arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3410
2605.24111 2026-05-26 cs.RO cs.AI

MASt3R-Nav: WayPixel Navigation in Relative 3D Maps

MASt3R-Nav: 相对3D地图中的WayPixel导航

Vansh Garg, Rohit Jayanti, Krish Pandya, Sarthak Chittawar, Siddharth Tourani, Muhammad Haris Khan, Sourav Garg, Madhava Krishna

AI总结 提出一种基于像素相对连接性的地图表示,通过相对3D坐标系中的像素对应构建地图,并利用像素级图进行全局路径规划,训练控制器预测轨迹,实现高精度导航。

Comments 2026 IEEE International Conference on Robotics & Automation (ICRA)

详情
AI中文摘要

视觉导航能力与其底层世界表示紧密相关。与需要全局一致几何的传统3D地图不同,图像或物体相对拓扑图几乎完全放弃了几何理解,但这以牺牲导航能力为代价,通常仅限于教-重复模式。本文提出一种新颖的地图表示,即像素相对连接性,它在几何上精确但不需要全局几何一致性。受近期3D基础图像匹配进展的启发,我们通过基于单个图像对相对3D坐标系中像素对应的图像间连接性,从图像序列构建地图。然后,我们利用该像素级图通过近似和稀疏化图像内像素连接性来执行全局路径规划。由此,我们推导出“WayPixel Costmap”表示,并训练一个以此条件化的控制器来预测轨迹展开。我们展示了这种基于相对几何的密集像素级成本图比其图像级和物体级对应物是更精确的控制预测条件变量。这实现了一个高能力的导航系统,通过在模拟器中的四种导航任务和真实世界演示中得到验证。

英文摘要

Visual navigation ability is strongly tied to its underlying representation of the world. Unlike classical 3D maps that require globally-consistent geometry, image- or object-relative topological graphs almost entirely do away with geometric understanding. But, this comes at the cost of navigation capability, often limiting it to merely teach-and-repeat. In this work, we propose a novel map representation in the form of pixel-relative connectivity, which is geometrically accurate but does not require global geometric consistency. Inspired by recent progress in 3D grounded image matching, we construct a map from an image sequence through inter-image connectivity based on pixel correspondences in the relative 3D coordinate systems of individual image pairs. We then use this pixel-level graph to perform global path planning by approximating and sparsifying intra-image pixel connectivity. Through this, we derive a ''WayPixel Costmap'' representation and train a controller conditioned on it to predict a trajectory rollout. We show that this dense pixel-level costmap based on relative geometry is a more accurate conditioning variable for control prediction than its image- and object-level counterparts. This enables a highly capable navigation system, as validated on four types of navigation tasks in the simulator and through real world demonstrations.

2605.24110 2026-05-26 cs.AI

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

EvoCode-Bench:评估多轮迭代交互中的编码智能体

Haiyang Shen, Xuanzhong Chen, Wendong Xu, Yun Ma, Liang Chen, Kuan Li

AI总结 提出EvoCode-Bench基准,通过多轮状态化任务和累积测试评估编码智能体在需求变化下维持代码库工作的能力,发现多轮指标远低于单轮指标,且最强智能体多轮成功率仅约50%。

Comments Work in Progress; 32 pages, 10 figures, preprint

详情
AI中文摘要

编码智能体越来越多地被用作迭代开发伙伴,但大多数基准测试仍然评估一个规范后跟一个最终评估。这忽略了一个基本问题:当需求变化时,智能体能否保持自己的代码库正常工作?我们引入了EvoCode-Bench,一个包含26个状态化编码任务和227个评估轮次的基准。每个任务保留智能体的工作空间5-15轮,通过可观察的行为陈述需求,并使用累积可执行测试来检查新需求以及仍然活跃的先前需求。我们使用两个指标评估了13个编码智能体:MT@4,一个四次尝试失败停止的多轮分数;以及SR,一个从参考完成的先前状态开始的单轮分数。对于大多数智能体,SR超过MT@4 22-40分。差距也改变了排名:最高SR的智能体(78.9)在持续执行中仅排名第三(44.0 MT@4)。即使是最强的智能体在多轮指标上也仅达到约50%的成功率,并且到第5轮时,聚合通过率下降到第1轮性能的一半以下。失败分析显示了层级依赖的行为:较弱的智能体早期失败,而较强的智能体存活足够长以暴露规范跟踪和回归失败。我们发布了基准数据和Harbor多轮基础设施。

英文摘要

Coding agents are increasingly used as iterative development partners, but most benchmarks still evaluate one specification followed by one final assessment. This leaves out a basic question: can an agent keep its own codebase working as requirements change? We introduce EvoCode-Bench, a benchmark of 26 stateful coding tasks and 227 evaluated rounds. Each task preserves the agent's workspace for 5-15 rounds, states requirements through observable behavior, and uses cumulative executable tests to check new requirements and still-active prior ones. We evaluate 13 coding agents with two metrics: MT@4, a four-attempt fail-stop multi-round score, and SR, a single-round score from a reference-completed prior state. For most agents, SR exceeds MT@4 by 22-40 points. The gap also changes rankings: the highest-SR agent (78.9) ranks only third in persistent execution (44.0 MT@4). Even the strongest agents achieve only about 50% success on multi-turn metrics, and aggregate pass rate drops below half of round-1 performance by round 5. Failure analysis shows tier-dependent behavior: weaker agents fail early, while stronger agents survive long enough to expose specification-tracking and regression failures. We release the benchmark data and Harbor multi-turn infrastructure.

2605.24106 2026-05-26 cs.LG cs.AI

Overcoming "Physics Shock" in Earth Observation A Heteroscedastic Uncertainty Framework for PINN-based Flood Inference

克服地球观测中的“物理冲击”:面向PINN洪水推断的异方差不确定性框架

Tewodros Syum Gebre, Jagrati Talreja, Matilda Anokye, Leila Hashemi-Beni

AI总结 提出一种不确定性感知的物理信息神经网络框架,通过动态热身启动和异方差不确定性建模,解决遥感洪水映射中物理约束与噪声数据冲突导致的梯度发散问题,在Sen1Floods11数据集上IoU提升25%。

Comments This article is accepted in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

详情
AI中文摘要

从遥感数据(如合成孔径雷达SAR)中快速准确地绘制洪水范围对于灾害应急响应至关重要,但标准深度学习模型由于缺乏水文约束常产生物理上不可能的预测。尽管物理信息神经网络(PINNs)试图通过将控制定律直接嵌入损失函数来解决这一问题,但其在真实遥感数据上的应用经常失败。将刚性空间导数(如二维浅水方程)强加于试图拟合噪声SAR散斑的无条件潜在空间会导致灾难性的梯度发散,我们将这一现象称为“物理冲击”。本文提出了一种专门针对应用地球观测的新型不确定性感知PINN框架,以解决这一不稳定性。通过集成动态热身启动协议和通过负对数似然目标建模异方差偶然不确定性,网络学会在高传感器噪声区域动态放松物理约束,而在高置信度区域严格强制执行。在Sen1Floods11数据集上的评估表明,我们的概率注意力门控FNO-UNet成功稳定了多目标优化,与确定性基线相比,交并比(IoU)相对提高了25%。此外,通过深度集成,我们成功地将内在传感器噪声与分布外地形未知性分离开来,为运营机构提供了高度校准、物理一致的置信区间,用于稳健的灾害缓解和实时决策。

英文摘要

Rapid and accurate flood extent mapping from Remote Sensing data, such as Synthetic Aperture Radar (SAR), is critical for operational disaster response, but standard Deep Learning models often produce physically impossible predictions due to a lack of hydrological constraints. While PhysicsInformed Neural Networks (PINNs) attempt to address this by embedding governing laws directly into the loss function, their application to real-world remote sensing data frequently fails. Enforcing rigid spatial derivatives (e.g., the 2D Shallow Water Equations) onto unconditioned latent spaces attempting to fit noisy SAR speckle causes catastrophic gradient divergence, a phenomenon we term Physics Shock. In this paper, we propose a novel Uncertainty-Aware PINN framework tailored specifically for applied Earth Observation that addresses this instability. By integrating a dynamic Warm-Start protocol and modeling heteroscedastic aleatoric uncertainty via a negative log-likelihood objective, the network learns to dynamically relax physical constraints in regions of high sensor noise while strictly enforcing them in high-confidence areas. Evaluated on the Sen1Floods11 dataset, our probabilistic Attention-Gated FNO-UNet successfully stabilizes multi-objective optimization, achieving a +25% relative improvement in Intersection over Union (IoU) compared to deterministic baselines. Furthermore, through Deep Ensembles, we successfully disentangle intrinsic sensor noise from out-of-distribution terrain ignorance, providing operational agencies with highly calibrated, physically consistent confidence bounds for robust disaster mitigation and real-time decision-making.

2605.24098 2026-05-26 cs.CV

D2-V2X: Depth-Driven Cooperative V2X Reasoning for Autonomous Driving

D2-V2X: 面向自动驾驶的深度驱动协同V2X推理

Kevin Richard, Alphin Varghese, Colin Pham, David Oh, Srijan Das

AI总结 针对单车辆视觉语言模型受传感器遮挡限制的问题,提出D2-V2X基准和基线模型,通过融合3D LiDAR特征与VLM潜空间,利用链式思维推理实现遮挡目标识别和空间估计,在识别遮挡危险和降低空间估计误差上取得显著提升。

Comments Accepted to the DriveX Workshop at CVPR 2026 (Non-archival)

详情
AI中文摘要

单车辆视觉语言模型(VLM)从根本上受到传感器遮挡的限制。虽然车联万物(V2X)系统缓解了这一问题,但当前基准缺乏解决复杂环境中歧义所需的协同推理。我们引入了D2-V2X,一个空间感知的问题-推理-答案(QRA)基准,包含来自多模态车辆和基础设施传感器的8,500个三元组。我们还建立了一个基线,将3D LiDAR特征与VLM的潜空间对齐。通过在结构化JSON输出之前强制使用自然语言链式思维推理,我们的模型被迫明确表达空间关系。实验表明,与零样本模型几乎为零的识别率相比,将VLM基于协同LiDAR在识别遮挡危险时实现了24.4%的召回率,并将可见物体的空间估计误差相比零样本基线降低了77%。虽然模型达到了53.5的功能性决策F1分数,但我们识别出3D到2D投影是当前VLM架构的基本瓶颈,为未来创新建立了新基线。数据、代码和训练模型可在https://github.com/KevinRichard1/D2-V2X获取。

英文摘要

Single-vehicle Vision-Language Models (VLMs) are fundamentally constrained by sensor occlusions. While Vehicle-to-Everything (V2X) systems mitigate this, current benchmarks lack the cooperative reasoning required for resolving ambiguities in complex environments. We introduce D2-V2X, a spatially-aware Question-Rationale-Answer (QRA) benchmark featuring 8,500 triplets derived from multimodal vehicle and infrastructure sensors. We additionally establish a baseline that aligns 3D LiDAR features with the VLM's latent space. By enforcing natural language Chain-of-Thought rationales prior to structured JSON outputs, our model is forced to explicitly articulate spatial relations. Our experiments demonstrate that grounding VLMs in cooperative LiDAR achieves 24.4% recall in identifying occluded hazards compared to near-zero in zero-shot models and reduces spatial estimation error for visible objects by 77% compared to the zero-shot baseline. While the model achieves a functional decision-making F1-score of 53.5, we identify 3D-to-2D projection as a fundamental bottleneck in current VLM architectures, establishing a new baseline for future innovation. Data, code, and trained models available at https://github.com/KevinRichard1/D2-V2X

2605.24084 2026-05-26 cs.LG cs.AI cs.LO

Verified SHAP: Provable Bounds for Exact Shapley Values of Neural Networks

Verified SHAP: 神经网络精确Shapley值的可证明界

David Boetius, Shahaf Bassan, Guy Katz, Stefan Leue, Tobias Sutter

AI总结 利用神经网络验证技术,提出一种计算SHAP值精确上下界的算法,可扩展到比现有精确方法大数个数量级的搜索空间。

Comments Accepted at ICML 2026. 34 pages, 13 figures

详情
AI中文摘要

Shapley加法解释(SHAP)被广泛认为对于神经网络在计算上是棘手的,因为它们在输入特征上诱导出指数搜索空间。在这项工作中,我们迈出了将精确SHAP计算扩展到更大搜索空间的第一步,引入了一种算法,该算法利用神经网络验证的最新进展来计算神经网络SHAP值的任意紧的精确下界和上界,最终恢复精确的SHAP值。我们证明了我们的方法可以扩展到比最先进的精确方法大数个数量级的搜索空间。这为精确SHAP计算提供了重要的第一步,并为在更大搜索空间上评估统计近似方法建立了原则性的基石。

英文摘要

Shapley additive explanations (SHAP) are widely recognised as computationally intractable for neural networks, since they induce an exponential search space over the input features. In this work, we take a first step towards scaling exact SHAP computation to larger search spaces by introducing an algorithm that leverages recent advances in neural network verification to compute arbitrarily tight exact lower and upper bounds on SHAP values for neural networks, ultimately recovering the exact SHAP values. We demonstrate that our approach scales to orders of magnitude larger search spaces than state-of-the-art exact methods. This provides an important first step towards exact SHAP computation and establishes a principled cornerstone for evaluating statistical approximation methods on larger search spaces.

2605.24074 2026-05-26 cs.CV cs.RO

WideDepth: Millimeter-Accurate Benchmark for Fisheye Depth Estimation

WideDepth: 用于鱼眼深度估计的毫米级精度基准

Ilia Indyk, Ignat Penshin, Ivan Sosin, Maxim Monastyrny, Aleksei Valenkov, Ilya Makarov

AI总结 提出首个室内鱼眼深度估计数据集WideDepth,包含101个场景的5K高分辨率立体对和毫米级真值,并引入基于LiDAR的立体鱼眼图像生成方法,评估多种模型,微调后性能提升高达62%。

Comments Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026

详情
AI中文摘要

鱼眼相机在机器人领域的近场操作、导航和沉浸式感知中应用日益广泛,但缺乏具有精确真值的室内深度基准。为此,我们引入WideDepth——首个用于鱼眼深度估计的室内数据集,包含101个场景的5K高分辨率立体对,标注了毫米级地面真值深度和视差。我们的数据集还包括在水平和垂直立体设置中,不同视场和基线下的配对针孔和鱼眼样本。我们进一步提出一种方法,将针孔训练的立体模型适配到鱼眼图像,并引入一种基于高分辨率LiDAR扫描的新型立体鱼眼图像生成流程。利用这些方法,我们在基准上全面评估了最先进的单目深度、立体匹配和深度补全模型。此外,我们提供了18K LiDAR导出的稀疏深度训练样本,在微调基于针孔的立体模型时,鱼眼数据性能提升高达62%。总之,我们基准的高精度和多功能性为推进鱼眼深度估计和机器人感知研究奠定了坚实基础。项目页面:https://ilyaind.github.io/WideDepth

英文摘要

Fisheye cameras are increasingly adopted in robotics for near-field manipulation, navigation, and immersive perception, yet indoor depth benchmarks with accurate ground truth are still missing. To address this, we introduce WideDepth - the first indoor dataset for fisheye depth estimation, featuring 101 scenes containing 5K high-resolution stereo pairs labeled with millimeter-level ground truth depth and disparity. Our dataset also includes paired pinhole and fisheye samples across varying fields of view and baselines in both horizontal and vertical stereo setups. We further propose a method to adapt pinhole-trained stereo models to fisheye images and introduce a novel stereo fisheye image generation pipeline based on high-resolution LiDAR scans. Leveraging these methods, we thoroughly evaluate state-of-the-art monocular depth, stereo matching, and depth completion models on our benchmark. Additionally, we provide 18K LiDAR-derived sparse depth training samples, achieving up to a 62% performance boost on fisheye data when fine-tuning pinhole-based stereo models. In summary, the high precision and versatility of our benchmark set a strong foundation for advancing research in fisheye depth estimation and robotics perception. Project page: https://ilyaind.github.io/WideDepth

2605.24066 2026-05-26 cs.CV

Distance-Aware Joint Spatio-Temporal Graph Contrastive Learning for Major Depressive Disorder Diagnosis

距离感知的联合时空图对比学习用于重度抑郁症诊断

Muhammad Asif Hasan, Yanming Zhu, Xuefei Yin, Alan Wee-Chung Liew

AI总结 针对动态功能连接在重度抑郁症诊断中的噪声、频域信息利用不足及时空分离建模问题,提出基于霍克斯过程先验的联合时空图对比学习框架HWSTCL,通过谱节点描述符、指数距离衰减边权重和核加权对比目标,实现可靠时空表示并提升诊断性能。

详情
AI中文摘要

重度抑郁症(MDD)是一种常见的神经精神疾病,其基于静息态功能磁共振成像(rs-fMRI)的准确诊断仍然困难。动态功能连接(DFC)捕捉脑区间的时变交互,提供丰富的时空信息,但当前基于DFC的方法面临三个限制:滑动窗口Pearson相关产生对窗口长度和运动伪影敏感的噪声估计;相关导出的节点特征未充分利用血氧水平依赖(BOLD)信号的频域特性;大多数时空图模型在分离阶段处理空间结构和时间动态,限制了它们表示耦合脑网络演化的能力。为克服这些问题,我们将DFC学习重新表述为在霍克斯过程启发的时间依赖性先验下的联合时空图表示学习,并提出HWSTCL,一个基于可靠性精炼联合时空图和核加权预训练目标的两阶段框架。在每个时间窗口内,BOLD信号被编码为谱节点描述符,功能边通过指数距离衰减先验进行精炼,该先验降低不可靠长程连接的权重。然后通过霍克斯启发的指数核将每个区域与未来窗口中的自身连接形成联合图,使得在消息传递过程中空间和时间信息可以一起传播。核加权对比目标进一步促进每个区域跨窗口的时间一致性,同时减少不同区域间的冗余相似性。在基准rs-fMRI数据集上的实验表明,HWSTCL优于最近的基线方法,并为MDD诊断生成连贯的时空表示。

英文摘要

Major depressive disorder (MDD) is a common neuropsychiatric condition whose accurate diagnosis from resting-state functional magnetic resonance imaging (rs-fMRI) remains difficult. Dynamic functional connectivity (DFC) captures time-varying interactions among brain regions and provides rich spatio-temporal information, yet current DFC-based methods face three limitations: sliding-window Pearson correlation yields noisy estimates sensitive to window length and motion artifacts; correlation-derived node features do not fully exploit frequency-domain properties of blood-oxygen-level-dependent (BOLD) signals; and most spatio-temporal graph models handle spatial structure and temporal dynamics in separate stages, restricting their ability to represent coupled brain network evolution. To overcome these issues, we reformulate DFC learning as joint spatio-temporal graph representation learning under a Hawkes-process-inspired temporal dependency prior and propose HWSTCL, a two-stage framework built on a reliability-refined joint spatio-temporal graph with a kernel-weighted pretraining objective. Within each temporal window, BOLD signals are encoded as spectral node descriptors and functional edges are refined by an exponential distance-decay prior that down-weights less reliable long-range connections. The joint graph is then formed by linking each region to itself across future windows through a Hawkes-inspired exponential kernel, allowing spatial and temporal information to be propagated together during message passing. A kernel-weighted contrastive objective further promotes temporal consistency for each region across windows while reducing redundant similarity between different regions. Experiments on a benchmark rs-fMRI dataset show that HWSTCL outperforms recent baselines and yields coherent spatio-temporal representations for MDD diagnosis.

2605.24065 2026-05-26 cs.CV

fMRI-Diffusion: Generating fMRI Time Series Via a Temporal Transformer Diffusion Model for Major Depressive Disorder Diagnosis

fMRI-Diffusion: 用于重度抑郁症诊断的基于时间Transformer扩散模型的fMRI时间序列生成

Muhammad Asif Hasan, Yanming Zhu, Xuefei Yin, Alan Wee-Chung Liew

AI总结 提出fMRI-Diffusion框架,通过时间Transformer扩散模型合成ROI级fMRI时间序列而非功能连接矩阵,以保留时间信息并提升小样本下MDD诊断准确率。

详情
AI中文摘要

使用功能连接分析从功能磁共振成像诊断重度抑郁症需要大量标记数据,而临床环境中这些数据稀缺。现有的增强方法合成FC矩阵,将fMRI记录压缩为静态成对摘要并丢弃时间信息。我们提出fMRI-Diffusion,一个合成ROI级fMRI时间序列而非FC矩阵的框架。时间Transformer作为去噪扩散概率模型中的去噪网络,将每个时间点视为一个token,通过自注意力捕获时间依赖。监督预训练策略在扩散训练前用任务相关表示初始化Transformer,并从合成时间序列导出FC矩阵用于分类。在REST-meta-MDD数据集上的实验表明,用合成时间序列增强训练数据在十个分类器、六个分区图谱和三个采集站点上一致提高了诊断准确率。该方法优于五种最新的基于FC的合成方法,比最强基线准确率提升高达3.7个百分点。消融研究证实了基于Transformer的去噪器和预训练策略的贡献。所有条件下的分布保真度指标均低于0.06,表明真实分布与合成分布高度一致。这些发现表明,在FC计算之前合成fMRI时间序列保留了矩阵级增强中丢失的时间信息,为有限数据下的MDD诊断提供了实用策略。

英文摘要

Diagnosing Major Depressive Disorder (MDD) from functional magnetic resonance imaging (fMRI) using functional connectivity (FC) analysis requires large amounts of labeled data that are scarce in clinical settings. Existing augmentation methods synthesize FC matrices, which compress fMRI recordings into static pairwise summaries and discard temporal information. We propose fMRI-Diffusion, a framework that synthesizes region-of-interest (ROI)-level fMRI time series rather than FC matrices. A Temporal Transformer serves as the denoising network within a denoising diffusion probabilistic model, treating each time point as a token to capture temporal dependencies through self-attention. A supervised pretraining strategy initializes the Transformer with task-relevant representations before diffusion training, and FC matrices are derived from the synthesized time series for classification. Experiments on the REST-meta-MDD dataset show that augmenting training data with synthetic time series consistently improves diagnostic accuracy across ten classifiers, six parcellation atlases, and three acquisition sites. The method outperforms five recent FC-based synthesis approaches, with accuracy gains of up to 3.7 percentage points over the strongest baseline. Ablation studies confirm the contributions of both the Transformer-based denoiser and the pretraining strategy. Distributional fidelity metrics remain below 0.06 across all conditions, indicating close agreement between real and synthetic distributions. These findings suggest that synthesizing fMRI time series before FC computation preserves temporal information lost in matrix-level augmentation and provides a practical strategy for MDD diagnosis under limited data.

2605.24064 2026-05-26 cs.LG cs.AI

Generative Representation Learning on Hyper-relational Knowledge Graphs via Masked Discrete Diffusion

超关系知识图谱上的生成式表示学习:基于掩码离散扩散

Jaejun Lee, Seheon Kim, Joyce Jiyoung Whang

AI总结 针对超关系知识图谱中任意掩码查询的补全与事实生成任务,提出基于掩码离散扩散的生成式表示学习方法KREPE,统一链接预测与事实生成,性能达到最优。

Comments 28 pages, 16 figures, 18 tables, 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

超关系知识图谱(HKG)能有效表示复杂事实。在HKG中推断新知识是一个关键问题,但现有方法将其视为简单的链接预测,假设事实中几乎所有实体和关系已知,仅留单个空白待填充。然而,这种受限假设在现实场景中可能不成立,因为事实的多个甚至全部组成成分可能同时缺失。为弥补这一差距,我们引入一个称为事实生成的任务:从任意掩码查询生成有效超关系事实,即补全部分观察到的事实或从头生成事实。我们提出KREPE,这是首个用于HKG的生成式表示学习方法,通过掩码离散扩散学习以局部事实成分和HKG全局结构为条件的缺失成分概率分布。KREPE通过上下文消息传递建模事实内依赖,并通过聚合随机采样上下文建模事实间关联。KREPE在单一训练框架内无缝统一链接预测与事实生成,在标准HKG链接预测基准上达到最先进性能,并在生成新颖且正确事实方面超越基于LLM的基线方法。

英文摘要

Hyper-relational knowledge graphs (HKGs) effectively represent complex facts. While inferring new knowledge in HKGs is a critical problem, current methods cast it as a simple link prediction, assuming that nearly all entities and relations within a fact are known, leaving only a single blank to be filled. However, this restricted assumption may not hold in real-world scenarios in which multiple, or even all, constituent components of a fact may be missing simultaneously. To bridge this gap, we introduce a task called fact generation: generating a valid hyper-relational fact from an arbitrarily masked query, i.e., completing a partially observed fact or generating a fact from scratch. We propose KREPE, the first generative representation learning method for HKGs that learns to model the probability distributions of missing components conditioned on the local fact components and global structure of HKGs via a masked discrete diffusion. KREPE models both the intra-fact dependencies by contextual message passing and inter-fact correlations by aggregating stochastically sampled contexts. KREPE seamlessly unifies link prediction and fact generation within a single training framework, achieving state-of-the-art performance on standard HKG link prediction benchmarks and outperforming LLM-based baselines in generating novel and correct facts.

2605.24062 2026-05-26 cs.LG cs.AI

Federated Learning over Human-Body Communication for On-Body Edge Intelligence: A Survey, Taxonomy, and BODYFED-HBC Scheduling Vignette

基于人体通信的联邦学习用于体表边缘智能:综述、分类法与BODYFED-HBC调度示例

Koffka Khan

AI总结 本文综述了人体通信与联邦学习在可穿戴设备中的交叉领域,提出了一种区分体内、体中心、跨用户和临床云联邦学习部署的分类法,并引入BODYFED-HBC参考架构和调度算法以解决体信道感知的联邦学习问题。

详情
AI中文摘要

人体通信(HBC)是一种有前景的可穿戴体域网络物理层,因为它可以将通信局限在身体周围,并减轻传统无线电链路的负担。联邦学习(FL)是一种有前景的学习层,因为它可以减少生理和行为传感的原始数据集中化。然而,这两类文献之间的联系仍然薄弱:用于可穿戴设备的FL通常抽象通信层,而HBC研究通常抽象学习和模型更新流量。本文综述了HBC、无线体域网络、可穿戴FL、身体互联网隐私和边缘智能优化的交叉领域。我们提出了一种分类法,区分了体内、体中心、跨用户和临床云FL部署,并识别了体信道感知FL这一开放问题:即客户端选择、更新压缩和聚合由姿态相关的HBC链路、剩余能量、传感器内存和隐私风险控制的学习协议。为了使研究议程具体化,我们引入了BODYFED-HBC作为参考架构,并提供了优化公式和调度算法。我们进一步指定了一个可复现的模拟示例,该示例结合了公共可穿戴数据集和经验性的体耦合通信信号损耗模型。文章最后为工作在硬件层之上的计算机科学家提供了开放数据集、评估指标、局限性和研究方向。

英文摘要

Human-body communication (HBC) is a promising physical substrate for wearable body-area networks because it can localize communication around the body and reduce the burden of conventional radio links. Federated learning (FL) is a promising learning substrate because it can reduce raw-data centralization for physiological and behavioral sensing. Yet these two literatures remain weakly connected: FL for wearables usually abstracts the communication layer, whereas HBC research usually abstracts learning and model-update traffic. This article surveys the intersection of HBC, wireless body-area networks, wearable FL, Internet-of-Bodies privacy, and edge-intelligence optimization. We propose a taxonomy that distinguishes intra-body, body-hub, cross-user, and clinical-cloud FL deployments, and we identify the open problem of body-channel-aware FL: learning protocols whose client selection, update compression, and aggregation are controlled by posture-dependent HBC links, residual energy, sensor memory, and privacy risk. To make the research agenda concrete, we introduce BODYFED-HBC as a reference architecture and provide an optimization formulation and scheduling algorithm. We further specify a reproducible simulation vignette that combines public wearable datasets with empirical body-coupled-communication signal-loss models. The article concludes with open datasets, evaluation metrics, limitations, and research directions for computer scientists working above the hardware layer.

2605.24058 2026-05-26 cs.LG cs.AI

Signs Beat Floats: Low-Rank Double-Binary Adaptation for On-Device Fine-Tuning

符号胜过浮点:面向设备端微调的低秩双二值适配器

Yoshihiko Fujisawa, Yuma Ichikawa, Yudai Fujimoto, Akira Sakai, Katsuki Fujisawa

AI总结 提出LoRDBA,一种用二值符号载波和通道级缩放替代低秩因子的适配器,在保持LoRA兼容性的同时显著降低存储和计算开销,并在设备端微调中匹配或超越低比特基线性能。

Comments 34 pages, 3 figures

详情
AI中文摘要

大型语言模型的设备端适配通常保持量化基模型冻结,同时训练和部署一个小型任务特定的LoRA适配器。然而,在未合并的适配器模式下,适配器不仅仅是一个紧凑的存储模块;它引入了一个额外的密集浮点分支,维护可训练状态以进行本地更新,并充当通信和热交换单元。我们提出LoRDBA,一种LoRA兼容的适配器,它将两个低秩因子替换为二值符号载波,同时通过轻量级的通道级缩放表示幅度,将密集适配器分支转换为两个符号累积矩阵乘法,中间穿插通道级缩放。有限样本分析表明,重建质量由原始LoRA因子的残差与幅度之比决定。在适配器模式实验中,LoRDBA在匹配模型大小的情况下优于低比特基线,并在某些场景下匹配fp16 LoRA的质量。尽管适配器占用减少了超过10倍,未合并的适配器在匹配秩r=16时最多引入8%的预填充延迟开销,训练内存开销约为fp16 LoRA的1.6倍。

英文摘要

On-device adaptation of large language models commonly keeps a quantized base model frozen while training and deploying a small, task-specific LoRA adapter. In the unmerged adapter-mode setting, however, the adapter is more than a compact storage module; it introduces an additional dense floating-point branch, maintains a trainable state for local updates, and acts as a unit of communication and hot-swapping.We introduce LoRDBA, a LoRA-compatible adapter that replaces both low-rank factors with binary sign carriers while representing magnitudes through lightweight, channel-wise scales, converting the dense adapter branch into two sign-accumulation matrix multiplications interleaved with channel-wise scaling. A finite-sample analysis shows that reconstruction quality is governed by the residual-to-magnitude ratio of the original LoRA factors. In adapter-mode experiments, LoRDBA outperforms low-bit baselines at matched model sizes while matching fp16 LoRA quality in selected regimes. The unmerged adapter incurs at most 8% prefill latency overhead at matched rank r=16 despite an over 10x reduction in adapter footprint, with moderate training memory overhead of approximately 1.6x that of fp16 LoRA.

2605.24057 2026-05-26 cs.LG cs.AI

Feature Lottery? A Bifurcation Theory of Concept Emergence

特征彩票?概念涌现的分岔理论

Fuming Yang

AI总结 提出一种基于分岔理论的方法,通过损失Hessian驱动的超临界叉形分岔检测表示动力学中的结构涌现,并引入无标签相位坐标β/β_c,在多种设置下验证了四个不同的转变阶段,揭示了特征可解释性的早期可预测性。

详情
AI中文摘要

神经网络在训练过程中的特定时刻获得结构化表示,然而识别这些转变通常依赖于回顾性的、基于标签的指标。我们引入了一种表示动力学的分岔理论来实时检测这些时刻。通过分析附加在演化编码器上的被动高斯混合模型探针,我们展示了结构的开始对应于由损失Hessian驱动的超临界叉形分岔。系统表现出一个理论上可预测的过零点(β_c),与网络当前状态(β)相比,产生一个动态比率β(t)/β_c(t):一个通用的、无标签的表示动力学相位坐标,完全可以从隐藏状态计算得出。我们在不同设置下实证验证了该坐标预测的四个不同转变阶段:语言模型(Pythia)上的稀疏自编码器、自监督学习(CIFAR)和grokking(模算术)。关键的是,在有限耗散下,宏观对称性破缺可能滞后于初始过零点数个数量级,这为grokking中观察到的延迟逃逸提供了严格的动力学解释。微观上,分岔产生了一个共享的不稳定子空间,迫使集体对称性破缺。我们将其称为稀疏自编码器训练中的“特征彩票”:一个特征的最终可解释性变得惊人地早期可预测。仅在训练5%时,早期原子纯度就能稳健地预测最终收敛纯度,其中前十百分位的早期原子在收敛时的纯度比基线高出12倍以上。除了解释概念涌现外,β/β_c还为训练健康提供了实用的早期预警指标,在下游指标反应之前检测到可用结构的出现、特征身份的结晶以及表示崩溃的时期。

英文摘要

Neural networks acquire structured representations at specific moments during training, yet identifying these transitions typically relies on retrospective, label-dependent metrics. We introduce a bifurcation theory of representation dynamics to detect these moments in real time. Analyzing a passive GMM probe attached to the evolving encoder, we show the onset of structure corresponds to a supercritical pitchfork bifurcation driven by the loss Hessian. The system exhibits a theoretically predictable zero-crossing ($β_c$) that, compared to the network's current state ($β$), yields a dynamic ratio $β(t)/β_c(t)$: a universal, label-free phase coordinate for representation dynamics, computable entirely from hidden states. We empirically validate four distinct transition regimes predicted by this coordinate across diverse settings: SAEs on language models (Pythia), SSL (CIFAR), and grokking (modular arithmetic). Crucially, under finite dissipation, macroscopic symmetry-breaking can lag the initial zero-crossing by orders of magnitude, which providing a rigorous dynamical account of the delayed escape observed in grokking. Microscopically, the bifurcation creates a shared unstable subspace, forcing collective symmetry breaking. We term this the "feature lottery" in SAE training: a feature's terminal interpretability becomes predictable remarkably early. By only 5% of training, early atom purity robustly predicts final convergence purity, with top-decile early atoms achieving over 12x the baseline purity at convergence. Beyond explaining concept emergence, $β/β_c$ provides a practical early-warning indicator for training health, detecting the onset of usable structure, the crystallization of feature identity, and representational collapse epochs before downstream metrics react.

2605.24055 2026-05-26 cs.LG cs.AI

Cascade-KDE: Robust Time-Series Restoration under Out-of-Distribution Impulse Corruptions

Cascade-KDE:面向分布外脉冲损坏的鲁棒时间序列恢复

Yuefeng Liu, Ning Yang, Ziyu Yang

AI总结 提出Cascade-KDE无训练框架,通过二维密度估计、密度截断鲁棒期望和指数级联自适应停止,在保留局部结构的同时鲁棒恢复被高斯噪声和脉冲异常损坏的时间序列。

详情
AI中文摘要

工业传感、医疗和能源系统中的真实世界时间序列数据通常被高斯噪声和偶尔的大幅度脉冲异常值混合污染。对于依赖局部形状的任务,如心电图形态分析和电池退化监测,主要要求不仅是低重建误差,还要保留导数峰值和任务关键特征。我们提出了Cascade-KDE,一种用于损坏时间序列的无训练恢复框架。该方法首先估计二维时间-幅度密度,然后应用密度截断鲁棒期望来限制远处异常点的影响,最后通过具有自适应停止的指数级联细化序列。该设计旨在提高在分布外脉冲损坏下的鲁棒性,同时使恢复轨迹接近原始局部结构。在多个基准数据集上,所提方法在曲线保真度、导数保留、下游分类和运行时效率方面相比经典滤波器和代表性学习基线表现出一致的改进。这些结果表明,基于有界密度的恢复是噪声时间序列流程中保留特征预处理的实用选择。

英文摘要

Real-world time-series data in industrial sensing, healthcare, and energy systems is often corrupted by a mixture of Gaussian noise and occasional large-magnitude impulse outliers. For tasks that depend on local shape, such as ECG morphology analysis and battery degradation monitoring, the main requirement is not only low reconstruction error but also preservation of derivative peaks and task-critical features. We propose Cascade-KDE, a training-free restoration framework for corrupted time series. The method first estimates a two-dimensional temporal-amplitude density, then applies a Density-Truncated Robust Expectation to limit the influence of distant abnormal points, and finally refines the sequence through an exponential cascade with adaptive stopping. This design aims to improve robustness under out-of-distribution impulse corruptions while keeping the restored trajectory close to the original local structure. Across several benchmark datasets, the proposed method shows consistent gains over classical filters and representative learning-based baselines on curve fidelity, derivative preservation, downstream classification, and runtime efficiency. These results suggest that bounded density-based restoration is a practical option for feature-preserving preprocessing in noisy time-series pipelines.

2605.24053 2026-05-26 cs.AI cs.CL cs.LG

Breaking the Chains of Probability: Neutrosophic Logic as a New Framework for Epistemic Uncertainty in Large Language Models

打破概率的锁链:中智逻辑作为大型语言模型中认知不确定性的新框架

Maikel Yelandi Leyva-Vázquez, Florentin Smarandache

AI总结 本文提出使用中智逻辑(Truth、Indeterminacy、Falsity三个独立维度)替代传统概率框架,通过实验发现该框架能更丰富地表示LLM的内部状态,并在35%的评估中自发出现超真状态,为透明、可靠和伦理感知的AI系统提供关键步骤。

Comments Published in Neutrosophic Sets and Systems, Vol. 99 (2026). Author's preprint version. Open code and data available at: github.com/mleyvaz/neutrosophic-llm-logic

详情
Journal ref
Neutrosophic Sets and Systems, Vol. 99, 2026
AI中文摘要

大型语言模型(LLM)主要受概率框架支配,其中结果概率之和被约束为1。这种由Softmax层强加的结构限制导致不确定性崩溃,使得难以区分认知不确定性、悖论和模糊性。我们提出了一种中智逻辑应用的实证研究,该框架将真(T)、不确定(I)和假(F)视为三个独立维度,用于建模LLM中的认知状态。我们在四个OpenAI GPT模型家族上进行了实验,涵盖五种语言现象:逻辑悖论、认知无知、模糊性、伦理矛盾和未来偶然性,采用三种提示策略:中智、概率和熵衍生。我们的发现表明,中智方法通过允许T+I+F>1(我们称之为超真状态),提供了模型内部状态的更丰富表示。在35%的评估中,超真状态自发出现,主要出现在伦理矛盾和逻辑悖论下。我们证明,该方法在模糊上下文中保留了真值,并提供了一种稳健的方法来识别和量化内部模型冲突。我们得出结论,中智评估层的集成是迈向更透明、可靠和伦理感知的AI系统的关键一步。

英文摘要

Large Language Models (LLMs) are predominantly governed by probabilistic frameworks in which the sum of outcome probabilities is constrained to unity. This architectural limitation, often imposed by Softmax layers, leads to a collapse of uncertainty that makes it difficult to differentiate between epistemic uncertainty, paradox, and vagueness. We present an empirical investigation of the application of Neutrosophic Logic, a framework that treats Truth (T), Indeterminacy (I), and Falsity (F) as three independent dimensions, to model epistemic states in LLMs. We conducted experiments on a family of four OpenAI GPT models across five linguistic phenomena: logical paradoxes, epistemic ignorance, vagueness, ethical contradictions, and future contingencies, under three prompting strategies: neutrosophic, probabilistic, and entropy-derived. Our findings reveal that the neutrosophic approach, by allowing T+I+F > 1, a state we term hyper-truth, provides a richer representation of a model's internal state. In 35% of evaluations, hyper-truth emerged spontaneously, predominantly under ethical contradiction and logical paradox. We demonstrate that this approach preserves truth values in fuzzy contexts and offers a robust method for identifying and quantifying internal model conflict. We conclude that the integration of neutrosophic evaluation layers is a critical step toward more transparent, reliable, and ethically aware AI systems.

2605.24052 2026-05-26 cs.LG cs.AI

Truthful Online Preference Aggregation for LLM Fine-Tuning in Mobile Crowdsourcing

移动众包中用于LLM微调的诚实在线偏好聚合

Shugang Hao, Lingjie Duan

AI总结 针对移动众包中工人可能策略性谎报偏好反馈的问题,提出一种动态贝叶斯博弈模型和在线加权聚合机制,确保工人诚实反馈并实现次线性遗憾。

详情
AI中文摘要

为了更好地满足移动应用(如导航)中用户的需求,移动众包平台可以迭代地将大语言模型(LLM)生成的内容(例如,AI生成的交通状况预测)与从众包工人(例如,移动用户)收集的人类反馈进行对齐。然而,工人可能会策略性地谎报他们的在线偏好反馈,以最大化其影响力或报酬。移动众包中现有的流程(例如,基于EM的权重估计)无法在这种在线设置中识别出最准确的工人,导致在$T$个时隙上产生线性遗憾$\mathcal{O}(T)$。在本文中,我们研究了移动众包中用于LLM微调的诚实在线偏好聚合。我们建立了一个新的动态贝叶斯博弈来建模平台与策略性移动工人之间的多智能体在线学习过程。我们提出了一种新颖的在线加权聚合机制,该机制根据每个工人的反馈准确性动态调整其在偏好聚合中的权重。我们证明了我们的机制确保了策略性工人的诚实反馈,并在$T$个时隙上实现了次线性遗憾$\mathcal{O}(\sqrt{T})$。我们进一步将我们的机制扩展到每个时隙工人反馈有限的挑战性场景,仍然保证了次线性遗憾$\mathcal{O}(\sqrt{T})$。在真实世界数据集上进行的LLM微调实验进一步证明了我们的机制相对于基准方案的显著性能提升。

英文摘要

To better serve users' demands in mobile applications (e.g., navigation), mobile crowdsourcing platforms can iteratively align large language model (LLM)-generated content (e.g., AI-generated traffic condition predictions) with human feedback collected from crowdsourcing workers (e.g., mobile users). However, workers may strategically misreport their online preference feedback to maximize their influence or payment. Existing pipelines in mobile crowdsourcing (e.g., EM-based weight estimation) fail to identify the most accurate worker in this online setting, resulting in a linear regret $\mathcal{O}(T)$ over $T$ time slots. In this paper, we study truthful online preference aggregation for LLM fine-tuning in mobile crowdsourcing. We formulate a new dynamic Bayesian game to model the multi-agent online learning process between the platform and strategic mobile workers. We propose a novel online weighted aggregation mechanism that dynamically adjusts each worker's weight in the preference aggregation according to their feedback accuracy. We prove that our mechanism ensures truthful feedback from strategic workers and achieves a sublinear regret $\mathcal{O}(\sqrt{T})$ over $T$ time slots. We further extend our mechanism to a challenging scenario with limited worker feedback per time slot, still guaranteeing a sublinear regret $\mathcal{O}(\sqrt{T})$. Experiments on LLM fine-tuning with real-world datasets further demonstrate significant performance gains of our mechanisms over benchmark schemes.

2605.24048 2026-05-26 cs.LG cs.AI

Mixture of Complementary Agents for Robust LLM Ensemble

互补代理混合:鲁棒的大语言模型集成

Yichi Zhang, Kevin Lu, Yuang Zhang, Jie Gao, Lirong Xia, Fang-Yi Yu

AI总结 将大语言模型选择视为组合选择问题,提出基于互补性的贪心选择算法,在性能与成本间取得最佳平衡。

详情
AI中文摘要

多AI协作,例如集成或辩论大语言模型(LLMs),是一种有前景的聚合信息和提升性能的范式。这些流程的基础步骤是将多个提议LLM的响应输入到一个总结LLM中,后者合成一个更好的答案。然而,选择哪些提议者并非易事。现有方法主要关注准确性(选择最强模型)或多样性(确保多样性),并且常常忽视提议者之间以及与总结者之间的交互。我们将提议者选择重新定义为类似于特征选择的组合选择问题,其中LLM的价值在于其与其他模型的互补性。然而,由于时间复杂度过高,直接应用标准特征选择算法在LLM场景中不切实际。受此限制,我们探索了一系列计算可行的贪心式选择算法,这些算法使用少量标记集评估互补性。我们的实验验证了互补性作为提议者选择的指导原则,并确定了在实践中实现最佳性能-成本权衡的方法。

英文摘要

Multi-AI collaboration, such as ensembling or debating large language models (LLMs), is a promising paradigm for aggregating information and boosting performance. A foundational step in these pipelines is to feed the responses of several proposer LLMs into a summarizer LLM, which synthesizes a better answer. However, choosing which proposers to include is non-trivial. Existing approaches primarily focus either on accuracy (picking the strongest models) or diversity (ensuring variety), and often overlook the interactions among proposers and with the summarizer. We reframe proposer selection as a combinatorial selection problem akin to feature selection, where the value of an LLM lies in its complementarity with others. However, directly applying standard feature-selection algorithms is impractical in the LLM setting due to prohibitive time complexity. Motivated by this limitation, we explore an extensive range of computationally feasible, greedy-style selection algorithms that assess complementarity using a small labeled set. Our experiments validate complementarity as a guiding principle for proposer selection and identify methods that achieve the best performance-cost trade-offs in practice.

2605.24047 2026-05-26 cs.CV

EMMA: Extracting Multiple physical parameters from Multimodal Data

EMMA: 从多模态数据中提取多个物理参数

Farhat Shaikh, Ayan Banerjee, Sandeep Gupta

AI总结 提出EMMA框架,利用物理信息多模态融合和LTC网络,从原始视频、音频和图像时间序列中联合推断系统动力学参数,无需先验条件或专用传感器,在100+场景中优于单模态方法。

Comments Accepted at CVPR 2026 (main conference)

详情
AI中文摘要

我们引入了EMMA,一个基于物理信息的多模态框架,能够直接从原始视频、音频和基于图像的时间序列观测中恢复系统的所有可识别动力学参数。与先前仅依赖视频的方法不同,这些方法难以处理遮挡状态、隐藏驱动输入或对已知初始条件和坐标系的假设,EMMA在统一的连续时间模型中对显式参数、隐式动力学分量和校准不变量进行联合推断。EMMA利用液态时间常数(LTC)网络从异构模态中学习潜在动力学,同时物理约束损失强制与支配微分方程保持一致。统一的特征管道实现了视频轨迹、声学特征和图表测量之间的一致对齐,使得EMMA能够在受迫、隐式和多元动力学下估计参数,无需分割掩码、可微渲染或专用传感器。在100多个场景中,包括五个标准动力学基准(75个Delfys视频)、具有隐藏输入的真实世界轮式机器人和四旋翼系统,以及涵盖生物和混沌系统的模拟图表案例研究,EMMA实现了稳健的多参数恢复,并显著优于现有的单模态和方程发现基线。我们的结果确立了EMMA作为从机会性多模态数据中进行物理一致模型提取的通用、可扩展解决方案。代码和数据可在 https://github.com/ImpactLabASU/EMMA-CVPR2026 获取。

英文摘要

We introduce EMMA, a physics-informed multimodal framework that recovers all identifiable dynamical parameters of a system directly from raw video, audio, and image-based time-series observations. Unlike prior video-only approaches that struggle with occluded states, hidden actuation inputs, or assumptions about known initial conditions and coordinate frames, EMMA performs joint inference of explicit parameters, implicit dynamical components, and calibration invariants within a unified continuous-time model. EMMA leverages a Liquid Time-Constant (LTC) network to learn latent dynamics from heterogeneous modalities while a physics-constrained loss enforces consistency with the governing differential equations. A unified feature pipeline enables consistent alignment across video trajectories, acoustic signatures, and chart-derived measurements, allowing EMMA to estimate parameters under forced, implicit, and multivariate dynamics without requiring segmentation masks, differentiable rendering, or specialized sensors. Across 100+ scenarios including five standard dynamical benchmarks (75 Delfys videos), real-world rover and quadrotor systems with hidden inputs, and simulation-chart case studies spanning biological and chaotic systems, EMMA delivers robust multi-parameter recovery and significantly outperforms existing single-modality and equation-discovery baselines. Our results establish EMMA as a general, scalable solution for physics-consistent model extraction from opportunistic multimodal data. Code and data are available at: https://github.com/ImpactLabASU/EMMA-CVPR2026

2605.24045 2026-05-26 cs.LG cs.AI

A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood?

大规模数据集与基准:蛋白质-配体模型学习的是结合位点还是仅仅结合可能性?

Zhaohan Meng, Zhen Bai, Ke Yuan, Iadh Ounis, Zaiqiao Meng, Hao Xu, Joseph Loscalzo

AI总结 针对现有基准无法评估模型是否定位结合位点的问题,提出包含约10万对蛋白质-配体的InteractBind数据集和细粒度基准,通过结合位点定位任务揭示模型在强二元预测下定位能力有限。

Comments Under Review for the NeurIPS 2026 Conference, Track on Evaluations and Datasets

详情
AI中文摘要

蛋白质-配体建模是计算药物发现和分子设计的基础。现有的蛋白质-配体基准通常通过二元结合预测和亲和力回归等任务评估蛋白质与配体是否相互作用以及结合强度。然而,这些评估提供的证据有限,无法判断模型是否能够定位结合位点或识别分子识别背后的非共价相互作用。为填补这一空白,我们引入了InteractBind,一个大规模蛋白质-配体数据集,包含约10万对蛋白质-配体对,以及一个用于细粒度评估的基准。核心细粒度任务是结合位点定位,它利用覆盖六种主要非共价相互作用类型的蛋白质残基和配体原子相互作用图,评估模型导出的相互作用图是否能够定位结合位点。InteractBind还包含结合亲和力和蛋白质相似性控制的分割,以支持现实的泛化评估。使用InteractBind,我们评估了八个现有的基于序列和交互感知的模型,评估了二元结合预测和结合位点定位。结果显示,尽管二元结合预测表现强劲,但结合位点定位能力有限,且在不同非共价相互作用类型间存在显著差异。总体而言,InteractBind建立了一个基准范式,鼓励开发更具可解释性和物理基础的蛋白质-配体模型。

英文摘要

Protein-ligand modeling underpins computational drug discovery and molecular design. Existing protein-ligand benchmarks typically evaluate whether a protein and ligand interact and how strongly they bind, through tasks such as binary binding prediction and affinity regression. However, these evaluations provide limited evidence of whether models can localize binding sites or identify the non-covalent interactions underlying molecular recognition. To address this gap, we introduce InteractBind, a large-scale protein-ligand dataset comprising approximately 100k protein-ligand pairs, together with a benchmark for fine-grained evaluation. The core fine-grained task is that of binding-site localization, which uses protein-residue and ligand-atom interaction maps spanning six major types of non-covalent interactions to assess whether model-derived interaction maps localize binding sites. InteractBind further includes binding affinity and protein similarity-controlled splits to support realistic generalization assessment. Using InteractBind, we evaluate eight existing sequence-based and interaction-aware models, assessing binary binding prediction and binding-site localization. Results reveal limited binding-site localization despite strong binary binding prediction, with marked variation across non-covalent interaction types. Overall, InteractBind establishes a benchmark paradigm that encourages the development of more interpretable and physically grounded protein-ligand models.

2605.24044 2026-05-26 cs.RO cs.SE cs.SY eess.SY

RED: Adaptive Real-Time DAG Scheduling for Robotic Inference under Environmental Dynamics

RED:面向环境动态的自适应实时DAG调度用于机器人推理

Zexin Li, Tao Ren, Johnathan Liu, Xiaoxi He, Cong Liu

AI总结 提出RED框架,通过截止时间感知调度器和MIMONet结构对齐,在资源受限机器人平台上实现多任务深度神经网络工作负载的实时调度,适应环境动态并保证端到端时序约束。

Comments Extension version of RTSS'23

详情
AI中文摘要

部署在动态环境中的机器人必须应对环境驱动的变化,这些变化会在运行时重塑计算:新任务可能出现,优先级关系可能改变,整体工作负载结构会演变,所有这些都会降低性能,特别是在资源紧张和实时预算下需要多任务推理时。我们提出RED,一个用于资源受限机器人平台上多任务深度神经网络工作负载的实时调度框架,它适应机器人环境动态(RED),同时在建模假设下保留端到端时序保证。RED的核心是一个截止时间感知调度器,它分配中间子截止时间,从而能够适应由不可预测条件引起的计算图演变和异步推理。该框架还支持灵活部署MIMONet(多输入多输出神经网络),这种网络常用于多任务机器人,通过权重共享缓解内存压力。RED通过工作负载细化和图重构过程显式利用这种共享参数属性,将MIMONet结构与可调度性要求对齐,提高兼容性和效率。我们在NVIDIA Jetson系列平台和Apple M系列MacBook上实现RED,并在代表真实机器人场景的导航导向工作负载上进行评估。实验表明,在吞吐量、截止时间满足率、抗干扰鲁棒性、适应性和运行时开销方面,RED持续优于现有方法。

英文摘要

Robots deployed in dynamic environments must contend with environment-driven changes that reshape computation at runtime: new tasks may appear, precedence relations can shift, and overall workload structure evolves, all of which degrade performance, especially when multi-task inference is required under tight resource and real-time budgets. We present RED, a real-time scheduling framework for multi-task deep neural network workloads on resource-constrained robotic platforms that adapts to Robotic Environmental Dynamics (RED) while preserving end-to-end timing guarantees under modeling assumptions. The core of RED is a deadline-aware scheduler that assigns intermediate sub-deadlines, allowing it to accommodate evolving computation graphs and asynchronous inference induced by unpredictable conditions. The framework also supports flexible deployment of MIMONet (multi-input multi-output neural networks), commonly used in multi-tasking robots to alleviate memory pressure through weight sharing. RED explicitly leverages this shared-parameter property via a workload refinement and graph-reconstruction procedure that aligns MIMONet structure with schedulability requirements, improving compatibility and efficiency. We implement RED on NVIDIA Jetson family platforms and on an Apple M-series MacBook and evaluate it on navigation-oriented workloads representative of real robotic scenarios. Experiments show consistent gains over existing methods in throughput, deadline satisfaction, robustness to interference, adaptability, and runtime overhead.

2605.24043 2026-05-26 cs.LG cs.AI

LLM-AutoSciLab: Closed-Loop Scientific Discovery via Active Experimentation with LLMs

LLM-AutoSciLab:通过LLM主动实验进行闭环科学发现

Sanchit Kabra, Nikhil Abhyankar, Saaketh Desai, Prasad Iyer, Chandan K Reddy

AI总结 提出LLM-AutoSciLab闭环框架,通过假设生成与实验选择迭代优化,在预算约束下实现主动数据采集,在三个基准上优于现有方法且样本效率提升2-5倍。

详情
AI中文摘要

科学发现是一个闭环过程,其中假设指导数据采集,观察结果细化假设空间。然而,大多数方法将发现简化为对固定数据集的监督学习,其中有限的观察可能支持多种局部拟合但无法泛化的合理机制。因此,关键挑战在于选择信息丰富的观察以消除不确定性,将焦点从静态推断转向自适应数据采集。为此,我们提出LLM-AutoSciLab,一个将假设生成与假设条件实验选择和机制细化相结合的闭环框架。LLM-AutoSciLab不是将模型拟合到被动收集的数据,而是迭代地提出合理的假设,选择信息丰富的实验来区分或细化它们,并使用由此产生的证据更新其状态。为了评估具有主动数据采集的动态闭环科学发现,我们引入了ActiveSciBench,包含两个数据集:包含57个酶动力学任务的ActiveSciBench-Chem和包含45个基因调控网络任务的ActiveSciBench-GRN。这些数据集将发现建模为预算约束过程,需要自适应实验设计、变量选择和真实机制的恢复。在NewtonBench、ActiveSciBench-Chem和ActiveSciBench-GRN上,LLM-AutoSciLab优于先前方法,在NewtonBench和ActiveSciBench-Chem上分别达到67.6%和35.1%的符号准确率,在ActiveSciBench-GRN上达到31.1%的精确图恢复。此外,假设引导的实验比最强竞争基线样本效率高2-5倍。代码和数据可在https://github.com/scientific-discovery/LLM-AutoSciLab获取。

英文摘要

Scientific discovery is a closed-loop process in which hypotheses guide data acquisition and observations refine the hypothesis space. Yet most approaches reduce discovery to supervised learning over fixed datasets, where limited observations can support multiple plausible mechanisms that fit locally but fail to generalize. Thus, the key challenge is selecting informative observations to resolve uncertainty, shifting the focus from static inference to adaptive data acquisition. To address this, we propose LLM-AutoSciLab, a closed-loop framework that couples hypothesis generation with hypothesis-conditioned experiment selection and mechanism refinement. Rather than fitting models to passively collected data, LLM-AutoSciLab iteratively proposes plausible hypotheses, selects informative experiments to distinguish or refine them, and updates its state using the resulting evidence. To evaluate dynamic, closed-loop scientific discovery with active data acquisition, we introduce ActiveSciBench, comprising two datasets: ActiveSciBench-Chem with 57 enzyme-kinetics tasks and ActiveSciBench-GRN with 45 gene-regulatory-network tasks. These datasets model discovery as a budget-constrained process requiring adaptive experiment design, variable selection, and recovery of true mechanisms. Across NewtonBench, ActiveSciBench-Chem, and ActiveSciBench-GRN, LLM-AutoSciLab outperforms prior methods, achieving 67.6% and 35.1% symbolic accuracy on NewtonBench and ActiveSciBench-Chem, respectively, and 31.1% exact graph recovery on ActiveSciBench-GRN. Moreover, hypothesis-guided experimentation is 2-5x more sample-efficient than the strongest competing baselines. Code and data are available at: https://github.com/scientific-discovery/LLM-AutoSciLab

2605.24040 2026-05-26 cs.CV

Learning to See Like Humans: Gaze-Aligned Cycling Safety Prediction

学习像人类一样看:基于注视对齐的骑行安全预测

Luís Maria Perdigão, Miguel Costa, Carlos Santiago, Manuel Marques

AI总结 提出眼动追踪引导的感知骑行安全框架(EG-PCS),通过将注视数据集成到基于视觉Transformer的成对学习流程中,使模型注意力与人类注视模式对齐,提升预测准确性和可解释性。

Comments Accepted to be published as part of the 2026 IEEE 29th International Conference on Intelligent Transportation Systems (ITSC), Naples, Italy, September 15-18, 2026

详情
AI中文摘要

骑行带来显著的公共健康和环境效益,但在城市中的普及常受限于感知安全性。当街道环境看起来不安全时,人们骑行的可能性降低,因此感知成为采用骑行的关键障碍。近期研究表明,街景图像的成对比较为学习主观安全判断提供了一种可扩展的方法。然而,现有方法未明确建模人类视觉注意力,而注意力在人类感知安全中起核心作用。我们提出眼动追踪引导的感知骑行安全框架(EG-PCS),该框架将注视数据集成到基于视觉Transformer的成对学习流程中。通过用眼动追踪信号监督模型的注意力机制,我们促使学习到的注意力图与人类注视模式对齐。实验表明,与最先进方法相比,注视引导模型在实现相似排序性能的同时,生成的注意力图更准确地反映人类视觉注意行为。我们的结果表明,在基于感知的城市分析中融入眼动追踪信息可提升预测准确性和可解释性。

英文摘要

Cycling delivers significant public-health and environmental benefits, yet its uptake in cities is often limited by perceived safety. When street environments appear unsafe, individuals are less likely to cycle, making perception a key barrier to adoption. Recent work has shown that pairwise comparisons of street-view images provide a scalable way to learn subjective safety judgments. However, existing approaches do not explicitly model human visual attention, which plays a central role in how humans perceive safety. We propose an Eye-Tracking-Guided Perceived Cycling Safety framework (EG-PCS) that integrates gaze data into a pairwise learning pipeline based on vision transformers. By supervising the model's attention mechanism with eye-tracking signals, we encourage alignment between learned attention maps and human fixation patterns. Experiments show that gaze-guided models achieve similar ranking performance compared to state-of-the-art approaches while producing attention maps that more accurately reflect human visual attention behavior. Our results demonstrate that incorporating eye-tracking information enhances both predictive accuracy and interpretability in perception-based urban analytics.

2605.24037 2026-05-26 cs.CV cs.AI

Mode-as-Sequence: Translating Multimodal Motion Prediction into Unified Sequential Mode Modeling

模式即序列:将多模态运动预测转化为统一序列模式建模

Zikang Zhou, Haibo Hu, Xinhong Chen, Yifan Zhang, Nan Guan, Yung-Hui Li, Chun Jason Xue, Jianping Wang

AI总结 提出Mode-as-Sequence框架,将无序模式集转化为有序模式序列并显式建模模式间依赖,通过ModeSeq和Parallel ModeSeq两种实例化方法解决多模态运动预测中的模式坍塌和置信度排序问题,在Waymo数据集上取得领先性能。

详情
AI中文摘要

多模态运动预测本质上是欠监督的:每个训练场景只提供一个已实现的未来,但存在多个合理的未来。这种稀疏监督通常会导致模式坍塌(冗余假设和模式覆盖不足)以及在预测少量轨迹时置信度排序不可靠。我们提出Mode-as-Sequence,一个统一的解码框架,将无序模式集转化为有序模式序列,并显式建模模式间依赖。在该框架下,我们开发了两种互补的实例化方法。ModeSeq执行循环模式解码,每个模式基于先前生成的模式生成,鼓励多样化、非冗余的假设,并具有校准的置信度排序。为了消除逐模式自回归瓶颈,我们进一步提出Parallel ModeSeq,它使用掩码模式间自注意力保留相同的因果依赖,同时在前向传播中一次性解码所有模式,从而实现高效的大K推理和可扩展的联合场景预测。为了在稀疏标签下学习代表性模式和校准的置信度,我们引入了Early-Match-Take-All (EMTA)及其联合场景扩展MA-EMTA,以及一个轻量级的排序正则化器,以减少置信度反转。在大型基准上的大量实验表明,在数据集、预测时长和对象类型上,排序导向指标和最佳K准确率均有一致提升。在Waymo开放数据集挑战中,ModeSeq在2024年无激光雷达运动预测赛道获得第一名,Parallel ModeSeq在2025年交互预测挑战赛中获得第一名,验证了Mode-as-Sequence在准确性和效率上的有效性。

英文摘要

Multimodal motion forecasting is inherently under-supervised: each training scene provides only one realized future, yet multiple plausible futures exist. This sparse supervision often leads to mode collapse (redundant hypotheses and insufficient mode coverage) and unreliable confidence ranking when predicting a small set of trajectories. We propose Mode-as-Sequence, a unified decoding framework that translates an unordered mode set into an ordered mode sequence and explicitly models mode-to-mode dependency. Under this framework, we develop two complementary instantiations. ModeSeq performs recurrent mode decoding, where each mode is generated conditioned on the previously generated modes, encouraging diverse, non-redundant hypotheses with calibrated confidence ordering. To remove the mode-by-mode autoregressive bottleneck, we further propose Parallel ModeSeq, which preserves the same causal dependency using masked mode-to-mode self-attention while decoding all modes in a single forward pass, enabling efficient large-$K$ inference and scalable joint-scene prediction. To learn representative modes and calibrated confidence under sparse labels, we introduce Early-Match-Take-All (EMTA) and its joint-scene extension MA-EMTA, together with a lightweight ranking regularizer that reduces confidence inversions. Extensive experiments on large-scale benchmarks demonstrate consistent improvements in both ranking-oriented metrics and best-of-K accuracy across datasets, horizons, and object types. In the Waymo Open Dataset challenges, ModeSeq achieves 1st place in the 2024 LiDAR-free motion prediction track, and Parallel ModeSeq achieves 1st place in the 2025 Interaction Prediction Challenge, validating the effectiveness of Mode-as-Sequence for both accuracy and efficiency.

2605.24033 2026-05-26 cs.LG cs.LO

Towards Verifiable Transformers: Solver-Checkable Circuit Explanations

迈向可验证的Transformer:求解器可检查的电路解释

Neel Somani

AI总结 提出Verifiable Transformers框架,通过将任务局部Transformer电路转化为有界、求解器可检查的声明,实现电路属性的形式化验证。

详情
AI中文摘要

机制可解释性通常识别Transformer模型内部的电路,但这些电路的解释通常通过示例、消融和手动推理来验证。这留下了在发现合理电路与证明电路功能之间的差距。我们引入了Verifiable Transformers,一个将任务局部Transformer电路转化为有界、求解器可检查的声明的框架。给定一个行为、一个有限任务域和一个候选token投影,我们提取任务电路并验证属性,如投影功能等价性、边必要性、任务相关不变性和最终残差鲁棒性。直接验证将提取的电路本身编码到SMT求解器中。当电路包含无法精确或可处理编码的算子时,代理介导的验证拟合一个SMT可编码的代理,在有限域上针对提取的电路验证它,并针对代理验证符号解释。我们使用带有Signed L1 BandNorm、sparsemax注意力和LeakyReLU的GPT风格架构实例化直接验证。在小型符号序列任务上,我们训练一个SMT可表示的Transformer,提取用于引号闭合和括号类型跟踪的稀疏电路,并详尽验证投影功能等价性、内容不变性、边必要性和最终残差鲁棒性。在GPT-2规模上,相同的算子堆栈在OpenWebText上稳定训练,尽管朴素直接SMT验证仍然难以处理。我们还展示了在具有难以编码注意力的任务局部电路上的代理介导验证,显示了已验证的符号解释和求解器生成的反例。目标不是全模型验证,而是为将机制电路解释转化为可证明或反驳的形式命题提供一条具体路径。

英文摘要

Mechanistic interpretability often identifies circuits inside Transformer models, but explanations of those circuits are usually validated through examples, ablations, and manual reasoning. This leaves a gap between finding a plausible circuit and proving what the circuit does. We introduce Verifiable Transformers, a framework for converting task-localized Transformer circuits into bounded, solver-checkable claims. Given a behavior, a finite task domain, and a candidate-token projection, we extract a task circuit and verify properties such as projected functional equivalence, edge necessity, task-relevant invariance, and final-residual robustness. Direct verification encodes the extracted circuit itself into an SMT solver. When a circuit contains operators that are not exactly or tractably encodable, surrogate-mediated verification fits an SMT-encodable surrogate, validates it against the extracted circuit over the bounded domain, and verifies symbolic explanations against the surrogate. We instantiate direct verification with a GPT-style architecture using Signed L1 BandNorm, sparsemax attention, and LeakyReLU. On small symbolic sequence tasks, we train an SMT-representable Transformer, extract sparse circuits for quote closing and bracket type tracking, and exhaustively verify projected functional equivalence, content invariance, edge necessity, and final-residual robustness. At GPT-2 scale, the same operator stack trains stably on OpenWebText, although naive direct SMT verification remains intractable. We also demonstrate surrogate-mediated verification on task-localized circuits with hard-to-encode attention, showing both verified symbolic explanations and solver-generated counterexamples. The goal is not full-model verification, but a concrete path for turning mechanistic circuit explanations into formal propositions that can be proven or refuted.

2605.24025 2026-05-26 cs.CV cs.LG

Towards Large Model Feature Coding

面向大模型特征编码

Youwei Pang, Changsheng Gao, Dong Liu, Huchuan Lu, Weisi Lin

AI总结 本文提出大模型特征编码(LaMoFC)基准与评估框架,通过构建涵盖4类16场景的特征数据集LaMoFCBench,揭示现有编码范式与大模型特征异构性之间的严重错位。

详情
AI中文摘要

大模型在广泛的感知和生成任务中取得了显著性能,但实际部署日益受到计算和内存预算以及隐私要求的限制。分割执行通过跨设备划分计算来缓解这些约束,但不可避免地引入了中间特征的密集传输和存储。与通常针对同质空间激活图的传统CNN特征编码不同,现代大模型生成具有不同统计分布和压缩容忍度的异构特征,例如多级/多模态表示和自回归上下文缓存。这些特性使得将大模型特征编码(LaMoFC)视为一个基本系统组件,并需要一个系统的评估框架。在本文中,我们提出了一个全面的LaMoFC基准和评估框架。我们首先构建特征数据集LaMoFCBench,涵盖4个类别和16个场景中的多样化任务需求,同时集成广泛采用的架构和各种分割计算设置。然后,我们根据实际应用场景指定代表性的分割点以提取中间特征,建立统一的流水线以实现公平和可重复的比较。最后,我们对主流的通用特征编解码器进行基准测试,揭示了现有编码范式与大模型特征异构性之间的严重错位。这些发现表明,LaMoFC需要从根本上脱离现有范式,而LaMoFCBench提供了推动这一转变的共享实证基础。数据和代码将在https://github.com/lartpang/LaMoFCBench上提供。

英文摘要

Large models have delivered remarkable performance across a wide range of perception and generation tasks, yet practical deployment is increasingly constrained by computational and memory budgets, as well as privacy requirements. Split execution alleviates these constraints by partitioning computation across devices, but it inevitably introduces intensive transmission and storage of intermediate features. Unlike conventional feature coding for CNNs that typically targets homogeneous spatial activation maps, modern large models generate heterogeneous features with varying statistical distributions and compression tolerances, e.g., multi-level/multi-modal representations and autoregressive context caches. These characteristics necessitate treating large model feature coding (LaMoFC) as a fundamental system component and call for a systematic evaluation framework. In this paper, we present a comprehensive benchmark and evaluation framework for LaMoFC. We first build the feature dataset LaMoFCBench, covering diverse task requirements across 4 categories and 16 scenarios while integrating widelyadopted architectures and various split-computing settings. We then specify representative split points according to practical application scenarios to extract intermediate features, establishing a unified pipeline for fair and reproducible comparisons. Finally, we benchmark mainstream universal feature codecs, exposing the profound misalignment between existing coding paradigms and the heterogeneous nature of large model features. These findings reveal that LaMoFC demands a fundamental departure from existing paradigms, and LaMoFCBench provides the shared empirical foundation to drive this transition. The data and code will be available at https://github.com/lartpang/LaMoFCBench.

2605.24024 2026-05-26 cs.CV

Mitigating Hallucinations in Large Vision-Language Models via Causal Route Gating

通过因果路由门控减轻大型视觉语言模型中的幻觉

Zhe Cheng, Wenyu Chen, Fode Zhang, Dehuan Shen

AI总结 针对大型视觉语言模型中因文本路径主导导致幻觉的问题,提出一种无训练、决策对齐的干预方法,通过分解注意力头为视觉和文本路由并抑制文本路由,有效减少幻觉错误。

Comments Accepted as a Spotlight Paper at ICML 2026. 33 pages, 8 figures

详情
AI中文摘要

大型视觉语言模型(LVLMs)常常产生流畅但缺乏图像支持的幻觉内容,限制了其在现实部署中的可靠性。我们表明,一个关键的失败模式源于路由竞争:即使视觉标记获得注意力,最终的标记决策也可能被文本路径主导,导致解码器遵循语言先验而非视觉证据。为了缓解这一问题,我们提出一种无训练、决策对齐的干预方法,将每个注意力头分解为视觉路由和文本路由,并使用高效的一次前向/一次梯度近似估计其标记级效应。这些估计揭示了头内的路由冲突并识别出先验主导的头,从而能够选择性地仅抑制文本路由,同时保持视觉路由完整。在涵盖判别和生成设置的五个基准测试中,我们的方法一致地减少了幻觉相关错误,对整体多模态性能影响有限,同时仅带来适度的推理时间开销。

英文摘要

Large vision-language models (LVLMs) often hallucinate content that is fluent yet unsupported by the image, limiting their reliability in real-world deployment. We show that a key failure mode arises from route competition: even when visual tokens receive attention, the final token decision can be dominated by the textual pathway, causing the decoder to follow linguistic priors over visual evidence. To mitigate this, we propose a training-free, decision-aligned intervention that decomposes each attention head into a visual route and a text route, and estimates their token-level effects using an efficient one-forward/one-gradient approximation. These estimates reveal route conflict within heads and identify prior-dominant ones, enabling selective suppression of only the text route while keeping the visual route intact. Across five benchmarks spanning discriminative and generative settings, our method consistently reduces hallucination-related errors across models with limited impact on overall multimodal performance, while incurring a modest inference-time overhead.

2605.24023 2026-05-26 cs.CV cs.DM

Soft Tuy-Completeness for Robust Projection Selection in Cone-Beam CT

锥束CT中鲁棒投影选择的软Tuy完备性

Linda-Sophie Schneider, Andreas Maier

AI总结 基于Tuy完备性理论,提出连续软近正交评分和分辨率感知饱和覆盖目标,通过次模贪心算法和混合整数线性规划实现投影选择,并引入有效空间分辨率作为轨迹级诊断指标。

Comments Preprint

详情
AI中文摘要

本工作引入了一个连续的软近正交评分和一个分辨率感知的饱和覆盖目标,用于感兴趣区域聚焦的锥束CT中的投影选择,基于Tuy完备性理论。将经典Tuy完备性的二元命中-未命中模型替换为分级的、可微的公式,保留了对可实现特征尺寸的直接联系,同时支持高效的近似和精确优化。我们通过从集合覆盖的多项式时间归约证明了底层离散决策问题是NP完全的,从而激发了具有证明的$(1-1/\mathrm{e})$近似保证的次模贪心算法和提供认证最优性边界的混合整数线性规划(MILP)。MILP作为贪心解的质量证书,而不是竞争性优化器。主要实证结果证实了这种关系:在跨越六个目标区域、多个投影预算和四个受控遮挡条件的系统基准测试中,贪心与MILP目标值的合并中位数为0.998,且相当一部分案例被认证为全局最优。包含一个二元公式作为诊断基线;它增强了硬方向完备性,但在连续覆盖尺度上较弱。我们还引入了有效空间分辨率(ESR),这是一个物理可解释的轨迹级诊断指标,将方向采样间隙映射到可实现的特征尺寸。ESR在投影预算和遮挡水平上与匹配的重建质量可靠相关,提供了选择阶段与图像域之间的实用桥梁,而无需重建。

英文摘要

This work introduces a continuous soft near-orthogonality score and a resolution-aware saturated coverage objective for projection selection in region-of-interest focused cone-beam CT, grounded in Tuy's completeness theory. Replacing the binary hit-or-miss model of classical Tuy completeness with a graded, differentiable formulation preserves a direct link to achievable feature sizes while enabling both efficient approximate and exact optimisation. We establish that the underlying discrete decision problems are NP-complete via polynomial-time reductions from Set Cover, motivating a submodular greedy algorithm with proven $(1-1/\mathrm{e})$ approximation guarantees and a mixed-integer linear program (MILP) that provides certified optimality bounds. The MILP serves as a quality certificate for the greedy solution rather than a competing optimiser. The primary empirical finding confirms this relationship: across a systematic benchmark spanning six target regions, multiple projection budgets, and four controlled occlusion conditions, the pooled median greedy-to-MILP objective ratio was 0.998, with a substantial fraction of cases certified globally optimal. A binary formulation is included as a diagnostic baseline; it strengthens hard directional completeness but is weaker on the continuous coverage scale. We additionally introduce Effective Spatial Resolution (ESR), a physically interpretable trajectory-level diagnostic that maps directional sampling gaps to achievable feature sizes. ESR correlates reliably with matched reconstruction quality across projection budgets and occlusion levels, providing a practical bridge between the selection stage and the image domain without requiring reconstruction.

2605.24020 2026-05-26 cs.CV cs.AI

Machine Intelligence that Understands Visual and Linguistic Information and Interacts with Humans and Environments

理解视觉与语言信息并与人类及环境交互的机器智能

Van Quang Nguyen

AI总结 本文提出GRIT、LTMI和两阶段指令解释框架,分别改进图像描述、视觉对话和交互式指令跟随任务,在准确性和效率上取得领先结果。

Comments Doctoral dissertation, Tohoku University, 2022. Uploaded for archival purposes. 146 pages

详情
AI中文摘要

计算机视觉与自然语言处理交叉领域的进展对于辅助技术、多媒体查询和机器人等应用至关重要。本论文提出了新颖的架构,以改进智能体在三个关键视觉-语言任务上的表现:图像描述、视觉对话和交互式指令跟随。 首先,我们解决了图像描述中视觉表示的局限性。传统模型依赖CNN检测器提取的区域特征,缺乏全局上下文且计算开销大。我们提出GRIT(基于网格和区域的图像描述Transformer),一种纯Transformer架构。通过使用基于DETR的检测器整合网格和区域特征,GRIT实现了端到端训练,并在推理准确性和速度上均优于先前方法。 其次,我们处理视觉对话,这需要对图像进行多轮对话。挑战在于高效建模多个输入(图像、问题、历史)之间的交互。我们引入LTMI(轻量级多输入Transformer)。利用专门的注意力块,LTMI层在VisDial数据集上验证,其表示能力与标准Transformer扩展相当,但参数不到其十分之一。 最后,我们使用ALFRED数据集研究具身AI的交互式指令跟随。我们提出一个包含两阶段指令解释的框架:首先独立于视觉上下文解码语言指令以预测暂定的动作-对象序列,然后与视觉特征融合以最终执行。通过使用多个自我中心视图和分层注意力,我们的方法准确定位对象,并实现了8.37%的最新未见成功率。

英文摘要

Advancements at the intersection of computer vision and natural language processing are crucial for applications like assistive tech, multimedia querying, and robotics. This dissertation proposes novel architectures to improve intelligent agents across three key vision-language tasks: image captioning, visual dialog, and interactive instruction following. First, we address limitations in visual representation for image captioning. Traditional models rely on region-based features from CNN detectors, which lack global context and suffer from high computational overhead. We propose GRIT (Grid and Region-based Image captioning Transformer), a transformer-only architecture. By integrating grid and region features using a DETR-based detector, GRIT enables end-to-end training and out-performs prior methods in both inference accuracy and speed. Second, we tackle visual dialog, which requires multi-turn conversation about an image. The challenge lies in efficiently modeling interactions between multiple inputs (image, question, history). We introduce LTMI (Light-weight Transformer for Many Inputs). Utilizing a specialized attention block, an LTMI layer matches the representational power of a standard Transformer extension while utilizing less than one-tenth of its parameters, as validated on the VisDial dataset. Finally, we study interactive instruction-following for embodied AI using the ALFRED dataset. We propose a framework featuring a two-stage instruction interpretation: it first decodes language directives independently of visual context to predict a tentative action-object sequence, which is then fused with visual features for final execution. Using multiple egocentric views and hierarchical attention, our method accurately localizes objects and achieves a state-of-the-art unseen success rate of 8.37%.

2605.24019 2026-05-26 cs.CV cs.LG

MGVQ: Synergizing Multi-dimensional Sensitivity-Aware and Gradient-Hessian Fusion for Vector Quantization

MGVQ:协同多维敏感度感知与梯度-海森融合的向量量化

Zhong Wang, Zukang Xu, Xing Hu, Dawei Yang

AI总结 提出MGVQ框架,通过敏感度引导的结构化混合精度量化和梯度感知的二阶误差补偿,实现视觉-语言模型的超低位向量量化,在2-bit量化下最高提升4.9个点。

详情
AI中文摘要

视觉-语言模型(VLM)取得了卓越的性能,但其巨大的模型尺寸严重阻碍了在资源受限的边缘设备上的部署。作为一种高效的模型压缩技术,向量量化(VQ)在超低位表示方面表现出色,它将模型权重映射到紧凑码本中的离散码字,以降低内存消耗和传输开销,同时保持模型能力。直接将VQ应用于VLM仍存在两个核心限制。首先,视觉和文本输入带来的跨模态权重分布差异无法被单一的统一码本很好地拟合。其次,当前的二阶误差补偿忽略了梯度信息,导致权重偏离预训练最优状态、梯度漂移和补偿结果有偏。本文提出MGVQ,一种新颖的向量量化框架,集成了多维敏感度感知和梯度-海森融合。它包含两个核心模块:敏感度引导的结构化混合精度量化,通过结合全局和局部敏感度分析,根据通道敏感度动态分配不同位宽,实现精细的资源分配;梯度感知的二阶误差补偿,将一阶梯度嵌入误差校正,并采用Kronecker和Block-LDL分解确保低计算成本。在主流VLM(包括LLaVA-onevision、InternVL2和Qwen2-VL)上的大量实验验证了MGVQ的有效性。在2-bit量化设置下,MGVQ显著超越现有先进的后训练量化方法,在InternVL2-26B上最高提升4.9个点(71.4% vs 67.0%)。所提方法实现了稳定高效的超低位VLM量化,极大促进了多模态大模型在资源受限环境中的实际部署。

英文摘要

Vision-Language Models (VLMs) achieve outstanding performance, yet their huge model size severely hinders deployment on edge devices with limited resources. As an efficient model compression technique, vector quantization (VQ) excels in ultra-low-bit representation, which maps model weights to discrete codewords in a compact codebook to cut memory consumption and transmission overhead while preserving model capability. Direct VQ application to VLMs still has two core limitations. First, cross-modality weight distribution differences brought by visual and textual inputs cannot be well fitted by a single unified codebook. Second, current second-order error compensation ignores first-order gradient information, causing weight deviation from pre-trained optimal states, gradient drift and biased compensation results. This work proposes MGVQ, a novel vector quantization framework integrating multi-dimensional sensitivity perception and gradient-Hessian fusion. It consists of two core modules: sensitivity-guided structured mixed-precision quantization dynamically assigns different bit-widths according to channel sensitivity via combined global and local sensitivity analysis for refined resource allocation; gradient-aware second-order error compensation embeds first-order gradients into error correction, and adopts Kronecker and Block-LDL decomposition to ensure low computational cost. Extensive experiments on mainstream VLMs including LLaVA-onevision, InternVL2 and Qwen2-VL verify the effectiveness of MGVQ. In 2-bit quantization settings, MGVQ surpasses existing advanced post-training quantization methods significantly, achieving a maximum accuracy improvement of 4.9 points (71.4% vs 67.0% on InternVL2-26B). The proposed method realizes stable and efficient ultra-low-bit VLM quantization, greatly promoting the practical deployment of multimodal large models in resource-limited environments.

2605.24018 2026-05-26 cs.AI cs.MA

EvoSci: A Bio-Inspired Multi-Agent Framework for the Evolution of Scientific Discovery

EvoSci: 一种受生物启发的多智能体框架用于科学发现的演化

Xiaoyu Xiong, Yuqi Ren, Deyi Xiong

AI总结 提出EvoSci框架,结合生物启发式演化与知识图谱建模,通过多角色智能体协作迭代生成、评估和优化研究想法,显著提升科学探索的连贯性和创造力。

Comments ACL 2026 Main Conference

详情
AI中文摘要

大型语言模型(LLMs)在科学发现中展现出强大潜力,但现有方法在研究工作流设计和多角色协作机制方面仍面临重大挑战。为解决这些问题,我们提出了EvoSci,一个多智能体科学协作框架,它整合了受生物启发的演化与知识图谱建模。为了迭代生成、评估和完善研究想法,EvoSci包含了多个基于角色的智能体,包括导师、研究者和评审者。通过结合协作推理、共享记忆和演化反馈,EvoSci显著增强了科学探索的连贯性和创造力。在真实研究主题上的实验表明,EvoSci在基于LLM的结构化同行评审和比较排名评估中显著优于强基线,获得了最高的整体同行评审分数(ICLR 4.90)和最高排名(Top-10 = 54)。这些结果表明其在科学想法生成和持续发现方面的优越性。

英文摘要

Large language models (LLMs), have shown strong potential in scientific discovery, yet existing methods still face substantial challenges in the design of research workflows and multi-role collaboration mechanisms. To mitigate these issues, we propose EvoSci, a multi-agent scientific collaboration framework, which integrates bio-inspired evolution with knowledge graph modeling. To iteratively generate, evaluate, and refine research ideas, EvoSci incorporates multiple role-based agents, including mentor, researcher, and reviewer. By combining collaborative reasoning, shared memory, and evolutionary feedback, EvoSci significantly enhances the coherence and creativity of scientific exploration. Experiments on real-world research topics demonstrate that EvoSci significantly outperforms strong baselines in LLM-based structured peer-review and comparative ranking evaluations, achieving the highest overall peer-review score (ICLR 4.90) and top ranking (Top-10 = 54). These results suggest its superiority in both scientific idea generation and continuous discovery.

2605.24014 2026-05-26 cs.CV

SkySeg: Collaborative Onboard Semantic Segmentation with Heterogeneous UAVs in the Wild

SkySeg: 野外异构无人机协同机载语义分割

Anqi Lu, Yun Cheng, Youbing Hu, Zhiqiang Cao, Jie Liu, Zhijun Li

AI总结 针对资源受限无人机在动态环境中实时语义分割的挑战,提出SkySeg异构多无人机空-空协作框架,结合高效信息融合推理与跨设备测试时自适应策略,实现低成本传感器下的机载分割,加速约3.6倍并提升精度5.91%。

详情
AI中文摘要

基于无人机的图像采集和分析需求激增,无人机越来越多地用于语义分割任务。为了满足无人机遥感任务的实时分析要求,进行机载计算并基于结果做出决策是一种自然的方法。然而,在资源受限的无人机平台上部署语义分割面临两个重大挑战:1)硬件限制限制了无人机执行实时语义分割的能力,2)飞行过程中的环境变化导致数据分布偏移,偏离原始训练数据。为了解决这些问题,本文介绍了SkySeg,一种异构多无人机空-空协作框架,它集成了计算机视觉和飞行模式,能够使用低成本传感器实现机载语义分割。SkySeg采用高效的信息融合推理方法,将低分辨率广域图像与高分辨率聚焦区域图像相结合。此外,它还包含一种跨设备测试时自适应策略,通过协作解决无人机间测试数据流的分布偏移,增强动态环境中的分割性能。实验结果表明,我们的SkySeg框架将推理延迟加速约3.6倍,将机载分割精度提高5.91%,并在野外环境中实现了10.91%的平均精度增益。

英文摘要

The demand for unmanned aerial vehicle (UAV)-based image acquisition and analysis has surged, with UAVs increasingly utilized for semantic segmentation tasks. To meet the real-time analysis requirements of UAV remote sensing missions, performing onboard computation and making decisions based on the results is a natural approach. However, deploying semantic segmentation on resource-constrained UAV platforms presents two significant challenges: 1) hardware constraints limit the ability of UAVs to perform real-time semantic segmentation, and 2) environmental variations during flight cause data distribution shifts, deviating from the original training data. To address these issues, this paper introduces SkySeg, a heterogeneous multi-UAV air-air cooperation framework that integrates computer vision and flight pattern to enable onboard semantic segmentation using low-cost sensors. SkySeg employs an efficient information fusion inference method, combining low-definition, wide-area images with high-definition, focused-area images. Additionally, it incorporates a cross-device test-time adaptation (TTA) strategy to enhance segmentation performance in dynamic environments by collaboratively addressing distribution shifts of test data streams across UAVs. Experimental results demonstrate that our SkySeg framework accelerates inference latency by approximately 3.6x, improves onboard segmentation accuracy by 5.91\%, and achieves a 10.91\% average accuracy gain in the wild.