arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.31007 2026-06-01 cs.LG cs.AI

DEM: A Distilled Explanation Model for Interpretable Anomaly Detection in Physiological Sensor Networks

DEM:面向生理传感器网络中可解释异常检测的蒸馏解释模型

Jyotirmoy Singh, Anushka Roy, Shreea Bose, Chittaranjan Hota

AI总结 提出一种三阶段玻璃箱框架DEM,通过将梯度提升专家模型的知识蒸馏到基于线性基线残差的决策树中,实现高精度与内在可解释性的异常检测,并引入蒸馏保真度指标量化解释可信度。

详情
Comments
21 pages, 10 figures, 7 tables. Code: https://github.com/Jyotirmoy17/dem-model
AI中文摘要

无线体域网(WBANs)中生理传感器数据的异常检测可能由传感器故障、网络中断或数据缺失引起,导致误报。因此,它既需要高预测精度,也需要临床可解释的解释。现有方法要么依赖性能强但无透明度的黑盒模型,要么依赖SHAP和LIME等事后解释方法。本文提出蒸馏解释模型(DEM),一个三阶段玻璃箱框架,将梯度提升专家模型的非线性知识蒸馏到基于线性基线残差的可解释决策树中,使得解释不是近似而是预测本身。DEM引入了一种新颖的蒸馏保真度指标,量化解释树忠实捕捉专家模型非线性贡献的程度,提供了先前可解释模型所缺乏的解释可信度的原则性度量。在包括MIMIC-IV、WESAD、eICU和内部SmartNet WBAN语料库在内的四个生理数据集上评估,DEM在临床上下文异常检测上达到0.9964的AUC,在可穿戴压力检测上达到0.9047,同时以可控深度生成人类可读的if-then规则。推理每1000个样本需要0.17ms,使DEM比基于SHAP的事后解释快1235倍,适用于实时生理监测。消融研究证实,XGBoost蒸馏步骤比朴素残差拟合提供了可测量的增益,深度敏感性分析展示了DEM在现有内在可解释模型中独有的、用户可控的准确性-可解释性权衡。

英文摘要

Anomaly detection in physiological sensor data from Wireless Body Area Networks (WBANs) can be caused by sensor faults, network disruptions, or missing data, leading to false alarms. Hence, it demands both high predictive accuracy and clinically interpretable explanations. Existing approaches rely either on black-box models that achieve strong performance but offer no transparency, or on post-prediction explanation methods such as SHAP and LIME. In this paper, we propose the Distilled Explanation Model (DEM), a three-stage glass-box framework that distills the non-linear knowledge of a gradient boosting expert into an interpretable decision tree operating on residuals relative to a linear baseline, so that the explanation is not an approximation but the prediction itself. DEM introduces a novel distillation fidelity metric that quantifies how faithfully the explanation tree captures the expert model's non-linear contribution, providing a principled measure of explanation trustworthiness absent from prior interpretable models. Evaluated across four physiological datasets, including MIMIC-IV, WESAD, eICU, and an in-house SmartNet WBAN corpus, DEM achieves an AUC of 0.9964 on clinical contextual anomaly detection and 0.9047 on wearable stress detection while producing human-readable if-then rules at a controllable depth. Inference requires 0.17ms per 1000 samples, rendering DEM 1235x faster than SHAP-based post-hoc explanation and suitable for real-time physiological monitoring. Ablation studies confirm that the XGBoost distillation step provides measurable gains over naive residual fitting, and depth-sensitivity analysis demonstrates an explicit, user-controlled accuracy-interpretability trade-off unique to DEM among existing intrinsically interpretable models.

2605.31005 2026-06-01 cs.LG

Learning Multi-Agent Coordination via Sheaf-ADMM

通过 Sheaf-ADMM 学习多智能体协调

Jeffrey Seely, Bartłomiej Cupiał, Llion Jones

AI总结 提出一种可微优化框架,利用细胞层(sheaf)和ADMM实现多智能体协调,在迷宫路径规划、图像分类和数独任务中验证了其有效性,并展现出优于标准消息传递架构的可解释性和鲁棒性。

详情
Comments
17 pages, 8 figures, 6 tables. Accepted at ICML 2026
AI中文摘要

我们提出了一种用于多智能体协调的可微优化框架。输入被分解为重叠的局部视图,每个视图由一个智能体处理,该智能体求解由神经编码器参数化的凸子问题。智能体通过交替方向乘子法(ADMM)进行协调,其中智能体间的约束由细胞层(cellular sheaf)指定。该层指定了相邻解必须在哪些方面达成一致,从而允许异构的全局共识概念。通过展开的优化进行反向传播,联合训练多智能体系统的所有组件。我们在迷宫路径规划、图像分类和数独任务上进行了评估,在这些任务中,局部视图单独不足的智能体学会了协调以产生正确的全局输出。在MNIST上,相对于标准CNN,局部视图分解提高了对分布偏移的鲁棒性。在数独上,优化导出的结构比参数匹配的MPNN基线产生了显著更高的求解率。最后,ADMM结构暴露了不同的原始、共识和对偶状态变量,使得协调动态可以直接分析和干预——这是标准消息传递架构所不具备的特性。

英文摘要

We present a differentiable optimization framework for multi-agent coordination. An input is decomposed into overlapping local views, each processed by an agent that solves a convex subproblem parameterized by a neural encoder. Agents coordinate through the Alternating Direction Method of Multipliers (ADMM) with inter-agent constraints specified by a cellular sheaf. The sheaf specifies which aspects of neighboring solutions must agree, allowing for heterogeneous notions of global consensus. Backpropagating through the unrolled optimization jointly trains all components of the multi-agent system. We evaluate on maze pathfinding, image classification, and Sudoku, where agents with individually insufficient local views learn to coordinate to produce correct global outputs. On MNIST, the local-view decomposition yields improved robustness to distribution shifts relative to a standard CNN. On Sudoku, the optimization-derived structure yields markedly higher solve rates than parameter-matched MPNN baselines. Finally, the ADMM structure exposes distinct primal, consensus, and dual state variables, opening the coordination dynamics to direct analysis and intervention -- a property unavailable in standard message-passing architectures.

2605.31001 2026-06-01 cs.CV

Iterative Framework For Data Augmentation Of Segmented Fingerprints

分割指纹数据增强的迭代框架

João Leonardo H. D. Agnol, Wesley Augusto de Bona, Erick Oliveira Rodrigues, Luiz Fernando Puttow Southier, Jefferson Oliva, Marcelo Filipak, Dalcimar Casanova

AI总结 针对婴儿指纹数据稀缺问题,提出一种迭代数据增强方法,通过在训练用于提取指纹脊线和谷线的卷积神经网络中引入错误,生成多样化的分割指纹变体,实验证明该方法能有效扩展指纹变异性且保持视觉相似性。

详情
Journal ref
Anais do XV Workshop de Sistemas de Informação 2024
AI中文摘要

婴儿生物识别由于婴儿与成人之间的生理差异而面临独特挑战,加上可用于研究的数据稀缺,限制了稳健匹配系统的发展。本文提出一种新颖的数据增强方法,使用迭代技术通过在训练用于提取指纹脊线和谷线的卷积神经网络中引入错误,生成分割指纹的多样化变体。在真实婴儿指纹上的实验证明了该方法在扩展指纹变异性方面的有效性,增强后的指纹在细节计数上表现出显著波动,同时仍保持与原始指纹的视觉相似性。研究还强调了该方法在应用不同程度变化到指纹分割方面的可定制性。未来研究包括使用所提框架增强的数据集训练分割和匹配神经网络。

英文摘要

Infant biometrics presents unique challenges due to the physiological differences between infants and adults, compounded by the scarcity of available data for research that limits the development of robust matching systems. This paper proposes a novel data augmentation method that uses iterative techniques to generate diverse variants of segmented fingerprints by inducing errors in a convolutional neural network trained to extract fingerprint ridges and valleys. Experiments on real infant fingerprints demonstrate the method's effectiveness in expanding fingerprint variability, with augmentations exhibiting significant fluctuations in minutiae counts while still retaining visual similarity to the originals. The study also highlights the method's customizable nature for applying varying levels of changes to fingerprint segmentations. Future research includes training segmentation and matching neural networks using datasets augmented by the proposed framework.

2605.31000 2026-06-01 cs.NI cs.LG

HetCCL: Enabling Collective Communication For Mixed-Vendor Heterogeneous Clusters

HetCCL:实现混合供应商异构集群的集体通信

Yuejie Wang, Tao Chang, Yuanyuan Zhao, Yulong Ao, Zeyu Gu, Zhiyu Li, Yanmin Jia, Yan Zhang, Mingjun Zhang, He Liu, Yongzhe He, Yonghua Lin, Guyue Liu

AI总结 提出HetCCL框架,通过高效P2P传输和边界通信器机制,在异构集群中实现跨供应商的集体通信,消除主机-设备内存拷贝开销,并优化带宽利用率。

详情
AI中文摘要

在异构集群上训练大型语言模型(LLM)给集体通信带来了重大挑战,因为来自多个供应商的硬件引入了多样化的网络和计算特性。现有的为同构环境设计的集体通信框架(如NCCL、RCCL)无法处理混合硬件设置,而支持异构的通信库(如Gloo、OpenMPI)在数据路径中引入了大量开销。本文提出了HetCCL,一个通过跨异构设备(如GPU)的高效P2P传输实现异构集体通信的框架,消除了主机-设备内存拷贝开销,同时将控制卸载到CPU。对于组合集体(如AllReduce、ReduceScatter),HetCCL引入了一种边界通信器机制,通过使用供应商集体通信库中组合集体的内在归约来实现供应商独立性。凭借高效的异构P2P传输和可移植的归约机制,HetCCL提出了异构集群的层次拓扑抽象,将集体通信分解为集群级原语,保证了最优的跨集群数据传输量和最优的带宽利用率。我们实现了支持4种不同供应商的HetCCL,并在4种异构设置下使用基准测试和端到端LLM任务进行了评估。评估结果表明,在异构通信中,HetCCL的带宽比Gloo高17-19倍,并且在端到端训练中每步时间加速高达16.9%。

英文摘要

Training Large Language Models (LLMs) on heterogeneous clusters presents significant challenges for collective communication, as hardware from multiple vendors introduces diverse network and computational characteristics. Existing collective communication frameworks (e.g., NCCL, RCCL) designed for homogeneous environments fail to address mixed-hardware setups, while communication libraries with heterogeneous support (e.g., Gloo, OpenMPI) incur heavy overhead in the data path. This paper presents HetCCL, a framework that enables heterogeneous collective communication by efficient P2P transport across heterogeneous devices (e.g., GPUs), eliminating the host-device memory copy overhead while offloading the control to the CPUs. For combining collectives (e.g., AllReduce, ReduceScatter), HetCCL introduces a border-communicator mechanism that achieves vendor independence by using the intrinsic reduction in the combining collectives in vendor collective communication libraries. With efficient heterogeneous P2P transport and portable reduction mechanism, HetCCL proposes a hierarchical topology abstraction for heterogeneous clusters, dissecting collective communication into cluster-level primitives that guarantee optimal cross-cluster data transfer volume and optimal bandwidth utilization. We implement HetCCL with 4 different vendor support and evaluate it in 4 heterogeneous settings with benchmarks and end-to-end LLM tasks. Our evaluation shows that HetCCL achieves 17-19x higher bandwidth than Gloo in heterogeneous communications, and speeds up end-to-end training by up to 16.9% in the per-step-time.

2605.30997 2026-06-01 stat.ML cs.LG

Hedging on the Frontier: Learning New Tasks with Few Samples

前沿对冲:基于少量样本学习新任务

Tobias Wegel, Federico Di Gennaro, Geelon So, Fanny Yang

AI总结 针对新任务样本少的问题,利用弱单调性假设,通过转移学习和模型选择聚合在模型前沿进行对冲,实现可证明的统计增益。

详情
AI中文摘要

当学习者面临少量样本的新任务时,必须利用任何可用的辅助信息。在实践中,这通常以公共基准中相关任务的模型评估形式出现。一个关键问题是如何对任务相关性进行建模,使其既现实又能从基准评估中获得可证明的收益。经验上,我们观察到弱单调性通常近似满足:如果一个模型在许多基准上占优,那么它在新任务上也往往表现更好。我们探索了在(近似)弱单调性下学习的统计复杂性,并在两种学习范式(迁移学习和模型选择聚合)中利用它。我们表明,不仅可以根据单调性剪枝模型类,还可以通过在前沿进行对冲来进一步适应可用权衡的几何结构。

英文摘要

When a learner faces a new task with few samples, it must leverage any available side information. In practice, this often comes in the form of model evaluations on related tasks in public benchmarks. A key question then is how to model task relatedness such that it is both realistic and the benchmark evaluations lead to provable gains. Empirically, we observe that weak monotonicity is often approximately satisfied: if a model dominates another on many benchmarks, it also tends to outperform on the new task. We explore the statistical complexity of learning under (approximate) weak monotonicity, leveraging it within two learning paradigms: transfer learning and model selection aggregation. We show that not only can we prune the model class based on monotonicity, but we can also further adapt to the geometry of the available trade-offs by hedging on the frontier.

2605.30992 2026-06-01 cs.LG

Eigenvectors of Experts are Training-free Non-collapsing Routers

专家特征向量是无需训练的非崩溃路由器

Giang Do, Hung Le, Truyen Tran

AI总结 针对稀疏混合专家模型中专家崩溃问题,提出基于专家权重矩阵特征向量的无需训练路由框架SSMoE,通过奇异值分解利用谱特性提升模型性能。

详情
Journal ref
ICML 2026
Comments
24 pages
AI中文摘要

稀疏混合专家(SMoE)架构通过将输入令牌路由到选定的专家子集来提高大型语言模型(LLMs)的训练效率。尽管取得了显著成功,SMoE模型在训练和推理中仍面临专家崩溃问题(Chi等人,2022),这会降低模型性能。先前研究主要关注改进路由器;然而,这些方法依赖于从头训练或微调,需要高昂的计算和数据处理成本。此外,我们通过理论和实证结果证明,尽管有这些努力,在推进预训练良好的SMoE模型时,该问题仍然存在。为填补这一空白,我们分析了先进的SMoE模型,观察到专家权重矩阵的特征向量编码了丰富的语义信息,指向传统路由策略的有效替代方案。基于这一见解,我们提出了奇异值分解SMoE(SSMoE),一种新颖且无需训练的框架,利用专家权重的谱特性来解决崩溃问题并提升模型性能。在多种语言和视觉任务上的大量实验,包括干净和损坏数据设置,证明了SSMoE的强大泛化能力和鲁棒性。我们的发现强调了更深入理解模型内部结构如何指导开发更有效的SMoE架构。我们的实现已在https://github.com/giangdip2410/SSMoE公开。

英文摘要

Sparse Mixture of Experts (SMoE) architectures improve the training efficiency of Large Language Models (LLMs) by routing input tokens to a selected subset of specialized experts. Despite their remarkable success, both training and inference in SMoE models suffer from the expert collapse issue (Chi et al., 2022), which degrades model performance. Prior studies primarily focus on improving the router; however, such methods rely on training from scratch or fine-tuning, which requires high computational and data-processing costs. Furthermore, we demonstrate that, despite these efforts, the issue persists when advancing well-pretrained SMoE models, as evidenced by both theoretical and empirical results. To fill that gap, we analyze the advanced SMoE models and observe that the eigenvectors of expert weight matrices encode rich semantic information, pointing to an effective alternative to conventional routing strategies. Building on this insight, we propose Singular Value Decomposition SMoE (SSMoE), a novel and training-free framework that leverages spectral properties of the expert weights to address the collapse issue and enhance model performance. Extensive experiments across diverse language and vision tasks, under both clean and corrupt data settings, demonstrate the strong generalization and robustness of SSMoE. Our findings highlight how a deeper understanding of model internals can guide the development of more effective SMoE architectures. Our implementation is publicly available at https://github.com/giangdip2410/SSMoE.

2605.30991 2026-06-01 cs.LG cs.CV

Parallel Tempering Initial Sampling in Inference-Time Reward Alignment

推理时奖励对齐中的并行回火初始采样

Myeongjun Oh, Gwangho Kim, Sungyoon Lee

AI总结 针对推理时奖励对齐中标准SMC方法因初始采样陷入局部模式的问题,提出基于并行回火的PATHS方法,通过耦合多条回火链实现高效探索,提升对齐质量。

详情
Comments
31 pages, 11 figures
AI中文摘要

推理时奖励对齐无需重新训练即可引导预训练的扩散和基于流的生成模型满足用户指定的奖励。最近,序贯蒙特卡洛(SMC)通过迭代过滤和传播多个粒子成为该任务的有力框架。然而,我们表明基于SMC的标准方法通常性能不佳,因为它们从标准先验初始化粒子,而复杂奖励景观中的高奖励区域极为罕见。此外,我们表明即使最近的奖励感知初始采样方法仍然容易陷入局部模式,因为复杂奖励景观通常是多模态的。为克服这些限制,我们提出PATHS(用于高复杂度奖励采样的并行回火),一种通过并行回火耦合多个采样链的新型初始化方法。PATHS维护一个奖励回火链的阶梯,并定期执行Metropolis交换,从而在平坦化的奖励景观中实现高效探索,缓解模式陷阱问题。我们的分析表明,该机制显著增强了有限预算下对通常难以采样的罕见高奖励区域的探索。在布局到图像和数量感知生成上的实验表明,PATHS在对齐质量上取得了一致的提升,尤其是在复杂提示上。

英文摘要

Inference-time reward alignment steers pretrained diffusion and flow-based generative models to satisfy user-specified rewards without retraining. Recently, Sequential Monte Carlo (SMC) has emerged as a powerful framework for this task by iteratively filtering and propagating multiple particles. However, we show that standard SMC-based methods often suffer from poor performance because they initialize particles from a standard prior, whereas high-reward regions in complex reward landscapes are extremely rare. Further, we show that even recent reward-aware initial sampling approaches remain vulnerable to getting trapped in local modes, as complex reward landscapes are often multi-modal. To overcome these limitations, we propose PATHS (PArallel Tempering for High-complexity reward Sampling), a novel initialization method that couples multiple sampling chains through parallel tempering. PATHS maintains a ladder of reward-tempered chains and periodically performs Metropolis swaps, enabling efficient exploration across flattened reward landscapes, thereby mitigating the mode-trapping issues. Our analysis reveals that this mechanism substantially enhances the finite-budget exploration of rare, high-reward regions that are typically challenging to sample. Experiments on layout-to-image and quantity-aware generation show that PATHS achieves consistent gains in alignment quality, particularly on complex prompts.

2605.30989 2026-06-01 cs.RO

A study on a Real-Time VR-Based Teleoperation Framework for Manipulator in Dynamic Environment

动态环境下基于实时VR的机械臂遥操作框架研究

InGyu Choi, GeonYeong Go, SunWoo Ahn, HyoJae Kang, Min-Sung Kang

AI总结 提出一种集成GPU加速逆运动学和轨迹优化的VR遥操作框架,在静态和动态障碍物环境中实现低延迟、碰撞感知的实时机械臂控制。

详情
Comments
This manuscript has been submitted for possible publication
AI中文摘要

机器人遥操作能够在人类难以直接进入的危险环境中安全、非接触地执行任务,并且随着最近VR技术的发展,其应用范围已经扩大。然而,许多VR遥操作研究主要作为机器人模仿学习的数据收集工具,因此它们通常没有明确处理操作过程中的动态障碍物、工作空间变化或碰撞风险。为了实际部署以保障操作员安全,遥操作必须能够以低延迟响应动态情况,并对经验不足的操作员的错误保持鲁棒性。本文提出了一种VR遥操作框架,支持实时操作,同时处理与静态和移动障碍物的碰撞。该框架在VR界面中集成了GPU加速的逆运动学和轨迹优化,以在机器人约束下在每个控制周期生成可行的关节命令。使用7自由度机械臂进行的实验展示了在无障碍物、静态障碍物和移动障碍物三种场景下的稳定在线行为和碰撞感知运动生成。结果表明,所提出的方法生成的运动与操作员的命令一致,并在障碍物干扰命令路径时产生安全的绕行。

英文摘要

Robot teleoperation enables safe, non-contact task execution in hazardous environments where direct human access is difficult, and its application has expanded with recent VR technologies. Many VR teleoperation studies, however, have primarily served as data-collection tools for robot imitation learning, so they often do not explicitly address dynamic obstacles, workspace changes, or collision risks during operation. For real deployment aimed at operator safety, teleoperation must react to dynamic situations with low latency and remain robust to mistakes made by inexperienced operators. This paper presents a VR teleoperation framework that supports real-time manipulation while handling collisions with both static and moving obstacles. The framework integrates GPU-accelerated inverse kinematics and trajectory optimization within a VR interface to generate feasible joint commands at each control cycle under robot constraints. Experiments with a 7-DoF manipulator demonstrate stable online behavior and collision-aware motion generation across three scenarios: obstacle-free, static-obstacle, and moving-obstacle environments. The results indicate that the proposed approach generates motion consistent with the operator's command while producing safe detours when obstacles interfere with the commanded path.

2605.30987 2026-06-01 cs.CV

Benchmarking Single-Step Inpainting Methods for Multi-Object 3D Gaussian Splatting Scenes

多对象3D高斯泼溅场景的单步修复方法基准测试

Finn Dröge, Cecilia Curreli, Abhishek Saroha, Daniel Cremers

AI总结 针对3D高斯泼溅场景中的对象移除与修复任务,比较了2D修复器在3D一致性上的表现,发现基于重建的修复器优于生成扩散模型,且从头初始化场景比微调现有场景效果更好,同时引入了一个带真实数据的新多对象场景。

详情
Comments
Accepted as an extended abstract to the CVEU Workshop at CVPR 2026
AI中文摘要

对象移除和修复3D高斯泼溅(3DGS)场景面临跨相机视图的3D一致性等挑战。在比较2D修复器及其对3D领域的适用性时,我们发现基于重建的修复器在3D一致性上优于生成扩散模型。将这些2D修复器集成到创建和微调3DGS场景的不同单步方法中,我们的结果表明,从头初始化场景比微调现有场景产生更高质量的结果。使用最先进的生成式2D修复器,我们创建了一个简单的基线,以强调在3D设置中先移除对象再进行修复的重要性。由于360°数据集很少包含真实世界的地面真值,且具有挑战性的遮挡场景同样稀少,我们引入了一个新的多对象场景,其中包含记录的地面真值数据和多个存在对象遮挡的视图。

英文摘要

The tasks of object removal and inpainting 3D Gaussian Splatting (3DGS) scenes face challenges such as 3D consistency across camera views. In comparing 2D inpainters and their suitability for the 3D domain, we find that reconstruction-based inpainters outperform generative diffusion models in 3D consistency. Integrating these 2D inpainters into different single-step methods for creating and finetuning 3DGS scenes, our results indicate that initializing the scene from scratch produces higher quality results than finetuning the existing scene. Using a state-of-the-art generative 2D inpainter, we create a straightforward baseline to underline the importance of object removal before inpainting in the 3D setting. Since 360° datasets rarely include real-world ground truths, and challenging occlusion scenarios are equally sparse, we introduce a novel multi-object scene with recorded ground truth data and many views with object occlusions.

2605.30984 2026-06-01 cs.CV cs.AI cs.CL

Generating Reports or Repeating Templates? Measuring and Mitigating Template Collapse in 3D CT Report Generation

生成报告还是重复模板?测量和缓解三维CT报告生成中的模板崩溃

Tom Maye-Lasserre, Yitong Li, Bailiang Jian, Morteza Ghahremani, Benedikt Wiestler, Christian Wachinger

AI总结 针对三维CT报告生成中模型输出多样性低、病理检测能力差的模板崩溃问题,提出解耦框架CLarGen,通过分离临床检测与语言合成,显著提升临床准确性并保持报告流畅性。

详情
AI中文摘要

现代三维医学视觉语言模型(VLM)能够生成流畅的放射学风格文本,但表现出极低的病理检测率和输出多样性,崩溃为低估罕见但关键发现的通用模板。我们将这种失败模式识别为模板崩溃。这种失败源于三维医学成像的独特限制,例如数据有限、标签严重不平衡以及体积编码器的弱信号。在这些限制下,文本生成目标鼓励捷径学习和流畅但基础薄弱的报告。我们通过临床保真度、输出多样性、正常模板偏差和罕见发现存活率系统性地诊断模板崩溃。为了缓解它,我们提出CLarGen,一个解耦框架,将说什么(临床检测)与怎么说(语言合成)分开。CLarGen使用(i)用于多标签病理检测的潜在查询变换器,(ii)用于临床匹配示例的病理引导检索,以及(iii)用于从检测到的发现和检索到的上下文中合成最终报告的医学语言模型。在最新的三维CT报告生成基线中,CLarGen缓解了模板崩溃,并在保持流畅报告的同时显著提高了临床准确性(macro-F1 0.487 vs. 0.189;CRG 0.472 vs. 0.368)。我们的结果表明,明确、可测量的临床基础对于抗模板崩溃的三维CT报告生成至关重要。代码将在接收后发布。

英文摘要

Modern 3D medical vision-language models (VLMs) can generate fluent radiology-style text while exhibit critically low pathology detection and output diversity, collapsing to generic templates that under-report rare yet critical findings. We identify this failure mode as Template Collapse. This failure stems from the unique constraints of 3D medical imaging, e.g., limited data, severe label imbalance, and weak signals from volumetric encoders. Under these constraints, text-generation objectives encourage shortcut learning and fluent but weakly grounded reports. We systematically diagnose the Template Collapse through clinical fidelity, output diversity, normal-template bias, and rare-finding survival. To mitigate it, we propose CLarGen, a decoupled framework that separates what to say (clinical detection) from how to say it (language synthesis). CLarGen uses (i) a Latent Query Transformer for multi-label pathology detection, (ii) pathology-guided retrieval for clinically matched exemplars, and (iii) a medical language model to synthesize the final report from detected findings and retrieved context. Across state-of-the-art 3D CT report generation baselines, CLarGen mitigates Template Collapse and substantially improves clinical accuracy (macro-F1 0.487 vs. 0.189; CRG 0.472 vs. 0.368) while maintaining fluent reporting. Our results suggest that explicit, measurable clinical grounding is essential for template-collapse-resistant 3D CT report generation. Code will be released upon acceptance.

2605.30983 2026-06-01 cs.CV

Can BEV Perception Gracefully Degrade under Sensor Failures?

BEV感知能否在传感器故障下优雅降级?

Haifa Zhang, Yijing Wang, Haoyu Wang, Zheng Li, Zhiqiang Zuo

AI总结 针对多模态BEV感知在传感器损坏时性能骤降的问题,提出Grace-BEV框架,通过主动可靠性评估和动态特征重校准实现优雅降级,在极端LiDAR故障下将mAP从0.0%恢复至34.7%。

详情
AI中文摘要

尽管多模态鸟瞰图(BEV)感知在自动驾驶中取得了显著成功,但现有系统存在一个关键脆弱性:现有融合机制对传感器损坏高度敏感,常导致灾难性性能下降。这种脆弱性主要源于标准融合框架通常以静态方式集成多模态表示,导致在缺失或损坏模态下性能急剧崩溃。相比之下,我们表明通过主动模态可靠性评估可以实现优雅降级。为此,我们提出Grace-BEV,一个轻量级即插即用框架,在多模态融合过程中强制引入主动可靠性感知。Grace-BEV不依赖计算昂贵的跨模态交互,而是利用对齐的BEV空间通过TrustGate路由器显式评估模态可信度,并使用FailSafe融合块动态重新校准特征集成。此外,我们设计了带模态丢弃的三阶段训练策略,以防止模态主导并鼓励在不可靠输入下进行平衡的跨模态学习。在nuScenes-R和nuScenes-C上的大量实验表明,Grace-BEV在各种损坏设置下保持稳健性能。值得注意的是,在标准基线崩溃至0.0%平均精度(mAP)的灾难性LiDAR故障下,Grace-BEV将性能恢复至高达34.7% mAP。此外,它将干净准确率提升高达1.4%,实现了鲁棒性与效率之间的强权衡。

英文摘要

Despite the remarkable success of multi-modal bird's-eye view (BEV) perception in autonomous driving, current systems exhibit a critical vulnerability: existing fusion mechanisms are highly brittle to sensor corruptions, often causing catastrophic performance degradation. This vulnerability largely stems from the fact that standard fusion frameworks typically integrate multi-modal representations in a static manner, leading to a precipitous performance collapse under missing or corrupted modalities. In contrast, we show that graceful degradation is achievable through active modality reliability assessment. To this end, we present Grace-BEV, a lightweight and plug-and-play framework that enforces active reliability awareness during multi-modal fusion. Instead of relying on computationally expensive cross-modal interactions, Grace-BEV leverages the aligned BEV space to explicitly assess modality trustworthiness via a TrustGate Router and dynamically recalibrate feature integration using the FailSafe Fusion Block. Furthermore, we devise a Three-Phase Training strategy with Modality Dropout to prevent modality dominance and encourage balanced cross-modal learning under unreliable inputs. Extensive experiments on nuScenes-R and nuScenes-C show that Grace-BEV maintains robust performance across diverse corruption settings. Notably, under catastrophic LiDAR failures where standard baselines collapse to 0.0% mean Average Precision (mAP), Grace-BEV restores performance to as high as 34.7% mAP. Moreover, it improves clean accuracy by up to 1.4%, achieving a strong trade-off between robustness and efficiency.

2605.30981 2026-06-01 cs.CL cs.LG

Cognitive Fatigue in Autoregressive Transformers: Formalization and Measurement

自回归Transformer中的认知疲劳:形式化与测量

Riju Marwah, Ritvik Garimella, Vishal Pallagani, Atishay Jain, Michael Stewart, Amit Sheth

AI总结 本文形式化自回归语言模型在长程生成中的退化现象为认知疲劳,并提出轻量级诊断指标疲劳指数(FI),通过聚合注意力衰减、表征漂移和熵校准三个信号实现实时监测,实验表明FI能高精度预测任务退化和重复生成。

详情
Comments
9 pages, 7 figures. Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

自回归语言模型在长程生成过程中经常退化,产生重复文本、失去指令遵循能力并表现出不稳定的熵。尽管这些失败普遍存在,但从业者缺乏在线诊断工具来实时检测它们。我们将这种退化形式化为认知疲劳,这是一种可测量的生成时状态,其特征是对原始提示的注意力衰减、表征漂移和熵校准错误。我们引入了疲劳指数(FI),这是一种轻量级、模型无关的诊断方法,在明确的公理(单调性、有界性、可解释性)下聚合这三个信号,从而实现可靠的运行时监控。在九个模型(1B-13B参数)上,FI轨迹表现出结构化的时间动态,预测任务退化(AUROC = 0.95)和重复(Spearman rho = 0.94),并揭示了非单调的缩放行为:低于3B的指令微调模型比基础模型退化更快,而在7B时这一趋势逆转。压力分析进一步表明,在更长的上下文、中间位置的证据和降低的数值精度下,FI onset加速。这些结果确立了认知疲劳作为一个连贯且可测量的现象,并将FI定位为生产级LLM系统中运行时可靠性监控的原则性工具。

英文摘要

Autoregressive language models frequently degrade during long-horizon generation, producing repetitive text, losing instruction adherence, and exhibiting unstable entropy. Despite the prevalence of these failures, practitioners lack online diagnostics to detect them in real-time as they occur. We formalize this degradation as cognitive fatigue, a measurable generation-time state characterized by decay in attention to the original prompt, representational drift, and entropy miscalibration. We introduce the Fatigue Index (FI), a lightweight, model-agnostic diagnostic that aggregates these three signals under explicit axioms (monotonicity, boundedness, interpretability) enabling reliable runtime monitoring. Across nine models (1B-13B parameters), FI trajectories exhibit structured temporal dynamics, predict task degradation (AUROC = 0.95) and repetition (Spearman rho = 0.94), and reveal non-monotonic scaling behavior: instruction-tuned models below 3B exhibit faster collapse than base models, with this trend reversing at 7B. Stress analyses further show that FI onset accelerates under longer contexts, middle-positioned evidence, and reduced numerical precision. These results establish cognitive fatigue as a coherent and measurable phenomenon, and position FI as a principled tool for runtime reliability monitoring in production LLM systems.

2605.30976 2026-06-01 stat.ML cs.IT cs.LG math.IT

Batched Stochastic Linear Bandits with 1-Bit Communication Constraints

具有1比特通信约束的批量随机线性赌博机

Ivan Lau, Daniel McMorrow, Kevin Jamieson, Jonathan Scarlett

AI总结 研究在批量大小B和每批仅1比特反馈的通信约束下,随机线性赌博机的遗憾最小化问题,提出了两种基于G-最优设计和1比特均值估计的相位消除算法,实现了接近无约束线性赌博机的最优遗憾。

详情
AI中文摘要

我们研究了在批处理和通信约束的自然组合下的随机线性赌博机:时间范围被划分为大小相等的批次$B$,在每个批次中,学习器向一个智能体发送$B$个请求的臂拉动,智能体观察相应的$B$个奖励,并用单个比特的反馈回复学习器。对于每个批次,学习器指定智能体使用的1比特量化规则,该规则可能依赖于所有先前接收到的比特,但不直接依赖于任何过去的奖励。这一设置解决了先前模型(仅有每轮量化或仅有总比特预算)之间一个显著但尚未探索的“中间地带”。我们建立了一个极小极大下界,表明由于1比特通信瓶颈,即使在没有噪声的情况下,$Ω(B\min\{d,\log\lvert \mathcal{A} vert\})$的遗憾也是不可避免的。结合标准的统计极限,这给出了一个通用的下界$\widetildeΩ(B\min\{d,\log\lvert \mathcal{A} vert\} + \sqrt{dT \min\{d,\log\lvert \mathcal{A} vert\}})$。我们开发了两种基于$G$-最优设计和1比特均值估计的相位消除算法。第一种算法实现了$\widetilde{O}(dB + d\sqrt{T})$的遗憾,当$\lvert \mathcal{A} vert = \exp(Ω(d))$时,该下界在对数因子内匹配;第二种算法结合了安全臂识别和热启动过程,获得了$\widetilde{O}(B\log\lvert \mathcal{A} vert + d^{3/2}\sqrt{B} + \sqrt{dT\log\lvert \mathcal{A} vert})$的遗憾,在$(\lvert \mathcal{A} vert, B, d, T)$的广泛缩放范围内接近最优。总之,我们的结果表明,每批仅需一个比特的反馈就足以在广泛的缩放范围内几乎匹配无约束线性赌博机的极小极大遗憾,即使对于$Θ(\sqrt{T})$这样大的批量大小也是如此。

英文摘要

We study stochastic linear bandits under a natural combination of batching and communication constraints: the time horizon is partitioned into batches of equal size $B$, and during each batch the learner sends $B$ requested arm pulls to an agent, who then observes the corresponding $B$ rewards and responds with a single bit of feedback to the learner. For each batch, the learner specifies the 1-bit quantization rule the agent uses, which may depend on all previously received bits but not on any past rewards directly. This setting addresses a significant yet unexplored ``middle ground'' between previous models having per-round quantization only or total bit budgets only. We establish a minimax lower bound showing that $Ω(B\min\{d,\log\lvert \mathcal{A} \rvert\})$ regret is unavoidable due to the 1-bit communication bottleneck, even in the absence of noise. Combined with standard statistical limits, this yields a general lower bound of $\widetildeΩ(B\min\{d,\log\lvert \mathcal{A} \rvert\} + \sqrt{dT \min\{d,\log\lvert \mathcal{A} \rvert\}})$. We develop two phased-elimination algorithms based on $G$-optimal designs and 1-bit mean estimation. The first achieves $\widetilde{O}(dB + d\sqrt{T})$ regret, matching the lower bound up to logarithmic factors when $\lvert \mathcal{A} \rvert = \exp(Ω(d))$, and the second incorporates a safe-arm identification and warm-start procedure to obtain $\widetilde{O}(B\log\lvert \mathcal{A} \rvert + d^{3/2}\sqrt{B} + \sqrt{dT\log\lvert \mathcal{A} \rvert})$ regret, which is near-optimal in broad scaling regimes of $(\lvert \mathcal{A} \rvert, B, d, T)$. Together, our results demonstrate that a single bit of feedback per batch suffices to nearly match the minimax regret of unconstrained linear bandits in broad scaling regimes, even for batch sizes as large as $Θ(\sqrt{T})$.

2605.30972 2026-06-01 cs.CV

BiSegMamba: Efficient Bidirectional Tri-Oriented Mamba for 3D Medical Image Segmentation

BiSegMamba: 用于3D医学图像分割的高效双向三向Mamba

Bakht Zada, Chao Tong, Qile Su, Shuai Zhang

AI总结 提出BiSegMamba,一种基于双向三向Mamba的高效3D医学图像分割网络,通过渐进压缩主干、多尺度空间混合器、双向正交Mamba块和自适应方向融合,在降低计算成本的同时提升分割精度。

详情
Comments
10 pages, 7 figures, 5 tables. Code is available at: https://github.com/bakhtzadaabshare/BiSegMamba
AI中文摘要

精确的3D医学图像分割需要长程体积上下文和精细边界保持。基于CNN的方法全局依赖建模有限,而基于Transformer的模型对于密集3D输入通常计算成本高昂。最近的基于Mamba的方法提供了一种高效替代方案,但现有的体积设计仍依赖于重复的高分辨率扫描、仅前向的顺序建模和固定的方向求和,导致高成本、扫描顺序偏差和次优的方向聚合。我们提出BiSegMamba,一种用于3D医学图像分割的高效双向三向Mamba网络。BiSegMamba遵循紧凑到细节的设计,其中渐进压缩主干(PCS)能够进行高效的潜在空间推理,同时保留浅层高分辨率特征用于重建。多尺度空间混合器(MSSM)在早期阶段捕获局部解剖模式,而提出的双向三向正交Mamba(Bi-ToOM)块使用联合处理的前向和后向扫描序列,从多个正交视图建模长程依赖。自适应方向融合(ADF)学习跨扫描方向的输入相关通道权重,用方向感知融合替代固定求和。在收集的颈动脉CTA数据集和三个公共基准BraTS2023、ACDC和AMOS-CT上的实验表明,BiSegMamba在血管、心脏、脑肿瘤和腹部多器官分割任务中具有良好的泛化能力。与SegMamba-V2相比,BiSegMamba在BraTS2023上性能略有提升,在ACDC和颈动脉数据集上显著改进,同时计算成本降低高达77.9% FLOPs,展示了在通用3D医学图像分割中强大的精度-效率平衡。

英文摘要

Accurate 3D medical image segmentation requires both long-range volumetric context and fine boundary preservation. CNN-based methods have limited global dependency modeling, while Transformer-based models are often computationally expensive for dense 3D inputs. Recent Mamba-based methods provide an efficient alternative, but existing volumetric designs still depend on repeated high-resolution scanning, forward-only sequential modeling, and fixed directional summation, causing high cost, scan-order bias, and suboptimal directional aggregation. We propose BiSegMamba, an efficient bidirectional tri-oriented Mamba network for 3D medical image segmentation. BiSegMamba follows a compact-to-detail design, where a progressive compacting stem (PCS) enables efficient latent-space reasoning while retaining shallow high-resolution features for reconstruction. A multi-scale spatial mixer (MSSM) captures local anatomical patterns in early stages, and the proposed bidirectional tri-oriented Ortho Mamba (Bi-ToOM) block models long-range dependencies from multiple orthogonal views using jointly processed forward and backward scan sequences. Adaptive directional fusion (ADF) learns input-dependent channel-wise weights across scan orientations, replacing fixed summation with orientation-aware fusion. Experiments on a collected carotid CTA dataset and three public benchmarks, BraTS2023, ACDC, and AMOS-CT, show that BiSegMamba generalizes well across vascular, cardiac, brain tumor, and abdominal multi-organ segmentation tasks. Compared with SegMamba-V2, BiSegMamba achieves slightly better performance on BraTS2023 and clear improvements on ACDC and the carotid dataset, while reducing computational cost by up to 77.9% FLOPs, demonstrating a strong accuracy-efficiency balance for general 3D medical image segmentation.

2605.30969 2026-06-01 cs.CV

Omni-Supervised Motion Editing: Balancing Change and Invariance through Positive-Negative Learning

全监督运动编辑:通过正负学习平衡变化与不变性

Zhenwu Shi, Jingyu Gong, Peiwei Wang, Xingzan Wang, Tianwen Qian, Wenxi Li, Yuan Fang, Jiao Xie, Lizhuang Ma, Shaohui Lin

AI总结 提出OmniME框架,通过正负学习结合回顾特征监督、运动保持机制和三元组语义对齐,平衡运动编辑中的变化与不变性,在MotionFix和STANCE Adjustment数据集上达到最优性能。

详情
AI中文摘要

基于文本的人体运动编辑旨在根据自然语言指令修改现有运动序列,同时保持原始运动的一致性。现有的基于扩散的方法通常依赖启发式相似性线索或粗糙的全局条件,导致运动失真和次优的语义对齐。关键挑战在于平衡变化(即精确编辑目标区域)和不变性(即保留未编辑部分)。为应对这一挑战,我们提出了一个全监督正负学习框架,名为OmniME。我们的方法集成了三个互补组件:(1)回顾特征监督,在Transformer层之间强制执行从粗到细的一致性;(2)运动保持机制,根据源-目标相似性关注细微变化;(3)基于三元组的语义对齐,增强文本-运动对应关系。这些组件共同形成了一个统一的监督范式,平衡变化与不变性。在MotionFix和STANCE Adjustment数据集上的大量实验表明,OmniME在编辑对齐方面达到了最先进的性能,验证了我们统一学习框架的有效性。我们的源代码和模型已发布在:https://github.com/rocket-ycyer/OmniME.git

英文摘要

Text-based human motion editing aims to modify existing motion sequences according to natural language instructions while maintaining the consistency of the original motion. Existing diffusion-based approaches often rely on heuristic similarity cues or coarse global conditioning, leading to motion distortion and suboptimal semantic alignment. The key challenge lies in balancing change (i.e. precisely editing target regions) and invariance (i.e. preserving unedited parts). To handle such challenge, we propose an Omni-Supervised Positive-Negative Learning framework, named OmniME. Our method integrates three complementary components: (1) retrospective feature supervision that enforces coarse-to-fine consistency across transformer layers,(2) motion preservation mechanism that focuses on subtle variations according to the source-target similarity, and (3) triplet-based semantic alignment that strengthens text-motion correspondence. Together, these components form a unified supervision paradigm that balances change and invariance. Extensive experiments on the MotionFix and STANCE Adjustment datasets demonstrate that OmniME achieves state-of-the-art performance in editing alignment, validating the effectiveness of our unified learning framework. Our source codes and models have been released at: https://github.com/rocket-ycyer/OmniME.git

2605.30968 2026-06-01 cs.CV cs.AI

Variational Adapter for Cross-modal Similarity Representation

变分适配器用于跨模态相似性表示

WenZhang Wei, Zhipeng Gui, Dehua Peng, Tiandi Ye, Huayi Wu

AI总结 针对跨模态匹配中细粒度标注稀缺导致二元分类边界压缩和假负样本问题,提出变分适配器VACSR,将匹配任务重构为变分推断问题,通过构建潜在相似性空间和正则化缓解过拟合,在图像-文本检索、域泛化和基类到新类泛化任务上验证了有效性。

详情
Comments
Accepted by the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

视觉-语言模型的核心在于在统一表示空间中度量跨模态相似性。然而,大多数图像-文本匹配或多类图像分类数据集缺乏细粒度的跨模态匹配标注,迫使连续的相似性空间压缩为二元分类边界。这种压缩引入了假负样本,并严重损害了跨模态任务的泛化性能。尽管先前的研究试图通过建模模态内模糊性来缓解这一问题,但往往忽略了固有的标注缺陷,导致不确定性分配次优。为了解决这些挑战,我们提出了一种变分适配器用于跨模态相似性表示(VACSR)。该方法将具有细粒度语义稀缺性的图像-文本匹配重新表述为变分推断问题。它构建了一个跨模态相似性的潜在空间,并使用正则化技术来减轻对二元标注的过拟合。在图像-文本检索、域泛化和基类到新类泛化上的实验证明了所提出方法的有效性和鲁棒的泛化能力。

英文摘要

The core of vision-language models lies in measuring cross-modal similarity within a unified representation space. However, most image-text matching or multi-class image classification datasets lack fine-grained cross-modal matching annotations, forcing the continuous similarity space into binary classification boundaries. This compression induces false negative samples and significantly impairs the generalization performance of cross-modal tasks. While prior research has attempted to mitigate this by modeling intra-modal ambiguity, it often overlooks inherent annotation flaws, leading to suboptimal uncertainty allocation. To address these challenges, we propose a Variational Adapter for Cross-modal Similarity Representation (VACSR). This approach reformulates image-text matching with fine-grained semantic scarcity as a variational inference problem. It constructs a latent space for cross-modal similarity and uses regularization techniques to mitigate overfitting to binary annotations. Experiments on image-text retrieval, domain generalization, and base-to-novel generalization demonstrate the proposed method's effectiveness and robust generalization ability.

2605.30966 2026-06-01 cs.IR cs.AI cs.CL

Reading Between the Citations: A Typed Claim Network for Scientific Literature

解读引用:面向科学文献的类型化主张网络

Ning Ding, Sergio J. Rodríguez Méndez, Pouya G. Omran

AI总结 针对现有知识图谱忽略引用立场的问题,提出将文献间引用具体化为带有立场标签的类型化主张网络,并构建了包含8260条主张的实例,在检索增强、立场摘要和拓扑分析三个任务上验证其有效性。

详情
AI中文摘要

基于相互引用文献语料库(如学术论文、法律意见书、政策简报)的知识图谱编码了引用的拓扑结构,但未编码其立场。标准表示将丰富的评价关系压缩为无类型边,丢失了支持社区级查询(关于一篇文献如何被另一篇文献接受)的关键内容。我们提出主张网络:一种表示模式,其中每个跨文献引用被具体化为一个类型化主张,携带来源、目标、主张文本以及基于引用意图文献的四类立场标签。我们给出了一个适用于任何学术相互引用文献语料库的构建流程,并在3D点云语义分割领域的127篇论文语料库上实例化,生成了一个包含8260个类型化主张的网络。三个下游任务系列展示了该网络的能力:检索信号增强、聚合立场摘要和拓扑分析。与标准检索增强生成(RAG)基线的直接比较表明,相对于平面检索的增益来自于正确的中间表示,而非错误的表示。

英文摘要

Knowledge graphs over corpora of inter-referencing documents - scholarly papers, legal opinions, policy briefs - encode the topology of reference but not its stance. The standard representation collapses a rich evaluative relation into an untyped edge, losing the very content that supports community-level queries about how one document is received by another. We propose the claim network: a representational pattern in which each cross-document reference is reified as a typed claim, carrying source, target, claim text, and a four-class stance label grounded in the citation-intent literature. We give a construction pipeline applicable to any corpus of scholarly inter-referencing documents and instantiate it on a corpus of 127 papers in 3D point cloud semantic segmentation, producing a network of 8,260 typed claims. Three downstream task families demonstrate what the network enables: retrieval signal augmentation, aggregated-stance summarisation, and topological analytics. Head-to-head evaluation against standard Retrieval-Augmented Generation (RAG) baselines shows that the gain over flat retrieval is the gain from the right intermediate representation rather than the wrong one.

2605.30965 2026-06-01 eess.AS cs.AI cs.CL

ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment

ImmersiveTTS:基于多模态扩散Transformer和领域特定表示对齐的环境感知文本转语音

Jun-Hak Yun, Seung-Bin Kim, Seong-Whan Lee

AI总结 提出ImmersiveTTS模型,通过多模态扩散Transformer和领域特定表示对齐,实现与环境音频自然融合的文本到语音生成。

详情
Comments
Accepted to ACL 2026 main conference. Code is available at https://github.com/jjunak-yun/ImmersiveTTS
AI中文摘要

最近在文本引导音频生成方面的进展在声音效果、语音和音乐等多个领域取得了有希望的结果。然而,由于语音和环境音频在声学模式和时域动态上的固有差异,联合生成语音和环境音频仍然具有挑战性。我们提出了ImmersiveTTS,一种环境感知的文本到语音(TTS)模型,通过显式建模跨模态交互,生成与环境上下文无缝融合的自然语音。我们的模型基于多模态扩散Transformer,并通过联合注意力将转录对齐的语音潜在表示与文本条件的环境上下文融合。为了增强语义一致性,我们引入了一种针对环境感知TTS量身定制的领域特定表示对齐目标,利用来自语音和音频编码器的互补自监督表示。实验结果表明,在客观指标和人类听力测试中,ImmersiveTTS在自然度、可懂度和音频保真度方面均优于现有方法。

英文摘要

Recent advancements in text-guided audio generation have yielded promising results in diverse domains, including sound effects, speech, and music. However, jointly generating speech with environmental audio remains challenging due to the inherent disparities in their acoustic patterns and temporal dynamics. We propose ImmersiveTTS, an environment-aware text-to-speech (TTS) model that generates natural speech seamlessly integrated within environmental contexts by explicitly modeling cross-modal interactions. Our model builds on a multimodal diffusion transformer and fuses transcript-aligned speech latent with text-conditioned environmental context via joint attention. To enhance semantic consistency, we introduce a domain-specific representation alignment objective tailored to environment-aware TTS, leveraging complementary self-supervised representations from speech and audio encoders. Experimental results show that ImmersiveTTS achieves higher naturalness, intelligibility, and audio fidelity than existing approaches across objective metrics and human listening tests.

2605.30963 2026-06-01 q-bio.BM cs.AI

AMix-2: Establishing Protein as a Native Modality in Large Language Models

AMix-2:将蛋白质确立为大语言模型的原生模态

Keyue Qiu, Yixin Wu, Lihao Wang, Yawen Ouyang, Jixiang Yu, Zihan Zhou, Changze Lv, Dongyu Xue, Yuxuan Song, Xinbo Zhang, Hao Wang, Jiangtao Feng, Zhiqiang Gao, Lijun Wu, Xiaoqing Zheng, Ka-Chun Wong, Lei Bai, Ya-Qin Zhang, Wei-Ying Ma, Dahua Lin, Bowen Zhou, Hao Zhou

AI总结 提出AMix-2,一种蛋白质-文本基础模型,通过统一蛋白质理解与序列设计,将蛋白质作为大语言模型的原生模态,并引入块状扩散语言建模骨干以更好地匹配蛋白质内在特性。

详情
Comments
30 pages, 4 figures, 12 tables
AI中文摘要

我们提出了AMix-2,一种蛋白质-文本基础模型,将蛋白质确立为大语言模型(LLMs)的原生模态,在单一基础模型中统一了蛋白质理解和序列设计。AMix-2基于两个关键思想:(1)统一的蛋白质-文本公式,将自然语言和蛋白质序列嵌入共享的标记空间,使一个模型能够执行生物推理和条件设计,而不是使用单独的下游任务专用模型;(2)块状扩散语言建模骨干,结合了跨块的因果生成与块内的双向上下文和迭代细化。这种方案比严格的从左到右分解更好地匹配了蛋白质的内在本质。为了在现实的泛化设置下评估蛋白质基础模型,我们进一步引入了ProteinArena,一个全面的基准测试,具有时间感知和同源性感知协议,涵盖各种理解和设计任务,并以经典生物信息学工具、蛋白质专用模型和LLMs作为基线。在ProteinArena上,AMix-2优于前沿的LLMs,并展现出与任务专用蛋白质模型竞争的性能。控制实验进一步表明,基于扩散的范式普遍优于其自回归对应物,突显了蛋白质序列灵活生成顺序的优势。我们发布了AMix-2和ProteinArena,以促进蛋白质基础模型的开放研究。

英文摘要

We present AMix-2, a protein-text foundation model that establishes protein as a native modality in large language models (LLMs), unifying protein understanding and sequence design within a single foundation model. AMix-2 is built upon two key ideas: (1) a unified protein-text formulation that embeds natural language and protein sequence in a shared token space, enabling one model to perform biological reasoning and conditional design instead of separate downstream task-specialized models; and (2) a block-wise diffusion language modeling backbone that combines causal generation across blocks with bidirectional context and iterative refinement within blocks. This scheme better matches the intrinsic nature of proteins than a strict left-to-right factorization. To evaluate protein foundation models under realistic generalization settings, we further introduce ProteinArena, a comprehensive benchmark with time-aware and homology-aware protocols across various understanding and design tasks, and with baselines covering classical bioinformatics tools, protein-specialized models and LLMs. On ProteinArena, AMix-2 outperforms frontier LLMs and demonstrates competitive performance to task-specific protein models. Controlled experiments further show that the diffusion-based paradigm generally surpasses its autoregressive counterpart, highlighting the advantage of flexible generation order for protein sequences. We release both AMix-2 and ProteinArena to facilitate open research in protein foundation models.

2605.30961 2026-06-01 cs.CL

EvoGens: A Population-Based Heuristic Search Framework for Scientific Idea Generation

EvoGens:一种基于种群的启发式搜索框架用于科学思想生成

Xu Li, Hanzhe Tu, Xinyi Li, Kuncheng Zhao, Xun Han, Zhonghui Liu

AI总结 针对现有LLM方法生成科学思想时语义趋同、多样性和新颖性不足的问题,提出EvoGens框架,通过进化搜索(变异、交叉、选择)增强思想探索,显著提升新颖性和多样性。

详情
Comments
21 pages, 6 figures
AI中文摘要

生成新颖的研究思想是科学进步的基础。虽然大型语言模型(LLM)在辅助这一过程中显示出潜力,但现有方法常表现出语义趋同,导致多样性和新颖性有限。为解决这一问题,我们引入了EvoGens,一个受进化启发的框架,将科学思想生成重新构想为对思想种群的进化搜索。EvoGens迭代地应用基于排名的变异与差异化检索规划以融入外部知识,以及语义感知的交叉以融合互补概念进行概念重组。一个轻量级的评估信号指导选择过程,鼓励持续探索同时缓解过早收敛。大量实验表明,与最先进的基线相比,EvoGens显著增强了探索能力。具体而言,在当前的自动评估协议下,它将新颖性从0.1提升到0.4,多样性从0.24提升到0.55,同时保持了可比的思想质量。这些发现表明,进化机制可以作为面向探索的研究构思的有用框架,特别是在共享自动评估设置下拓宽候选思想的新颖性和多样性。

英文摘要

Generating novel research ideas is fundamental to scientific progress. While Large Language Models (LLMs) show promise in assisting this process, existing approaches often exhibit semantic convergence, resulting in limited diversity and novelty. To address this, we introduce EvoGens, an evolution-inspired framework that recasts scientific idea generation as an evolutionary search over a population of ideas. EvoGens iteratively applies rank-based mutation with differentiated retrieval planning to incorporate external knowledge, and semantic-aware crossover to fuse complementary concepts for conceptual reorganization. A lightweight evaluation signal guides the selection process, encouraging sustained exploration while mitigating premature convergence. Extensive experiments demonstrate that EvoGens substantially enhances exploration capabilities compared to state-of-the-art baselines. Specifically, it improves the Novelty from 0.1 to 0.4 and the Diversity from 0.24 to 0.55, while maintaining comparable idea quality under the current automatic evaluation protocol. These findings suggest that evolutionary mechanisms can serve as a useful framework for exploration-oriented research ideation, especially for broadening the novelty and diversity of candidate ideas under a shared automatic evaluation setting.

2605.30960 2026-06-01 cs.LG

Revisiting Zeroth-Order Hessian Approximation: A Single-Step Policy Optimization Lens

重新审视零阶Hessian近似:单步策略优化视角

Junbin Qiu, Zhaowei Hong, Renzhe Xu, Yao Shu

AI总结 本文通过单步策略优化视角统一零阶Hessian估计,提出方差缩减的ZoVH框架,实现全Hessian矩阵、正则化逆及偏差校正逆Hessian-梯度积的高效估计。

详情
AI中文摘要

精确的零阶Hessian估计是无导数方法的基石,对于双层优化、贝叶斯推断和不确定性量化等任务至关重要。然而,在高维设置中获取完整的低方差Hessian及其逆估计器仍然是一个重大挑战。为了解决这一问题,我们提出了一个统一框架,通过单步策略优化的视角重新解释零阶Hessian近似。该视角建立了通用零阶Hessian估计器与平滑策略优化目标Hessian之间的理论等价性,将不同的经典随机估计器统一为基线选择的特定实例。在此基础上,我们引入了ZoVH,一个针对全Hessian矩阵、其正则化逆以及偏差校正的逆Hessian-梯度积的方差缩减估计器套件。ZoVH利用两种关键技术:(1) 推导出的唯一最优基线,可证明最小化方差;(2) 一种查询重用策略,结合历史函数查询以提高样本效率而不增加成本。我们严格的理论分析证实了Hessian估计器的无偏性,验证了基线的方差最优性,提供了整个ZoVH套件的误差界,并为由此产生的曲率感知零阶算法建立了收敛保证。广泛的实证结果验证了我们的理论发现,表明ZoVH在实际应用中实现了卓越的估计精度和收敛性能。代码可在 https://github.com/Qjbtiger/ZoVH 获取。

英文摘要

Accurate Zeroth-Order (ZO) Hessian estimation is a cornerstone of derivative-free methods, essential for tasks such as bilevel optimization, Bayesian inference, and uncertainty quantification. However, obtaining a complete suite of low-variance estimators for the Hessian and its inverse in high-dimensional settings remains a significant challenge. To address this, we propose a unified framework that reinterprets ZO Hessian approximation through the lens of single-step Policy Optimization (PO). This perspective establishes a theoretical equivalence between general ZO Hessian estimators and the Hessian of a smoothed PO objective, unifying distinct classical randomized estimators as specific instances of baseline selection. Building on this foundation, we introduce ZoVH, a comprehensive suite of variance-reduced estimators for the full Hessian matrix, its regularized inverse, and the bias-corrected inverse Hessian-gradient product. ZoVH leverages two key techniques: (1) a unique optimal baseline derived to provably minimize variance, and (2) a query reuse strategy that incorporates historical function queries to enhance sample efficiency without inflating costs. Our rigorous theoretical analysis confirms the unbiasedness of the Hessian estimator, validates the variance optimality of our baseline, provides error bounds for the entire ZoVH suite, and establishes convergence guarantees for the resulting curvature-aware ZO algorithm. Extensive empirical results validate our theoretical findings, demonstrating that ZoVH achieves superior estimation accuracy and convergence performance in real-world applications. Code is available at https://github.com/Qjbtiger/ZoVH

2605.30957 2026-06-01 cs.RO

RDGen: Demonstration Generation for High-Quality Robot Learning via Reinforcement Learning

RDGen: 通过强化学习生成高质量机器人学习的演示

Zijian Zhu, Menglin Zou, Zhuang Li, Yaojie Tu, Xinhai Sun

AI总结 提出RDGen框架,利用从仿真到真实的强化学习策略生成高质量机器人演示轨迹,用于训练视觉-语言-动作模型,相比人工遥操作产生更平滑轨迹并提升下游性能。

详情
Comments
13 pages, 4 figures, 3 tables
AI中文摘要

视觉-语言-动作(VLA)模型已成为通用机器人控制的一种有前景的范式。然而,其性能仍然从根本上受限于高质量机器人轨迹数据的可用性。在当前的机器人学习实践中,这些数据主要通过人类遥操作收集,这需要大量人力、成本高昂且难以扩展。在本文中,我们提出了RDGen,一种用于生成高质量机器人演示的仿真到真实强化学习框架。RDGen并非仅将强化学习用作最终控制策略,而是利用训练好的RL策略作为结构化的轨迹生成器。该系统由一个基于VLM的任务解析器(用于识别任务相关物体)、一个基于Grounding DINO的物体定位器以及一个从仿真迁移到真实机器人的RL策略组成。然后,成功的 rollout 被收集为干净、高质量的演示,用于下游VLA训练,而仿真阶段进一步以极低的边际成本提供可扩展的额外轨迹来源。在拾取和放置任务上的实验表明,迁移后的RL策略实现了高任务成功率。与人类遥操作相比,RDGen生成的轨迹显著更平滑,并产生更优的下游VLA性能。这些结果表明,RL生成的演示可以作为机器人策略学习更可靠和一致的监督信号。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robot control. However, their performance remains fundamentally constrained by the availability of high-quality robot trajectory data. In current robot learning practice, such data are primarily collected through human teleoperation, which is labor-intensive, costly, and difficult to scale. In this paper, we propose RDGen, a sim-to-real reinforcement learning framework for generating high-quality robot demonstrations. Rather than employing reinforcement learning solely as the final control policy, RDGen leverages trained RL policies as a structured trajectory generator. The system consists of a VLM-based task parser that identifies task-relevant objects, a Grounding DINO-based object localizer, and an RL policy transferred from simulation to the real robot. Successful rollouts are then harvested as clean, high-quality demonstrations for downstream VLA training, while the simulation stage further provides a scalable source of additional trajectories at little marginal cost. Experiments on a pick-and-place task demonstrate that the transferred RL policy achieves a high task success rate. Compared with human teleoperation, RDGen produces significantly smoother trajectories and yields superior downstream VLA performance. These results indicate that RL-generated demonstrations can serve as more reliable and consistent supervisory signals for robot policy learning.

2605.30942 2026-06-01 cs.CV

PRISM: Progressive Reasoning through Iterative Slot Memory for Vision

PRISM: 通过迭代槽记忆进行渐进推理的视觉架构

Ziyu Wang, Shuangpeng Han, Mengmi Zhang

AI总结 提出PRISM架构,通过迭代槽记忆进行渐进推理,在图像分类、目标检测和语义分割等任务上取得竞争性能,并在遮挡等不完整观测下展现出更强的鲁棒性。

详情
AI中文摘要

现代视觉模型通过单次前馈传递处理图像,这限制了它们在观测不完整时恢复缺失证据或细化不确定表示的能力。受人类感知迭代性质的启发,我们引入了PRISM(通过迭代槽记忆进行渐进推理),这是一种通过迭代细化对图像进行推理的金字塔视觉架构。在高层次上,PRISM将视觉特征分组为以对象为中心的表示,从学习到的记忆中检索相关模式,并迭代细化表示以解决歧义和恢复缺失信息。这种组织-回忆-细化过程在多个尺度上循环运行,实现了视觉表示的渐进改进。在包括图像分类、目标检测和语义分割在内的标准视觉任务中,PRISM取得了竞争性能,同时在遮挡等不完整观测下展现出更强的鲁棒性。这些结果表明,使用结构化表示和记忆进行迭代推理是构建更具弹性和适应性的视觉模型的一个有前景的方向。源代码和模型将发布。

英文摘要

Modern vision models process images in a single feed-forward pass, which limits their ability to recover missing evidence or refine uncertain representations under incomplete observations. Inspired by the iterative nature of human perception, we introduce PRISM (Progressive Reasoning through Iterative Slot Memory), a pyramid vision architecture that reasons over images through iterative refinement. At a high level, PRISM groups visual features into object-centric representations, retrieves relevant patterns from a learned memory, and iteratively refines the representation to resolve ambiguity and recover missing information. This organize-recall-refine process operates recurrently across multiple scales, enabling progressive improvement of visual representations. Across standard vision tasks, including image classification, object detection, and semantic segmentation, PRISM achieves competitive performance while demonstrating improved robustness under incomplete observations such as occlusion. These results suggest that iterative reasoning with structured representations and memory is a promising direction for building more resilient and adaptive vision models. Source code and models will be released.

2605.30940 2026-06-01 eess.AS cs.MM cs.SD

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

面向流式同步空间音频生成的自回归扩散Transformer

Ke Lei, Yu Zhang, Changhao Pan, Xueyi Pu, Wenxiang Guo, Ruiqi Li, Zhou Zhao

AI总结 提出SwanSphere统一流式框架,通过因果自回归扩散Transformer、空间视频-音频对比学习及多目标在线直接偏好优化,实现从全景视频和文本提示生成高保真空间音频,并开发自动化标注管道缓解数据稀缺。

详情
Comments
Accepted by ICML 2026
AI中文摘要

实时且准确的空间音频生成对于提供沉浸式体验至关重要。然而,现有的空间音频合成技术通常受限于生成质量与高推理延迟之间的权衡,以及难以从多模态输入中捕获精确的空间信息。为应对这些挑战,我们提出了SwanSphere,一个统一的流式框架,用于从全景视频和文本提示生成高保真空间音频。SwanSphere主要做出以下贡献:1)我们引入了一种因果自回归扩散Transformer架构,支持流式高质量空间音频生成。2)我们设计了一种空间视频-音频对比学习策略,以对齐视频编码器与声学领域,并进一步采用多目标在线直接偏好优化方案,从而实现强大的空间感知和鲁棒的多模态空间音频合成。3)为缓解当前空间音频数据集的稀缺性,我们还开发了一个自动化标注管道,用于生成详细的空间描述。实验结果表明,SwanSphere在视频到空间和文本到空间音频生成任务中均取得了优越性能。演示可在 https://swanaigc.github.io 找到。

英文摘要

Real-time and accurate spatial audio generation is pivotal for delivering an immersive experience. However, existing spatial audio synthesis technologies are often encumbered by a tradeoff between generation quality and high inference latency, as well as difficulty in capturing precise spatial information from multimodal inputs. To address these challenges, we propose SwanSphere, a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts. SwanSphere mainly makes the following contributions: 1) We introduce a causal autoregressive diffusion transformer architecture that enables streaming high-quality spatial audio generation. 2) We design a Spatial Video-Audio Contrastive (SVAC) learning strategy to align the video encoder with the acoustic domain, and further employ a multi-objective online direct preference optimization (ODPO) scheme, resulting in strong spatial perception and robust multimodal spatial audio synthesis. 3) To alleviate the current scarcity of spatial audio datasets, we also develop an automated annotation pipeline for generating detailed spatial captions. Experimental results demonstrate that SwanSphere achieves superior performance in both video-to-spatial and text-to-spatial audio generation tasks. Demos can be found at: https://swanaigc.github.io.

2605.30939 2026-06-01 cs.CV

IAF-Net: Illumination-Adaptive Fusion for Low-Light Urban Road Segmentation

IAF-Net:用于低光照城市道路分割的照明自适应融合网络

Bingtao Wang, Daojie Peng, Fulong Ma, Jun Ma, Liang Zhang

AI总结 提出IAF-Net,通过照明自适应融合模块动态调整RGB与几何特征的融合权重,并利用亮度调制注意力解码器增强低光照特征选择,实现不同光照条件下鲁棒的道路分割。

详情
AI中文摘要

语义道路分割对于自动驾驶至关重要,但现有方法在低光照条件下性能严重下降。许多现有的多模态融合方法没有显式适应模态可靠性的光照依赖性变化,这可能在夜间将退化的RGB特征传播到融合表示中。我们提出IAF-Net(照明自适应融合网络),一种端到端框架,具有照明自适应融合功能,可在不同光照条件下实现鲁棒的道路分割。它通过核心的照明自适应融合(IAF)模块动态调整RGB和几何特征的融合权重,并使用亮度调制注意力解码器增强低光照特征选择。我们还构建了两个专用数据集:nuScenes夜间道路分割(nuScenes-NRS)和CARLA多天气道路分割(CARLA-MWRS)。在nuScenes-NRS上的实验显示,在比较方法中整体性能达到最先进水平,而CARLA-MWRS进一步验证了在恶劣天气条件下的鲁棒性。在40%训练子集上的消融研究进一步强调了IAF模块的重要性,该模块在MaxF中提供了最大的个体增益0.70%。

英文摘要

Semantic road segmentation is important for autonomous driving, but existing methods suffer severe performance degradation under low-light conditions. Many existing multi-modal fusion methods do not explicitly adapt to illumination-dependent changes in modality reliability, which can propagate degraded RGB features into the fused representation at night. We propose IAF-Net (Illumination-Adaptive Fusion Network), an end-to-end framework with illumination-adaptive fusion for robust road segmentation across different lighting conditions. It dynamically adjusts fusion weights of RGB and geometric features via the core Illumination-Adaptive Fusion (IAF) module, and enhances low-light feature selection with a brightness-modulated attention decoder. We also construct two dedicated datasets: nuScenes Nighttime Road Segmentation (nuScenes-NRS) and CARLA Multi-Weather Road Segmentation (CARLA-MWRS). Experiments on nuScenes-NRS show state-of-the-art overall performance among the compared methods, while CARLA-MWRS further validates robustness across adverse weather conditions. Ablation studies on a 40% training subset further highlight the importance of the IAF module, which provides the largest individual gain of 0.70% in MaxF.

2605.30936 2026-06-01 cs.LG math.OC stat.ML

Local linear convergence of gradient methods for overparameterized Gaussian mixtures

过参数化高斯混合模型梯度方法的局部线性收敛性

Jingxing Wang, Vasileios Charisopoulos, Maryam Fazel

AI总结 针对过参数化高斯混合模型,提出一种交替使用短梯度步和长Polyak步的方法,实现局部线性收敛速率,克服了过参数化导致的慢收敛问题。

详情
Comments
45 pages, 7 figures
AI中文摘要

我们研究了过参数化下学习高斯混合模型的问题。先前的工作表明,虽然过参数化对于避免虚假局部最优和通过梯度EM算法实现全局恢复真实模型至关重要,但它会显著减慢局部收敛速度。在混合权重的某些假设下,我们证明了统计学习过程最小化的标准散度度量具有一个缓慢增长的流形,在该流形上著名的Polyak步长可以几何级地减少损失,并设计了一种基于梯度的方法,该方法以局部线性速率收敛到极小值点。此外,我们表明,对于具有任意权重的混合模型,我们的方法收敛到接近最优的解——直到一个自然的误设阈值。在高层次上,该方法在接近流形的几个“短”梯度下降步和收缩到极小值点距离的“长”Polyak步之间交替。我们的结果表明,慢收敛不是过参数化的内在挑战,而是可以通过利用损失景观的有利结构来克服。

英文摘要

We study the problem of learning Gaussian mixture models under overparameterization. Prior work has shown that while overparameterization is essential for avoiding spurious local optima and enables global recovery of the ground-truth model using the gradient-EM (expectation-maximization) algorithm, it can dramatically slow down the local rate of convergence. Under certain assumptions on the mixture weights, we show that a standard divergence measure minimized by statistical learning procedures possesses a manifold of slow growth on which the well-known Polyak stepsize reduces the loss geometrically, and design a gradient-based method that converges to minimizers at a locally linear rate. Additionally, we show that our method converges to nearly optimal solutions -- up to a natural misspecification threshold -- for mixtures with arbitrary weights. At a high level, the method alternates between several "short" gradient descent steps that approach the manifold and "long" Polyak steps that contract the distance to minimizers. Our results suggest that slow convergence is not an intrinsic challenge of overparameterization, but can be overcome by exploiting the favorable structure of the loss landscape.

2605.30934 2026-06-01 cs.CL cs.AI

Do Large Language Models Encode Institutional Experience? Evidence from Cross-Linguistic Moral Reasoning Under Ambiguity

大型语言模型是否编码了制度经验?来自跨语言模糊道德推理的证据

Nattavudh Powdthavee

AI总结 通过跨语言道德困境实验,研究大型语言模型在模糊情境下是否通过语言编码制度经验,发现隐含制度线索会放大跨语言道德分歧,而明确框架则抑制这种差异。

详情
Comments
44 pages
AI中文摘要

大型语言模型(LLMs)在不同语言中表现出系统性的道德推理差异,但这种差异的来源尚不清楚。我们检验了一个假设:语言编码了其使用环境中的制度方面,使得LLMs通过训练继承了特定制度的道德先验。跨越制度质量梯度广泛的九种语言、六个前沿LLM以及两项预注册研究,我们考察了道德困境的可接受性取决于制度功能的情况。在研究1中,明确的制度框架产生了统一的无结果:跨语言道德分歧在制度依赖场景中没有增加,也没有追踪语言社区之间的制度差异。在研究2中,我们引入了制度模糊场景,其中制度利益存在但未明确说明。在这些条件下,跨语言道德分歧相对于制度无关控制组增加,并且除一个理论上有信息的例外,与语言社区之间的现实世界制度差异相关。明确的框架再次减弱了这些效应。这些发现表明,制度经验可能在语言中留下可检测的痕迹,塑造LLM的道德推理,同时也表明明确的制度线索可以抑制这些差异的表达。

英文摘要

Large language models (LLMs) exhibit systematic differences in moral reasoning across languages, yet the source of this variation remains unclear. We test the hypothesis that languages encode aspects of the institutional environments in which they are spoken, allowing LLMs to inherit institution-specific moral priors through training. Across nine languages spanning a broad gradient of institutional quality, six frontier LLMs, and two preregistered studies, we examine moral dilemmas whose acceptability depends on institutional functioning. In Study 1, explicit institutional framing produced uniformly null results: cross-linguistic moral divergence did not increase in institutionally contingent scenarios, nor did it track institutional differences between language communities. In Study 2, we introduced institutionally ambiguous scenarios in which institutional stakes were present but not explicitly stated. Under these conditions, cross-linguistic moral divergence increased relative to institutionally inert controls and, with one theoretically informative exception, was associated with real-world institutional differences between language communities. Explicit framing again attenuated these effects. These findings suggest that institutional experience may leave detectable traces in language that shape LLM moral reasoning, while also indicating that explicit institutional cues can suppress the expression of those differences.

2605.30931 2026-06-01 cs.CL

MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

MineExplorer: 评估MLLM智能体在Minecraft中的开放世界探索能力

Tianjie Ju, Yueqing Sun, Zheng Wu, Wei Zhang, Yaqi Huo, Xi Su, Qi Gu, Xunliang Cai, Gongshen Liu, Zhuosheng Zhang

AI总结 提出MineExplorer基准,通过多智能体合成工作流构建隐式多跳任务,评估多模态大语言模型在Minecraft中的开放世界探索能力,发现长轨迹协调仍是挑战。

详情
Comments
Working in progress
AI中文摘要

多模态大语言模型(MLLM)在感知、推理和动作生成方面展现出强大能力。然而,它们在动态开放世界中持续探索的能力仍不明确。现有的具身和基于游戏的基准通常将交互压缩为短时任务,或将成功与特定领域的游戏机制纠缠在一起。在本文中,我们介绍了MineExplorer基准,用于评估MLLM智能体在Minecraft中的开放世界探索能力。我们首先筛选出解决方案高度依赖Minecraft特定知识的原子任务,以更好地反映通用开放世界推理。然后,我们围绕ReAct风格的能力公式组织基准,并将原子任务组合成隐式多跳任务。为了进一步构建可靠的实例,MineExplorer使用多智能体合成工作流,联合设计任务图、沙盒场景和基于规则的里程碑评估器。人工评估表明,多智能体合成工作流比单智能体基线产生显著更可靠的实例。与先进MLLM智能体的实验表明,开放世界探索仍然具有挑战性,因为强模型可以处理许多单跳任务,但在需要协调隐藏前提条件的长轨迹中性能急剧下降。进一步分析发现,任务难度与智能体完成度相关,而更大的模型或思考模式并不一致地转化为更好的性能。代码和数据集可在https://github.com/Jometeorie/MineExplorer获取。

英文摘要

Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and game-based benchmarks often compress interaction into short-horizon tasks or entangle success with domain-specific game mechanics. In this paper, we introduce MineExplorer benchmark for evaluating open-world exploration capabilities of MLLM agents in Minecraft. We first filter atomic tasks whose solutions rely heavily on Minecraft-specific knowledge to better reflect general open-world reasoning. Then we organize the benchmark around a ReAct-style capability formulation and compose atomic tasks into implicit multi-hop tasks. To further construct reliable instances, MineExplorer uses a multi-agent synthesis workflow that jointly designs task graphs, sandbox scenes, and rule-based milestone evaluators. Human evaluation shows that the multi-agent synthesis workflow produces significantly more reliable instances than a single-agent baseline. Experiments with advanced MLLM agents show that open-world exploration remains challenging, as strong models can handle many single-hop tasks but degrade sharply when hidden prerequisites must be coordinated over longer trajectories. Further analysis finds that task difficulty tracks agent completion, and larger models or thinking modes do not consistently translate into better performance. Code and dataset are available at https://github.com/Jometeorie/MineExplorer.

2605.30930 2026-06-01 cs.HC cs.AI cs.CL cs.CY

TUX: Measuring Human--AI Tacit Understanding

TUX:衡量人机默契理解

Yueshen Li, Hanyi Min, Vedant Das Swain, Koustuv Saha

AI总结 通过光谱放置任务和TUX指数,量化人类与LLM之间的默契理解,发现人格特征影响对齐程度。

详情
AI中文摘要

随着大型语言模型(LLMs)越来越多地作为协作伙伴,人机对齐通常通过明确的任务成功、准确性或奖励优化来评估。然而,许多协作场景依赖于默契理解:即智能体能否在没有明确目标、沟通或反馈的情况下,与人类的评价立场或表征先验对齐。为了研究这种能力,我们开发了一个受社交派对游戏Wavelength启发的光谱放置任务,在该任务中,人类和智能体独立地将概念放置在主观光谱上。我们将默契理解指数(TUX)操作化为人类与智能体判断之间的成对相似性度量,并通过241名人类参与者和200个基于人格条件的LLM智能体(涵盖四种模型)进行评估。我们发现,在特质空间中最近的人-智能体对实现了显著更高的TUX,表明默契对齐是由个体层面特征而非随机相似性所结构化的。回归分析表明,随着预测变量集变得更加丰富,TUX变得更可解释,个体特质、决策风格和置信度优于聚合特质距离基线。这些发现表明,人类与LLM之间的默契理解是可测量的,同时也揭示了基于人格条件化方法在捕捉更深层表征对齐方面的局限性。

英文摘要

As large language models (LLMs) increasingly act as collaborative partners, human--AI alignment is often evaluated through explicit task success, accuracy, or reward optimization. Yet many collaborative settings depend on tacit understanding: whether an agent can align with a human's evaluative stance or representational priors without clear objectives, communication, or feedback. To study this capacity, we develop a spectrum-placement task inspired by the social party game Wavelength, in which humans and agents independently place concepts along subjective spectra. We operationalize the Tacit Understanding Index (TUX) as a pairwise measure of similarity between human and agent judgments, and evaluate it with 241 human participants and 200 profile-conditioned LLM agents across four models. We find that nearest human--agent pairs in trait space achieve significantly higher TUX, suggesting that tacit alignment is structured by person-level characteristics rather than random similarity. Regression analyses show that TUX becomes more explainable as predictor sets become richer, with individual traits, decision-making styles, and confidence improving over aggregate trait-distance baselines. These findings suggest that tacit understanding between humans and LLMs is measurable, while revealing the limits of profile-based conditioning for capturing deeper representational alignment.

2605.30928 2026-06-01 cs.RO

Enhancing Human-Likeness in Reinforcement Learning Agents via Hierarchical Macro Action Quantization

通过分层宏动作量化增强强化学习智能体的人类相似性

Usman Nizamani, M. Shaheer Luqman, Fawad Javed Fateh, Ali Shah Ali, Murad Popattia, M. Zeeshan Zia, Quoc-Huy Tran

AI总结 提出一种分层宏动作量化框架(HiMAQ),通过两级向量量化将人类演示编码为宏动作,使强化学习智能体在保持高回报的同时生成更接近人类的行为序列,在D4RL基准上优于非分层基线并兼容多种RL算法。

详情
AI中文摘要

人类化智能体是人工智能的长期目标。尽管性能强劲,大多数强化学习(RL)智能体仍以奖励驱动,且常表现出与人类不同的行为,限制了可解释性和可靠性。在这项工作中,我们引入了一种新颖的人类化RL框架,该框架在最大化奖励的同时预测与人类行为紧密对齐的动作序列。具体来说,我们使用一种分层宏动作量化方法(称为HiMAQ)将人类演示编码为宏动作,该方法包含两个连续的向量量化层级。低层量化将输入动作映射到细粒度的子动作簇,而高层量化将这些子动作簇聚合成动作簇。在D4RL基准上的广泛评估表明,我们的分层方法优于非分层基线(MAQ),在保持与先前RL智能体相当或更高成功率的同时,获得了更好的人类相似性分数。这些改进泛化到与各种RL算法(即IQL、SAC和RLPD)的集成中。

英文摘要

Human-like agents are a long-standing goal of artificial intelligence. Despite strong performance, most reinforcement learning (RL) agents remain reward-driven and often exhibit behaviors that differ from humans, limiting interpretability and reliability. In this work, we introduce a novel human-like RL framework that predicts action sequences closely aligned with human behaviors while maximizing rewards. Specifically, we encode human demonstrations into macro actions using a hierarchical macro action quantization approach (termed HiMAQ) consisting of two successive levels of vector quantization. The lower quantization level maps input actions to fine-grained subaction clusters, while the higher quantization level aggregates these subaction clusters into action clusters. Extensive evaluations on the D4RL benchmarks show that our hierarchical approach outperforms the non-hierarchical baseline (MAQ), achieving better human-likeness scores while maintaining comparable or better success rates than previous RL agents. The improvements generalize across integrations with various RL algorithms, namely IQL, SAC, and RLPD.