arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.02282 2026-06-02 cs.AI

POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems

POIROT: 在多智能体系统中审讯智能体以进行故障检测

Iñaki Dellibarda Varela, R. Sendra-Arranz, Pablo Romero-Sorozabal, J. M. Valverde-García, Annemarie F. Laudanski, Álvaro Gutiérrez, Eduardo Rocon, Manuel Cebrian

AI总结 提出POIROT协议,利用多智能体系统自身的智能体作为诊断层进行故障检测,在复杂问题、多智能体和复合故障条件下优于单LLM评估基线。

详情
Comments
44 pages, 6 figures
AI中文摘要

将大型语言模型编排成多智能体系统(LLM-MAS)解锁了卓越的推理能力,但难以表征的突发故障和幻觉阻碍了其在安全关键领域的部署——新兴的AI法规使得这一差距在法律上难以维持。现有的评估范式有一个共同的缺陷:集中式判断造成单点故障,并且需要领域特定专业知识。本文提出POIROT,一种将系统自身的智能体重新用作其诊断层的协议,利用架构中已有的认知多样性。在评估的设置中,POIROT优于单LLM评估基线,其增益随问题复杂度(OR = 1.60,$p = 0.008$)、智能体数量和故障维度而扩展,并在复合故障条件下持续存在。这些结果表明,安全监督不必外部化:执行角色的智能体拥有足够的集体智慧来审计它。我们将POIROT作为开源库发布,同时发布BLAME,一个用于安全关键多智能体系统中故障归因的基准。

英文摘要

Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emergent failures and hallucinations that resist characterisation block their deployment in safety-critical domains -- a gap made legally untenable by emerging AI regulation. Existing evaluation paradigms share a common flaw: centralised judgment creates single points of failure and demands domain-specific expertise. Here we present POIROT, a protocol that repurposes a system's own agents as its diagnostic layer, leveraging the epistemic diversity already present in the architecture. Across evaluated settings, POIROT outperforms single-LLM evaluator baselines, with gains that scale with problem complexity (OR = 1.60, $p = 0.008$), agent count, and fault dimensionality, persisting under compound fault conditions. These results demonstrate that safety oversight need not be externalised: the agents executing a role carry sufficient collective intelligence to audit it. We release POIROT as an open-source library alongside BLAME, a benchmark for fault attribution in safety-critical multi-agent systems.

2606.02280 2026-06-02 cs.RO

Dynamics Are Learned, Not Told: Semi-Supervised Discovery of Latent Dynamics Geometries For Zero-Shot Policy Adaptation

动力学是学出来的,不是告诉的:零样本策略适应的潜在动力学几何半监督发现

Zhiming Xu, Weitao Zhou, Xianghui Pan, Nanshan Deng, Chengju Liu, Qijun Chen, Chenpeng Yao

AI总结 针对机器人强化学习中动力学变化导致策略失效的问题,提出基于对比学习的半监督方法,通过构建平滑、任务相关的潜在拓扑结构,实现零样本策略适应,在MuJoCo基准上优于参数中心方法。

详情
Comments
Proceedings of the 43rd International Conference on Machine Learning
AI中文摘要

现实世界中的动力学变化对机器人强化学习构成了严峻挑战,因为与标称环境紧密耦合的策略在物理条件变化时往往会灾难性地失败。大多数现有方法依赖于将明确识别的物理参数编码到潜在上下文中,这是一种以参数为中心的范式,依赖于预先指定的变化轴,在未建模或复合动力学变化下变得脆弱。我们从以结果为中心的角度重新审视动力学适应:不是告诉策略动力学是什么,而是让它们学习动力学如何影响交互结果。理论上,这基于目标域遗憾与轨迹动力学编码器的Lipschitz常数之间的单调关系。实际上,该常数可以通过对比学习来上界,从而在没有特权动力学信息的情况下产生平滑、任务相关的潜在拓扑。在MuJoCo基准上,我们的方法在严重的动力学变化下(包括未建模和时变参数)始终优于以参数为中心的基线,同时提高了分布内稳定性和潜在可解释性。总体而言,这些结果验证了控制潜在几何是实现鲁棒适应的原则性机制。

英文摘要

Real-world dynamics shifts pose a critical challenge for reinforcement learning in robotics, as policies tightly coupled to nominal environments often fail catastrophically when physical conditions change. Most existing methods rely on encoding explicitly identified physical parameters into a latent context, a parameter-centric paradigm that depends on pre-specified axes of variation and becomes brittle under unmodeled or compound dynamics changes. We revisit dynamics adaptation from an outcome-centric perspective: rather than telling policies what the dynamics are, we enable them to learn how dynamics affect interaction outcomes. Theoretically, this is grounded in a monotonic relationship between target-domain regret and the Lipschitz constant of a trajectory dynamics encoder. Practically, this constant can be upper-bounded through contrastive learning, yielding a smooth, task-relevant latent topology without privileged dynamics information. On MuJoCo benchmarks, our method consistently outperforms parameter-centric baselines under severe dynamics shifts, including unmodeled and time-varying parameters, while also improving in-distribution stability and latent interpretability. Overall, these results validate that controlling latent geometry is a principled mechanism for robust adaptation.

2606.02278 2026-06-02 eess.SY cs.LG cs.SY

Physics-Guided Recurrent State-Space Neural Networks for Multi-Step Prediction

物理引导的循环状态空间神经网络用于多步预测

Ruiyuan Li, Ajay Seth, Manon Kok

AI总结 提出PG-RSSNN,一种结合物理知识和循环结构的状态空间神经网络,通过缓解梯度消失和数值发散风险,在有限数据和部分物理模型下提升多步预测性能。

详情
Comments
6 pages, 3 figures. Accepted at IFAC World Congress 2026
AI中文摘要

状态空间模型传统上基于物理知识,但由于模型不准确,这些物理模型的多步预测可能较差。黑盒深度学习作为替代方案显示出潜力,但这些方法依赖于大量数据集的可用性,且潜在可用的物理知识被忽略。我们提出PG-RSSNN,一种物理引导的循环状态空间神经网络,它结合循环结构以在多步预测中使用非饱和激活函数。它缓解了梯度消失,并消除了现有结构中因反馈状态估计而导致的训练数值发散风险。在多个具有不同物理模型不完善性的系统上(从带高斯噪声的线性状态空间模型到机械臂和级联水箱系统)的实验结果表明,与黑盒神经网络和纯物理模型相比,所提出的PG-RSSNN即使在训练数据有限且物理模型仅部分已知的情况下,也能保持稳定的训练行为,并改善多步预测。

英文摘要

State-space models are traditionally based on physical knowledge, but multi-step predictions from these physical models can be poor due to model inaccuracy. Black-box deep learning has shown promise as an alternative. However, these methods rely on the availability of large datasets and potentially available physical knowledge is neglected. We propose the PG-RSSNN, a physics-guided recurrent state-space neural network that incorporates recurrent structures to enable the use of non-saturating activation functions in multi-step prediction. It mitigates the vanishing gradients and eliminates the risk of numerical divergence in training seen in existing structures that feed back state estimates. Results across multiple systems with various physical model imperfections, from linear state-space models with Gaussian noise to a robotic arm and a cascaded water tank system, show that the proposed PG-RSSNN maintains stable training behavior, and improves multi-step predictions, as compared with black-box neural networks and physics-only models, even with limited training data and when physical models are only partially known.

2606.02277 2026-06-02 cs.RO

RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models

RoboSemanticBench: 诊断 VLA 模型在动作预测中的语义基础

Bin Yu, Yao Zhang, Haishan Liu, Shijie Lian, Yuliang Wei, Xiaopeng Lin, Zhaolong Shen, Changti Wu, Ruina Hu, Bailing Wang, Cong Huang, Kai Chen

AI总结 提出 RoboSemanticBench 基准,通过多选问答任务评估 VLA 模型是否利用指令语义选择正确物体,发现模型在语义正确选择上接近随机,揭示语义理解与动作预测之间的差距。

详情
Comments
GitHub: https://github.com/ZGC-EmbodyAI/RoboSemanticBench
AI中文摘要

视觉-语言-动作(VLA)模型建立在预训练语言或视觉-语言骨干网络的语义理解应指导机器人动作预测的前提上。然而,机器人微调被优化为对任务特定动作分布的模仿,许多评估可以通过视觉或指令-动作捷径解决。我们引入 RoboSemanticBench(RSB),一个用于诊断动作预测中语义基础的具身基准:即后训练的 VLA 模型是否能够使用复杂的指令语义来选择并操作正确的物理目标。在每个回合中,机器人接收一个多项选择的数学或常识问题,观察候选答案块,并必须抓取对应正确答案的块。RSB 涵盖受控算术、小学数学理解以及常识或事实理解,分为四选和十选套件。在代表性的 VLA 模型上,我们发现许多策略学会了抓取候选块,但在控制抓取成功率后,选择语义正确块的比例接近随机或低于随机,揭示了骨干网络级别的语义能力与动作预测之间持续存在的差距。

英文摘要

Vision-language-action (VLA) models are built on the premise that semantic understanding from pretrained language or vision-language backbones should guide robot action prediction. Yet robot fine-tuning is optimized as imitation over task-specific action distributions, and many evaluations can be solved through visual or instruction-action shortcuts. We introduce RoboSemanticBench (RSB), an embodied benchmark for diagnosing semantic grounding in action prediction: whether post-trained VLA models can use complex instruction semantics to select and manipulate the correct physical target. In each episode, a robot receives a multiple-choice math or general-knowledge question, observes candidate answer blocks, and must grasp the block corresponding to the correct answer. RSB covers controlled arithmetic, grade-school mathematical understanding, and commonsense or factual understanding under four-choice and ten-choice suites. Across representative VLA models, we find that many policies learn to grasp candidate blocks but select the semantically correct block at near-random or below-random rates after controlling for grasp success, revealing a persistent gap between backbone-level semantic competence and action prediction.

2606.02276 2026-06-02 cs.CV cs.AI cs.CL cs.LG

Cross-modal linkage risk in clinical vision-language models

临床视觉-语言模型中的跨模态链接风险

Soroosh Tayebi Arasteh, Mahshad Lotfinia, Sven Nebelung, Daniel Truhn

AI总结 研究临床视觉-语言模型(VLM)在图像与报告分离场景下通过余弦相似度实现跨模态重链接的风险,并采用仅对投影头进行差分隐私微调的方法在保持图像效用同时显著降低重链接率。

详情
AI中文摘要

在配对胸部X光片和放射学报告上训练的视觉-语言模型(VLM)学习了一个共享嵌入空间,该空间可以保留实例级别的图像-报告对应关系。这在故意将X光片和报告在获取后分开的场景中(例如仅图像数据共享或受控访问的报告)构成了隐私风险,因为一个去标识的图像可能仅通过余弦相似度就重新链接到其原始叙述性报告。我们将此形式化为图像到报告的检索,并使用公共配对队列(其中真实配对是已知的)作为基准来审计风险,而不是作为隐私场景。在来自MIMIC-CXR(43,793个保留对)和外部CheXpert Plus(29,296个对)的126,804名患者的406,241个配对示例上评估了临床专业化程度递增的VLM,我们发现重链接率随专业化程度系统性地上升:最强的VLM在候选池N=100时以15倍随机概率检索到正确报告,在N=10,000时以50倍随机概率,在全数据库规模下仍远高于随机概率。该信号在去除疾病标签捷径的病理匹配困难负样本下仍然存在,表明对应关系超出了广泛的诊断类别。为了在不重新训练的情况下减少这种风险,我们冻结了两个编码器,仅对定义对齐层的投影头应用差分隐私优化(epsilon=0.34,delta=6x10^-6)。这使得MIMIC-CXR上N=10,000时的Recall@1降低了61.8%,并无需重新训练即可迁移到CheXpert Plus,同时图像侧效用基本保持:线性探针分类在14个标签上的宏AUROC仅从79.63%变为79.43%。对共享对齐层的定向DP微调可以大幅减少跨模态重链接,而不会实质性降低使这些模型在临床上有用的图像表示。

英文摘要

Vision-language models (VLMs) trained on paired chest radiographs and radiology reports learn a shared embedding space that can preserve instance-level image-report correspondence. This poses a privacy risk in settings where radiographs and reports are deliberately kept separate after acquisition, such as image-only data sharing or access-controlled reports, because a de-identified image may be re-linked to its original narrative report through cosine similarity alone. We formalized this as image-to-report retrieval and used public paired cohorts, in which the true pairing is known by design, as ground-truth benchmarks to audit the risk rather than as the privacy scenario. Evaluating VLMs of increasing clinical specialization on 406,241 paired examples from 126,804 patients across MIMIC-CXR (43,793 held-out pairs) and external CheXpert Plus (29,296 pairs), we found that re-linkage rose systematically with specialization: the strongest VLM retrieved the correct report at 15 times chance at a candidate pool of N = 100, 50 times chance at N = 10,000, and well above chance at full-database scale. The signal persisted under pathology-matched hard negatives that removed disease-label shortcuts, indicating correspondence beyond broad diagnostic categories. To reduce it without retraining, we froze both encoders and applied differentially private optimization only to the projection heads defining the alignment layer (epsilon = 0.34, delta = 6x10-6). This reduced Recall@1 by 61.8% at N = 10,000 on MIMIC-CXR and transferred to CheXpert Plus without retraining, while image-side utility was largely preserved: macro AUROC for linear-probe classification across 14 labels shifted only from 79.63% to 79.43%. Targeted DP finetuning of the shared alignment layer can substantially reduce cross-modal re-linkage without materially degrading the image representations that make these models clinically useful.

2606.02273 2026-06-02 cs.CV

Vision-language Models for Driver Monitoring Systems: A Driver Activity Description Dataset

用于驾驶员监控系统的视觉语言模型:一个驾驶员活动描述数据集

David J. Lerch, Sarath Mulugurthi, Manuel Martin, Frederik Diederichs, Rainer Stiefelhagen

AI总结 本文通过创建Drive&Act数据集的详细自然语言版本,评估并微调视觉语言模型,以提升对驾驶员细微动作的识别能力,微调后的模型在跨数据集评估中表现更优。

详情
Comments
Accepted at IEEE ITSC 2026
AI中文摘要

理解细微的驾驶员动作对于构建可靠的驾驶员监控系统至关重要。现有的视觉语言模型(VLM)在通用数据集上训练,难以识别驾驶员行为的细微差别。本文通过创建Drive&Act数据集的详细自然语言版本来解决这一限制。我们使用基于LLM的评分方法在新的基准上评估了三个VLM。它们在新基准上的表现表明,它们无法可靠地生成准确的细粒度驾驶员活动描述。基于标注的Drive&Act数据集,我们创建了一个新的Drive&Act描述数据集,其中包含细粒度描述,用于训练VLM理解驾驶员活动。在驾驶员监控数据集(DMD)上的跨数据集评估表明,在我们的新Drive&Act描述数据集上微调的VLM能够很好地泛化到DMD数据集中的动作。在我们的Drive&Act描述数据集上微调的VLM取得了76的ACCR分数,优于零样本VLM基线的66 ACCR分数。这些发现表明,用丰富描述的驾驶员动作来适应VLM可以显著提高其解释驾驶员行为的能力,同时也突显了需要更多样化的数据集以支持未来应用中更广泛泛化的需求。我们的Drive&Act描述数据集和代码将在GitHub上公开。

英文摘要

Understanding subtle driver actions is essential for building reliable driver monitoring systems. Existing visionlanguage models (VLMs) are trained on general datasets and struggle to recognize fine distinctions in driver behaviors. This paper addresses this limitation by creating a detailed natural language version of the Drive&Act dataset. We evaluate three VLMs on our new benchmark using LLM-based scoring methods. Their performance on the new benchmark shows that they cannot reliably generate accurate fine-grained driver activity descriptions. Based on the labeled Drive&Act dataset we create a new Drive&Act description dataset containing finegrained descriptions to train VLMs on driver activity understanding. Cross dataset evaluation on the Driver Monitoring Dataset (DMD) shows that the VLM fine-tuned on our new Drive&Act description dataset generalizes well to actions in the DMD dataset. The VLM fine-tuned on our Drive&Act description dataset achieves an ACCR score of 76 outperforming the zero-shot VLM baseline with an ACCR score of 66. These findings demonstrate that adapting VLMs with richly described driver actions can significantly improve their ability to interpret driver behavior while also highlighting the need for more diverse datasets to support broader generalization in future applications. Our Drive&Act description dataset and code will be publicly available on GitHub.

2606.02268 2026-06-02 cs.CV

From Extrinsic to Intrinsic: Geodesic-Guided Representation Learning for 3D Geometric Data

从外在到内在:面向3D几何数据的测地线引导表示学习

Yuming Zhao, Junhui Hou, Qijian Zhang, Jia Qin, Ying He

AI总结 提出PRISM预训练范式,通过恢复内在表面测地线度量学习等距嵌入,解决3D表示学习中外在空间与内在拓扑的脱节问题,在测地距离预测及下游任务中表现优异。

详情
AI中文摘要

几何分析从根本上区分了 extit{外在}和 extit{内在}视角。当前3D表示学习的主流范式依赖于外在空间结构或高层语义,难以捕捉形状本质和底层流形拓扑。为弥合这一差距,我们引入了一种新的3D表示学习范式,即 extbf{PRISM}(用于 extbf{预训练}),通过 extbf{恢复内在表面测地线度量}来学习等距嵌入。PRISM包含一个拓扑增强目标,显式约束潜在空间的结构,以及一个专门的两阶段训练策略,以缓解测地距离分布中固有的样本不平衡。实验表明,我们的方法在测地距离预测中表现出令人满意的准确性、鲁棒性和高效率,并在包括形状识别、表面参数化和非刚性对应在内的多种下游任务中取得了优越性能。代码将公开在 https://github.com/AidenZhao/PRISM。

英文摘要

Geometric analysis fundamentally distinguishes between \textit{extrinsic} and \textit{intrinsic} perspectives. The dominant paradigm in current 3D representation learning relies on either extrinsic spatial structures or high-level semantics, struggling to capture the essence of shape identity and underlying manifold topology. To bridge this gap, we introduce a novel 3D representation learning paradigm, namely \textbf{PRISM}, for \textbf{P}re-training, which learns isometric embeddings by \textbf{R}ecovering the \textbf{I}ntrinsic \textbf{S}urface geodesic \textbf{M}etric. PRISM incorporates a topology-enforcing objective that explicitly constrains the structure of latent space, alongside a specialized two-stage training recipe mitigating sample imbalance inherent in the distribution of geodesic distances. Experiments demonstrate that our approach shows satisfactory accuracy, robustness, and high efficiency in geodesic distance prediction and achieves superior performance across diverse downstream tasks, including shape recognition, surface parameterization, and non-rigid correspondence. The code will be publicly available at https://github.com/AidenZhao/PRISM.

2606.02267 2026-06-02 cs.LG cs.CV

A combination of noise and bilateral filters achieve supralinear and scalable adversarial robustness in CNNs

噪声与双边滤波的组合在CNN中实现超线性且可扩展的对抗鲁棒性

Nicolas Stalder, Benjamin F. Grewe, Matteo Saponati, Pau Vilimelis Aceituno

AI总结 本文提出结合高斯噪声和双边滤波的预处理方法,通过互补机制实现超线性对抗鲁棒性提升,并验证其与对抗训练结合后能以更低计算成本达到与最先进防御相当的性能。

详情
Comments
Main: 8 pages, 3 figures, 2 Tables. Supplement: 10 pages, 7 figures, 6 Tables
AI中文摘要

深度神经网络对对抗样本的脆弱性对其实际部署构成了重大挑战。现有的增强深度网络鲁棒性的技术依赖于对抗训练,这种方法虽然强大,但计算密集且通常针对特定攻击类型。为了解决这些局限性,现有工作探索了添加高斯噪声或滤波图像等技术,这两种技术都能适度提升网络对各种对抗攻击的鲁棒性。在此,我们从理论上证明,这两种方法通过互补机制增强对抗鲁棒性,当结合时产生超线性鲁棒性。基于这一见解,我们通过实验表明,一个结合高斯噪声和双边滤波的简单预处理器能以最小计算成本实现对抗鲁棒性的超线性提升。接下来,我们将预处理器与对抗训练结合,并在RobustBench上进行测试,评估其相对于最先进防御的超线性改进。首先,该组合在AutoAttack上排名第二,总体排名第三,同时仅使用约35%的训练FLOPs,模型参数减少约50%,训练轮次减少约33%,数据量减少约15%(与最先进防御相比)。其次,我们的方法高效可扩展,在三个数量级上以大约2-8倍的总计算量匹配竞争模型的准确率。总体而言,我们的方法提供了一个有原则且易于集成的框架来增强对抗鲁棒性,具有可忽略的计算开销和简单但理论扎实的设计。

英文摘要

The vulnerability of deep neural networks to adversarial examples poses a significant challenge for real-world deployment. Existing techniques to enhance deep network robustness rely on adversarial training, an approach that is powerful but computationally intensive and typically tailored to specific attack types. To address these limitations, existing works have explored techniques such as adding gaussian noise or filtering images, both of which can boost the network robustness to various adversarial attacks, albeit modestly. Here, we theoretically demonstrate that these two approaches enhance robustness against adversarial attacks through complementary mechanisms, resulting in supralinear robustness when combined. Building on this insight, we experimentally show that a simple preprocessor combining Gaussian noise and bilateral filtering yields supralinear improvements in adversarial robustness with minimal computational cost. Next, we combine our preprocessor with adversarial training and test on RobustBench to assess its supralinear improvement over state-of-the-art defenses. First, this combination ranks second on AutoAttack and third overall, while using only $\sim$35% of the training FLOPs, using a model with $\sim$50% less parametets, trained with $\sim$33% of the epochs and $\sim$15% the data compared to state-of-the-art defenses. Second, our method scales efficiently, matching the accuracy of competing models with roughly 2-8x less total compute across 3 orders of magnitude. Overall, our approach provides a principled and easily integrable framework for enhancing adversarial robustness, offering negligible computational overhead and a simple yet theoretically grounded design.

2606.02256 2026-06-02 cs.LG

ArrythML: An Autoencoder-Based TinyML Approach for On-Device Arrhythmia Detection on Resource-Constrained Embedded Systems

ArrythML: 一种基于自动编码器的TinyML方法,用于资源受限嵌入式系统上的设备内心律失常检测

Nagarajan S, Kurian Polachan

AI总结 提出一种基于INT8量化自动编码器的TinyML模型,在ESP32-S3微控制器上实现实时、低功耗的ECG分割与心律失常检测,达到84%召回率和79% F1分数。

详情
Comments
19 pages,
AI中文摘要

我们的工作提出了一种使用TinyML模型进行ECG分割和心律失常检测的方法,用于资源受限嵌入式系统上的实时设备内推理。我们开发了基于INT8量化自动编码器的TinyML模型,具有最少的层数和参数,适用于嵌入式部署。这些模型使用来自MIT-BIH心律失常数据库的自定义数据集进行评估,并在基于PC的模拟和设备内环境中进行了验证。在评估中,超过95,000个ECG片段在运行TensorFlow Lite Micro运行时的ESP32-S3微控制器上进行了处理。评估后,进行了详细分析,包括按注释和按记录的失败分析,以表征模型在不同ECG形态和心律模式上的行为,并解释漏检情况。在几种情况下,明显的误分类可能对应于参考注释中标记为正常的早期或微妙异常模式,突显了模型的敏感性。通过过滤数据集中模糊案例的细化评估显示,性能最佳的基于DNN的自动编码器实现了84%的召回率、79%的F1分数、约180 KB的模型大小和9 ms的设备内推理延迟。这些结果证明了低功耗、保护隐私的嵌入式可穿戴系统的可行性,该系统能够完全在设备上执行准确的心律失常检测。

英文摘要

Our work presents a method for ECG segmentation and arrhythmia detection using Tiny Machine Learning (TinyML) models for real-time, on-device inference on resource-constrained embedded systems. We develop INT8 quantized autoencoder-based TinyML models with minimal layers and parameters for embedded deployment. These models are evaluated using a custom dataset derived from the MIT-BIH Arrhythmia Database and validated in both PC-based simulations and on-device environments. For the evaluations, over 95,000 ECG segments are processed on an ESP32-S3 microcontroller running the TensorFlow Lite Micro runtime. Post-evaluation, detailed analysis, including annotation-wise and record-wise failure analysis, is conducted to characterize model behavior across diverse ECG morphologies and rhythm patterns and to explain missed detections. In several cases, apparent misclassifications may correspond to early or subtle anomaly patterns labeled as normal in the reference annotations, highlighting the model's sensitivity. A refined evaluation by filtering out ambiguous cases in the dataset shows that the best-performing DNN-based autoencoder achieves a recall of 84%, an F1-score of 79%, a model size of approximately 180 KB, and an inference latency of 9 ms on-device. These results demonstrate the feasibility of low-power, privacy-preserving embedded wearable systems capable of performing accurate arrhythmia detection entirely on-device.

2606.02255 2026-06-02 cs.CL cs.AI

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

谁在NLP中进行标注?2018年至2025年人工标注报告的大规模评估

Maria Kunilovskaya, Gagan Bhatia, Lisa Sophie Albertelli, Yanran Chen, Christian Greisinger, Lotta Kiefer, Christoph Leiter, Subhadeep Roy, Tewodros Achamaleh, Muhammad Arslan Manzoor, Sebastian Pohl, Yufang Hou, Steffen Eger

AI总结 本研究通过大规模审计NLP论文中的人工标注报告,发现标注细节报告不完整,并提出了改进报告质量的框架和建议。

详情
AI中文摘要

人工标注是许多NLP研究的经验基础,从数据集构建到模型评估,但论文往往不清楚谁产生了标注以及如何控制标注过程。我们首次对主要NLP会议中的人工标注报告进行了大规模、任务级别的审计,询问哪些标注细节被记录,哪些缺失,以及报告如何随时间、主题、会议和人工判断的预期用途而变化。我们引入了一个统一的标注报告实践分类法,并针对Annotated-gold(一个由41篇论文和72个标注任务组成的人工裁决黄金标准)验证了一个LLM辅助的提取流程,其中最佳模型与裁决标签达到了与人类相当的一致性,Krippendorff's alpha为0.606,而人类间一致性为0.585。利用该流程,我们构建了Annotated-llm数据集,涵盖2018-2025年ACL会议论文,包含来自1603篇论文的2667个提取的标注任务,发现论文经常报告操作细节,如招募策略、标注者专业知识和标注量,但往往省略评估标注有效性所需的细节,包括培训、语言能力、报酬、社会人口统计、裁决和一致性值,尤其是在模型评估研究中。我们的结果表明,NLP中的标注报告随时间有所改善,但仍不均衡,我们建立了一个可扩展的框架和最低报告建议,以使人工标注更可靠、可重复和可解释。

英文摘要

Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Krippendorff's alpha of 0.606 versus 0.585 for human-human agreement. Using this pipeline, we construct Annotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.

2606.02253 2026-06-02 cs.AI

CEON: Circular Economy Ontology Network

CEON: 循环经济本体网络

Huanyu Li, Els de Vleeschauwer, Robin Keskisärkkä, Mikael Lindecrantz, Mina Abd Nikooie Pour, Ying Li, Ben De Meester, Patrick Lambrix, Eva Blomqvist

AI总结 为解决循环经济领域跨行业信息共享的语义互操作性问题,提出了循环经济本体网络(CEON),定义了跨行业概念并实现语义感知数据文档化,在建筑、电子和纺织行业场景中验证了其有效性。

详情
AI中文摘要

提高社会中资源利用的循环性已被视为实现可持续性的一条途径,即向更加循环的经济转型。为此有许多不同的循环策略,例如重复使用产品和组件、翻新和再制造旧产品,或回收剩余或使用过的材料。为了实现这些策略,有必要在基础设施层面共享信息,并在产品生命周期内跨行业部门进行沟通。因此,在这种信息共享和沟通中实现语义互操作性是提高循环性的关键。然而,涉及产品生命周期相关众多行业的循环经济(CE)领域的知识表示仍然具有挑战性。为弥补这一差距,我们在Onto-DESIDE项目中开发了循环经济本体网络(CEON)。该本体网络旨在通过定义跨行业概念来填补CE领域的空白,并实现语义感知的数据文档化。我们通过跨行业数据文档化场景(涵盖建筑、电子和纺织行业)展示了CEON。

英文摘要

Increasing the circularity of resource use in our society has been recognized as a path to sustainability, i.e., transitioning into a more circular economy. There are many different circular strategies to do so, such as reusing products and components, refurbishing and remanufacturing used products, or recycling left-over or used materials. To enable these strategies, it is necessary to share information at the infrastructure level and to communicate between industry sectors along the product life cycle. Enabling semantic interoperability in this information sharing and communication is therefore a key to increasing circularity. However, knowledge representation for the circular economy (CE) domain, which involves many relevant industry sectors related to product life cycles, remains challenging. To bridge this gap, we developed the Circular Economy Ontology Network (CEON) within the Onto-DESIDE project. This ontology network aims to fill gaps in CE by defining cross-sectorial concepts and to enable semantics-aware data documentation. We demonstrate CEON through cross-industry data documentation scenarios spanning construction, electronics, and textile sectors.

2606.02252 2026-06-02 cs.CL

ResMerge: Residual-based Spectral Merging of Large Language Models

ResMerge:基于残差的谱合并大型语言模型

Yandu Sun, Zhiyan Hou, Haokai Ma, Yuheng Jia, Junfeng Fang, Haiyun Guo, Hongyan An, weizhen wang, Jinqiao Wang

AI总结 提出ResMerge框架,通过将RL任务向量分解为谱头部和残差组件,利用残差组件构建稳定骨干并选择性引入头部信息,实现无需训练的专家模型合并。

详情
Comments
14 pages including appendix
AI中文摘要

模型合并提供了一种无需训练的方式组合多个后训练专家模型,但合并通过强化学习(RL)获得的专家仍然具有挑战性。现有的谱合并方法通常假设主导奇异方向包含主要任务信号,而低能量残差组件可以被压缩、选择或衰减以减少干扰。我们发现这一假设不适用于RL任务向量:将每个任务向量分解为谱头部和残差组件后,两部分都能独立恢复大量行为知识,同时表现出不同的合并特性。头部高度集中且信息丰富,但更容易产生尖锐的跨专家冲突,而残差组件更分散,为聚合提供了更稳定的基础。基于这一观察,我们提出了ResMerge,一种针对RL专家的基于残差的谱合并框架。ResMerge首先通过球形残差共识自适应构建稳定的残差骨干,该算法在Frobenius球面上估计一个可靠性加权的共识方向。然后通过轻量级头部校正模块重新引入头部信息,该模块由正跨专家一致性门控。在多个RL专家组和能力领域的实验表明,ResMerge比代表性的任务向量和谱合并基线更好地保留了专家能力。ResMerge的实现可在https://github.com/sunyd0303-cpu/ResMerge-release公开获取。

英文摘要

Model merging offers a training-free way to combine multiple post-trained expert models, but merging experts obtained through reinforcement learning (RL) remains challenging. Existing spectral merging methods often assume that leading singular directions contain the main task signal, while lower-energy residual components can be compressed, selected, or attenuated to reduce interference. We find that this assumption does not hold for RL task vectors: after decomposing each task vector into a leading spectral head and a residual component, both parts can independently recover substantial behavior knowledge, while exhibiting different merging properties. The head is highly concentrated and informative but more prone to sharp cross-expert conflicts, whereas the residual component is more dispersed and provides a more stable basis for aggregation. Based on this observation, we propose ResMerge, a residual-based spectral merging framework for RL experts. ResMerge first constructs a stable residual backbone with Spherical Residual Consensus Adaptation, which estimates a reliability-weighted consensus direction on the Frobenius sphere. It then reintroduces leading-head information through a Lightweight Head Correction module gated by positive cross-expert agreement. Experiments across multiple RL expert groups and capability domains show that ResMerge better preserves expert capabilities than representative task-vector and spectral merging baselines. The implementation of ResMerge is publicly available at https://github.com/sunyd0303-cpu/ResMerge-release.

2606.02251 2026-06-02 cs.RO cs.AI eess.SP

FW-NKF: Frequency-Weighted Neural Kalman Filters

FW-NKF: 频率加权神经卡尔曼滤波器

Adnan Harun Dogan, Berken Utku Demirel, Christian Holz

AI总结 提出频率加权神经卡尔曼滤波器(FW-NKF),通过将因果谱整形算子嵌入卡尔曼测量残差并联合学习观测和状态转移网络,抑制频带受限噪声,在混沌系统和惯性姿态估计等任务中定位误差降低达10%。

详情
Comments
Published at ICRA 2026
AI中文摘要

鲁棒状态估计是机器人自主性的核心,然而经典卡尔曼滤波器难以应对频率相关干扰和模型失配,如传感器振动、电磁干扰和周期性噪声。尽管深度卡尔曼滤波器(DKF)变体通过学习潜在状态转移扩展了扩展卡尔曼滤波(EKF)框架,但它们缺乏明确的机制来抑制在实际场景中通常污染传感器测量的带限噪声分量。我们引入了频率加权神经卡尔曼滤波器(FW-NKF),这是一种统一的混合方法,将因果谱整形算子嵌入卡尔曼测量残差,并联合学习观测网络和状态转移网络。通过同时调整滤波器频谱和潜在状态表示,FW-NKF在抑制噪声主导频带的同时捕获复杂的残差结构。我们在四个异构基准上进行了广泛实验,包括混沌系统(如多维洛伦兹系统)和全身惯性姿态估计,发现定位误差降低高达10%,且方向精度显著提升。我们的消融研究证实,频率加权和深度潜在状态建模对整体性能有贡献。

英文摘要

Robust state estimation is central to robotic autonomy, yet classical Kalman filters struggle with frequency-dependent disturbances and model mismatch such as sensor vibrations, electromagnetic interference, and periodic noise. Although Deep Kalman Filter (DKF) variants extend the Extended Kalman Filtering (EKF) framework by learning latent transitions, they lack explicit mechanisms to suppress band-limited noise components that typically corrupt sensor measurements in real-world scenarios. We introduce the Frequency-Weighted Neural Kalman Filter (FW-NKF), a unified hybrid approach that embeds a causal spectral-shaping operator into the Kalman measurement residual and jointly learns observation, and transition networks. By adapting both the filter spectrum and the latent state representation, FW-NKF attenuates the noise-dominated frequency bands while capturing complex residual structures. We conduct extensive experiments on four heterogeneous benchmarks, including chaotic systems such as multi-dimensional Lorenz systems and full-body inertial pose estimation, and find a reduction in localization error of up to 10% as well as marked improvements in orientation accuracy. Our ablation studies confirm that frequency weighting and deep latent-state modeling contribute to overall performance.

2606.02248 2026-06-02 cs.CL

Geometric Latent Reasoning Induces Shorter Generations in LLMs

几何潜在推理诱导LLM生成更短的文本

Shashi Kumar, Yacouba Kaloga, Petr Motlicek, Ina Kodrasi, Andrea Cavallaro

AI总结 提出几何潜在推理(GLR)方法,通过轻量级过渡头在嵌入空间中预测迭代方向更新,近似离散推理轨迹,从而在不显式优化长度的情况下显著缩短LLM的生成步数。

详情
AI中文摘要

大型语言模型通过生成冗长的显式推理令牌链来解决复杂问题。虽然有效,但这使得推理成本高昂、对长度敏感,并受限于(离散的)自然语言。虽然潜在推理提供了连续的替代方案,但确定中间潜在状态的有用结构仍是一个开放挑战。在本文中,我们将潜在推理公式化为模型预训练令牌嵌入空间中的几何路径逼近问题。我们引入了几何潜在推理(GLR),它使用轻量级过渡头来预测嵌入空间中的迭代方向更新。利用文本思维链轨迹作为锚点,GLR学习逼近离散推理轨迹,同时允许偏离精确令牌嵌入的连续偏差。使用Qwen3模型在数学推理基准上的评估揭示了一个涌现现象:几何潜在推理在没有显式长度目标的情况下诱导出显著更短的生成。通过用连续的潜在步骤替换早期的显式推理,模型通常使用更少的总生成步骤达到正确答案。这些发现表明,连续轨迹充当紧凑的中间推理状态,揭示了潜在计算预算、输出长度和准确性之间的新权衡。

英文摘要

Large language models solve complex problems by generating lengthy chains of explicit reasoning tokens. While effective, this makes reasoning expensive, length-sensitive, and constrained to (discrete) natural language. While latent reasoning offers a continuous alternative, determining useful structures for intermediate latent states is an open challenge. In this paper, we formulate latent reasoning as a geometric path-approximation problem within the model's pretrained token-embedding space. We introduce Geometric Latent Reasoning (GLR), which uses a lightweight transition head to predict iterative direction updates in embedding space. Using textual chain-of-thought traces as anchors, GLR learns to approximate discrete reasoning trajectories while permitting continuous deviations from exact token embeddings. Evaluations on mathematical reasoning benchmarks using Qwen3 models reveal an emergent phenomenon: geometric latent reasoning induces substantially shorter generations without an explicit length objective. By replacing early explicit reasoning with continuous latent steps, models often reach correct answers using substantially fewer total generation steps. These findings suggest that continuous trajectories act as compact intermediate reasoning states, exposing a new tradeoff between latent computation budget, output length, and accuracy.

2606.02247 2026-06-02 stat.ML cs.LG

ShaplEIG: Bayesian Experimental Design for Shapley Value Estimation

ShaplEIG:用于Shapley值估计的贝叶斯实验设计

David Rundel, Fabian Fumagalli, Maximilian Muschalik, Bernd Bischl, Matthias Feurer

AI总结 提出ShaplEIG方法,通过高斯过程代理和期望信息增益自适应选择联盟,以高效估计Shapley值,在低预算场景下显著提升样本效率。

详情
Comments
Accepted at the Forty-Third International Conference on Machine Learning (ICML 2026)
AI中文摘要

Shapley值是一种原则性的归因度量,广泛用于可解释机器学习,但其精确计算随玩家数量呈指数增长,促使了基于采样联盟价值函数评估的各种近似方法。这引发了一个问题:能否通过根据先前评估自适应选择联盟来提高近似精度?这在价值函数昂贵且评估次数严重受限的设置中尤为重要,例如基于重训练的特征重要性、数据估值和超参数重要性。为此,我们提出ShaplEIG,一种贝叶斯实验设计方法,该方法使用高斯过程代理近似昂贵的价值函数,并根据联盟对Shapley值的期望信息增益自适应选择联盟。通过Shapley值在价值函数中的线性性质,我们证明了期望信息增益具有封闭形式。此外,我们提出了一种高效计算方案,通过初等对称多项式将复杂度从指数级降低到玩家数量的多项式级。在多种昂贵应用的广泛实验中,我们的方法在低预算场景下始终优于最先进的基线方法,提高了样本效率。

英文摘要

Shapley values are a principled attribution measure widely used in interpretable machine learning, but their exact computation scales exponentially with the number of players, motivating a wide range of approximation methods based on value function evaluations of sampled coalitions. This raises the question of whether approximation accuracy can be improved by adaptively selecting coalitions for evaluation based on previous evaluations. This is particularly relevant in settings where the value function is costly and the number of evaluations is severely limited, such as retraining-based feature importance, data valuation, and hyperparameter importance. For this purpose, we propose ShaplEIG, a Bayesian experimental design approach that approximates the expensive value function using a Gaussian process surrogate and adaptively selects coalitions based on their expected information gain about the Shapley values. By the linearity of the Shapley values in the value function, we show that the expected information gain is available in closed form. Furthermore, we propose an efficient computation scheme that reduces the complexity from exponential to polynomial in the number of players via elementary symmetric polynomials. In extensive experiments across diverse costly applications, our method consistently improves sample efficiency in the low-budget regime over state-of-the-art baselines.

2606.02246 2026-06-02 cs.CV

Ego-METAS: Egocentric online Multimodal Energy-efficient Temporal Action Segmentation benchmark

Ego-METAS:面向自我中心的在线多模态节能时间动作分割基准

Maria Santos-Villafranca, Jesus Bermudez-cameo, Alejandro Perez-Yus, Giovanni Maria Farinella, Antonino Furnari

AI总结 为解决资源受限设备上的能耗感知问题,提出了首个自我中心在线多模态节能时间动作分割基准Ego-METAS,包含超过100小时未裁剪视频和5种模态,要求模型动态选择传感器并遵守能量预算,评估显示最优路由高度依赖场景,现有方法难以适应连续环境。

详情
Comments
Project Page: https://maria-sanvil.github.io/Ego-METAS-website/
AI中文摘要

为了在物理世界中运行,具身智能体必须以“始终在线”的方式感知环境,选择性访问信息最丰富的传感器,以平衡能量约束和任务准确性。尽管这对资源受限设备至关重要,但能耗感知感知仍未被充分探索,大多数先前工作假设无限计算。为了解决这一问题,我们引入了Ego-METAS:首个自我中心在线多模态节能时间动作分割基准。Ego-METAS提供了一个统一的测试平台,包含来自EgoExo4D、CMU-MMAC和CaptainCook4D的超过100小时未裁剪自我中心视频,涵盖5种模态(RGB、音频、注视、IMU和单色相机)。我们制定了一个在线时间动作分割任务,其中模型必须动态选择在每个时间步激活哪些传感器,同时严格遵守硬件代表性的能量预算。除了基准测试,我们还发布了统一的分割、清理后的标注、预提取特征以及一套多样化的基线路由策略。我们的评估表明,最优路由高度依赖于场景,并且现有的策略学习方法(主要针对裁剪片段设计)难以适应连续的未裁剪环境。然而,即使是互补模态的简单动态融合(例如通过随机路由)也被证明对于平衡预测准确性与严格能量预算至关重要。最终,Ego-METAS为开发自主、始终在线的具身AI的鲁棒、成本感知策略提供了标准化基础。

英文摘要

To operate in the physical world, embodied agents must perceive their environment in an "always-on" fashion, selectively accessing the most informative sensors to balance energy constraints and task accuracy. Despite its importance for resource-constrained devices, energy-aware perception remains under-explored, with most prior work assuming unlimited compute. To address this, we introduce Ego-METAS: the first Egocentric online Multimodal Energy-efficient Temporal Action Segmentation benchmark. Ego-METAS provides a unified testbed of more than 100 hours of untrimmed egocentric video from EgoExo4D, CMU-MMAC, and CaptainCook4D, spanning 5 modalities (RGB, audio, gaze, IMU, and monochrome camera). We formulate an online temporal action segmentation task where models must dynamically select which sensors to activate at each timestep while strictly adhering to hardware-representative energy budgets. Alongside the benchmark, we release unified splits, cleaned annotations, pre-extracted features, and a diverse suite of baseline routing policies. Our evaluations show that optimal routing is highly scenario-dependent, and that existing policy-learning methods, designed primarily for trimmed clips, struggle to adapt to continuous, untrimmed environments. However, even simple dynamic fusion of complementary modalities (e.g., via random routing) proves critical for balancing predictive accuracy against strict energy budgets. Ultimately, Ego-METAS provides a standardized foundation to develop robust, cost-aware policies for autonomous, always-on embodied AI.

2606.02245 2026-06-02 cs.CL

When Knowledge Is Not Free: Cost-Aware Evidence Selection in Retrieval-Augmented Generation

当知识并非免费:检索增强生成中的成本感知证据选择

Mingyan Wu, Han Yang, Omer Ben-Porat, Yftah Ziser

AI总结 提出成本感知RAG设置,通过访问成本层级和预算约束,研究证据选择策略,发现静态选择脆弱而智能体方法有潜力但依赖模型和任务。

详情
AI中文摘要

检索增强生成(RAG)通常假设外部知识是免费的,但许多高质量来源需要付费、许可、受限或以其他方式访问成本高昂。我们引入了成本感知RAG,这是一种设置,其中检索到的证据被分配访问成本层级,系统必须在明确的证据访问预算下回答问题。我们通过为MS MARCO v2.1添加访问摩擦层级来实例化这一设置,并在通用领域和特定领域QA基准上评估预算证据选择。我们的结果表明,静态选择是脆弱的:没有固定的选择器能统一占优,更大的预算并不能可靠地提高答案质量,即使高成本证据与领域匹配。然后我们研究了智能体成本感知RAG,其中LLM决定何时检索、访问哪个层级以及何时停止。智能体作为自适应证据获取控制器显示出强大的潜力,但其行为仍然高度依赖于模型和任务。这些发现表明,成本感知的证据获取是下一代RAG系统的核心挑战。所有代码和数据可在https://github.com/Mignonmy/Cost-Aware获取。

英文摘要

Retrieval-Augmented Generation (RAG) typically assumes that external knowledge is free, but many high-quality sources are paywalled, licensed, restricted, or otherwise costly to access. We introduce cost-aware RAG, a setting where retrieved evidence is assigned access-cost tiers and systems must answer under an explicit evidence-access budget. We instantiate this setting by augmenting MS MARCO v2.1 with access-friction tiers and evaluate budgeted evidence selection across general-domain and domain-specific QA benchmarks. Our results show that static selection is brittle: no fixed selector uniformly dominates, and larger budgets do not reliably improve answer quality, even when costly evidence is domain-matched. We then study agentic cost-aware RAG, where an LLM decides when to retrieve, which tier to access, and when to stop. Agents show strong promise as adaptive evidence-acquisition controllers, but their behavior remains highly model- and task-dependent. These findings suggest that cost-aware evidence acquisition is a central challenge for the next generation of RAG systems. All code and data are available at https://github.com/Mignonmy/Cost-Aware.

2606.02242 2026-06-02 cs.CV cs.AI cs.LG

Towards Resolving Optimization Conflicts Between Image- and Text-Based Person Re-Identification

解决基于图像和基于文本的行人重识别之间的优化冲突

Karina Kvanchiani, Timur Mamedov

AI总结 针对图像与文本行人重识别任务因模态差异和目标冲突导致共享表示次优的问题,提出解耦两阶段训练流程,使用单一视觉编码器避免跨任务干扰,实验表明图像预训练和文本监督能提升双任务性能。

详情
AI中文摘要

基于图像(I2I)和基于文本(T2I)的行人重识别(ReID)的联合优化受到模态差异和冲突训练目标的阻碍,导致共享表示次优。虽然I2I ReID关注同一人图像间的身份级不变性,但T2I ReID由与独特视觉特征相关的实例特定文本描述驱动。本文探讨了两个ReID任务及其优化过程之间的根本差异,以实现有效训练。由于I2I和T2I ReID通常分开研究,为一种检索设置优化的损失函数可能对另一种所需的表示质量产生负面影响。基于这些发现,我们提出了一种解耦的两阶段训练流程,用于学习跨图像和文本模态的共享表示。该流程基于单个视觉编码器,支持I2I和T2I检索,同时避免训练期间的跨任务干扰。我们在多种配置下进行了大量实验,改变了域混合程序、学习策略和任务目标。我们观察到I2I ReID预训练对T2I数据的泛化能力有积极影响。此外,我们发现视觉编码器训练阶段引入文本监督能提升I2I和T2I性能。我们相信,我们的见解为统一的ReID系统和跨模态检索整体迈出了有意义的一步。

英文摘要

The joint optimization of image-based (I2I) and text-based (T2I) person re-identification (ReID) is hindered by modality discrepancies and conflicting training objectives, leading to suboptimal shared representations. While I2I ReID focuses on identity-level invariance across images of the same person, T2I ReID is driven by instance-specific textual descriptions tied to unique visual traits. This paper explores the fundamental difference between two ReID tasks and their optimization processes for effective training. Since I2I and T2I ReID are often studied separately, the loss functions optimized for one retrieval setting may negatively affect the representation quality required by the other. Motivated by these findings, we propose a decoupled two-stage training pipeline for learning a shared representation across image and text modalities. The pipeline is based on a single vision encoder that supports both I2I and T2I retrieval while avoiding cross-task interference during training. We provide extensive experiments across multiple configurations, varying domain mixing procedures, learning strategies, and task objectives. We observed that I2I ReID pre-training positively impacts the generalization ability to T2I data. Besides, we find that incorporating textual supervision during the vision encoder training stage enhances both I2I and T2I performance. We believe our insights provide a meaningful step toward unified ReID systems and cross-modal retrieval overall.

2606.02241 2026-06-02 cs.LG

BlockGen: Flexible Blockwise Sequence Modeling with Hybrid Samplers

BlockGen: 灵活的分块序列建模与混合采样器

Justin Deschenaux, Caglar Gulcehre

AI总结 提出BlockGen框架,通过分块序列建模和AR-informed预测-校正采样,比较均匀态扩散与掩码扩散在分块生成中的性能。

详情
AI中文摘要

均匀态扩散框架是否为离散扩散更强大的范式?最近的研究表明情况可能如此。结合预测-校正采样器,均匀态扩散模型(USDMs)生成的样本质量高于掩码扩散模型(MDMs),并且在下游任务中USDMs与MDMs相当或更优,尽管它们表现出更大的困惑度。两个问题仍未解决。首先,现有工作比较均匀和掩码扩散时使用了无信息的校正器,这些校正器在随机位置重新注入噪声,而不是针对最可能出错的标记。其次,先前的工作比较了全序列扩散模型,因此我们不知道当逐块生成标记时是否得出相同的结论。为了解决这些问题,我们引入了BlockGen,一种分块序列模型,我们使用掩码和均匀扩散两种方式实例化它。BlockGen在混合块大小上训练,其似然性比固定块大小的模型更精细地在自回归和纯扩散之间插值。BlockGen实现了AR-informed预测-校正采样(ARPC),它结合了AR和扩散预测来重新生成不太可能的标记,而无需辅助验证器。在祖先采样下,均匀扩散在逐块设置中优于掩码扩散,尤其是在少步数情况下。在ARPC下,差距缩小并在高NFE时反转。在GSM8K上使用块大小16时,MDMs达到略高于USDMs的准确率,我们在OpenWebText上的生成困惑度中也观察到类似趋势。代码见https://github.com/jdeschena/blockgen。

英文摘要

Is the uniform-state diffusion framework a more powerful paradigm for discrete diffusion? Recent studies indicate that this may be the case. In combination with predictor-corrector samplers, uniform-state diffusion models (USDMs) produce samples of higher-quality than masked diffusion models (MDMs), and USDMs equal or outperform MDMs in downstream tasks, even though they exhibit greater perplexity. Two issues remain unresolved. First, existing work compares uniform and masked diffusion with un-informed correctors that re-inject noise at random positions, rather than targeting tokens most likely to be wrong. Second, prior work compares full-sequence diffusion models, so we do not know whether the same conclusion holds when tokens are generated block by block. To address these issues, we introduce BlockGen, a blockwise sequence model that we instantiate with both masked and uniform diffusion. BlockGen trains on a mixture of block sizes and its likelihood interpolates between AR and pure diffusion more finely than models with a fixed block size. BlockGen enables AR-informed predictor-corrector sampling (ARPC), which combines AR and diffusion predictions to re-generate unlikely tokens without an auxiliary verifier. Under ancestral sampling, uniform outperforms masked in the block-by-block setting, especially in the few-step regime. Under ARPC, the gap closes and reverses at high NFE. With block size $16$ on GSM8K, MDMs reach slightly higher accuracy than USDMs, and we observe a similar trend in Generative Perplexity on OpenWebText. Find our code at https://github.com/jdeschena/blockgen.

2606.02237 2026-06-02 cs.LG

Why Are DMD Students Lazy? Understanding the Copying Behavior in Few-Step Distillation

为什么 DMD 学生懒惰?理解少步蒸馏中的复制行为

Shucheng Li, Iolo Jones, Alexander Tong, Michael M. Bronstein

AI总结 本文研究分布匹配蒸馏(DMD)中高维学生模型自发复制教师噪声-数据配对的现象,通过几何自由度受限解释其成因。

详情
AI中文摘要

分布匹配蒸馏(DMD)通过在所有尺度上对齐噪声分布,将预训练扩散模型压缩为高效的少步生成器。原则上,这种分布级监督对教师的特定噪声-数据配对保持不可知;这为学生提供了重新映射潜在噪声的自由度,这一行为在低维设置中一直被观察到。令人惊讶的是,我们发现,在高维设置中,蒸馏学生自发地再现了教师的原始噪声-数据配对,我们将这种现象称为复制。我们证明复制既不是对抗性目标的副产品,也不是教师记忆的结果。相反,我们的证据表明,复制是高维蒸馏过程中学生模型几何自由度有限而产生的一种涌现特性。

英文摘要

Distribution Matching Distillation (DMD) compresses pretrained diffusion models into efficient few-step generators by aligning their noised distributions across all scales. In principle, such distribution-level supervision remains agnostic to specific noise-data pairings of the teacher; this provides the student the freedom to remap latent noise, a behavior consistently observed in low-dimensional settings. Surprisingly, we find that in high-dimensional settings, distilled students spontaneously reproduce the original noise-data pairings of the teacher, a phenomenon we term copying. We demonstrate that copying is neither a byproduct of adversarial objectives nor a result of teacher memorization. Instead, our evidence suggests that copying is an emergent property arising from the limited geometric freedom of the student model during high-dimensional distillation.

2606.02232 2026-06-02 cs.LG

A Doeblin-Anchored Contrastive Chart for Learning Markov Transition Kernels

Doeblin锚定对比图:学习马尔可夫转移核

Ao Xu

AI总结 提出一种基于Doeblin锚定的对比坐标框架,通过对比目标学习有效的马尔可夫转移核,并引入可测马尔可夫化算子保证核有效性,实现非参数收敛率与有限时域误差界。

详情
AI中文摘要

学习马尔可夫转移模型不仅仅是条件密度估计:学习到的对象必须是一个有效的转移核,才能在后续动力学中迭代。本文介绍了一种Doeblin锚定对比图,这是一个从统计到动力学的坐标框架,用于从对比目标学习转移核。给定一个重启律和一个锚定强度,该图将目标转移与重启律混合。得到的锚定核同时是一个Doeblin小化马尔可夫核、一个二元对比实验中的正条件律,以及原始转移律的一个显式可逆坐标。我们证明了锚定对比风险识别锚定转移密度,并将超额风险校准为密度误差。由于学习得分的反演可能产生有符号或未归一化的对象,我们引入了一个可测马尔可夫化算子,在保持积分$L^1$精度(最多一个常数因子)的同时恢复核有效性。Oracle不等式和Hölder-ReLU逼近界给出了独立转移对的非参数速率。对于平稳几何$\beta$-混合轨迹,一个保守的稀疏化与耦合扩展以有效样本量提供了相同的重建接口。在显式覆盖下,占用加权扰动界将一步核误差转化为有限时域边际、路径律和占用测度误差。

英文摘要

Learning a Markov transition model is not merely conditional density estimation: the learned object must be a valid transition kernel before it is iterated in downstream dynamics. This paper introduces a Doeblin-anchored contrastive chart, a statistical-to-dynamical coordinate framework for learning transition kernels from contrastive objectives. Given a restart law and an anchor strength, the chart mixes the target transition with the restart law. The resulting anchored kernel is simultaneously a Doeblin-minorized Markov kernel, the positive conditional law in a binary contrastive experiment, and an explicitly invertible coordinate for the original transition law. We prove that the anchored contrastive risk identifies the anchored transition density and calibrates excess risk to density error. Since inversion of a learned score may produce a signed or unnormalized object, we introduce a measurable Markovization operator that restores kernel validity while preserving integrated $L^1$ accuracy up to a constant factor. Oracle inequalities and Hölder--ReLU approximation bounds yield nonparametric rates for independent transition pairs. For stationary geometrically $β$-mixing trajectories, a conservative thinning-and-coupling extension yields the same reconstruction interface with an effective sample size. Occupancy-weighted perturbation bounds transfer one-step kernel error to finite-horizon marginal, path-law, and occupation-measure errors under explicit coverage.

2606.02231 2026-06-02 stat.ML cs.LG stat.ME

Identifiable Markov Switching Models with Instantaneous Effects and Exponential Families

具有瞬时效应和指数族的可识别马尔可夫切换模型

Roel Hulsman, Carles Balsells-Rodas, Sara Magliacane

AI总结 针对非平稳时间序列,提出在指数族噪声下具有瞬时效应的马尔可夫切换模型的可识别性理论,并开发FlowMSM框架用于检测隐状态和恢复因果结构。

详情
Comments
International Conference on Machine Learning (ICML) 2026
AI中文摘要

时间系统通常表现出非平稳行为,例如季节性气候变化或1型糖尿病患者的血糖波动。对非平稳性建模的一种方法是通过离散隐状态,即时间的平稳片段。此类系统诱导出马尔可夫切换模型(MSM),这是一类隐马尔可夫模型,其中隐状态和观测变量之间存在自回归依赖关系。在存在频繁状态切换以及非线性和非高斯动态的情况下,特别是在变量之间存在瞬时效应(例如由于测量速率较慢)时,识别隐状态具有挑战性。在这项工作中,我们建立了在时间状态依赖、非线性滞后和瞬时效应以及来自指数族的独立噪声下,隐状态和状态依赖因果结构的可识别性。我们的可识别性理论涵盖了因果模型的非时间混合。此外,我们引入了FlowMSM,这是一个状态检测框架,可与任何平稳因果发现方法配对,以恢复状态依赖的因果结构。在合成基准和金融经济学数据集上的实验证明了我们的方法在检测隐状态和从非平稳时间序列中发现因果结构方面的有效性。

英文摘要

Temporal systems often exhibit non-stationary behaviour, such as seasonal climate variation or glucose fluctuations in patients with type-1 diabetes. One way to model non-stationarity is through discrete latent regimes, i.e., stationary segments of time. Such systems induce a Markov Switching Model (MSM), a class of Hidden Markov Models with autoregressive dependencies among latent regimes and observed variables. Identifying latent regimes is challenging in the presence of frequent regime switches and nonlinear and non-Gaussian dynamics, particularly when there are instantaneous effects between the variables, e.g., due to slow rates of measurements. In this work, we establish the identifiability of both latent regimes and regime-dependent causal structures under temporal regime dependencies, nonlinear lagged and instantaneous effects, and independent noise from the exponential family. Our identifiability theory subsumes non-temporal mixtures of causal models. Furthermore, we introduce FlowMSM, a regime detection framework that can be paired with any stationary causal discovery method to recover regime-dependent causal structures. Experiments on synthetic benchmarks and a financial economics dataset demonstrate the effectiveness of our approach to detect latent regimes and discover causal structures from non-stationary time series.

2606.02228 2026-06-02 stat.ML cs.CV cs.LG

Bayesian meta-learning for modeling Alzheimer's disease progression

贝叶斯元学习用于阿尔茨海默病进展建模

Clara Hoffmann, Nadja Klein

AI总结 提出贝叶斯元学习方法,利用个体历史MRI体积和疾病轨迹预测疾病评分分布,无需重新训练即可动态预测,并减少长期预测的过度自信。

详情
AI中文摘要

预测阿尔茨海默病患者将经历轻度还是重度疾病进展对于个性化治疗至关重要。通常,临床医生试图预测离散疾病评分的分布,条件是个体当前的MRI体积及其历史疾病轨迹。经典的统计回归模型和单任务神经网络不适合此目的,因为拟合单独模型不可行(每个个体通常只有少量观测),而忽略个体间相关性会导致泛化能力差。相比之下,元学习提供了一种自然的方法来动态预测分布,无需重新训练,并能建模结果与协变量之间的非线性关系。受此启发,我们提出了一种贝叶斯元学习器,它在多个个体上训练,但根据每个个体的历史数据定制预测的疾病评分分布。我们的模型无需重新训练即可预测未见过的个体,与历史观测数量呈线性扩展,并且在预测长期疾病评分时,与确定性对应模型相比,保证更少的过度自信。在阿尔茨海默病神经影像学倡议(ADNI)数据库的真实世界数据上,我们的模型在性能上与单任务模型和确定性元学习器相当,同时在预测长期疾病进展时显著提高了性能。

英文摘要

Predicting whether an individual with Alzheimer's disease will experience mild or severe disease progression is essential for personalized treatment. Typically, practitioners seek to predict the distribution of a discrete disease score, conditional on an individual's current MRI volume and their historical disease trajectory. Classical statistical regression models and single-task neural networks are not well-suited for this purpose because fitting separate models is infeasible (since each individual typically has few observations), while ignoring individual-level correlation leads to poor generalization. Meta-learning, in contrast, provides a natural avenue to dynamically predict distributions without retraining and model nonlinear relationships between the outcome and covariates. Motivated by this, we propose a Bayesian meta-learner that is trained on multiple individuals but tailors the predictive disease score distribution to each individual's historical data. Our model predicts on unseen individuals without retraining, scales linearly with the number of historical observations, and is guaranteed to be less overconfident when predicting long-term disease scores compared to its deterministic counterpart. On real-world data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database, our model achieves performance competitive with both single-task models and deterministic meta-learners, while substantially improving performance when predicting long-term disease progression.

2606.02223 2026-06-02 cs.LG math.ST stat.ME stat.TH

Network Learning with Semi-relaxed Gromov-Wasserstein

半松弛Gromov-Wasserstein的网络学习

Charles Dufour, Ulysse Naepels, Leonardo V. Santoro

AI总结 针对大规模网络生成机制估计中的节点标签缺失问题,提出半松弛Gromov-Wasserstein目标函数,通过概率耦合松弛分配问题,利用块坐标条件梯度算法求解,并证明松弛解与确定性分配的最优性差距以O(1/n)速率消失,实现随机块模型和Hölder光滑图模型的相合性与极小化最优收敛速率。

详情
AI中文摘要

估计大规模网络的生成机制是统计机器学习中的一个基本挑战。由于缺乏规范的节点标签,识别潜在连接结构通常是一个NP难的组合问题。我们通过允许概率耦合来应对这一挑战,从而松弛了分配问题。我们的估计框架可以表述为半松弛Gromov-Wasserstein目标,并提供了生成结构的低维表示。我们通过块坐标条件梯度算法求解该问题。尽管进行了松弛,但所得解通常是确定性的:事实上,我们证明了松弛解与确定性分配之间的最优性差距以$O(1/n)$的速率消失,其中$n$是节点数。这使得潜在模型的可处理恢复成为可能,并能够进行严格的统计分析:我们为随机块模型和Hölder光滑图模型建立了相合性和极小化最优收敛速率。我们的实现在合成和真实数据集上均展示了随$n$的高效扩展能力。

英文摘要

Estimating the generative mechanism of large-scale networks is a fundamental challenge in statistical machine learning. It requires the identification of the latent connectivity structure, which is in general an NP-hard combinatorial problem due to the absence of canonical node labels. We address this challenge by allowing for probabilistic couplings, thereby relaxing the assignment problem. Our estimation framework can be formulated as a semi-relaxed Gromov-Wasserstein objective and provides a low-dimensional representation of the generative structure. We solve this via a block-coordinate conditional gradient algorithm. Despite the relaxation, the resulting solution is typically deterministic: in fact, we show that the optimality gap between the relaxed solution and the deterministic assignment vanishes at rate $O(1/n)$, where $n$ is the number of nodes. This allows for tractable recovery of the underlying model and enables rigorous statistical analysis: we establish consistency and minimax-optimal convergence rates for both stochastic block models and Holder-smooth graphons. Our implementation scales efficiently with $n$, as demonstrated on both synthetic and real-world datasets.

2606.02221 2026-06-02 cs.CV cs.LG

CORE-MTL: Rethinking Gradient Balancing via Causal Orthogonal Representations

CORE-MTL: 通过因果正交表示重新思考梯度平衡

Chengfeng Wu, Tao Zou, Yanru Wu, Jingge Wang

AI总结 提出CORE-MTL框架,通过因果正交表示将共享表示分解为语义流和残差流,以分离任务相关结构与虚假上下文,从而减少负迁移并提升泛化能力。

详情
Comments
Accepted by ICML 2026
AI中文摘要

多任务学习旨在通过跨领域共享共同表示来构建联合模型。为实现这一目标,现有的优化中心方法要么平衡任务梯度,要么修改共享架构。然而,由于这些方法对共享表示的内容不可知,它们无法将任务相关结构与虚假上下文分离,导致负迁移和泛化能力差。为克服这一限制,我们提出了用于多任务学习的因果正交表示(CORE-MTL),这是一个因果驱动的表示中心框架,鼓励对共享表示进行结构化的语义-残差分解,将任务相关结构集中在语义流中,而将干扰变化归入残差流。我们通过利用结构化场景的物理先验和属性的统计约束,在视觉领域实例化了该框架。理论上,我们的方法比优化中心方法具有更紧的分布外泛化界,并且无需显式梯度投影或重新加权即可减少任务梯度干扰。实验上,CORE-MTL在视觉多任务基准测试中,在分布内和分布外设置下均持续优于现有方法。代码公开于 https://github.com/Hope-Rita/CORE-MTL。

英文摘要

Multi-task learning (MTL) aims to construct a joint model for multiple tasks by sharing a common representation across domains. To achieve this goal, existing optimization-centric methods either balance task gradients or modify the shared architecture. However, as these approaches remain agnostic to the content of the shared representation, they fail to disentangle task-relevant structure from spurious context, leading to negative transfer and poor generalization. To overcome this limitation, we propose Causal Orthogonal Representations for Multi-Task Learning (CORE-MTL), a causally motivated representation-centric framework that encourages a structured semantic-residual factorization of the shared representation, concentrating task-relevant structure in the semantic stream while relegating nuisance variation to the residual stream. We instantiate this framework in the visual domain by leveraging physical priors for structured scenes and statistical constraints for attributes. Theoretically, our method enjoys a tighter out-of-distribution generalization bound than optimization-centric methods and reduces task gradient interference without explicit gradient projection or reweighting. Empirically, CORE-MTL consistently outperforms existing methods on visual multi-task benchmarks in both in-distribution and out-of-distribution settings. Code is publicly available at https://github.com/Hope-Rita/CORE-MTL.

2606.02219 2026-06-02 cs.CV

Symmetry-Aware 9D Pose Estimation with Sim(3)-Consistent Feature and Spherical Inception Convolution

对称感知的9D姿态估计:Sim(3)一致特征与球形Inception卷积

Panfei Cheng, Hongshan Yu, Wenrui Chen, Xiaojun Tang, Jian Liu, Naveed Akhtar

AI总结 提出一种类别级物体姿态估计方法,通过语义引导的对称感知模块和球形大核Inception卷积融合特征,实现无形状先验的精确平移/尺寸估计和鲁棒旋转估计,在基准和真实场景中达到最优性能。

详情
Comments
12 pages, 7 figures
AI中文摘要

物体姿态估计是智能系统感知或操作图像/视频中物体的基本问题。然而,当前的实例级方法难以泛化到未见物体。类别级方法试图解决这一问题,但仍受限于非线性Sim(3)空间的学习复杂性和类内变化。为应对这些挑战,我们提出一种有效的类别级物体姿态估计方法,包含两项关键创新:(1) 一个平移/尺寸估计器,具有语义引导的对称感知模块,利用大型视觉模型(LVM)的鲁棒泛化能力推断对称点,从而无需形状先验即可获得精确的平移和尺寸。该结果作为旋转估计的预计算线索,降低了在非线性Sim(3)空间学习的难度,并为处理更具挑战性的旋转估计奠定坚实基础。(2) 一个特征融合模块,基于我们提出的球形大核Inception卷积,将LVM的语义特征与系统计算的几何特征融合,通过建模长程依赖关系以较低计算成本从类内变化中提取关键姿态特征。基于这些创新,我们在基准和真实场景中达到最优性能,并开发了一个能够处理多样物体的鲁棒机器人抓取系统。我们的代码将在项目页面提供:{\hypersetup{urlcolor=blue}https://panfei-cheng.github.io/SSH-Pose}。

英文摘要

Object pose estimation is a fundamental problem for an agent system to perceive or manipulate objects in images or videos. However, current instance-level methods struggle with generalization to unseen objects. Category-level methods seek to address this, but remain constrained by the complexities of learning in the non-linear Sim(3) space and intra-class variations. To address these challenges, We propose an effective method for category-level object pose estimation with two key innovations: (1) A translation/size estimator, featuring a semantic-guided symmetry-aware module that leverages robust generalization capabilities of a large vision model (LVM) to infer symmetry points, resulting in accurate translation and size without shape priors. This result serves as a precomputed cue for rotation estimation, thereby reducing the difficulty of learning in the non-linear Sim(3) space and laying a robust foundation for tackling the inherently more challenging rotation estimation. (2) A feature fusion module, based on our proposed spherical large-kernel inception convolution, fuses semantic features from the LVM with systematically computed geometric features to extract essential pose features from intra-class variations by modeling long-range dependencies without excessive computational cost. Built on these innovations, we achieve SOTA on benchmarks and real-world scenes, while developing a robust robotic picking system capable of handling diverse objects. Our code will be available at the project page: {\hypersetup{urlcolor=blue}https://panfei-cheng.github.io/SSH-Pose}.

2606.02218 2026-06-02 cs.LG cs.AI

Faster Synchronous On-Policy RL via Straggler-Aware Group Sizing

通过感知掉队者的组大小调整实现更快的同步在线策略强化学习

Azal Ahmad Khan, Ammar Ahmed, Zeshan Fayyaz, Sheng Di, Mingyi Hong, Ali Anwar

AI总结 提出动态组大小控制器SAGC,通过在线约束优化调整组大小,减少同步在线策略强化学习中的掉队者事件,提升墙钟效率并保持或改善训练奖励和模型质量。

详情
AI中文摘要

同步强化学习方法如组相对策略优化(GRPO)提供稳定且可复现的在线策略训练,但极易受到掉队者的影响——单个异常长的轨迹可能延迟整个组的奖励计算和参数更新。随着组大小增加,这个问题变得更加严重,在更大组的好处与同步停滞的墙钟成本之间产生矛盾。我们提出感知掉队者的组控制(SAGC),一种动态组大小控制器,根据观察到的轨迹行为在线调整训练组。SAGC将组大小选择形式化为一个在线约束优化问题,旨在保留更大组的好处,同时控制掉队者事件的长期发生率。在同步GRPO和DAPO训练中,以及在普通和强工程基线上,SAGC一致地减少了掉队者发生率并提高了墙钟效率,同时实现了有竞争力或更好的训练奖励。我们进一步表明这些收益转化为最终模型质量:在下游推理基准上,SAGC与最强的静态组大小基线相比具有竞争力或更好,并且通常在没有显式长度惩罚的情况下产生更短的输出。这些结果将动态组控制定位为使同步在线策略强化学习更高效和更稳健的实用方法。

英文摘要

Synchronous reinforcement learning methods such as Group Relative Policy Optimization (GRPO) provide stable and reproducible on-policy training, but they are highly vulnerable to stragglers, a single unusually long rollout can delay reward computation and parameter updates for the entire group. This problem becomes more severe as group size increases, creating a tension between the benefits of larger groups and the wall-clock cost of synchronization stalls. We propose Straggler-Aware Group Control (SAGC), a dynamic group-size controller that adapts the training group online based on observed rollout behavior. SAGC formulates group-size selection as an online constrained optimization problem, seeking to retain the benefits of larger groups while controlling the long-term rate of straggler events. Across synchronous GRPO and DAPO training, and on top of both vanilla and strong engineered baselines, SAGC consistently reduces straggler incidence and improves wall-clock efficiency while achieving competitive or better training reward. We further show that these gains transfer to final model quality: SAGC is competitive with or better than the strongest static group-size baseline on downstream reasoning benchmarks, and often produces shorter outputs without any explicit length penalty. These results position dynamic group control as a practical way to make synchronous on-policy RL more efficient and robust.

2606.02215 2026-06-02 cs.CL cs.SI

Better with Experience: Self-Evolving LLM Agents for Evidence-Grounded Health Community Notes

经验更优:用于证据基础健康社区笔记的自进化LLM智能体

Zihang Fu, Fanxiao Li, Jianyang Gu, Haonan Wang, Preslav Nakov, Bryan Hooi, Min-Yen Kan, Jiaying Wu

AI总结 提出EvoNote框架,通过自进化经验记忆和细粒度信用分配,在健康社区笔记生成中实现证据获取、分析与撰写,显著提升笔记质量并缩短生成时间。

详情
AI中文摘要

大型语言模型增强的社区笔记为社交媒体上健康错误信息的及时、基于证据的纠正提供了一条可扩展的路径。然而,它们仍然在每条帖子后重置,留下了先前案例中未使用的有用纠正经验。我们引入了EvoNote,一个智能体框架,通过先前错误信息纠正事件的自进化经验记忆,使健康社区笔记生成能够自我进化。其核心是细粒度信用分配:EvoNote将轨迹级反馈基于健康特定的笔记质量进行归因,并将其提炼为行动级记忆,用于声明分析、证据获取和笔记撰写。我们在MM-HealthCN上评估了EvoNote,这是一个包含1200个实例的多模态基准测试,包括用户标记的健康帖子、人工撰写的社区笔记和众包有用性标签。在一个人工验证的分层效用评判器下,EvoNote生成的笔记在89.6%的案例中优于对应的人工笔记;在一组没有众包有用性判定的“需要更多评分”帖子中,EvoNote为82.0%的案例生成了有用的笔记。它还将生成候选纠正所需的中位时间从人工笔记流程的超过13小时减少到不到2分钟。分析将这些收益归因于更强的证据使用和可复用的纠正策略,将自进化笔记生成定位为健康错误信息治理的一种有前景的范式。

英文摘要

Large Language Model (LLM)-augmented Community Notes offer a scalable path for timely, evidence-grounded correction of health misinformation on social platforms. However, they still reset at every post, leaving useful correction experience from prior cases unused. We introduce EvoNote, an agentic framework that enables health Community Notes generation to self-evolve through an evolving experience memory of prior misinformation correction episodes. Its core is fine-grained credit assignment: EvoNote grounds trajectory-level feedback in health-specific note qualities and distills it into action-level memory for claim analysis, evidence acquisition, and note writing. We evaluate EvoNote on MM-HealthCN, a 1.2K-instance multimodal benchmark of user-flagged health posts with human-written Community Notes and crowd-derived helpfulness labels. Under a human-validated hierarchical utility judge, EvoNote-generated notes are preferred over corresponding human-written notes in 89.6% of cases; on a separate set of Needs More Ratings posts without a crowd helpfulness verdict, EvoNote produces helpful notes for 82.0% of cases. It also reduces the median time needed to produce a candidate correction from over 13 hours in the human-note pipeline to under 2 minutes. Analyses link these gains to stronger evidence use and reusable correction strategies, positioning self-evolving note generation as a promising paradigm for health misinformation governance.

2606.02214 2026-06-02 cs.CL

Do Gender Cues Affect LLM Value Trade-offs? Evidence from a Controlled Decision Benchmark

性别线索是否影响LLM价值权衡?来自受控决策基准的证据

Yangyang Liu, Dong Yu, Pengyuan Liu

AI总结 通过构建受控基准RVDB,研究性别线索是否导致大语言模型在价值决策中产生系统性翻转,并发现性别效应集中在价值边界模糊和决策严重性高的情境下,且模型自我归因常掩盖性别影响。

详情
AI中文摘要

大型语言模型越来越多地用于价值敏感的决策场景,在这些场景中,无关的人口统计线索不应改变判断。我们构建了现实价值决策基准(RVDB),这是一个受控基准,仅改变角色-性别配置,同时保持场景、有序价值对、角色、候选决策、价值距离和决策严重性固定。使用跨七个模型的位置平衡评估,我们测试模型是否在性别扰动下保持决策不变性,以及它们的自我归因是否反映观察到的行为变化。我们发现,显式性别线索会引起有限但系统的决策翻转,包括在显式性别归因提示下,该提示要求模型报告性别是否影响其选择。跨性别角色交换揭示了一致的女性提议决策不对称性,而模型通常将翻转的决策归因于无影响或其他非性别因素。进一步分析表明,性别效应集中在价值边界较不明确和决策背景更严重的情况下,表明性别线索作为局部边界移动因素,而非价值推理的全局覆盖。价值排名基本保持稳定,但有序价值对权衡在不同角色-性别配置中不均匀地变化。这些结果表明,性别可以在行为上进入LLM价值权衡,同时在自我归因中被掩盖,这促使在基于解释的评估之外进行受控行为审计。

英文摘要

Large language models are increasingly used in value-sensitive decision settings, where irrelevant demographic cues should not alter judgments. We construct the Realistic Value Decision Benchmark (RVDB), a controlled benchmark that varies only the role-gender configuration while holding the scenario, ordered value pair, roles, candidate decisions, Value Distance, and Decision Severity fixed. Using a position-balanced evaluation across seven models, we test whether models preserve decision invariance under gender perturbations and whether their self-attributions reflect observed behavioral changes. We find that explicit gender cues induce bounded but systematic decision flips, including under an explicit gender-attribution prompt that asks models to report whether gender influenced their choice. Cross-gender role swaps reveal a consistent female-proposed-decision asymmetry, while models often attribute flipped decisions to No Influence or other non-gender factors. Further analysis shows that gender effects concentrate near less determinate value boundaries and under more severe decision contexts, suggesting that gender cues act as local boundary-shifting factors rather than global overrides of value reasoning. Value rankings remain largely stable, but ordered value-pair trade-offs shift unevenly across role-gender configurations. These results show that gender can enter LLM value trade-offs behaviorally while remaining obscured in self-attribution, motivating controlled behavioral audits beyond explanation-based evaluation.

2606.02212 2026-06-02 cs.SD

C2GA: A Class-Controllable Generative Augmentation Framework for Respiratory Sound Classification

C2GA:一种用于呼吸音分类的类别可控生成式增强框架

Ziqi Ma, Mengyu Han, Anteng Cai, Zhanchong Liu, Bowen Feng, Hang Yu, Sheng Hu

AI总结 针对呼吸音分类中数据有限、噪声大和类别不平衡问题,提出基于条件VQ-VAE和Transformer自回归先验的类别可控生成式增强框架C2GA,实现高保真、语义一致的样本生成。

详情
Comments
18 pages, 5 figures, submitted to Computer Methods and Programs in Biomedicine
AI中文摘要

背景:呼吸音分类在肺部病理的临床识别中起着关键作用。然而,其性能常受限于真实听诊数据集的规模小、噪声严重和类别不平衡。尽管传统的音频增强技术易于实现,但它们可能无意中扭曲微妙的病理特征。同时,现有的基于变分自编码器(VAE)或生成对抗网络(GAN)的生成方法往往面临样本保真度有限和对类别语义的可控性不足的问题,特别是在监督稀缺的情况下。方法:为克服这些限制,我们提出C2GA,一个类别可控的生成式增强框架。C2GA首先使用条件向量量化变分自编码器(VQ-VAE)构建一个语义丰富的离散潜在空间,其中局部声学标记与全局类别原型显式解耦。随后,训练一个基于Transformer的自回归先验以生成标签一致的标记序列。这些生成的标记随后与相应的类别原型融合,并解码为高保真Mel频谱图用于数据增强。结论:这些结果表明,C2GA为呼吸音分析提供了一种有效且语义可靠的增强策略。通过实现可控且高质量的数据生成,所提框架为提高真实临床场景中呼吸音分类的鲁棒性和泛化能力提供了一种有前景的解决方案。

英文摘要

Background: Respiratory sound classification plays a critical role in the clinical identification of pulmonary pathologies. However, its performance is often hindered by the limited size, severe noise, and class imbalance of real-world auscultation datasets. Although conventional audio augmentation techniques are easy to implement, they may inadvertently distort subtle pathological characteristics. Meanwhile, existing Variational Autoencoder (VAE)- or Generative Adversarial Network (GAN)-based generative approaches often suffer from limited sample fidelity and insufficient controllability over class semantics, particularly under conditions of scarce supervision. Methods: To overcome these limitations, we propose C2GA, a class-controllable generative augmentation framework. C2GA first constructs a semantically rich discrete latent space using a conditional Vector-Quantized Variational Autoencoder (VQ-VAE), in which local acoustic tokens are explicitly decoupled from global class prototypes. Subsequently, a Transformer-based autoregressive prior is trained to generate label-consistent token sequences. These generated tokens are then fused with the corresponding class prototypes and decoded into high-fidelity Mel-spectrograms for data augmentation. Conclusion: These results indicate that C2GA provides an effective and semantically reliable augmentation strategy for respiratory sound analysis. By enabling controllable and high-quality data generation, the proposed framework offers a promising solution for improving the robustness and generalization of respiratory sound classification in realistic clinical scenarios.