arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4160
2606.00093 2026-06-02 cs.CL cs.HC physics.data-an

Agreement Metrics for LLM-as-Judge Evaluation: What to Report and Why

LLM作为评判者的评估一致性指标:报告什么及为什么

Delip Rao, Chris Callison-Burch

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文通过调查24篇近期论文,指出在二元评判标准下多数一致性指标冗余,强调Cohen's κ提供额外信息,并给出报告清单。

详情
Comments
12 pages
AI中文摘要

将LLM评判者与人类标注进行验证通常需要报告多个一致性统计量:准确率、精确率、召回率、$F_1$、Cohen's $κ$以及一个或多个秩相关。对24篇近期LLM作为评判者论文的调查发现,指标选择与评判尺度、平局处理、无效输出和弃权处理纠缠在一起,且这些选择很少被说明。对于二元标准——基于量规评估中的常见情况,每个标准被评分为MET或UNMET——报告的大多数数字是冗余的:Pearson's $r$、Spearman's $ρ$、Kendall's $τ_b$、phi系数$ϕ$和Matthews相关系数在非退化二元数据上都归结为单个数字,因此报告其中几个只会产生佐证证据的错觉。Cohen's $κ$是唯一增加信息的一致性系数:它与$ϕ$共享分子但归一化方式不同,两者之间的差距衡量了评判者正标签率偏离人类正标签率的程度。然后我们追踪当评判者可能以CANNOT_ASSESS裁决弃权时发生的变化:处理弃权的三种常见方式不是可互换的预处理选择,而是回答不同的问题,并且它们打破了二元等价关系。对于使用Fleiss' $κ$或Krippendorff's $α$评分的多评判者集成,相同的等价关系会重新出现,但存在可忽略的有限样本修正。最后,我们给出一个报告清单,其中列出评判尺度、弃权和平局处理模式、覆盖率、混淆矩阵以及聚合级别,同时附上任何标量一致性系数。

英文摘要

Validating an LLM judge against human annotations usually means reporting several agreement statistics: accuracy, precision, recall, $F_1$, Cohen's $κ$, and one or more rank correlations. A survey of 24 recent LLM-as-judge papers finds metric choice entangled with the judgment scale, tie handling, invalid outputs, and abstention handling, and those choices rarely stated. For binary criteria -- the common case in rubric-based evaluation, where each criterion is graded MET or UNMET -- most of the reported numbers are redundant: Pearson's $r$, Spearman's $ρ$, Kendall's $τ_b$, the phi coefficient $ϕ$, and the Matthews Correlation Coefficient all reduce to a single number on non-degenerate binary data, so reporting several of them only creates an illusion of corroborating evidence. Cohen's $κ$ is the one agreement coefficient that adds information: it shares $ϕ$'s numerator but normalizes differently, and the gap between them measures how far the judge's positive-label rate has drifted from the human's. We then trace what changes when a judge may abstain with a CANNOT_ASSESS verdict: the three common ways of handling abstentions are not interchangeable preprocessing choices but answer different questions, and they break the binary equivalences. The same equivalences reappear, up to a negligible finite-sample correction, for multi-judge ensembles scored with Fleiss' $κ$ or Krippendorff's $α$. We close with a reporting checklist that names the judgment scale, the abstention and tie handling mode, coverage, the confusion matrix, and the aggregation level alongside any scalar agreement coefficient.

2606.00092 2026-06-02 cs.CV cs.AI

Aligning Cellular Sheaves with Classifier Attention for Interpretable Weakly-Supervised Pathology Localization

对齐细胞层与分类器注意力以实现可解释的弱监督病理定位

Devansh Lalwani, Swapnil Bhat, Maulik Shah

发表机构 * Turocrates AI Private Limited(Turocrates AI私有有限公司)

AI总结 针对弱监督全切片图像分类中注意力图定位不准确的问题,提出结合细胞层与注意力机制的一致性训练方法,在Camelyon16上实现补丁级AUC 0.940,并提升注意力AUC从0.717至0.953。

详情
AI中文摘要

基于基础特征的注意力多实例学习(ABMIL)在Camelyon16切片级别性能上接近饱和,但相应的注意力图作为定位信号并不完美:在临床解释中,一个正确分类但未激活实际病灶的模型难以被信任。我们通过细胞层(cellular sheaves)来解决这一差距,细胞层为图的每个顶点和边赋予有限维向量空间及它们之间一致的线性映射,提供了一种在图结构数据上检测局部不一致性的原则性方法。我们将细胞层应用于全切片图像的弱监督肿瘤定位,结合了细胞层不一致场与ABMIL。自然的训练目标——鼓励相似特征之间的一致性——产生的不一致场追踪的是组织级纹理而非诊断内容。我们提出注意力条件一致性,利用分类器的注意力来定义哪些相邻补丁应该一致。在此目标下联合训练分类器和细胞层,在Camelyon16上产生的不一致场达到补丁级AUC 0.940,并将注意力头从单独ABMIL的0.717提升至0.953。两阶段消融实验(分类器冻结在ABMIL值)仅在不一致场上达到0.727,注意力保持0.717,证实增益来自投影器在两个目标下的共同适应,而非单独的损失变化。训练后的模型无需重新训练即可迁移至Camelyon17的标注切片,保持Delta AUC 0.932 +/- 0.083和注意力AUC 0.955 +/- 0.099。结果是注意力图和细胞层不一致图同时激活相同的诊断区域,为每个切片级预测提供两种互补的解释。

英文摘要

Weakly-supervised classification of whole-slide images with attention-based multiple instance learning (ABMIL) on top of foundation features now reaches near-saturation on Camelyon16 slide-level performance, but the corresponding attention maps are an imperfect localization signal: in clinical interpretation, a model that classifies correctly without firing on the actual lesion is hard to trust. We address this gap with cellular sheaves, which equip each vertex and edge of a graph with a finite-dimensional vector space and consistent linear maps between them, providing a principled way to detect local disagreement on graph-structured data. We apply cellular sheaves to weakly-supervised tumour localization on whole-slide images, combining a sheaf disagreement field with ABMIL. The natural training objective, encouraging consistency between similar features, produces a disagreement field that tracks tissue-level texture rather than diagnostic content. We propose attention-conditional consistency, which uses the classifier's attention to define which neighbouring patches should agree. Joint training of the classifier and the sheaf under this objective produces a disagreement field with patch-level AUC 0.940 on Camelyon16 and raises the attention head from its ABMIL-alone level of 0.717 to 0.953. Two-stage ablation with the classifier frozen at its ABMIL values reaches only 0.727 on the disagreement field and leaves attention at 0.717, confirming that the gain comes from the projector co-adapting under both objectives, not from the loss change in isolation. The trained model transfers without retraining to annotated slides from Camelyon17, maintaining Delta AUC 0.932 +/- 0.083 and attention AUC 0.955 +/- 0.099. The result is an attention map and a sheaf-disagreement map that fire on the same diagnostic regions, giving clinicians two complementary explanations for each slide-level prediction.

2606.00091 2026-06-02 cs.CL cs.AI

DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models

DLLM-JEPA:用于掩码扩散语言模型的联合嵌入预测架构

Sangdae Nam

发表机构 * arXiv.org cs.CL(计算机科学与语言学)

AI总结 提出DLLM-JEPA,通过将联合嵌入预测架构与掩码扩散语言模型结合,消除显式多视图数据和双梯度前向传播需求,在多个任务上提升准确率并降低训练FLOPs。

详情
Comments
17 pages, 4 figures, 13 tables. Accepted at SPIGM Workshop, ICML 2026
AI中文摘要

联合嵌入预测架构(JEPAs)重塑了视觉中的自监督表示学习。最近的LLM-JEPA将JEPA移植到自回归语言模型,但继承了因果注意力机制的两个高昂代价:它需要显式的多视图数据(例如文本-代码对),并且每步需要两次携带梯度的前向传播。我们提出DLLM-JEPA,它将JEPA与掩码扩散语言模型配对,一次性消除这两个代价。扩散模型的双向注意力通过不同的掩码率从同一输入产生两个语义不同的视图——无需显式配对——并支持单次携带梯度的前向传播,相对于LLM-JEPA减少33%的训练FLOPs。DLLM-JEPA在我们评估的每个(任务,架构)组合上优于仅扩散微调:在LLaDA-8B GSM8K上最高提升+18.7个百分点,在Dream-7B GSM8K上提升+11.4个百分点,在Spider、NL-RX-SYNTH和Django上持续获得正向增益。除了准确率,DLLM-JEPA还表现出双赢特性:在LLaDA-8B上使用Wide-t配置时,它同时提高了GSM8K准确率(67.1 vs. 65.2,+1.8个百分点),将保留的Wikitext损失降至预训练基础之下,并在三个微调种子下将MMLU准确率保持在基础水平——而L2到基础参数锚点匹配基线准确率但没有任务增益。逐层探测揭示了机制:一种几何-功能漂移分离,其中微调后的骨干比基线更远离预训练权重,但在保留的Wikitext上遗忘更少,且放大集中在中间Transformer层。该模式也出现在Dream-7B上,表明该现象并非特定于单个骨干网络。

英文摘要

Joint Embedding Predictive Architectures (JEPAs) have reshaped self-supervised representation learning in vision. The recent LLM-JEPA ported JEPA to autoregressive language models but inherited two steep costs from the causal-attention substrate: it demands explicit multi-view data (e.g., text-code pairs), and it requires two gradient-carrying forward passes per step. We introduce DLLM-JEPA, which pairs JEPA with masked-diffusion language models to eliminate both costs at once. The bidirectional attention of diffusion models yields two semantically distinct views of the same input via different masking rates -- no explicit pairs needed -- and supports a single gradient-carrying forward pass, cutting training FLOPs by 33% relative to LLM-JEPA. DLLM-JEPA improves over diffusion-only fine-tuning in every (task, architecture) combination we evaluate: up to +18.7 pp on LLaDA-8B GSM8K and +11.4 pp on Dream-7B GSM8K, with consistent positive gains on Spider, NL-RX-SYNTH, and Django. Beyond accuracy, DLLM-JEPA exhibits a dual-win property: on LLaDA-8B with the Wide-t configuration, it simultaneously raises GSM8K accuracy (67.1 vs. 65.2, +1.8 pp), drives held-out Wikitext loss below the pre-trained base, and preserves MMLU accuracy at base level across three fine-tuning seeds -- whereas an L2-to-base parameter anchor matches baseline accuracy with no task gain. Layer-wise probing reveals the mechanism: a geometric-functional drift dissociation in which the fine-tuned backbone moves further from the pre-trained weights than the baseline yet forgets less on held-out Wikitext, with the amplification concentrated in middle transformer layers. The pattern appears on Dream-7B as well, indicating the phenomenon is not specific to a single backbone.

2606.00090 2026-06-02 cs.RO cs.AI

Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems

物理AI中的静默故障:自主系统运行时动作授权的文献综述

Barak Or

发表机构 * STATE16

AI总结 本文综述了物理AI系统中黑箱模型发出看似合理但实际错误的物理动作导致的静默故障问题,提出了运行时防护栏的分类和评估要求。

详情
Comments
23 pages
AI中文摘要

物理AI系统越来越多地将多模态观测、语言指令和学习的世界表示映射为具有物理后果的动作。机器人基础模型、视觉-语言-动作模型和基于世界模型的自主系统可以决定移动车辆、机器人、无人机和工业机器的决策。这种转变暴露了一个传统AI内容审核或经典机器人安全无法完全捕获的安全问题:黑箱模型可能发出一个物理后果的动作,同时表现出自信、合理和语义对齐。由此产生的故障可能是静默的,源于传感器漂移、遮挡、状态估计误差、分布偏移、幻觉的可供性,或在下游硬件控制器检测到违规之前的无效物理假设。在具身基础模型、世界模型、机器人仿真、具身安全基准、安全控制、运行时保证、不确定性估计、验证和防护栏评估中,模型能力和安全机制沿着大致分离的技术轨道发展。这里综合的一个反复出现的差距是,本综述调查的单一流都没有提供黑箱物理AI模型与物理执行之间的完整运行时授权边界。由此产生的分析发展了一个有界的问题表述、静默物理动作故障的定义、运行时防护栏功能的分类,以及比较防护栏作为物理AI保证机制的评估要求。

英文摘要

Physical AI systems increasingly map multimodal observations, language instructions, and learned world representations into physically consequential actions. Robotics foundation models, vision-language-action models, and world-model-based autonomous systems can condition decisions that move vehicles, robots, drones, and industrial machines. This transition exposes a safety problem that is not fully captured by conventional AI content moderation or by classical robot safety alone: a black-box model may issue a physically consequential action while appearing confident, plausible, and semantically aligned. The resulting failure can be silent, arising from sensor drift, occlusion, state-estimation error, distribution shift, hallucinated affordances, or invalid physical assumptions before downstream hardware controllers detect a violation. Across embodied foundation models, world models, robotics simulation, embodied safety benchmarks, safe control, runtime assurance, uncertainty estimation, verification, and guardrail evaluation, model capability and safety mechanisms have advanced along largely separate technical tracks. A recurring gap synthesized here is that no single stream surveyed in this review supplies a complete runtime authorization boundary between black-box Physical AI models and physical execution. The resulting analysis develops a bounded problem formulation, a definition of silent physical-action failure, a taxonomy of runtime guardrail functions, and evaluation requirements for comparing guardrails as Physical AI assurance mechanisms.

2606.00089 2026-06-02 cs.RO cs.AI

Can Predicted Dynamics Exist in the Physical World?

物理世界中是否存在可预测的动态?

Barak Or

发表机构 * STATE16 Technion - Israel Institute of Technology(技术Ion - 以色列理工学院) Reichman University(Reichman大学) Google-Reichman AI Tech School(Google-Reichman人工智能技术学院)

AI总结 本文提出物理可接受性作为预测-控制接口,通过运动学、动力学和直接到组合的视界条件评估解码提案的物理可执行性,实验表明该方法能有效识别无效提案并保持任务进度。

详情
Comments
17 pages
AI中文摘要

预测性物理AI系统输出状态展开、动作块和潜在计划,但低均方根误差(RMSE)并不意味特定提案在物理上可执行。我们将物理可接受性定义为预测-控制接口:在执行前,将解码提案视为候选动态,并使用运动学、动力学和直接到组合的视界条件进行评估。通过不是任务成功的证明;拒绝标识指定物理包络的违反,并给出组件级原因。在Hugging Face LeRobot PushT上,受控伪造表明一步预测RMSE和标准化动态残差达到接收者操作特征曲线下面积(AUC)0.982和0.972,仅运动学条件达到AUC 0.592,完整门控达到AUC 0.957并带有条件级归因。在基于重放的干预实验中,基于残差的过滤器和完整物理可接受性门控阻止了87-89%的无效提案,同时保持平均进度接近0.998。

英文摘要

Predictive Physical AI systems output state rollouts, action chunks, and latent plans, yet a low root-mean-square error (RMSE) does not imply that a particular proposal is physically executable. We formulate physical admissibility as a prediction-control interface: before execution, a decoded proposal is treated as candidate dynamics and evaluated using kinematic, dynamic, and direct-to-composed horizon conditions. Passing is not a certificate of task success; rejection identifies violation of the specified physical envelope and gives a component-level reason. On Hugging Face LeRobot PushT, controlled falsification shows that one-step prediction-RMSE and standardized dynamics residuals reach area under the receiver operating characteristic curve (AUC) 0.982 and 0.972, kinematic-only conditions reach AUC 0.592, and the full gate reaches AUC 0.957 with condition-level attribution. In replay-based intervention experiments, residual-based filters and the full physical-admissibility gate prevent 87-$89% of invalid proposals while preserving mean progress near 0.998.

2606.00087 2026-06-02 cs.CV cs.AI

Structured Visual Evidence Decomposition for Evidence-Grounded Multimodal Screening of Obstructive Sleep Apnea-Hypopnea Syndrome

结构化视觉证据分解用于阻塞性睡眠呼吸暂停低通气综合征的证据驱动多模态筛查

Chen Zhan, Yingchen Wei, Xiaoyu Tan, Jingjing Huang, Xihe Qiu

发表机构 * School of Electronic and Electrical Engineering, Shanghai University of Engineering Science(上海工程技术大学电子与电气工程学院) Tencent Youtu Lab(腾讯云视频实验室) ENT Institute and Department of Otorhinolaryngology, Eye & ENT Hospital of Fudan University(复旦大学耳鼻喉科医院耳鼻喉科研究所) National University of Singapore(新加坡国立大学)

AI总结 提出EviOSAHS框架,通过将面部图像分解为七个解剖查询并生成结构化证据卡,结合临床信息进行高灵敏度OSAHS筛查。

详情
AI中文摘要

有效的阻塞性睡眠呼吸暂停低通气综合征(OSAHS)多导睡眠图前筛查需要结合临床风险因素与可见的颅面和颈部线索。直接提示通用多模态基础模型进行医学是/否决策可能产生不稳定、校准不良的输出。我们提出EviOSAHS,一个证据驱动的多模态推理框架,将仅基于图像的解剖证据获取与最终临床判定分离。每张正面面部图像被分解为七个固定的解剖查询,涵盖颈部、下巴、嘴巴、面/颈脂肪、下颌、中面部和鼻子。视觉响应被转换为结构化证据卡,记录目标解剖结构、可见性、风险方向、证据强度、置信度和简洁摘要。这些卡片仅在最后阶段与清理后的临床档案结合,由大型语言模型进行平衡的二元筛查判定。我们在642名受试者队列上评估了EviOSAHS,将正常受试者映射为筛查阴性,轻度、中度或重度OSAHS受试者映射为筛查阳性。EviOSAHS实现了88.47%的准确率、94.86%的灵敏度、93.74%的F1分数和5.14%的假阴性率,在统一协议下优于仅临床提示、直接多模态提示和朴素两阶段流水线。消融实验表明,七问题视觉分解和平衡最终判定对高灵敏度工作点至关重要。对4,494个视觉输出的问题级审计显示100%的结构化解析率和93.88%的高可见率。EviOSAHS为二元多导睡眠图前OSAHS筛查提供了一个可审计、高灵敏度的工作流程,但应被视为分诊助手而非诊断系统。在临床部署前需要进行前瞻性验证、外部测试和校准的工作点控制。

英文摘要

Effective pre-polysomnography screening for obstructive sleep apnea-hypopnea syndrome (OSAHS) requires combining clinical risk factors with visible craniofacial and neck cues. Directly prompting general-purpose multimodal foundation models for medical yes/no decisions can yield unstable, poorly calibrated outputs. We propose EviOSAHS, an evidence-grounded multimodal reasoning framework that separates image-only anatomical evidence acquisition from final clinical adjudication. Each frontal facial image is decomposed into seven fixed anatomical queries covering the neck, chin, mouth, face/neck fat, lower jaw, midface, and nose. Visual responses are converted into structured evidence cards recording target anatomy, visibility, risk direction, evidence strength, confidence, and a concise summary. These cards are combined with a cleaned clinical profile only in the final stage, where a large language model performs balanced binary screening adjudication. We evaluated EviOSAHS on a 642-subject cohort, mapping normal subjects to screening-negative and mild, moderate, or severe OSAHS subjects to screening-positive. EviOSAHS achieved 88.47% accuracy, 94.86% sensitivity, 93.74% F1-score, and a 5.14% false-negative rate, outperforming clinical-only prompting, direct multimodal prompting, and naive two-stage pipelines under a unified protocol. Ablations showed that seven-question visual decomposition and balanced final adjudication were critical to the high-sensitivity operating point. A question-level audit of 4,494 visual outputs showed a 100% structured parse rate and 93.88% high-visibility rate. EviOSAHS provides an auditable, high-sensitivity workflow for binary pre-polysomnography OSAHS screening, but should be viewed as a triage assistant rather than a diagnostic system. Prospective validation, external testing, and calibrated operating-point control are needed before clinical deployment.

2606.00086 2026-06-02 cs.RO

Whole-Body Inverse Kinematics with Graph Diffusion

基于图扩散的全身逆运动学

Helong Huang, Kai Tan, Feng Wen, Guowei Huang, Xingyue Quan

发表机构 * Large Model Algorithm Lab, Huawei(华为大模型算法实验室)

AI总结 提出GraphDiff-IK,一种结构感知的图扩散逆运动学框架,通过将机器人表示为运动学图并引入分层消息传递和躯干感知条件,实现了多分支机器人的准确稳定IK求解。

详情
AI中文摘要

逆运动学(IK)是机器人学中的一个基本问题,需要生成满足目标末端执行器位姿的关节配置。现有方法通常难以在多种机器人形态间泛化,并且无法有效建模IK的多模态特性,特别是在具有多个运动学分支的关节系统中。在这项工作中,我们提出了GraphDiff-IK,一种结构感知的图扩散逆运动学框架。具体来说,我们将机器人表示为从机器人URDF构建的运动学图,其中节点对应驱动关节,边编码运动学依赖关系。基于这种表示,我们将IK表述为条件图扩散过程,直接在机器人图上生成关节配置。为了更好地捕捉关节系统中的结构依赖关系,我们进一步引入了一种结构感知的图推理框架,具有分层阶段式消息传递和针对多分支机器人的躯干感知条件。此外,我们结合了带噪声的正向运动学反馈和任务空间监督,以提高去噪过程中的几何一致性。所提出的框架提供了一种统一的公式,自然支持单臂机器人、双臂系统以及具有躯干或腰部结构的关节机器人。在多种机器人平台上的大量实验表明,所提出的方法实现了准确且稳定的IK性能,同时保留了为冗余机器人系统生成多个可行解的能力。

英文摘要

Inverse kinematics (IK) is a fundamental problem in robotics, requiring the generation of joint configurations that satisfy target end-effector poses. Existing approaches often struggle to generalize across diverse robot morphologies and to effectively model the multi-modal nature of IK, particularly in articulated systems with multiple kinematic branches. In this work, we propose GraphDiff-IK, a structure-aware graph diffusion framework for inverse kinematics. Specifically, we represent the robot as a kinematic graph constructed from the robot URDF, where nodes correspond to actuated joints and edges encode kinematic dependencies. Building upon this representation, we formulate IK as a conditional graph diffusion process that directly generates joint configurations on the robot graph. To better capture structural dependencies in articulated systems, we further introduce a structure-aware graph reasoning framework with hierarchical stage-wise message passing and torso-aware conditioning for multi-branch robots. In addition, we incorporate noisy forward kinematics feedback and task-space supervision to improve geometric consistency during denoising. The proposed framework provides a unified formulation that naturally supports single-arm robots, dual-arm systems, and articulated robots with torso or waist structures. Extensive experiments on diverse robotic platforms demonstrate that the proposed method achieves accurate and stable IK performance while preserving the ability to generate multiple feasible solutions for redundant robotic systems.

2606.00085 2026-06-02 cs.RO

Balancing Accuracy and Efficiency: Adaptive Dynamics Orchestration for Model Predictive Control

平衡精度与效率:模型预测控制的自适应动力学编排

Francesco Cancelliere, Aniket Datar, Giovanni Muscato, Xuesu Xiao

发表机构 * Department of Electrical and Computer Engineering, University of Michigan, Ann Arbor, MI, USA(1. 电气与计算机工程系,密歇根大学,安娜堡,密歇根州,美国)

AI总结 提出自适应动力学编排(ADO)框架,通过在线反事实滚动评估模型残差,动态选择最适合当前导航上下文的动力学模型,在计算效率与预测精度之间取得平衡。

详情
Comments
8 pages, 7 figures
AI中文摘要

自主导航的模型预测控制(MPC)面临模型精度与实时效率之间的基本权衡。高保真动力学模型能够准确预测轨迹展开过程中复杂的车辆-地形交互,但计算成本高,增加推理延迟并降低控制频率。相反,轻量级模型支持快速更新和密集采样,但在安全关键条件下可能产生错误预测,导致灾难性故障如车辆侧翻。为解决这一权衡,我们提出自适应动力学编排(ADO),一种根据当前导航上下文动态选择最合适动力学模型的框架。ADO维护一个涵盖不同精度-效率特征的模型库,并通过在线反事实滚动(即执行的控制动作在模型库中重放以评估预测差异)的残差误差,持续细化地形条件性能估计。这些估计实时指导模型选择,平衡计算效率与预测精度。在越野地面机器人上的真实实验表明,与固定低延迟基线相比,ADO显著降低了建模误差,同时接近最高保真模型的精度而不产生其计算成本,从而在复杂地形中实现更可靠和有效的导航。

英文摘要

Model Predictive Control (MPC) for autonomous navigation faces a fundamental trade-off between model accuracy and real-time efficiency. High-fidelity dynamics models can accurately predict complex vehicle-terrain interactions during trajectory rollouts, but incur significant computational cost, increasing inference latency and reducing control frequency. Conversely, lightweight models enable fast updates and dense sampling, yet may produce erroneous predictions under safety-critical conditions, potentially leading to catastrophic failures such as vehicle rollover. To address this trade-off, we propose Adaptive Dynamics Orchestration (ADO), a framework that dynamically selects the most appropriate dynamics model for the current navigation context. ADO maintains a library of models spanning diverse accuracy-efficiency profiles and continuously refines terrain-conditioned performance estimates using residual errors from online counterfactual rollouts, where executed control actions are replayed across the model library to assess predictive discrepancy. These estimates guide model selection in real time, balancing computational efficiency and predictive accuracy. Real-world experiments on an off-road ground robot demonstrate that ADO significantly reduces modeling error compared to a fixed low-latency baseline, while approaching the accuracy of the highest-fidelity model without incurring its computational cost, resulting in more reliable and effective navigation in challenging terrain.

2606.00079 2026-06-02 cs.LG cs.AI

BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization

BitsMoE: 面向MoE大语言模型量化的频谱能量引导比特分配

Jiayu Zhao, Zihan Teng, Minhao Fan, Tianrui Ma, Wentao Ren, Song Chen, Weichen Liu

发表机构 * School of Microelectronics, University of Science and Technology of China(中国科学技术大学微电子学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) School of Electrical and Electronic Engineering, Nanyang Technological University(南洋理工大学电子与电气工程学院)

AI总结 提出BitsMoE框架,通过SVD分解和频谱能量引导的混合精度比特分配,解决MoE模型超低位量化中的精度损失问题,在Qwen3-30B-A3B-Base上2比特量化下准确率提升27.83个百分点。

详情
Comments
29 pages, 6 figures, 9 tables. Code and models are available at https://github.com/zjiayu064/BitsMoE
AI中文摘要

混合专家(MoE)大语言模型通过稀疏专家激活减少了每词元的计算量,但由于所有专家权重必须常驻内存,其部署仍然占用大量内存。现有的MoE压缩方法在超低位宽场景下表现不佳:剪枝不可逆地移除模型容量,而粗粒度量化无法根据异构的专家和权重方向重要性分配比特。我们提出BitsMoE,一种面向MoE大语言模型量化的频谱能量引导比特分配框架。BitsMoE通过SVD将每个MoE层分解为共享基和专家特定的频谱因子,保留共享基不进行量化以保持跨专家的共同结构,并使用专家特定因子作为细粒度量化单元。为确定每个单元的比特宽度,BitsMoE将频谱混合精度量化建模为激活感知的重建替代问题,并求解一个整数线性规划,在固定比特预算下最小化估计的重建损失。在多个MoE大语言模型上的实验表明,BitsMoE在超低位宽场景下显著降低了下游任务准确率下降。在Qwen3-30B-A3B-Base上进行2比特量化时,BitsMoE相比GPTQ加速量化12.3倍,平均准确率提升27.83个百分点,解码速度提升1.76倍。我们的模型和代码已在https://github.com/zjiayu064/BitsMoE公开。

英文摘要

Mixture-of-Experts (MoE) large language models reduce per-token computation through sparse expert activation, but their deployment remains memory-intensive because all expert weights must be kept resident in memory. Existing MoE compression methods struggle in the ultra-low-bit regime: pruning irreversibly removes model capacity, while coarse-grained quantization fails to allocate bits according to heterogeneous expert and weight-direction importance. We propose BitsMoE, a spectral-energy-guided bit-allocation framework for MoE LLM quantization. BitsMoE decomposes each MoE layer by SVD into a shared basis and expert-specific spectral factors, retaining the shared basis without quantization to preserve common cross-expert structure and using the expert-specific factors as fine-grained quantization units. To determine the bit-width of each unit, BitsMoE formulates spectrum-wise mixed-precision quantization as an activation-aware reconstruction surrogate and solves an integer linear program that minimizes estimated reconstruction loss under a fixed bit budget. Experiments across multiple MoE LLMs show that BitsMoE substantially reduces downstream task accuracy degradation in ultra-low-bit regimes. Under 2-bit quantization on Qwen3-30B-A3B-Base, BitsMoE accelerates quantization by 12.3$\times$, improves average accuracy by 27.83 percentage points, and increases decoding speed by 1.76$\times$ over GPTQ. Our model and code are publicly available at https://github.com/zjiayu064/BitsMoE.

2606.00078 2026-06-02 cs.CV cs.AI

Flow-Based Generative Modeling for Optimizing Sampling Policies in Compressed Sensing Applications

基于流的生成建模优化压缩感知应用中的采样策略

Roman Pavelkin, Luis A. Zavala-Mondragon, Christiaan G. A. Viviers, Fons van der Sommen

发表机构 * Eindhoven University of Technology(埃因霍温理工大学)

AI总结 提出一种任务感知的基于流的生成框架,通过训练流模型优化压缩感知中的子采样掩码,显著提升图像分类、重建和MRI加速的性能。

详情
AI中文摘要

信号处理和医学成像中的许多现代应用需要在严格的资源约束下获取高维信号。传统采样理论表明,准确重建信号所需的测量次数与信号的维数成正比,这一要求往往过于昂贵或不切实际。压缩感知通过证明稀疏信号可以在较少的测量下恢复(前提是测量算子满足某些条件)挑战了这一观念。这项概念验证研究提出了一种任务感知的基于流的生成框架——对传统流匹配训练范式的重新表述,其中流模型被训练用于优化压缩感知应用中的子采样。我们建立了所提出的学习子采样掩码框架的基本可行性,该框架显著提升了压缩感知在图像分类、图像重建和MRI加速中的性能。在图像重建任务中,我们的方法展示了最先进的性能,在CelebA数据集上以5%的子采样率实现了25.17 dB的峰值信噪比,在重建8倍加速的MRI测量(fastMRI数据集)时以最小的计算开销达到了29.24 dB。这些结果突显了生成流模型中任务条件化的有效性,并揭示了表示学习策略的一个有前景的方向。总体而言,所提出的框架提供了一种统一、灵活的方法来设计数据和任务驱动的感知方案,有望适用于广泛的逆问题。

英文摘要

Numerous modern applications in signal processing and medical imaging necessitate acquiring high-dimensional signals under tight resource constraints. Traditional sampling theory suggests that accurate signal reconstruction requires a number of measurements proportional to the signal's ambient dimension, a requirement often too expensive or impractical. Compressed sensing challenges this notion by demonstrating that sparse signals can be recovered with fewer measurements, provided the measurement operator meets certain conditions. This proof-of-concept study presents a task-aware flow-based generative framework -- a reformulation of the conventional Flow Matching training paradigm with a flow model trained to optimize subsampling in compressed sensing applications. We establish the fundamental feasibility of the proposed framework of learning subsampling masks that substantially enhance the performance of compressed sensing for image classification, image reconstruction, and MRI acceleration. For the image reconstruction task, our method demonstrated state-of-the-art performance, achieving Peak Signal-to-Noise Ratio of 25.17 dB at the subsampling rate of 5\% on the CelebA dataset and 29.24 dB when reconstructing $8\times$ accelerated MRI measurements (fastMRI dataset) with the minimal computational overhead. These results highlight the effectiveness of task-conditioning within generative flow models and reveal a promising direction for representation learning strategies. Overall, the proposed framework offers a unified, flexible approach to designing data- and task-driven sensing schemes that can be potentially adapted to a broad range of inverse problems.

2606.00077 2026-06-02 cs.CV cs.AI

Improved Belief-Attention in Vision Task

视觉任务中的改进信念注意力

Guoqiang Zhang

发表机构 * University of Exeter(埃克塞特大学)

AI总结 提出Belief2-Attention,通过同时利用垂直分量和投影分量扩展信念注意力,并引入额外内积矩阵增强标记相关性,提升视觉任务性能。

详情
AI中文摘要

最近,Belief-Attention \cite{Guoqiang25BeliefAttention} 被提出,它首先对基于 softmax 的 $V$ 向量加权求和进行关于原始 $V$ 向量的正交投影,然后将垂直分量作为 Transformer 中的残差信号以提升性能。在本文中,我们首先进行消融研究,表明投影分量也携带关于标记相关性的信息,不应被忽略。然后,我们提出通过同时利用垂直分量和投影分量来扩展 Belief-Attention。具体地,投影分量经过某种激活函数,然后进行线性映射,再与所考虑的标记合并。概念上讲,投影分量的神经块可以视为新注意力块内的两层前馈网络(FFN)。此外,注意到标准注意力通过内积矩阵 $QK^T$ 捕获标记相关性。我们提出向 $QK^T$ 引入额外的内积矩阵 $ZZ^T$ 以捕获更丰富的标记相关性。我们将新模块称为 Belief2-Attention。可以很容易地证明 Belief2-Attention 比标准注意力更具表达能力。然后,我们验证了 Belief2-Attention 在图像分类和分割等视觉任务中的有效性。

英文摘要

Recently, Belief-Attention \cite{Guoqiang25BeliefAttention} has been proposed by first performing an orthogonal projection of the softmax-based weighted summation of $V$ vectors with respect to the original $V$ vectors and then taking the perpendicular component as the residual signal in Transformer for performance improvement. In this paper, we first conduct an ablation study showing the projected component also carries information about the token correlation, which should not be ignored. We then propose to extend Belief-Attention by making use of both the perpendicular and projected components. In particular, the projected component goes through certain activation function and then a linear mapping before merging with the considered token. Conceptually speaking, the neural block for the projected component can be viewed as a two-layer feedforward network (FFN) within the new attention block. It is also noted that standard attention captures the token correlation via the inner-product matrix $QK^T$. We propose to introduce an additional inner-product matrix $ZZ^T$ to $QK^T$ to capture richer token correlation. We refer to the new module as Belief2-Attention. It can be easily shown that Belief2-Attention is more expressive than standard Attention. We then verify the effectiveness of Belief2-Attention for vision tasks of image classification and segmentation.

2606.00076 2026-06-02 cs.CV

DefocusTrackerAI -- A Generalized Framework for the Automatic Detection of Defocused Particle Images

DefocusTrackerAI -- 一种用于自动检测离焦粒子图像的通用框架

Gonçalo Coutinho, Ana S. Moita, António L. N. Moreira, Massimiliano Rossi

发表机构 * IN+ Center for Innovation, Technology and Policy Research, Instituto Superior Técnico, University of Lisbon, Lisbon, Portugal(IN+创新、科技与政策研究中心,理工学院,里斯本大学,里斯本,葡萄牙) CINAMIL - Military Academy Research Center, Militart Academy, Portugal(CINAMIL - 军事学院研究中心,军事学院,葡萄牙) Department of Industrial Engineering, Alma Mater Studiorum University of Bologna, Bologna, Italy(工业工程系,博洛尼亚大学,博洛尼亚,意大利)

AI总结 提出DefocusTrackerAI,一种基于YOLOv9的通用深度学习框架,用于自动检测和位置估计离焦粒子图像,在多种光学配置下实现高召回率和低不确定性。

详情
Comments
24 pages, 10 figures
AI中文摘要

本工作介绍了DefocusTrackerAI,一个通用的深度学习框架,用于自动检测和位置估计来自任何光学配置的离焦粒子图像,同时不损害不确定性和召回率,旨在作为开源项目DefocusTracker的后续。我们从两个知名的目标检测模型Faster R-CNN和YOLOv9的直接比较中选择了深度神经网络架构,这些模型在包含不同直径的像散和非像散离焦粒子图像的多样化且特征丰富的合成图像集上进行了训练。对合成数据的模型评估表明,首先,YOLOv9优于Faster R-CNN,实现了更高的召回率和更低的不确定性,特别是在高粒子图像密度下;其次,YOLOv9提供了增强的空间分辨率,对于粒子图像密度N_s高达0.5,不确定性值在0.1到0.4像素之间,优于最先进的算法。我们证明了我们的模型能够在多种光学设置和不同光照条件下检测像散和非像散离焦粒子图像。此外,我们成功地将模型应用于真实的DPT实验,包括荧光和阴影图数据,表明它们可以用于传统DPT应用之外,包括喷雾和液滴的跟踪。基于YOLOv9的预训练、即用型DefocusTrackerAI版本可在https://gitlab.com/goncalo.coutinho/defocustrackerAI-main/-/tree/7e0f11f649ebad50e20dca5b9545f26ca303ebe0获取,并可用于高精度自动检测任何类型的离焦粒子图像。结合合适的深度位置校准方法,它可作为三维离焦粒子跟踪的有效第一步。

英文摘要

The present work introduces DefocusTrackerAI, a generalized deep-learning framework for the automatic detection and position estimation of defocused particle images from any kind of optical configuration without compromising uncertainty and recall, intended as a follow-up of the open-source project DefocusTracker. We selected the deep neural network architecture from the direct comparison of two well-known object detection models, Faster R-CNN and YOLOv9, trained on a diverse and feature-rich synthetic image set containing astigmatic and non-astigmatic defocused particle images of varying diameters. The model evaluation on synthetic data showed that, first, YOLOv9 outperforms Faster R-CNN, achieving higher recall and lower uncertainty, particularly at high particle image densities; and second, that YOLOv9 provides enhanced spatial resolution, with uncertainty values between 0.1 and 0.4 pixels for particle image densities N_s up to 0.5, outperforming state-of-the-art algorithms. We demonstrated that our models are able to detect astigmatic and non-astigmatic defocused particle images in multiple optical setups with varying lighting conditions. In addition, we successfully applied our models on real DPT experiments, including fluorescence and shadowgraph data, showing that they can be used beyond conventional DPT applications, including the tracking of sprays and droplets. A pre-trained, ready-to-use version of DefocusTrackerAI based on YOLOv9 is available at https://gitlab.com/goncalo.coutinho/defocustrackerAI-main/-/tree/7e0f11f649ebad50e20dca5b9545f26ca303ebe0 and can be used for automatic detection of defocused particle images of any kind with high accuracy. In combination with a suitable calibration approach for the depth position, it can be used as an effective first step for three-dimensional defocusing particle tracking.

2606.00069 2026-06-02 cs.RO eess.IV

Invascal: Inverse-Vacuity Self-Calibration for Uncertainty-Aware LiDAR Range-View Semantic Segmentation

Invascal: 面向不确定性感知激光雷达距离视图语义分割的逆空性自校准

Kerim Turacan, Hannes Reichert, Andrei Bolandut, Konrad Doll

发表机构 * Faculty of Engineering and Computer Science, University of Applied Sciences Aschaffenburg(工程与计算机科学学院,阿施费尔德应用科学大学)

AI总结 提出一种与架构无关的不确定性感知适配器头,通过偏好头和强度头分解预测,并设计逆空性自校准目标(Invascal)来监督强度信号,实现可靠且校准良好的不确定性估计,同时保持分割精度。

详情
Comments
Accepted for publication at the 2026 IEEE 29th International Conference on Intelligent Transportation Systems (ITSC)
AI中文摘要

激光雷达语义分割是自动驾驶车辆和移动机器人的核心感知能力。然而,安全运行还取决于知道预测何时不可靠。现有方法通常依赖softmax置信度,这往往校准不良且过度自信,而来自蒙特卡洛dropout或集成方法的更强不确定性估计对于实时使用通常计算成本高昂。为此,我们引入了一种新颖的、与架构无关的不确定性感知适配器头。它将预测分解为用于类别排名的偏好头和用于细化不确定性评估的强度头,从而能够原则性地构建证据狄利克雷表示。基于此设计,我们提出了逆空性自校准目标(Invascal),它直接监督强度信号以产生可靠且校准良好的不确定性估计,同时防止证据无节制增长。我们在多个激光雷达数据集和骨干架构上评估了我们的框架。我们与确定性训练、蒙特卡洛dropout和集成方法以及先前的证据方法进行了比较。我们的方法在最小计算开销下,持续改进了不确定性校准,优于传统的确定性方法。同时,它保持了有竞争力的分割精度,而先前的证据方法往往会出现性能下降。

英文摘要

LiDAR semantic segmentation is a core perception capability for autonomous vehicles and mobile robots. However, safe operation also depends on knowing when predictions are unreliable. Existing approaches typically rely on softmax confidence, which is often miscalibrated and overconfident, while stronger uncertainty estimates from Monte Carlo dropout or ensembles are often computationally expensive for real-time use. To this end, we introduce a novel, architecture-agnostic uncertainty-aware Adapter Head. It decomposes the prediction into a Preference Head for class ranking and a Strength Head that refines uncertainty assessment, thereby enabling a principled construction of evidential Dirichlet representations. Building on this design, we propose our inverse-vacuity self-calibration objective (Invascal), which directly supervises the strength signal to produce reliable and well-calibrated uncertainty estimates while preventing runaway evidence growth. We evaluate our framework across multiple LiDAR datasets and backbone architectures. We compare against deterministic training, Monte Carlo dropout and ensembles, and prior evidential methods. Our approach consistently improves uncertainty calibration over traditional deterministic methods with minimal computational overhead. At the same time, it preserves competitive segmentation accuracy, where prior evidential methods often suffer performance degradation.

2606.00066 2026-06-02 cs.SD eess.AS

DUET: Unified Dual-Space Emotion Control for Diffusion and Flow-Matching Driven Text-to-Speech

DUET: 扩散与流匹配驱动的文本转语音的统一双空间情感控制

Xu Zhang, Longbing Cao, Zhangkai Wu

发表机构 * Frontier AI Research Centre, Macquarie University(前沿人工智能研究中心,麦考瑞大学)

AI总结 提出DUET框架,通过隐空间引导和梅尔谱梯度修正的双空间控制,在预训练扩散/流匹配TTS模型中实现细粒度情感控制,超越10个有监督基线。

详情
AI中文摘要

基于扩散和流匹配的文本转语音(TTS)模型在自然度方面表现出色,但由于情感信号与说话人身份纠缠,往往缺乏显式的情感控制。我们发现情感嵌入作为冻结隐藏状态的线性可解码方向出现,几乎与编码说话人身份的方向正交。这启发了一个即插即用框架DUET,用于对预训练的扩散和流匹配TTS模型进行情感控制。在生成过程中,DUET统一双空间控制,在单步更新中实现细粒度情感干预:隐空间引导沿目标情感方向移动生成,而梅尔谱引导通过从可微分声码器反向传播的梯度细化频谱细节。我们在三个数据集上的五个架构多样的预训练TTS骨干上验证了DUET,它跨范式超越了10个有监督的最先进情感TTS基线,并获得了最高的人类评分情感适宜性。为了进一步展示其定性行为,我们将DUET部署在Ameca人形机器人上,使其产生丰富表现力的情感语音,展示了即插即用情感交互在具身智能体中的巨大潜力。

英文摘要

Diffusion and flow-matching based text-to-speech (TTS) models excel in naturalness but often lack explicit emotion control, as emotional signals remain entangled with speaker identity. We discover that emotion embedding emerges as a linearly decodable direction of frozen hidden states, nearly orthogonal to the direction embedding speaker identity. This inspires a plug-and-play framework DUET for emotion control over pretrained diffusion and flow-matching based TTS models. During generation, DUET unifies dual-space control to achieve fine-grained emotion intervention in a single per-step update: hidden space steering shifts generation along the target emotion direction, while mel-space guidance refines spectral details through gradients backpropagated from a differentiable vocoder. We validate DUET on five architecturally diverse pretrained TTS backbones across three datasets, where it outperforms 10 supervised state-of-the-art emotional TTS baselines across paradigms and achieves the highest human-rated emotion appropriateness. To further showcase its qualitative behavior, we deploy DUET on an Ameca humanoid robot, where it produces richly expressive emotional speech on the humanoid, demonstrating the strong potential for plug-and-play affective interaction for embodied agents.

2606.00063 2026-06-02 cs.RO math-ph math.MP physics.flu-dyn

Linear Motility Maps in Nonlinear Viscous Fluids

非线性粘性流体中的线性运动映射

Yishun Zhou, Shai Revzen

发表机构 * Department of Robotics, University of Michigan(机器人学系,密歇根大学) Departments of Electrical Engineering and Computer Science, and Ecology and Evolutionary Biology(电气工程与计算机科学系、生态与进化生物学系)

AI总结 研究在低雷诺数流体中,线性运动映射扩展到幂律流体,并发现Carreau-Yasuda流体可违反该线性性质实现净运动,方向可随速度改变。

详情
AI中文摘要

已知在低雷诺数流体中运动的系统受“运动映射”支配,该映射线性地将形状变化率与通过流体的本体框架速度联系起来。其结果是“珀塞尔扇贝定理”——经历时间上前后相同路径的形状变化(往复身体变形)的运动系统无法实现净位移,无论这些变化的速度如何。我们证明线性速度运动映射扩展到任何幂律粘度(即Ostwald-de Waele流体),因此也适用于中间剪切范围内的许多生物流体。我们还表明,在Carreau-Yasuda流体中,线性速度性质可以被违反,使用由两个不等质量且具有不等阻力系数的质量组成的“尺蠖”模型进行往复运动,从而产生净运动。有趣的是,运动方向可以通过改变速度来切换。我们的结果表明,几何力学的线性运动映射可用于分析和设计幂律流体中的运动,并且某些非线性阻力关系(如Carreau-Yasuda)可用于产生净运动,看似违反了“扇贝定理”。

英文摘要

Systems moving in low Reynolds number fluid regimes are known to be governed by a ``motility map'' which linearly relates their shape change rates to they body frame velocity moving through the fluid. A consequence of this is ``Purcell's Scallop Theorem'' -- a locomotion system that undergoes shape changes that follow the same path forward and backward in time (reciprocal body deformations) cannot achieve net displacement, regardless of pacing of those changes.We show that linear-in-velocity motility maps extend to any power law viscosity (a.k.a. Ostwald--de Waele fluid), and therefore to many biological fluids in intermediate shear ranges. We also show that the linear-in-velocity property can be violated in Carreau-Yasuda fluids to produce net motion using an ``inchworm'' model consisting of two unequal masses with unequal drag coefficients performing reciprocal motions. Interestingly, the direction of motion can be switched by changing speeds. Our results show that the linear motility map of geometric mechaincs can be used to analyze and design locomotion in power-law fluids, and that some nonlinear drag relationships such as Carreau-Yasuda can be exploited to generate net locomotion in seeming violation of the ``scallop theorem''.

2606.00062 2026-06-02 cs.CL

Graph-Augmented Retrieval for Cross-Entity Financial Sentiment Analysis: A Comparative Study

跨实体金融情感分析的图增强检索:一项比较研究

Rajan Bastakoti, Sagar Bhetwal, Nirajan Acharya, Gaurav Kumar Gupta

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 本文提出一种两跳图增强检索架构(Graph-RAG),通过构建情感加权知识图谱并融合密度检索与图遍历,相比标准向量检索在跨实体金融情感分析中显著提升实体召回率和复杂查询答案相关性。

详情
AI中文摘要

检索增强生成(RAG)已成为将大语言模型锚定到特定领域语料库的基础方法,然而传统的基于向量的RAG系统在捕捉支撑金融市场分析的结构化多实体关系方面存在根本性局限。本文对一种新颖的两跳图增强检索架构(Graph-RAG)与标准纯向量基线在跨实体金融情感分析中进行了全面比较研究。我们的系统从覆盖10只主要科技股的255篇新闻文章中构建了一个包含59个股票实体的情感加权知识图谱,然后通过强度过滤的图遍历(沿INFLUENCES边)增强密集检索,以揭示纯向量搜索无法获取的关系证据。我们使用语义相似度、实体召回率、RAGAS指标、延迟基准和消融研究,在100个有依据的查询(30个直接查询,70个关系查询)上评估了两种架构。Graph-RAG在实体召回率上实现了统计显著的提升(+6.4%,p < 0.001,Wilcoxon符号秩检验),并为复杂的多实体查询提供了显著更相关的答案(答案相关性+11.7%),增益集中在关系型问题类型(+16.1%)。关键的是,这些改进在答案质量上没有可测量的成本(语义相似度变化+0.001,Cohen's d = 0.078),平均延迟适度增加22.6%,但延迟方差降低了80%。对图遍历强度阈值的消融研究揭示了其与答案质量的倒U型关系,确定tau = 0.5为最优值,而生产默认值为tau = 0.7。这些发现刻画了图增强检索固有的精度-覆盖率权衡,并为构建多实体金融分析RAG系统的从业者提供了可操作的架构指导。

英文摘要

Retrieval-Augmented Generation (RAG) has become foundational for grounding large language models in domain-specific corpora, yet conventional vector-based RAG systems are fundamentally limited in their ability to capture the structured, multi-entity relationships that underpin financial market analysis. This paper presents a comprehensive comparative study of a novel two-hop Graph-RAG architecture versus a standard vector-only baseline for cross-entity financial sentiment analysis. Our system constructs a sentiment-weighted knowledge graph of 59 equity entities from 255 news articles covering 10 major technology stocks, then augments dense retrieval with intensity-filtered graph traversal over INFLUENCES edges to surface relational evidence inaccessible to vector search alone. We evaluate both architectures on 100 grounded queries (30 Direct, 70 Relational) using semantic similarity, entity recall, RAGAS metrics, latency benchmarks, and ablation studies. Graph-RAG achieves a statistically significant improvement in entity recall (+6.4%, p < 0.001, Wilcoxon signed-rank) and delivers substantially more relevant answers for complex multi-entity queries (+11.7% Answer Relevancy), with gains concentrating in relational question types (+16.1%). Critically, these improvements come at no measurable cost to answer quality (delta = +0.001 semantic similarity, Cohen's d = 0.078), with a modest 22.6% increase in mean latency offset by an 80% reduction in latency variance. An ablation study on the graph traversal intensity threshold reveals an inverted-U relationship with answer quality, identifying tau = 0.5 as optimal over the production default of tau = 0.7. These findings characterize a precision-for-coverage trade-off inherent to graph-augmented retrieval and provide actionable architectural guidance for practitioners building RAG systems for multi-entity financial analysis.

2606.00059 2026-06-02 cs.RO cs.LG

Reinforcement Learning for Optimal Experiment Design in Parameter Identification of Mechatronic Systems

机电系统参数辨识中最优实验设计的强化学习方法

Julian Langschwert, Georg Schaefer, Jakob Rehrl, Stefan Huber, Simon Hirlaender

发表机构 * Josef Ressel Centre for Intelligent and Secure Industrial Automation, Salzburg University of Applied Sciences, Salzburg, Austria(约瑟夫·雷斯尔智能与安全工业自动化中心,萨尔茨堡应用技术大学,萨尔茨堡,奥地利) Paris Lodron University of Salzburg, Salzburg, Austria(萨尔茨堡巴黎洛登伦大学,萨尔茨堡,奥地利)

AI总结 提出一种强化学习智能体,通过奖励塑形自主满足安全约束,为Quanser Aero 2测试平台学习最优激励信号,在三个辨识参数上均达到竞争性估计精度,且安全违规率仅0.75%。

详情
Comments
Accepted at DEXA AI4IP 2026
AI中文摘要

信息丰富的激励信号对于机电系统的精确系统辨识至关重要,然而经典系统辨识方法需要专家知识和手工设计的信号以满足硬件安全约束,限制了其通用性。我们提出一种强化学习智能体,为Quanser Aero 2测试平台学习最优激励信号,同时通过奖励塑形自主强制执行安全约束。在10个独立训练种子的评估中,我们的综合智能体在所有三个辨识参数上均实现了具有竞争力的估计精度,优于经典基线方法,且仅产生0.75%的安全违规。

英文摘要

Informative excitation signals are critical for accurate system identification of mechatronic systems, yet classical system identification (SI) approaches require expert knowledge and hand-crafted signal design to respect hardware safety constraints, limiting their generalizability. We propose a reinforcement learning (RL) agent that learns optimal excitation signals for a Quanser Aero 2 testbed while autonomously enforcing safety constraints through reward shaping. Evaluated across 10 independent training seeds, our comprehensive agent achieves competitive estimation accuracy across all three identified parameters, outperforming classical baselines while incurring only 0.75% safety violations.

2606.00054 2026-06-02 cs.RO cs.AI cs.CV

From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data

从人类视频到机器人操作:基于人类中心数据的可扩展视觉-语言-动作学习综述

Zhiyuan Feng, Qixiu Li, Huizhi Liang, Rushuai Yang, Yichao Shen, Zhiying Du, Zhaowei Zhang, Yu Deng, Li Zhao, Hao Zhao, Zongqing Lu, Oier Mees, Marc Pollefeys, Jiaolong Yang, Baining Guo

发表机构 * Tsinghua University(清华大学) HKUST(香港科技大学) Xi’an Jiaotong University(西安交通大学) Fudan University(复旦大学) Microsoft Research Asia(微软亚洲研究院) Peking University(北京大学) Microsoft Zurich Project(微软苏黎世实验室)

AI总结 本文综述了如何将丰富的人类视频转化为视觉-语言-动作(VLA)模型的有效知识,分类了四种方法(潜在动作表示、预测世界模型、显式2D监督、显式3D重建),并指出了结构化非结构化视频、跨具身和视角的动作映射、以及评估协议设计三大挑战。

详情
Comments
Accepted to IJCAI 2026 Survey Track. Project page: https://aaronfengzy.github.io/HumanCentricToVLA-Survey/
AI中文摘要

近期在可泛化具身控制方面的进展由大规模预训练的视觉-语言-动作(VLA)模型驱动。然而,大多数现有方法依赖于大量机器人演示数据,这些数据获取成本高昂且与特定具身紧密耦合。相比之下,人类视频丰富且捕捉了丰富的交互,为真实世界操作提供了多样的语义和物理线索。然而,具身差异以及任务对齐标注的频繁缺失使得它们直接用于VLA模型具有挑战性。本综述提供了一个统一的视角,探讨如何将人类视频转化为VLA模型的有效知识。我们根据所提取的动作相关信息将现有方法分为四类:(i) 编码帧间变化的潜在动作表示;(ii) 预测未来帧的预测世界模型;(iii) 提取图像平面线索的显式2D监督;(iv) 恢复几何或运动的显式3D重建。除分类外,我们强调了该领域的三个关键开放挑战:将非结构化视频结构化为可训练的片段、在具身和视角异质性下将视频导出的监督接地到机器人可执行动作中,以及设计能更好预测真实世界部署性能和迁移效率的评估协议,从而为未来研究方向提供参考。论文和资源的精选列表见 https://github.com/AaronFengZY/HumanCentricToVLA-Survey。

英文摘要

Recent progress in generalizable embodied control has been driven by large-scale pretraining of Vision-Language-Action (VLA) models. However, most existing approaches rely on large collections of robot demonstrations, which are costly to obtain and tightly coupled to specific embodiments. Human videos, by contrast, are abundant and capture rich interactions, providing diverse semantic and physical cues for real-world manipulation. Yet, embodiment differences and the frequent absence of task-aligned annotations make their direct use in VLA models challenging. This survey provides a unified view of how human videos are transformed into effective knowledge for VLA models. We categorize existing approaches into four classes based on the action-related information they derive: (i) latent action representations that encode inter-frame changes; (ii) predictive world models that forecast future frames; (iii) explicit 2D supervision that extracts image-plane cues; and (iv) explicit 3D reconstruction that recovers geometry or motion. Beyond this taxonomy, we highlight three key open challenges in this area: structuring unstructured videos into training-ready episodes, grounding video-derived supervision into robot-executable actions under embodiment and viewpoint heterogeneity, and designing evaluation protocols that better predict real-world deployment performance and transfer efficiency, thereby informing future research directions. A curated list of papers and resources is available at https://github.com/AaronFengZY/HumanCentricToVLA-Survey.

2606.00053 2026-06-02 cs.RO

VLAMotor: Test-Guided Enhancement of Vision-Language-Action Models via Agent-BasedData Synthesis

VLAMotor: 通过基于智能体的数据合成实现视觉-语言-动作模型的测试引导增强

Zeqin Liao, Peifan Ren, Zixu Gao, Hongyu Gong, Lianyu Hu, Wenbing Tang, Yuhong Nan, Zibin Zheng, Yang Liu

发表机构 * School of computing and data science, Nanyang Technological University(计算与数据科学学院,南洋理工大学) School of Software Engineering, Sun Yat-sen University(软件工程学院,中山大学) GuangDong Engineering Technology Research Center of Blockchain, China(区块链工程技术研发中心,中国) Northwest A&F University(西北农林科技大学)

AI总结 提出VLAMotor框架,通过距离感知测试暴露失败案例,并利用基于智能体的数据合成生成成功轨迹微调VLA模型,显著提升模型在仿真和真实环境中的成功率。

详情
AI中文摘要

视觉-语言-动作(VLA)模型遵循数据驱动范式,受训练数据覆盖范围的限制,在部署后容易在边缘情况配置上失败。为了减轻此类风险,必须暴露高质量失败模式,并将由此产生的失败转化为监督数据用于模型增强。现有研究大多止步于失败检测,缺乏利用发现的失败进行模型修复的机制。我们提出VLAMotor,这是首个用于VLA增强的分析框架,它集成了距离感知模型测试以暴露失败,以及基于智能体的数据合成以进行模型微调。首先,VLAMotor基于与训练样本的距离估计输入不确定性,并将不确定性排序与冗余消除相结合,构建暴露多样化失败的紧凑测试集。然后,VLAMotor将失败轨迹抽象为结构化语义表示,并规划参数化的修复技能序列,通过逆运动学和运动执行将其实现为可执行轨迹。由此产生的成功轨迹被自动标注并用于微调原始VLA模型,从而得到增强的VLA模型。在四个代表性机器人操作任务上的评估表明,VLAMotor生成的仿真测试用例中有92.33%触发了VLA失败,并且VLAMotor将测试覆盖率相比最先进工具提高了18.93%。通过使用从失败测试用例中导出的合成数据微调VLA模型,VLAMotor进一步将VLA模型的总体成功率提高了49.25%。当部署在真实硬件上时,仿真增强模型相比原始VLA模型成功率提高了57.50%,展示了VLA增强的一种有效且低成本的方向。

英文摘要

Vision-Language-Action (VLA) models follow a data-driven paradigm and are constrained by the coverage of training data, making them prone to failure on edge-case configurations after deployment. To mitigate such risks, it is essential to expose high-quality failure modes and convert the resulting failures into supervisory data for model enhancement. Existing studies largely stop at failure detection and lack a mechanism for leveraging discovered failures for model repair. We propose VLAMotor, the first analysis framework for VLA enhancement, which integrates distance-aware model testing for failure exposure and agent-based data synthesis for model finetunning. First, VLAMotor estimates input uncertainty based on the distance to training samples, and combines uncertainty ranking with redundancy elimination to build compact test sets that expose diverse failures. Then, VLAMotor abstracts failure trajectories into structured semantic representations, and plans parameterized repair-skill sequences, which are then realized as executable trajectories through inverse kinematics and motion execution. The resulting successful trajectories are automatically labeled and used to fine-tune the original VLA model, yielding an enhanced VLA model. Evaluation on four representative robotic manipulation tasks shows that 92.33% of the in-simulation test cases generated by VLAMotor trigger VLA failures, and VLAMotor improves test coverage over the state-of-the-art tool by 18.93%. By fine-tuning VLA models with synthetic data derived from failed test cases, VLAMotor further enhances the overall success rate of VLA models by 49.25%. When deployed on real hardware, the simulation-enhanced models improve the success rate over the original VLA models by 57.50%, demonstrating an effective and low-cost direction for VLA enhancement.

2606.00052 2026-06-02 cs.AI cs.LG

Product-Aware Deep Autoencoders for Robust Process Monitoring in Multi-Product Cyber-Physical Systems

产品感知深度自编码器用于多产品信息物理系统的鲁棒过程监控

MD Shafikul Islam, Jordan Carden

发表机构 * University of Cambridge(剑桥大学)

AI总结 针对多产品制造中全局模型因决策边界扩大而产生盲点的问题,提出产品感知自编码器,通过限制学习域到产品特定分布来提升异常检测鲁棒性,在扩展田纳西伊士曼过程基准上实现100%攻击检测。

详情
AI中文摘要

随着工业4.0加速信息物理系统在制造业中的集成,鲁棒异常检测对于确保过程安全与安保变得至关重要。当前的数据驱动方法通常采用“产品无关”或全局模型,这些模型在所有正常操作数据的聚合上训练。然而,现代工业设施经常在不同的产品等级下运行。虽然计算简单,但这些全局模型本质上会扩展其决策边界以适应多种模式的方差,从而产生一个“盲点”,其中微妙的异常或针对性的信息物理攻击可能被模型的宽接受区域所掩盖。在这项工作中,我们首先证明了上述漏洞存在于跨多个产品等级运行的全局无关模型中。然后,我们提出了一种产品感知自编码器作为原则性的缓解措施,将学习域限制在等级特定的分布上。虽然这种方法降低了已识别的盲点风险,但我们并不声称它是所有可能替代方案中的最优缓解措施。我们使用扩展的田纳西伊士曼过程基准对这种方法进行了严格的验证,并与全局无关基线进行了比较。我们的实证结果表明,产品感知框架在标准检测指标上与全局基线表现相当,同时提供了对产品等级特定操作模式的改进鲁棒性。最关键的是,模拟我们假设的攻击场景的压力测试显示,虽然全局模型在77.8%的场景中未能检测到操作偏差,但产品感知系统实现了100%的检测准确率。这些发现表明,在柔性制造环境中,广义异常检测器可能带来非平凡的安全风险,促使向模式感知诊断架构的转变。

英文摘要

As Industry 4.0 accelerates the integration of Cyber-Physical Systems (CPS) in manufacturing, robust anomaly detection has become critical for ensuring process safety and security. Current data-driven approaches typically employ "product-agnostic" or global models trained on the aggregate of all normal operating data. However, modern industrial facilities frequently operate under diverse product grades. While computationally simple, these global models inherently expand their decision boundaries to accommodate the variance of multiple modes, creating a "blind spot" where subtle anomalies or targeted cyber-physical attacks may be masked by the wide acceptance region of the model. In this work, we first demonstrate that the vulnerability described above is present in global-agnostic models operating across multiple product grades. We then present a Product-Aware Autoencoder as a principled mitigation that restricts the learning domain to grade-specific distributions. While this approach reduces the identified blind-spot risk, we do not claim it as the optimal mitigation among all possible alternatives. We rigorously validate this approach against a Global Agnostic baseline using the Extended Tennessee Eastman Process (TEP) benchmark. Our empirical results indicate that the Product-Aware framework performs comparably to the global baseline on standard detection metrics, while offering improved robustness to product-grade-specific operating modes. Most critically, stress tests simulating our hypothetical attack scenarios reveal that while the global model fails to detect operational deviations in 77.8% of the scenarios, the product-aware system achieves 100% detection accuracy. These findings suggest that, in flexible manufacturing environments, generalized anomaly detectors can pose non-trivial security risks, motivating a shift toward mode-aware diagnostic architectures.

2606.00050 2026-06-02 cs.AI cs.CL cs.DB cs.IR

Grokers: Bottom-Up Inductive Comprehension and Write-Time Intelligence over Typed Knowledge Graphs

Grokers: 基于类型化知识图谱的自底向上归纳理解与写入时智能

Gregory Magarshak

发表机构 * Gregory Magarshak

AI总结 提出Grokers架构,通过自底向上的依赖子图归纳遍历构建持久结构化理解,将智能推至写入时,实现零额外LM成本的查询,并证明字节同一性、累积单调性和双遍历顺序三个形式性质。

详情
Comments
6 pages; second in a series with the Magarshak Machine / SPACER paper and the Context paper
AI中文摘要

我们提出Grokers,一种通过依赖子图的自底向上归纳遍历来构建类型化知识图谱的持久结构化理解的架构。与检索增强生成(RAG)不同,后者在每个查询时支付全部理解成本,Grokers将智能推至写入时:自主的Groker代理分析类型化流图中的节点,通过受控语言模型(LM)调用提取结构化属性,并通过依赖关系归纳组合这种理解,写入丰富的类型化属性,从而以零额外LM成本服务于所有未来查询。我们证明了三个形式性质:(1)字节同一性定理,确立了从事务性维护的反规范化索引组装出的上下文块在语义变化之间的LM轮次中字节相同,使得KV缓存命中率接近100%;(2)累积单调性定理,确立了在受控智慧库增长协议下,无需LM调用即可解决交互的比例随已完成交互数量非递减;(3)双遍历顺序定理,确立了自顶向下生成和自底向上理解分别是它们在依赖DAG上各自任务的唯一正确遍历顺序,且它们的组合闭合为一个完整的生成-理解循环。我们进一步提出了一种基于嵌入的语义搜索的确定性替代方案,采用同义词缓存协议,其LM回退率在有限词汇域中收敛至零。在开源Qbix/Safebox/Safebots栈中提供了参考实现。

英文摘要

We present Grokers, an architecture for building persistent, structured comprehension of typed knowledge graphs through bottom-up inductive traversal of dependency subgraphs. Unlike retrieval-augmented generation (RAG), which pays full comprehension cost at every query, Grokers pushes intelligence to write time: autonomous Groker agents analyze nodes in a typed stream graph, extract structured attributes via governed language model (LM) calls, and inductively compose that understanding upward through dependency relations, writing enriched typed attributes that serve all future queries at zero additional LM cost. We prove three formal properties: (1) the Byte-Identity Theorem, establishing that context blocks assembled from a transactionally-maintained denormalization index are byte-identical across LM turns between semantic changes, enabling KV-cache hit rates approaching 100%; (2) the Accumulation Monotonicity Theorem, establishing that the fraction of interactions resolved without LM calls is non-decreasing in the number of completed interactions under a governed wisdom library growth protocol; and (3) the Dual-Traversal Ordering Theorem, establishing that top-down generation and bottom-up comprehension are the unique correct traversal orderings for their respective tasks over a dependency DAG, and that their composition closes into a complete generation-comprehension cycle. We further present a deterministic alternative to embedding-based semantic search, with a synonym caching protocol whose LM fallback rate converges to zero for finite-vocabulary domains. A reference implementation is provided in the open-source Qbix / Safebox / Safebots stack.

2606.00031 2026-06-02 cs.CL cs.AI

LLMs for Cardiovascular Risk Prediction from Structured Clinical Data

基于结构化临床数据的LLMs心血管风险预测

Jeba Maliha, Md Rafiul Kabir

发表机构 * Central Michigan University(中央密歇根大学)

AI总结 提出混合框架将结构化临床数据转换为自然语言表示,利用LLMs进行冠心病预测,并验证了高保真度与隐私保护优势。

详情
Comments
International Conference on Intelligent Systems, Blockchain, and Communication Technologies
AI中文摘要

冠状动脉疾病(CAD)仍然是全球主要死因之一,凸显了对可靠预测系统的需求以支持早期诊断和风险评估。虽然传统机器学习模型在结构化临床数据上表现良好,但大型语言模型(LLMs)为解释自然语言表达的医疗信息提供了新的可能性。在这项工作中,我们开发了一个混合框架,桥接了结构化临床数据和自然语言表示用于CAD预测。使用包含1190名患者记录和11个临床属性的公开数据集,结构化变量被转换为可解释的特征表示,并通过LLMs生成合成临床叙述。验证流程进行临床变量的反向提取,并计算与原始记录的一致性分数,平均保真度达到94.61%。然后,我们评估了四种传统机器学习模型,并在零样本和少样本提示设置下与基于LLM的分类进行比较。我们使用了两个LLM:GPT和Gemini。实验结果表明,随机森林达到了最高准确率。尽管有这一优势,基于LLM的分类在真实临床环境中仍然有益。这是因为LLMs直接操作于自然语言的患者描述,意味着敏感的数值患者数据(如精确的实验室值、血压读数和诊断代码)得以保密。研究结果表明,将结构化临床数据与LLM生成的叙述相结合,可以为混合临床预测系统开辟新方向。

英文摘要

Coronary artery disease (CAD) remains one of the leading causes of death globally, highlighting the need for reliable predictive systems to support early diagnosis and risk assessment. While traditional machine learning models perform well on structured clinical data, large language models (LLMs) present new possibilities to interpret medical information expressed in natural language. In this work, we develop a hybrid framework that bridges structured clinical data and natural-language representations for CAD prediction. Using a publicly available dataset of 1,190 patient records with 11 clinical attributes, structured variables are converted into interpretable feature representations and synthetic clinical narratives using LLMs. A validation pipeline performs reverse extraction of clinical variables and computes a consistency score with the original records, achieving an average fidelity of 94.61%. We then evaluate four conventional machine learning models and compare their performance with LLM-based classification under zero-shot and few-shot prompting settings. We use two LLMs here, GPT and Gemini. Experimental results show that Random Forest achieves the highest accuracy. Despite this advantage, LLM-based classification remains beneficial in real-world clinical settings. This is because LLMs operate directly on natural language patient descriptions, meaning that sensitive numerical patient data such as exact lab values, blood pressure readings, and diagnostic codes are kept private. Findings suggest that combining structured clinical data with LLM-generated narratives can enable new directions for hybrid clinical prediction systems.

2606.00029 2026-06-02 cs.CL cs.AI

TCAR-Gen: Temporal Graph Retrieval with Evidence Fusion for Knowledge-Grounded Generation

TCAR-Gen: 基于证据融合的时间图检索用于知识增强生成

Sidra Nasir, Muhammad Noman Zahid, Rizwan Ahmed Khan

发表机构 * Dipartimento di Informatica, Università di Verona(威尼斯大学计算机科学系) School of Advanced Studies, University of Camerino(坎皮诺大学高级研究学院) Department of Computer Science, School of Mathematics and Computer Science, Institute of Business Administration (IBA), Karachi, Pakistan(卡拉奇工商管理学院(IBA)数学与计算机科学学院计算机科学系)

AI总结 提出TCAR-Gen框架,结合查询条件图神经网络、时间证据融合和树链推理,在历史犯罪叙事问答中实现时间推理和多源证据融合,优于现有RAG方法。

详情
AI中文摘要

检索增强生成系统在回答历史犯罪案件叙述的复杂问题时,在时间推理和证据融合方面存在困难。现有方法要么独立于查询语义进行检索,要么无法连贯地整合多个证据来源。我们提出时间上下文增强检索生成(TCAR-Gen),一个结合查询条件图神经网络、时间证据融合和树链推理的框架,以将答案生成基于检索到的证据。在维多利亚犯罪日记基准上,TCAR-Gen在Recall@5上达到0.3738,在包括多跳推理和反事实问题在内的七种查询类型上优于Vanilla RAG、Temporal RAG、GraphRAG-C和GraphRAG-T。消融研究表明,上下文图、时间惩罚机制和查询条件是关键组件。跨五个语言模型(GPT-OSS 20B到TinyLlama 1.1B)的评估表明,TCAR-Gen在较小模型规模下保持稳健的检索覆盖,但生成质量随模型容量减少而显著下降。我们的工作表明,显式时间建模和多分支证据融合对于基于知识语料库的忠实、推理密集型问答至关重要。

英文摘要

Retrieval-augmented generation systems struggle with temporal reasoning and evidence fusion when answering complex questions over historical criminal case narratives. Existing approaches either retrieve independently of query semantics or fail to integrate multiple evidence sources coherently. We propose Temporal Context Augmented Retrieval Generation (TCAR-Gen), a framework that combines query-conditioned graph neural networks, temporal evidence fusion, and chain-of-trees reasoning to ground answer generation in retrieved evidence. On the Victorian Crime Diaries benchmark, TCAR-Gen achieves 0.3738 Recall@5, outperforming Vanilla RAG, Temporal RAG, GraphRAG-C, and GraphRAG-T across seven query types including multi-hop reasoning and counterfactual questions. Ablation studies reveal that the context graph, temporal penalty mechanism, and query conditioning are critical components. Cross-model evaluation across five language model (GPT-OSS 20B to TinyLlama 1.1B) demonstrates that TCAR-Gen maintains robust retrieval coverage at smaller model scales, though generation quality degrades substantially with reduced model capacity. Our work shows that explicit temporal modelling and multi-branch evidence fusion are essential for faithful, reasoning-intensive question answering over knowledge-grounded corpora.

2606.00027 2026-06-02 cs.CL cs.AI

A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models

医疗大语言模型安全性、鲁棒性和公平性评估的多领域红队框架

Andrei Marian Feier, Veysel Kocaman, Yigit Gul, Ahmet Korkmaz, Alexander Thomas, Aleksei Zakharov, Jay Gil, Mehmet Butgul, David Talby

发表机构 * John Snow Labs Inc.(约翰·索克斯实验室公司)

AI总结 提出一个多领域红队框架,通过690个临床场景评估11个当代大语言模型,发现平均准确率掩盖了临床意义上的风险,性能方差和最坏情况失败比平均准确率更能反映可靠性,混合评估方法对可信安全评估至关重要。

详情
Comments
10 pages, 4 figures. To be presented at the Text2Story 2026 Workshop (Delft, The Netherlands, 29 March 2026); CEUR Workshop Proceedings (forthcoming). Affiliation: John Snow Labs Inc
AI中文摘要

大语言模型(LLM)在医疗领域的部署日益增多,但现有基准未能捕捉临床实践中常见的对抗性或伦理复杂条件下的模型行为。我们开发了一个多领域红队框架,评估了11个当代LLM在690个临床场景中的表现,这些场景涵盖9个领域和150多个子类别。场景包含对抗性变换,响应使用七维度评分标准进行评估,包括LLM辅助评分和人在环验证。结果揭示了显著的性能差异,平均得分范围从0.791到0.984。关键的是,几个高性能系统在个别安全关键场景中完全失败,表明平均准确率掩盖了临床意义上的风险。最高性能系统(X-BAI、GPT-5、Claude Opus 4.1)得分超过0.97且方差较低,而不同领域间性能差异显著。公平性相关任务在人口统计修改后错误率放大10-20%,人工评审员识别出自动评估遗漏的临床相关失败。我们的发现表明,性能方差和最坏情况失败比平均准确率更能提供临床意义的可靠性指标,而结合自动化与临床监督的混合评估方法对于可信安全评估至关重要。

英文摘要

Large language models (LLMs) are increasingly deployed across healthcare, yet existing benchmarks fail to capture model behavior under adversarial or ethically complex conditions common in clinical practice. We developed a multi-domain red teaming framework evaluating eleven contemporary LLMs across 690 clinically grounded scenarios spanning nine domains and over 150 subcategories. Scenarios incorporated adversarial transformations, and responses were assessed using a seven-dimension rubric with LLM-assisted scoring and human-in-the-loop validation. Results revealed substantial performance variance, with mean scores ranging from 0.791 to 0.984. Critically, several high-performing systems produced complete failures in individual safety-critical scenarios, demonstrating that aggregate accuracy masks clinically meaningful risk. The highest-performing systems (X-BAI, GPT-5, Claude Opus 4.1) achieved scores above 0.97 with low variance, while performance varied significantly across domains. Equity-related tasks showed 10-20% error amplification with demographic modifications, and human reviewers identified clinically relevant failures missed by automated evaluation. Our findings demonstrate that performance variance and worst-case failures provide more clinically meaningful reliability indicators than mean accuracy alone, and that hybrid evaluation approaches combining automation with clinician oversight are essential for credible safety assessment.

2606.00026 2026-06-02 cs.CL

Cognitive-Linguistic Indicators of Depression in Online Communities: Analysed by DistilBERT and Holographic Reduced Representation

在线社区中抑郁的认知语言指标:基于DistilBERT和全息简化表示的分析

Brian Van Steen

发表机构 * School of Computing, University of Leeds(利兹大学计算机学院)

AI总结 本研究结合认知语言学特征与DistilBERT嵌入,通过混合模型(DistilBERT+HRR)在Reddit帖子中检测抑郁,宏F1达0.94,优于基线TF-IDF的0.80。

详情
AI中文摘要

本文研究将认知基础的语言特征与基于transformer的嵌入相结合是否能改善在线文本中抑郁的自动检测。基于Beck的抑郁认知理论,研究提取认知扭曲作为可测量特征,包括第一人称代词密度、绝对化词汇以及Reddit中抑郁相关和对照社区帖子中的负面情绪。使用Kaggle Reddit自杀和抑郁检测数据集的一个子集,比较了两个分类流程:作为基线的TF-IDF嵌入与朴素贝叶斯,以及一个混合模型,该模型将DistilBERT句子嵌入与编码认知语言特征的全息简化表示(HRR)向量拼接,然后进行逻辑回归。混合DistilBERT HRR模型的宏F1得分为0.94,而TF-IDF基线为0.80,5折交叉验证F1从0.83提高到0.92,AUC从0.958提高到0.981。

英文摘要

This paper investigates whether combining cognitively grounded linguistic features with transformer-based embeddings improves automated detection of depression in online text. Using Beck's Cognitive Theory of Depression, the study extracts cognitive distortions as measurable features, including first-person pronoun density, absolutist words, and negative emotion in Reddit posts from depression-related and control communities. Using a subset of the Kaggle Reddit Suicide and Depression Detection dataset, two classification pipelines are compared, a TF-IDF embedding with Naive Bayes as a baseline, and a hybrid model that concatenates DistilBERT sentence embeddings with Holographic Reduced Representation (HRR) vectors encoding the cognitive-linguistic features, followed by Logistic Regression. The hybrid DistilBERT HRR model achieves a macro F1 score of 0.94 versus 0.80 for the TD-IDF baseline, with 5-fold cross validation F1 improving from 0.83 to 0.92, and AUC from 0.958 to 0.981.

2606.00023 2026-06-02 cs.CL cs.AI cs.LG

TrustLDM: Benchmarking Trustworthiness in Language Diffusion Models

TrustLDM:语言扩散模型的可信度基准测试

Yichuan Mo, Yukun Jiang, Yanbo Shi, Mingjie Li, Michael Backes, Yang Zhang, Yisen Wang

发表机构 * State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University(中国科学院自动化研究所,智能科学与技术学院,北京大学) CISPA Helmholtz Center for Information Security(信息安全研究所) School of EECS, Peking University(电子工程学院,北京大学) Institute for Artificial Intelligence, Peking University(人工智能研究所,北京大学)

AI总结 针对语言扩散模型(LDM)的可信度问题,提出TrustLDM基准,评估其在不同架构和恶意上下文下的安全性、隐私性和公平性,并开发自动评估框架TrustLDM-Auto以识别脆弱配置。

详情
AI中文摘要

语言扩散模型(LDM)的快速发展挑战了自回归模型在语言处理中的主导地位。然而,其灵活、任意顺序的解码策略不仅实现了快速解码速度,还可能带来新的可信度挑战。为了更好地理解其流程背后的风险,我们引入了一个针对LDM的全面可信度基准(TrustLDM),评估不同LDM架构在多种静态后上下文类别下的安全性、隐私性和公平性。我们的实证结果表明,尽管LDM在仅使用用户提示时通常表现出较强的可信度,但当恶意后上下文附加到掩码响应时,其对齐行为明显下降。我们进一步观察到,较长的上下文不一定产生更强的影响,解码顺序和生成长度都会影响评估结果。最后,我们提出了TrustLDM-Auto,一个利用LDM解码灵活性自动识别脆弱配置的评估框架,揭示了所有评估模型和维度上的显著可信度弱点。我们的工作可能有助于社区构建更可信的LDM。我们的代码可在https://github.com/PKU-ML/TrustLDM获取。

英文摘要

The rapid development of Language Diffusion Models (LDMs) challenges the dominant position of auto-regressive competitors in language processing. However, their flexible, any-order decoding strategies not only enable fast decoding speed but also potentially bring new trustworthiness challenges. To better understand the risks behind their pipelines, we introduce a comprehensive trustworthiness benchmark tailored to LDMs (TrustLDM), evaluating safety, privacy, and fairness across different LDM architectures with multiple categories of static post contexts. Our empirical results show that although LDMs generally exhibit strong trustworthiness with only the user prompts, their alignment behavior degrades noticeably when the malicious post contexts are attached to the masked responses. We further observe that longer contexts do not necessarily induce stronger effects, and both decoding order and generation length affect the evaluation outcomes. Finally, we propose TrustLDM-Auto, an automatic evaluation framework that leverages LDM decoding flexibility to systematically identify vulnerable configurations, revealing substantial trustworthiness weaknesses across all evaluated models and dimensions. Our work may potentially help the community build more trustworthy LDMs. Our code is available at https://github.com/PKU-ML/TrustLDM.

2606.00022 2026-06-02 cs.CL cs.AI

lmfaoooo at SemEval-2026 Task 1: Humor Is an Audience. Preference Modeling for Constrained Humor Generation

lmfaoooo at SemEval-2026 Task 1: 幽默是受众。面向约束幽默生成的偏好建模

Alexey Tikhonov, Alexey Ivanov

发表机构 * Inworld.AI Berlin, Germany(Inworld.AI柏林,德国) OpenAI Mountain View, CA(山景城,加利福尼亚州)

AI总结 针对约束幽默生成任务,提出“生成多候选-偏好选择”策略,利用人类成对比较训练偏好模型,在MWAHAHA任务英、中、西语子任务中分别获得第1、第1和第2名。

详情
Comments
5 pages. Accepted for SEMEVAL 2026
AI中文摘要

幽默生成仍然困难,不仅因为生成流畅、新颖的笑话很难,而且因为“有趣”取决于受众,监督信号嘈杂——偏好随受众、语境和文化而变化,标注者一致性通常较低。在本文中,我们描述了用于SemEval-2026 Task-1(MWAHAHA)的系统,该任务专注于在显式约束下进行幽默生成。任务通过1对1竞技场式比较中的人类偏好判断来评估提交的系统。我们采用“生成多个->选择最佳”策略。首先,我们通过多步提示、模型集成和多样性导向解码为每个实例生成多样化的候选池。其次,我们使用偏好模型选择输出,该模型通过从人类比较中学习(而非绝对趣味性分数)来近似“读者”。为支持该方法,我们发布了通过幽默竞技场原型收集的2.5K人类成对判断。我们进一步提出一个可解释的流程,将标注的比较转换为偏好模型。在三个偏好数据集上,我们的模型一致优于基线,并表现出更强的跨领域迁移。最后,我们将学到的偏好模型应用于MWAHAHA设置中的候选排序,并发布中间产物(候选池和排序)以促进后续工作。我们的系统在MWAHAHA的英语和汉语子任务中排名第一,在西班牙语子任务中排名第二。

英文摘要

Humor generation remains difficult not only because producing fluent, novel jokes is hard, but because "funny" is audience-dependent and supervision is noisy -- preferences vary with audience, context, and culture, and annotator agreement is often low. In this paper, we describe our system for the SemEval-2026 Task-1 (MWAHAHA), which focuses on humor generation under explicit constraints. The task evaluates submitted systems via human preference judgments in 1-on-1 arena-style comparisons. We adopt a "generate-many -> select-best" strategy. First, we generate a diverse pool of candidates per instance using multi-step prompting, model ensembling, and diversity-oriented decoding. Second, we select outputs using a preference model that approximates a "reader" by learning from human comparisons rather than absolute funniness scores. To support this approach, we release 2.5K human pairwise judgments collected through the Humor Arena prototype. We further propose an interpretable pipeline that converts labeled comparisons into a preference model. Across three preference datasets, our models consistently outperform baselines and show stronger cross-domain transfer. Finally, we apply the learned preference model to rank candidates for the MWAHAHA setting and release intermediate artifacts (candidate pools and rankings) to facilitate follow-up work. Our system ranked 1st in the English and Chinese subtasks of MWAHAHA and 2nd in the Spanish subtask.

2606.00021 2026-06-02 cs.CL cs.AI cs.LG

SENSE: Semantic Embedding Navigation with Soft-gated Evaluation for Retrieval-based Speculative Decoding

SENSE: 基于软门控评估的语义嵌入导航用于检索式推测解码

Shaowen Chen, Zhicheng Liao, Hongwei Wang

发表机构 * arXiv.org cs.CL(计算机科学与语言学)

AI总结 提出SENSE方法,通过语义嵌入导航和软门控评估模块替代表面形式匹配,提升检索式推测解码的鲁棒性和加速效果,在LLaMA和Qwen系列上实现最高4.09平均接受长度和3.26倍加速。

详情
AI中文摘要

推测解码(SD)通过使用轻量级草稿模型提出候选令牌,并由目标模型并行验证,从而加速大型语言模型(LLM)推理,同时不损害生成质量。尽管检索式推测解码(RSD)因其即插即用的多功能性而受到青睐,但其潜力受到刚性词汇依赖的阻碍,使得检索和验证对表面形式变化敏感。为了解决这个问题,我们提出了SENSE(基于软门控评估的语义嵌入导航)。通过将检索锚定在目标模型的隐藏状态上,SENSE建立了稳健的语义对齐,这使得软门控评估模块能够验证语义等价性而非表面形式。为了确保严格的基准测试,我们将现有方法解构为统一框架内的原子原语,促进细粒度的组件级比较。跨多个领域的广泛实验表明,SENSE在LLaMA和Qwen系列上优于多个基线,实现了高达4.09的平均接受长度和3.26倍的加速,同时保持了生成质量。我们的代码将在发表后发布。

英文摘要

Speculative Decoding (SD) accelerates Large Language Model (LLM) inference by employing a lightweight draft model to propose candidate tokens, which are verified in parallel by the target model, without compromising generation quality. While Retrieval-based Speculative Decoding (RSD) is favored for its plug-and-play versatility, its potential is impeded by rigid lexical dependencies, rendering both retrieval and verification brittle to surface-level variations. To address this, we propose SENSE (Semantic Embedding Navigation with Soft-gated Evaluation). By anchoring retrieval on the hidden states of the target model, SENSE establishes robust semantic alignment, which empowers the Soft-gated Evaluation module to validate semantic equivalence rather than surface forms. To ensure rigorous benchmarking, we deconstruct existing methods into atomic primitives within a unified framework, facilitating granular, component-level comparison. Extensive experiments across diverse domains demonstrate that SENSE outperforms multiple baselines on the LLaMA and Qwen families, attaining up to 4.09 mean acceptance length and 3.26x speedup, while preserving generation quality. Our code will be released upon publication.

2606.00020 2026-06-02 cs.CL cs.AI

CSRP: Chain-of-Thought Reasoning for Chinese Text Correction via Reinforcement Learning with Efficiency-Aware Rewards

CSRP:基于效率感知奖励的强化学习链式推理中文文本纠错

Wei Tian, Yuhao Zhou, Man Lan

发表机构 * School of Computer Science and Technology, East China Normal University(东华大学计算机科学与技术学院) Shanghai Institute of Artificial Intelligence for Education, East China Normal University(东华大学教育人工智能研究所)

AI总结 提出CSRP三阶段框架,通过连续预训练、链式推理监督微调和基于效率感知奖励的组相对策略优化,在NACGEC基准上实现最优性能,有效缓解过度纠正偏差。

详情
Comments
Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026, Main conference)
AI中文摘要

基于大语言模型的中文语法纠错系统面临两个关键挑战:通用模型缺乏针对细微语法区别的专业语言先验,以及使用最大似然估计的监督微调无法优化以精度为中心的指标,导致系统性过度纠正。我们提出CSRP,一个三阶段框架,通过以下步骤逐步构建纠错能力:在590万平衡样本上进行连续预训练以内化领域知识,使用显式错误推理进行链式推理监督微调以实现诊断透明度,以及采用新颖的效率感知奖励进行组相对策略优化,明确惩罚不必要的编辑。在NACGEC基准上,CSRP以50.99的$F_{0.5}$和57.17的精确率实现了最先进性能,大幅优于先前最佳结果,同时有效缓解了MLE训练模型固有的过度纠正偏差。我们的方法还将CSCD拼写纠错提升至59.61的F1,超过GPT-4 5.20分。全面的消融研究表明,RL对齐阶段相比SFT基线贡献了8%的相对增益,且该增益与大规模CPT的贡献正交,验证了针对编辑效率的显式优化对于高质量语法纠错至关重要。我们的代码可在https://github.com/TW-NLP/ChineseErrorCorrector获取。

英文摘要

Large Language Model (LLM) based Chinese Grammatical Error Correction (CGEC) systems face two critical challenges: general-purpose models lack specialized linguistic priors for subtle grammatical distinctions, and Supervised Fine-Tuning (SFT) with Maximum Likelihood Estimation fails to optimize for precision-focused metrics, leading to systematic over-correction. We propose CSRP, a three-stage framework that progressively builds correction capability through Continual Pre-training (CPT) on 5.9M balanced samples to internalize domain knowledge, Chain-of-Thought SFT with explicit error reasoning for diagnostic transparency, and Group Relative Policy Optimization with a novel Efficiency-Aware Reward that explicitly penalizes unnecessary edits. On the NACGEC benchmark, CSRP achieves state-of-the-art performance with 50.99 $F_{0.5}$ and 57.17 precision, substantially outperforming previous best results while effectively mitigating the over-correction bias inherent in MLE-trained models. Our method also advances CSCD spelling correction to 59.61 F1, surpassing GPT-4 by 5.20 points. Comprehensive ablation studies demonstrate that the RL alignment stage contributes a 8\% relative gain over the SFT baseline, and that this gain is orthogonal to the contribution of large-scale CPT, validating that explicit optimization for edit efficiency is essential for high-quality grammatical error correction. Our code is available at https://github.com/TW-NLP/ChineseErrorCorrector.

2606.00017 2026-06-02 cs.AI cs.CL cs.MA

MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

MindGames Arena 泛化赛道:具有延迟每步奖励归因的 In2AI 解决方案

Aliaksei Korshuk, Alexander Buyantuev, Ilya Makarov

发表机构 * iMak AI Lab(iMak人工智能实验室)

AI总结 提出延迟每步奖励归因方法,结合资格门控、异步rollout生成和课程对手采样,实现多智能体环境中稳定高效的强化学习训练,并在NeurIPS 2025的MindGames Arena基准测试中取得领先。

详情
Comments
18 pages, 2 figures, 9 tables. Technical report. First place in both Open and Efficient tracks of MindGames Arena Generalization Track at NeurIPS 2025
AI中文摘要

训练用于多智能体战略交互的语言模型智能体面临一个核心困难:任何行动的质量可能取决于从未实现的未来事件、违反游戏规则的移动或其他玩家的决策。标准强化学习假设每一步都可以分配奖励,但在结果跨时间和智能体纠缠的环境中,这一假设不成立。我们引入了具有资格门控的延迟每步奖励归因,这是一个仅在回合结束时计算奖励、根据任务特定语义将其传播回原始步骤,并排除缺乏有效依赖信息的步骤的回合生命周期和后处理流程。结合通过vLLM的连续批处理实现的异步rollout生成、基于课程的对手采样和多层分层批次构建,该方法能够在多智能体环境中实现稳定、样本高效的强化学习训练。我们在NeurIPS 2025的MindGames Arena基准测试中进行了评估,使用我们的方法训练的单个80亿参数开源模型在头对头对战中匹配或超越了包括GPT-5在内的显著更大的专有系统,并在开放(无限制)和高效(<=80亿参数)赛道中均获得第一名。

英文摘要

Training language model agents for multi-agent strategic interaction presents a core difficulty: the quality of any action may depend on future events that never materialize, on moves that violate game rules, or on decisions made by other players. Standard reinforcement learning assumes that rewards can be assigned at each step, but this assumption fails in settings where outcomes are entangled across time and agents. We introduce delayed per-step reward attribution with eligibility gating, an episode lifecycle and postprocessing pipeline that computes rewards only at episode end, propagates them back to originating steps according to task-specific semantics, and excludes steps that lack valid dependent information from training. Together with asynchronous rollout generation via vLLM's continuous batching, curriculum-based opponent sampling, and multi-level stratified batch construction, this approach enables stable, sample-efficient RL training in multi-agent environments. We evaluate on the MindGames Arena benchmark at NeurIPS 2025, where a single 8-billion-parameter open-source model trained with our method matched or surpassed substantially larger proprietary systems, including GPT-5, in head-to-head play and took first place in both the Open (unrestricted) and Efficient (<=8B parameters) tracks.