arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.29807 2026-05-29 cs.CL cs.AI cs.LG

Data filtering methods for training language models

训练语言模型的数据过滤方法

Egor Shevchenko, Elena Bruches

AI总结 本文比较了Confident Learning和Dataset Cartography两种自动标签错误检测方法在俄语文本分类任务中的效果,发现其有效性依赖于数据集特性,在小规模高噪声数据集上Confident Learning显著提升F1-macro。

详情
Comments
AINL-2026
AI中文摘要

数据质量是机器学习模型有效性的关键因素。即使广泛使用的基准数据集中也存在标签错误,这些错误会引入训练数据噪声并降低模型泛化能力。在本工作中,我们对两种自动标签错误检测方法——Confident Learning和Dataset Cartography——在三个俄语文本分类语料库上进行了比较分析,这些语料库在规模、类别数量和领域上各不相同:ru_emotion_e-culture(49,123个样本,情感分类)、RuCoLA(8,524个样本,语言可接受性)和TERRa(2,337个样本,文本蕴含识别)。我们使用在每个语料库上微调的预训练rubert-base-cased模型。为了验证过滤的意义,我们进行了控制实验,随机移除等量样本。结果表明,两种方法的有效性强烈依赖于数据集特征:在噪声水平低的大规模语料库上,过滤并未提升性能,而在噪声高的小规模数据集上,Confident Learning实现了显著的F1-macro提升。Dataset Cartography表现出更保守的行为,移除的样本更少。在所有语料库中,两种方法的目标性移除均优于随机移除,证实了这些方法的意义。

英文摘要

Data quality is a critical factor in the effectiveness of machine learning models. Label errors, present even in widely used benchmarks, introduce noise into training data and reduce model generalization. In this work, we conduct a comparative analysis of two automatic label error detection methods - Confident Learning and Dataset Cartography - on three Russian text classification corpora of varying size, number of classes, and domain: ru_emotion_e-culture (49,123 examples, emotion classification), RuCoLA (8,524 examples, linguistic acceptability), and TERRa (2,337 examples, textual entailment recognition). We use the pre-trained rubert-base-cased model fine-tuned on each corpus. To verify the meaningfulness of filtering, we conduct control experiments with random removal of an equivalent number of examples. Results show that the effectiveness of both methods depends strongly on dataset characteristics: on large corpora with low noise levels, filtering does not improve performance, while on small datasets with high noise, Confident Learning achieves a significant F1-macro improvement. Dataset Cartography demonstrates more conservative behavior, removing fewer examples. Across all corpora, targeted removal by both methods outperforms random removal, confirming the meaningfulness of the approaches.

2605.29803 2026-05-29 cs.LG

Gated Graph Attention Networks with Learnable Temperature

具有可学习温度的门控图注意力网络

Zhongtian Ma, Hao Wu, Yexin Zhang, Qiaosheng Zhang, Zhen Wang

AI总结 提出门控图注意力和可学习温度机制,通过过滤不可靠特征维度并动态调整注意力系数分布的锐度,提升图注意力网络在均匀和异质异配基准上的性能。

详情
AI中文摘要

图注意力网络通过数据相关的系数学习邻居的重要性,但标准层缺乏对不可靠特征维度的显式控制,并且使用固定的注意力系数分布锐度。本文针对常见的图注意力机制提出了门控图注意力和可学习温度。门控图注意力过滤特征或消息响应以减少不可靠维度的影响,而可学习温度动态调整注意力系数分布的锐度。在均匀和异质异配基准上的实验表明,所提出的变体一致地改进了相应的图注意力骨干网络,受控噪声研究进一步验证了它们在特征扰动下的行为。理论分析解释了这些结果,表明当只有部分特征坐标可靠时,门控提高了鲁棒性,而当全局噪声削弱节点特征的可区分性时,温度是有益的。

英文摘要

Graph attention networks learn neighbor importance through data-dependent coefficients, but standard layers lack explicit control over unreliable feature dimensions and use fixed sharpness of attention coefficient distributions. This paper proposes gated graph attention and learnable temperature for common graph attention mechanisms. Gated graph attention filters feature or message responses to reduce the influence of unreliable dimensions, while learnable temperature dynamically adjusts the sharpness of the attention coefficient distribution. Experiments on homogeneous and heterophilic heterogeneous benchmarks show that the proposed variants consistently improve the corresponding graph attention backbones, and controlled noise studies further verify their behavior under feature perturbations. Theoretical analysis explains these results by showing that gating improves robustness when only part of the feature coordinates are reliable, while temperature is beneficial when global noise weakens the discriminability of node features.

2605.29801 2026-05-29 cs.AI cs.CL cs.CR cs.CV cs.LG

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

AgentDoG 1.5:一种轻量级且可扩展的AI智能体安全与安保对齐框架

Dongrui Liu, Yu Li, Zhonghao Yang, Peng Wang, Guanxu Chen, Yuejin Xie, Qinghua Mao, Wanying Qu, Yanxu Zhu, Tianyi Zhou, Leitao Yuan, Zhijie Zheng, Qihao Lin, Yimin Wang, Haoyu Luo, Shuai Shao, Chen Qian, Qingyu Liu, Ling Tang, Ruiyang Qin, Qihan Ren, Junxiao Yang, Kun Wang, Zhiheng Xi, Linfeng Zhang, Ranjie Duan, Bo Zhang, Wenjie Wang, Wen Shen, Qiaosheng Zhang, Yan Teng, Chaochao Lu, Rui Mei, Man Li, Jialing Tao, Xi Lin, Tianhang Zheng, Yong Liu, Quanshi Zhang, Lei Zhu, Xingjun Ma, Junhua Liu, Hui Xue, Xiaoxiang Zuo, Xiangnan He, Chao Shen, Xianglong Liu, Minlie Huang, Jing Shao, Xia Hu

AI总结 针对开放世界智能体的新兴安全风险,提出一种轻量级可扩展的安全对齐框架,通过更新安全分类法、构建数据引擎并训练小模型(0.8B-8B参数),实现与闭源模型相当的性能,并降低部署开销两个数量级。

详情
Comments
44 pages, 12 Figures, 9 Tables
AI中文摘要

现代开放世界智能体(如OpenClaw)展现出强大的跨环境执行能力,但同时也引入了广泛的新安全风险源。同时,先进的前沿AI模型大幅降低了攻击门槛,使得当前的智能体对齐框架不足以应对实际部署。为了应对这些新兴威胁,我们提出了一种轻量级且可扩展的智能体安全对齐框架。具体而言,我们更新了智能体安全分类法,以涵盖来自Codex和OpenClaw执行场景的新兴风险。我们进一步构建了一个基于分类法指导的数据引擎,并采用影响函数净化,仅使用约1k样本训练轻量级AgentDoG 1.5变体(0.8B、2B、4B和8B参数),达到了与领先闭源模型(如GPT-5.4)相当的性能。基于AgentDoG 1.5,我们构建了一个高效的智能体安全SFT和RL训练环境,将Docker级环境的部署开销降低了两个数量级。最后,我们将AgentDoG 1.5部署为无需训练的在线护栏,用于实时安全审核。大量实验结果表明,AgentDoG 1.5在多样且复杂的交互式智能体场景中达到了最先进的性能。所有模型和数据集均已公开发布。

英文摘要

Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.

2605.29800 2026-05-29 cs.CL

Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels

九位评委,两张有效选票:相关错误削弱了LLM评估小组

Guneet Kohli

AI总结 本文通过分析9个前沿LLM在自然语言推理任务上的投票行为,发现由于模型间存在高度相关的错误,评估小组的有效信息仅相当于约2个独立投票,实际准确率比独立投票理想情况低8-22个百分点,且增加评委或改进聚合算法均无法弥补这一差距。

详情
Comments
14 pages, 5 figures, 12 tables
AI中文摘要

LLM作为评委的小组通过聚合多个模型的投票来评估,期望不同模型能提供更可靠的评估。我们开发了一个框架来衡量此类小组的真实信息价值,并量化其可靠性距离独立投票理想状态的差距。在来自7个模型家族的9个前沿LLM小组上,对三个自然语言推理数据集(每个项目有100个人工标注)进行测试,我们发现9位评委实际上仅提供约2个独立投票的信息量。小组名义独立性的约四分之三因模型在相同项目上犯相同错误而丧失。后果是显著的:小组的实际准确率比独立投票所能达到的低8-22个百分点,且最佳单个评委在所有条件下均匹配或超越整个小组。增加评委或使用更智能的聚合算法均无济于事——即使能访问正确答案,现有方法最多只能缩小这一差距的11%。我们使用Kish有效样本量(n_eff)和Condorcet零模型量化这些发现,并显示该缺陷在提示变体、温度、思维链推理以及成对偏好任务(RewardBench)中均稳健存在。瓶颈在于评委之间的相关性,而非聚合算法,这意味着扩大小组规模无法替代真正独立的评估。

英文摘要

LLM-as-a-judge panels aggregate votes from multiple models, with the expectation that diverse models yield more reliable evaluations. We develop a framework to measure the true informational value of such panels and quantify how far their reliability falls short of the independent-voting ideal. Testing a panel of 9 frontier LLMs from 7 model families on three natural language inference datasets (each with 100 human annotations per item), we find that the 9 judges effectively provide only about 2 independent votes' worth of information. Roughly three-quarters of the panel's nominal independence is lost because the models make the same mistakes on the same items. The consequences are stark: the panel's actual accuracy falls 8-22 percentage points short of what independent voting would achieve, and the best single judge matches or outperforms the full panel across all conditions. Neither adding more judges nor using smarter aggregation algorithms helps -- established methods close at most 11% of this gap, even with access to the correct answers. We quantify these findings using the Kish effective sample size (n_eff) and a Condorcet null model, and show the deficit is robust across prompt variants, temperatures, chain-of-thought reasoning, and a pairwise preference task (RewardBench). The bottleneck is correlated judges, not the aggregation algorithm, implying that scaling up panels cannot substitute for genuinely independent evaluation.

2605.29798 2026-05-29 cs.CV cond-mat.mtrl-sci eess.IV

Low-Magnification SEM May Suffice: Interpretable Deep Learning for Multi-Scale Fracture-Cause Classification in Zirconia-Toughened Alumina

低倍率SEM可能足够:用于氧化锆增韧氧化铝多尺度断裂原因分类的可解释深度学习

Julian Schmid, Pawel Astankow, Tom Vater, Julius Beck, Robert Cichon, Danny Krautz

AI总结 提出一种可解释的视觉变换器工作流,利用低倍率SEM图像对氧化铝基复合材料植入物断裂原因进行自动分类,达到与高倍率相当的准确率。

详情
AI中文摘要

可靠识别氧化铝基复合材料髋关节和膝关节植入物的断裂起源对于质量保证和患者安全至关重要,然而当前的断口分析工作流程耗时、部分主观且依赖高倍率扫描电子显微镜(SEM)。我们提出了一种可解释的视觉变换器(ViT)工作流,用于对广泛用于全关节置换的氧化铝基复合材料(BIOLOX delta, CeramTec GmbH)的断裂原因进行自动分类。从五年的生产爆破和验证测试中整理了8,493张SEM图像(50倍至10,000倍)的数据集,并按照制造链定义的三个缺陷类别(生坯、硬加工和材料缺陷)进行标注。在严重的类别不平衡下,微调后的ViT在分层五折交叉验证中达到了0.907的准确率和0.888的宏F1分数,两阶段感知哈希/SSIM泄漏审计确认了样本重叠可忽略。值得注意的是,低倍率(50倍)下的性能与高倍率(1k-10k倍)相当,表明宏观特征——镜面几何和羽状纹线场——已经编码了足够的诊断信号。Grad-CAM归因一致地定位在经典的断口线索(镜面、羽状纹、孔隙、加工痕迹)上,与既定的断口分析标准一致。这些结果共同将可解释ViT定位为陶瓷植入物质量保证的补充工具,能够实现低倍率预筛选并减少对耗时的高倍率检查的依赖。

英文摘要

Reliable identification of fracture origins in alumina matrix composite hip and knee implants is critical for quality assurance and patient safety, yet current fractographic workflows are time-consuming, partly subjective, and reliant on high-magnification scanning electron microscopy (SEM). We present an interpretable vision-transformer (ViT) workflow for automated classification of fracture causes in an alumina matrix composite (BIOLOX delta, CeramTec GmbH) widely used in total joint replacements. A dataset of 8,493 SEM images (50x-10,000x) was curated from five years of in-production burst and proof tests and annotated into three defect categories defined along the manufacturing chain: green body, hard machining, and material defects. Under severe class imbalance, the fine-tuned ViT reached an accuracy of 0.907 and a macro-F1 of 0.888 in stratified five-fold cross-validation, with a two-stage perceptual-hash/SSIM leakage audit confirming negligible specimen overlap. Notably, performance at low magnification (50x) was comparable to that at high magnification (1k-10kx), indicating that macro-scale features - mirror geometry and hackle line fields - already encode sufficient diagnostic signal. Grad-CAM attributions consistently localised on canonical fractographic cues (mirrors, hackles, pores, machining marks), aligning with established fractographic criteria. Together, these results position interpretable ViTs as a complementary tool for ceramic-implant quality assurance, enabling low-magnification pre-screening and reducing reliance on time-intensive high-magnification inspection.

2605.29797 2026-05-29 cs.CL

Metric-Dependent Annotation Saturation for Learning from Label Distributions

基于度量依赖的标注饱和:从标签分布中学习

Guneet Kohli

AI总结 研究从标签分布中学习时,所需标注者数量如何依赖于评估度量,发现熵相关需要20-50个标注者收敛,而KL散度在10个标注者时饱和,且软标签优于标签平滑。

详情
Comments
16 pages, 3 figures, 14 tables
AI中文摘要

当标注者对标签存在分歧时,分歧本身携带信号——而捕获该信号所需的标注者数量取决于评估度量。我们在从ChaosNLI(每个项目提供100个独立标注者判断的数据集)子采样的标签分布上微调NLI模型,并识别出度量依赖的饱和。在我们的3类NLI设置中,熵相关——模型是否识别出哪些项目引发分歧——需要N~20-50个标注者才能收敛,而分布匹配(KL散度)在N~10时饱和(五个模型种子中达到改进的87-95%)。这一发现基于先前的观察:软标签携带标签平滑无法复现的项目特定信号。在五种平滑强度下,熵相关聚类在r~0.45-0.49,而软标签达到r=0.643(p<0.001);逐项分析将这一差距归因于平滑无法区分模糊项目与清晰项目。软标签优势在两种架构(DeBERTa、RoBERTa)、非NLI预训练基线以及内容安全探索性跨领域评估中得以复现。这些结果表明,标注预算应根据目标评估度量而非统一设定。

英文摘要

When annotators disagree on a label, the disagreement itself carries signal -- and the number of annotators needed to capture it depends on the evaluation metric. We fine-tune NLI models on label distributions subsampled from ChaosNLI, a dataset providing 100 independent annotator judgments per item, and identify metric-dependent saturation. In our 3-class NLI setting, entropy correlation -- whether the model identifies which items elicit disagreement -- requires N ~ 20-50 annotators to converge, while distributional match (KL divergence) saturates by N ~ 10 (87-95% of improvement across five model seeds). This finding rests on a prior observation: soft labels carry item-specific signal that label smoothing cannot replicate. Across five smoothing intensities, entropy correlation clusters at r ~ 0.45-0.49, while soft labels reach r = 0.643 (p < 0.001); per-item analysis traces this gap to smoothing's inability to distinguish ambiguous items from clear ones. The soft-label advantage replicates across two architectures (DeBERTa, RoBERTa), a non-NLI-pretrained baseline, and an exploratory cross-domain evaluation on content safety. These results suggest that annotation budgets should be informed by the target evaluation metric rather than set uniformly.

2605.29795 2026-05-29 cs.AI

MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains

MEMENTO: 利用网络作为低数据领域的学习信号

Ashutosh Ojha, Vinay Aggarwal, Ashutosh Srivastava, Siddharth Yedlapati, Yaman K Singla, Jitendra Ajmera

AI总结 提出MEMENTO框架,通过自适应探索树和双通道记忆将网络作为学习信号,在低数据专业领域(销售自动化和法律研究)中显著提升性能。

详情
AI中文摘要

现实世界的任务通常缺乏大规模标注数据集,这激发了在低数据场景下学习的广泛研究。然而,现有方法如少样本提示、指令调优和合成数据生成,仍将标注或伪标注数据作为主要学习信号。相比之下,人类从业者通过反复、自主地与开放网络交互来获取专业知识,逐步完善领域知识和搜索策略。我们提出MEMENTO,一个将网络视为学习信号而非无状态检索接口的框架。MEMENTO在两个层面运作:在每个会话内,它通过自适应探索树(AET)进行迭代式网络探索,将任务分解为演化中的问题并反思中间发现;跨会话间,它通过双通道记忆积累经验,将陈述性知识(事实)与程序性知识(搜索策略)分离。这种设计使智能体能够从网络交互轨迹中学习可重用的研究策略和领域专业知识,而无需额外的模型训练。我们在两个低数据专业领域(销售自动化和法律研究)上评估MEMENTO。实验结果显示,与基于ReAct的基线相比,性能持续提升(销售自动化+25.6%,法律研究+36.5%),表明网络可以作为在数据稀缺场景下获取任务特定专业知识的可扩展学习源。

英文摘要

Real-world tasks often lack large labeled datasets, motivating extensive work on learning in low-data regimes. However, existing approaches such as few-shot prompting, instruction tuning, and synthetic data generation, continue to treat labeled or pseudo-labeled data as the primary learning signal. In contrast, human practitioners acquire expertise through repeated, self-directed interaction with the open web, progressively refining both domain knowledge and search strategies. We propose MEMENTO, a framework that treats the web as a learning signal rather than a stateless retrieval interface. MEMENTO operates at two levels: within each session, it conducts iterative web exploration via an Adaptive Exploration Tree (AET) that decomposes tasks into evolving questions and reflects on intermediate findings; across sessions, it accumulates experience through dual-channel memory, separating declarative knowledge (facts) from procedural knowledge (search strategies). This design enables agents to learn reusable research strategies and domain expertise from trajectories of web interaction without additional model training. We evaluate MEMENTO on two low-data professional domains: sales automation and legal research. Our empirical results show consistent improvements in performance over ReAct based baselines (+25.6% on sales automation and 36.5% on legal research), demonstrating that the web can serve as a scalable learning source for acquiring task-specific expertise in data-scarce settings.

2605.29794 2026-05-29 cs.AI

SkillsInjector: Dynamic Skill Context Construction for LLM Agents

SkillsInjector: 面向LLM智能体的动态技能上下文构建

Yanchao Li, Wanhao Liu, Ben Gao, Jiaqing Xie, Zhehong Ai, Na Zou, Yuqiang Li, Tianfan Fu

AI总结 针对静态技能注入导致性能下降的问题,提出SkillsInjector两阶段自适应方法,通过上下文规划器学习技能偏好并自适应预算,结合集合感知渲染器优化描述呈现,在三个基准上分别提升3.9、6.1和7.3个百分点。

详情
AI中文摘要

LLM智能体现在依赖不断增长的技能库来处理复杂任务。然而,注入更多技能并不总能提高任务完成度,甚至可能降低性能。现有方法仍将技能注入视为静态步骤,使用固定标准选择技能,预先设定预算,并保持描述不变。我们认为这种静态处理会削弱技能的效用,因为暴露哪些技能、包含多少技能以及如何呈现它们都会影响下游性能。我们提出SkillsInjector,一种两阶段自适应方法,共同解决这些决策。首先,上下文规划器学习基于执行的技能偏好,并为每个任务自适应地确定技能数量。然后,集合感知渲染器根据共注入的邻居定制所选描述的呈现方式。在tau2-bench、SkillsBench和ALFWorld上,SkillsInjector取得了最高分数,分别比最强基线提高了3.9、6.1和7.3个百分点。消融研究表明,技能选择、自适应预算和集合感知渲染各自对性能提升有贡献。这些结果表明,技能增强型智能体受益于优化注入的上下文本身。代码将在发表后发布。

英文摘要

LLM agents now draw on growing skill libraries to handle complex tasks. However, injecting more skills does not always improve task completion and can even degrade it. Existing methods still treat skill injection as a static step, selecting skills with fixed criteria, fixing the budget in advance, and leaving descriptions unchanged. We argue that this static treatment can undermine the utility of skills, because which skills are exposed, how many are included, and how they are presented all affect downstream performance. We propose SkillsInjector, a two-stage adaptive method that jointly addresses these decisions. First, a context planner learns execution-grounded skill preferences and admits an adaptive number of skills for each task. A set-aware renderer then tailors how selected descriptions are presented relative to their co-injected neighbors. Across tau2-bench, SkillsBench, and ALFWorld, SkillsInjector achieves the highest score, improving over the strongest baseline by 3.9, 6.1, and 7.3 percentage points, respectively. Ablation studies show that skill selection, adaptive budgeting, and set-aware rendering each contribute to the gain. These results show that skill-augmented agents benefit from optimizing the injected context itself. Code will be released upon publication

2605.29793 2026-05-29 cs.CV

Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language

更少步骤,更优性能:基于语言的高效跨模态视频片段修剪用于视频时刻检索

Xiang Fang, Daizong Liu, Wanlong Fang, Pan Zhou, Zichuan Xu, Wenzheng Xu, Junyang Chen, Renfu Li

AI总结 提出SpotVMR方法,通过可学习的片段搜索模型和低成本语义索引特征,高效修剪查询相关视频片段,作为即插即用模块提升现有VMR方法的效率与性能。

详情
Comments
Published in AAAI 2024
AI中文摘要

给定一个未修剪的视频和一个句子查询,基于语言的视频时刻检索(VMR)旨在定位目标查询相关的时刻。由于未修剪的视频过长,几乎所有现有的VMR方法首先将每个未修剪的视频稀疏下采样为多个固定长度的视频片段,然后与查询特征和昂贵的片段特征进行多模态交互以进行推理,这对于跨越数小时的长真实世界视频是不可行的。由于视频被下采样为固定长度的片段,一些与查询相关的帧可能被过滤掉,这将模糊目标时刻的特定边界,将相邻的不相关帧作为新边界,容易导致跨模态错位,并引入边界偏差和推理偏差。为此,在本文中,我们提出了一种高效的方法SpotVMR,用于修剪与查询相关的片段。此外,我们提出的SpotVMR可以作为即插即用模块,在保持良好检索性能的同时提高最先进VMR方法的效率。特别地,我们首先设计了一个新颖的片段搜索模型,该模型学习根据语言查询识别有希望的视频区域进行搜索。然后,我们引入一组低成本的语义索引特征来捕获对象和交互的上下文,这些上下文提示在哪里搜索查询相关的时刻。此外,利用蒸馏损失来解决片段选择器和VMR模型端到端联合训练中出现的优化问题。在三个具有挑战性的数据集上的大量实验证明了其有效性。

英文摘要

Given an untrimmed video and a sentence query, video moment retrieval using language (VMR) aims to locate a target query-relevant moment. Since the untrimmed video is overlong, almost all existing VMR methods first sparsely down-sample each untrimmed video into multiple fixed-length video clips and then conduct multi-modal interactions with the query feature and expensive clip features for reasoning, which is infeasible for long real-world videos that span hours. Since the video is downsampled into fixed-length clips, some query-related frames may be filtered out, which will blur the specific boundary of the target moment, take the adjacent irrelevant frames as new boundaries, easily leading to cross-modal misalignment and introducing both boundary-bias and reasoning-bias. To this end, in this paper, we propose an efficient approach, SpotVMR, to trim the query-relevant clip. Besides, our proposed SpotVMR can serve as plug-and-play module, which achieves efficiency for state-of-the-art VMR methods while maintaining good retrieval performance. Especially, we first design a novel clip search model that learns to identify promising video regions to search conditioned on the language query. Then, we introduce a set of low-cost semantic indexing features to capture the context of objects and interactions that suggest where to search the query-relevant moment. Also, the distillation loss is utilized to address the optimization issues arising from end-to-end joint training of the clip selector and VMR model. Extensive experiments on three challenging datasets demonstrate its effectiveness.

2605.29791 2026-05-29 cs.CL

ActTraitBench: Quantifying the Knowledge-Decision Gap in Large Language Models via Human-Grounded Behavioral Validation

ActTraitBench: 通过人类行为验证量化大型语言模型中的知识-决策差距

Yutong Yang, Chenxi Miao, Weikang Li, Yunfang Wu

AI总结 提出ActTraitBench框架,基于人类数据建立心理测量方面与行为范式的一一映射,并通过分位数映射校准LLM评分分布,揭示LLM在自我报告与行为决策之间的知识-决策差距,并引入CoCA干预来缓解该差距。

详情
AI中文摘要

虽然大型语言模型(LLM)在显式自我报告中能够令人信服地模拟人格,但它们在隐式行为决策中常常出现偏差,揭示了显著的知识-决策差距($G_{\text{KD}}$)。现有的基准由于结构效度有限、多维度纠缠以及基于LLM评估中的分布偏差,难以衡量这种不对称性。为了解决这些问题,我们提出了ActTraitBench,一个基于人类数据的评估框架,用于衡量LLM中的人格一致性。基于经验人类数据,ActTraitBench建立了心理测量方面与行为范式之间的一一映射,并应用通过分位数映射的分布校准程序,使LLM评判者的分数分布与人类规范对齐。在14个主流LLM上的实验揭示了普遍的知识-决策不对称性,其中更大、能力更强的模型尽管自我报告高度一致,但往往表现出更强的行为分歧。为了缓解这一差距,我们进一步引入了认知对齐链(CoCA),一种即插即用的推理时干预措施,可改善具有推理能力的前沿模型的对齐,同时暴露出较小架构中明显的能力限制。

英文摘要

While Large Language Models (LLMs) can convincingly simulate personas in explicit self-reports, they often deviate in implicit behavioral decisions, revealing a substantial Knowledge-Decision Gap ($G_{\text{KD}}$). Existing benchmarks struggle to measure this asymmetry due to limited construct validity, multi-dimensional entanglement, and distributional biases in LLM-based evaluation. To address these issues, we propose ActTraitBench, a human-grounded evaluation framework for measuring personality consistency in LLMs. Grounded in empirical human data, ActTraitBench establishes one-to-one mappings between psychometric facets and behavioral paradigms, and applies a Distributional Calibration via Quantile Mapping procedure to align LLM-judge score distributions with human norms. Experiments on 14 mainstream LLMs reveal a pervasive knowledge-decision asymmetry, where larger and more capable models often exhibit stronger behavioral divergence despite highly consistent self-reports. To mitigate this gap, we further introduce the Chain of Cognitive Alignment (CoCA), a plug-and-play inference-time intervention that improves alignment in reasoning-capable frontier models while exposing clear capability limitations in smaller architectures.

2605.29790 2026-05-29 cs.MA cs.AI

Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems

像团队一样进化:基于LLM的多智能体系统的协作自我进化

Zhezheng Hao, Tianfu Wang, Huanshuo Dong, Ziyan Liu, Hong Wang, Xiankun Lin, Qiang Lin, Can Wang, Hande Dong, Jiawei Chen

AI总结 提出Meta-Team框架,通过协作自我进化机制,基于执行经验改进多智能体系统的行为、协调和团队组织,在长周期任务中显著优于单智能体、手工MAS及先前进化方法。

详情
AI中文摘要

基于LLM的多智能体系统(MAS)已成为处理复杂和长周期任务的有效范式。然而,在实际任务中,MAS在执行过程中经常出现各种故障,且这些故障在设计阶段难以消除。这激发了经验驱动的MAS进化,即系统根据自身执行经验进行改进。然而,这种进化具有挑战性,因为MAS经验漫长而复杂,交织着多个智能体的执行链和通信消息,使得难以识别需要改进的内容。为应对这一挑战,我们提出了Meta-Team,一种基于协作自我进化的经验驱动MAS进化框架。Meta-Team保留每个智能体的执行上下文并协调任务后通信,使智能体能够交换分布式证据以进行进化。基于此设计,Meta-Team进行多尺度自我进化,将执行经验转化为对智能体行为、智能体间协调以及团队级组织的可复用改进。在六个长周期智能体基准测试中,Meta-Team始终优于单智能体系统、手工MAS和先前的MAS进化方法;进一步分析表明,Meta-Team实现了更可靠和可扩展的MAS自我进化。

英文摘要

LLM-based multi-agent systems (MAS) have emerged as an effective paradigm for complex and long-horizon tasks. However, in real-world tasks, MAS often exhibit various failures during execution and such failures are difficult to eliminate during design. This motivates experience-driven MAS evolution, where a system improves based on its own execution experience. Yet such evolution is challenging because MAS experience is prolonged and intricate, interleaving multiple agents' execution chains and communication messages, which makes it difficult to identify what should be improved. To address this challenge, we propose Meta-Team, an experience-driven MAS evolution framework based on collaborative self-evolution. Meta-Team preserves the execution context of each agent and coordinates post-task communication, enabling agents to exchange distributed evidence for evolution. Building on this design, Meta-Team conducts multi-scale self-evolution, transforming execution experience into reusable improvements to agent behaviors, inter-agent coordination, and team-level organization. Across six long-horizon agent benchmarks, Meta-Team consistently outperforms single-agent systems, hand-crafted MAS, and prior MAS evolution methods; further analyses demonstrate that Meta-Team enables more reliable and scalable MAS self-evolution.

2605.29788 2026-05-29 cs.AI cs.LG

Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes Risk

嵌套因果赌博机的认证策略优化:基于PAC-Bayes风险

Tim Woydt, Paul-David Zuercher

AI总结 本文提出嵌套因果汤普森采样(NCTS)算法,通过PAC-Bayes超额风险界对历史数据进行离线、任意时刻的部署策略认证,解决分层因果赌博机中的跨时间尺度因果耦合问题。

详情
AI中文摘要

关键序列决策很少是单时间尺度的:一个战略决策因果地塑造了每个后续战术选择所处的环境;标准赌博机和强化学习理论并未捕捉时间尺度之间的这种因果耦合。我们将问题类别形式化为嵌套上下文因果赌博机(NCCBs),这是一个分层SCM,其中每个层次的动作设置下一层次的上下文分布,并提出了嵌套因果汤普森采样(NCTS),该算法每轮抽取一个机制因子化的信念,并在其下递归地行动。我们的主要理论结果是一个因果PAC-Bayesian超额风险界,它仅从历史数据中认证任何候选部署策略,离线且任意时刻,回答了部署问题:我们能否在此处信任该智能体,风险如何?在分层SCM上的实验表明,相对于同一函数类上的匹配RFF-GP联合回归,因子化的SCM机制后验在外生分布偏移下零样本迁移显著更好,递归的元到内层提交在分布上显著优于联合提交替代方案,并且随着离线数据积累,认证显著收缩。结合这些结果,我们建立了渐进式认证交接,一种安全部署方法:每个时间尺度在收益可被认证时从传统控制器切换到NCTS,独立于其他时间尺度。

英文摘要

Critical sequential decisions are rarely single-timescale: a strategic decision causally shapes the context in which every subsequent tactical choice is made; standard bandit and reinforcement-learning theory does not capture this causal coupling between timescales. We formalise the problem class as Nested Contextual Causal Bandits (NCCBs), a hierarchical SCM where each level's action sets the next level's context distribution, and propose Nested Causal Thompson Sampling (NCTS), which draws one mechanism-factorised belief per episode and acts recursively under it. Our main theoretical result is a causal PAC-Bayesian excess-risk bound that certifies any candidate deployment policy from historic data alone, off-policy and anytime, answering the deployment question: can we trust this agent here, and at what risk? Experiments on a hierarchical SCM show that, against a matched RFF-GP joint regression on the same function class, the factorised SCM-mechanism posterior transfers significantly better zero-shot under exogenous distribution shifts, the recursive meta-to-inner commit significantly dominates the joint-commit alternative in distribution, and the certificate significantly contracts as offline data accumulates. Combining these results, we establish progressive certified handover, a safe-deployment method: each timescale flips from a legacy controller to NCTS when gains can be certified, independently of the others.

2605.29786 2026-05-29 cs.AI

Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations

Croissant Tasks:一种用于可重复机器学习评估的元数据格式

Omar Benjelloun, Leonardo Martins Bianco, Isabelle Guyon, Thanh Gia Hieu Khuong, Jonathan Lebensold, Sebastian Lobentanzer, Luis Oala, Benedictus Kent Rachmat, Ihsan Ullah, Peyman Vahidi, Joaquin Vanschoren

AI总结 提出Croissant Tasks元数据格式,通过声明式规范解耦任务问题与解决方案,结合自动化LLM管道实现概念可重复性,使自主代理能从零生成可复现的评估流水线。

详情
Comments
10 pages, 4 figures
AI中文摘要

可重复性是科学方法的基础,但在机器学习中仍然是一个关键挑战。导致这一问题的因素包括不明确的执行细节和脆弱的软件环境。以人为中心的补救措施(如检查清单和手动验证)有所帮助,但需要大量精力且难以扩展。为了解决这个问题,我们引入了Croissant Tasks:一种声明式的、机器可操作的元数据格式,将低层实现细节抽象为高层规范。这种格式实现了概念可重复性:通过独立的、由代理生成的实现来验证声明,而不是脆弱的源代码复制。我们贡献了:(1) Croissant Tasks规范,正式将任务问题与解决方案解耦;(2) 一个自动化的LLM流水线,将现有基准测试改造为此格式;(3) 实证验证表明,自主代理可以摄取这些规范,从零开始生成功能准确的可重复流水线。我们设想这种格式将成为机器学习中自动化和概念可重复性的新基础。

英文摘要

Reproducibility is fundamental to the scientific method, yet remains a critical challenge in machine learning. Contributing factors include underspecified execution details and brittle software environments. Human-centric remedies, such as checklists and manual verification, help but require intensive effort and fail to scale. To address this, we introduce Croissant Tasks: a declarative, machine-actionable metadata format that abstracts low-level implementation details into high-level specifications. This format enables conceptual reproducibility: verifying claims via independent, agent-generated implementations rather than brittle source code replication. We contribute: (1) the Croissant Tasks specification, formally decoupling task problem from solution; (2) an automated LLM pipeline that retrofits existing benchmarks into this format; and (3) empirical validation showing autonomous agents can ingest these specifications to generate functional, accurate reproduction pipelines from scratch. We envision this format as a new foundation for automated and conceptual reproducibility in machine learning.

2605.29782 2026-05-29 cs.LG cs.AI cs.CL

Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning

Hista 和 Numca:为 LLM 强化学习有效估计状态值

Zizhe Chen, Jiqian Dong, Yizhou Tian, Garry Yang, Yongqiang Chen, Zhitang Chen, James Cheng

AI总结 针对 LLM 强化学习中状态值估计不准确的问题,提出 Numca(利用数值跨度作为可分级里程碑)和 Hista(利用隐藏状态加权平均不连续轨迹及其回报)两种方法,显著提升估计精度和训练性能。

详情
Comments
Accepted at ICML 2026
AI中文摘要

强化学习(RL)通过奖励信号直接优化模型行为来改进大型语言模型(LLMs)。虽然在经典RL中准确的状态值估计对于稳定训练至关重要,但在LLM后训练中这仍是一个未被充分探索的挑战。在这项工作中,我们引入了状态值估计基准(SVEB)来评估现有RL框架中的状态估计,并展示了像PPO这样的标准方法中的评论家会退化为粗糙的组平均基线。为了解决这个问题,我们提出了两种技术:Numca,它利用数值跨度作为可分级里程碑进行状态值估计;以及Hista,一个使用LLM的隐藏状态作为表示来加权平均不连续轨迹及其回报的框架。大量实验表明,这两种方法都能产生更准确的状态值估计,并在不同的RL算法和模型大小上提升训练性能,而不会产生显著的计算开销。

英文摘要

Reinforcement learning (RL) refines large language models (LLMs) by directly optimizing model behavior through reward signals. While accurate state value estimation is critical for stable training in classical RL, it remains an underexplored challenge in LLM post-training. In this work, we introduce the State Value Estimation Benchmark (SVEB) to assess state estimation within existing RL frameworks and show that critics in standard approaches like PPO collapse to a coarse group-average baseline. To address this, we propose two techniques: Numca, which leverages numerical spans as gradable milestones for state value estimation, and Hista, a framework that uses LLM's hidden states as representation to weighted average disjoint rollouts and their return. Extensive experiments demonstrate that both methods yield more accurate state value estimates and enhance training performance across different RL algorithms and model sizes without incurring significant computational overhead.

2605.29776 2026-05-29 cs.CV

Improving CLIP Adaptation by Breaking Tail Alignment for Source-Free Cross-Domain Few-Shot Learning

通过打破尾部对齐改进CLIP适应:用于源无关跨域小样本学习

Shuai Yi, Yixiong Zou, Yuhua Li, Ruixuan Li

AI总结 针对CLIP在跨域小样本学习中的性能下降问题,提出自适应尾头对齐策略(ATHA),通过有选择地削弱低相似度图像令牌的对齐来减少过拟合,在四个基准上取得最优结果。

详情
Comments
Accepted by ICML 2026
AI中文摘要

视觉语言模型(如CLIP)展现出强大的零样本泛化能力,但在目标域训练数据稀缺的跨域场景(跨域小样本学习,CDFSL)中性能显著下降。本文聚焦于基于CLIP的CDFSL任务中的目标域小样本微调。现有的微调范式将所有图像块令牌与其对应的文本嵌入统一对齐。然而,我们发现一个反直觉的现象:主动将某些低相似度图像令牌(称为“尾部令牌”)推离其文本嵌入能持续提升目标域性能。我们深入探究这一现象并给出新的解释:在巨大的域偏移和稀缺的训练数据下,模型难以从视觉输入中提取语义信息;因此,常见的对齐信念仅对已包含足够语义信息的令牌有效;对于尾部令牌,强制对齐会导致对稀缺训练的过度过拟合,而打破对齐则更有用。受此启发,我们提出自适应尾头对齐(ATHA),一种新颖的CLIP微调策略,将传统的统一对齐范式转变为自适应对齐范式,同时包含对齐增强和削弱。在四个具有挑战性的CDFSL基准上的大量实验验证了我们的最先进性能。我们的代码可在 https://github.com/shuaiyi308/ATHA 获取。

英文摘要

Vision-Language Models (VLMs) such as CLIP demonstrate strong zero-shot generalization, but their performance significantly degrades in cross-domain scenarios with scarce target-domain training data (Cross-Domain Few-Shot Learning, CDFSL). In this paper, we focus on the target-domain few-shot finetuning in the CLIP-based CDFSL task. Prevailing finetuning paradigms uniformly align all image patch tokens with their corresponding textual embeddings. However, we find a counterintuitive phenomenon: actively pushing away certain low-similarity image tokens, termed "tail tokens", from their textual embeddings consistently improves target-domain performance. We delve into this phenomenon and provide a novel interpretation: under great domain shifts and scarce training data, the model can hardly extract semantic information from visual inputs; therefore, the common belief of alignment is valid only for tokens already containing sufficient semantic information; for tail tokens, forcing the alignment would lead to excessive overfitting to the scarce training, while breaking the alignment is more useful. Motivated by this, we propose Adaptive Tail-Head Alignment (ATHA), a novel fine-tuning strategy for CLIP that transforms the conventional uniform alignment paradigm to an adaptive alignment paradigm, with both alignment strengthening and weakening. Extensive experiments on four challenging CDFSL benchmarks validate our state-of-the-art performance. Our code is available at https://github.com/shuaiyi308/ATHA.

2605.29773 2026-05-29 cs.CV cs.AI cs.RO

Energy-Aware NECO for Single-Pass Pixel-wise Out-of-Distribution Detection in Semantic Segmentation

能量感知NECO:用于语义分割中单次逐像素分布外检测

Boyuan Zhang, Huanshan Huang, Yifei Cao

AI总结 提出一种结合NECO几何比率和能量分数的混合方法,实现单次前向传播的逐像素分布外检测,在miniMUAD数据集上AUROC达0.8539,优于单独使用NECO或能量分数。

详情
Comments
7 pages, 6 figures. Accepted at the ICRA 2026 Workshop on Long-term Deployments in the Wild (LoWi 2026)
AI中文摘要

移动机器人的可靠语义分割需要准确的密集预测和分布偏移下的鲁棒不确定性估计。强不确定性基线如蒙特卡洛Dropout通常需要重复的随机前向传播,难以在边缘平台上部署。我们提出能量感知NECO,一种用于语义分割的单次逐像素分布外(OOD)检测器。该方法将从解码器特征计算的居中NECO风格几何比率与基于logit的能量分数相结合。两个分量均使用在纯分布内验证集上拟合的统计量进行标准化,并通过凸组合融合。我们在miniMUAD子集上使用真实像素级OOD标签评估该方法。所提出的混合分数达到0.8539的AUROC,优于仅NECO(0.8280)、仅能量(0.8171)和集成预测熵基线(0.8124)。额外的定性和操作点分析表明,混合检测器在保持单次设计效率优势的同时,提高了整体排名性能。代码可在https://github.com/boyuan-zhangx/Energy-Aware_NECO获取。

英文摘要

Reliable semantic segmentation for mobile robots requires both accurate dense prediction and robust uncertainty estimation under distribution shift. Strong uncertainty baselines such as Monte Carlo Dropout often require repeated stochastic forward passes and are difficult to deploy on edge platforms. We propose Energy-Aware NECO, a single-pass pixel-wise out-of-distribution (OOD) detector for semantic segmentation. The method combines a centered NECO-style geometric ratio computed from decoder features with a logit-based Energy score. Both components are standardized using statistics fitted on a pure in-distribution validation split and fused through a convex combination. We evaluate the method on the miniMUAD subset using true pixel-level OOD labels. The proposed hybrid score achieves an AUROC of 0.8539, outperforming NECO-only (0.8280), Energy-only (0.8171), and an ensemble predictive-entropy baseline (0.8124). Additional qualitative and operating-point analyses show that the hybrid detector improves overall ranking performance while preserving the efficiency advantages of a single-pass design. Code is available at https://github.com/boyuan-zhangx/Energy-Aware_NECO

2605.29771 2026-05-29 cs.RO

Joint Angle Estimation with Customized Wristband Based on Online Incremental Learning

基于在线增量学习的定制腕带关节角度估计

Shuo Wang, Xiaobin Chen, Xiaoming Tao

AI总结 提出一种基于在线增量学习的定制腕带系统,通过两阶段方法(在线学习更新模型+模型估计)实现腕关节角度估计,适应不同佩戴配置下的数据漂移,误差约15度。

详情
AI中文摘要

智能可穿戴技术在人机交互、运动和健康监测中扮演着越来越重要的角色。为了确保使用的舒适性和实用性,运动监测的一种常见形式是利用软体可穿戴传感器。然而,许多关于可穿戴传感器的研究应用过于简单,难以适应不同情况。本研究提出了一种基于在线增量学习方法的定制腕带系统,用于估计腕关节角度。这是一种两阶段估计方法:第一阶段根据佩戴者的手腕运动特征,利用在线学习更新模型,并集成来自IMU的实时数据作为真实值。第二阶段仅使用腕带利用更新后的模型进行腕关节角度估计。换句话说,模型训练在数据采集过程中完成,使得训练好的模型可用于后续的角度估计。该方法在适应由不同测试配置引起的数据漂移方面具有优势,例如同一受试者的左右手腕、同一手腕上佩戴位置的偏差,甚至不同受试者之间的差异。结果表明,传感器在应变变化下表现出良好的性能,所提系统在不同场景下的腕关节轨迹估计误差约为15度。

英文摘要

Intelligent wearable technology plays an increasingly important role in human-computer interaction, motion, and health monitoring. To ensure comfort and practicality of use, one common form for motion monitoring is to utilize soft wearable sensors. However, many research applications regarding wearable sensors are simplistic and difficult to adapt to different situations. This study proposes a system for estimating the angle of the wrist joint using a customized wristband based on an online incremental learning approach. It is a two-stage estimation method: the first stage updates the model based on the wearer's wrist movement characteristics using online learning, integrating real-time data from an IMU as ground truth. The second stage utilizes the updated model for estimation of wrist joint angle solely with the wristband. In other words, model training is completed during data acquisition, allowing the trained model to be used for subsequent angle estimation. This method offers advantages in adapting to data drift caused by variations in different testing configurations, such as the left and right wrists of the same subject, deviations in the wearing position on the same wrist, and even differences among various subjects. The results indicate that the sensors exhibit good performance under strain variations, and the wrist joint trajectory estimation of the proposed system has an approximate error of 15 degree in different scenarios.

2605.29768 2026-05-29 cs.AI

From XXLTraffic to EvoXXLTraffic: Scaling Traffic Forecasting to Sensor-Evolving Networks

从XXLTraffic到EvoXXLTraffic:将交通预测扩展到传感器演化的网络

Du Yin, Hao Xue, Arian Prabowo, Shuang Ao, Flora Salim

AI总结 针对现有交通预测基准假设固定传感器集的问题,提出包含长达27年数据的XXLTraffic数据集及其传感器演化版本EvoXXLTraffic,定义年度流式预测协议,并评估多种基线方法,发现超大规模演化数据集更贴近现实且许多现有SOTA方法失效。

详情
Comments
Under Review
AI中文摘要

现有的交通预测基准假设固定的传感器集,但实际道路传感器网络随着道路网络逐年变化而持续增长。我们引入了XXLTraffic数据集系列,涵盖长达27年的加州PeMS和新南威尔士州交通数据。XXLTraffic的固定传感器子集支持多年间隔的极长周期预测以及标准的每小时/每日长时预测。我们将其扩展为EvoXXLTraffic,这是一个传感器演化的重组版本,暴露了每年活跃的传感器、年度交通流矩阵以及九个PeMS区域的年度图快照,增长率从+305%到超过+10,000%。我们在EvoXXLTraffic上定义了一个年度流式预测协议,其中每个日历年是一个持续任务,并评估了来自静态时空GNN、朴素在线方案、演化图持续方法以及检索/测试时方法的各种代表性基线。我们发现,我们的超大规模演化数据集更好地反映了现实世界,许多最先进(SOTA)结果不再有效。我们的数据集通过支持在超长演化道路网络下更现实的预测,补充了现有的基准。

英文摘要

Existing traffic forecasting benchmarks assume a fixed sensor set, but real road-sensor networks grow continuously as the road network changes year by year. We introduce the XXLTraffic dataset family, which spans up to 27 years of California PeMS and Transport for NSW data. The fixed-sensor subsets of XXLTraffic support extremely long forecasting with multi-year gaps and standard hourly / daily long-horizon forecasting. We extend it to EvoXXLTraffic, a sensor-evolving reorganization that exposes per-year active sensors, yearly traffic-flow matrices, and yearly graph snapshots across nine PeMS districts, with growth ratios ranging from +305% to over +10,000%. We define a yearly streaming forecasting protocol on EvoXXLTraffic in which each calendar year is a continual task, and benchmark a wide range of representative baselines drawn from static spatio-temporal GNNs, naïve online schemes, evolving-graph continual methods, and retrieval / test-time methods. We find that our ultra-large evolutionary dataset better reflects the real world, and many state-of-the-art (SOTA) results no longer work. Our dataset complements existing benchmarks by enabling more realistic forecasting under ultra-long evolutionary road networks.

2605.29766 2026-05-29 cs.RO

MARS Policy: Multimodality Only When It Matters

MARS策略:仅在必要时使用多模态

Jindou Jia, Tuo An, Yuxuan Hu, Gen Li, Jingliang Li, Bohan Hou, Xiangyu Chen, Jiaqi Bai, Bofan Lyu, Jianfei Yang

AI总结 提出MARS策略,通过自适应地在需要时引入随机性,在单模态阶段使用确定性学习,平衡生成策略的多模态能力与确定性模型的高效率,在模拟和真实任务中提升成功率和推理速度。

详情
Comments
13 figures, 17 pages
AI中文摘要

模仿学习已成为解决复杂机器人操作任务的基石。特别是,多模态使机器人能够捕捉多样且有效的行为模式,推动了生成策略作为机器人学习主导范式的迅速兴起。然而,实现这种多模态通常依赖于随机噪声初始化和迭代去噪过程,导致训练复杂度高、推理效率低。同时,机器人任务的并非所有阶段都固有地需要行为多样性。受此启发,我们提出了模态自适应机器人采样(MARS)策略,该策略仅在真正有益时自适应地调用定制的随机性,而在单模态阶段恢复为高效的确定性学习。换句话说,仅在适当的时间注入适量的噪声。通过选择性激活多模态生成,MARS策略弥合了生成策略的多模态能力与确定性模型优越的训练和推理效率之间的差距。在8个模拟和4个真实世界任务上的实证研究表明,MARS展现出鲁棒的多模态表达能力和高效率,在真实世界测试中成功率提升16.67%,推理延迟降低83.20%。反直觉的是,MARS在近确定性任务上的训练效率也超过了确定性策略,因为它更有效地建模了细微的动作多样性。

英文摘要

Imitation learning has become a cornerstone for solving complex robotic manipulation tasks. In particular, multimodality, which enables robots to capture diverse yet valid behavioral patterns, has driven the rapid emergence of generative policies as a dominant paradigm in robot learning. However, achieving such multimodality typically relies on stochastic noise initialization and iterative denoising procedures, resulting in substantial training complexity and low inference efficiency. Meanwhile, not all phases of a robotic task inherently require behavioral diversity. Motivated by this insight, we propose the Modality-Adaptive Robot Sampling (MARS) policy, which adaptively invokes tailored stochasticity only when it is truly beneficial, while reverting to an efficient deterministic learning during single-modal phases. In other words, the proper amount of noise is injected only at the proper time. By selectively activating multimodal generation, MARS policy bridges the gap between the multimodal capability of generative policies and the superior training and inference efficiency of deterministic models. Empirical studies across 8 simulated and 4 real-world tasks demonstrate that MARS exhibits robust multimodal expressivity and high efficiency, with a 16.67% success rate improvement and an 83.20% inference latency reduction in real-world tests. Counterintuitively, MARS also outpaces deterministic policies in training efficiency on near-deterministic tasks by more effectively modeling nuanced action diversity.

2605.29765 2026-05-29 cs.LG

MMTM: Tri-Modal Topic Modeling for Long-Form Video via Similarity-Gated Fusion

MMTM: 基于相似性门控融合的长视频三模态主题建模

Ali Abusaleh, Bhuvanesh Verma, Alexander Mehler

AI总结 提出MMTM模块化流水线,通过相似性门控融合集成语音识别、音频和视觉嵌入及BERTopic聚类,在长视频主题发现中显著提升主题质量。

详情
Comments
Submitted to EMNLP 2026
AI中文摘要

我们介绍了MMTM,一个用于长视频主题发现的模块化流水线,它通过确定性相似性门控融合集成了语音识别、音频和视觉嵌入以及BERTopic聚类。在德语(Tagesschau)和英语(NBC)广播新闻上进行跨语言评估,联合三模态建模显著提高了主题质量:噪声从0.27降至0.06,转换率从0.70降至0.21,归一化熵从0.84升至0.92,表明主题更加连贯且时间稳定。聚类有效性(Calinski-Harabasz)在嵌入空间上提高了5-12倍。词汇连贯性(NPMI)在德语上从0.77升至0.86,但依赖于语料库,并未迁移到较短的NBC广播中。我们发布了流水线代码和一个经过人工验证的54小时多模态视频主题语料库,包含双标注者视觉评估和LLM辅助标注。

英文摘要

We introduce MMTM, a modular pipeline for topic discovery in long-form video that integrates speech recognition, audio and visual embeddings, and BERTopic clustering through a deterministic similarity-gated fusion. Evaluated cross-lingually on German (Tagesschau) and English (NBC) broadcast news, joint tri-modal modeling substantially improves topic quality: noise drops from 0.27 to 0.06, transition rate from 0.70 to 0.21, and normalized entropy rises from 0.84 to 0.92, indicating more coherent and temporally stable topics. Cluster validity (Calinski-Harabasz) improves by 5-12X across embedding spaces. Lexical coherence (NPMI) rises from 0.77 to 0.86 on German but is corpus-dependent and does not transfer to the shorter NBC broadcasts. We release the pipeline code and a human-validated 54-hour multimodal video topic corpus with dual-annotator visual evaluation and LLM-assisted labeling.

2605.29762 2026-05-29 cs.CV

GeoMag: Geometric-Aware Video Motion Magnification via State Space Model

GeoMag: 基于状态空间模型的几何感知视频运动放大

Kecheng Han, Yuchen Zhang, Bingqing Liu, Boqiang Guo, Wenbin Zheng, Shiyuan Pei

AI总结 提出GeoMag框架,利用状态空间模型实现全局一致的运动放大,并构建Geo-200K数据集提升训练多样性,在视觉保真度和计算效率上优于现有方法。

详情
Comments
ICME 2026 Spotlight
AI中文摘要

视频运动放大(VMM)揭示了不可感知的动态,但在复杂几何变换下常常遭受结构不一致的问题。现有的基于学习的方法通常面临CNN的有限全局上下文与Transformer的高计算成本之间的权衡。此外,当前的训练协议主要由简单的线性运动主导,未能捕捉真实世界视频中遇到的几何和成像复杂性。为了解决这些问题,我们提出了GeoMag,一个基于状态空间模型的几何感知VMM框架,以实现具有线性复杂度的全局一致运动放大。我们进一步构建了Geo-200K,一个大规模合成数据集,引入了丰富的几何变换以及传感器真实的退化,提高了训练信号的多样性和真实性。在合成和真实世界基准上的大量实验表明,GeoMag在视觉保真度和计算效率上始终优于先前的方法,同时产生更少的伪影和更好的结构一致性。

英文摘要

Video Motion Magnification (VMM) reveals imperceptible dynamics but often suffers from structural inconsistencies under complex geometric transformations. Existing learning-based methods generally face a trade-off between the limited global context of CNNs and the high computational cost of Transformers. In addition, current training protocols, largely dominated by simple linear motion, fail to capture the geometric and imaging complexities encountered in real-world videos. To address these issues, we propose GeoMag, a geometric-aware VMM framework built upon State Space Models to achieve globally consistent motion amplification with linear complexity. We further construct Geo-200K, a large-scale synthetic dataset that introduces rich geometric transformations together with sensor-realistic degradations, improving the diversity and realism of training signals. Extensive experiments on synthetic and real-world benchmarks show that GeoMag consistently outperforms prior methods in visual fidelity and computational efficiency, while producing fewer artifacts and better structural consistency.

2605.29761 2026-05-29 cs.CV cs.CG

S2MDF: A Plug-And-Play Layer for Intersection-Free Multi-Object Signed Distance Fields

S2MDF:用于无交叉多物体有符号距离场的即插即用层

Deniz Sayin Mercadier, Federico Stella, Aurel Bizeau, Nicolas Talabot, Pascal Fua

AI总结 提出S2MDF模块,通过硬约束强制向量值有符号距离场避免物体间几何交叉,无需修改网络架构,在训练或后处理中均可使用,显著减少交叉至数值精度且保持重建质量。

详情
AI中文摘要

组合隐式表面表示将场景建模为物体集合,每个物体由有符号距离场(SDF)编码。该方法的一个基本限制是多个SDF可能产生相互穿透的几何形状,违反物理合理性。现有的缓解策略依赖于软惩罚项,这些项减少但不能消除交叉,并且需要仔细的损失加权。为了真正防止相互穿透,我们提出了对向量值SDF的硬约束,并引入了S2MDF,一个轻量级的即插即用模块,无需架构修改即可对任何物体组合SDF表示施加约束。它引入可忽略的计算开销,并与线性插值的标准网格化算法(如Marching Cubes)兼容。它可以在训练期间或作为后处理步骤应用。在多种最先进的组合方法上的实验表明,S2MDF将交叉减少到数值精度,同时保持重建质量,优于现有的缓解策略。

英文摘要

Compositional implicit surface representations model scenes as collections of objects, each encoded by a Signed Distance Field (SDF). A fundamental limitation of this approach is that multiple SDFs can produce geometries that interpenetrate, violating physical plausibility. Existing mitigation strategies rely on soft penalty terms that reduce but do not eliminate intersections, and require careful loss weighting. To truly prevent interpenetration, we propose a hard constraint on vector-valued SDFs and introduce S2MDF, a lightweight plug-and-play module that enforces the constraint on any object-compositional SDF representation without architectural modifications. It introduces negligible computational overhead and is compatible with linearly-interpolated standard meshing algorithms such as Marching Cubes. It can be applied during training or as a post-processing step. Experiments on multiple state-of-the-art compositional methods show that S2MDF reduces intersections to numerical precision while preserving reconstruction quality, outperforming existing mitigation strategies.

2605.29756 2026-05-29 cs.AI

LFQ: Logit-aware Final-block Quantization for Boosting the Generation Quality of Low-Bit Quantized LLMs

LFQ:面向提升低比特量化LLM生成质量的逻辑感知最终块量化

Jung Hyun Lee, June Yong Yang, Jungwook Choi, Eunho Yang

AI总结 针对低比特量化LLM在生成任务中质量下降的问题,提出通过最小化FP模型与量化模型在最终Transformer块上的logits交叉熵来优化量化,从而提升复杂生成任务的准确性。

详情
Comments
Accepted to ICML 2026
AI中文摘要

随着大语言模型规模的持续扩大,低比特权重的训练后量化(PTQ)为其内存高效部署提供了实用解决方案。尽管分块PTQ在基本语言建模和理解任务上能够匹配全精度(FP)基线,但其在生成任务(尤其是长响应和扩展思维链,这对提升任务准确性至关重要)上的质量有所下降。我们将这一不足归因于两个因素:(i) 分块优化中忽略了反嵌入层(LM头),以及(ii) 对均方误差(MSE)目标的依赖。这两个因素导致量化模型的令牌概率分布与FP模型不一致,从而在文本生成基准上产生显著的准确性下降。为纠正这一偏差,我们引入了逻辑感知最终块量化(LFQ),这是对分块PTQ的一种简单而有效的增强,通过最小化FP模型与其量化对应模型在logits上的交叉熵来量化最终Transformer块。通过在最终块中在logit级别对齐令牌概率,LFQ在不同模型家族中持续提升了复杂生成任务的准确性,优于最先进的分块PTQ,同时在语言建模和理解任务上保持与FP基线相当的性能。

英文摘要

As large language models continue to scale, low-bit weight-only post-training quantization (PTQ) offers a practical solution to their memory-efficient deployment. Although block-wise PTQ is capable of matching the full-precision (FP) baseline on basic language modeling and understanding, its quality is degraded for generative tasks -- especially at longer responses and extended chains of thought, which is critical in boosting task accuracy. We attribute this shortfall to two factors: (i) the omission of the unembedding layer (the LM head) in block-wise optimization and (ii) the reliance on the mean squared error (MSE) objective. Both factors cause the token probability distribution of the quantized model to misalign with that of the FP model, yielding notable accuracy drops on text generation benchmarks. To rectify the discrepancy, we introduce Logit-aware Final-block Quantization (LFQ), a simple yet effective enhancement to block-wise PTQ that quantizes the final Transformer block by minimizing the cross-entropy between the logits of the FP model and those of its quantized counterpart. By aligning token probabilities at the logit level in the final block, LFQ consistently improves the accuracy of complex generation tasks over state-of-the-art block-wise PTQ across diverse model families, while maintaining parity with FP baselines on language modeling and understanding.

2605.29754 2026-05-29 cs.AI

Benchmarking Positional Encoding Strategies for Transformer-Based EEG Foundation Models

基于Transformer的脑电图基础模型位置编码策略基准测试

Ayse Betul Yuce, Sebastian Stober

AI总结 本研究在CBraMod骨干网络中基准测试五种位置编码策略,通过线性探测和微调协议评估运动想象分类和情感识别任务,发现最优策略具有任务依赖性。

详情
AI中文摘要

脑电图(EEG)是一种广泛使用的非侵入性技术,用于测量脑机接口(BCI)应用中的大脑活动。监督式EEG解码模型通常难以跨任务、受试者和数据集泛化,这促使了基于Transformer的EEG基础模型通过自监督学习进行训练。由于Transformer是排列不变的,它们需要显式的位置信息。与文本标记不同,EEG电极在头皮上空间分布,这引发了如何在基于Transformer的EEG模型中编码电极位置的问题。在本研究中,我们在CBraMod骨干网络中基准测试了五种位置编码策略,并在运动想象分类和情感识别任务上通过线性探测和微调协议进行评估。我们的结果表明,没有单一策略能在所有任务中持续表现优异。球形位置编码(SPE)为运动想象生成了强大的表示,但在情感识别上表现不佳,而非对称条件位置编码(ACPE)在任务间表现更为一致。这些发现表明,最优位置编码策略具有任务依赖性,在EEG解码场景中没有通用解决方案。

英文摘要

Electroencephalography (EEG) is a widely used non-invasive technique for measuring brain activity in brain-computer interface (BCI) applications. Supervised EEG decoding models often struggle to generalize across tasks, subjects, and datasets, motivating transformer-based EEG foundation models trained with self-supervised learning. Since transformers are permutation-invariant, they require explicit positional information. Unlike textual tokens, EEG electrodes are spatially distributed across the scalp, raising the question of how electrode positions should be encoded in transformer-based EEG models. In this study, we benchmark five positional encoding strategies within the CBraMod backbone and evaluate them under linear probing and fine-tuning protocols on motor imagery classification and emotion recognition. Our results show that no single strategy consistently outperforms across tasks. Spherical Positional Encoding (SPE) yields strong representations for motor imagery but underperforms on emotion recognition, while Asymmetric Conditional Positional Encoding (ACPE) demonstrates more consistent performance across tasks. These findings suggest that the optimal positional encoding strategy is task-dependent, with no universal solution across EEG decoding scenarios.

2605.29753 2026-05-29 eess.IV cs.AI

A unified deeplearning framework for contrast-phase-specific virtual monochromatic imaging

一种用于对比相位特异性虚拟单色成像的统一深度学习框架

Antony Jerald, Hemant K Aggarwal, Brian Nett, Avinash Gopal, Phaneendra K Yalavarthy, Bipul Das, Rajesh Langoju

AI总结 提出一种统一深度学习框架,利用对比相位先验信息从单能CT数据合成对比相位特异性虚拟单色50 keV图像,通过新型先验条件架构实现能量转换,并在四个对比相位上验证了其对比增强和泛化能力。

详情
Journal ref
SPIE Medical Imaging 2026
AI中文摘要

双能CT(DECT)可实现虚拟单色成像(VMI)并提高对比度分辨率,但其临床采用受到硬件复杂性和成本的限制。在这项工作中,我们提出了一种统一的深度学习框架,通过利用对比相位信息作为先验,从单能CT(SECT)数据合成对比相位特异性虚拟单色50 keV图像。该模型使用DECT衍生的70 keV和50 keV图像对进行训练,涵盖四个对比相位——血管期、动脉期、门脉期和延迟期——采用一种新颖的先验条件架构,将对比相位先验整合到能量转换过程中。我们证明了所提出的统一模型能够实现对比增强,并在对比相位之间具有良好的泛化能力。此外,我们展示了该模型可以从SECT输入生成类似50 keV的图像,并保留对比相位特异性动态。

英文摘要

Dual-energy CT (DECT) enables virtual monochromatic imaging (VMI) and improved contrast resolution, but its clinical adoption is limited by hardware complexity and cost. In this work, we propose a unified deep learning framework that synthesizes contrast-phase-specific virtual monochromatic 50 keV images from single-energy CT (SECT) data by leveraging contrast phase information as a prior. The model is trained using DECT-derived 70 keV and 50 keV image pairs across four contrast phases -- Angio, Arterial, Portal, and Delayed -- using a novel prior conditioning architecture that integrates contrast phase priors into the energy transformation process. We demonstrate that the proposed unified model achieves contrast enhancement and generalizes well across contrast phases. Additionally, we show that the model can generate 50 keV-like images from SECT inputs, preserving contrast phase-specific dynamics.

2605.29748 2026-05-29 stat.ML cs.LG

Instance-dependent Stochastic Lipschitz bandit

实例依赖的随机Lipschitz bandit

Marius Potfer, Vianney Perchet

AI总结 针对Lipschitz bandit问题,提出一种基于水平集次优性间隙积分的算法,实现比传统缩放维度更优的实例依赖遗憾界。

详情
AI中文摘要

我们研究Lipschitz bandit问题,其中学习器通过带噪声的点评估在域$\mathcal{X} \subset [0,1]^d$上顺序最大化未知的Lipschitz函数$f$。现有的遗憾界要么是最坏情况的,缩放为$\tilde\Theta \left ( T^{d+1/d+2}\right )$,要么通过缩放维度$d_z$自适应,得到$\tilde\Theta \left ( T^{d_z+1/d_z+2}\right )$。然而,这种基于缩放的保证仅是部分实例依赖的,因为它们仅依赖于近最优水平集的渐近增长,未能捕捉$f$的更精细结构性质。我们提供了一种分析和算法,通过$f$在其水平集上的次优性间隙的积分来刻画遗憾。这产生了适应水平集局部增长(而不仅仅是其渐近行为)的遗憾界。作为推论,当最大化者集合的维度$d^\star>0$时,我们获得了阶为$\tilde{\mathcal{O}} \left ( T^{d_z+1 / \max(d_z,d^\star)+2}\right )$的改进自适应速率,在该情况下严格优于经典的缩放界。最后,我们将分析扩展到完全信息设置(Lipschitz专家),并展示了如何放宽一些正则性假设。

英文摘要

We study the Lipschitz bandit problem, where a learner sequentially maximizes an unknown Lipschitz function $f$ over a domain $\mathcal{X} \subset [0,1]^d$ using noisy pointwise evaluations. Existing regret bounds are either worst-case, scaling as $\tildeΘ \left ( T^{d+1/d+2}\right )$, or adaptive via the zooming dimension $d_z$, yielding $\tildeΘ \left ( T^{d_z+1/d_z+2}\right )$. However, such zooming-based guarantees are only partially instance-dependent, as they depend solely on the asymptotic growth of near-optimal level sets and fail to capture finer structural properties of $f$. We provide an analysis and an algorithm that characterizes the regret through integrals of the suboptimality gap of $f$ over its level sets. This yields regret bounds that adapt to the local growth of level sets, rather than only their asymptotic behavior. As a corollary, when the set of maximizers has dimension $d^\star>0$, we obtain improved adaptive rates of order $\tilde{\mathcal{O}} \left ( T^{d_z+1 / \max(d_z,d^\star)+2}\right )$ strictly improving over classical zooming bounds in this regime. Finally, we extend our analysis to the full-information setting (Lipschitz experts) and show how some of the regularity assumptions can be relaxed.

2605.29744 2026-05-29 cs.AI cs.CL cs.LG cs.MA

Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence

为什么专家模型仍然重要:面向医学人工智能的异构多智能体范式

Yanan Wang, Shuaicong Hu, Jian Liu, Guohui Zhou, Aiguo Wang, Cuiwei Yang

AI总结 提出HetMedAgent异构多智能体框架,通过冲突感知证据融合、不确定性驱动的临床医生干预触发和自适应阈值校准,实现通用大语言模型与领域专家模型的协同,在三个临床决策任务中验证了专家模型在模态特定分析中的不可替代价值。

详情
Comments
Accepted at ICML 2026. 12 pages main text, 16 pages appendix
AI中文摘要

GPT和Claude等通用大语言模型在医疗保健领域的出色表现引发了一个关键问题:特定领域的医学专家模型是否会变得过时?我们认为,医学人工智能的未来不在于构建单一的医学基础模型,也不在于取代人类专业知识,而在于协调通用大语言模型、领域特定专家模型和临床医生之间的协作。我们提出HetMedAgent,一个异构医学多智能体框架,能够实现冲突感知证据融合、基于不确定性的临床医生干预触发和自适应阈值校准。在三个真实世界临床决策任务上的实验表明,通用大语言模型与领域特定专家模型之间的协同显著优于单独使用任一类型模型,验证了专家模型在模态特定分析中的不可替代价值。HetMedAgent代表了从构建医学大语言模型或基础模型向多智能体协作的转变,实现了通用推理能力与领域特定精度之间的平衡。

英文摘要

The impressive performance of generalist large language models (LLMs) such as GPT and Claude in healthcare raises a critical question: will domain-specific medical specialist models become obsolete? We argue that the future of medical artificial intelligence (AI) lies not in building monolithic medical foundation models, nor in replacing human expertise, but in orchestrating collaboration among generalist LLMs, domain-specific specialist models, and clinicians. We propose HetMedAgent, a heterogeneous medical multi-agent framework that enables conflict-aware evidence fusion, uncertainty-based clinician intervention triggering, and adaptive threshold calibration. Experiments on three real-world clinical decision-making tasks demonstrate that the synergy between generalist LLMs and domain-specific specialist models significantly outperforms using either type of model alone, validating the irreplaceable value of specialist models in modality-specific analysis. HetMedAgent represents a shift from building medical LLMs or foundation models to multi-agent collaboration, achieving a balance between general reasoning capabilities and domain-specific precision.

2605.29742 2026-05-29 cs.AI

Citation-Closure Retrieval and Per-Rule Attribution for Real-World Regulatory Compliance Question Answering

引用闭包检索与逐规则归因:面向真实世界法规合规问答

Yeong-Joon Ju, Seong-Whan Lee

AI总结 针对法规合规问答中多层级权威结构的引用追踪难题,提出基于操作知识图谱的基准RegOps-Bench和统一框架RefWalk,通过共享主题锚点遍历跨文档引用、多视角候选融合及逐规则归因,显著提升检索召回率和引用准确性。

详情
Comments
Under Review
AI中文摘要

将大型语言模型(LLM)部署于法规合规领域,要求通过跨多层权威结构的全面引用来实现严格的追溯性。与传统多跳或法律问答不同,该任务需要结构化的程序性查找和证据集闭包,而非实体解析或判例推理。现有的RAG系统由于扁平化的引用边、碎片化的检索扩展以及脆弱的后期归因而难以胜任。我们通过RegOps-Bench将法规合规问答形式化,这是一个新颖的基准,包含从复杂的国家研发法规中导出的操作知识图谱。为解决这些瓶颈,我们提出了RefWalk,一个由共享主题锚点驱动的统一框架。RefWalk遍历跨文档引用,通过基于最大值的聚合融合多视角候选,并强制执行逐规则归因,以明确地将声明映射到来源。我们建立了一个强大的基线,在检索召回率和引用准确性方面取得了显著改进。最后,在美国健康合规数据集(HIPAA)上的对比评估显示,现有系统在扁平结构规则上表现饱和,凸显了RegOps-Bench的必要性。我们的代码可在https://github.com/yeongjoonJu/RefWalk获取。

英文摘要

Deploying Large Language Models (LLMs) for regulatory compliance demands rigorous traceability via comprehensive citations across multi-tiered authority structures. Unlike traditional multi-hop or legal QA, this task requires structured procedural lookups and evidence-set closure rather than entity resolution or case-law reasoning. Existing RAG systems struggle here due to flattened citation edges, fragmented retrieval expansions, and fragile post-hoc attribution. We formalize Regulatory Compliance QA with RegOps-Bench, a novel benchmark featuring an Operational Knowledge Graph derived from complex national R\&D regulations. To address these bottlenecks, we propose RefWalk, a unified framework driven by a shared topic anchor. RefWalk traverses cross-document citations, fuses multi-view candidates via max-based aggregation, and enforces per-rule attribution to explicitly map claims to sources. We establish a strong baseline with substantial improvements in retrieval recall and citation accuracy. Finally, a contrastive evaluation on a U.S. health compliance dataset (HIPAA) reveals that existing systems exhibit saturation on flat-structure rules, underscoring the need for RegOps-Bench. Our code is available at https://github.com/yeongjoonJu/RefWalk.

2605.29741 2026-05-29 cs.CL

AfriScience-MT: Towards Decolonizing Science in Africa through Text Translation

AfriScience-MT:通过文本翻译实现非洲科学去殖民化

Idris Abdulmumin, Tajuddeen Gwadabe, Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani, Nomonde Khalo, Ibrahim Said Ahmad, Abiodun Modupe, Anina Mumm, Sibusiso Biyela, Michelle Rabie, Johanna Havemann, Marek Rei, Jade Abbott, Vukosi Marivate

AI总结 针对非洲语言缺乏科学术语的问题,构建包含6种非洲语言、11个科学领域的平行语料库AfriScience-MT,并评估机器翻译和大型语言模型在零样本、少样本和微调设置下的性能。

详情
AI中文摘要

殖民语言在非洲教育和科学传播中的主导地位限制了数亿非洲语言使用者获取和产生科学知识的能力。一个核心障碍是这些语言缺乏既定的科学术语。我们引入了AfriScience-MT,这是一个涵盖六种非洲语言(阿姆哈拉语、豪萨语、卢干达语、北索托语、约鲁巴语和祖鲁语)和11个科学领域的平行语料库。专业翻译人员与科学传播专家合作,将科学论文的通俗语言摘要翻译成每种目标语言,并在没有现成术语的地方创建新术语。我们在零样本、少样本和微调设置下对机器翻译系统和大型语言模型进行了基准测试。结果表明,在句子和文档层面,闭源模型均优于所有开源模型:GPT-5.4和Gemini-3.1-Flash-Lite领先,平均句子级COMET得分分别为68.3和68.0,平均文档级COMET得分均为48.3。在开源系统中,微调的NLLB-1.3B在句子级达到67.3,TranslateGemma-12B在1-shot上下文学习下文档级达到44.0。我们发布AfriScience-MT以支持非洲语言的基准测试和文档级科学机器翻译。

英文摘要

The dominance of colonial languages in African education and scientific communication limits how hundreds of millions of speakers of African languages access and produce scientific knowledge. A core obstacle is the lack of established scientific terminology in these languages. We introduce AfriScience-MT, a parallel corpus covering six African languages (Amharic, Hausa, Luganda, Northern Sotho, Yorùbá, and isiZulu) across 11 scientific domains. Professional translators, working with expert science communicators, translated plain-language summaries of scientific papers into each target language and created new terms where none existed. We benchmark machine translation systems and large language models in zero-shot, few-shot, and fine-tuned settings. Our results show that closed-source models outperform all open-source models at both the sentence and document levels: GPT-5.4 and Gemini-3.1-Flash-Lite lead with average sentence-level COMET scores of 68.3 and 68.0, respectively, and tie at an average document-level COMET of 48.3. Among open systems, fine-tuned NLLB-1.3B reaches 67.3 at the sentence level, and TranslateGemma-12B reaches 44.0 at the document level with 1-shot in-context learning. We release AfriScience-MT to support benchmarking and document-level scientific MT for African languages.

2605.29738 2026-05-29 cs.CL cs.AI

Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions

Multi-Legal-Bench: 跨司法管辖区、语言和法律传统的法律推理评估LLM

Volodymyr Ovcharov

AI总结 提出Multi-Legal-Bench,首个跨司法管辖区法律基准,在6个国家、4个语系和1.34亿份法院判决上评估LLM,发现少样本效果跨辖区复制、无单一模型主导所有语言、跨语言迁移不遵循语言邻近性、分词器效率不显著预测跨语言准确率。

详情
Comments
14 pages, 5 figures, 8 tables. Dataset: https://huggingface.co/datasets/overthelex/multi-legal-bench
AI中文摘要

法律NLP基准绝大多数评估单一语言或汇总跨司法管辖区根本不同的任务,使得跨语言比较不可能。我们引入Multi-Legal-Bench,首个跨司法管辖区法律基准,在六个国家(乌克兰、法国、荷兰、波兰、捷克共和国、立陶宛)、四个语系和1.34亿份法院判决上评估相同任务。该基准定义了五个任务——法院类型分类、判决形式分类、案件结果预测、法律规范提取和原因类别预测——映射到来自国家法院登记处的结构化元数据,形成一个故意稀疏的5x6任务-司法管辖区矩阵(30个单元格中填充20个)。我们通过AWS Bedrock在零样本和3样本提示下评估7个前沿LLM,并额外使用4个小/中型模型(3-12B)进行规模分析。我们的结果显示:(1)在乌克兰发现的依赖任务的少样本效果在所有司法管辖区复制;(2)没有单一模型主导任何语言——排名随任务和司法管辖区而变化;(3)跨语言少样本迁移不遵循语言邻近性:UA->FR(罗曼语族,-2.1个百分点)迁移优于UA->PL(斯拉夫语族,-13.7个百分点),标签集对齐比语系更能预测迁移质量;(4)分词器生育率尽管有2.3倍的差异,并不能显著预测跨语言准确率(r=-0.27,p=0.14),表明模型架构和预训练数据主导分词器效率。我们发布所有数据、提示和模型预测。

英文摘要

Legal NLP benchmarks overwhelmingly evaluate a single language or aggregate tasks that differ fundamentally across jurisdictions, making cross-lingual comparison impossible. We introduce Multi-Legal-Bench, the first cross-jurisdictional legal benchmark that evaluates identical tasks across six countries (Ukraine, France, Netherlands, Poland, Czech Republic, Lithuania), four language families, and 134 million court decisions. The benchmark defines five tasks court-type classification, judgment form classification, case-outcome prediction, legal norm extraction, and cause category prediction mapped to structured metadata from national court registries, forming a deliberately sparse 5x6 task-jurisdiction matrix (20 of 30 cells filled). We evaluate 7 frontier LLMs under zero-shot and 3-shot prompting via AWS Bedrock, with 4 additional small/medium models (3-12B) for scaling analysis. Our results reveal that: (1) task-dependent few-shot effects discovered in Ukrainian replicate across all jurisdictions; (2) no single model dominates any language rankings shift with both task and jurisdiction; (3) cross-lingual few-shot transfer does not follow language proximity: UA->FR (Romance, -2.1 pp) transfers better than UA->PL (Slavic, -13.7 pp), with label-set alignment predicting transfer quality better than language family; and (4) tokenizer fertility, despite a 2.3x spread, does not significantly predict cross-lingual accuracy (r=-0.27, p=0.14), suggesting that model architecture and pretraining data dominate tokenizer efficiency. We release all data, prompts, and model predictions.