arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4178
2606.00285 2026-06-02 cs.CL

Model-Based Quality Assessment for Massively Multilingual Parallel Data

基于模型的大规模多语言平行数据质量评估

Abdelaziz M. A. Ibrahim, Zihao Li, Jörg Tiedemann, Shaoxiong Ji

发表机构 * University of Jyväskylä(于韦斯屈莱大学) University of Helsinki(赫尔辛基大学) ELLIS Institute Finland(芬兰ELLIS研究所) University of Turku(图尔库大学)

AI总结 针对大规模多语言平行数据中存在的非平行句对和低质量翻译问题,提出将模型评估分解为平行性评估(使用多语言嵌入)和无参考质量估计两个独立组件,并通过实验发现没有模型在所有翻译方向上普遍可靠,建议采用方向感知的路由和校准方法。

详情
AI中文摘要

大规模多语言平行文本通常包含两个不同的问题:非平行句对和低质量翻译。我们将此类数据的基于模型评估分解为两个独立组件:使用多语言嵌入的平行性评估和无参考质量估计(QE)。对于平行性,我们在FLORES-200和BOUQuET检索任务上对四个嵌入模型进行了基准测试,覆盖了目标语言对清单中的6,654个源-目标方向。对于QE,我们在专业FLORES-200翻译上评估了九个无参考评估器,涵盖41,412个有序源-目标方向。结果表明,没有模型在所有翻译方向上普遍可靠。简单的QE集成会稀释强模型信号,而有文档记录的目标语言覆盖范围与更高的QE分数密切相关。总体而言,这些发现表明,多语言平行数据评估最好被视为一个方向感知的路由和校准问题,其中没有单一的通用指标预计在所有语言上都足够。

英文摘要

Large-scale multilingual bitext often contains two distinct problems: non-parallel sentence pairs and low-quality translations. We decompose model-based assessment for such data into two independent components: parallelism assessment with multilingual embeddings and reference-free quality estimation (QE). For parallelism, we benchmark four embedding models on FLORES-200 and BOUQuET retrieval tasks, covering 6,654 source--target directions in our target language-pair inventory. For QE, we evaluate nine reference-free evaluators on professional FLORES-200 translations across 41,412 ordered source--target directions. Results show that no model is universally reliable across translation directions. Naive QE ensembles dilute strong model signals, while documented target-language coverage is strongly associated with higher QE scores. Overall, these findings suggest that multilingual parallel-data assessment is best approached as a direction-aware routing and calibration problem, where no single universal metric is expected to suffice across all languages.

2606.00284 2026-06-02 cs.CL

Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models

参数对齐减轻多语言专家语言模型中的灾难性遗忘

Sanchit Ahuja, Terra Blevins

发表机构 * Northeastern University(东北大学)

AI总结 针对多语言持续预训练中的灾难性遗忘问题,提出五种层感知参数对齐策略(硬冻结、软正则化、事后权重恢复和模型合并),在32种训练语言和保留语言上评估,证明参数对齐能有效减少遗忘且语言获取成本低。

详情
Comments
25 Pages, 5 Figures
AI中文摘要

虽然持续预训练(CPT)是将大型语言模型扩展到新语言的一种实用方法,但在目标数据上的简单微调会通过灾难性侵蚀现有能力。围绕语系组织训练可以减少跨语言干扰,但无法单独防止下游任务所需通用知识的遗忘。我们将这种遗忘与多语言CPT中的参数漂移联系起来,并提出了一套五种层感知参数对齐策略:硬层冻结、软正则化、事后权重恢复和模型合并。我们在涵盖五个语系32种训练语言以及保留语言的基准上,沿四个评估轴(困惑度、阅读理解、物理推理和翻译)系统地将我们的对齐策略与两个无正则化的CPT基线进行比较。参数对齐在语言获取成本最小的情况下显著减少了遗忘:层冻结和正则化最好地保留了理解能力,而事后恢复则带来了最强的翻译提升。总之,这些结果描绘了族专家CPT的获取-遗忘边界,并为每种策略提供了与其最佳服务任务配对的实用部署指南。

英文摘要

While continual pretraining~(CPT) is a practical way to extend large language models to new languages, naïve finetuning on targeted data erodes existing capabilities through catastrophic forgetting. Organizing training around language families reduces cross-language interference but cannot alone prevent forgetting of the general knowledge needed for downstream tasks. We link this forgetting to parameter drift in multilingual CPT and present a suite of five layer-aware parameter alignment strategies: hard layer freezing, soft regularization, post-hoc weight reversion, and model merging. We systematically compare our alignment strategies against two unregularized CPT baselines on benchmarks spanning 32 training languages from five language families, plus held-out languages, across four evaluation axes: perplexity, reading comprehension, physical reasoning, and translation. Parameter alignment substantially reduces forgetting at minimal cost to language acquisition: layer freezing and regularization best preserve comprehension, whereas post-hoc reversion yields the strongest translation gains. Together, these results map the acquisition--forgetting frontier for family-expert CPT and offer practical deployment guidelines pairing each strategy to the tasks it best serves.

2606.00275 2026-06-02 cs.CV cs.AI

Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models

超几何与证据优先专家用于大型视觉-语言模型

Zijie Zhou, Dandan Zhu, Hangxiangpan Wang, Heng Zhang, Huishen Jiao, Yi Zhao

发表机构 * China University of Petroleum (Beijing)(中国石油大学(北京)) Hainan Institute of China University of Petroleum (Beijing)(中国石油大学(北京)海南学院) South China Normal University(华南师范大学)

AI总结 针对大型视觉-语言模型中视觉与语言模态的不对称性,提出AsyMoE架构,通过超几何跨模态专家和证据优先语言专家分别建模层级关系与保持上下文基础,在减少参数的同时提升性能。

详情
AI中文摘要

大型视觉-语言模型(LVLMs)通过扩展架构和大量训练在多模态任务上展现了令人印象深刻的性能。近期研究将混合专家(MoE)引入LVLMs以提高计算效率。然而,现有的MoE方法以对称架构处理视觉和语言模态,忽视了这两种模态处理中的固有不平衡性。这种不平衡性导致两个关键问题。首先,文本和视觉形成层级而非并行关系,因为文本查询通常描述完整视觉场景的部分方面。欧几里得专家空间难以编码这种包含结构。其次,深层语言专家逐渐从基于证据的处理转向参数记忆依赖,失去对提供的视觉和语言信息的立足点。为解决这些问题,我们提出AsyMoE,一种通过三个专门专家组显式建模这种不平衡性的新型架构。模态内专家处理模态特定处理。超几何跨模态专家通过负曲率几何捕获层级跨模态关系。证据优先语言专家抑制参数记忆激活并在整个网络深度中保持上下文基础。大量实验表明,AsyMoE相比基线方法取得一致改进,平均比MoE变体提升1.5%,在幻觉敏感任务上提升高达3.8%。与密集模型相比,AsyMoE激活参数减少25.45%。

英文摘要

Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. Recent studies introduce Mixture of Experts (MoE) into LVLMs for improved computational efficiency. However, existing MoE approaches treat visual and linguistic modalities with symmetric architectures, overlooking the inherent asymmetry in how these two modalities are processed. This asymmetry causes two critical issues. First, text and vision form hierarchical rather than parallel relationships, as text queries typically describe partial aspects of complete visual scenes. Euclidean expert space struggles to encode such containment structures. Second, language experts in deeper layers progressively shift from evidence-based processing to parametric memory dependence, losing grounding in the provided visual and linguistic information. To address these issues, we propose AsyMoE, a novel architecture that explicitly models this asymmetry through three specialized expert groups. Intra-modality experts handle modality-specific processing. Hyperbolic inter-modality experts capture hierarchical cross-modal relationships through negative curvature geometry. Evidence-priority language experts suppress parametric memory activation and maintain contextual grounding throughout network depth. Extensive experiments demonstrate that AsyMoE achieves consistent improvements over baseline methods, with average gains of 1.5\% over MoE variants and up to 3.8\% on hallucination-sensitive tasks. AsyMoE activates 25.45\% fewer parameters compared to dense models.

2606.00272 2026-06-02 cs.AI cs.CL cs.CY

On Wednesdays, We Ask Questions: Optimizing "Active Listening" in Automated Legal Triage and Referral

在周三,我们提问:优化自动化法律分诊与转介中的“主动倾听”

Quinten Steenhuis, Jacqueline Harvey

发表机构 * Suffolk University Law School(苏福克大学法学院)

AI总结 本文通过专家律师和LLM评估FETCH分类器的追问问题方法,发现低成本LLM在分类任务中表现良好但生成高质量问题需高成本模型,并提出法律分诊问题评估标准。

详情
Comments
Working paper submitted as accepted to AIDA2J workshop at International Conference for AI and Law in Singapore, June 2026
AI中文摘要

FETCH分类器生成追问问题,以帮助优化申请人法律问题的最佳匹配,使用低成本LLM集成。在本文中,我们描述了专家律师和LLM辅助评估FETCH中的追问问题方法,并表明虽然低成本LLM在分类任务中表现良好,但在这种情况下生成高质量的通俗语言问题似乎需要更复杂和更高成本的模型。通过与法律接待工作人员的讨论,我们提出了法律接待分类问题的评估标准,并发现仅靠提示工程不足以提高接待目的的问题质量。我们还发现LLM作为评判者与人类评分存在分歧。我们证明,通过添加单个高成本模型GPT-5,分类器可以从寻求法律帮助的申请人那里引出相关信息,并且这些问题导致分类任务更准确的性能。我们还发现不同类别(包括家庭暴力)的事实引出不均匀,与家庭法筛查规程不一致,这表明在某些法律领域纳入专门筛查小组的价值。

英文摘要

The FETCH classifier generates follow-up questions to help refine the best match for the applicant's legal problem, using a low-cost ensemble of LLMs. In this paper, we describe an expert attorney and LLM-assisted evaluation of the follow-up question approach in FETCH and show that while low-cost LLMs perform well at classification tasks, generating high-quality plain-language questions in this setting appears to require a more sophisticated and higher-cost model. Through discussion with legal intake workers, we propose a rubric for the evaluation of legal intake classification questions, and we find that prompt engineering alone is not enough to improve question quality for intake purposes. We also find that LLM-as-judge and human ratings diverge. We demonstrate that with the addition of a single high-cost model, GPT-5, the classifier can elicit relevant information from applicants for legal help, and that the questions lead to more accurate performance at classification tasks. We also find uneven fact elicitation across different categories, including domestic violence, at odds with family law screening protocols, suggesting the value of including dedicated screening panels for certain areas of law.

2606.00269 2026-06-02 cs.AI

Closed-Loop Neural Activation Control in Vision-Language-Action Models

视觉-语言-动作模型中的闭环神经激活控制

Abhijith Babu, Ramneet Kaur, Nathaniel D. Bastian, Olivera Kotevska, Susmit Jha, Yanzhao Wu, Sumit Kumar Jha, Anirban Roy

发表机构 * Florida International University(佛罗里达国际大学) SRI International(美国桑尼沃德国际研究机构) United States Military Academy(美国军事学院) Oak Ridge National Laboratory(橡树岭国家实验室) University of Florida(佛罗里达大学)

AI总结 提出CTRL-STEER闭环框架,通过自适应时变控制信号替代固定干预系数,实现更稳定的概念调节和任务成功率。

详情
Comments
Accepted at the IEEE/CVF CVPR 2026 Workshop on Visual Concepts (VisCon). 25 pages, 8 figures, including supplementary material
AI中文摘要

视觉-语言-动作(VLA)模型可以通过干预语义上有意义的内部方向在测试时被引导,但现有方法使用固定的引导系数,实际上以开环方式运行。这不适于具身控制,因为任务状态和概念误差随时间演变,通常导致过度校正、振荡和任务成功率降低,特别是对于速度和平滑度等时间行为。我们提出CTRL-STEER,一个闭环框架,用自适应时变控制信号替代静态干预强度。关键思想是将表示与调节解耦:不假设时间概念由单个神经元直接控制,而是沿着运动对齐的残差方向引导,同时反馈控制器在线调整干预幅度。我们使用基于PID和强化学习的控制器实例化该框架。在四个LIBERO任务套件上对微调的OpenVLA策略进行的实验表明,CTRL-STEER实现了更稳定的概念调节和比固定系数基线更好的引导-任务成功率权衡,而无需修改或重新训练基础模型。

英文摘要

Vision-Language-Action (VLA) models can be steered at test time by intervening on semantically meaningful internal directions, but existing methods use a fixed steering coefficient, effectively operating in open loop. This is poorly suited to embodied control, where task state and concept error evolve over time, often causing overcorrection, oscillation, and reduced task success, especially for temporal behaviors such as speed and smoothness. We propose CTRL-STEER, a closed-loop framework that replaces static intervention strength with adaptive, time-varying control signals. The key idea is to decouple representation from regulation: rather than assuming temporal concepts are directly controlled by individual neurons, we steer along motion-aligned residual directions while a feedback controller adjusts intervention magnitude online. We instantiate this framework with both PID and reinforcement learning based controllers. Experiments with a fine-tuned OpenVLA policy on four LIBERO task suites show that CTRL-STEER achieves more stable concept regulation and a better steering-task success trade-off than fixed-coefficient baselines, without modifying or retraining the base model.

2606.00262 2026-06-02 cs.LG cs.AI stat.AP stat.ML

When Softmax Fails at the Top: Extreme Value Corrections for InfoNCE

当 Softmax 在顶部失效:InfoNCE 的极值修正

Melihcan Erol, Suat Evren, Oktay Ozel, Alexander Morgan, Jongha Jon Ryu, Lizhong Zheng

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 针对 InfoNCE 中 softmax 假设与对比学习嵌入设置不匹配的问题,提出基于极值理论的 WEINCE 修正方法,在五个视觉基准上提升冻结特征评估性能。

详情
Comments
Presented in ICML 2026
AI中文摘要

InfoNCE 是标准的对比学习目标,但其 softmax 形式不仅是一种计算便利:它还编码了关于如何选择最高分示例的统计假设。利用极值理论,我们表明这一假设通常与现代对比学习中使用的归一化嵌入设置不一致。受此不匹配的启发,我们提出了 extsc{WEINCE},这是 InfoNCE 的一个简单修改,它使用锚点在线批次统计将通常的 softmax 对数与端点短缺修正混合,不增加可训练参数。在五个视觉基准上, extsc{WEINCE} 在冻结特征评估中产生了一致的改进。这些结果表明,对困难负样本进行更忠实的统计处理可以改进对比目标。

英文摘要

InfoNCE is the standard contrastive learning objective, but its softmax form is not only a computational convenience: it also encodes a statistical assumption about how the top-scoring example is selected. Using extreme value theory, we show that this assumption is often misaligned with the normalized embedding setting used in modern contrastive learning. Motivated by this mismatch, we propose \textsc{WEINCE}, a simple modification of InfoNCE that uses anchor-wise online batch statistics to blend the usual softmax logits with an endpoint shortfall correction, adding no trainable parameters. Across five vision benchmarks, \textsc{WEINCE} yields consistent improvements in frozen-feature evaluation. These results show that a more faithful statistical treatment of hard negatives can improve contrastive objectives.

2606.00261 2026-06-02 cs.CV physics.soc-ph

The Harsh Truth: Segment-Level Analysis of Harsh Driving Events in Milan Using Large-Scale Telematics, Street Networks, and Google Street View

残酷真相:基于大规模远程信息处理、街道网络和谷歌街景的米兰激烈驾驶事件路段级分析

Andrea La Grotteria, Paolo Santi, Titus Venverloo, Umberto Fugiglando, Carlo Ratti

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本研究结合大规模远程信息处理、交通指标、街道网络属性和谷歌街景视觉特征,通过非参数检验和机器学习回归分析米兰城市道路网络中激烈驾驶事件的路段级特征,发现更宽的车道、交叉口和公交站以及更开阔的视野与更高激烈事件强度相关,而密集建筑正面与较低强度相关,并针对自行车基础设施案例揭示了不同设施类型间的强度梯度。

详情
AI中文摘要

警方报告的碰撞统计数据仍然是城市道路安全评估的标准输入,但其不完整性和报告滞后限制了其在及时、细粒度干预设计中的实用性。激烈加速和制动事件被广泛用作替代安全指标,但迄今为止仅在相对较小的城市样本中进行了研究。本研究分析了米兰城市道路网络中的激烈事件,结合了来自超过420万辆配备车载单元的车辆的高分辨率远程信息处理数据、TomTom的路段级交通指标、OpenStreetMap的街道网络和基础设施属性,以及通过使用OneFormer模型进行语义分割从谷歌街景中提取的视觉街景特征。我们采用了一个分析框架,结合了高、低激烈组之间路段特征分布的非参数Mann-Whitney U检验和监督机器学习回归器。我们发现,在控制暴露量后,更宽的车道、交叉口和公交站以及更开阔的视野(更高的天空和道路像素比例)与更高的激烈事件强度相关,而更密集的建筑正面与较低的强度相关。最后,自行车基础设施案例研究确定了不同设施类型之间激烈事件强度的梯度:相对于物理隔离的自行车道,仅标线的自行车道与19.5%更高的激烈评分相关,混合交通配置与11.5%更高的评分相关,条件取决于包含的控制变量。这些结果支持针对具体情境而非统一的城市安全干预措施,并说明了大规模远程信息处理结合开放地理空间和视觉数据如何为大都市尺度的零死亡愿景决策提供信息。

英文摘要

Police-reported crash statistics remain the standard input for urban road-safety assessment, but their incompleteness and reporting lag limit their usefulness for timely, fine-grained intervention design. Harsh acceleration and braking events are widely used as surrogate safety indicators, but have so far been studied only in comparatively small urban samples. This study analyses harsh events across the urban road network of Milan, combining high-resolution telematics from more than 4.2 million vehicles equipped with On-Board Units, segment-level traffic metrics from TomTom, street-network and infrastructure attributes from OpenStreetMap, and visual streetscape features extracted from Google Street View via semantic segmentation using a OneFormer model. We employ an analytical framework combining non-parametric Mann--Whitney U tests of segment-feature distributions between high- and low-harshness groups with supervised machine-learning regressors. We find that, once exposure is controlled for, wider carriageways, crossings and transit stops, and more open visual fields (higher sky- and road-pixel proportions) are associated with higher harsh-event intensity, while denser built frontage is associated with lower intensity. Finally, the cycling-infrastructure case study identifies a gradient in harsh-event intensity across facility types: markings-only cycle lanes are associated with a 19.5% higher harshness score, and mixed-traffic configurations with an 11.5% higher score, relative to physically separated cycle paths, conditional on the included controls. These results support context-specific rather than uniform urban-safety interventions and illustrate how large-scale telematics combined with open geospatial and visual data can inform Vision Zero decision-making at the metropolitan scale.

2606.00257 2026-06-02 cs.LG cs.AI

ARCA: Adapter-Residual Credit Assignment When Token Signals Degenerate

ARCA: 当令牌信号退化时的适配器-残差信用分配

Rodney Lafuente-Mercado

发表机构 * Rodney Lafuente-Mercado(罗伊德·拉福恩特-默茨)

AI总结 针对LoRA微调下令牌级信用分配信号退化的问题,提出ARCA方法,利用适配器隐藏状态残差作为令牌显著性度量,无需学习奖励模型或价值头。

详情
Comments
Accepted to DEMO 2026: ICML Workshop on Decision-Making from Offline Datasets to Online Adaptation. Non-archival report
AI中文摘要

语言模型强化学习的令牌级信用分配通常被表述为策略完全可训练,而实际的LLM-RL流程往往依赖于参数高效微调,尤其是LoRA。我们认为这种分离隐藏了一种结构性失效模式。在LoRA下,策略被限制在参考模型的低秩邻域内,因此常用内在信用信号(如惊奇度、熵减和策略散度)所依赖的每令牌输出分布差异,在轨迹内归一化后可能变得退化,要么接近均匀权重,要么集中在少量与任务无关的位置上。我们形式化了这种行为,并提出直接用浓度诊断指标(如权重基尼系数和有效令牌比率)进行测量。然后,我们引入了适配器-残差信用分配(ARCA),一种轻量级替代方案,它从适配器自身的隐藏状态残差 $\|h^{\text{adapted}}_t - h^{\text{base}}_t\|_2$ 中推导令牌显著性。ARCA关注适配器实际改变模型的位置,而不是输出分布显得不确定或偏移的位置,并且不需要学习奖励模型、价值头或树结构。在紧凑的MATH/Qwen3-1.7B GRPO扫描中,ARCA在匹配的轨迹预算下表现出预测的非退化中间区域信用分布,并与秩匹配的基线保持竞争力。

英文摘要

Token-level credit assignment for language-model reinforcement learning is usually formulated as if the policy were fully trainable, while practical LLM-RL pipelines often rely on parameter-efficient fine-tuning, especially LoRA. We argue that this separation hides a structural failure mode. Under LoRA, the policy is restricted to a low-rank neighborhood of the reference model, so the per-token output-distribution differences used by common intrinsic credit signals, surprisal, entropy reduction, and policy divergence, can become degenerate after within-trajectory normalization, either approaching uniform weights or concentrating on a small set of task-agnostic positions. We formalize this behavior and propose measuring it directly with concentration diagnostics such as weight Gini and effective-token ratio. We then introduce \emph{Adapter-Residual Credit Assignment} (ARCA), a lightweight alternative that derives token salience from the adapter's own hidden-state residual, $\|h^{\text{adapted}}_t - h^{\text{base}}_t\|_2$. ARCA asks where the adapter actually changes the model, rather than where the output distribution appears uncertain or shifted, and requires no learned reward model, value head, or tree construction. In a compact MATH/Qwen3-1.7B GRPO sweep, ARCA exhibits the predicted non-degenerate middle-regime credit distribution under matched rollout budgets and remains competitive with rank-matched baselines.

2606.00253 2026-06-02 cs.RO cs.LG

Per-Group Error, Not Total MSE: Fine-Tuning Vision-Language-Action Models for 11-DoF Mobile Manipulation

分组误差而非总MSE:微调视觉-语言-动作模型用于11自由度移动操作

Pau Montagut Bofi, Mario García Blasco, Tessa Pulli, Markus Vincze

发表机构 * University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院)

AI总结 针对异构关节空间的移动操作器微调视觉-语言-动作模型时,发现总MSE最低的检查点并非实际表现最佳,提出以分组误差作为更可靠的检查点选择指标。

详情
Comments
4 pages, 3 figures, 3 tables. Accepted as poster at ICRA 2026 Workshop "From Data to Decisions: VLA Pipelines for Real Robots". Code: [https://github.com/paumontagut/per-group-mse-vla](https://github.com/paumontagut/per-group-mse-vla)
AI中文摘要

对具有异构关节空间的移动操作器微调视觉-语言-动作(VLA)模型可能产生反直觉的结果:总MSE最低的检查点并非在真实机器人上表现最佳。我们认为这是将异构关节组(手臂、夹爪、头部、轮式底座)合并为单一指标的可预测后果,其中易于预测的关节可能掩盖仍然失败的关节。我们在11自由度Toyota HSR上微调SmolVLA(450M,仅动作专家),并将其与更强的预训练基线$π_{0.5}$(3.3B)进行比较。分组分析揭示了两种模式:在SmolVLA中,移动底座收敛最慢并限制了整体性能。在$π_{0.5}$的仅专家微调(仅训练动作头,骨干冻结)中,总MSE低于基线但手臂精度下降。在60次真实机器人试验(每个模型20次)中,$π_{0.5}$ 80k(4.0/4)显著优于两种微调变体(仅专家3k:3.75/4;HSR-SmolVLA:3.5/4;Mann-Whitney $p \leq 0.010$),尽管仅专家3k的总MSE最低。这种差异与离线手臂组误差最为一致,而非总MSE或底座组误差。我们得出结论:对于具有异构动作空间的机器人,分组误差比总MSE是更可靠的检查点选择信号。代码:https://github.com/paumontagut/per-group-mse-vla

英文摘要

Fine-tuning Vision-Language-Action (VLA) models for mobile manipulators with heterogeneous joint spaces can produce a counterintuitive result: the checkpoint with the lowest aggregate MSE is not the one that performs best on the real robot. We argue this is a predictable consequence of collapsing heterogeneous joint groups (arm, gripper, head, wheeled base) into a single metric, where easy-to-predict joints can mask joints that still fail. We fine-tune SmolVLA (450M, action-expert only) on the 11-DoF Toyota HSR and compare it against $π_{0.5}$ (3.3B), a stronger pretrained baseline. Per-group analysis exposes two patterns: in SmolVLA, the mobile base converges slowest and limits overall performance. In expert-only fine-tuning of $π_{0.5}$ (training only the action head, backbone frozen), total MSE drops below the baseline but arm accuracy degrades. On 60 real-robot trials (20 per model), $π_{0.5}$ 80k (4.0/4) significantly outperforms both fine-tuned variants (expert-only 3k: 3.75/4; HSR-SmolVLA: 3.5/4; Mann-Whitney $p \leq 0.010$), despite expert-only 3k having the lowest total MSE. This separation is most consistent with the offline arm-group error, not total MSE or base-group error. We conclude that per-group error is a more reliable signal than total MSE for checkpoint selection on robots with heterogeneous action spaces. Code: https://github.com/paumontagut/per-group-mse-vla

2606.00252 2026-06-02 cs.RO cs.LG

HOIST: Humanoid Optimization with Imitation and Sample-efficient Tuning for Manipulating Suspended Loads

HOIST: 基于模仿和样本高效微调的人形机器人悬挂负载操作优化

Songyang Liu, Shunyu Yao, Dingyuan Huang, Shuai Li

发表机构 * Department of Civil and Coastal Engineering, University of Florida(土木与海岸工程系,佛罗里达大学)

AI总结 提出HOIST方法,结合模仿学习和样本高效的批量强化学习,优化人形机器人操控悬挂负载的放置精度和停止行为。

详情
AI中文摘要

使用人形机器人操控悬挂负载具有挑战性,因为机器人只能通过全身运动和间歇接触来影响一个欠驱动的振荡负载。模仿学习提供了安全初始行为,但无法直接优化最终放置,而从头开始的强化学习在真实人形机器人上不安全且样本效率低。我们提出了HOIST——基于模仿和样本高效微调的人形机器人悬挂负载操作优化。HOIST首先从虚拟现实遥操作演示中微调一个高级视觉-语言-动作策略,并通过全身控制器执行其命令。然后,它使用VLA rollout和迭代批量RL来提高放置精度和停止行为。在仿真和真实人形机器人上的实验表明,HOIST优于仅模仿和额外演示基线;与纯VLA rollout相比,HOIST将平移放置误差减少了19.9厘米,原始角度误差减少了3.56度,展示了人形机器人在欠驱动物料处理任务中的潜力。

英文摘要

Manipulating suspended payloads with humanoid robots is challenging because the robot can only influence an underactuated, oscillatory load through whole-body motion and intermittent contact. Imitation learning provides safe initial behavior but does not directly optimize final placement, while reinforcement learning from scratch is unsafe and sample-inefficient on real humanoids. We present HOIST-Humanoid Optimized with Imitation and Sample-efficient Tuning for manipulating suspended loads. HOIST first finetunes a high-level vision-language-action (VLA) policy from virtual-reality (VR) teleoperation demonstrations and executes its commands through a whole-body controller. It then uses VLA rollouts and iterative batched RL to improve placement accuracy and stopping behavior. Experiments in simulation and on a real humanoid show that HOIST improves over imitation-only and additional-demonstration baselines; compared with pure VLA rollouts, HOIST reduces translational placement error by 19.9 cm and raw angular error by 3.56 degrees, demonstrating the potential of humanoids for underactuated material-handling tasks.

2606.00251 2026-06-02 cs.AI

Capability Self-Assessment: Teaching LLMs to Know Their Limits

能力自我评估:教会LLM了解自身局限

Haoyan Yang, Reza Shirkavand, Yukai Jin, Jiawei Zhou, Shangqian Gao, Heng Huang

发表机构 * Stony Brook University(石溪大学) University of Maryland(马里兰大学) Florida State University(佛罗里达州立大学)

AI总结 本文提出能力自我评估(CSA)问题,通过强化学习训练大语言模型准确判断自身能力边界,显著优于监督微调且不损害原始能力。

详情
AI中文摘要

识别自身局限性并决定是解决问题还是委托他人,是可靠智能系统的基础。然而,我们表明现代大语言模型系统性地缺乏这种能力:在不同模型家族和规模中,它们高估自身能力并尝试无法解决的查询。我们将这种能力称为能力自我评估(CSA),并将其表述为一个策略学习问题,旨在提高自我评估能力同时保留模型的原始能力。我们的结果表明,强化学习能有效教会CSA,显著优于监督微调,同时保留原始能力。相比之下,监督微调严重损害了模型本应评估的能力。此外,学习到的自我评估行为在分布外也能很好地泛化,表明CSA是一种可迁移的模型特质。最后,CSA具有实际用途:它在推理时改善本地-云端决策,并在训练期间为针对性数据选择提供信号。

英文摘要

The ability to recognize one's own limitations and decide whether to solve a problem or delegate is fundamental for reliable intelligent systems. Yet we show that modern large language models systematically lack this ability: across diverse model families and scales, they overestimate their competence and attempt queries they cannot solve. We refer to this ability as Capability Self-Assessment (CSA) and formulate it as a policy-learning problem, aiming to improve self-assessment while preserving the model's original capabilities. Our results show that reinforcement learning teaches CSA effectively, significantly outperforming supervised fine-tuning while preserving original capabilities. In contrast, supervised fine-tuning severely degrades the capabilities the model is meant to assess. Moreover, learned self-assessment behavior generalizes well out of distribution, suggesting that CSA is a transferable model trait. Finally, CSA is practically useful: it improves local-cloud decision making at inference time and provides a signal for targeted data selection during training.

2606.00250 2026-06-02 cs.CL cs.AI cs.HC

Effects of Varying LLM Access on Essay Writing Behavior

不同LLM访问权限对论文写作行为的影响

Julia Christenson, Karin de Langis, Shirley Anugrah Hayati, Dongyeop Kang

发表机构 * University of Minnesota(明尼苏达大学)

AI总结 通过控制LLM访问权限(无访问、有限访问、无限访问)的随机实验,发现有限访问能保持学生作者信心和写作策略,而无限访问降低创意表达。

详情
Comments
BEA (Building Educational Applications) Workshop 2026
AI中文摘要

调查大型语言模型(LLM)对大学教学和学习的影响程度,有助于确定整合LLM的策略,以支持而非削弱学生的学习成果。本研究考察了不同水平的LLM辅助如何影响写作表现、参与度和感知作者身份。我们报告了一项初步研究,其中24名大学生被随机分配,在无LLM访问、有限访问(最多3次提示,每次回复限制100字)或无限访问条件下撰写一篇短文。各组之间的整体论文质量在统计上无显著差异。然而,写作行为和感知作者身份出现显著分化:有限访问组的学生报告了更高的所有权(62.5%愿意将论文作为独立作品提交,而无限访问组为25%)、更强的组织收益以及更具策略性和以修改为中心的提示。无限访问组花费更多时间写作,产生的论文与LLM输出更相似,并报告了创意表达减少。我们的研究结果表明,限制而非禁止LLM访问,可以在保留AI辅助的支架优势的同时,保持作者的信心。

英文摘要

Investigating the degree to which large language models (LLMs) affect teaching and learning in universities can help identify strategies for integrating LLMs in a way that supports, rather than undermines, student learning outcomes. This study examined how varying levels of LLM assistance affect writing performance, engagement, and perceived authorship. We report a pilot study in which 24 college students were randomly assigned to write a short essay with no LLM access, limited access (<=3 prompts, responses capped at 100 words), or unlimited access. Overall essay quality was statistically indistinguishable across groups. Yet writing behavior and perceived authorship diverged sharply: students with limited access reported higher ownership (62.5% would submit the essay as independent work, vs. 25% in the unlimited group), stronger organizational gains, and more strategic, revision-focused prompting. The unlimited group spent more time writing, produced essays more similar to LLM output, and reported reduced creative expression. Our findings suggest that constraining, rather than banning, LLM access may preserve authorship confidence while retaining the scaffolding benefits of AI assistance.

2606.00248 2026-06-02 cs.AI

Geodesic Flow Matching for Denoising High-Dimensional Structured Representations

用于去噪高维结构化表示的测地流匹配

Karim Habashy, Chris Eliasmith

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对向量符号代数中空间语义指针的流形约束问题,提出测地流匹配方法,将去噪流限制在环面上,在脉冲神经SLAM系统中实现72%的跟踪误差降低和40%的神经效率提升。

详情
Comments
ICML 2026 Main track
AI中文摘要

向量符号代数通过将符号信息编码为高维分布式表示,实现了鲁棒的神经符号推理。对于连续域,空间语义指针通过将变量映射到连续环面流形来扩展这一框架。然而,流匹配等标准方法假设平坦的欧几里得几何,未能考虑有效SSP状态上的几何约束。我们证明这一假设对SSP不成立:欧几里得线性插值“切割”流形内部,破坏了准确解码所需的相位和幅度结构。为解决此问题,我们采用测地流匹配,调整黎曼传输动力学以严格限制去噪流在SSP环面流形上。我们在脉冲神经SLAM系统中验证了该方法,表明流形感知的清理稳定了路径积分以抵抗漂移。与竞争基线相比,该方法实现了72%的跟踪误差降低和40%的神经效率提升。代码可在https://github.com/kremHabashy/CleanupSSP获取。

英文摘要

Vector Symbolic Algebras (VSAs) enable robust neurosymbolic reasoning by encoding symbolic information into high-dimensional distributed representations. For continuous domains, Spatial Semantic Pointers (SSPs) extend this framework by mapping variables onto continuous toroidal manifolds. However, standard approaches like Flow Matching assume a flat Euclidean geometry, which fails to account for the geometric constraints imposed on valid SSP states. We demonstrate that this assumption fails for SSPs: Euclidean linear interpolants ``cut through" the manifold's interior, destroying the phase and magnitude structure required for accurate decoding. To resolve this, we employ Geodesic Flow Matching, adapting Riemannian transport dynamics to strictly restrict the denoising flow to the SSP toroidal manifold. We validate this approach in a Spiking Neural SLAM system, showing that manifold-aware cleanup stabilizes path integration against drift. The method achieves a 72\% reduction in tracking error and enables a 40\% increase in neural efficiency compared to competitive baselines. Code is available at https://github.com/kremHabashy/CleanupSSP .

2606.00241 2026-06-02 cs.LG cs.AI stat.ML

InfoAtlas: A Foundation Model for Zero-Shot Statistical Dependence Estimate

InfoAtlas:用于零样本统计依赖性估计的基础模型

Zhengyang Hu, Yanzhi Chen, Hanxiang Ren, Qunsong Zeng, Youyi Zheng, Adrian Weller, Kaibin Huang, Yanchao Yang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出InfoAtlas,一种基础模型架构,通过单次前向传播直接推断互信息,实现零样本估计,在保持精度的同时获得100倍加速。

详情
Comments
Accepted to ICML 2026
AI中文摘要

测量高维随机变量之间的统计依赖性是数据科学和机器学习中的基本任务。神经互信息(MI)估计器提供了一种有前景的途径,但它们通常需要对每个新数据集进行昂贵的迭代优化,这使得它们不适用于实时应用。我们提出了InfoAtlas,一种类似基础模型的架构,通过单次前向传播直接推断MI,消除了这一瓶颈。在大规模合成数据上预训练,具有丰富的依赖模式,InfoAtlas学习识别多样的依赖结构并直接从数据集中预测MI。全面的实验表明,InfoAtlas在准确性上匹配最先进的神经估计器,同时实现100倍加速,可以通过单个统一模型灵活处理不同维度和样本量,并有效推广到复杂的现实场景。通过将MI估计重新表述为推理任务,InfoAtlas为实时依赖性分析奠定了基础。

英文摘要

Measuring statistical dependency between high-dimensional random variables is a fundamental task in data science and machine learning. Neural mutual information (MI) estimators offer a promising avenue, but they typically require costly iterative optimization for each new dataset, making them impractical for real-time applications. We present InfoAtlas, a foundation model-like architecture that eliminates this bottleneck by directly inferring MI in a single forward pass. Pretrained on large-scale synthetic data with rich dependence patterns, InfoAtlas learns to identify diverse dependence structures and predict MI directly from the dataset. Comprehensive experiments demonstrate that InfoAtlas matches state-of-the-art neural estimators in accuracy while achieving $100\times$ speedup, can flexibly handle varying dimensions and sample sizes through a single unified model, and generalizes effectively to complex, real-world scenarios. By reformulating MI estimation as an inference task, InfoAtlas establishes a foundation for real-time dependency analysis.

2606.00240 2026-06-02 cs.AI cs.MA

MindZero: Learning Online Mental Reasoning With Zero Annotations

MindZero:零标注的在线心智推理学习

Shunchi Zhang, Jin Lu, Chuanyang Jin, Yichao Zhou, Zhining Zhang, Tianmin Shu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出MindZero框架,通过自监督强化学习训练多模态大语言模型,实现高效鲁棒的在线心智推理,无需显式心智状态标注。

详情
Comments
ICML 2026. Website: https://scai.cs.jhu.edu/MindZero
AI中文摘要

有效的现实世界辅助需要具备强大心智理论的AI智能体:从行为推断人类心智状态。尽管近期有进展,但仍存在几个关键挑战,包括(1)对多个假设进行鲁棒不确定性更新的在线推理;(2)适用于实时辅助的高效推理;(3)现实领域缺乏真实心智状态标注。我们通过引入MindZero应对这些挑战,这是一个自监督强化学习框架,训练多模态大语言模型进行高效鲁棒的在线心智推理。训练期间,模型因生成的心智状态假设能最大化由规划器估计的观察动作的可能性而获得奖励,类似于基于模型的心智理论推理。因此,该方法消除了对显式心智状态标注的需求。训练后,MindZero将基于模型的推理内化为快速的单次推理。我们在网格世界和家庭领域的挑战性心智推理和AI辅助任务中,将MindZero与基线进行了评估。我们发现仅靠大语言模型是不够的;基于模型的方法提高了准确性,但速度慢、成本高,并受限于骨干多模态大语言模型的能力。相比之下,MindZero增强了多模态大语言模型的内在心智理论能力,在准确性和效率上均显著优于基于模型的方法,表明心智推理可以作为一种自监督技能有效学习。

英文摘要

Effective real-world assistance requires AI agents with robust Theory of Mind (ToM): inferring human mental states from their behavior. Despite recent advances, several key challenges remain, including (1) online inference with robust uncertainty updates over multiple hypotheses; (2) efficient reasoning suitable for real-time assistance; and (3) the lack of ground-truth mental state annotations in real-world domains. We address these challenges by introducing MindZero, a self-supervised reinforcement learning framework that trains multimodal large language models (MLLMs) for efficient and robust online mental reasoning. During training, the model is rewarded for generating mental state hypotheses that maximize the likelihood of observed actions estimated by a planner, similar to model-based ToM reasoning. This method thus eliminates the need for explicit mental state annotations. After training, MindZero internalizes model-based reasoning into fast single-pass inference. We evaluate MindZero against baselines across challenging mental reasoning and AI assistance tasks in gridworld and household domains. We found that LLMs alone are insufficient; model-based methods improve accuracy but are slow, costly, and limited by backbone MLLM capacity. In contrast, MindZero enhances MLLMs' intrinsic ToM ability and significantly outperforms model-based methods in both accuracy and efficiency, showing that mental reasoning can be effectively learned as a self-supervised skill.

2606.00232 2026-06-02 cs.AI cs.LG

TIGER: Traceable Inference with Graph-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation

TIGER: 基于图证据路由的可追踪推理用于减轻多模态生成中的幻觉

Kaixiang Zhao, Tianrun Yu, Shawn Huang, Porter Jenkins, Yushun Dong, Amanda Hughes

发表机构 * Brigham Young University Florida State University

AI总结 提出TIGER框架,通过从输入和输出中独立提取观测图与声明图,并基于图条件风险评分修复高风险声明,以减轻多模态生成中的事实级幻觉。

详情
Comments
25 pages, 7 figures, 16 tables. Under review
AI中文摘要

我们研究多模态生成的事实级修复,其中流畅的输出可能包含输入不支持的具体事实。现有的推理时修复方法通常通过联合条件化输入和当前输出来生成反馈。这种设计有两个局限性:输出中的幻觉声明可能偏置模型对输入的解释,且自由形式的反馈无法在事实级别进行排序或调度。我们提出TIGER,一种重新设计反馈以进行局部修复的推理时框架。TIGER从输入中独立提取观测图,从当前输出中提取声明图,然后根据支持和冲突为每个声明分配图条件风险分数。模型修复选定的高风险声明,同时保持骨干网络冻结。我们提供收敛性分析,表明在温和假设下,期望总风险几何级数下降至显式渐近界。跨四个跨模态路径(包括图像到文本、图像+文本到文本、音频到文本和视频到文本)的实验表明,TIGER在保持任务质量的同时减少了不支持内容。该增益在多个骨干网络上成立,CrisisFACTS案例研究表明相同的修复机制可以改善多源设置中的接地性。

英文摘要

We study fact-level repair for multimodal generation, where a fluent output may contain specific facts that are not supported by the input. Existing inference-time repair methods often generate feedback by jointly conditioning on the input and the current output. This design has two limitations: hallucinated claims in the output can bias the model's interpretation of the input, and free-form feedback cannot be ranked or scheduled at the fact level. We present TIGER, an inference-time framework that redesigns feedback for localized repair. TIGER independently extracts an observation graph from the input and a claim graph from the current output, then assigns each claim a graph-conditioned risk score based on support and conflict. The model repairs selected high-risk claims while keeping the backbone frozen. We provide a convergence analysis showing that the expected total risk decreases geometrically to an explicit asymptotic bound under mild assumptions. Experiments across four cross-modal paths, including image-to-text, image+text-to-text, audio-to-text, and video-to-text, show that TIGER reduces unsupported content while preserving task quality. The gains hold across multiple backbones, and a CrisisFACTS case study suggests that the same repair mechanism can improve grounding in multi-source settings.

2606.00230 2026-06-02 cs.LG

A Pre-Training Analogue of Grokking in Language Models: Tracing Delayed Grammatical Generalization

语言模型中预训练的Grokking类比:追踪延迟的语法泛化

Sherin Muckatira, Namrata Shivagunde, Vijeta Deshpande, Anna Rumshisky

发表机构 * University of Massachusetts Lowell(马萨诸塞大学洛文分校)

AI总结 本文提出一个基于暴露的框架,在LLM预训练中研究类似grokking的延迟泛化现象,通过BLiMP最小对发现语法泛化延迟,并分析泛化前后语法概念向量的变化。

详情
Comments
18 pages, 10 figures, 9 tables
AI中文摘要

Grokking是指神经网络在拟合训练数据后很长时间才泛化的现象,已在监督设置下经过多个epoch研究。LLM预训练则涉及在未标注语料库上进行下一个词预测,数据重复有限且没有明确的训练/验证划分。为了解决这个问题,我们提出了一个基于暴露的框架,使得在LLM预训练期间能够研究类似grokking的动态。我们将评估基于BLiMP最小对,它们提供了受控的语法对比。对于每个BLiMP最小对,我们识别出一个关键短语,即捕获语法对比和现象相关上下文的最小连续跨度。其关键短语出现在预训练窗口中的示例被分配到代理训练集;其余示例被分配到代理验证集。在五个语法现象中,我们观察到延迟泛化。分析泛化前后的预训练检查点表明,语法概念向量在泛化后更能预测语法可接受性,并占据更高维的子空间。我们还发现,从关键标记到相关上下文标记的注意力集中在少数头上。

英文摘要

Grokking, the phenomenon in which neural networks generalize long after fitting their training data, has been studied in supervised settings on many epochs. LLM pre-training instead involves next-token prediction over an unlabeled corpus, with limited data repetition and no explicit train/validation split. To address this, we propose an exposure-based framework that enables the study of grokking-like dynamics during LLM pre-training. We ground our evaluation in BLiMP minimal pairs, which provide controlled grammatical contrasts. For every BLiMP minimal pair, we identify a critical phrase, the smallest continuous span that captures the grammatical contrast and the phenomenon-relevant context. Examples whose critical phrase appears in the pre-training window are assigned to the proxy-train split; the remaining examples are assigned to the proxy-validation split. Across five grammatical phenomena, we observe delayed generalization. Analyzing pre-training checkpoints before and after generalization shows that grammatical concept vectors become more predictive of grammatical acceptability and occupy a higher-dimensional subspace after generalization. We also find that attention from the critical token to the relevant context token is concentrated in a small number of heads.

2606.00228 2026-06-02 cs.LG

LithoGRPO: Fast Inverse Lithography via GRPO Reinforced Flow Matching

LithoGRPO: 通过GRPO强化流匹配的快速逆向光刻

Yao Lai, Xuyuan Xiong, Zeyue Xue, Guojin Chen, Jing Wang, Xihui Liu, Rui Zhang, Robert Mullins, Bei Yu, Ping Luo

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出LithoGRPO框架,结合流匹配与GRPO强化学习微调,利用物理奖励函数优化掩模,实现高效逆向光刻,性能优于现有方法。

详情
Comments
ICML 2026
AI中文摘要

在半导体制造中,光刻通过光学掩模将电路布局投影到硅片上。随着电路特征尺寸缩小到光波长以下,光学衍射导致印刷图案偏离预期布局。逆向光刻技术(ILT)通过生成优化掩模来解决这一挑战,提高图案转移到晶圆上的保真度。虽然ILT类似于图像合成任务,但其对掩模评估依赖明确的物理指标限制了现有生成模型的适用性。我们引入了LithoGRPO,一个ILT框架,它将流匹配范式与基于GRPO的强化学习(RL)微调相结合,能够针对给定目标布局高效探索多样化的掩模。与纯生成或基于优化的方法不同,LithoGRPO中的RL利用了ILT明确定义的、基于物理的奖励函数,从而在复杂、工艺感知约束下进行优化。据我们所知,这是第一个将流匹配和RL统一用于掩模优化的框架。为了提高RL采样效率,我们提出了一种用于可制造性评估的快速镜头计数算法,在保持传统镜头计数指标掩模排序的同时,实现了超过130倍的加速。大量实验表明,LithoGRPO在基于优化和基于学习的方法中均达到了最先进的性能,同时保持了高效的掩模生成。

英文摘要

In semiconductor manufacturing, lithography projects circuit layouts onto silicon wafers through an optical mask. As circuit features shrink below the wavelength of light, optical diffraction causes the printed patterns to deviate from their intended layouts. Inverse Lithography Technology (ILT) addresses this challenge by generating optimized masks that enhance the fidelity of pattern transfer onto wafers. While ILT resembles an image synthesis task, its reliance on explicit physical metrics for mask evaluation limits the applicability of existing generative models. We introduce LithoGRPO, an ILT framework that integrates the flow-matching paradigm with GRPO-based reinforcement learning (RL) fine-tuning, enabling efficient exploration of diverse masks for a given target layout. Unlike purely generative or optimization-based approaches, RL in LithoGRPO exploits the explicitly defined, physics-based reward function of ILT, enabling optimization under complex, process-aware constraints. To the best of our knowledge, this is the first framework that unifies flow matching and RL for mask optimization. To improve RL sampling efficiency, we propose a fast shot-counting algorithm for manufacturability evaluation, achieving over 130x speedup while preserving the mask ranking of the traditional shot-count metric. Extensive experiments demonstrate that LithoGRPO achieves state-of-the-art performance over both optimization-based and learning-based methods, while maintaining efficient mask generation.

2606.00206 2026-06-02 cs.LG

Quantized Reasoning Models Think They Need to Think Longer, but They Do Not

量化推理模型认为它们需要思考更长时间,但实际上并不需要

Sanae Lotfi, Polina Kirichenko, Steven Li, Zechun Liu

发表机构 * FAIR at Meta(Meta 联合实验室) Meta AI(Meta 人工智能)

AI总结 本文发现后训练量化会降低推理模型准确率并增加思维链长度,通过分析量化模型在中间步骤正确但最终输出错误的“过度思考”错误,提出一种无训练的对过度思考标记施加logit惩罚的方法,在保持或提升准确率的同时减少12-23%的思维链长度。

详情
AI中文摘要

后训练量化(PTQ)被广泛用于高效部署大型语言模型,但其对推理模型的影响尚不明确。在数学、编程和科学问答任务中,我们发现激进的PTQ会降低准确率,同时增加思维链(CoT)长度。令人惊讶的是,我们证明在量化模型高达52%的失败案例中,模型在中间推理步骤中得出了正确答案,但并未将其作为最终答案输出。为了理解量化为何导致这种过度思考错误的增加,我们测量了量化模型与全精度输出分布之间的token级KL散度。KL散度高的位置与高下一个token熵强相关,在这些位置上,量化模型过度采样了“wait”、“but”、“alternatively”等过度思考标记。我们表明,仅对一组精心挑选的过度思考标记引入无训练的logit惩罚,即可在5个模型(1.5B-32B参数)、3种量化方法和5个基准测试中将CoT长度减少12-23%,同时保持或提升准确率,与惩罚其他标记集相比,在准确率与推理成本之间产生了更优的帕累托前沿。量化模型产生的过度思考错误尤其减少了高达58%。

英文摘要

Post-training quantization (PTQ) is widely used to deploy large language models efficiently, but its effect on reasoning models is not well understood. Across math, coding, and science QA, we find that aggressive PTQ reduces accuracy while increasing chain-of-thought (CoT) length. Surprisingly, we show that in up to 52% of the quantized models' failures, models reach the right answer in intermediate reasoning steps but do not output it as a final answer. To understand why quantization leads to this increase in overthinking errors, we measure the token-level KL divergence between quantized and full-precision output distributions. Positions with high KL divergence correlate strongly with high next-token entropy, and at these positions quantized models disproportionately sample overthinking markers such as "wait", "but", and "alternatively". We show that simply introducing a training-free logit penalty on a curated set of overthinking markers can reduce CoT length by 12--23% while preserving or improving accuracy across 5 models (1.5B-32B parameters), 3 quantization methods, and 5 benchmarks, yielding a favorable Pareto frontier of accuracy against reasoning cost compared to penalizing other token sets. Overthinking errors produced by quantized models are particularly reduced by up to 58%.

2606.00204 2026-06-02 cs.CV

APE: Agentic Prompt Enhancer for Image Generation and Editing

APE: 用于图像生成与编辑的智能提示增强器

Zijian Huang, Jay Zhangjie Wu, Zian Wang, Tianshi Cao, Jiasi Chen, Sanja Fidler, Huan Ling, Xuanchi Ren

发表机构 * NVIDIA University of Michigan(密歇根大学)

AI总结 提出APE框架,通过后训练小型语言模型作为提示增强代理,以单代理或多代理方式改进文本到图像生成与编辑中的提示质量,无需修改下游视觉模型。

详情
Comments
Project Page: https://research.nvidia.com/labs/sil/projects/ape/
AI中文摘要

自然语言已成为图像生成和编辑的强大接口,但文本引导的视觉系统对提示表述高度敏感。语义相似的请求可能因措辞、具体性以及视觉约束的明确程度而产生不同输出,这促使将提示增强作为可训练组件而非外围用户选择。现有的强增强器通常依赖大型专有LLM(如ChatGPT或Gemini),增加了视觉生成流水线的成本、延迟和部署依赖性。我们提出智能提示增强器(APE),一种轻量级框架,将小型语言模型(SLM)后训练为提示增强代理。APE支持单代理重写和角色专用多代理增强。其单代理实例SAPE一次性重写提示,而多代理实例MAPE将增强分解为路由器-重写器-组合器过程,以处理对象、属性、空间关系和编辑的组合约束。通过任务感知奖励和后训练协议,APE在不修改下游视觉模型的情况下改善了视觉对齐和提示遵循。在具有挑战性的图像生成和编辑基准上的实验表明,后训练的小型提示增强器可靠地优于其基础对应物,缩小了与闭源提示增强器的差距;此外,MAPE在这些基准中的复杂组合任务上表现尤为强劲。

英文摘要

Natural language has become a powerful interface for image generation and editing, yet text-guided visual systems remain highly sensitive to prompt formulation. Semantically similar requests can produce different outputs depending on wording, specificity, and how explicitly visual constraints are stated, motivating prompt enhancement as a trainable component rather than a peripheral user choice. Existing strong enhancers often rely on large, proprietary LLMs such as ChatGPT or Gemini, adding cost, latency, and deployment dependence to the visual generation pipeline. We propose Agentic Prompt Enhancer (APE), a lightweight framework that post-trains small language models (SLMs) as prompt-enhancement agents. APE supports both single-agent rewriting and role-specialized multi-agent enhancement. Its single-agent instantiation, SAPE, rewrites the prompt in one pass, while its multi-agent instantiation, MAPE, decomposes enhancement into a router--rewriter--composer process for handling compositional constraints over objects, attributes, spatial relations, and edits. With task-aware rewards and post-training protocols, APE improves visual alignment and prompt following without modifying the downstream visual model. Experiments on challenging image generation and editing benchmarks demonstrate that post-trained small prompt enhancers reliably outperform their base counterparts, narrowing the gap to closed-source prompt enhancers; in addition, MAPE proves particularly strong on complex compositional tasks within these benchmarks.

2606.00203 2026-06-02 cs.CL

DeSQ: Decomposition-based SPARQL Query Generation

DeSQ:基于分解的SPARQL查询生成

Papa Abdou Karim Karou Diallo, Aditya Sharma, Neshat Elhami Fard, Amal Zouaq

发表机构 * LAMA-WeST Mila – Quebec AI Institute(魁北克AI研究所) Polytechnique Montréal(蒙特利尔理工学院)

AI总结 提出DeSQ框架,通过将复杂问题分解为原子约束并映射为SPARQL片段,再组装成完整查询,在五个基准测试中四个超越现有方法,并增强鲁棒性和可解释性。

详情
AI中文摘要

知识库问答(KBQA)的主流方法分为两类:一是生成形式化查询,但存在脆弱性和可解释性有限的问题;二是通过知识库探索直接检索答案,但计算成本高且容易产生幻觉。为了结合两种范式的优势并减轻各自的缺点,我们提出了DeSQ(基于分解的SPARQL查询生成),这是一个与知识库无关的框架,分为三个阶段。首先,它将复杂问题分解为原子约束(AC),这些约束反映了底层知识库的关系结构。其次,它生成一个两部分的结构化输出:(a)将每个AC映射到对应的SPARQL片段,使用标准化的变量和URI占位符,以及(b)描述每个占位符的URI接地块。第三,它将这些片段组装成一个完整的SPARQL查询。DeSQ在五个主要基准测试中的四个上超越了现有最先进的方法,并展现出对词汇变化的卓越鲁棒性。除了性能提升,我们的框架通过消除对实时知识库端点的需求,大大简化了评估,并且其结构化输出支持细粒度的错误分析,从而实现更有针对性的改进干预。

英文摘要

Dominant approaches to Knowledge Base Question Answering (KBQA) fall into two categories. First is the generation of a formal query that suffers from brittleness and limited explainability, and the second is direct answer retrieval through KB exploration that is computationally costly and prone to hallucination. To combine the strengths of both paradigms while mitigating their respective weaknesses, we introduce DeSQ (Decomposition-based SPARQL Query Generation), a KB-agnostic framework that operates in three stages. First, it decomposes complex questions into Atomic Constraints (ACs) that mirror the relational structure of the underlying KB. Second, it generates a two-part structured output: (a) Mapping of each AC to its corresponding SPARQL Fragment, using standardized variable and URIs placeholders, and (b) URIs Grounding block describing each placeholder. Third, it assembles these fragments into a complete SPARQL query. DeSQ surpasses state-of-the-art approaches on four out of five major benchmarks and demonstrates superior robustness to lexical variation. Beyond performance gains, our framework greatly simplifies evaluation by eliminating the need for a live KB endpoint, and its structured output enables fine-grained error analysis, allowing more targeted interventions for improvement.

2606.00202 2026-06-02 cs.LG cs.AI

From Rashomon Theory to PRAXIS: Efficient Decision Tree Rashomon Sets

从Rashomon理论到PRAXIS:高效决策树Rashomon集

Zakk Heile, Hayden McTavish, Varun Babbar, Margo Seltzer, Cynthia Rudin

发表机构 * Stanford University(斯坦福大学)

AI总结 针对决策树Rashomon集计算开销大的问题,提出PRAXIS算法,在运行时和内存使用上实现数量级改进,并能恢复几乎完整的Rashomon集。

详情
Comments
Accepted to ICML 2026
AI中文摘要

标准机器学习流程通常会产生许多接近最优的模型。这些“Rashomon集”为不确定性感知的鲁棒决策带来了一系列挑战和机遇。它们允许用户整合领域知识和偏好,这些知识和偏好通常难以直接指定为目标函数,并且它们量化了给定训练数据集和目标函数下有效模型之间的多样性。然而,即使对于稀疏决策树这样简单、可解释的模型类,Rashomon集的计算仍然需要巨大的内存和运行时资源。我们提出了PRAXIS,一种近似该Rashomon集的算法,在运行时和内存使用上实现了数量级的改进。我们验证了PRAXIS通常能恢复几乎完整的Rashomon集。PRAXIS使研究人员和从业者能够可扩展地对真实世界数据集的Rashomon集进行建模。PRAXIS的代码可在https://github.com/zakk-h/PRAXIS获取。

英文摘要

Standard machine learning pipelines often admit many near-optimal models. These "Rashomon sets" pose a range of challenges and opportunities for uncertainty-aware, robust decision making. They allow users to incorporate domain knowledge and preferences that would otherwise be difficult to specify directly in an objective, and they quantify diversity among valid models for a given training dataset and objective function. However, computation of Rashomon sets, even for simple, interpretable model classes such as sparse decision trees, continues to require immense memory and runtime resources. We present PRAXIS, an algorithm to approximate this Rashomon set with orders of magnitude improvement in runtime and memory usage. We validate that PRAXIS regularly recovers almost all of the full Rashomon set. PRAXIS allows researchers and practitioners to scalably model the Rashomon set for real-world datasets. Code for PRAXIS is available at https://github.com/zakk-h/PRAXIS

2606.00201 2026-06-02 cs.RO

Series-Parallel Integrated Nonlinear Elastic Actuator applied to the lean motion of a bicycle simulator

应用于自行车模拟器倾斜运动的串并联集成非线性弹性致动器

Christina Kohler, Michiel Plooij, Nuria Peña-Perez, Arend L. Schwab, Heike Vallery

发表机构 * Institute of Automatic Control, RWTH Aachen University(自动控制研究所,亚琛RWTH大学) Demcon Life Sciences & Health(Demcon生命科学与健康) Hapticlink Technologies(Hapticlink技术公司) Department of BioMechanical Engineering, Delft University of Technology(生物机械工程系,代尔夫特理工大学) Department of Rehabilitation Medicine, Erasmus MC(康复医学系,埃因霍温麦斯特大学)

AI总结 提出一种串并联集成非线性弹性致动器(SPINEA),通过非线性传动使单个弹性元件同时承担串联和并联角色,实现高扭矩和精确扭矩跟踪,并应用于自行车模拟器倾斜运动。

详情
AI中文摘要

设计用于高扭矩、高保真力触觉交互的机器人具有挑战性。并联弹性致动器(PEA)使用与较小电机并联的弹性元件来补充扭矩,而串联弹性致动器(SEA)使用串联的弹性元件来解耦电机阻抗并改善力控制。最近的工作结合了SEA和PEA以获得两者的优点,但需要单独的弹性元件或离合器。本文提出了串并联集成非线性弹性致动器(SPINEA),它融合了SEA和PEA,使得单个弹性元件同时承担并联和串联的双重角色。这是通过非线性传动实现的,其中电机和负载具有不对齐的旋转轴并且弹性连接。这种几何结构实现了高峰值扭矩和精确的扭矩跟踪。我们将SPINEA应用于力触觉自行车模拟器的倾斜驱动,这需要高力矩和精确的渲染以实现安全且逼真的骑行者交互。我们实现了一个原型并进行了实验,包括外部激励装置和骑行者骑行。我们的结果证实了SPINEA的低阻抗和精确扭矩跟踪,在自行车框架固定时高达4.25 Hz,在骑行者骑行时高达4 Hz。这些优点可能转移到其他需要紧凑、高性能驱动的应用中。

英文摘要

Designing robots for high-torque, high-fidelity haptic interaction is challenging. Parallel Elastic Actuators (PEAs) use elastic elements in parallel to smaller motors to complement torques, and Series Elastic Actuators (SEAs) use elastic elements in series to decouple motor impedance and improve force control. Recent work combines SEAs and PEAs to obtain both benefits but requires separate elastic elements or clutching. This paper presents the Series Parallel Integrated Nonlinear Elastic Actuator (SPINEA), which merges SEA and PEA such that a single elastic element takes on dual roles simultaneously, parallel and series. This is achieved by a nonlinear transmission in which the motor and load have misaligned rotation axes and are elastically connected. This geometry enables both high peak torque and precise torque tracking. We apply SPINEA to actuate lean of a haptic bicycle simulator, which requires high moments and precise rendering for safe and realistic rider interactions. We realized a prototype and performed experiments, both with an external excitation setup and with riders cycling. Our results confirm SPINEA's low impedance and precise torque tracking, up to 4.25 Hz with the bicycle frame fixed and up to 4 Hz with riders. The benefits may transfer to other applications requiring compact, high-performance actuation.

2606.00198 2026-06-02 cs.LG cs.AI cs.CL

BAGEN: Are LLM Agents Budget-Aware?

BAGEN:LLM 智能体是否具有预算意识?

Yuxiang Lin, Zihan Wang, Mengyang Liu, Yuxuan Shan, Longju Bai, Junyao Zhang, Xing Jin, Boshan Chen, Jinyan Su, Xingyao Wang, Jiaxin Pei, Manling Li

发表机构 * Northwestern University(西北大学) O2 Lab(O2实验室) Independent(独立) University of Michigan(密歇根大学) Cornell(康奈尔大学) All Hands AI Stanford(斯坦福大学) UT Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出预算感知智能体(BAGEN)概念,将预算作为主动控制信号而非被动成本指标,通过渐进区间估计方法预测剩余预算上下界,并在四个环境和五个前沿模型上发现强模型不一定具有强预算意识、模型过度乐观等失败模式,早期停止可节省 28-64% 令牌,但精确区间校准仍具挑战。

详情
AI中文摘要

尽管智能体正在花费越来越多的资源,但如今智能体成本大多仅在执行后衡量。预算感知智能体(BAGEN)应将预算视为主动控制信号,而非被动成本指标。我们首先系统地将预算估计定义为内部预算(来自智能体计算)和外部预算(来自智能体动作)。然后,我们将预算意识形式化为渐进区间估计:在计划的每一步,智能体应预测剩余预算的上限和下限,并在完成可能性低时发出警报。通过 rollout-replay 协议进行评分,我们在四个环境和五个前沿模型上发现了一致的失败模式:(1)强模型不一定具有强预算意识,相关性 r=0.35。(2)前沿模型始终过度乐观,继续在不太可能成功的任务上花费资源,而不是尽早提醒用户。(3)预算感知信号是可操作且可训练的。早期停止在失败轨迹上节省 28-64% 的令牌,SFT+RL 增强了早期停止和警报行为。(4)精确区间校准仍然具有挑战性,SFT+RL 后区间覆盖率上限为 47%。项目页面:https://ragen-ai.github.io/bagen/

英文摘要

While agents are increasingly spending more resources, today agent cost is mostly measured only after execution. A Budget-Aware Agent (BAGEN) should treat budget as an active control signal, rather than a passive cost metric. We first systematically define budget estimation as internal budgets (from agent computation) and external budgets (from agent actions). We then formalize budget-awareness as progressive interval estimation: at each step of a plan, an agent should predict an upper and lower bound on remaining budget, and alert when completion is unlikely. Scoring with a rollout-replay protocol, we find consistent failure patterns on four environments and five frontier agents: (1) strong agents do not necessarily have strong budget-awareness, with correlation r=0.35. (2) frontier models are consistently over-optimistic, continue spending on tasks that are unlikely to succeed, instead of alerting the user early. (3) budget-aware signal is actionable and trainable. Early stop saves 28-64% tokens on failed trajectories, and SFT+RL strengthens early stop and alert behavior. (4) precise interval calibration remains challenging, with interval coverage capping at 47% after SFT+RL. Project page: https://ragen-ai.github.io/bagen/

2606.00191 2026-06-02 cs.RO cs.CV

Safe2Drive: Evaluating Safe Driving Behaviors of E2E Autonomous Driving Models

Safe2Drive: 评估端到端自动驾驶模型的安全驾驶行为

Nishad Sahu, Kalpana Panda, Congyuan Yu, Changzhong Qian, Shounak Sural, Ragunathan Rajkumar

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Birla Institute of Technology and Science Pilani(比拉理工学院和科学帕利尼)

AI总结 针对端到端自动驾驶模型在常见安全关键场景中表现脆弱的问题,提出Safe2Drive测试集和安全驾驶评分(SDS),评估发现领先模型在安全场景中驾驶得分大幅下降且SDS较低。

详情
Journal ref
CVPR Workshops 2026
AI中文摘要

最近的端到端(E2E)自动驾驶策略在闭环模拟中取得了高驾驶得分。然而,这些策略是否能够处理常见的安全关键场景仍不清楚。我们提出了Safe2Drive(S2D),一组与Bench2Drive对齐的场景扩展,重点关注三类常见的道路危险:施工区、行人乱穿马路和被遮挡的弱势道路使用者(VRU)。Safe2Drive增加了100个常见但具有挑战性的场景,并引入了安全驾驶评分(SDS),这是一种以安全为中心的度量,在先前评估器的基础上增加了碰撞前制动、施工区物体接触、车道居中和平滑性检查。在S2D上评估两种最先进的策略(LEAD和SimLingo),我们发现它们的驾驶得分相对于报告的Bench2Drive基线急剧下降(LEAD:从Bench2Drive上的94.70 DS下降到S2D上的39.95 DS;SimLingo:从Bench2Drive上的85.07 DS下降到S2D上的41.00 DS),并且S2D上的SDS较低(LEAD为11.85,SimLingo为15.27)。这些结果与脆弱的安全驾驶行为一致,例如对施工区理解差、闯红灯以及行人制动延迟或缺失。这项研究突显了E2E模型即使在训练集包含的CARLA城镇上进行测试时也缺乏安全行为推理。我们计划发布所有100个S2D场景的代码和视频。

英文摘要

Recent end-to-end (E2E) autonomous driving policies achieve high driving scores in closed-loop simulations. Yet it remains unclear whether these policies handle common safety-critical scenarios. We present Safe2Drive (S2D), a set of Bench2Drive-aligned scenario extensions focused on three frequent families of road hazards: work zones, pedestrian jaywalking, and occluded vulnerable road users (VRUs). Safe2Drive adds 100 common but challenging scenarios and introduces SafeDriving Score (SDS), a safety-centric metric that augments prior evaluators with pre-crash braking, work zone-object contact, lane centering, and smoothness checks. Evaluating two state-of-the-art policies (LEAD and SimLingo) on S2D, we find that their driving scores drop sharply relative to their reported Bench2Drive baselines (LEAD: from 94.70 DS on Bench2Drive to 39.95 DS on S2D; SimLingo: from 85.07 DS on Bench2Drive to 41.00 DS on S2D) and that SDS on S2D is low (11.85 for LEAD and 15.27 for Sim-Lingo). These results are consistent with brittle safe-driving behaviors such as poor work-zone understanding, red-light violations, and late or absent braking for pedestrians. This study highlights a lack of safe behavioral reasoning in E2E models even when tested on CARLA towns that are part of the training set. We plan to release the code and videos for all 100 S2D scenarios.

2606.00189 2026-06-02 cs.LG cs.AI

Learning to Construct Practical Agentic Systems

学习构建实用的智能体系统

Aditya Kumar, Zhihan Lei, Jerry Yan, Joshua W. Momo, Lauhitya Reddy, Rafael Enrique Cabrera Jimenez, Cassandra A. Cohen, Arthur Kajiyama, William W. Cohen

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Dept. of Computer Science(计算机科学系) Emory University(埃默里大学)

AI总结 本文提出一种基于伪工具和固定工作流的智能体框架,通过模块化设计和多目标优化方法,在保证成本可控和结果质量的前提下,实现实用智能体系统的自动构建与优化。

详情
AI中文摘要

基于LLM的智能体系统的自动设计和优化能够产生复杂的系统,显著提升结果质量,优于现成的智能体模式。然而,对实际部署的智能体系统的研究表明,生产系统更关注推理成本的简单性、可控性和可预测性等问题。本文提出了设计和优化实用智能体系统的原则性方法。我们描述了一个智能体框架,通过定义在受限上下文中递归调用LLM的“伪工具”,使设计者能够强制智能体系统的模块化。利用该框架,我们为多种任务手工设计了智能体,并表明相对于动态规划的工作流,手工构建的固定工作流通常更便宜且更准确。随后,我们提出了针对该框架所需的智能体组件(即伪工具和固定工作流)的新型学习方法。这些学习方法通常优于手工设计的智能体。我们还利用框架的模块化特性,应用多目标优化方法联合优化成本和响应质量,并融合多个学习系统的结果。

英文摘要

Automated design and optimization of agentic LLM-based systems leads to sophisticated systems that substantially improve result quality over off-the-shelf agentic patterns. However, studies of fielded agentic systems show that production systems focus much more on issues such as simplicity, controllability, and predictability of inference costs. In this paper we propose principled approaches to designing and optimizing practical agentic systems. We describe an agent framework that enables designers to enforce modularity in agentic systems, by defining "pseudo-tools" that call LLMs recursively on a restricted context. Using this framework we hand-engineer agents for a diverse set of tasks, and show that relative to dynamically-planned workflows, hand-constructed fixed workflows are generally cheaper and more accurate. We then propose novel learning methods for the agentic components required by this framework, namely pseudo-tools and fixed workflows. These learning methods generally outperform hand-engineered agents. We also exploit the modularity of the framework to apply multi-objective optimization methods to jointly optimize cost and response quality and blend the results of multiple learning systems.

2606.00187 2026-06-02 cs.LG cond-mat.mtrl-sci

AI-Guided Design and Optimization of Graphite-Based Anodes via Iterative Experimental Feedback

基于迭代实验反馈的AI引导石墨负极设计与优化

Qian Du, Mark M. Sullivan, James E. Saal, Florian Huber

发表机构 * Citrine Informatics hte GmbH

AI总结 提出一种迭代AI引导工作流,通过多目标逆向设计和反馈标签,将电池负极制造成功率从频繁失败提升至100%,高容量电池比例从28.4%增至84.8%,容量保持率从42.1%升至97.3%。

详情
Comments
12 pages, 10 figures, 2 tables
AI中文摘要

本研究提出一种迭代AI引导工作流,通过提高配方可行性和工艺鲁棒性来加速石墨负极开发。利用Citrine平台实现AI/ML引导的多目标逆向设计以优化负极。从嘈杂、不完整的数据集开始,Citrine平台生成早期代理模型,尽管预测确定性低,但突出了缺失的工艺约束。通过迭代添加可行性标签和边界条件失败,工作流迅速收敛到可制造、性能更高的配方。制造可靠性从频繁的工艺失败提高到100%成功的电池生产,而提供≥350 mAh g$^{-1}$的电池比例从28.4%增加到84.8%,容量保持率从42.1%上升到97.3%。这些结果表明,结构化的、反馈驱动的AI工作流可以将不完美的工业数据转化为可操作的指导,实现更快、更可重复的电池电极制造优化。

英文摘要

This study presents an iterative AI-guided workflow that accelerates graphite-based anode development by improving both formulation feasibility and process robustness. Sequential learning via AI/ML-guided multiobjective inverse design for anode optimization was implemented using the Citrine Platform. Starting from a noisy, incomplete dataset, the Citrine Platform was used to generate early surrogate models, which despite low predictive certainty highlighted missing process constraints. By iteratively adding feasibility labels and boundary condition failures, the workflow rapidly converged toward manufacturable, higher-performing formulations. Fabrication reliability improved from frequent process failures to 100% successful cell production, while the fraction of cells delivering $\geq$ 350 mAh g$^{-1}$ increased from 28.4% to 84.8%, with capacity retention rising from 42.1% to 97.3%. These results demonstrate that structured, feedback-driven AI workflows can transform imperfect industrial data into actionable guidance, enabling faster, more reproducible optimization of battery electrode manufacturing.

2606.00180 2026-06-02 cs.LG cs.AI

Beyond Augmentation: Score-Guided Pathological Prior for EEG-based Depression Detection

超越增强:基于评分引导的病理先验用于脑电图抑郁症检测

Xiaojing Chen, Jingqi Cheng, Xu Zhao, Wan Jiang, Jingjing Wu

发表机构 * School of Internet, Anhui University(安徽大学互联网学院) School of Computer Science and Technology, Hefei University of Technology(合肥工业大学计算机科学与技术学院) School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机科学与信息工程学院)

AI总结 针对脑电图抑郁症检测中的小样本困境,提出无数据增强的评分引导分类框架,利用生成网络建模病理先验并融合深度特征,同时设计跨通道空间适应模块解决多中心数据集硬件异构问题。

详情
AI中文摘要

基于深度学习的脑电图(EEG)重度抑郁症(MDD)检测从根本上受到“小样本困境”的制约。主流的生成式数据增强方法不仅带来沉重的计算开销,还可能引入合成噪声,从而模糊分类边界。为了挑战传统的“数据数量优先”惯例,我们提出了一种新颖的框架“超越增强”:评分引导分类(SGC)。SGC不合成伪样本,而是利用无监督生成网络架构对样本的结构和统计异常程度进行建模,作为核心的“病理先验”。该先验经过鲁棒归一化后,与深度特征表示显式融合,从而精确指导分类器的决策边界。此外,为了动态适应不同的通道配置,我们提出了跨通道空间适应模块,利用空间映射机制有效解决多中心数据集中不匹配通道的硬件异构问题。在Mumtaz2016和高密度MODMA数据集上的大量实验证明了我们的方法在具有挑战性的“零数据增强”设置和“零样本合成成本”下的有效性和卓越的泛化能力。

英文摘要

Deep learning-based Major Depressive Disorder (MDD) detection using Electroencephalography (EEG) is fundamentally constrained by the "small-sample dilemma." Prevailing generative data augmentation methods not only incur heavy computational overhead but also risk introducing synthetic noise, thereby blurring classification boundaries. To challenge the traditional "data quantity first" convention, we propose a novel framework "Beyond Augmentation": Score-Guided Classification (SGC). SGC does not synthesize pseudo-samples; instead, it utilizes an unsupervised generative network architecture to model the structural and statistical anomaly degrees of samples, serving as the core "Pathological Prior". This prior, after robust normalization, is explicitly fused with deep feature representations, thereby precisely guiding the classifier's decision boundary. Furthermore, to dynamically adapt to varying channel configurations, we propose a Cross-Channel Spatial Adaptation module, utilizing a spatial mapping mechanism to effectively resolve the hardware heterogeneity of mismatched channels in multi-center datasets. Extensive experiments on the Mumtaz2016 and high-density MODMA datasets demonstrate the effectiveness and exceptional generalizability of our method under the challenging "zero data augmentation" setting and at "zero sample synthesis cost". Keywords: Electroencephalography (EEG), Depression Detection, Anomaly Score, Diffusion Models, Few-Shot Learning

2606.00174 2026-06-02 cs.CV cs.AI

MyoSem: Aligning Electromyography to Natural-Language Action Semantics for Hand Action Understanding

MyoSem: 将肌电图与自然语言动作语义对齐以实现手部动作理解

Chiyue Wang, Dong She, Yang Gao, Zhanpeng Jin

发表机构 * South China University of Technology(华南理工大学)

AI总结 提出MyoSem框架,通过多视角动作语义构建、激活感知EMG编码和语义查询对齐,实现EMG信号与文本描述的双向检索,在多个数据集上优于基线方法并展现良好泛化性。

详情
Comments
16 pages, 9 figures. Preprint
AI中文摘要

肌电图(EMG)直接反映肌肉激活,是手势识别、假肢控制和可穿戴交互的关键传感模态。然而,现有的EMG方法通常将手部动作理解视为固定标签的分类问题,难以支持基于动作描述的查询、检索和泛化。我们提出MyoSem,一个EMG-动作语义对齐框架,将低层EMG信号映射到由多视角动作描述构建的共享语义空间。MyoSem结合多视角动作语义构建、激活感知EMG编码和语义查询对齐,实现了EMG信号与文本描述之间的双向检索。我们在EMG2Pose和NinaPro系列数据集上系统评估了MyoSem。结果表明,MyoSem在EMG-文本双向检索上表现良好,普遍优于大多数基线,并在未见用户、保留动作类别和截肢用户迁移场景中展现出良好的泛化性。消融实验和可视化进一步验证了每个模块的有效性。总体而言,MyoSem将基于EMG的手部动作理解从固定标签识别推进到可查询的双向语义检索,为语言介导的EMG动作理解提供了新的建模范式。

英文摘要

Electromyography (EMG) directly reflects muscle activation and is a key sensing modality for gesture recognition, prosthetic control, and wearable interaction. Existing EMG methods, however, commonly formulate hand action understanding as classification over fixed labels, making it difficult to support querying, retrieval, and generalization based on action descriptions. We present MyoSem, an EMG--action semantic alignment framework that maps low-level EMG signals into a shared semantic space constructed from multi-view action descriptions. MyoSem combines multi-view action-semantic construction, activation-aware EMG encoding, and semantic query alignment, enabling bidirectional retrieval between EMG signals and text descriptions. We systematically evaluate MyoSem on EMG2Pose and NinaPro-series datasets. Results show that MyoSem performs well on EMG--text bidirectional retrieval, generally outperforms most baselines, and shows favorable generalization to unseen users, held-out action classes, and amputee-user transfer scenarios. Ablations and visualizations further validate the effectiveness of each module. Overall, MyoSem advances EMG-based hand action understanding from fixed-label recognition toward queryable bidirectional semantic retrieval, providing a new modeling paradigm for language-mediated EMG action understanding.

2606.00172 2026-06-02 cs.AI

CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO

CAST:用于GRPO的非特权裁剪非对称自教学与优势翻转

Yang Li, Gongle Xue, Yijia Guo, Yuheng Yuan, Liwen Hu, Lei Ma

发表机构 * School of Software and Microelectronics, Peking University, Beijing(北京大学软件与微电子学院) School of Artificial Intelligence, Peking University, Beijing(北京大学人工智能学院) School of Computer Science, Peking University, Beijing(北京大学计算机科学学院) School of Future Technology, Peking University, Beijing(北京大学未来技术学院)

AI总结 提出CAST方法,通过无答案的自教师模型和双向局部优势符号翻转,解决GRPO中奖励稀疏和组相对优势消失的问题,提升数学推理性能。

详情
Comments
10 pages
AI中文摘要

基于可验证奖励的强化学习(RLVR),特别是组相对策略优化(GRPO),已被广泛用于改进大型语言模型的推理能力。然而,结果级奖励仅提供稀疏监督,并且当某个提示的所有采样轨迹全部正确或全部错误时,组相对优势会消失。在线自蒸馏(OPSD)提供了密集的令牌级指导,但其令牌偏好不一定与轨迹正确性对齐;实证诊断表明,OPSD信号在正确和错误的轨迹上表现不同,教师正向和教师负向的差距信号表现出不同的噪声特征。这些诊断仅在OPSD风格的特权教师上下文中进行分析,而CAST训练使用无答案的自教师评分。受这些观察启发,本文提出了CAST,一种用于GRPO风格RLVR的无答案自蒸馏方法。CAST保留了基于验证器的GRPO目标,但使用停止梯度的自教师根据轨迹正确性塑造令牌级优势。与先前的自蒸馏RLVR方法不同,CAST不需要参考解条件的教师评分,在整个训练过程中保持自教师对数概率差距活跃,并应用双向局部优势符号翻转:正确轨迹中的教师负向令牌可以获得负的令牌级优势,而错误轨迹中的教师正向令牌可以获得有界的正向局部优势。对于零方差的全正确和全错误组,CAST分配有界的符号约束基础优势,使得这些原本零梯度的组能够贡献验证器签名的令牌反馈。数学推理实验表明,CAST在保持轻量级、基于验证器的轨迹级目标的同时,改进了RLVR训练。

英文摘要

Reinforcement learning with verifiable rewards (RLVR), especially Group Relative Policy Optimization (GRPO), has been widely used to improve reasoning in large language models. However, outcome-level rewards provide only sparse supervision, and group-relative advantages vanish when all sampled trajectories for a prompt are either correct or incorrect. On-Policy Self-Distillation (OPSD) offers dense token-level guidance, but its token preferences are not necessarily aligned with trajectory correctness; empirical diagnostics show that OPSD signals behave differently on correct and incorrect rollouts, with teacher-positive and teacher-negative gap signals exhibiting different noise profiles. These diagnostics are conducted under an OPSD-style privileged teacher context for analysis only, whereas CAST training uses answer-free self-teacher scoring.Motivated by these observations, this work proposes CAST, an answer-free self-distillation method for GRPO-style RLVR. CAST keeps the verifier-grounded GRPO objective, but uses a stop-gradient self-teacher to shape token-level advantages according to trajectory correctness. Unlike prior self-distilled RLVR methods, CAST does not require reference-solution-conditioned teacher scoring, keeps the self-teacher log-probability gap active throughout training, and applies bidirectional local advantage sign reversal: teacher-negative tokens in correct trajectories can receive negative token-level advantages, while teacher-positive tokens in incorrect trajectories can receive bounded positive local advantages. For zero-variance all-correct and all-wrong groups, CAST assigns bounded sign-constrained base advantages, so these otherwise zero-gradient groups can contribute verifier-signed token feedback. Experiments on mathematical reasoning show that CAST improves RLVR training while retaining a lightweight, verifier-grounded trajectory-level objective.