arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2096
2605.26128 2026-05-27 cs.LG cs.SE

The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models

约束税:小语言模型结构化输出中正确性与准确性的权衡度量

Jaideep Ray

AI总结 本文提出“约束税”测量协议,通过实验证明硬输出约束会显著降低小语言模型的答案准确性和可执行准确性,并建议生产系统应分别报告模式有效性、答案准确性、可执行准确性和错误有效模式率。

详情
AI中文摘要

生产级LLM系统越来越需要机器可读的输出:JSON对象、类型化轨迹、正则表达式约束字段和工具调用模式。本文针对设备端和低成本小语言模型(SLM)部署,其中低于3B参数的模型因隐私、延迟和通用硬件而具有吸引力,但在解决任务时满足模式的能力有限。通常的工程假设是硬输出约束能提高可靠性而不改变底层答案。我们证明这一假设对小模型不安全。我们引入\emph{约束税},一种测量协议,用于在固定模型、固定任务分布和固定问题实例下,隔离由结构化输出约束引起的答案和可执行准确性损失。在Qwen2.5-0.5B、Qwen2.5-1.5B和SmolLM2-1.7B的15,000次通用GPU生成中,硬答案模式解码将模式有效性从61.5%提高到100.0%,但将答案准确性从19.7%降低到11.0%,并将错误有效模式输出从49.5%增加到88.9%。最强的工业类比是确定性日历工具调用任务:Qwen2.5-1.5B在仅提示JSON下达到91.5%的可执行准确性,但在相同硬工具调用模式下仅为48.0%,而两种模式都是100.0%模式有效。错误是语义性的,而非结构性的。我们还表明,3B边界仍然支付直接模式税,并且延迟包装支持一种建设性设计模式:自由推理,延迟约束。实际结论是直接的:生产系统应分别报告模式有效性、答案准确性、可执行准确性和错误有效模式率。

英文摘要

Production LLM systems increasingly require machine-readable outputs: JSON objects, typed traces, regex-constrained fields, and tool-call schemas. This paper targets on-device and low-cost small language model (SLM) deployments, where sub-3B models are attractive for privacy, latency, and commodity hardware but have limited capacity to satisfy schemas while solving tasks. The usual engineering assumption is that hard output constraints improve reliability without changing the underlying answer. We show that this assumption is unsafe for small models. We introduce \emph{constraint tax}, a measurement protocol for isolating the answer and executable-accuracy loss caused by structured-output constraints at fixed model, fixed task distribution, and fixed problem instances. Across 15,000 commodity-GPU generations with Qwen2.5-0.5B, Qwen2.5-1.5B, and SmolLM2-1.7B, hard answer-only schema decoding raises schema validity from 61.5\% to 100.0\%, but lowers answer accuracy from 19.7\% to 11.0\% and increases wrong-valid-schema outputs from 49.5\% to 88.9\%. The strongest industry analogue is a deterministic calendar tool-call task: Qwen2.5-1.5B achieves 91.5\% executable accuracy with prompt-only JSON but only 48.0\% under the same hard tool-call schema, while both modes are 100.0\% schema-valid. The error is semantic, not structural. We also show that the 3B boundary still pays a direct-schema tax and that delayed packaging supports a constructive design pattern: reason free, constrain late. The practical conclusion is direct: production systems should report schema validity, answer accuracy, executable accuracy, and wrong-valid-schema rate separately.

2605.26103 2026-05-27 cs.CV

Global Structure-from-Motion Meets Feedforward Reconstruction

全局运动恢复结构与前馈重建的结合

Linfei Pan, Johannes Schönberger, Marc Pollefeys

AI总结 提出一种结合经典SfM和前馈重建优势的新流水线,在多种场景下实现最先进的重建结果。

Comments CVPR 2026, Highlight

详情
AI中文摘要

运动恢复结构——从一组图像同时估计相机姿态和3D场景结构的过程——仍然是计算机视觉中的一个核心挑战,许多开放问题尚待解决。前馈3D重建的最新进展在克服经典SfM方法的持续失败案例方面取得了显著进步,特别是在低纹理、有限重叠和对称性等场景中。然而,尽管前馈方法在这些挑战性条件下表现出色,但它们在可扩展性、准确性或鲁棒性方面常常面临限制,并且在标准重建设置中通常不如经典方法。在这项工作中,我们系统地分析了这些限制,并通过结合经典和前馈方法的各自优势,提出了一种新的运动恢复结构流水线。在多个数据集上的广泛实验显示了我们的方法的优势,在广泛场景中实现了最先进的结果。我们将我们的系统作为开源实现分享在https://github.com/colmap/gluemap。

英文摘要

Structure-from-Motion -- the process of simultaneously estimating camera poses and 3D scene structure from a collection of images -- remains a central challenge in computer vision, with many open problems yet to be solved. Recent advances in feedforward 3D reconstruction have made significant strides in overcoming persistent failure cases of classical SfM methods, particularly in scenarios characterized by low texture, limited overlap, and symmetries. However, while feedforward approaches excel in these challenging conditions, they often face limitations regarding scalability, accuracy, or robustness, and typically fall short of classical methods in standard reconstruction settings. In this work, we systematically analyze these limitations and propose a new Structure-from-Motion pipeline by combining the respective strengths of classical and feedforward methods. Extensive experiments across multiple datasets show the benefits of our approach, achieving state-of-the-art results across a wide range of scenarios. We share our system as an open-source implementation at https://github.com/colmap/gluemap.

2605.26079 2026-05-27 cs.CL

Automated Benchmark Auditing for AI Agents and Large Language Models

AI智能体与大语言模型的自动化基准审计

Junlin Wang, Federico Bianchi, Shang Zhu, Fan Nie, Yongchan Kwon, Bhuwan Dhingra, James Zou

AI总结 提出自动化基准审计框架ABA,系统审计基准任务中的隐藏环境依赖、规范缺失和评分逻辑问题,在168个基准中发现25.7%的任务存在关键问题,过滤后模型排名变化且性能提升约10%。

详情
AI中文摘要

现代AI基准的复杂性超出了传统验证方法的能力。由领域专家编写的任务通常包含隐含假设、不完整的环境规范和脆弱的评估逻辑,人工标注无法可靠地捕捉这些问题。我们引入了自动化基准审计(ABA),一个系统审计单个基准任务的智能体框架,揭示隐藏的环境依赖、规范缺失和有限的评分逻辑等问题。我们在前沿LLM基准和之前的NeurIPS出版物上运行ABA,共涵盖九个领域的168个基准。在这些语料中,ABA识别出关键问题,包括模糊的任务设计、执行环境冲突和错误的地面真值,在超过25.7%的评估任务中。这些自动化审计的精确性通过专家评审和独立第三方报告(如上游PR)得到验证。关键的是,我们证明这些有问题的任务严重扭曲了对智能体和LLM的能力评估:过滤掉这些有问题的任务会改变模型排名,并在SWE-bench Verified和Terminal-Bench 2上分别将平均性能提高9.9%和9.6%。我们发布智能体工具和所有任务注释,以支持前沿基准的未来发展。

英文摘要

Modern AI benchmarks operate at a complexity that outpaces traditional verification methods. Tasks authored by domain experts often contain implicit assumptions, incomplete environment specifications, and brittle evaluation logic that human annotation cannot reliably catch. We introduce Auto Benchmark Audit (ABA), an agentic framework that systematically audits individual benchmark tasks, uncovering issues such as hidden environment dependencies, specification gaps, and limited grading logic. We run ABA on a collection of frontier LLM benchmarks and previous NeurIPS publications, totaling 168 benchmarks across nine domains. Across this corpus, ABA identifies critical issues including ambiguous task design, execution environment conflicts, and incorrect ground truths in over 25.7% of the evaluated tasks. The precision of these automated audits is validated by expert review and independent third-party reports such as upstream PRs. Crucially, we demonstrate that these problematic tasks severely distorts capability assessments for agents and LLMs: filtering out these tasks with issues shifts model rankings and increases average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6%, respectively. We release the agentic tool and all task annotations to support the future development of frontier benchmarks.

2605.26047 2026-05-27 cs.AI

Retrying vs Resampling in AI Control

AI控制中的重试与重采样

James Lucassen, Adam Kaufman

AI总结 研究AI编码框架中重试与重采样的安全性,发现重试会泄露信息导致攻击更隐蔽,而重采样通过最大怀疑分数审计可提升安全性。

详情
AI中文摘要

像Claude Code和Codex这样的AI编码脚手架使用重试:阻止标记为有风险的动作并继续轨迹。我们从AI控制的角度研究重试,该角度将模型视为潜在对抗性的。我们发现,虽然重试降低了诚实的怀疑分数,但不可信模型可以利用监控器理由构建更隐蔽的攻击,抵消了安全增益。我们还研究了重采样:从相同上下文中抽取多个样本,这不会泄露可利用的信息。我们解开了先前关于重采样的工作所捆绑的设计选择。在BashArena中,使用Claude Opus 4.6作为不可信模型,MiMo-V2-Flash作为可信监控器,每步抽取五个样本并在最大怀疑分数上审计,在0.3%的审计预算下,安全性从61%提高到71%,且不影响有用性。仅对第一次抽取中看起来可疑的步骤进行选择性重采样,恢复了6.2个百分点的增益,同时仅多抽取了10%的样本。我们在此设置中的两个发现与先前关于重采样的工作相矛盾。第一个是,基于重采样怀疑分数最大值进行审计优于使用最小值,这与Ctrl-Z的发现相反。第二个是,执行最不可疑的样本(这是早期“推迟到重采样”协议的核心机制)在我们的设置中仅带来很小的实证安全增益(+3.9个百分点,置信区间包含零)。

英文摘要

AI coding scaffolds like Claude Code and Codex use retrying: blocking actions flagged as risky and continuing the trajectory. We study retrying from an AI control perspective, which treats the model as potentially adversarial. We find that while retrying reduces honest suspicion scores, the untrusted model can exploit monitor rationale to construct sneakier attacks, negating safety gains. We also study resampling: drawing multiple samples from the same context, which does not leak exploitable information. We disentangle design choices that previous work on resampling had bundled together. In BashArena, with Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the trusted monitor, drawing five samples per step and auditing on the maximum suspicion score raises safety from 61% to 71% at a 0.3% audit budget, at no cost to usefulness. Selectively resampling only the steps that look suspicious on the first draw recovers 6.2 percentage points of the gain while drawing only 10% as many extra samples. Two of our findings in this setting contradict earlier work on resampling. The first is that auditing based on the maximum across resampled suspicion scores outperforms using the minimum, which is the opposite of what Ctrl-Z found. The second is that executing the least suspicious sample, which is the central mechanism in earlier defer-to-resample protocols, gives only a small empirical safety gain in our setting (+3.9 pp, with the confidence interval overlapping zero).

2605.25971 2026-05-27 cs.CL cs.IR cs.MA

Anticipate and Learn: Unleashing Idle-Time Compute in Proactive Agents

预见与学习:在主动智能体中释放空闲时间计算

Haoyi Hu, Qirong Lyu, Xianghan Kong, Weiwen Liu, Jianghao Lin, Zixuan Guo, Yan Xu, Yasheng Wang, Weinan Zhang, Yong Yu

AI总结 提出ProAct主动智能体架构,利用空闲时间计算预测并满足用户未来需求,通过ProActEval基准测试验证其在任务加速、减少用户努力和降低幻觉率方面的显著优势。

Comments 26 pages, 4 figures; code available at https://github.com/AgentACE-AI/ProAct

详情
AI中文摘要

虽然AI智能体在推理和工具使用方面展现出显著能力,但它们本质上仍然是被动的:仅在用户明确提示后才计算响应。这种范式忽略了一个关键机会:交互之间的空闲时间很大程度上被浪费,使得智能体无法为未来的用户需求做准备。为弥补这一差距,我们引入了ProAct,一种主动智能体架构,利用空闲时间计算来预测并满足可能即将出现的用户需求。通过分析不断演变的对话历史以及持久记忆,ProAct预测即将到来的需求并迭代获取信息,使智能体能够在用户发起查询之前解决知识差距并准备证据。为严格评估主动能力,我们还引入了ProActEval,一个包含40个领域200个场景的综合基准,具有可预测的需求链和多样化的用户认知特征。实验结果表明,与被动基线相比具有显著优势。ProAct通过减少14.8%的必要交互轮次加速任务完成,减少11.7%的用户努力,并在ProActEval上将幻觉率降低28.1%。此外,MemBench评估证实ProAct达到了最先进的反思准确性,突显其持续且稳健的性能。

英文摘要

While AI agents demonstrate remarkable capabilities in reasoning and tool use, they remain fundamentally reactive: they compute responses only after explicit user prompts. This paradigm ignores a critical opportunity: the idle time between interactions is largely wasted, leaving agents unable to prepare for future user needs. To bridge this gap, we introduce ProAct, a proactive agent architecture that leverages idle-time compute to anticipate and fulfill likely upcoming user needs. By analyzing evolving dialogue history together with persistent memory, ProAct predicts upcoming needs and iteratively acquires information, allowing the agent to resolve knowledge gaps and prepare evidence before the user initiates a query. To rigorously evaluate proactive capabilities, we also introduce ProActEval, a comprehensive benchmark comprising 200 scenarios across 40 domains, featuring predictable need chains and diverse user cognitive profiles. Empirical results demonstrate significant advantages over reactive baselines. ProAct accelerates task completion by reducing required turns by 14.8%, decreases user effort by 11.7%, and cuts hallucination rates by 28.1% on ProActEval. Furthermore, MemBench evaluations confirm that ProAct achieves state-of-the-art reflective accuracy, underscoring its sustained and robust performance.

2605.25861 2026-05-27 cs.CV cs.AI

MuNet: A Mutualistic Network for Joint 3D Human Mesh Recovery and 3D Clothed Human Reconstruction from Single Images

MuNet: 一种用于从单张图像联合进行3D人体网格恢复和3D穿衣人体重建的互惠网络

Yunqi Gao, Leyuan Liu, Yuhan Li, Changxin Gao, Jingying Chen

AI总结 提出MuNet,一种互惠网络,通过统一表示和互惠机制联合优化3D人体网格恢复与穿衣人体重建,在六个基准数据集上达到最先进性能。

详情
AI中文摘要

3D人体网格恢复和3D穿衣人体重建本质相关,但长期以来被孤立研究,忽视了联合优化的潜在收益。为克服这一局限,我们提出在一个统一框架中处理这两个任务,从而有效利用它们的相互依赖关系。基于这一思想,我们提出MuNet,一种用于从单张图像联合进行3D人体网格恢复和3D穿衣人体重建的互惠网络。首先,我们采用2-流形图作为所有3D模型的统一表示,从而在3D人体网格恢复和穿衣人体重建之间实现一致建模。其次,我们设计了一个端到端的图卷积网络,逐步将初始图变形为3D人体网格,并将其细化成详细的3D穿衣人体模型。第三,我们引入一种互惠机制,允许两个任务在训练期间进行相互交互,其中3D人体网格恢复为3D穿衣人体重建提供指导,而重建反馈则细化3D人体网格恢复。我们在六个基准数据集上广泛评估了MuNet,包括Human3.6M、3DPW、MPI-INF-3DHP、THuman2.0、CAPE和RenderPeople。实验结果表明,MuNet在所有数据集上的两个任务均达到了最先进的性能。MuNet的代码已在https://github.com/starVisionTeam/MuNet上发布,供研究使用。

英文摘要

3D human mesh recovery and 3D clothed human reconstruction are inherently related, yet they have long been studied in isolation, thereby overlooking the potential gains of joint optimization. To overcome this limitation, we propose to address these two tasks within a unified framework, which allows their mutual dependencies to be effectively exploited. Building on this idea, we propose MuNet, a mutualistic network for joint 3D human mesh recovery and 3D clothed human reconstruction from single images. First, we adopt 2-manifold graphs as a unified representation for all 3D models, enabling consistent modeling across 3D human mesh recovery and clothed human reconstruction. Second, we design an end-to-end graph convolutional network that progressively deforms an initial graph into a 3D human mesh and refines it into a detailed 3D clothed human model. Third, we introduce a mutualistic mechanism that allows reciprocal interaction between the two tasks {during training}, where 3D human mesh recovery provides guidance for 3D clothed human reconstruction, and reconstruction feedback refines the 3D human mesh recovery. We extensively evaluate MuNet on six benchmark datasets for 3D human mesh recovery and 3D clothed human reconstruction, including Human3.6M, 3DPW, MPI-INF-3DHP, THuman2.0, CAPE, and RenderPeople. Experimental results demonstrate that MuNet achieves state-of-the-art performance on both tasks across all datasets. The code of MuNet is released for research purposes at https://github.com/starVisionTeam/MuNet.

2605.25758 2026-05-27 cs.CL

StreamProfileBench: A Benchmark for Fine-Grained User Profile Inference in Real-World Streaming Scenarios

StreamProfileBench:真实流式场景中细粒度用户画像推断的基准

Sizhe Wang, Feiyu Duan, Juelin Wang, Liwen Zhang, Zhongyu Wei

AI总结 提出StreamProfileBench基准,通过持续状态维护任务和免标注评估框架,研究大语言模型在流式用户画像更新中的保守偏差问题。

详情
AI中文摘要

大型语言模型(LLMs)重塑了用户画像,但当前的评估主要关注静态数据快照。这种范式忽视了个性化系统的现实,其中用户生成内容(UGC)持续到达,细粒度画像快速演变。为弥补这一差距,我们引入了StreamProfileBench,一个用于细粒度流式用户画像的大规模基准。我们将流式用户画像形式化为一个持续状态维护任务,并整理了一个高度真实的数据集,包含来自五个不同平台的7000多名真实用户的超过12万条UGC帖子。通过利用用户兴趣的时间相关性,我们进一步提出了一种新颖的、无需标注的评估框架。在14个领先的LLM上的大量实验表明,持续画像更新仍然是一个开放的挑战。模型表现出系统性的保守偏差,过度保留过去的兴趣而未能识别兴趣衰减。消融实验进一步验证了流式范式的实用性和必要性。

英文摘要

Large Language Models (LLMs) have reshaped user profiling, yet current evaluations mainly focus on static data snapshots. This paradigm overlooks the reality of personalized systems, where User-Generated Content (UGC) arrives continuously and fine-grained profile evolve rapidly. To bridge this gap, we introduce StreamProfileBench, a large-scale benchmark for fine-grained streaming user profiling. We formalize streaming user profiling as a continuous state maintenance task and curate a highly authentic dataset comprising over 120,000 UGC posts from 7,000+ real users across five diverse platforms. By leveraging the temporal correlation of user interests, we further propose a novel, annotation-free evaluation framework. Extensive experiments across 14 leading LLMs reveal that continuous profile updating remains an open challenge. Models exhibit a systemic conservative bias, over-retaining past interests while failing to recognize interest decay. Ablation experiments further validate the practical utility and necessity of the streaming paradigm.

2605.25629 2026-05-27 cs.CL cs.LG

When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift

当分布内增益失效:评估偏好转移下的弱到强奖励模型

Khoi Le, Tri Cao, Phong Nguyen, Cong-Duy Nguyen, Anh Tuan Luu, Miao Chunyan, See-Kiong Ng, Thong Nguyen

AI总结 研究弱到强偏好学习在零样本分布转移下的表现,发现弱监督微调会导致强模型偏向源域特征,提出表示锚定正则化方法以改善跨分布迁移。

Comments Code: https://anonymous.4open.science/r/w2s_reward_ood-682F

详情
AI中文摘要

弱到强(W2S)泛化是一种有前景的可扩展监督框架,然而现有评估通常在同分布训练-测试条件下进行。因此,我们研究零样本分布转移下的W2S偏好学习,发现基于弱偏好标签训练的强学生模型在分布内表现成功,但无法跨偏好数据集迁移。我们提供了证据表明存在一种表示失败模式:弱监督微调可能将强模型拉向源域特征,而不是保持广泛可迁移的偏好表示。为了缓解这一问题,我们提出表示锚定(Anchor),一种简单而有效的正则化方法,在微调过程中约束强模型预训练表示空间的过度漂移,同时允许任务相关的适应。在多个偏好领域、数据集和模型家族中,Anchor一致地改进了分布外迁移,同时保持了具有竞争力的分布内性能。综合来看,我们的评估协议、迁移感知指标和方法揭示了当前W2S奖励建模中隐藏的脆弱性,并为实现更稳健的偏好迁移提供了实用路径。

英文摘要

Weak-to-strong (W2S) generalization is a promising framework for scalable oversight, yet existing evaluations often test students under matched train-test distributions. Therefore, we study W2S preference learning under zero-shot distribution shift and find that strong students trained on weak preference labels can appear successful in-distribution while failing to transfer across preference datasets. We provide evidence for a representational failure mode in which weak-supervised fine-tuning can pull the strong model toward source-domain features instead of maintaining broadly transferable preference representations. To mitigate this, we propose Representation Anchoring (Anchor), a simple yet effective regularizer that constrains excessive drift from the pretrained strong model's representation space during fine-tuning, while still allowing task-relevant adaptation. Across preference domains, datasets, and model families, Anchor consistently improves out-of-distribution transfer while maintaining competitive in-distribution performance. Together, our evaluation protocol, transfer-aware metrics, and method expose hidden brittleness in current W2S reward modeling and provide a practical path toward more robust preference transfer.

2605.25570 2026-05-27 cs.CV

From Contrast to Consistency: Rethinking Event-based Continuous-Time Optical Flow Estimation

从对比到一致性:重新思考基于事件的连续时光流估计

Rui Hu, Song Wu, Wen Yang, Jinjian Wu

AI总结 提出一种基于时空结构一致性(STSC)的混合监督框架,结合双向互补多尺度架构和课程引导混合训练策略,在连续时间和标准光流估计中达到最先进性能。

Comments Accepted by CVPR 2026

详情
AI中文摘要

估计连续光流是动态视觉感知中一个基础但具有挑战性的问题。基于事件的相机具有微秒级延迟和高动态范围,能够异步捕捉亮度变化,为以精细时间精度建模运动提供了独特机会。然而,时间密集的真实标注的稀缺性限制了监督学习的有效性,而专注于锐化扭曲事件图像(IWE)的对比度最大化(CM)框架往往忽略时间连续性和结构一致性,导致复杂运动下的轨迹扭曲。为了克服这些挑战,我们提出了一种基于时空结构一致性(STSC)原则的混合监督框架,用于连续时光流估计。该范式共同强化局部结构稳定性和轨迹连续性,确保跨时间的物理一致运动。为了进一步增强表示和鲁棒性,我们设计了一种双向互补的多尺度架构,并采用课程引导的混合训练策略,实现了从监督点约束到自监督流形正则化的平滑过渡。在多个基准上的综合实验表明,我们的方法在连续时间和标准光流估计中均达到了最先进的性能,证明了所提出学习范式的有效性。

英文摘要

Estimating continuous optical flow is a fundamental yet challenging problem in dynamic visual perception. Event-based cameras, with microsecond latency and high dynamic range, capture brightness changes asynchronously, offering a unique opportunity to model motion with fine temporal precision. However, the scarcity of temporally dense ground-truth annotations limits the effectiveness of supervised learning, while contrast maximization (CM) frameworks, focused on sharpening the Image of Warped Events (IWE), often neglect temporal continuity and structural coherence, leading to distorted trajectories under complex motion. To overcome these challenges, we propose a hybrid-supervised framework for continuous-time optical flow estimation, grounded in the principle of Spatio-temporal Structural Consistency (STSC). This paradigm jointly enforces local structural stability and trajectory continuity, ensuring physically coherent motion across time. To further enhance representation and robustness, we design a bidirectionally complementary multi-scale architecture and employ a curriculum-guided hybrid training strategy, enabling a smooth transition from supervised point constraints to self-supervised manifold regularization. Comprehensive experiments across multiple benchmarks show that our method achieves state-of-the-art performance in both continuous-time and standard optical flow estimation, demonstrating the effectiveness of the proposed learning paradigm.

2605.25569 2026-05-27 cs.CV

ControlLight: Towards Controllable, Consistent, and Generalizable Low-Light Enhancement

ControlLight: 迈向可控、一致且泛化的低光照增强

Yufeng Yang, Jianzhuang Liu, Jisheng Chu, Yuqi Peng, Xianfang Zeng, Jiancheng Huang, Shifeng Chen

AI总结 提出ControlLight框架,通过构建连续光照强度监督的大规模数据集和引入错位感知加权流匹配损失,实现了低光照增强的可控性、一致性和泛化性。

Comments 18 pages, 12 figures

详情
AI中文摘要

现有的基于深度学习的低光照增强方法通常在有限的数据集上训练,且具有单一的增强目标,这限制了它们在现实应用中的泛化能力和可控性。为了克服这些限制,我们提出了ControlLight,一个可控、一致且泛化的低光照增强框架。我们首先构建了一个带有连续光照强度监督的真实世界退化图像的大规模数据集。为了进一步确保在不同控制强度下输出的一致性,我们引入了一种错位感知加权流匹配损失,该损失在连续增强强度下保持图像结构。ControlLight允许用户通过灵活控制强度来编辑真实世界的退化低光照图像,以获得满意的增强结果,同时保持视觉一致性和真实性。大量实验表明,ControlLight在现有低光照增强方法中达到了最先进的性能,同时展现出强大的连续可控性和对真实世界场景的泛化能力。

英文摘要

Existing deep learning-based low-light enhancement methods are typically trained on limited datasets with single enhancement targets, which restricts their generalization ability and controllability in real-world applications. To overcome these limitations, we propose ControlLight, a controllable, consistent, and generalizable framework for low-light enhancement. We first construct a large-scale dataset of real-world degraded images with continuous illumination-strength supervision. To further ensure consistent outputs under different control strengths, we introduce a misalignment-aware weighted flow matching loss that preserves image structure across continuous enhancement strengths. ControlLight allows users to edit real-world degraded low-light images toward satisfactory enhancement results by flexibly controlling the strength while preserving visual consistency and realism. Extensive experiments show that ControlLight achieves state-of-the-art performance against existing low-light enhancement approaches while demonstrating strong continuous controllability and generalization to real-world scenarios.

2605.25538 2026-05-27 cs.CV cs.DB

Tetris: Tile-level Sampling for Efficient and High-Fidelity Video Object Tracking

Tetris: 用于高效高保真视频目标跟踪的瓦片级采样

Chanwut Kittivorawong, Alena Chao, Charlie Si, Alvin Cheung

AI总结 提出Tetris系统,通过将视频分解为基于瓦片的骨牌数据模型,实现细粒度时空剪枝,在保持跟踪精度损失不超过5%的条件下,将检测器调用次数减少多达68.8倍。

详情
AI中文摘要

轨迹物化将原始视频转换为可重用的目标轨迹,下游查询可以直接使用而无需重新运行跟踪,但高效且高保真地提取这些轨迹仍然成本高昂。先前的系统通过时间帧采样来降低成本,但这会抹去细粒度跟踪所需的帧间运动。然而,在静态视频中,每帧的大部分区域不包含感兴趣的目标,剩余区域也能容忍不同的采样率。我们提出Tetris,一个轨迹提取系统,它将视频分解为基于瓦片的骨牌数据模型,实现细粒度时空剪枝,以最小的保真度损失减少检测器调用。Tetris在用户提供的检测器上游运行三个算子:一个分类器识别相关瓦片并将它们分组为骨牌;一个整数线性规划(ILP)在用户指定的精度约束下剪枝冗余骨牌;一个打包器将幸存者组装成画布,以最小化检测器调用。在7个静态视频数据集上,Tetris的跟踪精度损失保持在5%以内,而先前的系统在7个数据集中的3个上超过了这个界限。在这个5%的界限下,Tetris的吞吐量比先前系统高17.4倍,比参考流水线高68.8倍。项目页面位于https://tetris-db.github.io。

英文摘要

Track materialization converts raw video into reusable object tracks that downstream queries can run against without rerunning tracking, but extracting those tracks efficiently and with high fidelity remains expensive. Prior systems reduce cost through temporal frame sampling, erasing the inter-frame motion that fine-grained tracking requires. In stationary video, however, large portions of each frame contain no objects of interest, and the remaining regions tolerate different sampling rates. We present Tetris, a track-extraction system that decomposes videos into a tile-based polyomino data model, enabling fine-grained spatiotemporal pruning that reduces detector calls with minimal fidelity loss. Tetris runs three operators upstream of the user-provided detector: a classifier identifies relevant tiles and groups them into polyominoes, an integer linear program (ILP) prunes redundant polyominoes under a user-specified accuracy constraint, and a packer assembles the survivors into canvases that minimize detector calls. Across 7 stationary-video datasets, Tetris stays within a 5% tracking accuracy loss of a full-frame, every-frame reference pipeline, whereas prior systems exceed this bound on 3 of the 7 datasets. At this 5% bound, Tetris achieves up to 17.4x higher throughput than prior systems and up to 68.8x higher than the reference pipeline. The project page is at https://tetris-db.github.io .

2605.25353 2026-05-27 cs.LG cs.CV physics.comp-ph

PDEInvBench: A Comprehensive Dataset and Design Space Exploration of Neural Networks for PDE Inverse Problems

PDEInvBench:面向PDE逆问题的神经网络综合数据集与设计空间探索

Divyam Goel, Nithin Chalapathi, Sanjeev Raja, Aditi S. Krishnapriyan

AI总结 提出PDEInvBench基准数据集,通过数值模拟涵盖多种PDE,并沿优化、表示和缩放三个维度系统探索神经网络设计空间,发现两阶段训练、PDE导数输入和初始条件多样性等实用见解。

Comments 37 total pages, 13 main pages, 20 figures, 8 tables. Published in Transactions on Machine Learning Research (TMLR), 2026

详情
Journal ref
Transactions on Machine Learning Research, 2026
AI中文摘要

偏微分方程(PDE)中的逆问题涉及从观测到的时空解场估计系统的物理参数。神经网络因其对函数到函数空间变换的建模能力,非常适合PDE参数估计。虽然现有的机器学习方法基准主要关注正问题,但尚无针对PDE逆问题(即从解场映射到潜在物理参数)的类似综合研究和基准数据集。我们通过引入PDEInvBench填补了这一空白,这是一个全面的基准数据集,包含时间依赖和时间独立PDE的数值模拟,覆盖广泛的物理行为和参数。我们的数据集包括评估划分,用于评估在分布内和多种分布外设置下的性能。利用我们的基准数据集,我们沿三个关键维度全面探索了神经网络在PDE逆问题中的设计空间:(1)优化过程,分析监督、自监督和测试时训练目标对性能的作用;(2)问题表示,研究具有不同归纳偏好的架构选择和各种条件策略的价值;(3)缩放,针对模型和数据大小进行。我们的实验揭示了几个实用见解:1)神经网络在两步训练过程中表现最佳:先用PDE参数进行初始监督,然后使用PDE残差进行测试时微调;2)将PDE导数作为输入特征始终能提高精度;3)增加训练数据中初始条件的多样性比扩大PDE参数范围带来更大的性能提升。我们公开了数据集和代码库。

英文摘要

Inverse problems in partial differential equations (PDEs) involve estimating the physical parameters of a system from observed spatiotemporal solution fields. Neural networks are well-suited for PDE parameter estimation due to their capability to model function-to-function space transformations. While existing benchmarks of machine learning methods for PDEs primarily focus on the forward problem, there are no similar comprehensive studies and benchmark datasets on PDE inverse problems, i.e., mapping solution fields to underlying physical parameters. We fill this gap by introducing PDEInvBench, a comprehensive benchmark dataset consisting of numerical simulations for both time-dependent and time-independent PDEs across a wide range of physical behaviors and parameters. Our dataset includes evaluation splits that assess performance in both in-distribution and various out-of-distribution settings. Using our benchmark dataset, we comprehensively explore the design space of neural networks for PDE inverse problems along three key dimensions: (1) optimization procedures, analyzing the role of supervised, self-supervised, and test-time training objectives on performance, (2) problem representations, where we study the value of architectural choices with different inductive biases and various conditioning strategies, and (3) scaling, which we perform with respect to both model and data size. Our experiments reveal several practical insights: 1) neural networks perform best with a two-stage training procedure: initial supervision with PDE parameters followed by test-time fine-tuning using the PDE residual, 2) incorporating PDE derivatives as input features consistently improves accuracy, and 3) increasing the diversity of initial conditions in the training data yields greater performance gains than expanding the range of PDE parameters. We make our dataset and codebase publicly available.

2605.25029 2026-05-27 cs.RO

ParkingWorld: End-to-End Autonomous Parking Reinforcement Learning from Corrective Experience in 3DGS Simulation

ParkingWorld: 基于3DGS仿真中纠正性经验的端到端自主泊车强化学习

Zhengcheng Yu, Changze Li, Haoran Liu, Tong Qin

AI总结 提出一种基于纠正性经验的样本高效强化学习框架(CIL-SERL),在逼真的3D高斯溅射(3DGS)仿真器中训练端到端自主泊车策略,通过多级回放缓冲区机制提高成功率、效率和安全性。

Comments 9 pages(including 1 page of Appendix), 6 figures. Will be submitted to RA-L 2026

详情
AI中文摘要

自主泊车需要在狭窄、杂乱且高度受限的环境中进行精确的低速操控,车辆必须避开静态障碍物和复杂的几何边界。与模仿学习不同(模仿学习通常需要大量高质量专家演示才能收敛到稳定策略,且泛化到未见场景的能力有限),传统强化学习方法面临训练开销过大、探索效率低下,甚至在具有挑战性的场景中无法学习可行泊车策略等持续挑战。为解决这些问题,本文提出了一种基于纠正性循环的样本高效强化学习(CIL-SERL)框架,用于端到端自主泊车,该框架完全在逼真的3D高斯溅射(3DGS)泊车模拟器中训练,能够对真实场景进行高保真数字重建。受学习实践中纠错笔记本的启发,我们设计了一种新颖的多级回放缓冲区机制。这些缓冲区将标准RL轨迹、人工纠正干预、失败探索轨迹和基于回滚的纠正段分层组织并存储在不同但相互连接的内存区域中,从而在训练过程中促进结构化采样和有针对性的学习。所提出的框架在3DGS仿真环境和真实车辆平台上进行了系统评估。大量实验结果表明,我们的方法在多种场景下显著提高了泊车成功率、运行效率和安全性,验证了所提出的基于CIL-SERL的端到端自主泊车解决方案的有效性和实际适用性。

英文摘要

Autonomous parking demands precise low-speed maneuvering within narrow, cluttered, and highly constrained environments, where vehicles must navigate tight spaces while avoiding static obstacles and complex geometric boundaries. Unlike imitation learning, which typically requires massive volumes of high-quality expert demonstrations to converge to a stable policy and often suffers from limited generalization to unseen scenarios, traditional reinforcement learning (RL) methods face persistent challenges including excessive training overhead, inefficient exploration, and even failure to learn viable parking strategies in challenging settings. To address these limitations, this paper presents a correction-in-the-loop sample-efficient reinforcement learning (CIL-SERL) framework for end-to-end autonomous parking, which is entirely trained in a photorealistic 3D Gaussian Splatting (3DGS) parking simulator that enables high-fidelity digital reconstruction of real-world scenes. Inspired by error-correction notebooks used in learning practice, we design a novel multi-level replay buffer mechanism. These buffers hierarchically organize and store standard RL rollouts, human corrective interventions, failed exploration trajectories, and rollback-based correction segments in separate yet interconnected memory regions, facilitating structured sampling and targeted learning during training. The proposed framework is systematically evaluated in both the 3DGS simulation environment and a physical vehicle platform. Extensive experimental results demonstrate that our method achieves substantial improvements in parking success rate, operational efficiency, and safety performance across diverse scenarios, validating the effectiveness and practical applicability of the proposed CIL-SERL-based end-to-end autonomous parking solution.

2605.24785 2026-05-27 cs.AI

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

PANDO: 通过在线技能蒸馏实现高效多模态AI智能体

Yubo Li, Yidi Miao, Yuntian Shen, Yuxin Liu

AI总结 提出PANDO框架,通过在线技能蒸馏、结构化技能库和缓存感知提示,在VisualWebArena任务中以更低token消耗实现更高成功率。

详情
AI中文摘要

近期多模态网络智能体的进展通常依赖于增加推理时的计算量,包括展开搜索、验证器传递、离线技能发现和专家模型堆叠。这引发了一个核心问题:网络智能体能否随着经验积累变得更高效,而不是更昂贵?我们首先分析VisualWebArena的轨迹,识别出三个反复出现的低效来源:重复动作循环、隐藏发现成本和低提示缓存复用。然后,我们引入PANDO,一个单次展开的在线技能蒸馏框架,它维护一个结构化的技能库,并结合进度反思、基于置信度的技能降级、层次化路由、视觉压缩和缓存感知提示。在全部910个VisualWebArena任务上,PANDO实现了58.3%的成功率,优于SGV(54.0%)和我们的WALT复现(45.2%),同时比SGV少使用58%的token,比WALT少使用61%的token,且无需任何预评估发现预算。一个300任务的消融实验进一步表明,规则和例程提供了大部分成功增益,而路由、压缩和缓存感知提示将更大的技能库转化为更低的边际token成本。最后,我们引入三个轨迹级效率指标——动作重复率、步骤开销比和提示缓存利用率——以使效率在终端成功之外可见。

英文摘要

Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raises a central question: can a web agent become more efficient as it accumulates experience, rather than more expensive? We first analyze trajectories from VisualWebArena and identify three recurring sources of inefficiency: repeat-action loops, hidden discovery costs, and low prompt-cache reuse. We then introduce PANDO, a single-rollout online skill-distillation framework that maintains a structured Skill Library and combines progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting. On the full set of 910 VisualWebArena tasks, PANDO achieves a 58.3% success rate, outperforming SGV (54.0%) and our WALT reproduction (45.2%), while using 58% fewer tokens than SGV and 61% fewer tokens than WALT, without any pre-evaluation discovery budget. A 300-task ablation further shows that rules and routines provide most of the success gains, while routing, compression, and cache-aware prompting convert the larger skill library into lower marginal token cost. Finally, we introduce three trajectory-level efficiency metrics -- Action Repetition Rate, Step Overhead Ratio, and Prompt Cache Utilization -- to make efficiency visible beyond terminal success.

2605.24383 2026-05-27 cs.AI cs.CY cs.SE

A governance horizon for ethical-use constraints in open-weight AI models

开放权重AI模型中伦理使用约束的治理视野

Weiwei Xu, Hengzhi Ye, Haoran Ye, Kai Gao, Vladimir Filkov, Minghui Zhou

AI总结 通过审计Hugging Face Hub上的模型仓库,发现基于披露的治理在开放权重AI中具有浅层结构性限制,提出治理视野概念并比较不同政策设计的效果。

详情
AI中文摘要

对开放权重AI模型的伦理约束既反映了社会关切,也是AI治理政策的基础。这些约束预计会传播到下游衍生品,同时作为自愿元数据披露实施,必须在每一代重用中重新声明。我们审计了Hugging Face Hub上的2,142,823个模型仓库,以测试这种基于披露的治理基础设施能否在深层模型谱系中维持可追溯性。限制证据以1.31个衍生步骤的半衰期衰减($R^2$=0.98),超过七代下游后,至少80%的后代模型缺乏足够的公开证据进行治理判定,我们将这一深度边界形式化为治理视野。恢复缺失许可元数据的平台级干预表明,政策设计(而非仅执法)是约束因素:仅继承设计需要近乎完全的执法才能移动视野,而明确解决孤儿谱系组件的强制声明设计即使在中等执法水平下也能移动视野。结构性瓶颈在于没有可继承上游意图的谱系:此类孤儿组件在任何仅继承政策下都无法判定,无论执法率如何,未解决的上游节点还会造成直接的下游不可判定性瓶颈,仅靠继承规则无法恢复。与PyPI的比较(其中治理信号由显式机器可读声明携带)证实,这种崩溃是开放权重衍生特有的拓扑结构问题,而非开放生态系统固有的。这些结果表明,基于披露的治理在开放权重AI中具有浅层、结构决定的范围,实现深层供应链问责需要治理信号通过衍生本身传播的溯源机制。

英文摘要

Ethical constraints on open-weight AI models are both a reflection of societal concerns and a foundation for AI governance policy. They are expected to propagate to downstream derivatives while implemented as voluntary metadata disclosures that must be restated at each generation of reuse. We audit 2,142,823 model repositories on Hugging Face Hub to test whether this disclosure-based governance infrastructure can sustain traceability across deep model lineages. Restriction evidence decays with a half-life of 1.31 derivation steps ($R^2$=0.98), and beyond seven downstream generations at least 80% of descendant models lack sufficient public evidence for a governance determination, a depth boundary we formalize as the governance horizon. Platform-level interventions to restore missing licence metadata reveal that policy design (not enforcement alone) is the binding factor: inheritance-only designs require near-complete enforcement to move the horizon, whereas a mandatory-declaration design that explicitly resolves orphan lineage components shifts the horizon already at moderate enforcement. The structural bottleneck is lineages with no inheritable upstream intent: such orphan components remain undecidable under any inheritance-only policy regardless of enforcement rate, and unresolved upstream nodes additionally create direct downstream undecidability bottlenecks that inheritance rules alone cannot recover. Comparison with PyPI, where governance signals are carried by explicit machine-readable declarations, corroborates that the collapse is topology-specific to open-weight derivation rather than inherent to open ecosystems. These results establish that disclosure-based governance has a shallow, structurally determined reach in open-weight AI, and that achieving deep supply-chain accountability requires provenance mechanisms propagating governance signals through derivation itself.

2605.24296 2026-05-27 cs.AI cs.IR

When Does Synthetic Patent Data Help? Volume-Fidelity Trade-offs in Low-Resource Multi-Label Classification

合成专利数据何时有帮助?低资源多标签分类中的数量-保真度权衡

Amirhossein Yousefiramandi, Ciaran Cooney

AI总结 研究通过LLM生成合成数据用于多标签专利分类时的数量与保真度权衡,发现低资源场景下数量效应主导,高资源场景下保真度更重要,混合数据策略最优。

详情
AI中文摘要

关于利用通过LLM生成的合成数据进行多标签专利分类时必须考虑的问题包括:(i) 何时使用此类数据可能有所帮助以及(ii) 为何如此。实际上,前一部分适当调整了通过增加样本量来改进结果的可能性。当前实验涉及六个开源LLM(从3.8B到12B参数),针对辅助技术64个WIPO标签分类的四种真实数据机制。应用了基于标签集条件化的全合成生成方法和释义方法,每种方法与三种分类器类别结合使用。结果表明,BERT-for-Patents的微F1从0.120到0.702的声称改进主要反映了数量效应;实际上,在165个样本中进行有放回复制产生了0.678。因此,相对于对照组的改进为+0.024,而与最佳基线(焦点损失重加权)相比为+0.219。这里要考虑的第二个关键点是随着数据生成机制变化,保真度分数的演变。对于低真实数据机制,数量效应占主导,最大均值差异(MMD)与分类性能之间的相关系数等于r = +0.95。随着使用更多真实数据,相关性变为负值,在1:10机制下达到r = -0.73(Fisher z = +6.47,p < 0.001,Delta r的95% CI [ +0.96, +1.00 ])。在固定预算分配方面,将真实数据(约20-30%)与合成数据(70-80%)结合优于纯合成和纯真实策略。此外,一个能够将原始微F1改进高达+0.58的语料库可能会对Jaccard重叠检索代理产生不利影响。其他体裁的提示族变体可能提供对该现象的一些解释,但使用标准专利过滤器仍使nDCG@10降低26%。

英文摘要

The issues that must be considered regarding the utilization of synthetic data generated through LLMs for multilabel patent classification include (i) when the use of such data may help and (ii) why. Indeed, the former part appropriately adjusts for the possibility of improving results by an increase in sample size. The current experiment involves six open-source LLMs (from 3.8B to 12B parameters) for four real-data regimes in classification of 64 WIPO labels of assistive technologies. Both full-synthesis generation, conditioned on the label set, and paraphrasing methods are applied, with each used in combination with three classifier categories. It is shown that the claimed improvements in micro F1 for BERT-for-Patents from 0.120 to 0.702 mainly reflect a volume effect; indeed, replication with replacement in 165 examples produces 0.678. Thus, the improvement over the control is +0.024, while compared to the best baseline (focal loss reweighting) is +0.219. The second crucial point to consider here is that of evolving fidelity scores as the data generation regime varies. For low real-data regimes, the volume effect dominates and the correlation coefficient between maximum mean discrepancy (MMD) and classification performance equals r = +0.95. As more real data is used, the correlation becomes inverted and reaches r = -0.73 at the 1:10 regime (Fisher z = +6.47, p < 0.001, 95% CI on Delta r [ +0.96, +1.00 ]). In terms of a fixed budget allocation, combining real data (about 20-30%) with synthetic (70-80%) outperforms both purely synthetic and purely real strategies. Moreover, a corpus that allows for improvement in classification performance up to +0.58 in raw micro F1 may adversely affect a Jaccard-overlap retrieval proxy. Prompt-family variations for other genres may provide some explanation of the phenomenon, but using the standard-patent filter still decreases nDCG@10 by 26%.

2605.24217 2026-05-27 cs.AI cs.DC

Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks

识别和减轻生产级LLM推理基准中的系统性测量偏差

Ashok Chandrasekar, Jason Kramberger

AI总结 针对生产级LLM推理基准中因客户端排队导致的测量偏差,提出基于多进程的无偏评估框架和归一化输出令牌时间(NTPOT)指标,实现高并发下的准确性能评估。

详情
AI中文摘要

随着大型语言模型(LLM)从研究环境过渡到生产部署,评估其是否满足严格的服务水平目标(SLO)变得至关重要。然而,当前的评估方法在大规模下存在严重的测量偏差。我们证明,广泛使用的基准测试工具依赖于单进程、异步驱动架构,在高并发下引入了根本性的客户端排队瓶颈。通过将基准测试客户端建模为$M/G/1$队列,我们从数学上展示了Python全局解释器锁(GIL)如何随着请求速率增加而人为地膨胀首令牌时间(TTFT)和每输出令牌时间(TPOT)指标。为了解决这一系统性不准确性,我们提出了一个无偏的多进程评估框架,有效分散客户端负载,确保可忽略的排队开销。此外,我们形式化了一个复合指标——归一化每输出令牌时间(NTPOT),以稳健地摊销端到端延迟,包括跨序列长度的预填充和调度延迟。我们的实证评估表明,该方法成功隔离了纯服务引擎性能,能够在每秒数千个查询的生产规模下对LLM进行准确、可复现的性能分析。

英文摘要

As Large Language Models (LLMs) transition from research environments to production deployments, evaluating their performance against strict Service Level Objectives (SLOs) has become critical. However, current evaluation methodologies suffer from severe measurement bias at scale. We demonstrate that widely used benchmarking utilities rely on single-process, asyncio-driven architectures that introduce fundamental client-side queuing bottlenecks under high concurrency. By modeling the benchmarking client as an $M/G/1$ queue, we mathematically demonstrate how the Python Global Interpreter Lock (GIL) artificially inflates Time to First Token (TTFT) and Time Per Output Token (TPOT) metrics as request rates scale. To resolve this systematic inaccuracy, we propose an unbiased, multi-process evaluation framework that effectively distributes client-side load, ensuring negligible queuing overhead. Furthermore, we formalize a composite metric, Normalized Time Per Output Token (NTPOT), to robustly amortize end-to-end latency, including prefill and scheduling delays across sequence lengths. Our empirical evaluation demonstrates that this methodology successfully isolates pure serving engine performance, enabling accurate, reproducible profiling of LLMs at production scales exceeding thousands of queries per second.

2605.24152 2026-05-27 cs.AI

Neuro-Inspired Inverse Learning for Planning and Control

神经启发式逆向学习用于规划与控制

Maryna Kapitonova, Tonio Ball

AI总结 提出一种神经启发式框架Inverter,通过逆向学习(IL)结合前向/逆向内部模型、开环多步运动指令和层次化动作组织,在规划与控制任务中实现高效推理,平均性能提升24.2%且计算时间降低一到两个数量级。

Comments Version 2, minor fix in online version of the abstract, pdf unchanged

详情
AI中文摘要

我们提出了一种用于具身规划与控制的神经启发式框架。基于哺乳动物大脑中实现快速高效目标导向行为的三个原则——配对的前向/逆向内部模型、开环多步运动指令以及顺序层次化的动作组织——我们的Inverter框架使用学习组件,通过逆向学习(IL)进行端到端训练,并在自然情况下辅以解析或算法模块;我们形式化了IL,并将其与监督学习、强化学习和模仿学习区分开来。IL桥接了强化学习(RL)式的摊销(单次前向传播但每次只输出一个动作)和最优控制(OC)式的序列规划(整个轨迹但需要迭代测试时计算)。单个Inverter或层次化n=2的Inverter堆栈在所有3个maze2d和6个antmaze D4RL变体上,平均比离线RL和扩散规划基线提升24.2%(范围-1.9%至+78.2%),同时推理计算时间减少一到两个数量级。显著的是,通过前向模型(FoM)对整个T步动作序列进行优化(而非逐步骤优化),使得Inverter能够生成平滑、目标一致、轨迹级的结构,并达到比训练数据本身所蕴含的策略更接近解析最优的控制策略。我们还发现了IL的一种失败模式:在训练数据覆盖范围狭窄时出现FoM攻击,我们通过使用覆盖范围更广的随机训练数据来缓解。作为一个应用实例,脉冲Inverter合成任意单量子比特量子门,其保真度与标准迭代数值基线(GRAPE)相当,而每个门的计算时间降低超过1000倍。总之,我们得出结论:IL实现了一类通用的世界接口,特别适用于对延迟和资源敏感的具身AI。

英文摘要

We present a neuro-inspired framework for embodied planning and control. Building on three principles that enable fast and highly effective goal-directed behavior in the mammalian brain - paired forward/inverse internal models, open-loop multi-step motor commands, and sequential, hierarchical organization of action - our Inverter framework uses learned components, trained end-to-end through Inverse Learning (IL) and supplemented where natural by analytic or algorithmic modules; we formalize IL and delineate it from supervised, reinforcement, and imitation learning. IL bridges Reinforcement Learning (RL)-style amortization, which runs in a single forward pass but emits only one action at a time, and Optimal Control (OC)-style sequence planning over whole trajectories, but with iterative test-time computation. Single Inverters or hierarchical n=2 Inverter stacks match or improve on offline-RL and diffusion-planner baselines on all 3 maze2d and 6 antmaze D4RL variants by an average of +24.2% (range -1.9% to +78.2%), at one-to-two orders of magnitude less inference compute time. Distinctively, optimizing through the Forward Model (FoM) over the entire T-step action sequence - rather than per step - lets Inverters produce smooth, goal-coherent, trajectory-wide structure and reach control policies closer to the analytic optimum than the policy underlying the training data itself. We also identify a failure mode of IL: FoM hacking under narrow training-data coverage, which we mitigate by using random training data with broader coverage. As an application example, a Pulse Inverter synthesizes arbitrary single-qubit quantum gates with fidelity matching the standard iterative numerical baseline (GRAPE), at more than 1000x lower per-gate compute time. In summary, we conclude that IL enables a versatile class of world-interfaces, especially for latency- and resource-critical embodied AI.

2605.24071 2026-05-27 cs.LG cs.AI

Not All Transitions Matter: Evidence from PPO

并非所有转移都重要:来自PPO的证据

Ajhesh Basnet

AI总结 本文提出在PPO训练中随机丢弃一定比例的轨迹转移,以打破重复梯度结构,稳定训练,并在多个环境中验证了效果。

Comments 19 pages, 5 figures. Accepted to 2026 8th Asia Conference on Machine Learning and Computing (ACMLC 2026)

详情
Journal ref
Proceedings of the 2026 8th Asia Conference on Machine Learning and Computing
AI中文摘要

在策略上训练强化学习代理意味着每次更新时收集新的经验,而这些经验隐藏着一个问题。轨迹中的每个状态都是前一个状态的直接输出,由代理自身的动作因果链连接。因此,连续的转移从未真正独立。它们携带重叠信息,网络接收到的梯度信号最终比批次大小所暗示的要重复得多。相同的方向被反复强化,价值网络在策略变化时难以跟上,训练变得悄悄不稳定,而仅凭奖励曲线很少能揭示这一点。本文询问这种冗余是否可以简单地移除。我们表明,在适当阶段从轨迹中随机丢弃固定比例的转移,使得奖励信号保持完整,足以打破重复的梯度结构并稳定训练。变化很小:一个采样步骤,没有新组件,不修改核心算法,并且适用于任何PPO实现。在五个难度递增的环境(CartPole-v1、Acrobot-v1、LunarLander-v2、HalfCheetah-v5和Hopper-v5)中,该方法在奖励上与标准PPO匹配,同时在KL散度、策略熵和价值估计上产生更一致的训练动态。丢弃25%的转移是最佳点:足以破坏冗余,又不至于使批次过薄。

英文摘要

Training a reinforcement learning agent on-policy means collecting fresh experience at every update, and that experience comes with a hidden problem. Each state in a rollout is the direct output of the previous one, causally chained together by the agent's own actions. Because of this, consecutive transitions are never truly independent. They carry overlapping information, and the gradient signal the network receives ends up far more repetitive than the batch size suggests. The same directions get reinforced over and over, the value network struggles to keep up as the policy shifts, and training becomes quietly unstable in ways that reward curves alone rarely reveal. This paper asks whether that redundancy can simply be removed. We show that randomly dropping a fixed fraction of transitions from the rollout, at the right stage so the reward signal stays intact, is enough to break the repetitive gradient structure and stabilize training. The change is minimal: one sampling step, no new components, no modification to the core algorithm, and it works with any PPO implementation. Across five environments of increasing difficulty, CartPole-v1, Acrobot-v1, LunarLander-v2, HalfCheetah-v5, and Hopper-v5, the method matches vanilla PPO on reward while producing more consistent training dynamics across KL divergence, policy entropy, and value estimates. Dropping 25% of transitions turns out to be the sweet spot: enough to disrupt the redundancy, not enough to thin the batch.

2605.24042 2026-05-27 cs.LG cs.AI

Hidden-State Privacy Has an Empty Middle

隐藏状态隐私存在空中间

Alexander Okezue Bell

AI总结 通过理论下界和实验证明,高斯释放机制在隐藏状态隐私中无法同时实现中等效用和隐私,存在空中间区域,并提出了对角逆Fisher机制作为最优解。

Comments 74 pages, 61 figures

详情
AI中文摘要

在我们测试的1536个高斯释放协方差中,对于单层隐藏状态隐私,没有一个能在自适应检索攻击者下同时实现中等效用和中等隐私。我们证明了一个互补的Fisher球下界:每个具有O(1) Fisher效用的满秩高斯释放都存在一个方向,其马氏信号随隐藏宽度线性增长,排除了该类中的均匀高斯安全性,并与经验上的空中间匹配。对角逆Fisher释放Σ^⋆_{diag}(K) = (2K/d) diag(1/F_{ii})是在一阶KL预算K下唯一的最小最大最优对角机制,也是在32个模型层网格的每个点上最坏攻击者top-1 ≤ 0.001的唯一释放,但它位于隐私/效用边界上,而不是填充中间。在欧几里得检索下达到13倍帕累托缩减的广义特征机制,在自适应马氏攻击者下崩溃为100% top-1,而全轨迹序列逆变器恢复了干净GPT-2前缀的94%,但在Σ_{diag}下为0%。从头训练的分离记忆Transformer在90M时达到G_{Mah} ∈ [20, 33],并在固定token语言建模损失惩罚下,从30M到1B保持比相同预算GPT基线6-24倍的优势;预训练模型最高为9.3。这些结果将隐藏状态释放从高斯类内的机制设计重新定义为架构或释放协同设计。

英文摘要

Of $1{,}536$ Gaussian release covariances we tested for single-layer hidden-state privacy, zero achieve both moderate utility and moderate privacy against an adaptive retrieval attacker. We prove a complementary Fisher-ball lower bound: every full-rank Gaussian release at $O(1)$ Fisher utility admits a direction whose Mahalanobis signal grows linearly in hidden width, ruling out uniform Gaussian safety in the class and matching the empirical empty middle. The diagonal inverse-Fisher release $Σ^\star_{\mathrm{diag}}(\mathcal{K}) = (2\mathcal{K}/d)\,\mathrm{diag}(1/F_{ii})$ is the unique minimax-optimal diagonal mechanism at first-order KL budget $\mathcal{K}$ and the only release with worst-attacker top-1 $\le 0.001$ at every point of a 32 model-layer grid, but it sits on a privacy/utility edge rather than filling the middle. A generalized-eigen mechanism reaching $13\times$ Pareto reduction under Euclidean retrieval collapses to $100\%$ top-1 under the adaptive Mahalanobis attacker, and a full-trajectory sequence inverter recovers $94\%$ of clean GPT-2 prefixes but $0\%$ under $Σ_{\mathrm{diag}}$. A split-memory transformer trained from scratch reaches $G_{\mathrm{Mah}} \in [20, 33]$ at 90M and maintains a $6$--$24\times$ advantage over same-budget GPT baselines from 30M to 1B at a fixed-token language-modeling loss penalty; pretrained models top out at 9.3. These results reframe hidden-state release from mechanism-design within the Gaussian class to architecture or release co-design.

2605.24001 2026-05-27 cs.CV cs.AI cs.LG

Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL

Diff-Instruct with Diffused Reward: 迈向有原则的一步生成器强化学习

Junyi Wu, Weijian Luo, Haoyang Zheng, Ruizhe Zhang, Guang Lin

AI总结 针对一步生成器强化学习中奖励优化与生成动力学不匹配的问题,提出基于积分KL最小化的无数据轨迹级对齐框架DIDR,通过扩散奖励分数和代理估计器实现奖励驱动的校正,在一步SDXL和6B DiT骨干网络上取得帕累托优势。

Comments author list correction

详情
AI中文摘要

近期一步文本到图像生成的进展实现了实时合成,具有显著的效率和质量。先前用于一步生成器的强化学习方法将图像空间奖励优化与扩散噪声空间分布匹配相结合。这种范式由于终端奖励优化与底层生成动力学之间的不匹配带来了挑战。结果,优化倾向于利用随机自由度,通常以牺牲图像保真度为代价来提高奖励。为了解决这个问题,我们提出了Diff-Instruct with Diffused Reward (DIDR),一个从积分KL最小化推导出的无数据轨迹级对齐框架。DIDR将RLHF最优的奖励倾斜干净图像分布沿扩散轨迹传播到所有噪声水平。我们证明该目标与干净图像RLHF具有相同的最小化器,同时自然诱导出扩散奖励分数(DRS),它作为对参考分数函数的奖励驱动校正。为了使其实用,我们进一步引入了扩散奖励代理(DRP),一种基于可微短步去噪的DRS高效估计器。大量实验表明,DIDR持续帕累托主导现有的一步SDXL基线。此外,当迁移到6B DiT骨干网络(Z-Image)时,DIDR在偏好对齐上超越了其50步教师模型,同时仅需单步生成。

英文摘要

Recent advances in one-step text-to-image generation have enabled real-time synthesis with remarkable efficiency and quality. Previous reinforcement learning methods for one-step generators combine image-space reward optimization with diffusion noisy-space distribution matching. This paradigm brings challenges due to a mismatch between terminal reward optimization and the underlying generative dynamics. As a result, optimization tends to exploit stochastic degrees of freedom, often improving reward at the expense of image fidelity. To address this issue, we propose Diff-Instruct with Diffused Reward (DIDR), a data-free trajectory-level alignment framework derived from Integral KL minimization. DIDR propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels along the diffusion trajectory. We show that this objective admits the same minimizer as clean-image RLHF, while naturally inducing the Diffused Reward Score (DRS), which acts as a reward-driven correction to the reference score function. To make this practical, we further introduce the Diffused Reward Proxy (DRP), an efficient estimator of DRS based on differentiable short-step denoising. Extensive experiments demonstrate that DIDR consistently Pareto-dominates existing one-step SDXL baselines. Moreover, when transferred to a 6B DiT backbone (Z-Image), DIDR surpasses its 50-step teacher in preference alignment while requiring only a single generation step.

2605.23651 2026-05-27 cs.CL

How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework

大型语言模型有多像人类?一个语域感知的语言评估框架

Björn Nieth, Marianna Gracheva, Michaela Mahlberg, Bjoern Eskofier, Emmanuelle Salin

AI总结 提出一个基于语域感知的评估框架,通过比较人类参考语料库与LLM生成文本的词汇语法特征分布(使用最大均值差异和Biber的67个特征),发现LLM偏离人类基线,且最接近人类的模型取决于语域而非模型大小。

Comments 8.5 pages (main) + 31 pages appendix, 29 figures, 10 tables. Code and data: https://github.com/BjoernNieth/Register_Aware_LLMs

详情
AI中文摘要

虽然事实正确性和任务性能长期以来一直是大型语言模型(LLM)研究的焦点,但生成文本在语言层面上与人类相似程度这一基本问题尚未得到充分探索。从语料库语言学的角度来看,语言生产本质上是依赖语境的,不同的交际语境会导致语言特征的频率和共现模式产生差异。未能遵循这些模式的文本可能在内容上是正确的,但仍然不受人类读者欢迎。在这项工作中,我们提出了一个上下文感知的评估框架,其中通过使用给定语域的人类参考语料库与相应的LLM生成语料库之间的语言特征分布的两样本问题来评估人类相似度。我们使用最大均值差异(MMD)和Biber引入的67个词汇语法特征来实现该框架,这些特征通常应用于语料库语言学。在我们的实验中,我们比较了七个经过指令微调的开源模型,跨越五个不同语域的英语数据集,并与人类基线进行对比。虽然在所有测试设置中,LLM都偏离了人类基线,但哪些模型最接近人类语言取决于语域,而不是由模型大小决定。

英文摘要

While factual correctness and task-performance have been in focus of Large Language Model (LLM) research for a long time, the fundamental question of how human-like generated texts are on a linguistic level has been underexplored. From a corpus-linguistic perspective, language production is inherently context-dependent, with distinct communicative contexts giving rise to differences in frequencies and co-occurrence patterns of linguistic features. A text failing to adhere to these patterns can be content-wise correct, but still be unfavorable to human readers. In this work, we propose a context-aware evaluation framework in which human-likeness is assessed using a two-sample problem between the linguistic feature distribution of a human reference corpus for a given register and a corresponding LLM-generated corpus. We implement this framework using the Maximum Mean Discrepancy (MMD) and the 67 lexico-grammatical features introduced by Biber, which are commonly applied in corpus linguistics. In our experiments, we compare seven instruction-tuned, open-source models across five English-language datasets spanning distinct registers against a human baseline. While across all tested setups, LLMs deviate from the human baseline, which models are closest to human language depends on the register and is not dictated by model size.

2605.23327 2026-05-27 cs.CV

GFSR: Geometric Fidelity and Spatial Refinement for Reliable Lane Detection

GFSR:用于可靠车道检测的几何保真度与空间细化

Tiancheng Wang, Zhaolu Ding, Richeng Xu, Tianhui Zheng, Hui Liu, Hanyu Xuan, Zhiliang Wu, Guanghui Yue

AI总结 针对现有车道检测方法中分类置信度与几何质量脱节、回归模块弱化采样点关联导致复杂场景性能下降的问题,提出包含LaneIoU引导的置信度校准和自适应门控位置细化的GFSR框架,在CULane和CurveLanes上取得最优结果。

Comments Submitted to IEEE Transactions on Intelligent Transportation Systems. 12 pages, 6 figures

详情
AI中文摘要

车道检测是自动驾驶和高级驾驶辅助系统中的一项关键感知任务。然而,现有方法在复杂真实场景中仍会退化,原因在于两个主要限制。首先,分类置信度仅表征车道先验的分类存在性,与几何质量无强相关性。如果仅基于该置信度进行阈值过滤和NMS,模型倾向于保留高置信度的车道先验,而消除那些置信度较低但几何表示更优的先验。其次,现有方法中的回归模块削弱了采样点之间的相关性,阻碍了对远处、高曲率和复杂拓扑车道的细粒度优化,导致欠拟合。为解决这些问题,我们提出了几何保真度与空间细化(GFSR),一个由LaneIoU引导的置信度校准(LCC)和自适应门控位置细化(AGLR)组成的框架。具体地,LCC采用LaneIoU作为软监督来显式估计车道先验的几何保真度,并将其与分类置信度融合以构建协同可靠性指数(CRI)。该指数引导车道先验过滤,有效保留那些具有高分类置信度和良好几何质量的先验。同时,在每个细化阶段与回归头协作,AGLR预测采样点横向偏移并采用门控机制自适应调节校正幅度,增强点间相关性,提升模型对复杂车道场景的适应性和鲁棒性。在CULane和CurveLanes上的大量实验表明,我们的GFSR在CULane上达到了最优性能,F1_50和F1_75分数分别为81.46%和65.01%,在CurveLanes上达到了87.35%的F1_50。

英文摘要

Lane detection stands as a crucial perception task in autonomous driving and advanced driver assistance systems. However, existing methods still degrade in complex real scenarios due to two major limitations. First, classification confidence only characterizes the categorical existence of lane priors and has no strong correlation with geometric quality. If threshold filtering and NMS are conducted merely based on this confidence, the model tends to retain lane priors with high confidence while eliminating those with lower confidence but superior geometric representation. Secondly, the regression modules in existing methods weaken correlations among sampling points, hindering fine-grained optimization of distant, high-curvature and complex-topology lanes and causing underfitting. To address these issues, we propose Geometric Fidelity and Spatial Refinement (GFSR), a framework consisting of LaneIoU-guided Confidence Calibration (LCC) and Adaptive Gated Location Refinement (AGLR). Specifically, LCC adopts LaneIoU as soft supervision to explicitly estimate the geometric fidelity of lane priors, which is further fused with classification confidence to construct the Collaborative Reliability Index (CRI). This index guides lane prior filtering, effectively retaining those with high classification confidence and favorable geometric quality. Meanwhile, cooperating with regression heads in each refinement stage, AGLR predicts sampling point lateral offsets and adopts a gating mechanism to adaptively regulate correction magnitude, strengthen inter-point correlations and boost model adaptability as well as robustness toward complex lane scenarios. Extensive experiments on CULane and CurveLanes demonstrate that our GFSR achieves state-of-the-art performance on CULane, with F1_50 and F1_75 scores of 81.46% and 65.01%, and reaches 87.35% F1_50 on CurveLanes.

2605.22904 2026-05-27 cs.CV cs.AI

Suicide Risk Assessment from AI-powered Video Surveillance: An Interpretable Framework for Prevention in Metro Stations

基于AI视频监控的自杀风险评估:地铁站预防的可解释框架

Safwen Naimi, Wassim Bouachir, Guillaume-Alexandre Bilodeau, Brian Mishara

AI总结 提出首个可解释框架,通过行人跟踪、活动识别、站台语义分割和轨迹风险热图建模,从监控视频中评估自杀风险,在真实数据上达到83.2% ROC-AUC。

Comments 9 pages, 6 figures, 1 table. Accepted for Publication in the International Joint Conference of Artificial Intelligence (IJCAI)

详情
AI中文摘要

理解并监控地铁站中的人类行为对于支持自杀预防工作至关重要,早期识别高风险情况能够实现及时干预。这需要通过对每个乘客的行为、其空间上下文和时间动态进行联合推理,从监控视频中评估自杀风险。然而,使用监控摄像头捕获的视频进行评估具有挑战性,因为它需要准确感知人体运动、理解站台几何结构,并随时间聚合异质行为线索。在这项工作中,我们正式定义了地铁站自杀风险评估(SRA)任务,并引入了首个解决这一挑战的可解释框架。与专注于孤立子任务或试图直接推断意图的方法不同,我们的公式通过整合行人跟踪、活动识别、站台语义分割和轨迹驱动的风险热图建模,从累积证据中评估自杀风险。通过将SRA形式化为一个独特任务,并在真实监控数据上基准测试一个完整的操作流程,实现了83.2%的ROC-AUC,这项工作突出了自杀风险评估的复杂性,并为面向社会公益的可解释AI系统研究开辟了新方向。

英文摘要

Understanding and monitoring human behavior in metro stations play an important role in supporting suicide prevention efforts, where early identification of high-risk situations can enable timely intervention. This requires assessing suicide risk from a surveillance video by jointly reasoning about the behavior of each passenger, his/her spatial context, and temporal dynamics. However, this assessment using videos captured by surveillance cameras is challenging, as it demands accurate perception of human motion, understanding of platform geometry, and aggregation of heterogeneous behavioral cues over time. In this work, we formalize the task of Suicide Risk Assessment (SRA) in metro stations and introduce the first interpretable framework that addresses this challenge. Unlike approaches that focus on isolated subtasks or attempt to infer intent directly, our formulation assesses suicide risk from accumulated evidence by incorporating person tracking, activity recognition, semantic segmentation of the platform, and trajectory-driven risk heatmap modeling. By formalizing SRA as a distinct task and benchmarking a complete operational pipeline achieving 83.2% ROC-AUC on real surveillance data, this work highlights the complexity of suicide risk assessment and opens new directions for research on interpretable AI systems for social good.

2605.22834 2026-05-27 cs.CL cs.IR

Query-Adaptive Semantic Chunking for Retrieval-Augmented Generation: A Dynamic Strategy with Contextual Window Expansion

查询自适应语义分块用于检索增强生成:一种具有上下文窗口扩展的动态策略

Mudit Rastogi

AI总结 提出查询自适应语义分块(QASC)方法,通过将查询融入分块过程,利用句子-查询嵌入余弦相似度、上下文窗口扩展和分块级分数聚合,动态构建相关且连贯的文档块,在F1分数上比固定分块提升18-27%,比语义和智能体分块提升8-12%。

详情
AI中文摘要

检索增强生成(RAG)系统关键依赖于文档分块质量以检索相关上下文。固定分块将文档分割成统一单元,不考虑语义或用户意图,导致精度-召回率权衡无法通过调整块大小解决。语义和智能体方法部分解决了这些限制,但未在分块阶段集成用户查询。我们提出查询自适应语义分块(QASC),通过三种机制将查询融入分割以动态构建块:句子与查询嵌入之间的余弦相似度评分以识别种子句子,围绕种子的上下文窗口扩展以保持连贯性,以及块级分数聚合以确保整体相关性。我们在100篇技术文档上评估QASC,涵盖四种类型的200个查询,并与五种粒度的固定分块、递归分割、语义分块和智能体分块进行比较。QASC实现了0.85的F1分数,相对于固定分块相对提升18-27%,相对于语义和智能体替代方法提升8-12%。消融研究证实每个组件都有意义贡献。三名标注者的人工评估(Cohen kappa = 0.82)证实QASC比现有方法产生更相关和连贯的块。

英文摘要

Retrieval-Augmented Generation (RAG) systems depend critically on document chunking quality for retrieving relevant context. Fixed chunking segments documents into uniform units irrespective of semantics or user intent, producing a precision-recall trade-off unresolvable by tuning chunk size alone. Semantic and agentic methods partially address these limitations but do not integrate user queries at the chunking stage. We present Query-Adaptive Semantic Chunking (QASC), which dynamically constructs chunks by integrating queries into segmentation through three mechanisms: cosine similarity scoring between sentence and query embeddings to identify seed sentences, contextual window expansion around seeds to preserve coherence, and chunk-level score aggregation to ensure holistic relevance. We evaluate QASC on 100 technical documents across 200 queries spanning four types, comparing against fixed chunking at five granularities, recursive splitting, semantic chunking, and agentic chunking. QASC achieves an F1-score of 0.85, a relative improvement of 18-27% over fixed chunking and 8-12% over semantic and agentic alternatives. Ablation studies confirm each component contributes meaningfully. Human evaluation by three annotators (Cohen kappa = 0.82) corroborates that QASC produces more relevant and coherent chunks than existing methods.

2605.22774 2026-05-27 cs.LG cs.AI cs.HC

CogAdapt: Transferring Clinical ECG Foundation Models to Wearable Cognitive Load Assessment via Lead Adaptation

CogAdapt: 通过导联适应将临床心电图基础模型迁移至可穿戴认知负荷评估

Amir Mousavi, Erfan Nourbakhsh, Mohammad Sadegh Sirjani, Mimi Xie, Rocky Slavin, Leslie Neely, John Davis, John Quarles

AI总结 提出CogAdapt框架,通过可学习适配器LeadBridge将3导联可穿戴信号转换为12导联表示,并结合渐进微调策略ProFine,实现临床心电图基础模型向可穿戴认知负荷评估的迁移,在跨受试者验证中显著优于从头训练的基线模型。

Comments 7 pages, 7 figures. Submitted to IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2026)

详情
AI中文摘要

实时认知负荷评估对于自适应人机交互至关重要,但由于标记数据有限和跨受试者泛化能力差,仍然具有挑战性。最近在数百万临床记录上预训练的心电图基础模型提供了丰富的表示,但由于传感器配置不匹配和任务差异,无法直接应用于可穿戴设备。在本文中,我们提出了CogAdapt,一个将临床心电图基础模型适应于可穿戴认知负荷评估的框架。CogAdapt引入了LeadBridge,一个可学习的适配器,将3导联可穿戴信号转换为解剖学一致的12导联表示,以及ProFine,一种渐进微调策略,逐步解冻编码器层同时防止灾难性遗忘。在两个公共数据集(CLARE和CL-Drive)上的留一受试者交叉验证评估表明,CogAdapt显著优于从头训练的基线,宏F1分数分别达到0.626和0.768。这些结果证明了基础模型适应用于从可穿戴传感器进行与受试者无关的认知负荷评估的前景。

英文摘要

Real-time cognitive load assessment is essential for adaptive human-computer interaction but remains challenging due to limited labeled data and poor cross-subject generalization. Recent ECG foundation models pre-trained on millions of clinical recordings offer rich representations, but cannot be directly applied to wearable devices due to sensor configuration mismatch and task differences. In this paper, we propose CogAdapt, a framework that adapts clinical ECG foundation models to wearable cognitive load assessment. CogAdapt introduces LeadBridge, a learnable adapter that transforms 3-lead wearable signals into anatomically consistent 12-lead representations, and ProFine, a progressive fine-tuning strategy that gradually unfreezes encoder layers while preventing catastrophic forgetting. Evaluations on two public datasets (CLARE and CL-Drive) under leave-one-subject-out cross-validation show that CogAdapt substantially outperforms baselines trained from scratch, achieving macro-F1 scores of 0.626 and 0.768. These results demonstrate the promise of foundation model adaptation for subject-independent cognitive load assessment from wearable sensors.

2605.21883 2026-05-27 cs.CL

Token-weighted Direct Preference Optimization with Attention

基于注意力的令牌加权直接偏好优化

Chengyu Huang, Zhuohang Li, Sheng-Yen Chou, Claire Cardie

AI总结 提出Token-weighted DPO (TwDPO)方法,利用注意力机制估计令牌权重,在不增加额外训练成本的情况下提升大语言模型与人类偏好对齐的性能。

详情
AI中文摘要

直接偏好优化(DPO)无需单独奖励模型即可使大语言模型与人类偏好对齐。然而,DPO平等对待响应中的所有令牌,忽略了单个令牌的不同重要性。现有的令牌级PO方法要么使用基于令牌位置的启发式函数,要么使用单独训练模型给出的概率估计来计算令牌权重,这缺乏鲁棒性且增加了额外训练成本。相比之下,我们提出令牌加权DPO(TwDPO)——一种基于令牌加权RL的新型训练目标,以及AttentionPO——TwDPO的一个实例,它利用LLM自身的注意力来估计令牌权重。AttentionPO提示LLM作为成对评判者,并在比较响应时检查模型关注的位置。这种设计使AttentionPO具有内容感知能力,根据响应内容调整权重,并且高效,每个样本仅需额外两次前向传播。实验结果表明,AttentionPO在AlpacaEval、MT-Bench和ArenaHard上显著提升了性能,超越了现有的偏好优化方法。

英文摘要

Direct Preference Optimization (DPO) aligns Large Language Models with human preferences without the need for a separate reward model. However, DPO treats all tokens in responses equally, neglecting the differing importance of individual tokens. Existing token-level PO methods compute the token weights using either token-position-based heuristic functions or probability estimates given by a separately trained model, which lacks robustness and incurs extra training cost. In contrast, we propose Token-weighted DPO (TwDPO) -- a novel training objective grounded on token-weighted RL -- and AttentionPO -- an instantiation of TwDPO that uses attention from the LLM itself to estimate token weights. AttentionPO prompts the LLM to serve as a pairwise judge and check where the model attends when comparing the responses. This design makes AttentionPO content-aware, adjusting weights based on response content, and efficient, incurring only two extra forward passes per example. Experiment results show that AttentionPO significantly improves performance on AlpacaEval, MT-Bench, and ArenaHard, surpassing existing Preference Optimization methods.

2605.20988 2026-05-27 cs.LG cs.AI

A Sharper Picture of Generalization in Transformers

Transformer 泛化能力的更清晰图景

Paul Lintilhac, Sair Shaikh

AI总结 本文通过PAC-Bayes理论研究Transformer在布尔域上的泛化行为,证明稀疏低阶频谱可实现低锐度构造并得到非平凡的泛化界,解释了思维链为何能改善高阶目标函数的泛化。

Comments 10 pages, 9 figures, 41 pages of supplementary material

详情
AI中文摘要

我们从目标函数的傅里叶谱角度研究Transformer在布尔域上的泛化行为。与先前基于Rademacher复杂度推导泛化界的工作(Edelman等人,2022;Trauger & Tosh,2024)不同,我们探讨了通过PAC-Bayes理论获得泛化界的可行性。我们证明,集中在低阶分量上的稀疏谱能够实现具有良好泛化性质的低锐度构造。我们的思路是证明存在实现任何稀疏度不超过上下文长度的布尔函数的平坦极小值,然后将PAC-Bayes界应用于一个理想化的低锐度学习器,从而得到一个非平凡的泛化界。我们利用这一点正式解释了为什么思维链能改善高阶目标函数的泛化,并展示了我们界中的复杂度参数可以通过性质测试高效估计。我们通过实验评估了预测,并进行了机制可解释性研究,以支持我们的理论构造在真实Transformer中的现实性。

英文摘要

We study transformers' generalization behavior on boolean domains from the perspective of the Fourier spectra of their target functions. In contrast to prior work (Edelman et al., 2022; Trauger & Tosh, 2024), which derived generalization bounds from Rademacher complexity, we investigate the feasibility of obtaining generalization bounds via PAC-Bayes theory. We show that sparse spectra concentrated on low-degree components enable low-sharpness constructions with good generalization properties. Our idea is to show the existence of flat minima implementing any boolean function of sparsity no greater than the context length, and then apply a PAC-Bayes bound to an idealized low-sharpness learner, resulting in a non-vacuous generalization bound. We use this to give a formal account of why chain-of-thought improves generalization for high-degree target functions, and show that the complexity parameters in our bound can be efficiently estimated via property testing. We evaluate predictions empirically and conduct a mechanistic interpretability study to support the realism of our theoretical construction in real transformers.

2605.20914 2026-05-27 cs.CV

RISE: Reliable Improvement in Self-Evolving Vision-Language Models

RISE: 自进化视觉语言模型的可靠改进

Chaoran Xu, Yingmao Miao, Pengfei Zhang, Hao Dou, Lei Sun, Xiangxiang Chu

AI总结 针对视觉语言模型自进化中角色交替粗粒度、问题质量下降和类型坍缩问题,提出RISE框架,通过细粒度角色交替、质量监督器和技能感知动态平衡实现可靠自进化。

详情
AI中文摘要

视觉语言模型(VLM)已具备强大的多模态推理能力,但进一步提升仍严重依赖大规模人工构建的监督信号进行后训练。这种监督信号获取成本高昂,尤其对于推理密集型多模态任务,其中问题、答案和反馈信号必须精心设计。这激发了自进化学习,即模型通过双角色闭环自我改进:提问者自主提出问题,求解者学习解答。然而,我们观察到当前的VLM自进化方法仍面临三大挑战:粗粒度的角色交替延迟了问题生成与求解者适应之间的交互;生成的问题质量可能逐渐下降;问题类型可能坍缩至狭窄分布。这些问题限制了自进化的效率和可靠性。因此,我们提出 extbf{RISE},一个可靠的视觉语言模型自进化框架。RISE基于三个互补设计:细粒度角色交替,缩短提问者与求解者之间的反馈循环以提高效率;质量监督器,提高问题有效性和伪标签可靠性;以及技能感知动态平衡,在进化过程中缓解模式坍缩并保持广泛的技能覆盖。这些组件共同使得从无标签图像中实现更可靠和有效的自进化成为可能。在两个VLM骨干网络上的七个基准测试实验表明,RISE持续改进基础模型,带来广泛而持久的性能提升。我们的代码已公开在https://github.com/AMAP-ML/RISE。

英文摘要

Vision-language models (VLMs) have achieved strong multimodal reasoning capabilities, but further improving them still relies heavily on large-scale human-constructed supervision for post-training. Such supervision is costly to obtain, especially for reasoning-intensive multimodal tasks where questions, answers, and feedback signals must be carefully designed. This motivates self-evolving learning, where a model improves itself through a dual-role closed loop: a questioner autonomously poses questions and a solver learns to solve them. However, we observe that current VLM self-evolving methods still face three major challenges: coarse-grained role alternation delays the interaction between question generation and solver adaptation; generated questions can progressively degrade in quality; and question types may collapse toward a narrow distribution. These issues limit the efficiency and reliability of self-evolution. Thus, we propose \textbf{RISE}, a reliable self-evolving framework for vision-language models. RISE is built on three complementary designs: fine-grained role alternation, which shortens the feedback loop between the questioner and the solver to improve efficiency; a quality supervisor, which improves question validity and pseudo-label reliability; and skill-aware dynamic balancing, which mitigates mode collapse and maintains broad skill coverage during evolution. Together, these components enable more reliable and effective self-evolution from unlabeled images. Experiments on two VLM backbones across seven benchmarks show that RISE consistently improves the base models, yielding broad and sustained gains. Our code is publicly available at https://github.com/AMAP-ML/RISE.

2605.20690 2026-05-27 cs.AI

Declarative Data Services: Structured Agentic Discovery for Composing Data Systems

声明式数据服务:用于组合数据系统的结构化智能体发现

Shanshan Ye, Duo Lu

AI总结 提出声明式数据服务(DDS)架构,通过分层类型契约将全局搜索分解为有界子搜索,解决无界智能体发现无法稳定收敛的问题,并在交易后端工作负载上验证其有效性。

Comments Accepted at AI Agents for Discovery in the Wild (AID-Wild), Workshop at ACM CAIS 2026

详情
AI中文摘要

智能体发现已表明,在基准条件下,LLM驱动的搜索能够发现新颖的算法、设计和代码。将该范式迁移到多系统数据后端面临一个更困难的问题:搜索空间是异构的,验证器是部署栈是否实际运行,且组合知识在预训练中不均匀地捕获。即使添加了迭代和显式组合知识,无界智能体发现(一个基于失败日志反馈迭代的编码智能体)也无法在运行栈上一致收敛。我们提出声明式数据服务(DDS),一种从声明式用户意图中结构化智能体发现数据系统组合的架构。该框架在连续层(意图、操作DAG、每系统技能、运行时归因)拥有四个类型契约,将全局搜索分解为有界子搜索;子智能体搜索每个类型空间,而框架提供通道,使知识以内联技能引用的方式向前流动,错误以类型信号的方式向后路由。作为交易后端工作负载的生命证明,DDS在无界发现无法收敛的地方收敛;运行时失败成为技能补丁,下一次部署内联引用。我们将其定位为早期原型,报告来自真实世界数据系统组合的经验教训。

英文摘要

Agentic discovery has shown that LLM-driven search can find novel algorithms, designs, and code under benchmark conditions. Translating the paradigm to multi-system data backends surfaces a harder problem: the search space is heterogeneous, the verifier is whether a deployed stack actually runs, and composition knowledge is unevenly captured in pretraining. Unbounded agentic discovery, a coding agent iterating on failure-log feedback, fails to converge consistently on a working stack even when iteration and explicit composition knowledge are added. We propose Declarative Data Services (DDS), an architecture for structured agentic discovery of data-system compositions from declarative user intent. The framework owns four typed contracts at successive layers (intent, operator DAG, per-system skills, runtime attribution) that decompose the global search into bounded sub-searches; sub-agents search each typed space, while the framework provides the channels by which knowledge flows forward as inline skill citations and errors route backward as typed signals. As a proof of life on a trading-backend workload, DDS converges where unbounded discovery does not; runtime failures become skill patches that the next deployment cites inline. We position this as an early prototype reporting lessons from real-world data-system composition.