arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4160
2602.16220 2026-06-02 cs.LG

SEMixer: Semantics Enhanced MLP-Mixer for Multiscale Mixing and Long-term Time Series Forecasting

SEMixer: 语义增强的MLP-Mixer用于多尺度混合和长期时间序列预测

Xu Zhang, Qitong Wang, Peng Wang, Wei Wang

发表机构 * Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence Fudan University(上海数据科学 key 实验室,复旦大学计算机科学与人工智能学院) Harvard University(哈佛大学)

AI总结 提出SEMixer模型,通过随机注意力机制和多尺度渐进混合链,有效建模多尺度时间依赖并解决语义鸿沟问题,在10个公开数据集和真实无线网络数据上取得优异性能。

Comments This work is accepted by the proceedings of the ACM Web Conference 2026 (WWW 2026). The code is available at the link https://github.com/Meteor-Stars/SEMixer

详情
AI中文摘要

建模多尺度模式对于长期时间序列预测(TSF)至关重要。然而,时间序列中的冗余和噪声,以及非相邻尺度之间的语义鸿沟,使得高效对齐和集成多尺度时间依赖具有挑战性。为此,我们提出了SEMixer,一种专为长期TSF设计的轻量级多尺度模型。SEMixer包含两个关键组件:随机注意力机制(RAM)和多尺度渐进混合链(MPMC)。RAM在训练期间捕获多样化的时间块交互,并通过推理时的dropout集成进行聚合,增强了块级语义,使MLP-Mixer能够更好地建模多尺度依赖。MPMC进一步以内存高效的方式堆叠RAM和MLP-Mixer,实现更有效的时间混合。它解决了跨尺度的语义鸿沟,促进了更好的多尺度建模和预测性能。我们不仅在10个公开数据集上验证了SEMixer的有效性,还在基于21GB真实无线网络数据的 extit{2025 CCF AlOps Challenge}中取得了第三名。代码可在链接https://github.com/Meteor-Stars/SEMixer获取。

英文摘要

Modeling multiscale patterns is crucial for long-term time series forecasting (TSF). However, redundancy and noise in time series, together with semantic gaps between non-adjacent scales, make the efficient alignment and integration of multi-scale temporal dependencies challenging. To address this, we propose SEMixer, a lightweight multiscale model designed for long-term TSF. SEMixer features two key components: a Random Attention Mechanism (RAM) and a Multiscale Progressive Mixing Chain (MPMC). RAM captures diverse time-patch interactions during training and aggregates them via dropout ensemble at inference, enhancing patch-level semantics and enabling MLP-Mixer to better model multi-scale dependencies. MPMC further stacks RAM and MLP-Mixer in a memory-efficient manner, achieving more effective temporal mixing. It addresses semantic gaps across scales and facilitates better multiscale modeling and forecasting performance. We not only validate the effectiveness of SEMixer on 10 public datasets, but also on the \textit{2025 CCF AlOps Challenge} based on 21GB real wireless network data, where SEMixer achieves third place. The code is available at the link https://github.com/Meteor-Stars/SEMixer.

2602.05139 2026-06-02 cs.LG

Adaptive Exploration for Latent-State Bandits

潜在状态赌博机的自适应探索

Jikai Jin, Kenneth Hung, Sanath Kumar Krishnamurthy, Baoyi Shi, Congshan Zhang

发表机构 * The Institute for Computational and Mathematical Engineering(计算与数学工程研究所) Stanford University(斯坦福大学) Meta Platforms, Inc.(Meta平台公司) Ads Online Experimentation(广告在线实验部) Central Applied Science(应用科学中央研究所)

AI总结 针对奖励依赖于未观测马尔可夫状态的赌博机问题,提出基于LinUCB的自适应算法,通过滞后动作-奖励对和探针指纹两种摘要来区分状态,并采用残差、边际和过时测试动态更新指纹,在合成压力测试中相比标准、对抗和非平稳基线降低了动态遗憾。

Comments 12 pages, 3 figures, 5 tables

详情
AI中文摘要

我们研究奖励依赖于未观测马尔可夫状态的赌博机,该状态独立于学习者的动作演化。即使学习者只观测过去的动作和奖励,最优臂也可能发生变化。我们提出的算法将隐藏状态的两种摘要馈送给LinUCB:滞后动作-奖励对,以及(当可用时)由多个臂的奖励形成的探针指纹。自适应变体使用残差、边际和过时测试刷新指纹。在关于状态数量、转移速率、噪声和时域的综合压力测试中,当这些摘要能够区分状态并足够频繁地更新时,这些方法相对于标准、对抗和非平稳赌博机基线减少了动态遗憾。消融和误设测试识别了主要失败模式:弱指纹分离、高噪声以及顺序探针期间的状态变化。

英文摘要

We study bandits whose rewards depend on an unobserved Markov state that evolves independently of the learner's actions. The optimal arm can change even though the learner observes only past actions and rewards. We propose algorithms that feed LinUCB with two summaries of the hidden state: a lagged action-reward pair and, when available, a probe fingerprint formed from rewards of multiple arms. The adaptive variants refresh the fingerprint using residual, margin, and staleness tests. In synthetic stress tests over state count, transition rate, noise, and horizon, these methods reduce dynamic regret relative to standard, adversarial, and non-stationary bandit baselines when the summaries distinguish states and are updated often enough. Ablations and misspecification tests identify the main failure modes: weak fingerprint separation, high noise, and state changes during sequential probes.

2602.15278 2026-06-02 cs.CV cs.AI

Visual Persuasion: What Influences Decisions of Vision-Language Models?

视觉说服:什么影响了视觉语言模型的决策?

Manuel Cherep, Pranav M R, Pattie Maes, Nikhil Singh

发表机构 * Massachusetts Institute of Technology(麻省理工学院) MIT Media Lab(MIT媒体实验室)

AI总结 提出一个框架,通过控制图像选择任务并系统性地扰动输入,利用视觉提示优化方法推断视觉语言模型的潜在视觉效用,揭示影响模型决策的视觉偏好。

Comments Accepted to ICML 2026

详情
AI中文摘要

网络上充斥着图像,这些图像最初是为人类消费而创建的,现在越来越多地被使用视觉语言模型(VLM)的智能体解释。这些智能体大规模地做出视觉决策,决定点击、推荐或购买什么。然而,我们对它们视觉偏好的结构知之甚少。我们引入了一个框架来研究这一点,通过将VLM置于受控的基于图像的选择任务中,并系统地扰动它们的输入。我们的关键思想是将智能体的决策函数视为一种潜在的视觉效用,可以通过揭示偏好来推断:在系统编辑的图像之间进行选择。从常见图像(如产品照片)开始,我们提出了视觉提示优化的方法,将文本优化方法适应为使用图像生成模型(例如在构图、光照或背景方面)迭代地提出并应用视觉上合理的修改。然后,我们评估哪些编辑增加了选择概率。通过对前沿VLM的大规模实验,我们证明了优化后的编辑在直接比较中显著改变了选择概率。我们开发了一个自动可解释性管道来解释这些偏好,识别出驱动选择的一致视觉主题。我们认为,这种方法提供了一种实用且高效的方式来揭示视觉漏洞和安全问题,否则这些问题可能会在现实世界中隐含地发现,从而支持对基于图像的AI智能体进行更主动的审计和治理。

英文摘要

The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent's decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, or background). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significantly shift choice probabilities in head-to-head comparisons. We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection. We argue that this approach offers a practical and efficient way to surface visual vulnerabilities, safety concerns that might otherwise be discovered implicitly in the wild, supporting more proactive auditing and governance of image-based AI agents.

2602.14849 2026-06-02 cs.LG cs.AI cs.DC cs.MA

Atomix: Timely, Transactional Tool Use for Reliable Agentic Workflows

Atomix: 用于可靠智能体工作流的及时事务性工具使用

Bardia Mohammadi, Nearchos Potamitis, Lars Klein, Akhil Arora, Laurent Bindschaedler

发表机构 * Max Planck Institute for Software Systems(马克斯·普朗克软件系统研究所) Aarhus University(奥胡斯大学) EPFL(苏黎世联邦理工学院)

AI总结 针对LLM智能体多步工作流中因故障、推测和并发导致的状态不一致问题,提出Atomix系统,通过进度感知事务将效果分组与冲突解决分离,实现可靠提交与回滚。

详情
AI中文摘要

LLM智能体执行多步工作流,通过工具改变外部状态。常见的编排器将工具返回视为结算触发器,因此故障、推测和并发智能体可能留下部分效果、丢失分支残留、陈旧写入或不可逆发送。正确的结算需要两个事实,而重试、检查点重放、锁和补偿各自混淆了这些事实:哪些效果必须一起结算,以及何时较早的冲突工作已耗尽。Atomix通过进度感知事务使这种分离明确化。运行时在执行期间记录读取和效果,当足迹完成时密封事务,并且仅在每个资源的前沿显示没有更早的冲突工作可能到达后才提交。提交是最终结算:Atomix释放可缓冲效果,接受可逆外部效果为最终状态,并让不可逆效果离开。中止抑制未释放的效果,并在可能的情况下补偿外部化的可逆效果。在代表性智能体工作负载上,这种组合在注入故障下改善了干净恢复,隔离了竞争和推测工作,并防止了正确分类的不可逆动作泄漏;微基准测试显示相对于工具延迟的微秒级包装开销。

英文摘要

LLM agents execute multi-step workflows that mutate external state through tools. Common orchestrators treat tool return as the settlement trigger, so faults, speculation, and concurrent agents can leave partial effects, losing-branch residue, stale writes, or irreversible sends. Correct settlement needs two facts that retries, checkpoint replay, locks, and compensation each conflate: which effects must settle together, and when earlier conflicting work is exhausted. Atomix makes this split explicit with progress-aware transactions. The runtime records reads and effects during execution, seals a transaction when its footprint is complete, and commits only after per-resource frontiers show that no earlier conflicting work can still arrive. Commit is final settlement: Atomix releases bufferable effects, accepts reversible external effects as final, and lets irreversible effects leave the gate. Abort suppresses unreleased effects and compensates externalized reversible effects where possible. On representative agent workloads, this composition improves clean recovery under injected faults, isolates contending and speculative work, and prevents correctly classified irreversible actions from leaking; microbenchmarks show microsecond-scale wrapper overhead relative to tool latency.

2602.14065 2026-06-02 cs.AI

REAL: Resolving Knowledge Conflicts in Knowledge-Intensive Visual Question Answering via Reasoning-Pivot Alignment

REAL: 通过推理枢轴对齐解决知识密集型视觉问答中的知识冲突

Kai Ye, Xianwei Mao, Sheng Zhou, Zirui Shao, Ye Mo, Liangliang Liu, Haikuan Huang, Bin Li, Jiajun Bu

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出REAL框架,通过推理枢轴对齐和引导解码,解决知识密集型视觉问答中因开放域检索引起的知识冲突问题。

Comments Accepted by ICML 2026

详情
AI中文摘要

知识密集型视觉问答(KI-VQA)经常因开放域检索的固有限制而遭受严重的知识冲突。然而,现有范式由于缺乏可泛化的冲突检测和模型内约束机制来处理冲突证据,面临关键限制。为应对这些挑战,我们提出了REAL(推理枢轴对齐)框架,其核心是新颖的推理枢轴概念。与优先考虑内部自我推导的推理步骤不同,推理枢轴作为推理链中的原子单元(节点或边),强调知识链接,通常依赖外部证据完成推理。在我们构建的REAL-VQA数据集支持下,我们的方法集成了推理枢轴感知SFT(RPA-SFT),通过将冲突与枢轴提取对齐来训练可泛化的判别器,并采用推理枢轴引导解码(RPGD),一种利用这些枢轴进行针对性冲突缓解的模型内解码策略。在多个数据集上的大量实验表明,REAL显著提高了判别准确性并实现了优越性能,验证了我们的枢轴驱动解决范式。

英文摘要

Knowledge-intensive Visual Question Answering (KI-VQA) frequently suffers from severe knowledge conflicts caused by the inherent limitations of open-domain retrieval. However, existing paradigms face critical limitations due to the lack of generalizable conflict detection and intra-model constraint mechanisms to handle conflicting evidence. To address these challenges, we propose the REAL (Reasoning-Pivot Alignment) framework centered on the novel concept of the Reasoning-Pivot. Distinct from reasoning steps that prioritize internal self-derivation, a reasoning-pivot serves as an atomic unit (node or edge) in the reasoning chain that emphasizes knowledge linkage, and it typically relies on external evidence to complete the reasoning. Supported by our constructed REAL-VQA dataset, our approach integrates Reasoning-Pivot Aware SFT (RPA-SFT) to train a generalizable discriminator by aligning conflicts with pivot extraction, and employs Reasoning-Pivot Guided Decoding (RPGD), an intra-model decoding strategy that leverages these pivots for targeted conflict mitigation. Extensive experiments on diverse datasets demonstrate that REAL significantly enhances discrimination accuracy and achieves superior performance, validating our pivot-driven resolution paradigm.

2602.13940 2026-06-02 cs.LG cs.AI

You Can Learn Tokenization End-to-End with Reinforcement Learning

你可以通过强化学习端到端地学习分词

Sam Dauncey, Roger Wattenhofer

发表机构 * University of Waterloo(滑铁卢大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文提出使用强化学习中的得分函数估计来学习离散分词边界,通过时间折扣等技巧降低方差,在1亿参数规模上优于先前的直通估计方法。

Comments ICML 2026 camera-ready

详情
AI中文摘要

分词是一个硬编码的压缩步骤,尽管架构总体上趋向于端到端,但它仍然保留在大语言模型(LLM)的训练流程中。先前的工作在大规模上展示了有希望的结果,通过启发式方法将这一压缩步骤引入LLM架构内部以绘制分词边界,并尝试使用直通估计来学习这些分词边界,直通估计将绘制离散分词边界的问题视为连续问题。我们表明,这些分词边界可以通过得分函数估计来学习,由于直接优化绘制离散分词边界以最小化损失的问题,得分函数估计具有更严格的理论保证。我们观察到,强化学习中的技术,如时间折扣,对于充分降低该得分函数的方差以使其可行是必要的。我们证明,所得到的方法在1亿参数规模上,在定性和定量上都优于先前提出的直通估计方法。

英文摘要

Tokenization is a hardcoded compression step which remains in the training pipeline of Large Language Models (LLMs), despite a general trend towards architectures becoming increasingly end-to-end. Prior work has shown promising results at scale in bringing this compression step inside the LLMs' architecture with heuristics to draw token boundaries, and also attempts to learn these token boundaries with straight-through estimates, which treat the problem of drawing discrete token boundaries as a continuous one. We show that these token boundaries can instead be learned using score function estimates, which have tighter theoretical guarantees due to directly optimizing the problem of drawing discrete token boundaries to minimize loss. We observe that techniques from reinforcement learning, such as time discounting, are necessary to reduce the variance of this score function sufficiently to make it practicable. We demonstrate that the resultant method outperforms prior proposed straight-through estimates, both qualitatively and quantitatively at the $100$ million parameter scale.

2602.13937 2026-06-02 cs.LG cs.SE

iML: Executable, Problem-Grounded, and Broadly Exploratory Code-Driven AutoML

iML: 可执行、问题驱动且广泛探索的代码驱动自动机器学习

Dat Le, Duc-Cuong Le, Anh-Son Nguyen, Tuan-Dung Bui, Thu-Trang Nguyen, Son Nguyen, Hieu Dinh Vo

发表机构 * Faculty of Information Technology, VNU University of Engineering and Technology(信息科技学院,越南工程与技术大学)

AI总结 提出iML多智能体框架,通过任务分析、数据剖析、结构化蓝图生成和模块化代码合成,实现可执行、问题驱动且广泛探索的代码驱动AutoML,在MLE-BENCH和iML-BENCH上显著优于基线。

详情
AI中文摘要

自动机器学习(AutoML)改善了机器学习的可访问性,但现有技术通常在灵活性、透明度和执行可靠性方面仍然有限。代码驱动的AutoML通过合成用于预处理、模型训练和评估的可执行代码,提供了一个有前景的方向。然而,当前基于LLM的方法经常生成在文本上合理但在执行中脆弱、未充分基于实际数据集或局限于狭窄解决方案路径的代码。在本文中,我们介绍了iML,一个多智能体代码驱动AutoML框架,围绕三个需求设计:可执行性、问题驱动和对有效解决方案的广泛探索。iML首先分析任务并剖析数据,然后合成一个结构化的蓝图,指导跨多个实现轨道的模块化代码生成,包括传统机器学习、预训练适应和自定义神经架构。为了提高可靠性,iML在集成过程中强制执行接口检查、动态执行和迭代调试。我们在MLE-BENCH和新引入的iML-BENCH上评估iML,涵盖多样化的Kaggle风格任务。在MLE-BENCH上,iML达到了90%的有效提交率和45%的奖牌率,以及0.82的APS,将基于LLM的基线的平均标准化性能分数(APS)提高了52%-273%。在iML-BENCH上,它实现了最高的APS,并且即使在任务描述被大幅简化时也表现出稳健的性能。这些结果确立了iML作为代码驱动AutoML的可靠且有竞争力的框架。

英文摘要

Automated Machine Learning (AutoML) has improved access to machine learning, yet existing techniques often remain limited in flexibility, transparency, and execution reliability. Code-driven AutoML offers a promising direction by synthesizing executable code for preprocessing, model training, and evaluation. However, current LLM-based approaches frequently generate code that is plausible in text yet brittle in execution, insufficiently grounded in the actual dataset, or restricted to narrow solution paths. In this paper, we introduce iML, a multi-agent code-driven AutoML framework designed around three requirements: executability, problem grounding, and broad exploration of valid solutions. iML first analyzes the task and profiles the data, then synthesizes a structured blueprint that guides modular code generation across multiple implementation tracks, including traditional ML,pretrained adaptation, and custom neural architectures. To improve reliability, iML enforces interface checking, dynamic execution, and iterative debugging during integration. We evaluate iML on MLE-BENCH and the newly introduced iML-BENCH, covering diverse Kaggle-style tasks. On MLE-BENCH, iML attains a 90% valid submission rate and a 45% medal rate, and an APS of 0.82, improving the average standardized performance score (APS) over the LLM-based baselines by 52%-273%. On iML-BENCH, it achieves the highest APS and demonstrates robust performance even when task descriptions are substantially stripped. These results establish iML as a reliable and competitive framework for code-driven AutoML.

2602.13602 2026-06-02 cs.CV cs.LG

Towards Sparse Video Understanding and Reasoning

迈向稀疏视频理解与推理

Chenwei Xu, Zhen Ye, Shang Wu, Weijian Li, Zihan Wang, Zhuofan Xia, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Han Liu

发表机构 * Northwestern University(西北大学) Johns Hopkins University(约翰霍普金斯大学) Dolby Laboratories(杜比实验室)

AI总结 提出一种多轮视频问答代理,通过稀疏帧选择、状态摘要和早期停止机制,在减少帧数和令牌数的同时提升准确率。

Comments Accepted to CVPR 2026. Project page: https://sparsevideounderstanding.github.io

详情
AI中文摘要

我们提出 \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity),一种用于视频问答 (VQA) 的多轮代理。与均匀采样帧不同,\revise 选择一小部分信息丰富的帧,跨轮维护摘要作为状态,并在置信时提前停止。它支持专有视觉语言模型 (VLM) 的“即插即用”设置,并允许对开源模型进行强化微调。对于微调,我们引入 EAGER (Evidence-Adjusted Gain for Efficient Reasoning),一种无注释奖励,包含三项:(1) 置信增益:添加新帧后,奖励正确选项与最强替代选项之间对数几率差距的增加;(2) 摘要充分性:在回答时仅使用最后提交的摘要重新提问,并奖励成功;(3) 正确且早期停止:在较小的轮次预算内正确回答即获得奖励。在多个 VQA 基准上,\revise 在减少帧数、轮数和提示令牌数的同时提高了准确率,展示了实用的稀疏视频推理。

英文摘要

We present \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity), a multi-round agent for video question answering (VQA). Instead of uniformly sampling frames, \revise selects a small set of informative frames, maintains a summary-as-state across rounds, and stops early when confident. It supports proprietary vision-language models (VLMs) in a ``plug-and-play'' setting and enables reinforcement fine-tuning for open-source models. For fine-tuning, we introduce EAGER (Evidence-Adjusted Gain for Efficient Reasoning), an annotation-free reward with three terms: (1) Confidence gain: after new frames are added, we reward the increase in the log-odds gap between the correct option and the strongest alternative; (2) Summary sufficiency: at answer time we re-ask using only the last committed summary and reward success; (3) Correct-and-early stop: answering correctly within a small turn budget is rewarded. Across multiple VQA benchmarks, \revise improves accuracy while reducing frames, rounds, and prompt tokens, demonstrating practical sparse video reasoning.

2602.11554 2026-06-02 cs.RO cs.CV cs.LG

HyperDet: 3D Object Detection with Hyper 4D Radar Point Clouds

HyperDet: 基于超4D雷达点云的3D目标检测

Yichun Xiao, Runwei Guan, Jin Jin, Fangqiang Ding

发表机构 * University of Edinburgh(爱丁堡大学) HKUST (GZ)(香港科技大学(广州)) University of Oxford(牛津大学) MIT(麻省理工学院)

AI总结 提出一种与检测器无关的框架HyperDet,通过构建任务感知的超4D雷达点云,利用时空累积、跨传感器验证和多普勒引导的运动补偿以及前景生成增强,显著提升仅用雷达的3D目标检测性能。

Comments 11 pages, 3 figures, 3 tables

详情
AI中文摘要

仅使用4D雷达进行3D目标检测能达到什么程度?尽管现代4D雷达为自主感知提供了鲁棒天气和速度感知能力,但其点云仍然稀疏、嘈杂且不稳定,限制了仅用雷达的3D检测。我们提出HyperDet,一种与检测器无关的框架,在检测前构建任务感知的超4D雷达点云。HyperDet首先通过时空累积、跨传感器验证和多普勒引导的运动补偿来细化短窗口环视雷达观测,提高返回可靠性和时间一致性。然后,它利用仅在训练时可用的激光雷达引导的伪雷达监督进行前景生成增强,在保留测量雷达背景和雷达原生属性的同时丰富目标几何。在检测器训练期间,雷达感知的目标级增强进一步在几何重定位下保持多普勒一致性。在推理时,HyperDet仅需雷达输入,可直接与标准3D检测器配合使用。在两个公开的环视4D雷达数据集上的实验表明,与原始雷达输入相比,在标准3D检测器上均取得一致改进,验证了输入级雷达增强作为仅用雷达3D检测的有效方法。

英文摘要

How far can 3D object detection go using 4D radar alone? Despite offering weather-robust and velocity-aware sensing for autonomous perception, modern 4D radar still yields sparse, noisy, and unstable point clouds, limiting radar-only 3D detection. We present HyperDet, a detector-agnostic framework that constructs task-aware hyper 4D radar point clouds before detection. HyperDet first refines short-window surround-view radar observations through spatio-temporal accumulation, cross-sensor validation, and Doppler-guided motion compensation, improving return reliability and temporal coherence. It then performs foreground generative enhancement using LiDAR-guided pseudo-radar supervision available only during training, enriching object geometry while preserving measured radar background and radar-native attributes. During detector training, radar-aware object-level augmentation further preserves Doppler consistency under geometric relocation. At inference time, HyperDet requires radar input alone and can be directly paired with standard 3D detectors. Experiments on two public surround-view 4D radar datasets demonstrate consistent improvements over raw radar inputs across standard 3D detectors, validating input-level radar enhancement as an effective approach to radar-only 3D detection.

2602.12984 2026-06-02 cs.CL

SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

SciAgentGym:LLM代理中多步科学工具使用的基准测试

Yujiong Shen, Yajie Yang, Zhiheng Xi, Binze Hu, Huayu Sha, Jiazheng Zhang, Qiyuan Peng, Junlin Shang, Jixuan Huang, Yutao Fan, Jingqi Tong, Shihan Dou, Ming Zhang, Lei Bai, Zhenfei Yin, Tao Gui, Xingjun Ma, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang

发表机构 * Fudan NLP Group(复旦大学自然语言处理组)

AI总结 为解决当前基准忽视代理在科学工作流中编排工具能力的问题,提出包含1780个领域特定工具的可扩展交互环境SciAgentGym和分层评估套件SciAgentBench,并设计数据合成方法SciForge,通过微调使SciAgent-8B超越更大模型,展示科学工具使用能力的跨领域正迁移。

详情
AI中文摘要

科学推理本质上需要整合复杂的工具包来导航领域特定知识。然而,当前的基准测试在很大程度上忽视了代理编排工具以进行此类严谨工作流的能力。为弥补这一差距,我们引入了SciAgentGym,一个可扩展的交互环境,包含四个自然科学学科中的1,780个领域特定工具,并由强大的执行基础设施支持。此外,我们提出了SciAgentBench,一个分层评估套件,旨在从基本动作到长期工作流对代理能力进行压力测试。我们的评估识别出一个关键瓶颈:最先进的模型仍然难以处理复杂的科学工具使用,并且随着交互范围的扩展,其性能显著下降。为解决这一问题,我们提出了SciForge,一种数据合成方法,将工具动作空间建模为依赖图,以生成逻辑感知的训练轨迹。通过在这些轨迹上进行微调,我们的SciAgent-8B优于显著更大的Qwen3-VL-235B-Instruct,同时表现出科学工具使用能力的跨领域正迁移。这些结果凸显了下一代自主科学代理的潜力。

英文摘要

Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models still struggle with complex scientific tool-use, and their performance degrades substantially as interaction horizons extend. To address this, we propose SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories. By fine-tuning on these trajectories, our SciAgent-8B outperforms the significantly larger Qwen3-VL-235B-Instruct while exhibiting positive cross-domain transfer of scientific tool-use capabilities. These results underscore the promising potential of next-generation autonomous scientific agents.

2602.12080 2026-06-02 cs.LG

PathCRF: Ball-Free Soccer Event Detection via Possession Path Inference from Player Trajectories

PathCRF: 通过球员轨迹的控球路径推断实现无球足球事件检测

Hyunsung Kim, Kunhee Lee, Sangwoo Seo, Sang-Ki Ko, Jinsung Yoon, Chanyoung Park

发表机构 * KAIST(韩国釜山科学技术院) Fitogether Inc.(Fitogether公司) University of Seoul(首尔大学)

AI总结 提出PathCRF框架,仅利用球员轨迹数据,通过将轨迹建模为动态图并采用条件随机场(CRF)推断控球路径,实现无球足球事件检测,降低对人工标注和球轨迹数据的依赖。

详情
AI中文摘要

尽管人工智能取得了最新进展,足球比赛的事件数据收集仍然严重依赖劳动密集型的人工标注。虽然已有研究利用球员和球轨迹探索自动事件检测,但由于高昂的基础设施和运营成本,球轨迹追踪仍然难以大规模应用。因此,足球领域的全面数据收集主要局限于顶级赛事,限制了数据驱动分析在该领域的广泛应用。为了解决这一挑战,本文提出了PathCRF,一个仅使用球员追踪数据检测足球控球事件的框架。我们将球员轨迹建模为全连接动态图,并将事件检测形式化为在每个时间步选择恰好一条对应于当前控球状态的边。为了确保所得边序列的逻辑一致性,我们采用条件随机场(CRF),禁止连续边之间出现不可能的转换,其中发射分数和转移分数由社会-时间骨干架构产生的边嵌入动态计算。在推理过程中,通过维特比解码获得最可能的边序列,当所选边在相邻时间步之间发生变化时,检测到控球或传球等事件。实验表明,PathCRF生成准确、逻辑一致的控球路径,能够实现可靠的下游分析,同时大幅减少对人工事件标注的需求。源代码可在 https://github.com/hyunsungkim-ds/pathcrf.git 获取。

英文摘要

Despite recent advances in AI, event data collection in soccer still relies heavily on labor-intensive manual annotation. Although prior work has explored automatic event detection using player and ball trajectories, ball tracking also remains difficult to scale due to high infrastructural and operational costs. As a result, comprehensive data collection in soccer is largely confined to top-tier competitions, limiting the broader adoption of data-driven analysis in this domain. To address this challenge, this paper proposes PathCRF, a framework for detecting on-ball soccer events using only player tracking data. We model player trajectories as a fully connected dynamic graph and formulate event detection as the problem of selecting exactly one edge corresponding to the current possession state at each time step. To ensure logical consistency of the resulting edge sequence, we employ a Conditional Random Field (CRF) that forbids impossible transitions between consecutive edges, where emission and transition scores are dynamically computed from edge embeddings produced by a socio-temporal backbone architecture. During inference, the most probable edge sequence is obtained via Viterbi decoding, and events such as ball controls or passes are detected whenever the selected edge changes between adjacent time steps. Experiments show that PathCRF produces accurate, logically consistent possession paths, enabling reliable downstream analyses while substantially reducing the need for manual event annotation. The source code is available at https://github.com/hyunsungkim-ds/pathcrf.git.

2602.11852 2026-06-02 cs.AI cs.CL cs.LG

Prototype Transformer: Towards Language Model Architectures Interpretable by Design

原型Transformer:迈向可解释设计的语言模型架构

Yordan Yordanov, Matteo Forasassi, Bayar Menzat, Ruizhi Wang, Chang Qi, Markus Kaltenberger, Amine M'Charrak, Tommaso Salvatori, Thomas Lukasiewicz

发表机构 * University of Cambridge(剑桥大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出原型Transformer(ProtoT),一种用线性代价原型模块替代二次代价自注意力的自回归语言模型架构,原型自动捕获可命名概念,提升可解释性并支持行为编辑。

Comments Accepted at ICML 2026. Equal contribution: Yordan Yordanov and Matteo Forasassi. 40 pages, 28 figures, 22 tables

详情
AI中文摘要

尽管最先进的语言模型(LM)在某些领域超越了大多数人类,但其推理过程仍然不透明,降低了信任度并增加了欺骗和幻觉的风险。我们引入了原型Transformer(ProtoT),一种自回归LM架构,它将Transformer的二次代价自注意力模块替换为基于原型的线性代价模块,原型是学习到的参数向量。在ProtoT中,原型创建了在不同时间尺度上聚合上下文信息的通信通道。我们表明,这种结构导致原型在训练过程中自动捕获可命名的概念,例如“女人”,为解释模型推理和对模型行为进行有针对性的编辑提供了途径。与基线相比,ProtoT在模型和数据规模上具有良好的扩展性,对输入扰动具有鲁棒性,并在文本生成和下游任务(包括GLUE)上表现良好。这些结果表明,ProtoT是朝着设计上更可解释的自回归语言模型迈出的有希望的一步。

英文摘要

While state-of-the-art language models (LMs) surpass most humans in certain domains, their reasoning remains largely opaque, reducing trust and increasing the risk of deception and hallucination. We introduce the Prototype Transformer (ProtoT), an autoregressive LM architecture that replaces the quadratic-cost self-attention module of the Transformer with a linear-cost module based on prototypes, which are learned parameter vectors. In ProtoT, prototypes create communication channels that aggregate contextual information at different time scales. We show that this structure leads prototypes to automatically capture nameable concepts, such as "woman", during training, offering a path toward interpreting model reasoning and making targeted edits to model behavior. Compared with baselines, ProtoT scales well with model and data size, is robust to input perturbations, and performs well on text generation and downstream tasks, including GLUE. These results suggest that ProtoT is a promising step toward autoregressive language models that are more interpretable by design.

2602.11790 2026-06-02 cs.AI cs.CL

Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation

超越端到端视频模型:基于LLM的多智能体系统用于教育视频生成

Lingyong Yan, Jiulong Wu, Dong Xie, Weixian Shi, Deguo Xia, Jizhou Huang

发表机构 * Baidu Inc.(百度公司)

AI总结 提出LASEV,一种基于LLM的分层多智能体系统,通过将教育视频生成分解为多个专业智能体协作,解决端到端模型在逻辑严谨性和知识表示方面的不足,实现低成本、高吞吐量的自动化教学视频生产。

Comments Accepted at ACM SIGKDD 2026 (KDD '26), Applied Data Science Track. 10 pages, 2 figures, 5 tables. The project is available at \url{https://robitsg.github.io/LASEV}

详情
AI中文摘要

尽管最近的端到端视频生成模型在视觉导向的内容创作中表现出令人印象深刻的性能,但在需要严格逻辑严谨性和精确知识表示的场景(如教学和教育媒体)中仍然受限。为解决此问题,我们提出LASEV,一种基于LLM的分层多智能体系统,用于从教育问题生成高质量教学视频。LASEV将教育视频生成表述为一个多目标任务,同时要求正确的逐步推理、教学连贯的叙述、语义忠实的视觉演示以及精确的视听对齐。为解决先前方法的局限性——包括低程序保真度、高生产成本和有限的可控性——LASEV将生成工作流分解为通过中央编排智能体协作的专业智能体,共享生产状态、显式质量门控和迭代批评机制。具体来说,编排智能体监督一个用于严格问题求解的求解智能体、一个生成可执行可视化代码的插图智能体,以及一个面向学习者的教学脚本的叙述智能体。此外,工作智能体的所有输出都经过语义批评、基于规则的约束和基于工具的编译检查。该系统不直接合成像素,而是构建一个结构化的可执行视频脚本,该脚本通过模板驱动的组装规则确定性编译为同步的视觉和叙述,实现无需手动编辑的全自动生产。在大规模部署中,LASEV实现了每天超过一百万视频的吞吐量,与当前行业标准方法相比成本降低超过95%,同时保持高接受率。

英文摘要

Although recent end-to-end video generation models demonstrate impressive performance in visually oriented content creation, they remain limited in scenarios that require strict logical rigor and precise knowledge representation, such as instructional and educational media. To address this problem, we propose LASEV, a hierarchical LLM-based multi-agent system for generating high-quality instructional videos from educational problems. LASEV formulates educational video generation as a multi-objective task that simultaneously demands correct step-by-step reasoning, pedagogically coherent narration, semantically faithful visual demonstrations, and precise audio--visual alignment. To address the limitations of prior approaches--including low procedural fidelity, high production cost, and limited controllability--LASEV decomposes the generation workflow into specialized agents that collaborate through a central Orchestrating Agent, shared production state, explicit quality gates, and iterative critique mechanisms. Specifically, the Orchestrating Agent supervises a Solution Agent for rigorous problem solving, an Illustration Agent that produces executable visualization code, and a Narration Agent for learner-oriented instructional scripts. In addition, all outputs from the working agents are subject to semantic critique, rule-based constraints, and tool-based compilation checks. Rather than directly synthesizing pixels, the system constructs a structured executable video script that is deterministically compiled into synchronized visuals and narration using template-driven assembly rules, enabling fully automated production without manual editing. In large-scale deployments, LASEV achieves a throughput exceeding one million videos per day, delivering over a 95% reduction in cost compared to current industry-standard approaches while maintaining a high acceptance rate.

2510.06028 2026-06-02 cs.LG stat.ML

Generalization of Gibbs and Langevin Monte Carlo Algorithms in the Interpolation Regime

插值机制下吉布斯和朗之万蒙特卡洛算法的泛化

Andreas Maurer, Erfan Mirzaei, Massimiliano Pontil

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 本文在过参数化插值机制下,通过数据依赖的期望误差界,证明了低温区的泛化可由高温区的小训练误差预示,并利用朗之万蒙特卡洛算法稳定逼近,在MNIST、CIFAR-10和SVHN数据集上给出非平凡且接近真实标签测试误差的预测。

详情
AI中文摘要

本文在过参数化插值机制下提供了吉布斯算法期望误差的数据依赖界,其中对于不可能的数据(如分类中的随机标签)也能获得低训练误差。结果表明,低温区的泛化已经由噪声较大的高温区的小训练误差所预示。这些界在使用朗之万蒙特卡洛算法近似时是稳定的。该分析激励了一种计算界的算法设计,该算法在MNIST、CIFAR-10和SVHN数据集上对真实标签数据给出了非平凡且接近的测试误差预测,同时对随机标签保持了正确的测试误差上界。

英文摘要

This paper provides data-dependent bounds on the expected error of the Gibbs algorithm in the overparameterized interpolation regime, where low training errors are also obtained for impossible data, such as random labels in classification. The results show that generalization in the low-temperature regime is already signaled by small training errors in the noisier high-temperature regime. The bounds are stable under approximation with Langevin Monte Carlo algorithms. The analysis motivates the design of an algorithm to compute bounds, which on the MNIST, CIFAR-10, and SVHN datasets yield nontrivial, close predictions on the test error for true labeled data, while maintaining a correct upper bound on the test error for random labels.

2602.11641 2026-06-02 cs.LG

Both Topology and Text Matter: Revisiting LLM-guided Out-of-Distribution Detection on Text-attributed Graphs

拓扑与文本同样重要:重新审视基于LLM的文本属性图分布外检测

Yinlin Zhu, Di Wu, Xu Wang, Guocong Quan, Miao Hu

发表机构 * Sun Yat-sen University(中山大学) Shandong University(山东大学)

AI总结 针对文本属性图分布外检测中拓扑与文本信息利用不足的问题,提出LG-Plug框架,通过对齐拓扑与文本表示并利用聚类迭代LLM提示生成共识驱动的OOD样本,有效提升检测性能。

Comments Accepted by SIGKDD 2026

详情
AI中文摘要

文本属性图(TAGs)将节点与文本属性和图结构关联,使GNN能够联合建模语义和结构信息。尽管在分布内(ID)数据上有效,但GNN在面对具有未见文本或结构模式的分布外(OOD)节点时常常失败,产生过度自信的预测而缺乏可靠的OOD检测。现有的拓扑驱动方法通过邻域结构缓解节点级偏差,但通常将文本编码为浅层特征,未充分利用语义信息。最近的基于LLM的方法则从文本知识中合成伪OOD先验,但存在两个关键限制:(1)可靠性与信息性之间的权衡,生成的OOD暴露要么偏离真实的OOD语义,要么引入大量ID噪声;(2)依赖专用架构,限制了与先前工作中验证的拓扑级进展的兼容性。为解决这些问题,我们提出LG-Plug,一个用于TAG OOD检测的LLM引导的即插即用框架。LG-Plug对齐拓扑和文本表示以获得细粒度节点嵌入,然后通过聚类迭代LLM提示构建共识驱动的OOD暴露。为降低LLM查询成本,它进一步采用轻量级簇内码本和启发式采样。生成的OOD暴露作为正则化器,分离ID和OOD节点,实现与现有检测器的无缝集成。在六个TAG基准上的实验表明,LG-Plug持续改进拓扑驱动的OOD检测器(FPR95降低>7%),并超越先前基于LLM的方法(FPR95降低>5%)。

英文摘要

Text-attributed graphs (TAGs) associate nodes with textual attributes and graph structure, enabling GNNs to jointly model semantic and structural information. Although effective on in-distribution (ID) data, GNNs often fail on out-of-distribution (OOD) nodes with unseen textual or structural patterns, producing overconfident predictions without reliable OOD detection. Existing topology-driven methods mitigate node-level bias through neighboring structures, but typically encode texts as shallow features, underutilizing semantic information. Recent LLM-based approaches instead synthesize pseudo OOD priors from textual knowledge, yet suffer from two key limitations: (1) a trade-off between reliability and informativeness, where generated OOD exposures either deviate from true OOD semantics or introduce substantial ID noise; and (2) dependence on specialized architectures, limiting compatibility with topology-level advances validated in prior work. To address these issues, we propose LG-Plug, an LLM-Guided Plug-and-play framework for TAG OOD detection. LG-Plug aligns topology and text representations to obtain fine-grained node embeddings, then constructs consensus-driven OOD exposure through clustered iterative LLM prompting. To reduce LLM query cost, it further adopts lightweight in-cluster codebooks and heuristic sampling. The generated OOD exposure acts as a regularizer that separates ID and OOD nodes, enabling seamless integration with existing detectors. Experiments on six TAG benchmarks demonstrate that LG-Plug consistently improves topology-driven OOD detectors (>7% FPR95 reduction) and surpasses prior LLM-based methods (>5% FPR95 reduction).

2602.11177 2026-06-02 cs.CL cs.AI

What Do LLMs Know About Alzheimer's Disease? Multi-loss Fine-Tuning and Probing for AD Detection

LLMs 对阿尔茨海默病了解多少?用于 AD 检测的多损失微调和探针分析

Lei Jiang, Yue Zhou, Natalie Parde

发表机构 * University of Illinois Chicago(伊利诺伊大学香槟分校)

AI总结 本文通过多损失微调 BERT、T5 和 Llama-1B 模型,在三个语料库上实现文本 AD 检测新 SOTA,并利用线性探针分析内部表征中 AD 相关信息的编码。

详情
AI中文摘要

可靠的阿尔茨海默病(AD)早期检测具有挑战性,特别是由于标记数据的有限可用性。虽然大型语言模型(LLMs)在跨领域表现出强大的迁移能力,但通过监督微调将其适应 AD 领域仍 largely unexplored。在这项工作中,我们跨三个异构转录语料库(Pitt、CCC、ADRC)实证评估了各种模型架构,以研究它们在基于文本的 AD 检测中的有效性,并分析任务相关信息如何在其内部表征中编码。据我们所知,我们微调的 BERT 和 T5 模型在 Pitt 和 CCC 数据集上建立了新的最先进水平,同时在 ADRC 上取得了强劲性能。同时,仅解码器的 Llama-1B 在所有三个语料库上取得了与 BERT 和 T5 相当的高度竞争结果,突显了其在 AD 检测中的有效性。我们进一步对 Llama-1B 骨干网络进行了全面评估,分析了跨语料库可迁移性、最优输入块大小粒度以及临床转录标记的影响。此外,我们使用线性探针实证表明,微调以反映 AD 相关信号的方式改变了单个标记(语言标记和内容词)的表征。

英文摘要

Reliable early detection of Alzheimer's disease (AD) is challenging, particularly due to the limited availability of labeled data. While large language models (LLMs) have shown strong transfer capabilities across do mains, adapting them to the AD domain through supervised fine-tuning remains largely unexplored. In this work, we empirically evaluate various model architectures across three heterogeneous transcript corpora (Pitt, CCC, ADRC) to investigate their effectiveness for text-based AD detection and analyze how task-relevant information is encoded within their internal representations. To the best of our knowledge, our fine-tuned BERT and T5 models establish a new state-of-the-art on the Pitt and CCC datasets, while achieving strong performance on ADRC. In parallel, the decoder-only Llama-1B achieves highly competitive results comparable to BERT and T5 across all three corpora, highlighting its effectiveness for AD detection. We further conduct a comprehensive evaluation of the Llama-1B backbone, analyzing cross-corpus transferability, optimal input chunk-size granularity, and the impact of clinical transcript markers. Also, we use linear probing to empirically show that fine-tuning shifts the representations of individual tokens, both linguistic markers and content words, in ways that reflect AD-related signal.

2507.15336 2026-06-02 cs.LG cs.AI cs.DB

Beyond Model Base Retrieval: Weaving Knowledge to Master Fine-grained Neural Network Design

超越模型库检索:编织知识以掌握细粒度神经网络设计

Jialiang Wang, Hanmo Liu, Shimin Di, Zhili Wang, Jiachuan Wang, Lei Chen, Xiaofang Zhou

发表机构 * National University of Singapore(新加坡国立大学) Tsinghua University(清华大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出M-DESIGN框架,通过构建编辑效应证据图并采用自适应检索与预测任务规划器,在严格预算下高效发现近最优细粒度架构修改路径,在33个案例中26个达到搜索空间最佳性能。

Comments Accepted at ICML 2026. Title changed from "Beyond Model Base Selection: Weaving Knowledge to Master Fine-grained Neural Network Design" to "Beyond Model Base Retrieval: Weaving Knowledge to Master Fine-grained Neural Network Design"

详情
AI中文摘要

为新任务设计高性能神经网络需要在优化质量与搜索效率之间取得平衡。当前方法未能实现这一平衡:神经架构搜索计算成本高,而模型检索通常产生次优的静态检查点。为解决这一困境,我们将细粒度架构修改带来的性能增益建模为编辑效应证据,并从先验任务构建证据图。通过构建检索增强的模型精炼框架,我们提出的M-DESIGN动态编织历史证据以发现近最优的修改路径。M-DESIGN具有自适应检索机制,可快速校准来自不同来源的编辑效应证据的演化可迁移性。为处理分布外偏移,我们引入预测任务规划器,从多跳证据外推增益,从而减少对详尽知识库的依赖。基于包含22个数据集上67,760个图神经网络的知识库,大量实验表明,M-DESIGN持续优于基线,在严格预算下33个案例中有26个达到搜索空间最佳性能。

英文摘要

Designing high-performance neural networks for new tasks requires balancing optimization quality with search efficiency. Current methods fail to achieve this balance: neural architectural search is computationally expensive, while model retrieval often yields suboptimal static checkpoints. To resolve this dilemma, we model the performance gains induced by fine-grained architectural modifications as edit-effect evidence and build evidence graphs from prior tasks. By constructing a retrieval-augmented model refinement framework, our proposed M-DESIGN dynamically weaves historical evidence to discover near-optimal modification paths. M-DESIGN features an adaptive retrieval mechanism that quickly calibrates the evolving transferability of edit-effect evidence from different sources. To handle out-of-distribution shifts, we introduce predictive task planners that extrapolate gains from multi-hop evidence, thereby reducing reliance on an exhaustive repository. Based on our model knowledge base of 67,760 graph neural networks across 22 datasets, extensive experiments demonstrate that M-DESIGN consistently outperforms baselines, achieving the search-space best performance in 26 out of 33 cases under a strict budget.

2509.18046 2026-06-02 cs.RO cs.AI cs.ET cs.SY eess.SP eess.SY

HuMam: Humanoid Motion Control via End-to-End Deep Reinforcement Learning with Mamba

HuMam: 基于Mamba的端到端深度强化学习人形机器人运动控制

Yinuo Wang, Yuanyang Qi, Jinzhao Zhou, Pengxiang Meng, Xiaowen Tao

发表机构 * College of Graduate and Professional Studies, Trine University(特灵大学研究生与专业研究学院) Department of Civil Engineering, University of Hong Kong(香港大学土木工程系) Faculty of Engineering and Information Technology, University of Technology Sydney(悉尼大学工程与信息技术学院) National Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University(吉林大学汽车底盘集成与生物仿生国家重点实验室) School of Computer Science and Statistics, Trinity College Dublin(都柏林信任学院计算机科学与统计学系)

AI总结 提出HuMam框架,使用单层Mamba编码器融合状态与步态目标,通过PPO优化实现人形机器人稳定高效的端到端运动控制。

Comments 12 pages

详情
Journal ref
2026 IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE International Conference on Robotics, Automation and Mechatronics (RAM) (CIS-RAM 2026)
AI中文摘要

端到端强化学习(RL)用于人形机器人运动因其紧凑的感知-动作映射而具有吸引力,但实际策略常受训练不稳定、特征融合低效和高执行成本困扰。我们提出HuMam,一种以状态为中心的端到端RL框架,采用单层Mamba编码器融合机器人中心状态与定向脚步目标及连续相位时钟。策略输出由低级PD环跟踪的关节位置目标,并通过PPO优化。一个简洁的六项奖励平衡接触质量、摆动平滑度、脚部放置、姿态和身体稳定性,同时隐含促进节能。在mc-mujoco中的JVRC-1人形机器人上,HuMam在强前馈基线上持续提高了学习效率、训练稳定性和整体任务性能,同时降低了功耗和扭矩峰值。据我们所知,这是首个采用Mamba作为融合骨干的端到端人形机器人RL控制器,展示了在效率、稳定性和控制经济性方面的切实提升。

英文摘要

End-to-end reinforcement learning (RL) for humanoid locomotion is appealing for its compact perception-action mapping, yet practical policies often suffer from training instability, inefficient feature fusion, and high actuation cost. We present HuMam, a state-centric end-to-end RL framework that employs a single-layer Mamba encoder to fuse robot-centric states with oriented footstep targets and a continuous phase clock. The policy outputs joint position targets tracked by a low-level PD loop and is optimized with PPO. A concise six-term reward balances contact quality, swing smoothness, foot placement, posture, and body stability while implicitly promoting energy saving. On the JVRC-1 humanoid in mc-mujoco, HuMam consistently improves learning efficiency, training stability, and overall task performance over a strong feedforward baseline, while reducing power consumption and torque peaks. To our knowledge, this is the first end-to-end humanoid RL controller that adopts Mamba as the fusion backbone, demonstrating tangible gains in efficiency, stability, and control economy.

2602.10623 2026-06-02 cs.LG cs.AI

Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

通过贝叶斯非负奖励建模缓解RLHF中的奖励黑客

Zhibin Duan, Guowei Rong, Zhuo Li, Bo Chen, Mingyuan Zhou, Dandan Guo

发表机构 * Zhejiang University(浙江大学)

AI总结 提出贝叶斯非负奖励模型(BNRM),通过非负因子分析和变分推断,在Bradley-Terry偏好模型中实现解耦与去偏,有效缓解奖励过度优化,提升鲁棒性和可解释性。

Comments Accepted as an Oral presentation at ICML 2026. The code is available at https://github.com/GuoweiRong/Bayesian-Non-negative-Reward-Model

详情
AI中文摘要

从人类偏好中学习的奖励模型是通过人类反馈强化学习对齐大型语言模型的核心,但由于噪声标注和系统偏差(如响应长度或风格),它们通常容易受到奖励黑客攻击。我们提出了贝叶斯非负奖励模型(BNRM),这是一个原则性的奖励建模框架,将非负因子分析整合到Bradley-Terry偏好模型中。BNRM通过稀疏的非负潜在因子生成过程表示奖励,该过程在两个互补层面运作:实例特定的潜在变量诱导解耦的奖励表示,而全局潜在因子的稀疏性作为隐式去偏机制,抑制虚假相关性。这种解耦-去偏结构共同实现了鲁棒的不确定性感知奖励学习。为了将BNRM扩展到现代LLM,我们开发了一个基于深度模型表示的条件摊销变分推断网络,实现高效的端到端训练。大量实验结果表明,BNRM显著缓解了奖励过度优化,提高了分布偏移下的鲁棒性,并比强基线产生了更可解释的奖励分解。

英文摘要

Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human feedback, yet they are often vulnerable to reward hacking due to noisy annotations and systematic biases such as response length or style. We propose Bayesian Non-Negative Reward Model (BNRM), a principled reward modeling framework that integrates non-negative factor analysis into Bradley-Terry (BT) preference model. BNRM represents rewards through a sparse, non-negative latent factor generative process that operates at two complementary levels: instance-specific latent variables induce disentangled reward representations, while sparsity over global latent factors acts as an implicit debiasing mechanism that suppresses spurious correlations. Together, this disentanglement-then-debiasing structure enables robust uncertainty-aware reward learning. To scale BNRM to modern LLMs, we develop an amortized variational inference network conditioned on deep model representations, allowing efficient end-to-end training. Extensive empirical results demonstrate that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.

2602.09153 2026-06-02 cs.RO cs.AI cs.CV cs.GR

SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

SceneSmith: 面向仿真就绪室内场景的智能体生成

Nicholas Pfaff, Thomas Cohn, Sergey Zakharov, Rick Cory, Russ Tedrake

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Harvard University(哈佛大学)

AI总结 提出层次化智能体框架SceneSmith,通过VLM智能体协作从自然语言生成仿真就绪的室内场景,相比先前方法生成3-6倍物体且碰撞率低于2%。

Comments ICML 2026 Spotlight; Project page: https://scenesmith.github.io/

详情
AI中文摘要

仿真已成为大规模训练和评估家用机器人的关键工具,但现有环境未能捕捉真实室内空间的多样性和物理复杂性。当前的场景合成方法生成的房间稀疏布置,缺乏机器人操作所必需的密集杂乱、铰接式家具和物理属性。我们提出了SceneSmith,一个层次化智能体框架,能够从自然语言提示生成仿真就绪的室内环境。SceneSmith通过连续阶段构建场景——从建筑布局到家具放置再到小物体填充——每个阶段都实现为VLM智能体(设计师、评论家和编排者)之间的交互。该框架通过文本到3D合成生成静态物体、数据集检索获取铰接式物体以及物理属性估计,紧密集成了资产生成。SceneSmith生成的物体数量是先前方法的3-6倍,物体间碰撞率低于2%,且96%的物体在物理仿真下保持稳定。在205名参与者参与的用户研究中,与基线相比,平均真实感胜率达到92%,平均提示忠实度胜率达到91%。我们进一步证明了这些环境可用于端到端的自动机器人策略评估流程。

英文摘要

Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stages$\unicode{x2013}$from architectural layout to furniture placement to small object population$\unicode{x2013}$each implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with <2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.

2505.24069 2026-06-02 cs.LG cs.AI

Can LLMs Reason Structurally? Benchmarking via the Lens of Data Structures

LLM 能否进行结构性推理?通过数据结构视角进行基准测试

Yu He, Yingxi Li, Colin White, Ellen Vitercik

发表机构 * arXiv.org cs.LG(计算机学习)

AI总结 本文提出 DSR-Bench 基准,通过 20 种数据结构、35 种操作和 4140 个问题实例评估 LLM 的结构性推理能力,发现顶级模型在挑战性实例上仅得 0.46/1,且在空间数据、上下文丰富场景及自身代码推理上表现不佳。

Comments Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026

详情
AI中文摘要

大型语言模型(LLM)被部署在日益复杂的任务上,这些任务需要多步决策。因此,理解它们的算法推理能力至关重要。然而,我们缺乏用于评估这些能力的诊断基准。我们提议使用数据结构作为原则性视角:作为算法的基本构建块,它们自然地探测结构性推理——即理解和操作支撑算法推理的关系(如顺序、层次和连接性)的能力。我们引入了 DSR-Bench(数据结构推理基准),涵盖 20 种数据结构、35 种操作和 4140 个问题实例。DSR-Bench 具有层次化任务组织、全自动生成与评估以及细粒度诊断的特点。评估 13 个最先进的 LLM 揭示了关键局限性:表现最好的模型在挑战性实例上仅达到 0.46/1。三个针对更现实用法的辅助探针暴露了进一步的弱点:模型在空间数据和上下文丰富的场景中表现不佳,并且难以对其自身代码进行推理。

英文摘要

Large language models (LLMs) are deployed on increasingly complex tasks that require multi-step decision-making. Understanding their algorithmic reasoning abilities is therefore crucial. However, we lack a diagnostic benchmark for evaluating these capabilities. We propose to use data structures as a principled lens: as fundamental building blocks of algorithms, they naturally probe structural reasoning - the ability to understand and manipulate relationships such as order, hierarchy, and connectivity that underpin algorithmic reasoning. We introduce DSR-Bench (Data Structure Reasoning Benchmark), spanning 20 data structures, 35 operations, and 4,140 problem instances. DSR-Bench features hierarchical task organization, fully automated generation and evaluation, and fine-grained diagnostics. Evaluating 13 state-of-the-art LLMs reveals critical limitations: the top-performing model achieves only 0.46/1 on challenging instances. Three auxiliary probes targeting more realistic usages expose further weaknesses: models perform poorly on spatial data and context-rich scenarios, and they struggle to reason over their own code.

2602.10056 2026-06-02 cs.LG stat.ML

WildCat: Near-Linear Attention in Theory and Practice

WildCat: 理论与实践中近乎线性的注意力机制

Tobias Schröder, Lester Mackey

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出WildCat方法,通过随机枢轴Cholesky算法选择加权核心集,在近线性时间内以超多项式误差衰减近似精确注意力,并应用于图像生成、分类和语言模型KV缓存压缩。

详情
AI中文摘要

我们介绍了WildCat,一种高精度、低成本的神经网络注意力机制压缩方法。虽然注意力是现代网络架构的标配,但由于其资源需求随输入序列长度$n$呈二次方增长,部署成本极高。WildCat通过仅关注一个小的加权核心集来避免这些二次成本。关键的是,我们使用一种快速但谱精确的子采样算法——随机枢轴Cholesky——来选择核心集,并最优地加权元素以最小化重构误差。值得注意的是,在输入有界的情况下,WildCat以超多项式$O(n^{-\sqrt{\log(\log(n))}})$的误差衰减逼近精确注意力,同时运行在近线性$O(n^{1+o(1)})$时间内。相比之下,先前的实用近似要么缺乏误差保证,要么需要二次运行时间才能保证如此高的保真度。我们将这一进展与GPU优化的PyTorch实现以及一套基准实验相结合,展示了WildCat在图像生成、图像分类和语言模型KV缓存压缩方面的优势。

英文摘要

We introduce WildCat, a high-accuracy, low-cost approach to compressing the attention mechanism in neural networks. While attention is a staple of modern network architectures, it is also notoriously expensive to deploy due to resource requirements that scale quadratically with the input sequence length $n$. WildCat avoids these quadratic costs by only attending over a small weighted coreset. Crucially, we select the coreset using a fast but spectrally-accurate subsampling algorithm -- randomly pivoted Cholesky -- and weight the elements optimally to minimise reconstruction error. Remarkably, given bounded inputs, WildCat approximates exact attention with super-polynomial $O(n^{-\sqrt{\log(\log(n))}})$ error decay while running in near-linear $O(n^{1+o(1)})$ time. In contrast, prior practical approximations either lack error guarantees or require quadratic runtime to guarantee such high fidelity. We couple this advance with a GPU-optimized PyTorch implementation and a suite of benchmark experiments demonstrating the benefits of WildCat for image generation, image classification, and language model KV cache compression.

2602.09492 2026-06-02 cs.LG cs.AI

Beware of the Batch Size: Hyperparameter Bias in Evaluating LoRA

当心批量大小:评估 LoRA 中的超参数偏差

Sangyoon Lee, Jaeho Lee

发表机构 * Pohang University of Science and Technology (POSTECH)(浦项科学技术大学(POSTECH))

AI总结 本文发现批量大小是导致 LoRA 变体性能矛盾的关键因素,提出基于代理的高效调优策略,将批量大小提升为一阶设计参数。

详情
AI中文摘要

低秩适配(LoRA)是微调大型语言模型的标准方法,但其众多变体在相同基准上报告了相互矛盾的经验性收益。我们表明这些矛盾源于一个被忽视的因素:批量大小。当适当调整时,vanilla LoRA 通常能达到与更复杂变体相当的性能。我们进一步提出了一种基于代理的、成本高效的批量大小调优策略,揭示了秩、数据集大小和模型容量对最优批量大小的影响。我们的发现将批量大小从次要实现细节提升为一阶设计参数,调和了先前的不一致性,并使得对 LoRA 变体的评估更加可靠。

英文摘要

Low-rank adaptation (LoRA) is a standard approach for fine-tuning large language models, yet its many variants report conflicting empirical gains, often on the same benchmarks. We show that these contradictions arise from a single overlooked factor: the batch size. When properly tuned, vanilla LoRA often matches the performance of more complex variants. We further propose a proxy-based, cost-efficient strategy for batch size tuning, revealing the impact of rank, dataset size, and model capacity on the optimal batch size. Our findings elevate batch size from a minor implementation detail to a first-order design parameter, reconciling prior inconsistencies and enabling more reliable evaluations of LoRA variants.

2602.08868 2026-06-02 cs.LG cs.AI

AnomSeer: Reinforcing Multimodal LLMs to Reason for Time-Series Anomaly Detection

AnomSeer: 增强多模态大语言模型进行时间序列异常检测的推理能力

Junru Zhang, Lang Feng, Haoran Shi, Xu Guo, Han Yu, Yabo Dong, Duanqing Xu

发表机构 * arXiv.org GitHub

AI总结 提出AnomSeer,通过专家思维链和基于最优传输的时间序列接地策略优化,增强多模态大语言模型在时间序列异常检测中的细粒度推理能力,统一异常分类、定位和解释。

Comments ICML 2026

详情
AI中文摘要

基于多模态大语言模型(MLLM)的时间序列异常检测(TSAD)是一个新兴领域,但一个持续存在的挑战是:MLLM依赖于粗略的时间序列启发式方法,但在多维、详细的推理方面存在困难,而这对于理解复杂的时间序列数据至关重要。我们提出AnomSeer来解决这个问题,通过增强模型将其推理基于时间序列的精确结构细节,统一异常分类、定位和解释。其核心是生成专家思维链轨迹,从经典分析(如统计度量、频率变换)中提供可验证的细粒度推理。在此基础上,我们提出了一种新颖的时间序列接地策略优化(TimerPO),它在标准强化学习之外引入了两个额外组件:基于最优传输的时间序列接地优势,以及确保这种辅助细粒度信号不干扰主要检测目标的正交投影。在各种异常场景中,使用Qwen2.5-VL-3B/7B-Instruct的AnomSeer在分类和定位准确性上优于更大的商业基线(如GPT-4o),特别是在点和频率驱动的异常上。此外,它产生了合理的时间序列推理轨迹,支持其结论。

英文摘要

Time-series anomaly detection (TSAD) with multimodal large language models (MLLMs) is an emerging area, yet a persistent challenge remains: MLLMs rely on coarse time-series heuristics but struggle with multi-dimensional, detailed reasoning, which is vital for understanding complex time-series data. We present AnomSeer to address this by reinforcing the model to ground its reasoning in precise, structural details of time series, unifying anomaly classification, localization, and explanation. At its core, an expert chain-of-thought trace is generated to provide a verifiable, fine-grained reasoning from classical analyses (e.g., statistical measures, frequency transforms). Building on this, we propose a novel time-series grounded policy optimization (TimerPO) that incorporates two additional components beyond standard reinforcement learning: a time-series grounded advantage based on optimal transport and an orthogonal projection to ensure this auxiliary granular signal does not interfere with the primary detection objective. Across diverse anomaly scenarios, AnomSeer, with Qwen2.5-VL-3B/7B-Instruct, outperforms larger commercial baselines (e.g., GPT-4o) in classification and localization accuracy, particularly on point- and frequency-driven exceptions. Moreover, it produces plausible time-series reasoning traces that support its conclusions.

2602.08689 2026-06-02 cs.LG

Learning To Sample From Diffusion Models Via Inverse Reinforcement Learning

通过逆强化学习从扩散模型中学习采样

Constant Bourdrez, Alexandre Vérine, Olivier Cappé

发表机构 * DI ENS, Ecole normale supérieure, Université PSL, CNRS(巴黎高等师范学院)

AI总结 提出一个逆强化学习框架,在不重新训练去噪器的情况下优化扩散模型的采样策略(噪声调度、引导尺度、随机性),通过策略梯度匹配目标行为,在ImageNet-64上以9倍更低成本和16%推理开销替代网格搜索。

Comments Preprint

详情
AI中文摘要

扩散模型通过由预训练神经网络引导的迭代去噪过程生成样本。一旦去噪器固定,采样算法本身(噪声调度、引导尺度、随机性分布)仍需要仔细调整,这一过程通常通过昂贵的经验网格搜索进行。在这项工作中,我们引入了一个逆强化学习框架,用于在不重新训练去噪器的情况下学习采样策略。我们将扩散采样过程建模为一个离散时间有限时域马尔可夫决策过程,其中动作对应于采样动力学的可选修改。为了优化动作调度,我们避免定义显式奖励函数,而是直接使用策略梯度技术匹配采样器预期的目标行为。我们提供的实验证据表明,该方法与微调后的采样器性能相当,并且与网格搜索相比成本适中:在ImageNet-64上,单次训练运行以高达9倍更低的成本取代了穷举搜索,推理时仅增加16%的开销。

英文摘要

Diffusion models generate samples through an iterative denoising process guided by a pretrained neural network. Once the denoiser is fixed, the sampling algorithm itself (noise schedules, guidance scales, stochasticity profiles) still requires careful tuning, a process typically carried out through costly empirical grid search. In this work, we introduce an inverse reinforcement learning framework for learning sampling strategies without retraining the denoiser. We formulate the diffusion sampling procedure as a discrete-time finite-horizon Markov Decision Process, where actions correspond to optional modifications of the sampling dynamics. To optimize action scheduling, we avoid defining an explicit reward function and instead directly match the target behavior expected from the sampler using policy gradient techniques. We provide experimental evidence that this approach matches fine-tuned samplers and comes at a modest cost compared to grid search: on ImageNet-64, a single training run replaces exhaustive search at up to 9x lower cost, with only 16% overhead at inference.

2602.08585 2026-06-02 cs.LG cs.AI

Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction

预测未来效用:任务无关的KV缓存驱逐的全局组合优化

Ziyao Tang, Pengkun Jiao, Xinhang Chen, Wei Liu, Shiyong Li, Jingjing Chen

发表机构 * arXiv.org cs.LG(计算机学习)

AI总结 提出LU-KV框架,通过全局组合优化分配注意力头预算以最大化长期边际贡献,实现80%的KV缓存压缩且性能损失极小。

详情
AI中文摘要

鉴于注意力的二次复杂度,KV缓存驱逐对于加速模型推理至关重要。当前的KV缓存驱逐方法通常依赖于瞬时启发式度量,隐含地假设分数幅度是所有注意力头的重要性一致代理。然而,这忽略了注意力头之间预测保真度的异质性。虽然某些头优先考虑令牌的瞬时贡献,但其他头致力于捕捉长期效用。在本文中,我们提出最优预算分配应由保留长期语义信息的边际效用决定。基于这一见解,我们提出了LU-KV,这是一个新颖的框架,将头级预算分配表述为全局组合优化问题,以最大化保留令牌的长期边际贡献。为了解决这个非凸问题,我们采用凸包松弛和基于边际效用的贪婪求解器,实现接近最优的解。此外,我们实现了一个数据驱动的离线分析协议,以促进LU-KV的实际部署。在LongBench和RULER基准上的评估表明,LU-KV将KV缓存大小减少了80%,性能下降最小,同时降低了推理延迟和GPU内存占用。

英文摘要

Given the quadratic complexity of attention, KV cache eviction is vital to accelerate model inference. Current KV cache eviction methods typically rely on instantaneous heuristic metrics, implicitly assuming that score magnitudes are consistent proxies for importance across all heads. However, this overlooks the heterogeneity in predictive fidelity across attention heads. While certain heads prioritize the instantaneous contribution of tokens, others are dedicated to capturing long-horizon utility. In this paper, we propose that optimal budget allocation should be governed by the marginal utility in preserving long-term semantic information. Building on this insight, we propose LU-KV, a novel framework that formulates head-level budget allocation as a global combinatorial optimization problem to maximize the long-horizon marginal contribution of reserved tokens. To solve this non-convex problem, we employ a convex-hull relaxation and a marginal-utility-based greedy solver, achieving near-optimal solutions. Furthermore, we implement a data-driven offline profiling protocol to facilitate the practical deployment of LU-KV. Evaluations on LongBench and RULER benchmarks demonstrate that LU-KV reduces KV cache size by 80% with minimal performance degradation, while also decreasing inference latency and GPU memory footprint.

2602.08236 2026-06-02 cs.CV cs.AI cs.CL

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

何时想象以及想象多少:基于世界模型的自适应测试时缩放用于视觉空间推理

Shoubin Yu, Yue Zhang, Zun Wang, Jaehong Yoon, Huaxiu Yao, Mingyu Ding, Mohit Bansal

发表机构 * University of North Carolina, Chapel Hill(北卡罗来纳大学教堂山分校) Nanyang Technological University(南洋理工大学)

AI总结 本文提出自适应测试时框架AVIC/AVIC-R,通过世界模型选择性调用和缩放视觉想象,在空间推理中平衡准确性与效率,超越GPT-4o等基线。

Comments the first two authors are equally contributed. Project page: https://adaptive-visual-tts.github.io/

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)取得了快速进展,但当正确答案取决于场景在未见或替代视角下的外观时,视觉空间推理仍然不可靠。最近的工作通过使用世界模型进行视觉想象来增强推理,但诸如想象何时真正必要、多少想象有益、以及何时想象有害等问题仍知之甚少。在实践中,无差别的想象可能会增加计算量,甚至通过引入误导性证据而降低性能。在这项工作中,我们深入分析了作为空间推理可控资源的测试时视觉想象。我们首先研究静态视觉证据何时足够,想象何时改进推理,以及过度或不必要的想象如何影响准确性和效率。为了支持这一分析,我们随后引入了AVIC,一个基于世界模型的自适应测试时框架,该框架在选择性调用和缩放视觉想象之前,明确推理当前视觉证据的充分性。最后,为了进一步学习这种门控和规划行为,而无需任何关于何时想象以及想象多少的标注,我们引入了AVIC-R,它通过来自QA正确性奖励和想象成本惩罚的GRPO来训练策略。在空间推理基准(SAT, MMSI)和具身导航基准(R2R)上,我们的结果揭示了想象至关重要、边际或有害的明确场景,并表明选择性控制可以匹配或超越固定想象策略,同时大幅减少世界模型调用和语言标记。我们的AVIC-R超越了包括GPT-4o和GPT-4.1在内的强大专有基线,同时调用世界模型的频率更低。总体而言,我们的发现强调了分析和控制测试时想象对于高效可靠的空间推理的重要性。

英文摘要

Despite rapid progress in MLLMs, visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning. We first study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we then introduce AVIC, an adaptive test-time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Finally, to further learn this gating and planning behavior without any annotation of when and how much to imagine, we introduce AVIC-R, which trains the policy via GRPO from QA-correctness rewards and penalties by imagination cost. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world-model calls and language tokens. Our AVIC-R surpasses strong proprietary baselines including GPT-4o and GPT-4.1 while invoking the world model less often. Overall, our findings highlight the importance of analyzing and controlling test-time imagination for efficient and reliable spatial reasoning.

2511.21140 2026-06-02 cs.LG cs.CL stat.AP stat.ML

How to Correctly Report LLM-as-a-Judge Evaluations

如何正确报告LLM作为评估者的评估结果

Chungpa Lee, Thomas Zeng, Jongwon Jeong, Jy-yong Sohn, Kangwook Lee

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对LLM作为评估者时存在偏差的问题,提出一种插件式校正框架,实现无偏估计和统计原理的不确定性量化,并证明在分布偏移下仍保持无偏性。

详情
Journal ref
International Conference on Machine Learning (ICML) 2026
AI中文摘要

大型语言模型(LLMs)被广泛用作模型响应的可扩展评估者,以替代人工标注者。然而,LLM评估者的不完美灵敏度和特异性会导致朴素评估分数产生偏差。我们提出一个简单的插件式框架,可校正此偏差并实现统计原理的不确定性量化。我们的框架构建置信区间,该区间同时考虑来自测试数据集和人工标注校准数据集的不确定性。此外,它采用自适应策略分配校准样本以获得更紧的区间。重要的是,我们刻画了由真实评估分数和LLM评估者的灵敏度与特异性定义的参数区间,在这些区间内,基于LLM的评估比仅人工评估产生更可靠的估计。此外,我们证明,与现有方法相比,我们的框架在测试集和校准集之间存在分布偏移时仍保持无偏性。

英文摘要

Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators. However, imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores. We propose a simple plug-in framework that corrects this bias and enables statistically principled uncertainty quantification. Our framework constructs confidence intervals that account for uncertainty from both the test dataset and a human-labeled calibration dataset. Additionally, it uses an adaptive strategy to allocate calibration samples for tighter intervals. Importantly, we characterize parameter regimes defined by the true evaluation score and the LLM judge's sensitivity and specificity in which our LLM-based evaluation yields more reliable estimates than human-only evaluation. Moreover, we show that our framework remains unbiased under distribution shift between the test and calibration datasets, in contrast to existing approaches.

2602.07955 2026-06-02 cs.CV

One-Shot Crowd Counting With Density Guidance For Scene Adaptation

基于密度引导的单次场景自适应人群计数

Jiwei Chen, Qi Wang, Junyu Gao, Jing Zhang, Dingyi Li, Jing-Jia Luo

发表机构 * Jiangsu Key Laboratory of Intelligent Weather Forecasting and Applications Based on Big Data(江苏大数据智能天气预报与应用重点实验室) State Key Laboratory of Climate System Prediction and Risk Management (CPRM)(气候变化预测与风险管理国家重点实验室) ICAR/CIC-FEMD/KLME/ILCEC Nanjing University of Information Science and Technology(南京信息工程大学) School of Artificial Intelligence, OPtics and ElectroNics, Northwestern Polytechnical University(人工智能、光学与电子学学院,西北工业大学) School of Computer Science, Wuhan University(武汉大学计算机学院) School of Computer Science and Engineering, Nanjing University of Science and Technology(南京理工大学计算机科学与工程学院)

AI总结 提出利用局部和全局密度特征引导模型适应未见过的监控场景,通过多局部密度学习器学习支持场景中的多原型密度分布,并编码局部密度相似性矩阵进行局部引导,同时提取全局密度特征进行全局引导,在三个监控数据集上优于现有方法。

详情
AI中文摘要

不同位置摄像头拍摄的人群场景差异很大,现有的人群模型对未见过的监控场景泛化能力有限。为了提高模型的泛化能力,我们将不同的监控场景视为不同的类别场景,并引入小样本学习,使模型适应属于给定示例类别场景的未见过的监控场景。为此,我们提出利用局部和全局密度特征来引导模型对未见过的监控场景进行人群计数。具体来说,为了使模型能够适应目标场景中不同的密度变化,我们提出了多局部密度学习器来学习多个原型,这些原型代表支持场景中的不同密度分布。随后,对这些多局部密度相似性矩阵进行编码,并以局部方式利用它们来引导模型。为了进一步适应目标场景中的全局密度,从支持图像中提取全局密度特征,然后以全局方式用于引导模型。在三个监控数据集上的实验表明,所提出的方法能够适应未见过的监控场景,并在小样本人群计数中优于最近的最先进方法。

英文摘要

Crowd scenes captured by cameras at different locations vary greatly, and existing crowd models have limited generalization for unseen surveillance scenes. To improve the generalization of the model, we regard different surveillance scenes as different category scenes, and introduce few-shot learning to make the model adapt to the unseen surveillance scene that belongs to the given exemplar category scene. To this end, we propose to leverage local and global density characteristics to guide the model of crowd counting for unseen surveillance scenes. Specifically, to enable the model to adapt to the varying density variations in the target scene, we propose the multiple local density learner to learn multi prototypes which represent different density distributions in the support scene. Subsequently, these multiple local density similarity matrixes are encoded. And they are utilized to guide the model in a local way. To further adapt to the global density in the target scene, the global density features are extracted from the support image, then it is used to guide the model in a global way. Experiments on three surveillance datasets shows that proposed method can adapt to the unseen surveillance scene and outperform recent state-of-the-art methods in the few-shot crowd counting.

2602.07356 2026-06-02 cs.LG

Controllable Value Alignment in Large Language Models through Neuron-Level Editing

大语言模型中通过神经元级编辑的可控价值对齐

Yonghui Yang, Yihui Wang, Junwei Li, Jilong Liu, Fengbin Zhu, Weibiao Huang, Le Wu, Richang Hong, Tat-Seng Chua

发表机构 * National University of Singapore(新加坡国立大学) Hefei University of Technology(合肥工业大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) ST Engineering Ltd.(ST工程有限公司)

AI总结 提出NeVA框架,通过识别稀疏价值相关神经元并进行推理时激活编辑,实现细粒度可控价值对齐,减少价值泄漏并保持通用能力。

详情
AI中文摘要

随着大语言模型对人类行为和决策的影响不断扩大,使其与人类价值观对齐变得越来越重要。然而,现有的基于引导的对齐方法存在可控性有限的问题:引导目标价值往往会无意中激活其他非目标价值。为了描述这一局限性,我们引入了价值泄漏这一诊断概念,它捕捉了价值引导过程中非目标价值的非预期激活,并基于Schwartz价值理论提出了归一化泄漏度量。基于此分析,我们提出了NeVA,一种用于大语言模型中可控价值对齐的神经元级编辑框架。NeVA识别稀疏的、与价值相关的神经元,并进行推理时激活编辑,无需参数更新或重新训练即可实现细粒度控制。实验表明,NeVA在实现更强的目标价值对齐的同时,对通用能力的性能下降更小。此外,NeVA显著降低了平均泄漏,残余效应主要局限于语义相关的价值类别。总体而言,NeVA为价值对齐提供了一种更可控且可解释的机制。

英文摘要

Aligning large language models (LLMs) with human values has become increasingly important as their influence on human behavior and decision-making expands. However, existing steering-based alignment methods suffer from limited controllability: steering a target value often unintentionally activates other, non-target values. To characterize this limitation, we introduce value leakage, a diagnostic notion that captures the unintended activation of non-target values during value steering, along with a normalized leakage metric grounded in Schwartz's value theory. In light of this analysis, we propose NeVA, a neuron-level editing framework for controllable value alignment in LLMs. NeVA identifies sparse, value-relevant neurons and performs inference-time activation editing, enabling fine-grained control without parameter updates or retraining. Experiments show that NeVA achieves stronger target value alignment while incurring smaller performance degradation on general capability. Moreover, NeVA significantly reduces the average leakage, with residual effects largely confined to semantically related value classes. Overall, NeVA offers a more controllable and interpretable mechanism for value alignment.