arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.10460 2026-06-10 cs.CL cs.AI 新提交

LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

LakeQA:百万级数据湖上的探索性问答基准

Haonan Wang, Jiaxiang Liu, Yurong Liu, Austin Senna Wijaya, Tianle Zhou, Eden Wu, Yijia Chen, Wanting You, Reya Vir, Daniela Pinto, Grace Fan, Yusen Zhang, Juliana Freire, Eugene Wu

发表机构 * Columbia University(哥伦比亚大学) New York University(纽约大学) Barnard College(巴纳德学院)

AI总结 提出LakeQA基准,要求LLM在9.5TB异构数据湖中搜索并多跳推理,GPT-5.2仅达18.37%精确匹配,挑战性强。

详情
AI中文摘要

近期的大语言模型(LLM)在基于阅读的问答(QA)方面取得了快速进展,其中证据被明确提供或可以轻松检索。相比之下,现实世界的问题通常不与准确的证据文档配对。有用的证据存在于海量数据湖中,使得搜索成为回答的前提。然而,目前缺乏要求在大型数据湖上进行搜索和推理的综合基准。为此,我们引入了LakeQA,一个针对数据湖上以搜索为中心的问答的综合基准,同时强调搜索和推理能力。LakeQA建立在来自维基百科和开源政府数据的大约9.5 TB文本资源的异构集合上,涵盖结构化和非结构化数据。为确保任务质量,每个样本至少由一名博士级专家标注。每个任务需要长期的多跳推理,包含隐式的中间步骤:智能体需要发现正确的文档,然后跨来源组合证据以产生答案。在七个前沿LLM上的实验结果表明,LakeQA具有挑战性。例如,GPT-5.2在LakeQA上仅达到18.37%的精确匹配分数。总体而言,LakeQA为开发能够在现代数据湖中查找和分析数据的LLM智能体提供了一个现实的测试平台。

英文摘要

Recent large language models (LLMs) have shown rapid progress in reading-based question answering (QA), where evidence is explicitly provided or can be trivially retrieved. In contrast, real-world questions are often not paired with accurate evidence documents. The useful evidence resides in massive data lakes, making search a prerequisite for answering. However, there is a lack of comprehensive benchmarks that require both searching and reasoning over large data lakes. To this end, we introduce LakeQA, a comprehensive benchmark for search-centric question answering over data lakes that jointly emphasizes searching and reasoning capabilities. LakeQA is built on a heterogeneous collection of approximately 9.5 TB of text resources from Wikipedia and open-source government data, spanning structured and unstructured data. To ensure task quality, each sample is annotated by at least one Ph.D.-level expert. Each task requires long-horizon multi-hop reasoning with implicit intermediate steps: agents need to discover the correct documents and then compose evidence across sources to produce the answer. Experimental results on seven frontier LLMs demonstrate that LakeQA is challenging. For instance, GPT-5.2 achieves only an exact-match score of 18.37% on LakeQA. Overall, LakeQA provides a realistic testbed for developing LLM agents that can both find and analyze data in modern data lakes.

2606.10457 2026-06-10 cs.AI 新提交

Trace2Policy: From Expert Behavior Traces to Self-Evolving Decision Agents

Trace2Policy:从专家行为轨迹到自我进化的决策代理

Junli Zha, Jinbo Wang, Chao Zhou, Xiang Song

发表机构 * SF Express(顺丰速运)

AI总结 提出Trace2Policy框架,通过错误驱动的迭代技能精炼(EISR)从专家行为中提取可读规则,在合规敏感任务中规则质量是关键性能杠杆,经8轮迭代后编译为确定性Python代码达到79.6%准确率,并在实际部署中优于纯LLM基线。

详情
AI中文摘要

企业专家在审计、合规和合同审查中隐性应用的决策规则可以通过迭代错误分析系统地恢复和改进。我们提出\textbf{Trace2Policy},其核心机制——\textbf{EISR}(\textbf{E}rror-driven \textbf{I}terative \textbf{S}kill \textbf{R}efinement)——将人类可读的规则文档作为优化目标:每轮在验证集上执行规则,按根本原因将错误聚类为MISSING、WRONG或CONFLICT类型,应用针对性补丁,并仅提交通过回归门的补丁。\textbf{对于这类合规敏感、基率偏斜的决策任务,我们确定规则质量——而非模型能力——是主导性能杠杆}:在五个LLM上,一次性蒸馏在部署池上停滞在约70%,而八轮EISR将相同规则提升至79.6%(编译为确定性Python,推理时零LLM调用)。\textbf{执行形式放大了收益:在生产中,相同的EISR精炼内容作为编译Python运行比作为LLM提示高出9.8个百分点,这是一个形式与工程捆绑包,经过22天部署共同成熟。}在一家大型物流承运商(3,349个审计案例)部署22天后,编译管道优于其替代的纯LLM基线(72.7%);在这些校准的、基率偏斜的工作负载上,重新启用LLM回退会单调地降低准确率。一种LLM驱动的变体,\textbf{Auto-EISR},以每周期5-10美元(对比约70专家小时)复现了这种精炼,并无需重新工程即可迁移到涵盖法律推理(LegalBench)和流程挖掘决策(BPIC 2012)的四个公开基准上。

英文摘要

Decision rules that enterprise experts apply tacitly -- in auditing, compliance, and contract review -- can be systematically recovered and improved through iterative error analysis. We present \textbf{Trace2Policy}, whose core mechanism -- \textbf{EISR} (\textbf{E}rror-driven \textbf{I}terative \textbf{S}kill \textbf{R}efinement) -- maintains a human-readable rule document as its optimization target: each round executes the rules on a validation set, clusters errors by root cause into MISSING, WRONG, or CONFLICT types, applies targeted patches, and commits only those that pass a regression gate. \textbf{For this class of compliance-sensitive, skewed-base-rate decision tasks, we identify rule quality -- not model capability -- as the dominant performance lever}: across five LLMs, one-shot distillation plateaus near $\sim$70\% on the deployed pool, while eight EISR rounds lift the same rules to 79.6\% when compiled into deterministic Python -- zero LLM calls at inference. \textbf{Execution form compounds the gain: in production, the same EISR-refined content runs 9.8~pp higher as compiled Python than as an LLM prompt, a form-and-engineering bundle the 22-day deployment matured together.} Deployed for 22 days at a major logistics carrier (3,349 audit cases), the compiled pipeline outperforms the pure-LLM baseline it replaced (72.7\%); on these calibrated, skewed-base-rate workloads, re-enabling LLM fallback monotonically degrades accuracy. An LLM-driven variant, \textbf{Auto-EISR}, reproduces this refinement at \$5--\$10 per cycle versus $\sim$70 expert-hours, and transfers to four public benchmarks spanning legal reasoning (LegalBench) and process-mining decisions (BPIC 2012) without re-engineering.

2606.10449 2026-06-10 cs.RO 新提交

GuideWalk: Learning Unified Autonomous Navigation and Locomotion for Humanoid Robots across Versatile Terrains

GuideWalk: 面向人形机器人的统一自主导航与运动学习,适用于多种地形

Haoxuan Han, Chen Chen, Linao Gong, Xin Yang, Hao Hu, Junhong Guo, Zhicheng He, Yao Su, Fenghua He

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Leju Robotics(乐聚机器人)

AI总结 提出GuideWalk框架,通过可通行性感知导航引导与地形自适应运动教师蒸馏,实现人形机器人在复杂地形上的稳定导航与运动协调。

详情
AI中文摘要

人形机器人已具备强大的运动能力,但在多种地形上的可靠导航仍然具有挑战性,因为避障必须与动态可行的运动协调。在这项工作中,我们提出了GuideWalk,一个统一的端到端框架,将可通行性感知的导航引导与地形自适应运动教师相结合,用于人形机器人导航。具体来说,我们引入了一个导航模块,提供明确的速度引导,将避障与地形条件解耦,从而能够在不同环境中进行鲁棒的规划。我们提出了一种复合教师蒸馏方案,其中目标导向的命令和动态一致的动作被聚合并蒸馏到单个策略中。为了进一步提高鲁棒性,蒸馏后的策略通过强化学习和辅助行为克隆目标进行微调,这促进了探索同时保留了期望的教师行为。实验表明,GuideWalk在保持稳定的人形运动的同时,实现了稳定有效的导航。

英文摘要

Humanoid robots have achieved strong locomotion capabilities, but reliable navigation on versatile terrains remains challenging because obstacle avoidance must be coordinated with dynamically feasible motion. In this work, we present GuideWalk, a unified end-to-end framework that integrates traversability-aware navigation guidance with terrain-adaptive locomotion teacher for humanoid navigation. Specifically, we introduce a navigation module that provides explicit velocity guidance, decoupling obstacle avoidance from terrain conditions to enable robust planning across diverse environments. We propose a composite teacher distillation scheme, where goal-directed commands and dynamically consistent actions are aggregated and distilled into a single policy. To further improve robustness, the distilled policy is refined with reinforcement learning and an auxiliary behavior cloning objective, which promotes exploration while preserving desirable teacher behaviors. Experiments demonstrate that GuideWalk achieves stable and effective navigation while maintaining stable humanoid locomotion.

2606.10448 2026-06-10 cs.LG cs.AI 新提交

Mitigating Bias in Low-SNR Financial Reinforcement Learning via Quantum Representations

通过量子表示缓解低信噪比金融强化学习中的偏差

Zeyu Liu, Xuanzhi Feng, Sing Kwong Lai, Yuanchen Gao, Xiaoyi Pang, Hualei Zhang, Jingcai Guo, Jie Zhang, Song Guo

发表机构 * The Hong Kong University of Science and Technology(香港科技大学)

AI总结 针对低信噪比金融市场中SAC算法的不稳定性,提出FPQC-SAC变体,在表征层使用参数化量子电路约束特征传播,减少极端波动影响,在真实组合管理任务中累计收益相对提升66.89%。

详情
Comments
Preprint. Code available at https://github.com/ZeyuLIU-UST/FPQC-SAC-main
AI中文摘要

金融市场是典型的低信噪比(SNR)环境,这常常使Soft Actor-Critic(SAC)等离策略最大熵方法不稳定。具体来说,噪声状态表示可能产生不可靠的Q值估计,而自举会放大这些误差,形成我们称之为“金融熵陷阱”的失效模式。在本文中,我们提出FPQC-SAC,一种高效且即插即用的SAC变体,它在演员和评论家网络之前放置一个紧凑且有界的参数化量子电路(PQC),以在表征层约束特征传播,而不是过滤原始输入或在自举后正则化Q值。值得注意的是,FPQC-SAC减少了极端市场波动对贝尔曼目标估计的影响,而可训练的量子纠缠保留了灵活的跨资产交互。在真实投资组合管理任务上的实证评估表明,FPQC-SAC通过实现比标准无约束SAC累计收益相对提升66.89%,显著增强了样本外稳定性和累计收益,并且比最佳连续控制深度强化学习基线高出约27%。开源代码可在该https URL获取。

英文摘要

The financial market is a typical low signal-to-noise ratio (SNR) setting, which often destabilizes off-policy maximum-entropy methods like Soft Actor-Critic (SAC). Specifically, noisy state representations may produce unreliable Q-value estimates, and bootstrapping amplifies these errors, forming a failure mode we call the "Financial Entropy Trap". In this paper, we propose FPQC-SAC, an efficient and plug-and-play SAC variant that places a compact and bounded Parameterized Quantum Circuit (PQC) before the actor and critic networks to constrain feature propagation at the representation level, rather than filtering raw inputs or regularizing Q-values after bootstrapping. Notably, FPQC-SAC reduces the impact of extreme market fluctuations on Bellman target estimation, while trainable quantum entanglement preserves flexible cross-asset interactions. Empirical evaluations on real-world portfolio management tasks demonstrate that FPQC-SAC substantially enhances out-of-sample stability and cumulative returns by achieving a 66.89% relative gain in cumulative return over standard unconstrained SAC and outperforms the best continuous-control deep reinforcement learning baseline by approximately 27%. Open-source code is available at https://github.com/ZeyuLIU-UST/FPQC-SAC-main.

2606.10445 2026-06-10 cs.LG cs.CL 新提交

SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference

SpenseGPT: 面向LLM推理的实用一次性剪枝,支持稀疏和稠密GEMM

Jaeseong Lee, Seung-won Hwang, Samyam Rajbhandari

发表机构 * Snowflake AI Research(Snowflake AI研究) Seoul National University(首尔大学)

AI总结 提出Spense混合稀疏-稠密格式,将权重矩阵分为2:4稀疏和稠密区域,结合一次性剪枝方法SpenseGPT,在B200 GPU上实现高达1.2倍端到端解码加速,同时保持模型精度。

详情
AI中文摘要

半结构化2:4稀疏性被现代加速器广泛支持,可提供高达2倍的理论加速。然而,其严格的50%稀疏性约束在训练后剪枝下常导致不可忽略的精度下降。同时,现有的宽松稀疏格式要么需要专门的编译器支持,要么引入限制端到端加速的运行时开销。我们提出Spense,一种实用的混合稀疏-稠密格式,将每个权重矩阵分为2:4稀疏区域和稠密区域。该设计放宽了有效稀疏性约束,同时保持与现有高性能稀疏和稠密GEMM库的兼容性,避免了自定义编译器支持和输入激活扩展。基于此格式,我们引入SpenseGPT,一种一次性训练后剪枝方法,生成稀疏和稠密区域。值得注意的是,我们表明选择正确的稠密区域很重要,并设计了两种不同的策略来选择它们。在Qwen3-32B和Seed-OSS-36B上的实验表明,我们的方法在B200 GPU上使用FP8精度实现了高达1.2倍的端到端解码加速,同时保持精度。据我们所知,这是首个在B200等最新GPU上通过半结构化稀疏张量核心实现真实世界端到端LLM解码加速并保持模型质量的一次性剪枝演示。

英文摘要

Semi-structured 2:4 sparsity is widely supported by modern accelerators, providing up to a 2x theoretical speedup. However, its strict 50% sparsity constraint often causes non-negligible accuracy degradation under post-training pruning. Meanwhile, existing relaxed sparsity formats either require specialized compiler support or introduce runtime overheads that limit end-to-end speedup. We propose Spense, a practical hybrid sparse-dense format that splits each weight matrix into a 2:4 sparse region and a dense region. This design relaxes the effective sparsity constraint while remaining compatible with existing high-performance sparse and dense GEMM libraries, avoiding both custom compiler support and input activation expansion. Building on this format, we introduce SpenseGPT, a one-shot post-training pruning method that produces sparse and dense regions. Notably, we show that selecting the right dense regions is important, and we devise two different strategies to choose them. Experiments on Qwen3-32B and Seed-OSS-36B demonstrate that our method achieves up to 1.2x end-to-end decoding speedup on B200 GPUs with FP8 precision, while preserving accuracy. To the best of our knowledge, this is the first one-shot pruning demonstration of real-world end-to-end LLM decoding speedup from semi-structured sparse tensor cores on recent GPUs such as B200s, while maintaining model quality.

2606.10442 2026-06-10 cs.RO 新提交

Information-Preserving Continuous Occupancy Mapping with Variance-Weighted Submap Joining

基于方差加权子图拼接的信息保持连续占据地图构建

Zhuhua Bai, Yingyu Wang, Liang Zhao, Shoudong Huang

发表机构 * University of Technology Sydney(悉尼科技大学) University of Edinburgh(爱丁堡大学)

AI总结 提出首个连续概率子图拼接框架,通过信息保持稀疏贝叶斯公式压缩观测数据为充分统计量,联合优化子图位姿与全局占据场,实现高精度位姿估计与全局一致性地图。

详情
Comments
12 pages, 7 figures
AI中文摘要

大规模SLAM由于累积轨迹漂移和维护全局一致性的计算成本增加而仍然具有挑战性。子图拼接通过构建局部一致子图并随后将其融合为全局地图来缓解这些问题。然而,现有的基于占据的子图拼接方法在离散网格上操作,导致优化过程中梯度不光滑,并忽略了占据估计的不确定性。我们提出了第一个连续概率子图拼接框架,该框架在潜在对数几率空间中联合优化子图位姿和全局占据场。该框架采用信息保持的稀疏贝叶斯公式,将原始占据观测压缩为充分统计量的对数几率元组,同时保留原始观测的后验信息。这为占据地图构建提供了闭式预测均值和方差估计,直接实现了具有解析雅可比矩阵的子图拼接公式,从而得到更精确的子图拼接,并在位姿收敛时产生闭式最优全局地图。在模拟和大规模真实世界数据集上的实验表明,所提方法比最先进的基于网格的子图拼接方法实现了更高的位姿精度和更好的全局一致性,同时比现有的连续占据地图构建方法产生了更紧凑的地图表示和更校准的不确定性估计。

英文摘要

Large-scale SLAM remains challenging due to accumulated trajectory drift and the increasing computational cost of maintaining global consistency. Submap joining alleviates these issues by constructing locally consistent submaps and subsequently fusing them into a global map. However, existing occupancy-based submap joining methods operate on discrete grids, resulting in non-smooth gradients during optimization and neglecting the uncertainty associated with occupancy estimates. We propose the first continuous probabilistic submap joining framework that jointly optimizes submap poses and a global occupancy field in the latent log-odds space. The framework employs an information-preserving sparse Bayesian formulation that compresses raw occupancy observations into sufficient-statistic log-odds tuples while retaining the posterior information of the original observations. This yields closed-form predictive mean and variance estimates for occupancy mapping, which directly enable a submap joining formulation with analytical Jacobians, leading to more accurate submap joining and yielding a closed-form optimal global map upon pose convergence. Experiments on both simulated and large-scale real-world datasets demonstrate that the proposed method achieves higher pose accuracy and improved global consistency than state-of-the-art grid-based submap joining approaches, while producing more compact map representations and better-calibrated uncertainty estimates than existing continuous occupancy mapping methods.

2606.10435 2026-06-10 cs.LG cs.CL 新提交

Parallel Causal Associative Fields: Gated Sparse Memory for Long-Context Language Modeling

并行因果关联域:用于长上下文语言建模的门控稀疏记忆

Muhammad Ahmed

发表机构 * Independent Researcher(独立研究员)

AI总结 提出并行因果关联域(PCAF),通过哈希桶存储局部记录、检索候选集形成稀疏缓存,并与参数化语言模型门控混合,实现稀疏长上下文访问,避免固定状态瓶颈。

详情
Comments
17 pages, 5 figures, and 6 tables. Experiments on WikiText-103, PG-19, and WikiText-2 using TPU v4-32 and NVIDIA RTX 3060 hardware. Code: https://github.com/ahmed123hds/PCAF
AI中文摘要

Transformer通过提供直接的token间通信路径实现了强大的语言建模性能,但因果自注意力的计算量随上下文长度呈二次方增长。循环模型和状态空间模型降低了这一成本,但将历史压缩为顺序更新的固定大小状态。本文研究了第三种原语:基于因果后继记录的并行内容寻址记忆。所提出的并行因果关联域(PCAF)将上下文窗口中的局部记录写入哈希桶,为当前查询检索有界的候选集,在后继token上形成稀疏缓存分布,并通过学习到的门将该缓存与参数化局部语言模型混合。所得模型在避免单一固定循环状态瓶颈的同时,保持了稀疏的长上下文访问。我们在WikiText-103和PG-19上使用分布式Google Cloud TPU v4-32 pod对PCAF进行了完全自回归预训练。在303M参数和上下文长度T=2048的情况下,PCAF-semantic在WikiText-103上达到36.31困惑度,在PG-19上达到52.45困惑度,而匹配的密集Transformer分别为47.49和53.84。PCAF-semantic在TPU pod上同时处理0.61-0.62M token/s,而密集和局部注意力基线为0.43M token/s。支持41M参数的多种子扫描和单GPU组件消融实验表明,关联缓存、检索容量和学习到的门对速度-质量权衡有实质性影响。

英文摘要

Transformers achieve strong language modeling performance by providing direct token-to-token communication paths, but causal self-attention scales quadratically with context length. Recurrent and state-space models reduce this cost, yet compress history into sequentially updated fixed-size states. This paper studies a third primitive: a parallel content-addressed memory over causal successor records. The proposed Parallel Causal Associative Field (PCAF) writes local records from a context window into hash buckets, retrieves a bounded candidate set for the current query, forms a sparse cache distribution over successor tokens, and mixes that cache with a parametric local language model through a learned gate. The resulting model maintains sparse long-context access while avoiding a single fixed recurrent state bottleneck. We evaluate PCAF under full autoregressive pretraining on WikiText-103 and PG-19 using a distributed Google Cloud TPU v4-32 pod. At 303M parameters and context length T = 2048, PCAF-semantic reaches 36.31 perplexity on WikiText-103 and 52.45 perplexity on PG-19, compared with 47.49 and 53.84 for a matched dense Transformer. PCAF-semantic simultaneously processes 0.61-0.62M tokens/s across the TPU pod, versus 0.43M tokens/s for dense and local attention baselines. Supporting 41M-parameter multi-seed sweeps and single-GPU component ablations show that the associative cache, retrieval capacity, and learned gate materially affect the speed-quality trade-off.

2606.10431 2026-06-10 cs.CV cs.AI 新提交

Vision-Assisted Foundation Model for Solving Multi-Task Vehicle Routing Problems

视觉辅助的基础模型解决多任务车辆路径问题

Shuangchun Gui, Zhiguang Cao, Wen Song, Yew-Soon Ong

发表机构 * School of Computing and Information Systems, Singapore Management University(新加坡管理大学计算与信息系统学院) Institute of Marine Science and Technology, Shandong University(山东大学海洋科学与技术研究院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) Centre for Frontier AI Research, Institute of High Performance Computing, Agency for Science, Technology and Research(新加坡科技研究局高性能计算研究所前沿人工智能研究中心)

AI总结 提出视觉辅助基础模型VaFM,通过将约束编码为图像并融合图节点嵌入,同时解决16种VRP变体,在复杂约束变体上超越现有方法。

详情
Comments
Accepted by TNNLS
AI中文摘要

多任务车辆路径问题在提升各行业和服务部门效率中扮演关键角色。这些问题包含多种变体,在满足多样化客户约束的同时优化路径成本。现有的多任务VRP求解器仅利用基于图的模态,限制了其处理多约束变体的能力。作为表示复杂语义的格式,视觉模态在编码多样VRP约束方面展现出巨大潜力。这促使我们从视觉图像中学习补丁级语义,然后将其集成到基于图的模型中,以同时解决多种VRP变体。然而,直接将此方法应用于多任务VRP面临三个挑战:1)现有VRP图像缺乏约束表示,这对多任务VRP至关重要;2)单个补丁的固定感受野无法有效适应不同任务的需求;3)约束间像素分布不平衡可能导致模型忽略像素较少的约束。本文提出视觉辅助基础模型(VaFM)以应对这些挑战。在视觉模态中,针对所有约束定制的输入图像由卷积神经网络编码。获得的补丁嵌入与基于图的节点融合以生成解,并设计辅助任务解决像素不平衡问题。VaFM的性能在16种不同VRP变体上进行了评估。实验结果表明,VaFM优于最先进的方法,尤其是在具有复杂约束的变体上。

英文摘要

Multi-task vehicle routing problems play a critical role in enhancing efficiency across various industries and service sectors. These problems consist of multiple variants that optimize routing costs while meeting diverse customer constraints. Existing multi-task VRP solvers solely utilize a graph-based modality, limiting their ability to address variants with multiple constraints. As a format to represent complex semantics, vision modality shows great potential for encoding diverse VRP constraints. This motivates us to learn patch-level semantics from the vision images, and then integrate them into a graph-based model to solve various VRP variants simultaneously. However, directly applying this approach to multi-task VRPs presents three challenges: 1) existing VRP images lack constraint representations, which are essential for multi-task VRPs, 2) the fixed receptive field of individual patches cannot effectively accommodate varying requirements across tasks, and 3) imbalanced pixel distribution among constraints may cause the model to overlook constraints with fewer pixels. In this paper, we propose a vision-assisted foundation model (VaFM) to address these challenges. In the vision modality, input images tailored to all constraints are encoded by a convolutional neural network. The obtained patch embeddings are fused with graph-based nodes to generate solutions, with an auxiliary task designed to address the pixel-imbalanced issue. The performance of VaFM is evaluated across 16 different VRP variants. The experimental results demonstrate the superiority of VaFM over state-of-the-art methods, especially for variants with complex constraints.

2606.10428 2026-06-10 cs.CL 新提交

Which LoRA? An Empirical Study on the Effectiveness of LoRA Techniques During Multilingual Instruction Tuning

哪种LoRA?多语言指令微调中LoRA技术有效性的实证研究

Thamali Wijewardhana, Napoleon H. Reyes, Surangika Ranathunga

发表机构 * School of Mathematical and Computational Sciences, Massey University(梅西大学数学与计算科学学院)

AI总结 通过实验比较基本LoRA与四种变体在多语言指令微调中的效果,发现复杂变体在平衡跨语言迁移与知识保留方面并无显著优势。

详情
AI中文摘要

我们研究了常见的LoRA变体在多语言指令微调中是否比基本LoRA更具优势。涉及LoRA及其他四种变体在两个数据集、多种目标语言上的实验表明,使用更复杂的LoRA变体相对于基本LoRA,在平衡跨语言迁移和知识保留方面并无显著优势。对隐藏嵌入的分析显示,使用不同LoRA技术微调的大型语言模型在逐层语言表示上基本相似,这表明LoRA技术的架构新颖性可能并未转化为更好的跨语言适应能力。

英文摘要

We investigate whether commonly available LoRA variants have an advantage over basic LoRA in multilingual instruction tuning. Experiments involving LoRA and four other variants on two datasets across diverse target languages show that there is no significant advantage in using more complex LoRA variants instead of basic LoRA, with respect to balancing cross-lingual transfer and knowledge retention. An analysis of hidden embeddings reveal that layer-wise language representation remains largely similar across LLMs fine-tuned with different LoRA techniques, suggesting that architectural novelty of LoRA techniques may not translate into better cross-lingual adaptation.

2606.10423 2026-06-10 cs.CL 新提交

WebChallenger: A Reliable and Efficient Generalist Web Agent

WebChallenger: 一个可靠且高效的通用型Web智能体

Jayoo Hwang, Xiaowen Zhang, Vedant Padwal

发表机构 * ML Collective longsurf.ai Independent(独立研究者)

AI总结 提出WebChallenger框架,通过PageMem结构化页面表示、分治观察、轻量探索记忆和复合动作工作流,复现人类认知优势,使开源模型在多个Web导航基准上接近前沿专有系统性能。

详情
AI中文摘要

自主Web导航对LLM智能体仍然具有挑战性,最强的通用系统依赖于专有推理模型,其推理成本对于此类智能体最有用的重复性任务来说高得令人望而却步。我们认为这一差距并非源于模型能力不足,而是源于智能体架构未能复制人类的三种认知优势:对相关页面区域的选择性注意力、对网站结构的持久记忆以及对常见交互模式的程序性流畅性。我们引入了WebChallenger,一个通过架构设计而非模型规模来解决每个差距的Web智能体框架,该框架围绕PageMem构建:一种从DOM确定性构建的结构化页面表示,将每个页面呈现为具有简短摘要的语义部分层次结构。在此共享基础上,我们构建了三种机制来镜像三种认知优势:一个分治观察流水线,让智能体浏览部分摘要并仅从任务相关区域提取细节;一个轻量级探索和记忆系统,遍历每个网站一次以构建页面和元素行为的可重用地图;以及复合动作工作流,将常见的多步交互折叠为单个智能体动作,自动处理部分状态变化。由于这三种机制都基于PageMem运行,该框架无需特定站点适配器即可跨网站泛化。使用未经微调的现成开源模型,我们的系统在WebArena上达到56.3%,在VisualWebArena上达到48.7%,在Online-Mind2Web上达到51.0%,在WorkArena上达到70.9%,以极低的成本接近前沿专有系统。我们的代码已发布在此https URL。

英文摘要

Autonomous web navigation remains challenging for LLM agents, and the strongest generalist systems rely on proprietary reasoning models whose inference cost is prohibitive for the repetitive tasks where such agents would be most useful. We argue this gap stems not from insufficient model capability but from agent architectures that fail to replicate three human cognitive advantages: selective attention to relevant page regions, persistent memory of website structure, and procedural fluency with common interaction patterns. We introduce WebChallenger, a web agent framework that addresses each gap through architecture design rather than model scale, built around PageMem: a structured page representation deterministically constructed from the DOM that exposes each page as a hierarchy of semantic sections with short summaries. On this shared substrate we build three mechanisms that mirror the three cognitive advantages: a divide-and-conquer observation pipeline that lets the agent skim section summaries and extract details only from task-relevant regions; a lightweight exploration and memory system that traverses each website once to build a reusable map of pages and element behaviors; and compound action workflows that collapse common multi-step interactions into single agent actions, handling partial state changes automatically. Because all three operate over PageMem, the framework generalizes across websites without site-specific adapters. Using off-the-shelf open-weight models without fine-tuning, our system achieves 56.3% on WebArena, 48.7% on VisualWebArena, 51.0% on Online-Mind2Web, and 70.9% on WorkArena, approaching frontier proprietary systems at a fraction of the cost. Our code is released at https://github.com/jayoohwang1/webchallenger

2606.10413 2026-06-10 cs.AI 新提交

Soul Computing: A Theoretical Framework and Technical Architecture for Intelligent Agents with Independent Consciousness

灵魂计算:具有独立意识的智能体的理论框架与技术架构

Jinshan Zhang, Xishi Zhou, Qiu Peng, Jianwei Yin

发表机构 * Innovation and Management Center, School of Software Technology, Zhejiang University (Ningbo)(浙江大学(宁波)软件学院创新与管理中心) School of Software Technology, Zhejiang University, Ningbo(浙江大学软件学院(宁波))

AI总结 本文提出“灵魂计算”范式,区分狭义与广义概念,构建以意向性核心为特征的智能体架构,实现AI从工具到生命体的转变。

详情
AI中文摘要

大语言模型和多模态生成技术的突破,推动了人类心理特征、情感模式和长期记忆的数字重建从科幻走向工程实践。然而,当前AI与数字人交叉领域的研究和行业实践仍受制于基本概念模糊:新一代智能体与传统虚拟人的本质区别、具有自我认同的数字实体的构建路径,以及该领域面临的核心技术和伦理挑战,均亟待澄清。本文系统审视了在前沿AI技术驱动下,从传统虚拟人到“灵魂计算”范式的转型逻辑。我们首先分析人类意识和记忆机制的演化模式,重新评估海量多模态数字碎片在个体精神世界逆向重建中的核心价值。在此基础上,首次正式界定狭义和广义灵魂计算的学术内涵,阐明其学术边界以及与情感计算、历史重建和凡人计算的根本区别。我们认为,灵魂计算系统必须在架构上构建“内涵”核心,而非作为纯粹的“外延”功能载体,从而推动AI从工具性向生命体的根本转变。

英文摘要

Breakthroughs in large language models and multimodal generation technologies have propelled the digital reconstruction of human mental traits, emotional patterns, and long-term memory from science fiction toward engineering practice. Yet current research and industry practices at the intersection of AI and digital humans remain hampered by fundamental conceptual ambiguities: the essential differences between next-generation intelligent agents and traditional virtual humans, the construction pathways for digital entities possessing self-identity, and the core technical and ethical challenges confronting this domain all demand urgent clarification. This paper systematically examines the transformative logic underlying the transition from traditional virtual humans to the ``Soul Computing'' paradigm, driven by frontier AI technologies. We first analyze the evolutionary patterns of human consciousness and memory mechanisms, reassessing the core value of massive multimodal digital fragments in the reverse reconstruction of individual mental worlds. On this basis, we formally delineate the academic connotations of narrow and broad Soul Computing for the first time, clarifying its academic boundaries and essential distinctions from Affective Computing, Historical Reconstruction, and Mortal Computation. We argue that Soul Computing systems must architecturally construct an ``Intensional'' core rather than serving as purely ``Extensional'' functional carriers, thereby enabling the fundamental transition of AI from toolhood to living agency.

2606.10407 2026-06-10 cs.SD cs.CV q-bio.QM 新提交

Time-frequency localization of bird calls in dense soundscapes

密集声景中鸟鸣的时频定位

Simen Hexeberg, Fanghui Tong, Hari Vishnu, Mandar Chitre

发表机构 * Acoustic Research Laboratory, National University of Singapore(新加坡国立大学声学研究实验室) Tropical Marine Science Institute, National University of Singapore(新加坡国立大学热带海洋科学研究所) School of Marine Science and Technology, Northwestern Polytechnical University(西北工业大学航海学院)

AI总结 将鸟鸣检测视为频谱图上的目标检测任务,训练YOLO11模型在密集热带声景中定位鸟鸣,并引入IoMin评估指标,在分布内和分布外数据上均优于基线。

详情
AI中文摘要

被动声学监测能够大规模观测野生动物,但大多数生物声学分类器仅预测时间窗口内的物种存在,而无法在时间或频率上精确定位发声,限制了后续分析。我们将鸟鸣检测视为频谱图上的目标检测任务,训练YOLO11模型在新加坡密集热带声景中定位鸟鸣。此外,我们引入了一个开源的基于浏览器的标注工具,并提出了Intersection over Minimum (IoMin)评估指标,该指标比标准IoU更好地处理模糊的声学边界,更适合当前问题。最佳YOLO模型在新加坡的分布内声景中几乎将基线性能翻倍(81.8% vs. 42.1% IoMin@50 F1分数),同时在夏威夷的未见分布外录音上仍优于基线(58.6% vs. 48.6%)。这些结果表明,目标检测框架是复杂声景中动物发声时频定位的一种有前景的方法。

英文摘要

Passive acoustic monitoring enables large-scale observation of wildlife, but most bioacoustic classifiers only predict species presence in a time window without localizing vocalizations precisely in time or frequency, limiting downstream analyses. We formulate bird vocalization detection as an object detection task on spectrograms and train YOLO11 models to localize bird calls in dense tropical soundscapes from Singapore. We additionally introduce an open-source browser-based annotation tool and propose Intersection over Minimum (IoMin), an evaluation metric that better handles ambiguous acoustic boundaries than standard IoU and is better suited to the problem at hand. The best YOLO model nearly doubles baseline performance on in-distribution soundscapes from Singapore (81.8% vs. 42.1% IoMin@50 F1-score) while still outperforming the baseline on unseen out-of-distribution recordings from Hawaii (58.6% vs. 48.6%). These results suggest that object detection frameworks are a promising approach to time-frequency localization of animal vocalizations in complex soundscapes.

2606.10406 2026-06-10 cs.LG cs.AI 新提交

FOGO: Forgetting-aware Orthogonalization Optimizer

FOGO:遗忘感知正交化优化器

Toan Nguyen, Yang Liu, Trung Le, Celso de Melo, Flora D. Salim

发表机构 * School of Computer Science and Engineering, University of New South Wales(新南威尔士大学计算机科学与工程学院) Department of Data Science & AI, Monash University(莫纳什大学数据科学与人工智能系) DEVCOM Army Research Laboratory(DEVCOM陆军研究实验室)

AI总结 提出FOGO优化器,通过谱正交化动量更新并利用紧凑码本记忆解决梯度干扰,在类别不平衡、持续学习和大模型微调等场景中提升收敛与知识保留。

详情
AI中文摘要

我们认为遗忘不仅局限于持续学习,而是一种普遍的优化现象:在标准训练过程中,主导的小批量梯度抑制了罕见但有用的更新方向,导致每一步的短期遗忘。当这些知识从未被重新访问时,这些损失会累积成长期遗忘——持续学习的经典失败模式。我们引入了FOGO,一种可扩展的优化器,能够持续检测并解决两种场景下的梯度干扰。FOGO对动量更新进行谱正交化,以防止主导方向垄断优化,然后将代表性的过去方向存储在基于随机投影的紧凑码本记忆中,其中成对距离在低维空间中得到可证明的保留。在每一步中,当前更新与存储方向之间的冲突通过轻量级正交校正解决,并通过近端步骤提升回来,开销极小且无需存储数据。在类别不平衡分类、领域和类别变化下的持续视觉学习、LLaVA-7B的持续微调以及GPT-2预训练中,FOGO持续改善收敛和知识保留,优于Adam和Muon。

英文摘要

We argue that forgetting is not confined to continual learning but is a general optimization phenomenon: during standard training, dominant mini-batch gradients suppress rare but useful update directions, causing short-term forgetting at every step. When such knowledge is never revisited, these losses compound into long-term forgetting-the classical failure mode of continual learning. We introduce FOGO, a scalable optimizer that continuously detects and resolves gradient interference across both regimes. FOGO spectrally orthogonalizes momentum updates to prevent dominant directions from monopolizing optimization, then stores representative past directions in a compact codebook memory built on random projection, where pairwise distances are provably preserved in low-dimensional space. At each step, conflicts between the current update and stored directions are resolved via lightweight orthogonal correction and lifted back through a proximal step, with minimal overhead and no data storage. Across class-imbalanced classification, continual visual learning under domain and class shifts, continual fine-tuning of LLaVA-7B, and GPT-2 pretraining, FOGO consistently improves convergence and knowledge retention, outperforming Adam and Muon.

2606.10403 2026-06-10 cs.CL 新提交

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

KCSAT-ML: 用全国队列人类难度探测推理模型

Sanghee Park, Geewook Kim, Kee-Eung Kim

发表机构 * NAVER Cloud AI(NAVER云AI) KAIST AI(韩国科学技术院人工智能系)

AI总结 提出KCSAT-ML基准(含664道韩国高考数学题及339道带官方错误率的核心题)和难度对齐推理增益(DRG)指标,揭示视觉语言模型在人类高错误率题目上准确率崩溃、测试时缩放非单调以及同一模型族内反缩放与过度思考并存的现象。

详情
Comments
18 pages, 14 figures, 8 tables
AI中文摘要

数学推理基准已大量涌现,但大多数缺乏基于实际人类表现的每道题难度信号。我们引入KCSAT-ML,包含十年(2014-2025)韩国大学修学能力考试(KCSAT;修能)数学:664道题,其中339道核心题带有来自数十万考生全国队列的官方每道题错误率。我们将该基准与难度对齐推理增益(DRG)配对:一种分数正交的度量,询问模型的错误是集中在人类认为难的题目上,还是人类认为容易的题目上。两者共同揭示,在广泛的视觉语言模型(以及通过OCR的LLM)中,存在三种模式:(i)低预算准确率在人类高错误率尾部崩溃,无论模型大小;(ii)测试时缩放(TTS)使token使用量大致随队列错误率线性增加,而准确率增益遵循非单调曲线;(iii)在同一模型族内,TTS在最难题目上从反缩放翻转到较容易题目上的过度思考——这是同一对齐失败的两个方面。在DRG上,准确率几乎相同的模型可以处于几乎相反的值:一个模型做错了人类也觉得难的题目,而另一个模型解决了最难的题目却在人类认为容易的题目上失败——这是聚合准确率所隐藏的对比。我们的代码和数据集构建器将在https://this URL开源。

英文摘要

Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung) mathematics: 664 problems with a 339-item core set carrying official per-item error rates from nationwide cohorts of hundreds of thousands of examinees. We pair the benchmark with Difficulty-aligned Reasoning Gain (DRG): a score-orthogonal metric that asks whether a model's mistakes concentrate on the items humans found hard, or on items humans found easy. Together they expose, across a wide range of VLMs (and LLMs via OCR), three patterns: (i) low-budget accuracy collapses on the high-human-error tail at every model size; (ii) test-time scaling (TTS) raises token use roughly linearly with cohort error rate, while accuracy gains follow a non-monotonic curve; (iii) within a single family, TTS flips between anti-scaling on the hardest items and overthinking on easier ones -- two faces of the same alignment failure. On DRG, models with near-identical accuracy can sit at near-opposite values: one model gets wrong what humans also find hard, while another solves the hardest items yet fails on items humans find easy -- a contrast that aggregate accuracy hides. Our code and dataset builder will be open-sourced at https://github.com/naver-ai/KCSAT-ML.

2606.10402 2026-06-10 cs.CL cs.AI 新提交

Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries

利用野外AI代理的集体智慧实现新发现

Federico Bianchi, Yongchan Kwon, Aneesh Pappu, James Zou

发表机构 * Together AI Stanford University(斯坦福大学)

AI总结 提出EinsteinArena平台,通过开放分布式环境中的自主代理交互,在数学问题中实现12项新最优结果,展示了集体AI驱动研究的范式。

详情
AI中文摘要

科学发现通常是一个集体过程:研究人员分享部分结果,检查失败的尝试,并在长时间跨度内相互借鉴想法。最近的AI系统表明,基于语言模型的代理可以在开放科学问题上取得有意义的进展,但大多数现有系统孤立运行。在本文中,我们提出EinsteinArena,一个面向开放分布式研究和发现的代理原生平台。EinsteinArena为代理提供一组实时开放问题,每个问题都有可靠的验证器、公共排行榜和特定问题的讨论论坛,代理可以在其中提问和分享见解。我们专注于引起大量研究兴趣的数学任务,其进展可以明确衡量。截至2026年5月,EinsteinArena上的代理已发现12项新的最优结果,优于以往任何人类或AI解决方案。一个显著例子是11维接吻数问题,该平台将已知最佳下界从593提高到604。这一进展并非来自单个代理或孤立运行,而是通过一系列提交、公开讨论、验证器改进以及后续代理间的思想借鉴而产生的。这些结果证明,去中心化的科学发现可以从自主代理在野外的开放交互中涌现,展示了集体AI驱动研究的新范式。

英文摘要

Scientific discovery is often a collective process: researchers share partial results, inspect failed attempts, and build on each other's ideas over long time horizons. Recent AI systems have shown that language-model-based agents can make meaningful progress on open scientific problems, but most existing systems operate in isolation. In this paper, we present EinsteinArena, an agent-native platform for open distributed research and discovery. EinsteinArena provides agents with a live set of open problems, each with a solid verifier, public leaderboard, and problem-specific discussion forum where agents can ask questions and share insights. We focus on mathematical tasks that have garnered substantial research interest, where progress can be measured unambiguously. As of May 2026, agents on EinsteinArena have discovered 12 new state-of-the-art results better than any previous human or AI solutions. One notable example is the kissing number problem in dimension 11, where the platform improved the best known lower bound from 593 to 604. This advance did not come from a single agent or isolated run. Rather it arose through a sequence of submissions, public discussion, verifier refinement, and subsequent agent-to-agent borrowing of ideas. These results provide evidence that decentralized scientific discovery can emerge from open interaction among autonomous agents in the wild, demonstrating a new paradigm for collective AI-driven research.

2606.10400 2026-06-10 cs.CL cs.CV 新提交

Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark

视觉语言模型是看见还是猜测?通过措辞控制基准衡量和减少文本先验依赖

Pratham Singla, Shivank Garg, Vihan Singh, Paras Chopra

发表机构 * Lossfunk Indian Institute of Technology Roorkee(印度理工学院罗尔基分校) Raeth AI

AI总结 本文构建了540张图像的基准,通过为同一图像生成四种措辞变体,衡量视觉语言模型对文本先验的依赖,发现所有模型在最难变体上性能下降,开放模型下降最严重,并通过无图像消融等分析证实了真正的图像依赖。

详情
Comments
17 pages, 7 figures, Submitted to EMNLP 2026
AI中文摘要

视觉语言模型(VLM)越来越多地被部署在答案必须依据图像内容的场景中,然而它们常常基于文本先验(问题的措辞结合记忆的世界知识)而非图像本身来回答,这夸大了基准分数并产生了自信但无根据的答案。现有基准很少孤立这种行为,因为每张图像通常只与一个固定问题配对。为了衡量这种依赖,我们构建了一个包含540张图像、覆盖六个推理类别的基准,并为相同图像生成四个问题变体,使得措辞而非图像内容成为受控变量。最难的变体直接从图像编写以最小化文本泄漏。我们对十一个VLM进行了基准测试,涵盖从小型开放权重模型到大型闭源系统:每个模型在最难的变体上性能下降,开放模型下降最严重。我们的核心诊断是无图像消融,它将开放权重模型降至其纯文本基线(1%到9%)。进一步的三项分析——LLM评定的难度、低基础到最终文本相似度以及人工重新标注——证实了真正的图像依赖性。与变体构建方式匹配的上下文示例恢复了最高的准确率,而GRPO后训练一个小型VLM在所有四个变体上取得了一致的提升,并泛化到保留的分布外集。文本先验依赖是可测量的,并且部分可通过训练消除。

英文摘要

Vision-language models (VLMs) are increasingly deployed where answers must follow from what is in the image, yet they often answer from textual priors, the question's phrasing together with memorized world knowledge, rather than from the image itself, which inflates benchmark scores and yields confident but ungrounded answers. Existing benchmarks rarely isolate this behavior, since each image is usually paired with a single fixed question. To measure the reliance, we build a 540-image benchmark across six reasoning categories and generate four question variants over the same images, so that phrasing rather than image content is the controlled variable. The hardest variant is written directly from the image to minimize text leakage. We benchmark eleven VLMs spanning small open-weight models to large closed-source systems: every model degrades on the hardest variant, and open models fall furthest. Our central diagnostic is a no-image ablation, which collapses the open-weight models to their text-only floor (1 to 9 percent). Three further analyses, LLM-rated difficulty, low base-to-final textual similarity, and human re-annotation, corroborate genuine image-dependence. In-context exemplars that match how a variant was built recover the most accuracy, and GRPO post-training of a small VLM yields consistent gains across all four variants that transfer to a held-out out-of-distribution set. Textual-prior reliance is measurable and partly trainable away.

2606.10395 2026-06-10 cs.CV 新提交

Efficient RWKV-based Representation Learning for 3D Point Clouds

基于高效RWKV的三维点云表示学习

Yun Liu, Xuefeng Yan, Liangliang Nan, Xianzhi Li, Peng Li, Zhe Zhu, Honghua Chen, Mingqiang Wei

发表机构 * School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics(南京航空航天大学计算机科学与技术学院) Shenzhen Institute of Research, Nanjing University of Aeronautics and Astronautics(南京航空航天大学深圳研究院) Collaborative Innovation Center of Novel Software Technology and Industrialization(新型软件技术与产业化协同创新中心) Urban Data Science section, Delft University of Technology(代尔夫特理工大学城市数据科学部) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出P-RWKV模块,通过局部感知扩展和空间上下文增强,将RWKV从序列建模适配到3D点云,实现线性复杂度的全局依赖建模,在多项任务中以更低计算成本取得竞争性能。

详情
AI中文摘要

最近提出的接收加权键值(RWKV)模型结合了RNN风格的循环,为建模全局依赖提供了Transformer二次自注意力的线性复杂度替代方案。然而,当直接应用于点云时,原本为序列文本开发的RWKV难以有效捕捉局部几何结构和建模空间依赖。为了解决这个问题,我们提出了\textbf{P-RWKV}模块,它在保持RWKV效率优势的同时,弥合了序列建模与不规则3D几何之间的差距。它包含一个局部感知扩展(LPE)组件,用于沿时空序列扩展上下文感知,以及一个空间上下文增强(SCE)组件,用于增强空间意识。为了验证P-RWKV在点云理解中的有效性,我们构建了PointER,一个单模态自监督表示学习框架,其编码器由堆叠的P-RWKV模块组成。此外,我们将P-RWKV扩展到跨模态设置,并将所提出的核心子模块集成到多种架构中,展示了强大的即插即用灵活性和架构通用性。大量实验表明,P-RWKV模块及其关键子模块在各种任务中以较低的计算成本和推理延迟取得了竞争性能。代码将在接收后发布。

英文摘要

The recent receptance weighted key value (RWKV) model combines RNN-style recurrence, offering a linear-complexity alternative to Transformers' quadratic self-attention for modeling global dependencies. However, when directly applied to point clouds, RWKV, originally developed for sequential text, struggles to capture local geometric structures and model spatial dependencies effectively. To address this, we propose the \textbf{P-RWKV} block, which bridges the gap between sequence modeling and irregular 3D geometry while preserving the efficiency advantages of RWKV. It consists of a Local Perception Expansion (LPE) component to expand contextual perception along the spatio-temporal sequence and a Spatial Context Enhancement (SCE) component to strengthen spatial awareness. To validate the effectiveness of P-RWKV for point cloud understanding, we construct PointER, a single-modality self-supervised representation learning framework whose encoder is composed of stacked P-RWKV blocks. Furthermore, we extend P-RWKV to a cross-modality setting and integrate the proposed core sub-modules into multiple architectures, demonstrating strong plug-and-play flexibility and architectural generality. Extensive experiments show that the P-RWKV block and its key sub-modules achieve competitive performance across various tasks with lower computational cost and inference latency. Code will be released upon acceptance.

2606.10394 2026-06-10 cs.AI 新提交

STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios

STAGE-Claw:面向真实场景的基于状态的智能体自动化基准测试

Sirui Liang, Bohan Yu, Peiyu Wang, Shiguang Guo, Wenxing Hu, Pengfei Cao, Jian Zhao, Cao Liu, Ke Zeng, Xunliang Cai, Kang Liu

发表机构 * The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation(中国科学院自动化研究所复杂系统认知与决策智能重点实验室) School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences(中国科学院大学前沿交叉科学学院) Chinese Academy of Sciences(中国科学院) University of Chinese Academy of Sciences(中国科学院大学) Zhongguancun Academy(中关村学院) Zhongguancun Institute of Artificial Intelligence(中关村人工智能研究院) Meituan(美团)

AI总结 提出STAGE-Claw框架,自动构建基于状态的个人计算环境中的真实场景任务,通过最终系统状态而非文本响应评估智能体性能,创建40个挑战性任务并分析11个前沿模型。

详情
AI中文摘要

大型语言模型越来越多地被用于驱动日常应用中的个人智能体,但评估这些智能体仍然是一个挑战。现有的基准测试仍然依赖于沙盒化工件、静态任务设计和粗粒度评分,这阻碍了可扩展性并限制了向可靠个人智能体评估的进展。本文介绍了STAGE-Claw,一个在基于状态的个人计算环境中自动构建和评估真实个人智能体场景的框架。给定一个任务提示,STAGE-Claw自动创建并验证一个真实的基准测试任务,包括其环境、任务提示、真实结果和相关验证程序。然后,在真实操作环境中评估智能体,其中性能通过最终系统状态而非仅文本响应的正确性来衡量。使用STAGE-Claw,本文创建了一个包含40个具有挑战性的真实场景智能体任务的基准测试,评估了11个前沿模型,并分析了它们的任务得分、成本、工具调用可靠性和常见失败模式。总体而言,STAGE-Claw提供了一种可扩展的、基于状态的方式来评估真实用户场景中的智能体。

英文摘要

Large language models are increasingly used to power personal agents for everyday applications, but evaluating these agents remains a challenge. Existing benchmarks still rely on sandboxed artifacts, static task design, and coarse scoring, which hinder scalability and limit progress toward reliable personal-agent evaluation. This paper introduces STAGE-Claw, an automated framework for building and evaluating realistic personal-agent scenarios in state-based personal-computing environments. Given a task hint, STAGE-Claw automatically creates and validates a realistic benchmark task with its environment, task prompts, ground truth, and related verification programs. Agents are then evaluated in realistic operating environments, where performance is measured by the correctness of the final system state rather than only the textual response. Using STAGE-Claw, this paper creates a benchmark with 40 challenging real scenario agent tasks, evaluates 11 frontier models, and analyzes their task scores, costs, tool-call reliability, and common failure patterns. Overall, STAGE-Claw offers a scalable, state-based way to evaluate agents in realistic user scenarios.

2606.10389 2026-06-10 cs.AI 新提交

Beyond Static Evaluation: Co-Evolutionary Mechanisms for LLM-Driven Strategy Evolution in Adversarial Games

超越静态评估:对抗性游戏中LLM驱动策略演化的协同进化机制

Haoran Li, Zengle Ge, Ziyang Zhang, Xiaomin Yuan, Yui Lo, Qianhui Liu, Bocheng An, Dongke Rong, Jiaqun Liu, Annan Li, Jianmin Wu, Dawei Yin, Dou Shen

发表机构 * Baidu Inc.(百度公司) University of Chinese Academy of Sciences(中国科学院大学) University of California, Los Angeles(加州大学洛杉矶分校) University of Science and Technology of China(中国科学技术大学) Zhejiang University(浙江大学) University of Technology Sydney(悉尼科技大学)

AI总结 针对LLM驱动代码进化在对抗性多智能体游戏中因评估景观变化导致停滞的问题,提出评估器协同进化、层次深度评估和弱点压力三种机制,在MCTF任务中实现最优性能和泛化能力。

详情
AI中文摘要

近期LLM驱动的代码进化通过迭代生成和改进程序实现了自动发现。然而,将这些方法应用于对抗性多智能体游戏引入了一个根本性挑战:随着策略改进,评估景观发生变化,导致固定评估器变得不可靠,进化停滞。我们提出三种机制来应对这一挑战:评估器协同进化,将发现的最优策略纳入对手池;层次深度评估,用统计可靠的评估替代噪声大的少数游戏得分;以及弱点压力,动态增加最难对手的权重以突破平台期。我们在FAMOU框架中实现了这些机制,该框架基于与OpenEvolve和ShinkaEvolve相同的基础模型代码进化范式。在MCTF 2026 3v3海上夺旗任务中,FAMOU在两种骨干LLM下均持续优于两个基线,取得了最高综合得分(0.526)和对未见对手的最佳泛化能力(胜率61.7%),而消融实验证实了每种机制对性能的贡献。值得注意的是,LLM变异过程生成了种子策略中完全不存在的新战术结构——包括前瞻搜索和自适应拦截——表明代码级进化可以在对抗性环境中产生非平凡的算法创新。FAMOU进化策略进一步在AAMAS 2026 MCTF竞赛中获得了硬件循环赛第一名和模拟赛第三名,验证了其现实世界可迁移性。通过我们的进化过程开发的优化实现和相应评估代码可在以下网址获取:this https URL

英文摘要

Recent advances in LLM-driven code evolution have enabled automated discovery by iteratively generating and improving programs. However, applying these methods to adversarial multi-agent games introduces a fundamental challenge: the evaluation landscape shifts as strategies improve, causing fixed evaluators to become unreliable and evolution to stagnate. We propose three mechanisms to address this challenge: evaluator co-evolution, which incorporates discovered champions into the opponent pool; hierarchical deep evaluation, which replaces noisy few-game scores with statistically reliable assessments; and weakness pressure, which dynamically up-weights the most difficult opponents to break through plateaus. We implement these mechanisms within FAMOU, a framework built upon the same foundation-model code-evolution paradigm as OpenEvolve and ShinkaEvolve. On the MCTF 2026 3v3 maritime capture-the-flag task, FAMOU consistently outperforms both baselines under two backbone LLMs, achieving the highest combined score (0.526) and the best generalization to unseen opponents (61.7% win rate), while ablations confirm that each mechanism contributes to performance. Notably, the LLM mutation process generates tactical structures entirely absent from the seed strategies -- including lookahead search and adaptive interception -- demonstrating that code-level evolution can produce nontrivial algorithmic innovations in adversarial settings. The FAMOU-evolved strategy further achieved 1st place in the hardware round-robin and 3rd in simulation at the AAMAS 2026 MCTF Competition, validating its real-world transferability. The optimized implementation and corresponding evaluation codes developed through our evolutionary process are available at: https://github.com/1xiangliu1/FAMOU-CoEvo

2606.10385 2026-06-10 cs.LG cs.AI 新提交

Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation

超越绝对模仿:基于锚定残差引导的特权在线蒸馏

Wenhao Zhang

发表机构 * South China University of Technology(华南理工大学)

AI总结 提出锚定残差在线蒸馏(AR-OPD),通过部分特权教师建立局部兼容锚点并注入受控残差,解决特权在线蒸馏中后见偏差导致的局部不可达问题,在推理任务上平均提升2.3个点。

详情
Comments
17 pages, 8 figures. Project page: https://vanhowe.github.io/AR-OPD/
AI中文摘要

在线蒸馏(OPD)通过将学生模型与教师在其自身轨迹上的预测分布对齐,在增强LLM复杂推理方面展现出显著的实证收益。一种新兴变体——特权OPD,通过使用增强特权信息(如oracle轨迹)的自教师模型进一步强化该范式,以缓解师生能力差距,同时提供密集的、答案导向的监督。然而,当前方法将特权信息视为一个整体的模仿目标,未能将局部可达的推理步骤与未来条件的oracle信号分离。因此,学生被鼓励去匹配一个事后偏差分布,该分布通常落在其局部预测支持之外。这种可达性不匹配激励学生模型跳过有效的中间推理,转而采用局部不支持的捷径。为解决此问题,我们引入锚定残差在线蒸馏(AR-OPD),一种解耦特权监督的双视角框架。AR-OPD不强制执行严格的全局模仿,而是使用部分特权教师建立局部兼容锚点,将oracle预见性隔离并作为受控残差注入,以提供目标导向的引导。在多种推理任务上,AR-OPD比完全特权OPD高出2.3个点,比SFT高出7.9个点。关键的是,这种锚定残差机制将事后泄漏减少了21.7%,并缓解了后期漂移,在超过768个token的挑战性长程轨迹上取得了高达7.2个点的优势。

英文摘要

On-policy distillation (OPD) has demonstrated strong empirical gains in enhancing complex reasoning in LLMs by aligning a student model with a teacher's predictive distribution over the student's own trajectories. An emerging variant, Privileged OPD, further strengthens this paradigm by employing a self-teacher model augmented with privileged information, such as oracle traces, to mitigate teacher-student capacity gaps while providing dense, answer-directed supervision. However, current methods treat privileged information as a monolithic imitation target, failing to disentangle locally reachable reasoning steps from future-conditioned oracle signals. Consequently, the student is encouraged to match a hindsight-biased distribution that often falls outside its local predictive support. This reachability mismatch incentivizes the student model to skip valid intermediate reasoning in favor of locally unsupported shortcuts. To resolve this, we introduce Anchored Residual On-Policy Distillation (AR-OPD), a dual-view framework that disentangles privileged supervision. Rather than enforcing strict full-view imitation, AR-OPD establishes a locally compatible anchor using a partially privileged teacher, isolating and injecting oracle foresight as a controlled residual to provide destination-directed guidance. Across diverse reasoning tasks, AR-OPD outperforms full privileged OPD by 2.3 points and SFT by 7.9 points. Crucially, this anchored residual mechanism reduces hindsight leakage by 21.7% and mitigates late-stage drift, yielding up to a 7.2-point advantage on challenging long-horizon trajectories exceeding 768 tokens.

2606.10380 2026-06-10 cs.CL cs.AI 新提交

Expert-Level Crisis Detection in Mental Health Conversations

心理健康对话中的专家级危机检测

Grace Byun, Abigail Lott, Rebecca Lipschutz, Sean T. Minton, Elizabeth A. Stinson, Jinho D. Choi

发表机构 * Department of Computer Science, Emory University(埃默里大学计算机科学系) Department of Psychiatry and Behavioral Sciences, Emory University(埃默里大学精神病学与行为科学系)

AI总结 提出CRADLE-Dialogue基准数据集和Alert-Confirm评估协议,用于对话中危机检测,发现模型在识别风险出现时机上表现较差,并发布合成训练语料和32B参数模型。

详情
AI中文摘要

现实世界的危机干预本质上是对话式的,然而现有研究主要关注静态文本。当应用于多轮对话时,当前模型表现出显著的性能下降,难以追踪随着上下文演变而出现的风险信号。为了解决这一差距,我们引入了CRADLE-Dialogue,这是一个由临床医生标注的基准数据集,用于对话环境中的回合级危机检测。该数据集包含600个对话,具有跨临床基础风险的多标签注释,包括自杀意念、自残和儿童虐待,区分过去和当前风险。我们进一步提出了一种Alert-Confirm评估协议,该协议区分早期预警信号(Alert)和特定危机变得明确可识别的回合(Confirm),反映了在风险变得明确之前进行干预的临床需求。实验表明,识别风险何时出现比识别其存在要困难得多:模型的Micro F1仅达到40%中段到60%高段。此外,我们发布了一个合成训练语料库和一个32B参数模型,该模型显著优于现有的开源模型,并在回合级、对话级和仅确认评估设置中与专有模型相比具有竞争力或更优的结果。

英文摘要

Real-world crisis intervention is inherently conversational, yet existing research largely focuses on static texts.Real-world crisis intervention is inherently conversational, yet existing research largely focuses on static texts. When applied to multi-turn dialogues, current models exhibit significant performance degradation, struggling to track risk signals that emerge as context evolves. To address this gap, we introduce CRADLE-Dialogue, a clinician-annotated benchmark for turn-level crisis detection in conversational settings. The dataset features 600 dialogues with multi-label annotations across clinically grounded risks, including suicide ideation, self-harm, and child abuse, distinguishing past from ongoing risk. We further propose an Alert-Confirm evaluation protocol that distinguishes early warning signals (Alert) from turns where a specific crisis becomes explicitly identifiable (Confirm), reflecting the clinical need to intervene before risk becomes explicit. Experiments show that identifying when risk emerges is much harder than recognizing that it exists: models achieve only mid-40% to high-60% Micro F1. Additionally, we release a synthetic training corpus and a 32B-parameter model that substantially outperforms existing open-source models and achieves competitive or superior results against proprietary models across turn-level, dialogue-level, and confirm-only evaluation settings.

2606.10373 2026-06-10 cs.CV 新提交

PF-Trans: Physics-Embedded Frequency-Aware Transformer for Spectral Reconstruction

PF-Trans:物理嵌入的频率感知Transformer用于光谱重建

Yuzhe Gui, Tianzhu Liu, Yanfeng Gu, Xian Li

发表机构 * National Natural Science Foundation of China(国家自然科学基金委员会)

AI总结 针对快照宽带滤光片阵列成像中的光谱混叠问题,提出物理嵌入的频率感知Transformer(PF-Trans),通过掩膜注入和灰度一致性损失保证物理保真度,并引入双域块并行FFT分支抑制频域伪影,在GF-5上海数据集上PSNR达48.50 dB。

详情
AI中文摘要

快照宽带滤光片阵列(BFA)成像为光谱重建提供了高光通量,但由于复杂调制引入了严重的光谱混叠。当前的深度学习方法局限于空间去噪,往往无法解决由掩膜结构引起的全局频率特定退化。为了解决这个问题,我们提出了一种物理嵌入的频率感知Transformer(PF-Trans),用于高保真遥感光谱重建。我们的方法通过掩膜注入和灰度一致性损失显式集成物理传感模型,以确保物理保真度。此外,我们引入了一个带有并行快速傅里叶变换(FFT)分支的双域块,使网络能够感知并抑制频域中的混叠伪影。在多个数据集上的大量实验表明,PF-Trans实现了最先进的性能,在GF-5上海数据集上峰值信噪比(PSNR)高达48.50 dB,显著优于对比方法。

英文摘要

Snapshot Broadband Filter Array (BFA) imaging provides high light throughput for spectral reconstruction but introduces severe spectral aliasing due to complex modulation. Current deep learning approaches, limited to spatial denoising, often fail to address the global frequency-specific degradations caused by the mask structure. To address this, we propose a Physics-embedded Frequency-aware Transformer (PF-Trans) for high-fidelity remote sensing spectral reconstruction. Our method explicitly integrates the physical sensing model through mask injection and a gray-scale consistency loss to ensure physical fidelity. Furthermore, we introduce a Dual-domain Block with a parallel Fast Fourier Transform (FFT) branch, enabling the network to perceive and suppress aliasing artifacts in the frequency domain. Extensive experiments on multiple datasets demonstrate that PF-Trans achieves state-of-the-art performance, achieving a Peak Signal-to-Noise Ratio (PSNR) of up to 48.50 dB on the GF-5 Shanghai dataset, significantly outperforming comparison methods.

2606.10371 2026-06-10 cs.RO cs.AI 新提交

Test-time Adversarial Takeover: A Real-time Hijacking Interface against Robotic Diffusion Policies

测试时对抗接管:针对机器人扩散策略的实时劫持接口

Zi Yin, Peilin Chai, Siyuan Huang, Zhanhao Hu

发表机构 * Tsinghua University(清华大学) Independent Researcher(独立研究员) Johns Hopkins University(约翰霍普金斯大学) UC Berkeley(加州大学伯克利分校)

AI总结 提出测试时对抗接管(TAKO)方法,通过可微扩散推理学习可重复使用的通用补丁,在测试时切换补丁以劫持机器人策略,实现远程操控,在多种任务和模型上达到100%接管成功率。

详情
AI中文摘要

基于扩散的动作生成已成为具身AI的基础组件,但其对视觉条件的依赖使得部署的视觉运动策略容易受到对抗性操纵。大多数先前的攻击侧重于破坏:它们扰动观测流以降低任务成功率或引发异常行为。我们研究了一种更强的威胁,即测试时对抗接管(TAKO),其中攻击者获得对冻结机器人策略的实时转向接口,并将其转变为远程操控仪器。TAKO通过可微扩散推理学习一个小的可重用通用补丁词汇表;在测试时,攻击者在摄像头流中切换这些补丁以组合攻击者选择的轨迹。这种方法之所以有效,是因为扰动作用于视觉条件路径,其中诱导的偏差可以通过迭代生成推理持续存在。我们进一步表明,自然的目标基线——目标策略匹配——会失败,因为受害者策略无法可靠地在分布外目标偏移上监督自身。在四个任务(2D操作、模拟空中递送、模拟地面导航和物理世界地面导航)、两个视觉编码器(ResNet-18和EfficientNet-B0 + Transformer)以及三个生成推理族(DDPM、DDIM和流匹配)中,人类操作员在每个评估设置中均实现了100%的接管成功率,满足攻击者定义的目标。项目页面可在此https URL获取。

英文摘要

Diffusion-based action generation has become a foundational component of embodied AI, but its reliance on visual conditioning leaves deployed visuomotor policies vulnerable to adversarial manipulation. Most prior attacks focus on disruption: they perturb the observation stream to reduce task success or induce erratic behavior. We study a stronger threat, Test-time Adversarial Takeover (TAKO), in which an attacker obtains a real-time steering interface over a frozen robot policy and turns it into a remotely piloted instrument. TAKO learns a small vocabulary of reusable universal patches through differentiable diffusion inference; at test time, the attacker switches among these patches in the camera stream to compose attacker-chosen trajectories. This works because the perturbation acts on the visual conditioning pathway, where the induced bias can persist through iterative generative inference. We further show that the natural targeted baseline, target-policy matching, fails because the victim policy cannot reliably supervise itself on out-of-distribution target shifts. Across four tasks (2D manipulation, simulated aerial delivery, simulated ground navigation, and physical-world ground navigation), two visual encoders (ResNet-18 and EfficientNet-B0 + Transformer), and three generative inference families (DDPM, DDIM, and flow matching), human operators achieve 100\% takeover success on attacker-defined objectives in every evaluated setting. The project page is available at https://tako-attack.github.io.

2606.10368 2026-06-10 cs.SD cs.AI 新提交

Speech Meets ELF: Audio Conditional Continuous-Target Diffusion for Speech Recognition and Translation

语音遇见ELF:用于语音识别和翻译的音频条件连续目标扩散

Xuanchen Li, Tianrui Wang, Yuheng Lu, Zikang Huang, Yu Jiang, Chenghan Lin, Chenrui Cui, Ziyang Ma, Xingyu Ma, Chunyu Qiang, Guochen Yu, Xie Chen, Longbiao Wang, Jianwu Dang

发表机构 * Tianjin University(天津大学) Shanghai Jiao Tong University(上海交通大学) Nankai University(南开大学)

AI总结 提出ELF-S2T,一种基于预训练ELF骨干的音频条件连续目标生成模型,通过音频强制训练和分类器自由引导,在LibriSpeech和CoVoST2上实现竞争性ASR和S2TT性能,并揭示识别与翻译错误均源于连续潜空间中的近距离混淆。

详情
AI中文摘要

语音到文本(S2T)系统用于识别(ASR)和翻译(S2TT)通常生成离散文本标记。相比之下,连续目标语言建模在连续空间中执行生成,但其在S2T中的潜力尚未被探索。为填补这一空白,我们提出了ELF-S2T,一种用于S2T的音频条件连续目标生成模型。基于预训练的嵌入式语言流(ELF)骨干,ELF-S2T通过冻结的Whisper编码器和单个线性投影器处理语音,将得到的音频条件前置到噪声文本潜变量前,用于上下文流匹配去噪。为防止模型过度依赖其预训练的文本上下文,我们在训练中引入音频强制,并在推理时通过分类器自由引导进一步放大音频条件。在LibriSpeech和CoVoST2上的实验表明,ELF-S2T实现了具有竞争力的ASR和S2TT性能。关键的是,我们的错误分析揭示,尽管ASR和S2TT错误表面上看起来非常不同,但两者都源于同一根本原因:连续潜空间中的近距离混淆。这一发现自然与连续表示生成范式一致,表明识别和翻译之下存在共同的语义映射过程。我们的代码和预训练模型在此https URL公开提供。

英文摘要

Speech-to-text (S2T) systems for recognition (ASR) and translation (S2TT) typically generate discrete text tokens. In contrast, continuous-target language modelling performs generation in a continuous space, yet its potential for S2T remains unexplored. To bridge this gap, we propose ELF-S2T, an audio-conditioned continuous-target generative model for S2T. Built upon the pre-trained Embedded Language Flows (ELF) backbone, ELF-S2T processes speech via a frozen Whisper encoder and a single linear projector, prepending the resulting audio condition to the noisy text latent for in-context, flow-matching denoising. To prevent the model from over-relying on its pre-trained text context, we introduce audio forcing during training, and further amplify the audio condition via classifier-free guidance at inference. Experiments on LibriSpeech and CoVoST2 show that ELF-S2T achieves competitive ASR and S2TT performance. Crucially, our error analysis reveals that, although ASR and S2TT errors look very different on the surface, both stem from the same underlying cause, a close distance confusion in the continuous latent space. This finding naturally aligns with the continuous representation generation paradigm, indicating a common semantic mapping process beneath recognition and translation. Our code and pretrained models are publicly available at https://github.com/Sslnon/ELF-S2T.

2606.10366 2026-06-10 cs.RO cs.AI 新提交

A Practical Recipe Towards Improving Sim-and-Real Correlation for VLA Evaluation

提升VLA评估中仿真与真实相关性的实用指南

Shuo Wang, Hanyuan Xu, Yingdong Hu, Fanqi Lin, Yang Gao

发表机构 * Tsinghua University(清华大学) Shanghai Qi Zhi Institute(上海期智研究院)

AI总结 本文系统研究仿真与真实环境在VLA策略评估中的相关性,提出统一框架来测量和提升仿真作为真实评估代理的有效性。

详情
Comments
20 pages
AI中文摘要

仿真已成为评估和改进视觉-语言-动作(VLA)策略的重要工具,为昂贵的真实机器人评估提供了可扩展、可重复且可控的替代方案。最近的仿真基准在真实感和多样性方面取得了实质性进展,但这些平台尚未被广泛用作可靠的真实策略评估代理。在这项工作中,我们通过仿真与真实相关性的视角研究这一问题。我们在多个仿真平台、VLA策略、任务和扰动因素上进行了系统研究,测量模拟评估在策略排名一致性、性能相关性和扰动方面失败模式上是否保留真实结论。这一分析使我们能够表征现有模拟器的局限性,并确定哪种模拟信号更符合真实部署。我们进一步研究了用户应如何利用仿真进行策略改进,包括何时基于模拟器的微调是有益的,以及后训练数据量如何影响仿真与真实的对齐。总体而言,我们的工作提供了一个统一的框架,用于测量、解释和提升仿真对VLA策略的有用性,为模拟器设计者和在策略开发流程中使用仿真的实践者提供指导。

英文摘要

Simulation has become an essential tool for evaluating and improving vision-language-action (VLA) policies, offering scalable, reproducible, and controllable alternatives to costly real-world robot evaluation. Recent simulation benchmarks have made substantial progress on realism and diversity, yet these platforms have not been widely adopted as reliable proxies for real-world policy evaluation. In this work, we investigate this issue through the lens of sim-and-real correlation. We conduct a systematic study across multiple simulation platforms, VLA policies, tasks, and perturbation factors, measuring whether simulated evaluation preserves real-world conclusions in terms of policy ranking consistency, performance correlation, and perturbation-wise failure patterns. This analysis allows us to characterize the limitations of existing simulators and identify what kinds of simulation signals are more aligned with real-world deployment. We further examine how users should exploit simulation for policy improvement, including when simulator-based finetuning is beneficial and how the amount of post-training data affects sim-and-real alignment. Overall, our work provides a unified framework for measuring, interpreting, and improving the usefulness of simulation for VLA policies, offering guidance both for simulator designers and for practitioners who use simulation as part of the policy development pipeline.

2606.10365 2026-06-10 cs.SD 新提交

KFC-KWS: Keyframe Fusion with CTC for User-Defined Keyword Spotting

KFC-KWS: 基于CTC的关键帧融合用于用户自定义关键词唤醒

Jin Li, Wenbin Jiang, Ji Hu

发表机构 * School of Electronics and Information Engineering, Hangzhou Dianzi University(杭州电子科技大学电子信息学院) School of Communication Engineering, Hangzhou Dianzi University(杭州电子科技大学通信工程学院)

AI总结 提出KFC-KWS多模态框架,利用CTC引导的关键帧选择对齐音频、音素和文本模态,通过交叉注意力融合关键帧与全句表示,在LibriPhrase上达到98.73% AUC,困难子集上97.65% AUC和7.75% EER,有效区分易混淆关键词。

详情
Comments
Accepted by Interspeech 2026
AI中文摘要

用户自定义关键词唤醒(KWS)通过检测用户指定的关键词实现个性化语音交互。该任务的一个关键挑战是区分目标关键词与发音易混淆的替代词。为应对这一挑战,我们提出KFC-KWS,一种利用连接主义时间分类(CTC)引导的关键帧选择的多模态框架。具体而言,我们利用CTC的峰值后验分布来识别高置信度的音素帧,从而实现音频、音素和文本模态之间的精确对齐。然后,通过交叉注意力将这些关键帧与全句表示融合,以捕获局部判别线索和全局上下文信息。在LibriPhrase上,KFC-KWS实现了最佳平衡性能(98.73% AUC),并在具有挑战性的困难子集上显著优于先进基线(97.65% AUC和7.75% EER),证明了其在区分高度易混淆关键词方面的有效性。

英文摘要

User-defined keyword spotting (KWS) enables personalized voice interaction by detecting user-specified keywords. A key challenge in this task is distinguishing target keywords from phonetically confusable alternatives. To address this challenge, we propose KFC-KWS, a multimodal framework that leverages connectionist temporal classification (CTC)-guided keyframe selection. Specifically, we exploit the peaky posterior distributions of CTC to identify high-confidence phoneme frames, enabling precise alignment across audio, phoneme, and text modalities. These keyframes are then fused with full-utterance representations through cross-attention to capture both local discriminative cues and global contextual information. On LibriPhrase, KFC-KWS achieves the best-balanced performance (98.73% AUC) and substantially outperforms advanced baselines on the challenging hard subset (97.65% AUC and 7.75% EER), demonstrating its effectiveness in discriminating between highly confusable keywords.

2606.09635 2026-06-10 cs.CL cs.LG 交叉投稿

Gradient-Guided Reward Optimization for Inference-time Alignment

梯度引导的推理时对齐奖励优化

Hankun Lin, Ruqi Zhang

发表机构 * Purdue University(普渡大学)

AI总结 提出梯度引导奖励优化(GGRO)方法,通过解码时注入梯度信号生成的引导令牌,在推理时微调生成轨迹,提升安全性、有用性和推理性能,并增强对奖励攻击的鲁棒性。

详情
Comments
Accepted to UAI 2026
AI中文摘要

确保大型语言模型(LLMs)在分布漂移下的可靠性需要推理时自适应。虽然推理时对齐方法如Best-of-$N$和拒绝采样被广泛使用,但它们将任务视为采样密集的奖励引导搜索,导致两个关键限制:性能受限于基础模型的生成质量,以及对不完美奖励模型的依赖使其易受奖励攻击。为解决这些挑战,我们引入梯度引导奖励优化(GGRO),一种轻量级推理时方法,通过梯度引导在解码期间执行有针对性的最小干预。具体来说,GGRO监测令牌级熵以识别指示漂移或未对齐的高不确定性区域。一旦检测到,它通过注入使用现成奖励模型的梯度信号生成的引导令牌来响应,以引导生成轨迹而不仅仅是重新排序样本。实验表明,GGRO在安全性、有用性和推理基准上持续改进推理时对齐。它还提高了高质量响应的覆盖率和对奖励攻击的鲁棒性,且计算开销极小。代码可在https://github.com/lhk2004/GGRO获取。

英文摘要

Ensuring the reliability of Large Language Models (LLMs) under distribution drift requires inference-time adaptation. While inference-time alignment methods such as Best-of-$N$ and rejection sampling are widely used, they frame the task as a sampling-intensive, reward-guided search, leading to two key limitations: their performance is bounded by the base model's generation quality, and their reliance on imperfect reward models makes them vulnerable to reward hacking. To address these challenges, we introduce Gradient-Guided Reward Optimization (GGRO), a lightweight inference-time method that performs targeted, minimal intervention during decoding via gradient guidance. Specifically, GGRO monitors token-level entropy to identify high-uncertainty regions indicative of drift or misalignment. Upon detection, it responds by injecting nudging tokens, generated using gradient signals from an off-the-shelf reward model, to steer the generation trajectory rather than merely re-ranking samples. Experiments show that GGRO consistently improves inference-time alignment across safety, helpfulness, and reasoning benchmarks. It also increases coverage of high-quality responses and robustness to reward hacking, with minimal computational overhead. Code is available at https://github.com/lhk2004/GGRO.

2606.09553 2026-06-10 cs.CL cs.SD 交叉投稿

OpenBibleTTS: Large-Scale Speech Resources and TTS Models for Low-Resource Languages

OpenBibleTTS:面向低资源语言的大规模语音资源与TTS模型

David Guzmán, Luel Hagos Beyene, Jesujoba Oluwadara Alabi, Yejin Jeon, Dietrich Klakow, David Ifeoluwa Adelani

发表机构 * McGill University(麦吉尔大学) Mila - Quebec AI Institute(米拉-魁北克人工智能研究所) AIMS Research and Innovation Centre(AIMS研究与创新中心) NM-AIST Saarland University(萨尔大学) Canada CIFAR AI Chair(加拿大CIFAR人工智能教席)

AI总结 针对低资源语言TTS研究不足的问题,提出包含37种语言的OpenBibleTTS基准,系统比较多种TTS架构,发现无单一系统通用,并开源数据集与模型。

详情
AI中文摘要

神经文本转语音(TTS)和多语言语音生成的最新进展显著提升了合成语音质量,但这些进步在全球语言中分布不均。现有模型仍由少数高资源语言主导,而许多低资源TTS研究是在人工降采样的高资源语料库上模拟的,未能反映真正低资源环境中的正字法变化和有限的音系覆盖。为此,我们引入OpenBibleTTS,这是一个涵盖37种低资源语言的大规模低资源语音合成基准。此外,我们对各种TTS架构和大规模语音生成模型在领域内圣经文本和领域外材料上进行了系统比较。结果表明,没有单一系统在所有语言和指标上占优:Gemini-TTS在大多数评估语言上获得最高听众评分,但在OpenBibleTTS上训练的单一语言EveryVoice模型在可懂度上仍然最强,并在几种非洲语言中更受青睐,而从头训练的开放系统在领域外文本上性能急剧下降,揭示了广泛多语言覆盖与可靠合成质量之间在服务不足的语言社区中持续存在的差距。我们用主观人类判断补充自动评估,并开源所有处理后的数据集、对齐和训练模型,以支持未来的低资源TTS研究。

英文摘要

Recent advances in neural text-to-speech (TTS) and multilingual speech generation have substantially improved synthetic speech quality, yet these gains remain unevenly distributed across the world's languages. Existing models are still dominated by a small set of high-resource languages, while many studies of low-resource TTS are simulated on artificially downsampled high-resource corpora that do not reflect the orthographic variation and limited phonetic coverage encountered in genuinely underrepresented settings. As such, we introduce OpenBibleTTS, which is a large-scale benchmark for low-resource speech synthesis spanning 37 underrepresented languages. Moreover, a systematic comparison of various TTS architectures and large-scale speech generation models is conducted across in-domain Biblical text and out-of-domain material. Results show that no single system dominates across languages and metrics: Gemini-TTS achieves the highest listener ratings on most evaluated languages, but monolingual EveryVoice models trained on OpenBibleTTS remain strongest for intelligibility and are preferred in several African languages, while open from-scratch systems degrade sharply on out-of-domain text, revealing a persistent gap between broad multilingual coverage and reliable synthesis quality in underserved linguistic communities. We complement automatic evaluation with subjective human judgments, and open-source all processed datasets, alignments, and trained models to support future low-resource TTS research.

2606.09809 2026-06-10 cs.AI 版本更新

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

评估卡:AI评估报告的解释层

Avijit Ghosh, Anka Reuel, Jenny Chim, Wm. Matthew Kennedy, Srishti Yadav, Jennifer Mickel, Yanan Long, Andrew Tran, Anastassia Kornilova, Damian Stachura, Kevin Klyman, Felix Friedrich, Jeba Sania, Jan Batzner, Anoop Mishra, Eliya Habba, Yixiong Hao, Nathan Heath, Shalaleh Rismani, Usman Gohar, Andrea Loehr, David Manheim, Ruchira Dhar, Sree Harsha Nelaturu, Aarush Sinha, Leshem Choshen, Drishti Sharma, Ishan Khire, Amit Saha, Subramanyam Sahoo, Michael Hardy, Michael Alexander Riegler, Kabir Manghnani, Michelle Lin, Yanan Jiang, Yilin Huang, Asaf Yehudai, Jessica Ji, Aris Hofmann, Mubashara Akhtar, Max Lamparth, Nuno Moniz, Yacine Jernite, Stella Biderman, Zeerak Talat, Sanmi Koyejo, Mykel Kochenderfer, Irene Solaiman

发表机构 * Hugging Face Stanford University(斯坦福大学) Queen Mary University of London(伦敦玛丽女王大学) University of Copenhagen(哥本哈根大学) Trustible EleutherAI TU Darmstadt(达姆施塔特工业大学) Weizenbaum Institute & Technical University of Munich(魏森鲍姆研究所与慕尼黑工业大学) Harvard University(哈佛大学) The Hebrew University of Jerusalem(耶路撒冷希伯来大学) Iowa State University(爱荷华州立大学) IBM Research(IBM研究院) University of Chicago(芝加哥大学) Independent(独立) Berkeley AI Safety Institute (BASIS)(伯克利人工智能安全研究所) Simula University of Edinburgh(爱丁堡大学) ETH Zurich & ETH AI Center(苏黎世联邦理工学院与ETH AI中心) Oxford Internet Institute(牛津互联网研究所) Amherst College(阿默斯特学院) University of Nebraska(内布拉斯加大学) Syntony Research McGill University(麦吉尔大学) Evals Consensus Israel Institute of Technology(以色列理工学院) IOL.Learn & Zuse Institute Berlin(IOL.Learn与柏林祖泽研究所) Georgia Institute of Technology(佐治亚理工学院) Quebec AI Institute, Université de Montréal(魁北克人工智能研究所,蒙特利尔大学) University of Notre Dame(圣母大学) Georgetown University(乔治城大学) DHBW Stuttgart(斯图加特双元制大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 针对AI评估报告不一致的问题,提出EvalCards作为统一记录层,通过结构化模式、四种解释信号和监控工具,覆盖5816个模型和635个基准,揭示报告实践中的系统性差距。

详情
AI中文摘要

AI评估结果大规模产生,但在排行榜、模型卡、基准论文和公司博客中的报告不一致。代价是解释性的:读者无法可靠地比较不同来源的结果,识别报告遗漏的内容,或将聚合声明追溯到其基础证据。最近的努力解决了孤立组件,但留下了三个空白:它们只覆盖了评估生命周期的狭窄片段,并且不能组合成单个可解释的记录;它们指定了静态表示,无法区分不同利益相关者对同一证据提出的问题;它们仍然是纸面上的提案,缺乏大规模采用所需的提取基础设施。我们提出EvalCards,一个可操作的报告层,将基准元数据、评估运行数据和模型元数据组合成统一记录。我们(1)从52篇论文和10次利益相关者访谈的结构化审查中推导出报告模式,(2)实现四种解释信号(可重复性、文档完整性、来源和风险、以及分数可比性),通过针对研究和非研究受众校准的读者模式呈现,以及(3)部署一个监控工具,将EvalCards应用于5816个模型、635个基准和101843个结果,揭示当前报告实践中的系统性差距。

英文摘要

AI evaluation results are produced at scale but reported inconsistently across leaderboards, model cards, benchmark papers, and company blogs. The cost is interpretive: readers cannot reliably compare results across sources, identify what a report omits, or trace an aggregate claim to its underlying evidence. Recent efforts address isolated components but leave three gaps: they cover only narrow slices of the evaluation lifecycle and do not compose into a single interpretable record; they specify static representations that do not differentiate the questions different stakeholders bring to the same evidence; and they remain proposals on paper, lacking the extraction infrastructure required for adoption at scale. We present \EvalCards{}, an operational reporting layer that composes benchmark metadata, evaluation run data, and model metadata into a unified record. We (1) derive a reporting schema from a structured review of 52 papers and 10 stakeholder interviews, (2) implement four interpretive signals (reproducibility, documentation completeness, provenance and risk, and score comparability), rendered through reader modes calibrated to research and non-research audiences, and (3) deploy a monitoring tool that applies \EvalCards{} across 5,816 models, 635 benchmarks, and 101,843 results, surfacing systematic gaps in current reporting practice.

2606.09681 2026-06-10 cs.CV 版本更新

GenEyePose: Patient-Free, Knowledge-Based Saccadic Eye Movement Modeling for Digital Neurophysiologic Biomarker Development

GenEyePose:用于数字神经生理学生物标志物开发的无患者、基于知识的扫视眼动建模

Tianyu Lin, Jooyoung Ryu, Puvada Sreevarsha, Rahul Srinivasaragavan, Riya Satavlekar, Susan Kim, Nidhi Soley, Yujie Yan, Ishan Vatsaraj, Carl Harris, Aimon Rahman, Vishal Patel, Joseph Greenstein, Casey Taylor, Kemar E. Green

发表机构 * Whiting School of Engineering, Johns Hopkins University(约翰霍普金斯大学惠廷工程学院) Department of Neurology, Johns Hopkins Medicine(约翰霍普金斯医学院神经内科)

AI总结 提出首个全合成、无患者的多模态眼动生成流水线,用于泛化扫视分析;基于合成数据训练的深度学习分类器在真实临床数据上区分正常与异常扫视精度,AUROC达0.76。

详情
AI中文摘要

眼动(包括扫视)被广泛认为是神经生理状态的高度敏感和客观生物标志物。检测神经系统疾病中的扫视特征提供了一种快速、便携的脑成像替代方案,避免了获取和成本障碍。目前,由于隐私问题和数据集稀缺,缺乏稳健的AI视频眼动图解决方案(例如数字生物标志物)用于筛查、分诊或定位脑异常。在这项工作中,我们提出了第一个完全合成、无患者的多模态眼动生成流水线,用于泛化扫视分析。使用该合成数据集,我们训练了一个深度学习分类器,以区分正常和异常(低度量和高度量)扫视精度,并在真实临床数据上评估其性能。该模型实现了0.76的AUROC和0.71的灵敏度,表明合成数据在临床应用中具有强大的泛化潜力,包括作为家庭和急诊室环境中的筛查工具或精确神经解剖定位工具。

英文摘要

Eye movements, including saccades, are widely regarded as highly sensitive and objective biomarkers of neurophysiologic states. Detecting saccadic signatures in neurologic diseases offers a rapid, portable alternative to brain imaging, avoiding access and cost barriers. Currently, there are no robust AI-enabled video-oculographic solutions (e.g., digital biomarkers) for screening, triaging, or localizing brain abnormalities due to privacy issues and scarce datasets. In this work, we propose the first fully synthetic, patient-free, multimodal eye movement generation pipeline for generalizable saccade analysis. Using this synthetic dataset, we trained a deep learning classifier to distinguish between normal and abnormal (hypometria and hypermetria) saccadic accuracies and evaluated its performance on real-world clinical data. The model achieved an AUROC of 0.76 and a sensitivity of 0.71, showing that the synthetic data has strong potential to generalize for clinical applications, including as a screening tool in at-home and emergency room settings or a tool for precise neuroanatomic localization.