arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2160
2601.12369 2026-05-20 cs.CL

Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies

深度研究代理能否检索和组织?通过专家分类法评估合成差距

Ming Zhang, Jiabao Zhuang, Wenqing Jing, Kexin Tan, Ziyu Kong, Jingyi Deng, Yujiong Shen, Yuhui Wang, Zhenghao Xiang, Qiyuan Peng, Yuhang Zhao, Ning Luo, Renzhe Zheng, Jiahui Lin, Mingqi Wu, Long Ma, Shihan Dou, Maxm Pan, Tao Gui, Qi Zhang, Xuanjing Huang

AI总结 本文提出TaxoBench基准,评估深度研究代理在检索和组织论文方面的能力,发现两者在能力与对齐方面均存在瓶颈。

详情
AI中文摘要

深度研究代理越来越多地自动化文献综述生成,但它们是否能像人类专家一样检索关键论文并将其组织成专家级分类法仍不清楚。现有基准强调写作质量和引用正确性,而标准聚类指标忽略层次结构。我们引入TaxoBench,一个包含72篇高引LLM综述、专家编写的分类树和3,815篇映射到论文类别的论文的基准。TaxoBench评估(1)检索通过召回率/精确率/F1,以及(2)在叶级别(论文到类别分配)和层次级别通过两个新指标:无序语义树编辑距离(US-TED/US-NTED)和语义路径相似性(Sem-Path)。支持两种模式:深度研究(主题-only,端到端)和自下而上(提供专家论文集,仅组织)。为了区分与单一专家参考的分歧与真正的模型失败,我们明确将发现分为能力基于(参考自由)和对齐基于(参考依赖)组。评估7个深度研究代理和12个前沿LLM揭示了双重瓶颈。在能力方面,最好的代理只能检索专家引用论文的20.92%,1,000个模型分类法显示75.9%的兄弟节点重叠,51.2%的MECE违规,和83.4%的结构不平衡,所有这些在没有参考的情况下都可以检测到。在对齐方面,所有12个LLM收敛到Sem-Path 28-29%,远低于三个独立人工标注组在相同论文集上达到的47-58%。我们的基准在https://github.com/KongLongGeFDU/TaxoBench上公开可用。

英文摘要

Deep Research Agents increasingly automate survey generation, yet whether they match human experts at retrieving essential papers and organizing them into expert-like taxonomies remains unclear. Existing benchmarks emphasize writing quality or citation correctness, while standard clustering metrics ignore hierarchical structure. We introduce TaxoBench, a benchmark of 72 highly cited LLM surveys with expert-authored taxonomy trees and 3,815 papers mapped to paper categories. TaxoBench evaluates (1) retrieval via Recall/Precision/F1, and (2) organization at a leaf level (paper-to-category assignment) and a hierarchy level via two new metrics: Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path). Two modes are supported: Deep Research (topic-only, end-to-end) and Bottom-Up (expert paper set provided, organization-only). To distinguish disagreement with a single expert reference from genuine model failure, we explicitly partition findings into capability-based (reference-free) and alignment-based (reference-dependent) groups. Evaluating 7 Deep Research Agents and 12 frontier LLMs reveals a dual bottleneck. On the capability side, the best agent retrieves only 20.92% of expert-cited papers, and 1,000 model taxonomies show 75.9% sibling overlap, 51.2% MECE violations, and 83.4% structural imbalance, all detectable without any reference. On the alignment side, all 12 LLMs converge to Sem-Path 28-29%, well below 47-58% achieved by three independent human-annotator groups on the same paper sets. Our benchmark is publicly available at https://github.com/KongLongGeFDU/TaxoBench.

2601.05437 2026-05-20 cs.CL cs.AI

Tracing Moral Foundations in Large Language Models

在大型语言模型中追溯道德基础

Chenxiao Yu, Bowen Yi, Farzan Karimi-Malekabadi, Suhaib Abdurahman, Jinyi Ye, Shrikanth Narayanan, Yue Zhao, Morteza Dehghani

AI总结 本文研究了大型语言模型中道德基础的编码、组织和表达,通过多层方法分析道德基础与人类道德感知的一致性,并发现道德结构在预训练和微调过程中自然形成,且部分解耦。

详情
AI中文摘要

大型语言模型常常产生类似人类的道德判断,但不清楚这种表现是内部概念结构还是表面的'道德模仿'。使用道德基础理论(MFT)作为分析框架,我们研究了14个基础和指令微调的LLM在四个模型家族(Llama、Qwen2.5、Qwen3-MoE、Mistral)和从7B到70B的不同规模上如何编码、组织和表达道德基础。我们采用多级方法结合(i)逐层分析MFT概念表示及其与人类道德感知的一致性,(ii)在残差流上预训练稀疏自编码器(SAEs)以识别支持道德概念的稀疏特征,以及(iii)使用密集MFT向量和稀疏SAE特征进行因果引导干预。我们发现模型在表示和区分道德基础方面与人类判断一致,且这种道德几何结构自然从预训练中产生,并在微调中被选择性重 wiring。在更细的尺度上,SAE特征显示出与特定基础的明确语义联系,表明在共享表示中存在部分解耦的机制。最后,沿着密集向量或稀疏特征引导会产生可预测的在基础相关行为上的变化,证明了内部表示与道德输出之间的因果联系。共同,我们的结果提供了机械证据,表明LLM中的道德概念是分布的、分层的且部分解耦的,暗示了多元道德结构可以从语言的统计规律中作为潜在模式出现。

英文摘要

Large language models often produce human-like moral judgments, but it is unclear whether this reflects an internal conceptual structure or superficial ``moral mimicry.'' Using Moral Foundations Theory (MFT) as an analytic framework, we study how moral foundations are encoded, organized, and expressed across 14 base and instruction-tuned LLMs spanning four model families (Llama, Qwen2.5, Qwen3-MoE, Mistral) and scales from 7B to 70B. We employ a multi-level approach combining (i) layer-wise analysis of MFT concept representations and their alignment with human moral perceptions, (ii) pretrained sparse autoencoders (SAEs) over the residual stream to identify sparse features that support moral concepts, and (iii) causal steering interventions using dense MFT vectors and sparse SAE features. We find that models represent and distinguish moral foundations in a manner that aligns with human judgments, and that this moral geometry naturally emerges from pretraining and is selectively rewired by post-training. At a finer scale, SAE features show clear semantic links to specific foundations, suggesting partially disentangled mechanisms within shared representations. Finally, steering along either dense vectors or sparse features produces predictable shifts in foundation-relevant behavior, demonstrating a causal connection between internal representations and moral outputs. Together, our results provide mechanistic evidence that moral concepts in LLMs are distributed, layered, and partly disentangled, suggesting that pluralistic moral structure can emerge as a latent pattern from the statistical regularities of language alone.

2512.24470 2026-05-20 cs.RO cs.AI

Foundation models on the bridge: Semantic hazard detection and safety maneuvers for maritime autonomy with vision-language models

桥梁上的基础模型:基于视觉-语言模型的语义危险检测与安全操作用于海上自主性

Kim Alexander Christensen, Andreas Gudahl Tufte, Alexey Gusev, Rohan Sinha, Milan Ganai, Ole Andreas Alsos, Marco Pavone, Martin Steinert

AI总结 本文提出了一种基于视觉-语言模型的语义危险检测与安全操作方法,用于满足IMO草案MASS代码对海上自主船舶的要求,通过快速-慢速异常管道和短时间范围的人类可覆盖回退操作来实现,在40个港口场景中验证了该方法的性能。

Comments 17 pages without bibliography or appendix. The main paper has 16 figures. Paper webpage can be found at https://kimachristensen.github.io/bridge_policy/

详情
Journal ref
Ocean Engineering 359, Part 3 (2026), Article 124646
AI中文摘要

草案IMO MASS代码要求自主和远程监督的海事船舶检测其操作设计领域偏离,进入预定义的回退模式以通知操作员,允许立即的人类接管,并避免在未经批准的情况下更改航行计划。在警报到接管的间隙中满足这些义务需要一个短时间范围、可人类接管的回退操作。传统的海事自主堆栈在正确行动依赖于意义(例如,潜水员旗表示水中的人员,附近有火表示危险)时会遇到困难。我们主张(i)视觉-语言模型(VLMs)为这些分布外情况提供语义意识,(ii)一个快速-慢速异常管道,带有短时间范围、可人类接管的回退操作,使在交接窗口内实现这一目标成为可能。我们引入了Semantic Lookout,一种仅使用摄像头、候选约束的VLM回退操作选择器,它在连续人类授权下,从水有效、世界锚定的轨迹中选择一个谨慎的操作(或站守)。在40个港口场景中,我们测量了每调用场景的理解和延迟,与人类共识(模型多数三票投票)的一致性,短时间范围在火险场景中的风险缓解,以及在水上的警报->回退操作->操作员交接。子10秒的模型保留了较慢的最新模型大部分的意识。回退操作选择器在火险场景中比仅基于几何的基线表现更好,并增加了 standoff 距离。一次现场运行验证了端到端的操作。这些结果支持VLMs作为符合草案IMO MASS代码的语义回退操作选择器,适用于实际延迟预算,并激励未来工作,研究适应领域、混合自主性,将基础模型语义与多传感器鸟瞰感知和短时间范围重新规划相结合。网站:kimachristensen.github.io/bridge_policy

英文摘要

The draft IMO MASS Code requires autonomous and remotely supervised maritime vessels to detect departures from their operational design domain, enter a predefined fallback that notifies the operator, permit immediate human override, and avoid changing the voyage plan without approval. Meeting these obligations in the alert-to-takeover gap calls for a short-horizon, human-overridable fallback maneuver. Classical maritime autonomy stacks struggle when the correct action depends on meaning (e.g., diver-down flag means people in the water, fire close by means hazard). We argue (i) that vision-language models (VLMs) provide semantic awareness for such out-of-distribution situations, and (ii) that a fast-slow anomaly pipeline with a short-horizon, human-overridable fallback maneuver makes this practical in the handover window. We introduce Semantic Lookout, a camera-only, candidate-constrained VLM fallback maneuver selector that selects one cautious action (or station-keeping) from water-valid, world-anchored trajectories under continuous human authority. On 40 harbor scenes we measure per-call scene understanding and latency, alignment with human consensus (model majority-of-three voting), short-horizon risk-relief on fire hazard scenes, and an on-water alert->fallback maneuver->operator handover. Sub-10 s models retain most of the awareness of slower state-of-the-art models. The fallback maneuver selector outperforms geometry-only baselines and increases standoff distance on fire scenes. A field run verifies end-to-end operation. These results support VLMs as semantic fallback maneuver selectors compatible with the draft IMO MASS Code, within practical latency budgets, and motivate future work on domain-adapted, hybrid autonomy that pairs foundation-model semantics with multi-sensor bird's-eye-view perception and short-horizon replanning. Website: kimachristensen.github.io/bridge_policy

2512.23461 2026-05-20 cs.LG cs.AI

Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance

通过信息论指导消除奖励模型中的归纳偏置

Zhuo Li, Pengyu Cheng, Zhechao Yu, Feifei Tong, Anningzhe Gao, Tsung-Hui Chang, Xiang Wan, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

AI总结 本文提出了一种基于信息论的奖励模型去偏方法DIR,通过最大化奖励模型评分与人类偏好对之间的互信息,同时最小化奖励模型输出与偏好输入偏置属性之间的互信息,从而有效缓解归纳偏置问题并提升RLHF性能。

Comments Published as a conference paper at The International Conference on Learning Representations (ICLR) 2026

详情
AI中文摘要

奖励模型(RMs)在人类反馈的强化学习(RLHF)中至关重要,用于将大型语言模型(LLMs)对齐于人类价值观。然而,RM训练数据通常被认为是低质量的,包含可能导致过拟合和奖励黑客的归纳偏置。例如,更详细和全面的响应通常更受人类青睐,但包含更多单词,导致响应长度成为不可避免的归纳偏置之一。有限的先前RM去偏方法要么针对单一特定类型的偏置,要么仅用简单的线性相关性建模,例如皮尔逊系数。为缓解奖励建模中更复杂和多样的归纳偏置,我们引入了一种新的信息论去偏方法,称为通过信息优化的奖励模型去偏(DIR)。受信息瓶颈(IB)的启发,我们最大化奖励模型评分与人类偏好对之间的互信息(MI),同时最小化奖励模型输出与偏好输入偏置属性之间的互信息。从信息论的理论依据出发,DIR能够处理更复杂的偏置类型,具有非线性相关性,从而广泛扩展了RM去偏方法在现实世界中的应用场景。在实验中,我们验证了DIR在三种归纳偏置类型(响应长度、奉承和格式)上的有效性。我们发现,DIR不仅有效缓解了目标归纳偏置,还通过多样化的基准测试提升了RLHF性能,展现出更好的泛化能力。代码和训练配方可在https://github.com/Qwen-Applications/DIR获取。

英文摘要

Reward models (RMs) are essential in reinforcement learning from human feedback (RLHF) to align large language models (LLMs) with human values. However, RM training data is commonly recognized as low-quality, containing inductive biases that can easily lead to overfitting and reward hacking. For example, more detailed and comprehensive responses are usually human-preferred but with more words, leading response length to become one of the inevitable inductive biases. A limited number of prior RM debiasing approaches either target a single specific type of bias or model the problem with only simple linear correlations, \textit{e.g.}, Pearson coefficients. To mitigate more complex and diverse inductive biases in reward modeling, we introduce a novel information-theoretic debiasing method called \textbf{D}ebiasing via \textbf{I}nformation optimization for \textbf{R}M (DIR). Inspired by the information bottleneck (IB), we maximize the mutual information (MI) between RM scores and human preference pairs, while minimizing the MI between RM outputs and biased attributes of preference inputs. With theoretical justification from information theory, DIR can handle more sophisticated types of biases with non-linear correlations, broadly extending the real-world application scenarios for RM debiasing methods. In experiments, we verify the effectiveness of DIR with three types of inductive biases: \textit{response length}, \textit{sycophancy}, and \textit{format}. We discover that DIR not only effectively mitigates target inductive biases but also enhances RLHF performance across diverse benchmarks, yielding better generalization abilities. The code and training recipes are available at https://github.com/Qwen-Applications/DIR.

2512.16856 2026-05-20 cs.AI

Distributional AGI Safety

分布式AGI安全

Nenad Tomašev, Matija Franklin, Julian Jacobs, Sébastien Krier, Simon Osindero

AI总结 本文提出了一种分布式的AGI安全框架,旨在通过设计和实现虚拟代理沙盒经济来应对群体代理协调带来的安全风险,强调市场机制、可审计性和监管的重要性。

详情
AI中文摘要

人工智能安全和对齐研究主要集中在保护单个AI系统的方法上,基于最终出现单一人工通用智能(AGI)的假设。另一种AGI出现假说认为,一般能力首先通过具有互补技能和能力的子AGI个体代理群体中的协调表现出来,这一假说受到较少关注。本文认为,这种碎片化AGI假说需要得到认真考虑,并应指导相应安全措施和缓解措施的发展。先进AI代理的快速部署,使其具备工具使用能力和通信协调能力,使其成为紧迫的安全问题。因此,我们提出了一种分布式的AGI安全框架,超越了评估和对齐单个代理。该框架以设计和实现虚拟代理沙盒经济(不可渗透或半渗透)为中心,其中代理间的交易由稳健的市场机制调控,并辅以适当的可审计性、声誉管理和监管,以缓解集体风险。

英文摘要

AI safety and alignment research has predominantly been focused on methods for safeguarding individual AI systems, resting on the assumption of an eventual emergence of a monolithic Artificial General Intelligence (AGI). The alternative AGI emergence hypothesis, where general capability levels are first manifested through coordination in groups of sub-AGI individual agents with complementary skills and affordances, has received far less attention. Here we argue that this patchwork AGI hypothesis needs to be given serious consideration, and should inform the development of corresponding safeguards and mitigations. The rapid deployment of advanced AI agents with tool-use capabilities and the ability to communicate and coordinate makes this an urgent safety consideration. We therefore propose a framework for distributional AGI safety that moves beyond evaluating and aligning individual agents. This framework centres on the design and implementation of virtual agentic sandbox economies (impermeable or semi-permeable), where agent-to-agent transactions are governed by robust market mechanisms, coupled with appropriate auditability, reputation management, and oversight to mitigate collective risks.

2512.11234 2026-05-20 cs.CV

RoomPilot: Controllable Indoor Scene Synthesis via Multimodal Semantic Parsing

RoomPilot: 通过多模态语义解析实现可控的室内场景合成

Wentang Chen, Shougao Zhang, Yiman Zhang, Tianhao Zhou, Ruihui Li

AI总结 该研究提出RoomPilot框架,通过多模态语义解析实现可控的室内场景合成,解决了现有方法输入模态有限和生成过程隐式的问题,提高了场景结构和语义的可控性。

Comments 30 pages, 8 figures

详情
AI中文摘要

生成可控的室内场景对于游戏开发、建筑可视化和具身AI应用至关重要。然而,现有方法要么只支持有限的输入模态,要么依赖隐式生成过程,限制了对场景结构和语义的精确控制。为了解决这些限制,我们引入RoomPilot,一个统一的框架,从多模态输入(包括文本描述和CAD平面图)中生成可控的室内场景。RoomPilot将异构输入映射到一个室内领域特定语言(IDSL),作为描述室内场景的结构化和可解释的语义表示。基于IDSL,RoomPilot提出一个分层合成流程,逐步在建筑、房间和物体层面组织场景,促进多房间布局中的结构一致性和功能一致性。此外,RoomPilot构建了一个经过精心挑选的资产数据集,具有丰富的语义注释,以支持高质量的场景合成,提高视觉真实感和外观一致性。广泛的实验表明,该方法在多模态理解、场景生成的细粒度可控性以及物理一致性和视觉保真度方面均有所提升,标志着可控3D室内场景合成的重要一步。代码和模型将公开。

英文摘要

Generating controllable indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI. However, existing approaches either support a limited input modalities or rely on implicit generation processes that hinder precise control over scene structure and semantics. To address these limitations, we introduce RoomPilot, a unified framework for controllable indoor scene synthesis from multi-modal inputs, including textual descriptions and CAD floor plans. RoomPilot maps heterogeneous inputs into an Indoor Domain-Specific Language (IDSL), which serves as a structured and interpretable semantic representation for describing indoor scenes. Built upon IDSL, RoomPilot presents a hierarchical synthesis pipeline that progressively organizes scenes at the building, room, and object levels, promoting structural coherence and functional consistency across multi-room layouts. Moreover, RoomPilot constructs a curated asset dataset with rich semantic annotations to support high-quality scene synthesis, improving visual realism and appearance consistency. Extensive experiments demonstrate effective multi-modal understanding, fine-grained controllability in scene generation, and improved physical consistency and visual fidelity, marking a significant step toward controllable 3D indoor scene synthesis. Code and model will be available.

2512.10891 2026-05-20 cs.RO cs.LG

Iterative Compositional Data Generation for Robot Control

迭代组合数据生成用于机器人控制

Anh-Quan Pham, Marcel Hussing, Shubhankar P. Patankar, Dani S. Bassett, Jorge Mendez-Mendez, Eric Eaton

AI总结 本文提出了一种语义组合扩散变换器,通过注意力机制学习机器人、物体、障碍物和目标特定组件的交互,从而在有限任务集上训练后,能够零样本生成高质量过渡,进而学习未见任务组合的控制策略,并通过迭代自我改进过程提升零样本性能。

详情
AI中文摘要

收集机器人操作数据成本高昂,使得在多对象、多机器人和多环境设置中获取大量任务演示不切实际。尽管最近的生成模型可以为单个任务合成有用的数据,但它们未能利用机器人领域的组合结构,并且在泛化到未见任务组合时表现不佳。我们提出了一种语义组合扩散变换器,将过渡分解为机器人、物体、障碍物和目标特定的组件,并通过注意力机制学习它们的交互。一旦在有限的任务子集上训练,我们展示了模型能够零样本生成高质量的过渡,从而学习未见任务组合的控制策略。然后,我们引入了一个迭代自我改进过程,其中合成数据通过离线强化学习验证,并纳入后续的训练轮次中。我们的方法在单体和硬编码组合基线之上显著提高了零样本性能,最终解决了几乎所有未见任务,并展示了学习表示中出现有意义的组合结构。

英文摘要

Collecting robotic manipulation data is expensive, making it impractical to acquire demonstrations for the combinatorially large space of tasks that arise in multi-object, multi-robot, and multi-environment settings. While recent generative models can synthesize useful data for individual tasks, they do not exploit the compositional structure of robotic domains and struggle to generalize to unseen task combinations. We propose a semantic compositional diffusion transformer that factorizes transitions into robot-, object-, obstacle-, and objective-specific components and learns their interactions through attention. Once trained on a limited subset of tasks, we show that our model can zero-shot generate high-quality transitions from which we can learn control policies for unseen task combinations. Then, we introduce an iterative self-improvement procedure in which synthetic data is validated via offline reinforcement learning and incorporated into subsequent training rounds. Our approach substantially improves zero-shot performance over monolithic and hard-coded compositional baselines, ultimately solving nearly all held-out tasks and demonstrating the emergence of meaningful compositional structure in the learned representations.

2512.08237 2026-05-20 cs.CV

Fast-BEV++: Fast by Algorithm, Deployable by Design

Fast-BEV++: 通过算法加速,通过设计部署

Yuanpeng Chen, Hui Song, Sheng Yang, Wei Tao, Shanhui Mo, Shuang Zhang, Xiao Hua, Tiankun Zhao

AI总结 本文提出Fast-BEV++,通过算法加速和设计部署两个原则,解决自动驾驶中低成本鸟眼视图感知在精度与部署效率之间的矛盾,实现了3倍速度提升并在nuScenes基准上取得0.488 NDS的新状态-of-the-art结果,同时在134 FPS以上实现实时推理。

Comments most up-to-date version

详情
AI中文摘要

视觉-only鸟眼视图(BEV)感知的进步受制于感知精度与设备部署效率之间的长期根本权衡。在本文中,我们引入了Fast-BEV++,一种通过两个基本设计原则解决这一矛盾的BEV感知框架:通过算法加速和通过设计部署。通过将核心视图转换模块分解为硬件导向的标准索引-收集-重塑流水线,Fast-BEV++消除了对定制内核的依赖,从而在主流边缘平台上实现了至少3倍于Fast-BEV基线的速度提升。实证表明,Fast-BEV++在nuScenes 3D物体检测基准上建立了新的状态-of-the-art结果0.488 NDS,同时通过我们的加速设计实现了超过134 FPS的实时推理。特别是,我们的集成、可学习深度模块带来了持续的性能提升,在可比方法中保持最高准确性。总体而言,这种本质上分解的架构使在各种生产级汽车平台上的无缝实时部署成为可能,缓解了硬件限制,而不会牺牲感知精度或推理效率。

英文摘要

The advancement of vision-only Bird's-Eye-View (BEV) perception, a core paradigm for cost-effective autonomous driving, is hindered by the long-standing fundamental trade-off between perception accuracy and on-device deployment efficiency. In this work, we introduce Fast-BEV++, a BEV perception framework that resolves this tension through two fundamental design principles: Fast by Algorithm and Deployable by Design. By decomposing the core view transformation module into a hardware-oriented standard Index-Gather-Reshape pipeline, Fast-BEV++ eliminates dependencies on custom kernels while achieving no less than 3 times speedup over the Fast-BEV baseline across mainstream edge platforms. Empirically, Fast-BEV++ establishes a new state-of-the-art result of 0.488 NDS on the nuScenes 3D object detection benchmark, simultaneously delivering real-time inference at more than 134 FPS via our acceleration design. In particular, our integrated, learnable depth module yields consistent performance gains, maintaining the highest accuracy among comparable methods. Overall, this inherently decomposed architecture enables seamless real-time deployment across diverse production-grade automotive platforms, alleviating hardware limitations without compromising perception accuracy or inference efficiency.

2512.07068 2026-05-20 cs.CL

SETUP: Sentence-level English-To-Uniform Meaning Representation Parser

SETUP:句子级别的英语到统一意义表示解析器

Emma Markle, Javier Gutierrez Bach, Shira Wein

AI总结 本文提出两种英语到统一意义表示(UMR)的解析方法,其中一种微调了现有的抽象意义表示解析器,另一种利用了通用依赖关系转换器。所提出的最佳模型SETUP在AnCast和SMATCH++评分上分别达到84和91,显示出在自动UMR解析中的显著提升。

Comments LREC 2026 Camera-ready

详情
AI中文摘要

统一意义表示(UMR)是一种新颖的基于图的语义表示,能够捕捉文本的核心意义,其注释方案具有灵活性,使得世界上各种语言(包括低资源语言)的注释成为可能。尽管UMR在促进语言记录、改进低资源语言技术以及增加可解释性方面显示出潜力,但只有在文本到UMR解析器能够实现大规模自动生产准确的UMR图时,UMR的下游应用才能得到充分探索。先前的文本到UMR解析工作仅限于当前阶段。在本文中,我们介绍了两种英语文本到UMR解析方法,其中一种微调了现有的抽象意义表示解析器,另一种利用了通用依赖关系转换器,以先前工作为基准。我们的最佳模型,我们称之为SETUP,在AnCast评分上达到84,在SMATCH++评分上达到91,表明在自动UMR解析方面取得了显著进展。

英文摘要

Uniform Meaning Representation (UMR) is a novel graph-based semantic representation which captures the core meaning of a text, with flexibility incorporated into the annotation schema such that the breadth of the world's languages can be annotated (including low-resource languages). While UMR shows promise in enabling language documentation, improving low-resource language technologies, and adding interpretability, the downstream applications of UMR can only be fully explored when text-to-UMR parsers enable the automatic large-scale production of accurate UMR graphs at test time. Prior work on text-to-UMR parsing is limited to date. In this paper, we introduce two methods for English text-to-UMR parsing, one of which fine-tunes existing parsers for Abstract Meaning Representation and the other, which leverages a converter from Universal Dependencies, using prior work as a baseline. Our best-performing model, which we call SETUP, achieves an AnCast score of 84 and a SMATCH++ score of 91, indicating substantial gains towards automatic UMR parsing.

2512.05958 2026-05-20 cs.LG cs.AI

MaxShapley: Towards Incentive-compatible Generative Search with Fair Context Attribution

MaxShapley:迈向具有公平上下文归因的激励兼容生成搜索

Sara Patel, Mingxun Zhou, Giulia Fanti

AI总结 本文提出MaxShapley算法,用于在生成搜索流程中公平地归因和补偿内容提供者,该算法基于Shapley值的特例,通过可分解的max-sum效用函数在多项式时间内计算归因,相比Shapley值的指数成本具有更高的效率。

详情
AI中文摘要

基于大型语言模型(LLMs)的生成搜索引擎正在取代传统搜索引擎,从根本上改变了信息提供者如何获得补偿。为了维持这一生态系统,我们需要公平的机制来根据内容提供者对生成答案的贡献来归因和补偿。我们介绍了MaxShapley,一种高效的算法,用于在生成搜索流程中进行公平的信用归因,该流程在生成之前检索外部来源。MaxShapley是著名Shapley值的特例;它利用可分解的max-sum效用函数,在文档数量上以多项式时间计算归因,而不是Shapley值的指数成本。我们在三个多跳问答数据集(HotPotQA、MuSiQUE、MS MARCO)上评估MaxShapley;MaxShapley在归因质量上与精确的Shapley计算相当,同时消耗的资源更少——例如,在相同归因准确性下,它在资源消耗上比先前最先进的方法减少了高达9倍。我们发布了开源代码和重新校准的数据集。一个教育演示可在https://fair-search.com上获得。

英文摘要

Generative search engines based on large language models (LLMs) are replacing traditional search, fundamentally changing how information providers are compensated. To sustain this ecosystem, we need fair mechanisms to attribute and compensate content providers based on their contributions to generated answers. We introduce MaxShapley, an efficient algorithm for fair credit attribution in generative search pipelines that retrieve external sources before generation. MaxShapley is a special case of the celebrated Shapley value; it leverages a de-composable max-sum utility function to compute attributions with polynomial-time computation in the number of documents, as opposed to the exponential cost of Shapley values. We evaluate MaxShapley on three multi-hop QA datasets (HotPotQA, MuSiQUE, MS MARCO); MaxShapley achieves comparable attribution quality to exact Shapley computation, while consuming a fraction of its tokens--for instance, it gives up to a 9x reduction in resource consumption over prior state-of-the-art methods at the same attribution accuracy. We release open-source code and re-calibrated datasets. An educational demo is available at https://fair-search.com.

2512.05721 2026-05-20 cs.LG

BERTO: Intent-Driven Network Time Series Forecasting via Natural Language Operator Preferences

BERTO:通过自然语言运算偏好进行意图驱动的网络时间序列预测

Nitin Priyadarshini Shankar, Vaibhav Singh, Sheetal Kalyani, Christian Maciocco

AI总结 BERTO通过自然语言运算偏好进行意图驱动的网络时间序列预测,利用BERT框架实现交通预测和能耗优化,结合平衡损失函数和提示条件,使模型能够根据运营商需求动态调整预测偏差,实现灵活的决策感知预测。

Comments 7 pages, 3 figures, 2 tables

详情
AI中文摘要

传统的蜂窝交通预测模型优化于最小化对称误差,使其对操作优先级的变化不敏感。为弥合这一差距,我们引入BERTO,一种基于BERT的框架,用于蜂窝网络的交通预测和能耗优化。基于Transformer架构,BERTO在实现高预测精度的同时,通过自然语言运营商提示使单个微调模型能够在多个预测制度中运行。通过结合平衡损失函数(BLF)和基于提示的条件,BERTO能够根据运营商在节能和服务质量之间的权衡需求,自适应地调整预测偏差,向欠预测或过预测倾斜。这使得同一模型能够在不重新训练或修改模型参数的情况下,动态生成不同的决策感知预测。在真实世界数据集上的实验表明,BERTO可以在约1.4kW的功率消耗范围内运行,同时平衡9倍的服务级别协议(SLA)违规变化,使其非常适合智能RAN部署。

英文摘要

Traditional cellular traffic forecasting models are optimized for minimizing symmetric errors, leaving them indifferent to shifting operational priorities. To bridge this gap, we introduce BERTO, a BERT-based framework for traffic prediction and energy optimization in cellular networks. Built on transformer architectures, BERTO achieves high prediction accuracy while enabling a single fine-tuned model to operate across multiple forecasting regimes via natural-language operator prompts. By combining a Balancing Loss Function (BLF) with prompt-based conditioning, BERTO adaptively shifts its forecasting bias toward underprediction or overprediction depending on the operator's desired trade-off between power savings and service quality. This allows the same model to dynamically generate different decision-aware forecasts without retraining or modifying model parameters. Experiments on real-world datasets demonstrate that BERTO can operate across a flexible range of approximately 1.4 kW in power consumption while balancing 9x variation in service level agreement (SLA) violations, making it well suited for intelligent RAN deployments.

2512.01152 2026-05-20 cs.LG cs.AI cs.CV

Open-Set Domain Adaptation Under Background Distribution Shift: Challenges and A Provably Efficient Solution

开放集域适应在背景分布偏移下的挑战:挑战与一种可证明高效的解决方案

Shravan Chaudhari, Yoav Wald, Suchi Saria

AI总结 本文研究了在背景分布偏移情况下开放集域适应的挑战,并提出了一种可证明高效的解决方案CoLOR,通过理论分析和实验证明其在简化过参数化设置中优于基线方法,同时展示了其在图像和文本数据上的广泛适用性。

Comments Project page at https://github.com/Shra1-25/CoLOR

详情
Journal ref
Transactions on Machine Learning Research (TMLR) 2026/May ISSN: 2835-8856
AI中文摘要

随着我们将机器学习系统部署到现实世界中,一个核心挑战是保持模型在数据偏移时的性能。这种偏移可以以多种形式存在:新类可能在训练时不存在,这被称为开放集识别,以及已知类别的分布可能发生变化。对于开放集识别的保证大多基于假设已知类别的分布(我们称之为背景分布)是固定的。在本文中,我们开发了CoLOR,一种在挑战性情况下(即背景分布偏移)也能解决开放集识别的方法。我们证明该方法在温和假设下有效,即新类可与非新类分离,并提供理论保证,表明其在简化过参数化设置中优于代表基线方法。我们开发了使CoLOR可扩展和稳健的技术,并在图像和文本数据上进行了全面的实证评估。结果表明,CoLOR在背景偏移下显著优于现有开放集识别方法。此外,我们还提供了新的见解,探讨了诸如新类大小等因素对性能的影响,这在先前工作中尚未得到广泛探索。

英文摘要

As we deploy machine learning systems in the real world, a core challenge is to maintain a model that is performant even as the data shifts. Such shifts can take many forms: new classes may emerge that were absent during training, a problem known as open-set recognition, and the distribution of known categories may change. Guarantees on open-set recognition are mostly derived under the assumption that the distribution of known classes, which we call the background distribution, is fixed. In this paper we develop CoLOR, a method that is guaranteed to solve open-set recognition even in the challenging case where the background distribution shifts. We prove that the method works under benign assumptions that the novel class is separable from the non-novel classes, and provide theoretical guarantees that it outperforms a representative baseline in a simplified overparameterized setting. We develop techniques to make CoLOR scalable and robust, and perform comprehensive empirical evaluations on image and text data. The results show that CoLOR significantly outperforms existing open-set recognition methods under background shift. Moreover, we provide new insights into how factors such as the size of the novel class influences performance, an aspect that has not been extensively explored in prior work.

2512.00281 2026-05-20 cs.CV q-bio.NC

Beyond Size and Growth: Rethinking Lung Cancer Screening with AI Based Nodule Detection and Diagnosis

超越尺寸和增长:利用AI进行肺结节检测与诊断的肺癌筛查再思考

Sylvain Bodard, Pierre Baudot, Benjamin Renoust, Charles Voyton, Gwendoline De Bie, Ezequiel Geremia, Van-Khoa Le, Danny Francis, Pierre-Henri Siot, Yousra Haddou, Vincent Bobin, Jean-Christophe Brisset, Carey C. Thomson, Valerie Bourdes, Benoit Huet

AI总结 本文提出了一种基于AI的集成系统,通过低剂量CT扫描在结节层面直接进行结节检测和恶性评估,超越传统基于尺寸和增长的筛查标准,提高了肺癌筛查的准确性和效率。

Comments 25 pages, 8 figures, with supplementary information containing 11 figures

详情
AI中文摘要

早期检测恶性肺结节仍然受到基于尺寸和生长的筛查标准的限制,常常延迟诊断。我们提出了一种集成的AI系统,该系统在统一的CADe/CADx框架内,从低剂量CT扫描中联合执行结节检测和恶性评估。与传统将检测和诊断分开的流程不同,我们的方法直接针对恶性结节,重新定义了临床决策点的评估。为了解决数据集规模和可解释性限制,系统由一个大型集成模型(LEM)组成,结合了浅层深度学习和基于特征的模型。该系统在25,709例扫描中训练和评估,其中69,449个结节被标注,并在独立队列上进行了外部验证。其内部AUC为0.98,外部AUC为0.945,优于所有基于生长的指标、Lung RADS尺寸基于的分流、欧洲体积和VDT基于的筛查标准、放射科医生和领先的AI模型。该模型在低假阳性率下保持高灵敏度,对小和早期阶段的癌症表现出色,并能对不确定和缓慢生长的结节在一年内更早地评估恶性性。这种方法有潜力优化肺癌筛查流程,支持更早、更可行的临床决策。

英文摘要

Early detection of malignant lung nodules remains constrained by size and growth based screening criteria, often delaying diagnosis. We present an integrated AI system that jointly performs nodule detection and malignancy assessment directly at the nodule level from low dose CT scans, within a unified CADe/CADx framework. Unlike conventional pipelines separating detection and diagnosis, our approach targets malignant nodules directly, redefining evaluation at the point where clinical decisions are made. To address limitations in dataset scale and explainability, the system consists of a Large Ensemble Model (LEM) combining ensembles of shallow deep learning and feature based models. It was trained and evaluated on 25,709 scans with 69,449 annotated nodules, with external validation on an independent cohort. It achieved an AUC of 0.98 internally and 0.945 externally, outperforming all growth based metrics, Lung RADS size based triage, European volume and VDT based screening criteria, radiologists, and leading AI models. The model maintains high sensitivity at low false positive rates, excels for small and early stage cancers, and enables malignancy assessment up to one year earlier than radiologists for indeterminate and slow growing nodules. This approach has the potential to streamline lung cancer screening workflows and support earlier, more actionable clinical decision making.

2511.17166 2026-05-20 cs.RO

Reflection-Based Relative Localization for Cooperative UAV Teams Using Active Markers

基于反射的协作无人机团队相对定位方法

Tim Lakemann, Daniel Bonilla Licea, Viktor Walter, Martin Saska

AI总结 本文提出了一种利用环境中的主动标记反射进行无人机团队相对定位的新方法,无需预先知道机器人大小或标记配置,且能有效应对表面不规则性带来的不确定性,实验表明其在不同光照条件下具有更高的有效范围和精度。

详情
AI中文摘要

主动标记在环境中的反射通常是机载视觉相对定位中的常见模糊源。本文提出了一种新的方法,利用这些通常不受欢迎的反射来实现异构多无人机团队的机载相对定位。该方法无需事先了解机器人大小或预定义的标记配置,不依赖于表面属性,并明确考虑了由表面不规则性引起的不确定性,包括对海洋部署相关的动态水表面。我们在室内和户外实验中验证了该方法,证明了其在不同光照条件下的可靠运行,并实现了比现有最先进方法更高的有效范围(超过30米)和精度。视频可通过以下链接获取:https://youtu.be/y0zp8cIwkig。

英文摘要

Reflections of active markers in the environment are a common source of ambiguity in onboard visual relative localization. This work presents a novel approach that exploits these typically unwanted reflections for onboard relative localization in heterogeneous multi-UAV teams. The method operates without prior knowledge of robot size or predefined marker configurations, remains independent of surface properties, and explicitly accounts for uncertainties caused by surface irregularities, including dynamic water surfaces relevant for marine deployments. We validated the approach in both indoor and outdoor experiments, demonstrating reliable operation across varying lighting conditions and achieving greater effective range (above 30 m) and accuracy than state-of-the-art methods. The video is available under the following link: https://youtu.be/y0zp8cIwkig.

2511.16766 2026-05-20 cs.CV

SVG360: Editable Multiview Vector Graphics from a Single SVG

SVG360: 从单个SVG生成可编辑的多视角矢量图形

Mengnan Jiang, Zhaolin Sun, Christian Franke, Michele Franco Adesso, Antonio Haas, Grace Li Zhang

AI总结 本文提出SVG360框架,通过视图一致的矢量化流程将单个SVG转换为几何和视觉一致的多视角SVG资产,解决了多视角下路径碎片化和颜色不稳定的问题,提升了多视角一致性与编辑性。

详情
AI中文摘要

可缩放矢量图形(SVG)是可编辑视觉设计的标准表示形式,但通常作为单视角二维插图进行作者创作。这限制了其在需要对象级资产在不同视角下保持一致时的应用。我们提出了SVG360,一个框架,将单个输入SVG转换为几何和视觉一致的多视角SVG资产。关键挑战在于直接按视角生成或矢量化会产生视角依赖的区域、碎片化的路径和不稳定的颜色,使生成的SVG难以作为整体对象进行编辑。SVG360通过视图一致的矢量化流程解决这一问题。它首先将栅格化输入提升为视图条件的对象表示,并在规定相机下渲染目标视角。然后通过一种源自视频分割的时空记忆机制,将部分身份传播到相邻视角,建立一致的区域分解、路径对应和颜色分配,而无需特定任务的重新训练。最后,每个视角通过结构感知的矢量化重建为可编辑的SVG,其中冗余路径被合并,局部几何被优化,同时保持边界和语义部分。在对象级SVG资产上的实验表明,与直接按视角矢量化相比,SVG360提高了多视角一致性,减少了路径冗余,并更好地保留了细结构。通过将单视角SVG转换为一致的360度矢量资产,SVG360将矢量图形从静态插图扩展到可编辑的多视角内容,适用于设计、动画和结构化视觉编辑。

英文摘要

Scalable Vector Graphics are a standard representation for editable visual design, yet they are usually authored as single view two dimensional illustrations. This limits their use in applications that require object level assets to remain coherent when observed, edited, or animated from different viewpoints. We present SVG360, a framework that converts a single input SVG into geometrically and visually consistent multiview SVG assets. The key challenge is that direct per view generation or vectorization produces view dependent regions, fragmented paths, and unstable colors, making the resulting SVGs difficult to edit as a coherent object. SVG360 addresses this problem through a view consistent vectorization pipeline. It first lifts the rasterized input into a view conditioned object representation and renders target views under prescribed cameras. It then propagates part identity across neighboring views using a spatial memory mechanism adapted from video segmentation, establishing consistent region decomposition, path correspondence, and color assignment without task specific retraining. Finally, each view is reconstructed as an editable SVG through structure aware vectorization, where redundant paths are consolidated and local geometry is optimized while preserving boundaries and semantic parts. Experiments on object level SVG assets show that SVG360 improves multiview consistency, reduces path redundancy, and better preserves fine structures compared with direct per view vectorization. By turning a single view SVG into a coherent 360 degree vector asset, SVG360 expands vector graphics from static illustration toward editable multiview content for design, animation, and structured visual editing.

2511.13864 2026-05-20 cs.CV

GRLoc: Geometric Representation Regression for Visual Localization

GRLoc: 用于视觉定位的几何表示回归

Changyang Li, Xuejian Ma, Lixiang Liu, Zhan Li, Qingan Yan, Yi Xu

AI总结 本文提出了一种基于几何表示回归(GRR)的方法,通过分离旋转和翻译预测来提升视觉定位的性能,并在7-Scenes和Cambridge Landmarks数据集上实现了最先进的结果。

详情
AI中文摘要

绝对姿态回归(APR)已成为视觉定位中的有力范式。然而,APR模型通常作为黑箱操作,直接从查询图像回归6自由度姿态,这可能导致记忆训练视图而非理解3D场景几何。在本文中,我们提出了一种基于几何的替代方法。受新颖视角合成的启发,该方法通过从中间几何表示生成图像,将APR重新公式化为其逆过程,即从图像直接回归底层3D表示,并将此范式称为几何表示回归(GRR)。我们的模型显式预测两种解耦的几何表示:(1)方向图以估计相机旋转,(2)对应点图以估计相机翻译。最终的相机姿态通过可微确定性求解器从这些几何组件中恢复。这种解耦方法将学习的视觉到几何映射与最终姿态计算分离,为网络引入了强几何先验。我们发现,显式分离旋转和翻译预测可显著提升性能。我们证明在7-Scenes和Cambridge Landmarks数据集上实现了最先进的性能,验证了建模逆渲染过程是更稳健的通用绝对姿态估计路径。

英文摘要

Absolute Pose Regression (APR) has emerged as a compelling paradigm for visual localization. However, APR models typically operate as black boxes, directly regressing a 6-DoF pose from a query image, which can lead to memorizing training views rather than understanding 3D scene geometry. In this work, we propose a geometrically-grounded alternative. Inspired by novel view synthesis, which renders images from intermediate geometric representations, we reformulate APR as its inverse that regresses the underlying 3D representations directly from the image, and we name this paradigm Geometric Representation Regression (GRR). Our model explicitly predicts two disentangled geometric representations in the world coordinate system: (1) a raymap's directions to estimate camera rotation, and (2) a corresponding pointmap to estimate camera translation. The final camera pose is then recovered from these geometric components using a differentiable deterministic solver. This disentangled approach, which separates the learned visual-to-geometry mapping from the final pose calculation, introduces a strong geometric prior into the network. We find that the explicit decoupling of rotation and translation predictions measurably boosts performance. We demonstrate state-of-the-art performance on 7-Scenes and Cambridge Landmarks datasets, validating that modeling the inverse rendering process is a more robust path toward generalizable absolute pose estimation.

2511.12158 2026-05-20 cs.LG

Data-Efficient Self-Supervised Algorithms for Fine-Grained Birdsong Analysis

用于细粒度鸟类叫声分析的数据高效自监督算法

Houtan Ghaffari, Lukas Rauch, Paul Devos

AI总结 本文提出了一种数据高效的鸟类叫声标注器,通过三阶段训练流程在最小标注情况下开发可靠的鸟类叫声音节检测器,并在极端标注稀缺场景下验证了其有效性,同时评估了自监督嵌入在线性探测和无监督鸟类叫声分析中的潜力。

详情
AI中文摘要

生物声学、神经科学和语言学研究经常使用鸟类叫声作为代理来获取跨不同领域的知识。这需要音频模型能够标注和解析鸟类叫声。开发此类模型需要精确的、音节级注释的训练数据。因此,减少标注成本的自动化方法需求迫切。本文提出了一种数据高效的鸟类叫声标注器,称为残差多层感知机递归神经网络。然后,本文提出了一个三阶段训练流程,以在最小标注情况下开发可靠的鸟类叫声音节检测器。第一阶段是从未标注数据中进行自监督学习。探索了两种最成功的预训练范式,即掩码预测和在线聚类。第二阶段是使用有效的数据增强进行监督训练,以为每个个体生成稳健的帧级音节检测器。第三阶段是一个半监督的后训练步骤,利用未标注数据来优化每个个体的模型。该方法在极端标注稀缺场景下对金翅雀叫声进行了验证。从信号处理的角度来看,金翅雀叫声表现出最具有挑战性的频谱-时间模式之一,对于算法时间序列标注而言:快速的发声、短暂的音节间间隔、快速且宽带的频率扫频,以及需要细粒度特征区分的光谱相似音节。因此,成功的金翅雀音节检测算法为其他鸟类建立了稳健的基准。这种方法论的泛化在白喉歌鸲叫声标注的案例研究中得到了验证。最后,评估了自监督嵌入在线性探测和无监督鸟类叫声分析中的潜力。

英文摘要

Research in bioacoustics, neuroscience, and linguistics often uses birdsong as a proxy to acquire knowledge across diverse areas. This requires audio models to annotate and parse the birdsong. Developing such models requires precise, syllable-level annotated training data. Therefore, automated methods that reduce annotation costs are in demand. This work presents a data-efficient birdsong annotator called Residual Multi-Layer Perceptron Recurrent Neural Network. It then presents a three-stage training pipeline for developing reliable birdsong syllable detectors with minimal annotation. The first stage is self-supervised learning from unlabeled data. Two of the most successful pretraining paradigms are explored, namely, masked prediction and online clustering. The second stage is supervised training with effective data augmentation to produce a robust frame-level syllable detector for each individual. The third stage is a semi-supervised post-training step that refines each individual's model using unlabeled data. The effectiveness of this approach is demonstrated for the Canary song in extreme label-scarcity scenarios. From a signal-processing perspective, the Canary song exhibits one of the most challenging spectro-temporal patterns for algorithmic time-series annotation: rapid vocalizations, brief inter-syllabic intervals, fast and broadband frequency sweeps, and spectrally similar syllables that require fine-grained features to distinguish. Hence, a successful syllable detection algorithm for Canary also establishes a robust baseline for other birds. This methodological generalization is validated in a case study of Bengalese Finch song annotation. Finally, the potential of self-supervised embeddings is assessed for linear probing and unsupervised birdsong analysis.

2511.11688 2026-05-20 cs.LG cs.CV

Hierarchical Schedule Optimization for Fast and Robust Diffusion Model Sampling

分层调度优化用于快速且稳健的扩散模型采样

Aihua Zhu, Rui Su, Qinglin Zhao, Li Feng, Meng Shen, Shibo He

AI总结 本文提出了一种分层调度优化方法,通过改进的双层优化框架,在极低的函数评估次数下实现高效的扩散模型采样,显著提升了样本质量和计算效率。

Comments Preprint, accepted to AAAI 2026

详情
AI中文摘要

扩散概率模型在生成保真度方面设立了新标准,但受到采样过程缓慢的迭代限制。一种强大的无训练策略是调度优化,旨在在固定的、较小的函数评估次数(NFE)下找到最优的时间步分布以最大化样本质量。为此,成功的调度优化方法必须遵循四个核心原则:有效性、适应性、实用性鲁棒性和计算效率。然而,现有方法难以同时满足这些原则,推动了更先进解决方案的需求。为克服这些限制,我们提出了分层调度优化器(HSO),一种新颖且高效的双层优化框架。HSO通过交替迭代两个协同层级将全局最优调度的搜索转化为更可处理的问题:上层的全局搜索用于寻找最优初始化策略,下层的局部优化用于调度细化。这一过程由两个关键创新引导:中点误差代理(MEP),一种求解器无关且数值稳定的局部优化目标,以及间距惩罚适应度(SPF)函数,通过惩罚病态接近的时间步确保实用性鲁棒性。大量实验表明,HSO在极低NFE范围内为无训练采样设定了新的状态-of-the-art。例如,仅使用5次NFE,HSO在LAION-Aesthetics上实现显著的FID为11.94,使用Stable Diffusion v2.1。关键的是,这种性能不是通过昂贵的重新训练实现的,而是一次性的优化成本不到8秒,提供了一种高效且实用的扩散模型加速范式。

英文摘要

Diffusion probabilistic models have set a new standard for generative fidelity but are hindered by a slow iterative sampling process. A powerful training-free strategy to accelerate this process is Schedule Optimization, which aims to find an optimal distribution of timesteps for a fixed and small Number of Function Evaluations (NFE) to maximize sample quality. To this end, a successful schedule optimization method must adhere to four core principles: effectiveness, adaptivity, practical robustness, and computational efficiency. However, existing paradigms struggle to satisfy these principles simultaneously, motivating the need for a more advanced solution. To overcome these limitations, we propose the Hierarchical-Schedule-Optimizer (HSO), a novel and efficient bi-level optimization framework. HSO reframes the search for a globally optimal schedule into a more tractable problem by iteratively alternating between two synergistic levels: an upper-level global search for an optimal initialization strategy and a lower-level local optimization for schedule refinement. This process is guided by two key innovations: the Midpoint Error Proxy (MEP), a solver-agnostic and numerically stable objective for effective local optimization, and the Spacing-Penalized Fitness (SPF) function, which ensures practical robustness by penalizing pathologically close timesteps. Extensive experiments show that HSO sets a new state-of-the-art for training-free sampling in the extremely low-NFE regime. For instance, with an NFE of just 5, HSO achieves a remarkable FID of 11.94 on LAION-Aesthetics with Stable Diffusion v2.1. Crucially, this level of performance is attained not through costly retraining, but with a one-time optimization cost of less than 8 seconds, presenting a highly practical and efficient paradigm for diffusion model acceleration.

2511.10292 2026-05-20 cs.CV cs.AI

Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models

自适应残差更新引导用于大型视觉语言模型中低开销幻觉抑制

Zhengtao Zou, Ya Gao, Jiarui Guan, Bin Li, Pekka Marttinen

AI总结 本文提出RUDDER框架,通过创建持久视觉锚点来对抗视觉稀释,利用模型的prefill残差更新提取鲁棒证据方向,并通过自适应门控机制注入解码过程,有效抑制幻觉并保持高吞吐量。

Comments Accepted by ICML 2026; Code available at: https://github.com/Akko000/RUDDER-Residual-Update-Directed-DEcoding-Regulation-

详情
AI中文摘要

大型视觉-语言模型(LVLMs)通常将视觉输入作为语言解码器之前的前缀进行处理。随着模型自回归地生成文本,这种初始视觉信息不可避免地经历“稀释”,导致模型过度依赖语言先验并产生幻觉。现有干预尝试通过对比logits或迭代优化输出来纠正这一问题,但会带来不可接受的延迟成本。我们提出残差更新引导解码调节(RUDDER)框架,通过创建持久视觉锚点来对抗视觉稀释。我们直接从模型的prefill残差更新中提取鲁棒证据方向(CARD),并将其注入解码过程。这种注入通过自适应门控机制(Beta Gate)进行调节,该机制作为信任机制,确保只有在必要时才应用视觉提示。在LLaVA-1.5(7B/13B)、Idefics2、InstructBLIP和Qwen2.5-VL上的实验表明,RUDDER一致地抑制了幻觉(在贪婪解码中,RUDDER将CHAIR_S减少平均24.4%,将CHAIR_i减少23.6%),并在不同架构上有效扩展,同时保持>96.0%的吞吐量。

英文摘要

Large Vision-Language Models (LVLMs) typically process visual inputs as a prefix to the language decoder. As the model autoregressively generates text, this initial visual information inevitably undergoes "dilution" leading the model to over-rely on language priors and hallucinate objects. Existing interventions attempt to correct this by contrasting logits or iteratively refining outputs, but they incur prohibitive latency costs. We propose Residual-Update Directed DEcoding Regulation (RUDDER), a framework that counters visual dilution by creating a persistent visual anchor. We extract a robust evidence direction (CARD) directly from the model's prefill residual updates, and inject it into the decoding process. This injection is modulated by an adaptive gate, the Beta Gate, which acts as a trust mechanism and ensures the visual reminder is applied only when necessary. Experiments on LLaVA-1.5 (7B/13B), Idefics2, InstructBLIP, and Qwen2.5-VL demonstrate that RUDDER consistently mitigates hallucination (with greedy decoding, RUDDER reduces CHAIR_S by an average of 24.4% and CHAIR_i by 23.6% relative) and scales effectively across architectures, all while maintaining >96.0% throughput.

2511.06943 2026-05-20 cs.CV cs.AI

PlantTraitNet: An Uncertainty-Aware Multimodal Framework for Global-Scale Plant Trait Inference from Citizen Science Data

PlantTraitNet: 一种考虑不确定性的多模态框架,用于从公民科学数据中进行全球尺度植物特性推断

Ayushi Sharma, Johanna Trost, Daniel Lusk, Johannes Dollinger, Julian Schrader, Christian Rossi, Javier Lopatin, Etienne Laliberté, Simon Haberstroh, Jana Eichel, Daniel Mederer, Jose Miguel Cerda-Paredes, Shyam S. Phartyal, Lisa-Maricia Schwarz, Anja Linstädter, Maria Conceição Caldeira, Teja Kattenborn

AI总结 本研究提出PlantTraitNet,一种多模态、多任务且考虑不确定性的深度学习框架,通过弱监督从公民科学照片中预测四个关键植物特性(植物高度、叶面积、特定叶面积和氮含量),并利用空间聚合生成全球特性分布图,验证结果表明其在所有评估特性上均优于现有特性地图。

Comments Accepted at the 40th AAAI Conference on Artificial Intelligence (AAAI-26). Link: https://ojs.aaai.org/index.php/AAAI/article/view/41272

详情
AI中文摘要

全球植物特性地图,如叶片氮含量或植物高度,对于理解生态系统过程,包括地球系统的碳和能量循环至关重要。然而,现有特性地图受限于基于现场测量的高成本和稀疏的地理覆盖。公民科学计划提供了一个未被充分利用的资源来克服这些限制,全球范围内有超过5000万张带有地理标签的植物照片,捕捉了有价值的植物形态和生理信息。在本研究中,我们引入PlantTraitNet,一种多模态、多任务且考虑不确定性的深度学习框架,利用弱监督从公民科学照片中预测四个关键植物特性(植物高度、叶面积、特定叶面积和氮含量)。通过在空间上聚合个体特性预测,我们生成全球特性分布图。我们通过独立的植被调查数据(sPlotOpen)验证这些地图,并将其与领先全球特性产品进行基准测试。我们的结果表明,PlantTraitNet在所有评估特性上均优于现有特性地图,证明了将公民科学影像与计算机视觉和地理空间AI结合,不仅能够实现可扩展的,而且更准确的全球特性映射。这种方法为生态研究和地球系统建模提供了强大的新途径。

英文摘要

Global plant maps of plant traits, such as leaf nitrogen or plant height, are essential for understanding ecosystem processes, including the carbon and energy cycles of the Earth system. However, existing trait maps remain limited by the high cost and sparse geographic coverage of field-based measurements. Citizen science initiatives offer a largely untapped resource to overcome these limitations, with over 50 million geotagged plant photographs worldwide capturing valuable visual information on plant morphology and physiology. In this study, we introduce PlantTraitNet, a multi-modal, multi-task uncertainty-aware deep learning framework that predictsfour key plant traits (plant height, leaf area, specific leaf area, and nitrogen content) from citizen science photos using weak supervision. By aggregating individual trait predictions across space, we generate global maps of trait distributions. We validate these maps against independent vegetation survey data (sPlotOpen) and benchmark them against leading global trait products. Our results show that PlantTraitNet consistently outperforms existing trait maps across all evaluated traits, demonstrating that citizen science imagery, when integrated with computer vision and geospatial AI, enables not only scalable but also more accurate global trait mapping. This approach offers a powerful new pathway for ecological research and Earth system modeling.

2511.06077 2026-05-20 cs.LG cs.IR

Make It Long, Keep It Fast: End-to-End 10K Long User Behavior Sequence Modeling for Billion-Scale Douyin Recommendation

让序列变长,让速度保持快速:面向十万个用户行为序列的端到端推荐系统

Lin Guan, Jia-Qi Yang, Zhishan Zhao, Beichuan Zhang, Bo Sun, Xuanyuan Luo, Jinan Ni, Xiaowen Li, Yuhang Qi, Zhifang Fan, Hangyu Wang, Qiwei Chen, Yi Cheng, Feng Zhang, Xiao Yang

AI总结 本文提出了一种端到端的推荐系统,能够处理长达10000个用户行为序列,通过引入堆叠的目标到历史交叉注意力机制、请求级别批量处理策略以及长度外推训练策略,实现了在大规模Douyin推荐中的高效长序列建模。

Comments WWW 2026. This work studies end-to-end 10K-scale long user behavior sequence modeling for billion-scale industrial recommendation on Douyin

详情
AI中文摘要

像Douyin这样的短视频推荐系统必须在不牺牲延迟或成本预算的前提下利用极其长的用户行为历史。我们提出了一种端到端的工业推荐系统,将长序列推荐建模扩展到10000长度的历史记录。首先,我们引入了堆叠的目标到历史交叉注意力(STCA),通过用目标到历史的堆叠交叉注意力替代历史自注意力,将复杂度从二次方降低到线性,从而在长用户行为序列上实现高效的端到端训练。其次,我们提出了请求级别批量处理(RLB),一种以用户为中心的批量方案,将相同用户/请求的多个目标聚合起来共享用户侧编码,显著降低了与序列相关的存储、通信和计算成本,而无需改变学习目标。第三,我们设计了一种长度外推训练策略——在较短的窗口上训练,在更长的窗口上推断——从而使模型能够泛化到10000规模的历史记录而无需额外的训练成本。在离线和在线实验中,我们观察到随着历史长度和模型容量的增加,我们获得的收益是可预测且单调的,与在大型语言模型中观察到的扩展定律行为相呼应。在Douyin全流量部署中,我们的系统在关键参与度指标上实现了显著提升,同时满足了生产延迟,展示了将端到端超长序列推荐扩展到10000规模的实用路径。

英文摘要

Short-video recommenders such as Douyin must exploit extremely long user behavior histories without breaking latency or cost budgets. We present an end-to-end industrial recommender system that scales long-sequence recommendation modeling to 10K-length histories in production. First, we introduce Stacked Target-to-History Cross Attention (STCA), which replaces history self-attention with stacked cross-attention from the target to the history, reducing complexity from quadratic to linear in sequence length and enabling efficient end-to-end training over long user behavior sequences. Second, we propose Request Level Batching (RLB), a user-centric batching scheme that aggregates multiple targets for the same user/request to share the user-side encoding, substantially lowering sequence-related storage, communication, and compute without changing the learning objective. Third, we design a length-extrapolative training strategy -- train on shorter windows, infer on much longer ones -- so the model generalizes to 10K-scale histories without additional training cost. Across offline and online experiments, we observe predictable, monotonic gains as we scale history length and model capacity, mirroring the scaling law behavior observed in large language models. Deployed at full traffic on Douyin, our system delivers significant improvements on key engagement metrics while meeting production latency, demonstrating a practical path to scaling end-to-end ultra-long sequence recommendation to the 10K regime.

2511.01526 2026-05-20 cs.CL

Difficulty-Controllable Cloze Question Distractor Generation

可调节难度的填空题干扰项生成

Seokhoon Kang, Yejin Jeon, Seonjeong Hwang, Gary Geunbae Lee

AI总结 本文提出了一种可调节难度的填空题干扰项生成框架,通过数据增强和多任务学习策略,生成高质量且标注难度的干扰项,优于GPT-4o在匹配干扰项难度与人类感知方面。

Comments Accepted to ACL 2026 Main Conference

详情
AI中文摘要

多项选择填空题常用于评估语言能力和理解能力。然而,生成高质量的干扰项仍具有挑战性,因为现有方法往往缺乏适应性和对难度水平的控制,缺乏难度标注的数据集进一步阻碍了进展。为了解决这些问题,我们提出了一种生成具有可控难度的干扰项的新框架,通过利用数据增强和多任务学习策略。首先,为了创建高质量、难度标注的数据集,我们引入了双向干扰项生成过程来生成多样且合理的干扰项。这些候选者经过筛选后,通过集成QA系统进行难度分类。其次,利用新创建的数据集通过多任务学习训练一个可调节难度的生成模型。实验结果表明,我们的方法在不同难度级别上生成高质量的干扰项,并在匹配干扰项难度与人类感知方面显著优于GPT-4o。

英文摘要

Multiple-choice cloze questions are commonly used to assess linguistic proficiency and comprehension. However, generating high-quality distractors remains challenging, as existing methods often lack adaptability and control over difficulty levels, and the absence of difficulty-annotated datasets further hinders progress. To address these issues, we propose a novel framework for generating distractors with controllable difficulty by leveraging both data augmentation and a multitask learning strategy. First, to create a high-quality, difficulty-annotated dataset, we introduce a two-way distractor generation process to produce diverse and plausible distractors. These candidates are filtered and then categorized by difficulty using an ensemble QA system. Second, this newly created dataset is used to train a difficulty-controllable generation model via multitask learning. Experimental results demonstrate that our method generates high-quality distractors across difficulty levels and substantially outperforms GPT-4o in aligning distractor difficulty with human perception.

2511.01126 2026-05-20 cs.LG cs.NA math.NA math.OC math.ST stat.TH

Stochastic Regret Guarantees for Online Zeroth- and First-Order Bilevel Optimization

在线零阶和一阶双层优化的随机遗憾保证

Parvin Nazari, Bojian Hou, Davoud Ataee Tarzanagh, Li Shen, George Michailidis

AI总结 本文提出了一种新的搜索方向,证明了利用该方向的零阶和一阶随机在线双层优化算法能够在不使用窗口平滑的情况下实现亚线性随机双层遗憾。此外,该框架通过减少超梯度估计中的oracle依赖、同时更新内层和外层变量以及使用基于零阶的Hessian、雅可比和梯度估计来提高效率。

Comments Published at NeurIPS 2025

详情
AI中文摘要

在线双层优化(OBO)是一种强大的框架,用于解决机器学习问题,其中外层和内层目标随时间演变,需要动态更新。当前的OBO方法依赖于确定性的窗口平滑后悔最小化,这在函数变化迅速时可能无法准确反映系统性能。在本文中,我们引入了一种新的搜索方向,并证明利用该方向的零阶和一阶随机OBO算法能够在不使用窗口平滑的情况下实现亚线性随机双层遗憾。除了这些保证外,我们的框架通过以下方式提高效率:(i)减少超梯度估计中的oracle依赖,(ii)在求解线性系统的同时更新内层和外层变量,(iii)使用基于零阶的Hessian、雅可比和梯度估计。在在线参数损失调谐和黑盒对抗攻击的实验中验证了我们的方法。

英文摘要

Online bilevel optimization (OBO) is a powerful framework for machine learning problems where both outer and inner objectives evolve over time, requiring dynamic updates. Current OBO approaches rely on deterministic \textit{window-smoothed} regret minimization, which may not accurately reflect system performance when functions change rapidly. In this work, we introduce a novel search direction and show that both first- and zeroth-order (ZO) stochastic OBO algorithms leveraging this direction achieve sublinear {stochastic bilevel regret without window smoothing}. Beyond these guarantees, our framework enhances efficiency by: (i) reducing oracle dependence in hypergradient estimation, (ii) updating inner and outer variables alongside the linear system solution, and (iii) employing ZO-based estimation of Hessians, Jacobians, and gradients. Experiments on online parametric loss tuning and black-box adversarial attacks validate our approach.

2510.25064 2026-05-20 cs.CL

Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?

LLMs能否估算阅读理解题的认知复杂性?

Seonjeong Hwang, Hyounghun Kim, Gary Geunbae Lee

AI总结 本文研究了大型语言模型是否能通过证据范围和转换级别两个维度估算阅读理解题的认知复杂性,结果显示LLMs能够近似估算认知复杂性,但存在推理能力与元认知意识之间的差距。

Comments ACL 2026 Main Conference

详情
AI中文摘要

估算阅读理解(RC)题的认知复杂性对于在施测前评估题目难度至关重要。与句法和语义特征(如文章长度或选项间的语义相似性)不同,答案推理过程中产生的认知特征难以用现有NLP工具提取,传统上依赖人工标注。在本研究中,我们探讨大型语言模型(LLMs)能否通过两个维度——证据范围和转换级别——来估算RC题的认知复杂性,这两个维度表明推理答案过程中所涉及的认知负担程度。我们的实验结果表明,LLMs能够近似估算题目的认知复杂性,表明其在前期难度分析中的潜力。进一步分析揭示了LLMs推理能力与其元认知意识之间的差距:即使它们产生了正确答案,有时也会错误地识别出自身推理过程背后的特征。

英文摘要

Estimating the cognitive complexity of reading comprehension (RC) items is crucial for assessing item difficulty before it is administered to learners. Unlike syntactic and semantic features, such as passage length or semantic similarity between options, cognitive features that arise during answer reasoning are not readily extractable using existing NLP tools and have traditionally relied on human annotation. In this study, we examine whether large language models (LLMs) can estimate the cognitive complexity of RC items by focusing on two dimensions-Evidence Scope and Transformation Level-that indicate the degree of cognitive burden involved in reasoning about the answer. Our experimental results demonstrate that LLMs can approximate the cognitive complexity of items, indicating their potential as tools for prior difficulty analysis. Further analysis reveals a gap between LLMs' reasoning ability and their metacognitive awareness: even when they produce correct answers, they sometimes fail to correctly identify the features underlying their own reasoning process.

2510.23507 2026-05-20 cs.LG cs.AI cs.IT math.IT

A Deep Latent Factor Graph Clustering with Fairness-Utility Trade-off Perspective

具有公平性-效用权衡视角的深度潜在因子图聚类

Siamak Ghodsi, Amjad Seyedi, Tai Le Quy, Fariba Karimi, Eirini Ntoutsi

AI总结 本文提出DFNMF,一种针对图的端到端深度非负三因子分解方法,通过软统计平衡正则化直接优化聚类分配,以实现公平性与效用的平衡,同时在合成和真实网络中表现出更高的群体平衡性和更高的模ularity。

Comments Accepted to IEEE Big-Data 2025 main research track. The paper is 10 main pages and 4 pages of Appendix

详情
Journal ref
2025 IEEE International Conference on Big Data (BigData)
AI中文摘要

公平图聚类旨在找到尊重网络结构的同时保持敏感群体比例的划分,应用范围涵盖社区检测、团队组建、资源分配和社会网络分析。许多现有方法强制性约束或依赖多阶段流程(例如谱嵌入后接k-均值),限制了权衡控制、可解释性和可扩展性。我们引入DFNMF,一种针对图的端到端深度非负三因子分解方法,直接优化聚类分配,使用软统计平衡正则化。单个参数λ调节公平性-效用平衡,非负性产生部分因子和透明的软成员资格。优化使用稀疏友好的交替更新,与边数成近线性比例。在合成和真实网络中,DFNMF在可比的模ularity下实现了显著更高的群体平衡,经常在帕累托前沿上超越最先进基线。代码可在https://github.com/SiamakGhodsi/DFNMF.git获得。

英文摘要

Fair graph clustering seeks partitions that respect network structure while maintaining proportional representation across sensitive groups, with applications spanning community detection, team formation, resource allocation, and social network analysis. Many existing approaches enforce rigid constraints or rely on multi-stage pipelines (e.g., spectral embedding followed by $k$-means), limiting trade-off control, interpretability, and scalability. We introduce \emph{DFNMF}, an end-to-end deep nonnegative tri-factorization tailored to graphs that directly optimizes cluster assignments with a soft statistical-parity regularizer. A single parameter $λ$ tunes the fairness--utility balance, while nonnegativity yields parts-based factors and transparent soft memberships. The optimization uses sparse-friendly alternating updates and scales near-linearly with the number of edges. Across synthetic and real networks, DFNMF achieves substantially higher group balance at comparable modularity, often dominating state-of-the-art baselines on the Pareto front. The code is available at https://github.com/SiamakGhodsi/DFNMF.git.

2510.21464 2026-05-20 cs.CV

CXR-LanIC: Language-Grounded Interpretable Classifier for Chest X-Ray Diagnosis

CXR-LanIC:基于语言的可解释分类器用于胸部X光诊断

Yiming Tang, Wenjia Zhong, Rushi Shah, Dianbo Liu

AI总结 本文提出CXR-LanIC,一种基于语言的可解释分类器,通过任务对齐的模式发现解决胸部X光诊断的可解释性挑战,通过训练稀疏自编码器提取可解释的视觉模式,实现高准确率的诊断并支持自然语言解释。

详情
AI中文摘要

深度学习模型在胸部X光诊断中已取得显著的准确性,但其广泛应用仍受到预测黑盒性质的限制。临床医生需要透明、可验证的解释来信任自动化诊断并识别潜在的故障模式。我们介绍CXR-LanIC(基于语言的可解释分类器用于胸部X光),一种新的框架,通过任务对齐的模式发现解决这一可解释性挑战。我们的方法在BiomedCLIP诊断分类器上训练基于转码的稀疏自编码器,将医学图像表示分解为可解释的视觉模式。通过在MIMIC-CXR数据集上训练100个转码器,我们发现了约5,000个单义模式,涵盖心脏、肺部、胸膜、结构、设备和伪影类别。每个模式在共享特定放射学特征的图像中表现出一致的激活行为,使预测分解为20-50个可解释模式,具有可验证的激活画廊。CXR-LanIC在五个关键发现上实现了竞争性的诊断准确性,同时通过计划的大型多模态模型注释为自然语言解释奠定基础。我们的关键创新在于从在特定诊断目标上训练的分类器中提取可解释特征,而不是通用嵌入,确保发现的模式直接相关于临床决策,证明医疗AI系统可以既准确又可解释,通过透明、基于临床的解释支持更安全的临床部署。

英文摘要

Deep learning models have achieved remarkable accuracy in chest X-ray diagnosis, yet their widespread clinical adoption remains limited by the black-box nature of their predictions. Clinicians require transparent, verifiable explanations to trust automated diagnoses and identify potential failure modes. We introduce CXR-LanIC (Language-Grounded Interpretable Classifier for Chest X-rays), a novel framework that addresses this interpretability challenge through task-aligned pattern discovery. Our approach trains transcoder-based sparse autoencoders on a BiomedCLIP diagnostic classifier to decompose medical image representations into interpretable visual patterns. By training an ensemble of 100 transcoders on multimodal embeddings from the MIMIC-CXR dataset, we discover approximately 5,000 monosemantic patterns spanning cardiac, pulmonary, pleural, structural, device, and artifact categories. Each pattern exhibits consistent activation behavior across images sharing specific radiological features, enabling transparent attribution where predictions decompose into 20-50 interpretable patterns with verifiable activation galleries. CXR-LanIC achieves competitive diagnostic accuracy on five key findings while providing the foundation for natural language explanations through planned large multimodal model annotation. Our key innovation lies in extracting interpretable features from a classifier trained on specific diagnostic objectives rather than general-purpose embeddings, ensuring discovered patterns are directly relevant to clinical decision-making, demonstrating that medical AI systems can be both accurate and interpretable, supporting safer clinical deployment through transparent, clinically grounded explanations.

2510.18821 2026-05-20 cs.LG

Search Self-play: Pushing the Frontier of Agent Capability without Supervision

搜索自play:在无监督条件下推动智能体能力的前沿

Hongliang Lu, Yuhang Wen, Pengyu Cheng, Ruijin Ding, Jiaqi Guo, Haotian Xu, Chutian Wang, Haonan Chen, Xiaoxi Jiang, Guanjun Jiang

AI总结 本文提出了一种基于自play的深度搜索智能体训练方法,通过自动生成任务和解决任务来提升智能体在无监督条件下的性能,无需外部监督。

Comments Published as a conference paper at the Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为训练大语言模型(LLM)智能体的主要技术。然而,RLVR高度依赖精心设计的任务查询和相应的地面真实答案来提供准确的奖励,这需要大量的人力努力,并阻碍了RL过程的扩展,尤其是在代理场景中。尽管一些最近的工作探索了任务合成方法,但生成的代理任务的难度很难控制以提供有效的RL训练优势。为了实现更高可扩展性的代理RLVR,我们探索了深度搜索代理的自play训练,其中学习LLM利用多轮搜索引擎调用,并同时充当任务提出者和问题解决者。任务提出者的目标是生成具有明确地面真实答案和逐渐增加的任务难度的深度搜索查询。问题解决者试图处理生成的搜索查询并输出正确的答案预测。为了确保每个生成的搜索查询都有准确的地面真实,我们收集所有从提出者轨迹中获得的搜索结果作为外部知识,然后进行检索增强生成(RAG)以测试所提出的查询是否可以使用所有必要的搜索文档来正确回答。在这个搜索自play(SSP)游戏中,提出者和解决者通过竞争和合作共同进化其智能体能力。通过大量实验结果,我们发现SSP可以在各种基准上显著提高搜索代理的性能,而无需任何监督,在从头开始和连续RL训练设置下均如此。代码在https://github.com/Qwen-Applications/SSP。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become the mainstream technique for training LLM agents. However, RLVR highly depends on well-crafted task queries and corresponding ground-truth answers to provide accurate rewards, which requires significant human effort and hinders the scaling of RL processes, especially in agentic scenarios. Although a few recent works explore task synthesis methods, the difficulty of generated agentic tasks can hardly be controlled to provide effective RL training advantages. To achieve agentic RLVR with higher scalability, we explore self-play training for deep search agents, in which the learning LLM utilizes multi-turn search engine calling and acts simultaneously as both a task proposer and a problem solver. The task proposer aims to generate deep search queries with well-defined ground-truth answers and increasing task difficulty. The problem solver tries to handle the generated search queries and output the correct answer predictions. To ensure that each generated search query has accurate ground truth, we collect all the searching results from the proposer's trajectory as external knowledge, then conduct retrieval-augmentation generation (RAG) to test whether the proposed query can be correctly answered with all necessary search documents provided. In this search self-play (SSP) game, the proposer and the solver co-evolve their agent capabilities through both competition and cooperation. With substantial experimental results, we find that SSP can significantly improve search agents' performance uniformly on various benchmarks without any supervision under both from-scratch and continuous RL training setups. The code is at https://github.com/Qwen-Applications/SSP.

2510.16814 2026-05-20 cs.LG cs.AI cs.CV

Needles in the Landscape: Semi-Supervised Pseudolabeling for Archaeological Site Discovery under Label Scarcity

景观中的针:在标签稀缺条件下用于考古遗址发现的半监督伪标签方法

Simon Jaxy, Anton Theys, Patrick Willett, W. Chris Carleton, Ralf Vandam, Pieter Libin

AI总结 本文提出了一种非对称双伪标签(DPL)方法,通过端到端深度学习直接从多波段遥感影像中学习稀疏正样本,无需人工特征工程或对遗址不存在的假设,在两个著名的考古数据集上进行了评估。DPL在Sagalassos数据集上优于LAMAP基线,在F1和召回率上分别提高了12%和29%,而在Cyprus数据集上,DPL在无确认负样本的纯PU设置中恢复了判别能力。DPL的集成产生可解释的概率表面,支持调查规划,从最小的标记数据中有效发现遗址。

详情
AI中文摘要

考古预测建模通过结合已知位置与环境和地理空间变量来估计未发现遗址的可能位置,提出了一个积极无标签(PU)学习挑战,其中确认的遗址稀少,大多数位置未标记而非真正的负样本。为克服这一问题,我们提出了非对称双伪标签(DPL),一种端到端深度学习方法,直接从多波段遥感影像中学习稀疏正样本,无需人工特征工程或对遗址不存在的假设,并在两个著名的考古数据集上进行了评估。在Sagalassos数据集上,与独立的验证现场调查相比,DPL在F1和召回率上分别优于LAMAP基线12%和29%,而LAMAP在概率排名上保持优势。标准监督基线在负样本不确定时失败惨烈;仅正样本训练崩溃为预测 everywhere,建立经验界限。在Cyprus数据集上,纯PU设置中无确认负样本,SL翻转概率排名,而DPL恢复判别能力。DPL集成产生可解释的概率表面,支持调查规划,从最小的标记数据中有效发现遗址。

英文摘要

Archaeological predictive modelling estimates where undiscovered sites are likely to occur by combining known locations with environmental and geospatial variables, presenting a positive-unlabeled (PU) learning challenge where confirmed sites are rare and most locations are unlabeled rather than truly negative. To overcome this, we propose asymmetric dual pseudolabeling (DPL), an end-to-end deep learning method that learns from sparse positives directly from multi-band geospatial imagery without hand-crafted feature engineering or assumptions about site absence, and evaluate on two prominent archaeological datasets. On the Sagalassos dataset, evaluated against an independent, held-out field survey, DPL outperforms the LAMAP baseline by 12% in F1 and 29% in Recall, while LAMAP maintains advantages in probability ranking. Standard supervised baselines fail catastrophically when negatives are uncertain; positive-only training collapses to predicting everywhere, es- tablishing empirical bounds. On the Cyprus dataset, a pure PU setting without confirmed negatives, SL inverts probability rankings while DPL recovers discrimination. DPL ensembles produce interpretable probability surfaces supporting survey planning, enabling effective site discovery from minimal labeled data.

2510.14261 2026-05-20 cs.CL

Rewriting History: A Recipe for Interventional Analyses to Study Data Effects on Model Behavior

重写历史:一种用于干预分析以研究数据对模型行为影响的配方

Rahul Nadkarni, Yanai Elazar, Hila Gonen, Noah A. Smith

AI总结 本文提出了一种实验方法,用于研究训练数据与语言模型行为之间的关系,通过干预数据批次(即'重写历史')并重新训练模型检查点来测试数据与行为之间的假设,展示了如何通过案例研究来验证事实知识获取中的数据影响。

Comments Accepted to TACL, pre-MIT Press publication version

详情
AI中文摘要

我们提出了一种实验配方,用于研究训练数据与语言模型(LM)行为之间的关系。我们概述了干预数据批次——即'重写历史'——的步骤,并重新训练模型检查点以测试数据与行为之间的假设。我们的配方将这种干预分解为多个阶段,包括从衡量模型行为的基准中选择评估项目、将相关文档与这些项目匹配,并在重新训练前修改这些文档以测量效果。我们通过事实性知识获取的案例研究展示了该配方的实用性,使用共现统计和信息检索方法来识别可能促进知识学习的文档。我们的结果补充了过去将共现与模型行为联系起来的观测分析,同时表明现有方法无法完全解释LM正确回答知识问题的能力。总体而言,我们概述了一种研究人员可以遵循的配方,以进一步测试训练数据如何影响模型行为的假设。我们的代码已公开发布,以促进未来的工作。

英文摘要

We present an experimental recipe for studying the relationship between training data and language model (LM) behavior. We outline steps for intervening on data batches -- i.e., ``rewriting history'' -- and then retraining model checkpoints over that data to test hypotheses relating data to behavior. Our recipe breaks down such an intervention into stages that include selecting evaluation items from a benchmark that measures model behavior, matching relevant documents to those items, and modifying those documents before retraining and measuring the effects. We demonstrate the utility of our recipe through case studies on factual knowledge acquisition in LMs, using both cooccurrence statistics and information retrieval methods to identify documents that might contribute to knowledge learning. Our results supplement past observational analyses that link cooccurrence to model behavior, while demonstrating that extant methods for identifying relevant training documents do not fully explain an LM's ability to correctly answer knowledge questions. Overall, we outline a recipe that researchers can follow to test further hypotheses about how training data affects model behavior. Our code is made publicly available to promote future work.

2510.13727 2026-05-20 cs.AI

From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails

从拒绝到恢复:一种生成AI防护机制的控制论方法

Ravi Pandya, Madison Bland, Duy P. Nguyen, Changliu Liu, Jaime Fernández Fisac, Andrea Bajcsy

AI总结 本文提出了一种基于控制论的生成AI防护机制,通过实时监控和主动纠正高风险输出,提供了一种动态替代传统标志和阻断方法的解决方案。

详情
Journal ref
Second International Association on Safe and Ethical AI Conference (IASEAI 2026)
AI中文摘要

生成AI系统越来越多地在实际应用中协助并代表终端用户,从数字购物助手到下一代自动驾驶汽车。在此背景下,安全不再仅仅是阻止有害内容,而是要预防下游危害,如财务或人身伤害。然而,大多数AI防护机制仍然依赖于标记数据集和人工指定标准的输出分类,使其对新的危险情况变得脆弱。即使在不安全状况被标记时,这种检测也提供不了恢复的路径:通常,AI系统只是拒绝行动,这并不总是安全的选择。本文认为,代理AI安全本质上是一个连续决策问题:有害结果来自于AI系统持续变化的交互及其对世界下游后果。我们通过安全关键控制理论的视角来正式化这一问题,但是在AI模型的世界表征中。这使我们能够构建预测防护机制,(i) 实时监控AI系统的输出(动作),(ii) 主动纠正危险输出为安全输出,所有这些都以模型无关的方式进行,因此同一防护机制可以围绕任何AI模型。我们还提供了一种实用的训练配方,通过安全关键强化学习在大规模上计算此类防护机制。我们在模拟驾驶和电子商务设置中的实验表明,控制论防护机制能够可靠地引导LLM代理避免灾难性结果(从碰撞到破产),同时保持任务性能,提供了一种有原则的动态替代传统标志和阻断防护机制的解决方案。

英文摘要

Generative AI systems are increasingly assisting and acting on behalf of end users in practical settings, from digital shopping assistants to next-generation autonomous cars. In this context, safety is no longer about blocking harmful content, but about preempting downstream hazards like financial or physical harm. Yet, most AI guardrails continue to rely on output classification based on labeled datasets and human-specified criteria,making them brittle to new hazardous situations. Even when unsafe conditions are flagged, this detection offers no path to recovery: typically, the AI system simply refuses to act--which is not always a safe choice. In this work, we argue that agentic AI safety is fundamentally a sequential decision problem: harmful outcomes arise from the AI system's continually evolving interactions and their downstream consequences on the world. We formalize this through the lens of safety-critical control theory, but within the AI model's latent representation of the world. This enables us to build predictive guardrails that (i) monitor an AI system's outputs (actions) in real time and (ii) proactively correct risky outputs to safe ones, all in a model-agnostic manner so the same guardrail can be wrapped around any AI model. We also offer a practical training recipe for computing such guardrails at scale via safety-critical reinforcement learning. Our experiments in simulated driving and e-commerce settings demonstrate that control-theoretic guardrails can reliably steer LLM agents clear of catastrophic outcomes (from collisions to bankruptcy) while preserving task performance, offering a principled dynamic alternative to today's flag-and-block guardrails.