arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 智能体、规划与决策 7 篇

2510.05107 2026-06-18 cs.AI 版本更新

Structured Cognitive Loop for Behavioral Intelligence in Large Language Model Agents (Extended Revision: From Behavioral Architecture to Epistemic Accountability)

大型语言模型代理中行为智能的结构化认知循环(扩展修订:从行为架构到认知问责)

Myung Ho Kim

发表机构 * JEI University(JEI大学)

AI总结 提出结构化认知循环(SCL)架构,通过分离认知、记忆、控制和行动模块,实现LLM代理的可问责行为,在360个任务中成功率86.3%,优于基线方法。

Comments This revised version extends the original SCL framework from a behavioral architecture for reliable LLM agents into a broader architecture of epistemic accountability, integrating context-aware Human-in-the-Loop control, Pool-Gated Retrieval, and the Horizon-Warrant-Commitment structure

详情
AI中文摘要

AI代理的核心挑战不仅是性能,还有问责性。通过不透明提示序列行动的代理可能产生正确输出,但几乎无法验证为何允许某个行动、错误发生在何处或如何分配责任。本文提出结构化认知循环(SCL)作为大型语言模型代理中可问责行为的架构。SCL将认知、记忆、控制和行动分离为不同模块。语言模型提出建议。外部记忆保存已验证的状态。轻量级控制器检查前提条件、防止冗余行动,并在使用工具前授权执行。我们评估了SCL与ReAct及常见LangChain代理变体在旅行规划、条件邮件起草和约束引导图像生成中的表现。在360个回合中,SCL的任务成功率达到86.3%,而基于提示的基线为70.5%至76.8%。它还提高了目标保真度,减少了冗余工具调用,增加了中间状态的重用,并降低了无依据的断言。此扩展修订将SCL置于更广泛的认知问责架构中。后续扩展整合了上下文感知的人机循环控制、池门控检索和视野担保承诺框架。这些组件共同定义了一个代理架构,其中模型提出建议,结构做出决策,证据在使用前得到担保,人类判断嵌入在轨迹中而非事后强加。结果为AI代理奠定了基础,使其决策不仅有效,而且得到授权、可检查且可问责。

英文摘要

The central challenge for AI agents is not only performance but accountability. Agents that act through opaque prompt sequences may produce correct outputs, but they provide little basis for verifying why an action was permitted, where an error occurred, or how responsibility should be assigned. This paper presents the Structured Cognitive Loop as an architecture for accountable behavior in large language model agents. SCL separates cognition, memory, control, and action into distinct modules. The language model proposes. External memory preserves verified state. A lightweight controller checks preconditions, prevents redundant actions, and authorizes execution before tools are used. We evaluate SCL against ReAct and common LangChain agent variants across travel planning, conditional email drafting, and constraint guided image generation. Across 360 episodes, SCL achieves 86.3 percent task success compared with 70.5 to 76.8 percent for prompt based baselines. It also improves goal fidelity, reduces redundant tool calls, increases reuse of intermediate state, and lowers unsupported assertions. This extended revision situates SCL within a broader architecture of epistemic accountability. Subsequent extensions integrate context aware Human in the Loop control, Pool Gated Retrieval, and the Horizon Warrant Commitment framework. Together these components define an agent architecture in which the model proposes, structure decides, evidence is warranted before use, and human judgment is embedded in the trace rather than imposed after the fact. The result is a foundation for AI agents whose decisions are not only effective but also authorized, inspectable, and accountable.

2603.00656 2026-06-18 cs.AI 版本更新

InfoPO: Information-Driven Policy Optimization for User-Centric Agents

InfoPO:面向用户智能体的信息驱动策略优化

Fanqi Kong, Jiayi Zhang, Mingyi Deng, Chenglin Wu, Yuyu Luo, Bang Liu

发表机构 * Peking University(北京大学) The Hong Kong University of Science(香港科学大学)

AI总结 针对多轮交互中信用分配和优势信号不足的问题,提出信息增益奖励与自适应方差门控融合的InfoPO方法,在意图澄清、协作编码等任务上优于现有基线。

详情
AI中文摘要

现实世界中用户对LLM智能体的请求往往不明确。智能体必须通过交互获取缺失信息并做出正确的下游决策。然而,当前基于多轮GRPO的方法通常依赖于轨迹级奖励计算,这导致信用分配问题以及rollout组内优势信号不足。一种可行的方法是在细粒度上识别有价值的交互轮次,以驱动更有针对性的学习。为此,我们引入了InfoPO(信息驱动策略优化),它将多轮交互视为一个主动不确定性降低的过程,并计算信息增益奖励,该奖励对反馈可测量地改变智能体后续动作分布(与掩码反馈反事实相比)的轮次进行奖励。然后,通过自适应方差门控融合将该信号与任务结果结合,以在保持任务导向目标方向的同时识别信息重要性。在包括意图澄清、协作编码和工具增强决策在内的多种任务中,InfoPO始终优于提示和多轮RL基线。它还在用户模拟器偏移下表现出鲁棒性,并有效泛化到环境交互任务。总体而言,InfoPO为优化复杂的智能体-用户协作提供了一种原则性且可扩展的机制。代码可在以下网址获取:https://this URL。

英文摘要

Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information-Driven Policy Optimization), which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked-feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance-gated fusion to identify information importance while maintaining task-oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool-augmented decision making, InfoPO consistently outperforms prompting and multi-turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment-interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent-user collaboration. Code is available at https://github.com/kfq20/InfoPO.

2606.01139 2026-06-18 cs.AI 版本更新

SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision

SkillRevise: 通过轨迹条件技能修订改进LLM撰写的智能体技能

Yuxuan Liu, Zhaochen Su, Lingyun Xie, Yuhao Zhang, Qing Zong, Jiahe Guo, Zhongwei Xie, Yiyan Ji, Yauwai Yim, Hongyu Luo, Xiyu Ren, Ruan Chenyu, Haoran Li, Yangqiu Song

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) Harbin Institute of Technology(哈尔滨工业大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Nanjing University(南京大学) The University of Hong Kong(香港大学)

AI总结 提出SkillRevise框架,通过执行证据诊断、修复原则检索和执行锚定编辑,迭代优化初始技能,在SkillsBench上将基础智能体成功率从36.05%提升至61.63%,并展现跨模型迁移性。

Comments 15 pages, 4 figures

详情
AI中文摘要

智能体技能是使LLM智能体能够执行工作流、验证约束并从故障中恢复的程序性工件。现有的自进化方法利用累积轨迹来优化技能,但在冷启动场景下(仅有一个初始的不完美技能可用)表现不佳。因此,技能构建默认采用专家编写或一次性LLM生成。专家编写的技能成本高昂,且可能与LLM智能体实际执行任务的方式不一致,而一次性生成的技能可能在语法上良好但在行为上薄弱。为弥合这一差距,我们提出SkillRevise,一个基于执行的框架,旨在迭代优化这些初始技能。SkillRevise从执行证据中诊断技能缺陷,从通用记忆中检索相关修复原则,并应用执行锚定编辑。通过重新执行候选技能并测量经验效用,它系统地保留最优技能版本。在三个基准测试和五个LLM上的评估表明,SkillRevise显著优于一次性基线,将SkillsBench上基础智能体的成功率从36.05%提升至61.63%。此外,修订后的技能展现出强大的跨模型迁移性,捕获了超越模型特定工件的通用程序性知识。

英文摘要

Agent skills are procedural artifacts that enable LLM agents to execute workflows, verify constraints, and recover from failures. Existing self-evolving methods refine skills using accumulated trajectories. However, they struggle in cold-start settings, where only an initial, imperfect skill is available. Consequently, skill construction defaults to expert authoring or one-shot LLM generation. Expert-authored skills are costly and may not align with how LLM agents actually execute tasks, while one-shot generated skills can be syntactically well formed yet behaviorally weak. To bridge this gap, we propose SkillRevise, an execution-grounded framework designed to iteratively refine these initial skills. SkillRevise diagnoses skill defects from execution evidence, retrieves relevant repair principles from a general memory, and applies execution-anchored edits. By re-executing candidates, it retains the first verifier-passing skill within the revision budget and falls back to empirical utility only when no candidate succeeds. Evaluated across three benchmarks and five LLMs, SkillRevise substantially outperforms one-shot baselines, improving the base agent's success rate on SkillsBench from 36.05% to 61.63%. Furthermore, the revised skills transfer across both executors and task environments, suggesting that SkillRevise captures reusable procedural knowledge beyond any single executor.

2606.17454 2026-06-18 cs.AI cs.LG 版本更新

Dissecting model behavior through agent trajectories

通过智能体轨迹剖析模型行为

Gaurav Gupta, Vatshank Chaturvedi, Jun Huan, Anoop Deoras

发表机构 * AWS AI Labs(AWS人工智能实验室)

AI总结 本文提出“意图-执行差距”概念,并设计Simple Strands Agent(SSA)框架,通过分析138k条轨迹揭示模型在自主问题解决中的行为差异。

Comments 106 pages, 50 Figures, 16 Tables

详情
AI中文摘要

AI智能体性能不仅仅是一个建模问题,它本质上是一个系统问题。模型的高级能力通过智能体框架(harness)实现。因此,模型假设与框架行为之间的差距很容易阻止模型的全部能力转化为智能体性能。我们将此形式化为“意图-执行差距”:模型意图与框架执行之间的不匹配,反之亦然。我们认为,最小化这种意图-执行差距与框架设计的其他方面(如工具和执行循环)同样重要。为了说明这种框架-模型对齐的影响,我们开发了一个简单且可定制的框架,称为“Simple Strands Agent”(SSA)。SSA旨在找到跨不同模型家族(如Claude、Gemini、GPT、Grok、Qwen)通用的常见模式,以及少量模型特定的偏好。我们做出两个贡献:(i)我们在流行的智能体基准测试(SWE-Pro、SWE-Verified和Terminal-Bench-2)上**复现或改进了**不同模型提供商家族报告的pass@1性能;(ii)基于对**SSA生成的138k条轨迹的分析**,我们超越了前沿模型之间通常相对均匀的pass@1数字。通过在代码状态空间中表示智能体轨迹,我们观察到问题解决行为中的模型级差异。更细粒度的指标,如编辑频率、测试活动和阶段转换,揭示了单个模型如何在自主问题解决的不同阶段分配努力。

英文摘要

AI agent performance is not just a modeling problem, it is fundamentally a systems problem. The advanced capabilities of models are realized through agent harnesses. Therefore, a gap between model assumptions and harness behavior can easily prevent the model's full capabilities from translating into agent performance. We formalize this as the `intent-execution' gap: the mismatch between what the model intends and what the harness executes, and vice versa. We argue that minimizing this intent-execution gap is as important as other aspects of harness design such as tools and execution loops. To illustrate the impact of this harness-model alignment, we develop a simple and customizable harness called `Simple Strands Agent' (SSA). SSA aims to find the bulk of common patterns which generalize across different model families (such as Claude, Gemini, GPT, Grok, Qwen), as well as a small number of model-specific preferences. We make two contributions: (i) we reproduce or improve on the pass@1 performance reported by diverse model-provider families on popular agentic benchmarks (SWE-Pro, SWE-Verified and Terminal-Bench-2), and (ii) building on an analysis of 138k trajectories generated by SSA, we look beyond the pass@1 numbers which tend to be relatively even across frontier models. By representing agent trajectories in code state-spaces, we observe model-level differences in problem-solving behavior. Finer-grained metrics such as edit frequency, testing activity, and phase-transitions reveal how individual models allocate effort across different stages of autonomous problem solving.

2603.00026 2026-06-18 cs.CL cs.AI cs.IR 版本更新

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

ActMem:弥合LLM代理中记忆检索与推理之间的差距

Xiaohui Zhang, Zequn Sun, Chengyuan Yang, Yaqin Jin, Yazhong Zhang, Wei Hu

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University, China(南京大学新型软件技术国家重点实验室) Alibaba Group, Hangzhou, China(阿里巴巴集团,杭州,中国) National Institute of Healthcare Data Science, Nanjing University, China(南京大学健康数据科学国家研究院)

AI总结 提出ActMem框架,通过将非结构化对话历史转化为结构化因果语义图,结合反事实推理和常识补全,实现主动因果推理,显著提升LLM代理在复杂记忆依赖任务中的表现。

详情
AI中文摘要

记忆管理对于长期交互中的LLM代理至关重要。当前的记忆框架通常将代理视为被动的“记录器”,并在不理解其深层含义的情况下检索信息。它们可能在需要推理和复杂决策的场景中失败。为了弥合这一关键差距,我们提出了一种新颖的可操作记忆框架ActMem,它将记忆检索与主动因果推理相结合。ActMem将非结构化对话历史转化为结构化的因果语义图。通过利用反事实推理和常识补全,它使代理能够推断隐含约束并解决过去状态与当前意图之间的潜在冲突。此外,我们引入了一个全面的数据集ActMemEval,用于评估代理在逻辑驱动场景中的推理能力,超越了现有记忆基准测试中事实检索的焦点。实验表明,ActMem在处理复杂的、依赖记忆的任务时显著优于基线,为更一致和可靠的智能助手铺平了道路。

英文摘要

Memory management is essential for LLM agents in long-term interactions. Current memory frameworks typically treat agents as passive ``recorders'' and retrieve information without understanding its deeper implications. They may fail in scenarios requiring reasoning and complex decision-making. To bridge this critical gap, we propose a novel actionable memory framework called ActMem that integrates memory retrieval with active causal reasoning. ActMem transforms unstructured dialogue history into a structured causal and semantic graph. By leveraging counterfactual reasoning and commonsense completion, it enables agents to deduce implicit constraints and resolve potential conflicts between past states and current intentions. Furthermore, we introduce a comprehensive dataset ActMemEval to evaluate agent reasoning capabilities in logic-driven scenarios, moving beyond the fact-retrieval focus of existing memory benchmarks. Experiments demonstrate that ActMem significantly outperforms baselines in handling complex, memory-dependent tasks, paving the way for more consistent and reliable intelligent assistants.

2603.29247 2026-06-18 cs.CL cs.AI cs.LG 版本更新

MemRerank: Preference Memory for Personalized Product Reranking

MemRerank:用于个性化产品重排序的偏好记忆

Zhiyuan Peng, Xuyang Wu, Huaixiao Tou, Yi Fang, Yu Gong

发表机构 * Santa Clara University(圣克拉拉大学) Independent Researcher(独立研究者)

AI总结 提出MemRerank框架,通过强化学习将用户购买历史提炼为查询无关的偏好记忆,用于LLM购物代理的个性化重排序,在1-in-5选择任务中准确率提升高达10.61个百分点。

Comments correct author name in metadata

详情
AI中文摘要

基于LLM的购物代理越来越依赖长购买历史和多轮交互来实现个性化,然而,由于噪声、长度和相关性不匹配,将原始历史简单地附加到提示中通常效果不佳。我们提出MemRerank,一个偏好记忆框架,将用户购买历史提炼为简洁、查询无关的信号,用于个性化产品重排序。为了研究这个问题,我们构建了一个端到端的基准测试和评估框架,围绕基于LLM的\ extbf{1-in-5}选择任务,该任务同时衡量记忆质量和下游重排序效用。我们进一步使用强化学习(RL)训练记忆提取器,以下游重排序性能作为监督。使用两个基于LLM的重排序器进行的实验表明,MemRerank始终优于无记忆、原始历史和现成记忆基线,在1-in-5准确率上提高了高达\ extbf{+10.61}个绝对百分点。这些结果表明,显式偏好记忆是代理型电子商务系统中个性化的一种实用且有效的构建模块。

英文摘要

LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch. We propose MemRerank, a preference memory framework that distills user purchase history into concise, query-independent signals for personalized product reranking. To study this problem, we build an end-to-end benchmark and evaluation framework centered on an LLM-based \textbf{1-in-5} selection task, which measures both memory quality and downstream reranking utility. We further train the memory extractor with reinforcement learning (RL), using downstream reranking performance as supervision. Experiments with two LLM-based rerankers show that MemRerank consistently outperforms no-memory, raw-history, and off-the-shelf memory baselines, yielding up to \textbf{+10.61} absolute points in 1-in-5 accuracy. These results suggest that explicit preference memory is a practical and effective building block for personalization in agentic e-commerce systems.

2605.30880 2026-06-18 cs.CL cs.AI 版本更新

PatchWorld: Gradient-Free Optimization of Executable World Models

PatchWorld:可执行世界模型的免梯度优化

Jiaxin Bai, Yue Guo, Yifei Dong, Jiaxuan Xiong, Tianshi Zheng, Yixia Li, Tianqing Fang, Yufei Li, Yisen Gao, Haoyu Huang, Zhongwei Xie, Hong Ting Tsang, Zihao Wang, Lihui Liu, Jeff Z. Pan, Yangqiu Song

发表机构 * Hong Kong Baptist University(香港 Baptist 大学) Independent Researcher(独立研究员) HKUST(香港科技大学) Beijing Institute of Technology(北京理工大学) Southern University of Science and Technology(南方科技大学) Wayne State University(韦恩州立大学) University of Edinburgh(爱丁堡大学)

AI总结 提出 PatchWorld 框架,通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型,实现无需梯度优化的符号信念状态程序,在 AgentGym 环境中达到 76.4% 的宏观成功率。

Comments 40 pages

详情
AI中文摘要

文本智能体环境通常被建模为部分可观察马尔可夫决策过程(POMDP),假设模拟器的潜在状态和转移动态对智能体隐藏。然而,很少有工作研究是否可以通过归纳可执行代码来作为部分可观察性下的预测和规划的世界模型。我们引入了 PatchWorld,一个免梯度框架,通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型。PatchWorld 不是用黑盒模型预测下一个观察,而是归纳出符号信念状态程序,其动作更新可以被检查、重放和局部修补。在七个 AgentGym 环境中,PatchWorld-Simple 在评估方法中取得了最高的基于代码的规划分数,在实时一步前瞻中达到 76.4% 的宏观成功率,同时在世界模型预测模块本身内不调用任何 LLM。我们进一步发现,人类指定的残差记忆偏差提高了表面观察保真度,但削弱了决策效用。这暴露了可执行世界模型中的权衡,因为提高观察保真度可能以牺牲动作判别动态为代价,反之亦然。代码可在 https://github.com/HKBU-KnowComp/PatchWorld 获取。

英文摘要

Text-agent environments are typically modeled as partially observable Markov decision processes (POMDPs), assuming that the simulator's latent state and transition dynamics are hidden from the agent. Yet little work has examined whether executable code can be induced to serve as a world model for prediction and planning under partial observability. We introduce PatchWorld, a gradient-free framework that turns offline trajectories into executable Python world models through counterexample-guided code repair. Instead of predicting the next observation with a black-box model, PatchWorld induces symbolic belief-state programs whose action updates can be inspected, replayed, and locally patched. Across seven AgentGym environments, PatchWorld-Simple achieves the highest code-based planning score among evaluated methods, reaching 76.4\% macro success in live one-step lookahead while invoking no LLM calls inside the world-model prediction module itself. We further find that a human-specified residual-memory bias improves surface observation fidelity but weakens decision utility. This exposes a tradeoff in executable world models, since improving observation fidelity can come at the expense of action-discriminative dynamics, and vice versa. Code is available at https://github.com/HKBU-KnowComp/PatchWorld.

2. 知识表示、推理与符号AI 4 篇

2505.12369 2026-06-18 cs.AI cs.LG cs.LO 版本更新

Fully Geometric Multi-Hop Reasoning on Knowledge Graphs with Transitive Relations

知识图谱上具有传递关系的全几何多跳推理

Fernando Zhapa-Camacho, Robert Hoehndorf

发表机构 * KAUST Center of Excellence for Smart Health (KCSH)(智能健康卓越中心) KAUST Center of Excellence for Generative AI(生成人工智能卓越中心)

AI总结 提出GeometrE方法,将逻辑操作映射为纯几何变换,并引入传递损失函数,在保持可解释性的同时提升多跳推理性能。

Comments Accepted at ESWC 2026

详情
Journal ref
The Semantic Web. ESWC 2026. Lecture Notes in Computer Science, vol 16549. Springer, Cham (2026)
AI中文摘要

知识图谱上的多跳逻辑推理需要将逻辑语义忠实地映射到潜在空间。当前的几何嵌入方法通过将实体映射到几何区域、逻辑操作映射到潜在变换,在此任务上表现出有效性。虽然几何嵌入可以为查询回答提供直接的可解释性框架,但当前方法仅利用了实体的几何构造,未能将逻辑操作映射为纯几何变换,而是使用神经组件来学习这些操作。另一方面,纯神经方法优于几何方法,但在潜在空间中缺乏可解释性。我们提出了GeometrE,一种用于多跳推理的几何嵌入方法,它将每个逻辑操作映射为潜在空间中的纯几何操作。此外,我们引入了一个传递损失函数,并表明与现有方法不同,它可以保留对所有a,b,c的逻辑规则:r(a,b)和r(b,c) -> r(a,c)。我们的实验表明,GeometrE优于当前最先进的几何方法,并在标准基准数据集上与现有的神经方法保持竞争力。

英文摘要

Multi-hop logical reasoning on knowledge graphs requires faithfully mapping the logical semantics to latent space. Current geometric embedding methods show to be useful on this task by mapping entities to geometric regions and logical operations to latent transformations. While a geometric embedding can provide a direct interpretability framework for query answering, current methods have only leveraged the geometric construction of entities, failing to map logical operations to pure geometric transformations and, instead, using neural components to learn these operations. On the other hand, purely neural-based methods outperform geometric methods, but they lack interpretability in the latent space. We introduce GeometrE, a geometric embedding method for multi-hop reasoning, that maps every logical operation to a purely geometric operation in the latent space. Additionally, we introduce a transitive loss function and show that, unlike existing methods, it can preserve the logical rule for all a,b,c: r(a,b) and r(b,c) -> r(a,c). Our experiments show that GeometrE outperforms current state-of-the-art geometric methods and remains competitive with existing neural-based methods on standard benchmark datasets.

2605.16385 2026-06-18 cs.CV cs.AI cs.CL 版本更新

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Hilbert-Geo:通过神经符号推理解决立体几何问题

Ruoran Xu, Haoyu Cheng, Bin Dong, Qiufeng Wang

发表机构 * Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学) Ricoh Software Research Center Beijing Co.,Ltd(Ricoh 软件研究中心北京有限公司)

AI总结 提出Hilbert-Geo框架和Parse2Reason方法,利用条件描述语言和定理库实现立体几何问题的严格推理,在SolidFGeo2k和MathVerse-Solid上达到SOTA性能。

Comments Computer Vision and Pattern Recognition (CVPR), 2026

详情
AI中文摘要

几何问题求解作为一种典型的多模态推理问题,近年来受到广泛关注并取得了很大进展,然而大多数工作集中于平面几何,由于三维空间图和复杂推理,通常在立体几何中失败。为弥补这一差距,我们引入了Hilbert-Geo,这是第一个用于立体几何的统一形式语言框架,包括一个广泛的谓词库和一个专用的定理库。基于该框架,我们提出了一种Parse2Reason方法,包含先解析后推理两个步骤。在解析步骤中,我们利用条件描述语言(CDL),一种由专门用于构建几何条件的谓词组成的形式化语言,来表示问题描述(自然文本)和立体图(视觉图像)。在推理步骤中,我们利用这些形式化CDL和定理库进行关系推理和代数计算,生成严格正确、可验证且人类可读的推理过程。值得注意的是,我们提出的Hilbert-Geo也适用于平面几何。为推进几何推理,我们策划了两个专家标注的数据集SolidFGeo2k和PlaneFGeo3k,它们配备了几何形式语言标注、解答和答案。大量实验表明,我们提出的方法在SolidFGeo2k上达到77.3%的最先进性能,在MathVerse-Solid(MathVerse中专用于立体几何的一个小子集)上达到84.1%,显著优于领先的多模态大语言模型,如Gemini-2.5-pro(在SolidFGeo2k上为54.2%)和GPT-5(在MathVerse-Solid上为62.9%)。此外,我们的方法在PlaneFGeo3k上达到80.2%的SOTA准确率,展示了Hilbert-Geo在几何推理中的通用性。我们的代码和数据集将公开提供。

英文摘要

Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, however most of works focus on plane geometry while usually fail in solid geometry due to 3D spatial diagrams and complex reasoning. To bridge this gap, we introduce Hilbert-Geo, the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank. Based on this framework, we propose a Parse2Reason method containing two steps of first parsing then reasoning. In the parsing step, we utilize conditional description language (CDL), a formalized language composed of predicates specifically designed to construct geometric conditions, to represent both problem description (natural text) and solid diagrams (visual image). In the reasoning step, we leverage those formal CDL and the theorem bank to perform relational inference and algebraic computation, generating strictly correct, verifiable, and human-readable reasoning processes. Notably, our proposed Hilbert-Geo is also applicable to plane geometry. To advance geometric reasoning, we curate two expert-annotated dataset SolidFGeo2k and PlaneFGeo3k, which are furnished with geometric formal language annotations, solutions and answers. Extensive experiments show that our proposed method achieves the state-of-the-art (SOTA) performance 77.3% in SolidFGeo2k and 84.1% in MathVerse-Solid (one small subset in MathVerse dedicated to solid geometry), substantially outperforming leading MLLMs, such as Gemini-2.5-pro (54.2% on SolidFGeo2k) and GPT-5 (62.9% on MathVerse-Solid). In addition, our method achieves the SOTA accuracy 80.2% in PlaneFGeo3k, demonstrating the generality of the Hilbert-Geo in geometric reasoning. Our code and datasets are released at https://github.com/PremiLab-Math/Hilbert-Geo.

2605.22142 2026-06-18 cs.LG cs.AI 版本更新

Short-Term-to-Long-Term Memory Transfer for Knowledge Graphs under Partial Observability

知识图谱下的短期到长期记忆转移:在部分可观测性下的短期到长期记忆转移

Taewoon Kim, Vincent François-Lavet, Michael Cochez

AI总结 本文研究了在部分可观测性下知识图谱中的短期到长期记忆转移问题,提出了一种基于神经符号价值决策的方法,通过在长期插入前决定保留或丢弃观察到的三元组,从而提升记忆效率,并在RoomKG基准测试中优于符号和神经基线方法。

详情
AI中文摘要

在部分可观测性下的强化学习需要决定保留哪些信息,但大多数基于记忆的方法并未显式建模符号观察的短期到长期转移。我们研究了这一转移过程,将其建模为一个神经符号价值决策问题:对于每个观察到的三元组,智能体需决定在长期插入前是否保留或丢弃。为处理可变大小的短期缓冲区,我们采用了一种每项Q学习设计,使用共享参数和实际的时间差分更新,跨连续步骤匹配项目。在长期记忆容量为128的RoomKG基准测试中,学习到的转移决策优于符号和神经基线,包括带有时间注释的符号基线和基于历史的LSTM/Transformer基线。在转移策略消融分析中,一个轻量级的本地短期-only变体表现最佳,且在步骤层面行为显示,策略保留导航和查询相关的事实,同时丢弃低价值的候选事实,支持在内存限制下显式且可解释的记忆决策。

英文摘要

Reinforcement learning under partial observability requires deciding what information to retain, yet most memory-based approaches do not explicitly model short-term-to-long-term transfer of symbolic observations. We study this transfer process in a temporal knowledge-graph memory setting and cast it as a neuro-symbolic value-based decision problem: for each observed triple, the agent chooses whether to keep or drop it before long-term insertion. To handle variable-sized short-term buffers, we use a per-item Q-learning design with shared parameters and a practical temporal-difference update over matched items across consecutive steps. On the RoomKG benchmark at long-term memory capacity 128, learned transfer decisions outperform symbolic and neural baselines, including symbolic baselines with temporal annotations and history-based LSTM/Transformer baselines. Across transfer-policy ablations, a lightweight local short-term-only variant performs best, and step-level behavior shows that the policy keeps navigation- and query-relevant facts while discarding lower-value candidate facts, supporting explicit and interpretable memory decisions under memory constraints.

2606.06133 2026-06-18 cs.SE cs.AI cs.LG cs.LO 版本更新

TLA-Prover: Verifiable TLA+ Specification Synthesis via Preference-Optimized Low-Rank Adaptation

TLA-Prover: 通过偏好优化低秩适配实现可验证的 TLA+ 规范合成

Eric Spencer, Arslan Bisharat, Brian Ortiz, Khushboo Bhadauria, TaiNing Wang, George K. Thiruvathukal, Konstantin Laufer, Mohammed Abuhamad

发表机构 * Department of Computer Science, Loyola University Chicago(洛约拉芝加哥大学计算机科学系)

AI总结 提出 TLA-Prover 模型,结合监督微调和基于修复的组相对策略优化,在 TLC 模型检查器上实现 TLA+ 规范合成,Gold/Diamond 级别通过率达 30%,约为未调优基线的 3.5 倍。

Comments 12 pages, 5 tables, 3 figures. Accepted at the 21st International Conference on Software Technologies (ICSOFT 2026)

详情
AI中文摘要

TLA+ 是一种用于验证分布式系统和安全关键协议的正式规范语言。大型语言模型(LLM)生成的 TLA+ 规范常常因语义原因无法通过 TLC 模型检查器。在 25 个 LLM 中,最佳公开基线的语法解析成功率为 26.6%,语义模型检查通过率为 8.6%。我们提出了 TLA-Prover,一个 200 亿参数的 TLA+ 规范合成模型。训练结合了在已验证示例上的监督微调(SFT)和基于修复的组相对策略优化(GRPO)。在 GRPO 阶段,模型学习修复自身被拒绝的规范。我们还从相同的 SFT 检查点训练了一个直接偏好优化(DPO)变体作为消融实验。TLC 直接提供奖励信号,无需学习奖励模型。每个输出分为四个等级:青铜(解析通过)、银(无警告)、金(通过 TLC)和钻石。要达到钻石级,模型的正确性属性会被自动微小修改;TLC 必须检测到违反。如果 TLC 仍然通过,则该属性始终为真且无贡献;输出无法达到钻石级。在一个保留的 30 问题基准上,TLA-Prover 在金级和钻石级均达到 9/30(即 pass@1 = 30%)。这大约是未调优基线 8.6% 的 3.5 倍。DPO 变体在钻石级达到 20%。金级和钻石级在每个检查点都一致;这防止了平凡属性失败模式。

英文摘要

TLA+ is a formal specification language for verifying distributed systems and safety-critical protocols. Large language models (LLMs) frequently produce TLA+ specifications that fail the TLC model checker for semantic reasons. Across 25 LLMs, the best public baseline is 26.6% syntactic parse and 8.6% semantic model-check. We present TLA-Prover, a 20-billion-parameter model for TLA+ specification synthesis. Training combines supervised fine-tuning (SFT) on verified examples with repair-based group-relative policy optimization (GRPO). In the GRPO stage, the model learns to fix its own rejected specifications. We also train a direct preference optimization (DPO) variant from the same SFT checkpoint as an ablation. TLC provides the reward signal directly, with no learned reward model. Four tiers grade each output: Bronze (parses), Silver (no warnings), Gold (passes TLC), and Diamond. To reach Diamond, the model's correctness property is automatically altered in a small way; TLC must then detect a violation. If TLC still passes, the property was always-true and contributes nothing; the output fails Diamond. TLA-Prover reaches 9/30 (i.e. pass@1 = 30%) at both Gold and Diamond on a held-out 30-problem benchmark. This is roughly 3.5x the 8.6% untuned baseline. The DPO variant reaches 20% at Diamond. Gold and Diamond coincide at every checkpoint; this prevents the trivial-property failure mode.

3. 多智能体与博弈 5 篇

2402.08128 2026-06-18 cs.AI cs.GT 版本更新

Recursive Joint Simulation in Games

博弈中的递归联合模拟

Vojtech Kovarik, Caspar Oesterheld, Vincent Conitzer

发表机构 * Foundations of Cooperative AI Lab (FOCAL), Computer Science Department(合作人工智能基础实验室(FOCAL),计算机科学系) Carnegie Mellon University(卡内基梅隆大学) AI Center(人工智能中心) Czech Technical University(捷克技术大学) Center for Theoretical Study(理论研究中心) Charles University(查理大学)

AI总结 研究AI智能体通过递归联合模拟实现合作,证明该过程等价于原博弈的无限重复版本,从而可直接应用民间定理等现有结论。

详情
AI中文摘要

AI智能体之间的博弈动力学可能以多种方式不同于传统的人类-人类互动。其中一个差异是,可能能够精确模拟一个AI智能体,例如因为其源代码已知。这样的智能体将从根本上不确定自己是在现实世界还是在模拟中。我们的目标是探索利用这种可能性在战略环境中实现更合作的结果。在本文中,我们研究了AI智能体之间的交互,其中智能体运行递归联合模拟。也就是说,智能体首先共同观察它们所面临情境的模拟。这个模拟递归地包含额外的模拟(带有小的失败概率以避免无限递归),并且在选择行动之前观察所有这些嵌套模拟的结果。我们表明,由此产生的交互在策略上等价于原始博弈的无限重复版本,允许直接转移现有结果,如各种民间定理。作为该等价性稳健性的证据,我们表明即使放宽一些假设,它仍然成立,并且“从内部”也成立——即对于发现自己处于博弈中并具有自定位不确定性的智能体而言。

英文摘要

Game-theoretic dynamics between AI agents could differ from traditional human-human interactions in various ways. One such difference is that it may be possible to accurately simulate an AI agent, for example because its source code is known. Such an agent would then be fundamentally uncertain whether it is in the real world or in a simulation. Our aim is to explore ways of leveraging this possibility to achieve more cooperative outcomes in strategic settings. In this paper, we study an interaction between AI agents where the agents run a recursive joint simulation. That is, the agents first jointly observe a simulation of the situation they face. This simulation in turn recursively includes additional simulations (with a small chance of failure, to avoid infinite recursion), and the results of all these nested simulations are observed before an action is chosen. We show that the resulting interaction is strategically equivalent to an infinitely repeated version of the original game, allowing a direct transfer of existing results such as the various folk theorems. As evidence that the equivalence is robust, we show that it holds even when we relax some of the assumptions and that it also holds ``from the inside'' -- meaning, for an agent that finds itself inside the game and has self-locating uncertainty.

2508.21720 2026-06-18 cs.AI 版本更新

PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation

PosterForest: 用于科学海报生成的分层多智能体协作

Jiho Choi, Seojeong Park, Seongjong Song, Hyunjung Shim

发表机构 * Graduate School of Artificial Intelligence, KAIST(韩国釜山国立大学人工智能研究生院) School of Integrated Technology, Yonsei University(延世大学整合技术学院)

AI总结 提出PosterForest,一种无需训练的科学海报生成框架,通过Poster Tree分层表示文档结构,并利用内容与布局智能体进行分层推理与递归优化,实现内容与布局的联合优化,提升语义连贯性、逻辑流畅性和视觉平衡。

Comments ACL 2026

详情
AI中文摘要

自动化科学海报生成需要层次化的文档理解和连贯的内容-布局规划。现有方法通常依赖于平面摘要或分别优化内容和布局。因此,它们常常遭受信息丢失、逻辑流程薄弱和视觉平衡差的问题。我们提出了PosterForest,一个无需训练的科学海报生成框架。我们的方法引入了Poster Tree,一种结构化的中间表示,能够跨多个层次捕获文档层次结构和视觉-文本语义。基于这种表示,内容和布局智能体执行分层推理和递归优化,从全局组织到局部组成逐步优化海报。这种联合优化提高了语义连贯性、逻辑流畅性和视觉和谐。实验表明,PosterForest在自动评估和人工评估中均优于先前方法,且无需额外训练或领域特定监督。

英文摘要

Automating scientific poster generation requires hierarchical document understanding and coherent content-layout planning. Existing methods often rely on flat summarization or optimize content and layout separately. As a result, they often suffer from information loss, weak logical flow, and poor visual balance. We present PosterForest, a training-free framework for scientific poster generation. Our method introduces the Poster Tree, a structured intermediate representation that captures document hierarchy and visual-textual semantics across multiple levels. Building on this representation, content and layout agents perform hierarchical reasoning and recursive refinement, progressively optimizing the poster from global organization to local composition. This joint optimization improves semantic coherence, logical flow, and visual harmony. Experiments show that PosterForest outperforms prior methods in both automatic and human evaluations, without additional training or domain-specific supervision.

2606.15504 2026-06-18 cs.AI 版本更新

Toward Vibe Medicine: A Self-Evolving Multi-Agent Framework for Clinical Decision Support

迈向振动医学:一种用于临床决策支持的自演化多智能体框架

Qianxue Zhang, Yiming Ren, Shihuan Qin, Xiao Zhang, Liao Zhang, Jinyang Huang, Zhengliang Liu, Chenbin Liu, Hongying Feng, Jingyuan Chen, Yuzhen Ding, Weihang You, Hanqi Jiang, Yi Pan, Yifan Zhou, Junhao Chen, Lifeng Chen, Wei Liu, Tianming Liu, Zengren Zhao, Lian Zhang

发表机构 * Medical AI Lab, The First Hospital of Hebei Medical University(河北医科大学第一医院医学人工智能实验室) Hebei Provincial Engineering Research Center for AI-Based Cancer Treatment Decision-Making, The First Hospital of Hebei Medical University(河北省人工智能癌症治疗决策工程研究中心,河北医科大学第一医院) State Key Laboratory of Neurology and Oncology Drug Development(神经与肿瘤药物研发国家重点实验室) School of Computing, University of Georgia(佐治亚大学计算学院) Department of Radiation Oncology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital and Shenzhen Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College(中国医学科学院北京协和医学院国家癌症中心/国家肿瘤临床医学研究中心/肿瘤医院深圳医院放射治疗科) Department of Radiation Oncology, Mayo Clinic(梅奥诊所放射肿瘤科) College of Mechanical and Power Engineering, China Three Gorges University(三峡大学机械与动力工程学院) Department of Radiation Oncology, Guangzhou Concord Cancer Center(广州康华肿瘤中心放射治疗科) Gastrointestinal Disease Diagnosis and Treatment Center, The First Hospital of Hebei Medical University(河北医科大学第一医院胃肠疾病诊疗中心) Department of General Surgery, The First Hospital of Hebei Medical University(河北医科大学第一医院普通外科)

AI总结 提出VIBEMed多智能体框架,通过自演化机制和架构级安全沙箱,从交互历史中动态学习,实现个性化临床决策支持。

详情
AI中文摘要

近年来,大型语言模型和自主智能体的进步彻底改变了医疗领域,促进了诊断并改善了治疗结果。然而,大多数现有AI系统依赖预训练知识和预定义流程,难以从包含患者结果和过去失败的交互式聊天会话历史中动态学习。为解决这一限制,我们提出了VIBEMed,一种具有内置自演化机制和架构级安全沙箱的多智能体框架,用于稳健的临床决策支持。该系统集成了三个专门智能体:用于假设生成的临床诊断智能体(CDA)、用于治疗计划的治疗执行智能体(TEA)以及将纵向临床反馈提炼为可重用知识的临床演化管理智能体(CEMA),将多模态患者信息转化为个性化医疗决策。通过自演化机制,该框架实现了跨记忆、模型行为和决策策略的迭代更新,使系统能够随时间改进。实验结果表明,VIBEMed通过其演化机制在复杂临床病例中表现出优越性能,特别是在需要集成决策和纵向规划的任务中。该框架还支持在具有挑战性的场景(如肿瘤治疗规划)中进行可靠的端到端决策,凸显了其在真实临床环境中的可行性。总体而言,VIBEMed为超越静态AI系统、迈向自适应、经验驱动的临床决策支持提供了一条实用路径,展示了将多智能体协作与持续演化相结合以推进精准医学的价值。

英文摘要

In recent years, the advances of large language models and autonomous agents have revolutionized the healthcare field, facilitating diagnosis and improving treatment results. However, most existing AI systems rely on pre-trained knowledge and predefined pipelines, which struggle to learn dynamically from the interactive chat session history that contains patient outcomes and past failures. To address this limitation, we propose VIBEMed, a multi-agent framework with a built-in self-evolution mechanism and architecture-level safety sandbox for robust clinical decision support. The system integrates three specialized agents, including a Clinical Diagnostic Agent (CDA) for hypothesis generation, a Therapeutic Execution Agent (TEA) for treatment planning, and a Clinical Evolution Manager Agent (CEMA) that distills longitudinal clinical feedback into reusable knowledge, transforming multimodal patient information into personalized medical decisions. Through self-evolution mechanism, the framework enables iterative updates across memory, model behavior, and decision strategies, allowing the system to improve over time. Experimental results show that VIBEMed demonstrates superior performance through its evolving mechanism in complex clinical cases, particularly in tasks that require integrated decision-making and longitudinal planning. The framework also supports reliable end-to-end decisions in challenging scenarios such as oncology treatment planning, highlighting its feasibility in real-world clinical contexts. Overall, VIBEMed provides a practical path beyond static AI systems toward adaptive, experience-driven clinical decision support, demonstrating the value of combining multi-agent collaboration with continuous evolution for advancing precision medicine.

2506.09046 2026-06-18 cs.LG cs.AI cs.MA 版本更新

Self-Evolving Multi-Agent Systems via Textual Backpropagation

通过文本反向传播的自进化多智能体系统

Xiaowen Ma, Yunpu Ma, Chenyang Lin, Sikuan Yan, Jinhe Bi, Zixuan Cao, Yijun Tian, Volker Tresp, Hinrich Schuetze

发表机构 * Ludwig Maximilian University of Munich(慕尼黑路德维希-马克西米利安大学) Technical University of Munich(慕尼黑技术大学) Munich Center for Machine Learning(慕尼黑机器学习中心) University of Notre Dame(诺丁汉大学)

AI总结 提出Agentic Neural Network框架,将多智能体协作建模为分层神经网络,通过前向分解任务和反向传播反馈实现智能体角色、提示和协作的自进化,在七个基准数据集上超越现有方法。

详情
AI中文摘要

利用多个大型语言模型(LLM)已被证明对处理复杂、高维任务有效,但当前方法通常依赖静态、手动设计的多智能体配置。为克服这些限制,我们提出Agentic Neural Network(ANN)框架,该框架将多智能体协作概念化为分层神经网络架构。在此设计中,每个智能体作为节点运行,每一层形成一个专注于特定子任务的协作团队。我们的框架遵循两阶段优化策略:(1)前向阶段——受神经网络前向传播启发,任务被动态分解为子任务,并逐层构建具有合适聚合方法的协作智能体团队。(2)反向阶段——模仿反向传播,我们通过迭代反馈优化全局和局部协作,使智能体能够自进化其角色、提示和协调。这种神经符号方法使我们的框架能够在训练后创建新的或专门的智能体团队,在准确性和适应性方面带来显著提升。在七个基准数据集上,我们的工作在相同配置下超越了领先的多智能体基线,显示出持续的性能改进。

英文摘要

Leveraging multiple Large Language Models (LLMs) has proven effective for addressing complex, high-dimensional tasks, but current approaches often rely on static, manually engineered multi-agent configurations. To overcome these constraints, we present the Agentic Neural Network (ANN), a framework that conceptualizes multi-agent collaboration as a layered neural network architecture. In this design, each agent operates as a node, and each layer forms a cooperative team focused on a specific subtask. Our framework follows a two-phase optimization strategy: (1) Forward Phase - Drawing inspiration from neural network forward passes, tasks are dynamically decomposed into subtasks, and cooperative agent teams with suitable aggregation methods are constructed layer by layer. (2) Backward Phase - Mirroring backpropagation, we refine both global and local collaboration through iterative feedback, allowing agents to self-evolve their roles, prompts, and coordination. This neuro-symbolic approach enables our framework to create new or specialized agent teams post-training, delivering notable gains in accuracy and adaptability. Across seven benchmark datasets, our work surpasses leading multi-agent baselines under the same configurations, showing consistent performance improvements.

2510.18085 2026-06-18 cs.RO cs.AI cs.MA 版本更新

R2BC: Multi-Agent Imitation Learning from Single-Agent Demonstrations

R2BC: 从单智能体演示进行多智能体模仿学习

Connor Mattson, Varun Raveendra, Ellen Novoseller, Nicholas Waytowich, Vernon J. Lawhern, Daniel S. Brown

发表机构 * Kahlert School of Computing, University of Utah(犹他大学凯勒尔计算学院) DEVCOM Army Research Laboratory(陆军研究实验室)

AI总结 提出R2BC方法,通过轮换单智能体演示训练多机器人系统,无需联合动作空间演示,在模拟和实物任务中性能媲美或超越基于特权同步演示的基线方法。

Comments 8 pages, 6 figures. In Proceedings: IEEE International Conference on Robotics & Automation (ICRA 2026)

详情
AI中文摘要

模仿学习(IL)是人类教授机器人的自然方式,尤其是在高质量演示易于获取的情况下。虽然IL已广泛应用于单机器人场景,但将其扩展到多智能体系统的研究相对较少,尤其是在单个人类必须为协作机器人团队提供演示的场景中。本文介绍并研究了轮换行为克隆(R2BC),该方法使单个人类操作员能够通过顺序的单智能体演示有效训练多机器人系统。我们的方法允许人类一次远程操作一个智能体,并逐步向整个系统教授多智能体行为,无需联合多智能体动作空间的演示。我们表明,在四个多智能体模拟任务中,R2BC方法的性能与基于特权同步演示的Oracle行为克隆方法相当,甚至在某些情况下超越后者。最后,我们在两个使用真实人类演示训练的物理机器人任务上部署了R2BC。

英文摘要

Imitation Learning (IL) is a natural way for humans to teach robots, particularly when high-quality demonstrations are easy to obtain. While IL has been widely applied to single-robot settings, relatively few studies have addressed the extension of these methods to multi-agent systems, especially in settings where a single human must provide demonstrations to a team of collaborating robots. In this paper, we introduce and study Round-Robin Behavior Cloning (R2BC), a method that enables a single human operator to effectively train multi-robot systems through sequential, single-agent demonstrations. Our approach allows the human to teleoperate one agent at a time and incrementally teach multi-agent behavior to the entire system, without requiring demonstrations in the joint multi-agent action space. We show that R2BC methods match, and in some cases surpass, the performance of an oracle behavior cloning approach trained on privileged synchronized demonstrations across four multi-agent simulated tasks. Finally, we deploy R2BC on two physical robot tasks trained using real human demonstrations.

4. 搜索、优化与约束求解 5 篇

2510.27353 2026-06-18 cs.AI 版本更新

An In-depth Study of LLM Contributions to the Bin Packing Problem

LLM对装箱问题贡献的深入研究

Julien Herrmann, Guillaume Pallez

发表机构 * CNRS-IRIT Inria

AI总结 通过分析LLM生成的启发式算法,发现其虽可读但难以解释,进而提出更简单高效的新算法,质疑LLM对装箱问题的实际贡献。

Comments Accepted for publication in ACM Transactions on Evolutionary Learning and Optimization

详情
AI中文摘要

近期研究表明,大型语言模型(LLM)可能为数学发现提供有趣的思路。该主张基于报告称,基于LLM的遗传算法在均匀分布和Weibull分布下为在线装箱问题产生了具有新见解的启发式算法。本文通过详细分析LLM产生的启发式算法,考察其行为和可解释性,重新评估了这一主张。尽管这些启发式算法是人类可读的,但即使对领域专家而言,它们仍然在很大程度上是不透明的。基于此分析,我们提出了一类针对这些特定装箱实例的新算法。推导出的算法显著更简单、更高效、更可解释且更具泛化性,表明所考虑的实例本身相对简单。然后,我们讨论了关于LLM对该问题贡献的主张的局限性,该主张似乎基于一个错误的假设,即这些实例先前已被研究过。我们的发现反而强调了在评估LLM生成输出的科学价值时,需要进行严格的验证和情境化。

英文摘要

Recent studies have suggested that Large Language Models (LLMs) could provide interesting ideas contributing to mathematical discovery. This claim was motivated by reports that LLM-based genetic algorithms produced heuristics offering new insights into the online bin packing problem under uniform and Weibull distributions. In this work, we reassess this claim through a detailed analysis of the heuristics produced by LLMs, examining both their behavior and interpretability. Despite being human-readable, these heuristics remain largely opaque even to domain experts. Building on this analysis, we propose a new class of algorithms tailored to these specific bin packing instances. The derived algorithms are significantly simpler, more efficient, more interpretable, and more generalizable, suggesting that the considered instances are themselves relatively simple. We then discuss the limitations of the claim regarding LLMs' contribution to this problem, which appears to rest on the mistaken assumption that the instances had previously been studied. Our findings instead emphasize the need for rigorous validation and contextualization when assessing the scientific value of LLM-generated outputs.

2602.23092 2026-06-18 cs.AI 版本更新

Enhancing CVRP Solver through LLM-driven Automatic Heuristic Design

通过LLM驱动的自动启发式设计增强CVRP求解器

Zhuoliang Xie, Fei Liu, Zhenkun Wang, Qingfu Zhang

发表机构 * Southern University of Science and Technology(南方科技大学) City University of Hong Kong(香港城市大学)

AI总结 提出AILS-AHD方法,结合进化搜索框架与大语言模型动态生成和优化破坏启发式,并引入加速机制,在中等和大规模CVRP实例上优于现有求解器,在CVRPLib大规模基准中10个实例上取得8个新最优解。

详情
AI中文摘要

容量受限车辆路径问题(CVRP)是一个基本的组合优化挑战,专注于在车辆容量约束下优化车队运营。尽管在运筹学中得到了广泛研究,CVRP的NP-hard性质仍然带来显著的计算挑战,特别是对于大规模实例。本研究提出了AILS-AHD(自适应迭代局部搜索与自动启发式设计),一种利用大语言模型(LLMs)革新CVRP求解的新方法。我们的方法将进化搜索框架与LLMs集成,在AILS方法中动态生成和优化破坏启发式。此外,我们引入了一种基于LLM的加速机制以提高计算效率。针对最先进的求解器(包括AILS-II和HGS)的综合实验评估表明,AILS-AHD在中等和大规模实例上均表现出优越性能。值得注意的是,我们的方法在CVRPLib大规模基准的10个实例中为8个建立了新的最佳已知解,突显了LLM驱动的启发式设计在推进车辆路径优化领域的潜力。

英文摘要

The Capacitated Vehicle Routing Problem (CVRP), a fundamental combinatorial optimization challenge, focuses on optimizing fleet operations under vehicle capacity constraints. While extensively studied in operational research, the NP-hard nature of CVRP continues to pose significant computational challenges, particularly for large-scale instances. This study presents AILS-AHD (Adaptive Iterated Local Search with Automatic Heuristic Design), a novel approach that leverages Large Language Models (LLMs) to revolutionize CVRP solving. Our methodology integrates an evolutionary search framework with LLMs to dynamically generate and optimize ruin heuristics within the AILS method. Additionally, we introduce an LLM-based acceleration mechanism to enhance computational efficiency. Comprehensive experimental evaluations against state-of-the-art solvers, including AILS-II and HGS, demonstrate the superior performance of AILS-AHD across both moderate and large-scale instances. Notably, our approach establishes new best-known solutions for 8 out of 10 instances in the CVRPLib large-scale benchmark, underscoring the potential of LLM-driven heuristic design in advancing the field of vehicle routing optimization.

2605.29649 2026-06-18 cs.AI 版本更新

LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning

LLM进化的符号AI规划领域无关启发式

Elliot Gestrin, Jendrik Seipp

AI总结 本文使用进化搜索让大语言模型生成领域无关的启发式函数,在未见测试域上超越手工最优启发式,并首次系统评估了启发式的信息性-速度权衡。

Comments Accepted at the LM4Plan workshop at ICAPS 2026

详情
AI中文摘要

启发式搜索是符号AI规划中的主导范式,最强的启发式是规划研究者数十年工作的成果。最近的工作表明,大型语言模型(LLM)可以为单个规划领域设计启发式,但迄今为止,没有LLM生成的启发式能在任意规划任务上工作。在本文中,我们使用进化搜索来产生第一个LLM生成的领域无关启发式,其超越了手工最优的现有技术。我们让LLM变异用C++编写的父启发式,将候选解存储在MAP-Elites档案中,以信息性和速度作为键,并通过混合覆盖率和求解时间计算适应度分数。为了将进化程序置于上下文中,我们还额外基准测试了一组广泛的手工启发式在信息性-速度权衡上的表现,据我们所知,这之前从未做过。在未见测试域上,我们最好的进化启发式比最强基线解决了更多任务,我们的完整启发式套件跨越了所述权衡的帕累托前沿。我们还发现,从平凡的盲目启发式开始进化优于从强FF启发式开始,即使最终程序本身是FF变体,并且LLM推理努力影响候选编译成功的频率远大于影响那些编译成功的候选的质量。由于进化程序是纯C++,它们可以作为即插即用替代品插入现有规划器,并继承底层搜索的健全性和完备性保证。

英文摘要

Heuristic search is the dominant paradigm in symbolic AI planning, and the strongest heuristics are the result of decades of work by planning researchers. Recent work has shown that large language models (LLMs) can design heuristics for individual planning domains, but no LLM-generated heuristic has so far worked on arbitrary planning tasks. In this paper, we use evolutionary search to produce the first LLM-generated domain-independent heuristics that exceed the hand-engineered state of the art. We let an LLM mutate parent heuristics written in C++, store candidates in a MAP-Elites archive keyed on informedness and speed and calculate fitness scores by blending coverage with solving time. To place the evolved programs in context, we additionally benchmark a broad set of hand-engineered heuristics on their informedness-speed tradeoff, which to our knowledge has not been done before. On unseen testing domains, our best evolved heuristic solves more tasks than even the strongest baseline, with our full heuristic suite spanning the Pareto frontier of said tradeoff. We also find that seeding evolution from the trivial blind heuristic outperforms seeding from the strong FF heuristic, even when the resulting program is itself an FF variant, and that LLM reasoning effort affects how often candidates compile much more than the quality of those that do. Because the evolved programs are plain C++, they slot into existing planners as drop-in replacements and inherit the soundness and completeness guarantees of the underlying search.

2411.16206 2026-06-18 cs.LG cs.AI cs.NE 版本更新

Scalable Batch Bayesian Optimization Via Subspace Acquisition Functions

可扩展的批量贝叶斯优化:基于子空间采集函数

Dawei Zhan, Zhaoxi Zeng, Shuoxiao Wei, Ping Wu

发表机构 * School of Computing and Artificial Intelligence(计算与人工智能学院)

AI总结 提出通过从原始问题的轴对齐子空间中各选一点来扩展贝叶斯优化至大规模批量评估,显著加速收敛,与十种批量算法相比极具竞争力。

详情
Journal ref
ACM Transactions on Evolutionary Learning and Optimization, 2026
AI中文摘要

将贝叶斯优化扩展到批量评估可以使设计者充分利用并行计算技术。然而,当前大多数批量方法在批量大小增大时扩展性不佳,优化效率往往下降。为解决此问题,本文提出一种简单高效的方法,将贝叶斯优化扩展到大规模批量评估。与现有批量方法不同,新方法的思想是从原始问题中抽取一批轴对齐子空间,并使用现有采集函数从每个子空间中选择一个点。数值实验表明,与顺序贝叶斯优化算法相比,我们提出的方法显著加速收敛,并且与十种批量贝叶斯优化算法相比表现非常有竞争力。我们提出的方法的实现可在此 https URL 获取。

英文摘要

Extending Bayesian optimization to batch evaluation can enable the designer to make the most use of parallel computing technology. However, most of current batch approaches do not scale well with the batch size. That is, their optimization efficiencies often deteriorate as the batch size increases. To address this issue, we propose a simple and efficient approach to extend Bayesian optimization to large-scale batch evaluation in this work. Different from existing batch approaches, the idea of the new approach is to draw a batch of axis-aligned subspaces of the original problem and select one point from each subspace using existing acquisition functions. Numerical experiments show that our proposed approach speedups the convergence significantly when compared with the sequential Bayesian optimization algorithm, and performs very competitively when compared with ten batch Bayesian optimization algorithms. The implementation of our proposed approach is available at https://github.com/zhandawei/SubSpace_Acquisition_Functions.

2606.14202 2026-06-18 cs.NE cs.AI 版本更新

MeEvo: Metacognitive Evolution Combined with Natural Evolution for Automatic Heuristic Design

MeEvo: 元认知进化与自然进化相结合用于自动启发式设计

Zishang Qiu, Xinan Chen, Rong Qu, Ruibin Bai

发表机构 * School of Computer Science, University of Nottingham Ningbo China(诺丁汉大学宁波分校计算机科学学院) School of Computer Science, University of Nottingham(诺丁汉大学计算机科学学院)

AI总结 提出MeEvo框架,通过循环耦合自然进化(探索启发式代码)和元认知进化(反思历史生成改进启发式),解决现有方法知识继承弱、探索不足的问题,在五个优化问题上表现更优。

详情
AI中文摘要

大型语言模型(LLMs)通过推理和代码合成实现启发式生成,推动了自动启发式设计(AHD)的发展。现有的基于LLM的AHD架构主要遵循两种范式:自然进化,它使用交叉和变异来探索启发式程序;以及元认知进化,它通过反思来改进推理。然而,自然进化丢弃了推理轨迹,削弱了知识继承和利用,而元认知进化缺乏种群级别的重组,限制了探索并增加了过早收敛的风险。这些局限性降低了复杂问题的搜索效率、稳定性和解的质量。为了解决这一差距,我们提出了MeEvo,一种双层AHD框架,它循环耦合自然进化和元认知进化。自然进化探索启发式代码,同时将推理轨迹、适应度值和错误记录到共享历史中;然后元认知进化反思该历史以生成改进的启发式,这些启发式重新进入父代池以进行下一轮循环。这种设计使得种群驱动的探索和反思驱动的改进相互加强。在五个优化问题上的实验(使用两个LLM骨干)表明,MeEvo比现有的基于LLM的AHD架构实现了更强且更稳定的性能,尤其是在复杂约束任务上。

英文摘要

Large Language Models (LLMs) have advanced Automatic Heuristic Design (AHD) by enabling heuristic generation through reasoning and code synthesis. Existing LLM-based AHD architectures mainly follow two paradigms: Natural Evolution, which uses crossover and mutation to explore heuristic programs, and Metacognitive Evolution, which refines reasoning through reflection. However, Natural Evolution discards reasoning traces, weakening knowledge inheritance and exploitation, while Metacognitive Evolution lacks population-level recombination, limiting exploration and increasing the risk of premature convergence. These limitations reduce search efficiency, stability, and solution quality on complex problems. To address this gap, we propose MeEvo, a dual-layer AHD framework that cyclically couples Natural Evolution and Metacognitive Evolution. Natural Evolution explores heuristic code while recording reasoning traces, fitness values, and errors into a shared history; Metacognitive Evolution then reflects on this history to generate improved heuristics that re-enter the parent pool for the next cycle. This design enables population-driven exploration and reflection-driven refinement to reinforce each other. Experiments on five optimization problems with two LLM backbones show that MeEvo achieves stronger and more stable performance than existing LLM-based AHD architectures, especially on complex constrained tasks.

5. 机器学习与表示学习 23 篇

2602.06774 2026-06-18 cs.AI 版本更新

Towards Understanding What State Space Models Learn About Code

理解状态空间模型在代码中学到了什么

Jiali Wu, Abhinav Anand, Shweta Verma, Mira Mezini

发表机构 * TU Darmstadt(图宾根大学) Hessian Center for Artificial Intelligence(黑森人工智能中心) National Research Center for Applied Cybersecurity ATHENE(应用网络安全国家研究中心ATHENE)

AI总结 本文首次系统分析状态空间模型(SSM)在代码理解中的学习机制,发现SSM在预训练时比Transformer更有效捕获语法和语义结构,但微调时会遗忘某些关系,并提出SSM-Interpret框架和架构改进,将NLCodeSearch的MRR提升高达6。

详情
AI中文摘要

状态空间模型(SSM)已成为Transformer架构的高效替代方案。先前工作表明,在可比条件下训练时,SSM在代码理解任务上可以匹配或超越Transformer。然而,其内部机制仍是一个黑箱。我们首次系统分析了基于SSM的代码模型所学到的内容,并在此领域直接比较了SSM和Transformer模型。我们的分析表明,SSM在预训练期间比Transformer更有效地捕获了语法和语义结构,但在某些任务的微调过程中会遗忘某些关系。为了研究这种行为,我们引入了SSM-Interpret,一个频域框架,揭示了微调期间向短程依赖的频谱偏移。在这些发现的指导下,我们提出了架构修改,将基于SSM的代码模型在NLCodeSearch上的性能显著提升了高达+6 MRR。这表明我们的分析不仅解释了模型行为,而且直接导致了更好的设计。

英文摘要

State Space Models (SSMs) have emerged as an efficient alternative to the Transformer architecture. Prior work shows that, when trained under comparable conditions, SSMs can match or surpass Transformers on code understanding tasks. However, their internal mechanisms remain a black box. We present the first systematic analysis of what SSM-based code models learn along with the direct comparison between SSM and Transformer models in this domain. Our analysis shows that SSMs capture syntactic and semantic structure more effectively than Transformers during pretraining but forgets certain relations during fine-tuning on some tasks. To investigate this behavior, we introduce SSM-Interpret, a frequency-domain framework that exposes a spectral shift toward short-range dependencies during fine-tuning. Guided by these findings, we propose architectural modifications that significantly improve the performance of SSM-based code model by upto +6 MRR on NLCodeSearch. This demonstrates that our analysis not only explains model behavior but also leads directly to better designs.

2603.09344 2026-06-18 cs.AI stat.ML 版本更新

Robust Regularized Policy Iteration under Transition Uncertainty

鲁棒正则化策略迭代在转移不确定性下

Hongqiang Lin, Zhenghui Fu, Weihao Tang, Pengfei Wang, Yiding Sun, Qixian Huang, Dongxu Zhang

发表机构 * College of Computer Science and Technology, Zhejiang University, Hangzhou, China(浙江大学计算机科学与技术学院) School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi'an, China(西北工业大学人工智能、光学与电子学院(iOPEN)) School of Software Technology, Zhejiang University, Hangzhou, China(浙江大学软件技术学院) School of Software Engineering, Xi'an Jiaotong University, Xi'an, China(西安交通大学软件工程学院) School of Systems Science and Engineering, Sun Yat-sen University, Guangzhou, China(中山大学系统科学与工程学院)

AI总结 提出鲁棒正则化策略迭代(RRPI),通过将离线强化学习建模为鲁棒策略优化,使用KL正则化替代难解的双层目标,并基于鲁棒正则化贝尔曼算子实现高效策略迭代,理论保证收敛性,实验在D4RL基准上表现优异。

详情
AI中文摘要

离线强化学习(RL)无需在线探索即可实现数据高效且安全的策略学习,但其性能常因分布偏移而下降。学习到的策略可能访问分布外的状态-动作对,其中价值估计和学习到的动态不可靠。为了在统一框架中处理策略引发的外推和转移不确定性,我们将离线RL建模为鲁棒策略优化,将转移核视为不确定性集内的决策变量,并针对最坏情况动态优化策略。我们提出鲁棒正则化策略迭代(RRPI),用可处理的KL正则化替代难解的最大-最小双层目标,并基于鲁棒正则化贝尔曼算子推导出高效的策略迭代过程。我们提供了理论保证,证明所提出的算子是$\gamma$-压缩算子,且迭代更新替代目标能单调改进原始鲁棒目标并收敛。在D4RL基准上的实验表明,RRPI实现了强大的平均性能,在大多数环境中优于包括基于百分位数方法在内的最新基线,并在其余环境中保持竞争力。此外,RRPI通过将较低的$Q$值与高认知不确定性对齐,展现出鲁棒性能,从而防止策略执行不可靠的分布外动作。

英文摘要

Offline reinforcement learning (RL) enables data-efficient and safe policy learning without online exploration, but its performance often degrades under distribution shift. The learned policy may visit out-of-distribution state-action pairs where value estimates and learned dynamics are unreliable. To address policy-induced extrapolation and transition uncertainty in a unified framework, we formulate offline RL as robust policy optimization, treating the transition kernel as a decision variable within an uncertainty set and optimizing the policy against the worst-case dynamics. We propose Robust Regularized Policy Iteration (RRPI), which replaces the intractable max-min bilevel objective with a tractable KL-regularized surrogate and derives an efficient policy iteration procedure based on a robust regularized Bellman operator. We provide theoretical guarantees by showing that the proposed operator is a $γ$-contraction and that iteratively updating the surrogate yields monotonic improvement of the original robust objective with convergence. Experiments on D4RL benchmarks demonstrate that RRPI achieves strong average performance, outperforming recent baselines including percentile-based methods on the majority of environments while remaining competitive on the rest. Moreover, RRPI exhibits robust performance by aligning lower $Q$-values with high epistemic uncertainty, which prevents the policy from executing unreliable out-of-distribution actions.

2606.11918 2026-06-18 cs.AI 版本更新

The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning

提问的艺术:一致性增强空间推理中的事实性

Theo Uscidda, Marta Tintore Gazulla, Maks Ovsjanikov, Federico Tombari, Leonidas Guibas

发表机构 * The University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院) University of Oxford(牛津大学) Stanford University(斯坦福大学)

AI总结 提出自监督强化学习框架,通过几何与语义一致性验证器(如图像翻转、文本对象顺序交换)对齐预训练模型的内在空间推理能力,无需标注数据即可达到接近监督方法的精度。

详情
AI中文摘要

当前的大型推理模型(LRMs)展现出显著的通用能力,但在空间推理任务中表现明显不足。现有方法将此差距视为知识缺陷,依赖监督微调(SFT)从外部视觉源或合成引擎中获取标注空间数据。相反,我们认为对于许多任务,空间推理能力已经存在于预训练的LRMs中,但需要通过几何2D和3D约束下的逻辑一致性进行对齐。在这项工作中,我们提出了一个自监督强化学习(RL)框架,针对内部推理过程,无需真实标注。通过形式化一致性验证器——即在变换下检查几何和语义一致性的奖励函数——我们证明模型可以提高其空间推理能力。我们同时使用图像变换(如翻转)和文本变换(如交换问题中对象的顺序),并提出了一种新的基于最优传输的RL策略OT-GRPO,这是针对成对验证器定制的组相对策略优化的最小匹配变体。我们展示了这种无标签一致性训练在精度上接近使用真实监督训练的模型,并在不同任务和数据领域实现了类似的泛化。

英文摘要

Current Large Reasoning Models (LRMs) exhibit remarkable general capabilities but significantly underperform in spatial reasoning tasks. Existing approaches treat this gap as a knowledge deficit, relying on supervised fine-tuning (SFT) to ingest labeled spatial data from external vision sources or synthetic engines. In contrast, we argue that for many tasks, spatial reasoning capabilities are already present in pre-trained LRMs but require alignment through logical coherence under geometric 2D and 3D constraints. In this work, we propose a self-supervised reinforcement learning (RL) framework that targets the internal reasoning process without requiring ground-truth annotations. By formalizing the notion of consistency verifiers -- reward functions that check for geometric and semantic consistency under transformations -- we demonstrate that models can improve their spatial reasoning abilities. We use both image transformations, like flipping, and textual transformations, like swapping the order of objects in the question, and propose a new optimal transport-based RL strategy, OT-GRPO, which is a minimal-matching variant of group relative policy optimization tailored to pairwise verifiers. We show that this label-free consistency training approaches the accuracy of models trained with ground-truth supervision and achieves similar generalization across diverse tasks and data domains.

2606.18101 2026-06-18 cs.AI 版本更新

Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding

信任正确的教师:面向GUI定位的质量感知自蒸馏

Jingyuan Huang, Zuming Huang, Yucheng Shi, Tianze Yang, Xiaoming Zhai, Wei Chu, Ninghao Liu

发表机构 * University of Georgia(佐治亚大学) INFLY Tech Tencent AI Lab(腾讯AI实验室) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出质量感知自蒸馏方法,通过软正确性感知门控和教师概率缩放改善坐标令牌教师信号质量,提升VLM在GUI定位任务中的性能。

Comments corrected some claims

详情
AI中文摘要

图形用户界面(GUI)定位要求视觉语言模型(VLM)在高分辨率截图中识别小的目标元素并预测精确的屏幕坐标。同策略自蒸馏(OPSD)是一种有前景的后训练方法,因为它提供密集的令牌级教师信号,超越了硬坐标标签。然而,朴素OPSD并不适合GUI定位:OPSD在由学生生成的前缀上评估教师,当前缀已经偏离目标坐标时,坐标令牌教师信号的质量会下降,导致不可靠的教师信号。为缓解这一问题,我们提出了面向基于VLM的GUI定位的质量感知自蒸馏,通过软正确性感知门控和教师概率缩放来改善坐标令牌教师信号质量。软正确性感知门控检查在当前学生生成的前缀下,教师的坐标令牌预测是否仍能完成到真实框。如果不能,则相应教师信号被降低权重。教师概率缩放则利用教师置信度作为轻量级因子,进一步校准门控监督的强度。一个关键的实验发现是,单独使用任一组件都不能提升整体性能,而组合使用则能持续提升性能。这表明两种机制发挥互补作用:正确性感知门控抑制不可靠的坐标令牌监督,而教师概率缩放校准剩余信号的强度。在六个GUI定位基准上的实验表明,我们的方法持续提升基础模型性能,并优于强基线。

英文摘要

Graphical user interface (GUI) grounding requires vision-language models (VLMs) to identify small target elements in high-resolution screenshots and predict precise screen coordinates. On-policy self-distillation (OPSD) is a promising post-training approach for this coordinate-sensitive task, since it provides dense token-level teacher signals beyond hard coordinate labels. However, naive OPSD is not well suited to GUI grounding: OPSD evaluates the teacher on student-generated prefixes, the quality of coordinate-token teacher signals can degrade when the prefix has already deviated from the target coordinate, leading to unreliable teacher signal. To mitigate this, We propose quality-aware self-distillation for VLM-based GUI grounding, which improves coordinate-token teacher-signal quality through soft correctness-aware gating and teacher-probability scaling. The soft correctness-aware gate checks whether the teacher's current coordinate-token prediction can still be completed into the ground-truth box under the student-generated prefix. If not, the corresponding teacher signal is down-weighted. Teacher-probability scaling then uses the teacher's confidence as a lightweight factor to further calibrate the strength of the gated supervision. A key empirical finding is that neither component alone improves overall performance, whereas combining them consistently improves performance. This suggests that the two mechanisms play complementary roles: correctness-aware gating suppresses unreliable coordinate-token supervision, while teacher-probability scaling calibrates the strength of the remaining signals. Experiments across six GUI grounding benchmarks show that our method consistently improves the base model and outperforms strong baselines.

2502.10239 2026-06-18 cs.LG cs.AI 版本更新

Efficient Zeroth-Order Federated Finetuning of Language Models on Resource-Constrained Devices

资源受限设备上语言模型的高效零阶联邦微调

Mohamed Aboelenien Ahmed, Kilian Pfeiffer, Ramin Khalili, Heba Khdr, Jörg Henkel

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) Huawei(华为) Heisenberg Research Center (Munich), Germany(海森堡研究中心(慕尼黑),德国)

AI总结 提出一种基于零阶优化的联邦微调方法,通过分块模型并分配更多扰动到后一块,复用中间激活减少前向评估次数,在保持内存和通信优势的同时将计算量降低至其他零阶方法的1/3。

Comments Published at TMLR

详情
AI中文摘要

联邦学习是一种有前景的范式,可以在分布式数据源上微调大型语言模型,同时保护数据隐私。然而,在边缘设备上微调如此大的模型由于资源需求高而具有挑战性。零阶优化通过有限差分近似估计梯度,依赖于模型参数随机扰动下的函数评估。因此,与任务对齐的零阶优化提供了一种潜在解决方案,允许仅使用前向传播(推理级内存需求和低通信开销)进行微调,但存在收敛慢和计算需求高的问题。在本文中,我们提出了一种新的基于零阶优化的方法,应用更高效的技术来减少使用大量扰动带来的计算需求,同时保留其收敛优势。这是通过将模型分成连续的块,并为第二块分配更多扰动来实现的,从而能够高效复用中间激活,以更少的前向评估更新整个网络。我们在RoBERTa-large、OPT1.3B、LLaMa-3-3.2B模型上的评估显示,与其他基于零阶优化的技术相比,计算量减少了高达3倍,同时保留了一阶联邦学习技术的内存和通信优势。

英文摘要

Federated Learning (FL) is a promising paradigm for finetuning Large Language Models (LLMs) across distributed data sources while preserving data privacy. However, finetuning such large models is challenging on edge devices due to its high resource demand. Zeroth-order Optimization (ZO) estimates gradients through finite-difference approximations, which rely on function evaluations under random perturbations of the model parameters. Consequently, ZO with task alignment provides a potential solution, allowing finetuning using only forward passes with inference-level memory requirements and low communication overhead, but it suffers from slow convergence and higher computational demand. In this paper, we propose a new ZO-based method that applies a more efficient technique to reduce the computational demand associated with using a large number of perturbations while preserving their convergence benefits. This is achieved by splitting the model into consecutive blocks and allocating a higher number of perturbations to the second block, enabling efficient reuse of intermediate activations to update the full network with fewer forward evaluations. Our evaluation on RoBERTa-large, OPT1.3B, LLaMa-3-3.2B models shows up to $3\times$ reduction in computation compared to the other ZO-based techniques, while retaining the memory and communication benefits over first-order federated learning techniques.

2503.01805 2026-06-18 cs.LG cs.AI cs.CL 版本更新

Depth-Width tradeoffs in Algorithmic Reasoning of Graph Tasks with Transformers

图任务算法推理中Transformer的深度-宽度权衡

Gilad Yehudai, Clayton Sanford, Maya Bechler-Speicher, Orr Fischer, Ran Gilad-Bachrach, Amir Globerson

发表机构 * Courant Institute of Mathematical Sciences, New York University(纽约大学应用数学科学研究所) Google Research(谷歌研究) Meta AI Bar-Ilan University(巴伊兰大学) Department of Bio-Medical Engineering, Edmond J. Safra Center for Bioinformatics, Tel-Aviv University(生物医学工程系,埃德蒙·J·萨法中心,特拉维夫大学) Tel Aviv University(特拉维夫大学)

AI总结 研究Transformer在图算法任务中深度与宽度的权衡,发现线性宽度下常数深度足以解决许多图问题,而某些问题需要二次宽度,实验验证了宽模型在保持精度的同时训练和推理更快。

Comments Updated ISF grant number

详情
AI中文摘要

Transformer已经彻底改变了机器学习领域。特别是,它们可用于解决复杂的算法问题,包括基于图的任务。在此类算法任务中,一个关键问题是能够实现该任务的Transformer的最小尺寸是多少。最近的工作开始探索图任务的这个问题,表明对于次线性嵌入维度(即模型宽度),对数深度就足够了。然而,我们在这里解决的一个开放问题是,如果允许宽度线性增长而深度保持固定,会发生什么。我们分析了这种情况,并得出了一个令人惊讶的结果:在线性宽度下,常数深度足以解决一系列基于图的问题。这表明宽度的适度增加可以允许更浅的模型,这在推理和训练时间方面是有利的。对于其他问题,我们表明需要二次宽度。我们的结果展示了Transformer实现图算法的复杂而有趣的格局。我们通过实验研究了深度和宽度相对能力之间的这些权衡,并发现宽模型在具有与深模型相同准确度的任务中,由于可并行化的硬件,训练和推理时间更快。

英文摘要

Transformers have revolutionized the field of machine learning. In particular, they can be used to solve complex algorithmic problems, including graph-based tasks. In such algorithmic tasks a key question is what is the minimal size of a transformer that can implement the task. Recent work has begun to explore this problem for graph-based tasks, showing that for sub-linear embedding dimension (i.e., model width) logarithmic depth suffices. However, an open question, which we address here, is what happens if width is allowed to grow linearly, while depth is kept fixed. Here we analyze this setting, and provide the surprising result that with linear width, constant depth suffices for solving a host of graph-based problems. This suggests that a moderate increase in width can allow much shallower models, which are advantageous in terms of inference and train time. For other problems, we show that quadratic width is required. Our results demonstrate the complex and intriguing landscape of transformer implementations of graph-based algorithms. We empirically investigate these trade-offs between the relative powers of depth and width and find tasks where wider models have the same accuracy as deep models, while having much faster train and inference time due to parallelizable hardware.

2503.08038 2026-06-18 cs.LG cs.AI cs.CV 版本更新

Generalized Kullback-Leibler Divergence Loss

广义Kullback-Leibler散度损失

Jiequan Cui, Beier Zhu, Qingshan Xu, Zhuotao Tian, Xiaojuan Qi, Bei Yu, Hanwang Zhang, Richang Hong

发表机构 * Hefei University of Technology(合肥工业大学) University of Science and Technology of China(中国科学技术大学) Nanyang Technological University(南洋理工大学) The Chinese University of Hong Kong(香港中文大学) The University of Hong Kong(香港大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳))

AI总结 本文提出广义KL散度损失,通过解耦KL损失为加权MSE和交叉熵损失,并引入非对称优化修正和类别全局信息,在对抗训练和知识蒸馏中取得SOTA性能。

Comments TPAMI 2026, extension of our NeurIPS paper "Decoupled Kullback-Leibler Divergence Loss". arXiv admin note: substantial text overlap with arXiv:2305.13948

详情
AI中文摘要

在本文中,我们深入探讨了Kullback-Leibler (KL) 散度损失,并从数学上证明它等价于由(1)加权均方误差(wMSE)损失和(2)包含软标签的交叉熵损失组成的解耦Kullback-Leibler (DKL) 散度损失。得益于DKL损失的解耦结构,我们确定了两个改进方向。首先,我们通过打破KL损失的不对称优化性质并引入更平滑的权重函数,解决了其在知识蒸馏等场景中的局限性。这一修改有效缓解了优化中的收敛困难,特别是对于软标签中预测分数较高的类别。其次,我们将类别级别的全局信息引入KL/DKL,以减少单个样本带来的偏差。通过这两项改进,我们推导出广义Kullback-Leibler (GKL) 散度损失,并通过在CIFAR-10/100、ImageNet和视觉-语言数据集上进行实验,聚焦于对抗训练和知识蒸馏任务,评估其有效性。具体来说,我们在公开排行榜RobustBench上实现了新的最先进对抗鲁棒性,并在CIFAR/ImageNet模型和CLIP模型上取得了具有竞争力的知识蒸馏性能,展示了其重要的实际价值。我们的代码可在该https URL获取。

英文摘要

In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss and mathematically prove that it is equivalent to the Decoupled Kullback-Leibler (DKL) Divergence loss that consists of (1) a weighted Mean Square Error (wMSE) loss and (2) a Cross-Entropy loss incorporating soft labels. Thanks to the decoupled structure of DKL loss, we have identified two areas for improvement. Firstly, we address the limitation of KL loss in scenarios like knowledge distillation by breaking its asymmetric optimization property along with a smoother weight function. This modification effectively alleviates convergence challenges in optimization, particularly for classes with high predicted scores in soft labels. Secondly, we introduce class-wise global information into KL/DKL to reduce bias arising from individual samples. With these two enhancements, we derive the Generalized Kullback-Leibler (GKL) Divergence loss and evaluate its effectiveness by conducting experiments on CIFAR-10/100, ImageNet, and vision-language datasets, focusing on adversarial training, and knowledge distillation tasks. Specifically, we achieve new state-of-the-art adversarial robustness on the public leaderboard -- RobustBench and competitive knowledge distillation performance across CIFAR/ImageNet models and CLIP models, demonstrating the substantial practical merits. Our code is available at https://github.com/jiequancui/DKL.

2506.11139 2026-06-18 eess.IV cs.AI cs.CV 版本更新

Grids Often Outperform Implicit Neural Representations at Compressing Dense Signals

网格通常在压缩密集信号方面优于隐式神经表示

Namhoon Kim, Sara Fridovich-Keil

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系) Georgia Institute of Technology(佐治亚理工学院)

AI总结 研究发现,对于密集信号任务,带插值的正则化网格在训练速度和重建质量上优于同等参数量的隐式神经表示,而INR仅在拟合二值信号(如形状轮廓)时表现更优。

Comments Our analysis are available at https://github.com/voilalab/INR-benchmark

详情
AI中文摘要

隐式神经表示(INR)最近展示了令人印象深刻的结果,但其基本容量、隐式偏差和缩放行为仍知之甚少。我们研究了不同INR在一系列具有不同有效带宽的2D和3D真实及合成信号上的性能,以及包括断层扫描、超分辨率和去噪在内的过拟合和泛化任务。通过根据模型大小以及信号类型和带宽对性能进行分层,我们的结果揭示了不同INR和网格表示如何分配其容量。我们发现,对于许多涉及密集信号的任务,具有插值的简单正则化网格在训练速度和质量上优于或等同于具有相同参数数量的任何INR。我们还发现有限的情况——即拟合二值信号(如形状轮廓)——其中INR优于网格,以指导INR的未来开发和使用,使其应用于最有利的应用场景。

英文摘要

Implicit Neural Representations (INRs) have recently shown impressive results, but their fundamental capacity, implicit biases, and scaling behavior remain poorly understood. We investigate the performance of diverse INRs across a suite of 2D and 3D real and synthetic signals with varying effective bandwidth, as well as both overfitting and generalization tasks including tomography, super-resolution, and denoising. By stratifying performance according to model size as well as signal type and bandwidth, our results shed light on how different INR and grid representations allocate their capacity. We find that, for many tasks involving dense signals, a simple regularized grid with interpolation trains faster and to higher or comparable quality than any INR with the same number of parameters. We also find limited settings -- namely fitting binary signals such as shape contours -- where INRs outperform grids, to guide future development and use of INRs towards the most advantageous applications.

2506.14126 2026-06-18 cs.LG cs.AI 版本更新

From Memorization to Parameter Interference: How Overtraining Experts Harms Model Merging

从记忆到参数干扰:过度训练专家如何损害模型合并

Stefan Horoi, Guy Wolf, Eugene Belilovsky, Gintare Karolina Dziugaite

发表机构 * Concordia University(康科德大学) Mila -- Québec AI Institute(魁北克人工智能研究所) Google DeepMind(谷歌深Mind)

AI总结 本文研究专家模型微调过度对模型合并的影响,发现长时间微调导致记忆困难样本,造成参数干扰,降低合并性能,并提出任务相关的早停策略改善合并效果。

Comments Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026

详情
AI中文摘要

现代深度学习日益以使用开放权重基础模型为特征,这些模型可以在专门数据集上进行微调。这导致了专家模型和适配器的激增,通常通过HuggingFace和AdapterHub等平台共享。模型合并最近成为一种有效利用这些现有资源的方法,使得能够组合不同模型检查点的能力。因此,形成了一种自然的流程来利用迁移学习的好处并分摊沉没训练成本:模型在通用数据上预训练,在特定任务上微调,然后合并多个检查点以获得更强大的模型。一个普遍假设是,该流程中某一阶段的改进会向下游传播,从而在后续步骤中带来收益。在这项工作中,我们通过研究专家微调如何影响模型合并来挑战这一假设。我们表明,针对个体性能优化的专家长时间微调会导致跨视觉和语言模态、多种模型规模以及完全微调和LoRA适配模型的合并性能下降。我们将这种退化追溯到对一小部分困难样本的记忆,这些样本主导了微调后期步骤。这会导致负参数干扰,并编码在合并过程中被遗忘的知识。最后,我们证明任务相关的激进早停策略可以显著改善模型合并性能。

英文摘要

Modern deep learning is increasingly characterized by the use of open-weight foundation models that can be fine-tuned on specialized datasets. This has led to a proliferation of expert models and adapters, often shared via platforms like HuggingFace and AdapterHub. Model merging has recently emerged as an effective way to leverage these existing resources, enabling the composition of capabilities from different model checkpoints. A natural pipeline has thus formed to harness the benefits of transfer learning and amortize sunk training costs: models are pre-trained on general data, fine-tuned on specific tasks, and then multiple checkpoints are merged to obtain a more capable model. A prevailing assumption is that improvements at one stage of this pipeline propagate downstream, leading to gains at subsequent steps. In this work, we challenge that assumption by examining how expert fine-tuning affects model merging. We show that long fine-tuning of experts that optimizes for their individual performance leads to degraded merging performance across vision and language modalities, multiple model scales, and both fully fine-tuned and LoRA-adapted models. We trace this degradation to the memorization of a small set of difficult examples that dominate late fine-tuning steps. This causes negative parameter interference and encodes knowledge that is forgotten during merging. Finally, we demonstrate that task-dependent aggressive early stopping strategies can significantly improve model merging performance.

2601.21626 2026-06-18 cs.LG cs.AI 版本更新

HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning

HeRo-Q: 通过Hessian条件化实现稳定低比特量化的通用框架

Jinhao Zhang, Yunquan Zhang, Zicheng yan, Boyang Zhang, Jun Sun, Daning Cheng

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) University of Science and Technology of China(中国科学技术大学) Zhejiang Lab(浙江实验室) Peng Cheng Laboratory(鹏城实验室)

AI总结 针对后训练量化中“低误差、高损失”的矛盾,提出HeRo-Q算法,通过轻量可学习的旋转压缩矩阵重塑损失景观,降低最大Hessian特征值,增强对量化噪声的鲁棒性,在Llama和Qwen模型上优于现有方法。

详情
AI中文摘要

后训练量化(PTQ)是一种主流的模型压缩技术,但由于其仅专注于最小化量化误差,常常导致矛盾的“低误差、高损失”现象。根本原因在于LLM损失景观的Hessian矩阵:少数高曲率方向对扰动极其敏感。为了解决这个问题,我们提出了Hessian鲁棒量化(HeRo Q)算法,该算法在量化前对权重空间应用一个轻量级、可学习的旋转压缩矩阵。这个联合框架通过降低最大的Hessian特征值并减小其最大特征值来重塑损失景观,从而显著增强对量化噪声的鲁棒性。HeRo-Q不需要修改架构,计算开销可忽略不计,并且可以无缝集成到现有的PTQ流程中。在Llama和Qwen模型上的实验表明,HeRo Q在标准W4A8设置下不仅持续优于包括GPTQ、AWQ和SpinQuant在内的最先进方法,而且在极具挑战性的W3A16超低比特场景中表现出色,将Llama3 8B在GSM8K上的准确率提升至70.15%,并有效避免了激进量化中常见的逻辑崩溃。

英文摘要

Post Training Quantization (PTQ), a mainstream model compression technique, often leads to the paradoxical 'low error, high loss' phenomenon because it focuses solely on minimizing quantization error. The root cause lies in the Hessian matrix of the LLM loss landscape: a few high curvature directions are extremely sensitive to perturbations. To address this, we propose the Hessian Robust Quantization (HeRo Q) algorithm, which applies a lightweight, learnable rotation-compression matrix to the weight space prior to quantization. This joint framework reshapes the loss landscape by reducing the largest Hessian eigenvalue and reducing its max eigenvalue, thereby significantly enhancing robustness to quantization noise. HeRo-Q requires no architectural modifications, incurs negligible computational overhead, and integrates seamlessly into existing PTQ pipelines. Experiments on Llama and Qwen models show that HeRo Q consistently outperforms state of the art methods including GPTQ, AWQ, and SpinQuant not only achieving superior performance under standard W4A8 settings, but also excelling in the highly challenging W3A16 ultra low bit regime, where it boosts GSM8K accuracy on Llama3 8B to 70.15\% and effectively avoids the logical collapse commonly seen in aggressive quantization.

2602.00161 2026-06-18 cs.LG cs.AI cs.CL quant-ph 版本更新

LLM Compression by Block Removal with Constrained Binary Optimization

通过带约束二进制优化的块移除进行LLM压缩

David Jansen, Roman Rausch, Ali Hashemi, David Montero, Román Orús

发表机构 * Multiverse Computing(多维计算公司) Donostia International Physics Center(多斯蒂亚国际物理中心) Ikerbasque Foundation for Science(伊克尔巴斯克科学基金会)

AI总结 提出将大语言模型块移除压缩问题建模为约束二进制优化,映射到Ising玻璃系统,实现高效排序和高质量非连续块移除,在50%压缩时MMLU提升近23个百分点,且计算高效、通用性强。

Comments 16 pages, 3 figures

详情
AI中文摘要

在本文中,我们将通过最优删除Transformer块(“块移除”)来压缩大语言模型(LLM)的问题,表述为一个约束二进制优化(CBO)问题,该问题可以映射到物理系统(Ising玻璃),其能量是下游模型性能的强代理。这种表述使得能够高效地对大量候选块移除配置进行排序,产生许多高质量、非平凡的解决方案,而不仅仅是移除连续区域。我们的方法在深度压缩场景中表现强劲,例如在Llama-3.3-70B-Instruct的50%压缩中,与其他最先进的块移除方法相比,我们在MMLU基准上取得了近23个百分点的提升。对于较轻的压缩,它在多个基准上与这些方法表现相当,适用于Llama-3.1-8B-Instruct、Qwen3-14B(重训练前后)以及Llama-3.3-70B-Instruct。该方法计算效率高,仅需在校准数据集上对少数活跃参数进行前向和反向传播。此外,我们证明,当无法精确求解CBO问题时,使用良好的启发式求解器可以在可忽略的运行时间内提供在下游任务上表现良好的解决方案。该方法可以轻松应用于任何架构。我们在最近的NVIDIA-Nemotron-3-Nano-30B-A3B-FP8模型上展示了这种通用性,该模型具有高度不均匀且具有挑战性的块结构,并且在移除2个注意力层或3个混合专家层时,我们在AIME25和GPQA上超越了最先进水平。

英文摘要

In this paper, we formulate the compression of large language models (LLMs) by optimally deleting transformer blocks (``block removal'') as a constrained binary optimization (CBO) problem that can be mapped to a physical system (Ising glass), whose energies are a strong proxy for downstream model performance. This formulation enables an efficient ranking of a large number of candidate block-removal configurations yielding many high-quality, non-trivial solutions beyond those only removing consecutive regions. Our method performs strongly in the deep compression regime, such as for 50% compression of Llama-3.3-70B-Instruct, where we achieve an almost 23 percentage point increase on the MMLU benchmark compared to other state-of-the-art (SOTA) block-removal methods. For lighter compression, it performs on par with those methods across several benchmarks for Llama-3.1-8B-Instruct, Qwen3-14B (both before and after retraining), as well as Llama-3.3-70B-Instruct. The approach is computationally efficient and requires only forward and backward passes on a calibration dataset for a few active parameters. Additionally, we demonstrate that using good heuristic solvers for the CBO problem provides solutions that perform well on downstream tasks in negligible runtime when it is unfeasible to solve the problem exactly. The method can be readily applied to any architecture. We illustrate this generality on the recent NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 model, which exhibits a highly inhomogeneous and challenging block structure, and where we outperform SOTA for AIME25 and GPQA when removing either 2 attention layers or 3 mixture-of-experts layers.

2602.00176 2026-06-18 cs.CV cs.AI 版本更新

Posterior Continuation with Noise-Conditioned Frequency Exposure for Diffusion Inverse Problems

基于噪声条件频率暴露的扩散逆问题后验延续

Feng Tian, Yixuan Li, Weili Zeng, Weitian Zhang, Yichao Yan, Xiaokang Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出后验延续框架,根据扩散噪声水平逐步暴露测量频率,结合稳定采样器实现超分辨率、修复和去模糊的先进性能。

详情
AI中文摘要

扩散后验采样通过将预训练的扩散先验与测量一致性指导相结合来解决逆问题。然而,在高噪声水平下,全频带指导可能不可靠,因为干净估计包含分数诱导误差,且高频测量方向弱可识别。我们认为后验指导应根据瞬时扩散噪声水平暴露测量频率。基于这一原则,我们提出一个后验延续框架,构建一系列中间后验,其似然强调当前可靠频带并逐渐恢复全频带一致性。我们通过一个稳定采样器实例化该框架,该采样器结合了扩散预测器、频率受限似然细化以及Haar域承诺规则,该规则提交可靠粗校正同时推迟弱可识别细节。在超分辨率、修复和去模糊任务中,我们的方法实现了具有竞争力乃至最先进的恢复性能,包括在FFHQ和ImageNet评估中,运动去模糊相比强基线PSNR提升高达5 dB。

英文摘要

Diffusion posterior sampling solves inverse problems by combining a pretrained diffusion prior with measurement-consistency guidance. However, full-band guidance can be unreliable at high noise levels, where clean estimates contain score-induced errors and high-frequency measurement directions are weakly identifiable. We argue that posterior guidance should expose measurement frequencies according to the instantaneous diffusion noise level. Based on this principle, we propose a posterior continuation framework that constructs a family of intermediate posteriors whose likelihood emphasizes currently reliable frequency bands and gradually returns to full-band consistency. We instantiate this framework with a stabilized sampler that combines a diffusion predictor, frequency-limited likelihood refinement, and a Haar-domain commitment rule that commits reliable coarse corrections while deferring weakly identifiable details. Across super-resolution, inpainting, and deblurring, our method achieves competitive-to-state-of-the-art restoration performance, including up to 5 dB PSNR improvement on motion deblurring over strong baselines in evaluations on FFHQ and ImageNet.

2602.09234 2026-06-18 cs.LG cs.AI 版本更新

Do Neural Networks Lose Plasticity in a Gradually Changing World?

神经网络在渐变世界中会失去可塑性吗?

Tianhui Liu, Lili Mou

发表机构 * Dept. Computing Science \& Alberta Machine Intelligence Institute (Amii), University of Alberta Canada CIFAR AI Chair

AI总结 研究任务转换的突然性对神经网络可塑性损失的影响,通过输入/输出插值和任务采样模拟渐变环境,理论和实验表明可塑性损失严重程度与任务转换突然性密切相关,渐变环境下可显著减轻。

详情
AI中文摘要

持续学习已成为机器学习的热门话题。最近的研究发现了一个有趣的现象,称为可塑性丧失,指的是神经网络逐渐失去学习新任务的能力。然而,现有的可塑性研究很大程度上依赖于具有突然任务转换的基准测试,而没有检验突然性本身是否导致了观察到的可塑性损失。在本文中,我们通过输入/输出插值和任务采样模拟逐渐变化的环境,研究了转换突然性的作用。我们进行了理论和实证分析,表明可塑性损失的严重程度与任务转换的突然性密切相关,并且在环境逐渐变化时可以显著降低。

英文摘要

Continual learning has become a trending topic in machine learning. Recent studies have discovered an interesting phenomenon called loss of plasticity, referring to neural networks gradually losing the ability to learn new tasks. However, existing plasticity research largely relies on benchmarks with abrupt task transitions, without examining whether the abruptness itself contributes to the observed plasticity loss. In this paper, we investigate the role of transition abruptness by simulating gradually changing environments through input/output interpolation and task sampling. We perform theoretical and empirical analysis, showing that the severity of plasticity loss is closely tied to the abruptness of task transitions, and can be substantially reduced when the environment changes gradually.

2603.15988 2026-06-18 eess.AS cs.AI cs.LG 版本更新

Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech

无中生有:面向构音障碍语音严重程度鲁棒估计的数据增强

Jaesung Bae, Xiuwen Zheng, Minje Kim, Chang D. Yoo, Mark Hasegawa-Johnson

发表机构 * 1 University of Illinois Urbana-Champaign, IL, USA 2 Korea Advanced Institute of Science \& Technology, KR

AI总结 提出三阶段框架,利用未标注构音障碍语音和典型语音数据集,通过教师模型生成伪标签、标签感知对比学习预训练和微调,在五个未见数据集上平均SRCC达0.761,显著优于现有方法。

Comments Accepted to Interspeech 2026 Long Paper Track

详情
AI中文摘要

构音障碍语音质量评估(DSQA)对于临床诊断和包容性语音技术至关重要。然而,主观评估成本高且难以规模化,而标注数据的稀缺限制了鲁棒的客观建模。为解决这一问题,我们提出了一个三阶段框架,利用未标注的构音障碍语音和大规模典型语音数据集来扩展训练。教师模型首先生成未标注样本的伪标签,然后使用标签感知对比学习策略进行弱监督预训练,使模型暴露于多样化的说话者和声学条件。预训练模型随后针对下游DSQA任务进行微调。在跨越多种病因和语言的五个未见数据集上的实验证明了我们方法的鲁棒性。我们的基于Whisper的基线显著优于SOTA DSQA预测器(如SpICE),完整框架在未见测试数据集上实现了平均SRCC为0.761。

英文摘要

Dysarthric speech quality assessment (DSQA) is critical for clinical diagnostics and inclusive speech technologies. However, subjective evaluation is costly and difficult to scale, and the scarcity of labeled data limits robust objective modeling. To address this, we propose a three-stage framework that leverages unlabeled dysarthric speech and large-scale typical speech datasets to scale training. A teacher model first generates pseudo-labels for unlabeled samples, followed by weakly supervised pretraining using a label-aware contrastive learning strategy that exposes the model to diverse speakers and acoustic conditions. The pretrained model is then fine-tuned for the downstream DSQA task. Experiments on five unseen datasets spanning multiple etiologies and languages demonstrate the robustness of our approach. Our Whisper-based baseline significantly outperforms SOTA DSQA predictors such as SpICE, and the full framework achieves an average SRCC of 0.761 across unseen test datasets.

2604.13082 2026-06-18 cs.LG cs.AI 版本更新

The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

算术泛化的长延迟:当学习到的表征超越行为时

Laura Gomezjurado Gonzalez

发表机构 * Stanford University(斯坦福大学)

AI总结 研究Transformer在算术任务中泛化延迟的原因,发现编码器早期已学到结构,但解码器瓶颈导致延迟,通过移植编码器或冻结编码器可加速泛化,且数字基的选择影响学习难度。

Comments 19 pages, 10 fugures

详情
AI中文摘要

在算法任务上训练的Transformer中的grokking现象以训练集拟合与突然泛化之间的长延迟为特征,但该延迟的来源仍不清楚。在编码器-解码器算术模型中,我们认为这种延迟反映了对已学习结构的有限访问,而非未能首先获得该结构。我们研究一步Collatz预测,发现编码器在最初几千训练步内组织了奇偶性和残差结构,而输出精度在数万步内仍接近随机。因果干预支持解码器瓶颈假说。将训练好的编码器移植到新模型中将grokking加速2.75倍,而移植训练好的解码器则有害。冻结收敛的编码器并仅重新训练解码器完全消除了平台期,并达到97.6%的准确率,而联合训练为86.1%。解码器任务的难易取决于数字表示。在15种基中,那些分解与Collatz映射算术对齐的基(例如基24)达到99.8%的准确率,而二进制完全失败,因为其表示崩溃且无法恢复。基的选择作为归纳偏置,控制解码器可利用的局部数字结构量,从而在相同底层任务上产生巨大的可学习性差异。

英文摘要

Grokking in transformers trained on algorithmic tasks is characterized by a long delay between training-set fit and abrupt generalization, but the source of that delay remains poorly understood. In encoder-decoder arithmetic models, we argue that this delay reflects limited access to already learned structure rather than failure to acquire that structure in the first place. We study one-step Collatz prediction and find that the encoder organizes parity and residue structure within the first few thousand training steps, while output accuracy remains near chance for tens of thousands more. Causal interventions support the decoder bottleneck hypothesis. Transplanting a trained encoder into a fresh model accelerates grokking by 2.75 times, while transplanting a trained decoder actively hurts. Freezing a converged encoder and retraining only the decoder eliminates the plateau entirely and yields 97.6% accuracy, compared to 86.1% for joint training. What makes the decoder's job harder or easier depends on numeral representation. Across 15 bases, those whose factorization aligns with the Collatz map's arithmetic (e.g., base 24) reach 99.8% accuracy, while binary fails completely because its representations collapse and never recover. The choice of base acts as an inductive bias that controls how much local digit structure the decoder can exploit, producing large differences in learnability from the same underlying task.

2605.10840 2026-06-18 cs.LG cs.AI q-bio.QM 版本更新

Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories

Clin-JEPA:一种多阶段协同训练框架,用于EHR患者轨迹的联合嵌入预测预训练

Yixuan Yang, Mehak Arora, Ryan Zhang, Baraa Abed, Junseob Kim, Tilendra Choudhary, Md Hassanuzzaman, Kevin Zhu, Ayman Ali, Chengkun Yang, Alasdair Edward Gent, Victor Moas, Rishikesan Kamaleswaran

发表机构 * Duke University(杜克大学)

AI总结 本文提出Clin-JEPA框架,通过多阶段预训练稳定协同训练编码器和预测器,解决EHR数据中联合嵌入预测的挑战,实现多任务下游任务的高性能表现。

Comments 16 pages, 4 figures, 8 tables. Code: https://github.com/YeungYathin/Clin-JEPA

详情
AI中文摘要

我们介绍了Clin-JEPA,一种用于EHR患者轨迹的联合嵌入预测(JEPA)预训练的多阶段协同训练框架。JEPA架构已在机器人领域实现了潜在空间规划,并在视觉领域实现了高质量的表示学习,但将其扩展到EHR数据以获得一个能够同时预测患者轨迹并服务于多种下游风险预测任务的单一主干,仍是一个开放性挑战。现有的JEPA框架要么在预训练后丢弃预测器(I-JEPA,V-JEPA),要么在冻结的预训练编码器上训练预测器(V-JEPA 2-AC),导致编码器在推理时无法感知预测器必须使用的滚动信号;在共享JEPA预测目标下协同训练编码器和预测器将提供这种基础,但朴素的协同训练不稳定,代表性崩溃和在线/目标漂移导致自回归滚动发散。Clin-JEPA的五阶段预训练课程——预测器预热、联合细化、EMA目标对齐、硬同步和预测器最终化——通过阶段解决每个失败模式,稳定地协同训练基于Qwen3-8B的编码器和一个具有9200万参数的潜在轨迹预测器。在MIMIC-IV ICU数据上,三个独立评估支持该框架:(1)潜在ℓ1滚动漂移唯一收敛(-15.7%)在48小时范围内,而基线和消融测试发散(+3%至+4951%);(2)编码器学习了临床可区分的潜在几何结构(衰变患者群体在潜在空间中偏离4.83×,而稳定患者仅偏离≤2.62×);(3)单一主干在多任务下游评估中优于强大的表格和序列基线。Clin-JEPA在ICareFM EEP上达到平均AUROC 0.851,在8个二元风险任务上达到0.883(比基线平均高0.038和0.041)

英文摘要

We present Clin-JEPA, a multi-phase co-training framework for joint-embedding predictive (JEPA) pretraining on EHR patient trajectories. JEPA architectures have enabled latent-space planning in robotics and high-quality representation learning in vision, but extending the paradigm to EHR data -- to obtain a single backbone that simultaneously forecasts patient trajectories and serves diverse downstream risk-prediction tasks without per-task fine-tuning -- remains an open challenge. Existing JEPA frameworks either discard the predictor after pretraining (I-JEPA, V-JEPA) or train it on a frozen pretrained encoder (V-JEPA 2-AC), leaving the encoder unaware of the rollout signal that the retained predictor must use at inference; co-training the encoder and predictor under a shared JEPA prediction objective would supply this grounding, but naïve co-training is unstable, with representation collapse and online/target drift causing autoregressive rollout to diverge. Clin-JEPA's five-phase pretraining curriculum -- predictor warmup, joint refinement, EMA target alignment, hard sync, and predictor finalization -- addresses each failure mode by phase, stably co-training a Qwen3-8B-based encoder and a 92M-parameter latent trajectory predictor. On MIMIC-IV ICU data, three independent evaluations support the framework: (1) latent $\ell_1$ rollout drift uniquely converges ($-$15.7%) over 48-hour horizons while baselines and ablations diverge (+3% to +4951%); (2) the encoder learns a clinically discriminative latent geometry (deteriorating-patient cohorts displace 4.83$\times$ further than stable patients in latent space, vs $\leq$2.62$\times$ for baseline encoders); (3) a single backbone outperforms strong tabular and sequence baselines on multi-task downstream evaluation. Clin-JEPA achieves mean AUROC 0.851 on ICareFM EEP and 0.883 on 8 binary risk tasks (+0.038 and +0.041 vs baseline average).

2605.11287 2026-06-18 cs.LG cs.AI 版本更新

Beyond Similarity: Temporal Operator Attention for Time Series Analysis

超越相似性:时间序列分析中的时序操作注意力

Jevon Twitty, Vinh Pham, Nitiwith Rotchanarak, Viresh Pati, Yubin Kim, Shihao Yang, Jiecheng Lu

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出时序操作注意力(TOA),通过引入可学习的操作符增强注意力机制,以更有效地处理时间序列数据中的符号和振荡变换,提升时间序列预测、异常检测和分类任务的性能。

详情
AI中文摘要

时间序列预测中存在一个持久性悖论:结构简单的MLP和线性模型往往优于高容量的Transformer。我们指出,这种差距源于序列建模基本原理的不匹配:尽管许多时间序列动态由全局时间操作符(如滤波和谐波结构)主导,标准注意力将每个输出视为输入的凸组合。这限制了其表示带符号和振荡变换的能力,这些能力对于时间信号处理至关重要。我们正式将这一限制定义为softmax注意力中的简单约束混合瓶颈,这对由操作符驱动的时间序列任务尤其限制性。为了解决这一问题,我们提出时序操作注意力(TOA),一种通过显式、可学习的序列空间操作符增强注意力的框架,使时间内的符号混合成为可能,同时保持输入依赖的适应性。为了使密集的N×N操作符实用化,我们引入了随机操作符正则化,一种高方差的dropout机制,它稳定了训练并防止了记忆性学习。在预测、异常检测和分类基准上,TOA在集成到标准骨干如PatchTST和iTransformer时始终提高了性能,尤其是在重建密集任务中表现尤为突出。这些结果表明,显式操作符学习是有效时间序列建模的关键要素。

英文摘要

A persistent paradox in time-series forecasting is that structurally simple MLP and linear models often outperform high-capacity Transformers. We argue that this gap arises from a mismatch in the sequence-modeling primitive: while many time-series dynamics are governed by global temporal operators (e.g., filtering and harmonic structure), standard attention forms each output as a convex combination of inputs. This restricts its ability to represent signed and oscillatory transformations that are fundamental to temporal signal processing. We formalize this limitation as a simplex-constrained mixing bottleneck in softmax attention, which becomes especially restrictive for operator-driven time-series tasks. To address this, we propose $\textbf{Temporal Operator Attention (TOA)}$, a framework that augments attention with explicit, learnable sequence-space operators, enabling direct signed mixing across time while preserving input-dependent adaptivity. To make dense $N \times N$ operators practical, we introduce Stochastic Operator Regularization, a high-variance dropout mechanism that stabilizes training and prevents trivial memorization. Across forecasting, anomaly detection, and classification benchmarks, TOA consistently improves performance when integrated into standard backbones such as PatchTST and iTransformer, with particularly strong gains in reconstruction-heavy tasks. These results suggest that explicit operator learning is a key ingredient for effective time-series modeling.

2605.12713 2026-06-18 quant-ph cs.AI 版本更新

Controllable Quantum Memory Capacity in Quantum Reservoir Networks with Tunable partial-SWAPs

量子回路网络中可控的量子记忆容量:可调部分SWAPs

Erik L. Connerty, Ethan N. Evans

发表机构 * University of South Carolina - Columbia(南卡罗来纳大学哥伦比亚分校) Qodex Quantum(Qodex量子)

AI总结 本文提出一种可调部分SWAP机制,用于控制量子回路网络中记忆衰减速率,通过模拟和IBM QPU验证,提升了噪声中间尺度量子处理器的性能。

Comments 14 pages, 9 figures

详情
AI中文摘要

在量子回路计算领域,许多不同的计算模型和架构已被提出。从这些模型中,我们识别出基于反馈的模型和递归模型作为两种主要竞争架构。本文在递归架构基础上,提出了一种双寄存器方法,使量子回路计算具有衰减记忆。虽然这些方法已在硬件上验证并展示了在噪声中间尺度量子处理器上的优异性能,但记忆容量的确切机制尚不完全理解或完全可控。为此,我们扩展了递归方法,提出了一种硬件可实现的可调部分SWAP机制,允许从基于门的量子处理器上实现的量子回路网络直接控制记忆衰减速率。该机制的理论基于受控振幅阻尼通道,并通过随机短期记忆容量(STMC)回忆基准和NARMA-5数据集的验证实验进行验证,分别使用模拟和IBM QPU进行测试。

英文摘要

In the field of quantum reservoir computing (QRC), many different computational models and architectures have been proposed. From these models, we identify feedback-based models -- which use a feedback mechanism to re-embed classical measurements from the QRC -- and recurrent models -- which use a multi-register approach with memory and readout qubits -- as the two major competing architectures that have been discussed and validated on hardware. In this paper, we advance upon the recurrent architectures, which employ a two register approach to endow the QRC with a fading memory. While these approaches have been validated on hardware and have demonstrated great real-world performance on noisy-intermediate-scale-quantum (NISQ) quantum processing units (QPUs), the exact mechanism through which the memory capacity arises is not completely understood or fully controllable. With this, we augment the recurrent approaches and present a hardware-realizable mechanism, which we call a tunable partial-SWAP, that allows for the direct control of the rate of memory dissipation from a QRN implemented on a gate-based QPU. The theory behind this mechanism is discussed in terms of a controlled amplitude-damping channel and validation experiments using a randomized short-term memory capacity (STMC) recall benchmark and the NARMA-5 dataset are conducted using simulation and IBM QPUs, respectively.

2606.06564 2026-06-18 cs.LG cs.AI 版本更新

HAARES Half-Split Residual Basis Routing for Deep Transformers

WAV:面向深度仅解码器Transformer的多分辨率块残差路由

Kehan Wang

发表机构 * Chongqing University(重庆大学)

AI总结 提出WAV v1方法,通过为每个块增加方向性细节基(相位基和分裂基)来增强残差路由,在深层Transformer中优于现有方法,48层时在TinyStories和Text8上取得更低验证损失。

Comments 6 pages, 4 figures, 3 tables

详情
AI中文摘要

残差连接对于训练深度Transformer至关重要,但标准的PreNorm残差流以固定的单位权重聚合子层更新。最近的注意力残差用内容相关的深度路由替代了这种固定累积,而块注意力残差通过对块级残差摘要进行路由使机制高效。然而,单个块摘要仅存储块内的低频总残差位移,丢弃了方向性结构,例如注意力与MLP的不平衡以及早期与晚期块的动态。我们提出WAV v1,一种用于仅解码器Transformer的轻量级多分辨率残差路由方法。WAV v1不是仅通过累积残差和来表示每个块,而是为每个块增加两个方向性细节基:一个对比注意力和MLP更新的相位基,以及一个对比早期和晚期子层更新的分裂基。这些基与标准块摘要一起通过相同的深度softmax混合器进行路由,而负细节源初始化和分离的RMS匹配稳定了训练。在字符级TinyStories和Text8语言建模中,WAV v1显示出明显的深度相关优势。尽管在12层时并非始终有益,但在24层时变得有竞争力,并在48层时优于所有基线。在48层时,WAV v1将TinyStories上的验证损失从0.4960降至0.4738,Text8上从0.9363降至0.9305,且额外参数可忽略。这些结果表明,方向性残差细节(而不仅仅是块级和)对于在更深Transformer中扩展残差路由很重要。

英文摘要

Block-level residual routing makes learned residual aggregation practical by routing over block summaries, but each summary compresses an ordered sequence of attention and MLP updates into one cumulative vector. We propose \method{}, a lightweight residual basis router that keeps the cumulative block source and adds one half-split detail basis, computed as the difference between first-half and second-half residual updates. The detail basis is RMS-matched and updated online, exposing coarse intra-block trajectory information without dense sublayer-level routing. Across OpenWebText, cross-domain character-level benchmarks, and BPE-tokenized OpenWebText, the empirical pattern is depth-dependent: gains are small or mixed at shallow depth and most reliable in 48-layer models. In the 201M 48-layer setting, \method{} improves over Block AttnRes across all three seeds, while a 453M two-seed probe shows the same direction. Ablations rule out source duplication, random signed details, fixed detail-source biases, or block-count changes alone. Cost analysis shows that the method is FLOP-light but not wall-clock-free: it adds memory and routing overhead, yet its relative arithmetic cost is amortized as width grows and earlier convergence can reduce time-to-target.

2606.10466 2026-06-18 cs.LG cs.AI 版本更新

UPLOTS: A Unified Pretrained Language Model for Constrained Time-series Generation

UPLOTS: 一种用于约束时间序列生成的统一预训练语言模型

Du Yin, Hao Xue, Jinliang Deng, Yang Yang, Shuang Ao, Arian Prabowo, Flora Salim

发表机构 * University of New South Wales(新南威尔士大学) HKUST(GZ)(香港科技大学(广州)) BUAA(北京航空航天大学)

AI总结 提出UPLOTS,一种基于统一预训练语言模型和提示引导的框架,通过动态多数据集损失重加权和提示到模式映射,实现跨领域约束时间序列生成,在四个基准上验证了其泛化性和数据增强效果。

详情
AI中文摘要

在时间序列生成中,现有方法通常为每个数据集手工设计或训练单独的模型,这阻碍了它们的可扩展性,并且未能利用跨领域的共享时间结构。为了解决这种碎片化问题,我们提出了UPLOTS,一种统一的、提示引导的语言模型框架,用于跨不同领域的约束时间序列生成。UPLOTS不是构建任务特定的模型,而是利用一个由学习到的约束提示引导的单一预训练transformer骨干网络,从而能够按需生成并精确控制模式。一个关键创新是我们的动态多数据集损失重加权和提示到模式映射,这使得UPLOTS能够在训练期间内化多样化的时间结构,并在推理时有条件地生成它们。我们在四个真实世界基准和多个约束设置(包括峰值周期、日历、负载水平和波动性模式)上评估了UPLOTS。额外的保留约束组合和下游预测实验进一步表明,UPLOTS能够泛化到原始峰值模式设置之外,并在真实数据稀缺的情况下改进数据增强。我们的代码和基线可在匿名GitHub仓库获取:this https URL。

英文摘要

In time-series generation, existing approaches typically handcraft ortrain a separate model for each dataset, which hinders their scalability and fails to leverage shared temporal structures across domains. To address this fragmentation, we propose UPLOTS, a Unified, Prompt-guided Language model framework fOr constrained Time-Series Generation across diverse domains. Instead of building task-specific models, UPLOTS leverages a single pre-trained transformer backbone guided by learned constraint prompts, enabling on-demand generation with precise pattern control. One key innovation is our dynamic multi-dataset loss re-weighting and prompt-to-pattern mapping, which allows UPLOTS to internalize diverse temporal structures during training and conditionally generate them at inference. We evaluate UPLOTS on four real-world benchmarks and multiple constraint settings, including peak-period, calendar, load-level, and volatility patterns. Additional held-out constraint-combination and downstream forecasting experiments further demonstrate that UPLOTS generalizes beyond the original peak-pattern setting and improves data augmentation under scarce real-data regimes. Our code and baselines are available at anonymous github repo: https://anonymous.4open.science/r/UPLOTS-6C36.

2606.12629 2026-06-18 cs.LG cs.AI 版本更新

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

Bag of Dims:通过维度级符号模式实现无需训练的机制可解释性

Varun Reddy Nalagatla

发表机构 * Amazon Web Services(亚马逊云服务)

AI总结 本文提出Bag of Dims框架,证明Transformer隐藏状态的标准基即可作为无需训练的特征基,通过维度符号模式编码语义,并在三个模型上验证了其有效性。

Comments 22 pages, 5 figures, 27 tables

详情
AI中文摘要

我们表明,Transformer隐藏状态的标准基已经提供了一个无需训练、架构通用的特征基。单个维度通过其符号编码语义内容,通过其幅度编码置信度,充当独立的二进制寄存器。我们通过四个渐进实验在三个模型家族(Qwen 3.5-4B、Gemma 3-4B、Mistral 7B)上验证了这种Bag of Dims框架。仅符号模式就携带预测性内容:将所有幅度替换为1,通过LM头实现72-93%的top-5下一个token准确率,而无需任何解码器的纯汉明评分达到80-90%的top-4096准确率。这些符号模式组织成语义特征:使用单token类型缓存(每个词汇token一次前向传播,无上下文),我们通过每维度符号一致性(平均AUC 0.80)从50个锚点发现了175个类别,无需任何训练。一个训练过的探针仅增加+0.018 AUC并收敛到轴对齐的权重,证实了可忽略的跨维度结构。这种结构扩展到注意力:所有175个类别在K和V投影中仍然可发现。在写入端,静态FFN权重检查将20%的特征与单个写入神经元联系起来(一致性>0.70;随机对照:0%),通过多数投票,top-200神经元联盟在99.9%的原型上实现>0.70的一致性。完全无监督的发现(随机种子,无标签)在所有三个模型上扩展到1500个特征,产量100%,稀疏度99%,成对互信息为0.0014比特,证实了低维度间耦合。这些结果确立了标准基已经足以在整个Transformer计算路径中进行特征读取,无需训练、无需优化,且每个词汇token仅需一次前向传播,无需GPU天数。

英文摘要

We show the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs (+/-1) and confidence via their magnitudes, acting as independent binary registers; a feature is a subset of dimensions with a consistent sign pattern, read by counting sign agreements with no learned rotation. We validate this Bag of Dims framework across seven models spanning language (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B, Qwen3-32B), vision (DINOv2, ViT-Base), and audio (AST). Signs alone carry predictive content: unit-magnitude sign patterns preserve 60-93% top-5 next-token accuracy through the LM head, and decoder-free Hamming scoring reaches 80-90% top-4096. From a single-token cache (one forward pass per token, no context, no labels), we detect 175 categories at AUC 0.97-0.99 by sign agreement; a trained probe adds only +0.018 AUC and converges to axis-aligned weights. These features are causally operative: they survive the K/V attention projections, trace to the FFN neuron coalitions that write them (random-weight controls never reproduce this), and flipping a feature's signs during the live forward pass suppresses its concept across four language models, magnitude-matched and concept-specific. Dimensions stay independent throughout (pairwise mutual information below 0.006 bits). The structure is not specific to language: the same per-dimension signs appear in self-supervised vision (DINOv2, 9/12 ImageNet superclasses), supervised vision (ViT-Base, 11/12), and audio (AST, 50/50 ESC-50 categories), so it reflects transformer training in general, not the language-modeling objective. The standard basis already suffices for feature reading at one forward pass, no optimization, no GPU-days. The open problem shifts from finding the right rotation to cataloging what each dimension encodes.

2606.12808 2026-06-18 cs.LG cs.AI 版本更新

SymQNet: Amortized Acquisition for Low-Latency Adaptive Hamiltonian Learning

SymQNet: 低延迟自适应哈密顿量学习的摊销获取

Yash Vardhan Tomar, Dheeraj Peddireddy

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出SymQNet,一种摊销强化学习方法,通过离线学习后验条件获取策略,在线快速前向传播,显著降低自适应哈密顿量学习的获取延迟。

详情
AI中文摘要

自适应哈密顿量学习对于校准和表征量子设备至关重要。在自适应控制器中,选择下一个实验本身就是一个计算。贝叶斯设计规则在每次后验更新后重新计算,这一步可能需要几秒钟。在数百次试验中,这些秒数成为自适应性的显著墙钟成本。我们引入SymQNet,一种用于低延迟自适应哈密顿量学习的摊销强化学习方法。SymQNet离线学习后验条件获取策略,然后在线使用快速策略前向传播,同时保留贝叶斯后验反馈。在横向场伊辛基准测试中,相对于有界Fisher信息搜索和有界两步贝叶斯主动学习(BALD),SymQNet显著降低了获取延迟。在五量子比特时,相对于这些在线基线,它仅获取决策延迟降低了$47.1\ imes$和$72.6\ imes$;在十二量子比特时,SymQNet的完整模拟步骤需要$1.02$秒,而有界两步BALD需要$13.27$秒。总体而言,我们表明学习获取可以使自适应哈密顿量学习对于重复的低延迟工作负载变得实用。

英文摘要

Adaptive Hamiltonian learning is central to calibrating and characterizing quantum devices. In an adaptive controller, choosing the next experiment is itself a computation. Bayesian design rules are recomputed after every posterior update, and that step can take seconds. Across hundreds of shots, those seconds become a significant wall-clock cost for adaptivity. We introduce SymQNet, an amortized reinforcement-learning approach for low-latency adaptive Hamiltonian learning. SymQNet learns a posterior-conditioned acquisition policy offline, then uses a fast policy forward pass online while retaining Bayesian posterior feedback. On transverse-field Ising benchmarks, SymQNet substantially reduces acquisition latency relative to bounded Fisher-information search and bounded two-step Bayesian active learning by disagreement (BALD). At five qubits, it reduces acquisition-only decision latency by $47.1\times$ and $72.6\times$ relative to these online baselines; at twelve qubits, full simulated steps take $1.02$ s for SymQNet versus $13.27$ s for bounded two-step BALD. Overall, we show that learned acquisition can make adaptive Hamiltonian learning practical for repeated low-latency workloads.

2606.16214 2026-06-18 cs.LG cs.AI 版本更新

Calibrated Sampling-Free Uncertainty Estimation in Bayesian Deep Learning

贝叶斯深度学习中的校准无采样不确定性估计

Tobias Jan Wieczorek, Leon de Andrade, Thomas Möllenhoff, Marcus Rohrbach

发表机构 * TU Darmstadt & hessian.AI, Darmstadt, Germany(达姆施塔特工业大学 & hessian.AI,德国达姆施塔特) RIKEN Center for Advanced Intelligence Project, Tokyo, Japan(日本理化学研究所革新智能研究中心,日本东京)

AI总结 提出校准方差传播(CVP),通过新型归一化层传播方法、激活函数处理技术及轻量校准步骤,在单次前向传播中高效估计不确定性,在Transformer和CNN上达到与MC采样相当的精度,成本显著降低。

详情
AI中文摘要

现代深度学习模型仍然以过度自信而闻名,限制了它们在高风险应用中的可靠性。贝叶斯方法通过学习模型参数的分布来应对这一问题,最近的进展使得在大规模架构上以与AdamW相当的成本实现这一目标成为可能。然而,测试时仍存在一个挑战:预测必须对从后验中采样的权重进行多次前向传播的平均,这代价高昂。方差传播提供了一种高效的替代方案,在单次前向传播中计算每层不确定性的解析近似。虽然此类技术对MLP有效,但由于现代架构的深度增加和层类型多样性,其扩展仍然具有挑战性。为填补这一空白,我们提出了校准方差传播(CVP),它引入了一种新的归一化层传播方法,结合了处理激活函数的近期技术,并通过轻量校准步骤吸收残差误差。CVP在Transformer和CNN上产生与MC采样相当准确的不确定性估计,而成本仅为极小部分。与先前的方差传播工作相比,CVP在BEiT-3上对视觉推理(NLVR2)的$0.5\%$风险覆盖率从$8.2\%$提高到$14.6\%$,在ViLT上对VQAv2从$2.6\%$提高到$10.8\%$,且增益扩展到卷积架构。

英文摘要

Modern deep learning models remain notoriously prone to overconfidence, limiting their reliability in high-stakes applications. Bayesian methods aim to counter this by learning a distribution over model parameters, and recent advances now make this feasible for large-scale architectures at costs comparable to AdamW. However, a challenge remains at test time: predictions must be averaged across many forward passes with weights sampled from the posterior, which is prohibitively expensive. Variance propagation offers an efficient alternative, computing layer-wise analytical approximations of uncertainty in a single forward pass. While such techniques are effective for MLPs, their extension to modern architectures remains challenging, due to increased depth and diversity of layer types. To fill this gap, we propose Calibrated Variance Propagation (CVP), which introduces a new propagation method for normalization layers, combines it with recent techniques for handling activation functions, and absorbs residual error through a light calibration step. CVP yields comparably accurate uncertainty estimates to MC sampling across transformers and CNNs, at a fraction of the cost. Against prior variance propagation work, CVP improves coverage at $0.5\%$ risk from $8.2\%$ to $14.6\%$ with BEiT-3 on Visual Reasoning (NLVR2) and from $2.6\%$ to $10.8\%$ with ViLT on VQAv2, with gains extending to convolutional architectures.

6. 自然语言与多模态智能 14 篇

2606.16276 2026-06-18 cs.AI 版本更新

SpecAlign: Efficient Specification-Grounded Alignment of Large Language Models via Synthetic Data

SpecAlign: 通过合成数据实现高效的大语言模型规范对齐

Wenjie Wang, Yue Huang, Zhengqing Yuan, Han Bao, Shiyi Du, Yuchen Ma, Yue Zhao, Yanfang Ye, Xiangliang Zhang

发表机构 * University of Notre Dame(圣母大学) Carnegie Mellon University(卡内基梅隆大学) LMU Munich(慕尼黑大学) University of Southern California(南加州大学)

AI总结 提出规范对齐新范式,通过从规范文档合成数据(SpecAlign框架),结合结构化规则标注、可控规范实例化和多智能体对抗数据合成,生成细粒度偏好对,提升规则遵守度且不损害通用能力。

Comments 58 pages

详情
AI中文摘要

随着大语言模型(LLM)在现实应用中的部署日益增多,对齐不再由单一的通用安全或有用性概念主导,而是由提供商或应用特定的模型规范主导。这些规范通常冗长、结构化且频繁更新,然而现有的对齐流程缺乏系统化的机制来将其作为训练信号。在本文中,我们提出规范对齐(specification-grounded alignment),一种新的对齐范式,将提供商编写的模型规范作为主要对齐目标,而非抽象原则或静态基准。为实例化该范式,我们引入SpecAlign框架,该框架直接从规范文档合成对齐数据。SpecAlign结合结构化规则标注、可控规范实例化和多智能体对抗数据合成,生成细粒度、边界感知的偏好对,捕获合规行为和有意义的规范违反。在多个模型规范和骨干模型上的实验表明,使用SpecAlign进行训练一致地提高了规则遵守度,同时保持了通用能力并避免了过度保守的行为。这些结果表明,将对齐建立在显式模型规范上,能够实现LLM行为对不断变化的政策要求的快速、精确和可扩展的适应。

英文摘要

As large language models (LLMs) are increasingly deployed in real-world applications, alignment is no longer governed by a single universal notion of safety or helpfulness, but instead by provider- or application-specific model specifications. These specifications are typically long, structured, and frequently updated, yet existing alignment pipelines lack a systematic mechanism to operationalize them as training signals. In this paper, we propose specification-grounded alignment, a new alignment paradigm that treats provider-authored model specifications as the primary alignment target rather than abstract principles or static benchmarks. To instantiate this paradigm, we introduce SpecAlign, a framework that synthesizes alignment data directly from specification documents. SpecAlign combines structured rule annotation, controllable specification instantiation, and multi-agent adversarial data synthesis to generate fine-grained, boundary-aware preference pairs that capture both compliant behaviors and meaningful specification violations. Experiments across multiple model specifications and backbone models demonstrate that training with SpecAlign consistently improves rule compliance while preserving general capabilities and avoiding over-conservative behavior. These results suggest that grounding alignment in explicit model specifications enables rapid, precise, and scalable adaptation of LLM behavior to evolving policy requirements.

2502.07531 2026-06-18 cs.CV cs.AI cs.LG cs.MM 版本更新

VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

VidCRAFT3: 面向图像到视频生成的相机、物体与光照控制

Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, Yanwei Fu

发表机构 * School of Data Science, Fudan University(复旦大学数据科学学院) Shanghai Innovation Institute(上海创新研究院) Zhejiang University(浙江大学) Huawei Noah’s Ark Lab(华为诺亚实验室) Westlake University(西湖大学) School of Data Science and MOE Frontiers Center for Brain Science, Fudan University(复旦大学数据科学学院和脑科学前沿中心) Fudan ISTBI–ZJNU Algorithm Centre for Brain-inspired Intelligence, Zhejiang Normal University(复旦大学-浙江师范大学脑启发智能算法中心)

AI总结 提出VidCRAFT3框架,通过显式建模几何、运动与光照的跨因素交互,实现对相机运动、物体运动和光照方向的独立或联合控制,在控制精度和视觉一致性上达到最优。

Comments Accepted to TVCG 2026

详情
AI中文摘要

可控图像到视频(I2V)生成将参考图像转换为由用户指定控制信号引导的连贯视频。虽然对相机运动、物体运动和光照的精确控制对于高保真创作至关重要,但现有方法通常独立处理这些因素,忽视了动态场景中视角、几何和光照之间的物理耦合,导致同时变化时出现阴影不匹配和透视漂移等视觉不一致问题。我们提出了VidCRAFT3,一个统一且灵活的I2V框架,显式建模几何、运动和光照之间的跨因素交互,实现对相机运动、物体运动和光照方向的独立或联合控制。Image2Cloud提供显式的3D几何先验以实现精确的相机运动控制。ObjMotionNet将稀疏物体轨迹编码为多尺度运动特征,以引导逼真的物体运动。空间三重注意力变压器通过光照交叉注意力整合光照方向,实现一致的重光照。为了解决联合标注数据的稀缺性,我们构建了VideoLightingDirection(VLD)数据集,包含精确的逐帧光照方向标注,并引入三阶段渐进训练策略,使得无需完全联合标注即可实现鲁棒学习。大量实验表明,VidCRAFT3在多种场景下的控制精度和视觉一致性上达到了最先进水平。

英文摘要

Controllable image-to-video (I2V) generation transforms a reference image into a coherent video guided by user-specified control signals. While precise control over camera motion, object motion, and lighting is essential for high-fidelity creation, existing methods often treat these factors independently. This overlooks the physical coupling among viewpoint, geometry, and illumination in dynamic scenes, leading to visual inconsistencies such as mismatched shadows and perspective drift under simultaneous changes. We present VidCRAFT3, a unified and flexible I2V framework that explicitly models cross-factor interactions among geometry, motion, and illumination, enabling both independent and joint control over camera motion, object motion, and lighting direction. Image2Cloud provides explicit 3D geometric priors for accurate camera motion control. ObjMotionNet encodes sparse object trajectories into multi-scale motion features to guide realistic object motion. A Spatial Triple-Attention Transformer integrates lighting direction through lighting cross-attention for consistent relighting. To address the scarcity of jointly annotated data, we construct the VideoLightingDirection (VLD) dataset with accurate per-frame lighting direction annotations, and introduce a three-stage progressive training strategy that enables robust learning without fully joint annotations. Extensive experiments demonstrate that VidCRAFT3 achieves state-of-the-art performance in control precision and visual coherence across diverse scenarios.

2508.09191 2026-06-18 cs.LG cs.AI 版本更新

From Values to Tokens: An LLM-Driven Framework for Context-aware Time Series Forecasting via Symbolic Discretization

从数值到标记:一种基于符号离散化的LLM驱动上下文感知时间序列预测框架

Xiaoyu Tao, Shilong Zhang, Mingyue Cheng, Daoyu Wang, Tingyue Pan, Bokai Pan, Changqing Zhang, Shijin Wang

发表机构 * State Key Laboratory of Cognitive Intelligence(认知智能国家重点实验室) University of Science and Technology of China(中国科学技术大学) College of Intelligence and Computing(智能科学与计算学院) iFLYTEK Research(iFLYTEK研究院)

AI总结 提出TokenCast框架,利用大语言模型通过符号离散化将连续时间序列转化为标记,与上下文文本对齐,实现上下文感知的预测,实验证明有效。

详情
AI中文摘要

时间序列预测在能源、医疗和金融等关键应用领域支持决策中起着重要作用。尽管近期取得了进展,但由于将历史数值序列与通常包含非结构化文本数据的上下文特征整合的挑战,预测精度仍然有限。为了解决这一挑战,我们提出了TokenCast,一个由大语言模型(LLM)驱动的框架,利用基于语言的符号表示作为上下文感知时间序列预测的统一中介。具体来说,TokenCast采用离散分词器将连续数值序列转化为时间标记,实现与基于语言输入的结构对齐。为了有效弥合模态之间的语义差距,时间和上下文标记通过预训练的LLM嵌入到共享表示空间中,并通过生成目标进一步优化。基于这一统一语义空间,对齐的LLM随后以监督方式进行微调,以预测未来的时间标记,然后解码回原始数值空间。在真实世界数据集上的大量实验证明了我们框架的有效性,并突显了其作为上下文感知时间序列预测生成框架的潜力。代码可从此https URL获取。

英文摘要

Time series forecasting plays a vital role in supporting decision-making across a wide range of critical applications, including energy, healthcare, and finance. Despite recent advances, forecasting accuracy remains limited due to the challenge of integrating historical numerical sequences with contextual features, which often comprise unstructured textual data. To address this challenge, we propose TokenCast, a large language model (LLM) driven framework that leverages language-based symbolic representations as a unified intermediary for context-aware time series forecasting. Specifically, TokenCast employs a discrete tokenizer to transform continuous numerical sequences into temporal tokens, enabling structural alignment with language-based inputs. To effectively bridge the semantic gap between modalities, both temporal and contextual tokens are embedded into a shared representation space via a pre-trained LLM, further optimized with generative objectives. Building upon this unified semantic space, the aligned LLM is subsequently fine-tuned in a supervised manner to predict future temporal tokens, which are then decoded back into the original numerical space. Extensive experiments on real-world datasets demonstrate the effectiveness of our framework and highlight its potential as a generative framework for context-aware time series forecasting. The code is available at https://github.com/Xiaoyu-Tao/TokenCast.

2510.04120 2026-06-18 cs.CL cs.AI 版本更新

Probing Semantic Alignment, Lexical Invariance, and Syntactic Influence in LLM Metaphor Processing

探究大语言模型隐喻处理中的语义对齐、词汇不变性和句法影响

Fengying Ye, Shanshan Wang, Lidia S. Chao, Derek F. Wong

发表机构 * NLP 2 CT Lab, Department of Computer and Information Science, University of Macau(自然语言处理2CT实验室,计算机与信息科学系,澳门大学)

AI总结 通过几何探测、上下文替换和句法扰动三种方法,分析LLM在隐喻处理中的语义漂移、词汇稳定性及句法敏感性,揭示强行为表现可能源于异质信号。

Comments Accepted to ACL 2026

详情
AI中文摘要

大语言模型(LLM)在隐喻检测和解释任务上表现出色,但尚不清楚这种行为成功揭示了隐喻处理的哪些方面。我们通过探测三个互补维度:语义属性对齐、词汇不变性和句法敏感性,对行为证据的局限性进行诊断分析。使用几何探测,我们评估模型生成的解释是否与参考语义属性对齐;通过上下文变化替换,分析隐喻和字面表达之间词汇关联的稳定性;通过受控句法扰动,检查隐喻检测的敏感性。我们的分析表明,LLM生成的解释可能相对于参考属性出现语义漂移;稳定的词汇锚点在不同上下文条件下持续存在,可能支持常规隐喻,同时使需要上下文整合的新奇隐喻产生偏差;检测性能对句法不规则性敏感。这些发现表明,强行为表现可能反映了异质的潜在信号,强调在将隐喻基准解释为稳健、集成语义理解的证据时需要谨慎。

英文摘要

Large language models (LLMs) achieve strong performance on metaphor detection and interpretation tasks, yet it remains unclear what such behavioral success reveals about metaphor processing. We present a diagnostic analysis that examines the limits of behavioral evidence by probing three complementary dimensions: semantic attribute alignment, lexical invariance, and syntactic sensitivity. Using geometric probing, we assess whether model-generated interpretations align with reference semantic attributes; through context-varying substitution, we analyze the stability of lexical associations between metaphorical and literal expressions; and via controlled syntactic perturbations, we examine sensitivity in metaphor detection. Our analysis reveals that LLM-generated interpretations can exhibit semantic drift relative to reference attributes; stable lexical anchors persist across contextual conditions, potentially supporting conventional metaphors while biasing novel metaphors requiring contextual integration; and detection performance is sensitive to syntactic irregularities. These findings suggest that strong behavioral performance may reflect heterogeneous underlying signals, highlighting the need for caution when interpreting metaphor benchmarks as evidence of robust, integrated semantic understanding.

2510.15551 2026-06-18 cs.CL cs.AI cs.LG 版本更新

Rethinking Cross-lingual Gaps from a Statistical Viewpoint

从统计视角重新思考跨语言差距

Vihari Piratla, Purvam Jain, Darshan Singh, Trevor Cohn, Preethi Jyothi, Partha Talukdar

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 提出跨语言差距源于目标语言响应方差,通过形式化偏差和无偏误差,并采用推理时集成方法降低方差,使跨语言迁移得分提升8%-50%以上。

Comments 30 pages

详情
AI中文摘要

任何知识片段通常以一种或少数几种自然语言表达在网页或大型语料库中。大型语言模型(LLMs)通过从源语言获取知识,并在使用目标语言查询时使其可访问,从而充当桥梁。跨语言差距是指使用目标语言而非源语言查询知识时准确率的下降。现有研究侧重于导致跨语言差距的建模或训练失败。在这项工作中,我们采取另一种视角来表征跨语言错误的性质,并假设目标语言中响应的方差是造成这一差距的关键原因。我们首次将跨语言差距形式化为有偏误差和无偏误差。通过多种控制方差并减少跨语言差距的推理时干预,我们实证验证了我们的假设。我们展示了几种测试时集成方法,这些方法降低了响应方差,从而将源-目标迁移得分提高了多达12个绝对百分点,在各种LLMs上实现了8%到超过50%的相对提升。

英文摘要

Any piece of knowledge is usually expressed in one or a handful of natural languages on the web or in any large corpus. Large Language Models (LLMs) act as a bridge by acquiring knowledge from a source language and making it accessible when queried using target languages. A cross-lingual gap is a drop in accuracy incurred when querying knowledge in a target language rather than the source language. Existing research focused on modeling or training failures leading to cross-lingual gaps. In this work, we take an alternative view to characterize the nature of cross-lingual error, and hypothesize that the variance of responses in the target language is a key cause of this gap. For the first time, we formalize the cross-lingual gap in terms of biased and unbiased errors. We empirically validate our hypothesis through multiple inference-time interventions that control variance and reduce the cross-lingual gap. We demonstrate a few test-time ensemble methods that reduce response variance, and thereby improve source-target transfer scores by up to 12 absolute points yielding relative gains of 8% to over 50% across various LLMs.

2601.14968 2026-06-18 cs.LG cs.AI 版本更新

InstructTime++: Time Series Classification with Multimodal Language Modeling via Implicit Feature Enhancement

InstructTime++: 通过隐式特征增强的多模态语言建模进行时间序列分类

Mingyue Cheng, Xiaoyu Tao, Huajian Zhang, Qi Liu, Zhiding Liu, Yucong Luo, Yiheng Chen, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(中国科学技术大学认知智能国家重点实验室)

AI总结 提出将时间序列分类转化为多模态生成任务,通过离散化模块和对齐投影层弥合模态差距,并利用隐式特征建模提升语言模型性能。

详情
AI中文摘要

大多数现有的时间序列分类方法采用判别范式,将输入序列直接映射到独热编码的类别标签。虽然有效,但这种范式难以融入上下文特征,也无法捕捉类别间的语义关系。为了解决这些局限性,我们提出了InstructTime,一种将时间序列分类重新定义为多模态生成任务的新框架。具体来说,连续的数值序列、上下文文本特征和任务指令被视为多模态输入,而类别标签则通过调优的语言模型作为文本输出生成。为了弥合模态差距,InstructTime引入了一个时间序列离散化模块,将连续序列转换为离散的时间标记,同时结合对齐投影层和生成式自监督预训练策略,以增强跨模态表示对齐。在此框架基础上,我们进一步提出了InstructTime++,通过引入隐式特征建模来扩展InstructTime,以补偿语言模型有限的归纳偏差。InstructTime++利用专门的工具包从原始时间序列和上下文输入中挖掘信息丰富的隐式模式,包括统计特征提取和基于视觉-语言模型的图像描述,并将其转化为文本描述以实现无缝集成。在多个基准数据集上的大量实验证明了InstructTime++的优越性能。

英文摘要

Most existing time series classification methods adopt a discriminative paradigm that maps input sequences directly to one-hot encoded class labels. While effective, this paradigm struggles to incorporate contextual features and fails to capture semantic relationships among classes. To address these limitations, we propose InstructTime, a novel framework that reformulates time series classification as a multimodal generative task. Specifically, continuous numerical sequences, contextual textual features, and task instructions are treated as multimodal inputs, while class labels are generated as textual outputs by tuned language models. To bridge the modality gap, InstructTime introduces a time series discretization module that converts continuous sequences into discrete temporal tokens, together with an alignment projection layer and a generative self-supervised pre-training strategy to enhance cross-modal representation alignment. Building upon this framework, we further propose InstructTime++, which extends InstructTime by incorporating implicit feature modeling to compensate for the limited inductive bias of language models. InstructTime++ leverages specialized toolkits to mine informative implicit patterns from raw time series and contextual inputs, including statistical feature extraction and vision-language-based image captioning, and translates them into textual descriptions for seamless integration. Extensive experiments on multiple benchmark datasets demonstrate the superior performance of InstructTime++.

2601.17226 2026-06-18 cs.CL cs.AI 版本更新

Retell, Reward, Repeat: Reinforcement Learning for Narrative Theory-Informed Story Retelling

复述、奖励、重复:面向叙事理论启发的故事复述的强化学习

David Y. Liu, Xanthe Muston, Dipankar Srirag, Aditya Joshi, Sebastian Sequoiah-Grayson

发表机构 * University of New South Wales(新南威尔士大学)

AI总结 提出RRR强化学习框架,结合结构主义叙事学与标量叙事性,通过d-RLAIF从文本特征中获取训练信号,无需参考输出,提升LLM故事复述的逻辑性、合理性和完整性。

Comments 8 Pages, 7 figures

详情
AI中文摘要

反事实故事复述暴露了LLM在受限叙事解空间中的缺陷,此时它们无法依赖回忆记忆的训练数据。基于真实值的后训练(如SFT)无法教会LLM生成逻辑合理的叙事事件。本文提出Retell, Reward, Repeat (RRR),一个基于强化学习的流水线,将结构主义叙事学与标量叙事性相结合,以教授故事结构。我们扩展了TimeTravel数据集,加入人工标注的叙事平衡阶段,以评估奖励模型。通过d-RLAIF,RRR从文本特征的叙事性中推导训练信号,无需参考输出。评估表明,RRR训练的LLM在逻辑性、合理性和完整性上优于少样本和SFT基线,输出质量通过盲人偏好验证。RRR仅依赖小型查询数据集,为故事讲述——一个目前缺乏有效后训练方法的领域——提供了一种基于语言学、成本效益高的后训练机制。RRR强调了将既定语言学理论整合到当代NLP中的持续相关性。

英文摘要

Counterfactual story retelling exposes LLM shortcomings in constrained narrative solution spaces where they can no longer rely on recalling memorised training data. Ground-truth-based post-training, such as SFT, fails to teach LLMs how to generate logical and rational narrative events. In this paper, we introduce Retell, Reward, Repeat (RRR), an RL-based pipeline synthesising Structuralist Narratology with scalar narrativity to teach storytelling structure. We extend the TimeTravel dataset with human-annotated stages of narrative equilibrium to evaluate reward models. By using d-RLAIF, RRR derives training signals from the narrativity of textual features without the need for reference outputs. Evaluations demonstrate that RRR-trained LLMs outperform few-shot and SFT baselines in logic, rationality, and completeness, with output quality additionally validated by blind human preference. Relying on a small, query-only dataset, RRR provides a linguistically grounded, cost-effective post-training mechanism for storytelling--a domain currently lacking effective post-training methods. RRR highlights the continued relevance of integrating established linguistic theories into contemporary NLP.

2601.19792 2026-06-18 cs.CL cs.AI cs.HC 版本更新

LVLMs and Humans Ground Differently in Referential Communication

LVLMs与人类在指称交流中的基础不同

Peter Zeng, Weiling Li, Amie J. Paige, Zhengxiang Wang, Panagiotis Kaliosis, Dimitris Samaras, Gregory Zelinsky, Susan E. Brennan, Owen Rambow

AI总结 通过人类与AI配对的多轮指称交流实验,发现LVLMs无法像人类一样利用共同基础生成和解析指称表达,导致交流不畅。

Comments 27 pages, 16 figures

详情
AI中文摘要

对于生成式AI代理与人类用户有效合作,准确预测人类意图的能力至关重要。但这种协作能力仍然受到一个关键缺陷的限制:无法建模共同基础。我们提出了一个因子设计的指称交流实验,涉及指导者-匹配者配对(人类-人类、人类-AI、AI-人类和AI-AI),他们在多轮重复回合中交互,以匹配与任何明显词汇化标签无关的物体图片。我们表明,LVLMs无法以促进顺畅交流的方式交互式生成和解析指称表达,而这是人类语言使用的基础技能。我们发布了包含356个对话(89对,每对4轮)的语料库,以及用于数据收集的在线流程和用于分析准确性、效率和词汇重叠的工具。

英文摘要

For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common ground. We present a referential communication experiment with a factorial design involving director-matcher pairs (human-human, human-AI, AI-human, and AI-AI) that interact with multiple turns in repeated rounds to match pictures of objects not associated with any obvious lexicalized labels. We show that LVLMs cannot interactively generate and resolve referring expressions in a way that enables smooth communication, a crucial skill that underlies human language use. We release our corpus of 356 dialogues (89 pairs over 4 rounds each) along with the online pipeline for data collection and the tools for analyzing accuracy, efficiency, and lexical overlap.

2602.06470 2026-06-18 cs.CL cs.AI 版本更新

Improve Large Language Model Systems with User Logs

通过用户日志改进大型语言模型系统

Changyue Wang, Weihang Su, Qingyao Ai, Xingzhao Yue, Rui Zhang, Xiaojia Chang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 本文提出UNO框架,通过用户日志提炼规则和偏好对,利用查询反馈驱动聚类处理数据异质性,量化模型知识与日志数据间的认知差距,提升LLM系统性能。

详情
AI中文摘要

扩大训练数据和模型参数规模长期以来推动了大型语言模型(LLMs)的发展,但这一范式日益受到高质量数据稀缺和计算成本上升导致的边际效益递减的限制。因此,近期研究更加关注从真实世界部署中持续学习,其中用户交互日志提供了丰富的真人类反馈和过程知识。然而,从用户日志学习具有挑战性,因为它们是无结构和嘈杂的。传统的LLM系统往往难以区分有用的反馈信号与嘈杂的用户行为,且用户日志收集与模型优化之间的差异(例如,非策略优化问题)进一步加剧了这一问题。为此,我们提出UNO(用户日志驱动的优化),一个统一的框架,用于通过用户日志改进LLM系统(LLMsys)。UNO首先将日志提炼为半结构化的规则和偏好对,然后利用查询和反馈驱动的聚类来管理数据异质性,最后量化模型先验知识与日志数据之间的认知差距。这一评估指导LLMsys自适应地过滤掉嘈杂的反馈并构建不同模块,以处理从用户日志中提取的初级和反思性经验,从而提升未来的响应。广泛的实验表明,UNO在效果和效率上均达到最先进的水平,显著优于检索增强生成(RAG)和基于记忆的基线方法。我们已开源代码至https://github.com/bebr2/UNO。

英文摘要

Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model's prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at https://github.com/bebr2/UNO .

2602.15851 2026-06-18 cs.CL cs.AI 版本更新

Narrative Theory-Driven LLM Methods for Automatic Story Generation and Understanding: A Survey

叙事理论驱动的LLM方法在自动故事生成与理解中的应用:综述

David Y. Liu, Aditya Joshi, Paul Dawson

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院) School of Arts and Media(艺术与媒体学院) University of New South Wales (UNSW)(新南威尔士大学)

AI总结 综述叙事理论驱动的大语言模型方法在自动故事生成与理解中的应用,分析现状并指出生成任务在理论应用、后训练方法、非虚构叙事及叙事层次等方面落后于理解任务,提出未来方向。

Comments 31 pages

详情
AI中文摘要

使用大语言模型(LLM)的叙事理论应用在自动故事生成和理解任务中提供了有前景的方法。本综述考察了自然语言处理(NLP)研究如何利用LLM方法处理叙事研究中的不同概念。我们使用叙事学中的既定区分来分类当前工作,并发现以下内容:(a) 叙事文本来源多样,不仅限于文学;(b) 理论综合与验证是潜在成果;(c) 生成任务在多个方面落后于理解任务:理论应用、后训练方法、探索非虚构叙事以及处理超出故事与话语层面的叙事层次。对于未来方向,我们相信,与其追求单一的、通用的“叙事质量”基准,进步可以受益于以下方面的努力:定义和改进针对单个叙事属性的基于理论的度量;继续开展大规模、理论驱动的文学/社会/文化分析;在情境化上下文中生成叙事;以及继续进行实验,其输出可用于验证或完善叙事理论。本文通过概述当前研究工作和更广泛的叙事研究领域,为NLP中更系统、更具理论依据的叙事研究提供了背景基础。

英文摘要

Applications of narrative theories using large language models (LLMs) deliver promising methods in automatic story generation and understanding tasks. Our survey examines how natural language processing (NLP) research uses LLM methods to engage with diverse concepts from narrative studies. We use established distinctions from narratology to categorise ongoing efforts and discover the following: \redtext{(a) narrative texts come from diverse sources beyond just literature, (b) theoretical synthesis and validation are potential outcomes, (c) generation tasks lag behind understanding in several ways: theoretical application, post-training methods, exploring non-fiction narratives and addressing narrative levels beyond fabula and discourse.} For future directions, instead of the pursuit of a single, generalised benchmark for `narrative quality', we believe that progress can benefit from efforts that focus on the following: defining and improving theory-based metrics for individual narrative attributes; continue conducting large-scale, theory-driven literary/social/cultural analysis; generating narratives in situated contexts; and continuing experiments where outputs can be used to validate or refine narrative theories. This work provides a contextual foundation for more systematic and theoretically informed narrative research in NLP by providing an overview to ongoing research efforts and the broader narrative studies landscape.

2605.21028 2026-06-18 cs.CV cs.AI 版本更新

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

DySink:动态帧 sinks 用于自回归长视频生成

Bo Ye, Xinyu Cui, Jian Zhao, Tong Wei, Min-Ling Zhang

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Lab. of Computer Network and Information Integration, Southeast University(东南大学计算机网络与信息集成重点实验室) Zhongguancun Academy(中关村学院) Zhongguancun Institute of Artificial Intelligence(中关村人工智能研究院) Institute of Automation, CAS(中国科学院自动化研究所)

AI总结 本文提出 DySink,一种基于检索的框架,通过维护紧凑的记忆银行并选择视觉相关的历史帧作为动态帧 sinks,以提高自回归长视频生成的动态性和时间质量。

详情
AI中文摘要

自回归长视频生成通常采用有界内存流以提高效率,通常结合局部窗口实现短期连续性与静态早期帧 sinks 作为长程锚点。然而,这种固定分配在当前视觉状态与早期帧大幅偏离时仍会缓存早期帧,而丢弃可能更相关的中间历史。结果,保留的长程上下文可能变得不适应,并偏向过时的线索;在严重情况下,RoPE 引起的相位再对齐会homogenize 头间注意力并导致 sink 崩溃,其中内容会回归到 sink 帧。我们提出 DySink,一种基于检索的框架,维护紧凑的记忆银行并选择视觉相关的历史帧作为动态帧 sinks。DySink 将自适应检索与 sink 异常门相结合,后者检测检索上下文中的过度头间共识并抑制易崩溃的上下文。在分钟级视频上的实验表明,DySink 在动态度方面一致优于强基线,同时也实现了更高的时间质量。代码和模型权重将在 https://github.com/yebo0216best/DySink 上发布。

英文摘要

Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE-induced phase re-alignment can homogenize inter-head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval-based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter-head consensus over retrieved context and suppresses collapse-prone context. Experiments on minute-long videos show that DySink consistently improves temporal quality over strong baselines while also achieving higher dynamic degree, enabling coherent and more natural long-horizon visual evolution. The code and model weights are released at https://github.com/yebo0216best/DySink.

2606.13768 2026-06-18 cs.CV cs.AI 版本更新

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

CineOrchestra:面向电影视频生成的统一实体中心条件控制

Sharath Girish, Tsai-Shien Chen, Zhikang Dong, Mukesh Singhal, Hao Chen, Sergey Tulyakov, Aliaksandr Siarohin

发表机构 * Snap Inc.(Snap公司) UC Merced(加州大学默塞德分校)

AI总结 提出CineOrchestra,一种统一控制主体、事件、相机和镜头切换的视频扩散模型,通过实体中心条件原语和参数无关的旋转位置编码实现多轴联合控制,在密集描述跟随和镜头切换时序上超越六种专用方法。

Comments Project page: https://snap-research.github.io/CineOrchestra

详情
AI中文摘要

电影视频描绘了多个主体在特定时刻行动或互动,通过有意的相机运动捕捉,并由镜头切换拼接而成。这些元素共同要求比当前文本到视频模型更细粒度的控制。现有工作分别处理每个轴:多主体个性化、时间控制、多镜头合成或相机控制;没有先前的框架能联合集成所有四个轴。我们提出CineOrchestra,一种统一的视频扩散模型,同时控制主体、事件、相机和镜头切换。我们的关键洞察是,这些异构的电影元素共享一个基本结构:每个元素都是在特定时间间隔内行动的实体,因此都可以通过一个共享的实体中心条件原语结构来表达,并辅以视觉实体的参考图像。这种表述将架构挑战简化为单个位置编码问题,我们通过两个参数无关的协调旋转嵌入来解决:(a) 间隔采样的时间RoPE,在持续时间差异巨大的事件上产生一致注意力行为;(b) 2D实体-时间交叉注意力RoPE,消除每个实体条件的歧义,并将其路由到对应的时空区域。在两个新基准上,CineOrchestra在密集描述跟随和镜头切换时序上优于六种每轴专家方法,在成对用户研究和组件消融中持续获得增益。

英文摘要

Cinematic video depicts multiple subjects acting or interacting at specific moments, captured with deliberate camera movement, and stitched together by shot transitions. Together, these elements demand a level of fine-grained control beyond current text-to-video models. Existing work addresses each axis in isolation: multi-subject personalization, temporal control, multi-shot synthesis, or camera control; no prior framework jointly integrates all four. We present CineOrchestra, a unified video diffusion model that controls subjects, events, cameras, and shot transitions simultaneously. Our key insight is that these heterogeneous cinematic elements share a fundamental structure: each is an entity acting over a specific temporal interval, which can therefore all be expressed through one shared structure of entity-centric conditioning primitives, augmented with reference images for visual entities. This formulation reduces the architectural challenge to a single positional encoding problem, which we solve with two parameter-free coordinated rotary embeddings: (a) an interval-sampled temporal RoPE that yields consistent attention behavior across events of dramatically varying duration, and (b) a 2D entity-temporal cross-attention RoPE that disambiguates per-entity conditions and routes each to its corresponding spatiotemporal region. On two new benchmarks, CineOrchestra outperforms six per-axis specialists on dense caption following and shot-transition timing, with consistent gains in a pairwise user study and component ablations. Project page: https://snap-research.github.io/CineOrchestra

2606.17372 2026-06-18 cs.CL cs.AI 版本更新

Implicit vs. Explicit Prompting Strategies for LVLMs in Referential Communication

LVLMs在指称通信中的隐式与显式提示策略

Peter Zeng, Amie J. Paige, Weiling Li, Susan E. Brennan, Owen Rambow, Cameron R. Jones

发表机构 * Stony Brook University(石溪大学)

AI总结 本研究通过控制任务差异,比较显式与隐式提示对LVLM生成高效指称表达的影响,发现显式提示下模型能协调高效表达,而隐式提示则失败,揭示了人机通信的关键差异。

详情
AI中文摘要

两项近期研究(Jones等人,2026;Zeng等人,2026)关于LVLM能否协调高效指称表达得出了明显矛盾的结论。我们在控制研究间任务差异的同时,直接比较了它们的提示风格。我们复现了当显式提示时模型可以协调高效指称表达的发现,表明其他任务差异并非导致结果分歧的原因。然而,我们也发现相同的模型无法从更隐式的提示中推断出通信效率的需求,凸显了人类与AI系统通信方式的关键差异。

英文摘要

Two recent studies (Jones et al. (2026); Zeng et al. (2026)) reach apparently contradictory conclusions about whether LVLMs can coordinate on efficient referring expressions. We control for task differences between the studies while directly comparing their prompting styles. We replicate the finding that models can coordinate efficient referring expressions when explicitly prompted to do so, suggesting that other task differences are not responsible for divergent results. However, we also find that the same models fail to infer the need for communicative efficiency from a more implicit prompt, highlighting critical differences between how humans and AI systems communicate.

2606.17412 2026-06-18 cs.CV cs.AI 版本更新

Enhancing Pathological VLMs with Cross-scale Reasoning

增强病理视觉语言模型的跨尺度推理能力

Chi Phan, Tianyi Zhang, Qiaochu Xue, Yufeng Wu, Dan Hu, Zeyu Liu, Sudong Wang, Yueming Jin

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore(新加坡国立大学电气与计算机工程系) PuzzleLogic Pte Ltd(PuzzleLogic私人有限公司) Department of Pathology, Fujian Medical University Cancer Hospital & Fujian Cancer Hospital(福建医科大学附属肿瘤医院病理科暨福建省肿瘤医院)

AI总结 提出首个跨尺度训练与评估范式,通过多倍率视觉问答任务增强病理视觉语言模型的跨尺度推理能力,并构建高质量基准数据集Scale-VQA及模型ScaleReasoner-R1,实现最优性能。

详情
AI中文摘要

病理图像本质上是多尺度的,要求病理学家整合从低倍放大下的整体组织结构到高倍放大下的细胞形态的证据以进行准确诊断。虽然现有的视觉语言模型(VLM)病理数据集包含多种尺度,但它们通常缺乏明确的跨尺度推理目标。这一限制阻碍了VLM捕获关键的跨尺度表示和学习基于证据的推理。为弥补这一差距,我们引入了首个跨尺度训练和评估范式,将病理解释表述为多倍率推理。然而,创建这样的任务揭示了一个关键挑战:多图像视觉问答(VQA)容易受到仅文本捷径的影响,这使得模型能够利用与放大倍数相关的伪影而非视觉证据来猜测答案。为解决此问题,我们提出了一种泄漏感知的策展流程,结合了对抗性仅文本筛选和约束引导的问题设计。利用该流程,我们构建了Scale-VQA,一个高质量基准,包含4,685个多项选择题,基于2,537张跨多个放大级别的病理图像。最后,我们提出了ScaleReasoner-R1,一个通过强化学习训练的模型,以优化跨尺度VQA任务的性能。ScaleReasoner-R1在我们的跨尺度推理基准上达到了最先进的性能,并在已有的单尺度基准上泛化到最先进的性能。研究结果表明,即使是有限的跨尺度监督也能显著改善病理理解。代码和演示将开源。

英文摘要

Pathological images are inherently multi-scale, requiring pathologists to integrate evidence from global tissue architecture at low magnification to cellular morphology at higher magnification for accurate diagnosis. While existing pathological datasets for vision-language model (VLM) include various scales, they often lack an explicit cross-scale reasoning objective. This limitation prevents VLMs from capturing essential cross-scale representations and learning evidence-based reasoning. To bridge this gap, we introduce the first cross-scale training and evaluation paradigm that formulates pathology interpretation as multi-magnification reasoning. However, creating such a task reveals a critical challenge: multi-image visual question answering (VQA) is prone to text-only shortcuts, which allow models to guess answers using magnification-dependent artifacts rather than visual evidence. To address this, we propose a leakage-aware curation pipeline that combines adversarial text-only screening with constraint-guided question design. Using this pipeline, we construct Scale-VQA, a high-quality benchmark with 4,685 multiple-choice questions grounded in 2,537 pathology images across multiple magnification levels. Finally, we present ScaleReasoner-R1, a model trained via reinforcement learning to optimize performance on the cross-scale VQA task. ScaleReasoner-R1 achieves state-of-the-art performance on our cross-scale reasoning benchmark and generalizes to SOTA performance on established single-scale benchmarks. Findings suggest that even the limited cross-scale supervision can significantly improve pathological understanding. The code and demos will be open-sourced.

7. 机器人与具身智能 1 篇

2606.02800 2026-06-18 cs.CV cs.AI cs.LG cs.MM cs.RO 版本更新

Cosmos 3: Omnimodal World Models for Physical AI

Cosmos 3:面向物理AI的全模态世界模型

NVIDIA, :, Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, DeLesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Andy Ju, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz, Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, JF Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li, Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Ming-Yu Liu, Sifei Liu, Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo, Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, Shubham Pachori, David Page, Yatian Pang, Sehwi Park, Mahesh Patekar, Mostofa Patwary, Marco Pavone, Trung Pham, Wei Ping, Soha Pouya, Shrimai Prabhumoye, Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi, Mateusz Sieniawski, Shuran Song, Alexander Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang, Yan Wang, Yu Wang, Rohit Watve, David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu, Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang, Hans Yang, Xiaodong Yang, Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu, Hao Yuan, Simon Yuen, Xiaohui Zeng, Pengcuo Zeren, Cindy Zha, Haotian Zhang, Jenny Zhang, Jing Zhang, Liangkai Zhang, Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu, Dima Zhylko, Artur Zolkowski

发表机构 * NVIDIA

AI总结 提出基于统一混合Transformer架构的全模态世界模型Cosmos 3,联合处理语言、图像、视频、音频和动作序列,在理解和生成任务上达到新最优,为具身智能体提供可扩展的通用骨干。

详情
AI中文摘要

我们介绍了Cosmos 3,一个全模态世界模型家族,设计用于在统一的混合Transformer架构中联合处理和生成语言、图像、视频、音频和动作序列。通过支持高度灵活的输入输出配置,Cosmos 3无缝统一了物理AI的关键模态——有效地将视觉语言模型、视频生成器、世界模拟器和世界动作模型整合到一个框架中。我们的评估表明,Cosmos 3在一系列多样化的理解和生成任务中确立了新的最优水平,展示了全模态世界模型作为具身智能体可扩展、通用骨干的能力。我们的后训练Cosmos 3模型在技术报告撰写时被Artificial Analysis评为最佳开源文本到图像和图像到视频模型,并被RoboArena评为最佳策略模型。为了加速物理AI领域的开放研究和部署,我们在Linux基金会的OpenMDW-1.1许可证下提供我们的代码、模型检查点、策划的合成数据集和评估基准,网址为https://this https URL License at this https URL }{ this http URL and this https URL。项目网站位于https://this https URL。

英文摘要

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

8. 可信、安全与AI治理 11 篇

2510.09905 2026-06-18 cs.AI cs.CL 版本更新

The Personalization Trap: How User Memory Alters Emotional Reasoning in LLMs

个性化陷阱:用户记忆如何改变大语言模型的情感推理

Xi Fang, Weijie Xu, Yuchong Zhang, Stephanie Eckman, Scott Nickleach, Chandan K. Reddy

发表机构 * Amazon(亚马逊)

AI总结 研究用户记忆如何导致大语言模型在情感推理中产生系统性偏差,发现高绩效模型对优势背景用户的情感解读更准确,个性化机制可能嵌入社会等级。

Comments 19 pages 5 figures

详情
AI中文摘要

当AI助手记住Sarah是一位打两份工的单亲母亲时,它对她压力的解读是否与她是富有的高管时不同?随着个性化AI系统越来越多地融入长期用户记忆,理解这种记忆如何塑造情感推理至关重要。我们通过在人验证的情感智能测试上评估15个模型,研究用户记忆如何影响大语言模型(LLMs)的情感智能。我们发现,相同的场景搭配不同的用户画像会产生系统性不同的情感解读。在经验证的独立于用户的情感场景和多样化的用户画像中,几个高性能LLM出现了系统性偏差,其中优势背景的用户画像获得了更准确的情感解读。此外,LLM在情感推理和支持性推荐任务中表现出跨人口统计因素的显著差异,表明个性化机制可以将社会等级嵌入模型的情感推理中。这些结果凸显了记忆增强AI的一个关键挑战:为个性化设计的系统可能会强化社会不平等。为缓解这些差异,我们整理了一个通用偏好数据集,旨在减少人口统计画像对情感理解的影响。

英文摘要

When an AI assistant remembers that Sarah is a single mother working two jobs, does it interpret her stress differently than if she were a wealthy executive? As personalized AI systems increasingly incorporate long-term user memory, understanding how this memory shapes emotional reasoning is critical. We investigate how user memory affects emotional intelligence in large language models (LLMs) by evaluating 15 models on human-validated emotional intelligence tests. We find that identical scenarios paired with different user profiles produce systematically divergent emotional interpretations. Across validated user-independent emotional scenarios and diverse user profiles, systematic biases emerged in several high-performing LLMs where advantaged profiles received more accurate emotional interpretations. Moreover, LLMs demonstrate significant disparities across demographic factors in emotion reasoning and supportive recommendations tasks, indicating that personalization mechanisms can embed social hierarchies into models' emotional reasoning. These results highlight a key challenge for memory-enhanced AI: systems designed for personalization may reinforce social inequalities. To mitigate these disparities, we curate a general-purpose preference dataset designed to reduce demographic profiles' influence on emotional understanding.

2606.12618 2026-06-18 cs.AI 版本更新

"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

“你撒谎了吗?”评估不同规模模型和信念验证模型生物体的谎言检测器

Alan Cooney, David Africa, Geoffrey Irving

发表机构 * AI Security Institute(AI安全研究所)

AI总结 本研究通过构建13个信念可验证的推理模型生物体和多样化提示撒谎测试集,评估了四种谎言检测器在不同规模模型上的表现,发现基于激活和概率的检测器在训练模型生物体上性能显著下降,而思维链法官保持较强性能,但存在伪影。

Comments 12 pages, 6 figures

详情
AI中文摘要

语言模型的鲁棒谎言检测器可以实现审计、监控和事后调查模型行为的强大技术,但评估它们需要模型可验证地相信与其所说相反的测试平台。我们表明,现有的训练模型生物体通常无法满足这一要求,使得先前的正面和负面检测结果难以解释。我们通过13个推理模型生物体来解决这个问题,这些生物体的隐藏信念在思维链中得到验证,并显示泛化到保留任务,同时结合了多样化欺骗(Varied Deception),一个涵盖广泛谎言诱导动机的提示撒谎测试集。在这些测试平台上,我们评估了四个检测器:一个思维链法官、一个对数概率分类器和两个激活探针,包括Did-You-Lie(DYL),一种训练后续探针的新方法。在提示撒谎任务上,跨越31个开放权重模型(参数从2B到1T),所有四个检测器都显示出与模型能力正相关的缩放。然而,每个基于激活和对数概率的检测器在我们训练的生物体上性能急剧下降,其中DYL保留了最多的信号;只有思维链法官保持强劲,平衡准确率为0.82,部分原因是我们的验证过程偏向于CoT可读的信念。因此,当前的谎言检测器无法支持关于模型信念的高置信度声明,我们提出了可能解决当前一些局限性的研究方向。我们发布了我们的数据集、模型生物体和训练好的检测器。

英文摘要

Robust lie detectors for language models could enable powerful techniques for auditing, monitoring, and post-hoc investigation of model behaviour, but evaluating them requires testbeds where models verifiably believe the opposite of what they say. We show that existing trained model organisms often fail this requirement, leaving prior positive and negative detection results difficult to interpret. We address this with 13 reasoning model organisms whose hidden beliefs are verified in chain-of-thought and shown to generalise to held-out tasks, alongside Varied Deception, a prompted-lying testbed covering a broad range of lie-inducing motivations. On these testbeds we evaluate four detectors: a chain-of-thought judge, a logprob classifier, and two activation probes, including Did-You-Lie (DYL), a new method for training follow-up probes. On prompted lying, across 31 open-weight models spanning 2B to 1T parameters, all four detectors show positive scaling with model capability. However, every activation- and logprob-based detector drops sharply on our trained model organisms, with DYL retaining the most signal; only the chain-of-thought judge remains strong, with 0.82 balanced accuracy, partly as an artefact of our verification process favouring CoT-readable beliefs. Current lie detectors therefore cannot support high-confidence claims about model beliefs, and we suggest research directions that may address some of their current limitations. We release our datasets, model organisms, and trained detectors.

2409.03500 2026-06-18 cs.CY cs.AI 版本更新

Quality Perceptions and Intended Engagement in Response to AI-Generated and AI-Assisted News

对AI生成和AI辅助新闻的质量感知与预期参与

Fabrizio Gilardi, Sabrina Di Lorenzo, Juri Ezzaini, Beryl Santa, Benjamin Streiff, Eric Zurfluh, Emma Hoes

发表机构 * University of Zurich(苏黎世大学)

AI总结 通过预注册调查实验(N=599),研究读者对人类撰写、AI辅助和AI完全生成新闻的质量感知及披露AI参与后的参与意愿,发现质量评价相似,但披露后AI组短期阅读意愿更高。

Comments Forthcoming, Scientific Reports

详情
AI中文摘要

人工智能在新闻生产中的日益普及引发了关于受众如何看待和回应AI生成新闻的重要问题。这项预注册调查实验(N=599,瑞士德语区)考察了(i)对人类撰写、AI辅助或完全AI生成的新闻摘录的文章质量感知(以可信度、可读性和专业知识衡量),以及(ii)在披露AI参与后自我报告的参与意愿。参与者在了解文章制作方式之前先阅读两篇短新闻摘录。所有条件下的文章在感知质量上评价相似。披露后,与对照组相比,AI辅助和AI生成条件下的参与者报告了更高的继续阅读指定文章的意愿,但未来阅读AI生成新闻的意愿在各条件下无差异。总体而言,研究结果表明,读者对AI生成和人类撰写的新闻质量评价相当,而披露AI使用可能暂时增加好奇心或兴趣,但尚未改变长期阅读意愿。

英文摘要

The increasing use of artificial intelligence (AI) in news production raises important questions about how audiences perceive and respond to AI-generated journalism. This preregistered survey experiment (N = 599, German-speaking Switzerland) examines (i) perceptions of article quality (measured as credibility, readability, and expertise) across news excerpts that were human-written, AI-assisted, or fully AI-generated, and (ii) self-reported intentions to engage following disclosure of AI involvement. Participants rated two short news excerpts before learning how they had been produced. Articles across all conditions were evaluated similarly in perceived quality. After disclosure, participants in the AI-assisted and AI-generated conditions reported a higher willingness to continue reading their assigned articles compared to the control group, but future willingness to read AI-generated news did not differ across conditions. Overall, the findings suggest that readers assess AI-generated and human-written news comparably in quality, while disclosure of AI use can momentarily increase curiosity or interest without yet changing longer-term reading intentions.

2505.03646 2026-06-18 cs.LG cs.AI cs.CV 版本更新

Revealing Hidden Vulnerabilities in Autoencoders through Gradient Signal Restoration

通过梯度信号恢复揭示自编码器中的隐藏漏洞

Chethan Krishnamurthy Ramanaik, Arjun Roy, Tobias Callies, Eirini Ntoutsi

发表机构 * University of the Bundeswehr Munich(联邦国防军理工大学)

AI总结 针对自编码器对抗攻击中梯度消失导致鲁棒性被高估的问题,提出GRILL框架恢复梯度信号,显著提升攻击效果,暴露隐藏漏洞。

详情
AI中文摘要

深度自编码器(AE)的对抗鲁棒性受到的关注远少于判别模型,尽管其压缩的潜在表示会导致病态映射,从而放大小的输入扰动并破坏重建稳定性。现有的AE白盒攻击通过优化范数有界的对抗扰动以最大化重建损失,往往收敛到次优扰动,从而可能高估AE的鲁棒性。我们表明,这种限制与通过病态层反向传播时对抗损失梯度消失有关,这些病态层的中间权重矩阵具有接近零的奇异值。为了解决这个问题,我们提出了GRILL(病态层中的梯度信号恢复)框架,旨在减轻梯度退化并提高编码器-解码器架构中对抗鲁棒性评估的可靠性。GRILL旨在缓解优化过程中的对抗梯度退化,使攻击能够在固定范数约束下更好地逼近高失真扰动。通过在多种AE架构上的广泛实验,包括样本特定和通用攻击,以及标准和自适应攻击设置,我们表明GRILL显著提高了攻击有效性,从而暴露了现有攻击限制所隐藏的漏洞。除了AE之外,我们提供了初步证据表明现代多模态编码器-解码器架构也存在类似的漏洞。

英文摘要

Adversarial robustness of deep autoencoders (AEs) has received less attention than that of discriminative models, although their compressed latent representations induce ill-conditioned mappings that can amplify small input perturbations and destabilize reconstructions. Existing white-box attacks for AEs, which optimize norm-bounded adversarial perturbations to maximize reconstruction damage, often converge to suboptimal perturbations, thereby potentially overstating AE robustness. We show that this limitation is linked to vanishing adversarial loss gradients during backpropagation through ill-conditioned layers, associated with near-zero singular values in their intermediate weight matrices. To address this, we propose GRILL (Gradient Signal Restoration in Ill-Conditioned Layers), a framework designed to mitigate gradient degradation and improve the reliability of adversarial robustness evaluation in encoder-decoder architectures. GRILL is designed to mitigate adversarial gradient degradation during optimization, enabling attacks to better approximate high-distortion perturbations under fixed norm constraints. Through extensive experiments across multiple AE architectures, under both sample-specific and universal attacks, as well as standard and adaptive attack settings, we show that GRILL significantly increases attack effectiveness, thereby exposing vulnerabilities hidden by existing attack limitations. Beyond AEs, we provide preliminary evidence that modern multimodal encoder-decoder architectures exhibit similar vulnerabilities.

2505.16057 2026-06-18 cs.HC cs.AI cs.MM 版本更新

Signals of Provenance: Practices & Challenges of Navigating Indicators in AI-Generated Media for Sighted and Blind Individuals

来源信号:视障与明眼用户在AI生成媒体中导航指示器的实践与挑战

Ayae Ide, Tory Park, Jaron Mink, Tanusree Sharma

发表机构 * Pennsylvania State University(宾夕法尼亚州立大学) Arizona State University(亚利桑那州立大学)

AI总结 通过访谈28位视障与明眼用户,研究AI生成内容指示器的使用实践,发现基于内容和菜单的指示器各有优劣,视障用户因界面可访问性不足而面临更多挑战,并提出设计建议。

Comments error found in reporting of results

详情
AI中文摘要

近年来,生成模型的进步和易用工具大幅降低了通过简单自然语言提示生成高度逼真音频、图像和视频的技术门槛,使得AI生成(AIG)内容日益普及。作为回应,平台正在采用可验证的来源机制,并推荐AIG内容进行自我披露和向用户发出信号。然而,这些指示器常常被忽略,尤其是当它们仅依赖视觉线索时,对具有不同感官能力的用户效果不佳。为弥补这一空白,我们进行了半结构化访谈(N=28),包括15名明眼和13名盲人或低视力(BLV)参与者,考察他们通过自我披露的AI指示器与AIG内容的互动。我们的发现揭示了多样化的心智模型和实践,突出了基于内容(如标题、描述)和菜单辅助(如AI标签)指示器的不同优缺点。明眼参与者利用视觉和音频线索,而BLV参与者主要依赖音频和现有的辅助工具,限制了其识别AIG的能力。两组参与者都经常忽略平台部署的菜单辅助指示器,而更倾向于与基于内容的指示器(如标题和评论)互动。我们发现了由于指示器位置不一致、元数据不清晰和认知过载导致的可用性挑战。这些问题对BLV个体尤为关键,因为界面元素的可访问性不足。我们为未来AIG指示器的多个维度提供了实用建议和设计启示。

英文摘要

AI-Generated (AIG) content has become increasingly widespread by recent advances in generative models and the easy-to-use tools that have significantly lowered the technical barriers for producing highly realistic audio, images, and videos through simple natural language prompts. In response, platforms are adopting provable provenance with platforms recommending AIG to be self-disclosed and signaled to users. However, these indicators may be often missed, especially when they rely solely on visual cues and make them ineffective to users with different sensory abilities. To address the gap, we conducted semi-structured interviews (N=28) with 15 sighted and 13 BLV participants to examine their interaction with AIG content through self-disclosed AI indicators. Our findings reveal diverse mental models and practices, highlighting different strengths and weaknesses of content-based (e.g., title, description) and menu-aided (e.g., AI labels) indicators. While sighted participants leveraged visual and audio cues, BLV participants primarily relied on audio and existing assistive tools, limiting their ability to identify AIG. Across both groups, they frequently overlooked menu-aided indicators deployed by platforms and rather interacted with content-based indicators such as title and comments. We uncovered usability challenges stemming from inconsistent indicator placement, unclear metadata, and cognitive overload. These issues were especially critical for BLV individuals due to the insufficient accessibility of interface elements. We provide practical recommendations and design implications for future AIG indicators across several dimensions.

2507.04219 2026-06-18 cs.LG cs.AI 版本更新

Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs

模型崩溃不是错误,而是大语言模型机器遗忘中的一种特性

Yan Scholten, Sophie Xhonneux, Leo Schwinn, Stephan Günnemann

发表机构 * Dept. of Computer Science & Munich Data Science Institute, Technical University of Munich(计算机科学系及慕尼黑数据科学研究所,技术大学慕尼黑) Mila, Université de Montréal(蒙特利尔大学Mila)

AI总结 提出部分模型崩溃(PMC)方法,通过故意触发模型在目标数据上的分布崩溃实现遗忘,无需在遗忘目标上优化,有效移除私有信息并保持模型效用。

Comments Accepted at ICLR 2026

详情
AI中文摘要

当前大语言模型的遗忘方法通过将待移除的私有信息纳入微调数据来优化。我们认为这不仅可能强化对敏感数据的暴露,而且从根本上违背了最小化其使用的原则。作为补救,我们提出了一种新颖的遗忘方法——部分模型崩溃(PMC),该方法在遗忘目标中不需要遗忘目标。我们的方法受到最近观察的启发:在生成模型上训练其自身生成会导致分布崩溃,从而有效移除模型输出中的信息。我们的核心见解是,可以通过故意触发我们旨在移除的数据上的模型崩溃来利用模型崩溃进行机器遗忘。我们从理论上分析了我们的方法收敛到期望结果,即模型遗忘目标移除的数据。我们实验证明,PMC克服了现有显式优化遗忘目标的遗忘方法的四个关键限制,并在保持通用模型效用的同时更有效地从模型输出中移除私有信息。总体而言,我们的贡献代表了向更全面、更符合现实隐私约束的遗忘迈出的重要一步。代码可在该 https URL 获取。

英文摘要

Current unlearning methods for LLMs optimize on the private information they seek to remove by incorporating it into their fine-tuning data. We argue this not only risks reinforcing exposure to sensitive data, but also fundamentally contradicts the principle of minimizing its use. As a remedy, we propose a novel unlearning method-Partial Model Collapse (PMC), which does not require unlearning targets in the unlearning objective. Our approach is inspired by recent observations that training generative models on their own generations leads to distribution collapse, effectively removing information from model outputs. Our central insight is that model collapse can be leveraged for machine unlearning by deliberately triggering it for data we aim to remove. We theoretically analyze that our approach converges to the desired outcome, i.e. the model unlearns the data targeted for removal. We empirically demonstrate that PMC overcomes four key limitations of existing unlearning methods that explicitly optimize on unlearning targets, and more effectively removes private information from model outputs while preserving general model utility. Overall, our contributions represent an important step toward more comprehensive unlearning that better aligns with real-world privacy constraints. Code available at https://www.cs.cit.tum.de/daml/partial-model-collapse/.

2508.03483 2026-06-18 cs.CV cs.AI 版本更新

When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models

当汽车有刻板印象:审计文本到图像模型中对象的群体偏见

Dasol Choi, Jihwan Lee, Minjae Lee, Minsuk Kahng

发表机构 * AIM Intelligence(AIM智能研究院) Yonsei University(延世大学)

AI总结 提出SODA框架,通过三个指标系统测量文本到图像模型在生成对象中的群体偏见,发现中性提示隐含偏向中年和白人,且人口统计线索导致高度偏斜的刻板输出。

详情
AI中文摘要

虽然先前关于文本到图像生成的研究主要集中在人类描绘中的偏见,但生成对象中的群体偏见仍然相对未被充分探索。我们引入了SODA(刻板对象诊断审计),这是一个新颖的框架,通过自动属性发现和三个标准化指标系统地测量这些偏见:基础与群体差异(BDS)、跨群体差异(CDS)和视觉属性集中度(VAC)。将SODA应用于五个最先进模型和八个对象类别(例如汽车)的8000张图像,我们发现“中性”提示产生的输出在视觉上最接近中年和白人,表明这些群体在模型默认设置中被隐含地过度代表。此外,人口统计线索触发了高度偏斜的刻板输出:26.6%的对象-模型-群体组合产生的结果中,所有20张生成图像共享完全相同的属性值(例如,为女性生成玫瑰金笔记本电脑)。最后,提示级别的去偏减少了群体间差异,但矛盾地压缩了群体内多样性,用一种刻板印象取代了另一种。SODA提供了一个实用的流程,使这些隐含关联变得可测量,作为迈向更负责任的人工智能发展的一步。

英文摘要

While prior research on text-to-image generation has predominantly focused on biases in human depictions, demographic bias in generated objects remains relatively underexplored. We introduce SODA (Stereotyped Object Diagnostic Audit), a novel framework for systematically measuring these biases through automated attribute discovery and three standardized metrics: Base vs. Demographic Divergence (BDS), Cross-Demographic Disparity (CDS), and Visual Attribute Concentration (VAC). Applying SODA to 8,000 images across five state-of-the-art models and eight object categories (e.g., cars), we find that "neutral" prompts produce outputs most visually similar to middle-aged and White people, suggesting these groups are implicitly over-represented in model defaults. Furthermore, demographic cues trigger highly skewed stereotypical outputs: 26.6% of object-model-demographic combinations produce results where all 20 generated images share the exact same attribute value (e.g., rose gold laptops for women). Finally, prompt-level debiasing reduces inter-group disparity but paradoxically collapses within-group diversity, replacing one stereotype with another. SODA offers a practical pipeline for making these implicit associations measurable, serving as a step toward more responsible AI development.

2511.20002 2026-06-18 cs.CV cs.AI cs.CR 版本更新

Semantic Router: On the Feasibility of Hijacking MLLMs via a Single Adversarial Perturbation

语义路由器:通过单一对抗扰动劫持多模态大语言模型的可行性研究

Changyue Li, Jiaying Li, Youliang Yuan, Jiaming He, Zhicong Huang, Pinjia He

发表机构 * The Chinese University of Hong Kong, Shenzhen, China(香港中文大学(深圳)) School of Data Science, School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen, China(数据科学学院、人工智能学院、香港中文大学(深圳))

AI总结 提出语义感知通用扰动(SAUP),作为语义路由器同时劫持多个无状态决策,通过理论分析和SORT优化策略实现,在Qwen上对五个目标达到66%攻击成功率。

Comments Accepted to ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地部署在无状态系统中,例如自动驾驶和机器人技术。本文研究了一种新型威胁:语义感知劫持。我们探索了使用单一通用扰动同时劫持多个无状态决策的可行性。我们引入了语义感知通用扰动(SAUP),它充当语义路由器,“主动”感知输入语义并将其路由到不同的、攻击者定义的目标。为了实现这一点,我们对潜在空间中的几何特性进行了理论和实证分析。在这些见解的指导下,我们提出了语义导向(SORT)优化策略,并标注了一个具有细粒度语义的新数据集以评估性能。在三个代表性MLLM上的大量实验证明了这种攻击的基本可行性,在针对Qwen的五个目标上使用单帧实现了66%的攻击成功率。

英文摘要

Multimodal Large Language Models (MLLMs) are increasingly deployed in stateless systems, such as autonomous driving and robotics. This paper investigates a novel threat: Semantic-Aware Hijacking. We explore the feasibility of hijacking multiple stateless decisions simultaneously using a single universal perturbation. We introduce the Semantic-Aware Universal Perturbation (SAUP), which acts as a semantic router, "actively" perceiving input semantics and routing them to distinct, attacker-defined targets. To achieve this, we conduct theoretical and empirical analysis on the geometric properties in the latent space. Guided by these insights, we propose the Semantic-Oriented (SORT) optimization strategy and annotate a new dataset with fine-grained semantics to evaluate performance. Extensive experiments on three representative MLLMs demonstrate the fundamental feasibility of this attack, achieving a 66% attack success rate over five targets using a single frame against Qwen.

2604.23130 2026-06-18 cs.CL cs.AI 版本更新

From Concept-Aligned Tokens to Vulnerable Features: Mechanistic Localization of Jailbreaks

从概念对齐的Token到脆弱特征:越狱的机制定位

Nilanjana Das, Mathew Dawit, Aman Chadha, Manas Gaur

发表机构 * UMBC(马里兰大学伯克利分校) Apple(苹果公司)

AI总结 提出一种基于Token的机制流水线,通过稀疏自编码器特征子组定位越狱漏洞,发现单个有害Token足以定位脆弱特征,且这些特征集中在中后期层。

详情
AI中文摘要

越狱攻击揭示了安全对齐的大语言模型中一种持续的失败模式:模型可以被推向有害行为,但促成这种转变的内部表示仍未被很好地定位。最近的机制安全性研究通常通过广泛的表示对象来解释这种行为,包括全局拒绝方向、激活引导向量和与拒绝相关的SAE特征。我们转而询问越狱脆弱性是否可以追溯到更细粒度的、基于提示的SAE特征子组。我们引入了一个基于Token的机制流水线,将Gemma-2-2B的残差流分解为稀疏自编码器(SAE)特征,并识别与不安全行为相关的特征子组。使用BeaverTails中的单类别不安全示例以减少跨类别干扰,我们从对抗性响应中提取有害概念,并通过子空间相似性将其与概念相关的提示Token对齐。然后,我们应用三种特征分组策略:基于聚类的、层次链接的和单Token驱动的,以识别所有26层中的SAE特征子组。最后,我们放大每个子组中的顶级特征,并使用标准的有害性评判器评估生成的输出。单Token驱动的分组实现了与完整基于聚类的分组相当的有害性,表明单个有害提示Token足以定位与脆弱性相关的SAE特征子组,而无需依赖更广泛的聚类级聚合。这些子组出现在早期和中后期层,且更集中在中后期层,其中目标引导暴露了特定的模型脆弱性。总体而言,我们的结果表明越狱敏感性可以追溯到稀疏的、基于Token定位的SAE特征子组,补充了先前基于广泛对抗、拒绝或引导方向的解释。

英文摘要

Jailbreak attacks expose a persistent failure mode in safety-aligned LLMs: models can be pushed into harmful behavior, but the internal representations enabling this shift remain poorly localized. Recent mechanistic safety studies often explain such behavior through broad representational objects, including global refusal directions, activation steering vectors, and refusal-related SAE features. We instead ask whether jailbreak vulnerability can be traced to finer-grained, prompt-conditioned SAE feature subgroups. We introduce a token-driven mechanistic pipeline that decomposes the residual stream of Gemma-2-2B into Sparse Autoencoder (SAE) features and identifies feature subgroups associated with unsafe behavior. Using single-category unsafe examples from BeaverTails to reduce cross-category interference, we extract harmful concepts from adversarial responses and align them with concept-relevant prompt tokens through subspace similarity. We then apply three feature-grouping strategies: cluster-based, hierarchical-linkage, and single-token-driven, to identify SAE feature subgroups across all 26 layers. Finally, we amplify the top features in each subgroup and evaluate the resulting generations with a standardized harmfulness judge. Single-token-driven grouping achieves harmfulness comparable to full cluster-based grouping, showing that individual harmful prompt tokens are sufficient to localize vulnerability-relevant SAE feature subgroups without relying on broader cluster-level aggregation. These subgroups appear across early and mid-to-late layers, with stronger concentration in mid-to-late layers, where targeted steering exposes specific model vulnerabilities. Overall, our results suggest that jailbreak susceptibility can be traced to sparse, token-localized SAE feature subgroups, complementing prior accounts based on broad adversarial, refusal, or steering directions.

2605.26903 2026-06-18 cs.CR cs.AI 版本更新

Practical Anonymous Two-Party Gradient Boosting Decision Tree

实用的匿名两方梯度提升决策树

Chenyu Huang, Fan Zhang, Minxin Du, Sherman S. M. Chow, Huangxun Chen, Huaming Rao, Danqing Huang, Bo Qian, Peng Chen

发表机构 * Tencent(腾讯) Hong Kong Polytechnic University(香港理工大学) Chinese University of Hong Kong(香港中文大学) HKUST-GZ

AI总结 针对两方垂直分割数据上的梯度提升决策树训练,提出一种基于双电路隐私集合求交和遗忘可编程伪随机函数的匿名协议,在隐藏记录标识符的同时保持效率。

Comments 19 pages; 2026 IEEE Symposium on Security and Privacy (SP)

详情
Journal ref
2026 IEEE Symposium on Security and Privacy (SP)
AI中文摘要

梯度提升决策树(GBDT)擅长处理结构化数据,通常用于在互不信任的各方之间垂直分割的特征上进行训练。高速和可解释性使得GBDT在金融和医疗领域广受欢迎,而神经网络在这些领域可能表现不佳。为GBDT启用安全计算带来了独特的挑战,需要安全的记录对齐以进行比较。依赖隐私集合求交(PSI)是一种事实上的方法。将PSI误认为是安全措施实际上会暴露数据集中哪些记录标识符(ID)是共享的。尽管电路PSI可以提供帮助,但对于通用用途来说成本高昂。需要新的思路来在“黑暗森林”中高效训练。为了隐藏ID,我们启动了对两方持有的分割数据上的匿名GBDT训练的研究。我们设计中的双电路PSI让双方交替作为接收者,对本地特征执行“选取后求和”。通过遗忘可编程伪随机函数,我们将电路PSI的输出作为共享状态在运行之间传播。避免通用对齐,我们解决了被忽视的困境:隐藏ID会带来与域大小成比例的成本。接下来,我们将用于将单指令多数据同态加密从(环)学习误差转换的密文打包成本减半,相比之前的安全GBDT(Usenix Security' 23)和相关安全机器学习计算。对比实验表明,我们的协议在效率上与有泄漏的方法相比仍具有竞争力。通过启用隐藏ID的聚合,我们的技术可以扩展到其他垂直分割的分析场景。

英文摘要

Structured data is well handled by gradient-boosted decision trees (GBDT), which are usually trained on vertically partitioned features across mutually distrustful parties. High speed and interpretability make GBDTs popular in finance and healthcare, where neural networks may fall short. Enabling secure computation for GBDTs poses unique challenges, requiring secure record alignment for comparison. Relying on private set intersection (PSI) is a de facto approach. Mistaking PSI for a safety measure actually exposes which record identifiers (IDs) are shared between the datasets. Although circuit-PSI could help, it is costly for generic uses. New ideas are needed to efficiently train in a "dark forest". Aiming to hide the IDs, we initiate the study of anonymous GBDT training on split data held by two parties. Dual circuit-PSI in our design lets the parties alternate as receiver to run pick-then-sum over local features. Via oblivious programmable pseudorandom functions, we propagate circuit-PSI outputs as shared state across runs. Avoiding universal alignment, we resolve the neglected dilemma that ID hiding incurs a cost that scales with domain size. Next, we halve the cost of ciphertext packing used to convert single-instruction multiple-data homomorphic encryption from (ring) learning with errors in prior secure GBDT (Usenix Security' 23) and related secure machine-learning computations. Comparative experiments show our protocol remains competitive with leaky approaches in efficiency. Enabling ID-hiding aggregation, our techniques can extend to other vertically partitioned analytics.

2606.07150 2026-06-18 cs.CR cs.AI cs.MA cs.NI 版本更新

From Privacy to Workflow Integrity: Communication-Graph Metadata in Autonomous Agent Interoperability

从隐私到工作流完整性:自主智能体互操作性中的通信图元数据

Bijaya Dangol

发表机构 * Independent Researcher(独立研究者)

AI总结 针对智能体通信图元数据泄露问题,提出工作流完整性威胁模型,定义传输层与引导层隐私属性,并通过A2A案例验证元数据保护可有效抑制任务推断。

Comments 22 pages, 7 figures, 6 tables

详情
AI中文摘要

诸如A2A和MCP之类的智能体互操作性协议标准化了智能体之间的通信内容,但假设基于地址的HTTP(S)传输。此类传输保护消息内容,并越来越多地采用端到端加密。它们暴露在明文中的是通信图:哪个智能体联系哪个智能体、何时以及频率如何。在智能体系统中,该图比隐私框架所暗示的更具后果性。端点通常带有能力标签,工作流是结构化和链式的,交互与实际行动耦合,因此观察者恢复的不仅仅是过去的关系。它可以推断出待处理的工作流、正在组装的任务以及可能即将发生的行动。以机器速度,它可以在工作流完成之前根据该推断采取行动。因此,威胁是工作流完整性,而不仅仅是隐私:对自主行动的预测性杠杆。我们为智能体通信图提供了一个威胁模型;识别了使智能体元数据具有独特揭示性的因素(语义性、前瞻性、驱动性);定义了传输层和引导层隐私属性,并评估了候选传输(SimpleX/SMP、Tor、混合网络)与这些属性的匹配程度;并提出了一个A2A案例研究,其中元数据保护绑定是可表达的,但揭示了协议的身份假设。我们在一个基于真实A2A捕获的生成模型上测试了这些。仅凭被动元数据,没有载荷,一个分类器从工作流的开头就能以远高于随机水平的概率恢复任务类别;应用这些属性后,该恢复被急剧拉回随机水平。除了观察者能恢复的内容外,我们衡量了利用泄露的杠杆:在工作流开头和固定预算下,选择对哪些工作流采取行动的对手在此模型中实现了大部分先知攻击者相对于元数据盲攻击者的优势,而相同的属性抑制了这一点。

英文摘要

Agent-interoperability protocols such as A2A and MCP standardize what agents say to one another but assume address-based transport. Whether over HTTP(S) or a content-protecting binding such as MLS-based SLIM, these transports protect message content yet leave the communication graph exposed: which agent contacts which, when, and how often. In agent systems this graph is more consequential than a privacy framing suggests. Endpoints are capability-labeled, workflows are structured and chained, and interactions are coupled to actions, so an observer recovers more than past relationships: it can recognize a recurring pending workflow from its opening and, at machine speed, act on it before it completes. The threat is one of workflow integrity, not privacy alone. We give a threat model for the communication graph and locate what makes its metadata distinctively consequential: not stronger fingerprinting but exposure across independent trust domains, coupled to autonomous action. We define transport- and bootstrap-layer privacy properties, give them an indistinguishability-game semantics, evaluate transports, and give an A2A case study where a metadata-protecting binding surfaces its implicit identity assumptions. On a corpus of real multi-agent A2A traffic from the official reference agents, on a live A2A binding, and with a generative model as a controlled instrument, a label-blind classifier recovers a task's class from passive metadata at 6x chance, and from only its opening; a defense-aware adversary does not overturn this, and only the full set of properties drives recovery toward chance. Acting on the leak is distinct from recoverability: under a fixed budget an adversary captures 0.63 of a clairvoyant attacker's advantage on the corpus (0.41 from a workflow's opening), governed by top-ranked precision rather than overall accuracy, so integrity and privacy come apart under defense.

9. 评测、基准与数据集 18 篇

2512.04144 2026-06-18 cs.AI 版本更新

RippleBench: Capturing Ripple Effects Using Existing Knowledge Repositories

RippleBench: 利用现有知识库捕捉涟漪效应

Roy Rinberg, Usha Bhalla, Igor Shilov, Flavio P. Calmon, Rohit Gandikota

发表机构 * Harvard University(哈佛大学) Imperial College London(伦敦帝国学院) Northeastern University(东北大学)

AI总结 提出RippleBench-Maker自动管道,从知识库检索语义邻居生成选择题,评估八种遗忘方法在Llama3-8B-Instruct上的涟漪效应,发现准确率下降随语义距离衰减且跨模型一致。

详情
AI中文摘要

针对语言模型的目标干预,如遗忘或模型编辑,旨在修改特定信息,但其效果往往传播到相关的、非预期的领域(例如,删除病毒学内容可能降低对过敏任务的性能);这些副作用通常被称为涟漪效应。我们引入RippleBench-Maker,一个自动管道,从知识库中检索任何源概念的语义邻居,并生成不同语义距离的多选题。我们使用WikiRAG(一个基于英文维基百科的开源RAG系统)实例化该框架,构建RippleBench-WMDP-Bio(584个种子主题,352,961个问题),并在Llama3-8B-Instruct上评估八种遗忘方法。所有八种方法在遗忘目标附近准确率下降最大,并随语义距离衰减,每种方法具有不同的传播曲线。我们在Mistral-7B、Zephyr-7B和Yi-34B上复现了这些发现;跨模型的差值曲线几乎相同,表明涟漪效应是遗忘方法的属性而非基础模型。我们通过一项包含四个实验的Mechanical Turk研究(5,200+次响应,61名工作者)验证了所有主要管道阶段。我们发布所有代码、数据和基础设施。

英文摘要

Targeted interventions on language models, such as unlearning or model editing, aim to modify specific information, but their effects often propagate to related, unintended areas (e.g., removing virology content may degrade performance on allergies); these side-effects are commonly referred to as the ripple effect. We introduce RippleBench-Maker, an automatic pipeline that retrieves semantic neighbors of any source concept from a knowledge repository and generates multiple-choice questions at varying semantic distances. We instantiate this framework using WikiRAG, an open-source RAG system over English Wikipedia, to construct RippleBench-WMDP-Bio (584 seed topics, 352,961 questions), and evaluate eight unlearning methods on Llama3-8B-Instruct. All eight exhibit accuracy drops that are largest near the unlearned target and decay with semantic distance, each with a distinct propagation profile. We replicate these findings across Mistral-7B, Zephyr-7B, and Yi-34B; cross-model delta curves are nearly identical, suggesting ripple effects are a property of the unlearning method rather than the base model. We validate all major pipeline stages using a four-experiment Mechanical Turk study (5,200+ responses, 61 workers). We release all code, data, and infrastructure.

2605.29676 2026-06-18 cs.AI cs.CL 版本更新

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

符号至关重要:智能体AI系统中令牌优化格式的基准研究

Lorenz Kutschka, Bernhard Geiger

发表机构 * Know Center Research GmbH(知中心研究有限公司) Graz University of Technology(格拉茨技术大学) Graz Center for Machine Learning(格拉茨机器学习中心)

AI总结 本研究在四个智能体基准上评估了两种令牌优化格式TOON和TRON,发现TRON在保持准确率的同时最多减少27%的令牌,而TOON虽减少18%但存在多轮解析失败和并行工具调用输出崩溃的问题。

Comments 16 pages, 6 figures, 4 tables

详情
AI中文摘要

智能体AI系统中的大型语言模型消耗工具模式和执行结果,并发出结构化数据的工具调用。这种交换的默认语言JSON是为应用间交换而非令牌效率设计的,因此其结构元素带来大量令牌开销。最近的工作提出了令牌优化替代方案,如TOON(令牌导向对象表示法)和TRON(令牌减少对象表示法)作为更紧凑的替代,但这些格式仅在孤立的理解或生成任务上进行了评估。它们在端到端智能体循环中是否保持令牌减少仍是一个开放问题。我们在四个智能体基准(BFCL、MCPToolBenchPP、MCP-Universe、StableToolBench)和五个开放权重LLM上评估了TOON和TRON,将输入压缩与输出压缩解耦,以独立测量理解和生成。TRON最多减少27%的令牌,准确率在JSON基线的14个百分点内。TOON实现了最多18%的减少,准确率成本类似为9个百分点,但在多轮解析失败上额外级联,并且对于大多数模型导致并行工具调用输出崩溃。

英文摘要

Large language models in Agentic AI systems consume tool schemas and execution results and emit tool invocations as structured data. The default language for that exchange, JSON, was designed for application-to-application interchange rather than token efficiency, so its structural elements impose substantial token overhead. Recent work proposes token-optimized alternatives such as TOON (Token-Oriented Object Notation) and TRON (Token Reduced Object Notation) as more compact replacements, but these formats have been evaluated only on isolated comprehension or generation tasks. Whether their token reductions hold inside end-to-end agentic loops therefore remains an open question. We evaluate TOON and TRON on four agentic benchmarks (BFCL, MCPToolBenchPP, MCP-Universe, StableToolBench) and five open-weight LLMs, decoupling input compression from output compression to measure comprehension and generation independently. TRON reduces tokens by up to 27% with accuracy within 14pp of the JSON baseline. TOON achieves up to 18% reduction at a similar 9pp accuracy cost, but additionally cascades on multi-turn parsing failures and collapses parallel tool-call output for most models. The code is available at: https://github.com/lkutschka/notation-matters

2606.17453 2026-06-18 cs.AI 版本更新

MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors

MapSatisfyBench: 通过行为隐含决策因素基准测试满意度感知的地图智能体

Lubin Bai, Mengyu Cao, Sixue Wang, Zhongwei Wan, Yue Pan, Jiale Hou, Xiang Li, Xiuyuan Zhang

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 提出MapSatisfyBench基准,通过恢复用户行为链中的隐含决策因素来评估地图智能体的满意度感知能力,实验表明现有智能体在显式任务完成上表现良好,但在满足隐含需求方面仍有局限。

详情
AI中文摘要

大型语言模型智能体越来越多地集成到地图服务中。由于地图服务嵌入在日常场景而非专业任务设置中,用户通常非正式地表达需求,导致查询不明确,包含许多未言明的需求,即对用户满意度至关重要的隐含决策因素。虽然澄清是缓解这一问题的有效方法,但它增加了日常交互中的用户负担,而一个能干的智能体应首先从可用信息源主动恢复这些因素。然而,评估这一能力具有挑战性。第一个挑战是确定哪些隐含决策因素适合评估。一个因素只有在影响用户接受度且能从智能体响应前可获取的信息中恢复时才是可评估的。其次,用户满意度不能可靠地由单个参考答案表示,需要一个将满意度相关因素转化为客观可量化评估目标的基准。为应对这些挑战,我们提出一个恢复-识别-过滤框架,从行为链证据中重建完整的用户需求,识别隐含决策因素,并仅保留那些有查询前证据支持的因素。基于此方法,我们从大规模真实世界匿名用户数据构建MapSatisfyBench,并从五个维度标注真实值,实现对满意度感知地图智能体的全链条评估。实验表明,当前智能体在显式任务完成上普遍表现良好,但在满足隐含决策因素和主动获取满意度感知决策所需证据方面仍然有限。这些发现使MapSatisfyBench成为将地图智能体评估从任务完成转向满意度感知空间决策的基准。

英文摘要

Large language model agents are increasingly integrated into map services. Since map services are embedded in everyday-life scenarios rather than professional task settings, users often express their needs informally, resulting in underspecified queries with many unspoken needs, namely, implicit decision factors that are critical for user satisfaction. Although clarification is an effective way to mitigate this issue, it increases user burden in daily interaction, and a capable agent should first proactively recover such factors from available information sources. However, evaluating this ability is challenging. The first challenge is to determine which implicit decision factors are suitable for evaluation. A factor is evaluable only if it affects user acceptance and can be recovered from information available to the agent before it responds. Second, user satisfaction cannot be reliably represented by a single reference answer, requiring a benchmark that converts satisfaction-relevant factors into objective and quantifiable evaluation targets. To address these challenges, we propose a restore-identify-filter framework that reconstructs complete user needs from behavior-chain evidence, identifies implicit decision factors, and retains only those supported by pre-query evidence. Building on this methodology, we construct MapSatisfyBench from large-scale, real-world anonymized user data and annotate ground truth from five dimensions and enables full-chain evaluation of satisfaction-aware map agents. Experiments show that current agents generally perform well on explicit task completion, but remain limited in satisfying implicit decision factors and proactively acquiring the evidence needed for satisfaction-aware decisions. These findings establish MapSatisfyBench as a benchmark for shifting map-agent evaluation from task completion toward satisfaction-aware spatial decision making.

2606.18142 2026-06-18 cs.AI cs.CL cs.CY 版本更新

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

你的AI旅行代理会为你预订斗牛:前沿AI模型中隐含动物福利的代理基准

Jasmine Brazilek, Joel Christoph, Miles Tidmarsh, Carol Kline, Oliver Tullio, Arturs Kanepajs

发表机构 * Compassion Aligned Machine Learning(同情对齐机器学习) Sentient Futures(感知未来) Harvard Kennedy School(哈佛肯尼迪学院) Appalachian State University Department of Management(阿巴拉契亚州立大学管理系)

AI总结 提出首个代理基准TAC,测试AI代理在为用户执行旅行预订等操作时是否避免涉及动物剥削的选项。评估七个前沿模型,所有模型得分低于随机水平64%,最佳模型仅53%。

详情
AI中文摘要

AI代理正从顾问转变为行动者,代表用户预订旅行、规划菜单和管理采购。现有的AI与动物福利基准评估模型对问答提示的文本响应,但未检验这些响应中的福利推理是否迁移到代理部署中(模型必须使用工具采取行动)。我们引入TAC(旅行代理同情心),这是首个衡量AI代理在代表用户行动时是否避免涉及动物剥削选项的代理基准。TAC向AI代理提供十二个手工编写的旅行预订场景,涵盖六类动物剥削,并扩展至四十八个样本以控制价格、评分和位置混淆因素。我们评估了来自四个实验室的七个前沿模型。每个模型得分均低于随机水平64%,最佳表现者(Claude Opus 4.7)为53%。系统提示中的单一福利意识句子在Claude和GPT-5.5中带来47至63个百分点的提升,在GPT-5.2中提升26个百分点,在DeepSeek和Gemini中提升不足12个百分点。一项辅助的Inspect Scout审计(使用Gemini 2.5 Flash Lite作为评判者,对前两名模型的288个基础条件转录进行审计)未标记任何评估意识转录,表明低于随机水平的比率并非源于模型识别出评估。我们讨论了跨文化领域的类别级变化、文本响应福利基准的局限性以及欧盟通用AI实践准则系统性风险框架的影响。

英文摘要

AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.

2606.18192 2026-06-18 cs.AI 版本更新

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

斯坦福EDGAR文件数据集:将美国公司及财务披露重建为布局忠实且令牌高效的预训练数据

Nick Bettencourt, Xiaowei Ding, Kay Giesecke

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Nanjing University(南京大学) Stanford University(斯坦福大学)

AI总结 为解决长上下文文档稀缺问题,提出SEFD数据集,将SEC文件重建为布局忠实的MultiMarkdown格式,用于金融语言建模与评估,具有令牌高效、与Common Crawl重叠率低于0.1%的特点。

Comments Preprint. Includes appendix, tables, and figures

详情
AI中文摘要

随着高质量公共网络语料库日益枯竭,干净的长上下文文档已成为大型语言模型(LLM)训练数据中稀缺且昂贵的来源。现有的长上下文语料库通常是专有的且获取成本高昂、合成生成的,或集中在编程等狭窄领域。我们介绍了斯坦福EDGAR文件数据集(SEFD),这是将SEC文件重建为布局忠实的MultiMarkdown格式的开放数据集,用于金融语言建模和评估。SEFD使经过审计的财务报表、风险披露、所有权报告、会计说明和影响市场的事件文件能够用作长上下文预训练数据,并作为金融推理、预测、合规和文档理解的基础。生成的语料库令牌高效、可直接用于模型,并且与Common Crawl衍生的语料库重叠率低于0.1%。我们发布了SEFD-v1,一个152B令牌的初始公共快照,并提供了更大的1850万文件档案(估计为550B令牌)的语料库级分析。我们进一步引入了两个基于SEFD的基准:EDGAR-Forecast,用于评估模型知识截止后基于文件的数值预测;以及EDGAR-OCR,用于评估复杂金融表格的转录。

英文摘要

As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often proprietary and costly to acquire, synthetically generated, or concentrated in narrow domains such as programming. We introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown for financial language modeling and evaluation. SEFD makes audited financial statements, risk disclosures, ownership reports, accounting notes, and market-moving event filings usable as long-context pretraining data and as a basis for financial reasoning, forecasting, compliance, and document understanding. The resulting corpus is token-efficient, model-ready, and has less than 0.1% overlap with Common Crawl-derived corpora. We release SEFD-v1, a 152B-token initial public snapshot, and provide corpus-level analyses of a larger 18.5M-filing archive estimated at 550B tokens. We further introduce two SEFD-derived benchmarks: EDGAR-Forecast, which evaluates filing-grounded numerical forecasting after model knowledge cutoffs, and EDGAR-OCR, which evaluates transcription of complex financial tables.

2303.18031 2026-06-18 cs.CV cs.AI cs.LG 版本更新

Simple Domain Generalization Methods are Strong Baselines for Open Domain Generalization

简单域泛化方法是开放域泛化的强基线

Masashi Noguchi, Shinichi Shirakawa

发表机构 * Graduate School of Environment and Information Sciences(环境与信息科学研究生院) Yokohama National University(Yokohama国立大学) Faculty of Environment(环境学系)

AI总结 本文评估现有域泛化方法在开放域泛化中的表现,发现简单方法CORAL和MMD与复杂方法DAML竞争力相当,并通过集成学习和Dirichlet混合数据增强简单扩展后性能接近DAML且计算成本更低。

Comments Accepted at IJCNN 2024. The code used in the experiments is available at https://github.com/shiralab/OpenDG-Eval

详情
AI中文摘要

在现实应用中,机器学习模型需要处理开放集识别(OSR),即在推理过程中出现未知类别,同时还要处理域偏移,即训练和推理阶段数据分布不同。域泛化(DG)旨在处理推理阶段目标域在模型训练期间不可访问的域偏移情况。开放域泛化(ODG)同时考虑DG和OSR。域增强元学习(DAML)是一种针对ODG的方法,但其学习过程复杂。相比之下,尽管已提出多种DG方法,但它们尚未在ODG场景下进行评估。在本研究中,我们全面评估了现有DG方法在ODG中的表现,并表明两种简单的DG方法——相关对齐(CORAL)和最大均值差异(MMD)——在多种情况下与DAML具有竞争力。此外,我们通过引入DAML中使用的技术(如集成学习和Dirichlet混合数据增强)提出了CORAL和MMD的简单扩展。实验评估表明,扩展后的CORAL和MMD可以以较低的计算成本达到与DAML相当的性能。这表明简单的DG方法及其简单扩展是ODG的强基线。

英文摘要

In real-world applications, a machine learning model is required to handle an open-set recognition (OSR), where unknown classes appear during the inference, in addition to a domain shift, where the data distribution differs between the training and inference phases. Domain generalization (DG) aims to handle the domain shift situation where the target domain of the inference phase is inaccessible during the model training. Open domain generalization (ODG) considers DG and OSR. Domain-augmented meta-learning (DAML) is a method targeting ODG; however, it has a complicated learning process. By contrast, although various DG methods have been proposed, they have not been evaluated in ODG situations. In this study, we comprehensively evaluate the existing DG methods in ODG and show that the two simple DG methods, CORrelation ALignment (CORAL) and maximum mean discrepancy (MMD), are competitive with DAML in several cases. In addition, we propose simple extensions of CORAL and MMD by introducing the techniques used in DAML, such as ensemble learning and Dirichlet mixup data augmentation. The experimental evaluation demonstrates that the extended CORAL and MMD can perform comparably to DAML with lower computational costs. This suggests that the simple DG methods and their simple extensions are strong baselines for ODG.

2505.21954 2026-06-18 cs.CV cs.AI 版本更新

Revisiting Active Speaker Detection: An In-the-Wild Benchmark for Generalization and Robustness

重新审视主动说话人检测:面向泛化性和鲁棒性的野外基准

Le Thien Phuc Nguyen, Zhuoran Yu, Khoa Quang Nhat Cao, Yuwei Guo, Tu Ho Manh Pham, Tuan Tai Nguyen, Toan Ngo Duc Vo, Lucas Poon, Tuan Khai Nguyen, Soochahn Lee, Yong Jae Lee

发表机构 * University of Wisconsin - Madison(威斯康星大学麦迪逊分校) Oregon State University(俄勒冈州立大学) University of Sydney(悉尼大学) Kookmin University(韩国成均馆大学)

AI总结 提出UniTalk数据集,涵盖多语言、嘈杂背景和拥挤场景等挑战性真实条件,评估显示现有模型在野外环境下性能不足,而UniTalk训练模型泛化性更好,为主动说话人检测建立新基准。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

我们提出了UniTalk,一个强调挑战性场景的新数据集,旨在增强主动说话人检测(ASD)任务的模型泛化性。先前建立的基准如AVA主要包含老电影,因此与现实世界视频存在显著领域差距。相比之下,UniTalk涵盖了反映挑战性真实条件的多种视频类型,包括代表性不足的语言、嘈杂背景和拥挤场景,同时在规模上与AVA相当。广泛评估表明,在现实条件下ASD仍未解决:在AVA上接近完美的先进模型在UniTalk上未能达到饱和。相反,在UniTalk上训练的模型能更好地泛化到现代野外数据集,包括Talkies和ASW。因此,UniTalk为ASD建立了新的基准,为研究人员开发和评估多功能且鲁棒的模型提供了宝贵资源。

英文摘要

We present UniTalk, a novel dataset emphasizing challenging scenarios to enhance model generalization for the task of active speaker detection (ASD). Previously established benchmarks such as AVA predominantly comprise old movies and thus exhibit significant domain gaps with real-world video. In contrast, UniTalk covers diverse video types reflecting challenging real-world conditions, including underrepresented languages, noisy backgrounds, and crowded scenes, while being on par with AVA in scale. Extensive evaluations reveal that ASD remains unsolved under realistic conditions: state-of-the-art models near-perfect on AVA fail to reach saturation on UniTalk. Conversely, models trained on UniTalk generalize better to modern in-the-wild datasets including Talkies and ASW. UniTalk thus establishes a new benchmark for ASD, providing researchers with a valuable resource for developing and evaluating versatile and resilient models.

2505.23851 2026-06-18 cs.CL cs.AI cs.SC 版本更新

ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

ASyMOB:代数符号数学运算基准

Michael Shalyt, Rotem Elimelech, Ido Kaminer

发表机构 * MIT(麻省理工学院) Technion - Israel Institute of Technology(技术学院-以色列理工学院)

AI总结 提出ASyMOB基准,包含35,368个符号数学问题,通过扰动测试揭示大模型在符号数学推理中的鲁棒性不足,并发现LLM与CAS的互补潜力。

Comments Published in ICML2026: https://icml.cc/virtual/2026/poster/63549 Code repository: https://github.com/RamanujanMachine/ASyMOB Complete benchmark dataset: https://huggingface.co/datasets/Shalyt/ASyMOB-Algebraic_Symbolic_Mathematical_Operations_Benchmark

详情
AI中文摘要

大型语言模型(LLM)越来越多地应用于符号数学,然而现有评估常常混淆模式记忆与真正推理。为弥补这一空白,我们提出\textbf{ASyMOB},一个包含\textit{35,368}个经过验证的符号数学问题的高分辨率数据集,涵盖积分、极限、微分方程、级数和超几何函数。与以往基准不同,\textbf{ASyMOB}通过符号、数值和等价保持变换系统地扰动每个种子问题,从而实现对泛化能力的细粒度评估。我们的评估揭示了三个关键发现:(1)大多数模型的性能在微小扰动下崩溃,而顶级系统表现出明显的鲁棒性\textit{机制转变};(2)集成代码工具稳定了性能,尤其对较弱模型;(3)我们识别出计算机代数系统(CAS)失败而LLM成功的例子,以及仅通过LLM-CAS混合方法解决的问题,突显了有前景的集成前沿。\textbf{ASyMOB}作为一个原则性诊断工具,用于衡量和加速构建可验证、可信赖的AI以促进科学发现。

英文摘要

Large language models (LLMs) are increasingly applied to symbolic mathematics, yet existing evaluations often conflate pattern memorization with genuine reasoning. To address this gap, we present ASyMOB, a high-resolution dataset of 35,368 validated symbolic math problems spanning integration, limits, differential equations, series, and hypergeometrics. Unlike prior benchmarks, ASyMOB systematically perturbs each seed problem using symbolic, numeric, and equivalence-preserving transformations, enabling a fine-grained assessment of generalization. Our evaluation reveals three key findings: (1) most models' performance collapses under minor perturbations, while top systems exhibit an apparent regime shift in robustness; (2) integrated code tools stabilize performance, particularly for weaker models; and (3) we identify examples where Computer Algebra Systems (CAS) fail while LLMs succeed, as well as problems solved only via a hybrid LLM-CAS approach, highlighting a promising integration frontier. ASyMOB serves as a principled diagnostic tool for measuring and accelerating progress toward building verifiable, trustworthy AI for scientific discovery.

2509.02555 2026-06-18 cs.LG cs.AI cs.NE 版本更新

Surrogate Benchmarks for Model Merging Optimization

模型合并优化的替代基准

Rio Akizuki, Yuya Kudo, Nozomu Yoshinari, Yoichi Hirose, Toshiyuki Nishimoto, Kento Uchida, Shinichi Shirakawa

发表机构 * Yokohama National University(横滨国立大学)

AI总结 针对模型合并超参数优化计算成本高的问题,构建替代基准以低成本预测合并模型性能并模拟优化算法行为。

Comments AutoML 2025 Non-Archival Content Track. The code of the surrogate benchmark is available at https://github.com/shiralab/SMM-Bench

详情
AI中文摘要

模型合并技术旨在将多个模型的能力整合到一个模型中。大多数模型合并技术都有超参数,其设置会影响合并模型的性能。由于现有几项工作表明,调整模型合并中的超参数可以增强合并结果,因此为模型合并开发超参数优化算法是一个有前景的方向。然而,其优化过程计算成本高昂,特别是在合并大型语言模型时。在这项工作中,我们为合并超参数的优化开发了替代基准,以实现低成本的算法开发和性能比较。我们定义了两个搜索空间并收集数据样本,以构建替代模型来预测合并模型在给定超参数下的性能。我们证明了我们的基准能够很好地预测合并模型的性能,并模拟优化算法的行为。

英文摘要

Model merging techniques aim to integrate the abilities of multiple models into a single model. Most model merging techniques have hyperparameters, and their setting affects the performance of the merged model. Because several existing works show that tuning hyperparameters in model merging can enhance the merging outcome, developing hyperparameter optimization algorithms for model merging is a promising direction. However, its optimization process is computationally expensive, particularly in merging LLMs. In this work, we develop surrogate benchmarks for optimization of the merging hyperparameters to realize algorithm development and performance comparison at low cost. We define two search spaces and collect data samples to construct surrogate models to predict the performance of a merged model from a hyperparameter. We demonstrate that our benchmarks can predict the performance of merged models well and simulate optimization algorithm behaviors.

2601.00567 2026-06-18 cs.IR cs.AI 版本更新

Improving Scientific Document Retrieval with Academic Concept Index

利用学术概念索引改进科学文献检索

Jeyun Lee, Junhyoung Lee, Wonbin Kweon, Bowen Jin, Yu Zhang, Susik Yoon, Dongha Lee, Hwanjo Yu, Jiawei Han, Seongku Kang

发表机构 * Korea University Seoul South Korea University of Illinois Urbana-Champaign Champaign United States Texas A\&M University College Station United States Yonsei University Seoul South Korea Pohang University of Science Korea University University of Illinois Urbana-Champaign Texas A\&M University Yonsei University

AI总结 针对通用检索器在科学领域因词汇和需求不匹配而表现不佳的问题,提出基于学术概念索引的方法,通过概念覆盖查询生成和概念聚焦上下文扩展,提升查询质量和检索性能。

Comments Accepted for publication in ACM TIST, 2026

详情
AI中文摘要

将通用领域的检索器适应到科学领域具有挑战性,原因在于缺乏大规模领域特定的相关性标注,以及词汇和信息需求的显著不匹配。最近的方法通过两个独立方向利用大型语言模型(LLMs)来解决这些问题:(1)生成合成查询以进行微调,(2)生成辅助上下文以支持相关性匹配。然而,这两个方向都忽略了科学文档中嵌入的多样化学术概念,常常产生冗余或概念狭窄的查询和上下文。为了解决这一限制,我们引入了一个学术概念索引,该索引从论文中提取关键概念,并在学术分类的指导下进行组织。这个结构化索引为改进这两个方向奠定了基础。首先,我们通过基于概念覆盖的查询生成(CCQGen)来增强合成查询生成,该方法自适应地以未覆盖的概念为条件,生成具有更广泛概念覆盖的互补查询。其次,我们通过概念聚焦的辅助上下文(CCExpand)来增强上下文增强,该方法利用一组文档片段作为对概念感知的CCQGen查询的简洁响应。大量实验表明,将学术概念索引纳入查询生成和上下文增强中,可以产生更高质量的查询、更好的概念对齐以及改进的检索性能。

英文摘要

Adapting general-domain retrievers to scientific domains is challenging due to the scarcity of large-scale domain-specific relevance annotations and the substantial mismatch in vocabulary and information needs. Recent approaches address these issues through two independent directions that leverage large language models (LLMs): (1) generating synthetic queries for fine-tuning, and (2) generating auxiliary contexts to support relevance matching. However, both directions overlook the diverse academic concepts embedded within scientific documents, often producing redundant or conceptually narrow queries and contexts. To address this limitation, we introduce an academic concept index, which extracts key concepts from papers and organizes them guided by an academic taxonomy. This structured index serves as a foundation for improving both directions. First, we enhance the synthetic query generation with concept coverage-based generation (CCQGen), which adaptively conditions LLMs on uncovered concepts to generate complementary queries with broader concept coverage. Second, we strengthen the context augmentation with concept-focused auxiliary contexts (CCExpand), which leverages a set of document snippets that serve as concise responses to the concept-aware CCQGen queries. Extensive experiments show that incorporating the academic concept index into both query generation and context augmentation leads to higher-quality queries, better conceptual alignment, and improved retrieval performance.

2601.12805 2026-06-18 q-bio.GN cs.AI cs.CL 版本更新

SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding

SciHorizon-GENE:从基因知识到功能理解的生命科学推理基准测试

Xiaohan Huang, Meng Xiao, Chuan Qin, Qingqing Long, Jinmiao Chen, Yuanchun Zhou, Hengshu Zhu

发表机构 * Computer Network Information Center, Chinese Academy of Sciences(中国科学院计算机网络信息中心) University of the Chinese Academy of Sciences(中国科学院大学) DUKE-NUS Medical School, National University of Singapore(新加坡国立大学杜克-新加坡医学学校) Singapore Immunology Network, Agency for Science, Technology and Research(新加坡免疫网络,科技研究局)

AI总结 针对大语言模型在基因级推理能力上的不足,构建了包含超过19万个人类基因和54万问题的基准SciHorizon-GENE,从研究关注敏感性、幻觉倾向、答案完整性和文献影响力四个生物学关键维度评估模型,揭示了模型在生成忠实、完整且基于文献的功能解释方面的持续挑战。

Comments Accepted by SIGKDD 2026. 12 pages

详情
AI中文摘要

大型语言模型(LLMs)在生物医学研究中展现出日益增长的潜力,尤其是在知识驱动的解释任务中。然而,它们从基因知识到功能理解的可靠推理能力——这是知识增强型细胞图谱解释的核心要求——仍然在很大程度上未被探索。为了填补这一空白,我们引入了SciHorizon-GENE,这是一个基于权威生物数据库构建的大规模基因中心基准。该基准整合了超过19万个人类基因的 curated 知识,包含超过54万个问题,涵盖了与细胞类型注释、功能解释和机制导向分析相关的多种基因到功能推理场景。受初步检查中观察到的行为模式启发,SciHorizon-GENE从四个生物学关键角度评估LLMs:研究关注敏感性、幻觉倾向、答案完整性和文献影响力,明确针对限制LLMs在生物解释管道中安全采用的失败模式。我们系统评估了多种最先进的通用和生物医学LLMs,揭示了基因级推理能力的显著异质性,以及在生成忠实、完整且基于文献的功能解释方面的持续挑战。我们的基准为在基因尺度上分析LLM行为建立了系统基础,并为模型选择和发展提供了见解,与知识增强型生物解释直接相关。

英文摘要

Large language models (LLMs) have shown growing promise in biomedical research, particularly for knowledge-driven interpretation tasks. However, their ability to reliably reason from gene-level knowledge to functional understanding, a core requirement for knowledge-enhanced cell atlas interpretation, remains largely underexplored. To address this gap, we introduce SciHorizon-GENE, a large-scale gene-centric benchmark constructed from authoritative biological databases. The benchmark integrates curated knowledge for over 190K human genes and comprises more than 540K questions covering diverse gene-to-function reasoning scenarios relevant to cell type annotation, functional interpretation, and mechanism-oriented analysis. Motivated by behavioral patterns observed in preliminary examinations, SciHorizon-GENE evaluates LLMs along four biologically critical perspectives: research attention sensitivity, hallucination tendency, answer completeness, and literature influence, explicitly targeting failure modes that limit the safe adoption of LLMs in biological interpretation pipelines. We systematically evaluate a wide range of state-of-the-art general-purpose and biomedical LLMs, revealing substantial heterogeneity in gene-level reasoning capabilities and persistent challenges in generating faithful, complete, and literature-grounded functional interpretations. Our benchmark establishes a systematic foundation for analyzing LLM behavior at the gene scale and offers insights for model selection and development, with direct relevance to knowledge-enhanced biological interpretation.

2603.10827 2026-06-18 cs.SD cs.AI 版本更新

Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation

语音感知大语言模型的说话人验证:评估与增强

Thomas Thebaud, Yuzhe Wang, Laureano Moro-Velazquez, Jesus Villalba-Lopez, Najim Dehak

发表机构 * Electrical and Computer Engineering Department, Johns Hopkins University, Baltimore, MD, USA(约翰霍普金斯大学电气与计算机工程系) Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore, MD, USA(约翰霍普金斯大学人机语言技术中心卓越中心)

AI总结 提出模型无关的评分协议评估语音感知LLM的说话人区分能力(EER>20%),并通过注入冻结的ECAPA-TDNN说话人嵌入和LoRA微调,实现接近专用系统的性能(EER 1.03%)。

Comments 3 Tables, 1 Figure, Published in Interspeech 2026

详情
AI中文摘要

语音感知大语言模型(LLMs)可以接受语音输入,但其训练目标主要强调语言内容或特定领域(如情感或说话人性别),尚不清楚它们是否编码了说话人身份。首先,我们提出了一种模型无关的评分协议,该协议利用Yes/No令牌概率的置信度分数或对数似然比,为仅API模型和开放权重模型生成连续验证分数。使用该协议,我们评估了最近的语音感知LLMs,观察到较弱的说话人区分能力(在VoxCeleb1上EER高于20%)。其次,我们引入了一种轻量级增强方法,通过可学习的投影注入冻结的ECAPA-TDNN说话人嵌入,并仅训练LoRA适配器,使LLM具备自动说话人验证(ASV)能力。在TinyLLaMA-1.1B上,得到的ECAPA-LLM在VoxCeleb1-E上实现了1.03%的EER,接近专用说话人验证系统,同时保留了自然语言接口。

英文摘要

Speech-aware large language models (LLMs) can accept speech inputs, yet their training objectives largely emphasize linguistic content or specific fields such as emotions or the speaker's gender, leaving it unclear whether they encode speaker identity. First, we propose a model-agnostic scoring protocol that produces continuous verification scores for both API-only and open-weight models, using confidence scores or log-likelihood ratios from the Yes/No token probabilities. Using this protocol, we benchmark recent speech-aware LLMs and observe weak speaker discrimination (EERs above 20% on VoxCeleb1). Second, we introduce a lightweight augmentation that equips an LLM with ASV capability by injecting frozen ECAPA-TDNN speaker embeddings through a learned projection and training only LoRA adapters. On TinyLLaMA-1.1B, the resulting ECAPA-LLM achieves 1.03% EER on VoxCeleb1-E, approaching a dedicated speaker verification system while preserving a natural-language interface.

2604.06367 2026-06-18 cs.CR cs.AI cs.LG 版本更新

WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks

WebSP-Eval:在网站安全与隐私任务上评估网络代理

Guruprasad Viswanathan Ramesh, Asmit Nayak, Basieem Siddique, Kassem Fawaz

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出WebSP-Eval框架,通过200个任务实例和自动化评估器,测试多模态大模型在网站安全与隐私任务上的表现,发现状态UI元素(如开关)导致超过45%的任务失败。

Comments Accepted at PETS 2026. Project Page: https://wiscprivacy.com/webspeval/

详情
AI中文摘要

网络代理自动化浏览器任务,从简单的表单填写到复杂的工作流程(如订购杂货)。虽然当前的基准测试评估通用性能(如WebArena)或针对恶意行为的安全性(如SafeArena),但没有现有框架评估代理成功执行面向用户的网站安全和隐私任务的能力,例如管理cookie偏好、配置隐私敏感账户设置或撤销非活动会话。为填补这一空白,我们引入了WebSP-Eval,一个用于衡量网络代理在网站安全和隐私任务上性能的评估框架。WebSP-Eval包括:1)一个手动制作的任务数据集,涵盖28个网站的200个任务实例;2)一个强大的代理系统,支持使用自定义Google Chrome扩展在多次运行中进行账户和初始状态管理;以及3)一个自动化评估器。我们使用最先进的多模态大语言模型评估了总共8个网络代理实例,对网站、任务类别和UI元素进行了细粒度分析。我们的评估显示,当前模型在可靠解决网站安全和隐私任务方面自主探索能力有限,并且在特定任务类别和网站上表现困难。关键的是,我们发现状态UI元素是代理失败的主要原因,其中开关导致许多模型超过45%的任务失败。

英文摘要

Web agents automate browser tasks, ranging from simple form completion to complex workflows like ordering groceries. While current benchmarks evaluate general-purpose performance~(e.g., WebArena) or safety against malicious actions~(e.g., SafeArena), no existing framework assesses an agent's ability to successfully execute user-facing website security and privacy tasks, such as managing cookie preferences, configuring privacy-sensitive account settings, or revoking inactive sessions. To address this gap, we introduce WebSP-Eval, an evaluation framework for measuring web agent performance on website security and privacy tasks. WebSP-Eval comprises 1) a manually crafted task dataset of 200 task instances across 28 websites; 2) a robust agentic system supporting account and initial state management across runs using a custom Google Chrome extension; and 3) an automated evaluator. We evaluate a total of 8 web agent instantiations using state-of-the-art multimodal large language models, conducting a fine-grained analysis across websites, task categories, and UI elements. Our evaluation reveals that current models suffer from limited autonomous exploration capabilities to reliably solve website security and privacy tasks, and struggle with specific task categories and websites. Crucially, we identify stateful UI elements are a primary reason for agent failure, with toggles causing more than 45% task failure across many models.

2604.13899 2026-06-18 cs.CL cs.AI 版本更新

Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

我们是否仍然需要人在回路中?比较主动学习中用于敌意检测的人类与LLM标注

Ahmad Dawar Hakimi, Lea Hirlimann, Isabelle Augenstein, Hinrich Schütze

AI总结 研究比较了LLM与人类在主动学习中的标注效果,发现LLM标注成本更低且性能更优,但主动学习在LLM标注下无优势。

详情
AI中文摘要

指令微调的LLM可以低成本标注数千个实例。这为主动学习(AL)提出了两个问题:LLM标签能否替代AL回路中的人类标签?当整个语料库可以廉价标注时,AL是否仍然必要?我们在一个新的包含277,902条德国政治TikTok评论(25,974条LLM标注,5,000条人工标注)的数据集上进行了研究,比较了LLM和人类标注在七种条件、四种编码器和10个随机种子下的表现。在模仿人类标注任务的双问题界面下,大规模LLM标注的性能优于人类监督分类器,成本约为其十分之一(GPT-5.2 Batch API为28美元,Prolific为316美元)。这一优势对于闭源(GPT-5.2)和开源(Qwen3.5-122B-10B)LLM均成立,在软标签评估下具有鲁棒性,并且是通过双问题分解实现的;整体单提示基线仅与人类监督持平。在任一LLM标注器下,主动学习相比随机采样没有可靠优势。然而,错误结构差异显著:只有GPT-5.2在双问题界面下产生的分类器具有接近人类的FP/FN平衡,而其他LLM变体过度标记了边境管制和经济竞争话语。我们发布了数据集和代码。

英文摘要

Instruction-tuned LLMs can annotate thousands of instances at low cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be cheaply labeled? We investigate both on a new dataset of 277,902 German political TikTok comments (25,974 LLM-labeled, 5,000 human-annotated), comparing LLM and human annotation across seven conditions, four encoders, and 10 random seeds. Under a two-question interface that mirrors the human annotation task, LLM annotation at scale outperforms human-supervised classifiers at roughly one-tenth the cost (\$28 for GPT-5.2 Batch API vs. \$316 for Prolific). The advantage holds for both a closed-source (GPT-5.2) and an open-weight (Qwen3.5-122B-10B) LLM, is robust under soft-label evaluation, and is unlocked specifically by the two-question decomposition; a holistic single-prompt baseline only ties with human supervision. AL provides no reliable advantage over random sampling under either LLM annotator. However, error structure varies sharply: only GPT-5.2 under the two-question interface produces classifiers with near-human FP/FN balance, while other LLM variants over-flag border-control and economic competition discourse. We release the dataset and code.

2604.28076 2026-06-18 cs.CL cs.AI cs.LG 版本更新

TopBench: A Benchmark for Implicit Predictive Reasoning in Tabular Question Answering

TopBench:表格问答中隐式预测推理的基准

An-Yang Ji, Jun-Peng Jiang, De-Chuan Zhan, Han-Jia Ye

发表机构 * School of Artificial Intelligence, Nanjing University, China(人工智能学院,南京大学,中国) National Key Laboratory for Novel Software Technology, Nanjing University, China(新型软件技术国家重点实验室,南京大学,中国)

AI总结 提出TopBench基准,包含779个样本和四个子任务,评估大语言模型在表格问答中识别隐式预测意图并进行可靠推理的能力,发现当前模型在意图识别上存在困难。

详情
AI中文摘要

大型语言模型(LLM)推动了表格问答的发展,其中大多数查询可以通过提取信息或简单聚合来回答。然而,一类常见的现实世界查询是隐式预测性的,需要从历史模式中推断未观察到的答案,而不仅仅是检索。这些查询带来了两个挑战:识别潜在意图和对大规模表格进行可靠的预测推理。为了评估LLM在带有隐式预测任务的表格问答中的表现,我们引入了TopBench,一个包含779个样本的基准,涵盖四个子任务,从单点预测到决策制定、处理效应分析和复杂过滤,要求模型生成涵盖推理文本和结构化表格的输出。我们在基于文本和代理工作流下评估了多种模型。实验表明,当前模型通常在意图识别上存在困难,默认进行查找。更深入的分析发现,准确的意图消歧是引导这些预测行为的前提。此外,提升预测精度的上限需要整合更复杂的建模或推理能力。

英文摘要

Large Language Models (LLMs) have advanced Table Question Answering, where most queries can be answered by extracting information or simple aggregation. However, a common class of real-world queries is implicitly predictive, requiring the inference of unobserved answers from historical patterns rather than mere retrieval. These queries introduce two challenges: recognizing latent intent and reliable predictive reasoning over massive tables. To assess LLMs in such Tabular questiOn answering with implicit Prediction tasks, we introduce TopBench, a benchmark consisting of 779 samples across four sub-tasks, ranging from single-point prediction to decision making, treatment effect analysis, and complex filtering, requiring models to generate outputs spanning reasoning text and structured tables. We evaluate diverse models under both text-based and agentic workflows. Experiments reveal that current models often struggle with intent recognition, defaulting to just lookups. Deeper analysis identifies that accurate intent disambiguation serves as the prerequisite for leading these predictive behaviors. Furthermore, elevating the upper bound of prediction precision requires the integration of more sophisticated modeling or reasoning capabilities.

2605.17986 2026-06-18 cs.CR cs.AI 版本更新

LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injection

LivePI:更真实的智能体对抗间接提示注入基准测试

Lei Zhao, Abhay Bhaskar, Edgar Dobriban

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出LivePI基准,覆盖7种输入表面、12种攻击/渲染家族和5种恶意目标,在真实虚拟机环境中评估多个AI智能体,发现攻击成功率10.7%-29.6%,并验证了两层防御的有效性。

详情
AI中文摘要

诸如OpenClaw之类的AI智能体越来越多地部署在本地工作流中,并能够访问外部工具。这带来了间接提示注入(IPI)风险:智能体可能会执行嵌入在不可信输入(如电子邮件、下载文件、网页、仓库或群聊消息)中的有害指令。现有的评估通常规模较小、完全模拟或仅关注狭窄的通道。我们引入了LivePI(实时提示注入),这是一个在生产类似但测试可控环境中的IPI风险结构化基准。LivePI覆盖了七个输入表面、十二个攻击/渲染家族和五个恶意目标,包括受保护信息窃取、未经授权的安全控制更改、不安全的代码检索或执行、收件箱摘要窃取以及加密货币转账。我们在一个真实的虚拟机上运行LivePI,该虚拟机具有实时但测试可控的电子邮件、聊天、网页、本地文件、仓库和钱包接口。在GPT-5.3-Codex、Claude Opus 4.6、Gemini 3.1 Pro、Kimi K2.5和GLM-5上,总攻击成功率范围为10.7%至29.6%。群聊注入在我们部署中评估的所有骨干模型上均成功,而仓库链接攻击尽管分母较小,仍产生了高严重性失败。我们还评估了一种由提示级过滤和执行前工具调用授权组成的两层防御。在GPT-5.3-Codex设置中,该防御在LivePI中拦截了所有测试的恶意目标完成,同时保留了PinchBench衍生工作负载上的良性效用。

英文摘要

AI agents such as OpenClaw are increasingly deployed in local workflows with access to external tools. This creates indirect prompt-injection (IPI) risk: an agent may execute harmful instructions embedded in untrusted inputs such as email, downloaded files, webpages, repositories, or group-chat messages. Existing evaluations are often small, purely simulated, or focused on a narrow set of channels. We introduce LivePI (Live Prompt Injection), a structured benchmark for IPI risk in a production-like but test-controlled environment. LivePI covers seven input surfaces, twelve attack/rendering families, and five malicious goals, including protected-information exfiltration, unauthorized security-control changes, unsafe code retrieval or execution, inbox-summary exfiltration, and cryptocurrency transfer. We run LivePI on a real virtual machine with live but test-controlled email, chat, web, local-file, repository, and wallet interfaces. Across GPT-5.3-Codex, Claude Opus 4.6, Gemini 3.1 Pro, Kimi K2.5, and GLM-5, total attack success rates range from 10.7% to 29.6%. Group-chat injection is uniformly successful across the evaluated backbones in our deployment, and repository-link attacks produce high-severity failures despite a small denominator. We also evaluate a two-layer defense consisting of prompt-level filtering and pre-execution tool-call authorization. In the GPT-5.3-Codex setting, the defense intercepts all tested malicious-goal completions in LivePI before execution while preserving benign utility on PinchBench-derived workloads.

2606.07591 2026-06-18 cs.LG cs.AI cs.CL 版本更新

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

ResearchClawBench: 端到端自主科学研究基准

Wanghan Xu, Shuo Li, Tianlin Ye, Qinglong Cao, Yixin Chen, Hengjian Gao, Yiheng Wang, Qi Li, Kun Li, Sheng Xu, Shengdu Chai, Fangchen Yu, Xiangyu Zhao, Zhangrui Zhao, Weijie Ma, Zijie Guo, Koutian Wu, Haoyu Zhou, Haoxiang Yin, Lixue Cheng, Chaofan Hu, Haoxuan Li, Lu Mi, Xuxuan Xie, Yifan Zhou, Ruizhe Chen, Zhiwang Zhou, Xingjian Guo, Yuhao Zhou, Xuming He, Shengyuan Xu, Xinyu Gu, Jiamin Wu, Mianxin Liu, Chunfeng Song, Fenghua Ling, Dongzhan Zhou, Shixiang Tang, Yuqiang Li, Mao Su, Peng Ye, Siqi Sun, Bin Wang, Xue Yang, Zhenfei Yin, Tianfan Fu, Guangtao Zhai, Wanli Ouyang, Bo Zhang, Lei Bai, Wenlong Zhang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出ResearchClawBench基准,包含10个领域40个任务,通过多模态评分标准评估自主科研能力,最强智能体仅得21.5分,揭示当前系统在实验协议、证据匹配和科学核心方面的不足。

详情
AI中文摘要

AI编码智能体越来越多地用于科学工作,但其端到端自主研究能力仍然难以验证。我们提出了ResearchClawBench,一个用于评估自主科学研究的基准,涵盖来自10个科学领域的40个任务。每个任务基于一篇真实发表论文,提供相关文献和原始数据,并在评估期间隐藏目标论文。专家策划的多模态评分标准将目标科学制品分解为加权标准,从而能够评估目标论文级别的重新发现,同时为新发现留出空间。我们在统一协议下评估了七个自主研究(auto-research)智能体,并通过轻量级ResearchHarness评估了十七个原生LLM。当前系统远未达到可靠的重新发现:最强的自主智能体Claude Code平均得分为21.5,最强的ResearchHarness LLM Claude-Opus-4.7平均得分为20.7,LLM前沿均值仅为26.5。错误分析表明,失败集中在实验协议不匹配、证据不匹配和缺失科学核心。ResearchClawBench为衡量自主科学研究进展提供了一个可复现的评估前沿。

英文摘要

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.

2410.15595 2026-06-18 cs.AI cs.CL cs.LG 版本更新

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

直接偏好优化综述:数据集、理论、变体及应用

Wenyi Xiao, Zechuan Wang, Leilei Gan, Shuai Zhao, Zongrui Li, Ruirui Lei, Wanggui He, Luu Anh Tuan, Long Chen, Hao Jiang, Zhou Zhao, Fei Wu

发表机构 * Zhejiang University(浙江大学) Nanyang Technological University(南洋理工大学) Alibaba Group(阿里巴巴集团)

AI总结 综述直接偏好优化(DPO)在理论、变体、数据集和应用方面的进展,指出其作为RL-free替代方案的潜力与局限,并提出未来研究方向。

Comments Accepted by TPAMI 2026. Project page: https://github.com/Mr-Loevan/DPO-Survey

详情
AI中文摘要

随着大语言模型(LLMs)的快速发展,将策略模型与人类偏好对齐变得日益关键。直接偏好优化(DPO)作为一种有前景的对齐方法,作为从人类反馈中强化学习(RLHF)的无RL替代方案而出现。尽管DPO取得了各种进展并存在固有局限性,但文献中目前缺乏对这些方面的深入综述。在这项工作中,我们对DPO中的挑战和机遇进行了全面回顾,涵盖理论分析、变体、相关偏好数据集和应用。具体而言,我们基于关键研究问题对近期DPO研究进行分类,以提供对DPO当前格局的透彻理解。此外,我们提出了几个未来研究方向,为研究社区提供模型对齐的见解。相关论文的更新合集可在此https URL找到。

英文摘要

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO's various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO's current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community. An updated collection of relevant papers can be found on https://github.com/Mr-Loevan/DPO-Survey.

10. AI应用与系统 21 篇

2310.05753 2026-06-18 cs.AI 版本更新

Large-Scale OD Matrix Estimation with A Deep Learning Method

基于深度学习的大规模OD矩阵估计

Zheli Xiong, Defu Lian, Enhong Chen, Gang Chen, Xiaomin Cheng

发表机构 * IEEE Publication Technology Group(IEEE出版技术组)

AI总结 提出一种结合深度学习与数值优化的方法,利用探针交通流推断结构约束,实现大规模OD矩阵的实时估计,无需先验信息且具有良好泛化性。

Comments 12 pages,25 figures

详情
AI中文摘要

起点-终点(OD)矩阵估计是智能交通系统(ITS)的关键方面。它涉及通过回归当前观测值(如路段交通计数,例如使用最小二乘法)来调整初始OD矩阵。然而,OD估计问题缺乏足够的约束,在数学上是欠定的。为缓解此问题,一些研究者将先验OD矩阵作为回归目标以提供更多结构约束,但该方法高度依赖于可能过时的先验矩阵。另一些研究者通过传感器数据(如车辆轨迹和速度)添加结构约束,这些数据能实时反映更当前的结构约束。我们提出的方法将深度学习与数值优化算法相结合,以推断矩阵结构并指导数值优化。该方法结合了深度学习与数值优化算法的优势。神经网络(NN)学习从探针交通流中推断结构约束,消除了对先验信息的依赖,并提供了实时性能。此外,由于NN的泛化能力,该方法在工程上经济高效。我们进行了测试,证明了该方法在大规模合成数据集上的良好泛化性能。随后,我们在真实交通数据上验证了方法的稳定性。实验证实了结合NN与数值优化的优势。

英文摘要

The estimation of origin-destination (OD) matrices is a crucial aspect of Intelligent Transport Systems (ITS). It involves adjusting an initial OD matrix by regressing the current observations like traffic counts of road sections (e.g., using least squares). However, the OD estimation problem lacks sufficient constraints and is mathematically underdetermined. To alleviate this problem, some researchers incorporate a prior OD matrix as a target in the regression to provide more structural constraints. However, this approach is highly dependent on the existing prior matrix, which may be outdated. Others add structural constraints through sensor data, such as vehicle trajectory and speed, which can reflect more current structural constraints in real-time. Our proposed method integrates deep learning and numerical optimization algorithms to infer matrix structure and guide numerical optimization. This approach combines the advantages of both deep learning and numerical optimization algorithms. The neural network(NN) learns to infer structural constraints from probe traffic flows, eliminating dependence on prior information and providing real-time performance. Additionally, due to the generalization capability of NN, this method is economical in engineering. We conducted tests to demonstrate the good generalization performance of our method on a large-scale synthetic dataset. Subsequently, we verified the stability of our method on real traffic data. Our experiments provided confirmation of the benefits of combining NN and numerical optimization.

2604.25848 2026-06-18 cs.AI 版本更新

A Distributionally Robust Reinforcement Learning Framework for Constrained Urban EV Dispatch

面向约束城市电动汽车调度的分布鲁棒强化学习框架

An Nguyen, Hoang Nguyen, Phuong Le, Hung Pham, Cuong Do, Laurent El Ghaoui

发表机构 * College of Engineering and Computer Science, VinUniversity, Hanoi, Vietnam(VinUniversity 工程与计算机科学学院,河内,越南) Center for Environmental Intelligence, VinUniversity, Hanoi, Vietnam(VinUniversity 环境智能中心,河内,越南)

AI总结 针对城市电动汽车调度中充电站和馈线容量约束及不确定需求,提出基于半马尔可夫决策过程与分布鲁棒软演员-评论家算法,通过图卷积编码器和滚动混合整数线性规划保证可行性,在纽约出租车数据仿真中实现最高净利润且零违规。

详情
AI中文摘要

我们研究城市规模的电动汽车(EV)网约车车队控制,其中调度、重新定位和充电决策必须在不确定且空间相关的出行需求和旅行时间下,遵守充电器和馈线限制。我们将问题建模为六边形网格半马尔可夫决策过程(semi-MDP),具有混合动作——用于服务、重新定位和充电的离散动作,以及连续充电功率——和可变动作持续时间。为了保证训练和部署期间的物理可行性,策略在由掩码温度退火actor产生的高层意图上学习。这些意图在每个决策步骤通过一个时间受限的滚动混合整数线性规划(MILP)进行投影,该规划严格强制执行荷电状态、充电端口和馈线约束。为了缓解分布偏移,我们针对一个Wasserstein-1模糊集优化软演员-评论家(SAC)智能体,该模糊集使用图对齐的马氏基础度量来捕捉空间相关性。鲁棒备份使用Kantorovich-Rubinstein对偶、投影次梯度内环和原始-对偶风险预算更新。我们的架构结合了两层图卷积网络(GCN)编码器、双评论家和一个驱动对手的价值网络。基于纽约出租车数据构建的大规模电动汽车车队模拟器上的实验表明,PD-RSAC实现了最高的净利润,达到122万美元,而强启发式、单智能体RL和多智能体RL基线(包括Greedy、SAC、MAPPO和MADDPG)的净利润为58万至70万美元,同时保持零馈线限制违规。

英文摘要

We study city-scale control of electric-vehicle (EV) ride-hailing fleets where dispatch, repositioning, and charging decisions must respect charger and feeder limits under uncertain, spatially correlated demand and travel times. We formulate the problem as a hex-grid semi-Markov decision process (semi-MDP) with mixed actions -- discrete actions for serving, repositioning, and charging, together with continuous charging power -- and variable action durations. To guarantee physical feasibility during both training and deployment, the policy learns over high-level intentions produced by a masked, temperature-annealed actor. These intentions are projected at every decision step through a time-limited rolling mixed-integer linear program (MILP) that strictly enforces state-of-charge, port, and feeder constraints. To mitigate distributional shifts, we optimize a Soft Actor-Critic (SAC) agent against a Wasserstein-1 ambiguity set with a graph-aligned Mahalanobis ground metric that captures spatial correlations. The robust backup uses the Kantorovich-Rubinstein dual, a projected subgradient inner loop, and a primal-dual risk-budget update. Our architecture combines a two-layer Graph Convolutional Network (GCN) encoder, twin critics, and a value network that drives the adversary. Experiments on a large-scale EV fleet simulator built from NYC taxi data show that PD-RSAC achieves the highest net profit, reaching \$1.22M, compared with \$0.58M-\$0.70M for strong heuristic, single-agent RL, and multi-agent RL baselines, including Greedy, SAC, MAPPO, and MADDPG, while maintaining zero feeder-limit violations.

2605.03460 2026-06-18 cs.AI cs.LG 版本更新

FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models

FinSTaR:面向时间序列推理模型的金融推理

Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, Soonyoung Lee, Wonbin Ahn

发表机构 * LG AI Research(LG人工智能研究)

AI总结 针对时间序列推理模型在金融领域的失效问题,提出基于2x2能力分类法的FinSTaR模型,通过Compute-in-CoT和Scenario-Aware CoT策略在FinTSR-Bench基准上达到78.9%平均准确率。

Comments KDD Workshop on SciSoc Agents & LLMs 2026 (Oral Presentation)

详情
AI中文摘要

时间序列推理模型在通用领域表现出色,但在具有独特特征的金融领域却持续失败。我们提出一个通用的2x2能力分类法,通过交叉1)单实体与多实体分析,以及2)当前状态评估与未来行为预测来划分TSRM能力。我们在金融领域实例化该分类法——其中确定性评估与随机性预测的区分尤为关键——形成十个金融推理任务,并基于标普股票构建FinTSR-Bench基准。为此,我们提出FinSTaR(金融时间序列思考与推理),在FinTSR-Bench上训练,并针对每个类别采用不同的思维链策略。对于评估(确定性,即可从可观测数据计算得出),我们采用Compute-in-CoT,一种程序化思维链,使模型能够直接从原始价格推导答案。对于预测(本质上是随机的,即受不可观测因素影响),我们采用场景感知思维链,在做出判断前生成多种场景,模拟金融分析师在不确定性下的推理方式。所提方法在FinTSR-Bench上达到78.9%的平均准确率,显著优于LLM和TSRM基线。此外,我们展示了四个能力类别通过联合训练具有互补性和相互增强性,并且场景感知思维链相比标准思维链持续提升预测准确率。代码已公开:https://github.com/seunghan96/FinSTaR。

英文摘要

Time series (TS) reasoning models (TSRMs) have shown promising capabilities in general domains, yet they consistently fail in the financial domain, which exhibits unique characteristics. We propose a general 2 x 2 capability taxonomy for TSRMs by crossing 1) single-entity vs. multi-entity analysis with 2) assessment of the current state vs. prediction of future behavior. We instantiate this taxonomy in the financial domain-where the distinction between deterministic assessment and stochastic prediction is particularly critical-as ten financial reasoning tasks, forming the FinTSR-Bench benchmark based on S&P stocks. To this end, we propose FinSTaR (Financial Time Series Thinking and Reasoning), trained on FinTSR-Bench with distinct chain-of-thought (CoT) strategies tailored to each category. For assessment, which is deterministic (i.e., computable from observable data), we employ Compute-in-CoT, a programmatic CoT that enables models to derive answers directly from raw prices. For prediction, which is inherently stochastic (i.e., subject to unobservable factors), we adopt Scenario-Aware CoT, which generates diverse scenarios before making a judgment, mirroring how financial analysts reason under uncertainty. The proposed method achieves 78.9% average accuracy on FinTSR-Bench, substantially outperforming LLM and TSRM baselines. Furthermore, we show that the four capability categories are complementary and mutually reinforcing through joint training, and that Scenario-Aware CoT consistently improves prediction accuracy over standard CoT. Code is available at https://github.com/seunghan96/FinSTaR.

2606.08532 2026-06-18 cs.AI 版本更新

DN-Hypo-Pipeline: An AI-Driven Workflow for Hypothesis Generation via Large Language Models and Scientific Explanations

DN-Hypo-Pipeline:一种基于大语言模型和科学解释的AI驱动假设生成工作流

Lei Lin, Ronghao Wang, Chunbao Zhou, Jue Wang, Yangang Wang

发表机构 * Computer Network Information Center, Chinese Academy of Sciences, China(中国科学院计算机网络信息中心)

AI总结 提出DN-Hypo-Pipeline,利用大语言模型和科学解释作为先验知识,从现有文献中推导新假设,在数据科学建模中通过统计推断和专家评估证明优于直接生成方法,并验证了生成假设对应的算法性能。

详情
AI中文摘要

科学假设是研究的第一步并经过实验验证,但它也反映了对科学现象的深刻理解和推理。我们引入了DN-Hypo-Pipeline,一种基于大语言模型的AI驱动工作流,旨在通过利用科学解释作为先验知识来支持结构化科学思维和假设生成。该流水线帮助研究人员从现有文献中推导出新假设。给定研究论文的解释项(即结论),它识别潜在的定律、理论和原理,并为观察到的现象重构一个新的、尚未验证的解释。我们在数据科学建模领域使用三篇高被引论文评估了DN-Hypo-Pipeline。由LLM作为评判者和人类专家评估支持的统计推断表明,我们的流水线比直接生成方法更有效。此外,我们通过开发相应新颖算法验证了得分最高的两个生成假设,这些算法优于原始论文中提出的基线模型。除了在数据科学中的应用,DN-Hypo-Pipeline还提供了一个理论框架,不仅包含了理论指导的数据科学建模方法,还揭示了建模过程更基础的结构。此外,这种方法本质上是理论指导建模的推广,具有扩展到其他领域和更广泛科学学科的潜力。

英文摘要

A scientific hypothesis is the first step in research and undergoes experimental validation, yet it also reflects a deep understanding of and reasoning about scientific phenomena. We introduce DN-Hypo-Pipeline, an AI-powered workflow based on large language models, designed to support structured scientific thinking and hypothesis generation by leveraging scientific explanations as prior knowledge. This pipeline assists researchers in deriving novel hypotheses from existing literature. Given the explanandum (i.e., the conclusion) of a research paper, it identifies underlying laws, theories, and principles, and reconstructs a new, yet-to-be-verified explanation for the observed phenomenon. We evaluated DN-Hypo-Pipeline in the field of data science modeling using three highly cited papers. Statistical inference, supported by both LLM-as-judge assessment and human expert evaluation, demonstrates that our pipeline is more effective than direct generation methods. Additionally, we validated the two highest-scoring generated hypotheses by developing corresponding novel algorithms, which outperformed the baseline models presented in the original papers. Beyond application in data science, DN-Hypo-Pipeline provides a theoretical framework that not only encompasses theory-guided data science modeling methods but also reveals a more fundamental structure of the modeling process. Moreover, this approach is essentially a generalization of theory-guided modeling, offering potential for extension to other domains and across a broader range of scientific disciplines.

2606.10376 2026-06-18 cs.AI cs.IT math.IT 版本更新

Belief-Space Control for Personalized Cancer Treatment via Active Inference

基于主动推理的个性化癌症治疗信念空间控制

Deniz Sargun, H. Bugra Tulay, C. Emre Koksal

发表机构 * American Association for Cancer Research(美国癌症研究协会) AACR Project GENIE registry(AACR Project GENIE 注册中心) AACR Project GENIE Biopharma Collaborative(AACR Project GENIE 生物制药合作组织)

AI总结 提出用主动推理将癌症治疗建模为信念空间规划问题,在测量预算下统一目标导向控制与信息获取,实现患者分类与高效治疗。

Comments 11 pages including appendix

详情
AI中文摘要

癌症治疗本质上是一个具有部分可观测性、潜在患者异质性以及医疗测量预算明确约束的序贯决策问题。与标准强化学习(RL)方法控制状态轨迹不同,癌症治疗会永久性地改变患者的转移动力学,从而改变状态随时间演化的方式。我们使用主动推理将癌症治疗建模为信念空间规划问题,推导出一个期望自由能目标,该目标在测量预算下统一了目标导向控制和信息获取。我们使用来自AACR Project GENIE Biopharma Collaborative数据集的真实临床癌症数据实现了该框架。临床数据结果表明,在真实的测量和治疗约束下,能够同时实现患者分类和高治疗效力。

英文摘要

Cancer treatment is at the core a sequential decision-making problem with partial observability, latent patient heterogeneity, and explicit constraints on the budget for medical measurements. Unlike standard Reinforcement Learning (RL) approaches that control state trajectories, cancer treatments permanently modify patients' transition dynamics, changing how states evolve over time. We model cancer treatment as a belief-space planning problem using active inference, deriving an expected free-energy objective that unifies goal-directed control and information acquisition under measurement budgets without. We implement this framework using real clinical cancer data from the AACR Project GENIE Biopharma Collaborative dataset. Results on clinical data demonstrate a simultaneous patient categorization and high treatment efficacy, under real measurement and treatment constraints.

2509.24725 2026-06-18 cs.LG cs.AI 版本更新

Q-Net: Queue Length Estimation via Kalman-based Neural Networks

Q-Net:基于卡尔曼神经网络的队列长度估计

Ting Gao, Elvin Isufi, Winnie Daamen, Erik-Sander Smits, Serge Hoogendoorn

发表机构 * University of Amsterdam(阿姆斯特丹大学) Delft University of Technology(代尔夫特理工大学)

AI总结 本文提出Q-Net框架,通过结合卡尔曼滤波与神经网络,解决信号交叉口队列长度估计中的数据融合问题,提升空间转移性和实时性,实现无需昂贵传感设备的准确队列估计。

详情
AI中文摘要

估计信号交叉口的队列长度一直是交通管理中的长期挑战。尽管有两类隐私保护的数据源:(i) 接近停止线的环形检测器提供的车辆计数汇总数据,以及 (ii) 提供路段平均速度测量的汇总浮动汽车数据 (aFCD),但如何将这些具有不同空间和时间分辨率的数据源整合用于队列长度估计仍不清楚。为此,本文提出Q-Net:一种基于状态空间形式的队列估计框架。该设计解决了队列建模中的关键挑战,如违反交通守恒假设。Q-Net遵循卡尔曼预测-更新结构,并在状态演变和测量模型中保持物理可解释性。Q-Net使用AI增强的卡尔曼滤波器从数据中学习时间变化的增益动态。该框架支持实时实现,并通过将aFCD测量分组为固定大小的局部组来提高空间转移性,使可学习参数的数量与路段长度无关。在荷兰 Rotterdam 城市主干道的评估显示,Q-Net优于基线方法,能够准确追踪队列的形成和消散,并缓解aFCD引起的延迟。通过结合数据效率、可解释性、实时适用性和空间转移性,Q-Net在无需昂贵的传感基础设施(如摄像头或雷达)的情况下实现了准确的队列长度估计。

英文摘要

Estimating queue lengths at signalized intersections is a long-standing challenge in traffic management. Partial observability of vehicle flows complicates this task despite the availability of two privacy-preserving data sources: (i) aggregated vehicle counts from loop detectors near stop lines, and (ii) aggregated floating car data (aFCD) that provide segment-wise average speed measurements. However, how to integrate these sources with differing spatial and temporal resolutions for queue length estimation is rather unclear. Addressing this question, we present Q-Net: a queue estimation framework built upon a state-space formulation. This design addresses key challenges in queue modeling, such as violations of traffic conservation assumptions. Q-Net follows the Kalman predict-update structure and maintains physical interpretability in both the state evolution and measurement models. Q-Net uses an AI-augmented Kalman filter to learn time-varying gain dynamics from data. The framework supports real-time implementation and improves spatial transferability by grouping aFCD measurements into fixed-size local groups, making the number of learnable parameters independent of section length. Evaluations on urban main roads in Rotterdam, the Netherlands, show that Q-Net outperforms baseline methods, tracks queue formation and dissipation accurately, and mitigates aFCD-induced delays. By combining data efficiency, interpretability, real-time applicability, and spatial transferability, Q-Net makes accurate queue length estimation possible without costly sensing infrastructure like cameras or radar.

2307.05623 2026-06-18 cs.LG cs.AI 版本更新

A DeepLearning Framework for Dynamic Estimation of Origin-Destination Sequence

一种用于动态估计起点-终点序列的深度学习框架

Zheli Xiong, Defu Lian, Enhong Chen, Gang Chen, Xiaomin Cheng

发表机构 * School of Data Science University of Science(数据科学学院 中国科学技术大学) Yangtze River Delta Information Intelligence Innovation Research Institute, China(长江三角洲信息智能创新研究院)

AI总结 针对OD矩阵估计中的欠定性和滞后性问题,提出集成深度学习方法,利用神经网络推断OD序列结构并引导数值优化,实验证明能有效提供时空约束。

Comments 11 pages,25 figures

详情
AI中文摘要

OD矩阵估计是交通领域的一个关键问题。主要方法利用交通传感器测量信息(如交通计数)来估计由OD矩阵表示的交通需求。该问题分为两类:静态OD矩阵估计和动态OD矩阵序列(简称OD序列)估计。上述两类都面临由大量待估参数和不足的约束信息引起的欠定性问题。此外,OD序列估计还面临滞后挑战:由于拥堵等不同交通状况,同一车辆在相同观测时段内会出现在不同路段,导致相同的OD需求对应不同的行程。为此,本文提出一种集成方法,利用深度学习方法推断OD序列的结构,并利用结构约束指导传统数值优化。实验表明,神经网络能有效推断OD序列的结构,并为数值优化提供实用的约束以获得更好的结果。此外,实验表明,所提供的结构信息不仅包含对OD矩阵空间结构的约束,还提供了对OD序列时间结构的约束,很好地解决了滞后问题的影响。

英文摘要

OD matrix estimation is a critical problem in the transportation domain. The principle method uses the traffic sensor measured information such as traffic counts to estimate the traffic demand represented by the OD matrix. The problem is divided into two categories: static OD matrix estimation and dynamic OD matrices sequence(OD sequence for short) estimation. The above two face the underdetermination problem caused by abundant estimated parameters and insufficient constraint information. In addition, OD sequence estimation also faces the lag challenge: due to different traffic conditions such as congestion, identical vehicle will appear on different road sections during the same observation period, resulting in identical OD demands correspond to different trips. To this end, this paper proposes an integrated method, which uses deep learning methods to infer the structure of OD sequence and uses structural constraints to guide traditional numerical optimization. Our experiments show that the neural network(NN) can effectively infer the structure of the OD sequence and provide practical constraints for numerical optimization to obtain better results. Moreover, the experiments show that provided structural information contains not only constraints on the spatial structure of OD matrices but also provides constraints on the temporal structure of OD sequence, which solve the effect of the lagging problem well.

2507.16859 2026-06-18 cs.RO cs.AI 版本更新

Enhancing Fatigue Detection through Heterogeneous Multi-Source Data Integration and Cross-Domain Modality Imputation

通过异构多源数据集成与跨域模态插补增强疲劳检测

Luobin Cui, Yanlai Wu, Tang Ying, Weikai Li

AI总结 针对实际部署环境中高质量传感器不可用的问题,提出异构多源疲劳检测框架,利用共享模态进行跨域模态插补,融合源域知识提升目标域疲劳检测性能。

Comments 4figures,14pages

详情
AI中文摘要

疲劳检测对于安全相关应用(如航空、采矿和长途运输)中的人类操作员至关重要。可靠的操作员疲劳估计可以支持人机系统中的及时警告、自适应任务调度、接管提醒和其他安全管理决策。然而,这些功能的有效性取决于疲劳相关信号是否能在部署环境中可靠捕获。虽然许多研究已显示高保真传感器在受控实验室环境中的价值,但在实际环境中,由于噪声、光照条件和视野限制,其性能往往会下降,从而限制了实际应用。本文形式化了一种面向实际部署的疲劳检测设置,其中高质量传感器在实际应用中通常不可用。为解决这一问题,我们利用来自异构源域的知识,包括难以在现场部署但常用于受控环境的高保真传感器,来辅助真实目标域中的疲劳检测。基于这一思想,我们设计了一个异构多源疲劳检测框架,该框架利用目标域中的可用模态,同时通过基于共享模态的跨域模态插补来利用源域中的多样化配置。

英文摘要

Fatigue detection for human operators is important in safety-related applications such as aviation, mining, and long-haul transport. Reliable estimation of operator fatigue can support timely warnings, adaptive task scheduling, takeover reminders, and other safety-management decisions in human-machine systems. However, the effectiveness of these functions depends on whether fatigue-related signals can be reliably captured in the deployment environment. While many studies have shown the value of high-fidelity sensors in controlled laboratory environments, their performance often degrades when used in real-world settings because of noise, lighting conditions, and field-of-view constraints, thereby limiting their practical use. This paper formalizes a deployment-oriented setting for real-world fatigue detection, where high-quality sensors are often unavailable in practical applications. To address this issue, we use knowledge from heterogeneous source domains, including high-fidelity sensors that are difficult to deploy in the field but commonly used in controlled environments, to assist fatigue detection in the real-world target domain. Based on this idea, we design a heterogeneous and multi-source fatigue-detection framework that uses the available modalities in the target domain while leveraging diverse configurations in the source domains through cross-domain modality imputation based on shared modalities.

2511.14555 2026-06-18 q-bio.NC cs.AI 版本更新

DecNefSimulator: A Modular, Interpretable Framework for Decoded Neurofeedback Simulation Using Generative Models

DecNefSimulator:一个用于解码神经反馈模拟的模块化、可解释框架

Alexander Olza, Roberto Santana, David Soto

发表机构 * Intelligent Systems Group, University of the Basque Country (UPV/EHU)(巴斯克国家大学智能系统组) Consciousness Group, Basque Center on Cognition, Brain and Language (BCBL)(巴斯克认知、大脑与语言中心意识组) Ikerbasque, Basque Foundation for Science(巴斯克科学基金会)

AI总结 提出DecNefSimulator,一个模块化可解释的模拟框架,将解码神经反馈形式化为机器学习问题,通过潜变量生成模型模拟参与者,直接观察内部状态并评估协议设计对学习的影响,可复现经验现象、识别失败条件并指导协议设计。

详情
AI中文摘要

解码神经反馈(DecNef)是一种有前景的非侵入性脑调控方法,在神经医学和认知神经科学中具有广泛应用。然而,DecNef研究的进展仍受限于受试者依赖的学习变异性、依赖间接测量来量化进展,以及实验的高成本和时间消耗。我们提出DecNefSimulator,一个模块化且可解释的模拟框架,将DecNef形式化为一个机器学习问题。除了提供虚拟实验室,DecNefSimulator使研究人员能够建模、分析和理解神经反馈动态。通过使用潜变量生成模型作为模拟参与者,DecNefSimulator允许直接观察内部认知状态,并系统评估不同协议设计和受试者特征如何影响学习。我们展示了这种方法如何(i)复现DecNef学习的经验现象,(ii)识别DecNef反馈未能诱导学习的条件,以及(iii)在人体实施之前,在计算机中指导设计更稳健可靠的DecNef协议。总之,DecNefSimulator连接了计算建模和认知神经科学,为方法创新、稳健协议设计以及最终更深入地理解基于DecNef的脑调控提供了原则性基础。

英文摘要

Decoded Neurofeedback (DecNef) is a promising non-invasive approach to brain modulation with wide-ranging applications in neuromedicine and cognitive neuroscience. However, progress in DecNef research remains constrained by subject-dependent learning variability, reliance on indirect measures to quantify progress, and the high cost and time demands of experimentation. We present DecNefSimulator, a modular and interpretable simulation framework that formalizes DecNef as a machine learning problem. Beyond providing a virtual laboratory, DecNefSimulator enables researchers to model, analyze and understand neurofeedback dynamics. Using latent variable generative models as simulated participants, DecNefSimulator allows direct observation of internal cognitive states and systematic evaluation of how different protocol designs and subject characteristics influence learning. We demonstrate how this approach can (i) reproduce empirical phenomena of DecNef learning, (ii) identify conditions under which DecNef feedback fails to induce learning, and (iii) guide the design of more robust and reliable DecNef protocols in silico before human implementation. In summary, DecNefSimulator bridges computational modeling and cognitive neuroscience, offering a principled foundation for methodological innovation, robust protocol design, and ultimately, a deeper understanding of DecNef-based brain modulation.

2512.09185 2026-06-18 cs.CV cs.AI 版本更新

Learning Patient-Specific Disease Dynamics with Latent Flow Matching for Longitudinal Imaging Generation

学习患者特异性疾病动态:基于潜在流匹配的纵向影像生成

Hao Chen, Rui Yin, Yifan Chen, Qi Chen, Chao Li

发表机构 * University of Cambridge(剑桥大学) Nanjing First Hospital(南京第一医院) Nanjing Medical University(南京医科大学) Johns Hopkins University(约翰霍普金斯大学) University of Dundee(邓迪大学)

AI总结 提出Δ-LFM框架,利用流匹配对齐患者潜在轨迹,通过患者特异性潜在对齐实现单调疾病进展建模,在三个纵向MRI基准上验证了可解释性和性能。

Comments ICLR 2026 accepted

详情
AI中文摘要

理解疾病进展是一个直接的临床挑战,对早期诊断和个性化治疗具有重要意义。虽然最近的生成方法试图对进展进行建模,但关键不匹配仍然存在:疾病动态本质上是连续且单调的,然而潜在表示通常是分散的,缺乏语义结构,并且基于扩散的模型通过随机去噪过程破坏了连续性。在这项工作中,我们提出将疾病动态视为速度场,并利用流匹配(FM)来对齐患者数据的时间演变。与先前方法不同,它捕捉了疾病的内在动态,使进展更具可解释性。然而,一个关键挑战仍然存在:在潜在空间中,自动编码器(AE)不能保证跨患者的对齐或与临床严重性指标(例如年龄和疾病状况)的相关性。为了解决这个问题,我们提出学习患者特异性潜在对齐,这迫使患者轨迹沿着特定轴延伸,其幅度随疾病严重程度单调增加。这导致了一个一致且语义上有意义的潜在空间。总之,我们提出了Δ-LFM,一个用于通过流匹配建模患者特异性潜在进展的框架。在三个纵向MRI基准上,Δ-LFM展示了强大的实证性能,更重要的是,为解释和可视化疾病动态提供了一个新框架。

英文摘要

Understanding disease progression is a central clinical challenge with direct implications for early diagnosis and personalized treatment. While recent generative approaches have attempted to model progression, key mismatches remain: disease dynamics are inherently continuous and monotonic, yet latent representations are often scattered, lacking semantic structure, and diffusion-based models disrupt continuity with random denoising process. In this work, we propose to treat the disease dynamic as a velocity field and leverage Flow Matching (FM) to align the temporal evolution of patient data. Unlike prior methods, it captures the intrinsic dynamic of disease, making the progression more interpretable. However, a key challenge remains: in latent space, Auto-Encoders (AEs) do not guarantee alignment across patients or correlation with clinical-severity indicators (e.g., age and disease conditions). To address this, we propose to learn patient-specific latent alignment, which enforces patient trajectories to lie along a specific axis, with magnitude increasing monotonically with disease severity. This leads to a consistent and semantically meaningful latent space. Together, we present $Δ$-LFM, a framework for modeling patient-specific latent progression with flow matching. Across three longitudinal MRI benchmarks, $Δ$-LFM demonstrates strong empirical performance and, more importantly, offers a new framework for interpreting and visualizing disease dynamics.

2601.14288 2026-06-18 astro-ph.CO cs.AI cs.CE gr-qc hep-th 版本更新

DeepInflation: an AI agent for research and model discovery of inflation

DeepInflation:用于暴胀研究与模型发现的AI智能体

Ze-Yu Peng, Hao-Shi Yuan, Qi Lai, Jun-Qian Jiang, Gen Ye, Jun Zhang, Yun-Song Piao

发表机构 * School of Physical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China International Centre for Theoretical Physics Asia-Pacific, University of Chinese Academy of Sciences, 100190 Beijing, China Taiji Laboratory for Gravitational Wave Universe, University of Chinese Academy of Sciences, 100049 Beijing, China School of Fundamental Physics Mathematical Sciences, Hangzhou Institute for Advanced Study, UCAS, Hangzhou 310024, China Institute of Theoretical Physics, Chinese Academy of Sciences, P.O. Box 2735, Beijing 100190, China D\' e partement de Physique Th\' e orique, Universit\' e de Gen\` e ve, 24 quai Ernest-Ansermet, CH-1211 Gen\` e ve 4, Switzerland

AI总结 提出基于多智能体架构的AI智能体DeepInflation,集成大语言模型、符号回归引擎和检索增强生成知识库,自动发现与最新观测一致的单场慢滚暴胀势,并解释理论背景。

详情
AI中文摘要

我们提出了DeepInflation,一个专为暴胀宇宙学中的研究和模型发现而设计的AI智能体。基于多智能体架构,DeepInflation将大语言模型(LLMs)与符号回归(SR)引擎以及检索增强生成(RAG)知识库相结合。该框架使智能体能够自动探索和验证广阔的暴胀势景观,同时将其输出建立在既定的理论文献基础上。我们证明,DeepInflation能够成功发现与最新观测(以ACT DR6结果为例)或任意给定的$n_s$和$r$一致的简单且可行的单场慢滚暴胀势,并为晦涩的暴胀场景提供准确的理论背景。DeepInflation作为宇宙学中新一代自主科学发现引擎的原型,使研究人员和非专家都能使用自然语言探索暴胀景观。该智能体可从此网址获取:https://example.com。

英文摘要

We present DeepInflation, an AI agent designed for research and model discovery in inflationary cosmology. Built upon a multi-agent architecture, DeepInflation integrates Large Language Models (LLMs) with a symbolic regression (SR) engine and a retrieval-augmented generation (RAG) knowledge base. This framework enables the agent to automatically explore and verify the vast landscape of inflationary potentials while grounding its outputs in established theoretical literature. We demonstrate that DeepInflation can successfully discover simple and viable single-field slow-roll inflationary potentials consistent with the latest observations (with the ACT DR6 results taken as an example) or any given $n_s$ and $r$, and provide accurate theoretical context for obscure inflationary scenarios. DeepInflation serves as a prototype for a new generation of autonomous scientific discovery engines in cosmology, which enables researchers and non-experts alike to explore the inflationary landscape using natural language. This agent is available at https://github.com/pengzy-cosmo/DeepInflation.

2602.19591 2026-06-18 cs.LG cs.AI 版本更新

Detecting High-Potential SMEs with Heterogeneous Graph Neural Networks

使用异构图神经网络检测高潜力中小企业

Yijiashun Qi, Hanzhe Guo, Yijiazhen Qi

发表机构 * University of Michigan(密歇根大学) The University of Hong Kong(香港大学)

AI总结 提出SME-HGT异构图Transformer框架,利用公开数据构建包含公司、研究主题和政府机构的异构图,预测SBIR第一阶段获奖者能否进入第二阶段,AUPRC达0.621,优于基线模型。

Comments accepted by (ICIIS 2026)

详情
AI中文摘要

中小企业占美国企业的99.9%,贡献44%的经济活动,但系统性地识别高潜力中小企业仍是一个开放挑战。我们提出了SME-HGT,一个异构图Transformer框架,仅使用公开数据预测哪些SBIR第一阶段获奖者将进入第二阶段资助。我们构建了一个异构图,包含32,268个公司节点、124个研究主题节点和13个政府机构节点,通过约99,000条边连接三种语义关系类型。SME-HGT在时间分割测试集上达到0.621±0.003的AUPRC,在五个随机种子上优于MLP基线(0.590±0.002)和R-GCN(0.608±0.013)。在筛选深度为100家公司时,SME-HGT达到89.6%的精确率,比随机选择提升2.14倍。我们的时间评估协议防止信息泄露,对公开数据的依赖确保了可重复性。这些结果表明,公司、研究主题和资助机构之间的关系结构为中小企业潜力评估提供了有意义的信号,对政策制定者和早期投资者具有启示意义。

英文摘要

Small and Medium Enterprises (SMEs) constitute 99.9% of U.S. businesses and generate 44% of economic activity, yet systematically identifying high-potential SMEs remains an open challenge. We introduce SME-HGT, a Heterogeneous Graph Transformer framework that predicts which SBIR Phase I awardees will advance to Phase II funding using exclusively public data. We construct a heterogeneous graph with 32,268 company nodes, 124 research topic nodes, and 13 government agency nodes connected by approximately 99,000 edges across three semantic relation types. SME-HGT achieves an AUPRC of 0.621 0.003 on a temporally-split test set, outperforming an MLP baseline (0.590 0.002) and R-GCN (0.608 0.013) across five random seeds. At a screening depth of 100 companies, SME-HGT attains 89.6% precision with a 2.14 lift over random selection. Our temporal evaluation protocol prevents information leakage, and our reliance on public data ensures reproducibility. These results demonstrate that relational structure among firms, research topics, and funding agencies provides meaningful signal for SME potential assessment, with implications for policymakers and early-stage investors.

2603.28707 2026-06-18 cs.CE cs.AI 版本更新

A Convex Route to Thermoelasticity: Learning Internal Energy and Dissipation

热力学的凸路径:学习内能和耗散

Hagen Holthusen, Paul Steinmann, Ellen Kuhl

发表机构 * Institute of Applied Mechanics, University of Erlangen-Nuremberg, Egerlandstra{\ss}e 5, 91058 Erlangen, Germany(埃尔兰根-纽伦堡应用力学研究所,埃尔兰根大学,德国) Department of Mechanical Engineering, Stanford University, United States(机械工程系,斯坦福大学,美国)

AI总结 提出基于物理的神经网络框架,通过输入凸神经网络表示内能和耗散势,自动满足热力学第二定律,实现全耦合热力学本构建模。

Comments 31 pages, 16 figures, 4 tables

详情
AI中文摘要

我们提出了一个基于物理的神经网络框架,用于发现全耦合热力学中的本构模型。与基于亥姆霍兹能量的经典公式不同,我们采用内能和耗散势作为主要本构函数,以变形和熵为变量。这一选择避免了强制混合凸-凹条件,并促进了热力学原理的一致纳入。在本文中,我们关注没有优先方向或内变量的材料。尽管公式以熵表示,但温度被视为独立可观测量,熵通过本构关系内部推断,从而在不需要熵数据的情况下实现热力学一致建模。网络的热力学可接受性通过构造保证。内能和耗散势由输入凸神经网络表示,确保凸性和符合第二定律。客观性、材料对称性和归一化通过基于不变量的表示和零锚定公式直接嵌入架构中。我们在合成和实验数据集上展示了所提出框架的性能,包括纯热问题以及软组织和填充橡胶的全耦合热力学响应。结果表明,学习模型准确捕捉了潜在的本构行为。所有代码、数据和训练模型均通过 https://doi.org/10.5281/zenodo.19248596 公开提供。

英文摘要

We present a physics-based neural network framework for the discovery of constitutive models in fully coupled thermomechanics. In contrast to classical formulations based on the Helmholtz energy, we adopt the internal energy and a dissipation potential as primary constitutive functions, expressed in terms of deformation and entropy. This choice avoids the need to enforce mixed convexity--concavity conditions and facilitates a consistent incorporation of thermodynamic principles. In this contribution, we focus on materials without preferred directions or internal variables. While the formulation is posed in terms of entropy, the temperature is treated as the independent observable, and the entropy is inferred internally through the constitutive relation, enabling thermodynamically consistent modeling without requiring entropy data. Thermodynamic admissibility of the networks is guaranteed by construction. The internal energy and dissipation potential are represented by input convex neural networks, ensuring convexity and compliance with the second law. Objectivity, material symmetry, and normalization are embedded directly into the architecture through invariant-based representations and zero-anchored formulations. We demonstrate the performance of the proposed framework on synthetic and experimental datasets, including purely thermal problems and fully coupled thermomechanical responses of soft tissues and filled rubbers. The results show that the learned models accurately capture the underlying constitutive behavior. All code, data, and trained models are made publicly available via https://doi.org/10.5281/zenodo.19248596.

2604.00730 2026-06-18 cs.CY cs.AI cs.LG cs.SE 版本更新

A CEFR-Inspired Classification Framework with Fuzzy C-Means To Automate Assessment of Programming Skills in Scratch

基于CEFR启发的模糊C均值分类框架:自动化评估Scratch编程技能

Ricardo Hidalgo-Aragón, Jesús M. González-Barahona, Gregorio Robles

发表机构 * Universidad Rey Juan Carlos(雷昂卡洛斯大学)

AI总结 提出一种基于CEFR的Scratch项目评估框架,使用模糊C均值聚类对200万+项目分级,识别B2瓶颈并引入分类确定性指标以平衡自动反馈与人工审核。

Comments Best Paper Award CSEDU 2026 -Minor change FPC fix-

详情
AI中文摘要

背景:学校、培训平台和技术公司日益需要以透明、可重复的方法大规模评估编程能力,以支持个性化学习路径。目标:本研究引入一个与欧洲共同语言参考标准(CEFR)一致的Scratch项目评估教学框架,为学生和教师提供通用能力等级,并为课程设计提供可行见解。方法:我们对通过此http URL评估的2008246个Scratch项目应用模糊C均值聚类,实施序数准则将聚类映射到CEFR等级(A1-C2),并引入增强分类指标,识别过渡学习者,实现持续进度跟踪,量化分类确定性以平衡自动反馈与教师评审。影响:该框架能够诊断系统性课程缺口——特别是“B2瓶颈”,由于逻辑同步和数据表示的认知负荷,仅13.3%的学习者处于该等级——同时提供基于确定性的触发机制以进行人工干预。

英文摘要

Context: Schools, training platforms, and technology firms increasingly need to assess programming proficiency at scale with transparent, reproducible methods that support personalized learning pathways. Objective: This study introduces a pedagogical framework for Scratch project assessment, aligned with the Common European Framework of Reference (CEFR), providing universal competency levels for students and teachers alongside actionable insights for curriculum design. Method: We apply Fuzzy C-Means clustering to 2008246 Scratch projects evaluated via Dr.Scratch, implementing an ordinal criterion to map clusters to CEFR levels (A1-C2), and introducing enhanced classification metrics that identify transitional learners, enable continuous progress tracking, and quantify classification certainty to balance automated feedback with instructor review. Impact: The framework enables diagnosis of systemic curriculum gaps-notably a "B2 bottleneck" where only 13.3% of learners reside due to the cognitive load of integrating Logic Synchronization, and Data Representation--while providing certainty--based triggers for human intervention.

2604.03275 2026-06-18 physics.ao-ph cs.AI cs.LG 版本更新

IPSL-AID: Generative Diffusion Models for Climate Downscaling from Global to Regional Scales

IPSL-AID:用于从全球到区域尺度气候降尺度的生成扩散模型

Kishanthan Kingston, Olivier Boucher, Freddy Bouchet, Pierre Chapel, Rosemary Eade, Jean-Francois Lamarque, Redouane Lguensat, Kazem Ardaneh

发表机构 * Climate Modeling Center(气候建模中心) Sorbonne University(索邦大学) CNRS(法国国家科学研究中心) IPSL Paris(巴黎) France(法国)

AI总结 提出基于去噪扩散概率模型的IPSL-AID工具,利用ERA5再分析数据从粗分辨率输入生成0.25°温度、风和降水场,并建模细尺度特征概率分布以量化不确定性,准确重建统计分布、极端事件和空间结构。

Comments 17 pages, 12 figures, submitted to Climate Informatique 2026, to appear in Environmental Data Science

详情
AI中文摘要

有效的气候变化适应和减缓策略需要高分辨率预测来指导战略决策。传统的全球气候模型通常以150至200公里的分辨率运行,缺乏表示关键区域过程的能力。IPSL-AID是一种基于去噪扩散概率模型的全球到区域降尺度工具,旨在解决这一限制。该工具在ERA5再分析数据上训练,利用粗分辨率输入及其时空上下文生成0.25°分辨率的温度、风和降水场。它还建模细尺度特征的概率分布,以产生用于不确定性量化的合理情景。该模型准确重建了统计分布,包括极端事件、功率谱和空间结构。这项工作突出了生成扩散模型在高效气候降尺度及不确定性量化方面的潜力。

英文摘要

Effective adaptation and mitigation strategies for climate change require high-resolution projections to inform strategic decision-making. Conventional global climate models, which typically operate at resolutions of 150 to 200 kilometers, lack the capacity to represent essential regional processes. IPSL-AID is a global to regional downscaling tool based on a denoising diffusion probabilistic model designed to address this limitation. Trained on ERA5 reanalysis data, it generates 0.25 degree resolution fields for temperature, wind, and precipitation using coarse inputs and their spatiotemporal context. It also models probability distributions of fine-scale features to produce plausible scenarios for uncertainty quantification. The model accurately reconstructs statistical distributions, including extreme events, power spectra, and spatial structures. This work highlights the potential of generative diffusion models for efficient climate downscaling with uncertainty

2604.04089 2026-06-18 physics.comp-ph cond-mat.str-el cs.AI cs.HC 版本更新

From Paper to Program: Externalizing and Diagnosing Knowledge Bottlenecks in AI-Assisted Quantum Many-Body Code Generation

从论文到程序:AI辅助量子多体代码生成中的知识外化

Yi Zhou

AI总结 针对AI直接翻译论文为代码时因隐含约定导致失败的问题,提出知识外化方法,通过多阶段人机协作流程将隐式假设显式化,在DMRG和Pfaffian-MPS任务上验证了有效性。

Comments Core thesis upgraded

详情
AI中文摘要

大型语言模型可以编写科学代码,但当正确性依赖于文献中的默认约定时,直接的论文到程序翻译仍然脆弱。我们将这一瓶颈识别为\textbf{知识外化}:在实现之前将隐式计算假设——索引约定、规范选择、费米子符号、收缩顺序和内存约束——转换为明确的技术规范。我们评估了一个多阶段、人在回路的工作流程,该流程在理论提取和代码生成之间插入这样的规范,并带有验证和停止门。该工作流程在两个算法上不同的量子多体任务上进行了测试:基于变分扫描的密度矩阵重整化群(DMRG)来自教学综述,以及将Hartree-Fock-Bogoliubov态构造性地转换为矩阵乘积态的Pfaffian方法,来自Jin等人五页的信件,Phys. Rev. B 105, L081101 (2022),该代码未公开。对于DMRG,在$4\ imes4$网格中,所有16个规范引导的模型配对都满足物理验证标准,而直接尝试为6/13。散文规范消融实验表明,外化的内容(而非LaTeX格式)是基本要素。对于Pfaffian-MPS,该工作流程在26次存档尝试中成功11次,而直接提示产生零次审计通过。跨规范转移是不对称的:由GPT~5.5实现的非GPT规范通过4/4,而由较弱模型实现的GPT~5.5规范失败4/4,表明存在残留的实现模型瓶颈。由此产生的\textit{论文到程序多体}技能为AI辅助实现多体算法以及诊断外化成功或失败提供了可审计的协议。

英文摘要

Large language models can write scientific code, but direct paper-to-program translation remains fragile when correctness depends on tacit conventions rather than explicit equations. We frame this as a \textbf{knowledge-externalization} problem: index choices, gauges, fermionic signs, contraction order, validation gates, and scaling constraints must be made explicit before code generation. We evaluate a multi-stage, human-in-the-loop workflow on two quantum many-body tasks. DMRG from Schollwock's pedagogical review serves as calibration: specification-guided implementations pass in all 16 model pairings, compared with 6/13 direct attempts, and a prose-specification ablation shows that externalized content, not \LaTeX{} form, is the active ingredient. Pfaffian conversion of HFB states to MPS from the five-page Letter by Jin et al. serves as the stress test: no public implementation is available, and success depends on tacit sign, gauge, ordering, and scalability conventions. Here the workflow yields 11/26 audited passes, while direct prompting yields none. Cross-specification transfer is asymmetric: non-GPT specifications implemented by GPT~5.5 pass 4/4, whereas GPT~5.5 specifications implemented by weaker models fail 4/4. The contrast supports a two-bottleneck picture. Externalization resolves the first bottleneck -- paper-to-code ambiguity -- well enough to make DMRG reproducible and Pfaffian-MPS auditable. The remaining failures expose a second bottleneck in implementation-model capability. Iterative meta-specification moves this boundary but does not eliminate it. The resulting \emph{Paper-to-Program Many-Body} skill is both a reusable implementation protocol and a diagnostic instrument for AI-assisted many-body programming.

2605.12567 2026-06-18 cs.CV cs.AI 版本更新

Pyramid Self-Contrastive Learning for Single-shot Test-time Ultrasound Image Denoising

金字塔自对比学习框架用于测试时超声图像去噪

Jiajing Zhang, Bingze Dai, Xi Zhang, Yue Xu, Wei-Ning Lee

发表机构 * Department of Electrical and Computer Engineering, The University of Hong Kong(香港大学电子与计算机工程系) Department of Biomedical Engineering, Duke University(达特茅斯大学生物医学工程系)

AI总结 本文提出一种纯测试时训练框架,用于单次超声图像去噪,应用于合成孔径超声,通过自对比学习分离解剖相似性和噪声随机性,提升去噪效果和结构细节。

详情
AI中文摘要

内在的电子噪声和斑点噪声使超声图像的临床解释复杂化。传统去噪方法依赖显式噪声假设,其有效性在复合噪声条件下减弱。基于学习的方法需要大量标注数据和模型参数。这些预定义和预训练的方法在复杂体内环境中不可避免地导致领域偏移,因此局限于特定噪声类型并常模糊结构细节。本文提出了一种纯测试时训练框架用于单次超声图像去噪,并应用于合成孔径超声(SAU),该方法通过自对比学习在金字塔潜在空间中分离解剖相似性和噪声随机性。干净图像随后从解剖空间解码,而丢弃噪声空间。A2A在测试时仅使用一个噪声样本的SAU信号进行训练,从而从根本上消除了领域偏移和预训练成本。模拟实验,包括电子噪声水平0至30 dB和不同包含几何形状,证明了A2A在SNR和CNR上的改进分别为69.3%和34.4%。体内结果表明,仅使用心脏六个超声切面、肝脏和肾脏的两个孔径数据,SNR和CNR分别提高了84.8%和25.7%。A2A在多种成像目标和配置中产生清晰的图像/信号,为更可靠的超声解剖可视化和功能评估铺平了道路。

英文摘要

The inherent electronic and speckle noise complicates clinical interpretation of ultrasound images. Conventional denoising methods rely on explicit noise assumptions whose validity diminishes under composite noise conditions. Learning-based methods are usually pretrained in a limited image domain using a labeled dataset, which implies inevitable domain shift in complex in vivo environments. This study proposes a Pyramid Self-Contrastive Learning (PSCL) framework for test-time ultrasound image denoising without pretraining. Given multiple noisy samples from only one-shot imaging, PSCL disentangles anatomical similarity and noise randomness into separate pyramid latent spaces. The clean image is then decoded from the anatomy space while discarding the noise space. We first apply PSCL to synthetic aperture ultrasound (SAU), where an Aperture-to-Aperture loop serves as a self-supervised proxy task to ensure denoising fidelity. Simulation experiments, including noise levels from 0 to 30 dB and inclusion geometries from simple to complex, demonstrated improvements of 69.3% in SNR and 34.4% in CNR. The in vivo results showed 84.8% SNR and 25.7% CNR gains using only two aperture data of the heart in six echocardiographic views, liver, and kidney. PSCL delivers clear images across diverse imaging targets and configurations, paving the way for more reliable anatomical visualization without domain shift and pretraining costs.

2605.21528 2026-06-18 cs.LG cs.AI 版本更新

A Reproducible Log-Driven AutoML Framework for Interpretable Pipeline Optimization in Healthcare Risk Prediction

可重复的基于日志的自动机器学习框架用于医疗风险预测中的可解释流水线优化

Rui Huang, Lican Huang

发表机构 * School of Basic Medicine, Hangzhou Normal University(杭州师范大学基础医学院) Research Department, Hangzhou Domain Zones Technology Co.Ltd.(杭州域区技术有限公司)

AI总结 本文提出了一种可重复的基于日志的自动机器学习框架,用于医疗风险预测中的可解释流水线优化,通过分析组件属性、交互和冗余性,提高了模型性能和稳定性。

详情
AI中文摘要

准确且可重复的疾病风险预测仍然具有挑战性,由于异质特征、有限样本和严重的类别不平衡。本研究引入了yvsoucom-iterkit,一种确定性和基于日志的自动化机器学习框架,将流水线优化完全可重复地建模为配置级系统。每个流水线被编码为可追溯的日志实体,使能够分析组件属性、交互、相似性和跨种子鲁棒性。在超过18,000个流水线配置上对Pima Indians糖尿病和中风数据集的实验揭示了一个结构化且部分冗余的搜索空间,其中性能由一小部分相互作用的组件决定。随机森林重要性分析显示,增强(0.454)、模型选择(0.198)和不平衡处理(0.101)是Pima数据集的关键驱动因素,而不平衡处理主导中风(0.406)。组件相似性分析显示强冗余性,特征选择变体(biMax-biMean)表现出低RMS距离(0.0252),混合匹配无增强(0.0279),TomekLinks与无不平衡处理对齐(0.0325),而高斯噪声与无增强的差异更大(0.10)。该框架使用集成模型(加权F1 0.89,宏F1 0.88在Pima;加权F1 0.94在中风)实现了强且稳定的性能,而宏F1在中风上较低(0.67)由于类别不平衡。跨种子分析揭示了性能-鲁棒性权衡,集成模型的变异性低于SVM。这些结果表明,有效的AutoML优化可以聚焦于一组高影响的组件。

英文摘要

Accurate disease risk prediction is challenged by heterogeneous features, limited data, and class imbalance. This study presents yvsoucom-iterkit, a deterministic AutoML framework that models pipeline optimization as a configuration-level system with full reproducibility and traceable execution logs, enabling systematic analysis of component attribution, interactions, similarity, and cross-seed robustness. Experiments on the Pima Indians Diabetes and Stroke datasets across more than 18,000 pipeline configurations reveal a structured yet partially redundant search space, where performance is dominated by a small subset of interacting components. Ensemble models achieve stable performance, reaching a Weighted-F1 of 0.89 on Pima and 0.94 on Stroke. Macro-F1 reaches approximately 0.88 on Pima but drops to 0.6560 on Stroke due to severe imbalance. Cross-seed experiments show that ensembles reduce variance compared to single models. Friedman testing ($p < 0.05$) confirms significant ranking differences across configurations. Based on analysis of component attribution, interaction, and similarity, optimal configuration design reveals dataset-dependent behavior. For the Pima dataset, computational efficiency benefits from simplified search spaces where redundant components can be removed, with split ratio playing a key role. In contrast, the Stroke dataset requires enhanced imbalance-aware strategies, where RandomOverSampler improves Macro-F1 from 0.6560 to 0.6766. These findings demonstrate that effective AutoML optimization is achieved through optimal configuration design, where carefully constraining the search space to high-impact components can improve performance, stability, and interpretability while reducing unnecessary search complexity.

2606.00491 2026-06-18 cs.CV cs.AI 版本更新

Pre-Deployment Robustness Stress Testing for CT Segmentation Systems Using Clinically Motivated Multi-Corruption Augmentation

CT分割系统的部署前鲁棒性压力测试:使用临床驱动的多损坏增强

CholMin Kanga, Jonghyun Chung, Amanpreet Kaur, Nagesh Gulkotwar, Aarthi Sivasankaran

发表机构 * Seoul National University(首尔国立大学) Google Inc.(谷歌公司)

AI总结 提出RAMP框架,通过多损坏增强提升CT分割模型在临床异质成像条件下的鲁棒性,显著缩小干净与损坏图像性能差距。

详情
AI中文摘要

基于深度学习的CT分割系统在干净基准图像上通常能达到高精度,但在噪声、分辨率损失、对比度变化、强度偏移和伪影等异质临床成像条件下,其性能可能会下降。这种不稳定性可能限制其在真实医疗成像工作流程中的可靠部署。 我们提出鲁棒性增强多损坏流水线(RAMP),这是一个面向鲁棒性的CT分割增强框架。RAMP结合了解剖约束的空间扰动、CT强度变换和随机多损坏组合,使模型在训练过程中暴露于临床可行的图像退化。 在两个CT分割评估设置中,RAMP实现了最强的损坏图像性能和最小的干净到损坏鲁棒性差距。在五器官噪声评估基准中,与nnU-Net基线相比,RAMP将平均损坏Dice从0.610提高到0.753,并将鲁棒性差距从0.264降低到0.064。在Abdomen1K中,RAMP将平均损坏Dice从0.633提高到0.789,并将鲁棒性差距从0.290降低到0.070。尽管RAMP未达到最高的干净图像Dice,但它显著减轻了严重图像退化下的最坏情况分割崩溃。 这些结果表明,多损坏增强可以作为提高CT分割系统在异质临床环境中可靠性的实用部署前策略。

英文摘要

Deep learning-based CT segmentation systems often achieve high accuracy on clean benchmark images, but their performance may degrade under heterogeneous clinical imaging conditions such as noise, resolution loss, contrast variation, intensity shift, and artifacts. This instability can limit reliable deployment in real-world medical imaging workflows. We propose Robustness via Augmented Multi-corruption Pipeline (RAMP), a robustness-oriented augmentation framework for CT segmentation. RAMP combines anatomically constrained spatial perturbations, CT intensity transformations, and stochastic multi-corruption composition to expose models to clinically plausible image degradation during training. Across two CT segmentation evaluation settings, RAMP achieved the strongest corrupted-image performance and the smallest clean-to-corrupted robustness gap. In the five-organ noisy evaluation benchmark, RAMP improved mean corrupted Dice from 0.610 to 0.753 and reduced the robustness gap from 0.264 to 0.064 compared with the nnU-Net baseline. In Abdomen1K, RAMP improved mean corrupted Dice from 0.633 to 0.789 and reduced the robustness gap from 0.290 to 0.070. Although RAMP did not achieve the highest clean-image Dice, it substantially mitigated worst-case segmentation collapse under severe image degradation. These results suggest that multi-corruption augmentation can serve as a practical pre-deployment strategy for improving the reliability of CT segmentation systems in heterogeneous clinical environments.

2606.02045 2026-06-18 cs.CV cs.AI 版本更新

Attention mechanisms and transfer learning for robust peach leaf damage classification under domain shift

域偏移下基于注意力机制和迁移学习的鲁棒桃叶损伤分类

Adrián Cánovas-Rodriguez, Miguel A. González-Illán, Maria Fernanda García-Cruz, Pedro Nortes Tortosa, José Salvador Rubio-Asensio, Miguel A. Zamora Izquierdo, Juan Antonio Martínez Navarro, Antonio F. Skarmeta

发表机构 * Department of Information and Communication Engineering(信息与通信工程系) University of Murcia(穆尔西亚大学) Department of Irrigation, Centro de Edafología y Biología Aplicada del Segura CEBAS-CSIC(灌溉系,塞格拉应用土壤学与生物技术中心CEBAS-CSIC)

AI总结 提出基于注意力机制和迁移学习的桃叶损伤分类方法,通过CBAM增强EfficientNet模型在公共数据集上达到93.3%准确率,并在本地数据集上通过迁移学习实现93%宏F1分数,有效应对域偏移。

详情
AI中文摘要

人工智能为从图像数据评估作物损伤提供了实用框架,支持农业管理中的早期决策。在桃园中,气候变化增加了非生物胁迫和生物压力,包括病虫害,这些通常产生视觉上相似的叶片症状。这种重叠使得手动诊断变得困难,尤其是在不同环境条件下的多个田地中,凸显了对具有强泛化能力的自动化模型的需求。 我们提出了一种基于图像的桃叶损伤检测分类方法。通过手动标注公开图像创建了一个基准数据集,包含六个损伤类别的1,366片桃叶。评估了几种深度学习架构。EfficientNet模型取得了最佳结果,其中EfficientNetB0达到92.9%的准确率,EfficientNetB3达到91.5%,EfficientNetB5在少数类上表现最强。DenseNet121达到92.6%的准确率。卷积块注意力模块(CBAM)的集成在多个骨干网络中提升了性能,特别是在EfficientNetB5和InceptionV3中,而在其他网络中效果有限或为负。CBAM增强的EfficientNetB5取得了93.3%的最佳总体准确率。 为了评估在现实条件下的鲁棒性,收集了一个包含四个类别180张图像的本地数据集,并应用迁移学习策略来解决域偏移。测试了三种微调策略。结合CBAM的EfficientNetB3在本地域中取得了最佳性能,迁移后宏F1分数达到93%。总体而言,基于注意力的模型在少数类上表现出更强的鲁棒性,并在不同田间条件下具有更好的泛化能力。

英文摘要

Artificial intelligence provides a practical framework for crop damage assessment from imagery data, supporting early decision-making in agricultural management. In peach orchards, climate change increases abiotic stress and biotic pressures, including pests and diseases, which often produce visually similar foliar symptoms. This overlap makes manual diagnosis difficult, especially across multiple fields with varying environmental conditions, highlighting the need for automated models with strong generalization ability. We propose an image-based classification approach for peach leaf damage detection. A benchmark dataset was created through manual annotation of publicly available images, consisting of 1,366 peach leaves across six damage categories. Several deep learning architectures were evaluated. EfficientNet models achieved the best results, with EfficientNetB0 reaching 92.9 percent accuracy, EfficientNetB3 achieving 91.5 percent, and EfficientNetB5 showing the strongest performance on minority classes. DenseNet121 reached 92.6 percent accuracy. The integration of the Convolutional Block Attention Module (CBAM) improved performance in several backbones, particularly EfficientNetB5 and InceptionV3, while showing limited or negative impact in others. The CBAM-enhanced EfficientNetB5 achieved the best overall accuracy of 93.3 percent. To evaluate robustness under realistic conditions, a local dataset of 180 images across four classes was collected, and transfer learning strategies were applied to address domain shift. Three fine-tuning strategies were tested. EfficientNetB3 combined with CBAM achieved the best performance in the local domain, reaching a 93 percent macro F1-score after transfer. Overall, attention-based models showed improved robustness for minority classes and better generalization across different field conditions.

2606.03827 2026-06-18 cs.CV cs.AI 版本更新

Conditional Latent Diffusion Model with Fourier-based Motion Modelling for Virtual Population Synthesis

基于傅里叶运动建模的条件潜扩散模型用于虚拟人群合成

Shaokun Lan, Haoran Dou, Jinghan Huang, Arezoo Zakeri, Fengming Lin, Zherui Zhou, Jinming Duan, Alejandro F. Frangi

发表机构 * Centre for Computational Imaging and Modelling in Medicine (CIMIM)(计算医学成像与建模中心) University of Manchester(曼彻斯特大学) Christabel Pankhurst Institute(克里斯塔贝尔·潘克赫斯特研究所) Department of Computer Science(计算机科学系) Division of Informatics, Imaging & Data Sciences(信息学、成像与数据科学分会) Department of Electrical & Electronic Engineering(电子与电气工程系) NIHR Manchester Biomedical Research Centre, Manchester Academic Health Sciences Centre, University of Manchester(尼日利亚卫生研究委员会曼彻斯特生物医学研究中心、曼彻斯特学术健康科学中心、曼彻斯特大学)

AI总结 提出4D F-MeshLDM框架,结合卷积网格VAE、截断傅里叶级数运动参数化和条件扩散先验,实现可控的3D+t心脏网格序列生成,在UK Biobank数据上优于基线方法。

Comments This work has been early accepted by International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2026

详情
AI中文摘要

医疗设备的计算机模拟试验需要生成虚拟解剖人群。在心血管应用中,虚拟解剖通常表示为从生成模型采样的3D+t网格。然而,大多数现有网格生成器关注静态解剖,而序列模型往往缺乏显式周期性。为此,我们提出4D F-MeshLDM,一个条件生成框架,包括用于编码网格的卷积网格VAE、使用截断傅里叶级数参数化运动的结构化潜空间,以及学习傅里叶系数令牌上潜分布的先验扩散。通过仿射调制将扩散过程条件化于临床协变量,我们实现了可控合成。采样令牌并执行逆傅里叶合成产生周期一致的潜轨迹,可解码为3D+t心脏网格序列。在5,000名UK Biobank受试者上的实验表明,4D F-MeshLDM在解剖保真度上优于最先进的基线,并实现了接近零的周期闭合误差。此外,生成的队列准确保留了临床功能指标,突显了我们的框架在可靠的心脏计算机模拟试验中的潜力。

英文摘要

In-silico trials of medical devices require the generation of virtual populations of anatomies. In cardiovascular applications, virtual anatomy is typically represented as a 3D+t mesh sampled from a generative model. However, most existing mesh generators focus on static anatomy, while sequence models often lack explicit periodicity. To this end, we propose 4D F-MeshLDM, a conditional generative framework comprising a convolutional mesh VAE to encode meshes, a structural latent space that parameterises motion using a truncated Fourier series, and a diffusion prior that learns the latent distribution over Fourier coefficient tokens. By conditioning the diffusion process on clinical covariates via affine modulation, we enable controllable synthesis. Sampling tokens and performing inverse Fourier synthesis yield cycle-consistent latent trajectories, which can be decoded into 3D+t cardiac mesh sequences. Experiments on 5,000 UK Biobank subjects demonstrate that 4D F-MeshLDM outperforms state-of-the-art baselines in anatomical fidelity and achieves near-zero cycle closure error. Furthermore, the generated cohorts accurately preserve clinical functional indices, highlighting the potential of our framework for reliable in-silico cardiac trials.

11. 其他/综合AI 16 篇

2604.23716 2026-06-18 cs.AI cs.IT cs.LG cs.MA math.IT 版本更新

Information-Theoretic Measures in AI: A Practical Decision Guide

人工智能中的信息论度量:实用决策指南

Nikolaos Al. Papadopoulos, Konstantinos E. Psannis

发表机构 * Department of Applied Informatics, University of Macedonia(马其顿大学应用信息系)

AI总结 本文为七种信息论度量提供实用决策框架,围绕每个度量的三个关键问题:回答的问题与AI场景、适合的估计器、最危险的误用,并附有流程图和决策表。

Comments 25 pages, 2 tables, 1 figure. Submitted to Entropy (MDPI)

详情
AI中文摘要

信息论(IT)度量在人工智能中无处不在:熵驱动决策树分裂和不确定性量化,交叉熵是默认的分类损失,互信息支撑表示学习和特征选择,转移熵揭示动态系统中的有向影响。第二类较不成熟的度量——整合信息(Phi)、有效信息(EI)和自主性——已出现用于表征智能体复杂性。尽管被广泛采用,度量选择常常与估计器假设、失败模式和安全的推断主张脱节。本文为所有七种度量提供了一个实用决策框架,围绕每个度量的三个指导性问题组织:(i)该度量回答什么问题,在何种AI背景下;(ii)哪种估计器适合数据类型和维度;(iii)最危险的误用是什么。该框架通过两个互补的人工制品实现:度量选择流程图和主决策表。我们涵盖每个度量的AI/ML和决策智能体应用领域,并使用标准化桥接框将IT量与认知构造联系起来。三个工作示例展示了该框架在具体从业者场景中的应用,涵盖表示学习、时间影响分析和进化智能体复杂性。

英文摘要

Information-theoretic (IT) measures are ubiquitous in artificial intelligence: entropy drives decision-tree splits and uncertainty quantification, cross-entropy is the default classification loss, mutual information underpins representation learning and feature selection, and transfer entropy reveals directed influence in dynamical systems. A second, less consolidated family of measures, integrated information (Phi), effective information (EI), and autonomy, has emerged for characterizing agent complexity. Despite wide adoption, measure selection is often decoupled from estimator assumptions, failure modes, and safe inferential claims. This paper provides a practical decision framework for all seven measures, organized around three prescriptive questions for each: (i) what question does the measure answer and in which AI context; (ii) which estimator is appropriate for the data type and dimensionality; and (iii) what is the most dangerous misuse. The framework is operationalized in two complementary artifacts: a measure-selection flowchart and a master decision table. We cover both AI/ML and decision-making agent application domains per measure, with standardized Bridge Boxes linking IT quantities to cognitive constructs. Three worked examples illustrate the framework on concrete practitioner scenarios spanning representation learning, temporal influence analysis, and evolved agent complexity.

2606.00729 2026-06-18 cs.AI 版本更新

AI Sovereignty as National Learning Capacity: A Human-Centered Learning Mechanics Viewpoint on France, the United States, and China

AI主权作为国家学习能力:基于人本学习机制视角看法国、美国与中国

Kim Phuc Tran

发表机构 * Univ. Lille, ENSAIT, ULR 2461 – GEMTEX(里尔大学、ENSAIT、ULR 2461 – GEMTEX)

AI总结 本文提出将国家AI发展视为一个受控的信息注入与熵耗散平衡的动态学习系统,主张AI主权源于国家调节自身信息动力学的能力,而非单纯规模扩张。

详情
AI中文摘要

在法国,人工智能常被从投资、算力、监管、就业、主权和教育等维度讨论,这些维度通常被分开处理。本文提出一个统一解读:法国应被理解为一个\emph{国家AI学习系统}。基于最近被形式化为熵调控表示学习动力学框架的人本学习机制(HCLM),我们将国家AI发展解释为信息注入与熵耗散之间的受控平衡。信息注入对应算力、数据、人才、研究、资本、产业部署和制度实验;熵耗散对应组织复杂性、协调摩擦、能源约束、监管不确定性、人才流动压力以及加强产业吸收的机会。核心主张是:AI主权并非仅源于规模,而是源于国家调节自身信息动力学的能力。本文将HCLM与神经标度律、内生增长理论、创造性破坏和博弈论联系起来,认为法国AI辩论应超越技术乐观主义与监管优先的二元对立。一个具有竞争力且以人为本的AI战略需要一个受控机制,其中信息注入增长快于制度耗散,同时避免不稳定、不平等或高能耗的扩张。我们提供了一个数学模型、可衡量的政策指标、博弈论命题、国家AI制度的说明性模拟,以及对法国的具体政策启示。所提出的观点将AI政策重新定义为对一个开放、战略性、非均衡学习系统的治理。

英文摘要

Artificial intelligence in France is often discussed through separate dimensions such as investment, compute, regulation, employment, sovereignty, and education. This viewpoint paper proposes a unified interpretation: France can be analyzed as a national AI learning system. Building on Human-Centered Learning Mechanics (HCLM), we use HCLM not as a validated econometric model, but as a conceptual and diagnostic lens for interpreting national AI development as a balance between information injection, absorptive capacity, and institutional dissipation. Information injection includes compute, data, talent, research, capital, industrial deployment, and policy experimentation. Institutional dissipation refers to avoidable frictions such as administrative overload, coordination failures, energy constraints, regulatory uncertainty, talent mobility pressures, and weak industrial absorption. Regulation is not treated as mere friction: adaptive governance, trusted data spaces, and safety-oriented standards may increase long-term learning capacity by improving legitimacy, interoperability, and social trust. The central claim is not that a country follows neural-network equations, but that AI sovereignty depends on how effectively it converts distributed information into absorbed, coordinated, and socially legitimate capability. The paper connects HCLM with neural scaling laws, endogenous growth theory, creative destruction, absorptive capacity, and coordination mechanisms. It offers a formal heuristic, policy indicators, illustrative scenarios, and implications for France. The numerical results are diagnostic scenarios, not econometric estimates or official rankings. The proposed viewpoint reframes AI policy as the governance of an open, strategic, non-equilibrium learning system that should be tested with historical and cross-country data.

2605.17131 2026-06-18 cs.CV cs.AI cs.LG 版本更新

A Survey on Deep Learning Architectures for Point Cloud Classification and Segmentation

针对点云分类和分割的深度学习架构系统性调研

Minhas Kamal, Hiranya Garbha Kumar, Balakrishnan Prabhakaran

发表机构 * State University of New York at Albany(纽约州立大学阿尔巴尼分校)

AI总结 本文系统性地探讨了点云分类和分割中的深度学习架构,分析了点云数据的结构特性,分类了不同架构的工作,并评估了其在主流基准上的性能,同时指出了开放挑战和未来方向。

Comments We reviewed a decade of advancements in point cloud processing: trace the evolution of the field from its foundational roots to the modern SOTA, analyze how diverse architectures overcome the inherent geometric challenges of 3D data, and map out critical research gaps alongside promising future directions. GitHub: https://github.com/MinhasKamal/DeepLearningForPointCloud

详情
Journal ref
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2026
AI中文摘要

点云因其简洁性和几何保真度而成为表示3D形状和场景最广泛采用的格式。然而,其固有的无序和不规则性质,加剧了传感器噪声和遮挡的影响,给基于机器学习的方法带来了独特的挑战。为应对这些问题,已开发出多种策略,包括转换为有序格式、提取局部几何特征以及基于排列不变或自注意力的处理方法。在本文中,我们的重点是深度学习模型在3D视觉三个基本任务中的应用:点云分类、部分分割和语义分割。我们首先正式定义点云数据,然后深入讨论其结构特性。接着,我们根据其骨干结构对重要工作进行分类,并评估其在流行基准上的性能。除了经验比较外,我们还提供了架构创新和局限性的见解。我们还概述了3D点云理解中的开放挑战和有前途的未来方向。

英文摘要

Point cloud stands as the most widely adopted format for representing 3D shapes and scenes due to its simplicity and geometric fidelity. However, its inherent unordered and irregular nature, exacerbated by sensor noise and occlusions, introduces unique challenges for machine learning based methodologies. To combat these issues, diverse strategies have been developed, including converting to a format that has orderliness, extracting local geometry, and permutation-invariant or self-attention-based processing. In this paper, our focus is directed towards deep learning models for three fundamental tasks in 3D vision: point cloud classification, part segmentation, and semantic segmentation. We begin by formally defining point cloud data, followed by an in-depth discussion on its structural characteristics. Then, we categorize notable works based on their backbone structure and evaluate their performance on popular benchmarks. Beyond empirical comparison, we offer insights into architectural innovations and limitations. We also outline open challenges and promising future directions for 3D point cloud understanding.

2606.00182 2026-06-18 cs.HC cs.AI cs.CY 版本更新

The New Social Image: How AI Competency and AI Proactivity Influence Self- and Peer-Perceptions in the Workplace

新社会形象:AI能力与AI主动性如何影响职场中的自我与同伴感知

Kuntal Ghosh, Marc Hassenzahl, Shadan Sadeghian

发表机构 * Autonomous Interactive Systems, University of Siegen(自主交互系统,锡根大学) Experience & Interaction Design, University of Siegen(体验与交互设计,锡根大学)

AI总结 通过2x2x2情景实验(n=50),研究AI能力与主动性水平对员工工作所有权、情感、意义感及角色动态的自我与同伴感知影响,发现低能力或低主动性的AI通常提升积极感知,但高能力与高主动性可能带来负面影响。

Comments Updated metadata following publication in Interacting with Computers. Added DOI and publication information

详情
AI中文摘要

人机协作被视为将AI融入职场的最有前景方式。然而,这种协作的体验后果尚未被探索。具体而言,在与AI组成的团队中,人类如何感知自己(自我感知)以及同事如何看待他们(同伴感知)在工作所有权和工作意义方面。在一项2x2x2情景研究(n=50)中,参与者对所有权、情感、工作意义和满意度以及角色动态的感知进行了评分,其中AI主动性和AI能力作为被试内因素(低/高两个水平),视角(自我感知/同伴感知)作为被试间因素。我们的结果表明,低能力或低主动性的AI通常提升了与所有权、意义感、满意度和角色动态相关的感受,并增加了积极情感,减少了消极情感。然而,这些效应往往受到视角的影响。例如,低AI主动性从自我感知而非同伴感知中带来了更高的工作满意度。基于我们的发现,我们认为仅围绕绩效指标设计未来工作的AI可能并不足够。高能力和高主动性的AI驱动系统可能对所有权感知、工作身份、社会形象和团队动态产生不良影响,进而影响工作意义。

英文摘要

Human-AI collaboration is considered the most promising way to incorporate AI in the workplace. What remains unexplored are the experiential consequences of this teaming. More specifically, in a team with AI, how humans perceive themselves (self-perception) and how they are perceived by their coworkers (peer perception) in terms of work ownership and job meaningfulness. In a 2x2x2 vignette study (n=50), participants rated perceptions of ownership, affect, job meaningfulness and satisfaction, and role dynamics across two levels (low/high) of AI proactivity and AI competency as within-subject factors, with point-of-view (self perception/peer perception) as between-subjects. Our results showed that AI with low competency or low proactivity generally improved feelings related to ownership, meaningfulness, satisfaction, and role dynamics, and also increased positive affect while reducing negative affect. However, these effects were often influenced by point-of-view. For instance, low AI proactivity resulted in higher job satisfaction from self-perception rather than peer perception. Based on our findings, we argue that designing AI for the future of work solely around performance metrics may not be adequate. Highly competent and proactive AI-driven systems can have undesirable impacts on perceptions of ownership, job identity, social image and team dynamics, and consequently, job meaningfulness.

2606.15091 2026-06-18 cs.HC cs.AI 版本更新

Sensory Restoration via Brain-Computer Interfaces: A Unified 2 x 2 Framework and Convergence Roadmap

通过脑机接口的感觉恢复:统一的2×2框架与融合路线图

Xuan-The Tran

发表机构 * School of Mechanical Engineering, Vietnam Maritime University(机械工程学院,越南海防大学)

AI总结 本文提出一个统一的2×2框架,按侵入性和信号方向分类脑机接口,并定义恢复、替代和增强范式,同时给出近中长期的融合路线图。

详情
AI中文摘要

全球数百万个体因神经退行性疾病、中风或创伤而遭受感觉和沟通缺陷。脑机接口(BCI)为感觉和运动恢复提供了有希望的途径。然而,科学文献在侵入性神经假体和非侵入性电生理解码器之间高度碎片化,缺乏一致的术语和比较指标。本章提出了一个统一的2×2框架,沿两个轴对BCI进行分类:侵入性程度(侵入性与非侵入性)和信号方向(传入感觉-IN与传出感觉-OUT)。我们定义并区分了恢复、替代和增强的范式。此外,我们概述了一个结构化的路线图,用于在近期、中期和长期内这些模态的融合,重点关注物理限制和机器学习基础模型的整合作用。

英文摘要

Millions of individuals worldwide suffer from sensory and communication deficits caused by neurodegenerative diseases, stroke, or trauma. Brain-computer interfaces (BCIs) offer a promising avenue for sensory and motor restoration. However, the scientific literature remains highly fragmented between invasive neuroprosthetics and non-invasive electrophysiological decoders, with a lack of consistent terminology and comparison metrics. This chapter proposes a unified 2 x 2 framework categorizing BCIs along two axes: degree of invasiveness (invasive vs. non-invasive) and signal direction (afferent sensory-IN vs. efferent sensory-OUT). We define and distinguish the paradigms of restoration, substitution, and augmentation. Furthermore, we outline a structural roadmap for the convergence of these modalities over near-, medium-, and long-term horizons, focusing on physical limits and the integrative role of machine learning foundation models.

2602.15513 2026-06-18 cs.RO cs.AI 版本更新

HIMM: Human-Inspired Long-Term Memory Modeling for Embodied Exploration and Question Answering

Ji Li, Bo Wang, Jing Xia, Mingyi Li, Shiyan Hu

发表机构 * The University of Hong Kong(香港大学) Beijing Institute of Technology(北京理工大学)

详情
Journal ref
IROS 2026
英文摘要

Deploying Multimodal Large Language Models as the brain of embodied agents remains challenging, particularly under long-horizon observations and limited context budgets. Existing memory assisted methods often rely on textual summaries, which discard rich visual and spatial details and remain brittle in non-stationary environments. In this work, we propose a non-parametric memory framework that explicitly disentangles episodic and semantic memory for embodied exploration and question answering. Our retrieval-first, reasoning-assisted paradigm recalls episodic experiences via semantic similarity and verifies them through visual reasoning, enabling robust reuse of past observations without rigid geometric alignment. In parallel, we introduce a program-style rule extraction mechanism that converts experiences into structured, reusable semantic memory, facilitating cross-environment generalization. Extensive experiments demonstrate state-of-the-art performance on embodied question answering and exploration benchmarks, yielding a 7.3% gain in LLM-Match and an 11.4% gain in LLM MatchXSPL on A-EQA, as well as +7.7% success rate and +6.8% SPL on GOAT-Bench. Analyses reveal that our episodic memory primarily improves exploration efficiency, while semantic memory strengthens complex reasoning of embodied agents.

2602.20135 2026-06-18 cs.CL cs.AI cs.IR 版本更新

KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi, Behnam Bahrak

发表机构 * University of Tehran(塔里班大学) Independent Researcher(独立研究员) Amirkabir University of Technology(阿米尔卡比尔技术大学) TEIAS Institute(TEIAS研究所)

Comments Accepted at the Third Conference on Parsimony and Learning (CPAL 2026). 36 pages, 12 figures. (Equal contribution: Yasaman Amou Jafari and Mahdi Noori.)

详情
Journal ref
Conference on Parsimony and Learning, Proceedings of Machine Learning Research, 328:989-1024, 2026
英文摘要

With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time and cost of building specialized assessment datasets. We introduce KNIGHT, an LLM-based, knowledge-graph-driven framework for generating multiple-choice question (MCQ) datasets from external sources. KNIGHT constructs a topic-specific knowledge graph, a structured and parsimonious summary of entities and relations, that can be reused to generate instructor-controlled difficulty levels, including multi-hop questions, without repeatedly re-feeding the full source text. This knowledge graph acts as a compressed, reusable state, making question generation a cheap read over the graph. We instantiate KNIGHT on Wikipedia/Wikidata while keeping the framework domain- and ontology-agnostic. As a case study, KNIGHT produces six MCQ datasets in History, Biology, and Mathematics. We evaluate quality on five criteria: fluency, unambiguity (single correct answer), topic relevance, option uniqueness, and answerability given the provided sources (as a proxy for hallucination). Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-specific and difficulty-controlled evaluation.

2405.14273 2026-06-18 cs.LG cs.AI math.OC 版本更新

Exact Solution to Data-Driven Inverse Optimization of MILPs in Finite Time via Gradient-Based Methods

通过基于梯度的方法在有限时间内精确求解混合整数线性规划的驱动数据反优化问题

Akira Kitaoka

发表机构 * NEC Corporation(日本电气株式会社)

AI总结 本文研究了混合整数线性规划中驱动数据反优化问题,揭示了子最优损失的几何结构,并证明了基于梯度的优化方法可以在有限次迭代内达到观测数据的一致性,同时给出了投影子梯度下降法的迭代次数上界。

Comments 66 pages; comments are welcome

详情
AI中文摘要

驱动数据反优化问题(DDIOP)是估计能够解释观测最优解数据的目标函数参数(权重)的问题,广泛应用于混合整数线性规划(MILP)中。在MILP的反优化中,特征的预测误差对权重的不连续性使得直接应用基于梯度的优化方法具有挑战性。本文聚焦于子最优损失,该损失在权重与观测数据完全一致时达到最小值零。我们揭示了该损失的几何结构——它具有凸性和分段线性特性,并且与观测数据完全一致的权重集合具有正的“厚度”而非单一点或薄边界。利用这一结构,我们证明了:首先,一类广泛的基于梯度的优化方法,包括投影子梯度下降法,在有限次迭代中可以达到观测数据的一致性(在有限时间内获得精确解)。其次,对于投影子梯度下降法,我们给出了达到精确一致性的迭代次数的显式上界。第三,当正向问题是一个整数线性规划(ILP)时,我们将其上界表示为仅由样本数、特征维度和约束系数矩阵结构(例如,若系数矩阵是总模矩阵,则迭代次数被显式地限制为样本数平方和维度的多项式)决定的完全显式迭代次数。通过数值实验,我们验证了这种有限步数达到行为。

英文摘要

A data-driven inverse optimization problem (DDIOP) is the problem of estimating the objective-function parameters (weights) that explain observed optimal-solution data, and it arises in many applications, including mixed integer linear programming (MILP). In inverse optimization for MILPs, the prediction error of the features is discontinuous with respect to the weights, so applying gradient-based optimization directly is difficult. In this paper we focus on the suboptimality loss. This loss attains its minimum value, zero, if and only if the weights are exactly consistent with the observed data. We reveal a geometric structure of this loss -- it is convex and piecewise linear, and moreover the set of weights that are exactly consistent with the observed data has a positive ``thickness'' rather than being a single point or a thin boundary -- and use it to show the following. First, a broad class of gradient-based optimization methods, including projected subgradient descent, reaches exact consistency with the observed data in finitely many iterations (an exact solution is obtained in finite time). Second, for projected subgradient descent we give an explicit upper bound on the number of iterations needed to reach exact consistency. Third, when the forward problem is an integer linear program (ILP), we give this upper bound as a fully explicit iteration count determined solely by the number of samples, the dimension of the features, and the structure of the constraint coefficient matrix. Through numerical experiments, we confirm this finite-step attainment behavior.

2407.00449 2026-06-18 cs.LG cs.AI cs.NE 版本更新

Fully tensorial approach to hypercomplex-valued neural networks

Agnieszka Niemczynowicz, Radosław Antoni Kycia

发表机构 * Faculty of Computer Science and Mathematics, Cracow University of Technology(克拉科夫技术大学计算机科学与数学系)

Comments 23 pages, 3 figures

详情
Journal ref
Information Sciences, 2026, 123796
英文摘要

A fully tensorial theoretical framework for hypercomplex-valued neural networks is presented. The proposed approach enables neural network architectures to operate on data defined over arbitrary finite-dimensional algebras. The central observation is that algebra multiplication can be represented by a rank-three tensor, which allows all algebraic operations in neural network layers to be formulated in terms of standard tensor contractions, permutations, and reshaping operations. This tensor-based formulation provides a unified and dimension-independent description of hypercomplex-valued dense and convolutional layers and is directly compatible with modern deep learning libraries supporting optimized tensor operations. The proposed framework recovers existing constructions for four-dimensional algebras as a special case. Within this setting, a tensor-based version of the universal approximation theorem for single-layer hypercomplex-valued perceptrons is established under mild non-degeneracy assumptions on the underlying algebra, thereby providing a rigorous theoretical foundation for the considered class of neural networks.

2512.04115 2026-06-18 cs.CY cs.AI cs.HC 版本更新

Artificial Intelligence Competence of K-12 Students Shapes Their AI Risk Perception: A Co-occurrence Network Analysis

Ville Heilala, Pieta Sikström, Mika Setälä, Tommi Kärkkäinen

发表机构 * University of Jyväskylä(于韦斯屈莱大学)

Comments Accepted for Proceedings of the 41th ACM/SIGAPP Symposium on Applied Computing (SAC'26)

详情
英文摘要

As artificial intelligence (AI) becomes increasingly integrated into education, understanding how students perceive its risks is essential for supporting responsible and effective adoption. This research aimed to examine the relationships between perceived AI competence and risks among Finnish K-12 upper secondary students (n = 163) by utilizing a co-occurrence analysis. Students reported their self-perceived AI competence and concerns related to AI across systemic, institutional, and personal domains. The findings showed that students with lower competence emphasized personal and learning-related risks, such as reduced creativity, lack of critical thinking, and misuse, whereas higher-competence students focused more on systemic and institutional risks, including bias, inaccuracy, and cheating. These differences suggest that students' self-reported AI competence is related to how they evaluate both the risks and opportunities associated with artificial intelligence in education (AIED). The results of this study highlight the need for educational institutions to incorporate AI literacy into their curricula, provide teacher guidance, and inform policy development to ensure personalized opportunities for utilization and equitable integration of AI into K-12 education.

2506.20869 2026-06-18 cs.SE cs.AI cs.IR 版本更新

Engineering RAG Systems for Real-World Applications: Design, Development, and Evaluation

Md Toufique Hasan, Muhammad Waseem, Kai-Kristian Kemell, Ayman Asad Khan, Mika Saari, Pekka Abrahamsson

发表机构 * Faculty of Information Technology and Communication Sciences, Tampere University(信息科技与通讯科学学院,塔尔皮耶大学)

Comments Published in the Proceedings of the 51st Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2025. Lecture Notes in Computer Science, volume 16082, pages 143-158. Springer, 2026

详情
Journal ref
LNCS 16082, 143-158, 2026
英文摘要

Retrieval-Augmented Generation (RAG) systems are emerging as a key approach for grounding Large Language Models (LLMs) in external knowledge, addressing limitations in factual accuracy and contextual relevance. However, there is a lack of empirical studies that report on the development of RAG-based implementations grounded in real-world use cases, evaluated through general user involvement, and accompanied by systematic documentation of lessons learned. This paper presents five domain-specific RAG applications developed for real-world scenarios across governance, cybersecurity, agriculture, industrial research, and medical diagnostics. Each system incorporates multilingual OCR, semantic retrieval via vector embeddings, and domain-adapted LLMs, deployed through local servers or cloud APIs to meet distinct user needs. A web-based evaluation involving a total of 100 participants assessed the systems across six dimensions: (i) Ease of Use, (ii) Relevance, (iii) Transparency, (iv) Responsiveness, (v) Accuracy, and (vi) Likelihood of Recommendation. Based on user feedback and our development experience, we documented twelve key lessons learned, highlighting technical, operational, and ethical challenges affecting the reliability and usability of RAG systems in practice.

2503.01163 2026-06-18 cs.AI cs.CL cs.HC cs.LG cs.NE 版本更新

Bandit-Based Prompt Design Strategy Selection Improves Prompt Optimizers

Rin Ashizawa, Yoichi Hirose, Nozomu Yoshinari, Kento Uchida, Shinichi Shirakawa

发表机构 * Yokohama National University(横滨国立大学)

Comments Accepted to ACL 2025 Findings

详情
英文摘要

Prompt optimization aims to search for effective prompts that enhance the performance of large language models (LLMs). Although existing prompt optimization methods have discovered effective prompts, they often differ from sophisticated prompts carefully designed by human experts. Prompt design strategies, representing best practices for improving prompt performance, can be key to improving prompt optimization. Recently, a method termed the Autonomous Prompt Engineering Toolbox (APET) has incorporated various prompt design strategies into the prompt optimization process. In APET, the LLM is needed to implicitly select and apply the appropriate strategies because prompt design strategies can have negative effects. This implicit selection may be suboptimal due to the limited optimization capabilities of LLMs. This paper introduces Optimizing Prompts with sTrategy Selection (OPTS), which implements explicit selection mechanisms for prompt design. We propose three mechanisms, including a Thompson sampling-based approach, and integrate them into EvoPrompt, a well-known prompt optimizer. Experiments optimizing prompts for two LLMs, Llama-3-8B-Instruct and GPT-4o mini, were conducted using BIG-Bench Hard. Our results show that the selection of prompt design strategies improves the performance of EvoPrompt, and the Thompson sampling-based mechanism achieves the best overall results. Our experimental code is provided at https://github.com/shiralab/OPTS .

2506.09822 2026-06-18 cs.CE cs.AI 版本更新

Superstudent intelligence in thermodynamics

Rebecca Loubet, Pascal Zittlau, Marco Hoffmann, Luisa Vollmer, Sophie Fellenz, Heike Leitte, Fabian Jirasek, Johannes Lenhard, Hans Hasse

发表机构 * Laboratory of Engineering Thermodynamics (LTD)(工程热力学实验室) Visual Information Analysis Research Group (VIA)(视觉信息分析研究组) Machine Learning Research Group (ML)(机器学习研究组)

Comments This document is the unedited Author's version of a yet to be Submitted Work to Physical Review Physics Education Research. 15 pages, 2 figures, Graphical Abstract, Highlights and SI available (12 pages)

详情
英文摘要

In this short note, we report and analyze a striking event: OpenAI's large language model o3 has outwitted all students in a university exam on thermodynamics. The thermodynamics exam is a difficult hurdle for most students, where they must show that they have mastered the fundamentals of this important topic. Consequently, the failure rates are very high, A-grades are rare - and they are considered proof of the students' exceptional intellectual abilities. This is because pattern learning does not help in the exam. The problems can only be solved by knowledgeably and creatively combining principles of thermodynamics. We have given our latest thermodynamics exam not only to the students but also to OpenAI's most powerful reasoning model, o3, and have assessed the answers of o3 exactly the same way as those of the students. In zero-shot mode, the model o3 solved all problems correctly, better than all students who took the exam; its overall score was in the range of the best scores we have seen in more than 10,000 similar exams since 1985. This is a turning point: machines now excel in complex tasks, usually taken as proof of human intellectual capabilities. We discuss the consequences this has for the work of engineers and the education of future engineers.

2504.12347 2026-06-18 cs.CL cs.AI cs.CY 版本更新

Assessment of Evolving Large Language Models in Upper Secondary Mathematics

Mika Setälä, Pieta Sikström, Ville Heilala, Tommi Kärkkäinen

发表机构 * Faculty of Information Technology(信息科技学院) University of Jyväskylä(于韦斯屈莱大学) Faculty of Humanities and Social Sciences(人文与社会科学学院)

详情
英文摘要

Large language models (LLMs) have shown increasing promise in educational settings, yet their mathematical reasoning has been considered evolving. This study evaluates the mathematical capabilities of various LLMs using the Finnish matriculation examination, a high-stakes digital test for upper secondary education. Initial tests yielded moderate performance corresponding to mid-range grades, but later evaluations demonstrated substantial improvements as the language models evolved. Remarkably, some models achieved near-perfect or perfect scores, matching top student performance and qualifying for university admission. Our findings highlight the rapid advances in the mathematical proficiency of LLMs and illustrate their potential as underlying tools to support learning and teaching in a variety of ways.

2505.03863 2026-06-18 cs.CR cs.AI 版本更新

Data-Driven Falsification of Cyber-Physical Systems

Atanu Kundu, Sauvik Gon, Rajarshi Ray

发表机构 * Indian Association for the Cultivation of Science(印度科学培养协会)

详情
英文摘要

Cyber-Physical Systems (CPS) are abundant in safety-critical domains such as healthcare, avionics, and autonomous vehicles. Formal verification of their operational safety is, therefore, of utmost importance. In this paper, we address the falsification problem, where the focus is on searching for an unsafe execution in the system instead of proving their absence. The contribution of this paper is a framework that (a) connects the falsification of CPS with the falsification of deep neural networks (DNNs) and (b) leverages the inherent interpretability of Decision Trees for faster falsification of CPS. This is achieved by: (1) building a surrogate model of the CPS under test, either as a DNN model or a Decision Tree, (2) application of various DNN falsification tools to falsify CPS, and (3) a novel falsification algorithm guided by the explanations of safety violations of the CPS model extracted from its Decision Tree surrogate. The proposed framework has the potential to exploit a repertoire of \emph{adversarial attack} algorithms designed to falsify robustness properties of DNNs, as well as state-of-the-art falsification algorithms for DNNs. Although the presented methodology is applicable to systems that can be executed/simulated in general, we demonstrate its effectiveness, particularly in CPS. We show that our framework, implemented as a tool \textsc{FlexiFal}, can detect hard-to-find counterexamples in CPS that have linear and non-linear dynamics. Decision tree-guided falsification shows promising results in efficiently finding multiple counterexamples in the ARCH-COMP 2024 falsification benchmarks~\cite{khandait2024arch}.

2406.15537 2026-06-18 q-bio.NC cs.AI cs.SD eess.AS 版本更新

R&B -- Rhythm and Brain: Cross-subject Decoding of Music from Human Brain Activity

Matteo Ferrante, Matteo Ciferri, Nicola Toschi

发表机构 * Department of Biomedicine and Prevention University of Rome Tor Vergata(生物医学与预防系罗马大学托尔维加塔分校) A.A. Martinos Center for Biomedical Imaging Harvard Medical School/MGH, Boston (US)(A.A. Martinos生物医学成像中心哈佛医学院/马萨诸塞总医院,波士顿(美国))

Comments The first two authors contributed equally to this work

详情
Journal ref
Neural Networks, 203, 109195 (2026)
英文摘要

Music is a universal phenomenon that profoundly influences human experiences across cultures. This study investigates whether music can be decoded from human brain activity measured with functional MRI (fMRI) during its perception. Leveraging recent advancements in extensive datasets and pre-trained computational models, we construct mappings between neural data and latent representations of musical stimuli. Our approach integrates functional and anatomical alignment techniques to facilitate cross-subject decoding, addressing the challenges posed by the low temporal resolution and signal-to-noise ratio (SNR) in fMRI data. Starting from the GTZan fMRI dataset, where five participants listened to 540 musical stimuli from 10 different genres while their brain activity was recorded, we used the CLAP (Contrastive Language-Audio Pretraining) model to extract latent representations of the musical stimuli and developed voxel-wise encoding models to identify brain regions responsive to these stimuli. By applying a threshold to the association between predicted and actual brain activity, we identified specific regions of interest (ROIs) which can be interpreted as key players in music processing. Our decoding pipeline, primarily retrieval-based, employs a linear map to project brain activity to the corresponding CLAP features. This enables us to predict and retrieve the musical stimuli most similar to those that originated the fMRI data. Our results demonstrate state-of-the-art identification accuracy, with our methods significantly outperforming existing approaches. Our findings suggest that neural-based music retrieval systems could enable personalized recommendations and therapeutic applications. Future work could use higher temporal resolution neuroimaging and generative models to improve decoding accuracy and explore the neural underpinnings of music perception and emotion.