arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2510.05107 2026-06-18 cs.AI 版本更新

Structured Cognitive Loop for Behavioral Intelligence in Large Language Model Agents (Extended Revision: From Behavioral Architecture to Epistemic Accountability)

大型语言模型代理中行为智能的结构化认知循环（扩展修订：从行为架构到认知问责）

Myung Ho Kim

发表机构 * JEI University（JEI大学）

AI总结提出结构化认知循环（SCL）架构，通过分离认知、记忆、控制和行动模块，实现LLM代理的可问责行为，在360个任务中成功率86.3%，优于基线方法。

Comments This revised version extends the original SCL framework from a behavioral architecture for reliable LLM agents into a broader architecture of epistemic accountability, integrating context-aware Human-in-the-Loop control, Pool-Gated Retrieval, and the Horizon-Warrant-Commitment structure

详情

AI中文摘要

AI代理的核心挑战不仅是性能，还有问责性。通过不透明提示序列行动的代理可能产生正确输出，但几乎无法验证为何允许某个行动、错误发生在何处或如何分配责任。本文提出结构化认知循环（SCL）作为大型语言模型代理中可问责行为的架构。SCL将认知、记忆、控制和行动分离为不同模块。语言模型提出建议。外部记忆保存已验证的状态。轻量级控制器检查前提条件、防止冗余行动，并在使用工具前授权执行。我们评估了SCL与ReAct及常见LangChain代理变体在旅行规划、条件邮件起草和约束引导图像生成中的表现。在360个回合中，SCL的任务成功率达到86.3%，而基于提示的基线为70.5%至76.8%。它还提高了目标保真度，减少了冗余工具调用，增加了中间状态的重用，并降低了无依据的断言。此扩展修订将SCL置于更广泛的认知问责架构中。后续扩展整合了上下文感知的人机循环控制、池门控检索和视野担保承诺框架。这些组件共同定义了一个代理架构，其中模型提出建议，结构做出决策，证据在使用前得到担保，人类判断嵌入在轨迹中而非事后强加。结果为AI代理奠定了基础，使其决策不仅有效，而且得到授权、可检查且可问责。

英文摘要

The central challenge for AI agents is not only performance but accountability. Agents that act through opaque prompt sequences may produce correct outputs, but they provide little basis for verifying why an action was permitted, where an error occurred, or how responsibility should be assigned. This paper presents the Structured Cognitive Loop as an architecture for accountable behavior in large language model agents. SCL separates cognition, memory, control, and action into distinct modules. The language model proposes. External memory preserves verified state. A lightweight controller checks preconditions, prevents redundant actions, and authorizes execution before tools are used. We evaluate SCL against ReAct and common LangChain agent variants across travel planning, conditional email drafting, and constraint guided image generation. Across 360 episodes, SCL achieves 86.3 percent task success compared with 70.5 to 76.8 percent for prompt based baselines. It also improves goal fidelity, reduces redundant tool calls, increases reuse of intermediate state, and lowers unsupported assertions. This extended revision situates SCL within a broader architecture of epistemic accountability. Subsequent extensions integrate context aware Human in the Loop control, Pool Gated Retrieval, and the Horizon Warrant Commitment framework. Together these components define an agent architecture in which the model proposes, structure decides, evidence is warranted before use, and human judgment is embedded in the trace rather than imposed after the fact. The result is a foundation for AI agents whose decisions are not only effective but also authorized, inspectable, and accountable.

URL PDF HTML ☆

赞 0 踩 0

2603.00656 2026-06-18 cs.AI 版本更新

ActMem：弥合LLM代理中记忆检索与推理之间的差距

Xiaohui Zhang, Zequn Sun, Chengyuan Yang, Yaqin Jin, Yazhong Zhang, Wei Hu

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University, China（南京大学新型软件技术国家重点实验室）； Alibaba Group, Hangzhou, China（阿里巴巴集团，杭州，中国）； National Institute of Healthcare Data Science, Nanjing University, China（南京大学健康数据科学国家研究院）

AI总结提出ActMem框架，通过将非结构化对话历史转化为结构化因果语义图，结合反事实推理和常识补全，实现主动因果推理，显著提升LLM代理在复杂记忆依赖任务中的表现。

详情

AI中文摘要

记忆管理对于长期交互中的LLM代理至关重要。当前的记忆框架通常将代理视为被动的“记录器”，并在不理解其深层含义的情况下检索信息。它们可能在需要推理和复杂决策的场景中失败。为了弥合这一关键差距，我们提出了一种新颖的可操作记忆框架ActMem，它将记忆检索与主动因果推理相结合。ActMem将非结构化对话历史转化为结构化的因果语义图。通过利用反事实推理和常识补全，它使代理能够推断隐含约束并解决过去状态与当前意图之间的潜在冲突。此外，我们引入了一个全面的数据集ActMemEval，用于评估代理在逻辑驱动场景中的推理能力，超越了现有记忆基准测试中事实检索的焦点。实验表明，ActMem在处理复杂的、依赖记忆的任务时显著优于基线，为更一致和可靠的智能助手铺平了道路。

英文摘要

Memory management is essential for LLM agents in long-term interactions. Current memory frameworks typically treat agents as passive ``recorders'' and retrieve information without understanding its deeper implications. They may fail in scenarios requiring reasoning and complex decision-making. To bridge this critical gap, we propose a novel actionable memory framework called ActMem that integrates memory retrieval with active causal reasoning. ActMem transforms unstructured dialogue history into a structured causal and semantic graph. By leveraging counterfactual reasoning and commonsense completion, it enables agents to deduce implicit constraints and resolve potential conflicts between past states and current intentions. Furthermore, we introduce a comprehensive dataset ActMemEval to evaluate agent reasoning capabilities in logic-driven scenarios, moving beyond the fact-retrieval focus of existing memory benchmarks. Experiments demonstrate that ActMem significantly outperforms baselines in handling complex, memory-dependent tasks, paving the way for more consistent and reliable intelligent assistants.

URL PDF HTML ☆

赞 0 踩 0

2603.29247 2026-06-18 cs.CL cs.AI cs.LG 版本更新

MemRerank: Preference Memory for Personalized Product Reranking

MemRerank：用于个性化产品重排序的偏好记忆

Zhiyuan Peng, Xuyang Wu, Huaixiao Tou, Yi Fang, Yu Gong

发表机构 * Santa Clara University（圣克拉拉大学）； Independent Researcher（独立研究者）

AI总结提出MemRerank框架，通过强化学习将用户购买历史提炼为查询无关的偏好记忆，用于LLM购物代理的个性化重排序，在1-in-5选择任务中准确率提升高达10.61个百分点。

Comments correct author name in metadata

详情

AI中文摘要

基于LLM的购物代理越来越依赖长购买历史和多轮交互来实现个性化，然而，由于噪声、长度和相关性不匹配，将原始历史简单地附加到提示中通常效果不佳。我们提出MemRerank，一个偏好记忆框架，将用户购买历史提炼为简洁、查询无关的信号，用于个性化产品重排序。为了研究这个问题，我们构建了一个端到端的基准测试和评估框架，围绕基于LLM的\ extbf{1-in-5}选择任务，该任务同时衡量记忆质量和下游重排序效用。我们进一步使用强化学习（RL）训练记忆提取器，以下游重排序性能作为监督。使用两个基于LLM的重排序器进行的实验表明，MemRerank始终优于无记忆、原始历史和现成记忆基线，在1-in-5准确率上提高了高达\ extbf{+10.61}个绝对百分点。这些结果表明，显式偏好记忆是代理型电子商务系统中个性化的一种实用且有效的构建模块。

英文摘要

LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch. We propose MemRerank, a preference memory framework that distills user purchase history into concise, query-independent signals for personalized product reranking. To study this problem, we build an end-to-end benchmark and evaluation framework centered on an LLM-based \textbf{1-in-5} selection task, which measures both memory quality and downstream reranking utility. We further train the memory extractor with reinforcement learning (RL), using downstream reranking performance as supervision. Experiments with two LLM-based rerankers show that MemRerank consistently outperforms no-memory, raw-history, and off-the-shelf memory baselines, yielding up to \textbf{+10.61} absolute points in 1-in-5 accuracy. These results suggest that explicit preference memory is a practical and effective building block for personalization in agentic e-commerce systems.

URL PDF HTML ☆

赞 0 踩 0

2605.30880 2026-06-18 cs.CL cs.AI 版本更新

PatchWorld: Gradient-Free Optimization of Executable World Models

PatchWorld：可执行世界模型的免梯度优化

Jiaxin Bai, Yue Guo, Yifei Dong, Jiaxuan Xiong, Tianshi Zheng, Yixia Li, Tianqing Fang, Yufei Li, Yisen Gao, Haoyu Huang, Zhongwei Xie, Hong Ting Tsang, Zihao Wang, Lihui Liu, Jeff Z. Pan, Yangqiu Song

发表机构 * Hong Kong Baptist University（香港 Baptist 大学）； Independent Researcher（独立研究员）； HKUST（香港科技大学）； Beijing Institute of Technology（北京理工大学）； Southern University of Science and Technology（南方科技大学）； Wayne State University（韦恩州立大学）； University of Edinburgh（爱丁堡大学）

AI总结提出 PatchWorld 框架，通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型，实现无需梯度优化的符号信念状态程序，在 AgentGym 环境中达到 76.4% 的宏观成功率。

Comments 40 pages

详情

AI中文摘要

文本智能体环境通常被建模为部分可观察马尔可夫决策过程（POMDP），假设模拟器的潜在状态和转移动态对智能体隐藏。然而，很少有工作研究是否可以通过归纳可执行代码来作为部分可观察性下的预测和规划的世界模型。我们引入了 PatchWorld，一个免梯度框架，通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型。PatchWorld 不是用黑盒模型预测下一个观察，而是归纳出符号信念状态程序，其动作更新可以被检查、重放和局部修补。在七个 AgentGym 环境中，PatchWorld-Simple 在评估方法中取得了最高的基于代码的规划分数，在实时一步前瞻中达到 76.4% 的宏观成功率，同时在世界模型预测模块本身内不调用任何 LLM。我们进一步发现，人类指定的残差记忆偏差提高了表面观察保真度，但削弱了决策效用。这暴露了可执行世界模型中的权衡，因为提高观察保真度可能以牺牲动作判别动态为代价，反之亦然。代码可在 https://github.com/HKBU-KnowComp/PatchWorld 获取。

英文摘要

Text-agent environments are typically modeled as partially observable Markov decision processes (POMDPs), assuming that the simulator's latent state and transition dynamics are hidden from the agent. Yet little work has examined whether executable code can be induced to serve as a world model for prediction and planning under partial observability. We introduce PatchWorld, a gradient-free framework that turns offline trajectories into executable Python world models through counterexample-guided code repair. Instead of predicting the next observation with a black-box model, PatchWorld induces symbolic belief-state programs whose action updates can be inspected, replayed, and locally patched. Across seven AgentGym environments, PatchWorld-Simple achieves the highest code-based planning score among evaluated methods, reaching 76.4\% macro success in live one-step lookahead while invoking no LLM calls inside the world-model prediction module itself. We further find that a human-specified residual-memory bias improves surface observation fidelity but weakens decision utility. This exposes a tradeoff in executable world models, since improving observation fidelity can come at the expense of action-discriminative dynamics, and vice versa. Code is available at https://github.com/HKBU-KnowComp/PatchWorld.

URL PDF HTML ☆

赞 0 踩 0

2505.12369 2026-06-18 cs.AI cs.LG cs.LO 版本更新

Fully Geometric Multi-Hop Reasoning on Knowledge Graphs with Transitive Relations

知识图谱上具有传递关系的全几何多跳推理

Fernando Zhapa-Camacho, Robert Hoehndorf

发表机构 * KAUST Center of Excellence for Smart Health (KCSH)（智能健康卓越中心）； KAUST Center of Excellence for Generative AI（生成人工智能卓越中心）

AI总结提出GeometrE方法，将逻辑操作映射为纯几何变换，并引入传递损失函数，在保持可解释性的同时提升多跳推理性能。

Comments Accepted at ESWC 2026

详情

DOI: 10.1007/978-3-032-25156-5_14
Journal ref: The Semantic Web. ESWC 2026. Lecture Notes in Computer Science, vol 16549. Springer, Cham (2026)

AI中文摘要

知识图谱上的多跳逻辑推理需要将逻辑语义忠实地映射到潜在空间。当前的几何嵌入方法通过将实体映射到几何区域、逻辑操作映射到潜在变换，在此任务上表现出有效性。虽然几何嵌入可以为查询回答提供直接的可解释性框架，但当前方法仅利用了实体的几何构造，未能将逻辑操作映射为纯几何变换，而是使用神经组件来学习这些操作。另一方面，纯神经方法优于几何方法，但在潜在空间中缺乏可解释性。我们提出了GeometrE，一种用于多跳推理的几何嵌入方法，它将每个逻辑操作映射为潜在空间中的纯几何操作。此外，我们引入了一个传递损失函数，并表明与现有方法不同，它可以保留对所有a,b,c的逻辑规则：r(a,b)和r(b,c) -> r(a,c)。我们的实验表明，GeometrE优于当前最先进的几何方法，并在标准基准数据集上与现有的神经方法保持竞争力。

英文摘要

Multi-hop logical reasoning on knowledge graphs requires faithfully mapping the logical semantics to latent space. Current geometric embedding methods show to be useful on this task by mapping entities to geometric regions and logical operations to latent transformations. While a geometric embedding can provide a direct interpretability framework for query answering, current methods have only leveraged the geometric construction of entities, failing to map logical operations to pure geometric transformations and, instead, using neural components to learn these operations. On the other hand, purely neural-based methods outperform geometric methods, but they lack interpretability in the latent space. We introduce GeometrE, a geometric embedding method for multi-hop reasoning, that maps every logical operation to a purely geometric operation in the latent space. Additionally, we introduce a transitive loss function and show that, unlike existing methods, it can preserve the logical rule for all a,b,c: r(a,b) and r(b,c) -> r(a,c). Our experiments show that GeometrE outperforms current state-of-the-art geometric methods and remains competitive with existing neural-based methods on standard benchmark datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.16385 2026-06-18 cs.CV cs.AI cs.CL 版本更新

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Hilbert-Geo：通过神经符号推理解决立体几何问题

Ruoran Xu, Haoyu Cheng, Bin Dong, Qiufeng Wang

发表机构 * Xi’an Jiaotong-Liverpool University（西安交通大学利物浦大学）； Ricoh Software Research Center Beijing Co.,Ltd（Ricoh 软件研究中心北京有限公司）

AI总结提出Hilbert-Geo框架和Parse2Reason方法，利用条件描述语言和定理库实现立体几何问题的严格推理，在SolidFGeo2k和MathVerse-Solid上达到SOTA性能。

Comments Computer Vision and Pattern Recognition (CVPR), 2026

详情

AI中文摘要

几何问题求解作为一种典型的多模态推理问题，近年来受到广泛关注并取得了很大进展，然而大多数工作集中于平面几何，由于三维空间图和复杂推理，通常在立体几何中失败。为弥补这一差距，我们引入了Hilbert-Geo，这是第一个用于立体几何的统一形式语言框架，包括一个广泛的谓词库和一个专用的定理库。基于该框架，我们提出了一种Parse2Reason方法，包含先解析后推理两个步骤。在解析步骤中，我们利用条件描述语言（CDL），一种由专门用于构建几何条件的谓词组成的形式化语言，来表示问题描述（自然文本）和立体图（视觉图像）。在推理步骤中，我们利用这些形式化CDL和定理库进行关系推理和代数计算，生成严格正确、可验证且人类可读的推理过程。值得注意的是，我们提出的Hilbert-Geo也适用于平面几何。为推进几何推理，我们策划了两个专家标注的数据集SolidFGeo2k和PlaneFGeo3k，它们配备了几何形式语言标注、解答和答案。大量实验表明，我们提出的方法在SolidFGeo2k上达到77.3%的最先进性能，在MathVerse-Solid（MathVerse中专用于立体几何的一个小子集）上达到84.1%，显著优于领先的多模态大语言模型，如Gemini-2.5-pro（在SolidFGeo2k上为54.2%）和GPT-5（在MathVerse-Solid上为62.9%）。此外，我们的方法在PlaneFGeo3k上达到80.2%的SOTA准确率，展示了Hilbert-Geo在几何推理中的通用性。我们的代码和数据集将公开提供。

英文摘要

Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, however most of works focus on plane geometry while usually fail in solid geometry due to 3D spatial diagrams and complex reasoning. To bridge this gap, we introduce Hilbert-Geo, the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank. Based on this framework, we propose a Parse2Reason method containing two steps of first parsing then reasoning. In the parsing step, we utilize conditional description language (CDL), a formalized language composed of predicates specifically designed to construct geometric conditions, to represent both problem description (natural text) and solid diagrams (visual image). In the reasoning step, we leverage those formal CDL and the theorem bank to perform relational inference and algebraic computation, generating strictly correct, verifiable, and human-readable reasoning processes. Notably, our proposed Hilbert-Geo is also applicable to plane geometry. To advance geometric reasoning, we curate two expert-annotated dataset SolidFGeo2k and PlaneFGeo3k, which are furnished with geometric formal language annotations, solutions and answers. Extensive experiments show that our proposed method achieves the state-of-the-art (SOTA) performance 77.3% in SolidFGeo2k and 84.1% in MathVerse-Solid (one small subset in MathVerse dedicated to solid geometry), substantially outperforming leading MLLMs, such as Gemini-2.5-pro (54.2% on SolidFGeo2k) and GPT-5 (62.9% on MathVerse-Solid). In addition, our method achieves the SOTA accuracy 80.2% in PlaneFGeo3k, demonstrating the generality of the Hilbert-Geo in geometric reasoning. Our code and datasets are released at https://github.com/PremiLab-Math/Hilbert-Geo.

URL PDF HTML ☆

赞 0 踩 0

2605.22142 2026-06-18 cs.LG cs.AI 版本更新

Short-Term-to-Long-Term Memory Transfer for Knowledge Graphs under Partial Observability

知识图谱下的短期到长期记忆转移：在部分可观测性下的短期到长期记忆转移

Taewoon Kim, Vincent François-Lavet, Michael Cochez

AI总结本文研究了在部分可观测性下知识图谱中的短期到长期记忆转移问题，提出了一种基于神经符号价值决策的方法，通过在长期插入前决定保留或丢弃观察到的三元组，从而提升记忆效率，并在RoomKG基准测试中优于符号和神经基线方法。

详情

AI中文摘要

在部分可观测性下的强化学习需要决定保留哪些信息，但大多数基于记忆的方法并未显式建模符号观察的短期到长期转移。我们研究了这一转移过程，将其建模为一个神经符号价值决策问题：对于每个观察到的三元组，智能体需决定在长期插入前是否保留或丢弃。为处理可变大小的短期缓冲区，我们采用了一种每项Q学习设计，使用共享参数和实际的时间差分更新，跨连续步骤匹配项目。在长期记忆容量为128的RoomKG基准测试中，学习到的转移决策优于符号和神经基线，包括带有时间注释的符号基线和基于历史的LSTM/Transformer基线。在转移策略消融分析中，一个轻量级的本地短期-only变体表现最佳，且在步骤层面行为显示，策略保留导航和查询相关的事实，同时丢弃低价值的候选事实，支持在内存限制下显式且可解释的记忆决策。

英文摘要

Reinforcement learning under partial observability requires deciding what information to retain, yet most memory-based approaches do not explicitly model short-term-to-long-term transfer of symbolic observations. We study this transfer process in a temporal knowledge-graph memory setting and cast it as a neuro-symbolic value-based decision problem: for each observed triple, the agent chooses whether to keep or drop it before long-term insertion. To handle variable-sized short-term buffers, we use a per-item Q-learning design with shared parameters and a practical temporal-difference update over matched items across consecutive steps. On the RoomKG benchmark at long-term memory capacity 128, learned transfer decisions outperform symbolic and neural baselines, including symbolic baselines with temporal annotations and history-based LSTM/Transformer baselines. Across transfer-policy ablations, a lightweight local short-term-only variant performs best, and step-level behavior shows that the policy keeps navigation- and query-relevant facts while discarding lower-value candidate facts, supporting explicit and interpretable memory decisions under memory constraints.

URL PDF HTML ☆

赞 0 踩 0

2606.06133 2026-06-18 cs.SE cs.AI cs.LG cs.LO 版本更新

TLA-Prover: Verifiable TLA+ Specification Synthesis via Preference-Optimized Low-Rank Adaptation

TLA-Prover: 通过偏好优化低秩适配实现可验证的 TLA+ 规范合成

Eric Spencer, Arslan Bisharat, Brian Ortiz, Khushboo Bhadauria, TaiNing Wang, George K. Thiruvathukal, Konstantin Laufer, Mohammed Abuhamad

发表机构 * Department of Computer Science, Loyola University Chicago（洛约拉芝加哥大学计算机科学系）

AI总结提出 TLA-Prover 模型，结合监督微调和基于修复的组相对策略优化，在 TLC 模型检查器上实现 TLA+ 规范合成，Gold/Diamond 级别通过率达 30%，约为未调优基线的 3.5 倍。

Comments 12 pages, 5 tables, 3 figures. Accepted at the 21st International Conference on Software Technologies (ICSOFT 2026)

详情

AI中文摘要

TLA+ 是一种用于验证分布式系统和安全关键协议的正式规范语言。大型语言模型（LLM）生成的 TLA+ 规范常常因语义原因无法通过 TLC 模型检查器。在 25 个 LLM 中，最佳公开基线的语法解析成功率为 26.6%，语义模型检查通过率为 8.6%。我们提出了 TLA-Prover，一个 200 亿参数的 TLA+ 规范合成模型。训练结合了在已验证示例上的监督微调（SFT）和基于修复的组相对策略优化（GRPO）。在 GRPO 阶段，模型学习修复自身被拒绝的规范。我们还从相同的 SFT 检查点训练了一个直接偏好优化（DPO）变体作为消融实验。TLC 直接提供奖励信号，无需学习奖励模型。每个输出分为四个等级：青铜（解析通过）、银（无警告）、金（通过 TLC）和钻石。要达到钻石级，模型的正确性属性会被自动微小修改；TLC 必须检测到违反。如果 TLC 仍然通过，则该属性始终为真且无贡献；输出无法达到钻石级。在一个保留的 30 问题基准上，TLA-Prover 在金级和钻石级均达到 9/30（即 pass@1 = 30%）。这大约是未调优基线 8.6% 的 3.5 倍。DPO 变体在钻石级达到 20%。金级和钻石级在每个检查点都一致；这防止了平凡属性失败模式。

英文摘要

TLA+ is a formal specification language for verifying distributed systems and safety-critical protocols. Large language models (LLMs) frequently produce TLA+ specifications that fail the TLC model checker for semantic reasons. Across 25 LLMs, the best public baseline is 26.6% syntactic parse and 8.6% semantic model-check. We present TLA-Prover, a 20-billion-parameter model for TLA+ specification synthesis. Training combines supervised fine-tuning (SFT) on verified examples with repair-based group-relative policy optimization (GRPO). In the GRPO stage, the model learns to fix its own rejected specifications. We also train a direct preference optimization (DPO) variant from the same SFT checkpoint as an ablation. TLC provides the reward signal directly, with no learned reward model. Four tiers grade each output: Bronze (parses), Silver (no warnings), Gold (passes TLC), and Diamond. To reach Diamond, the model's correctness property is automatically altered in a small way; TLC must then detect a violation. If TLC still passes, the property was always-true and contributes nothing; the output fails Diamond. TLA-Prover reaches 9/30 (i.e. pass@1 = 30%) at both Gold and Diamond on a held-out 30-problem benchmark. This is roughly 3.5x the 8.6% untuned baseline. The DPO variant reaches 20% at Diamond. Gold and Diamond coincide at every checkpoint; this prevents the trivial-property failure mode.

URL PDF HTML ☆

赞 0 踩 0

2402.08128 2026-06-18 cs.AI cs.GT 版本更新

Recursive Joint Simulation in Games

博弈中的递归联合模拟

Vojtech Kovarik, Caspar Oesterheld, Vincent Conitzer

发表机构 * Foundations of Cooperative AI Lab (FOCAL), Computer Science Department（合作人工智能基础实验室（FOCAL），计算机科学系）； Carnegie Mellon University（卡内基梅隆大学）； AI Center（人工智能中心）； Czech Technical University（捷克技术大学）； Center for Theoretical Study（理论研究中心）； Charles University（查理大学）

AI总结研究AI智能体通过递归联合模拟实现合作，证明该过程等价于原博弈的无限重复版本，从而可直接应用民间定理等现有结论。

详情

AI中文摘要

AI智能体之间的博弈动力学可能以多种方式不同于传统的人类-人类互动。其中一个差异是，可能能够精确模拟一个AI智能体，例如因为其源代码已知。这样的智能体将从根本上不确定自己是在现实世界还是在模拟中。我们的目标是探索利用这种可能性在战略环境中实现更合作的结果。在本文中，我们研究了AI智能体之间的交互，其中智能体运行递归联合模拟。也就是说，智能体首先共同观察它们所面临情境的模拟。这个模拟递归地包含额外的模拟（带有小的失败概率以避免无限递归），并且在选择行动之前观察所有这些嵌套模拟的结果。我们表明，由此产生的交互在策略上等价于原始博弈的无限重复版本，允许直接转移现有结果，如各种民间定理。作为该等价性稳健性的证据，我们表明即使放宽一些假设，它仍然成立，并且“从内部”也成立——即对于发现自己处于博弈中并具有自定位不确定性的智能体而言。

英文摘要

Game-theoretic dynamics between AI agents could differ from traditional human-human interactions in various ways. One such difference is that it may be possible to accurately simulate an AI agent, for example because its source code is known. Such an agent would then be fundamentally uncertain whether it is in the real world or in a simulation. Our aim is to explore ways of leveraging this possibility to achieve more cooperative outcomes in strategic settings. In this paper, we study an interaction between AI agents where the agents run a recursive joint simulation. That is, the agents first jointly observe a simulation of the situation they face. This simulation in turn recursively includes additional simulations (with a small chance of failure, to avoid infinite recursion), and the results of all these nested simulations are observed before an action is chosen. We show that the resulting interaction is strategically equivalent to an infinitely repeated version of the original game, allowing a direct transfer of existing results such as the various folk theorems. As evidence that the equivalence is robust, we show that it holds even when we relax some of the assumptions and that it also holds ``from the inside'' -- meaning, for an agent that finds itself inside the game and has self-locating uncertainty.

URL PDF HTML ☆

赞 0 踩 0

2508.21720 2026-06-18 cs.AI 版本更新

PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation

PosterForest: 用于科学海报生成的分层多智能体协作

Jiho Choi, Seojeong Park, Seongjong Song, Hyunjung Shim

发表机构 * Graduate School of Artificial Intelligence, KAIST（韩国釜山国立大学人工智能研究生院）； School of Integrated Technology, Yonsei University（延世大学整合技术学院）

AI总结提出PosterForest，一种无需训练的科学海报生成框架，通过Poster Tree分层表示文档结构，并利用内容与布局智能体进行分层推理与递归优化，实现内容与布局的联合优化，提升语义连贯性、逻辑流畅性和视觉平衡。

Comments ACL 2026

详情

AI中文摘要

自动化科学海报生成需要层次化的文档理解和连贯的内容-布局规划。现有方法通常依赖于平面摘要或分别优化内容和布局。因此，它们常常遭受信息丢失、逻辑流程薄弱和视觉平衡差的问题。我们提出了PosterForest，一个无需训练的科学海报生成框架。我们的方法引入了Poster Tree，一种结构化的中间表示，能够跨多个层次捕获文档层次结构和视觉-文本语义。基于这种表示，内容和布局智能体执行分层推理和递归优化，从全局组织到局部组成逐步优化海报。这种联合优化提高了语义连贯性、逻辑流畅性和视觉和谐。实验表明，PosterForest在自动评估和人工评估中均优于先前方法，且无需额外训练或领域特定监督。

英文摘要

Automating scientific poster generation requires hierarchical document understanding and coherent content-layout planning. Existing methods often rely on flat summarization or optimize content and layout separately. As a result, they often suffer from information loss, weak logical flow, and poor visual balance. We present PosterForest, a training-free framework for scientific poster generation. Our method introduces the Poster Tree, a structured intermediate representation that captures document hierarchy and visual-textual semantics across multiple levels. Building on this representation, content and layout agents perform hierarchical reasoning and recursive refinement, progressively optimizing the poster from global organization to local composition. This joint optimization improves semantic coherence, logical flow, and visual harmony. Experiments show that PosterForest outperforms prior methods in both automatic and human evaluations, without additional training or domain-specific supervision.

URL PDF HTML ☆

赞 0 踩 0

2606.15504 2026-06-18 cs.AI 版本更新

Toward Vibe Medicine: A Self-Evolving Multi-Agent Framework for Clinical Decision Support

迈向振动医学：一种用于临床决策支持的自演化多智能体框架

Qianxue Zhang, Yiming Ren, Shihuan Qin, Xiao Zhang, Liao Zhang, Jinyang Huang, Zhengliang Liu, Chenbin Liu, Hongying Feng, Jingyuan Chen, Yuzhen Ding, Weihang You, Hanqi Jiang, Yi Pan, Yifan Zhou, Junhao Chen, Lifeng Chen, Wei Liu, Tianming Liu, Zengren Zhao, Lian Zhang

发表机构 * Medical AI Lab, The First Hospital of Hebei Medical University（河北医科大学第一医院医学人工智能实验室）； Hebei Provincial Engineering Research Center for AI-Based Cancer Treatment Decision-Making, The First Hospital of Hebei Medical University（河北省人工智能癌症治疗决策工程研究中心，河北医科大学第一医院）； State Key Laboratory of Neurology and Oncology Drug Development（神经与肿瘤药物研发国家重点实验室）； School of Computing, University of Georgia（佐治亚大学计算学院）； Department of Radiation Oncology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital and Shenzhen Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College（中国医学科学院北京协和医学院国家癌症中心/国家肿瘤临床医学研究中心/肿瘤医院深圳医院放射治疗科）； Department of Radiation Oncology, Mayo Clinic（梅奥诊所放射肿瘤科）； College of Mechanical and Power Engineering, China Three Gorges University（三峡大学机械与动力工程学院）； Department of Radiation Oncology, Guangzhou Concord Cancer Center（广州康华肿瘤中心放射治疗科）； Gastrointestinal Disease Diagnosis and Treatment Center, The First Hospital of Hebei Medical University（河北医科大学第一医院胃肠疾病诊疗中心）； Department of General Surgery, The First Hospital of Hebei Medical University（河北医科大学第一医院普通外科）

AI总结提出VIBEMed多智能体框架，通过自演化机制和架构级安全沙箱，从交互历史中动态学习，实现个性化临床决策支持。

详情

DOI: 10.1016/j.metrad.2026.100223

AI中文摘要

近年来，大型语言模型和自主智能体的进步彻底改变了医疗领域，促进了诊断并改善了治疗结果。然而，大多数现有AI系统依赖预训练知识和预定义流程，难以从包含患者结果和过去失败的交互式聊天会话历史中动态学习。为解决这一限制，我们提出了VIBEMed，一种具有内置自演化机制和架构级安全沙箱的多智能体框架，用于稳健的临床决策支持。该系统集成了三个专门智能体：用于假设生成的临床诊断智能体（CDA）、用于治疗计划的治疗执行智能体（TEA）以及将纵向临床反馈提炼为可重用知识的临床演化管理智能体（CEMA），将多模态患者信息转化为个性化医疗决策。通过自演化机制，该框架实现了跨记忆、模型行为和决策策略的迭代更新，使系统能够随时间改进。实验结果表明，VIBEMed通过其演化机制在复杂临床病例中表现出优越性能，特别是在需要集成决策和纵向规划的任务中。该框架还支持在具有挑战性的场景（如肿瘤治疗规划）中进行可靠的端到端决策，凸显了其在真实临床环境中的可行性。总体而言，VIBEMed为超越静态AI系统、迈向自适应、经验驱动的临床决策支持提供了一条实用路径，展示了将多智能体协作与持续演化相结合以推进精准医学的价值。

英文摘要

In recent years, the advances of large language models and autonomous agents have revolutionized the healthcare field, facilitating diagnosis and improving treatment results. However, most existing AI systems rely on pre-trained knowledge and predefined pipelines, which struggle to learn dynamically from the interactive chat session history that contains patient outcomes and past failures. To address this limitation, we propose VIBEMed, a multi-agent framework with a built-in self-evolution mechanism and architecture-level safety sandbox for robust clinical decision support. The system integrates three specialized agents, including a Clinical Diagnostic Agent (CDA) for hypothesis generation, a Therapeutic Execution Agent (TEA) for treatment planning, and a Clinical Evolution Manager Agent (CEMA) that distills longitudinal clinical feedback into reusable knowledge, transforming multimodal patient information into personalized medical decisions. Through self-evolution mechanism, the framework enables iterative updates across memory, model behavior, and decision strategies, allowing the system to improve over time. Experimental results show that VIBEMed demonstrates superior performance through its evolving mechanism in complex clinical cases, particularly in tasks that require integrated decision-making and longitudinal planning. The framework also supports reliable end-to-end decisions in challenging scenarios such as oncology treatment planning, highlighting its feasibility in real-world clinical contexts. Overall, VIBEMed provides a practical path beyond static AI systems toward adaptive, experience-driven clinical decision support, demonstrating the value of combining multi-agent collaboration with continuous evolution for advancing precision medicine.

URL PDF HTML ☆

赞 0 踩 0

2506.09046 2026-06-18 cs.LG cs.AI cs.MA 版本更新

Self-Evolving Multi-Agent Systems via Textual Backpropagation

通过文本反向传播的自进化多智能体系统

Xiaowen Ma, Yunpu Ma, Chenyang Lin, Sikuan Yan, Jinhe Bi, Zixuan Cao, Yijun Tian, Volker Tresp, Hinrich Schuetze

发表机构 * Ludwig Maximilian University of Munich（慕尼黑路德维希-马克西米利安大学）； Technical University of Munich（慕尼黑技术大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； University of Notre Dame（诺丁汉大学）

AI总结提出Agentic Neural Network框架，将多智能体协作建模为分层神经网络，通过前向分解任务和反向传播反馈实现智能体角色、提示和协作的自进化，在七个基准数据集上超越现有方法。

详情

AI中文摘要

利用多个大型语言模型（LLM）已被证明对处理复杂、高维任务有效，但当前方法通常依赖静态、手动设计的多智能体配置。为克服这些限制，我们提出Agentic Neural Network（ANN）框架，该框架将多智能体协作概念化为分层神经网络架构。在此设计中，每个智能体作为节点运行，每一层形成一个专注于特定子任务的协作团队。我们的框架遵循两阶段优化策略：（1）前向阶段——受神经网络前向传播启发，任务被动态分解为子任务，并逐层构建具有合适聚合方法的协作智能体团队。（2）反向阶段——模仿反向传播，我们通过迭代反馈优化全局和局部协作，使智能体能够自进化其角色、提示和协调。这种神经符号方法使我们的框架能够在训练后创建新的或专门的智能体团队，在准确性和适应性方面带来显著提升。在七个基准数据集上，我们的工作在相同配置下超越了领先的多智能体基线，显示出持续的性能改进。

英文摘要

Leveraging multiple Large Language Models (LLMs) has proven effective for addressing complex, high-dimensional tasks, but current approaches often rely on static, manually engineered multi-agent configurations. To overcome these constraints, we present the Agentic Neural Network (ANN), a framework that conceptualizes multi-agent collaboration as a layered neural network architecture. In this design, each agent operates as a node, and each layer forms a cooperative team focused on a specific subtask. Our framework follows a two-phase optimization strategy: (1) Forward Phase - Drawing inspiration from neural network forward passes, tasks are dynamically decomposed into subtasks, and cooperative agent teams with suitable aggregation methods are constructed layer by layer. (2) Backward Phase - Mirroring backpropagation, we refine both global and local collaboration through iterative feedback, allowing agents to self-evolve their roles, prompts, and coordination. This neuro-symbolic approach enables our framework to create new or specialized agent teams post-training, delivering notable gains in accuracy and adaptability. Across seven benchmark datasets, our work surpasses leading multi-agent baselines under the same configurations, showing consistent performance improvements.

URL PDF HTML ☆

赞 0 踩 0

2510.18085 2026-06-18 cs.RO cs.AI cs.MA 版本更新

R2BC: Multi-Agent Imitation Learning from Single-Agent Demonstrations

R2BC: 从单智能体演示进行多智能体模仿学习

Connor Mattson, Varun Raveendra, Ellen Novoseller, Nicholas Waytowich, Vernon J. Lawhern, Daniel S. Brown

发表机构 * Kahlert School of Computing, University of Utah（犹他大学凯勒尔计算学院）； DEVCOM Army Research Laboratory（陆军研究实验室）

AI总结提出R2BC方法，通过轮换单智能体演示训练多机器人系统，无需联合动作空间演示，在模拟和实物任务中性能媲美或超越基于特权同步演示的基线方法。

Comments 8 pages, 6 figures. In Proceedings: IEEE International Conference on Robotics & Automation (ICRA 2026)

详情

AI中文摘要

模仿学习（IL）是人类教授机器人的自然方式，尤其是在高质量演示易于获取的情况下。虽然IL已广泛应用于单机器人场景，但将其扩展到多智能体系统的研究相对较少，尤其是在单个人类必须为协作机器人团队提供演示的场景中。本文介绍并研究了轮换行为克隆（R2BC），该方法使单个人类操作员能够通过顺序的单智能体演示有效训练多机器人系统。我们的方法允许人类一次远程操作一个智能体，并逐步向整个系统教授多智能体行为，无需联合多智能体动作空间的演示。我们表明，在四个多智能体模拟任务中，R2BC方法的性能与基于特权同步演示的Oracle行为克隆方法相当，甚至在某些情况下超越后者。最后，我们在两个使用真实人类演示训练的物理机器人任务上部署了R2BC。

英文摘要

Imitation Learning (IL) is a natural way for humans to teach robots, particularly when high-quality demonstrations are easy to obtain. While IL has been widely applied to single-robot settings, relatively few studies have addressed the extension of these methods to multi-agent systems, especially in settings where a single human must provide demonstrations to a team of collaborating robots. In this paper, we introduce and study Round-Robin Behavior Cloning (R2BC), a method that enables a single human operator to effectively train multi-robot systems through sequential, single-agent demonstrations. Our approach allows the human to teleoperate one agent at a time and incrementally teach multi-agent behavior to the entire system, without requiring demonstrations in the joint multi-agent action space. We show that R2BC methods match, and in some cases surpass, the performance of an oracle behavior cloning approach trained on privileged synchronized demonstrations across four multi-agent simulated tasks. Finally, we deploy R2BC on two physical robot tasks trained using real human demonstrations.

URL PDF HTML ☆

赞 0 踩 0

2510.27353 2026-06-18 cs.AI 版本更新

An In-depth Study of LLM Contributions to the Bin Packing Problem

LLM对装箱问题贡献的深入研究

Julien Herrmann, Guillaume Pallez

发表机构 * CNRS-IRIT ； Inria

AI总结通过分析LLM生成的启发式算法，发现其虽可读但难以解释，进而提出更简单高效的新算法，质疑LLM对装箱问题的实际贡献。

Comments Accepted for publication in ACM Transactions on Evolutionary Learning and Optimization

详情

DOI: 10.1145/3821574

AI中文摘要

近期研究表明，大型语言模型（LLM）可能为数学发现提供有趣的思路。该主张基于报告称，基于LLM的遗传算法在均匀分布和Weibull分布下为在线装箱问题产生了具有新见解的启发式算法。本文通过详细分析LLM产生的启发式算法，考察其行为和可解释性，重新评估了这一主张。尽管这些启发式算法是人类可读的，但即使对领域专家而言，它们仍然在很大程度上是不透明的。基于此分析，我们提出了一类针对这些特定装箱实例的新算法。推导出的算法显著更简单、更高效、更可解释且更具泛化性，表明所考虑的实例本身相对简单。然后，我们讨论了关于LLM对该问题贡献的主张的局限性，该主张似乎基于一个错误的假设，即这些实例先前已被研究过。我们的发现反而强调了在评估LLM生成输出的科学价值时，需要进行严格的验证和情境化。

英文摘要

Recent studies have suggested that Large Language Models (LLMs) could provide interesting ideas contributing to mathematical discovery. This claim was motivated by reports that LLM-based genetic algorithms produced heuristics offering new insights into the online bin packing problem under uniform and Weibull distributions. In this work, we reassess this claim through a detailed analysis of the heuristics produced by LLMs, examining both their behavior and interpretability. Despite being human-readable, these heuristics remain largely opaque even to domain experts. Building on this analysis, we propose a new class of algorithms tailored to these specific bin packing instances. The derived algorithms are significantly simpler, more efficient, more interpretable, and more generalizable, suggesting that the considered instances are themselves relatively simple. We then discuss the limitations of the claim regarding LLMs' contribution to this problem, which appears to rest on the mistaken assumption that the instances had previously been studied. Our findings instead emphasize the need for rigorous validation and contextualization when assessing the scientific value of LLM-generated outputs.

URL PDF HTML ☆

赞 0 踩 0

2602.23092 2026-06-18 cs.AI 版本更新

Enhancing CVRP Solver through LLM-driven Automatic Heuristic Design

通过LLM驱动的自动启发式设计增强CVRP求解器

Zhuoliang Xie, Fei Liu, Zhenkun Wang, Qingfu Zhang

发表机构 * Southern University of Science and Technology（南方科技大学）； City University of Hong Kong（香港城市大学）

AI总结提出AILS-AHD方法，结合进化搜索框架与大语言模型动态生成和优化破坏启发式，并引入加速机制，在中等和大规模CVRP实例上优于现有求解器，在CVRPLib大规模基准中10个实例上取得8个新最优解。

详情

AI中文摘要

容量受限车辆路径问题（CVRP）是一个基本的组合优化挑战，专注于在车辆容量约束下优化车队运营。尽管在运筹学中得到了广泛研究，CVRP的NP-hard性质仍然带来显著的计算挑战，特别是对于大规模实例。本研究提出了AILS-AHD（自适应迭代局部搜索与自动启发式设计），一种利用大语言模型（LLMs）革新CVRP求解的新方法。我们的方法将进化搜索框架与LLMs集成，在AILS方法中动态生成和优化破坏启发式。此外，我们引入了一种基于LLM的加速机制以提高计算效率。针对最先进的求解器（包括AILS-II和HGS）的综合实验评估表明，AILS-AHD在中等和大规模实例上均表现出优越性能。值得注意的是，我们的方法在CVRPLib大规模基准的10个实例中为8个建立了新的最佳已知解，突显了LLM驱动的启发式设计在推进车辆路径优化领域的潜力。

英文摘要

The Capacitated Vehicle Routing Problem (CVRP), a fundamental combinatorial optimization challenge, focuses on optimizing fleet operations under vehicle capacity constraints. While extensively studied in operational research, the NP-hard nature of CVRP continues to pose significant computational challenges, particularly for large-scale instances. This study presents AILS-AHD (Adaptive Iterated Local Search with Automatic Heuristic Design), a novel approach that leverages Large Language Models (LLMs) to revolutionize CVRP solving. Our methodology integrates an evolutionary search framework with LLMs to dynamically generate and optimize ruin heuristics within the AILS method. Additionally, we introduce an LLM-based acceleration mechanism to enhance computational efficiency. Comprehensive experimental evaluations against state-of-the-art solvers, including AILS-II and HGS, demonstrate the superior performance of AILS-AHD across both moderate and large-scale instances. Notably, our approach establishes new best-known solutions for 8 out of 10 instances in the CVRPLib large-scale benchmark, underscoring the potential of LLM-driven heuristic design in advancing the field of vehicle routing optimization.

URL PDF HTML ☆

赞 0 踩 0

2605.29649 2026-06-18 cs.AI 版本更新

LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning

LLM进化的符号AI规划领域无关启发式

Elliot Gestrin, Jendrik Seipp

AI总结本文使用进化搜索让大语言模型生成领域无关的启发式函数，在未见测试域上超越手工最优启发式，并首次系统评估了启发式的信息性-速度权衡。

Comments Accepted at the LM4Plan workshop at ICAPS 2026

详情

AI中文摘要

启发式搜索是符号AI规划中的主导范式，最强的启发式是规划研究者数十年工作的成果。最近的工作表明，大型语言模型（LLM）可以为单个规划领域设计启发式，但迄今为止，没有LLM生成的启发式能在任意规划任务上工作。在本文中，我们使用进化搜索来产生第一个LLM生成的领域无关启发式，其超越了手工最优的现有技术。我们让LLM变异用C++编写的父启发式，将候选解存储在MAP-Elites档案中，以信息性和速度作为键，并通过混合覆盖率和求解时间计算适应度分数。为了将进化程序置于上下文中，我们还额外基准测试了一组广泛的手工启发式在信息性-速度权衡上的表现，据我们所知，这之前从未做过。在未见测试域上，我们最好的进化启发式比最强基线解决了更多任务，我们的完整启发式套件跨越了所述权衡的帕累托前沿。我们还发现，从平凡的盲目启发式开始进化优于从强FF启发式开始，即使最终程序本身是FF变体，并且LLM推理努力影响候选编译成功的频率远大于影响那些编译成功的候选的质量。由于进化程序是纯C++，它们可以作为即插即用替代品插入现有规划器，并继承底层搜索的健全性和完备性保证。

英文摘要

Heuristic search is the dominant paradigm in symbolic AI planning, and the strongest heuristics are the result of decades of work by planning researchers. Recent work has shown that large language models (LLMs) can design heuristics for individual planning domains, but no LLM-generated heuristic has so far worked on arbitrary planning tasks. In this paper, we use evolutionary search to produce the first LLM-generated domain-independent heuristics that exceed the hand-engineered state of the art. We let an LLM mutate parent heuristics written in C++, store candidates in a MAP-Elites archive keyed on informedness and speed and calculate fitness scores by blending coverage with solving time. To place the evolved programs in context, we additionally benchmark a broad set of hand-engineered heuristics on their informedness-speed tradeoff, which to our knowledge has not been done before. On unseen testing domains, our best evolved heuristic solves more tasks than even the strongest baseline, with our full heuristic suite spanning the Pareto frontier of said tradeoff. We also find that seeding evolution from the trivial blind heuristic outperforms seeding from the strong FF heuristic, even when the resulting program is itself an FF variant, and that LLM reasoning effort affects how often candidates compile much more than the quality of those that do. Because the evolved programs are plain C++, they slot into existing planners as drop-in replacements and inherit the soundness and completeness guarantees of the underlying search.

URL PDF HTML ☆

赞 0 踩 0

2411.16206 2026-06-18 cs.LG cs.AI cs.NE 版本更新

Scalable Batch Bayesian Optimization Via Subspace Acquisition Functions

可扩展的批量贝叶斯优化：基于子空间采集函数

Dawei Zhan, Zhaoxi Zeng, Shuoxiao Wei, Ping Wu

发表机构 * School of Computing and Artificial Intelligence（计算与人工智能学院）

AI总结提出通过从原始问题的轴对齐子空间中各选一点来扩展贝叶斯优化至大规模批量评估，显著加速收敛，与十种批量算法相比极具竞争力。

详情

DOI: 10.1145/3820495
Journal ref: ACM Transactions on Evolutionary Learning and Optimization, 2026

AI中文摘要

将贝叶斯优化扩展到批量评估可以使设计者充分利用并行计算技术。然而，当前大多数批量方法在批量大小增大时扩展性不佳，优化效率往往下降。为解决此问题，本文提出一种简单高效的方法，将贝叶斯优化扩展到大规模批量评估。与现有批量方法不同，新方法的思想是从原始问题中抽取一批轴对齐子空间，并使用现有采集函数从每个子空间中选择一个点。数值实验表明，与顺序贝叶斯优化算法相比，我们提出的方法显著加速收敛，并且与十种批量贝叶斯优化算法相比表现非常有竞争力。我们提出的方法的实现可在此 https URL 获取。

英文摘要

Extending Bayesian optimization to batch evaluation can enable the designer to make the most use of parallel computing technology. However, most of current batch approaches do not scale well with the batch size. That is, their optimization efficiencies often deteriorate as the batch size increases. To address this issue, we propose a simple and efficient approach to extend Bayesian optimization to large-scale batch evaluation in this work. Different from existing batch approaches, the idea of the new approach is to draw a batch of axis-aligned subspaces of the original problem and select one point from each subspace using existing acquisition functions. Numerical experiments show that our proposed approach speedups the convergence significantly when compared with the sequential Bayesian optimization algorithm, and performs very competitively when compared with ten batch Bayesian optimization algorithms. The implementation of our proposed approach is available at https://github.com/zhandawei/SubSpace_Acquisition_Functions.

URL PDF HTML ☆

赞 0 踩 0

2606.14202 2026-06-18 cs.NE cs.AI 版本更新

MeEvo: Metacognitive Evolution Combined with Natural Evolution for Automatic Heuristic Design

MeEvo: 元认知进化与自然进化相结合用于自动启发式设计

Zishang Qiu, Xinan Chen, Rong Qu, Ruibin Bai

发表机构 * School of Computer Science, University of Nottingham Ningbo China（诺丁汉大学宁波分校计算机科学学院）； School of Computer Science, University of Nottingham（诺丁汉大学计算机科学学院）

AI总结提出MeEvo框架，通过循环耦合自然进化（探索启发式代码）和元认知进化（反思历史生成改进启发式），解决现有方法知识继承弱、探索不足的问题，在五个优化问题上表现更优。

详情

AI中文摘要

大型语言模型（LLMs）通过推理和代码合成实现启发式生成，推动了自动启发式设计（AHD）的发展。现有的基于LLM的AHD架构主要遵循两种范式：自然进化，它使用交叉和变异来探索启发式程序；以及元认知进化，它通过反思来改进推理。然而，自然进化丢弃了推理轨迹，削弱了知识继承和利用，而元认知进化缺乏种群级别的重组，限制了探索并增加了过早收敛的风险。这些局限性降低了复杂问题的搜索效率、稳定性和解的质量。为了解决这一差距，我们提出了MeEvo，一种双层AHD框架，它循环耦合自然进化和元认知进化。自然进化探索启发式代码，同时将推理轨迹、适应度值和错误记录到共享历史中；然后元认知进化反思该历史以生成改进的启发式，这些启发式重新进入父代池以进行下一轮循环。这种设计使得种群驱动的探索和反思驱动的改进相互加强。在五个优化问题上的实验（使用两个LLM骨干）表明，MeEvo比现有的基于LLM的AHD架构实现了更强且更稳定的性能，尤其是在复杂约束任务上。

英文摘要

Large Language Models (LLMs) have advanced Automatic Heuristic Design (AHD) by enabling heuristic generation through reasoning and code synthesis. Existing LLM-based AHD architectures mainly follow two paradigms: Natural Evolution, which uses crossover and mutation to explore heuristic programs, and Metacognitive Evolution, which refines reasoning through reflection. However, Natural Evolution discards reasoning traces, weakening knowledge inheritance and exploitation, while Metacognitive Evolution lacks population-level recombination, limiting exploration and increasing the risk of premature convergence. These limitations reduce search efficiency, stability, and solution quality on complex problems. To address this gap, we propose MeEvo, a dual-layer AHD framework that cyclically couples Natural Evolution and Metacognitive Evolution. Natural Evolution explores heuristic code while recording reasoning traces, fitness values, and errors into a shared history; Metacognitive Evolution then reflects on this history to generate improved heuristics that re-enter the parent pool for the next cycle. This design enables population-driven exploration and reflection-driven refinement to reinforce each other. Experiments on five optimization problems with two LLM backbones show that MeEvo achieves stronger and more stable performance than existing LLM-based AHD architectures, especially on complex constrained tasks.

URL PDF HTML ☆

赞 0 踩 0

2602.06774 2026-06-18 cs.AI 版本更新

Towards Understanding What State Space Models Learn About Code

理解状态空间模型在代码中学到了什么

Jiali Wu, Abhinav Anand, Shweta Verma, Mira Mezini

发表机构 * TU Darmstadt（图宾根大学）； Hessian Center for Artificial Intelligence（黑森人工智能中心）； National Research Center for Applied Cybersecurity ATHENE（应用网络安全国家研究中心ATHENE）

AI总结本文首次系统分析状态空间模型（SSM）在代码理解中的学习机制，发现SSM在预训练时比Transformer更有效捕获语法和语义结构，但微调时会遗忘某些关系，并提出SSM-Interpret框架和架构改进，将NLCodeSearch的MRR提升高达6。

详情

AI中文摘要

状态空间模型（SSM）已成为Transformer架构的高效替代方案。先前工作表明，在可比条件下训练时，SSM在代码理解任务上可以匹配或超越Transformer。然而，其内部机制仍是一个黑箱。我们首次系统分析了基于SSM的代码模型所学到的内容，并在此领域直接比较了SSM和Transformer模型。我们的分析表明，SSM在预训练期间比Transformer更有效地捕获了语法和语义结构，但在某些任务的微调过程中会遗忘某些关系。为了研究这种行为，我们引入了SSM-Interpret，一个频域框架，揭示了微调期间向短程依赖的频谱偏移。在这些发现的指导下，我们提出了架构修改，将基于SSM的代码模型在NLCodeSearch上的性能显著提升了高达+6 MRR。这表明我们的分析不仅解释了模型行为，而且直接导致了更好的设计。

英文摘要

State Space Models (SSMs) have emerged as an efficient alternative to the Transformer architecture. Prior work shows that, when trained under comparable conditions, SSMs can match or surpass Transformers on code understanding tasks. However, their internal mechanisms remain a black box. We present the first systematic analysis of what SSM-based code models learn along with the direct comparison between SSM and Transformer models in this domain. Our analysis shows that SSMs capture syntactic and semantic structure more effectively than Transformers during pretraining but forgets certain relations during fine-tuning on some tasks. To investigate this behavior, we introduce SSM-Interpret, a frequency-domain framework that exposes a spectral shift toward short-range dependencies during fine-tuning. Guided by these findings, we propose architectural modifications that significantly improve the performance of SSM-based code model by upto +6 MRR on NLCodeSearch. This demonstrates that our analysis not only explains model behavior but also leads directly to better designs.

URL PDF HTML ☆

赞 0 踩 0

2603.09344 2026-06-18 cs.AI stat.ML 版本更新

Robust Regularized Policy Iteration under Transition Uncertainty

鲁棒正则化策略迭代在转移不确定性下

Hongqiang Lin, Zhenghui Fu, Weihao Tang, Pengfei Wang, Yiding Sun, Qixian Huang, Dongxu Zhang

发表机构 * College of Computer Science and Technology, Zhejiang University, Hangzhou, China（浙江大学计算机科学与技术学院）； School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi'an, China（西北工业大学人工智能、光学与电子学院（iOPEN））； School of Software Technology, Zhejiang University, Hangzhou, China（浙江大学软件技术学院）； School of Software Engineering, Xi'an Jiaotong University, Xi'an, China（西安交通大学软件工程学院）； School of Systems Science and Engineering, Sun Yat-sen University, Guangzhou, China（中山大学系统科学与工程学院）

AI总结提出鲁棒正则化策略迭代（RRPI），通过将离线强化学习建模为鲁棒策略优化，使用KL正则化替代难解的双层目标，并基于鲁棒正则化贝尔曼算子实现高效策略迭代，理论保证收敛性，实验在D4RL基准上表现优异。

详情

AI中文摘要

离线强化学习（RL）无需在线探索即可实现数据高效且安全的策略学习，但其性能常因分布偏移而下降。学习到的策略可能访问分布外的状态-动作对，其中价值估计和学习到的动态不可靠。为了在统一框架中处理策略引发的外推和转移不确定性，我们将离线RL建模为鲁棒策略优化，将转移核视为不确定性集内的决策变量，并针对最坏情况动态优化策略。我们提出鲁棒正则化策略迭代（RRPI），用可处理的KL正则化替代难解的最大-最小双层目标，并基于鲁棒正则化贝尔曼算子推导出高效的策略迭代过程。我们提供了理论保证，证明所提出的算子是$\gamma$-压缩算子，且迭代更新替代目标能单调改进原始鲁棒目标并收敛。在D4RL基准上的实验表明，RRPI实现了强大的平均性能，在大多数环境中优于包括基于百分位数方法在内的最新基线，并在其余环境中保持竞争力。此外，RRPI通过将较低的$Q$值与高认知不确定性对齐，展现出鲁棒性能，从而防止策略执行不可靠的分布外动作。

英文摘要

Offline reinforcement learning (RL) enables data-efficient and safe policy learning without online exploration, but its performance often degrades under distribution shift. The learned policy may visit out-of-distribution state-action pairs where value estimates and learned dynamics are unreliable. To address policy-induced extrapolation and transition uncertainty in a unified framework, we formulate offline RL as robust policy optimization, treating the transition kernel as a decision variable within an uncertainty set and optimizing the policy against the worst-case dynamics. We propose Robust Regularized Policy Iteration (RRPI), which replaces the intractable max-min bilevel objective with a tractable KL-regularized surrogate and derives an efficient policy iteration procedure based on a robust regularized Bellman operator. We provide theoretical guarantees by showing that the proposed operator is a $γ$-contraction and that iteratively updating the surrogate yields monotonic improvement of the original robust objective with convergence. Experiments on D4RL benchmarks demonstrate that RRPI achieves strong average performance, outperforming recent baselines including percentile-based methods on the majority of environments while remaining competitive on the rest. Moreover, RRPI exhibits robust performance by aligning lower $Q$-values with high epistemic uncertainty, which prevents the policy from executing unreliable out-of-distribution actions.

URL PDF HTML ☆

赞 0 踩 0

2606.11918 2026-06-18 cs.AI 版本更新

HeRo-Q: 通过Hessian条件化实现稳定低比特量化的通用框架

Jinhao Zhang, Yunquan Zhang, Zicheng yan, Boyang Zhang, Jun Sun, Daning Cheng

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所）； University of Science and Technology of China（中国科学技术大学）； Zhejiang Lab（浙江实验室）； Peng Cheng Laboratory（鹏城实验室）

AI总结针对后训练量化中“低误差、高损失”的矛盾，提出HeRo-Q算法，通过轻量可学习的旋转压缩矩阵重塑损失景观，降低最大Hessian特征值，增强对量化噪声的鲁棒性，在Llama和Qwen模型上优于现有方法。

详情

AI中文摘要

后训练量化（PTQ）是一种主流的模型压缩技术，但由于其仅专注于最小化量化误差，常常导致矛盾的“低误差、高损失”现象。根本原因在于LLM损失景观的Hessian矩阵：少数高曲率方向对扰动极其敏感。为了解决这个问题，我们提出了Hessian鲁棒量化（HeRo Q）算法，该算法在量化前对权重空间应用一个轻量级、可学习的旋转压缩矩阵。这个联合框架通过降低最大的Hessian特征值并减小其最大特征值来重塑损失景观，从而显著增强对量化噪声的鲁棒性。HeRo-Q不需要修改架构，计算开销可忽略不计，并且可以无缝集成到现有的PTQ流程中。在Llama和Qwen模型上的实验表明，HeRo Q在标准W4A8设置下不仅持续优于包括GPTQ、AWQ和SpinQuant在内的最先进方法，而且在极具挑战性的W3A16超低比特场景中表现出色，将Llama3 8B在GSM8K上的准确率提升至70.15%，并有效避免了激进量化中常见的逻辑崩溃。

英文摘要

Post Training Quantization (PTQ), a mainstream model compression technique, often leads to the paradoxical 'low error, high loss' phenomenon because it focuses solely on minimizing quantization error. The root cause lies in the Hessian matrix of the LLM loss landscape: a few high curvature directions are extremely sensitive to perturbations. To address this, we propose the Hessian Robust Quantization (HeRo Q) algorithm, which applies a lightweight, learnable rotation-compression matrix to the weight space prior to quantization. This joint framework reshapes the loss landscape by reducing the largest Hessian eigenvalue and reducing its max eigenvalue, thereby significantly enhancing robustness to quantization noise. HeRo-Q requires no architectural modifications, incurs negligible computational overhead, and integrates seamlessly into existing PTQ pipelines. Experiments on Llama and Qwen models show that HeRo Q consistently outperforms state of the art methods including GPTQ, AWQ, and SpinQuant not only achieving superior performance under standard W4A8 settings, but also excelling in the highly challenging W3A16 ultra low bit regime, where it boosts GSM8K accuracy on Llama3 8B to 70.15\% and effectively avoids the logical collapse commonly seen in aggressive quantization.

URL PDF HTML ☆

赞 0 踩 0

2602.00161 2026-06-18 cs.LG cs.AI cs.CL quant-ph 版本更新

LLM Compression by Block Removal with Constrained Binary Optimization

通过带约束二进制优化的块移除进行LLM压缩

David Jansen, Roman Rausch, Ali Hashemi, David Montero, Román Orús

发表机构 * Multiverse Computing（多维计算公司）； Donostia International Physics Center（多斯蒂亚国际物理中心）； Ikerbasque Foundation for Science（伊克尔巴斯克科学基金会）

AI总结提出将大语言模型块移除压缩问题建模为约束二进制优化，映射到Ising玻璃系统，实现高效排序和高质量非连续块移除，在50%压缩时MMLU提升近23个百分点，且计算高效、通用性强。

Comments 16 pages, 3 figures

详情

AI中文摘要

在本文中，我们将通过最优删除Transformer块（“块移除”）来压缩大语言模型（LLM）的问题，表述为一个约束二进制优化（CBO）问题，该问题可以映射到物理系统（Ising玻璃），其能量是下游模型性能的强代理。这种表述使得能够高效地对大量候选块移除配置进行排序，产生许多高质量、非平凡的解决方案，而不仅仅是移除连续区域。我们的方法在深度压缩场景中表现强劲，例如在Llama-3.3-70B-Instruct的50%压缩中，与其他最先进的块移除方法相比，我们在MMLU基准上取得了近23个百分点的提升。对于较轻的压缩，它在多个基准上与这些方法表现相当，适用于Llama-3.1-8B-Instruct、Qwen3-14B（重训练前后）以及Llama-3.3-70B-Instruct。该方法计算效率高，仅需在校准数据集上对少数活跃参数进行前向和反向传播。此外，我们证明，当无法精确求解CBO问题时，使用良好的启发式求解器可以在可忽略的运行时间内提供在下游任务上表现良好的解决方案。该方法可以轻松应用于任何架构。我们在最近的NVIDIA-Nemotron-3-Nano-30B-A3B-FP8模型上展示了这种通用性，该模型具有高度不均匀且具有挑战性的块结构，并且在移除2个注意力层或3个混合专家层时，我们在AIME25和GPQA上超越了最先进水平。

英文摘要

In this paper, we formulate the compression of large language models (LLMs) by optimally deleting transformer blocks (``block removal'') as a constrained binary optimization (CBO) problem that can be mapped to a physical system (Ising glass), whose energies are a strong proxy for downstream model performance. This formulation enables an efficient ranking of a large number of candidate block-removal configurations yielding many high-quality, non-trivial solutions beyond those only removing consecutive regions. Our method performs strongly in the deep compression regime, such as for 50% compression of Llama-3.3-70B-Instruct, where we achieve an almost 23 percentage point increase on the MMLU benchmark compared to other state-of-the-art (SOTA) block-removal methods. For lighter compression, it performs on par with those methods across several benchmarks for Llama-3.1-8B-Instruct, Qwen3-14B (both before and after retraining), as well as Llama-3.3-70B-Instruct. The approach is computationally efficient and requires only forward and backward passes on a calibration dataset for a few active parameters. Additionally, we demonstrate that using good heuristic solvers for the CBO problem provides solutions that perform well on downstream tasks in negligible runtime when it is unfeasible to solve the problem exactly. The method can be readily applied to any architecture. We illustrate this generality on the recent NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 model, which exhibits a highly inhomogeneous and challenging block structure, and where we outperform SOTA for AIME25 and GPQA when removing either 2 attention layers or 3 mixture-of-experts layers.

URL PDF HTML ☆

赞 0 踩 0

2602.00176 2026-06-18 cs.CV cs.AI 版本更新

Posterior Continuation with Noise-Conditioned Frequency Exposure for Diffusion Inverse Problems

基于噪声条件频率暴露的扩散逆问题后验延续

Feng Tian, Yixuan Li, Weili Zeng, Weitian Zhang, Yichao Yan, Xiaokang Yang

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结提出后验延续框架，根据扩散噪声水平逐步暴露测量频率，结合稳定采样器实现超分辨率、修复和去模糊的先进性能。

详情

超越相似性：时间序列分析中的时序操作注意力

Jevon Twitty, Vinh Pham, Nitiwith Rotchanarak, Viresh Pati, Yubin Kim, Shihao Yang, Jiecheng Lu

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结本文提出时序操作注意力（TOA），通过引入可学习的操作符增强注意力机制，以更有效地处理时间序列数据中的符号和振荡变换，提升时间序列预测、异常检测和分类任务的性能。

详情

AI中文摘要

时间序列预测中存在一个持久性悖论：结构简单的MLP和线性模型往往优于高容量的Transformer。我们指出，这种差距源于序列建模基本原理的不匹配：尽管许多时间序列动态由全局时间操作符（如滤波和谐波结构）主导，标准注意力将每个输出视为输入的凸组合。这限制了其表示带符号和振荡变换的能力，这些能力对于时间信号处理至关重要。我们正式将这一限制定义为softmax注意力中的简单约束混合瓶颈，这对由操作符驱动的时间序列任务尤其限制性。为了解决这一问题，我们提出时序操作注意力（TOA），一种通过显式、可学习的序列空间操作符增强注意力的框架，使时间内的符号混合成为可能，同时保持输入依赖的适应性。为了使密集的N×N操作符实用化，我们引入了随机操作符正则化，一种高方差的dropout机制，它稳定了训练并防止了记忆性学习。在预测、异常检测和分类基准上，TOA在集成到标准骨干如PatchTST和iTransformer时始终提高了性能，尤其是在重建密集任务中表现尤为突出。这些结果表明，显式操作符学习是有效时间序列建模的关键要素。

英文摘要

A persistent paradox in time-series forecasting is that structurally simple MLP and linear models often outperform high-capacity Transformers. We argue that this gap arises from a mismatch in the sequence-modeling primitive: while many time-series dynamics are governed by global temporal operators (e.g., filtering and harmonic structure), standard attention forms each output as a convex combination of inputs. This restricts its ability to represent signed and oscillatory transformations that are fundamental to temporal signal processing. We formalize this limitation as a simplex-constrained mixing bottleneck in softmax attention, which becomes especially restrictive for operator-driven time-series tasks. To address this, we propose $\textbf{Temporal Operator Attention (TOA)}$, a framework that augments attention with explicit, learnable sequence-space operators, enabling direct signed mixing across time while preserving input-dependent adaptivity. To make dense $N \times N$ operators practical, we introduce Stochastic Operator Regularization, a high-variance dropout mechanism that stabilizes training and prevents trivial memorization. Across forecasting, anomaly detection, and classification benchmarks, TOA consistently improves performance when integrated into standard backbones such as PatchTST and iTransformer, with particularly strong gains in reconstruction-heavy tasks. These results suggest that explicit operator learning is a key ingredient for effective time-series modeling.

URL PDF HTML ☆

赞 0 踩 0

2605.12713 2026-06-18 quant-ph cs.AI 版本更新

Controllable Quantum Memory Capacity in Quantum Reservoir Networks with Tunable partial-SWAPs

量子回路网络中可控的量子记忆容量：可调部分SWAPs

Erik L. Connerty, Ethan N. Evans

发表机构 * University of South Carolina - Columbia（南卡罗来纳大学哥伦比亚分校）； Qodex Quantum（Qodex量子）

AI总结本文提出一种可调部分SWAP机制，用于控制量子回路网络中记忆衰减速率，通过模拟和IBM QPU验证，提升了噪声中间尺度量子处理器的性能。

Comments 14 pages, 9 figures

详情

AI中文摘要

在量子回路计算领域，许多不同的计算模型和架构已被提出。从这些模型中，我们识别出基于反馈的模型和递归模型作为两种主要竞争架构。本文在递归架构基础上，提出了一种双寄存器方法，使量子回路计算具有衰减记忆。虽然这些方法已在硬件上验证并展示了在噪声中间尺度量子处理器上的优异性能，但记忆容量的确切机制尚不完全理解或完全可控。为此，我们扩展了递归方法，提出了一种硬件可实现的可调部分SWAP机制，允许从基于门的量子处理器上实现的量子回路网络直接控制记忆衰减速率。该机制的理论基于受控振幅阻尼通道，并通过随机短期记忆容量（STMC）回忆基准和NARMA-5数据集的验证实验进行验证，分别使用模拟和IBM QPU进行测试。

英文摘要

In the field of quantum reservoir computing (QRC), many different computational models and architectures have been proposed. From these models, we identify feedback-based models -- which use a feedback mechanism to re-embed classical measurements from the QRC -- and recurrent models -- which use a multi-register approach with memory and readout qubits -- as the two major competing architectures that have been discussed and validated on hardware. In this paper, we advance upon the recurrent architectures, which employ a two register approach to endow the QRC with a fading memory. While these approaches have been validated on hardware and have demonstrated great real-world performance on noisy-intermediate-scale-quantum (NISQ) quantum processing units (QPUs), the exact mechanism through which the memory capacity arises is not completely understood or fully controllable. With this, we augment the recurrent approaches and present a hardware-realizable mechanism, which we call a tunable partial-SWAP, that allows for the direct control of the rate of memory dissipation from a QRN implemented on a gate-based QPU. The theory behind this mechanism is discussed in terms of a controlled amplitude-damping channel and validation experiments using a randomized short-term memory capacity (STMC) recall benchmark and the NARMA-5 dataset are conducted using simulation and IBM QPUs, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.06564 2026-06-18 cs.LG cs.AI 版本更新

HAARES Half-Split Residual Basis Routing for Deep Transformers

WAV：面向深度仅解码器Transformer的多分辨率块残差路由

Kehan Wang

发表机构 * Chongqing University（重庆大学）

AI总结提出WAV v1方法，通过为每个块增加方向性细节基（相位基和分裂基）来增强残差路由，在深层Transformer中优于现有方法，48层时在TinyStories和Text8上取得更低验证损失。

Comments 6 pages, 4 figures, 3 tables

详情

AI中文摘要

残差连接对于训练深度Transformer至关重要，但标准的PreNorm残差流以固定的单位权重聚合子层更新。最近的注意力残差用内容相关的深度路由替代了这种固定累积，而块注意力残差通过对块级残差摘要进行路由使机制高效。然而，单个块摘要仅存储块内的低频总残差位移，丢弃了方向性结构，例如注意力与MLP的不平衡以及早期与晚期块的动态。我们提出WAV v1，一种用于仅解码器Transformer的轻量级多分辨率残差路由方法。WAV v1不是仅通过累积残差和来表示每个块，而是为每个块增加两个方向性细节基：一个对比注意力和MLP更新的相位基，以及一个对比早期和晚期子层更新的分裂基。这些基与标准块摘要一起通过相同的深度softmax混合器进行路由，而负细节源初始化和分离的RMS匹配稳定了训练。在字符级TinyStories和Text8语言建模中，WAV v1显示出明显的深度相关优势。尽管在12层时并非始终有益，但在24层时变得有竞争力，并在48层时优于所有基线。在48层时，WAV v1将TinyStories上的验证损失从0.4960降至0.4738，Text8上从0.9363降至0.9305，且额外参数可忽略。这些结果表明，方向性残差细节（而不仅仅是块级和）对于在更深Transformer中扩展残差路由很重要。

英文摘要

Block-level residual routing makes learned residual aggregation practical by routing over block summaries, but each summary compresses an ordered sequence of attention and MLP updates into one cumulative vector. We propose \method{}, a lightweight residual basis router that keeps the cumulative block source and adds one half-split detail basis, computed as the difference between first-half and second-half residual updates. The detail basis is RMS-matched and updated online, exposing coarse intra-block trajectory information without dense sublayer-level routing. Across OpenWebText, cross-domain character-level benchmarks, and BPE-tokenized OpenWebText, the empirical pattern is depth-dependent: gains are small or mixed at shallow depth and most reliable in 48-layer models. In the 201M 48-layer setting, \method{} improves over Block AttnRes across all three seeds, while a 453M two-seed probe shows the same direction. Ablations rule out source duplication, random signed details, fixed detail-source biases, or block-count changes alone. Cost analysis shows that the method is FLOP-light but not wall-clock-free: it adds memory and routing overhead, yet its relative arithmetic cost is amortized as width grows and earlier convergence can reduce time-to-target.

URL PDF HTML ☆

赞 0 踩 0

2606.10466 2026-06-18 cs.LG cs.AI 版本更新

UPLOTS: A Unified Pretrained Language Model for Constrained Time-series Generation

UPLOTS: 一种用于约束时间序列生成的统一预训练语言模型

Du Yin, Hao Xue, Jinliang Deng, Yang Yang, Shuang Ao, Arian Prabowo, Flora Salim

发表机构 * University of New South Wales（新南威尔士大学）； HKUST(GZ)（香港科技大学（广州））； BUAA（北京航空航天大学）

AI总结提出UPLOTS，一种基于统一预训练语言模型和提示引导的框架，通过动态多数据集损失重加权和提示到模式映射，实现跨领域约束时间序列生成，在四个基准上验证了其泛化性和数据增强效果。

详情

AI中文摘要

在时间序列生成中，现有方法通常为每个数据集手工设计或训练单独的模型，这阻碍了它们的可扩展性，并且未能利用跨领域的共享时间结构。为了解决这种碎片化问题，我们提出了UPLOTS，一种统一的、提示引导的语言模型框架，用于跨不同领域的约束时间序列生成。UPLOTS不是构建任务特定的模型，而是利用一个由学习到的约束提示引导的单一预训练transformer骨干网络，从而能够按需生成并精确控制模式。一个关键创新是我们的动态多数据集损失重加权和提示到模式映射，这使得UPLOTS能够在训练期间内化多样化的时间结构，并在推理时有条件地生成它们。我们在四个真实世界基准和多个约束设置（包括峰值周期、日历、负载水平和波动性模式）上评估了UPLOTS。额外的保留约束组合和下游预测实验进一步表明，UPLOTS能够泛化到原始峰值模式设置之外，并在真实数据稀缺的情况下改进数据增强。我们的代码和基线可在匿名GitHub仓库获取：this https URL。

英文摘要

In time-series generation, existing approaches typically handcraft ortrain a separate model for each dataset, which hinders their scalability and fails to leverage shared temporal structures across domains. To address this fragmentation, we propose UPLOTS, a Unified, Prompt-guided Language model framework fOr constrained Time-Series Generation across diverse domains. Instead of building task-specific models, UPLOTS leverages a single pre-trained transformer backbone guided by learned constraint prompts, enabling on-demand generation with precise pattern control. One key innovation is our dynamic multi-dataset loss re-weighting and prompt-to-pattern mapping, which allows UPLOTS to internalize diverse temporal structures during training and conditionally generate them at inference. We evaluate UPLOTS on four real-world benchmarks and multiple constraint settings, including peak-period, calendar, load-level, and volatility patterns. Additional held-out constraint-combination and downstream forecasting experiments further demonstrate that UPLOTS generalizes beyond the original peak-pattern setting and improves data augmentation under scarce real-data regimes. Our code and baselines are available at anonymous github repo: https://anonymous.4open.science/r/UPLOTS-6C36.

URL PDF HTML ☆

赞 0 踩 0

2606.12629 2026-06-18 cs.LG cs.AI 版本更新

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

Bag of Dims：通过维度级符号模式实现无需训练的机制可解释性

Varun Reddy Nalagatla

发表机构 * Amazon Web Services（亚马逊云服务）

AI总结本文提出Bag of Dims框架，证明Transformer隐藏状态的标准基即可作为无需训练的特征基，通过维度符号模式编码语义，并在三个模型上验证了其有效性。

Comments 22 pages, 5 figures, 27 tables

详情

AI中文摘要

我们表明，Transformer隐藏状态的标准基已经提供了一个无需训练、架构通用的特征基。单个维度通过其符号编码语义内容，通过其幅度编码置信度，充当独立的二进制寄存器。我们通过四个渐进实验在三个模型家族（Qwen 3.5-4B、Gemma 3-4B、Mistral 7B）上验证了这种Bag of Dims框架。仅符号模式就携带预测性内容：将所有幅度替换为1，通过LM头实现72-93%的top-5下一个token准确率，而无需任何解码器的纯汉明评分达到80-90%的top-4096准确率。这些符号模式组织成语义特征：使用单token类型缓存（每个词汇token一次前向传播，无上下文），我们通过每维度符号一致性（平均AUC 0.80）从50个锚点发现了175个类别，无需任何训练。一个训练过的探针仅增加+0.018 AUC并收敛到轴对齐的权重，证实了可忽略的跨维度结构。这种结构扩展到注意力：所有175个类别在K和V投影中仍然可发现。在写入端，静态FFN权重检查将20%的特征与单个写入神经元联系起来（一致性>0.70；随机对照：0%），通过多数投票，top-200神经元联盟在99.9%的原型上实现>0.70的一致性。完全无监督的发现（随机种子，无标签）在所有三个模型上扩展到1500个特征，产量100%，稀疏度99%，成对互信息为0.0014比特，证实了低维度间耦合。这些结果确立了标准基已经足以在整个Transformer计算路径中进行特征读取，无需训练、无需优化，且每个词汇token仅需一次前向传播，无需GPU天数。

英文摘要

We show the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs (+/-1) and confidence via their magnitudes, acting as independent binary registers; a feature is a subset of dimensions with a consistent sign pattern, read by counting sign agreements with no learned rotation. We validate this Bag of Dims framework across seven models spanning language (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B, Qwen3-32B), vision (DINOv2, ViT-Base), and audio (AST). Signs alone carry predictive content: unit-magnitude sign patterns preserve 60-93% top-5 next-token accuracy through the LM head, and decoder-free Hamming scoring reaches 80-90% top-4096. From a single-token cache (one forward pass per token, no context, no labels), we detect 175 categories at AUC 0.97-0.99 by sign agreement; a trained probe adds only +0.018 AUC and converges to axis-aligned weights. These features are causally operative: they survive the K/V attention projections, trace to the FFN neuron coalitions that write them (random-weight controls never reproduce this), and flipping a feature's signs during the live forward pass suppresses its concept across four language models, magnitude-matched and concept-specific. Dimensions stay independent throughout (pairwise mutual information below 0.006 bits). The structure is not specific to language: the same per-dimension signs appear in self-supervised vision (DINOv2, 9/12 ImageNet superclasses), supervised vision (ViT-Base, 11/12), and audio (AST, 50/50 ESC-50 categories), so it reflects transformer training in general, not the language-modeling objective. The standard basis already suffices for feature reading at one forward pass, no optimization, no GPU-days. The open problem shifts from finding the right rotation to cataloging what each dimension encodes.

URL PDF HTML ☆

赞 0 踩 0

2606.12808 2026-06-18 cs.LG cs.AI 版本更新

SymQNet: Amortized Acquisition for Low-Latency Adaptive Hamiltonian Learning

SymQNet: 低延迟自适应哈密顿量学习的摊销获取

Yash Vardhan Tomar, Dheeraj Peddireddy

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出SymQNet，一种摊销强化学习方法，通过离线学习后验条件获取策略，在线快速前向传播，显著降低自适应哈密顿量学习的获取延迟。

详情

AI中文摘要

自适应哈密顿量学习对于校准和表征量子设备至关重要。在自适应控制器中，选择下一个实验本身就是一个计算。贝叶斯设计规则在每次后验更新后重新计算，这一步可能需要几秒钟。在数百次试验中，这些秒数成为自适应性的显著墙钟成本。我们引入SymQNet，一种用于低延迟自适应哈密顿量学习的摊销强化学习方法。SymQNet离线学习后验条件获取策略，然后在线使用快速策略前向传播，同时保留贝叶斯后验反馈。在横向场伊辛基准测试中，相对于有界Fisher信息搜索和有界两步贝叶斯主动学习（BALD），SymQNet显著降低了获取延迟。在五量子比特时，相对于这些在线基线，它仅获取决策延迟降低了$47.1\ imes$和$72.6\ imes$；在十二量子比特时，SymQNet的完整模拟步骤需要$1.02$秒，而有界两步BALD需要$13.27$秒。总体而言，我们表明学习获取可以使自适应哈密顿量学习对于重复的低延迟工作负载变得实用。

英文摘要

Adaptive Hamiltonian learning is central to calibrating and characterizing quantum devices. In an adaptive controller, choosing the next experiment is itself a computation. Bayesian design rules are recomputed after every posterior update, and that step can take seconds. Across hundreds of shots, those seconds become a significant wall-clock cost for adaptivity. We introduce SymQNet, an amortized reinforcement-learning approach for low-latency adaptive Hamiltonian learning. SymQNet learns a posterior-conditioned acquisition policy offline, then uses a fast policy forward pass online while retaining Bayesian posterior feedback. On transverse-field Ising benchmarks, SymQNet substantially reduces acquisition latency relative to bounded Fisher-information search and bounded two-step Bayesian active learning by disagreement (BALD). At five qubits, it reduces acquisition-only decision latency by $47.1\times$ and $72.6\times$ relative to these online baselines; at twelve qubits, full simulated steps take $1.02$ s for SymQNet versus $13.27$ s for bounded two-step BALD. Overall, we show that learned acquisition can make adaptive Hamiltonian learning practical for repeated low-latency workloads.

URL PDF HTML ☆

赞 0 踩 0

2606.16214 2026-06-18 cs.LG cs.AI 版本更新

从数值到标记：一种基于符号离散化的LLM驱动上下文感知时间序列预测框架

Xiaoyu Tao, Shilong Zhang, Mingyue Cheng, Daoyu Wang, Tingyue Pan, Bokai Pan, Changqing Zhang, Shijin Wang

发表机构 * State Key Laboratory of Cognitive Intelligence（认知智能国家重点实验室）； University of Science and Technology of China（中国科学技术大学）； College of Intelligence and Computing（智能科学与计算学院）； iFLYTEK Research（iFLYTEK研究院）

AI总结提出TokenCast框架，利用大语言模型通过符号离散化将连续时间序列转化为标记，与上下文文本对齐，实现上下文感知的预测，实验证明有效。

详情

AI中文摘要

时间序列预测在能源、医疗和金融等关键应用领域支持决策中起着重要作用。尽管近期取得了进展，但由于将历史数值序列与通常包含非结构化文本数据的上下文特征整合的挑战，预测精度仍然有限。为了解决这一挑战，我们提出了TokenCast，一个由大语言模型（LLM）驱动的框架，利用基于语言的符号表示作为上下文感知时间序列预测的统一中介。具体来说，TokenCast采用离散分词器将连续数值序列转化为时间标记，实现与基于语言输入的结构对齐。为了有效弥合模态之间的语义差距，时间和上下文标记通过预训练的LLM嵌入到共享表示空间中，并通过生成目标进一步优化。基于这一统一语义空间，对齐的LLM随后以监督方式进行微调，以预测未来的时间标记，然后解码回原始数值空间。在真实世界数据集上的大量实验证明了我们框架的有效性，并突显了其作为上下文感知时间序列预测生成框架的潜力。代码可从此https URL获取。

英文摘要

Time series forecasting plays a vital role in supporting decision-making across a wide range of critical applications, including energy, healthcare, and finance. Despite recent advances, forecasting accuracy remains limited due to the challenge of integrating historical numerical sequences with contextual features, which often comprise unstructured textual data. To address this challenge, we propose TokenCast, a large language model (LLM) driven framework that leverages language-based symbolic representations as a unified intermediary for context-aware time series forecasting. Specifically, TokenCast employs a discrete tokenizer to transform continuous numerical sequences into temporal tokens, enabling structural alignment with language-based inputs. To effectively bridge the semantic gap between modalities, both temporal and contextual tokens are embedded into a shared representation space via a pre-trained LLM, further optimized with generative objectives. Building upon this unified semantic space, the aligned LLM is subsequently fine-tuned in a supervised manner to predict future temporal tokens, which are then decoded back into the original numerical space. Extensive experiments on real-world datasets demonstrate the effectiveness of our framework and highlight its potential as a generative framework for context-aware time series forecasting. The code is available at https://github.com/Xiaoyu-Tao/TokenCast.

URL PDF HTML ☆

赞 0 踩 0

2510.04120 2026-06-18 cs.CL cs.AI 版本更新

Probing Semantic Alignment, Lexical Invariance, and Syntactic Influence in LLM Metaphor Processing

探究大语言模型隐喻处理中的语义对齐、词汇不变性和句法影响

Fengying Ye, Shanshan Wang, Lidia S. Chao, Derek F. Wong

发表机构 * NLP 2 CT Lab, Department of Computer and Information Science, University of Macau（自然语言处理2CT实验室，计算机与信息科学系，澳门大学）

AI总结通过几何探测、上下文替换和句法扰动三种方法，分析LLM在隐喻处理中的语义漂移、词汇稳定性及句法敏感性，揭示强行为表现可能源于异质信号。

Comments Accepted to ACL 2026

详情

AI中文摘要

大语言模型（LLM）在隐喻检测和解释任务上表现出色，但尚不清楚这种行为成功揭示了隐喻处理的哪些方面。我们通过探测三个互补维度：语义属性对齐、词汇不变性和句法敏感性，对行为证据的局限性进行诊断分析。使用几何探测，我们评估模型生成的解释是否与参考语义属性对齐；通过上下文变化替换，分析隐喻和字面表达之间词汇关联的稳定性；通过受控句法扰动，检查隐喻检测的敏感性。我们的分析表明，LLM生成的解释可能相对于参考属性出现语义漂移；稳定的词汇锚点在不同上下文条件下持续存在，可能支持常规隐喻，同时使需要上下文整合的新奇隐喻产生偏差；检测性能对句法不规则性敏感。这些发现表明，强行为表现可能反映了异质的潜在信号，强调在将隐喻基准解释为稳健、集成语义理解的证据时需要谨慎。

英文摘要

Large language models (LLMs) achieve strong performance on metaphor detection and interpretation tasks, yet it remains unclear what such behavioral success reveals about metaphor processing. We present a diagnostic analysis that examines the limits of behavioral evidence by probing three complementary dimensions: semantic attribute alignment, lexical invariance, and syntactic sensitivity. Using geometric probing, we assess whether model-generated interpretations align with reference semantic attributes; through context-varying substitution, we analyze the stability of lexical associations between metaphorical and literal expressions; and via controlled syntactic perturbations, we examine sensitivity in metaphor detection. Our analysis reveals that LLM-generated interpretations can exhibit semantic drift relative to reference attributes; stable lexical anchors persist across contextual conditions, potentially supporting conventional metaphors while biasing novel metaphors requiring contextual integration; and detection performance is sensitive to syntactic irregularities. These findings suggest that strong behavioral performance may reflect heterogeneous underlying signals, highlighting the need for caution when interpreting metaphor benchmarks as evidence of robust, integrated semantic understanding.

URL PDF HTML ☆

赞 0 踩 0

2510.15551 2026-06-18 cs.CL cs.AI cs.LG 版本更新

Rethinking Cross-lingual Gaps from a Statistical Viewpoint

从统计视角重新思考跨语言差距

Vihari Piratla, Purvam Jain, Darshan Singh, Trevor Cohn, Preethi Jyothi, Partha Talukdar

发表机构 * Google DeepMind（谷歌深Mind）

AI总结提出跨语言差距源于目标语言响应方差，通过形式化偏差和无偏误差，并采用推理时集成方法降低方差，使跨语言迁移得分提升8%-50%以上。

Comments 30 pages

详情

AI中文摘要

任何知识片段通常以一种或少数几种自然语言表达在网页或大型语料库中。大型语言模型（LLMs）通过从源语言获取知识，并在使用目标语言查询时使其可访问，从而充当桥梁。跨语言差距是指使用目标语言而非源语言查询知识时准确率的下降。现有研究侧重于导致跨语言差距的建模或训练失败。在这项工作中，我们采取另一种视角来表征跨语言错误的性质，并假设目标语言中响应的方差是造成这一差距的关键原因。我们首次将跨语言差距形式化为有偏误差和无偏误差。通过多种控制方差并减少跨语言差距的推理时干预，我们实证验证了我们的假设。我们展示了几种测试时集成方法，这些方法降低了响应方差，从而将源-目标迁移得分提高了多达12个绝对百分点，在各种LLMs上实现了8%到超过50%的相对提升。

英文摘要

Any piece of knowledge is usually expressed in one or a handful of natural languages on the web or in any large corpus. Large Language Models (LLMs) act as a bridge by acquiring knowledge from a source language and making it accessible when queried using target languages. A cross-lingual gap is a drop in accuracy incurred when querying knowledge in a target language rather than the source language. Existing research focused on modeling or training failures leading to cross-lingual gaps. In this work, we take an alternative view to characterize the nature of cross-lingual error, and hypothesize that the variance of responses in the target language is a key cause of this gap. For the first time, we formalize the cross-lingual gap in terms of biased and unbiased errors. We empirically validate our hypothesis through multiple inference-time interventions that control variance and reduce the cross-lingual gap. We demonstrate a few test-time ensemble methods that reduce response variance, and thereby improve source-target transfer scores by up to 12 absolute points yielding relative gains of 8% to over 50% across various LLMs.

URL PDF HTML ☆

赞 0 踩 0

2601.14968 2026-06-18 cs.LG cs.AI 版本更新

InstructTime++: Time Series Classification with Multimodal Language Modeling via Implicit Feature Enhancement

InstructTime++: 通过隐式特征增强的多模态语言建模进行时间序列分类

Mingyue Cheng, Xiaoyu Tao, Huajian Zhang, Qi Liu, Zhiding Liu, Yucong Luo, Yiheng Chen, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China（中国科学技术大学认知智能国家重点实验室）

AI总结提出将时间序列分类转化为多模态生成任务，通过离散化模块和对齐投影层弥合模态差距，并利用隐式特征建模提升语言模型性能。

详情

AI中文摘要

大多数现有的时间序列分类方法采用判别范式，将输入序列直接映射到独热编码的类别标签。虽然有效，但这种范式难以融入上下文特征，也无法捕捉类别间的语义关系。为了解决这些局限性，我们提出了InstructTime，一种将时间序列分类重新定义为多模态生成任务的新框架。具体来说，连续的数值序列、上下文文本特征和任务指令被视为多模态输入，而类别标签则通过调优的语言模型作为文本输出生成。为了弥合模态差距，InstructTime引入了一个时间序列离散化模块，将连续序列转换为离散的时间标记，同时结合对齐投影层和生成式自监督预训练策略，以增强跨模态表示对齐。在此框架基础上，我们进一步提出了InstructTime++，通过引入隐式特征建模来扩展InstructTime，以补偿语言模型有限的归纳偏差。InstructTime++利用专门的工具包从原始时间序列和上下文输入中挖掘信息丰富的隐式模式，包括统计特征提取和基于视觉-语言模型的图像描述，并将其转化为文本描述以实现无缝集成。在多个基准数据集上的大量实验证明了InstructTime++的优越性能。

英文摘要

Most existing time series classification methods adopt a discriminative paradigm that maps input sequences directly to one-hot encoded class labels. While effective, this paradigm struggles to incorporate contextual features and fails to capture semantic relationships among classes. To address these limitations, we propose InstructTime, a novel framework that reformulates time series classification as a multimodal generative task. Specifically, continuous numerical sequences, contextual textual features, and task instructions are treated as multimodal inputs, while class labels are generated as textual outputs by tuned language models. To bridge the modality gap, InstructTime introduces a time series discretization module that converts continuous sequences into discrete temporal tokens, together with an alignment projection layer and a generative self-supervised pre-training strategy to enhance cross-modal representation alignment. Building upon this framework, we further propose InstructTime++, which extends InstructTime by incorporating implicit feature modeling to compensate for the limited inductive bias of language models. InstructTime++ leverages specialized toolkits to mine informative implicit patterns from raw time series and contextual inputs, including statistical feature extraction and vision-language-based image captioning, and translates them into textual descriptions for seamless integration. Extensive experiments on multiple benchmark datasets demonstrate the superior performance of InstructTime++.

URL PDF HTML ☆

赞 0 踩 0

2601.17226 2026-06-18 cs.CL cs.AI 版本更新

Retell, Reward, Repeat: Reinforcement Learning for Narrative Theory-Informed Story Retelling

复述、奖励、重复：面向叙事理论启发的故事复述的强化学习

David Y. Liu, Xanthe Muston, Dipankar Srirag, Aditya Joshi, Sebastian Sequoiah-Grayson

发表机构 * University of New South Wales（新南威尔士大学）

AI总结提出RRR强化学习框架，结合结构主义叙事学与标量叙事性，通过d-RLAIF从文本特征中获取训练信号，无需参考输出，提升LLM故事复述的逻辑性、合理性和完整性。

Comments 8 Pages, 7 figures

详情

AI中文摘要

反事实故事复述暴露了LLM在受限叙事解空间中的缺陷，此时它们无法依赖回忆记忆的训练数据。基于真实值的后训练（如SFT）无法教会LLM生成逻辑合理的叙事事件。本文提出Retell, Reward, Repeat (RRR)，一个基于强化学习的流水线，将结构主义叙事学与标量叙事性相结合，以教授故事结构。我们扩展了TimeTravel数据集，加入人工标注的叙事平衡阶段，以评估奖励模型。通过d-RLAIF，RRR从文本特征的叙事性中推导训练信号，无需参考输出。评估表明，RRR训练的LLM在逻辑性、合理性和完整性上优于少样本和SFT基线，输出质量通过盲人偏好验证。RRR仅依赖小型查询数据集，为故事讲述——一个目前缺乏有效后训练方法的领域——提供了一种基于语言学、成本效益高的后训练机制。RRR强调了将既定语言学理论整合到当代NLP中的持续相关性。

英文摘要

Counterfactual story retelling exposes LLM shortcomings in constrained narrative solution spaces where they can no longer rely on recalling memorised training data. Ground-truth-based post-training, such as SFT, fails to teach LLMs how to generate logical and rational narrative events. In this paper, we introduce Retell, Reward, Repeat (RRR), an RL-based pipeline synthesising Structuralist Narratology with scalar narrativity to teach storytelling structure. We extend the TimeTravel dataset with human-annotated stages of narrative equilibrium to evaluate reward models. By using d-RLAIF, RRR derives training signals from the narrativity of textual features without the need for reference outputs. Evaluations demonstrate that RRR-trained LLMs outperform few-shot and SFT baselines in logic, rationality, and completeness, with output quality additionally validated by blind human preference. Relying on a small, query-only dataset, RRR provides a linguistically grounded, cost-effective post-training mechanism for storytelling--a domain currently lacking effective post-training methods. RRR highlights the continued relevance of integrating established linguistic theories into contemporary NLP.

URL PDF HTML ☆

赞 0 踩 0

2601.19792 2026-06-18 cs.CL cs.AI cs.HC 版本更新

LVLMs and Humans Ground Differently in Referential Communication

LVLMs与人类在指称交流中的基础不同

Peter Zeng, Weiling Li, Amie J. Paige, Zhengxiang Wang, Panagiotis Kaliosis, Dimitris Samaras, Gregory Zelinsky, Susan E. Brennan, Owen Rambow

AI总结通过人类与AI配对的多轮指称交流实验，发现LVLMs无法像人类一样利用共同基础生成和解析指称表达，导致交流不畅。

Comments 27 pages, 16 figures

2602.06470 2026-06-18 cs.CL cs.AI 版本更新

Improve Large Language Model Systems with User Logs

通过用户日志改进大型语言模型系统

Changyue Wang, Weihang Su, Qingyao Ai, Xingzhao Yue, Rui Zhang, Xiaojia Chang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）

AI总结本文提出UNO框架，通过用户日志提炼规则和偏好对，利用查询反馈驱动聚类处理数据异质性，量化模型知识与日志数据间的认知差距，提升LLM系统性能。

详情

AI中文摘要

扩大训练数据和模型参数规模长期以来推动了大型语言模型（LLMs）的发展，但这一范式日益受到高质量数据稀缺和计算成本上升导致的边际效益递减的限制。因此，近期研究更加关注从真实世界部署中持续学习，其中用户交互日志提供了丰富的真人类反馈和过程知识。然而，从用户日志学习具有挑战性，因为它们是无结构和嘈杂的。传统的LLM系统往往难以区分有用的反馈信号与嘈杂的用户行为，且用户日志收集与模型优化之间的差异（例如，非策略优化问题）进一步加剧了这一问题。为此，我们提出UNO（用户日志驱动的优化），一个统一的框架，用于通过用户日志改进LLM系统（LLMsys）。UNO首先将日志提炼为半结构化的规则和偏好对，然后利用查询和反馈驱动的聚类来管理数据异质性，最后量化模型先验知识与日志数据之间的认知差距。这一评估指导LLMsys自适应地过滤掉嘈杂的反馈并构建不同模块，以处理从用户日志中提取的初级和反思性经验，从而提升未来的响应。广泛的实验表明，UNO在效果和效率上均达到最先进的水平，显著优于检索增强生成（RAG）和基于记忆的基线方法。我们已开源代码至https://github.com/bebr2/UNO。

英文摘要

Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model's prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at https://github.com/bebr2/UNO .

URL PDF HTML ☆

赞 0 踩 0

2602.15851 2026-06-18 cs.CL cs.AI 版本更新

Narrative Theory-Driven LLM Methods for Automatic Story Generation and Understanding: A Survey

叙事理论驱动的LLM方法在自动故事生成与理解中的应用：综述

David Y. Liu, Aditya Joshi, Paul Dawson

发表机构 * School of Computer Science and Engineering（计算机科学与工程学院）； School of Arts and Media（艺术与媒体学院）； University of New South Wales (UNSW)（新南威尔士大学）

AI总结综述叙事理论驱动的大语言模型方法在自动故事生成与理解中的应用，分析现状并指出生成任务在理论应用、后训练方法、非虚构叙事及叙事层次等方面落后于理解任务，提出未来方向。

Comments 31 pages

详情

AI中文摘要

使用大语言模型（LLM）的叙事理论应用在自动故事生成和理解任务中提供了有前景的方法。本综述考察了自然语言处理（NLP）研究如何利用LLM方法处理叙事研究中的不同概念。我们使用叙事学中的既定区分来分类当前工作，并发现以下内容：(a) 叙事文本来源多样，不仅限于文学；(b) 理论综合与验证是潜在成果；(c) 生成任务在多个方面落后于理解任务：理论应用、后训练方法、探索非虚构叙事以及处理超出故事与话语层面的叙事层次。对于未来方向，我们相信，与其追求单一的、通用的“叙事质量”基准，进步可以受益于以下方面的努力：定义和改进针对单个叙事属性的基于理论的度量；继续开展大规模、理论驱动的文学/社会/文化分析；在情境化上下文中生成叙事；以及继续进行实验，其输出可用于验证或完善叙事理论。本文通过概述当前研究工作和更广泛的叙事研究领域，为NLP中更系统、更具理论依据的叙事研究提供了背景基础。

英文摘要

Applications of narrative theories using large language models (LLMs) deliver promising methods in automatic story generation and understanding tasks. Our survey examines how natural language processing (NLP) research uses LLM methods to engage with diverse concepts from narrative studies. We use established distinctions from narratology to categorise ongoing efforts and discover the following: \redtext{(a) narrative texts come from diverse sources beyond just literature, (b) theoretical synthesis and validation are potential outcomes, (c) generation tasks lag behind understanding in several ways: theoretical application, post-training methods, exploring non-fiction narratives and addressing narrative levels beyond fabula and discourse.} For future directions, instead of the pursuit of a single, generalised benchmark for `narrative quality', we believe that progress can benefit from efforts that focus on the following: defining and improving theory-based metrics for individual narrative attributes; continue conducting large-scale, theory-driven literary/social/cultural analysis; generating narratives in situated contexts; and continuing experiments where outputs can be used to validate or refine narrative theories. This work provides a contextual foundation for more systematic and theoretically informed narrative research in NLP by providing an overview to ongoing research efforts and the broader narrative studies landscape.

URL PDF HTML ☆

赞 0 踩 0

2605.21028 2026-06-18 cs.CV cs.AI 版本更新

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

DySink：动态帧 sinks 用于自回归长视频生成

Bo Ye, Xinyu Cui, Jian Zhao, Tong Wei, Min-Ling Zhang

发表机构 * School of Computer Science and Engineering, Southeast University（东南大学计算机科学与工程学院）； Key Lab. of Computer Network and Information Integration, Southeast University（东南大学计算机网络与信息集成重点实验室）； Zhongguancun Academy（中关村学院）； Zhongguancun Institute of Artificial Intelligence（中关村人工智能研究院）； Institute of Automation, CAS（中国科学院自动化研究所）

AI总结本文提出 DySink，一种基于检索的框架，通过维护紧凑的记忆银行并选择视觉相关的历史帧作为动态帧 sinks，以提高自回归长视频生成的动态性和时间质量。

详情

AI中文摘要

自回归长视频生成通常采用有界内存流以提高效率，通常结合局部窗口实现短期连续性与静态早期帧 sinks 作为长程锚点。然而，这种固定分配在当前视觉状态与早期帧大幅偏离时仍会缓存早期帧，而丢弃可能更相关的中间历史。结果，保留的长程上下文可能变得不适应，并偏向过时的线索；在严重情况下，RoPE 引起的相位再对齐会homogenize 头间注意力并导致 sink 崩溃，其中内容会回归到 sink 帧。我们提出 DySink，一种基于检索的框架，维护紧凑的记忆银行并选择视觉相关的历史帧作为动态帧 sinks。DySink 将自适应检索与 sink 异常门相结合，后者检测检索上下文中的过度头间共识并抑制易崩溃的上下文。在分钟级视频上的实验表明，DySink 在动态度方面一致优于强基线，同时也实现了更高的时间质量。代码和模型权重将在 https://github.com/yebo0216best/DySink 上发布。

英文摘要

Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE-induced phase re-alignment can homogenize inter-head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval-based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter-head consensus over retrieved context and suppresses collapse-prone context. Experiments on minute-long videos show that DySink consistently improves temporal quality over strong baselines while also achieving higher dynamic degree, enabling coherent and more natural long-horizon visual evolution. The code and model weights are released at https://github.com/yebo0216best/DySink.

URL PDF HTML ☆

赞 0 踩 0

2606.13768 2026-06-18 cs.CV cs.AI 版本更新

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

CineOrchestra：面向电影视频生成的统一实体中心条件控制

Sharath Girish, Tsai-Shien Chen, Zhikang Dong, Mukesh Singhal, Hao Chen, Sergey Tulyakov, Aliaksandr Siarohin

发表机构 * Snap Inc.（Snap公司）； UC Merced（加州大学默塞德分校）

AI总结提出CineOrchestra，一种统一控制主体、事件、相机和镜头切换的视频扩散模型，通过实体中心条件原语和参数无关的旋转位置编码实现多轴联合控制，在密集描述跟随和镜头切换时序上超越六种专用方法。

Comments Project page: https://snap-research.github.io/CineOrchestra

详情

AI中文摘要

个性化陷阱：用户记忆如何改变大语言模型的情感推理

Xi Fang, Weijie Xu, Yuchong Zhang, Stephanie Eckman, Scott Nickleach, Chandan K. Reddy

发表机构 * Amazon（亚马逊）

AI总结研究用户记忆如何导致大语言模型在情感推理中产生系统性偏差，发现高绩效模型对优势背景用户的情感解读更准确，个性化机制可能嵌入社会等级。

Comments 19 pages 5 figures

详情

AI中文摘要

当AI助手记住Sarah是一位打两份工的单亲母亲时，它对她压力的解读是否与她是富有的高管时不同？随着个性化AI系统越来越多地融入长期用户记忆，理解这种记忆如何塑造情感推理至关重要。我们通过在人验证的情感智能测试上评估15个模型，研究用户记忆如何影响大语言模型（LLMs）的情感智能。我们发现，相同的场景搭配不同的用户画像会产生系统性不同的情感解读。在经验证的独立于用户的情感场景和多样化的用户画像中，几个高性能LLM出现了系统性偏差，其中优势背景的用户画像获得了更准确的情感解读。此外，LLM在情感推理和支持性推荐任务中表现出跨人口统计因素的显著差异，表明个性化机制可以将社会等级嵌入模型的情感推理中。这些结果凸显了记忆增强AI的一个关键挑战：为个性化设计的系统可能会强化社会不平等。为缓解这些差异，我们整理了一个通用偏好数据集，旨在减少人口统计画像对情感理解的影响。

英文摘要

When an AI assistant remembers that Sarah is a single mother working two jobs, does it interpret her stress differently than if she were a wealthy executive? As personalized AI systems increasingly incorporate long-term user memory, understanding how this memory shapes emotional reasoning is critical. We investigate how user memory affects emotional intelligence in large language models (LLMs) by evaluating 15 models on human-validated emotional intelligence tests. We find that identical scenarios paired with different user profiles produce systematically divergent emotional interpretations. Across validated user-independent emotional scenarios and diverse user profiles, systematic biases emerged in several high-performing LLMs where advantaged profiles received more accurate emotional interpretations. Moreover, LLMs demonstrate significant disparities across demographic factors in emotion reasoning and supportive recommendations tasks, indicating that personalization mechanisms can embed social hierarchies into models' emotional reasoning. These results highlight a key challenge for memory-enhanced AI: systems designed for personalization may reinforce social inequalities. To mitigate these disparities, we curate a general-purpose preference dataset designed to reduce demographic profiles' influence on emotional understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.12618 2026-06-18 cs.AI 版本更新

"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

“你撒谎了吗？”评估不同规模模型和信念验证模型生物体的谎言检测器

Alan Cooney, David Africa, Geoffrey Irving

发表机构 * AI Security Institute（AI安全研究所）

AI总结本研究通过构建13个信念可验证的推理模型生物体和多样化提示撒谎测试集，评估了四种谎言检测器在不同规模模型上的表现，发现基于激活和概率的检测器在训练模型生物体上性能显著下降，而思维链法官保持较强性能，但存在伪影。

Comments 12 pages, 6 figures

详情

AI中文摘要

语言模型的鲁棒谎言检测器可以实现审计、监控和事后调查模型行为的强大技术，但评估它们需要模型可验证地相信与其所说相反的测试平台。我们表明，现有的训练模型生物体通常无法满足这一要求，使得先前的正面和负面检测结果难以解释。我们通过13个推理模型生物体来解决这个问题，这些生物体的隐藏信念在思维链中得到验证，并显示泛化到保留任务，同时结合了多样化欺骗（Varied Deception），一个涵盖广泛谎言诱导动机的提示撒谎测试集。在这些测试平台上，我们评估了四个检测器：一个思维链法官、一个对数概率分类器和两个激活探针，包括Did-You-Lie（DYL），一种训练后续探针的新方法。在提示撒谎任务上，跨越31个开放权重模型（参数从2B到1T），所有四个检测器都显示出与模型能力正相关的缩放。然而，每个基于激活和对数概率的检测器在我们训练的生物体上性能急剧下降，其中DYL保留了最多的信号；只有思维链法官保持强劲，平衡准确率为0.82，部分原因是我们的验证过程偏向于CoT可读的信念。因此，当前的谎言检测器无法支持关于模型信念的高置信度声明，我们提出了可能解决当前一些局限性的研究方向。我们发布了我们的数据集、模型生物体和训练好的检测器。

英文摘要

Robust lie detectors for language models could enable powerful techniques for auditing, monitoring, and post-hoc investigation of model behaviour, but evaluating them requires testbeds where models verifiably believe the opposite of what they say. We show that existing trained model organisms often fail this requirement, leaving prior positive and negative detection results difficult to interpret. We address this with 13 reasoning model organisms whose hidden beliefs are verified in chain-of-thought and shown to generalise to held-out tasks, alongside Varied Deception, a prompted-lying testbed covering a broad range of lie-inducing motivations. On these testbeds we evaluate four detectors: a chain-of-thought judge, a logprob classifier, and two activation probes, including Did-You-Lie (DYL), a new method for training follow-up probes. On prompted lying, across 31 open-weight models spanning 2B to 1T parameters, all four detectors show positive scaling with model capability. However, every activation- and logprob-based detector drops sharply on our trained model organisms, with DYL retaining the most signal; only the chain-of-thought judge remains strong, with 0.82 balanced accuracy, partly as an artefact of our verification process favouring CoT-readable beliefs. Current lie detectors therefore cannot support high-confidence claims about model beliefs, and we suggest research directions that may address some of their current limitations. We release our datasets, model organisms, and trained detectors.

URL PDF HTML ☆

赞 0 踩 0

2409.03500 2026-06-18 cs.CY cs.AI 版本更新

Quality Perceptions and Intended Engagement in Response to AI-Generated and AI-Assisted News

对AI生成和AI辅助新闻的质量感知与预期参与

Fabrizio Gilardi, Sabrina Di Lorenzo, Juri Ezzaini, Beryl Santa, Benjamin Streiff, Eric Zurfluh, Emma Hoes

发表机构 * University of Zurich（苏黎世大学）

AI总结通过预注册调查实验（N=599），研究读者对人类撰写、AI辅助和AI完全生成新闻的质量感知及披露AI参与后的参与意愿，发现质量评价相似，但披露后AI组短期阅读意愿更高。

Comments Forthcoming, Scientific Reports

详情

AI中文摘要

人工智能在新闻生产中的日益普及引发了关于受众如何看待和回应AI生成新闻的重要问题。这项预注册调查实验（N=599，瑞士德语区）考察了（i）对人类撰写、AI辅助或完全AI生成的新闻摘录的文章质量感知（以可信度、可读性和专业知识衡量），以及（ii）在披露AI参与后自我报告的参与意愿。参与者在了解文章制作方式之前先阅读两篇短新闻摘录。所有条件下的文章在感知质量上评价相似。披露后，与对照组相比，AI辅助和AI生成条件下的参与者报告了更高的继续阅读指定文章的意愿，但未来阅读AI生成新闻的意愿在各条件下无差异。总体而言，研究结果表明，读者对AI生成和人类撰写的新闻质量评价相当，而披露AI使用可能暂时增加好奇心或兴趣，但尚未改变长期阅读意愿。

英文摘要

The increasing use of artificial intelligence (AI) in news production raises important questions about how audiences perceive and respond to AI-generated journalism. This preregistered survey experiment (N = 599, German-speaking Switzerland) examines (i) perceptions of article quality (measured as credibility, readability, and expertise) across news excerpts that were human-written, AI-assisted, or fully AI-generated, and (ii) self-reported intentions to engage following disclosure of AI involvement. Participants rated two short news excerpts before learning how they had been produced. Articles across all conditions were evaluated similarly in perceived quality. After disclosure, participants in the AI-assisted and AI-generated conditions reported a higher willingness to continue reading their assigned articles compared to the control group, but future willingness to read AI-generated news did not differ across conditions. Overall, the findings suggest that readers assess AI-generated and human-written news comparably in quality, while disclosure of AI use can momentarily increase curiosity or interest without yet changing longer-term reading intentions.

URL PDF HTML ☆

赞 0 踩 0

2505.03646 2026-06-18 cs.LG cs.AI cs.CV 版本更新

当汽车有刻板印象：审计文本到图像模型中对象的群体偏见

Dasol Choi, Jihwan Lee, Minjae Lee, Minsuk Kahng

发表机构 * AIM Intelligence（AIM智能研究院）； Yonsei University（延世大学）

AI总结提出SODA框架，通过三个指标系统测量文本到图像模型在生成对象中的群体偏见，发现中性提示隐含偏向中年和白人，且人口统计线索导致高度偏斜的刻板输出。

详情

AI中文摘要

虽然先前关于文本到图像生成的研究主要集中在人类描绘中的偏见，但生成对象中的群体偏见仍然相对未被充分探索。我们引入了SODA（刻板对象诊断审计），这是一个新颖的框架，通过自动属性发现和三个标准化指标系统地测量这些偏见：基础与群体差异（BDS）、跨群体差异（CDS）和视觉属性集中度（VAC）。将SODA应用于五个最先进模型和八个对象类别（例如汽车）的8000张图像，我们发现“中性”提示产生的输出在视觉上最接近中年和白人，表明这些群体在模型默认设置中被隐含地过度代表。此外，人口统计线索触发了高度偏斜的刻板输出：26.6%的对象-模型-群体组合产生的结果中，所有20张生成图像共享完全相同的属性值（例如，为女性生成玫瑰金笔记本电脑）。最后，提示级别的去偏减少了群体间差异，但矛盾地压缩了群体内多样性，用一种刻板印象取代了另一种。SODA提供了一个实用的流程，使这些隐含关联变得可测量，作为迈向更负责任的人工智能发展的一步。

英文摘要

While prior research on text-to-image generation has predominantly focused on biases in human depictions, demographic bias in generated objects remains relatively underexplored. We introduce SODA (Stereotyped Object Diagnostic Audit), a novel framework for systematically measuring these biases through automated attribute discovery and three standardized metrics: Base vs. Demographic Divergence (BDS), Cross-Demographic Disparity (CDS), and Visual Attribute Concentration (VAC). Applying SODA to 8,000 images across five state-of-the-art models and eight object categories (e.g., cars), we find that "neutral" prompts produce outputs most visually similar to middle-aged and White people, suggesting these groups are implicitly over-represented in model defaults. Furthermore, demographic cues trigger highly skewed stereotypical outputs: 26.6% of object-model-demographic combinations produce results where all 20 generated images share the exact same attribute value (e.g., rose gold laptops for women). Finally, prompt-level debiasing reduces inter-group disparity but paradoxically collapses within-group diversity, replacing one stereotype with another. SODA offers a practical pipeline for making these implicit associations measurable, serving as a step toward more responsible AI development.

URL PDF HTML ☆

赞 0 踩 0

2511.20002 2026-06-18 cs.CV cs.AI cs.CR 版本更新

Semantic Router: On the Feasibility of Hijacking MLLMs via a Single Adversarial Perturbation

语义路由器：通过单一对抗扰动劫持多模态大语言模型的可行性研究

Changyue Li, Jiaying Li, Youliang Yuan, Jiaming He, Zhicong Huang, Pinjia He

发表机构 * The Chinese University of Hong Kong, Shenzhen, China（香港中文大学（深圳））； School of Data Science, School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen, China（数据科学学院、人工智能学院、香港中文大学（深圳））

AI总结提出语义感知通用扰动（SAUP），作为语义路由器同时劫持多个无状态决策，通过理论分析和SORT优化策略实现，在Qwen上对五个目标达到66%攻击成功率。

Comments Accepted to ICML 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）越来越多地部署在无状态系统中，例如自动驾驶和机器人技术。本文研究了一种新型威胁：语义感知劫持。我们探索了使用单一通用扰动同时劫持多个无状态决策的可行性。我们引入了语义感知通用扰动（SAUP），它充当语义路由器，“主动”感知输入语义并将其路由到不同的、攻击者定义的目标。为了实现这一点，我们对潜在空间中的几何特性进行了理论和实证分析。在这些见解的指导下，我们提出了语义导向（SORT）优化策略，并标注了一个具有细粒度语义的新数据集以评估性能。在三个代表性MLLM上的大量实验证明了这种攻击的基本可行性，在针对Qwen的五个目标上使用单帧实现了66%的攻击成功率。

英文摘要

Multimodal Large Language Models (MLLMs) are increasingly deployed in stateless systems, such as autonomous driving and robotics. This paper investigates a novel threat: Semantic-Aware Hijacking. We explore the feasibility of hijacking multiple stateless decisions simultaneously using a single universal perturbation. We introduce the Semantic-Aware Universal Perturbation (SAUP), which acts as a semantic router, "actively" perceiving input semantics and routing them to distinct, attacker-defined targets. To achieve this, we conduct theoretical and empirical analysis on the geometric properties in the latent space. Guided by these insights, we propose the Semantic-Oriented (SORT) optimization strategy and annotate a new dataset with fine-grained semantics to evaluate performance. Extensive experiments on three representative MLLMs demonstrate the fundamental feasibility of this attack, achieving a 66% attack success rate over five targets using a single frame against Qwen.

URL PDF HTML ☆

赞 0 踩 0

2604.23130 2026-06-18 cs.CL cs.AI 版本更新

From Concept-Aligned Tokens to Vulnerable Features: Mechanistic Localization of Jailbreaks

从概念对齐的Token到脆弱特征：越狱的机制定位

Nilanjana Das, Mathew Dawit, Aman Chadha, Manas Gaur

发表机构 * UMBC（马里兰大学伯克利分校）； Apple（苹果公司）

AI总结提出一种基于Token的机制流水线，通过稀疏自编码器特征子组定位越狱漏洞，发现单个有害Token足以定位脆弱特征，且这些特征集中在中后期层。

详情

AI中文摘要

越狱攻击揭示了安全对齐的大语言模型中一种持续的失败模式：模型可以被推向有害行为，但促成这种转变的内部表示仍未被很好地定位。最近的机制安全性研究通常通过广泛的表示对象来解释这种行为，包括全局拒绝方向、激活引导向量和与拒绝相关的SAE特征。我们转而询问越狱脆弱性是否可以追溯到更细粒度的、基于提示的SAE特征子组。我们引入了一个基于Token的机制流水线，将Gemma-2-2B的残差流分解为稀疏自编码器（SAE）特征，并识别与不安全行为相关的特征子组。使用BeaverTails中的单类别不安全示例以减少跨类别干扰，我们从对抗性响应中提取有害概念，并通过子空间相似性将其与概念相关的提示Token对齐。然后，我们应用三种特征分组策略：基于聚类的、层次链接的和单Token驱动的，以识别所有26层中的SAE特征子组。最后，我们放大每个子组中的顶级特征，并使用标准的有害性评判器评估生成的输出。单Token驱动的分组实现了与完整基于聚类的分组相当的有害性，表明单个有害提示Token足以定位与脆弱性相关的SAE特征子组，而无需依赖更广泛的聚类级聚合。这些子组出现在早期和中后期层，且更集中在中后期层，其中目标引导暴露了特定的模型脆弱性。总体而言，我们的结果表明越狱敏感性可以追溯到稀疏的、基于Token定位的SAE特征子组，补充了先前基于广泛对抗、拒绝或引导方向的解释。

英文摘要

Jailbreak attacks expose a persistent failure mode in safety-aligned LLMs: models can be pushed into harmful behavior, but the internal representations enabling this shift remain poorly localized. Recent mechanistic safety studies often explain such behavior through broad representational objects, including global refusal directions, activation steering vectors, and refusal-related SAE features. We instead ask whether jailbreak vulnerability can be traced to finer-grained, prompt-conditioned SAE feature subgroups. We introduce a token-driven mechanistic pipeline that decomposes the residual stream of Gemma-2-2B into Sparse Autoencoder (SAE) features and identifies feature subgroups associated with unsafe behavior. Using single-category unsafe examples from BeaverTails to reduce cross-category interference, we extract harmful concepts from adversarial responses and align them with concept-relevant prompt tokens through subspace similarity. We then apply three feature-grouping strategies: cluster-based, hierarchical-linkage, and single-token-driven, to identify SAE feature subgroups across all 26 layers. Finally, we amplify the top features in each subgroup and evaluate the resulting generations with a standardized harmfulness judge. Single-token-driven grouping achieves harmfulness comparable to full cluster-based grouping, showing that individual harmful prompt tokens are sufficient to localize vulnerability-relevant SAE feature subgroups without relying on broader cluster-level aggregation. These subgroups appear across early and mid-to-late layers, with stronger concentration in mid-to-late layers, where targeted steering exposes specific model vulnerabilities. Overall, our results suggest that jailbreak susceptibility can be traced to sparse, token-localized SAE feature subgroups, complementing prior accounts based on broad adversarial, refusal, or steering directions.

URL PDF HTML ☆

赞 0 踩 0

2605.26903 2026-06-18 cs.CR cs.AI 版本更新

Practical Anonymous Two-Party Gradient Boosting Decision Tree

实用的匿名两方梯度提升决策树

Chenyu Huang, Fan Zhang, Minxin Du, Sherman S. M. Chow, Huangxun Chen, Huaming Rao, Danqing Huang, Bo Qian, Peng Chen

发表机构 * Tencent（腾讯）； Hong Kong Polytechnic University（香港理工大学）； Chinese University of Hong Kong（香港中文大学）； HKUST-GZ

AI总结针对两方垂直分割数据上的梯度提升决策树训练，提出一种基于双电路隐私集合求交和遗忘可编程伪随机函数的匿名协议，在隐藏记录标识符的同时保持效率。

Comments 19 pages; 2026 IEEE Symposium on Security and Privacy (SP)

详情

DOI: 10.1109/SP63933.2026.00084
Journal ref: 2026 IEEE Symposium on Security and Privacy (SP)

AI中文摘要

梯度提升决策树（GBDT）擅长处理结构化数据，通常用于在互不信任的各方之间垂直分割的特征上进行训练。高速和可解释性使得GBDT在金融和医疗领域广受欢迎，而神经网络在这些领域可能表现不佳。为GBDT启用安全计算带来了独特的挑战，需要安全的记录对齐以进行比较。依赖隐私集合求交（PSI）是一种事实上的方法。将PSI误认为是安全措施实际上会暴露数据集中哪些记录标识符（ID）是共享的。尽管电路PSI可以提供帮助，但对于通用用途来说成本高昂。需要新的思路来在“黑暗森林”中高效训练。为了隐藏ID，我们启动了对两方持有的分割数据上的匿名GBDT训练的研究。我们设计中的双电路PSI让双方交替作为接收者，对本地特征执行“选取后求和”。通过遗忘可编程伪随机函数，我们将电路PSI的输出作为共享状态在运行之间传播。避免通用对齐，我们解决了被忽视的困境：隐藏ID会带来与域大小成比例的成本。接下来，我们将用于将单指令多数据同态加密从（环）学习误差转换的密文打包成本减半，相比之前的安全GBDT（Usenix Security' 23）和相关安全机器学习计算。对比实验表明，我们的协议在效率上与有泄漏的方法相比仍具有竞争力。通过启用隐藏ID的聚合，我们的技术可以扩展到其他垂直分割的分析场景。

英文摘要

Structured data is well handled by gradient-boosted decision trees (GBDT), which are usually trained on vertically partitioned features across mutually distrustful parties. High speed and interpretability make GBDTs popular in finance and healthcare, where neural networks may fall short. Enabling secure computation for GBDTs poses unique challenges, requiring secure record alignment for comparison. Relying on private set intersection (PSI) is a de facto approach. Mistaking PSI for a safety measure actually exposes which record identifiers (IDs) are shared between the datasets. Although circuit-PSI could help, it is costly for generic uses. New ideas are needed to efficiently train in a "dark forest". Aiming to hide the IDs, we initiate the study of anonymous GBDT training on split data held by two parties. Dual circuit-PSI in our design lets the parties alternate as receiver to run pick-then-sum over local features. Via oblivious programmable pseudorandom functions, we propagate circuit-PSI outputs as shared state across runs. Avoiding universal alignment, we resolve the neglected dilemma that ID hiding incurs a cost that scales with domain size. Next, we halve the cost of ciphertext packing used to convert single-instruction multiple-data homomorphic encryption from (ring) learning with errors in prior secure GBDT (Usenix Security' 23) and related secure machine-learning computations. Comparative experiments show our protocol remains competitive with leaky approaches in efficiency. Enabling ID-hiding aggregation, our techniques can extend to other vertically partitioned analytics.

URL PDF HTML ☆

赞 0 踩 0

2606.07150 2026-06-18 cs.CR cs.AI cs.MA cs.NI 版本更新

From Privacy to Workflow Integrity: Communication-Graph Metadata in Autonomous Agent Interoperability

从隐私到工作流完整性：自主智能体互操作性中的通信图元数据

Bijaya Dangol

发表机构 * Independent Researcher（独立研究者）

AI总结针对智能体通信图元数据泄露问题，提出工作流完整性威胁模型，定义传输层与引导层隐私属性，并通过A2A案例验证元数据保护可有效抑制任务推断。

Comments 22 pages, 7 figures, 6 tables

详情

AI中文摘要

诸如A2A和MCP之类的智能体互操作性协议标准化了智能体之间的通信内容，但假设基于地址的HTTP(S)传输。此类传输保护消息内容，并越来越多地采用端到端加密。它们暴露在明文中的是通信图：哪个智能体联系哪个智能体、何时以及频率如何。在智能体系统中，该图比隐私框架所暗示的更具后果性。端点通常带有能力标签，工作流是结构化和链式的，交互与实际行动耦合，因此观察者恢复的不仅仅是过去的关系。它可以推断出待处理的工作流、正在组装的任务以及可能即将发生的行动。以机器速度，它可以在工作流完成之前根据该推断采取行动。因此，威胁是工作流完整性，而不仅仅是隐私：对自主行动的预测性杠杆。我们为智能体通信图提供了一个威胁模型；识别了使智能体元数据具有独特揭示性的因素（语义性、前瞻性、驱动性）；定义了传输层和引导层隐私属性，并评估了候选传输（SimpleX/SMP、Tor、混合网络）与这些属性的匹配程度；并提出了一个A2A案例研究，其中元数据保护绑定是可表达的，但揭示了协议的身份假设。我们在一个基于真实A2A捕获的生成模型上测试了这些。仅凭被动元数据，没有载荷，一个分类器从工作流的开头就能以远高于随机水平的概率恢复任务类别；应用这些属性后，该恢复被急剧拉回随机水平。除了观察者能恢复的内容外，我们衡量了利用泄露的杠杆：在工作流开头和固定预算下，选择对哪些工作流采取行动的对手在此模型中实现了大部分先知攻击者相对于元数据盲攻击者的优势，而相同的属性抑制了这一点。

英文摘要

Agent-interoperability protocols such as A2A and MCP standardize what agents say to one another but assume address-based transport. Whether over HTTP(S) or a content-protecting binding such as MLS-based SLIM, these transports protect message content yet leave the communication graph exposed: which agent contacts which, when, and how often. In agent systems this graph is more consequential than a privacy framing suggests. Endpoints are capability-labeled, workflows are structured and chained, and interactions are coupled to actions, so an observer recovers more than past relationships: it can recognize a recurring pending workflow from its opening and, at machine speed, act on it before it completes. The threat is one of workflow integrity, not privacy alone. We give a threat model for the communication graph and locate what makes its metadata distinctively consequential: not stronger fingerprinting but exposure across independent trust domains, coupled to autonomous action. We define transport- and bootstrap-layer privacy properties, give them an indistinguishability-game semantics, evaluate transports, and give an A2A case study where a metadata-protecting binding surfaces its implicit identity assumptions. On a corpus of real multi-agent A2A traffic from the official reference agents, on a live A2A binding, and with a generative model as a controlled instrument, a label-blind classifier recovers a task's class from passive metadata at 6x chance, and from only its opening; a defense-aware adversary does not overturn this, and only the full set of properties drives recovery toward chance. Acting on the leak is distinct from recoverability: under a fixed budget an adversary captures 0.63 of a clairvoyant attacker's advantage on the corpus (0.41 from a workflow's opening), governed by top-ranked precision rather than overall accuracy, so integrity and privacy come apart under defense.

URL PDF HTML ☆

赞 0 踩 0

2512.04144 2026-06-18 cs.AI 版本更新

RippleBench: Capturing Ripple Effects Using Existing Knowledge Repositories

RippleBench: 利用现有知识库捕捉涟漪效应

Roy Rinberg, Usha Bhalla, Igor Shilov, Flavio P. Calmon, Rohit Gandikota

发表机构 * Harvard University（哈佛大学）； Imperial College London（伦敦帝国学院）； Northeastern University（东北大学）

AI总结提出RippleBench-Maker自动管道，从知识库检索语义邻居生成选择题，评估八种遗忘方法在Llama3-8B-Instruct上的涟漪效应，发现准确率下降随语义距离衰减且跨模型一致。

详情

AI中文摘要

针对语言模型的目标干预，如遗忘或模型编辑，旨在修改特定信息，但其效果往往传播到相关的、非预期的领域（例如，删除病毒学内容可能降低对过敏任务的性能）；这些副作用通常被称为涟漪效应。我们引入RippleBench-Maker，一个自动管道，从知识库中检索任何源概念的语义邻居，并生成不同语义距离的多选题。我们使用WikiRAG（一个基于英文维基百科的开源RAG系统）实例化该框架，构建RippleBench-WMDP-Bio（584个种子主题，352,961个问题），并在Llama3-8B-Instruct上评估八种遗忘方法。所有八种方法在遗忘目标附近准确率下降最大，并随语义距离衰减，每种方法具有不同的传播曲线。我们在Mistral-7B、Zephyr-7B和Yi-34B上复现了这些发现；跨模型的差值曲线几乎相同，表明涟漪效应是遗忘方法的属性而非基础模型。我们通过一项包含四个实验的Mechanical Turk研究（5,200+次响应，61名工作者）验证了所有主要管道阶段。我们发布所有代码、数据和基础设施。

英文摘要

Targeted interventions on language models, such as unlearning or model editing, aim to modify specific information, but their effects often propagate to related, unintended areas (e.g., removing virology content may degrade performance on allergies); these side-effects are commonly referred to as the ripple effect. We introduce RippleBench-Maker, an automatic pipeline that retrieves semantic neighbors of any source concept from a knowledge repository and generates multiple-choice questions at varying semantic distances. We instantiate this framework using WikiRAG, an open-source RAG system over English Wikipedia, to construct RippleBench-WMDP-Bio (584 seed topics, 352,961 questions), and evaluate eight unlearning methods on Llama3-8B-Instruct. All eight exhibit accuracy drops that are largest near the unlearned target and decay with semantic distance, each with a distinct propagation profile. We replicate these findings across Mistral-7B, Zephyr-7B, and Yi-34B; cross-model delta curves are nearly identical, suggesting ripple effects are a property of the unlearning method rather than the base model. We validate all major pipeline stages using a four-experiment Mechanical Turk study (5,200+ responses, 61 workers). We release all code, data, and infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2605.29676 2026-06-18 cs.AI cs.CL 版本更新

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

符号至关重要：智能体AI系统中令牌优化格式的基准研究

Lorenz Kutschka, Bernhard Geiger

发表机构 * Know Center Research GmbH（知中心研究有限公司）； Graz University of Technology（格拉茨技术大学）； Graz Center for Machine Learning（格拉茨机器学习中心）

AI总结本研究在四个智能体基准上评估了两种令牌优化格式TOON和TRON，发现TRON在保持准确率的同时最多减少27%的令牌，而TOON虽减少18%但存在多轮解析失败和并行工具调用输出崩溃的问题。

Comments 16 pages, 6 figures, 4 tables

详情

AI中文摘要

智能体AI系统中的大型语言模型消耗工具模式和执行结果，并发出结构化数据的工具调用。这种交换的默认语言JSON是为应用间交换而非令牌效率设计的，因此其结构元素带来大量令牌开销。最近的工作提出了令牌优化替代方案，如TOON（令牌导向对象表示法）和TRON（令牌减少对象表示法）作为更紧凑的替代，但这些格式仅在孤立的理解或生成任务上进行了评估。它们在端到端智能体循环中是否保持令牌减少仍是一个开放问题。我们在四个智能体基准（BFCL、MCPToolBenchPP、MCP-Universe、StableToolBench）和五个开放权重LLM上评估了TOON和TRON，将输入压缩与输出压缩解耦，以独立测量理解和生成。TRON最多减少27%的令牌，准确率在JSON基线的14个百分点内。TOON实现了最多18%的减少，准确率成本类似为9个百分点，但在多轮解析失败上额外级联，并且对于大多数模型导致并行工具调用输出崩溃。

英文摘要

Large language models in Agentic AI systems consume tool schemas and execution results and emit tool invocations as structured data. The default language for that exchange, JSON, was designed for application-to-application interchange rather than token efficiency, so its structural elements impose substantial token overhead. Recent work proposes token-optimized alternatives such as TOON (Token-Oriented Object Notation) and TRON (Token Reduced Object Notation) as more compact replacements, but these formats have been evaluated only on isolated comprehension or generation tasks. Whether their token reductions hold inside end-to-end agentic loops therefore remains an open question. We evaluate TOON and TRON on four agentic benchmarks (BFCL, MCPToolBenchPP, MCP-Universe, StableToolBench) and five open-weight LLMs, decoupling input compression from output compression to measure comprehension and generation independently. TRON reduces tokens by up to 27% with accuracy within 14pp of the JSON baseline. TOON achieves up to 18% reduction at a similar 9pp accuracy cost, but additionally cascades on multi-turn parsing failures and collapses parallel tool-call output for most models. The code is available at: https://github.com/lkutschka/notation-matters

URL PDF HTML ☆

赞 0 踩 0

2606.17453 2026-06-18 cs.AI 版本更新

MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors

MapSatisfyBench: 通过行为隐含决策因素基准测试满意度感知的地图智能体

Lubin Bai, Mengyu Cao, Sixue Wang, Zhongwei Wan, Yue Pan, Jiale Hou, Xiang Li, Xiuyuan Zhang

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）

AI总结提出MapSatisfyBench基准，通过恢复用户行为链中的隐含决策因素来评估地图智能体的满意度感知能力，实验表明现有智能体在显式任务完成上表现良好，但在满足隐含需求方面仍有局限。

详情

AI中文摘要

大型语言模型智能体越来越多地集成到地图服务中。由于地图服务嵌入在日常场景而非专业任务设置中，用户通常非正式地表达需求，导致查询不明确，包含许多未言明的需求，即对用户满意度至关重要的隐含决策因素。虽然澄清是缓解这一问题的有效方法，但它增加了日常交互中的用户负担，而一个能干的智能体应首先从可用信息源主动恢复这些因素。然而，评估这一能力具有挑战性。第一个挑战是确定哪些隐含决策因素适合评估。一个因素只有在影响用户接受度且能从智能体响应前可获取的信息中恢复时才是可评估的。其次，用户满意度不能可靠地由单个参考答案表示，需要一个将满意度相关因素转化为客观可量化评估目标的基准。为应对这些挑战，我们提出一个恢复-识别-过滤框架，从行为链证据中重建完整的用户需求，识别隐含决策因素，并仅保留那些有查询前证据支持的因素。基于此方法，我们从大规模真实世界匿名用户数据构建MapSatisfyBench，并从五个维度标注真实值，实现对满意度感知地图智能体的全链条评估。实验表明，当前智能体在显式任务完成上普遍表现良好，但在满足隐含决策因素和主动获取满意度感知决策所需证据方面仍然有限。这些发现使MapSatisfyBench成为将地图智能体评估从任务完成转向满意度感知空间决策的基准。

英文摘要

Large language model agents are increasingly integrated into map services. Since map services are embedded in everyday-life scenarios rather than professional task settings, users often express their needs informally, resulting in underspecified queries with many unspoken needs, namely, implicit decision factors that are critical for user satisfaction. Although clarification is an effective way to mitigate this issue, it increases user burden in daily interaction, and a capable agent should first proactively recover such factors from available information sources. However, evaluating this ability is challenging. The first challenge is to determine which implicit decision factors are suitable for evaluation. A factor is evaluable only if it affects user acceptance and can be recovered from information available to the agent before it responds. Second, user satisfaction cannot be reliably represented by a single reference answer, requiring a benchmark that converts satisfaction-relevant factors into objective and quantifiable evaluation targets. To address these challenges, we propose a restore-identify-filter framework that reconstructs complete user needs from behavior-chain evidence, identifies implicit decision factors, and retains only those supported by pre-query evidence. Building on this methodology, we construct MapSatisfyBench from large-scale, real-world anonymized user data and annotate ground truth from five dimensions and enables full-chain evaluation of satisfaction-aware map agents. Experiments show that current agents generally perform well on explicit task completion, but remain limited in satisfying implicit decision factors and proactively acquiring the evidence needed for satisfaction-aware decisions. These findings establish MapSatisfyBench as a benchmark for shifting map-agent evaluation from task completion toward satisfaction-aware spatial decision making.

URL PDF HTML ☆

赞 0 踩 0

2606.18142 2026-06-18 cs.AI cs.CL cs.CY 版本更新

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

你的AI旅行代理会为你预订斗牛：前沿AI模型中隐含动物福利的代理基准

Jasmine Brazilek, Joel Christoph, Miles Tidmarsh, Carol Kline, Oliver Tullio, Arturs Kanepajs

发表机构 * Compassion Aligned Machine Learning（同情对齐机器学习）； Sentient Futures（感知未来）； Harvard Kennedy School（哈佛肯尼迪学院）； Appalachian State University Department of Management（阿巴拉契亚州立大学管理系）

AI总结提出首个代理基准TAC，测试AI代理在为用户执行旅行预订等操作时是否避免涉及动物剥削的选项。评估七个前沿模型，所有模型得分低于随机水平64%，最佳模型仅53%。

详情

AI中文摘要

AI代理正从顾问转变为行动者，代表用户预订旅行、规划菜单和管理采购。现有的AI与动物福利基准评估模型对问答提示的文本响应，但未检验这些响应中的福利推理是否迁移到代理部署中（模型必须使用工具采取行动）。我们引入TAC（旅行代理同情心），这是首个衡量AI代理在代表用户行动时是否避免涉及动物剥削选项的代理基准。TAC向AI代理提供十二个手工编写的旅行预订场景，涵盖六类动物剥削，并扩展至四十八个样本以控制价格、评分和位置混淆因素。我们评估了来自四个实验室的七个前沿模型。每个模型得分均低于随机水平64%，最佳表现者（Claude Opus 4.7）为53%。系统提示中的单一福利意识句子在Claude和GPT-5.5中带来47至63个百分点的提升，在GPT-5.2中提升26个百分点，在DeepSeek和Gemini中提升不足12个百分点。一项辅助的Inspect Scout审计（使用Gemini 2.5 Flash Lite作为评判者，对前两名模型的288个基础条件转录进行审计）未标记任何评估意识转录，表明低于随机水平的比率并非源于模型识别出评估。我们讨论了跨文化领域的类别级变化、文本响应福利基准的局限性以及欧盟通用AI实践准则系统性风险框架的影响。

英文摘要

AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.

URL PDF HTML ☆

赞 0 踩 0

2606.18192 2026-06-18 cs.AI 版本更新

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

斯坦福EDGAR文件数据集：将美国公司及财务披露重建为布局忠实且令牌高效的预训练数据

Nick Bettencourt, Xiaowei Ding, Kay Giesecke

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Nanjing University（南京大学）； Stanford University（斯坦福大学）

AI总结为解决长上下文文档稀缺问题，提出SEFD数据集，将SEC文件重建为布局忠实的MultiMarkdown格式，用于金融语言建模与评估，具有令牌高效、与Common Crawl重叠率低于0.1%的特点。

Comments Preprint. Includes appendix, tables, and figures

详情

AI中文摘要

随着高质量公共网络语料库日益枯竭，干净的长上下文文档已成为大型语言模型（LLM）训练数据中稀缺且昂贵的来源。现有的长上下文语料库通常是专有的且获取成本高昂、合成生成的，或集中在编程等狭窄领域。我们介绍了斯坦福EDGAR文件数据集（SEFD），这是将SEC文件重建为布局忠实的MultiMarkdown格式的开放数据集，用于金融语言建模和评估。SEFD使经过审计的财务报表、风险披露、所有权报告、会计说明和影响市场的事件文件能够用作长上下文预训练数据，并作为金融推理、预测、合规和文档理解的基础。生成的语料库令牌高效、可直接用于模型，并且与Common Crawl衍生的语料库重叠率低于0.1%。我们发布了SEFD-v1，一个152B令牌的初始公共快照，并提供了更大的1850万文件档案（估计为550B令牌）的语料库级分析。我们进一步引入了两个基于SEFD的基准：EDGAR-Forecast，用于评估模型知识截止后基于文件的数值预测；以及EDGAR-OCR，用于评估复杂金融表格的转录。

英文摘要

As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often proprietary and costly to acquire, synthetically generated, or concentrated in narrow domains such as programming. We introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown for financial language modeling and evaluation. SEFD makes audited financial statements, risk disclosures, ownership reports, accounting notes, and market-moving event filings usable as long-context pretraining data and as a basis for financial reasoning, forecasting, compliance, and document understanding. The resulting corpus is token-efficient, model-ready, and has less than 0.1% overlap with Common Crawl-derived corpora. We release SEFD-v1, a 152B-token initial public snapshot, and provide corpus-level analyses of a larger 18.5M-filing archive estimated at 550B tokens. We further introduce two SEFD-derived benchmarks: EDGAR-Forecast, which evaluates filing-grounded numerical forecasting after model knowledge cutoffs, and EDGAR-OCR, which evaluates transcription of complex financial tables.

URL PDF HTML ☆

赞 0 踩 0

2303.18031 2026-06-18 cs.CV cs.AI cs.LG 版本更新

Simple Domain Generalization Methods are Strong Baselines for Open Domain Generalization

简单域泛化方法是开放域泛化的强基线

Masashi Noguchi, Shinichi Shirakawa

发表机构 * Graduate School of Environment and Information Sciences（环境与信息科学研究生院）； Yokohama National University（Yokohama国立大学）； Faculty of Environment（环境学系）

AI总结本文评估现有域泛化方法在开放域泛化中的表现，发现简单方法CORAL和MMD与复杂方法DAML竞争力相当，并通过集成学习和Dirichlet混合数据增强简单扩展后性能接近DAML且计算成本更低。

Comments Accepted at IJCNN 2024. The code used in the experiments is available at https://github.com/shiralab/OpenDG-Eval

详情

DOI: 10.1109/IJCNN60899.2024.10650639

AI中文摘要

直接偏好优化综述：数据集、理论、变体及应用

Wenyi Xiao, Zechuan Wang, Leilei Gan, Shuai Zhao, Zongrui Li, Ruirui Lei, Wanggui He, Luu Anh Tuan, Long Chen, Hao Jiang, Zhou Zhao, Fei Wu

发表机构 * Zhejiang University（浙江大学）； Nanyang Technological University（南洋理工大学）； Alibaba Group（阿里巴巴集团）

AI总结综述直接偏好优化（DPO）在理论、变体、数据集和应用方面的进展，指出其作为RL-free替代方案的潜力与局限，并提出未来研究方向。

Comments Accepted by TPAMI 2026. Project page: https://github.com/Mr-Loevan/DPO-Survey

详情

DOI: 10.1109/TPAMI.2026.3704314

AI中文摘要

随着大语言模型（LLMs）的快速发展，将策略模型与人类偏好对齐变得日益关键。直接偏好优化（DPO）作为一种有前景的对齐方法，作为从人类反馈中强化学习（RLHF）的无RL替代方案而出现。尽管DPO取得了各种进展并存在固有局限性，但文献中目前缺乏对这些方面的深入综述。在这项工作中，我们对DPO中的挑战和机遇进行了全面回顾，涵盖理论分析、变体、相关偏好数据集和应用。具体而言，我们基于关键研究问题对近期DPO研究进行分类，以提供对DPO当前格局的透彻理解。此外，我们提出了几个未来研究方向，为研究社区提供模型对齐的见解。相关论文的更新合集可在此https URL找到。

英文摘要

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO's various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO's current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community. An updated collection of relevant papers can be found on https://github.com/Mr-Loevan/DPO-Survey.

URL PDF HTML ☆

赞 0 踩 0

2310.05753 2026-06-18 cs.AI 版本更新

Large-Scale OD Matrix Estimation with A Deep Learning Method

基于深度学习的大规模OD矩阵估计

Zheli Xiong, Defu Lian, Enhong Chen, Gang Chen, Xiaomin Cheng

发表机构 * IEEE Publication Technology Group（IEEE出版技术组）

AI总结提出一种结合深度学习与数值优化的方法，利用探针交通流推断结构约束，实现大规模OD矩阵的实时估计，无需先验信息且具有良好泛化性。

Comments 12 pages,25 figures

详情

AI中文摘要

起点-终点（OD）矩阵估计是智能交通系统（ITS）的关键方面。它涉及通过回归当前观测值（如路段交通计数，例如使用最小二乘法）来调整初始OD矩阵。然而，OD估计问题缺乏足够的约束，在数学上是欠定的。为缓解此问题，一些研究者将先验OD矩阵作为回归目标以提供更多结构约束，但该方法高度依赖于可能过时的先验矩阵。另一些研究者通过传感器数据（如车辆轨迹和速度）添加结构约束，这些数据能实时反映更当前的结构约束。我们提出的方法将深度学习与数值优化算法相结合，以推断矩阵结构并指导数值优化。该方法结合了深度学习与数值优化算法的优势。神经网络（NN）学习从探针交通流中推断结构约束，消除了对先验信息的依赖，并提供了实时性能。此外，由于NN的泛化能力，该方法在工程上经济高效。我们进行了测试，证明了该方法在大规模合成数据集上的良好泛化性能。随后，我们在真实交通数据上验证了方法的稳定性。实验证实了结合NN与数值优化的优势。

英文摘要

The estimation of origin-destination (OD) matrices is a crucial aspect of Intelligent Transport Systems (ITS). It involves adjusting an initial OD matrix by regressing the current observations like traffic counts of road sections (e.g., using least squares). However, the OD estimation problem lacks sufficient constraints and is mathematically underdetermined. To alleviate this problem, some researchers incorporate a prior OD matrix as a target in the regression to provide more structural constraints. However, this approach is highly dependent on the existing prior matrix, which may be outdated. Others add structural constraints through sensor data, such as vehicle trajectory and speed, which can reflect more current structural constraints in real-time. Our proposed method integrates deep learning and numerical optimization algorithms to infer matrix structure and guide numerical optimization. This approach combines the advantages of both deep learning and numerical optimization algorithms. The neural network(NN) learns to infer structural constraints from probe traffic flows, eliminating dependence on prior information and providing real-time performance. Additionally, due to the generalization capability of NN, this method is economical in engineering. We conducted tests to demonstrate the good generalization performance of our method on a large-scale synthetic dataset. Subsequently, we verified the stability of our method on real traffic data. Our experiments provided confirmation of the benefits of combining NN and numerical optimization.

URL PDF HTML ☆

赞 0 踩 0

2604.25848 2026-06-18 cs.AI 版本更新

A Distributionally Robust Reinforcement Learning Framework for Constrained Urban EV Dispatch

面向约束城市电动汽车调度的分布鲁棒强化学习框架

An Nguyen, Hoang Nguyen, Phuong Le, Hung Pham, Cuong Do, Laurent El Ghaoui

发表机构 * College of Engineering and Computer Science, VinUniversity, Hanoi, Vietnam（VinUniversity 工程与计算机科学学院，河内，越南）； Center for Environmental Intelligence, VinUniversity, Hanoi, Vietnam（VinUniversity 环境智能中心，河内，越南）

AI总结针对城市电动汽车调度中充电站和馈线容量约束及不确定需求，提出基于半马尔可夫决策过程与分布鲁棒软演员-评论家算法，通过图卷积编码器和滚动混合整数线性规划保证可行性，在纽约出租车数据仿真中实现最高净利润且零违规。

详情

AI中文摘要

我们研究城市规模的电动汽车（EV）网约车车队控制，其中调度、重新定位和充电决策必须在不确定且空间相关的出行需求和旅行时间下，遵守充电器和馈线限制。我们将问题建模为六边形网格半马尔可夫决策过程（semi-MDP），具有混合动作——用于服务、重新定位和充电的离散动作，以及连续充电功率——和可变动作持续时间。为了保证训练和部署期间的物理可行性，策略在由掩码温度退火actor产生的高层意图上学习。这些意图在每个决策步骤通过一个时间受限的滚动混合整数线性规划（MILP）进行投影，该规划严格强制执行荷电状态、充电端口和馈线约束。为了缓解分布偏移，我们针对一个Wasserstein-1模糊集优化软演员-评论家（SAC）智能体，该模糊集使用图对齐的马氏基础度量来捕捉空间相关性。鲁棒备份使用Kantorovich-Rubinstein对偶、投影次梯度内环和原始-对偶风险预算更新。我们的架构结合了两层图卷积网络（GCN）编码器、双评论家和一个驱动对手的价值网络。基于纽约出租车数据构建的大规模电动汽车车队模拟器上的实验表明，PD-RSAC实现了最高的净利润，达到122万美元，而强启发式、单智能体RL和多智能体RL基线（包括Greedy、SAC、MAPPO和MADDPG）的净利润为58万至70万美元，同时保持零馈线限制违规。

英文摘要

We study city-scale control of electric-vehicle (EV) ride-hailing fleets where dispatch, repositioning, and charging decisions must respect charger and feeder limits under uncertain, spatially correlated demand and travel times. We formulate the problem as a hex-grid semi-Markov decision process (semi-MDP) with mixed actions -- discrete actions for serving, repositioning, and charging, together with continuous charging power -- and variable action durations. To guarantee physical feasibility during both training and deployment, the policy learns over high-level intentions produced by a masked, temperature-annealed actor. These intentions are projected at every decision step through a time-limited rolling mixed-integer linear program (MILP) that strictly enforces state-of-charge, port, and feeder constraints. To mitigate distributional shifts, we optimize a Soft Actor-Critic (SAC) agent against a Wasserstein-1 ambiguity set with a graph-aligned Mahalanobis ground metric that captures spatial correlations. The robust backup uses the Kantorovich-Rubinstein dual, a projected subgradient inner loop, and a primal-dual risk-budget update. Our architecture combines a two-layer Graph Convolutional Network (GCN) encoder, twin critics, and a value network that drives the adversary. Experiments on a large-scale EV fleet simulator built from NYC taxi data show that PD-RSAC achieves the highest net profit, reaching \$1.22M, compared with \$0.58M-\$0.70M for strong heuristic, single-agent RL, and multi-agent RL baselines, including Greedy, SAC, MAPPO, and MADDPG, while maintaining zero feeder-limit violations.

URL PDF HTML ☆

赞 0 踩 0

2605.03460 2026-06-18 cs.AI cs.LG 版本更新

FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models

FinSTaR：面向时间序列推理模型的金融推理

Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, Soonyoung Lee, Wonbin Ahn

发表机构 * LG AI Research（LG人工智能研究）

AI总结针对时间序列推理模型在金融领域的失效问题，提出基于2x2能力分类法的FinSTaR模型，通过Compute-in-CoT和Scenario-Aware CoT策略在FinTSR-Bench基准上达到78.9%平均准确率。

Comments KDD Workshop on SciSoc Agents & LLMs 2026 (Oral Presentation)

详情

AI中文摘要

时间序列推理模型在通用领域表现出色，但在具有独特特征的金融领域却持续失败。我们提出一个通用的2x2能力分类法，通过交叉1)单实体与多实体分析，以及2)当前状态评估与未来行为预测来划分TSRM能力。我们在金融领域实例化该分类法——其中确定性评估与随机性预测的区分尤为关键——形成十个金融推理任务，并基于标普股票构建FinTSR-Bench基准。为此，我们提出FinSTaR（金融时间序列思考与推理），在FinTSR-Bench上训练，并针对每个类别采用不同的思维链策略。对于评估（确定性，即可从可观测数据计算得出），我们采用Compute-in-CoT，一种程序化思维链，使模型能够直接从原始价格推导答案。对于预测（本质上是随机的，即受不可观测因素影响），我们采用场景感知思维链，在做出判断前生成多种场景，模拟金融分析师在不确定性下的推理方式。所提方法在FinTSR-Bench上达到78.9%的平均准确率，显著优于LLM和TSRM基线。此外，我们展示了四个能力类别通过联合训练具有互补性和相互增强性，并且场景感知思维链相比标准思维链持续提升预测准确率。代码已公开：https://github.com/seunghan96/FinSTaR。

英文摘要

Time series (TS) reasoning models (TSRMs) have shown promising capabilities in general domains, yet they consistently fail in the financial domain, which exhibits unique characteristics. We propose a general 2 x 2 capability taxonomy for TSRMs by crossing 1) single-entity vs. multi-entity analysis with 2) assessment of the current state vs. prediction of future behavior. We instantiate this taxonomy in the financial domain-where the distinction between deterministic assessment and stochastic prediction is particularly critical-as ten financial reasoning tasks, forming the FinTSR-Bench benchmark based on S&P stocks. To this end, we propose FinSTaR (Financial Time Series Thinking and Reasoning), trained on FinTSR-Bench with distinct chain-of-thought (CoT) strategies tailored to each category. For assessment, which is deterministic (i.e., computable from observable data), we employ Compute-in-CoT, a programmatic CoT that enables models to derive answers directly from raw prices. For prediction, which is inherently stochastic (i.e., subject to unobservable factors), we adopt Scenario-Aware CoT, which generates diverse scenarios before making a judgment, mirroring how financial analysts reason under uncertainty. The proposed method achieves 78.9% average accuracy on FinTSR-Bench, substantially outperforming LLM and TSRM baselines. Furthermore, we show that the four capability categories are complementary and mutually reinforcing through joint training, and that Scenario-Aware CoT consistently improves prediction accuracy over standard CoT. Code is available at https://github.com/seunghan96/FinSTaR.

URL PDF HTML ☆

赞 0 踩 0

2606.08532 2026-06-18 cs.AI 版本更新

DN-Hypo-Pipeline: An AI-Driven Workflow for Hypothesis Generation via Large Language Models and Scientific Explanations

DN-Hypo-Pipeline：一种基于大语言模型和科学解释的AI驱动假设生成工作流

Lei Lin, Ronghao Wang, Chunbao Zhou, Jue Wang, Yangang Wang

发表机构 * Computer Network Information Center, Chinese Academy of Sciences, China（中国科学院计算机网络信息中心）

AI总结提出DN-Hypo-Pipeline，利用大语言模型和科学解释作为先验知识，从现有文献中推导新假设，在数据科学建模中通过统计推断和专家评估证明优于直接生成方法，并验证了生成假设对应的算法性能。

详情

AI中文摘要

科学假设是研究的第一步并经过实验验证，但它也反映了对科学现象的深刻理解和推理。我们引入了DN-Hypo-Pipeline，一种基于大语言模型的AI驱动工作流，旨在通过利用科学解释作为先验知识来支持结构化科学思维和假设生成。该流水线帮助研究人员从现有文献中推导出新假设。给定研究论文的解释项（即结论），它识别潜在的定律、理论和原理，并为观察到的现象重构一个新的、尚未验证的解释。我们在数据科学建模领域使用三篇高被引论文评估了DN-Hypo-Pipeline。由LLM作为评判者和人类专家评估支持的统计推断表明，我们的流水线比直接生成方法更有效。此外，我们通过开发相应新颖算法验证了得分最高的两个生成假设，这些算法优于原始论文中提出的基线模型。除了在数据科学中的应用，DN-Hypo-Pipeline还提供了一个理论框架，不仅包含了理论指导的数据科学建模方法，还揭示了建模过程更基础的结构。此外，这种方法本质上是理论指导建模的推广，具有扩展到其他领域和更广泛科学学科的潜力。

英文摘要

A scientific hypothesis is the first step in research and undergoes experimental validation, yet it also reflects a deep understanding of and reasoning about scientific phenomena. We introduce DN-Hypo-Pipeline, an AI-powered workflow based on large language models, designed to support structured scientific thinking and hypothesis generation by leveraging scientific explanations as prior knowledge. This pipeline assists researchers in deriving novel hypotheses from existing literature. Given the explanandum (i.e., the conclusion) of a research paper, it identifies underlying laws, theories, and principles, and reconstructs a new, yet-to-be-verified explanation for the observed phenomenon. We evaluated DN-Hypo-Pipeline in the field of data science modeling using three highly cited papers. Statistical inference, supported by both LLM-as-judge assessment and human expert evaluation, demonstrates that our pipeline is more effective than direct generation methods. Additionally, we validated the two highest-scoring generated hypotheses by developing corresponding novel algorithms, which outperformed the baseline models presented in the original papers. Beyond application in data science, DN-Hypo-Pipeline provides a theoretical framework that not only encompasses theory-guided data science modeling methods but also reveals a more fundamental structure of the modeling process. Moreover, this approach is essentially a generalization of theory-guided modeling, offering potential for extension to other domains and across a broader range of scientific disciplines.

URL PDF HTML ☆

赞 0 踩 0

2606.10376 2026-06-18 cs.AI cs.IT math.IT 版本更新

Belief-Space Control for Personalized Cancer Treatment via Active Inference

基于主动推理的个性化癌症治疗信念空间控制

Deniz Sargun, H. Bugra Tulay, C. Emre Koksal

发表机构 * American Association for Cancer Research（美国癌症研究协会）； AACR Project GENIE registry（AACR Project GENIE 注册中心）； AACR Project GENIE Biopharma Collaborative（AACR Project GENIE 生物制药合作组织）

AI总结提出用主动推理将癌症治疗建模为信念空间规划问题，在测量预算下统一目标导向控制与信息获取，实现患者分类与高效治疗。

Comments 11 pages including appendix

2509.24725 2026-06-18 cs.LG cs.AI 版本更新

Q-Net: Queue Length Estimation via Kalman-based Neural Networks

Q-Net：基于卡尔曼神经网络的队列长度估计

Ting Gao, Elvin Isufi, Winnie Daamen, Erik-Sander Smits, Serge Hoogendoorn

发表机构 * University of Amsterdam（阿姆斯特丹大学）； Delft University of Technology（代尔夫特理工大学）

AI总结本文提出Q-Net框架，通过结合卡尔曼滤波与神经网络，解决信号交叉口队列长度估计中的数据融合问题，提升空间转移性和实时性，实现无需昂贵传感设备的准确队列估计。

详情

DOI: 10.1016/j.trc.2026.105809

AI中文摘要

估计信号交叉口的队列长度一直是交通管理中的长期挑战。尽管有两类隐私保护的数据源：(i) 接近停止线的环形检测器提供的车辆计数汇总数据，以及 (ii) 提供路段平均速度测量的汇总浮动汽车数据 (aFCD)，但如何将这些具有不同空间和时间分辨率的数据源整合用于队列长度估计仍不清楚。为此，本文提出Q-Net：一种基于状态空间形式的队列估计框架。该设计解决了队列建模中的关键挑战，如违反交通守恒假设。Q-Net遵循卡尔曼预测-更新结构，并在状态演变和测量模型中保持物理可解释性。Q-Net使用AI增强的卡尔曼滤波器从数据中学习时间变化的增益动态。该框架支持实时实现，并通过将aFCD测量分组为固定大小的局部组来提高空间转移性，使可学习参数的数量与路段长度无关。在荷兰 Rotterdam 城市主干道的评估显示，Q-Net优于基线方法，能够准确追踪队列的形成和消散，并缓解aFCD引起的延迟。通过结合数据效率、可解释性、实时适用性和空间转移性，Q-Net在无需昂贵的传感基础设施（如摄像头或雷达）的情况下实现了准确的队列长度估计。

英文摘要

Estimating queue lengths at signalized intersections is a long-standing challenge in traffic management. Partial observability of vehicle flows complicates this task despite the availability of two privacy-preserving data sources: (i) aggregated vehicle counts from loop detectors near stop lines, and (ii) aggregated floating car data (aFCD) that provide segment-wise average speed measurements. However, how to integrate these sources with differing spatial and temporal resolutions for queue length estimation is rather unclear. Addressing this question, we present Q-Net: a queue estimation framework built upon a state-space formulation. This design addresses key challenges in queue modeling, such as violations of traffic conservation assumptions. Q-Net follows the Kalman predict-update structure and maintains physical interpretability in both the state evolution and measurement models. Q-Net uses an AI-augmented Kalman filter to learn time-varying gain dynamics from data. The framework supports real-time implementation and improves spatial transferability by grouping aFCD measurements into fixed-size local groups, making the number of learnable parameters independent of section length. Evaluations on urban main roads in Rotterdam, the Netherlands, show that Q-Net outperforms baseline methods, tracks queue formation and dissipation accurately, and mitigates aFCD-induced delays. By combining data efficiency, interpretability, real-time applicability, and spatial transferability, Q-Net makes accurate queue length estimation possible without costly sensing infrastructure like cameras or radar.

URL PDF HTML ☆

赞 0 踩 0

2307.05623 2026-06-18 cs.LG cs.AI 版本更新

A DeepLearning Framework for Dynamic Estimation of Origin-Destination Sequence

一种用于动态估计起点-终点序列的深度学习框架

Zheli Xiong, Defu Lian, Enhong Chen, Gang Chen, Xiaomin Cheng

发表机构 * School of Data Science University of Science（数据科学学院中国科学技术大学）； Yangtze River Delta Information Intelligence Innovation Research Institute, China（长江三角洲信息智能创新研究院）

AI总结针对OD矩阵估计中的欠定性和滞后性问题，提出集成深度学习方法，利用神经网络推断OD序列结构并引导数值优化，实验证明能有效提供时空约束。

Comments 11 pages,25 figures

详情

AI中文摘要

OD矩阵估计是交通领域的一个关键问题。主要方法利用交通传感器测量信息（如交通计数）来估计由OD矩阵表示的交通需求。该问题分为两类：静态OD矩阵估计和动态OD矩阵序列（简称OD序列）估计。上述两类都面临由大量待估参数和不足的约束信息引起的欠定性问题。此外，OD序列估计还面临滞后挑战：由于拥堵等不同交通状况，同一车辆在相同观测时段内会出现在不同路段，导致相同的OD需求对应不同的行程。为此，本文提出一种集成方法，利用深度学习方法推断OD序列的结构，并利用结构约束指导传统数值优化。实验表明，神经网络能有效推断OD序列的结构，并为数值优化提供实用的约束以获得更好的结果。此外，实验表明，所提供的结构信息不仅包含对OD矩阵空间结构的约束，还提供了对OD序列时间结构的约束，很好地解决了滞后问题的影响。

英文摘要

OD matrix estimation is a critical problem in the transportation domain. The principle method uses the traffic sensor measured information such as traffic counts to estimate the traffic demand represented by the OD matrix. The problem is divided into two categories: static OD matrix estimation and dynamic OD matrices sequence(OD sequence for short) estimation. The above two face the underdetermination problem caused by abundant estimated parameters and insufficient constraint information. In addition, OD sequence estimation also faces the lag challenge: due to different traffic conditions such as congestion, identical vehicle will appear on different road sections during the same observation period, resulting in identical OD demands correspond to different trips. To this end, this paper proposes an integrated method, which uses deep learning methods to infer the structure of OD sequence and uses structural constraints to guide traditional numerical optimization. Our experiments show that the neural network(NN) can effectively infer the structure of the OD sequence and provide practical constraints for numerical optimization to obtain better results. Moreover, the experiments show that provided structural information contains not only constraints on the spatial structure of OD matrices but also provides constraints on the temporal structure of OD sequence, which solve the effect of the lagging problem well.

URL PDF HTML ☆

赞 0 踩 0

2507.16859 2026-06-18 cs.RO cs.AI 版本更新

Enhancing Fatigue Detection through Heterogeneous Multi-Source Data Integration and Cross-Domain Modality Imputation

通过异构多源数据集成与跨域模态插补增强疲劳检测

Luobin Cui, Yanlai Wu, Tang Ying, Weikai Li

AI总结针对实际部署环境中高质量传感器不可用的问题，提出异构多源疲劳检测框架，利用共享模态进行跨域模态插补，融合源域知识提升目标域疲劳检测性能。

Comments 4figures,14pages

详情

AI中文摘要

疲劳检测对于安全相关应用（如航空、采矿和长途运输）中的人类操作员至关重要。可靠的操作员疲劳估计可以支持人机系统中的及时警告、自适应任务调度、接管提醒和其他安全管理决策。然而，这些功能的有效性取决于疲劳相关信号是否能在部署环境中可靠捕获。虽然许多研究已显示高保真传感器在受控实验室环境中的价值，但在实际环境中，由于噪声、光照条件和视野限制，其性能往往会下降，从而限制了实际应用。本文形式化了一种面向实际部署的疲劳检测设置，其中高质量传感器在实际应用中通常不可用。为解决这一问题，我们利用来自异构源域的知识，包括难以在现场部署但常用于受控环境的高保真传感器，来辅助真实目标域中的疲劳检测。基于这一思想，我们设计了一个异构多源疲劳检测框架，该框架利用目标域中的可用模态，同时通过基于共享模态的跨域模态插补来利用源域中的多样化配置。

英文摘要

Fatigue detection for human operators is important in safety-related applications such as aviation, mining, and long-haul transport. Reliable estimation of operator fatigue can support timely warnings, adaptive task scheduling, takeover reminders, and other safety-management decisions in human-machine systems. However, the effectiveness of these functions depends on whether fatigue-related signals can be reliably captured in the deployment environment. While many studies have shown the value of high-fidelity sensors in controlled laboratory environments, their performance often degrades when used in real-world settings because of noise, lighting conditions, and field-of-view constraints, thereby limiting their practical use. This paper formalizes a deployment-oriented setting for real-world fatigue detection, where high-quality sensors are often unavailable in practical applications. To address this issue, we use knowledge from heterogeneous source domains, including high-fidelity sensors that are difficult to deploy in the field but commonly used in controlled environments, to assist fatigue detection in the real-world target domain. Based on this idea, we design a heterogeneous and multi-source fatigue-detection framework that uses the available modalities in the target domain while leveraging diverse configurations in the source domains through cross-domain modality imputation based on shared modalities.

URL PDF HTML ☆

赞 0 踩 0

2511.14555 2026-06-18 q-bio.NC cs.AI 版本更新

DecNefSimulator: A Modular, Interpretable Framework for Decoded Neurofeedback Simulation Using Generative Models

DecNefSimulator：一个用于解码神经反馈模拟的模块化、可解释框架

Alexander Olza, Roberto Santana, David Soto

发表机构 * Intelligent Systems Group, University of the Basque Country (UPV/EHU)（巴斯克国家大学智能系统组）； Consciousness Group, Basque Center on Cognition, Brain and Language (BCBL)（巴斯克认知、大脑与语言中心意识组）； Ikerbasque, Basque Foundation for Science（巴斯克科学基金会）

AI总结提出DecNefSimulator，一个模块化可解释的模拟框架，将解码神经反馈形式化为机器学习问题，通过潜变量生成模型模拟参与者，直接观察内部状态并评估协议设计对学习的影响，可复现经验现象、识别失败条件并指导协议设计。

详情

AI中文摘要

解码神经反馈（DecNef）是一种有前景的非侵入性脑调控方法，在神经医学和认知神经科学中具有广泛应用。然而，DecNef研究的进展仍受限于受试者依赖的学习变异性、依赖间接测量来量化进展，以及实验的高成本和时间消耗。我们提出DecNefSimulator，一个模块化且可解释的模拟框架，将DecNef形式化为一个机器学习问题。除了提供虚拟实验室，DecNefSimulator使研究人员能够建模、分析和理解神经反馈动态。通过使用潜变量生成模型作为模拟参与者，DecNefSimulator允许直接观察内部认知状态，并系统评估不同协议设计和受试者特征如何影响学习。我们展示了这种方法如何（i）复现DecNef学习的经验现象，（ii）识别DecNef反馈未能诱导学习的条件，以及（iii）在人体实施之前，在计算机中指导设计更稳健可靠的DecNef协议。总之，DecNefSimulator连接了计算建模和认知神经科学，为方法创新、稳健协议设计以及最终更深入地理解基于DecNef的脑调控提供了原则性基础。

英文摘要

Decoded Neurofeedback (DecNef) is a promising non-invasive approach to brain modulation with wide-ranging applications in neuromedicine and cognitive neuroscience. However, progress in DecNef research remains constrained by subject-dependent learning variability, reliance on indirect measures to quantify progress, and the high cost and time demands of experimentation. We present DecNefSimulator, a modular and interpretable simulation framework that formalizes DecNef as a machine learning problem. Beyond providing a virtual laboratory, DecNefSimulator enables researchers to model, analyze and understand neurofeedback dynamics. Using latent variable generative models as simulated participants, DecNefSimulator allows direct observation of internal cognitive states and systematic evaluation of how different protocol designs and subject characteristics influence learning. We demonstrate how this approach can (i) reproduce empirical phenomena of DecNef learning, (ii) identify conditions under which DecNef feedback fails to induce learning, and (iii) guide the design of more robust and reliable DecNef protocols in silico before human implementation. In summary, DecNefSimulator bridges computational modeling and cognitive neuroscience, offering a principled foundation for methodological innovation, robust protocol design, and ultimately, a deeper understanding of DecNef-based brain modulation.

URL PDF HTML ☆

赞 0 踩 0

2512.09185 2026-06-18 cs.CV cs.AI 版本更新

Learning Patient-Specific Disease Dynamics with Latent Flow Matching for Longitudinal Imaging Generation

学习患者特异性疾病动态：基于潜在流匹配的纵向影像生成

Hao Chen, Rui Yin, Yifan Chen, Qi Chen, Chao Li

发表机构 * University of Cambridge（剑桥大学）； Nanjing First Hospital（南京第一医院）； Nanjing Medical University（南京医科大学）； Johns Hopkins University（约翰霍普金斯大学）； University of Dundee（邓迪大学）

AI总结提出Δ-LFM框架，利用流匹配对齐患者潜在轨迹，通过患者特异性潜在对齐实现单调疾病进展建模，在三个纵向MRI基准上验证了可解释性和性能。

Comments ICLR 2026 accepted

详情

AI中文摘要

理解疾病进展是一个直接的临床挑战，对早期诊断和个性化治疗具有重要意义。虽然最近的生成方法试图对进展进行建模，但关键不匹配仍然存在：疾病动态本质上是连续且单调的，然而潜在表示通常是分散的，缺乏语义结构，并且基于扩散的模型通过随机去噪过程破坏了连续性。在这项工作中，我们提出将疾病动态视为速度场，并利用流匹配（FM）来对齐患者数据的时间演变。与先前方法不同，它捕捉了疾病的内在动态，使进展更具可解释性。然而，一个关键挑战仍然存在：在潜在空间中，自动编码器（AE）不能保证跨患者的对齐或与临床严重性指标（例如年龄和疾病状况）的相关性。为了解决这个问题，我们提出学习患者特异性潜在对齐，这迫使患者轨迹沿着特定轴延伸，其幅度随疾病严重程度单调增加。这导致了一个一致且语义上有意义的潜在空间。总之，我们提出了Δ-LFM，一个用于通过流匹配建模患者特异性潜在进展的框架。在三个纵向MRI基准上，Δ-LFM展示了强大的实证性能，更重要的是，为解释和可视化疾病动态提供了一个新框架。

英文摘要

Understanding disease progression is a central clinical challenge with direct implications for early diagnosis and personalized treatment. While recent generative approaches have attempted to model progression, key mismatches remain: disease dynamics are inherently continuous and monotonic, yet latent representations are often scattered, lacking semantic structure, and diffusion-based models disrupt continuity with random denoising process. In this work, we propose to treat the disease dynamic as a velocity field and leverage Flow Matching (FM) to align the temporal evolution of patient data. Unlike prior methods, it captures the intrinsic dynamic of disease, making the progression more interpretable. However, a key challenge remains: in latent space, Auto-Encoders (AEs) do not guarantee alignment across patients or correlation with clinical-severity indicators (e.g., age and disease conditions). To address this, we propose to learn patient-specific latent alignment, which enforces patient trajectories to lie along a specific axis, with magnitude increasing monotonically with disease severity. This leads to a consistent and semantically meaningful latent space. Together, we present $Δ$-LFM, a framework for modeling patient-specific latent progression with flow matching. Across three longitudinal MRI benchmarks, $Δ$-LFM demonstrates strong empirical performance and, more importantly, offers a new framework for interpreting and visualizing disease dynamics.

URL PDF HTML ☆

赞 0 踩 0

2601.14288 2026-06-18 astro-ph.CO cs.AI cs.CE gr-qc hep-th 版本更新

DeepInflation: an AI agent for research and model discovery of inflation

DeepInflation：用于暴胀研究与模型发现的AI智能体

Ze-Yu Peng, Hao-Shi Yuan, Qi Lai, Jun-Qian Jiang, Gen Ye, Jun Zhang, Yun-Song Piao

发表机构 * School of Physical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China ； International Centre for Theoretical Physics Asia-Pacific, University of Chinese Academy of Sciences, 100190 Beijing, China Taiji Laboratory for Gravitational Wave Universe, University of Chinese Academy of Sciences, 100049 Beijing, China School of Fundamental Physics ； Mathematical Sciences, Hangzhou Institute for Advanced Study, UCAS, Hangzhou 310024, China Institute of Theoretical Physics, Chinese Academy of Sciences, P.O. Box 2735, Beijing 100190, China D\' e partement de Physique Th\' e orique, Universit\' e de Gen\` e ve, 24 quai Ernest-Ansermet, CH-1211 Gen\` e ve 4, Switzerland

AI总结提出基于多智能体架构的AI智能体DeepInflation，集成大语言模型、符号回归引擎和检索增强生成知识库，自动发现与最新观测一致的单场慢滚暴胀势，并解释理论背景。

详情

AI中文摘要

我们提出了DeepInflation，一个专为暴胀宇宙学中的研究和模型发现而设计的AI智能体。基于多智能体架构，DeepInflation将大语言模型（LLMs）与符号回归（SR）引擎以及检索增强生成（RAG）知识库相结合。该框架使智能体能够自动探索和验证广阔的暴胀势景观，同时将其输出建立在既定的理论文献基础上。我们证明，DeepInflation能够成功发现与最新观测（以ACT DR6结果为例）或任意给定的$n_s$和$r$一致的简单且可行的单场慢滚暴胀势，并为晦涩的暴胀场景提供准确的理论背景。DeepInflation作为宇宙学中新一代自主科学发现引擎的原型，使研究人员和非专家都能使用自然语言探索暴胀景观。该智能体可从此网址获取：https://example.com。

英文摘要

We present DeepInflation, an AI agent designed for research and model discovery in inflationary cosmology. Built upon a multi-agent architecture, DeepInflation integrates Large Language Models (LLMs) with a symbolic regression (SR) engine and a retrieval-augmented generation (RAG) knowledge base. This framework enables the agent to automatically explore and verify the vast landscape of inflationary potentials while grounding its outputs in established theoretical literature. We demonstrate that DeepInflation can successfully discover simple and viable single-field slow-roll inflationary potentials consistent with the latest observations (with the ACT DR6 results taken as an example) or any given $n_s$ and $r$, and provide accurate theoretical context for obscure inflationary scenarios. DeepInflation serves as a prototype for a new generation of autonomous scientific discovery engines in cosmology, which enables researchers and non-experts alike to explore the inflationary landscape using natural language. This agent is available at https://github.com/pengzy-cosmo/DeepInflation.

URL PDF HTML ☆

赞 0 踩 0

2602.19591 2026-06-18 cs.LG cs.AI 版本更新

Detecting High-Potential SMEs with Heterogeneous Graph Neural Networks

使用异构图神经网络检测高潜力中小企业

Yijiashun Qi, Hanzhe Guo, Yijiazhen Qi

发表机构 * University of Michigan（密歇根大学）； The University of Hong Kong（香港大学）

AI总结提出SME-HGT异构图Transformer框架，利用公开数据构建包含公司、研究主题和政府机构的异构图，预测SBIR第一阶段获奖者能否进入第二阶段，AUPRC达0.621，优于基线模型。

Comments accepted by (ICIIS 2026)

详情

AI中文摘要

中小企业占美国企业的99.9%，贡献44%的经济活动，但系统性地识别高潜力中小企业仍是一个开放挑战。我们提出了SME-HGT，一个异构图Transformer框架，仅使用公开数据预测哪些SBIR第一阶段获奖者将进入第二阶段资助。我们构建了一个异构图，包含32,268个公司节点、124个研究主题节点和13个政府机构节点，通过约99,000条边连接三种语义关系类型。SME-HGT在时间分割测试集上达到0.621±0.003的AUPRC，在五个随机种子上优于MLP基线（0.590±0.002）和R-GCN（0.608±0.013）。在筛选深度为100家公司时，SME-HGT达到89.6%的精确率，比随机选择提升2.14倍。我们的时间评估协议防止信息泄露，对公开数据的依赖确保了可重复性。这些结果表明，公司、研究主题和资助机构之间的关系结构为中小企业潜力评估提供了有意义的信号，对政策制定者和早期投资者具有启示意义。

英文摘要

Small and Medium Enterprises (SMEs) constitute 99.9% of U.S. businesses and generate 44% of economic activity, yet systematically identifying high-potential SMEs remains an open challenge. We introduce SME-HGT, a Heterogeneous Graph Transformer framework that predicts which SBIR Phase I awardees will advance to Phase II funding using exclusively public data. We construct a heterogeneous graph with 32,268 company nodes, 124 research topic nodes, and 13 government agency nodes connected by approximately 99,000 edges across three semantic relation types. SME-HGT achieves an AUPRC of 0.621 0.003 on a temporally-split test set, outperforming an MLP baseline (0.590 0.002) and R-GCN (0.608 0.013) across five random seeds. At a screening depth of 100 companies, SME-HGT attains 89.6% precision with a 2.14 lift over random selection. Our temporal evaluation protocol prevents information leakage, and our reliance on public data ensures reproducibility. These results demonstrate that relational structure among firms, research topics, and funding agencies provides meaningful signal for SME potential assessment, with implications for policymakers and early-stage investors.

URL PDF HTML ☆

赞 0 踩 0

2603.28707 2026-06-18 cs.CE cs.AI 版本更新

A Convex Route to Thermoelasticity: Learning Internal Energy and Dissipation

热力学的凸路径：学习内能和耗散

Hagen Holthusen, Paul Steinmann, Ellen Kuhl

发表机构 * Institute of Applied Mechanics, University of Erlangen-Nuremberg, Egerlandstra{\ss}e 5, 91058 Erlangen, Germany（埃尔兰根-纽伦堡应用力学研究所，埃尔兰根大学，德国）； Department of Mechanical Engineering, Stanford University, United States（机械工程系，斯坦福大学，美国）

AI总结提出基于物理的神经网络框架，通过输入凸神经网络表示内能和耗散势，自动满足热力学第二定律，实现全耦合热力学本构建模。

Comments 31 pages, 16 figures, 4 tables

详情

DOI: 10.1016/j.cma.2026.119082

AI中文摘要

我们提出了一个基于物理的神经网络框架，用于发现全耦合热力学中的本构模型。与基于亥姆霍兹能量的经典公式不同，我们采用内能和耗散势作为主要本构函数，以变形和熵为变量。这一选择避免了强制混合凸-凹条件，并促进了热力学原理的一致纳入。在本文中，我们关注没有优先方向或内变量的材料。尽管公式以熵表示，但温度被视为独立可观测量，熵通过本构关系内部推断，从而在不需要熵数据的情况下实现热力学一致建模。网络的热力学可接受性通过构造保证。内能和耗散势由输入凸神经网络表示，确保凸性和符合第二定律。客观性、材料对称性和归一化通过基于不变量的表示和零锚定公式直接嵌入架构中。我们在合成和实验数据集上展示了所提出框架的性能，包括纯热问题以及软组织和填充橡胶的全耦合热力学响应。结果表明，学习模型准确捕捉了潜在的本构行为。所有代码、数据和训练模型均通过 https://doi.org/10.5281/zenodo.19248596 公开提供。

英文摘要

We present a physics-based neural network framework for the discovery of constitutive models in fully coupled thermomechanics. In contrast to classical formulations based on the Helmholtz energy, we adopt the internal energy and a dissipation potential as primary constitutive functions, expressed in terms of deformation and entropy. This choice avoids the need to enforce mixed convexity--concavity conditions and facilitates a consistent incorporation of thermodynamic principles. In this contribution, we focus on materials without preferred directions or internal variables. While the formulation is posed in terms of entropy, the temperature is treated as the independent observable, and the entropy is inferred internally through the constitutive relation, enabling thermodynamically consistent modeling without requiring entropy data. Thermodynamic admissibility of the networks is guaranteed by construction. The internal energy and dissipation potential are represented by input convex neural networks, ensuring convexity and compliance with the second law. Objectivity, material symmetry, and normalization are embedded directly into the architecture through invariant-based representations and zero-anchored formulations. We demonstrate the performance of the proposed framework on synthetic and experimental datasets, including purely thermal problems and fully coupled thermomechanical responses of soft tissues and filled rubbers. The results show that the learned models accurately capture the underlying constitutive behavior. All code, data, and trained models are made publicly available via https://doi.org/10.5281/zenodo.19248596.

URL PDF HTML ☆

赞 0 踩 0

2604.00730 2026-06-18 cs.CY cs.AI cs.LG cs.SE 版本更新

A CEFR-Inspired Classification Framework with Fuzzy C-Means To Automate Assessment of Programming Skills in Scratch

基于CEFR启发的模糊C均值分类框架：自动化评估Scratch编程技能

Ricardo Hidalgo-Aragón, Jesús M. González-Barahona, Gregorio Robles

发表机构 * Universidad Rey Juan Carlos（雷昂卡洛斯大学）

AI总结提出一种基于CEFR的Scratch项目评估框架，使用模糊C均值聚类对200万+项目分级，识别B2瓶颈并引入分类确定性指标以平衡自动反馈与人工审核。

Comments Best Paper Award CSEDU 2026 -Minor change FPC fix-

详情

AI中文摘要

背景：学校、培训平台和技术公司日益需要以透明、可重复的方法大规模评估编程能力，以支持个性化学习路径。目标：本研究引入一个与欧洲共同语言参考标准（CEFR）一致的Scratch项目评估教学框架，为学生和教师提供通用能力等级，并为课程设计提供可行见解。方法：我们对通过此http URL评估的2008246个Scratch项目应用模糊C均值聚类，实施序数准则将聚类映射到CEFR等级（A1-C2），并引入增强分类指标，识别过渡学习者，实现持续进度跟踪，量化分类确定性以平衡自动反馈与教师评审。影响：该框架能够诊断系统性课程缺口——特别是“B2瓶颈”，由于逻辑同步和数据表示的认知负荷，仅13.3%的学习者处于该等级——同时提供基于确定性的触发机制以进行人工干预。

基于傅里叶运动建模的条件潜扩散模型用于虚拟人群合成

Shaokun Lan, Haoran Dou, Jinghan Huang, Arezoo Zakeri, Fengming Lin, Zherui Zhou, Jinming Duan, Alejandro F. Frangi

发表机构 * Centre for Computational Imaging and Modelling in Medicine (CIMIM)（计算医学成像与建模中心）； University of Manchester（曼彻斯特大学）； Christabel Pankhurst Institute（克里斯塔贝尔·潘克赫斯特研究所）； Department of Computer Science（计算机科学系）； Division of Informatics, Imaging & Data Sciences（信息学、成像与数据科学分会）； Department of Electrical & Electronic Engineering（电子与电气工程系）； NIHR Manchester Biomedical Research Centre, Manchester Academic Health Sciences Centre, University of Manchester（尼日利亚卫生研究委员会曼彻斯特生物医学研究中心、曼彻斯特学术健康科学中心、曼彻斯特大学）

AI总结提出4D F-MeshLDM框架，结合卷积网格VAE、截断傅里叶级数运动参数化和条件扩散先验，实现可控的3D+t心脏网格序列生成，在UK Biobank数据上优于基线方法。

Comments This work has been early accepted by International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2026

详情

AI中文摘要

医疗设备的计算机模拟试验需要生成虚拟解剖人群。在心血管应用中，虚拟解剖通常表示为从生成模型采样的3D+t网格。然而，大多数现有网格生成器关注静态解剖，而序列模型往往缺乏显式周期性。为此，我们提出4D F-MeshLDM，一个条件生成框架，包括用于编码网格的卷积网格VAE、使用截断傅里叶级数参数化运动的结构化潜空间，以及学习傅里叶系数令牌上潜分布的先验扩散。通过仿射调制将扩散过程条件化于临床协变量，我们实现了可控合成。采样令牌并执行逆傅里叶合成产生周期一致的潜轨迹，可解码为3D+t心脏网格序列。在5,000名UK Biobank受试者上的实验表明，4D F-MeshLDM在解剖保真度上优于最先进的基线，并实现了接近零的周期闭合误差。此外，生成的队列准确保留了临床功能指标，突显了我们的框架在可靠的心脏计算机模拟试验中的潜力。

英文摘要

In-silico trials of medical devices require the generation of virtual populations of anatomies. In cardiovascular applications, virtual anatomy is typically represented as a 3D+t mesh sampled from a generative model. However, most existing mesh generators focus on static anatomy, while sequence models often lack explicit periodicity. To this end, we propose 4D F-MeshLDM, a conditional generative framework comprising a convolutional mesh VAE to encode meshes, a structural latent space that parameterises motion using a truncated Fourier series, and a diffusion prior that learns the latent distribution over Fourier coefficient tokens. By conditioning the diffusion process on clinical covariates via affine modulation, we enable controllable synthesis. Sampling tokens and performing inverse Fourier synthesis yield cycle-consistent latent trajectories, which can be decoded into 3D+t cardiac mesh sequences. Experiments on 5,000 UK Biobank subjects demonstrate that 4D F-MeshLDM outperforms state-of-the-art baselines in anatomical fidelity and achieves near-zero cycle closure error. Furthermore, the generated cohorts accurately preserve clinical functional indices, highlighting the potential of our framework for reliable in-silico cardiac trials.

URL PDF HTML ☆

赞 0 踩 0

2604.23716 2026-06-18 cs.AI cs.IT cs.LG cs.MA math.IT 版本更新

Information-Theoretic Measures in AI: A Practical Decision Guide

人工智能中的信息论度量：实用决策指南

Nikolaos Al. Papadopoulos, Konstantinos E. Psannis

发表机构 * Department of Applied Informatics, University of Macedonia（马其顿大学应用信息系）

AI总结本文为七种信息论度量提供实用决策框架，围绕每个度量的三个关键问题：回答的问题与AI场景、适合的估计器、最危险的误用，并附有流程图和决策表。

Comments 25 pages, 2 tables, 1 figure. Submitted to Entropy (MDPI)

详情

AI中文摘要

信息论（IT）度量在人工智能中无处不在：熵驱动决策树分裂和不确定性量化，交叉熵是默认的分类损失，互信息支撑表示学习和特征选择，转移熵揭示动态系统中的有向影响。第二类较不成熟的度量——整合信息（Phi）、有效信息（EI）和自主性——已出现用于表征智能体复杂性。尽管被广泛采用，度量选择常常与估计器假设、失败模式和安全的推断主张脱节。本文为所有七种度量提供了一个实用决策框架，围绕每个度量的三个指导性问题组织：（i）该度量回答什么问题，在何种AI背景下；（ii）哪种估计器适合数据类型和维度；（iii）最危险的误用是什么。该框架通过两个互补的人工制品实现：度量选择流程图和主决策表。我们涵盖每个度量的AI/ML和决策智能体应用领域，并使用标准化桥接框将IT量与认知构造联系起来。三个工作示例展示了该框架在具体从业者场景中的应用，涵盖表示学习、时间影响分析和进化智能体复杂性。

英文摘要

Information-theoretic (IT) measures are ubiquitous in artificial intelligence: entropy drives decision-tree splits and uncertainty quantification, cross-entropy is the default classification loss, mutual information underpins representation learning and feature selection, and transfer entropy reveals directed influence in dynamical systems. A second, less consolidated family of measures, integrated information (Phi), effective information (EI), and autonomy, has emerged for characterizing agent complexity. Despite wide adoption, measure selection is often decoupled from estimator assumptions, failure modes, and safe inferential claims. This paper provides a practical decision framework for all seven measures, organized around three prescriptive questions for each: (i) what question does the measure answer and in which AI context; (ii) which estimator is appropriate for the data type and dimensionality; and (iii) what is the most dangerous misuse. The framework is operationalized in two complementary artifacts: a measure-selection flowchart and a master decision table. We cover both AI/ML and decision-making agent application domains per measure, with standardized Bridge Boxes linking IT quantities to cognitive constructs. Three worked examples illustrate the framework on concrete practitioner scenarios spanning representation learning, temporal influence analysis, and evolved agent complexity.

URL PDF HTML ☆

赞 0 踩 0

2606.00729 2026-06-18 cs.AI 版本更新

AI Sovereignty as National Learning Capacity: A Human-Centered Learning Mechanics Viewpoint on France, the United States, and China

AI主权作为国家学习能力：基于人本学习机制视角看法国、美国与中国

Kim Phuc Tran

发表机构 * Univ. Lille, ENSAIT, ULR 2461 – GEMTEX（里尔大学、ENSAIT、ULR 2461 – GEMTEX）

AI总结本文提出将国家AI发展视为一个受控的信息注入与熵耗散平衡的动态学习系统，主张AI主权源于国家调节自身信息动力学的能力，而非单纯规模扩张。

详情

AI中文摘要

在法国，人工智能常被从投资、算力、监管、就业、主权和教育等维度讨论，这些维度通常被分开处理。本文提出一个统一解读：法国应被理解为一个\emph{国家AI学习系统}。基于最近被形式化为熵调控表示学习动力学框架的人本学习机制（HCLM），我们将国家AI发展解释为信息注入与熵耗散之间的受控平衡。信息注入对应算力、数据、人才、研究、资本、产业部署和制度实验；熵耗散对应组织复杂性、协调摩擦、能源约束、监管不确定性、人才流动压力以及加强产业吸收的机会。核心主张是：AI主权并非仅源于规模，而是源于国家调节自身信息动力学的能力。本文将HCLM与神经标度律、内生增长理论、创造性破坏和博弈论联系起来，认为法国AI辩论应超越技术乐观主义与监管优先的二元对立。一个具有竞争力且以人为本的AI战略需要一个受控机制，其中信息注入增长快于制度耗散，同时避免不稳定、不平等或高能耗的扩张。我们提供了一个数学模型、可衡量的政策指标、博弈论命题、国家AI制度的说明性模拟，以及对法国的具体政策启示。所提出的观点将AI政策重新定义为对一个开放、战略性、非均衡学习系统的治理。

英文摘要

Artificial intelligence in France is often discussed through separate dimensions such as investment, compute, regulation, employment, sovereignty, and education. This viewpoint paper proposes a unified interpretation: France can be analyzed as a national AI learning system. Building on Human-Centered Learning Mechanics (HCLM), we use HCLM not as a validated econometric model, but as a conceptual and diagnostic lens for interpreting national AI development as a balance between information injection, absorptive capacity, and institutional dissipation. Information injection includes compute, data, talent, research, capital, industrial deployment, and policy experimentation. Institutional dissipation refers to avoidable frictions such as administrative overload, coordination failures, energy constraints, regulatory uncertainty, talent mobility pressures, and weak industrial absorption. Regulation is not treated as mere friction: adaptive governance, trusted data spaces, and safety-oriented standards may increase long-term learning capacity by improving legitimacy, interoperability, and social trust. The central claim is not that a country follows neural-network equations, but that AI sovereignty depends on how effectively it converts distributed information into absorbed, coordinated, and socially legitimate capability. The paper connects HCLM with neural scaling laws, endogenous growth theory, creative destruction, absorptive capacity, and coordination mechanisms. It offers a formal heuristic, policy indicators, illustrative scenarios, and implications for France. The numerical results are diagnostic scenarios, not econometric estimates or official rankings. The proposed viewpoint reframes AI policy as the governance of an open, strategic, non-equilibrium learning system that should be tested with historical and cross-country data.

URL PDF HTML ☆

赞 0 踩 0

2605.17131 2026-06-18 cs.CV cs.AI cs.LG 版本更新

A Survey on Deep Learning Architectures for Point Cloud Classification and Segmentation

针对点云分类和分割的深度学习架构系统性调研

Minhas Kamal, Hiranya Garbha Kumar, Balakrishnan Prabhakaran

发表机构 * State University of New York at Albany（纽约州立大学阿尔巴尼分校）

AI总结本文系统性地探讨了点云分类和分割中的深度学习架构，分析了点云数据的结构特性，分类了不同架构的工作，并评估了其在主流基准上的性能，同时指出了开放挑战和未来方向。

Comments We reviewed a decade of advancements in point cloud processing: trace the evolution of the field from its foundational roots to the modern SOTA, analyze how diverse architectures overcome the inherent geometric challenges of 3D data, and map out critical research gaps alongside promising future directions. GitHub: https://github.com/MinhasKamal/DeepLearningForPointCloud

详情

DOI: 10.1145/3815180
Journal ref: ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2026

AI中文摘要

点云因其简洁性和几何保真度而成为表示3D形状和场景最广泛采用的格式。然而，其固有的无序和不规则性质，加剧了传感器噪声和遮挡的影响，给基于机器学习的方法带来了独特的挑战。为应对这些问题，已开发出多种策略，包括转换为有序格式、提取局部几何特征以及基于排列不变或自注意力的处理方法。在本文中，我们的重点是深度学习模型在3D视觉三个基本任务中的应用：点云分类、部分分割和语义分割。我们首先正式定义点云数据，然后深入讨论其结构特性。接着，我们根据其骨干结构对重要工作进行分类，并评估其在流行基准上的性能。除了经验比较外，我们还提供了架构创新和局限性的见解。我们还概述了3D点云理解中的开放挑战和有前途的未来方向。

英文摘要

Point cloud stands as the most widely adopted format for representing 3D shapes and scenes due to its simplicity and geometric fidelity. However, its inherent unordered and irregular nature, exacerbated by sensor noise and occlusions, introduces unique challenges for machine learning based methodologies. To combat these issues, diverse strategies have been developed, including converting to a format that has orderliness, extracting local geometry, and permutation-invariant or self-attention-based processing. In this paper, our focus is directed towards deep learning models for three fundamental tasks in 3D vision: point cloud classification, part segmentation, and semantic segmentation. We begin by formally defining point cloud data, followed by an in-depth discussion on its structural characteristics. Then, we categorize notable works based on their backbone structure and evaluate their performance on popular benchmarks. Beyond empirical comparison, we offer insights into architectural innovations and limitations. We also outline open challenges and promising future directions for 3D point cloud understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.00182 2026-06-18 cs.HC cs.AI cs.CY 版本更新

The New Social Image: How AI Competency and AI Proactivity Influence Self- and Peer-Perceptions in the Workplace

新社会形象：AI能力与AI主动性如何影响职场中的自我与同伴感知

Kuntal Ghosh, Marc Hassenzahl, Shadan Sadeghian

发表机构 * Autonomous Interactive Systems, University of Siegen（自主交互系统，锡根大学）； Experience & Interaction Design, University of Siegen（体验与交互设计，锡根大学）

AI总结通过2x2x2情景实验（n=50），研究AI能力与主动性水平对员工工作所有权、情感、意义感及角色动态的自我与同伴感知影响，发现低能力或低主动性的AI通常提升积极感知，但高能力与高主动性可能带来负面影响。

Comments Updated metadata following publication in Interacting with Computers. Added DOI and publication information

详情

DOI: 10.1093/iwc/iwag033

AI中文摘要

人机协作被视为将AI融入职场的最有前景方式。然而，这种协作的体验后果尚未被探索。具体而言，在与AI组成的团队中，人类如何感知自己（自我感知）以及同事如何看待他们（同伴感知）在工作所有权和工作意义方面。在一项2x2x2情景研究（n=50）中，参与者对所有权、情感、工作意义和满意度以及角色动态的感知进行了评分，其中AI主动性和AI能力作为被试内因素（低/高两个水平），视角（自我感知/同伴感知）作为被试间因素。我们的结果表明，低能力或低主动性的AI通常提升了与所有权、意义感、满意度和角色动态相关的感受，并增加了积极情感，减少了消极情感。然而，这些效应往往受到视角的影响。例如，低AI主动性从自我感知而非同伴感知中带来了更高的工作满意度。基于我们的发现，我们认为仅围绕绩效指标设计未来工作的AI可能并不足够。高能力和高主动性的AI驱动系统可能对所有权感知、工作身份、社会形象和团队动态产生不良影响，进而影响工作意义。

英文摘要

Human-AI collaboration is considered the most promising way to incorporate AI in the workplace. What remains unexplored are the experiential consequences of this teaming. More specifically, in a team with AI, how humans perceive themselves (self-perception) and how they are perceived by their coworkers (peer perception) in terms of work ownership and job meaningfulness. In a 2x2x2 vignette study (n=50), participants rated perceptions of ownership, affect, job meaningfulness and satisfaction, and role dynamics across two levels (low/high) of AI proactivity and AI competency as within-subject factors, with point-of-view (self perception/peer perception) as between-subjects. Our results showed that AI with low competency or low proactivity generally improved feelings related to ownership, meaningfulness, satisfaction, and role dynamics, and also increased positive affect while reducing negative affect. However, these effects were often influenced by point-of-view. For instance, low AI proactivity resulted in higher job satisfaction from self-perception rather than peer perception. Based on our findings, we argue that designing AI for the future of work solely around performance metrics may not be adequate. Highly competent and proactive AI-driven systems can have undesirable impacts on perceptions of ownership, job identity, social image and team dynamics, and consequently, job meaningfulness.

URL PDF HTML ☆

赞 0 踩 0

2606.15091 2026-06-18 cs.HC cs.AI 版本更新

Sensory Restoration via Brain-Computer Interfaces: A Unified 2 x 2 Framework and Convergence Roadmap

通过脑机接口的感觉恢复：统一的2×2框架与融合路线图

Xuan-The Tran

发表机构 * School of Mechanical Engineering, Vietnam Maritime University（机械工程学院，越南海防大学）

AI总结本文提出一个统一的2×2框架，按侵入性和信号方向分类脑机接口，并定义恢复、替代和增强范式，同时给出近中长期的融合路线图。

2602.15513 2026-06-18 cs.RO cs.AI 版本更新

HIMM: Human-Inspired Long-Term Memory Modeling for Embodied Exploration and Question Answering

Ji Li, Bo Wang, Jing Xia, Mingyi Li, Shiyan Hu

发表机构 * The University of Hong Kong（香港大学）； Beijing Institute of Technology（北京理工大学）

2602.20135 2026-06-18 cs.CL cs.AI cs.IR 版本更新

KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi, Behnam Bahrak

发表机构 * University of Tehran（塔里班大学）； Independent Researcher（独立研究员）； Amirkabir University of Technology（阿米尔卡比尔技术大学）； TEIAS Institute（TEIAS研究所）

Comments Accepted at the Third Conference on Parsimony and Learning (CPAL 2026). 36 pages, 12 figures. (Equal contribution: Yasaman Amou Jafari and Mahdi Noori.)

2405.14273 2026-06-18 cs.LG cs.AI math.OC 版本更新

Exact Solution to Data-Driven Inverse Optimization of MILPs in Finite Time via Gradient-Based Methods

通过基于梯度的方法在有限时间内精确求解混合整数线性规划的驱动数据反优化问题

Akira Kitaoka

发表机构 * NEC Corporation（日本电气株式会社）

AI总结本文研究了混合整数线性规划中驱动数据反优化问题，揭示了子最优损失的几何结构，并证明了基于梯度的优化方法可以在有限次迭代内达到观测数据的一致性，同时给出了投影子梯度下降法的迭代次数上界。

Comments 66 pages; comments are welcome

详情

AI中文摘要

驱动数据反优化问题（DDIOP）是估计能够解释观测最优解数据的目标函数参数（权重）的问题，广泛应用于混合整数线性规划（MILP）中。在MILP的反优化中，特征的预测误差对权重的不连续性使得直接应用基于梯度的优化方法具有挑战性。本文聚焦于子最优损失，该损失在权重与观测数据完全一致时达到最小值零。我们揭示了该损失的几何结构——它具有凸性和分段线性特性，并且与观测数据完全一致的权重集合具有正的“厚度”而非单一点或薄边界。利用这一结构，我们证明了：首先，一类广泛的基于梯度的优化方法，包括投影子梯度下降法，在有限次迭代中可以达到观测数据的一致性（在有限时间内获得精确解）。其次，对于投影子梯度下降法，我们给出了达到精确一致性的迭代次数的显式上界。第三，当正向问题是一个整数线性规划（ILP）时，我们将其上界表示为仅由样本数、特征维度和约束系数矩阵结构（例如，若系数矩阵是总模矩阵，则迭代次数被显式地限制为样本数平方和维度的多项式）决定的完全显式迭代次数。通过数值实验，我们验证了这种有限步数达到行为。

英文摘要

A data-driven inverse optimization problem (DDIOP) is the problem of estimating the objective-function parameters (weights) that explain observed optimal-solution data, and it arises in many applications, including mixed integer linear programming (MILP). In inverse optimization for MILPs, the prediction error of the features is discontinuous with respect to the weights, so applying gradient-based optimization directly is difficult. In this paper we focus on the suboptimality loss. This loss attains its minimum value, zero, if and only if the weights are exactly consistent with the observed data. We reveal a geometric structure of this loss -- it is convex and piecewise linear, and moreover the set of weights that are exactly consistent with the observed data has a positive ``thickness'' rather than being a single point or a thin boundary -- and use it to show the following. First, a broad class of gradient-based optimization methods, including projected subgradient descent, reaches exact consistency with the observed data in finitely many iterations (an exact solution is obtained in finite time). Second, for projected subgradient descent we give an explicit upper bound on the number of iterations needed to reach exact consistency. Third, when the forward problem is an integer linear program (ILP), we give this upper bound as a fully explicit iteration count determined solely by the number of samples, the dimension of the features, and the structure of the constraint coefficient matrix. Through numerical experiments, we confirm this finite-step attainment behavior.

URL PDF HTML ☆

赞 0 踩 0

2407.00449 2026-06-18 cs.LG cs.AI cs.NE 版本更新

Fully tensorial approach to hypercomplex-valued neural networks

Agnieszka Niemczynowicz, Radosław Antoni Kycia

发表机构 * Faculty of Computer Science and Mathematics, Cracow University of Technology（克拉科夫技术大学计算机科学与数学系）

Comments 23 pages, 3 figures

2512.04115 2026-06-18 cs.CY cs.AI cs.HC 版本更新

Artificial Intelligence Competence of K-12 Students Shapes Their AI Risk Perception: A Co-occurrence Network Analysis

Ville Heilala, Pieta Sikström, Mika Setälä, Tommi Kärkkäinen

发表机构 * University of Jyväskylä（于韦斯屈莱大学）

Comments Accepted for Proceedings of the 41th ACM/SIGAPP Symposium on Applied Computing (SAC'26)

2506.20869 2026-06-18 cs.SE cs.AI cs.IR 版本更新

Engineering RAG Systems for Real-World Applications: Design, Development, and Evaluation

Md Toufique Hasan, Muhammad Waseem, Kai-Kristian Kemell, Ayman Asad Khan, Mika Saari, Pekka Abrahamsson

发表机构 * Faculty of Information Technology and Communication Sciences, Tampere University（信息科技与通讯科学学院，塔尔皮耶大学）

Comments Published in the Proceedings of the 51st Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2025. Lecture Notes in Computer Science, volume 16082, pages 143-158. Springer, 2026

2503.01163 2026-06-18 cs.AI cs.CL cs.HC cs.LG cs.NE 版本更新

Bandit-Based Prompt Design Strategy Selection Improves Prompt Optimizers

Rin Ashizawa, Yoichi Hirose, Nozomu Yoshinari, Kento Uchida, Shinichi Shirakawa

发表机构 * Yokohama National University（横滨国立大学）

Comments Accepted to ACL 2025 Findings

2506.09822 2026-06-18 cs.CE cs.AI 版本更新

Superstudent intelligence in thermodynamics

Rebecca Loubet, Pascal Zittlau, Marco Hoffmann, Luisa Vollmer, Sophie Fellenz, Heike Leitte, Fabian Jirasek, Johannes Lenhard, Hans Hasse

发表机构 * Laboratory of Engineering Thermodynamics (LTD)（工程热力学实验室）； Visual Information Analysis Research Group (VIA)（视觉信息分析研究组）； Machine Learning Research Group (ML)（机器学习研究组）

Comments This document is the unedited Author's version of a yet to be Submitted Work to Physical Review Physics Education Research. 15 pages, 2 figures, Graphical Abstract, Highlights and SI available (12 pages)

2504.12347 2026-06-18 cs.CL cs.AI cs.CY 版本更新

Assessment of Evolving Large Language Models in Upper Secondary Mathematics

Mika Setälä, Pieta Sikström, Ville Heilala, Tommi Kärkkäinen

发表机构 * Faculty of Information Technology（信息科技学院）； University of Jyväskylä（于韦斯屈莱大学）； Faculty of Humanities and Social Sciences（人文与社会科学学院）

2505.03863 2026-06-18 cs.CR cs.AI 版本更新

Data-Driven Falsification of Cyber-Physical Systems

Atanu Kundu, Sauvik Gon, Rajarshi Ray

发表机构 * Indian Association for the Cultivation of Science（印度科学培养协会）

2406.15537 2026-06-18 q-bio.NC cs.AI cs.SD eess.AS 版本更新

R&B -- Rhythm and Brain: Cross-subject Decoding of Music from Human Brain Activity

Matteo Ferrante, Matteo Ciferri, Nicola Toschi

发表机构 * Department of Biomedicine and Prevention University of Rome Tor Vergata（生物医学与预防系罗马大学托尔维加塔分校）； A.A. Martinos Center for Biomedical Imaging Harvard Medical School/MGH, Boston (US)（A.A. Martinos生物医学成像中心哈佛医学院/马萨诸塞总医院，波士顿（美国））

Comments The first two authors contributed equally to this work

详情

DOI: 10.1016/j.neunet.2026.109195
Journal ref: Neural Networks, 203, 109195 (2026)

英文摘要

Music is a universal phenomenon that profoundly influences human experiences across cultures. This study investigates whether music can be decoded from human brain activity measured with functional MRI (fMRI) during its perception. Leveraging recent advancements in extensive datasets and pre-trained computational models, we construct mappings between neural data and latent representations of musical stimuli. Our approach integrates functional and anatomical alignment techniques to facilitate cross-subject decoding, addressing the challenges posed by the low temporal resolution and signal-to-noise ratio (SNR) in fMRI data. Starting from the GTZan fMRI dataset, where five participants listened to 540 musical stimuli from 10 different genres while their brain activity was recorded, we used the CLAP (Contrastive Language-Audio Pretraining) model to extract latent representations of the musical stimuli and developed voxel-wise encoding models to identify brain regions responsive to these stimuli. By applying a threshold to the association between predicted and actual brain activity, we identified specific regions of interest (ROIs) which can be interpreted as key players in music processing. Our decoding pipeline, primarily retrieval-based, employs a linear map to project brain activity to the corresponding CLAP features. This enables us to predict and retrieve the musical stimuli most similar to those that originated the fMRI data. Our results demonstrate state-of-the-art identification accuracy, with our methods significantly outperforming existing approaches. Our findings suggest that neural-based music retrieval systems could enable personalized recommendations and therapeutic applications. Future work could use higher temporal resolution neuroimaging and generative models to improve decoding accuracy and explore the neural underpinnings of music perception and emotion.

URL PDF HTML ☆

赞 0 踩 0

1. 智能体、规划与决策 7 篇

Structured Cognitive Loop for Behavioral Intelligence in Large Language Model Agents (Extended Revision: From Behavioral Architecture to Epistemic Accountability)

InfoPO: Information-Driven Policy Optimization for User-Centric Agents

SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision

Dissecting model behavior through agent trajectories

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

MemRerank: Preference Memory for Personalized Product Reranking

PatchWorld: Gradient-Free Optimization of Executable World Models

2. 知识表示、推理与符号AI 4 篇

Fully Geometric Multi-Hop Reasoning on Knowledge Graphs with Transitive Relations

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Short-Term-to-Long-Term Memory Transfer for Knowledge Graphs under Partial Observability

TLA-Prover: Verifiable TLA+ Specification Synthesis via Preference-Optimized Low-Rank Adaptation

3. 多智能体与博弈 5 篇

Recursive Joint Simulation in Games

PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation

Toward Vibe Medicine: A Self-Evolving Multi-Agent Framework for Clinical Decision Support

Self-Evolving Multi-Agent Systems via Textual Backpropagation

R2BC: Multi-Agent Imitation Learning from Single-Agent Demonstrations

4. 搜索、优化与约束求解 5 篇

An In-depth Study of LLM Contributions to the Bin Packing Problem

Enhancing CVRP Solver through LLM-driven Automatic Heuristic Design

LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning

Scalable Batch Bayesian Optimization Via Subspace Acquisition Functions

MeEvo: Metacognitive Evolution Combined with Natural Evolution for Automatic Heuristic Design

5. 机器学习与表示学习 23 篇

Towards Understanding What State Space Models Learn About Code

Robust Regularized Policy Iteration under Transition Uncertainty

The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning

Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding

Efficient Zeroth-Order Federated Finetuning of Language Models on Resource-Constrained Devices

Depth-Width tradeoffs in Algorithmic Reasoning of Graph Tasks with Transformers

Generalized Kullback-Leibler Divergence Loss

Grids Often Outperform Implicit Neural Representations at Compressing Dense Signals

From Memorization to Parameter Interference: How Overtraining Experts Harms Model Merging

HeRo-Q: A General Framework for Stable Low Bit Quantization via Hessian Conditioning

LLM Compression by Block Removal with Constrained Binary Optimization

Posterior Continuation with Noise-Conditioned Frequency Exposure for Diffusion Inverse Problems

Do Neural Networks Lose Plasticity in a Gradually Changing World?

Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech

The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories

Beyond Similarity: Temporal Operator Attention for Time Series Analysis

Controllable Quantum Memory Capacity in Quantum Reservoir Networks with Tunable partial-SWAPs

HAARES Half-Split Residual Basis Routing for Deep Transformers

UPLOTS: A Unified Pretrained Language Model for Constrained Time-series Generation

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

SymQNet: Amortized Acquisition for Low-Latency Adaptive Hamiltonian Learning

Calibrated Sampling-Free Uncertainty Estimation in Bayesian Deep Learning

6. 自然语言与多模态智能 14 篇

SpecAlign: Efficient Specification-Grounded Alignment of Large Language Models via Synthetic Data

VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

From Values to Tokens: An LLM-Driven Framework for Context-aware Time Series Forecasting via Symbolic Discretization

Probing Semantic Alignment, Lexical Invariance, and Syntactic Influence in LLM Metaphor Processing

Rethinking Cross-lingual Gaps from a Statistical Viewpoint

InstructTime++: Time Series Classification with Multimodal Language Modeling via Implicit Feature Enhancement

Retell, Reward, Repeat: Reinforcement Learning for Narrative Theory-Informed Story Retelling

LVLMs and Humans Ground Differently in Referential Communication

Improve Large Language Model Systems with User Logs

Narrative Theory-Driven LLM Methods for Automatic Story Generation and Understanding: A Survey

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

Implicit vs. Explicit Prompting Strategies for LVLMs in Referential Communication

Enhancing Pathological VLMs with Cross-scale Reasoning

7. 机器人与具身智能 1 篇

Cosmos 3: Omnimodal World Models for Physical AI

8. 可信、安全与AI治理 11 篇

The Personalization Trap: How User Memory Alters Emotional Reasoning in LLMs

"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

Quality Perceptions and Intended Engagement in Response to AI-Generated and AI-Assisted News

Revealing Hidden Vulnerabilities in Autoencoders through Gradient Signal Restoration

Signals of Provenance: Practices & Challenges of Navigating Indicators in AI-Generated Media for Sighted and Blind Individuals

Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs

When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models

Semantic Router: On the Feasibility of Hijacking MLLMs via a Single Adversarial Perturbation

From Concept-Aligned Tokens to Vulnerable Features: Mechanistic Localization of Jailbreaks

Practical Anonymous Two-Party Gradient Boosting Decision Tree

From Privacy to Workflow Integrity: Communication-Graph Metadata in Autonomous Agent Interoperability

9. 评测、基准与数据集 18 篇

RippleBench: Capturing Ripple Effects Using Existing Knowledge Repositories