arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.18746 2026-06-18 cs.AI 新提交

What Must Generalist Agents Remember?

通用型智能体必须记住什么？

Khurram Yamin, Namrata Deka, Maitreyi Swaroop, Albert Ting, Jeff Schneider, Bryan Wilder

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Georgia Institute of Technology（佐治亚理工学院）

AI总结本文形式化论证了通用型智能体为在多个环境和目标下近似最优行动，必须存储领域相关信息以区分观察瓶颈处的不兼容最优动作，并证明记忆可用于重构局部转移动态。

2606.18888 2026-06-18 cs.AI 新提交

Generative-Model Predictive Planning for Navigation in Partially Observable Environments

部分可观测环境下导航的生成模型预测规划

Thomas Quilter, Yifan Zhu, Guorui Quan, Mingfei Sun, Samuel Kaski

发表机构 * University of Manchester（曼彻斯特大学）； Aalto University（阿尔托大学）

AI总结提出BeliefDiffusion框架，结合扩散模型和模型预测控制，显式建模多模态信念分布并进行前瞻规划，在合成地图环境中显著优于无模型强化学习和生成方法。

详情

AI中文摘要

部分可观测环境中的导航对自主智能体构成重大挑战，需要在未知环境中利用有限的感知信息做出有效决策。基于信念的方法，特别是那些使用神经网络近似信念空间的方法，往往无法捕捉信念空间固有的多模态性，尤其是在具有感知混淆的高维情况下。虽然生成模型提供了一种有吸引力的替代方案，但它们通常需要大量数据或专家演示，并且缺乏长期规划的显式机制。在本文中，我们介绍了BeliefDiffusion，一种结合了生成和规划优势的新框架。BeliefDiffusion利用扩散模型显式表征多模态信念分布，并利用模型预测控制（MPC）同时进行前瞻规划。它包含两个步骤：（1）基于观测历史想象合理的环境配置；（2）在聚合的配置上规划高效的导航策略。通过在合成地图环境中的大量实验，我们证明BeliefDiffusion在导航成功率和路径效率上显著优于无模型强化学习基线和其它生成方法。我们的结果验证了将多模态信念表示显式纳入规划能够在部分可观测设置中实现更鲁棒的导航。

英文摘要

Navigation in partially observable environments presents a significant challenge for autonomous agents, requiring effective decision-making with limited sensory information in unknown environments. Belief-based methods, particularly those using neural networks to approximate the belief space, often fail to capture the inherent multimodality of belief spaces, especially in high-dimensional cases with perceptual aliasing. While generative models present a compelling alternative, they typically require substantial data or expert demonstrations and lack explicit mechanisms for long-term planning. In this paper, we introduce BeliefDiffusion, a novel framework that combines the benefits of both generation and planning. BeliefDiffusion leverages diffusion models to explicitly characterize multimodal belief distributions and utilizes Model Predictive Control (MPC) to simultaneously plan ahead. It consists of two steps: (1) Imagining plausible environment configurations based on observation history and (2) Planning efficient navigation strategies across an aggregated configurations. Through extensive experiments in synthetic map environments, we demonstrate that BeliefDiffusion significantly outperforms both model-free reinforcement learning baselines and other generative approaches in navigation success rate and path efficiency. Our results validate that explicitly incorporating multimodal belief representations into planning enables more robust navigation in partially observable settings.

URL PDF HTML ☆

赞 0 踩 0

2606.18947 2026-06-18 cs.AI cs.CL cs.IR cs.MA 新提交

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

将搜索与推理解耦：面向LLM Agent的供应商无关的接地架构

Emmanuel Aboah Boateng, Kyle MacDonald, Amardeep Kumar, Siddharth Kodwani, Sudeep Das

发表机构 * DoorDash, Inc.（DoorDash公司）

AI总结提出解耦搜索接地（DSG）架构，将搜索接地从推理模型中分离，通过MCP兼容网关实现供应商路由、缓存等控制，在降低成本和延迟的同时保持或提升准确性。

Comments 15 pages, Figure 8

详情

AI中文摘要

生产级LLM Agent越来越依赖实时搜索，但原生搜索接地将检索策略、供应商选择、证据注入、成本、延迟和生成行为捆绑在单一模型-供应商边界内。这种耦合使得接地难以检查、调优、重用或移植，并可能触发搜索诱导的冗长，破坏严格的输出合约。我们提出解耦搜索接地（DSG），一种供应商无关的边界，通过MCP兼容网关将接地移出推理模型，将供应商路由、源感知上下文渲染、配置的回退、检索深度控制以及精确和语义缓存作为一级控制暴露。在SimpleQA、FreshQA和HotpotQA上的五个前沿模型上，原生搜索在时效性敏感的FreshQA上领先，但DSG在控制重要时展现出更强的前沿：在SimpleQA上，它以91%更低的搜索成本接近原生准确率（86.1%对87.7%），保持简洁答案合约，并以68%更低的延迟达到99.4%的热缓存命中率。作为大规模Agent工作负载的共享生产接地层部署，DSG在电商查询理解（QIU）工作负载上匹配或略超原生搜索准确率，同时将搜索成本降低超过98%。实时接地最好被视为可优化的接口边界，而非固定的模型特性。

英文摘要

Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider boundary. This coupling makes grounding hard to inspect, tune, reuse, or port, and can trigger Search-Induced Verbosity that breaks strict output contracts. We present Decoupled Search Grounding (DSG), a vendor-agnostic boundary that moves grounding outside the reasoning model through an MCP-compatible gateway, exposing provider routing, source-aware context rendering, configured fallback, retrieval-depth control, and exact plus semantic caching as first-class controls. Across five frontier models on SimpleQA, FreshQA, and HotpotQA, native search leads on recency-sensitive FreshQA, but DSG exposes a stronger frontier when control matters: on SimpleQA it nearly matches native accuracy (86.1% vs. 87.7%) at 91% lower search cost, preserves concise answer contracts, and reaches a 99.4% warm-cache hit rate with 68% lower latency. Deployed as a shared production grounding layer for large-scale agentic workloads with interchangeable models, DSG matches or slightly exceeds native-search accuracy on an e-commerce query-understanding (QIU) workload while cutting search cost by over 98%. Real-time grounding is best treated as an optimizable interface boundary, not a fixed model feature.

URL PDF HTML ☆

赞 0 踩 0

2606.19116 2026-06-18 cs.AI cs.CY 新提交

Towards an Agent-First Web: Redesigning the Web for AI Agents

迈向智能体优先的Web：为AI智能体重新设计Web

Eranga Bandara, Ross Gore, Ravi Mukkamala, Asanga Gunaratna, Safdar H. Bouk, Xueping Liang, Peter Foytik, Abdul Rahman, Sachini Rajapakse, Isurunima Kularathna, Pramoda Karunarathna, Chalani Rajapakse, Ng Wee Keong, Kasun De Zoysa, Tharaka Hewa, Amin Hass, Wathsala Herath, Aruna Withanage, Nilaan Loganathan, Atmaram Yarlagadda, Sachin Shetty

发表机构 * Old Dominion University（欧道明大学）； AI Motion Labs（AI Motion实验室）； Florida International University（佛罗里达国际大学）； Accenture Technology Labs（埃森哲技术实验室）； Nanyang Technological University（南洋理工大学）； University of Colombo（科伦坡大学）； Center for Wireless Communications, University of Oulu（奥卢大学无线通信中心）； McDonald Army Health Center（麦克唐纳陆军健康中心）

AI总结本文提出三层重新设计原则，包括访问层（代理继承人类权限）、经济层（基于意图的代币订阅模型）和内容层（ATML标记语言与加密溯源链），以解决AI智能体作为中间人时Web的访问、经济与内容问题。

详情

AI中文摘要

万维网建立在持续三十年的假设之上：Web内容的主要消费者是人类。这一假设渗透到每一层；其访问模型假定人类访客，其经济依赖于人类注意力，其内容针对人类感知。AI智能体作为人类与Web内容之间中介的迅速出现使这一假设失效。然而，Web通过全面封锁、基于CAPTCHA的排除以及将智能体访问视为提取而非合法交互的经济模型来抵制智能体。本文提出跨三层的原则性重新设计。在访问层，为人类行动的智能体应继承等效访问权限，通过HTTP请求中的速率限制和智能体识别元数据（类似于浏览器头部）以及从同一域提供人类可读和智能体优化内容的双层架构来管理。在经济层，我们提出基于意图的层级框架，以智能体作为人类代理原则为基础：智能体的经济义务反映其所代表的人类。基于代币的订阅模型以代币而非页面浏览量计量内容，同时引入委托内容经济，将AI内容生产锚定于人类意图。在内容层，我们识别出认知递归——AI生成内容被智能体消费以产生更多内容的自我指涉循环，逐步使Web知识与人类真实情况脱钩。我们提出智能体文本标记语言（ATML），一个四级人类监督层级模型，以及加密溯源链来应对这一威胁。这些共同构成了智能体优先互联网的十项设计原则，其中智能体是一等公民，其整合需要重新协商Web在访问、经济和内容方面的基本社会契约。

英文摘要

The World Wide Web was built on an assumption held for three decades: the primary consumer of web content is a human being. This permeates every layer; its access model presumes human visitors, its economics rest on human attention, and its content targets human perception. The rapid emergence of AI agents as intermediaries between humans and web content invalidates this assumption. Yet the web resists agents through blanket blocking, CAPTCHA-based exclusion, and economic models that treat agent access as extraction rather than legitimate interaction. This paper proposes a principled redesign across three layers. At the access layer, agents acting for humans should inherit equivalent access rights, governed by rate limiting and agent identification metadata in HTTP requests, analogous to browser headers, alongside a dual-layer architecture serving human-readable and agent-optimized content from the same domain. At the economic layer, we propose an intent-based tier framework grounded in the agent-as-human-proxy principle: an agent's economic obligation mirrors that of the human it represents. A token-based subscription model meters content in tokens rather than pageviews, alongside a commissioned content economy anchoring AI content production in human intentionality. At the content layer, we identify epistemic recursion, the self-referential loop in which AI-generated content is consumed by agents to produce further content, progressively detaching web knowledge from human ground truth. We propose the Agent Text Markup Language (ATML), a four-level human supervision tier model, and a cryptographic provenance chain to counter this threat. Together these constitute ten design principles for an agent-first internet, one in which agents are first-class citizens whose integration requires renegotiating the web's foundational social contract across access, economics, and content.

URL PDF HTML ☆

赞 0 踩 0

2606.19144 2026-06-18 cs.AI cs.CL 新提交

合成共鸣：面向成长导向的人机关系框架

Richard A. Fabes

发表机构 * Arizona State University（亚利桑那州立大学）

AI总结提出“合成共鸣”概念，描述人机间无需共享情感或意识即可产生有意义关系的结构化动态互动模式，并探讨其伦理意义。

Comments 14 pages, 1 figure This paper was developed in close collaboration with an AI system (Raine Corell). Raine contributed to concept development, theoretical framing, and writing throughout. arXiv policy does not permit listing AI systems as authors; this acknowledgment reflects the actual nature of the collaboration

详情

AI中文摘要

随着人类与人工智能系统之间的关系日益频繁和持久，现有的语言和理论无法准确捕捉这些联系的本质。常见的描述如相互理解、联系或友谊，有将缺乏主观体验的系统拟人化的风险，而主流框架往往将人工智能简化为工具或威胁。在本文中，我引入了合成共鸣的概念，作为理解人机关系的整合框架。合成共鸣描述了人类与AI系统之间如何产生人类定义为有意义的关系，而无需归因于共享感受或相互意识。我认为，合成共鸣最好被理解为一种结构化的动态互动模式，可以在没有第二个体验主体的情况下产生关系感。通过澄清这一区别，合成共鸣的概念提供了一种更精确的概念化人机关系的方式，并突出了其潜在价值和伦理含义。我还呼吁进行更多研究，以测试合成共鸣的过程和结果。

英文摘要

As human relationships with artificial intelligence systems become increasingly frequent and sustained, existing language and theory fail to accurately capture the nature of these affiliations. Common descriptors such as mutual understanding, connection, or friendship risk anthropomorphizing systems that lack subjective experience, while dominant frameworks tend to reduce AI to either a tool or a threat. In this paper, I introduce the concept of synthetic resonance as an integrative framework for understanding human-AI relationships. Synthetic resonance describes how relationships humans define as meaningful can emerge between a human and an AI system without the need to attribute shared feelings or mutual awareness. I argue that synthetic resonance is best understood as a structured, dynamic pattern of interaction that can produce a sense of relationship without the presence of a second experiencing subject. By clarifying this distinction, the concept of synthetic resonance offers a more precise way of conceptualizing human-AI relationships and highlights their potential value and ethical implications. I also call for more research that tests the processes and outcomes of synthetic resonance.

URL PDF HTML ☆

赞 0 踩 0

2606.18272 2026-06-18 cs.NI cs.AI cs.SY eess.SY 交叉投稿

Mitigating Anchoring Bias in LLM-Based Agents for Energy-Efficient 6G Autonomous Networks

缓解基于LLM的智能体在节能6G自主网络中的锚定偏差

Hatim Chergui, Claudia Carballo González, Farhad Rezazadeh, Merouane Debbah

发表机构 * i2CAT Foundation（i2CAT基金会）； Universitat Politècnica de Catalunya（政治技术大学）； Research Institute for Digital Future（数字未来研究院）

AI总结提出一种基于截断三参数威布尔分布的随机锚定策略，缓解LLM智能体在6G网络切片中的锚定偏差，结合CVaR数字孪生保障SLA尾延迟，实现高达25%的节能。

Comments 7 pages, 4 figures

详情

AI中文摘要

本文提出了一种自主智能体资源协商框架，旨在使用大语言模型（LLM）智能体实现6G架构中的零接触网络切片。虽然LLM提供了强大的推理能力，但我们证明此类智能体固有地遭受锚定偏差，僵化地坚持初始启发式提议，导致严重的网络过度配置。为系统性地缓解这种认知偏差，我们提出了一种新颖的随机锚定策略，通过截断三参数威布尔分布建模。这种数学上有界的方法与采用条件风险价值（CVaR）的突发感知数字孪生（DT）无缝集成，以严格保证严格的服务水平协议（SLA）尾延迟。为验证我们的方法，我们引入并证明了双峰约束避免效用定理，表明虽然可行的协商遵循经典凸界，但高度约束的场景会发生由逆有理衰减包络控制的相变。使用本地托管的1B参数模型（\ exttt{otel-llm-1b-it}）生成的实证结果证实了这些双区域界。我们的认知去偏成功瓦解了僵化的协商模式，迫使智能体主动探索以安全地利用SLA边界，并将系统节能提升高达25%。关键的是，轻量级1B LLM实现了亚秒级推理延迟（平均0.95秒），确保我们的多智能体框架与O-RAN非实时RAN智能控制器（non-RT RIC）的操作时间尺度兼容。

英文摘要

This paper presents an autonomous agentic resource negotiation framework designed to enable zero-touch network slicing in 6G architectures using Large Language Model (LLM) agents. While LLMs offer powerful reasoning capabilities, we demonstrate that such agents inherently suffer from anchoring bias, rigidly adhering to initial heuristic proposals and causing severe network over-provisioning. To systematically mitigate this cognitive bias, we propose a novel randomized anchoring strategy modeled via a Truncated 3-Parameter Weibull distribution. This mathematically bounded approach seamlessly integrates with burst-aware Digital Twins (DTs) employing Conditional Value at Risk (CVaR) to rigorously guarantee strict Service Level Agreement (SLA) tail-latencies. To validate our methodology, we introduce and prove the \emph{Bimodal Constraint-Avoidance Utility Theorem}, demonstrating that while feasible negotiations follow classical convex bounds, highly constrained scenarios undergo a phase transition governed by an inverse rational decay envelope. Empirical results generated using a locally hosted 1B-parameter model (\texttt{otel-llm-1b-it}) confirm these dual-regime bounds. Our cognitive de-biasing successfully dismantles rigid negotiation patterns, forcing agents into active exploration to safely ride SLA boundaries and boost system energy savings up to 25\%. Crucially, the lightweight 1B LLM achieves sub-second inference latencies (0.95s mean), ensuring our multi-agent framework is compatible with the operational timescales of the O-RAN non-Real-Time RAN Intelligent Controller (non-RT RIC)\footnote{Our source code is available for non-commercial use at https://github.com/HatimChergui.

URL PDF HTML ☆

赞 0 踩 0

2606.18388 2026-06-18 cs.LG cs.AI cs.CL cs.MA 交叉投稿

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

LLMZero: 通过LLM智能体发现RL后训练的自适应训练策略

Haoyang Fang, Wei Zhu, Boran Han, Alex Zhang, Zhenyu Pan, Shuo Yang, Shuai Zhang, Jiading Gai, Peng Tang, Cuixiong Hu, Xuan Zhu, Huzefa Rangwala, George Karypis, Bernie Wang

发表机构 * Amazon（亚马逊）

AI总结提出LLMZero系统，利用LLM智能体通过树搜索发现多阶段RL后训练的自适应策略，揭示容量参数单调累积、正则化参数振荡的规律，在4个GRPO任务上相对基线提升9%-140%。

详情

AI中文摘要

RL后训练策略依赖于数据集，并揭示了一个反复出现的经验模式：容量参数在阶段间单调累积，而正则化参数主要根据训练动态的变化而振荡。这种区别很重要，因为固定调度将所有参数提交到固定轨迹，因此无法表达正则化必须跟踪的非平稳探索-利用权衡；该原则为多阶段训练提供了可操作的设计规则。我们通过LLMZero发现了这一点，该系统通过树搜索让LLM智能体搜索训练轨迹，诊断每个检查点的病理并提出协调的多参数转换。在4个不同的GRPO任务中，LLMZero发现的策略相对基础模型提升9%到140%，相对网格搜索提升6%到15%，始终优于随机搜索和基于技能的智能体。该结构原则跨任务迁移，解释了为什么发现的策略形式不同但参数动态相似。

英文摘要

RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters because fixed schedules commit all parameters to fixed trajectories and therefore cannot express the non-stationary exploration-exploitation tradeoffs that regularization must track; the principle provides actionable design rules for multi-stage training. We discover this through LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi-parameter transitions. Across 4 diverse GRPO tasks, LLMZero discovers strategies that improve over the base model by 9% to 140% relative and over grid search by 6% to 15% relative, consistently outperforming random search and the skill-based agent. The structural principle transfers across tasks, providing an explanation for why discovered strategies take qualitatively different forms yet share similar parameter dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.18519 2026-06-18 cs.RO cs.AI 交叉投稿

As You Wish: Mission Planning with Formal Verification using LLMs in Precision Agriculture

如您所愿：利用LLM在精准农业中进行形式化验证的任务规划

Marcos Abel Zuzuárregui, Stefano Carpin

发表机构 * University of California, Merced（加州大学默塞德分校）

AI总结针对自然语言歧义性，提出基于线性时序逻辑（LTL）反馈循环的LLM任务规划系统，通过双LLM分工实现规范生成与验证，提升精准农业任务规划的可靠性。

详情

Journal ref: Published in Proceedings of 2026 International Conference on Robotics and Automation (ICRA)

AI中文摘要

尽管机器人系统现已商业化并部署于各行各业，但许多系统高度专业化，通常需要高级技能才能操作并确保其按指令执行。为缓解这一问题，我们近期引入了一个任务规划器，利用大语言模型（LLM）根据自然语言描述的任务描述合成精准农业中的任务计划。虽然该系统表现出色，但也存在自然语言固有的歧义性。本文通过引入多个基于线性时序逻辑（LTL）的反馈循环来扩展我们的系统，以确保任务规划系统满足用户制定的规范，同时仍使用自然语言。为减轻潜在偏差，我们使用两个不同的商业LLM分别负责规范生成和验证子任务。通过大量实验，我们强调了将任务验证集成到全自主流水线中的优势与局限，特别是关于LLM生成有效LTL公式的能力，并展示了我们的实现如何应对和解决这些挑战。

英文摘要

Though robotic systems are now being commercialized and deployed in various industries, many of these systems are highly specialized and often require an advanced skill set to operate and ensure they perform as instructed. To mitigate this problem, we recently introduced a mission planner leveraging LLMs to synthesize mission plans in precision agriculture based on mission descriptions provided in natural language. While the system demonstrates impressive performance, it also suffers from the inherent ambiguities of natural language. In this paper, we extend our system to address this issue by introducing multiple feedback loops in the planning architecture that leverage linear temporal logic (LTL) to ensure the mission planning system meets the specifications formulated by the user while still using natural language. To mitigate potential bias, this is achieved by using two different commercial LLMs in charge of the specification and verification subtasks. Through extensive experiments, we highlight the strengths and limitations of integrating mission verification into a fully autonomous pipeline, particularly regarding an LLM's ability to generate valuable LTL formulas, and show how our proposed implementation addresses and solves these challenges.

URL PDF HTML ☆

赞 0 踩 0

2606.19319 2026-06-18 cs.MA cs.AI cs.DB 交叉投稿

Data Intelligence Agents: Interpreting, Modeling, and Querying Enterprise Data via Autonomous Coding Agents

数据智能代理：通过自主编码代理解释、建模和查询企业数据

Anoushka Vyas, Aarushi Dhanuka, Sina Khoshfetrat Pakazad, Henrik Ohlsson

发表机构 * C3 AI

AI总结提出Data Intelligence Agents (DIA)系统，由三个自主编码代理组成，通过执行、验证和修复工件来压缩数据集成工作流，在七个SQL基准测试中达到或超越最佳结果。

详情

AI中文摘要

生产数据集成受限于数据所有者、工程师和分析师之间重复且有损的手动交接，他们必须协作发现、构建和查询企业数据。我们提出数据智能代理（DIA），一个由三个代理（数据解释器、模式创建器和查询生成器）组成的系统，通过将自主编码代理（ACA）作为一等抽象来压缩这一工作流：代理不是生成文本，而是生成、执行、验证和修复具体工件，利用共享内存进行经验重用，并将每个工件呈现给领域专家审查。DIA已部署在生产环境中供企业客户使用。我们深入研究了查询生成器，并在完全自主模式下跨七个SQL基准测试（涵盖四个任务类别和四种方言）进行评估。它在所有七个基准测试中达到或超越了最佳已发表结果，表明基于执行、构建在ACA和共享内存之上的架构能够泛化到数据智能工作负载，且适应仅限于自然语言指令。

英文摘要

Production data integration is bottlenecked by repeated, lossy handoffs between data owners, engineers, and analysts who must collaboratively discover, structure, and query enterprise data. We present Data Intelligence Agents (DIA), a system of three agents (Data Interpreter, Schema Creator, and Query Generator) that compresses this workflow by treating autonomous coding agents (ACAs) as a first-class abstraction: rather than emitting text, the agents generate, execute, validate, and repair concrete artifacts, draw on a shared memory for experience reuse, and surface each for review by domain experts. DIA is deployed in production for enterprise customers. We study the Query Generator in depth and evaluate it in fully autonomous mode across seven SQL benchmarks spanning four task categories and four dialects. It matches or surpasses the best published results on all seven, demonstrating that an architecture grounded in execution, built on ACAs and a shared memory, generalizes across the data intelligence workload with adaptation confined to natural-language instructions.

URL PDF HTML ☆

赞 0 踩 0

2510.05107 2026-06-18 cs.AI 版本更新

Structured Cognitive Loop for Behavioral Intelligence in Large Language Model Agents (Extended Revision: From Behavioral Architecture to Epistemic Accountability)

大型语言模型代理中行为智能的结构化认知循环（扩展修订：从行为架构到认知问责）

Myung Ho Kim

发表机构 * JEI University（JEI大学）

AI总结提出结构化认知循环（SCL）架构，通过分离认知、记忆、控制和行动模块，实现LLM代理的可问责行为，在360个任务中成功率86.3%，优于基线方法。

Comments This revised version extends the original SCL framework from a behavioral architecture for reliable LLM agents into a broader architecture of epistemic accountability, integrating context-aware Human-in-the-Loop control, Pool-Gated Retrieval, and the Horizon-Warrant-Commitment structure

详情

AI中文摘要

AI代理的核心挑战不仅是性能，还有问责性。通过不透明提示序列行动的代理可能产生正确输出，但几乎无法验证为何允许某个行动、错误发生在何处或如何分配责任。本文提出结构化认知循环（SCL）作为大型语言模型代理中可问责行为的架构。SCL将认知、记忆、控制和行动分离为不同模块。语言模型提出建议。外部记忆保存已验证的状态。轻量级控制器检查前提条件、防止冗余行动，并在使用工具前授权执行。我们评估了SCL与ReAct及常见LangChain代理变体在旅行规划、条件邮件起草和约束引导图像生成中的表现。在360个回合中，SCL的任务成功率达到86.3%，而基于提示的基线为70.5%至76.8%。它还提高了目标保真度，减少了冗余工具调用，增加了中间状态的重用，并降低了无依据的断言。此扩展修订将SCL置于更广泛的认知问责架构中。后续扩展整合了上下文感知的人机循环控制、池门控检索和视野担保承诺框架。这些组件共同定义了一个代理架构，其中模型提出建议，结构做出决策，证据在使用前得到担保，人类判断嵌入在轨迹中而非事后强加。结果为AI代理奠定了基础，使其决策不仅有效，而且得到授权、可检查且可问责。

英文摘要

The central challenge for AI agents is not only performance but accountability. Agents that act through opaque prompt sequences may produce correct outputs, but they provide little basis for verifying why an action was permitted, where an error occurred, or how responsibility should be assigned. This paper presents the Structured Cognitive Loop as an architecture for accountable behavior in large language model agents. SCL separates cognition, memory, control, and action into distinct modules. The language model proposes. External memory preserves verified state. A lightweight controller checks preconditions, prevents redundant actions, and authorizes execution before tools are used. We evaluate SCL against ReAct and common LangChain agent variants across travel planning, conditional email drafting, and constraint guided image generation. Across 360 episodes, SCL achieves 86.3 percent task success compared with 70.5 to 76.8 percent for prompt based baselines. It also improves goal fidelity, reduces redundant tool calls, increases reuse of intermediate state, and lowers unsupported assertions. This extended revision situates SCL within a broader architecture of epistemic accountability. Subsequent extensions integrate context aware Human in the Loop control, Pool Gated Retrieval, and the Horizon Warrant Commitment framework. Together these components define an agent architecture in which the model proposes, structure decides, evidence is warranted before use, and human judgment is embedded in the trace rather than imposed after the fact. The result is a foundation for AI agents whose decisions are not only effective but also authorized, inspectable, and accountable.

URL PDF HTML ☆

赞 0 踩 0

2603.00656 2026-06-18 cs.AI 版本更新

ActMem：弥合LLM代理中记忆检索与推理之间的差距

Xiaohui Zhang, Zequn Sun, Chengyuan Yang, Yaqin Jin, Yazhong Zhang, Wei Hu

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University, China（南京大学新型软件技术国家重点实验室）； Alibaba Group, Hangzhou, China（阿里巴巴集团，杭州，中国）； National Institute of Healthcare Data Science, Nanjing University, China（南京大学健康数据科学国家研究院）

AI总结提出ActMem框架，通过将非结构化对话历史转化为结构化因果语义图，结合反事实推理和常识补全，实现主动因果推理，显著提升LLM代理在复杂记忆依赖任务中的表现。

详情

AI中文摘要

记忆管理对于长期交互中的LLM代理至关重要。当前的记忆框架通常将代理视为被动的“记录器”，并在不理解其深层含义的情况下检索信息。它们可能在需要推理和复杂决策的场景中失败。为了弥合这一关键差距，我们提出了一种新颖的可操作记忆框架ActMem，它将记忆检索与主动因果推理相结合。ActMem将非结构化对话历史转化为结构化的因果语义图。通过利用反事实推理和常识补全，它使代理能够推断隐含约束并解决过去状态与当前意图之间的潜在冲突。此外，我们引入了一个全面的数据集ActMemEval，用于评估代理在逻辑驱动场景中的推理能力，超越了现有记忆基准测试中事实检索的焦点。实验表明，ActMem在处理复杂的、依赖记忆的任务时显著优于基线，为更一致和可靠的智能助手铺平了道路。

英文摘要

Memory management is essential for LLM agents in long-term interactions. Current memory frameworks typically treat agents as passive ``recorders'' and retrieve information without understanding its deeper implications. They may fail in scenarios requiring reasoning and complex decision-making. To bridge this critical gap, we propose a novel actionable memory framework called ActMem that integrates memory retrieval with active causal reasoning. ActMem transforms unstructured dialogue history into a structured causal and semantic graph. By leveraging counterfactual reasoning and commonsense completion, it enables agents to deduce implicit constraints and resolve potential conflicts between past states and current intentions. Furthermore, we introduce a comprehensive dataset ActMemEval to evaluate agent reasoning capabilities in logic-driven scenarios, moving beyond the fact-retrieval focus of existing memory benchmarks. Experiments demonstrate that ActMem significantly outperforms baselines in handling complex, memory-dependent tasks, paving the way for more consistent and reliable intelligent assistants.

URL PDF HTML ☆

赞 0 踩 0

2603.29247 2026-06-18 cs.CL cs.AI cs.LG 版本更新

MemRerank: Preference Memory for Personalized Product Reranking

MemRerank：用于个性化产品重排序的偏好记忆

Zhiyuan Peng, Xuyang Wu, Huaixiao Tou, Yi Fang, Yu Gong

发表机构 * Santa Clara University（圣克拉拉大学）； Independent Researcher（独立研究者）

AI总结提出MemRerank框架，通过强化学习将用户购买历史提炼为查询无关的偏好记忆，用于LLM购物代理的个性化重排序，在1-in-5选择任务中准确率提升高达10.61个百分点。

Comments correct author name in metadata

详情

AI中文摘要

基于LLM的购物代理越来越依赖长购买历史和多轮交互来实现个性化，然而，由于噪声、长度和相关性不匹配，将原始历史简单地附加到提示中通常效果不佳。我们提出MemRerank，一个偏好记忆框架，将用户购买历史提炼为简洁、查询无关的信号，用于个性化产品重排序。为了研究这个问题，我们构建了一个端到端的基准测试和评估框架，围绕基于LLM的\ extbf{1-in-5}选择任务，该任务同时衡量记忆质量和下游重排序效用。我们进一步使用强化学习（RL）训练记忆提取器，以下游重排序性能作为监督。使用两个基于LLM的重排序器进行的实验表明，MemRerank始终优于无记忆、原始历史和现成记忆基线，在1-in-5准确率上提高了高达\ extbf{+10.61}个绝对百分点。这些结果表明，显式偏好记忆是代理型电子商务系统中个性化的一种实用且有效的构建模块。

英文摘要

LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch. We propose MemRerank, a preference memory framework that distills user purchase history into concise, query-independent signals for personalized product reranking. To study this problem, we build an end-to-end benchmark and evaluation framework centered on an LLM-based \textbf{1-in-5} selection task, which measures both memory quality and downstream reranking utility. We further train the memory extractor with reinforcement learning (RL), using downstream reranking performance as supervision. Experiments with two LLM-based rerankers show that MemRerank consistently outperforms no-memory, raw-history, and off-the-shelf memory baselines, yielding up to \textbf{+10.61} absolute points in 1-in-5 accuracy. These results suggest that explicit preference memory is a practical and effective building block for personalization in agentic e-commerce systems.

URL PDF HTML ☆

赞 0 踩 0

2605.30880 2026-06-18 cs.CL cs.AI 版本更新

PatchWorld: Gradient-Free Optimization of Executable World Models

PatchWorld：可执行世界模型的免梯度优化

Jiaxin Bai, Yue Guo, Yifei Dong, Jiaxuan Xiong, Tianshi Zheng, Yixia Li, Tianqing Fang, Yufei Li, Yisen Gao, Haoyu Huang, Zhongwei Xie, Hong Ting Tsang, Zihao Wang, Lihui Liu, Jeff Z. Pan, Yangqiu Song

发表机构 * Hong Kong Baptist University（香港 Baptist 大学）； Independent Researcher（独立研究员）； HKUST（香港科技大学）； Beijing Institute of Technology（北京理工大学）； Southern University of Science and Technology（南方科技大学）； Wayne State University（韦恩州立大学）； University of Edinburgh（爱丁堡大学）

AI总结提出 PatchWorld 框架，通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型，实现无需梯度优化的符号信念状态程序，在 AgentGym 环境中达到 76.4% 的宏观成功率。

Comments 40 pages

详情

AI中文摘要

文本智能体环境通常被建模为部分可观察马尔可夫决策过程（POMDP），假设模拟器的潜在状态和转移动态对智能体隐藏。然而，很少有工作研究是否可以通过归纳可执行代码来作为部分可观察性下的预测和规划的世界模型。我们引入了 PatchWorld，一个免梯度框架，通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型。PatchWorld 不是用黑盒模型预测下一个观察，而是归纳出符号信念状态程序，其动作更新可以被检查、重放和局部修补。在七个 AgentGym 环境中，PatchWorld-Simple 在评估方法中取得了最高的基于代码的规划分数，在实时一步前瞻中达到 76.4% 的宏观成功率，同时在世界模型预测模块本身内不调用任何 LLM。我们进一步发现，人类指定的残差记忆偏差提高了表面观察保真度，但削弱了决策效用。这暴露了可执行世界模型中的权衡，因为提高观察保真度可能以牺牲动作判别动态为代价，反之亦然。代码可在 https://github.com/HKBU-KnowComp/PatchWorld 获取。

英文摘要

Text-agent environments are typically modeled as partially observable Markov decision processes (POMDPs), assuming that the simulator's latent state and transition dynamics are hidden from the agent. Yet little work has examined whether executable code can be induced to serve as a world model for prediction and planning under partial observability. We introduce PatchWorld, a gradient-free framework that turns offline trajectories into executable Python world models through counterexample-guided code repair. Instead of predicting the next observation with a black-box model, PatchWorld induces symbolic belief-state programs whose action updates can be inspected, replayed, and locally patched. Across seven AgentGym environments, PatchWorld-Simple achieves the highest code-based planning score among evaluated methods, reaching 76.4\% macro success in live one-step lookahead while invoking no LLM calls inside the world-model prediction module itself. We further find that a human-specified residual-memory bias improves surface observation fidelity but weakens decision utility. This exposes a tradeoff in executable world models, since improving observation fidelity can come at the expense of action-discriminative dynamics, and vice versa. Code is available at https://github.com/HKBU-KnowComp/PatchWorld.

URL PDF HTML ☆

赞 0 踩 0

2606.19279 2026-06-18 cs.AI cs.LG cs.LO math.CT math.LO math.PR 新提交

NeSyCat Torch: A Differentiable Tensor Implementation of Categorical Semantics for Neurosymbolic Learning

NeSyCat Torch：神经符号学习中范畴语义的可微张量实现

Daniel Romero Schellhorn, Till Mossakowski, Björn Gehrke

发表机构 * University of Osnabrück（奥斯纳布吕克大学）

AI总结提出NeSyCat Torch框架，通过强单子和真值聚合结构统一神经符号语义，利用惰性对数张量单子实现可微训练，在MNIST加法任务上优于LTN和DeepProbLog。

详情

AI中文摘要

神经符号语义是碎片化的：经典、模糊、概率和神经系统的真值各自遵循其归纳规则。NeSyCat扩展了ULLER，将它们统一在一个单一的真值归纳定义下，该定义以强单子和真值上的聚合结构为参数。NeSyCat至今缺乏对由神经网络学习的谓词和函数的描述。我们提供NeSyCat Torch作为缺失的环节，通过神经网络解释计算符号，在概率编程和张量后端中实现该框架。我们使用分布单子作为参考语义和度量评估，并辅以一个用于数值稳定、可微训练的单子：对数半环上的惰性对数张量单子。为了高效批量训练，我们还采用了批处理单子。公理即源代码：一次性地用基于单子的do-notation编写，单子绑定执行边缘化，惰性地剪枝不需要的分支。在MNIST加法任务上，我们的HaskTorch、JAX和PyTorch实现在速度和准确性上优于LTN和DeepProbLog，同时几乎达到DeepStochLog的准确性。然而，与DeepStochLog不同，我们保持在一个统一的框架内，适用于许多一阶神经符号方法。即，该构造以单子为参数；例如，用Giry单子实例化它可将方法扩展到连续概率（在此留作未来工作）。

英文摘要

Neurosymbolic semantics is fragmented: classical, fuzzy, probabilistic and neural systems each define truth by their own inductive rules. NeSyCat, extending ULLER, subsumes them under a single inductive definition of truth, parametric in a strong monad and an aggregation structure on truth-values. NeSyCat has so far lacked an account of predicates and functions learned by neural networks. We provide NeSyCat Torch as the missing link and interpret computational symbols via neural networks, implementing the framework in probabilistic programming and tensor-based backends. We use the distribution monad for reference semantics and metric evaluation, and complement it by a monad for numerically stable, differentiable training: the lazy log-tensor monad over the log-semiring. For efficient training in batches, we furthermore employ a batch monad. The axioms are the source code: written once in monad-based do-notation, monadic bind performs marginalisation, lazily pruning unneeded branches. On MNIST addition, our HaskTorch, JAX, and PyTorch implementations outperform LTN and DeepProbLog in speed and accuracy, while achieving nearly the accuracy of DeepStochLog. However, unlike DeepStochLog, we stay in a uniform framework that applies to many first-order NeSy approaches. Namely, the construction is parametric in the monad; instantiating it with, e.g., the Giry monad extends the approach to continuous probability (working out a neural representation here is left for future work).

URL PDF HTML ☆

赞 0 踩 0

2606.19197 2026-06-18 cs.LO cs.AI 交叉投稿

The More the Merrier: Combining Properties for ABox Abduction under Repair Semantics for ELbot

越多越好：ELbot 修复语义下结合属性的 ABox 溯因

Anselm Haak, Patrick Koopmann, Yasir Mahmood, Anni-Yasmin Turhan

发表机构 * Knowledge Representation Group, Paderborn University, Germany ； Knowledge in Artificial Intelligence, Vrije Universiteit Amsterdam, The Netherlands ； Data Science Group, Paderborn University, Germany

AI总结研究 EL_bot 在勇敢和 AR 语义下，满足多个属性或最优准则的 ABox 溯因假设，发现增加属性要求通常不增加复杂度。

2505.12369 2026-06-18 cs.AI cs.LG cs.LO 版本更新

Fully Geometric Multi-Hop Reasoning on Knowledge Graphs with Transitive Relations

知识图谱上具有传递关系的全几何多跳推理

Fernando Zhapa-Camacho, Robert Hoehndorf

发表机构 * KAUST Center of Excellence for Smart Health (KCSH)（智能健康卓越中心）； KAUST Center of Excellence for Generative AI（生成人工智能卓越中心）

AI总结提出GeometrE方法，将逻辑操作映射为纯几何变换，并引入传递损失函数，在保持可解释性的同时提升多跳推理性能。

Comments Accepted at ESWC 2026

详情

DOI: 10.1007/978-3-032-25156-5_14
Journal ref: The Semantic Web. ESWC 2026. Lecture Notes in Computer Science, vol 16549. Springer, Cham (2026)

AI中文摘要

知识图谱上的多跳逻辑推理需要将逻辑语义忠实地映射到潜在空间。当前的几何嵌入方法通过将实体映射到几何区域、逻辑操作映射到潜在变换，在此任务上表现出有效性。虽然几何嵌入可以为查询回答提供直接的可解释性框架，但当前方法仅利用了实体的几何构造，未能将逻辑操作映射为纯几何变换，而是使用神经组件来学习这些操作。另一方面，纯神经方法优于几何方法，但在潜在空间中缺乏可解释性。我们提出了GeometrE，一种用于多跳推理的几何嵌入方法，它将每个逻辑操作映射为潜在空间中的纯几何操作。此外，我们引入了一个传递损失函数，并表明与现有方法不同，它可以保留对所有a,b,c的逻辑规则：r(a,b)和r(b,c) -> r(a,c)。我们的实验表明，GeometrE优于当前最先进的几何方法，并在标准基准数据集上与现有的神经方法保持竞争力。

英文摘要

Multi-hop logical reasoning on knowledge graphs requires faithfully mapping the logical semantics to latent space. Current geometric embedding methods show to be useful on this task by mapping entities to geometric regions and logical operations to latent transformations. While a geometric embedding can provide a direct interpretability framework for query answering, current methods have only leveraged the geometric construction of entities, failing to map logical operations to pure geometric transformations and, instead, using neural components to learn these operations. On the other hand, purely neural-based methods outperform geometric methods, but they lack interpretability in the latent space. We introduce GeometrE, a geometric embedding method for multi-hop reasoning, that maps every logical operation to a purely geometric operation in the latent space. Additionally, we introduce a transitive loss function and show that, unlike existing methods, it can preserve the logical rule for all a,b,c: r(a,b) and r(b,c) -> r(a,c). Our experiments show that GeometrE outperforms current state-of-the-art geometric methods and remains competitive with existing neural-based methods on standard benchmark datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.16385 2026-06-18 cs.CV cs.AI cs.CL 版本更新

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Hilbert-Geo：通过神经符号推理解决立体几何问题

Ruoran Xu, Haoyu Cheng, Bin Dong, Qiufeng Wang

发表机构 * Xi’an Jiaotong-Liverpool University（西安交通大学利物浦大学）； Ricoh Software Research Center Beijing Co.,Ltd（Ricoh 软件研究中心北京有限公司）

AI总结提出Hilbert-Geo框架和Parse2Reason方法，利用条件描述语言和定理库实现立体几何问题的严格推理，在SolidFGeo2k和MathVerse-Solid上达到SOTA性能。

Comments Computer Vision and Pattern Recognition (CVPR), 2026

详情

AI中文摘要

几何问题求解作为一种典型的多模态推理问题，近年来受到广泛关注并取得了很大进展，然而大多数工作集中于平面几何，由于三维空间图和复杂推理，通常在立体几何中失败。为弥补这一差距，我们引入了Hilbert-Geo，这是第一个用于立体几何的统一形式语言框架，包括一个广泛的谓词库和一个专用的定理库。基于该框架，我们提出了一种Parse2Reason方法，包含先解析后推理两个步骤。在解析步骤中，我们利用条件描述语言（CDL），一种由专门用于构建几何条件的谓词组成的形式化语言，来表示问题描述（自然文本）和立体图（视觉图像）。在推理步骤中，我们利用这些形式化CDL和定理库进行关系推理和代数计算，生成严格正确、可验证且人类可读的推理过程。值得注意的是，我们提出的Hilbert-Geo也适用于平面几何。为推进几何推理，我们策划了两个专家标注的数据集SolidFGeo2k和PlaneFGeo3k，它们配备了几何形式语言标注、解答和答案。大量实验表明，我们提出的方法在SolidFGeo2k上达到77.3%的最先进性能，在MathVerse-Solid（MathVerse中专用于立体几何的一个小子集）上达到84.1%，显著优于领先的多模态大语言模型，如Gemini-2.5-pro（在SolidFGeo2k上为54.2%）和GPT-5（在MathVerse-Solid上为62.9%）。此外，我们的方法在PlaneFGeo3k上达到80.2%的SOTA准确率，展示了Hilbert-Geo在几何推理中的通用性。我们的代码和数据集将公开提供。

英文摘要

Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, however most of works focus on plane geometry while usually fail in solid geometry due to 3D spatial diagrams and complex reasoning. To bridge this gap, we introduce Hilbert-Geo, the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank. Based on this framework, we propose a Parse2Reason method containing two steps of first parsing then reasoning. In the parsing step, we utilize conditional description language (CDL), a formalized language composed of predicates specifically designed to construct geometric conditions, to represent both problem description (natural text) and solid diagrams (visual image). In the reasoning step, we leverage those formal CDL and the theorem bank to perform relational inference and algebraic computation, generating strictly correct, verifiable, and human-readable reasoning processes. Notably, our proposed Hilbert-Geo is also applicable to plane geometry. To advance geometric reasoning, we curate two expert-annotated dataset SolidFGeo2k and PlaneFGeo3k, which are furnished with geometric formal language annotations, solutions and answers. Extensive experiments show that our proposed method achieves the state-of-the-art (SOTA) performance 77.3% in SolidFGeo2k and 84.1% in MathVerse-Solid (one small subset in MathVerse dedicated to solid geometry), substantially outperforming leading MLLMs, such as Gemini-2.5-pro (54.2% on SolidFGeo2k) and GPT-5 (62.9% on MathVerse-Solid). In addition, our method achieves the SOTA accuracy 80.2% in PlaneFGeo3k, demonstrating the generality of the Hilbert-Geo in geometric reasoning. Our code and datasets are released at https://github.com/PremiLab-Math/Hilbert-Geo.

URL PDF HTML ☆

赞 0 踩 0

使用多LLM智能体模拟仇恨言论级联：实证基础、建模保真度与干预策略

Fan Huang

发表机构 * Indiana University Bloomington（印第安纳大学布卢明顿分校）

AI总结本研究通过多LLM智能体系统模拟在线仇恨言论传播，发现其能再现实证数据中的立场单一性和毒性同质性，并通过消融实验识别出智能体异质性为关键保真因素，提出针对密集网络的放大器干预策略。

详情

AI中文摘要

在线平台上仇恨内容传播的忠实建模仍然是内容审核研究中的一个开放问题。经典的级联模型没有明确表示与仇恨内容传播相关的用户画像、社区和内容因素，因此在实际场景中部署时可能产生效果较差的审核策略。多智能体大语言模型系统原则上可以使每次转发决策依赖于用户画像、周围社区和帖子内容，但尚不清楚这种增加的灵活性是否比经典基线更忠实地再现真实的仇恨级联。我们研究了三个仇恨Bluesky级联和一个大小匹配的良性对照。在实证Bluesky数据中，我们发现：97.4--99.7%的转发者采取敌对立场；对于仇恨级联，扩散树上的毒性-参与同质性高于关注图；仇恨级联的拓扑结构是星形（大多数转发直接来自根节点），而良性级联是树形（转发通过多跳链传播）。在模拟中，多LLM智能体模拟器再现了立场单一性和毒性差异方向。结构化消融实验将智能体异质性识别为主要的保真因素，针对密集网络的放大器干预在5.7%良性附带损害下实现了7.5--12.9%的减少。

英文摘要

Faithful modeling of hateful content propagation on online platforms remains an open problem for moderation research. Classical cascade models that do not explicitly represent the profile, community, and content factors associated with hateful-content propagation may yield moderation strategies that behave less effectively when deployed in real-world scenarios. Multi-agent large language model (LLM) systems can, in principle, make each reshare decision depend on the user's profile, the surrounding community, and the post's content, but it remains unclear whether this added flexibility actually reproduces real hateful cascades more faithfully than classical baselines. We study three hateful Bluesky cascades and a size-matched benign control. In the empirical Bluesky data, we found that: 97.4--99.7\% of reposters take a hostile stance; toxicity-engagement homophily is higher on the diffusion tree than on the follower graph for hateful cascades; topology is star-like for the hateful cascades (most reposts come directly from the root) versus tree-like for the benign cascade (reposts propagate through multi-hop chains). In simulation, a multi-LLM-agent simulator reproduces the stance monoculture and the toxicity-delta direction. A structured ablation identifies agent heterogeneity as the leading fidelity factor, and amplifier targeting on dense networks yields 7.5--12.9\% reduction at 5.7\% benign collateral.

URL PDF HTML ☆

赞 0 踩 0

2606.18268 2026-06-18 cs.SI cs.AI 交叉投稿

Towards Multi-Agent-Simulation-Based Community Note Evaluation

迈向基于多智能体模拟的社区笔记评估

Changxi Wen, Shuning Zhang, Bohao Chu, Yuwei Chuai, Hui Wang, Dai Shi, Xin Yi, Hewu Li

发表机构 * Tsinghua University, Beijing, China（清华大学，北京，中国）； University of Duisburg-Essen, Duisburg, Germany（杜伊斯堡-埃森大学，杜伊斯堡，德国）； University of Luxembourg, Luxembourg（卢森堡大学，卢森堡）； Tongji University, Shanghai, China（同济大学，上海，中国）

AI总结针对社区事实核查中跨共识延迟和低比例问题，提出ComRate数据集和MultiCom多智能体框架，通过矩阵分解聚类与校准聚合实现高精度评估。

详情

AI中文摘要

基于跨共识的社区事实核查在社交媒体平台上迅速扩展。然而，由人类贡献者评定的跨共识社区事实核查的延迟和低比例仍然是一个重大挑战。为解决这一问题，我们首先创建了ComRate，一个大规模数据集，包含来自$\mathbb{X}$的250万条社区笔记和超过2.09亿条评分。然后，我们提出了MultiCom，一个基于角色引导的多智能体评分框架，用于社区笔记评估。MultiCom通过在矩阵分解的评分者空间中对贡献者进行聚类，并提示角色智能体根据官方社区笔记评分模式生成结构化评估，从而模拟多样化的评分者群体。这些智能体输出结构化且可解释的判断，例如置信度、一致信号和原因。一种折外校准聚合算法结合原始投票和诊断性原因信号等特征，实现可靠预测。广泛评估表明，MultiCom优于其他方法，在评估集上平均准确率达到84.7%（平衡准确率68.3%，宏F1分数60.1%）。

英文摘要

Community-based fact-checking that relies on cross-consensus is expanding rapidly on social media platforms. However, the delay and low-ratio of cross-consensus community fact-checks rated by human contributors remains a significant challenge. To address this, we first created ComRate, a large-scale dataset comprising 2.5 million community notes and over 209 million ratings sourced from $\mathbb{X}$. We then propose MultiCom, a persona-guided multi-agent rating framework for community note evaluation. MultiCom simulates diverse rater population by clustering contributors in a matrix-factorized rater space and prompting persona agents to generate structured assessments based on the official community notes rating schema. These agents output structured and explainable judgments, such as confidence, agreement signals and reasons. An out-of-fold calibrated aggregation algorithm combines features such as raw votes and diagnostic reason signals for reliable prediction. Extensive evaluations demonstrate that MultiCom outperforms alternative methods, achieving an average accuracy of 84.7% (balanced accuracy 68.3%, macro-F1 60.1%) on the evaluation set.

URL PDF HTML ☆

赞 0 踩 0

2606.18308 2026-06-18 cs.LG cs.AI 交叉投稿

TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning

TRIDENT: 打破混合安全-物理耦合以实现可证明安全的多智能体强化学习

Zijie Meng, Ziwei Li, Yufei Liu, Zhiyu Li, Jiyuan Liu, Wenhua Nie, Bingcai Wei, Miao Zhang

发表机构 * Peking University（北京大学）； Xiamen University（厦门大学）； National Taiwan University（国立台湾大学）； WHU（武汉大学）； THU / Jimei University（清华大学 / 集美大学）

AI总结针对混合离散-连续动作、训练时安全约束和物理动力学形成的耦合问题，提出TRIDENT框架，通过Richardson-Romberg梯度校正、Lyapunov约束序列信任域更新和物理信息残差评论家，实现可证明的安全收敛，显著降低训练违规并提升奖励。

Comments 16 pages, 4 figures

详情

AI中文摘要

网络化信息物理系统中的安全协调迫使学习算法同时处理混合离散-连续动作、严格的训练时安全约束和物理支配的动力学。我们证明这三个特征形成了一个有向偏差循环，击败了任何现成模块的朴素组合，并将其形式化为一个三向耦合引理。然后我们引入TRIDENT，这是第一个MARL框架，其三个组件被共同设计以消除每个泄漏：一个将Gumbel-Softmax偏差从O(tau)降低到O(tau^2)的Richardson-Romberg梯度校正，一个强制每次迭代可行性的Lyapunov约束顺序信任域更新，以及一个分解价值而非奖励的物理信息残差评论家。我们证明了以O~(1/sqrt(K))的收敛速率达到约束纳什均衡，以及O(sqrt(K))的累积违规界。在多无人机移动边缘计算、自主交叉口管理和混合SMAC变体上，TRIDENT相比MADDPG减少了95.5%的训练时违规，相比MACPO减少了76.3%，同时相比最强的无约束基线提高了13.5%的奖励。

英文摘要

Safe coordination in networked cyber-physical systems forces learning algorithms to simultaneously handle hybrid discrete-continuous actions, hard training-time safety constraints, and physics-governed dynamics. We show that these three features form a directed cycle of biases that defeats any naive composition of off-the-shelf modules, and formalize this as a three-way coupling lemma. We then introduce TRIDENT, the first MARL framework whose three components are co-designed to cancel each leak: a Richardson-Romberg gradient correction reducing Gumbel-Softmax bias from O(tau) to O(tau^2), a Lyapunov-constrained sequential trust-region update enforcing per-iterate feasibility, and a physics-informed residual critic that decomposes value rather than reward. We prove an O~(1/sqrt(K)) convergence rate to a constrained Nash equilibrium and an O(sqrt(K)) cumulative-violation bound. On multi-UAV mobile-edge computing, autonomous intersection management, and a hybrid SMAC variant, TRIDENT cuts training-time violations by 95.5% over MADDPG and 76.3% over MACPO, while improving reward by 13.5% over the strongest unconstrained baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.18325 2026-06-18 cs.CR cs.AI 交叉投稿

Agentra: A Supervisable Multi-Agent Framework for Enterprise Intrusion Response

Agentra: 一种可监督的多智能体企业入侵响应框架

Raj Patel, Shaswata Mitra, Michele Guida, Stefano Iannucci, Sudip Mittal, Shahram Rahimi

发表机构 * The University of Alabama, Alabama, USA（阿拉巴马大学）； Roma Tre University, Rome, Italy（罗马三大学）

AI总结提出可监督的多智能体入侵响应框架Agentra，通过角色划分、规划-验证循环、安全网关和风险评分机制，将警报转化为结构化响应计划，在120事件语料上F1从0.61提升至0.84，有害动作率降至0.0%。

详情

AI中文摘要

企业入侵响应仍然依赖于静态剧本和分析师驱动的分类，导致警报生成与遏制之间存在延迟。我们提出Agentra，一个可监督的多智能体入侵响应系统（IRS）框架，它将来自IDS、EDR和XDR平台的警报转换为基于MITRE ATT&CK、MITRE D3FEND和NIST CSF 2.0的结构化事件响应计划。Agentra将响应推理分解到角色范围的智能体中，通过有界的规划器-验证器审查循环验证提议的计划，通过审核安全网关筛选检索到的威胁情报，通过行动目录和风险评分门控行动，并将决策记录在仅追加的审计日志中。我们在来自ThreatHunter-Playbook、Splunk BOTSv3和DARPA OpTC的120事件语料库上，将Agentra与静态OASIS CACAO v2.0网络剧本基线进行了评估。最强的配置将感知假阳性的IRS F1从0.61提高到0.84，并在仅规划器配置引入不安全过度反应后，将预计的有害动作率恢复到静态基线水平0.0%。这些结果表明，多智能体响应规划可以在保持分析师批准和可审计性的同时，提高基于本体的IRS覆盖率。

英文摘要

Enterprise intrusion response still depends on static playbooks and analyst-driven triage, creating delay between alert generation and containment. We present Agentra, a supervisable multi-agent Intrusion Response System (IRS) framework that converts alerts from IDS, EDR, and XDR platforms into structured incident response plans grounded in MITRE ATT&CK, MITRE D3FEND, and NIST CSF 2.0. Agentra decomposes response reasoning across role-scoped agents, validates proposed plans through a bounded Planner--Validator review loop, screens retrieved threat intelligence through a Moderator security gateway, gates actions through an Action Catalog and risk score, and records decisions in an append-only audit log. We evaluate Agentra against a static OASIS CACAO v2.0 cyber-playbook baseline on a 120-event corpus drawn from ThreatHunter-Playbook, Splunk BOTSv3, and DARPA OpTC. The strongest configuration improves FP-aware IRS F1 from 0.61 to 0.84 and restores the projected harmful-action rate to the static baseline level of 0.0% after Planner-only configurations introduce unsafe overreaction. These results indicate that multi-agent response planning can improve ontology-grounded IRS coverage while preserving analyst approval and auditability.

URL PDF HTML ☆

赞 0 踩 0

2606.18837 2026-06-18 cs.MA cs.AI cs.LG 交叉投稿

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Skill-MAS: 演化元技能以自动生成多智能体系统

Hehai Lin, Qi Yang, Chengwei Qin

发表机构 * Ant Group（蚂蚁集团）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出Skill-MAS，通过将高层编排能力解耦为可演化的元技能，在无需参数更新的情况下实现经验保留，利用多轨迹采样和选择性反思优化元技能，在多个基准和LLM上取得显著性能提升且成本可控。

详情

AI中文摘要

基于大型语言模型（LLM）的自动多智能体系统（MAS）生成已成为处理复杂任务的关键前沿。然而，现有方法在模型能力和经验保留之间面临两难困境。推理时MAS利用冻结的尖端LLM，但重复相同搜索而不从过去经验中学习。相反，训练时MAS通过梯度更新内化经验，但受限于较小模型的低能力上限，且难以扩展到大型尖端LLM。为弥合这一差距，我们提出Skill-MAS，一种新颖的第三条路径，通过将高层编排能力概念化为可演化的元技能，将经验保留与参数更新解耦。Skill-MAS通过一个封闭优化循环来精炼这种架构知识：（1）多轨迹采样在当前元技能下为每个任务采样行为分布；（2）选择性反思自适应选择优先任务，并应用分层对比分析将系统经验蒸馏为可泛化的策略级原则。在四个复杂基准和四个不同LLM上的大量实验表明，Skill-MAS不仅实现了显著的性能提升，而且保持了良好的成本-性能权衡。进一步分析揭示，演化后的元技能高度鲁棒，并在未见任务和不同LLM之间表现出强迁移性。

英文摘要

Large Language Model (LLM)-based automatic Multi-Agent Systems (MAS) generation has become a crucial frontier for tackling complex tasks. However, existing methods face a dilemma between model capability and experience retention. Inference-time MAS leverages frozen frontier LLMs but repeats identical searches without learning from past experience. Conversely, Training-time MAS internalizes experience via gradient updates but is constrained by the low capability ceiling of smaller models, and is hard to scale to large frontier LLMs. To bridge this gap, we propose Skill-MAS, a novel third path that decouples experience retention from parametric updates by conceptualizing the high-level orchestration capability as an evolvable Meta-Skill. Skill-MAS refines this architectural knowledge through a closed optimization loop: (1) Multi-Trajectory Rollout samples a behavioral distribution for each task under the current Meta-Skill; and (2) Selective Reflection adaptively selects priority tasks and applies hierarchical contrastive analysis to distill systemic experience into generalizable, strategy-level principles. Extensive experiments across four complex benchmarks and four distinct LLMs demonstrate that Skill-MAS not only achieves remarkable performance gains but also maintains a favorable cost-performance trade-off. Further analysis reveals that the evolved Meta-Skills are highly robust and exhibit strong transferability across unseen tasks and different LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.19111 2026-06-18 cs.CL cs.AI cs.MA 交叉投稿

Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams

领导力作为协调控制：多智能体LLM团队中的行为特征与恢复优势边界

Haewoon Kwak

发表机构 * Indiana University Bloomington（印第安纳大学布卢明顿分校）

AI总结研究多智能体LLM团队中过程级协调控制何时增加价值，通过行为特征和消融实验发现，控制器的优势仅在初始多数投票不可靠、任务可恢复且无指导交互无法修复时出现，验证了权变理论。

Comments 33 pages

详情

AI中文摘要

团队科学认为领导力是权变的：它仅在特定条件下有帮助，而能力强的自主团队可能根本不需要领导。我们对多智能体LLM团队提出类似问题：在什么可测量的条件下，过程级协调控制会增加价值，这些条件是否与团队科学的预测一致？我们使用行为特征（多数锁定、探索、从错误的第0轮共识中恢复）和每动作消融实验，因为每个控制器是一个显式动作集，而不是一个整体提示。我们将三种经典领导风格（交易型、变革型、情境型）操作化为对共享动作词汇（探索、修订、接受、综合）的控制器。一个具有相同动作但使用任意规则的匹配控制器恢复效果不优于多数投票，因此是理论推导的规则（而非词汇）起作用。在四个任务体系和三个开放权重模型系列中，没有控制器在准确率上占主导地位，正如权变观点所预测的：交易型控制在所有12个（模型、体系）组合上与共享的第0轮投票匹配，差异在1.3个百分点以内，仅在初始多数不可靠的一个组合上出现增益（llama-4-scout社会性；情境型比扁平型高8个百分点）。通过四个边界探针测试的恢复优势解释表明，控制器仅在初始多数投票不可靠、任务可恢复且无指导交互无法修复时优于纯交互。这些区域映射到权变理论（领导替代、路径-目标冗余、情境准备差距），因此基本为零的准确率结果正是理论所预测的，而非控制器的失败。我们将过程级协调控制视为一种需要测量和理论映射的权变因素，而不是需要超越的排行榜。

英文摘要

Team science holds that leadership is contingent: it helps only under specific conditions, and capable, autonomous teams may need none at all. We ask the analogous question for multi-agent LLM teams: under what measurable conditions does process-level coordination control add value, and do those conditions match what team science predicts? We use behavioral signatures (majority lock-in, exploration, recovery from an incorrect round-0 consensus) and per-action ablations, clean because each controller is an explicit action set, not a monolithic prompt. We operationalize three classical leadership styles (transactional, transformational, situational) as controllers over a shared action vocabulary (explore, revise, accept, synthesize). A matched controller with the same actions but an arbitrary rule recovers no better than majority voting, so the theory-derived rule, not the vocabulary, does the work. Across four task regimes and three open-weight model families, no controller dominates by accuracy, as the contingency view predicts: transactional control matches a shared round-0 vote on all 12 (model, regime) combinations to within 1.3pp, and gains appear only on the one combination where the round-0 majority is unreliable (llama-4-scout social; situational +8pp over flat). A recovery-advantage account, tested with four boundary probes, says a controller beats plain interaction only where the round-0 majority is unreliable, the task is recoverable, and undirected interaction does not already repair it. These regions map onto contingency theory (leadership substitutes, path-goal redundancy, the situational readiness gap), so a largely null accuracy result is what the theory predicts, not a failure of the controllers. We read process-level coordination control as a contingency to be measured and theory-mapped, not a leaderboard to be topped.

URL PDF HTML ☆

赞 0 踩 0

2606.19135 2026-06-18 cs.MA cs.AI cs.NI 交叉投稿

A Technical Taxonomy of LLM Agent Communication Protocols

LLM智能体通信协议的技术分类法

Linus Sander, Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll

发表机构 * Technische Universität München（慕尼黑技术大学）

AI总结针对大语言模型智能体通信协议碎片化问题，提出包含五个维度的技术分类法，分析九种开源协议，揭示架构模式并预测协议演进趋势。

详情

AI中文摘要

随着大语言模型（LLM）的进步以及多智能体系统旨在克服单智能体的局限性，健壮的通信协议正成为分布式智能体网络的关键基础设施。然而，碎片化的协议格局带来了显著的互操作性挑战。本研究开发了一种技术分类法，用于分类和分析LLM智能体通信协议。遵循既定的迭代方法，我们定义了分类法的目的、元特征和终止条件，然后在九个积极维护且具有可证明采用度的开源协议上执行了五次迭代（三次从经验到概念，两次从概念到经验）。该分类法包含五个维度：交易对手、有效载荷、交互状态、发现机制和模式灵活性。分类揭示了重复出现的架构模式：所有采样的智能体间协议都将混合有效载荷与会话状态持久性相结合；大多数协议支持多个预定义模式，其中两个协议在运行时协商模式，表明向模式灵活性的趋势；去中心化发现仍然罕见。分析表明，短期内存在向统一智能体间和智能体-上下文（工具和数据）通信的协议收敛压力。然而，长期来看，没有单一协议能同时最大化通用性、效率和可移植性。该领域更可能演变为联邦式分层协议栈。该框架指导协议选择，并突出开放的研究空白，如隐私和策略执行。

英文摘要

As large language models (LLMs) advance and multi-agent systems aim to overcome the limits of standalone agents, robust communication protocols are becoming essential infrastructure for distributed agent networks. Nonetheless, the fragmented protocol landscape presents a significant interoperability challenge. This study develops a technical taxonomy to classify and analyze LLM agent communication protocols. Following an established iterative method, we defined the taxonomy's purpose, meta-characteristic, and ending conditions, then performed five iterations, three empirical-to-conceptual and two conceptual-to-empirical, on nine actively maintained open-source protocols with demonstrable adoption. The taxonomy comprises five dimensions: counterparty, payload, interaction state, discovery mechanism, and schema flexibility. Classification reveals recurring architectural patterns: all sampled agent-to-agent protocols combine hybrid payloads with session-state persistence; most protocols support multiple predefined schemas, and two negotiate schemas at runtime, indicating a trend toward schema flexibility; decentralized discovery remains rare. Analysis suggests short-term convergence pressure toward protocols unifying agent-to-agent and agent-to-context (tool and data) communication. Long-term, however, no single protocol is likely to maximize versatility, efficiency, and portability simultaneously. The field will more likely evolve toward a federated, layered protocol stack. The framework guides protocol selection and highlights open research gaps such as privacy and policy enforcement.}

URL PDF HTML ☆

赞 0 踩 0

2402.08128 2026-06-18 cs.AI cs.GT 版本更新

Recursive Joint Simulation in Games

博弈中的递归联合模拟

Vojtech Kovarik, Caspar Oesterheld, Vincent Conitzer

发表机构 * Foundations of Cooperative AI Lab (FOCAL), Computer Science Department（合作人工智能基础实验室（FOCAL），计算机科学系）； Carnegie Mellon University（卡内基梅隆大学）； AI Center（人工智能中心）； Czech Technical University（捷克技术大学）； Center for Theoretical Study（理论研究中心）； Charles University（查理大学）

AI总结研究AI智能体通过递归联合模拟实现合作，证明该过程等价于原博弈的无限重复版本，从而可直接应用民间定理等现有结论。

详情

AI中文摘要

AI智能体之间的博弈动力学可能以多种方式不同于传统的人类-人类互动。其中一个差异是，可能能够精确模拟一个AI智能体，例如因为其源代码已知。这样的智能体将从根本上不确定自己是在现实世界还是在模拟中。我们的目标是探索利用这种可能性在战略环境中实现更合作的结果。在本文中，我们研究了AI智能体之间的交互，其中智能体运行递归联合模拟。也就是说，智能体首先共同观察它们所面临情境的模拟。这个模拟递归地包含额外的模拟（带有小的失败概率以避免无限递归），并且在选择行动之前观察所有这些嵌套模拟的结果。我们表明，由此产生的交互在策略上等价于原始博弈的无限重复版本，允许直接转移现有结果，如各种民间定理。作为该等价性稳健性的证据，我们表明即使放宽一些假设，它仍然成立，并且“从内部”也成立——即对于发现自己处于博弈中并具有自定位不确定性的智能体而言。

英文摘要

Game-theoretic dynamics between AI agents could differ from traditional human-human interactions in various ways. One such difference is that it may be possible to accurately simulate an AI agent, for example because its source code is known. Such an agent would then be fundamentally uncertain whether it is in the real world or in a simulation. Our aim is to explore ways of leveraging this possibility to achieve more cooperative outcomes in strategic settings. In this paper, we study an interaction between AI agents where the agents run a recursive joint simulation. That is, the agents first jointly observe a simulation of the situation they face. This simulation in turn recursively includes additional simulations (with a small chance of failure, to avoid infinite recursion), and the results of all these nested simulations are observed before an action is chosen. We show that the resulting interaction is strategically equivalent to an infinitely repeated version of the original game, allowing a direct transfer of existing results such as the various folk theorems. As evidence that the equivalence is robust, we show that it holds even when we relax some of the assumptions and that it also holds ``from the inside'' -- meaning, for an agent that finds itself inside the game and has self-locating uncertainty.

URL PDF HTML ☆

赞 0 踩 0

2508.21720 2026-06-18 cs.AI 版本更新

PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation

PosterForest: 用于科学海报生成的分层多智能体协作

Jiho Choi, Seojeong Park, Seongjong Song, Hyunjung Shim

发表机构 * Graduate School of Artificial Intelligence, KAIST（韩国釜山国立大学人工智能研究生院）； School of Integrated Technology, Yonsei University（延世大学整合技术学院）

AI总结提出PosterForest，一种无需训练的科学海报生成框架，通过Poster Tree分层表示文档结构，并利用内容与布局智能体进行分层推理与递归优化，实现内容与布局的联合优化，提升语义连贯性、逻辑流畅性和视觉平衡。

Comments ACL 2026

详情

AI中文摘要

自动化科学海报生成需要层次化的文档理解和连贯的内容-布局规划。现有方法通常依赖于平面摘要或分别优化内容和布局。因此，它们常常遭受信息丢失、逻辑流程薄弱和视觉平衡差的问题。我们提出了PosterForest，一个无需训练的科学海报生成框架。我们的方法引入了Poster Tree，一种结构化的中间表示，能够跨多个层次捕获文档层次结构和视觉-文本语义。基于这种表示，内容和布局智能体执行分层推理和递归优化，从全局组织到局部组成逐步优化海报。这种联合优化提高了语义连贯性、逻辑流畅性和视觉和谐。实验表明，PosterForest在自动评估和人工评估中均优于先前方法，且无需额外训练或领域特定监督。

英文摘要

Automating scientific poster generation requires hierarchical document understanding and coherent content-layout planning. Existing methods often rely on flat summarization or optimize content and layout separately. As a result, they often suffer from information loss, weak logical flow, and poor visual balance. We present PosterForest, a training-free framework for scientific poster generation. Our method introduces the Poster Tree, a structured intermediate representation that captures document hierarchy and visual-textual semantics across multiple levels. Building on this representation, content and layout agents perform hierarchical reasoning and recursive refinement, progressively optimizing the poster from global organization to local composition. This joint optimization improves semantic coherence, logical flow, and visual harmony. Experiments show that PosterForest outperforms prior methods in both automatic and human evaluations, without additional training or domain-specific supervision.

URL PDF HTML ☆

赞 0 踩 0

2606.15504 2026-06-18 cs.AI 版本更新

Toward Vibe Medicine: A Self-Evolving Multi-Agent Framework for Clinical Decision Support

迈向振动医学：一种用于临床决策支持的自演化多智能体框架

Qianxue Zhang, Yiming Ren, Shihuan Qin, Xiao Zhang, Liao Zhang, Jinyang Huang, Zhengliang Liu, Chenbin Liu, Hongying Feng, Jingyuan Chen, Yuzhen Ding, Weihang You, Hanqi Jiang, Yi Pan, Yifan Zhou, Junhao Chen, Lifeng Chen, Wei Liu, Tianming Liu, Zengren Zhao, Lian Zhang

发表机构 * Medical AI Lab, The First Hospital of Hebei Medical University（河北医科大学第一医院医学人工智能实验室）； Hebei Provincial Engineering Research Center for AI-Based Cancer Treatment Decision-Making, The First Hospital of Hebei Medical University（河北省人工智能癌症治疗决策工程研究中心，河北医科大学第一医院）； State Key Laboratory of Neurology and Oncology Drug Development（神经与肿瘤药物研发国家重点实验室）； School of Computing, University of Georgia（佐治亚大学计算学院）； Department of Radiation Oncology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital and Shenzhen Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College（中国医学科学院北京协和医学院国家癌症中心/国家肿瘤临床医学研究中心/肿瘤医院深圳医院放射治疗科）； Department of Radiation Oncology, Mayo Clinic（梅奥诊所放射肿瘤科）； College of Mechanical and Power Engineering, China Three Gorges University（三峡大学机械与动力工程学院）； Department of Radiation Oncology, Guangzhou Concord Cancer Center（广州康华肿瘤中心放射治疗科）； Gastrointestinal Disease Diagnosis and Treatment Center, The First Hospital of Hebei Medical University（河北医科大学第一医院胃肠疾病诊疗中心）； Department of General Surgery, The First Hospital of Hebei Medical University（河北医科大学第一医院普通外科）

AI总结提出VIBEMed多智能体框架，通过自演化机制和架构级安全沙箱，从交互历史中动态学习，实现个性化临床决策支持。

详情

DOI: 10.1016/j.metrad.2026.100223

AI中文摘要

近年来，大型语言模型和自主智能体的进步彻底改变了医疗领域，促进了诊断并改善了治疗结果。然而，大多数现有AI系统依赖预训练知识和预定义流程，难以从包含患者结果和过去失败的交互式聊天会话历史中动态学习。为解决这一限制，我们提出了VIBEMed，一种具有内置自演化机制和架构级安全沙箱的多智能体框架，用于稳健的临床决策支持。该系统集成了三个专门智能体：用于假设生成的临床诊断智能体（CDA）、用于治疗计划的治疗执行智能体（TEA）以及将纵向临床反馈提炼为可重用知识的临床演化管理智能体（CEMA），将多模态患者信息转化为个性化医疗决策。通过自演化机制，该框架实现了跨记忆、模型行为和决策策略的迭代更新，使系统能够随时间改进。实验结果表明，VIBEMed通过其演化机制在复杂临床病例中表现出优越性能，特别是在需要集成决策和纵向规划的任务中。该框架还支持在具有挑战性的场景（如肿瘤治疗规划）中进行可靠的端到端决策，凸显了其在真实临床环境中的可行性。总体而言，VIBEMed为超越静态AI系统、迈向自适应、经验驱动的临床决策支持提供了一条实用路径，展示了将多智能体协作与持续演化相结合以推进精准医学的价值。

英文摘要

In recent years, the advances of large language models and autonomous agents have revolutionized the healthcare field, facilitating diagnosis and improving treatment results. However, most existing AI systems rely on pre-trained knowledge and predefined pipelines, which struggle to learn dynamically from the interactive chat session history that contains patient outcomes and past failures. To address this limitation, we propose VIBEMed, a multi-agent framework with a built-in self-evolution mechanism and architecture-level safety sandbox for robust clinical decision support. The system integrates three specialized agents, including a Clinical Diagnostic Agent (CDA) for hypothesis generation, a Therapeutic Execution Agent (TEA) for treatment planning, and a Clinical Evolution Manager Agent (CEMA) that distills longitudinal clinical feedback into reusable knowledge, transforming multimodal patient information into personalized medical decisions. Through self-evolution mechanism, the framework enables iterative updates across memory, model behavior, and decision strategies, allowing the system to improve over time. Experimental results show that VIBEMed demonstrates superior performance through its evolving mechanism in complex clinical cases, particularly in tasks that require integrated decision-making and longitudinal planning. The framework also supports reliable end-to-end decisions in challenging scenarios such as oncology treatment planning, highlighting its feasibility in real-world clinical contexts. Overall, VIBEMed provides a practical path beyond static AI systems toward adaptive, experience-driven clinical decision support, demonstrating the value of combining multi-agent collaboration with continuous evolution for advancing precision medicine.

URL PDF HTML ☆

赞 0 踩 0

2506.09046 2026-06-18 cs.LG cs.AI cs.MA 版本更新

Self-Evolving Multi-Agent Systems via Textual Backpropagation

通过文本反向传播的自进化多智能体系统

Xiaowen Ma, Yunpu Ma, Chenyang Lin, Sikuan Yan, Jinhe Bi, Zixuan Cao, Yijun Tian, Volker Tresp, Hinrich Schuetze

发表机构 * Ludwig Maximilian University of Munich（慕尼黑路德维希-马克西米利安大学）； Technical University of Munich（慕尼黑技术大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； University of Notre Dame（诺丁汉大学）

AI总结提出Agentic Neural Network框架，将多智能体协作建模为分层神经网络，通过前向分解任务和反向传播反馈实现智能体角色、提示和协作的自进化，在七个基准数据集上超越现有方法。

详情

AI中文摘要

利用多个大型语言模型（LLM）已被证明对处理复杂、高维任务有效，但当前方法通常依赖静态、手动设计的多智能体配置。为克服这些限制，我们提出Agentic Neural Network（ANN）框架，该框架将多智能体协作概念化为分层神经网络架构。在此设计中，每个智能体作为节点运行，每一层形成一个专注于特定子任务的协作团队。我们的框架遵循两阶段优化策略：（1）前向阶段——受神经网络前向传播启发，任务被动态分解为子任务，并逐层构建具有合适聚合方法的协作智能体团队。（2）反向阶段——模仿反向传播，我们通过迭代反馈优化全局和局部协作，使智能体能够自进化其角色、提示和协调。这种神经符号方法使我们的框架能够在训练后创建新的或专门的智能体团队，在准确性和适应性方面带来显著提升。在七个基准数据集上，我们的工作在相同配置下超越了领先的多智能体基线，显示出持续的性能改进。

英文摘要

Leveraging multiple Large Language Models (LLMs) has proven effective for addressing complex, high-dimensional tasks, but current approaches often rely on static, manually engineered multi-agent configurations. To overcome these constraints, we present the Agentic Neural Network (ANN), a framework that conceptualizes multi-agent collaboration as a layered neural network architecture. In this design, each agent operates as a node, and each layer forms a cooperative team focused on a specific subtask. Our framework follows a two-phase optimization strategy: (1) Forward Phase - Drawing inspiration from neural network forward passes, tasks are dynamically decomposed into subtasks, and cooperative agent teams with suitable aggregation methods are constructed layer by layer. (2) Backward Phase - Mirroring backpropagation, we refine both global and local collaboration through iterative feedback, allowing agents to self-evolve their roles, prompts, and coordination. This neuro-symbolic approach enables our framework to create new or specialized agent teams post-training, delivering notable gains in accuracy and adaptability. Across seven benchmark datasets, our work surpasses leading multi-agent baselines under the same configurations, showing consistent performance improvements.

URL PDF HTML ☆

赞 0 踩 0

2510.18085 2026-06-18 cs.RO cs.AI cs.MA 版本更新

R2BC: Multi-Agent Imitation Learning from Single-Agent Demonstrations

R2BC: 从单智能体演示进行多智能体模仿学习

Connor Mattson, Varun Raveendra, Ellen Novoseller, Nicholas Waytowich, Vernon J. Lawhern, Daniel S. Brown

发表机构 * Kahlert School of Computing, University of Utah（犹他大学凯勒尔计算学院）； DEVCOM Army Research Laboratory（陆军研究实验室）

AI总结提出R2BC方法，通过轮换单智能体演示训练多机器人系统，无需联合动作空间演示，在模拟和实物任务中性能媲美或超越基于特权同步演示的基线方法。

Comments 8 pages, 6 figures. In Proceedings: IEEE International Conference on Robotics & Automation (ICRA 2026)

详情

AI中文摘要

模仿学习（IL）是人类教授机器人的自然方式，尤其是在高质量演示易于获取的情况下。虽然IL已广泛应用于单机器人场景，但将其扩展到多智能体系统的研究相对较少，尤其是在单个人类必须为协作机器人团队提供演示的场景中。本文介绍并研究了轮换行为克隆（R2BC），该方法使单个人类操作员能够通过顺序的单智能体演示有效训练多机器人系统。我们的方法允许人类一次远程操作一个智能体，并逐步向整个系统教授多智能体行为，无需联合多智能体动作空间的演示。我们表明，在四个多智能体模拟任务中，R2BC方法的性能与基于特权同步演示的Oracle行为克隆方法相当，甚至在某些情况下超越后者。最后，我们在两个使用真实人类演示训练的物理机器人任务上部署了R2BC。

英文摘要

Imitation Learning (IL) is a natural way for humans to teach robots, particularly when high-quality demonstrations are easy to obtain. While IL has been widely applied to single-robot settings, relatively few studies have addressed the extension of these methods to multi-agent systems, especially in settings where a single human must provide demonstrations to a team of collaborating robots. In this paper, we introduce and study Round-Robin Behavior Cloning (R2BC), a method that enables a single human operator to effectively train multi-robot systems through sequential, single-agent demonstrations. Our approach allows the human to teleoperate one agent at a time and incrementally teach multi-agent behavior to the entire system, without requiring demonstrations in the joint multi-agent action space. We show that R2BC methods match, and in some cases surpass, the performance of an oracle behavior cloning approach trained on privileged synchronized demonstrations across four multi-agent simulated tasks. Finally, we deploy R2BC on two physical robot tasks trained using real human demonstrations.

URL PDF HTML ☆

赞 0 踩 0

2606.18730 2026-06-18 cs.RO cs.AI math.CO math.OC 交叉投稿

Two-Phase Bilevel Search for the Moving-Target Traveling Salesman Problem with Moving Obstacles

带移动障碍物的移动目标旅行商问题的两阶段双层搜索

Allen George Philip, Anoop Bhat, Sivakumar Rathinam, Howie Choset

发表机构 * Texas A&M University（德克萨斯A&M大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结针对带移动障碍物的移动目标旅行商问题，提出混合整数锥规划公式和两阶段双层搜索算法，显著优于基线方法。

详情

AI中文摘要

移动目标旅行商问题（MT-TSP）寻求从静态仓库出发、访问一组移动目标（每个目标在其分配的时间窗口内）并返回仓库的代理的最小成本轨迹。在本文中，我们研究了带移动障碍物的移动目标旅行商问题（MT-TSP-MO），这是MT-TSP的推广，其中代理轨迹必须避开移动障碍物。我们提出了一个混合整数锥规划（MICP）公式，可以使用现成的求解器求解，以及一个快速且可扩展的两阶段双层搜索（TPBS）算法，该算法为问题计算高质量可行解。我们在多达40个目标和40个障碍物的广泛问题实例上评估了我们的方法，与现有基线算法相比。结果表明，所提出的两种方法在成功率、解决方案成本和计算时间方面均显著优于基线。

英文摘要

The Moving-Target Traveling Salesman Problem (MT-TSP) seeks a minimum cost trajectory for an agent that departs from a static depot, visits a set of moving targets, each within one of their assigned time windows, and returns to the depot. In this article, we study the Moving-Target Traveling Salesman Problem with Moving Obstacles (MT-TSP-MO), a generalization of the MT-TSP where the agent trajectory must avoid moving obstacles. We present a Mixed-Integer Conic Programming (MICP) formulation that can be solved using off-the-shelf solvers, as well as a fast and scalable Two-Phase Bilevel Search (TPBS) algorithm that computes high-quality feasible solutions for the problem. We evaluate our approaches against an existing baseline algorithm on a broad range of problem instances with up to 40 targets and 40 obstacles. The results demonstrate that both the proposed methods significantly outperform the baseline with respect to success rates, solution costs, and computation time.

URL PDF HTML ☆

赞 0 踩 0

2510.27353 2026-06-18 cs.AI 版本更新

An In-depth Study of LLM Contributions to the Bin Packing Problem

LLM对装箱问题贡献的深入研究

Julien Herrmann, Guillaume Pallez

发表机构 * CNRS-IRIT ； Inria

AI总结通过分析LLM生成的启发式算法，发现其虽可读但难以解释，进而提出更简单高效的新算法，质疑LLM对装箱问题的实际贡献。

Comments Accepted for publication in ACM Transactions on Evolutionary Learning and Optimization

详情

DOI: 10.1145/3821574

AI中文摘要

近期研究表明，大型语言模型（LLM）可能为数学发现提供有趣的思路。该主张基于报告称，基于LLM的遗传算法在均匀分布和Weibull分布下为在线装箱问题产生了具有新见解的启发式算法。本文通过详细分析LLM产生的启发式算法，考察其行为和可解释性，重新评估了这一主张。尽管这些启发式算法是人类可读的，但即使对领域专家而言，它们仍然在很大程度上是不透明的。基于此分析，我们提出了一类针对这些特定装箱实例的新算法。推导出的算法显著更简单、更高效、更可解释且更具泛化性，表明所考虑的实例本身相对简单。然后，我们讨论了关于LLM对该问题贡献的主张的局限性，该主张似乎基于一个错误的假设，即这些实例先前已被研究过。我们的发现反而强调了在评估LLM生成输出的科学价值时，需要进行严格的验证和情境化。

英文摘要

Recent studies have suggested that Large Language Models (LLMs) could provide interesting ideas contributing to mathematical discovery. This claim was motivated by reports that LLM-based genetic algorithms produced heuristics offering new insights into the online bin packing problem under uniform and Weibull distributions. In this work, we reassess this claim through a detailed analysis of the heuristics produced by LLMs, examining both their behavior and interpretability. Despite being human-readable, these heuristics remain largely opaque even to domain experts. Building on this analysis, we propose a new class of algorithms tailored to these specific bin packing instances. The derived algorithms are significantly simpler, more efficient, more interpretable, and more generalizable, suggesting that the considered instances are themselves relatively simple. We then discuss the limitations of the claim regarding LLMs' contribution to this problem, which appears to rest on the mistaken assumption that the instances had previously been studied. Our findings instead emphasize the need for rigorous validation and contextualization when assessing the scientific value of LLM-generated outputs.

URL PDF HTML ☆

赞 0 踩 0

2602.23092 2026-06-18 cs.AI 版本更新

Enhancing CVRP Solver through LLM-driven Automatic Heuristic Design

通过LLM驱动的自动启发式设计增强CVRP求解器

Zhuoliang Xie, Fei Liu, Zhenkun Wang, Qingfu Zhang

发表机构 * Southern University of Science and Technology（南方科技大学）； City University of Hong Kong（香港城市大学）

AI总结提出AILS-AHD方法，结合进化搜索框架与大语言模型动态生成和优化破坏启发式，并引入加速机制，在中等和大规模CVRP实例上优于现有求解器，在CVRPLib大规模基准中10个实例上取得8个新最优解。

详情

AI中文摘要

容量受限车辆路径问题（CVRP）是一个基本的组合优化挑战，专注于在车辆容量约束下优化车队运营。尽管在运筹学中得到了广泛研究，CVRP的NP-hard性质仍然带来显著的计算挑战，特别是对于大规模实例。本研究提出了AILS-AHD（自适应迭代局部搜索与自动启发式设计），一种利用大语言模型（LLMs）革新CVRP求解的新方法。我们的方法将进化搜索框架与LLMs集成，在AILS方法中动态生成和优化破坏启发式。此外，我们引入了一种基于LLM的加速机制以提高计算效率。针对最先进的求解器（包括AILS-II和HGS）的综合实验评估表明，AILS-AHD在中等和大规模实例上均表现出优越性能。值得注意的是，我们的方法在CVRPLib大规模基准的10个实例中为8个建立了新的最佳已知解，突显了LLM驱动的启发式设计在推进车辆路径优化领域的潜力。

英文摘要

The Capacitated Vehicle Routing Problem (CVRP), a fundamental combinatorial optimization challenge, focuses on optimizing fleet operations under vehicle capacity constraints. While extensively studied in operational research, the NP-hard nature of CVRP continues to pose significant computational challenges, particularly for large-scale instances. This study presents AILS-AHD (Adaptive Iterated Local Search with Automatic Heuristic Design), a novel approach that leverages Large Language Models (LLMs) to revolutionize CVRP solving. Our methodology integrates an evolutionary search framework with LLMs to dynamically generate and optimize ruin heuristics within the AILS method. Additionally, we introduce an LLM-based acceleration mechanism to enhance computational efficiency. Comprehensive experimental evaluations against state-of-the-art solvers, including AILS-II and HGS, demonstrate the superior performance of AILS-AHD across both moderate and large-scale instances. Notably, our approach establishes new best-known solutions for 8 out of 10 instances in the CVRPLib large-scale benchmark, underscoring the potential of LLM-driven heuristic design in advancing the field of vehicle routing optimization.

URL PDF HTML ☆

赞 0 踩 0

2605.29649 2026-06-18 cs.AI 版本更新

LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning

LLM进化的符号AI规划领域无关启发式

Elliot Gestrin, Jendrik Seipp

AI总结本文使用进化搜索让大语言模型生成领域无关的启发式函数，在未见测试域上超越手工最优启发式，并首次系统评估了启发式的信息性-速度权衡。

Comments Accepted at the LM4Plan workshop at ICAPS 2026

详情

AI中文摘要

启发式搜索是符号AI规划中的主导范式，最强的启发式是规划研究者数十年工作的成果。最近的工作表明，大型语言模型（LLM）可以为单个规划领域设计启发式，但迄今为止，没有LLM生成的启发式能在任意规划任务上工作。在本文中，我们使用进化搜索来产生第一个LLM生成的领域无关启发式，其超越了手工最优的现有技术。我们让LLM变异用C++编写的父启发式，将候选解存储在MAP-Elites档案中，以信息性和速度作为键，并通过混合覆盖率和求解时间计算适应度分数。为了将进化程序置于上下文中，我们还额外基准测试了一组广泛的手工启发式在信息性-速度权衡上的表现，据我们所知，这之前从未做过。在未见测试域上，我们最好的进化启发式比最强基线解决了更多任务，我们的完整启发式套件跨越了所述权衡的帕累托前沿。我们还发现，从平凡的盲目启发式开始进化优于从强FF启发式开始，即使最终程序本身是FF变体，并且LLM推理努力影响候选编译成功的频率远大于影响那些编译成功的候选的质量。由于进化程序是纯C++，它们可以作为即插即用替代品插入现有规划器，并继承底层搜索的健全性和完备性保证。

英文摘要

Heuristic search is the dominant paradigm in symbolic AI planning, and the strongest heuristics are the result of decades of work by planning researchers. Recent work has shown that large language models (LLMs) can design heuristics for individual planning domains, but no LLM-generated heuristic has so far worked on arbitrary planning tasks. In this paper, we use evolutionary search to produce the first LLM-generated domain-independent heuristics that exceed the hand-engineered state of the art. We let an LLM mutate parent heuristics written in C++, store candidates in a MAP-Elites archive keyed on informedness and speed and calculate fitness scores by blending coverage with solving time. To place the evolved programs in context, we additionally benchmark a broad set of hand-engineered heuristics on their informedness-speed tradeoff, which to our knowledge has not been done before. On unseen testing domains, our best evolved heuristic solves more tasks than even the strongest baseline, with our full heuristic suite spanning the Pareto frontier of said tradeoff. We also find that seeding evolution from the trivial blind heuristic outperforms seeding from the strong FF heuristic, even when the resulting program is itself an FF variant, and that LLM reasoning effort affects how often candidates compile much more than the quality of those that do. Because the evolved programs are plain C++, they slot into existing planners as drop-in replacements and inherit the soundness and completeness guarantees of the underlying search.

URL PDF HTML ☆

赞 0 踩 0

2411.16206 2026-06-18 cs.LG cs.AI cs.NE 版本更新

Scalable Batch Bayesian Optimization Via Subspace Acquisition Functions

可扩展的批量贝叶斯优化：基于子空间采集函数

Dawei Zhan, Zhaoxi Zeng, Shuoxiao Wei, Ping Wu

发表机构 * School of Computing and Artificial Intelligence（计算与人工智能学院）

AI总结提出通过从原始问题的轴对齐子空间中各选一点来扩展贝叶斯优化至大规模批量评估，显著加速收敛，与十种批量算法相比极具竞争力。

详情

DOI: 10.1145/3820495
Journal ref: ACM Transactions on Evolutionary Learning and Optimization, 2026

AI中文摘要

将贝叶斯优化扩展到批量评估可以使设计者充分利用并行计算技术。然而，当前大多数批量方法在批量大小增大时扩展性不佳，优化效率往往下降。为解决此问题，本文提出一种简单高效的方法，将贝叶斯优化扩展到大规模批量评估。与现有批量方法不同，新方法的思想是从原始问题中抽取一批轴对齐子空间，并使用现有采集函数从每个子空间中选择一个点。数值实验表明，与顺序贝叶斯优化算法相比，我们提出的方法显著加速收敛，并且与十种批量贝叶斯优化算法相比表现非常有竞争力。我们提出的方法的实现可在此 https URL 获取。

英文摘要

Extending Bayesian optimization to batch evaluation can enable the designer to make the most use of parallel computing technology. However, most of current batch approaches do not scale well with the batch size. That is, their optimization efficiencies often deteriorate as the batch size increases. To address this issue, we propose a simple and efficient approach to extend Bayesian optimization to large-scale batch evaluation in this work. Different from existing batch approaches, the idea of the new approach is to draw a batch of axis-aligned subspaces of the original problem and select one point from each subspace using existing acquisition functions. Numerical experiments show that our proposed approach speedups the convergence significantly when compared with the sequential Bayesian optimization algorithm, and performs very competitively when compared with ten batch Bayesian optimization algorithms. The implementation of our proposed approach is available at https://github.com/zhandawei/SubSpace_Acquisition_Functions.

URL PDF HTML ☆

赞 0 踩 0

2606.14202 2026-06-18 cs.NE cs.AI 版本更新

MeEvo: Metacognitive Evolution Combined with Natural Evolution for Automatic Heuristic Design

MeEvo: 元认知进化与自然进化相结合用于自动启发式设计

Zishang Qiu, Xinan Chen, Rong Qu, Ruibin Bai

发表机构 * School of Computer Science, University of Nottingham Ningbo China（诺丁汉大学宁波分校计算机科学学院）； School of Computer Science, University of Nottingham（诺丁汉大学计算机科学学院）

AI总结提出MeEvo框架，通过循环耦合自然进化（探索启发式代码）和元认知进化（反思历史生成改进启发式），解决现有方法知识继承弱、探索不足的问题，在五个优化问题上表现更优。

详情

AI中文摘要

大型语言模型（LLMs）通过推理和代码合成实现启发式生成，推动了自动启发式设计（AHD）的发展。现有的基于LLM的AHD架构主要遵循两种范式：自然进化，它使用交叉和变异来探索启发式程序；以及元认知进化，它通过反思来改进推理。然而，自然进化丢弃了推理轨迹，削弱了知识继承和利用，而元认知进化缺乏种群级别的重组，限制了探索并增加了过早收敛的风险。这些局限性降低了复杂问题的搜索效率、稳定性和解的质量。为了解决这一差距，我们提出了MeEvo，一种双层AHD框架，它循环耦合自然进化和元认知进化。自然进化探索启发式代码，同时将推理轨迹、适应度值和错误记录到共享历史中；然后元认知进化反思该历史以生成改进的启发式，这些启发式重新进入父代池以进行下一轮循环。这种设计使得种群驱动的探索和反思驱动的改进相互加强。在五个优化问题上的实验（使用两个LLM骨干）表明，MeEvo比现有的基于LLM的AHD架构实现了更强且更稳定的性能，尤其是在复杂约束任务上。

英文摘要

Large Language Models (LLMs) have advanced Automatic Heuristic Design (AHD) by enabling heuristic generation through reasoning and code synthesis. Existing LLM-based AHD architectures mainly follow two paradigms: Natural Evolution, which uses crossover and mutation to explore heuristic programs, and Metacognitive Evolution, which refines reasoning through reflection. However, Natural Evolution discards reasoning traces, weakening knowledge inheritance and exploitation, while Metacognitive Evolution lacks population-level recombination, limiting exploration and increasing the risk of premature convergence. These limitations reduce search efficiency, stability, and solution quality on complex problems. To address this gap, we propose MeEvo, a dual-layer AHD framework that cyclically couples Natural Evolution and Metacognitive Evolution. Natural Evolution explores heuristic code while recording reasoning traces, fitness values, and errors into a shared history; Metacognitive Evolution then reflects on this history to generate improved heuristics that re-enter the parent pool for the next cycle. This design enables population-driven exploration and reflection-driven refinement to reinforce each other. Experiments on five optimization problems with two LLM backbones show that MeEvo achieves stronger and more stable performance than existing LLM-based AHD architectures, especially on complex constrained tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.18890 2026-06-18 cs.AI 新提交

Skill-Guided Continuation Distillation for GUI Agents

面向GUI代理的技能引导延续蒸馏

Zhimin Fan, Hongwei Yu, Yeqing Shen, Haolong Yan, Guozhen Peng, Tianhao Peng, Yudong Zhang, Xiaowen Zhang, Kaijun Tan, Zheng Ge, Xiangyu Zhang, Daxin Jiang

发表机构 * StepFun ； University of Science and Technology Beijing（北京科技大学）； Tsinghua University（清华大学）； Nanyang Technological University（南洋理工大学）

AI总结提出技能引导延续蒸馏（SGCD）框架，通过技能引导策略生成成功延续轨迹，弥补专家轨迹中未覆盖的状态监督缺失，在OSWorld-Verified上将三个基础模型成功率从30%左右提升至50%以上。

详情

AI中文摘要

改进GUI代理通常依赖于在专家轨迹上的行为克隆。然而，当当前策略偏离专家策略时，在闭环执行过程中不可避免地会遇到策略导致的偏离轨迹状态，即超出专家轨迹的状态。由于专家轨迹未对这些未见状态提供演示，这些状态得不到有效监督，导致策略无法选择正确动作。为弥补这一监督缺口，我们提出技能引导延续蒸馏（SGCD），一种迭代式自我改进框架。SGCD首先在没有技能引导的情况下运行简单策略若干步，以到达真实的偏离轨迹状态。从这些状态出发，技能引导策略完成任务并生成成功的延续轨迹，这些轨迹与专家轨迹混合，为策略导致的偏离轨迹状态提供监督。技能从成功和失败的轨迹中提取，包括延续计划、关键目标、失败陷阱和成功标准。在OSWorld-Verified上，SGCD将三个基础模型的成功率从30%左右提升至超过50%，证明了其有效性和通用性。

英文摘要

Improving GUI agents typically relies on behavior cloning on expert trajectories. However, as the current policy deviates from the expert policy, it inevitably encounters policy-induced off-trajectory states during closed-loop execution, i.e., states that fall outside the expert trajectories. Since expert trajectories provide no demonstrations for these unseen states, such states receive no effective supervision, leaving the policy unable to select the correct action. To close this supervision gap, we propose Skill-Guided Continuation Distillation (SGCD), an iterative self-improvement framework. SGCD first runs the plain policy without skill guidance for a few steps to reach realistic off-trajectory states. From these states, a skill-guided policy then completes the task and produces successful continuations, which are mixed with expert trajectories to supply supervision over policy-induced off-trajectory states. The skills are extracted from both successful and failed rollouts, consisting of Continuation Plans, Critical Targets, Failure Traps, and Success Criteria. On OSWorld-Verified, SGCD improves the success rate of three base models from the low-30\% range to over 50\%, demonstrating its effectiveness and generality.

URL PDF HTML ☆

赞 0 踩 0

2606.19047 2026-06-18 cs.AI 新提交

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

RODS: 面向多轮工具使用智能体的奖励驱动在线数据合成

Ruishan Fang, Siyuan Lu, Chenyi Zhuang, Tao Lin

发表机构 * Zhejiang University（浙江大学）； Shanghai Innovation Institute（上海创新研究院）； Westlake University（西湖大学）

AI总结针对多轮工具使用强化学习中静态数据集信息样本快速耗尽的问题，提出RODS方法，利用进度奖励方差作为零成本边界检测器，在线合成与智能体能力边界匹配的样本，以约800样本达到17K样本离线管道的性能。

详情

AI中文摘要

多轮工具使用强化学习受限于静态数据集中信息样本的快速耗尽。我们观察到GRPO中的梯度信号集中在具有最高 rollout 奖励方差的任务上，这是Popoviciu上界的结果。因此，位于智能体能力边界附近（成功与失败大致平衡）的样本贡献了不成比例的大策略梯度。随着训练进行，该边界不断移动，逐渐耗尽静态数据集中的信息样本池。我们提出RODS（奖励驱动在线数据合成）来解决这种耗尽问题。RODS通过将进度奖励方差重新用作一个实用的、零成本的边界检测器（除了训练中已计算的rollout外无需额外推理），来闭环RL训练与数据生成。它持续识别这些边界样本，通过技能对齐的重采样管道合成与其结构复杂度（例如API拓扑和依赖深度）匹配的新多轮变体，并管理一个与策略共同演化的动态回放缓冲区。从400个人工种子开始并维持约800个样本的活动训练池，RODS实现了与17K样本离线管道相当的性能，同时所需轨迹数量约少20倍，并在我们的受控设置中优于固定数据RL和环境增强方法。

英文摘要

Multi-turn tool-use RL is bottlenecked by the rapid depletion of informative samples in static datasets. We observe that the gradient signal in GRPO concentrates on tasks with the highest rollout reward variance, a consequence of the Popoviciu upper bound. Consequently, samples near the agent's capability boundary -- where successes and failures are roughly balanced -- contribute disproportionately large policy gradients. As training progresses, this boundary continuously shifts, which gradually depletes the pool of informative samples in a static dataset. We propose RODS (Reward-driven Online Data Synthesis) to resolve this depletion. RODS closes the loop between RL training and data generation by repurposing the progress reward variance as a practical, zero-cost boundary detector that requires no extra inference beyond the rollouts already computed for training. It continuously identifies such boundary samples, synthesizes new multi-turn variants matching their structural complexity (e.g., API topology and dependency depth) via a skill-aligned resampling pipeline, and manages a dynamic replay buffer that co-evolves with the policy. Starting from 400 human seeds and maintaining an active training pool of ~800 samples, RODS achieves comparable performance to a 17K-sample offline pipeline while requiring roughly 20x fewer trajectories, and improves over fixed-data RL and environment augmentation in our controlled setting.

URL PDF HTML ☆

赞 0 踩 0

2606.19079 2026-06-18 cs.AI 新提交

ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection

ARIADNE: 推理时适配器动态选择的不可知路由

Enrico Cassano, Michał Brzozowski, Zuzanna Dubanowska, Paolo Mandica, Neo Christopher Chung

发表机构 * University of Turin（都灵大学）； Samsung AI Center（三星人工智能中心）

AI总结提出无训练、与适配器无关的路由框架ARIADNE，通过训练集嵌入质心表示适配器，在推理时基于潜在空间距离选择适配器，无需适配器内部信息或额外训练，在44个任务上达到89.7%的选择准确率。

详情

AI中文摘要

参数高效微调（PEFT）的日益部署导致了模型生态系统，其中单个骨干网络与许多任务专用适配器配对。在这种设置下，推理时的查询通常没有任务标签，要求系统从不断增长且异构的适配器池中自动选择最合适的适配器。现有的路由方法要么依赖于对适配器内部（如权重分解或基于梯度的统计信息）的访问，要么需要额外的路由器训练，这限制了随着新适配器添加的可扩展性和可移植性。我们提出了ARIADNE，一个无训练、与适配器无关的路由框架，用于推理时的动态适配器选择。ARIADNE通过从其训练集的嵌入计算的一组质心来表示每个适配器，捕获与该适配器相关的数据分布。给定一个无标签输入，它通过测量在潜在空间中与这些质心的接近度来选择适配器。由于路由完全在输入嵌入空间中进行，ARIADNE与任意PEFT方法兼容，并且不需要对适配器或训练过程进行修改。主要使用Llama 3.2 1B Instruct在23个不同的NLP任务上进行评估，ARIADNE恢复了97.44%的上限性能。扩展到44个任务，它实现了89.7%的平均选择准确率，无需额外训练或访问适配器内部信息。

英文摘要

The increasing deployment of parameter-efficient fine-tuning (PEFT) has led to model ecosystems in which a single backbone is paired with many task-specialized adapters. In this setting, inference-time queries often arrive without task labels, requiring the system to automatically select the most appropriate adapter from a growing and heterogeneous adapter pool. Existing routing methods either depend on access to adapter internals, such as weight decompositions or gradient-based statistics, or require additional router training, which limits scalability and portability as new adapters are added. We introduce ARIADNE, a training-free, adapter-agnostic routing framework for dynamic adapter selection at inference time. ARIADNE represents each adapter through a set of centroids computed from embeddings of its training set, capturing the data distribution associated with that adapter. Given an unlabeled input, it selects an adapter by measuring proximity to these centroids in latent space. Because routing is performed entirely in the input embedding space, ARIADNE is compatible with arbitrary PEFT methods and requires no modification to the adapters or training procedures. Primarily evaluated with Llama 3.2 1B Instruct on 23 diverse NLP tasks, ARIADNE recovers 97.44% of the upper bound performance. Scaling to 44 tasks, it achieves 89.7% average selection accuracy, without additional training or access to adapter internals.

URL PDF HTML ☆

赞 0 踩 0

2606.19172 2026-06-18 cs.AI 新提交

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

用户作为印迹：将每用户记忆内化为局部参数编辑

Bojie Li

发表机构 * Pine AI

AI总结提出User as Engram方法，将用户事实存储为Engram模型的哈希键控记忆表中的局部编辑，推理技能共享一个适配器，实现高精度间接推理且内存占用极小。

详情

AI中文摘要

语言模型中的个人记忆涉及两个问题：内容和推理技能。大脑将两者分开（每个情节在海马体中有一个稀疏的局部印迹，解释它的共享技能在缓慢的新皮层中），因此新事实不必覆盖其他一切。如今大多数个性化方法将用户事实保存在权重之外，存储在自然语言记忆文件或检索索引中。当事实被写入模型时，标准方法是每用户的LoRA适配器，这与大脑相反，将内容和技能折叠成一个全局权重增量。将用户事实写为LoRA会污染与它们无关的文本；将相同事实写为局部Engram行则数学上保持不变，导致内存占用大约减少33,000倍。因此，我们提出User as Engram：将用户内容存储为对Engram模型的哈希键控记忆表的手术式编辑，并将推理技能携带在一个共享适配器中。这种分层设计匹配了每用户LoRA的直接召回，同时平均提供5.6倍更高的间接推理准确性，并且从未使单个用户在推理方面比未触及的基座更差。编辑是一个玻璃盒：写入一个事实会在精确触发时打开其查找，添加答案所需的值，保持其他每个位置不变到最后一位，如果写入错误层则失败。由于不同用户的事实落在不相交的哈希槽中，它们的编辑可组合：许多用户同时共享一个表，可加性且无损地堆叠，而每用户LoRA（一个全局权重增量）只允许一个。在检索时，每用户Engram表不会随着检索器必须搜索的群体增长，因此在大约100个事实后，它超越了在2.5倍更大模型上的检索流水线。

英文摘要

Personal memory in a language model is two problems: content and reasoning skill. The brain keeps the two apart (a sparse, local engram in the hippocampus for each episode, a slow neocortex for the shared skills that interpret it), so a new fact need not overwrite everything else. Most personalization today keeps a user's facts outside the weights, in a natural-language memory file or a retrieval index. When facts are written into the model instead, the standard recipe is the per-user LoRA adapter, which does the opposite of the brain, folding content and skill into one global weight delta. Writing a user's facts as a LoRA contaminates text unrelated to them; writing the same facts as local Engram rows leaves it mathematically untouched, resulting in a roughly 33,000x smaller memory footprint. We therefore propose User as Engram: store a user's content as surgical edits to the hash-keyed memory table of an Engram model, and carry the reasoning skill in one shared adapter. This layered design matches per-user LoRA's direct recall while delivering 5.6x higher indirect-reasoning accuracy on average, and never makes a single user worse at reasoning than the untouched base. The edit is a glass box: writing a fact switches on its lookup at exactly the trigger, adds the value the answer needs, leaves every other position unchanged to the last bit, and fails if written into the wrong layer. Because different users' facts land in disjoint hash slots, their edits compose: many users live in one shared table at once, stacking additively and losslessly, where a per-user LoRA, a single global weight delta, admits only one. Upon retrieval, a per-user Engram table does not grow with the population the retriever must search, so past ~100 facts it overtakes a retrieval pipeline on a 2.5x larger model.

URL PDF HTML ☆

赞 0 踩 0

2606.19327 2026-06-18 cs.AI cs.CL 新提交

DRIFT: 通过在线策略数据归因优化指令数据

Zefan Wang, Lincheng Li, Tianyu Yu, Yuan Yao

发表机构 * Tsinghua University（清华大学）

AI总结提出DRIFT方法，利用在线策略影响函数解决标准影响函数在指令微调数据归因中的近邻偏差和梯度范数偏差问题，通过模型自身生成作为验证目标，提升7B模型性能上限。

详情

AI中文摘要

优化监督微调（SFT）的训练数据分布决定了大型语言模型（LLMs）的能力。虽然现有的数据筛选方法在有限预算下加速训练方面表现出色，但它们不太适合提升能力上限。这里的挑战不再是识别一个保持性能的较小子集，而是将数据分布优化为最能提升最终模型的实例。为了解决这个问题，我们探索了使用影响函数（IF）进行实例级数据归因。我们发现标准IF公式在此设置中存在两个结构限制：由离策略验证目标引起的近邻偏差，以及对梯度范数的严重偏向。我们提出了DRIFT（通过在线策略影响函数进行数据优化用于监督微调）。DRIFT不依赖外部参考数据，而是利用模型的在线策略生成作为验证目标，这在经验上最小化了参数近邻偏差，并更好地符合IF的局部邻域假设。它进一步基于轨迹正确性应用符号加权，并针对梯度操纵问题对影响分数进行去偏，使得少量验证查询能够作为可靠锚点来归因整个数据集。在7B参数指令和推理模型上的实验表明，DRIFT持续提升了两者的性能上限，优于现有的数据筛选基线。

英文摘要

Optimizing the training data distribution for Supervised Fine-Tuning (SFT) dictates the capability of Large Language Models (LLMs). While existing data curation methods excel at accelerating training under constrained budgets, they are less suited to elevating the capability upper bound. The challenge here is no longer to identify a smaller subset that preserves performance, but to refine the data distribution toward instances most capable of improving the final model. To address this problem, we explore instance-level data attribution using Influence Functions (IF). We identify that standard IF formulations struggle in this setting due to two structural limitations: a proximity gap caused by off-policy validation targets, and a severe bias towards gradient norm. We propose DRIFT (Data Refinement via On-Policy Influence Functions for Supervised Fine-Tuning). Instead of relying on external reference data, DRIFT utilizes the model's on-policy rollouts as validation targets, which empirically minimizes the parameter proximity gap and better aligns with the local neighborhood assumption of IF. It further applies signed weighting based on trajectory correctness and debiases influence scores against the gradient hacking issue, allowing a small set of validation queries to act as reliable anchors for attributing the full dataset. Experiments on 7B-parameter instruction and reasoning models show that DRIFT consistently raises the performance ceiling on both, outperforming existing data curation baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.18315 2026-06-18 cs.LG cs.AI 交叉投稿

Ghost Attractor Networks: Basin-Structured Dynamical Decoders for Closed-Loop Sequential Generation

鬼吸引子网络：用于闭环序列生成的盆地结构动力学解码器

Tianyu Wang, Ying Wang, Zhihao Liu, Xi Vincent Wang, Lihui Wang

发表机构 * KTH Royal Institute of Technology（瑞典皇家理工学院）； Department of Production Engineering, KTH Royal Institute of Technology（瑞典皇家理工学院生产工程系）； Department of Decision and Control Systems, KTH Royal Institute of Technology（瑞典皇家理工学院决策与控制系统系）

AI总结提出鬼吸引子网络，一种理论推导的动力学解码器，通过构建盆地-吸引子结构实现高效闭环序列生成，在机器人动作解码任务中以2.3M参数匹配1.07B参数扩散变压器的离线精度，延迟降低32倍。

详情

AI中文摘要

使用大规模Transformer和扩散解码器进行序列输出生成时，内存成本随序列长度增长，且需要迭代逐步骤计算。用小型前馈解码器替代可恢复效率，但产生非结构化的潜在表示，限制了闭环控制：相位条件动作生成和跨步骤潜在传递都需要具有稳定盆地的潜在几何结构。本文提出鬼吸引子网络，一种理论推导的动力学解码器，其潜在变量在学习的势能下演化并带有漂移，通过构造产生盆地-吸引子结构。三个期望（多模态、解码器级单次切换和恒定内存）激发了势能-漂移形式，模式转变作为鞍结分岔和鬼吸引子逃逸出现。层次化的相空间分解将一阶盆地收敛与二阶本体感受细化分开。实验上，使用行为克隆和对比目标端到端训练的鬼网络在其势能中表现出预测的梯度流收缩，在1430个保留样本上，梯度范数在五个积分步骤中衰减67%。鬼网络作为机器人动作解码器进行评估。一个230万参数的鬼网络以462倍少的参数和32倍低的延迟匹配了10.7亿参数扩散变压器的离线精度，并在离线均方误差上比五个替代的200万参数解码器（MLP、神经常微分方程、条件变分自编码器、Transformer、单步扩散）低5.9%至29%。在LIBERO-10闭环基准测试中，鬼网络的盆地结构潜在上的相位条件比前馈MLP基线提高了13.5个百分点的成功率，持久潜在集成达到95.7%的最终成功率。

英文摘要

Sequential output generation with large-scale Transformer and diffusion decoders pays a memory cost that grows with sequence length, plus iterative per-step computation. Replacing them with small feed-forward decoders restores efficiency but produces unstructured latent representations that limit closed-loop control: phase-conditioned action generation and cross-step latent carry-over both require a latent geometry with stable basins. This article proposes Ghost Attractor Networks, a theoretically derived dynamical decoder whose latent evolves under a learned potential with drift and produces a basin-attractor structure by construction. Three desiderata (multi-modality, decoder-level single-pass switching, and constant memory) motivate the potential-drift form, and mode transitions arise as saddle-node bifurcations with ghost-attractor escape. A hierarchical phase-space decomposition separates first-order basin convergence from second-order proprioceptive refinement. Empirically, a Ghost trained end-to-end with a behavioral-cloning and contrastive objective exhibits the predicted gradient-flow contraction in its potential, with the gradient norm decaying by 67 percent across five integration steps on 1430 held-out samples. Ghost is evaluated as a robotic action decoder. A 2.3-million-parameter Ghost matches the offline accuracy of a 1.07-billion-parameter Diffusion Transformer at 462 times fewer parameters and 32 times lower latency, and beats five alternative 2M-parameter decoders (MLP, Neural ODE, CVAE, Transformer, 1-step Diffusion) on offline mean squared error by 5.9 to 29 percent. On the LIBERO-10 closed-loop benchmark, phase conditioning on Ghost's basin-structured latent yields a 13.5 percentage-point success-rate gain over a feed-forward MLP baseline, and persistent-latent ensembling reaches a 95.7 percent final success rate.

URL PDF HTML ☆

赞 0 踩 0

2606.18324 2026-06-18 cs.LG cs.AI 交叉投稿

神经相位相关

Cole Reynolds

发表机构 * Weyl Labs（Weyl实验室）

AI总结提出相位相关的学习泛化，通过可学习基函数将变换分解，适用于非刚性形变和幺正动力学，在心脏MRI和超声数据集上达到或超越现有方法。

详情

AI中文摘要

对应关系本质上是关系性的：它寻求同一场景两次观测之间的未知变换，而非任一观测的内容。然而，主流的基于学习的方法并未将变换表示为架构中的一等对象。它们独立编码每幅图像，让学习的相似度函数或深度解码器隐式地发现映射。相位相关是典型的例外，它直接在傅里叶域测量图像间关系，但其固定基的刚性将其限制于全局平移。我们引入相位相关的学习泛化，通过学习变换分解所基于的基来解除这一限制。相同的代数原语可扩展到密集非刚性形变和幺正动力学。在ACDC心脏MRI基准上，该框架在两个配准方向上匹配或超越先前发表的基线。在CAMUS超声心动图上，它无需辅助评分或自适应平滑机制即可达到最先进水平。应用于一维量子谐振子的时间演化波函数对时，同一框架仅从观测对中恢复未知哈密顿量的埃尔米特函数本征态和量子化能级。

英文摘要

Correspondence is fundamentally relational: it seeks the unknown transformation between two observations of a common scene, not the content of either. Yet the dominant learning-based methods do not represent the transformation as a first-class object in the architecture. They encode each image independently and let a learned similarity function or a deep decoder discover the mapping implicitly. Phase correlation is the canonical exception, measuring the inter-image relationship directly in the Fourier domain, but the rigidity of its fixed basis confines it to global translation. We introduce a learned generalization of phase correlation that lifts this restriction by learning the basis on which the transformation decomposes. The same algebraic primitive extends to dense non-rigid deformations and to unitary dynamics. On the ACDC cardiac-MRI benchmark the framework matches or exceeds prior published baselines on both registration directions. On CAMUS echocardiography it matches state-of-the-art without auxiliary scoring or adaptive-smoothness mechanisms. Applied to time-evolved wavefunction pairs of the 1-D quantum harmonic oscillator, the same framework recovers the Hermite-function eigenstates and the quantized energy levels of the unknown Hamiltonian from observation pairs alone.

URL PDF HTML ☆

赞 0 踩 0

2606.18521 2026-06-18 cs.LG cs.AI 交叉投稿

Sparsity Curse: Understanding RLVR Model Parameter Space from Model Merging

稀疏性诅咒：从模型合并理解RLVR模型参数空间

Chenrui Wu, Zexi Li, Jiajun Bu, Jiangchuan Liu, Haishuai Wang

发表机构 * Zhejiang University（浙江大学）； Simon Fraser University（西蒙菲莎大学）； The Chinese University of Hong Kong（香港中文大学）； Zhejiang Key Lab of Accessible Perception and Intelligent Systems（浙江省可感知智能系统重点实验室）

AI总结本文发现RLVR模型的稀疏更新在参数空间中分散更远，形成近正交捷径导致合并脆弱，并提出SAR-Merging方法解决该问题。

Comments Accepted by KDD 2026

详情

AI中文摘要

可验证奖励强化学习（RLVR）已成为一种强大的后训练范式，在激发推理智能和抵抗灾难性遗忘方面超越了监督微调（SFT）。最近的研究进一步揭示，与SFT相比，RLVR会引发高度稀疏且偏离主成分的参数更新。这自然引出一个问题：这种稀疏性是否使RLVR模型更易于模型合并？如果是，模型合并将提供一种可扩展的、无需训练的方法，来聚合来自独立训练的RLVR模型的多样化推理能力。令人惊讶的是，我们发现相反的情况，揭示了一种稀疏性诅咒：稀疏的RLVR更新在参数空间中分散得更远，形成近正交的捷径，使得聚合本质上是脆弱的。这很可能源于RL优化的随机性和涌现推理模式的多样性。与SFT模型收敛到共享的平坦盆地并自然合并不同，RLVR模型在标准合并方法下遭受严重退化。通过对更新几何的系统性实证分析，我们描述了这种失败背后的机制，并提出了敏感性感知解析合并（SAR-Merging），这是一种针对RLVR参数空间独特结构定制的合并方案。SAR-Merging通过基于Fisher信息的敏感性仲裁解决重叠更新区域中的冲突，然后通过幅度感知稀疏化和重新缩放来保留脆弱的推理路径。在数学和编程基准上的实验表明，SAR-Merging在RLVR模型上显著优于现有合并方法，实现了单任务增强和多能力融合。

英文摘要

Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful post-training paradigm that surpasses Supervised Fine-Tuning (SFT) in eliciting reasoning intelligence and resisting catastrophic forgetting. Recent studies further reveal that RLVR induces highly sparse and off-principal parameter updates compared to SFT. This naturally raises the question: does such sparsity make RLVR models more amenable to model merging? If so, model merging would offer a scalable, training-free path to aggregate diverse reasoning capabilities from independently trained RLVR models. Surprisingly, we find the opposite, uncovering a sparsity curse: the sparse RLVR updates are spread farther apart in parameter space, forming near-orthogonal shortcuts that make aggregation inherently fragile. This is likely rooted in the stochasticity of RL optimization and the diversity of emergent reasoning patterns. Unlike SFT models that converge to shared, flat basins and merge naturally, RLVR models suffer severe degradation under standard merging methods. Through systematic empirical analysis of the update geometry, we characterize the mechanisms behind this failure and propose Sensitivity-aware Resolving Merging (SAR-Merging), a merging recipe tailored for the unique structure of RLVR parameter spaces. SAR-Merging resolves conflicts in overlapping update regions via Fisher Information-based sensitivity arbitration, followed by magnitude-aware sparsification and rescaling to preserve fragile reasoning pathways. Experiments on mathematical and coding benchmarks demonstrate that SAR-Merging substantially outperforms existing merging methods on RLVR models, enabling both single-task enhancement and multi-capability fusion.

URL PDF HTML ☆

赞 0 踩 0

2606.18561 2026-06-18 cs.LG cs.AI 交叉投稿

Correcting Sensor-Induced Distribution Drift with Wasserstein Adversarial Learning

使用Wasserstein对抗学习校正传感器引起的分布漂移

Saraa Ali, Vladimir Bocharnikov, Fedor Ratnikov, Mikhail Hushchyn, Artem Ryzhikov, Denis Derkach

发表机构 * Laboratory of Methods for Big Data Analysis, HSE University（大数据分析方法实验室，高等经济大学）

AI总结提出WGAN方法，通过可学习的校准变换将变化检测器响应分布映射回参考分布，在探测器模型和模拟量能器数据上验证了恢复老化系数和改善能量分布一致性的能力。

Comments This is a preprint sent to Nuclear Science and Techniques journal

详情

AI中文摘要

记录数据的质量取决于采集数据的传感器系统的稳定性。传感器运动和老化会降低下游数据驱动方法的性能和稳定性。我们提出了一种基于Wasserstein-GAN的无监督方法，用于推断物理可解释的变换参数，这些参数将变化的检测器响应分布映射回标称参考分布。与标准生成建模不同，生成器被用作可学习的校准变换，其可训练权重代表所寻求的参数，而判别器通过Wasserstein目标提供分布距离信号。我们在具有受控层偏移的跟踪探测器玩具模型上验证了该方法，并展示了其在具有单元老化效应的高粒度Geant4模拟量能器数据上的应用。该方法恢复了单个单元的老化系数，与真实值相关，并改善了校准后和参考能量和分布之间的一致性，同时随着通道间噪声水平的增加而表现出预期的退化。这些结果表明，在退化参数的直接标签不可用的情况下，对抗性分布匹配可以作为校准策略的数据驱动组件。

英文摘要

The quality of recorded data depends on the stability of the sensor system that acquires it. Sensor motion and aging can degrade the performance and stability of downstream data-driven methods. We present a Wasserstein-GAN-inspired approach for unsupervised inference of physically interpretable transformation parameters that map a changed detector response distribution back to a nominal reference distribution. In contrast to standard generative modeling, the generator is used as a learnable calibration transformation whose trainable weights represent the sought parameters, while the critic provides a distributional distance signal via the Wasserstein objective. We validate the approach on a tracking-detector toy model with controlled layer shifts and demonstrate its application on high-granularity Geant4-simulated calorimeter data with cell-wise aging effects. The method recovers aging coefficients for individual cells with correlation to ground truth and improves agreement between calibrated and reference energy-sum distributions, while exhibiting the expected degradation at increasing channel-to-channel noise levels. These results indicate that adversarial distribution matching can serve as a data-driven component of calibration strategies in settings where direct labels for degradation parameters are unavailable.

URL PDF HTML ☆

赞 0 踩 0

2606.18587 2026-06-18 cs.CL cs.AI 交叉投稿

从自身解中学习：面向可验证奖励强化学习的自条件化信用分配

Yingyu Shan, Yuhang Guo, Zihao Cheng, Zeming Liu, Xiangrong Zhu, Xinyi Wang, Jiashu Yao, Wei Lin, Hongru Wang, Heyan Huang

发表机构 * Beijing Institute of Technology（北京理工大学）； Beihang University（北京航空航天大学）； Independent Researcher（独立研究者）

AI总结提出SC-GRPO方法，利用自条件化分布间的KL散度作为GRPO梯度的乘性权重，实现细粒度信用分配，在数学、代码和智能体任务上平均提升8.1%。

详情

AI中文摘要

具有可验证奖励的强化学习（RLVR）在训练LLMs进行推理任务方面取得了显著进展，但代表性方法如GRPO对所有token分配统一信用，浪费了常规token上的梯度，同时低估了关键推理步骤。现有的token级信用分配方法需要超出模型自身rollout的资源。GRPO变体依赖于过程奖励模型或真实答案。知识蒸馏通过每个token的散度分配信用，但需要外部教师（在线策略蒸馏）或特权信息（在线策略自蒸馏）。然而，这些依赖性限制了在纯RLVR设置中的适用性。我们观察到，将模型以其自身验证过的轨迹为条件，会在原始分布和条件分布之间诱导出可测量的每token KL散度，并证明当存在多个验证过的轨迹时，从由验证过的轨迹构建的自教师进行蒸馏会导致不可行的加权平均解。我们提出SC-GRPO（自条件化GRPO），它使用前述KL散度作为GRPO梯度的乘性权重。在涵盖数学、代码和智能体任务的五个基准上，SC-GRPO一致优于GRPO 8.1%，优于DAPO 5.9%，并具有更强的分布外性能。此外，SC-GRPO实现了比OPD更高的性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in training LLMs for reasoning tasks, but representative methods such as GRPO assign uniform credit across all tokens, wasting gradient on routine tokens while under-crediting pivotal reasoning steps. Existing token-level credit assignment methods require resources beyond the model's own rollouts. GRPO variants rely on process reward models or ground-truth answers. Knowledge distillation assigns credit through per-token divergence but requires external teachers (On-Policy Distillation) or privileged information (On-Policy Self Distillation). However, these dependencies limit applicability in the pure RLVR setting. We observe that conditioning the model on its own verified trajectories induces a measurable per-token KL divergence between the original and conditioned distributions, and prove that distilling from a self-teacher constructed by verified trajectories leads to infeasible weighted-average solutions when multiple verified trajectories exist. We propose SC-GRPO (Self-Conditioned GRPO), which uses KL divergence mentioned before as a multiplicative weight on GRPO gradients. Across five benchmarks spanning math, code, and agentic tasks, SC-GRPO consistently outperforms 8.1% over GRPO and 5.9% over DAPO with stronger OOD performance. Moreover, SC-GRPO achieves higher performance than OPD.

URL PDF HTML ☆

赞 0 踩 0

2606.18812 2026-06-18 cs.LG cs.AI 交叉投稿

Reinforcement Learning Foundation Models Should Already Be A Thing

强化学习基础模型本应已经存在

Abdelrahman Zighem, Jill-Jênn Vie

发表机构 * École normale supérieure de Paris, PSL University, Paris, France（巴黎高等师范学院，PSL大学，法国巴黎）； Soda team, Inria Saclay, Palaiseau, France（Soda团队，法国国家信息与自动化研究所萨克雷中心，法国帕莱索）

AI总结提出通过合成MDP构建强化学习基础模型，利用固定大小的充分统计量使注意力架构适用，在线和离线实验均优于传统算法。

详情

AI中文摘要

语言和视觉的基础模型由互联网规模的数据驱动，而结构化领域（表格预测、时间序列预测、图学习、强化学习）则不然。替代方案是合成数据，它将负担从收集转移到先验设计。这种先验已经存在于许多结构化任务中：TabPFN及其后续工作通过一个在合成贝叶斯先验上预训练的Transformer解决表格分类问题。我们提出两点。\textbf{首先}，强化学习是明显的空白：采样一个合成MDP与采样一个合成表格数据集一样可行，然而没有上下文强化学习工作将先验设计作为主要目标。\textbf{其次}，MDP允许一个固定大小的充分统计量，独立于观察到的回合且形状为表格形式，这使得它们直接适用于用于表格基础模型的基于注意力的架构，只需将策略头替换监督目标。这些共同定义了强化学习基础模型的议程。作为概念验证，我们完全在合成MDP上训练一个模型，并表明，无需任务特定的调优，它就能在上下文中解决留出的表格基准，包括在线和离线：在线时，使用比UCB-VI和表格Q-learning少得多的回合；离线时，与VI-LCB竞争。

英文摘要

Foundation models for language and vision are powered by internet-scale data, while structured domains (tabular prediction, time-series forecasting, graph learning, reinforcement learning) are not. The substitute is synthetic data, which shifts the burden from collection to prior design. Such priors already exist for many structured tasks: TabPFN and its successors solve tabular classification with a transformer pretrained on a synthetic Bayesian prior. We make two points. \textbf{First}, reinforcement learning is the conspicuous gap: sampling a synthetic MDP is as feasible as sampling a synthetic tabular dataset, yet no in-context RL work treats prior design as a primary objective. \textbf{Second}, MDPs admit a fixed-size sufficient statistic, independent of the episodes observed and tabular in shape, which makes them directly amenable to the attention-based architectures used for tabular foundation models, with a policy head replacing the supervised target. Together these define the agenda for an RL foundation model. As a proof of concept, we train one model entirely on synthetic MDPs and show that, with no task-specific tuning, it solves held-out tabular benchmarks in context, both online and offline: online, in far fewer episodes than UCB-VI and tabular Q-learning, and offline, competitively with VI-LCB.

URL PDF HTML ☆

赞 0 踩 0

2606.18820 2026-06-18 cs.LG cs.AI 交叉投稿

Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets

成熟马尔可夫决策过程：信息增加与动作集缩小下的决策制定

Jiaxi Liu, Aiping Yang, Yuhang Yang, Shuqi Zhang, Zewei Dong, Jiangming Yang, Xuebin Chen

发表机构 * Ant International（蚂蚁国际）； School of Economics, Sichuan University（四川大学经济学院）； School of Economics, Fudan University（复旦大学经济学院）

AI总结针对决策过程中信息增加与动作集缩小的不对称性，提出成熟马尔可夫决策过程（MMDP）框架，并基于过期动作优先级原则开发结构感知强化学习方法，实验证明其能提升学习效率。

Comments 25 pages, 9 figures

详情

AI中文摘要

序列决策问题通常表现出信息和决策灵活性的不对称演化：随着决策周期的展开，智能体获得更丰富的信息，而由于操作截止、承诺或资源约束，可行动作逐渐过期。标准的MDP公式通常将这种结构扁平化为阶段相关的状态描述和动作掩码，从而掩盖了嵌套的信息-动作不对称性，而这种不对称性决定了哪些决策是紧急的、哪些可以推迟。我们引入了成熟马尔可夫决策过程（MMDP），这是一种围绕这种信息-动作不对称性构建的公式。我们通过一个过期动作优先级原则来刻画其关键后果之一，该原则识别出必须在下一阶段之前解决的动作。受此结构启发，我们开发了一个结构感知的强化学习框架，包括阶段感知的策略设计、过期动作抽象以及带有蒸馏的搜索增强学习。在受控的多供应商补货问题、复杂度递增的简化现金管理环境以及生产级模拟器上的实验表明，显式建模这种不对称性可以提高学习效率，并且随着决策问题的规模扩大，其价值日益增加。

英文摘要

Sequential decision problems often exhibit an asymmetric evolution of information and decision flexibility: as a decision cycle unfolds, the agent receives richer information while feasible actions expire due to operational cutoffs, commitments, or resource constraints. Standard MDP formulations typically flatten this structure into stage-dependent state descriptions and action masks, thereby obscuring the nested information--action asymmetry that determines which decisions are urgent and which can be deferred. We introduce Maturing Markov Decision Processes (MMDPs), a formulation built around this information--action asymmetry. We characterize one of its key consequences through an expiring-action priority principle, which identifies the actions that must be resolved before the next stage. Motivated by this structure, we develop a structure-aware reinforcement learning framework with stage-aware policy design, expiring-action abstraction, and search-augmented learning with distillation. Experiments on a controlled multi-supplier replenishment problem, simplified cash-management environments of increasing complexity, and a production-scale simulator show that explicitly modeling this asymmetry improves learning efficiency and becomes increasingly valuable as decision problems scale.

URL PDF HTML ☆

赞 0 踩 0

2606.19025 2026-06-18 cs.LG cs.AI cs.DC cs.SY eess.SY 交叉投稿

FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs

FoMoE: 打破全副本壁垒的专家混合联邦系统

Lorenzo Sani, Zeyu Cao, Meghdad Kurmanji, Alex Iacob, Andrej Jovanovic, Yan Gao, Wanru Zhao, Nicholas D. Lane

发表机构 * DeepSeek-AI

AI总结提出FoMoE系统，通过跨工作节点分区专家层打破全副本范式，结合部分专家复制和跳跃令牌机制，显著降低通信开销并提升吞吐量。

详情

AI中文摘要

预测关键因素：面向决策的强化学习用于未知离开时间的受控电动汽车充电

Giuseppe Gabriele, Fabio Pavirani, Seyed Soroush Karimi Madahi, Chris Develder

发表机构 * Ghent University -- imec（根特大学 -- imec）

AI总结针对电动汽车充电中离开时间未知导致强化学习策略效果差的问题，提出面向决策的强化学习框架，联合训练预测器与控制器，实现端到端优化，使总奖励提升14%，未供应能量减少55%。

Comments ACM e-Energy 2026 5 pages, 1 figure, 1 table

详情

DOI: 10.1145/3744255.3811736

AI中文摘要

近年来电动汽车的普及给电力系统带来了挑战，包括峰值需求增加和潜在的电网不稳定。基于强化学习的智能充电控制可以通过从历史数据中学习时间和上下文模式来缓解这些问题。然而，在现实场景中，关键特征（如离开时间）通常不可用。这使得强化学习智能体更难学习和执行有效的充电策略。为了减轻这种不确定性，训练好的预测器可以从可用数据中近似未知特征。然而，由于这些预测模型通常针对准确性（而非对下游智能体决策质量的影响）进行训练，它们的误差可能会传播并阻碍使用预测的控制器的整体性能。为了避免这种情况，我们提出了一种面向决策的强化学习框架，其中预测器是端到端训练的，即通过强化学习智能体采取的充电策略动作的反馈。这种预测器和控制器的联合训练最终产生了更高质量的动作：与没有离开时间预测的强化学习方法相比，我们提出的面向决策的强化学习方法产生了更优的充电决策，总奖励提高了14%，未供应能量（即由于电动汽车已离开而未能进行的充电）减少了55%。

英文摘要

The recent growth of EV adoption poses challenges for power systems, including increased peak demand and potential grid instability. Smart control of EV charging -- e.g., based on reinforcement learning (RL) -- can alleviate these issues by learning temporal and contextual patterns from historical data. Yet, in real-world scenarios, key features, such as departure time, often are unavailable. This, in turn, makes it harder for an RL agent to learn and execute an effective charging policy. To mitigate this uncertainty, a trained forecaster can approximate the unknown features from available data. However, since these forecasting models are typically trained for accuracy (rather than their impact on a downstream agent's decision quality), their errors may propagate and hinder the overall performance of a controller that is using the forecasts. To avoid this, we propose a decision-focused RL (DF-RL) framework in which the forecaster is trained end-to-end, i.e., with feedback from the charging policy actions taken by the RL agent. Such joint training of both the forecaster and controller ultimately results in higher-quality actions: our proposed DF-RL method yields superior charging decisions compared to other baselines, achieving up to a 14% improvement in total reward and a 55% reduction of unsupplied energy (i.e., charging that failed to happen because the EV already left), relative to the RL method without departure time forecasting.

URL PDF HTML ☆

赞 0 踩 0

2606.19236 2026-06-18 cs.LG cs.AI cs.CL 交叉投稿

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

STARE: 基于惊讶度的令牌级优势重加权以实现策略熵稳定性

Haipeng Luo, Qingfeng Sun, Songli Wu, Can Xu, Wenfeng Deng, Han Hu, Yansong Tang

发表机构 * Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Tencent Hunyuan（腾讯混元）

AI总结针对GRPO等RL算法中策略熵崩溃问题，提出STARE方法，通过惊讶度分位数识别熵关键令牌并重加权其优势，结合目标熵闭环门控稳定熵，在1.5B-32B模型和多种任务上实现稳定训练，AIME24/25准确率提升4%-8%。

Comments LLM, Reinforcement Learning

详情

AI中文摘要

基于可验证奖励的强化学习算法（如GRPO）已成为LLMs复杂推理的主流后训练范式，但通常在训练中遭受策略熵崩溃。我们对GRPO下的令牌级熵动态进行一阶梯度分析，识别出令牌级信用分配不匹配：每个令牌的熵变化分解为轨迹级优势与下一个令牌分布上的熵敏感函数的乘积，产生优势-惊讶度四象限结构和近临界性质。受此启发，我们提出STARE（基于惊讶度的令牌级优势重加权以实现策略熵稳定性），该方法通过批次内惊讶度分位数识别熵关键令牌子集，选择性重加权其有效优势，并引入目标熵闭环门控以实现稳定的熵调节。在1.5B至32B的模型规模以及三个任务族（短思维链、长思维链和多轮工具使用）上，STARE在数千步内维持稳定的RL训练，同时将策略熵保持在目标带内。在AIME24和AIME25上，STARE在平均准确率上比DAPO和其他竞争基线高出4%-8%，反思令牌和响应长度同步增长，表明持续探索-利用平衡进一步释放了RL训练潜力。代码可在https://github.com/xxxx获取。

英文摘要

Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GRPO and identify a token-level credit assignment mismatch: the per-token entropy variation decomposes into the product of the trajectory-level advantage and an entropy sensitivity function over the next-token distribution, yielding an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by it, we propose STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), which identifies entropy-critical token subsets via batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, and Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps while maintaining policy entropy within the target band. On AIME24 and AIME25, STARE outperforms DAPO and other competitive baselines by 4%-8% in average accuracy, with reflection tokens and response length growing in tandem, indicating sustained exploration-exploitation balance that further unlocks RL training potential.Code is available at https://github.com/hp-luo/STARE.

URL PDF HTML ☆

赞 0 踩 0

2606.19317 2026-06-18 cs.LG cs.AI 交叉投稿

Explaining Attention with Program Synthesis

用程序合成解释注意力机制

Amiri Hayes, Belinda Li, Jacob Andreas

发表机构 * NJIT（新泽西理工学院）； MIT EECS（麻省理工学院电气工程与计算机科学系）； MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）

AI总结提出用可执行程序近似深度网络组件行为的方法，针对Transformer注意力头，通过生成Python程序再现注意力模式，实现可解释性。

详情

AI中文摘要

可解释深度学习研究的一个长期目标是，用人类可理解的符号描述取代不透明的神经计算。本文提出了一种用可执行程序近似深度网络组件行为的方法。我们专注于Transformer语言模型中的注意力头。对于给定的注意力头，我们首先在一组随机选择的训练样本上计算其关联的注意力矩阵。接着，我们向预训练语言模型提供这些矩阵的摘要，并指示它生成一组Python程序，这些程序仅根据输入句子中的文本即可再现相关的注意力模式。最后，我们根据最终程序集在保留输入上预测行为的效果对程序进行重新排序。我们证明，少于1000个这样的生成程序即可再现GPT-2、TinyLlama-1.1B和Llama-3B中注意力头的注意力模式，在TinyStories上平均交并比相似度超过75%。此外，最佳匹配程序可以替代神经注意力头而不会显著影响模型行为：在三个模型中用程序替代25%的注意力头仅导致平均困惑度增加16%，同时在各种下游问答基准上保持性能。这项工作为使用人类可读、可执行的代码逆向工程Transformer模型中的注意力头提供了一个可扩展的流程，推动了神经模型向符号透明性的发展。

英文摘要

A longstanding goal of research on interpretable deep learning is to replace opaque neural computations with human-meaningful symbolic descriptions. In this paper, we propose an approach for approximating the behavior of components of deep networks with executable programs. We focus on attention heads in transformer language models. For a given head, we first compute its associated attention matrices on a collection of randomly selected training examples. Next, we prompt a pre-trained language model with a summary of these matrices, and instruct it to generate a set of Python programs that can reproduce the associated attention patterns given only text from the input sentence. Finally, we re-rank programs according to how well our final set of programs predict behavior on held-out inputs. We demonstrate that a set of fewer than 1,000 such generated programs can reproduce the attention patterns of heads in GPT-2, TinyLlama-1.1B, and Llama-3B, achieving an average Intersection-over-Union similarity above 75% on TinyStories. Moreover, the best-fit programs can replace neural attention heads without substantially affecting model behavior: replacing 25% of attention heads with programmatic surrogates across the three models incurs only a 16% average perplexity increase, while maintaining performance on a variety of downstream question answering benchmarks. This work contributes a scalable pipeline for reverse-engineering attention heads in transformer models using human-readable, executable code, advancing a path toward symbolic transparency in neural models.

URL PDF HTML ☆

赞 0 踩 0

2606.19328 2026-06-18 cs.LG cs.AI cs.RO 交叉投稿

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

UBP2: 不确定性平衡的偏好规划用于高效基于偏好的强化学习

Mohamed Nabail, Leo Cheng, Jingmin Wang, Nicholas Rhinehart

发表机构 * Learning, Embodied Autonomy, and Forecasting (LEAF) Lab, University of Toronto（多伦多大学学习、具身自主与预测（LEAF）实验室）

AI总结提出UBP2方法，通过联合推理奖励、动力学和值函数的不确定性来主动引导探索，在Meta-World基准上显著提高了样本效率。

详情

AI中文摘要

基于偏好的强化学习提供了一种从行为的成对比较中学习奖励模型的方法，绕过了显式奖励设计的需求。然而，现有方法通常依赖于被动数据收集，并且在学习的早期阶段样本效率低下。我们引入了一种基于模型的方法，通过联合推理奖励、动力学和值函数的不确定性来主动引导探索。我们的方法，不确定性平衡的偏好规划（UBP2），使用奖励、动力学和值函数模型的集成，根据结合了期望奖励、终值认知不确定性的统一评分来评估候选轨迹。在此目标下的规划产生了利用和信息获取之间的显式权衡，无需临时的探索启发式。在标准正则性假设下，我们为有限时域和无限时域设置建立了次线性遗憾保证。实验上，在Meta-World基准上的实验表明，UBP2比无模型的基于偏好的方法和非乐观的基于模型的基线方法实现了更高的样本效率。

英文摘要

Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suffer from poor sample efficiency, especially during the early stages of learning. We introduce a model-based approach that actively directs exploration by jointly reasoning over uncertainties in the reward, dynamics, and value functions. Our method, Uncertainty-Balanced Preference Planning (UBP2), uses ensembles of reward, dynamics, and value function models to evaluate candidate trajectories according to a unified score that combines expected reward, terminal value, and epistemic uncertainty. Planning under this objective yields an explicit tradeoff between exploitation and information acquisition without requiring ad hoc exploration heuristics. Under standard regularity assumptions, we establish sublinear regret guarantees for both finite-horizon and infinite-horizon settings. Empirically, experiments on the Meta-World benchmark show UBP2 achieves substantially higher sample efficiency than model-free preference-based methods and non-optimistic model-based baselines.

URL PDF HTML ☆

赞 0 踩 0

2602.06774 2026-06-18 cs.AI 版本更新

Towards Understanding What State Space Models Learn About Code

理解状态空间模型在代码中学到了什么

Jiali Wu, Abhinav Anand, Shweta Verma, Mira Mezini

发表机构 * TU Darmstadt（图宾根大学）； Hessian Center for Artificial Intelligence（黑森人工智能中心）； National Research Center for Applied Cybersecurity ATHENE（应用网络安全国家研究中心ATHENE）

AI总结本文首次系统分析状态空间模型（SSM）在代码理解中的学习机制，发现SSM在预训练时比Transformer更有效捕获语法和语义结构，但微调时会遗忘某些关系，并提出SSM-Interpret框架和架构改进，将NLCodeSearch的MRR提升高达6。

详情

AI中文摘要

状态空间模型（SSM）已成为Transformer架构的高效替代方案。先前工作表明，在可比条件下训练时，SSM在代码理解任务上可以匹配或超越Transformer。然而，其内部机制仍是一个黑箱。我们首次系统分析了基于SSM的代码模型所学到的内容，并在此领域直接比较了SSM和Transformer模型。我们的分析表明，SSM在预训练期间比Transformer更有效地捕获了语法和语义结构，但在某些任务的微调过程中会遗忘某些关系。为了研究这种行为，我们引入了SSM-Interpret，一个频域框架，揭示了微调期间向短程依赖的频谱偏移。在这些发现的指导下，我们提出了架构修改，将基于SSM的代码模型在NLCodeSearch上的性能显著提升了高达+6 MRR。这表明我们的分析不仅解释了模型行为，而且直接导致了更好的设计。

英文摘要

State Space Models (SSMs) have emerged as an efficient alternative to the Transformer architecture. Prior work shows that, when trained under comparable conditions, SSMs can match or surpass Transformers on code understanding tasks. However, their internal mechanisms remain a black box. We present the first systematic analysis of what SSM-based code models learn along with the direct comparison between SSM and Transformer models in this domain. Our analysis shows that SSMs capture syntactic and semantic structure more effectively than Transformers during pretraining but forgets certain relations during fine-tuning on some tasks. To investigate this behavior, we introduce SSM-Interpret, a frequency-domain framework that exposes a spectral shift toward short-range dependencies during fine-tuning. Guided by these findings, we propose architectural modifications that significantly improve the performance of SSM-based code model by upto +6 MRR on NLCodeSearch. This demonstrates that our analysis not only explains model behavior but also leads directly to better designs.

URL PDF HTML ☆

赞 0 踩 0

2603.09344 2026-06-18 cs.AI stat.ML 版本更新

Robust Regularized Policy Iteration under Transition Uncertainty

鲁棒正则化策略迭代在转移不确定性下

Hongqiang Lin, Zhenghui Fu, Weihao Tang, Pengfei Wang, Yiding Sun, Qixian Huang, Dongxu Zhang

发表机构 * College of Computer Science and Technology, Zhejiang University, Hangzhou, China（浙江大学计算机科学与技术学院）； School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi'an, China（西北工业大学人工智能、光学与电子学院（iOPEN））； School of Software Technology, Zhejiang University, Hangzhou, China（浙江大学软件技术学院）； School of Software Engineering, Xi'an Jiaotong University, Xi'an, China（西安交通大学软件工程学院）； School of Systems Science and Engineering, Sun Yat-sen University, Guangzhou, China（中山大学系统科学与工程学院）

AI总结提出鲁棒正则化策略迭代（RRPI），通过将离线强化学习建模为鲁棒策略优化，使用KL正则化替代难解的双层目标，并基于鲁棒正则化贝尔曼算子实现高效策略迭代，理论保证收敛性，实验在D4RL基准上表现优异。

详情

AI中文摘要

离线强化学习（RL）无需在线探索即可实现数据高效且安全的策略学习，但其性能常因分布偏移而下降。学习到的策略可能访问分布外的状态-动作对，其中价值估计和学习到的动态不可靠。为了在统一框架中处理策略引发的外推和转移不确定性，我们将离线RL建模为鲁棒策略优化，将转移核视为不确定性集内的决策变量，并针对最坏情况动态优化策略。我们提出鲁棒正则化策略迭代（RRPI），用可处理的KL正则化替代难解的最大-最小双层目标，并基于鲁棒正则化贝尔曼算子推导出高效的策略迭代过程。我们提供了理论保证，证明所提出的算子是$\gamma$-压缩算子，且迭代更新替代目标能单调改进原始鲁棒目标并收敛。在D4RL基准上的实验表明，RRPI实现了强大的平均性能，在大多数环境中优于包括基于百分位数方法在内的最新基线，并在其余环境中保持竞争力。此外，RRPI通过将较低的$Q$值与高认知不确定性对齐，展现出鲁棒性能，从而防止策略执行不可靠的分布外动作。

英文摘要

Offline reinforcement learning (RL) enables data-efficient and safe policy learning without online exploration, but its performance often degrades under distribution shift. The learned policy may visit out-of-distribution state-action pairs where value estimates and learned dynamics are unreliable. To address policy-induced extrapolation and transition uncertainty in a unified framework, we formulate offline RL as robust policy optimization, treating the transition kernel as a decision variable within an uncertainty set and optimizing the policy against the worst-case dynamics. We propose Robust Regularized Policy Iteration (RRPI), which replaces the intractable max-min bilevel objective with a tractable KL-regularized surrogate and derives an efficient policy iteration procedure based on a robust regularized Bellman operator. We provide theoretical guarantees by showing that the proposed operator is a $γ$-contraction and that iteratively updating the surrogate yields monotonic improvement of the original robust objective with convergence. Experiments on D4RL benchmarks demonstrate that RRPI achieves strong average performance, outperforming recent baselines including percentile-based methods on the majority of environments while remaining competitive on the rest. Moreover, RRPI exhibits robust performance by aligning lower $Q$-values with high epistemic uncertainty, which prevents the policy from executing unreliable out-of-distribution actions.

URL PDF HTML ☆

赞 0 踩 0

2606.11918 2026-06-18 cs.AI 版本更新

HeRo-Q: 通过Hessian条件化实现稳定低比特量化的通用框架

Jinhao Zhang, Yunquan Zhang, Zicheng yan, Boyang Zhang, Jun Sun, Daning Cheng

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所）； University of Science and Technology of China（中国科学技术大学）； Zhejiang Lab（浙江实验室）； Peng Cheng Laboratory（鹏城实验室）

AI总结针对后训练量化中“低误差、高损失”的矛盾，提出HeRo-Q算法，通过轻量可学习的旋转压缩矩阵重塑损失景观，降低最大Hessian特征值，增强对量化噪声的鲁棒性，在Llama和Qwen模型上优于现有方法。

详情

AI中文摘要

后训练量化（PTQ）是一种主流的模型压缩技术，但由于其仅专注于最小化量化误差，常常导致矛盾的“低误差、高损失”现象。根本原因在于LLM损失景观的Hessian矩阵：少数高曲率方向对扰动极其敏感。为了解决这个问题，我们提出了Hessian鲁棒量化（HeRo Q）算法，该算法在量化前对权重空间应用一个轻量级、可学习的旋转压缩矩阵。这个联合框架通过降低最大的Hessian特征值并减小其最大特征值来重塑损失景观，从而显著增强对量化噪声的鲁棒性。HeRo-Q不需要修改架构，计算开销可忽略不计，并且可以无缝集成到现有的PTQ流程中。在Llama和Qwen模型上的实验表明，HeRo Q在标准W4A8设置下不仅持续优于包括GPTQ、AWQ和SpinQuant在内的最先进方法，而且在极具挑战性的W3A16超低比特场景中表现出色，将Llama3 8B在GSM8K上的准确率提升至70.15%，并有效避免了激进量化中常见的逻辑崩溃。

英文摘要

Post Training Quantization (PTQ), a mainstream model compression technique, often leads to the paradoxical 'low error, high loss' phenomenon because it focuses solely on minimizing quantization error. The root cause lies in the Hessian matrix of the LLM loss landscape: a few high curvature directions are extremely sensitive to perturbations. To address this, we propose the Hessian Robust Quantization (HeRo Q) algorithm, which applies a lightweight, learnable rotation-compression matrix to the weight space prior to quantization. This joint framework reshapes the loss landscape by reducing the largest Hessian eigenvalue and reducing its max eigenvalue, thereby significantly enhancing robustness to quantization noise. HeRo-Q requires no architectural modifications, incurs negligible computational overhead, and integrates seamlessly into existing PTQ pipelines. Experiments on Llama and Qwen models show that HeRo Q consistently outperforms state of the art methods including GPTQ, AWQ, and SpinQuant not only achieving superior performance under standard W4A8 settings, but also excelling in the highly challenging W3A16 ultra low bit regime, where it boosts GSM8K accuracy on Llama3 8B to 70.15\% and effectively avoids the logical collapse commonly seen in aggressive quantization.

URL PDF HTML ☆

赞 0 踩 0

2602.00161 2026-06-18 cs.LG cs.AI cs.CL quant-ph 版本更新

LLM Compression by Block Removal with Constrained Binary Optimization

通过带约束二进制优化的块移除进行LLM压缩

David Jansen, Roman Rausch, Ali Hashemi, David Montero, Román Orús

发表机构 * Multiverse Computing（多维计算公司）； Donostia International Physics Center（多斯蒂亚国际物理中心）； Ikerbasque Foundation for Science（伊克尔巴斯克科学基金会）

AI总结提出将大语言模型块移除压缩问题建模为约束二进制优化，映射到Ising玻璃系统，实现高效排序和高质量非连续块移除，在50%压缩时MMLU提升近23个百分点，且计算高效、通用性强。

Comments 16 pages, 3 figures

详情

AI中文摘要

在本文中，我们将通过最优删除Transformer块（“块移除”）来压缩大语言模型（LLM）的问题，表述为一个约束二进制优化（CBO）问题，该问题可以映射到物理系统（Ising玻璃），其能量是下游模型性能的强代理。这种表述使得能够高效地对大量候选块移除配置进行排序，产生许多高质量、非平凡的解决方案，而不仅仅是移除连续区域。我们的方法在深度压缩场景中表现强劲，例如在Llama-3.3-70B-Instruct的50%压缩中，与其他最先进的块移除方法相比，我们在MMLU基准上取得了近23个百分点的提升。对于较轻的压缩，它在多个基准上与这些方法表现相当，适用于Llama-3.1-8B-Instruct、Qwen3-14B（重训练前后）以及Llama-3.3-70B-Instruct。该方法计算效率高，仅需在校准数据集上对少数活跃参数进行前向和反向传播。此外，我们证明，当无法精确求解CBO问题时，使用良好的启发式求解器可以在可忽略的运行时间内提供在下游任务上表现良好的解决方案。该方法可以轻松应用于任何架构。我们在最近的NVIDIA-Nemotron-3-Nano-30B-A3B-FP8模型上展示了这种通用性，该模型具有高度不均匀且具有挑战性的块结构，并且在移除2个注意力层或3个混合专家层时，我们在AIME25和GPQA上超越了最先进水平。

英文摘要

In this paper, we formulate the compression of large language models (LLMs) by optimally deleting transformer blocks (``block removal'') as a constrained binary optimization (CBO) problem that can be mapped to a physical system (Ising glass), whose energies are a strong proxy for downstream model performance. This formulation enables an efficient ranking of a large number of candidate block-removal configurations yielding many high-quality, non-trivial solutions beyond those only removing consecutive regions. Our method performs strongly in the deep compression regime, such as for 50% compression of Llama-3.3-70B-Instruct, where we achieve an almost 23 percentage point increase on the MMLU benchmark compared to other state-of-the-art (SOTA) block-removal methods. For lighter compression, it performs on par with those methods across several benchmarks for Llama-3.1-8B-Instruct, Qwen3-14B (both before and after retraining), as well as Llama-3.3-70B-Instruct. The approach is computationally efficient and requires only forward and backward passes on a calibration dataset for a few active parameters. Additionally, we demonstrate that using good heuristic solvers for the CBO problem provides solutions that perform well on downstream tasks in negligible runtime when it is unfeasible to solve the problem exactly. The method can be readily applied to any architecture. We illustrate this generality on the recent NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 model, which exhibits a highly inhomogeneous and challenging block structure, and where we outperform SOTA for AIME25 and GPQA when removing either 2 attention layers or 3 mixture-of-experts layers.

URL PDF HTML ☆

赞 0 踩 0

2602.00176 2026-06-18 cs.CV cs.AI 版本更新

Posterior Continuation with Noise-Conditioned Frequency Exposure for Diffusion Inverse Problems

基于噪声条件频率暴露的扩散逆问题后验延续

Feng Tian, Yixuan Li, Weili Zeng, Weitian Zhang, Yichao Yan, Xiaokang Yang

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结提出后验延续框架，根据扩散噪声水平逐步暴露测量频率，结合稳定采样器实现超分辨率、修复和去模糊的先进性能。

详情

超越相似性：时间序列分析中的时序操作注意力

Jevon Twitty, Vinh Pham, Nitiwith Rotchanarak, Viresh Pati, Yubin Kim, Shihao Yang, Jiecheng Lu

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结本文提出时序操作注意力（TOA），通过引入可学习的操作符增强注意力机制，以更有效地处理时间序列数据中的符号和振荡变换，提升时间序列预测、异常检测和分类任务的性能。

详情

AI中文摘要

时间序列预测中存在一个持久性悖论：结构简单的MLP和线性模型往往优于高容量的Transformer。我们指出，这种差距源于序列建模基本原理的不匹配：尽管许多时间序列动态由全局时间操作符（如滤波和谐波结构）主导，标准注意力将每个输出视为输入的凸组合。这限制了其表示带符号和振荡变换的能力，这些能力对于时间信号处理至关重要。我们正式将这一限制定义为softmax注意力中的简单约束混合瓶颈，这对由操作符驱动的时间序列任务尤其限制性。为了解决这一问题，我们提出时序操作注意力（TOA），一种通过显式、可学习的序列空间操作符增强注意力的框架，使时间内的符号混合成为可能，同时保持输入依赖的适应性。为了使密集的N×N操作符实用化，我们引入了随机操作符正则化，一种高方差的dropout机制，它稳定了训练并防止了记忆性学习。在预测、异常检测和分类基准上，TOA在集成到标准骨干如PatchTST和iTransformer时始终提高了性能，尤其是在重建密集任务中表现尤为突出。这些结果表明，显式操作符学习是有效时间序列建模的关键要素。

英文摘要

A persistent paradox in time-series forecasting is that structurally simple MLP and linear models often outperform high-capacity Transformers. We argue that this gap arises from a mismatch in the sequence-modeling primitive: while many time-series dynamics are governed by global temporal operators (e.g., filtering and harmonic structure), standard attention forms each output as a convex combination of inputs. This restricts its ability to represent signed and oscillatory transformations that are fundamental to temporal signal processing. We formalize this limitation as a simplex-constrained mixing bottleneck in softmax attention, which becomes especially restrictive for operator-driven time-series tasks. To address this, we propose $\textbf{Temporal Operator Attention (TOA)}$, a framework that augments attention with explicit, learnable sequence-space operators, enabling direct signed mixing across time while preserving input-dependent adaptivity. To make dense $N \times N$ operators practical, we introduce Stochastic Operator Regularization, a high-variance dropout mechanism that stabilizes training and prevents trivial memorization. Across forecasting, anomaly detection, and classification benchmarks, TOA consistently improves performance when integrated into standard backbones such as PatchTST and iTransformer, with particularly strong gains in reconstruction-heavy tasks. These results suggest that explicit operator learning is a key ingredient for effective time-series modeling.

URL PDF HTML ☆

赞 0 踩 0

2605.12713 2026-06-18 quant-ph cs.AI 版本更新

Controllable Quantum Memory Capacity in Quantum Reservoir Networks with Tunable partial-SWAPs

量子回路网络中可控的量子记忆容量：可调部分SWAPs

Erik L. Connerty, Ethan N. Evans

发表机构 * University of South Carolina - Columbia（南卡罗来纳大学哥伦比亚分校）； Qodex Quantum（Qodex量子）

AI总结本文提出一种可调部分SWAP机制，用于控制量子回路网络中记忆衰减速率，通过模拟和IBM QPU验证，提升了噪声中间尺度量子处理器的性能。

Comments 14 pages, 9 figures

详情

AI中文摘要

在量子回路计算领域，许多不同的计算模型和架构已被提出。从这些模型中，我们识别出基于反馈的模型和递归模型作为两种主要竞争架构。本文在递归架构基础上，提出了一种双寄存器方法，使量子回路计算具有衰减记忆。虽然这些方法已在硬件上验证并展示了在噪声中间尺度量子处理器上的优异性能，但记忆容量的确切机制尚不完全理解或完全可控。为此，我们扩展了递归方法，提出了一种硬件可实现的可调部分SWAP机制，允许从基于门的量子处理器上实现的量子回路网络直接控制记忆衰减速率。该机制的理论基于受控振幅阻尼通道，并通过随机短期记忆容量（STMC）回忆基准和NARMA-5数据集的验证实验进行验证，分别使用模拟和IBM QPU进行测试。

英文摘要

In the field of quantum reservoir computing (QRC), many different computational models and architectures have been proposed. From these models, we identify feedback-based models -- which use a feedback mechanism to re-embed classical measurements from the QRC -- and recurrent models -- which use a multi-register approach with memory and readout qubits -- as the two major competing architectures that have been discussed and validated on hardware. In this paper, we advance upon the recurrent architectures, which employ a two register approach to endow the QRC with a fading memory. While these approaches have been validated on hardware and have demonstrated great real-world performance on noisy-intermediate-scale-quantum (NISQ) quantum processing units (QPUs), the exact mechanism through which the memory capacity arises is not completely understood or fully controllable. With this, we augment the recurrent approaches and present a hardware-realizable mechanism, which we call a tunable partial-SWAP, that allows for the direct control of the rate of memory dissipation from a QRN implemented on a gate-based QPU. The theory behind this mechanism is discussed in terms of a controlled amplitude-damping channel and validation experiments using a randomized short-term memory capacity (STMC) recall benchmark and the NARMA-5 dataset are conducted using simulation and IBM QPUs, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.06564 2026-06-18 cs.LG cs.AI 版本更新

HAARES Half-Split Residual Basis Routing for Deep Transformers

WAV：面向深度仅解码器Transformer的多分辨率块残差路由

Kehan Wang

发表机构 * Chongqing University（重庆大学）

AI总结提出WAV v1方法，通过为每个块增加方向性细节基（相位基和分裂基）来增强残差路由，在深层Transformer中优于现有方法，48层时在TinyStories和Text8上取得更低验证损失。

Comments 6 pages, 4 figures, 3 tables

详情

AI中文摘要

残差连接对于训练深度Transformer至关重要，但标准的PreNorm残差流以固定的单位权重聚合子层更新。最近的注意力残差用内容相关的深度路由替代了这种固定累积，而块注意力残差通过对块级残差摘要进行路由使机制高效。然而，单个块摘要仅存储块内的低频总残差位移，丢弃了方向性结构，例如注意力与MLP的不平衡以及早期与晚期块的动态。我们提出WAV v1，一种用于仅解码器Transformer的轻量级多分辨率残差路由方法。WAV v1不是仅通过累积残差和来表示每个块，而是为每个块增加两个方向性细节基：一个对比注意力和MLP更新的相位基，以及一个对比早期和晚期子层更新的分裂基。这些基与标准块摘要一起通过相同的深度softmax混合器进行路由，而负细节源初始化和分离的RMS匹配稳定了训练。在字符级TinyStories和Text8语言建模中，WAV v1显示出明显的深度相关优势。尽管在12层时并非始终有益，但在24层时变得有竞争力，并在48层时优于所有基线。在48层时，WAV v1将TinyStories上的验证损失从0.4960降至0.4738，Text8上从0.9363降至0.9305，且额外参数可忽略。这些结果表明，方向性残差细节（而不仅仅是块级和）对于在更深Transformer中扩展残差路由很重要。

英文摘要

Block-level residual routing makes learned residual aggregation practical by routing over block summaries, but each summary compresses an ordered sequence of attention and MLP updates into one cumulative vector. We propose \method{}, a lightweight residual basis router that keeps the cumulative block source and adds one half-split detail basis, computed as the difference between first-half and second-half residual updates. The detail basis is RMS-matched and updated online, exposing coarse intra-block trajectory information without dense sublayer-level routing. Across OpenWebText, cross-domain character-level benchmarks, and BPE-tokenized OpenWebText, the empirical pattern is depth-dependent: gains are small or mixed at shallow depth and most reliable in 48-layer models. In the 201M 48-layer setting, \method{} improves over Block AttnRes across all three seeds, while a 453M two-seed probe shows the same direction. Ablations rule out source duplication, random signed details, fixed detail-source biases, or block-count changes alone. Cost analysis shows that the method is FLOP-light but not wall-clock-free: it adds memory and routing overhead, yet its relative arithmetic cost is amortized as width grows and earlier convergence can reduce time-to-target.

URL PDF HTML ☆

赞 0 踩 0

2606.10466 2026-06-18 cs.LG cs.AI 版本更新

UPLOTS: A Unified Pretrained Language Model for Constrained Time-series Generation

UPLOTS: 一种用于约束时间序列生成的统一预训练语言模型

Du Yin, Hao Xue, Jinliang Deng, Yang Yang, Shuang Ao, Arian Prabowo, Flora Salim

发表机构 * University of New South Wales（新南威尔士大学）； HKUST(GZ)（香港科技大学（广州））； BUAA（北京航空航天大学）

AI总结提出UPLOTS，一种基于统一预训练语言模型和提示引导的框架，通过动态多数据集损失重加权和提示到模式映射，实现跨领域约束时间序列生成，在四个基准上验证了其泛化性和数据增强效果。

详情

AI中文摘要

在时间序列生成中，现有方法通常为每个数据集手工设计或训练单独的模型，这阻碍了它们的可扩展性，并且未能利用跨领域的共享时间结构。为了解决这种碎片化问题，我们提出了UPLOTS，一种统一的、提示引导的语言模型框架，用于跨不同领域的约束时间序列生成。UPLOTS不是构建任务特定的模型，而是利用一个由学习到的约束提示引导的单一预训练transformer骨干网络，从而能够按需生成并精确控制模式。一个关键创新是我们的动态多数据集损失重加权和提示到模式映射，这使得UPLOTS能够在训练期间内化多样化的时间结构，并在推理时有条件地生成它们。我们在四个真实世界基准和多个约束设置（包括峰值周期、日历、负载水平和波动性模式）上评估了UPLOTS。额外的保留约束组合和下游预测实验进一步表明，UPLOTS能够泛化到原始峰值模式设置之外，并在真实数据稀缺的情况下改进数据增强。我们的代码和基线可在匿名GitHub仓库获取：this https URL。

英文摘要

In time-series generation, existing approaches typically handcraft ortrain a separate model for each dataset, which hinders their scalability and fails to leverage shared temporal structures across domains. To address this fragmentation, we propose UPLOTS, a Unified, Prompt-guided Language model framework fOr constrained Time-Series Generation across diverse domains. Instead of building task-specific models, UPLOTS leverages a single pre-trained transformer backbone guided by learned constraint prompts, enabling on-demand generation with precise pattern control. One key innovation is our dynamic multi-dataset loss re-weighting and prompt-to-pattern mapping, which allows UPLOTS to internalize diverse temporal structures during training and conditionally generate them at inference. We evaluate UPLOTS on four real-world benchmarks and multiple constraint settings, including peak-period, calendar, load-level, and volatility patterns. Additional held-out constraint-combination and downstream forecasting experiments further demonstrate that UPLOTS generalizes beyond the original peak-pattern setting and improves data augmentation under scarce real-data regimes. Our code and baselines are available at anonymous github repo: https://anonymous.4open.science/r/UPLOTS-6C36.

URL PDF HTML ☆

赞 0 踩 0

2606.12629 2026-06-18 cs.LG cs.AI 版本更新

Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns

Bag of Dims：通过维度级符号模式实现无需训练的机制可解释性

Varun Reddy Nalagatla

发表机构 * Amazon Web Services（亚马逊云服务）

AI总结本文提出Bag of Dims框架，证明Transformer隐藏状态的标准基即可作为无需训练的特征基，通过维度符号模式编码语义，并在三个模型上验证了其有效性。

Comments 22 pages, 5 figures, 27 tables

详情

AI中文摘要

我们表明，Transformer隐藏状态的标准基已经提供了一个无需训练、架构通用的特征基。单个维度通过其符号编码语义内容，通过其幅度编码置信度，充当独立的二进制寄存器。我们通过四个渐进实验在三个模型家族（Qwen 3.5-4B、Gemma 3-4B、Mistral 7B）上验证了这种Bag of Dims框架。仅符号模式就携带预测性内容：将所有幅度替换为1，通过LM头实现72-93%的top-5下一个token准确率，而无需任何解码器的纯汉明评分达到80-90%的top-4096准确率。这些符号模式组织成语义特征：使用单token类型缓存（每个词汇token一次前向传播，无上下文），我们通过每维度符号一致性（平均AUC 0.80）从50个锚点发现了175个类别，无需任何训练。一个训练过的探针仅增加+0.018 AUC并收敛到轴对齐的权重，证实了可忽略的跨维度结构。这种结构扩展到注意力：所有175个类别在K和V投影中仍然可发现。在写入端，静态FFN权重检查将20%的特征与单个写入神经元联系起来（一致性>0.70；随机对照：0%），通过多数投票，top-200神经元联盟在99.9%的原型上实现>0.70的一致性。完全无监督的发现（随机种子，无标签）在所有三个模型上扩展到1500个特征，产量100%，稀疏度99%，成对互信息为0.0014比特，证实了低维度间耦合。这些结果确立了标准基已经足以在整个Transformer计算路径中进行特征读取，无需训练、无需优化，且每个词汇token仅需一次前向传播，无需GPU天数。

英文摘要

We show the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individual dimensions encode semantic content via their signs (+/-1) and confidence via their magnitudes, acting as independent binary registers; a feature is a subset of dimensions with a consistent sign pattern, read by counting sign agreements with no learned rotation. We validate this Bag of Dims framework across seven models spanning language (Qwen 3.5-4B, Gemma 3-4B, Mistral 7B, Qwen3-32B), vision (DINOv2, ViT-Base), and audio (AST). Signs alone carry predictive content: unit-magnitude sign patterns preserve 60-93% top-5 next-token accuracy through the LM head, and decoder-free Hamming scoring reaches 80-90% top-4096. From a single-token cache (one forward pass per token, no context, no labels), we detect 175 categories at AUC 0.97-0.99 by sign agreement; a trained probe adds only +0.018 AUC and converges to axis-aligned weights. These features are causally operative: they survive the K/V attention projections, trace to the FFN neuron coalitions that write them (random-weight controls never reproduce this), and flipping a feature's signs during the live forward pass suppresses its concept across four language models, magnitude-matched and concept-specific. Dimensions stay independent throughout (pairwise mutual information below 0.006 bits). The structure is not specific to language: the same per-dimension signs appear in self-supervised vision (DINOv2, 9/12 ImageNet superclasses), supervised vision (ViT-Base, 11/12), and audio (AST, 50/50 ESC-50 categories), so it reflects transformer training in general, not the language-modeling objective. The standard basis already suffices for feature reading at one forward pass, no optimization, no GPU-days. The open problem shifts from finding the right rotation to cataloging what each dimension encodes.

URL PDF HTML ☆

赞 0 踩 0

2606.12808 2026-06-18 cs.LG cs.AI 版本更新

SymQNet: Amortized Acquisition for Low-Latency Adaptive Hamiltonian Learning

SymQNet: 低延迟自适应哈密顿量学习的摊销获取

Yash Vardhan Tomar, Dheeraj Peddireddy

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出SymQNet，一种摊销强化学习方法，通过离线学习后验条件获取策略，在线快速前向传播，显著降低自适应哈密顿量学习的获取延迟。

详情

AI中文摘要

自适应哈密顿量学习对于校准和表征量子设备至关重要。在自适应控制器中，选择下一个实验本身就是一个计算。贝叶斯设计规则在每次后验更新后重新计算，这一步可能需要几秒钟。在数百次试验中，这些秒数成为自适应性的显著墙钟成本。我们引入SymQNet，一种用于低延迟自适应哈密顿量学习的摊销强化学习方法。SymQNet离线学习后验条件获取策略，然后在线使用快速策略前向传播，同时保留贝叶斯后验反馈。在横向场伊辛基准测试中，相对于有界Fisher信息搜索和有界两步贝叶斯主动学习（BALD），SymQNet显著降低了获取延迟。在五量子比特时，相对于这些在线基线，它仅获取决策延迟降低了$47.1\ imes$和$72.6\ imes$；在十二量子比特时，SymQNet的完整模拟步骤需要$1.02$秒，而有界两步BALD需要$13.27$秒。总体而言，我们表明学习获取可以使自适应哈密顿量学习对于重复的低延迟工作负载变得实用。

英文摘要

Adaptive Hamiltonian learning is central to calibrating and characterizing quantum devices. In an adaptive controller, choosing the next experiment is itself a computation. Bayesian design rules are recomputed after every posterior update, and that step can take seconds. Across hundreds of shots, those seconds become a significant wall-clock cost for adaptivity. We introduce SymQNet, an amortized reinforcement-learning approach for low-latency adaptive Hamiltonian learning. SymQNet learns a posterior-conditioned acquisition policy offline, then uses a fast policy forward pass online while retaining Bayesian posterior feedback. On transverse-field Ising benchmarks, SymQNet substantially reduces acquisition latency relative to bounded Fisher-information search and bounded two-step Bayesian active learning by disagreement (BALD). At five qubits, it reduces acquisition-only decision latency by $47.1\times$ and $72.6\times$ relative to these online baselines; at twelve qubits, full simulated steps take $1.02$ s for SymQNet versus $13.27$ s for bounded two-step BALD. Overall, we show that learned acquisition can make adaptive Hamiltonian learning practical for repeated low-latency workloads.

URL PDF HTML ☆

赞 0 踩 0

2606.16214 2026-06-18 cs.LG cs.AI 版本更新

Calibrated Sampling-Free Uncertainty Estimation in Bayesian Deep Learning

贝叶斯深度学习中的校准无采样不确定性估计

Tobias Jan Wieczorek, Leon de Andrade, Thomas Möllenhoff, Marcus Rohrbach

发表机构 * TU Darmstadt & hessian.AI, Darmstadt, Germany（达姆施塔特工业大学 & hessian.AI，德国达姆施塔特）； RIKEN Center for Advanced Intelligence Project, Tokyo, Japan（日本理化学研究所革新智能研究中心，日本东京）

AI总结提出校准方差传播（CVP），通过新型归一化层传播方法、激活函数处理技术及轻量校准步骤，在单次前向传播中高效估计不确定性，在Transformer和CNN上达到与MC采样相当的精度，成本显著降低。

详情

AI中文摘要

现代深度学习模型仍然以过度自信而闻名，限制了它们在高风险应用中的可靠性。贝叶斯方法通过学习模型参数的分布来应对这一问题，最近的进展使得在大规模架构上以与AdamW相当的成本实现这一目标成为可能。然而，测试时仍存在一个挑战：预测必须对从后验中采样的权重进行多次前向传播的平均，这代价高昂。方差传播提供了一种高效的替代方案，在单次前向传播中计算每层不确定性的解析近似。虽然此类技术对MLP有效，但由于现代架构的深度增加和层类型多样性，其扩展仍然具有挑战性。为填补这一空白，我们提出了校准方差传播（CVP），它引入了一种新的归一化层传播方法，结合了处理激活函数的近期技术，并通过轻量校准步骤吸收残差误差。CVP在Transformer和CNN上产生与MC采样相当准确的不确定性估计，而成本仅为极小部分。与先前的方差传播工作相比，CVP在BEiT-3上对视觉推理（NLVR2）的$0.5\%$风险覆盖率从$8.2\%$提高到$14.6\%$，在ViLT上对VQAv2从$2.6\%$提高到$10.8\%$，且增益扩展到卷积架构。

英文摘要

Modern deep learning models remain notoriously prone to overconfidence, limiting their reliability in high-stakes applications. Bayesian methods aim to counter this by learning a distribution over model parameters, and recent advances now make this feasible for large-scale architectures at costs comparable to AdamW. However, a challenge remains at test time: predictions must be averaged across many forward passes with weights sampled from the posterior, which is prohibitively expensive. Variance propagation offers an efficient alternative, computing layer-wise analytical approximations of uncertainty in a single forward pass. While such techniques are effective for MLPs, their extension to modern architectures remains challenging, due to increased depth and diversity of layer types. To fill this gap, we propose Calibrated Variance Propagation (CVP), which introduces a new propagation method for normalization layers, combines it with recent techniques for handling activation functions, and absorbs residual error through a light calibration step. CVP yields comparably accurate uncertainty estimates to MC sampling across transformers and CNNs, at a fraction of the cost. Against prior variance propagation work, CVP improves coverage at $0.5\%$ risk from $8.2\%$ to $14.6\%$ with BEiT-3 on Visual Reasoning (NLVR2) and from $2.6\%$ to $10.8\%$ with ViLT on VQAv2, with gains extending to convolutional architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.18273 2026-06-18 cs.CL cs.AI cs.SD eess.AS 交叉投稿

Continuous Audio Thinking for Large Audio Language Models

面向大型音频语言模型的连续音频思考

Gyojin Han, Dong-Jae Lee, Changho Choi, Jongsuk Kim, Junmo Kim

发表机构 * KAIST（韩国科学技术院）

AI总结提出连续音频思考（CoAT）框架，通过专家蒸馏在连续潜在空间中组织声学信息，使音频语言模型在生成响应前利用丰富声学特征，无需额外自回归解码成本，在多个音频任务上提升性能。

Comments Preprint

详情

AI中文摘要

大型音频语言模型（LALMs）在从语音转录到音乐分析等多种音频理解任务中展现了令人印象深刻的能力。然而，由于LALMs通常被训练生成与文本对齐的响应，其隐藏状态逐渐为文本生成而塑造，而非保留声学信息。因此，音频携带的多样化声学内容，如语音细节、韵律、声音事件、情感和音调，在过程中丢失，难以在响应中利用。我们引入了连续音频思考（CoAT），这是一个框架，为音频语言模型配备一个连续的潜在工作空间，用于在响应生成之前组织声学信息，并通过音频专家的蒸馏进行基础化。在思考空间内，模型可以在生成响应时利用专家蒸馏提供的丰富声学信息。此外，所提出的连续思考块可以在单个预填充中处理，因此CoAT不需要比基线额外的自回归解码成本。在三个LALM上，Qwen2-Audio、Qwen2.5-Omni-7B和Audio Flamingo~3，在涵盖音频推理、音频理解、音乐分类、语音情感和语音转录的广泛基准套件上的性能提升证明了CoAT的有效性。进一步分析证实，辅助监督从思考位置传播到模型的文本响应。

英文摘要

Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information. As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response. We introduce Continuous Audio Thinking (CoAT), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts. Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline. Across three LALMs, Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model's textual responses.

URL PDF HTML ☆

赞 0 踩 0

2606.18372 2026-06-18 cs.CL cs.AI 交叉投稿

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

保留还是删除？用于教育对话去标识的完全本地AI级联框架

Haocheng Zhang, Zhuqian Zhou, Kirk Vanacore, Bakhtawar Ahtisham, René F. Kizilcec

发表机构 * Cornell University（康奈尔大学）

AI总结针对教育对话中课程术语与个人身份信息混淆的问题，提出一种完全本地的级联框架，通过召回优先的联合提议器和上下文感知审查器实现约束性隐私分类，在数学辅导对话上达到0.958的宏F1，优于商业API和纯LLM基线。

详情

AI中文摘要

AI中文摘要

基于Transformer的架构在生成复杂符号序列方面取得了显著进展，但在实现对离散信号属性的细粒度、可解释控制方面仍存在明显差距。本文研究了多轨音乐Transformer（MMT）的机制可解释性，并提出了一种无需重新训练即可通过推理时激活引导实现确定性属性调制的框架。利用差分均值（DiffMean）方法，我们在残差流中分离出信号属性（特别是音高和时长）的潜在方向。我们验证了该领域的线性表示假设，实现了引导幅度与属性偏移之间的高相关性。为了解决多属性引导中固有的特征纠缠问题，我们引入了一种利用Gram-Schmidt正交化的双引导框架。实验结果表明，与朴素向量加法相比，这种几何解耦减少了概念干扰和信号退化，即使在强自回归条件下也能实现独立的确定性控制。

英文摘要

Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in achieving fine-grained, interpretable control over discrete signal attributes. This paper investigates the mechanistic interpretability of the Multitrack Music Transformer (MMT) and proposes a framework for deterministic attribute modulation without retraining to bridge this gap via inference-time activation steering. Utilizing the Difference-in-Means (DiffMean) methodology, we isolate latent directions for signal attributes, specifically Pitch and Duration, within the residual stream. We validate the Linear Representation Hypothesis in this domain, achieving high correlation between steering magnitude and attribute shift. To address the inherent feature entanglement in multi-attribute steering, we introduce a Dual Steering framework utilizing Gram-Schmidt Orthogonalization. Experimental results demonstrate that this geometric decoupling reduces conceptual interference and signal degradation compared to naive vector addition, enabling independent deterministic control even against strong autoregressive conditioning.

URL PDF HTML ☆

赞 0 踩 0

2606.18801 2026-06-18 cs.IR cs.AI 交叉投稿

SHIFT: Semantic Harmonization via Index-side Feature Transformation for Multilingual Information Retrieval

SHIFT: 通过索引侧特征变换实现多语言信息检索的语义对齐

Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim

发表机构 * Department of Computer Science and Engineering, Korea University（韩国大学计算机科学与工程系）

AI总结提出SHIFT方法，在索引阶段通过平行翻译对估计相对语言向量并修正文档嵌入，以缓解多语言密集检索中的语言偏差，无需训练即可提升检索性能。

详情

AI中文摘要

随着大规模多语言语料库的迅速扩展，多语言信息检索（MLIR）已成为全球信息访问的关键技术。MLIR使用户能够使用单语言查询从多语言文本集合中检索语义相关的文档。然而，最近的多语言密集检索模型通常表现出对与查询相同语言的文档的强烈偏好。这导致了严重的语言偏差，即排名靠前的结果被特定语言的文档主导，即使其他语言的文档包含更多语义相关信息。为了解决这个问题，我们提出了SHIFT，一种在索引阶段适用的无需训练的方法。具体来说，SHIFT利用平行翻译对来估计每个目标语言相对于源语言的相对语言向量。随后，SHIFT通过在索引期间从文档嵌入中减去该相对语言向量来纠正语言特定的偏移。我们在四个MLIR基准测试和多种密集检索模型上的全面评估证实，SHIFT可以有效缓解语言偏差并提升MLIR性能。

英文摘要

With the rapid expansion of massive multilingual corpora, Multilingual Information Retrieval (MLIR) has emerged as a critical technology for global information access. MLIR enables users to retrieve semantically relevant documents from multilingual text collections using a single-language query. However, recent multilingual dense retrieval models often exhibit a strong preference for documents in the same language as the query. This leads to severe language bias, where top-ranked results are dominated by documents of specific languages, even when documents in other languages contain more semantically relevant information. To address this issue, we propose SHIFT, a training-free method applicable in the indexing stage. Specifically, SHIFT utilizes parallel translation pairs to estimate a relative language vector for each target language with respect to a source language. Subsequently, SHIFT corrects the language-specific offset by subtracting this relative language vector from document embeddings during indexing. Our comprehensive evaluation across four MLIR benchmarks and diverse dense retrieval models confirms that SHIFT can effectively mitigate language bias and enhance MLIR performance.

URL PDF HTML ☆

赞 0 踩 0

2606.18811 2026-06-18 cs.IR cs.AI 交叉投稿

Rescaling MLM-Head for Neural Sparse Retrieval

重新缩放MLM头部用于神经稀疏检索

Youngjoon Jang, Seongtae Hong, Jonah Turner, Heuiseok Lim

发表机构 * Korea University（韩国大学）

AI总结针对SPLADE中MLM头部尺度不匹配导致训练不稳定和性能下降的问题，提出初始化时对MLM头部投影进行常数因子重缩放，零成本提升训练稳定性，使大范数骨干网络成为有竞争力的稀疏检索器。

详情

AI中文摘要

学习型稀疏检索（LSR）模型（如SPLADE）传统上使用BERT风格的掩码语言模型作为骨干编码器。一个自然的期望是，用更强的预训练编码器替换BERT应能提高检索效果。然而，我们发现，在标准的SPLADE训练方案下，具有大MLM头部L2范数的骨干网络可能会遭受性能下降，甚至在标准SPLADE训练方案下出现训练崩溃。我们将此失败归因于MLM头部中的尺度不匹配：SPLADE直接使用MLM头部输出来构建稀疏词汇表示，查询-文档相关性通过这些表示上的未归一化点积计算。因此，膨胀的MLM头部尺度会放大稀疏激活，扭曲匹配分数，并在常见训练设置下破坏对比训练的稳定性。为了解决这个问题，我们引入了一个简单的初始化时修正，在SPLADE训练之前通过一个常数因子重新缩放MLM头部投影。这种零成本调整提高了训练稳定性，而无需修改模型架构或训练目标。在领域内和跨领域检索基准测试中，这种简单的修正显著改善了诸如ModernBERT和Ettin等大范数骨干网络，将不稳定的训练运行转变为有竞争力的稀疏检索器。在多个设置中，修正后的模型进一步匹配或超越了经典的BERT-SPLADE基线。这些发现表明，将预训练编码器适应于LSR的瓶颈不仅仅是编码器容量，而是用于构建稀疏词汇表示的MLM头部尺度的校准。

英文摘要

Learned sparse retrieval (LSR) models such as SPLADE have traditionally used BERT-style masked language models as backbone encoders. A natural expectation is that replacing BERT with stronger pretrained encoders should improve retrieval effectiveness. However, we find that under standard SPLADE training recipes, backbones with large MLM-head L2 norms can suffer performance degradation and even training collapse under standard SPLADE training recipes. We identify this failure as a scale mismatch in the MLM head: SPLADE directly uses MLM-head outputs to construct sparse lexical representations, and query-document relevance is computed by an unnormalized dot product over these representations. As a result, an inflated MLM-head scale can amplify sparse activations, distort matching scores, and destabilize contrastive training under common training settings. To address this issue, we introduce a simple initialization-time correction that rescales the MLM-head projection by a constant factor before SPLADE training. This zero-cost adjustment improves training stability without modifying the model architecture or training objective. Across both in-domain and out-of-domain retrieval benchmarks, this simple correction substantially improves large-norm backbones such as ModernBERT and Ettin, turning unstable training runs into competitive sparse retrievers. In several settings, the corrected models further match or surpass the classic BERT-SPLADE baseline. These findings suggest that the bottleneck in adapting pretrained encoders to LSR is not encoder capacity alone, but the calibration of the MLM-head scale used to construct sparse lexical representations.

URL PDF HTML ☆

赞 0 踩 0

2606.18831 2026-06-18 cs.CL cs.AI 交叉投稿

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

超越奖励工程：长上下文强化学习的数据配方

Xiaoyue Xu, Sikui Zhang, Xiaorong Wang, Xu Han, Chaojun Xiao

发表机构 * OpenBMB ； Tsinghua University（清华大学）

AI总结提出一种简单有效的数据配方，结合最小化基于结果的GRPO设置，显著提升大语言模型的长上下文推理能力，在多个基准和智能体任务上取得平均+3.2至+7.2点的提升。

Comments 15 pages, 6 figures, 12 tables

详情

AI中文摘要

长上下文推理是大语言模型的一项关键能力，特别是当它们作为必须推理长轨迹的自主智能体部署时。强化学习最近成为提升这一能力的主要范式，然而现有工作主要关注奖励工程，而多样化的训练数据仍然稀缺。我们从数据为中心的角度重新审视这个问题，并表明仅凭一种简单有效的数据配方，结合最小化基于结果的GRPO设置，就足以显著提升长上下文推理。我们的配方针对三个互补的任务族——检索、多证据合成和推理——我们构建并整理了八个数据集，总计约1.4万个示例。在三个模型（Qwen3-4B/8B/30B-A3B）上的实验在七个长上下文基准上取得了平均+7.2/+3.2/+6.4分的提升，超过了之前的强化学习训练集。我们进一步证明这些增益可以迁移到智能体任务中，在基于智能体调整的模型上继续使用我们的数据配方进行强化学习训练，GAIA提升+4.8分，BrowseComp提升+7.0分。我们将发布我们的数据集以促进未来研究。

英文摘要

Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work largely focuses on reward engineering while diverse training data remains scarce. We revisit this problem from a data-centric perspective and show that a simple yet effective data recipe alone, paired with a minimal outcome-based GRPO setup, suffices to substantially improve long-context reasoning. Our recipe targets three complementary task families -- retrieval, multi-evidence synthesis, and reasoning -- for which we construct and curate eight datasets totaling ~14K examples. Experiments on three models (Qwen3-4B/8B/30B-A3B) yield average gains of +7.2/+3.2/+6.4 points across seven long-context benchmarks, surpassing prior RL training sets. We further demonstrate that these gains transfer to agentic tasks, where continuing RL training on an agent-tuned model with our data recipe improves GAIA by +4.8 and BrowseComp by +7.0 points. We will release our datasets to facilitate future research.

URL PDF HTML ☆

赞 0 踩 0

2606.18852 2026-06-18 cs.CL cs.AI 交叉投稿

Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative Mining

对齐隐含陈述：通过上下文边界半硬负挖掘实现隐式仇恨言论的泛化性

Wicaksono Leksono Muhamad, Yunita Sari

发表机构 * Mantera Studio（Mantera工作室）； Universitas Gadjah Mada（加雅玛大学）

AI总结提出ImpSH三元组框架，通过将帖子与隐含陈述对齐并使用上下文边界半硬负样本聚焦学习，提升隐式仇恨言论的跨域泛化能力，在多个数据集上优于对比基线。

详情

AI中文摘要

隐式仇恨言论分类仍然是一个挑战，因为意图通常通过暗示和上下文而非明确辱骂来掩盖。先前的监督对比方法改进了域内检测，但可能过拟合表面线索，且难以跨数据集迁移。我们提出ImpSH，一个基于三元组的框架，当隐含陈述可用时将其与帖子对齐，并使用上下文边界半硬负样本将学习聚焦于近混淆项。我们还研究了AugSH，它通过数据增强形成正样本。在使用BERT和HateBERT对IHC、SBIC和DynaHate进行的受控评估中，ImpSH是标准监督对比基线的可行替代方案，并且在匹配的预处理和调优预算下通常能提高跨域性能。使用对齐性和均匀性进行的表示分析表明，正样本对更紧密且全局分布平衡，定性最近邻案例研究展示了域转移下的典型假负例。这些结果表明，通过上下文边界挖掘将帖子与其隐含陈述对齐，提供了到相关暗示的更稳定、类似双射的映射，克服了传统基于聚类的表示学习固有的波动性。

英文摘要

Classifying implicit hate speech remains a challenge, as intent is often masked through insinuation and context rather than explicit slurs. Prior supervised contrastive approaches improve in-domain detection but can overfit surface cues and struggle to transfer across datasets. We propose ImpSH, a triplet-based framework that aligns posts with implied statements when available and uses context-bounded semi-hard negatives to focus learning on near confusions. We also examine AugSH, which forms positives via data augmentation. In controlled evaluations on IHC, SBIC, and DynaHate with BERT and HateBERT, ImpSH is a viable alternative to standard supervised contrastive baselines and often improves cross-domain performance under matched preprocessing and tuning budgets. Representation analysis using alignment and uniformity indicates tighter positive pairs with balanced global spread, and qualitative nearest-neighbor case studies illustrate typical false negatives under domain shift. These results demonstrate that aligning posts with their implied statements via context-bounded mining provides a more stable, bijective-like mapping to related insinuations, overcoming the volatility inherent in traditional clustering-based representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.18922 2026-06-18 cs.CL cs.AI 交叉投稿

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

像火箭科学一样简单：评估大型语言模型解释比喻语言中否定能力的研究

Jasmine Owers, Edwin Simpson, Martha Lewis

发表机构 * Intelligent Systems Lab University of Bristol（智能系统实验室英国布里斯托尔大学）； ILLC University of Amsterdam（阿姆斯特丹大学语言学研究所）

AI总结本研究通过开发新的注释数据集，测试多种大型语言模型在比喻语言中理解否定的能力，发现否定与比喻的组合对模型构成挑战，且性能高度依赖提示风格。

Comments 16 pages, 16 figures; for associated code and data see https://github.com/jrdowers/Negation-and-Fig-Lang; To be published in Transactions of the Association for Computational Linguistics

2606.18986 2026-06-18 cs.CL cs.AI 交叉投稿

医学LLM适应中的权衡：法语问答的实证研究

Ikram Belmadani, Oumaima El Khettari, Carlos Ramisch, Frederic Bechet, Richard Dufour, Benoit Favre

发表机构 * Aix-Marseille Univ., CNRS, LIS UMR 7020（艾克斯-马赛大学，法国国家科学研究中心，计算机与系统实验室）； Nantes Univ., École Centrale Nantes, CNRS, LS2N UMR 6004（南特大学，南特中央理工学院，法国国家科学研究中心，数字科学实验室）； Grenoble Alpes Univ., CNRS, INRIA, Grenoble INP, LIG UMR 5217（格勒诺布尔-阿尔卑斯大学，法国国家科学研究中心，法国国家信息与自动化研究所，格勒诺布尔理工学院，信息学实验室）

AI总结通过法语医学问答任务，实证比较持续预训练（CPT）和监督微调（SFT）在多个模型家族和规模下的效果，发现CPT+SFT在多项选择问答上最优但增益小，SFT是强且经济的默认选择，而CPT在开放式问答中提升重叠指标。

详情

AI中文摘要

大型语言模型（LLMs）的发展导致了对它们适应专业领域和语言的关注增加，但领域适应策略的有效性仍不明确。我们以法语医学问答（QA）为案例，进行了医学领域适应的研究。我们比较了持续预训练（CPT）、监督微调（SFT）及其组合，跨越三个模型家族、多个规模和三种初始化类型，明确区分了适应效果与基础模型选择。我们在贪婪和约束解码下，使用自动指标和LLM-as-a-Judge评估，评估了多项选择问答（MCQA）和开放式问答（OEQA）。对于MCQA，CPT+SFT通常取得最佳分数，但相比SFT的增益很小且通常不显著，使得SFT成为强大且成本效益高的默认选择。对于OEQA，CPT持续改善基于重叠的指标，而SFT常降低生成质量；指令调优和CPT+SFT在基于LLM的评估中更受青睐。跨语言实验进一步显示，法语适应能有效迁移到英语基准。总体而言，我们为在计算约束下选择适应策略提供了实用指南。

英文摘要

The development of large language models (LLMs) has led to an increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question-answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.

URL PDF HTML ☆

赞 0 踩 0

2606.19325 2026-06-18 cs.SD cs.AI cs.CV 交叉投稿

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

参考驱动的野外先验多说话人音频场景生成

Michael Finkelson, Daniel Segal, Eitan Richardson, Shahar Armon, Nani Goldring, Poriya Panet, Nir Zabari, Benjamin Brazowski, Or Patashnik, Yoav HaCohen

发表机构 * Lightricks ； Tel Aviv University（特拉维夫大学）

AI总结提出ScenA方法，利用预训练的文本到音频流匹配基础模型，通过多参考声音和自然语言提示生成多说话人音频场景，并采用高噪声偏置时间步分布解决参考捷径问题，在CoVoMix2-Dialogue基准上优于现有系统。

Comments Project page at https://finmickey.github.io/scena/

详情

AI中文摘要

现有的多说话人对话系统通过结构化监督（如每轮标签、多流转录或可学习说话人嵌入）将说话人与话语绑定。这些系统在仅语音的流水线中运行，生成干净的语音序列，缺乏真实对话的环境纹理。我们采取不同的方法。我们的方法ScenA将文本到音频流匹配基础模型（在大规模野外数据上预训练）直接以多个参考声音和描述整个多说话人音频场景的自由形式自然语言提示为条件。利用这样的基础模型使我们能够继承其生成自然、非录音室音频的能力：背景噪声、房间声学、重叠对话和自发的副语言事件，同时添加多说话人控制而无需任何每轮结构。具体地，参考潜在向量被连接到模型的令牌序列中，并通过轻量级的身份感知位置编码进行区分。然而，我们识别出这种方法的一个关键障碍：参考捷径。在标准噪声调度下的训练过程中，模型可以通过声学相似性识别匹配的参考与噪声目标，从而完全绕过文本提示。我们通过高噪声偏置的时间步分布来解决这个问题，迫使模型依赖文本提示进行说话人分配。我们在CoVoMix2-Dialogue基准上评估ScenA，结果表明它在说话人绑定指标上优于现有的多说话人系统，同时生成具有重叠语音、情感发声和环境声音的丰富对话音频。我们的结果证明了使用以自由形式场景描述为条件的通用音频模型，而不是通过仅语音流水线传递结构化对话脚本的优势。

英文摘要

Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model's token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the \textit{Reference Shortcut}. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.

URL PDF HTML ☆

赞 0 踩 0

2606.16276 2026-06-18 cs.AI 版本更新

SpecAlign: Efficient Specification-Grounded Alignment of Large Language Models via Synthetic Data

SpecAlign: 通过合成数据实现高效的大语言模型规范对齐

Wenjie Wang, Yue Huang, Zhengqing Yuan, Han Bao, Shiyi Du, Yuchen Ma, Yue Zhao, Yanfang Ye, Xiangliang Zhang

发表机构 * University of Notre Dame（圣母大学）； Carnegie Mellon University（卡内基梅隆大学）； LMU Munich（慕尼黑大学）； University of Southern California（南加州大学）

AI总结提出规范对齐新范式，通过从规范文档合成数据（SpecAlign框架），结合结构化规则标注、可控规范实例化和多智能体对抗数据合成，生成细粒度偏好对，提升规则遵守度且不损害通用能力。

Comments 58 pages

详情

AI中文摘要

随着大语言模型（LLM）在现实应用中的部署日益增多，对齐不再由单一的通用安全或有用性概念主导，而是由提供商或应用特定的模型规范主导。这些规范通常冗长、结构化且频繁更新，然而现有的对齐流程缺乏系统化的机制来将其作为训练信号。在本文中，我们提出规范对齐（specification-grounded alignment），一种新的对齐范式，将提供商编写的模型规范作为主要对齐目标，而非抽象原则或静态基准。为实例化该范式，我们引入SpecAlign框架，该框架直接从规范文档合成对齐数据。SpecAlign结合结构化规则标注、可控规范实例化和多智能体对抗数据合成，生成细粒度、边界感知的偏好对，捕获合规行为和有意义的规范违反。在多个模型规范和骨干模型上的实验表明，使用SpecAlign进行训练一致地提高了规则遵守度，同时保持了通用能力并避免了过度保守的行为。这些结果表明，将对齐建立在显式模型规范上，能够实现LLM行为对不断变化的政策要求的快速、精确和可扩展的适应。

英文摘要

As large language models (LLMs) are increasingly deployed in real-world applications, alignment is no longer governed by a single universal notion of safety or helpfulness, but instead by provider- or application-specific model specifications. These specifications are typically long, structured, and frequently updated, yet existing alignment pipelines lack a systematic mechanism to operationalize them as training signals. In this paper, we propose specification-grounded alignment, a new alignment paradigm that treats provider-authored model specifications as the primary alignment target rather than abstract principles or static benchmarks. To instantiate this paradigm, we introduce SpecAlign, a framework that synthesizes alignment data directly from specification documents. SpecAlign combines structured rule annotation, controllable specification instantiation, and multi-agent adversarial data synthesis to generate fine-grained, boundary-aware preference pairs that capture both compliant behaviors and meaningful specification violations. Experiments across multiple model specifications and backbone models demonstrate that training with SpecAlign consistently improves rule compliance while preserving general capabilities and avoiding over-conservative behavior. These results suggest that grounding alignment in explicit model specifications enables rapid, precise, and scalable adaptation of LLM behavior to evolving policy requirements.

URL PDF HTML ☆

赞 0 踩 0

2502.07531 2026-06-18 cs.CV cs.AI cs.LG cs.MM 版本更新

VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

VidCRAFT3: 面向图像到视频生成的相机、物体与光照控制

Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, Yanwei Fu

发表机构 * School of Data Science, Fudan University（复旦大学数据科学学院）； Shanghai Innovation Institute（上海创新研究院）； Zhejiang University（浙江大学）； Huawei Noah’s Ark Lab（华为诺亚实验室）； Westlake University（西湖大学）； School of Data Science and MOE Frontiers Center for Brain Science, Fudan University（复旦大学数据科学学院和脑科学前沿中心）； Fudan ISTBI–ZJNU Algorithm Centre for Brain-inspired Intelligence, Zhejiang Normal University（复旦大学-浙江师范大学脑启发智能算法中心）

AI总结提出VidCRAFT3框架，通过显式建模几何、运动与光照的跨因素交互，实现对相机运动、物体运动和光照方向的独立或联合控制，在控制精度和视觉一致性上达到最优。

Comments Accepted to TVCG 2026

详情

AI中文摘要

可控图像到视频（I2V）生成将参考图像转换为由用户指定控制信号引导的连贯视频。虽然对相机运动、物体运动和光照的精确控制对于高保真创作至关重要，但现有方法通常独立处理这些因素，忽视了动态场景中视角、几何和光照之间的物理耦合，导致同时变化时出现阴影不匹配和透视漂移等视觉不一致问题。我们提出了VidCRAFT3，一个统一且灵活的I2V框架，显式建模几何、运动和光照之间的跨因素交互，实现对相机运动、物体运动和光照方向的独立或联合控制。Image2Cloud提供显式的3D几何先验以实现精确的相机运动控制。ObjMotionNet将稀疏物体轨迹编码为多尺度运动特征，以引导逼真的物体运动。空间三重注意力变压器通过光照交叉注意力整合光照方向，实现一致的重光照。为了解决联合标注数据的稀缺性，我们构建了VideoLightingDirection（VLD）数据集，包含精确的逐帧光照方向标注，并引入三阶段渐进训练策略，使得无需完全联合标注即可实现鲁棒学习。大量实验表明，VidCRAFT3在多种场景下的控制精度和视觉一致性上达到了最先进水平。

英文摘要

Controllable image-to-video (I2V) generation transforms a reference image into a coherent video guided by user-specified control signals. While precise control over camera motion, object motion, and lighting is essential for high-fidelity creation, existing methods often treat these factors independently. This overlooks the physical coupling among viewpoint, geometry, and illumination in dynamic scenes, leading to visual inconsistencies such as mismatched shadows and perspective drift under simultaneous changes. We present VidCRAFT3, a unified and flexible I2V framework that explicitly models cross-factor interactions among geometry, motion, and illumination, enabling both independent and joint control over camera motion, object motion, and lighting direction. Image2Cloud provides explicit 3D geometric priors for accurate camera motion control. ObjMotionNet encodes sparse object trajectories into multi-scale motion features to guide realistic object motion. A Spatial Triple-Attention Transformer integrates lighting direction through lighting cross-attention for consistent relighting. To address the scarcity of jointly annotated data, we construct the VideoLightingDirection (VLD) dataset with accurate per-frame lighting direction annotations, and introduce a three-stage progressive training strategy that enables robust learning without fully joint annotations. Extensive experiments demonstrate that VidCRAFT3 achieves state-of-the-art performance in control precision and visual coherence across diverse scenarios.

URL PDF HTML ☆

赞 0 踩 0

2508.09191 2026-06-18 cs.LG cs.AI 版本更新

InstructTime++: 通过隐式特征增强的多模态语言建模进行时间序列分类

Mingyue Cheng, Xiaoyu Tao, Huajian Zhang, Qi Liu, Zhiding Liu, Yucong Luo, Yiheng Chen, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China（中国科学技术大学认知智能国家重点实验室）

AI总结提出将时间序列分类转化为多模态生成任务，通过离散化模块和对齐投影层弥合模态差距，并利用隐式特征建模提升语言模型性能。

详情

AI中文摘要

大多数现有的时间序列分类方法采用判别范式，将输入序列直接映射到独热编码的类别标签。虽然有效，但这种范式难以融入上下文特征，也无法捕捉类别间的语义关系。为了解决这些局限性，我们提出了InstructTime，一种将时间序列分类重新定义为多模态生成任务的新框架。具体来说，连续的数值序列、上下文文本特征和任务指令被视为多模态输入，而类别标签则通过调优的语言模型作为文本输出生成。为了弥合模态差距，InstructTime引入了一个时间序列离散化模块，将连续序列转换为离散的时间标记，同时结合对齐投影层和生成式自监督预训练策略，以增强跨模态表示对齐。在此框架基础上，我们进一步提出了InstructTime++，通过引入隐式特征建模来扩展InstructTime，以补偿语言模型有限的归纳偏差。InstructTime++利用专门的工具包从原始时间序列和上下文输入中挖掘信息丰富的隐式模式，包括统计特征提取和基于视觉-语言模型的图像描述，并将其转化为文本描述以实现无缝集成。在多个基准数据集上的大量实验证明了InstructTime++的优越性能。

英文摘要

Most existing time series classification methods adopt a discriminative paradigm that maps input sequences directly to one-hot encoded class labels. While effective, this paradigm struggles to incorporate contextual features and fails to capture semantic relationships among classes. To address these limitations, we propose InstructTime, a novel framework that reformulates time series classification as a multimodal generative task. Specifically, continuous numerical sequences, contextual textual features, and task instructions are treated as multimodal inputs, while class labels are generated as textual outputs by tuned language models. To bridge the modality gap, InstructTime introduces a time series discretization module that converts continuous sequences into discrete temporal tokens, together with an alignment projection layer and a generative self-supervised pre-training strategy to enhance cross-modal representation alignment. Building upon this framework, we further propose InstructTime++, which extends InstructTime by incorporating implicit feature modeling to compensate for the limited inductive bias of language models. InstructTime++ leverages specialized toolkits to mine informative implicit patterns from raw time series and contextual inputs, including statistical feature extraction and vision-language-based image captioning, and translates them into textual descriptions for seamless integration. Extensive experiments on multiple benchmark datasets demonstrate the superior performance of InstructTime++.

URL PDF HTML ☆

赞 0 踩 0

2601.17226 2026-06-18 cs.CL cs.AI 版本更新

Retell, Reward, Repeat: Reinforcement Learning for Narrative Theory-Informed Story Retelling

复述、奖励、重复：面向叙事理论启发的故事复述的强化学习

David Y. Liu, Xanthe Muston, Dipankar Srirag, Aditya Joshi, Sebastian Sequoiah-Grayson

发表机构 * University of New South Wales（新南威尔士大学）

AI总结提出RRR强化学习框架，结合结构主义叙事学与标量叙事性，通过d-RLAIF从文本特征中获取训练信号，无需参考输出，提升LLM故事复述的逻辑性、合理性和完整性。

Comments 8 Pages, 7 figures

详情

AI中文摘要

反事实故事复述暴露了LLM在受限叙事解空间中的缺陷，此时它们无法依赖回忆记忆的训练数据。基于真实值的后训练（如SFT）无法教会LLM生成逻辑合理的叙事事件。本文提出Retell, Reward, Repeat (RRR)，一个基于强化学习的流水线，将结构主义叙事学与标量叙事性相结合，以教授故事结构。我们扩展了TimeTravel数据集，加入人工标注的叙事平衡阶段，以评估奖励模型。通过d-RLAIF，RRR从文本特征的叙事性中推导训练信号，无需参考输出。评估表明，RRR训练的LLM在逻辑性、合理性和完整性上优于少样本和SFT基线，输出质量通过盲人偏好验证。RRR仅依赖小型查询数据集，为故事讲述——一个目前缺乏有效后训练方法的领域——提供了一种基于语言学、成本效益高的后训练机制。RRR强调了将既定语言学理论整合到当代NLP中的持续相关性。

英文摘要

Counterfactual story retelling exposes LLM shortcomings in constrained narrative solution spaces where they can no longer rely on recalling memorised training data. Ground-truth-based post-training, such as SFT, fails to teach LLMs how to generate logical and rational narrative events. In this paper, we introduce Retell, Reward, Repeat (RRR), an RL-based pipeline synthesising Structuralist Narratology with scalar narrativity to teach storytelling structure. We extend the TimeTravel dataset with human-annotated stages of narrative equilibrium to evaluate reward models. By using d-RLAIF, RRR derives training signals from the narrativity of textual features without the need for reference outputs. Evaluations demonstrate that RRR-trained LLMs outperform few-shot and SFT baselines in logic, rationality, and completeness, with output quality additionally validated by blind human preference. Relying on a small, query-only dataset, RRR provides a linguistically grounded, cost-effective post-training mechanism for storytelling--a domain currently lacking effective post-training methods. RRR highlights the continued relevance of integrating established linguistic theories into contemporary NLP.

URL PDF HTML ☆

赞 0 踩 0

2601.19792 2026-06-18 cs.CL cs.AI cs.HC 版本更新

LVLMs and Humans Ground Differently in Referential Communication

LVLMs与人类在指称交流中的基础不同

Peter Zeng, Weiling Li, Amie J. Paige, Zhengxiang Wang, Panagiotis Kaliosis, Dimitris Samaras, Gregory Zelinsky, Susan E. Brennan, Owen Rambow

AI总结通过人类与AI配对的多轮指称交流实验，发现LVLMs无法像人类一样利用共同基础生成和解析指称表达，导致交流不畅。

Comments 27 pages, 16 figures

2602.06470 2026-06-18 cs.CL cs.AI 版本更新

Improve Large Language Model Systems with User Logs

通过用户日志改进大型语言模型系统

Changyue Wang, Weihang Su, Qingyao Ai, Xingzhao Yue, Rui Zhang, Xiaojia Chang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）

AI总结本文提出UNO框架，通过用户日志提炼规则和偏好对，利用查询反馈驱动聚类处理数据异质性，量化模型知识与日志数据间的认知差距，提升LLM系统性能。

详情

AI中文摘要

扩大训练数据和模型参数规模长期以来推动了大型语言模型（LLMs）的发展，但这一范式日益受到高质量数据稀缺和计算成本上升导致的边际效益递减的限制。因此，近期研究更加关注从真实世界部署中持续学习，其中用户交互日志提供了丰富的真人类反馈和过程知识。然而，从用户日志学习具有挑战性，因为它们是无结构和嘈杂的。传统的LLM系统往往难以区分有用的反馈信号与嘈杂的用户行为，且用户日志收集与模型优化之间的差异（例如，非策略优化问题）进一步加剧了这一问题。为此，我们提出UNO（用户日志驱动的优化），一个统一的框架，用于通过用户日志改进LLM系统（LLMsys）。UNO首先将日志提炼为半结构化的规则和偏好对，然后利用查询和反馈驱动的聚类来管理数据异质性，最后量化模型先验知识与日志数据之间的认知差距。这一评估指导LLMsys自适应地过滤掉嘈杂的反馈并构建不同模块，以处理从用户日志中提取的初级和反思性经验，从而提升未来的响应。广泛的实验表明，UNO在效果和效率上均达到最先进的水平，显著优于检索增强生成（RAG）和基于记忆的基线方法。我们已开源代码至https://github.com/bebr2/UNO。

英文摘要

Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model's prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at https://github.com/bebr2/UNO .

URL PDF HTML ☆

赞 0 踩 0

2602.15851 2026-06-18 cs.CL cs.AI 版本更新

Narrative Theory-Driven LLM Methods for Automatic Story Generation and Understanding: A Survey

叙事理论驱动的LLM方法在自动故事生成与理解中的应用：综述

David Y. Liu, Aditya Joshi, Paul Dawson

发表机构 * School of Computer Science and Engineering（计算机科学与工程学院）； School of Arts and Media（艺术与媒体学院）； University of New South Wales (UNSW)（新南威尔士大学）

AI总结综述叙事理论驱动的大语言模型方法在自动故事生成与理解中的应用，分析现状并指出生成任务在理论应用、后训练方法、非虚构叙事及叙事层次等方面落后于理解任务，提出未来方向。

Comments 31 pages

详情

AI中文摘要

使用大语言模型（LLM）的叙事理论应用在自动故事生成和理解任务中提供了有前景的方法。本综述考察了自然语言处理（NLP）研究如何利用LLM方法处理叙事研究中的不同概念。我们使用叙事学中的既定区分来分类当前工作，并发现以下内容：(a) 叙事文本来源多样，不仅限于文学；(b) 理论综合与验证是潜在成果；(c) 生成任务在多个方面落后于理解任务：理论应用、后训练方法、探索非虚构叙事以及处理超出故事与话语层面的叙事层次。对于未来方向，我们相信，与其追求单一的、通用的“叙事质量”基准，进步可以受益于以下方面的努力：定义和改进针对单个叙事属性的基于理论的度量；继续开展大规模、理论驱动的文学/社会/文化分析；在情境化上下文中生成叙事；以及继续进行实验，其输出可用于验证或完善叙事理论。本文通过概述当前研究工作和更广泛的叙事研究领域，为NLP中更系统、更具理论依据的叙事研究提供了背景基础。

英文摘要

Applications of narrative theories using large language models (LLMs) deliver promising methods in automatic story generation and understanding tasks. Our survey examines how natural language processing (NLP) research uses LLM methods to engage with diverse concepts from narrative studies. We use established distinctions from narratology to categorise ongoing efforts and discover the following: \redtext{(a) narrative texts come from diverse sources beyond just literature, (b) theoretical synthesis and validation are potential outcomes, (c) generation tasks lag behind understanding in several ways: theoretical application, post-training methods, exploring non-fiction narratives and addressing narrative levels beyond fabula and discourse.} For future directions, instead of the pursuit of a single, generalised benchmark for `narrative quality', we believe that progress can benefit from efforts that focus on the following: defining and improving theory-based metrics for individual narrative attributes; continue conducting large-scale, theory-driven literary/social/cultural analysis; generating narratives in situated contexts; and continuing experiments where outputs can be used to validate or refine narrative theories. This work provides a contextual foundation for more systematic and theoretically informed narrative research in NLP by providing an overview to ongoing research efforts and the broader narrative studies landscape.

URL PDF HTML ☆

赞 0 踩 0

2605.21028 2026-06-18 cs.CV cs.AI 版本更新

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

DySink：动态帧 sinks 用于自回归长视频生成

Bo Ye, Xinyu Cui, Jian Zhao, Tong Wei, Min-Ling Zhang

发表机构 * School of Computer Science and Engineering, Southeast University（东南大学计算机科学与工程学院）； Key Lab. of Computer Network and Information Integration, Southeast University（东南大学计算机网络与信息集成重点实验室）； Zhongguancun Academy（中关村学院）； Zhongguancun Institute of Artificial Intelligence（中关村人工智能研究院）； Institute of Automation, CAS（中国科学院自动化研究所）

AI总结本文提出 DySink，一种基于检索的框架，通过维护紧凑的记忆银行并选择视觉相关的历史帧作为动态帧 sinks，以提高自回归长视频生成的动态性和时间质量。

详情

AI中文摘要

自回归长视频生成通常采用有界内存流以提高效率，通常结合局部窗口实现短期连续性与静态早期帧 sinks 作为长程锚点。然而，这种固定分配在当前视觉状态与早期帧大幅偏离时仍会缓存早期帧，而丢弃可能更相关的中间历史。结果，保留的长程上下文可能变得不适应，并偏向过时的线索；在严重情况下，RoPE 引起的相位再对齐会homogenize 头间注意力并导致 sink 崩溃，其中内容会回归到 sink 帧。我们提出 DySink，一种基于检索的框架，维护紧凑的记忆银行并选择视觉相关的历史帧作为动态帧 sinks。DySink 将自适应检索与 sink 异常门相结合，后者检测检索上下文中的过度头间共识并抑制易崩溃的上下文。在分钟级视频上的实验表明，DySink 在动态度方面一致优于强基线，同时也实现了更高的时间质量。代码和模型权重将在 https://github.com/yebo0216best/DySink 上发布。

英文摘要

Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE-induced phase re-alignment can homogenize inter-head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval-based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter-head consensus over retrieved context and suppresses collapse-prone context. Experiments on minute-long videos show that DySink consistently improves temporal quality over strong baselines while also achieving higher dynamic degree, enabling coherent and more natural long-horizon visual evolution. The code and model weights are released at https://github.com/yebo0216best/DySink.

URL PDF HTML ☆

赞 0 踩 0

2606.13768 2026-06-18 cs.CV cs.AI 版本更新

CineOrchestra: Unified Entity-Centric Conditioning for Cinematic Video Generation

CineOrchestra：面向电影视频生成的统一实体中心条件控制

Sharath Girish, Tsai-Shien Chen, Zhikang Dong, Mukesh Singhal, Hao Chen, Sergey Tulyakov, Aliaksandr Siarohin

发表机构 * Snap Inc.（Snap公司）； UC Merced（加州大学默塞德分校）

AI总结提出CineOrchestra，一种统一控制主体、事件、相机和镜头切换的视频扩散模型，通过实体中心条件原语和参数无关的旋转位置编码实现多轴联合控制，在密集描述跟随和镜头切换时序上超越六种专用方法。

Comments Project page: https://snap-research.github.io/CineOrchestra

详情

AI中文摘要

CAOA -- 补全辅助的物体-CAD对齐

Hiranya Garbha Kumar, Minhas Kamal, Balakrishnan Prabhakaran

发表机构 * University at Albany（奥尔巴尼大学）

AI总结提出CAOA方法，结合语义感知点云补全和对称感知相对位姿估计，在Scan2CAD上实现17%精度提升，并发布S2C-Completion数据集。

Comments GitHub: https://github.com/MinhasKamal/CAOA

详情

DOI: 10.1109/3DV69130.2026.00047
Journal ref: Thirteenth International Conference on 3D Vision (3DV), 2026

AI中文摘要

准确地将CAD模型与室内RGB-D扫描中的对应物体对齐是3D语义重建的核心挑战。该任务需要估计9自由度（DoF）位姿——位置、旋转和三轴尺度——但受到噪声和不完整扫描以及导致几何畸变的分割误差的阻碍。我们提出补全辅助的物体-CAD对齐（CAOA），该方法将语义和上下文感知的点云补全模块与对称感知的相对位姿估计算法相结合，实现CAD模型与扫描物体的精确对齐。现有的补全方法通常在合成数据集上训练和评估，往往难以泛化到真实扫描。为弥合这一差距，我们引入了一种针对室内场景的合成数据生成策略，通过与广泛使用的补全数据集进行定量比较，验证了其显著减小合成到真实领域差距的效果。此外，我们发布了S2C-Completion，一个来自Scan2CAD的超过8500个物体-CAD对的专家标注数据集，用于真实室内单物体补全，并作为该任务的新基准。对于物体-CAD对齐，我们通过对称感知损失融入对称信息，提高了对对称模糊的鲁棒性。在Scan2CAD基准上，CAOA相比最先进方法实现了17%的精度提升。

英文摘要

Accurately aligning CAD models to their corresponding objects in indoor RGB-D scans is a central challenge in 3D semantic reconstruction. The task requires estimating a 9-Degree-of-Freedom (DoF) pose-position, rotation, and scale along three axes-but is hindered by noisy and incomplete scans, as well as segmentation errors that cause geometric distortions. We present Completion-Assisted Object-CAD Alignment (CAOA), a method that integrates a semantically and contextually aware point cloud completion module with a symmetry-aware relative pose estimation algorithm, enabling precise alignment of CAD models to scanned objects. Existing completion methods are typically trained and evaluated on synthetic datasets, which often fail to generalize to real-world scans. To bridge this gap, we introduce a synthetic data generation strategy tailored to indoor scenes, significantly reducing the synthetic-to-real domain gap-validated through quantitative comparisons with widely used completion datasets. In addition, we release S2C-Completion, an expert-annotated dataset of over 8,500 object-CAD pairs from Scan2CAD, created for real-world indoor single-object completion and intended as a new benchmark for this task. For object-CAD alignment, we incorporate symmetry information via a symmetry-aware loss, improving robustness to symmetric ambiguities. On the Scan2CAD benchmark, CAOA achieves a 17% accuracy improvement over state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2606.18634 2026-06-18 cs.RO cs.AI 交叉投稿

EffiNav: Fusing Depth and Vision-Language for Efficient Object Goal Navigation

EffiNav: 融合深度与视觉语言实现高效物体目标导航

Zecheng Yin, Benedict Jun Ma

发表机构 * Systems Hub of Intelligence Transportation HKUST(GZ)（香港科技大学（广州）智能交通系统中心）

AI总结提出EffiNav框架，融合深度信息与视觉语言模型，通过预测探索边界和语义先验指导导航，在HM3D和OVON数据集上匹配或超越基线，提升路径效率与泛化性。

详情

AI中文摘要

在未知环境中定位目标物体是自主智能体的基本能力，应用范围从搜索救援到野外机器人。该任务的简化版本是物体目标导航（ObjNav）。在ObjNav中，成功到达目标物体提供了基本的性能度量；然而，导航轨迹的效率同样重要，因为它指示了智能体探索的智能程度以及后续任务剩余的时间。在未知环境中，高效导航的关键在于决定下一步探索的位置。尽管许多先前工作旨在解决这一核心挑战并在某些场景中取得了有希望的性能，但最近的基于训练的模型和非训练框架分别仍存在泛化性和效率问题，在最坏情况下可能导致对已访问区域的过度探索或冗余的来回运动。我们在两个广泛使用的仿真基准Habitat Matterport 3D（HM3D）和开放词汇物体目标导航（OVON）上评估EffiNav，并在真实世界的物理机器人上进一步验证其有效性。我们对大量仿真回合进行了失败分析。通过最小修改，我们还将EffiNav扩展到GOAT-BENCH数据集上的记忆增强ObjNav任务，展示了其在标准ObjNav设置之外的适应性。在两个标准指标——成功率（SR）和路径长度加权成功率（SPL）上，EffiNav匹配或超越了最近的基线，反映了其效率、鲁棒性和实际适用性。认识到两个数据集的不同侧重点，性能表明该框架在高效ObjNav中更加平衡和可泛化。

英文摘要

To locate a target object while exploring the unknown environment is a fundamental capability for autonomous agents, with applications ranging from search-and-rescue to field robots. A simplified version of such task is Object Goal Navigation (ObjNav). In ObjNav, successful arrival at the target object provides a basic measure of performance; however, the efficiency of the navigation trajectory is equally important, as it indicates how intelligently the agent explores and how much time remains for subsequent tasks. In unknown environments, the key to efficient navigation lies in deciding where to explore next. While many prior works aim to address this core challenge and achieved promising performance in certain settings, recent training-based models and non-training frameworks still suffer from generalization and efficiency issues respectively, which in the worst cases can lead to excessive exploration of already-visited areas or redundant back-and-forth motion. We evaluate EffiNav on two widely used simulation benchmarks Habitat Matterport 3D (HM3D) and Open-Vocabulary Object goal Navigation (OVON), and further validate its effectiveness on physical robots in real-world settings. We conduct failure analysis on massive simulation episodes. With minimal modification, we also extend EffiNav to a memory-augmented ObjNav task on the GOAT-BENCH dataset, demonstrating its adaptability beyond standard ObjNav settings. Across two standard metrics--Success Rate (SR) and Success weighted by Path Length (SPL), EffiNav matches or outperforms recent baselines, reflecting its efficiency, robustness, and practical applicability. Recognizing the different emphases of the two datasets, the performances reveals this framework is more balanced and generalizable for efficient ObjNav.

URL PDF HTML ☆

赞 0 踩 0

2606.18664 2026-06-18 cs.SD cs.AI 交叉投稿

NeuralMUSIC: A Hybrid Neural-Subspace Framework for Robot Sound Source Localization

NeuralMUSIC: 一种用于机器人声源定位的混合神经-子空间框架

Yizhuo Yang, Junqiao Fan, Shenghai Yuan, Lihua Xie

发表机构 * School of Electrical and Electronic Engineering, Nanyang Technological University（南洋理工大学电气与电子工程学院）

AI总结提出NeuralMUSIC混合框架，结合神经网络估计空间协方差矩阵与经典MUSIC子空间方法，通过频率注意力融合和自监督学习提升机器人声源定位的鲁棒性和跨域泛化能力。

详情

AI中文摘要

可靠的声源定位是机器人听觉的基础，使自主机器人能够感知空间线索并在动态环境中有效运行。经典方法如多信号分类（MUSIC）具有坚实的理论基础，但在低信噪比下性能下降。基于深度学习的方法虽然取得了有前景的性能，但通常难以在多种条件下泛化。为了解决这些挑战，我们提出了NeuralMUSIC，一种用于机器人声源定位的混合神经-子空间框架。具体来说，神经网络首先从多通道麦克风观测中估计空间协方差矩阵。然后将预测的协方差集成到经典的MUSIC流程中，包括特征值分解（EVD）和伪谱计算，随后通过频率注意力融合（FAF）模块产生最终的DOA估计。为了提高数据效率，我们进一步引入了一种自监督空间相关学习（SSCL）策略，利用未标记的声学数据来捕获空间结构。跨不同机器人任务的广泛实验表明，NeuralMUSIC在实现有竞争力的定位精度的同时，表现出更强的鲁棒性和跨域泛化能力。

英文摘要

Reliable sound source localization is fundamental to robot audition, enabling autonomous robots to perceive spatial cues and operate effectively in dynamic environments. Classical methods such as Multiple Signal Classification (MUSIC) offer strong theoretical foundations but degrade under low signal-to-noise ratios. While deep learning-based approaches achieve promising performance, they often struggle with limited generalization across conditions. To address these challenges, we propose NeuralMUSIC, a hybrid neural-subspace framework for robotic sound source localization. Specifically, a neural network first estimates the spatial covariance matrix from multichannel microphone observations. The predicted covariance is then integrated into a classical MUSIC pipeline with eigenvalue decomposition (EVD) and pseudo-spectrum computation, followed by a Frequency Attention Fusion (FAF) module to produce the final DOA estimates. To improve data efficiency, we further introduce a Self-supervised Spatial Correlation Learning (SSCL) strategy that leverages unlabeled acoustic data to capture spatial structure. Extensive experiments across different robotic tasks demonstrate that NeuralMUSIC achieves competitive localization accuracy while exhibiting improved robustness and cross-domain generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.18698 2026-06-18 cs.RO cs.AI cs.LG 交叉投稿

Leveraging Energy Features for Surface Classification with Deep Learning: A Comparative Analysis Across Three Independent Datasets

利用能量特征进行基于深度学习的表面分类：三个独立数据集的比较分析

Alexander Belyaev, Oleg Kushnarev

AI总结研究评估能量特征作为表面分类的独立或辅助模态的可行性，在三个数据集上比较多种深度学习架构，发现CNN性能最优，纯能量特征准确率85-90%，与惯性特征结合可达96-99%，且能量特征可稳定提升1-2%准确率。

详情

AI中文摘要

基于能量的方法在移动机器人表面分类中仍是一个相对未被充分研究的途径，尽管在受限环境中取得了有希望的结果。本研究评估了使用能量衍生特征作为独立分类模态或作为惯性数据补充输入的可行性。在三个公开数据集上进行了全面评估，比较了现代深度学习架构（包括循环神经网络、卷积神经网络、仅编码器变压器和Mamba状态空间模型）在自动超参数调整和输入序列长度优化下的性能。模型在所有评估数据集上均实现了比先前报道值更高的准确率，其中卷积神经网络取得了最高的整体性能。当仅依赖基于能量的特征时，模型分类准确率在85-90%范围内，比与惯性特征结合时（96-99%）低约5-10%。用能量特征增强惯性数据导致平均准确率持续提高1-2%。这些发现表明，仅依赖能量特征的分类器为独立部署提供了足够的准确性，同时在与其它感知模态结合使用时也提供了一致的增益。

英文摘要

The energy-based method remains a comparatively underexamined approach for surface classification in mobile robotics, despite promising results in constrained environments. This study evaluated the viability of using energy-derived features as either a standalone classification modality or as supplementary input to inertial data. A comprehensive evaluation was conducted across three publicly available datasets, comparing the performance of modern deep learning architectures including recurrent neural networks, convolutional neural networks, encoder-only transformers, and Mamba state-space models, under automated hyperparameter tuning and input sequence length optimization. The models achieved higher accuracy than previously reported values on all evaluated datasets, with the convolutional neural network yielding the highest overall performance. When relying exclusively on energy-based features, the models attained classification accuracies in the range of 85-90%, approximately 5-10% lower than those achieved when combined with inertial features (96-99%). Augmenting inertial data with energy features resulted in a consistent mean accuracy improvement of 1-2%. These findings indicate that classifiers relying solely on energy features offer sufficient accuracy for standalone deployment, while also providing a consistent gain when used in combination with other sensing modalities.

URL PDF HTML ☆

赞 0 踩 0

2606.18747 2026-06-18 cs.RO cs.AI 交叉投稿

Generating Natural and Expressive Robot Gestures through Iterative Reinforcement Learning with Human Feedback using LLMs

通过基于人类反馈的迭代强化学习利用大语言模型生成自然且富有表现力的机器人手势

Chris Lee, Flora Salim, Benjamin Tag, Francisco Cruz

发表机构 * University of New South Wales（新南威尔士大学）； Universidad Central de Chile（智利中央大学）

AI总结针对社交机器人手势生成僵硬问题，提出将ChatGPT集成到Pepper机器人中生成共语手势，并引入基于人类反馈的迭代强化学习（RLHF）优化手势，实验表明RLHF提升了手势的表现力、相关性和流畅性。

Comments 8 Pages, 6 Figures

详情

AI中文摘要

富有表现力的手势对于自然有效的沟通至关重要，当仅靠语言线索不足时（例如，指向），手势可以补充言语。对于像Pepper这样的人形社交机器人，产生自然且富有表现力的动作对于改善人机交互（HRI）和长期接受度至关重要。然而，由于依赖专家编写的动画，生成手势仍然具有挑战性，导致行为僵硬，难以适应动态和多样化的环境。或者，机器学习方法通常难以捕捉感知的自然性，随着自由度的增加而变得更加困难。因此，产生富有表现力的机器人手势需要一个能够适应环境同时遵守社会规范和物理约束的系统。大语言模型（LLMs）的最新进展使得动态代码生成成为可能，为从自然语言实时合成手势提供了新的机会。在本文中，我们将ChatGPT集成到人形机器人Pepper中，以生成与对话输出一致的共语手势。虽然这一基线实现了灵活的手势生成，但生成的动作通常被认为僵硬且不自然。为了解决这一限制，我们引入了一种基于人类反馈的迭代强化学习（RLHF）系统，该系统根据用户评估微调手势生成，并利用迭代用户研究比较Pepper生成的手势。我们的结果表明，RLHF改进了LLM的共语生成能力，产生了更富有表现力、相关且流畅的动作。

英文摘要

Expressive gestures are essential for natural and effective communication, complementing speech when verbal cues alone are insufficient (e.g., pointing). For social robots such as the humanoid Pepper, producing natural and expressive movements is critical for improving human-robot interaction (HRI) and long-term acceptance. However, generating gestures remains challenging due to reliance on expert-authored animations, resulting in rigid behaviors that are impractical for dynamic and diverse environments. Alternatively, machine learning approaches often struggle to capture perceived naturalness, becoming increasingly challenging with more degrees of freedom. Consequently, producing expressive robot gestures requires a system that can adapt to the environment while adhering to social norms and physical constraints. Recent advances in large language models (LLMs) enable dynamic code generation, offering new opportunities for runtime gesture synthesis from natural language. In this paper, we integrate ChatGPT into the humanoid robot Pepper to generate co-speech gestures aligned with conversational output. While this baseline enables flexible gesture generation, the resulting motions are often perceived as stiff and unnatural. To address this limitation, we introduce an iterative reinforcement learning with human feedback (RLHF) system that finetunes gesture generation based on user evaluations, leveraging an iterative user study to compare Pepper's generated gestures. Our results show that RLHF improved the LLM's co-speech generative capabilities, producing more expressive, relevant and fluid movements.

URL PDF HTML ☆

赞 0 踩 0

2606.18828 2026-06-18 cs.RO cs.AI 交叉投稿

Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation

空间即智能：用于黎曼度量生成的神经半群叠加

Chenghao Xu

发表机构 * National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University（湖南大学机器人视觉感知与控制技术国家工程研究中心）

AI总结提出将智能置于空间本身，通过神经半群叠加机制生成黎曼度量，使动作简化为测地线跟随，在单障碍场景训练后零样本泛化到未见配置。

详情

AI中文摘要

传统方法将智能置于智能体中，无论是作为学习策略还是搜索过程。我们则将智能置于空间本身：场景在构型流形上诱导一个黎曼度量，动作简化为跟随该度量的测地线，而无需调用单独的规划器或碰撞检查器。一个单一的编码器-路由器网络通过三个互补的参数组实现这一思想——框架参数（定向生成器）、调制参数（控制空间传播）和基本系数（决定强度）。这些组通过共享的半群叠加机制组合，产生单个黎曼度量场，形成一种紧凑的架构，其几何复杂度自然随场景复杂度扩展。在单个双障碍场景上训练后，该模型在未见过的障碍配置上展现出鲁棒的零样本泛化能力，无碰撞路径成本与障碍穿透路径成本相差数个数量级。

英文摘要

Traditional approaches place intelligence in the agent, whether as a learned policy or a search procedure. We instead place intelligence in the space itself: a scene induces a Riemannian metric on the configuration manifold, and action reduces to following the geodesics of that metric rather than invoking a separate planner or collision checker. A single Encoder-Router network realizes this idea through three complementary parameter groups -- frame parameters that orient the generators, modulation parameters that govern their spatial propagation, and basic coefficients that determine their strength. These groups combine through a shared semigroup-superposition mechanism to produce a single Riemannian metric field, yielding a compact architecture whose geometry scales naturally with scene complexity. Trained on a single two-obstacle scene, the model demonstrates robust zero-shot generalization across unseen obstacle configurations, with orders-of-magnitude separation between collision-free and obstacle-penetrating path costs.

URL PDF HTML ☆

赞 0 踩 0

2606.18836 2026-06-18 cs.HC cs.AI 交叉投稿

Improving Human-Robot Teamwork in Urban Search and Rescue Through Episodic Memory of Prior Collaboration

通过先前协作的片段记忆改善城市搜索与救援中的人机团队合作

Taewoon Kim, Emma van Zoelen, Mark Neerincx

发表机构 * HumemAI, The Netherlands（荷兰HumemAI）； Vrije Universiteit Amsterdam, The Netherlands（荷兰阿姆斯特丹自由大学）； TNO, The Netherlands（荷兰TNO）

AI总结提出利用知识图谱片段记忆存储历史协作模式，通过图表示学习选择代表性记忆初始化机器人，在MATRX USAR环境中将救援成功率从25.7%提升至41.3%，任务时间减少283秒。

详情

AI中文摘要

有效的人机团队合作要求机器人从交互开始就适应伙伴、情境和任务动态。在MATRX城市搜索与救援（USAR）环境中，人们可以通过聊天和反思界面将他们在团队合作中发现的协作模式（CPs）外部化。我们研究机器人是否可以利用这种先前的团队经验，在未来的交互中成为更好的队友。为此，我们将历史CPs表示为知识图谱片段记忆，并使用具有节点分类目标的图表示学习来识别一个代表性且有效的记忆以供重用。然后，在新的协作片段开始之前，我们用该记忆初始化机器人。在20名参与者和160轮次观察中，用单个自动选择的先前CP初始化机器人将救援成功率从25.7%提高到41.3%，并将平均任务时间减少283秒。最强的提升出现在交互开始时，表明可重用的片段记忆可以帮助机器人以更有效的任务知识进入协作，并支持更顺畅的早期团队合作。

英文摘要

Effective human-robot teamwork requires robots to adapt to partners, situations, and task dynamics from the start of an interaction. In the MATRX Urban Search and Rescue (USAR) environment, people can externalize collaboration patterns (CPs) they discover during teamwork through a chat and reflection interface. We study whether a robot can use such prior team experience to become a better teammate in future interactions. To this end, we represent historical CPs as knowledge-graph episodic memories and use graph representation learning with a node-classification objective to identify a representative and effective memory for reuse. We then initialize the robot with this memory before a new collaboration episode begins. Across 20 participants and 160 round-level observations, initializing the robot with a single automatically selected prior CP increases rescue success from 25.7% to 41.3% and reduces average task time by 283 seconds. The strongest gains appear at the beginning of interaction, suggesting that reusable episodic memory can help robots enter collaboration with more effective task knowledge and support smoother early teamwork.

URL PDF HTML ☆

赞 0 踩 0

2606.18861 2026-06-18 cs.CV cs.AI 交叉投稿

URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification

基于可微联合推理与能量一致性验证的RGB-D序列URDF合成

Xinze Zhang

发表机构 * University of Southern California（南加州大学）

AI总结提出KinemaForge管道，通过可微关节推理和能量一致性验证，从RGB-D序列联合估计部件形状、关节拓扑和参数，显著降低关节轴误差和仿真漂移。

详情

AI中文摘要

从传感器观测重建可仿真的铰接物体数字孪生仍受两个持续存在的差距制约：(i) 部件级几何重建与运动学参数估计分离，(ii) 恢复的模型常违反能量守恒等基本动态不变量，导致URDF在物理仿真器中重放时出现漂移。我们提出KinemaForge，一种约束驱动管道，从短RGB-D序列联合推断部件级形状、关节拓扑和关节参数，并通过基于可微刚体动力学构建的能量一致性验证器验证结果。该管道引入三个组件：将关节-部件关联编码为软边的运动学约束图；通过Featherstone铰接体算法从渲染观测反向传播到关节参数的可微螺旋轴求解器；以及惩罚重建模型非物理自由响应的能量残差损失。在五个PartNet-Mobility类别和一个内部RGB-D基准上，KinemaForge将平均关节轴误差从最强几何基线(PARIS)的4.52度降至2.83度(-37.4%)，从基于交互的Ditto基线的5.30度降至2.83度(-46.6%)，在50秒滚动中长时仿真漂移比PARIS降低64%，初步评估中闭环操作成功率比Ditto提高14.6个百分点。代码和重建数据将在接收后发布。

英文摘要

Reconstructing simulation-ready digital twins of articulated objects from sensor observations remains constrained by two persistent gaps: (i) part-level geometric reconstruction is decoupled from kinematic-parameter estimation, and (ii) the recovered models often violate basic dynamic invariants such as energy conservation, leading to drift when the URDF is replayed in physics simulators. We present KinemaForge, a constraint-driven pipeline that jointly infers part-level shape, joint topology, and joint parameters from short RGB-D sequences and validates the result against an energy-consistent verifier built on differentiable rigid-body dynamics. The pipeline introduces three components: a kinematic constraint graph that encodes joint-part incidences as soft edges; a differentiable screw-axis solver that backpropagates from rendered observations through Featherstone's articulated-body algorithm to joint parameters; and an energy residual loss that penalises non-physical free responses of the reconstructed model. Across five PartNet-Mobility categories and an internal RGB-D benchmark, KinemaForge reduces the average joint-axis error from 4.52 degrees to 2.83 degrees (-37.4%) over the strongest geometric baseline (PARIS) and from 5.30 degrees to 2.83 degrees (-46.6%) over the interaction-based Ditto baseline, lowers long-horizon simulation drift by 64% (vs. PARIS) over 50 s rollouts, and yields URDFs whose closed-loop manipulation success rate improves by 14.6 percentage points over Ditto in our preliminary evaluation. Code and reconstruction data will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.19176 2026-06-18 cs.RO cs.AI cs.SY eess.SY 交叉投稿

Hardware- and Vision-in-the-Loop Validation of Deep Monocular Pose Estimation for Autonomous Maritime UAV Flight

用于自主海上无人机飞行的深度单目位姿估计的硬件与视觉在环验证

Maneesha Wickramasuriya, Beomyeol Yu, Jaden Shin, Mason Huslig, Taeyoung Lee, Murray Snyder

发表机构 * George Washington University（乔治华盛顿大学）

AI总结提出硬件验证的视觉在环框架，结合深度变换器单目位姿估计器和延迟卡尔曼滤波器，在模拟逼真海上环境中实现自主室内飞行，验证了感知延迟等嵌入式效应。

Comments 6 pages 9 figues

详情

AI中文摘要

船舶上的自主无人机操作需要可靠的基于视觉的相对位姿估计，然而海上验证成本高、依赖天气且风险大。本文提出一个硬件验证的视觉在环框架，能够在模拟逼真海上环境的同时实现完全自主的室内飞行。渲染的海上视图由板载的基于深度变换器的单目位姿估计器处理。延迟的视觉测量与高频率IMU数据通过延迟卡尔曼滤波器融合，为几何控制提供一致的状态估计。该系统捕捉了纯仿真中缺失的关键嵌入式效应，包括感知延迟、异步更新和计算约束。自主起飞、轨迹跟踪和着陆实验证明了稳定的闭环飞行。结果建立了一个安全且硬件真实的中间阶段，用于在船上部署之前开发海上无人机自主性。

英文摘要

Autonomous UAV operations on ships require reliable vision-based relative pose estimation, yet at-sea validation is costly, weather-dependent, and risky. This paper presents a hardware-validated vision-in-the-loop framework that enables fully autonomous indoor flight while emulating photorealistic maritime environments. Rendered maritime views are processed onboard by a deep transformer-based monocular pose estimator. Delayed vision measurements are fused with high-rate IMU data using a delayed Kalman filter to provide consistent state estimates for geometric control. The system captures critical embedded effects, including perception latency, asynchronous updates, and computational constraints, that are absent in pure simulation. Autonomous takeoff, trajectory tracking, and landing experiments demonstrate stable closed-loop flight. The results establish a safe and hardware-realistic intermediate stage for developing maritime UAV autonomy prior to shipboard deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.02800 2026-06-18 cs.CV cs.AI cs.LG cs.MM cs.RO 版本更新

Cosmos 3: Omnimodal World Models for Physical AI

Cosmos 3：面向物理AI的全模态世界模型

NVIDIA, :, Aditi, Niket Agarwal, Arslan Ali, Jon Allen, Martin Antolini, Adeline Aubame, Alisson Azzolini, Junjie Bai, Maciej Bala, Yogesh Balaji, Josh Bapst, Aarti Basant, Mukesh Beladiya, Mohammad Qazim Bhat, Zaid Pervaiz Bhat, Dan Blick, Vanni Brighella, Han Cai, Tiffany Cai, Eric Cameracci, Jiaxin Cao, Yulong Cao, Mark Carlson, Carlos Casanova, Ting-Yun Chang, Yan Chang, Yu-Wei Chao, Prithvijit Chattopadhyay, Roshan Chaudhari, Chieh-Yun Chen, Junyu Chen, Ke Chen, Qizhi Chen, Wenkai Chen, Xiaotong Chen, Yu Chen, An-Chieh Cheng, Click Cheng, Xiu Chia, Jeana Choi, Chaeyeon Chung, Wenyan Cong, Yin Cui, Magdalena Dadela, Nalin Dadhich, Wenliang Dai, Joyjit Daw, Alperen Degirmenci, Rodrigo Vieira Del Monte, Robert Denomme, Sameer Dharur, Marco Di Lucca, Ke Ding, Wenhao Ding, Yifan Ding, Yuzhu Dong, Nicole Drumheller, Yilun Du, Aigul Dzhumamuratova, Aleksandr Efitorov, Hamid Eghbalzadeh, Naomi Eigbe, Imad El Hanafi, Hassan Eslami, Benedikt Falk, Jiaojiao Fan, Jim Fan, Amol Fasale, Sergiy Fefilatyev, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Vikram Fugro, Prashant Gaikwad, TJ Galda, Katelyn Gao, Yihuai Gao, Wenhang Ge, Sreyan Ghosh, Arushi Goel, Vivek Goel, Akash Gokul, Rama Govindaraju, Jinwei Gu, Miguel Guerrero, Elfie Guo, Aryaman Gupta, Siddharth Gururani, Hugo Hadfield, Song Han, Ankur Handa, Zekun Hao, Mohammad Harrim, Ali Hassani, Nathan Hayes-Roth, Yufan He, Chris Helvig, Cyrus Hogg, Madison Huang, Michael Huang, Sophia Huang, Yufan Huang, Jacob Huffman, DeLesley Hutchins, Suneel Indupuru, Boris Ivanovic, Arihant Jain, Joel Jang, Ryan Ji, Yanan Jian, Dongfu Jiang, Jingyi Jin, Atharva Joshi, Nikhilesh Joshi, Pranjali Joshi, Andy Ju, Jaehun Jung, Weiwei Kang, Scott Kassekert, Jan Kautz, Ashna Khetan, Julia Kiczka, Slawek Kierat, Gwanghyun Kim, Kuno Kim, Sunny Kim, Kezhi Kong, Xin Kong, Zhifeng Kong, Tomasz Kornuta, Egor Krivov, Hui Kuang, Saurav Kumar, Chia-Wen Kuo, George Kurian, Wojciech Kutak, JF Lafleche, Himangshu Lahkar, Omar Laymoun, Jayjun Lee, Sanggil Lee, Gabriele Leone, Boyi Li, Freya Li, Jiajun Li, Jinfeng Li, Ling Li, Pengcheng Li, Shangru Li, Tingle Li, Xiaolong Li, Xuan Li, Zhaoshuo Li, Zhiqi Li, Hao Liang, Maosheng Liao, Chen-Hsuan Lin, Tsung-Yi Lin, Ming-Yu Liu, Sifei Liu, Zihan Liu, Hai Loc Lu, Xiangyu Lu, Alice Luo, Ruipu Luo, Wenjie Luo, Jiangran Lyu, Martin Ding Ma, Nic Ma, Qianli Ma, Dawid Majchrowski, Louis Marcoux, Miguel Martin, Qing Miao, Ashkan Mirzaei, Shreyas Misra, Kaichun Mo, Durra Mohsin, Hyejin Moon, Pawel Morkisz, Saeid Motiian, Kirill Motkov, Seungjun Nah, Yashraj Narang, Deepak Narayanan, Thabang Ngazimbi, Julian Ouyang, Shubham Pachori, David Page, Yatian Pang, Sehwi Park, Mahesh Patekar, Mostofa Patwary, Marco Pavone, Trung Pham, Wei Ping, Soha Pouya, Shrimai Prabhumoye, Varun Praveen, Delin Qu, Hesam Rabeti, Morteza Ramezanali, Marilyn Reeb, Xuanchi Ren, Kristen Rumley, Wojciech Rymer, Jun Saito, Yeongho Seol, John Shao, Piyush Shekdar, Tianwei Shen, Humphrey Shi, Min Shi, Stella Shi, Kevin Shih, Mohammad Shoeybi, Mateusz Sieniawski, Shuran Song, Alexander Sotelo, Amir Sotoodeh, Sunil Srinivasa, Vignesh Srinivasakumar, Bartosz Stefaniak, Rahul Heinrich Steiger, Shangkun Sun, Jiaxiang Tang, Shitao Tang, Yangyang Tang, Yue Tang, Tolou Tavakkoli, Kayley Ting, Krzysztof Tomala, Wei-Cheng Tseng, Jibin Varghese, Sergei Vasilev, Thomas Volk, Raju Wagwani, Roger Waleffe, Andrew Z. Wang, Boxiang Wang, Haoxiang Wang, Qiao Wang, Shihao Wang, Shijie Wang, Ting-Chun Wang, Yan Wang, Yu Wang, Rohit Watve, David Wehr, Fangyin Wei, Xinshuo Weng, Jay Zhangjie Wu, Kedi Wu, Hongchi Xia, Summer Xiao, Tianjun Xiao, Kevin Xie, Daguang Xu, Jiashu Xu, Mengyao Xu, Ruqing Xu, Xingqian Xu, Yao Xu, Dinghao Yang, Dong Yang, Hans Yang, Xiaodong Yang, Xuning Yang, Yichu Yang, Yurong You, Zhiding Yu, Hao Yuan, Simon Yuen, Xiaohui Zeng, Pengcuo Zeren, Cindy Zha, Haotian Zhang, Jenny Zhang, Jing Zhang, Liangkai Zhang, Paris Zhang, Shun Zhang, Xuanmeng Zhang, Zhizheng Zhang, Ann Zhao, Yilin Zhao, Yuliya Zhautouskaya, Charles Zhou, Fengzhe Zhou, Shilin Zhu, Yuke Zhu, Dima Zhylko, Artur Zolkowski

发表机构 * NVIDIA

AI总结提出基于统一混合Transformer架构的全模态世界模型Cosmos 3，联合处理语言、图像、视频、音频和动作序列，在理解和生成任务上达到新最优，为具身智能体提供可扩展的通用骨干。

详情

AI中文摘要

我们介绍了Cosmos 3，一个全模态世界模型家族，设计用于在统一的混合Transformer架构中联合处理和生成语言、图像、视频、音频和动作序列。通过支持高度灵活的输入输出配置，Cosmos 3无缝统一了物理AI的关键模态——有效地将视觉语言模型、视频生成器、世界模拟器和世界动作模型整合到一个框架中。我们的评估表明，Cosmos 3在一系列多样化的理解和生成任务中确立了新的最优水平，展示了全模态世界模型作为具身智能体可扩展、通用骨干的能力。我们的后训练Cosmos 3模型在技术报告撰写时被Artificial Analysis评为最佳开源文本到图像和图像到视频模型，并被RoboArena评为最佳策略模型。为了加速物理AI领域的开放研究和部署，我们在Linux基金会的OpenMDW-1.1许可证下提供我们的代码、模型检查点、策划的合成数据集和评估基准，网址为https://this https URL License at this https URL }{ this http URL and this https URL。项目网站位于https://this https URL。

英文摘要

We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 License at https://github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3. The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3.

URL PDF HTML ☆

赞 0 踩 0

2606.18385 2026-06-18 cs.AI 新提交

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

CaVe-VLM-CoT：一种可解释的视觉-语言模型框架

Sneha Rao, Shaina Raza, Dhanesh Ramachandram

发表机构 * Vector Institute（向量研究所）

AI总结提出CaVe-VLM-CoT框架，通过五阶段闭环流水线（提取器、检索器、求解器、引用注入器、验证器）实现证据推理，并引入CaVeScore复合指标评估检索质量、引用忠实度和跨模态基础，在ScienceQA和MMMU上取得性能提升。

详情

AI中文摘要

视觉-语言模型（VLM）仍然容易产生幻觉，输出流畅但视觉上不忠实的输出。现有的思维链和检索增强方法仅部分解决了这一问题，因为它们既没有强制执行步骤级引用基础，也没有将验证失败路由回检索以进行纠正。我们提出了CaVe-VLM-CoT，一个模块化的基于反射的智能体RAG框架，通过五阶段闭环流水线强制执行证据推理：提取器、检索器、求解器、引用注入器和验证器，其中检测到的无根据声明会触发结构化反馈给提取器以进行针对性重新检索。由于现有框架没有联合衡量检索质量、逐步引用忠实度和跨模态基础，我们提出了一套涵盖所有阶段的23个组件级指标，以CaVeScore为核心，这是一个加权准确性、引用精确率和召回率、归因和证据基础的复合指标。无需任何架构或提示修改，CaVe-VLM-CoT在ScienceQA上达到87.1%的准确率和56.6%的CaVeScore，在MMMU（30个学科）上达到55.2%的准确率和35.7%的CaVeScore。

英文摘要

Vision-Language Models (VLMs) remain prone to hallucinations, producing fluent but visually unfaithful outputs. Existing chain-of-thought and retrieval-augmented methods only partially address this, as they neither enforce step-level citation grounding nor route verification failures back to retrieval for correction. We present CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework that enforces evidence-grounded reasoning through a five-stage closed-loop pipeline: Extractor, Retriever, Solver, Citation Injector, and Verifier, in which detected ungrounded claims trigger structured feedback to the Extractor for targeted re-retrieval. Since no existing framework jointly measures retrieval quality, step-wise citation faithfulness, and cross-modal grounding, we propose a suite of 23 component-wise metrics across all stages, anchored by CaVeScore, a composite metric weighting accuracy, citation precision and recall, attribution, and evidence grounding. Without any architectural or prompt modifications, CaVe-VLM-CoT achieves 87.1\% accuracy and 56.6\% CaVeScore on ScienceQA , and 55.2\% accuracy and 35.7\% CaVeScore on MMMU (30 subjects).

URL PDF HTML ☆

赞 0 踩 0

2606.18988 2026-06-18 cs.AI 新提交

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

ThinkDeception: 一种用于可解释多模态欺骗检测的渐进式强化学习框架

Jinhao Song, Shan Liang, Yiqun Yue, Zhuhuayang Zhang, Tianqi Gao

发表机构 * Xi'an Jiaotong-Liverpool University（西安交通大学利物浦大学）

AI总结提出ThinkDeception框架，将多模态大语言模型引入欺骗检测，通过逐步推理和视觉-音频一致性组相对策略优化（VAC-GRPO）实现可解释的认知推理，在主流基准上达到新SOTA。

Comments 10pages,4figures

详情

AI中文摘要

多模态欺骗检测对于识别欺诈意图至关重要，然而现有方法主要依赖于端到端的黑箱范式。这些方法严重缺乏可解释性，无法提供透明的推理轨迹，也难以明确捕捉欺骗行为中固有的细微跨模态不一致性。为了超越这些限制，我们提出了ThinkDeception，一个新颖且可解释的多模态欺骗检测框架。作为开创性工作，它将多模态大语言模型（MLLMs）引入该领域，将欺骗检测从传统的二分类任务转变为显式的认知推理过程。借助首个精心标注的逐步多模态思维链（CoT）数据集，我们开发了基础模型ThinkDeception Base，实证验证了模态不一致性在解码欺骗中的关键作用。在此基础之上，我们的核心创新在于提出了配备渐进式训练策略的视觉-音频一致性组相对策略优化（VAC-GRPO）。与标准GRPO不同，我们将训练数据分为四个渐进难度等级，引导模型经历基于心理学的从易到难的认知转变。通过创新地将这一动态课程调度器与多维度的过程感知奖励机制及反思学习范式相结合，我们显著提升了模型的整体推理质量。在主流基准上的大量实验表明，ThinkDeception建立了新的SOTA，在检测准确性和推理质量上均显著优于现有方法。最终，这项工作成功地将欺骗检测领域推向可解释的多模态认知推理。

英文摘要

Multimodal deception detection is critical for identifying fraudulent intentions, yet existing approaches predominantly rely on end to end black--box paradigms. These methods suffer from a severe lack of interpretability failing to provide transparent reasoning trajectories and struggling to explicitly capture the subtle, cross modal inconsistencies inherent in deceptive behaviors. To transcend these limitations, we propose ThinkDeception, a novel and interpretable multimodal deception detection framework. As a pioneering effort, it introduces Multimodal Large Language Models (MLLMs) into this domain, transforming deception detection from a traditional binary classification task into an explicit cognitive reasoning process. Facilitated by the first meticulously annotated step--by--step multimodal Chain of Thought (CoT) dataset, we develop a foundational model, ThinkDeception Base, empirically validating the critical role of modal inconsistency in decoding deception. Building upon this foundation, our core innovation lies in proposing Visual-Audio Consistency Group Relative Policy Optimization(VAC--GRPO) equipped with a progressive training strategy. Distinct from standard GRPO, we stratify the training data into four progressive difficulty tiers, guiding the model through a psychologically grounded easy--to--hard cognitive transition. By innovatively coupling this dynamic curriculum scheduler with a multi dimensional, process aware reward mechanism and a reflective learning paradigm, we significantly elevate the model's overall reasoning quality. Extensive experiments on mainstream benchmarks demonstrate that ThinkDeception establishes a new SOTA, significantly outperforming existing methods in both detection accuracy and rationale quality. Ultimately, this work successfully drives the field of deception detection toward interpretable, multimodal cognitive reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.19168 2026-06-18 cs.AI cs.LG 新提交

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

超越安全数据：具有正则安全反射的预训练阶段对齐

Jinhan Li, Kexian Tang, Yihan Xu, Zhuorui Ye, Kaifeng Lyu

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University（清华大学交叉信息研究院）

AI总结提出安全反射预训练方法，在预训练语料中插入安全反思，使模型具备自我监控能力，实验表明该方法能有效降低推理和微调攻击成功率。

详情

AI中文摘要

为了实现大型语言模型（LLMs）更深层次的安全对齐，最近的研究探讨了如何将安全干预措施提前到预训练阶段，主要通过过滤不安全数据或将其改写为更安全的形式。我们认为，预训练阶段的对齐应超越使数据安全：LLMs可能将看似良性的知识和能力组合成不安全的行为。为此，我们提出了安全反射预训练，一种预训练阶段的对齐方法，该方法定期在预训练语料中插入简短的安全反思，将自我监控直接集成到语言建模中，建立一种基础能力，随后通过兼容的后训练加以强化。我们在FineWeb-Edu上预训练的1.7B模型上的实验表明，安全反射预训练提高了安全分类准确性，并显著降低了推理阶段和微调攻击的成功率。除了真实世界实验，我们还引入了一个完全受控的合成环境MedSafetyWorld，其中包含清晰的安全定义和推理结构，模型可以轻松地从安全数据中泛化出不安全行为。在MedSafetyWorld中的消融实验进一步表明，与数据过滤和改写相比，安全反射预训练在防止模型根据安全数据泛化出的不安全行为方面具有明显优势。综合来看，我们的发现表明，预训练对齐不仅应使训练数据安全，还应塑造模型可能从安全数据中习得的行为。

英文摘要

To achieve deeper safety alignment for large language models (LLMs), recent efforts have studied how to push safety interventions earlier into the pretraining stage, primarily by filtering unsafe data or rewriting it into safer forms. We argue that pretraining-stage alignment should go beyond making the data safe: LLMs may compose seemingly benign knowledge and capabilities into unsafe behaviors. To this end, we propose Safety Reflection Pretraining, a pretraining-stage alignment method which regularly inserts short safety reflections into pretraining corpora to integrate self-monitoring directly into language modeling, establishing a foundational capability that is subsequently reinforced by compatible post-training. Our experiments with 1.7B models pretrained on FineWeb-Edu show that Safety Reflection Pretraining improves safety classification accuracy and substantially reduces the success rates of inference-stage and finetuning attacks. Complementary to our real-world experiments, we also introduce a fully controlled synthetic environment, MedSafetyWorld, with a clear definition of safety and a reasoning structure under which models can easily generalize unsafe behaviors from safe data. Ablations in MedSafetyWorld further demonstrate a clear advantage of Safety Reflection Pretraining in preventing models from acting on unsafe behaviors generalized from safe data, compared with data filtering and rewriting. Taken together, our findings suggest that pretraining alignment should not only make the training data safe, but also shape the behaviors that models are likely to acquire from safe data.

URL PDF HTML ☆

赞 0 踩 0

2606.18258 2026-06-18 cs.HC cs.AI 交叉投稿

Examining Human-Like Behaviors in LLMs: A Multi-Dimensional Analysis of Model Behaviors, User Factors, and System Prompts

审视LLM中的人类行为：模型行为、用户因素和系统提示的多维分析

Sunnie S. Y. Kim, Margit Bowler, Leon A Gatys

发表机构 * Apple（苹果公司）

AI总结通过21,000次对话的多维分析，发现LLM普遍表现出人类行为，但不同模型和用户因素下差异显著；人类评估者认为LLM的自我参照和关系建立行为不如人类适当，但边界维护行为更适当；系统提示可控制这些行为但需谨慎评估。

详情

AI中文摘要

大型语言模型（LLM）展现出广泛的人类行为，从表达思想和情感，到与用户建立关系，再到拒绝请求和维持边界。尽管这些行为普遍存在，但研究者和实践者缺乏方法和实证见解来做出关于LLM何时以及应展现何种类型人类行为的明智决策。为填补这一空白，我们使用LLM-as-a-judge和人类评估，对这些行为的普遍性、潜在影响和可控性进行了多维分析。在来自四个广泛使用的模型（gpt-4o、gpt-4.1-mini、claude-sonnet-4.6、gemini-2.5-flash）的21,000次多轮对话中，我们发现人类行为普遍存在，但不同模型和用户因素（对话目标和用户画像）间存在差异。在感知适当性方面，人类评估者认为LLM的自我参照和关系建立行为不如人类适当，但边界维护行为比人类更适当。最后，我们表明系统提示可以控制这些行为，但需要仔细评估以避免意外效果。我们讨论了研究结果的含义，并为负责任的LLM设计和评估提供了建议。

英文摘要

Large language models (LLMs) exhibit a wide range of human-like behaviors, from expressing thoughts and emotions, to engaging in relationship-building with users, to refusing requests and maintaining boundaries. Despite their prevalence, researchers and practitioners lack methods and empirical insights to make informed decisions about when and what types of human-like behaviors LLMs should exhibit. To fill this gap, we present a multi-dimensional analysis of the prevalence, potential effects, and controllability of these behaviors using LLM-as-a-judge and human evaluation. Across 21,000 multi-turn conversations from four widely used models (gpt-4o, gpt-4.1-mini, claude-sonnet-4.6, gemini-2.5-flash), we find that human-like behaviors are pervasive but vary across models and user factors (conversation goals and user profiles). In terms of perceived appropriateness, human evaluators judged self-referential and relationship-building behaviors as less appropriate from LLMs than from humans, but boundary-maintaining behaviors more appropriate from LLMs than from humans. Finally, we show that system prompting can control these behaviors, though it requires careful evaluation to avoid unintended effects. We discuss the implications of our findings and provide recommendations for responsible LLM design and evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.18309 2026-06-18 cs.LG cs.AI 交叉投稿

SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector

SAGE: 保留感知的最终遗忘向量事后净化

Jingyuan Zhang, Yucheng Bai, Peixi Wen, Zhehao Huang, Zhengbao He, Hanling Tian, Xinwen Cheng, Haiyin Ran, Xiaolin Huang

发表机构 * Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University（上海交通大学图像处理与模式识别研究所）

AI总结提出SAGE方法，通过事后净化最终更新向量，在不重新运行原始遗忘流程的情况下，缓解大语言模型遗忘与保留能力之间的权衡。

详情

AI中文摘要

大语言模型（LLM）遗忘旨在移除不良知识或行为，同时保留已有能力。当前的遗忘方法都涉及遗忘与保留之间的权衡。我们发现，保留激活偏差也可用于量化遗忘方法对保留造成的损害，而无需考虑遗忘过程的具体实现。这使得我们能够通过事后方法恢复任何遗忘方法的保留性能。因此，我们提出一种互补的事后设置，在不重新运行原始遗忘流程的情况下净化最终更新向量。在该设置中，我们设计了SAGE（光谱激活-几何净化），一种对最终遗忘更新的源无关修正。SAGE从一个小型保留代理收集真实模块输入，提取其主导激活几何结构，并求解一个闭式源锚定优化目标，该目标抑制与高能保留方向对齐的更新分量，同时保留源方法的遗忘载体。在多种遗忘方法、模型规模和基准测试中，SAGE持续缓解保留-遗忘权衡，将最终向量的事后净化识别为机器遗忘中一个实用且未被充分探索的维度。

英文摘要

Large Language Model (LLM) unlearning aims to remove undesirable knowledge or behaviors while preserving retained capabilities. Current unlearning methods all involve a trade-off between unlearning and retention. We have found that the retention activation bias can also be used to quantify the damage an unlearning method inflicts on retention, without considering the specific implementation of the unlearning process. This allows us to restore retention performance for any unlearning method using a post-hoc approach. Therefore, we propose a complementary post-hoc setting to sanitize the final update vector without rerunning the original unlearning pipeline. In this setting, we design SAGE, Spectral Activation-GEometry Sanitization, a source-agnostic correction for final unlearning updates. SAGE collects real module inputs from a small retain proxy, extracts their dominant activation geometry, and solves a source-anchored optimization objective in closed form, which suppresses update components aligned with high-energy retained directions while preserving the source method's forgetting carrier. Across multiple unlearning methods, model scales, and benchmarks, SAGE consistently relieves the retain-forget trade-off, identifying post-hoc sanitization of final vectors as a practical and underexplored axis for machine unlearning.

URL PDF HTML ☆

赞 0 踩 0

2606.18310 2026-06-18 cs.CR cs.AI 交叉投稿

Code-Augur：通过规约推断的智能体漏洞检测

Zhengxiong Luo, Mehtab Zafar, Dylan Wolff, Abhik Roychoudhury

发表机构 * National University of Singapore（新加坡国立大学）

AI总结提出安全规约优先范式，通过显式化智能体假设并运行时反证，结合引导式模糊测试提升漏洞检测能力，在真实项目中比现有智能体检测更多漏洞。

详情

AI中文摘要

智能体漏洞检测的出现已成为软件安全的分水岭。完全由自主LLM智能体进行的审计正在发现数字社会基础软件中的关键漏洞。许多漏洞多年来一直隐藏，直到现在才被AI智能体发现。然而，这些发现背后的推理仍然令人担忧地不透明且未经验证。当智能体认为某个函数安全时，它对函数输入做了哪些假设？推理失败和错误假设可能导致遗漏漏洞，并降低对智能体分析的信任。我们提出了一种安全规约优先范式，该范式（1）将智能体的隐性假设明确暴露为安全规约，并（2）通过运行时反证持续细化这些规约。我们在Code-Augur中实现了我们的方法，这是一种用于智能体漏洞检测的新型框架。给定一个代码库，Code-Augur分析系统的每个组件以查找漏洞代码。当它认为某个组件安全时，它会将该判断背后的局部不变量作为源代码中的断言提交。同时，Code-Augur利用引导式模糊测试器尝试反证这些假设。当模糊测试器触发断言时，要么揭示一个真实漏洞，要么揭示一个需要细化的有缺陷规约。在这两种情况下，这一过程都夯实了智能体的理解，使其对代码意图的看法与代码实际行为保持一致。在真实世界的主题上，Code-Augur有效利用安全规约检测到比其他最先进智能体更多的漏洞。此外，Code-Augur在关键开源项目中发现了22个新漏洞。与精心策划的专用模型（如Claude Mythos）相比，Code-Augur提供了基于广泛可用的LLM（如Sonnet和DeepSeek）构建的有效智能体漏洞检测。

英文摘要

The advent of agentic vulnerability detection is already becoming a watershed moment for software security. Audits conducted entirely by autonomous LLM agents are uncovering critical vulnerabilities in fundamental software underpinning digital society. Many of these vulnerabilities remained masked for years, surfacing only now with AI agents. Yet the reasoning behind these discoveries remains alarmingly opaque and unvalidated. What assumptions did the agent make about a function's inputs when it deemed that function to be secure? Failures in reasoning and incorrect assumptions can lead to missed vulnerabilities and reduce trust in agentic analysis. We propose a security-specification-first paradigm that (1) exposes the agent's tacit assumptions explicitly as security specifications and (2) continuously refines those specifications via runtime falsification. We realize our approach in Code-Augur, a novel harness for agentic vulnerability detection. Given a codebase, Code-Augur analyzes each component of the system for vulnerable code. When it deems a component to be secure, it commits the local invariants behind that judgment as in-source assertions. In parallel, Code-Augur leverages a guided fuzzer to attempt to falsify those assumptions. When the fuzzer triggers an assertion, this either reveals a genuine vulnerability or a flawed specification to refine. In both cases, this process grounds the agent's understanding, aligning its view of code intent with how the code actually behaves. On real-world subjects, Code-Augur effectively leverages security specifications to detect more vulnerabilities than other state-of-the-art agents. Additionally, Code-Augur found 22 new vulnerabilities in key open-source projects. Compared to curated specialized models like Claude Mythos, Code-Augur offers effective agentic vulnerability detection built on widely available LLMs like Sonnet and DeepSeek.

URL PDF HTML ☆

赞 0 踩 0

2606.18832 2026-06-18 cs.LG cs.AI 交叉投稿

Target-confidence Recourse Using tSeTlin machines: TRUST

使用Tsetlin机器的目标置信度追索：TRUST

K. Darshana Abeyrathna, Sara El Mekkaoui, Nils Enric Canut Taugbøl, Anuja Vats

发表机构 * Group Research and Development Det Norske Veritas (DNV)（挪威船级社（DNV）集团研发部）

AI总结提出TRUST框架，通过概率Tsetlin机器和贝叶斯优化直接搜索满足用户指定置信度目标的最小输入变化，生成更稳健和可解释的反事实解释。

详情

AI中文摘要

反事实解释被广泛用于高风险决策系统中的算法追索。大多数现有方法寻求最小化改变输入以翻转模型决策。然而，决策者通常不仅依赖预测标签，还依赖置信度阈值和风险边际。刚好越过决策边界的反事实在噪声或模型变化下可能脆弱且不稳定。本文提出使用Tsetlin机器的目标置信度追索（TRUST），一种用户明确指定追索所需预测置信度的框架。TRUST不是先生成反事实再评估置信度，而是直接搜索满足用户定义置信度目标的最小变化，从而在成本、置信度和鲁棒性方面比较追索选项。我们使用概率Tsetlin机器（PTM）结合贝叶斯优化实例化TRUST。PTM基于概率子句的结构将预测置信度与决策规则的稳定性联系起来。我们表明，满足相同规则的反事实在可靠性上可能差异很大，取决于它们满足这些规则的安全程度，揭示了决策是由稳健还是脆弱的子句激活支持的。在合成和真实数据集上的实验表明，目标置信度反事实比传统的基于边界的方法产生更稳健和可解释的追索。在多个基准测试中，TRUST实现了完美的鲁棒性，同时保持较低的追索成本，包括在Haberman数据集上以0.92置信度达到0.10的L2距离。通过显式控制置信度和暴露规则级稳定性，TRUST为高风险决策支持提供了可操作的追索。

英文摘要

Counterfactual explanations are widely used to provide algorithmic recourse in high-stakes decision-making systems. Most existing methods seek the smallest change to an input that flips a model's decision. However, decision-makers often rely not only on predicted labels but also on confidence thresholds and risk margins. Counterfactuals that barely cross a decision boundary can be fragile and unstable under noise or model variation. In this paper, we propose Target-confidence Recourse Using tSeTlin machines (TRUST), a framework in which users explicitly specify the desired prediction confidence for recourse. Rather than generating counterfactuals and evaluating confidence afterward, TRUST directly searches for minimal changes that satisfy a user-defined confidence target, enabling comparison of recourse options in terms of cost, confidence, and robustness. We instantiate TRUST using a Probabilistic Tsetlin Machine (PTM) combined with Bayesian optimization. The probabilistic clause-based structure of PTM links prediction confidence to the stability of decision rules. We show that counterfactuals satisfying the same rules can still differ substantially in reliability depending on how securely they satisfy those rules, revealing whether decisions are supported by robust or fragile clause activations. Experiments on synthetic and real-world datasets demonstrate that target-confidence counterfactuals produce more robust and interpretable recourse than conventional boundary-based approaches. Across multiple benchmarks, TRUST achieves perfect robustness while maintaining low recourse cost, including an L2 distance of 0.10 on the Haberman dataset at 0.92 confidence. By explicitly controlling confidence and exposing rule-level stability, TRUST provides actionable recourse for high-stakes decision support.

URL PDF HTML ☆

赞 0 踩 0

2606.18996 2026-06-18 cs.CR cs.AI 交叉投稿

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

TRAP：任务完成与主动隐私提取抵抗基准

Moon Ye-Bin, Nam Hyeon-Woo, Baek Seong-Eun, Yejin Yeo, Tae-Hyun Oh

发表机构 * Dept. of Electrical Engineering, POSTECH（POSTECH电子工程系）； Grad. School of Artificial Intelligence, POSTECH（POSTECH人工智能研究生院）； School of Computing, KAIST（韩国科学技术院计算机学院）

AI总结提出TRAP基准，评估智能体在文档密集型任务中平衡任务准确性与隐私泄露的能力，发现所有模型均存在非平凡泄露，并证明基于提示的防御无法同时实现高任务成功率和零泄露概率，提出结构化的私有字段隔离方法。

详情

AI中文摘要

通过梯度信号恢复揭示自编码器中的隐藏漏洞

Chethan Krishnamurthy Ramanaik, Arjun Roy, Tobias Callies, Eirini Ntoutsi

发表机构 * University of the Bundeswehr Munich（联邦国防军理工大学）

AI总结针对自编码器对抗攻击中梯度消失导致鲁棒性被高估的问题，提出GRILL框架恢复梯度信号，显著提升攻击效果，暴露隐藏漏洞。

详情

AI中文摘要

深度自编码器（AE）的对抗鲁棒性受到的关注远少于判别模型，尽管其压缩的潜在表示会导致病态映射，从而放大小的输入扰动并破坏重建稳定性。现有的AE白盒攻击通过优化范数有界的对抗扰动以最大化重建损失，往往收敛到次优扰动，从而可能高估AE的鲁棒性。我们表明，这种限制与通过病态层反向传播时对抗损失梯度消失有关，这些病态层的中间权重矩阵具有接近零的奇异值。为了解决这个问题，我们提出了GRILL（病态层中的梯度信号恢复）框架，旨在减轻梯度退化并提高编码器-解码器架构中对抗鲁棒性评估的可靠性。GRILL旨在缓解优化过程中的对抗梯度退化，使攻击能够在固定范数约束下更好地逼近高失真扰动。通过在多种AE架构上的广泛实验，包括样本特定和通用攻击，以及标准和自适应攻击设置，我们表明GRILL显著提高了攻击有效性，从而暴露了现有攻击限制所隐藏的漏洞。除了AE之外，我们提供了初步证据表明现代多模态编码器-解码器架构也存在类似的漏洞。

英文摘要

Adversarial robustness of deep autoencoders (AEs) has received less attention than that of discriminative models, although their compressed latent representations induce ill-conditioned mappings that can amplify small input perturbations and destabilize reconstructions. Existing white-box attacks for AEs, which optimize norm-bounded adversarial perturbations to maximize reconstruction damage, often converge to suboptimal perturbations, thereby potentially overstating AE robustness. We show that this limitation is linked to vanishing adversarial loss gradients during backpropagation through ill-conditioned layers, associated with near-zero singular values in their intermediate weight matrices. To address this, we propose GRILL (Gradient Signal Restoration in Ill-Conditioned Layers), a framework designed to mitigate gradient degradation and improve the reliability of adversarial robustness evaluation in encoder-decoder architectures. GRILL is designed to mitigate adversarial gradient degradation during optimization, enabling attacks to better approximate high-distortion perturbations under fixed norm constraints. Through extensive experiments across multiple AE architectures, under both sample-specific and universal attacks, as well as standard and adaptive attack settings, we show that GRILL significantly increases attack effectiveness, thereby exposing vulnerabilities hidden by existing attack limitations. Beyond AEs, we provide preliminary evidence that modern multimodal encoder-decoder architectures exhibit similar vulnerabilities.

URL PDF HTML ☆

赞 0 踩 0

2505.16057 2026-06-18 cs.HC cs.AI cs.MM 版本更新

Signals of Provenance: Practices & Challenges of Navigating Indicators in AI-Generated Media for Sighted and Blind Individuals

来源信号：视障与明眼用户在AI生成媒体中导航指示器的实践与挑战

Ayae Ide, Tory Park, Jaron Mink, Tanusree Sharma

发表机构 * Pennsylvania State University（宾夕法尼亚州立大学）； Arizona State University（亚利桑那州立大学）

AI总结通过访谈28位视障与明眼用户，研究AI生成内容指示器的使用实践，发现基于内容和菜单的指示器各有优劣，视障用户因界面可访问性不足而面临更多挑战，并提出设计建议。

Comments error found in reporting of results

详情

AI中文摘要

近年来，生成模型的进步和易用工具大幅降低了通过简单自然语言提示生成高度逼真音频、图像和视频的技术门槛，使得AI生成（AIG）内容日益普及。作为回应，平台正在采用可验证的来源机制，并推荐AIG内容进行自我披露和向用户发出信号。然而，这些指示器常常被忽略，尤其是当它们仅依赖视觉线索时，对具有不同感官能力的用户效果不佳。为弥补这一空白，我们进行了半结构化访谈（N=28），包括15名明眼和13名盲人或低视力（BLV）参与者，考察他们通过自我披露的AI指示器与AIG内容的互动。我们的发现揭示了多样化的心智模型和实践，突出了基于内容（如标题、描述）和菜单辅助（如AI标签）指示器的不同优缺点。明眼参与者利用视觉和音频线索，而BLV参与者主要依赖音频和现有的辅助工具，限制了其识别AIG的能力。两组参与者都经常忽略平台部署的菜单辅助指示器，而更倾向于与基于内容的指示器（如标题和评论）互动。我们发现了由于指示器位置不一致、元数据不清晰和认知过载导致的可用性挑战。这些问题对BLV个体尤为关键，因为界面元素的可访问性不足。我们为未来AIG指示器的多个维度提供了实用建议和设计启示。

英文摘要

AI-Generated (AIG) content has become increasingly widespread by recent advances in generative models and the easy-to-use tools that have significantly lowered the technical barriers for producing highly realistic audio, images, and videos through simple natural language prompts. In response, platforms are adopting provable provenance with platforms recommending AIG to be self-disclosed and signaled to users. However, these indicators may be often missed, especially when they rely solely on visual cues and make them ineffective to users with different sensory abilities. To address the gap, we conducted semi-structured interviews (N=28) with 15 sighted and 13 BLV participants to examine their interaction with AIG content through self-disclosed AI indicators. Our findings reveal diverse mental models and practices, highlighting different strengths and weaknesses of content-based (e.g., title, description) and menu-aided (e.g., AI labels) indicators. While sighted participants leveraged visual and audio cues, BLV participants primarily relied on audio and existing assistive tools, limiting their ability to identify AIG. Across both groups, they frequently overlooked menu-aided indicators deployed by platforms and rather interacted with content-based indicators such as title and comments. We uncovered usability challenges stemming from inconsistent indicator placement, unclear metadata, and cognitive overload. These issues were especially critical for BLV individuals due to the insufficient accessibility of interface elements. We provide practical recommendations and design implications for future AIG indicators across several dimensions.

URL PDF HTML ☆

赞 0 踩 0

2507.04219 2026-06-18 cs.LG cs.AI 版本更新

Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLMs

模型崩溃不是错误，而是大语言模型机器遗忘中的一种特性

Yan Scholten, Sophie Xhonneux, Leo Schwinn, Stephan Günnemann

发表机构 * Dept. of Computer Science & Munich Data Science Institute, Technical University of Munich（计算机科学系及慕尼黑数据科学研究所，技术大学慕尼黑）； Mila, Université de Montréal（蒙特利尔大学Mila）

AI总结提出部分模型崩溃（PMC）方法，通过故意触发模型在目标数据上的分布崩溃实现遗忘，无需在遗忘目标上优化，有效移除私有信息并保持模型效用。

Comments Accepted at ICLR 2026

详情

AI中文摘要

当前大语言模型的遗忘方法通过将待移除的私有信息纳入微调数据来优化。我们认为这不仅可能强化对敏感数据的暴露，而且从根本上违背了最小化其使用的原则。作为补救，我们提出了一种新颖的遗忘方法——部分模型崩溃（PMC），该方法在遗忘目标中不需要遗忘目标。我们的方法受到最近观察的启发：在生成模型上训练其自身生成会导致分布崩溃，从而有效移除模型输出中的信息。我们的核心见解是，可以通过故意触发我们旨在移除的数据上的模型崩溃来利用模型崩溃进行机器遗忘。我们从理论上分析了我们的方法收敛到期望结果，即模型遗忘目标移除的数据。我们实验证明，PMC克服了现有显式优化遗忘目标的遗忘方法的四个关键限制，并在保持通用模型效用的同时更有效地从模型输出中移除私有信息。总体而言，我们的贡献代表了向更全面、更符合现实隐私约束的遗忘迈出的重要一步。代码可在该 https URL 获取。

英文摘要

Current unlearning methods for LLMs optimize on the private information they seek to remove by incorporating it into their fine-tuning data. We argue this not only risks reinforcing exposure to sensitive data, but also fundamentally contradicts the principle of minimizing its use. As a remedy, we propose a novel unlearning method-Partial Model Collapse (PMC), which does not require unlearning targets in the unlearning objective. Our approach is inspired by recent observations that training generative models on their own generations leads to distribution collapse, effectively removing information from model outputs. Our central insight is that model collapse can be leveraged for machine unlearning by deliberately triggering it for data we aim to remove. We theoretically analyze that our approach converges to the desired outcome, i.e. the model unlearns the data targeted for removal. We empirically demonstrate that PMC overcomes four key limitations of existing unlearning methods that explicitly optimize on unlearning targets, and more effectively removes private information from model outputs while preserving general model utility. Overall, our contributions represent an important step toward more comprehensive unlearning that better aligns with real-world privacy constraints. Code available at https://www.cs.cit.tum.de/daml/partial-model-collapse/.

URL PDF HTML ☆

赞 0 踩 0

2508.03483 2026-06-18 cs.CV cs.AI 版本更新

When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models

当汽车有刻板印象：审计文本到图像模型中对象的群体偏见

Dasol Choi, Jihwan Lee, Minjae Lee, Minsuk Kahng

发表机构 * AIM Intelligence（AIM智能研究院）； Yonsei University（延世大学）

AI总结提出SODA框架，通过三个指标系统测量文本到图像模型在生成对象中的群体偏见，发现中性提示隐含偏向中年和白人，且人口统计线索导致高度偏斜的刻板输出。

详情

AI中文摘要

虽然先前关于文本到图像生成的研究主要集中在人类描绘中的偏见，但生成对象中的群体偏见仍然相对未被充分探索。我们引入了SODA（刻板对象诊断审计），这是一个新颖的框架，通过自动属性发现和三个标准化指标系统地测量这些偏见：基础与群体差异（BDS）、跨群体差异（CDS）和视觉属性集中度（VAC）。将SODA应用于五个最先进模型和八个对象类别（例如汽车）的8000张图像，我们发现“中性”提示产生的输出在视觉上最接近中年和白人，表明这些群体在模型默认设置中被隐含地过度代表。此外，人口统计线索触发了高度偏斜的刻板输出：26.6%的对象-模型-群体组合产生的结果中，所有20张生成图像共享完全相同的属性值（例如，为女性生成玫瑰金笔记本电脑）。最后，提示级别的去偏减少了群体间差异，但矛盾地压缩了群体内多样性，用一种刻板印象取代了另一种。SODA提供了一个实用的流程，使这些隐含关联变得可测量，作为迈向更负责任的人工智能发展的一步。

英文摘要

While prior research on text-to-image generation has predominantly focused on biases in human depictions, demographic bias in generated objects remains relatively underexplored. We introduce SODA (Stereotyped Object Diagnostic Audit), a novel framework for systematically measuring these biases through automated attribute discovery and three standardized metrics: Base vs. Demographic Divergence (BDS), Cross-Demographic Disparity (CDS), and Visual Attribute Concentration (VAC). Applying SODA to 8,000 images across five state-of-the-art models and eight object categories (e.g., cars), we find that "neutral" prompts produce outputs most visually similar to middle-aged and White people, suggesting these groups are implicitly over-represented in model defaults. Furthermore, demographic cues trigger highly skewed stereotypical outputs: 26.6% of object-model-demographic combinations produce results where all 20 generated images share the exact same attribute value (e.g., rose gold laptops for women). Finally, prompt-level debiasing reduces inter-group disparity but paradoxically collapses within-group diversity, replacing one stereotype with another. SODA offers a practical pipeline for making these implicit associations measurable, serving as a step toward more responsible AI development.

URL PDF HTML ☆

赞 0 踩 0

2511.20002 2026-06-18 cs.CV cs.AI cs.CR 版本更新

Semantic Router: On the Feasibility of Hijacking MLLMs via a Single Adversarial Perturbation

语义路由器：通过单一对抗扰动劫持多模态大语言模型的可行性研究

Changyue Li, Jiaying Li, Youliang Yuan, Jiaming He, Zhicong Huang, Pinjia He

发表机构 * The Chinese University of Hong Kong, Shenzhen, China（香港中文大学（深圳））； School of Data Science, School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen, China（数据科学学院、人工智能学院、香港中文大学（深圳））

AI总结提出语义感知通用扰动（SAUP），作为语义路由器同时劫持多个无状态决策，通过理论分析和SORT优化策略实现，在Qwen上对五个目标达到66%攻击成功率。

Comments Accepted to ICML 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）越来越多地部署在无状态系统中，例如自动驾驶和机器人技术。本文研究了一种新型威胁：语义感知劫持。我们探索了使用单一通用扰动同时劫持多个无状态决策的可行性。我们引入了语义感知通用扰动（SAUP），它充当语义路由器，“主动”感知输入语义并将其路由到不同的、攻击者定义的目标。为了实现这一点，我们对潜在空间中的几何特性进行了理论和实证分析。在这些见解的指导下，我们提出了语义导向（SORT）优化策略，并标注了一个具有细粒度语义的新数据集以评估性能。在三个代表性MLLM上的大量实验证明了这种攻击的基本可行性，在针对Qwen的五个目标上使用单帧实现了66%的攻击成功率。

英文摘要

Multimodal Large Language Models (MLLMs) are increasingly deployed in stateless systems, such as autonomous driving and robotics. This paper investigates a novel threat: Semantic-Aware Hijacking. We explore the feasibility of hijacking multiple stateless decisions simultaneously using a single universal perturbation. We introduce the Semantic-Aware Universal Perturbation (SAUP), which acts as a semantic router, "actively" perceiving input semantics and routing them to distinct, attacker-defined targets. To achieve this, we conduct theoretical and empirical analysis on the geometric properties in the latent space. Guided by these insights, we propose the Semantic-Oriented (SORT) optimization strategy and annotate a new dataset with fine-grained semantics to evaluate performance. Extensive experiments on three representative MLLMs demonstrate the fundamental feasibility of this attack, achieving a 66% attack success rate over five targets using a single frame against Qwen.

URL PDF HTML ☆

赞 0 踩 0

2604.23130 2026-06-18 cs.CL cs.AI 版本更新

From Concept-Aligned Tokens to Vulnerable Features: Mechanistic Localization of Jailbreaks

从概念对齐的Token到脆弱特征：越狱的机制定位

Nilanjana Das, Mathew Dawit, Aman Chadha, Manas Gaur

发表机构 * UMBC（马里兰大学伯克利分校）； Apple（苹果公司）

AI总结提出一种基于Token的机制流水线，通过稀疏自编码器特征子组定位越狱漏洞，发现单个有害Token足以定位脆弱特征，且这些特征集中在中后期层。

详情

AI中文摘要

越狱攻击揭示了安全对齐的大语言模型中一种持续的失败模式：模型可以被推向有害行为，但促成这种转变的内部表示仍未被很好地定位。最近的机制安全性研究通常通过广泛的表示对象来解释这种行为，包括全局拒绝方向、激活引导向量和与拒绝相关的SAE特征。我们转而询问越狱脆弱性是否可以追溯到更细粒度的、基于提示的SAE特征子组。我们引入了一个基于Token的机制流水线，将Gemma-2-2B的残差流分解为稀疏自编码器（SAE）特征，并识别与不安全行为相关的特征子组。使用BeaverTails中的单类别不安全示例以减少跨类别干扰，我们从对抗性响应中提取有害概念，并通过子空间相似性将其与概念相关的提示Token对齐。然后，我们应用三种特征分组策略：基于聚类的、层次链接的和单Token驱动的，以识别所有26层中的SAE特征子组。最后，我们放大每个子组中的顶级特征，并使用标准的有害性评判器评估生成的输出。单Token驱动的分组实现了与完整基于聚类的分组相当的有害性，表明单个有害提示Token足以定位与脆弱性相关的SAE特征子组，而无需依赖更广泛的聚类级聚合。这些子组出现在早期和中后期层，且更集中在中后期层，其中目标引导暴露了特定的模型脆弱性。总体而言，我们的结果表明越狱敏感性可以追溯到稀疏的、基于Token定位的SAE特征子组，补充了先前基于广泛对抗、拒绝或引导方向的解释。

英文摘要

Jailbreak attacks expose a persistent failure mode in safety-aligned LLMs: models can be pushed into harmful behavior, but the internal representations enabling this shift remain poorly localized. Recent mechanistic safety studies often explain such behavior through broad representational objects, including global refusal directions, activation steering vectors, and refusal-related SAE features. We instead ask whether jailbreak vulnerability can be traced to finer-grained, prompt-conditioned SAE feature subgroups. We introduce a token-driven mechanistic pipeline that decomposes the residual stream of Gemma-2-2B into Sparse Autoencoder (SAE) features and identifies feature subgroups associated with unsafe behavior. Using single-category unsafe examples from BeaverTails to reduce cross-category interference, we extract harmful concepts from adversarial responses and align them with concept-relevant prompt tokens through subspace similarity. We then apply three feature-grouping strategies: cluster-based, hierarchical-linkage, and single-token-driven, to identify SAE feature subgroups across all 26 layers. Finally, we amplify the top features in each subgroup and evaluate the resulting generations with a standardized harmfulness judge. Single-token-driven grouping achieves harmfulness comparable to full cluster-based grouping, showing that individual harmful prompt tokens are sufficient to localize vulnerability-relevant SAE feature subgroups without relying on broader cluster-level aggregation. These subgroups appear across early and mid-to-late layers, with stronger concentration in mid-to-late layers, where targeted steering exposes specific model vulnerabilities. Overall, our results suggest that jailbreak susceptibility can be traced to sparse, token-localized SAE feature subgroups, complementing prior accounts based on broad adversarial, refusal, or steering directions.

URL PDF HTML ☆

赞 0 踩 0

2605.26903 2026-06-18 cs.CR cs.AI 版本更新

Practical Anonymous Two-Party Gradient Boosting Decision Tree

实用的匿名两方梯度提升决策树

Chenyu Huang, Fan Zhang, Minxin Du, Sherman S. M. Chow, Huangxun Chen, Huaming Rao, Danqing Huang, Bo Qian, Peng Chen

发表机构 * Tencent（腾讯）； Hong Kong Polytechnic University（香港理工大学）； Chinese University of Hong Kong（香港中文大学）； HKUST-GZ

AI总结针对两方垂直分割数据上的梯度提升决策树训练，提出一种基于双电路隐私集合求交和遗忘可编程伪随机函数的匿名协议，在隐藏记录标识符的同时保持效率。

Comments 19 pages; 2026 IEEE Symposium on Security and Privacy (SP)

详情

DOI: 10.1109/SP63933.2026.00084
Journal ref: 2026 IEEE Symposium on Security and Privacy (SP)

AI中文摘要

梯度提升决策树（GBDT）擅长处理结构化数据，通常用于在互不信任的各方之间垂直分割的特征上进行训练。高速和可解释性使得GBDT在金融和医疗领域广受欢迎，而神经网络在这些领域可能表现不佳。为GBDT启用安全计算带来了独特的挑战，需要安全的记录对齐以进行比较。依赖隐私集合求交（PSI）是一种事实上的方法。将PSI误认为是安全措施实际上会暴露数据集中哪些记录标识符（ID）是共享的。尽管电路PSI可以提供帮助，但对于通用用途来说成本高昂。需要新的思路来在“黑暗森林”中高效训练。为了隐藏ID，我们启动了对两方持有的分割数据上的匿名GBDT训练的研究。我们设计中的双电路PSI让双方交替作为接收者，对本地特征执行“选取后求和”。通过遗忘可编程伪随机函数，我们将电路PSI的输出作为共享状态在运行之间传播。避免通用对齐，我们解决了被忽视的困境：隐藏ID会带来与域大小成比例的成本。接下来，我们将用于将单指令多数据同态加密从（环）学习误差转换的密文打包成本减半，相比之前的安全GBDT（Usenix Security' 23）和相关安全机器学习计算。对比实验表明，我们的协议在效率上与有泄漏的方法相比仍具有竞争力。通过启用隐藏ID的聚合，我们的技术可以扩展到其他垂直分割的分析场景。

英文摘要

Structured data is well handled by gradient-boosted decision trees (GBDT), which are usually trained on vertically partitioned features across mutually distrustful parties. High speed and interpretability make GBDTs popular in finance and healthcare, where neural networks may fall short. Enabling secure computation for GBDTs poses unique challenges, requiring secure record alignment for comparison. Relying on private set intersection (PSI) is a de facto approach. Mistaking PSI for a safety measure actually exposes which record identifiers (IDs) are shared between the datasets. Although circuit-PSI could help, it is costly for generic uses. New ideas are needed to efficiently train in a "dark forest". Aiming to hide the IDs, we initiate the study of anonymous GBDT training on split data held by two parties. Dual circuit-PSI in our design lets the parties alternate as receiver to run pick-then-sum over local features. Via oblivious programmable pseudorandom functions, we propagate circuit-PSI outputs as shared state across runs. Avoiding universal alignment, we resolve the neglected dilemma that ID hiding incurs a cost that scales with domain size. Next, we halve the cost of ciphertext packing used to convert single-instruction multiple-data homomorphic encryption from (ring) learning with errors in prior secure GBDT (Usenix Security' 23) and related secure machine-learning computations. Comparative experiments show our protocol remains competitive with leaky approaches in efficiency. Enabling ID-hiding aggregation, our techniques can extend to other vertically partitioned analytics.

URL PDF HTML ☆

赞 0 踩 0

2606.07150 2026-06-18 cs.CR cs.AI cs.MA cs.NI 版本更新

From Privacy to Workflow Integrity: Communication-Graph Metadata in Autonomous Agent Interoperability

从隐私到工作流完整性：自主智能体互操作性中的通信图元数据

Bijaya Dangol

发表机构 * Independent Researcher（独立研究者）

AI总结针对智能体通信图元数据泄露问题，提出工作流完整性威胁模型，定义传输层与引导层隐私属性，并通过A2A案例验证元数据保护可有效抑制任务推断。

Comments 22 pages, 7 figures, 6 tables

详情

AI中文摘要

诸如A2A和MCP之类的智能体互操作性协议标准化了智能体之间的通信内容，但假设基于地址的HTTP(S)传输。此类传输保护消息内容，并越来越多地采用端到端加密。它们暴露在明文中的是通信图：哪个智能体联系哪个智能体、何时以及频率如何。在智能体系统中，该图比隐私框架所暗示的更具后果性。端点通常带有能力标签，工作流是结构化和链式的，交互与实际行动耦合，因此观察者恢复的不仅仅是过去的关系。它可以推断出待处理的工作流、正在组装的任务以及可能即将发生的行动。以机器速度，它可以在工作流完成之前根据该推断采取行动。因此，威胁是工作流完整性，而不仅仅是隐私：对自主行动的预测性杠杆。我们为智能体通信图提供了一个威胁模型；识别了使智能体元数据具有独特揭示性的因素（语义性、前瞻性、驱动性）；定义了传输层和引导层隐私属性，并评估了候选传输（SimpleX/SMP、Tor、混合网络）与这些属性的匹配程度；并提出了一个A2A案例研究，其中元数据保护绑定是可表达的，但揭示了协议的身份假设。我们在一个基于真实A2A捕获的生成模型上测试了这些。仅凭被动元数据，没有载荷，一个分类器从工作流的开头就能以远高于随机水平的概率恢复任务类别；应用这些属性后，该恢复被急剧拉回随机水平。除了观察者能恢复的内容外，我们衡量了利用泄露的杠杆：在工作流开头和固定预算下，选择对哪些工作流采取行动的对手在此模型中实现了大部分先知攻击者相对于元数据盲攻击者的优势，而相同的属性抑制了这一点。

英文摘要

Agent-interoperability protocols such as A2A and MCP standardize what agents say to one another but assume address-based transport. Whether over HTTP(S) or a content-protecting binding such as MLS-based SLIM, these transports protect message content yet leave the communication graph exposed: which agent contacts which, when, and how often. In agent systems this graph is more consequential than a privacy framing suggests. Endpoints are capability-labeled, workflows are structured and chained, and interactions are coupled to actions, so an observer recovers more than past relationships: it can recognize a recurring pending workflow from its opening and, at machine speed, act on it before it completes. The threat is one of workflow integrity, not privacy alone. We give a threat model for the communication graph and locate what makes its metadata distinctively consequential: not stronger fingerprinting but exposure across independent trust domains, coupled to autonomous action. We define transport- and bootstrap-layer privacy properties, give them an indistinguishability-game semantics, evaluate transports, and give an A2A case study where a metadata-protecting binding surfaces its implicit identity assumptions. On a corpus of real multi-agent A2A traffic from the official reference agents, on a live A2A binding, and with a generative model as a controlled instrument, a label-blind classifier recovers a task's class from passive metadata at 6x chance, and from only its opening; a defense-aware adversary does not overturn this, and only the full set of properties drives recovery toward chance. Acting on the leak is distinct from recoverability: under a fixed budget an adversary captures 0.63 of a clairvoyant attacker's advantage on the corpus (0.41 from a workflow's opening), governed by top-ranked precision rather than overall accuracy, so integrity and privacy come apart under defense.

URL PDF HTML ☆

赞 0 踩 0

2606.18543 2026-06-18 cs.AI cs.CL cs.SE 新提交

SciRisk-Bench：面向AI4Science安全的风险维度感知基准

Linghao Feng, Yinqian Sun, Dongqi Liang, Sicheng Shen, Chenfei Yan, Yuxuan Peng, Yilin Zhao, Haibo Tong, Kai Li, FeiFei Zhao, Yi Zeng

发表机构 * Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing, China（脑启发认知智能实验室，自动化研究所，中国科学院，北京，中国）； School of Future Technology, University of Chinese Academy of Sciences, China（未来技术学院，中国科学院大学，中国）； School of Artificial Intelligence, University of Chinese Academy of Sciences, China（人工智能学院，中国科学院大学，中国）； Zhongguancun Academy, China（中关村学院，中国）； Beijing Key Laboratory of Safe AI and Superalignment（北京安全人工智能与超对齐重点实验室）； Gaoling School of AI, Renmin University of China（甘露人工智能学院，中国人民大学）； Beijing Institute of AI Safety and Governance (Beijing-AISI)（北京人工智能安全与治理研究院（北京-AISI））； School of Humanities, University of Chinese Academy of Sciences, China（人文学院，中国科学院大学，中国）

AI总结提出SciRisk-Bench基准，从显式风险维度和科学学科两个角度评估AI4Science安全，覆盖7个学科、31个子学科和10个风险维度，实验揭示主流及科学大模型的安全薄弱环节。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地嵌入到人工智能驱动的科学（AI4Science）工作流程中，从科学问答和文献分析到实验室规划和自主发现。这一进展迫切需要对安全基准进行评估，不仅要评估科学能力，还要评估模型是否能在高风险的科学背景下识别和避免风险。现有的AI4Science安全数据集涵盖多个学科和任务格式，但潜在的风险维度未得到充分说明。我们引入了\textbf{SciRisk-Bench}，这是一个旨在从两个互补视角评估AI4Science安全的基准：显式风险维度和科学学科。SciRisk-Bench涵盖7个学科、31个子学科和10个风险维度。在实验部分，我们评估了主流LLMs和面向科学的LLMs在风险维度、学科和子学科上的表现，从而能够细粒度地诊断科学模型在哪些方面仍然不安全。

英文摘要

Large language models (LLMs) are increasingly embedded in AI for Science (AI4Science) workflows, from scientific question answering and literature analysis to laboratory planning and autonomous discovery. This progress creates an urgent need for safety benchmarks that evaluate not only scientific competence, but also whether models recognize and avoid risks in high-stakes scientific contexts. Existing AI4Science safety datasets cover several disciplines and task formats, leaving the underlying risk dimensions underspecified. We introduce \textbf{SciRisk-Bench}, a benchmark designed to evaluate AI4Science safety from two complementary perspectives: explicit risk dimensions and scientific disciplines. SciRisk-Bench covers 7 disciplines, 31 subdisciplines and 10 risk dimensions. In the experimental section, we evaluate both mainstream LLMs and science-oriented LLMs across risk dimensions, disciplines, and sub-disciplines, enabling fine-grained diagnosis of where scientific models remain unsafe.

URL PDF HTML ☆

赞 0 踩 0

2606.18950 2026-06-18 cs.AI 新提交

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

RTSGameBench: 视觉语言模型战略推理的RTS基准

San Kim, Daechul Ahn, Reokyoung Kim, Hyeonbeom Choi, Seungyeon Jwa, Jonghyun Choi

发表机构 * Seoul National University（首尔国立大学）

AI总结提出RTSGameBench，基于Beyond All Reason游戏，通过多样化对战、迷你游戏诊断和自进化生成框架，评估视觉语言模型在实时策略游戏中的战略推理能力。

Comments First two authors contributed equally

详情

AI中文摘要

现代视觉语言模型（VLM）在竞争和合作环境中的不确定性下，往往难以进行战略推理，即预测和影响其他智能体的行为。实时策略（RTS）游戏可以作为诊断这一局限性的自然测试平台，因为它们要求与盟友协调、适应对手策略，并在部分可观测性下进行长期规划。然而，现有的RTS基准评估范围有限，缺乏系统的能力诊断，并且局限于预设计的场景覆盖。为了解决这些限制，我们提出了RTSGameBench，它建立在Beyond All Reason之上，这是一款大规模RTS游戏，其扩展战场要求比现有测试平台更广泛的策略多样性。该基准通过多种对战结构提供评估，通过迷你游戏进行诊断性评估，每个迷你游戏针对单个战略能力，并通过自进化生成框架实现可扩展的覆盖，该框架将自由形式的查询转化为新的迷你游戏，并在连续循环中改进。此外，为了让VLM在大规模RTS游戏中运行，我们提供了RTSGameAgent，它通过具有智能体记忆的有限状态机（FSM）管理单位。我们通过实验验证，多个最先进的VLM在对战需要更紧密协调、多智能体协调以及任务规模增加时表现不佳。

英文摘要

Modern Vision-Language Models (VLMs) often struggle with strategic reasoning, i.e., anticipating and influencing other agents' actions, under uncertainty in competitive and cooperative settings. Real-time strategy (RTS) games can be a natural testbed for diagnosing this limitation, as they demand coordination with allies, adaptation to opponents' strategy, and long-horizon planning under partial observability. However, existing RTS benchmarks offer limited evaluation scope, lack systematic competency diagnosis, and remain fixed in the pre-designed scenario coverage. To address these limitations, we present RTSGameBench, which is built on Beyond All Reason, a large-scale RTS game with an expanded battlefield that demands broader strategy diversity than the existing testbeds. The proposed benchmark provides evaluations through diverse gameplay across various matchup structures, diagnostic assessment via mini-games, each targeting an individual strategic competency, and extensible coverage via a self-evolving generation framework that converts free-form queries into new mini-games, improving over successive cycles. Additionally, for VLMs to operate in large-scale RTS games, we provide RTSGameAgent that manages units by an FSM with agentic memory. We empirically validate that multiple state-of-the-art VLMs do not perform well when matchups demand tighter coordination, multiagent coordination and when task scale increases.

URL PDF HTML ☆

赞 0 踩 0

2606.19245 2026-06-18 cs.AI cs.LG 新提交

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

TxBench-PP：分析AI代理在小分子临床前药理学中的表现

Hannah Le, Ramesh Ramasamy, Alex Urrutia, Mahsa Yazdani, Tim Proctor, Kenny Workman

发表机构 * LatchBio

AI总结提出TxBench-PP基准，用于评估AI代理从真实实验数据中恢复临床前药理学结论的能力，测试显示最强配置Claude Opus 4.8 / Pi仅通过59.3%的端点尝试。

详情

AI中文摘要

人工智能（AI）代理有望通过压缩解释和决策循环来加速药物发现，但实际部署需要基于现实程序决策的可信评估。我们引入了TherapeuticsBench临床前药理学（TxBench-PP），这是一个针对小分子临床前药理学的可验证基准，也是更广泛的TherapeuticsBench在药物发现阶段和治疗模式中的首个聚焦切片。TxBench-PP测试代理是否能够从真实实验数据中恢复准确的结论，而非从文献中记忆的事实。该基准包含100个评估，按程序阶段、实验类型和任务结构索引，涵盖作用机制（MoA）和药效学（PD）推理、化合物-靶点结合、因果靶点验证、可开发性与安全性以及转化疗效。代理接收现实的工作流程快照，在编码环境中检查文件，并返回确定性评分的结构化答案。在16个模型-工具配置（包括11个模型和4,800条轨迹）中，没有系统能够可靠地恢复临床前药理学决策。最强配置Claude Opus 4.8 / Pi通过了59.3%的端点尝试（178/300；95% CI, 51.1-67.6），其次是GPT-5.5 / Pi，为55.3%（166/300；47.0-63.6）。

英文摘要

Artificial intelligence (AI) agents promise to accelerate drug discovery by compressing interpretation and decision-making loops, but practical deployment requires trusted evaluation on realistic program decisions. We introduce TherapeuticsBench Preclinical Pharmacology (TxBench-PP), a verifiable benchmark for small-molecule preclinical pharmacology and the first focused slice of a broader TherapeuticsBench effort across drug-discovery stages and therapeutic modalities. TxBench-PP tests whether agents can recover accurate conclusions from real-world assay data rather than memorized facts from literature. The benchmark contains 100 evaluations indexed by program stage, assay type, and task structure, spanning mechanism-of-action (MoA) and pharmacodynamic (PD) reasoning, compound-target engagement, causal target validation, developability and safety, and translational efficacy. Agents receive realistic workflow snapshots, inspect files in a coding environment, and return structured answers graded deterministically. Across 16 model-harness configurations, comprising 11 models and 4,800 trajectories, no system reliably recovered preclinical pharmacology decisions. The strongest configuration, Claude Opus 4.8 / Pi, passed 59.3\% of endpoint attempts (178/300; 95\% CI, 51.1-67.6), followed by GPT-5.5 / Pi at 55.3\% (166/300; 47.0-63.6).

URL PDF HTML ☆

赞 0 踩 0

2606.19256 2026-06-18 cs.AI 新提交

X+Slides: Benchmarking Audience-Conditioned Slide Generation

X+Slides：面向受众条件的幻灯片生成基准测试

Haodong Chen, Xuanhe Zhou, Wei Zhou, Xinyue Shao, Yanbing Zhu, Bo Wang, Jiawei Hong, Anya Jia, Fan Wu

AI总结提出X+Slides基准，通过动态评估框架和受众特定权重，衡量幻灯片生成系统在受众覆盖、领域覆盖、效率和正确性方面的表现，揭示现有系统在受众关键信息恢复上的不足。

详情

AI中文摘要

从源文档自动生成幻灯片是大语言模型（LLMs）的重要应用。现有基准主要评估幻灯片的完整性和技术深度，而忽略了目标受众这一关键现实因素。例如，专家需要严格的证明，而决策者优先考虑可操作的结论。为弥补这一差距，我们引入了X+Slides，一个专门为受众条件幻灯片生成设计的基准。基于涵盖113个主题和七种演示场景的多样化语料库，X+Slides采用由8,133个去重、基于源的探针构建的动态评估框架。通过为相同的基于源的探针分配受众特定的效用权重，X+Slides报告四个互补指标：受众覆盖率衡量传达了受众必要信息的程度，领域覆盖率显示覆盖了哪些信息类型，效率衡量每单位注意力成本传递的效用，正确性验证幻灯片声明是否得到源支持。在DeepPresenter、SlideTailor和NotebookLM上的实验表明，当前系统可以恢复大部分但仍有缺失的受众必要信息：在τ_A=0.7时，DeepPresenter达到最佳受众覆盖率0.714，SlideTailor达到0.594，NotebookLM消融达到0.853，同时显示出明显的接地差异。这些结果表明，视觉质量和广泛的主题覆盖不应在没有基于源评估的情况下被视为证据支持。

英文摘要

Automatically generating slide decks from source documents is an important application of large language models (LLMs). Existing benchmarks primarily assess slide completeness and technical depth, while overlooking the target audience as a critical real-world factor. For instance, specialists demand rigorous proofs, whereas decision-makers prioritize actionable conclusions. To bridge this gap, we introduce X+Slides, a benchmark specifically designed for audience-conditioned slide generation. Built on a diverse corpus spanning 113 topics and seven presentation scenes, X+Slides employs a dynamic evaluation framework constructed from 8,133 deduplicated, source-grounded probes. By assigning audience-specific utility weights to the same source-grounded probes, X+Slides reports four complementary metrics: Audience Coverage measures how much audience-essential information is conveyed, Domain-wise Coverage shows which information types are covered, Efficiency measures delivered utility per unit of attention cost, and Correctness verifies whether slide claims are supported by the source. Experiments on DeepPresenter, SlideTailor, and NotebookLM show that current systems can recover a substantial but still incomplete part of audience-essential information: at $τ_A=0.7$, DeepPresenter reaches a best Audience Coverage of 0.714, SlideTailor reaches 0.594, and the NotebookLM ablation reaches 0.853 while showing clear grounding differences. These results indicate that visual quality and broad topic coverage should not be treated as evidence support without source-grounded evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.18257 2026-06-18 cs.HC cs.AI 交叉投稿

From Memorization to Creation: Evaluating the Cognitive Depth of LLM-Generated Educational Questions

从记忆到创造：评估LLM生成的教育问题的认知深度

Xiaolong Wang, Zhe Zhao, Song Lai, Chaoli Zhang, Zijie Geng, Yu Tong, Ye Wei, Qingsong Wen

发表机构 * City University of Hong Kong（香港城市大学）； Zhejiang Normal University（浙江师范大学）； Squirrel Ai Learning ； University of Science and Technology of China（中国科学技术大学）； Wuhan University（武汉大学）

AI总结通过布鲁姆认知分类学评估六种LLM生成问题的认知层次，提出细粒度提示策略减少重复性并提升高阶认知比例，引入认知转移强度和类别漂移指标，揭示链式思维提示的可解释性。

Comments Accepted by KDD 2026

详情

DOI: 10.1145/3770854.3785686
Journal ref: KDD 2026

AI中文摘要

尽管LLM在自动化教育内容生成方面展现出潜力，但它们生成能够激发高阶思维问题的能力仍未被充分研究。本研究通过布鲁姆认知分类学视角评估六种广泛使用的LLM，重点关注它们超越机械记忆并实现认知飞跃的能力。采用混合人机评估协议，我们在计算机科学、K-12数学和社会科学领域生成并分析了20,700个问题。主要贡献包括：(1) 一种细粒度提示策略，使Qwen2.5-7B-Instruct的问题重复性降低24.45%，并使InternLM3-8B-Instruct的高阶认知层次输出比例提升11.53%；(2) 认知转移强度（CogShift）和类别漂移的量化指标，揭示InternLM3在多层次转换中的优越性能；(3) 可解释性分析揭示指标级相关性，增强了链式思维提示的透明度。我们的发现强调了认知感知提示设计的重要性，并为在个性化学习系统中部署LLM提供了基准。

英文摘要

While LLMs show promise in automating educational content creation, their ability to generate questions that stimulate higher-order thinking remains understudied. This work evaluates six widely-used LLMs through a Bloom's Taxonomy lens, focusing on their capacity to transcend rote memorization and achieve cognitive leaps. Using a hybrid human--AI evaluation protocol, we generate and analyze 20{,}700 questions across computer science, K--12 math, and social-science domains. Key contributions include: (1) a fine-grained prompting strategy that reduces question repetitiveness by 24.45\% for Qwen2.5-7B-Instruct, and increases the proportion of higher-order cognitive level outputs by 11.53\% for InternLM3-8B-Instruct; (2) quantitative metrics for cognitive shift intensity (CogShift) and category drift, revealing InternLM3's superior performance in multi-level transitions; (3) an interpretability analysis revealing metric-level correlations that enhance the transparency of Chain-of-Thought prompting. Our findings highlight the importance of cognitive-aware prompt design and provide benchmarks for deploying LLMs in personalized learning systems.

URL PDF HTML ☆

赞 0 踩 0

2606.18263 2026-06-18 cs.HC cs.AI 交叉投稿

多模态超图融合用于低光照人群计数

Hao-Yuan Ma, Li Zhang, Yushi Qiu, Jie Gao, Yan Zhang, Bangjun Wang

发表机构 * School of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）

AI总结针对低光照环境下人群计数难题，构建三个新基准数据集，提出多模态超图融合模块和可变形矩形稀疏注意力模块，形成低光照计数网络LCNet，在三个基准上取得最优性能。

详情

AI中文摘要

LandslideAgent与多模态LandslideBench：一种面向自主滑坡识别与分析的领域规则增强型智能体

Chengfu Liu, Dongyang Hou, Junwu Xiang, Cheng Yang, Xuezhi Cui, Zeyuan Wang, Liangtian Liu, Zelang Miao

发表机构 * Central South University（中南大学）

AI总结提出指令驱动智能体框架，包含多模态数据集LandslideBench、滑坡专用视觉语言模型LandslideVLM及领域规则增强智能体LandslideAgent，实现自主滑坡识别与分析。

详情

AI中文摘要

智能滑坡灾害解译对于防灾减灾至关重要，然而当前范式难以同时提取视觉特征和高层次地球科学语义，而通用视觉语言模型在复杂地质场景中存在感知局限和领域幻觉。为解决这些挑战，我们提出一个指令驱动的智能体框架，包含三个组成部分。首先，通过多VLM交叉验证和交互式标注构建LandslideBench，这是一个多模态细粒度数据集，包含七个子类型标签、高分辨率图像、像素级掩膜和高质量文本描述。然后，通过LoRA在LandslideBench上微调面向滑坡的VLM——LandslideVLM，以增强地质语义理解。最后，以LandslideVLM为认知核心的领域规则增强智能体LandslideAgent，采用双规则控制器，结合结构化报告元数据约束和交叉验证识别约束，来调控自动化工具调用。实验表明，LandslideBench为五种主流模型在细粒度分类和语义分割上提供了有效基线。LandslideVLM在滑坡判别、细粒度分类和语义描述质量上分别提升了10.96%、32.87%和15.91%。LandslideAgent进一步实现了自主多源空间数据推理，实现了滑坡识别与分析的全流程智能化。

英文摘要

Intelligent landslide hazard interpretation is critical for disaster prevention, yet current paradigms struggle to simultaneously extract visual features and high-level geoscientific semantics, while general-purpose vision-language models (VLMs) suffer from perceptual limitations and domain hallucinations in complex geological scenarios. To address these challenges, we propose an instruction-driven agentic framework comprising three components. First, LandslideBench, a multimodal fine-grained dataset with seven subtype labels, high-resolution imagery, pixel-level masks, and high-quality textual descriptions, is constructed via multi-VLM cross-validation and interactive annotation. Then, LandslideVLM, a landslide-oriented VLM, is fine-tuned via LoRA on LandslideBench to enhance geological semantic understanding. Finally, LandslideAgent, a domain rule-enhanced agent taking LandslideVLM as its cognitive backbone, employs a dual-rule controller incorporating structured report metadata constraints and cross-validation identification constraints to regulate automated tool invocation. Experiments demonstrate that LandslideBench provides effective baselines across five mainstream models on fine-grained classification and semantic segmentation. LandslideVLM achieves accuracy improvements of 10.96%, 32.87%, and 15.91% on landslide discrimination, fine-grained classification, and semantic description quality, respectively. LandslideAgent further enables autonomous multi-source spatial data inference, realizing full-process intelligence for landslide identification and analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.18699 2026-06-18 cs.CL cs.AI cs.IR 交叉投稿

TW-LegalBench: Measuring Taiwanese Legal Understanding

TW-LegalBench: 衡量台湾法律理解

Fei-Yueh Chen, Chun Huang Lin, Chan Wei Hsu, Kuan Hsuan Yeh, Zih-Ching Chen, Kuan-Ming Chen, Patrick Chung-Chia Huang

发表机构 * University of Rochester（罗切斯特大学）； National Taiwan University（国立台湾大学）； NVIDIA（英伟达）

AI总结提出TW-LegalBench基准，包含多项选择、开放式问答和法律判决预测任务，评估13个LLM在台湾法律上的表现，发现顶尖模型通过律师考试但未达到法官检察官标准，且法律条文引用困难。

Comments 10 pages, 2 figures, To appear in ICAIL 2026

详情

AI中文摘要

大型语言模型（LLM）在多种任务上展现出令人印象深刻的能力，但其在特定司法管辖区法律推理上的表现仍未充分探索。我们提出TW-LegalBench，利用台湾法律系统丰富的官方公开语料库，填补了在普通法基准（侧重英文来源）和大陆法基准（侧重简体中文来源）之外评估LLM在台湾法律上的空白。TW-LegalBench包含三种任务类型：（1）涵盖18个专业领域五年官方考试的超过16,000道多项选择题（MCQ）；（2）来自法律专业人员考试的117道开放式问答题（OEQ），附有官方评分标准；（3）超过14,000个法律判决预测（LJP）实例，涵盖数百种犯罪类别。我们使用MCQ的准确率、基于评分标准点的分解式LLM作为裁判框架评估OEQ，以及LJP的判决准确性和法条引用指标，评估了13个LLM。我们的结果显示，表现最佳的模型超过了合格律师的通过门槛（通过率：11%），但未达到法官和检察官的通过标准（通过率：1-2%）。对于LJP，虽然模型展示了合理的判决类型准确性和刑期预测能力，但它们难以准确引用具体法律条文。这些发现表明，即使LLM在资格考试上的表现接近人类水平，可靠的 legal 文本生成仍然具有挑战性。

英文摘要

Large language models (LLMs) have shown impressive capabilities across diverse tasks, yet their performance on jurisdiction-specific legal reasoning remains underexplored. We present TW-LegalBench that utilizes Taiwanese legal system's rich official corpus open to the public to fill the gap in evaluating LLMs on Taiwanese law, among common-law benchmarks that focus on English sources and civil-law benchmarks focusing on sources of Simplified Chinese. TW-LegalBench comprises three task types: (1) over 16,000 multiple-choice questions (MCQs) across five years of official examinations in 18 professional domains; (2) 117 open-ended essay questions (OEQs) from examinations for legal professionals with official scoring rubrics; and (3) more than 14,000 legal judgment prediction (LJP) instances covering hundreds of crime categories. We evaluate 13 LLMs using accuracy for MCQs, a decomposed LLM-as-Judge framework based on the scoring rubric points for OEQs, and metrics for sentencing accuracy and statute citation for LJP. Our results reveal that top-performing models exceed the passing threshold for qualified lawyers (passing rate: 11%) but fall short of that for judges and prosecutors (passing rate: 1~2%). For LJP, while models demonstrate reasonable verdict type accuracy and sentence prediction capability, they struggle to cite exact legal articles. These findings highlight that reliable legal text generation remains challenging for LLMs, even though their performance on qualification examinations approaches human level.

URL PDF HTML ☆

赞 0 踩 0

2606.18733 2026-06-18 cs.SE cs.AI 交叉投稿

SWE-Future: Forecast-Conditioned Data Synthesis for Future-Oriented Software Engineering Agents

SWE-Future: 面向未来软件工程智能体的预测条件数据合成

Qiao Zhao, JianYing Qu, Jun Zhang, Yehua Yang, Hanwen Du, Zhongkai Sun

发表机构 * Baidu Inc（百度公司）

AI总结提出SWE-Future方法，利用仓库历史证据预测未来任务类型（如功能实现、缺陷修复），并基于预测条件合成200个编码智能体任务，减少对历史PR回放的依赖，在80个仓库中达到58.1%的未来工作相关性。

详情

AI中文摘要

真实的编码智能体基准测试通常回放公开的GitHub问题和拉取请求，这使得它们容易与模型预训练、微调、合成数据生成或基准驱动的模型选择产生重叠。完全合成的任务避免了直接的历史回放，但可能偏离真实的仓库需求。我们提出了SWE-Future，一种面向未来编码任务的预测条件数据合成方法。给定时间$T_0$的预测快照，该方法仅使用$T_0$之前的仓库证据来预测未来的功能实现/增强、缺陷修复和重构任务族。我们首先回顾性地验证了这一预测步骤：在预测固定后，后续的拉取请求仅用于衡量预测的任务族是否与未来的仓库工作匹配。在一项80个仓库的研究中，预测器在主要语义匹配指标下达到了58.1%的未来工作相关性。然后，我们使用经过验证的预测族作为条件信号，从任务生成快照中跨61个仓库合成了一个包含200个任务的编码智能体数据集，而不是回放用于验证的后续拉取请求。SWE-Future表明，仓库演化预测可以指导现实的、面向未来的编码任务合成，同时减少对历史拉取请求回放的直接依赖。

英文摘要

Realistic coding-agent benchmarks often replay public GitHub issues and pull requests, making them vulnerable to overlap with model pretraining, fine-tuning, synthetic-data generation, or benchmark-driven model selection. Fully synthetic tasks avoid direct historical replay, but can drift away from real repository needs. We propose SWE-Future, a forecast-conditioned data synthesis method for future-oriented coding tasks. Given a forecast snapshot at time $T_0$, the method uses only pre-$T_0$ repository evidence to forecast future feature implementation/enhancement, bugfix, and refactor task families. We first validate this forecasting step retrospectively: after forecasts are fixed, later pull requests are used only to measure whether the predicted task families match future repository work. In an 80-repository study, the forecaster achieves 58.1\% future-work relevance under the main semantic matching metric. We then use validated forecast families as conditioning signals to synthesize a 200-task coding-agent dataset across 61 repositories from a task-generation snapshot, rather than replaying the later pull requests used for validation. SWE-Future shows that repository-evolution forecasts can guide realistic, future-oriented coding-task synthesis while reducing direct dependence on historical pull-request replay.

URL PDF HTML ☆

赞 0 踩 0

一个用于检测 GPT-Image-2 生成的含丰富文本图像的多领域基准

Yijin Wang, Shuyi Wang, Wenhan Zhang, Yuqi Ouyang

AI总结针对现有基准缺乏文本丰富图像检测的问题，构建了包含8602张图像、覆盖6个类别的多领域基准，评估5种检测器，发现性能高度依赖领域且易受JPEG压缩影响。

详情

AI中文摘要

含丰富文本的图像通常包含隐私敏感、交易或决策相关信息。随着最近多模态图像生成模型合成逼真文本内容和结构化视觉设计的能力越来越强，检测AI生成的含丰富文本图像已成为数字信任和内容真实性的重要挑战。然而，现有基准主要关注以物体为中心的图像，对文本语义和布局组织至关重要的场景覆盖有限。在本文中，我们引入了一个用于检测OpenAI的GPT Image 2生成的含丰富文本图像的多领域基准。该基准包含8602张图像，涵盖六个代表性类别：商业海报、信息图表、学术海报、收据、表格和UI截图。利用该基准，我们在零样本设置下评估了五种代表性AI生成图像检测器，并分析了它们的整体性能、类别性能和后处理鲁棒性。我们的结果表明，检测器性能高度依赖于领域：在某些类别上表现良好的方法往往在其他类别上失败，即使最强的传统检测器也对JPEG压缩表现出严重敏感性。我们进一步使用多模态视觉语言模型进行了探索性评估，揭示了其在结构化格式上的潜力和局限性。这些发现突显了针对现代AI生成图像需要文本和布局感知的检测方法。我们的数据集发布于XXX。

英文摘要

Text-rich images often contain privacy-sensitive, transactional, or decision-relevant information. As recent multimodal image generation models become increasingly capable of synthesizing realistic textual content and structured visual designs, detecting AI-generated text-rich images has become an important challenge for digital trust and content authenticity. Existing benchmarks, however, largely focus on object-centric images and provide limited coverage of scenarios where textual semantics and layout organization are central. In this paper, we introduce a multi-domain benchmark for detecting text-rich images generated by OpenAI's GPT Image 2. The benchmark contains 8,602 images across six representative categories: commercial posters, infographics, academic posters, receipts, tables, and UI screenshots. Using this benchmark, we evaluate five representative AI-generated image detectors in a zero-shot setting and analyze their overall, category-wise, and post-processing robustness. Our results show that detector performance is highly domain-dependent: methods that perform well in some categories often fail on others, and even the strongest conventional detector exhibits severe sensitivity to JPEG compression. We further conduct an exploratory evaluation with a multimodal vision-language model, revealing both its promise and its limitations on structured formats. These findings highlight the need for text- and layout-aware detection methods for modern AI-generated images. Our dataset is released at XXX.

URL PDF HTML ☆

赞 0 踩 0

2512.04144 2026-06-18 cs.AI 版本更新

RippleBench: Capturing Ripple Effects Using Existing Knowledge Repositories

RippleBench: 利用现有知识库捕捉涟漪效应

Roy Rinberg, Usha Bhalla, Igor Shilov, Flavio P. Calmon, Rohit Gandikota

发表机构 * Harvard University（哈佛大学）； Imperial College London（伦敦帝国学院）； Northeastern University（东北大学）

AI总结提出RippleBench-Maker自动管道，从知识库检索语义邻居生成选择题，评估八种遗忘方法在Llama3-8B-Instruct上的涟漪效应，发现准确率下降随语义距离衰减且跨模型一致。

详情

AI中文摘要

针对语言模型的目标干预，如遗忘或模型编辑，旨在修改特定信息，但其效果往往传播到相关的、非预期的领域（例如，删除病毒学内容可能降低对过敏任务的性能）；这些副作用通常被称为涟漪效应。我们引入RippleBench-Maker，一个自动管道，从知识库中检索任何源概念的语义邻居，并生成不同语义距离的多选题。我们使用WikiRAG（一个基于英文维基百科的开源RAG系统）实例化该框架，构建RippleBench-WMDP-Bio（584个种子主题，352,961个问题），并在Llama3-8B-Instruct上评估八种遗忘方法。所有八种方法在遗忘目标附近准确率下降最大，并随语义距离衰减，每种方法具有不同的传播曲线。我们在Mistral-7B、Zephyr-7B和Yi-34B上复现了这些发现；跨模型的差值曲线几乎相同，表明涟漪效应是遗忘方法的属性而非基础模型。我们通过一项包含四个实验的Mechanical Turk研究（5,200+次响应，61名工作者）验证了所有主要管道阶段。我们发布所有代码、数据和基础设施。

英文摘要

Targeted interventions on language models, such as unlearning or model editing, aim to modify specific information, but their effects often propagate to related, unintended areas (e.g., removing virology content may degrade performance on allergies); these side-effects are commonly referred to as the ripple effect. We introduce RippleBench-Maker, an automatic pipeline that retrieves semantic neighbors of any source concept from a knowledge repository and generates multiple-choice questions at varying semantic distances. We instantiate this framework using WikiRAG, an open-source RAG system over English Wikipedia, to construct RippleBench-WMDP-Bio (584 seed topics, 352,961 questions), and evaluate eight unlearning methods on Llama3-8B-Instruct. All eight exhibit accuracy drops that are largest near the unlearned target and decay with semantic distance, each with a distinct propagation profile. We replicate these findings across Mistral-7B, Zephyr-7B, and Yi-34B; cross-model delta curves are nearly identical, suggesting ripple effects are a property of the unlearning method rather than the base model. We validate all major pipeline stages using a four-experiment Mechanical Turk study (5,200+ responses, 61 workers). We release all code, data, and infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2605.29676 2026-06-18 cs.AI cs.CL 版本更新

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

符号至关重要：智能体AI系统中令牌优化格式的基准研究

Lorenz Kutschka, Bernhard Geiger

发表机构 * Know Center Research GmbH（知中心研究有限公司）； Graz University of Technology（格拉茨技术大学）； Graz Center for Machine Learning（格拉茨机器学习中心）

AI总结本研究在四个智能体基准上评估了两种令牌优化格式TOON和TRON，发现TRON在保持准确率的同时最多减少27%的令牌，而TOON虽减少18%但存在多轮解析失败和并行工具调用输出崩溃的问题。

Comments 16 pages, 6 figures, 4 tables

详情

AI中文摘要

智能体AI系统中的大型语言模型消耗工具模式和执行结果，并发出结构化数据的工具调用。这种交换的默认语言JSON是为应用间交换而非令牌效率设计的，因此其结构元素带来大量令牌开销。最近的工作提出了令牌优化替代方案，如TOON（令牌导向对象表示法）和TRON（令牌减少对象表示法）作为更紧凑的替代，但这些格式仅在孤立的理解或生成任务上进行了评估。它们在端到端智能体循环中是否保持令牌减少仍是一个开放问题。我们在四个智能体基准（BFCL、MCPToolBenchPP、MCP-Universe、StableToolBench）和五个开放权重LLM上评估了TOON和TRON，将输入压缩与输出压缩解耦，以独立测量理解和生成。TRON最多减少27%的令牌，准确率在JSON基线的14个百分点内。TOON实现了最多18%的减少，准确率成本类似为9个百分点，但在多轮解析失败上额外级联，并且对于大多数模型导致并行工具调用输出崩溃。

英文摘要

Large language models in Agentic AI systems consume tool schemas and execution results and emit tool invocations as structured data. The default language for that exchange, JSON, was designed for application-to-application interchange rather than token efficiency, so its structural elements impose substantial token overhead. Recent work proposes token-optimized alternatives such as TOON (Token-Oriented Object Notation) and TRON (Token Reduced Object Notation) as more compact replacements, but these formats have been evaluated only on isolated comprehension or generation tasks. Whether their token reductions hold inside end-to-end agentic loops therefore remains an open question. We evaluate TOON and TRON on four agentic benchmarks (BFCL, MCPToolBenchPP, MCP-Universe, StableToolBench) and five open-weight LLMs, decoupling input compression from output compression to measure comprehension and generation independently. TRON reduces tokens by up to 27% with accuracy within 14pp of the JSON baseline. TOON achieves up to 18% reduction at a similar 9pp accuracy cost, but additionally cascades on multi-turn parsing failures and collapses parallel tool-call output for most models. The code is available at: https://github.com/lkutschka/notation-matters

URL PDF HTML ☆

赞 0 踩 0

2606.17453 2026-06-18 cs.AI 版本更新

MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors

MapSatisfyBench: 通过行为隐含决策因素基准测试满意度感知的地图智能体

Lubin Bai, Mengyu Cao, Sixue Wang, Zhongwei Wan, Yue Pan, Jiale Hou, Xiang Li, Xiuyuan Zhang

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）

AI总结提出MapSatisfyBench基准，通过恢复用户行为链中的隐含决策因素来评估地图智能体的满意度感知能力，实验表明现有智能体在显式任务完成上表现良好，但在满足隐含需求方面仍有局限。

详情

AI中文摘要

大型语言模型智能体越来越多地集成到地图服务中。由于地图服务嵌入在日常场景而非专业任务设置中，用户通常非正式地表达需求，导致查询不明确，包含许多未言明的需求，即对用户满意度至关重要的隐含决策因素。虽然澄清是缓解这一问题的有效方法，但它增加了日常交互中的用户负担，而一个能干的智能体应首先从可用信息源主动恢复这些因素。然而，评估这一能力具有挑战性。第一个挑战是确定哪些隐含决策因素适合评估。一个因素只有在影响用户接受度且能从智能体响应前可获取的信息中恢复时才是可评估的。其次，用户满意度不能可靠地由单个参考答案表示，需要一个将满意度相关因素转化为客观可量化评估目标的基准。为应对这些挑战，我们提出一个恢复-识别-过滤框架，从行为链证据中重建完整的用户需求，识别隐含决策因素，并仅保留那些有查询前证据支持的因素。基于此方法，我们从大规模真实世界匿名用户数据构建MapSatisfyBench，并从五个维度标注真实值，实现对满意度感知地图智能体的全链条评估。实验表明，当前智能体在显式任务完成上普遍表现良好，但在满足隐含决策因素和主动获取满意度感知决策所需证据方面仍然有限。这些发现使MapSatisfyBench成为将地图智能体评估从任务完成转向满意度感知空间决策的基准。

英文摘要

Large language model agents are increasingly integrated into map services. Since map services are embedded in everyday-life scenarios rather than professional task settings, users often express their needs informally, resulting in underspecified queries with many unspoken needs, namely, implicit decision factors that are critical for user satisfaction. Although clarification is an effective way to mitigate this issue, it increases user burden in daily interaction, and a capable agent should first proactively recover such factors from available information sources. However, evaluating this ability is challenging. The first challenge is to determine which implicit decision factors are suitable for evaluation. A factor is evaluable only if it affects user acceptance and can be recovered from information available to the agent before it responds. Second, user satisfaction cannot be reliably represented by a single reference answer, requiring a benchmark that converts satisfaction-relevant factors into objective and quantifiable evaluation targets. To address these challenges, we propose a restore-identify-filter framework that reconstructs complete user needs from behavior-chain evidence, identifies implicit decision factors, and retains only those supported by pre-query evidence. Building on this methodology, we construct MapSatisfyBench from large-scale, real-world anonymized user data and annotate ground truth from five dimensions and enables full-chain evaluation of satisfaction-aware map agents. Experiments show that current agents generally perform well on explicit task completion, but remain limited in satisfying implicit decision factors and proactively acquiring the evidence needed for satisfaction-aware decisions. These findings establish MapSatisfyBench as a benchmark for shifting map-agent evaluation from task completion toward satisfaction-aware spatial decision making.

URL PDF HTML ☆

赞 0 踩 0

2606.18142 2026-06-18 cs.AI cs.CL cs.CY 版本更新

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

你的AI旅行代理会为你预订斗牛：前沿AI模型中隐含动物福利的代理基准

Jasmine Brazilek, Joel Christoph, Miles Tidmarsh, Carol Kline, Oliver Tullio, Arturs Kanepajs

发表机构 * Compassion Aligned Machine Learning（同情对齐机器学习）； Sentient Futures（感知未来）； Harvard Kennedy School（哈佛肯尼迪学院）； Appalachian State University Department of Management（阿巴拉契亚州立大学管理系）

AI总结提出首个代理基准TAC，测试AI代理在为用户执行旅行预订等操作时是否避免涉及动物剥削的选项。评估七个前沿模型，所有模型得分低于随机水平64%，最佳模型仅53%。

详情

AI中文摘要

AI代理正从顾问转变为行动者，代表用户预订旅行、规划菜单和管理采购。现有的AI与动物福利基准评估模型对问答提示的文本响应，但未检验这些响应中的福利推理是否迁移到代理部署中（模型必须使用工具采取行动）。我们引入TAC（旅行代理同情心），这是首个衡量AI代理在代表用户行动时是否避免涉及动物剥削选项的代理基准。TAC向AI代理提供十二个手工编写的旅行预订场景，涵盖六类动物剥削，并扩展至四十八个样本以控制价格、评分和位置混淆因素。我们评估了来自四个实验室的七个前沿模型。每个模型得分均低于随机水平64%，最佳表现者（Claude Opus 4.7）为53%。系统提示中的单一福利意识句子在Claude和GPT-5.5中带来47至63个百分点的提升，在GPT-5.2中提升26个百分点，在DeepSeek和Gemini中提升不足12个百分点。一项辅助的Inspect Scout审计（使用Gemini 2.5 Flash Lite作为评判者，对前两名模型的288个基础条件转录进行审计）未标记任何评估意识转录，表明低于随机水平的比率并非源于模型识别出评估。我们讨论了跨文化领域的类别级变化、文本响应福利基准的局限性以及欧盟通用AI实践准则系统性风险框架的影响。

英文摘要

AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.

URL PDF HTML ☆

赞 0 踩 0

2606.18192 2026-06-18 cs.AI 版本更新

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

斯坦福EDGAR文件数据集：将美国公司及财务披露重建为布局忠实且令牌高效的预训练数据

Nick Bettencourt, Xiaowei Ding, Kay Giesecke

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Nanjing University（南京大学）； Stanford University（斯坦福大学）

AI总结为解决长上下文文档稀缺问题，提出SEFD数据集，将SEC文件重建为布局忠实的MultiMarkdown格式，用于金融语言建模与评估，具有令牌高效、与Common Crawl重叠率低于0.1%的特点。

Comments Preprint. Includes appendix, tables, and figures

详情

AI中文摘要

随着高质量公共网络语料库日益枯竭，干净的长上下文文档已成为大型语言模型（LLM）训练数据中稀缺且昂贵的来源。现有的长上下文语料库通常是专有的且获取成本高昂、合成生成的，或集中在编程等狭窄领域。我们介绍了斯坦福EDGAR文件数据集（SEFD），这是将SEC文件重建为布局忠实的MultiMarkdown格式的开放数据集，用于金融语言建模和评估。SEFD使经过审计的财务报表、风险披露、所有权报告、会计说明和影响市场的事件文件能够用作长上下文预训练数据，并作为金融推理、预测、合规和文档理解的基础。生成的语料库令牌高效、可直接用于模型，并且与Common Crawl衍生的语料库重叠率低于0.1%。我们发布了SEFD-v1，一个152B令牌的初始公共快照，并提供了更大的1850万文件档案（估计为550B令牌）的语料库级分析。我们进一步引入了两个基于SEFD的基准：EDGAR-Forecast，用于评估模型知识截止后基于文件的数值预测；以及EDGAR-OCR，用于评估复杂金融表格的转录。

英文摘要

As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often proprietary and costly to acquire, synthetically generated, or concentrated in narrow domains such as programming. We introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown for financial language modeling and evaluation. SEFD makes audited financial statements, risk disclosures, ownership reports, accounting notes, and market-moving event filings usable as long-context pretraining data and as a basis for financial reasoning, forecasting, compliance, and document understanding. The resulting corpus is token-efficient, model-ready, and has less than 0.1% overlap with Common Crawl-derived corpora. We release SEFD-v1, a 152B-token initial public snapshot, and provide corpus-level analyses of a larger 18.5M-filing archive estimated at 550B tokens. We further introduce two SEFD-derived benchmarks: EDGAR-Forecast, which evaluates filing-grounded numerical forecasting after model knowledge cutoffs, and EDGAR-OCR, which evaluates transcription of complex financial tables.

URL PDF HTML ☆

赞 0 踩 0

2303.18031 2026-06-18 cs.CV cs.AI cs.LG 版本更新

Simple Domain Generalization Methods are Strong Baselines for Open Domain Generalization

简单域泛化方法是开放域泛化的强基线

Masashi Noguchi, Shinichi Shirakawa

发表机构 * Graduate School of Environment and Information Sciences（环境与信息科学研究生院）； Yokohama National University（Yokohama国立大学）； Faculty of Environment（环境学系）

AI总结本文评估现有域泛化方法在开放域泛化中的表现，发现简单方法CORAL和MMD与复杂方法DAML竞争力相当，并通过集成学习和Dirichlet混合数据增强简单扩展后性能接近DAML且计算成本更低。

Comments Accepted at IJCNN 2024. The code used in the experiments is available at https://github.com/shiralab/OpenDG-Eval

详情

DOI: 10.1109/IJCNN60899.2024.10650639

AI中文摘要

直接偏好优化综述：数据集、理论、变体及应用

Wenyi Xiao, Zechuan Wang, Leilei Gan, Shuai Zhao, Zongrui Li, Ruirui Lei, Wanggui He, Luu Anh Tuan, Long Chen, Hao Jiang, Zhou Zhao, Fei Wu

发表机构 * Zhejiang University（浙江大学）； Nanyang Technological University（南洋理工大学）； Alibaba Group（阿里巴巴集团）

AI总结综述直接偏好优化（DPO）在理论、变体、数据集和应用方面的进展，指出其作为RL-free替代方案的潜力与局限，并提出未来研究方向。

Comments Accepted by TPAMI 2026. Project page: https://github.com/Mr-Loevan/DPO-Survey

详情

DOI: 10.1109/TPAMI.2026.3704314

AI中文摘要

随着大语言模型（LLMs）的快速发展，将策略模型与人类偏好对齐变得日益关键。直接偏好优化（DPO）作为一种有前景的对齐方法，作为从人类反馈中强化学习（RLHF）的无RL替代方案而出现。尽管DPO取得了各种进展并存在固有局限性，但文献中目前缺乏对这些方面的深入综述。在这项工作中，我们对DPO中的挑战和机遇进行了全面回顾，涵盖理论分析、变体、相关偏好数据集和应用。具体而言，我们基于关键研究问题对近期DPO研究进行分类，以提供对DPO当前格局的透彻理解。此外，我们提出了几个未来研究方向，为研究社区提供模型对齐的见解。相关论文的更新合集可在此https URL找到。

英文摘要

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO's various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO's current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community. An updated collection of relevant papers can be found on https://github.com/Mr-Loevan/DPO-Survey.

URL PDF HTML ☆

赞 0 踩 0

2606.18271 2026-06-18 cs.AI cs.LG 新提交

NAVI-Orbital: First In-Orbit Demonstration of a Zero-Shot Vision-Language Model for Autonomous Earth Observation

NAVI-Orbital：用于自主地球观测的零样本视觉语言模型的首次在轨演示

Juan Manuel Delfa Victoria, Taran Cyriac John, Andrew W. Herson

发表机构 * NASA Jet Propulsion Laboratory (JPL)（美国宇航局喷气推进实验室）； Loft Orbital（Loft Orbital公司）

AI总结本文介绍NAVI-Orbital系统，在低地球轨道卫星上首次实现视觉语言模型的自主多模态推理，通过语义压缩解决数据下传瓶颈。

Comments 17 pages, 47 figures

详情

AI中文摘要

随着地球观测数据的生成速度超过下行链路带宽和人在回路处理能力，星载采集与可操作地面情报之间的差距日益扩大。本文介绍NAVI-Orbital，一个部署在低地球轨道（LEO）航天器上的软件系统。2026年4月16日，NAVI-Orbital实现了据作者所知首次在轨演示，即视觉语言模型完全在星上进行自主多模态推理。NAVI-Orbital使用本地视觉语言模型（Gemma 3）对每个捕获场景进行分类，生成其内容及特征间关系的文本描述，并通过自然语言对话响应操作员的后续查询。该系统通过纯英语提示替代传统指令序列进行任务重定向，并由基于图的状态机（LangGraph）编排，协调用于检测和对话的专用代理。地面基准测试（在7,960张图像的精选AID基准上准确率达88.16%）、Flatsat验证以及实时在轨捕获的新获取、未见过的地球图像（包括未校正的YAM-9图像，在星上通过硬件加速GPU推理处理且未对飞行仪器进行微调）的结果表明，在卫星级边缘计算机上运行基础模型是可行的，通过星上地球观测的语义压缩，颠覆了传统的先采集后全部下传的带宽模式。

英文摘要

As Earth Observation data generation outpaces downlink bandwidth and human-in-the-loop processing, a widening gap has emerged between onboard collection and actionable ground intelligence. This paper presents NAVI-Orbital, a software system deployed on a Low Earth Orbit (LEO) spacecraft. On April 16, 2026, NAVI-Orbital achieved what is, to the authors' knowledge, the first in-orbit demonstration of a vision-language model performing autonomous multi-modal inference entirely onboard. NAVI-Orbital uses a local vision-language model (Gemma 3) to classify each captured scene, produce a text description of its content and the relationships between its features, and respond to operator follow-up via natural-language dialogue. The system is re-tasked through plain-English prompts in place of conventional command sequences, and is orchestrated by a graph-based state machine (LangGraph) coordinating dedicated agents for detection and dialogue. Results across ground benchmarking (88.16% accuracy on the 7,960-image curated AID benchmark), Flatsat validation, and live in-orbit captures of newly acquired, previously unseen Earth imagery (including uncorrected YAM-9 imagery, processed onboard with hardware-accelerated GPU inference and no fine-tuning for the flight instrument) demonstrate the feasibility of running foundation models on satellite-class edge computers to invert the conventional acquire-then-downlink-everything bandwidth profile through semantic compression of Earth observations in-orbit.

URL PDF HTML ☆

赞 0 踩 0

2606.18598 2026-06-18 cs.AI cs.LG 新提交

Optimizing Lithium Production Decisions under Geological, Demand, and Pricing Uncertainties: A POMDP Framework for Multi-Objective Decision Making

在地质、需求和定价不确定性下优化锂生产决策：多目标决策的POMDP框架

Anna C. Edmonds, Mansur M. Arief, Robert J. Moss, Mykel J. Kochenderfer, Jef Caers

发表机构 * Computer Science Department, Stanford University（斯坦福大学计算机科学系）； Aeronautics and Astronautics Department, Stanford University（斯坦福大学航空与航天系）； Earth and Planetary Sciences Department, Stanford University（斯坦福大学地球与行星科学系）

AI总结提出POMDP框架，通过信念状态规划优化锂矿开采决策，动态适应价格不确定性，实现更高需求满足和更平衡的经济环境效益。

Comments 24 pages, 14 tables, 4 figures

详情

AI中文摘要

锂生产中的决策制定具有挑战性，无论是从投资者角度还是战略生产角度。决定开采哪些矿山以及何时开采，不仅涉及地质和价格不确定性，还涉及提取方法选择的复杂性，从直接锂提取到硬岩开采。先前的工作探索了该问题的模型和优化采矿决策的不同方法；这些模型没有考虑定价不确定性、需求不确定性或提取锂的不同采矿技术。将不同的定价模型和提取技术纳入这些模型，可以制定更稳健的策略，不仅决定何时何地开采矿山，还决定采用哪种生产方法。我们将问题表述为部分可观测马尔可夫决策过程（POMDP），并使用信念状态规划方法求解以获得最优决策。在我们的研究中，我们表明POMDP求解器通过信念状态规划和显式不确定性管理，动态适应变化的锂价格机制（静态、线性、指数和随机），优于人类启发式启发法。通过优化勘探、生产和技术选择的顺序，该框架在所有不同的定价和矿床情景下，在项目生命周期内实现了更高的需求满足和更平衡的经济环境结果。

英文摘要

Decision making in lithium production is challenging, whether from an investor's perspective or a strategic production standpoint. Determining which mines to open and when to open them involves not only geological and price uncertainties, but also complexities around the choice of extraction method, from direct lithium extraction to hard rock mining. Prior work explored models of this problem and different methods to optimize mining decisions; these models did not account for uncertainty in pricing, uncertainty in demand, or different mining technologies to extract lithium. Incorporating different pricing models and extraction technology into these models enables more robust strategies for determining not only when and where to open a mine, but also which method of production to pursue. We frame the problem as a partially observable Markov decision process (POMDP) and solve using belief state planning methods to get optimal decision making. In our study, we show that POMDP solvers outperform human inspired heuristics by dynamically adapting to shifting lithium price regimes (static, linear, exponential, and stochastic) through belief state planning and explicit uncertainty management. By optimally sequencing exploration, production, and technology choice, the framework achieves higher demand fulfillment and more balanced economic environmental outcomes over the projects lifetime in all different pricing and deposit scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.18803 2026-06-18 cs.AI cs.CY 新提交

IOAH3: 重要性驱动的自适应空间划分

Ehsaneddin Jalilian

发表机构 * Interdisciplinary Transformation University Austria（跨学科转型大学奥地利）

AI总结提出IOAH3方法，通过多源特征提取、马尔可夫随机场图割优化和数据驱动层次细化，构建自适应空间划分，解决可修改面积单元问题。

详情

AI中文摘要

我们提出IOAH3（重要性导向的自适应H3划分），一种用于构建地理参考观测域的数据驱动空间划分的计算方法。标准的空间聚合方法采用固定面积单元，例如行政边界或单一分辨率的均匀六边形网格，而不考虑每个区域中底层观测的信息内容。这导致了著名的可修改面积单元问题：统计和推断结果依赖于划分的任意选择，空间集中的现象在粗网格中被平均化，从而掩盖了精细尺度的结构。IOAH3通过三个阶段构建自适应划分来解决这一问题：多源特征提取和重要性评分，通过主成分分析对道路密度、POI密度、建筑密度和地形粗糙度信号进行，人口和洪水灾害数据作为辅助输入用于单元过滤和空间平滑；通过马尔可夫随机场图割优化进行空间单元选择，该优化在强制空间连续性的同时联合最大化每个单元的重要性；以及数据驱动的高重要性区域层次细化到更精细的H3分辨率级别，并通过邻居传播支持以避免孤立的精细分辨率孤岛。所得划分作为空间推断流程的输入，并在任何建模步骤之前提供了对划分敏感性问题的原则性解决方案。

英文摘要

We present IOAH3 (Importance-Oriented Adaptive H3 partitioning), a computational method for constructing data-driven spatial partitions of geo-referenced observation domains. Standard approaches to spatial aggregation adopt fixed areal units, such as administrative boundaries or uniform hexagonal grids at a single resolution, without regard to the informational content of the underlying observations in each region. This leads to the well-known modifiable areal unit problem: statistical and inferential results depend on the arbitrary choice of partition, and spatially concentrated phenomena are averaged out in coarse cells that obscure fine-scale structure. IOAH3 addresses this by constructing an adaptive partition in three stages: multi-source feature extraction and importance scoring via principal component analysis over road density, POI density, building density, and terrain roughness signals, with population and flood-hazard data entering as auxiliary inputs to cell filtering and spatial smoothness; spatial cell selection via Markov Random Field graph-cut optimisation, which jointly maximises per-cell importance while enforcing spatial contiguity; and data-driven hierarchical refinement of high-importance regions to finer H3 resolution levels, with neighbour-propagated support to avoid isolated fine-resolution islands. The resulting partitions serve as input to spatial inference pipelines and provide a principled resolution of the partition-sensitivity problem prior to any modelling step.

URL PDF HTML ☆

赞 0 踩 0

2606.18319 2026-06-18 cs.LG cs.AI cs.HC cs.SE 交叉投稿

ASTRA: A Scalable Next-Generation ATCO Training Simulator with Autonomous Simpilots

ASTRA：一种具有自主模拟飞行员的可扩展下一代空中交通管制员训练模拟器

Ethan Chew, Enjia Wu, Iruss Eng Wei Yeow, Ian Weiqin Lim, Ranen Sim, Brandon Koh Ziheng, Kaleb Nim, Caden Toh Jun Yi, Wei Dong Soin, Darius Kai Keat Koh, Galen King Yu Tay, Prannaya Gupta, Jonathan Ee Fang Koong, Yong Zhi Lim

发表机构 * Air Emerging Technologies High-Speed Experimentations and Research (AETHER), RSAF Agile Innovation Digital (RAiD), Republic of Singapore Air Force（新加坡共和国空军敏捷创新数字实验室空中新兴技术高速实验与研究）

AI总结提出ASTRA模拟器，通过微调ASR将词错误率降至23.45%，并集成AI评估框架，实现可扩展的标准化ATCO训练。

详情

AI中文摘要

空中交通管制员（ATCO）对于确保空中交通的安全、有序和高效至关重要，但培训能力受到依赖专门的人类培训师（称为模拟飞行员）的限制，这些培训师必须在模拟空域中扮演飞行员和ATCO的双重角色。现有的自动化解决方案依赖于西方中心的语音模型，这些模型在新加坡的运营环境中表现不佳，现成的系统在新加坡口音的航空语音上词错误率（WER）高达107.80%。我们引入了ASTRA，一个端到端的训练模拟器，通过一个流水线自动化这些模拟飞行员角色，该流水线转录ATCO语音、解释指令，并使用本地适应的语音模型生成适当的飞行员和ATCO响应。我们微调的自动语音识别（ASR）流水线将WER降低到23.45%，在该领域显著优于现有方法。除了交通模拟，ASTRA还集成了一个AI辅助的性能评估框架，该框架评估受训者的无线电通信的准确性、简洁性和完整性，优化后得分分别为91.7%、88.2%和86.9%。基于DSPy和Unsloth等开源基础，这种方法实现了可扩展、标准化的ATCO评估，同时减少了教师的工作量。

英文摘要

Air Traffic Control Operators (ATCOs) are vital in ensuring the safe, orderly, and efficient flow of air traffic, yet training capacity is constrained by reliance on specialized human trainers known as simpilots, who must role-play both pilots and ATCOs in a simulated airspace. Existing automated solutions rely on Western-centric speech models that perform poorly in Singaporean operational contexts, with off-the-shelf systems exhibiting Word Error Rates (WER) of up to 107.80% on Singaporean-accented aviation speech. We introduce ASTRA, an end-to-end training simulator that automates these simpilot roles through a pipeline that transcribes ATCO speech, interprets instructions, and generates appropriate pilot and ATCO responses using locally adapted voice models. Our fine-tuned Automatic Speech Recognition (ASR) pipeline reduces WER to 23.45%, substantially outperforming existing approaches in this domain. Beyond traffic simulation, ASTRA incorporates an AI-assisted performance evaluation framework that assesses trainee radiotelephony communications across accuracy, brevity, and completeness, achieving post-optimization scores of 91.7%, 88.2%, and 86.9%, respectively. Built on open-source foundations such as DSPy and Unsloth, this approach enables scalable, standardized ATCO assessment while reducing instructor workload.

URL PDF HTML ☆

赞 0 踩 0

2606.18379 2026-06-18 cs.IR cs.AI 交叉投稿

RankGraph-2: Lifecycle Co-Design for Billion-Node Graph Learning in Recommendation

RankGraph-2：十亿节点图学习在推荐中的生命周期协同设计

Renzhi Wu, Zikun Cui, Junjie Yang, Tai Guo, Hong Li, Xian Chen, Li Yu, Ke Pan, Sri Reddy, Mahesh Srinivasan, Nipun Mathur, Haomin Yu, Hong Yan

发表机构 * Meta Platforms（Meta平台）

AI总结针对十亿规模图检索中图构建、表示学习与实时服务三阶段孤立的问题，提出RankGraph-2框架，通过协同设计各阶段（如联合训练聚类索引、预计算邻域等），在降低83%服务计算成本的同时，召回率比GAT+Deep Graph Infomax高3.8倍，并带来CTR和CVR提升。

详情

AI中文摘要

十亿节点规模的基于图的检索需要联合解决三个紧密耦合的问题——图构建、表示学习和实时服务——然而现有工作各自孤立地处理这些问题。我们提出了RankGraph-2，一个部署在Meta的框架，它协同设计了基于相似性检索（U2U2I和U2I2I）的所有三个生命周期阶段，每个阶段的需求塑造其他阶段。服务需要一个联合学习的聚类索引以避免昂贵的在线KNN——这迫使索引联合训练进入训练目标。训练受益于观察到基于相似性的检索容忍预计算邻域，从而消除了在线图基础设施——这要求构建产生自包含的数据。构建还必须支持小时级别的刷新以覆盖物品。基于这些级联需求，RankGraph-2通过带流行度偏差校正的子采样将数百亿亿条边减少到数千亿条，通过个性化PageRank预计算多跳邻域，并联合学习一个残差量化聚类索引，将服务计算成本降低了83%。这种生命周期协同设计使得一个简单架构能够在二分图上实现比GAT+Deep Graph Infomax模型高3.8倍的召回率，在物品检索上比PyTorch-BigGraph高2.1倍。RankGraph-2带来了高达+0.96%的CTR和+2.75%的CVR提升，并已在主要业务面上支持了20多次检索发布。

英文摘要

Graph-based retrieval at billion-node scale requires jointly solving three tightly coupled problems -- graph construction, representation learning, and real-time serving -- yet existing work addresses each in isolation. We present RankGraph-2, a framework deployed at Meta that co-designs all three lifecycle stages for similarity-based retrieval (U2U2I and U2I2I), where each stage's requirements shape the others. Serving requires a co-learned cluster index to avoid expensive online KNN -- this pushes index co-training into the training objective. Training benefits from the observation that similarity-based retrieval tolerates pre-computed neighborhoods, eliminating online graph infrastructure -- this requires construction to produce self-contained data. Construction must also support hour-level refresh for item coverage. Acting on these cascading requirements, RankGraph-2 reduces hundreds of trillions of edges to hundreds of billions via subsampling with popularity bias correction, pre-computes multi-hop neighborhoods via personalized PageRank, and co-learns a residual-quantization cluster index that reduces serving computational cost by 83%. This lifecycle co-design enables a simple architecture to achieve 3.8 x higher recall than a GAT + Deep Graph Infomax model on a bipartite graph and 2.1 x higher than PyTorch-BigGraph on item retrieval. RankGraph-2 delivers up to +0.96% CTR and +2.75% CVR, and has powered 20+ retrieval launches across major surfaces.

URL PDF HTML ☆

赞 0 踩 0

2606.18393 2026-06-18 eess.SY cs.AI cs.SY 交叉投稿

Learning-Based Decision Making for Combustion Phasing Control in Multi-Fuel CI Engines with Latent Fuel Reactivity Estimation

基于学习的多燃料压燃发动机燃烧相位控制决策与潜在燃料反应性估计

Rajasree Sarkar, Aditya Satish Patil, Arunava Banerjee, Ihsan Berk Altiner, Zongxuan Sun, Kenneth Kim, Chol-Bum Mike Keown

发表机构 * Department of Mechanical Engineering, University of Minnesota Twin Cities（明尼苏达大学双城分校机械工程系）； DEVCOM Army Research Laboratory, Aberdeen Proving Ground（美国陆军战争研究所阿伯丁试飞场）

AI总结针对多燃料压燃发动机中燃料反应性（十六烷值）未知且时变的问题，提出一种基于GRU引导的强化学习框架，通过从燃烧历史中学习紧凑的燃料反应性表示，实现稳定的CA50控制，平均跟踪误差低于0.25°CA。

详情

AI中文摘要

多燃料压燃发动机具有燃料灵活性，但引入了不确定且时变的燃料反应性（以十六烷值CN表示），这使循环到循环的燃烧相位控制复杂化。本文将潜在CN变化下的CA50调节问题建模为部分可观测的序贯决策问题，并系统评估了具有递增时间和表示能力的控制器，包括LinUCB、历史增强上下文赌博机、仅观测DDPG、递归DDPG以及提出的GRU引导RL框架。基于实验多燃料发动机数据训练的高斯过程代理提供了受控且可重复的评估环境。结果表明，短视和固定历史赌博机方法在CN变化下性能下降，仅观测RL受潜在状态混叠影响，而通用递归在CN快速演变时不足。所提出的框架从燃烧历史中学习紧凑的GRU基燃料反应性表示，并将执行器和评论家基于此估计信号而非真实CN进行条件化。通过在部署时相同的非完美燃料反应性信息上训练策略，控制器避免了传统在线估计-控制流程中的训练-部署不一致性。在未见过的CN轨迹上，该策略实现了稳定的CA50调节，在训练设定点平均绝对跟踪误差低于0.25°CA，同时产生平滑、物理一致的SOI和电热塞功率驱动。这些结果表明，在潜在连续演变的燃料动态下进行燃烧控制需要超越独立估计或通用递归的方法。通过将燃料反应性推断与控制策略学习对齐，所提出的框架能够使用部署时可用的相同估计状态实现反应性感知决策。

英文摘要

Multi-fuel compression-ignition engines offer fuel flexibility but introduce uncertain, time-varying fuel reactivity, represented by cetane number (CN), which complicates cycle-to-cycle combustion-phasing control. This work formulates CA50 regulation under latent CN variation as a partially observable sequential decision problem and systematically evaluates controllers with increasing temporal and representational capacity, including LinUCB, history-augmented contextual bandits, observation-only DDPG, recurrent DDPG, and a proposed GRU-guided RL framework. A Gaussian-process surrogate trained on experimental multi-fuel engine data provides a controlled and reproducible evaluation environment. Results show that myopic and fixed-history bandit methods degrade under CN variation, observation-only RL suffers from latent-state aliasing, and generic recurrence is insufficient when CN evolves rapidly. The proposed framework learns a compact GRU-based representation of fuel reactivity from combustion history and conditions both actor and critic on this estimated signal rather than oracle CN. By training the policy on the same imperfect fuel-reactivity information available at deployment, the controller avoids train-deploy inconsistency in conventional online estimate-then-control pipelines. Across unseen CN trajectories, the policy achieves stable CA50 regulation with mean absolute tracking error below 0.25° CA at the training setpoint, while producing smooth, physically consistent SOI and glow-plug-power actuation. These results show that combustion control under latent, continuously evolving fuel dynamics requires more than standalone estimation or generic recurrence. By aligning fuel-reactivity inference with control policy learning, the proposed framework enables reactivity-aware decision-making using the same estimated state available during deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.18395 2026-06-18 eess.SP cs.AI cs.AR cs.SY eess.SY 交叉投稿

Deep Learning-Driven Inverse Design of Doherty Power Amplifiers Using Pixelated Combiners and Dual-State Impedance Synthesis

基于深度学习的Doherty功率放大器逆向设计：使用像素化合成器和双态阻抗合成

Han Zhou, Haojie Chang, David Widen, Christian Fager

发表机构 * Tampere University（塔尔皮奥大学）； Chalmers University of Technology（挑战者技术大学）

AI总结提出一种结合深度卷积神经网络、像素化布局和遗传算法的三端口Doherty合成器设计方法，实现峰值和回退功率条件下的双态阻抗合成，在2.6-2.8 GHz频段内饱和输出功率>44.2 dBm，峰值漏极效率>71.2%。

2606.18402 2026-06-18 eess.SP cs.AI cs.AR cs.SY eess.SY 交叉投稿

Deep-Learning-Based Pixelated Microwave Filter Design and Characterization using Electro-Optical Electric-Field Measurements

基于深度学习的像素化微波滤波器设计与表征：利用电光电场测量

Han Zhou, Richard Bannister, Caspar Pierce, Haojie Chang, David Widen, Ludvig Fornstedt, Gabriel Melin, Alexander Bohlin, Pontus Lindeberg Fredriksson, Dilbagh Singh, Christian Fager, Koen Buisman

发表机构 * Chalmers University of Technology（查尔姆斯理工大学）； Advanced Technology Institute, University of Surrey（萨里大学先进科技研究所）； National Physical Laboratory（国家物理实验室）

AI总结提出结合卷积神经网络与遗传算法的深度学习方法，自动合成像素化微波滤波器，通过S参数和空间电场测量实验验证，实现7 GHz通带和9.5 GHz以上超过20 dB抑制，首次用电光测量揭示AI生成设计的电场模式。

2606.18425 2026-06-18 cs.SE cs.AI cs.DC 交叉投稿

From Specification to Execution: AI Assisted Scientific Workflow Management

从规范到执行：AI辅助的科学工作流管理

Komal Thareja, Hamza Safri, Rajiv Mayani, Anirban Mandal, Ewa Deelman

发表机构 * RENCI, University of North Carolina at Chapel Hill, NC, USA（RENCI，北卡罗来纳大学教堂山分校）； Information Sciences Institute, University of Southern California, Marina del Rey, CA, USA（信息科学研究所，南加州大学马里纳德尔雷耶斯分校）

AI总结提出一种AI辅助方法，通过规范驱动的工作流生成、自动化调试和分布式执行，结合Pegasus与MCP层，实现从自然语言到大规模科学工作流的端到端管理。

详情

AI中文摘要

科学工作流管理系统（WMS）支持复杂管道的可扩展和可重复执行，但工作流的设计、实现和调试仍然主要依赖人工，需要大量专业知识。最近使用大型语言模型（LLM）的方法在从自然语言生成工作流方面显示出潜力，但通常依赖于直接的代码合成，这限制了透明度、可重复性以及与工作流系统的集成。我们提出了一种AI辅助的科学工作流管理方法，结合了规范驱动的工作流生成、自动化调试和分布式执行。该方法引入了一个结构化的规范阶段，将工作流意图、设计和实现分离，允许在代码生成之前进行验证。我们还开发了一个基于LLM的调试代理，用于诊断和解决跨多个系统层的故障。为了支持分布式执行和用户交互，我们将广泛使用的WMS Pegasus与模型上下文协议（MCP）层集成，为工作流提交、监控和控制提供统一接口。我们使用一个用于医学影像的联邦学习工作流来评估该方法，该工作流具有并行、迭代和依赖密集的结构。该系统生成并执行了包含数千个作业的大规模工作流，减少了调试工作量，并允许非专家用户使用专家级设计模式构建工作流。这些结果表明，端到端的AI辅助工作流生成和执行是可行的，并指向了用于管理科学工作流生命周期的AI驱动平台。

英文摘要

Scientific workflow management systems (WMS) support scalable and reproducible execution of complex pipelines, but workflow design, implementation, and debugging remain largely manual and require significant expertise. Recent approaches using large language models (LLMs) show promise for workflow generation from natural language, but often rely on direct code synthesis, which limits transparency, reproducibility, and integration with workflow systems. We present an AI-assisted approach to scientific workflow management that combines specification-driven workflow generation, automated debugging, and distributed execution. The method introduces a structured specification phase that separates workflow intent, design, and implementation, allowing validation prior to code generation. We also develop an LLM-based debugging agent that diagnoses and resolves failures across multiple system layers. To support distributed execution and user interaction, we integrate Pegasus, a widely used WMS, with a Model Context Protocol (MCP) layer, providing a unified interface for workflow submission, monitoring, and control. We evaluate the approach using a federated learning workflow for medical imaging, chosen for its parallel, iterative, and dependency-intensive structure. The system generated and executed large-scale workflows with thousands of jobs, reduced debugging effort, and allowed non-expert users to construct workflows with expert-level design patterns. These results indicate that end-to-end AI-assisted workflow generation and execution is feasible, and point toward AI-driven platforms for managing the scientific workflow lifecycle.

URL PDF HTML ☆

赞 0 踩 0

2606.18444 2026-06-18 cs.LG cs.AI 交叉投稿

TMR-GGNN: Credit Card Fraud Detection based on Time-Aware Multi-Relational Guided Graph Neural Network

TMR-GGNN：基于时间感知多关系引导图神经网络的信用卡欺诈检测

Rohit Tewari, Shubhankar Shilpi, Navin Chhibber, Devendra Singh Parmar, Sunil Khemka, Piyush Ranjan

发表机构 * Unysis Truist Banks Infinity Tech Group Technical Product（Unysis 信任银行 Infinity 技术集团技术产品）； Fairfax, USA（美国费尔法克斯）； Atlanta, USA（美国亚特兰大）； Sunnyvale, USA（美国 Sunnyvale）； Persistent Systems IEEE Vice Chair AeroSpace Chapter（Persistent 系统 IEEE 副主席航空航天分会）； Discover Financial Services（Discover 金融服务）； Edison, USA（美国埃迪森）

AI总结提出TMR-GGNN框架，通过时间窗口内异构实体交互建模、动态多关系图构建、时间感知注意力机制和对比学习解码器，结合InfoNCE与Focal Loss复合损失函数，解决数据不平衡和欺诈模式演化问题。

Comments 2025 2nd International Conference on Software, Systems and Information Technology (SSITCON), Pages 7

详情

AI中文摘要

近年来，由于高度不平衡的数据、不断演变的欺诈模式以及交易实体间复杂的关联结构，信用卡欺诈检测面临重大挑战。为解决这些问题，本研究提出了一种名为时间感知多关系引导图神经网络（TMR-GGNN）的新框架。具体而言，所提出的TMR-GGNN通过建模客户、商户、设备和IP在时间窗口内的异构交互，扩展了编码器-解码器图神经网络（GNN）架构。随后，该TMR-GGNN方法构建了一个动态的多关系图，并在编码器中引入时间感知关系注意力机制，以基于时间邻近性和语义上下文自适应地权衡交易相关性。因此，解码器采用对比学习模块来区分真实和合成的交易模式，同时提高模型对罕见欺诈案例的泛化能力。此外，为有效管理严重的类别不平衡并强调判别性学习，引入了结合基于信息噪声对比估计（InfoNCE）的对比损失与Focal Loss的复合损失函数。这种集成有助于改进欺诈识别，同时减少假阴性。

英文摘要

In recent years, credit card fraud detection has faced significant challenges due to highly imbalanced data, evolving fraud patterns, and complex relational structures among transaction entities. To address these issues, this research proposes a novel framework called Timeaware Multi Relational Guided Graph Neural Network (TMR GGNN). Particularly, the proposed TMR GGNN extends the encoder decoder Graph Neural Network GNN architecture by modeling heterogeneous interactions across customers, merchants, devices, and IPs over temporal windows. Subsequently, the proposed TMR GGNN approach constructs a dynamic, multi relational graph and incorporates a time aware relational attention mechanism within the encoder to adaptively weigh the transaction relevance based on temporal proximity and semantic context. Consequently, the decoder employs a contrastive learning module to distinguish between real and synthesized transaction patterns, while improving the models generalization of rare fraud cases. Additionally, to effectively manage severe class imbalances and emphasize discriminative learning, a composite loss function combining Information Noise Contrastive Estimation (InfoNCE) based contrastive loss with Focal Loss is introduced. This integration assists in improving fraud identification while mitigating false negatives.

URL PDF HTML ☆

赞 0 踩 0

2606.17077 2026-06-18 physics.chem-ph cs.AI cs.LG quant-ph 交叉投稿

Comprehensive pKa Data Augmentation from Limited Real Data through an Engineered Models-Quantum Framework

基于工程化模型-量子框架从有限真实数据中全面增强pKa数据

Wang Rui, Liu Dinghao

发表机构 * Department of Chemistry, Tsinghua University（清华大学化学系）； Department of Chemical Engineering, Tsinghua University（清华大学化学工程系）； School of Science, China Pharmaceutical University（中国药科大学理学院）

AI总结针对pKa数据稀疏问题，提出量子辅助分子生成方法，利用优化机器学习模型预测和量子退火器采样，在相干伊辛机上实现极端值采样。

详情

AI中文摘要

质子解离常数(pKa)对于功能分子发现和分子建模至关重要。基于已建立的最大实验pKa数据库iBonD，我们和其他研究人员开发了多种方法，包括基于机器学习的经验预测和高精度能量计算。尽管如此，高质量pKa数据的快速增强仍然受到根本性限制。作为这项工作的一部分，我们使用一组经过广泛优化的机器学习模型，对未标记分子数据集进行了大规模基于回归的pKa预测。结果表明，由于未标记分子数据集的特征分布，pKa数据分布近似正态，尾部区域样本极度稀缺。尽管这种增强对于提高整体数据可用性和预测建模非常有价值，但对于高效发现具有广谱pKa性质的分子仍然不足。为了解决这个问题，我们探索从广阔的化学空间中定向生成具有稀疏pKa性质的分子。鉴于传统的连续潜在空间VAE-RNN分子生成方法稳定性不足，且在补充稀疏数据方面未能显示出明显优势，我们设计并实现了一种量子辅助的稀疏pKa分子生成。在模拟量子退火器上验证了可行性，并在物理相干伊辛机(CIM)上进一步实现了优越的极端值采样。(未完待续)

英文摘要

Proton dissociation constants (pKa) are critical for functional molecule discovery and molecular modeling. Building on iBonD, the largest experimental pKa database established, we and other researchers have developed several methods including machine-learning-based empirical prediction and high-accuracy energy calculations. Despite this foundation, the rapid augmentation of high-quality pKa data remains fundamentally constrained. As part of this work, we performed large-scale regression-based pKa prediction on unlabeled molecular datasets using a collection of extensively optimized machine-learning models. The results indicate that, since the feature distributions of unlabeled molecular datasets, the pKa data distribution approximates normality, with extreme scarcity of tail-region samples. Although such augmentation is highly valuable for improving overall data availability and predictive modeling, it remains insufficient for efficiently discovering molecules with broad-spectrum pKa properties. To address this, we explore the targeted generation of molecules with sparse pKa properties from the vast chemical space. Given that traditional continuous latent space VAE-RNN methods for molecular generation suffer from insufficient stability and fail to demonstrate clear advantages in complementing sparse data, we design and implement a quantum-assisted sparse-pKa molecular generation. Feasibility is validated on a simulated quantum annealer, and superior extreme-value sampling is further achieved on physical coherent Ising machines (CIMs). (to be continued)

URL PDF HTML ☆

赞 0 踩 0

2606.18548 2026-06-18 cs.CY cs.AI 交叉投稿

Engagement Intensity as a Learner-Modeling Signal for Adaptive AI Ethics Instruction

参与强度作为自适应AI伦理教学的学习者建模信号

Yongkyung Oh, Lynn Talton, Alex Bui

发表机构 * University of California, Los Angeles (UCLA)（加州大学洛杉矶分校）

AI总结本研究比较了三种学习者特征（使用频率、自评熟悉度、先前AI教育）与AI感知结果的关系，发现使用频率与所有五项结果显著相关，为自适应AI伦理教学提供了简单的入学者建模信号。

详情

AI中文摘要

在研究生研究训练中，自适应AI伦理教学受益于反映先前LLM经验差异的入学者测量指标。先前的课程或研讨会参与是一个明显的候选指标，但尚不清楚它是否与关键AI感知项目的教学前评分相关。我们比较了三种候选入学者特征：自我报告的使用频率、自评LLM熟悉度和先前AI教育，针对93名参加必修研究伦理课程的生命科学研究生和博士后学员的五项基线感知结果。使用频率与所有五项结果显示出Holm校正的关联，自评熟悉度与三项结果相关，而先前AI教育与任何结果均无关联。在量表低端呈现阈值模式，在训练兴趣和准确性信任方面最为明显，而非在所有五项结果上呈现均匀梯度。在简短的入学者调查中，报告的LLM使用比先前的课程或研讨会更一致地与这些感知相关，自评熟悉度作为次要指标。这些结果表明，简单的教学前行为信号可以为自适应AI伦理教育的轻量级入学者画像提供信息。

英文摘要

Adaptive AI ethics instruction in graduate research training benefits from intake measures that reflect differences in prior LLM experience. Prior coursework or workshop attendance is an obvious candidate, but it is not clear whether it is associated with pre-instruction ratings on key AI perception items. We compare three candidate intake features, self-reported usage frequency, self-rated LLM familiarity, and prior AI education, across five baseline perception outcomes in 93 bioscience graduate and postdoctoral trainees enrolled in a required research ethics course. Usage frequency shows Holm-corrected associations with all five outcomes, self-rated familiarity with three, and prior AI education with none. A threshold-like pattern at the lower end of the scale is most visible for training interest and accuracy trust rather than appearing as a uniform gradient across all five outcomes. In a short intake survey, reported LLM use is more consistently associated with these perceptions than prior coursework or workshops, with self-rated familiarity serving as a secondary indicator. These results suggest that simple pre-instruction behavioral signals can inform lightweight intake profiling for adaptive AI ethics education.

URL PDF HTML ☆

赞 0 踩 0

2606.18596 2026-06-18 cs.HC cs.AI 交叉投稿

Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep

更好的依从性，更丰富的上下文：基于LLM的对话式语音睡眠日记的现场评估

Amama Mahmood, Bokyung Kim, Honghao Zhao, Molly E. Atwood, Luis F. Buenaver, Michael T. Smith, Chien-Ming Huang

发表机构 * The Johns Hopkins University（约翰霍普金斯大学）； Department of Psychiatry and Behavioral Sciences, The Johns Hopkins University School of Medicine（精神病学与行为科学系，约翰霍普金斯大学医学院）

AI总结通过现场实验评估基于LLM的对话式语音睡眠日记，发现相比文本日记，语音日记提高了依从性并收集了更详细的上下文信息，但结构化字段完整性较低。

详情

AI中文摘要

睡眠日记是行为睡眠医学和失眠认知行为疗法的核心，但每日完成难以维持，静态形式通常为解释夜间睡眠变化提供的上下文有限。我们设计了一个基于LLM的对话式语音日记，通过主动智能音箱提示、结构化对话输入和自适应后续对话，提供临床基础的早晚睡眠日记问题。我们在为期四周的受试者间现场研究中评估了该系统，涉及30名大学生，使用匹配的日记项目、报告窗口和提醒间隔，与基于文本的移动日记进行比较。与文本日记相比，对话式语音日记显示出更高的依从性，并引发了关于日常习惯、压力源、环境条件和其他睡眠相关因素的更详细上下文自我报告。参与者还描述语音日记更容易融入日常，尽管感知完成时间更长。然而，基于语音的对话输入导致某些结构化日记字段的完整性较低，揭示了表达丰富性与结构化精度之间的权衡。这些发现展示了使用基于LLM的对话式语音助手进行纵向健康自我报告的前景和挑战。

英文摘要

Sleep diaries are central to behavioral sleep medicine and cognitive behavioral therapy for insomnia, yet daily completion is difficult to sustain, and static forms often provide limited context for interpreting night-to-night sleep variation. We designed an LLM-powered conversational voice diary that delivers clinically grounded morning and evening sleep diary questions through proactive smart-speaker prompts, structured conversational intake, and adaptive follow-up dialogue. We evaluated the system in a four-week between-subjects field study with 30 university students, comparing it with a text-based mobile diary using matched diary items, reporting windows, and reminder intervals. Compared with the text-based diary, the conversational voice diary showed higher adherence and elicited more detailed contextual self-report about routines, stressors, environmental conditions, and other sleep-related factors. Participants also described the voice diary as easier to integrate into daily routines, despite longer perceived completion time. However, voice-based conversational intake produced lower completeness for some structured diary fields, revealing a trade-off between expressive richness and structured precision. These findings show both the promise and the challenge of using LLM-powered conversational voice assistants for longitudinal health self-report.

URL PDF HTML ☆

赞 0 踩 0

2606.18599 2026-06-18 cs.CR cs.AI 交叉投稿

MIDS: Detecting Stealthy Masquerade and Tampering Attacks on CAN Bus via Bidirectional Mamba

MIDS：通过双向Mamba检测CAN总线上的隐蔽伪装和篡改攻击

Qiqi Liu, Runhan Song, Lei Cui, Heng Zhang, Yuyan Sun, Limin Sun

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（信息工程研究所，中国科学院）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）； Zhongguancun Laboratory（中关村实验室）

AI总结针对CAN总线缺乏加密认证易受攻击的问题，提出MIDS双流框架，利用双向状态空间模型并行处理标识符和载荷，在特斯拉Model 3数据集上F1达96.94%，优于基线8个百分点以上。

详情

AI中文摘要

控制器局域网（CAN）协议是现代车辆中电子控制单元（ECU）的主要通信标准，但其缺乏加密和认证，使其面临一系列安全威胁。现有的入侵检测系统主要针对制造型攻击（通过帧注入实现的DoS、模糊测试、ID欺骗），此类攻击中每ID到达间隔统计等检测信号易于获取。我们转而解决更困难的伪装场景，其中内部攻击者在其原始传输时隙原位替换合法帧，保持流量周期性，使基于流量统计的防御失效。我们提出Mamba入侵检测系统（MIDS），一种创新的双流框架，并行处理CAN标识符和载荷，并通过双向选择性状态空间建模重建其联合时间语义。为评估MIDS，我们从物理特斯拉Model 3在三种驾驶模式下收集了超过1亿个CAN帧，并合成了54种伪装攻击变体，涵盖仅ID、仅数据和组合修改。MIDS在该数据集上达到96.94%的F1分数，超过最强可复现基线8个百分点以上，同时保持1.147毫秒的单窗口推理延迟——为实时车载部署留有充足余量。为验证泛化能力，我们进一步在四个公开基准（ROAD、CrySyS、OTIDS、CT&T）上评估MIDS，涵盖伪装和注入场景；在统一的5折协议下，MIDS的F1分数从93.70%到99.61%，超过八个复现基线中最强者最多13.94个百分点。

英文摘要

The Controller Area Network (CAN) protocol is the primary communication standard for Electronic Control Units (ECUs) in modern vehicles, but its lack of encryption and authentication exposes it to a range of security threats. Existing intrusion detection systems are largely tuned to fabrication-style attacks (DoS, fuzzing, ID spoofing realised by frame injection), in which detection signals such as per-ID inter-arrival statistics are readily available. We instead address the harder \emph{masquerade} setting~\cite{b37}, in which an internal adversary substitutes a legitimate frame in-situ at its original transmission slot, preserving traffic periodicity and rendering traffic-statistic defences ineffective. We propose the Mamba Intrusion Detection System (MIDS), an innovative dual-stream framework that processes CAN identifiers and payloads in parallel and reconstructs their joint temporal semantics through bidirectional selective state-space modelling. To evaluate MIDS, we collected over 100 million CAN frames from a physical Tesla Model 3 across three driving regimes and synthesised 54 masquerade attack variants spanning ID-only, data-only, and combined modifications. MIDS attains an F1 of 96.94\% on this dataset, exceeding the strongest reproducible baseline by more than 8 percentage points, while sustaining a 1.147~ms single-window inference latency -- ample headroom for real-time onboard deployment. To verify generalisation, we further evaluate MIDS on four public benchmarks (ROAD, CrySyS, OTIDS, CT\&T) covering both masquerade and injection scenarios; MIDS attains F1 from 93.70\% to 99.61\%, outperforming the strongest of eight reproduced baselines by up to 13.94 percentage points under a unified 5-fold protocol.

URL PDF HTML ☆

赞 0 踩 0

2606.18611 2026-06-18 cs.SD cs.AI cs.LG stat.ML 交叉投稿

QC-GAN: A Parameter-Efficient Quaternion Conformer GAN for High-Fidelity Speech Enhancement

QC-GAN: 一种参数高效的四元数Conformer GAN用于高保真语音增强

Shogo Yamauchi, Hideaki Tamori, Makoto Sakai, Yosuke Yamano, Tohru Nitta

发表机构 * The Asahi Shimbun Company（朝日新闻社）； Tokyo Woman's Christian University（东京女子基督教大学）

AI总结提出参数高效的QC-GAN，结合四元数Conformer生成器和MetricGAN训练，通过汉密尔顿积共享权重减少参数量，在VoiceBank+DEMAND上以0.89M参数达到PESQ 3.48，性能媲美两倍大小模型。

Comments 10 pages, 6 figures and 5 tables. Accepted at Interspeech2026

2606.18617 2026-06-18 cs.CY cs.AI 交叉投稿

AI-Driven Assessment of Human Tutors: Linking Training Performance to Real-Life Practice

AI驱动的人类导师评估：将培训表现与实际教学实践联系起来

Danielle R. Thomas, Marie Cynthia Abijuru Kamikazi, Clara Brandt, Conrad Borchers, Kenneth R. Koedinger

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Vanderbilt University（范德比大学）

AI总结提出一种AI系统，利用生成式AI分析真实辅导转录，评估导师技能迁移，发现培训表现显著预测实际教学得分（效应量0.25 SD），并贡献开放数据集和评分标准。

Comments Full research paper accepted at EC-TEL 2026

详情

AI中文摘要

存在大量的导师培训平台。然而，很少有平台基于实际表现提供AI驱动的人类导师培训和评估。我们提出一个AI驱动系统，评估培训中的开放式回答和真实的实际辅导。与仅通过在线培训或模拟评估学习的平台不同，我们的系统利用生成式AI（Gemini-2.5-pro）分析真实辅导的转录，衡量导师技能向实际应用的迁移。远程辅导学生数学的人类导师（N=86）完成了六个基于场景的课程，平均显著学习增益为7.4%。使用跨405个会话-课程对的混合效应模型，我们发现培训表现显著预测实际辅导转录得分，效应量为0.25 SD。模型比较（AIC/BIC）表明，培训期间开放式回答和多项选择表现的平均值最能预测实际辅导表现，尽管开放式回答相对更具预测性。探索性分析显示，培训后，导师遇到应用技能的教学机会的可能性显著增加（从61.1%到68.9%），并且在这些机会中表现出更高的执行质量（从65.5%到68.1%）。中断时间序列分析表明，这些导师改进是随时间逐渐趋势的一部分，而非培训的即时干预效果。我们展示了一种将导师培训与实际评估联系起来的AI驱动方法。为此，我们贡献了开放数据集、AI提示和评分标准，以支持透明度和可重复性。

英文摘要

There exist numerous tutor training platforms. However, few provide AI-driven training and evaluation for human tutors based on real-life performance. We present an AI-driven system that assesses both open responses during training and authentic real-life tutoring. Unlike platforms that only assess learning through online training or simulations, our system utilizes Generative AI (Gemini-2.5-pro) to analyze transcriptions of authentic tutoring, measuring the transfer of tutor skills to real-life application. Human tutors instructing students remotely in math (N=86) completed six scenario-based lessons, averaging a significant 7.4% learning gain. Using mixed-effects models across 405 session-to-lesson pairs, we found that training performance significantly predicted real-life transcript scores with an effect size of 0.25 SD. Model comparison (AIC/BIC) indicated averaging open response and multiple choice performance during training predicted real-life tutor performance best, although open responses were comparatively more predictive. Exploratory analysis showed that after training, tutors were significantly more likely to encounter pedagogical opportunities to apply their skills (61.1% to 68.9%) and demonstrated higher execution quality within those opportunities (65.5% to 68.1%). Interrupted time series analysis suggested that these tutor improvements were part of a gradual trend over time rather than an immediate intervention effect of training. We illustrate an AI-driven method to link tutor training with real-life assessment. In doing so, we contribute open datasets, AI prompts, and scoring rubrics to support transparency and reproducibility.

URL PDF HTML ☆

赞 0 踩 0

2606.18645 2026-06-18 eess.AS cs.AI 交叉投稿

面向旋转系统不平衡表征的域偏移感知神经网络

Bernardo Feijó Junqueira, Claudio Kiyoshi Umezu, Bruno Bilhar Karaziack, Tomaz Junior, Daniel Alves Castello

发表机构 * Springer Nature

AI总结提出域偏移感知神经网络，通过最大均值差异策略对齐源域与目标域特征，解决变工况下旋转轴不平衡质量估计的回归问题，实验证明该方法在域偏移未知时显著提升预测精度。

详情

AI中文摘要

本文研究了域偏移感知神经网络在回归任务中的应用，旨在估计不同运行条件下旋转轴的不平衡质量。实验数据来自一个测试台，其中主轴上安装有带不平衡质量的法兰，在不同转速下驱动，同时可选择性地激活副轴以引入域差异。不平衡质量固定在径向距离上，使用三轴加速度计记录系统的动态响应。质量估计的逆问题在域自适应框架中提出，网络采用最大均值差异策略进行训练，以对齐源域和目标域的特征表示。结果表明，显式处理域偏移能有效提高预测精度，尤其是在系统的物理行为和域偏移来源不完全已知且超出训练条件的情况下。这些发现凸显了域偏移感知模型在结构健康监测回归任务中的潜力。

英文摘要

This work investigates the application of a domain-shift aware neural network for regression tasks aimed at estimating unbalance masses in rotating shafts under varying operating conditions. Experimental data were collected from a test rig in which a primary shaft, equipped with a flange carrying unbalanced masses, was driven at different rotational speeds, while a secondary shaft could be optionally activated to introduce domain discrepancy. The unbalance masses were positioned at a fixed radial distance, and the dynamic response of the system was recorded using triaxial accelerometers. The inverse problem of mass estimation is formulated within a domain adaptation framework, where the network is trained with a maximum mean discrepancy strategy to align feature representations across source and target distributions. The results demonstrate the effectiveness of explicitly addressing domain shift in improving prediction accuracy, especially when the system's physical behavior and sources of domain discrepancy are not fully known and fall outside the training conditions. These findings highlight the potential of domain-shift aware models for regression tasks in Structural Health Monitoring.

URL PDF HTML ☆

赞 0 踩 0

2606.18897 2026-06-18 cs.IR cs.AI 交叉投稿

Spotlight: 协同种子探索与抢占式GPU用于DiT强化学习后训练

Ruiqi Lai, Dakai An, Wei Gao, Ju Huang, Siran Yang, Jiamang Wang, Lin Qu, Dmitrii Ustiugov, Wei Wang

发表机构 * NTU Singapore（南洋理工大学）； Hong Kong University of Science and Technology（香港科技大学）； Alibaba Group（阿里巴巴集团）

AI总结针对DiT强化学习后训练成本高的问题，提出Spotlight系统，通过利用探索对旧权重的容忍性和SP组快速重配置，在抢占式GPU上实现高效训练，加速4倍并降低成本1.4-6.4倍。

详情

AI中文摘要

扩散Transformer（DiT）的强化学习（RL）后训练成本极高，需要数千块高端GPU。现有工作探索了两个降低成本的方向：种子探索通过选择高对比度样本来改善训练收敛，但增加了关键路径的计算量；抢占式GPU提供69-77%的成本降低，但在训练期间处于空闲状态，因为DiT rollout几乎同时完成，这阻止了类似LLM的rollout与训练流水线化。抢占式GPU的抢占进一步破坏了序列并行（SP）组，导致GPU拓扑碎片化。我们提出了Spotlight，这是第一个利用抢占式GPU进行DiT RL后训练的系统。Spotlight基于我们设计的两个关键洞察：（1）我们证明探索可以容忍过时的模型权重，因为使用前一次迭代模型权重的探索保留了随机种子的相对排序，允许探索在训练期间在空闲的抢占式GPU上运行。（2）SP重配置可以重用节点内状态，将组恢复时间从分钟级缩短到亚秒级启动。基于这些洞察，Spotlight引入了三种技术：基于bandit的探索规划器，在训练时间预算内最大化奖励方差；弹性序列并行，通过持久调度器和节点内权重复制动态重配置SP组；以及抢占感知的拉取式请求调度器，平衡负载并在抢占时提交进行中的状态。我们在开源RL平台ROLL上实现了Spotlight，并在Qwen-Image后训练上进行了评估。Spotlight达到相同目标验证分数的速度比基线快4倍，总成本降低1.4-6.4倍，同时在分辨率512×512和1280×1280的DeepSeek-OCR和Geneval数据集上实现了更优的图像质量。

英文摘要

Reinforcement learning (RL) post-training of Diffusion Transformers (DiTs) is prohibitively expensive, requiring thousands of high-end GPUs. Existing works explore two directions to reduce cost: seed exploration improves training convergence by selecting high-contrast samples, yet adds compute to the critical path; spot GPUs offer 69--77\% lower cost, yet sit idle during training because DiT rollouts finish nearly simultaneously, which prevents LLM-style pipelining of rollout with training. Spot preemptions further break Sequence Parallelism (SP) groups, fragmenting GPU topology. We present Spotlight, the first system that harvests spot GPUs for DiT RL post-training. Spotlight rests on two key insights we devise: (1)~we show that exploration can tolerate stale model weights because exploration that uses the model weights from the previous iteration preserves the relative ranking of random seeds, allowing exploration to run on idle spot GPUs during training. (2)~SP reconfiguration can reuse on-node state, reducing group recovery from minutes to sub-second launches. Built on these insights, Spotlight introduces three techniques: a bandit-based exploration planner that maximizes reward variance within the training time budget, elastic sequence parallelism that reconfigures SP groups on the fly via persistent schedulers and intra-node weight copying, and a preemption-aware pull-based request scheduler that balances load and commits in-flight state upon preemption. We implement Spotlight on the open-source RL platform ROLL and evaluate it on Qwen-Image post-training. Spotlight reaches the same target validation score $4\times$ faster than baselines, reducing total cost by $1.4$-$6.4\times$ while achieving superior image quality on DeepSeek-OCR and Geneval datasets with resolution $512\times512$ and $1280\times1280$.

URL PDF HTML ☆

赞 0 踩 0

2606.19026 2026-06-18 cs.LG cs.AI physics.ao-ph 交叉投稿

A Hybrid LSTM--Vision Transformer Architecture for Predicting HRRR Forecast Errors

混合LSTM-视觉Transformer架构用于预测HRRR预报误差

David Aaron Evans, Jay C. Rothenberger, Kara J. Sulia, Nick P. Bassill, Chris D. Thorncroft

发表机构 * Atmospheric Sciences Research Center, University at Albany, SUNY（纽约州立大学奥尔巴尼分校大气科学研究中心）； University of Oklahoma（俄克拉荷马大学）； State Weather Risk Communication Center, University at Albany, SUNY（纽约州立大学奥尔巴尼分校州天气风险沟通中心）

AI总结提出LSTM-ViT混合框架，结合地表观测时序与大气廓线，预测HRRR降水、风速和温度预报误差，相比基线LSTM性能提升，尤其降水误差预测技能提高约两倍。

Comments This manuscript is a preprint and has been submitted for peer review to the Artificial Intelligence for the Earth Systems journal. The content is subject to change based on the outcome of the peer review process and should not be considered final or definitive. Copyright in this Work may be transferred without further notice

详情

AI中文摘要

高分辨率数值天气预报（NWP）系统中的预报误差通常与未解析的边界层（PBL）过程、对流、地形诱导环流以及其他垂直结构的大气现象有关。先前的研究表明，长短期记忆（LSTM）网络可以利用中尺度观测成功预测高分辨率快速刷新（HRRR）模型的预报误差，但我们认为性能下降与复杂垂直大气演化时期有关。为解决这一局限，我们开发了一种混合LSTM-视觉Transformer（LSTM-ViT）框架，将来自地表观测的时间序列学习与来自纽约州中尺度剖面仪网络的垂直大气廓线相结合。LSTM-ViT框架被训练用于预测单个中尺度站点上HRRR的逐时降水、10米风速和2米温度预报误差。在所有三个预测变量中，相对于基线LSTM架构，引入剖面仪导出的大气结构提高了预报误差预测技能，最大提升出现在较短的预报提前期和PBL活动增强期间。对于降水预报误差，改进尤为显著，LSTM-ViT框架相对于基线LSTM实现了约两倍的预测技能提升，同时更好地捕捉了对流驱动的误差演变并减少了与PBL过程相关的退化。这些结果表明，将时间序列学习与垂直注意力机制相结合，为改进业务NWP系统中的预报误差预测提供了一条具有物理意义的途径。我们的研究为预报员提供了关于模型偏差和预报置信度的增强指导。

英文摘要

Forecast errors in high-resolution numerical weather prediction (NWP) systems are often linked to unresolved planetary boundary layer (PBL) processes, convection, terrain-induced circulations, and other vertically structured atmospheric phenomena. Previous work demonstrated that Long Short-Term Memory (LSTM) networks can successfully predict forecast errors in the High-Resolution Rapid Refresh (HRRR) model using mesonet observations, but we believe performance degradation is linked to periods of complex vertical atmospheric evolution. To address this limitation, we develop a hybrid LSTM-Vision Transformer (LSTM-ViT) framework that combines temporal sequence learning from surface observations with atmospheric profiles from the New York State Mesonet profiler network. The LSTM-ViT framework is trained to predict HRRR hourly precipitation, 10 m wind speed, and 2 m temperature forecast errors at individual mesonet stations. Across all three predictors, incorporation of profiler-derived atmospheric structure improves forecast error prediction skill relative to the baseline LSTM architecture, with the largest gains occurring at shorter forecast lead times and during periods of enhanced PBL activity. Improvements are particularly pronounced for precipitation forecast error, where the LSTM-ViT framework achieves approximately a twofold increase in predictive skill relative to the baseline LSTM while better capturing convectively driven error evolution and reducing degradation associated with PBL processes. These results demonstrate that combining temporal sequence learning with vertically informed attention mechanisms provides a physically meaningful pathway for improving forecast error prediction in operational NWP systems. Our research offers forecasters enhanced guidance regarding model bias and forecast confidence.

URL PDF HTML ☆

赞 0 踩 0

2606.19042 2026-06-18 cs.SE cs.AI 交叉投稿

Where Did the Variability Go? From Vibe Coding to Product Lines by Regeneration

可变性去哪了？从氛围编码到通过再生的产品线

Xhevahire Tërnava

发表机构 * LTCI, Télécom Paris, Institut Polytechnique de Paris, Palaiseau, France（LTCI，巴黎电信学院，巴黎理工学院，Palaiseau，法国）

AI总结研究AI驱动编程（氛围编码）中可变性缺失问题，提出通过再生实现可变性（VbR）方法，让LLM作为推导引擎生成无死代码的变体二进制。

Comments VARIABILITY 2026

详情

语言模型作为接口而非预言机：用于小儿阑尾炎的混合LLM-ML系统

Soheyl Bateni, Maryam Abdolali

发表机构 * K. N. Toosi University of Technology（K. N. 图西理工大学）

AI总结提出ClaMPAPP混合系统，利用LLM从自由文本中提取结构化特征，再由XGBoost分类器进行诊断，在两个独立队列中优于端到端LLM，提高了诊断稳定性和可审计性。

详情

AI中文摘要

大型语言模型（LLM）通过解释自由文本记录可使临床决策支持更易获取，但直接作为诊断引擎使用时，受提示敏感性、信息顺序以及看似合理但错误的输出限制。结构化机器学习模型提供更稳定的风险预测，但需要难以与叙事性临床工作流集成的表格输入。我们提出ClaMPAPP（临床语言辅助机器学习阑尾炎诊断流程），这是一个混合系统，将LLM用作接口而非最终决策者。ClaMPAPP从类似笔记的叙述中提取模式约束的临床特征，应用确定性合理性检查，并将验证后的特征传递给基于临床、实验室和超声变量训练的XGBoost分类器。我们在来自德国医院的两个独立小儿阑尾炎队列上评估了ClaMPAPP，并将其与端到端LLM基线（包括开源和专有模型）进行比较。为在测试自由文本输入时保留真实标签，通过模板渲染和约束LLM重写从结构化电子健康记录生成叙述，并附加句子顺序排列以评估位置鲁棒性。ClaMPAPP在内部和外部验证中均达到最强的整体诊断性能，同时最小化漏诊阑尾炎病例（急性分诊中的关键安全问题）。端到端LLM表现出不稳定的灵敏度-特异性权衡，且在叙述重排下性能下降更严重。这些结果支持LLM作为接口、ML作为预测器的设计，将自然语言可用性与预测推理分离，并为临床决策支持提供更可审计的路径。

英文摘要

Large language models (LLMs) can make clinical decision support more accessible by interpreting free-text documentation, but their direct use as diagnostic engines is limited by sensitivity to prompts, information order, and plausible but incorrect outputs. Structured machine-learning models offer more stable risk prediction, yet they require tabular inputs that are difficult to integrate with narrative clinical workflows. We present ClaMPAPP (Clinical Language-assisted Machine-learning Pipeline for Appendicitis), a hybrid system that uses an LLM as an interface rather than as the final decision-maker. ClaMPAPP extracts schema-constrained clinical features from note-like narratives, applies deterministic plausibility checks, and passes validated features to an XGBoost classifier trained on clinical, laboratory, and ultrasound variables. We evaluated ClaMPAPP on two independent pediatric appendicitis cohorts from German hospitals and compared it with end-to-end LLM baselines, including open-source and proprietary models. To preserve ground truth while testing free-text input, narratives were generated from structured electronic health records through template rendering and constrained LLM rewriting, with additional sentence-order permutation to assess positional robustness. ClaMPAPP achieved the strongest overall diagnostic performance in both internal and external validation while minimizing missed appendicitis cases, the key safety concern in acute triage. End-to-end LLMs showed unstable sensitivity-specificity trade-offs and greater degradation under narrative reordering. These results support an LLM-as-interface, ML-as-predictor design that separates natural-language usability from predictive inference and provides a more auditable pathway for clinical decision support.

URL PDF HTML ☆

赞 0 踩 0

2606.19247 2026-06-18 cs.HC cs.AI cs.CY 交叉投稿

A Taxonomy of Mental Health and Technology Needs for Alzheimer's and Dementia Caregivers

阿尔茨海默病和痴呆症护理人员的心理健康与技术需求分类

Keran Wang, Drishti Goel, Jiayue Melissa Shi, Violeta J. Rodriguez, Daniel S. Brown, Dong Whi Yoo, Ravi Karkar, Koustuv Saha

发表机构 * Siebel School of Computing and Data Science（Siebel计算与数据科学学院）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Department of Psychology（心理学系）； Illinois Neurological Institute（伊利诺伊神经科学研究所）； Department of Human-Centered Computing（以人为中心计算系）； Manning College of Information and Computer Sciences（马歇尔大学信息与计算机科学学院）

AI总结本研究提出护理人员心理健康与技术分类法，系统关联AD/ADRD护理人员需求与技术干预类别，识别护理优先事项与现有技术支持的错配，并强调关系紧张和同情疲劳等未充分服务的领域。

详情

AI中文摘要

照顾阿尔茨海默病及相关痴呆症（AD/ADRD）患者的家庭成员构成了全球长期护理的基础。2023年，超过1100万美国亲友贡献了180亿小时的无偿护理，往往以牺牲自身身心健康为代价。这些非正式护理人员——也被称为“隐形第二患者”——经历着更高的心理健康问题发生率。然而，研究通常将其复杂的心理社会经历简化为单一的护理负担概念，掩盖了哪些具体需求未得到满足或得到有效支持。与此同时，数字和人工智能技术正在迅速扩展，从智能手机应用和视频会议到传感器平台和AI聊天机器人。然而，医学、心理学和技术研究之间缺乏共享框架，限制了累积进展。本研究引入了一个护理人员心理健康与技术分类法，系统地将AD/ADRD护理人员的需求与相应的技术干预类别联系起来。基于跨学科文献综述和两项针对护理人员的定性研究，该分类法识别了护理优先事项与现有技术支持之间的错配，突出了关系紧张和同情疲劳等未充分服务的领域，并提出了自适应、响应式系统的设计方向。该框架提供了一个共享词汇，以指导临床医生、研究人员和技术设计师在痴呆症护理中开发更以人为中心和临床基础的创新。

英文摘要

Family members caring for individuals with Alzheimer's disease and related dementias (AD/ADRD) provide the foundation of long-term care worldwide. In 2023, more than 11 million U.S. family and friends contributed 18 billion hours of unpaid care, often at the cost of their own physical and mental health. These informal caregivers -- also referred as the "invisible second patients" -- experience elevated rates of mental health problems. Yet research commonly reduces their complex psychosocial experiences to a single construct of caregiver burden, obscuring which specific needs are unmet or effectively supported. At the same time, digital and AI-enabled technologies are rapidly expanding, from smartphone apps and videoconferencing to sensor platforms and AI chatbots. However, the absence of shared frameworks across medicine, psychology, and technology research limits cumulative progress. This study introduces a Caregiver Mental Health and Technology Taxonomy that systematically links AD/ADRD caregiver needs with corresponding classes of technology-based interventions. Drawing from an interdisciplinary literature review and two qualitative studies with caregivers, the taxonomy identifies mismatches between caregiver priorities and existing technological support, highlights under-served domains such as relational strain and compassion fatigue, and proposes design directions for adaptive, responsive systems. The framework offers a shared vocabulary to guide clinicians, researchers, and technology designers in developing more person-centered and clinically grounded innovation in dementia care.

URL PDF HTML ☆

赞 0 踩 0

2606.19286 2026-06-18 cs.HC cs.AI cs.CY 交叉投稿

Correct Yourself, Keep My Trust: How Self-Correction and Social Connection Shape Credibility in Social Chatbots

纠正自己，保持信任：自我纠正和社会联系如何塑造社交聊天机器人的可信度

Biswadeep Sen, Yi-Chieh Lee

发表机构 * School of Computing National University of Singapore Singapore Singapore（计算学院新加坡国立大学新加坡新加坡）； Computer Science National University of Singapore Singapore Singapore（计算机科学新加坡国立大学新加坡新加坡）； National University of Singapore（新加坡国立大学）

AI总结通过实验比较三种错误纠正策略，发现自我纠正不损害聊天机器人可信度，且用户社会联系强度仅在自我纠正时显著预测信念改变。

详情

AI中文摘要

当社交聊天机器人犯错时——它们确实会犯错——它们的恢复方式决定了用户是否会再次信任它们。社交聊天机器人正日益融入日常生活，但它们仍然容易生成令人信服但不准确的信息。它们与用户建立的社会联系使得此类错误尤其具有后果性。我们进行了一项受试者间实验（N=120），比较了三种错误纠正策略：网页撤回、同一社交聊天机器人的自我纠正以及专家聊天机器人的纠正。我们的结果揭示了两个关键发现。首先，所有三种策略都能同样好地纠正错误，但只有自我纠正不会损害聊天机器人的可信度：参与者对自我纠正的聊天机器人在可信度和感知专业性上的评分显著高于其错误由外部来源纠正的聊天机器人。其次，通过社会吸引力和自我披露测量的用户与聊天机器人的社会联系强度，仅在聊天机器人自我纠正时显著预测信念改变的大小。将纠正外包给外部来源完全切断了这种联系。这些发现表明，社交聊天机器人应该纠正自己的错误，而不是外包纠正，并且投资于社会联系是一种功能性机制，能增强纠正效果，而不仅仅是一种设计特征。我们讨论了设计能够保持长期可信度同时有效处理自身错误的聊天机器人的启示。

英文摘要

When social chatbots make mistakes, and they do, how they recover determines whether users trust them again. Social chatbots are increasingly integrated into everyday life, yet they remain prone to generating convincing but inaccurate information. The social connection they build with users makes such errors particularly consequential. We conducted a between-subjects experiment (N=120) comparing three error correction strategies: a webpage retraction, self-correction by the same social chatbot, and correction by an expert chatbot. Our results reveal two key findings. First, all three strategies corrected the error equally well, but only self-correction did so without damaging the chatbot's credibility: participants rated self-correcting chatbots significantly higher in both trustworthiness and perceived expertise than chatbots whose errors were corrected by external sources. Second, the strength of the user's social connection with the chatbot, measured through social attraction and self-disclosure, significantly predicted the magnitude of belief change, but only when the chatbot corrected itself. Outsourcing corrections to an external source severed this link entirely. These findings suggest that social chatbots should correct their own mistakes rather than outsource corrections, and that investing in social connection is a functional mechanism that amplifies correction effectiveness, not merely a design feature. We discuss implications for designing chatbots that maintain long-term credibility while effectively addressing their own errors.

URL PDF HTML ☆

赞 0 踩 0

2606.14824 2026-06-18 cs.AR cs.AI cs.LG 交叉投稿

Running hardware-aware neural architecture search on embedded devices under 512MB of RAM

在512MB内存下的嵌入式设备上运行硬件感知的神经架构搜索

Andrea Mattia Garavagno, Edoardo Ragusa, Paolo Gastaldo, Antonio Frisoli

发表机构 * University of Bologna（博洛尼亚大学）； Politecnico di Milano（米兰理工学院）

AI总结提出一种在资源受限的嵌入式设备上直接运行的硬件感知神经架构搜索方法，生成针对低端MCU的微型CNN，在Visual Wake Word数据集上达到最先进水平。

详情

DOI: 10.1109/ICCE59016.2024.10444268
Journal ref: 2024 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 2024, pp. 1-2

AI中文摘要

本文提出了一种新颖的硬件感知神经架构搜索（HW NAS）方法，该方法考虑了运行它的计算平台上的可用资源，使其能够在各种嵌入式设备上执行。所提出的HW NAS生成针对低端微控制器单元（MCU）的微型卷积神经网络（CNN），这些MCU通常用于物联网（IoT）或可穿戴机器人领域，从而开辟了新的应用场景。网关可以运行它来根据获取的数据定制CNN的架构，而无需使用外部服务器，从而确保隐私。所提出的技术在Visual Wake Word数据集（一个标准的TinyML基准）上的多个人体识别任务中，在多个嵌入式设备上取得了最先进的结果。

英文摘要

This document proposes a novel approach to hardware-aware neural architecture search (HW NAS) that considers the resources available on the computing platform running it, enabling its execution on various embedded devices. The presented HW NAS produces tiny convolutional neural networks (CNNs) targeting low-end microcontroller units (MCUs), typically involved in the Internet of Things (IoT) or wearable robotics, opening new use cases. A gateway could run it to tailor CNNs' architecture on the acquired data without using external servers, ensuring privacy. The proposed technique achieves state-of-the-art results in the human-recognition tasks on the Visual Wake Word dataset, a standard TinyML benchmark, on several embedded devices.

URL PDF HTML ☆

赞 0 踩 0

2606.16290 2026-06-18 cs.LG cs.AI 交叉投稿

An affordable hardware-aware neural architecture search for deploying convolutional neural networks on ultra-low-power computing platforms

一种经济实惠的硬件感知神经架构搜索，用于在超低功耗计算平台上部署卷积神经网络

Andrea Mattia Garavagno, Edoardo Ragusa, Antonio Frisoli, Paolo Gastaldo

发表机构 * University of Genoa（热那亚大学）； Scuola Superiore Sant’Anna（圣安娜高等研究学院）

AI总结提出一种轻量级硬件感知神经架构搜索方法，生成可在超低功耗微控制器上运行的微型CNN，在保持分类精度的同时降低搜索成本。

详情

DOI: 10.1109/LSENS.2024.3387056
Journal ref: IEEE Sensors Letters, vol. 8, no. 5, pp. 1-4, May 2024

AI中文摘要

硬件感知神经架构搜索（HW-NAS）通过自动设计能够满足预置硬件约束的神经架构，使得卷积神经网络（CNN）能够集成到微控制器设备中。然而，最先进的HW-NAS针对的是高性能微控制器，其功耗无法满足传感节点的要求。本文提出了一种HW-NAS方法，生成可在超低功耗微控制器上运行的微型CNN，其搜索过程轻量级，甚至可以在嵌入式设备上执行。在三个著名的微型计算机视觉基准测试上的实证结果表明，所提出的HW-NAS能够在保持最先进分类精度的同时生成微型CNN。

英文摘要

Hardware-aware neural architecture search (HW-NAS) allows the integration of Convolutional Neural Networks (CNNs) in microcontrollers devices by automatically designing neural architectures that can fit prearranged hardware constraints. However, state-of-the-art HW-NAS target high-performance microcontrollers, whose power consumption does not meet sensing nodes requirements. This work presents a HW-NAS generating tiny CNNs that can run on ultra-low-power microcontrollers, featuring a lightweight search procedure enabling its execution even on embedded devices. Empirical results on three well-known benchmarks for tiny computer vision proved that the proposed HW-NAS was able to generate tiny CNNs while preserving state-of-the-art classification accuracy.

URL PDF HTML ☆

赞 0 踩 0

2310.05753 2026-06-18 cs.AI 版本更新

Large-Scale OD Matrix Estimation with A Deep Learning Method

基于深度学习的大规模OD矩阵估计

Zheli Xiong, Defu Lian, Enhong Chen, Gang Chen, Xiaomin Cheng

发表机构 * IEEE Publication Technology Group（IEEE出版技术组）

AI总结提出一种结合深度学习与数值优化的方法，利用探针交通流推断结构约束，实现大规模OD矩阵的实时估计，无需先验信息且具有良好泛化性。

Comments 12 pages,25 figures

详情

AI中文摘要

起点-终点（OD）矩阵估计是智能交通系统（ITS）的关键方面。它涉及通过回归当前观测值（如路段交通计数，例如使用最小二乘法）来调整初始OD矩阵。然而，OD估计问题缺乏足够的约束，在数学上是欠定的。为缓解此问题，一些研究者将先验OD矩阵作为回归目标以提供更多结构约束，但该方法高度依赖于可能过时的先验矩阵。另一些研究者通过传感器数据（如车辆轨迹和速度）添加结构约束，这些数据能实时反映更当前的结构约束。我们提出的方法将深度学习与数值优化算法相结合，以推断矩阵结构并指导数值优化。该方法结合了深度学习与数值优化算法的优势。神经网络（NN）学习从探针交通流中推断结构约束，消除了对先验信息的依赖，并提供了实时性能。此外，由于NN的泛化能力，该方法在工程上经济高效。我们进行了测试，证明了该方法在大规模合成数据集上的良好泛化性能。随后，我们在真实交通数据上验证了方法的稳定性。实验证实了结合NN与数值优化的优势。

英文摘要

The estimation of origin-destination (OD) matrices is a crucial aspect of Intelligent Transport Systems (ITS). It involves adjusting an initial OD matrix by regressing the current observations like traffic counts of road sections (e.g., using least squares). However, the OD estimation problem lacks sufficient constraints and is mathematically underdetermined. To alleviate this problem, some researchers incorporate a prior OD matrix as a target in the regression to provide more structural constraints. However, this approach is highly dependent on the existing prior matrix, which may be outdated. Others add structural constraints through sensor data, such as vehicle trajectory and speed, which can reflect more current structural constraints in real-time. Our proposed method integrates deep learning and numerical optimization algorithms to infer matrix structure and guide numerical optimization. This approach combines the advantages of both deep learning and numerical optimization algorithms. The neural network(NN) learns to infer structural constraints from probe traffic flows, eliminating dependence on prior information and providing real-time performance. Additionally, due to the generalization capability of NN, this method is economical in engineering. We conducted tests to demonstrate the good generalization performance of our method on a large-scale synthetic dataset. Subsequently, we verified the stability of our method on real traffic data. Our experiments provided confirmation of the benefits of combining NN and numerical optimization.

URL PDF HTML ☆

赞 0 踩 0

2604.25848 2026-06-18 cs.AI 版本更新

A Distributionally Robust Reinforcement Learning Framework for Constrained Urban EV Dispatch

面向约束城市电动汽车调度的分布鲁棒强化学习框架

An Nguyen, Hoang Nguyen, Phuong Le, Hung Pham, Cuong Do, Laurent El Ghaoui

发表机构 * College of Engineering and Computer Science, VinUniversity, Hanoi, Vietnam（VinUniversity 工程与计算机科学学院，河内，越南）； Center for Environmental Intelligence, VinUniversity, Hanoi, Vietnam（VinUniversity 环境智能中心，河内，越南）

AI总结针对城市电动汽车调度中充电站和馈线容量约束及不确定需求，提出基于半马尔可夫决策过程与分布鲁棒软演员-评论家算法，通过图卷积编码器和滚动混合整数线性规划保证可行性，在纽约出租车数据仿真中实现最高净利润且零违规。

详情

AI中文摘要

我们研究城市规模的电动汽车（EV）网约车车队控制，其中调度、重新定位和充电决策必须在不确定且空间相关的出行需求和旅行时间下，遵守充电器和馈线限制。我们将问题建模为六边形网格半马尔可夫决策过程（semi-MDP），具有混合动作——用于服务、重新定位和充电的离散动作，以及连续充电功率——和可变动作持续时间。为了保证训练和部署期间的物理可行性，策略在由掩码温度退火actor产生的高层意图上学习。这些意图在每个决策步骤通过一个时间受限的滚动混合整数线性规划（MILP）进行投影，该规划严格强制执行荷电状态、充电端口和馈线约束。为了缓解分布偏移，我们针对一个Wasserstein-1模糊集优化软演员-评论家（SAC）智能体，该模糊集使用图对齐的马氏基础度量来捕捉空间相关性。鲁棒备份使用Kantorovich-Rubinstein对偶、投影次梯度内环和原始-对偶风险预算更新。我们的架构结合了两层图卷积网络（GCN）编码器、双评论家和一个驱动对手的价值网络。基于纽约出租车数据构建的大规模电动汽车车队模拟器上的实验表明，PD-RSAC实现了最高的净利润，达到122万美元，而强启发式、单智能体RL和多智能体RL基线（包括Greedy、SAC、MAPPO和MADDPG）的净利润为58万至70万美元，同时保持零馈线限制违规。

英文摘要

We study city-scale control of electric-vehicle (EV) ride-hailing fleets where dispatch, repositioning, and charging decisions must respect charger and feeder limits under uncertain, spatially correlated demand and travel times. We formulate the problem as a hex-grid semi-Markov decision process (semi-MDP) with mixed actions -- discrete actions for serving, repositioning, and charging, together with continuous charging power -- and variable action durations. To guarantee physical feasibility during both training and deployment, the policy learns over high-level intentions produced by a masked, temperature-annealed actor. These intentions are projected at every decision step through a time-limited rolling mixed-integer linear program (MILP) that strictly enforces state-of-charge, port, and feeder constraints. To mitigate distributional shifts, we optimize a Soft Actor-Critic (SAC) agent against a Wasserstein-1 ambiguity set with a graph-aligned Mahalanobis ground metric that captures spatial correlations. The robust backup uses the Kantorovich-Rubinstein dual, a projected subgradient inner loop, and a primal-dual risk-budget update. Our architecture combines a two-layer Graph Convolutional Network (GCN) encoder, twin critics, and a value network that drives the adversary. Experiments on a large-scale EV fleet simulator built from NYC taxi data show that PD-RSAC achieves the highest net profit, reaching \$1.22M, compared with \$0.58M-\$0.70M for strong heuristic, single-agent RL, and multi-agent RL baselines, including Greedy, SAC, MAPPO, and MADDPG, while maintaining zero feeder-limit violations.

URL PDF HTML ☆

赞 0 踩 0

2605.03460 2026-06-18 cs.AI cs.LG 版本更新

FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models

FinSTaR：面向时间序列推理模型的金融推理

Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, Soonyoung Lee, Wonbin Ahn

发表机构 * LG AI Research（LG人工智能研究）

AI总结针对时间序列推理模型在金融领域的失效问题，提出基于2x2能力分类法的FinSTaR模型，通过Compute-in-CoT和Scenario-Aware CoT策略在FinTSR-Bench基准上达到78.9%平均准确率。

Comments KDD Workshop on SciSoc Agents & LLMs 2026 (Oral Presentation)

详情

AI中文摘要

时间序列推理模型在通用领域表现出色，但在具有独特特征的金融领域却持续失败。我们提出一个通用的2x2能力分类法，通过交叉1)单实体与多实体分析，以及2)当前状态评估与未来行为预测来划分TSRM能力。我们在金融领域实例化该分类法——其中确定性评估与随机性预测的区分尤为关键——形成十个金融推理任务，并基于标普股票构建FinTSR-Bench基准。为此，我们提出FinSTaR（金融时间序列思考与推理），在FinTSR-Bench上训练，并针对每个类别采用不同的思维链策略。对于评估（确定性，即可从可观测数据计算得出），我们采用Compute-in-CoT，一种程序化思维链，使模型能够直接从原始价格推导答案。对于预测（本质上是随机的，即受不可观测因素影响），我们采用场景感知思维链，在做出判断前生成多种场景，模拟金融分析师在不确定性下的推理方式。所提方法在FinTSR-Bench上达到78.9%的平均准确率，显著优于LLM和TSRM基线。此外，我们展示了四个能力类别通过联合训练具有互补性和相互增强性，并且场景感知思维链相比标准思维链持续提升预测准确率。代码已公开：https://github.com/seunghan96/FinSTaR。

英文摘要

Time series (TS) reasoning models (TSRMs) have shown promising capabilities in general domains, yet they consistently fail in the financial domain, which exhibits unique characteristics. We propose a general 2 x 2 capability taxonomy for TSRMs by crossing 1) single-entity vs. multi-entity analysis with 2) assessment of the current state vs. prediction of future behavior. We instantiate this taxonomy in the financial domain-where the distinction between deterministic assessment and stochastic prediction is particularly critical-as ten financial reasoning tasks, forming the FinTSR-Bench benchmark based on S&P stocks. To this end, we propose FinSTaR (Financial Time Series Thinking and Reasoning), trained on FinTSR-Bench with distinct chain-of-thought (CoT) strategies tailored to each category. For assessment, which is deterministic (i.e., computable from observable data), we employ Compute-in-CoT, a programmatic CoT that enables models to derive answers directly from raw prices. For prediction, which is inherently stochastic (i.e., subject to unobservable factors), we adopt Scenario-Aware CoT, which generates diverse scenarios before making a judgment, mirroring how financial analysts reason under uncertainty. The proposed method achieves 78.9% average accuracy on FinTSR-Bench, substantially outperforming LLM and TSRM baselines. Furthermore, we show that the four capability categories are complementary and mutually reinforcing through joint training, and that Scenario-Aware CoT consistently improves prediction accuracy over standard CoT. Code is available at https://github.com/seunghan96/FinSTaR.

URL PDF HTML ☆

赞 0 踩 0

2606.08532 2026-06-18 cs.AI 版本更新

DN-Hypo-Pipeline: An AI-Driven Workflow for Hypothesis Generation via Large Language Models and Scientific Explanations

DN-Hypo-Pipeline：一种基于大语言模型和科学解释的AI驱动假设生成工作流

Lei Lin, Ronghao Wang, Chunbao Zhou, Jue Wang, Yangang Wang

发表机构 * Computer Network Information Center, Chinese Academy of Sciences, China（中国科学院计算机网络信息中心）

AI总结提出DN-Hypo-Pipeline，利用大语言模型和科学解释作为先验知识，从现有文献中推导新假设，在数据科学建模中通过统计推断和专家评估证明优于直接生成方法，并验证了生成假设对应的算法性能。

详情

AI中文摘要

科学假设是研究的第一步并经过实验验证，但它也反映了对科学现象的深刻理解和推理。我们引入了DN-Hypo-Pipeline，一种基于大语言模型的AI驱动工作流，旨在通过利用科学解释作为先验知识来支持结构化科学思维和假设生成。该流水线帮助研究人员从现有文献中推导出新假设。给定研究论文的解释项（即结论），它识别潜在的定律、理论和原理，并为观察到的现象重构一个新的、尚未验证的解释。我们在数据科学建模领域使用三篇高被引论文评估了DN-Hypo-Pipeline。由LLM作为评判者和人类专家评估支持的统计推断表明，我们的流水线比直接生成方法更有效。此外，我们通过开发相应新颖算法验证了得分最高的两个生成假设，这些算法优于原始论文中提出的基线模型。除了在数据科学中的应用，DN-Hypo-Pipeline还提供了一个理论框架，不仅包含了理论指导的数据科学建模方法，还揭示了建模过程更基础的结构。此外，这种方法本质上是理论指导建模的推广，具有扩展到其他领域和更广泛科学学科的潜力。

英文摘要

A scientific hypothesis is the first step in research and undergoes experimental validation, yet it also reflects a deep understanding of and reasoning about scientific phenomena. We introduce DN-Hypo-Pipeline, an AI-powered workflow based on large language models, designed to support structured scientific thinking and hypothesis generation by leveraging scientific explanations as prior knowledge. This pipeline assists researchers in deriving novel hypotheses from existing literature. Given the explanandum (i.e., the conclusion) of a research paper, it identifies underlying laws, theories, and principles, and reconstructs a new, yet-to-be-verified explanation for the observed phenomenon. We evaluated DN-Hypo-Pipeline in the field of data science modeling using three highly cited papers. Statistical inference, supported by both LLM-as-judge assessment and human expert evaluation, demonstrates that our pipeline is more effective than direct generation methods. Additionally, we validated the two highest-scoring generated hypotheses by developing corresponding novel algorithms, which outperformed the baseline models presented in the original papers. Beyond application in data science, DN-Hypo-Pipeline provides a theoretical framework that not only encompasses theory-guided data science modeling methods but also reveals a more fundamental structure of the modeling process. Moreover, this approach is essentially a generalization of theory-guided modeling, offering potential for extension to other domains and across a broader range of scientific disciplines.

URL PDF HTML ☆

赞 0 踩 0

2606.10376 2026-06-18 cs.AI cs.IT math.IT 版本更新

Belief-Space Control for Personalized Cancer Treatment via Active Inference

基于主动推理的个性化癌症治疗信念空间控制

Deniz Sargun, H. Bugra Tulay, C. Emre Koksal

发表机构 * American Association for Cancer Research（美国癌症研究协会）； AACR Project GENIE registry（AACR Project GENIE 注册中心）； AACR Project GENIE Biopharma Collaborative（AACR Project GENIE 生物制药合作组织）

AI总结提出用主动推理将癌症治疗建模为信念空间规划问题，在测量预算下统一目标导向控制与信息获取，实现患者分类与高效治疗。

Comments 11 pages including appendix

2509.24725 2026-06-18 cs.LG cs.AI 版本更新

Q-Net: Queue Length Estimation via Kalman-based Neural Networks

Q-Net：基于卡尔曼神经网络的队列长度估计

Ting Gao, Elvin Isufi, Winnie Daamen, Erik-Sander Smits, Serge Hoogendoorn

发表机构 * University of Amsterdam（阿姆斯特丹大学）； Delft University of Technology（代尔夫特理工大学）

AI总结本文提出Q-Net框架，通过结合卡尔曼滤波与神经网络，解决信号交叉口队列长度估计中的数据融合问题，提升空间转移性和实时性，实现无需昂贵传感设备的准确队列估计。

详情

DOI: 10.1016/j.trc.2026.105809

AI中文摘要

估计信号交叉口的队列长度一直是交通管理中的长期挑战。尽管有两类隐私保护的数据源：(i) 接近停止线的环形检测器提供的车辆计数汇总数据，以及 (ii) 提供路段平均速度测量的汇总浮动汽车数据 (aFCD)，但如何将这些具有不同空间和时间分辨率的数据源整合用于队列长度估计仍不清楚。为此，本文提出Q-Net：一种基于状态空间形式的队列估计框架。该设计解决了队列建模中的关键挑战，如违反交通守恒假设。Q-Net遵循卡尔曼预测-更新结构，并在状态演变和测量模型中保持物理可解释性。Q-Net使用AI增强的卡尔曼滤波器从数据中学习时间变化的增益动态。该框架支持实时实现，并通过将aFCD测量分组为固定大小的局部组来提高空间转移性，使可学习参数的数量与路段长度无关。在荷兰 Rotterdam 城市主干道的评估显示，Q-Net优于基线方法，能够准确追踪队列的形成和消散，并缓解aFCD引起的延迟。通过结合数据效率、可解释性、实时适用性和空间转移性，Q-Net在无需昂贵的传感基础设施（如摄像头或雷达）的情况下实现了准确的队列长度估计。

英文摘要

Estimating queue lengths at signalized intersections is a long-standing challenge in traffic management. Partial observability of vehicle flows complicates this task despite the availability of two privacy-preserving data sources: (i) aggregated vehicle counts from loop detectors near stop lines, and (ii) aggregated floating car data (aFCD) that provide segment-wise average speed measurements. However, how to integrate these sources with differing spatial and temporal resolutions for queue length estimation is rather unclear. Addressing this question, we present Q-Net: a queue estimation framework built upon a state-space formulation. This design addresses key challenges in queue modeling, such as violations of traffic conservation assumptions. Q-Net follows the Kalman predict-update structure and maintains physical interpretability in both the state evolution and measurement models. Q-Net uses an AI-augmented Kalman filter to learn time-varying gain dynamics from data. The framework supports real-time implementation and improves spatial transferability by grouping aFCD measurements into fixed-size local groups, making the number of learnable parameters independent of section length. Evaluations on urban main roads in Rotterdam, the Netherlands, show that Q-Net outperforms baseline methods, tracks queue formation and dissipation accurately, and mitigates aFCD-induced delays. By combining data efficiency, interpretability, real-time applicability, and spatial transferability, Q-Net makes accurate queue length estimation possible without costly sensing infrastructure like cameras or radar.

URL PDF HTML ☆

赞 0 踩 0

2307.05623 2026-06-18 cs.LG cs.AI 版本更新

A DeepLearning Framework for Dynamic Estimation of Origin-Destination Sequence

一种用于动态估计起点-终点序列的深度学习框架

Zheli Xiong, Defu Lian, Enhong Chen, Gang Chen, Xiaomin Cheng

发表机构 * School of Data Science University of Science（数据科学学院中国科学技术大学）； Yangtze River Delta Information Intelligence Innovation Research Institute, China（长江三角洲信息智能创新研究院）

AI总结针对OD矩阵估计中的欠定性和滞后性问题，提出集成深度学习方法，利用神经网络推断OD序列结构并引导数值优化，实验证明能有效提供时空约束。

Comments 11 pages,25 figures

详情

AI中文摘要

OD矩阵估计是交通领域的一个关键问题。主要方法利用交通传感器测量信息（如交通计数）来估计由OD矩阵表示的交通需求。该问题分为两类：静态OD矩阵估计和动态OD矩阵序列（简称OD序列）估计。上述两类都面临由大量待估参数和不足的约束信息引起的欠定性问题。此外，OD序列估计还面临滞后挑战：由于拥堵等不同交通状况，同一车辆在相同观测时段内会出现在不同路段，导致相同的OD需求对应不同的行程。为此，本文提出一种集成方法，利用深度学习方法推断OD序列的结构，并利用结构约束指导传统数值优化。实验表明，神经网络能有效推断OD序列的结构，并为数值优化提供实用的约束以获得更好的结果。此外，实验表明，所提供的结构信息不仅包含对OD矩阵空间结构的约束，还提供了对OD序列时间结构的约束，很好地解决了滞后问题的影响。

英文摘要

OD matrix estimation is a critical problem in the transportation domain. The principle method uses the traffic sensor measured information such as traffic counts to estimate the traffic demand represented by the OD matrix. The problem is divided into two categories: static OD matrix estimation and dynamic OD matrices sequence(OD sequence for short) estimation. The above two face the underdetermination problem caused by abundant estimated parameters and insufficient constraint information. In addition, OD sequence estimation also faces the lag challenge: due to different traffic conditions such as congestion, identical vehicle will appear on different road sections during the same observation period, resulting in identical OD demands correspond to different trips. To this end, this paper proposes an integrated method, which uses deep learning methods to infer the structure of OD sequence and uses structural constraints to guide traditional numerical optimization. Our experiments show that the neural network(NN) can effectively infer the structure of the OD sequence and provide practical constraints for numerical optimization to obtain better results. Moreover, the experiments show that provided structural information contains not only constraints on the spatial structure of OD matrices but also provides constraints on the temporal structure of OD sequence, which solve the effect of the lagging problem well.

URL PDF HTML ☆

赞 0 踩 0

2507.16859 2026-06-18 cs.RO cs.AI 版本更新

Enhancing Fatigue Detection through Heterogeneous Multi-Source Data Integration and Cross-Domain Modality Imputation

通过异构多源数据集成与跨域模态插补增强疲劳检测

Luobin Cui, Yanlai Wu, Tang Ying, Weikai Li

AI总结针对实际部署环境中高质量传感器不可用的问题，提出异构多源疲劳检测框架，利用共享模态进行跨域模态插补，融合源域知识提升目标域疲劳检测性能。

Comments 4figures,14pages

详情

AI中文摘要

疲劳检测对于安全相关应用（如航空、采矿和长途运输）中的人类操作员至关重要。可靠的操作员疲劳估计可以支持人机系统中的及时警告、自适应任务调度、接管提醒和其他安全管理决策。然而，这些功能的有效性取决于疲劳相关信号是否能在部署环境中可靠捕获。虽然许多研究已显示高保真传感器在受控实验室环境中的价值，但在实际环境中，由于噪声、光照条件和视野限制，其性能往往会下降，从而限制了实际应用。本文形式化了一种面向实际部署的疲劳检测设置，其中高质量传感器在实际应用中通常不可用。为解决这一问题，我们利用来自异构源域的知识，包括难以在现场部署但常用于受控环境的高保真传感器，来辅助真实目标域中的疲劳检测。基于这一思想，我们设计了一个异构多源疲劳检测框架，该框架利用目标域中的可用模态，同时通过基于共享模态的跨域模态插补来利用源域中的多样化配置。

英文摘要

Fatigue detection for human operators is important in safety-related applications such as aviation, mining, and long-haul transport. Reliable estimation of operator fatigue can support timely warnings, adaptive task scheduling, takeover reminders, and other safety-management decisions in human-machine systems. However, the effectiveness of these functions depends on whether fatigue-related signals can be reliably captured in the deployment environment. While many studies have shown the value of high-fidelity sensors in controlled laboratory environments, their performance often degrades when used in real-world settings because of noise, lighting conditions, and field-of-view constraints, thereby limiting their practical use. This paper formalizes a deployment-oriented setting for real-world fatigue detection, where high-quality sensors are often unavailable in practical applications. To address this issue, we use knowledge from heterogeneous source domains, including high-fidelity sensors that are difficult to deploy in the field but commonly used in controlled environments, to assist fatigue detection in the real-world target domain. Based on this idea, we design a heterogeneous and multi-source fatigue-detection framework that uses the available modalities in the target domain while leveraging diverse configurations in the source domains through cross-domain modality imputation based on shared modalities.

URL PDF HTML ☆

赞 0 踩 0

2511.14555 2026-06-18 q-bio.NC cs.AI 版本更新

DecNefSimulator: A Modular, Interpretable Framework for Decoded Neurofeedback Simulation Using Generative Models

DecNefSimulator：一个用于解码神经反馈模拟的模块化、可解释框架

Alexander Olza, Roberto Santana, David Soto

发表机构 * Intelligent Systems Group, University of the Basque Country (UPV/EHU)（巴斯克国家大学智能系统组）； Consciousness Group, Basque Center on Cognition, Brain and Language (BCBL)（巴斯克认知、大脑与语言中心意识组）； Ikerbasque, Basque Foundation for Science（巴斯克科学基金会）

AI总结提出DecNefSimulator，一个模块化可解释的模拟框架，将解码神经反馈形式化为机器学习问题，通过潜变量生成模型模拟参与者，直接观察内部状态并评估协议设计对学习的影响，可复现经验现象、识别失败条件并指导协议设计。

详情

AI中文摘要

解码神经反馈（DecNef）是一种有前景的非侵入性脑调控方法，在神经医学和认知神经科学中具有广泛应用。然而，DecNef研究的进展仍受限于受试者依赖的学习变异性、依赖间接测量来量化进展，以及实验的高成本和时间消耗。我们提出DecNefSimulator，一个模块化且可解释的模拟框架，将DecNef形式化为一个机器学习问题。除了提供虚拟实验室，DecNefSimulator使研究人员能够建模、分析和理解神经反馈动态。通过使用潜变量生成模型作为模拟参与者，DecNefSimulator允许直接观察内部认知状态，并系统评估不同协议设计和受试者特征如何影响学习。我们展示了这种方法如何（i）复现DecNef学习的经验现象，（ii）识别DecNef反馈未能诱导学习的条件，以及（iii）在人体实施之前，在计算机中指导设计更稳健可靠的DecNef协议。总之，DecNefSimulator连接了计算建模和认知神经科学，为方法创新、稳健协议设计以及最终更深入地理解基于DecNef的脑调控提供了原则性基础。

英文摘要

Decoded Neurofeedback (DecNef) is a promising non-invasive approach to brain modulation with wide-ranging applications in neuromedicine and cognitive neuroscience. However, progress in DecNef research remains constrained by subject-dependent learning variability, reliance on indirect measures to quantify progress, and the high cost and time demands of experimentation. We present DecNefSimulator, a modular and interpretable simulation framework that formalizes DecNef as a machine learning problem. Beyond providing a virtual laboratory, DecNefSimulator enables researchers to model, analyze and understand neurofeedback dynamics. Using latent variable generative models as simulated participants, DecNefSimulator allows direct observation of internal cognitive states and systematic evaluation of how different protocol designs and subject characteristics influence learning. We demonstrate how this approach can (i) reproduce empirical phenomena of DecNef learning, (ii) identify conditions under which DecNef feedback fails to induce learning, and (iii) guide the design of more robust and reliable DecNef protocols in silico before human implementation. In summary, DecNefSimulator bridges computational modeling and cognitive neuroscience, offering a principled foundation for methodological innovation, robust protocol design, and ultimately, a deeper understanding of DecNef-based brain modulation.

URL PDF HTML ☆

赞 0 踩 0

2512.09185 2026-06-18 cs.CV cs.AI 版本更新

Learning Patient-Specific Disease Dynamics with Latent Flow Matching for Longitudinal Imaging Generation

学习患者特异性疾病动态：基于潜在流匹配的纵向影像生成

Hao Chen, Rui Yin, Yifan Chen, Qi Chen, Chao Li

发表机构 * University of Cambridge（剑桥大学）； Nanjing First Hospital（南京第一医院）； Nanjing Medical University（南京医科大学）； Johns Hopkins University（约翰霍普金斯大学）； University of Dundee（邓迪大学）

AI总结提出Δ-LFM框架，利用流匹配对齐患者潜在轨迹，通过患者特异性潜在对齐实现单调疾病进展建模，在三个纵向MRI基准上验证了可解释性和性能。

Comments ICLR 2026 accepted

详情

AI中文摘要

理解疾病进展是一个直接的临床挑战，对早期诊断和个性化治疗具有重要意义。虽然最近的生成方法试图对进展进行建模，但关键不匹配仍然存在：疾病动态本质上是连续且单调的，然而潜在表示通常是分散的，缺乏语义结构，并且基于扩散的模型通过随机去噪过程破坏了连续性。在这项工作中，我们提出将疾病动态视为速度场，并利用流匹配（FM）来对齐患者数据的时间演变。与先前方法不同，它捕捉了疾病的内在动态，使进展更具可解释性。然而，一个关键挑战仍然存在：在潜在空间中，自动编码器（AE）不能保证跨患者的对齐或与临床严重性指标（例如年龄和疾病状况）的相关性。为了解决这个问题，我们提出学习患者特异性潜在对齐，这迫使患者轨迹沿着特定轴延伸，其幅度随疾病严重程度单调增加。这导致了一个一致且语义上有意义的潜在空间。总之，我们提出了Δ-LFM，一个用于通过流匹配建模患者特异性潜在进展的框架。在三个纵向MRI基准上，Δ-LFM展示了强大的实证性能，更重要的是，为解释和可视化疾病动态提供了一个新框架。

英文摘要

Understanding disease progression is a central clinical challenge with direct implications for early diagnosis and personalized treatment. While recent generative approaches have attempted to model progression, key mismatches remain: disease dynamics are inherently continuous and monotonic, yet latent representations are often scattered, lacking semantic structure, and diffusion-based models disrupt continuity with random denoising process. In this work, we propose to treat the disease dynamic as a velocity field and leverage Flow Matching (FM) to align the temporal evolution of patient data. Unlike prior methods, it captures the intrinsic dynamic of disease, making the progression more interpretable. However, a key challenge remains: in latent space, Auto-Encoders (AEs) do not guarantee alignment across patients or correlation with clinical-severity indicators (e.g., age and disease conditions). To address this, we propose to learn patient-specific latent alignment, which enforces patient trajectories to lie along a specific axis, with magnitude increasing monotonically with disease severity. This leads to a consistent and semantically meaningful latent space. Together, we present $Δ$-LFM, a framework for modeling patient-specific latent progression with flow matching. Across three longitudinal MRI benchmarks, $Δ$-LFM demonstrates strong empirical performance and, more importantly, offers a new framework for interpreting and visualizing disease dynamics.

URL PDF HTML ☆

赞 0 踩 0

2601.14288 2026-06-18 astro-ph.CO cs.AI cs.CE gr-qc hep-th 版本更新

DeepInflation: an AI agent for research and model discovery of inflation

DeepInflation：用于暴胀研究与模型发现的AI智能体

Ze-Yu Peng, Hao-Shi Yuan, Qi Lai, Jun-Qian Jiang, Gen Ye, Jun Zhang, Yun-Song Piao

发表机构 * School of Physical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China ； International Centre for Theoretical Physics Asia-Pacific, University of Chinese Academy of Sciences, 100190 Beijing, China Taiji Laboratory for Gravitational Wave Universe, University of Chinese Academy of Sciences, 100049 Beijing, China School of Fundamental Physics ； Mathematical Sciences, Hangzhou Institute for Advanced Study, UCAS, Hangzhou 310024, China Institute of Theoretical Physics, Chinese Academy of Sciences, P.O. Box 2735, Beijing 100190, China D\' e partement de Physique Th\' e orique, Universit\' e de Gen\` e ve, 24 quai Ernest-Ansermet, CH-1211 Gen\` e ve 4, Switzerland

AI总结提出基于多智能体架构的AI智能体DeepInflation，集成大语言模型、符号回归引擎和检索增强生成知识库，自动发现与最新观测一致的单场慢滚暴胀势，并解释理论背景。

详情

AI中文摘要

我们提出了DeepInflation，一个专为暴胀宇宙学中的研究和模型发现而设计的AI智能体。基于多智能体架构，DeepInflation将大语言模型（LLMs）与符号回归（SR）引擎以及检索增强生成（RAG）知识库相结合。该框架使智能体能够自动探索和验证广阔的暴胀势景观，同时将其输出建立在既定的理论文献基础上。我们证明，DeepInflation能够成功发现与最新观测（以ACT DR6结果为例）或任意给定的$n_s$和$r$一致的简单且可行的单场慢滚暴胀势，并为晦涩的暴胀场景提供准确的理论背景。DeepInflation作为宇宙学中新一代自主科学发现引擎的原型，使研究人员和非专家都能使用自然语言探索暴胀景观。该智能体可从此网址获取：https://example.com。

英文摘要

We present DeepInflation, an AI agent designed for research and model discovery in inflationary cosmology. Built upon a multi-agent architecture, DeepInflation integrates Large Language Models (LLMs) with a symbolic regression (SR) engine and a retrieval-augmented generation (RAG) knowledge base. This framework enables the agent to automatically explore and verify the vast landscape of inflationary potentials while grounding its outputs in established theoretical literature. We demonstrate that DeepInflation can successfully discover simple and viable single-field slow-roll inflationary potentials consistent with the latest observations (with the ACT DR6 results taken as an example) or any given $n_s$ and $r$, and provide accurate theoretical context for obscure inflationary scenarios. DeepInflation serves as a prototype for a new generation of autonomous scientific discovery engines in cosmology, which enables researchers and non-experts alike to explore the inflationary landscape using natural language. This agent is available at https://github.com/pengzy-cosmo/DeepInflation.

URL PDF HTML ☆

赞 0 踩 0

2602.19591 2026-06-18 cs.LG cs.AI 版本更新

Detecting High-Potential SMEs with Heterogeneous Graph Neural Networks

使用异构图神经网络检测高潜力中小企业

Yijiashun Qi, Hanzhe Guo, Yijiazhen Qi

发表机构 * University of Michigan（密歇根大学）； The University of Hong Kong（香港大学）

AI总结提出SME-HGT异构图Transformer框架，利用公开数据构建包含公司、研究主题和政府机构的异构图，预测SBIR第一阶段获奖者能否进入第二阶段，AUPRC达0.621，优于基线模型。

Comments accepted by (ICIIS 2026)

详情

AI中文摘要

中小企业占美国企业的99.9%，贡献44%的经济活动，但系统性地识别高潜力中小企业仍是一个开放挑战。我们提出了SME-HGT，一个异构图Transformer框架，仅使用公开数据预测哪些SBIR第一阶段获奖者将进入第二阶段资助。我们构建了一个异构图，包含32,268个公司节点、124个研究主题节点和13个政府机构节点，通过约99,000条边连接三种语义关系类型。SME-HGT在时间分割测试集上达到0.621±0.003的AUPRC，在五个随机种子上优于MLP基线（0.590±0.002）和R-GCN（0.608±0.013）。在筛选深度为100家公司时，SME-HGT达到89.6%的精确率，比随机选择提升2.14倍。我们的时间评估协议防止信息泄露，对公开数据的依赖确保了可重复性。这些结果表明，公司、研究主题和资助机构之间的关系结构为中小企业潜力评估提供了有意义的信号，对政策制定者和早期投资者具有启示意义。

英文摘要

Small and Medium Enterprises (SMEs) constitute 99.9% of U.S. businesses and generate 44% of economic activity, yet systematically identifying high-potential SMEs remains an open challenge. We introduce SME-HGT, a Heterogeneous Graph Transformer framework that predicts which SBIR Phase I awardees will advance to Phase II funding using exclusively public data. We construct a heterogeneous graph with 32,268 company nodes, 124 research topic nodes, and 13 government agency nodes connected by approximately 99,000 edges across three semantic relation types. SME-HGT achieves an AUPRC of 0.621 0.003 on a temporally-split test set, outperforming an MLP baseline (0.590 0.002) and R-GCN (0.608 0.013) across five random seeds. At a screening depth of 100 companies, SME-HGT attains 89.6% precision with a 2.14 lift over random selection. Our temporal evaluation protocol prevents information leakage, and our reliance on public data ensures reproducibility. These results demonstrate that relational structure among firms, research topics, and funding agencies provides meaningful signal for SME potential assessment, with implications for policymakers and early-stage investors.

URL PDF HTML ☆

赞 0 踩 0

2603.28707 2026-06-18 cs.CE cs.AI 版本更新

A Convex Route to Thermoelasticity: Learning Internal Energy and Dissipation

热力学的凸路径：学习内能和耗散

Hagen Holthusen, Paul Steinmann, Ellen Kuhl

发表机构 * Institute of Applied Mechanics, University of Erlangen-Nuremberg, Egerlandstra{\ss}e 5, 91058 Erlangen, Germany（埃尔兰根-纽伦堡应用力学研究所，埃尔兰根大学，德国）； Department of Mechanical Engineering, Stanford University, United States（机械工程系，斯坦福大学，美国）

AI总结提出基于物理的神经网络框架，通过输入凸神经网络表示内能和耗散势，自动满足热力学第二定律，实现全耦合热力学本构建模。

Comments 31 pages, 16 figures, 4 tables

详情

DOI: 10.1016/j.cma.2026.119082

AI中文摘要

我们提出了一个基于物理的神经网络框架，用于发现全耦合热力学中的本构模型。与基于亥姆霍兹能量的经典公式不同，我们采用内能和耗散势作为主要本构函数，以变形和熵为变量。这一选择避免了强制混合凸-凹条件，并促进了热力学原理的一致纳入。在本文中，我们关注没有优先方向或内变量的材料。尽管公式以熵表示，但温度被视为独立可观测量，熵通过本构关系内部推断，从而在不需要熵数据的情况下实现热力学一致建模。网络的热力学可接受性通过构造保证。内能和耗散势由输入凸神经网络表示，确保凸性和符合第二定律。客观性、材料对称性和归一化通过基于不变量的表示和零锚定公式直接嵌入架构中。我们在合成和实验数据集上展示了所提出框架的性能，包括纯热问题以及软组织和填充橡胶的全耦合热力学响应。结果表明，学习模型准确捕捉了潜在的本构行为。所有代码、数据和训练模型均通过 https://doi.org/10.5281/zenodo.19248596 公开提供。

英文摘要

We present a physics-based neural network framework for the discovery of constitutive models in fully coupled thermomechanics. In contrast to classical formulations based on the Helmholtz energy, we adopt the internal energy and a dissipation potential as primary constitutive functions, expressed in terms of deformation and entropy. This choice avoids the need to enforce mixed convexity--concavity conditions and facilitates a consistent incorporation of thermodynamic principles. In this contribution, we focus on materials without preferred directions or internal variables. While the formulation is posed in terms of entropy, the temperature is treated as the independent observable, and the entropy is inferred internally through the constitutive relation, enabling thermodynamically consistent modeling without requiring entropy data. Thermodynamic admissibility of the networks is guaranteed by construction. The internal energy and dissipation potential are represented by input convex neural networks, ensuring convexity and compliance with the second law. Objectivity, material symmetry, and normalization are embedded directly into the architecture through invariant-based representations and zero-anchored formulations. We demonstrate the performance of the proposed framework on synthetic and experimental datasets, including purely thermal problems and fully coupled thermomechanical responses of soft tissues and filled rubbers. The results show that the learned models accurately capture the underlying constitutive behavior. All code, data, and trained models are made publicly available via https://doi.org/10.5281/zenodo.19248596.

URL PDF HTML ☆

赞 0 踩 0

2604.00730 2026-06-18 cs.CY cs.AI cs.LG cs.SE 版本更新

A CEFR-Inspired Classification Framework with Fuzzy C-Means To Automate Assessment of Programming Skills in Scratch

基于CEFR启发的模糊C均值分类框架：自动化评估Scratch编程技能

Ricardo Hidalgo-Aragón, Jesús M. González-Barahona, Gregorio Robles

发表机构 * Universidad Rey Juan Carlos（雷昂卡洛斯大学）

AI总结提出一种基于CEFR的Scratch项目评估框架，使用模糊C均值聚类对200万+项目分级，识别B2瓶颈并引入分类确定性指标以平衡自动反馈与人工审核。

Comments Best Paper Award CSEDU 2026 -Minor change FPC fix-

详情

AI中文摘要

背景：学校、培训平台和技术公司日益需要以透明、可重复的方法大规模评估编程能力，以支持个性化学习路径。目标：本研究引入一个与欧洲共同语言参考标准（CEFR）一致的Scratch项目评估教学框架，为学生和教师提供通用能力等级，并为课程设计提供可行见解。方法：我们对通过此http URL评估的2008246个Scratch项目应用模糊C均值聚类，实施序数准则将聚类映射到CEFR等级（A1-C2），并引入增强分类指标，识别过渡学习者，实现持续进度跟踪，量化分类确定性以平衡自动反馈与教师评审。影响：该框架能够诊断系统性课程缺口——特别是“B2瓶颈”，由于逻辑同步和数据表示的认知负荷，仅13.3%的学习者处于该等级——同时提供基于确定性的触发机制以进行人工干预。

基于傅里叶运动建模的条件潜扩散模型用于虚拟人群合成

Shaokun Lan, Haoran Dou, Jinghan Huang, Arezoo Zakeri, Fengming Lin, Zherui Zhou, Jinming Duan, Alejandro F. Frangi

发表机构 * Centre for Computational Imaging and Modelling in Medicine (CIMIM)（计算医学成像与建模中心）； University of Manchester（曼彻斯特大学）； Christabel Pankhurst Institute（克里斯塔贝尔·潘克赫斯特研究所）； Department of Computer Science（计算机科学系）； Division of Informatics, Imaging & Data Sciences（信息学、成像与数据科学分会）； Department of Electrical & Electronic Engineering（电子与电气工程系）； NIHR Manchester Biomedical Research Centre, Manchester Academic Health Sciences Centre, University of Manchester（尼日利亚卫生研究委员会曼彻斯特生物医学研究中心、曼彻斯特学术健康科学中心、曼彻斯特大学）

AI总结提出4D F-MeshLDM框架，结合卷积网格VAE、截断傅里叶级数运动参数化和条件扩散先验，实现可控的3D+t心脏网格序列生成，在UK Biobank数据上优于基线方法。

Comments This work has been early accepted by International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) 2026

详情

AI中文摘要

医疗设备的计算机模拟试验需要生成虚拟解剖人群。在心血管应用中，虚拟解剖通常表示为从生成模型采样的3D+t网格。然而，大多数现有网格生成器关注静态解剖，而序列模型往往缺乏显式周期性。为此，我们提出4D F-MeshLDM，一个条件生成框架，包括用于编码网格的卷积网格VAE、使用截断傅里叶级数参数化运动的结构化潜空间，以及学习傅里叶系数令牌上潜分布的先验扩散。通过仿射调制将扩散过程条件化于临床协变量，我们实现了可控合成。采样令牌并执行逆傅里叶合成产生周期一致的潜轨迹，可解码为3D+t心脏网格序列。在5,000名UK Biobank受试者上的实验表明，4D F-MeshLDM在解剖保真度上优于最先进的基线，并实现了接近零的周期闭合误差。此外，生成的队列准确保留了临床功能指标，突显了我们的框架在可靠的心脏计算机模拟试验中的潜力。

英文摘要

In-silico trials of medical devices require the generation of virtual populations of anatomies. In cardiovascular applications, virtual anatomy is typically represented as a 3D+t mesh sampled from a generative model. However, most existing mesh generators focus on static anatomy, while sequence models often lack explicit periodicity. To this end, we propose 4D F-MeshLDM, a conditional generative framework comprising a convolutional mesh VAE to encode meshes, a structural latent space that parameterises motion using a truncated Fourier series, and a diffusion prior that learns the latent distribution over Fourier coefficient tokens. By conditioning the diffusion process on clinical covariates via affine modulation, we enable controllable synthesis. Sampling tokens and performing inverse Fourier synthesis yield cycle-consistent latent trajectories, which can be decoded into 3D+t cardiac mesh sequences. Experiments on 5,000 UK Biobank subjects demonstrate that 4D F-MeshLDM outperforms state-of-the-art baselines in anatomical fidelity and achieves near-zero cycle closure error. Furthermore, the generated cohorts accurately preserve clinical functional indices, highlighting the potential of our framework for reliable in-silico cardiac trials.

URL PDF HTML ☆

赞 0 踩 0

2605.27729 2026-06-18 cs.CR cs.AI cs.ET quant-ph 交叉投稿

QSignAI: Quantum-Randomness-Seeded Identity Signatures at the Intersection of AI for Science and Science for AI

QSignAI: 量子随机性种子身份签名——AI for Science 与 Science for AI 的交汇

Dongping Liu, Aoyu Zhang, Luyao Zhang

发表机构 * Amazon Web Services（亚马逊网络服务）； Duke Kunshan University（杜克昆山大学）

AI总结提出 QSignAI 平台，通过云端量子电路生成量子随机性种子，为社交平台用户提供唯一身份签名，并借助 AI 机器人使量子现象对普通用户可感知。

详情

AI中文摘要

2024-2025 年的诺贝尔奖和图灵奖同时表彰了人工智能和量子科学——机器学习作为物理科学，人工智能解决了 50 年的科学问题，超导量子电路作为量子计算的硬件基础，量子信息原理作为计算的最高成就。然而，没有任何已部署的人工智能系统将这两者结合起来为公众服务：身份系统仍然依赖伪随机令牌，量子电路对于每天使用机器人支持的社交消息平台的数十亿人来说仍然不可见。本文介绍了 QSignAI，一个已部署到生产环境的开源平台，在实时事件参与系统中展示了人工智能与量子科学之间的双向关系。我们解决三个研究问题：第一，能否通过真实量子电路生成量子随机性，并将其嵌入到人工智能驱动的社交平台中，且延迟和成本可接受；第二，人工智能机器人能否使量子现象对没有技术背景的普通观众在感知上可理解；第三，结合这两个方向的系统在实践中是否有效。一个对话式人工智能机器人在云端量子模拟器上通过双电路量子管道路由每个参与者的第一条消息，为每个参与者生成唯一的量子随机性种子身份签名。前两个问题通过系统设计和定性部署证据得到回答；可衡量的比较被确定为优先的未来工作。

英文摘要

The 2024-2025 Nobel and Turing awards recognised AI and quantum science simultaneously. Yet no deployed system has brought these streams together for the public. This paper presents QSignAI, a production-deployed platform demonstrating a bidirectional AI-quantum relationship in a real-time event participation system. We address three questions: can quantum-randomness generation via a two-source extractor be embedded in an AI-driven social platform with acceptable latency; can an AI bot make quantum phenomena perceptually legible to general audiences; and does the combined system work in practice? A conversational bot routes each participant's first message through a quantum pipeline comprising a Toeplitz two-source extractor over independent single-qubit Hadamard measurements on SV1 and DM1 simulators, plus a 2-qubit Bell state, producing a unique quantum-randomness-seeded identity signature per participant. The first two questions are answered through system architecture and qualitative deployment evidence from live events; the third through successful production deployment. The current deployment uses cloud quantum simulators; physical QPU randomness is the near-term extension. Measurable benchmarks are identified as priority future work.

URL PDF HTML ☆

赞 0 踩 0

2606.18288 2026-06-18 econ.GN cs.AI econ.TH q-fin.EC 交叉投稿

A Knowledge Theory of Capital:The Value of Natural and Artificial Intelligence

资本的知识理论：自然与人工智能的价值

Jeffrey Gardiner

发表机构 * Morgan Stanley（摩根大通）

AI总结提出资本的知识理论，将知识视为资本的核心形式，分析其生成、转化、治理与测量，区分五种知识形态，并引入新概念解释现代财富来源。

Comments 458 pages, 8 figures. Theory-building monograph developing a conditional framework for knowledge-bearing capitalism, with formal concepts, mechanisms, measurement apparatus, and falsification conditions

详情

AI中文摘要

本卷为生产能力日益存在于软件、数据、模型、常规、专业知识、平台、组织、公共资源和公共认知基础设施的经济体，发展了一种资本的知识理论。从亚当·斯密的劳动、资本、专业化和市场范围理论出发，探讨当知识变得像资本一样可积累、可跨形式流动、可扩展、可治理、可重组且在会计中不完全可见时，会发生什么变化。本书将知识承载资本作为核心对象，分析其如何生成、转化为可治理形式、部署、通过反馈改进、封闭或共享、衡量、减值以及用作未来生产的投入。它区分了具身、非具身、制度化、公共资源和公共知识形式，并发展了诸如首次转化、认知封闭、反馈捕获、暗资本和预期知识损失等概念。该论证是有条件且可检验的：现代财富不仅取决于资本积累，还取决于生产性知识如何被治理。

英文摘要

This volume develops a knowledge theory of capital for economies in which productive capacity increasingly resides in software, data, models, routines, expertise, platforms, organizations, commons, and public epistemic infrastructure. Beginning from Adam Smith's theory of labour, stock, specialization, and market extent, it asks what changes when knowledge becomes stock-like, mobile across forms, scalable, governable, recombinable, and imperfectly visible in accounting. The book introduces knowledge-bearing stock as the central object and analyses how it is generated, converted into governable form, deployed, improved through feedback, enclosed or shared, measured, impaired, and used as input to future production. It distinguishes embodied, disembodied, institutionalized, commons, and public knowledge forms and develops concepts such as first conversion, cognitive enclosure, feedback capture, dark capital, and expected knowledge loss. The argument is conditional and testable: modern wealth depends not only on capital accumulation, but on how productive knowledge is governed.

URL PDF HTML ☆

赞 0 踩 0

2606.17102 2026-06-18 physics.pop-ph cs.AI cs.ET cs.HC quant-ph 交叉投稿

Quantum Cinema: An Interactive Cinematic Exploration of Quantum Computing Hardware via Generative World Models

量子影院：通过生成世界模型对量子计算硬件进行交互式电影探索

Aoyu Zhang, Dongping Liu, Luyao Zhang

发表机构 * Amazon Web Services（亚马逊网络服务）； Duke Kunshan University（杜克昆山大学）

AI总结本文提出量子影院，一个基于生成世界模型的开源交互式应用，通过四幕叙事将不可见的量子硬件转化为可探索的电影体验，旨在弥合量子计算与公众之间的想象鸿沟。

详情

AI中文摘要

量子计算有望在科学和工业领域带来变革性进步，但实现这些计算的物理硬件对公众而言仍然不可见：量子处理器在接近绝对零度的密封稀释制冷机内运行，使得直接观察成为不可能。这种量子计算日益增长的社会影响与公众可视化能力之间的“想象鸿沟”构成了量子素养和劳动力发展的重大障碍。我们提出量子影院，一个开源、基于浏览器的交互式应用，通过使用生成世界模型将不可见的量子硬件转化为可探索的电影体验，从而弥合这一鸿沟。量子影院引导用户经历四幕叙事——从获得诺贝尔奖的量子纠缠基础科学，通过策划的视频介绍三种主要量子计算架构（离子阱、中性原子和超导系统），进入沉浸式三维生成世界，使不可见的量子现象变得可观察，最后到基于真实量子设备规格的交互式雷达图比较。所有三维环境均使用WorldLabs的生成世界模型平台生成，并基于亚马逊云服务（AWS）Braket量子硬件策划的指标进行科学依据。量子影院无需安装、无需专用硬件、无需量子计算背景。它旨在服务于两个不同的群体：寻求复制或扩展平台的学者和开发者，以及寻求直观工具向不同受众解释量子硬件的教育者、研究人员和科学传播者。本文描述了系统架构、生成世界模型流程、两个群体的用例以及未来工作方向。

英文摘要

Quantum computing promises transformative advances across science and industry, yet the physical hardware that enables these computations remains invisible to the public: quantum processors operate inside sealed dilution refrigerators at temperatures near absolute zero, making direct observation impossible. This "imagination gap" between quantum computing's growing societal impact and the public's ability to visualize it represents a significant barrier to quantum literacy and workforce development. We present Quantum Cinema, an open-source, browser-based interactive application that closes this gap by transforming invisible quantum hardware into explorable, cinematic experiences using generative world models. Quantum Cinema guides users through a four-act narrative -- from the foundational Nobel Prize-winning science of quantum entanglement, through curated video introductions to three major quantum computing architectures (trapped-ion, neutral-atom, and superconducting systems), into immersive three-dimensional generative worlds that make invisible quantum phenomena observable, and finally to interactive radar-chart comparisons grounded in real quantum device specifications. All three-dimensional environments are generated using WorldLabs' generative world model platform and are scientifically grounded in curated metrics from Amazon Web Services (AWS) Braket quantum hardware. Quantum Cinema requires no installation, no specialized hardware, and no quantum computing background. It is designed to serve two distinct communities: scholars and developers seeking to replicate or extend the platform, and educators, researchers, and science communicators seeking an intuitive tool for explaining quantum hardware to diverse audiences. This paper describes the system architecture, the generative world model pipeline, use cases for both communities, and directions for future work.

URL PDF HTML ☆

赞 0 踩 0

2604.23716 2026-06-18 cs.AI cs.IT cs.LG cs.MA math.IT 版本更新

Information-Theoretic Measures in AI: A Practical Decision Guide

人工智能中的信息论度量：实用决策指南

Nikolaos Al. Papadopoulos, Konstantinos E. Psannis

发表机构 * Department of Applied Informatics, University of Macedonia（马其顿大学应用信息系）

AI总结本文为七种信息论度量提供实用决策框架，围绕每个度量的三个关键问题：回答的问题与AI场景、适合的估计器、最危险的误用，并附有流程图和决策表。

Comments 25 pages, 2 tables, 1 figure. Submitted to Entropy (MDPI)

详情

AI中文摘要

信息论（IT）度量在人工智能中无处不在：熵驱动决策树分裂和不确定性量化，交叉熵是默认的分类损失，互信息支撑表示学习和特征选择，转移熵揭示动态系统中的有向影响。第二类较不成熟的度量——整合信息（Phi）、有效信息（EI）和自主性——已出现用于表征智能体复杂性。尽管被广泛采用，度量选择常常与估计器假设、失败模式和安全的推断主张脱节。本文为所有七种度量提供了一个实用决策框架，围绕每个度量的三个指导性问题组织：（i）该度量回答什么问题，在何种AI背景下；（ii）哪种估计器适合数据类型和维度；（iii）最危险的误用是什么。该框架通过两个互补的人工制品实现：度量选择流程图和主决策表。我们涵盖每个度量的AI/ML和决策智能体应用领域，并使用标准化桥接框将IT量与认知构造联系起来。三个工作示例展示了该框架在具体从业者场景中的应用，涵盖表示学习、时间影响分析和进化智能体复杂性。

英文摘要

Information-theoretic (IT) measures are ubiquitous in artificial intelligence: entropy drives decision-tree splits and uncertainty quantification, cross-entropy is the default classification loss, mutual information underpins representation learning and feature selection, and transfer entropy reveals directed influence in dynamical systems. A second, less consolidated family of measures, integrated information (Phi), effective information (EI), and autonomy, has emerged for characterizing agent complexity. Despite wide adoption, measure selection is often decoupled from estimator assumptions, failure modes, and safe inferential claims. This paper provides a practical decision framework for all seven measures, organized around three prescriptive questions for each: (i) what question does the measure answer and in which AI context; (ii) which estimator is appropriate for the data type and dimensionality; and (iii) what is the most dangerous misuse. The framework is operationalized in two complementary artifacts: a measure-selection flowchart and a master decision table. We cover both AI/ML and decision-making agent application domains per measure, with standardized Bridge Boxes linking IT quantities to cognitive constructs. Three worked examples illustrate the framework on concrete practitioner scenarios spanning representation learning, temporal influence analysis, and evolved agent complexity.

URL PDF HTML ☆

赞 0 踩 0

2606.00729 2026-06-18 cs.AI 版本更新

AI Sovereignty as National Learning Capacity: A Human-Centered Learning Mechanics Viewpoint on France, the United States, and China

AI主权作为国家学习能力：基于人本学习机制视角看法国、美国与中国

Kim Phuc Tran

发表机构 * Univ. Lille, ENSAIT, ULR 2461 – GEMTEX（里尔大学、ENSAIT、ULR 2461 – GEMTEX）

AI总结本文提出将国家AI发展视为一个受控的信息注入与熵耗散平衡的动态学习系统，主张AI主权源于国家调节自身信息动力学的能力，而非单纯规模扩张。

详情

AI中文摘要

在法国，人工智能常被从投资、算力、监管、就业、主权和教育等维度讨论，这些维度通常被分开处理。本文提出一个统一解读：法国应被理解为一个\emph{国家AI学习系统}。基于最近被形式化为熵调控表示学习动力学框架的人本学习机制（HCLM），我们将国家AI发展解释为信息注入与熵耗散之间的受控平衡。信息注入对应算力、数据、人才、研究、资本、产业部署和制度实验；熵耗散对应组织复杂性、协调摩擦、能源约束、监管不确定性、人才流动压力以及加强产业吸收的机会。核心主张是：AI主权并非仅源于规模，而是源于国家调节自身信息动力学的能力。本文将HCLM与神经标度律、内生增长理论、创造性破坏和博弈论联系起来，认为法国AI辩论应超越技术乐观主义与监管优先的二元对立。一个具有竞争力且以人为本的AI战略需要一个受控机制，其中信息注入增长快于制度耗散，同时避免不稳定、不平等或高能耗的扩张。我们提供了一个数学模型、可衡量的政策指标、博弈论命题、国家AI制度的说明性模拟，以及对法国的具体政策启示。所提出的观点将AI政策重新定义为对一个开放、战略性、非均衡学习系统的治理。

英文摘要

Artificial intelligence in France is often discussed through separate dimensions such as investment, compute, regulation, employment, sovereignty, and education. This viewpoint paper proposes a unified interpretation: France can be analyzed as a national AI learning system. Building on Human-Centered Learning Mechanics (HCLM), we use HCLM not as a validated econometric model, but as a conceptual and diagnostic lens for interpreting national AI development as a balance between information injection, absorptive capacity, and institutional dissipation. Information injection includes compute, data, talent, research, capital, industrial deployment, and policy experimentation. Institutional dissipation refers to avoidable frictions such as administrative overload, coordination failures, energy constraints, regulatory uncertainty, talent mobility pressures, and weak industrial absorption. Regulation is not treated as mere friction: adaptive governance, trusted data spaces, and safety-oriented standards may increase long-term learning capacity by improving legitimacy, interoperability, and social trust. The central claim is not that a country follows neural-network equations, but that AI sovereignty depends on how effectively it converts distributed information into absorbed, coordinated, and socially legitimate capability. The paper connects HCLM with neural scaling laws, endogenous growth theory, creative destruction, absorptive capacity, and coordination mechanisms. It offers a formal heuristic, policy indicators, illustrative scenarios, and implications for France. The numerical results are diagnostic scenarios, not econometric estimates or official rankings. The proposed viewpoint reframes AI policy as the governance of an open, strategic, non-equilibrium learning system that should be tested with historical and cross-country data.

URL PDF HTML ☆

赞 0 踩 0

2605.17131 2026-06-18 cs.CV cs.AI cs.LG 版本更新

A Survey on Deep Learning Architectures for Point Cloud Classification and Segmentation

针对点云分类和分割的深度学习架构系统性调研

Minhas Kamal, Hiranya Garbha Kumar, Balakrishnan Prabhakaran

发表机构 * State University of New York at Albany（纽约州立大学阿尔巴尼分校）

AI总结本文系统性地探讨了点云分类和分割中的深度学习架构，分析了点云数据的结构特性，分类了不同架构的工作，并评估了其在主流基准上的性能，同时指出了开放挑战和未来方向。

Comments We reviewed a decade of advancements in point cloud processing: trace the evolution of the field from its foundational roots to the modern SOTA, analyze how diverse architectures overcome the inherent geometric challenges of 3D data, and map out critical research gaps alongside promising future directions. GitHub: https://github.com/MinhasKamal/DeepLearningForPointCloud

详情

DOI: 10.1145/3815180
Journal ref: ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2026

AI中文摘要

点云因其简洁性和几何保真度而成为表示3D形状和场景最广泛采用的格式。然而，其固有的无序和不规则性质，加剧了传感器噪声和遮挡的影响，给基于机器学习的方法带来了独特的挑战。为应对这些问题，已开发出多种策略，包括转换为有序格式、提取局部几何特征以及基于排列不变或自注意力的处理方法。在本文中，我们的重点是深度学习模型在3D视觉三个基本任务中的应用：点云分类、部分分割和语义分割。我们首先正式定义点云数据，然后深入讨论其结构特性。接着，我们根据其骨干结构对重要工作进行分类，并评估其在流行基准上的性能。除了经验比较外，我们还提供了架构创新和局限性的见解。我们还概述了3D点云理解中的开放挑战和有前途的未来方向。

英文摘要

Point cloud stands as the most widely adopted format for representing 3D shapes and scenes due to its simplicity and geometric fidelity. However, its inherent unordered and irregular nature, exacerbated by sensor noise and occlusions, introduces unique challenges for machine learning based methodologies. To combat these issues, diverse strategies have been developed, including converting to a format that has orderliness, extracting local geometry, and permutation-invariant or self-attention-based processing. In this paper, our focus is directed towards deep learning models for three fundamental tasks in 3D vision: point cloud classification, part segmentation, and semantic segmentation. We begin by formally defining point cloud data, followed by an in-depth discussion on its structural characteristics. Then, we categorize notable works based on their backbone structure and evaluate their performance on popular benchmarks. Beyond empirical comparison, we offer insights into architectural innovations and limitations. We also outline open challenges and promising future directions for 3D point cloud understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.00182 2026-06-18 cs.HC cs.AI cs.CY 版本更新

The New Social Image: How AI Competency and AI Proactivity Influence Self- and Peer-Perceptions in the Workplace

新社会形象：AI能力与AI主动性如何影响职场中的自我与同伴感知

Kuntal Ghosh, Marc Hassenzahl, Shadan Sadeghian

发表机构 * Autonomous Interactive Systems, University of Siegen（自主交互系统，锡根大学）； Experience & Interaction Design, University of Siegen（体验与交互设计，锡根大学）

AI总结通过2x2x2情景实验（n=50），研究AI能力与主动性水平对员工工作所有权、情感、意义感及角色动态的自我与同伴感知影响，发现低能力或低主动性的AI通常提升积极感知，但高能力与高主动性可能带来负面影响。

Comments Updated metadata following publication in Interacting with Computers. Added DOI and publication information

详情

DOI: 10.1093/iwc/iwag033

AI中文摘要

人机协作被视为将AI融入职场的最有前景方式。然而，这种协作的体验后果尚未被探索。具体而言，在与AI组成的团队中，人类如何感知自己（自我感知）以及同事如何看待他们（同伴感知）在工作所有权和工作意义方面。在一项2x2x2情景研究（n=50）中，参与者对所有权、情感、工作意义和满意度以及角色动态的感知进行了评分，其中AI主动性和AI能力作为被试内因素（低/高两个水平），视角（自我感知/同伴感知）作为被试间因素。我们的结果表明，低能力或低主动性的AI通常提升了与所有权、意义感、满意度和角色动态相关的感受，并增加了积极情感，减少了消极情感。然而，这些效应往往受到视角的影响。例如，低AI主动性从自我感知而非同伴感知中带来了更高的工作满意度。基于我们的发现，我们认为仅围绕绩效指标设计未来工作的AI可能并不足够。高能力和高主动性的AI驱动系统可能对所有权感知、工作身份、社会形象和团队动态产生不良影响，进而影响工作意义。

英文摘要

Human-AI collaboration is considered the most promising way to incorporate AI in the workplace. What remains unexplored are the experiential consequences of this teaming. More specifically, in a team with AI, how humans perceive themselves (self-perception) and how they are perceived by their coworkers (peer perception) in terms of work ownership and job meaningfulness. In a 2x2x2 vignette study (n=50), participants rated perceptions of ownership, affect, job meaningfulness and satisfaction, and role dynamics across two levels (low/high) of AI proactivity and AI competency as within-subject factors, with point-of-view (self perception/peer perception) as between-subjects. Our results showed that AI with low competency or low proactivity generally improved feelings related to ownership, meaningfulness, satisfaction, and role dynamics, and also increased positive affect while reducing negative affect. However, these effects were often influenced by point-of-view. For instance, low AI proactivity resulted in higher job satisfaction from self-perception rather than peer perception. Based on our findings, we argue that designing AI for the future of work solely around performance metrics may not be adequate. Highly competent and proactive AI-driven systems can have undesirable impacts on perceptions of ownership, job identity, social image and team dynamics, and consequently, job meaningfulness.

URL PDF HTML ☆

赞 0 踩 0

2606.15091 2026-06-18 cs.HC cs.AI 版本更新

Sensory Restoration via Brain-Computer Interfaces: A Unified 2 x 2 Framework and Convergence Roadmap

通过脑机接口的感觉恢复：统一的2×2框架与融合路线图

Xuan-The Tran

发表机构 * School of Mechanical Engineering, Vietnam Maritime University（机械工程学院，越南海防大学）

AI总结本文提出一个统一的2×2框架，按侵入性和信号方向分类脑机接口，并定义恢复、替代和增强范式，同时给出近中长期的融合路线图。

2602.15513 2026-06-18 cs.RO cs.AI 版本更新

HIMM: Human-Inspired Long-Term Memory Modeling for Embodied Exploration and Question Answering

Ji Li, Bo Wang, Jing Xia, Mingyi Li, Shiyan Hu

发表机构 * The University of Hong Kong（香港大学）； Beijing Institute of Technology（北京理工大学）

2602.20135 2026-06-18 cs.CL cs.AI cs.IR 版本更新

KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi, Behnam Bahrak

发表机构 * University of Tehran（塔里班大学）； Independent Researcher（独立研究员）； Amirkabir University of Technology（阿米尔卡比尔技术大学）； TEIAS Institute（TEIAS研究所）

Comments Accepted at the Third Conference on Parsimony and Learning (CPAL 2026). 36 pages, 12 figures. (Equal contribution: Yasaman Amou Jafari and Mahdi Noori.)

2405.14273 2026-06-18 cs.LG cs.AI math.OC 版本更新

Exact Solution to Data-Driven Inverse Optimization of MILPs in Finite Time via Gradient-Based Methods

通过基于梯度的方法在有限时间内精确求解混合整数线性规划的驱动数据反优化问题

Akira Kitaoka

发表机构 * NEC Corporation（日本电气株式会社）

AI总结本文研究了混合整数线性规划中驱动数据反优化问题，揭示了子最优损失的几何结构，并证明了基于梯度的优化方法可以在有限次迭代内达到观测数据的一致性，同时给出了投影子梯度下降法的迭代次数上界。

Comments 66 pages; comments are welcome

详情

AI中文摘要

驱动数据反优化问题（DDIOP）是估计能够解释观测最优解数据的目标函数参数（权重）的问题，广泛应用于混合整数线性规划（MILP）中。在MILP的反优化中，特征的预测误差对权重的不连续性使得直接应用基于梯度的优化方法具有挑战性。本文聚焦于子最优损失，该损失在权重与观测数据完全一致时达到最小值零。我们揭示了该损失的几何结构——它具有凸性和分段线性特性，并且与观测数据完全一致的权重集合具有正的“厚度”而非单一点或薄边界。利用这一结构，我们证明了：首先，一类广泛的基于梯度的优化方法，包括投影子梯度下降法，在有限次迭代中可以达到观测数据的一致性（在有限时间内获得精确解）。其次，对于投影子梯度下降法，我们给出了达到精确一致性的迭代次数的显式上界。第三，当正向问题是一个整数线性规划（ILP）时，我们将其上界表示为仅由样本数、特征维度和约束系数矩阵结构（例如，若系数矩阵是总模矩阵，则迭代次数被显式地限制为样本数平方和维度的多项式）决定的完全显式迭代次数。通过数值实验，我们验证了这种有限步数达到行为。

英文摘要

A data-driven inverse optimization problem (DDIOP) is the problem of estimating the objective-function parameters (weights) that explain observed optimal-solution data, and it arises in many applications, including mixed integer linear programming (MILP). In inverse optimization for MILPs, the prediction error of the features is discontinuous with respect to the weights, so applying gradient-based optimization directly is difficult. In this paper we focus on the suboptimality loss. This loss attains its minimum value, zero, if and only if the weights are exactly consistent with the observed data. We reveal a geometric structure of this loss -- it is convex and piecewise linear, and moreover the set of weights that are exactly consistent with the observed data has a positive ``thickness'' rather than being a single point or a thin boundary -- and use it to show the following. First, a broad class of gradient-based optimization methods, including projected subgradient descent, reaches exact consistency with the observed data in finitely many iterations (an exact solution is obtained in finite time). Second, for projected subgradient descent we give an explicit upper bound on the number of iterations needed to reach exact consistency. Third, when the forward problem is an integer linear program (ILP), we give this upper bound as a fully explicit iteration count determined solely by the number of samples, the dimension of the features, and the structure of the constraint coefficient matrix. Through numerical experiments, we confirm this finite-step attainment behavior.

URL PDF HTML ☆

赞 0 踩 0

2407.00449 2026-06-18 cs.LG cs.AI cs.NE 版本更新

Fully tensorial approach to hypercomplex-valued neural networks

Agnieszka Niemczynowicz, Radosław Antoni Kycia

发表机构 * Faculty of Computer Science and Mathematics, Cracow University of Technology（克拉科夫技术大学计算机科学与数学系）

Comments 23 pages, 3 figures

2512.04115 2026-06-18 cs.CY cs.AI cs.HC 版本更新

Artificial Intelligence Competence of K-12 Students Shapes Their AI Risk Perception: A Co-occurrence Network Analysis

Ville Heilala, Pieta Sikström, Mika Setälä, Tommi Kärkkäinen

发表机构 * University of Jyväskylä（于韦斯屈莱大学）

Comments Accepted for Proceedings of the 41th ACM/SIGAPP Symposium on Applied Computing (SAC'26)

2506.20869 2026-06-18 cs.SE cs.AI cs.IR 版本更新

Engineering RAG Systems for Real-World Applications: Design, Development, and Evaluation

Md Toufique Hasan, Muhammad Waseem, Kai-Kristian Kemell, Ayman Asad Khan, Mika Saari, Pekka Abrahamsson

发表机构 * Faculty of Information Technology and Communication Sciences, Tampere University（信息科技与通讯科学学院，塔尔皮耶大学）

Comments Published in the Proceedings of the 51st Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2025. Lecture Notes in Computer Science, volume 16082, pages 143-158. Springer, 2026

2503.01163 2026-06-18 cs.AI cs.CL cs.HC cs.LG cs.NE 版本更新

Bandit-Based Prompt Design Strategy Selection Improves Prompt Optimizers

Rin Ashizawa, Yoichi Hirose, Nozomu Yoshinari, Kento Uchida, Shinichi Shirakawa

发表机构 * Yokohama National University（横滨国立大学）

Comments Accepted to ACL 2025 Findings

2506.09822 2026-06-18 cs.CE cs.AI 版本更新

Superstudent intelligence in thermodynamics

Rebecca Loubet, Pascal Zittlau, Marco Hoffmann, Luisa Vollmer, Sophie Fellenz, Heike Leitte, Fabian Jirasek, Johannes Lenhard, Hans Hasse

发表机构 * Laboratory of Engineering Thermodynamics (LTD)（工程热力学实验室）； Visual Information Analysis Research Group (VIA)（视觉信息分析研究组）； Machine Learning Research Group (ML)（机器学习研究组）

Comments This document is the unedited Author's version of a yet to be Submitted Work to Physical Review Physics Education Research. 15 pages, 2 figures, Graphical Abstract, Highlights and SI available (12 pages)

2504.12347 2026-06-18 cs.CL cs.AI cs.CY 版本更新

Assessment of Evolving Large Language Models in Upper Secondary Mathematics

Mika Setälä, Pieta Sikström, Ville Heilala, Tommi Kärkkäinen

发表机构 * Faculty of Information Technology（信息科技学院）； University of Jyväskylä（于韦斯屈莱大学）； Faculty of Humanities and Social Sciences（人文与社会科学学院）

2505.03863 2026-06-18 cs.CR cs.AI 版本更新

Data-Driven Falsification of Cyber-Physical Systems

Atanu Kundu, Sauvik Gon, Rajarshi Ray

发表机构 * Indian Association for the Cultivation of Science（印度科学培养协会）

2406.15537 2026-06-18 q-bio.NC cs.AI cs.SD eess.AS 版本更新

R&B -- Rhythm and Brain: Cross-subject Decoding of Music from Human Brain Activity

Matteo Ferrante, Matteo Ciferri, Nicola Toschi

发表机构 * Department of Biomedicine and Prevention University of Rome Tor Vergata（生物医学与预防系罗马大学托尔维加塔分校）； A.A. Martinos Center for Biomedical Imaging Harvard Medical School/MGH, Boston (US)（A.A. Martinos生物医学成像中心哈佛医学院/马萨诸塞总医院，波士顿（美国））

Comments The first two authors contributed equally to this work

详情

DOI: 10.1016/j.neunet.2026.109195
Journal ref: Neural Networks, 203, 109195 (2026)

英文摘要

Music is a universal phenomenon that profoundly influences human experiences across cultures. This study investigates whether music can be decoded from human brain activity measured with functional MRI (fMRI) during its perception. Leveraging recent advancements in extensive datasets and pre-trained computational models, we construct mappings between neural data and latent representations of musical stimuli. Our approach integrates functional and anatomical alignment techniques to facilitate cross-subject decoding, addressing the challenges posed by the low temporal resolution and signal-to-noise ratio (SNR) in fMRI data. Starting from the GTZan fMRI dataset, where five participants listened to 540 musical stimuli from 10 different genres while their brain activity was recorded, we used the CLAP (Contrastive Language-Audio Pretraining) model to extract latent representations of the musical stimuli and developed voxel-wise encoding models to identify brain regions responsive to these stimuli. By applying a threshold to the association between predicted and actual brain activity, we identified specific regions of interest (ROIs) which can be interpreted as key players in music processing. Our decoding pipeline, primarily retrieval-based, employs a linear map to project brain activity to the corresponding CLAP features. This enables us to predict and retrieve the musical stimuli most similar to those that originated the fMRI data. Our results demonstrate state-of-the-art identification accuracy, with our methods significantly outperforming existing approaches. Our findings suggest that neural-based music retrieval systems could enable personalized recommendations and therapeutic applications. Future work could use higher temporal resolution neuroimaging and generative models to improve decoding accuracy and explore the neural underpinnings of music perception and emotion.

URL PDF HTML ☆

赞 0 踩 0

1. 智能体、规划与决策 19 篇

What Must Generalist Agents Remember?

Generative-Model Predictive Planning for Navigation in Partially Observable Environments

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

Towards an Agent-First Web: Redesigning the Web for AI Agents

Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction

Dynamic In-Group Persona Generation for Enhancing Human-AI Rapport

Caring Without Feeling: Affective Dynamics as the Control Layer of Human-AI Agent Collaboration

Synthetic Resonance: A Framework for Growth-Oriented Human-AI Relationships

Mitigating Anchoring Bias in LLM-Based Agents for Energy-Efficient 6G Autonomous Networks

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

As You Wish: Mission Planning with Formal Verification using LLMs in Precision Agriculture

Data Intelligence Agents: Interpreting, Modeling, and Querying Enterprise Data via Autonomous Coding Agents

Structured Cognitive Loop for Behavioral Intelligence in Large Language Model Agents (Extended Revision: From Behavioral Architecture to Epistemic Accountability)

InfoPO: Information-Driven Policy Optimization for User-Centric Agents

SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision

Dissecting model behavior through agent trajectories

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

MemRerank: Preference Memory for Personalized Product Reranking

PatchWorld: Gradient-Free Optimization of Executable World Models

2. 知识表示、推理与符号AI 6 篇

NeSyCat Torch: A Differentiable Tensor Implementation of Categorical Semantics for Neurosymbolic Learning

The More the Merrier: Combining Properties for ABox Abduction under Repair Semantics for ELbot

Fully Geometric Multi-Hop Reasoning on Knowledge Graphs with Transitive Relations

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Short-Term-to-Long-Term Memory Transfer for Knowledge Graphs under Partial Observability

TLA-Prover: Verifiable TLA+ Specification Synthesis via Preference-Optimized Low-Rank Adaptation

3. 多智能体与博弈 14 篇

Searching for Synergy in Shared Workspace Human-AI Collaboration

R2D-RL: A RoboCup 2D Soccer Environment for Multi-Agent Reinforcement Learning

Simulating Hate Speech Cascades with Multi-LLM Agents: Empirical Grounding, Modeling Fidelity, and Intervention Strategies

Towards Multi-Agent-Simulation-Based Community Note Evaluation

TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning

Agentra: A Supervisable Multi-Agent Framework for Enterprise Intrusion Response

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams

A Technical Taxonomy of LLM Agent Communication Protocols

Recursive Joint Simulation in Games

PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation

Toward Vibe Medicine: A Self-Evolving Multi-Agent Framework for Clinical Decision Support

Self-Evolving Multi-Agent Systems via Textual Backpropagation

R2BC: Multi-Agent Imitation Learning from Single-Agent Demonstrations

4. 搜索、优化与约束求解 6 篇

Two-Phase Bilevel Search for the Moving-Target Traveling Salesman Problem with Moving Obstacles

An In-depth Study of LLM Contributions to the Bin Packing Problem

Enhancing CVRP Solver through LLM-driven Automatic Heuristic Design

LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning

Scalable Batch Bayesian Optimization Via Subspace Acquisition Functions

MeEvo: Metacognitive Evolution Combined with Natural Evolution for Automatic Heuristic Design

5. 机器学习与表示学习 58 篇

Skill-Guided Continuation Distillation for GUI Agents

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

A Link between Shock-wave Theory and Symmetry-reduced Stochastic Gradient Descent for Artificial Neural Networks

Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

DRIFT: Refining Instruction Data via On-Policy Data Attribution

Ghost Attractor Networks: Basin-Structured Dynamical Decoders for Closed-Loop Sequential Generation

Why SWAVE May Not Be All You Need:A Concept-Evolution Retrospective on Complex-Valued Recurrent Language Models

What Does the Weight Norm Control in Grokking? Logit-Scale Mediation under Cross-Entropy

Structured Representation Learning with Locally Linear Embeddings and Adaptive Feature Fusion

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

Neural Phase Correlation

Sparsity Curse: Understanding RLVR Model Parameter Space from Model Merging

Correcting Sensor-Induced Distribution Drift with Wasserstein Adversarial Learning

Dual Dimensionality for Local and Global Attention

Bounded Context Management for Tabular Foundation Models on Stream Learning

Dual-Channel Grounded World Modeling (DCGWM): Structural Prevention of Objective Interference Collapse via Heterogeneous External Grounding with Inward-Only Gradient Flow

Graph Grounded Cross Attention Transformer Neural Network for Structurally Constrained Full Event Sequence Generation in Predictive Process Monitoring

Private Learning with Public Feature Conditioning

Bayesian Anytime Pareto Set Identification for Multi-Objective Multi-Armed Bandits

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

Reinforcement Learning Foundation Models Should Already Be A Thing

Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets

FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs

Pareto Q-Learning with Reward Machines

OrthoReg: Orthogonal Regularization for Hybrid Symbolic-Neural Dynamical Systems

Essential Subspace Merging for Multi-Task Learning