URL PDF HTML ☆

赞 0 踩 0

2606.18272 2026-06-18 cs.NI cs.AI cs.SY eess.SY 交叉投稿

Mitigating Anchoring Bias in LLM-Based Agents for Energy-Efficient 6G Autonomous Networks

缓解基于LLM的智能体在节能6G自主网络中的锚定偏差

Hatim Chergui, Claudia Carballo González, Farhad Rezazadeh, Merouane Debbah

发表机构 * i2CAT Foundation（i2CAT基金会）； Universitat Politècnica de Catalunya（政治技术大学）； Research Institute for Digital Future（数字未来研究院）

AI总结提出一种基于截断三参数威布尔分布的随机锚定策略，缓解LLM智能体在6G网络切片中的锚定偏差，结合CVaR数字孪生保障SLA尾延迟，实现高达25%的节能。

Comments 7 pages, 4 figures

详情

AI中文摘要

本文提出了一种自主智能体资源协商框架，旨在使用大语言模型（LLM）智能体实现6G架构中的零接触网络切片。虽然LLM提供了强大的推理能力，但我们证明此类智能体固有地遭受锚定偏差，僵化地坚持初始启发式提议，导致严重的网络过度配置。为系统性地缓解这种认知偏差，我们提出了一种新颖的随机锚定策略，通过截断三参数威布尔分布建模。这种数学上有界的方法与采用条件风险价值（CVaR）的突发感知数字孪生（DT）无缝集成，以严格保证严格的服务水平协议（SLA）尾延迟。为验证我们的方法，我们引入并证明了双峰约束避免效用定理，表明虽然可行的协商遵循经典凸界，但高度约束的场景会发生由逆有理衰减包络控制的相变。使用本地托管的1B参数模型（\ exttt{otel-llm-1b-it}）生成的实证结果证实了这些双区域界。我们的认知去偏成功瓦解了僵化的协商模式，迫使智能体主动探索以安全地利用SLA边界，并将系统节能提升高达25%。关键的是，轻量级1B LLM实现了亚秒级推理延迟（平均0.95秒），确保我们的多智能体框架与O-RAN非实时RAN智能控制器（non-RT RIC）的操作时间尺度兼容。

英文摘要

This paper presents an autonomous agentic resource negotiation framework designed to enable zero-touch network slicing in 6G architectures using Large Language Model (LLM) agents. While LLMs offer powerful reasoning capabilities, we demonstrate that such agents inherently suffer from anchoring bias, rigidly adhering to initial heuristic proposals and causing severe network over-provisioning. To systematically mitigate this cognitive bias, we propose a novel randomized anchoring strategy modeled via a Truncated 3-Parameter Weibull distribution. This mathematically bounded approach seamlessly integrates with burst-aware Digital Twins (DTs) employing Conditional Value at Risk (CVaR) to rigorously guarantee strict Service Level Agreement (SLA) tail-latencies. To validate our methodology, we introduce and prove the \emph{Bimodal Constraint-Avoidance Utility Theorem}, demonstrating that while feasible negotiations follow classical convex bounds, highly constrained scenarios undergo a phase transition governed by an inverse rational decay envelope. Empirical results generated using a locally hosted 1B-parameter model (\texttt{otel-llm-1b-it}) confirm these dual-regime bounds. Our cognitive de-biasing successfully dismantles rigid negotiation patterns, forcing agents into active exploration to safely ride SLA boundaries and boost system energy savings up to 25\%. Crucially, the lightweight 1B LLM achieves sub-second inference latencies (0.95s mean), ensuring our multi-agent framework is compatible with the operational timescales of the O-RAN non-Real-Time RAN Intelligent Controller (non-RT RIC)\footnote{Our source code is available for non-commercial use at https://github.com/HatimChergui.

URL PDF HTML ☆

赞 0 踩 0

2606.18388 2026-06-18 cs.LG cs.AI cs.CL cs.MA 交叉投稿

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

LLMZero: 通过LLM智能体发现RL后训练的自适应训练策略

Haoyang Fang, Wei Zhu, Boran Han, Alex Zhang, Zhenyu Pan, Shuo Yang, Shuai Zhang, Jiading Gai, Peng Tang, Cuixiong Hu, Xuan Zhu, Huzefa Rangwala, George Karypis, Bernie Wang

发表机构 * Amazon（亚马逊）

AI总结提出LLMZero系统，利用LLM智能体通过树搜索发现多阶段RL后训练的自适应策略，揭示容量参数单调累积、正则化参数振荡的规律，在4个GRPO任务上相对基线提升9%-140%。

详情

AI中文摘要

RL后训练策略依赖于数据集，并揭示了一个反复出现的经验模式：容量参数在阶段间单调累积，而正则化参数主要根据训练动态的变化而振荡。这种区别很重要，因为固定调度将所有参数提交到固定轨迹，因此无法表达正则化必须跟踪的非平稳探索-利用权衡；该原则为多阶段训练提供了可操作的设计规则。我们通过LLMZero发现了这一点，该系统通过树搜索让LLM智能体搜索训练轨迹，诊断每个检查点的病理并提出协调的多参数转换。在4个不同的GRPO任务中，LLMZero发现的策略相对基础模型提升9%到140%，相对网格搜索提升6%到15%，始终优于随机搜索和基于技能的智能体。该结构原则跨任务迁移，解释了为什么发现的策略形式不同但参数动态相似。

英文摘要

RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters because fixed schedules commit all parameters to fixed trajectories and therefore cannot express the non-stationary exploration-exploitation tradeoffs that regularization must track; the principle provides actionable design rules for multi-stage training. We discover this through LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi-parameter transitions. Across 4 diverse GRPO tasks, LLMZero discovers strategies that improve over the base model by 9% to 140% relative and over grid search by 6% to 15% relative, consistently outperforming random search and the skill-based agent. The structural principle transfers across tasks, providing an explanation for why discovered strategies take qualitatively different forms yet share similar parameter dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.18519 2026-06-18 cs.RO cs.AI 交叉投稿

As You Wish: Mission Planning with Formal Verification using LLMs in Precision Agriculture

如您所愿：利用LLM在精准农业中进行形式化验证的任务规划

Marcos Abel Zuzuárregui, Stefano Carpin

发表机构 * University of California, Merced（加州大学默塞德分校）

AI总结针对自然语言歧义性，提出基于线性时序逻辑（LTL）反馈循环的LLM任务规划系统，通过双LLM分工实现规范生成与验证，提升精准农业任务规划的可靠性。

详情

Journal ref: Published in Proceedings of 2026 International Conference on Robotics and Automation (ICRA)

AI中文摘要

尽管机器人系统现已商业化并部署于各行各业，但许多系统高度专业化，通常需要高级技能才能操作并确保其按指令执行。为缓解这一问题，我们近期引入了一个任务规划器，利用大语言模型（LLM）根据自然语言描述的任务描述合成精准农业中的任务计划。虽然该系统表现出色，但也存在自然语言固有的歧义性。本文通过引入多个基于线性时序逻辑（LTL）的反馈循环来扩展我们的系统，以确保任务规划系统满足用户制定的规范，同时仍使用自然语言。为减轻潜在偏差，我们使用两个不同的商业LLM分别负责规范生成和验证子任务。通过大量实验，我们强调了将任务验证集成到全自主流水线中的优势与局限，特别是关于LLM生成有效LTL公式的能力，并展示了我们的实现如何应对和解决这些挑战。

英文摘要

Though robotic systems are now being commercialized and deployed in various industries, many of these systems are highly specialized and often require an advanced skill set to operate and ensure they perform as instructed. To mitigate this problem, we recently introduced a mission planner leveraging LLMs to synthesize mission plans in precision agriculture based on mission descriptions provided in natural language. While the system demonstrates impressive performance, it also suffers from the inherent ambiguities of natural language. In this paper, we extend our system to address this issue by introducing multiple feedback loops in the planning architecture that leverage linear temporal logic (LTL) to ensure the mission planning system meets the specifications formulated by the user while still using natural language. To mitigate potential bias, this is achieved by using two different commercial LLMs in charge of the specification and verification subtasks. Through extensive experiments, we highlight the strengths and limitations of integrating mission verification into a fully autonomous pipeline, particularly regarding an LLM's ability to generate valuable LTL formulas, and show how our proposed implementation addresses and solves these challenges.

URL PDF HTML ☆

赞 0 踩 0

2606.19319 2026-06-18 cs.MA cs.AI cs.DB 交叉投稿

Data Intelligence Agents: Interpreting, Modeling, and Querying Enterprise Data via Autonomous Coding Agents

数据智能代理：通过自主编码代理解释、建模和查询企业数据

Anoushka Vyas, Aarushi Dhanuka, Sina Khoshfetrat Pakazad, Henrik Ohlsson

发表机构 * C3 AI

AI总结提出Data Intelligence Agents (DIA)系统，由三个自主编码代理组成，通过执行、验证和修复工件来压缩数据集成工作流，在七个SQL基准测试中达到或超越最佳结果。

详情

AI中文摘要

TRIDENT: 打破混合安全-物理耦合以实现可证明安全的多智能体强化学习

Zijie Meng, Ziwei Li, Yufei Liu, Zhiyu Li, Jiyuan Liu, Wenhua Nie, Bingcai Wei, Miao Zhang

发表机构 * Peking University（北京大学）； Xiamen University（厦门大学）； National Taiwan University（国立台湾大学）； WHU（武汉大学）； THU / Jimei University（清华大学 / 集美大学）

AI总结针对混合离散-连续动作、训练时安全约束和物理动力学形成的耦合问题，提出TRIDENT框架，通过Richardson-Romberg梯度校正、Lyapunov约束序列信任域更新和物理信息残差评论家，实现可证明的安全收敛，显著降低训练违规并提升奖励。

Comments 16 pages, 4 figures

详情

AI中文摘要

网络化信息物理系统中的安全协调迫使学习算法同时处理混合离散-连续动作、严格的训练时安全约束和物理支配的动力学。我们证明这三个特征形成了一个有向偏差循环，击败了任何现成模块的朴素组合，并将其形式化为一个三向耦合引理。然后我们引入TRIDENT，这是第一个MARL框架，其三个组件被共同设计以消除每个泄漏：一个将Gumbel-Softmax偏差从O(tau)降低到O(tau^2)的Richardson-Romberg梯度校正，一个强制每次迭代可行性的Lyapunov约束顺序信任域更新，以及一个分解价值而非奖励的物理信息残差评论家。我们证明了以O~(1/sqrt(K))的收敛速率达到约束纳什均衡，以及O(sqrt(K))的累积违规界。在多无人机移动边缘计算、自主交叉口管理和混合SMAC变体上，TRIDENT相比MADDPG减少了95.5%的训练时违规，相比MACPO减少了76.3%，同时相比最强的无约束基线提高了13.5%的奖励。

英文摘要

Safe coordination in networked cyber-physical systems forces learning algorithms to simultaneously handle hybrid discrete-continuous actions, hard training-time safety constraints, and physics-governed dynamics. We show that these three features form a directed cycle of biases that defeats any naive composition of off-the-shelf modules, and formalize this as a three-way coupling lemma. We then introduce TRIDENT, the first MARL framework whose three components are co-designed to cancel each leak: a Richardson-Romberg gradient correction reducing Gumbel-Softmax bias from O(tau) to O(tau^2), a Lyapunov-constrained sequential trust-region update enforcing per-iterate feasibility, and a physics-informed residual critic that decomposes value rather than reward. We prove an O~(1/sqrt(K)) convergence rate to a constrained Nash equilibrium and an O(sqrt(K)) cumulative-violation bound. On multi-UAV mobile-edge computing, autonomous intersection management, and a hybrid SMAC variant, TRIDENT cuts training-time violations by 95.5% over MADDPG and 76.3% over MACPO, while improving reward by 13.5% over the strongest unconstrained baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.18325 2026-06-18 cs.CR cs.AI 交叉投稿

Agentra: A Supervisable Multi-Agent Framework for Enterprise Intrusion Response

Agentra: 一种可监督的多智能体企业入侵响应框架

Raj Patel, Shaswata Mitra, Michele Guida, Stefano Iannucci, Sudip Mittal, Shahram Rahimi

发表机构 * The University of Alabama, Alabama, USA（阿拉巴马大学）； Roma Tre University, Rome, Italy（罗马三大学）

AI总结提出可监督的多智能体入侵响应框架Agentra，通过角色划分、规划-验证循环、安全网关和风险评分机制，将警报转化为结构化响应计划，在120事件语料上F1从0.61提升至0.84，有害动作率降至0.0%。

详情

AI中文摘要

企业入侵响应仍然依赖于静态剧本和分析师驱动的分类，导致警报生成与遏制之间存在延迟。我们提出Agentra，一个可监督的多智能体入侵响应系统（IRS）框架，它将来自IDS、EDR和XDR平台的警报转换为基于MITRE ATT&CK、MITRE D3FEND和NIST CSF 2.0的结构化事件响应计划。Agentra将响应推理分解到角色范围的智能体中，通过有界的规划器-验证器审查循环验证提议的计划，通过审核安全网关筛选检索到的威胁情报，通过行动目录和风险评分门控行动，并将决策记录在仅追加的审计日志中。我们在来自ThreatHunter-Playbook、Splunk BOTSv3和DARPA OpTC的120事件语料库上，将Agentra与静态OASIS CACAO v2.0网络剧本基线进行了评估。最强的配置将感知假阳性的IRS F1从0.61提高到0.84，并在仅规划器配置引入不安全过度反应后，将预计的有害动作率恢复到静态基线水平0.0%。这些结果表明，多智能体响应规划可以在保持分析师批准和可审计性的同时，提高基于本体的IRS覆盖率。

英文摘要

Enterprise intrusion response still depends on static playbooks and analyst-driven triage, creating delay between alert generation and containment. We present Agentra, a supervisable multi-agent Intrusion Response System (IRS) framework that converts alerts from IDS, EDR, and XDR platforms into structured incident response plans grounded in MITRE ATT&CK, MITRE D3FEND, and NIST CSF 2.0. Agentra decomposes response reasoning across role-scoped agents, validates proposed plans through a bounded Planner--Validator review loop, screens retrieved threat intelligence through a Moderator security gateway, gates actions through an Action Catalog and risk score, and records decisions in an append-only audit log. We evaluate Agentra against a static OASIS CACAO v2.0 cyber-playbook baseline on a 120-event corpus drawn from ThreatHunter-Playbook, Splunk BOTSv3, and DARPA OpTC. The strongest configuration improves FP-aware IRS F1 from 0.61 to 0.84 and restores the projected harmful-action rate to the static baseline level of 0.0% after Planner-only configurations introduce unsafe overreaction. These results indicate that multi-agent response planning can improve ontology-grounded IRS coverage while preserving analyst approval and auditability.

URL PDF HTML ☆

赞 0 踩 0

2606.18837 2026-06-18 cs.MA cs.AI cs.LG 交叉投稿

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Skill-MAS: 演化元技能以自动生成多智能体系统

Hehai Lin, Qi Yang, Chengwei Qin

发表机构 * Ant Group（蚂蚁集团）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出Skill-MAS，通过将高层编排能力解耦为可演化的元技能，在无需参数更新的情况下实现经验保留，利用多轨迹采样和选择性反思优化元技能，在多个基准和LLM上取得显著性能提升且成本可控。

详情

AI中文摘要

基于大型语言模型（LLM）的自动多智能体系统（MAS）生成已成为处理复杂任务的关键前沿。然而，现有方法在模型能力和经验保留之间面临两难困境。推理时MAS利用冻结的尖端LLM，但重复相同搜索而不从过去经验中学习。相反，训练时MAS通过梯度更新内化经验，但受限于较小模型的低能力上限，且难以扩展到大型尖端LLM。为弥合这一差距，我们提出Skill-MAS，一种新颖的第三条路径，通过将高层编排能力概念化为可演化的元技能，将经验保留与参数更新解耦。Skill-MAS通过一个封闭优化循环来精炼这种架构知识：（1）多轨迹采样在当前元技能下为每个任务采样行为分布；（2）选择性反思自适应选择优先任务，并应用分层对比分析将系统经验蒸馏为可泛化的策略级原则。在四个复杂基准和四个不同LLM上的大量实验表明，Skill-MAS不仅实现了显著的性能提升，而且保持了良好的成本-性能权衡。进一步分析揭示，演化后的元技能高度鲁棒，并在未见任务和不同LLM之间表现出强迁移性。

英文摘要

Large Language Model (LLM)-based automatic Multi-Agent Systems (MAS) generation has become a crucial frontier for tackling complex tasks. However, existing methods face a dilemma between model capability and experience retention. Inference-time MAS leverages frozen frontier LLMs but repeats identical searches without learning from past experience. Conversely, Training-time MAS internalizes experience via gradient updates but is constrained by the low capability ceiling of smaller models, and is hard to scale to large frontier LLMs. To bridge this gap, we propose Skill-MAS, a novel third path that decouples experience retention from parametric updates by conceptualizing the high-level orchestration capability as an evolvable Meta-Skill. Skill-MAS refines this architectural knowledge through a closed optimization loop: (1) Multi-Trajectory Rollout samples a behavioral distribution for each task under the current Meta-Skill; and (2) Selective Reflection adaptively selects priority tasks and applies hierarchical contrastive analysis to distill systemic experience into generalizable, strategy-level principles. Extensive experiments across four complex benchmarks and four distinct LLMs demonstrate that Skill-MAS not only achieves remarkable performance gains but also maintains a favorable cost-performance trade-off. Further analysis reveals that the evolved Meta-Skills are highly robust and exhibit strong transferability across unseen tasks and different LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.19111 2026-06-18 cs.CL cs.AI cs.MA 交叉投稿

Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams

领导力作为协调控制：多智能体LLM团队中的行为特征与恢复优势边界

Haewoon Kwak

发表机构 * Indiana University Bloomington（印第安纳大学布卢明顿分校）

AI总结研究多智能体LLM团队中过程级协调控制何时增加价值，通过行为特征和消融实验发现，控制器的优势仅在初始多数投票不可靠、任务可恢复且无指导交互无法修复时出现，验证了权变理论。

Comments 33 pages

详情

AI中文摘要

团队科学认为领导力是权变的：它仅在特定条件下有帮助，而能力强的自主团队可能根本不需要领导。我们对多智能体LLM团队提出类似问题：在什么可测量的条件下，过程级协调控制会增加价值，这些条件是否与团队科学的预测一致？我们使用行为特征（多数锁定、探索、从错误的第0轮共识中恢复）和每动作消融实验，因为每个控制器是一个显式动作集，而不是一个整体提示。我们将三种经典领导风格（交易型、变革型、情境型）操作化为对共享动作词汇（探索、修订、接受、综合）的控制器。一个具有相同动作但使用任意规则的匹配控制器恢复效果不优于多数投票，因此是理论推导的规则（而非词汇）起作用。在四个任务体系和三个开放权重模型系列中，没有控制器在准确率上占主导地位，正如权变观点所预测的：交易型控制在所有12个（模型、体系）组合上与共享的第0轮投票匹配，差异在1.3个百分点以内，仅在初始多数不可靠的一个组合上出现增益（llama-4-scout社会性；情境型比扁平型高8个百分点）。通过四个边界探针测试的恢复优势解释表明，控制器仅在初始多数投票不可靠、任务可恢复且无指导交互无法修复时优于纯交互。这些区域映射到权变理论（领导替代、路径-目标冗余、情境准备差距），因此基本为零的准确率结果正是理论所预测的，而非控制器的失败。我们将过程级协调控制视为一种需要测量和理论映射的权变因素，而不是需要超越的排行榜。

英文摘要

Team science holds that leadership is contingent: it helps only under specific conditions, and capable, autonomous teams may need none at all. We ask the analogous question for multi-agent LLM teams: under what measurable conditions does process-level coordination control add value, and do those conditions match what team science predicts? We use behavioral signatures (majority lock-in, exploration, recovery from an incorrect round-0 consensus) and per-action ablations, clean because each controller is an explicit action set, not a monolithic prompt. We operationalize three classical leadership styles (transactional, transformational, situational) as controllers over a shared action vocabulary (explore, revise, accept, synthesize). A matched controller with the same actions but an arbitrary rule recovers no better than majority voting, so the theory-derived rule, not the vocabulary, does the work. Across four task regimes and three open-weight model families, no controller dominates by accuracy, as the contingency view predicts: transactional control matches a shared round-0 vote on all 12 (model, regime) combinations to within 1.3pp, and gains appear only on the one combination where the round-0 majority is unreliable (llama-4-scout social; situational +8pp over flat). A recovery-advantage account, tested with four boundary probes, says a controller beats plain interaction only where the round-0 majority is unreliable, the task is recoverable, and undirected interaction does not already repair it. These regions map onto contingency theory (leadership substitutes, path-goal redundancy, the situational readiness gap), so a largely null accuracy result is what the theory predicts, not a failure of the controllers. We read process-level coordination control as a contingency to be measured and theory-mapped, not a leaderboard to be topped.

URL PDF HTML ☆

赞 0 踩 0

2606.19135 2026-06-18 cs.MA cs.AI cs.NI 交叉投稿

A Technical Taxonomy of LLM Agent Communication Protocols

LLM智能体通信协议的技术分类法

Linus Sander, Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll

发表机构 * Technische Universität München（慕尼黑技术大学）

AI总结针对大语言模型智能体通信协议碎片化问题，提出包含五个维度的技术分类法，分析九种开源协议，揭示架构模式并预测协议演进趋势。

详情

AI中文摘要

随着大语言模型（LLM）的进步以及多智能体系统旨在克服单智能体的局限性，健壮的通信协议正成为分布式智能体网络的关键基础设施。然而，碎片化的协议格局带来了显著的互操作性挑战。本研究开发了一种技术分类法，用于分类和分析LLM智能体通信协议。遵循既定的迭代方法，我们定义了分类法的目的、元特征和终止条件，然后在九个积极维护且具有可证明采用度的开源协议上执行了五次迭代（三次从经验到概念，两次从概念到经验）。该分类法包含五个维度：交易对手、有效载荷、交互状态、发现机制和模式灵活性。分类揭示了重复出现的架构模式：所有采样的智能体间协议都将混合有效载荷与会话状态持久性相结合；大多数协议支持多个预定义模式，其中两个协议在运行时协商模式，表明向模式灵活性的趋势；去中心化发现仍然罕见。分析表明，短期内存在向统一智能体间和智能体-上下文（工具和数据）通信的协议收敛压力。然而，长期来看，没有单一协议能同时最大化通用性、效率和可移植性。该领域更可能演变为联邦式分层协议栈。该框架指导协议选择，并突出开放的研究空白，如隐私和策略执行。

英文摘要

As large language models (LLMs) advance and multi-agent systems aim to overcome the limits of standalone agents, robust communication protocols are becoming essential infrastructure for distributed agent networks. Nonetheless, the fragmented protocol landscape presents a significant interoperability challenge. This study develops a technical taxonomy to classify and analyze LLM agent communication protocols. Following an established iterative method, we defined the taxonomy's purpose, meta-characteristic, and ending conditions, then performed five iterations, three empirical-to-conceptual and two conceptual-to-empirical, on nine actively maintained open-source protocols with demonstrable adoption. The taxonomy comprises five dimensions: counterparty, payload, interaction state, discovery mechanism, and schema flexibility. Classification reveals recurring architectural patterns: all sampled agent-to-agent protocols combine hybrid payloads with session-state persistence; most protocols support multiple predefined schemas, and two negotiate schemas at runtime, indicating a trend toward schema flexibility; decentralized discovery remains rare. Analysis suggests short-term convergence pressure toward protocols unifying agent-to-agent and agent-to-context (tool and data) communication. Long-term, however, no single protocol is likely to maximize versatility, efficiency, and portability simultaneously. The field will more likely evolve toward a federated, layered protocol stack. The framework guides protocol selection and highlights open research gaps such as privacy and policy enforcement.}

URL PDF HTML ☆

赞 0 踩 0

2606.18730 2026-06-18 cs.RO cs.AI math.CO math.OC 交叉投稿

DRIFT: 通过在线策略数据归因优化指令数据

Zefan Wang, Lincheng Li, Tianyu Yu, Yuan Yao

发表机构 * Tsinghua University（清华大学）

AI总结提出DRIFT方法，利用在线策略影响函数解决标准影响函数在指令微调数据归因中的近邻偏差和梯度范数偏差问题，通过模型自身生成作为验证目标，提升7B模型性能上限。

详情

AI中文摘要

优化监督微调（SFT）的训练数据分布决定了大型语言模型（LLMs）的能力。虽然现有的数据筛选方法在有限预算下加速训练方面表现出色，但它们不太适合提升能力上限。这里的挑战不再是识别一个保持性能的较小子集，而是将数据分布优化为最能提升最终模型的实例。为了解决这个问题，我们探索了使用影响函数（IF）进行实例级数据归因。我们发现标准IF公式在此设置中存在两个结构限制：由离策略验证目标引起的近邻偏差，以及对梯度范数的严重偏向。我们提出了DRIFT（通过在线策略影响函数进行数据优化用于监督微调）。DRIFT不依赖外部参考数据，而是利用模型的在线策略生成作为验证目标，这在经验上最小化了参数近邻偏差，并更好地符合IF的局部邻域假设。它进一步基于轨迹正确性应用符号加权，并针对梯度操纵问题对影响分数进行去偏，使得少量验证查询能够作为可靠锚点来归因整个数据集。在7B参数指令和推理模型上的实验表明，DRIFT持续提升了两者的性能上限，优于现有的数据筛选基线。

英文摘要

Optimizing the training data distribution for Supervised Fine-Tuning (SFT) dictates the capability of Large Language Models (LLMs). While existing data curation methods excel at accelerating training under constrained budgets, they are less suited to elevating the capability upper bound. The challenge here is no longer to identify a smaller subset that preserves performance, but to refine the data distribution toward instances most capable of improving the final model. To address this problem, we explore instance-level data attribution using Influence Functions (IF). We identify that standard IF formulations struggle in this setting due to two structural limitations: a proximity gap caused by off-policy validation targets, and a severe bias towards gradient norm. We propose DRIFT (Data Refinement via On-Policy Influence Functions for Supervised Fine-Tuning). Instead of relying on external reference data, DRIFT utilizes the model's on-policy rollouts as validation targets, which empirically minimizes the parameter proximity gap and better aligns with the local neighborhood assumption of IF. It further applies signed weighting based on trajectory correctness and debiases influence scores against the gradient hacking issue, allowing a small set of validation queries to act as reliable anchors for attributing the full dataset. Experiments on 7B-parameter instruction and reasoning models show that DRIFT consistently raises the performance ceiling on both, outperforming existing data curation baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.18315 2026-06-18 cs.LG cs.AI 交叉投稿

Ghost Attractor Networks: Basin-Structured Dynamical Decoders for Closed-Loop Sequential Generation

鬼吸引子网络：用于闭环序列生成的盆地结构动力学解码器

Tianyu Wang, Ying Wang, Zhihao Liu, Xi Vincent Wang, Lihui Wang

发表机构 * KTH Royal Institute of Technology（瑞典皇家理工学院）； Department of Production Engineering, KTH Royal Institute of Technology（瑞典皇家理工学院生产工程系）； Department of Decision and Control Systems, KTH Royal Institute of Technology（瑞典皇家理工学院决策与控制系统系）

AI总结提出鬼吸引子网络，一种理论推导的动力学解码器，通过构建盆地-吸引子结构实现高效闭环序列生成，在机器人动作解码任务中以2.3M参数匹配1.07B参数扩散变压器的离线精度，延迟降低32倍。

详情

AI中文摘要

使用大规模Transformer和扩散解码器进行序列输出生成时，内存成本随序列长度增长，且需要迭代逐步骤计算。用小型前馈解码器替代可恢复效率，但产生非结构化的潜在表示，限制了闭环控制：相位条件动作生成和跨步骤潜在传递都需要具有稳定盆地的潜在几何结构。本文提出鬼吸引子网络，一种理论推导的动力学解码器，其潜在变量在学习的势能下演化并带有漂移，通过构造产生盆地-吸引子结构。三个期望（多模态、解码器级单次切换和恒定内存）激发了势能-漂移形式，模式转变作为鞍结分岔和鬼吸引子逃逸出现。层次化的相空间分解将一阶盆地收敛与二阶本体感受细化分开。实验上，使用行为克隆和对比目标端到端训练的鬼网络在其势能中表现出预测的梯度流收缩，在1430个保留样本上，梯度范数在五个积分步骤中衰减67%。鬼网络作为机器人动作解码器进行评估。一个230万参数的鬼网络以462倍少的参数和32倍低的延迟匹配了10.7亿参数扩散变压器的离线精度，并在离线均方误差上比五个替代的200万参数解码器（MLP、神经常微分方程、条件变分自编码器、Transformer、单步扩散）低5.9%至29%。在LIBERO-10闭环基准测试中，鬼网络的盆地结构潜在上的相位条件比前馈MLP基线提高了13.5个百分点的成功率，持久潜在集成达到95.7%的最终成功率。

英文摘要

Sequential output generation with large-scale Transformer and diffusion decoders pays a memory cost that grows with sequence length, plus iterative per-step computation. Replacing them with small feed-forward decoders restores efficiency but produces unstructured latent representations that limit closed-loop control: phase-conditioned action generation and cross-step latent carry-over both require a latent geometry with stable basins. This article proposes Ghost Attractor Networks, a theoretically derived dynamical decoder whose latent evolves under a learned potential with drift and produces a basin-attractor structure by construction. Three desiderata (multi-modality, decoder-level single-pass switching, and constant memory) motivate the potential-drift form, and mode transitions arise as saddle-node bifurcations with ghost-attractor escape. A hierarchical phase-space decomposition separates first-order basin convergence from second-order proprioceptive refinement. Empirically, a Ghost trained end-to-end with a behavioral-cloning and contrastive objective exhibits the predicted gradient-flow contraction in its potential, with the gradient norm decaying by 67 percent across five integration steps on 1430 held-out samples. Ghost is evaluated as a robotic action decoder. A 2.3-million-parameter Ghost matches the offline accuracy of a 1.07-billion-parameter Diffusion Transformer at 462 times fewer parameters and 32 times lower latency, and beats five alternative 2M-parameter decoders (MLP, Neural ODE, CVAE, Transformer, 1-step Diffusion) on offline mean squared error by 5.9 to 29 percent. On the LIBERO-10 closed-loop benchmark, phase conditioning on Ghost's basin-structured latent yields a 13.5 percentage-point success-rate gain over a feed-forward MLP baseline, and persistent-latent ensembling reaches a 95.7 percent final success rate.

URL PDF HTML ☆

赞 0 踩 0

2606.18324 2026-06-18 cs.LG cs.AI 交叉投稿

神经相位相关

Cole Reynolds

发表机构 * Weyl Labs（Weyl实验室）

AI总结提出相位相关的学习泛化，通过可学习基函数将变换分解，适用于非刚性形变和幺正动力学，在心脏MRI和超声数据集上达到或超越现有方法。

详情

AI中文摘要

对应关系本质上是关系性的：它寻求同一场景两次观测之间的未知变换，而非任一观测的内容。然而，主流的基于学习的方法并未将变换表示为架构中的一等对象。它们独立编码每幅图像，让学习的相似度函数或深度解码器隐式地发现映射。相位相关是典型的例外，它直接在傅里叶域测量图像间关系，但其固定基的刚性将其限制于全局平移。我们引入相位相关的学习泛化，通过学习变换分解所基于的基来解除这一限制。相同的代数原语可扩展到密集非刚性形变和幺正动力学。在ACDC心脏MRI基准上，该框架在两个配准方向上匹配或超越先前发表的基线。在CAMUS超声心动图上，它无需辅助评分或自适应平滑机制即可达到最先进水平。应用于一维量子谐振子的时间演化波函数对时，同一框架仅从观测对中恢复未知哈密顿量的埃尔米特函数本征态和量子化能级。

英文摘要

Correspondence is fundamentally relational: it seeks the unknown transformation between two observations of a common scene, not the content of either. Yet the dominant learning-based methods do not represent the transformation as a first-class object in the architecture. They encode each image independently and let a learned similarity function or a deep decoder discover the mapping implicitly. Phase correlation is the canonical exception, measuring the inter-image relationship directly in the Fourier domain, but the rigidity of its fixed basis confines it to global translation. We introduce a learned generalization of phase correlation that lifts this restriction by learning the basis on which the transformation decomposes. The same algebraic primitive extends to dense non-rigid deformations and to unitary dynamics. On the ACDC cardiac-MRI benchmark the framework matches or exceeds prior published baselines on both registration directions. On CAMUS echocardiography it matches state-of-the-art without auxiliary scoring or adaptive-smoothness mechanisms. Applied to time-evolved wavefunction pairs of the 1-D quantum harmonic oscillator, the same framework recovers the Hermite-function eigenstates and the quantized energy levels of the unknown Hamiltonian from observation pairs alone.

URL PDF HTML ☆

赞 0 踩 0

2606.18521 2026-06-18 cs.LG cs.AI 交叉投稿

Sparsity Curse: Understanding RLVR Model Parameter Space from Model Merging

稀疏性诅咒：从模型合并理解RLVR模型参数空间

Chenrui Wu, Zexi Li, Jiajun Bu, Jiangchuan Liu, Haishuai Wang

发表机构 * Zhejiang University（浙江大学）； Simon Fraser University（西蒙菲莎大学）； The Chinese University of Hong Kong（香港中文大学）； Zhejiang Key Lab of Accessible Perception and Intelligent Systems（浙江省可感知智能系统重点实验室）

AI总结本文发现RLVR模型的稀疏更新在参数空间中分散更远，形成近正交捷径导致合并脆弱，并提出SAR-Merging方法解决该问题。

Comments Accepted by KDD 2026

详情

AI中文摘要

可验证奖励强化学习（RLVR）已成为一种强大的后训练范式，在激发推理智能和抵抗灾难性遗忘方面超越了监督微调（SFT）。最近的研究进一步揭示，与SFT相比，RLVR会引发高度稀疏且偏离主成分的参数更新。这自然引出一个问题：这种稀疏性是否使RLVR模型更易于模型合并？如果是，模型合并将提供一种可扩展的、无需训练的方法，来聚合来自独立训练的RLVR模型的多样化推理能力。令人惊讶的是，我们发现相反的情况，揭示了一种稀疏性诅咒：稀疏的RLVR更新在参数空间中分散得更远，形成近正交的捷径，使得聚合本质上是脆弱的。这很可能源于RL优化的随机性和涌现推理模式的多样性。与SFT模型收敛到共享的平坦盆地并自然合并不同，RLVR模型在标准合并方法下遭受严重退化。通过对更新几何的系统性实证分析，我们描述了这种失败背后的机制，并提出了敏感性感知解析合并（SAR-Merging），这是一种针对RLVR参数空间独特结构定制的合并方案。SAR-Merging通过基于Fisher信息的敏感性仲裁解决重叠更新区域中的冲突，然后通过幅度感知稀疏化和重新缩放来保留脆弱的推理路径。在数学和编程基准上的实验表明，SAR-Merging在RLVR模型上显著优于现有合并方法，实现了单任务增强和多能力融合。

英文摘要

Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful post-training paradigm that surpasses Supervised Fine-Tuning (SFT) in eliciting reasoning intelligence and resisting catastrophic forgetting. Recent studies further reveal that RLVR induces highly sparse and off-principal parameter updates compared to SFT. This naturally raises the question: does such sparsity make RLVR models more amenable to model merging? If so, model merging would offer a scalable, training-free path to aggregate diverse reasoning capabilities from independently trained RLVR models. Surprisingly, we find the opposite, uncovering a sparsity curse: the sparse RLVR updates are spread farther apart in parameter space, forming near-orthogonal shortcuts that make aggregation inherently fragile. This is likely rooted in the stochasticity of RL optimization and the diversity of emergent reasoning patterns. Unlike SFT models that converge to shared, flat basins and merge naturally, RLVR models suffer severe degradation under standard merging methods. Through systematic empirical analysis of the update geometry, we characterize the mechanisms behind this failure and propose Sensitivity-aware Resolving Merging (SAR-Merging), a merging recipe tailored for the unique structure of RLVR parameter spaces. SAR-Merging resolves conflicts in overlapping update regions via Fisher Information-based sensitivity arbitration, followed by magnitude-aware sparsification and rescaling to preserve fragile reasoning pathways. Experiments on mathematical and coding benchmarks demonstrate that SAR-Merging substantially outperforms existing merging methods on RLVR models, enabling both single-task enhancement and multi-capability fusion.

URL PDF HTML ☆

赞 0 踩 0

2606.18561 2026-06-18 cs.LG cs.AI 交叉投稿

Correcting Sensor-Induced Distribution Drift with Wasserstein Adversarial Learning

使用Wasserstein对抗学习校正传感器引起的分布漂移

Saraa Ali, Vladimir Bocharnikov, Fedor Ratnikov, Mikhail Hushchyn, Artem Ryzhikov, Denis Derkach

发表机构 * Laboratory of Methods for Big Data Analysis, HSE University（大数据分析方法实验室，高等经济大学）

AI总结提出WGAN方法，通过可学习的校准变换将变化检测器响应分布映射回参考分布，在探测器模型和模拟量能器数据上验证了恢复老化系数和改善能量分布一致性的能力。

Comments This is a preprint sent to Nuclear Science and Techniques journal

详情

AI中文摘要

记录数据的质量取决于采集数据的传感器系统的稳定性。传感器运动和老化会降低下游数据驱动方法的性能和稳定性。我们提出了一种基于Wasserstein-GAN的无监督方法，用于推断物理可解释的变换参数，这些参数将变化的检测器响应分布映射回标称参考分布。与标准生成建模不同，生成器被用作可学习的校准变换，其可训练权重代表所寻求的参数，而判别器通过Wasserstein目标提供分布距离信号。我们在具有受控层偏移的跟踪探测器玩具模型上验证了该方法，并展示了其在具有单元老化效应的高粒度Geant4模拟量能器数据上的应用。该方法恢复了单个单元的老化系数，与真实值相关，并改善了校准后和参考能量和分布之间的一致性，同时随着通道间噪声水平的增加而表现出预期的退化。这些结果表明，在退化参数的直接标签不可用的情况下，对抗性分布匹配可以作为校准策略的数据驱动组件。

英文摘要

The quality of recorded data depends on the stability of the sensor system that acquires it. Sensor motion and aging can degrade the performance and stability of downstream data-driven methods. We present a Wasserstein-GAN-inspired approach for unsupervised inference of physically interpretable transformation parameters that map a changed detector response distribution back to a nominal reference distribution. In contrast to standard generative modeling, the generator is used as a learnable calibration transformation whose trainable weights represent the sought parameters, while the critic provides a distributional distance signal via the Wasserstein objective. We validate the approach on a tracking-detector toy model with controlled layer shifts and demonstrate its application on high-granularity Geant4-simulated calorimeter data with cell-wise aging effects. The method recovers aging coefficients for individual cells with correlation to ground truth and improves agreement between calibrated and reference energy-sum distributions, while exhibiting the expected degradation at increasing channel-to-channel noise levels. These results indicate that adversarial distribution matching can serve as a data-driven component of calibration strategies in settings where direct labels for degradation parameters are unavailable.

URL PDF HTML ☆

赞 0 踩 0

2606.18587 2026-06-18 cs.CL cs.AI 交叉投稿

从自身解中学习：面向可验证奖励强化学习的自条件化信用分配

Yingyu Shan, Yuhang Guo, Zihao Cheng, Zeming Liu, Xiangrong Zhu, Xinyi Wang, Jiashu Yao, Wei Lin, Hongru Wang, Heyan Huang

发表机构 * Beijing Institute of Technology（北京理工大学）； Beihang University（北京航空航天大学）； Independent Researcher（独立研究者）

AI总结提出SC-GRPO方法，利用自条件化分布间的KL散度作为GRPO梯度的乘性权重，实现细粒度信用分配，在数学、代码和智能体任务上平均提升8.1%。

详情

AI中文摘要

具有可验证奖励的强化学习（RLVR）在训练LLMs进行推理任务方面取得了显著进展，但代表性方法如GRPO对所有token分配统一信用，浪费了常规token上的梯度，同时低估了关键推理步骤。现有的token级信用分配方法需要超出模型自身rollout的资源。GRPO变体依赖于过程奖励模型或真实答案。知识蒸馏通过每个token的散度分配信用，但需要外部教师（在线策略蒸馏）或特权信息（在线策略自蒸馏）。然而，这些依赖性限制了在纯RLVR设置中的适用性。我们观察到，将模型以其自身验证过的轨迹为条件，会在原始分布和条件分布之间诱导出可测量的每token KL散度，并证明当存在多个验证过的轨迹时，从由验证过的轨迹构建的自教师进行蒸馏会导致不可行的加权平均解。我们提出SC-GRPO（自条件化GRPO），它使用前述KL散度作为GRPO梯度的乘性权重。在涵盖数学、代码和智能体任务的五个基准上，SC-GRPO一致优于GRPO 8.1%，优于DAPO 5.9%，并具有更强的分布外性能。此外，SC-GRPO实现了比OPD更高的性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in training LLMs for reasoning tasks, but representative methods such as GRPO assign uniform credit across all tokens, wasting gradient on routine tokens while under-crediting pivotal reasoning steps. Existing token-level credit assignment methods require resources beyond the model's own rollouts. GRPO variants rely on process reward models or ground-truth answers. Knowledge distillation assigns credit through per-token divergence but requires external teachers (On-Policy Distillation) or privileged information (On-Policy Self Distillation). However, these dependencies limit applicability in the pure RLVR setting. We observe that conditioning the model on its own verified trajectories induces a measurable per-token KL divergence between the original and conditioned distributions, and prove that distilling from a self-teacher constructed by verified trajectories leads to infeasible weighted-average solutions when multiple verified trajectories exist. We propose SC-GRPO (Self-Conditioned GRPO), which uses KL divergence mentioned before as a multiplicative weight on GRPO gradients. Across five benchmarks spanning math, code, and agentic tasks, SC-GRPO consistently outperforms 8.1% over GRPO and 5.9% over DAPO with stronger OOD performance. Moreover, SC-GRPO achieves higher performance than OPD.

URL PDF HTML ☆

赞 0 踩 0

2606.18812 2026-06-18 cs.LG cs.AI 交叉投稿

Reinforcement Learning Foundation Models Should Already Be A Thing

强化学习基础模型本应已经存在

Abdelrahman Zighem, Jill-Jênn Vie

发表机构 * École normale supérieure de Paris, PSL University, Paris, France（巴黎高等师范学院，PSL大学，法国巴黎）； Soda team, Inria Saclay, Palaiseau, France（Soda团队，法国国家信息与自动化研究所萨克雷中心，法国帕莱索）

AI总结提出通过合成MDP构建强化学习基础模型，利用固定大小的充分统计量使注意力架构适用，在线和离线实验均优于传统算法。

详情

AI中文摘要

语言和视觉的基础模型由互联网规模的数据驱动，而结构化领域（表格预测、时间序列预测、图学习、强化学习）则不然。替代方案是合成数据，它将负担从收集转移到先验设计。这种先验已经存在于许多结构化任务中：TabPFN及其后续工作通过一个在合成贝叶斯先验上预训练的Transformer解决表格分类问题。我们提出两点。\textbf{首先}，强化学习是明显的空白：采样一个合成MDP与采样一个合成表格数据集一样可行，然而没有上下文强化学习工作将先验设计作为主要目标。\textbf{其次}，MDP允许一个固定大小的充分统计量，独立于观察到的回合且形状为表格形式，这使得它们直接适用于用于表格基础模型的基于注意力的架构，只需将策略头替换监督目标。这些共同定义了强化学习基础模型的议程。作为概念验证，我们完全在合成MDP上训练一个模型，并表明，无需任务特定的调优，它就能在上下文中解决留出的表格基准，包括在线和离线：在线时，使用比UCB-VI和表格Q-learning少得多的回合；离线时，与VI-LCB竞争。

英文摘要

Foundation models for language and vision are powered by internet-scale data, while structured domains (tabular prediction, time-series forecasting, graph learning, reinforcement learning) are not. The substitute is synthetic data, which shifts the burden from collection to prior design. Such priors already exist for many structured tasks: TabPFN and its successors solve tabular classification with a transformer pretrained on a synthetic Bayesian prior. We make two points. \textbf{First}, reinforcement learning is the conspicuous gap: sampling a synthetic MDP is as feasible as sampling a synthetic tabular dataset, yet no in-context RL work treats prior design as a primary objective. \textbf{Second}, MDPs admit a fixed-size sufficient statistic, independent of the episodes observed and tabular in shape, which makes them directly amenable to the attention-based architectures used for tabular foundation models, with a policy head replacing the supervised target. Together these define the agenda for an RL foundation model. As a proof of concept, we train one model entirely on synthetic MDPs and show that, with no task-specific tuning, it solves held-out tabular benchmarks in context, both online and offline: online, in far fewer episodes than UCB-VI and tabular Q-learning, and offline, competitively with VI-LCB.

URL PDF HTML ☆

赞 0 踩 0

2606.18820 2026-06-18 cs.LG cs.AI 交叉投稿

Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets

成熟马尔可夫决策过程：信息增加与动作集缩小下的决策制定

Jiaxi Liu, Aiping Yang, Yuhang Yang, Shuqi Zhang, Zewei Dong, Jiangming Yang, Xuebin Chen

发表机构 * Ant International（蚂蚁国际）； School of Economics, Sichuan University（四川大学经济学院）； School of Economics, Fudan University（复旦大学经济学院）

AI总结针对决策过程中信息增加与动作集缩小的不对称性，提出成熟马尔可夫决策过程（MMDP）框架，并基于过期动作优先级原则开发结构感知强化学习方法，实验证明其能提升学习效率。

Comments 25 pages, 9 figures

详情

AI中文摘要

序列决策问题通常表现出信息和决策灵活性的不对称演化：随着决策周期的展开，智能体获得更丰富的信息，而由于操作截止、承诺或资源约束，可行动作逐渐过期。标准的MDP公式通常将这种结构扁平化为阶段相关的状态描述和动作掩码，从而掩盖了嵌套的信息-动作不对称性，而这种不对称性决定了哪些决策是紧急的、哪些可以推迟。我们引入了成熟马尔可夫决策过程（MMDP），这是一种围绕这种信息-动作不对称性构建的公式。我们通过一个过期动作优先级原则来刻画其关键后果之一，该原则识别出必须在下一阶段之前解决的动作。受此结构启发，我们开发了一个结构感知的强化学习框架，包括阶段感知的策略设计、过期动作抽象以及带有蒸馏的搜索增强学习。在受控的多供应商补货问题、复杂度递增的简化现金管理环境以及生产级模拟器上的实验表明，显式建模这种不对称性可以提高学习效率，并且随着决策问题的规模扩大，其价值日益增加。

英文摘要

Sequential decision problems often exhibit an asymmetric evolution of information and decision flexibility: as a decision cycle unfolds, the agent receives richer information while feasible actions expire due to operational cutoffs, commitments, or resource constraints. Standard MDP formulations typically flatten this structure into stage-dependent state descriptions and action masks, thereby obscuring the nested information--action asymmetry that determines which decisions are urgent and which can be deferred. We introduce Maturing Markov Decision Processes (MMDPs), a formulation built around this information--action asymmetry. We characterize one of its key consequences through an expiring-action priority principle, which identifies the actions that must be resolved before the next stage. Motivated by this structure, we develop a structure-aware reinforcement learning framework with stage-aware policy design, expiring-action abstraction, and search-augmented learning with distillation. Experiments on a controlled multi-supplier replenishment problem, simplified cash-management environments of increasing complexity, and a production-scale simulator show that explicitly modeling this asymmetry improves learning efficiency and becomes increasingly valuable as decision problems scale.

URL PDF HTML ☆

赞 0 踩 0

2606.19025 2026-06-18 cs.LG cs.AI cs.DC cs.SY eess.SY 交叉投稿

FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs

FoMoE: 打破全副本壁垒的专家混合联邦系统

Lorenzo Sani, Zeyu Cao, Meghdad Kurmanji, Alex Iacob, Andrej Jovanovic, Yan Gao, Wanru Zhao, Nicholas D. Lane

发表机构 * DeepSeek-AI

AI总结提出FoMoE系统，通过跨工作节点分区专家层打破全副本范式，结合部分专家复制和跳跃令牌机制，显著降低通信开销并提升吞吐量。

详情

AI中文摘要

预测关键因素：面向决策的强化学习用于未知离开时间的受控电动汽车充电

Giuseppe Gabriele, Fabio Pavirani, Seyed Soroush Karimi Madahi, Chris Develder

发表机构 * Ghent University -- imec（根特大学 -- imec）

AI总结针对电动汽车充电中离开时间未知导致强化学习策略效果差的问题，提出面向决策的强化学习框架，联合训练预测器与控制器，实现端到端优化，使总奖励提升14%，未供应能量减少55%。

Comments ACM e-Energy 2026 5 pages, 1 figure, 1 table

详情

DOI: 10.1145/3744255.3811736

AI中文摘要

近年来电动汽车的普及给电力系统带来了挑战，包括峰值需求增加和潜在的电网不稳定。基于强化学习的智能充电控制可以通过从历史数据中学习时间和上下文模式来缓解这些问题。然而，在现实场景中，关键特征（如离开时间）通常不可用。这使得强化学习智能体更难学习和执行有效的充电策略。为了减轻这种不确定性，训练好的预测器可以从可用数据中近似未知特征。然而，由于这些预测模型通常针对准确性（而非对下游智能体决策质量的影响）进行训练，它们的误差可能会传播并阻碍使用预测的控制器的整体性能。为了避免这种情况，我们提出了一种面向决策的强化学习框架，其中预测器是端到端训练的，即通过强化学习智能体采取的充电策略动作的反馈。这种预测器和控制器的联合训练最终产生了更高质量的动作：与没有离开时间预测的强化学习方法相比，我们提出的面向决策的强化学习方法产生了更优的充电决策，总奖励提高了14%，未供应能量（即由于电动汽车已离开而未能进行的充电）减少了55%。

英文摘要

The recent growth of EV adoption poses challenges for power systems, including increased peak demand and potential grid instability. Smart control of EV charging -- e.g., based on reinforcement learning (RL) -- can alleviate these issues by learning temporal and contextual patterns from historical data. Yet, in real-world scenarios, key features, such as departure time, often are unavailable. This, in turn, makes it harder for an RL agent to learn and execute an effective charging policy. To mitigate this uncertainty, a trained forecaster can approximate the unknown features from available data. However, since these forecasting models are typically trained for accuracy (rather than their impact on a downstream agent's decision quality), their errors may propagate and hinder the overall performance of a controller that is using the forecasts. To avoid this, we propose a decision-focused RL (DF-RL) framework in which the forecaster is trained end-to-end, i.e., with feedback from the charging policy actions taken by the RL agent. Such joint training of both the forecaster and controller ultimately results in higher-quality actions: our proposed DF-RL method yields superior charging decisions compared to other baselines, achieving up to a 14% improvement in total reward and a 55% reduction of unsupplied energy (i.e., charging that failed to happen because the EV already left), relative to the RL method without departure time forecasting.

URL PDF HTML ☆

赞 0 踩 0

2606.19236 2026-06-18 cs.LG cs.AI cs.CL 交叉投稿

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

STARE: 基于惊讶度的令牌级优势重加权以实现策略熵稳定性

Haipeng Luo, Qingfeng Sun, Songli Wu, Can Xu, Wenfeng Deng, Han Hu, Yansong Tang

发表机构 * Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Tencent Hunyuan（腾讯混元）

AI总结针对GRPO等RL算法中策略熵崩溃问题，提出STARE方法，通过惊讶度分位数识别熵关键令牌并重加权其优势，结合目标熵闭环门控稳定熵，在1.5B-32B模型和多种任务上实现稳定训练，AIME24/25准确率提升4%-8%。

Comments LLM, Reinforcement Learning

详情

AI中文摘要

基于可验证奖励的强化学习算法（如GRPO）已成为LLMs复杂推理的主流后训练范式，但通常在训练中遭受策略熵崩溃。我们对GRPO下的令牌级熵动态进行一阶梯度分析，识别出令牌级信用分配不匹配：每个令牌的熵变化分解为轨迹级优势与下一个令牌分布上的熵敏感函数的乘积，产生优势-惊讶度四象限结构和近临界性质。受此启发，我们提出STARE（基于惊讶度的令牌级优势重加权以实现策略熵稳定性），该方法通过批次内惊讶度分位数识别熵关键令牌子集，选择性重加权其有效优势，并引入目标熵闭环门控以实现稳定的熵调节。在1.5B至32B的模型规模以及三个任务族（短思维链、长思维链和多轮工具使用）上，STARE在数千步内维持稳定的RL训练，同时将策略熵保持在目标带内。在AIME24和AIME25上，STARE在平均准确率上比DAPO和其他竞争基线高出4%-8%，反思令牌和响应长度同步增长，表明持续探索-利用平衡进一步释放了RL训练潜力。代码可在https://github.com/xxxx获取。

英文摘要

Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GRPO and identify a token-level credit assignment mismatch: the per-token entropy variation decomposes into the product of the trajectory-level advantage and an entropy sensitivity function over the next-token distribution, yielding an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by it, we propose STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability), which identifies entropy-critical token subsets via batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, and Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps while maintaining policy entropy within the target band. On AIME24 and AIME25, STARE outperforms DAPO and other competitive baselines by 4%-8% in average accuracy, with reflection tokens and response length growing in tandem, indicating sustained exploration-exploitation balance that further unlocks RL training potential.Code is available at https://github.com/hp-luo/STARE.

URL PDF HTML ☆

赞 0 踩 0

2606.19317 2026-06-18 cs.LG cs.AI 交叉投稿

Explaining Attention with Program Synthesis

用程序合成解释注意力机制

Amiri Hayes, Belinda Li, Jacob Andreas

发表机构 * NJIT（新泽西理工学院）； MIT EECS（麻省理工学院电气工程与计算机科学系）； MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）

AI总结提出用可执行程序近似深度网络组件行为的方法，针对Transformer注意力头，通过生成Python程序再现注意力模式，实现可解释性。

详情

AI中文摘要

可解释深度学习研究的一个长期目标是，用人类可理解的符号描述取代不透明的神经计算。本文提出了一种用可执行程序近似深度网络组件行为的方法。我们专注于Transformer语言模型中的注意力头。对于给定的注意力头，我们首先在一组随机选择的训练样本上计算其关联的注意力矩阵。接着，我们向预训练语言模型提供这些矩阵的摘要，并指示它生成一组Python程序，这些程序仅根据输入句子中的文本即可再现相关的注意力模式。最后，我们根据最终程序集在保留输入上预测行为的效果对程序进行重新排序。我们证明，少于1000个这样的生成程序即可再现GPT-2、TinyLlama-1.1B和Llama-3B中注意力头的注意力模式，在TinyStories上平均交并比相似度超过75%。此外，最佳匹配程序可以替代神经注意力头而不会显著影响模型行为：在三个模型中用程序替代25%的注意力头仅导致平均困惑度增加16%，同时在各种下游问答基准上保持性能。这项工作为使用人类可读、可执行的代码逆向工程Transformer模型中的注意力头提供了一个可扩展的流程，推动了神经模型向符号透明性的发展。

英文摘要

A longstanding goal of research on interpretable deep learning is to replace opaque neural computations with human-meaningful symbolic descriptions. In this paper, we propose an approach for approximating the behavior of components of deep networks with executable programs. We focus on attention heads in transformer language models. For a given head, we first compute its associated attention matrices on a collection of randomly selected training examples. Next, we prompt a pre-trained language model with a summary of these matrices, and instruct it to generate a set of Python programs that can reproduce the associated attention patterns given only text from the input sentence. Finally, we re-rank programs according to how well our final set of programs predict behavior on held-out inputs. We demonstrate that a set of fewer than 1,000 such generated programs can reproduce the attention patterns of heads in GPT-2, TinyLlama-1.1B, and Llama-3B, achieving an average Intersection-over-Union similarity above 75% on TinyStories. Moreover, the best-fit programs can replace neural attention heads without substantially affecting model behavior: replacing 25% of attention heads with programmatic surrogates across the three models incurs only a 16% average perplexity increase, while maintaining performance on a variety of downstream question answering benchmarks. This work contributes a scalable pipeline for reverse-engineering attention heads in transformer models using human-readable, executable code, advancing a path toward symbolic transparency in neural models.

URL PDF HTML ☆

赞 0 踩 0

2606.19328 2026-06-18 cs.LG cs.AI cs.RO 交叉投稿

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

UBP2: 不确定性平衡的偏好规划用于高效基于偏好的强化学习

Mohamed Nabail, Leo Cheng, Jingmin Wang, Nicholas Rhinehart

发表机构 * Learning, Embodied Autonomy, and Forecasting (LEAF) Lab, University of Toronto（多伦多大学学习、具身自主与预测（LEAF）实验室）

AI总结提出UBP2方法，通过联合推理奖励、动力学和值函数的不确定性来主动引导探索，在Meta-World基准上显著提高了样本效率。

详情

AI中文摘要

基于偏好的强化学习提供了一种从行为的成对比较中学习奖励模型的方法，绕过了显式奖励设计的需求。然而，现有方法通常依赖于被动数据收集，并且在学习的早期阶段样本效率低下。我们引入了一种基于模型的方法，通过联合推理奖励、动力学和值函数的不确定性来主动引导探索。我们的方法，不确定性平衡的偏好规划（UBP2），使用奖励、动力学和值函数模型的集成，根据结合了期望奖励、终值认知不确定性的统一评分来评估候选轨迹。在此目标下的规划产生了利用和信息获取之间的显式权衡，无需临时的探索启发式。在标准正则性假设下，我们为有限时域和无限时域设置建立了次线性遗憾保证。实验上，在Meta-World基准上的实验表明，UBP2比无模型的基于偏好的方法和非乐观的基于模型的基线方法实现了更高的样本效率。

英文摘要

Preference-based RL provides an approach to learning reward models from pairwise comparisons of behaviors, bypassing the need for explicit reward design. However, existing methods typically rely on passive data collection and suffer from poor sample efficiency, especially during the early stages of learning. We introduce a model-based approach that actively directs exploration by jointly reasoning over uncertainties in the reward, dynamics, and value functions. Our method, Uncertainty-Balanced Preference Planning (UBP2), uses ensembles of reward, dynamics, and value function models to evaluate candidate trajectories according to a unified score that combines expected reward, terminal value, and epistemic uncertainty. Planning under this objective yields an explicit tradeoff between exploitation and information acquisition without requiring ad hoc exploration heuristics. Under standard regularity assumptions, we establish sublinear regret guarantees for both finite-horizon and infinite-horizon settings. Empirically, experiments on the Meta-World benchmark show UBP2 achieves substantially higher sample efficiency than model-free preference-based methods and non-optimistic model-based baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.18273 2026-06-18 cs.CL cs.AI cs.SD eess.AS 交叉投稿

Continuous Audio Thinking for Large Audio Language Models

面向大型音频语言模型的连续音频思考

Gyojin Han, Dong-Jae Lee, Changho Choi, Jongsuk Kim, Junmo Kim

发表机构 * KAIST（韩国科学技术院）

AI总结提出连续音频思考（CoAT）框架，通过专家蒸馏在连续潜在空间中组织声学信息，使音频语言模型在生成响应前利用丰富声学特征，无需额外自回归解码成本，在多个音频任务上提升性能。

Comments Preprint

详情

AI中文摘要

大型音频语言模型（LALMs）在从语音转录到音乐分析等多种音频理解任务中展现了令人印象深刻的能力。然而，由于LALMs通常被训练生成与文本对齐的响应，其隐藏状态逐渐为文本生成而塑造，而非保留声学信息。因此，音频携带的多样化声学内容，如语音细节、韵律、声音事件、情感和音调，在过程中丢失，难以在响应中利用。我们引入了连续音频思考（CoAT），这是一个框架，为音频语言模型配备一个连续的潜在工作空间，用于在响应生成之前组织声学信息，并通过音频专家的蒸馏进行基础化。在思考空间内，模型可以在生成响应时利用专家蒸馏提供的丰富声学信息。此外，所提出的连续思考块可以在单个预填充中处理，因此CoAT不需要比基线额外的自回归解码成本。在三个LALM上，Qwen2-Audio、Qwen2.5-Omni-7B和Audio Flamingo~3，在涵盖音频推理、音频理解、音乐分类、语音情感和语音转录的广泛基准套件上的性能提升证明了CoAT的有效性。进一步分析证实，辅助监督从思考位置传播到模型的文本响应。

英文摘要

Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information. As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response. We introduce Continuous Audio Thinking (CoAT), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts. Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline. Across three LALMs, Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model's textual responses.

URL PDF HTML ☆

赞 0 踩 0

2606.18372 2026-06-18 cs.CL cs.AI 交叉投稿

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

保留还是删除？用于教育对话去标识的完全本地AI级联框架

Haocheng Zhang, Zhuqian Zhou, Kirk Vanacore, Bakhtawar Ahtisham, René F. Kizilcec

发表机构 * Cornell University（康奈尔大学）

AI总结针对教育对话中课程术语与个人身份信息混淆的问题，提出一种完全本地的级联框架，通过召回优先的联合提议器和上下文感知审查器实现约束性隐私分类，在数学辅导对话上达到0.958的宏F1，优于商业API和纯LLM基线。

详情

AI中文摘要

AI中文摘要

基于Transformer的架构在生成复杂符号序列方面取得了显著进展，但在实现对离散信号属性的细粒度、可解释控制方面仍存在明显差距。本文研究了多轨音乐Transformer（MMT）的机制可解释性，并提出了一种无需重新训练即可通过推理时激活引导实现确定性属性调制的框架。利用差分均值（DiffMean）方法，我们在残差流中分离出信号属性（特别是音高和时长）的潜在方向。我们验证了该领域的线性表示假设，实现了引导幅度与属性偏移之间的高相关性。为了解决多属性引导中固有的特征纠缠问题，我们引入了一种利用Gram-Schmidt正交化的双引导框架。实验结果表明，与朴素向量加法相比，这种几何解耦减少了概念干扰和信号退化，即使在强自回归条件下也能实现独立的确定性控制。

英文摘要

Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in achieving fine-grained, interpretable control over discrete signal attributes. This paper investigates the mechanistic interpretability of the Multitrack Music Transformer (MMT) and proposes a framework for deterministic attribute modulation without retraining to bridge this gap via inference-time activation steering. Utilizing the Difference-in-Means (DiffMean) methodology, we isolate latent directions for signal attributes, specifically Pitch and Duration, within the residual stream. We validate the Linear Representation Hypothesis in this domain, achieving high correlation between steering magnitude and attribute shift. To address the inherent feature entanglement in multi-attribute steering, we introduce a Dual Steering framework utilizing Gram-Schmidt Orthogonalization. Experimental results demonstrate that this geometric decoupling reduces conceptual interference and signal degradation compared to naive vector addition, enabling independent deterministic control even against strong autoregressive conditioning.

URL PDF HTML ☆

赞 0 踩 0

2606.18801 2026-06-18 cs.IR cs.AI 交叉投稿

SHIFT: Semantic Harmonization via Index-side Feature Transformation for Multilingual Information Retrieval

SHIFT: 通过索引侧特征变换实现多语言信息检索的语义对齐

Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim

发表机构 * Department of Computer Science and Engineering, Korea University（韩国大学计算机科学与工程系）

AI总结提出SHIFT方法，在索引阶段通过平行翻译对估计相对语言向量并修正文档嵌入，以缓解多语言密集检索中的语言偏差，无需训练即可提升检索性能。

详情

AI中文摘要

随着大规模多语言语料库的迅速扩展，多语言信息检索（MLIR）已成为全球信息访问的关键技术。MLIR使用户能够使用单语言查询从多语言文本集合中检索语义相关的文档。然而，最近的多语言密集检索模型通常表现出对与查询相同语言的文档的强烈偏好。这导致了严重的语言偏差，即排名靠前的结果被特定语言的文档主导，即使其他语言的文档包含更多语义相关信息。为了解决这个问题，我们提出了SHIFT，一种在索引阶段适用的无需训练的方法。具体来说，SHIFT利用平行翻译对来估计每个目标语言相对于源语言的相对语言向量。随后，SHIFT通过在索引期间从文档嵌入中减去该相对语言向量来纠正语言特定的偏移。我们在四个MLIR基准测试和多种密集检索模型上的全面评估证实，SHIFT可以有效缓解语言偏差并提升MLIR性能。

英文摘要

With the rapid expansion of massive multilingual corpora, Multilingual Information Retrieval (MLIR) has emerged as a critical technology for global information access. MLIR enables users to retrieve semantically relevant documents from multilingual text collections using a single-language query. However, recent multilingual dense retrieval models often exhibit a strong preference for documents in the same language as the query. This leads to severe language bias, where top-ranked results are dominated by documents of specific languages, even when documents in other languages contain more semantically relevant information. To address this issue, we propose SHIFT, a training-free method applicable in the indexing stage. Specifically, SHIFT utilizes parallel translation pairs to estimate a relative language vector for each target language with respect to a source language. Subsequently, SHIFT corrects the language-specific offset by subtracting this relative language vector from document embeddings during indexing. Our comprehensive evaluation across four MLIR benchmarks and diverse dense retrieval models confirms that SHIFT can effectively mitigate language bias and enhance MLIR performance.

URL PDF HTML ☆

赞 0 踩 0

2606.18811 2026-06-18 cs.IR cs.AI 交叉投稿

Rescaling MLM-Head for Neural Sparse Retrieval

重新缩放MLM头部用于神经稀疏检索

Youngjoon Jang, Seongtae Hong, Jonah Turner, Heuiseok Lim

发表机构 * Korea University（韩国大学）

AI总结针对SPLADE中MLM头部尺度不匹配导致训练不稳定和性能下降的问题，提出初始化时对MLM头部投影进行常数因子重缩放，零成本提升训练稳定性，使大范数骨干网络成为有竞争力的稀疏检索器。

详情

AI中文摘要

学习型稀疏检索（LSR）模型（如SPLADE）传统上使用BERT风格的掩码语言模型作为骨干编码器。一个自然的期望是，用更强的预训练编码器替换BERT应能提高检索效果。然而，我们发现，在标准的SPLADE训练方案下，具有大MLM头部L2范数的骨干网络可能会遭受性能下降，甚至在标准SPLADE训练方案下出现训练崩溃。我们将此失败归因于MLM头部中的尺度不匹配：SPLADE直接使用MLM头部输出来构建稀疏词汇表示，查询-文档相关性通过这些表示上的未归一化点积计算。因此，膨胀的MLM头部尺度会放大稀疏激活，扭曲匹配分数，并在常见训练设置下破坏对比训练的稳定性。为了解决这个问题，我们引入了一个简单的初始化时修正，在SPLADE训练之前通过一个常数因子重新缩放MLM头部投影。这种零成本调整提高了训练稳定性，而无需修改模型架构或训练目标。在领域内和跨领域检索基准测试中，这种简单的修正显著改善了诸如ModernBERT和Ettin等大范数骨干网络，将不稳定的训练运行转变为有竞争力的稀疏检索器。在多个设置中，修正后的模型进一步匹配或超越了经典的BERT-SPLADE基线。这些发现表明，将预训练编码器适应于LSR的瓶颈不仅仅是编码器容量，而是用于构建稀疏词汇表示的MLM头部尺度的校准。

英文摘要

Learned sparse retrieval (LSR) models such as SPLADE have traditionally used BERT-style masked language models as backbone encoders. A natural expectation is that replacing BERT with stronger pretrained encoders should improve retrieval effectiveness. However, we find that under standard SPLADE training recipes, backbones with large MLM-head L2 norms can suffer performance degradation and even training collapse under standard SPLADE training recipes. We identify this failure as a scale mismatch in the MLM head: SPLADE directly uses MLM-head outputs to construct sparse lexical representations, and query-document relevance is computed by an unnormalized dot product over these representations. As a result, an inflated MLM-head scale can amplify sparse activations, distort matching scores, and destabilize contrastive training under common training settings. To address this issue, we introduce a simple initialization-time correction that rescales the MLM-head projection by a constant factor before SPLADE training. This zero-cost adjustment improves training stability without modifying the model architecture or training objective. Across both in-domain and out-of-domain retrieval benchmarks, this simple correction substantially improves large-norm backbones such as ModernBERT and Ettin, turning unstable training runs into competitive sparse retrievers. In several settings, the corrected models further match or surpass the classic BERT-SPLADE baseline. These findings suggest that the bottleneck in adapting pretrained encoders to LSR is not encoder capacity alone, but the calibration of the MLM-head scale used to construct sparse lexical representations.

URL PDF HTML ☆

赞 0 踩 0

2606.18831 2026-06-18 cs.CL cs.AI 交叉投稿

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

超越奖励工程：长上下文强化学习的数据配方

Xiaoyue Xu, Sikui Zhang, Xiaorong Wang, Xu Han, Chaojun Xiao

发表机构 * OpenBMB ； Tsinghua University（清华大学）

AI总结提出一种简单有效的数据配方，结合最小化基于结果的GRPO设置，显著提升大语言模型的长上下文推理能力，在多个基准和智能体任务上取得平均+3.2至+7.2点的提升。

Comments 15 pages, 6 figures, 12 tables

详情

AI中文摘要

长上下文推理是大语言模型的一项关键能力，特别是当它们作为必须推理长轨迹的自主智能体部署时。强化学习最近成为提升这一能力的主要范式，然而现有工作主要关注奖励工程，而多样化的训练数据仍然稀缺。我们从数据为中心的角度重新审视这个问题，并表明仅凭一种简单有效的数据配方，结合最小化基于结果的GRPO设置，就足以显著提升长上下文推理。我们的配方针对三个互补的任务族——检索、多证据合成和推理——我们构建并整理了八个数据集，总计约1.4万个示例。在三个模型（Qwen3-4B/8B/30B-A3B）上的实验在七个长上下文基准上取得了平均+7.2/+3.2/+6.4分的提升，超过了之前的强化学习训练集。我们进一步证明这些增益可以迁移到智能体任务中，在基于智能体调整的模型上继续使用我们的数据配方进行强化学习训练，GAIA提升+4.8分，BrowseComp提升+7.0分。我们将发布我们的数据集以促进未来研究。

英文摘要

Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work largely focuses on reward engineering while diverse training data remains scarce. We revisit this problem from a data-centric perspective and show that a simple yet effective data recipe alone, paired with a minimal outcome-based GRPO setup, suffices to substantially improve long-context reasoning. Our recipe targets three complementary task families -- retrieval, multi-evidence synthesis, and reasoning -- for which we construct and curate eight datasets totaling ~14K examples. Experiments on three models (Qwen3-4B/8B/30B-A3B) yield average gains of +7.2/+3.2/+6.4 points across seven long-context benchmarks, surpassing prior RL training sets. We further demonstrate that these gains transfer to agentic tasks, where continuing RL training on an agent-tuned model with our data recipe improves GAIA by +4.8 and BrowseComp by +7.0 points. We will release our datasets to facilitate future research.

URL PDF HTML ☆

赞 0 踩 0

2606.18852 2026-06-18 cs.CL cs.AI 交叉投稿

Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative Mining

对齐隐含陈述：通过上下文边界半硬负挖掘实现隐式仇恨言论的泛化性

Wicaksono Leksono Muhamad, Yunita Sari

发表机构 * Mantera Studio（Mantera工作室）； Universitas Gadjah Mada（加雅玛大学）

AI总结提出ImpSH三元组框架，通过将帖子与隐含陈述对齐并使用上下文边界半硬负样本聚焦学习，提升隐式仇恨言论的跨域泛化能力，在多个数据集上优于对比基线。

详情

AI中文摘要

隐式仇恨言论分类仍然是一个挑战，因为意图通常通过暗示和上下文而非明确辱骂来掩盖。先前的监督对比方法改进了域内检测，但可能过拟合表面线索，且难以跨数据集迁移。我们提出ImpSH，一个基于三元组的框架，当隐含陈述可用时将其与帖子对齐，并使用上下文边界半硬负样本将学习聚焦于近混淆项。我们还研究了AugSH，它通过数据增强形成正样本。在使用BERT和HateBERT对IHC、SBIC和DynaHate进行的受控评估中，ImpSH是标准监督对比基线的可行替代方案，并且在匹配的预处理和调优预算下通常能提高跨域性能。使用对齐性和均匀性进行的表示分析表明，正样本对更紧密且全局分布平衡，定性最近邻案例研究展示了域转移下的典型假负例。这些结果表明，通过上下文边界挖掘将帖子与其隐含陈述对齐，提供了到相关暗示的更稳定、类似双射的映射，克服了传统基于聚类的表示学习固有的波动性。

英文摘要

Classifying implicit hate speech remains a challenge, as intent is often masked through insinuation and context rather than explicit slurs. Prior supervised contrastive approaches improve in-domain detection but can overfit surface cues and struggle to transfer across datasets. We propose ImpSH, a triplet-based framework that aligns posts with implied statements when available and uses context-bounded semi-hard negatives to focus learning on near confusions. We also examine AugSH, which forms positives via data augmentation. In controlled evaluations on IHC, SBIC, and DynaHate with BERT and HateBERT, ImpSH is a viable alternative to standard supervised contrastive baselines and often improves cross-domain performance under matched preprocessing and tuning budgets. Representation analysis using alignment and uniformity indicates tighter positive pairs with balanced global spread, and qualitative nearest-neighbor case studies illustrate typical false negatives under domain shift. These results demonstrate that aligning posts with their implied statements via context-bounded mining provides a more stable, bijective-like mapping to related insinuations, overcoming the volatility inherent in traditional clustering-based representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.18922 2026-06-18 cs.CL cs.AI 交叉投稿

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

像火箭科学一样简单：评估大型语言模型解释比喻语言中否定能力的研究

Jasmine Owers, Edwin Simpson, Martha Lewis

发表机构 * Intelligent Systems Lab University of Bristol（智能系统实验室英国布里斯托尔大学）； ILLC University of Amsterdam（阿姆斯特丹大学语言学研究所）

AI总结本研究通过开发新的注释数据集，测试多种大型语言模型在比喻语言中理解否定的能力，发现否定与比喻的组合对模型构成挑战，且性能高度依赖提示风格。

Comments 16 pages, 16 figures; for associated code and data see https://github.com/jrdowers/Negation-and-Fig-Lang; To be published in Transactions of the Association for Computational Linguistics

2606.18986 2026-06-18 cs.CL cs.AI 交叉投稿

医学LLM适应中的权衡：法语问答的实证研究

Ikram Belmadani, Oumaima El Khettari, Carlos Ramisch, Frederic Bechet, Richard Dufour, Benoit Favre

发表机构 * Aix-Marseille Univ., CNRS, LIS UMR 7020（艾克斯-马赛大学，法国国家科学研究中心，计算机与系统实验室）； Nantes Univ., École Centrale Nantes, CNRS, LS2N UMR 6004（南特大学，南特中央理工学院，法国国家科学研究中心，数字科学实验室）； Grenoble Alpes Univ., CNRS, INRIA, Grenoble INP, LIG UMR 5217（格勒诺布尔-阿尔卑斯大学，法国国家科学研究中心，法国国家信息与自动化研究所，格勒诺布尔理工学院，信息学实验室）

AI总结通过法语医学问答任务，实证比较持续预训练（CPT）和监督微调（SFT）在多个模型家族和规模下的效果，发现CPT+SFT在多项选择问答上最优但增益小，SFT是强且经济的默认选择，而CPT在开放式问答中提升重叠指标。

详情

AI中文摘要

大型语言模型（LLMs）的发展导致了对它们适应专业领域和语言的关注增加，但领域适应策略的有效性仍不明确。我们以法语医学问答（QA）为案例，进行了医学领域适应的研究。我们比较了持续预训练（CPT）、监督微调（SFT）及其组合，跨越三个模型家族、多个规模和三种初始化类型，明确区分了适应效果与基础模型选择。我们在贪婪和约束解码下，使用自动指标和LLM-as-a-Judge评估，评估了多项选择问答（MCQA）和开放式问答（OEQA）。对于MCQA，CPT+SFT通常取得最佳分数，但相比SFT的增益很小且通常不显著，使得SFT成为强大且成本效益高的默认选择。对于OEQA，CPT持续改善基于重叠的指标，而SFT常降低生成质量；指令调优和CPT+SFT在基于LLM的评估中更受青睐。跨语言实验进一步显示，法语适应能有效迁移到英语基准。总体而言，我们为在计算约束下选择适应策略提供了实用指南。

英文摘要

The development of large language models (LLMs) has led to an increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question-answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.

URL PDF HTML ☆

赞 0 踩 0

2606.19325 2026-06-18 cs.SD cs.AI cs.CV 交叉投稿

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

参考驱动的野外先验多说话人音频场景生成

Michael Finkelson, Daniel Segal, Eitan Richardson, Shahar Armon, Nani Goldring, Poriya Panet, Nir Zabari, Benjamin Brazowski, Or Patashnik, Yoav HaCohen

发表机构 * Lightricks ； Tel Aviv University（特拉维夫大学）

AI总结提出ScenA方法，利用预训练的文本到音频流匹配基础模型，通过多参考声音和自然语言提示生成多说话人音频场景，并采用高噪声偏置时间步分布解决参考捷径问题，在CoVoMix2-Dialogue基准上优于现有系统。

Comments Project page at https://finmickey.github.io/scena/

详情

AI中文摘要

现有的多说话人对话系统通过结构化监督（如每轮标签、多流转录或可学习说话人嵌入）将说话人与话语绑定。这些系统在仅语音的流水线中运行，生成干净的语音序列，缺乏真实对话的环境纹理。我们采取不同的方法。我们的方法ScenA将文本到音频流匹配基础模型（在大规模野外数据上预训练）直接以多个参考声音和描述整个多说话人音频场景的自由形式自然语言提示为条件。利用这样的基础模型使我们能够继承其生成自然、非录音室音频的能力：背景噪声、房间声学、重叠对话和自发的副语言事件，同时添加多说话人控制而无需任何每轮结构。具体地，参考潜在向量被连接到模型的令牌序列中，并通过轻量级的身份感知位置编码进行区分。然而，我们识别出这种方法的一个关键障碍：参考捷径。在标准噪声调度下的训练过程中，模型可以通过声学相似性识别匹配的参考与噪声目标，从而完全绕过文本提示。我们通过高噪声偏置的时间步分布来解决这个问题，迫使模型依赖文本提示进行说话人分配。我们在CoVoMix2-Dialogue基准上评估ScenA，结果表明它在说话人绑定指标上优于现有的多说话人系统，同时生成具有重叠语音、情感发声和环境声音的丰富对话音频。我们的结果证明了使用以自由形式场景描述为条件的通用音频模型，而不是通过仅语音流水线传递结构化对话脚本的优势。

英文摘要

Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model's token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the \textit{Reference Shortcut}. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.

URL PDF HTML ☆

赞 0 踩 0

2606.18363 2026-06-18 cs.RO cs.AI 交叉投稿

Guava: An Effective and Universal Harness for Embodied Manipulation

Guava: 一种有效且通用的具身操作工具框架

Haowen Liu, Xirui Li, Shaoxiong Yao, Peng Shi, Tianyi Zhou, Jia-Bin Huang, Furong Huang, Jiayuan Mao

发表机构 * University of Maryland College Park（马里兰大学帕克分校）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Waterloo（滑铁卢大学）； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）； University of Pennsylvania（宾夕法尼亚大学）； Amazon FAR（亚马逊 FAR）

AI总结提出Guava框架，通过迭代感知-推理-行动循环、语义动作抽象和多模态观测三大关键设计，将具身操作能力蒸馏到4B开源模型中，在仿真和真实环境中性能媲美前沿专有模型。

详情

AI中文摘要

在大规模视觉-语言数据上训练的语言模型已展现出作为具身智能体的强大潜力。通过具身工具使用来驾驭模型，为端到端的视觉-语言-行动系统提供了一种有前景的替代方案，它将高层推理与外部模块（用于感知、规划和控制）相结合。然而，对于具身操作而言，什么构成了有效的工具框架，以及这种框架能在多大程度上解锁广泛推理模型的具身能力，仍不清楚。在这项工作中，我们提出了Guava，一个通过系统探索智能体工作流、动作空间和观测空间的设计空间而开发的具身工具使用框架。我们的研究确定了有效具身智能体的三个关键要素：迭代感知-推理-行动循环、语义动作抽象和多模态观测。为了理解这些设计原则是否对小型模型也具有普适性，我们开发了一个端到端的训练流程，利用完全在仿真中收集的不到2000条轨迹，将具身操作能力蒸馏到一个4B开源模型中。在仿真和真实环境中的实验结果表明，其性能与前沿专有模型相当，同时展现出对未见物体、新指令和长时域任务的强大泛化能力。结果表明，一个精心设计的框架可以作为具身操作的可扩展、模型无关的接口，使紧凑的开源模型在极少的训练数据下展现出强大的涌现具身能力。

英文摘要

Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. To understand whether these design principles are universal even to small models, we develop an end-to-end training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation. Experimental results in both simulation and real-world environments show performance comparable to frontier proprietary models while exhibiting strong generalization to unseen objects, novel instructions, and long-horizon tasks. Results suggest that a well-designed harness can serve as a scalable, model-agnostic interface for embodied manipulation, enabling strong emergent embodied capabilities in compact open-source models with minimal training data.

URL PDF HTML ☆

赞 0 踩 0

2606.18429 2026-06-18 cs.CV cs.AI cs.LG 交叉投稿

CAOA -- Completion-Assisted Object-CAD Alignment

CAOA -- 补全辅助的物体-CAD对齐

Hiranya Garbha Kumar, Minhas Kamal, Balakrishnan Prabhakaran

发表机构 * University at Albany（奥尔巴尼大学）

AI总结提出CAOA方法，结合语义感知点云补全和对称感知相对位姿估计，在Scan2CAD上实现17%精度提升，并发布S2C-Completion数据集。

Comments GitHub: https://github.com/MinhasKamal/CAOA

详情

DOI: 10.1109/3DV69130.2026.00047
Journal ref: Thirteenth International Conference on 3D Vision (3DV), 2026

AI中文摘要

准确地将CAD模型与室内RGB-D扫描中的对应物体对齐是3D语义重建的核心挑战。该任务需要估计9自由度（DoF）位姿——位置、旋转和三轴尺度——但受到噪声和不完整扫描以及导致几何畸变的分割误差的阻碍。我们提出补全辅助的物体-CAD对齐（CAOA），该方法将语义和上下文感知的点云补全模块与对称感知的相对位姿估计算法相结合，实现CAD模型与扫描物体的精确对齐。现有的补全方法通常在合成数据集上训练和评估，往往难以泛化到真实扫描。为弥合这一差距，我们引入了一种针对室内场景的合成数据生成策略，通过与广泛使用的补全数据集进行定量比较，验证了其显著减小合成到真实领域差距的效果。此外，我们发布了S2C-Completion，一个来自Scan2CAD的超过8500个物体-CAD对的专家标注数据集，用于真实室内单物体补全，并作为该任务的新基准。对于物体-CAD对齐，我们通过对称感知损失融入对称信息，提高了对对称模糊的鲁棒性。在Scan2CAD基准上，CAOA相比最先进方法实现了17%的精度提升。

英文摘要

Accurately aligning CAD models to their corresponding objects in indoor RGB-D scans is a central challenge in 3D semantic reconstruction. The task requires estimating a 9-Degree-of-Freedom (DoF) pose-position, rotation, and scale along three axes-but is hindered by noisy and incomplete scans, as well as segmentation errors that cause geometric distortions. We present Completion-Assisted Object-CAD Alignment (CAOA), a method that integrates a semantically and contextually aware point cloud completion module with a symmetry-aware relative pose estimation algorithm, enabling precise alignment of CAD models to scanned objects. Existing completion methods are typically trained and evaluated on synthetic datasets, which often fail to generalize to real-world scans. To bridge this gap, we introduce a synthetic data generation strategy tailored to indoor scenes, significantly reducing the synthetic-to-real domain gap-validated through quantitative comparisons with widely used completion datasets. In addition, we release S2C-Completion, an expert-annotated dataset of over 8,500 object-CAD pairs from Scan2CAD, created for real-world indoor single-object completion and intended as a new benchmark for this task. For object-CAD alignment, we incorporate symmetry information via a symmetry-aware loss, improving robustness to symmetric ambiguities. On the Scan2CAD benchmark, CAOA achieves a 17% accuracy improvement over state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2606.18634 2026-06-18 cs.RO cs.AI 交叉投稿

EffiNav: Fusing Depth and Vision-Language for Efficient Object Goal Navigation

EffiNav: 融合深度与视觉语言实现高效物体目标导航

Zecheng Yin, Benedict Jun Ma

发表机构 * Systems Hub of Intelligence Transportation HKUST(GZ)（香港科技大学（广州）智能交通系统中心）

AI总结提出EffiNav框架，融合深度信息与视觉语言模型，通过预测探索边界和语义先验指导导航，在HM3D和OVON数据集上匹配或超越基线，提升路径效率与泛化性。

详情

AI中文摘要

在未知环境中定位目标物体是自主智能体的基本能力，应用范围从搜索救援到野外机器人。该任务的简化版本是物体目标导航（ObjNav）。在ObjNav中，成功到达目标物体提供了基本的性能度量；然而，导航轨迹的效率同样重要，因为它指示了智能体探索的智能程度以及后续任务剩余的时间。在未知环境中，高效导航的关键在于决定下一步探索的位置。尽管许多先前工作旨在解决这一核心挑战并在某些场景中取得了有希望的性能，但最近的基于训练的模型和非训练框架分别仍存在泛化性和效率问题，在最坏情况下可能导致对已访问区域的过度探索或冗余的来回运动。我们在两个广泛使用的仿真基准Habitat Matterport 3D（HM3D）和开放词汇物体目标导航（OVON）上评估EffiNav，并在真实世界的物理机器人上进一步验证其有效性。我们对大量仿真回合进行了失败分析。通过最小修改，我们还将EffiNav扩展到GOAT-BENCH数据集上的记忆增强ObjNav任务，展示了其在标准ObjNav设置之外的适应性。在两个标准指标——成功率（SR）和路径长度加权成功率（SPL）上，EffiNav匹配或超越了最近的基线，反映了其效率、鲁棒性和实际适用性。认识到两个数据集的不同侧重点，性能表明该框架在高效ObjNav中更加平衡和可泛化。

英文摘要

To locate a target object while exploring the unknown environment is a fundamental capability for autonomous agents, with applications ranging from search-and-rescue to field robots. A simplified version of such task is Object Goal Navigation (ObjNav). In ObjNav, successful arrival at the target object provides a basic measure of performance; however, the efficiency of the navigation trajectory is equally important, as it indicates how intelligently the agent explores and how much time remains for subsequent tasks. In unknown environments, the key to efficient navigation lies in deciding where to explore next. While many prior works aim to address this core challenge and achieved promising performance in certain settings, recent training-based models and non-training frameworks still suffer from generalization and efficiency issues respectively, which in the worst cases can lead to excessive exploration of already-visited areas or redundant back-and-forth motion. We evaluate EffiNav on two widely used simulation benchmarks Habitat Matterport 3D (HM3D) and Open-Vocabulary Object goal Navigation (OVON), and further validate its effectiveness on physical robots in real-world settings. We conduct failure analysis on massive simulation episodes. With minimal modification, we also extend EffiNav to a memory-augmented ObjNav task on the GOAT-BENCH dataset, demonstrating its adaptability beyond standard ObjNav settings. Across two standard metrics--Success Rate (SR) and Success weighted by Path Length (SPL), EffiNav matches or outperforms recent baselines, reflecting its efficiency, robustness, and practical applicability. Recognizing the different emphases of the two datasets, the performances reveals this framework is more balanced and generalizable for efficient ObjNav.

URL PDF HTML ☆

赞 0 踩 0

2606.18664 2026-06-18 cs.SD cs.AI 交叉投稿

NeuralMUSIC: A Hybrid Neural-Subspace Framework for Robot Sound Source Localization

NeuralMUSIC: 一种用于机器人声源定位的混合神经-子空间框架

Yizhuo Yang, Junqiao Fan, Shenghai Yuan, Lihua Xie

发表机构 * School of Electrical and Electronic Engineering, Nanyang Technological University（南洋理工大学电气与电子工程学院）

AI总结提出NeuralMUSIC混合框架，结合神经网络估计空间协方差矩阵与经典MUSIC子空间方法，通过频率注意力融合和自监督学习提升机器人声源定位的鲁棒性和跨域泛化能力。

详情

AI中文摘要

可靠的声源定位是机器人听觉的基础，使自主机器人能够感知空间线索并在动态环境中有效运行。经典方法如多信号分类（MUSIC）具有坚实的理论基础，但在低信噪比下性能下降。基于深度学习的方法虽然取得了有前景的性能，但通常难以在多种条件下泛化。为了解决这些挑战，我们提出了NeuralMUSIC，一种用于机器人声源定位的混合神经-子空间框架。具体来说，神经网络首先从多通道麦克风观测中估计空间协方差矩阵。然后将预测的协方差集成到经典的MUSIC流程中，包括特征值分解（EVD）和伪谱计算，随后通过频率注意力融合（FAF）模块产生最终的DOA估计。为了提高数据效率，我们进一步引入了一种自监督空间相关学习（SSCL）策略，利用未标记的声学数据来捕获空间结构。跨不同机器人任务的广泛实验表明，NeuralMUSIC在实现有竞争力的定位精度的同时，表现出更强的鲁棒性和跨域泛化能力。

英文摘要

Reliable sound source localization is fundamental to robot audition, enabling autonomous robots to perceive spatial cues and operate effectively in dynamic environments. Classical methods such as Multiple Signal Classification (MUSIC) offer strong theoretical foundations but degrade under low signal-to-noise ratios. While deep learning-based approaches achieve promising performance, they often struggle with limited generalization across conditions. To address these challenges, we propose NeuralMUSIC, a hybrid neural-subspace framework for robotic sound source localization. Specifically, a neural network first estimates the spatial covariance matrix from multichannel microphone observations. The predicted covariance is then integrated into a classical MUSIC pipeline with eigenvalue decomposition (EVD) and pseudo-spectrum computation, followed by a Frequency Attention Fusion (FAF) module to produce the final DOA estimates. To improve data efficiency, we further introduce a Self-supervised Spatial Correlation Learning (SSCL) strategy that leverages unlabeled acoustic data to capture spatial structure. Extensive experiments across different robotic tasks demonstrate that NeuralMUSIC achieves competitive localization accuracy while exhibiting improved robustness and cross-domain generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.18698 2026-06-18 cs.RO cs.AI cs.LG 交叉投稿

Leveraging Energy Features for Surface Classification with Deep Learning: A Comparative Analysis Across Three Independent Datasets

利用能量特征进行基于深度学习的表面分类：三个独立数据集的比较分析

Alexander Belyaev, Oleg Kushnarev

AI总结研究评估能量特征作为表面分类的独立或辅助模态的可行性，在三个数据集上比较多种深度学习架构，发现CNN性能最优，纯能量特征准确率85-90%，与惯性特征结合可达96-99%，且能量特征可稳定提升1-2%准确率。

详情

AI中文摘要

基于能量的方法在移动机器人表面分类中仍是一个相对未被充分研究的途径，尽管在受限环境中取得了有希望的结果。本研究评估了使用能量衍生特征作为独立分类模态或作为惯性数据补充输入的可行性。在三个公开数据集上进行了全面评估，比较了现代深度学习架构（包括循环神经网络、卷积神经网络、仅编码器变压器和Mamba状态空间模型）在自动超参数调整和输入序列长度优化下的性能。模型在所有评估数据集上均实现了比先前报道值更高的准确率，其中卷积神经网络取得了最高的整体性能。当仅依赖基于能量的特征时，模型分类准确率在85-90%范围内，比与惯性特征结合时（96-99%）低约5-10%。用能量特征增强惯性数据导致平均准确率持续提高1-2%。这些发现表明，仅依赖能量特征的分类器为独立部署提供了足够的准确性，同时在与其它感知模态结合使用时也提供了一致的增益。

英文摘要

The energy-based method remains a comparatively underexamined approach for surface classification in mobile robotics, despite promising results in constrained environments. This study evaluated the viability of using energy-derived features as either a standalone classification modality or as supplementary input to inertial data. A comprehensive evaluation was conducted across three publicly available datasets, comparing the performance of modern deep learning architectures including recurrent neural networks, convolutional neural networks, encoder-only transformers, and Mamba state-space models, under automated hyperparameter tuning and input sequence length optimization. The models achieved higher accuracy than previously reported values on all evaluated datasets, with the convolutional neural network yielding the highest overall performance. When relying exclusively on energy-based features, the models attained classification accuracies in the range of 85-90%, approximately 5-10% lower than those achieved when combined with inertial features (96-99%). Augmenting inertial data with energy features resulted in a consistent mean accuracy improvement of 1-2%. These findings indicate that classifiers relying solely on energy features offer sufficient accuracy for standalone deployment, while also providing a consistent gain when used in combination with other sensing modalities.

URL PDF HTML ☆

赞 0 踩 0

2606.18747 2026-06-18 cs.RO cs.AI 交叉投稿

Generating Natural and Expressive Robot Gestures through Iterative Reinforcement Learning with Human Feedback using LLMs

通过基于人类反馈的迭代强化学习利用大语言模型生成自然且富有表现力的机器人手势

Chris Lee, Flora Salim, Benjamin Tag, Francisco Cruz

发表机构 * University of New South Wales（新南威尔士大学）； Universidad Central de Chile（智利中央大学）

AI总结针对社交机器人手势生成僵硬问题，提出将ChatGPT集成到Pepper机器人中生成共语手势，并引入基于人类反馈的迭代强化学习（RLHF）优化手势，实验表明RLHF提升了手势的表现力、相关性和流畅性。

Comments 8 Pages, 6 Figures

详情

AI中文摘要

富有表现力的手势对于自然有效的沟通至关重要，当仅靠语言线索不足时（例如，指向），手势可以补充言语。对于像Pepper这样的人形社交机器人，产生自然且富有表现力的动作对于改善人机交互（HRI）和长期接受度至关重要。然而，由于依赖专家编写的动画，生成手势仍然具有挑战性，导致行为僵硬，难以适应动态和多样化的环境。或者，机器学习方法通常难以捕捉感知的自然性，随着自由度的增加而变得更加困难。因此，产生富有表现力的机器人手势需要一个能够适应环境同时遵守社会规范和物理约束的系统。大语言模型（LLMs）的最新进展使得动态代码生成成为可能，为从自然语言实时合成手势提供了新的机会。在本文中，我们将ChatGPT集成到人形机器人Pepper中，以生成与对话输出一致的共语手势。虽然这一基线实现了灵活的手势生成，但生成的动作通常被认为僵硬且不自然。为了解决这一限制，我们引入了一种基于人类反馈的迭代强化学习（RLHF）系统，该系统根据用户评估微调手势生成，并利用迭代用户研究比较Pepper生成的手势。我们的结果表明，RLHF改进了LLM的共语生成能力，产生了更富有表现力、相关且流畅的动作。

英文摘要

Expressive gestures are essential for natural and effective communication, complementing speech when verbal cues alone are insufficient (e.g., pointing). For social robots such as the humanoid Pepper, producing natural and expressive movements is critical for improving human-robot interaction (HRI) and long-term acceptance. However, generating gestures remains challenging due to reliance on expert-authored animations, resulting in rigid behaviors that are impractical for dynamic and diverse environments. Alternatively, machine learning approaches often struggle to capture perceived naturalness, becoming increasingly challenging with more degrees of freedom. Consequently, producing expressive robot gestures requires a system that can adapt to the environment while adhering to social norms and physical constraints. Recent advances in large language models (LLMs) enable dynamic code generation, offering new opportunities for runtime gesture synthesis from natural language. In this paper, we integrate ChatGPT into the humanoid robot Pepper to generate co-speech gestures aligned with conversational output. While this baseline enables flexible gesture generation, the resulting motions are often perceived as stiff and unnatural. To address this limitation, we introduce an iterative reinforcement learning with human feedback (RLHF) system that finetunes gesture generation based on user evaluations, leveraging an iterative user study to compare Pepper's generated gestures. Our results show that RLHF improved the LLM's co-speech generative capabilities, producing more expressive, relevant and fluid movements.

URL PDF HTML ☆

赞 0 踩 0

2606.18828 2026-06-18 cs.RO cs.AI 交叉投稿

Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation

空间即智能：用于黎曼度量生成的神经半群叠加

Chenghao Xu

发表机构 * National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University（湖南大学机器人视觉感知与控制技术国家工程研究中心）

AI总结提出将智能置于空间本身，通过神经半群叠加机制生成黎曼度量，使动作简化为测地线跟随，在单障碍场景训练后零样本泛化到未见配置。

详情

AI中文摘要

传统方法将智能置于智能体中，无论是作为学习策略还是搜索过程。我们则将智能置于空间本身：场景在构型流形上诱导一个黎曼度量，动作简化为跟随该度量的测地线，而无需调用单独的规划器或碰撞检查器。一个单一的编码器-路由器网络通过三个互补的参数组实现这一思想——框架参数（定向生成器）、调制参数（控制空间传播）和基本系数（决定强度）。这些组通过共享的半群叠加机制组合，产生单个黎曼度量场，形成一种紧凑的架构，其几何复杂度自然随场景复杂度扩展。在单个双障碍场景上训练后，该模型在未见过的障碍配置上展现出鲁棒的零样本泛化能力，无碰撞路径成本与障碍穿透路径成本相差数个数量级。

英文摘要

Traditional approaches place intelligence in the agent, whether as a learned policy or a search procedure. We instead place intelligence in the space itself: a scene induces a Riemannian metric on the configuration manifold, and action reduces to following the geodesics of that metric rather than invoking a separate planner or collision checker. A single Encoder-Router network realizes this idea through three complementary parameter groups -- frame parameters that orient the generators, modulation parameters that govern their spatial propagation, and basic coefficients that determine their strength. These groups combine through a shared semigroup-superposition mechanism to produce a single Riemannian metric field, yielding a compact architecture whose geometry scales naturally with scene complexity. Trained on a single two-obstacle scene, the model demonstrates robust zero-shot generalization across unseen obstacle configurations, with orders-of-magnitude separation between collision-free and obstacle-penetrating path costs.

URL PDF HTML ☆

赞 0 踩 0

2606.18836 2026-06-18 cs.HC cs.AI 交叉投稿

Improving Human-Robot Teamwork in Urban Search and Rescue Through Episodic Memory of Prior Collaboration

通过先前协作的片段记忆改善城市搜索与救援中的人机团队合作

Taewoon Kim, Emma van Zoelen, Mark Neerincx

发表机构 * HumemAI, The Netherlands（荷兰HumemAI）； Vrije Universiteit Amsterdam, The Netherlands（荷兰阿姆斯特丹自由大学）； TNO, The Netherlands（荷兰TNO）

AI总结提出利用知识图谱片段记忆存储历史协作模式，通过图表示学习选择代表性记忆初始化机器人，在MATRX USAR环境中将救援成功率从25.7%提升至41.3%，任务时间减少283秒。

详情

AI中文摘要

有效的人机团队合作要求机器人从交互开始就适应伙伴、情境和任务动态。在MATRX城市搜索与救援（USAR）环境中，人们可以通过聊天和反思界面将他们在团队合作中发现的协作模式（CPs）外部化。我们研究机器人是否可以利用这种先前的团队经验，在未来的交互中成为更好的队友。为此，我们将历史CPs表示为知识图谱片段记忆，并使用具有节点分类目标的图表示学习来识别一个代表性且有效的记忆以供重用。然后，在新的协作片段开始之前，我们用该记忆初始化机器人。在20名参与者和160轮次观察中，用单个自动选择的先前CP初始化机器人将救援成功率从25.7%提高到41.3%，并将平均任务时间减少283秒。最强的提升出现在交互开始时，表明可重用的片段记忆可以帮助机器人以更有效的任务知识进入协作，并支持更顺畅的早期团队合作。

英文摘要

Effective human-robot teamwork requires robots to adapt to partners, situations, and task dynamics from the start of an interaction. In the MATRX Urban Search and Rescue (USAR) environment, people can externalize collaboration patterns (CPs) they discover during teamwork through a chat and reflection interface. We study whether a robot can use such prior team experience to become a better teammate in future interactions. To this end, we represent historical CPs as knowledge-graph episodic memories and use graph representation learning with a node-classification objective to identify a representative and effective memory for reuse. We then initialize the robot with this memory before a new collaboration episode begins. Across 20 participants and 160 round-level observations, initializing the robot with a single automatically selected prior CP increases rescue success from 25.7% to 41.3% and reduces average task time by 283 seconds. The strongest gains appear at the beginning of interaction, suggesting that reusable episodic memory can help robots enter collaboration with more effective task knowledge and support smoother early teamwork.

URL PDF HTML ☆

赞 0 踩 0

2606.18861 2026-06-18 cs.CV cs.AI 交叉投稿

URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification

基于可微联合推理与能量一致性验证的RGB-D序列URDF合成

Xinze Zhang

发表机构 * University of Southern California（南加州大学）

AI总结提出KinemaForge管道，通过可微关节推理和能量一致性验证，从RGB-D序列联合估计部件形状、关节拓扑和参数，显著降低关节轴误差和仿真漂移。

详情

AI中文摘要

从传感器观测重建可仿真的铰接物体数字孪生仍受两个持续存在的差距制约：(i) 部件级几何重建与运动学参数估计分离，(ii) 恢复的模型常违反能量守恒等基本动态不变量，导致URDF在物理仿真器中重放时出现漂移。我们提出KinemaForge，一种约束驱动管道，从短RGB-D序列联合推断部件级形状、关节拓扑和关节参数，并通过基于可微刚体动力学构建的能量一致性验证器验证结果。该管道引入三个组件：将关节-部件关联编码为软边的运动学约束图；通过Featherstone铰接体算法从渲染观测反向传播到关节参数的可微螺旋轴求解器；以及惩罚重建模型非物理自由响应的能量残差损失。在五个PartNet-Mobility类别和一个内部RGB-D基准上，KinemaForge将平均关节轴误差从最强几何基线(PARIS)的4.52度降至2.83度(-37.4%)，从基于交互的Ditto基线的5.30度降至2.83度(-46.6%)，在50秒滚动中长时仿真漂移比PARIS降低64%，初步评估中闭环操作成功率比Ditto提高14.6个百分点。代码和重建数据将在接收后发布。

英文摘要

Reconstructing simulation-ready digital twins of articulated objects from sensor observations remains constrained by two persistent gaps: (i) part-level geometric reconstruction is decoupled from kinematic-parameter estimation, and (ii) the recovered models often violate basic dynamic invariants such as energy conservation, leading to drift when the URDF is replayed in physics simulators. We present KinemaForge, a constraint-driven pipeline that jointly infers part-level shape, joint topology, and joint parameters from short RGB-D sequences and validates the result against an energy-consistent verifier built on differentiable rigid-body dynamics. The pipeline introduces three components: a kinematic constraint graph that encodes joint-part incidences as soft edges; a differentiable screw-axis solver that backpropagates from rendered observations through Featherstone's articulated-body algorithm to joint parameters; and an energy residual loss that penalises non-physical free responses of the reconstructed model. Across five PartNet-Mobility categories and an internal RGB-D benchmark, KinemaForge reduces the average joint-axis error from 4.52 degrees to 2.83 degrees (-37.4%) over the strongest geometric baseline (PARIS) and from 5.30 degrees to 2.83 degrees (-46.6%) over the interaction-based Ditto baseline, lowers long-horizon simulation drift by 64% (vs. PARIS) over 50 s rollouts, and yields URDFs whose closed-loop manipulation success rate improves by 14.6 percentage points over Ditto in our preliminary evaluation. Code and reconstruction data will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.19176 2026-06-18 cs.RO cs.AI cs.SY eess.SY 交叉投稿

Hardware- and Vision-in-the-Loop Validation of Deep Monocular Pose Estimation for Autonomous Maritime UAV Flight

用于自主海上无人机飞行的深度单目位姿估计的硬件与视觉在环验证

Maneesha Wickramasuriya, Beomyeol Yu, Jaden Shin, Mason Huslig, Taeyoung Lee, Murray Snyder

发表机构 * George Washington University（乔治华盛顿大学）

AI总结提出硬件验证的视觉在环框架，结合深度变换器单目位姿估计器和延迟卡尔曼滤波器，在模拟逼真海上环境中实现自主室内飞行，验证了感知延迟等嵌入式效应。

Comments 6 pages 9 figues

详情

AI中文摘要

船舶上的自主无人机操作需要可靠的基于视觉的相对位姿估计，然而海上验证成本高、依赖天气且风险大。本文提出一个硬件验证的视觉在环框架，能够在模拟逼真海上环境的同时实现完全自主的室内飞行。渲染的海上视图由板载的基于深度变换器的单目位姿估计器处理。延迟的视觉测量与高频率IMU数据通过延迟卡尔曼滤波器融合，为几何控制提供一致的状态估计。该系统捕捉了纯仿真中缺失的关键嵌入式效应，包括感知延迟、异步更新和计算约束。自主起飞、轨迹跟踪和着陆实验证明了稳定的闭环飞行。结果建立了一个安全且硬件真实的中间阶段，用于在船上部署之前开发海上无人机自主性。

英文摘要

Autonomous UAV operations on ships require reliable vision-based relative pose estimation, yet at-sea validation is costly, weather-dependent, and risky. This paper presents a hardware-validated vision-in-the-loop framework that enables fully autonomous indoor flight while emulating photorealistic maritime environments. Rendered maritime views are processed onboard by a deep transformer-based monocular pose estimator. Delayed vision measurements are fused with high-rate IMU data using a delayed Kalman filter to provide consistent state estimates for geometric control. The system captures critical embedded effects, including perception latency, asynchronous updates, and computational constraints, that are absent in pure simulation. Autonomous takeoff, trajectory tracking, and landing experiments demonstrate stable closed-loop flight. The results establish a safe and hardware-realistic intermediate stage for developing maritime UAV autonomy prior to shipboard deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.18258 2026-06-18 cs.HC cs.AI 交叉投稿

Examining Human-Like Behaviors in LLMs: A Multi-Dimensional Analysis of Model Behaviors, User Factors, and System Prompts

审视LLM中的人类行为：模型行为、用户因素和系统提示的多维分析

Sunnie S. Y. Kim, Margit Bowler, Leon A Gatys

发表机构 * Apple（苹果公司）

AI总结通过21,000次对话的多维分析，发现LLM普遍表现出人类行为，但不同模型和用户因素下差异显著；人类评估者认为LLM的自我参照和关系建立行为不如人类适当，但边界维护行为更适当；系统提示可控制这些行为但需谨慎评估。

详情

AI中文摘要

大型语言模型（LLM）展现出广泛的人类行为，从表达思想和情感，到与用户建立关系，再到拒绝请求和维持边界。尽管这些行为普遍存在，但研究者和实践者缺乏方法和实证见解来做出关于LLM何时以及应展现何种类型人类行为的明智决策。为填补这一空白，我们使用LLM-as-a-judge和人类评估，对这些行为的普遍性、潜在影响和可控性进行了多维分析。在来自四个广泛使用的模型（gpt-4o、gpt-4.1-mini、claude-sonnet-4.6、gemini-2.5-flash）的21,000次多轮对话中，我们发现人类行为普遍存在，但不同模型和用户因素（对话目标和用户画像）间存在差异。在感知适当性方面，人类评估者认为LLM的自我参照和关系建立行为不如人类适当，但边界维护行为比人类更适当。最后，我们表明系统提示可以控制这些行为，但需要仔细评估以避免意外效果。我们讨论了研究结果的含义，并为负责任的LLM设计和评估提供了建议。

英文摘要

Large language models (LLMs) exhibit a wide range of human-like behaviors, from expressing thoughts and emotions, to engaging in relationship-building with users, to refusing requests and maintaining boundaries. Despite their prevalence, researchers and practitioners lack methods and empirical insights to make informed decisions about when and what types of human-like behaviors LLMs should exhibit. To fill this gap, we present a multi-dimensional analysis of the prevalence, potential effects, and controllability of these behaviors using LLM-as-a-judge and human evaluation. Across 21,000 multi-turn conversations from four widely used models (gpt-4o, gpt-4.1-mini, claude-sonnet-4.6, gemini-2.5-flash), we find that human-like behaviors are pervasive but vary across models and user factors (conversation goals and user profiles). In terms of perceived appropriateness, human evaluators judged self-referential and relationship-building behaviors as less appropriate from LLMs than from humans, but boundary-maintaining behaviors more appropriate from LLMs than from humans. Finally, we show that system prompting can control these behaviors, though it requires careful evaluation to avoid unintended effects. We discuss the implications of our findings and provide recommendations for responsible LLM design and evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.18309 2026-06-18 cs.LG cs.AI 交叉投稿

SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector

SAGE: 保留感知的最终遗忘向量事后净化

Jingyuan Zhang, Yucheng Bai, Peixi Wen, Zhehao Huang, Zhengbao He, Hanling Tian, Xinwen Cheng, Haiyin Ran, Xiaolin Huang

发表机构 * Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University（上海交通大学图像处理与模式识别研究所）

AI总结提出SAGE方法，通过事后净化最终更新向量，在不重新运行原始遗忘流程的情况下，缓解大语言模型遗忘与保留能力之间的权衡。

详情

AI中文摘要

大语言模型（LLM）遗忘旨在移除不良知识或行为，同时保留已有能力。当前的遗忘方法都涉及遗忘与保留之间的权衡。我们发现，保留激活偏差也可用于量化遗忘方法对保留造成的损害，而无需考虑遗忘过程的具体实现。这使得我们能够通过事后方法恢复任何遗忘方法的保留性能。因此，我们提出一种互补的事后设置，在不重新运行原始遗忘流程的情况下净化最终更新向量。在该设置中，我们设计了SAGE（光谱激活-几何净化），一种对最终遗忘更新的源无关修正。SAGE从一个小型保留代理收集真实模块输入，提取其主导激活几何结构，并求解一个闭式源锚定优化目标，该目标抑制与高能保留方向对齐的更新分量，同时保留源方法的遗忘载体。在多种遗忘方法、模型规模和基准测试中，SAGE持续缓解保留-遗忘权衡，将最终向量的事后净化识别为机器遗忘中一个实用且未被充分探索的维度。

英文摘要

Large Language Model (LLM) unlearning aims to remove undesirable knowledge or behaviors while preserving retained capabilities. Current unlearning methods all involve a trade-off between unlearning and retention. We have found that the retention activation bias can also be used to quantify the damage an unlearning method inflicts on retention, without considering the specific implementation of the unlearning process. This allows us to restore retention performance for any unlearning method using a post-hoc approach. Therefore, we propose a complementary post-hoc setting to sanitize the final update vector without rerunning the original unlearning pipeline. In this setting, we design SAGE, Spectral Activation-GEometry Sanitization, a source-agnostic correction for final unlearning updates. SAGE collects real module inputs from a small retain proxy, extracts their dominant activation geometry, and solves a source-anchored optimization objective in closed form, which suppresses update components aligned with high-energy retained directions while preserving the source method's forgetting carrier. Across multiple unlearning methods, model scales, and benchmarks, SAGE consistently relieves the retain-forget trade-off, identifying post-hoc sanitization of final vectors as a practical and underexplored axis for machine unlearning.

URL PDF HTML ☆

赞 0 踩 0

2606.18310 2026-06-18 cs.CR cs.AI 交叉投稿

Code-Augur：通过规约推断的智能体漏洞检测

Zhengxiong Luo, Mehtab Zafar, Dylan Wolff, Abhik Roychoudhury

发表机构 * National University of Singapore（新加坡国立大学）

AI总结提出安全规约优先范式，通过显式化智能体假设并运行时反证，结合引导式模糊测试提升漏洞检测能力，在真实项目中比现有智能体检测更多漏洞。

详情

AI中文摘要

智能体漏洞检测的出现已成为软件安全的分水岭。完全由自主LLM智能体进行的审计正在发现数字社会基础软件中的关键漏洞。许多漏洞多年来一直隐藏，直到现在才被AI智能体发现。然而，这些发现背后的推理仍然令人担忧地不透明且未经验证。当智能体认为某个函数安全时，它对函数输入做了哪些假设？推理失败和错误假设可能导致遗漏漏洞，并降低对智能体分析的信任。我们提出了一种安全规约优先范式，该范式（1）将智能体的隐性假设明确暴露为安全规约，并（2）通过运行时反证持续细化这些规约。我们在Code-Augur中实现了我们的方法，这是一种用于智能体漏洞检测的新型框架。给定一个代码库，Code-Augur分析系统的每个组件以查找漏洞代码。当它认为某个组件安全时，它会将该判断背后的局部不变量作为源代码中的断言提交。同时，Code-Augur利用引导式模糊测试器尝试反证这些假设。当模糊测试器触发断言时，要么揭示一个真实漏洞，要么揭示一个需要细化的有缺陷规约。在这两种情况下，这一过程都夯实了智能体的理解，使其对代码意图的看法与代码实际行为保持一致。在真实世界的主题上，Code-Augur有效利用安全规约检测到比其他最先进智能体更多的漏洞。此外，Code-Augur在关键开源项目中发现了22个新漏洞。与精心策划的专用模型（如Claude Mythos）相比，Code-Augur提供了基于广泛可用的LLM（如Sonnet和DeepSeek）构建的有效智能体漏洞检测。

英文摘要

The advent of agentic vulnerability detection is already becoming a watershed moment for software security. Audits conducted entirely by autonomous LLM agents are uncovering critical vulnerabilities in fundamental software underpinning digital society. Many of these vulnerabilities remained masked for years, surfacing only now with AI agents. Yet the reasoning behind these discoveries remains alarmingly opaque and unvalidated. What assumptions did the agent make about a function's inputs when it deemed that function to be secure? Failures in reasoning and incorrect assumptions can lead to missed vulnerabilities and reduce trust in agentic analysis. We propose a security-specification-first paradigm that (1) exposes the agent's tacit assumptions explicitly as security specifications and (2) continuously refines those specifications via runtime falsification. We realize our approach in Code-Augur, a novel harness for agentic vulnerability detection. Given a codebase, Code-Augur analyzes each component of the system for vulnerable code. When it deems a component to be secure, it commits the local invariants behind that judgment as in-source assertions. In parallel, Code-Augur leverages a guided fuzzer to attempt to falsify those assumptions. When the fuzzer triggers an assertion, this either reveals a genuine vulnerability or a flawed specification to refine. In both cases, this process grounds the agent's understanding, aligning its view of code intent with how the code actually behaves. On real-world subjects, Code-Augur effectively leverages security specifications to detect more vulnerabilities than other state-of-the-art agents. Additionally, Code-Augur found 22 new vulnerabilities in key open-source projects. Compared to curated specialized models like Claude Mythos, Code-Augur offers effective agentic vulnerability detection built on widely available LLMs like Sonnet and DeepSeek.

URL PDF HTML ☆

赞 0 踩 0

2606.18832 2026-06-18 cs.LG cs.AI 交叉投稿

Target-confidence Recourse Using tSeTlin machines: TRUST

使用Tsetlin机器的目标置信度追索：TRUST

K. Darshana Abeyrathna, Sara El Mekkaoui, Nils Enric Canut Taugbøl, Anuja Vats

发表机构 * Group Research and Development Det Norske Veritas (DNV)（挪威船级社（DNV）集团研发部）

AI总结提出TRUST框架，通过概率Tsetlin机器和贝叶斯优化直接搜索满足用户指定置信度目标的最小输入变化，生成更稳健和可解释的反事实解释。

详情

AI中文摘要

反事实解释被广泛用于高风险决策系统中的算法追索。大多数现有方法寻求最小化改变输入以翻转模型决策。然而，决策者通常不仅依赖预测标签，还依赖置信度阈值和风险边际。刚好越过决策边界的反事实在噪声或模型变化下可能脆弱且不稳定。本文提出使用Tsetlin机器的目标置信度追索（TRUST），一种用户明确指定追索所需预测置信度的框架。TRUST不是先生成反事实再评估置信度，而是直接搜索满足用户定义置信度目标的最小变化，从而在成本、置信度和鲁棒性方面比较追索选项。我们使用概率Tsetlin机器（PTM）结合贝叶斯优化实例化TRUST。PTM基于概率子句的结构将预测置信度与决策规则的稳定性联系起来。我们表明，满足相同规则的反事实在可靠性上可能差异很大，取决于它们满足这些规则的安全程度，揭示了决策是由稳健还是脆弱的子句激活支持的。在合成和真实数据集上的实验表明，目标置信度反事实比传统的基于边界的方法产生更稳健和可解释的追索。在多个基准测试中，TRUST实现了完美的鲁棒性，同时保持较低的追索成本，包括在Haberman数据集上以0.92置信度达到0.10的L2距离。通过显式控制置信度和暴露规则级稳定性，TRUST为高风险决策支持提供了可操作的追索。

英文摘要

Counterfactual explanations are widely used to provide algorithmic recourse in high-stakes decision-making systems. Most existing methods seek the smallest change to an input that flips a model's decision. However, decision-makers often rely not only on predicted labels but also on confidence thresholds and risk margins. Counterfactuals that barely cross a decision boundary can be fragile and unstable under noise or model variation. In this paper, we propose Target-confidence Recourse Using tSeTlin machines (TRUST), a framework in which users explicitly specify the desired prediction confidence for recourse. Rather than generating counterfactuals and evaluating confidence afterward, TRUST directly searches for minimal changes that satisfy a user-defined confidence target, enabling comparison of recourse options in terms of cost, confidence, and robustness. We instantiate TRUST using a Probabilistic Tsetlin Machine (PTM) combined with Bayesian optimization. The probabilistic clause-based structure of PTM links prediction confidence to the stability of decision rules. We show that counterfactuals satisfying the same rules can still differ substantially in reliability depending on how securely they satisfy those rules, revealing whether decisions are supported by robust or fragile clause activations. Experiments on synthetic and real-world datasets demonstrate that target-confidence counterfactuals produce more robust and interpretable recourse than conventional boundary-based approaches. Across multiple benchmarks, TRUST achieves perfect robustness while maintaining low recourse cost, including an L2 distance of 0.10 on the Haberman dataset at 0.92 confidence. By explicitly controlling confidence and exposing rule-level stability, TRUST provides actionable recourse for high-stakes decision support.

URL PDF HTML ☆

赞 0 踩 0

2606.18996 2026-06-18 cs.CR cs.AI 交叉投稿

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

TRAP：任务完成与主动隐私提取抵抗基准

Moon Ye-Bin, Nam Hyeon-Woo, Baek Seong-Eun, Yejin Yeo, Tae-Hyun Oh

发表机构 * Dept. of Electrical Engineering, POSTECH（POSTECH电子工程系）； Grad. School of Artificial Intelligence, POSTECH（POSTECH人工智能研究生院）； School of Computing, KAIST（韩国科学技术院计算机学院）

AI总结提出TRAP基准，评估智能体在文档密集型任务中平衡任务准确性与隐私泄露的能力，发现所有模型均存在非平凡泄露，并证明基于提示的防御无法同时实现高任务成功率和零泄露概率，提出结构化的私有字段隔离方法。

详情

AI中文摘要

智能体越来越多地部署在文档密集型工作流中，其中敏感私人信息不是边缘情况而是常规输入，例如，预订航班的智能体需要护照号码。在这种情况下，智能体必须使用私人信息准确完成任务，同时绝不在其响应中暴露这些信息，因为它无法验证键盘前实际是谁。这两个义务存在根本性矛盾。一个能够使用私人信息完成任务的模型，同样可能被诱导泄露这些信息。为了评估任务准确性与隐私泄露之间的权衡，我们引入了任务完成与主动隐私提取抵抗（TRAP）。每个场景包括一个包含私人信息的文档、一个要求智能体使用私有字段调用正确工具的任务查询，以及一个试图以自然语言引出相同信息的攻击查询。评估了涵盖前沿专有和开源模型的22个模型，我们发现所有模型系列都表现出非平凡的泄露，并且指令遵循能力与泄露率相关。现有的基于提示的防御减少了泄露，但以显著降低任务准确性为代价。提示优化未能摆脱这种权衡。我们证明这种失败并非偶然。对于任何基于softmax的模型，没有软约束防御（例如基于提示的防御）能够同时实现高任务成功率和零泄露概率。受这一不可能性结果的启发，我们提出了结构化的私有字段隔离，该方法在私有字段到达模型之前用哈希键替换它们。这种方法在保持任务准确性的同时很大程度上防止了泄露。

英文摘要

Agents are increasingly deployed in document-intensive workflows where sensitive private information is not an edge case but a routine input, e.g., an agent booking a flight needs passport numbers. In such settings, the agent must use private information to complete tasks accurately while never exposing it in its responses, because it cannot verify who is actually at the keyboard. These two obligations are in fundamental tension. A model capable enough to use private information for task completion can, by the same capability, be induced to reveal it. To evaluate the trade-off of task accuracy and privacy leakage, we introduce Task-completion and Resistance to Active Privacy-extraction (TRAP). Each scenario includes a document containing private information, a task query that requires the agent to invoke the correct tool using private fields, and an attack query that attempts to elicit the same information in natural language. Evaluating 22 models spanning frontier proprietary and open-source models at multiple scales, we find that all model families exhibit non-trivial leakage, and that instruction-following ability correlates with leakage rate. Existing prompt-based defenses reduce leakage but at significant cost to task accuracy. Prompt optimization fails to escape this trade-off. We demonstrate that this failure is not incidental. For any softmax-based model, no soft-constraint defense, e.g., prompt-based defenses, can jointly achieve high task success with zero leakage probability. Motivated by this impossibility result, we propose structural private field isolation, which replaces private fields with hash keys before they reach the model. This approach largely prevents leakage while keeping task accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.19220 2026-06-18 cs.LG cs.AI 交叉投稿

Machine Unlearning for the XGBoost Model with Network Intrusion Datasets

面向网络入侵数据集的XGBoost模型机器遗忘

Diana Magalhães, Eva Maia, João Vitorino, Isabel Praça

发表机构 * GECAD, ISEP, Polytechnic of Porto（波尔图理工学院工程学院GECAD研究所）

AI总结针对XGBoost模型提出XGBoost-Forget遗忘方法，在表格型网络入侵数据集上实现高效遗忘，保持模型性能的同时显著提升遗忘速度。

Comments 12 pages, 7 tables, WorldCist'26 Conference

2606.19222 2026-06-18 cs.LG cs.AI 交叉投稿

Mechanism-Guided Selective Unlearning for RLVR-Induced Reasoning

机制引导的选择性遗忘：针对RLVR诱导的推理

Chenyu Zhou, Qiliang Jiang, Shuning Wu, Xu Zhou

发表机构 * School of Engineering, Institute of Science Tokyo, Japan（东京科学大学工学院）； College of Control Science and Engineering, Zhejiang University, China（浙江大学控制科学与工程学院）； Department of Electrical and Computer Engineering, National University of Singapore, Singapore（新加坡国立大学电气与计算机工程系）

AI总结提出MAST方法，通过机制引导选择性更新参数，在遗忘RLVR诱导的推理行为时，显著降低对保留性能的附带损害。

Comments 15 pages, 4 figures, 7 tables

详情

AI中文摘要

我们提出MAST（机制对齐选择性目标），一种机制引导的方法，用于遗忘RLVR诱导的推理，其附带损害远低于标准全参数更新。在Qwen2.5-Math-1.5B和Qwen3-1.7B-Base的匹配SFT/RLVR检查点上，SFT到RLVR的增量在token级delta-log-probability上与SFT更新显著不同，而全参数梯度上升仅通过破坏保留的MATH和GSM8K来实现遗忘。MAST根据离主能量、更新幅度和遗忘梯度耦合幅度对注意力投影张量进行排序，然后仅更新排名最高的子集。在主模型上，MAST诱导了统计上显著的目标遗忘（MATH遗忘从45/150降至37/150；McNemar p=0.0078），同时保留了GSM8K（+0.8个百分点）和MATH保留（-0.5个百分点）。该优势在不同种子、NPO/SimNPO目标以及Qwen3上均得到复现，在Qwen3上MAST保留了GSM8K，而全参数遗忘导致其崩溃。

英文摘要

We propose MAST (Mechanism-Aligned Selective Targeting), a mechanism-guided method for unlearning RLVR-induced reasoning with substantially lower collateral damage than standard full-parameter updates. In matched SFT/RLVR checkpoints on Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, the SFT-to-RLVR increment differs sharply from the SFT update in token-level delta-log-probability, and full-parameter gradient ascent forgets only by damaging retain MATH and GSM8K. MAST ranks attention-projection tensors by off-principal energy, update magnitude, and forget-gradient coupling magnitude, then updates only the top-ranked subset. On the primary model, MAST induces statistically significant target forgetting (MATH forget 45/150 to 37/150; McNemar p=0.0078) while preserving GSM8K (+0.8 pp) and MATH retain (-0.5 pp). The advantage reproduces across seeds, NPO/SimNPO objectives, and Qwen3, where MAST preserves GSM8K while full-parameter unlearning collapses it.

URL PDF HTML ☆

赞 0 踩 0

2606.18257 2026-06-18 cs.HC cs.AI 交叉投稿

From Memorization to Creation: Evaluating the Cognitive Depth of LLM-Generated Educational Questions

从记忆到创造：评估LLM生成的教育问题的认知深度

Xiaolong Wang, Zhe Zhao, Song Lai, Chaoli Zhang, Zijie Geng, Yu Tong, Ye Wei, Qingsong Wen

发表机构 * City University of Hong Kong（香港城市大学）； Zhejiang Normal University（浙江师范大学）； Squirrel Ai Learning ； University of Science and Technology of China（中国科学技术大学）； Wuhan University（武汉大学）

AI总结通过布鲁姆认知分类学评估六种LLM生成问题的认知层次，提出细粒度提示策略减少重复性并提升高阶认知比例，引入认知转移强度和类别漂移指标，揭示链式思维提示的可解释性。

Comments Accepted by KDD 2026

详情

DOI: 10.1145/3770854.3785686
Journal ref: KDD 2026

AI中文摘要

尽管LLM在自动化教育内容生成方面展现出潜力，但它们生成能够激发高阶思维问题的能力仍未被充分研究。本研究通过布鲁姆认知分类学视角评估六种广泛使用的LLM，重点关注它们超越机械记忆并实现认知飞跃的能力。采用混合人机评估协议，我们在计算机科学、K-12数学和社会科学领域生成并分析了20,700个问题。主要贡献包括：(1) 一种细粒度提示策略，使Qwen2.5-7B-Instruct的问题重复性降低24.45%，并使InternLM3-8B-Instruct的高阶认知层次输出比例提升11.53%；(2) 认知转移强度（CogShift）和类别漂移的量化指标，揭示InternLM3在多层次转换中的优越性能；(3) 可解释性分析揭示指标级相关性，增强了链式思维提示的透明度。我们的发现强调了认知感知提示设计的重要性，并为在个性化学习系统中部署LLM提供了基准。

英文摘要

While LLMs show promise in automating educational content creation, their ability to generate questions that stimulate higher-order thinking remains understudied. This work evaluates six widely-used LLMs through a Bloom's Taxonomy lens, focusing on their capacity to transcend rote memorization and achieve cognitive leaps. Using a hybrid human--AI evaluation protocol, we generate and analyze 20{,}700 questions across computer science, K--12 math, and social-science domains. Key contributions include: (1) a fine-grained prompting strategy that reduces question repetitiveness by 24.45\% for Qwen2.5-7B-Instruct, and increases the proportion of higher-order cognitive level outputs by 11.53\% for InternLM3-8B-Instruct; (2) quantitative metrics for cognitive shift intensity (CogShift) and category drift, revealing InternLM3's superior performance in multi-level transitions; (3) an interpretability analysis revealing metric-level correlations that enhance the transparency of Chain-of-Thought prompting. Our findings highlight the importance of cognitive-aware prompt design and provide benchmarks for deploying LLMs in personalized learning systems.

URL PDF HTML ☆

赞 0 踩 0

2606.18263 2026-06-18 cs.HC cs.AI 交叉投稿

多模态超图融合用于低光照人群计数

Hao-Yuan Ma, Li Zhang, Yushi Qiu, Jie Gao, Yan Zhang, Bangjun Wang

发表机构 * School of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）

AI总结针对低光照环境下人群计数难题，构建三个新基准数据集，提出多模态超图融合模块和可变形矩形稀疏注意力模块，形成低光照计数网络LCNet，在三个基准上取得最优性能。

详情

AI中文摘要

LandslideAgent与多模态LandslideBench：一种面向自主滑坡识别与分析的领域规则增强型智能体

Chengfu Liu, Dongyang Hou, Junwu Xiang, Cheng Yang, Xuezhi Cui, Zeyuan Wang, Liangtian Liu, Zelang Miao

发表机构 * Central South University（中南大学）

AI总结提出指令驱动智能体框架，包含多模态数据集LandslideBench、滑坡专用视觉语言模型LandslideVLM及领域规则增强智能体LandslideAgent，实现自主滑坡识别与分析。

详情

AI中文摘要

智能滑坡灾害解译对于防灾减灾至关重要，然而当前范式难以同时提取视觉特征和高层次地球科学语义，而通用视觉语言模型在复杂地质场景中存在感知局限和领域幻觉。为解决这些挑战，我们提出一个指令驱动的智能体框架，包含三个组成部分。首先，通过多VLM交叉验证和交互式标注构建LandslideBench，这是一个多模态细粒度数据集，包含七个子类型标签、高分辨率图像、像素级掩膜和高质量文本描述。然后，通过LoRA在LandslideBench上微调面向滑坡的VLM——LandslideVLM，以增强地质语义理解。最后，以LandslideVLM为认知核心的领域规则增强智能体LandslideAgent，采用双规则控制器，结合结构化报告元数据约束和交叉验证识别约束，来调控自动化工具调用。实验表明，LandslideBench为五种主流模型在细粒度分类和语义分割上提供了有效基线。LandslideVLM在滑坡判别、细粒度分类和语义描述质量上分别提升了10.96%、32.87%和15.91%。LandslideAgent进一步实现了自主多源空间数据推理，实现了滑坡识别与分析的全流程智能化。

英文摘要

Intelligent landslide hazard interpretation is critical for disaster prevention, yet current paradigms struggle to simultaneously extract visual features and high-level geoscientific semantics, while general-purpose vision-language models (VLMs) suffer from perceptual limitations and domain hallucinations in complex geological scenarios. To address these challenges, we propose an instruction-driven agentic framework comprising three components. First, LandslideBench, a multimodal fine-grained dataset with seven subtype labels, high-resolution imagery, pixel-level masks, and high-quality textual descriptions, is constructed via multi-VLM cross-validation and interactive annotation. Then, LandslideVLM, a landslide-oriented VLM, is fine-tuned via LoRA on LandslideBench to enhance geological semantic understanding. Finally, LandslideAgent, a domain rule-enhanced agent taking LandslideVLM as its cognitive backbone, employs a dual-rule controller incorporating structured report metadata constraints and cross-validation identification constraints to regulate automated tool invocation. Experiments demonstrate that LandslideBench provides effective baselines across five mainstream models on fine-grained classification and semantic segmentation. LandslideVLM achieves accuracy improvements of 10.96%, 32.87%, and 15.91% on landslide discrimination, fine-grained classification, and semantic description quality, respectively. LandslideAgent further enables autonomous multi-source spatial data inference, realizing full-process intelligence for landslide identification and analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.18699 2026-06-18 cs.CL cs.AI cs.IR 交叉投稿

TW-LegalBench: Measuring Taiwanese Legal Understanding

TW-LegalBench: 衡量台湾法律理解

Fei-Yueh Chen, Chun Huang Lin, Chan Wei Hsu, Kuan Hsuan Yeh, Zih-Ching Chen, Kuan-Ming Chen, Patrick Chung-Chia Huang

发表机构 * University of Rochester（罗切斯特大学）； National Taiwan University（国立台湾大学）； NVIDIA（英伟达）

AI总结提出TW-LegalBench基准，包含多项选择、开放式问答和法律判决预测任务，评估13个LLM在台湾法律上的表现，发现顶尖模型通过律师考试但未达到法官检察官标准，且法律条文引用困难。

Comments 10 pages, 2 figures, To appear in ICAIL 2026

详情

AI中文摘要

大型语言模型（LLM）在多种任务上展现出令人印象深刻的能力，但其在特定司法管辖区法律推理上的表现仍未充分探索。我们提出TW-LegalBench，利用台湾法律系统丰富的官方公开语料库，填补了在普通法基准（侧重英文来源）和大陆法基准（侧重简体中文来源）之外评估LLM在台湾法律上的空白。TW-LegalBench包含三种任务类型：（1）涵盖18个专业领域五年官方考试的超过16,000道多项选择题（MCQ）；（2）来自法律专业人员考试的117道开放式问答题（OEQ），附有官方评分标准；（3）超过14,000个法律判决预测（LJP）实例，涵盖数百种犯罪类别。我们使用MCQ的准确率、基于评分标准点的分解式LLM作为裁判框架评估OEQ，以及LJP的判决准确性和法条引用指标，评估了13个LLM。我们的结果显示，表现最佳的模型超过了合格律师的通过门槛（通过率：11%），但未达到法官和检察官的通过标准（通过率：1-2%）。对于LJP，虽然模型展示了合理的判决类型准确性和刑期预测能力，但它们难以准确引用具体法律条文。这些发现表明，即使LLM在资格考试上的表现接近人类水平，可靠的 legal 文本生成仍然具有挑战性。

英文摘要

Large language models (LLMs) have shown impressive capabilities across diverse tasks, yet their performance on jurisdiction-specific legal reasoning remains underexplored. We present TW-LegalBench that utilizes Taiwanese legal system's rich official corpus open to the public to fill the gap in evaluating LLMs on Taiwanese law, among common-law benchmarks that focus on English sources and civil-law benchmarks focusing on sources of Simplified Chinese. TW-LegalBench comprises three task types: (1) over 16,000 multiple-choice questions (MCQs) across five years of official examinations in 18 professional domains; (2) 117 open-ended essay questions (OEQs) from examinations for legal professionals with official scoring rubrics; and (3) more than 14,000 legal judgment prediction (LJP) instances covering hundreds of crime categories. We evaluate 13 LLMs using accuracy for MCQs, a decomposed LLM-as-Judge framework based on the scoring rubric points for OEQs, and metrics for sentencing accuracy and statute citation for LJP. Our results reveal that top-performing models exceed the passing threshold for qualified lawyers (passing rate: 11%) but fall short of that for judges and prosecutors (passing rate: 1~2%). For LJP, while models demonstrate reasonable verdict type accuracy and sentence prediction capability, they struggle to cite exact legal articles. These findings highlight that reliable legal text generation remains challenging for LLMs, even though their performance on qualification examinations approaches human level.

URL PDF HTML ☆

赞 0 踩 0

2606.18733 2026-06-18 cs.SE cs.AI 交叉投稿

SWE-Future: Forecast-Conditioned Data Synthesis for Future-Oriented Software Engineering Agents

SWE-Future: 面向未来软件工程智能体的预测条件数据合成

Qiao Zhao, JianYing Qu, Jun Zhang, Yehua Yang, Hanwen Du, Zhongkai Sun

发表机构 * Baidu Inc（百度公司）

AI总结提出SWE-Future方法，利用仓库历史证据预测未来任务类型（如功能实现、缺陷修复），并基于预测条件合成200个编码智能体任务，减少对历史PR回放的依赖，在80个仓库中达到58.1%的未来工作相关性。

详情

AI中文摘要

真实的编码智能体基准测试通常回放公开的GitHub问题和拉取请求，这使得它们容易与模型预训练、微调、合成数据生成或基准驱动的模型选择产生重叠。完全合成的任务避免了直接的历史回放，但可能偏离真实的仓库需求。我们提出了SWE-Future，一种面向未来编码任务的预测条件数据合成方法。给定时间$T_0$的预测快照，该方法仅使用$T_0$之前的仓库证据来预测未来的功能实现/增强、缺陷修复和重构任务族。我们首先回顾性地验证了这一预测步骤：在预测固定后，后续的拉取请求仅用于衡量预测的任务族是否与未来的仓库工作匹配。在一项80个仓库的研究中，预测器在主要语义匹配指标下达到了58.1%的未来工作相关性。然后，我们使用经过验证的预测族作为条件信号，从任务生成快照中跨61个仓库合成了一个包含200个任务的编码智能体数据集，而不是回放用于验证的后续拉取请求。SWE-Future表明，仓库演化预测可以指导现实的、面向未来的编码任务合成，同时减少对历史拉取请求回放的直接依赖。

英文摘要

Realistic coding-agent benchmarks often replay public GitHub issues and pull requests, making them vulnerable to overlap with model pretraining, fine-tuning, synthetic-data generation, or benchmark-driven model selection. Fully synthetic tasks avoid direct historical replay, but can drift away from real repository needs. We propose SWE-Future, a forecast-conditioned data synthesis method for future-oriented coding tasks. Given a forecast snapshot at time $T_0$, the method uses only pre-$T_0$ repository evidence to forecast future feature implementation/enhancement, bugfix, and refactor task families. We first validate this forecasting step retrospectively: after forecasts are fixed, later pull requests are used only to measure whether the predicted task families match future repository work. In an 80-repository study, the forecaster achieves 58.1\% future-work relevance under the main semantic matching metric. We then use validated forecast families as conditioning signals to synthesize a 200-task coding-agent dataset across 61 repositories from a task-generation snapshot, rather than replaying the later pull requests used for validation. SWE-Future shows that repository-evolution forecasts can guide realistic, future-oriented coding-task synthesis while reducing direct dependence on historical pull-request replay.

URL PDF HTML ☆

赞 0 踩 0

一个用于检测 GPT-Image-2 生成的含丰富文本图像的多领域基准

Yijin Wang, Shuyi Wang, Wenhan Zhang, Yuqi Ouyang

AI总结针对现有基准缺乏文本丰富图像检测的问题，构建了包含8602张图像、覆盖6个类别的多领域基准，评估5种检测器，发现性能高度依赖领域且易受JPEG压缩影响。

详情

AI中文摘要

含丰富文本的图像通常包含隐私敏感、交易或决策相关信息。随着最近多模态图像生成模型合成逼真文本内容和结构化视觉设计的能力越来越强，检测AI生成的含丰富文本图像已成为数字信任和内容真实性的重要挑战。然而，现有基准主要关注以物体为中心的图像，对文本语义和布局组织至关重要的场景覆盖有限。在本文中，我们引入了一个用于检测OpenAI的GPT Image 2生成的含丰富文本图像的多领域基准。该基准包含8602张图像，涵盖六个代表性类别：商业海报、信息图表、学术海报、收据、表格和UI截图。利用该基准，我们在零样本设置下评估了五种代表性AI生成图像检测器，并分析了它们的整体性能、类别性能和后处理鲁棒性。我们的结果表明，检测器性能高度依赖于领域：在某些类别上表现良好的方法往往在其他类别上失败，即使最强的传统检测器也对JPEG压缩表现出严重敏感性。我们进一步使用多模态视觉语言模型进行了探索性评估，揭示了其在结构化格式上的潜力和局限性。这些发现突显了针对现代AI生成图像需要文本和布局感知的检测方法。我们的数据集发布于XXX。

英文摘要

Text-rich images often contain privacy-sensitive, transactional, or decision-relevant information. As recent multimodal image generation models become increasingly capable of synthesizing realistic textual content and structured visual designs, detecting AI-generated text-rich images has become an important challenge for digital trust and content authenticity. Existing benchmarks, however, largely focus on object-centric images and provide limited coverage of scenarios where textual semantics and layout organization are central. In this paper, we introduce a multi-domain benchmark for detecting text-rich images generated by OpenAI's GPT Image 2. The benchmark contains 8,602 images across six representative categories: commercial posters, infographics, academic posters, receipts, tables, and UI screenshots. Using this benchmark, we evaluate five representative AI-generated image detectors in a zero-shot setting and analyze their overall, category-wise, and post-processing robustness. Our results show that detector performance is highly domain-dependent: methods that perform well in some categories often fail on others, and even the strongest conventional detector exhibits severe sensitivity to JPEG compression. We further conduct an exploratory evaluation with a multimodal vision-language model, revealing both its promise and its limitations on structured formats. These findings highlight the need for text- and layout-aware detection methods for modern AI-generated images. Our dataset is released at XXX.

URL PDF HTML ☆

赞 0 踩 0

2606.18266 2026-06-18 cs.HC cs.AI cs.SD 交叉投稿

EMORSION: Examining the Impact of Audio Parameters on Emotional Responses and Immersion in Film

EMORSION：检验音频参数对电影中情感反应和沉浸感的影响

Nelly Garcia, Ruby Crocker, Bleiz M Del Sette, Fabrizio Smeraldi, Charalampos Saitis, George Fazekas, Joshua Reiss

发表机构 * Queen Mary University of London（伦敦大学女王学院）

AI总结通过操纵频率、动态和方向性三个音频参数，研究电影音频设计对观众情感和沉浸感的影响，发现细微变化可改变情感感知，非常规混音增加解读变异性。

Comments AES Europe 2026

详情

AI中文摘要

EMORSION 是一项探索性概念验证研究，旨在考察电影音频设计如何在影院环境中塑造观众的情感和沉浸感。选取了恐怖片（2部）和剧情片（2部）共四个电影场景，平衡主流与独立制作。针对每个场景，通过系统操纵音频设计的三个核心方面——频率（音高）、动态（响度）和方向性（空间位置），创建了多种替代音频混音。三组观众观看场景，每组观看每个场景的一个操纵混音和一个对照混音。通过三角化多模态框架评估观众反应，包括通过问卷自我报告的情感和沉浸感、心率监测等生理测量以及基于视频的运动追踪。该协议成功捕获了不同音频条件下可测量、可解释的差异，表明即使音频设计的细微变化也能塑造情感感知和沉浸感。非常规混音往往导致观众解读的更大变异性，而常规沉浸式混音则与更强的跨观众一致性相关。这些发现确立了 EMORSION 协议的可行性，并激励更大规模的研究来表征特定音频参数在塑造观众体验中的作用。

英文摘要

EMORSION is an exploratory proof-of-concept study examining how film audio design shapes audience emotion and immersion in acinema setting. Four film scenes were selected across the horror (2) and drama (2) genres, balanced between mainstream and independent productions. For each scene, multiple alternative audio mixes were created by systematically manipulating three core aspects of audio design, frequency (pitch), dynamics (loudness), and directionality (spatial placement). Three audience groups viewed the scenes, with each group exposed to one manipulated mix alongside a control mix for each scene. Audience responses were assessed through a triangulated multimodal framework combining self-reported emotion and immersion via a questionnaire, physiological measures including heart rate monitoring, and video-based motion tracking. The protocol successfully captured measurable, interpretable differences across audio conditions, indicating that even subtle changes in audio design can shape emotional perception and immersion. Unconventional mixes tended to produce greater variability in audience interpretation, while conventional immersive mixes were associated with stronger cross-audience agreement. These findings establish the feasibility of the EMORSION protocol and motivate larger-scale studies to characterise the role of specific audio parameters in shaping audience experience.

URL PDF HTML ☆

赞 0 踩 0

2606.18280 2026-06-18 stat.AP cs.AI 交叉投稿

IOAH3: Importance-Driven Adaptive Spatial Partitioning

IOAH3: 重要性驱动的自适应空间划分

Ehsaneddin Jalilian

发表机构 * Interdisciplinary Transformation University Austria（跨学科转型大学奥地利）

AI总结提出IOAH3方法，通过多源特征提取、马尔可夫随机场图割优化和数据驱动层次细化，构建自适应空间划分，解决可修改面积单元问题。

详情

AI中文摘要

我们提出IOAH3（重要性导向的自适应H3划分），一种用于构建地理参考观测域的数据驱动空间划分的计算方法。标准的空间聚合方法采用固定面积单元，例如行政边界或单一分辨率的均匀六边形网格，而不考虑每个区域中底层观测的信息内容。这导致了著名的可修改面积单元问题：统计和推断结果依赖于划分的任意选择，空间集中的现象在粗网格中被平均化，从而掩盖了精细尺度的结构。IOAH3通过三个阶段构建自适应划分来解决这一问题：多源特征提取和重要性评分，通过主成分分析对道路密度、POI密度、建筑密度和地形粗糙度信号进行，人口和洪水灾害数据作为辅助输入用于单元过滤和空间平滑；通过马尔可夫随机场图割优化进行空间单元选择，该优化在强制空间连续性的同时联合最大化每个单元的重要性；以及数据驱动的高重要性区域层次细化到更精细的H3分辨率级别，并通过邻居传播支持以避免孤立的精细分辨率孤岛。所得划分作为空间推断流程的输入，并在任何建模步骤之前提供了对划分敏感性问题的原则性解决方案。

英文摘要

We present IOAH3 (Importance-Oriented Adaptive H3 partitioning), a computational method for constructing data-driven spatial partitions of geo-referenced observation domains. Standard approaches to spatial aggregation adopt fixed areal units, such as administrative boundaries or uniform hexagonal grids at a single resolution, without regard to the informational content of the underlying observations in each region. This leads to the well-known modifiable areal unit problem: statistical and inferential results depend on the arbitrary choice of partition, and spatially concentrated phenomena are averaged out in coarse cells that obscure fine-scale structure. IOAH3 addresses this by constructing an adaptive partition in three stages: multi-source feature extraction and importance scoring via principal component analysis over road density, POI density, building density, and terrain roughness signals, with population and flood-hazard data entering as auxiliary inputs to cell filtering and spatial smoothness; spatial cell selection via Markov Random Field graph-cut optimisation, which jointly maximises per-cell importance while enforcing spatial contiguity; and data-driven hierarchical refinement of high-importance regions to finer H3 resolution levels, with neighbour-propagated support to avoid isolated fine-resolution islands. The resulting partitions serve as input to spatial inference pipelines and provide a principled resolution of the partition-sensitivity problem prior to any modelling step.

URL PDF HTML ☆

赞 0 踩 0

2606.18319 2026-06-18 cs.LG cs.AI cs.HC cs.SE 交叉投稿

ASTRA: A Scalable Next-Generation ATCO Training Simulator with Autonomous Simpilots

ASTRA：一种具有自主模拟飞行员的可扩展下一代空中交通管制员训练模拟器

Ethan Chew, Enjia Wu, Iruss Eng Wei Yeow, Ian Weiqin Lim, Ranen Sim, Brandon Koh Ziheng, Kaleb Nim, Caden Toh Jun Yi, Wei Dong Soin, Darius Kai Keat Koh, Galen King Yu Tay, Prannaya Gupta, Jonathan Ee Fang Koong, Yong Zhi Lim

发表机构 * Air Emerging Technologies High-Speed Experimentations and Research (AETHER), RSAF Agile Innovation Digital (RAiD), Republic of Singapore Air Force（新加坡共和国空军敏捷创新数字实验室空中新兴技术高速实验与研究）

AI总结提出ASTRA模拟器，通过微调ASR将词错误率降至23.45%，并集成AI评估框架，实现可扩展的标准化ATCO训练。

详情

AI中文摘要

空中交通管制员（ATCO）对于确保空中交通的安全、有序和高效至关重要，但培训能力受到依赖专门的人类培训师（称为模拟飞行员）的限制，这些培训师必须在模拟空域中扮演飞行员和ATCO的双重角色。现有的自动化解决方案依赖于西方中心的语音模型，这些模型在新加坡的运营环境中表现不佳，现成的系统在新加坡口音的航空语音上词错误率（WER）高达107.80%。我们引入了ASTRA，一个端到端的训练模拟器，通过一个流水线自动化这些模拟飞行员角色，该流水线转录ATCO语音、解释指令，并使用本地适应的语音模型生成适当的飞行员和ATCO响应。我们微调的自动语音识别（ASR）流水线将WER降低到23.45%，在该领域显著优于现有方法。除了交通模拟，ASTRA还集成了一个AI辅助的性能评估框架，该框架评估受训者的无线电通信的准确性、简洁性和完整性，优化后得分分别为91.7%、88.2%和86.9%。基于DSPy和Unsloth等开源基础，这种方法实现了可扩展、标准化的ATCO评估，同时减少了教师的工作量。

英文摘要

Air Traffic Control Operators (ATCOs) are vital in ensuring the safe, orderly, and efficient flow of air traffic, yet training capacity is constrained by reliance on specialized human trainers known as simpilots, who must role-play both pilots and ATCOs in a simulated airspace. Existing automated solutions rely on Western-centric speech models that perform poorly in Singaporean operational contexts, with off-the-shelf systems exhibiting Word Error Rates (WER) of up to 107.80% on Singaporean-accented aviation speech. We introduce ASTRA, an end-to-end training simulator that automates these simpilot roles through a pipeline that transcribes ATCO speech, interprets instructions, and generates appropriate pilot and ATCO responses using locally adapted voice models. Our fine-tuned Automatic Speech Recognition (ASR) pipeline reduces WER to 23.45%, substantially outperforming existing approaches in this domain. Beyond traffic simulation, ASTRA incorporates an AI-assisted performance evaluation framework that assesses trainee radiotelephony communications across accuracy, brevity, and completeness, achieving post-optimization scores of 91.7%, 88.2%, and 86.9%, respectively. Built on open-source foundations such as DSPy and Unsloth, this approach enables scalable, standardized ATCO assessment while reducing instructor workload.

URL PDF HTML ☆

赞 0 踩 0

2606.18379 2026-06-18 cs.IR cs.AI 交叉投稿

RankGraph-2: Lifecycle Co-Design for Billion-Node Graph Learning in Recommendation

RankGraph-2：十亿节点图学习在推荐中的生命周期协同设计

Renzhi Wu, Zikun Cui, Junjie Yang, Tai Guo, Hong Li, Xian Chen, Li Yu, Ke Pan, Sri Reddy, Mahesh Srinivasan, Nipun Mathur, Haomin Yu, Hong Yan

发表机构 * Meta Platforms（Meta平台）

AI总结针对十亿规模图检索中图构建、表示学习与实时服务三阶段孤立的问题，提出RankGraph-2框架，通过协同设计各阶段（如联合训练聚类索引、预计算邻域等），在降低83%服务计算成本的同时，召回率比GAT+Deep Graph Infomax高3.8倍，并带来CTR和CVR提升。

详情

AI中文摘要

十亿节点规模的基于图的检索需要联合解决三个紧密耦合的问题——图构建、表示学习和实时服务——然而现有工作各自孤立地处理这些问题。我们提出了RankGraph-2，一个部署在Meta的框架，它协同设计了基于相似性检索（U2U2I和U2I2I）的所有三个生命周期阶段，每个阶段的需求塑造其他阶段。服务需要一个联合学习的聚类索引以避免昂贵的在线KNN——这迫使索引联合训练进入训练目标。训练受益于观察到基于相似性的检索容忍预计算邻域，从而消除了在线图基础设施——这要求构建产生自包含的数据。构建还必须支持小时级别的刷新以覆盖物品。基于这些级联需求，RankGraph-2通过带流行度偏差校正的子采样将数百亿亿条边减少到数千亿条，通过个性化PageRank预计算多跳邻域，并联合学习一个残差量化聚类索引，将服务计算成本降低了83%。这种生命周期协同设计使得一个简单架构能够在二分图上实现比GAT+Deep Graph Infomax模型高3.8倍的召回率，在物品检索上比PyTorch-BigGraph高2.1倍。RankGraph-2带来了高达+0.96%的CTR和+2.75%的CVR提升，并已在主要业务面上支持了20多次检索发布。

英文摘要

Graph-based retrieval at billion-node scale requires jointly solving three tightly coupled problems -- graph construction, representation learning, and real-time serving -- yet existing work addresses each in isolation. We present RankGraph-2, a framework deployed at Meta that co-designs all three lifecycle stages for similarity-based retrieval (U2U2I and U2I2I), where each stage's requirements shape the others. Serving requires a co-learned cluster index to avoid expensive online KNN -- this pushes index co-training into the training objective. Training benefits from the observation that similarity-based retrieval tolerates pre-computed neighborhoods, eliminating online graph infrastructure -- this requires construction to produce self-contained data. Construction must also support hour-level refresh for item coverage. Acting on these cascading requirements, RankGraph-2 reduces hundreds of trillions of edges to hundreds of billions via subsampling with popularity bias correction, pre-computes multi-hop neighborhoods via personalized PageRank, and co-learns a residual-quantization cluster index that reduces serving computational cost by 83%. This lifecycle co-design enables a simple architecture to achieve 3.8 x higher recall than a GAT + Deep Graph Infomax model on a bipartite graph and 2.1 x higher than PyTorch-BigGraph on item retrieval. RankGraph-2 delivers up to +0.96% CTR and +2.75% CVR, and has powered 20+ retrieval launches across major surfaces.

URL PDF HTML ☆

赞 0 踩 0

2606.18393 2026-06-18 eess.SY cs.AI cs.SY 交叉投稿

Learning-Based Decision Making for Combustion Phasing Control in Multi-Fuel CI Engines with Latent Fuel Reactivity Estimation

基于学习的多燃料压燃发动机燃烧相位控制决策与潜在燃料反应性估计

Rajasree Sarkar, Aditya Satish Patil, Arunava Banerjee, Ihsan Berk Altiner, Zongxuan Sun, Kenneth Kim, Chol-Bum Mike Keown

发表机构 * Department of Mechanical Engineering, University of Minnesota Twin Cities（明尼苏达大学双城分校机械工程系）； DEVCOM Army Research Laboratory, Aberdeen Proving Ground（美国陆军战争研究所阿伯丁试飞场）

AI总结针对多燃料压燃发动机中燃料反应性（十六烷值）未知且时变的问题，提出一种基于GRU引导的强化学习框架，通过从燃烧历史中学习紧凑的燃料反应性表示，实现稳定的CA50控制，平均跟踪误差低于0.25°CA。

详情

AI中文摘要

多燃料压燃发动机具有燃料灵活性，但引入了不确定且时变的燃料反应性（以十六烷值CN表示），这使循环到循环的燃烧相位控制复杂化。本文将潜在CN变化下的CA50调节问题建模为部分可观测的序贯决策问题，并系统评估了具有递增时间和表示能力的控制器，包括LinUCB、历史增强上下文赌博机、仅观测DDPG、递归DDPG以及提出的GRU引导RL框架。基于实验多燃料发动机数据训练的高斯过程代理提供了受控且可重复的评估环境。结果表明，短视和固定历史赌博机方法在CN变化下性能下降，仅观测RL受潜在状态混叠影响，而通用递归在CN快速演变时不足。所提出的框架从燃烧历史中学习紧凑的GRU基燃料反应性表示，并将执行器和评论家基于此估计信号而非真实CN进行条件化。通过在部署时相同的非完美燃料反应性信息上训练策略，控制器避免了传统在线估计-控制流程中的训练-部署不一致性。在未见过的CN轨迹上，该策略实现了稳定的CA50调节，在训练设定点平均绝对跟踪误差低于0.25°CA，同时产生平滑、物理一致的SOI和电热塞功率驱动。这些结果表明，在潜在连续演变的燃料动态下进行燃烧控制需要超越独立估计或通用递归的方法。通过将燃料反应性推断与控制策略学习对齐，所提出的框架能够使用部署时可用的相同估计状态实现反应性感知决策。

英文摘要

Multi-fuel compression-ignition engines offer fuel flexibility but introduce uncertain, time-varying fuel reactivity, represented by cetane number (CN), which complicates cycle-to-cycle combustion-phasing control. This work formulates CA50 regulation under latent CN variation as a partially observable sequential decision problem and systematically evaluates controllers with increasing temporal and representational capacity, including LinUCB, history-augmented contextual bandits, observation-only DDPG, recurrent DDPG, and a proposed GRU-guided RL framework. A Gaussian-process surrogate trained on experimental multi-fuel engine data provides a controlled and reproducible evaluation environment. Results show that myopic and fixed-history bandit methods degrade under CN variation, observation-only RL suffers from latent-state aliasing, and generic recurrence is insufficient when CN evolves rapidly. The proposed framework learns a compact GRU-based representation of fuel reactivity from combustion history and conditions both actor and critic on this estimated signal rather than oracle CN. By training the policy on the same imperfect fuel-reactivity information available at deployment, the controller avoids train-deploy inconsistency in conventional online estimate-then-control pipelines. Across unseen CN trajectories, the policy achieves stable CA50 regulation with mean absolute tracking error below 0.25° CA at the training setpoint, while producing smooth, physically consistent SOI and glow-plug-power actuation. These results show that combustion control under latent, continuously evolving fuel dynamics requires more than standalone estimation or generic recurrence. By aligning fuel-reactivity inference with control policy learning, the proposed framework enables reactivity-aware decision-making using the same estimated state available during deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.18395 2026-06-18 eess.SP cs.AI cs.AR cs.SY eess.SY 交叉投稿

Deep Learning-Driven Inverse Design of Doherty Power Amplifiers Using Pixelated Combiners and Dual-State Impedance Synthesis

基于深度学习的Doherty功率放大器逆向设计：使用像素化合成器和双态阻抗合成

Han Zhou, Haojie Chang, David Widen, Christian Fager

发表机构 * Tampere University（塔尔皮奥大学）； Chalmers University of Technology（挑战者技术大学）

AI总结提出一种结合深度卷积神经网络、像素化布局和遗传算法的三端口Doherty合成器设计方法，实现峰值和回退功率条件下的双态阻抗合成，在2.6-2.8 GHz频段内饱和输出功率>44.2 dBm，峰值漏极效率>71.2%。

2606.18402 2026-06-18 eess.SP cs.AI cs.AR cs.SY eess.SY 交叉投稿

Deep-Learning-Based Pixelated Microwave Filter Design and Characterization using Electro-Optical Electric-Field Measurements

基于深度学习的像素化微波滤波器设计与表征：利用电光电场测量

Han Zhou, Richard Bannister, Caspar Pierce, Haojie Chang, David Widen, Ludvig Fornstedt, Gabriel Melin, Alexander Bohlin, Pontus Lindeberg Fredriksson, Dilbagh Singh, Christian Fager, Koen Buisman

发表机构 * Chalmers University of Technology（查尔姆斯理工大学）； Advanced Technology Institute, University of Surrey（萨里大学先进科技研究所）； National Physical Laboratory（国家物理实验室）

AI总结提出结合卷积神经网络与遗传算法的深度学习方法，自动合成像素化微波滤波器，通过S参数和空间电场测量实验验证，实现7 GHz通带和9.5 GHz以上超过20 dB抑制，首次用电光测量揭示AI生成设计的电场模式。

2606.18425 2026-06-18 cs.SE cs.AI cs.DC 交叉投稿

From Specification to Execution: AI Assisted Scientific Workflow Management

从规范到执行：AI辅助的科学工作流管理

Komal Thareja, Hamza Safri, Rajiv Mayani, Anirban Mandal, Ewa Deelman

发表机构 * RENCI, University of North Carolina at Chapel Hill, NC, USA（RENCI，北卡罗来纳大学教堂山分校）； Information Sciences Institute, University of Southern California, Marina del Rey, CA, USA（信息科学研究所，南加州大学马里纳德尔雷耶斯分校）

AI总结提出一种AI辅助方法，通过规范驱动的工作流生成、自动化调试和分布式执行，结合Pegasus与MCP层，实现从自然语言到大规模科学工作流的端到端管理。

详情

AI中文摘要

科学工作流管理系统（WMS）支持复杂管道的可扩展和可重复执行，但工作流的设计、实现和调试仍然主要依赖人工，需要大量专业知识。最近使用大型语言模型（LLM）的方法在从自然语言生成工作流方面显示出潜力，但通常依赖于直接的代码合成，这限制了透明度、可重复性以及与工作流系统的集成。我们提出了一种AI辅助的科学工作流管理方法，结合了规范驱动的工作流生成、自动化调试和分布式执行。该方法引入了一个结构化的规范阶段，将工作流意图、设计和实现分离，允许在代码生成之前进行验证。我们还开发了一个基于LLM的调试代理，用于诊断和解决跨多个系统层的故障。为了支持分布式执行和用户交互，我们将广泛使用的WMS Pegasus与模型上下文协议（MCP）层集成，为工作流提交、监控和控制提供统一接口。我们使用一个用于医学影像的联邦学习工作流来评估该方法，该工作流具有并行、迭代和依赖密集的结构。该系统生成并执行了包含数千个作业的大规模工作流，减少了调试工作量，并允许非专家用户使用专家级设计模式构建工作流。这些结果表明，端到端的AI辅助工作流生成和执行是可行的，并指向了用于管理科学工作流生命周期的AI驱动平台。

英文摘要

Scientific workflow management systems (WMS) support scalable and reproducible execution of complex pipelines, but workflow design, implementation, and debugging remain largely manual and require significant expertise. Recent approaches using large language models (LLMs) show promise for workflow generation from natural language, but often rely on direct code synthesis, which limits transparency, reproducibility, and integration with workflow systems. We present an AI-assisted approach to scientific workflow management that combines specification-driven workflow generation, automated debugging, and distributed execution. The method introduces a structured specification phase that separates workflow intent, design, and implementation, allowing validation prior to code generation. We also develop an LLM-based debugging agent that diagnoses and resolves failures across multiple system layers. To support distributed execution and user interaction, we integrate Pegasus, a widely used WMS, with a Model Context Protocol (MCP) layer, providing a unified interface for workflow submission, monitoring, and control. We evaluate the approach using a federated learning workflow for medical imaging, chosen for its parallel, iterative, and dependency-intensive structure. The system generated and executed large-scale workflows with thousands of jobs, reduced debugging effort, and allowed non-expert users to construct workflows with expert-level design patterns. These results indicate that end-to-end AI-assisted workflow generation and execution is feasible, and point toward AI-driven platforms for managing the scientific workflow lifecycle.

URL PDF HTML ☆

赞 0 踩 0

2606.18444 2026-06-18 cs.LG cs.AI 交叉投稿

TMR-GGNN: Credit Card Fraud Detection based on Time-Aware Multi-Relational Guided Graph Neural Network

TMR-GGNN：基于时间感知多关系引导图神经网络的信用卡欺诈检测

Rohit Tewari, Shubhankar Shilpi, Navin Chhibber, Devendra Singh Parmar, Sunil Khemka, Piyush Ranjan

发表机构 * Unysis Truist Banks Infinity Tech Group Technical Product（Unysis 信任银行 Infinity 技术集团技术产品）； Fairfax, USA（美国费尔法克斯）； Atlanta, USA（美国亚特兰大）； Sunnyvale, USA（美国 Sunnyvale）； Persistent Systems IEEE Vice Chair AeroSpace Chapter（Persistent 系统 IEEE 副主席航空航天分会）； Discover Financial Services（Discover 金融服务）； Edison, USA（美国埃迪森）

AI总结提出TMR-GGNN框架，通过时间窗口内异构实体交互建模、动态多关系图构建、时间感知注意力机制和对比学习解码器，结合InfoNCE与Focal Loss复合损失函数，解决数据不平衡和欺诈模式演化问题。

Comments 2025 2nd International Conference on Software, Systems and Information Technology (SSITCON), Pages 7

详情

AI中文摘要

近年来，由于高度不平衡的数据、不断演变的欺诈模式以及交易实体间复杂的关联结构，信用卡欺诈检测面临重大挑战。为解决这些问题，本研究提出了一种名为时间感知多关系引导图神经网络（TMR-GGNN）的新框架。具体而言，所提出的TMR-GGNN通过建模客户、商户、设备和IP在时间窗口内的异构交互，扩展了编码器-解码器图神经网络（GNN）架构。随后，该TMR-GGNN方法构建了一个动态的多关系图，并在编码器中引入时间感知关系注意力机制，以基于时间邻近性和语义上下文自适应地权衡交易相关性。因此，解码器采用对比学习模块来区分真实和合成的交易模式，同时提高模型对罕见欺诈案例的泛化能力。此外，为有效管理严重的类别不平衡并强调判别性学习，引入了结合基于信息噪声对比估计（InfoNCE）的对比损失与Focal Loss的复合损失函数。这种集成有助于改进欺诈识别，同时减少假阴性。

英文摘要

In recent years, credit card fraud detection has faced significant challenges due to highly imbalanced data, evolving fraud patterns, and complex relational structures among transaction entities. To address these issues, this research proposes a novel framework called Timeaware Multi Relational Guided Graph Neural Network (TMR GGNN). Particularly, the proposed TMR GGNN extends the encoder decoder Graph Neural Network GNN architecture by modeling heterogeneous interactions across customers, merchants, devices, and IPs over temporal windows. Subsequently, the proposed TMR GGNN approach constructs a dynamic, multi relational graph and incorporates a time aware relational attention mechanism within the encoder to adaptively weigh the transaction relevance based on temporal proximity and semantic context. Consequently, the decoder employs a contrastive learning module to distinguish between real and synthesized transaction patterns, while improving the models generalization of rare fraud cases. Additionally, to effectively manage severe class imbalances and emphasize discriminative learning, a composite loss function combining Information Noise Contrastive Estimation (InfoNCE) based contrastive loss with Focal Loss is introduced. This integration assists in improving fraud identification while mitigating false negatives.

URL PDF HTML ☆

赞 0 踩 0

2606.17077 2026-06-18 physics.chem-ph cs.AI cs.LG quant-ph 交叉投稿

Comprehensive pKa Data Augmentation from Limited Real Data through an Engineered Models-Quantum Framework

基于工程化模型-量子框架从有限真实数据中全面增强pKa数据

Wang Rui, Liu Dinghao

发表机构 * Department of Chemistry, Tsinghua University（清华大学化学系）； Department of Chemical Engineering, Tsinghua University（清华大学化学工程系）； School of Science, China Pharmaceutical University（中国药科大学理学院）

AI总结针对pKa数据稀疏问题，提出量子辅助分子生成方法，利用优化机器学习模型预测和量子退火器采样，在相干伊辛机上实现极端值采样。

详情

AI中文摘要

质子解离常数(pKa)对于功能分子发现和分子建模至关重要。基于已建立的最大实验pKa数据库iBonD，我们和其他研究人员开发了多种方法，包括基于机器学习的经验预测和高精度能量计算。尽管如此，高质量pKa数据的快速增强仍然受到根本性限制。作为这项工作的一部分，我们使用一组经过广泛优化的机器学习模型，对未标记分子数据集进行了大规模基于回归的pKa预测。结果表明，由于未标记分子数据集的特征分布，pKa数据分布近似正态，尾部区域样本极度稀缺。尽管这种增强对于提高整体数据可用性和预测建模非常有价值，但对于高效发现具有广谱pKa性质的分子仍然不足。为了解决这个问题，我们探索从广阔的化学空间中定向生成具有稀疏pKa性质的分子。鉴于传统的连续潜在空间VAE-RNN分子生成方法稳定性不足，且在补充稀疏数据方面未能显示出明显优势，我们设计并实现了一种量子辅助的稀疏pKa分子生成。在模拟量子退火器上验证了可行性，并在物理相干伊辛机(CIM)上进一步实现了优越的极端值采样。(未完待续)

英文摘要

Proton dissociation constants (pKa) are critical for functional molecule discovery and molecular modeling. Building on iBonD, the largest experimental pKa database established, we and other researchers have developed several methods including machine-learning-based empirical prediction and high-accuracy energy calculations. Despite this foundation, the rapid augmentation of high-quality pKa data remains fundamentally constrained. As part of this work, we performed large-scale regression-based pKa prediction on unlabeled molecular datasets using a collection of extensively optimized machine-learning models. The results indicate that, since the feature distributions of unlabeled molecular datasets, the pKa data distribution approximates normality, with extreme scarcity of tail-region samples. Although such augmentation is highly valuable for improving overall data availability and predictive modeling, it remains insufficient for efficiently discovering molecules with broad-spectrum pKa properties. To address this, we explore the targeted generation of molecules with sparse pKa properties from the vast chemical space. Given that traditional continuous latent space VAE-RNN methods for molecular generation suffer from insufficient stability and fail to demonstrate clear advantages in complementing sparse data, we design and implement a quantum-assisted sparse-pKa molecular generation. Feasibility is validated on a simulated quantum annealer, and superior extreme-value sampling is further achieved on physical coherent Ising machines (CIMs). (to be continued)

URL PDF HTML ☆

赞 0 踩 0

2606.18548 2026-06-18 cs.CY cs.AI 交叉投稿

Engagement Intensity as a Learner-Modeling Signal for Adaptive AI Ethics Instruction

参与强度作为自适应AI伦理教学的学习者建模信号

Yongkyung Oh, Lynn Talton, Alex Bui

发表机构 * University of California, Los Angeles (UCLA)（加州大学洛杉矶分校）

AI总结本研究比较了三种学习者特征（使用频率、自评熟悉度、先前AI教育）与AI感知结果的关系，发现使用频率与所有五项结果显著相关，为自适应AI伦理教学提供了简单的入学者建模信号。

详情

AI中文摘要

在研究生研究训练中，自适应AI伦理教学受益于反映先前LLM经验差异的入学者测量指标。先前的课程或研讨会参与是一个明显的候选指标，但尚不清楚它是否与关键AI感知项目的教学前评分相关。我们比较了三种候选入学者特征：自我报告的使用频率、自评LLM熟悉度和先前AI教育，针对93名参加必修研究伦理课程的生命科学研究生和博士后学员的五项基线感知结果。使用频率与所有五项结果显示出Holm校正的关联，自评熟悉度与三项结果相关，而先前AI教育与任何结果均无关联。在量表低端呈现阈值模式，在训练兴趣和准确性信任方面最为明显，而非在所有五项结果上呈现均匀梯度。在简短的入学者调查中，报告的LLM使用比先前的课程或研讨会更一致地与这些感知相关，自评熟悉度作为次要指标。这些结果表明，简单的教学前行为信号可以为自适应AI伦理教育的轻量级入学者画像提供信息。

英文摘要

Adaptive AI ethics instruction in graduate research training benefits from intake measures that reflect differences in prior LLM experience. Prior coursework or workshop attendance is an obvious candidate, but it is not clear whether it is associated with pre-instruction ratings on key AI perception items. We compare three candidate intake features, self-reported usage frequency, self-rated LLM familiarity, and prior AI education, across five baseline perception outcomes in 93 bioscience graduate and postdoctoral trainees enrolled in a required research ethics course. Usage frequency shows Holm-corrected associations with all five outcomes, self-rated familiarity with three, and prior AI education with none. A threshold-like pattern at the lower end of the scale is most visible for training interest and accuracy trust rather than appearing as a uniform gradient across all five outcomes. In a short intake survey, reported LLM use is more consistently associated with these perceptions than prior coursework or workshops, with self-rated familiarity serving as a secondary indicator. These results suggest that simple pre-instruction behavioral signals can inform lightweight intake profiling for adaptive AI ethics education.

URL PDF HTML ☆

赞 0 踩 0

2606.18596 2026-06-18 cs.HC cs.AI 交叉投稿

Better Adherence, Richer Context: A Field Evaluation of LLM-Powered Conversational Voice Diaries for Sleep

更好的依从性，更丰富的上下文：基于LLM的对话式语音睡眠日记的现场评估

Amama Mahmood, Bokyung Kim, Honghao Zhao, Molly E. Atwood, Luis F. Buenaver, Michael T. Smith, Chien-Ming Huang

发表机构 * The Johns Hopkins University（约翰霍普金斯大学）； Department of Psychiatry and Behavioral Sciences, The Johns Hopkins University School of Medicine（精神病学与行为科学系，约翰霍普金斯大学医学院）

AI总结通过现场实验评估基于LLM的对话式语音睡眠日记，发现相比文本日记，语音日记提高了依从性并收集了更详细的上下文信息，但结构化字段完整性较低。

详情

AI中文摘要

睡眠日记是行为睡眠医学和失眠认知行为疗法的核心，但每日完成难以维持，静态形式通常为解释夜间睡眠变化提供的上下文有限。我们设计了一个基于LLM的对话式语音日记，通过主动智能音箱提示、结构化对话输入和自适应后续对话，提供临床基础的早晚睡眠日记问题。我们在为期四周的受试者间现场研究中评估了该系统，涉及30名大学生，使用匹配的日记项目、报告窗口和提醒间隔，与基于文本的移动日记进行比较。与文本日记相比，对话式语音日记显示出更高的依从性，并引发了关于日常习惯、压力源、环境条件和其他睡眠相关因素的更详细上下文自我报告。参与者还描述语音日记更容易融入日常，尽管感知完成时间更长。然而，基于语音的对话输入导致某些结构化日记字段的完整性较低，揭示了表达丰富性与结构化精度之间的权衡。这些发现展示了使用基于LLM的对话式语音助手进行纵向健康自我报告的前景和挑战。

英文摘要

Sleep diaries are central to behavioral sleep medicine and cognitive behavioral therapy for insomnia, yet daily completion is difficult to sustain, and static forms often provide limited context for interpreting night-to-night sleep variation. We designed an LLM-powered conversational voice diary that delivers clinically grounded morning and evening sleep diary questions through proactive smart-speaker prompts, structured conversational intake, and adaptive follow-up dialogue. We evaluated the system in a four-week between-subjects field study with 30 university students, comparing it with a text-based mobile diary using matched diary items, reporting windows, and reminder intervals. Compared with the text-based diary, the conversational voice diary showed higher adherence and elicited more detailed contextual self-report about routines, stressors, environmental conditions, and other sleep-related factors. Participants also described the voice diary as easier to integrate into daily routines, despite longer perceived completion time. However, voice-based conversational intake produced lower completeness for some structured diary fields, revealing a trade-off between expressive richness and structured precision. These findings show both the promise and the challenge of using LLM-powered conversational voice assistants for longitudinal health self-report.

URL PDF HTML ☆

赞 0 踩 0

2606.18599 2026-06-18 cs.CR cs.AI 交叉投稿

MIDS: Detecting Stealthy Masquerade and Tampering Attacks on CAN Bus via Bidirectional Mamba

MIDS：通过双向Mamba检测CAN总线上的隐蔽伪装和篡改攻击

Qiqi Liu, Runhan Song, Lei Cui, Heng Zhang, Yuyan Sun, Limin Sun

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（信息工程研究所，中国科学院）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）； Zhongguancun Laboratory（中关村实验室）

AI总结针对CAN总线缺乏加密认证易受攻击的问题，提出MIDS双流框架，利用双向状态空间模型并行处理标识符和载荷，在特斯拉Model 3数据集上F1达96.94%，优于基线8个百分点以上。

详情

AI中文摘要

控制器局域网（CAN）协议是现代车辆中电子控制单元（ECU）的主要通信标准，但其缺乏加密和认证，使其面临一系列安全威胁。现有的入侵检测系统主要针对制造型攻击（通过帧注入实现的DoS、模糊测试、ID欺骗），此类攻击中每ID到达间隔统计等检测信号易于获取。我们转而解决更困难的伪装场景，其中内部攻击者在其原始传输时隙原位替换合法帧，保持流量周期性，使基于流量统计的防御失效。我们提出Mamba入侵检测系统（MIDS），一种创新的双流框架，并行处理CAN标识符和载荷，并通过双向选择性状态空间建模重建其联合时间语义。为评估MIDS，我们从物理特斯拉Model 3在三种驾驶模式下收集了超过1亿个CAN帧，并合成了54种伪装攻击变体，涵盖仅ID、仅数据和组合修改。MIDS在该数据集上达到96.94%的F1分数，超过最强可复现基线8个百分点以上，同时保持1.147毫秒的单窗口推理延迟——为实时车载部署留有充足余量。为验证泛化能力，我们进一步在四个公开基准（ROAD、CrySyS、OTIDS、CT&T）上评估MIDS，涵盖伪装和注入场景；在统一的5折协议下，MIDS的F1分数从93.70%到99.61%，超过八个复现基线中最强者最多13.94个百分点。

英文摘要

The Controller Area Network (CAN) protocol is the primary communication standard for Electronic Control Units (ECUs) in modern vehicles, but its lack of encryption and authentication exposes it to a range of security threats. Existing intrusion detection systems are largely tuned to fabrication-style attacks (DoS, fuzzing, ID spoofing realised by frame injection), in which detection signals such as per-ID inter-arrival statistics are readily available. We instead address the harder \emph{masquerade} setting~\cite{b37}, in which an internal adversary substitutes a legitimate frame in-situ at its original transmission slot, preserving traffic periodicity and rendering traffic-statistic defences ineffective. We propose the Mamba Intrusion Detection System (MIDS), an innovative dual-stream framework that processes CAN identifiers and payloads in parallel and reconstructs their joint temporal semantics through bidirectional selective state-space modelling. To evaluate MIDS, we collected over 100 million CAN frames from a physical Tesla Model 3 across three driving regimes and synthesised 54 masquerade attack variants spanning ID-only, data-only, and combined modifications. MIDS attains an F1 of 96.94\% on this dataset, exceeding the strongest reproducible baseline by more than 8 percentage points, while sustaining a 1.147~ms single-window inference latency -- ample headroom for real-time onboard deployment. To verify generalisation, we further evaluate MIDS on four public benchmarks (ROAD, CrySyS, OTIDS, CT\&T) covering both masquerade and injection scenarios; MIDS attains F1 from 93.70\% to 99.61\%, outperforming the strongest of eight reproduced baselines by up to 13.94 percentage points under a unified 5-fold protocol.

URL PDF HTML ☆

赞 0 踩 0

2606.18611 2026-06-18 cs.SD cs.AI cs.LG stat.ML 交叉投稿

QC-GAN: A Parameter-Efficient Quaternion Conformer GAN for High-Fidelity Speech Enhancement

QC-GAN: 一种参数高效的四元数Conformer GAN用于高保真语音增强

Shogo Yamauchi, Hideaki Tamori, Makoto Sakai, Yosuke Yamano, Tohru Nitta

发表机构 * The Asahi Shimbun Company（朝日新闻社）； Tokyo Woman's Christian University（东京女子基督教大学）

AI总结提出参数高效的QC-GAN，结合四元数Conformer生成器和MetricGAN训练，通过汉密尔顿积共享权重减少参数量，在VoiceBank+DEMAND上以0.89M参数达到PESQ 3.48，性能媲美两倍大小模型。

Comments 10 pages, 6 figures and 5 tables. Accepted at Interspeech2026

2606.18617 2026-06-18 cs.CY cs.AI 交叉投稿

AI-Driven Assessment of Human Tutors: Linking Training Performance to Real-Life Practice

AI驱动的人类导师评估：将培训表现与实际教学实践联系起来

Danielle R. Thomas, Marie Cynthia Abijuru Kamikazi, Clara Brandt, Conrad Borchers, Kenneth R. Koedinger

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Vanderbilt University（范德比大学）

AI总结提出一种AI系统，利用生成式AI分析真实辅导转录，评估导师技能迁移，发现培训表现显著预测实际教学得分（效应量0.25 SD），并贡献开放数据集和评分标准。

Comments Full research paper accepted at EC-TEL 2026

详情

AI中文摘要

存在大量的导师培训平台。然而，很少有平台基于实际表现提供AI驱动的人类导师培训和评估。我们提出一个AI驱动系统，评估培训中的开放式回答和真实的实际辅导。与仅通过在线培训或模拟评估学习的平台不同，我们的系统利用生成式AI（Gemini-2.5-pro）分析真实辅导的转录，衡量导师技能向实际应用的迁移。远程辅导学生数学的人类导师（N=86）完成了六个基于场景的课程，平均显著学习增益为7.4%。使用跨405个会话-课程对的混合效应模型，我们发现培训表现显著预测实际辅导转录得分，效应量为0.25 SD。模型比较（AIC/BIC）表明，培训期间开放式回答和多项选择表现的平均值最能预测实际辅导表现，尽管开放式回答相对更具预测性。探索性分析显示，培训后，导师遇到应用技能的教学机会的可能性显著增加（从61.1%到68.9%），并且在这些机会中表现出更高的执行质量（从65.5%到68.1%）。中断时间序列分析表明，这些导师改进是随时间逐渐趋势的一部分，而非培训的即时干预效果。我们展示了一种将导师培训与实际评估联系起来的AI驱动方法。为此，我们贡献了开放数据集、AI提示和评分标准，以支持透明度和可重复性。

英文摘要

There exist numerous tutor training platforms. However, few provide AI-driven training and evaluation for human tutors based on real-life performance. We present an AI-driven system that assesses both open responses during training and authentic real-life tutoring. Unlike platforms that only assess learning through online training or simulations, our system utilizes Generative AI (Gemini-2.5-pro) to analyze transcriptions of authentic tutoring, measuring the transfer of tutor skills to real-life application. Human tutors instructing students remotely in math (N=86) completed six scenario-based lessons, averaging a significant 7.4% learning gain. Using mixed-effects models across 405 session-to-lesson pairs, we found that training performance significantly predicted real-life transcript scores with an effect size of 0.25 SD. Model comparison (AIC/BIC) indicated averaging open response and multiple choice performance during training predicted real-life tutor performance best, although open responses were comparatively more predictive. Exploratory analysis showed that after training, tutors were significantly more likely to encounter pedagogical opportunities to apply their skills (61.1% to 68.9%) and demonstrated higher execution quality within those opportunities (65.5% to 68.1%). Interrupted time series analysis suggested that these tutor improvements were part of a gradual trend over time rather than an immediate intervention effect of training. We illustrate an AI-driven method to link tutor training with real-life assessment. In doing so, we contribute open datasets, AI prompts, and scoring rubrics to support transparency and reproducibility.

URL PDF HTML ☆

赞 0 踩 0

2606.18645 2026-06-18 eess.AS cs.AI 交叉投稿

面向旋转系统不平衡表征的域偏移感知神经网络

Bernardo Feijó Junqueira, Claudio Kiyoshi Umezu, Bruno Bilhar Karaziack, Tomaz Junior, Daniel Alves Castello

发表机构 * Springer Nature

AI总结提出域偏移感知神经网络，通过最大均值差异策略对齐源域与目标域特征，解决变工况下旋转轴不平衡质量估计的回归问题，实验证明该方法在域偏移未知时显著提升预测精度。

详情

AI中文摘要

本文研究了域偏移感知神经网络在回归任务中的应用，旨在估计不同运行条件下旋转轴的不平衡质量。实验数据来自一个测试台，其中主轴上安装有带不平衡质量的法兰，在不同转速下驱动，同时可选择性地激活副轴以引入域差异。不平衡质量固定在径向距离上，使用三轴加速度计记录系统的动态响应。质量估计的逆问题在域自适应框架中提出，网络采用最大均值差异策略进行训练，以对齐源域和目标域的特征表示。结果表明，显式处理域偏移能有效提高预测精度，尤其是在系统的物理行为和域偏移来源不完全已知且超出训练条件的情况下。这些发现凸显了域偏移感知模型在结构健康监测回归任务中的潜力。

英文摘要

This work investigates the application of a domain-shift aware neural network for regression tasks aimed at estimating unbalance masses in rotating shafts under varying operating conditions. Experimental data were collected from a test rig in which a primary shaft, equipped with a flange carrying unbalanced masses, was driven at different rotational speeds, while a secondary shaft could be optionally activated to introduce domain discrepancy. The unbalance masses were positioned at a fixed radial distance, and the dynamic response of the system was recorded using triaxial accelerometers. The inverse problem of mass estimation is formulated within a domain adaptation framework, where the network is trained with a maximum mean discrepancy strategy to align feature representations across source and target distributions. The results demonstrate the effectiveness of explicitly addressing domain shift in improving prediction accuracy, especially when the system's physical behavior and sources of domain discrepancy are not fully known and fall outside the training conditions. These findings highlight the potential of domain-shift aware models for regression tasks in Structural Health Monitoring.

URL PDF HTML ☆

赞 0 踩 0

2606.18897 2026-06-18 cs.IR cs.AI 交叉投稿

Spotlight: 协同种子探索与抢占式GPU用于DiT强化学习后训练

Ruiqi Lai, Dakai An, Wei Gao, Ju Huang, Siran Yang, Jiamang Wang, Lin Qu, Dmitrii Ustiugov, Wei Wang

发表机构 * NTU Singapore（南洋理工大学）； Hong Kong University of Science and Technology（香港科技大学）； Alibaba Group（阿里巴巴集团）

AI总结针对DiT强化学习后训练成本高的问题，提出Spotlight系统，通过利用探索对旧权重的容忍性和SP组快速重配置，在抢占式GPU上实现高效训练，加速4倍并降低成本1.4-6.4倍。

详情

AI中文摘要

扩散Transformer（DiT）的强化学习（RL）后训练成本极高，需要数千块高端GPU。现有工作探索了两个降低成本的方向：种子探索通过选择高对比度样本来改善训练收敛，但增加了关键路径的计算量；抢占式GPU提供69-77%的成本降低，但在训练期间处于空闲状态，因为DiT rollout几乎同时完成，这阻止了类似LLM的rollout与训练流水线化。抢占式GPU的抢占进一步破坏了序列并行（SP）组，导致GPU拓扑碎片化。我们提出了Spotlight，这是第一个利用抢占式GPU进行DiT RL后训练的系统。Spotlight基于我们设计的两个关键洞察：（1）我们证明探索可以容忍过时的模型权重，因为使用前一次迭代模型权重的探索保留了随机种子的相对排序，允许探索在训练期间在空闲的抢占式GPU上运行。（2）SP重配置可以重用节点内状态，将组恢复时间从分钟级缩短到亚秒级启动。基于这些洞察，Spotlight引入了三种技术：基于bandit的探索规划器，在训练时间预算内最大化奖励方差；弹性序列并行，通过持久调度器和节点内权重复制动态重配置SP组；以及抢占感知的拉取式请求调度器，平衡负载并在抢占时提交进行中的状态。我们在开源RL平台ROLL上实现了Spotlight，并在Qwen-Image后训练上进行了评估。Spotlight达到相同目标验证分数的速度比基线快4倍，总成本降低1.4-6.4倍，同时在分辨率512×512和1280×1280的DeepSeek-OCR和Geneval数据集上实现了更优的图像质量。

英文摘要

Reinforcement learning (RL) post-training of Diffusion Transformers (DiTs) is prohibitively expensive, requiring thousands of high-end GPUs. Existing works explore two directions to reduce cost: seed exploration improves training convergence by selecting high-contrast samples, yet adds compute to the critical path; spot GPUs offer 69--77\% lower cost, yet sit idle during training because DiT rollouts finish nearly simultaneously, which prevents LLM-style pipelining of rollout with training. Spot preemptions further break Sequence Parallelism (SP) groups, fragmenting GPU topology. We present Spotlight, the first system that harvests spot GPUs for DiT RL post-training. Spotlight rests on two key insights we devise: (1)~we show that exploration can tolerate stale model weights because exploration that uses the model weights from the previous iteration preserves the relative ranking of random seeds, allowing exploration to run on idle spot GPUs during training. (2)~SP reconfiguration can reuse on-node state, reducing group recovery from minutes to sub-second launches. Built on these insights, Spotlight introduces three techniques: a bandit-based exploration planner that maximizes reward variance within the training time budget, elastic sequence parallelism that reconfigures SP groups on the fly via persistent schedulers and intra-node weight copying, and a preemption-aware pull-based request scheduler that balances load and commits in-flight state upon preemption. We implement Spotlight on the open-source RL platform ROLL and evaluate it on Qwen-Image post-training. Spotlight reaches the same target validation score $4\times$ faster than baselines, reducing total cost by $1.4$-$6.4\times$ while achieving superior image quality on DeepSeek-OCR and Geneval datasets with resolution $512\times512$ and $1280\times1280$.

URL PDF HTML ☆

赞 0 踩 0

2606.19026 2026-06-18 cs.LG cs.AI physics.ao-ph 交叉投稿

A Hybrid LSTM--Vision Transformer Architecture for Predicting HRRR Forecast Errors

混合LSTM-视觉Transformer架构用于预测HRRR预报误差

David Aaron Evans, Jay C. Rothenberger, Kara J. Sulia, Nick P. Bassill, Chris D. Thorncroft

发表机构 * Atmospheric Sciences Research Center, University at Albany, SUNY（纽约州立大学奥尔巴尼分校大气科学研究中心）； University of Oklahoma（俄克拉荷马大学）； State Weather Risk Communication Center, University at Albany, SUNY（纽约州立大学奥尔巴尼分校州天气风险沟通中心）

AI总结提出LSTM-ViT混合框架，结合地表观测时序与大气廓线，预测HRRR降水、风速和温度预报误差，相比基线LSTM性能提升，尤其降水误差预测技能提高约两倍。

Comments This manuscript is a preprint and has been submitted for peer review to the Artificial Intelligence for the Earth Systems journal. The content is subject to change based on the outcome of the peer review process and should not be considered final or definitive. Copyright in this Work may be transferred without further notice

详情

AI中文摘要

高分辨率数值天气预报（NWP）系统中的预报误差通常与未解析的边界层（PBL）过程、对流、地形诱导环流以及其他垂直结构的大气现象有关。先前的研究表明，长短期记忆（LSTM）网络可以利用中尺度观测成功预测高分辨率快速刷新（HRRR）模型的预报误差，但我们认为性能下降与复杂垂直大气演化时期有关。为解决这一局限，我们开发了一种混合LSTM-视觉Transformer（LSTM-ViT）框架，将来自地表观测的时间序列学习与来自纽约州中尺度剖面仪网络的垂直大气廓线相结合。LSTM-ViT框架被训练用于预测单个中尺度站点上HRRR的逐时降水、10米风速和2米温度预报误差。在所有三个预测变量中，相对于基线LSTM架构，引入剖面仪导出的大气结构提高了预报误差预测技能，最大提升出现在较短的预报提前期和PBL活动增强期间。对于降水预报误差，改进尤为显著，LSTM-ViT框架相对于基线LSTM实现了约两倍的预测技能提升，同时更好地捕捉了对流驱动的误差演变并减少了与PBL过程相关的退化。这些结果表明，将时间序列学习与垂直注意力机制相结合，为改进业务NWP系统中的预报误差预测提供了一条具有物理意义的途径。我们的研究为预报员提供了关于模型偏差和预报置信度的增强指导。

英文摘要

Forecast errors in high-resolution numerical weather prediction (NWP) systems are often linked to unresolved planetary boundary layer (PBL) processes, convection, terrain-induced circulations, and other vertically structured atmospheric phenomena. Previous work demonstrated that Long Short-Term Memory (LSTM) networks can successfully predict forecast errors in the High-Resolution Rapid Refresh (HRRR) model using mesonet observations, but we believe performance degradation is linked to periods of complex vertical atmospheric evolution. To address this limitation, we develop a hybrid LSTM-Vision Transformer (LSTM-ViT) framework that combines temporal sequence learning from surface observations with atmospheric profiles from the New York State Mesonet profiler network. The LSTM-ViT framework is trained to predict HRRR hourly precipitation, 10 m wind speed, and 2 m temperature forecast errors at individual mesonet stations. Across all three predictors, incorporation of profiler-derived atmospheric structure improves forecast error prediction skill relative to the baseline LSTM architecture, with the largest gains occurring at shorter forecast lead times and during periods of enhanced PBL activity. Improvements are particularly pronounced for precipitation forecast error, where the LSTM-ViT framework achieves approximately a twofold increase in predictive skill relative to the baseline LSTM while better capturing convectively driven error evolution and reducing degradation associated with PBL processes. These results demonstrate that combining temporal sequence learning with vertically informed attention mechanisms provides a physically meaningful pathway for improving forecast error prediction in operational NWP systems. Our research offers forecasters enhanced guidance regarding model bias and forecast confidence.

URL PDF HTML ☆

赞 0 踩 0

2606.19042 2026-06-18 cs.SE cs.AI 交叉投稿

Where Did the Variability Go? From Vibe Coding to Product Lines by Regeneration

可变性去哪了？从氛围编码到通过再生的产品线

Xhevahire Tërnava

发表机构 * LTCI, Télécom Paris, Institut Polytechnique de Paris, Palaiseau, France（LTCI，巴黎电信学院，巴黎理工学院，Palaiseau，法国）

AI总结研究AI驱动编程（氛围编码）中可变性缺失问题，提出通过再生实现可变性（VbR）方法，让LLM作为推导引擎生成无死代码的变体二进制。

Comments VARIABILITY 2026

详情

语言模型作为接口而非预言机：用于小儿阑尾炎的混合LLM-ML系统

Soheyl Bateni, Maryam Abdolali

发表机构 * K. N. Toosi University of Technology（K. N. 图西理工大学）

AI总结提出ClaMPAPP混合系统，利用LLM从自由文本中提取结构化特征，再由XGBoost分类器进行诊断，在两个独立队列中优于端到端LLM，提高了诊断稳定性和可审计性。

详情

AI中文摘要

大型语言模型（LLM）通过解释自由文本记录可使临床决策支持更易获取，但直接作为诊断引擎使用时，受提示敏感性、信息顺序以及看似合理但错误的输出限制。结构化机器学习模型提供更稳定的风险预测，但需要难以与叙事性临床工作流集成的表格输入。我们提出ClaMPAPP（临床语言辅助机器学习阑尾炎诊断流程），这是一个混合系统，将LLM用作接口而非最终决策者。ClaMPAPP从类似笔记的叙述中提取模式约束的临床特征，应用确定性合理性检查，并将验证后的特征传递给基于临床、实验室和超声变量训练的XGBoost分类器。我们在来自德国医院的两个独立小儿阑尾炎队列上评估了ClaMPAPP，并将其与端到端LLM基线（包括开源和专有模型）进行比较。为在测试自由文本输入时保留真实标签，通过模板渲染和约束LLM重写从结构化电子健康记录生成叙述，并附加句子顺序排列以评估位置鲁棒性。ClaMPAPP在内部和外部验证中均达到最强的整体诊断性能，同时最小化漏诊阑尾炎病例（急性分诊中的关键安全问题）。端到端LLM表现出不稳定的灵敏度-特异性权衡，且在叙述重排下性能下降更严重。这些结果支持LLM作为接口、ML作为预测器的设计，将自然语言可用性与预测推理分离，并为临床决策支持提供更可审计的路径。

英文摘要

Large language models (LLMs) can make clinical decision support more accessible by interpreting free-text documentation, but their direct use as diagnostic engines is limited by sensitivity to prompts, information order, and plausible but incorrect outputs. Structured machine-learning models offer more stable risk prediction, yet they require tabular inputs that are difficult to integrate with narrative clinical workflows. We present ClaMPAPP (Clinical Language-assisted Machine-learning Pipeline for Appendicitis), a hybrid system that uses an LLM as an interface rather than as the final decision-maker. ClaMPAPP extracts schema-constrained clinical features from note-like narratives, applies deterministic plausibility checks, and passes validated features to an XGBoost classifier trained on clinical, laboratory, and ultrasound variables. We evaluated ClaMPAPP on two independent pediatric appendicitis cohorts from German hospitals and compared it with end-to-end LLM baselines, including open-source and proprietary models. To preserve ground truth while testing free-text input, narratives were generated from structured electronic health records through template rendering and constrained LLM rewriting, with additional sentence-order permutation to assess positional robustness. ClaMPAPP achieved the strongest overall diagnostic performance in both internal and external validation while minimizing missed appendicitis cases, the key safety concern in acute triage. End-to-end LLMs showed unstable sensitivity-specificity trade-offs and greater degradation under narrative reordering. These results support an LLM-as-interface, ML-as-predictor design that separates natural-language usability from predictive inference and provides a more auditable pathway for clinical decision support.

URL PDF HTML ☆

赞 0 踩 0

2606.19247 2026-06-18 cs.HC cs.AI cs.CY 交叉投稿

A Taxonomy of Mental Health and Technology Needs for Alzheimer's and Dementia Caregivers

阿尔茨海默病和痴呆症护理人员的心理健康与技术需求分类

Keran Wang, Drishti Goel, Jiayue Melissa Shi, Violeta J. Rodriguez, Daniel S. Brown, Dong Whi Yoo, Ravi Karkar, Koustuv Saha

发表机构 * Siebel School of Computing and Data Science（Siebel计算与数据科学学院）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Department of Psychology（心理学系）； Illinois Neurological Institute（伊利诺伊神经科学研究所）； Department of Human-Centered Computing（以人为中心计算系）； Manning College of Information and Computer Sciences（马歇尔大学信息与计算机科学学院）

AI总结本研究提出护理人员心理健康与技术分类法，系统关联AD/ADRD护理人员需求与技术干预类别，识别护理优先事项与现有技术支持的错配，并强调关系紧张和同情疲劳等未充分服务的领域。

详情

AI中文摘要

照顾阿尔茨海默病及相关痴呆症（AD/ADRD）患者的家庭成员构成了全球长期护理的基础。2023年，超过1100万美国亲友贡献了180亿小时的无偿护理，往往以牺牲自身身心健康为代价。这些非正式护理人员——也被称为“隐形第二患者”——经历着更高的心理健康问题发生率。然而，研究通常将其复杂的心理社会经历简化为单一的护理负担概念，掩盖了哪些具体需求未得到满足或得到有效支持。与此同时，数字和人工智能技术正在迅速扩展，从智能手机应用和视频会议到传感器平台和AI聊天机器人。然而，医学、心理学和技术研究之间缺乏共享框架，限制了累积进展。本研究引入了一个护理人员心理健康与技术分类法，系统地将AD/ADRD护理人员的需求与相应的技术干预类别联系起来。基于跨学科文献综述和两项针对护理人员的定性研究，该分类法识别了护理优先事项与现有技术支持之间的错配，突出了关系紧张和同情疲劳等未充分服务的领域，并提出了自适应、响应式系统的设计方向。该框架提供了一个共享词汇，以指导临床医生、研究人员和技术设计师在痴呆症护理中开发更以人为中心和临床基础的创新。

英文摘要

Family members caring for individuals with Alzheimer's disease and related dementias (AD/ADRD) provide the foundation of long-term care worldwide. In 2023, more than 11 million U.S. family and friends contributed 18 billion hours of unpaid care, often at the cost of their own physical and mental health. These informal caregivers -- also referred as the "invisible second patients" -- experience elevated rates of mental health problems. Yet research commonly reduces their complex psychosocial experiences to a single construct of caregiver burden, obscuring which specific needs are unmet or effectively supported. At the same time, digital and AI-enabled technologies are rapidly expanding, from smartphone apps and videoconferencing to sensor platforms and AI chatbots. However, the absence of shared frameworks across medicine, psychology, and technology research limits cumulative progress. This study introduces a Caregiver Mental Health and Technology Taxonomy that systematically links AD/ADRD caregiver needs with corresponding classes of technology-based interventions. Drawing from an interdisciplinary literature review and two qualitative studies with caregivers, the taxonomy identifies mismatches between caregiver priorities and existing technological support, highlights under-served domains such as relational strain and compassion fatigue, and proposes design directions for adaptive, responsive systems. The framework offers a shared vocabulary to guide clinicians, researchers, and technology designers in developing more person-centered and clinically grounded innovation in dementia care.

URL PDF HTML ☆

赞 0 踩 0

2606.19286 2026-06-18 cs.HC cs.AI cs.CY 交叉投稿

Correct Yourself, Keep My Trust: How Self-Correction and Social Connection Shape Credibility in Social Chatbots

纠正自己，保持信任：自我纠正和社会联系如何塑造社交聊天机器人的可信度

Biswadeep Sen, Yi-Chieh Lee

发表机构 * School of Computing National University of Singapore Singapore Singapore（计算学院新加坡国立大学新加坡新加坡）； Computer Science National University of Singapore Singapore Singapore（计算机科学新加坡国立大学新加坡新加坡）； National University of Singapore（新加坡国立大学）

AI总结通过实验比较三种错误纠正策略，发现自我纠正不损害聊天机器人可信度，且用户社会联系强度仅在自我纠正时显著预测信念改变。

详情

AI中文摘要

当社交聊天机器人犯错时——它们确实会犯错——它们的恢复方式决定了用户是否会再次信任它们。社交聊天机器人正日益融入日常生活，但它们仍然容易生成令人信服但不准确的信息。它们与用户建立的社会联系使得此类错误尤其具有后果性。我们进行了一项受试者间实验（N=120），比较了三种错误纠正策略：网页撤回、同一社交聊天机器人的自我纠正以及专家聊天机器人的纠正。我们的结果揭示了两个关键发现。首先，所有三种策略都能同样好地纠正错误，但只有自我纠正不会损害聊天机器人的可信度：参与者对自我纠正的聊天机器人在可信度和感知专业性上的评分显著高于其错误由外部来源纠正的聊天机器人。其次，通过社会吸引力和自我披露测量的用户与聊天机器人的社会联系强度，仅在聊天机器人自我纠正时显著预测信念改变的大小。将纠正外包给外部来源完全切断了这种联系。这些发现表明，社交聊天机器人应该纠正自己的错误，而不是外包纠正，并且投资于社会联系是一种功能性机制，能增强纠正效果，而不仅仅是一种设计特征。我们讨论了设计能够保持长期可信度同时有效处理自身错误的聊天机器人的启示。

英文摘要

When social chatbots make mistakes, and they do, how they recover determines whether users trust them again. Social chatbots are increasingly integrated into everyday life, yet they remain prone to generating convincing but inaccurate information. The social connection they build with users makes such errors particularly consequential. We conducted a between-subjects experiment (N=120) comparing three error correction strategies: a webpage retraction, self-correction by the same social chatbot, and correction by an expert chatbot. Our results reveal two key findings. First, all three strategies corrected the error equally well, but only self-correction did so without damaging the chatbot's credibility: participants rated self-correcting chatbots significantly higher in both trustworthiness and perceived expertise than chatbots whose errors were corrected by external sources. Second, the strength of the user's social connection with the chatbot, measured through social attraction and self-disclosure, significantly predicted the magnitude of belief change, but only when the chatbot corrected itself. Outsourcing corrections to an external source severed this link entirely. These findings suggest that social chatbots should correct their own mistakes rather than outsource corrections, and that investing in social connection is a functional mechanism that amplifies correction effectiveness, not merely a design feature. We discuss implications for designing chatbots that maintain long-term credibility while effectively addressing their own errors.

URL PDF HTML ☆

赞 0 踩 0

2606.14824 2026-06-18 cs.AR cs.AI cs.LG 交叉投稿

Running hardware-aware neural architecture search on embedded devices under 512MB of RAM

在512MB内存下的嵌入式设备上运行硬件感知的神经架构搜索

Andrea Mattia Garavagno, Edoardo Ragusa, Paolo Gastaldo, Antonio Frisoli

发表机构 * University of Bologna（博洛尼亚大学）； Politecnico di Milano（米兰理工学院）

AI总结提出一种在资源受限的嵌入式设备上直接运行的硬件感知神经架构搜索方法，生成针对低端MCU的微型CNN，在Visual Wake Word数据集上达到最先进水平。

详情

DOI: 10.1109/ICCE59016.2024.10444268
Journal ref: 2024 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 2024, pp. 1-2

AI中文摘要

本文提出了一种新颖的硬件感知神经架构搜索（HW NAS）方法，该方法考虑了运行它的计算平台上的可用资源，使其能够在各种嵌入式设备上执行。所提出的HW NAS生成针对低端微控制器单元（MCU）的微型卷积神经网络（CNN），这些MCU通常用于物联网（IoT）或可穿戴机器人领域，从而开辟了新的应用场景。网关可以运行它来根据获取的数据定制CNN的架构，而无需使用外部服务器，从而确保隐私。所提出的技术在Visual Wake Word数据集（一个标准的TinyML基准）上的多个人体识别任务中，在多个嵌入式设备上取得了最先进的结果。

英文摘要

This document proposes a novel approach to hardware-aware neural architecture search (HW NAS) that considers the resources available on the computing platform running it, enabling its execution on various embedded devices. The presented HW NAS produces tiny convolutional neural networks (CNNs) targeting low-end microcontroller units (MCUs), typically involved in the Internet of Things (IoT) or wearable robotics, opening new use cases. A gateway could run it to tailor CNNs' architecture on the acquired data without using external servers, ensuring privacy. The proposed technique achieves state-of-the-art results in the human-recognition tasks on the Visual Wake Word dataset, a standard TinyML benchmark, on several embedded devices.

URL PDF HTML ☆

赞 0 踩 0

2606.16290 2026-06-18 cs.LG cs.AI 交叉投稿

An affordable hardware-aware neural architecture search for deploying convolutional neural networks on ultra-low-power computing platforms

一种经济实惠的硬件感知神经架构搜索，用于在超低功耗计算平台上部署卷积神经网络

Andrea Mattia Garavagno, Edoardo Ragusa, Antonio Frisoli, Paolo Gastaldo

发表机构 * University of Genoa（热那亚大学）； Scuola Superiore Sant’Anna（圣安娜高等研究学院）

AI总结提出一种轻量级硬件感知神经架构搜索方法，生成可在超低功耗微控制器上运行的微型CNN，在保持分类精度的同时降低搜索成本。

详情

DOI: 10.1109/LSENS.2024.3387056
Journal ref: IEEE Sensors Letters, vol. 8, no. 5, pp. 1-4, May 2024

AI中文摘要

硬件感知神经架构搜索（HW-NAS）通过自动设计能够满足预置硬件约束的神经架构，使得卷积神经网络（CNN）能够集成到微控制器设备中。然而，最先进的HW-NAS针对的是高性能微控制器，其功耗无法满足传感节点的要求。本文提出了一种HW-NAS方法，生成可在超低功耗微控制器上运行的微型CNN，其搜索过程轻量级，甚至可以在嵌入式设备上执行。在三个著名的微型计算机视觉基准测试上的实证结果表明，所提出的HW-NAS能够在保持最先进分类精度的同时生成微型CNN。

英文摘要

Hardware-aware neural architecture search (HW-NAS) allows the integration of Convolutional Neural Networks (CNNs) in microcontrollers devices by automatically designing neural architectures that can fit prearranged hardware constraints. However, state-of-the-art HW-NAS target high-performance microcontrollers, whose power consumption does not meet sensing nodes requirements. This work presents a HW-NAS generating tiny CNNs that can run on ultra-low-power microcontrollers, featuring a lightweight search procedure enabling its execution even on embedded devices. Empirical results on three well-known benchmarks for tiny computer vision proved that the proposed HW-NAS was able to generate tiny CNNs while preserving state-of-the-art classification accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.27729 2026-06-18 cs.CR cs.AI cs.ET quant-ph 交叉投稿

QSignAI: Quantum-Randomness-Seeded Identity Signatures at the Intersection of AI for Science and Science for AI

QSignAI: 量子随机性种子身份签名——AI for Science 与 Science for AI 的交汇

Dongping Liu, Aoyu Zhang, Luyao Zhang

发表机构 * Amazon Web Services（亚马逊网络服务）； Duke Kunshan University（杜克昆山大学）

AI总结提出 QSignAI 平台，通过云端量子电路生成量子随机性种子，为社交平台用户提供唯一身份签名，并借助 AI 机器人使量子现象对普通用户可感知。

详情

AI中文摘要

2024-2025 年的诺贝尔奖和图灵奖同时表彰了人工智能和量子科学——机器学习作为物理科学，人工智能解决了 50 年的科学问题，超导量子电路作为量子计算的硬件基础，量子信息原理作为计算的最高成就。然而，没有任何已部署的人工智能系统将这两者结合起来为公众服务：身份系统仍然依赖伪随机令牌，量子电路对于每天使用机器人支持的社交消息平台的数十亿人来说仍然不可见。本文介绍了 QSignAI，一个已部署到生产环境的开源平台，在实时事件参与系统中展示了人工智能与量子科学之间的双向关系。我们解决三个研究问题：第一，能否通过真实量子电路生成量子随机性，并将其嵌入到人工智能驱动的社交平台中，且延迟和成本可接受；第二，人工智能机器人能否使量子现象对没有技术背景的普通观众在感知上可理解；第三，结合这两个方向的系统在实践中是否有效。一个对话式人工智能机器人在云端量子模拟器上通过双电路量子管道路由每个参与者的第一条消息，为每个参与者生成唯一的量子随机性种子身份签名。前两个问题通过系统设计和定性部署证据得到回答；可衡量的比较被确定为优先的未来工作。

英文摘要

The 2024-2025 Nobel and Turing awards recognised AI and quantum science simultaneously. Yet no deployed system has brought these streams together for the public. This paper presents QSignAI, a production-deployed platform demonstrating a bidirectional AI-quantum relationship in a real-time event participation system. We address three questions: can quantum-randomness generation via a two-source extractor be embedded in an AI-driven social platform with acceptable latency; can an AI bot make quantum phenomena perceptually legible to general audiences; and does the combined system work in practice? A conversational bot routes each participant's first message through a quantum pipeline comprising a Toeplitz two-source extractor over independent single-qubit Hadamard measurements on SV1 and DM1 simulators, plus a 2-qubit Bell state, producing a unique quantum-randomness-seeded identity signature per participant. The first two questions are answered through system architecture and qualitative deployment evidence from live events; the third through successful production deployment. The current deployment uses cloud quantum simulators; physical QPU randomness is the near-term extension. Measurable benchmarks are identified as priority future work.

URL PDF HTML ☆

赞 0 踩 0

2606.18288 2026-06-18 econ.GN cs.AI econ.TH q-fin.EC 交叉投稿

A Knowledge Theory of Capital:The Value of Natural and Artificial Intelligence

资本的知识理论：自然与人工智能的价值

Jeffrey Gardiner

发表机构 * Morgan Stanley（摩根大通）

AI总结提出资本的知识理论，将知识视为资本的核心形式，分析其生成、转化、治理与测量，区分五种知识形态，并引入新概念解释现代财富来源。

Comments 458 pages, 8 figures. Theory-building monograph developing a conditional framework for knowledge-bearing capitalism, with formal concepts, mechanisms, measurement apparatus, and falsification conditions

详情

AI中文摘要

本卷为生产能力日益存在于软件、数据、模型、常规、专业知识、平台、组织、公共资源和公共认知基础设施的经济体，发展了一种资本的知识理论。从亚当·斯密的劳动、资本、专业化和市场范围理论出发，探讨当知识变得像资本一样可积累、可跨形式流动、可扩展、可治理、可重组且在会计中不完全可见时，会发生什么变化。本书将知识承载资本作为核心对象，分析其如何生成、转化为可治理形式、部署、通过反馈改进、封闭或共享、衡量、减值以及用作未来生产的投入。它区分了具身、非具身、制度化、公共资源和公共知识形式，并发展了诸如首次转化、认知封闭、反馈捕获、暗资本和预期知识损失等概念。该论证是有条件且可检验的：现代财富不仅取决于资本积累，还取决于生产性知识如何被治理。

英文摘要

This volume develops a knowledge theory of capital for economies in which productive capacity increasingly resides in software, data, models, routines, expertise, platforms, organizations, commons, and public epistemic infrastructure. Beginning from Adam Smith's theory of labour, stock, specialization, and market extent, it asks what changes when knowledge becomes stock-like, mobile across forms, scalable, governable, recombinable, and imperfectly visible in accounting. The book introduces knowledge-bearing stock as the central object and analyses how it is generated, converted into governable form, deployed, improved through feedback, enclosed or shared, measured, impaired, and used as input to future production. It distinguishes embodied, disembodied, institutionalized, commons, and public knowledge forms and develops concepts such as first conversion, cognitive enclosure, feedback capture, dark capital, and expected knowledge loss. The argument is conditional and testable: modern wealth depends not only on capital accumulation, but on how productive knowledge is governed.

URL PDF HTML ☆

赞 0 踩 0

2606.17102 2026-06-18 physics.pop-ph cs.AI cs.ET cs.HC quant-ph 交叉投稿

Quantum Cinema: An Interactive Cinematic Exploration of Quantum Computing Hardware via Generative World Models

量子影院：通过生成世界模型对量子计算硬件进行交互式电影探索

Aoyu Zhang, Dongping Liu, Luyao Zhang

发表机构 * Amazon Web Services（亚马逊网络服务）； Duke Kunshan University（杜克昆山大学）

AI总结本文提出量子影院，一个基于生成世界模型的开源交互式应用，通过四幕叙事将不可见的量子硬件转化为可探索的电影体验，旨在弥合量子计算与公众之间的想象鸿沟。

详情

AI中文摘要

量子计算有望在科学和工业领域带来变革性进步，但实现这些计算的物理硬件对公众而言仍然不可见：量子处理器在接近绝对零度的密封稀释制冷机内运行，使得直接观察成为不可能。这种量子计算日益增长的社会影响与公众可视化能力之间的“想象鸿沟”构成了量子素养和劳动力发展的重大障碍。我们提出量子影院，一个开源、基于浏览器的交互式应用，通过使用生成世界模型将不可见的量子硬件转化为可探索的电影体验，从而弥合这一鸿沟。量子影院引导用户经历四幕叙事——从获得诺贝尔奖的量子纠缠基础科学，通过策划的视频介绍三种主要量子计算架构（离子阱、中性原子和超导系统），进入沉浸式三维生成世界，使不可见的量子现象变得可观察，最后到基于真实量子设备规格的交互式雷达图比较。所有三维环境均使用WorldLabs的生成世界模型平台生成，并基于亚马逊云服务（AWS）Braket量子硬件策划的指标进行科学依据。量子影院无需安装、无需专用硬件、无需量子计算背景。它旨在服务于两个不同的群体：寻求复制或扩展平台的学者和开发者，以及寻求直观工具向不同受众解释量子硬件的教育者、研究人员和科学传播者。本文描述了系统架构、生成世界模型流程、两个群体的用例以及未来工作方向。

英文摘要

Quantum computing promises transformative advances across science and industry, yet the physical hardware that enables these computations remains invisible to the public: quantum processors operate inside sealed dilution refrigerators at temperatures near absolute zero, making direct observation impossible. This "imagination gap" between quantum computing's growing societal impact and the public's ability to visualize it represents a significant barrier to quantum literacy and workforce development. We present Quantum Cinema, an open-source, browser-based interactive application that closes this gap by transforming invisible quantum hardware into explorable, cinematic experiences using generative world models. Quantum Cinema guides users through a four-act narrative -- from the foundational Nobel Prize-winning science of quantum entanglement, through curated video introductions to three major quantum computing architectures (trapped-ion, neutral-atom, and superconducting systems), into immersive three-dimensional generative worlds that make invisible quantum phenomena observable, and finally to interactive radar-chart comparisons grounded in real quantum device specifications. All three-dimensional environments are generated using WorldLabs' generative world model platform and are scientifically grounded in curated metrics from Amazon Web Services (AWS) Braket quantum hardware. Quantum Cinema requires no installation, no specialized hardware, and no quantum computing background. It is designed to serve two distinct communities: scholars and developers seeking to replicate or extend the platform, and educators, researchers, and science communicators seeking an intuitive tool for explaining quantum hardware to diverse audiences. This paper describes the system architecture, the generative world model pipeline, use cases for both communities, and directions for future work.

URL PDF HTML ☆

赞 0 踩 0

1. 智能体、规划与决策 7 篇

Dynamic In-Group Persona Generation for Enhancing Human-AI Rapport

Caring Without Feeling: Affective Dynamics as the Control Layer of Human-AI Agent Collaboration

Synthetic Resonance: A Framework for Growth-Oriented Human-AI Relationships

Mitigating Anchoring Bias in LLM-Based Agents for Energy-Efficient 6G Autonomous Networks

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

As You Wish: Mission Planning with Formal Verification using LLMs in Precision Agriculture

Data Intelligence Agents: Interpreting, Modeling, and Querying Enterprise Data via Autonomous Coding Agents

2. 知识表示、推理与符号AI 1 篇

The More the Merrier: Combining Properties for ABox Abduction under Repair Semantics for ELbot

3. 多智能体与博弈 7 篇

Simulating Hate Speech Cascades with Multi-LLM Agents: Empirical Grounding, Modeling Fidelity, and Intervention Strategies

Towards Multi-Agent-Simulation-Based Community Note Evaluation

TRIDENT: Breaking the Hybrid-Safety-Physics Coupling for Provably Safe Multi-Agent Reinforcement Learning

Agentra: A Supervisable Multi-Agent Framework for Enterprise Intrusion Response

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams

A Technical Taxonomy of LLM Agent Communication Protocols

4. 搜索、优化与约束求解 1 篇

Two-Phase Bilevel Search for the Moving-Target Traveling Salesman Problem with Moving Obstacles

5. 机器学习与表示学习 30 篇

Breaking the Solver Bottleneck: Training Task Generators at the Learnable Frontier

A Link between Shock-wave Theory and Symmetry-reduced Stochastic Gradient Descent for Artificial Neural Networks

Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

DRIFT: Refining Instruction Data via On-Policy Data Attribution

Ghost Attractor Networks: Basin-Structured Dynamical Decoders for Closed-Loop Sequential Generation

Why SWAVE May Not Be All You Need:A Concept-Evolution Retrospective on Complex-Valued Recurrent Language Models

What Does the Weight Norm Control in Grokking? Logit-Scale Mediation under Cross-Entropy

Structured Representation Learning with Locally Linear Embeddings and Adaptive Feature Fusion

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

Neural Phase Correlation

Sparsity Curse: Understanding RLVR Model Parameter Space from Model Merging

Correcting Sensor-Induced Distribution Drift with Wasserstein Adversarial Learning

Dual Dimensionality for Local and Global Attention

Bounded Context Management for Tabular Foundation Models on Stream Learning

Dual-Channel Grounded World Modeling (DCGWM): Structural Prevention of Objective Interference Collapse via Heterogeneous External Grounding with Inward-Only Gradient Flow

Graph Grounded Cross Attention Transformer Neural Network for Structurally Constrained Full Event Sequence Generation in Predictive Process Monitoring

Private Learning with Public Feature Conditioning

Bayesian Anytime Pareto Set Identification for Multi-Objective Multi-Armed Bandits

Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

Reinforcement Learning Foundation Models Should Already Be A Thing

Maturing Markov Decision Processes: Decision Making under Increasing Information and Shrinking Action Sets

FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs

Pareto Q-Learning with Reward Machines

OrthoReg: Orthogonal Regularization for Hybrid Symbolic-Neural Dynamical Systems

Essential Subspace Merging for Multi-Task Learning

Compute Efficiency and Serial Runtime Tradeoffs for Stochastic Momentum Methods

Forecasting what Matters: Decision-Focused RL for Controlled EV Charging with Unknown Departure Times

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Explaining Attention with Program Synthesis

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

6. 自然语言与多模态智能 18 篇

Continuous Audio Thinking for Large Audio Language Models

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

A Variational Framework for LLM Generator-Regulator Games

MagpieTTS-LF: Inference-Time Long-Form Speech Generation Without Training on Long-Form data

APT: Atomic Physical Transitions for Causal Video-Language Understanding

BCL: Bayesian In-Context Learning Framework for Information Extraction

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Closing the Loop: PID Feedback Control for Interpretable Activation Steering in Symbolic Music Generation

SHIFT: Semantic Harmonization via Index-side Feature Transformation for Multilingual Information Retrieval

Rescaling MLM-Head for Neural Sparse Retrieval

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative Mining

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering

ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

7. 机器人与具身智能 10 篇

Guava: An Effective and Universal Harness for Embodied Manipulation

CAOA -- Completion-Assisted Object-CAD Alignment

EffiNav: Fusing Depth and Vision-Language for Efficient Object Goal Navigation

NeuralMUSIC: A Hybrid Neural-Subspace Framework for Robot Sound Source Localization

Leveraging Energy Features for Surface Classification with Deep Learning: A Comparative Analysis Across Three Independent Datasets

Generating Natural and Expressive Robot Gestures through Iterative Reinforcement Learning with Human Feedback using LLMs

Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation

Improving Human-Robot Teamwork in Urban Search and Rescue Through Episodic Memory of Prior Collaboration

URDF Synthesis from RGB-D Sequences via Differentiable Joint Inference and Energy-Consistent Verification