arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1676
2605.15206 2026-05-18 cs.LG cs.AI cs.DC

AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices

Dzung Pham, Kleomenis Katevas, Ali Shahin Shamsabadi, Hamed Haddadi

AI总结 随着基于大语言模型的自主代理在复杂任务中应用增多,本地部署虽能提升隐私保护和降低成本,但其资源消耗远高于普通语言模型交互。本文研究了在消费级硬件上本地运行代理的能耗问题,提出了一种名为AgentStop的轻量级监督机制,通过预测任务失败的可能性提前终止无效流程,在减少15%-20%能耗的同时仅小幅影响任务性能,为可持续的本地智能代理系统提供了可行方案。

Comments ACM CAIS '26

详情
英文摘要

Autonomous agents powered by large language models (LLMs) are increasingly used to automate complex, multi-step tasks such as coding or web-based question answering. While remote, cloud-based agents offer scalability and ease of deployment, they raise privacy concerns, depend on network connectivity, and incur recurring API costs. Deploying agents locally on user devices mitigates these issues by preserving data privacy and eliminating usage-based fees. However, agentic workflows are far more resource-intensive than typical LLM interactions. Iterative reasoning, tool use, and failure retries substantially increase token consumption, often expending significant compute without successfully completing tasks. In this work, we investigate the time, token, and energy overhead of locally deployed LLM-based agents on consumer hardware. Our measurements show that agentic execution increases GPU power draw, temperature, and battery drain compared to single-inference workloads. To address this inefficiency, we introduce AgentStop, a lightweight efficiency supervisor that predicts and preemptively terminates trajectories unlikely to succeed. Leveraging low-cost execution signals, such as token-level log probabilities, AgentStop can reduce wasted energy by 15-20% with minimal impact on task performance (<5% utility drop) for challenging web-based question answering and coding benchmarks. These findings position predictive early termination as a practical mechanism for enabling sustainable, privacy-preserving LLM agents on user devices. Our project code and data are available at https://github.com/brave-experiments/AgentStop.

2605.15205 2026-05-18 cs.AI

Does Theory of Mind Improvement Really Benefit Human-AI Interactions? Empirical Findings from Interactive Evaluations

Nanxu Gong, Zixin Chen, Haotian Li, Zishu Zhao, Jianxun Lian, Huamin Qu, Yanjie Fu, Xing Xie

AI总结 本研究探讨了提升大型语言模型(LLM)心智理论(ToM)能力是否真正有助于改善人机交互。研究指出,现有基准多从第三人称视角通过阅读故事和选择题评估ToM能力,忽视了真实交互中的第一人称、动态和开放特性。为此,研究提出了一种新的交互式ToM评估范式,并通过真实数据集和用户实验系统评估了四种代表性ToM增强技术,发现静态基准上的提升并不一定带来动态人机交互中的性能改善,强调了基于交互的评估在开发下一代社会智能模型中的重要性。

详情
英文摘要

Improving the Theory of Mind (ToM) capability of Large Language Models (LLMs) is crucial for effective social interactions between these AI models and humans. However, the existing benchmarks often measure ToM capability improvement through story-reading, multiple-choice questions from a third-person perspective, while ignoring the first-person, dynamic, and open-ended nature of human-AI (HAI) interactions. To directly examine how ToM improvement techniques benefit HAI interactions, we first proposed the new paradigm of interactive ToM evaluation with both perspective and metric shifts. Next, following the paradigm, we conducted a systematic study of four representative ToM enhancement techniques using both four real-world datasets and a user study, covering both goal-oriented tasks (e.g., coding, math) and experience-oriented tasks (e.g., counseling). Our findings reveal that improvements on static benchmarks do not always translate to better performance in dynamic HAI interactions. This paper offers critical insights into ToM evaluation, showing the necessity of interaction-based assessments in developing next-generation, socially aware LLMs for HAI symbiosis.

2605.15204 2026-05-18 cs.AI

SDOF: Taming the Alignment Tax in Multi-Agent Orchestration with State-Constrained Dispatch

Zhantao Wang

AI总结 本文提出了一种名为SDOF的多智能体协调框架,旨在解决现有系统在任务调度中缺乏阶段约束的问题。该框架将多智能体执行视为受约束的状态机,并通过强化学习与有限状态自动机相结合的方法,实现对任务流程的精确控制与合规性验证。实验表明,SDOF在招聘系统等实际场景中表现出更高的任务完成率与执行安全性,显著优于现有模型。

Comments 12 pages, 4 figures, 14 tables

详情
英文摘要

Multi-agent orchestration frameworks such as LangChain, LangGraph, and CrewAI route tasks through graph-based pipelines but do not enforce the stage constraints that govern real business processes. We present SDOF, a framework that treats multi-agent execution as a constrained state machine. SDOF operates through two primary defensive layers, implemented by three components: (1) an Online-RLHF Specialized Intent Router trained via Generative Reward Modeling (GRPO) and (2) a StateAwareDispatcher with GoalStage finite-automaton checks and precondition/postcondition SkillRegistry validation for auditable execution control. On a recruitment system backed by the Beisen iTalent platform (6000+ enterprises), 185 expert-curated scenarios trigger 1671 live API calls. Our GSPO-aligned 7B Intent Router achieves higher joint accuracy than zero-shot GPT-4o on this FSM-constrained adversarial routing benchmark (80.9% versus 48.9%). In end-to-end execution, SDOF reaches 86.5% task completion (95% confidence interval 80.8 to 90.7) and blocks all 22 operations in the injection, illegal HR subset. Under a broader message-level blocking audit, SDOF attains precision 100% and recall 88%, expert agreement kappa=0.94. A separate evaluation on 960 SGD-derived dialogues spanning 8 service domains surfaces 201 stage-order conflicts under our FSM mapping, 41 of which arise in the normal split. This arXiv version reports the current validated scope; extended multi-seed training comparisons and deeper workflow evaluations will be released in a subsequent update.

2605.15202 2026-05-18 cs.AI cs.CL cs.IR

DeepSlide: From Artifacts to Presentation Delivery

Ming Yang, Zhiwei Zhang, Jiahang Li, Haoseng Liu, Yuzheng Cai, Weiguo Zheng

AI总结 DeepSlide 是一个支持全流程演示文稿准备的人机协作多智能体系统,旨在优化从内容规划到演讲表现的整个过程,而不仅仅是生成视觉上合理的幻灯片。该系统结合了可控逻辑链规划、内容树检索、风格继承的序列渲染以及可执行的排练支持,有效提升了演讲的叙事连贯性、节奏精确度和幻灯片与讲稿的协同性。研究还引入了一个双评分板基准,用于区分静态内容质量与动态演讲表现,实验表明 DeepSlide 在多个领域和受众场景下均优于现有方法。

Comments 37 pages,10 figures,9 tables

详情
英文摘要

Presentations are a primary medium for scholarly communication, yet most AI slide generators optimize the artifact (a visually plausible deck) while under-optimizing the delivery process (pacing, narrative, and presentation preparation). We present DeepSlide, a human-in-the-loop multi-agent system that supports preparing the full presentation process, from requirement elicitation and time-budgeted narrative planning, to evidence-grounded slide--script generation, attention augmentation, and rehearsal support. DeepSlide integrates (i) a controllable logical-chain planner with per-node time budgets, (ii) a lightweight content-tree retriever for grounding, (iii) Markov-style sequential rendering with style inheritance, and (iv) sandboxed execution with minimal repair to ensure renderability. We further introduce a dual-scoreboard benchmark that cleanly separates static artifact quality from dynamic delivery excellence. Across 20 domains and diverse audience profiles, DeepSlide matches strong baselines on artifact quality while consistently achieving larger gains on delivery metrics, improving narrative flow, pacing precision, and slide--script synergy with clearer attention guidance.

2605.15093 2026-05-18 cs.CV

CoralLite: μCT Reconstruction of Coral Colonies from Individual Corallites

Jess Jones, Leonardo Bertini, Kenneth Johnson, Erica Hendy, Tilo Burghardt

AI总结 该研究提出了一种名为CoralLite的方法,用于从珊瑚骨骼的微CT扫描数据中重建单个珊瑚虫的骨骼结构。研究通过结合弱标注数据预训练与全标注切片微调的混合V-Trans-UNet网络,实现了对整个珊瑚群体骨骼的高精度分割与三维建模。该方法在相同珊瑚群体和不同生物样本上均表现出良好的分割性能,为基于微CT的珊瑚个体骨骼建模提供了首个深度学习基准与完整数据集。

Comments 15 pages, 10 figures, 2 tables

详情
英文摘要

The life history of an individual coral is archived within the accreting skeleton of the colony. While reef-forming coral colonies (e.g. massive $\textit{Porites}$ sp.) may live for hundreds of years and deposit calcareous structures many metres in height and width, their living tissue is a thin outer surface layer comprised of asexually-dividing polyps that only survive a few years. To understand the rate and timing of polyp division and the consequences for colony skeletal growth, scientists need to track the skeletal corallite deposited around each polyp. Here we propose CoralLite, an annotated $μ$CT scan dataset of entire calcareous skeletons and an associated, first corallite deep learning reconstruction baseline. CoralLite combines fully quantified volumetric segmentations with cross-slice linking for visualisations of 3D models for each corallite up to colony scale. For segmentation, we propose and evaluate in detail a hybrid V-Trans-UNet architecture applicable to segmenting tiled $μ$CT virtual slabs of $\textit{Porites}$ sp. colonies. The model is pre-trained on weakly annotated data and topology-aware fine-tuned using fully annotated slice sections with 8k+ manual corallite region annotations. On unseen slices of the same colony, the resulting model reaches 0.94 topological accuracy at mean Dice scores of 0.77 on the same colony and projection axis, and 0.63 mean Dice scores on a different, biologically unrelated specimen. Whilst our experiments are limited in scale and context, our results show for the first time that visual machine learning can effectively support full 3D individual corallite modelling from $μ$CT scans of coral skeletons alone. For reproducibility and as a baseline for future research we publish our full dataset of 697 $μ$CT slices, 37 partial or full slice annotations, and all network weights and source code with this paper.

2605.15053 2026-05-18 cs.LG cs.AI

TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale

Anurup Ganguli

AI总结 本文提出了一种名为TFGN的新型架构,能够在无需回放数据、无需任务标识的情况下,在大规模语言模型中实现无灾难性遗忘的持续预训练。该方法通过在Transformer模型上叠加一个参数高效的输入条件更新模块,实现了跨异构文本领域的正向和反向迁移,并在多个大规模模型和数据集上取得了显著效果。研究还进一步引入了闭环元控制器和操作级计划向量,提升了模型的自主学习能力和跨域适应性,为大规模语言模型的持续学习提供了新的架构解决方案。

Comments 65 pages, 10 figures, 40 tables

详情
英文摘要

Continually pre-training a large language model on heterogeneous text domains, without replay or task labels, has remained an unsolved architectural problem at LLM scale. Existing methods rely on replay buffers, task identifiers, regularization penalties that scale poorly, or sentence-classification-scale evaluation. We introduce TFGN, an architectural overlay for transformer language models that produces input-conditioned, parameter-efficient updates while leaving the rest of the transformer unchanged. On six heterogeneous text domains (Prose, Python, Math, Biomedical, Chinese, JavaScript) at 1B tokens per phase across three model scales (~398M, ~739M, ~9B) and two regimes (From-Scratch and Retrofit), TFGN achieves backward transfer of -0.007 at LLaMA 3.1 8B Retrofit, HellaSwag retention 0.506/0.504/0.510, and >=99.59% L2-orthogonal gradient separation between domain pairs - with no replay, no task IDs, no Fisher penalty. The same matrices show positive cross-domain forward transfer: held-out JavaScript PPL drops 26.8% at LLaMA-8B Retrofit and 62.0% at GPT-2 Medium From-Scratch purely from Python training. Two extensions on the same substrate close further open problems. A closed-loop meta-control layer (Extension A) reduces forgetting by an additional 81% at ~398M, mapping onto the System A and System M roles of Dupoux et al. (arXiv:2603.15381). An operator-level plan vector (Extension B) reshapes forward-pass behavior at 99.96% cosine fidelity over 30 source->target pairs. The architectural insight is a Read/Write decomposition: the forward pass is fully dense, while cross-domain parameter updates are structured so prior-domain subspaces are not written to. To our knowledge, TFGN is the first architecture that simultaneously closes catastrophic forgetting at LLM scale, realizes a closed-loop autonomous-learning meta-controller, and carries an operator-level latent planner.

2605.15010 2026-05-18 cs.CV

3D Skew-Normal Splatting

Xiangru Wu, Ke Fan, Yanwei Fu

AI总结 本文提出了一种名为Skew-Normal Splatting(SNS)的新方法,用于改进3D高斯溅射(3DGS)在实时新视角合成中的表示能力。通过引入Azzalini偏正态分布作为基本单元,SNS能够灵活建模对称和非对称结构,尤其在处理物体边界和单侧表面时表现出更强的表示能力。此外,SNS保持了数学上的可解析性,并通过解耦参数化和分块优化策略提升了训练稳定性,实验表明其在多个基准测试中优于传统高斯及其他非高斯核方法。

详情
英文摘要

3D Gaussian Splatting (3DGS) has emerged as a leading representation for real-time novel view synthesis and has been widely adopted in various downstream applications. The core strength of 3DGS lies in its efficient kernel-based scene representation, where Gaussian primitives provide favorable mathematical and computational properties. However, under a finite primitive budget, the symmetric shape of each primitive directly affects representation compactness, especially near asymmetric structures such as object boundaries and one-sided surfaces. Recent works have explored more complex kernel distributions; however, they either remain within the elliptical family or rely on hard truncation, which limits continuous shape control and introduces distributional discontinuities. In this paper, we propose Skew-Normal Splatting (SNS), which adopts the Azzalini Skew-Normal distribution as the fundamental primitive. By introducing a learnable and bounded skewness parameter, SNS can continuously interpolate between symmetric Gaussians and Half-Gaussian-like shapes, enabling flexible modeling of both sharp boundaries and interior regions. Moreover, SNS preserves analytical tractability under affine transformations and marginalization. This property allows seamless integration into existing Gaussian Splatting rasterization pipelines. Furthermore, to address the strong coupling between scale, rotation, and skewness parameters, we introduce a decoupled parameterization and a block-wise optimization strategy to enhance training stability and accuracy. Extensive experiments on standard novel-view synthesis benchmarks show that SNS consistently improves reconstruction quality over Gaussian and recent non-Gaussian kernels, with clearer benefits on sharp boundaries and thin or one-sided structures.

2605.14978 2026-05-18 cs.CL

Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing

Jie Jiang, Xing Sun, Ruotian Chen, Jianan Su, Kaixin Shen

AI总结 本文研究了如何通过性能驱动的策略优化提升推测解码的效率,提出了一种基于强化学习的框架PPOW,该方法将草案模型的优化从传统的词元级模仿转向窗口级优化。PPOW结合了成本感知加速奖励、分布基于的接近奖励以及自适应发散感知窗口机制,优先优化具有高置信度的窗口。实验表明,PPOW在多个模型和基准测试中显著提升了推测解码的接受长度和加速效果。

详情
英文摘要

Speculative decoding accelerates LLM inference by having a lightweight draft model propose speculative windows of candidate tokens for parallel verification by a larger target model. In practice, speculative efficiency is often bottlenecked by hard-to-draft positions, where an early mismatch truncates the accepted prefix and invalidates the rest of the speculative window. Most learning-based drafters are still optimized with token-level supervised objectives, even though speculative utility is inherently window-level and prefix-sensitive. We propose PPOW (Performance-Driven Policy Optimization with Adaptive Windowing), a reinforcement learning framework that shifts drafter optimization from token-level imitation to window-level optimization. PPOW combines a Cost-Aware Speedup Reward, a Distribution-Based Proximity Reward, and Adaptive Divergence-Aware Windowing, which prioritizes informative windows with high confidence-weighted draft-target divergence. PPOW achieves average acceptance lengths of 6.29-6.52 and speedups of 3.39-4.36$\times$ across multiple model families and benchmarks under a unified decoding protocol. These results show that performance-driven window-level optimization is a practical approach to improving speculative decoding efficiency.

2605.14892 2026-05-18 cs.AI

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

Shihao Qi, Jie Ma, Rui Xing, Wei Guo, Xiao Huang, Zhitao Gao, Jianhao Deng, Jun Liu, Lingling Zhang, Bifan Wei, Boqian Yang, Pinghui Wang, Jianwen Sun, Jing Tao, Yaqiang Wu, Hui Liu, Yu Yao, Tongliang Liu

AI总结 本文综述了基于大语言模型的多智能体系统在协作、错误归因与自主进化方面的研究进展,指出现有研究多分别关注单个智能体能力、协作机制或自我进化,而忽视了它们之间的因果关系。文章提出了一个统一的框架——LIFE 进程,涵盖能力基础构建、协作整合、错误归因与自主进化四个阶段,系统分析了各阶段之间的依赖关系,并提出了跨阶段的研究方向,旨在推动具备持续诊断、结构调整与行为优化能力的自组织多智能体系统发展。

详情
英文摘要

LLM-based autonomous agents have demonstrated strong capabilities in reasoning, planning, and tool use, yet remain limited when tasks require sustained coordination across roles, tools, and environments. Multi-agent systems address this through structured collaboration among specialized agents, but tighter coordination also amplifies a less explored risk: errors can propagate across agents and interaction rounds, producing failures that are difficult to diagnose and rarely translate into structural self-improvement. Existing surveys cover individual agent capabilities, multi-agent collaboration, or agent self-evolution separately, leaving the causal dependencies among them unexamined. This survey provides a unified review organized around four causally linked stages, which we term the LIFE progression: Lay the capability foundation, Integrate agents through collaboration, Find faults through attribution, and Evolve through autonomous self-improvement. For each stage, we provide systematic taxonomies and formally characterize the dependencies between adjacent stages, revealing how each stage both depends on and constrains the next. Beyond synthesizing existing work, we identify open challenges at stage boundaries and propose a cross-stage research agenda for closed-loop multi-agent systems capable of continuously diagnosing failures, reorganizing structures, and refining agent behaviors, extending current coordination frameworks toward more self-organizing forms of collective intelligence. By bridging these previously fragmented research threads, this survey aims to offer both a systematic reference and a conceptual roadmap toward autonomous, self-improving multi-agent intelligence.

2605.14884 2026-05-18 cs.LG

AIMing for Standardised Explainability Evaluation in GNNs: A Framework and Case Study on Graph Kernel Networks

Magdalena Proszewska, N. Siddharth

AI总结 图神经网络(GNNs)在处理图结构数据方面取得了显著进展,但缺乏一个全面的可解释性评估框架。本文提出AIM框架,从准确性、实例级解释和模型级解释三个维度对可解释性进行系统评估,具有高度灵活性和广泛适用性。通过将AIM应用于图核网络(GKNs)等内在可解释的GNN模型,研究发现了其解释性局限并据此改进模型,提出了在保持高准确率的同时提升可解释性的xGKN,为图神经网络的可解释性研究提供了更实用和有效的解决方案。

Comments 19 pages, 4 figures, 8 tables

详情
Journal ref
Transactions on Machine Learning Research (TMLR). ISSN 2835-8856 (2026)
英文摘要

Graph Neural Networks (GNNs) have advanced significantly in handling graph-structured data, but a comprehensive framework for evaluating explainability remains lacking. Existing evaluation frameworks primarily involve post-hoc explanations, and operate in the setting where multiple methods generate a suite of explanations for a single model. This makes comparison of explanations across models difficult. Evaluation of inherently interpretable models often targets a specific aspect of interpretability relevant to the model, but remains underdeveloped in terms of generating insight across a suite of measures. We introduce AIM, a comprehensive framework that addresses these limitations by measuring Accuracy, Instance-level explanations, and Model-level explanations. AIM is formulated with minimal constraints to enhance flexibility and facilitate broad applicability. Here, we use AIM in a pipeline, extracting explanations from inherently interpretable GNNs such as graph kernel networks (GKNs) and prototype networks (PNs), evaluating these explanations with AIM, identifying their limitations and obtaining insights to their characteristics. Taking GKNs as a case study, we show how the insights obtained from AIM can be used to develop an updated model, xGKN, that maintains high accuracy while demonstrating improved explainability. Our approach aims to advance the field of Explainable AI (XAI) for GNNs, providing more robust and practical solutions for understanding and improving complex models.

2605.14876 2026-05-18 cs.CV cs.AI

Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

Hanbo Cheng, Limin Lin, Ruo Zhang, Yicheng Pan, Jun Du

AI总结 尽管当前文本到图像生成模型在技术上取得了快速进展,但它们大多依赖单步生成范式,难以处理复杂的语义内容,且参数扩展带来的性能提升有限。为了解决多步推理方法中存在的幻觉、优化不稳定和推理延迟等问题,本文提出了一种闭环视觉推理框架CLVR,该框架将视觉语言逻辑规划与像素级扩散生成深度融合,并引入了基于代理提示的强化学习和Δ-空间权重合并等方法,有效提升了生成质量与推理效率,实验表明其在多个基准测试中优于现有开源模型,接近商业模型的性能。

详情
英文摘要

Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose $Δ$-Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation.

2605.14665 2026-05-18 cs.AI cs.CL cs.IR

Falkor-IRAC: Graph-Constrained Generation for Verified Legal Reasoning in Indian Judicial AI

Joy Bose

AI总结 该论文提出了一种名为Falkor-IRAC的图约束生成框架,旨在提升印度司法AI系统中法律推理的准确性和可靠性。该方法基于IRAC(问题、规则、分析、结论)知识图谱,将印度最高法院和高等法院的判决结构化为图节点,并整合程序状态转换、先例关系和法律条文引用。在推理过程中,系统仅接受能通过图结构验证的生成结果,从而有效减少错误引用和推理链不完整的问题,并能主动检测法律原则间的冲突,为法律AI的可信推理提供了新思路。

Comments 20 pages, 8 figures, 4 tables

详情
英文摘要

Legal reasoning is not semantic similarity search. A court judgment encodes constrained symbolic reasoning: precedent propagation, procedural state transitions, and statute-bound inference. These are properties that vector-based retrieval-augmented generation (RAG) cannot faithfully represent. Hallucinated precedents, outdated statute citations, and unsupported reasoning chains remain persistent failure modes in LLM-based legal AI, with real consequences for access to justice in high-caseload jurisdictions such as India. This paper presents Falkor-IRAC, a graph-constrained generation framework for Indian legal AI that grounds generation in structured reasoning over an IRAC (Issue, Rule, Analysis, Conclusion) knowledge graph. Judgments from the Supreme Court and High Courts of India are ingested as IRAC node structures enriched with procedural state transitions, precedent relationships, and statutory references, stored in FalkorDB for low-latency agentic traversal. At inference time, LLM-generated answers are accepted only if a valid supporting path can be traced through the graph, a check performed by a falsifiability oracle called the Verifier Agent. The system also detects doctrinal conflicts as a first-class output rather than silently resolving them. Falkor-IRAC is evaluated using graph-native metrics: citation grounding accuracy, path validity rate, hallucinated precedent rate, and conflict detection rate. These metrics are argued to be more appropriate for legal reasoning evaluation than BLEU and ROUGE. On a proof-of-concept corpus of 51 Supreme Court judgments, the Verifier Agent correctly validated citations on completed queries and correctly rejected fabricated citations. Evaluation against vector-only RAG baselines is left for future work. The companion InIRAC dataset, 500+ structured Indian court judgments with IRAC annotations, is released alongside this paper.

2605.14401 2026-05-18 cs.CL cs.AI

Agentic Recommender System with Hierarchical Belief-State Memory

Xiang Shen, Yuhang Zhou, Yifan Wu, Zhuokai Zhao, Siyu Lin, Lei Huang, Qianqian Zhong, Lizhu Zhang, Benyu Zhang, Xiangjun Fan, Hong Yan

AI总结 本文提出了一种基于记忆增强的智能推荐系统MARS,通过分层信念状态记忆结构,将推荐问题建模为部分可观测问题,从而更准确地捕捉用户的动态偏好。MARS将记忆分为事件记忆、偏好记忆和用户画像记忆三个层级,并引入包含提取、强化、弱化、巩固、遗忘和重构六种操作的完整生命周期,由基于大语言模型的调度器动态管理。实验表明,MARS在多个推荐基准数据集上取得了显著性能提升,优于现有最优方法。

Comments 4 figures, 8 tables

详情
英文摘要

Memory-augmented LLM agents have advanced personalized recommendation, yet existing approaches universally adopt flat memory representations that conflate ephemeral signals with stable preferences, and none provides a complete lifecycle governing how memory should evolve. We propose MARS (Memory-Augmented Agentic Recommender System), a framework that treats recommendation as a partially observable problem and maintains a structured belief state that progressively abstracts noisy behavioral observations into a compact estimate of user preferences. MARS organizes this belief state into three tiers: event memory buffers raw signals, preference memory maintains fine-grained mutable chunks with explicit strength and evidence tracking, and profile memory distills all preferences into a coherent natural language narrative. A complete lifecycle of six operations -- extraction, reinforcement, weakening, consolidation, forgetting, and resynthesis -- is adaptively scheduled by an LLM-based planner rather than fixed-interval heuristics. Experiments on four InstructRec benchmark domains show that MARS achieves state-of-the-art performance with average improvements of 26.4% in HR@1 and 10.3% in NDCG@10 over the strongest baselines with further gains from agentic scheduling in evolving settings.

2605.14354 2026-05-18 cs.CL

LLM-based Detection of Manipulative Political Narratives

Sinclair Schneider, Florian Steuber, Gabi Dreo Rodosek

AI总结 本文提出了一种基于大语言模型的计算框架,用于检测和结构化操纵性政治叙事。该方法通过结合少量样本提示与合法批评内容,预先过滤出具有操纵性的帖子,再利用UMAP进行嵌入和降维,使用HDBSCAN进行聚类分析,从而发现新的叙事群体。该方法无需预设目标类别,能够有效识别出120多万条社交媒体帖子中的41个操纵性叙事集群,为分析政治舆论提供了新的工具。

Comments This paper has been submitted to the upcoming 18th International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2026)

详情
英文摘要

We present a new computational framework for detecting and structuring manipulative political narratives. A task that became more important due to the shift of political discussions to social media. One of the primary challenges thereby is differentiating between manipulative political narratives and legitimate critiques. Some posts may also reframe actual events within a manipulative context. To achieve good clustering results, we filter manipulative posts beforehand using a detailed few-shot prompt that combines documented campaign narratives with legitimate criticisms to differentiate them. This prompt enables a reasoning model to assign labels, retaining only manipulative narrative posts for further processing. The remaining posts are subsequently embedded and dimensionality-reduced using UMAP, before HDBSCAN is applied to uncover narrative groups. A key advantage of this unsupervised approach is its independence from a predefined list of target categories, enabling it to uncover new narrative clusters. Finally, a reasoning model is employed to uncover the narrative behind each cluster. This approach, applied to over 1.2 million social media posts, effectively identified 41 distinct manipulative narrative clusters by integrating prompt-based filtering with unsupervised clustering.

2605.14311 2026-05-18 cs.LG cs.AI cs.HC

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

Yuchen Sun, Pei Fu, Shaojie Zhang, Anan Du, Xiuwen Xi, Ruoceng Zhang, Zhenbo Luo, Jian Luan, Chongyang Zhang

AI总结 本文研究了通用图形用户界面(GUI)代理中测试时扩展(TTS)方法中的关键问题,即现有批评模型依赖二分类导致对有效操作和看似合理但无效的操作无法区分。为此,作者提出了一种新的连续语义对齐方法BBCritic,通过两阶段对比学习恢复被二分类压制的层次结构,并引入首个细粒度评估基准BBBench。实验表明,该方法在无需额外标注的情况下超越了现有大模型,在跨平台任务中表现出强大的零样本迁移能力。

Comments 28 pages including appendix. Code and BBBench benchmark to be released

详情
英文摘要

Test-Time Scaling (TTS), which samples multiple candidate actions and ranks them via a Critic Model, has emerged as a promising paradigm for generalist GUI agents. Its efficacy thus hinges on the critic's fine-grained ranking ability. However, existing GUI critic models uniformly adopt binary classification. Our motivational analysis of these models exposes a severe entanglement: scores for valid actions and plausible-but-invalid distractors become indistinguishable. We attribute this failure to two structural defects: Affordance Collapse--the hierarchical affordance space is compressed into 0/1 labels; and Noise Sensitivity--binary objectives overfit to noisy decision boundaries. To resolve this, we introduce BBCritic (Beyond-Binary Critic), a paradigm shift grounded in the Functional Equivalence Hypothesis. Through two-stage contrastive learning, BBCritic aligns instructions and actions in a shared Affordance Space, recovering the hierarchical structure that binary supervision flattens. We also present BBBench (Beyond-Binary Bench), the first GUI critic benchmark that pairs a dense action space with a hierarchical four-level taxonomy, enabling fine-grained ranking evaluation. Experimental results show that BBCritic-3B, trained without any extra annotation, outperforms 7B-parameter SOTA binary models. It demonstrates strong zero-shot transferability across platforms and tasks, supporting our methodological view: GUI critique is fundamentally a metric-learning problem, not a classification one.

2605.14309 2026-05-18 cs.CV cs.AI cs.LG

ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition

Shen Lin, Jing Lin, Junhao Dong, Piotr Koniusz, Li Xu

AI总结 本文提出了一种基于可解释概念分解的视觉-语言模型(VLM)概念级机器遗忘方法ICED,旨在解决传统图像或实例级遗忘难以精确移除目标知识而不影响无关语义的问题。该方法通过多模态大语言模型构建任务相关的概念词汇表,并将视觉表征分解为稀疏、非负的语义概念组合,从而实现对图像中目标概念的精确抑制,同时保留非目标语义和跨模态知识。实验表明,该方法在保持模型性能的同时,能够更全面地遗忘目标知识并更好保留图像中的非目标信息。

详情
英文摘要

Machine unlearning in Vision-Language Models (VLMs) is typically performed at the image or instance level, making it difficult to precisely remove target knowledge without affecting unrelated semantics. This issue is especially pronounced since a single image often contains multiple entangled concepts, including both target concepts to be forgotten and contextual information that should be preserved. In this paper, we propose an interpretable concept-level unlearning framework for VLMs, which constructs a compact task-specific concept vocabulary from the forgetting set using a multimodal large language model. In addition to modality alignment, visual representations are decomposed into sparse, nonnegative combinations of semantic concepts, providing an explicit interface for fine-grained knowledge manipulation. Based on this decomposition, our method formulates unlearning as concept-level optimization, where target concepts are selectively suppressed while intra-instance non-target semantics and global cross-modal knowledge are preserved. Extensive experiments across both in-domain and out-of-domain forgetting settings demonstrate that our method enables more comprehensive target forgetting, better preserves non-target knowledge within the same image, and maintains competitive model utility compared with existing VLM unlearning methods.

2605.14205 2026-05-18 cs.AI

SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents

Zahra Zanjani Foumani, Alberto Castelo, Shuang Xie, Ted Chaiwachirasak, Han Li, Lingyun Wang

AI总结 本文提出SimPersona框架,旨在解决基于大语言模型的电商代理在面对真实买家群体时无法捕捉其异质性和分布特性的问题。该方法通过从历史点击流中学习离散的买家类型,并将其转化为紧凑的个性标签,从而指导代理的行为决策。实验表明,SimPersona能够有效模拟真实买家行为,实现高转化率匹配,并在多个电商场景中表现出优越的性能。

详情
英文摘要

LLM-based web agents can navigate live storefronts, yet they often collapse to a single "average buyer" policy, failing to capture the heterogeneous and distributional nature of real buyer populations. Existing personalization methods rely on hand-crafted prompt-based personas that are brittle, difficult to scale, context-inefficient, and unable to faithfully represent population-level behavior. We introduce SimPersona, a novel framework that learns discrete buyer types from historical traffic and exposes them to LLM-based web agents as compact persona tokens. Given raw clickstreams, a behavior-aware VQ-VAE induces a discrete buyer-type space that captures the statistical structure of real buyer behavior and merchant-specific buyer population distributions. To provide behavior-specific guidance to LLM-based web agents, SimPersona maps each learned buyer type to a dedicated persona token in the LLM agent vocabulary and fine-tunes the agent with these tokens on real browsing traces. At inference, each synthetic buyer is assigned to a learned buyer type with a single encoder forward pass, requiring no retraining or store-specific prompt engineering. For population-level simulation, SimPersona samples buyer types from each merchant's empirical distribution over the learned VQ-VAE codebook and instantiates agents with the corresponding persona tokens, preserving merchant-specific buyer population distributions. Evaluated on $8.37$M buyers across $42$ held-out live storefronts, SimPersona achieves $78\%$ conversion-rate alignment with real buyers, exhibits interpretable behavioral variation across buyer types, and outperforms a baseline with $8\times$ more parameters on goal-oriented shopping tasks. We further release an open-source data pipeline that converts raw e-commerce event logs into buyer representations and agent-training traces.

2605.14087 2026-05-18 cs.CL cs.LG

Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study

Mokshit Surana, Archit Rathod, Akshaj Satishkumar

AI总结 本研究系统评估了大型语言模型中毒性内容的生成与缓解方法,重点考察了推理时缓解技术DExperts在降低有害输出方面的效果。研究通过三个阶段的实验发现,DExperts在显式毒性检测中表现优异,安全率达到100%,但在面对隐含的仇恨言论时效果下降至98.5%,同时带来了显著的延迟开销。该研究揭示了显式与隐式毒性缓解之间的性能差距,为AI安全领域提供了重要的实证参考。

详情
英文摘要

Large Language Models (LLMs) trained on web-scale corpora inherently absorb toxic patterns from their training data. This leads to toxic degeneration where even innocuous prompts can trigger harmful outputs. This phenomenon poses significant risks for real-world deployments. Thus, necessitating effective mitigation strategies that should maintain model utility while ensuring safety. In this comprehensive replication study, we evaluate the efficacy of DExperts (Decoding-time Experts), which is an inference-time mitigation technique that steers generation without requiring model retraining. We structured our research into three systematic phases: (1) establishing baseline toxicity measurements using RealToxicityPrompts on standard GPT-2 models; then (2) implementing and evaluating DExperts to mitigate explicit toxicity; and finally (3) stress-testing the method against implicit hate speech using the adversarial ToxiGen dataset. Our empirical results confirm that while DExperts achieves near-perfect safety rates (100%) on explicit toxicity benchmarks, it exhibits brittleness against adversarial, implicit hate speech, with safety rates dropping to 98.5%. Furthermore, we quantify a critical trade-off. The method introduces a 10x latency penalty (from 0.2s to 2.0s per generation), posing challenges for real-time deployment scenarios. This study contributes to the growing body of work on AI safety by highlighting the robustness gap between explicit and implicit toxicity mitigation. We emphasize the need for more sophisticated approaches that generalize across diverse hate-speech patterns without incurring prohibitive computational costs.

2605.14057 2026-05-18 cs.CL

Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents

Xubo Lin, Zezhi Deng, Shihao Wang, Grace Hui Yang, Yang Deng

AI总结 本文提出了一种用于法律咨询对话代理的双层次强化学习框架,旨在解决传统对话系统被动响应用户需求的问题。该方法通过两个协作的强化学习智能体,分别负责策略层面的对话管理和细粒度的语句生成,使代理能够主动提问以获取关键信息,模拟法官的质询模式。实验表明,该方法在美最高法院数据集上优于多种基线模型,为高风险、领域特定的对话系统应用提供了重要进展。

Comments Accepted in ACL 2026 as Findings

详情
英文摘要

Most existing dialogue systems are user-driven, primarily designed to fulfill user requests. However, in many critical real-world scenarios, a conversational agent must proactively extract information to achieve its own objectives rather than merely respond. To address this gap, we introduce Inquisitive Conversational Agents (ICAs) and develop an ICA specifically tailored to U.S. Supreme Court oral arguments. We propose a Dual Hierarchical Reinforcement Learning framework featuring two cooperating RL agents, each with its own policy, to coordinate strategic dialogue management and fine-grained utterance generation. By learning when and how to ask probing questions, the agent emulates judicial questioning patterns and systematically uncovers crucial information to fulfill its legal objectives. Evaluations on a U.S. Supreme Court dataset show that our method outperforms various baselines across multiple metrics. It represents an important first step toward broader high-stakes, domain-specific applications.

2605.13925 2026-05-18 cs.RO

Towards Robotic Dexterous Hand Intelligence: A Survey

Weiguang Zhao, Tian Liang, Xihao Guo, Rui Zhang, Irwin King, Kaizhu Huang

AI总结 本文综述了灵巧机械手领域的研究进展,系统分析了硬件设计、控制与学习方法、数据集与评估体系等方面的现状与挑战。文章从四个互补角度出发,梳理了机械手在驱动、感知、控制策略等方面的关键权衡,并总结了当前研究的主要局限与未来发展方向,旨在为该领域提供结构化的理解与研究指引。

详情
英文摘要

Robotic dexterous hands are central to contact-rich manipulation, with rapid progress driven by advances in hardware, sensing, control, simulation, and data generation. However, existing studies are often developed under different assumptions regarding hand embodiments, sensory configurations, task settings, training data, and evaluation protocols, making systematic comparison difficult and obscuring the developmental trajectory of the field. This survey provides a holistic review of dexterous hand research from four complementary aspects. First, we present a hardware-level analysis covering actuation, transmission, perception, and representative hand designs, highlighting the key trade-offs in force capability, compliance, bandwidth, integration, and system complexity. Furthermore, we review control and learning methods for dexterous manipulation from a methodological perspective, grouping representative works by major paradigms and tracing their evolution in chronological order. In addition, we consolidate datasets, modality design, and evaluation practices, which enables methodological progress to be interpreted together with the ways in which it is trained, benchmarked, and assessed. Finally, we discuss the major limitations of current dexterous hand research and summarize the corresponding future directions. By connecting hardware analysis, methodological development, data resources, and evaluation, this survey aims to provide a structured understanding of dexterous hand research and to clarify the most important open challenges for future study.

2605.13142 2026-05-18 cs.AI math.OC

A Constraint Programming Approach for n-Day Lookahead Playoff Clinching in the NHL

Gili Rosenberg, Kyle E. C. Booth, J. Kyle Brubaker, Ruben S. Andrist

AI总结 本文研究了如何在国家冰球联盟(NHL)中确定一支球队在接下来的 $n$ 天内是否能够锁定季后赛资格的问题。针对复杂的晋级规则和复杂的平局处理机制,作者提出了一种基于约束编程的树搜索算法,能够高效地分析未来 $n$ 天比赛结果的所有可能组合,并判断球队是否能够确保季后赛席位。该方法结合了预处理、剪枝策略和节点排序启发式,有效提升了搜索效率,并通过大量真实赛季数据验证了其有效性,具有良好的扩展性,可用于分析其他相关体育指标。

Comments 18 pages, 5 figures, 4 tables. Accepted to CP 2026

详情
英文摘要

In professional sports, a team has clinched the playoffs if they are guaranteed a postseason spot, regardless of the outcomes of any remaining games. As the season progresses, sports fans and other stakeholders are interested in precisely when, and under what conditions, their team will clinch the playoffs. In this paper, we investigate playoff clinching in the context of the National Hockey League (NHL), where it is computationally challenging to produce clinching scenarios due, in part, to complex tie-breakers. We present an algorithm that determines under which combinations of game outcomes in the next $n$ days a team will clinch the playoffs (i.e., "$n$-day lookahead clinching"). Our approach is a custom tree search which employs various preprocessing techniques, pruning strategies, and node ordering heuristics to efficiently explore the space of possible outcomes. The tree search leverages a constraint programming (CP)-based subroutine for inference that determines if a team has clinched the playoffs for some snapshot in time of the regular season (i.e., "0-day lookahead clinching"). This CP subroutine aims to find a counter-example in which the team being evaluated is eliminated, taking into account qualification rules and the NHL's extensive list of tie-breakers. We validate the efficacy of our algorithm using hundreds of scenarios based on public NHL data for the seasons 2021-22 through 2024-25. The methods introduced can be readily extended to other metrics of interest, including mathematical proof of playoff elimination, clinching the President's Trophy, as well as clinching (or being eliminated from clinching) any other seed in the standings.

2605.13073 2026-05-18 cs.CV

HarmoGS: Robust 3D Gaussian Splatting in the Wild via Conflict-Aware Gradient Harmonization

Yulei Kang, Tianze Zhu, Jian-Fang Hu, Jianhuang Lai, Wei-Shi Zheng

AI总结 本文针对真实场景中3D高斯泼溅(3DGS)重建面临的动态干扰和光照引起的视图间外观不一致问题,提出了一种基于冲突感知的优化框架。该方法通过语义一致性引导的掩膜生成和双视角梯度调和策略,有效抑制了不可靠的监督信息并缓解视图间梯度冲突,从而提升了重建质量与稳定性。实验表明,该方法在复杂真实场景下取得了当前最优的渲染效果。

详情
英文摘要

In-the-wild 3D Gaussian Splatting remains challenging due to transient distractors and illumination-induced cross-view appearance inconsistencies. Existing methods mainly rely on image-level masking to suppress unreliable supervision, but masking alone cannot fully eliminate residual occlusions or resolve illumination-induced inconsistencies, both of which can introduce conflicting cross-view gradients. These unresolved conflicts may destabilize Gaussian optimization and lead to visible reconstruction artifacts. We propose a conflict-aware 3DGS framework that addresses this problem from both image-space supervision and gradient-level optimization. Semantic Consistency-Guided Masking learns pixel-wise consistency scores to adaptively refine prior masks and suppress unreliable supervision before gradient formation. A dual-view Conflict-Aware Gradient Harmonization strategy further reconciles view-specific gradients by mutually rotating them into an orthogonal configuration, reducing negative directional interference across views. We also introduce conflict-aware densification and pruning to stabilize Gaussian growth and remove persistently conflicting primitives. Extensive experiments on standard in-the-wild benchmarks demonstrate that our method achieves state-of-the-art rendering quality under complex transient distractors and cross-view inconsistencies.

2605.12667 2026-05-18 cs.LG cs.AI

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

Nirmal Patel, Fei Wang, Inderjit S. Dhillon

AI总结 该研究针对大语言模型对齐中基于人工智能反馈的强化学习(RLAIF)所面临的离散奖励噪声问题,提出了一种名为ODRPO的鲁棒策略优化框架。其核心方法是将多级离散奖励分解为一系列二元序数指示符,从而结构化地隔离评估噪声,并通过逐步设定的成功阈值独立计算优势,提升学习稳定性与鲁棒性。实验表明,ODRPO在多个基准任务上显著优于现有方法,且几乎不增加训练时间开销。

详情
英文摘要

The alignment of Large Language Models (LLMs) utilizes Reinforcement Learning from AI Feedback (RLAIF) for non-verifiable domains such as long-form question answering and open-ended instruction following. These domains often rely on LLM based auto-raters to provide granular, multi-tier discrete rewards (e.g., 1-10 rubrics) that are inherently stochastic due to prompt sensitivity and sampling randomness. We empirically verify the stochasticity of auto-raters that can propagate and corrupt standard advantage estimators like GRPO and MaxRL, as a noisy reward samples can skew normalization statistics and degrade the global learning signal. Empirically, sampling more rewards and taking majority voting may reduce the noise and improve performance, but this approach is computationally expensive. To address this bottleneck, we introduce $\textbf{O}$rdinal $\textbf{D}$ecomposition for $\textbf{R}$obust $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{ODRPO}$), a framework that structurally isolates evaluation noise by decomposing discrete rewards into a sequence of ordinal binary indicators. By independently computing and accumulating advantages across these progressively challenging success thresholds, ODRPO prevents outlier evaluations from corrupting the global update while establishing an implicit, variance-aware learning curriculum. Empirically, ODRPO achieves robust performance on Qwen2.5-7B and Qwen3-4B models, outperforming baselines with relative improvements of upto 14.8% on FACTS-grounding-v2 and 7.5% on Alpaca-Evals. Critically, these gains are achieved with negligible training-time overhead, as ODRPO requires no additional compute per step compared to standard estimators. Supported by theoretical analysis confirming its optimization stability, ODRPO provides a scalable and robust framework for aligning models within the noisy, discrete evaluation landscape of modern RLAIF.

2605.11885 2026-05-18 cs.AI q-bio.NC

From Clever Hans to Scientific Discovery: Interpreting EEG Foundational Transformers with LRP

Justus Meyer zu Bexten, Nico Scherf, Bogdan Franczyk, Simon M. Hofmann

AI总结 本文研究了如何利用基于注意力的逐层相关传播(LRP)方法对脑电图基础模型(EEG-FMs)进行解释,以解决其模型可解释性差的问题。研究将LRP方法从传统的卷积神经网络扩展到基于Transformer架构的EEG-FMs,发现该方法不仅能验证模型决策,还能揭示具有生物学意义的新假设。研究在运动想象和情感预测任务中展示了LRP的有效性,揭示了模型对特定脑区信号的依赖,为理解EEG-FMs的行为提供了新的视角。

Comments 18 pages, 6 figures

详情
英文摘要

Emerging foundation models (FMs) in electroencephalography (EEG) promise a path to scale deep learning in diagnostics and brain-computer interfaces despite data scarcity, yet their opaque nature remains a barrier to wider adoption. We investigate attention-aware Layer-wise relevance propagation (LRP) as a post-hoc attribution method for EEG-FMs, extending LRP's use on convolutional neural network (CNN)-based EEG models to the Transformer architectures that current FMs are based on. We find that LRP can both verify EEG-FM decisions and surface novel, biologically plausible hypotheses from them. In motor imagery, it unmasks 'Clever Hans' behavior where models prioritize task correlated ocular signals over the intended motor correlates. In a naturalistic paradigm for affect prediction, it reveals a recurring reliance on a central electrode cluster, suggesting a candidate sensorimotor signature of arousal. Though heatmap interpretation remains ambiguous in this complex domain, the results position LRP as a tool for both verification and exploration of EEG-FMs, a role that will grow in both importance and discovery potential as the underlying models mature.

2605.11485 2026-05-18 cs.RO

Coordinated Diffusion: Generating Multi-Agent Behavior Without Multi-Agent Demonstrations

Lasse Peters, Laura Ferranti, Andrea Bajcsy, Javier Alonso-Mora

AI总结 该论文研究了如何在没有多智能体示范数据的情况下,通过单智能体示范数据学习多智能体协作行为。提出了一种名为CoDi的框架,通过用户定义的多智能体成本函数,将独立训练的单智能体扩散策略进行耦合,从而生成协调的多智能体行为。该方法无需多智能体示范数据,通过一种新的扩散采样方案实现策略协调,并能在无需额外训练的情况下适用于黑箱或非微分成本函数,实验表明其在数据效率和行为协调性方面优于传统多智能体方法。

详情
英文摘要

Imitation learning powered by generative models has proven effective for modeling complex single-agent behaviors. However, teaching multi-agent systems, like multiple arms or vehicles, to coordinate through imitation learning is hindered by a fundamental data bottleneck: as the joint state-action space grows exponentially with the number of agents, collecting a sufficient amount of coordinated multi-agent demonstrations becomes extremely costly. In this work, we ask: how can we leverage single-agent demonstration data to learn multi-agent policies? We present Coordinated Diffusion (CoDi), a framework that couples independently trained single-agent diffusion policies through a user-defined multi-agent cost function, without requiring any coordinated demonstrations. We derive a new diffusion-based sampling scheme wherein the diffusion score function decomposes into independent, single-agent pre-trained base policies plus a cost-driven guidance term that coordinates these base policies into cohesive multi-agent behavior. We show that this guidance term can be estimated in a gradient-free manner, making CoDi applicable to black-box, non-differentiable cost functions without additional training. Theoretically and empirically, we analyze the conditions under which this composition can faithfully approximate a target multi-agent behavior. We find a complementary role for demonstration data versus the cost function: single-agent demonstrations must cover the support of the desired multi-agent behavior, while the cost function must promote desired behavior from this product of single-agent policies. Our results in simulation and hardware experiments of a two-arm manipulation task show that CoDi discovers robust coordinated behavior from single-agent data, is more data-efficient than multi-agent baselines, and highlights the importance of joint guidance, base policy support, and cost design.

2605.11118 2026-05-18 cs.AI cs.IR

A Cascaded Generative Approach for e-Commerce Recommendations

Moein Hasani, Hamidreza Shahidi, Trace Levinson, Yuan Zhong, Guanghua Shu, Vinesh Gudla, Tejaswi Tenneti

AI总结 本文提出了一种级联生成框架,用于解决电商推荐中个性化店面构建的问题。该方法将店面生成分解为两个生成任务:页面区域的主题生成和针对每个区域的受限关键词生成,以支持产品检索。通过教师-学生微调策略提升模型的生产效率,并结合传统排序模型实现混合架构,实验表明该方法在每页浏览量的购物车添加率上相比基线提升了约2.7%。

详情
英文摘要

Personalized storefronts in large e-commerce marketplaces are often assembled from many independent components: static themes per page section ("placement"), retrieval systems to fetch eligible products per placement, and pointwise rankers to order content. While effective in optimizing for aggregate preferences, this paradigm is rigid and can limit personalization and semantic cohesion across the page. This makes it poorly suited to support dynamic objectives and merchandising requirements over time. To address this, we introduce a cascaded merchandising framework that decomposes storefront construction into two generative tasks: (i) placement-level theme generation and (ii) constrained keyword generation per placement to power product retrieval. Teacher-student fine-tuning is leveraged to improve scalability of this framework under production latency and cost constraints. Fine-tuned model ablations are shown to approach closed-weight LLM performance. We further contribute frameworks for AI-driven content evaluation and quality filtering, enabling safe and automated deployment of dynamic content at scale. Generative output is fused with traditional ranking models to preserve hybrid infrastructure. In online experiments, this framework yields an estimated +2.7% lift in cart adds per page view over a strong baseline.

2605.10893 2026-05-18 cs.CL

Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking

Reza Khanmohammadi, Erfan Miahi, Simerjot Kaur, Charese H. Smiley, Ivan Brugere, Kundan Thind, Mohammad M. Ghassemi

AI总结 该论文研究了大型视觉-语言模型(LVLM)在回答问题时可能依赖语言先验而非图像信息的问题,提出了一种名为BICR的模型无关置信度估计框架。BICR通过在训练时对比真实图像-问题对与图像遮蔽后的隐藏状态,学习区分视觉依据与纯语言驱动的回答,从而在不增加推理成本的情况下提升模型置信度的可靠性。实验表明,BICR在多个基准任务中表现出色,显著优于现有方法,且参数量更少。

详情
英文摘要

Large vision-language models suffer from visual ungroundedness: they can produce a fluent, confident, and even correct response driven entirely by language priors, with the image contributing nothing to the prediction. Existing confidence estimation methods cannot detect this, as they observe model behavior under normal inference with no mechanism to determine whether a prediction was shaped by the image or by text alone. We introduce BICR (Blind-Image Contrastive Ranking), a model-agnostic confidence estimation framework that makes this contrast explicit during training by extracting hidden states from a frozen LVLM twice: once with the real image-question pair, and once with the image blacked out while the question is held fixed. A lightweight probe is trained on the real-image hidden state and regularized by a ranking loss that penalizes higher confidence on the blacked-out view, teaching it to treat visual grounding as a signal of reliability at zero additional inference cost. Evaluated across five modern LVLMs and seven baselines on a benchmark covering visual question answering, object hallucination detection, medical imaging, and financial document understanding, BICR achieves the best cross-LVLM average on both calibration and discrimination simultaneously, with statistically significant discrimination gains robust to cluster-aware analysis at 4-18x fewer parameters than the strongest probing baseline.

2605.10799 2026-05-18 cs.LG cs.AI cs.CL

The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies

Gabriel Garcia

AI总结 该论文指出,在评估链式推理(CoT)可信度的标准方法中,存在一个由格式引起的偏差问题:当基准任务的推理链以明确的最终答案结尾时,现有的腐败实验主要测量的是答案位置的影响,而非中间计算步骤的重要性。研究通过实验表明,移除最终答案或提供错误答案会显著影响模型表现,且这种影响随模型规模变化而不同。论文进一步提出了一套三要素协议,以改进未来基于腐败的可信度研究。

Comments 34 pages, 6 figures, 13 tables. Submitted to NeurIPS 2026. Code and data: https://github.com/Gpgabriel25/LastWordWinsCoT

详情
英文摘要

Corruption studies, the standard tool for evaluating chain-of-thought (CoT) faithfulness, infer which steps are ``computationally important'' from accuracy loss when steps are corrupted. We show that when benchmark chains end with an explicit terminal answer line, as in GSM8K and MATH, these tests largely measure \emph{answer placement} rather than where intermediate computation is carried out. Using matched GSM8K examples, removing only the final answer statement while preserving all reasoning collapses suffix sensitivity by about $19\times$ for Qwen~2.5-3B ($N{=}300$, $p{=}0.022$). Conflicting-answer prompts, which contain correct reasoning but a wrong explicit final answer, drive accuracy to zero or near-zero at 7B across five open-weight model families; wrong-answer following is strong at 3B--7B and attenuates sharply at larger scales. Replications on MATH, within-stable comparisons at 7B, and suffix-free chains show the same pattern in different guises: corruption sensitivity tracks the location of explicit answer text, not a fixed computational depth in the reasoning. Generation-time probes indicate that final answers are rarely early-determined during generation (${<}5\%$ early commitment), yet consumption-time behavior systematically follows explicit answer text. The confound is therefore largely a readout effect when the chain is consumed. We propose a three-prerequisite protocol (question-only control, format characterization, and an all-position sweep) as a practical minimum for future corruption-based faithfulness studies.

2605.10057 2026-05-18 cs.AI cs.MA

STAR: Failure-Aware Markovian Routing for Multi-Agent Spatiotemporal Reasoning

Ruiyi Yang, Lihuan Li, Hao Xue, Flora D. Salim

AI总结 本文提出了一种名为STAR的失效感知路由框架,用于多智能体时空推理中的任务分配问题。该方法通过将智能体之间的控制决策显式建模为基于状态的转移策略,能够根据任务类型和执行状态动态选择合适的专家智能体,从而有效应对不同类型的执行失败。STAR通过结合专家指定的正常路由路径和从执行轨迹中学习的恢复转移,显著提升了系统在面对异常情况时的鲁棒性和可解释性。实验表明,STAR在多个时空推理基准上优于现有方法,尤其在执行路径偏离预期的情况下表现突出。

Comments 30 pages, 13 figures

详情
英文摘要

Compositional spatiotemporal reasoning often requires a system to invoke multiple heterogeneous specialists, such as geometric, temporal, topological, and trajectory agents. A central question is how such a system should route among specialists when execution does not simply succeed or fail, but fails in qualitatively different ways. Existing tool-augmented and multi-agent LLM systems typically leave this routing decision implicit in language generation, making recovery ad hoc, difficult to interpret, and hard to optimize. This paper presents STAR (Spatio-Temporal Agent Router), a failure-aware routing framework that externalizes inter-agent control as a state-conditioned transition policy over the current agent, task type, and typed execution status. At the center of STARis an agent routing matrix that combines expert-specified nominal routes with recovery transitions learned from execution traces. Because the matrix conditions on distinct failure states, the router can respond differently to malformed outputs, missing dependencies, and tool--query mismatches, rather than collapsing them into a generic retry signal. Specialists execute through a tool-grounded extract--compute--deposit protocol and write intermediate results to a shared blackboard for downstream fusion. Results prove that retaining unsuccessful traces during training enlarges the support of the routing policy on error states, enabling recovery transitions that success-only training cannot represent. Across three spatiotemporal benchmarks and eight backbone LLMs, STAR improves over multiple baselines with the clearest gains on queries whose execution deviates from the nominal routing path. Router-specific ablations and recovery analyses further show that typed failure-aware routing, rather than specialist composition alone, is a key factor for these improvements.

2605.10052 2026-05-18 cs.CL cs.AI

Swarm Skills: A Portable, Self-Evolving Multi-Agent System Specification for Coordination Engineering

Xinyu Zhang, Zhicheng Dou, Deyang Li, Jianjun Tao, Shuo Cheng, Ruifeng Shi, Fangchao Liu, Enrui Hu, Yangkai Ding, Hongbo Wang, Qi Ye, Xuefeng Jin, Zhangchun Zhao

AI总结 随着人工智能工程范式从单智能体提示和上下文工程转向多智能体协调工程,如何系统化地编码和提升多智能体协作能力成为关键瓶颈。本文提出了一种名为 *Swarm Skills* 的可移植、自演进的多智能体系统规范,通过引入角色、工作流、执行边界和自演进语义结构,将多智能体协作流程转化为可分发的资产。研究还提出了一种自演进算法,能够自动提炼成功执行轨迹并持续优化现有技能,从而实现无需人工干预的多智能体协调策略自我进化。

详情
英文摘要

As artificial intelligence engineering paradigms shift from single-agent Prompt and Context Engineering toward multi-agent \textbf{Coordination Engineering}, the ability to codify and systematically improve how multiple agents collaborate has emerged as a critical bottleneck. While single-agent skills can now be distributed as portable assets, multi-agent coordination protocols remain locked within framework-internal code or static configurations, preventing them from being shared across systems or autonomously improved over time. We propose \textbf{Swarm Skills}, a portable specification that extends the Anthropic Skills standard with multi-agent semantics. Swarm Skills turns multi-agent workflows into first-class, distributable assets that consist of roles, workflows, execution bounds, and a built-in semantic structure for self-evolution. To operationalize the specification's evolving nature, we present a companion self-evolution algorithm that automatically distills successful execution trajectories into new Swarm Skills and continuously patches existing ones based on multi-dimensional scoring (Effectiveness, Utilization, and Freshness), eliminating the need for human-in-the-loop oversight during the refinement process. Through an architectural compatibility analysis and a comprehensive qualitative case study using the open-source JiuwenSwarm reference implementation, we demonstrate how Swarm Skills achieves zero-adapter cross-agent portability via progressive disclosure, enabling agent teams to self-evolve their coordination strategies without framework lock-in.