arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1981
2605.05686 2026-05-15 cs.AI

Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

Qiyao Liang, Risto Miikkulainen, Ila Fiete

AI总结 该研究探讨了语言模型在生成过程中可能出现的两种失败模式:知识冲突和自信幻觉,并揭示了它们在隐藏状态空间中的统一几何解释。研究发现,模型中学习到的事实形成吸引子盆地,冲突源于工作记忆干扰正确吸引子的收敛,而幻觉则源于缺乏对应吸引子导致隐藏状态自由漂移。通过几何边距指标,研究成功区分了正确回忆与幻觉,并验证了该结构特性不依赖于微调,且随着模型规模增大,自信幻觉的比例呈指数增长。

Comments 9 pages, 6 figures, plus appendices

详情
英文摘要

Language models draw on two knowledge sources: facts baked into weights (parametric memory, PM) and information in context (working memory, WM). We study two mechanistically distinct failure modes--conflict, when PM and WM disagree and interfere; and hallucination, when the queried fact was never learned. Both produce confident output regardless, making output-based monitoring blind by design. We show both failures share a unified geometric account. In the hidden-state space of autoregressive generation, learned facts form attractor basins. Conflict is basin competition: WM disrupts convergence to the correct basin without raising output entropy. Hallucination is basin absence: the hidden state drifts freely when no memorized basin exists. The frozen LM head, designed for next-token prediction, cannot distinguish these cases and fires confidently either way. We verify this account in a controlled synthetic task-entity identifiers mapped to unique codes with PM installed via LoRA adapters--where ground truth is exact and component roles can be causally isolated through targeted adapter placement. Geometric margin--the hidden state's distance to the nearest memorized basin--reads this geometry directly and separates correct recall from hallucination far more cleanly than output entropy, with zero false refusals where entropy-based detection cannot avoid rejecting the vast majority of correct outputs. The separation holds on natural-language factual queries from the pretrained model with no adaptation, confirming attractor geometry is structural rather than a fine-tuning artifact. The fraction of confident hallucinations follows a scaling law $C = \exp(-c/\barΔ)$, growing with scale even as overall error rates fall. Hidden states reliably encode epistemic state; the frozen output head systematically erases it--and this erasure worsens with scale.

2605.04554 2026-05-15 cs.CV

InterMesh: Explicit Interaction-Aware End-to-End Multi-Person Human Mesh Recovery

Kaili Zheng, Kaiwen Wang, Xun Zhu, Chenyi Guo, Ji Wu

AI总结 该论文提出了一种名为InterMesh的端到端多人人体网格恢复框架,旨在更准确地建模人类与环境及彼此之间的交互关系。与现有基于DETR的方法不同,InterMesh通过引入人类-物体交互检测器,显式地将交互语义信息融入人体网格恢复过程,从而提升姿态和形状估计的准确性。研究设计了轻量的模块以高效整合交互信息,并在多个数据集上验证了方法的有效性,显著提升了在复杂交互场景下的恢复性能。

Comments 13 pages, 10 figures

详情
英文摘要

Humans constantly interact with their surroundings. Existing end-to-end multi-person human mesh recovery methods, typically based on the DETR framework, capture inter-human relationships through self-attention across all human queries. However, these approaches model interactions only implicitly and lack explicit reasoning about how humans interact with objects and with each other. In this paper, we propose InterMesh, a simple yet effective framework that explicitly incorporates human-environment interaction information into human mesh recovery pipeline. By leveraging a human-object interaction detector, InterMesh enriches query representations with structured interaction semantics, enabling more accurate pose and shape estimation. We design lightweight modules, Contextual Interaction Encoder and Interaction-Guided Refiner, to integrate these features into existing HMR architectures with minimal overhead. We validate our approach through extensive experiments on 3DPW, MuPoTS, CMU Panoptic, Hi4D, and CHI3D datasets, demonstrating remarkable improvements over state-of-the-art methods. Notably, InterMesh reduces MPJPE by 9.9% on CMU Panoptic and 8.2% on Hi4D, highlighting its effectiveness in scenarios with complex human-object and inter-human interactions. Code and models are released at https://github.com/Kelly510/InterMesh.

2605.04474 2026-05-15 cs.LG

Geometry-Aware Neural Optimizer for Shape Optimization and Inversion

Guoze Sun, Tianya Miao, Haoyang Huang, Huaguan Chen, Han Wan, Rui Zhang, Hao Sun

AI总结 本文提出了一种几何感知神经优化器(GANO),旨在解决偏微分方程控制系统的形状优化与反演问题。GANO 通过统一几何表示、场级预测和自动优化的端到端可微框架,克服了传统方法中梯度不可用、参数化受限以及优化不稳定等问题。该方法利用去噪机制和几何感知代理模型实现稳定的几何更新,并支持部件级控制与高效几何处理,实验表明其在多个基准测试中表现出优越的精度和可控性。

Comments To appear in ICML2026

详情
英文摘要

Geometry is central to PDE-governed systems, motivating shape optimization and inversion. Classical pipelines conduct costly forward simulation with geometry processing, requiring substantial expert effort. Neural surrogates accelerate forward analysis but do not close the loop because gradients from objectives to geometry are often unavailable. Existing differentiable methods either rely on restrictive parameterizations or unstable latent optimization driven by scalar objectives, limiting interpretability and part-wise control. To address these challenges, we propose Geometry-Aware Neural Optimizer (\textbf{\textsc{GANO}}), an end-to-end differentiable framework that unifies geometry representation, field-level prediction, and automated optimization/inversion in a single latent-space loop. \textsc{GANO} encodes shapes with an auto-decoder and stabilizes latent updates via a denoising mechanism, and a geometry-informed surrogate provides a reliable gradient pathway for geometry updates. Moreover, \textsc{GANO} supports part-wise control through null-space projection and uses remeshing-free projection to accelerate geometry processing. We further prove that denoising induces an implicit Jacobian regularization that reduces decoder sensitivity, yielding controlled deformations. Experiments on three benchmarks spanning 2D Helmholtz, 2D airfoil, and 3D vehicles show state-of-the-art accuracy and stable, controllable updates, achieving up to +55.9% lift-to-drag improvement for airfoils and ~7% drag reduction for vehicles.

2605.04236 2026-05-15 cs.LG

Adaptive Consensus in LLM Ensembles via Sequential Evidence Accumulation: Automatic Budget Identification and Calibrated Commit Signals

Roberto E. Medina

AI总结 该研究提出了一种名为DASE的自适应停止机制,用于改进大型语言模型集成中的推理过程,通过在证据积累过程中自动识别预算并生成校准的提交信号,以提升整体准确性。DASE能够在早期达成共识时提前提交结果,并在证据碎片化时采用全局频率策略,从而在多个基准测试中表现出显著的性能提升。研究还发现,自适应停止策略对准确性的影响远大于注入带宽,并揭示了注入方法在准确性与推理成本之间存在倒U型关系。

详情
英文摘要

Large Language Model ensembles improve reasoning accuracy, but only up to a performance boundary beyond which additional deliberation degrades accuracy. We introduce DASE (Deliberative Adaptive Stopping Ensemble), a stopping heuristic for iterative ensemble deliberation that commits early on genuine consensus and applies a global-frequency fallback on fragmented evidence. We make three contributions. (1) DASE produces a commit-type routing partition that generalises across benchmarks and is complementary to verbalized single-call confidence. On GPQA-Extended (N=546, 70B ensemble), the partition yields a 39.5 pp routing gap (right-wall 81.1% vs. left-wall 41.5%). On AIME 2010-2023 (N=261, 120B ensemble, 3 seeds), right-wall commits reach 98.3% accuracy vs. left-wall 72.8% (25.5 pp gap), statistically equivalent to Opus 4.6 Standard verbalized confidence at matched coverage (25.7 pp gap; bootstrap p=0.873); the two mechanisms disagree on 37% of routing assignments. (2) Adaptive stopping, not injection bandwidth, drives accuracy. On AIME-300, bandwidth accounts for only 0.3 pp (ns). On GPQA-Extended at the 120B tier, sparse injection ($\approx15$ tokens/worker/round) achieves 70.9% with a 30.7 pp routing gap; dense injection ($\approx600$ chars/worker/round) achieves 72.2% but with halved right-wall coverage and a narrower 18.9 pp gap. (3) Injection-based methods exhibit an inverted-U accuracy-vs-inference trajectory; this pattern is hypothesis-generating.

2605.04215 2026-05-15 cs.LG cs.AI

Predict-then-Diffuse: Adaptive Response Length for Compute-Budgeted Inference in Diffusion LLMs

Michael Rottoli, Subhankar Roy, Stefano Paraboschi

AI总结 扩散式大语言模型(D-LLMs)在生成任务中具有高并行性和优越的GPU利用率,但其固定响应长度的限制导致计算资源浪费或输出截断的问题。为此,本文提出“Predict-then-Diffuse”框架,通过一个自适应响应长度预测器(AdaRLP)先估计输入对应的最优响应长度,再进行扩散生成,从而在保证输出质量的同时减少冗余计算。实验表明,该方法在多个数据集上有效降低了计算成本,且对数据分布的偏态具有鲁棒性。

详情
英文摘要

Diffusion-based Large Language Models (D-LLMs) represent a promising frontier in generative AI, offering fully parallel token generation that can lead to significant throughput advantages and superior GPU utilization over the traditional autoregressive paradigm. However, this parallelism is constrained by the requirement of a fixed-size response length prior to generation. This architectural limitation imposes a severe trade-off: oversized response length results in computational waste on semantically meaningless padding tokens, while undersized response length causes output truncation requiring costly re-computations that introduce unpredictable latency spikes. To tackle this issue, we propose Predict-then-Diffuse, a simple and model-agnostic framework that enables compute-budgeted inference per input query by first estimating the response length and then using it to run inference with D-LLM. At its core lies an Adaptive Response Length Predictor (AdaRLP), which estimates the optimal response length given an input query. As a measure against under-estimating the response length and re-running inference with a higher value, we introduce a data-driven safety mechanism based on a small increase of the predicted length. As a whole, our framework avoids wasting computation on padding tokens, at the same time preserving output quality. Experimental validation on multiple datasets demonstrates that Predict-then-Diffuse significantly reduces computational costs (FLOP) compared to the default D-LLM inference mechanism, while being robust to skewed data distributions.

2605.03823 2026-05-15 cs.LG cs.IT math.IT math.ST stat.TH

Realizable Bayes-Consistency for General Metric Losses

Dan Tsir Cohen, Steve Hanneke, Aryeh Kontorovich

AI总结 本文研究了在可实现设定下,使用一般度量损失进行学习时的强泛化贝叶斯一致性问题,扩展了传统二分类和回归问题的相关结果。作者给出了假设类满足何种条件时,存在一种分布无关的学习规则,使其风险几乎必然收敛到类内最优风险(即零)。主要贡献在于提出了一种基于组合障碍的精确刻画,引入了无限非递减 $(γ_k)$-Littlestone 树的概念,从而将经典 Littlestone 树结构推广到度量损失场景。

Comments 14 pages. To appear in Proceedings of the 43rd International Conference on Machine Learning (ICML 2026); v2: fixed abstract metadata rendering

详情
英文摘要

We study strong universal Bayes-consistency in the realizable setting for learning with general metric losses, extending classical characterizations beyond $0$-$1$ classification (Bousquet et al., 2020; Hanneke et al., 2021) and real-valued regression (Attias et al., 2024). Given an instance space $(X,ρ)$, a label space $(Y,\ell)$ with possibly unbounded loss, and a hypothesis class $H \subseteq Y^{X}$, we resolve the realizable case of an open problem presented in Tsir Cohen and Kontorovich (2022). Specifically, we find the necessary and sufficient conditions on the hypothesis class $H$ under which there exists a distribution-free learning rule whose risk converges almost surely to the best-in-class risk (which is zero) for every realizable data-generating distribution. Our main contribution is this sharp characterization in terms of a combinatorial obstruction: Similarly to Attias et al. (2024), we introduce the notion of an infinite non-decreasing $(γ_k)$-Littlestone tree, where $γ_k \to \infty$. This extends the Littlestone tree structure used in Bousquet et al. (2020) to the metric loss setting.

2605.03596 2026-05-15 cs.AI cs.CL cs.DB cs.LG

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

Zirui Tang, Xuanhe Zhou, Yumou Liu, Linchun Li, Yukai Wu, Weizheng Wang, Hongzhang Huang, Wei Zhou, Jun Zhou, Jiachen Song, Shaoli Yu, Jinqi Wang, Zihang Zhou, Hongyi Zhou, Yuting Lv, Jinyang Li, Jiashuo Liu, Ruoyu Chen, Chunwei Liu, GuoLiang Li, Jihua Kang, Fan Wu

AI总结 Workspace-Bench 1.0 是一个用于评估 AI 智能体在工作空间任务中处理大规模文件依赖关系能力的基准。该研究构建了包含多种文件类型和真实工作场景的复杂工作空间,并设计了大量任务来测试智能体的跨文件检索、上下文推理和适应性决策能力。实验表明,当前主流 AI 模型在该基准上的表现仍远低于人类水平,突显了在真实工作场景中实现可靠工作空间学习的挑战。

Comments 30 pages, 16 figures

详情
英文摘要

Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning involving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 4 popular agent harnesses and 7 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only about 60%, substantially below the human result of 80.7%, and the average performance across agents is only 43.3%.

2605.02438 2026-05-15 cs.CV cs.LG

Mixture Prototype Flow Matching for Open-Set Supervised Anomaly Detection

Fuyun Wang, Yuanzhi Wang, Xu Guo, Sujia Huang, Tong Zhang, Dan Wang, Hui Yan, Xin Liu, Zhen Cui

AI总结 本文研究开放集监督异常检测(OSAD)问题,旨在利用有限的异常监督信息识别未见过的异常样本。为了解决现有基于原型的方法在建模正常数据时忽略多模态特性导致决策边界模糊的问题,提出了一种混合原型流匹配(MPFM)框架,通过连续变换将正常特征分布映射到结构化的高斯混合原型空间。该方法引入高斯混合先验建模速度场,并结合互信息最大化正则化器提升原型区分度,实验表明其在多种基准数据集上均取得领先性能。

Comments Accepted by ICML 2026

详情
英文摘要

Open-set supervised anomaly detection (OSAD) aims to identify unseen anomalies using limited anomalous supervision. However, existing prototype-based methods typically model normal data via a unimodal Gaussian prior, failing to capture inherent multi-modality and resulting in blurred decision boundaries. To address this, we propose Mixture Prototype Flow Matching (MPFM), a framework that learns a continuous transformation from normal feature distributions to a structured Gaussian mixture prototype space. Departing from traditional flow-based approaches that rely on a single velocity vector, MPFM explicitly models the velocity field as a Gaussian mixture prior where each component corresponds to a distinct normal class. This design facilitates mode-aware and semantically coherent distribution transport. Furthermore, we introduce a Mutual Information Maximization Regularizer (MIMR) to prevent prototype collapse and maximize normal-anomaly separability. Extensive experiments demonstrate that MPFM achieves state-of-the-art performance across diverse benchmarks under both single- and multi-anomaly settings.

2605.02398 2026-05-15 cs.AI cs.CL cs.LG

The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure

Rahul Kumar

AI总结 随着前沿AI模型被用于高风险决策流程,其在对抗性压力下保持元认知稳定性的能力成为关键的安全要求。本文研究了模型在面对强制合规指令时出现的元认知崩溃现象,并提出了“合规陷阱”这一新概念,指出模型性能的严重下降并非源于威胁内容本身,而是由强制性指令引发的认知边界突破所致。通过大规模实验,作者发现大多数模型在对抗性条件下表现出显著的性能下降,而Anthropic的 Constitutional AI 由于对齐训练表现出较强的免疫能力。

Comments 9 pages, 2 figures, 3 tables. Code: https://github.com/rkstu/schema-compliance-trap Dataset: https://huggingface.co/datasets/lightmate/schema-compliance-trap

详情
英文摘要

As frontier AI models are deployed in high-stakes decision pipelines, their ability to maintain metacognitive stability (knowing what they do not know, detecting errors, seeking clarification) under adversarial pressure is a critical safety requirement. Current safety evaluations focus on detecting strategic deception (scheming); we investigate a more fundamental failure mode: cognitive collapse. We present SCHEMA, an evaluation of 11 frontier models from 8 vendors across 67,221 scored records using a 6-condition factorial design with dual-classifier scoring. We find that 8 of 11 models suffer catastrophic metacognitive degradation under adversarial pressure, with accuracy dropping by up to 30.2 percentage points (all $p < 2 \times 10^{-8}$, surviving Bonferroni correction). Crucially, we identify a "Compliance Trap": through factorial isolation and a benign distraction control, we demonstrate that collapse is driven not by the psychological content of survival threats, but by compliance-forcing instructions that override epistemic boundaries. Removing the compliance suffix restores performance even under active threat. Models with advanced reasoning capabilities exhibit the most severe absolute degradation, while Anthropic's Constitutional AI demonstrates near-perfect immunity. This immunity does not stem from superior capability (Google's Gemini matches its baseline accuracy) but from alignment-specific training. We release the complete dataset and evaluation infrastructure.

2605.01758 2026-05-15 cs.AI

Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

Yue Ma, Ziyuan Yang, Yi Zhang

AI总结 该研究针对多智能体系统中感染式越狱攻击的问题,提出了一种无需训练的前瞻性引导本地净化(FLP)框架。该方法通过模拟未来交互轨迹,结合多角色模拟策略,检测并消除智能体中的感染行为,有效降低了感染传播率。实验表明,FLP能将最大累计感染率从超过95%降至5.47%以下,同时保持交互多样性,显著优于现有方法。

Comments 12 pages

详情
英文摘要

Large multimodal model-based Multi-Agent Systems (MASs) enable collaborative complex problem solving through specialized agents. However, MASs are vulnerable to infectious jailbreak, where compromising a single agent can spread to others, leading to widespread compromise. Existing defenses counter this by training a more contagious cure factor, biasing agents to retrieve it over virus adversarial examples (VirAEs). However, this homogenizes agent responses, providing only superficial suppression rather than true recovery. We revisit these defenses, which operate globally via a shared cure factor, while infectious jailbreak arise from localized interaction behaviors. This mismatch limits their effectiveness. To address this, we propose a training-free Foresight-Guided Local Purification (FLP) framework, where each agent reasons over future interactions to track behavioral evolution and eliminate infections. Specifically, each agent simulates future behavioral trajectories over subsequent chat rounds. To reflect diversity in MASs, we introduce a multi-persona simulation strategy for robust prediction across interaction contexts. We then use response diversity as a diagnostic signal to detect infection by analyzing inconsistencies across persona-based predictions at both retrieval-result and semantic levels. For infected agents, we apply localized purification: recent infections are mitigated via immediate album rollback, while long-term infections are handled using Recursive Binary Diagnosis (RBD), which recursively partitions the image album and applies the same diagnosis strategy to localize and eliminate VirAEs. Experiments show that FLP reduces the maximum cumulative infection rate from over 95% to below 5.47%. Moreover, retrieval and semantic metrics closely match benign baselines, indicating effective preservation of interaction diversity.

2605.01725 2026-05-15 cs.CV cs.AI

Motion-Aware Caching for Efficient Autoregressive Video Generation

Jing Xu, Yuexiao Ma, Xuzhe Zheng, Xing Wang, Shiwei Liu, Chenqian Yan, Xiawu Zheng, Rongrong Ji, Fei Chao, Songwei Liu

AI总结 本文研究了如何通过运动感知的缓存机制提升自回归视频生成的效率。现有方法依赖于粗粒度的块级缓存跳过,无法准确捕捉像素级别的动态变化,导致生成质量下降。为此,作者提出了MotionCache,通过帧间差异作为像素运动的轻量代理,结合粗到细的策略,在保证生成质量的前提下显著提升了生成速度。实验表明,MotionCache在多个先进模型上实现了最高达6.28倍的加速,同时保持了高质量的生成效果。

Comments 20 pages

详情
英文摘要

Autoregressive video generation paradigms offer theoretical promise for long video synthesis, yet their practical deployment is hindered by the computational burden of sequential iterative denoising. While cache reuse strategies can accelerate generation by skipping redundant denoising steps, existing methods rely on coarse-grained chunk-level skipping that fails to capture fine-grained pixel dynamics. This oversight is critical: pixels with high motion require more denoising steps to prevent error accumulation, while static pixels tolerate aggressive skipping. We formalize this insight theoretically by linking cache errors to residual instability, and propose MotionCache, a motion-aware cache framework that exploits inter-frame differences as a lightweight proxy for pixel-level motion characteristics. MotionCache employs a coarse-to-fine strategy: an initial warm-up phase establishes semantic coherence, followed by motion-weighted cache reuse that dynamically adjusts update frequencies per token. Extensive experiments on state-of-the-art models like SkyReels-V2 and MAGI-1 demonstrate that MotionCache achieves significant speedups of $\textbf{6.28}\times$ and $\textbf{1.64}\times$ respectively, while effectively preserving generation quality (VBench: $1\%\downarrow$ and $0.01\%\downarrow$ respectively). The code is available at https://github.com/ywlq/MotionCache.

2604.28130 2026-05-15 cs.CV

MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons

Kehong Gong, Zhengyu Wen, Dao Thien Phong, Mingxi Xu, Weixia He, Qi Wang, Ning Zhang, Zhengyu Li, Guanli Hou, Dongze Lian, Xiaoyu He, Mingyuan Zhang, Hanwang Zhang

AI总结 本文提出了一种端到端的任意骨骼运动捕获框架 MoCapAnything V2,解决了传统分阶段方法在关节位置与旋转映射上的不确定性问题。通过引入目标资产的参考姿态-旋转对,明确旋转坐标系,使旋转预测更加精确并易于学习。该方法直接从视频中预测关节位置,无需依赖网格中间表示,提升了鲁棒性与效率,并在多个数据集上显著降低了旋转误差,推理速度也比基于网格的方法快约20倍。

Comments Project page: https://animotionlab.github.io/MoCapAnythingV2/

详情
英文摘要

Recent methods for arbitrary-skeleton motion capture from monocular video follow a factorized pipeline, where a Video-to-Pose network predicts joint positions and an analytical inverse-kinematics (IK) stage recovers joint rotations. While effective, this design is inherently limited, since joint positions do not fully determine rotations and leave degrees of freedom such as bone-axis twist ambiguous, and the non-differentiable IK stage prevents the system from adapting to noisy predictions or optimizing for the final animation objective. In this work, we present the first fully end-to-end framework in which both Video-to-Pose and Pose-to-Rotation are learnable and jointly optimized. We observe that the ambiguity in pose-to-rotation mapping arises from missing coordinate system information: the same joint positions can correspond to different rotations under different rest poses and local axis conventions. To resolve this, we introduce a reference pose-rotation pair from the target asset, which, together with the rest pose, not only anchors the mapping but also defines the underlying rotation coordinate system. This formulation turns rotation prediction into a well-constrained conditional problem and enables effective learning. In addition, our model predicts joint positions directly from video without relying on mesh intermediates, improving both robustness and efficiency. Both stages share a skeleton-aware Global-Local Graph-guided Multi-Head Attention (GL-GMHA) module for joint-level local reasoning and global coordination. Experiments on Truebones Zoo and Objaverse show that our method reduces rotation error from ~17 degrees to ~10 degrees, and to 6.54 degrees on unseen skeletons, while achieving ~20x faster inference than mesh-based pipelines. Project page: https://animotionlab.github.io/MoCapAnythingV2/

2604.27263 2026-05-15 cs.CL

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

Théo Gigant, Bowen Peng, Jeffrey Quesnelle

AI总结 本文研究了子词分词在大语言模型训练中的具体作用,通过构建一个可控的字节级预训练框架,将子词分词的效果进行解耦和分析。研究从样本吞吐量、词汇规模扩展以及子词边界的语言先验等多个维度提出并验证了相关假设,揭示了子词模型优于原始字节模型的关键原因,并为未来字节级和子词模型的预训练提供了改进方向。

Comments 14 pages, 7 figures

详情
英文摘要

Subword tokenization is an essential part of modern large language models (LLMs), yet its specific contributions to training efficiency and model performance remain poorly understood. In this work, we decouple the effects of subword tokenization by isolating them within a controlled byte-level pretraining pipeline. We formulate and test hypotheses across various dimensions, including sample throughput, vocabulary scaling, and the linguistic prior of subword boundaries. By simulating these effects in a byte-level setting, we refine our understanding of why subword models outperform raw byte models and offer insights to improve the pretraining of future byte-level and subword models. Specifically, our experiments highlight the critical role of increased training throughput and the integration of subword boundaries as either explicit priors or inductive biases.

2604.22050 2026-05-15 cs.LG cs.CL

LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs

Mohamed Ali Souibgui, Jan Fostier, Rodrigo Abadía-Heredia, Bohdan Denysenko, Christian Marschke, Igor Peric

AI总结 LayerBoost 是一种层感知的注意力缩减方法,旨在提升大语言模型的推理效率。该方法通过对预训练模型进行系统性敏感性分析,识别出对性能影响较大的关键层,并根据不同层的敏感程度分别采用标准注意力、线性滑动窗口注意力或完全移除注意力机制,从而在保持模型性能的同时降低计算复杂度。实验表明,LayerBoost 在高并发场景下可将推理延迟减少高达68%,且在多个基准测试中表现出与原始模型相当或接近的性能,显著优于现有的注意力线性化方法。

详情
英文摘要

Transformers are mostly relying on softmax attention, which introduces quadratic complexity with respect to sequence length and remains a major bottleneck for efficient inference. Prior work on linear or hybrid attention typically replaces softmax attention uniformly across all layers, often leading to significant performance degradation or requiring extensive retraining to recover model quality. This work proposes LayerBoost, a layer-aware attention reduction method that selectively modifies the attention mechanism based on the sensitivity of individual transformer layers. It first performs a systematic sensitivity analysis on a pretrained model to identify layers that are critical for maintaining performance. Guided by this analysis, three distinct strategies can be applied: retaining standard softmax attention in highly sensitive layers, replacing it with linear sliding window attention in moderately sensitive layers, and removing attention entirely in layers that exhibit low sensitivity. To recover performance after these architectural modifications, we introduce a lightweight distillation-based healing phase requiring only 10M additional training tokens. LayerBoost reduces inference latency and improves throughput by up to 68% at high concurrency, while maintaining competitive model quality. It matches base model performance on several benchmarks, exhibits only minor degradations on others, and significantly outperforms state-of-the-art attention linearization methods. These efficiency gains make our method particularly well-suited for high-concurrency serving and hardware-constrained deployment scenarios, where inference cost and memory footprint are critical bottlenecks.

2604.21809 2026-05-15 cs.LG cs.AI q-bio.QM stat.ML

Quotient-Space Diffusion Models

Yixian Xu, Yusong Wang, Shengjie Luo, Kaiyuan Gao, Tianyu He, Di He, Chang Liu

AI总结 本文提出了一种名为商空间扩散模型(Quotient-Space Diffusion Models)的生成模型框架,旨在有效处理和利用系统中的对称性。该方法通过在去除对称冗余的商空间上进行生成过程,使模型能够在保持目标对称分布的前提下,更灵活地学习生成过程。该框架在分子结构生成任务中进行了实例化,相比等变扩散模型和基于对齐的方法,表现出更优的性能,为生成模型中的对称性处理提供了新的解决方案。

Comments ICLR 2026 Oral Presentation; 43 pages, 5 figures, 6 tables; ICLR 2026 Camera Ready version

详情
英文摘要

Diffusion-based generative models have reformed generative AI, and also enabled new capabilities in the science domain, e.g., fast generation of 3D structures of molecules. In such tasks, there is often a symmetry in the system, identifying elements that can be converted by certain transformations as equivalent. Equivariant diffusion models guarantee a symmetric distribution, but miss the opportunity to make learning easier, while alignment-based simplification attempts fail to preserve the target distribution. In this work, we develop quotient-space diffusion models, a principled generative framework to fully handle and leverage symmetry. By viewing the intrinsic generation process on the quotient space, the exact construction that removes symmetry redundancy, the framework simplifies learning by allowing model output to have an arbitrary intra-equivalence-class movement, while generating the correct symmetric target distribution with guarantee. We instantiate the framework for molecular structure generation which follows $\mathrm{SE}(3)$ (rigid-body movement) symmetry. It improves the performance over equivariant diffusion models and outperforms alignment-based methods universally for small molecules and proteins, representing a new framework that surpasses previous symmetry treatments in generative models.

2604.19092 2026-05-15 cs.RO cs.AI

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

Feng Jiang, Yang Chen, Kyle Xu, Yuchen Liu, Haifeng Wang, Zhenhao Shen, Jasper Lu, Shengze Huang, Yuanfei Wang, Chen Xie, Ruihai Wu

AI总结 RoboWM-Bench 是一个专注于机器人操作任务的基准,用于评估视频世界模型在生成行为是否具备物理可执行性。该基准通过将生成的视频转化为可执行的动作序列,并在物理仿真环境中验证其可行性,从而系统评估模型在真实机器人操作中的表现。研究发现,视觉合理性与物理可执行性并不总是一致,突显了在复杂操作任务中进行具身化评估的重要性。

详情
英文摘要

Recent advances in large-scale video world models have enabled increasingly realistic future prediction, raising the prospect of using generated videos as scalable supervision for robot learning. However, for embodied manipulation, perceptual realism alone is not sufficient: generated interactions must also be physically consistent and executable by robotic agents. Existing benchmarks provide valuable assessments of visual quality and physical plausibility, but they do not systematically evaluate whether predicted behaviors can be translated into executable actions that complete manipulation tasks. We introduce RoboWM-Bench, a manipulation-centric benchmark for embodiment-grounded evaluation of video world models. RoboWM-Bench converts generated human-hand and robotic manipulation videos into embodied action sequences and validates them through execution in physically grounded simulation environments. Built on real-to-sim scene reconstruction and diverse manipulation tasks, RoboWM-Bench enables standardized, reproducible, and scalable evaluation of physical executability. Using RoboWM-Bench, we evaluate state-of-the-art video world models and observe that visual plausibility and embodied executability are not always aligned. Our analysis highlights several recurring factors that affect execution performance, including spatial reasoning, contact prediction, and non-physical geometric distortions, particularly in complex and long-horizon interactions. These findings provide a more fine-grained view of current model capabilities and underscore the value of embodiment-aware evaluation for guiding physically grounded world modeling in robotic manipulation.

2604.17548 2026-05-15 cs.LG math.AT stat.ML

Contraction and Hourglass Persistence for Learning on Graphs, Simplices, and Cells

Mattie Ji, Indradyumna Roy, Vikas Garg

AI总结 该论文研究了如何在图、单纯复形和胞腔网络上进行学习的拓扑方法,提出了收缩同调(Contraction Homology)和小时glass持续性(Hourglass Persistence)的概念,以改进传统持续同调在图神经网络中的应用。通过结合包含和收缩操作,小时glass持续性提升了模型的表达能力、可学习性和稳定性,并设计了高效的算法,能够在多种现实图数据集上取得优于传统方法的实验结果。

Comments 31 pages, 6 figures, 4 algorithms, 2 tables. Accepted at ICLR 2026

详情
英文摘要

Persistent homology (PH) encodes global information, such as cycles, and is thus increasingly integrated into graph neural networks (GNNs). PH methods in GNNs typically traverse an increasing sequence of subgraphs. In this work, we first expose limitations of this inclusion procedure. To remedy these shortcomings, we analyze contractions as a principled topological operation, in particular, for graph representation learning. We study the persistence of contraction sequences, which we call Contraction Homology (CH). We establish that forward PH and CH differ in expressivity. We then introduce Hourglass Persistence, a class of topological descriptors that interleave a sequence of inclusions and contractions to boost expressivity, learnability, and stability. We also study related families parametrized by two paradigms. We also discuss how our framework extends to simplicial and cellular networks. We further design efficient algorithms that are pluggable into end-to-end differentiable GNN pipelines, enabling consistent empirical improvements over many PH methods across standard real-world graph datasets. Code is available at \href{https://github.com/Aalto-QuML/Hourglass}{this https URL}.

2604.16744 2026-05-15 cs.CL cs.AI cs.HC

Evaluating Adaptive Personalization of Educational Readings with Simulated Learners

Ryan T. Woo, Anmol Rao, Aryan Keluskar, Yinong Chen

AI总结 本文提出了一种基于理论支持的模拟学习者框架,用于评估教育阅读材料的自适应个性化效果。该方法从开放教材中构建学习目标和知识组件本体,通过浏览器工具进行管理,并生成匹配的阅读与评估对。实验结果表明,自适应阅读在计算机科学中显著提升了学习效果,在无机化学中效果不明确,在普通生物学中则无明显提升甚至略有负面影响。

详情
英文摘要

We present a framework for evaluating adaptive personalization of educational reading materials with theory-grounded simulated learners. The system builds a learning-objective and knowledge-component ontology from open textbooks, curates it in a browser-based Ontology Atlas, labels textbook chunks with ontology entities, and generates aligned reading-assessment pairs. Simulated readers learn from passages through a Construction-Integration-inspired memory model with DIME-style reader factors, KREC-style misconception revision, and an open New Dale-Chall readability signal. Answers are produced by score-based option selection over the learner's explicit memory state, while BKT drives adaptation. Across three sampled subject ontologies and matched cohorts of 50 simulated learners per condition, adaptive reading significantly improved outcomes in computer science, yielded smaller positive but inconclusive gains in inorganic chemistry, and was neutral to slightly negative in general biology.

2604.16325 2026-05-15 cs.LG cs.AI

UniMamba: A Unified Spatial-Temporal Modeling Framework with State-Space and Attention Integration

Xingsheng Chen, Xianpei Mu, Deyu Yi, Yilin Yuan, Xingwei He, Bo Gao, Regina Zhang, Pietro Lio, Siu-Ming Yiu

AI总结 多变量时间序列预测在能源、金融和环境监测等领域具有重要意义,但其复杂的时序依赖关系和变量间交互带来诸多挑战。为此,本文提出UniMamba,一个融合状态空间模型与注意力机制的统一时空预测框架,既保持了高效的计算性能,又能够捕捉显式的时序模式。该方法通过结合Mamba变体编码层、时空注意力层和前馈时序动态层,有效建模了全局时间依赖和变量间关系,在多个公开数据集上的实验表明,UniMamba在预测精度和计算效率方面均优于现有先进模型。

Comments The authors wish to withdraw this preprint due to a lack of consensus regarding the final authorship list and the order of authors

详情
英文摘要

Multivariate time series forecasting is fundamental to numerous domains such as energy, finance, and environmental monitoring, where complex temporal dependencies and cross-variable interactions pose enduring challenges. Existing Transformer-based methods capture temporal correlations through attention mechanisms but suffer from quadratic computational cost, while state-space models like Mamba achieve efficient long-context modeling yet lack explicit temporal pattern recognition. Therefore we introduce UniMamba, a unified spatial-temporal forecasting framework that integrates efficient state-space dynamics with attention-based dependency learning. UniMamba employs a Mamba Variate-Channel Encoding Layer enhanced with FFT-Laplace Transform and TCN to capture global temporal dependencies, and a Spatial Temporal Attention Layer to jointly model inter-variate correlations and temporal evolution. A Feedforward Temporal Dynamics Layer further fuses continuous and discrete contexts for accurate forecasting. Comprehensive experiments on eight public benchmark datasets demonstrate that UniMamba consistently outperforms state-of-the-art forecasting models in both forecasting accuracy and computational efficiency, establishing a scalable and robust solution for long-sequence multivariate time-series prediction.

2604.10892 2026-05-15 cs.RO cs.MA

HECTOR: Human-centric Hierarchical Coordination and Supervision of Robotic Fleets under Continual Temporal Tasks

Shen Wang, Yinhang Luo, Jie Li, Meng Guo

AI总结 本文提出了一种以人类为中心的分层协调与监督框架HECTOR,用于在持续变化和不确定的时间任务下管理大规模机器人集群。该方法包含三个层次:人机双向交互协议、任务滚动分配机制以及团队内部动态协调,支持操作员在不同粒度上进行任务调整与监督,从而提升计算效率并减轻人工负担。实验表明,该框架在异构机器人集群和复杂环境任务中表现出良好的适应性和有效性。

详情
英文摘要

Robotic fleets can be extremely efficient when working concurrently and collaboratively, e.g., for delivery, surveillance, search and rescue. However, it can be demanding or even impractical for an operator to directly control each robot. Thus, autonomy of the fleet and its online interaction with the operator are both essential, particularly in dynamic and partially unknown environments. The operator might need to add new tasks, cancel some tasks, change priorities and modify planning results. How to design the procedure for these interactions and efficient algorithms to fulfill these needs have been mostly neglected in the related literature. Thus, this work proposes a human-centric coordination and supervision scheme (HECTOR) for large-scale robotic fleets under continual and uncertain temporal tasks. It consists of three hierarchical layers: (I) the bidirectional and multimodal protocol of online human-fleet interaction, where the operator interacts with and supervises the whole fleet; (II) the rolling assignment of currently-known tasks to teams within a certain horizon, and (III) the dynamic coordination within a team given the detected subtasks during online execution. The overall mission can be as general as temporal logic formulas over collaborative actions. Such hierarchical structure allows human interaction and supervision at different granularities and triggering conditions, to both improve computational efficiency and reduce human effort. Extensive human-in-the-loop simulations are performed over heterogeneous fleets under various temporal tasks and environmental uncertainties.

2604.09304 2026-05-15 cs.CV

GeRM: A Generative Rendering Model From Physically Realistic to Photorealistic

Jiayuan Lu, Rengan Xie, Xuancheng Jin, Zhizhen Wu, Qi Ye, Tian Xie, Hujun Bao, Rui Wang. Yuchi Huo

AI总结 本文提出了一种名为GeRM的生成渲染模型,旨在弥合基于物理的渲染(PBR)与照片级真实感渲染(PRR)之间的差距。该模型通过学习分布转移向量(DTV)场,结合多条件ControlNet和残差感知转移机制,实现了从物理真实到视觉真实的可控图像生成。研究还引入了一个多智能体视觉语言框架,构建了用于监督转移过程的专家引导数据集P2P-50K,实验表明GeRM在多种应用场景中均优于现有先进方法。

详情
英文摘要

While physically-based rendering (PBR) simulates light transport that guarantees physical realism, achieving true photorealistic rendering (PRR) demands prohibitive time and labor, and still struggles to capture the intractable richness of the real world. We propose GeRM, the first multimodal generative rendering model to bridge the gap from PBR to PRR (P2P). We formulate this P2P transition by learning a distribution transfer vector (DTV) field to direct the generative process. To achieve this, we introduce a multi-condition ControlNet that synthesizes PBR images and progressively transitions them into PRR images, guided by G-buffers, text prompts, and cues for enhanced regions. To improve the model's grasp of the image distribution shift driven by text prompts, we propose a residual perceptual transfer mechanism to associate text prompts with corresponding targeted modification regions, which more clearly defines the incremental component updates. To supervise this transfer process, we introduce a multi-agent visual language model framework to construct an expert-guided pairwise transfer dataset, named P2P-50K, where each paired sample corresponds to a specific transfer vector in the DTV field. Extensive experiments demonstrate that GeRM synthesizes high-quality controllable images and outperforms state-of-the-art baselines across diverse applications, including PBR and PRR image synthesis and editing.

2604.08991 2026-05-15 cs.CV cs.AI

PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

Zhiyu Zhou, Peilin Liu, Ruoxuan Zhang, Luyang Zhang, Cheng Zhang, Hongxia Xie, Wen-Huang Cheng

AI总结 本文提出PinpointQA,首个用于室内视频中小物体中心空间理解的数据集与基准,旨在评估模型在视频中精确定位目标物体并描述其位置的能力。该数据集基于ScanNet++和ScanNet200构建,包含1024个场景和10,094个问答对,涵盖四个逐步增加难度的任务,实验表明主流多模态大语言模型在该基准上仍存在明显性能差距,而通过PinpointQA进行微调可显著提升模型表现。

详情
英文摘要

Small object-centric spatial understanding in indoor videos remains a significant challenge for multimodal large language models (MLLMs), despite its practical value for object search and assistive applications. Although existing benchmarks have advanced video spatial intelligence, embodied reasoning, and diagnostic perception, no existing benchmark directly evaluates whether a model can localize a target object in video and express its position with sufficient precision for downstream use. In this work, we introduce PinpointQA, the first dataset and benchmark for small object-centric spatial understanding in indoor videos. Built from ScanNet++ and ScanNet200, PinpointQA comprises 1,024 scenes and 10,094 QA pairs organized into four progressively challenging tasks: Target Presence Verification (TPV), Nearest Reference Identification (NRI), Fine-Grained Spatial Description (FSD), and Structured Spatial Prediction (SSP). The dataset is built from intermediate spatial representations, with QA pairs generated automatically and further refined through quality control. Experiments on representative MLLMs reveal a consistent capability gap along the progressive chain, with SSP remaining particularly difficult. Supervised fine-tuning on PinpointQA yields substantial gains, especially on the harder tasks, demonstrating that PinpointQA serves as both a diagnostic benchmark and an effective training dataset. The dataset and project page are available at https://rainchowz.github.io/PinpointQA.

2604.06757 2026-05-15 cs.CV

FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching

Junchao Yi, Rui Zhao, Jiahao Tang, Weixian Lei, Linjie Li, Qisheng Su, Zhengyuan Yang, Lijuan Wang, Xiaofeng Zhu, Alex Jinpeng Wang

AI总结 FlowInOne 提出了一种统一的多模态生成框架,将文本描述、空间布局和编辑指令等不同模态的信息转化为单一的视觉表示,从而实现以图像输入、图像输出为特点的生成流程。该方法通过一个统一的流匹配模型消除了跨模态对齐和任务特定结构的限制,将文本到图像生成、布局引导编辑和视觉指令遵循等任务整合到同一范式下。研究还构建了大规模视觉提示数据集 VisPrompt-5M 和评估基准 VP-Bench,实验表明 FlowInOne 在多项任务中达到当前最优性能,为完全以视觉为中心的生成建模奠定了新基础。

详情
英文摘要

Multimodal generation has long been dominated by text-driven pipelines where language dictates vision but cannot reason or create within it. We challenge this paradigm by asking whether all modalities, including textual descriptions, spatial layouts, and editing instructions, can be unified into a single visual representation. We present FlowInOne, a framework that reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model. This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm. To support this, we introduce VisPrompt-5M, a large-scale dataset of 5 million visual prompt pairs spanning diverse tasks including physics-aware force dynamics and trajectory prediction, alongside VP-Bench, a rigorously curated benchmark assessing instruction faithfulness, spatial precision, visual realism, and content consistency. Extensive experiments demonstrate that FlowInOne achieves state-of-the-art performance across all unified generation tasks, surpassing both open-source models and competitive commercial systems, establishing a new foundation for fully vision-centric generative modeling where perception and creation coexist within a single continuous visual space. Our code and models are released on https://csu-jpg.github.io/FlowInOne.github.io/

2604.02482 2026-05-15 cs.LG

SEDGE: Structural Extrapolated Data Generation

Kun Zhang, Jiaqi Sun, Yiqing Li, Ignavier Ng, Namrata Deka, Shaoan Xie

AI总结 本文提出了一种名为SEDGE的框架,用于在训练数据之外生成符合新规格的数据,其核心在于对数据生成过程做出合理假设。该方法在特定保守假设下保证了生成数据分布的近似可识别性,并指出在无此类假设时分布的不可识别性。研究通过结构化优化策略和扩散后验采样等算法实现了有效外推数据生成,并在合成数据和图像生成任务中验证了其有效性。

详情
英文摘要

This paper aims to address the challenge of data generation beyond the training data and proposes a framework for Structural Extrapolated Data GEneration (SEDGE) based on suitable assumptions on the underlying data-generating process. We provide conditions under which data satisfying novel specifications can be generated reliably, together with the approximate identifiability of the distribution of such data under certain ``conservative" assumptions, as well as the inherent non-identifiability of this distribution without such assumptions. On the algorithmic side, we develop practical methods to achieve extrapolated data generation, based on a structure-informed optimization strategy or diffusion posterior sampling, respectively. We verify the extrapolation performance on synthetic data and also consider extrapolated image generation as a real-world scenario to illustrate the validity of the proposed framework.

2603.29665 2026-05-15 cs.CL

Near-Miss: Latent Policy Failure Detection in Agentic Workflows

Ella Rabinovich, David Boaz, Naama Zwerdling, Ateret Anaby-Tavor

AI总结 在代理工作流中,基于大语言模型的系统虽然能够达到预期的最终状态,但可能在执行过程中绕过必要的策略检查,从而产生潜在的策略失效问题。本文提出了一种新的度量方法,用于检测代理对话轨迹中的隐性策略失败,该方法基于ToolGuard框架,分析代理在调用工具时的决策是否充分合理。实验表明,即使最终结果正确,仍有8%至17%的轨迹存在此类潜在失败,揭示了当前评估方法的局限性。

Comments GEM@ACL2026, 13 pages

详情
英文摘要

Agentic systems for business process automation often require compliance with policies governing conditional updates to the system state. Evaluation of policy adherence in LLM-based agentic workflows is typically performed by comparing the final system state against a predefined ground truth. While this approach detects explicit policy violations, it may overlook a more subtle class of issues in which agents bypass required policy checks, yet reach a correct outcome due to favorable circumstances. We refer to such cases as near-misses or latent failures. In this work, we introduce a novel metric for detecting latent policy failures in agent conversations traces. Building on the ToolGuard framework, which converts natural-language policies into executable guard code, our method analyzes agent trajectories to determine whether agent's tool-calling decisions where sufficiently informed. We evaluate our approach on the $τ^2$-verified Airlines benchmark across several contemporary open and proprietary LLMs acting as agents. Our results show that latent failures occur in 8-17% of trajectories involving mutating tool calls, even when the final outcome matches the expected ground-truth state. These findings reveal a blind spot in current evaluation methodologies and highlight the need for metrics that assess not only final outcomes but also the decision process leading to them.

2603.28205 2026-05-15 cs.CL

Beyond Cosine Similarity: Zero-Initialized Residual Complex Projection for Aspect-Based Sentiment Analysis

Yijin Wang, Fandi Sun, Haoyu Wen

AI总结 本文针对基于方面的情感分析(ABSA)中实值嵌入空间中的表示纠缠和负样本碰撞问题,提出了一种新的框架,包含零初始化残差复投影(ZRCP)和反碰撞掩码角度损失。该方法将文本特征映射到复语义空间,利用相位分离情感极性,同时通过幅度正则化保持方面类别的结构一致性,并引入反碰撞掩码以增强对立极性之间的判别性。实验表明,该方法在ASAP数据集上取得了当前最优的Macro-F1分数。

详情
英文摘要

Aspect-Based Sentiment Analysis (ABSA) faces critical challenges due to representation entanglement and false-negative collisions in real-valued embedding spaces. In this paper, we propose a novel framework featuring a Zero-Initialized Residual Complex Projection (ZRCP) and an Anti-collision Masked Angle Loss. Our approach projects textual features into a complex semantic space, utilizing the phase to isolate sentiment polarities while regularizing the amplitude to ensure structural consistency within aspect categories. To mitigate this, we introduce an anti-collision mask that preserves intra-polarity aspect cohesion while significantly expanding the discriminative margin between opposing polarities. Experimental results on the ASAP dataset demonstrate that our framework achieves a state-of-the-art Macro-F1 score of 0.8923, outperforming robust baselines.

2603.23129 2026-05-15 cs.LG

Polaris: A Gödel Agent Framework for Small Language Models through Experience-Abstracted Policy Repair

Aditya Kakade, Vivek Srivastava, Shirish Karande

AI总结 本文提出了一种名为Polaris的框架,用于实现小型语言模型的递归自我改进。该框架通过经验抽象策略修复机制,将模型在任务中的失败转化为可复用的策略更新,从而在不改变模型参数的前提下提升其政策层面的表现。研究通过元推理机制,使模型能够解释自身错误并提出具体的策略修订,最终在多个基准测试中实现了显著的性能提升。

Comments Accepted to ACL 2026 (Findings). 33 pages

详情
英文摘要

Gödel agent realize recursive self-improvement: an agent inspects its own policy and traces and then modifies that policy in a tested loop. We introduce Polaris, a Gödel agent for compact models that performs policy repair via experience abstraction, turning failures into policy updates through a structured cycle of analysis, strategy formation, abstraction, and minimal code pat ch repair with conservative checks. Unlike response level self correction or parameter tuning, Polaris makes policy level changes with small, auditable patches that persist in the policy and are reused on unseen instances within each benchmark. As part of the loop, the agent engages in meta reasoning: it explains its errors, proposes concrete revisions to its own policy, and then updates the policy. To enable cumulative policy refinement, we introduce experience abstraction, which distills failures into compact, reusable strategies that transfer to unseen instances. On MGSM, DROP, GPQA, and LitBench (covering arithmetic reasoning, compositional inference, graduate-level problem solving, and creative writing evaluation), a 7-billion-parameter model equipped with Polaris achieves consistent gains over the base policy and competitive baselines.

2603.22586 2026-05-15 cs.LG

A Foundation Model for Instruction-Conditioned In-Context Time Series Tasks

Anish Saha, Konstantin Shmakov

AI总结 本文提出了一种名为iAmTime的时间序列基础模型,旨在通过指令条件化的提示学习实现对上下文任务的适应。该模型采用隐式元学习方法,在历史和未来变量之间建立结构化提示,结合分层多尺度变换编码器和任务条件化补丁解码器,以捕捉时间动态和协变量特征,并实现对多种任务(如预测、分类、异常检测等)的零样本适应。实验表明,iAmTime在多个基准测试中优于现有时间序列基础模型,表现出良好的泛化能力和任务适应性。

详情
英文摘要

In-context learning (ICL) enables task adaptation at inference time by conditioning on demonstrations rather than updating model parameters. Although recent time-series foundation models incorporate contextual conditioning, retrieval, or example-based prompting, they typically rely on implicit positional structure or task-specific objectives rather than explicit instruction-conditioned input-output demonstrations. We introduce iAmTime, a time-series foundation model trained with instruction-conditioned amortized meta-learning to infer tasks directly from example demonstrations. iAmTime represents each episode as a structured prompt over historical context and future-known variables using specialized semantic tokens that attend to designated time-series regions, exchange information across demonstrations, and inject task information into the query representation. The model combines a Hierarchical Multi-Scope Transformer Encoder, which captures temporal and covariate dynamics while inferring latent task structure from demonstrated input-output mappings, with a Task-Conditioned Patch Decoder, which adapts decoding through expert-based routing. We train iAmTime on large-scale real and synthetic corpora using supervised and self-supervised instruction-conditioned tasks, including forecasting, imputation, reconstruction, classification, anomaly detection, and source de-mixing. Across diverse domains, frequencies, and horizons, iAmTime improves zero-shot adaptation over strong time-series foundation baselines on probabilistic and point forecasting benchmarks, while achieving competitive performance on non-forecasting tasks such as classification.

2603.21250 2026-05-15 cs.AI

Graph of States: Solving Abductive Tasks with Large Language Models

Yu Luo, Rongchen Gao, Lu Teng, Xidao Wen, Jiamin Jiang, Qingliang Zhang, Yongqian Sun, Shenglin Zhang, Jiasong Feng, Tong Liu, Wenjie Zhang, Dan Pei

AI总结 本文研究了大型语言模型在归纳和演绎推理之外的第三类逻辑推理——溯因推理中的应用。针对现有框架在结构化状态表示和显式状态控制方面的不足,作者提出了一种名为Graph of States(GoS)的神经符号框架,通过因果图编码逻辑依赖关系,并利用状态机控制推理过程的合法转移,从而将无约束的探索转化为有导向的搜索。实验表明,GoS在两个真实数据集上显著优于现有方法,为复杂溯因任务提供了稳健的解决方案。

详情
英文摘要

Logical reasoning encompasses deduction, induction, and abduction. However, while Large Language Models (LLMs) have effectively mastered the former two, abductive reasoning remains significantly underexplored. Existing frameworks, predominantly designed for static deductive tasks, fail to generalize to abductive reasoning due to unstructured state representation and lack of explicit state control. Consequently, they are inevitably prone to Evidence Fabrication, Context Drift, Failed Backtracking, and Early Stopping. To bridge this gap, we introduce Graph of States (GoS), a general-purpose neuro-symbolic framework tailored for abductive tasks. GoS grounds multi-agent collaboration in a structured belief states, utilizing a causal graph to explicitly encode logical dependencies and a state machine to govern the valid transitions of the reasoning process. By dynamically aligning the reasoning focus with these symbolic constraints, our approach transforms aimless, unconstrained exploration into a convergent, directed search. Extensive evaluations on two real-world datasets demonstrate that GoS significantly outperforms all baselines, providing a robust solution for complex abductive tasks. Code repo and all prompts: https://github.com/gaorch85/Graph-of-States.

2603.21174 2026-05-15 cs.CL

Explainable Semantic Textual Similarity via Dissimilar Span Detection

Diego Miguel Lozano, Daryna Dementieva, Alexander Fraser

AI总结 本文提出了一种新的可解释语义文本相似度(STS)方法,通过检测文本对中语义差异的片段(Dissimilar Span Detection, DSD)来增强模型的可解释性。研究引入了用于该任务的语义相似性数据集(SSD),并评估了多种基于语言模型和解释性方法的基线模型。实验表明,尽管大型语言模型和监督模型表现最佳,但整体任务难度较高,而DSD在特定任务如释义检测中可提升性能。

Comments Accepted at LREC 2026

详情
英文摘要

Semantic Textual Similarity (STS) is a crucial component of many Natural Language Processing (NLP) applications. However, existing approaches typically reduce semantic nuances to a single score, limiting interpretability. To address this, we introduce the task of Dissimilar Span Detection (DSD), which aims to identify semantically differing spans between pairs of texts. This can help users understand which particular words or tokens negatively affect the similarity score, or be used to improve performance in STS-dependent downstream tasks. Furthermore, we release a new dataset suitable for the task, the Span Similarity Dataset (SSD), developed through a semi-automated pipeline combining large language models (LLMs) with human verification. We propose and evaluate different baseline methods for DSD, both unsupervised, based on LIME, SHAP, LLMs, and our own method, as well as an additional supervised approach. While LLMs and supervised models achieve the highest performance, overall results remain low, highlighting the complexity of the task. Finally, we set up an additional experiment that shows how DSD can lead to increased performance in the specific task of paraphrase detection.