arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4046
2605.08568 2026-05-12 cs.LG

Different Prompts, Different Ranks: Prompt-aware Dynamic Rank Selection for SVD-based LLM Compression

Hengyi Zhu, Zhendong Mi, Grace Li Zhang, Shaoyi Huang

AI总结 大型语言模型(LLM)的规模迅速增长,带来了显著的内存和计算成本,限制了其高效部署。现有的基于奇异值分解(SVD)的压缩方法采用静态秩截断策略,无法适应不同提示的差异性需求,导致性能受限。为此,本文提出 PARSE,一种基于提示感知的动态秩选择框架,通过离线训练线性路由器实现针对不同提示的个性化秩选择,并结合语义相似性缓存和系统优化技术,有效提升了压缩模型的性能与推理效率。实验表明,PARSE 在 LLaMA-7B 模型上实现了最高 10% 的任务准确率提升,并显著加快了推理速度。

详情
英文摘要

Large language models (LLMs) have rapidly grown in scale, creating substantial memory and computational costs that hinder efficient deployment. Singular value decomposition (SVD) has emerged as an effective post-training compression technique, but existing SVD-based methods rely on static rank truncation, applying a fixed prefix of singular components to all inputs regardless of their diversity. We identify two limitations of this static design: the optimal rank varies across individual prompts, and the selected rank is sensitive to the choice of calibration set, leading to suboptimal performance across diverse inputs. To address these challenges, we propose $\textbf{PARSE}$, a post-training framework for $\textbf{P}$rompt-$\textbf{A}$ware $\textbf{R}$ank $\textbf{S}$election as $\textbf{E}$xperts in SVD-compressed LLMs. PARSE trains a linear router offline to perform prompt-aware rank selection, decoupling it from calibration information by supervising the router against dense-model outputs on a large-scale corpus. We further observe that rank-selection patterns are shared across semantically similar prompts and remain stable across decoding steps, allowing appropriate rank subsets to be served directly from a pattern cache at inference. Complemented by expert memory aggregation and kernel fusion for system-level efficiency, PARSE is orthogonal to existing SVD-based pipelines and consistently improves both model quality and inference efficiency. Integrated with four representative SVD-based methods, PARSE improves average task accuracy by up to 10% at a compression ratio of 0.6 on LLaMA-7B, and achieves up to 2.5 $\times$ prefill and 2.4 $\times$ decode speedup over native SVD execution.

2605.08566 2026-05-12 cs.CV cs.LG q-bio.QM

MicroDiffuse3D: A Foundation Model for 3D Microscopy Imaging Restoration

Yongkang Li, Brian Wong, King Wai Chiu, Hanwen Xu, Tangqi Fang, Erin Dunnington, Dan Fu, Sheng Wang

AI总结 本文提出了一种名为MicroDiffuse3D的预训练基础模型,用于三维显微成像的图像恢复,能够从低分辨率的退化测量数据中重建高质量的三维结构,显著提升数据获取效率。该模型在多种具有挑战性的恢复任务中表现出色,包括稀疏三维超分辨率、分辨率与噪声联合退化以及低信噪比下的去噪任务,相比现有方法在分割质量和线轮廓一致性等方面均有明显提升。研究结果表明,预训练的三维图像恢复方法为克服体积化学成像中的吞吐量和信噪比限制提供了广泛适用的解决方案。

详情
英文摘要

Chemical imaging enables label-free visualization of cells, tissues and living systems while providing direct biochemical information that is difficult to obtain with conventional fluorescence microscopy. Despite its promise in applications ranging from intraoperative diagnosis to drug-response analysis, its broader use remains limited by slow data acquisition, particularly for three-dimensional imaging. Here we present MicroDiffuse3D, a pretrained foundation model for 3D microscopy image restoration that recovers high-quality volumetric structure from degraded low-resolution measurements acquired at substantially higher throughput. We evaluated MicroDiffuse3D across three challenging restoration settings, including 3D super-resolution under 16-fold volumetric sparsity, joint degradation in resolution and noise, and 3D denoising in the low signal-to-noise ratio (SNR) regime, where the model delivered clear gains over strong baselines. Under the sparse 3D super-resolution setting, MicroDiffuse3D produced clearer continuity across depth with fewer artifacts and improved segmentation quality by 10.58% and line-profile concordance by 15.59%. Together, our results establish pretrained 3D restoration as a broadly applicable strategy for overcoming the throughput and SNR limitations in volumetric chemical imaging, enabling high-resolution analysis at scales and speeds that were previously difficult to achieve.

2605.08564 2026-05-12 cs.AI cs.CV cs.LG

Biological Plausibility and Representational Alignment of Feedback Alignment in Convolutional Networks

Jake Lance, Larry Kieu

AI总结 本文研究了反馈对齐(FA)算法在卷积网络中的生物合理性及其表征一致性问题,对比分析了包括改进FA和标准反向传播(BP)在内的五种学习算法在CIFAR-10数据集上的表现。研究发现,改进后的FA算法在内部表征结构上与BP方法相似,表明其成功可能源于对BP表征几何的模仿,尽管两者权重更新机制不同。该研究为理解生物合理性与模型性能之间的关系提供了新的视角。

详情
英文摘要

The feedback alignment (FA) algorithm offers a biologically plausible alternative to backpropagation (BP) for training neural networks yet notably fails to scale to convolutional architectures. Modifications have been proposed to address this limitation, but at questionable cost to biological plausibility. In this paper, we evaluate five learning algorithms including modified FA and standard BP, applied to the same convolutional architecture with the CIFAR-10 dataset. We provide a tripartite comparative analysis focusing on biological plausibility, interpretability, and computational complexity. Our results indicate that modified FA algorithms converge on internal representations that are structurally similar to those produced by backpropagation. In particular, it appears the functional success of modified FA algorithms may be rooted in their ability to mimic the representational geometry of backpropagation, converging on similar representations despite relying on fundamentally different weight update mechanisms.

2605.08563 2026-05-12 cs.AI

Why Retrying Fails: Context Contamination in LLM Agent Pipelines

Zhanfu Yang

AI总结 当大型语言模型代理在多步骤工具增强任务中失败并重试时,失败的尝试通常会保留在上下文中,导致后续尝试受到污染,从而提高每一步的错误率。本文引入了“上下文污染重启模型(CCRM)”,用于形式化描述这一现象,并推导出五个主要结论,包括成功概率的闭式公式、重试带来的额外尝试次数、最优预算分配策略等。实验验证表明,该模型在真实数据上拟合良好,显著优于独立同分布假设模型。

详情
英文摘要

When an LLM agent fails a multi-step tool-augmented task and retries, the failed attempt typically remains in its context window -- contaminating the next attempt and elevating the per-step error rate beyond the base level. This context-contaminated restart phenomenon is widely observed in practice yet entirely lacks formal treatment. We introduce the Context-Contaminated Restart Model (CCRM): a chain of T tool-call steps, each failing with base rate epsilon_0; after any failed attempt, the subsequent attempt operates in contaminated context with elevated error rate epsilon_1 > epsilon_0. Under this model we derive five main results. (R1) An exact closed-form formula for P(succeed in at most K attempts). (R2) A cascade-overhead theorem giving the additional attempts Delta K incurred by contamination versus the clean-restart baseline. (R3) An optimal budget-allocation theorem identifying the pipeline depth T* that maximises success probability for a fixed total budget B=KT; we prove the closed form T* = sqrt(B * log(1/(1-epsilon_1)) / log(1/(1-epsilon_0))), with K*=B/T*. (R4) An information-theoretic lower bound via Le Cam's method showing K_CCRM is tight up to O(1). (R5) A clean-restart dominance theorem quantifying the exact benefit of context-clearing before retry. We validate CCRM on real SWE-bench Verified data: the IID model overestimates pass@3 by 17.4 percentage points (98.6% vs. 81.2%), while CCRM fits with error less than 0.001, implying a cascade ratio of epsilon_1/epsilon_0 = 7.1. Monte Carlo experiments confirm all theoretical predictions.

2605.08560 2026-05-12 cs.CV cs.AI

ZAYA1-VL-8B Technical Report

Hassan Shapourian, Kasra Hejazi, Olabode M. Sule, Beren Millidge

AI总结 本文介绍了ZAYA1-VL-8B,一种基于自研语言模型ZAYA1-8B构建的紧凑型视觉-语言混合专家模型。该模型在保持较小参数规模的同时,在多个图像理解、推理和计数任务上表现优异,性能可与主流基础模型相媲美甚至超越。其核心创新包括引入视觉专用LoRA适配器以提升模态容量,以及在语言模型中采用图像标记的双向注意力机制以增强视觉理解能力。

Comments 20 pages, 7 figures, 3 appendices (with 31 figures)

详情
英文摘要

We present ZAYA1-VL-8B, a compact mixture-of-experts vision-language model built upon our in-house language model, ZAYA1-8B. Despite its compact size, ZAYA1-VL achieves performance competitive with leading base models such as Molmo2-4B and InternVL3.5-4B, while surpassing models including Qwen2.5-VL-3B, PLM-3B, and MolmoE-1B across a range of image understanding, reasoning, and counting benchmarks. The architecture incorporates two key innovations: (1) vision-specific LoRA adapters integrated into the LLM to increase modality-specific capacity without increasing the number of experts, and (2) bidirectional attention over image tokens within the LLM to enhance visual understanding. We detail the full training pipeline including data composition at each stage, sequence packing, and the attention masking scheme. The model comprises 9.2B total parameters, with 1.4B active parameters including the vision encoder, and is publicly available at https://huggingface.co/Zyphra/ZAYA1-VL.

2605.08558 2026-05-12 cs.LG

Beyond Static Bias: Adaptive Multi-Fidelity Bandits with Improving Proxies

Muyun Lu, Haoyang Hong, Huazheng Wang, Ying Lin

AI总结 本文研究了一种改进型多保真度多臂老虎机问题,其中低保真度反馈源可以通过校准不断提升其准确性。为应对这一动态特性,作者提出了一种基于阈值的自适应延续算法(TACC),通过动态置信界和成本效益分析,决定何时继续使用低保真度反馈、何时升级到高保真度评估。该方法在理论分析中证明了其对中间臂的实例相关悔恨界,并在合成实验和基于大语言模型的策略评估任务中验证了其有效性。

详情
英文摘要

As an extension of the classical multi-armed bandit problem, multi-fidelity multi-armed bandits (MF-MAB) enable individual arms to be evaluated using diverse feedback sources that vary in both cost and accuracy. Prior stochastic models typically assume fixed low-to-high fidelity discrepancies, whereas modern proxy sources, such as learning-based simulators and Large Language Models (LLMs), can be improved using additional calibration. We investigate adaptive MF-MAB with improving proxy sources, and focus on the canonical two-fidelity case in which the low-fidelity source becomes more informative with repeated use. To capture this dynamic, we introduce a selected-average mismatch bound that converts dynamic low-fidelity observations into improvement-aware confidence bounds for the high-fidelity target. We propose the Threshold-Based Adaptive Continuation Companion (TACC), an optimistic algorithm that uses a bounded continuation rule to decide when low-fidelity sampling remains cost-effective and when to escalate. We prove an instance-dependent regret bound showing that, for detected intermediate arms, adaptive continuation replaces logarithmic high-fidelity confirmation with bounded low-fidelity continuation. Experiments on synthetic bandits and an LLM-as-a-judge policy-evaluation task examine when continuation improves cost-weighted regret.

2605.08557 2026-05-12 cs.CV cs.AI cs.LG

MC-RFM: Geometry-Aware Few-Shot Adaptation via Mixed-Curvature Riemannian Flow Matching

Salim Khazem, Ibrahim Mohamed Serouis, Zakaria Ezzahed

AI总结 该研究提出了一种名为MC-RFM的混合曲率黎曼流匹配框架,用于冻结视觉主干模型的少样本适配。该方法通过将特征映射到由双曲空间和欧几里得空间组成的乘积流形,显式建模任务诱导的特征位移几何结构,从而实现更有效的适配。实验表明,MC-RFM在多个视觉识别基准和多种主干模型上均取得优异性能,尤其在Transformer和细粒度数据集上表现突出,验证了其对任务几何结构建模的有效性。

Comments Submitted to NeurIPS (Under Review)

详情
英文摘要

Parameter-efficient adaptation of pretrained vision models is commonly performed through linear probes, prompts, low-rank updates, or lightweight residual modules. While effective, these methods usually treat adaptation as a discrete Euclidean perturbation of frozen representations, without explicitly modeling the geometry of the task-induced feature displacement. We propose \textsc{MC-RFM}, a mixed-curvature Riemannian flow-matching framework for few-shot adaptation of frozen visual backbones. The key idea is to represent adapted features on a product manifold combining a hyperbolic factor, which captures hierarchy-sensitive semantic structure, and a Euclidean factor, which preserves locally discriminative visual variation. Adaptation is formulated as a task-conditioned continuous transport from frozen features to support-set prototypes, trained with a flow-matching objective and coupled to a hybrid prototype-linear classifier. The method is lightweight, backbone-agnostic, and operates entirely on cached frozen features. Across seven visual recognition benchmarks, five frozen backbones, and 1/4/16-shot regimes, \textsc{MC-RFM} is the best-performing method in a majority of evaluated settings, with the strongest gains on Transformer backbones and fine-grained datasets. Ablations show that the mixed-curvature head, task conditioning, adaptive branch gating, prototype shrinkage, and discriminative supervision each contribute to performance. These results suggest that few-shot adaptation benefits not only from deciding which parameters to update, but also from modeling how representations should move through a geometry matched to the structure of the downstream task.

2605.08556 2026-05-12 cs.LG

Can Revealed Preferences Clarify LLM Alignment and Steering?

Khurram Yamin, Jingjing Tang, Eric Horvitz, Bryan Wilder

AI总结 该研究探讨了大语言模型(LLM)在不确定性决策中的对齐问题,提出了一种基于揭示偏好的方法来估计模型所优化的隐含偏好。通过分析模型在决策任务中的选择及其概率分布,研究构建了离散选择模型以还原其决策背后的成本函数,并评估模型是否具备一致的目标导向行为、能否准确表达其目标以及是否能通过提示进行有效引导。实验结果表明,尽管许多模型在内部一致性方面表现不俗,但在忠实反映或响应用户指定偏好方面仍存在明显不足。

详情
英文摘要

LLMs are increasingly used to make or support high-stakes decisions under uncertainty, where alignment depends not only on factual accuracy but on how models weigh tradeoffs between different outcomes. We present an empirical pipeline for estimating the implied preferences that an LLM's observed choices optimize: we elicit the model's probability distribution over unknowns along with the choice it would make for the decision task and then fit a discrete choice model to recover the cost function that best rationalizes the model's decisions. We show how this revealed-preference description allows rigorous evaluation of whether models behave in a consistently goal-directed way, whether they can verbalize a description of their objectives which matches their revealed decision policy, and whether prompting can reliably steer those policies to implement a user-specified cost function. We apply this evaluation across four medical diagnosis domains and multiple frontier and open-source models. We find that while many models have a nontrivial degree of internal coherence, they also have significant weaknesses in faithfully reporting or adopting preferences in response to user direction.

2605.08554 2026-05-12 cs.SD

Online Segmented Beamforming via Dynamic Programming

Manan Mittal, Ryan M. Corey, Diego Cuji, John R. Buck, Andrew C. Singer

AI总结 在动态声学环境中,由于干扰源和声源随时间变化,传统波束成形方法难以准确识别静止区域。本文提出了一种基于动态规划的在线分段波束成形算法,通过数据驱动的时间分段方法,动态调整协方差矩阵估计窗口,以适应局部平稳性,并在环境突变时实时重置协方差估计,从而有效跟踪新出现的干扰源。实验表明,该方法在复杂混响环境中优于固定窗口的自适应方法。

Comments 4 pages, 2 figures

详情
英文摘要

In dynamic acoustic environments characterized by time-varying interferers and moving sources, effective beamforming requires accurately identifying stationary regions over time. Traditional Capon beamformers rely on the instantaneous ensemble covariance matrix, which is inaccessible in practice. Practical implementations overcome this by estimating the sample covariance matrix (SCM) through averaging over a block of temporal samples. However, in non-stationary settings, a naive batch approach fails. Moving interferers smear the SCM, causing the beamformer to place nulls in outdated locations while failing to track newly active interferers, thereby degrading its nulling capabilities. To address this fundamental limitation, an Online Segmented Beamformer is proposed. This algorithm incorporates data-driven temporal segmentation to causally minimize output power while dynamically adapting the SCM estimation windows to local stationarity. By framing the problem through the lens of dynamic programming, the proposed method tracks abrupt environmental changes and resets covariance estimates in real-time. We validate the performance of this framework in a complex, reverberant simulated acoustic environment and in highly reverberant real world experiments, demonstrating its superiority over fixed-window adaptive methods.

2605.08549 2026-05-12 cs.AI

Evaluating Developmental Cognition Capabilities of LLMs

Xiao Xiao, Hayoun Noh, Mar Gonzalez-Franco

AI总结 本文研究了大语言模型在发展认知层面的能力,引入了“发展性句子完成测试”(DSCT)作为评估工具,用于捕捉模型对发展认知阶段的响应特征。研究发现,前沿模型在模拟人物设定下能较准确地恢复预设的发展阶段标签,但在处理真实人类回答时,模型与人类的共识有限,且模型生成的回答在不同规模下表现出稳定的发展阶段差异。研究指出,发展认知信号在合成响应中更清晰,而构建阶段感知的对话系统核心挑战在于从生成文本中获取有效的发展信号。

Comments 9 pages, 3 figures, (10 pages appendix)

详情
英文摘要

Conversational AI is increasingly personalized around users' preferences, histories, goals, and knowledge, but much less around how users interpret and take up model outputs to construct and understand their reality. We draw on Robert Kegan's constructive-developmental theory as a complementary lens on this dimension. Existing methods for assessing developmental stage in the Keganian tradition rely either on expert interviews that do not scale or on sentence-completion instruments that are proprietary, lengthy, or invasive. To make this perspective tractable for LLM evaluation, we introduce the Developmental Sentence Completion Test (DSCT), a 20-item instrument designed to elicit developmental signal in self-administered text. Throughout, we treat the resulting labels as characterizations of stage-like structure in elicited responses, not as validated person-level developmental stage. We then ask how much of that signal can be recovered by LLMs across three elicited response regimes: simulated personas, real human respondents, and default model-generated answers. On simulated personas, top frontier models recover simulator-intended labels with high accuracy. On real human DSCT responses, human-LLM agreement is fair, with much stronger within-neighborhood than exact agreement. Finally, when LLMs answer DSCT prompts without persona-conditioning, their responses exhibit stable stage-like differences across model families, with larger and newer models tending to generate higher-rated text. These results suggest that stage-conditioned signal is cleaner in synthetic responses than in human-written DSCT text, and that the core constraint for stage-aware conversational AI is not classifier accuracy alone, but the availability of developmental signal from elicited text.

2605.08545 2026-05-12 cs.AI

Log analysis is necessary for credible evaluation of AI agents

Peter Kirgis, Sayash Kapoor, Stephan Rabanser, Nitya Nadgir, Cozmin Ududec, Magda Dubois, JJ Allaire, Conrad Stosz, Marius Hobbhahn, Jacob Steinhardt, Arvind Narayanan

AI总结 本文指出,当前智能体基准测试通常仅报告最终结果(通过或失败),这会威胁评估的可信度。为此,作者提出通过日志分析系统追踪和分析智能体的输入、执行和输出过程,以识别评估中的潜在问题,如能力误判、现实效用预测偏差及隐藏的危险行为。文章构建了日志分析的威胁分类体系,提出了指导原则,并通过实例展示了其有效性,为提升智能体评估的可信度提供了实用建议。

详情
英文摘要

Agent benchmarks typically report only final outcomes: pass or fail. This threatens evaluation credibility in three ways. First, scores may be inflated or deflated by shortcuts and benchmark artifacts, misrepresenting capability. Second, benchmark performance may fail to predict real-world utility due to scaffold limitations and recurring failure modes. Finally, capability scores may conceal dangerous or catastrophic actions taken by the agent. We argue that log analysis -- the systematic tracking and analysis of the inputs, execution, and outputs of an AI agent -- is necessary to overcome these validity threats and promote credible agent evaluation. In this paper, we (1) present a taxonomy of threats to credible evaluation documented through log analysis, and (2) develop a set of guiding principles for log analysis. We illustrate these principles on tau-Bench Airline, revealing that pass^5 performance was under-elicited by nearly 50% and surfacing deployment failure modes invisible to outcome metrics. We conclude with pragmatic recommendations to increase uptake of log analysis, directed at diverse stakeholders including benchmark creators, model developers, independent evaluators, and deployers.

2605.08539 2026-05-12 cs.LG cs.AI

Continuity Laws for Sequential Models

Annan Yu, Dongwei Lyu, N. Benjamin Erichson

AI总结 本文研究了序列模型中一个被忽视的归纳偏置——时间连续性。作者提出了一种形式化方法,通过时间离散化的收敛性来衡量模型的连续性,并发现S4模型表现出稳定的连续行为,而S6(Mamba的核心)则对输入幅度和选择性动态更敏感。研究还引入了衡量任务连续性的指标,发现任务连续性、模型连续性与模型性能之间存在显著的实证关联,表明时间连续性不仅是归纳偏置,也具有实际应用价值,如提升时间子采样的效率与性能。

详情
英文摘要

Inductive biases influence the behavior and performance of sequential models. In this work, we study an underexplored inductive bias in sequential modeling: continuity in time. We ask a simple question: do models motivated by continuous-time formulations, such as state-space models, actually behave continuously in time, and does this translate into better performance on tasks with continuous temporal structure? To answer this, we formalize model continuity as convergence under temporal refinement, where a model is continuous if its predictions approach an underlying continuous trajectory as the temporal discretization is refined. We show that S4 exhibits stable continuous behavior, whereas S6 (the core of Mamba) can be more sensitive to input amplitude and selective dynamics, despite being derived from a continuous dynamical system. To study whether this distinction matters for learning, we also need a corresponding notion of task continuity. We therefore introduce a metric to quantify the continuity of datasets directly from their temporal structure. Across benchmarks, we find a clear empirical alignment between task continuity, model continuity, and model performance. Beyond an inductive bias, continuity also has practical consequences: we show that it enables a simple temporal subsampling strategy that improves both efficiency and performance.

2605.08538 2026-05-12 cs.AI cs.CL cs.IR cs.LG

Human-Inspired Memory Architecture for LLM Agents

Doga Kerestecioglu, Alexei Robsky, Clemens Vasters, Anshul Sharma, Yitzhak Kesselman

AI总结 当前大型语言模型代理在长期交互中缺乏有效的持久记忆管理机制。本文提出一种受人类认知启发的记忆架构,包含六种认知机制,用于解决原始记忆积累中的各种失效模式,并引入一种无需基准数据的合成校准方法,提升了系统的泛化能力。实验表明,该架构在代码问题跟踪和长对话两个基准上均显著提升了记忆精度与存储效率。

Comments 10 pages, 4 tables. Preprint; comments welcome

详情
英文摘要

Current LLM agents lack principled mechanisms for managing persistent memory across long interaction horizons. We present a biologically-grounded memory architecture comprising six cognitive mechanisms: (1) sleep-phase consolidation, (2) interference-based forgetting, (3) engram maturation, (4) reconsolidation upon retrieval, (5) entity knowledge graphs, and (6) hybrid multi-cue retrieval. Each mechanism addresses a specific failure mode of naive memory accumulation. We introduce a synthetic calibration methodology that derives all pipeline thresholds without benchmark data exposure, eliminating a common source of evaluation leakage. We evaluate on two benchmarks. First, a VSCode issue-tracking dataset (13K issues, 120K events) where deduplication-based consolidation achieves 97.2% retention precision with 58% store reduction (+21.8 pp over baseline). Second, the LongMemEval personal-chat benchmark where we conduct the first streaming M-tier evaluation (475 sessions, ~540K unique turns). At a 200K-token context budget, our pipeline matches raw retrieval accuracy (70.1% vs. 71.2%, overlapping 95% CI) while exposing a tunable accuracy/store-size operating curve. At S-tier scale (50 sessions), dedup-based consolidation yields a +13.3 pp improvement in preference recall.

2605.08533 2026-05-12 cs.AI

Human-LLM Dialogue Improves Diagnostic Accuracy in Emergency Care

Burcu Sayin, Ngoc Vo Hong, Ipek Baris Schlicht, Jacopo Staiano, Pasquale Minervini, Sara Allievi, Nicola Susca, Nicola Osti, Alberto Maino, Vito Racanelli, Andrea Passerini

AI总结 该研究探讨了人类医生与大型语言模型(LLM)对话在急诊诊断中的应用效果。研究中,医生在仅看到患者主诉的情况下,可以逐步向配备完整病历记录的LLM提问,以辅助诊断。实验结果显示,使用LLM辅助后,住院医师在困难病例中的诊断准确性显著提高,且不同经验水平的医生在交互中表现出不同的提问策略,整体诊断一致性也有所增强。这一研究表明,交互式LLM支持能够有效提升急诊诊断的准确性与合理性。

Comments Paper under review

详情
英文摘要

Clinical decision-making in emergency medicine demands rapid, accurate diagnoses under uncertainty. Despite benchmark progress, evidence for LLMs as interactive aids in live physician workflows remains sparse. MedSyn lets physicians iteratively query an LLM provided with the full clinical record while initially viewing only the chief complaint. Seven physicians (three seniors, four residents) completed baseline and AI-assisted sessions across 52 MIMIC-IV cases stratified by difficulty. Blinded evaluation showed residents' Hard-case correctness rose from 0.589 to 0.734; difficulty-standardised completely-correct rates confirmed a medium effect (Δ = 0.092; p = 0.071; d = 0.47). Automated metrics corroborated these gains: standardised any-match accuracy improved by 0.156 (p < 0.0001), and residents showed the largest F1 gain (Δ = 0.138; p < 0.0001). Dialogue analysis revealed expertise-dependent strategies (seniors asked targeted, hypothesis-driven questions; residents relied on broader queries) and cross-expertise concordance increased (Δ = 0.145; p < 0.0001). Interactive LLM support meaningfully enhances diagnostic reasoning.

2605.08530 2026-05-12 cs.CV

A Two-Stage Motion-Aware Framework for mmWave-based Human Mesh Recovery

Hoang Hai Pham, Shuntian Zheng, Jiaqi Li, Yu Guan

AI总结 该论文提出了一种两阶段的运动感知框架,用于基于毫米波雷达的人体网格重建。针对毫米波雷达信号杂乱和测量不完整的问题,该方法首先通过粗到细的定位与体素分割提取人体反射信号,生成带有置信度权重的雷达体积;随后利用双分支网络联合建模单帧几何与帧间动态信息,实现更准确的人体网格重建。实验表明,该方法在保持计算效率的同时优于现有方法。

详情
英文摘要

Millimeter-wave (mmWave) radar has emerged as a promising sensing modality for human perception due to its robustness under challenging environmental conditions and strong privacy-preserving properties. However, recovering accurate 3D human body meshes from radar observations remains difficult due to severe signal clutter and the inherently partial nature of radar measurements. Previous works typically adopt end-to-end frameworks that directly regress human body parameters from raw radar data, without decoupling signal interpretation from geometric reasoning or exploiting temporal motion cues, limiting learning performance. To address this, we propose a two-stage framework for radar-based human body reconstruction. First, we introduce a human reflection extraction module that performs coarse-to-fine localization and voxel-wise segmentation to produce a confidence-weighted radar volume encoding voxel-level human likelihood. Second, we design a motion-aware mesh recovery network that reconstructs the human body by jointly modeling per-frame geometry and inter-frame dynamics using a dual-branch architecture. Extensive experiments demonstrate that the proposed method outperforms existing approaches while maintaining computational efficiency.

2605.08529 2026-05-12 cs.LG

The Propagation Field: A Geometric Substrate Theory of Deep Learning

Xingrui Gu

AI总结 本文提出了一种新的深度学习理论框架——传播场,从几何角度重新理解神经网络的内部动态。研究通过分析网络中隐藏状态轨迹和局部雅可比算子的几何结构,揭示了端到端损失仅约束传播场的边界行为,而内部几何结构则未被确定。实验表明,基于传播场的可观察指标能够提升模型在未知路径上的泛化能力、鲁棒性和校准性能,并在持续学习任务中表现出优于传统方法的性能。

Comments Technical notes on exploring the nature of deep learning propagation, Under review by the ICML 4th Workshop on High-dimensional Learning Dynamics (HiLD) 2026

详情
Journal ref
ICML 2026
英文摘要

Modern deep learning treats neural networks primarily as endpoint functions from inputs to outputs. Inspired by the shift from force to geometry in physics, we ask whether a network should instead be understood through the geometry of its internal propagation. We define a neural propagation field as the collection of hidden-state trajectories and local Jacobian operators across depth. Endpoint losses constrain only the boundary behavior of this field, leaving its interior geometry underdetermined. We show that endpoint-equivalent models can differ by orders of magnitude in trajectory and Jacobian structure, and introduce observable field metrics such as path sensitivity, solver consistency, and trajectory/Jacobian retention. In controlled teacher-flow and PDE systems, endpoint fitting fails to recover the underlying propagation law. In real multi-path tasks, field-aware objectives improve unseen-path generalization, OOD robustness, and calibration when aligned with the observation structure, but can collapse when over-constrained. In continual learning, field-preservation regularization complements replay and distillation: on Split CIFAR-100, DER++ with field preservation improves average accuracy, backward transfer, and field-retention metrics. These results identify propagation-field quality as a measurable and trainable property of neural networks beyond endpoint performance.

2605.08526 2026-05-12 cs.LG

Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck

Zihan Huang, Junda Wu, Tong Yu, Qianqi Yan, Rohan Surana, Uttaran Bhattacharya, Lina Yao, Xin Eric Wang, Julian McAuley

AI总结 本文提出了一种名为Skill-CMIB的方法,用于构建多模态智能体技能,以提升动作执行的一致性。该方法通过条件多模态信息瓶颈机制,将视觉和语言信息中的任务相关不变性进行有效提取和压缩,同时分离可解释的文本技能与残余的感知信息,从而减少跨模态冗余并提高执行稳定性。该方法在保持技能可复用性的同时,避免了多样本推理带来的额外开销,为多模态智能体的可靠执行提供了新的解决方案。

详情
英文摘要

While LLM-based agents excel at planning and executing long action sequences, their execution often remains inconsistent across trials, limiting reliability. Consolidating agent consistency requires distilling trial-error trajectories into reusable skills that preserve task-relevant invariants while discarding trajectory-specific noise. However, in multimodal settings, the key challenge is not only that useful invariants are distributed across vision and language information, but that different modalities support different kinds of reusable skill content: while some skills are verbalizable and interpretable, others reside in perceptual evidence beyond text. Text-only skills may lose perceptual cues, whereas storing text and perception naively introduces redundancy and noise. Existing inference-time methods, such as self-consistency, improve reliability through costly multi-sample decoding, while internalization strategies lack a way to separate verbalizable skill content from residual perceptual information. To address this, we introduce Conditional Multimodal Information Bottleneck (CMIB), a method for multimodal skill construction. CMIB begins with a joint bottleneck over multimodal skills and derives an exact sequential decomposition: (1) a text-stage bottleneck distilling interpretable skill cards, and (2) a conditional multimodal bottleneck compressing only residual information in perception that remains predictive beyond text. Unlike naive two-stream formulations, CMIB explicitly conditions the multimodal latent on the text skill, thus structurally reducing cross-modal redundancy and enabling independent control over textual and perceptual compression. We instantiate CMIB with a variational objective that makes its conditional decomposition tractable to optimize, yielding reusable multimodal skills that improve execution stability without incurring multi-sample inference overhead.

2605.08525 2026-05-12 cs.RO cs.SY eess.SY

Model-Reference Adaptive Flight Control of the 95-mg Bee++

Francisco M. F. R. Gonçalves, Conor K. Trygstad, Néstor O. Pérez-Arancibia

AI总结 本文提出了一种基于模型参考自适应控制(MRAC)的飞行控制架构,用于实现95毫克微型扑翼飞行器Bee++的高精度位置跟踪。该方法通过实时飞行实验验证了其适用性、功能性和优异的控制性能,为微型无人机的高精度控制提供了有效解决方案。

Comments Extended abstract to appear in the proceedings of the LSU Symposium on Control, Learning, and Intelligent Systems

详情
英文摘要

We introduce a model-reference adaptive control (MRAC) architecture for high-performance positional tracking of the Bee++, a 95-mg insect-scale flapping-wing aerial vehicle. The suitability, functionality, and high performance of the proposed approach are demonstrated using data from real-time flight experiments.

2605.08521 2026-05-12 cs.CV cs.LG

Geometric Flood Depth Estimation: Fusing Transformer-Based Segmentation with Digital Elevation Models

Nhut Le, Ehsan Karimi, Maryam Rahnemoonfar

AI总结 本文提出了一种基于几何分析的洪水深度估计方法,通过融合基于Transformer的分割模型与数字高程模型(DEM),从单目航拍图像中估算洪水深度。该方法利用Mask2Former生成精确的洪水掩膜,并结合DEM确定水陆边界、计算全局水面高程,进而得到每个像素的深度信息。研究展示了如何通过高性能分割模型从二维图像中高效提取三维体积数据,无需依赖耗时的水动力模拟。

Comments Accepted by the 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)

详情
英文摘要

Post-disaster situational awareness relies heavily on understanding both the extent and the volume of floodwaters. While 2D semantic segmentation provides accurate flood masking, it lacks the vertical dimension required to assess navigability and structural risk. This paper presents a geometric "Water Surface Elevation" approach for estimating flood depth from monocular aerial imagery. Our pipeline utilizes Mask2Former, a state-of-the-art transformer-based segmentation model, to generate precise 2D flood masks. These masks are fused with Digital Elevation Models (DEMs) to identify the water-land boundary, calculate a global water surface elevation ($Z_{water}$), and compute per-pixel depth based on the principle of local hydrostatic equilibrium. We evaluate this workflow using the FloodNet and CRASAR-U-DROIDS datasets, demonstrating how high-performance segmentation can be leveraged to extract 3D volumetric data from 2D imagery without the latency of hydrodynamic simulations.

2605.08520 2026-05-12 cs.LG cs.DC

FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration

Zhengding Hu, Mingge Lu, Zhen Wang, Jixuan Ruan, Chang Chen, Zaifeng Pan, Yue Guan, Ruiyi Wang, Zhongkai Yu, Chao Zhang, Yufei Ding

AI总结 本文提出了一种名为 FlashEvolve 的高效框架,旨在加速基于大语言模型(LLM)的智能体自我进化过程。该方法通过引入异步工作流和队列机制,替代传统的同步阶段执行方式,从而实现不同阶段和步骤的重叠执行,提升整体效率。为应对异步带来的数据陈旧问题,FlashEvolve 通过追踪非参数工件的版本并采用相应的更新、丢弃或修复策略,有效提升了系统的稳定性和进化质量。实验表明,该方法在多个基准任务上显著提升了提案吞吐量。

详情
英文摘要

LLM-based evolution has emerged as a promising way to improve agents by refining non-parametric artifacts, but its wall-clock cost remains a major bottleneck. We identify that this cost comes from synchronized stage execution and imbalance inside each LLM-heavy stage. We present FlashEvolve, an efficient framework that replaces synchronized execution with asynchronous workers and queues, allowing different stages and steps to overlap. To handle data staleness introduced by asynchrony, FlashEvolve tracks artifact versions and applies different policies to update, discard, or patch stale artifacts. Unlike weight-space staleness in asynchronous RL, language-space staleness is inspectable and repairable: a stale artifact is not just delayed work, but readable evidence that the LLM can reflect on, revise, and turn into useful evolution signal. FlashEvolve further improves throughput and token efficiency with speculative stage completion and adaptive workflow control. On GEPA workloads, FlashEvolve improves proposal throughput by $3.5\times$ on local vLLM and $4.9\times$ on API serving over synchronous GEPA. The same design also applies to ACE and Meta-Harness.

2605.08519 2026-05-12 cs.LG

SeBA: Semi-supervised few-shot learning via Separated-at-Birth Alignment for tabular data

Kacper Jurek, Wojciech Batko, Marek Śmieja, Marcin Przewięźlikowski

AI总结 本文研究了在标签数据稀缺、但存在大量未标签样本的情况下,如何对表格数据进行有效的半监督少样本学习。针对现有方法依赖于视觉或语言领域的自监督学习框架、难以适用于表格数据的问题,提出了一种名为SeBA的新方法,通过分离视角对齐的方式,无需数据增强即可实现特征与标签关系的优化。实验表明,SeBA在多个基准数据集上取得了当前最优的性能,为表格数据的半监督少样本学习提供了新的研究方向。

详情
英文摘要

Learning from scarce labeled data with a larger pool of unlabeled samples, known as semi-supervised few-shot learning (SS-FSL), remains critical for applications involving tabular data in domains like medicine, finance, and science. The existing SS-FSL methods often rely on self-supervised learning (SSL) frameworks developed for vision or language, which assume the availability of a natural form of data augmentations. For tabular data, defining meaningful augmentations is non-trivial and can easily distort semantics, limiting the effectiveness of conventional SSL. In this work, we rethink SSL for tabular data and propose Separated-at-Birth Alignment (SeBA), a joint-embedding framework for SS-FSL that eliminates the dependence on augmentations. Our core idea is to separate the data into two independent, but complementary views and align the representations of one view to mirror the nearest-neighbor correspondence of the data in the second view. Our experimental evaluation supported by a theoretical analysis justifies that SeBA generates an output space, which improves the feature-label relationship. An experimental study conducted in various benchmark datasets demonstrates that SeBA achieves the state-of-the-art performance in the majority of cases, opening a new avenue for SS-FSL paradigm in the domain of tabular data.

2605.08518 2026-05-12 cs.AI

Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge

Dhaval Patel, Chathurangi Shyalika, Suryanarayana Reddy Yarrabothula, Ling Yue, Shuxin Lin, Nianjun Zhou, James Rayfield

AI总结 本文回顾了CODS 2025 AssetOpsBench挑战赛,分析了竞赛中的排名机制、隐藏评估对结果的影响以及设计模式的奖励情况。研究发现,公开排行榜的性能趋于饱和,隐藏评估与公开评估在执行任务上呈现负相关,且成功的执行方法主要依赖于安全机制而非新颖的智能体架构。这些结果揭示了竞赛评价体系的特性,并为未来竞赛设计提供了改进方向。

Comments 43 pages, 32 Figures

详情
英文摘要

Competition retrospectives are useful when they explain what a leaderboard measured, how hidden evaluation changed conclusions, and which design patterns were rewarded. We revisit the CODS 2025 \assetopslive{} challenge, a privacy-aware Codabench competition on industrial multi-agent orchestration built on \assetops{}. We combine final rank sheets, a 300-submission server log, 149-team registrations, best-submission exports, the organizer winners report, the companion \assetopslive{} system paper, and verified planning-track source trees. Five results stand out. First, the public planning leaderboard saturates at 72.73\%, and richer prompts do not improve that peak. Second, hidden evaluation changes the story: public and private scores correlate moderately in planning ($r{=}0.69$) but negatively in execution ($r{=}{-}0.13$), with several 45.45\% public execution systems reaching 63.64\% on the hidden set. Third, the \tmatch{} term is numerically almost inert in the official composite -- combined on a 0--1 scale with 0--100 percentage scores, it contributes at most 0.05 points per track, and rescaling would swap the top two teams. Fourth, the competition is operationally account-based but substantively team-based: 149 registered teams reduce to 24 with non-zero public scores and 11 fully ranked, while 52.3\% of deduplicated registrations list multiple usernames. Fifth, successful execution methods mostly improve guardrails -- response selection, contamination cleanup, fallback, and context control -- rather than novel agent architectures. These findings identify which behaviors the evaluation rewarded, and motivate scale-aware composites, skill-level diagnostics, and versioned artifact release.

2605.08517 2026-05-12 cs.LG cs.CV physics.med-ph

A Deep Risk Estimator for Known Operator Learning

Andreas Maier, Md Hasan, Paulina Conrad, Paula Andrea Perez-Toro

AI总结 本文提出了一种用于估计深度网络中包含已知算子和学习算子混合结构的统计风险的方法。该方法基于已有的已知算子学习最大训练误差界,将网络的期望误差与训练样本数量联系起来,并将总风险分解为各学习层的贡献之和,其中已知算子不增加风险,而每个学习层则引入逼近项和估计项。研究还展示了当用已知算子替代学习层时,风险上界会减小,并通过CT重建等应用验证了该估计器的有效性。

Comments In Review

详情
英文摘要

We describe an approach for estimating the statistical risk of deep networks that contain a mix of learned and known operators. Building on the maximal training error bounds previously established for known operator learning, we derive a deep risk estimator that connects the expected error of a layered network to the size of the training sample. The estimator decomposes the total risk into a sum over learned layers; every known operator contributes zero to this sum, while every learned layer adds an approximation term inspired by Barron's classic work and an estimation term that decreases with the number of training samples. We are able to show that the bound shrinks whenever a learned layer is replaced by a known operator and that the corresponding sample requirement scales with the number of trainable parameters of the layer that is replaced. As an application, we use computed tomography as an example and compare an operator-aware filtered backprojection network with a fully connected substitute that collapses the entire reconstruction pipeline into a single learned dense matrix. The predicted parameter ratio coincides with the structural sparsity that the analytic decomposition into a circulant filter and a sparse backprojection exposes. We confirm the predicted scaling on CPU at small image scale and on GPU at medium image scale, all on the same scaling law. Beyond CT reconstruction, the estimator applies to physics-informed neural networks that hardcode a known physical operation in its architecture, and we expect the result to be of interest for a broad community working on operator-aware deep learning. Calibrating the per-layer constants on each sweep yields a bound that tracks the empirical test MSE within a factor of two at every training-set size, so the estimator can be inverted to predict how many training samples are required to reach a target error.

2605.08516 2026-05-12 cs.AI

OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control

Darryl Jacob, Xinyu Liu, Muchao Ye, Xiaoyong Yuan, Pan He

AI总结 本文提出了一种名为OracleTSC的交通信号控制方法,旨在提升基于大语言模型(LLM)的强化学习在交通信号控制中的稳定性和可解释性。该方法通过引入奖励门槛机制和不确定性正则化,有效过滤弱学习信号并鼓励决策一致性,从而提高模型训练的稳定性。实验表明,OracleTSC在LibSignal基准测试中显著提升了交通效率,同时保持了自然语言解释的可解释性,并在不同路口间表现出良好的泛化能力。

Comments Published in Transactions on Machine Learning Research

详情
英文摘要

Transparent decision-making is essential for traffic signal control (TSC) systems to earn public trust. However, traditional reinforcement learning-based TSC methods function as black boxes with limited interpretability. Although large language models (LLMs) can provide natural language reasoning, reinforcement finetuning for TSC remains unstable because feedback is sparse and delayed, while most actions produce only marginal changes in congestion metrics. We introduce OracleTSC, which stabilizes LLM-based TSC through two mechanisms: (1) a reward hurdle mechanism that filters weak learning signals by subtracting a calibrated threshold from environmental rewards, and (2) uncertainty regularization that maximizes the probability of the selected response to encourage consistent decisions across sampled outputs. Experiments on the LibSignal benchmark show that OracleTSC enables a compact LLaMA3-8B model to substantially improve traffic efficiency, achieving a 75% reduction in travel time and a 67% decrease in queue length compared with the pretrained baseline while preserving interpretability through natural language explanations. OracleTSC also demonstrates strong cross-intersection generalization: a policy trained on one intersection transfers to a structurally different intersection with 17% lower travel time and 39% lower queue length without additional finetuning. These results suggest that uncertainty-aware reward shaping can improve the stability and effectiveness of reinforcement fine-tuning for TSC.

2605.08515 2026-05-12 cs.LG cs.RO

Quantile-Coupled Flow Matching for Distributional Reinforcement Learning

Michael Groom, Victor-Alexandru Darvariu, Lars Kunze, James Wilson, Nick Hawes

AI总结 不同于标准的期望回报强化学习,分布强化学习(DRL)建模完整的回报分布,更适合处理不确定性感知和风险敏感的决策问题。本文提出FlowIQN,一种基于条件流匹配(CFM)的批评者模型,通过在每个小批量内对源样本和贝尔曼目标样本进行排序,近似单调最优运输耦合,从而实现与Wasserstein距离对齐的流匹配损失。该方法首次在CFM批评者中提供了显式的Wasserstein对齐投影保证,并在多个离线强化学习基准测试中表现出优越的分布准确性和性能。

详情
英文摘要

Unlike standard expected-return Reinforcement Learning (RL), Distributional RL (DRL) models the full return distribution, making it better-suited for uncertainty-aware and risk-sensitive decision-making. Conditional Flow Matching (CFM) critics have recently attracted attention for modelling continuous, multi-modal return distributions. Despite this interest, there remains a substantial metric mismatch: DRL theory relies on the distributional Bellman operator being contractive in the $p$-Wasserstein distance, yet existing CFM critics are trained with arbitrary source-target couplings, so their flow-matching losses are not Wasserstein-aligned surrogates for matching Bellman target return distributions. In this work, we address this mismatch by proposing FlowIQN, a CFM critic that sorts source and Bellman target samples within each mini-batch to approximate the monotone optimal transport coupling, replacing arbitrary pairings with quantile-aligned flow paths. We prove that the loss of our quantile-coupled CFM critic yields a Wasserstein-aligned approximate projection compatible with the foundations of DRL. To our knowledge, FlowIQN is the first flow-matching distributional critic with an explicit Wasserstein-aligned projection guarantee. We further extend FlowIQN with shortcut models for efficient inference. Empirical results show that FlowIQN improves Wasserstein return-distribution accuracy over other CFM critics. It also yields competitive performance on offline RL benchmarks across multiple policy extraction methods, providing a theoretically grounded CFM critic that is readily compatible with DRL pipelines. Code: https://github.com/ori-goals/flowIQN.

2605.08513 2026-05-12 cs.CL cs.AI cs.LG

A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models

Hamid Kazemi, Atoosa Chegini, Maria Safi

AI总结 该研究探讨了大语言模型中安全对齐机制的脆弱性,发现安全对齐依赖于两类机制不同的神经元:拒绝神经元负责控制有害知识的输出,概念神经元则编码有害内容本身。通过单独操控这两类系统中的单个神经元,研究展示了绕过安全对齐的两种方式——对明确有害请求的抑制绕过和对无害提示的有害内容放大,并在多个不同规模的模型中验证了这一现象。研究结果表明,安全对齐并非广泛分布于模型权重中,而是由个别神经元控制,这些神经元单独即可决定是否抑制有害输出。

详情
英文摘要

Safety alignment in language models operates through two mechanistically distinct systems: refusal neurons that gate whether harmful knowledge is expressed, and concept neurons that encode the harmful knowledge itself. By targeting a single neuron in each system, we demonstrate both directions of failure -- bypassing safety on explicit harmful requests via suppression, and inducing harmful content from innocent prompts via amplification -- across seven models spanning two families and 1.7B to 70B parameters, without any training or prompt engineering. Our findings suggest that safety alignment is not robustly distributed across model weights but is mediated by individual neurons that are each causally sufficient to gate refusal behavior -- suppressing any one of the identified refusal neurons bypasses safety alignment across diverse harmful requests.

2605.08511 2026-05-12 cs.RO

Trajectory-Consistent Flow Matching for Robust Visuomotor Policy Learning

Riad Ahmed, Sujosh Nag, Moniruzzaman Akash, Mostafa Hussein, Momotaz Begum

AI总结 该研究针对流匹配策略在机器人操作中因训练与推理阶段目标不一致导致的轨迹误差问题,提出了一种轨迹一致的流匹配方法。通过引入时间一致性监督、轨迹段位移约束、速度场平滑正则化以及高阶Runge-Kutta推理等四方面互补的改进措施,有效提升了策略的鲁棒性和长期任务执行能力。实验表明,该方法在多个真实机器人任务中显著优于现有方法,尤其在长时域多阶段任务中表现出色。

详情
英文摘要

Flow matching policies learn continuous velocity fields that transport noise to actions, enabling fast deterministic inference for robot manipulation. However, standard training optimizes a pointwise velocity objective while inference requires numerical integration of that field -- a mismatch that causes compounding trajectory errors. We propose four complementary remedies: (1) auxiliary rectified flow velocity regression that provides uniform temporal supervision across the full time interval; (2) multi-step trajectory consistency training that supervises the integrated displacement of the velocity field over trajectory segments, directly closing the train-inference gap; (3) velocity field regularization that enforces temporal smoothness, preventing oscillations that destabilize integration; and (4) fourth-order Runge-Kutta (RK4) inference that reduces global discretization error by orders of magnitude over Euler methods. Critically, these components are not independently sufficient -- RK4 without a smooth velocity field fails, and smoothness without trajectory-level supervision still drifts, as our ablation study confirms. We further pair these with a dual-view 3D point cloud encoder using two independent PointNet encoders for complementary spatial perception. On four real-robot tasks across a Franka arm and a Boston Dynamics Spot, our method achieves 70% and 60% overall success on two long-horizon multi-phase tasks where both baselines score 0%, and reaches 100% on precision tool placement. Three MetaWorld simulation tasks confirm consistent improvements, validating that trajectory-level supervision is essential for reliable policy execution.

2605.08505 2026-05-12 cs.LG cs.AI math.PR math.ST stat.TH

Scaling Limits of Long-Context Transformers

Giuseppe Bruno, Shi Chen, Zhengjiang Lin, Yury Polyanskiy, Philippe Rigollet

AI总结 本文研究了固定查询和随机上下文下的长上下文Transformer的注意力机制,分析了逆温度参数 $β_n$ 对注意力行为的影响,揭示了选择性出现的临界尺度由距离分布的局部指数决定,而非全局特征。研究还刻画了不同 $β_n$ 区域下注意力权重和输出的极限分布,包括亚临界、临界和超临界情形,并指出在亚临界情况下,当值矩阵为单位矩阵时,注意力映射近似实现了反向热方程。

Comments 40 pages, 4 figures

详情
英文摘要

We study the long-context limit of softmax self-attention with a fixed query and a random context of $n$ i.i.d. keys on the sphere, viewing the inverse temperature $β_n$ as the scaling parameter that decides whether attention degenerates into uniform averaging or collapses onto the single closest key. We show that the critical scale at which selectivity emerges is determined by the local exponent of the distance-to-query distribution near zero rather than by global features of the context, and scales like $β_n^\ast \asymp n^{2/(d-1)}$ for uniform keys on $\mathbb{S}^{d-1}$. Furthermore, we characterize the limiting laws of the ordered attention weights and of the attention output across all regimes of $β_n$: a subcritical regime in which the output reduces to a local average around $q$ with explicit deterministic bias and Gaussian fluctuations; a critical regime in which a finite collection of nearest keys retains macroscopic mass without single-key collapse; and a supercritical regime in which all mass concentrates on the closest key. Of notable interest is the subcritical case with identity value matrix where the attention map approximately implements a backward heat equation.

2605.08503 2026-05-12 cs.CL cs.CY cs.HC

NARRA-Gym for Evaluating Interactive Narrative Agents

Yue Huang, Yuchen Ma, Jiayi Ye, Wenjie Wang, Zipeng Ling, Xingjian Hu, Yuexing Hao, Zichen Chen, Zhangchen Xu, Yunhong He, Zhengqing Yuan, Yujun Zhou, Kehan Guo, Chaoran Chen, Toby Jia-Jun Li, Stefan Feuerriegel, Xiangliang Zhang

AI总结 本文介绍了NARRA-Gym,一个用于评估交互式叙事代理的可执行环境,旨在测试大语言模型在多轮对话中生成连贯故事、管理长期状态、模拟角色、个性化表达及生成故事相关素材的能力。该环境通过稀疏的情感种子生成完整故事,并记录模型在故事构建、记忆更新、节奏控制等过程中的完整轨迹。实验表明,不同模型在故事流畅性、鲁棒性、用户体验等方面表现差异显著,突显了交互式叙事作为评估长期、用户自适应大模型行为的有效基准。

详情
英文摘要

Interactive narrative tasks require LLMs to sustain a coherent, evolving story while adapting to a user over multiple turns. However, suitable benchmarks for this setting are limited: existing evaluations often focus on static prompts, isolated story generations, or post-hoc ratings, and therefore miss whether models can jointly manage story generation, long-context state and pacing, character simulation, empathic personalization, and story-grounded artifacts. We introduce NARRA-Gym, an executable evaluation environment that turns a sparse emotional seed into a complete interactive story episode and logs the full model-in-the-loop trajectory, including story construction, memory updates, planning, pacing interventions, and optional artifact synthesis. We evaluate nine frontier LLMs using a controlled LLM-as-judge sweep over eight benchmark personas and a human evaluation in which participants rate customized model outputs. Our results show substantial variation across models, personas, and evaluation dimensions: models that produce fluent stories can still fail on robustness, user experience, or resistance-sensitive personalization. These findings suggest that interactive narrative offers a useful benchmark for evaluating long-horizon, user-adaptive LLM behavior beyond isolated story quality.

2605.08498 2026-05-12 cs.LG cs.AI cs.LO

MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs

Viresh Pati, Zhengyu Li, Piyush Jha, Rahul Garg, Yatharth Sejpal, Vijay Ganesh

AI总结 本文提出 MathConstraint,一个用于评估大语言模型组合推理能力的具有挑战性的自适应基准。该基准结合约束满足问题与求解器验证,设计了一种自适应生成器,能够持续生成随着模型推理能力提升仍保持难度的实例。实验表明,即使前沿模型借助工具环境,其在 MathConstraint 上的准确率仍显著下降,展示了该基准对模型进步的鲁棒性,并揭示了工具调用次数对性能的敏感影响。

详情
英文摘要

We introduce MathConstraint, a hard, adaptive benchmark for evaluating the combinatorial reasoning capabilities of LLMs. We combine constraint satisfaction problems with rigorous solver-based verification and design an adaptive generator to create instances that remain challenging as the LLMs improve in their reasoning capabilities. Unlike existing benchmarks that quickly saturate on fixed datasets or use LLM-as-a-judge for checking solutions,MathConstraint uses parameterized problem types that enable scalable generation of arbitrarily difficult and automatically verifiable instances. We release MathConstraint-Easy ($266$ instances), on which frontier models achieve between $72.6\%$ (gemini-3.1-flash-lite) and $87.6\%$ (gpt-5.5) accuracy, and MathConstraint ($329$ instances) on which the same models drop to between $18.5\%$ (claude-4.6-sonnet) and $66.9\%$ (gpt-5.5) accuracy, demonstrating the resilience of our benchmark generator against rapid progress in LLM reasoning capabilities. We evaluate 12 frontier and open-weight models with and without access to a sandboxed Python environment that includes generic SAT/SMT solvers. Tool access roughly doubles frontier accuracy on MathConstraint (mean $+28$pp; up to $+52$pp for claude-4.6-sonnet). Further, halving the tool-call budget from $8$ to $4$ rounds erases up to $37$ points -- a sensitivity that most single-budget benchmarks miss. We release the generator, dataset, and evaluation harness as a robust environment for studying combinatorial reasoning and tool-use behavior under adversarially-tunable difficulty.