arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2085
2605.07342 2026-05-11 cs.LG cs.AI cs.SE

Mage: Multi-Axis Evaluation of LLM-Generated Executable Game Scenes Beyond Compile-Pass Rate

Hugh Xuechen Liu, Kıvanç Tatar

AI总结 该研究提出了一种名为Mage的多维度评估框架,用于评估大语言模型生成的可执行游戏场景的质量,超越了传统的编译通过率指标。研究通过编译成功、运行时成功、结构保真度和机制遵循度四个维度,对多种大语言模型生成的Unity游戏场景进行了系统评估,揭示了编译通过率与功能正确性之间的负相关关系。实验表明,仅依赖编译通过率会误导对生成结果的判断,而多轴评估能更准确地反映模型在复杂领域中的表现。

Comments Main Content: 10 pages, 1 figure. In total 22 pages

详情
英文摘要

Compile-pass rate is the dominant evaluation signal for LLM code generation, yet for multi-component domain-specific artifacts it can be actively misleading. We demonstrate this on executable game scene synthesis with a four-axis evaluation protocol (named `Mage') -- compile success, runtime success, structural fidelity, and mechanism adherence -- applied to 858 generation attempts across four open-weight LLMs (7B--30B), 26~hand-crafted Unity goal pattern playable concepts, and two automatically extracted IR granularity levels. Direct NL-to-C\# generation achieves the highest runtime-pass rate (43\% mean) yet produces structurally vacuous scenes (mechanism $F_1 \approx 0.12$). Structural IR conditioning halves the runtime rate but recovers domain-faithful structure ($F_1$ up to 1.00). Within IR conditioning, behavior-only and full-scene granularity are statistically indistinguishable (McNemar $p = 1.0$), indicating input-level granularity saturation. These results show that compile rate is anti-correlated with functional correctness in this domain and that multi-axis evaluation is necessary to detect the divergence. We release the benchmark, replay logs, and per-record metrics for independent verification.

2605.07339 2026-05-11 cs.AI

Tools as Continuous Flow for Evolving Agentic Reasoning

Tairan Huang, Siyu Shang, Qiang Chen, Xiu Su, Yi Chen

AI总结 本文提出了一种名为FlowAgent的新方法,旨在解决大型语言模型在工具链使用中因分步式范式导致的误差累积和泛化能力不足的问题。该方法将工具链的执行过程重构为语义空间中的连续轨迹生成,通过条件流匹配生成潜在轨迹,从而提供全局规划视角,确保工具执行的连贯性和鲁棒性。理论分析证明了该方法在效用收敛和误差衰减方面的优越性,实验结果也表明其在长时序推理任务中具有更高的鲁棒性和适应性。

详情
英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities in orchestrating tools for reasoning tasks. However, existing methods rely on a step-wise paradigm that lacks a global perspective, which causes error accumulation over long horizons and restricts generalization to unseen tools. To overcome these limitations, we propose Tools as Continuous Flow for Evolving Agentic Reasoning (FlowAgent), which reconceptualizes tool chaining as continuous trajectory generation within a semantic space. To systematically evaluate this paradigm, we introduce the first plan-level closed-loop benchmark dedicated to plan-level agentic reasoning in dynamic real-world environments. Specifically, the proposed FlowAgent leverages conditional flow matching to generate continuous latent trajectories, providing a global planning perspective to ensure coherent and robust tool execution. Theoretically, we establish formal bounds on utility convergence and prove that our continuous formulation fundamentally guarantees robust generalization and error attenuation. Empirical evaluations show that FlowAgent achieves superior robustness and adaptability in long-horizon reasoning tasks.

2605.07338 2026-05-11 cs.CV

ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs

Ziheng Zhou, Yang Wang, Nan Wang, Chengliang Wu, Jun Yan

AI总结 随着全球贝类生物多样性的下降,建立有效的生态监测系统变得尤为重要。为应对现有数据集在真实水下环境中适应性不足的问题,研究团队构建了ShellfishNet,一个包含8,691张图像、涵盖32个分类的海洋软体动物图像基准数据集,用于支持视觉识别模型的评估与优化。该数据集通过实地拍摄和网络爬取构建,涵盖了复杂环境下的样本,并引入了图像退化测试以评估模型在浑浊水体和恶劣天气下的鲁棒性,为智能生态监测提供了可靠的数据基础和模型评估基准。

详情
英文摘要

The decline of global shellfish biodiversity poses a severe threat to coastal ecosystems. Although artificial intelligence (AI) technologies show potential for automated ecological monitoring, existing marine benthic datasets often lack adaptation to the complexities of real underwater environments (e.g., variable lighting conditions and diverse species postures), posing challenges for the robust generalization of vision models in practical ecological monitoring. To address this problem, we construct ShellfishNet, a comprehensive image benchmark dataset designed specifically for real-world ecological monitoring constraints. Comprising 8,691 images across 32 taxa, this dataset includes a curated subset annotated with descriptive captions. It is constructed through field photography and web scraping, encompassing samples from complex real-world environments. Based on this benchmark, we systematically evaluate 80 representative neural network models, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), State Space Models (SSMs), and Self-Supervised Learning (SSL) methods. Furthermore, we evaluate the performance of fine-grained visual categorization (FGVC) models and investigate the image captioning capabilities of several mainstream multimodal large language models (MLLMs). Meanwhile, we introduce image corruption benchmark tests to simulate common underwater degradation scenarios (turbidity, severe weather) and assess the robustness of vision models, enabling trustworthy decisions on ecological protection in the wild. ShellfishNet is dedicated to providing a data foundation and a model-evaluation benchmark for the intelligent monitoring of benthic organisms.

2605.07335 2026-05-11 cs.LG cs.SE

CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models

Mengran Li, Bo Li, Jiaying Wang, Wenbin Xing, Yixuan Dong, Chengyang Zhang, Hongliang Zhang, Yuzhong Peng, Jinlin Wu, Bob Zhang, Bingo Wing-Kuen Ling, Fuji Yang, Zhen Lei, Jiebo Luo, Zelin Zang

AI总结 该研究提出了一种名为CellScientist的双空间分层框架,用于虚拟细胞模型的闭环优化。该方法通过将高层假设空间与底层可执行实现空间相结合,实现了模型决策的结构化表示与可执行程序的生成,并能够将执行中的差异反馈到相应的假设或实现层面进行针对性修正。实验表明,该方法在形态学和转录组学等基准任务中表现优于现有基线,同时生成的优化过程可追溯且可审计。

详情
英文摘要

Virtual Cell Modeling (VCM) requires models that not only predict perturbation responses, but also support targeted revision when predictions fail. Current LLM-assisted modeling workflows face a refinement-routing problem: prediction discrepancies are observed through executable implementations, but the relevant revision may involve the modeling assumption, representation design, implementation, or task constraint. Without structured feedback propagation across these levels, iterative refinement may repair code while failing to revise the assumption responsible for the discrepancy. We propose CellScientist, a dual-space hierarchical framework that couples a high-level hypothesis space with a low-level executable implementation space. CellScientist represents modeling decisions as structured states, realizes them as admissible programs under task and interface constraints, and routes execution discrepancies back to targeted hypothesis or implementation updates. This enables a closed Hypothesis -> Implementation -> Hypothesis loop where failures become structured signals for model refinement rather than debugging events. Across morphology and transcriptomic benchmarks, with additional single-cell perturbation evaluations, the final executable models selected by CellScientist improve over reference baselines under fixed split and evaluation protocols, while the workflow produces auditable refinement traces.

2605.07334 2026-05-11 cs.CV

RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation

Junwei Wen, Deshui Miao, Guangming Lu, Xin Li, Wenjie Pei

AI总结 视频推理分割(VRS)旨在根据隐含指令对视频中的目标物体进行分割,这些指令包含人类意图和时间逻辑。现有基于大语言模型的方法在选择帧后使用[SEG]标记预测掩码,但由于监督有限和帧-语言相似性规则,往往导致关键帧选择范围狭窄,影响整体时间理解与复杂多目标场景中的定位稳定性。为此,本文提出RCoT-Seg框架,将VRS分解为时间视频推理(TVR)和关键帧目标感知(KTP)两个阶段,通过强化学习机制优化关键帧选择,并结合高精度分割与掩码传播技术,显著提升了时间推理能力和分割精度。

Comments 21 pages

详情
英文摘要

Video Reasoning Segmentation (VRS) aims to segment target objects in videos based on implicit instructions that convey human intent and temporal logic. Existing MLLM-based methods predict masks with a [SEG] token after selecting frames via simple sampling or an auxiliary MLLM, where limited supervision and frame-language similarity rules often yield narrow-scope keyframe choices that weaken holistic temporal understanding and lead to brittle localization in complex multi-object scenes. To address these issues, we introduce RCoT-Seg, a video-of-thought framework that factorizes VRS into temporal video reasoning (TVR) and keyframe target perception (KTP), explicitly separating temporal reasoning from spatial perception. Specifically, in the TVR stage, an agentic keyframe selection module, initialized with a curated CoT-start corpus and refined by GRPO under task-aligned rewards, is proposed to generate and reselect the keyframe through self-evaluation, strengthening moment localization and temporal reasoning. In the KTP stage, RCoT-Seg performs high-resolution segmentation on the selected frame and propagates masks with SAM2-based methods across the sequence, replacing heuristic sampling and external selectors while improving spatial precision and inter-frame consistency. Extensive experimental results demonstrate that the proposed RCoT-Seg achieves favorable performance against the state-of-the-art methods. The code and models will be publicly released at https://github.com/Victor-wjw/RCoT-Seg.

2605.07331 2026-05-11 cs.LG cs.AI

Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

Yuheng Zhang, Chenlu Ye, Shuowei Jin, Changlong Yu, Wei Xiong, Saurabh Sahu, Nan Jiang

AI总结 本文重新思考了大语言模型策略优化中的重要性采样方法,提出了一种基于累积令牌视角的解决方案,以解决现有方法在偏差与方差之间的根本矛盾。研究证明,累积令牌重要性比率在理论上能为每个令牌梯度项提供无偏前缀修正,并且方差低于全序列比率。基于这一发现,作者提出了CTPO方法,结合累积令牌比率与位置自适应裁剪,提升了模型在数学推理等任务上的性能。

详情
英文摘要

Reinforcement learning, including reinforcement learning with verifiable rewards (RLVR), has emerged as a powerful approach for LLM post-training. Central to these approaches is the design of the importance sampling (IS) ratio used in off-policy policy-gradient estimation. Existing methods face a fundamental bias-variance dilemma: token-level IS ratios, as adopted by PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024), introduce bias by ignoring prefix state distribution mismatch; full sequence ratios provide exact trajectory-level correction but suffer from high variance due to the multiplicative accumulation of per-token ratios, while GSPO (Zheng et al., 2025) improves numerical stability via length normalization at the cost of deviating from the exact full-sequence IS correction. In this work, we identify the cumulative token IS ratio, the product of per-token ratios up to position $t$, as a theoretically principled solution to this dilemma. We prove that, under the token-level policy-gradient formulation, this ratio provides an unbiased prefix correction for each token-level gradient term and has strictly lower variance than the full sequence ratio. Building on this insight, we propose CTPO (Cumulative Token Policy Optimization), which combines the cumulative token IS ratio with position-adaptive clipping that scales log-space clip bounds according to the natural $\sqrt{t}$ growth of the cumulative log-ratio. This yields more consistent regularization across token positions. We implement and evaluate CTPO in the tool-integrated reasoning setting on several challenging mathematical reasoning benchmarks, achieving the best average performance across both model scales compared with strong GRPO and GSPO baselines. Code will be available at https://github.com/horizon-llm/CTPO.

2605.07330 2026-05-11 cs.LG cs.AI cs.DC

SparseRL-Sync: Lossless Weight Synchronization with ~100x Less Communication

Lucas Hu, Ranchi Zhao, Isaac Zhu, Zach Zhang, Hscos Zhang, Hugh Yin, Jason Zhao

AI总结 在大规模强化学习系统中,训练器与执行器分离的架构下,权重同步通信成为影响系统性能的关键瓶颈。本文提出了一种名为 SparseRL-Sync 的方法,通过利用模型参数在元素级别高度稀疏变化的特性,将全量权重传输替换为仅传递变化参数的索引和值,从而实现无损的稀疏更新,通信量可减少约 100 倍。该方法有效提升了带宽受限和异步环境下强化学习系统的可扩展性与端到端效率。

Comments Code will be released at https://github.com/scitix/helix

详情
英文摘要

In large-scale reinforcement learning (RL) systems with decoupled Trainer-Rollout execution, the Trainer must regularly synchronize policy weights to the Rollout side to limit policy staleness. When inter-node bandwidth is abundant, such synchronization is usually only a small fraction of end-to-end cost. As model size grows, however, the communication demand rises rapidly. In bandwidth-constrained or network-variable deployments -- for example, cross-datacenter or cross-cluster settings, heterogeneous resource pools, and online RL -- weight synchronization can become a dominant bottleneck for throughput and tail latency. We observe that, in mainstream large-model RL training, the locations where parameters actually change are highly sparse at the element level (often 99%+ sparsity). Building on this observation, we propose and implement SparseRL-Sync, which replaces full-weight transfers with a lossless sparse update payload (indices and values) that can be exactly reconstructed on the inference side, thereby preserving 100% fidelity. Under a simplified cost model, sparse synchronization reduces the per-update communication volume from S to approximately S/X; with 99% sparsity (X ~ 100), this yields about a 100x reduction in transmitted data. Combined with appropriate bucketing, SparseRL-Sync also reduces launch and control-plane overhead, significantly improving scalability and end-to-end efficiency in bandwidth-limited and highly asynchronous RL settings.

2605.07329 2026-05-11 cs.CV

GC-ART: Global Learnable Second-Order Rational Tone Curves for Illumination Robustness

Wei Huang, Joyce Huang

AI总结 本文提出了一种轻量级的可微预处理模块GC-ART,用于提升图像分类在光照变化下的鲁棒性。GC-ART通过一个包含643个参数的多层感知机,从每个通道的软直方图中预测出端点固定的有理色调曲线,并将其应用于分类器之前。实验表明,GC-ART在多种光照退化场景下表现优异,且计算效率较高,相比卷积增强方法大幅减少了计算量。

详情
英文摘要

We introduce GC-ART (Global Curve Adaptive Rational Tone-mapping), a lightweight differentiable pre-processing module for robust image classification. GC-ART predicts an endpoint-pinned rational tone curve from per-channel soft histograms using a 643-parameter MLP, then applies the curve pointwise before the classifier. The module is trained end-to-end with cross-entropy and a soft monotonicity penalty. On CIFAR-10 with a CIFAR-style ResNet-18, GC-ART matches clean accuracy with the unenhanced baseline and other learned enhancers, improves over the baseline on multiplicative darkening, and achieves the best learned-method result on contrast corruption (48.45% vs. 46.27% for the baseline and 47.13% for Zero-DCE++). These results suggest that histogram-conditioned rational curves can learn useful global tone corrections, including contrast-expanding behavior, while preserving edge locations by construction through pointwise mapping. GC-ART also uses substantially fewer FLOPs than convolutional learned enhancers at 32 x 32. The current hyperparameters are untuned, leaving room for systematic improvement.

2605.07327 2026-05-11 cs.CV

Teacher-Feature Drifting: One-Step Diffusion Distillation with Pretrained Diffusion Representations

Yuan Zhang, Chenyi Li, Guoqing Ma, Jiajun Zha, Yuanming Yang, Bo Wang, Wei Tang, Wenbo Li, Haoyang Huang, Nan Duan

AI总结 该研究提出了一种基于预训练扩散模型的单步蒸馏方法,通过直接利用教师模型的中间隐藏状态作为特征表示,简化了传统蒸馏所需的复杂网络结构和训练流程。研究引入了漂移损失和轻量化的模式覆盖损失,有效提升了生成图像的质量与多样性。实验表明,该方法在ImageNet和SDXL数据集上取得了具有竞争力的FID分数,显著提升了蒸馏效率。

详情
英文摘要

Sampling from pretrained diffusion and flow-matching models typically requires many forward passes to generate diverse and high-fidelity images. Existing distillation methods often rely on multiple auxiliary networks, carefully designed training stages, or complex optimization pipelines. In this work, we revisit the recently proposed Drifting Model objective and show that a single drifting loss can be directly used to simplify one step distillation. A key observation is that the pretrained diffusion teacher itself already provides a strong representation space. Unlike the original Drifting Model, which relies on an additional pretrained feature extractor, we use intermediate hidden states of the pretrained teacher model as the feature representation. This removes the need for training or introducing an extra representation network while preserving a semantically meaningful feature geometry for drifting. Furthermore, we introduce a lightweight mode coverage loss to mitigate mode collapse during distillation and encourage the student generator to cover diverse teacher-supported regions. Extensive experiments on ImageNet and SDXL demonstrate that our method achieves efficient one step generation with competitive image quality and diversity, achieving FID scores of 1.58 on ImageNet-64$\times$64 and 18.4 on SDXL, while substantially simplifying the overall distillation framework.

2605.07326 2026-05-11 cs.CV

GEM: Generating LiDAR World Model via Deformable Mamba

Yang Wu, Zhaojiang Liu, Qiang Meng, Youquan Liu, Renliang Weng, Jianjun Qian, Jian Yang, Jin Xie

AI总结 本文提出GEM,一种基于可变形Mamba架构的生成式LiDAR世界模型,旨在提升自动驾驶中LiDAR点云数据的仿真与感知能力。针对LiDAR点云固有的无序性和动态物体与静态结构难以区分的问题,GEM通过自定义的LiDAR场景编码器和动态-静态分离器,结合三路径可变形Mamba网络,实现了对环境时空演化的更精准理解。实验表明,GEM在多个基准测试中表现出色,展示了其在生成高保真场景和预测“假设”情境方面的优越性。

详情
英文摘要

World models, which simulate environmental dynamics and generate sensor observations, are gaining increasing attention in autonomous driving. However, progress in LiDAR-based world models has lagged behind those built on camera videos or occupancy data, primarily due to two core challenges: the inherent disorder of LiDAR point clouds and the difficulty of distinguishing dynamic objects from static structures. To address these issues, we propose GEM: a Generative LiDAR world model that leverages deformable mamba architecture, significantly improving fidelity and imaginative capability. Specifically, leveraging the structural similarity between sequential laser scanning and Mamba's processing mechanism, we first tokenize LiDAR sweeps into compact representations via a custom LiDAR scene tokenizer. After unsupervised disentanglement of tokenized features via a dynamic-static separator, a tri-path deformable Mamba is introduced to perform selective scanning and adaptive gating fusion over the disentangled features, leading to enhanced spatial-temporal understanding of the world evolution. Optionally, a planner and a BEV layout controller can be integrated to explore the model's capability for autonomous rollout and its potential to generate ``what-if" scenarios. Extensive experiments show that GEM achieves state-of-the-art performances across diverse benchmarks and evaluation settings, demonstrating its superiority and effectiveness. Project page: https://github.com/wuyang98/GEM.

2605.07325 2026-05-11 cs.RO cs.AI

CSR: Infinite-Horizon Real-Time Policies with Massive Cached State Representations

Robin Karlsson, Go Suzui

AI总结 本文研究了如何高效部署大规模语言模型作为机器人系统的实时认知引擎,解决其在处理长序列状态历史时面临的时间延迟问题。提出了一种名为CSR的缓存状态表示框架,结合异步状态对齐算法(ASR),实现了对键值缓存的最优复用,显著降低了推理延迟。实验表明,CSR在物理机器人和具身AI基准测试中均表现出优越的性能,大幅提升了模型的实时性和推理效率。

Comments Extended Technical Report for Paper Accepted to IEEE RA-L

详情
英文摘要

Deploying massive large language models (LLMs) as continuous cognitive engines for robotics is bottlenecked by the time-to-first-token (TTFT) latency required to process extensive state histories. Existing solutions like RAG or sliding windows compromise global context or incur prohibitive re-computation costs. We formalize the optimal task structure for minimizing latency and theoretically prove that prefix stability, incremental extensibility, and asynchronous state reconciliation are necessary conditions for real-time performance. Building on these proofs, we introduce the Cached State Representation (CSR) framework as the practical instantiation of these properties, ensuring optimal KV-cache reuse. To sustain these properties over infinite horizons, we further propose an Asynchronous State Reconciliation (ASR) algorithm that offloads state memory eviction to a parallel computational resource to eliminate latency spikes. On a physical robot wirelessly connected to an on-premise GPU server, CSR achieves a 26-fold latency reduction (14.67s to 0.56s) for 120K token contexts with a 235B parameter model compared to a standard baseline. On an embodied AI benchmark, we achieve SOTA recall (0.836 vs. 0.459) while maintaining RAG-level latency. ASR is validated to sustain bounded, spike-free TTFT over 10 eviction cycles in continuous real-world operation. Together, CSR and ASR enable massive LLMs to function as continuously operating, high-frequency (> 2 Hz) embodied policies.

2605.07324 2026-05-11 cs.CL cs.AI cs.CR cs.LG

Activation Differences Reveal Backdoors: A Comparison of SAE Architectures

Sachin Kumar

AI总结 该研究探讨了如何通过稀疏自编码器架构检测语言模型中的后门攻击。作者比较了两种架构——Crosscoders 和 Differential SAE(Diff-SAE),用于识别微调模型中与后门相关的特征。实验表明,Diff-SAE 在隔离后门行为方面显著优于 Crosscoders,能够以高精度和零误报率检测后门,而 Crosscoders 几乎无法有效识别。研究指出,后门表现为激活方向的变化而非稀疏特征,因此基于差异的表示方法在检测中更为有效。

Comments Accepted at IJCNN 2026 (IEEE WCCI). ©2026 IEEE

详情
英文摘要

Backdoor attacks on language models pose a significant threat to AI safety, where models behave normally on most inputs but exhibit harmful behavior when triggered by specific patterns. Detecting such backdoors through mechanistic interpretability remains an open challenge. We investigate two sparse autoencoder architectures -- Crosscoders and Differential SAEs (Diff-SAE) -- for isolating backdoor-related features in fine-tuned models. Using a controlled SQL injection backdoor triggered by year-based context ("2024" triggers vulnerable code, "2023" triggers safe code), we evaluate both approaches across LoRA and full-rank fine-tuning regimes on SmolLM2-360M. We find that Diff-SAE consistently and substantially outperforms Crosscoders for backdoor isolation. Diff-SAE achieves a Backdoor Isolation Score (BIS) of 0.40 with perfect precision (1.0) and zero false positive rate across most experimental conditions, while Crosscoders fail almost entirely with BIS below 0.02 in most cases. This performance gap holds across multiple transformer layers (14, 18, 22, 26) and both fine-tuning regimes, with full-rank fine-tuning producing particularly clean backdoor signals. Our results suggest that backdoors manifest as directional activation shifts rather than sparse feature activations, making difference-based representations fundamentally more effective for detection. These findings have important implications for AI safety monitoring and the development of interpretability tools for detecting model manipulation.

2605.07323 2026-05-11 cs.AI cs.LG cs.NE cs.SC

Discovering Ordinary Differential Equations with LLM-Based Qualitative and Quantitative Evaluation

Sum Kyun Song, Bong Gyun Shin, Jae Yong Lee

AI总结 从观测数据中发现控制系统的微分方程是科学机器学习中的一个基础挑战。现有符号回归方法主要依赖定量指标,但实际微分方程建模还需结合领域知识以确保物理合理性。为此,本文提出DoLQ方法,通过基于大语言模型的定性和定量评估相结合的方式,实现常微分方程的发现。该方法采用多智能体架构,分别负责候选系统生成、参数优化和评估引导,实验表明其在多维常微分方程基准上表现出更高的成功率和更准确的符号项恢复能力。

Comments Accepted at ICML 2026

详情
Journal ref
International Conference on Machine Learning 2026
英文摘要

Discovering governing differential equations from observational data is a fundamental challenge in scientific machine learning. Existing symbolic regression approaches rely primarily on quantitative metrics; however, real-world differential equation modeling also requires incorporating domain knowledge to ensure physical plausibility. To address this gap, we propose DoLQ, a method for discovering ordinary differential equations with LLM-based qualitative and quantitative evaluation. DoLQ employs a multi-agent architecture: a Sampler Agent proposes dynamic system candidates, a Parameter Optimizer refines equations for accuracy, and a Scientist Agent leverages an LLM to conduct both qualitative and quantitative evaluations and synthesize their results to iteratively guide the search. Experiments on multi-dimensional ordinary differential equation benchmarks demonstrate that DoLQ achieves superior performance compared to existing methods, not only attaining higher success rates but also more accurately recovering the correct symbolic terms of ground truth equations. Our code is available at https://github.com/Bon99yun/DoLQ.

2605.07319 2026-05-11 cs.LG cs.AI

Generative Modeling with Flux Matching

Peter Pao-Huang, Xiaojie Qiu, Stefano Ermon

AI总结 本文提出了一种名为Flux Matching的新生成模型方法,它将现有基于分数的模型推广到更广泛的非保守向量场家族,不再强制要求模型与数据分数一致,而是通过更宽松的条件允许无限多种具有相同数据平稳分布的向量场。这种方法为生成模型引入了更大的灵活性,使得可以引入归纳偏置、结构先验和动态特性,并在高维图像数据集上表现出色,同时支持更快的采样、可解释的模型以及变量间定向依赖关系的建模,为生成建模开辟了新的设计维度。

详情
英文摘要

We introduce Flux Matching, a new paradigm for generative modeling that generalizes existing score-based models to a broader family of vector fields that need not be conservative. Rather than requiring the model to equal the data score, the Flux Matching objective imposes a weaker condition that admits infinitely many vector fields whose stationary distribution is the data. This flexibility enables a class of generative models that cannot be learned under score matching, in which inductive biases, structural priors, and properties of the dynamics can be directly imposed or optimized. We show that Flux Matching performs strongly on high-dimensional image datasets and, more importantly, that our added freedom unlocks a range of applications including faster sampling, interpretable and mechanistic models, and dynamics that encode directed dependencies between variables. More broadly, Flux Matching opens a new dimension in generative modeling by turning the vector field itself into a design choice rather than a fixed target. Code is available at https://github.com/peterpaohuang/flux_matching.

2605.07317 2026-05-11 cs.CV cs.AI

Amortized-Precision Quantization for Early-Exit Vision Transformers

Rui Fang, Hsi-Wen Chen, Ming-Syan Chen

AI总结 本文研究了如何在视觉任务中稳定地部署低精度早退出的视觉Transformer(ViT)。为了解决现有量化方法在动态推理路径中因量化噪声导致误差放大的问题,作者提出了 amortized-precision 量化(APQ)方法,考虑了各层对量化噪声的随机暴露程度,并揭示了精度与深度之间的权衡关系。基于 APQ,进一步提出了一个双层级框架 MAQEE,在显式风险控制下联合优化退出阈值和位宽,显著提升了推理稳定性,在分类、检测和分割任务中相比强基线提升了最高20%的性能,同时将计算量降低了高达95%。

详情
英文摘要

Vision Transformers (ViTs) achieve strong performance across vision tasks, yet their deployment with low-precision early exiting remains fragile. Existing quantization methods assume static full-depth execution, making them unstable when exit decisions are perturbed by quantization noise, which can amplify errors along dynamic inference paths. In this paper, we introduce Amortized-Precision Quantization (APQ), a utilization-aware formulation that accounts for layer-wise stochastic exposure to quantization noise and reveals depth-precision trade-offs. Building on APQ, we propose Mutual Adaptive Quantization with Early Exiting (MAQEE), a bi-level framework that jointly optimizes exit thresholds and bit-widths under explicit risk control to improve inference stability. MAQEE establishes a superior Pareto frontier in the accuracy-efficiency trade-off, reducing BOPs by up to 95% while maintaining accuracy and outperforming strong baselines by up to 20\% across classification, detection, and segmentation tasks.

2605.07316 2026-05-11 cs.AI

Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training

Chen Wang, Hexuan Deng, Yining Zhang, Yuchen Zhang, Jionghao Bai, Zhaochun Li, Ge Lan, Yue Wang

AI总结 该研究针对强化学习中可验证奖励机制导致的“过度推理”问题,提出了一种新的压缩正则化方法——隐式压缩正则化(ICR)。通过分析训练过程中的长度-准确率相关性变化,作者发现短响应在初期更可能正确,但随着压缩过程推进可能逐渐失去这一特性。基于此,ICR 利用 rollout 中最短正确响应所诱导的虚拟短分布,引导策略生成简洁且准确的推理轨迹,实验表明该方法在保持或提升准确率的同时有效缩短了响应长度。

详情
英文摘要

Reinforcement learning with verifiable rewards improves LLM reasoning but often induces overthinking, where models generate unnecessarily long reasoning traces. Existing methods mainly rely on length penalties or early-exit strategies; however, the former may degrade accuracy and induce underthinking, whereas the latter assumes that substantial portions of reasoning traces can be safely truncated. To obtain a compression signal without these limitations, we revisit the training dynamics of existing compression methods. We observe that the length--accuracy correlation is initially negative but continually increases during compression, indicating that shorter responses are initially more likely to be correct but gradually lose this property as the policy moves toward underthinking. Based on this observation, we formalize overthinking: a negative correlation indicates an overthinking regime, while a positive one indicates underthinking. When overthinking, the shortest correct responses are shorter than the group-average response length in expectation, making them natural compression targets already present in on-policy rollouts. We therefore propose \emph{Implicit Compression Regularization} (ICR), an on-policy regularization method whose compression signal comes from a virtual shorter distribution induced by the shortest correct responses in rollout groups, guiding the policy toward concise yet correct trajectories. Training dynamics show that ICR maintains a better length--accuracy correlation during compression, indicating that short responses remain better aligned with correctness instead of drifting toward underthinking. Experiments on three reasoning backbones and multiple mathematical and knowledge-intensive benchmarks show that ICR consistently shortens responses while preserving or improving accuracy, achieving a stronger accuracy--length Pareto frontier.

2605.07315 2026-05-11 cs.CL

LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification

Xuan Li, Yining Wang, Yuchen Liu, Guanjun Liu, Delai Qiu, Shengping Liu, Jiaen Liang, Wei Huang, Jun Yu, Junnan Zhu

AI总结 LaTER 是一种高效的推理方法,通过在连续潜在空间中进行有限探索,随后切换到显式推理进行验证和答案生成,从而减少推理过程中的 token 使用量。该方法在无需额外训练的情况下,通过投影隐藏状态和使用熵值探测等技术实现推理阶段的自动切换,显著降低了计算成本。实验表明,LaTER 在多个基准测试中不仅保持了原有精度,还实现了 token 使用量的大幅减少,并在 AIME 2025 等任务中取得了优于传统链式推理的性能。

详情
英文摘要

Chain-of-thought (CoT) reasoning improves large language models (LLMs) on difficult tasks, but it also makes inference expensive because every intermediate step must be generated as a discrete token. Latent reasoning reduces visible token generation by propagating continuous states, yet replacing explicit derivations with latent computation can hurt tasks that require symbolic checking. We propose Latent-Then-Explicit Reasoning (LaTER), a two-stage paradigm that first performs bounded exploration in a continuous latent space and then switches to explicit CoT for verification and answer generation. In a training-free instantiation, LaTER projects final-layer hidden states back to the input embedding space, preserves the latent KV cache, and uses entropy and model-native stop-token probes to decide when to switch. We find that strong reasoning models already exhibit structured latent trajectories under this interface. On Qwen3-14B, training-free LaTER reduces total token usage by 16%-32% on several benchmarks while matching or improving accuracy on most of them; for example, it improves AIME 2025 from 70.0% to 73.3% while reducing tokens from 15,730 to 10,661. We further construct Latent-Switch-69K, a supervised corpus that pairs condensed solution intuitions with shortened explicit derivations. Fine-tuning with latent rollout and halting supervision yields additional gains: trained LaTER reaches 80.0% accuracy on AIME 2025, 10.0 points above the standard CoT baseline, while using 33% fewer tokens. Our code, data, and model are available at https://github.com/TioeAre/LaTER.

2605.07313 2026-05-11 cs.AI

When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory

Jiaqi Shao, Yiyi Lu, Yunzhen Zhang, Bing Luo

AI总结 本文研究了智能体记忆在证据持续增长情况下的可用性问题,提出了一个基于规模条件的评估协议,通过固定任务证据并逐步增加无关会话来模拟真实场景。该方法引入了四个诊断指标,揭示了不同智能体和记忆接口在不同规模下的可靠性变化,并指出可靠性下降并非单一现象。实验表明,现有方法在面对无关信息积累时表现出显著差异,为可扩展记忆系统的评估提供了更精细的框架。

Comments 19 pages, 11 figures, preprint

详情
英文摘要

Memory-agent evaluations report fixed-snapshot accuracy or retrieval quality, but these scores do not show whether evidence remains usable as irrelevant sessions (sessions not annotated as task-relevant evidence for the query) accumulate. We present a scale-conditioned evaluation protocol for agent memory under evidence-preserving growth: for each query, task evidence is held fixed while irrelevant sessions are added. The protocol logs agent--memory trajectories and reports four diagnostics: budget-compliant reliability, tail memory-call burden, failure-regime decomposition, and the usable-scale boundary where reliability falls below the target. Applied to LongMemEval and LoCoMo across flat, planar, and hierarchical memory interfaces, the protocol shows reliability loss is not a single phenomenon. On LongMemEval, HippoRAG stays within the two-call budget but loses 16--20 percentage points in budget-compliant reliability as irrelevant sessions are added; LiCoMemory's observed failures depend strongly on the agent, with Qwen3-8B exceeding the budget while Qwen3-32B and Qwen3-235B remain reliable in the tested range. The result supports a framework for making scalable-memory claims conditional on agent, interface, scale range, and interaction budget.

2605.07307 2026-05-11 cs.CL

Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts

Yi-Chang Chen, Feng-Ting Liao, Da-shan Shiu, Hung-yi Lee

AI总结 该研究质疑了现代推理语言模型在生成推理链时隐含的两个假设:即每个词都重要且必须按顺序处理。通过系统性干预(如删除、遮蔽、打乱顺序和注入噪声),研究发现推理链的顺序对答案提取影响很小,且并非所有信息都是必要的。研究还表明,即使在推理链被大幅简化和打乱的情况下,模型仍能保持较高的准确性,揭示了答案提取依赖于稀疏、顺序不敏感且结构鲁棒的信息基础。

详情
英文摘要

Modern reasoning language models generate dense, sequential chain-of-thought traces implicitly assuming that every token contributes and that steps must be consumed in order. We challenge both assumptions through a systematic intervention pipeline--removal, masking, shuffling, and noise injection--applied to model-generated reasoning chains across three models and three benchmarks. Our findings are counterintuitive on three dimensions. Order: Does the sequential order of a reasoning chain matter for answer extraction? No--line-level shuffling reduces accuracy by less than 0.5 pp; word-level shuffling retains 62%-89% accuracy; only token-level shuffling collapses to near zero. Pretrained-only and instruction-tuned variants exhibit near-identical tolerance (78.67% vs. 78.00% under line shuffling), indicating order-independence originates from pretraining rather than reasoning-specific fine-tuning. Dense: Is all the information in a reasoning chain important for answer extraction? No--masking numeric digits collapses accuracy to exactly 0%, while masking alphabetic prose improves accuracy by 4.7 pp. Robustness: Is a reasoning chain that is both order-shuffling and non-dense still robust? Yes--the most aggressively reduced representation (all natural language removed, lines arbitrarily shuffled) still achieves 83% accuracy, and injecting false answers at 3x true-answer frequency leaves accuracy unchanged (83.3%->83.3%), falsifying a frequency-based extraction account. These results establish that answer extraction operates on a sparse, order-insensitive, and structurally robust informational substrate, opening paths toward parallelized and token-efficient reasoning generation.

2605.07306 2026-05-11 cs.RO cs.AI

BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation

Zhaohui Du, Zhe Wang, Hongmei Fei, Xiwen Cao, Ting Xiao, Qi Wang, Huanbo Jin, Jiaming Gu, Quan Lu, Zhe Liu

AI总结 本文提出了一种名为BioProVLA-Agent的低成本、基于协议、视觉增强的具身多智能体系统,用于生物实验室操作。该系统通过结合视觉-语言-动作(VLA)模型,实现了从协议解析到视觉状态验证和具身执行的闭环流程,提升了在复杂湿实验室环境中的操作可靠性和适应性。研究还引入了AugSmolVLA增强策略,有效应对透明器皿、反射、光照变化等视觉干扰,实验表明其在多种任务场景下表现优于现有方法。

Comments 16 pages, 7 figures

详情
英文摘要

Biological laboratory automation can reduce repetitive manual work and improve reproducibility, but reliable embodied execution in wet-lab environments remains challenging. Protocols are often unstructured, labware is frequently transparent or reflective, and multi-step procedures require state-aware execution beyond one-shot instruction following. Existing robotic systems often rely on costly hardware, fixed workflows, dedicated instruments, or robotics-oriented interfaces. Here, we introduce BioProVLA-Agent, an affordable, protocol-driven, vision-enhanced embodied multi-agent system enabled by Vision-Language-Action (VLA) models for biological manipulation. The system uses protocols as the task interface and integrates protocol parsing, visual state verification, and embodied execution in a closed-loop workflow. A Tailored LLM Protocol Agent converts protocols into verifiable subtasks; a VLM-RAG Verification Agent assesses readiness and completion using observations, robot states, retrieved knowledge, and success/failure examples; and a VLA Embodied Agent executes verified subtasks through a lightweight policy. To improve robustness under wet-lab visual perturbations, we develop AugSmolVLA, an online augmentation strategy targeting transparent labware, reflections, illumination shifts, and overexposure. We evaluate the system on a hierarchical benchmark covering 15 atomic tasks, 6 composite workflows, and 3 bimanual tasks, including tube loading, sorting, waste disposal, cap twisting, and liquid pouring. Across normal and high-exposure settings, AugSmolVLA improves execution stability over ACT, X-VLA, and the original SmolVLA, especially for precise placement, transparent-object manipulation, composite workflows, and visually degraded scenes. These results suggest a practical route toward accessible, protocol-centered, and verification-capable embodied AI for biological manipulation.

2605.07305 2026-05-11 cs.CL cs.AI

MedAction: Towards Active Multi-turn Clinical Diagnostic LLMs

Hsin-Ling Hsu, Zizheng Wang, Donghua Zhang, Nai-Chia Chen, Jerry Wang, Jun-En Ding, Chia-Hsuan Hsu, Guoan Wang, Feng Liu, Fang-Ming Hung, Chenwei Wu, Liyue Shen

AI总结 现有的临床诊断大语言模型多在静态、单轮设定下进行评估,与实际诊疗过程中多轮交互、逐步获取信息的场景存在较大差距。本文研究了更贴近真实场景的主动诊断问题,识别出当前模型在多轮诊断中存在测试选择不合理、诊断更新不可靠和多轮逻辑不连贯等核心问题。为此,作者提出MedAction方法,通过树结构蒸馏框架和基于知识图谱的评估指标,构建了一个包含32,681条多轮诊断轨迹的高质量数据集,并在多个医学基准测试中取得了开源模型的最优性能。

详情
英文摘要

Most existing LLM diagnoses are evaluated on static, single-turn settings where complete patient information is provided upfront, an oversimplification of real clinical practice. We study active diagnosis: the real-life clinical process of starting from initial observation, ordering tests, interpreting results, and updating a differential diagnosis across multiple turns. Through systematic analysis, we identify three recurring failure modes in current LLMs: ungrounded test ordering, unreliable diagnostic update, and degraded multi-turn coherence. Together, these failures reveal a core deficit: existing medical training data teaches models to reason from complete information but not to act under evolving, partial evidence. To address this gap, we introduce MedAction, a tree-structured distillation pipeline that synthesizes diverse and high-quality multi-turn diagnostic trajectories via LLM-environment interaction. We propose two knowledge-graph-grounded metrics to filter trajectory quality: Disease Trajectory Consistency (DTC), which tracks whether the model's hypothesis converges toward the correct diagnosis, and Reasoning-Action Consistency (RAC), which verifies that belief updates are driven by gathered evidence. Using this pipeline, we construct MedAction-32K, a dataset of 32,681 trajectories from 2,896 PMC cases. Fine-tuning an 8B model on MedAction-32K achieves state-of-the-art performance among open-source models on both MedR-Bench and our curated MedAction-300-Hard benchmark, pushing the edge for open-source medical LLMs.

2605.07304 2026-05-11 cs.LG

Latent Order Bandits

Emil Carlsson, Newton Mwai, Fredrik D. Johansson

AI总结 本文研究了在多臂老虎机问题中如何利用潜在状态之间的部分顺序信息进行更高效的探索。传统潜在老虎机算法依赖于准确的潜在状态和奖励分布模型,但实际中难以获得。为此,作者提出了一种新的方法——潜在顺序老虎机(LOB),仅需先验的行动偏好部分顺序信息,从而放宽了对模型准确性的要求。该方法允许相同潜在状态下不同实例的奖励分布存在差异,但保持行动偏好的部分顺序一致,并给出了相应的置信区间算法和后悔上界,实验表明其在不同场景下均具有良好的性能。

Comments arXiv admin note: text overlap with arXiv:2508.05367

详情
英文摘要

Bandit algorithms solve diverse sequential decision-making problems, but are often too sample-inefficient for from-scratch personalization. To substantially reduce exploration times, latent bandit algorithms exploit cross-instance structure implied by discrete latent states, provided that the posterior distribution of rewards and latent states is known and accurate. However, obtaining an accurate model of this structure is difficult, and a small number of latent states may be insufficient to characterize the reward distributions in all problem instances. We propose latent order bandits (LOB), relaxing the assumptions of latent bandits to require only prior knowledge of a partial order of action preferences in each state. This allows instances of the same state to vary in reward distributions, as long as the partial order of actions is shared. For example, groups of users on a streaming service may agree on which movie genres are the best but rate experiences on different scales. We give an upper-confidence bound procedure for the LOB problem, applicable to both total and partial latent orders, and give an upper bound on its regret. To improve empirical performance, we propose a posterior-sampling algorithm and show, in a suite of experiments, that both are competitive with full-prior latent bandits when same-state instances share reward parameters, and preferable to them when reward scales differ between instances with the same latent state.

2605.07302 2026-05-11 cs.LG

Pretraining Induces a Reusable Spectral Basis for Downstream Task Adaptation

Junjie Yu, Yue Wang, Zihan Deng, Yan Zhu, Wenxiao Ma, Quanying Liu

AI总结 该研究探讨了预训练模型在微调过程中参数空间中低维子空间的特性,揭示了预训练权重矩阵的主要奇异向量在微调过程中高度稳定,并且在不同下游任务中具有共享性,表明预训练建立了一个可复用的谱坐标系。研究进一步发现,预训练规模越大,模型在分布偏移或任务变化下的谱稳定性越强,从而提升了几何可迁移性。基于这一发现,作者提出了一种参数高效的微调方法,仅优化主要谱系数而冻结预训练的奇异向量,在GLUE基准上取得了与全参数微调相当的性能。

详情
英文摘要

Finetuning pretrained models occurs in a low-dimensional subspace of the full parameter space. Prior work has focused on characterizing this optimization subspace, but largely ignored the complementary question: why do certain directions remain unexplored during finetuning? Are these stable directions irrelevant to downstream tasks, or do they already encode task-relevant structure that requires no further adjustment? Answering this question is central to understanding how pretrained knowledge transfers. Through systematic spectral analysis across vision and language models, we show that the leading singular vectors of pretrained weight matrices remain highly stable under finetuning and are shared across unrelated downstream tasks, revealing that pretraining establishes a reusable spectral coordinate system. Models pretrained on larger datasets exhibit greater spectral stability under distribution shift or task change, directly linking pretraining scale to geometric transferability. Motivated by these findings, we propose a parameter-efficient method that freezes pretrained singular vectors and optimizes only leading spectral coefficients, achieving competitive performance on GLUE with 0.2% trainable parameters. Our results reveal that the stable directions encode transferable structure rather than irrelevant noise: successful pretraining discovers spectral bases that downstream tasks inherit and operate within.

2605.07301 2026-05-11 cs.AI

SOM: Structured Opponent Modeling for LLM-based Agents via Structural Causal Model

Shiyue Cao, Pei Xu, Likun Yang, Lei Cui, Xiaotang Chen, Kaiqi Huang

AI总结 本文提出了一种基于结构因果模型的结构化对手建模方法(SOM),用于提升基于大语言模型(LLM)的智能体在多智能体环境中的对手行为预测能力。该方法将对手建模分为两个阶段:首先利用结构因果模型显式建模对手的观察与行为之间的依赖关系,形成结构化的对手表示;然后基于该结构进行推理,引导LLM沿明确的因果路径进行预测,从而提高预测的准确性和稳定性。实验表明,SOM在多个多智能体基准任务中优于现有基于LLM的预测方法,显著提升了复杂动态环境下的策略决策能力。

详情
英文摘要

Accurately predicting opponents' behavior from interactions is a fundamental capability for large language model (LLM)-based agents in multi-agent and game-theoretic environments. Existing approaches often entangle opponent modeling with prediction, relying on implicit contextual reasoning and limiting adaptability in dynamic interactions. To this end, we propose Structured Opponent Modeling (SOM), a two-stage opponent modeling framework that distinctly separates opponent model construction and opponent prediction. At the construction stage, SOM employs a Structural Causal Model (SCM), a graph-based formalism for representing dependencies among variables, to capture directed links between opponents' observations and actions, yielding an explicit and structured opponent representation. At the prediction stage, the LLM performs structured reasoning along clear pathways derived from the SCM, improving both prediction accuracy and stability. Extensive experiments on diverse multi-agent benchmarks demonstrate that SOM consistently outperforms state-of-the-art LLM-based reasoning baselines, enabling more accurate and adaptable strategic decision-making in complex and dynamic multi-agent interactions.

2605.07299 2026-05-11 cs.CV cs.AI

EgoPro-Bench: Benchmarking Personalized Proactive Interaction in Egocentric Video Streams

Dongchuan Ran, Linyu Ou, Xueheng Li, Wenwen Tong, Chenxu Guo, Hewei Guo, Kaibing Wang, Lewei Lu

AI总结 本文提出EgoPro-Bench,一个基于第一人称视频流的新型基准,用于评估和训练具有个性化主动交互能力的多模态大语言模型。该基准通过模拟用户画像生成多样化的用户意图,并在12个不同领域构建高保真的人机交互数据,弥补了现有基准在个性化和交互时机评估方面的不足。研究还提出“短思优交”原则,通过限制推理预算提升交互效率,并验证了该基准在增强模型意图理解与交互时机识别方面的重要价值。

Comments 8 pages

详情
英文摘要

Existing Multimodal Large Language Models (MLLMs) remain primarily reactive, failing to continuously perceive environments or proactively assist users. While emerging benchmarks address proactivity, they are largely confined to alert scenarios, neglect personalized context, and fail to evaluate the precise timing of human-machine interactions (HMI).In this paper, we introduce EgoPro-Bench, a novel benchmark for training and evaluating proactive interaction capabilities based on streaming egocentric videos; it comprises 2,400 videos in the evaluation set and over 12,000 videos in the training set.Unlike previous works, EgoPro-Bench leverages simulated user profiles to generate diverse user intentions and to construct high-fidelity HMI data across 12 distinct domains.Subsequently, we propose a specialized evaluation protocol and metrics, train proactive interaction models designed for efficient reasoning and low-latency interaction on streaming video data, and conduct comprehensive evaluations.Furthermore, we introduce an interaction principle termed "short thinking, better interaction", which allocates a limited token budget prior to intent recognition, thereby enhancing interaction performance.The experiments demonstrate that EgoPro-Bench substantially enhances the intention understanding capabilities of MLLMs and enables accurate identification of appropriate timings for HMI, thereby laying a solid foundation for next-generation user-centric proactive interactive agents.

2605.07292 2026-05-11 cs.RO cs.SY eess.SY math.OC

Variable Aerodynamic Damping via Co-Contraction: A Dynamic Isomorphism with Variable Stiffness Actuators

Antonio Franchi

AI总结 本文研究了冗余双旋翼执行器中通过协同收缩实现可变气动阻尼的机制,证明在保持净推力不变的前提下,可以通过调整气动阻尼来调节系统的被动阻尼特性。研究提出了一种增量阻尼系数,并基于叶片元理论推导了其与气流速度的关系,验证了阻尼和刚度增强特性。该机制被形式化为可变气动阻尼执行器(VADA),其动态特性与拮抗可变刚度执行器中的刚度调节具有同构性,为多旋翼飞行器的被动阻尼控制提供了新的理论基础和方法。

详情
英文摘要

We prove that aerodynamic co-contraction in a redundant dual-rotor actuator can tune a passive, trim-defined aero-mechanical damping while keeping the commanded net force constant. In particular, we define an incremental damping coefficient as the local sensitivity of net thrust to air-relative velocity at a trim and prove that it increases monotonically along constant-force fibers under a mild aerodynamic hardening condition. We then validate the required damping and hardening properties from a first-principles Blade Element Theory derivation, which yields a minimal thrust model affine in inflow and explicitly reveals the speed--inflow coupling driving the effect. The resulting mechanism is formalized as a Variable Aerodynamic Damping Actuator (VADA) and shown to be dynamically isomorphic to stiffness modulation in antagonistic variable-stiffness actuation (VSA), similar to the co-contraction of tendons by muscle co-activation. The same fiber-density principle also enhances the active aerodynamic promptness measure of redundant multirotors. Finally, an impedance-form representation clarifies the roles of common-mode and differential-mode actuation in the control of passive impedance and the equilibrium velocity of the VADA system.

2605.07288 2026-05-11 cs.CV cs.AI

Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training

Jiaxuan Gao, Yongjian Guo, Zhong Guan, Wen Huang, Wanlun Ma, Xi Xiao, Junwu Xiong, Sheng Wen

AI总结 本文提出了一种名为Sword的鲁棒世界模型框架,旨在解决现有世界模型作为模拟器时在特定环境(如LIBERO基准)中泛化能力差、长期误差累积以及对初始状态扰动敏感的问题。该方法通过引入结构引导的风格增强技术,将交互环境的视觉纹理与任务相关动态解耦,从而提升模型的泛化能力;同时提出动态潜在引导机制,在保证训练与推理一致性的同时降低内存消耗。实验表明,Sword在生成质量、鲁棒性、保真度以及VLA模型的强化学习后训练成功率等方面均显著优于基线方法。

详情
英文摘要

The integration of Vision-Language-Action (VLA) models with World Models has gained increasing attention. One representative approach treats learned World Models as generative simulators, enabling policy optimization entirely within "imagination." However, when deployed as simulators for specific environments such as the LIBERO benchmark, existing World Models often suffer from poor generalization and long-horizon error accumulation. During closed-loop rollouts, these models are highly sensitive to initial-state perturbations; minor changes in color, illumination, and other visual factors can trigger cascading hallucinations, leading to severe blurriness or overexposure. Moreover, long-horizon error accumulation further degrades the quality and fidelity of predicted future states. These issues limit the reliability of World Models as simulators. To mitigate these problems, we propose Sword, a robust World Model framework. Our method introduces Structure-Guided Style Augmentation to disentangle the visual textures of interactive environments from task-relevant dynamics, thereby improving generalization. We further propose Dynamic Latent Bootstrapping, which maintains consistency between training and inference while keeping memory consumption low. Extensive experiments on the LIBERO benchmark show that our method significantly outperforms the baseline WoVR in terms of generalization, generation quality, robustness, fidelity, and the success rate of reinforcement-learning post-training for VLA models.

2605.07284 2026-05-11 cs.LG

Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic

Yifan Zhou

AI总结 该研究探讨了指令微调如何影响模型内部状态在后期读出层的行为,通过引入一种新的交叉补丁诊断方法,分析了预训练模型与指令微调模型在早期层与后期层之间的交互作用。研究发现,指令微调的后期层对自身训练后的上游状态有显著依赖,而这一效应在不同规模的模型中均表现出一致性。结果表明,后期层的行为不仅受自身训练的影响,还依赖于早期层的状态,为理解模型行为的可解释性提供了新视角。

详情
英文摘要

Recent interpretability work has identified model-internal handles on post-trained behavior, including refusal directions, assistant/persona axes, and sparse chat-tuning features. These results localize where behaviors can be read out or controlled, often in middle-to-late layers. We ask how earlier computation and the late stack cooperate to turn those differences into next-token margins. To test this, we introduce first-divergence cross-patching: at the first token where pretrained base (PT) and instruction-tuned (IT) checkpoints disagree, we cross each model's earlier-layer state with each model's late stack. The diagnostic separates training recipes: same-base instruction-following descendants show late effects that depend on their own earlier-layer state, while OpenMath2 math-domain SFT and controlled code/biomed CPT controls with verified domain learning do not; for OpenMath2, the late effect is already largely portable from base earlier-layer state. Across five dense families (4B-32B), the IT late stack adds +0.76 logits from PT upstream and +2.44 from IT upstream, giving a +1.68 interaction that is positive in every family. Thus the late stack has a real PT-upstream effect, but its larger effect in the IT checkpoint appears only when it reads its own post-trained upstream state. Sparse features in final MLP layers partially mediate the effect and are driven by upstream patches, supporting a handoff from earlier state to final-layer feature activation to IT-token margin. Forced-token scoring shows that the local token choice can change later exact-answer success. Operationally, paired-checkpoint studies that localize a difference to late layers should test whether it survives under the other checkpoint's upstream state before treating the late stack as self-contained.

2605.07282 2026-05-11 cs.LG

The Convergence Gap: Instruction-Tuned Language Models Stabilize Later in the Forward Pass

Yifan Zhou

AI总结 本文研究了指令微调语言模型在前向传播过程中预测稳定化的时机差异,引入了“收敛差距”这一诊断指标,用于衡量模型各层预测分布与最终输出之间的差异。研究发现,相比预训练模型,指令微调模型在传播后期更倾向于保持与最终预测的差异,且这一现象在多种实验设置下均具鲁棒性。实验进一步表明,模型后期的多层感知机(MLP)计算是影响预测稳定化时间的关键因素。

详情
英文摘要

Final outputs hide when a checkpoint commits to its next-token prediction. We introduce the convergence gap, a model-diffing diagnostic that decodes each layer's next-token distribution and measures its distance to the model's own final distribution. Across six paired pretrained and instruction-tuned checkpoints in native prompting regimes, instruction-tuned checkpoints remain farther from their final predictions later into the stack. The effect persists under endpoint-matched raw and tuned readouts, endpoint-free same-history checks, and fixed-history template replay. Matched-prefix interventions identify late MLP windows as the largest tested leverage point: late IT grafts into PT hosts increase late KL by +0.34 nats, while PT-late swaps into IT hosts reduce it by -0.51 nats; matched random late perturbations give only +0.003 versus +0.327 for the true late graft. A preselected Gemma case study provides behavior-facing plausibility for the same late swap, without serving as a benchmark claim. These results identify a robust predictiondynamics signature of post-training: released instruction-following checkpoints tend to settle later, and late MLP computation is the strongest tested bidirectional handle on that delay under matched histories.

2605.07280 2026-05-11 cs.LG cs.AI

Mask2Cause: Causal Discovery via Adjacency Constrained Causal Attention

Omar Muhammad, Pasupuleti Dhruv Shivkant, Deepak N. Subramani

AI总结 本文提出了一种名为 Mask2Cause 的端到端框架,用于在时间序列中直接通过预测过程恢复潜在的因果图。该方法引入了倒置变量嵌入和邻接约束的掩码注意力机制,并通过同方差或异方差目标进行训练,以捕捉变量间在均值和方差层面的因果影响。实验表明,该方法在多个基准数据集上实现了最先进的因果发现性能,且参数复杂度显著降低,所推断的因果结构还能有效减少预测模型的参数量而不影响预测精度。

详情
英文摘要

Leveraging deep learning for causal discovery in time series remains challenging because existing neural methods predominantly rely on component-wise architectures that fail to capture shared system dynamics or employ decoupled post-hoc graph extraction that risks overfitting to spurious correlations. We propose $\textbf{Mask2Cause}$, an end-to-end framework that recovers the underlying causal graph directly during the forecasting forward pass. Our approach introduces an Inverted Variable Embedding and an Adjacency-Constrained Masked Attention mechanism, trained with homoscedastic or heteroscedastic objectives to capture causal influences in both mean and variance. Empirical results on diverse benchmarks, from synthetic chaotic dynamics to realistic biological simulations, demonstrate state-of-the-art causal discovery with significantly reduced parameter complexity compared to standard baselines. We further show that inferred causal structures can be used to reduce parameter count of forecasting models by more than 70% on average while maintaining predictive accuracy.