arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.14988 2026-05-15 cs.CV

Compositional Video Generation via Inference-Time Guidance

Ariel Shaulov, Eitan Shaar, Amit Edenzon, Gal Chechik, Lior Wolf

AI总结文本到视频扩散模型虽然能够生成逼真的视频，但在需要细致组合理解的提示任务上表现不佳，例如实体关系、属性、动作和运动方向等。本文提出了一种名为CVG的推理时引导方法，通过利用模型内部的交叉注意力图来捕捉提示概念在时空上的分布，并训练一个轻量级的组合分类器，利用其梯度在去噪早期阶段引导潜在变量轨迹，从而提升生成视频的组合忠实度。该方法无需修改模型结构或微调生成器，仅依靠冻结的视觉语言模型主干即可实现跨语义相关组合标签的迁移，实验表明其在组合性文本到视频任务上显著提升了生成结果的准确性与视觉质量。

2605.14984 2026-05-15 cs.CV cs.AI

Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image

Ming Qian, Zimin Xia, Changkun Liu, Shuailei Ma, Wen Wang, Zeran Ke, Bin Tan, Hang Zhang, Gui-Song Xia

AI总结本文研究如何从单张卫星图像生成街景级别的3D场景，这是一个具有挑战性的问题。现有方法在几何精度和语义多样性之间存在明显权衡，而本文提出的Sat3DGen通过引入一种以几何优先的方法，结合新的几何约束和视角训练策略，显著提升了生成场景的几何准确性和视觉真实感。实验表明，该方法在几何误差和图像质量方面均优于现有最佳方法，并在多个下游任务中展现了广泛的应用价值。

详情

Comments: ICLR 2026; code: https://github.com/qianmingduowan/Sat3DGen demo: https://huggingface.co/spaces/qian43/Sat3DGen project page: https://qianmingduowan.github.io/Sat3DGen_project_page/

英文摘要

Generating a street-level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade-off: geometry-colorization models achieve high geometric fidelity but are typically building-focused and lack semantic diversity. In contrast, proxy-based models use feed-forward image-to-3D frameworks to generate holistic scenes by jointly learning geometry and texture, a process that yields rich content but coarse and unstable geometry. We attribute these geometric failures to the extreme viewpoint gap and sparse, inconsistent supervision inherent in satellite-to-street data. We introduce Sat3DGen to address these fundamental challenges, which embodies a geometry-first methodology. This methodology enhances the feed-forward paradigm by integrating novel geometric constraints with a perspective-view training strategy, explicitly countering the primary sources of geometric error. This geometry-centric strategy yields a dramatic leap in both 3D accuracy and photorealism. For validation, we first constructed a new benchmark by pairing the VIGOR-OOD test set with high-resolution DSM data. On this benchmark, our method improves geometric RMSE from 6.76m to 5.20m. Crucially, this geometric leap also boosts photorealism, reducing the Fréchet Inception Distance (FID) from $\sim$40 to 19 against the leading method, Sat2Density++, despite using no extra tailored image-quality modules. We demonstrate the versatility of our high-quality 3D assets through diverse downstream applications, including semantic-map-to-3D synthesis, multi-camera video generation, large-scale meshing, and unsupervised single-image Digital Surface Model (DSM) estimation. The code has been released on https://github.com/qianmingduowan/Sat3DGen.

URL PDF HTML ☆

赞 0 踩 0

2605.14982 2026-05-15 cs.LG cs.AI

Second-Order Actor-Critic Methods for Discounted MDPs via Policy Hessian Decomposition

Sanjeev Manivannan, Shuban V

AI总结本文研究了折扣奖励设置下的强化学习问题，旨在提升策略梯度方法中策略更新的收敛效率。通过引入策略Hessian分解，作者提出了一种基于二阶优化的actor-critic方法，充分利用目标函数的曲率信息，在保证计算效率的同时提升了算法稳定性。该方法在双时间尺度框架下，将评论家视为准平稳，从而合理近似动作价值函数对策略参数的局部常数性，为二阶更新提供了理论支持。

2605.14981 2026-05-15 cs.LG

Distance-Matrix Wasserstein Statistics for Scalable Gromov--Wasserstein Learning

Ao Xu, Tieru Wu

AI总结该论文提出了一种名为“距离矩阵Wasserstein”（DMW）的统计方法，用于解决大规模Gromov-Wasserstein学习中的计算难题。DMW通过比较随机有限距离矩阵的分布，避免了全局点对齐的优化，从而提高了可扩展性。研究证明DMW是对Gromov-Wasserstein距离的一种松弛形式，并建立了其收敛性与样本数量的关系，同时给出了有限样本下的理论界。实验表明，DMW在合成数据、图分类和两样本检验等任务中有效且具有可解释性。

2605.14980 2026-05-15 cs.CV cs.AI

MicroscopyMatching: Towards a Ready-to-use Framework for Microscopy Image Analysis in Diverse Conditions

Xiaofei Hui, Haoxuan Qu, Hossein Rahmani, Shuohong Wang, Jeff W. Lichtman, Jun Liu

AI总结本文提出了一种名为MicroscopyMatching的通用显微图像分析框架，旨在解决不同实验条件下显微图像分析任务（如分割、追踪和计数）的自动化难题。该框架通过将多样化的分析任务统一为匹配问题，并利用预训练的潜在扩散模型的强大匹配能力，实现了在多种生物样本和成像条件下可靠且无需额外调整的分析效果。该研究为生物医学研究提供了一种实用且广泛适用的解决方案，显著降低了对人工分析的依赖。

2605.13360 2026-05-15 cs.LG

Speculative Interaction Agents: Building Real-Time Agents with Asynchronous I/O and Speculative Tool Calling

Coleman Hooper, Minwoo Kang, Suhong Moon, Nicholas Lee, Eric Wen, John Wawrzynek, Michael W. Mahoney, Yakun Sophia Shao, Amir Gholami, Kurt Keutzer

AI总结本文提出了一种名为“Speculative Interaction Agents”的方法，旨在解决智能体在需要实时交互的应用中因工具调用而产生的高延迟问题。研究通过引入异步I/O机制和推测性工具调用技术，使智能体能够在等待外部信息时继续处理任务，从而显著提升响应速度。实验表明，该方法在保持较高准确率的同时，能够为云端大模型和边缘小模型带来1.3到2.2倍的加速效果，适用于客服、语音助手等实时性要求高的场景。

详情

Comments: 17 pages

英文摘要

There is a growing demand for agentic AI technologies for a range of downstream applications like customer service and personal assistants. For applications where the agent needs to interact with a person, real-time low-latency responsiveness is required; for example, with voice-controlled applications, under 1 second of latency is typically required for the interaction to feel seamless. However, if we want the LLM to reason and execute an agentic workflow with tool calling, this can add several seconds or more of latency, which is prohibitive for real-time latency-sensitive applications. In our work, we propose Speculative Interaction Agents to enable real-time interaction even for agents with complex multi-turn tool calling. We propose Asynchronous I/O, which decouples the core agent reason-and-act thread from waiting for additional information from either the user or environment, thereby allowing for overlapping agentic processing while waiting on external delays. We also propose Speculative Tool Calling as a method to manage task execution when the agent is still unsure if it has received the full information or if additional user information may later be provided. For strong cloud models, our method can be applied out-of-the-box to existing real-time cloud APIs, providing 1.3-1.7$\times$ speedups with minor accuracy loss. To enable real-time interaction with small edge-scale models, we also present a clock-based training methodology that adapts the model to handle streaming inputs and asynchronous responses, and demonstrate a synthetic data generation strategy for SFT. Altogether, this approach provides 1.6-2.2$\times$ speedups with the Qwen2.5-3B-Instruct and Llama-3.2-3B-Instruct models across multiple tool calling benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.12625 2026-05-15 cs.RO cs.CV

Driving Intents Amplify Planning-Oriented Reinforcement Learning

Hengtong Lu, Victor Shea-Jay Huang, Chengmin Yang, Pengfei Jing, Jifeng Dai, Yan Xie, Benjin Zhu

AI总结该研究针对基于单轨迹演示训练的连续动作策略在模式崩溃问题上的局限性，提出了一种两阶段的DIAL框架，用于生成符合偏好评估的驾驶策略。第一阶段通过分类器无关引导（CFG）扩展动作采样分布，打破单一演示导致的模式坍缩；第二阶段引入多意图GRPO方法，在偏好强化学习中保持分布多样性，防止策略微调重新坍缩。实验表明，该方法在驾驶任务中显著提升了人类演示水平的评价得分，验证了扩展采样分布对提升连续动作策略性能的重要性。

详情

Comments: Project page: https://mind-omni.github.io/

英文摘要

Continuous-action policies trained on a single demonstrated trajectory per scene suffer from mode collapse: samples cluster around the demonstrated maneuver and the policy cannot represent semantically distinct alternatives. Under preference-based evaluation, this caps best-of-N performance -- even oracle selection cannot recover what the sampling distribution does not contain. We introduce DIAL, a two-stage Driving-Intent-Amplified reinforcement Learning framework for preference-aligned continuous-action driving policies. In the first stage, DIAL conditions the flow-matching action head on a discrete intent label with classifier-free guidance (CFG), which expands the sampling distribution along distinct maneuver modes and breaks single-demonstration mode collapse. In the second stage, DIAL carries this expanded distribution into preference RL through multi-intent GRPO, which spans all intent classes within every preference group and prevents fine-tuning from re-collapsing around the currently preferred mode. Instantiated for end-to-end driving with eight rule-derived intents and evaluated on WOD-E2E: competitive Vision-to-Action (VA) and Vision-Language-Action (VLA) Supervised Finetuning (SFT) baselines plateau below the human-driven demonstration at best-of-128, with the strongest prior (RAP) capping at Rater Feedback Score (RFS) 8.5 even with best-of-64; intent-CFG sampling lifts this ceiling to RFS 9.14 at best-of-128, surpassing both the prior best (RAP 8.5) and the human-driven demonstration (8.13) for the first time; and multi-intent GRPO improves held-out RFS from 7.681 to 8.211, while every single-intent baseline peaks lower and degrades by training end. These results suggest that the bottleneck of preference RL on continuous-action policies trained from demonstrations is not only how to update the policy, but to expand and preserve the sampling distribution being optimized.

URL PDF HTML ☆

赞 0 踩 0

2605.12624 2026-05-15 cs.RO cs.CV

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

Yuzhou Huang, Benjin Zhu, Hengtong Lu, Victor Shea-Jay Huang, Haiming Zhang, Wei Chen, Jifeng Dai, Yan Xie, Hongsheng Li

AI总结本文提出了一种用于自动驾驶的统一流式VLA架构MindVLA-U1，旨在解决现有VLA模型在规划质量上落后于VA模型的问题。该方法通过一个统一的视觉-语言-动作（VLA）主干网络，实现了语言指令和连续动作轨迹的联合生成，并采用流式处理设计提升实时性。MindVLA-U1在保持自然语言交互接口的同时，显著提升了规划性能，首次在WOD-E2E基准测试中超越人类驾驶员，并在规划精度和处理速度方面均达到当前最优水平。

详情

Comments: Work in progress. Project page: https://mind-omni.github.io/

英文摘要

Autonomous driving has progressed from modular pipelines toward end-to-end unification, and Vision-Language-Action (VLA) models are a natural extension of this journey beyond Vision-to-Action (VA). In practice, driving VLAs have often trailed VA on planning quality, suggesting that the difficulty is not simply model scale but the interface through which semantic reasoning, temporal context, and continuous control are combined. We argue that this gap reflects how VLA has been built -- as isolated subtask improvements that fail to compose coherent driving capabilities -- rather than what VLA is. We present MindVLA-U1, the first unified streaming VLA architecture for autonomous driving. A unified VLM backbone produces AR language tokens (optional) and flow-matching continuous action trajectories in a single forward pass over one shared representation, preserving the natural output form of each modality. A full streaming design processes the driving video framewise rather than as fixed video-action chunks under costly temporal VLM modeling. Planned trajectories evolve smoothly across frames while a learned streaming memory channel carries temporal context and updates. The unified architecture enables fast/slow systems on dense & sparse MoT backbones via flexible self-attention context management, and exposes a measurable language-control path for action: language-predicted driving intents steers the action diffusion via classifier-free guidance (CFG), turning language-side intent into control signals for continuous action planning. On the long-tail WOD-E2E benchmark, MindVLA-U1 surpasses experienced human drivers for the first time (8.20 RFS vs. 8.13 GT RFS) with 2 diffusion steps, achieves state-of-the-art planning ADEs over prior VA/VLA by large margins, and matches VA latency (16 FPS vs. RAP's 18 FPS at 1B scale) while preserving natural language interfaces for human-vehicle interaction.

URL PDF HTML ☆

赞 0 踩 0

2605.12622 2026-05-15 cs.RO cs.CV

Action Emergence from Streaming Intent

Pengfei Jing, Victor Shea-Jay Huang, Hengtong Lu, Jifeng Dai, Yan Xie, Benjin Zhu

AI总结本文研究了端到端自动驾驶中动作生成的“意图涌现”能力，即在复杂交通场景中生成物理可行、语义合理且安全合规的动作。为此，作者提出了一种名为SI（Streaming Intent）的视觉-语言-动作模型，通过连续的因果推理链对驾驶意图进行语义和时间上的流式处理，并利用该意图引导动作生成，从而实现可控且高质量的轨迹规划。实验表明，SI在Waymo End-to-End基准上表现出色，并首次在全端到端VLA模型中实现了基于意图的可控性。

详情

Comments: Project page: https://mind-omni.github.io/

英文摘要

We formalize action emergence as a target capability for end-to-end autonomous driving: the ability to generate physically feasible, semantically appropriate, and safety-compliant actions in arbitrary, long-tail traffic scenes through scene-conditioned reasoning rather than retrieval or interpolation of learned scene-action mappings. We show that previous paradigms cannot deliver action emergence: autoregressive trajectory decoders collapse the inherently multimodal future into a single averaged output, while diffusion and flow-matching generators express multimodality but are not steerable by reasoned intent. We propose Streaming Intent as a concrete way to approach action emergence: a mechanism that makes driving intent (i) semantically streamed through a continuous chain-of-thought that causally derives the intent from scene understanding, and (ii) temporally streamed across clips so that intent commitments remain coherent along the driving horizon. We realize Streaming Intent in a VLA model we call SI (Streaming Intent). SI autoregressively decodes a four-step chain-of-thought and emits an intent token; the decoded intent then drives classifier-free guidance (CFG) on a flow-matching action head, requiring only two denoising steps to generate the final trajectory. On the Waymo End-to-End benchmark, SI achieves competitive aggregate performance, with an RFS score of 7.96 on the validation set and 7.74 on the test set. Beyond aggregate metrics, the model demonstrates -- to our knowledge for the first time in a fully end-to-end VLA -- intent-faithful controllability: for a fixed scene, varying the intent class at inference yields qualitatively distinct yet consistently high-quality plans, arising purely from data-driven learning without any pre-built trajectory bank or hand-coded post-hoc selector.

URL PDF HTML ☆

赞 0 踩 0

2605.12484 2026-05-15 cs.LG cs.AI

Learning, Fast and Slow: Towards LLMs That Adapt Continually

Rishabh Tiwari, Kusha Sareen, Lakshya A Agrawal, Joseph E. Gonzalez, Matei Zaharia, Kurt Keutzer, Inderjit S Dhillon, Rishabh Agarwal, Devvrit Khatri

AI总结大型语言模型（LLMs）通常通过更新参数（如强化学习）来适应下游任务，但这可能导致灾难性遗忘和泛化能力下降。相比之下，固定参数的上下文学习虽然能快速适应任务需求，但性能提升有限。本文提出了一种“快-慢”学习框架，将模型参数视为“慢权重”，优化的上下文作为“快权重”，从而在保持模型整体稳定性的基础上实现高效学习。实验表明，该方法在样本效率和性能上限上均优于传统方法，并在持续学习场景中表现出更强的适应能力和更少的遗忘。

详情

Comments: 29 pages, 14 figures, including appendix; Blog post: https://gepa-ai.github.io/gepa/blog/2026/05/11/learning-fast-and-slow/

英文摘要

Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly adapt to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as "slow" weights and optimized context as "fast" weights. These fast "weights" can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast-Slow Training (FST) is up to 3x more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL-training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.

URL PDF HTML ☆

赞 0 踩 0

2605.08512 2026-05-15 cs.LG

MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning

Yusuf Syed, Viraj Parimi, Brian Williams

AI总结本文提出了一种名为MoMo的条件对比表示学习方法，用于实现用户偏好调节的规划。该方法通过特征逐位线性调制和低秩神经调制，联合学习表示结构和潜在预测操作符，使得模型能够在推理时根据标量偏好连续调整规划的保守程度，而无需重新训练。实验表明，MoMo在多个环境中能够根据用户偏好平滑调整规划的安全性，提升了时间与偏好一致性。

2605.07785 2026-05-15 cs.CV

Radiologist-Guided Causal Concept Bottleneck Models for Chest X-Ray Interpretation

Amy Rafferty, Rishi Ramaesh, Ajitha Rajan

AI总结该研究提出了一种由放射科医生指导的因果概念瓶颈模型（XpertCausal），用于胸片的解释。该模型通过概率噪声-OR框架建模疾病与影像学表现之间的因果关系，并利用贝叶斯推理从预测的概念中估计疾病概率。通过结合放射科医生定义的概念-疾病关联，模型结构被约束在临床合理的推理路径上，从而在诊断性能、校准度和解释质量方面优于传统概念瓶颈模型，更贴近专家知识。

2605.02043 2026-05-15 cs.LG

Bringing Order to Asynchronous SGD: Towards Optimality under Data-Dependent Delays with Momentum

Tehila Dahan, Roie Reshef, Sharon Goldstein, Kfir Y. Levy

AI总结本文研究了异步随机梯度下降（SGD）中因数据依赖性延迟导致的梯度陈旧问题，提出了一种基于动量的异步框架，在保留延迟梯度信息的同时缓解其影响。该方法在凸和非凸光滑优化场景下首次建立了最优收敛速率，填补了现有研究的空白，并提供了实用的鲁棒学习率调度策略，简化了超参数调优过程。

2605.02004 2026-05-15 cs.AI

Personalized Digital Health Modeling with Adaptive Support Users

Zhongqi Yang, Mahkameh Rasouli, Neda Mohseni, Yong Huang, Iman Azimi, Amir M. Rahmani

AI总结在数字健康领域，个体间生理和行为差异显著，因此个性化建模至关重要。然而，由于用户数据稀缺且噪声大，现有方法多依赖于群体预训练或相似用户数据，导致迁移偏差和泛化能力不足。本文提出一种统一的个性化框架，通过自适应加权相似和不相似用户数据进行建模，结合个人损失、相似性迁移和对比正则化，提升模型鲁棒性。实验表明，该方法在多个真实数据集上显著优于传统方法，尤其在数据量少时表现更优，并提升了数据利用效率和可解释性。

2604.25855 2026-05-15 cs.CV cs.AI

SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

Hector G. Rodriguez, Marcus Rohrbach

AI总结本文提出了一种名为SIEVES的新型选择性预测方法，旨在提升视觉问答（VQA）系统在真实世界和分布外（OOD）场景中的可靠性和覆盖率。该方法通过让模型在回答问题时生成局部视觉证据，并设计一个选择器来基于这些证据显式评估回答质量，从而在不依赖模型内部信号（如logits或隐藏状态）的情况下实现更准确的置信度估计。实验表明，SIEVES在多个具有挑战性的OOD基准上显著提升了系统覆盖率，且适用于多种前沿闭源模型，无需访问其权重或logits。

详情

英文摘要

Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering (VQA) benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world, out-of-distribution (OOD) scenarios. Precisely, selective prediction aims to improve coverage, i.e. the share of inputs the system answers, while adhering to a user-defined risk level. This is typically achieved by assigning a confidence score to each answer and abstaining on those that fall below a certain threshold. Existing selective prediction methods estimate implicit confidence scores, relying on model internal signals like logits or hidden representations, which are not available for frontier closed-sourced models. To enable reliable generalization in VQA, we require reasoner models to produce localized visual evidence while answering, and design a selector that explicitly learns to estimate the quality of the localization provided by the reasoner using only model inputs and outputs. We show that SIEVES (Selective Prediction through Visual Evidence Scoring) improves coverage by up to three times on challenging OOD benchmarks (V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, and AdVQA), compared to non-grounding baselines. Beyond better generalization to OOD tasks, the design of the SIEVES selector enables transfer to proprietary reasoners without access to their weights or logits, such as o3 and Gemini-3-Pro, providing coverage boosts beyond those attributable to accuracy alone. We highlight that SIEVES generalizes across all tested OOD benchmarks and reasoner models (Pixel-Reasoner, o3, and Gemini-3-Pro), without benchmark- or reasoner-specific training or adaptation. Code is publicly available at https://github.com/hector-gr/SIEVES .

URL PDF HTML ☆

赞 0 踩 0

2604.21909 2026-05-15 cs.CV cs.IT math.IT q-bio.NC

Directional Confusions Reveal Divergent Inductive Biases Through Rate-Distortion Geometry in Human and Machine Vision

Leyla Roksan Caglar, Pedro A. M. Mediano, Baihan Lin

AI总结该研究探讨了人类与机器视觉系统在分类任务中对混淆方向的不同表现，揭示了两者在归纳偏置上的差异。通过分析12种扰动下人类与深度神经网络的响应，研究量化了混淆矩阵中的不对称性，并将其与信息-误差权衡的几何特性联系起来。结果表明，人类表现出广泛但较弱的类别间不对称性，而深度模型则表现出更集中、更强的定向混淆，且这种差异在准确率相同的情况下仍能反映不同的泛化策略。

2604.09860 2026-05-15 cs.RO cs.AI

RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

Xuning Yang, Rishit Dagli, Alex Zook, Hugo Hadfield, Ankit Goyal, Stan Birchfield, Fabio Ramos, Jonathan Tremblay

AI总结为了解决通用机器人领域仿真基准测试中性能快速饱和和缺乏真实泛化能力评估的问题，研究提出了RoboLab，一个高保真度的仿真基准框架。该框架通过生成与机器人和策略无关的场景和任务，支持对现实策略在仿真中的行为进行深入分析，并引入了包含120个任务的RoboLab-120基准，涵盖视觉、过程和关系三个能力维度。研究还系统评估了现有先进模型在性能和行为鲁棒性上的不足，为评估任务通用型机器人策略的真实泛化能力提供了细粒度指标和可扩展工具。

2603.16039 2026-05-15 cs.LG cs.AI cs.CL

Residual Stream Duality in Modern Transformer Architectures

Yifan Zhang

AI总结本文探讨了现代Transformer架构中残差流的双重性质，指出残差路径不仅是优化工具，更是模型表示机制的重要组成部分。作者提出从序列位置和层深度两个维度理解Transformer的设计空间，并揭示了残差流在层深度方向上的自注意力机制与序列方向上的短窗口注意力具有对偶性。基于这一视角，文章进一步分析了不同模型设计的优劣，并推荐在关注快捷连接时使用深度增量学习（DDL），而在需要局部自适应混合时采用序列方向的短窗口注意力（ShortSWA）。

详情

Comments: Project Page: https://github.com/yifanzhang-pro/residual-stream-duality

英文摘要

Recent work has made clear that the residual pathway is not mere optimization plumbing; it is part of the model's representational machinery. We agree, but argue that the cleanest way to organize this design space is through a two-axis view of the Transformer. A decoder evolves information along two ordered dimensions: sequence position and layer depth. Self-attention already provides adaptive mixing along the sequence axis, whereas the residual stream usually performs fixed addition along the depth axis. If we fix a token position and treat layer index as the ordered variable, then a causal depth-wise residual attention read is exactly the same local operator as causal short sliding-window attention (ShortSWA), except written over depth rather than over sequence. This is the core residual stream duality behind Transformer$^2$. This perspective also clarifies the recent literature. ELC-BERT and DenseFormer already show that learned aggregation over depth can outperform uniform residual accumulation, while Vertical Attention, DeepCrossAttention (DCA), MUDDFormer, and Attention Residuals move further toward explicit attention-based routing over earlier layers. The key point, however, is that operator-level duality does not imply systems-level symmetry. For large-scale autoregressive models, sequence-axis ShortSWA is usually the more hardware-friendly placement because it reuses token-side sliding-window kernels, KV-cache layouts, and chunked execution. If the goal is instead to change the shortcut itself, Deep Delta Learning (DDL) is the cleaner intervention because it modifies the residual operator directly rather than adding a separate cross-layer retrieval path. Our recommendation is therefore simple: use DDL when the shortcut is the object of interest, and use sequence-axis ShortSWA when the goal is local adaptive mixing.

URL PDF HTML ☆

赞 0 踩 0

2602.24273 2026-05-15 cs.AI

A Minimal Agent for Automated Theorem Proving

Borja Requena, Austin Letson, Krystian Nowakowski, Izan Beltran-Ferreiro, Leopoldo Sarra

AI总结本文提出了一种用于自动定理证明的最小智能体基线，旨在为不同基于人工智能的定理证明架构提供系统性的比较基础。该设计实现了当前先进系统共有的核心功能，包括迭代证明优化、库搜索和上下文管理。实验表明，该方法在保持显著简化架构和低成本的同时，性能可与现有先进方法媲美，并在样本效率和成本效益方面展现出迭代方法相对于单次生成方法的优势。研究代码已开源，供未来研究参考及社区使用。

2602.14674 2026-05-15 cs.AI

From User Preferences to Base Score Extraction Functions in Gradual Argumentation (with Appendix)

Aniol Civit, Antonio Rago, Antonio Andriella, Guillem Alenyà, Francesca Toni

AI总结本文研究了如何从用户对论点的偏好中提取基础评分函数，以支持渐进式论证系统中的决策过程。作者提出了一种基础评分提取函数，能够将用户偏好映射到论点的基础评分，并将其应用于双极论证框架，从而构建定量双极论证框架，便于使用现有的计算工具进行分析。该方法考虑了人类偏好中的非线性特性，并通过理论分析和机器人实验验证了其有效性，为实际应用中的渐进语义选择提供了指导。

2602.11626 2026-05-15 cs.LG cs.AI physics.chem-ph physics.comp-ph physics.flu-dyn

ArGEnT: Arbitrary Geometry-encoded Transformer for Operator Learning

Wenqian Chen, Yucheng Fu, Michael Penwarden, Pratanu Roy, Panos Stinis

AI总结在科学机器学习中，如何学习具有复杂、变化几何结构和参数化物理条件的系统解算符是一个核心挑战。本文提出了一种名为 ArGEnT 的任意几何编码变换器，它基于注意力机制，能够直接从点云表示中编码几何信息，并通过自注意力、交叉注意力和混合注意力三种变体灵活地整合几何特征。将 ArGEnT 集成到 DeepONet 中作为主干网络，构建了一个无需显式参数化几何输入的代理建模框架，在流体力学、固体力学和电化学系统等多个基准问题上的实验表明，该方法在预测精度和泛化能力方面显著优于传统 DeepONet 和其他几何感知代理模型。

详情

Comments: 69 pages, 21 figures, 10 tables

英文摘要

Learning solution operators for systems with complex, varying geometries and parametric physical settings is a central challenge in scientific machine learning. In many-query regimes such as design optimization, control and inverse problems, surrogate modeling must generalize across geometries while allowing flexible evaluation at arbitrary spatial locations. In this work, we propose Arbitrary Geometry-encoded Transformer (ArGEnT), a geometry-aware attention-based architecture for operator learning on arbitrary domains. ArGEnT employs Transformer attention mechanisms to encode geometric information directly from point-cloud representations with three variants-self-attention, cross-attention, and hybrid-attention-that incorporates different strategies for incorporating geometric features. By integrating ArGEnT into DeepONet as the trunk network, we develop a surrogate modeling framework capable of learning operator mappings that depend on both geometric and non-geometric inputs without the need to explicitly parametrize geometry as a branch network input. Evaluation on benchmark problems spanning fluid dynamics, solid mechanics and electrochemical systems, we demonstrate significantly improved prediction accuracy and generalization performance compared with the standard DeepONet and other existing geometry-aware saurrogates. In particular, the cross-attention transformer variant enables accurate geometry-conditioned predictions with reduced reliance on signed distance functions. By combining flexible geometry encoding with operator-learning capabilities, ArGEnT provides a scalable surrogate modeling framework for optimization, uncertainty quantification, and data-driven modeling of complex physical systems.

URL PDF HTML ☆

赞 0 踩 0

2602.04289 2026-05-15 cs.CL cs.LG

Proxy Compression for Language Modeling

Lin Zheng, Xinyu Li, Qian Liu, Xiachong Feng, Lingpeng Kong

AI总结本文提出了一种名为“代理压缩”的新型语言模型训练方法，旨在在保持压缩输入效率优势的同时，实现端到端的原始字节级推理接口。该方法在训练时联合使用原始字节序列和由外部压缩器生成的压缩视图，使模型在内部对齐压缩序列与原始字节，从而在推理时无需依赖外部分词器。实验表明，代理压缩显著提升了训练效率，并在固定计算预算下优于传统的字节级基线方法，且随着模型规模增大，其优势更加明显。

2602.02711 2026-05-15 cs.AI

Dynamic Mixed-Precision Routing for Efficient Multi-step LLM Interaction

Yuanzhe Li, Jianing Deng, Jingtong Hu, Tianlong Chen, Song Wang, Huanrui Yang

AI总结该研究针对大语言模型（LLM）在长周期决策任务中推理成本过高的问题，提出了一种动态混合精度路由（DMR）框架，通过在每一步决策中自适应选择高精度或低精度模型，以在保证任务成功率的同时降低计算成本。该方法基于不同步骤对精度敏感性的观察，采用两阶段训练策略，结合KL散度监督学习和组相对策略优化，有效提升了性能与效率的平衡。实验表明，DMR在ALFWorld和WebShop等任务中取得了优于单一精度基线的准确率与成本综合表现。

2602.01828 2026-05-15 cs.LG

Hyperbolic Graph Neural Networks Under the Microscope: The Role of Geometry-Task Alignment

Dionisia Naddeo, Jonas Linkerhägner, Nicola Toschi, Geri Skenderi, Veronica Lachi

AI总结许多复杂网络具有分层、树状结构，因此双曲空间成为学习其表示的自然选择。本文提出“几何-任务对齐”这一新条件，探讨目标任务的度量结构是否与输入图的几何结构一致。理论与实验表明，双曲图神经网络在回归任务中能够学习到低失真的表示，且在需要保持度量结构的任务中表现出优势。研究进一步发现，双曲图神经网络在几何对齐的链接预测任务中优于欧几里得模型，但在通常不涉及几何对齐的节点分类任务中优势不明显，从而揭示了任务与几何结构对齐的重要性。

2601.16981 2026-05-15 cs.CV cs.GR

SyncLight: Single-Edit Multi-View Relighting

David Serrano-Lozano, Anand Bhattad, Luis Herranz, Jean-François Lalonde, Javier Vazquez-Corral

AI总结 SyncLight 是一种基于单视角编辑实现多视角场景一致重照明的方法，旨在解决多摄像机直播、立体电影和虚拟制作中对光照一致性要求高的问题。该方法通过一个基于潜在空间桥接的多视角扩散变换模型，能够在单次推理过程中对多视角图像进行高保真重照明，并且无需相机位姿信息即可推广到任意数量的视角。SyncLight 的核心贡献在于实现了参数化光照控制，并引入了一个包含合成与真实多视角数据的大型混合数据集以支持训练。

2512.13609 2026-05-15 cs.CV cs.LG

Do-Undo Bench: Reversibility for Action Understanding in Image Generation

Shweta Mahajan, Shreya Kadambi, Hoang Le, Rajeev Yasarla, Apratim Bhattacharyya, Munawar Hayat, Fatih Porikli

AI总结本文提出了一项新的任务和基准测试“Do-Undo Bench”，旨在解决视觉语言模型在理解并生成由现实世界动作驱动的场景变换方面存在的关键不足。该任务要求模型不仅模拟现实动作对场景的影响，还需将其恢复到原始状态，从而检验模型对因果关系的理解能力。研究发现当前模型在动作可逆性方面表现不佳，凸显了评估动作理解能力的必要性，该基准为评估和推进多模态系统中与现实动态相关的动作感知生成提供了直观的测试平台。

2512.06471 2026-05-15 cs.LG cs.AI

Why Goal-Conditioned Reinforcement Learning Works: Relation to Dual Control

Nathan P. Lawrence, Ali Mesbah

AI总结本文分析了目标条件强化学习（Goal-Conditioned RL）的成功原因，并将其与最优控制理论联系起来。研究揭示了经典二次目标与目标条件奖励之间的最优性差距，解释了为何目标条件奖励在某些情况下优于密集奖励。此外，文章将目标条件奖励与部分可观测马尔可夫决策过程中的状态估计相结合，表明其在双控制问题中的适用性，并通过强化学习和预测控制方法在非线性与不确定环境中验证了目标条件策略的优势。

2511.14751 2026-05-15 cs.CV cs.RO

Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers

Yutian Chen, Yuheng Qiu, Ruogu Li, Ali Agha, Shayegan Omidshafiei, Jay Patrikar, Sebastian Scherer

AI总结本文提出了一种名为Co-Me的置信度引导的token合并方法，用于加速视觉几何变换器，无需重新训练或微调基础模型。该方法通过训练一个轻量级的置信度预测器，根据token的不确定性进行排序，并选择性地合并低置信度的token，从而在保持空间覆盖的同时有效降低计算量。实验表明，Co-Me在多种多视角和流式视觉几何变换器中均能实现显著加速，应用于VGGT和Pi3时分别达到21.5倍和20.4倍的加速效果，为实时三维感知与重建提供了可行的解决方案。

2510.25356 2026-05-15 cs.CL

Prompting from the bench: Large-scale pretraining is not sufficient to prepare LLMs for ordinary meaning analysis

Abhishek Purushothama, Junghyun Min, Brandon Waldon, Nathan Schneider

AI总结该研究探讨了大型语言模型（LLMs）在普通含义分析中的适用性，尤其是在法律解释等高风险领域。尽管LLMs在预训练阶段接触了大量数据，但实验表明它们对问题格式的变化极为敏感，导致结论不稳定，且与人类判断的相关性较弱。研究质疑了当前法律实践中依赖LLMs进行文本解释的做法，指出其可靠性和权威性仍需进一步验证。

详情

DOI: 10.1145/3805689.3812346
Comments: Accepted FAccT 2026; 29 pages, 14 tables, 7 figures. Previous title - Not ready for the bench: LLM legal interpretation is unstable and out of step with human judgments; NLLPW 2026

英文摘要

In the U.S. judicial system, a widespread approach to legal interpretation entails assessing how a legal text would be understood by an `ordinary' speaker of the language. Recent scholarship has proposed that legal practitioners leverage large language models (LLMs) to ascertain a text's ordinary meaning. But are LLMs up to the task? As textual interpretation questions arise in spheres ranging from criminal law to civil rights, we argue it is crucial that models not be taken as authoritative without rigorous evaluation. This work offers an empirical argument against LLM-assisted interpretation as recently practiced by legal scholars and federal judges, who reasoned the large amount of data that models see in training would enable models to illuminate how people ordinarily use certain words or phrases. In controlled experiments, we find failures in robustness which cast doubt on this assumption and raise serious questions about the utility of these models in practice. For the models in our evaluation, slight changes to the format of a question can lead to wildly different conclusions -- a vulnerability that parties with an interest in the outcome could exploit. Comparing with a dataset where people were asked similar legal interpretation questions, we see that these models are at best moderately correlated to human judgments -- not strong enough given the stakes in this domain.

URL PDF HTML ☆

赞 0 踩 0

2510.18766 2026-05-15 cs.RO

Sharing the Load: Autonomous Multi-Rover Cargo Transport

Alexander Krawciw, Luka Antonyshyn, Sven Lilge, Nicolas Olmedo, Faizan Rehmatullah, Maxime Desjardins-Goulet, Pascal Toupin, Timothy D. Barfoot

AI总结本文研究了在月球任务中使用多辆自主月球车协同运输货物的问题，提出了一种分布式模型预测控制方法，使多辆车能够共享负载并协同搬运大型货物。该方法通过定制的货物耦合装置实现车辆运动学的解耦，同时保持对货物的完全支撑，并在实地测试中实现了较高的定位精度。实验表明，该控制架构不仅提升了运输任务的灵活性和可靠性，还可用于支持其他任务操作。