arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1981
2605.14988 2026-05-15 cs.CV

Compositional Video Generation via Inference-Time Guidance

Ariel Shaulov, Eitan Shaar, Amit Edenzon, Gal Chechik, Lior Wolf

AI总结 文本到视频扩散模型虽然能够生成逼真的视频,但在需要细致组合理解的提示任务上表现不佳,例如实体关系、属性、动作和运动方向等。本文提出了一种名为CVG的推理时引导方法,通过利用模型内部的交叉注意力图来捕捉提示概念在时空上的分布,并训练一个轻量级的组合分类器,利用其梯度在去噪早期阶段引导潜在变量轨迹,从而提升生成视频的组合忠实度。该方法无需修改模型结构或微调生成器,仅依靠冻结的视觉语言模型主干即可实现跨语义相关组合标签的迁移,实验表明其在组合性文本到视频任务上显著提升了生成结果的准确性与视觉质量。

详情
英文摘要

Text-to-video diffusion models generate realistic videos, but often fail on prompts requiring fine-grained compositional understanding, such as relations between entities, attributes, actions, and motion directions. We hypothesize that these failures need not be addressed by retraining the generator, but can instead be mitigated by steering the denoising process using the model's own internal grounding signals. We propose \textbf{CVG}, an inference-time guidance method for improving compositional faithfulness in frozen text-to-video models. Our key observation is that cross-attention maps already encode how prompt concepts are grounded across space and time. We train a lightweight compositional classifier on these attention features and use its gradients during early denoising steps to steer the latent trajectory toward the desired composition. Built on a frozen VLM backbone, the classifier transfers across semantically related composition labels rather than relying only on narrow category-specific features. CVG improves compositional generation without modifying the model architecture, fine-tuning the generator, or requiring layouts, boxes, or other user-supplied controls. Experiments on compositional text-to-video benchmarks show improved prompt faithfulness while preserving the visual quality of the underlying generator.

2605.14984 2026-05-15 cs.CV cs.AI

Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image

Ming Qian, Zimin Xia, Changkun Liu, Shuailei Ma, Wen Wang, Zeran Ke, Bin Tan, Hang Zhang, Gui-Song Xia

AI总结 本文研究如何从单张卫星图像生成街景级别的3D场景,这是一个具有挑战性的问题。现有方法在几何精度和语义多样性之间存在明显权衡,而本文提出的Sat3DGen通过引入一种以几何优先的方法,结合新的几何约束和视角训练策略,显著提升了生成场景的几何准确性和视觉真实感。实验表明,该方法在几何误差和图像质量方面均优于现有最佳方法,并在多个下游任务中展现了广泛的应用价值。

详情
Comments
ICLR 2026; code: https://github.com/qianmingduowan/Sat3DGen demo: https://huggingface.co/spaces/qian43/Sat3DGen project page: https://qianmingduowan.github.io/Sat3DGen_project_page/
英文摘要

Generating a street-level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade-off: geometry-colorization models achieve high geometric fidelity but are typically building-focused and lack semantic diversity. In contrast, proxy-based models use feed-forward image-to-3D frameworks to generate holistic scenes by jointly learning geometry and texture, a process that yields rich content but coarse and unstable geometry. We attribute these geometric failures to the extreme viewpoint gap and sparse, inconsistent supervision inherent in satellite-to-street data. We introduce Sat3DGen to address these fundamental challenges, which embodies a geometry-first methodology. This methodology enhances the feed-forward paradigm by integrating novel geometric constraints with a perspective-view training strategy, explicitly countering the primary sources of geometric error. This geometry-centric strategy yields a dramatic leap in both 3D accuracy and photorealism. For validation, we first constructed a new benchmark by pairing the VIGOR-OOD test set with high-resolution DSM data. On this benchmark, our method improves geometric RMSE from 6.76m to 5.20m. Crucially, this geometric leap also boosts photorealism, reducing the Fréchet Inception Distance (FID) from $\sim$40 to 19 against the leading method, Sat2Density++, despite using no extra tailored image-quality modules. We demonstrate the versatility of our high-quality 3D assets through diverse downstream applications, including semantic-map-to-3D synthesis, multi-camera video generation, large-scale meshing, and unsupervised single-image Digital Surface Model (DSM) estimation. The code has been released on https://github.com/qianmingduowan/Sat3DGen.

2605.14982 2026-05-15 cs.LG cs.AI

Second-Order Actor-Critic Methods for Discounted MDPs via Policy Hessian Decomposition

Sanjeev Manivannan, Shuban V

AI总结 本文研究了折扣奖励设置下的强化学习问题,旨在提升策略梯度方法中策略更新的收敛效率。通过引入策略Hessian分解,作者提出了一种基于二阶优化的actor-critic方法,充分利用目标函数的曲率信息,在保证计算效率的同时提升了算法稳定性。该方法在双时间尺度框架下,将评论家视为准平稳,从而合理近似动作价值函数对策略参数的局部常数性,为二阶更新提供了理论支持。

详情
Comments
9 pages, 2 figures including Appendix with Detailed proofs
英文摘要

We address the discounted reward setting in reinforcement learning (RL). To mitigate the value approximation challenges in policy gradient methods, actor-critic approaches have been developed and are known to converge to stationary points under suitable assumptions. However, these methods rely on first-order updates. In contrast, second-order optimization provides principled curvature-aware updates that are proven to accelerate convergence, but its application in RL is limited by the computational complexity of Hessian estimation. In this work, we analyze second-order approximations for the actor update that leverage the full curvature information of the objective as much as possible. A stable approximation requires treating the action-value function as locally constant with respect to policy parameters, which does not generally hold in policy gradient methods. We show that this approximation becomes well-justified under a two-timescale actor-critic framework, where the critic evolves on a faster timescale and can be treated as quasi-stationary during actor updates. Building on this insight, we formulate a second-order actor-critic method for the discounted reward setting that leverages Hessian-vector product (HVP) computations, resulting in a computationally efficient and stable second-order update.

2605.14981 2026-05-15 cs.LG

Distance-Matrix Wasserstein Statistics for Scalable Gromov--Wasserstein Learning

Ao Xu, Tieru Wu

AI总结 该论文提出了一种名为“距离矩阵Wasserstein”(DMW)的统计方法,用于解决大规模Gromov-Wasserstein学习中的计算难题。DMW通过比较随机有限距离矩阵的分布,避免了全局点对齐的优化,从而提高了可扩展性。研究证明DMW是对Gromov-Wasserstein距离的一种松弛形式,并建立了其收敛性与样本数量的关系,同时给出了有限样本下的理论界。实验表明,DMW在合成数据、图分类和两样本检验等任务中有效且具有可解释性。

详情
英文摘要

Gromov--Wasserstein (GW) distances compare graphs, shapes, and point clouds through internal distances, without requiring a common coordinate system. This invariance is powerful, but discrete GW is a nonconvex quadratic optimal transport problem and is difficult to estimate at scale. We propose \emph{Distance-Matrix Wasserstein} (DMW), a hierarchy of Wasserstein statistics comparing laws of random finite distance matrices. Rather than optimizing a global point-level alignment, DMW samples $n$ points from each space, records their pairwise distances, and transports the resulting matrix laws. We prove that DMW is a relaxation and lower bound of GW, and establish a reverse approximation inequality: the GW--DMW gap is controlled by the Wasserstein error of approximating each original measure with $n$ samples. Hence population DMW converges to GW as sampled subspaces become dense. We further give finite-sample bounds, including intrinsic-dimensional rates that depend on the data manifold rather than the ambient matrix dimension $\binom n2$. For scalable computation, we introduce sliced and multi-scale DMW; for $p=1$, the sliced multi-scale dissimilarity yields positive-definite exponential kernels. Experiments on synthetic metric spaces, scalability benchmarks, graph classification, and two-sample testing validate the theory and demonstrate an interpretable GW-style proxy for structural comparison.

2605.14980 2026-05-15 cs.CV cs.AI

MicroscopyMatching: Towards a Ready-to-use Framework for Microscopy Image Analysis in Diverse Conditions

Xiaofei Hui, Haoxuan Qu, Hossein Rahmani, Shuohong Wang, Jeff W. Lichtman, Jun Liu

AI总结 本文提出了一种名为MicroscopyMatching的通用显微图像分析框架,旨在解决不同实验条件下显微图像分析任务(如分割、追踪和计数)的自动化难题。该框架通过将多样化的分析任务统一为匹配问题,并利用预训练的潜在扩散模型的强大匹配能力,实现了在多种生物样本和成像条件下可靠且无需额外调整的分析效果。该研究为生物医学研究提供了一种实用且广泛适用的解决方案,显著降低了对人工分析的依赖。

详情
英文摘要

Analyzing microscopy images to extract biological object properties (e.g., their morphological organization, temporal dynamics, and population density) is fundamental to various biomedical research. Yet conducting this manually is costly and time-consuming. Though deep learning-based approaches have been explored to automate this process, the substantial diversity of microscopy analysis settings in practice (including variations of biological object types, sample processing protocols, imaging equipment, and analysis tasks, etc.) often renders them ineffective. As a result, these approaches typically require extensive adaptation for different settings, which, however, can impose burdens that are often practically unsustainable for laboratories, forcing biomedical researchers to still commonly rely on manual analysis, thereby severely bottlenecking the pace of biomedical research progress. This situation has created a pressing and long-standing need for a reliable and broadly applicable microscopy image analysis tool, yet such a tool is still missing. To address this gap, we present the first ready-to-use microscopy image analysis framework, MicroscopyMatching, that can reliably perform key analysis tasks (including segmentation, tracking, and counting) across diverse microscopy analysis settings. From a fundamentally different perspective, MicroscopyMatching reformulates diverse microscopy image analysis tasks as a unified matching problem, effectively handling this problem by exploiting the robust matching capability from pre-trained latent diffusion models.

2605.13360 2026-05-15 cs.LG

Speculative Interaction Agents: Building Real-Time Agents with Asynchronous I/O and Speculative Tool Calling

Coleman Hooper, Minwoo Kang, Suhong Moon, Nicholas Lee, Eric Wen, John Wawrzynek, Michael W. Mahoney, Yakun Sophia Shao, Amir Gholami, Kurt Keutzer

AI总结 本文提出了一种名为“Speculative Interaction Agents”的方法,旨在解决智能体在需要实时交互的应用中因工具调用而产生的高延迟问题。研究通过引入异步I/O机制和推测性工具调用技术,使智能体能够在等待外部信息时继续处理任务,从而显著提升响应速度。实验表明,该方法在保持较高准确率的同时,能够为云端大模型和边缘小模型带来1.3到2.2倍的加速效果,适用于客服、语音助手等实时性要求高的场景。

详情
Comments
17 pages
英文摘要

There is a growing demand for agentic AI technologies for a range of downstream applications like customer service and personal assistants. For applications where the agent needs to interact with a person, real-time low-latency responsiveness is required; for example, with voice-controlled applications, under 1 second of latency is typically required for the interaction to feel seamless. However, if we want the LLM to reason and execute an agentic workflow with tool calling, this can add several seconds or more of latency, which is prohibitive for real-time latency-sensitive applications. In our work, we propose Speculative Interaction Agents to enable real-time interaction even for agents with complex multi-turn tool calling. We propose Asynchronous I/O, which decouples the core agent reason-and-act thread from waiting for additional information from either the user or environment, thereby allowing for overlapping agentic processing while waiting on external delays. We also propose Speculative Tool Calling as a method to manage task execution when the agent is still unsure if it has received the full information or if additional user information may later be provided. For strong cloud models, our method can be applied out-of-the-box to existing real-time cloud APIs, providing 1.3-1.7$\times$ speedups with minor accuracy loss. To enable real-time interaction with small edge-scale models, we also present a clock-based training methodology that adapts the model to handle streaming inputs and asynchronous responses, and demonstrate a synthetic data generation strategy for SFT. Altogether, this approach provides 1.6-2.2$\times$ speedups with the Qwen2.5-3B-Instruct and Llama-3.2-3B-Instruct models across multiple tool calling benchmarks.

2605.12625 2026-05-15 cs.RO cs.CV

Driving Intents Amplify Planning-Oriented Reinforcement Learning

Hengtong Lu, Victor Shea-Jay Huang, Chengmin Yang, Pengfei Jing, Jifeng Dai, Yan Xie, Benjin Zhu

AI总结 该研究针对基于单轨迹演示训练的连续动作策略在模式崩溃问题上的局限性,提出了一种两阶段的DIAL框架,用于生成符合偏好评估的驾驶策略。第一阶段通过分类器无关引导(CFG)扩展动作采样分布,打破单一演示导致的模式坍缩;第二阶段引入多意图GRPO方法,在偏好强化学习中保持分布多样性,防止策略微调重新坍缩。实验表明,该方法在驾驶任务中显著提升了人类演示水平的评价得分,验证了扩展采样分布对提升连续动作策略性能的重要性。

详情
Comments
Project page: https://mind-omni.github.io/
英文摘要

Continuous-action policies trained on a single demonstrated trajectory per scene suffer from mode collapse: samples cluster around the demonstrated maneuver and the policy cannot represent semantically distinct alternatives. Under preference-based evaluation, this caps best-of-N performance -- even oracle selection cannot recover what the sampling distribution does not contain. We introduce DIAL, a two-stage Driving-Intent-Amplified reinforcement Learning framework for preference-aligned continuous-action driving policies. In the first stage, DIAL conditions the flow-matching action head on a discrete intent label with classifier-free guidance (CFG), which expands the sampling distribution along distinct maneuver modes and breaks single-demonstration mode collapse. In the second stage, DIAL carries this expanded distribution into preference RL through multi-intent GRPO, which spans all intent classes within every preference group and prevents fine-tuning from re-collapsing around the currently preferred mode. Instantiated for end-to-end driving with eight rule-derived intents and evaluated on WOD-E2E: competitive Vision-to-Action (VA) and Vision-Language-Action (VLA) Supervised Finetuning (SFT) baselines plateau below the human-driven demonstration at best-of-128, with the strongest prior (RAP) capping at Rater Feedback Score (RFS) 8.5 even with best-of-64; intent-CFG sampling lifts this ceiling to RFS 9.14 at best-of-128, surpassing both the prior best (RAP 8.5) and the human-driven demonstration (8.13) for the first time; and multi-intent GRPO improves held-out RFS from 7.681 to 8.211, while every single-intent baseline peaks lower and degrades by training end. These results suggest that the bottleneck of preference RL on continuous-action policies trained from demonstrations is not only how to update the policy, but to expand and preserve the sampling distribution being optimized.

2605.12624 2026-05-15 cs.RO cs.CV

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

Yuzhou Huang, Benjin Zhu, Hengtong Lu, Victor Shea-Jay Huang, Haiming Zhang, Wei Chen, Jifeng Dai, Yan Xie, Hongsheng Li

AI总结 本文提出了一种用于自动驾驶的统一流式VLA架构MindVLA-U1,旨在解决现有VLA模型在规划质量上落后于VA模型的问题。该方法通过一个统一的视觉-语言-动作(VLA)主干网络,实现了语言指令和连续动作轨迹的联合生成,并采用流式处理设计提升实时性。MindVLA-U1在保持自然语言交互接口的同时,显著提升了规划性能,首次在WOD-E2E基准测试中超越人类驾驶员,并在规划精度和处理速度方面均达到当前最优水平。

详情
Comments
Work in progress. Project page: https://mind-omni.github.io/
英文摘要

Autonomous driving has progressed from modular pipelines toward end-to-end unification, and Vision-Language-Action (VLA) models are a natural extension of this journey beyond Vision-to-Action (VA). In practice, driving VLAs have often trailed VA on planning quality, suggesting that the difficulty is not simply model scale but the interface through which semantic reasoning, temporal context, and continuous control are combined. We argue that this gap reflects how VLA has been built -- as isolated subtask improvements that fail to compose coherent driving capabilities -- rather than what VLA is. We present MindVLA-U1, the first unified streaming VLA architecture for autonomous driving. A unified VLM backbone produces AR language tokens (optional) and flow-matching continuous action trajectories in a single forward pass over one shared representation, preserving the natural output form of each modality. A full streaming design processes the driving video framewise rather than as fixed video-action chunks under costly temporal VLM modeling. Planned trajectories evolve smoothly across frames while a learned streaming memory channel carries temporal context and updates. The unified architecture enables fast/slow systems on dense & sparse MoT backbones via flexible self-attention context management, and exposes a measurable language-control path for action: language-predicted driving intents steers the action diffusion via classifier-free guidance (CFG), turning language-side intent into control signals for continuous action planning. On the long-tail WOD-E2E benchmark, MindVLA-U1 surpasses experienced human drivers for the first time (8.20 RFS vs. 8.13 GT RFS) with 2 diffusion steps, achieves state-of-the-art planning ADEs over prior VA/VLA by large margins, and matches VA latency (16 FPS vs. RAP's 18 FPS at 1B scale) while preserving natural language interfaces for human-vehicle interaction.

2605.12622 2026-05-15 cs.RO cs.CV

Action Emergence from Streaming Intent

Pengfei Jing, Victor Shea-Jay Huang, Hengtong Lu, Jifeng Dai, Yan Xie, Benjin Zhu

AI总结 本文研究了端到端自动驾驶中动作生成的“意图涌现”能力,即在复杂交通场景中生成物理可行、语义合理且安全合规的动作。为此,作者提出了一种名为SI(Streaming Intent)的视觉-语言-动作模型,通过连续的因果推理链对驾驶意图进行语义和时间上的流式处理,并利用该意图引导动作生成,从而实现可控且高质量的轨迹规划。实验表明,SI在Waymo End-to-End基准上表现出色,并首次在全端到端VLA模型中实现了基于意图的可控性。

详情
Comments
Project page: https://mind-omni.github.io/
英文摘要

We formalize action emergence as a target capability for end-to-end autonomous driving: the ability to generate physically feasible, semantically appropriate, and safety-compliant actions in arbitrary, long-tail traffic scenes through scene-conditioned reasoning rather than retrieval or interpolation of learned scene-action mappings. We show that previous paradigms cannot deliver action emergence: autoregressive trajectory decoders collapse the inherently multimodal future into a single averaged output, while diffusion and flow-matching generators express multimodality but are not steerable by reasoned intent. We propose Streaming Intent as a concrete way to approach action emergence: a mechanism that makes driving intent (i) semantically streamed through a continuous chain-of-thought that causally derives the intent from scene understanding, and (ii) temporally streamed across clips so that intent commitments remain coherent along the driving horizon. We realize Streaming Intent in a VLA model we call SI (Streaming Intent). SI autoregressively decodes a four-step chain-of-thought and emits an intent token; the decoded intent then drives classifier-free guidance (CFG) on a flow-matching action head, requiring only two denoising steps to generate the final trajectory. On the Waymo End-to-End benchmark, SI achieves competitive aggregate performance, with an RFS score of 7.96 on the validation set and 7.74 on the test set. Beyond aggregate metrics, the model demonstrates -- to our knowledge for the first time in a fully end-to-end VLA -- intent-faithful controllability: for a fixed scene, varying the intent class at inference yields qualitatively distinct yet consistently high-quality plans, arising purely from data-driven learning without any pre-built trajectory bank or hand-coded post-hoc selector.

2605.12484 2026-05-15 cs.LG cs.AI

Learning, Fast and Slow: Towards LLMs That Adapt Continually

Rishabh Tiwari, Kusha Sareen, Lakshya A Agrawal, Joseph E. Gonzalez, Matei Zaharia, Kurt Keutzer, Inderjit S Dhillon, Rishabh Agarwal, Devvrit Khatri

AI总结 大型语言模型(LLMs)通常通过更新参数(如强化学习)来适应下游任务,但这可能导致灾难性遗忘和泛化能力下降。相比之下,固定参数的上下文学习虽然能快速适应任务需求,但性能提升有限。本文提出了一种“快-慢”学习框架,将模型参数视为“慢权重”,优化的上下文作为“快权重”,从而在保持模型整体稳定性的基础上实现高效学习。实验表明,该方法在样本效率和性能上限上均优于传统方法,并在持续学习场景中表现出更强的适应能力和更少的遗忘。

详情
Comments
29 pages, 14 figures, including appendix; Blog post: https://gepa-ai.github.io/gepa/blog/2026/05/11/learning-fast-and-slow/
英文摘要

Large language models (LLMs) are trained for downstream tasks by updating their parameters (e.g., via RL). However, updating parameters forces them to absorb task-specific information, which can result in catastrophic forgetting and loss of plasticity. In contrast, in-context learning with fixed LLM parameters can cheaply and rapidly adapt to task-specific requirements (e.g., prompt optimization), but cannot by itself typically match the performance gains available through updating LLM parameters. There is no good reason for restricting learning to being in-context or in-weights. Moreover, humans also likely learn at different time scales (e.g., System 1 vs 2). To this end, we introduce a fast-slow learning framework for LLMs, with model parameters as "slow" weights and optimized context as "fast" weights. These fast "weights" can learn from textual feedback to absorb the task-specific information, while allowing slow weights to stay closer to the base model and persist general reasoning behaviors. Fast-Slow Training (FST) is up to 3x more sample-efficient than only slow learning (RL) across reasoning tasks, while consistently reaching a higher performance asymptote. Moreover, FST-trained models remain closer to the base LLM (up to 70% less KL divergence), resulting in less catastrophic forgetting than RL-training. This reduced drift also preserves plasticity: after training on one task, FST trained models adapt more effectively to a subsequent task than parameter-only trained models. In continual learning scenarios, where task domains change on the fly, FST continues to acquire each new task while parameter-only RL stalls.

2605.08512 2026-05-15 cs.LG

MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning

Yusuf Syed, Viraj Parimi, Brian Williams

AI总结 本文提出了一种名为MoMo的条件对比表示学习方法,用于实现用户偏好调节的规划。该方法通过特征逐位线性调制和低秩神经调制,联合学习表示结构和潜在预测操作符,使得模型能够在推理时根据标量偏好连续调整规划的保守程度,而无需重新训练。实验表明,MoMo在多个环境中能够根据用户偏好平滑调整规划的安全性,提升了时间与偏好一致性。

详情
英文摘要

Temporally contrastive representation learning induces a latent structure capable of reducing long-horizon planning to inference in a low-dimensional linear system. However, existing contrastive planning work learns a single latent geometry which cannot distinguish multiple valid behaviors trading task efficiency against risk exposure for the same start-goal query. We introduce MoMo, a preference-conditioned contrastive planner allowing a scalar user preference to continuously modulate plan conservativeness at inference time, without retraining. MoMo learns a joint conditioning of the representation geometry and latent prediction operator via Feature-Wise Linear Modulation and low-rank neural modulation, respectively. We show that our formulation preserves the probability density ratio encoded in the representation space that is required for inference-driven contrastive planning, further retaining its inference-time efficiency. Across six environments, MoMo smoothly adapts plan safety according to user preferences, yielding improved temporal and preferential consistency over state augmentation baselines.

2605.07785 2026-05-15 cs.CV

Radiologist-Guided Causal Concept Bottleneck Models for Chest X-Ray Interpretation

Amy Rafferty, Rishi Ramaesh, Ajitha Rajan

AI总结 该研究提出了一种由放射科医生指导的因果概念瓶颈模型(XpertCausal),用于胸片的解释。该模型通过概率噪声-OR框架建模疾病与影像学表现之间的因果关系,并利用贝叶斯推理从预测的概念中估计疾病概率。通过结合放射科医生定义的概念-疾病关联,模型结构被约束在临床合理的推理路径上,从而在诊断性能、校准度和解释质量方面优于传统概念瓶颈模型,更贴近专家知识。

详情
英文摘要

Concept Bottleneck Models (CBMs) in medical imaging aim to improve model interpretability by predicting intermediate clinical concepts before final diagnoses. However, most existing CBMs treat concepts as discriminative predictors of pathology labels, without explicitly modelling the underlying clinical generative process where diseases produce observable radiographic findings. We propose XpertCausal, a radiologist-guided causal CBM for chest X-ray interpretation which models pathology-to-concept relationships using a probabilistic noisy-OR framework. This generative model is then inverted via Bayesian inference to estimate pathology probabilities from predicted concepts. Radiologist-curated concept-pathology associations are used to constrain model structure to radiologist-defined clinically plausible reasoning pathways. We evaluate XpertCausal on MIMIC-CXR across pathology classification performance, calibration, explanation quality, and alignment with radiologist-defined reasoning pathways. Compared with both a non-causal CBM baseline and a causal ablation with unconstrained learned associations, XpertCausal achieves improved AUROC, calibration, and clinically relevant explanation quality, while learning concept-pathology relationships that more closely align with expert knowledge. These results demonstrate that incorporating clinically motivated causal structure and expert domain knowledge into CBMs can lead to more accurate, interpretable, and clinically aligned models for CXR interpretation.

2605.02043 2026-05-15 cs.LG

Bringing Order to Asynchronous SGD: Towards Optimality under Data-Dependent Delays with Momentum

Tehila Dahan, Roie Reshef, Sharon Goldstein, Kfir Y. Levy

AI总结 本文研究了异步随机梯度下降(SGD)中因数据依赖性延迟导致的梯度陈旧问题,提出了一种基于动量的异步框架,在保留延迟梯度信息的同时缓解其影响。该方法在凸和非凸光滑优化场景下首次建立了最优收敛速率,填补了现有研究的空白,并提供了实用的鲁棒学习率调度策略,简化了超参数调优过程。

详情
英文摘要

Asynchronous stochastic gradient descent (SGD) enables scalable distributed training but suffers from gradient staleness. Existing mitigation strategies, such as delay-adaptive learning rates and staleness-aware filtering, typically attenuate or discard delayed gradients, introducing systematic bias: updates from simpler or faster-to-process samples are overrepresented, while gradients from more complex samples are delayed or suppressed. In contrast, prior approaches to data-dependent delays rely on a Lipschitz assumption that yields suboptimal rates or leave the smooth, convex case unaddressed. We propose a momentum-based asynchronous framework designed to preserve information from delayed gradients while mitigating the effects of staleness. We establish the first optimal convergence rates for data-dependent delays in both convex and non-convex smooth setups, providing a new result for asynchronous optimization under standard assumptions. Additionally, we derive robust learning-rate schedules that simplify hyperparameter tuning in practice.

2605.02004 2026-05-15 cs.AI

Personalized Digital Health Modeling with Adaptive Support Users

Zhongqi Yang, Mahkameh Rasouli, Neda Mohseni, Yong Huang, Iman Azimi, Amir M. Rahmani

AI总结 在数字健康领域,个体间生理和行为差异显著,因此个性化建模至关重要。然而,由于用户数据稀缺且噪声大,现有方法多依赖于群体预训练或相似用户数据,导致迁移偏差和泛化能力不足。本文提出一种统一的个性化框架,通过自适应加权相似和不相似用户数据进行建模,结合个人损失、相似性迁移和对比正则化,提升模型鲁棒性。实验表明,该方法在多个真实数据集上显著优于传统方法,尤其在数据量少时表现更优,并提升了数据利用效率和可解释性。

详情
英文摘要

Personalized models are essential in digital health because individuals exhibit substantial physiological and behavioral heterogeneity. Yet personalization is limited by scarce and noisy user-specific data. Most existing methods rely on population pretraining or data from similar users only, which can lead to biased transfer and weak generalization. We propose a unified personalization framework that trains a personal model using adaptively weighted support users, including both similar and dissimilar individuals. The objective integrates personal loss, similarity-weighted transfer from similar users, and contrastive regularization from dissimilar users to suppress misleading correlations. An iterative optimization algorithm jointly updates model parameters and user similarity weights. Experiments on six tasks across four real-world digital health datasets show consistent improvements over population and personalized baselines. The method achieves up to 10% lower RMSE on large-scale datasets and approximately 25% lower RMSE in low-data settings. The learned adaptive weights improve data efficiency and provide interpretable guidance for targeted data selection.

2604.25855 2026-05-15 cs.CV cs.AI

SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

Hector G. Rodriguez, Marcus Rohrbach

AI总结 本文提出了一种名为SIEVES的新型选择性预测方法,旨在提升视觉问答(VQA)系统在真实世界和分布外(OOD)场景中的可靠性和覆盖率。该方法通过让模型在回答问题时生成局部视觉证据,并设计一个选择器来基于这些证据显式评估回答质量,从而在不依赖模型内部信号(如logits或隐藏状态)的情况下实现更准确的置信度估计。实验表明,SIEVES在多个具有挑战性的OOD基准上显著提升了系统覆盖率,且适用于多种前沿闭源模型,无需访问其权重或logits。

详情
英文摘要

Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering (VQA) benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world, out-of-distribution (OOD) scenarios. Precisely, selective prediction aims to improve coverage, i.e. the share of inputs the system answers, while adhering to a user-defined risk level. This is typically achieved by assigning a confidence score to each answer and abstaining on those that fall below a certain threshold. Existing selective prediction methods estimate implicit confidence scores, relying on model internal signals like logits or hidden representations, which are not available for frontier closed-sourced models. To enable reliable generalization in VQA, we require reasoner models to produce localized visual evidence while answering, and design a selector that explicitly learns to estimate the quality of the localization provided by the reasoner using only model inputs and outputs. We show that SIEVES (Selective Prediction through Visual Evidence Scoring) improves coverage by up to three times on challenging OOD benchmarks (V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, and AdVQA), compared to non-grounding baselines. Beyond better generalization to OOD tasks, the design of the SIEVES selector enables transfer to proprietary reasoners without access to their weights or logits, such as o3 and Gemini-3-Pro, providing coverage boosts beyond those attributable to accuracy alone. We highlight that SIEVES generalizes across all tested OOD benchmarks and reasoner models (Pixel-Reasoner, o3, and Gemini-3-Pro), without benchmark- or reasoner-specific training or adaptation. Code is publicly available at https://github.com/hector-gr/SIEVES .

2604.21909 2026-05-15 cs.CV cs.IT math.IT q-bio.NC

Directional Confusions Reveal Divergent Inductive Biases Through Rate-Distortion Geometry in Human and Machine Vision

Leyla Roksan Caglar, Pedro A. M. Mediano, Baihan Lin

AI总结 该研究探讨了人类与机器视觉系统在分类任务中对混淆方向的不同表现,揭示了两者在归纳偏置上的差异。通过分析12种扰动下人类与深度神经网络的响应,研究量化了混淆矩阵中的不对称性,并将其与信息-误差权衡的几何特性联系起来。结果表明,人类表现出广泛但较弱的类别间不对称性,而深度模型则表现出更集中、更强的定向混淆,且这种差异在准确率相同的情况下仍能反映不同的泛化策略。

详情
英文摘要

To humans, a robin seems more like a bird than a bird seems like a robin, but does this asymmetry also hold for machine vision? Humans and modern vision models can match each other in accuracy while making systematically different kinds of errors, differing not in how often they fail, but in who gets mistaken for whom. We show these directional confusions reveal distinct inductive biases invisible to accuracy alone. Using matched human and deep neural network responses on a natural-image categorization task under 12 perturbation types, we quantify asymmetry in confusion matrices and link its organization to the geometry of the information--error trade-off - how efficiently, and how gracefully, a system generalizes under distortion. We find that humans exhibit broad but weak asymmetries across many class pairs, whereas deep vision models show sparser, stronger directional collapses into a few dominant categories. Robustness training reduces overall asymmetry magnitude but fails to recover this human-like distributed structure. Generative simulations further show that these two asymmetry organizations shift the trade-off geometry in opposite directions even at matched accuracy, explaining why the same scalar asymmetry score can reflect fundamentally different generalization strategies. Together, these results establish directional confusion structure as a sensitive, interpretable signature of inductive bias that accuracy-based evaluation cannot recover.

2604.09860 2026-05-15 cs.RO cs.AI

RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

Xuning Yang, Rishit Dagli, Alex Zook, Hugo Hadfield, Ankit Goyal, Stan Birchfield, Fabio Ramos, Jonathan Tremblay

AI总结 为了解决通用机器人领域仿真基准测试中性能快速饱和和缺乏真实泛化能力评估的问题,研究提出了RoboLab,一个高保真度的仿真基准框架。该框架通过生成与机器人和策略无关的场景和任务,支持对现实策略在仿真中的行为进行深入分析,并引入了包含120个任务的RoboLab-120基准,涵盖视觉、过程和关系三个能力维度。研究还系统评估了现有先进模型在性能和行为鲁棒性上的不足,为评估任务通用型机器人策略的真实泛化能力提供了细粒度指标和可扩展工具。

详情
Journal ref
Robotics: Science and Systems XXII, Sydney, Australia, 2026
英文摘要

The pursuit of general-purpose robotics has yielded impressive foundation models, yet simulation-based benchmarking remains a bottleneck due to rapid performance saturation and a lack of true generalization testing. Existing benchmarks often exhibit significant domain overlap between training and evaluation, trivializing success rates and obscuring insights into robustness. We introduce RoboLab, a simulation benchmarking framework designed to address these challenges. Concretely, our framework is designed to answer two questions: (1) to what extent can we understand the performance of a real-world policy by analyzing its behavior in simulation, and (2) which factor most strongly affect policy behavior. First, RoboLab enables human-authored and LLM-enabled generation of scenes and tasks in a robot- and policy-agnostic manner within a high-fidelity simulation environment. We introduce an accompanying RoboLab-120 benchmark, consisting of 120 tasks categorized into three competency axes: visual, procedural, relational, across three difficulty levels. Second, we introduce a systematic analysis of real-world policies that quantify both their performance and the sensitivity of their behavior to controlled perturbations, exposing significant performance gap in current state-of-the-art models. By providing granular metrics and a scalable toolset, RoboLab offers a scalable framework for evaluating the true generalization capabilities of task-generalist robotic policies. Project website: https://research.nvidia.com/labs/srl/projects/robolab/.

2603.16039 2026-05-15 cs.LG cs.AI cs.CL

Residual Stream Duality in Modern Transformer Architectures

Yifan Zhang

AI总结 本文探讨了现代Transformer架构中残差流的双重性质,指出残差路径不仅是优化工具,更是模型表示机制的重要组成部分。作者提出从序列位置和层深度两个维度理解Transformer的设计空间,并揭示了残差流在层深度方向上的自注意力机制与序列方向上的短窗口注意力具有对偶性。基于这一视角,文章进一步分析了不同模型设计的优劣,并推荐在关注快捷连接时使用深度增量学习(DDL),而在需要局部自适应混合时采用序列方向的短窗口注意力(ShortSWA)。

详情
Comments
Project Page: https://github.com/yifanzhang-pro/residual-stream-duality
英文摘要

Recent work has made clear that the residual pathway is not mere optimization plumbing; it is part of the model's representational machinery. We agree, but argue that the cleanest way to organize this design space is through a two-axis view of the Transformer. A decoder evolves information along two ordered dimensions: sequence position and layer depth. Self-attention already provides adaptive mixing along the sequence axis, whereas the residual stream usually performs fixed addition along the depth axis. If we fix a token position and treat layer index as the ordered variable, then a causal depth-wise residual attention read is exactly the same local operator as causal short sliding-window attention (ShortSWA), except written over depth rather than over sequence. This is the core residual stream duality behind Transformer$^2$. This perspective also clarifies the recent literature. ELC-BERT and DenseFormer already show that learned aggregation over depth can outperform uniform residual accumulation, while Vertical Attention, DeepCrossAttention (DCA), MUDDFormer, and Attention Residuals move further toward explicit attention-based routing over earlier layers. The key point, however, is that operator-level duality does not imply systems-level symmetry. For large-scale autoregressive models, sequence-axis ShortSWA is usually the more hardware-friendly placement because it reuses token-side sliding-window kernels, KV-cache layouts, and chunked execution. If the goal is instead to change the shortcut itself, Deep Delta Learning (DDL) is the cleaner intervention because it modifies the residual operator directly rather than adding a separate cross-layer retrieval path. Our recommendation is therefore simple: use DDL when the shortcut is the object of interest, and use sequence-axis ShortSWA when the goal is local adaptive mixing.

2602.24273 2026-05-15 cs.AI

A Minimal Agent for Automated Theorem Proving

Borja Requena, Austin Letson, Krystian Nowakowski, Izan Beltran-Ferreiro, Leopoldo Sarra

AI总结 本文提出了一种用于自动定理证明的最小智能体基线,旨在为不同基于人工智能的定理证明架构提供系统性的比较基础。该设计实现了当前先进系统共有的核心功能,包括迭代证明优化、库搜索和上下文管理。实验表明,该方法在保持显著简化架构和低成本的同时,性能可与现有先进方法媲美,并在样本效率和成本效益方面展现出迭代方法相对于单次生成方法的优势。研究代码已开源,供未来研究参考及社区使用。

详情
Comments
Accepted for publication at ICML 2026
英文摘要

We propose a minimal agentic baseline that enables systematic comparison across different AI-based theorem prover architectures. This design implements the core features shared among state-of-the-art systems: iterative proof refinement, library search and context management. We evaluate this agentic approach using qualitatively different benchmarks and compare various frontier language models and design choices. Our results show competitive performance compared to state-of-the-art approaches, while using a significantly simpler architecture and a fraction of their cost. Additionally, we demonstrate consistent advantages of an iterative approach over multiple single-shot generations, especially in terms of sample efficiency and cost effectiveness. The implementation is released open-source as a candidate reference for future research and as an accessible prover for the community.

2602.14674 2026-05-15 cs.AI

From User Preferences to Base Score Extraction Functions in Gradual Argumentation (with Appendix)

Aniol Civit, Antonio Rago, Antonio Andriella, Guillem Alenyà, Francesca Toni

AI总结 本文研究了如何从用户对论点的偏好中提取基础评分函数,以支持渐进式论证系统中的决策过程。作者提出了一种基础评分提取函数,能够将用户偏好映射到论点的基础评分,并将其应用于双极论证框架,从而构建定量双极论证框架,便于使用现有的计算工具进行分析。该方法考虑了人类偏好中的非线性特性,并通过理论分析和机器人实验验证了其有效性,为实际应用中的渐进语义选择提供了指导。

详情
Comments
Accepted to AAMAS 2026 - With Appendix
英文摘要

Gradual argumentation is a field of symbolic AI which is attracting attention for its ability to support transparent and contestable AI systems. It is considered a useful tool in domains such as decision-making, recommendation, debate analysis, and others. The outcomes in such domains are usually dependent on the arguments' base scores, which must be selected carefully. Often, this selection process requires user expertise and may not always be straightforward. On the other hand, organising the arguments by preference could simplify the task. In this work, we introduce \emph{Base Score Extraction Functions}, which provide a mapping from users' preferences over arguments to base scores. These functions can be applied to the arguments of a \emph{Bipolar Argumentation Framework} (BAF), supplemented with preferences, to obtain a \emph{Quantitative Bipolar Argumentation Framework} (QBAF), allowing the use of well-established computational tools in gradual argumentation. We outline the desirable properties of base score extraction functions, discuss some design choices, and provide an algorithm for base score extraction. Our method incorporates an approximation of non-linearities in human preferences to allow for better approximation of the real ones. Finally, we evaluate our approach both theoretically and experimentally in a robotics setting, and offer recommendations for selecting appropriate gradual semantics in practice.

2602.11626 2026-05-15 cs.LG cs.AI physics.chem-ph physics.comp-ph physics.flu-dyn

ArGEnT: Arbitrary Geometry-encoded Transformer for Operator Learning

Wenqian Chen, Yucheng Fu, Michael Penwarden, Pratanu Roy, Panos Stinis

AI总结 在科学机器学习中,如何学习具有复杂、变化几何结构和参数化物理条件的系统解算符是一个核心挑战。本文提出了一种名为 ArGEnT 的任意几何编码变换器,它基于注意力机制,能够直接从点云表示中编码几何信息,并通过自注意力、交叉注意力和混合注意力三种变体灵活地整合几何特征。将 ArGEnT 集成到 DeepONet 中作为主干网络,构建了一个无需显式参数化几何输入的代理建模框架,在流体力学、固体力学和电化学系统等多个基准问题上的实验表明,该方法在预测精度和泛化能力方面显著优于传统 DeepONet 和其他几何感知代理模型。

详情
Comments
69 pages, 21 figures, 10 tables
英文摘要

Learning solution operators for systems with complex, varying geometries and parametric physical settings is a central challenge in scientific machine learning. In many-query regimes such as design optimization, control and inverse problems, surrogate modeling must generalize across geometries while allowing flexible evaluation at arbitrary spatial locations. In this work, we propose Arbitrary Geometry-encoded Transformer (ArGEnT), a geometry-aware attention-based architecture for operator learning on arbitrary domains. ArGEnT employs Transformer attention mechanisms to encode geometric information directly from point-cloud representations with three variants-self-attention, cross-attention, and hybrid-attention-that incorporates different strategies for incorporating geometric features. By integrating ArGEnT into DeepONet as the trunk network, we develop a surrogate modeling framework capable of learning operator mappings that depend on both geometric and non-geometric inputs without the need to explicitly parametrize geometry as a branch network input. Evaluation on benchmark problems spanning fluid dynamics, solid mechanics and electrochemical systems, we demonstrate significantly improved prediction accuracy and generalization performance compared with the standard DeepONet and other existing geometry-aware saurrogates. In particular, the cross-attention transformer variant enables accurate geometry-conditioned predictions with reduced reliance on signed distance functions. By combining flexible geometry encoding with operator-learning capabilities, ArGEnT provides a scalable surrogate modeling framework for optimization, uncertainty quantification, and data-driven modeling of complex physical systems.

2602.04289 2026-05-15 cs.CL cs.LG

Proxy Compression for Language Modeling

Lin Zheng, Xinyu Li, Qian Liu, Xiachong Feng, Lingpeng Kong

AI总结 本文提出了一种名为“代理压缩”的新型语言模型训练方法,旨在在保持压缩输入效率优势的同时,实现端到端的原始字节级推理接口。该方法在训练时联合使用原始字节序列和由外部压缩器生成的压缩视图,使模型在内部对齐压缩序列与原始字节,从而在推理时无需依赖外部分词器。实验表明,代理压缩显著提升了训练效率,并在固定计算预算下优于传统的字节级基线方法,且随着模型规模增大,其优势更加明显。

详情
Comments
ICML 2026
英文摘要

Modern language models are trained almost exclusively on token sequences produced by a fixed tokenizer, an external lossless compressor often over UTF-8 byte sequences, thereby coupling the model to that compressor. This work introduces proxy compression, an alternative training scheme that preserves the efficiency benefits of compressed inputs while providing an end-to-end, raw-byte interface at inference time. During training, a single language model is jointly trained on raw byte sequences and compressed views generated by external compressors; through the process, the model learns to internally align compressed sequences and raw bytes. This alignment enables strong transfer between the two formats, even when training predominantly on compressed inputs that are discarded at inference. Extensive experiments on code language modeling demonstrate that proxy compression substantially improves training efficiency and significantly outperforms pure byte-level baselines given fixed compute budgets. As model scale increases, these gains become more pronounced, and proxy-trained models eventually match or surpass tokenizer approaches, all while operating solely on raw bytes and retaining the inherent robustness of byte-level modeling. Our code is available at https://github.com/LZhengisme/proxy-compression.

2602.02711 2026-05-15 cs.AI

Dynamic Mixed-Precision Routing for Efficient Multi-step LLM Interaction

Yuanzhe Li, Jianing Deng, Jingtong Hu, Tianlong Chen, Song Wang, Huanrui Yang

AI总结 该研究针对大语言模型(LLM)在长周期决策任务中推理成本过高的问题,提出了一种动态混合精度路由(DMR)框架,通过在每一步决策中自适应选择高精度或低精度模型,以在保证任务成功率的同时降低计算成本。该方法基于不同步骤对精度敏感性的观察,采用两阶段训练策略,结合KL散度监督学习和组相对策略优化,有效提升了性能与效率的平衡。实验表明,DMR在ALFWorld和WebShop等任务中取得了优于单一精度基线的准确率与成本综合表现。

详情
英文摘要

Large language models (LLMs) achieve strong performance in long-horizon decision-making tasks through multi-step interaction and reasoning at test time. While practitioners commonly believe a higher task success rate necessitates the use of a larger and stronger LLM model, multi-step interaction with a large LLM incurs prohibitive inference cost. To address this problem, we explore the use of low-precision quantized LLMs in the long-horizon decision-making process. Based on the observation of diverse sensitivities among interaction steps, we propose Dynamic Mixed-Precision Routing (DMR), a framework that adaptively selects between high-precision and low-precision LLMs at each decision step. The router is trained via a two-stage pipeline, consisting of KL-divergence-based supervised learning that identifies precision-sensitive steps, followed by Group-Relative Policy Optimization (GRPO) to further improve task success rates. Experiments on ALFWorld and WebShop demonstrate that our approach achieves a strong accuracy-cost trade-off over single-precision baselines.

2602.01828 2026-05-15 cs.LG

Hyperbolic Graph Neural Networks Under the Microscope: The Role of Geometry-Task Alignment

Dionisia Naddeo, Jonas Linkerhägner, Nicola Toschi, Geri Skenderi, Veronica Lachi

AI总结 许多复杂网络具有分层、树状结构,因此双曲空间成为学习其表示的自然选择。本文提出“几何-任务对齐”这一新条件,探讨目标任务的度量结构是否与输入图的几何结构一致。理论与实验表明,双曲图神经网络在回归任务中能够学习到低失真的表示,且在需要保持度量结构的任务中表现出优势。研究进一步发现,双曲图神经网络在几何对齐的链接预测任务中优于欧几里得模型,但在通常不涉及几何对齐的节点分类任务中优势不明显,从而揭示了任务与几何结构对齐的重要性。

详情
英文摘要

Many complex networks exhibit hierarchical, tree-like structures, making hyperbolic space a natural candidate wherein to learn representations of them. Based on this observation, Hyperbolic Graph Neural Networks (HGNNs) have been widely adopted as a principled choice for representation learning on tree-like graphs. In this work, we question this paradigm by proposing the additional condition of geometry--task alignment, i.e., whether the metric structure of the target follows that of the input graph. We theoretically and empirically demonstrate the capability of HGNNs to recover low-distortion representations on regression problems, and show that their geometric inductive bias becomes helpful when the problem requires preserving metric structure. By jointly analyzing predictive performance and embedding distortion, we further show that HGNNs gain an advantage on link prediction, a naturally geometry-aligned task, whereas this advantage largely disappears on standard node classification benchmarks, which are typically not geometry--aligned. Overall, our findings shift the focus from only asking "Is the graph hyperbolic?" to also questioning "Is the task aligned with hyperbolic geometry?", showing that HGNNs consistently outperform Euclidean models under such alignment, while their advantage vanishes otherwise.

2601.16981 2026-05-15 cs.CV cs.GR

SyncLight: Single-Edit Multi-View Relighting

David Serrano-Lozano, Anand Bhattad, Luis Herranz, Jean-François Lalonde, Javier Vazquez-Corral

AI总结 SyncLight 是一种基于单视角编辑实现多视角场景一致重照明的方法,旨在解决多摄像机直播、立体电影和虚拟制作中对光照一致性要求高的问题。该方法通过一个基于潜在空间桥接的多视角扩散变换模型,能够在单次推理过程中对多视角图像进行高保真重照明,并且无需相机位姿信息即可推广到任意数量的视角。SyncLight 的核心贡献在于实现了参数化光照控制,并引入了一个包含合成与真实多视角数据的大型混合数据集以支持训练。

详情
Comments
Project page: http://sync-light.github.io
英文摘要

We present SyncLight, a method to enable consistent, parametric control over light sources across multiple uncalibrated views of a static scene conditioned on a single view. While single-view relighting has advanced significantly, existing generative approaches struggle to maintain the rigorous lighting consistency essential for multi-camera broadcasts, stereoscopic cinema, and virtual production. SyncLight addresses this by enabling precise control over light intensity and color across a multi-view capture of a scene, conditioned on a single reference edit. Our method leverages a multi-view diffusion transformer trained using a latent bridge matching formulation, achieving high-fidelity relighting of the entire image set in a single inference step. To facilitate training, we introduce a large-scale hybrid dataset comprising diverse synthetic environments -- curated from existing sources and newly designed scenes -- alongside high-fidelity, real-world multi-view captures under calibrated illumination. Though trained only on image pairs, SyncLight generalizes zero-shot to an arbitrary number of viewpoints, effectively propagating lighting changes across all views, without requiring camera pose information. SyncLight enables practical relighting workflows for multi-view capture systems.

2512.13609 2026-05-15 cs.CV cs.LG

Do-Undo Bench: Reversibility for Action Understanding in Image Generation

Shweta Mahajan, Shreya Kadambi, Hoang Le, Rajeev Yasarla, Apratim Bhattacharyya, Munawar Hayat, Fatih Porikli

AI总结 本文提出了一项新的任务和基准测试“Do-Undo Bench”,旨在解决视觉语言模型在理解并生成由现实世界动作驱动的场景变换方面存在的关键不足。该任务要求模型不仅模拟现实动作对场景的影响,还需将其恢复到原始状态,从而检验模型对因果关系的理解能力。研究发现当前模型在动作可逆性方面表现不佳,凸显了评估动作理解能力的必要性,该基准为评估和推进多模态系统中与现实动态相关的动作感知生成提供了直观的测试平台。

详情
Comments
Project page: https://s-mahajan.github.io/Do-Undo-Bench/
英文摘要

We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models: understanding and generating plausible scene transformations driven by real-world actions. Unlike prior work that relies on prompt-based image generation and editing to perform action-conditioned image manipulation, our training hypothesis requires models to simulate the outcome of a real-world action and then reverse it to the original state. This forward-reverse requirement tests genuine cause-and-effect understanding rather than stylistic or semantic edits. We curate a high-quality benchmark of reversible actions from real-world scenarios to enable robust action grounding. Our experiments reveal that current models struggle with action reversibility, highlighting the need to evaluate action understanding. Do-Undo provides an intuitive testbed for evaluating and advancing action-aware generation in multimodal systems that must reason about real-world dynamics.

2512.06471 2026-05-15 cs.LG cs.AI

Why Goal-Conditioned Reinforcement Learning Works: Relation to Dual Control

Nathan P. Lawrence, Ali Mesbah

AI总结 本文分析了目标条件强化学习(Goal-Conditioned RL)的成功原因,并将其与最优控制理论联系起来。研究揭示了经典二次目标与目标条件奖励之间的最优性差距,解释了为何目标条件奖励在某些情况下优于密集奖励。此外,文章将目标条件奖励与部分可观测马尔可夫决策过程中的状态估计相结合,表明其在双控制问题中的适用性,并通过强化学习和预测控制方法在非线性与不确定环境中验证了目标条件策略的优势。

详情
Comments
IFAC world congress postprint
英文摘要

Goal-conditioned reinforcement learning (RL) concerns the problem of training an agent to maximize the probability of reaching target goal states. This paper presents an analysis of the goal-conditioned setting based on optimal control. In particular, we derive an optimality gap between more classical, often quadratic, objectives and the goal-conditioned reward, elucidating the success of goal-conditioned RL and why classical ``dense'' rewards can falter. We then consider the partially observed Markov decision setting and connect state estimation to our probabilistic reward, making the goal-conditioned reward well suited to dual control problems. The advantages of goal-conditioned policies are validated on nonlinear and uncertain environments using both RL and predictive control techniques.

2511.14751 2026-05-15 cs.CV cs.RO

Co-Me: Confidence-Guided Token Merging for Visual Geometric Transformers

Yutian Chen, Yuheng Qiu, Ruogu Li, Ali Agha, Shayegan Omidshafiei, Jay Patrikar, Sebastian Scherer

AI总结 本文提出了一种名为Co-Me的置信度引导的token合并方法,用于加速视觉几何变换器,无需重新训练或微调基础模型。该方法通过训练一个轻量级的置信度预测器,根据token的不确定性进行排序,并选择性地合并低置信度的token,从而在保持空间覆盖的同时有效降低计算量。实验表明,Co-Me在多种多视角和流式视觉几何变换器中均能实现显著加速,应用于VGGT和Pi3时分别达到21.5倍和20.4倍的加速效果,为实时三维感知与重建提供了可行的解决方案。

详情
英文摘要

We propose Confidence-Guided Token Merging (Co-Me), an acceleration mechanism for visual geometric transformers without retraining or finetuning the base model. Co-Me distilled a light-weight confidence predictor to rank tokens by uncertainty and selectively merge low-confidence ones, effectively reducing computation while maintaining spatial coverage. Compared to similarity-based merging or pruning, the confidence signal in Co-Me reliably indicates regions emphasized by the transformer, enabling substantial acceleration without degrading performance. Co-Me applies seamlessly to various multi-view and streaming visual geometric transformers, achieving speedups that scale with sequence length. When applied to VGGT and Pi3, Co-Me achieves up to 21.5x and 20.4x speedup, making visual geometric transformers practical for real-time 3D perception and reconstruction.

2510.25356 2026-05-15 cs.CL

Prompting from the bench: Large-scale pretraining is not sufficient to prepare LLMs for ordinary meaning analysis

Abhishek Purushothama, Junghyun Min, Brandon Waldon, Nathan Schneider

AI总结 该研究探讨了大型语言模型(LLMs)在普通含义分析中的适用性,尤其是在法律解释等高风险领域。尽管LLMs在预训练阶段接触了大量数据,但实验表明它们对问题格式的变化极为敏感,导致结论不稳定,且与人类判断的相关性较弱。研究质疑了当前法律实践中依赖LLMs进行文本解释的做法,指出其可靠性和权威性仍需进一步验证。

详情
Comments
Accepted FAccT 2026; 29 pages, 14 tables, 7 figures. Previous title - Not ready for the bench: LLM legal interpretation is unstable and out of step with human judgments; NLLPW 2026
英文摘要

In the U.S. judicial system, a widespread approach to legal interpretation entails assessing how a legal text would be understood by an `ordinary' speaker of the language. Recent scholarship has proposed that legal practitioners leverage large language models (LLMs) to ascertain a text's ordinary meaning. But are LLMs up to the task? As textual interpretation questions arise in spheres ranging from criminal law to civil rights, we argue it is crucial that models not be taken as authoritative without rigorous evaluation. This work offers an empirical argument against LLM-assisted interpretation as recently practiced by legal scholars and federal judges, who reasoned the large amount of data that models see in training would enable models to illuminate how people ordinarily use certain words or phrases. In controlled experiments, we find failures in robustness which cast doubt on this assumption and raise serious questions about the utility of these models in practice. For the models in our evaluation, slight changes to the format of a question can lead to wildly different conclusions -- a vulnerability that parties with an interest in the outcome could exploit. Comparing with a dataset where people were asked similar legal interpretation questions, we see that these models are at best moderately correlated to human judgments -- not strong enough given the stakes in this domain.

2510.18766 2026-05-15 cs.RO

Sharing the Load: Autonomous Multi-Rover Cargo Transport

Alexander Krawciw, Luka Antonyshyn, Sven Lilge, Nicolas Olmedo, Faizan Rehmatullah, Maxime Desjardins-Goulet, Pascal Toupin, Timothy D. Barfoot

AI总结 本文研究了在月球任务中使用多辆自主月球车协同运输货物的问题,提出了一种分布式模型预测控制方法,使多辆车能够共享负载并协同搬运大型货物。该方法通过定制的货物耦合装置实现车辆运动学的解耦,同时保持对货物的完全支撑,并在实地测试中实现了较高的定位精度。实验表明,该控制架构不仅提升了运输任务的灵活性和可靠性,还可用于支持其他任务操作。

详情
Comments
19 pages, 14 figures, submitted to IEEE Transactions on Field Robotics
英文摘要

A future lunar habitat, as part of the Artemis program, will require a significant amount of logistics infrastructure. Cargo that is transported to the Moon will need to be moved from a landing site to other key locations that may be up to 5 km away. Teach and repeat navigation is well suited to this task as utility rovers will need to repeat these cargo routes many times. One of the most significant challenges involves the modules that will be assembled together to form the habitat. Canada is studying potential Lunar Utility Vehicle (LUV) designs to carry these large payloads between the landing site and the location of the habitat. As the details of the cargo continue to evolve, using two, smaller LUVs to carry cargo together would provide high capacity and mission flexibility. In this paper, we develop and implement a distributed model-predictive controller that allows vehicles to carry cargo that is shared between them. The algorithm is compared to baselines in small-scale before being implemented onboard two 800 kg path-to-flight rovers and field tested carrying a 475 kg cargo between them. A custom cargo coupling decouples the kinematics of each vehicle while fully supporting the cargo's mass. In our field test, the rovers maintain a relative separation error of 9.2 cm and maximum error of 33.4 cm. This multi-vehicle control architecture retains the high-quality path tracking of lidar teach and repeat for each rover. We demonstrate that kinematic freedom of the vehicles allows a single controller to provide mission improvements for other operations as well.