arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2085
2605.07019 2026-05-11 cs.CV cs.AI

LensVLM: Selective Context Expansion for Compressed Visual Representation of Text

Roy Xie, Dan Friedman, Donghan Yu, Bowen Pan, Christopher Fifty, Jang-Hyun Kim, Xianzhi Du, Zhe Gan, Vivek Rathod, Bhuwan Dhingra

AI总结 本文提出了一种名为 LensVLM 的视觉语言模型框架,旨在解决在压缩图像上进行文本处理时准确率下降的问题。该方法通过在推理阶段对压缩图像进行选择性内容扩展,仅对相关区域恢复为原始分辨率,从而在保持高准确率的同时实现更高的压缩比。实验表明,LensVLM 在多个文本问答任务中表现优于现有压缩和检索基线,并且能够有效推广到多模态文档和代码理解任务。

详情
英文摘要

Vision Language Models (VLMs) offer the exciting possibility of processing text as rendered images, bypassing the need for tokenizing the text into long token sequences. Since VLM image encoders map fixed-size images to a fixed number of visual tokens, varying rendering resolution provides a fine-grained compression knob. However, accuracy deteriorates quickly as compression increases: characters shrink below the vision encoder's effective resolution, making them indistinguishable. To address this, we propose LensVLM, an inference framework and post-training recipe that enables VLMs to scan compressed images, then selectively expand only the relevant images to their uncompressed form via learned tools. Building on Qwen3.5-9B-Base, LensVLM maintains accuracy comparable to the full-text upper bound at 4.3x effective compression and outperforms retrieval-based, text- and visual-compression baselines up to 10.1x effective compression across seven text QA benchmarks. LensVLM also generalizes to multimodal document and code understanding tasks, with the accuracy gain over baselines growing as compression increases. Our analysis validates this approach: training makes visual compression robust to rendering choices, and as compression grows the model increasingly relies on expanded content rather than unreliable visual reading. The analysis also yields practical tool-choice guidance: text expansion is preferable for rendered text, while high-resolution image expansion suits native documents whose layout cues carry task-relevant information.

2605.07011 2026-05-11 cs.LG

Dual-Agent Co-Training for Health Coaching via Implicit Adversarial Preference Optimization

Da Long, Lingyi Fu, Diya Michelle Rao, Jasmine Ruales Carrera, Yang Bai, Shandian Zhe

AI总结 本文提出了一种基于双智能体协同训练的健康辅导方法,旨在解决传统AI健康教练在交互能力上的不足。该方法通过同时训练健康教练代理和客户模拟器,利用隐式对抗偏好优化技术,提升双方的交互质量与辅导效果。实验表明,该方法在多个关键维度上显著提升了健康辅导的质量。

详情
英文摘要

Motivational-interviewing-based health coaching is an effective approach for improving mental health and promoting healthy behavior change. However, the scarcity of trained human coaches and the high cost of coaching services make such support inaccessible to many people who could benefit from it. This motivates the development of AI health coaches that can provide scalable and affordable support. Existing methods typically optimize only one side of the interaction: they either train a dialogue agent against a fixed client environment or train a client simulator against a fixed assistant. This one-sided setup can limit exploration of the interaction space and may be inefficient at developing the capabilities required by the target agent and pushing its performance boundaries. In this paper, we propose a dual-agent framework that interactively co-trains both the health coach agent and the client simulator. The coach is optimized with DPO using Pareto-dominant response pairs identified by a multi-dimensional LLM judge. In turn, the client is trained adversarially by reversing these preferences, inducing an implicit adversarial training dynamic. We further show that this co-training process admits a natural stochastic-game interpretation. Extensive experiments demonstrate that our method effectively improves coaching quality across several important dimensions.

2605.07010 2026-05-11 cs.LG

Inductive Power Grid Cascading Failure Analysis with GRU-Gated Graph Attention

Tianxin Zhou, Xiang Li, Haibing Lu

AI总结 本文研究了在电网级联故障发生前识别脆弱输电线路的问题,提出了一种基于GRU门控图注意力网络的方法,能够在有限训练电网的数据上进行训练,并直接应用于任何未见过的电网,无需重新训练。该方法通过GRU门控机制控制节点在级联过程中的信息保留与丢弃,实验表明其在跨时间和跨领域的新电网中均具有良好的零样本迁移能力,并能有效识别出比传统结构和电气基准方法更多的脆弱线路。

Comments 10 pages, 10 figures, IEEE format

详情
英文摘要

Identifying vulnerable transmission lines in power grids before a cascading failure occurs is challenging: existing methods can learn inter-line failure correlations from cascade data, but they are trained and evaluated on a single grid, and transferring the learned knowledge to an unseen grid remains an open problem. We address this by training a single Gated Recurrent Unit (GRU)-gated Graph Attention Network on combined cascading failure data from limited training grids and applying it directly to any unseen grid without retraining. A GRU gate controls what information each node retains or discards at each cascade iteration. Empirical evaluation shows that the model transfers zero-shot to multiple new grids spanning inter-time and inter-domain settings. Using information extracted from the trained model, we consistently identify more vulnerable lines than established structural and electrical baselines.

2605.07003 2026-05-11 cs.RO cs.SY eess.SY

AirBender: Adaptive Transportation of Bendable Objects Using Dual UAVs

Jiawei Xu, Longsen Gao, Rafael Fierro, David Saldaña

AI总结 本文提出了一种用于双无人机协作运输可变形物体的自适应控制方法,解决了空中机器人在处理柔性物体时因控制难度大而导致的性能下降和潜在坠机问题。该方法无需依赖显式的弹性模型,能够在运行过程中实时适应物体未知的变形特性,保证轨迹跟踪的稳定性与性能。通过硬件实验验证,展示了该方法在多种场景下有效操控多旋翼无人机运输柔性物体的能力。

详情
英文摘要

The interaction of robots with bendable objects in midair presents significant challenges in control, often resulting in performance degradation and potential crashes, especially for aerial robots due to their limited actuation capabilities and constant need to remain airborne. This paper presents an adaptive controller that enables two aerial vehicles to collaboratively follow a trajectory while transporting a bendable object without relying on explicit elasticity models. Our method allows on-the-fly adaptation to the object's unknown deformable properties, ensuring stability and performance in trajectory-tracking tasks. We use Lyapunov analysis to demonstrate that our adaptive controller is asymptotically stable. Our method is evaluated through hardware experiments in various scenarios, demonstrating the capabilities of using multirotor aerial vehicles to handle bendable objects.

2605.07002 2026-05-11 cs.AI math.ST stat.ML stat.TH

Adaptive auditing of AI systems with anytime-valid guarantees

Siyu Zhou, Patrick Vossler, Venkatesh Sivaraman, Yifan Mai, Jean Feng

AI总结 本文研究了如何在有限标注成本下对生成式AI系统进行自适应审计,并保证统计推断的严谨性。作者提出了一种基于“对抗性假设检验”的框架,从模型和审计方两个视角分别设定假设,并利用安全任意时刻有效推理(SAVI)方法,将审计过程转化为一种“投注式检验”,从而实现对两个对立假设的同时检验。研究表明,当审计方法足够强大时,通过严格审计可以证明AI系统具有全局鲁棒性,实验也验证了该方法在控制一类错误和统计效能方面的优越性。

详情
英文摘要

A major bottleneck in characterizing the failure modes of generative AI systems is the cost and time of annotation and evaluation. Consequently, adaptive testing paradigms have gained popularity, where one opportunistically decides which cases and how many to annotate based on past results. While this framework is highly practical, its extreme flexibility makes it difficult to draw statistically rigorous conclusions, as it violates classical assumptions: the number of observations is typically limited (often 10 to 50 cases) and decisions regarding sampling and stopping are made in the midst of data collection rather than based a pre-specified rule. To characterize what statistical inferences can be drawn from highly adaptive audits, we introduce a hypothesis testing framework from two 'dueling' perspectives: (i) the model's null that asserts there is no failure mode with performance below a target threshold versus (ii) the auditor's null that asserts they have a sampling strategy that will uncover a failure mode. Leveraging Safe Anytime-Valid Inference (SAVI), we formalize the auditor as conducting 'testing by betting', which translates into simultaneous e-processes for testing the dueling null hypotheses. Furthermore, if the auditor is sufficiently powerful, we prove that these two hypotheses are asymptotically inverses of each other, in that passage of a stringent audit does in fact certify the AI system as being globally robust. Empirically, we demonstrate that our proposed testing procedures maintain anytime-valid type-I error control, outperform pre-specified testing methods, and can reach statistically rigorous conclusions sometimes with as few as 20 observations.

2605.06997 2026-05-11 cs.LG

Echo: KV-Cache-Free Associative Recall with Spectral Koopman Operators

Anupama Sridhar, Alexander Johansen

AI总结 本文提出了一种名为 Echo 的新型关联召回架构,该架构无需使用传统的键值缓存(KV-cache),通过引入谱Koopman注意力(SKA)机制,解决了状态空间模型(SSM)在长距离信息检索中的准确率骤降问题。Echo 利用谱线性系统拟合键值历史,并通过学习的幂迭代滤波器进行检索,仅需常数内存即可实现高效召回。实验表明,与纯 SSM 或 SSM 与注意力结合的模型相比,Echo 在多个基准测试中均表现出更优的召回性能,且保持了常量推理内存的特性。

详情
英文摘要

Long chain-of-thought reasoning and agentic tool-calling produce traces spanning tens of thousands of tokens, yet Transformer KV caches grow linearly with sequence length, creating a memory bottleneck on commodity hardware. State-space models offer constant-memory recurrence but suffer a memory cliff: retrieval accuracy collapses once the gap between a stored fact and its query exceeds the effective horizon of the recurrent state. We introduce Echo, a KV-cache-free associative recall architecture built around Spectral Koopman Attention (SKA); a drop-in replacement for attention layers that augments SSM blocks with a closed-form dynamical operator whose sufficient statistics are accumulated in constant memory with no KV cache. Echo fits a spectral linear system to the key and value history via kernel ridge regression and retrieves through a learned power-iterated filter, all from $O(r^{2})$ streaming state where $r$ is a small projection rank. On the Multi-Query Associative Recall benchmark, a pure Mamba-2 SSM fails to exceed chance accuracy (${\sim}3\%$) across all gap lengths and KV-pair counts, while at the 50M parameter scale SKA-augmented models achieve $100\%$ retrieval accuracy on every configuration tested, including distractor gaps of $4{,}096$ tokens with $32$ KV pairs. Across five additional transfer benchmarks including needle-in-a-haystack, tool-trace, and multi-hop retrieval, SKA consistently outperforms both pure SSM and SSM+Attention hybrids while maintaining constant inference memory. Ablations confirm that the spectral operator, not the prefix masking strategy, drives the retrieval gain.

2605.06993 2026-05-11 cs.AI stat.ML

Optimal Experiments for Partial Causal Effect Identification

Tobias Maringgele, Jalal Etesami

AI总结 该研究探讨了如何在观测数据中部分识别因果效应的情况下,选择成本受限的最优实验以最大程度地缩小因果效应的置信区间。作者提出了一个称为“最大效用”的问题,并证明其计算复杂度为NP难。通过引入基于因果图的剪枝准则,研究有效减少了候选实验的搜索空间,并在多个基准网络上验证了方法的有效性,展示了其在实际数据中的应用潜力。

详情
英文摘要

Causal queries are often only partially identifiable from observational data, and experiments that could tighten the resulting bounds are typically costly. We study the problem of selecting, prior to observing experimental outcomes, a cost-constrained subset of experiments that maximally tightens bounds on a target query. We formalize this as the max-potency problem, where epistemic potency measures the worst-case reduction in bound width guaranteed by an experiment, and show that this problem is NP-hard via a reduction from 0-1 knapsack. Building on the polynomial-programming framework of Duarte et al. (2023), we give a general procedure for evaluating epistemic potency in discrete settings. To control the super-exponential search space, we introduce two graphical pruning criteria that depend only on the causal graph and the query: a novel path-interception rule that exploits district structure to certify zero potency in linear time, and an identifiability check based on the ID algorithm. On Erdos-Renyi random graphs and 11 bnlearn benchmark networks, the two criteria together prune 50-88% of candidate experiments on average without solving a single polynomial program. For the general subset search, we show that ID-pruned experiments are combinatorially inert, yielding a super-exponential reduction in the number of subsets evaluated. We close with an end-to-end demonstration on observational NHANES data, selecting optimal experiments for estimating the effect of physical activity on diabetes.

2605.06992 2026-05-11 cs.LG stat.ML

Why Does Agentic Safety Fail to Generalize Across Tasks?

Yonatan Slutzky, Yotam Alexander, Tomer Slor, Yoav Nagel, Nadav Cohen

AI总结 随着AI代理在多任务环境中应用增多,如何在未知任务中保持安全执行成为一个关键问题。本文理论分析与实验表明,代理安全能力难以跨任务泛化,不仅源于训练方法的局限,更是安全本身固有的复杂性所致。研究通过线性二次控制与$H_{\infty}$鲁棒性分析,证明安全需求会显著增加任务到控制器映射的Lipschitz常数,并在无人机导航和CRM任务中验证了该结论,指出当前提升代理安全性的方法可能存在根本性不足。

详情
英文摘要

AI agents are increasingly deployed in multi-task settings, where the task to perform is specified at test time, and the agent must generalize to unseen tasks. A major concern in such settings is safety: often, an agent must not only execute unseen tasks, but do so while avoiding risks and handling ones that materialize. Empirical evidence suggests that even when the ability to execute generalizes to unseen tasks, the ability to do so safely frequently does not. This paper provides theory and experiments indicating that failures of agentic safety to generalize across tasks are not merely due to limitations of training methods, but reflect an inherent property of safety itself: the relationship between a task and its safe execution is more complex than the relationship between a task and its execution alone. Theoretically, we analyze linear-quadratic control with $H_{\infty}$-robustness, and prove that the mapping from task specification to an optimal controller has higher Lipschitz constant with safety requirements than without, yielding a Lipschitz bound of independent interest. Empirically, we demonstrate our conclusions in simulated quadcopter navigation with a neural network agent and in CRM with an LLM agent. Our findings suggest that current efforts to enhance agentic safety may be insufficient, and point to a need for fundamentally different approaches.

2605.06990 2026-05-11 cs.CV cs.LG

TRAJGANR: Trajectory-Centric Urban Multimodal Learning via Geospatially Aligned Neural Representations

Maria Despoina Siampou, Gengchen Mai, Ni Lao, Jinmeng Rao, Neha Arora, Cyrus Shahabi, Shushman Choudhury

AI总结 该论文提出了一种名为TrajGANR的轨迹中心地理空间多模态自监督学习框架,旨在解决现有方法在处理人类移动轨迹时的不足。与传统基于静态位置对齐的方法不同,TrajGANR能够对轨迹的连续运动模式与静态地理观测进行对齐,从而实现更细粒度的多模态学习。通过联合对齐轨迹、街景图像及其地理位置,该方法在多个城市交通与道路理解任务中表现出色,验证了其在地理空间多模态学习中的有效性与优势。

详情
英文摘要

Multimodal self-supervised learning (MSSL) has emerged as a key paradigm for pretraining geospatial foundation models. However, existing geospatial MSSL methods are mainly designed for static pairs of modalities, such as satellite imagery, street-view imagery, and text, where learning is driven by aligning observations from the same or nearby locations. This assumption breaks down for human mobility trajectories, which represent continuous movement along paths rather than discrete observations at individual locations. Although trajectories are important for urban understanding through their ability to capture human activity across roads, neighborhoods, and places over time, they remain largely underexplored in current geospatial MSSL frameworks. We present TrajGANR, a novel trajectory-centric geospatial MSSL framework that aligns continuous movement patterns with static, location-based observations. TrajGANR learns a continuous neural representation of trajectories at arbitrary points along each path, which enables fine-grained alignment with nearby street-view images, even when they are not co-located with any trajectory waypoints. We leverage this capability to introduce an MSSL objective that jointly aligns three modalities: trajectories, street-view images, and their geographic locations. We evaluate TrajGANR on four urban mobility and road understanding tasks. Across these tasks, TrajGANR consistently outperforms existing geospatial MSSL frameworks and a trajectory-specific foundation model. Ablation studies further demonstrate that our proposed MSSL objective and the multimodal learning framework are the primary drivers of these improvements, highlighting the importance of fine-grained geospatial alignment over coarser aggregation, as well as geospatial multimodal learning.

2605.06987 2026-05-11 cs.LG cs.GT econ.TH stat.ML

Response Time Enhances Alignment with Heterogeneous Preferences

Federico Echenique, Alireza Fallah, Baihe Huang, Michael I. Jordan

AI总结 本文研究了如何在存在异质偏好标签者的情况下,提升大语言模型与人类偏好的对齐效果。传统方法通过聚合二元选择数据构建奖励模型,但忽略了标签者之间的偏好差异,导致模型无法准确学习真实的人群平均偏好。为此,作者提出利用用户响应时间作为补充信号,结合漂移-扩散模型(DDM),设计了一种能够识别异质偏好的新估计方法,有效纠正了传统方法的偏差,并在多种数据集上验证了其优越性。该方法无需用户身份信息,具有实际应用价值。

详情
英文摘要

Aligning large language models (LLMs) to human preferences typically relies on aggregating pooled feedback into a single reward model. However, this standard approach assumes that all labelers share the same underlying preferences, ignoring the fact that real-world labelers are highly heterogeneous and usually anonymous. Consequently, relying solely on binary choice data fundamentally distorts the learned policy, making the true population-average preference unidentifiable. To overcome this critical limitation, we demonstrate that augmenting preference datasets with a simple, secondary signal -- the user's response time -- can restore the identifiability of the population's average preference. By modeling each decision as a Drift-Diffusion Model (DDM), we introduce a novel, consistent estimator of heterogeneous preferences that successfully corrects the distortions of standard choice-only labels. We prove that our estimator asymptotically converges to the true average preference even in extreme cases where each anonymous labeler contributes only a single choice. Empirically, across both synthetic and real-world datasets, our method consistently outperforms standard baselines that otherwise fail and plateau at a bias floor. Because response times are essentially free to record and require zero user tracking or identification, our results bring promises and open up new opportunities for future data-collection pipelines to improve the social benefit without requiring user-level identifiers or repeated elicitations.

2605.06982 2026-05-11 cs.LG

FastOmniTMAE: Parallel Clause Learning for Scalable and Hardware-Efficient Tsetlin Embeddings

Ahmed K. Kadhim, Lei Jiao, Rishad Shafik, Ole-Christoffer Granmo, Mayur Kishor Shende

AI总结 本文提出了一种名为 FastOmniTMAE 的新型嵌入模型,旨在提升基于逻辑的 Tsetlin 机在静态嵌入学习中的训练效率。通过将传统的串行训练过程重构为两阶段并行流程,该方法显著加快了训练速度,并在多个基准任务中保持了良好的嵌入质量。此外,研究还实现了 FastOmniTMAE 在 SoC-FPGA 平台上的加速器版本,展示了其在资源受限硬件上高效训练逻辑嵌入的能力。

详情
英文摘要

Embedding models in natural language processing (NLP) increasingly rely on deep architectures such as BERT, while simpler models such as Word2Vec provide efficient representations but limited interpretability. The Tsetlin Machine (TM) offers an alternative logic-based learning paradigm. Omni TM Autoencoder (Omni TM-AE) applies this paradigm to static embedding by exploiting automaton state distributions within a single clause layer, but its training process remains slow. In this work, we propose FastOmniTMAE, a reformulation of Omni TM-AE that replaces sequential training dependencies with a two-stage parallel process: evaluation and update. Using a Single-Run Multi-Environment Benchmark covering classification, similarity, and clustering, FastOmniTMAE achieves up to 5$\times$ faster training in classification while maintaining comparable embedding quality under both Spearman and Kendall similarity measures. To address the limited efficiency of TM training on conventional GPUs, we further implement FastOmniTMAE as a reusable accelerator on SoC-FPGA platforms. The Multi-Hardware Benchmark shows that FastOmniTMAE achieves similarity scores of 0.669 on a resource-constrained FPGA and 0.696 on an UltraScale+ SoC, demonstrating efficient logic-based embedding training with a small hardware footprint.

2605.06979 2026-05-11 cs.LG cs.AI stat.ML

PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction

Jonathn Chang, Arya Datla, Ziv Goldfeld

AI总结 本文提出了一种名为PLOT的方法,通过最优运输理论实现神经因果抽象中的渐进式因果变量定位。该方法通过在抽象变量与候选神经位置之间建立最优运输耦合,获得全局软对应关系,并据此校准干预句柄,从而高效定位因果变量。实验表明,PLOT在保持高精度的同时显著提升了计算效率,为大规模因果抽象研究提供了有效的定位工具。

详情
英文摘要

Causal abstraction offers a principled framework for mechanistic interpretability, aligning a high-level causal model with the low-level computation realized by a neural network through counterfactual intervention analysis. Existing methods such as distributed alignment search (DAS) learn expressive subspace interventions, but the relevant neural site is unknown a priori, so finding a handle requires a computationally burdensome search over candidate sites. We introduce PLOT (Progressive Localization via Optimal Transport), a transport-based framework that localizes causal variables from the output effect geometry of abstract and neural interventions. PLOT fits an optimal transport coupling between abstract variables and candidate neural sites, yielding a global soft correspondence that can be calibrated into intervention handles. In simple settings, a single coupling over individual neurons suffices. In larger models, PLOT is applied progressively, moving from coarse sites such as tokens, timesteps, or layers to finer supports such as coordinate groups or PCA spans, and optionally guiding DAS based on the localized signal. Across experiments of increasing complexity, transport-only PLOT handles are exceedingly fast and competitive on accuracy, while PLOT-guided DAS reaches DAS-level accuracy at a fraction of full DAS runtime, providing an efficient localization engine for causal abstraction research at scale.

2605.06978 2026-05-11 cs.CL cs.AI

Group of Skills: Group-Structured Skill Retrieval for Agent Skill Libraries

Kun Zeng, Yu Huo, Siyu Zhang, Zi Ye, Yuecheng Zhuo, Haoyue Liu, Yuquan Lu, Junhao Wen, Xiaoying Tang

AI总结 本文提出了一种名为 GoSkills 的技能检索方法,旨在解决智能体在使用大型技能库时面临的相关技能检索与实际可用性之间的差距问题。该方法通过构建以锚点为中心的技能组,生成带有角色标签的执行上下文,从而在不改变下游智能体和执行环境的前提下,提升技能检索的效率与适用性。实验表明,GoSkills 在有限技能预算下保持了可见需求的覆盖,并在奖励和智能体运行时间方面优于现有方法。

Comments 30 pages, 4 figures, 24 tables

详情
英文摘要

Skill-augmented agents increasingly rely on large reusable skill libraries, but retrieving relevant skills is not the same as presenting usable context. Existing methods typically return atomic skills or dependency-aware bundles whose internal roles remain implicit, leaving the agent to infer the execution entry point, support skills, visible requirements, and failure-avoidance guidance. We introduce Group of Skills (GoSkills), an inference-time group-structured retrieval method that changes the agent-facing retrieval object from a flat skill list to a compact, role-labeled execution context. GoSkills builds anchor-centered skill groups from a typed skill graph, expands support groups through a group graph, bottlenecks the selected group plan into a bounded set of atomic skill payloads, and renders a fixed execution contract with Start, Support, Check, and Avoid fields, without changing the downstream agent, skill payloads, or execution environment. Experiments on SkillsBench and ALFWorld show that GoSkills preserves visible-requirement coverage under a small skill budget, improves over flat skill-access baselines, and often improves reward and agent-only runtime relative to structural retrieval references.

2605.06977 2026-05-11 cs.LG cs.AI cs.IT math.IT stat.ML

$f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses

Di Wu, Chengshuai Shi, Jing Yang, Cong Shen

AI总结 本文研究了在强化学习从人类反馈(RLHF)中使用一般$f$-散度正则化的问题,提出了一个统一的理论框架,填补了现有研究在该方向上的理论空白。作者基于两种不同的采样原则设计了两个算法,分别通过优化主义原则和奖励扰动敏感性进行策略优化,理论分析表明这两个算法均可达到$O(\log T)$的遗憾界和$O(1/T)$的次优性间隙,为在线RLHF在一般$f$-散度正则化下的性能提供了首个理论保证。

Comments ICML 2026

详情
英文摘要

Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone technique for post-training large language models. While most existing approaches rely on the reverse KL-regularization, recent empirical studies have begun exploring alternative divergences (e.g., forward KL, chi-squared) as regularizers in RLHF. However, a unified theoretical understanding of general $f$-divergence regularization remains under-explored. To fill this gap, this work develops a comprehensive theoretical framework for online RLHF with a general $f$-divergence regularized objective. Rather than treating each possible divergence function individually, we adopt a holistic perspective across the entire function class and propose two algorithms based on distinct sampling principles. The first extends the classical optimism principle with a carefully designed exploration bonus, while the second introduces a new method that exploits the sensitivity of the optimal policy to reward perturbations under $f$-divergence regularization. Theoretical analysis shows that $O(\log T)$ regret and $O(1/T)$ sub-optimality gap are achievable, establishing provable efficiency of both algorithms and, to the best of our knowledge, the first performance bounds for online RLHF under general $f$-divergence regularization.

2605.06966 2026-05-11 cs.RO cs.SE

Traffic Scenario Orchestration from Language via Constraint Satisfaction

Frieda Rong, Chris Zhang, Kelvin Wong, Raquel Urtasun

AI总结 本文研究如何通过约束满足从自然语言描述中生成用于自动驾驶车辆闭环测试的交通场景。核心方法是将场景编排建模为约束求解问题,利用基础模型将自然语言描述转化为约束条件,再借助现成求解器生成满足精确测试需求的场景行为。该方法在多种精心设计的场景描述基准测试中表现出色,尤其在需要自我反应性规范的场景中展现出显著优势。

Comments 19 pages, 10 figures; full version of paper accepted for poster presentation at ICRA 2026

详情
英文摘要

Autonomous vehicles (AVs) require extensive testing in simulation, but test case generation for driving scenarios is laborious. The desired scenarios are often out-of-distribution and have precise requirements on interactions with the AV policy under test. Manually programming scenarios allows for precise controllability but is difficult to scale. On the other hand, statistical models can leverage compute and data, but struggle with precise controllability when out-of-distribution. We cast scenario orchestration as a constraint-solving problem and present a language-in, simulation-out scenario orchestrator for closed-loop testing AVs. Our approach leverages foundation model reasoning to translate general, natural language descriptions into a set of constraints as a scenario representation. This then allows us to leverage off the shelf solvers to solve for actor behaviors which meet precise testing intentions in closed-loop. Under a benchmark of carefully crafted and diverse scenario descriptions, our approach greatly outperforms our baselines in orchestration success rate. We further show that our closed-loop approach is especially important for scenarios which require ego-reactive specifications.

2605.06957 2026-05-11 cs.AI

Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents

Shirin Sohrabi, Haritha Ananthakrishnan, Harsha Kokel, Kavitha Srinivas, Michael Katz

AI总结 本文提出了一种结合通用规划与分层任务分解的动态策略学习方法,用于基于大语言模型(LLM)的智能体。该方法名为HCL-GP,通过参数化策略实现跨任务实例的泛化,并从成功执行中自动提取可复用组件,构建组件库以支持组合策略生成。研究解决了自动分解、组件泛化和语义检索三个关键挑战,在AppWorld基准测试中表现出色,显著提升了任务执行的准确率与效率。

详情
英文摘要

We present a dynamic policy-learning approach that combines generalized planning and hierarchical task decomposition for LLM-based agents. Our method, Hierarchical Component Learning for Generalized Policies (HCL-GP ), learns parameterized policies that generalize across task instances and automatically extracts reusable components from successful executions, organizing them into a component library for compositional policy generation. We address three challenges: (1) learning components through automated decomposition, (2) generalizing components to maximize reuse, and (3) efficient retrieval via semantic search. Evaluated on the AppWorld benchmark, our approach achieves 98.2% accuracy on normal tasks and 97.8% on challenge tasks with unseen applications, improving 15.8 points over static synthesis on challenging scenarios. For open-source models, dynamic reuse enables 62.5% success versus near-zero without reuse. This demonstrates that classical planning concepts can be effectively integrated with LLM agents for improved accuracy and efficiency.

2605.06955 2026-05-11 cs.LG cs.AI

Kurtosis-Guided Denoising Score Matching for Tabular Anomaly Detection

Victor Livernoche, Jie Zan, Reihaneh Rabbany

AI总结 本文提出了一种基于峰度引导的去噪得分匹配(K-DSM)方法,用于表格数据的异常检测。该方法通过分析每个特征的边缘分布形状来动态调整噪声水平,从而在保持模型简洁性的同时提升对低密度和高密度区域的检测能力。实验表明,K-DSM在半监督和全监督设置下均取得当前最优性能,并且无需复杂的多尺度训练或超参数调优。

Comments 39 pages, 10 figures, 14 tables

详情
英文摘要

Denoising score matching (DSM) provides a way to learn data distributions by training a neural network to recover the score function, defined as the gradient of the log density, from noise-corrupted samples. Once trained, the score magnitude at a test point reflects how consistent that point is with the learned distribution, making it a natural anomaly signal. The key practical challenge is selecting the perturbation scale: too little noise yields unstable score estimates in sparse regions, while too much erases local structure and weakens anomaly sensitivity. This is compounded by the difficulty of hyperparameter tuning when anomalies are unknown and no validation set is available. We introduce kurtosis-based noise scaling (K-DSM), a per-feature scheme that sets noise levels from the shape of each marginal distribution, improving coverage of low-density regions and precision in high-density regions without extra model complexity. Contrary to prior claims that multi-scale or noise-conditioned training is necessary, we find that a carefully trained single-scale model is already a strong anomaly detector. On standard tabular anomaly detection benchmarks, K-DSM achieves state-of-the-art performance in the semi-supervised setting. When combined with a lightweight EMA-teacher filtering rule that removes low-density training points before each gradient step, it also achieves strong performance in the fully unsupervised (contaminated) setting, suggesting that simple, data-adaptive noise scaling enables robust anomaly detection while reducing reliance on hyperparameter tuning.

2605.06951 2026-05-11 cs.AI cs.LG cs.MA

Multi-Objective Constraint Inference using Inverse reinforcement learning

Syed Ihtesham Hussain Shah, Floris den Hengst, Aneta Lisowska, Annette ten Teije

AI总结 本文提出了一种名为MOCI的多目标约束推理框架,旨在从不同目标的专家轨迹中联合提取共享约束和个体偏好。该方法能够有效处理多样且可能冲突的专家行为,克服了现有方法在处理异质演示和个体偏好方面的不足。实验表明,MOCI在预测性能和计算效率方面均优于现有方法,为实际约束推理和偏好学习任务提供了准确且实用的解决方案。

详情
英文摘要

Constraint inference is widely considered essential to align reinforcement learning agents with safety boundaries and operational guidelines by observing expert demonstrations. However, existing approaches typically assume homogeneous demonstrations (i.e., generated by a single expert or multiple experts with identical objectives). They also have limited ability to capture individual preferences and often suffer from computational inefficiencies. In this paper, we introduce Multi-Objective Constraint Inference (MOCI), a novel framework designed to jointly extract shared constraints and individual preferences from heterogeneous expert trajectories, where multiple experts pursue different objectives. MOCI effectively models and learns from diverse, and potentially conflicting, behaviors. Empirical evaluations demonstrate that MOCI significantly outperforms existing baselines, achieving improved predictive performance, and maintaining competitive computational efficiency on a standard grid-world benchmark. These results establish MOCI as an accurate, flexible, and computationally practical approach for real-world constraint inference and preference learning tasks.

2605.06947 2026-05-11 cs.LG

Rollback-Free Stable Brick Structures Generation

Chenhui Xu, Ziyue Bai, Fuxun Yu, Heng Huang, Jinjun Xiong

AI总结 本文研究了如何生成物理上稳定的砖块结构,提出了一个无需回退的生成方法。通过引入强化学习框架,将物理约束从推理阶段转移到训练阶段,使模型在训练过程中学习碰撞避免、全局连接性、结构咬合和形状一致性等关键特性。该方法实现了高效且高质量的稳定砖结构生成,显著提升了生成速度,并在实验中取得了当前最优的生成效果。

详情
英文摘要

While autoregressive models have advanced 3D generation, creating physically stable brick structures remains a challenge due to the strict requirements of gravity and interconnectivity. Existing approaches rely on external physical simulators during inference to perform rejection sampling and brick-by-brick rollbacks, which severely bottlenecks efficiency. To address this, we propose a reinforcement learning paradigm that shifts physical validity enforcement from test-time correction to training-time policy optimization. By utilizing assembly-level rewards, the model optimizes for collision avoidance, global connectivity, structural interlocking, and shape conformity. This paradigm allows the model to internalize physical priors, enabling the first rollback-free generation of stable brick structures. Experimental results demonstrate that our approach achieves state-of-the-art generation quality while accelerating inference speed by orders of magnitude. Our code and dataset are available at https://github.com/miniHuiHui/STABLE. Our models are available at https://huggingface.co/miniHui/STABLE.

2605.06946 2026-05-11 cs.LG cs.AI

Adaptive Memory Decay for Log-Linear Attention

Yaxita Amin, Helen Zichen Li, Mengfan Zhang, Samet Ayhan

AI总结 本文研究了序列模型中记忆容量与计算效率之间的根本性权衡问题,提出了一种自适应记忆衰减机制以改进基于对数线性注意力的模型。传统方法中,记忆衰减参数是固定的,无法根据输入内容进行调整,而本文通过一个轻量的两层多层感知机,使每个位置和每个层次的记忆衰减参数都能根据输入内容动态学习,从而提升了模型在长距离记忆任务中的表现。实验表明,该方法在多项任务中均优于基线方法,尤其在长序列场景下效果显著。

Comments 19 pages, 13 figures. Preprint

详情
英文摘要

Sequence models face a fundamental tradeoff between memory capacity and computational efficiency. Transformers achieve expressive context modeling at quadratic cost, while linear attention and state-space models run in linear time by compressing context into a fixed-size hidden state, inherently limiting recall. Log-linear attention navigates this tradeoff by organizing memory across a Fenwick tree hierarchy, growing its hidden state logarithmically with sequence length at log-linear compute cost. However, its memory decay parameter λ is fixed and independent of the input, assigning uniform weights across all hierarchy levels regardless of the content, which introduces unnecessary rigidity. We propose learning λ directly from the input via a lightweight two-layer MLP, producing per-token, per-level decay that adapts to content rather than position. A softplus activation lets each Fenwick tree level scale independently, avoiding the inter-level competition that softmax introduces. This modification preserves log-linear complexity exactly and adds negligible parameter overhead. We evaluate on associative recall, selective copying, and language modeling, finding that input-dependent decay consistently outperforms the baseline, with the largest gains in long-range memory settings where baseline λ degrades or collapses entirely.

2605.06943 2026-05-11 cs.LG

ProtoSSL: Interpretable Prototype Learning from Unlabeled Time-Series Data

Steven Song, Sahil Sethi, Brett Beaulieu-Jones, Robert L. Grossman

AI总结 在需要预测性能与可解释性兼顾的时间序列领域,深度神经网络虽然表现优异,但难以解释其预测依据。为此,研究提出ProtoSSL框架,通过自监督学习从无标签时间序列数据中学习可复用的可解释原型,并通过高效分配机制将其适配到下游任务中。该方法无需标签监督,显著提升了标签效率,并在多个心电图数据集和音频分类任务中优于有监督原型方法,同时在人类评估中获得了更优的可解释性评价。

详情
英文摘要

In time-series domains where both predictive performance and interpretability are essential, deep neural networks achieve strong results but provide limited insight into how their predictions are made. Projection-based prototype networks address this limitation by grounding predictions in similarity to representative training examples, enabling case-based explanations and global prototype inspection. However, existing approaches rely on label supervision, tying prototypes to a specific task and requiring large labeled datasets. We introduce ProtoSSL, a novel framework for learning interpretable, projection-based prototypes from unlabeled time-series data and adapting them to downstream tasks. Our key idea is to separate motif discovery from label alignment. ProtoSSL first learns a reusable prototype bank using a self-supervised objective applied directly to prototype activations, and then aligns these prototypes to downstream tasks through an efficient assignment procedure. Across six electrocardiography (ECG) datasets, ProtoSSL improves label efficiency, outperforming supervised prototype baselines in low-data regimes with as few as 256 labeled examples; with fine-tuning, ProtoSSL outperforms supervised prototype baselines at full dataset scale. In a human evaluation study, ProtoSSL produces prototypes and prototype-based explanations that are judged more favorably than those learned with direct label supervision. We further show that the framework extends to audio classification. Thus, ProtoSSL enables both learning generalizable prototypes from unlabeled data before the downstream label space is known, and subsequent assignment of interpretable, projection-grounded prototypes to new time-series tasks.

2605.06941 2026-05-11 cs.LG math.OC

Causal-Aware Foundation-Model for Bilevel Optimization in Discrete Choice Settings

Shivaram Subramanian, Zhengliang Xue, Markus Ettl, Yingdong Lu, Jayant Kalagnanam

AI总结 本文提出了一种用于离散选择环境中实时最优决策的因果感知基础模型框架,旨在解决服务提供商在面对具有个性化偏好的异质用户时,如何选择最优商品组合的问题。研究引入了约束三头定价优化网络(C3PO),通过模仿学习、多任务学习和情境学习等方法,在满足业务约束的前提下生成定价建议,并利用行为经济学文献中的弹性先验提升新产品的定价效果。实验表明,该模型在模拟和真实数据集上均表现出优异的情境学习能力,并在多个实际应用场景中实现了显著的定价绩效提升。

详情
英文摘要

We introduce a causal aware foundation-model framework for real time optimal decision making in discrete choice environments. We propose a constrained triple-head price optimization (C3PO) network to solve a bilevel decision problem in which a service provider selects an optimal assortment while heterogeneous users make personalized acceptance or rejection choices optimizing their own personalized preferences. C3PO integrates imitation learning of prices, multi-task learning of revenue responses, and in context learning of price elasticity to generate pricing recommendations while adhering to business constraints. During inference, frontier model prompting retrieves an enhanced elasticity prior for new products from behavioral economics literature, improving pricing effectiveness. We demonstrate strong in context learning performance using simulated, synthetic, and real-world datasets. C3PO is trained on simulated data generated from multiple classical discrete choice models in economics. The model is trained on data comprising simulated customer segments and counterfactual action and outcome pairs and evaluated on randomly generated choice environments with no access to the underlying preference structure. The trained model consistently improves the pricing KPIs, with gains increasing as customer price sensitivity increases. We also deploy the tuned foundation model for optimal pricing in real-world applications such as healthcare, tender pricing, airline ancillary pricing, and other domains, achieving substantial gains across multiple products, markets, and divisions.

2605.06939 2026-05-11 cs.LG stat.ME stat.ML

Bias and Uncertainty in LLM-as-a-Judge Estimation

James Fiedler

AI总结 本文研究了使用大型语言模型作为裁判(LLM-as-a-Judge)进行模型评估时存在的偏差和不确定性问题。作者指出,直接使用裁判输出进行性能估计会引入系统性偏差,现有校正方法的可靠性依赖于裁判质量及跨模型校准稳定性。研究通过理论分析、模拟实验和真实数据案例,揭示了共享校准在模型比较中可能导致严重偏差甚至方向错误的问题,并提出了基于裁判质量($J$)和跨模型校准不稳定性($ΔJ$)的诊断指标,以指导更可靠的LLM-as-a-Judge评估实践。

详情
英文摘要

LLM-as-a-Judge evaluation has become a standard tool for assessing base model performance. However, characterizing performance via the naive estimator, i.e., raw judge outputs, is systematically biased. Recent work has proposed estimators to correct this bias, but their reliability depends critically on judge quality and, for model comparisons, on calibration stability. Sharing calibration across compared models is practically attractive but can introduce severe bias, including cases where the comparison estimate points in the wrong direction with high apparent confidence. We study these failure modes through analytical results, simulations over judge quality ($J$) and cross-model calibration instability ($ΔJ$), and a real-data MMLU-Pro case study with sign reversal. We propose $J$ and $ΔJ$ as diagnostics for when corrected estimates, especially shared-calibration comparisons, are likely unreliable, and provide reporting guidance for LaaJ evaluation.

2605.06938 2026-05-11 cs.LG cs.AI

A Generalized Singular Value Theory for Neural Networks

Brian Charles Brown, Robert Bridges, David Grimsman, Mauricio Munoz, Sean Warnick

AI总结 本文基于布朗等人提出的广义奇异值分解(GSVD)理论,证明了大多数现代神经网络架构在最终线性层之前具有左可逆的广义奇异值分解表示,且输入输出行为保持不变。研究进一步表明,该非线性部分可以设计为保持范数,使得嵌入空间中的扰动与输入空间中的变化成比例,从而实现特征空间与输入空间距离的直接校准。论文提出了一种数据驱动的算法用于从训练好的模型中估计该表示,并设计了一种有助于该分解的网络结构,同时展示了该表示在检测对抗性扰动方面的应用潜力。

详情
英文摘要

Building on the abstract Generalized Singular Value Decomposition (GSVD) theory of Brown et al. [2025], we prove that most modern neural architectures admit a generalized SVD representation in which they are left-invertible before a final linear layer, with no change in input-output behavior. Furthermore, the left-invertible nonlinear portion of the input-output behavior can be made to be \emph{norm preserving}, meaning that perturbations in the left-invertible ``embedding'' (the activations prior to the final linear layer in this representation) correspond proportionally to changes in the input, i.e., distance in feature space can be calibrated directly to distance in input space. We provide a data-driven algorithm for estimating this representation from trained models and propose a model architecture that naturally facilitates the decomposition. We then provide a proof-of-concept that the learned representation can be used to identify adversarial perturbations to model inputs, and develop the theory necessary for future applications to areas such as model bias and invertibility.

2605.06937 2026-05-11 cs.LG

A Reproducible Optimisation Protocol for Calibrating Prompt-Based Large Language Model Workflows in Evidence Synthesis

Teo Susnjak

AI总结 本文提出了一种可复现的优化流程,用于校准基于提示的大型语言模型在结构化证据综合任务中的工作流。该方法将科学任务的规则与可变的提示框架分离,并通过标注示例和明确任务指标对提示框架进行优化,最终将校准后的工作流保存为可检查的制品。研究以标题和摘要筛选为例进行验证,并展示了如何利用较小的学生模型执行任务,而由较大的反思模型引导提示优化过程。

详情
英文摘要

This methods article presents a reproducible calibration workflow for prompt-based large language models (LLMs) in structured evidence-synthesis tasks. The method separates the rules that define the scientific task from the mutable prompt harness that frames and applies them. It optimises that harness against labelled or reference examples and an explicit task metric, then preserves the calibrated workflow as an inspectable artefact with its specification, metric, settings, and evaluation traces. The example code instantiates the protocol with DSPy and GEPA tools, but the underlying logic can transfer to other prompt-optimisation frameworks that support structured task definitions, metric-guided search, and artefact reuse. Title and abstract screening is the worked validation case because it provides labelled benchmark data and clear evaluation metrics. The demonstrated workflow uses a smaller student LLM for performing the scientific task execution and a larger reflection LLM to steer the prompt optimisation process during calibration. This work shows compilation, artefact round-tripping, and how optimisation budget affects a smaller student model.

2605.06934 2026-05-11 cs.LG

Learned Lyapunov Shielding for Adaptive Control

Giansalvo Cirrincione, Adriano Fagiolini

AI总结 本文提出了一种用于欧拉-拉格朗日系统的自适应控制方法,通过引入三个学习组件增强传统的Slotine-Li控制器:一个结构化二次李雅普诺夫函数、一个用于修正控制输入的残差Soft Actor-Critic策略,以及一个用于估计未建模动力学的物理感知神经网络。研究设计了一个闭式安全过滤器,确保控制输出满足安全性约束,并在无需在线求解二次规划的前提下实现全局可行性与指数稳定性。实验表明,该方法在具有非线性摩擦和可变负载的2自由度机械臂和7自由度Franka Emika Panda机械臂上均表现出优越的跟踪性能与扩展性。

详情
英文摘要

We augment the Slotine--Li adaptive controller for Euler--Lagrange systems with three learned components: a structured-quadratic Lyapunov function \(V_ψ\) whose positive-definiteness follows from a Cholesky parameterization, a residual Soft Actor--Critic policy that adds bounded torque corrections to the analytic baseline, and a physics-informed neural network that estimates unmodeled dynamics. A closed-form safety filter, derived from the single affine constraint \(\dot V_ψ+ αV_ψ\le 0\), projects every policy output onto the safe set without requiring an online QP solver. We prove: global feasibility of the filter under a drift-decay condition on the control-degeneracy set; exponential stability under exact shielding, with a robust extension whose margin depends on the PINN approximation error; almost-sure convergence of the three-timescale policy--certificate--multiplier updates to a KKT point; and a PAC generalization bound for the certificate over compacts. On a 2-DOF manipulator with nonlinear friction and variable payload, the learned certificate accounts for most of the empirical gain: tracking error drops by 41\% on nominal friction and 24\% on aggressive friction at the centroid of the training distribution. A 7-DOF scalability study on a Franka Emika Panda confirms clean convergence of the full pipeline at industrial scale, identifies the conditions under which gains over exact model-based baselines should and should not be expected, and documents a warm-start pathology of the learned certificate that has practical implications for deployment.

2605.06931 2026-05-11 cs.LG

Target-Aware Data Augmentation for SAT Prediction

Eshed Gal, Uri Ascher, Eldad Haber

AI总结 本文提出了一种面向布尔可满足性(SAT)问题的靶向感知数据增强方法,无需求解器即可生成正确标记的SAT和UNSAT实例,有效解决了传统标注方式成本高、效率低的问题。该方法通过构造与目标基准结构一致的合成数据,提升了后续学习的效果,并设计了一种结合线性规划感知的图神经网络(LPGNN),能够利用约束违反残差进行信息传递,从而更好地捕捉问题的优化结构。研究展示了该方法在数据生成速度上的显著提升,证明了结构对齐的合成数据在基于图神经网络的SAT预测中的有效性。

详情
英文摘要

Learning-based approaches to NP-hard problems have shown increasing promise, but their progress is fundamentally constrained by the high cost of generating labeled training data. In domains such as Boolean satisfiability (SAT), standard pipelines rely on solver-in-the-loop labeling, which scales poorly with problem size and limits the amount of usable supervision. This bottleneck hinders the broader goal of leveraging machine learning to capture structure in hard combinatorial problems. In this work, we propose a target-aware, solver-free data generation framework for SAT that produces correctly labeled SAT and UNSAT instances by construction, eliminating the need for expensive solver calls. Our method aligns generated instances with the structural properties of a target benchmark, making synthetic data effective for downstream learning. We further develop a linear-programming-aware graph neural network (LPGNN) architecture that incorporates constraint-violation residuals into message passing, enabling the model to exploit underlying optimization structure. Together, these contributions support a data-centric paradigm for learning on NP-hard problems, where scalable, task-aligned data generation is as critical as model design. Our approach yields orders-of-magnitude speedups in data generation, demonstrating that benchmark-aligned synthetic data can effectively augment solver-labeled datasets for GNN-based SAT prediction.

2605.06927 2026-05-11 cs.CV cs.AI

XiYOLO: Energy-Aware Object Detection via Iterative Architecture Search and Scaling

Tony Tran, Richie R. Suganda, Bin Hu

AI总结 本文提出了一种名为XiYOLO的能效感知目标检测框架,旨在在异构边缘设备上实现高检测精度与低能耗的平衡。该方法结合了迭代架构搜索、能量感知的搜索空间以及两阶段能量估计器,以寻找高效的检测模型,并通过复合缩放策略生成适用于不同部署预算的XiYOLO模型族。实验表明,XiYOLO在多个数据集和真实设备上相比YOLO基线模型,在保持较高检测精度的同时显著降低了能耗。

详情
英文摘要

Object detection on heterogeneous edge devices must satisfy strict energy, latency, and memory constraints while still providing reliable perception for downstream autonomy. Existing energy-aware NAS methods often target limited deployment settings, while real energy remains difficult to optimize because it is highly device-dependent and costly to measure. We address these challenges with an energy-adaptive framework that combines an energy-aware XiResOFA search space, a two-stage energy estimator, and iterative search to identify a single energy-efficient base architecture. We then apply compound scaling to transform this base design into the XiYOLO family across deployment budgets, enabling interpretable accuracy-energy tradeoffs under sparse hardware measurements. Experiments on PascalVOC, COCO, and real-device deployment show that XiYOLO achieves a stronger energy-accuracy tradeoff than YOLO baselines. On PascalVOC, the medium XiYOLO model reaches 86.15 mAP50 while reducing energy relative to YOLOv12m by 20.6% on GPU and 35.9% on NPU. On COCO, XiYOLO reduces energy relative to YOLOv12 by up to 53.7% on GPU and 51.6% on NPU at the small scale. The proposed two-stage estimator also improves sample efficiency over a joint predictor under few-shot adaptation with only 2-20 target-device samples.

2605.06924 2026-05-11 cs.CV cs.AI

A$^2$RD: Agentic Autoregressive Diffusion for Long Video Consistency

Do Xuan Long, Yale Song, Min-Yen Kan, Tomas Pfister, Long T. Le

AI总结 生成一致且连贯的长视频仍然是一个基础性挑战。本文提出A$^2$RD,一种基于智能体的自回归扩散架构,通过解耦创意生成与一致性约束,实现长视频的逐段合成与自我优化。该方法包含多模态视频记忆、自适应分段生成和分层运行时自改进三个核心组件,有效避免了语义漂移和叙事崩溃问题,并在多个基准测试中取得了显著提升。

Comments Project page: http://dxlong2000.github.io/AARD

详情
英文摘要

Synthesizing consistent and coherent long video remains a fundamental challenge. Existing methods suffer from semantic drift and narrative collapse over long horizons. We present A$^2$RD, an Agentic Auto-Regressive Diffusion architecture that decouples creative synthesis from consistency enforcement. A$^2$RD formulates long video synthesis as a closed-loop process that synthesizes and self-improves video segment-by-segment through a Retrieve--Synthesize--Refine--Update cycle. It comprises three core components: (i) Multimodal Video Memory that tracks video progression across modalities; (ii) Adaptive Segment Generation that switches among generation modes for natural progression and visual consistency; and (iii) Hierarchical Test-Time Self-Improvement that self-improves each segment at frame and video levels to prevent error propagation. We further introduce LVBench-C, a challenging benchmark with non-linear entity and environment transitions to stress-test long-horizon consistency. Across public and LVBench-C benchmarks spanning one- to ten-minute videos, A$^2$RD outperforms state-of-the-art baselines by up to 30% in consistency and 20% in narrative coherence. Human evaluations corroborate these gains while also highlighting notable improvements in motion and transition smoothness.

2605.06919 2026-05-11 cs.CL

Can LLMs Take Retrieved Information with a Grain of Salt?

Behzad Shayegh, Mohamed Osama Ahmed, Fred Tung, Leo Feng

AI总结 该研究探讨了大型语言模型在处理检索信息的不确定性时的表现,发现它们在适应上下文确定性方面存在系统性不足,如难以回忆先验知识、误解置信度表达以及过度信任复杂内容。为解决这些问题,作者提出了一种结合先验提醒、置信度校准和上下文简化的交互策略,无需修改模型权重即可平均减少25%的响应偏差,展示了交互设计在提升模型可靠性方面的有效性。研究还提供了评估模型不确定处理能力的指标及跨模型适用的改进方法。

详情
英文摘要

Large language models have demonstrated impressive retrieval-augmented capabilities. However, a crucial area remains underexplored: their ability to appropriately adapt responses to the certainty of the retrieved information. It is a limitation with real consequences in high-stakes domains like medicine and finance. We evaluate eight LLMs on their context-certainty obedience, measuring how well they adjust responses to match expressed context certainty. Our analysis reveals systematic limitations: LLMs struggle to recall prior knowledge after observing an uncertain context, misinterpret expressed certainties, and overtrust complex contexts. To address these, we propose an interaction strategy combining prior reminders, certainty recalibration, and context simplification. This approach reduces obedience errors by 25% on average, without modifying model weights, demonstrating the efficacy of interaction design in enhancing LLM reliability. Our contributions include a principled evaluation metric, empirical insights into LLMs' uncertainty handling, and a portable strategy to improve context-certainty obedience across diverse LLMs.