arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4065
2605.10107 2026-05-12 cs.AI cs.AR

Arcane: An Assertion Reduction Framework through Semantic Clustering and MCTS-Guided Rule Exploring

Hongqin Lyu, Yonghao Wang, Zhiteng Chao, Tiancheng Wang, Huawei Li

AI总结 本文提出了一种名为Arcane的断言约简框架,旨在解决基于断言的硬件验证中冗余断言导致的仿真效率低下问题。该方法结合语义聚类对大规模断言进行准确分类,并利用蒙特卡洛树搜索(MCTS)探索最优的规则应用顺序,以高效减少断言数量。实验表明,Arcane在保持形式化覆盖率和突变检测能力的前提下,最多可减少76.2%的断言数量,并使仿真速度提升2.6至6.1倍。

Comments 6 pages, 6 figures

详情
英文摘要

Assertion-based Verification (ABV) is essential for ensuring that hardware designs conform to their intended specifications. However, existing automated assertion-generation approaches, such as LLM-based frameworks, often generate large numbers of redundant assertions, which significantly degrade simulation efficiency. To mitigate the simulation overhead caused by redundant assertions, this paper proposes Arcane, an efficient assertion reduction framework. It integrates a two-tier assertion clustering approach for accurate semantic classification of large assertion sets, and employs Monte Carlo Tree Search (MCTS) to explore optimal rule-application sequences for efficient assertion reduction. The experimental results on Assertionbench [20] show that Arcane achieves a reduction of up to 76.2% in the assertion count while fully preserving formal coverage and mutation-detection ability. Further simulation studies demonstrate a speedup of 2.6x to 6.1x speedup in simulation time. The proposed framework is released at https://anonymous.4open.science/r/Arcane1-0A6F/.

2605.10106 2026-05-12 cs.CV cs.AI

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

Tingshu Mou, Jiabo He, Renying Wang, Ce Liu, Hao Yang, Tiehua Zhang, Jingjing Chen, Xingjun Ma

AI总结 本文提出了一种名为ViSRA的基于视频的三维空间推理代理,旨在提升多模态大语言模型(MLLMs)的空间推理能力。ViSRA无需额外训练,通过利用专家模型提供的显式空间信息,以模块化和可扩展的方式引导模型进行空间推理,实现了灵活的即插即用框架。该方法在多个现有基准和未见过的三维空间任务中均表现出色,相比基线方法分别提升了15.6%和28.9%的绝对性能,具有可迁移的三维理解能力和较低的计算成本。

详情
英文摘要

Recent advances in Multi-modal Large Language Models (MLLMs) target 3D spatial intelligence, yet the progress has been largely driven by post-training on curated benchmarks, leaving the inference-time approach relatively underexplored. In this paper, we take a training-free perspective and introduce ViSRA, a human-aligned Video-based Spatial Reasoning Agent, as a framework to probe the spatial reasoning mechanism of MLLMs. ViSRA elicits spatial reasoning in a modular and extensible manner by leveraging explicit spatial information from expert models, enabling a plug-and-play flexible paradigm. ViSRA offers two key advantages: (1) human-aligned and transferable 3D understanding rather than task-specific overfitting; and (2) no post-training computational cost along with heavy manual curation of spatial reasoning datasets. Experimental results demonstrate consistent improvement across a set of MLLMs on both existing benchmarks and unseen 3D spatial reasoning tasks, with ViSRA outperforming baselines by up to a 15.6% and 28.9% absolute margin respectively.

2605.10091 2026-05-12 cs.LG

TopoU-Net: a U-Net architecture for topological domains

Gaurav Gaurav, Ibrahem ALJabea, Yaroslav Zakomornyy, Eric Frank, Mohamed Elhamdadi, Theodore Papamarkou, Mustafa Hajij

AI总结 TopoU-Net 是一种面向拓扑结构数据的 U-Net 架构,旨在处理包含点、边、区域、超边等复杂结构的数据。该方法将 U-Net 视为一种层次化的编码-解码框架,利用组合复形中的单元、关联和秩来构建表示空间与跳跃连接。通过引入秩路径的概念,TopoU-Net 在不同拓扑层级之间进行特征传递,并在多个任务中表现出优越的性能,尤其在异质图和高阶结构数据上效果显著。

详情
英文摘要

Many modern datasets mix points, edges, regions, groups, objects, events, hyperedges, and relations. Yet neural architectures often force such data into grids, graphs, or sequences, obscuring higher-order structure and making encoder-decoder designs domain-specific. We view U-Net not as a grid-specific architecture, but as a hierarchical encoder-decoder principle: representation spaces, transport maps between levels, and skip connections between matched levels. Combinatorial complexes naturally supply these ingredients through cells, incidences, and ranks. We introduce TopoU-Net, a rank-path U-Net for topological domains. Given a path from an input rank to a bottleneck rank and back, the encoder lifts cochains upward along incidence maps, the decoder transports them downward, and skip connections merge features at matched ranks. Rank replaces spatial scale: choosing paths through nodes, edges, faces, hyperedges, or global cells becomes the central architectural decision. A key quantity is the bottleneck support ratio, the number of cells at the bottleneck relative to the number of cells at the input rank. This ratio is fixed by the complex and chosen path rather than by arbitrary pooling, and it clarifies when skip connections are optional, useful, or structurally important. Across node classification, graph classification, hypergraph node classification, mesh classification, and image reconstruction, TopoU-Net provides a reusable encoder-decoder template for higher-order structured data. Among the evaluated baselines, it achieves the strongest mean accuracy on six of eight node-classification datasets and four of five hypergraph datasets, with the largest gains on heterophilic graphs. Ablations show that removing skip connections is most damaging under severe bottleneck compression.

2605.10087 2026-05-12 cs.CV

Initiation of Interaction Detection Framework using a Nonverbal Cue for Human-Robot Interaction

Guhnoo Yun, Juhan Yoo, Kijung Kim, Dong Hwan Kim

AI总结 本文提出了一种基于音频和视觉传感器融合的非语言线索的人机交互(HRI)启动检测框架,用于家庭环境中的机器人交互。该框架通过声音源定位与人体跟踪信息结合,实现用户注视机器人时的交互启动检测,即使用户未直接说话,也能在注视时间超过预设阈值时识别交互意图。研究设计了状态转移模型,并在移动机器人上进行了实验验证,所有模块均集成于ROS系统中,实现了框架的完整实现与应用。

详情
英文摘要

This paper describes an initiation of interaction(IoI) detection framework without keywords for human-robot interaction(HRI) based on audio and vision sensor fusion in a domestic environment. In the proposed framework, the robot has its own audio and vision sensors, and can employ external vision sensor for stable human detection and tracking. When the user starts to speak while looking at the robot, the robot can localize his or her position by its sound source localization together with human tracking information. Then the robot can detect the IoI if it perceives the face of the speaker faces the robot. In case that the user does not speak directly, the robot can also detect the IoI if he or she looks at the robot for more than predefined periods of time. A state transition model for the proposed IoI detection framework is designed and verified by experiments with a mobile robot. In order to implement and associate our model in a robot architecture, all the components are implemented and integrated in the Robot Operating System(ROS) environment.

2605.10086 2026-05-12 cs.RO

A cell-decomposition based path planner for 3D navigation in constrained workspaces

João P. L. Morais, Luciano C. A. Pimenta, Marcelo A. Santos, Guilherme V. Raffo

AI总结 本文提出了一种基于单元分解的路径规划算法,用于在受限三维工作空间中进行导航,确保每个单元与其至少一个相邻单元之间具有完全可见性。该方法构建了一个简化的路径可行性验证框架,并可方便地嵌入到优化问题中。通过结合Yen的k最短路径算法与二阶锥规划(SOCP),提出了一种名为KSP-SOCP的新方法,在保证路径质量的同时降低了计算负担,实验表明该方法在时间和内存效率上优于传统方法,适用于大规模场景。

Comments Accepted for publication at the 23rd IFAC World Congress (Busan, Korea)

详情
英文摘要

This paper proposes a cell decomposition algorithm for binary occupancy grids that ensures mutual complete visibility from each cell to at least one adjacent cell. This decomposition establishes a simplified framework for verifying path feasibility that can be easily embedded in optimization problems. To illustrate its utility, we formulate both second-order cone programs (SOCP) and their mixed-integer variant (MISOCP) within the proposed framework. Furthermore, we propose the KSP-SOCP method, which combines Yen's k-shortest path algorithm with the SOCP, achieving improved solutions compared to a standard SOCP approach while avoiding the computational burden of MISOCP. The cell decomposition algorithm, KSP-SOCP, and MISOCP approaches were evaluated in 9 city-like workspaces. The decomposition efficiently partitioned each map, enabling both optimization methods to compute feasible paths. The proposed KSP-SOCP achieved time performance comparable to the MISOCP while requiring less memory, making it highly suitable for large-scale problems.

2605.10083 2026-05-12 cs.LG

Unlocking air traffic flow prediction through microscopic aircraft-state modeling

Bin Wang, Anqi Liu, Jiangtao Zhao, Yanyong Huang, Peilan He, Guiyuan Jiang, Feng Hong, Yanwei Yu, Tianrui Li

AI总结 本文研究了如何通过微观飞机状态建模提升终端空域短时空中交通流预测的准确性。提出了一种名为AeroSense的框架,该方法直接从ADS-B轨迹生成的动态飞机状态集合出发,建立从微观飞机状态到未来区域交通流的端到端映射。该方法无需依赖历史数据窗口,能够自然适应不同密度的交通状况,实验表明其在高密度交通场景下的预测精度显著优于基于聚合时间序列的传统方法。

详情
英文摘要

Short-term air traffic flow prediction in terminal airspace is essential for proactive air traffic management. Existing approaches predominantly model traffic flow as aggregated time series, despite traffic dynamics being governed by aircraft states and interactions in continuous airspace. Such aggregation obscures fine-grained information including aircraft kinematics, boundary interactions, and control intent. Here we present AeroSense, a state-to-flow modeling framework that predicts future traffic flow directly from instantaneous airspace situations represented as dynamic sets of aircraft states derived from ADS-B trajectories. By establishing an end-to-end mapping from microscopic aircraft states to future regional traffic flow, AeroSense preserves aircraft-level dynamics while naturally accommodating varying traffic density without relying on historical look-back windows. Experiments on a large-scale real-world dataset show that AeroSense consistently improves predictive accuracy over aggregation-based forecasting approaches, particularly during high-density traffic periods. These findings suggest that instantaneous airspace situations provide an effective alternative to conventional time-series-based traffic forecasting paradigms.

2605.10079 2026-05-12 cs.CV

SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation

Liangyang Ouyang, Ruicong Liu, Caixin Kang, Yifei Huang, Yoichi Sato

AI总结 该论文提出了一种名为SocialDirector的训练-free交互控制器,用于提升多人物视频生成中社会互动的控制能力。该方法通过调节交叉注意力图,实现了对人物动作执行者、动作时机及目标对象的精确控制,有效解决了现有模型中人物与动作不匹配、社交动态混乱等问题。研究还构建了自动化评估流程,实验表明SocialDirector显著提升了生成视频的交互真实性,接近真实视频的表现水平。

详情
英文摘要

Video generation has advanced rapidly, producing photorealistic videos from text or image prompts. Meanwhile, film production and social robotics increasingly demand multi-person videos with rich social interactions, including conversations, gestures, and coordinated actions. However, existing models offer no explicit control over interactions, such as who performs which action, when it occurs, and toward whom it is directed. This often results in wrong person performing unintended actions (actor-action mismatch), disordered social dynamics, and wrong action targets. To address these challenges, we present SocialDirector, a training-free interaction controller that enhances the generation model by modulating cross-attention maps. SocialDirector contains two modules: Social Actor Masking and Directional Reweighting. Social Actor Masking constrains each person's visual tokens to attend only to their own textual descriptions via a spatiotemporal mask, avoiding actor-action mismatch and disordered social dynamics. Directional Reweighting amplifies attention to directional words (e.g., "leftward", "right"), leading each action towards its intended target. To evaluate generated social interactions, we annotate existing datasets with interaction descriptions and build a fully automated evaluation pipeline powered by open-source VLMs. Experiments on different video generation models show that SocialDirector significantly improves interaction fidelity and approaches the upper bound set by real videos.

2605.10071 2026-05-12 cs.CV

MFVLR: Multi-domain Fine-grained Vision-Language Reconstruction for Generalizable Diffusion Face Forgery Detection and Localization

Yaning Zhang, Tianyi Wang, Zan Gao, Yibo Zhao, Chunjie Ma, Meng Wang

AI总结 随着高真实感人脸生成技术的快速发展,通用性的人脸伪造检测与定位方法变得尤为重要。本文提出了一种多领域细粒度视觉-语言重建模型(MFVLR),通过语言引导的细粒度人脸伪造表示学习,全面捕捉多领域中的视觉伪造痕迹,从而实现对扩散模型生成人脸伪造内容的通用检测与定位。该模型引入细粒度语言变换器、多领域视觉编码器和视觉解码器,并设计了创新的视觉注入模块,显著提升了模型在跨生成器、跨伪造类型和跨数据集场景下的性能。

详情
英文摘要

The swift advancement in photo-realistic face generation technology has sparked considerable concerns across society and academia, emphasizing the requirement of generalizable face forgery detection and localization methods. Prior works tend to capture face forgery patterns across multiple domains using image modality, other modalities like fine-grained texts are not comprehensively investigated, which restricts the generalization capability of models. Besides, they usually analyze facial images created by GAN, but struggle to identify and localize those synthesized by diffusion. To solve the problems, in this paper, we devise a novel multi-domain fine-grained vision-language reconstruction (MFVLR) model, which explores comprehensive and diverse visual forgery traces via language-guided face forgery representation learning, to achieve generalizable diffusion-synthesized face forgery detection and localization (DFFDL). Specifically, we devise a fine-grained language transformer that studies general fine-grained language embeddings using language reconstruction. We propose a multi-domain vision encoder to capture general and complementary visual forgery patterns across the image and residual domains. A vision decoder is designed to reconstruct image appearance and achieve forgery localization. Besides, we propose an innovative plug-and-play vision injection module to enhance the interaction between the vision and language embeddings. Extensive experiments and visualizations demonstrate that our network outperforms the state of the art on different settings like cross-generator, cross-forgery, and cross-dataset evaluations.

2605.10065 2026-05-12 cs.CL cs.AI

NCO: A Versatile Plug-in for Handling Negative Constraints in Decoding

Hyundong Jin, Yo-Sub Han

AI总结 在生成文本时,防止大型语言模型生成不适当内容(如脏话和个人身份信息)变得越来越重要。为了解决在解码过程中高效处理多个硬约束和正则表达式约束的问题,本文提出了一种名为NCO的解码策略,该方法通过在线模式匹配实现对约束的高效处理,避免了状态爆炸问题,并兼容多种采样和搜索方法。实验表明,NCO在实际任务中有效提升了内容过滤的效果。

详情
英文摘要

Controlling Large Language Models (LLMs) to prevent the generation of undesirable content, such as profanity and personally identifiable information (PII), has become increasingly critical. While earlier approaches relied on post-processing or resampling, recent research has shifted towards constrained decoding methods that control outputs during generation to mitigate high computational costs and quality degradation. However, preventing multiple forbidden hard constraints or regex constraints from appearing anywhere in the output is computationally challenging. A straightforward solution is to convert these constraints into a single automaton that tracks all forbidden patterns during decoding, but this often becomes impractically large. Standard regex engines also do not readily support the operations needed to build such a constraint, such as complement and intersection. In order to address these limitations, we propose NCO, a decoding strategy that performs online pattern matching over finite hard constraints and regex constraints, reducing computational overhead without inducing state explosion. NCO is fully compatible with standard inference strategies, including various sampling methods and beam search, while also supporting soft masking for probabilistic suppression. We empirically demonstrate its effectiveness across practical tasks, including PII and profanity suppression. Our implementation is available at https://github.com/hyundong98/NCO-Decoding.git .

2605.10064 2026-05-12 cs.AI

MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs

Ruiyi Yang, Zechen Li, Hao Xue, Imran Razzak, Flora D. Salim

AI总结 MAGE 是一种基于多智能体协同进化的框架,通过构建包含四个子图的协同进化知识图谱,将智能体在学习过程中的经验与反馈外部化存储,从而支持冻结主干模型在推理时的稳定表现。该方法利用任务条件引导检索机制,结合任务级和技能级的强化学习策略,实现了知识的高效积累与应用。实验表明,MAGE 在多个复杂任务上显著优于基于提示的冻结主干模型,展示了其在自我进化学习中的有效性与广泛适用性。

Comments 25 pages, 3 figures

详情
英文摘要

Self-evolving language-model agents must decide what to learn next and how to preserve what they have learned across iterations. Existing systems typically carry this cross-iteration knowledge as natural-language feedback, flat episodic memory, or implicit reinforcement signals, none of which cleanly supports a frozen weak backbone at inference time. This paper introduces MAGE (Multi-Agent Graph-guided Evolution), a framework that externalizes self-knowledge into a four-subgraph co-evolutionary knowledge graph. Its experience subgraph stores both teacher-written failure corrections and the learner's own past correct reasoning traces, which are retrieved as task-conditioned guidance for a frozen execution model. During evolution, the graph, a task-level search bandit, and a skill-level routing bandit are updated from the same reward stream, while the learner's backbone remains unchanged. We further provide structural analysis showing how append-only memory growth, bounded curriculum coverage, and task-filtered retrieval together support stable improvement of the retrieval substrate for frozen-learner evolution. Across nine benchmarks spanning mathematical reasoning, multi-hop and open-domain question answering, spatio-temporal analysis, financial numerical reasoning, medical multiple-choice, an open-world survival game, and web navigation, MAGE achieves strong performance against prompt-based frozen-backbone baselines. Ablations show that self-harvested success traces and teacher-written corrections are complementary, with success memories contributing most on reasoning-template-heavy tasks and corrective memories supporting harder composition and interaction settings.

2605.10063 2026-05-12 cs.RO

EFGCL: Learning Dynamic Motion through Spotting-Inspired External Force Guided Curriculum Learning

Keita Yoneda, Kento Kawaharazuka, Kei Okada

AI总结 本文提出了一种基于物理引导的强化学习方法——外部力引导课程学习(EFGCL),旨在解决足式机器人学习复杂全身动态运动时效率低、失败风险高的问题。受体操中“ spotting ”动作的启发,该方法通过在训练过程中引入辅助外力,使机器人能够物理上体验成功动作的执行过程,无需依赖特定任务的奖励设计或参考轨迹。实验表明,EFGCL显著提升了四足机器人学习跳跃等复杂动作的效率,并成功在真实机器人上复现了仿真中的运动,验证了该方法的有效性和通用性。

Comments Accepted at RA-L 2026, website - https://keitayoneda.github.io/kleiyn-efgcl/, YouTube - https://youtu.be/sFK00hm14No/

详情
Journal ref
IEEE Robotics and Automation Letters (RA-L) 2026
英文摘要

Learning dynamic whole-body motions for legged robots through reinforcement learning (RL) remains challenging due to the high risk of failure, which makes efficient exploration difficult and often leads to unstable learning. In this paper, we propose External Force Guided Curriculum Learning (EFGCL), a guided RL approach based on the principle of physical guidance, in which external assistive forces are introduced during training. Inspired by spotting in artistic gymnastics, EFGCL enables agents to physically experience successful motion executions without relying on task-specific reward shaping or reference trajectories. Experiments on a quadrupedal robot performing Jump, Backflip, and Lateral-Flip tasks demonstrate that EFGCL accelerates learning of the Jump task by approximately a factor of two and enables the acquisition of complex whole body motions that conventional RL methods fail to learn. We further show that the learned policies can be deployed on real robot, reproducing motions consistent with those observed in simulation. These results indicate that physically guided exploration, which allows agents to experience success early in training, is an effective and general strategy for improving learning efficiency in dynamic whole-body motion tasks.

2605.10061 2026-05-12 cs.CL cs.AI

Not-So-Strange Love: Language Models and Generative Linguistic Theories are More Compatible than They Appear

R. Thomas McCoy

AI总结 本文探讨了神经语言模型(LMs)与生成语言学理论之间的兼容性,指出LMs不仅能支持基于梯度和使用频率的语言理论,还能体现基于形式结构的生成理论。研究扩展了LMs可验证的语言理论范围,为使用频率理论与生成理论的融合提供了可能性。

Comments Accepted to Behavioral and Brain Sciences; 4 pages; Commentary on "How Linguistics Learned to Stop Worrying and Love the Language Models" by Richard Futrell and Kyle Mahowald

详情
英文摘要

Futrell and Mahowald (2025) frame the success of neural language models (LMs) as supporting gradient, usage-based linguistic theories. I argue that LMs can also instantiate theories based on formal structures - the types of theories seen in the generative tradition. This argument expands the space of theories that can be tested with LMs, potentially enabling reconciliations between usage-based and generative accounts.

2605.10054 2026-05-12 cs.CV

Explanation-Aware Learning for Enhanced Interpretability in Biomedical Imaging

Zubair Faruqui, Rahul Dubey

AI总结 该研究针对医学影像诊断中深度神经网络过度依赖非临床相关特征的问题,提出了一种在训练过程中直接引入解释性监督的方法,以引导模型关注具有临床意义的区域。研究系统分析了不同解释损失设计和监督强度对模型预测性能和解释可信度的影响,并引入了两个新的量化指标用于评估解释质量。实验表明,该方法在保持模型准确性的同时,能够显著提升解释的临床相关性,适用于多种标注的生物医学影像任务。

Comments Under review at IEEE Journal of Biomedical and Health Informatics (JBHI)

详情
英文摘要

Deep neural networks for medical image diagnosis often achieve high predictive accuracy while relying on spurious or clinically irrelevant visual cues, limiting their trustworthiness in practice. Post-hoc explanation methods are widely used to visualize model decisions in the form of saliency maps; however, these explanations do not influence how models learn during training, allowing non-causal or confounding features to persist. This motivates the incorporation of explanation supervision directly into the training objective to guide model attention toward clinically meaningful regions and promote clinically grounded decision-making. This paper presents a systematic approach to integrate explanation loss into model training and analyzes how different explanation loss designs and supervision strengths influence both predictive performance and spatial faithfulness of explanations. To quantitatively assess interpretability, two complementary explanation performance metrics-annotation coverage and saliency precision-are introduced, enabling rigorous evaluation beyond qualitative visualization. Our experimental results reveal a clear trade-off between explanation quality and explanation loss coefficients. Furthermore, quantitative statistical analysis yields consistently improved explanation alignment while maintaining comparable accuracy. Experiments were conducted on annotated chest X-ray datasets; however, the proposed framework is applicable to a broad range of annotated biomedical imaging modalities. Overall, these findings demonstrate that explanation supervision is not a monolithic design choice and provide practical guidance for incorporating explanation loss into training objectives under noisy clinical annotations.

2605.10051 2026-05-12 cs.RO cs.AI

Guided Streaming Stochastic Interpolant Policy

Puming Jiang, Meiyi Wang, Kelvin Lin, Ce Hao, Harold Soh

AI总结 本文研究了如何在推理时通过引导机制,使生成式机器人策略能够动态适应目标,而无需重新训练。传统方法受限于基于块的架构,存在延迟高、反应性差的问题。作者通过分析价值函数的时间演化,推导出针对随机插值策略的最优引导项,并提出了流式随机插值策略(SSIP),实现了快速且反应灵敏的实时控制。此外,还提出了两种互补机制,分别支持零样本适应和高效推理,实验表明该方法在动态复杂环境中表现出更优的反应能力和物理合理性。

Comments Accepted to Robotics: Science and Systems (RSS) 2026. The first two authors contributed equally

详情
英文摘要

Inference-time guidance is essential for steering generative robot policies toward dynamic objectives without retraining, yet existing methods are largely confined to chunk-based architectures that exhibit high latency and lack the reactivity needed for test-time preference alignment or obstacle avoidance. In this work, we formally derive the optimal guidance term for Stochastic Interpolants (SI) by analyzing the value function's time evolution via the Backward Kolmogorov Equation, establishing a modified drift that theoretically guarantees sampling from a target distribution. We apply this framework to real-time control through the Streaming Stochastic Interpolant Policy (SSIP), which generalizes the deterministic Streaming Flow Policy (SFP). Unifying this guidance law with the streaming architecture enables fast and reactive control. To support diverse deployment needs, we propose two complementary mechanisms: training-free Stochastic Trajectory Ensemble Guidance (STEG) that computes gradients on-the-fly for zero-shot adaptation, and training-based Conditional Critic Guidance (CCG) for amortized inference. Empirical evaluations demonstrate that our guided streaming approach significantly outperforms conventional chunk-based policies in reactivity and provides superior, physically valid guidance for dynamic, unstructured environments.

2605.10050 2026-05-12 cs.CV

EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs

Jiameng Li, Minye Wu, Jiezhang Cao, Aleksei Tiulpin, Matthew B. Blaschko

AI总结 视频大语言模型(VideoLLMs)在处理长视频时面临挑战,因为密集采样会导致大量视觉token,而稀疏采样则可能遗漏关键时间信息,引发模型幻觉。本文提出了一种轻量且无需训练的token剪枝方法EchoPrune,通过将冗余token解释为时间回声,利用跨模态相关性和时间重建误差对token进行评分,从而在固定token预算下提升时间分辨率。实验表明,EchoPrune使VideoLLMs在相同token预算下处理的帧数提升至原来的20倍,并在多个基准上提升了性能和推理速度。

Comments 9 pages

详情
英文摘要

Long-form video understanding remains challenging for Video Large Language Models (VideoLLMs), as the dense frame sampling introduces massive visual tokens while sparse sampling risks missing critical temporal evidence and leading to LLM hallucination. Existing training-free token reduction methods either treat videos equally as static images or rely on segment-level merging heuristics, which weaken fine-grained spatiotemporal modeling and introduce additional overhead. In this paper, we propose EchoPrune, a lightweight and training-free token pruning method that improves temporal resolution under a fixed LLM-side visual token budget. Our core idea is to interpret redundant video tokens as temporal echoes: if a token is well reconstructed from the previous frame, it is merely a temporally redundant echo; otherwise, it may capture new events, motion, or query-relevant visual evidence. Based on this insight, EchoPrune scores visual tokens by (i) query-guided crossmodal relevance and (ii) temporal reconstruction error, measured by correspondence matching and echo matching across consecutive frames. The selected tokens preserve task-relevant cues and temporal novelty while suppressing predictable redundancy, allowing VideoLLMs to observe more frames without increasing the decoding budget. Extensive experiments on LLaVA-OV, Qwen2.5VL, and Qwen3VL across six video understanding benchmarks show that EchoPrune enables VideoLLMs to process up to 20x frames under the same token budget, yielding improved performance (+8.6%) and inference speedup (5.6x for prefilling) on Qwen2.5VL-7B.

2605.10047 2026-05-12 cs.LG cs.AI

Rethinking Loss Reweighting for Imbalance Learning as an Inverse Problem: A Neural Collapse Point of View

Jinping Wang, Zixin Tong, Zhiwu Xie, Zhiqiang Gao

AI总结 本文从逆问题的角度重新思考不平衡学习中的损失重加权问题,提出了一种基于神经崩溃(Neural Collapse)理论的动态权重调整策略。该方法以类间平均损失相等为目标,通过逆向推导动态确定类别权重,从而更有效地缓解类别不平衡带来的影响。实验表明,该方法在多个数据集上优于现有主流长尾分类方法,且能更好地贴近理想几何结构。

Comments Accepted by ICML2026

详情
英文摘要

Loss reweighting is a widely used strategy for long-tailed classification, but existing reweighting strategies often rely on heuristics and rarely define a well-specified target. Inspired by Neural Collapse (NC), the ideal simplex Equiangular Tight Frame (ETF) terminal geometry suggests equal per-class average loss as a reasonable target for reweighting. Based on the ideal equal loss objective, we consider loss reweighting as an inverse problem and propose an inverse-view reweighting strategy that infers class weights dynamically to match this ideal objective. Empirically, NC metrics suggest our method can effectively reduce the loss imbalance coefficient and closer alignment with NC geometry while consistently outperforming strong long-tailed baselines on different datasets.

2605.10046 2026-05-12 cs.CV cs.LG cs.MA

PixelFlowCast: Latent-Free Precipitation Nowcasting via Pixel Mean Flows

Yufeng Zhu, Chunlei Shi, Yongchao Feng, Dan Niu

AI总结 本文提出了一种名为PixelFlowCast的降水临近预报方法,旨在在不使用潜在空间压缩的情况下实现高效且高精度的短期雷达回波预测。该方法采用两阶段框架,第一阶段通过确定性模型生成粗粒度预测以捕捉整体演变趋势,第二阶段利用KANCondNet提取深度时空特征进行精确条件引导,并结合基于像素均值流的预测器,以少量步骤生成高质量预测结果。实验表明,PixelFlowCast在预测精度和推理效率方面均优于现有主流方法,尤其在长序列预测任务中表现突出,具有良好的实际应用前景。

Comments 26 pages, 7 figures

详情
英文摘要

Precipitation nowcasting aims to forecast short-term radar echo sequences for extreme weather warning, where both prediction fidelity and inference efficiency are critical for real-world deployment. However, diffusion-based models, despite their strong generative capability, suffer from slow inference due to multi-step sampling trajectories, limiting their practical usability. Conditional Flow Matching (CFM) improves efficiency via straightened trajectories, but relies on latent space compression, which inevitably discards high-frequency physical details and degrades fine-grained prediction quality. To address these limitations, we propose PixelFlowCast, a two-stage probabilistic forecasting framework that achieves both high-efficiency and high-fidelity prediction without latent compression. Specifically, in the first stage, a deterministic model first produces coarse forecasts to capture global evolution trends. In the subsequent stage, the proposed KANCondNet extracts deep spatiotemporal evolution features to provide accurate conditional guidance. Based on this, a latent-free, few-step Pixel Mean Flows (PMF) predictor employs an $x$-prediction mechanism to generate high-quality predictions, effectively preserving fine-grained structures while maintaining fast inference. Experiments on the publicly available SEVIR dataset demonstrate that PixelFlowCast outperforms existing mainstream methods in both prediction accuracy and inference efficiency, particularly for long sequence forecasting, highlighting its strong potential for real-world operational deployment.

2605.10045 2026-05-12 cs.CV

ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models

Feihong Yan, Shaoyu Liu, Haixuan Wang, Shuai Lu, Linfeng Zhang, Huiqi Li, Xiangyang Ji

AI总结 视觉自回归(VAR)模型作为扩散模型的有力替代方案,在图像生成中表现出色,但其固定训练分辨率限制了其在更高分辨率下的直接生成能力。本文提出ExtraVAR方法,通过引入阶段感知的RoPE重映射策略,解决了VAR模型在分辨率外推过程中出现的全局重复、局部重复和细节退化等问题,并进一步提出基于熵驱动的自适应注意力校准方法,以适应高分辨率下注意力分布的变化,实验表明该方法在结构一致性和细节保真度方面均优于现有方法。

Comments 10 pages, 7 figures

详情
英文摘要

Visual Autoregressive (VAR) models have emerged as a strong alternative to diffusion for image synthesis, yet their fixed training resolution prevents direct generation at higher resolutions. Naively transferring training-free extrapolation methods from LLMs or diffusion models to VAR yields three characteristic failure modes: global repetition, local repetition, and detail degradation. We trace them to a unified band-stage mismatch: VAR generates images in a coarse-to-fine, scale-wise process where each stage is driven by a distinct dominant RoPE frequency band, and each failure mode emerges when the dominant band of a particular stage is disrupted. Building on this insight, we propose Stage-Aware RoPE Remapping, a training-free strategy that assigns each frequency band a stage-specific remapping rule, jointly suppressing all three failure modes. We further observe that attention becomes systematically dispersed as the image resolution increases. Existing methods typically depend on predefined attention scaling factors, which are neither adaptive to the target resolution nor capable of faithfully capturing the actual extent of attention dispersion. We therefore propose Entropy-Driven Adaptive Attention Calibration, which quantifies dispersion via a resolution-invariant normalized entropy and yields a closed-form per-head scaling factor that realigns the extrapolated-resolution attention entropy with its training-resolution counterpart. Extensive experiments show that our method consistently outperforms prior resolution-extrapolation methods in both structural coherence and fine-detail fidelity. Our code is available at https://github.com/feihongyan1/ExtraVAR.

2605.10044 2026-05-12 cs.LG cs.AI

Adaptive Action Chunking via Multi-Chunk Q Value Estimation

Yongjae Shin, Jongseong Chae, Seongmin Kim, Jongeui Park, Youngchul Sung

AI总结 本文提出了一种名为Adaptive Action Chunking (ACH)的新方法,用于强化学习中的动作分块问题。该方法通过基于Transformer的架构,在一次前向传播中同时估计所有候选分块长度的动作价值,从而动态调整分块长度以适应当前状态,克服了传统固定分块长度方法在不同状态和任务下性能受限的问题。实验表明,ACH在34个复杂任务中均优于固定长度基线,展现出更优的泛化能力和学习效率。

详情
英文摘要

Action chunking emerged as a pivotal technique in imitation learning, enabling policies to predict cohesive action sequences rather than single actions. Recently, this approach has expanded to reinforcement learning (RL), enhancing behavioral consistency and reducing bootstrapping errors in value function estimation. However, existing methods rely on a fixed chunk length, creating a performance bottleneck as the optimal length varies across states and tasks. In this paper, we propose Adaptive Action CHunking (ACH), a novel offline-to-online RL algorithm that dynamically modulates chunk length during both training and inference. To find the optimal chunk length for a dynamically varying current state, we simultaneously estimate action-values for all candidate chunk lengths in a single forward pass, using a Transformer-based architecture. Our mechanism allows the agent to select the most effective chunk length adaptively based on the current state. Evaluated on 34 challenging tasks, ACH consistently outperforms fixed-length baselines, demonstrating superior generalization and learning efficiency in complex environments.

2605.10043 2026-05-12 cs.CL cs.AI

Personalizing LLMs with Binary Feedback: A Preference-Corrected Optimization Framework

Xilai Ma, Liye Zhao, Weijun Yao, Haibing Di, Wenya Wang, Jing Li

AI总结 该研究旨在通过二元反馈个性化大语言模型(LLM),以更好地对齐用户个体偏好。提出了一种基于偏好校准的优化框架C-BPO,通过将目标用户数据视为正反馈,其他用户数据作为隐式负反馈,捕捉用户间的差异。为解决偏好重叠问题,该方法基于正-未标记(PU)学习理论构建目标函数,有效去除正样本偏差,从而在保持模型通用性的同时实现更精准的个性化。实验表明,C-BPO在多种任务和模型上均优于现有方法,验证了其有效性。

Comments Accepted by ACL 2026 Main

详情
英文摘要

Large Language Model (LLM) personalization aims to align model behaviors with individual user preferences. Existing methods often focus on isolated user histories, neglecting the essential role of inter-user differences. We propose C-BPO, a framework that personalizes LLMs via preference-calibrated binary signals. By treating target user data as positive feedback and other users' data as an auxiliary set of implicit negative signals, C-BPO captures distinct inter-user differences. To mitigate the preference overlap issue, where shared task knowledge is erroneously penalized, we derive an objective grounded in Positive-Unlabeled (PU) learning theory. This approach purifies negative signals by subtracting ``positive bias'', ensuring alignment with unique idiosyncrasies without compromising general helpfulness. Empirical experiments across various personalization tasks and backbone LLMs show C-BPO consistently outperforms baselines, demonstrating the efficacy of preference-calibrated binary signals in modeling inter-user differences.

2605.10038 2026-05-12 cs.AI

TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning

Hangchen Liu, Dongyuan Li, Renhe Jiang, Jiewen Deng, Weiwei Ye, Yoshihide Sekimoto

AI总结 TimeClaw 是一种面向时间序列分析的 AI 智能体,旨在解决任务执行中探索经验难以复用的问题。该方法通过探索、比较、提炼和重注入的四阶段循环,将探索性执行转化为可复用的分层经验,结合指标监督学习、任务感知的工具丢弃以及推理时的经验注入,提升了模型在金融、气象等领域的预测与推理能力。实验表明,TimeClaw 在多个任务上优于现有方法,突显了探索经验处理机制对科学系统性能的关键影响。

Comments Under review

详情
英文摘要

Time series analysis underpins forecasting, monitoring, and decision making in domains such as finance and weather, where solving a task often requires both numerical accuracy and contextual reasoning. Recent progress has moved from specialized neural predictors to approaches built on LLMs and foundation models that can reason over time series inputs and use external tools. However, most such systems remain execution-centric: they focus on solving the current instance but learn little from exploratory execution. This is especially limiting in verifiable numeric settings, where multiple candidate executions and tool-use procedures may all be task-valid yet differ sharply in quantitative quality, and where early success can trigger tool-prior collapse that suppresses further exploration. To address this limitation, we present TimeClaw, an exploratory execution learning framework that turns exploratory execution into reusable hierarchical distilled experience through a four-stage loop: Explore, Compare, Distill, and Reinject. TimeClaw combines metric-supervised exploratory execution learning, task-aware tool dropout, and hierarchical distilled experience for inference-time reinjection, while keeping the base model frozen and avoiding online test-time adaptation. In an MTBench-aligned evaluation with 17 tasks that span finance and weather prediction and reasoning tasks, TimeClaw delivers consistent gains over the baselines. These results suggest that, for scientific systems, the bottleneck is not only execution-time capability, but how exploratory experience is compared, distilled, and reused.

2605.10035 2026-05-12 cs.AI

From Single-Step Edit Response to Multi-Step Molecular Optimization

Haojie Rao, Kun Li, Yida Xiong, Jiameng Chen, Wenbin Hu, Yizhen Zheng, Jiajun Yu, Duanhua Cao

AI总结 该研究旨在通过分子结构编辑实现特定性质的优化,面对结构相似分子数据稀缺及决策过程需遵循化学规则的挑战。提出了一种响应导向的离散编辑优化方法,包含单步分子编辑响应预测器和多步规划器,通过指导树搜索将局部预测组合为优化路径,从而减少对外部评估的依赖,并提升了数据利用效率。

详情
英文摘要

Conditional molecular optimization aims to edit a molecule to realize a specified property shift. In practice, structurally similar molecule data is scarce, while decisions are inherently action-level: at each step, the system must select one local structural edit from a candidate set that is strictly filtered by chemical feasibility rules. This level mismatch between supervision and decision makes oracle-in-the-loop search unstable in molecular optimization. Regressing on property differences between molecule pairs improves data efficiency but relies on oracle-in-the-loop search, entangling transformation effects with global context and providing limited guidance for selecting the next feasible edit, often resorting to oracle-in-the-loop search. For this reason, we propose a response-oriented discrete edit optimization approach comprising two tightly coupled components: a single-step molecular edit response predictor (SMER) and a multi-step planner that composes local predictions into optimization trajectories via guided tree search (SMER-Opt). The approach learns a directional evaluation model over edit actions to support constraint-aware planning. It mines weakly related molecule pairs and decomposes their structural differences into minimal edit units, turning endpoint property annotations into process-level supervision and yielding reusable, transferable action primitives. A directional edit evaluator then scores feasible candidate edits by their likelihood of moving the molecule toward the desired property change, substantially reducing dependence on external evaluator queries at decision time. Code is available at https://anonymous.4open.science/r/SMER.

2605.10034 2026-05-12 cs.RO

Beyond Self-Play and Scale: A Behavior Benchmark for Generalization in Autonomous Driving

Aron Distelzweig, Faris Janjoš, Andreas Look, Anna Rothenhäusler, Daniel Jost, Oliver Scheel, Raghu Rajan, Daphne Cornelisse, Eugene Vinitsky, Joschka Boedecker

AI总结 本文提出BehaviorBench,一个用于评估自动驾驶策略泛化能力的综合性基准测试平台,旨在弥补当前大规模强化学习策略与标准评估体系之间的差距。该基准从评估体系、场景复杂度和行为多样性三个方面进行设计,支持在nuPlan等标准规划基准上评估大规模RL策略,并引入多样化的交互式交通代理以测试策略在不同行为模式下的表现。研究发现,基于纯自博弈训练的策略在面对真实交通场景时存在泛化不足的问题,并提出了一种结合策略梯度与规则规划的混合方法以提升性能。

详情
英文摘要

Recent Autonomous Driving (AD) works such as GigaFlow and PufferDrive have unlocked Reinforcement Learning (RL) at scale as a training strategy for driving policies. Yet such policies remain disconnected from established benchmarks, leaving the performance of large-scale RL for driving on standardized evaluations unknown. We present BehaviorBench -- a comprehensive test suite that closes this gap along three axes: Evaluation, Complexity, and Behavior Diversity. In terms of Evaluation, we provide an interface connecting PufferDrive to nuPlan, which, for the first time, enables policies trained via RL at scale to be evaluated on an established planning benchmark for autonomous driving. Complementarily, we offer an evaluation framework that allows planners to be benchmarked directly inside the PufferDrive simulation, at a fraction of the time. Regarding Complexity, we observe that today's standardized benchmarks are so simple that near-perfect scores are achievable by straight lane following with collision checking. We extract a meaningful, interaction-rich split from the Waymo Open Motion Dataset (WOMD) on which strong performance is impossible without multi-agent reasoning. Lastly, we address Behavior Diversity. Existing benchmarks commonly evaluate planners against a single rule-based traffic model, the Intelligent Driver Model (IDM). We provide a diverse suite of interactive traffic agents to stress-test policies under heterogeneous behaviors, beyond just using IDM. Overall, our benchmarking analysis uncovers the following insight: despite learning interactive behaviors in an emergent manner, policies trained via pure self-play under standard reward functions overfit to their training opponents and fail to generalize to other traffic agent behaviors. Building on this observation, we propose a hybrid planner that combines a PPO policy with a rule-based planner.

2605.10029 2026-05-12 cs.CV

Slum Detection and Density Mapping with AlphaEarth Foundations: A Representation Learning Evaluation Across 12 Global Cities

Shuyang Hou, Ziqi Liu, Haoyue Jiao, Zhangyan Xu, Xiaopu Zhang, Lutong Xie, Yaxian Qing, Jianyuan Liang, Xuefeng Guan, Huayi Wua

AI总结 该研究利用AlphaEarth Foundations(AEF)这一全球一致的高分辨率地表嵌入数据,评估其在12个全球城市中用于贫民窟检测和密度估计的性能。通过多种训练策略和辅助特征配置,研究发现同一城市跨年训练效果最佳,并揭示了AEF在区分贫民窟边界和建模像素内密度梯度方面的局限性。研究还指出POI特征对密度估计有显著提升,并展示了AEF在长期贫民窟监测中的结构保持能力。

详情
英文摘要

Pixel-level slum mapping has long been constrained by limited cross-city generalisation, the absence of continuous density estimation, and weak global comparability. AlphaEarth Foundations (AEF), a globally consistent 64-dimensional annual surface embedding at 10 m, offers a new analysis-ready basis for lightweight slum monitoring, but its applicability to slum detection - an indirectly coupled task shaped by both built form and socio-economic processes - remains untested. We evaluate AEF on slum classification and sub-pixel density estimation across 12 cities and 69 city-year pairs (2017-2024), using GRAM pseudo-masks as supervisory labels. The evaluation spans four training strategies, two protocols (random split and 3x3 spatial block cross-validation), six auxiliary feature configurations, and five baseline models, complemented by representation-level analyses (PCA, SHAP) and full-AOI mapping. Five findings emerge. (1) Same-city cross-year training is optimal under both protocols (median spatial F1 = 0.616, R^2 = 0.466); temporal expansion outperforms cross-city transfer, indicating city-scale representational drift. (2) Regression R^2 is driven primarily by zero/non-zero boundary discrimination: positive-pixel R^2 is consistently negative across all cities, revealing limited capacity to model intra-pixel density gradients at 10 m. (3) PC36 is consistently top-ranked across tasks; classification saturates at k = 32 while regression remains unsaturated at k = 64. (4) POI features yield the largest density gain (Delta R^2 = +0.064). (5) For six cities meeting dual-task usability thresholds, full-AOI inference across 2017-2024 preserves slum cluster structure (mean SSIM = 0.926). The study delineates the capabilities and complementarity needs of foundation-model embeddings for slum monitoring.

2605.10027 2026-05-12 cs.CL cs.AI

Speech-based Psychological Crisis Assessment using LLMs

Terumi Chiba, Yang Luo, Ziyun Cui, Yongsheng Tong, Chao Zhang

AI总结 本文提出了一种基于大语言模型(LLM)的语音心理危机评估框架,旨在自动化识别通话中的心理危机等级,以提升心理热线服务的质量与效率。为更好地捕捉语音对话中的情感信号,研究引入了副语言注入方法,将识别出的非语言情感线索插入语音文本中,增强模型对语音细微情感的感知能力。同时,提出了一种增强推理的训练策略,通过生成诊断推理链作为辅助任务,提升分类性能,结合数据增强后,在三类分类任务中取得了较高的宏F1分数和准确率。

Comments 5 pages, 5 figures

详情
英文摘要

Psychological support hotlines provide critical support for individuals experiencing mental health emergencies, yet current assessments largely rely on human operators whose judgments may vary with professional experience and are constrained by limited staffing resources. This paper proposes a large language model (LLM)-based framework for automated crisis level classification, a key indicator that supports many downstream tasks and improves the overall quality of hotline services. To better capture emotional signals in spoken conversations, we introduce a paralinguistic injection method that inserts identified non-verbal emotional cues into speech transcripts, enabling LLM-based reasoning to incorporate critical acoustic nuances. In addition, we propose a reasoning-enhanced training strategy that trains the model to generate diagnostic reasoning chains as an auxiliary task, which serves as a regulariser to improve classification performance. Combined with data augmentation, our final system achieves a macro F1-score of 0.802 and an accuracy of 0.805 on the three-class classification task under 5-fold cross-validation.

2605.10026 2026-05-12 cs.CV

MUSDA: Multi-source Multi-modality Unsupervised Domain Adaptive 3D Object Detection for Autonomous Driving

Xiaohu Lu, Hamed Khatounabadi, Hayder Radha

AI总结 随着自动驾驶技术的发展,多模态标注数据集日益丰富,为无需人工标注即可适应新环境的3D目标检测提供了可能。然而传统领域自适应方法通常仅针对单一来源或单一模态,难以应对多源多模态场景。本文提出了一种面向自动驾驶的多源多模态无监督领域自适应3D目标检测框架,通过引入分层空间条件领域分类器和原型图加权融合策略,有效对齐了不同来源和模态的特征,实验表明该方法在多个主流数据集上均优于现有先进方法。

详情
英文摘要

With the advancement of autonomous driving, numerous annotated multi-modality datasets have become available. This presents an opportunity to develop domain-adaptive 3D object detectors for new environments without relying on labor-intensive manual annotations. However, traditional domain adaptation methods typically focus on a single source domain or a single modality, limiting their effectiveness in multi-source, multi-modality scenarios. In this paper, we propose a novel framework for multi-source, multi-modality unsupervised domain adaptation in 3D object detection for autonomous driving. Given multiple labeled source domains and one unlabeled target domain, our framework first introduces hierarchical spatially-conditioned (HSC) domain classifiers, which jointly align features from both camera and LiDAR modalities at two distinct levels for each source-target domain pair. To effectively leverage information from multiple source domains, we construct a prototype graph between each pair of domains. Based on this, we develop a prototype graph weighted (PGW) multi-source fusion strategy to aggregate predictions from multiple source detection heads. Experimental results on three widely used 3D object detection datasets - Waymo, nuScenes, and Lyft - demonstrate that our proposed framework effectively integrates information across both modalities and source domains, consistently outperforming state-of-the-art methods.

2605.10025 2026-05-12 cs.CL cs.AI

Medical Incident Causal Factors and Preventive Measures Generation Using Tag-based Example Selection in Few-shot Learning

Yuna Haseyama, Tomoki Ito, Hiroki Sakaji, Itsuki Noda

AI总结 在医疗等高风险领域,大型语言模型(LLM)生成临床见解的可靠性至关重要。本文提出了一种基于标签的少样本示例选择方法,用于引导LLM从医疗事件描述中生成背景/因果因素和预防措施。实验使用日本医疗事件数据集(JMID),结果表明,基于标签的示例选择方法在生成精度和稳定性方面优于随机采样和基于相似度的方法,为提升临床LLM应用的可靠性提供了有效策略。

详情
英文摘要

In high-stakes domains such as healthcare, the reliability of Large Language Models (LLMs) is critical, particularly when generating clinical insights from incident reports. This study proposes a tag-based few-shot example selection method for prompting LLMs to generate background/causal factors and preventive measures from details of the medical incidents. For our experiments, we use the Japanese Medical Incident Dataset (JMID), a structured dataset of 3,884 real-world medical accident and near-miss reports. These reports are variably annotated with a wide range of tags--some include descriptive information (e.g., "medications," "blood transfusion therapy"). We compare three few-shot example selection strategies--random sampling, cosine similarity-based selection, and our proposed tag-based method--using GPT-4o and LLaMA 3.3. Results show that the tag-based approach achieves the highest precision and most stable generation behavior, while similarity-based selection often leads to unintended outputs and safety filter activation. These findings suggest that selecting examples based on human-interpretable dataset tags can improve generation precision and stability in clinical LLM applications.

2605.10020 2026-05-12 cs.LG

TrajDLM: Topology-Aware Block Diffusion Language Model for Trajectory Generation

Wilson Wongso, Lihuan Li, Arian Prabowo, Xiachong Lin, Baiyu Chen, Hao Xue, Flora D. Salim

AI总结 生成高保真合成GPS轨迹在交通、城市规划和情景模拟等领域日益重要,但现有方法在生成效率与道路网络拓扑结构的忠实度之间存在矛盾。本文提出TrajDLM,一种基于块扩散语言模型的拓扑感知轨迹生成框架,通过将轨迹建模为离散道路段序列,并结合拓扑感知嵌入与约束采样,在保证轨迹真实性的同时显著提升生成速度。实验表明,TrajDLM在多个城市规模数据集上表现出优异的局部相似性性能,且比现有方法快2.8倍,同时具备跨领域零样本迁移能力。

详情
英文摘要

Generating high-fidelity synthetic GPS trajectories is increasingly important for applications in transportation, urban planning, and what-if scenario simulation, especially as privacy concerns limit access to real-world mobility data. Existing trajectory generation models face a trade-off between efficiency and faithfulness to road network topology: continuous-space methods enable fast generation but ignore the road network, while topology-aware approaches rely on search-based autoregressive decoding that limits generation speed. We propose TrajDLM, a topology-aware trajectory generation framework based on block diffusion language models that bridges this gap. TrajDLM models trajectories as sequences of discrete road segments, combining a block diffusion backbone for efficient denoising, topology-aware embeddings from a road network encoder, and topology-constrained sampling to ensure coherent and realistic trajectories. Across three city-scale datasets, TrajDLM achieves strong performance on fine-grained local similarity metrics while being up to $2.8\times$ faster than prior work, and demonstrates strong zero-shot transfer across domains, including unseen transportation modes. These results highlight the effectiveness of block-wise discrete diffusion as a scalable approach to accurate and efficient trajectory generation. Our code is available at https://github.com/cruiseresearchgroup/TrajDLM/

2605.10019 2026-05-12 cs.LG cs.AI cs.CC stat.ML

The two clocks and the innovation window: When and how generative models learn rules

Binxu Wang, Emma Lucia Byrnes Finn, Bingbin Liu

AI总结 该论文研究了生成模型在有限数据下学习规则时所面临的基本矛盾,即模型的训练目标使其更倾向于拟合经验分布而非目标分布。通过引入两个关键时间点——规则生效时间 $τ_{\mathrm{rule}}$ 和记忆重现时间 $τ_{\mathrm{mem}}$,论文分析了生成模型何时开始生成符合规则的样本以及何时开始复制训练数据。研究发现,这两个时间点受规则复杂度、模型容量和数据规模等因素影响,并定义了“创新窗口”作为模型真正创新的时期,揭示了生成模型在不同架构下学习规则的共性与差异。

Comments 48 pages, 28 figures. Earlier versions are presented in NeurIPS2025 SPIGM workshop as oral presentation https://openreview.net/forum?id=LjqX8OhPPi

详情
英文摘要

Generative models trained on finite data face a fundamental tension: their score-matching or next-token objective converges to the empirical training distribution rather than the population distribution we seek to learn. Using rule-valid synthetic tasks, we trace this tension across two training timescales: $τ_{\mathrm{rule}}$, the step at which generations first become rule-valid, and $τ_{\mathrm{mem}}$, the step at which models begin reproducing training samples. Focusing on parity and extending to other binary rules and combinatorial puzzles, we characterize how these two clocks, $τ_{\mathrm{rule}}$ and $τ_{\mathrm{mem}}$, depend on key aspects of the learning setup. Specifically, we show that $τ_{\mathrm{rule}}$ increases with rule complexity and decreases with model capacity, while $τ_{\mathrm{mem}}$ is approximately invariant to the rule and scales nearly linearly with dataset size $N$. We define the \emph{innovation window} as the interval $[τ_{\mathrm{rule}}, τ_{\mathrm{mem}}]$. This window widens with increasing $N$ and narrows with rule complexity, and may vanish entirely when $τ_{\mathrm{rule}} \geq τ_{\mathrm{mem}}$. The same two-clock structure arises in both diffusion (DiT) and autoregressive (GPT) models, with architecture-dependent offsets. Dissecting the learned score of DiT models reveals a corresponding evolution of the optimization landscapes, where rule-valid samples' basins expand substantially around $τ_{\mathrm{rule}}$, while training samples' basins begin to dominate around $τ_{\mathrm{mem}}$. Together, these results yield a unified and predictive account of when and how generative models exhibit genuine innovation.

2605.10018 2026-05-12 cs.LG

The Value of Mechanistic Priors in Sequential Decision Making

Itai Shufaro, Gal Benor, Shie Mannor

AI总结 本文研究了在序列决策中引入机制先验(mechanistic priors)的价值,提出了一种量化机制模型信息量的指标——机制互信息,并分析了其在渐近和小样本(burn-in)两种场景下的理论性能。研究证明,使用机制先验可以显著降低样本复杂度,尤其在小样本阶段表现出更高的样本效率。通过基于实际药代动力学数据的5-氟尿嘧啶给药模拟,验证了混合机制先验的有效性,并对比了大型语言模型先验的不足,强调了在安全关键应用中使用物理基础先验的重要性。

详情
英文摘要

Hybrid mechanistic models, physical priors with learned residuals, promise to reduce the data required for good decisions, but have no computable criterion to test this. We characterize the value of mechanistic priors in sequential decision-making within both asymptotic and burn-in regimes. To formalize this, we introduce the mechanistic information of a model -- the mutual information between the model's recommended policy $\hatπ$ and the true optimal policy $π^*$ -- quantified via an occupancy-weighted bias $B_μ$. In the asymptotic regime (large $N$), matched bounds reveal that Bayesian regret scales with the residual entropy $H_{\mathrm{mech}}$, delivering a theoretical sample complexity reduction of $H(μ)/H_{\mathrm{mech}}$ compared to an uninformed baseline. Furthermore, we provide a model certificate to determine empirical sample efficiency. Complementarily, in the clinically relevant burn-in regime (small $N$), we establish a lower bound on the penalty incurred by confidently wrong priors. We demonstrate both the asymptotic and burn-in bounds across 5-fluorouracil (5-FU) dosing simulations motivated by published FOLFOX pharmacokinetic data, where a hybrid prior yields large sample-efficiency gains in the burn-in regime. Finally, we contrast these grounded models with LLM priors, demonstrating that LLMs can suffer severe losses in mechanistic information, thereby motivating the exclusive use of physically-grounded priors for safety-critical applications.