arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1967
专题追踪 全部专题
2605.14462 2026-05-15 cs.CV

Real2Sim in HOI: Toward Physically Plausible HOI Reconstruction from Monocular Videos

Yubo Zhao, Yujin Chai, Yunao Dong, Chengfeng Zhao, Zijiao Zeng, Yuan Liu, Chi-Keung Tang

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) Tencent IEG(腾讯IEG)

AI总结 本文研究如何从单目视频中重建具有物理合理性的4D人-物交互(HOI)动画,以支持3D内容生成和仿真学习等应用。为了解决现有方法在交互一致性、接触稳定性和物理合理性方面的不足,作者提出了HA-HOI框架,采用“以人为先,物体跟随”的策略,以人体运动为交互锚点,重建并优化物体的运动轨迹,并将其映射到物理仿真中进行验证。该方法在多个基准和真实视频上显著提升了人-物对齐、接触一致性及仿真适用性,推动了从视觉合理到物理合理的交互动画生成。

详情
英文摘要

Recovering 4D human-object interaction (HOI) from monocular video is a key step toward scalable 3D content creation, embodied AI, and simulation-based learning. Recent methods can reconstruct temporally coherent human and object trajectories, but these trajectories often remain visual artifacts while failing to preserve stable contact, functional manipulation, or physical plausibility when used as reference motions for humanoid-object simulation. This reveals a fundamental interaction gap: HOI reconstruction should not stop at tracking a human and an object, but should recover the relation that makes their motion a coherent interaction. We introduce $\textbf{HA-HOI}$, a framework for reconstructing physically plausible 4D HOI animation from in-the-wild monocular videos. Instead of treating the human and object as independent entities in an ambiguous monocular 3D space, we propose a $\textit{human-first, object-follow}$ formulation. The human motion is recovered as the interaction anchor, and the object is reconstructed, aligned, and refined relative to the human action. The resulting kinematic trajectory is then projected into a physics-based humanoid-object simulation, where it acts as a teacher trajectory for stable physical rollout. Across benchmark and in-the-wild videos, $\textbf{HA-HOI}$ improves human-object alignment, contact consistency, temporal stability, and simulation readiness over prior monocular HOI reconstruction methods. By moving beyond visually plausible trajectory recovery toward physically grounded interaction animation, our work takes a step toward turning general monocular HOI videos into scalable demonstrations for humanoid-object behavior. Project page: https://knoxzhao.github.io/real2sim_in_HOI/

2605.14461 2026-05-15 cs.CV

ClickRemoval: An Interactive Open-Source Tool for Object Removal in Diffusion Models

Ledun Zhang, Yatu Ji, Xufei Zhuang, Xinying Yao

发表机构 * Inner Mongolia University of Technology(内蒙古科技大学)

AI总结 ClickRemoval 是一个基于预训练 Stable Diffusion 模型的开源交互式工具,旨在解决扩散模型中对象移除的难题。该工具仅需用户点击操作即可定位目标对象并修复背景,无需手动绘制掩码或输入文本描述。通过在去噪过程中进行自注意力调制,ClickRemoval 在复杂场景中实现了高效且自然的移除效果,实验表明其在定量指标和用户研究中均表现优异。

Comments 5 pages, 4 figures. Open-source software paper

详情
英文摘要

Existing object removal tools often rely on manual masks or text prompts, making precise removal difficult for non-expert users in complex scenes and often leading to incomplete removal or unnatural background completion. To address this issue, we present ClickRemoval, an open-source interactive object removal tool built on pretrained Stable Diffusion models and driven solely by user clicks. Without additional training, hand-drawn masks, or text descriptions, ClickRemoval localizes target objects and restores the background through self-attention modulation during denoising. Experiments show that ClickRemoval achieves competitive results across quantitative metrics and user studies. We release a complete software package at https://github.com/zld-make/ClickRemoval under the Apache-2.0 license.

2605.14458 2026-05-15 cs.AI

OmniDrop: Layer-wise Token Pruning for Omni-modal LLMs via Query-Guidance

Yeo Jeong Park, Hyemi Jang, Minseo Choi, Jongsun Lee, Jooyoung Choi, Yongkweon Jeon

发表机构 * Samsung Research(三星研究院)

AI总结 OmniDrop 是一种用于多模态大语言模型的层间 token 剪枝方法,旨在解决高分辨率音频和视频输入导致的 token 爆炸问题。该方法通过在解码器各层逐步剪枝,而非在输入嵌入层进行,从而更有效地保留多模态信息融合,并利用文本查询指导剪枝过程以提升任务适应性。实验表明,OmniDrop 在多个基准测试中表现优异,显著降低了预填充延迟和内存消耗。

详情
英文摘要

Omni-modal large language models have demonstrated remarkable potential in holistic multimodal understanding; however, the token explosion caused by high-resolution audio and video inputs remains a critical bottleneck for real-time applications and long-form reasoning. Existing omni-modal token compression methods typically prune tokens at the input embedding level, relying on audio-video similarity or temporal co-occurrence as proxies for semantic relevance. In practice, such assumptions are often unreliable. To address this limitation, we propose OmniDrop, a training-free, layer-wise token pruning framework that progressively prunes audiovisual tokens within the LLM decoder layers rather than at the input-level, allowing early layers to preserve sufficient omni-modal information fusion before aggressively removing tokens in deeper layers. We further utilize text queries as guidance for modality-agnostic and task-adaptive token pruning. We also introduce a temporal diversity score that encourages balanced token survival to preserve global temporal context. Experimental results across various audiovisual benchmarks demonstrate that OmniDrop outperforms all baselines by up to 3.58 points while reducing prefill latency by up to 40% and memory usage by up to 14.7%.

2605.14455 2026-05-15 cs.AI cs.LG

Intelligence Impact Quotient (IIQ): A Framework for Measuring Organizational AI Impact

Chandan Rajah, Neha Sengupta, Federico Castanedo, Robin Mills, Amit Bahree, Ramesh Krishnan Muthukrishnan, Larry Murray

发表机构 * Inception G42

AI总结 本文提出了一种名为“智能影响商”(IIQ)的综合指标,用于量化人工智能系统在组织工作流程中的集成深度及其影响。IIQ结合了多种因素,如新颖性加权的令牌库存、使用频率、近期使用情况、组织杠杆效应、任务复杂度和自主性,生成可用于比较不同用户和单位的原始智能采纳指数(IAI)和标准化的0-1000分IIQ指数。该框架旨在为AI在工作流程中的部署提供一种可跟踪的测量工具,而非直接衡量模型能力或替代因果生产力评估。

详情
英文摘要

The Intelligence Impact Quotient (IIQ) is a composite metric intended to quantify the depth to which AI systems are integrated into organizational work and their impact. Rather than treating access counts or aggregate token volume as sufficient evidence of impact, IIQ combines a novelty-weighted, time-decayed token stock with usage frequency, a grace-period recency gate, organizational leverage, task complexity, and autonomy. The formulation produces a raw Intelligence Adoption Index (IAI) and a normalized 0-1000 IIQ index for comparison between heterogeneous users and units. We also derive sub-daily update rules and a bounded interpretation layer for estimated efficiency and financial impact. The paper positions IIQ as a deployment-oriented measurement framework: a formal proposal for tracking AI embedding in workflows, not a direct measure of model capability or a substitute for causal productivity evaluation. Synthetic scenarios illustrate how the revised metric distinguishes between frequent low-leverage use, semantically repetitive prompting, and more autonomous, higher-consequence AI-assisted work.

2605.14454 2026-05-15 cs.LG cs.CL cs.CR

LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

Minbeom Kim, Lesly Miculicich, Bhavana Dalvi Mishra, Mihir Parmar, Phillip Wallis, Bharath Chandrasekhar, Kyomin Jung, Tomas Pfister, Long T. Le

发表机构 * Google Cloud AI Research(谷歌云人工智能研究)

AI总结 随着AI代理从聊天界面扩展到处理隐私数据、调用工具和执行多步骤工作流的系统,安全防护机制成为防止实际部署中危害的最后一道防线。传统防护机制难以应对复杂多变的现实场景,而LiSA(Lifelong Safety Adaptation)提出了一种保守策略归纳框架,通过结构化记忆提升固定基础防护策略的适应能力。LiSA能够将偶发的失败转化为可复用的策略抽象,结合冲突感知的本地规则和基于证据的置信度门控机制,有效提升在稀疏反馈和噪声环境下的安全性和泛化能力。

Comments 27 pages, 3 figures

详情
英文摘要

As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap: guardrails must adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservative policy induction framework that improves a fixed base guardrail through structured memory. LiSA converts occasional failures into reusable policy abstractions so that sparse reports can generalize beyond individual cases, adds conflict-aware local rules to prevent overgeneralization in mixed-label contexts, and applies evidence-aware confidence gating via a posterior lower bound, so that memory reuse scales with accumulated evidence rather than empirical accuracy alone. Across PrivacyLens+, ConFaide+, and AgentHarm, LiSA consistently outperforms strong memory-based baselines under sparse feedback, remains robust under noisy user feedback even at 20% label-flip rates, and pushes the latency--performance frontier beyond backbone model scaling. Ultimately, LiSA offers a practical path to secure AI agents against the unpredictable long tail of real-world edge risks.

2605.14449 2026-05-15 cs.LG cs.AI cs.CL

When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition

Siyang Yao, Erhu Feng, Yubin Xia

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 本文研究了大语言模型中幻觉检测的问题,提出了一种名为QAOD的单次推理框架,通过将答案表示中与问题对齐的部分分解出去,提取出与问题正交的成分以抑制领域相关的变化。该方法结合多样性惩罚的费舍尔评分和判别神经元选择,设计了两种互补的探测策略,分别用于提升领域内检测性能和跨领域泛化能力,在多个基准测试中表现出色,尤其在跨领域场景下显著优于现有方法。

详情
英文摘要

Hallucination detection in large language models (LLMs) requires balancing accu racy, efficiency, and robustness to distribution shift. Black-box consistency methods are effective but demand repeated inference; single-pass white-box probes are effi cient yet treat answer representations in isolation, often degrading sharply under domain shift. We propose QAOD (Question-Answer Orthogonal Decomposition), a single-pass framework that projects away the question-aligned direction from the answer representation to obtain a question-orthogonal component that suppresses domain-conditioned variation. To identify informative signals, QAOD further selects layers via diversity-penalized Fisher scoring and discriminative neurons via Fisher importance. To address both in-domain detection and cross-domain generalization, we design two complementary probing strategies: pairing the or thogonal component with question context yields a joint probe that maximizes in-domain discriminability, while using the orthogonal component alone preserves domain-agnostic factuality signals for robust transfer. QAOD's joint probe achieves the best in-domain AUROC across all evaluated model-dataset pairs, while the orthogonal-only probe delivers the strongest OOD transfer, surpassing the best white-box baseline by up to 21% on BioASQ at under 25% of generation cost.

2605.14448 2026-05-15 cs.CV cs.CL cs.IR

Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture

Longxiang Zhang, Weilong Dai, Guanghao Zhang, Hao Jiang, Pipei Huang

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 该研究提出了一种名为Think When Needed(TWN)的统一多模态嵌入框架,旨在通过自适应推理机制提升多模态嵌入的质量与效率。TWN采用双LoRA架构,将推理和嵌入适配器附加到共享的冻结主干模型上,以减少参数开销并避免梯度冲突。通过自监督路由门机制,模型能够根据输入内容决定是否生成链式推理(CoT),从而避免冗余推理带来的性能下降,并显著降低推理成本。实验表明,TWN在MMEB-V2的78个任务中取得了最先进的嵌入质量,同时在参数和推理效率方面优于现有生成式方法。

Comments 30 pages, preprint

详情
英文摘要

Multimodal large language models (MLLMs) have emerged as a powerful backbone for multimodal embeddings. Recent methods introduce chain-of-thought (CoT) reasoning into the embedding pipeline to improve retrieval quality, but remain costly in both model size and inference cost. They typically employ separate reasoner and embedder with substantial parameter overhead, and generate CoT indiscriminately for every input. However, we observe that for simple inputs, discriminative embeddings already perform well, and redundant reasoning can even mislead the model, degrading performance. To address these limitations, we propose Think When Needed (TWN), a unified multimodal embedding framework with adaptive reasoning. TWN introduces a dual-LoRA architecture that attaches reasoning and embedding adapters to a shared frozen backbone, detaching gradients at their interface to mitigate gradient conflicts introduced by joint optimization while keeping parameters close to a single model. Building on this, an adaptive think mechanism uses a self-supervised routing gate to decide per input whether to generate CoT, skipping unnecessary reasoning to reduce inference overhead and even improve retrieval quality. We further explore embedding-guided RL to optimize CoT quality beyond supervised training. On the 78 tasks of MMEB-V2, TWN achieves state-of-the-art embedding quality while being substantially more efficient than existing generative methods, requiring only 3-5% additional parameters relative to the backbone and up to 50% fewer reasoning tokens compared to the full generative mode.

2605.14445 2026-05-15 cs.LG

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

Runyuan He, Qiuyang Mang, Shang Zhou, Kaiyuan Liu, Hanchen Li, Huanzhi Mao, Qizheng Zhang, Zerui Li, Bo Peng, Lufeng Cheng, Tianfu Fu, Yichuan Wang, Wenhao Chai, Jingbo Shang, Alex Dimakis, Joseph E. Gonzalez, Alvin Cheung

发表机构 * University of Washington(华盛顿大学) Stanford University(斯坦福大学) Princeton University(普林斯顿大学) Massachusetts Institute of Technology(麻省理工学院) Bespoke Labs(Bespoke实验室)

AI总结 本文提出了一种名为FrontierSmith的系统,用于大规模合成开放性编程问题,以提升大型语言模型在开放性编码任务上的表现。该系统通过迭代演化方式,从现有的封闭性编程任务(如竞赛编程题目)生成开放性问题变体,并利用定量指标筛选出能激发多样化解题思路的问题。实验表明,使用该系统合成的数据进行训练,显著提升了模型在多个开放性编程基准测试中的性能。

详情
英文摘要

Many real-world coding challenges are open-ended and admit no known optimal solution. Yet, recent progress in LLM coding has focused on well-defined tasks such as feature implementation, bug fixing, and competitive programming. Open-ended coding remains a weak spot for LLMs, largely because open-ended training problems are scarce and expensive to construct. Our goal is to synthesize open-ended coding problems at scale to train stronger LLM coders. We introduce FrontierSmith, an automated system for iteratively evolving open-ended problems from existing closed-ended coding tasks. Starting from competitive programming problems, FrontierSmith generates candidate open-ended variants by changing the problems'goals, restricting outputs, and generalizing inputs. It then uses a quantitative idea divergence metric to select problems that elicit genuinely diverse approaches from different solvers. Agents then generate test cases and verifiers for the surviving candidates. On two open-ended coding benchmarks, training on our synthesized data yields substantial gains over the base models: Qwen3.5-9B improves by +8.82 score on FrontierCS and +306.36 (Elo-rating-based performance) on ALE-bench; Qwen3.5-27B improves by +12.12 and +309.12, respectively. The synthesized problems also make agents take more turns and use more tokens, similar to human-curated ones, suggesting that closed-ended seeds can be a practical starting point for long-horizon coding data.

2605.14443 2026-05-15 cs.AI cs.LG cs.MA

Prompting Policies for Multi-step Reasoning and Tool-Use in Black-box LLMs with Iterative Distillation of Experience

Krishna Sayana, Ketan Todi, Ambarish Jash

发表机构 * Google Research(谷歌研究)

AI总结 该研究针对冻结的“黑盒”大语言模型(LLM)中的提示工程问题,提出了一种基于强化学习的框架,通过迭代经验蒸馏训练可学习的提示策略。该方法利用对比经验缓冲区,结合标量奖励和密集文本批评,使轻量级提示模型能够优化以最大化任务奖励,从而在单次策略权重中实现迭代提示的高效优化。实验表明,该方法在多步骤推理和工具使用任务中显著提升了性能,且相比现有进化基线方法具有更高的样本效率。

Comments 10 pages and reference, appendix

详情
英文摘要

The shift toward interacting with frozen, "black-box" Large Language Models (LLMs) has transformed prompt engineering from a heuristic exercise into a critical optimization challenge. We propose a Reinforcement Learning (RL) framework for training learned prompting policies via iterative distillation of experience. In this architecture, a lightweight prompter model is optimized to maximize task-specific rewards for a larger, frozen worker LLM. By utilizing a contrastive experience buffer that couples scalar rewards with dense textual critiques, our approach effectively amortizes iterative prompt refinement into single-shot policy weights. Our experimental analysis focuses on the Big Bench Extra Hard (BBEH) and Tau-bench suites, covering a diverse range of multi-step reasoning and tool-use tasks. We demonstrate significant gains, improving performance from 55% to 90% in logic-intensive reasoning and 74% to 91% in tool-use tasks. Furthermore, we analyze the structural evolution of prompts, demonstrating how the policy discovers specialized algorithmic heuristics. We provide comprehensive comparisons against state-of-the-art evolutionary baselines like GEPA, showing that iterative distillation achieves superior performance with higher sample efficiency.

2605.14440 2026-05-15 cs.AI cs.FL cs.LO

Synthesizing POMDP Policies: Sampling Meets Model-checking via Learning

Debraj Chakraborty, Anirban Majumdar, Prince Mathew, Sayan Mukherjee, Jean-François Raskin

发表机构 * Nanyang Technological University, Singapore(新加坡南洋理工大学) Tata Institute of Fundamental Research, Mumbai, India(印度孟买印度理工学院基础研究所以) Université Libre de Bruxelles, Brussels, Belgium(比利时布鲁塞尔自由大学) IITB Trust Lab, Department of CSE, IIT Bombay, Mumbai, India(印度孟买印度理工学院 Bombay 电子与计算机科学系信托实验室)

AI总结 本文研究了在部分可观察马尔可夫决策过程(POMDP)中如何合成具有形式化保证的策略,针对采样方法缺乏形式正确性保证、形式合成方法可扩展性差的问题,提出了一种结合采样、自动机学习和模型检测的综合框架。该方法借鉴Angluin的$L^*$算法,利用采样作为成员查询,模型检测作为等价性查询,能够在采样策略满足正则性条件时合成有限状态控制器,并证明了该框架的相对完备性。实验表明,该方法在解决现有工具难以处理的阈值安全问题上表现良好。

Comments Paper accepted at 38th International Conference on Computer Aided Verification (CAV 2026), Lisbon, Portugal, July 2026

详情
英文摘要

Partially Observable Markov Decision Processes (POMDPs) are the standard framework for decision-making under uncertainty. While sampling-based methods scale well, they lack formal correctness guarantees, making them unsuitable for safety-critical applications. Conversely, formal synthesis techniques provide correctness-by-construction but often struggle with scalability, as general POMDP synthesis is undecidable. To bridge this gap, we propose a synthesis framework that integrates sampling, automata learning, and model-checking. Inspired by Angluin's $L^*$ algorithm, our approach utilizes sampling as a membership oracle and model-checking as an equivalence oracle. This enables the synthesis of finite-state controllers with formal guarantees, provided the sampling-induced policy is regular. We establish a relative completeness result for this framework. Experimental results from our prototypical implementation demonstrate that this method successfully solves threshold-safety problems that remain challenging for existing formal synthesis tools. We believe our algorithm serves as a valuable component in a portfolio approach to tackling the inherent difficulty of POMDP synthesis problems.

2605.14438 2026-05-15 cs.AI

BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE

Juntong Wu, Jialiang Cheng, Qishen Yin, Yue Dai, Yuliang Yan, Fuyu Lv, Ou Dan, Li Yuan

发表机构 * Shenzhen Graduate School, Peking University(北京大学深圳研究生院)

AI总结 BEAM(二值专家激活掩码)是一种用于动态路由的新型方法,旨在提升Mixture-of-Experts(MoE)架构在大语言模型中的推理效率。该方法通过可训练的二值掩码实现对每个token的专家动态选择,结合直通估计器和辅助正则化损失,在端到端训练中诱导专家稀疏性,同时保持模型性能。实验表明,BEAM在保持超过98%原始模型性能的同时,显著减少了MoE层的计算量,提升了推理速度和吞吐量,是一种高效且易于集成的实用解决方案。

Comments 22 pages, 12 figures

详情
英文摘要

Mixture-of-Experts (MoE) architectures enhance the efficiency of large language models by activating only a subset of experts per token. However, standard MoE employs a fixed Top-K routing strategy, leading to redundant computation and suboptimal inference latency. Existing acceleration methods either require costly retraining with architectural changes or suffer from severe performance drop at high sparsity due to train-inference mismatch. To address these limitations, we propose BEAM (Binary Expert Activation Masking), a novel method that learns token-adaptive expert selection via trainable binary masks. With a straight-through estimator and an auxiliary regularization loss, BEAM induces dynamic expert sparsity through end-to-end training while maintaining model capability. We further implement an efficient custom CUDA kernel for BEAM, ensuring seamless integration with the vLLM inference framework. Experiments show that BEAM retains over 98\% of the original model's performance while reducing MoE layer FLOPs by up to 85\%, achieving up to 2.5$\times$ faster decoding and 1.4$\times$ higher throughput, demonstrating its effectiveness as a practical, plug-and-play solution for efficient MoE inference.

2605.14427 2026-05-15 cs.CL cs.SD

A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

Sunil Kumar Kopparapu

发表机构 * TCS Research(TCS研究)

AI总结 本文提出了一种基于微积分的框架,用于确定端到端自动语音识别(ASR)系统中的词汇量大小。该方法通过拟合训练数据,并利用一阶和二阶导数测试原理,正式估计词汇量这一关键超参数。实验表明,该方法在标准Librispeech语料库上有效,能够优化词汇量选择,从而提升ASR系统的性能。本文的主要贡献在于为端到端ASR系统提供了确定词汇量大小的系统化方法。

Comments 8 pages, is an extension of the paper S. K. Kopparapu and A. Panda, A cost minimization approach to fix the vocabulary size in a tokenizer for an end-to-end ASR system, in Proceedings of the 2024 International Conference on Pattern Recognition, Kolkata, India, 2024

详情
英文摘要

In hybrid automatic speech recognition (ASR) systems, the vocabulary size is unambiguous, typically determined by the number of phones, bi-phones, or tri-phones present in the language. In contrast, end-to-end ASR systems derive their vocabulary, often referred to as tokens from the text corpus used for training. The choice and, more importantly, the size of this vocabulary is a critical hyper-parameter in training end-to-end ASR systems. Tokenization algorithms such as Byte Pair Encoding (BPE), WordPiece, and Unigram Language Model (ULM) use the vocabulary size as an input hyper-parameter to generate the sub-words employed during ASR training. Popular toolkits like ESPNet provide a fixed vocabulary size in their training recipes, but there is little documentation or discussion in the literature regarding how these values are determined. Recent work [1] has formalized an approach to identify the vocabulary size best suited for end-to-end ASR, introducing a cost function framework that treats the tokenization process as a black box. In this paper, we build upon that foundation by curve fitting the training data and using the principle of first and second derivative tests in calculus to formally estimate the vocabulary size hyper-parameter. We demonstrate the utility and usefulness of our approach by applying it on a standard Librispeech corpus and show that the optimal choice of vocabulary size hyper-parameter improves the performance of the ASR. The main contribution of this paper in formalizing an approach to identify the vocabulary size best suited for training an end-to-end ASR system.

2605.14423 2026-05-15 cs.LG cs.AI

Collaborative Yet Personalized Policy Training: Single-Timescale Federated Actor-Critic

Leo Muxing Wang, Pengkun Yang, Lili Su

发表机构 * Northeastern University(东北大学) Tsinghua University(清华大学)

AI总结 本文研究了在异构环境中实现协作与个性化策略训练的问题,提出了一种单时间尺度的联邦演员-评论家框架。该方法通过共享一个公共的线性子空间表示,同时保留各智能体的个性化策略组件,实现了策略的协作优化与个性化平衡。理论分析表明,该方法在有限时间内具有收敛性,并且随着智能体数量的增加表现出线性加速效果,实验验证了其在联邦强化学习任务中的有效性。

详情
英文摘要

Despite the popularity of the actor-critic method and the practical needs of collaborative policy training, existing works typically either overlook environmental heterogeneity or give up personalization altogether by training a single shared policy across all agents. We consider a federated actor-critic framework in which agents share a common linear subspace representation while maintaining personalized local policy components, and agents iteratively estimate the common subspace, local critic heads, and local policies (i.e., actors). Under canonical single-timescale updates with Markovian sampling, we establish finite-time convergence via a novel joint linear approximation framework. Specifically, we show that the critic error converges to zero at the rate of $\tilde{\mathcal{O}}(1/((1-γ)^4\sqrt{TK}))$, and the policy gradient norm converges to zero at the rate of $\tilde{\mathcal{O}}(1/((1-γ)^6\sqrt{TK}))$, where $T$ is the number of rounds, $K$ is the number of agents, and $γ\in (0,1)$ is the discount factor. These results demonstrate linear speedup with respect to the number of agents $K$, despite heterogeneous Markovian trajectories under distinct transition kernels and coupled learning dynamics. To address these challenges, we develop a new perturbation analysis for the projected subspace updates and QR decomposition steps, together with conditional mixing arguments for heterogeneous Markovian noise. Furthermore, to handle the additional complications induced by policy updates and temporal dependence, we establish fine-grained characterizations of the discrepancies between function evaluations under Markovian sampling and under temporally frozen policies. Experiments instantiate the framework within PPO on federated \texttt{Hopper-v5} action-map heterogeneity, showing gains over Single PPO and FedAvg PPO and downstream transfer from the learned shared trunk.

2605.14422 2026-05-15 cs.LG

What if Tomorrow is the World Cup Final? Counterfactual Time Series Forecasting with Textual Conditions

Shuqi Gu, Yongxiang Zhao, Baoyu Jing, Kan Ren

发表机构 * School of Information Science and Technology, ShanghaiTech University, Shanghai, China(信息科学与技术学院,上海科技大学,上海,中国) University of Illinois at Urbana-Champaign, Illinois, United States(伊利诺伊大学厄巴纳-香槟分校,伊利诺伊,美国)

AI总结 本文研究了在文本条件下的反事实时间序列预测问题,旨在应对未来事件对时间序列预测的影响,提升预测模型在复杂和随机条件下的适应能力。为解决传统方法忽视反事实场景及条件结构单一的问题,作者提出了一个包含事实与反事实设置的综合评估框架,并设计了一种文本归因机制,用于区分可变与不可变因素,从而提高预测精度。该方法在无真实时间序列标签的情况下也能有效评估模型性能,具有重要的实际应用价值。

详情
英文摘要

Time series forecasting has become increasingly critical in real-world scenarios, where future sequences are influenced not only by historical patterns but also by forthcoming events. In this context, forecasting must dynamically adapt to complex and stochastic future conditions, which introduces fundamental challenges in both forecasting and evaluation. Traditional methods typically rely on historical data or factual future conditions, while overlooking counterfactual scenarios. Furthermore, many existing approaches are restricted to simple structured conditions, limiting their ability to generalize to the real-world complexities. To address these gaps, we introduce the task of counterfactual time series forecasting with textual conditions, enabling more flexible and condition-aware forecasting. We propose a comprehensive evaluation framework that encompasses both factual and counterfactual settings, even in the absence of ground truth time series. Additionally, we present a novel text-attribution mechanism that distinguishes mutable from immutable factors, thereby improving forecast accuracy under sophisticated and stochastic textual conditions. The project page is at https://seqml.github.io/TADiff/

2605.14420 2026-05-15 cs.AI

DVMap: Fine-Grained Pluralistic Value Alignment via High-Consensus Demographic-Value Mapping

Pengyun Zhu, Yuqi Ren, Zhen Wang, Lei Yang, Deyi Xiong

发表机构 * TJUNLP Lab, School of Computer Science and Technology, Tianjin University, China(天津大学计算机科学与技术学院 TJUNLP 实验室,中国)

AI总结 当前大型语言模型(LLMs)通常依赖于粗粒度的国家标签进行多元价值观对齐,但这种宏观层面的监督往往掩盖了国家内部的价值观异质性,导致对齐效果松散。为此,研究提出DVMap框架,通过多维人口统计约束识别具有可预测、高共识价值观偏好的群体,实现细粒度的多元价值观对齐。该方法引入人口统计原型提取策略和结构化思维链机制,并结合群体相对策略优化技术,有效提升了模型在跨人口统计、跨国家和跨价值观场景下的泛化能力与鲁棒性。

Comments Accepted to the Main Conference of ACL 2026

详情
英文摘要

Current Large Language Models (LLMs) typically rely on coarse-grained national labels for pluralistic value alignment. However, such macro-level supervision often obscures intra-country value heterogeneity, yielding a loose alignment. We argue that resolving this limitation requires shifting from national labels to multi-dimensional demographic constraints, which can identify groups with predictable, high-consensus value preference. To this end, we propose DVMap (High-Consensus Demographic-Value Mapping), a framework for fine-grained pluralistic value alignment. In this framework, we first present a demographic archetype extraction strategy to construct a high-quality value alignment corpus of 56,152 samples from the World Values Survey (WVS) by strictly retaining respondents with consistent value preferences under identical demographics. Over this corpus, we introduce a Structured Chain-of-Thought (CoT) mechanism that explicitly guides LLMs to reason about demographic-value correlations. Subsequently, we employ Group Relative Policy Optimization (GRPO) to achieve adaptive anchoring of value distributions. To rigorously evaluate generalization, we further establish a triple-generalization benchmark (spanning cross-demographic, cross-country, and cross-value) comprising 21,553 samples. Experimental results demonstrate that DVMap effectively learns the manifold mapping from demographics to values, exhibiting strong generalization and robustness. On cross-demographic tests, Qwen3-8B-DVMap achieves 48.6% accuracy, surpassing the advanced open-source LLM DeepSeek-v3.2 (45.1%). The source code and dataset are available at https://github.com/EnlightenedAI/DVMap.

2605.14416 2026-05-15 cs.AI

A Unified Knowledge Embedded Reinforcement Learning-based Framework for Generalized Capacitated Vehicle Routing Problems

Wen Wang, Xiangchen Wu, Liang Wang, Hao Hu, Xianping Tao

发表机构 * Nanjing University(南京大学)

AI总结 本文提出了一种基于知识嵌入的强化学习统一框架,用于解决具有容量限制的车辆路径问题(CVRP)。该框架结合了路线优先、聚类次优的启发式策略,并引入动态规划解决子问题,同时利用历史增强的上下文处理模块应对分解带来的部分可观测性问题。实验表明,该方法在多种CVRP变体中均能取得优于现有学习方法的解质量,且与经典启发式方法的差距更小,展现出良好的泛化能力。

详情
英文摘要

The Capacitated Vehicle Routing Problem (CVRP) is a fundamental NP-hard problem with broad applications in logistics and transportation. Real-world CVRPs often involve diverse objectives and complex constraints, such as time windows or backhaul requirements, motivating the development of a unified solution framework. Recent reinforcement learning (RL) approaches have shown promise in combinatorial optimization, yet they rely on end-to-end learning and lack explicit problem-solving knowledge, limiting solution quality. In this paper, we propose a knowledge-embedded framework inspired by the Route-First Cluster-Second heuristics. It incorporates knowledge at two levels: (1) decomposing CVRPs into the route-first and cluster-second subproblems, and (2) leveraging dynamic programming to solve the second subproblem, whose results guide the RL-based constructive solver to solve the first problem. To mitigate partial observability caused by problem decomposition, we introduce a unified history-enhanced context processing module. Extensive experiments show that this framework achieves superior solution quality compared with state-of-the-art learning-based methods, with a smaller gap to classical heuristics, demonstrating strong generalization across diverse CVRP variants.

2605.14413 2026-05-15 cs.LG cs.AI

MahaVar: OOD Detection via Class-wise Mahalanobis Distance Variance under Neural Collapse

Donghwan Kim, Hyunsoo Yoon

发表机构 * Department of Industrial Engineering(工业工程系) Yonsei University(延世大学)

AI总结 该论文提出了一种基于类内马哈拉诺比斯距离方差的新型分布外检测方法MahaVar。研究发现,对于分布内样本,类内马哈拉诺比斯距离呈现出明显的尖锐最小值结构,导致类间距离方差较大,而分布外样本则表现出较弱的结构特征和较小的距离方差。基于这一现象并结合神经崩溃理论,作者提出了MahaVar方法,在传统马哈拉诺比斯距离基础上引入类内距离方差作为判别依据,有效提升了分布外检测性能,在多个基准数据集上取得了当前最优结果。

Comments 29 pages, 8 figures

详情
英文摘要

Out-of-distribution (OOD) detection is a critical component for ensuring the reliability of deep neural networks in safety-critical applications. In this work, we present a key empirical observation: for in-distribution (ID) samples, class-wise Mahalanobis distances exhibit a pronounced sharp minimum structure, where the distance to the nearest class is small while distances to all other classes remain large, resulting in high variance across classes. In contrast, OOD samples tend to exhibit a less pronounced sharp minimum structure, producing comparatively lower variance across classes. We further provide a theoretical analysis grounding this observation in Neural Collapse geometry: under relaxed Neural Collapse assumptions on within-class compactness and inter-class separation, ID samples are shown to structurally exhibit high class-wise distance variance, offering a theoretical basis for its use as an OOD score. Motivated by this observation and its theoretical backing, we propose MahaVar, a simple and effective post-hoc OOD detector that augments the Mahalanobis distance with a class-wise distance variance term. Following the OpenOOD v1.5 benchmark protocol, MahaVar achieves state-of-the-art performance on CIFAR-100 and ImageNet, with consistent improvements in both AUROC and FPR@95 over existing Mahalanobis-based methods across all benchmarks.

2605.14411 2026-05-15 cs.RO cs.AI

Energy-Efficient Quadruped Locomotion with Compliant Feet

Pramod Pal, Shishir Kolathaya, Ashitava Ghosal

发表机构 * Department of Mechanical Engineering, Indian Institute of Science(印度科学研究院机械工程系) Robert Bosch Centre for Cyber Physical Systems, Indian Institute of Science(印度科学研究院网络物理系统研究中心) School of Engineering and Applied Science, Ahmedabad University(阿亨布尔大学工程与应用科学学院)

AI总结 该研究探讨了具有柔顺足部的四足机器人能否在保证运动稳定性的同时提升运动效率。通过将足部柔顺性引入强化学习控制器,研究发现适中的足部刚度可以有效减少每米行走的机械能耗,实验表明相较于过于刚硬或过于柔软的足部,中间刚度的足部可使能耗降低约17%。这一结果表明,合理设计足部柔顺性有助于提高四足机器人的能量效率。

Comments 29 pages, 7 figures, supplemental videos link is mentioned in the paper

详情
英文摘要

Quadruped robots are often designed with rigid feet to simplify control and maintain stable contact during locomotion. While this approach is straightforward, it limits the ability of the legs to absorb impact forces and reuse stored elastic energy, leading to higher energy expenditure during locomotion. To explore whether compliant feet can provide an advantage, we integrate foot compliance into a reinforcement learning (RL) locomotion controller and study its effect on walking efficiency. In simulation, we train eight policies corresponding to eight different spring stiffness values and then cross-evaluate their performance by measuring mechanical energy consumed per meter traveled. In experiments done on a developed quadruped, the energy consumption for the intermediate stiffness spring is lower by ~ 17% when compared to a very stiff or a very flexible spring incorporated in the feet, with similar trends appearing in the simulation results. These results indicate that selecting an appropriate foot compliance can improve locomotion efficiency without destabilizing the robot during motion.

2605.14407 2026-05-15 cs.AI

Metis AI: The Overlooked Middle Zone Between AI-Native and World-Movers

Xiang Li

发表机构 * Massachusetts General Hospital(麻省总医院)

AI总结 本文探讨了人工智能在数字任务中常被忽视的“中间地带”——Metis AI,这类任务虽可在计算机上完成,但因涉及机构、社会和规范层面的复杂性,难以被算法可靠自动化。研究提出了Metis AI的五个结构性特征,并指出应对策略应是人类主导、AI辅助的“半人马架构”,而非单纯提升自动化水平。

详情
英文摘要

The dominant discourse on AI limitations frames the boundary of AI capability as a divide between digital tasks (where AI excels) and physical tasks (where embodiment is required). We argue this framing misses the most consequential boundary: the one within digital tasks. We identify a class of tasks we call Metis AI, named for the Greek concept of metis (practical, contextual knowledge), that are performed entirely on computers yet resist reliable AI automation. These tasks are not computationally intractable; they are institutionally, socially, and normatively entangled in ways that defeat algorithmic approaches. We distinguish constitutive metis (knowledge destroyed by the act of formalization) from operational metis (system-specific familiarity that automation can progressively absorb), and propose five structural characteristics that define the Metis AI zone: consequential irreversibility, relational irreducibility, normative open texture, adversarial co-evolution, and accountability anchoring. We ground each in established theory from across the social sciences, philosophy, and humanitarian practice, argue that these characteristics are properties of the tasks themselves rather than limitations of current models, and show that the appropriate design response is not better automation but centaur architectures in which humans lead and AI supports.

2605.14406 2026-05-15 cs.LG cs.CV

GeoViSTA: Geospatial Vision-Tabular Transformer for Multimodal Environment Representation

Yuhao Liu, Sadeer Al-Kindi, Ashok Veeraraghavan, Guha Balakrishnan

发表机构 * Department of Electrical and Computer Engineering, Rice University(理海大学电气与计算机工程系) Center for Cardiovascular Computational and Precision Health, Department of Cardiology, DeBakey Heart and Vascular Center, Houston Methodist(休斯顿方法主义医疗中心心血管计算与精准健康中心、心内科部门、德贝基心脏和血管中心)

AI总结 GeoViSTA 是一种结合遥感图像和表格数据的多模态模型,旨在学习统一的地理空间表征。该模型通过双边交叉注意力机制,在图像和表格数据之间交换空间与语义信息,并借助地理感知的注意力机制对齐图像块与不规则的统计区域。GeoViSTA 在自监督的联合掩码重建任务中进行训练,显著提升了在疾病死亡率和火灾风险等关键任务上的预测性能,展示了其在综合地理空间推理中的强大能力。

详情
英文摘要

Large-scale pretraining on Earth observation imagery has yielded powerful representations of the natural and built environment. However, most existing geospatial foundation models do not directly model the structured socioeconomic covariates typically stored in tabular form. This modality gap limits their ability to capture the complete total environment, which is critical for reasoning about complex environmental, social, and health-related outcomes. In this work, we propose GeoViSTA (Geospatial Vision-Tabular Transformer), a vision-tabular architecture that learns unified geospatial embeddings from co-registered gridded imagery and tabular data. GeoViSTA utilizes bilateral cross-attention to exchange spatial and semantic information across modalities, guided by a geography-aware attention mechanism that aligns continuous image patches with irregular census-tract tokens. We train GeoViSTA with a self-supervised joint masked-autoencoding objective, forcing it to recover missing image patches and tabular rows using local spatial context and cross-modal cues. Empirically, GeoViSTA's unified embeddings improve linear probing performance on high-impact downstream tasks, outperforming baselines in predicting disease-specific mortality and fire hazard frequency across held-out regions. These results demonstrate that jointly modeling the physical environment alongside structured socioeconomic context yields highly transferable representations for holistic geospatial inference.

2605.14405 2026-05-15 cs.LG math.DS

Watch your neighbors: Training statistically accurate chaotic systems with local phase space information

Joon-Hyuk Ko, Andrus Giraldo, Deok-Sun Lee

发表机构 * Center for AI and Natural Sciences(人工智能与自然科学中心) School of Computational Sciences(计算科学学院) Korea Institute for Advanced Study(韩国高级研究院)

AI总结 该论文研究了如何通过局部相空间信息训练出统计上准确的混沌系统代理模型。作者提出了一种新框架,旨在结合精确的雅可比矩阵和长期统计特性,通过构建相空间中混沌吸引子的局部覆盖,并最小化代理模型与真实动力学在这些覆盖上的分布差异来训练模型。实验表明,该方法在提升雅可比矩阵准确性的同时,仍能与现有最先进的统计动力学学习方法保持竞争力。

详情
英文摘要

Chaotic systems pose fundamental challenges for data-driven dynamics discovery, as small modeling errors lead to exponentially growing trajectory discrepancies. Since exact long-term prediction is unattainable, it is natural to ask what a good surrogate model for chaotic dynamics is. Prior work has largely focused either on reproducing the Jacobian of the underlying dynamics, which governs local expansion and contraction rates, or on training surrogate models that reproduce the ground-truth dynamics' long-term statistical behavior. In this work, we propose a new framework that aims to bridge these two paradigms by training surrogate dynamics models with accurate Jacobians and long-term statistical properties. Our method constructs a local covering of a chaotic attractor in phase space and analyzes the expansion and contraction of these coverings under the dynamics. The surrogate model is trained by minimizing the maximum mean discrepancy between the pushforward distributions of the coverings under the surrogate and ground-truth dynamics. Experiments show that our method significantly improves Jacobian accuracy while remaining competitive with state-of-the-art statistically accurate dynamics learning methods. Our code is fully available at https://anonymous.4open.science/r/neighborwatch.

2605.14404 2026-05-15 cs.CL

Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

Kyomin Hwang, Hyeonjin Kim, Sangyeon Cho, Nojun Kwak

发表机构 * GSCST, Seoul National University(首尔国立大学GSCST) AIIS, Seoul National University(首尔国立大学AIIS) Department of Artificial Intelligence, Chung-Ang University(Chung-Ang大学人工智能系) Korean Surgical Researcher Foundation, Republic of Korea(韩国外科研究员基金会)

AI总结 随着大型语言模型在商业服务中的广泛应用,其可能引发的隐私泄露问题日益突出。本文针对多语言场景下的机器遗忘(MMU)评估不足的问题,提出了两个新的评估指标——知识可分性得分(KSS)和知识持续性得分(KPS),用于衡量多语言环境下信息去除的效果与一致性。研究通过这些指标对多种遗忘方法进行了评估,揭示了多语言机器遗忘中特有的现象,并为该领域的评估提供了新视角。

详情
英文摘要

While LLMs are increasingly used in commercial services, they pose privacy risks such as leakage of sensitive personally identifiable information (PII). For LLMs trained on multilingual corpora, Multilingual Machine Unlearning (MMU) aims to remove information across multiple languages. However, prior MMU evaluations fail to capture such cross-linguistic distribution of information, being largely limited to direct extensions of per-language evaluation protocols. To this end, we propose two metrics to evaluate the information spread across languages: the Knowledge Separability Score (KSS) and the Knowledge Persistence Score (KPS). KSS measures the overall unlearning quality across multiple languages, while KPS more specifically aims to assess consistent removal of information among different language pairs. We evaluated various unlearning methods in the multilingual setting with these metrics and conducted comprehensive analyses. Through our investigation, we provide insights into unique phenomena exclusive to MMU and offer a new perspective on MMU evaluation.

2605.14403 2026-05-15 cs.CV

DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making

Yize Liu, Siyuan Yan, Ming Hu, Lie Ju, Xieji Li, Feilong Tang, Wei Feng, Zongyuan Ge

发表机构 * AIM for Health Lab, Faculty of Information Technology, Monash University, Melbourne, Australia(健康人工智能实验室,信息科技学院,墨尔本大学,澳大利亚) Faculty of Information Technology, Monash University, Melbourne, Australia(信息科技学院,墨尔本大学,澳大利亚) University College London, Institute of Ophthalmology, London, United Kingdom(伦敦大学学院,眼科研究所,英国)

AI总结 DermAgent 是一个用于皮肤科图像分析的自反思智能代理系统,旨在解决现有多模态大语言模型在皮肤病诊断中领域知识不足和幻觉问题。该系统通过集成七个专业视觉与语言模块,在计划-执行-反思框架下实现可追溯的诊断推理,结合多工具协同推理与外部证据检索,有效提升了诊断准确性和可靠性。实验表明,DermAgent 在多个皮肤病基准测试中表现优异,显著优于现有先进模型。

Comments MICCAI2026 early acceptance

详情
英文摘要

Dermatological diagnosis requires integrating fine-grained visual perception with expert clinical knowledge. Although Multimodal Large Language Models (MLLMs) facilitate interactive medical image analysis, their application in dermatology is hindered by insufficient domain-specific grounding and hallucinations. To address these issues, we propose DermAgent, a collaborative multi-tool agent that orchestrates seven specialized vision and language modules within a Plan-Execute-Reflect framework. DermAgent delivers stepwise, traceable diagnostic reasoning through three core components. First, it employs complementary visual perception tools for comprehensive morphological description, dermoscopic concept annotation, and disease diagnosis. Second, to overcome the lack of domain prior, a dual-modality retrieval module anchors every prediction in external evidence by cross-referencing 413,210 diagnosed image cases and 3,199 clinical guideline chunks. To further mitigate hallucinations, a deterministic critic module conducts strict post-hoc auditing via confidence, coverage, and conflict gates, automatically detecting inter-source disagreements to trigger targeted self-correction. Extensive experiments on five dermatology benchmarks demonstrate that DermAgent consistently outperforms state-of-the-art MLLMs and medical agent baselines across zero-shot fine-grained disease diagnosis, concept annotation, and clinical captioning tasks, exceeding GPT-4o by 17.6% in skin disease diagnostic accuracy and 3.15% in captioning ROUGE-L. Our code is available at https://github.com/YizeezLiu/DermAgent.

2605.14399 2026-05-15 cs.CV cs.GR

SceneForge: Structured World Supervision from 3D Interventions

Jizhizi Li, Jiayang Ao, Danny Wicks, Petru-Daniel Tudosiu

发表机构 * Canva Research(Canva研究院)

AI总结 SceneForge 是一个基于可编辑3D世界状态的干预驱动框架,旨在生成在场景编辑、视角变化和场景级干预下保持一致的结构化监督信号。该方法通过显式干预(如物体移除或相机变化)并传播其对场景结构和物理属性的影响,生成包括反事实观测、多视角观测及阴影、反射等效应感知信号在内的对齐输出。实验表明,SceneForge 能有效提升多任务学习中物体移除和场景移除的性能,为干预一致的多模态学习提供了可扩展的监督基础。

详情
英文摘要

Many multimodal learning tasks require supervision that remains consistent across edits, viewpoints, and scene-level interventions. However, such supervision is difficult to obtain from observation-level datasets, which do not expose the underlying scene state or how changes propagate through it. We present SceneForge, an intervention-driven framework that generates structured supervision from editable 3D world states. SceneForge represents each scene as a persistent world with semantic, geometric, and physical dependencies. By applying explicit interventions (e.g., object removal or camera variation) and propagating their effects through scene dependencies, SceneForge renders supervision that remains consistent with object structure and scene-level effects. This produces aligned outputs including counterfactual observations, multi-view observations, and effect-aware signals such as shadows and reflections, all derived from a shared world state rather than post hoc image-space processing. We instantiate SceneForge using Infinigen and Blender to construct a licensing-clean indoor supervision resource with a large number of counterfactual pairs and aligned annotations from over 2K scenes, covering both diverse single-view and registered multi-view settings. Under matched training budgets, incorporating SceneForge supervision improves both object removal and scene removal performance across multiple benchmarks in both quantitative and qualitative evaluation. These results indicate that modeling supervision as structured state transitions in editable worlds provides a practical and scalable foundation for intervention-consistent multimodal learning.

2605.14396 2026-05-15 cs.CV cs.CR cs.LG cs.RO

Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion

Chenyi Wang, Ruoyu Song, Raymond Muller, Jean-Philippe Monteuuis, Jonathan Petit, Z. Berkay Celik, Ryan Gerdes, Ming F. Li

发表机构 * University of Arizona(亚利桑那大学) Purdue University(普渡大学) Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室)

AI总结 自动驾驶车辆依赖在线高精度地图构建来感知车道边界、分隔线和人行横道等关键道路元素,这些元素直接影响运动规划的安全性。本文提出MIRAGE框架,通过条件扩散模型系统性地发现能够绕过对抗防御、导致地图预测退化的语义攻击,例如制造阴影或湿滑路面等合理环境变化。实验表明,MIRAGE生成的攻击在多个防御机制下仍具有强效,并且生成场景的现实感达到80-84%,远高于传统像素级攻击方法。

详情
英文摘要

Autonomous vehicles depend on online HD map construction to perceive lane boundaries, dividers, and pedestrian crossings -- safety-critical road elements that directly govern motion planning. While existing pixel perturbation attacks can disrupt the mapping, they can be neutralized by standard adversarial defenses. We present MIRAGE, a framework for systematic discovery of semantic attacks that bypass adversarial defenses and degrade mapping predictions by finding plausible environmental variation (e.g. shadows, wet roads). MIRAGE exploits the latent manifold of real-world data learned by diffusion models, and searches for semantically mutated scenes neighboring the ground truth with the same road topology yet mislead the mapping predictions. We evaluate MIRAGE on nuScenes and demonstrate two attacks: (1) boundary removal, suppressing 57.7% of detections and corrupting 96% of planned trajectories; and (2) boundary injection, the only method that successfully injects fictitious boundaries, while pixel PGD and AdvPatch fail entirely. Both attacks remain potent under various adversarial defenses. We use two independent VLM judges to quantify realism, where MIRAGE passes as realistic 80--84% of the time (vs. 97--99% for clean nuScenes), while AdvPatch only 0--9%. Our findings expose a categorical gap in current adversarial defenses: semantic-level perturbations that manifest as legitimate environmental variation are substantially harder to mitigate than pixel-level perturbations.

2605.14393 2026-05-15 cs.CV

Analogical Trajectory Transfer

Junho Kim, Eun Sun Lee, Gwangtak Bae, Seunggu Kang, Young Min Kim

发表机构 * Dept. of Electrical and Computer Engineering, Seoul National University(电子与计算机工程系,首尔国立大学) Interdisciplinary Program in Artificial Intelligence and INMC, Seoul National University(人工智能交叉计划和INMC,首尔国立大学)

AI总结 本文研究类比轨迹迁移问题,旨在将一个三维环境中的运动轨迹转换到另一个语义上相似但空间布局不同的环境中,从而实现机器的类比空间推理能力。为了解决场景间物体位置、尺度和布局差异带来的碰撞和几何失真问题,作者提出了一种基于场景聚类和分层映射预测的方法,通过分解问题并组合子问题的解,生成语义一致且空间连贯的轨迹转移结果。该方法无需训练,运行速度快,且在多个应用场景中优于基于大语言模型和场景图匹配的基线方法。

详情
英文摘要

We study analogical trajectory transfer, where the goal is to translate motion trajectories in one 3D environment to a semantically analogous location in another. Such a capacity would enable machines to perform analogical spatial reasoning, with applications in AR/VR co-presence, content creation, and robotics. However, even semantically similar scenes can still differ substantially in object placement, scale, and layout, so naively matching semantics leads to collisions or geometric distortions. Furthermore, finding where each trajectory point should transfer to has a large search space, as the mapping must preserve semantics and functionality without tearing the trajectory apart or causing collisions. Our key insight is to decompose the problem into spatially segregated subproblems and merge their solutions to produce semantically consistent and spatially coherent transfers. Specifically, we partition scenes into object-centric clusters and estimate cross-scene mappings via hierarchical smooth map prediction, using 3D foundation model features that encode contextual information from object and open-space arrangements. We then combinatorially assemble the per-cluster maps into an initial transfer and refine the result to remove collisions and distortions, yielding a spatially coherent trajectory. Our method does not require training, attains a fast runtime around 0.6 seconds, and outperforms baselines based on LLMs, VLMs, and scene graph matching. We further showcase applications in virtual co-presence, multi-trajectory transfer, camera transfer, and human-to-robot motion transfer, which indicates the broad applicability of our work to AR/VR and robotics.

2605.14392 2026-05-15 cs.AI

Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis

Yucheng Shi, Zhenwen Liang, Kishan Panaganti, Dian Yu, Wenhao Yu, Haitao Mi

发表机构 * Tencent HY LLM(腾讯 HY LLM)

AI总结 该研究提出了一种通过可验证环境合成实现自我进化的强化学习方法,使语言模型不仅能生成问题,还能构建用于训练自身的环境。核心方法是通过生成可执行的环境对象,实现问题采样、参考解计算与响应评分,并确保环境具有稳定的“解决-验证”不对称性,从而保证奖励信号的有效性。研究通过EvoEnv框架验证了该方法的有效性,在基准测试中实现了性能提升,表明模型的自我改进依赖于构建难度始终超越自身能力的环境,而非单纯增加合成数据量。

Comments Tech report, work in progress

详情
英文摘要

We pursue a vision for self-improving language models in which the model does not merely generate problems or traces to imitate, but constructs the environments that train it. In zero-data reasoning RL, this reframes self-improvement from a data-generation loop into an environment-construction loop, where each artifact is a reusable executable object that samples instances, computes references, and scores responses. Whether this vision sustains improvement hinges on a single property: the environments must exhibit stable solve--verify asymmetry, the model must be able to write an oracle once that it cannot reliably execute in natural language on fresh instances. This asymmetry takes two complementary forms. Some tasks are algorithmically hard to reason through but trivial as code: a dynamic program or graph traversal, compiled once, yields unboundedly many calibrated instances. Others are intrinsically hard to solve but easy to verify, like planted subset-sum or constraint satisfaction. Both create a durable gap between proposing and solving that the policy cannot close by gaming the verifier, and it is this gap that keeps reward informative as the learner improves. We instantiate this view in EvoEnv, a single-policy generator, solver method that synthesizes Python environments from ten seeds and admits them only after staged validation, semantic self-review, solver-relative difficulty calibration, and novelty checks. The strongest evidence comes from the already-strong regime: on Qwen3-4B-Thinking, fixed public-data RLVR and fixed hand-crafted environment RLVR reduce the average, while EvoEnv improves it from 72.4 to 74.8, a relative gain of 3.3%. Stable self-improvement, we suggest, depends not on producing more synthetic data, but on models learning to construct worlds whose difficulty stays structurally beyond their own reach.

2605.14391 2026-05-15 cs.CV

Dual-Latent Collaborative Decoding for Fidelity-Perception Balanced Image Compression

Qi Mao, Zijian Wang, Zhengxue Cheng, Lingyu Zhu, Siwei Ma

发表机构 * School of Information and Communication Engineering and the State Key Laboratory of Media Convergence and Communication, Communication University of China(信息与通信工程学院和媒体融合与通信国家重点实验室,中国通信大学) School of Information Science and Electronic Engineering, Shanghai Jiao Tong University(信息科学与电子工程学院,上海交通大学) Department of Computer Science, City University of Hong Kong(计算机科学系,香港城市大学) State Key Laboratory of Multimedia information Processing, School of Computer Science, Peking University(多媒体信息处理国家重点实验室,计算机科学学院,北京大学)

AI总结 本文研究了如何在图像压缩中平衡重建图像的保真度与感知质量。现有方法通常依赖单一的潜在表示同时处理结构细节、语义信息和感知先验,导致不同任务之间的冲突。为此,作者提出了一种双潜在协作解码框架MoDE,通过将标量量化和向量量化两种潜在表示分别作为保真度专家和感知专家,并引入专家特定增强和跨专家调制模块,实现两者的协同解码。实验表明,该方法在广泛比特率范围内实现了更优的保真-感知平衡。

详情
英文摘要

Learned image compression (LIC) increasingly requires reconstructions that balance distortion fidelity and perceptual realism across a wide range of bitrates. However, most existing methods still rely on a single compressed latent representation to simultaneously carry structural details, semantic cues, and perceptual priors, requiring the same latent representation to serve multiple, potentially conflicting roles. This tension becomes evident across different latent paradigms: scalar-quantized (SQ) continuous latents provide rate-scalable fidelity but tend to lose perceptual details at low rates, while vector-quantized (VQ) discrete tokens preserve compact semantic cues but suffer from limited structural fidelity and bitrate scalability. To address this issue, we propose Mixture of Decoder Experts (MoDE), a dual-latent collaborative decoding framework that decomposes reconstruction responsibilities across complementary latent paradigms. Specifically, MoDE treats the SQ branch as a fidelity-oriented expert and the VQ branch as a perception-oriented expert, and coordinates them through two decoder-side modules: Expert-Specific Enhancement (ESE), which preserves branch-specific expert references, and Cross-Expert Modulation (CEM), which enables selective complementary transfer during reconstruction. The resulting framework supports selective cross-latent collaboration under a shared dual-stream bitstream and enables both fidelity-anchored and perception-anchored decoding. Extensive experiments demonstrate that MoDE achieves a more favorable fidelity-perception balance than representative distortion-oriented, perception-oriented, generative, and dual-latent baselines across a wide bitrate range, highlighting decoder-side expert collaboration as an effective design for wide-range fidelity-perception balanced LIC.

2605.14389 2026-05-15 cs.AI cs.CL cs.LG

Nexus : An Agentic Framework for Time Series Forecasting

Sarkar Snigdha Sarathi Das, Palash Goyal, Mihir Parmar, Nanyun Peng, Vishy Tirumalashetty, Chun-Liang Li, Rui Zhang, Jinsung Yoon, Tomas Pfister

发表机构 * Google(谷歌) Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 时间序列预测不仅涉及数值推断,还需结合新闻、事件等非结构化文本信息进行推理。为弥补现有时间序列基础模型(TSFMs)对文本信号不敏感以及大语言模型(LLMs)在不同领域表现不一的问题,本文提出Nexus,一种多智能体预测框架,通过分解预测过程为宏观与微观时间波动识别、上下文信息整合等阶段,实现更灵活的预测。实验表明,Nexus在多个领域数据上优于现有先进模型,同时生成高质量的推理轨迹,揭示了预测背后的驱动因素,证明了现实中的时间序列预测是超越单纯序列建模的智能体推理问题。

Comments 30 Pages, 3 figures, 5 Tables

详情
英文摘要

Time series forecasting is not just numerical extrapolation, but often requires reasoning with unstructured contextual data such as news or events. While specialized Time Series Foundation Models (TSFMs) excel at forecasting based on numerical patterns, they remain unaware to real-world textual signals. Conversely, while LLMs are emerging as zero-shot forecasters, their performance remains uneven across domains and contextual grounding. To bridge this gap, we introduce Nexus, a multi-agent forecasting framework that decomposes prediction into specialized stages: isolating macro-level and micro-level temporal fluctuations, and integrating contextual information when available before synthesizing a final forecast. This decomposition enables Nexus to adapt from seasonal signals to volatile, event-driven information without relying on external statistical anchors or monolithic prompting. We show that current-generation LLMs possess substantially stronger intrinsic forecasting ability than previously recognized, depending critically on how numerical and contextual reasoning are organized. Evaluated on data strictly succeeding LLM knowledge cutoffs spanning Zillow real estate metrics and volatile stock market equities, Nexus consistently matches or outperforms state-of-the-art TSFMs and strong LLM baselines. Beyond numerical accuracy, Nexus produces high-quality reasoning traces that explicitly show the fundamental drivers behind each forecast. Our results establish that real-world forecasting is an agentic reasoning problem extending well beyond only sequence modeling.

2605.14380 2026-05-15 cs.CL

Mitigating Data Scarcity in Psychological Defense Classification with Context-Aware Synthetic Augmentation

Hoang-Thuy-Duong Vu, Quoc-Cuong Pham, Huy-Hieu Pham

发表机构 * College of Engineering and Computer Science, VinUniversity, Hanoi, Vietnam(越南 Vin大学工程与计算机科学学院,河内,越南) VinUni-Illinois Smart Health Center, VinUniversity, Hanoi, Vietnam(越南 Vin大学与伊利诺伊大学智能健康中心,河内,越南) Center for Innovations in Health Sciences, VinUniversity, Hanoi, Vietnam(越南 Vin大学健康科学创新中心,河内,越南)

AI总结 该研究针对心理防御机制(PDMs)分类任务中因数据稀缺和类别不平衡带来的挑战,提出了一种结合上下文感知合成增强与混合分类模型的方法。通过整合语言上下文表示、基础临床特征以及150个标注防御条目,该方法在PsyDefDetect共享任务中显著提升了分类性能,准确率和宏F1值分别达到58.26%和24.62%,优于现有方法,为低资源场景下的心理防御分类建立了有力的基准。

详情
英文摘要

Psychological defense mechanisms (PDMs) are unconscious cognitive processes that modulate how individuals perceive and respond to emotional distress. Automatically classifying PDMs from text is clinically valuable but severely hindered by data scarcity and class imbalance, challenges which generative augmentation alone cannot resolve without psychological grounding. In this work, we address these challenges in the PsyDefDetect shared task (BioNLP@ACL 2026) by proposing a context-aware synthetic augmentation framework combined with a hybrid classification model. Our hybrid model integrates contextual language representations with basic clinical features, along with 150 annotated defense items. Experiments demonstrate that definition quality in prompting directly governs generation fidelity and downstream performance. Our method surpasses DMRS Co-Pilot, reaching an accuracy of 58.26% (+40.25%) and a macro-F1 of 24.62% (+15.99%), thereby establishing a strong baseline for psychologically grounded defense mechanism classification in low-resource settings. Source code is available at: https://github.com/htdgv/CASA-PDC.