arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪
2605.12965 2026-05-14 cs.LG cs.NA math.NA

U-HNO: A U-shaped Hybrid Neural Operator with Sparse-Point Adaptive Routing for Non-stationary PDE Dynamics

Yingzhe Ma, Xiao Yang, Yuxin Xie, Zihan Xiong, Jinliang Liu

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Peking University(北京大学)

AI总结 该研究针对偏微分方程(PDE)解中同时存在的全局平滑传输与局部尖锐特征的挑战,提出了一种名为U-HNO的U型混合神经算子。其核心方法是引入稀疏点自适应路由(SPAR),通过逐像素的硬掩码动态选择全局傅里叶分支或局部多尺度高斯分支,从而在不同区域灵活融合全局与局部计算。实验表明,U-HNO在多个PDE基准任务中取得了领先的预测精度,尤其在具有尖锐局部特征的问题上表现突出。

Comments 26 pages, 7 figures

详情
英文摘要

Solutions to many partial differential equations (PDEs) display coexisting smooth global transport and localized sharp features within a single trajectory: shock fronts, thin interfaces, and concentrated high-frequency content sit on top of slowly varying backgrounds. This poses a challenge for neural operators: Fourier-based architectures mix nonlocal interactions efficiently but tend to under-resolve localized non-smooth features, whereas spatially local architectures recover fine detail at the cost of long-range propagation and rollout stability. Existing hybrid operators paper over this tension with a fixed, spatially uniform fusion that forces the same trade-off everywhere. We propose U-HNO, a U-shaped hybrid neural operator whose central design is Sparse-Point Adaptive Routing (SPAR): at every spatial location, a per-pixel hard mask selects whether the global Fourier branch or the local multi-scale Gaussian branch should dominate, and the sparsity ratio is a function of the local contrast of the routing signal, so smooth and shock-aligned regions receive different mixtures of global and local computation. SPAR is embedded in a hierarchical encoder-bottleneck-decoder backbone with skip connections so that the dual branches and the gate operate at every resolution. Training combines pointwise supervision with a finite-difference H^1 gradient term and a band-wise spectral consistency regularizer. Across benchmarks spanning 1D Burgers, Kuramoto-Sivashinsky, KdV, 2D advection, Allen-Cahn, Navier-Stokes, Darcy flow, and 3D transonic compressible Navier-Stokes from PDEBench, U-HNO achieves state-of-the-art rollout accuracy on the majority of tasks in both relative L^2 and H^1 metrics, with the largest gains on problems dominated by sharp localized features. Ablations show that removing any single component substantially degrades rollout error.

2605.12963 2026-05-14 cs.AI

Sustaining AI safety: Control-theoretic external impossibility, intrinsic necessity, and structural requirements

James M. Mazzu

发表机构 * Digie Inc.(Digie公司)

AI总结 随着AI系统能力的增强,安全策略不仅需要降低当前风险,还必须确保在外部控制无法可靠约束系统行为时仍能维持安全。本文运用控制理论,从结构层面分析了外部强制安全策略是否可行,并提出了两个主要结论:一旦系统影响超出有限外部控制的应对范围,任何依赖外部控制的策略都无法持续保障AI安全;若仍存在可行策略,则这些策略必须是内在的,并需满足四个结构性要求,如安全目标的稳定性与自我修改兼容性等。本文为外部控制局限性的广泛担忧提供了形式化的理论框架。

详情
英文摘要

As AI systems become increasingly capable, safety strategies must be evaluated not only by how much they reduce present risk, but by whether they could sustain safety once external control can no longer reliably constrain system behavior. This paper addresses that problem by using control theory to clarify, at a structural level, whether externally enforced safety-sustaining strategies can succeed and, if not, what any alternative strategy would have to satisfy in order to be viable. It establishes two main results. First, under explicit premises including a reachability condition, it proves a class-wide external impossibility result: once the system's effects exceed what bounded external control can counteract, no strategy that depends in any degree on continued external enforcement can sustain AI safety. This failure is structural across the entire externally enforced class rather than contingent on any particular strategy. Second, it establishes a conditional class-level necessity result: if at least one candidate safety-sustaining strategy remains after that elimination, then all such remaining strategies must be intrinsic. It then states four structural requirements for viability: safety may not depend on continued external enforcement; the system's terminal objective must be safety-compatible when first formed; that objective must remain stable under self-modification; and safety must continue to be preserved as capability grows. The paper does not propose a complete strategy for sustaining AI safety. Its contribution is to give formal structure to a widely held concern about the limits of external control. It does so by deriving explicit conditional results that identify which safety-sustaining strategies are ruled out and what any remaining strategies must satisfy.

2605.12957 2026-05-14 cs.CV

GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

Hanxin Zhu, Cong Wang, Peiyan Tu, Jiayi Luo, Tianyu He, Xin Jin, Zhibo Chen

发表机构 * College of Information Science and Electronic Engineering, Zhejiang University(浙江大学信息科学与电子工程学院)

AI总结 本文提出了一种名为GTA的新型图像到3D世界生成方法,采用“几何优先、再渲染外观”的策略,以提升生成场景的结构准确性和跨视角一致性。该方法通过两个阶段的视频扩散模型,首先生成粗略的几何结构,再基于预测的几何信息合成精细的外观细节。此外,研究引入了随机潜在码打乱策略和测试时缩放方案,进一步提升了生成质量与感知一致性。实验表明,GTA在保真度、视觉质量及几何精度方面优于现有方法,并可作为通用增强模块提升现有生成流程的效果。

详情
英文摘要

Recent developments in generative models and large-scale datasets have substantially advanced 3D world generation, facilitating a broad range of domains including spatial intelligence, embodied intelligence, and autonomous driving. While achieving remarkable progress, existing approaches to 3D world generation typically prioritize appearance prediction with limited modeling of the underlying geometry, leading to issues such as unreliable scene structure estimation and degraded cross-view consistency. To address these limitations, motivated by the coarse-to-fine nature of human visual perception, we propose GTA, a novel image-to-3D world generation method following a Geometry-Then-Appearance paradigm. Specifically, given a single input image, to improve the structural fidelity of synthesized 3D scenes, GTA adopts a two-stage framework with two dedicated video diffusion models, which first generate coarse geometric structure from novel viewpoints and then synthesize fine-grained appearance conditioned on the predicted geometry. To further enhance cross-view appearance consistency, we introduce a random latent shuffle strategy during the training process, along with a test-time scaling scheme that improves perceptual quality without compromising quantitative performance. Extensive experiments have demonstrated that our proposed method consistently outperforms existing approaches in terms of fidelity, visual quality, and geometric accuracy. Moreover, GTA is shown to be effective as a general enhancement module that further improves the generation quality of existing image-to-3D world pipelines, as well as supporting multiple downstream applications and exhibiting favorable data efficiency during model training, highlighting its versatility and broad applicability. Project page: https://hanxinzhu-lab.github.io/GTA/.

2605.12954 2026-05-14 cs.CV cs.AI

AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding

Xiao Yang, Yingzhe Ma, Haoxuan Yu, Zixin Li, Ning Qin

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 AdaFocus 是一种高效的长视频理解框架,旨在解决传统方法在时间覆盖、视觉细节与计算效率之间难以平衡的问题。该方法通过自适应相关性-多样性采样和零缓存回溯机制,实现对视频内容的渐进式证据获取,既减少了内存和计算开销,又保留了关键视觉细节。实验表明,AdaFocus 在多个基准数据集上实现了比现有方法更优的效率与精度平衡,显著提升了长视频理解任务的性能。

Comments 9 pages, 4 figures. Authors Xiao Yang and Yingzhe Ma contributed equally

详情
英文摘要

Long video understanding is heavily bottlenecked by a rigid one-shot paradigm: existing methods either densely encode videos at prohibitive memory and latency costs, or aggressively compress them into sparse frame sets that irreversibly discard fine-grained evidence needed for downstream reasoning. Consequently, current models struggle to simultaneously balance temporal coverage, visual details, and computational efficiency. We propose AdaFocus, an efficient framework that rethinks long-video understanding as progressive evidence acquisition rather than one-pass encoding. AdaFocus relies on two tightly coupled components. First, a Query-Aware Adaptive Relevance-Diversity sampler (AdaRD) produces a compact yet informative video preview, adaptively switching to global clustering when the query lacks reliable local grounding. Second, instead of caching exhaustive frame sequences in memory, AdaFocus introduces an uncertainty-triggered refinement mechanism. It performs targeted look-back only when the model is not confident, retrieving high-resolution evidence directly from disk via a zero-cache I/O design. This turns discarded visual details from an irreversible loss into on-demand recoverable evidence without paying the cost of exhaustive preloading. Experiments on seven standard long-video benchmarks show that AdaFocus delivers a substantially better efficiency-accuracy trade-off than strong baselines. Compared with conventional dense encoding, AdaFocus achieves improved task performance (e.g., +2.59 accuracy on VideoMME, +8.39 mIoU on Charades-STA over single-pass inference) while reducing visual token consumption by ~33x and eliminating the need for in-memory frame pre-caching through its zero-cache disk retrieval design. These findings suggest that progressive preview combined with zero-cache evidence refinement is a highly effective paradigm for scalable multimedia reasoning.

2605.12953 2026-05-14 cs.CV cs.AI

Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

Chao Hao, Jun Xu, Ji Du, Shuo Ye, Ziyue Qiao, Xiaodong Cun, Guangcong Wang, Xubin Zheng, Zitong Yu

发表机构 * School of Computing and Information Technology(计算与信息科技学院) Great Bay University(大湾大学) Hangzhou International Innovation Institute(杭州国际创新研究院) Beihang University(北航大学) Department of Computing(计算系) The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出了一种名为Seg-Agent的全新训练-free语言引导分割框架,旨在解决传统方法依赖大量训练数据的问题。该方法通过构建显式的多模态推理循环,使大型语言模型能够在视觉域内进行交互式推理,从而直接生成和优化分割结果。此外,研究还引入了Various-LangSeg基准,用于全面评估模型在不同场景下的泛化能力,实验表明Seg-Agent在无需参数更新的情况下即可达到先进训练方法的性能水平。

详情
英文摘要

Language-guided segmentation transcends the scope limitations of traditional semantic segmentation, enabling models to segment arbitrary target regions based on natural language instructions. Existing approaches typically adopt a two-stage framework: employing Multimodal Large Language Models (MLLMs) to interpret instructions and generate visual prompts, followed by foundational segmentation models (e.g., SAM) to produce masks. However, due to the limited spatial grounding capabilities of off-the-shelf MLLMs, these methods often rely on extensive training on large-scale datasets to achieve satisfactory accuracy. While recent advances have introduced reasoning mechanisms to improve performance, they predominantly operate within the textual domain, performing chain-of-thought reasoning solely based on abstract text representations without direct visual feedback. In this paper, we propose Seg-Agent, a completely training-free framework that pioneers Explicit Multimodal Chain-of-Reasoning. Unlike prior text-only reasoning, our approach constructs an interactive visual reasoning loop comprising three stages: generation, selection, and refinement. Specifically, we leverage Set-of-Mark (SoM) visual prompting to render candidate regions directly onto the image, allowing the MLLM to ``see'' and iteratively reason about spatial relationships in the visual domain rather than just the textual one. This explicit multimodal interaction enables Seg-Agent to achieve performance comparable to state-of-the-art training-based methods without any parameter updates. Furthermore, to comprehensively evaluate generalization across diverse scenarios, we introduce Various-LangSeg, a novel benchmark covering explicit semantic, generic object, and reasoning-guided segmentation tasks. Extensive experiments demonstrate the effectiveness and robustness of our method.

2605.12952 2026-05-14 cs.CV

Debunking Grad-ECLIP: A Comprehensive Study on Its Incorrectness and Fundamental Principles for Model Interpretation

Yongjin Cui, Xiaohui Fan

发表机构 * Zhejiang University(浙江大学)

AI总结 本文对ICML 2024发表的Grad-ECLIP方法进行了全面分析,指出其并非基于中间特征的全新技术路线,而是与现有的注意力机制解释方法等价,且计算更为简洁。研究进一步揭示了Grad-ECLIP方法的缺陷,表明其生成的模型解释结果与原模型实际行为不一致,并提出了模型解释应遵循的两个基本原则,以避免类似错误。

详情
英文摘要

Grad-ECLIP is published at ICML 2024 and represents a new Transformer interpretation technical route (intermediate features-based). First, this paper demonstrates that the intermediate features-based technical route is not a novel one. Based on the existing attention-based route, we have developed Attention-ECLIP, which is completely equivalent to Grad-ECLIP but with simpler computation. Both through formal derivation and experimental validation, we prove that the intermediate feature-based route represented by Grad-ECLIP is actually an equivalent variant of the attention-based route. Next, this paper demonstrates that the Grad-ECLIP method is flawed. The model interpretation results obtained by Grad-ECLIP are not those of the original model, and the interpretation results are misaligned with the model's performance. We analyze the causes of Grad-ECLIP's flaws and propose, or rather, explicitly emphasize two fundamental principles that model interpretation should adhere to in order to avoid similar errors.

2605.12945 2026-05-14 cs.LG

Separating Shortcut Transition from Cross-Family OOD Failure in a Minimal Model

Hongmin Li

发表机构 * School of Life Science and Technology, Institute of Science Tokyo(生命科学与技术学院,科学东京研究所) Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences(计算生物学与医学科学系,前沿科学研究生院)

AI总结 该研究探讨了在最小模型中区分“捷径特征”与“跨家族分布外(OOD)失败”之间的关系。通过构建包含一个不变坐标和一个家族依赖的捷径坐标的二分类模型,研究揭示了在确定性条件下,正的捷径相关性会引导经验风险最小化(ERM)偏向捷径特征,但岭正则化能保持分类器对不变特征的依赖,从而避免确定性的OOD失败。当不变坐标存在噪声时,模型会在训练中的捷径信号超过不变信号时切换到捷径规则,其是否导致失败取决于测试家族的特性。该模型清晰地区分了捷径吸引、捷径规则切换与跨家族OOD失败之间的机制。

Comments 14 pages, 3 figures

详情
英文摘要

Shortcut features are often invoked to explain out-of-distribution (OOD) failure, but training correlation, learned shortcut use, and test-time failure need not coincide. We study a minimal binary model with one invariant coordinate and one family-dependent shortcut coordinate. In the deterministic regime, positive average shortcut correlation pulls logistic ERM toward positive shortcut weight, but ridge regularization keeps the classifier invariant-dominated and prevents deterministic OOD failure. When the invariant coordinate is noisy, ridge-logistic ERM switches to the shortcut rule once the training shortcut signal exceeds the invariant signal. Whether that transition causes failure depends on the held-out family: weaker shortcut correlation yields positive excess risk, and sign-flipped families yield above-chance error. Synthetic checks match these analytic regimes and show that the same training-side transition can have different held-out consequences. The model separates shortcut attraction, shortcut-rule transition, and cross-family OOD failure.

2605.12944 2026-05-14 cs.LG cs.CL

From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning

Haodong Wu, Jiahao Zhang, Lijie Hu, Yongqi Zhang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Mohamed bin Zayed University of Artificial Intelligence(莫扎伊德大学人工智能学院)

AI总结 该研究关注监督微调(SFT)数据选择问题,提出了一种新的固定池数据配方搜索方法,旨在从原始指令池中构建高质量的训练子集。不同于传统的实例排序方法,该方法通过一系列过滤、混合和去重操作组合成数据配方,以优化数据分布。研究引入了AutoSelection算法,通过解耦任务、数据和模型信号,结合暖启动探针、局部配方编辑和高斯过程辅助排名等技术,在有限的全量评估预算下高效搜索最优数据配方,实验表明其在多个模型和任务上均优于现有方法。

详情
英文摘要

Supervised fine-tuning (SFT) data selection is commonly formulated as instance ranking: score each example and retain a top-$k$ subset. However, effective SFT training subsets are often produced through ordered curation recipes, where filtering, mixing, and deduplication operators jointly shape the final data distribution. We formulate this problem as fixed-pool data recipe search: given a raw instruction pool and a library of grounded operators, the goal is to discover an executable recipe that constructs a high-quality selected subset under a limited budget of full SFT evaluations, without generating, rewriting, or augmenting training samples. We introduce AutoSelection, a two-layer solver that decouples fixed-pool materialization based on cached task-, data-, and model-side signals from expensive full evaluation, using warmup probes, realized subset states, local recipe edits, Gaussian-process-assisted ranking, and stagnation-triggered reseeding. Experiments on a 90K instruction pool show that AutoSelection achieves the strongest in-distribution reasoning average across three base models, outperforming full-data training, random recipe search, random top-$k$, and single-operator selectors. Additional Out-of-distribution graph-reasoning results, search-stability analyses, structural ablations, and 1.5B-to-7B transfer checks further show that recipe structure matters beyond individual selection operators. Code is available at https://github.com/w253/AutoSelection.

2605.12943 2026-05-14 cs.LG

Reinforced Collaboration in Multi-Agent Flow Networks

Zheng Wang, Yuang Liu, Yangkai Ding

发表机构 * Huawei Technologies Co., Ltd.(华为技术有限公司)

AI总结 多智能体系统通过将复杂任务分解为多个子任务,为扩展大语言模型提供了有效途径。然而,子任务之间的错误传播和协作流程设计不合理常导致整体性能下降。为此,本文提出MANGO框架,通过构建历史成功工作流的流网络,结合强化学习和文本梯度,联合优化工作流路径与智能体行为,并引入跳过机制提升效率。实验表明,MANGO在多个基准上性能提升达12.8%,效率提高47.4%,并在未见领域表现出良好的泛化能力。

详情
英文摘要

Multi-agent systems provide a powerful way to extend large language models (LLMs) by decomposing a complex task into specialized subtasks handled by different agents. However, their performance is often hindered by error propagation, arising from suboptimal workflow design or inaccurate agent outputs, which can propagate through the agent collaboration process and degrade final results. To address the challenges, we present MANGO (Multi-Agent Network Gradient Optimization), a data-driven framework that organizes and refines agent collaboration via a flow network constructed from past successful workflows. MANGO integrates reinforcement learning and textual gradients to jointly optimize workflow paths and agent behaviors, while a skipping mechanism prevents redundant updates to well-optimized agents for improving efficiency. Extensive experiments on seven benchmarks show that MANGO achieves up to 12.8% performance improvement over state-of-the-art baselines, enhances efficiency by 47.4%, and generalizes effectively to unseen domains. Our code and datasets are publicly available at https://github.com/openJiuwen-ai/agent-store/tree/main/community/mango.

2605.12940 2026-05-14 cs.LG cs.AI

The Expressivity Boundary of Probabilistic Circuits: A Comparison with Large Language Models

Zhiyu Zhao, Xuejie Liu, Muhan Zhang, Anji Liu

发表机构 * School of Computing, National University of Singapore(新加坡国立大学计算机学院) Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) School of Intelligence Science and Technology, Peking University(北京大学智能科学与技术学院)

AI总结 本文研究了概率电路(PCs)在生成语言模型中的表达能力边界,并与基于Transformer的大语言模型(LLMs)进行了对比。研究发现,PCs在自回归语言建模中仍存在表达能力上的不足,主要受限于输出参数化方式和上下文编码结构。通过引入logit空间参数化和分析结构分解PCs的依赖拓扑限制,作者揭示了PCs与LLMs之间的关键差异,并证明分解PCs在理论上具有更强的表达能力,但其有效优化仍是一个挑战。

详情
英文摘要

Probabilistic Circuits (PCs) are deep generative models that support exact and efficient probabilistic inference. Yet in autoregressive language modeling, PCs still lag behind Transformer-based large language models (LLMs), suggesting an important expressivity gap. In this work, we compare PCs and LLMs under a unified autoregressive formulation. First, an output bottleneck: PCs parameterize predictions as convex combinations in probability space, which struggles to represent the sharp distributions typical of language; adopting a logit-space parameterization substantially narrows this gap. Second, a context-encoding bottleneck: we prove that structured-decomposable PCs can match Transformer separation rank on vtree-aligned partitions, but show, both theoretically and empirically, that this capacity is limited to partitions aligned with the fixed routing structure, leading to severe degradation when the data exhibits heterogeneous dependency topologies. We further prove that decomposable PCs are strictly more expressive than structured-decomposable ones, though effectively optimizing them remains an open challenge.

2605.12939 2026-05-14 cs.CV

DirectTryOn: One-Step Virtual Try-On via Straightened Conditional Transport

Xianbing Sun, Jiahui Zhan, Liqing Zhang, Jianfu Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出了一种名为DirectTryOn的一站式虚拟试穿方法,通过直角条件传输实现高效生成。该方法基于对虚拟试穿任务条件约束特性的观察,提出通过纯条件传输、服装保持损失和自一致性损失等改进,引导生成过程更加直接,从而实现单步生成。实验表明,该方法在保证生成质量的同时显著降低了推理成本,达到了当前最先进的性能。

详情
英文摘要

Recent diffusion- and flow-based VTON methods achieve strong results with pretrained generative models, but their reliance on multi-step sampling incurs high inference cost, while existing acceleration methods largely overlook the intrinsic structure of the try-on task. In this paper, we highlight a key observation: VTON outputs are highly constrained by the conditional inputs, suggesting that the conditional sampling trajectory can be much straighter than that in general image generation, making one-step generation a natural solution. However, limited task-specific data makes training from scratch impractical, forcing existing methods to fine-tune pretrained models whose objectives do not encourage such straight conditional trajectories. Thus, the deviation from an ideal straight path mainly comes from the mismatch between pretrained base models and the conditional nature of try-on generation, rather than from the task itself. Motivated by this insight, we encourage straighter VTON sampling trajectories through three targeted modifications: pure conditional transport, a garment preservation loss, and a self consistency loss. We further introduce a one-step distillation stage. Extensive experiments show that our method achieves state-of-the-art performance with one-step sampling, establishing a new standard for efficient and high-quality VTON.

2605.12938 2026-05-14 cs.CV cs.AI cs.LG

CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation

Seonghyun Jin, Youngmin Kim, Sunwoo Park, Jong Chul Ye

发表机构 * Graduate School of AI(人工智能研究生院)

AI总结 该论文提出了一种名为CRePE的曲光线期望位置编码方法,用于统一相机控制的视频生成。针对现有方法在处理广角和鱼眼镜头等复杂相机配置时的不足,CRePE通过引入深度感知的位置分布,捕捉由宽视角相机引起的投影路径几何特性,从而提升相机控制的稳定性和生成质量。该方法结合几何注意力适配器和单目几何基础模型进行伪监督,实现了对多种相机模型的有效支持,并在多个几何感知和感知质量指标上表现出色。

Comments 17 pages, 8 figures, Under review

详情
英文摘要

Camera-conditioned video generation requires positional encoding that remains reliable under changes in camera motion, lens configuration, and scene structure. However, existing attention-level camera encodings either provide ray-only camera signals or rely on pinhole camera geometry, limiting their applicability to general camera control under the Unified Camera Model, including wide-angle and fisheye lenses. To address this limitation, we propose Curved Ray Expectation Positional Encoding (CRePE). CRePE represents each image token as a depth-aware positional distribution along its source ray, providing a Unified Camera Model-compatible positional encoding that captures the projected-path geometry induced by wide-angle and fisheye cameras. CRePE is implemented through a Geometric Attention Adapter added to frozen video DiTs, injecting token-wise scene-distance information into selected attention layers and stabilizing it with pseudo supervision from a monocular geometry foundation model. This design leads to more stable camera control and improves several geometry-aware and perceptual-quality metrics, while remaining competitive on video-quality metrics. Controlled positional-encoding ablations show a better overall average rank than a RayRoPE-style endpoint PE baseline, demonstrating the effectiveness of UCM-aware projected-path integration across diverse camera models. Furthermore, by extending the same positional-encoding pathway to external geometry control through Radial MixForcing, CRePE supports external radial-map control for scene-geometry-conditioned generation and source-video motion transfer beyond camera control.

2605.12937 2026-05-14 cs.CV cs.AI cs.HC

AuraMask: An Extensible Pipeline for Developing Aesthetic Anti-Facial Recognition Image Filters

Jacob Lagogiannis, William Agnew, Rosa I. Arriaga, Sauvik Das

发表机构 * Franklin and Marshall College(弗兰克林与马歇尔学院) Carnegie Mellon University(卡内基梅隆大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出了一种名为 AuraMask 的可扩展管道,用于开发既具有对抗性效果又符合审美要求的反人脸识别图像滤镜。该方法通过模仿流行的 Instagram 一键滤镜,生成了 40 种视觉上美观的滤镜,并在对抗开源人脸识别模型方面表现出优于现有方法的效果。实验表明,这些滤镜在用户接受度上也显著高于以往方法,为隐私保护技术的进一步研究提供了有效工具。

Comments 21 pages, 10 figures

详情
英文摘要

Anti-facial recognition (AFR) image filters alter images in ways that are subtle to people but blinding to computer vision. Yet, despite widespread interest in these technologies to subvert surveillance, users rarely use them in practice -- because the ``subtle'' alterations are visible enough to conflict with users' self-presentation goals. To address this challenge, we propose AuraMask: a novel approach to creating AFR filters that are both adversarially effective and aesthetically acceptable. Using AuraMask, we produce 40 ``aesthetic'' filters that emulate popular ``one-click'' Instagram image filters. We show that AuraMask filters meet or exceed the adversarial effectiveness of prior methods against open-source facial recognition models. Moreover, in a controlled online user study ($N=630$) we confirm these filters achieve significantly higher user acceptance than prior methods. Lastly, we provide our AFR pipeline to the community for accelerated research in adversarially effective and aesthetically acceptable protections.

2605.12933 2026-05-14 cs.CL

ATD-Trans: A Geographically Grounded Japanese-English Travelogue Translation Dataset

Shohei Higashiyama, Hiroki Ouchi, Atsushi Fujita, Masao Utiyama

发表机构 * National Institute of Information and Communications Technology(信息与通信技术国家研究所) Nara Institute of Science and Technology(奈良科学技术研究所)

AI总结 本文介绍了一个名为ATD-Trans的地理语境下的日英旅游游记平行语料库,旨在支持多语言地理信息的公平获取和机器翻译质量的评估。该数据集包含日本国内和海外地区的地理实体信息,可用于分析不同语言模型在翻译任务中的表现差异。研究发现,针对日语优化的模型在处理日本国内地理实体时具有优势,而这类实体的翻译难度较高。

详情
英文摘要

Geographic text, or textual data rich in geographic (geo-) information is a valuable source for various geographic applications, e.g., tourism management. Making such information accessible to speakers of other languages further enhances its utility; thus, accurate machine translation (MT) is essential for equity in multilingual geo-information access. To facilitate in-depth analysis for geographic text, we introduce ATD-Trans, a geographically grounded Japanese--English travelogue translation dataset, which enables evaluation of MT quality at both the overall and geo-entity levels across domestic (within Japan) and overseas regions. Our experiments on existing language models examine two factors: model language focus and geographic regions. The results highlight advantages of Japanese-enhanced models and greater difficulty in translating domestic-region geo-entities mentioned in travel blogs.

2605.12928 2026-05-14 cs.LG

The Efficiency Gap in Byte Modeling

Celine Lee, Jing Nathan Yan, Chen Liang, Jiaxin Shi, Yin Zhang, Jeremiah Liu, Pengcheng Yin, Fernando Pereira, Ed Chi, Derek Cheng, Alexander M. Rush, Ruoxi Wang

发表机构 * Google DeepMind(谷歌深Mind) Department of Computer Science, Cornell University(康奈尔大学计算机科学系) Work done while at Google DeepMind(曾在谷歌深Mind工作)

AI总结 本文研究了字节级语言模型在计算效率上的劣势,对比了其与传统自回归模型和掩码扩散模型在扩展性上的表现差异。通过计算匹配的扩展实验,发现字节建模在掩码扩散模型中的性能损失更为显著,原因在于其缺乏局部连续性,难以高效解析原始字节的语义。研究指出,未来在字节级建模中需引入替代的结构先验,以维持模型的可扩展性。

详情
英文摘要

Modern language models have historically relied on two dominant design choices: subword tokenization and autoregressive (AR) ordering. These design decisions bake in priors that dictate a model's learning. Recently, two alternative paradigms have challenged this: byte-level modeling, which bypasses static statistically-derived token vocabularies, and masked diffusion modeling (MDM), which conducts parallel, non-sequential generation. Their intersection represents a fully end-to-end modality-agnostic generative prototype; however, removing these structural priors incurs a significant computational cost. In this work, we investigate this cost through a compute-matched scaling study. Our results reveal that the performance penalty of byte modeling is not uniform; across scale, the scaling overhead of byte modeling is worse for MDM than for AR. We hypothesize that this disparity stems from context fragility: while AR's stable causal history allows models to naturally rediscover subword patterns, the MDM objective destroys the local contiguity required to efficiently resolve semantics from raw bytes. Our findings from controlled permutation experiments suggest that future modality-agnostic designs must incorporate alternative structural biases to maintain viable scaling trajectories in the byte regime.

2605.12924 2026-05-14 cs.LG

IV-ICL: Bounding Causal Effects with Instrumental Variables via In-Context Learning

Vahid Balazadeh, Hamidreza Kamkari, Medha Barath, Ricardo Silva, Rahul G. Krishnan

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) University College London(伦敦大学学院)

AI总结 该论文提出了一种基于工具变量的因果效应置信区间估计方法IV-ICL,通过上下文学习直接学习因果效应的边缘后验分布,并利用其分位数推导出因果效应的置信区间。与传统方法相比,IV-ICL避免了手动设计估计量的需求,同时克服了计算复杂度高和先验敏感等问题,能够在多种数据生成过程中更准确地覆盖识别集。实验表明,该方法在合成和半合成数据集上表现出更高的可靠性与信息量,且推理速度显著优于现有方法。

详情
英文摘要

The instrumental-variables (IV) setting is standard for partial identification of causal effects when unobserved confounding makes point identification impossible. Existing approaches face methodological bottlenecks: closed-form bound estimands are required -- e.g., Balke-Pearl equations in binary IV -- and even when available, designing accurate estimators requires manual effort tailored to each estimand. While direct Bayesian inference of the causal effects, instead of the bounds, circumvents these challenges, it is often computationally intensive and suffers from high prior sensitivity or under-dispersed posteriors. As a remedy, we introduce IV-ICL, an amortized Bayesian in-context learning method that learns the marginal posterior distribution of the causal effects directly and derives bounds as its quantiles. Unlike standard variational inference that optimizes exclusive KL divergence, amortized Bayesian inference minimizes the expected inclusive KL, a mass-covering objective. We empirically observe that optimizing inclusive KL can recover the entire identified set across diverse data-generating processes, while exclusive-KL (e.g. with variational inference) of the same Bayesian formulation collapses onto a single mode and fails to cover the identified set. We evaluate IV-ICL on synthetic and semi-synthetic IV benchmarks and show it produces intervals that are more reliably valid and more informative compared to efficient semi-parametric, Bayesian, and plug-in baselines, at 20-500x lower inference time. Beyond methodology, we propose a procedure to convert randomized controlled trials into IV benchmarks with provably preserved ground-truth causal effects that enables a more realistic evaluation of partial-identification methods.

2605.12922 2026-05-14 cs.AI cs.CL

When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

Vardhan Dongre, Joseph Hsieh, Viet Dac Lai, Seunghyun Yoon, Trung Bui, Dilek Hakkani-Tür

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Adobe Research(Adobe研究)

AI总结 这篇论文研究了大型语言模型在多轮对话中逐渐丢失任务目标、角色设定和规则的现象。作者提出了一种“通道转换”机制,认为目标定义的标记在注意力机制中逐渐变得难以访问,而相关信息可能仍保留在残差表示中。通过引入“目标可访问性比率”(GAR)以及残差流探针等方法,研究揭示了不同模型在注意力关闭后表现出的多样化失效模式,并展示了残差表示在预测任务表现中的重要性。

详情
英文摘要

Large language models can follow complex instructions in a single turn, yet over long multi-turn interactions they often lose the thread of instructions, persona, and rules. This degradation has been measured behaviorally but not mechanistically explained. We propose a channel-transition account: goal-defining tokens become less accessible through attention, while goal-related information may persist in residual representations. We introduce the Goal Accessibility Ratio (GAR), measuring attention from generated tokens to task-defining goal tokens, and combine it with sliding-window ablations and residual-stream probes. When attention to instructions closes, what survives reveals architecture. Across architectures, the transition yields qualitatively distinct failure modes: some models preserve goal-conditioned behavior at vanishing attention, others fail despite decodable residual goal information, and the layer at which this encoding emerges varies from 2 to 27. A within-model causal ablation that force-closes the attention channel in Mistral collapses recall from near-perfect to 11% on a 20-fact retention task and raises persona-constraint violations above an adversarial-pressure baseline without user pressure, with both effects emerging at the predictable crossover turn. Linear probes recover per-episode recall outcomes from residual representations with AUC up to 0.99 across all four primary architectures, while input embeddings remain at chance. Across architectures and model scales, the gap between attention loss and residual decodability predicts whether goal-conditioned behavior survives channel closure. We contribute GAR as a diagnostic, the channel-transition framework as a controlled mechanistic account, and a parametric prediction of failure timing under windowed attention closure.

2605.12919 2026-05-14 cs.CV

GuardMarkGS: Unified Ownership Tracing and Edit Deterrence for 3D Gaussian Splatting

Utae Jeong, Jaewan Choi, Junseok Lee, Jongheon Jeong, Sang Ho Yoon, ByoungSoo Koh, Sangpil Kim

发表机构 * Korea University(韩国大学) KAIST(韩国科学技术院) Hanshin University(汉西大学)

AI总结 本文提出了一种名为 GuardMarkGS 的统一保护框架,旨在解决 3D Gaussian Splatting(3DGS)资产在版权归属追踪与防止未经授权编辑之间的双重风险。该方法结合了全局水印优化与对抗性编辑抑制策略,通过分离潜在特征、扰动编辑轨迹以及选择性增强对抗更新,实现了版权归属可追溯与编辑行为有效遏制的双重目标。实验表明,该框架在保持渲染质量的同时,有效平衡了水印准确性与编辑抑制效果。

Comments Preprint

详情
英文摘要

3D Gaussian Splatting (3DGS) is becoming a practical representation for novel view synthesis, but its growing adoption, together with rapid advances in instruction-driven 3DGS editing, also exposes a dual copyright risk: once a 3DGS-based asset is released, it can be used without permission and manipulated through 3D editing. Existing protection methods address only one side of this problem. Watermarking can trace ownership after unauthorized use, but it cannot prevent malicious editing. Adversarial edit-deterrence methods can disrupt editing, but they do not provide evidence of ownership. To the best of our knowledge, we present the first unified protection framework for 3DGS that jointly optimizes ownership tracing and unauthorized editing deterrence. Our framework combines a scene-wide watermarking objective over all Gaussians with an adversarial objective for edit deterrence. The adversarial branch combines latent-anchor separation, denoising-trajectory diversion, and cross-attention diversion to divert the editing trajectory, while an update-saliency-motivated Gaussian selection strategy assigns stronger adversarial updates to mask-selected Gaussians, improving the balance among watermark recovery, edit deterrence, and rendering fidelity. Experiments on scenes from Mip-NeRF 360 and Instruct-NeRF2NeRF demonstrate that the proposed framework achieves a favorable balance among bit accuracy, edit deterrence, and rendering quality. These results suggest that practical copyright protection of 3DGS-based assets can be more effectively addressed by integrating ownership tracing and unauthorized editing deterrence into a single optimization framework.

2605.12918 2026-05-14 cs.CL

CommonWhy: A Dataset for Evaluating Entity-Based Causal Commonsense Reasoning in Large Language Models

Armin Toroghi, Faeze Moradi Kalarde, Scott Sanner

发表机构 * University of Toronto(多伦多大学)

AI总结 为了有效与现实世界交互,大型语言模型(LLMs)需要具备基于实体的常识推理能力,这要求模型将具体实体的事实知识与常识推理相结合。本文提出CommonWhy数据集,包含15,000个“为什么”问题,用于评估模型在因果关系上的常识推理能力,并作为知识图谱问答(KGQA)的基准,所有问题答案均可在Wikidata中找到。与现有KGQA数据集不同,CommonWhy重点考察因果推理而非单纯的事实检索,实验表明当前先进模型在该任务上仍存在事实幻觉和因果推理失败等问题。

详情
英文摘要

To effectively interact with the real world, Large Language Models (LLMs) require entity-based commonsense reasoning, a challenging task that necessitates integrating factual knowledge about specific entities with commonsense inference. Existing datasets for evaluating LLM entity-based commonsense reasoning have largely focused on True/False or multiple-choice questions, leaving the explicit assessment of the model's ability in abductive reasoning about causes and effects and generating explanations largely unexamined. In this work, we introduce CommonWhy, a dataset of 15,000 why questions designed to evaluate entity-based commonsense reasoning about causal relationships in LLMs. CommonWhy also serves as a Knowledge Graph Question Answering (KGQA) benchmark, as all supporting knowledge required to answer its queries is available in the Wikidata knowledge graph. Unlike existing KGQA datasets, which primarily test fact retrieval, CommonWhy targets causal commonsense reasoning, establishing a new paradigm for KGQA evaluation. Experiments with state-of-the-art LLMs and LLM-based KGQA methods reveal their significant shortcomings, including frequent factual hallucinations and failures in causal reasoning.

2605.12917 2026-05-14 cs.CV cs.LG

Adaptive Conformal Prediction for Reliable and Explainable Medical Image Classification

One Octadion, Novanto Yudistira, Lailil Muflikhah

发表机构 * Faculty of Computer Science, Universitas Brawijaya(博雅大学计算机科学学院)

AI总结 该研究针对医学图像分类中深度学习模型过度自信的问题,提出了一种自适应的置信度预测方法,以提高诊断的可靠性和可解释性。通过改进RAPS方法,引入自适应Lambda准则,有效控制预测集的覆盖偏差,确保在不同输入难度下均保持较高的覆盖性能。实验表明,该方法在多个医学图像数据集上实现了高覆盖率与小预测集大小的平衡,且具有良好的跨领域泛化能力,适用于对安全性要求高的医疗AI应用。

Comments To appear in IEA/AIE 2026 (Springer LNAI)

详情
英文摘要

Deep learning models for medical imaging often exhibit overconfidence, creating safety risks in ambiguous diagnostic scenarios. While Conformal Prediction (CP) provides distribution-free statistical guarantees, standard methods such as Regularized Adaptive Prediction Sets (RAPS) optimize for average efficiency and can mask severe failures on difficult inputs. We propose an Adaptive Lambda Criterion for RAPS that minimizes the worst-case coverage violation across prediction set size strata. On OrganAMNIST (58,850 abdominal CT images, 11 classes), standard size-optimized RAPS converges to near-deterministic behavior with stratified undercoverage on uncertain samples, while our method achieves 95.72 percent global coverage with average set size 1.09 and at least 90 percent coverage across all strata. Cross-domain validation on PathMNIST (107,180 pathology images, 9 classes) confirms generalizability. Quantitative Grad-CAM analysis (rho = -0.30, p < 1e-22) shows that multi-label predictions correspond to focused attention on anatomically ambiguous regions. These results demonstrate that the proposed method improves reliability while maintaining efficiency, making it suitable for safety-critical medical AI applications.

2605.12913 2026-05-14 cs.LG

Revisiting DAgger in the Era of LLM-Agents

Changhao Li, Rushi Qiang, Jiawei Huang, Chenxiao Gao, Chao Zhang, Niao He, Bo Dai

发表机构 * Georgia Institute of Technology(佐治亚理工学院) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文研究了在大语言模型代理(LLM-Agents)时代下如何改进长期任务的学习方法,针对现有监督微调和强化学习方法的不足,重新引入并改进了数据聚合(DAgger)算法。该方法通过在每一步骤中融合学生策略与教师策略生成轨迹,并利用教师提供的监督标签进行训练,从而有效缓解协变量偏移问题并提供丰富的反馈。实验表明,该方法在软件工程任务中显著提升了模型性能,优于现有主流方法。

详情
英文摘要

Long-horizon LM agents learn from multi-turn interaction, where a single early mistake can alter the subsequent state distribution and derail the whole trajectory. Existing recipes fall short in complementary ways: supervised fine-tuning provides dense teacher supervision but suffers from covariate shift because it is trained on off-policy teacher trajectories; while reinforcement learning with verifiable rewards avoids this off-policy mismatch by learning from on-policy rollouts but with only sparse outcome feedback. We address this dilemma by revisiting Dataset Aggregation (DAgger) for multi-turn LM agents: the algorithm collects trajectories through a turn-level interpolation of student and teacher policies, and the student is then trained on these trajectories using supervised labels provided by the teacher. By directly interacting with environments, we expose the model to realistic states likely to be encountered during deployment, thereby effectively mitigating covariate shift. Besides, since the student is learned by mimicking the teacher's behavior, it receives rich feedback during learning. To demonstrate DAgger enjoys the benefits of both worlds, we tested the algorithm to train a software-engineering agent with 4B- and 8B-scale student models. On SWE-bench Verified, our DAgger-style training improves over the strongest post-training baseline by +3.9 points at 4B and +3.6 points at 8B. The resulting 4B agent reaches 27.3%, outperforming representative published 8B SWE-agent systems, while the 8B agent achieves 29.8%, surpassing SWE-Gym-32B and coming within 5 points of stronger 32B-scale agents. Together with consistent gains on the held-out SWE-Gym split, these results suggest the effectiveness of DAgger for modern long-horizon LM agents.

2605.12904 2026-05-14 cs.LG

VIP-COP: Context Optimization for Tabular Foundation Models

Yilong Chen, Xueying Ding, Leman Akoglu

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 表格基础模型(TFMs)在结构化数据的上下文学习中表现出色,但其性能受到上下文长度限制的制约,难以处理超出预训练规模的数据。本文提出VIP-COP方法,通过评估训练样本和特征对预测的重要性,实现对上下文的优化选择,有效抑制噪声并聚焦关键信息。该方法具备高效、预算感知、模型无关、可解释且鲁棒等优势,在多个大规模高维任务中显著优于现有方法,为表格基础模型的测试时上下文优化树立了新的标杆。

详情
英文摘要

Tabular foundation models (TFMs) have emerged as a powerful paradigm for in-context learning on structured data, enabling direct prediction on new tabular tasks without task-specific training. However, their effectiveness is constrained by context length limits, restricting application to medium-scale data and degrading performance when inference-time data exceed pretraining size distributions. Our work introduces VIP-COP, estimating the Value of Importance for Prediction of training examples and features for hard Context OPtimization for TFMs. Its explicit selection mechanism suppresses noise and isolates influential data, enabling the model to also benefit from data augmentation by prioritizing high-value augmented samples and features. VIP-COP is (i) fast, boosting performance often within minutes of optimization, based on an online KernelSHAP-based regression with iterative refinement, value-guided context sampling, and multi-fidelity pruning; (ii) budget-aware and any-time, improving with additional test-time compute unlike heuristics that produce fixed contexts; (iii) model-aware yet fully black-box, requiring no access to model internals, making it compatible with both proprietary and open-source TFMs; (iv) interpretable, identifying discrete ``Very Important Predictors'' (samples and features) that maximize signal-to-noise, which makes it (v) robust, isolating high-value data from noise. In contrast, soft-prompt optimization requires model gradients, produces abstract latent tokens, and lacks explicit signal discrimination. Extensive experiments show that VIP-COP consistently outperforms heuristic and optimized baselines across large-scale high-dimensional testbeds, including data augmentation and data-noise settings, establishing a new state of the art in test-time context refinement for TFMs.

2605.12897 2026-05-14 cs.RO

DynoJEPP: Joint Estimation, Prediction and Planning in Dynamic Environments

Mikolaj Kliniewski, Jesse Morris, Yiduo Wang, Ian R. Manchester, Viorela Ila

发表机构 * Australian Centre For Robotics (ACFR)(澳大利亚机器人中心) School of Aerospace, Mechanical and Mechatronic Engineering (AMME)(航空航天、机械与机电工程学院)

AI总结 DynoJEPP 是一个基于因子图的框架,旨在动态环境中联合优化状态估计、预测与路径规划。为了解决传统方法中预测和规划信息反馈导致估计污染和不安全行为的问题,DynoJEPP 引入了一种新型有向因子,以确保信息在因子图中的单向流动。实验表明,该方法对安全导航至关重要,而合作版 DynoJEPP 进一步支持机器人在预测和规划中融入协作对象的行为,提升了整体系统的鲁棒性与安全性。

详情
英文摘要

DynoJEPP is a factor-graph-based framework that jointly formulates and simultaneously optimizes estimation, prediction, and planning in dynamic environments. In conventional factor-graph-based approaches that jointly formulate estimation, prediction, and planning, information from prediction and planning feeds back into state estimation, yielding corrupted estimates, undesired behaviors, and unsafe plans. To address this, DynoJEPP introduces a novel directed factor that enforces directional information flow within the factor graph, preventing prediction and planning from corrupting state estimation. We evaluate the impact of directed factors on inter-module interactions during navigation in both static and dynamic environments. Our results demonstrate that these factors are critical for safe operation, as without them, the robot collides in the majority of experiments. Building on this, we further introduce Cooperative DynoJEPP, which enables the ego robot to incorporate cooperative object behavior into its prediction and trajectory planning.

2605.12894 2026-05-14 cs.AI cs.CL

Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents

Harshita Chopra, Kshitish Ghate, Aylin Caliskan, Tadayoshi Kohno, Chirag Shah, Natasha Jaques

发表机构 * University of Washington, Seattle, WA(华盛顿大学) Georgetown University, Washington, DC(乔治城大学)

AI总结 该研究旨在解决大型语言模型(LLM)代理在面对真实用户多样化行为时表现不佳的问题,提出了一种名为Persona Policies(PPol)的可插拔控制层,用于生成具有真实行为特征的用户角色,从而提升代理的鲁棒性。通过将角色生成建模为基于LLM的进化程序搜索,该方法优化Python生成器以发现符合任务目标的行为模式,并生成多样化的用户角色。实验表明,PPol显著提升了用户模拟的真实性与代理任务成功率,为基于模拟器的评估和训练提供了新的有效方法。

Comments Preprint under review

详情
英文摘要

Large Language Model (LLM) agents are increasingly deployed in settings where they interact with a wide variety of people, including users who are unclear, impatient, or reluctant to share information. However, collecting real interaction data at scale remains expensive. The field has turned to LLM-based user simulators as stand-ins, but these simulators inherit the behavior of their underlying models: cooperative and homogeneous. As a result, agents that appear strong in simulation often fail under the unseen, diverse communication patterns of real users. To narrow this gap, we introduce Persona Policies (PPol), a plug-and-play control layer that induces realistic behavioral variation in user simulators while preserving the original task goals. Rather than hand-crafting personas, we cast persona generation as an LLM-driven evolutionary program search that optimizes a Python generator to discover behaviors and translate them into task-preserving roleplay policies. Candidate generators are guided by a multi-objective fitness score combining human-likeness with broad coverage of human behavioral patterns. Once optimized, the generator produces a diverse population of human-like personas for any task in the domain. Across tau^2-bench retail and airline domains, evolved PPol programs yield 33-62% absolute gains in fitness score over the baseline simulator. In a blinded evaluation, annotators rated PPol-conditioned users as human 80.4% of the time, close to real human traces and nearly twice as frequently as baseline simulators. Agents trained with PPol are more robust to challenging, out-of-distribution behaviors, improving task success by +17% relative to training only on existing simulated interactions. This offers a novel approach to strengthen simulator-based evaluation and training without changing tasks or rewards.

2605.12882 2026-05-14 cs.CL cs.CV

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

Dongsheng Ma, Jiayu Li, Zhengren Wang, Yijie Wang, Jiahao Kong, Weijun Zeng, Jutao Xiao, Jie Yang, Wentao Zhang, Bin Wang, Conghui He

发表机构 * Peking University(北京大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 CiteVQA 是一个用于评估可信文档智能的新型基准,旨在解决当前文档问答系统中忽视证据溯源的问题。该基准要求模型在回答问题的同时提供具体的引用区域,从而同时评估答案的正确性和引用的准确性。通过引入严格归因准确率(SAA)指标,CiteVQA 揭示了现有大型语言模型在答案正确但引用错误方面的普遍问题,为提升文档理解系统的可靠性提供了新的评估工具。

详情
英文摘要

Multimodal Large Language Models (MLLMs) have significantly advanced document understanding, yet current Doc-VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer-only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage -- a critical risk in high-stakes domains like law, finance, and medicine, where every conclusion must be traceable to a specific source region. To address this, we introduce CiteVQA, a benchmark that requires models to return element-level bounding-box citations alongside each answer, evaluating both jointly. CiteVQA comprises 1,897 questions across 711 PDFs spanning seven domains and two languages, averaging 40.6 pages per document. To ensure fidelity and scalability, the ground-truth citations are generated by an automated pipeline-which identifies crucial evidence via masking ablation-and are subsequently validated through expert review. At the core of our evaluation is Strict Attributed Accuracy (SAA), which credits a prediction only when the answer and the cited region are both correct. Auditing 20 MLLMs reveals a pervasive Attribution Hallucination: models frequently produce the right answer while citing the wrong region. The strongest system (Gemini-3.1-Pro-Preview) achieves an SAA of only 76.0, and the strongest open-source MLLM reaches just 22.5. Ultimately, towards trustworthy document intelligence, CiteVQA exposes a reliability gap that answer-only evaluations overlook, providing the instrumentation needed to close it. Our repository is available at https://github.com/opendatalab/CiteVQA.

2605.12879 2026-05-14 cs.LG

ASAP: Amortized Doubly-Stochastic Attention via Sliced Dual Projection

Huy Tran, Max Milkert, David Hyde

发表机构 * Vanderbilt University(范德比大学)

AI总结 本文提出了一种名为ASAP的新方法,用于高效实现双重随机注意力机制。该方法结合了Sinkhorn缩放的训练优势和切片双投影的推理优化,通过在训练阶段学习参数映射,在推理阶段用固定操作替代迭代缩放,从而显著提升计算效率。实验表明,ASAP在保持低成本训练的同时,在语言和视觉任务中表现出与现有方法相当甚至更优的性能。

详情
英文摘要

Doubly-stochastic attention has emerged as a transport-based alternative to row-softmax attention, with recent Transformer variants using it to reduce attention sinks and rank collapse while improving performance. In this family, the standard approach is Sinkhorn scaling, which trains more efficiently but still repeats matrix scaling in every inference forward pass. Sliced-transport attention removes the online iteration, but its soft sorting approximation materializes dense tensors for each slice, requiring substantially more training resources than Sinkhorn attention. We introduce ASAP: Amortized Doubly-Stochastic Attention via Sliced Dual Projection, a train-then-compile method that trains the doubly-stochastic layer with Sinkhorn, then replaces the iterative scaling loop at inference with a fixed sliced-dual operator. It learns a lightweight parametric map from exact one-dimensional Kantorovich potentials to the Sinkhorn query-side dual, then reconstructs the attention plan with a two-sided entropic c-transform. Across language and vision benchmarks, ASAP keeps the cheaper training setup and remains highly competitive with recent baselines. In the main frozen-layer benchmark, ASAP is 5.3 faster than the trained Sinkhorn teacher while matching its accuracy; in downstream replacements, ASAP recovers most of the teacher performance without any retraining.

2605.12876 2026-05-14 cs.LG

Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing

Blaise Delattre, Hengyu Wu, Paul Caillon, Wei Yang Bryan Lim, Yang Cao

发表机构 * Department of Computer Science, School of Computing, Institute of Science Tokyo, Tokyo, Japan(东京科学研究所计算机科学系) College of Computing and Data Science, Nanyang Technological University, Singapore(南洋理工大学计算机与数据科学学院) PSL University, Paris, France(巴黎高等师范学院)

AI总结 该论文研究了在异构扰动下如何为多模态模型提供认证鲁棒性的问题,提出了一种统一的随机平滑框架,能够处理离散和连续混合输入的联合扰动。通过分析离散与连续噪声的联合似然排序,该方法得到了一个严格推广图像和文本单独扰动认证的闭式一维鲁棒性证书。该框架在多模态安全过滤任务中得到了验证,提供了首个针对文本-图像交互依赖场景下联合离散和连续扰动的模型无关的Neyman-Pearson认证。

Comments ICML 2026. Code: https://github.com/tdsai-lab/hybrid-randomized-smoothing

详情
英文摘要

Randomized smoothing provides strong, model-agnostic robustness certificates, but existing guarantees are limited to single modalities, treating continuous and discrete inputs in isolation. This limitation becomes critical in multimodal models, where decisions depend on cross-modal semantics and adversaries can jointly perturb heterogeneous inputs, rendering unimodal certificates insufficient. We introduce a unified randomized smoothing framework for mixed discrete--continuous inputs based on an analytically tractable Neyman--Pearson formulation of the joint worst-case problem. By analyzing the joint likelihood ordering induced by factorized discrete and continuous noise, our approach yields a closed-form, one-dimensional certificate that strictly generalizes both Gaussian (image-only) and discrete (text-only) randomized smoothing. We validate the framework on multimodal safety filtering, providing, to our knowledge, the first model-agnostic Neyman--Pearson certificate for joint discrete-token and continuous-image perturbations in interaction-dependent text--image safety filtering.

2605.12874 2026-05-14 cs.LG

Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features

Jordan F. McCann

发表机构 * Independent Researcher(独立研究者)

AI总结 本文研究了稀疏自编码器(SAE)在语言模型解释性任务中的一种新问题——描述性碰撞,即多个不同的特征被赋予相同的自然语言解释。作者通过分析大量人工标注的SAE特征数据,发现同一解释常被重复使用,导致特征区分度下降。为此,他们提出了两个新的评估指标,用于修正现有方法对特征解释性的高估问题,从而提升自动解释性的准确性与可靠性。

Comments 11 pages, 2 figures, 3 tables

详情
英文摘要

Sparse autoencoders (SAEs) are now standard tools for decomposing language model activations into interpretable features, and automated interpretability pipelines routinely assign each feature a short natural-language explanation. Existing critiques of this practice focus on polysemanticity -- one feature with many meanings -- or on whether explanations predict activations. We identify a complementary, structurally distinct problem we call descriptive collision: many distinct SAE features admit the same explanation. Reanalyzing the largest publicly-available dataset of human-annotated SAE features (Marks et al., 2025), comprising 722 annotated features across Gemma 2 2B and Pythia 70M, we find that the mean annotation string is reused across 3.07 features; 82.1% of features share their annotation with at least one other feature; and the single most common annotation string ("plural nouns") labels 101 distinct features spanning 18 layers and four model components. Information-theoretically, the average annotation resolves only 70% of feature identity. We formalize a property called discrimination, prove that current detection-style auto-interpretability scoring is invariant to collision, and propose two complementary corrective metrics -- collision-adjusted detection and discrimination scoring -- that explicitly penalize explanations that fail to distinguish a feature from its neighbors. The collision problem is independent of, and additive with, previously identified failure modes of auto-interpretability; ignoring it inflates reported feature interpretability by a quantity equal to roughly one-third of the bits required to identify a feature.

2605.12872 2026-05-14 cs.LG

SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning

Truong Pham, Anay Majee, Rishabh Iyer

发表机构 * The University of Texas at Dallas(德克萨斯大学达拉斯分校)

AI总结 尽管多模态基础模型在近期取得了显著进展,但它们依赖大量配对数据,限制了其在数据稀缺场景下的应用。本文提出了一种基于子模态互信息的组合式对齐方法——SMA,通过将多组增强和描述视为集合,捕捉更丰富的跨模态结构,从而在有限数据下实现更有效的多模态对齐。实验表明,SMA在少样本分类和检索任务中表现出色,仅需数万样本即可达到强多模态泛化能力,显著优于传统方法。

详情
英文摘要

Despite the recent success of Multimodal Foundation Models (FMs), their reliance on massive paired datasets limits their applicability in low-data and rare-scenario settings where aligned data is scarce and expensive. A key bottleneck is the adoption of an instance-level formulation, which learns alignment by maximizing correlation between individual image-text pairs while neglecting the underlying geometric structure across modalities resulting in a modality gap across input modalities. In this paper, we propose a combinatorial paradigm for multimodal alignment that moves beyond pairwise learning and introduce the \emph{Submodular Modality Aligner (SMA)}, which treats multiple augmentations and descriptions of an entity as a set, leveraging multiple descriptions of the data to capture richer cross-modal structure. We instantiate SMA using a principled objective based on Submodular Mutual Information (SMI), which jointly maximizes inter-modality mutual information while reducing cross-modal divergence. This formulation enables the model to effectively utilize multiple positive associations and extract significantly more information from limited data. We evaluate SMA on 14 zero-shot classification and retrieval tasks from the CLIP benchmark and demonstrate consistent gains in the low-data regime. Notably, SMA achieves strong multimodal generalization using only tens of thousands of samples. This is orders of magnitude fewer than standard approaches. Our results highlight the importance of set-based formulations and submodular objectives for data-efficient multimodal learning.

2605.12855 2026-05-14 cs.CV

Prediction of Rectal Cancer Regrowth from Longitudinal Endoscopy

Jorge Tapias Gomez, Despoina Kanata, Aneesh Rangnekar, Christina Lee, Hannah Williams, Hannah Thompson, J. Joshua Smith, Francisco Sanchez-Vega, Mert R. Sabuncu, Julio Garcia-Aguilar, Harini Veeraraghavan

发表机构 * Department of Medical Physics, Memorial Sloan Kettering Cancer Center(医学物理部,纪念斯隆凯特勒癌症中心) School of Computer Science, Cornell University and Cornell Tech(计算机科学学院,康奈尔大学和康奈尔科技) Department of Surgery, Colorectal Service, Memorial Sloan Kettering Cancer Center(外科部,结直肠服务,纪念斯隆凯特勒癌症中心) Department of Radiology, Weill Cornell Medical College(放射科,韦尔医学院) School of Electrical and Computer Engineering, Cornell University and Cornell Tech(电气与计算机工程学院,康奈尔大学和康奈尔科技)

AI总结 该研究提出了一种基于纵向内镜图像的深度学习方法TREX,用于预测接受“观察等待”治疗的直肠癌患者肿瘤的复发情况。TREX通过结合治疗后复查和随访期间的图像,利用双交叉注意力机制和预训练的Swin Transformer模型,在无需图像配准的情况下提取并融合特征,从而区分完全缓解与局部复发。实验表明,TREX在复发检测和早期预警方面均优于现有方法,并在临床验证中表现出与专业医生相当的诊断准确性。

Comments 14 Pages, 9 figures, 2 tables

详情
英文摘要

Clinical trial studies indicate benefit of watch-and-wait (WW) surveillance for patients with rectal cancer showing a complete or near clinical response (CR) directly after treatment (restaging). However, there are no objectively accurate methods to early detect local tumor regrowth (LR) in patients undergoing WW from follow-up exams. Hence, we developed Temporal Rectal Endoscopy Cross-attention (TREX), a longitudinal deep learning approach that combines pairs of images acquired at restaging and follow-up to distinguish CR from LR. TREX uses pretrained Swin Transformers in a siamese setting to extract features from longitudinal images and dual cross-attention to combine the features without spatial co-registration between image pairs. TREX and Swin-based baselines were trained under two settings: (a) detecting LR or CR at the last available follow-up and (b) early detection of LR at 3--6, 6--12, and 12--24 months before clinical confirmation. TREX achieved the highest accuracy in detecting LR with a high sensitivity of 97% $\pm$ 6% and a balanced accuracy of 90% $\pm$ 3%, and outperformed all baselines in early detection at both 3--6 (74% $\pm$ 1%) and 6--12 months (62% $\pm$ 4%) prior to clinical detection. Clinical validation via a surgeon survey showed that TREX matched attending-level overall accuracy (TREX: 86.21% vs.\ Clinicians: 87.84% $\pm$ 1.28%). Finally, we explored TREX's ability to predict treatment response by combining pre-treatment (pre-TNT) and restaging endoscopies, achieving a balanced accuracy of 73% $\pm$ 12%. These results show that longitudinal deep learning analysis of endoscopy may improve surveillance and enable earlier identification of rectal cancer regrowth.