arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2083
2605.13013 2026-05-14 cs.LG

JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning

Jing Yu Lim, Rushi Shah, Zarif Ikram, Samson Yu, Haozhe Ma, Tze-Yun Leong, Dianbo Liu

AI总结 本文提出了一种名为 JEDI 的端到端联合嵌入扩散世界模型,用于在线基于模型的强化学习。该模型结合了 JEPA 预测表征学习与扩散去噪目标,直接从扩散损失中学习潜在空间,避免了传统方法中预训练编码器的依赖。JEDI 在计算效率和性能上均优于现有方法,在 Atari100k 环境中表现出色,同时显著降低了显存占用和训练、采样时间。

详情
英文摘要

Diffusion world models have recently become competitive for online model-based reinforcement learning, but current approaches expose a tension: pixel diffusion is effective but computationally expensive while the latest latent diffusion approach improves efficiency yet performs subpar. The latter also relies on separately trained latents rather than the end-to-end world-model objectives that have driven much of modern MBRL progress. In particular, JEPA-style predictive representation learning has emerged as an especially promising direction for world modeling and MBRL. Concurrently, diffusion-style objectives have gained traction across multiple domains, with iterative refinement as a promising approach for multimodal and stochastic targets. Taken together, these trends motivate Joint Embedding DIffusion (JEDI), the first online end-to-end latent diffusion world model. JEDI learns its latent space directly from the diffusion denoising loss with a JEPA framework, using denoising to learn and predict future latents rather than relying on reconstruction and pretrained models. We provide a theoretical motivation showing that conventional JEPA objectives induce a predictive information bottleneck, and that conditional diffusion denoising admits a closely related predictive-compression decomposition. Empirically, JEDI is competitive on Atari100k and outperforms the baseline with seperately trained latents where directly comparable. Relative to the pixel diffusion baseline, JEDI uses 43% less VRAM, over 3$\times$ faster world-model sampling, and 2.5$\times$ faster training. JEDI also exhibits a markedly different task-level performance profile from the pixel baseline, suggesting that end-to-end predictive latents change more than compute alone.

2605.13010 2026-05-14 cs.CV cs.AI cs.SY eess.SY math.OC

Amortized Guidance for Image Inpainting with Pretrained Diffusion Models

Yilie Huang, Xun Yu Zhou

AI总结 本文研究了基于生成扩散模型的图像修复问题,提出了一种名为AID的方法,在保持预训练扩散模型主干不变的前提下,通过离线训练一个小型可复用的引导模块,实现对多张掩码图像的高效修复。该方法将问题建模为带有监督终端目标的确定性引导问题,并通过引入辅助高斯形式,推导出一种可在高维空间中学习的随机化问题求解方案,从而设计出一种基于数据驱动的连续时间策略-价值算法。实验表明,AID在多个数据集和掩码类型上均优于现有固定主干和摊销修复方法,在修复质量与速度之间取得了更好的平衡。

详情
英文摘要

We study image inpainting with generative diffusion models. Existing methods typically either train dedicated task-specific models, or adapt a pretrained diffusion model separately for each masked image at deployment. We introduce a middle-ground model, termed Amortized Inpainting with Diffusion (AID), which keeps a pretrained diffusion backbone fixed, trains a small reusable guidance module offline, and then reuses it across masked images without per-instance optimization. We formulate it as a deterministic guidance problem with a supervised terminal objective. To make this problem learnable in high dimensions, we derive an auxiliary Gaussian formulation and prove that solving this randomized problem recovers the optimal deterministic guidance field. This bridge yields a principled continuous-time actor--critic algorithm for learning the guidance module in a fully data-driven manner. Empirically, on AFHQv2 and FFHQ under the pixel EDM pipeline and on ImageNet under the latent EDM2 pipeline, AID consistently improves the quality--speed trade-off over strong fixed-backbone and amortized inpainting baselines across multiple mask types, while adding less than one percent trainable overhead.

2605.13006 2026-05-14 cs.RO cs.MA

Occlusion-Based Object Transportation Around Obstacles With a Swarm of Miniature Robots

Breno Cunha Queiroz, Daniel MacRae

AI总结 本文研究了如何利用微型机器人集群在障碍物周围运输物体的问题。核心方法是在原有基于遮挡的策略基础上,引入子目标机制,使机器人能够通过协作形成可见路径链,从而在不依赖通信和保持去中心化控制的前提下绕过障碍。实验表明,该方法在不同初始位置和多种形状障碍物的情况下均表现出良好的鲁棒性和通用性。

Comments 25 pages, 9 figures, 6 tables. Accepted for publication in the journal Swarm Intelligence

详情
Journal ref
Swarm Intelligence, 2024
英文摘要

Swarm robotics utilises decentralised self-organising systems to form complex collective behaviours built from the bottom-up using individuals that have limited capabilities. Previous work has shown that simple occlusion-based strategies can be effective in using swarm robotics for the task of transporting objects to a goal position. However, this strategy requires a clear line-of-sight between the object and the goal. In this paper, we extend this strategy by allowing robots to form sub-goals; enabling any member of the swarm to establish a wider range of visibility of the goal, ultimately forming a chain of sub-goals between the object and the goal position. We do so while preserving the fully decentralised and communication-free nature of the original strategy, while maintaining performance in object-free scenarios. In five sets of simulated experiments, we demonstrate the generalisability of our proposed strategy. Our finite-state machine allows a sufficiently large swarm to transport objects around obstacles that block the goal. The method is robust to varying starting positions and can handle both concave and convex shapes.

2605.12997 2026-05-14 cs.LG

Frequency Bias and OOD Generalization in Neural Operators under a Variable-Coefficient Wave Equation

Runlong Xie, An Luo

AI总结 本文研究了神经算子在变系数波方程下的频率偏差与分布外泛化能力。通过对比傅里叶神经算子(FNO)和深度算子网络(DeepONet)在结构化分布外场景下的表现,发现FNO在高频输入下误差显著增加,而DeepONet则表现出更稳定的退化趋势。研究揭示了不同架构对频率结构的表示差异是导致泛化性能不同的关键因素,突显了当前神经算子在分布外场景下泛化能力的不足及架构设计的重要性。

详情
英文摘要

Neural operators learn to map initial conditions to the terminal solution of partial differential equations (PDEs), providing a surrogate for the full operator mapping. This enables rapid prediction across different input configurations. While recent neural operator architectures have demonstrated strong performance on diverse PDE tasks, their behavior under structured distribution shifts remains insufficiently understood. To investigate this, we study operator learning in a wave propagation setting governed by a one-dimensional variable-coefficient wave equation, using two representative architectures, the Fourier Neural Operator (FNO) and the Deep Operator Network (DeepONet). To examine their generalization under distribution shifts, we consider structured out-of-distribution (OOD) settings that independently vary input frequency and coefficient smoothness. The results show that under smoothness shifts, both models maintain stable performance, with FNO achieving lower error. In contrast, under frequency shifts, FNO exhibits a sharp increase in error under unseen high-frequency inputs, whereas DeepONet shows milder degradation despite higher overall error. Our analysis reveals that these differences arise from how each architecture represents and responds to variations in frequency structure. Together, these findings highlight a fundamental gap between strong in-distribution performance and generalization under distribution shifts in operator learning, underscoring the role of architectural representation bias in developing more reliable neural operators for physics-based PDE simulations beyond the training distribution.

2605.12995 2026-05-14 cs.LG

F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking

Rohan Surana, Gagan Mundada, Junda Wu, Xintong Li, Yizhu Jiao, Bowen Jin, Sizhe Zhou, Tong Yu, Ritwik Sinha, Jiawei Han, Jingbo Shang, Julian McAuley

AI总结 本文提出了一种统一的生成与排序优化框架F-GRPO,旨在解决传统检索系统中生成与排序分离导致的效用不匹配问题。该方法通过因子化分组相对策略优化,在单一的语言模型骨干网络中联合优化候选生成与排序过程,利用顺序不变的覆盖奖励和位置感知的效用奖励进行联合训练。实验表明,F-GRPO在多个基准任务中优于现有生成与排序分离的方法及监督学习模型,且在推理时无需架构修改。

详情
英文摘要

Traditional retrieval pipelines optimize utility through stages of candidate retrieval and reranking, where ranking operates over a predefined candidate set. Large Language Models (LLMs) broaden this into a generative process: given a candidate pool, an LLM can generate a subset and order it within a single autoregressive pass. However, this flexibility introduces a new optimization challenge: the model must search a combinatorial output space while receiving utility feedback only after the full ranked list is generated. Because this feedback is defined over the completed sequence, it cannot distinguish whether a poor result arises from failing to generate a relevant subset or from failing to rank that subset correctly. This credit assignment gap makes end-to-end optimization unstable and sample-inefficient. Existing systems often address this by separating candidate generation from ranking. However, such decoupling remains misaligned with downstream utility because ranking is limited by the candidate set it receives. To bridge this gap, we propose a unified framework that performs both within a single autoregressive rollout and optimizes them end-to-end via factorized group-relative policy optimization (F-GRPO). Our framework factorizes the policy into candidate generation and ranking while sharing a single LLM backbone, and jointly trains them with an order-invariant coverage reward and a position-aware utility reward. To address the resulting phase-specific credit assignment problem, we use separate group-relative advantages for generation and ranking within a two-phase sequence-level objective. Across sequential recommendation and multi-hop question answering benchmarks, F-GRPO improves top-ranked performance over GRPO and decoupled baselines, outperforms supervised alternatives, and remains competitive with strong zero-shot rerankers, with no architectural changes at inference time.

2605.12994 2026-05-14 cs.LG

DP-Muon: Differentially Private Optimization via Matrix-Orthogonalized Momentum

Jihwan Kim, Chenglin Fan

AI总结 本文提出了一种名为DP-Muon的差分隐私优化方法,该方法基于矩阵正交化动量优化器Muon,通过逐样本梯度裁剪、添加高斯噪声以及后续动量和牛顿-舒尔正交化处理,实现了隐私保护下的模型训练。研究证明DP-Muon能够继承对应的子采样高斯会计机制的隐私保证,且正交化处理不会引入额外隐私成本。此外,文章还分析了差分隐私对Muon优化过程的影响,并提出了一种偏差校正的变体DP-MuonBC,在保持相同隐私保障的同时进一步提升了模型性能。

Comments 26 pages

详情
英文摘要

We study differentially private (DP) training with Muon, a matrix-valued optimizer that updates hidden-layer weights using momentum followed by Newton--Schulz orthogonalization. While DP-SGD is well understood, the interaction between per-example clipping, Gaussian noise, momentum, and nonlinear orthogonalization in Muon has not been systematically analyzed. We formulate DP-Muon, a private Muon procedure that clips per-example matrix gradients, adds Gaussian noise to the clipped lot average, and then applies momentum and Newton--Schulz orthogonalization as post-processing. We prove that DP-Muon inherits the privacy guarantee certified by the corresponding same-lot subsampled Gaussian accountant, with no additional privacy cost from Muon-specific post-processing. On the optimization side, we establish finite-horizon and vanishing stationarity guarantees under per-matrix clipping, with bounds that separate optimization error, clipping residual, privacy noise, and Newton--Schulz approximation error. We further show that the DP-induced bias in Muon arises not in the linear momentum buffer itself, but after the nonlinear Newton--Schulz map, where Gaussian noise induces a matrix-valued heat-smoothing bias. This motivates DP-MuonBC, a bias-corrected variant that removes the leading output-level bias term while preserving the same privacy guarantee. Experiments on E2E and DART show that Muon-style matrix updates improve private fine-tuning, and that DP-MuonBC further improves utility without increasing the privacy budget.

2605.12988 2026-05-14 cs.AI cs.CY cs.IR

Retrieval-Augmented Tutoring for Algorithm Tracing and Problem-Solving in AI Education

Mragisha Jain, Tirth Bhatt, Griffin Pitts, Aum Pandya, Peter Brusilovsky, Narges Norouzi, Arto Hellas, Juho Leinonen, Bita Akram

AI总结 本文提出了一种基于检索增强生成(RAG)的智能辅导系统KITE,旨在辅助算法学习中的推理与问题求解。KITE通过意图感知的苏格拉底式响应策略,为学生提供针对性的提示、引导性问题和渐进式支持,同时结合多模态检索技术确保回答与课程内容一致。实验表明,KITE能够生成内容相关且教学效果良好的回应,并有效提升学生模型在算法问题上的后续回答准确性,为算法教育提供了新的辅导架构与评估方法。

Comments Paper accepted to the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), co-located with ACL 2026

详情
英文摘要

Students learning algorithms often need support as they interpret traces, debug reasoning errors, and apply procedures across unfamiliar problem instances. In this paper, we present KITE (Knowledge-Informed Tutoring Engine), a Retrieval-Augmented Generation (RAG)-based intelligent tutoring system designed to serve as a classroom teaching assistant for algorithmic reasoning and problem-solving tasks. KITE uses an intent-aware Socratic response strategy to tailor support to different student needs, responding with targeted hints, guiding questions, and progressive scaffolding intended to strengthen students' algorithmic problem-solving ability. To keep responses aligned with course content, KITE uses a multimodal RAG pipeline that retrieves relevant information from course materials. We evaluate KITE using three forms of assessment: RAGAs-based metrics for response grounding and quality, expert evaluation of pedagogical quality, and a simulated student pipeline in which a weaker language model interacts with KITE across two-turn dialogues and produces revised answers after receiving feedback. Results indicate that KITE produces contextually grounded and pedagogically appropriate responses. Further, using simulated students, KITE's feedback helped the student models produce more accurate follow-up responses on procedural and tracing questions, suggesting that its scaffolding can support algorithmic problem-solving. This work contributes a tutoring architecture and an evaluation approach for assessing retrieval-grounded explanations and scaffolded problem-solving feedback.

2605.12983 2026-05-14 cs.LG cs.CC

Decision Tree Learning on Product Spaces

Arshia Soltani Moakahr, Faraz Ghahremani, Kiarash Banihashem, MohammadTaghi Hajiaghayi

AI总结 本文研究了在乘积分布下决策树的学习问题,针对广泛使用的自顶向下贪心启发式方法进行了理论分析。作者扩展了 Blanc 等人关于均匀分布下贪心方法的理论保证,证明了在任意乘积分布下,该方法仍能构造出近似最优的决策树,其规模随最优树的平均深度和最大深度呈指数增长。此外,作者提出了一种无需先验参数的算法,具有更强的实用性和更广的适用性。

Comments ICML 2026

详情
英文摘要

Decision tree learning has long been a central topic in theoretical computer science, driven by its practical importance. A fundamental and widely used method for decision tree construction is the top-down greedy heuristic, which recursively splits on the most influential variable. Despite its empirical success, theoretical analysis of this heuristic has been limited. A recent breakthrough by Blanc et al. (ITCS, 2020) provided the first rigorous theoretical guarantees for the greedy approach, but only under the uniform distribution. We extend this analysis to the more general and practically relevant setting of arbitrary product distributions. Our main result shows that for any function $f$ computable by an optimal decision tree of size $s$, maximum depth $D_{\text{opt}}$, and average depth $Δ_{\text{opt}}$, the greedy heuristic constructs an $ε$-approximating tree whose size grows at most with $\exp\bigl(Δ_{\text{opt}} D_{\text{opt}} \log(e/ε)\bigr)$. In the special case where the optimal tree is a full binary tree, this bound improves upon the bound of Blanc et al. and holds under a strictly broader class of distributions. Moreover, we present an algorithm based on the top-down greedy heuristic that is entirely parameter-free -- it requires no prior knowledge of the optimal tree's size or depth -- offering a practical advantage over Blanc et al.'s method.

2605.12980 2026-05-14 cs.LG cs.AI

CoRe-Gen: Robust Spectrum-to-Structure Generation under Imperfect Fingerprint Conditions

Tianbo Liu, Chixiang Lu, Jing Hao, Hengyu Zhang, Lifei Wang, Haibo Jiang, Xiaojuan Qi

AI总结 从串联质谱(MS/MS)解析分子结构是一个具有挑战性的问题,尤其是在超出数据库覆盖范围的从头生成任务中。本文提出了一种名为CoRe-Gen的方法,通过合成光谱预训练编码器、在解码器训练中引入频率感知的指纹噪声匹配,以及结合结构感知的自回归解码和化学约束,有效缓解了预测指纹误差带来的生成偏差。实验表明,CoRe-Gen在多个基准测试中取得了新的性能纪录,同时保持了自回归解码的高效性,为实际条件下的谱-结构生成提供了实用且可扩展的解决方案。

详情
英文摘要

Molecular structure elucidation from tandem mass spectra (MS/MS) remains challenging, particularly for de novo generation beyond database coverage. A common approach decomposes the task into spectrum-to-fingerprint prediction followed by fingerprint-to-structure decoding, enabling the use of large-scale molecular corpora. However, at deployment, the decoder relies on predicted rather than oracle fingerprints, introducing structured errors that propagate into generation. This results in a fundamental condition mismatch, where models trained on clean inputs must operate under noisy, biased predictions, especially for long-tail substructures. We present CoRe-Gen that explicitly addresses this gap. CoRe-Gen improves the intermediate condition via synthetic-spectrum pretraining of the encoder, matches deployment-time noise through frequency-aware fingerprint corruption during decoder training, and mitigates residual errors using structure-aware autoregressive decoding with compositional SELFIES representations, auxiliary structural supervision, and lightweight chemical constraints. Experiments on standard benchmarks show that CoRe-Gen establishes a new state of the art on NPLIB1, achieving 19.54\% Top-1 and 29.92\% Top-10 exact-match accuracy, while remaining competitive on the more challenging MassSpecGym benchmark. Importantly, CoRe-Gen preserves the efficiency advantages of autoregressive decoding, providing a practical and scalable solution for robust spectrum-to-structure generation under realistic conditions.

2605.12978 2026-05-14 cs.AI

Useful Memories Become Faulty When Continuously Updated by LLMs

Dylan Zhang, Yanshan Lin, Zhengkun Wu, Yihang Sun, Bingxuan Li, Dianqi Li, Hao Peng

AI总结 本文研究了大型语言模型(LLMs)在持续更新记忆时可能出现的错误问题。研究发现,尽管通过记忆整合(consolidation)可以提升智能体的学习效果,但随着更新的进行,记忆的实用性会先上升后下降,甚至低于无记忆基准。实验表明,即使是基于正确解法的记忆整合,也可能导致模型在后续任务中表现下降,因此应谨慎处理记忆更新,保留原始经验作为关键证据,以提高智能体记忆的可靠性。

详情
英文摘要

Learning from past experience benefits from two complementary forms of memory: episodic traces -- raw trajectories of what happened -- and consolidated abstractions distilled across many episodes into reusable, schema-like lessons. Recent agentic-memory systems pursue the consolidated form: an LLM rewrites past trajectories into a textual memory bank that it continuously updates with new interactions, promising self-improving agents without parameter updates. Yet we find that such consolidated memories produced by today's LLMs are often faulty even when derived from useful experiences. As consolidation proceeds, memory utility first rises, then degrades, and can fall below the no-memory baseline. More surprisingly, even when consolidating from ground-truth solutions, GPT-5.4 fails on 54% of a set of ARC-AGI problems it had previously solved without memory. We trace the regression to the consolidation step rather than the underlying experience: the same trajectories yield qualitatively different memories under different update schedules, and an episodic-only control that simply retains those trajectories remains competitive with the consolidators we test. In a controlled ARC-AGI Stream environment that exposes Retain, Delete, and Consolidate actions, agents preserve raw episodes by default and double the accuracy of their forced-consolidation counterparts; disabling consolidation entirely (episodic management only) matches this auto regime. Practically, robust agent memory should treat raw episodes as first-class evidence and gate consolidation explicitly rather than firing it after every interaction. Looking forward, reliable agentic memory will require LLMs that can consolidate without overwriting the evidence they depend on.

2605.12975 2026-05-14 cs.AI

Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

Jiashuo Sun, Jimeng Shi, Yixuan Xie, Saizhuo Wang, Jash Rajesh Parekh, Pengcheng Jiang, Zhiyi Shi, Jiajun Fan, Qinglong Zheng, Peiran Li, Shaowen Wang, Ge Liu, Jiawei Han

AI总结 该论文提出了一种名为 PyRAG 的可执行多跳推理框架,用于增强检索生成(RAG)在复杂问答任务中的表现。不同于传统基于自然语言的推理方式,PyRAG 将多跳推理过程转化为可执行的 Python 程序,利用检索和问答工具进行结构化计算,从而实现中间状态的显式表达和确定性反馈。实验表明,PyRAG 在多个多跳问答数据集上显著优于现有方法,尤其在组合性任务中表现突出。

Comments 32 pages, 20 figures, 4 tables

详情
英文摘要

Retrieval-Augmented Generation (RAG) has become a standard approach for knowledge-intensive question answering, but existing systems remain brittle on multi-hop questions, where solving the task requires chaining multiple retrieval and reasoning steps. Key challenges are that current methods represent reasoning through free-form natural language, where intermediate states are implicit, retrieval queries can drift from intended entities, and errors are detected by the same model that produces them making self-reflection an unreliable, ungrounded signal. We observe that multi-hop question answering is a typical form of step-by-step computation, and that this structured process aligns closely with how code-specialized language models are trained to operate. Motivated by this, we introduce \pyrag, a framework that reformulates multi-hop RAG as program synthesis and execution. Instead of free-form reasoning trajectories, \pyrag represents the reasoning process as an executable Python program over retrieval and QA tools, exposing intermediate states as variables, producing deterministic feedback through execution, and yielding an inspectable trace of the entire reasoning process. This formulation further enables compiler-grounded self-repair and execution-driven adaptive retrieval without any additional training. Experiments on five QA benchmarks (PopQA, HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle) show that \pyrag consistently outperforms strong baselines under both training-free and RL-trained settings, with especially large gains on compositional multi-hop datasets. Our code, data and models are publicly available at https://github.com/GasolSun36/PyRAG.

2605.12967 2026-05-14 cs.CV

ImageAttributionBench: How Far Are We from Generalizable Attribution?

Tingshu Mou, Zhipeng Wei, Chao Gong, Jingjing Chen, Xingjun Ma

AI总结 随着生成式AI的快速发展,合成图像的逼真度和多样性不断提高,给图像来源识别和虚假信息检测带来了严峻挑战。为此,本文提出ImageAttributionBench,一个包含多种先进生成模型合成图像的综合性数据集,旨在推动更具鲁棒性和泛化能力的图像归属方法研究。实验表明,当前主流归属方法在该数据集上的表现较差,揭示了其在面对语义变化和图像退化时的局限性,为未来研究提供了严格的评估基准。

详情
英文摘要

The rapid advancement of generative AI has enabled the creation of highly realistic and diverse synthetic images, posing critical challenges for image provenance and misinformation detection. This underscores the urgent need for effective image attribution. However, existing attribution datasets are constrained by limited scale, outdated generation methods, and insufficient semantic diversity - hindering the development of robust and generalizable attribution models. To address these limitations, we introduce ImageAttributionBench, a comprehensive dataset comprising images synthesized by a wide array of advanced generative models with state-of-the-art (SOTA) architectures. Covering multiple real-world semantic domains, the dataset offers rich diversity and scale to support and accelerate progress in image attribution research. To simulate real-world attribution scenarios, we evaluate several SOTA attribution methods on ImageAttributionBench under two challenging settings: (1) training on a standard balanced split and testing on degraded images, and (2) training and testing on semantically disjoint splits. In both cases, current methods exhibit consistently poor performance, revealing significant limitations in their robustness and generalization to unseen semantic content. Our work provides a rigorous benchmark to facilitate the development and evaluation of future image attribution methods.

2605.12966 2026-05-14 cs.AI

Position: Agentic AI System Is a Foreseeable Pathway to AGI

Junwei Liao, Shuai Li, Muning Wen, Jun Wang, Weinan Zhang

AI总结 本文质疑单一模型规模扩展是实现人工通用智能(AGI)的唯一路径,提出代理式人工智能(Agentic AI)是应对现实任务复杂性和异质性分布的必要范式。通过理论推导,文章对比了单一学习器与代理系统的优化约束,展示了代理式AI在泛化能力和样本效率上的指数级优势,并探讨了其与专家混合模型的关系,呼吁加强对代理式AI的研究。

Comments Accepted by ICML'26 Position Track

详情
英文摘要

Is monolithic scaling the only path to AGI? This paper challenges the dogma that purely scaling a single model is sufficient to achieve Artificial General Intelligence. Instead, we identify Agentic AI as a necessary paradigm for mastering the complex, heterogeneous distribution of real-world tasks. Through rigorous theoretical derivations, we contrast the optimization constraints of monolithic learners against the efficiency of Agentic systems, progressing from simple routing mechanisms to general Directed Acyclic Graph (DAG) topologies. We demonstrate that Agentic AI achieves exponentially superior generalization and sample efficiency. Finally, we discuss the connection to Mixture-of-Experts, reinterpret the instability of current multi-agent frameworks, and call for greater research focus on Agentic AI.

2605.12965 2026-05-14 cs.LG cs.NA math.NA

U-HNO: A U-shaped Hybrid Neural Operator with Sparse-Point Adaptive Routing for Non-stationary PDE Dynamics

Yingzhe Ma, Xiao Yang, Yuxin Xie, Zihan Xiong, Jinliang Liu

AI总结 该研究针对偏微分方程(PDE)解中同时存在的全局平滑传输与局部尖锐特征的挑战,提出了一种名为U-HNO的U型混合神经算子。其核心方法是引入稀疏点自适应路由(SPAR),通过逐像素的硬掩码动态选择全局傅里叶分支或局部多尺度高斯分支,从而在不同区域灵活融合全局与局部计算。实验表明,U-HNO在多个PDE基准任务中取得了领先的预测精度,尤其在具有尖锐局部特征的问题上表现突出。

Comments 26 pages, 7 figures

详情
英文摘要

Solutions to many partial differential equations (PDEs) display coexisting smooth global transport and localized sharp features within a single trajectory: shock fronts, thin interfaces, and concentrated high-frequency content sit on top of slowly varying backgrounds. This poses a challenge for neural operators: Fourier-based architectures mix nonlocal interactions efficiently but tend to under-resolve localized non-smooth features, whereas spatially local architectures recover fine detail at the cost of long-range propagation and rollout stability. Existing hybrid operators paper over this tension with a fixed, spatially uniform fusion that forces the same trade-off everywhere. We propose U-HNO, a U-shaped hybrid neural operator whose central design is Sparse-Point Adaptive Routing (SPAR): at every spatial location, a per-pixel hard mask selects whether the global Fourier branch or the local multi-scale Gaussian branch should dominate, and the sparsity ratio is a function of the local contrast of the routing signal, so smooth and shock-aligned regions receive different mixtures of global and local computation. SPAR is embedded in a hierarchical encoder-bottleneck-decoder backbone with skip connections so that the dual branches and the gate operate at every resolution. Training combines pointwise supervision with a finite-difference H^1 gradient term and a band-wise spectral consistency regularizer. Across benchmarks spanning 1D Burgers, Kuramoto-Sivashinsky, KdV, 2D advection, Allen-Cahn, Navier-Stokes, Darcy flow, and 3D transonic compressible Navier-Stokes from PDEBench, U-HNO achieves state-of-the-art rollout accuracy on the majority of tasks in both relative L^2 and H^1 metrics, with the largest gains on problems dominated by sharp localized features. Ablations show that removing any single component substantially degrades rollout error.

2605.12963 2026-05-14 cs.AI

Sustaining AI safety: Control-theoretic external impossibility, intrinsic necessity, and structural requirements

James M. Mazzu

AI总结 随着AI系统能力的增强,安全策略不仅需要降低当前风险,还必须确保在外部控制无法可靠约束系统行为时仍能维持安全。本文运用控制理论,从结构层面分析了外部强制安全策略是否可行,并提出了两个主要结论:一旦系统影响超出有限外部控制的应对范围,任何依赖外部控制的策略都无法持续保障AI安全;若仍存在可行策略,则这些策略必须是内在的,并需满足四个结构性要求,如安全目标的稳定性与自我修改兼容性等。本文为外部控制局限性的广泛担忧提供了形式化的理论框架。

详情
英文摘要

As AI systems become increasingly capable, safety strategies must be evaluated not only by how much they reduce present risk, but by whether they could sustain safety once external control can no longer reliably constrain system behavior. This paper addresses that problem by using control theory to clarify, at a structural level, whether externally enforced safety-sustaining strategies can succeed and, if not, what any alternative strategy would have to satisfy in order to be viable. It establishes two main results. First, under explicit premises including a reachability condition, it proves a class-wide external impossibility result: once the system's effects exceed what bounded external control can counteract, no strategy that depends in any degree on continued external enforcement can sustain AI safety. This failure is structural across the entire externally enforced class rather than contingent on any particular strategy. Second, it establishes a conditional class-level necessity result: if at least one candidate safety-sustaining strategy remains after that elimination, then all such remaining strategies must be intrinsic. It then states four structural requirements for viability: safety may not depend on continued external enforcement; the system's terminal objective must be safety-compatible when first formed; that objective must remain stable under self-modification; and safety must continue to be preserved as capability grows. The paper does not propose a complete strategy for sustaining AI safety. Its contribution is to give formal structure to a widely held concern about the limits of external control. It does so by deriving explicit conditional results that identify which safety-sustaining strategies are ruled out and what any remaining strategies must satisfy.

2605.12957 2026-05-14 cs.CV

GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

Hanxin Zhu, Cong Wang, Peiyan Tu, Jiayi Luo, Tianyu He, Xin Jin, Zhibo Chen

AI总结 本文提出了一种名为GTA的新型图像到3D世界生成方法,采用“几何优先、再渲染外观”的策略,以提升生成场景的结构准确性和跨视角一致性。该方法通过两个阶段的视频扩散模型,首先生成粗略的几何结构,再基于预测的几何信息合成精细的外观细节。此外,研究引入了随机潜在码打乱策略和测试时缩放方案,进一步提升了生成质量与感知一致性。实验表明,GTA在保真度、视觉质量及几何精度方面优于现有方法,并可作为通用增强模块提升现有生成流程的效果。

详情
英文摘要

Recent developments in generative models and large-scale datasets have substantially advanced 3D world generation, facilitating a broad range of domains including spatial intelligence, embodied intelligence, and autonomous driving. While achieving remarkable progress, existing approaches to 3D world generation typically prioritize appearance prediction with limited modeling of the underlying geometry, leading to issues such as unreliable scene structure estimation and degraded cross-view consistency. To address these limitations, motivated by the coarse-to-fine nature of human visual perception, we propose GTA, a novel image-to-3D world generation method following a Geometry-Then-Appearance paradigm. Specifically, given a single input image, to improve the structural fidelity of synthesized 3D scenes, GTA adopts a two-stage framework with two dedicated video diffusion models, which first generate coarse geometric structure from novel viewpoints and then synthesize fine-grained appearance conditioned on the predicted geometry. To further enhance cross-view appearance consistency, we introduce a random latent shuffle strategy during the training process, along with a test-time scaling scheme that improves perceptual quality without compromising quantitative performance. Extensive experiments have demonstrated that our proposed method consistently outperforms existing approaches in terms of fidelity, visual quality, and geometric accuracy. Moreover, GTA is shown to be effective as a general enhancement module that further improves the generation quality of existing image-to-3D world pipelines, as well as supporting multiple downstream applications and exhibiting favorable data efficiency during model training, highlighting its versatility and broad applicability. Project page: https://hanxinzhu-lab.github.io/GTA/.

2605.12954 2026-05-14 cs.CV cs.AI

AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding

Xiao Yang, Yingzhe Ma, Haoxuan Yu, Zixin Li, Ning Qin

AI总结 AdaFocus 是一种高效的长视频理解框架,旨在解决传统方法在时间覆盖、视觉细节与计算效率之间难以平衡的问题。该方法通过自适应相关性-多样性采样和零缓存回溯机制,实现对视频内容的渐进式证据获取,既减少了内存和计算开销,又保留了关键视觉细节。实验表明,AdaFocus 在多个基准数据集上实现了比现有方法更优的效率与精度平衡,显著提升了长视频理解任务的性能。

Comments 9 pages, 4 figures. Authors Xiao Yang and Yingzhe Ma contributed equally

详情
英文摘要

Long video understanding is heavily bottlenecked by a rigid one-shot paradigm: existing methods either densely encode videos at prohibitive memory and latency costs, or aggressively compress them into sparse frame sets that irreversibly discard fine-grained evidence needed for downstream reasoning. Consequently, current models struggle to simultaneously balance temporal coverage, visual details, and computational efficiency. We propose AdaFocus, an efficient framework that rethinks long-video understanding as progressive evidence acquisition rather than one-pass encoding. AdaFocus relies on two tightly coupled components. First, a Query-Aware Adaptive Relevance-Diversity sampler (AdaRD) produces a compact yet informative video preview, adaptively switching to global clustering when the query lacks reliable local grounding. Second, instead of caching exhaustive frame sequences in memory, AdaFocus introduces an uncertainty-triggered refinement mechanism. It performs targeted look-back only when the model is not confident, retrieving high-resolution evidence directly from disk via a zero-cache I/O design. This turns discarded visual details from an irreversible loss into on-demand recoverable evidence without paying the cost of exhaustive preloading. Experiments on seven standard long-video benchmarks show that AdaFocus delivers a substantially better efficiency-accuracy trade-off than strong baselines. Compared with conventional dense encoding, AdaFocus achieves improved task performance (e.g., +2.59 accuracy on VideoMME, +8.39 mIoU on Charades-STA over single-pass inference) while reducing visual token consumption by ~33x and eliminating the need for in-memory frame pre-caching through its zero-cache disk retrieval design. These findings suggest that progressive preview combined with zero-cache evidence refinement is a highly effective paradigm for scalable multimedia reasoning.

2605.12953 2026-05-14 cs.CV cs.AI

Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

Chao Hao, Jun Xu, Ji Du, Shuo Ye, Ziyue Qiao, Xiaodong Cun, Guangcong Wang, Xubin Zheng, Zitong Yu

AI总结 本文提出了一种名为Seg-Agent的全新训练-free语言引导分割框架,旨在解决传统方法依赖大量训练数据的问题。该方法通过构建显式的多模态推理循环,使大型语言模型能够在视觉域内进行交互式推理,从而直接生成和优化分割结果。此外,研究还引入了Various-LangSeg基准,用于全面评估模型在不同场景下的泛化能力,实验表明Seg-Agent在无需参数更新的情况下即可达到先进训练方法的性能水平。

详情
英文摘要

Language-guided segmentation transcends the scope limitations of traditional semantic segmentation, enabling models to segment arbitrary target regions based on natural language instructions. Existing approaches typically adopt a two-stage framework: employing Multimodal Large Language Models (MLLMs) to interpret instructions and generate visual prompts, followed by foundational segmentation models (e.g., SAM) to produce masks. However, due to the limited spatial grounding capabilities of off-the-shelf MLLMs, these methods often rely on extensive training on large-scale datasets to achieve satisfactory accuracy. While recent advances have introduced reasoning mechanisms to improve performance, they predominantly operate within the textual domain, performing chain-of-thought reasoning solely based on abstract text representations without direct visual feedback. In this paper, we propose Seg-Agent, a completely training-free framework that pioneers Explicit Multimodal Chain-of-Reasoning. Unlike prior text-only reasoning, our approach constructs an interactive visual reasoning loop comprising three stages: generation, selection, and refinement. Specifically, we leverage Set-of-Mark (SoM) visual prompting to render candidate regions directly onto the image, allowing the MLLM to ``see'' and iteratively reason about spatial relationships in the visual domain rather than just the textual one. This explicit multimodal interaction enables Seg-Agent to achieve performance comparable to state-of-the-art training-based methods without any parameter updates. Furthermore, to comprehensively evaluate generalization across diverse scenarios, we introduce Various-LangSeg, a novel benchmark covering explicit semantic, generic object, and reasoning-guided segmentation tasks. Extensive experiments demonstrate the effectiveness and robustness of our method.

2605.12952 2026-05-14 cs.CV

Debunking Grad-ECLIP: A Comprehensive Study on Its Incorrectness and Fundamental Principles for Model Interpretation

Yongjin Cui, Xiaohui Fan

AI总结 本文对ICML 2024发表的Grad-ECLIP方法进行了全面分析,指出其并非基于中间特征的全新技术路线,而是与现有的注意力机制解释方法等价,且计算更为简洁。研究进一步揭示了Grad-ECLIP方法的缺陷,表明其生成的模型解释结果与原模型实际行为不一致,并提出了模型解释应遵循的两个基本原则,以避免类似错误。

详情
英文摘要

Grad-ECLIP is published at ICML 2024 and represents a new Transformer interpretation technical route (intermediate features-based). First, this paper demonstrates that the intermediate features-based technical route is not a novel one. Based on the existing attention-based route, we have developed Attention-ECLIP, which is completely equivalent to Grad-ECLIP but with simpler computation. Both through formal derivation and experimental validation, we prove that the intermediate feature-based route represented by Grad-ECLIP is actually an equivalent variant of the attention-based route. Next, this paper demonstrates that the Grad-ECLIP method is flawed. The model interpretation results obtained by Grad-ECLIP are not those of the original model, and the interpretation results are misaligned with the model's performance. We analyze the causes of Grad-ECLIP's flaws and propose, or rather, explicitly emphasize two fundamental principles that model interpretation should adhere to in order to avoid similar errors.

2605.12945 2026-05-14 cs.LG

Separating Shortcut Transition from Cross-Family OOD Failure in a Minimal Model

Hongmin Li

AI总结 该研究探讨了在最小模型中区分“捷径特征”与“跨家族分布外(OOD)失败”之间的关系。通过构建包含一个不变坐标和一个家族依赖的捷径坐标的二分类模型,研究揭示了在确定性条件下,正的捷径相关性会引导经验风险最小化(ERM)偏向捷径特征,但岭正则化能保持分类器对不变特征的依赖,从而避免确定性的OOD失败。当不变坐标存在噪声时,模型会在训练中的捷径信号超过不变信号时切换到捷径规则,其是否导致失败取决于测试家族的特性。该模型清晰地区分了捷径吸引、捷径规则切换与跨家族OOD失败之间的机制。

Comments 14 pages, 3 figures

详情
英文摘要

Shortcut features are often invoked to explain out-of-distribution (OOD) failure, but training correlation, learned shortcut use, and test-time failure need not coincide. We study a minimal binary model with one invariant coordinate and one family-dependent shortcut coordinate. In the deterministic regime, positive average shortcut correlation pulls logistic ERM toward positive shortcut weight, but ridge regularization keeps the classifier invariant-dominated and prevents deterministic OOD failure. When the invariant coordinate is noisy, ridge-logistic ERM switches to the shortcut rule once the training shortcut signal exceeds the invariant signal. Whether that transition causes failure depends on the held-out family: weaker shortcut correlation yields positive excess risk, and sign-flipped families yield above-chance error. Synthetic checks match these analytic regimes and show that the same training-side transition can have different held-out consequences. The model separates shortcut attraction, shortcut-rule transition, and cross-family OOD failure.

2605.12944 2026-05-14 cs.LG cs.CL

From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning

Haodong Wu, Jiahao Zhang, Lijie Hu, Yongqi Zhang

AI总结 该研究关注监督微调(SFT)数据选择问题,提出了一种新的固定池数据配方搜索方法,旨在从原始指令池中构建高质量的训练子集。不同于传统的实例排序方法,该方法通过一系列过滤、混合和去重操作组合成数据配方,以优化数据分布。研究引入了AutoSelection算法,通过解耦任务、数据和模型信号,结合暖启动探针、局部配方编辑和高斯过程辅助排名等技术,在有限的全量评估预算下高效搜索最优数据配方,实验表明其在多个模型和任务上均优于现有方法。

详情
英文摘要

Supervised fine-tuning (SFT) data selection is commonly formulated as instance ranking: score each example and retain a top-$k$ subset. However, effective SFT training subsets are often produced through ordered curation recipes, where filtering, mixing, and deduplication operators jointly shape the final data distribution. We formulate this problem as fixed-pool data recipe search: given a raw instruction pool and a library of grounded operators, the goal is to discover an executable recipe that constructs a high-quality selected subset under a limited budget of full SFT evaluations, without generating, rewriting, or augmenting training samples. We introduce AutoSelection, a two-layer solver that decouples fixed-pool materialization based on cached task-, data-, and model-side signals from expensive full evaluation, using warmup probes, realized subset states, local recipe edits, Gaussian-process-assisted ranking, and stagnation-triggered reseeding. Experiments on a 90K instruction pool show that AutoSelection achieves the strongest in-distribution reasoning average across three base models, outperforming full-data training, random recipe search, random top-$k$, and single-operator selectors. Additional Out-of-distribution graph-reasoning results, search-stability analyses, structural ablations, and 1.5B-to-7B transfer checks further show that recipe structure matters beyond individual selection operators. Code is available at https://github.com/w253/AutoSelection.

2605.12943 2026-05-14 cs.LG

Reinforced Collaboration in Multi-Agent Flow Networks

Zheng Wang, Yuang Liu, Yangkai Ding

AI总结 多智能体系统通过将复杂任务分解为多个子任务,为扩展大语言模型提供了有效途径。然而,子任务之间的错误传播和协作流程设计不合理常导致整体性能下降。为此,本文提出MANGO框架,通过构建历史成功工作流的流网络,结合强化学习和文本梯度,联合优化工作流路径与智能体行为,并引入跳过机制提升效率。实验表明,MANGO在多个基准上性能提升达12.8%,效率提高47.4%,并在未见领域表现出良好的泛化能力。

详情
英文摘要

Multi-agent systems provide a powerful way to extend large language models (LLMs) by decomposing a complex task into specialized subtasks handled by different agents. However, their performance is often hindered by error propagation, arising from suboptimal workflow design or inaccurate agent outputs, which can propagate through the agent collaboration process and degrade final results. To address the challenges, we present MANGO (Multi-Agent Network Gradient Optimization), a data-driven framework that organizes and refines agent collaboration via a flow network constructed from past successful workflows. MANGO integrates reinforcement learning and textual gradients to jointly optimize workflow paths and agent behaviors, while a skipping mechanism prevents redundant updates to well-optimized agents for improving efficiency. Extensive experiments on seven benchmarks show that MANGO achieves up to 12.8% performance improvement over state-of-the-art baselines, enhances efficiency by 47.4%, and generalizes effectively to unseen domains. Our code and datasets are publicly available at https://github.com/openJiuwen-ai/agent-store/tree/main/community/mango.

2605.12940 2026-05-14 cs.LG cs.AI

The Expressivity Boundary of Probabilistic Circuits: A Comparison with Large Language Models

Zhiyu Zhao, Xuejie Liu, Muhan Zhang, Anji Liu

AI总结 本文研究了概率电路(PCs)在生成语言模型中的表达能力边界,并与基于Transformer的大语言模型(LLMs)进行了对比。研究发现,PCs在自回归语言建模中仍存在表达能力上的不足,主要受限于输出参数化方式和上下文编码结构。通过引入logit空间参数化和分析结构分解PCs的依赖拓扑限制,作者揭示了PCs与LLMs之间的关键差异,并证明分解PCs在理论上具有更强的表达能力,但其有效优化仍是一个挑战。

详情
英文摘要

Probabilistic Circuits (PCs) are deep generative models that support exact and efficient probabilistic inference. Yet in autoregressive language modeling, PCs still lag behind Transformer-based large language models (LLMs), suggesting an important expressivity gap. In this work, we compare PCs and LLMs under a unified autoregressive formulation. First, an output bottleneck: PCs parameterize predictions as convex combinations in probability space, which struggles to represent the sharp distributions typical of language; adopting a logit-space parameterization substantially narrows this gap. Second, a context-encoding bottleneck: we prove that structured-decomposable PCs can match Transformer separation rank on vtree-aligned partitions, but show, both theoretically and empirically, that this capacity is limited to partitions aligned with the fixed routing structure, leading to severe degradation when the data exhibits heterogeneous dependency topologies. We further prove that decomposable PCs are strictly more expressive than structured-decomposable ones, though effectively optimizing them remains an open challenge.

2605.12939 2026-05-14 cs.CV

DirectTryOn: One-Step Virtual Try-On via Straightened Conditional Transport

Xianbing Sun, Jiahui Zhan, Liqing Zhang, Jianfu Zhang

AI总结 本文提出了一种名为DirectTryOn的一站式虚拟试穿方法,通过直角条件传输实现高效生成。该方法基于对虚拟试穿任务条件约束特性的观察,提出通过纯条件传输、服装保持损失和自一致性损失等改进,引导生成过程更加直接,从而实现单步生成。实验表明,该方法在保证生成质量的同时显著降低了推理成本,达到了当前最先进的性能。

详情
英文摘要

Recent diffusion- and flow-based VTON methods achieve strong results with pretrained generative models, but their reliance on multi-step sampling incurs high inference cost, while existing acceleration methods largely overlook the intrinsic structure of the try-on task. In this paper, we highlight a key observation: VTON outputs are highly constrained by the conditional inputs, suggesting that the conditional sampling trajectory can be much straighter than that in general image generation, making one-step generation a natural solution. However, limited task-specific data makes training from scratch impractical, forcing existing methods to fine-tune pretrained models whose objectives do not encourage such straight conditional trajectories. Thus, the deviation from an ideal straight path mainly comes from the mismatch between pretrained base models and the conditional nature of try-on generation, rather than from the task itself. Motivated by this insight, we encourage straighter VTON sampling trajectories through three targeted modifications: pure conditional transport, a garment preservation loss, and a self consistency loss. We further introduce a one-step distillation stage. Extensive experiments show that our method achieves state-of-the-art performance with one-step sampling, establishing a new standard for efficient and high-quality VTON.

2605.12938 2026-05-14 cs.CV cs.AI cs.LG

CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation

Seonghyun Jin, Youngmin Kim, Sunwoo Park, Jong Chul Ye

AI总结 该论文提出了一种名为CRePE的曲光线期望位置编码方法,用于统一相机控制的视频生成。针对现有方法在处理广角和鱼眼镜头等复杂相机配置时的不足,CRePE通过引入深度感知的位置分布,捕捉由宽视角相机引起的投影路径几何特性,从而提升相机控制的稳定性和生成质量。该方法结合几何注意力适配器和单目几何基础模型进行伪监督,实现了对多种相机模型的有效支持,并在多个几何感知和感知质量指标上表现出色。

Comments 17 pages, 8 figures, Under review

详情
英文摘要

Camera-conditioned video generation requires positional encoding that remains reliable under changes in camera motion, lens configuration, and scene structure. However, existing attention-level camera encodings either provide ray-only camera signals or rely on pinhole camera geometry, limiting their applicability to general camera control under the Unified Camera Model, including wide-angle and fisheye lenses. To address this limitation, we propose Curved Ray Expectation Positional Encoding (CRePE). CRePE represents each image token as a depth-aware positional distribution along its source ray, providing a Unified Camera Model-compatible positional encoding that captures the projected-path geometry induced by wide-angle and fisheye cameras. CRePE is implemented through a Geometric Attention Adapter added to frozen video DiTs, injecting token-wise scene-distance information into selected attention layers and stabilizing it with pseudo supervision from a monocular geometry foundation model. This design leads to more stable camera control and improves several geometry-aware and perceptual-quality metrics, while remaining competitive on video-quality metrics. Controlled positional-encoding ablations show a better overall average rank than a RayRoPE-style endpoint PE baseline, demonstrating the effectiveness of UCM-aware projected-path integration across diverse camera models. Furthermore, by extending the same positional-encoding pathway to external geometry control through Radial MixForcing, CRePE supports external radial-map control for scene-geometry-conditioned generation and source-video motion transfer beyond camera control.

2605.12937 2026-05-14 cs.CV cs.AI cs.HC

AuraMask: An Extensible Pipeline for Developing Aesthetic Anti-Facial Recognition Image Filters

Jacob Lagogiannis, William Agnew, Rosa I. Arriaga, Sauvik Das

AI总结 本文提出了一种名为 AuraMask 的可扩展管道,用于开发既具有对抗性效果又符合审美要求的反人脸识别图像滤镜。该方法通过模仿流行的 Instagram 一键滤镜,生成了 40 种视觉上美观的滤镜,并在对抗开源人脸识别模型方面表现出优于现有方法的效果。实验表明,这些滤镜在用户接受度上也显著高于以往方法,为隐私保护技术的进一步研究提供了有效工具。

Comments 21 pages, 10 figures

详情
英文摘要

Anti-facial recognition (AFR) image filters alter images in ways that are subtle to people but blinding to computer vision. Yet, despite widespread interest in these technologies to subvert surveillance, users rarely use them in practice -- because the ``subtle'' alterations are visible enough to conflict with users' self-presentation goals. To address this challenge, we propose AuraMask: a novel approach to creating AFR filters that are both adversarially effective and aesthetically acceptable. Using AuraMask, we produce 40 ``aesthetic'' filters that emulate popular ``one-click'' Instagram image filters. We show that AuraMask filters meet or exceed the adversarial effectiveness of prior methods against open-source facial recognition models. Moreover, in a controlled online user study ($N=630$) we confirm these filters achieve significantly higher user acceptance than prior methods. Lastly, we provide our AFR pipeline to the community for accelerated research in adversarially effective and aesthetically acceptable protections.

2605.12933 2026-05-14 cs.CL

ATD-Trans: A Geographically Grounded Japanese-English Travelogue Translation Dataset

Shohei Higashiyama, Hiroki Ouchi, Atsushi Fujita, Masao Utiyama

AI总结 本文介绍了一个名为ATD-Trans的地理语境下的日英旅游游记平行语料库,旨在支持多语言地理信息的公平获取和机器翻译质量的评估。该数据集包含日本国内和海外地区的地理实体信息,可用于分析不同语言模型在翻译任务中的表现差异。研究发现,针对日语优化的模型在处理日本国内地理实体时具有优势,而这类实体的翻译难度较高。

详情
英文摘要

Geographic text, or textual data rich in geographic (geo-) information is a valuable source for various geographic applications, e.g., tourism management. Making such information accessible to speakers of other languages further enhances its utility; thus, accurate machine translation (MT) is essential for equity in multilingual geo-information access. To facilitate in-depth analysis for geographic text, we introduce ATD-Trans, a geographically grounded Japanese--English travelogue translation dataset, which enables evaluation of MT quality at both the overall and geo-entity levels across domestic (within Japan) and overseas regions. Our experiments on existing language models examine two factors: model language focus and geographic regions. The results highlight advantages of Japanese-enhanced models and greater difficulty in translating domestic-region geo-entities mentioned in travel blogs.

2605.12928 2026-05-14 cs.LG

The Efficiency Gap in Byte Modeling

Celine Lee, Jing Nathan Yan, Chen Liang, Jiaxin Shi, Yin Zhang, Jeremiah Liu, Pengcheng Yin, Fernando Pereira, Ed Chi, Derek Cheng, Alexander M. Rush, Ruoxi Wang

AI总结 本文研究了字节级语言模型在计算效率上的劣势,对比了其与传统自回归模型和掩码扩散模型在扩展性上的表现差异。通过计算匹配的扩展实验,发现字节建模在掩码扩散模型中的性能损失更为显著,原因在于其缺乏局部连续性,难以高效解析原始字节的语义。研究指出,未来在字节级建模中需引入替代的结构先验,以维持模型的可扩展性。

详情
英文摘要

Modern language models have historically relied on two dominant design choices: subword tokenization and autoregressive (AR) ordering. These design decisions bake in priors that dictate a model's learning. Recently, two alternative paradigms have challenged this: byte-level modeling, which bypasses static statistically-derived token vocabularies, and masked diffusion modeling (MDM), which conducts parallel, non-sequential generation. Their intersection represents a fully end-to-end modality-agnostic generative prototype; however, removing these structural priors incurs a significant computational cost. In this work, we investigate this cost through a compute-matched scaling study. Our results reveal that the performance penalty of byte modeling is not uniform; across scale, the scaling overhead of byte modeling is worse for MDM than for AR. We hypothesize that this disparity stems from context fragility: while AR's stable causal history allows models to naturally rediscover subword patterns, the MDM objective destroys the local contiguity required to efficiently resolve semantics from raw bytes. Our findings from controlled permutation experiments suggest that future modality-agnostic designs must incorporate alternative structural biases to maintain viable scaling trajectories in the byte regime.

2605.12924 2026-05-14 cs.LG

IV-ICL: Bounding Causal Effects with Instrumental Variables via In-Context Learning

Vahid Balazadeh, Hamidreza Kamkari, Medha Barath, Ricardo Silva, Rahul G. Krishnan

AI总结 该论文提出了一种基于工具变量的因果效应置信区间估计方法IV-ICL,通过上下文学习直接学习因果效应的边缘后验分布,并利用其分位数推导出因果效应的置信区间。与传统方法相比,IV-ICL避免了手动设计估计量的需求,同时克服了计算复杂度高和先验敏感等问题,能够在多种数据生成过程中更准确地覆盖识别集。实验表明,该方法在合成和半合成数据集上表现出更高的可靠性与信息量,且推理速度显著优于现有方法。

详情
英文摘要

The instrumental-variables (IV) setting is standard for partial identification of causal effects when unobserved confounding makes point identification impossible. Existing approaches face methodological bottlenecks: closed-form bound estimands are required -- e.g., Balke-Pearl equations in binary IV -- and even when available, designing accurate estimators requires manual effort tailored to each estimand. While direct Bayesian inference of the causal effects, instead of the bounds, circumvents these challenges, it is often computationally intensive and suffers from high prior sensitivity or under-dispersed posteriors. As a remedy, we introduce IV-ICL, an amortized Bayesian in-context learning method that learns the marginal posterior distribution of the causal effects directly and derives bounds as its quantiles. Unlike standard variational inference that optimizes exclusive KL divergence, amortized Bayesian inference minimizes the expected inclusive KL, a mass-covering objective. We empirically observe that optimizing inclusive KL can recover the entire identified set across diverse data-generating processes, while exclusive-KL (e.g. with variational inference) of the same Bayesian formulation collapses onto a single mode and fails to cover the identified set. We evaluate IV-ICL on synthetic and semi-synthetic IV benchmarks and show it produces intervals that are more reliably valid and more informative compared to efficient semi-parametric, Bayesian, and plug-in baselines, at 20-500x lower inference time. Beyond methodology, we propose a procedure to convert randomized controlled trials into IV benchmarks with provably preserved ground-truth causal effects that enables a more realistic evaluation of partial-identification methods.

2605.12922 2026-05-14 cs.AI cs.CL

When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

Vardhan Dongre, Joseph Hsieh, Viet Dac Lai, Seunghyun Yoon, Trung Bui, Dilek Hakkani-Tür

AI总结 这篇论文研究了大型语言模型在多轮对话中逐渐丢失任务目标、角色设定和规则的现象。作者提出了一种“通道转换”机制,认为目标定义的标记在注意力机制中逐渐变得难以访问,而相关信息可能仍保留在残差表示中。通过引入“目标可访问性比率”(GAR)以及残差流探针等方法,研究揭示了不同模型在注意力关闭后表现出的多样化失效模式,并展示了残差表示在预测任务表现中的重要性。

详情
英文摘要

Large language models can follow complex instructions in a single turn, yet over long multi-turn interactions they often lose the thread of instructions, persona, and rules. This degradation has been measured behaviorally but not mechanistically explained. We propose a channel-transition account: goal-defining tokens become less accessible through attention, while goal-related information may persist in residual representations. We introduce the Goal Accessibility Ratio (GAR), measuring attention from generated tokens to task-defining goal tokens, and combine it with sliding-window ablations and residual-stream probes. When attention to instructions closes, what survives reveals architecture. Across architectures, the transition yields qualitatively distinct failure modes: some models preserve goal-conditioned behavior at vanishing attention, others fail despite decodable residual goal information, and the layer at which this encoding emerges varies from 2 to 27. A within-model causal ablation that force-closes the attention channel in Mistral collapses recall from near-perfect to 11% on a 20-fact retention task and raises persona-constraint violations above an adversarial-pressure baseline without user pressure, with both effects emerging at the predictable crossover turn. Linear probes recover per-episode recall outcomes from residual representations with AUC up to 0.99 across all four primary architectures, while input embeddings remain at chance. Across architectures and model scales, the gap between attention loss and residual decodability predicts whether goal-conditioned behavior survives channel closure. We contribute GAR as a diagnostic, the channel-transition framework as a controlled mechanistic account, and a parametric prediction of failure timing under windowed attention closure.