arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2092
2512.12602 2026-05-11 cs.LG

Exact Flow Linear Attention: Exact Solution from Continuous-Time Dynamics

Jingdi Lei, Di Zhang, Soujanya Poria

AI总结 本文提出了一种名为Exact Flow Linear Attention(EFLA)的精确流线性注意力机制,通过将传统的delta-rule线性注意力中的欧拉离散更新替换为精确的闭式解,消除了离散化误差。该方法利用动态矩阵的秩-1结构,实现了参数数量、线性时间复杂度和块并行性的保持,同时提升了模型在噪声输入和高能量输入下的稳定性。实验表明,EFLA在多个基准测试中表现出更强的鲁棒性和性能,优于传统的状态空间模型和欧拉风格基线方法。

Comments 16 pages, 5 figures

详情
英文摘要

In this paper, we introduce Exact Flow Linear Attention~(EFLA), an exact-flow formulation of delta-rule linear attention. We show that the delta-rule update can be interpreted as an explicit Euler discretization of an underlying continuous-time system. EFLA replaces this first-order update with the exact closed-form flow. By exploiting the rank-1 structure of the dynamics matrix, both the matrix exponential and the input integral collapse to a simple update that preserves delta-rule linear attention's algebraic structure, parameter count, linear-time complexity, and chunkwise parallelism. This attention mechanism removes the Euler discretization error of the delta-rule dynamics without introducing additional parameters. Experiments on robustness tests, language modeling benchmarks, and the MAD synthetic benchmark show that EFLA improves stability under corrupted and high-energy inputs, reduces perplexity, and achieves stronger downstream performance compared to SSM and Euler-style baselines. These results establish exact-flow integration as a principled and scalable update mechanism for delta-rule linear attention.

2512.09629 2026-05-11 cs.AI cs.LG

End-to-end PDDL Planning with Hardcoded and Dynamic Agents

Emanuele La Malfa, Ping Zhu, Samuele Marro, Sara Bernardini, Michael Wooldridge

AI总结 本文提出了一种端到端的规划框架,该框架结合验证器支持,能够将自然语言的人类指令转化为PDDL规划模型。框架中包含两类代理:基于日志和错误痕迹的硬编码代理,用于解决语法错误和时间约束等问题;以及能够适应具体领域并修订潜在规划抽象的动态代理。该框架无需人工干预,由大型语言模型驱动,已在多个经典规划问题和基准测试中展现出良好的灵活性和有效性。

Comments Code: https://github.com/EmanueleLM/MultiAgentPlanning

详情
英文摘要

We present an end-to-end framework for planning supported by verifiers. An orchestrator receives a human specification written in natural language and converts it into a PDDL (Planning Domain Definition Language) model, where the domain and problem are iteratively refined by sub-modules (agents) to address common planning requirements, such as time constraints and optimality, as well as ambiguities and contradictions that may exist in the human specification. We support two categories of agents: hardcoded, which are informed by logs and error traces and have a pre-defined goal (e.g., fix issues with PDDL syntax, check temporal constraints), and dynamic, which have no predefined goal but adapt to the specific domain and revise the latent planning abstraction. The validated domain and problem are then passed to an external planning engine to generate a plan. The orchestrator and agents are powered by Large Language Models (LLMs) and require no human intervention at any stage of the process. Finally, a module translates the final plan back into natural language to improve human readability while maintaining the correctness of each step. We demonstrate the flexibility and effectiveness of our framework on GPT-\{4o, 5-mini, 5.4\}, and Gemini-\{2.5, 3\}-flash across more than ten domains and tasks, including the Google NaturalPlan benchmark, Planbench, and classic planning problems like Sokoban, Blocksworld and the Tower of Hanoi, where LLMs are known to struggle even with small instances. Our framework can be integrated with any PDDL planning engine and validator (we successfully tested Fast Downward, LPG, POPF, VAL, and uVAL) and represents a significant step toward end-to-end planning aided by LLMs.

2512.03479 2026-05-11 cs.CV

ProcObject-10K: Benchmarking Object-Centric Procedural Understanding in Instructional Videos

Wenliang Guo, Yu Kong

AI总结 该研究提出了 ProcObject-10K,首个用于评估指令视频中物体中心过程理解的基准,旨在解决现有基准过于关注动作而忽视物体状态演变的问题。该基准包含10,522个开放问答对,涵盖9个领域、137项任务,评估模型在预条件、状态演化、反事实推理等方面的能力。实验表明,现有主流模型在生成合理答案的同时,难以准确定位支持证据,暴露其依赖语言先验而非细粒度物体动态的缺陷。研究还提供了基于物体中心的监督微调方法,有效提升了模型在本任务及其他相关任务上的表现。

详情
英文摘要

Procedural activities are fundamentally driven by object state transitions, yet existing instructional video benchmarks remain action-centric and cannot evaluate whether models reason about how objects evolve toward task completion. In this work, we introduce ProcObject-10K, the first benchmark that jointly evaluates object-centric reasoning and temporal evidence grounding in instructional videos, across both egocentric and exocentric views. It comprises 10,522 open-ended VideoQA pairs grounded in 1,799 video clips, spanning 137 tasks across 9 domains and five reasoning types covering preconditions, state evolution, counterfactuals, mistakes, and readiness. Benchmarking 13 leading MLLMs reveals a substantial answering-grounding gap: models produce plausible answers while failing to localize the supporting evidence (mIoU < 45%), exposing their reliance on linguistic priors rather than fine-grained object dynamics. As a step toward closing this gap, we further provide an object-centric supervised fine-tuning baseline with pseudo object-level supervision and spatial-temporal constraints. Models fine-tuned on ProcObject-10K not only improve on the benchmark itself, but also transfer effectively to other grounded VideoQA and embodied planning tasks. The dataset, annotations, and evaluation toolkit will be publicly released to support future research on object-centric procedural understanding.

2512.02991 2026-05-11 cs.CV

GraphFusion3D: Dynamic Graph Attention Convolution with Adaptive Cross-Modal Transformer for 3D Object Detection

Md Sohag Mia, Md Nahid Hasan, Muhammad Abdullah Adnan

AI总结 GraphFusion3D 是一种用于三维目标检测的统一框架,旨在解决点云数据稀疏、结构不完整和语义信息有限等挑战。该方法引入了自适应跨模态变换器(ACMT)和图推理模块(GRM),分别用于融合图像信息和建模点云中的局部几何与全局语义关系,从而提升检测性能。实验表明,GraphFusion3D 在多个基准数据集上取得了显著的性能提升。

详情
英文摘要

Despite significant progress in 3D object detection, point clouds remain challenging due to sparse data, incomplete structures, and limited semantic information. Capturing contextual relationships between distant objects presents additional difficulties. To address these challenges, we propose GraphFusion3D, a unified framework combining multi-modal fusion with advanced feature learning. Our approach introduces the Adaptive Cross-Modal Transformer (ACMT), which adaptively integrates image features into point representations to enrich both geometric and semantic information. For proposal refinement, we introduce the Graph Reasoning Module (GRM), a novel mechanism that models neighborhood relationships to simultaneously capture local geometric structures and global semantic context. The module employs multi-scale graph attention to dynamically weight both spatial proximity and feature similarity between proposals. We further employ a cascade decoder that progressively refines detections through multi-stage predictions. Extensive experiments on SUN RGB-D (70.6% AP$_{25}$ and 51.2% AP$_{50}$) and ScanNetV2 (75.1% AP$_{25}$ and 60.8% AP$_{50}$) demonstrate a substantial performance improvement over existing approaches.

2512.00164 2026-05-11 cs.LG cs.PL

Faster Verified Explanations for Neural Networks

Alessandro De Palma, Greta Dolcetti, Caterina Urban

AI总结 本文提出了一种名为 FaVeX 的新算法,用于加速神经网络的验证解释计算。该方法通过动态结合批量和顺序处理输入特征,并复用之前查询的信息,显著提升了计算效率。此外,作者还提出了一个层次化的验证解释定义——验证器最优鲁棒解释,考虑了网络验证器的不完整性。实验表明,FaVeX 和该解释方法在大规模非线性激活网络上具有优越的可扩展性,能够生成有意义的形式化解释。

Comments ECOOP 2026

详情
英文摘要

Verified explanations are a principled way to explain the decisions taken by neural networks, which are otherwise black-box in nature. However, these techniques face significant scalability challenges, as they require multiple calls to neural network verifiers, each of them with an exponential worst-case complexity. We present FaVeX, a novel algorithm to compute verified explanations. FaVeX accelerates the computation by dynamically combining batch and sequential processing of input features, and by reusing information from previous queries, both when proving invariances with respect to certain input features, and when searching for feature assignments altering the prediction. Furthermore, we present a novel and hierarchical definition of verified explanations, termed verifieroptimal robust explanations, that explicitly factors the incompleteness of network verifiers within the explanation. Our comprehensive experimental evaluation demonstrates the superior scalability of both FaVeX, and of verifier-optimal robust explanations, which together can produce meaningful formal explanation on networks with hundreds of thousands of non-linear activations.

2511.06571 2026-05-11 cs.CL cs.AI cs.LG

Rep2Text: Decoding Full Text from a Single LLM Token Representation

Haiyan Zhao, Zirui He, Yiming Tang, Fan Yang, Ali Payani, Dianbo Liu, Mengnan Du

AI总结 本研究探讨了从大型语言模型(LLM)最后一个token的表示中恢复原始输入文本的可能性,提出了一种名为Rep2Text的新框架。该方法通过一个可训练的适配器,将目标模型的最后一个token表示映射到解码语言模型的token嵌入空间,从而自回归地重建输入文本。实验表明,对于16个token的序列,平均能恢复约一半的token且保持语义连贯性,同时揭示了随着序列长度增加,token级恢复能力下降但语义信息仍较完整的信息瓶颈效应。

Comments 18 pages, 6 figures, 6 tables

详情
英文摘要

Large language models (LLMs) have achieved remarkable progress across diverse tasks, yet their internal mechanisms remain largely opaque. In this work, we investigate a fundamental question: to what extent can the original input text be recovered from a single last-token representation in an LLM? To this end, we propose Rep2Text, a novel framework for decoding text from last-token representations. Rep2Text employs a trainable adapter that maps a target model's last-token representation into the token embedding space of a decoding language model, which then autoregressively reconstructs the input text. Experiments across various model combinations (Llama-3.1-8B, Gemma-7B, Mistral-7B-v0.1, Llama-3.2-3B, etc.) show that, on average, roughly half of the tokens in 16-token sequences can be recovered from this compressed representation while preserving strong semantic coherence. Further analysis reveals a clear information bottleneck effect: as sequence length increases, token-level recovery declines, while semantic information remains relatively well preserved. We also find that scaling effects are less pronounced in inversion tasks. Finally, our framework demonstrates robust generalization to out-of-distribution clinical data.

2510.19669 2026-05-11 cs.CL

DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference

Xiang Liu, Xuming Hu, Xiaowen Chu, Eunsol Choi

AI总结 近期研究表明,大型语言模型在推理任务中虽然表现出色,但往往生成冗长的思考过程,影响推理效率。本文提出DiffAdapt,一种基于问题难度和推理熵值动态调整推理策略的轻量框架,通过选择不同复杂度的推理模式,在保证准确率的同时显著减少生成的token数量。实验表明,该方法在多个模型和基准上有效降低了计算开销,最高减少达22.4%的token使用量。

Comments ICLR 26

详情
英文摘要

Recent reasoning Large Language Models (LLMs) demonstrate remarkable problem-solving abilities but often generate long thinking traces whose utility is unclear. Our work aims to improve their efficiency, enabling them to reach high performance without overthinking. First, we analyze the entropy of token probabilities in reasoning traces. Across three models, we observe a consistent U-shaped entropy pattern: high entropy on easy problems despite high accuracy, low entropy on problems with medium difficulty, and high entropy on hard problems reflecting uncertainty. Specifically, we notice 22--25\% entropy reduction from easy to medium difficulty regions, suggesting an {overthinking} phenomenon on easy instances. Building on these insights, we introduce \textbf{DiffAdapt}, a lightweight framework that selects Easy/Normal/Hard inference strategies per question based on their difficulty and reasoning trace entropy. Each inference strategy consists of a fixed prompt, temperature and maximum token length. In contrast to existing efficiency optimization methods, our approach does not fine-tune base LLM but a small probe that classifies LLM's final hidden state, allowing inexpensive adaptation. We comprehensively evaluate our method on five models and eight benchmarks. Our method achieves comparable or improved accuracy while reducing token usage by up to 22.4\%, establishing a practical path toward compute-efficient reasoning.

2510.07625 2026-05-11 cs.RO cs.SY eess.SY

GATO: GPU-Accelerated and Batched Trajectory Optimization for Scalable Edge Model Predictive Control

Alexander Du, Emre Adabag, Gabriel Bravo-Palacios, Brian Plancher

AI总结 该论文提出了一种名为GATO的批处理轨迹优化求解器,旨在提升边缘设备上模型预测控制(MPC)的实时性能。GATO通过算法、软件与硬件的协同设计,在GPU上实现了针对中等规模批处理任务的高效并行计算,有效解决了现有方法在实时性与通用性之间的权衡问题。实验表明,GATO在多个基准测试和实际案例中均表现出显著的加速效果与控制性能提升。

Comments Accepted to ICRA 2026. 8 pages, 8 figures, 2 tables

详情
英文摘要

While Model Predictive Control (MPC) delivers strong performance across robotics applications, solving the underlying (batches of) nonlinear trajectory optimization (TO) problems online remains computationally demanding. Existing GPU-accelerated approaches either parallelize single solves, handle large batches at sub-real-time rates, or sacrifice model generality for speed. This leaves a large gap in solver performance for many state-of-the-art MPC applications that require real-time batches of tens to low-hundreds of solves. As such, we present GATO, an open source, GPU-accelerated, batched TO solver co-designed across algorithm, software, and computational hardware to deliver real-time throughput for these moderate batch size regimes. Our approach leverages a combination of block-, warp-, and thread-level parallelism within and across solves for ultra-high performance. We demonstrate the effectiveness of our approach through a combination of: simulated benchmarks showing speedups of 18-21x over CPU baselines and 1.4-16x over GPU baselines as batch size increases; case studies highlighting improved disturbance rejection and convergence behavior; and finally a validation on hardware using an industrial manipulator. We open source GATO to support reproducibility and adoption.

2510.04839 2026-05-11 cs.RO

TAG-K: Tail-Averaged Greedy Kaczmarz for Computationally Efficient and Performant Online Inertial Parameter Estimation

Shuo Sha, Anupam Bhakta, Zhenyuan Jiang, Kevin Qiu, Ishaan Mahajan, Gabriel Bravo-Palacios, Brian Plancher

AI总结 本文提出了一种名为 TAG-K 的在线惯性参数估计方法,旨在解决传统方法在动态环境中跟踪参数突变困难以及计算开销大的问题。TAG-K 是 Kaczmarz 方法的轻量级扩展,结合了贪婪行选择以加速收敛,并引入尾部平均策略以提高噪声和不一致性下的鲁棒性。实验表明,TAG-K 在多种基准和四旋翼跟踪任务中相比其他方法具有更高的计算效率和更优的估计精度,显著提升了系统的整体跟踪性能。

Comments Accepted to ICRA 2026. 3 Figures. 3 Tables

详情
英文摘要

Accurate online inertial parameter estimation is essential for adaptive robotic control, enabling real-time adjustment to payload changes, environmental interactions, and system wear. Traditional methods often struggle to track abrupt parameter shifts or incur high computational costs, limiting their effectiveness in dynamic environments and for computationally constrained robotic systems. We introduce TAG-K, a lightweight extension of the Kaczmarz method that combines greedy randomized row selection for rapid convergence with tail averaging for robustness under noise and inconsistency. This design enables fast, stable parameter adaptation while retaining the low per-iteration complexity inherent to the Kaczmarz framework. We evaluate TAG-K in synthetic benchmarks and quadrotor tracking tasks against RLS, KF, and other Kaczmarz variants. TAG-K achieves 1.5x-1.9x faster solve times on laptop-class CPUs and 4.8x-20.7x faster solve times on embedded microcontrollers. More importantly, these speedups are paired with improved robustness to measurement noise and a 25% reduction in estimation error, leading to nearly 2x better end-to-end tracking performance. Website, documentation, and code available at: https://a2r-lab.org/TAG-K/.

2510.04606 2026-05-11 cs.LG stat.ML

Closed-Form Last Layer Optimization

Alexandre Galashov, Nathaël Da Costa, Liyuan Xu, Philipp Hennig, Arthur Gretton

AI总结 本文研究了在平方损失下神经网络最后一层权重的闭式优化方法。作者提出在优化过程中将最后一层视为主干网络参数的函数,仅对主干参数进行优化,从而等价于交替进行主干网络的梯度下降和最后一层的闭式更新。该方法在随机梯度下降框架下进行了改进,并通过理论分析证明了其在神经切线核 regime 下的收敛性,实验表明该方法在多个回归任务中优于标准 SGD 和 Adam。

详情
英文摘要

Neural networks are typically optimized with variants of stochastic gradient descent. Under a squared loss, however, the optimal solution to the linear last layer weights is known in closed-form. We propose to leverage this during optimization, treating the last layer as a function of the backbone parameters, and optimizing solely for these parameters. We show this is equivalent to alternating between gradient descent steps on the backbone and closed-form updates on the last layer. We adapt the method for the setting of stochastic gradient descent, by trading off the loss on the current batch against the accumulated information from previous batches. We provide theoretical analyses showing convergence of the method to an optimal solution in the neural tangent kernel regime, as well as quantifying the gains compared to standard SGD in a one-step analysis. Finally, we demonstrate the effectiveness of our approach compared with SGD and Adam on a squared loss in several regression tasks, including neural operators and causal inference.

2510.03245 2026-05-11 cs.LG cs.AI cs.CV

Frequency-Aware Model Parameter Explorer: A new attribution method for improving explainability

Ali Yavari, Alireza Mohamadi, Elham Beydaghi, Philipp Seeböck, Rainer A. Leitgeb

AI总结 现有归因方法在生成对抗样本时通常使用全通滤波器,忽略了对深度神经网络特征归因至关重要的高频信息。本文提出了一种新的归因方法FAMPE,通过基于FFT的α加权扰动策略,分别调节高低频成分,并将频率感知的探索直接融入模型参数分析中,从而更精确地揭示模型依赖的频域特征。实验表明,FAMPE在多个架构上显著优于现有方法,尤其在低频主导的图像中,高频扰动能有效提升归因精度。

Comments Preprint

详情
英文摘要

State-of-the-art attribution methods rely on adversarial sample generation that applies an all-pass filter across the frequency spectrum, discarding fine-grained high-frequency information that is demonstrably important for accurate feature attribution in deep neural networks. By generating adversarial samples that selectively perturb high- and low-frequency components, we can probe which spectral features a model relies on most -- directly translating frequency-domain exploration into attribution signals. Building on this insight, we propose FAMPE (Frequency-Aware Model Parameter Explorer), a novel attribution method that introduces an FFT-based α-weighted perturbation scheme -- separately modulating high- and low-frequency components via an energy-driven spectral cutoff -- and, crucially, integrates this frequency-aware exploration directly into model parameter exploration for attribution, a connection that has not been established in prior work. Unlike prior frequency-aware adversarial approaches that target transferability or imperceptibility, FAMPE's specific formulation is designed and validated exclusively for explainability, translating spectral structure into fine-grained attribution maps without requiring any manual baseline selection. Evaluated on ImageNet across four architectures spanning CNNs and Vision Transformers, at fixed α= 0.1 FAMPE outperforms AttEXplore by 4.25% on Inception-v3 and 12.04% on MaxViT-T, with per-sample oracle selection further revealing that low-frequency-dominated images systematically benefit from high-frequency perturbations -- underscoring the potential of adaptive spectral exploration. Our ablation studies confirm that high-frequency perturbations are disproportionately responsible for attribution precision, while excessive low-frequency noise degrades global structural coherence.

2510.01685 2026-05-11 cs.CL cs.AI

How Do Language Models Compose Functions?

Apoorv Khandelwal, Ellie Pavlick

AI总结 本文研究了大型语言模型(LLMs)在解决需要函数组合的两跳事实检索任务时的内部机制。作者发现,尽管现代LLMs在任务表现上有所提升,但它们仍然存在“组合性鸿沟”,即能够分别计算 $f(x)$ 和 $g(z)$ 并不意味着能正确组合计算 $g(f(x))$。通过分析残差流表示,研究识别出两种处理机制:一种是通过逐步计算 $f(x)$ 来实现组合求解,另一种则是直接跳过中间变量 $f(x)$ 直接得到结果。实验还表明,嵌入空间的几何结构与所采用的机制密切相关,其中直接机制在任务表示为从 $x$ 到 $g(f(x))$ 的嵌入空间翻译时占主导。

详情
英文摘要

While large language models (LLMs) appear to be increasingly capable of solving compositional tasks, it is an open question whether they do so using compositional mechanisms. In this work, we investigate how feedforward LLMs solve two-hop factual recall tasks, which can be expressed compositionally as $g(f(x))$. We first confirm that modern LLMs continue to suffer from the "compositionality gap", i.e. their ability to compute both $z = f(x)$ and $y = g(z)$ does not entail their ability to compute the composition $y = g(f(x))$. We then decode residual stream representations and identify two processing mechanisms: one which solves tasks $\textit{compositionally}$, computing $f(x)$ along the way to $g(f(x))$, and one which solves them $\textit{directly}$, without any detectable signature of the intermediate variable $f(x)$. Finally, we find that embedding space geometry is strongly related to which mechanism is employed, where the idiomatic mechanism is dominant when tasks are represented by translations from $x$ to $g(f(x))$ in the embedding spaces. We fully release our data and code at: https://github.com/apoorvkh/composing-functions.

2510.01510 2026-05-11 cs.LG

Flock: A Knowledge Graph Foundation Model via Learning on Random Walks

Jinwoo Kim, Xingyue Huang, Krzysztof Olejniczak, Kyungbin Min, Michael Bronstein, Seunghoon Hong, İsmail İlkan Ceylan

AI总结 本文研究了知识图谱中的零样本链接预测问题,即模型需能推广到新实体和新关系。为解决传统知识图谱基础模型在区分结构相似但语义不同的关系时表达能力受限的问题,作者提出了一种基于概率节点-关系等变性的新方法,通过结构化随机性打破推理时的对称性。基于此,他们提出了Flock模型,该模型通过迭代采样随机游走、编码序列并聚合节点与关系表示,实现了对知识图谱链接级函数的通用逼近,在多个基准数据集上取得了优越的性能。

Comments 42 pages, 7 figures

详情
英文摘要

We study the problem of zero-shot link prediction on knowledge graphs (KGs), which requires models to generalize to novel entities and novel relations. Knowledge graph foundation models (KGFMs) address this task by enforcing equivariance over both nodes and relations, which enables them to learn structural properties of nodes and relations that transfer to novel KGs with similar structure. However, the conventional notion of deterministic equivariance inherently limits the expressive power of KGFMs, as it prevents them from distinguishing relations that are structurally similar but semantically distinct. To overcome this limitation, we propose to leverage probabilistic node-relation equivariance, which preserves equivariance in distribution while using structured randomness to break symmetries at inference time. Building on this principle, we present Flock, a KGFM that iteratively samples random walks, encodes them into sequences, embeds them with a sequence model, and aggregates node and relation representations through learned pooling. Flock respects probabilistic node-relation equivariance and, crucially, is a universal approximator for isomorphism-invariant link-level functions over KGs. Empirically, Flock perfectly solves our new diagnostic dataset Petals on which current KGFMs fail, and achieves state-of-the-art performance on entity and relation prediction tasks across 54 KGs from diverse domains. Code is available at https://github.com/jw9730/flock.

2510.01105 2026-05-11 cs.LG

Geometric Analysis of Neural Regression Collapse via Intrinsic Dimension

George Andriopoulos, Zixuan Dong, Bimarsha Adhikari, Keith Ross

AI总结 该研究分析了神经回归模型中特征表示的几何特性,揭示了回归任务中“神经崩溃”现象对性能的负面影响。通过引入内在维度的概念,研究发现崩溃模型的特征空间维度低于目标空间维度,导致过度压缩和泛化能力下降,而非崩溃模型则通常保持特征空间维度高于目标空间。基于此,研究提出了两种不同的压缩状态,为提升回归模型的泛化性能提供了新的几何视角和实用策略。

Comments 36 pages, 21 figures

详情
英文摘要

Neural multivariate regression underpins a wide range of domains, including control, robotics, and finance, yet the geometry of its learned representations remains poorly characterized. While neural collapse has been shown to benefit generalization in classification, we find that analogous collapse in regression consistently degrades performance. To explain this contrast, we analyze regression models through the lens of intrinsic dimension. Across control tasks and synthetic datasets, we estimate the intrinsic dimension of last-layer features (ID_H) and compare it with that of the regression targets (ID_Y). Collapsed models exhibit ID_H < ID_Y, leading to over-compression and poor generalization, whereas non-collapsed models typically maintain ID_H > ID_Y. For the non-collapsed models, performance with respect to ID_H depends on the data quantity and noise levels. From these observations, we identify two regimes (over-compressed and under-compressed) that determine when expanding or reducing feature dimensionality improves performance. Our results provide new geometric insights into neural regression collapse and suggest practical strategies for enhancing generalization.

2510.00568 2026-05-11 cs.CL

ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards

Shiyu Li, Yang Tang, Yifan Wang, Peiming Li, Xi Chen

AI总结 ReSeek 是一种用于训练搜索代理的自纠正框架,旨在解决基于强化学习的搜索代理在面对复杂任务时易陷入错误推理路径的问题。该框架引入了自纠正机制,使代理能够在搜索过程中动态识别并恢复错误路径,并通过一个特殊的 JUDGE 动作进行信息判断和策略重规划。此外,研究还设计了一种密集的指导性奖励函数,并提出了新的基准测试 FictionalHot,实验表明 ReSeek 显著提升了搜索代理的任务成功率和路径可信度。

Comments ICML 2026

详情
英文摘要

Search agents powered by Large Language Models (LLMs) have demonstrated significant potential in tackling knowledge-intensive tasks. Reinforcement learning (RL) has emerged as a powerful paradigm for training these agents to perform complex, multi-step reasoning. However, prior RL-based methods often rely on sparse or rule-based rewards, which can lead agents to commit to suboptimal or erroneous reasoning paths without the ability to recover. To address these limitations, we propose ReSeek, a novel self-correcting framework for training search agents. Our framework introduces a self-correction mechanism that empowers the agent to dynamically identify and recover from erroneous search paths during an episode. By invoking a special JUDGE action, the agent can judge the information and re-plan its search strategy. To guide this process, we design a dense, instructive process reward function, which decomposes into a correctness reward for retrieving factual information and a utility reward for finding information genuinely useful for the query. Furthermore, to mitigate the risk of data contamination in existing datasets, we introduce FictionalHot, a new and challenging benchmark with recently curated questions requiring complex reasoning. Being intuitively reasonable and practically simple, extensive experiments show that agents trained with ReSeek significantly outperform SOTA baselines in task success rate and path faithfulness.

2510.00253 2026-05-11 cs.LG

DReS: Dual Reconstruction Smoothing for Functional Regularization

Parsa Moradi, Tayyebeh Jahaninezhad, Hanzaleh Akbarinodehi, Mohammad Ali Maddah-Ali

AI总结 本文提出了一种名为 DReS 的非参数正则化框架,通过基于样条的辅助分支引入平滑性,无需额外可训练参数,适用于监督、半监督和自监督学习场景。该方法通过共享模型参数实现函数近似,理论分析表明其能控制目标函数与其近似之间的差异,具有隐式的高阶平滑正则化效果。实验表明,DReS 在表示学习、生成建模和监督学习中均表现出色。

详情
英文摘要

Smoothness is a key inductive bias in machine learning and is closely related to generalization. Existing smoothness-inducing methods typically rely either on explicit gradient regularization, which often incurs substantial computational and memory overhead, or on data-mixing strategies, which are less naturally applicable to unsupervised and self-supervised settings. In this work, we propose $\textit{Dual Reconstruction Smoothing}$ (DReS), a nonparametric regularization framework that induces smoothness through a spline-based auxiliary branch with shared model parameters. The method introduces no additional trainable parameters and can be applied to arbitrary submodules, making it suitable for unsupervised, self-supervised, and supervised regimes. We show theoretically that the discrepancy between the target function and its DReS approximation is controlled by higher-order smoothness quantities of the function, establishing the method as an implicit higher-order smoothness regularizer. Empirically, DReS improves representation learning across several self-supervised methods, improves generation quality in generative modeling, and achieves strong performance relative to competitive baselines in supervised learning.

2509.26524 2026-05-11 cs.LG cs.AI

TAP: Two-Stage Adaptive Personalization of Multi-Task and Multi-Modal Foundation Models in Federated Learning

Seohyun Lee, Wenzhi Fang, Dong-Jun Han, Seyyedali Hosseinalipour, Christopher G. Brinton

AI总结 在联邦学习中,如何在数据、任务和模态均存在异构性的客户端上对基础模型进行个性化微调仍是一个未被充分研究的问题。为此,本文提出了一种两阶段自适应个性化方法TAP,第一阶段通过利用客户端与服务器之间的模型架构差异,选择性地替换参数以减少跨任务和跨模态的干扰,第二阶段则在全局模型稳定后进行后处理蒸馏,以恢复有益的共享结构,从而在提升泛化能力的同时保持个性化。该方法首次分析了在模态-任务对异构性下的联邦基础模型训练收敛性,并通过大量实验验证了其有效性。

Comments 29 pages

详情
英文摘要

In federated learning (FL), local personalization of models has received significant attention, yet personalized fine-tuning of foundation models remains underexplored. In particular, there is a lack of understanding in the literature on how to personalize foundation models in settings where there exist heterogeneity not only in data, but also in tasks and modalities across the clients. To address this gap, we propose Two-Stage Adaptive Personalization (TAP). In the first stage, TAP leverages mismatched model architectures between clients and the server to selectively replace personalized parameters with global updates, explicitly limiting cross-task and cross-modality interference. In the second stage, TAP conducts post-FL distillation on the global model to recover a beneficial shared structure. By reintroducing generalizable knowledge only after the global model has stabilized, TAP enhances generalization without compromising personalization. In developing our methodology, we introduce the first convergence analysis of federated foundation model training at the server under modality-task pair heterogeneity across clients, and demonstrate the impact of the number of modality-task pairs on model fine-tuning. Through extensive experiments, we demonstrate the effectiveness of TAP across a variety of datasets and tasks in comparison to state-of-the-art baselines. The implementation code is publicly available at https://github.com/lee3296/TAP.

2509.26272 2026-05-11 cs.CV cs.LG

PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection

Tuan Nguyen, Naseem Khan, Khang Tran, NhatHai Phan, Issa Khalil

AI总结 随着合成媒体的快速发展,深度伪造检测成为保障网络安全和信任的重要挑战。为解决现有数据不足及大语言模型在检测任务中推理与视觉证据不一致的问题,本文提出了一种基于段落级相对策略优化(PRPO)的强化学习方法,并构建了一个包含推理注释的深度伪造检测数据集。实验表明,PRPO显著提升了检测准确率和推理评分,验证了其在提升模型可解释性和可靠性方面的有效性。

Comments Accepted at ICML 2026

详情
英文摘要

The rapid rise of synthetic media has made deepfake detection a critical challenge for online safety and trust. Progress remains constrained by the scarcity of large, high-quality datasets. Although multimodal large language models (LLMs) exhibit strong reasoning capabilities, their performance on deepfake detection is poor, often producing explanations that are misaligned with visual evidence or hallucinatory. To address this limitation, we introduce a reasoning-annotated dataset for deepfake detection and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns LLM reasoning with image content at the paragraph level. Experiments show that PRPO improves detection accuracy by a wide margin and achieves the highest reasoning score of 4.55/5.0. Ablation studies further demonstrate that PRPO significantly outperforms GRPO under test-time conditions. These results underscore the importance of grounding multimodal reasoning in visual evidence to enable more reliable and interpretable deepfake detection.

2509.23370 2026-05-11 cs.CV

GRAPE: Let GRPO Supervise Query Rewriting by Ranking for Retrieval

Zhaohua Zhang, Jianhuan Zhuo, Muxi Chen, Chenchen Zhao, Wenyu Jiang, Tianwen Jiang, Mingyang Chen, Yutang, Qiuyong Xiao, Jihong Zhang, Zhixun Su

AI总结 本文提出了一种名为GRAPE的插件式方法,旨在通过基于排序的策略优化来提升查询重写的效果,从而改善大规模检索系统在多语言、长文本和多模态查询下的性能。该方法利用大型语言模型进行查询重写,并通过分组相对策略优化(GRPO)将排名信号融入重写过程,使重写后的查询更贴合冻结检索器的潜在分布。实验表明,GRAPE在多个基准数据集上显著提升了检索效果,平均提升了4.9%的Recall@10指标,且无需对原始检索器进行任何修改。

详情
英文摘要

The CLIP model has established itself as a cornerstone of large-scale retrieval systems. However, its performance often degrades under distributional shifts such as multilingual, long-form, or multimodal queries. To avoid the prohibitive costs associated with retriever retraining or corpus re-embedding, we propose GRAPE (Grouped Ranking-Aware Policy Optimization Enhancement), a plug-and-play approach that leverages LLM-based query rewriting to bridge these gaps. Unlike existing methods that lack explicit supervision, GRAPE integrates ranking signals into the rewriting LLM via Grouped Relative Policy Optimization (GRPO), ensuring rewritten queries are better aligned with the frozen retriever's latent distribution. Crucially, we identify a score inflation phenomenon in naive similarity-based finetuning - where irrelevant candidates receive indiscriminately high scores - and mitigate it with a novel corpus-relative ranking-based reward. Extensive experiments across multilingual (Flickr30k-CN, CVLUE, XM3600), long-form (Wikipedia), and multimodal (CIRR) benchmarks demonstrate that GRAPE consistently improves performance, achieving an average gain of 4.9% in Recall@10 without any modification to the underlying retriever. The code is available at https://github.com/mogulzhang/GRAPE.

2509.08089 2026-05-11 cs.LG cs.CR

Hammer and Anvil: Toward a Theory of Backdoors in Federated Learning

Lucas Fenaux, Zheng Wang, Jacob Yan, Nathan Chung, Florian Kerschbaum

AI总结 本文提出了一种名为“Hammer and Anvil”的理论框架,用于分析联邦学习中的后门攻击问题。该框架根据恶意客户端更新与平均更新之间的偏差 $δ$ 对后门进行分类,并提出了两类核心防御方法:针对大偏差攻击的“Type 1(Anvil)”防御,如异常检测和鲁棒聚合;以及针对小偏差攻击的“Type 2(Hammer)”防御,如基于移除的策略。研究进一步表明,单一类型的防御或非原理性的组合防御容易被自适应攻击者利用,而将两类防御原理性地结合后,能够有效抵御最坏情况下的全信息自适应攻击,并在多个数据集和场景中表现出优越的鲁棒性。

详情
英文摘要

Federated Learning (FL) enables distributed model training but is vulnerable to backdoor attacks, where malicious clients embed attacker-controlled behaviors into the global model. Existing defenses fail against adaptive adversaries. In this paper, we present "Hammer and Anvil", a principled theoretical framework that categorizes backdoors by the deviation, $δ$, of their updates to the mean of the updates. We identify two fundamental defense types: "Type 1 (The Anvil)", comprising outlier detection and robust aggregation effective against large-deviation attacks, and "Type 2 (The Hammer)", consisting of removal-based defenses effective against small-deviation attacks. We demonstrate that defenses of a single type and non-principled combined defenses inherently leave an exploitable gap for adaptive attackers. To bridge this gap, we propose the principled combination of Type 1 and Type 2 defenses. We evaluate our framework against a new, worst-case, full-information adaptive adversary that knows the benign updates, the aggregation algorithm, and its parameters, and yet this adversary fails against our combined defenses. Our empirical evaluation across various datasets and settings shows that single-typed and non-principled combined defenses are easily broken, often by a single malicious client. In contrast, our best combined defense variants, $HA_{Flame}^{CSFT}$, $HA_{Krum}^{CSFT}$, and $HA_{Multi-Metrics}^{CSFT}$, remain undefeated even in the most adversarial settings. Our results provide a principled approach for research on backdoors in federated learning.

2509.03736 2026-05-11 cs.AI cs.CL cs.LG

Are LLM Agents Behaviorally Coherent? Latent Profiles for Social Simulation

James Mooney, Josef Woldense, Zheng Robert Jia, Shirley Anugrah Hayati, My Ha Nguyen, Vipul Raheja, Dongyeop Kang

AI总结 本文探讨了大型语言模型(LLM)代理在行为上是否具有内在一致性,特别是在不同实验条件下是否与人类行为模型保持一致。研究设计了一种方法,通过提问揭示代理的潜在特征,并在多代理对话环境中评估其行为一致性。结果表明,不同模型家族和规模的LLM在行为一致性方面存在显著差异,尽管它们生成的回应可能与人类相似,但无法在实验设置中保持经验一致性,这表明它们在替代真实受试者方面仍存在关键缺陷。

Comments 25 pages, 9 figures, 7 tables

详情
英文摘要

The impressive capabilities of Large Language Models (LLMs) raise the possibility that synthetic agents can serve as substitutes for real participants in human-subject research. To evaluate this claim, prior research has largely focused on whether LLM-generated survey responses align with those produced by human respondents whom the LLMs are prompted to represent. In contrast, we address a more fundamental question: Do agents maintain empirical consistency; aligning to human behavioral models when examined under different experimental settings? To this end, we develop a study designed to (a) ask a set of questions which reveals an agent's latent profile and (b) examine agent behavioral consistency in a conversational setting with other agents. This design enables us to explore a set of behavioral hypotheses to assess whether an agent's conversational behavior is consistent with what we would expect from its revealed state. Our findings show significant inconsistencies in LLMs across model families and at differing model sizes. Most importantly, we find that, although agents may generate responses matching those of their human counterparts, they fail to be empirically consistent, representing a critical gap in their capabilities to accurately substitute for real participants in human-subject research.

2509.00338 2026-05-11 cs.LG cs.AI

Scalable Option Learning in High-Throughput Environments

Mikael Henaff, Scott Fujimoto, Michael Matthews, Michael Rabbat

AI总结 本文研究了如何在高吞吐量环境中实现可扩展的分层强化学习。作者提出了一个名为Scalable Option Learning (SOL)的算法,显著提升了分层强化学习的训练效率,其吞吐量比现有方法高出约35倍。通过在NetHack等复杂环境中进行大规模训练,SOL展示了优越的性能和可扩展性,并在多个基准任务中验证了其广泛适用性。

详情
英文摘要

Hierarchical reinforcement learning (RL) has the potential to enable effective decision-making over long timescales. Existing approaches, while promising, have yet to realize the benefits of large-scale training. In this work, we identify and solve several key challenges in scaling online hierarchical RL to high-throughput environments. We propose Scalable Option Learning (SOL), a highly scalable hierarchical RL algorithm which achieves a ~35x higher throughput compared to existing hierarchical methods. To demonstrate SOL's performance and scalability, we train hierarchical agents using 30 billion frames of experience on the complex game of NetHack, significantly surpassing flat agents and demonstrating positive scaling trends. We also validate SOL on MiniHack and Mujoco environments, showcasing its general applicability. Our code is open sourced at: github.com/facebookresearch/sol.

2508.15989 2026-05-11 cs.LG cs.ET

Scalable Equilibrium Propagation via Intermediate Error Signals for Deep Convolutional CRNNs

Jiaqi Lin, Malyaban Bal, Abhronil Sengupta

AI总结 本文提出了一种可扩展的均衡传播(EP)框架,用于深度卷积循环神经网络(CRNNs),通过引入中间层的误差信号来解决深度网络中的梯度消失问题。该方法结合了知识蒸馏与局部误差信号,增强了神经元动态的收敛性,首次实现了在深层架构中的有效训练。实验表明,该方法在CIFAR-10和CIFAR-100数据集上取得了 state-of-the-art 的性能,展示了其在深层VGG架构上的可扩展性,为EP在更复杂网络中的应用提供了新方向。

详情
英文摘要

Equilibrium Propagation (EP) is a biologically inspired local learning rule first proposed for convergent recurrent neural networks (CRNNs), in which synaptic updates depend only on neuron states from two distinct phases. EP estimates gradients that closely align with those computed by Backpropagation Through Time (BPTT) while significantly reducing computational demands, positioning it as a potential candidate for on-chip training in neuromorphic architectures. However, prior studies on EP have been constrained to shallow architectures, as deeper networks suffer from the vanishing gradient problem, leading to convergence difficulties in both energy minimization and gradient computation. To alleviate the vanishing gradient problem in deep EP networks, we propose a novel EP framework that incorporates layer-wise learning signals to provide auxiliary supervision, which enhances the convergence of neuron dynamics. This is the first work to integrate knowledge distillation and local error signals into EP, enabling the training of significantly deeper architectures. Our proposed approach achieves state-of-the-art performance on the CIFAR-10 and CIFAR-100 datasets, showcasing its scalability on deep VGG architectures. These results represent a significant advancement in the scalability of EP, suggesting that intermediate learning signals can extend the practical applicability of EP to deeper architectures.

2508.05803 2026-05-11 cs.CL

Human-like fleeting memory improves language learning but impairs reading time prediction in transformer language models

Abishek Thamma, Micha Heilbron

AI总结 本研究探讨了短暂记忆对语言学习和阅读时间预测的影响,发现赋予Transformer语言模型短暂记忆机制能够提升其语言建模和句法表现,但会损害其基于惊讶度预测人类阅读时间的能力。这一结果支持了人类记忆限制可能有助于语言学习的观点,同时揭示了神经网络语言模型在语言学习与行为预测之间的性能差异。

Comments Revised after peer review. Accepted for publication in Transactions of the Association for Computational Linguistics

详情
英文摘要

Human memory is fleeting. As words are processed, the exact wordforms that make up incoming sentences are rapidly lost. Cognitive scientists have long believed that this limitation of memory may, paradoxically, help in learning language - an idea supported by classic connectionist modelling work. The rise of Transformers appears to challenge this idea, as these models can learn language effectively, despite lacking memory limitations or other architectural recency biases. Here, we investigate the hypothesized benefit of fleeting memory for language learning in tightly controlled experiments on transformer language models. Training transformers with and without fleeting memory on a developmentally realistic training set, we find that fleeting memory consistently improves language learning (as quantified by both overall language modelling performance and targeted syntactic evaluation) but, unexpectedly, impairs surprisal-based prediction of human reading times. Interestingly, follow up analyses revealed that this discrepancy - better language modeling, yet worse reading time prediction - could not be accounted for by prior explanations of why better language models sometimes fit human reading time worse. Together, these results support a benefit of memory limitations on neural network language learning - but not on predicting behavior.

2508.05773 2026-05-11 cs.RO cs.SY eess.SY

GPU-Accelerated Barrier-Rate Guided MPPI Control for Tractor-Trailer Systems

Keyvan Majd, Hardik Parwana, Bardh Hoxha, Steven Hong, Hideki Okamoto, Georgios Fainekos

AI总结 本文研究了如何利用屏障速率引导的模型预测路径积分(BR-MPPI)控制方法,提高牵引挂车等 articulated 车辆在复杂环境中(如停车场)的导航能力。该方法将控制屏障函数(CBF)约束直接嵌入路径积分更新过程,引导重要性采样分布向无碰撞且动力学可行的轨迹集中,从而提升路径探索能力和轨迹鲁棒性。实验表明,该方法在单块 GPU 上实现了超过 100 Hz 的实时控制频率,并在多个障碍物场景下表现出更优的停车避障性能。

Comments Accepted to IEEE ITSC 2025

详情
英文摘要

Articulated vehicles such as tractor-trailers, yard trucks, and similar platforms must often reverse and maneuver in cluttered spaces where pedestrians are present. We present how Barrier-Rate guided Model Predictive Path Integral (BR-MPPI) control can solve navigation in such challenging environments. BR-MPPI embeds Control Barrier Function (CBF) constraints directly into the path-integral update. By steering the importance-sampling distribution toward collision-free, dynamically feasible trajectories, BR-MPPI enhances the exploration strength of MPPI and improves robustness of resulting trajectories. The method is evaluated in the high-fidelity CarMaker simulator on a 12 [m] tractor-trailer tasked with reverse and forward parking in a parking lot. BR-MPPI computes control inputs in above 100 [Hz] on a single GPU (for scenarios with eight obstacles) and maintains better parking clearance than a standard MPPI baseline and an MPPI with collision cost baseline.

2508.04056 2026-05-11 cs.RO q-bio.QM

SCOUT: Closed-Loop in-vivo System for Continuous Methane Concentration Monitoring in Cattle

Yuelin Deng, Hinayah Rojas de Oliveira, Richard M. Voyles, Upinder Kaur

AI总结 该研究提出了一种名为SCOUT的闭环在体监测系统,用于持续测量牛瘤胃内甲烷浓度,解决了现有方法在准确性和操作可行性之间的矛盾。SCOUT通过闭环气体循环维持瘤胃厌氧环境,实现了高时间分辨率的甲烷浓度监测,揭示了与动物行为变化相关的快速浓度波动。该系统为建立浓度与排放量之间的模型提供了可靠的数据基础,有助于精准表型分析、排放代理校准和减排策略评估。

详情
英文摘要

Enteric methane measurement from ruminant livestock faces fundamental trade-offs between accuracy and operational feasibility. Existing methods quantify methane after eructation and atmospheric dilution, limiting temporal resolution and confounding biological signals with environmental variables. We present SCOUT (Smart Cannula-mounted Optical Unit for Trace-methane), the first autonomous system for continuous in-vivo monitoring of ruminal headspace methane concentrations. The system addresses a critical engineering barrier through closed-loop gas recirculation that maintains anaerobic ruminal conditions during persistent headspace sampling. SCOUT was deployed on cannulated Simmental heifers under contrasting dietary treatments. Headspace concentrations were 100 to 1000 times higher than concurrent ambient sniffer readings, providing substantially greater signal resolution for characterizing methane dynamics. High-frequency monitoring revealed behavior-production coupling previously inaccessible, including rapid concentration changes ($14.5 \pm 11.3k$ ppm) associated with postural transitions within 15-minute intervals. Cross-platform comparison with ambient sniffers showed scale-dependent correspondence between production and release measurements, with an optimal correlation (r = -0.564) at 40-minute averaging windows consistent with eructation cycles. These results demonstrate that the rumen headspace contains continuous, biologically interpretable methane signals that SCOUT can reliably access, establishing the measurement infrastructure necessary for developing concentration-to-flux models that would support precision phenotyping, emission proxy calibration, and mitigation strategy evaluation.

2508.02129 2026-05-11 cs.CV

VDEGaussian: Video Diffusion Enhanced 4D Gaussian Splatting for Dynamic Urban Scenes Modeling

Yuru Xiao, Zihan Lin, Chao Lu, Deming Zhai, Kui Jiang, Wenbo Zhao, Wei Zhang, Junjun Jiang, Huanran Wang, Xianming Liu

AI总结 本文提出了一种名为VDEGaussian的视频扩散增强4D高斯点绘框架,用于动态城市场景的建模。该方法通过在测试时适配的视频扩散模型中提取鲁棒的时序一致先验,有效解决了现有方法在处理快速移动物体和时间不连续性时的局限性。研究引入了时间戳联合优化策略和不确定性蒸馏方法,显著提升了动态场景的重建质量与新视角合成效果,尤其在快速运动物体的建模上表现出明显优势。

详情
英文摘要

Dynamic urban scene modeling is a rapidly evolving area with broad applications. While current approaches leveraging neural radiance fields or Gaussian Splatting have achieved fine-grained reconstruction and high-fidelity novel view synthesis, they still face significant limitations. These often stem from a dependence on pre-calibrated object tracks or difficulties in accurately modeling fast-moving objects from undersampled capture, particularly due to challenges in handling temporal discontinuities. To overcome these issues, we propose a novel video diffusion-enhanced 4D Gaussian Splatting framework. Our key insight is to distill robust, temporally consistent priors from a test-time adapted video diffusion model. To ensure precise pose alignment and effective integration of this denoised content, we introduce two core innovations: a joint timestamp optimization strategy that refines interpolated frame poses, and an uncertainty distillation method that adaptively extracts target content while preserving well-reconstructed regions. Extensive experiments demonstrate that our method significantly enhances dynamic modeling, especially for fast-moving objects, achieving an approximate PSNR gain of 2 dB for novel view synthesis over baseline approaches.

2508.01248 2026-05-11 cs.CV

NS-Net: Decoupling CLIP Semantic Information through NULL-Space for Generalizable AI-Generated Image Detection

Jiazhen Yan, Fan Wang, Weiwei Jiang, Ziqiang Li, Zhangjie Fu

AI总结 随着生成模型(如GAN和扩散模型)的快速发展,生成高度逼真的图像引发了在安全敏感领域中被滥用的担忧。现有检测方法在已知生成模型下表现良好,但在面对未知模型时泛化能力不足,尤其当真实图像与生成图像语义内容相近时。本文提出NS-Net,通过NULL-Space投影解耦CLIP视觉特征中的高层语义信息,并结合对比学习捕捉真实与生成图像的分布差异,从而提升检测性能;实验表明,NS-Net在包含40种生成模型的开放世界基准测试中显著优于现有方法,检测准确率提升7.4%。

详情
英文摘要

The rapid progress of generative models, such as GANs and diffusion models, has facilitated the creation of highly realistic images, raising growing concerns over their misuse in security-sensitive domains. While existing detectors perform well under known generative settings, they often fail to generalize to unknown generative models, especially when semantic content between real and fake images is closely aligned. In this paper, we revisit the use of CLIP features for AI-generated image detection and uncover a critical limitation: the high-level semantic information embedded in CLIP's visual features hinders effective discrimination. To address this, we propose NS-Net, a novel detection framework that leverages NULL-Space projection to decouple semantic information from CLIP's visual features, followed by contrastive learning to capture intrinsic distributional differences between real and generated images. Furthermore, we design a Patch Selection strategy to preserve fine-grained artifacts by mitigating semantic bias caused by global image structures. Extensive experiments on an open-world benchmark comprising images generated by 40 diverse generative models show that NS-Net outperforms existing state-of-the-art methods, achieving a 7.4\% improvement in detection accuracy, thereby demonstrating strong generalization across both GAN- and diffusion-based image generation techniques.

2506.23875 2026-05-11 cs.LG cs.AI

Discovering Learning-Friendly Generation Orders for Sequential Computation

Yuta Sato, Kazuhiko Kawamoto, Hiroshi Kera

AI总结 该研究旨在自动发现适用于序列生成任务的学习友好状态生成顺序,以提高训练成功率。核心方法通过“损失分析”评估候选顺序在训练初期的损失下降速度,并结合分层搜索策略高效探索大规模候选空间。实验表明,该方法在多个任务中显著提升了生成顺序的有效性,甚至在某些任务中达到了接近100%的成功率,并重现了先前研究中发现的有效顺序。

Comments 10+24 pages, 10 figures

详情
英文摘要

Sequential computation via autoregressive generation can make difficult tasks learnable, but the generation order of intermediate states strongly affects whether training succeeds. We address the problem of discovering a learning-friendly target order automatically, rather than relying on task-specific design. Our key observation is that learning-friendly orders cause a faster loss drop in the early stage of training. We exploit this by \emph{loss profiling}, which ranks candidate orders by the early-stage loss of a single short run. To handle the factorial candidate space, we wrap loss profiling in a hierarchical global -- local search over block- and within-block-level orderings. On six order-sensitive tasks, the method discovers effective orders up to $L=13$ from random initialization and up to $L=40$ from structured initialization, lifting success rates from about 10\% to near 100\%. On integer multiplication, it rediscovers the reverse-digit order that was reported to be efficient in prior studies. On delay dynamical systems, as a case study of multi-variate recurrences, learnability varies sharply even among valid topological sorts of the dependency graph: loss profiling identifies a learning-friendly one, and the global search even discovers orders surpassing hand-designed candidates.

2506.14951 2026-05-11 cs.LG cs.AI cs.NE

Flat Channels to Infinity in Neural Loss Landscapes

Flavio Martinelli, Alexander Van Meegen, Berfin Şimşek, Wulfram Gerstner, Johanni Brea

AI总结 本文研究了神经网络损失景观中一种特殊的结构:存在某些通道,沿着这些通道损失下降极慢,同时至少两个神经元的输出权重趋于正负无穷,其输入权重向量趋于相等。最终这两个神经元实现了一个门控线性单元,表现出独特的计算特性。该研究揭示了梯度下降等优化方法在回归任务中容易收敛到这些看似平坦的区域,并从几何、梯度动力学和功能角度全面刻画了这些区域的特性。

Comments Accepted to NeurIPS'25 (fixed resolution of equations in figs.1,2,3)

详情
英文摘要

The loss landscapes of neural networks contain minima and saddle points that may be connected in flat regions or appear in isolation. We identify and characterize a special structure in the loss landscape: channels along which the loss decreases extremely slowly, while the output weights of at least two neurons, $a_i$ and $a_j$, diverge to $\pm$infinity, and their input weight vectors, $\mathbf{w_i}$ and $\mathbf{w_j}$, become equal to each other. At convergence, the two neurons implement a gated linear unit: $a_iσ(\mathbf{w_i} \cdot \mathbf{x}) + a_jσ(\mathbf{w_j} \cdot \mathbf{x}) \rightarrow σ(\mathbf{w} \cdot \mathbf{x}) + (\mathbf{v} \cdot \mathbf{x}) σ'(\mathbf{w} \cdot \mathbf{x})$. Geometrically, these channels to infinity are asymptotically parallel to symmetry-induced lines of critical points. Gradient flow solvers, and related optimization methods like SGD or ADAM, reach the channels with high probability in diverse regression settings, but without careful inspection they look like flat local minima with finite parameter values. Our characterization provides a comprehensive picture of these quasi-flat regions in terms of gradient dynamics, geometry, and functional interpretation. The emergence of gated linear units at the end of the channels highlights a surprising aspect of the computational capabilities of fully connected layers.