arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2115
2601.10962 2026-06-17 cs.LG cond-mat.dis-nn 版本更新

Noise-Driven Exploration and Transient Freezing Select Flat Minima in Stochastic Gradient Descent

噪声驱动的探索与瞬态冻结在随机梯度下降中选择平坦极小值

Ning Yang, Yikuan Zhang, Qi Ouyang, Chao Tang, Yuhai Tu

AI总结 通过分析SGD学习动力学,发现非平衡机制驱动解选择:瞬态探索阶段逃离尖锐谷,噪声重塑势能稳定平坦解,冻结延迟增强泛化。

Comments 12 pages, 4 figures

详情
AI中文摘要

随机梯度下降(SGD)是深度学习的核心,但其偏好更平坦、更泛化解的动力学起源仍不清楚。本文通过分析SGD学习动力学,识别出一种非平衡机制,该机制在训练过程中控制解的选择。数值实验揭示了一个瞬态探索阶段,在此阶段SGD轨迹反复逃离尖锐谷,并向损失景观中更平坦的区域迁移,然后才被限制在最终盆地中。利用一个可处理的物理模型,我们证明SGD噪声将损失景观重塑为一个有效势能,该势能优先稳定平坦解。我们进一步揭示了一种瞬态冻结机制:随着训练进行,平坦化的景观抑制了竞争谷之间的跃迁。更强的SGD噪声延迟了这种冻结转变,延长了探索阶段,从而增加了收敛到更平坦极小值的概率。这些结果共同提供了一个统一的物理框架,连接了学习动力学、损失景观几何和泛化,并为设计更有效的优化算法提供了指导原则。

英文摘要

Stochastic gradient descent (SGD) is central to deep learning, yet the dynamical origin of its preference for flatter, more generalizable solutions remains unclear. Here, by analyzing SGD learning dynamics, we identify a nonequilibrium mechanism that governs solution selection during training. Numerical experiments reveal a transient exploratory phase in which SGD trajectories repeatedly escape sharp valleys and migrate toward flatter regions of the loss landscape before becoming confined to a final basin. Using a tractable physical model, we show that SGD noise reshapes the loss landscape into an effective potential that preferentially stabilizes flat solutions. We further uncover a transient freezing mechanism: as training progresses, the flattening landscape suppresses transitions between competing valleys. Stronger SGD noise delays this freezing transition, prolonging the exploratory phase and thereby increasing the probability of convergence to flatter minima. Together, these results provide a unified physical framework connecting learning dynamics, loss-landscape geometry, and generalization, and suggest guiding principles for the design of more effective optimization algorithms.

2512.16420 2026-06-17 cs.SD 版本更新

DPDFNet: Boosting DeepFilterNet2 via Dual-Path RNN

DPDFNet: 通过双路径RNN提升DeepFilterNet2

Daniel Rika, Nino Sapir, Ido Gus

AI总结 提出DPDFNet,在DeepFilterNet2编码器中引入双路径块增强长时跨带建模,结合过衰减抑制损失和微调策略,在多个基准上超越现有因果模型,并部署于边缘NPU实现实时性能。

Comments Accepted manuscript version. Accepted for publication in Speech Communication

详情
AI中文摘要

我们提出DPDFNet,一种因果单通道语音增强模型,它在DeepFilterNet2架构的基础上,在编码器中引入双路径块,增强了长时域和跨频带建模能力,同时保留了原有的增强框架。此外,我们证明,添加一个损失分量以减轻增强语音中的过度衰减,并结合针对“始终在线”应用定制的微调阶段,可以显著提升模型整体性能。我们在标准VoiceBank+DEMAND和DNS4盲测基准上评估DPDFNet,结果显示其相比DeepFilterNet2有一致提升,并且与其他因果开源模型相比整体性能强劲。此外,我们引入了一个补充的多语言低信噪比评估集,包含12种语言在日常噪声场景下的长录音,DPDFNet在此评估集上表现出优于其他因果开源模型的性能,包括一些规模更大、计算需求更高的模型。我们还提出了一种整体指标PRISM,它是侵入式和非侵入式指标的复合、尺度归一化聚合,该指标清晰展示了与双路径块数量的可扩展性。我们通过在Ceva-NeuPro-Nano边缘NPU上部署DPDFNet进一步证明了其在设备上的可行性。结果表明,我们的第二大模型DPDFNet-4在NPN32上实现了实时性能,在NPN64上运行更快,证实了在最先进的嵌入式功耗和延迟约束下可以维持高质量。

英文摘要

We present DPDFNet, a causal single-channel speech enhancement model that extends DeepFilterNet2 architecture with dual-path blocks in the encoder, strengthening long-range temporal and cross-band modeling while preserving the original enhancement framework. In addition, we demonstrate that adding a loss component to mitigate over-attenuation in the enhanced speech, combined with a fine-tuning phase tailored for "always-on" applications, leads to substantial improvements in overall model performance. We evaluate DPDFNet on the standard VoiceBank+DEMAND and DNS4 blind test benchmarks, where it shows consistent gains over DeepFilterNet2 and strong overall performance against other causal open-source models. In addition, we introduce a supplementary multilingual low-SNR evaluation set comprising long recordings in 12 languages across everyday noise scenarios, on which DPDFNet delivers superior performance to other causal open-source models, including some that are substantially larger and more computationally demanding. We also propose an holistic metric named PRISM, a composite, scale-normalized aggregate of intrusive and non-intrusive metrics, which demonstrates clear scalability with the number of dual-path blocks. We further demonstrate on-device feasibility by deploying DPDFNet on Ceva-NeuPro-Nano edge NPUs. Results indicate that DPDFNet-4, our second-largest model, achieves real-time performance on NPN32 and runs even faster on NPN64, confirming that state-of-the-art quality can be sustained within strict embedded power and latency constraints.

2601.03872 2026-06-17 cs.CL 版本更新

Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning

Atlas: 编排异构模型与工具实现多领域复杂推理

Jinyang Wu, Guocheng Zhai, Ruihan Jin, Jiahao Yuan, Yuhao Shen, Shuai Zhang, Zhengqi Wen, Jianhua Tao

AI总结 提出ATLAS双路径框架,通过无监督聚类路由和强化学习多步路由动态选择最优模型-工具组合,在15个基准上超越GPT-4o,分布内外任务分别提升10.1%和13.1%。

Comments Accepted by ACL 2026

详情
AI中文摘要

大语言模型与外部工具的集成显著扩展了AI代理的能力。然而,随着模型和工具多样性的增加,选择最优模型-工具组合成为一个高维优化挑战。现有方法通常依赖单一模型或固定工具调用逻辑,未能利用异构模型-工具对之间的性能差异。本文提出ATLAS(自适应工具-LLM对齐与协同调用),一种用于跨领域复杂推理中动态工具使用的双路径框架。ATLAS通过双路径方式运作:(1)基于无监督聚类的路由,利用经验先验进行领域特定对齐;(2)基于强化学习的多步路由,探索自主轨迹以实现分布外泛化。在15个基准上的大量实验表明,我们的方法优于GPT-4o等闭源模型,在分布内(+10.1%)和分布外(+13.1%)任务上均超越现有路由方法。此外,我们的框架通过编排专用多模态工具在视觉推理中展现出显著提升。

英文摘要

The integration of large language models (LLMs) with external tools has significantly expanded the capabilities of AI agents. However, as the diversity of both LLMs and tools increases, selecting the optimal model-tool combination becomes a high-dimensional optimization challenge. Existing approaches often rely on a single model or fixed tool-calling logic, failing to exploit the performance variations across heterogeneous model-tool pairs. In this paper, we present ATLAS (Adaptive Tool-LLM Alignment and Synergistic Invocation), a dual-path framework for dynamic tool usage in cross-domain complex reasoning. ATLAS operates via a dual-path approach: (1) \textbf{training-free cluster-based routing} that exploits empirical priors for domain-specific alignment, and (2) \textbf{RL-based multi-step routing} that explores autonomous trajectories for out-of-distribution generalization. Extensive experiments across 15 benchmarks demonstrate that our method outperforms closed-source models like GPT-4o, surpassing existing routing methods on both in-distribution (+10.1%) and out-of-distribution (+13.1%) tasks. Furthermore, our framework shows significant gains in visual reasoning by orchestrating specialized multi-modal tools.

2511.01650 2026-06-17 cs.CL cs.AI cs.LG 版本更新

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

EngTrace:工程推理可验证过程监督的符号基准

Ayesha Gull, Muhammad Usman Safder, Rania Elbadry, Fan Zhang, Veselin Stoyanov, Preslav Nakov, Zhuohan Xie

AI总结 提出EngTrace符号基准,包含1350个参数化测试用例,通过两阶段可验证评估框架(分层协议+AI仲裁)检验中间推理轨迹与最终答案,揭示数值精度与轨迹保真度的权衡。

Comments 33 pages, includes figures and tables; introduces the EngTrace benchmark

详情
AI中文摘要

大型语言模型(LLM)正越来越多地进入由严格定量标准和不变物理定律约束的专业化、安全关键的工程工作流程,因此对其推理能力进行严格评估势在必行。然而,现有的基准(如MMLU、MATH和HumanEval)评估的是孤立的认知技能,未能捕捉工程中核心的基于物理的推理,其中科学原理、定量建模和实际约束必须融合。为了实现工程中的可验证过程监督,我们引入了EngTrace,这是一个基于90个参数化模板构建的符号基准,每个模板生成独特的、抗污染的实例,涵盖三个主要工程分支、九个核心领域和20个不同领域,产生1350个测试用例,以压力测试跨多样物理场景的泛化能力。超越结果匹配,我们引入了一个可验证的两阶段评估框架,该框架使用分层协议通过自动化程序检查和异构AI仲裁来验证中间推理轨迹以及最终答案。我们对27个领先LLM的评估揭示了数值精度与轨迹保真度之间的明显权衡,识别出一个复杂性悬崖,其中抽象数学预训练未能转化为高级工程任务所需的整合推理。

英文摘要

Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning capabilities imperative. However, existing benchmarks such as MMLU, MATH, and HumanEval assess isolated cognitive skills, failing to capture the physically grounded reasoning central to engineering, where scientific principles, quantitative modeling, and practical constraints must converge. To enable verifiable process supervision in engineering, we introduce EngTrace, a symbolic benchmark built on 90 parameterized templates, each generating unique, contamination-resistant problem instances, spanning three major engineering branches, nine core domains, and 20 distinct areas, yielding 1,350 test cases that stress-test generalization across diverse physical scenarios. Moving beyond outcome matching, we introduce a verifiable two-stage evaluation framework that uses a tiered protocol to validate intermediate reasoning traces alongside final answers through automated procedural checks and a heterogeneous AI Tribunal. Our evaluation of 27 leading LLMs reveals a distinct trade-off between numeric precision and trace fidelity, identifying a complexity cliff where abstract mathematical pre-training fails to translate into the integrative reasoning required for advanced engineering tasks.

2509.09631 2026-06-17 cs.SD cs.CL cs.CV 版本更新

DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Discrete Flow Matching

DiFlow-TTS: 基于离散流匹配的紧凑低延迟零样本文本转语音

Ngoc-Son Nguyen, Thanh V. T. Tran, Hieu-Nghia Huynh-Nguyen, Truong-Son Hy, Van Nguyen

AI总结 提出DiFlow-TTS框架,通过离散流匹配和分解离散流去噪器,在零样本TTS中实现高质量与低延迟的平衡。

Comments Accepted at Interspeech 2026 (Long Paper track)

详情
AI中文摘要

零样本文本转语音(TTS)在复制未见过的声音方面取得了显著进展,但平衡生成质量和推理效率仍然具有挑战性。自回归模型存在高延迟问题,而基于扩散的方法受限于训练时的配置。此外,大多数基于流的方法在连续空间中运行,由于连续令牌空间本质上比离散空间更复杂,这引入了优化挑战。为了解决这些限制,我们提出了DiFlow-TTS,一种基于离散流匹配的新型零样本TTS框架。该模型由一个用于语言建模的确定性音素-内容映射器和一个同时生成韵律和声学令牌流的分解离散流去噪器组成。实验结果表明了我们的方法在多个评估指标上的有效性。

英文摘要

Zero-shot text-to-speech (TTS) has made significant progress in replicating unseen voices, yet balancing generation quality and inference efficiency remains challenging. Autoregressive models suffer from high latency, while diffusion-based approaches are constrained by training-time configurations. Moreover, most flow-based methods operate in continuous space, which introduces optimization challenges because continuous token spaces are inherently more complex than discrete ones. To address these limitations, we propose DiFlow-TTS, a novel zero-shot TTS framework based on discrete flow matching. The model consists of a deterministic Phoneme-Content Mapper for linguistic modeling and a Factorized Discrete Flow Denoiser that simultaneously generates prosody and acoustic token streams. Experimental results demonstrate the effectiveness of our approach across multiple evaluation metrics.

2507.14632 2026-06-17 cs.CV 版本更新

BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

BusterX++: 迈向基于MLLM的统一跨模态AI生成内容检测与解释

Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, Guangliang Cheng

AI总结 提出统一多模态大模型BusterX++,通过纯强化学习策略实现图像与视频伪造检测的跨模态能力迁移,性能超越现有方法。

详情
AI中文摘要

生成式AI的快速发展显著提升了图像和视频合成质量,加剧了多模态视觉错误信息的风险。最近的多模态大模型通过推理和解释在透明化AI生成内容检测方面展现出潜力,但现有方法大多将图像和视频取证视为孤立任务,跨模态协同作用尚未充分探索。为解决这一问题,我们提出了\textbf{BusterX++},一个统一的多模态大模型,用于联合图像和视频检测并具备可解释推理能力。我们还引入了\textbf{GenBuster-Bench++},一个精心策划、难度对齐的基准测试,包含平衡的图像和视频样本,覆盖最新的生成模型和多样化的真实场景。利用这一受控设置,我们重新审视了广泛采用的$SFT \rightarrow RL$后训练范式。值得注意的是,我们的发现表明,仅由稀疏结果奖励驱动的单阶段纯RL策略在统一和单模态设置中始终匹配或超越强SFT+RL基线。我们的关键洞察是,SFT降低了策略熵,限制了策略搜索空间并抑制了探索自由度。相比之下,单阶段纯RL在整个训练过程中保持较高的策略熵,有效解锁了图像和视频取证之间跨模态能力迁移的自发涌现。大量实验表明,BusterX++达到了最先进的性能,突显了RL在统一跨模态视觉推理中的强大潜力。

英文摘要

The rapid advancement of generative AI has substantially improved image and video synthesis, amplifying the risk of multimodal visual misinformation. Recent MLLMs have shown promise for transparent AI-generated content detection through reasoning and explanation, yet existing approaches largely treat image and video forensics as isolated tasks, leaving cross-modal synergies underexplored. To address this, we present \textbf{BusterX++}, a unified MLLM for joint image and video detection with interpretable reasoning. We also introduce \textbf{GenBuster-Bench++}, a meticulously curated, difficulty-aligned benchmark containing balanced image and video samples spanning recent generation models and diverse real-world scenarios. Using this controlled setting, we revisit the widely adopted $SFT \rightarrow RL$ post-training paradigm. Notably, our findings demonstrate that a single-stage, pure RL strategy driven strictly by sparse outcome rewards consistently matches or surpasses a strong SFT+RL baseline across both unified and single-modality settings. Our key insight reveals that SFT imposes lower policy entropy, which restricts the policy search space and dampens exploratory freedom. In contrast, single-stage pure RL maintains higher policy entropy throughout training, effectively unlocking the spontaneous emergence of cross-modal capability transfer between image and video forensics. Extensive experiments demonstrate that BusterX++ achieves state-of-the-art performance, highlighting the powerful potential of RL for unified cross-modal visual reasoning.

2601.00215 2026-06-17 cs.CV cs.CL 版本更新

Disentangling Perception and Reasoning in Multimodal LLMs via Reward Design

通过奖励设计解耦多模态大语言模型中的感知与推理

Omar Sharif, Eftekhar Hossain, Nikhil Singh, Patrick Ng

AI总结 研究多模态大模型中感知与推理的瓶颈,发现感知是主要约束,并通过奖励设计提升视觉基础推理,平均提升5.56分。

Comments 24 pages, 15 Figures, 10 Tables

详情
AI中文摘要

基于可验证奖励的强化学习推动了LLM推理的重大进步,直观上这种策略应能很好地迁移到多模态模型。然而,多模态模型做两件事:首先感知图像中的内容,然后推理其含义。由于这两个阶段是联合评分的,很难判断推理本身还有多大提升空间。我们在算法视觉谜题上研究这一问题,其中两个组件都是必要的,并表明感知而非推理是约束瓶颈。用简单的文本描述替换图像,Claude模型的平均性能提升超过20点。然后我们评估了六种奖励设计,旨在诱导推理过程中的视觉基础,而无需思维链监督。使用GRPO训练Qwen-2.5-VL-7B,奖励设计诱导出带有自我反思和视觉引用的长结构化推理,相比基础模型获得5.56点的提升。然而,这些提升是不均匀的;没有单一奖励能改善所有类别,并且具有可验证准确性信号的奖励会以域外迁移为代价换取域内准确性。这些结果表明,感知感知的奖励设计是一条前进之路,以便在源头纠正感知,而不是纠正继承其错误的推理。

英文摘要

Reinforcement learning with verifiable rewards has driven major gains in LLM reasoning, and it is intuitive to assume this recipe will transfer well to multimodal models. However, multimodal models do two things: first, perceive what is in an image, then reason about what it implies. Because these stages are graded jointly, it is hard to tell how much room reasoning alone has to grow. We study this on algorithmic visual puzzles, where both components are necessary and show that perception, not reasoning, is the binding constraint. Replacing images with simple textual descriptions raises performance by over 20 points on average for Claude models. We then evaluate six reward designs aimed at inducing visual grounding during reasoning without chain-of-thought supervision. Training Qwen-2.5-VL-7B with GRPO, reward design induces long, structured reasoning with self-reflection and visual references, yielding a 5.56-point gain over the base model. These gains are, however, uneven; no single reward improves all categories, and rewards with verifiable accuracy signals trade out-of-domain transfer for in-domain accuracy. These results point to perception-aware reward design as a path forward, so that signals correct perception at its source rather than the reasoning that inherits its errors.

2512.21315 2026-06-17 cs.LG cs.CV stat.ML 版本更新

Does the Data Processing Inequality Reflect Practice? On the Utility of Low-Level Tasks

数据处理不等式是否反映实践?论低级任务的有用性

Roy Turgeman, Tom Tirer

AI总结 本文研究低级处理(如去噪、编码)如何提升分类性能,证明在有限样本下存在预处理可提高准确率,并通过实验验证理论趋势。

Comments ICLR 2026 (camera-ready). Code is available at: https://github.com/serveroy/process-before-you-classify

详情
Journal ref
The Fourteenth International Conference on Learning Representations (ICLR 2026)
AI中文摘要

数据处理不等式是一个信息论原理,指出信号的信息内容不能通过处理观测数据而增加。特别地,它表明在解决分类问题之前,增强信号或对其进行编码没有益处。对于最优贝叶斯分类器,这一断言可以被证明是正确的。然而,在实践中,尽管现代深度神经网络具有强大的能力,但在高级下游任务之前执行“低级”任务仍然很常见。在本文中,我们旨在理解低级处理何时以及为何对分类有益。我们提出了一个二元分类设置的综合理论研究,其中我们考虑一个与最优贝叶斯分类器紧密相连的分类器,并随着训练样本数量的增加而收敛到它。我们证明,对于任何有限数量的训练样本,存在一种预分类处理可以提高分类准确率。我们还探讨了类分离、训练集大小和类平衡对该过程相对增益的影响。我们通过理论设置的经验研究来支持我们的理论。最后,我们进行了一项实证研究,调查去噪和编码对基准数据集上实际深度分类器性能的影响。具体来说,我们改变了训练集的大小和类别分布以及噪声水平,并展示了与我们的理论结果一致的趋势。

英文摘要

The data processing inequality is an information-theoretic principle stating that the information content of a signal cannot be increased by processing the observations. In particular, it suggests that there is no benefit in enhancing the signal or encoding it before addressing a classification problem. This assertion can be proven to be true for the case of the optimal Bayes classifier. However, in practice, it is common to perform "low-level" tasks before "high-level" downstream tasks despite the overwhelming capabilities of modern deep neural networks. In this paper, we aim to understand when and why low-level processing can be beneficial for classification. We present a comprehensive theoretical study of a binary classification setup, where we consider a classifier that is tightly connected to the optimal Bayes classifier and converges to it as the number of training samples increases. We prove that for any finite number of training samples, there exists a pre-classification processing that improves the classification accuracy. We also explore the effect of class separation, training set size, and class balance on the relative gain from this procedure. We support our theory with an empirical investigation of the theoretical setup. Finally, we conduct an empirical study where we investigate the effect of denoising and encoding on the performance of practical deep classifiers on benchmark datasets. Specifically, we vary the size and class distribution of the training set, and the noise level, and demonstrate trends that are consistent with our theoretical results.

2512.16978 2026-06-17 cs.CV 版本更新

A Benchmark for Omni-Modal Reasoning in Long Videos

长视频全模态推理基准

Mohammed Irfan Kurpath, Jaseel Muhammad Kaithakkodan, Jinxing Zhou, Sahal Shaji Mullappilly, Mohammad Almansoori, Noor Ahsan, Beknur Kalmakhanbet, Sambal Shikhar, Rishabh Lalla, Jean Lahoud, Mariette Awad, Fahad Shahbaz Khan, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal

AI总结 提出LongShOTBench基准,用于评估长视频中视觉、语音和环境音频的全模态推理,并引入无训练的全模态证据搜索代理LongShOTAgent,在105个模型上取得最优性能。

详情
AI中文摘要

长形式全模态视频理解需要整合视觉、语音和环境音频,并进行连贯的长上下文推理。现有的视频基准通常在时间尺度、模态覆盖、开放式交互和可解释评分之间进行权衡。为了解决这一差距,我们引入了LongShOTBench,一个围绕三个耦合目标设计的长期视频理解基准:整体全模态集成、意图驱动的开放式交互和规则级诊断。它从真实观看场景构建单轮和多轮问题,通过系统任务探究视觉、语音、环境音频、时间和跨模态推理。每个项目包括一个参考答案和一个加权标准级规则,让评估识别哪些感知事实、时间链接、模态接地要求和推理步骤得到满足或遗漏。所有样本都经过手动验证,以提高接地性、清晰度和规则可靠性。我们还引入了LongShOTAgent,一个无训练的全模态证据搜索代理,将全视频预处理与目标检索、查询自适应片段细化以及基于视觉、语音和非语音音频证据的显式声明验证相结合。其迭代搜索-细化-验证循环暴露中间证据,并让模态特定专家在回答之前重新分析相关时刻。我们评估了105个视频能力模型,涵盖开源全模态模型、视觉语言系统、音频LLM、代理管道和闭源API。当前的MLLM远未饱和LongShOTBench,而我们的LongShOTAgent是最强的无训练系统,达到66.64%的整体性能。通过发布基准、排行榜和方法,我们为推进长形式全模态视频推理提供了一个共享、可解释的测试平台。代码、数据和排行榜可在以下网址获取:此 https URL。

英文摘要

Long-form omni-modal video understanding requires integrating vision, speech, and ambient audio with coherent long-context reasoning. Existing video benchmarks often trade off temporal scale, modality coverage, open-ended interaction, and interpretable scoring. To address this gap, we introduce LongShOTBench, a long video understanding benchmark designed around three coupled goals: holistic omni-modal integration, intent-driven open-ended interaction, and rubric-level diagnosis. It builds single- and multi-turn questions from real viewing scenarios, with systematic tasks probing visual, speech, ambient-audio, temporal, and cross-modal reasoning. Each item includes a reference answer and a weighted criterion-level rubric, letting evaluation identify which perceptual facts, temporal links, modality-grounding requirements, and reasoning steps are satisfied or missed. All samples are manually verified to improve grounding, clarity, and rubric reliability. We also introduce LongShOTAgent, a training-free omni-modal evidence-seeking agent coupling full-video preprocessing with targeted retrieval, query-adaptive segment refinement, and explicit claim verification over visual, speech, and non-speech audio evidence. Its iterative search-refine-verify loop exposes intermediate evidence and lets modality-specific specialists re-analyze relevant moments before answering. We evaluate 105 video-capable models spanning open-source omni-modal models, vision-language systems, audio LLMs, agentic pipelines and closed-source APIs. Current MLLMs remain far from saturating LongShOTBench, while our LongShOTAgent is the strongest training-free system, reaching 66.64% overall. By releasing the benchmark, leaderboard, and method, we provide a shared, interpretable testbed for advancing long-form omni-modal video reasoning. Code, data, and the leaderboard are available at https://longshot.cvmbzuai.com/.

2512.13853 2026-06-17 cs.LG cond-mat.stat-mech math.PR stat.ML 版本更新

Dropout Neural Network Training Viewed from a Percolation Perspective

从逾渗视角看待Dropout神经网络训练

Finley Devlin, Jaron Sanders

AI总结 本文研究使用dropout训练深度神经网络时的逾渗现象,建立新逾渗模型刻画网络拓扑与路径问题的关系,揭示dropout中的逾渗效应及其可能导致训练崩溃的机制。

Comments 21 pages, 14 figures

详情
AI中文摘要

在这项工作中,我们研究了使用dropout训练深度神经网络(NNs)时逾渗的存在和影响。Dropout方法是训练NNs的正则化技术,由G. Hinton等人(2012)首次提出。这些方法在训练的每个阶段随机临时移除NN中的连接,并用随机梯度下降(SGD)更新剩余子网络。随机从网络中移除连接的过程类似于逾渗,这是统计物理的一个范式模型。如果dropout移除足够多的连接,使得NN的输入和输出之间没有路径,那么NN就无法根据数据做出预测。我们研究了模拟NN中dropout的新逾渗模型,并刻画了网络拓扑与该路径问题之间的关系。该理论证明了dropout中存在逾渗效应。我们还表明,在使用dropout训练无偏置NN时,这种逾渗效应可能导致训练崩溃;并且我们启发式地论证了这种崩溃也扩展到有偏置的NN。

英文摘要

In this work, we investigate the existence and effect of percolation in training deep Neural Networks (NNs) with dropout. Dropout methods are regularisation techniques for training NNs, first introduced by G. Hinton et al. (2012). These methods temporarily remove connections in the NN, randomly at each stage of training, and update the remaining subnetwork with Stochastic Gradient Descent (SGD). The process of removing connections from a network at random is similar to percolation, a paradigm model of statistical physics. If dropout were to remove enough connections such that there is no path between the input and output of the NN, then the NN could not make predictions informed by the data. We study new percolation models that mimic dropout in NNs and characterise the relationship between network topology and this path problem. The theory shows the existence of a percolative effect in dropout. We also show that this percolative effect can cause a breakdown when training NNs without biases with dropout; and we argue heuristically that this breakdown extends to NNs with biases.

2506.24121 2026-06-17 cs.CV 版本更新

TextMesh4D: Zero-shot Text-to-4D Mesh Generation

TextMesh4D: 零样本文本到4D网格生成

Sisi Dai, Xinxin Su, Kai Xu

AI总结 提出TextMesh4D框架,通过雅可比变形场和局部-全局语义正则化,实现零样本文本到动态网格生成,解决扩散引导与网格拓扑约束的冲突,达到高时间一致性和几何保真度。

详情
AI中文摘要

大规模、高质量动态3D(4D)资产对于学习物理基础表示至关重要,但大规模捕获和标注成本高昂。这限制了监督式4D学习的可行性,并激发了利用预训练扩散先验的零样本文本到4D生成。为了建模复杂动态,先前方法通常采用隐式3D表示(如NeRF或3DGS)以利用其变形能力。然而,其隐式性质对表面拓扑的控制有限,阻碍了高保真几何,并使时间一致表面重建具有挑战性。为解决这些限制,我们探索零样本文本到4D网格生成。然而,将基于扩散的引导与拓扑约束网格结合时会出现结构不匹配:引导是噪声且空间不一致的,而网格施加严格的拓扑约束,使得直接顶点级变形不稳定。在本文中,我们介绍TextMesh4D,这是首个零样本文本到4D框架,通过在两个互补层面解决上述挑战,直接生成动态网格。几何上,我们通过雅可比变形场(JDF)将变形建模从顶点转移到面,通过可积性强制积分公式实现拓扑感知表面重建。语义上,我们提出局部-全局语义正则化器(LGSR),通过联合约束局部变形合理性和全局形状一致性来随时间保持身份。大量实验表明,在单个24GB GPU上高效运行的同时,达到了最先进的时间一致性、结构保真度和视觉质量。

英文摘要

Large-scale, high-quality dynamic 3D (4D) assets are essential for learning physically grounded representations, but remain costly to capture and annotate at scale. This limits the viability of supervised 4D learning and motivates zero-shot text-to-4D generation leveraging pretrained diffusion priors. To model complex dynamics, prior methods typically adopt implicit 3D representations (e.g., NeRFs or 3DGS) for their deformation capacity. However, their implicit nature provides limited control over surface topology, which hinders high-fidelity geometry and makes temporally coherent surface reconstruction challenging. To address these limitations, we explore zero-shot text-to-4D mesh generation. However, a structural mismatch arises when combining diffusion-based guidance with topology-constrained meshes: the guidance is noisy and spatially inconsistent, while meshes impose severe topological constraints, making direct vertex-level deformation unstable. In this paper, we introduce TextMesh4D, the first zero-shot framework for text-to-4D that directly generates dynamic meshes by addressing the above challenge at two complementary levels. Geometrically, we shift deformation modeling from vertices to faces via a Jacobian Deformation Field (JDF), enabling topology-aware surface reconstruction through an integrability-enforcing integration formulation. Semantically, we propose a Local-Global Semantic Regularizer (LGSR) that preserves identity over time by jointly constraining local deformation plausibility and global shape consistency. Extensive experiments demonstrate state-of-the-art temporal consistency, structural fidelity, and visual quality, while remaining efficient on a single 24GB GPU.

2512.13009 2026-06-17 cs.RO 版本更新

K-VARK: Kernelized Variance-Aware Residual Kalman Filter for Sensorless Force Estimation in Collaborative Robots

K-VARK: 用于协作机器人无传感器力估计的核化方差感知残差卡尔曼滤波器

Oğuzhan Akbıyık, Naseem Alhousani, Fares J. Abu-Dakka

AI总结 提出K-VARK方法,通过核化运动基元学习残差力矩的预测均值和异方差方差,并自适应调整卡尔曼滤波噪声协方差,在6自由度协作机械臂上实现无传感器力估计,RMSE降低20%以上。

详情
AI中文摘要

可靠接触力估计对于确保机器人与非结构化环境的安全和精确交互至关重要。然而,由于固有的建模误差以及复杂的残差动力学和摩擦,准确的无传感器力估计仍然具有挑战性。为应对这一挑战,本文提出K-VARK(核化方差感知残差卡尔曼滤波器),一种将关节残差力矩的核化概率模型集成到自适应卡尔曼滤波框架中的新颖方法。通过在优化激励轨迹上训练的核化运动基元,K-VARK捕获残差力矩的预测均值和输入相关的异方差方差,反映数据变异性和距训练样本距离的影响。这些统计信息通过增广测量噪声协方差来通知方差感知的虚拟测量更新,而过程噪声协方差通过变分贝叶斯优化在线自适应以处理动态干扰。在6自由度协作机械臂上的实验验证表明,与最先进的无传感器力估计方法相比,K-VARK的RMSE降低了20%以上,为抛光、装配等高级任务提供了鲁棒且准确的外部力/力矩估计。

英文摘要

Reliable estimation of contact forces is crucial for ensuring safe and precise interaction of robots with unstructured environments. However, accurate sensorless force estimation remains challenging due to inherent modeling errors and complex residual dynamics and friction. To address this challenge, in this paper, we propose K-VARK (Kernelized Variance-Aware Residual Kalman filter), a novel approach that integrates a kernelized, probabilistic model of joint residual torques into an adaptive Kalman filter framework. Through Kernelized Movement Primitives trained on optimized excitation trajectories, K-VARK captures both the predictive mean and input-dependent heteroscedastic variance of residual torques, reflecting data variability and distance-to-training effects. These statistics inform a variance-aware virtual measurement update by augmenting the measurement noise covariance, while the process noise covariance adapts online via variational Bayesian optimization to handle dynamic disturbances. Experimental validation on a 6-DoF collaborative manipulator demonstrates that K-VARK achieves over 20% reduction in RMSE compared to state-of-the-art sensorless force estimation methods, yielding robust and accurate external force/torque estimation suitable for advanced tasks such as polishing and assembly.

2512.11784 2026-06-17 cs.LG stat.ML 版本更新

Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

大提示词机制下的Softmax作为线性注意力:基于测度的视角

Etienne Boursier, Claire Boyer

AI总结 提出基于测度的框架,证明在无限提示词极限下softmax注意力收敛到线性算子,并给出有限提示词下的非渐近浓度界,从而将线性注意力的优化分析迁移到大提示词下的softmax注意力。

详情
AI中文摘要

Softmax注意力是Transformer架构的核心组成部分,但其非线性结构给理论分析带来了重大挑战。我们开发了一个统一的、基于测度的框架,用于研究有限和无限提示词下的单层softmax注意力。对于独立同分布的高斯输入,我们利用softmax算子在大提示词极限下收敛到作用于底层输入标记测度的线性算子这一事实。基于这一见解,我们建立了softmax注意力输出和梯度的非渐近浓度界,量化了有限提示词模型接近其无限提示词对应模型的速度,并证明了在具有次高斯标记的一般上下文学习设置中,这种浓度在整个训练轨迹上保持稳定。在线性回归的上下文学习中,我们利用易处理的无限提示词动力学来分析有限提示词长度下的训练。我们的结果表明,当提示词足够长时,为线性注意力开发的优化分析可以直接迁移到softmax注意力上,表明大提示词下的softmax注意力继承了其线性对应物的分析结构。这反过来为研究大提示词机制下softmax注意力层的训练动力学和统计行为提供了一个有原则且广泛适用的工具包。

英文摘要

Softmax attention is a central component of transformer architectures, yet its nonlinear structure poses significant challenges for theoretical analysis. We develop a unified, measure-based framework for studying single-layer softmax attention under both finite and infinite prompts. For i.i.d. Gaussian inputs, we lean on the fact that the softmax operator converges in the infinite-prompt limit to a linear operator acting on the underlying input-token measure. Building on this insight, we establish non-asymptotic concentration bounds for the output and gradient of softmax attention, quantifying how rapidly the finite-prompt model approaches its infinite-prompt counterpart, and prove that this concentration remains stable along the entire training trajectory in general in-context learning settings with sub-Gaussian tokens. In the case of in-context linear regression, we use the tractable infinite-prompt dynamics to analyze training at finite prompt length. Our results allow optimization analyses developed for linear attention to transfer directly to softmax attention when prompts are sufficiently long, showing that large-prompt softmax attention inherits the analytical structure of its linear counterpart. This, in turn, provides a principled and broadly applicable toolkit for studying the training dynamics and statistical behavior of softmax attention layers in large prompt regimes.

2509.12742 2026-06-17 cs.CV 版本更新

Effective Gaussian Management for High-fidelity Object Reconstruction

高保真物体重建的有效高斯管理

Jiateng Liu, Hao Gao, Jiu-Cheng Xie, Chi-Man Pun, Jian Xiong, Haolun Li, Junxin Chen, Feng Xu

AI总结 提出一种高斯管理框架,通过选择性激活属性、自适应表示和任务解耦剪枝,结合正则化表面重建模块,在减少参数的同时实现高保真外观与几何重建。

详情
AI中文摘要

本文提出了一种有效的高斯管理框架,用于外观和几何的高保真场景重建。与最近将所有基元在优化过程中统一处理的高斯泼溅(GS)管线不同,我们的框架显式管理高斯的属性激活、表示和剪枝。具体来说,我们的框架首先引入GauSep,一种新的致密化策略,选择性地激活高斯颜色或法线属性,以缓解由双重监督产生的破坏性梯度冲突。我们进一步提出GauRep,一种自适应高斯表示,动态调整球谐函数(SHs)阶数并执行任务解耦剪枝,以在个体和全局层面减少冗余。为了为上述管理过程提供可靠的几何监督,我们还引入了CoRe,一个正则化表面重建模块,通过置信度机制从SDF分支蒸馏鲁棒的法线场到高斯表示。值得注意的是,所提出的高斯管理与各种重建架构兼容,可以无缝集成以提高性能同时减小模型大小。大量实验表明,与最先进方法相比,我们的方法在外观和几何重建上实现了优越或可比的性能,同时使用了显著更少的参数。

英文摘要

This paper proposes an effective Gaussian management framework for high-fidelity scene reconstruction of both appearance and geometry. Unlike recent Gaussian Splatting (GS) pipelines that treat all primitives uniformly during optimization, our framework explicitly manages the attribute activation, representation and pruning of Gaussian. Specifically, our framework first introduces GauSep, a novel densification strategy that selectively activates Gaussian color or normal attributes to alleviate destructive gradient conflicts arising from dual supervision. We further propose GauRep, an adaptive Gaussian representation that dynamically adjusts spherical harmonics (SHs) orders and performs task-decoupled pruning to reduce redundancy at both the individual and global levels. To provide reliable geometric supervision for above mangement process, we additionally introduce CoRe, an regularized surface reconstruction module that distills robust normal fields from an SDF branch to the Gaussian representation through a confidence mechanism. Notably, the proposed Gaussian management is compatible with various reconstruction architectures and can be seamlessly integrated to improve performance while reducing size of the model. Extensive experiments demonstrate that our approach achieves superior or comparable performance in appearance and geometry reconstruction compared with state-of-the-art methods, while using significantly fewer parameters.

2511.01352 2026-06-17 cs.LG astro-ph.HE astro-ph.IM hep-ex physics.data-an 版本更新

MiniFool -- Physics-Constraint-Aware Minimizer-Based Adversarial Attacks in Deep Neural Networks

MiniFool——深度神经网络中基于物理约束感知的最小化器对抗攻击

Lucie Flek, Oliver Janik, Philipp Alexander Jung, Akbar Karimi, Timo Saala, Alexander Schmidt, Matthias Schott, Philipp Soldin, Matthias Thiesmeyer, Christopher Wiebusch, Ulrich Willemsen

AI总结 提出MiniFool算法,通过最小化结合χ²检验统计量与目标分数偏差的代价函数,生成物理感知的对抗样本,用于测试粒子与天体物理中的神经网络分类器,并量化网络决策的鲁棒性。

Comments Submitted to Computing and Software for Big Science

详情
Journal ref
Published in: Eur.Phys.J.C 86 (2026) 6, 641
AI中文摘要

在本文中,我们提出了一种新算法MiniFool,该算法实现了物理启发的对抗攻击,用于测试粒子物理和天体粒子物理中基于神经网络的分类任务。虽然我们最初为IceCube中微子天文台的天体物理tau中微子搜索开发了该算法,但我们将其应用于其他科学领域的更多数据,从而证明了其通用性。在此,我们将该算法应用于著名的MNIST数据集,以及大型强子对撞机CMS实验的开放数据。该算法基于最小化一个代价函数,该函数结合了基于χ²的检验统计量与期望目标分数的偏差。检验统计量根据实验不确定性量化了应用于数据的扰动的概率。对于我们研究的用例,我们发现翻转分类的可能性对于最初正确分类和错误分类的事件是不同的。当测试分类随攻击参数(该参数缩放实验不确定性)的变化时,可以量化网络决策的鲁棒性。此外,这允许测试未标记实验数据分类的鲁棒性。

英文摘要

In this paper, we present a new algorithm, MiniFool, that implements physics-inspired adversarial attacks for testing neural network-based classification tasks in particle and astroparticle physics. While we initially developed the algorithm for the search for astrophysical tau neutrinos with the IceCube Neutrino Observatory, we apply it to further data from other science domains, thus demonstrating its general applicability. Here, we apply the algorithm to the well-known MNIST data set and furthermore, to Open Data data from the CMS experiment at the Large Hadron Collider. The algorithm is based on minimizing a cost function that combines a $χ^2$ based test-statistic with the deviation from the desired target score. The test statistic quantifies the probability of the perturbations applied to the data based on the experimental uncertainties. For our studied use cases, we find that the likelihood of a flipped classification differs for both the initially correctly and incorrectly classified events. When testing changes of the classifications as a function of an attack parameter that scales the experimental uncertainties, the robustness of the network decision can be quantified. Furthermore, this allows testing the robustness of the classification of unlabeled experimental data.

2505.03509 2026-06-17 cs.LG astro-ph.IM 版本更新

AnomalyMatch: Discovering Rare Objects of Interest with Semi-supervised and Active Learning

AnomalyMatch: 通过半监督和主动学习发现罕见感兴趣对象

Pablo Gómez, Laslo E. Ruhberg, Maria Teresa Nardone, David O'Ryan

AI总结 提出AnomalyMatch框架,结合半监督FixMatch算法和主动学习,将异常检测视为二分类问题,利用少量标注和大量未标注图像训练,在严重类别不平衡下实现高AUROC和AUPRC。

Comments Accepted for publication in RASTI; 17 pages; 12 figures

详情
AI中文摘要

大数据集中的异常检测在天文学和计算机视觉中至关重要。然而,由于标记数据稀缺,通常无法应用监督方法进行异常检测。我们提出了AnomalyMatch,一个结合了使用EfficientNet分类器的半监督FixMatch算法与主动学习的异常检测框架。AnomalyMatch专为大规模应用定制,并集成到ESA Datalabs科学平台中。在该方法中,我们将异常检测视为二分类问题,并有效利用有限的标记图像和丰富的未标记图像进行训练。我们通过用户界面实现主动学习,用于验证高置信度异常并纠正误报。在严重类别不平衡下,对GalaxyMNIST天文数据集和miniImageNet自然图像基准的评估显示出强大性能。从五到十个标记异常开始,我们实现了平均AUROC为0.96(miniImageNet)和0.89(GalaxyMNIST),相应的AUPRC分别为0.82和0.77。经过三个主动学习周期后,按分数排名前1%的图像中,异常精度达到76%(miniImageNet)至94%(GalaxyMNIST)。我们与已建立的Astronomaly软件在来自'Galaxy Zoo - The Galaxy Challenge'数据集的选定'奇特'星系上进行比较,实现了可比较的性能,平均AUROC为0.83。我们的结果强调了该方法在异常发现方面的卓越实用性和可扩展性,突显了针对标签严重稀缺领域的专门方法的价值。

英文摘要

Anomaly detection in large datasets is essential in astronomy and computer vision. However, due to a scarcity of labelled data, it is often infeasible to apply supervised methods to anomaly detection. We present AnomalyMatch, an anomaly detection framework combining the semi-supervised FixMatch algorithm using EfficientNet classifiers with active learning. AnomalyMatch is tailored for large-scale applications and integrated into the ESA Datalabs science platform. In this method, we treat anomaly detection as a binary classification problem and efficiently utilise limited labelled and abundant unlabelled images for training. We enable active learning via a user interface for verification of high-confidence anomalies and correction of false positives. Evaluations on the GalaxyMNIST astronomical dataset and the miniImageNet natural-image benchmark under severe class imbalance display strong performance. Starting from five to ten labelled anomalies, we achieve an average AUROC of 0.96 (miniImageNet) and 0.89 (GalaxyMNIST), with respective AUPRC of 0.82 and 0.77. After three active learning cycles, anomalies are ranked with 76% (miniImageNet) to 94% (GalaxyMNIST) precision in the top 1% of the highest-ranking images by score. We compare to the established Astronomaly software on selected 'odd' galaxies from the 'Galaxy Zoo- The Galaxy Challenge' dataset, achieving comparable performance with an average AUROC of 0.83. Our results underscore the exceptional utility and scalability of this approach for anomaly discovery, highlighting the value of specialised approaches for domains characterised by severe label scarcity

2510.23798 2026-06-17 cs.CV cs.AI 版本更新

A geometric and deep learning reproducible pipeline for monitoring floating anthropogenic debris in urban rivers using in situ cameras

一种基于几何和深度学习的可复现流水线,用于利用原位摄像头监测城市河流中的漂浮人为碎片

Gauthier Grimmer, Romain Wenger, Clément Flint, Germain Forestier, Gilles Rixhon, Valentin Chardon

AI总结 提出结合几何模型与深度学习的框架,利用固定摄像头连续量化监测城市河流漂浮碎片,并评估不同模型在复杂环境下的精度与速度,通过投影几何实现碎片尺寸估计。

详情
AI中文摘要

河流中漂浮人为碎片的扩散已成为一个紧迫的环境问题,对生物多样性、水质以及人类活动(如航行和娱乐)产生不利影响。本研究提出了一种新颖的方法框架,利用固定的原位摄像头监测上述废弃物。本研究提供了两个关键贡献:(i)利用深度学习对漂浮碎片进行连续量化和监测;(ii)在复杂环境条件下,识别出在精度和推理速度方面最合适的深度学习模型。这些模型在多种环境条件和学习配置下进行测试,包括与数据泄漏相关的偏差实验。此外,实现了一个几何模型,用于从二维图像估计检测对象的实际尺寸。该模型利用了相机的内参和外参特性。本研究结果强调了数据集构建协议的重要性,特别是在负样本图像的整合和时间泄漏的考虑方面。最后,证明了使用投影几何结合回归校正进行公制物体估计的可行性。该方法为开发稳健、低成本、自动化的城市水生环境监测系统铺平了道路。

英文摘要

The proliferation of floating anthropogenic debris in rivers has emerged as a pressing environmental concern, exerting a detrimental influence on biodiversity, water quality, and human activities such as navigation and recreation. The present study proposes a novel methodological framework for the monitoring the aforementioned waste, utilising fixed, in-situ cameras. This study provides two key contributions: (i) the continuous quantification and monitoring of floating debris using deep learning and (ii) the identification of the most suitable deep learning model in terms of accuracy and inference speed under complex environmental conditions. These models are tested in a range of environmental conditions and learning configurations, including experiments on biases related to data leakage. Furthermore, a geometric model is implemented to estimate the actual size of detected objects from a 2D image. This model takes advantage of both intrinsic and extrinsic characteristics of the camera. The findings of this study underscore the significance of the dataset constitution protocol, particularly with respect to the integration of negative images and the consideration of temporal leakage. In conclusion, the feasibility of metric object estimation using projective geometry coupled with regression corrections is demonstrated. This approach paves the way for the development of robust, low-cost, automated monitoring systems for urban aquatic environments.

2507.11178 2026-06-17 cs.LG cs.AI 版本更新

A Gradient-based Causal Discovery Framework with Applications to Complex Industrial Processes

基于梯度的因果发现框架及其在复杂工业过程中的应用

Meiliang Liu, Huiwen Dong, Xiaoxiao Yang, Yunfang Xu, Mingbao Yang, Zijin Li, Zhengye Si, Xinyue Yang, Zhiwen Zhao

AI总结 提出GRNGC方法,通过对模型输入输出梯度施加L1正则化推断Granger因果,仅需一个预测模型,降低计算开销,在多个基准和真实数据集上优于现有方法。

Comments 9 pages,3 figures, conference

详情
AI中文摘要

随着深度学习技术的发展,各种基于神经网络的Granger因果模型已被提出。尽管这些模型表现出显著改进,但仍存在若干局限性。大多数现有方法采用组件式架构,需要为每个时间序列构建单独的模型,导致大量计算成本。此外,对神经网络第一层权重施加稀疏性惩罚以提取因果关系,削弱了模型捕捉复杂交互的能力。为解决这些局限性,我们提出基于梯度正则化的神经Granger因果(GRNGC),该方法仅需一个时间序列预测模型,并对模型输入与输出之间的梯度施加$L_{1}$正则化以推断Granger因果。此外,GRNGC不依赖于特定的时间序列预测模型,可通过KAN、MLP和LSTM等多种架构实现,提供增强的灵活性。在DREAM、Lorenz-96、fMRI BOLD和CausalTime上的数值模拟表明,GRNGC优于现有基线,并显著降低计算开销。同时,在真实世界的DNA、酵母、HeLa和膀胱尿路上皮癌数据集上的实验进一步验证了该模型在重建基因调控网络方面的有效性。

英文摘要

With the advancement of deep learning technologies, various neural network-based Granger causality models have been proposed. Although these models have demonstrated notable improvements, several limitations remain. Most existing approaches adopt the component-wise architecture, necessitating the construction of a separate model for each time series, which results in substantial computational costs. In addition, imposing the sparsity-inducing penalty on the first-layer weights of the neural network to extract causal relationships weakens the model's ability to capture complex interactions. To address these limitations, we propose Gradient Regularization-based Neural Granger Causality (GRNGC), which requires only one time series prediction model and applies $L_{1}$ regularization to the gradient between model's input and output to infer Granger causality. Moreover, GRNGC is not tied to a specific time series forecasting model and can be implemented with diverse architectures such as KAN, MLP, and LSTM, offering enhanced flexibility. Numerical simulations on DREAM, Lorenz-96, fMRI BOLD, and CausalTime show that GRNGC outperforms existing baselines and significantly reduces computational overhead. Meanwhile, experiments on real-world DNA, Yeast, HeLa, and bladder urothelial carcinoma datasets further validate the model's effectiveness in reconstructing gene regulatory networks.

2510.19838 2026-06-17 cs.AI cs.CL cs.LG 版本更新

Branch-and-Browse: Efficient and Controllable Web Exploration with Tree-Structured Reasoning and Action Memory

Branch-and-Browse:具有树状推理与动作记忆的高效可控网页探索

Shiqi He, Yue Cui, Xinyu Ma, Yaliang Li, Bolin Ding, Mosharaf Chowdhury

AI总结 提出Branch-and-Browse框架,通过树状结构化推理、网页状态重放和页面动作记忆,实现LLM网页代理的高效可控多分支探索,在WebArena上成功率35.8%,执行时间降低40.4%。

详情
AI中文摘要

由大型语言模型(LLM)驱动的自主网页代理在执行目标导向任务(如信息检索、报告生成和在线交易)方面展现出强大潜力。这些代理标志着向开放网络环境中实用具身推理的关键一步。然而,现有方法在推理深度和效率方面仍然受限:简单的线性方法无法进行多步推理且缺乏有效的回溯,而其他搜索策略则粗粒度且计算成本高。我们引入了Branch-and-Browse,一个细粒度的网页代理框架,它统一了结构化推理-行动、上下文记忆和高效执行。它(i)采用显式子任务管理与树状结构化探索,实现可控的多分支推理;(ii)通过高效的网页状态重放与后台推理引导探索;(iii)利用页面动作记忆在会话内和跨会话间共享已探索的动作。在WebArena基准测试中,Branch-and-Browse的任务成功率达到35.8%,相对于最先进的方法执行时间减少高达40.4%。这些结果表明,Branch-and-Browse是一个可靠且高效的基于LLM的网页代理框架。

英文摘要

Autonomous web agents powered by large language models (LLMs) show strong potential for performing goal-oriented tasks such as information retrieval, report generation, and online transactions. These agents mark a key step toward practical embodied reasoning in open web environments. However, existing approaches remain limited in reasoning depth and efficiency: vanilla linear methods fail at multi-step reasoning and lack effective backtracking, while other search strategies are coarse-grained and computationally costly. We introduce Branch-and-Browse, a fine-grained web agent framework that unifies structured reasoning-acting, contextual memory, and efficient execution. It (i) employs explicit subtask management with tree-structured exploration for controllable multi-branch reasoning, (ii) bootstraps exploration through efficient web state replay with background reasoning, and (iii) leverages a page action memory to share explored actions within and across sessions. On the WebArena benchmark, Branch-and-Browse achieves a task success rate of 35.8\% and reduces execution time by up to 40.4\% relative to state-of-the-art methods. These results demonstrate that Branch-and-Browse is a reliable and efficient framework for LLM-based web agents.

2510.14807 2026-06-17 cs.AI 版本更新

Beyond the Sampled Token: Preserving Candidate Support in RLVR

超越采样令牌:在RLVR中保留候选支持

Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, Yandong Wen

AI总结 本文从候选分布角度分析RLVR中的探索崩溃,提出CaSP方法,通过保留前N个候选的概率质量,在不牺牲pass@1的情况下提升pass@K,在多个基准测试中验证了有效性。

Comments Technical report (23 pages, 16 figures, project page: https://spherelab.ai/simko/)

详情
AI中文摘要

我们从下一个令牌预测的候选分布角度,重新审视了具有可验证奖励的强化学习(RLVR)中的探索崩溃。我们正式证明,当概率集中到前1个候选时,无论采样预算K如何,期望的不同响应数量都会崩溃为1。这一理论含义通过我们在训练过程中对前N个候选概率的实证跟踪得到进一步验证,其中前1个候选逐渐占据主导地位,而其他合理替代方案被抑制。这些发现提出了有效探索的关键需求:在前N个候选上保留不可忽略的概率质量。为此,我们提出了候选感知支持保留(CaSP),包含两个互补设计。具体来说,对于正确响应,CaSP在前N个候选上重新分配正梯度;对于错误响应,则对前1个候选施加更强的惩罚。与许多以牺牲pass@1为代价提高pass@K的探索导向方法不同,CaSP在整个K谱上提高了pass@K。这些增益泛化到6个数学、2个逻辑推理和2个编码基准测试,并扩展到32B参数模型和高达K=1024的采样预算,使其成为RLVR探索的一种原则性、候选级别的方法。

英文摘要

We revisit exploration collapse in reinforcement learning with verifiable rewards (RLVR), from the perspective of the \emph{candidate distribution} for next-token prediction. We formally show that as probability concentrates on the top-$1$ candidate, the expected number of distinct responses collapses to one regardless of the sampling budget $K$. This theoretical implication is further verified by our empirical tracking of top-$N$ candidate probabilities during training, where the top-$1$ candidate progressively dominates while plausible alternatives are suppressed. These findings suggest a key desideratum for effective exploration: \emph{preserving non-negligible probability mass on the top-$N$ candidates}. To this end, we propose Candidate-aware Support Preservation (CaSP), with two complementary designs. Specifically, CaSP redistributes positive gradients among top-$N$ candidates for correct responses, and applies a stronger penalty to the top-$1$ candidate for incorrect responses. Unlike many exploration-oriented methods that improve pass@$K$ at the cost of pass@1, CaSP improves pass@$K$ across the full $K$ spectrum. These gains generalize to 6 math, 2 logical-reasoning, and 2 coding benchmarks, and scales to 32B-parameter models and sampling budgets up to $K=1024$, positioning it as a principled, candidate-level approach for RLVR exploration.

2510.11709 2026-06-17 cs.LG cs.AI cs.CV 版本更新

Adversarial Attacks Leverage Interference Between Features in Superposition

对抗攻击利用特征叠加中的干扰

Edward Stevinson, Lucas Prieto, Melih Barsbey, Tolga Birdal

AI总结 本文揭示神经网络中特征叠加导致的干扰是对抗脆弱性的根源,通过理论推导和实验验证了干扰模式决定攻击成功与迁移性。

Comments Forty-third International Conference on Machine Learning

详情
AI中文摘要

为什么对抗样本存在,并且为什么它们能在模型间迁移?现有的解释诉诸于高维几何、输入中的非鲁棒模式以及决策边界结构,但没有一个提供表示层面的机制来解释为什么特定的扰动会成功以及为什么攻击能在模型间迁移。在本文中,我们表明对抗脆弱性可能源于神经网络中高效的信息编码。具体来说,脆弱性可能源于叠加——网络表示的概念数量超过其维度,迫使非正交表示从而产生干扰。这种干扰导致针对一个表示的扰动会影响其他表示,从而产生由干扰模式决定的脆弱性。在精确控制叠加的合成环境中,我们证实叠加足以产生对抗脆弱性。由此产生的攻击是可预测的:PGD发现的扰动与从干扰几何导出的理论最优扰动一致。在相似数据上训练的模型会发展出相似的干扰模式,这解释了攻击的可迁移性。然后我们表明,对图像分类器的成功攻击表现出我们提出的机制所预测的结构。这些发现揭示了对抗脆弱性可能是网络表示压缩的副产品,补充了基于数据属性或架构因素的现有解释。

英文摘要

Why do adversarial examples exist, and why do they transfer between models? Existing explanations appeal to high-dimensional geometry, non-robust patterns in the input, and decision boundary structure, but none provides a representation-level mechanism that explains why specific perturbations succeed and why attacks transfer between models. In this paper, we show that adversarial vulnerability can stem from efficient information encoding in neural networks. Specifically, vulnerability can arise from superposition - the phenomenon where networks represent more concepts than they have dimensions, forcing non-orthogonal representation and thus interference. This interference causes perturbations targeting one representation to affect others, creating vulnerabilities determined by interference patterns. In synthetic settings with precisely controlled superposition, we establish that superposition suffices to create adversarial vulnerability. The resulting attacks are predictable: PGD-discovered perturbations align with theoretically optimal perturbations derived from the interference geometry. Models trained on similar data develop similar interference patterns, explaining attack transferability. We then show that successful attacks on image classifiers exhibit the structure predicted by our proposed mechanism. These findings reveal that adversarial vulnerability can be a byproduct of networks' representational compression, complementing existing explanations based on data properties or architectural factors.

2508.19445 2026-06-17 cs.LG stat.ML 版本更新

On Surjectivity of Neural Networks: Can you elicit any behavior from your model?

论神经网络的满射性:你能从模型中诱导出任何行为吗?

Haozhe Jiang, Nika Haghtalab

AI总结 本文证明现代神经网络架构(如预层归一化和线性注意力模块)几乎总是满射,意味着任何输出(包括有害内容)原则上都可生成,揭示了模型在对抗攻击下的固有脆弱性。

Comments Blog: https://astro-eric.github.io/blogs/surjective/

详情
AI中文摘要

给定一个训练好的神经网络,是否可以通过某些输入生成任意指定的输出?等价地,该网络对应的函数是否是满射的?在生成模型中,满射性意味着任何输出,包括有害或不良内容,原则上都可以由网络生成,引发了对模型安全和越狱漏洞的担忧。在本文中,我们证明了现代神经架构的许多基本构建模块,例如具有预层归一化和线性注意力模块的网络,几乎总是满射的。作为推论,广泛使用的生成框架,包括GPT风格的Transformer和具有确定性ODE求解器的扩散模型,允许对任意输出进行逆映射。通过研究这些现代且常用的神经架构的满射性,我们提供了一个形式化方法,揭示了它们对广泛对抗攻击类别的不可避免的脆弱性。

英文摘要

Given a trained neural network, can any specified output be generated by some input? Equivalently, does the network correspond to a function that is surjective? In generative models, surjectivity implies that any output, including harmful or undesirable content, can in principle be generated by the networks, raising concerns about model safety and jailbreak vulnerabilities. In this paper, we prove that many fundamental building blocks of modern neural architectures, such as networks with pre-layer normalization and linear-attention modules, are almost always surjective. As corollaries, widely used generative frameworks, including GPT-style transformers and diffusion models with deterministic ODE solvers, admit inverse mappings for arbitrary outputs. By studying surjectivity of these modern and commonly used neural architectures, we contribute a formalism that sheds light on their unavoidable vulnerability to a broad class of adversarial attacks.

2509.26633 2026-06-17 cs.RO cs.AI cs.LG cs.SY eess.SY 版本更新

OmniRetarget: Interaction-Preserving Data Generation for Humanoid Whole-Body Loco-Manipulation and Scene Interaction

OmniRetarget:面向人形全身运动操控与场景交互的交互保持数据生成

Lujie Yang, Xiaoyu Huang, Zhen Wu, Angjoo Kanazawa, Pieter Abbeel, Carmelo Sferrazza, C. Karen Liu, Rocky Duan, Guanya Shi

AI总结 提出OmniRetarget引擎,通过交互网格显式建模并保持智能体、地形和物体间的空间与接触关系,将人类运动重定向为机器人运动,生成高质量轨迹以训练强化学习策略,实现长时间跑酷和操控技能。

Comments Project website: https://omniretarget.github.io

详情
AI中文摘要

教授人形机器人复杂技能的主流范式是将人类运动重定向为运动学参考,以训练强化学习(RL)策略。然而,现有的重定向流程常常难以应对人与机器人之间的显著具身差异,产生物理上不可信的伪影,如脚滑和穿透。更重要的是,常见的重定向方法忽略了对于表达性运动及运动操控至关重要的丰富的人-物和人-环境交互。为解决这一问题,我们引入了OmniRetarget,一种基于交互网格的交互保持数据生成引擎,该网格显式建模并保持智能体、地形和操作对象之间的关键空间与接触关系。通过最小化人体与机器人网格之间的拉普拉斯变形同时施加运动学约束,OmniRetarget生成运动学上可行的轨迹。此外,保持任务相关的交互使得从单一示范到不同机器人本体、地形和物体配置的高效数据增强成为可能。我们通过将来自OMOMO、LAFAN1和我们内部MoCap数据集的运动进行重定向,全面评估了OmniRetarget,生成了超过8小时的轨迹,这些轨迹在运动学约束满足和接触保持方面优于广泛使用的基线。这种高质量数据使得本体感觉RL策略能够在Unitree G1人形机器人上成功执行长达30秒的长时间跑酷和运动操控技能,且仅使用5个奖励项和所有任务共享的简单域随机化进行训练,无需任何学习课程。

英文摘要

A dominant paradigm for teaching humanoid robots complex skills is to retarget human motions as kinematic references to train reinforcement learning (RL) policies. However, existing retargeting pipelines often struggle with the significant embodiment gap between humans and robots, producing physically implausible artifacts like foot-skating and penetration. More importantly, common retargeting methods neglect the rich human-object and human-environment interactions essential for expressive locomotion and loco-manipulation. To address this, we introduce OmniRetarget, an interaction-preserving data generation engine based on an interaction mesh that explicitly models and preserves the crucial spatial and contact relationships between an agent, the terrain, and manipulated objects. By minimizing the Laplacian deformation between the human and robot meshes while enforcing kinematic constraints, OmniRetarget generates kinematically feasible trajectories. Moreover, preserving task-relevant interactions enables efficient data augmentation, from a single demonstration to different robot embodiments, terrains, and object configurations. We comprehensively evaluate OmniRetarget by retargeting motions from OMOMO, LAFAN1, and our in-house MoCap datasets, generating over 8-hour trajectories that achieve better kinematic constraint satisfaction and contact preservation than widely used baselines. Such high-quality data enables proprioceptive RL policies to successfully execute long-horizon (up to 30 seconds) parkour and loco-manipulation skills on a Unitree G1 humanoid, trained with only 5 reward terms and simple domain randomization shared by all tasks, without any learning curriculum.

2503.10945 2026-06-17 cs.LG cs.AI cs.CR stat.ML 版本更新

Gaussian DP for Reporting Differential Privacy Guarantees in Machine Learning

高斯差分隐私:机器学习中报告差分隐私保证的方法

Juan Felipe Gomez, Bogdan Kulynych, Georgios Kaissis, Flavio P. Calmon, Jamie Hayes, Borja Balle, Antti Honkela

AI总结 针对当前机器学习中差分隐私报告不完整的问题,提出使用非渐近高斯差分隐私(GDP)作为主要报告方式,通过数值会计和决策理论度量,证明GDP能无误差地捕获DP-SGD等算法的完整隐私特征。

Comments IEEE SatML 2026 (position paper track)

详情
AI中文摘要

当前报告机器学习算法(如DP-SGD)的差分隐私(DP)保证的做法提供了不完整且可能误导的图景。例如,如果仅知道机制的一个$(\varepsilon, \delta)$,标准分析表明可能存在针对训练数据记录的高精度推理攻击,而更仔细的分析发现,对于大多数实际机制,这种精确攻击并不存在。在这篇立场论文中,我们主张使用_非渐近_高斯差分隐私(GDP)作为机器学习中传达DP保证的主要手段,以避免这些潜在缺点。利用DP文献中的两个最新进展:(i)能够以任意精度计算DP-SGD的隐私配置文件和$f$-DP曲线的开源数值会计,以及(ii)关于DP表示的决策理论度量,我们展示了如何使用数值会计提供GDP的非渐近界,并表明GDP能够以几乎无误差的方式捕获DP-SGD及相关算法的整个隐私配置文件(由该度量量化)。为了支持我们的主张,我们研究了最先进的DP大规模图像分类以及美国十年人口普查的TopDown算法的隐私配置文件,观察到GDP在所有情况下都与其配置文件拟合得非常好。最后,我们讨论了这种方法的优缺点,并探讨了哪些其他隐私机制可以从GDP中受益。

英文摘要

Current practices for reporting differential privacy (DP) guarantees for machine learning (ML) algorithms such as DP-SGD provide an incomplete and potentially misleading picture. For instance, if only a single $(\varepsilon, δ)$ is known about a mechanism, standard analyses show that there could exist highly accurate inference attacks against training data records, when, upon a more careful analysis, such accurate attacks do not exist for most practical mechanisms. In this position paper, we argue that using _non-asymptotic_ Gaussian Differential Privacy (GDP) as the primary means of communicating DP guarantees in ML avoids these potential downsides. Using two recent developments in the DP literature: (i) open-source numerical accountants capable of computing the privacy profile and $f$-DP curves of DP-SGD to arbitrary accuracy, and (ii) a decision-theoretic metric over DP representations, we show how to provide non-asymptotic bounds on GDP using numerical accountants, and show that GDP can capture the entire privacy profile of DP-SGD and related algorithms with virtually no error, as quantified by the metric. To support our claims, we investigate the privacy profiles of state-of-the-art DP large-scale image classification, and the TopDown algorithm for the U.S. Decennial Census, observing that GDP fits their profiles remarkably well in all cases. We conclude with a discussion on the strengths and weaknesses of this approach, and discuss which other privacy mechanisms could benefit from GDP.

2509.26476 2026-06-17 cs.CL cs.AI cs.LG cs.PF cs.SE 版本更新

Regression Language Models for Code

代码的回归语言模型

Yash Akhauri, Xingyou Song, Arissa Wongpanich, Bryan Lewandowski, Mohamed S. Abdelfattah

AI总结 提出回归语言模型(RLM),利用冻结的大语言模型编码器直接从文本预测代码执行结果(如内存占用、延迟、神经网络精度等),在多个任务上达到高相关度。

Comments Published in International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

我们研究代码到指标的回归:预测代码执行的数值结果,由于编程语言的开放性,这是一项具有挑战性的任务。虽然先前的方法依赖于繁重且特定领域的特征工程,但我们展示了一个统一的回归语言模型(RLM),使用冻结的LLM编码器可以直接从文本同时预测:(i) 多种高级语言(如Python和C++)代码的内存占用,(ii) Triton GPU内核的延迟,以及(iii) 以ONNX表示的已训练神经网络的精度和速度。特别是,一个基于T5Gemma的较小300M参数RLM在APPS的竞赛编程提交上获得了>0.9的Spearman等级相关系数,而单个统一模型在CodeNet的17种不同语言上获得了>0.5的平均Spearman等级相关系数。此外,RLM在五个经典NAS设计空间上获得了最高平均Kendall-Tau 0.46,这些空间此前由图神经网络主导,并且能同时预测多种硬件平台上的架构延迟。

英文摘要

We study code-to-metric regression: predicting numeric outcomes of code executions, a challenging task due to the open-ended nature of programming languages. While prior methods have resorted to heavy and domain-specific feature engineering, we show that a single unified Regression Language Model (RLM) using a frozen LLM encoder can simultaneously predict directly from text, (i) the memory footprint of code across multiple high-level languages such as Python and C++, (ii) the latency of Triton GPU kernels, and (iii) the accuracy and speed of trained neural networks represented in ONNX. In particular, a relatively small 300M parameter RLM based on T5Gemma, obtains >0.9 Spearman-rank on competitive programming submissions from APPS, and a single unified model achieves >0.5 average Spearman-rank across 24 different programming languages from CodeNet. Furthermore, the RLM can obtain the highest average Kendall-Tau of 0.46 on five classic NAS design spaces previously dominated by graph neural networks, and simultaneously predict architecture latencies on numerous hardware platforms.

2507.18623 2026-06-17 cs.LG cs.AI cs.MA 版本更新

Moving Out: Physically-grounded Human-AI Collaboration

Moving Out: 基于物理的人机协作

Xuhui Kang, Sung-Wook Lee, Haolin Liu, Yuyan Wang, Yen-Ling Kuo

AI总结 提出Moving Out基准测试,模拟物理约束下的协作场景,并开发BASS方法增强智能体多样性及动作理解,实验证明其与未见过的AI和人类均能有效协作。

Comments Accepted at ICML 2026

详情
AI中文摘要

适应环境中的物理动作和约束的能力对于具身智能体(如机器人)与人类有效协作至关重要。这种基于物理的人机协作必须考虑连续状态-动作空间增加的复杂性以及物理约束导致的受限动力学。然而,大多数现有的协作基准是离散的,或者不考虑物理属性和约束。为了解决这个问题,我们引入了Moving Out,一个人机协作基准,它模拟了受物理属性和约束影响的各种协作模式,例如一起移动重物以及协调动作将物品绕过角落。Moving Out包含两个挑战和人类-人类交互数据,以全面评估模型适应多样化人类行为和未见物理属性的能力。为了使具身智能体能够在物理属性和约束下与人类协作,我们提出了一种新方法BASS(行为增强、模拟和选择),以增强智能体的多样性及其对动作结果的理解。我们系统地将BASS与最先进模型在AI-AI和人机实验中进行了比较,结果表明BASS能够有效地与未见过的AI和人类协作。项目页面可在此https URL访问。

英文摘要

The ability to adapt to physical actions and constraints in an environment is crucial for embodied agents (e.g., robots) to effectively collaborate with humans. Such physically grounded human-AI collaboration must account for the increased complexity of the continuous state-action space and constrained dynamics caused by physical constraints. However, most existing collaboration benchmarks are discrete or do not consider physical attributes and constraints. To address this, we introduce Moving Out, a human-AI collaboration benchmark that resembles a wide range of collaboration modes affected by physical attributes and constraints, such as moving heavy items together and coordinating actions to move an item around a corner. Moving Out consists of two challenges and human-human interaction data to comprehensively evaluate models' abilities to adapt to diverse human behaviors and unseen physical attributes. To give embodied agents the capability to collaborate with humans under physical attributes and constraints, we propose a novel method, BASS (Behavior Augmentation, Simulation, and Selection), to enhance the diversity of agents and their understanding of the outcome of actions. We systematically compare BASS and state-of-the-art models in AI-AI and human-AI experiments, showing that BASS can effectively collaborate with both unseen AI and humans. The project page is available at https://live-robotics-uva.github.io/movingout_ai/.

2509.15210 2026-06-17 cs.SD cs.AI cs.LG 版本更新

Explicit Context-Driven Neural Acoustic Modeling for High-Fidelity RIR Generation

显式上下文驱动的神经声学建模用于高保真RIR生成

Chen Si, Qianyi Wu, Chaitanya Amballa, Romit Roy Choudhury

AI总结 提出MiNAF模型,通过查询房间网格并提取距离分布作为显式局部几何特征,引导神经隐式模型生成更准确的房间脉冲响应(RIR),在多项指标上达到竞争性能。

详情
AI中文摘要

逼真的声音模拟在许多应用中起着关键作用。声音模拟的一个关键要素是房间脉冲响应(RIR),它描述了声音在给定空间中的传播方式。最近的研究应用神经隐式方法,利用从环境中收集的上下文信息(如场景图像)来学习RIR。然而,这些方法没有有效利用环境中的显式几何信息。为了进一步利用具有直接几何特征的神经隐式模型,我们提出了MiNAF,它在给定位置查询粗略的房间网格,并提取距离分布作为局部上下文的显式表示。我们的方法表明,结合显式的局部几何特征可以更好地引导模型生成更准确的RIR预测。通过与常规和最先进方法的比较,我们展示了MiNAF在各种评估指标上具有竞争力的性能。

英文摘要

Realistic sound simulation plays a critical role in many applications. A key element in sound simulation is the room impulse response (RIR), which characterizes how sound propagates within a given space. Recent studies have applied neural implicit methods to learn RIR using context information collected from the environment, such as scene images. However, these approaches do not effectively leverage explicit geometric information from the environment. To further exploit neural implicit models with direct geometric features, we present MiNAF, which queries a rough room mesh at given locations and extracts distance distributions as an explicit representation of local context. Our approach demonstrates that incorporating explicit local geometric features can better guide the model in generating more accurate RIR predictions. Through comparisons with conventional and state-of-the-art methods, we show that MiNAF performs competitively across various evaluation metrics.

2509.00064 2026-06-17 cs.RO cs.CV 版本更新

OpenTie: Open-vocabulary Sequential Rebar Tying System

OpenTie: 开放词汇的连续钢筋绑扎系统

Sai Fan, Mingze Liu, Haozhen Li, Haobo Liang, Yixing Yuan, Yanke Wang

AI总结 提出OpenTie,一种无需训练的3D钢筋绑扎框架,通过RGB到点云生成和开放词汇检测实现高精度连续绑扎,优于基于YOLO的方法。

Comments This article is accepted by The 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026)

详情
AI中文摘要

建筑工地的机器人实践因其应对复杂挑战的能力而备受关注,尤其是在涉及钢筋的场景中。现有产品和研究主要集中于需要模型训练的大量数据收集。为填补这一空白,我们提出OpenTie,一种利用RGB到点云生成和开放词汇钢筋检测的3D无训练钢筋绑扎框架,并在真实世界测试中实现。我们通过带有双目摄像头的机械臂实现OpenTie,并通过将基于提示的目标检测方法应用于经我们提出的后处理流程过滤的图像(用于图像到点云生成框架),保证了高精度。我们的流程无需训练,且在真实连续钢筋绑扎测试中优于基于训练的目标检测(即基于YOLO的方法)。该系统灵活适用于水平和垂直钢筋绑扎任务,并具有在真实建筑工地应用和商业化的潜力。

英文摘要

Robotic practices on the construction site emerge as an attention-attracting manner owing to their capability of tackling complex challenges, especially in the rebar-involved scenarios. Most of existing products and research are mainly focused on the collection of large amounts of data with model training demands. To fulfill this gap, we propose OpenTie, a 3D training-free rebar tying framework utilizing a RGB-to-point-cloud generation and an open-vocabulary rebar detection on the real-world test. We implement the OpenTie via a robotic arm with a binocular camera and guarantee a high accuracy by applying the prompt-based object detection method on the image filtered by our proposed post-processing procedure for the image-to-point-cloud generation framework. Our pipeline requires no training efforts and outperforms the training-based object detection, i.e., YOLO-based method, with the verification on the real-world sequential rebar tying test. The system is flexible for horizontal and vertical rebar tying tasks and holds the potential application to the real construction site with possibility of commercialization.

2502.08363 2026-06-17 cs.CL cs.AI 版本更新

Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

Top-Theta注意力:通过补偿阈值稀疏化Transformer

Konstantin Berestizshevsky, Renzo Andri, Lukas Cavigelli

AI总结 提出Top-Theta注意力,一种无需训练的推理时稀疏化方法,通过静态每头阈值保留每行固定数量的重要元素,结合补偿技术实现高稀疏度下的精度保持,在NLP任务中实现3-10倍V-cache减少和高达10倍注意力元素减少,精度下降不超过1%。

Comments Extended version of a paper accepted at ICANN 2026

详情
AI中文摘要

我们提出Top-Theta(Top-$\ heta$)注意力,一种无需训练的推理时稀疏化Transformer注意力的方法。我们的关键洞察是,可以校准静态的每头阈值,以在每行注意力中保留所需数量的重要元素。该方法实现了基于内容的稀疏性,无需重新训练,并且在不同数据领域保持鲁棒性。我们进一步引入补偿技术,以在激进稀疏化下保持精度,将注意力阈值化确立为top-k注意力的实用且原则性替代方案。我们在自然语言处理任务上进行了广泛评估,表明Top-Theta在推理时实现了3-10倍的V-cache减少和高达10倍的注意力元素减少,同时精度下降不超过1%。

英文摘要

We present Top-Theta (Top-$θ$) Attention, a training-free method for sparsifying transformer attention during inference. Our key insight is that static, per-head thresholds can be calibrated to retain the desired constant number of significant elements per attention row. This approach enables content-based sparsity without retraining, and it remains robust across data domains. We further introduce compensation techniques to preserve accuracy under aggressive sparsification, establishing attention thresholding as a practical and principled alternative to top-k attention. We provide extensive evaluation on natural language processing tasks, showing that Top-$θ$ achieves 3-10x reduction in V-cache usage and up to 10x fewer attention elements during inference while degrading no more than 1% in accuracy.

2508.06692 2026-06-17 cs.LG 版本更新

HeteRo-Select: Informativeness as the Participation Driver in Heterogeneous Federated Learning

HeteRo-Select: 信息量作为异构联邦学习中的参与驱动因素

Md. Akmol Masud, Md Abrar Jahin, Mahmud Hasan

AI总结 提出HeteRo-Select框架,用客户端信息量分数替代带宽驱动压缩,联合决定客户端选择、压缩比和聚合权重,降低异构性并减少流量,在CIFAR-10上实现1.78倍加速和18.2%流量减少。

详情
AI中文摘要

联邦学习系统通常根据链路速度分配梯度压缩。当带宽和数据信息量一致时,这是合理的。然而,在非IID数据下,这些信号常常去相关或反转。基于带宽的分配器可能最严重地压缩信息量最大的梯度。我们提出HeteRo-Select,一个用每个客户端的信息量分数替代带宽作为压缩主要驱动因素的框架。该分数联合控制每轮的三个决策:客户端选择、压缩比和服务器聚合权重,带宽仅作为硬上限保留。分数比例选择可证明地降低所选子集的有效异构性;分数比例压缩可证明地在固定流量下降低聚合top-$k$误差。在精确的FedCG模拟协议下,HeteRo-Select在CIFAR-10上实现了$1.78\times$加速和$18.2\%$流量减少。相同的配置,未经改变,从$7{,}850$参数的逻辑回归扩展到$11.27$M参数的ResNet-18,在四个基准测试中的三个达到了准确率目标。当带宽和信息量被故意反相关时,该方法仍能以比正常带宽运行更少的流量达到目标准确率。

英文摘要

Federated learning systems typically allocate gradient compression by link speed. This is sensible when bandwidth and data informativeness align. However, under non-IID data, these signals often decorrelate or invert. A bandwidth-driven allocator then risks compressing the most informative gradients hardest. We propose HeteRo-Select, a framework that replaces bandwidth with a per-client informativeness score as the primary driver of compression. The score jointly governs three decisions per round: client selection, compression ratio, and server aggregation weight, with bandwidth retained only as a hard ceiling. Score-proportional selection provably reduces the effective heterogeneity of the chosen subset; score-proportional compression provably lowers aggregate top-$k$ error at fixed traffic. Under the exact FedCG simulation protocol, HeteRo-Select delivers a $1.78\times$ speedup and an $18.2\%$ reduction in traffic on CIFAR-10. The same configuration, unchanged, scales from a $7{,}850$-parameter logistic regression to an $11.27$M-parameter ResNet-18, hitting the accuracy target on three of four benchmarks. When bandwidth and informativeness are deliberately anti-correlated, the method still achieves the target accuracy with less traffic than the normal-bandwidth run.