arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2601.10962 2026-06-17 cs.LG cond-mat.dis-nn 版本更新

Noise-Driven Exploration and Transient Freezing Select Flat Minima in Stochastic Gradient Descent

噪声驱动的探索与瞬态冻结在随机梯度下降中选择平坦极小值

Ning Yang, Yikuan Zhang, Qi Ouyang, Chao Tang, Yuhai Tu

AI总结通过分析SGD学习动力学，发现非平衡机制驱动解选择：瞬态探索阶段逃离尖锐谷，噪声重塑势能稳定平坦解，冻结延迟增强泛化。

Comments 12 pages, 4 figures

详情

AI中文摘要

随机梯度下降（SGD）是深度学习的核心，但其偏好更平坦、更泛化解的动力学起源仍不清楚。本文通过分析SGD学习动力学，识别出一种非平衡机制，该机制在训练过程中控制解的选择。数值实验揭示了一个瞬态探索阶段，在此阶段SGD轨迹反复逃离尖锐谷，并向损失景观中更平坦的区域迁移，然后才被限制在最终盆地中。利用一个可处理的物理模型，我们证明SGD噪声将损失景观重塑为一个有效势能，该势能优先稳定平坦解。我们进一步揭示了一种瞬态冻结机制：随着训练进行，平坦化的景观抑制了竞争谷之间的跃迁。更强的SGD噪声延迟了这种冻结转变，延长了探索阶段，从而增加了收敛到更平坦极小值的概率。这些结果共同提供了一个统一的物理框架，连接了学习动力学、损失景观几何和泛化，并为设计更有效的优化算法提供了指导原则。

英文摘要

Stochastic gradient descent (SGD) is central to deep learning, yet the dynamical origin of its preference for flatter, more generalizable solutions remains unclear. Here, by analyzing SGD learning dynamics, we identify a nonequilibrium mechanism that governs solution selection during training. Numerical experiments reveal a transient exploratory phase in which SGD trajectories repeatedly escape sharp valleys and migrate toward flatter regions of the loss landscape before becoming confined to a final basin. Using a tractable physical model, we show that SGD noise reshapes the loss landscape into an effective potential that preferentially stabilizes flat solutions. We further uncover a transient freezing mechanism: as training progresses, the flattening landscape suppresses transitions between competing valleys. Stronger SGD noise delays this freezing transition, prolonging the exploratory phase and thereby increasing the probability of convergence to flatter minima. Together, these results provide a unified physical framework connecting learning dynamics, loss-landscape geometry, and generalization, and suggest guiding principles for the design of more effective optimization algorithms.

URL PDF HTML ☆

赞 0 踩 0

2512.16420 2026-06-17 cs.SD 版本更新

DPDFNet: Boosting DeepFilterNet2 via Dual-Path RNN

DPDFNet: 通过双路径RNN提升DeepFilterNet2

Daniel Rika, Nino Sapir, Ido Gus

AI总结提出DPDFNet，在DeepFilterNet2编码器中引入双路径块增强长时跨带建模，结合过衰减抑制损失和微调策略，在多个基准上超越现有因果模型，并部署于边缘NPU实现实时性能。

Comments Accepted manuscript version. Accepted for publication in Speech Communication

详情

DOI: 10.1016/j.specom.2026.103430

AI中文摘要

我们提出DPDFNet，一种因果单通道语音增强模型，它在DeepFilterNet2架构的基础上，在编码器中引入双路径块，增强了长时域和跨频带建模能力，同时保留了原有的增强框架。此外，我们证明，添加一个损失分量以减轻增强语音中的过度衰减，并结合针对“始终在线”应用定制的微调阶段，可以显著提升模型整体性能。我们在标准VoiceBank+DEMAND和DNS4盲测基准上评估DPDFNet，结果显示其相比DeepFilterNet2有一致提升，并且与其他因果开源模型相比整体性能强劲。此外，我们引入了一个补充的多语言低信噪比评估集，包含12种语言在日常噪声场景下的长录音，DPDFNet在此评估集上表现出优于其他因果开源模型的性能，包括一些规模更大、计算需求更高的模型。我们还提出了一种整体指标PRISM，它是侵入式和非侵入式指标的复合、尺度归一化聚合，该指标清晰展示了与双路径块数量的可扩展性。我们通过在Ceva-NeuPro-Nano边缘NPU上部署DPDFNet进一步证明了其在设备上的可行性。结果表明，我们的第二大模型DPDFNet-4在NPN32上实现了实时性能，在NPN64上运行更快，证实了在最先进的嵌入式功耗和延迟约束下可以维持高质量。

英文摘要

We present DPDFNet, a causal single-channel speech enhancement model that extends DeepFilterNet2 architecture with dual-path blocks in the encoder, strengthening long-range temporal and cross-band modeling while preserving the original enhancement framework. In addition, we demonstrate that adding a loss component to mitigate over-attenuation in the enhanced speech, combined with a fine-tuning phase tailored for "always-on" applications, leads to substantial improvements in overall model performance. We evaluate DPDFNet on the standard VoiceBank+DEMAND and DNS4 blind test benchmarks, where it shows consistent gains over DeepFilterNet2 and strong overall performance against other causal open-source models. In addition, we introduce a supplementary multilingual low-SNR evaluation set comprising long recordings in 12 languages across everyday noise scenarios, on which DPDFNet delivers superior performance to other causal open-source models, including some that are substantially larger and more computationally demanding. We also propose an holistic metric named PRISM, a composite, scale-normalized aggregate of intrusive and non-intrusive metrics, which demonstrates clear scalability with the number of dual-path blocks. We further demonstrate on-device feasibility by deploying DPDFNet on Ceva-NeuPro-Nano edge NPUs. Results indicate that DPDFNet-4, our second-largest model, achieves real-time performance on NPN32 and runs even faster on NPN64, confirming that state-of-the-art quality can be sustained within strict embedded power and latency constraints.

URL PDF HTML ☆

赞 0 踩 0

2601.03872 2026-06-17 cs.CL 版本更新

Atlas: Orchestrating Heterogeneous Models and Tools for Multi-Domain Complex Reasoning

Atlas: 编排异构模型与工具实现多领域复杂推理

Jinyang Wu, Guocheng Zhai, Ruihan Jin, Jiahao Yuan, Yuhao Shen, Shuai Zhang, Zhengqi Wen, Jianhua Tao

AI总结提出ATLAS双路径框架，通过无监督聚类路由和强化学习多步路由动态选择最优模型-工具组合，在15个基准上超越GPT-4o，分布内外任务分别提升10.1%和13.1%。

Comments Accepted by ACL 2026

详情

AI中文摘要

大语言模型与外部工具的集成显著扩展了AI代理的能力。然而，随着模型和工具多样性的增加，选择最优模型-工具组合成为一个高维优化挑战。现有方法通常依赖单一模型或固定工具调用逻辑，未能利用异构模型-工具对之间的性能差异。本文提出ATLAS（自适应工具-LLM对齐与协同调用），一种用于跨领域复杂推理中动态工具使用的双路径框架。ATLAS通过双路径方式运作：（1）基于无监督聚类的路由，利用经验先验进行领域特定对齐；（2）基于强化学习的多步路由，探索自主轨迹以实现分布外泛化。在15个基准上的大量实验表明，我们的方法优于GPT-4o等闭源模型，在分布内（+10.1%）和分布外（+13.1%）任务上均超越现有路由方法。此外，我们的框架通过编排专用多模态工具在视觉推理中展现出显著提升。

英文摘要

The integration of large language models (LLMs) with external tools has significantly expanded the capabilities of AI agents. However, as the diversity of both LLMs and tools increases, selecting the optimal model-tool combination becomes a high-dimensional optimization challenge. Existing approaches often rely on a single model or fixed tool-calling logic, failing to exploit the performance variations across heterogeneous model-tool pairs. In this paper, we present ATLAS (Adaptive Tool-LLM Alignment and Synergistic Invocation), a dual-path framework for dynamic tool usage in cross-domain complex reasoning. ATLAS operates via a dual-path approach: (1) \textbf{training-free cluster-based routing} that exploits empirical priors for domain-specific alignment, and (2) \textbf{RL-based multi-step routing} that explores autonomous trajectories for out-of-distribution generalization. Extensive experiments across 15 benchmarks demonstrate that our method outperforms closed-source models like GPT-4o, surpassing existing routing methods on both in-distribution (+10.1%) and out-of-distribution (+13.1%) tasks. Furthermore, our framework shows significant gains in visual reasoning by orchestrating specialized multi-modal tools.

URL PDF HTML ☆

赞 0 踩 0

2511.01650 2026-06-17 cs.CL cs.AI cs.LG 版本更新

EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning

EngTrace：工程推理可验证过程监督的符号基准

Ayesha Gull, Muhammad Usman Safder, Rania Elbadry, Fan Zhang, Veselin Stoyanov, Preslav Nakov, Zhuohan Xie

AI总结提出EngTrace符号基准，包含1350个参数化测试用例，通过两阶段可验证评估框架（分层协议+AI仲裁）检验中间推理轨迹与最终答案，揭示数值精度与轨迹保真度的权衡。

Comments 33 pages, includes figures and tables; introduces the EngTrace benchmark

详情

AI中文摘要

大型语言模型（LLM）正越来越多地进入由严格定量标准和不变物理定律约束的专业化、安全关键的工程工作流程，因此对其推理能力进行严格评估势在必行。然而，现有的基准（如MMLU、MATH和HumanEval）评估的是孤立的认知技能，未能捕捉工程中核心的基于物理的推理，其中科学原理、定量建模和实际约束必须融合。为了实现工程中的可验证过程监督，我们引入了EngTrace，这是一个基于90个参数化模板构建的符号基准，每个模板生成独特的、抗污染的实例，涵盖三个主要工程分支、九个核心领域和20个不同领域，产生1350个测试用例，以压力测试跨多样物理场景的泛化能力。超越结果匹配，我们引入了一个可验证的两阶段评估框架，该框架使用分层协议通过自动化程序检查和异构AI仲裁来验证中间推理轨迹以及最终答案。我们对27个领先LLM的评估揭示了数值精度与轨迹保真度之间的明显权衡，识别出一个复杂性悬崖，其中抽象数学预训练未能转化为高级工程任务所需的整合推理。

英文摘要

Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning capabilities imperative. However, existing benchmarks such as MMLU, MATH, and HumanEval assess isolated cognitive skills, failing to capture the physically grounded reasoning central to engineering, where scientific principles, quantitative modeling, and practical constraints must converge. To enable verifiable process supervision in engineering, we introduce EngTrace, a symbolic benchmark built on 90 parameterized templates, each generating unique, contamination-resistant problem instances, spanning three major engineering branches, nine core domains, and 20 distinct areas, yielding 1,350 test cases that stress-test generalization across diverse physical scenarios. Moving beyond outcome matching, we introduce a verifiable two-stage evaluation framework that uses a tiered protocol to validate intermediate reasoning traces alongside final answers through automated procedural checks and a heterogeneous AI Tribunal. Our evaluation of 27 leading LLMs reveals a distinct trade-off between numeric precision and trace fidelity, identifying a complexity cliff where abstract mathematical pre-training fails to translate into the integrative reasoning required for advanced engineering tasks.

URL PDF HTML ☆

赞 0 踩 0

2509.09631 2026-06-17 cs.SD cs.CL cs.CV 版本更新

DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Discrete Flow Matching

DiFlow-TTS: 基于离散流匹配的紧凑低延迟零样本文本转语音

Ngoc-Son Nguyen, Thanh V. T. Tran, Hieu-Nghia Huynh-Nguyen, Truong-Son Hy, Van Nguyen

AI总结提出DiFlow-TTS框架，通过离散流匹配和分解离散流去噪器，在零样本TTS中实现高质量与低延迟的平衡。

Comments Accepted at Interspeech 2026 (Long Paper track)

2507.14632 2026-06-17 cs.CV 版本更新

BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

BusterX++: 迈向基于MLLM的统一跨模态AI生成内容检测与解释

Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, Guangliang Cheng

AI总结提出统一多模态大模型BusterX++，通过纯强化学习策略实现图像与视频伪造检测的跨模态能力迁移，性能超越现有方法。

详情

AI中文摘要

生成式AI的快速发展显著提升了图像和视频合成质量，加剧了多模态视觉错误信息的风险。最近的多模态大模型通过推理和解释在透明化AI生成内容检测方面展现出潜力，但现有方法大多将图像和视频取证视为孤立任务，跨模态协同作用尚未充分探索。为解决这一问题，我们提出了\textbf{BusterX++}，一个统一的多模态大模型，用于联合图像和视频检测并具备可解释推理能力。我们还引入了\textbf{GenBuster-Bench++}，一个精心策划、难度对齐的基准测试，包含平衡的图像和视频样本，覆盖最新的生成模型和多样化的真实场景。利用这一受控设置，我们重新审视了广泛采用的$SFT \rightarrow RL$后训练范式。值得注意的是，我们的发现表明，仅由稀疏结果奖励驱动的单阶段纯RL策略在统一和单模态设置中始终匹配或超越强SFT+RL基线。我们的关键洞察是，SFT降低了策略熵，限制了策略搜索空间并抑制了探索自由度。相比之下，单阶段纯RL在整个训练过程中保持较高的策略熵，有效解锁了图像和视频取证之间跨模态能力迁移的自发涌现。大量实验表明，BusterX++达到了最先进的性能，突显了RL在统一跨模态视觉推理中的强大潜力。

英文摘要

The rapid advancement of generative AI has substantially improved image and video synthesis, amplifying the risk of multimodal visual misinformation. Recent MLLMs have shown promise for transparent AI-generated content detection through reasoning and explanation, yet existing approaches largely treat image and video forensics as isolated tasks, leaving cross-modal synergies underexplored. To address this, we present \textbf{BusterX++}, a unified MLLM for joint image and video detection with interpretable reasoning. We also introduce \textbf{GenBuster-Bench++}, a meticulously curated, difficulty-aligned benchmark containing balanced image and video samples spanning recent generation models and diverse real-world scenarios. Using this controlled setting, we revisit the widely adopted $SFT \rightarrow RL$ post-training paradigm. Notably, our findings demonstrate that a single-stage, pure RL strategy driven strictly by sparse outcome rewards consistently matches or surpasses a strong SFT+RL baseline across both unified and single-modality settings. Our key insight reveals that SFT imposes lower policy entropy, which restricts the policy search space and dampens exploratory freedom. In contrast, single-stage pure RL maintains higher policy entropy throughout training, effectively unlocking the spontaneous emergence of cross-modal capability transfer between image and video forensics. Extensive experiments demonstrate that BusterX++ achieves state-of-the-art performance, highlighting the powerful potential of RL for unified cross-modal visual reasoning.

URL PDF HTML ☆

赞 0 踩 0

2601.00215 2026-06-17 cs.CV cs.CL 版本更新

Disentangling Perception and Reasoning in Multimodal LLMs via Reward Design

通过奖励设计解耦多模态大语言模型中的感知与推理

Omar Sharif, Eftekhar Hossain, Nikhil Singh, Patrick Ng

AI总结研究多模态大模型中感知与推理的瓶颈，发现感知是主要约束，并通过奖励设计提升视觉基础推理，平均提升5.56分。

Comments 24 pages, 15 Figures, 10 Tables

详情

AI中文摘要

基于可验证奖励的强化学习推动了LLM推理的重大进步，直观上这种策略应能很好地迁移到多模态模型。然而，多模态模型做两件事：首先感知图像中的内容，然后推理其含义。由于这两个阶段是联合评分的，很难判断推理本身还有多大提升空间。我们在算法视觉谜题上研究这一问题，其中两个组件都是必要的，并表明感知而非推理是约束瓶颈。用简单的文本描述替换图像，Claude模型的平均性能提升超过20点。然后我们评估了六种奖励设计，旨在诱导推理过程中的视觉基础，而无需思维链监督。使用GRPO训练Qwen-2.5-VL-7B，奖励设计诱导出带有自我反思和视觉引用的长结构化推理，相比基础模型获得5.56点的提升。然而，这些提升是不均匀的；没有单一奖励能改善所有类别，并且具有可验证准确性信号的奖励会以域外迁移为代价换取域内准确性。这些结果表明，感知感知的奖励设计是一条前进之路，以便在源头纠正感知，而不是纠正继承其错误的推理。

英文摘要

Reinforcement learning with verifiable rewards has driven major gains in LLM reasoning, and it is intuitive to assume this recipe will transfer well to multimodal models. However, multimodal models do two things: first, perceive what is in an image, then reason about what it implies. Because these stages are graded jointly, it is hard to tell how much room reasoning alone has to grow. We study this on algorithmic visual puzzles, where both components are necessary and show that perception, not reasoning, is the binding constraint. Replacing images with simple textual descriptions raises performance by over 20 points on average for Claude models. We then evaluate six reward designs aimed at inducing visual grounding during reasoning without chain-of-thought supervision. Training Qwen-2.5-VL-7B with GRPO, reward design induces long, structured reasoning with self-reflection and visual references, yielding a 5.56-point gain over the base model. These gains are, however, uneven; no single reward improves all categories, and rewards with verifiable accuracy signals trade out-of-domain transfer for in-domain accuracy. These results point to perception-aware reward design as a path forward, so that signals correct perception at its source rather than the reasoning that inherits its errors.

URL PDF HTML ☆

赞 0 踩 0

2512.21315 2026-06-17 cs.LG cs.CV stat.ML 版本更新

Does the Data Processing Inequality Reflect Practice? On the Utility of Low-Level Tasks

数据处理不等式是否反映实践？论低级任务的有用性

Roy Turgeman, Tom Tirer

AI总结本文研究低级处理（如去噪、编码）如何提升分类性能，证明在有限样本下存在预处理可提高准确率，并通过实验验证理论趋势。

Comments ICLR 2026 (camera-ready). Code is available at: https://github.com/serveroy/process-before-you-classify

详情

Journal ref: The Fourteenth International Conference on Learning Representations (ICLR 2026)

AI中文摘要

数据处理不等式是一个信息论原理，指出信号的信息内容不能通过处理观测数据而增加。特别地，它表明在解决分类问题之前，增强信号或对其进行编码没有益处。对于最优贝叶斯分类器，这一断言可以被证明是正确的。然而，在实践中，尽管现代深度神经网络具有强大的能力，但在高级下游任务之前执行“低级”任务仍然很常见。在本文中，我们旨在理解低级处理何时以及为何对分类有益。我们提出了一个二元分类设置的综合理论研究，其中我们考虑一个与最优贝叶斯分类器紧密相连的分类器，并随着训练样本数量的增加而收敛到它。我们证明，对于任何有限数量的训练样本，存在一种预分类处理可以提高分类准确率。我们还探讨了类分离、训练集大小和类平衡对该过程相对增益的影响。我们通过理论设置的经验研究来支持我们的理论。最后，我们进行了一项实证研究，调查去噪和编码对基准数据集上实际深度分类器性能的影响。具体来说，我们改变了训练集的大小和类别分布以及噪声水平，并展示了与我们的理论结果一致的趋势。

英文摘要

The data processing inequality is an information-theoretic principle stating that the information content of a signal cannot be increased by processing the observations. In particular, it suggests that there is no benefit in enhancing the signal or encoding it before addressing a classification problem. This assertion can be proven to be true for the case of the optimal Bayes classifier. However, in practice, it is common to perform "low-level" tasks before "high-level" downstream tasks despite the overwhelming capabilities of modern deep neural networks. In this paper, we aim to understand when and why low-level processing can be beneficial for classification. We present a comprehensive theoretical study of a binary classification setup, where we consider a classifier that is tightly connected to the optimal Bayes classifier and converges to it as the number of training samples increases. We prove that for any finite number of training samples, there exists a pre-classification processing that improves the classification accuracy. We also explore the effect of class separation, training set size, and class balance on the relative gain from this procedure. We support our theory with an empirical investigation of the theoretical setup. Finally, we conduct an empirical study where we investigate the effect of denoising and encoding on the performance of practical deep classifiers on benchmark datasets. Specifically, we vary the size and class distribution of the training set, and the noise level, and demonstrate trends that are consistent with our theoretical results.

URL PDF HTML ☆

赞 0 踩 0

2512.16978 2026-06-17 cs.CV 版本更新

A Benchmark for Omni-Modal Reasoning in Long Videos

长视频全模态推理基准

Mohammed Irfan Kurpath, Jaseel Muhammad Kaithakkodan, Jinxing Zhou, Sahal Shaji Mullappilly, Mohammad Almansoori, Noor Ahsan, Beknur Kalmakhanbet, Sambal Shikhar, Rishabh Lalla, Jean Lahoud, Mariette Awad, Fahad Shahbaz Khan, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal

AI总结提出LongShOTBench基准，用于评估长视频中视觉、语音和环境音频的全模态推理，并引入无训练的全模态证据搜索代理LongShOTAgent，在105个模型上取得最优性能。

详情

AI中文摘要

长形式全模态视频理解需要整合视觉、语音和环境音频，并进行连贯的长上下文推理。现有的视频基准通常在时间尺度、模态覆盖、开放式交互和可解释评分之间进行权衡。为了解决这一差距，我们引入了LongShOTBench，一个围绕三个耦合目标设计的长期视频理解基准：整体全模态集成、意图驱动的开放式交互和规则级诊断。它从真实观看场景构建单轮和多轮问题，通过系统任务探究视觉、语音、环境音频、时间和跨模态推理。每个项目包括一个参考答案和一个加权标准级规则，让评估识别哪些感知事实、时间链接、模态接地要求和推理步骤得到满足或遗漏。所有样本都经过手动验证，以提高接地性、清晰度和规则可靠性。我们还引入了LongShOTAgent，一个无训练的全模态证据搜索代理，将全视频预处理与目标检索、查询自适应片段细化以及基于视觉、语音和非语音音频证据的显式声明验证相结合。其迭代搜索-细化-验证循环暴露中间证据，并让模态特定专家在回答之前重新分析相关时刻。我们评估了105个视频能力模型，涵盖开源全模态模型、视觉语言系统、音频LLM、代理管道和闭源API。当前的MLLM远未饱和LongShOTBench，而我们的LongShOTAgent是最强的无训练系统，达到66.64%的整体性能。通过发布基准、排行榜和方法，我们为推进长形式全模态视频推理提供了一个共享、可解释的测试平台。代码、数据和排行榜可在以下网址获取：此 https URL。

英文摘要

Long-form omni-modal video understanding requires integrating vision, speech, and ambient audio with coherent long-context reasoning. Existing video benchmarks often trade off temporal scale, modality coverage, open-ended interaction, and interpretable scoring. To address this gap, we introduce LongShOTBench, a long video understanding benchmark designed around three coupled goals: holistic omni-modal integration, intent-driven open-ended interaction, and rubric-level diagnosis. It builds single- and multi-turn questions from real viewing scenarios, with systematic tasks probing visual, speech, ambient-audio, temporal, and cross-modal reasoning. Each item includes a reference answer and a weighted criterion-level rubric, letting evaluation identify which perceptual facts, temporal links, modality-grounding requirements, and reasoning steps are satisfied or missed. All samples are manually verified to improve grounding, clarity, and rubric reliability. We also introduce LongShOTAgent, a training-free omni-modal evidence-seeking agent coupling full-video preprocessing with targeted retrieval, query-adaptive segment refinement, and explicit claim verification over visual, speech, and non-speech audio evidence. Its iterative search-refine-verify loop exposes intermediate evidence and lets modality-specific specialists re-analyze relevant moments before answering. We evaluate 105 video-capable models spanning open-source omni-modal models, vision-language systems, audio LLMs, agentic pipelines and closed-source APIs. Current MLLMs remain far from saturating LongShOTBench, while our LongShOTAgent is the strongest training-free system, reaching 66.64% overall. By releasing the benchmark, leaderboard, and method, we provide a shared, interpretable testbed for advancing long-form omni-modal video reasoning. Code, data, and the leaderboard are available at https://longshot.cvmbzuai.com/.

URL PDF HTML ☆

赞 0 踩 0

2512.13853 2026-06-17 cs.LG cond-mat.stat-mech math.PR stat.ML 版本更新

Dropout Neural Network Training Viewed from a Percolation Perspective

从逾渗视角看待Dropout神经网络训练

Finley Devlin, Jaron Sanders

AI总结本文研究使用dropout训练深度神经网络时的逾渗现象，建立新逾渗模型刻画网络拓扑与路径问题的关系，揭示dropout中的逾渗效应及其可能导致训练崩溃的机制。

Comments 21 pages, 14 figures

详情

AI中文摘要

在这项工作中，我们研究了使用dropout训练深度神经网络（NNs）时逾渗的存在和影响。Dropout方法是训练NNs的正则化技术，由G. Hinton等人（2012）首次提出。这些方法在训练的每个阶段随机临时移除NN中的连接，并用随机梯度下降（SGD）更新剩余子网络。随机从网络中移除连接的过程类似于逾渗，这是统计物理的一个范式模型。如果dropout移除足够多的连接，使得NN的输入和输出之间没有路径，那么NN就无法根据数据做出预测。我们研究了模拟NN中dropout的新逾渗模型，并刻画了网络拓扑与该路径问题之间的关系。该理论证明了dropout中存在逾渗效应。我们还表明，在使用dropout训练无偏置NN时，这种逾渗效应可能导致训练崩溃；并且我们启发式地论证了这种崩溃也扩展到有偏置的NN。

英文摘要

In this work, we investigate the existence and effect of percolation in training deep Neural Networks (NNs) with dropout. Dropout methods are regularisation techniques for training NNs, first introduced by G. Hinton et al. (2012). These methods temporarily remove connections in the NN, randomly at each stage of training, and update the remaining subnetwork with Stochastic Gradient Descent (SGD). The process of removing connections from a network at random is similar to percolation, a paradigm model of statistical physics. If dropout were to remove enough connections such that there is no path between the input and output of the NN, then the NN could not make predictions informed by the data. We study new percolation models that mimic dropout in NNs and characterise the relationship between network topology and this path problem. The theory shows the existence of a percolative effect in dropout. We also show that this percolative effect can cause a breakdown when training NNs without biases with dropout; and we argue heuristically that this breakdown extends to NNs with biases.

URL PDF HTML ☆

赞 0 踩 0

2506.24121 2026-06-17 cs.CV 版本更新

TextMesh4D: Zero-shot Text-to-4D Mesh Generation

TextMesh4D: 零样本文本到4D网格生成

Sisi Dai, Xinxin Su, Kai Xu

AI总结提出TextMesh4D框架，通过雅可比变形场和局部-全局语义正则化，实现零样本文本到动态网格生成，解决扩散引导与网格拓扑约束的冲突，达到高时间一致性和几何保真度。

详情

AI中文摘要

大规模、高质量动态3D（4D）资产对于学习物理基础表示至关重要，但大规模捕获和标注成本高昂。这限制了监督式4D学习的可行性，并激发了利用预训练扩散先验的零样本文本到4D生成。为了建模复杂动态，先前方法通常采用隐式3D表示（如NeRF或3DGS）以利用其变形能力。然而，其隐式性质对表面拓扑的控制有限，阻碍了高保真几何，并使时间一致表面重建具有挑战性。为解决这些限制，我们探索零样本文本到4D网格生成。然而，将基于扩散的引导与拓扑约束网格结合时会出现结构不匹配：引导是噪声且空间不一致的，而网格施加严格的拓扑约束，使得直接顶点级变形不稳定。在本文中，我们介绍TextMesh4D，这是首个零样本文本到4D框架，通过在两个互补层面解决上述挑战，直接生成动态网格。几何上，我们通过雅可比变形场（JDF）将变形建模从顶点转移到面，通过可积性强制积分公式实现拓扑感知表面重建。语义上，我们提出局部-全局语义正则化器（LGSR），通过联合约束局部变形合理性和全局形状一致性来随时间保持身份。大量实验表明，在单个24GB GPU上高效运行的同时，达到了最先进的时间一致性、结构保真度和视觉质量。

英文摘要

Large-scale, high-quality dynamic 3D (4D) assets are essential for learning physically grounded representations, but remain costly to capture and annotate at scale. This limits the viability of supervised 4D learning and motivates zero-shot text-to-4D generation leveraging pretrained diffusion priors. To model complex dynamics, prior methods typically adopt implicit 3D representations (e.g., NeRFs or 3DGS) for their deformation capacity. However, their implicit nature provides limited control over surface topology, which hinders high-fidelity geometry and makes temporally coherent surface reconstruction challenging. To address these limitations, we explore zero-shot text-to-4D mesh generation. However, a structural mismatch arises when combining diffusion-based guidance with topology-constrained meshes: the guidance is noisy and spatially inconsistent, while meshes impose severe topological constraints, making direct vertex-level deformation unstable. In this paper, we introduce TextMesh4D, the first zero-shot framework for text-to-4D that directly generates dynamic meshes by addressing the above challenge at two complementary levels. Geometrically, we shift deformation modeling from vertices to faces via a Jacobian Deformation Field (JDF), enabling topology-aware surface reconstruction through an integrability-enforcing integration formulation. Semantically, we propose a Local-Global Semantic Regularizer (LGSR) that preserves identity over time by jointly constraining local deformation plausibility and global shape consistency. Extensive experiments demonstrate state-of-the-art temporal consistency, structural fidelity, and visual quality, while remaining efficient on a single 24GB GPU.

URL PDF HTML ☆

赞 0 踩 0

2512.13009 2026-06-17 cs.RO 版本更新

K-VARK: Kernelized Variance-Aware Residual Kalman Filter for Sensorless Force Estimation in Collaborative Robots

K-VARK: 用于协作机器人无传感器力估计的核化方差感知残差卡尔曼滤波器

Oğuzhan Akbıyık, Naseem Alhousani, Fares J. Abu-Dakka

AI总结提出K-VARK方法，通过核化运动基元学习残差力矩的预测均值和异方差方差，并自适应调整卡尔曼滤波噪声协方差，在6自由度协作机械臂上实现无传感器力估计，RMSE降低20%以上。

详情

AI中文摘要

可靠接触力估计对于确保机器人与非结构化环境的安全和精确交互至关重要。然而，由于固有的建模误差以及复杂的残差动力学和摩擦，准确的无传感器力估计仍然具有挑战性。为应对这一挑战，本文提出K-VARK（核化方差感知残差卡尔曼滤波器），一种将关节残差力矩的核化概率模型集成到自适应卡尔曼滤波框架中的新颖方法。通过在优化激励轨迹上训练的核化运动基元，K-VARK捕获残差力矩的预测均值和输入相关的异方差方差，反映数据变异性和距训练样本距离的影响。这些统计信息通过增广测量噪声协方差来通知方差感知的虚拟测量更新，而过程噪声协方差通过变分贝叶斯优化在线自适应以处理动态干扰。在6自由度协作机械臂上的实验验证表明，与最先进的无传感器力估计方法相比，K-VARK的RMSE降低了20%以上，为抛光、装配等高级任务提供了鲁棒且准确的外部力/力矩估计。

英文摘要

Reliable estimation of contact forces is crucial for ensuring safe and precise interaction of robots with unstructured environments. However, accurate sensorless force estimation remains challenging due to inherent modeling errors and complex residual dynamics and friction. To address this challenge, in this paper, we propose K-VARK (Kernelized Variance-Aware Residual Kalman filter), a novel approach that integrates a kernelized, probabilistic model of joint residual torques into an adaptive Kalman filter framework. Through Kernelized Movement Primitives trained on optimized excitation trajectories, K-VARK captures both the predictive mean and input-dependent heteroscedastic variance of residual torques, reflecting data variability and distance-to-training effects. These statistics inform a variance-aware virtual measurement update by augmenting the measurement noise covariance, while the process noise covariance adapts online via variational Bayesian optimization to handle dynamic disturbances. Experimental validation on a 6-DoF collaborative manipulator demonstrates that K-VARK achieves over 20% reduction in RMSE compared to state-of-the-art sensorless force estimation methods, yielding robust and accurate external force/torque estimation suitable for advanced tasks such as polishing and assembly.

URL PDF HTML ☆

赞 0 踩 0

2512.11784 2026-06-17 cs.LG stat.ML 版本更新

Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

大提示词机制下的Softmax作为线性注意力：基于测度的视角

Etienne Boursier, Claire Boyer

AI总结提出基于测度的框架，证明在无限提示词极限下softmax注意力收敛到线性算子，并给出有限提示词下的非渐近浓度界，从而将线性注意力的优化分析迁移到大提示词下的softmax注意力。

详情

AI中文摘要

Softmax注意力是Transformer架构的核心组成部分，但其非线性结构给理论分析带来了重大挑战。我们开发了一个统一的、基于测度的框架，用于研究有限和无限提示词下的单层softmax注意力。对于独立同分布的高斯输入，我们利用softmax算子在大提示词极限下收敛到作用于底层输入标记测度的线性算子这一事实。基于这一见解，我们建立了softmax注意力输出和梯度的非渐近浓度界，量化了有限提示词模型接近其无限提示词对应模型的速度，并证明了在具有次高斯标记的一般上下文学习设置中，这种浓度在整个训练轨迹上保持稳定。在线性回归的上下文学习中，我们利用易处理的无限提示词动力学来分析有限提示词长度下的训练。我们的结果表明，当提示词足够长时，为线性注意力开发的优化分析可以直接迁移到softmax注意力上，表明大提示词下的softmax注意力继承了其线性对应物的分析结构。这反过来为研究大提示词机制下softmax注意力层的训练动力学和统计行为提供了一个有原则且广泛适用的工具包。

英文摘要

Softmax attention is a central component of transformer architectures, yet its nonlinear structure poses significant challenges for theoretical analysis. We develop a unified, measure-based framework for studying single-layer softmax attention under both finite and infinite prompts. For i.i.d. Gaussian inputs, we lean on the fact that the softmax operator converges in the infinite-prompt limit to a linear operator acting on the underlying input-token measure. Building on this insight, we establish non-asymptotic concentration bounds for the output and gradient of softmax attention, quantifying how rapidly the finite-prompt model approaches its infinite-prompt counterpart, and prove that this concentration remains stable along the entire training trajectory in general in-context learning settings with sub-Gaussian tokens. In the case of in-context linear regression, we use the tractable infinite-prompt dynamics to analyze training at finite prompt length. Our results allow optimization analyses developed for linear attention to transfer directly to softmax attention when prompts are sufficiently long, showing that large-prompt softmax attention inherits the analytical structure of its linear counterpart. This, in turn, provides a principled and broadly applicable toolkit for studying the training dynamics and statistical behavior of softmax attention layers in large prompt regimes.

URL PDF HTML ☆

赞 0 踩 0

2509.12742 2026-06-17 cs.CV 版本更新

Effective Gaussian Management for High-fidelity Object Reconstruction

高保真物体重建的有效高斯管理

Jiateng Liu, Hao Gao, Jiu-Cheng Xie, Chi-Man Pun, Jian Xiong, Haolun Li, Junxin Chen, Feng Xu

AI总结提出一种高斯管理框架，通过选择性激活属性、自适应表示和任务解耦剪枝，结合正则化表面重建模块，在减少参数的同时实现高保真外观与几何重建。

详情

AI中文摘要

本文提出了一种有效的高斯管理框架，用于外观和几何的高保真场景重建。与最近将所有基元在优化过程中统一处理的高斯泼溅（GS）管线不同，我们的框架显式管理高斯的属性激活、表示和剪枝。具体来说，我们的框架首先引入GauSep，一种新的致密化策略，选择性地激活高斯颜色或法线属性，以缓解由双重监督产生的破坏性梯度冲突。我们进一步提出GauRep，一种自适应高斯表示，动态调整球谐函数（SHs）阶数并执行任务解耦剪枝，以在个体和全局层面减少冗余。为了为上述管理过程提供可靠的几何监督，我们还引入了CoRe，一个正则化表面重建模块，通过置信度机制从SDF分支蒸馏鲁棒的法线场到高斯表示。值得注意的是，所提出的高斯管理与各种重建架构兼容，可以无缝集成以提高性能同时减小模型大小。大量实验表明，与最先进方法相比，我们的方法在外观和几何重建上实现了优越或可比的性能，同时使用了显著更少的参数。

英文摘要

This paper proposes an effective Gaussian management framework for high-fidelity scene reconstruction of both appearance and geometry. Unlike recent Gaussian Splatting (GS) pipelines that treat all primitives uniformly during optimization, our framework explicitly manages the attribute activation, representation and pruning of Gaussian. Specifically, our framework first introduces GauSep, a novel densification strategy that selectively activates Gaussian color or normal attributes to alleviate destructive gradient conflicts arising from dual supervision. We further propose GauRep, an adaptive Gaussian representation that dynamically adjusts spherical harmonics (SHs) orders and performs task-decoupled pruning to reduce redundancy at both the individual and global levels. To provide reliable geometric supervision for above mangement process, we additionally introduce CoRe, an regularized surface reconstruction module that distills robust normal fields from an SDF branch to the Gaussian representation through a confidence mechanism. Notably, the proposed Gaussian management is compatible with various reconstruction architectures and can be seamlessly integrated to improve performance while reducing size of the model. Extensive experiments demonstrate that our approach achieves superior or comparable performance in appearance and geometry reconstruction compared with state-of-the-art methods, while using significantly fewer parameters.

URL PDF HTML ☆

赞 0 踩 0

2511.01352 2026-06-17 cs.LG astro-ph.HE astro-ph.IM hep-ex physics.data-an 版本更新

MiniFool -- Physics-Constraint-Aware Minimizer-Based Adversarial Attacks in Deep Neural Networks

MiniFool——深度神经网络中基于物理约束感知的最小化器对抗攻击

Lucie Flek, Oliver Janik, Philipp Alexander Jung, Akbar Karimi, Timo Saala, Alexander Schmidt, Matthias Schott, Philipp Soldin, Matthias Thiesmeyer, Christopher Wiebusch, Ulrich Willemsen

AI总结提出MiniFool算法，通过最小化结合χ²检验统计量与目标分数偏差的代价函数，生成物理感知的对抗样本，用于测试粒子与天体物理中的神经网络分类器，并量化网络决策的鲁棒性。

Comments Submitted to Computing and Software for Big Science

详情

DOI: 10.1140/epjc/s10052-026-15773-2
Journal ref: Published in: Eur.Phys.J.C 86 (2026) 6, 641

AI中文摘要

在本文中，我们提出了一种新算法MiniFool，该算法实现了物理启发的对抗攻击，用于测试粒子物理和天体粒子物理中基于神经网络的分类任务。虽然我们最初为IceCube中微子天文台的天体物理tau中微子搜索开发了该算法，但我们将其应用于其他科学领域的更多数据，从而证明了其通用性。在此，我们将该算法应用于著名的MNIST数据集，以及大型强子对撞机CMS实验的开放数据。该算法基于最小化一个代价函数，该函数结合了基于χ²的检验统计量与期望目标分数的偏差。检验统计量根据实验不确定性量化了应用于数据的扰动的概率。对于我们研究的用例，我们发现翻转分类的可能性对于最初正确分类和错误分类的事件是不同的。当测试分类随攻击参数（该参数缩放实验不确定性）的变化时，可以量化网络决策的鲁棒性。此外，这允许测试未标记实验数据分类的鲁棒性。

英文摘要

In this paper, we present a new algorithm, MiniFool, that implements physics-inspired adversarial attacks for testing neural network-based classification tasks in particle and astroparticle physics. While we initially developed the algorithm for the search for astrophysical tau neutrinos with the IceCube Neutrino Observatory, we apply it to further data from other science domains, thus demonstrating its general applicability. Here, we apply the algorithm to the well-known MNIST data set and furthermore, to Open Data data from the CMS experiment at the Large Hadron Collider. The algorithm is based on minimizing a cost function that combines a $χ^2$ based test-statistic with the deviation from the desired target score. The test statistic quantifies the probability of the perturbations applied to the data based on the experimental uncertainties. For our studied use cases, we find that the likelihood of a flipped classification differs for both the initially correctly and incorrectly classified events. When testing changes of the classifications as a function of an attack parameter that scales the experimental uncertainties, the robustness of the network decision can be quantified. Furthermore, this allows testing the robustness of the classification of unlabeled experimental data.

URL PDF HTML ☆

赞 0 踩 0

2505.03509 2026-06-17 cs.LG astro-ph.IM 版本更新

AnomalyMatch: Discovering Rare Objects of Interest with Semi-supervised and Active Learning

AnomalyMatch: 通过半监督和主动学习发现罕见感兴趣对象

Pablo Gómez, Laslo E. Ruhberg, Maria Teresa Nardone, David O'Ryan

AI总结提出AnomalyMatch框架，结合半监督FixMatch算法和主动学习，将异常检测视为二分类问题，利用少量标注和大量未标注图像训练，在严重类别不平衡下实现高AUROC和AUPRC。

Comments Accepted for publication in RASTI; 17 pages; 12 figures

详情

AI中文摘要

大数据集中的异常检测在天文学和计算机视觉中至关重要。然而，由于标记数据稀缺，通常无法应用监督方法进行异常检测。我们提出了AnomalyMatch，一个结合了使用EfficientNet分类器的半监督FixMatch算法与主动学习的异常检测框架。AnomalyMatch专为大规模应用定制，并集成到ESA Datalabs科学平台中。在该方法中，我们将异常检测视为二分类问题，并有效利用有限的标记图像和丰富的未标记图像进行训练。我们通过用户界面实现主动学习，用于验证高置信度异常并纠正误报。在严重类别不平衡下，对GalaxyMNIST天文数据集和miniImageNet自然图像基准的评估显示出强大性能。从五到十个标记异常开始，我们实现了平均AUROC为0.96（miniImageNet）和0.89（GalaxyMNIST），相应的AUPRC分别为0.82和0.77。经过三个主动学习周期后，按分数排名前1%的图像中，异常精度达到76%（miniImageNet）至94%（GalaxyMNIST）。我们与已建立的Astronomaly软件在来自'Galaxy Zoo - The Galaxy Challenge'数据集的选定'奇特'星系上进行比较，实现了可比较的性能，平均AUROC为0.83。我们的结果强调了该方法在异常发现方面的卓越实用性和可扩展性，突显了针对标签严重稀缺领域的专门方法的价值。

英文摘要

Anomaly detection in large datasets is essential in astronomy and computer vision. However, due to a scarcity of labelled data, it is often infeasible to apply supervised methods to anomaly detection. We present AnomalyMatch, an anomaly detection framework combining the semi-supervised FixMatch algorithm using EfficientNet classifiers with active learning. AnomalyMatch is tailored for large-scale applications and integrated into the ESA Datalabs science platform. In this method, we treat anomaly detection as a binary classification problem and efficiently utilise limited labelled and abundant unlabelled images for training. We enable active learning via a user interface for verification of high-confidence anomalies and correction of false positives. Evaluations on the GalaxyMNIST astronomical dataset and the miniImageNet natural-image benchmark under severe class imbalance display strong performance. Starting from five to ten labelled anomalies, we achieve an average AUROC of 0.96 (miniImageNet) and 0.89 (GalaxyMNIST), with respective AUPRC of 0.82 and 0.77. After three active learning cycles, anomalies are ranked with 76% (miniImageNet) to 94% (GalaxyMNIST) precision in the top 1% of the highest-ranking images by score. We compare to the established Astronomaly software on selected 'odd' galaxies from the 'Galaxy Zoo- The Galaxy Challenge' dataset, achieving comparable performance with an average AUROC of 0.83. Our results underscore the exceptional utility and scalability of this approach for anomaly discovery, highlighting the value of specialised approaches for domains characterised by severe label scarcity

URL PDF HTML ☆

赞 0 踩 0

2510.23798 2026-06-17 cs.CV cs.AI 版本更新

A geometric and deep learning reproducible pipeline for monitoring floating anthropogenic debris in urban rivers using in situ cameras

一种基于几何和深度学习的可复现流水线，用于利用原位摄像头监测城市河流中的漂浮人为碎片

Gauthier Grimmer, Romain Wenger, Clément Flint, Germain Forestier, Gilles Rixhon, Valentin Chardon

AI总结提出结合几何模型与深度学习的框架，利用固定摄像头连续量化监测城市河流漂浮碎片，并评估不同模型在复杂环境下的精度与速度，通过投影几何实现碎片尺寸估计。

详情

AI中文摘要

河流中漂浮人为碎片的扩散已成为一个紧迫的环境问题，对生物多样性、水质以及人类活动（如航行和娱乐）产生不利影响。本研究提出了一种新颖的方法框架，利用固定的原位摄像头监测上述废弃物。本研究提供了两个关键贡献：（i）利用深度学习对漂浮碎片进行连续量化和监测；（ii）在复杂环境条件下，识别出在精度和推理速度方面最合适的深度学习模型。这些模型在多种环境条件和学习配置下进行测试，包括与数据泄漏相关的偏差实验。此外，实现了一个几何模型，用于从二维图像估计检测对象的实际尺寸。该模型利用了相机的内参和外参特性。本研究结果强调了数据集构建协议的重要性，特别是在负样本图像的整合和时间泄漏的考虑方面。最后，证明了使用投影几何结合回归校正进行公制物体估计的可行性。该方法为开发稳健、低成本、自动化的城市水生环境监测系统铺平了道路。

英文摘要

The proliferation of floating anthropogenic debris in rivers has emerged as a pressing environmental concern, exerting a detrimental influence on biodiversity, water quality, and human activities such as navigation and recreation. The present study proposes a novel methodological framework for the monitoring the aforementioned waste, utilising fixed, in-situ cameras. This study provides two key contributions: (i) the continuous quantification and monitoring of floating debris using deep learning and (ii) the identification of the most suitable deep learning model in terms of accuracy and inference speed under complex environmental conditions. These models are tested in a range of environmental conditions and learning configurations, including experiments on biases related to data leakage. Furthermore, a geometric model is implemented to estimate the actual size of detected objects from a 2D image. This model takes advantage of both intrinsic and extrinsic characteristics of the camera. The findings of this study underscore the significance of the dataset constitution protocol, particularly with respect to the integration of negative images and the consideration of temporal leakage. In conclusion, the feasibility of metric object estimation using projective geometry coupled with regression corrections is demonstrated. This approach paves the way for the development of robust, low-cost, automated monitoring systems for urban aquatic environments.

URL PDF HTML ☆

赞 0 踩 0

2507.11178 2026-06-17 cs.LG cs.AI 版本更新

A Gradient-based Causal Discovery Framework with Applications to Complex Industrial Processes

基于梯度的因果发现框架及其在复杂工业过程中的应用

Meiliang Liu, Huiwen Dong, Xiaoxiao Yang, Yunfang Xu, Mingbao Yang, Zijin Li, Zhengye Si, Xinyue Yang, Zhiwen Zhao

AI总结提出GRNGC方法，通过对模型输入输出梯度施加L1正则化推断Granger因果，仅需一个预测模型，降低计算开销，在多个基准和真实数据集上优于现有方法。

Comments 9 pages,3 figures, conference

详情

AI中文摘要

随着深度学习技术的发展，各种基于神经网络的Granger因果模型已被提出。尽管这些模型表现出显著改进，但仍存在若干局限性。大多数现有方法采用组件式架构，需要为每个时间序列构建单独的模型，导致大量计算成本。此外，对神经网络第一层权重施加稀疏性惩罚以提取因果关系，削弱了模型捕捉复杂交互的能力。为解决这些局限性，我们提出基于梯度正则化的神经Granger因果（GRNGC），该方法仅需一个时间序列预测模型，并对模型输入与输出之间的梯度施加$L_{1}$正则化以推断Granger因果。此外，GRNGC不依赖于特定的时间序列预测模型，可通过KAN、MLP和LSTM等多种架构实现，提供增强的灵活性。在DREAM、Lorenz-96、fMRI BOLD和CausalTime上的数值模拟表明，GRNGC优于现有基线，并显著降低计算开销。同时，在真实世界的DNA、酵母、HeLa和膀胱尿路上皮癌数据集上的实验进一步验证了该模型在重建基因调控网络方面的有效性。

英文摘要

With the advancement of deep learning technologies, various neural network-based Granger causality models have been proposed. Although these models have demonstrated notable improvements, several limitations remain. Most existing approaches adopt the component-wise architecture, necessitating the construction of a separate model for each time series, which results in substantial computational costs. In addition, imposing the sparsity-inducing penalty on the first-layer weights of the neural network to extract causal relationships weakens the model's ability to capture complex interactions. To address these limitations, we propose Gradient Regularization-based Neural Granger Causality (GRNGC), which requires only one time series prediction model and applies $L_{1}$ regularization to the gradient between model's input and output to infer Granger causality. Moreover, GRNGC is not tied to a specific time series forecasting model and can be implemented with diverse architectures such as KAN, MLP, and LSTM, offering enhanced flexibility. Numerical simulations on DREAM, Lorenz-96, fMRI BOLD, and CausalTime show that GRNGC outperforms existing baselines and significantly reduces computational overhead. Meanwhile, experiments on real-world DNA, Yeast, HeLa, and bladder urothelial carcinoma datasets further validate the model's effectiveness in reconstructing gene regulatory networks.

URL PDF HTML ☆

赞 0 踩 0

2510.19838 2026-06-17 cs.AI cs.CL cs.LG 版本更新

显式上下文驱动的神经声学建模用于高保真RIR生成

Chen Si, Qianyi Wu, Chaitanya Amballa, Romit Roy Choudhury

AI总结提出MiNAF模型，通过查询房间网格并提取距离分布作为显式局部几何特征，引导神经隐式模型生成更准确的房间脉冲响应（RIR），在多项指标上达到竞争性能。

详情

AI中文摘要

逼真的声音模拟在许多应用中起着关键作用。声音模拟的一个关键要素是房间脉冲响应（RIR），它描述了声音在给定空间中的传播方式。最近的研究应用神经隐式方法，利用从环境中收集的上下文信息（如场景图像）来学习RIR。然而，这些方法没有有效利用环境中的显式几何信息。为了进一步利用具有直接几何特征的神经隐式模型，我们提出了MiNAF，它在给定位置查询粗略的房间网格，并提取距离分布作为局部上下文的显式表示。我们的方法表明，结合显式的局部几何特征可以更好地引导模型生成更准确的RIR预测。通过与常规和最先进方法的比较，我们展示了MiNAF在各种评估指标上具有竞争力的性能。

英文摘要

Realistic sound simulation plays a critical role in many applications. A key element in sound simulation is the room impulse response (RIR), which characterizes how sound propagates within a given space. Recent studies have applied neural implicit methods to learn RIR using context information collected from the environment, such as scene images. However, these approaches do not effectively leverage explicit geometric information from the environment. To further exploit neural implicit models with direct geometric features, we present MiNAF, which queries a rough room mesh at given locations and extracts distance distributions as an explicit representation of local context. Our approach demonstrates that incorporating explicit local geometric features can better guide the model in generating more accurate RIR predictions. Through comparisons with conventional and state-of-the-art methods, we show that MiNAF performs competitively across various evaluation metrics.

URL PDF HTML ☆

赞 0 踩 0

2509.00064 2026-06-17 cs.RO cs.CV 版本更新

OpenTie: Open-vocabulary Sequential Rebar Tying System

OpenTie: 开放词汇的连续钢筋绑扎系统

Sai Fan, Mingze Liu, Haozhen Li, Haobo Liang, Yixing Yuan, Yanke Wang

AI总结提出OpenTie，一种无需训练的3D钢筋绑扎框架，通过RGB到点云生成和开放词汇检测实现高精度连续绑扎，优于基于YOLO的方法。

Comments This article is accepted by The 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026)

详情

AI中文摘要

建筑工地的机器人实践因其应对复杂挑战的能力而备受关注，尤其是在涉及钢筋的场景中。现有产品和研究主要集中于需要模型训练的大量数据收集。为填补这一空白，我们提出OpenTie，一种利用RGB到点云生成和开放词汇钢筋检测的3D无训练钢筋绑扎框架，并在真实世界测试中实现。我们通过带有双目摄像头的机械臂实现OpenTie，并通过将基于提示的目标检测方法应用于经我们提出的后处理流程过滤的图像（用于图像到点云生成框架），保证了高精度。我们的流程无需训练，且在真实连续钢筋绑扎测试中优于基于训练的目标检测（即基于YOLO的方法）。该系统灵活适用于水平和垂直钢筋绑扎任务，并具有在真实建筑工地应用和商业化的潜力。

英文摘要

Robotic practices on the construction site emerge as an attention-attracting manner owing to their capability of tackling complex challenges, especially in the rebar-involved scenarios. Most of existing products and research are mainly focused on the collection of large amounts of data with model training demands. To fulfill this gap, we propose OpenTie, a 3D training-free rebar tying framework utilizing a RGB-to-point-cloud generation and an open-vocabulary rebar detection on the real-world test. We implement the OpenTie via a robotic arm with a binocular camera and guarantee a high accuracy by applying the prompt-based object detection method on the image filtered by our proposed post-processing procedure for the image-to-point-cloud generation framework. Our pipeline requires no training efforts and outperforms the training-based object detection, i.e., YOLO-based method, with the verification on the real-world sequential rebar tying test. The system is flexible for horizontal and vertical rebar tying tasks and holds the potential application to the real construction site with possibility of commercialization.

URL PDF HTML ☆

赞 0 踩 0

2502.08363 2026-06-17 cs.CL cs.AI 版本更新

Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

Top-Theta注意力：通过补偿阈值稀疏化Transformer

Konstantin Berestizshevsky, Renzo Andri, Lukas Cavigelli

AI总结提出Top-Theta注意力，一种无需训练的推理时稀疏化方法，通过静态每头阈值保留每行固定数量的重要元素，结合补偿技术实现高稀疏度下的精度保持，在NLP任务中实现3-10倍V-cache减少和高达10倍注意力元素减少，精度下降不超过1%。

Comments Extended version of a paper accepted at ICANN 2026

2508.06692 2026-06-17 cs.LG 版本更新

HeteRo-Select: Informativeness as the Participation Driver in Heterogeneous Federated Learning

HeteRo-Select: 信息量作为异构联邦学习中的参与驱动因素

Md. Akmol Masud, Md Abrar Jahin, Mahmud Hasan

AI总结提出HeteRo-Select框架，用客户端信息量分数替代带宽驱动压缩，联合决定客户端选择、压缩比和聚合权重，降低异构性并减少流量，在CIFAR-10上实现1.78倍加速和18.2%流量减少。

详情

AI中文摘要

联邦学习系统通常根据链路速度分配梯度压缩。当带宽和数据信息量一致时，这是合理的。然而，在非IID数据下，这些信号常常去相关或反转。基于带宽的分配器可能最严重地压缩信息量最大的梯度。我们提出HeteRo-Select，一个用每个客户端的信息量分数替代带宽作为压缩主要驱动因素的框架。该分数联合控制每轮的三个决策：客户端选择、压缩比和服务器聚合权重，带宽仅作为硬上限保留。分数比例选择可证明地降低所选子集的有效异构性；分数比例压缩可证明地在固定流量下降低聚合top-$k$误差。在精确的FedCG模拟协议下，HeteRo-Select在CIFAR-10上实现了$1.78\times$加速和$18.2\%$流量减少。相同的配置，未经改变，从$7{,}850$参数的逻辑回归扩展到$11.27$M参数的ResNet-18，在四个基准测试中的三个达到了准确率目标。当带宽和信息量被故意反相关时，该方法仍能以比正常带宽运行更少的流量达到目标准确率。

英文摘要

Federated learning systems typically allocate gradient compression by link speed. This is sensible when bandwidth and data informativeness align. However, under non-IID data, these signals often decorrelate or invert. A bandwidth-driven allocator then risks compressing the most informative gradients hardest. We propose HeteRo-Select, a framework that replaces bandwidth with a per-client informativeness score as the primary driver of compression. The score jointly governs three decisions per round: client selection, compression ratio, and server aggregation weight, with bandwidth retained only as a hard ceiling. Score-proportional selection provably reduces the effective heterogeneity of the chosen subset; score-proportional compression provably lowers aggregate top-$k$ error at fixed traffic. Under the exact FedCG simulation protocol, HeteRo-Select delivers a $1.78\times$ speedup and an $18.2\%$ reduction in traffic on CIFAR-10. The same configuration, unchanged, scales from a $7{,}850$-parameter logistic regression to an $11.27$M-parameter ResNet-18, hitting the accuracy target on three of four benchmarks. When bandwidth and informativeness are deliberately anti-correlated, the method still achieves the target accuracy with less traffic than the normal-bandwidth run.

URL PDF HTML ☆

赞 0 踩 0