arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2092
2604.05777 2026-05-11 cs.AI

Emergent social transmission of model-based representations without inference

Silja Keßler, Miriam Bautista-Salinero, Claudio Tennie, Charley M. Wu

AI总结 本文探讨了人们如何在有限认知能力下,通过他人获取丰富且灵活的环境知识。研究通过强化学习模拟表明,无需推断他人心理状态,仅通过观察行为并利用简单社会线索,即可间接传递高层表征。研究发现,基于模型的学习者在社会暴露下能更快学习并形成更接近专家的表征,揭示了文化传递可能源于非心智化的过程。

Comments Code available at https://github.com/skessler01/social-transmission-rl.git

详情
英文摘要

How do people acquire rich, flexible knowledge about their environment from others despite limited cognitive capacity? Humans are often thought to rely on computationally costly mentalizing, such as inferring others' beliefs. In contrast, cultural evolution emphasizes that behavioral transmission can be supported by simple social cues. Using reinforcement learning simulations, we show how minimal social learning can indirectly transmit higher-level representations. We simulate a naïve agent searching for rewards in a reconfigurable environment, learning either alone or by observing an expert - crucially, without inferring mental states. Instead, the learner heuristically selects actions or boosts value representations based on observed actions. Our results demonstrate that these cues bias the learner's experience, causing its representation to converge toward the expert's. Model-based learners benefit most from social exposure, showing faster learning and more expert-like representations. These findings show how cultural transmission can arise from simple, non-mentalizing processes exploiting asocial learning mechanisms.

2604.03147 2026-05-11 cs.CL cs.AI cs.CY

Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control

Lihao Sun, Lewen Yan, Xiaoya Lu, Andrew Lee, Jie Zhang, Jing Shao

AI总结 本研究揭示了大语言模型中情感向量在二维“效价-唤醒”(VA)子空间中呈现出环形几何结构,并通过主成分分解和岭回归方法,恢复出与情感控制向量相关的VA轴。研究发现,沿这些轴进行情感引导可实现对生成文本情感属性的单调控制,并能同时双向调控下游行为(如拒绝和奉承)。实验在多个主流模型中复现,表明该方法具有普适性,且提出词汇中介机制解释其有效性。

详情
英文摘要

We show that emotion vectors in LLMs are organized by a two-dimensional valence-arousal (VA) subspace exhibiting circular geometry. Through principal component decomposition and ridge regression, we recover meaningful VA axes underlying emotion steering vectors whose projections correlate with human affect ratings across 44,728 words. Steering along these axes produces monotonic control over the affective properties of generated text, and further affords bidirectional control over multiple downstream behaviors (refusal and sycophancy) from a single subspace. These effects replicate across Llama-3.1-8B, Qwen3-8B, and Qwen3-14B. We propose lexical mediation to explain why these effects and prior emotionally framed controls work: refusal and compliance tokens occupy distinct VA regions, and VA steering directly modulates their emission probabilities.

2603.23198 2026-05-11 cs.LG cs.CL

Sparser, Faster, Lighter Transformer Language Models

Edoardo Cetin, Stefano Peluchetti, Emilio Castillo, Akira Naruse, Mana Murakami, Llion Jones

AI总结 本文研究如何通过引入非结构化稀疏性来降低大型语言模型(LLM)的计算成本,重点优化前馈层的参数和计算效率。作者提出了一种新的稀疏打包格式和配套的CUDA内核,以适配现代GPU的优化执行流程,从而在推理和训练过程中实现高效的稀疏计算。实验表明,使用简单的L1正则化可以实现超过99%的稀疏度,且对模型性能影响极小,同时显著提升了模型的吞吐量、能效和内存使用效率。

Comments Code and checkpoints available at: https://github.com/SakanaAI/sparser-faster-llms

详情
英文摘要

Scaling autoregressive large language models (LLMs) has driven unprecedented progress but comes with vast computational costs. In this work, we tackle these costs by leveraging unstructured sparsity within an LLM's feedforward layers, the components accounting for most of the model parameters and execution FLOPs. To achieve this, we introduce a new sparse packing format and a set of CUDA kernels designed to seamlessly integrate with the optimized execution pipelines of modern GPUs, enabling efficient sparse computation during LLM inference and training. To substantiate our gains, we provide a quantitative study of LLM sparsity, demonstrating that simple L1 regularization can induce over 99% sparsity with negligible impact on downstream performance. When paired with our kernels, we show that these sparsity levels translate into substantial throughput, energy efficiency, and memory usage benefits that increase with model scale. We will release all code and kernels under an open-source license to promote adoption and accelerate research toward establishing sparsity as a practical axis for improving the efficiency and scalability of modern foundation models.

2603.15525 2026-05-11 cs.CV cs.HC

Clinically Aware Synthetic Image Generation for Concept Coverage in Chest X-ray Models

Amy Rafferty, Rishi Ramaesh, Ajitha Rajan

AI总结 该研究针对胸部X光影像诊断中的深度学习模型,提出了一种临床意识导向的合成图像生成方法CARPA,用于增强模型对临床概念组合的覆盖能力。该方法通过在保持解剖结构的基础上,对临床概念向量进行有针对性的扰动,生成具有可控概念插入和删除的合成图像,从而提升模型的诊断性能和可靠性。实验表明,基于CARPA生成的图像在多种模型架构上均能提升分类精度、降低预测不确定性,并得到放射科专家对图像真实性和临床意义的认可。

详情
英文摘要

Deep learning models for chest X-ray diagnosis are constrained by limited coverage of clinically meaningful concept combinations in publicly available training datasets. While synthetic image generation has been explored to increase data diversity, existing methods rarely enforce clinical or anatomical constraints, limiting utility for improving model reliability. We propose CARPA, a clinically aware and anatomically grounded framework for synthetic chest X-ray generation that applies targeted perturbations to clinical concept vectors while preserving anatomical structure. By producing anatomically faithful synthetic images with controlled concept insertions and deletions, CARPA expands clinically relevant concept coverage. We evaluate CARPA across seven backbone architectures by fine-tuning models on synthetic subsets and testing on a held-out MIMIC-CXR benchmark. Compared to prior concept perturbation approaches, fine-tuning on CARPA-generated images consistently improves precision-recall performance, reduces predictive uncertainty, and improves model calibration. Structural and semantic analyses demonstrate high anatomical fidelity, strong concept alignment, and low semantic uncertainty. Evaluation by two expert radiologists further confirms realism and clinical agreement. Together, these results show that anatomically grounded concept perturbations enable more effective use of synthetic data, improving both performance and reliability of chest X-ray classification models and supporting safer clinical deployment.

2603.15001 2026-05-11 cs.LG cs.AI

How Log-Barrier Helps Exploration in Policy Optimization

Leonardo Cesani, Matteo Papini, Marcello Restelli

AI总结 本文研究了策略优化中探索机制的问题,指出现有的随机梯度老虎机(SGB)算法在收敛性保证上依赖于不现实的假设,因此提出通过引入对数障碍(log-barrier)正则化来增强策略的探索能力。该方法在保持样本复杂度的同时,能够在更一般的情况下保证收敛,并揭示了对数障碍与自然策略梯度之间的几何联系。实验验证了理论分析的有效性。

详情
英文摘要

Recently, it has been shown that the Stochastic Gradient Bandit (SGB) algorithm converges to a globally optimal policy with a constant learning rate. However, these guarantees rely on unrealistic assumptions about the learning process, namely that the probability of the optimal action is always bounded away from zero. We attribute this to the lack of an explicit exploration mechanism in SGB. To address these limitations, we propose to regularize the SGB objective with a log-barrier on the parametric policy, structurally enforcing a minimal amount of exploration. We prove that Log-Barrier Stochastic Gradient Bandit (LB-SGB) matches the sample complexity of SGB, but also converges (at a slower rate) without any assumptions on the learning process. We also show a connection between the log-barrier regularization and Natural Policy Gradient, as both exploit the geometry of the policy space by controlling the Fisher information. We validate our theoretical findings through numerical simulations, showing the benefits of the log-barrier regularization.

2603.09742 2026-05-11 cs.LG math.DS stat.ML

Upper Generalization Bounds for Neural Oscillators

Zifeng Huang, Konstantin M. Zuev, Yong Xia, Michael Beer

AI总结 本文研究了源自二阶常微分方程的神经振荡器在学习复杂非线性结构系统动态映射时的泛化能力。通过Rademacher复杂度框架,推导了其在连续时间函数空间之间逼近因果和一致连续算子,以及逼近一致渐近增量稳定二阶动力系统的上界泛化界,并将其扩展到目标算子与神经振荡器输出之间的平方Wasserstein-1距离。理论分析表明,估计误差随神经网络规模和时间长度多项式增长,避免了参数复杂度的灾难,并指出通过损失函数正则化约束MLP的Lipschitz常数可提升泛化性能。数值实验验证了理论预测的误差幂律关系,并证实了在有限训练数据下约束MLP矩阵和向量范数的有效性。

Comments This manuscript contains 33 pages with 6 figures

详情
英文摘要

Neural oscillators that originate from second-order ordinary differential equations (ODEs) have shown competitive performance in learning mappings between dynamic loads and responses of complex nonlinear structural systems. Despite this empirical success, theoretically quantifying the generalization capacities of their neural network architectures remains undeveloped. In this study, the neural oscillator consisting of a second-order ODE followed by a multilayer perceptron (MLP) is considered. Its upper probably approximately correct (PAC) generalization bound for approximating causal and uniformly continuous operators between continuous temporal function spaces and that for approximating the uniformly asymptotically incrementally stable second-order dynamical systems are derived by leveraging the Rademacher complexity framework. These bounds are further extended to the squared Wasserstein-1 distances between the probability measures of quantities of interest calculated from target causal operators and the corresponding learned neural oscillators. The theoretical results show that the estimation errors grow polynomially with respect to both MLP sizes and the time length, thereby avoiding the curse of parametric complexity. Furthermore, the derived error bounds demonstrate that constraining the Lipschitz constants of the MLPs via loss function regularization can improve the generalization ability of the neural oscillator. Numerical studies considering a Bouc-Wen nonlinear system under stochastic seismic excitation validates the theoretically predicted power laws of the estimation errors with respect to the sample size and time length, and confirms the effectiveness of constraining MLPs' matrix and vector norms in enhancing the performance of the neural oscillator under limited training data.

2603.09652 2026-05-11 cs.AI

MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

Zuhao Zhang, Chengyue Yu, Yuante Li, Chenyi Zhuang, Linjian Mo, Shuai Li

AI总结 随着大型语言模型在代码生成方面的发展,人机交互正从静态文本响应转向动态的、基于HTML的交互式应用,即MiniApps。为评估模型在生成此类应用方面的能力,本文提出了MiniAppBench,这是首个全面评估原理驱动型交互应用生成的基准测试,包含来自真实应用场景的500个任务。同时,文章还引入了MiniAppEval评估框架,通过浏览器自动化进行类人探索测试,从意图、静态和动态三个维度系统评估应用质量,为未来研究提供了可靠的标准。

详情
英文摘要

With the rapid advancement of Large Language Models (LLMs) in code generation, human-AI interaction is evolving from static text responses to dynamic, interactive HTML-based applications, which we term MiniApps. These applications require models to not only render visual interfaces but also construct customized interaction logic that adheres to real-world principles. However, existing benchmarks primarily focus on algorithmic correctness or static layout reconstruction, failing to capture the capabilities required for this new paradigm. To address this gap, we introduce MiniAppBench, the first comprehensive benchmark designed to evaluate principle-driven, interactive application generation. Sourced from a real-world application with 10M+ generations, MiniAppBench distills 500 tasks across six domains (e.g., Games, Science, and Tools). Furthermore, to tackle the challenge of evaluating open-ended interactions where no single ground truth exists, we propose MiniAppEval, an agentic evaluation framework. Leveraging browser automation, it performs human-like exploratory testing to systematically assess applications across three dimensions: Intention, Static, and Dynamic. Our experiments reveal that current LLMs still face significant challenges in generating high-quality MiniApps, while MiniAppEval demonstrates high alignment with human judgment, establishing a reliable standard for future research. Our homepage is available in miniappbench.github.io.

2603.06859 2026-05-11 cs.LG cs.AI

Exact Is Easier: Credit Assignment for Cooperative LLM Agents

Yanjun Chen, Yirong Sun, Hanlin Wang, Jinghan Wang, Xinming Zhang, Xiaoyu Shen, Wenjie Li, Wei Zhang

AI总结 本文研究了如何准确评估合作大型语言模型(LLM)系统中各智能体的贡献问题。不同于传统多智能体强化学习依赖近似方法,作者指出在合作LLM系统中,由于交互历史是可观测文本的确定性函数,因此可以精确还原每个决策点的状态,从而实现无偏的因果贡献度量。基于此,提出了一种名为C3的方法,通过固定完整历史、冻结行为策略并采样替代动作,计算出精确的每步优势值,实验表明该方法在多个基准上优于现有方法,并且还提出了首个与方法无关的多智能体LLM信用分配审计工具。

详情
英文摘要

Removing an agent from a cooperative team to measure its contribution seems natural, yet in multi-agent LLM systems this evaluation distorts the result it claims to measure. This failure is not isolated: learned critics, trajectory-level baselines, and agent-removal counterfactuals all inherit from standard multi-agent reinforcement learning a premise that exact counterfactual evaluation requires privileged environment access, and therefore approximate. In cooperative LLM systems, this premise is false. Interaction histories are deterministic functions of observable text with no hidden state, so any decision point can be restored exactly, making direct causal measurement possible without parametric approximation. C3 exploits this property by fixing the complete history at each decision point, sampling alternative actions under a frozen behavior policy, and computing unbiased per-decision advantages through a parameter-free leave-one-out baseline. Across six benchmarks spanning math reasoning and code generation, two model families, and two multi-agent topologies, C3 consistently outperforms all baselines; a controlled decomposition confirms gains originate from credit quality, not architecture, while checkpoint restoration reduces training token consumption. The exact solution proves simpler, cheaper, and more effective than all approximate alternatives. The same structural property that enables exact credit also enables exact verification: three independently computable diagnostics, credit fidelity, within-group variance, and inter-agent influence, constitute the first method-agnostic auditing tool for multi-agent LLM credit assignment. Our code is available at https://github.com/EIT-EAST-Lab/C3

2603.06811 2026-05-11 cs.AI

Making AI Evaluation Deployment Relevant Through Context Specification

Matthew Holmes, Thiago Lacerda, Reva Schwartz

AI总结 本文探讨了如何通过上下文规范(context specification)提升AI评估在实际部署中的相关性。研究指出,当前AI评估方法往往忽视了影响部署效果的实际操作环境,导致组织难以判断AI工具能否带来持久价值。为此,作者提出通过明确界定评估场景中的关键属性、行为和结果,将模糊的利益相关者观点转化为可观察和衡量的构建,从而为AI系统的部署评估提供清晰的指导框架。

Comments 8 pages; 2 figures

详情
英文摘要

With many organizations struggling to gain value from AI deployments, pressure to evaluate AI in an informed manner has intensified. Status quo AI evaluation approaches often mask the operational realities that ultimately determine deployment success, making it difficult for organizational decision makers to know whether and how AI tools will deliver durable value. We introduce and describe context specification as a process to support and inform this decision making process. Context specification turns diffuse stakeholder perspectives about what matters in a given setting into clear, named constructs: explicit definitions of the properties, behaviors, and outcomes that evaluations aim to capture, so they can be observed and measured in context. The process serves as a foundational roadmap for evaluating what AI systems are likely to do in the deployment contexts that organizations actually manage.

2603.05539 2026-05-11 cs.LG cs.AI cs.IR cs.MM

VDCook:DIY video data cook your MLLMs

Chengwei Wu

AI总结 本文提出 VDCook,一种可自我演进的视频数据操作系统,旨在为研究人员和垂直领域团队提供灵活的视频数据构建平台。用户可通过自然语言查询和参数调整发起数据请求,系统自动优化查询并并行运行视频检索与可控合成模块,最终生成带有完整来源信息和元数据的数据包。VDCook 支持基于 MCP 协议的自动数据摄入机制,使数据集能够持续更新和扩展,同时提供多维元数据标注,为后续数据处理和索引奠定基础,显著降低了构建专业视频训练数据集的门槛。

详情
英文摘要

We introduce VDCook: a self-evolving video data operating system, a configurable video data construction platform for researchers and vertical domain teams. Users initiate data requests via natural language queries and adjustable parameters (scale, retrieval-synthesis ratio, quality threshold). The system automatically performs query optimization, concurrently running real video retrieval and controlled synthesis modules. It ultimately generates in-domain data packages with complete provenance and metadata, along with reproducible Notebooks. Unlike traditional static, one-time-built datasets, VDCook enables continuous updates and domain expansion through its automated data ingestion mechanism based on MCP (Model Context Protocol)\cite{mcp2024anthropic}, transforming datasets into dynamically evolving open ecosystems. The system also provides multi-dimensional metadata annotation (scene segmentation, motion scoring, OCR ratio, automatic captioning, etc.), laying the foundation for flexible subsequent data `cooking' and indexing\cite{vlogger}. This platform aims to significantly lower the barrier to constructing specialized video training datasets through infrastructure-level solutions, while supporting community contributions and a governance-enabled data expansion paradigm. \textbf{Project demo:} https://screenapp.io/app/v/WP0SvffgsH

2603.00223 2026-05-11 cs.CV quant-ph

Pretty Good Measurement for Radiomics: A Quantum-Inspired Multi-Class Classifier for Lung Cancer Subtyping and Prostate Cancer Risk Stratification

Giuseppe Sergioli, Carlo Cuccu, Giovanni Pasini, Alessandro Stefano, Giorgio Russo, Andrés Camilo Granda Arango, Roberto Giuntini

AI总结 本文提出了一种基于量子启发的多分类方法——Pretty Good Measurement(PGM),用于解决医学影像中的肺癌亚型分类和前列腺癌风险分层问题。该方法将每个类别映射为一个编码的混合量子态,并通过单个正交测量(POVM)进行分类,实现了真正的多类分类策略,无需降维为二分类或一对一比较。实验表明,该方法在多个医学影像分析任务中表现优异,尤其在肺癌的二分类和三分类任务中优于传统方法,且在前列腺癌风险分层中也展现出良好的临床相关性。

Comments 22 pages, 9 figures, 12 table, in preparation for journal submission

详情
英文摘要

We investigate a quantum-inspired approach to supervised multi-class classification based on the Pretty Good Measurement (PGM), viewed as an operator-valued decision rule derived from quantum state discrimination. The method associates each class with an encoded mixed state and performs classification through a single POVM construction, thus providing a genuinely multi-class strategy without reduction to pairwise or one-vs-rest schemes. In this perspective, classification is reformulated as the discrimination of a finite ensemble of class-dependent density operators, with performance governed by the geometry induced by the encoding map and by the overlap structure among classes. To assess the practical scope of this framework, we apply the PGM-based classifier to two biomedical radiomics case studies: histopathological subtyping of non-small-cell lung carcinoma (NSCLC) and prostate cancer (PCa) risk stratification. The evaluation is conducted under protocols aligned with previously reported radiomics studies, enabling direct comparison with established classical baselines. The results show that the PGM-based classifier is consistently competitive and, in several settings, improves upon standard methods. In particular, the method performs especially well in the NSCLC binary and three-class tasks, while remaining competitive in the four-class case, where increased class overlap yields a more demanding discrimination geometry. In the PCa study, the PGM classifier remains close to the strongest ensemble baseline and exhibits clinically relevant sensitivity--specificity trade-offs across feature-selection scenarios.

2603.00041 2026-05-11 cs.LG cs.AI econ.EM stat.ME

Econometric vs. Causal Structure-Learning for Time-Series Policy Decisions: Evidence from the UK COVID-19 Policies

Bruno Petrungaro, Anthony C. Constantinou

AI总结 本文研究了在时间序列政策决策中,计量经济学方法与因果结构学习方法在因果关系发现上的表现差异,以英国新冠疫情政策为案例进行实证分析。研究对比了四种计量经济学方法与十一种因果机器学习算法在图结构、模型维度和因果效应恢复能力方面的表现,发现计量经济学方法在时间结构上提供了明确的规则,而因果机器学习方法则能探索更广泛的图结构空间,从而发现更多可识别的因果关系。研究为因果机器学习从计量经济学中借鉴经验提供了实证依据,并提供了将计量经济学结果转换为贝叶斯网络工具的代码支持。

详情
英文摘要

Causal machine learning (ML) recovers graphical structures that inform us about potential cause-and-effect relationships. Most progress has focused on cross-sectional data with no explicit time order, whereas recovering causal structures from time series data remains the subject of ongoing research in causal ML. In addition to traditional causal ML, this study assesses econometric methods that some argue can recover causal structures from time series data. The use of these methods can be explained by the significant attention the field of econometrics has given to causality, and specifically to time series, over the years. This presents the possibility of comparing the causal discovery performance between econometric and traditional causal ML algorithms. We seek to understand if there are lessons to be incorporated into causal ML from econometrics, and provide code to translate the results of these econometric methods to the most widely used Bayesian Network R library, bnlearn. We investigate the benefits and challenges that these algorithms present in supporting policy decision-making, using the real-world case of COVID-19 in the UK as an example. Four econometric methods are evaluated in terms of graphical structure, model dimensionality, and their ability to recover causal effects, and these results are compared with those of eleven causal ML algorithms. Amongst our main results, we see that econometric methods provide clear rules for temporal structures, whereas causal-ML algorithms offer broader discovery by exploring a larger space of graph structures that tends to lead to denser graphs that capture more identifiable causal relationships.

2602.16360 2026-05-11 cs.RO

Docking and Persistent Operations for a Resident Underwater Vehicle

Leonard Günzel, Gabrielė Kasparavičiūtė, Ambjørn Grimsrud Waldum, Bjørn-Magnus Moslått, Abubakar Aliyu Badawi, Celil Yılmaz, Md Shamin Yeasher Yousha, Robert Staven, Martin Ludvigsen

AI总结 本文研究了如何实现水下驻留机器人在深海环境下的持续自主运行,以克服传统水下监测方法在成本和效率上的限制。作者提出了一种结合对接站和小型遥控水下机器人(ROV)的驻留系统,在90米深度环境下实现了自主导航、视觉定位对接和局部检测任务。该系统展示了高自主对接成功率和快速任务执行能力,验证了声学与视觉导航融合在实际水下环境中的可行性,为低成本、可扩展的水下监测提供了新思路。

详情
英文摘要

Our understanding of the oceans remains limited by sparse and infrequent observations, primarily because current methods are constrained by the high cost and logistical effort of underwater monitoring, relying either on sporadic surveys across broad areas or on long-term measurements at fixed locations. To overcome these limitations, monitoring systems must enable persistent and autonomous operations without the need for continuous surface support. Despite recent advances, resident underwater vehicles remain uncommon due to persistent challenges in autonomy, robotic resilience, and mechanical robustness, particularly under long-term deployment in harsh and remote environments. This work addresses these problems by presenting the development, deployment, and operation of a resident infrastructure using a docking station with a mini-class Remotely Operated Vehicle (ROV) at 90 m depth. The ROV is equipped with enhanced onboard processing and perception, allowing it to autonomously navigate using USBL signals, dock via ArUco marker-based visual localisation fused through an Extended Kalman Filter, and carry out local inspection routines. The system demonstrated a 90 % autonomous docking success rate and completed full inspection missions within four minutes, validating the integration of acoustic and visual navigation in real-world conditions. These results show that reliable, untethered operations at depth are feasible, highlighting the potential of resident ROV systems for scalable, cost-effective underwater monitoring.

2602.14868 2026-05-11 cs.LG cs.AI

Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

Ilia Mahrooghi, Aryo Lotfi, Emmanuel Abbe

AI总结 该研究针对强化学习中因稀疏奖励导致的样本效率低下的问题,提出了一种名为Goldilocks的新型数据采样策略。该方法通过教师模型预测学生模型在不同问题上的难度,选择适中的问题(既不太简单也不太困难),从而更高效地训练模型的推理能力。实验表明,该方法在相同计算预算下显著提升了模型在数学推理任务中的表现。

Comments 28 pages, 13 figures

详情
英文摘要

Reinforcement learning has emerged as a powerful paradigm for unlocking reasoning capabilities in language models. However, relying on sparse rewards makes this process highly sample-inefficient, as models must navigate vast search spaces with minimal feedback. While classic curriculum learning aims to mitigate this by ordering data based on complexity, prior works have primarily targeted small datasets and do not directly transfer to the large-scale settings typical of modern LM training. Furthermore, the right ordering for a specific model is often unclear. To address this, we propose Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question's difficulty for the student model. The teacher model selects questions of appropriate difficulty for the student model, i.e., questions that are neither too easy nor too hard (Goldilocks principle), while training the student with GRPO. By leveraging the student's performance on seen samples, the teacher continuously adapts to the student's evolving abilities. On the OpenMathReasoning dataset, Goldilocks data sampling improves the performance of models trained with standard GRPO under the same compute budget.

2602.13298 2026-05-11 cs.CV cs.AI

The Effective Depth Paradox: Evaluating the Relationship between Architectural Topology and Trainability in Deep CNNs

Manfred M. Fischer, Joshua Pitts

AI总结 本文通过对比VGG、ResNet和GoogLeNet等卷积神经网络架构,研究了CNN拓扑结构与其图像识别性能之间的关系。研究引入了名义深度和有效深度的概念,揭示了网络结构中身份捷径和分支模块对优化稳定性的影响。结果表明,有效深度比名义深度更能准确反映网络的可训练性和扩展潜力,指出网络拓扑结构而非单纯的层数是影响深度学习模型梯度健康的关键因素。

详情
英文摘要

This paper investigates the relationship between convolutional neural network (CNN) topology and image recognition performance through a comparative study of the VGG, ResNet, and GoogLeNet architectural families. Utilizing a unified experimental framework, the study isolates the impact of depth from confounding implementation variables. A formal distinction is introduced between nominal depth ($D_{\mathrm{nom}}$), representing the physical layer count, and effective depth ($D_{\mathrm{eff}}$), an operational metric quantifying the expected number of sequential transformations. Empirical results demonstrate that architectures utilizing identity shortcuts or branching modules maintain optimization stability by decoupling $D_{\mathrm{eff}}$ from $D_{\mathrm{nom}}$. These findings suggest that effective depth serves as a superior framework for predicting scaling potential and practical trainability, ultimately indicating that architectural topology - rather than sheer layer volume - is the primary determinant of gradient health in deep learning models.

2602.11758 2026-05-11 cs.RO

HAIC: Humanoid Agile Object Interaction Control via Dynamics-Aware World Model

Dongting Li, Xingyu Chen, Qianyang Wu, Bo Chen, Sikai Wu, Hanyu Wu, Guoyao Zhang, Liang Li, Mingliang Zhou, Diyun Xiang, Jianzhu Ma, Qiang Zhang, Renjing Xu

AI总结 本文提出HAIC,一种用于人形机器人敏捷物体交互的控制框架,解决了与非完整约束和独立动力学物体交互时的控制难题。HAIC通过仅依靠本体感觉历史预测物体的高阶状态(如速度、加速度),并结合静态几何先验生成动态占用地图,从而在无外部状态估计的情况下实现鲁棒交互。实验表明,HAIC在多种敏捷任务和多物体长期任务中表现出色,展示了其对惯性扰动的主动补偿能力和环境适应性。

Comments RSS 2026. Webpage: https://haic-humanoid.github.io/

详情
英文摘要

Humanoid robots show promise for complex whole-body tasks in unstructured environments. Although Human-Object Interaction (HOI) has advanced, most methods focus on fully actuated objects rigidly coupled to the robot, ignoring underactuated objects with independent dynamics and non-holonomic constraints. These introduce control challenges from coupling forces and occlusions. We present HAIC, a unified framework for robust interaction across diverse object dynamics without external state estimation. Our key contribution is a dynamics predictor that estimates high-order object states (velocity, acceleration) solely from proprioceptive history. These predictions are projected onto static geometric priors to form a spatially grounded dynamic occupancy map, enabling the policy to infer collision boundaries and contact affordances in blind spots. We use asymmetric fine-tuning, where a world model continuously adapts to the student policy's exploration, ensuring robust state estimation under distribution shifts. Experiments on a humanoid robot show HAIC achieves high success rates in agile tasks (skateboarding, cart pushing/pulling under various loads) by proactively compensating for inertial perturbations, and also masters multi-object long-horizon tasks like carrying a box across varied terrain by predicting the dynamics of multiple objects.

2602.10693 2026-05-11 cs.LG cs.AI

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Guobin Shen, Chenxiao Zhao, Xiang Cheng, Lei Huang, Xing Yu

AI总结 在大型语言模型的强化学习训练中,由于异步训练和训练与推理引擎不匹配,导致策略更新需要依赖离线策略。传统的重要度采样方法虽无偏,但方差大,且在自回归生成中问题更严重。本文提出了一种基于变分序列级软策略优化的方法VESPO,通过直接对序列级重要性权重进行处理,有效降低方差并提供明确的方差上界,实验表明该方法在数学推理和代码生成任务中能稳定训练并优于现有方法。

详情
英文摘要

Off-policy updates are inevitable in reinforcement learning (RL) for large language models (LLMs) due to rollout staleness from asynchronous training and mismatches between training and inference engines. Naive importance sampling gives an unbiased correction but suffers from high variance, which is amplified by unbounded ratios and autoregressive generation. Prior remedies either rely on scenario-specific engineering, or trade bias for variance via token-level clipping or sequence-level normalization, yet these approaches remain largely heuristic. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By explicitly incorporating variance reduction into a variational formulation, we derive a principled closed-form reshaping kernel that operates directly on sequence-level importance weights, avoids token-level approximation and length normalization, and admits an explicit variance bound for the deployed kernel. Experiments on math reasoning and code generation show that VESPO maintains stable training under severe off-policy conditions (staleness up to 64x) and delivers consistent gains across both dense and Mixture-of-Experts (MoE) models, outperforming recent reshaping baselines under matched setup. Code is available at https://github.com/FloyedShen/VESPO.

2602.07425 2026-05-11 cs.LG cs.CL math.OC

Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise

Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, Lijun Zhang

AI总结 本文研究了在重尾噪声环境下符号梯度优化算法(如Lion和Muon)的优越性问题,提出了一个新的重尾噪声条件,更准确地描述了大语言模型训练中的梯度特性。理论分析表明,符号梯度方法在该噪声模型下具有与现有最佳结果相当或更优的收敛速度,并首次对Muon等算法在矩阵优化中的表现进行了严格分析。实验验证了理论结论,说明符号优化器在处理重尾噪声时具有显著优势。

Comments Code is available at https://github.com/Dingzhen230/Heavy-tailed-Noise-in-LLMs

详情
英文摘要

While adaptive gradient methods are the workhorse of modern machine learning, sign-based optimization algorithms such as Lion and Muon have recently demonstrated superior empirical performance over AdamW in training large language models (LLM). However, a theoretical understanding of why sign-based updates outperform variance-adapted methods remains elusive. In this paper, we aim to bridge the gap between theory and practice through the lens of heavy-tailed gradient noise, a phenomenon frequently observed in language modeling tasks. Theoretically, we introduce a novel generalized heavy-tailed noise condition that captures the behavior of LLMs more accurately than standard finite variance assumptions. Under this noise model, we establish sharp convergence rates of SignSGD and Lion for generalized smooth function classes, matching or surpassing previous best-known bounds. Furthermore, we extend our analysis to Muon and Muonlight, providing what is, to our knowledge, the first rigorous analysis of matrix optimization under heavy-tailed stochasticity. These results offer a strong theoretical justification for the empirical superiority of sign-based optimizers, showcasing that they are naturally suited to handle the noisy gradients associated with heavy tails. Empirically, LLM pretraining experiments validate our theoretical insights and confirm that our proposed noise models are well-aligned with practice.

2602.04939 2026-05-11 cs.CV

SynthForensics: Benchmarking and Evaluating People-Centric Synthetic Video Deepfakes

Roberto Leotta, Salvatore Alfio Sambataro, Claudio Vittorio Ragaglia, Mirko Casu, Yuri Petralia, Francesco Guarnera, Luca Guarnera, Sebastiano Battiato

AI总结 本文提出SynthForensics,一个以人物为中心的合成视频深度伪造基准数据集,包含来自8个文本到视频和7个图像到视频生成器的20,445个视频,并与真实视频进行配对验证。该数据集在四个压缩版本中提供完整元数据,实验表明现有检测方法在该数据集上的性能显著下降,突显了当前评估体系的不足。研究还揭示了合成视频与传统伪造视频在特征上的差异,为未来检测方法的改进提供了重要参考。

详情
英文摘要

Modern T2V/I2V generators synthesize people increasingly hard to distinguish from authentic footage, while current evaluation suites lag: legacy benchmarks target manipulation-based forgeries, and recent synthetic-video benchmarks prioritize scale over realistic human depiction. We introduce SynthForensics, a people-centric benchmark of $20{,}445$ videos from 8 T2V and 7 I2V open-source generators, paired-source from FF++/DFD reals, two-stage human-validated, in four compression versions with full metadata. In our paired-comparison human study, raters prefer SynthForensics in $71$--$77\%$ of head-to-head comparisons against each of nine existing synthetic-video benchmarks, while facial-quality metrics fall within the FF++/DFD baseline range. Across 15 detectors and three protocols, face-based methods drop $13$--$55$ AUC points (mean $27$) from FF++ to SynthForensics and a further $23$ under aggressive compression; fine-tuning closes the gap at a backward cost on legacy benchmarks; training from scratch shows synthetic and manipulation features largely disjoint for most detectors. We release dataset, pipeline, and code.

2602.03490 2026-05-11 cs.LG q-bio.NC

Path Integration and Object-Location Binding Emerge in an Action-Conditioned Predictive Sequence Network

Linda Ariel Ventura, Victoria Bosch, Tim C Kietzmann, Sushrut Thorat

AI总结 该研究探讨了如何通过行动条件下的预测序列网络实现路径整合和物体-位置绑定。研究中使用了一个递归神经网络,在连续的二维场景中依次采样标记,并通过预测下一个标记来学习环境模型。实验表明,网络能够逐步提升预测准确性,并在解码分析中展现出路径整合和动态绑定能力,揭示了结构化表征如何通过灵活绑定支持预测,为认知科学中的序列世界建模提供了机制性解释。

Comments 8 pages, 4 figures; accepted at CogSci 2026

详情
英文摘要

Adaptive cognition requires structured internal models of objects and their relations. Predictive neural networks are often proposed to learn such world models, but how these are instantiated and how they support prediction remain unclear. We investigate this in a minimal in-silico setting. A recurrent neural network samples tokens sequentially from 2D continuous token scenes and is trained to predict the upcoming token from the current input and a saccade-like displacement. On novel scenes, prediction accuracy improves across the sequence, indicating in-context learning. Decoding analyses reveal path integration and dynamic binding of token identity to position. Interventional analyses show that new bindings can be learned late in sequence and that out-of-distribution bindings can be learned as well. Together, these findings show how structured representations relying on flexible binding emerge to support prediction, offering a mechanistic account of sequential world modeling relevant to cognitive science.

2602.03473 2026-05-11 cs.LG cs.CV

Scaling Continual Learning to 300+ Tasks with Bi-Level Routing Mixture-of-Experts

Meng Lou, Yunxiang Fu, Yizhou Yu

AI总结 本文提出了一种名为CaRE的可扩展持续学习框架,旨在解决在数百个任务序列上同时保持模型稳定性和可塑性的挑战。其核心方法是引入双级路由混合专家(BR-MoE)机制,通过动态激活任务相关的路由和专家模块,增强模型对判别性和综合性特征的提取能力。此外,研究还构建了一个包含上千任务的挑战性数据集OmniBenchmark-1K,并在多种任务设置下验证了CaRE的优越性能,尤其在超长任务序列上表现突出,是目前首个支持300多个非重叠任务的持续学习模型。

Comments Accepted by ICML 2026

详情
英文摘要

Continual learning, especially class-incremental learning (CIL), on the basis of a pre-trained model (PTM) has garnered substantial research interest in recent years. However, how to effectively learn both discriminative and comprehensive feature representations while maintaining stability and plasticity over very long task sequences remains an open problem. We propose CaRE, a scalable {C}ontinual Le{a}rner with efficient Bi-Level {R}outing Mixture-of-{E}xperts (BR-MoE). The core idea of BR-MoE is a bi-level routing mechanism: a router selection stage that dynamically activates relevant task-specific routers, followed by an expert routing phase that dynamically activates and aggregates experts, aiming to inject discriminative and comprehensive representations into every intermediate network layer. On the other hand, we introduce a challenging dataset, OmniBenchmark-1K, for CIL performance evaluation on very long task sequences with hundreds of tasks. Extensive experiments show that CaRE demonstrates leading performance across a variety of datasets and task settings, including commonly used CIL datasets with classical CIL settings (e.g., 5-20 tasks). To the best of our knowledge, CaRE is the first continual learner that scales to very long task sequences (ranging from 100 to over 300 non-overlapping tasks), while outperforming all baselines by a large margin on such task sequences. We hope that this work will inspire further research into continual learning over extremely long task sequences. Code and dataset are publicly released at https://github.com/LMMMEng/CaRE.

2602.02832 2026-05-11 cs.LG physics.flu-dyn

Koopman Autoencoders with Continuous-Time Latent Dynamics for Fluid Dynamics Forecasting

Rares Grozavescu, Pengyu Zhang, Etienne Meunier, Mark Girolami

AI总结 本文提出了一种基于连续时间动力学的Koopman自编码器,用于流体动力学的长期预测,其核心在于通过连续时间演化方程 $dz/dt = \mathbf{K}_{\mathrm{cont}} z$ 实现闭式推理,从而摆脱固定时间步长的限制,并提升计算效率。面对高维混沌系统中潜在状态不稳定的挑战,作者引入了包括滚动训练、前后一致性、潜在正则化和物理条件化的LoRA等结构约束,有效提升了长期预测的稳定性。实验表明,该方法在复杂流体基准测试中优于现有扩散模型和算子学习方法,并实现了110倍的推理加速。

详情
英文摘要

Forecasting physical systems over long horizons from irregularly sampled observations demands models that are stable, computationally efficient, and free of fixed-timestep assumptions. We address this with a continuous-time Koopman autoencoder whose latent dynamics obey $dz/dt = \mathbf{K}_{\mathrm{cont}} z$, yielding closed-form inference via $z(τ) = \exp(\mathbf{K}_{\mathrm{cont}} τ) z(0)$ at any horizon $τ$ in a single step. This decouples forecast cost from forecast length at inference time and supports data assimilation as gradient-based optimization with cost independent of the assimilation window. However, scaling continuous-time Koopman dynamics to high-dimensional chaotic systems causes severe latent instability, including spectral collapse and trajectory divergence over long horizons. In contrast, discrete Koopman methods train an operator $\mathbf{A}$ such that $z_{t+Δt} = \mathbf{A} z_t$; recovering the continuous generator could be theoretically done through matrix logarithm but requires conditions not guaranteed by training, and approximation errors grow with the $Δt$ imposed by the training data. These methods also require fixed, regular timesteps. We identify an empirically effective set of structural constraints -- rollout training, forward-backward consistency, latent regularization, and physics-conditioned LoRA -- sufficient for stable long-horizon latent dynamics. On challenging fluid benchmarks, our method outperforms strong diffusion and operator-learning baselines on long-horizon forecasting while achieving a 110$\times$ inference speedup.

2602.00465 2026-05-11 cs.LG cs.AI

PAIR-Former: Budgeted Relational Multi-Instance Learning for Functional miRNA Target Prediction

Jiaqi Yin, Baiming Chen, Jia Fei, Mingjun Yang

AI总结 该研究提出了一种名为PAIR-Former的新型多实例学习方法,用于解决功能性miRNA靶基因预测中的大规模候选靶点筛选问题。该方法通过预算约束下的关系建模,在保证计算效率的同时捕捉靶点间的相互作用,从而提升预测精度。研究还提出了预算关系多实例学习(BR-MIL)框架,理论分析表明模型性能由预算参数而非原始数据规模主导,实验表明PAIR-Former在多个基准数据集上均取得优越性能。

Comments Preprint. Under review. During the preprint stage, inquiries and feedback can be directed to Jiaqi Yin (yjqhit@gmail.com)

详情
英文摘要

Functional miRNA--mRNA targeting is a large-bag prediction problem where each transcript yields a heavy-tailed pool of candidate target sites (CTSs), yet only a pair-level label is observed. Prior methods use max-pooling over individual CTS scores, ignoring relational patterns among sites, but modeling these patterns is critical for accuracy. The challenge is that naive relational aggregation incurs $\mathcal{O}(n^2)$ cost, prohibitive when $n$ reaches thousands, yet a cheap scan alone discards the very interactions that drive functional repression. We formalize this tension as \emph{Budgeted Relational Multi-Instance Learning (BR-MIL)}, a new MIL problem where the compute budget $K$ is a first-class constraint such that at most $K$ instances per bag may receive expensive encoding and relational processing. We establish theoretical foundations for BR-MIL, proving that both approximation quality and generalization are governed by $K$ rather than the raw bag size $n$. Building on this theory, we propose \textbf{PAIR-Former}, which scans all candidates cheaply, selects $K$ diverse CTSs, and aggregates them via Set Transformer. PAIR-Former achieves state-of-the-art performance, outperforming all reproduced baselines with F1$=0.840$ on miRAW (10-fold balanced CV) and $0.839$ on deepTargetPro in transfer evaluation, while achieving $0.793$ on the large-scale MTI benchmark (420K pairs, $38\times$ larger), demonstrating that budgeted relational MIL scales where naive approaches fail. Additional results on CAMELYON16 and Musk2 further show that the proposed BR-MIL formulation extends beyond biological sequence modeling.

2601.23251 2026-05-11 cs.CV

Structure Over Scale: Learning Visual Reasoning from Pedagogical Video

Bishoy Galoaa, Xiangyu Bai, Sarah Ostadabbas

AI总结 该研究针对当前视觉语言模型在基础视觉推理任务上的不足,提出利用儿童教育视频中特有的“上下文-问题-暂停-答案”结构来学习视觉推理能力。研究引入了一个名为SoSVQA的统一基准,从儿童节目自动提取了1万个时间对齐的问题-答案对,并通过结构化策略优化模型训练,显著提升了模型在多项视觉推理任务上的表现。实验表明,即使在数据量远少于主流大模型的情况下,该方法仍能有效提升视觉语言模型的推理能力,达到与顶级商业模型相当的水平。

详情
英文摘要

State-of-the-art vision-language models (VLMs) score impressively on video benchmarks yet stumble on basic visual reasoning tasks involving spatial relations, navigation, and object selection that a preschooler solves easily. We hypothesize that the explicit pedagogical structure, specifically the context-question-pause-answer cycles embedded in children's educational video, provides naturally co-aligned reasoning traces: temporally synchronized visual cues, questions, and answers that emerge only from deliberate pedagogical authoring and cannot be practically reconstructed through manual annotation at scale. To test this, we introduce SoSVQA (Structure over Scale Visual Question Answering), a unified benchmark of 10K question-answer pairs automatically extracted from Dora the Explorer (DoraVQA) and Mickey Mouse Clubhouse (ClubHVQA) with precise timestamp alignment, and fine-tune Qwen2-VL and Qwen3-VL using Group Relative Policy Optimization (GRPO) to leverage the clear correctness signals and structured reasoning traces inherent in educational content. Despite training on just 10K QA pairs from 78 hours of children's television, orders of magnitude less data than GPT and Gemini, our approach delivers generalizable performance gains for Qwen-based VLMs, yielding consistent improvements on NExT-QA (+19.7), Video-MME (+10.6), and MotionBench (+4.9), matching the performance of leading proprietary systems and demonstrating that content structure can compensate for content scale.

2601.22766 2026-05-11 cs.LG

Sparse Attention as Compact Kernel Regression

Saul Santos, Nuno Gonçalves, Daniel C. McNamee, Marcos Treviso, André F. T Martins

AI总结 本文建立了稀疏注意力机制与紧支撑核回归之间的形式对应关系,揭示了归一化ReLU和sparsemax注意力分别对应于固定和自适应归一化下的Epanechnikov核回归。研究进一步表明,非参数密度估计中常用的Epanechnikov、biweight和triweight核可以与特定参数的α-entmax注意力相对应,而softmax与高斯核的关系则在参数趋于无穷时出现。该工作为注意力机制的设计提供了理论依据,并通过基于核回归的Transformer变体Memory Mosaics实验验证了其在语言建模、上下文学习和长度泛化任务中的有效性。

Comments 16 pages, 5 figures

详情
英文摘要

Recent work has revealed a link between self-attention mechanisms in transformers and test-time kernel regression via the Nadaraya-Watson estimator, with standard softmax attention corresponding to a Gaussian kernel. However, a kernel-theoretic understanding of sparse attention mechanisms is currently missing. In this paper, we establish a formal correspondence between sparse attention and compact (bounded support) kernels. We show that normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively. More generally, we demonstrate that widely used kernels in nonparametric density estimation -- including Epanechnikov, biweight, and triweight -- correspond to $α$-entmax attention with $α= 1 + \frac{1}{n}$ for $n \in \mathbb{N}$, while the softmax/Gaussian relationship emerges in the limit $n \to \infty$. This unified perspective explains how sparsity naturally emerges from kernel design and provides principled alternatives to heuristic top-$k$ attention and other associative memory mechanisms. Experiments with a kernel-regression-based variant of transformers -- Memory Mosaics -- show that kernel-based sparse attention achieves competitive performance on language modeling, in-context learning, and length generalization tasks, offering a principled framework for designing attention mechanisms.

2601.15984 2026-05-11 cs.LG

Partially Lazy Gradient Descent for Smoothed Online Learning

Naram Mhaisen, George Iosifidis

AI总结 本文提出了一种名为 $k$-lazyGD 的在线学习算法,介于贪心的在线梯度下降(OGD)和惰性梯度下降(lazy GD)之间,实现了从反应性强到稳定性高的更新策略之间的平衡。研究聚焦于平滑在线凸优化(SOCO)问题,在该问题中学习者需要同时考虑 hitting 成本和 movement 成本。本文的主要贡献是证明了在不牺牲 hitting 性能的前提下,惰性更新是可行的,并展示了 $k$-lazyGD 在任意惰性松弛参数 $k$ 下均可达到最优动态 regret 上界 $\mathcal{O}(\sqrt{(P_T{+}1)T})$,其中 $P_T$ 表示比较器路径长度,从而将允许的惰性程度与比较器的变化联系起来。分析基于 FTRL 框架,并给出了对应的下界结果,最终提出了一种根据需求在稳定性和敏捷性之间自适应的算法。

详情
英文摘要

We introduce \textsc{$k$-lazyGD}, an online learning algorithm that bridges the gap between greedy Online Gradient Descent (OGD, for $k{=}1$) and lazy GD/dual-averaging (for $k{=}T$), creating a spectrum between reactive and stable updates. We analyze this spectrum in Smoothed Online Convex Optimization (SOCO), where the learner incurs both hitting and movement costs. Our main contribution is establishing that laziness is possible without sacrificing hitting performance: we prove that \textsc{$k$-lazyGD} achieves the optimal dynamic regret $\mathcal{O}(\sqrt{(P_T{+}1)T})$ for any laziness slack $k$ up to $Θ(\sqrt{T/P_T})$, where $P_T$ is the comparator path length. This result formally connects the allowable laziness to the comparator's shifts, showing that \textsc{$k$-lazyGD} can retain the inherently small movements of lazy methods without compromising tracking ability. We base our analysis on the Follow the Regularized Leader (FTRL) framework, and derive a matching lower bound. Since the slack depends on $P_T$, an ensemble of learners with various slacks is used, yielding a method that is provably stable when it can be, and agile when it must be.

2601.15127 2026-05-11 cs.LG cs.CV cs.DC

DeepFedNAS: Efficient Hardware-Aware Architecture Adaptation for Heterogeneous IoT Federations via Pareto-Guided Supernet Training

Bostan Khan, Masoud Daneshtalab

AI总结 DeepFedNAS 是一种高效的硬件感知架构适应方法,旨在为异构物联网设备联邦学习场景中不同设备类别定制神经网络结构。该方法通过引入多目标适应度函数,结合信息论网络指标与架构启发式规则,提出两阶段框架:第一阶段通过预计算精英架构缓存提升超网络训练效果,第二阶段利用该适应度函数作为零成本精度代理,快速发现硬件优化子网络,显著提升搜索效率。实验表明,DeepFedNAS 在多个数据集上取得先进精度,同时大幅降低通信开销,适用于大规模、通信受限的物联网联邦学习场景。

Comments This paper significantly extends the preliminary work presented at ESANN 2026. Source Code: https://github.com/bostankhan6/DeepFedNAS

详情
英文摘要

Deploying federated learning across heterogeneous IoT device fleets requires tailored neural network architectures for each device class, yet existing Federated Neural Architecture Search (FedNAS) methods suffer from unguided supernet training and prohibitively costly post-training search pipelines that demand over 20 GPU-hours per deployment target. We introduce DeepFedNAS, a two-phase framework built on a multi-objective fitness function that synthesizes information-theoretic network metrics with architectural heuristics. In the first phase, Federated Pareto Optimal Supernet Training replaces random subnet sampling with a pre-computed cache of elite, high-fitness architectures, yielding a superior supernet. In the second phase, a Predictor-Free Search uses this fitness function as a zero-cost accuracy proxy, discovering hardware-optimized subnets in ~20 seconds, a ~61x speedup over the baseline pipeline. Experiments on CIFAR-10, CIFAR-100, and CINIC-10 demonstrate state-of-the-art accuracy (up to +1.21% on CIFAR-100), a 2.8x reduction in per-round transmission size, and robust performance under extreme non-IID conditions (α = 0.1), making DeepFedNAS practical for scalable, communication-constrained IoT federations. Source code: https://github.com/bostankhan6/DeepFedNAS

2601.11794 2026-05-11 cs.LG cs.CV cs.RO

Physics-Constrained Denoising Autoencoders for Data-Scarce Wildfire UAV Sensing

Abdelrahman Ramadan, Zahra Dorbeigi Namaghi, Emily Taylor, Lucas Edwards, Xan Giuliani, David S. McLagan, Sidney Givigi, Melissa Greeff

AI总结 该研究针对无人机在野火监测中因低成本传感器数据质量差导致的浓度估计问题,提出了一种嵌入物理约束的去噪自编码器PC²DAE,通过软plus激活函数和物理可解释的时序平滑机制,确保输出数据符合物理规律。该模型在数据稀缺条件下仍表现出色,仅需少量飞行数据即可实现高精度去噪,并在边缘设备上高效运行,显著优于传统深度学习方法。

详情
Journal ref
2026 IEEE International Systems Conference (SysCon), Halifax, NS, Canada, pp. 1-8, 2026
英文摘要

Wildfire monitoring requires high-resolution atmospheric measurements, yet low-cost sensors on Unmanned Aerial Vehicles (UAVs) exhibit baseline drift, cross-sensitivity, and response lag that corrupt concentration estimates. Traditional deep learning denoising approaches demand large datasets impractical to obtain from limited UAV flight campaigns. We present PC$^2$DAE, a physics-informed denoising autoencoder that addresses data scarcity by embedding physical constraints directly into the network architecture. Non-negative concentration estimates are enforced via softplus activations and physically plausible temporal smoothing, ensuring outputs are physically admissible by construction rather than relying on loss function penalties. The architecture employs hierarchical decoder heads for Black Carbon, Gas, and CO$_2$ sensor families, with two variants: PC$^2$DAE-Lean (21k parameters) for edge deployment and PC$^2$DAE-Wide (204k parameters) for offline processing. We evaluate on 7,894 synchronized 1 Hz samples collected from UAV flights during prescribed burns in Saskatchewan, Canada (approximately 2.2 hours of flight data), two orders of magnitude below typical deep learning requirements. PC$^2$DAE-Lean achieves 67.3\% smoothness improvement and 90.7\% high-frequency noise reduction with zero physics violations. Five baselines (LSTM-AE, U-Net, Transformer, CBDAE, DeSpaWN) produce 15--23\% negative outputs. The lean variant outperforms wide (+5.6\% smoothness), suggesting reduced capacity with strong inductive bias prevents overfitting in data-scarce regimes. Training completes in under 65 seconds on consumer hardware.

2512.23032 2026-05-11 cs.CL cs.AI cs.LG

Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

Kerem Zaman, Shashank Srivastava

AI总结 该研究质疑了当前对思维链(Chain-of-Thought, CoT)可信度的评估标准,指出仅以是否包含提示注入的关键词来判断CoT的可信性存在局限。研究提出,思维链在推理过程中可能通过非语言化的方式影响预测结果,因此可信性应结合因果中介分析等更全面的评估方法。实验表明,增加推理预算可以显著提升提示词的表达比例,而部分看似不可信的CoT实际上可能因长度限制而被误判。

Comments Accepted to ACL 2026. 23 pages, 29 figures, 6 tables

详情
英文摘要

Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric adopts a narrow notion of faithfulness and confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with instruct-tuned and reasoning models, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hint-based evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics. We do not claim all CoTs are faithful, only that the absence of hint words alone does not prove unfaithfulness.

2512.14263 2026-05-11 cs.LG cs.AI math.OC

DT-PBO: an Interpretable Tree-based Surrogate Model for Preferential Bayesian Optimization

Nick Leenders, Thomas Quadt, Boris Cule, Roy Lindelauf, Herman Monsuur, Joost van Oijen, Mark Voskuijl

AI总结 本文提出了一种名为DT-PBO的可解释树型替代模型,用于偏好贝叶斯优化(PBO),旨在以尽可能少的成对比较找到决策者最偏好的解。该方法通过引入一种新的分裂启发式算法,直接从成对比较数据构建浅层决策树,并结合拉普拉斯近似为每个叶子节点提供概率估计,从而在保持模型可解释性的同时有效建模偏好不确定性。实验表明,该方法在多个基准函数上表现优异,尤其在优化景观复杂的函数中,同时具备对噪声的鲁棒性和较高的计算效率。

详情
英文摘要

Preferential Bayesian Optimization (PBO) aims to find a decision-maker's most preferred solution in as few pairwise comparisons as possible. Existing approaches rely on Gaussian Process (GP) surrogates, which provide strong performance but limited interpretability. This limits real-world usability in high-stakes domains, such as healthcare, where interpretability and trust are essential. We propose DT-PBO, a novel tree-based surrogate model for PBO that is inherently interpretable while capturing preference uncertainty. Specifically, we introduce a novel splitting heuristic that constructs interpretable shallow decision trees directly from pairwise comparison data, and use Laplace approximation to obtain probabilistic estimates for each leaf. This enables efficient preference modeling without sacrificing interpretability. Across eight benchmark functions, our method achieves competitive convergence to GP-based PBO, particularly on functions with rugged optimization landscapes. Additional experiments show robustness against noise and a fast computational running time. Experiments on real-world datasets further demonstrate that our model provides interpretable insights into decision-maker preferences that would remain opaque under GP-based approaches.