arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2085
2604.26157 2026-05-11 cs.CL cs.AI

Structural Generalization on SLOG without Hand-Written Rules

Zichao Wei

AI总结 该研究探讨了语义解析中的结构泛化问题,旨在使系统能够将学到的组合规则应用于新的结构组合。不同于依赖手工编写的代数规则或无法泛化的基于Transformer的模型,本文提出了一种无需手工规则的方法,基于具有离散瓶颈的神经细胞自动机(NCA),通过局部迭代从数据中学习所有组合规则。实验表明,该方法在SLOG基准测试中取得了良好的结构泛化效果,并揭示了结构泛化失败的机制与CCG结构特征之间的紧密关系。

详情
英文摘要

Structural generalization in semantic parsing requires systems to apply learned compositional rules to novel structural combinations. Existing approaches either rely on hand-written algebraic rules (AM-Parser) or fail to generalize structurally (Transformer-based models). We present an alternative requiring no hand-written compositional rules, based on a neural cellular automaton (NCA) with a discrete bottleneck: all compositional rules are learned from data through local iteration. On the SLOG benchmark, the system achieves an overall accuracy of $67.3 \pm 0.2\%$ across 10 seeds (AM-Parser: $70.8 \pm 4.3\%$), with 11 of 17 structural generalization categories at $100\%$ type-exact match, including three where AM-Parser scores $0$--$74\%$. Analysis reveals that all 5,539 failure instances reduce to exactly two mechanisms: novel combinations of wh-extraction context with reduced verb types, and modifiers appearing on the subject side of verbs. When we decompose results by CCG structural features, each sub-pattern either succeeds on all instances or fails on all. Intermediate scores (e.g., $41.4\%$) are mixtures of structurally distinct CCG patterns, not partial generalization. These results suggest that CCG directed types provide higher resolution than SLOG's phenomenon-level categories for characterizing structural generalization, and that the success/failure boundary is determined by the coverage of directed operations in the training data.

2604.25809 2026-05-11 cs.CV

Instruction-Evidence Contrastive Dual-Stream Decoding for Grounded Vision-Language Reasoning

Yashwant Pravinrao Bangde, Debaditya Roy

AI总结 该论文提出了一种名为IECD²的解码框架,用于解决视觉语言模型在生成过程中语言表达与视觉证据不一致的问题。该方法通过双流解码机制,在生成过程中同时考虑指令引导和图像证据引导,利用对称KL散度对比门进行自适应融合,以抑制语言先验影响并增强视觉依据。实验表明,该方法在多个视觉语言生成任务中有效提升了推理准确率并减少了幻觉现象。

详情
英文摘要

Vision-Language Models (VLMs) exhibit strong performance in instruction following and open-ended vision-language reasoning, yet they frequently generate fluent outputs that are weakly grounded in visual evidence. Prior works have shown that instruction prompting further worsens this issue by amplifying language priors, especially when the visual signal is uncertain or ambiguous. To address this challenge, we propose a decoding framework that explicitly balances linguistic informativeness and visual faithfulness during generation. Our method, Instruction-Evidence Contrastive Dual-Stream Decoding (IECD$^2$), maintains two parallel probability distribution of tokens at each decoding step: an instruction-driven stream that promotes expressive and informative responses, and an evidence-driven stream that enforces strict grounding in the image. These two streams are adaptively fused using a symmetric KL-based contrastive gate, which suppresses tokens favored by language priors but unsupported by visual evidence, while preserving them when both distributions agree. We evaluate IECD$^2$ on multiple datasets spanning various generative vision-language reasoning tasks such as captioning and visual question answering on multiple datasets such as, POPE, MME, VQAv2, AMBER, and MSCOCO. IECD$^2$ demonstrates consistent improvements in task accuracy and reasoning performance with substantial reduction in hallucination compared to state-of-the-art decoding approaches.

2604.25380 2026-05-11 cs.CV

Benchmarking and Improving GUI Agents in High-Dynamic Environments

Enqi Liu, Liyuan Pan, Zhi Gao, Yan Yang, Chenrui Shi, Yang Liu, Jingrong Wu, Qing Li

AI总结 本文研究了在高动态图形用户界面(GUI)环境中提升智能体性能的问题,指出现有方法因仅依赖单张截图进行决策,导致关键界面状态信息获取不足。为此,作者提出了DynamicGUIBench基准测试平台和DynamicUI智能体,后者通过输入交互过程的屏幕录像,结合动态感知、精炼策略和反思模块,有效捕捉界面变化并提升决策准确性。实验表明,DynamicUI在动态GUI任务中表现出显著优势,同时在其他公开基准上也保持了竞争力。

详情
英文摘要

Recent advancements in Graphical User Interface (GUI) agents have predominantly focused on training paradigms like supervised fine-tuning (SFT) and reinforcement learning (RL). However, the challenge of high-dynamic GUI environments remains largely underexplored. Existing agents typically rely on a single screenshot after each action for decision-making, leading to a partially observable (or even unobservable) Markov decision process, where the key GUI state including important information for actions is often inadequately captured. To systematically explore this challenge, we introduce DynamicGUIBench, a comprehensive online GUI benchmark spanning ten applications and diverse interaction scenarios characterized by important interface changes between actions. Furthermore, we present DynamicUI, an agent designed for dynamic interfaces, which takes screen-recording videos of the interaction process as input and consists of three components: a dynamic perceiver, a refinement strategy, and a reflection. Specifically, the dynamic perceiver clusters frames of the GUI video, generates captions for the centroids, and iteratively selects the most informative frames as the salient dynamic context. Considering that there may be inconsistencies and noise between the selected frames and the textual context of the agent, the refinement strategy employs an action-conditioned filtering to refine thoughts to mitigate thought-action inconsistency and redundancy. Based on the refined agent trajectories, the reflection module provides effective and accurate guidance for further actions. Experiments on DynamicGUIBench demonstrate that DynamicUI significantly improves the performance in dynamic GUI environments, while maintaining competitive performance on other public benchmarks.

2604.25150 2026-05-11 cs.LG cs.AI

The Role of Symmetry in Optimizing Overparameterized Networks

Kusha Sareen, Mohammad Pedramfar, Sékou-Oumar Kaba, Mehran Shakerinava, Siamak Ravanbakhsh

AI总结 本文研究了过参数化在深度学习优化中的作用,重点分析了神经网络权重空间中的对称性。研究发现,过参数化引入了额外的对称性,这些对称性通过改善Hessian矩阵的条件数和增加全局最小值的概率质量,有助于优化过程。实验表明,网络宽度增加有助于降低Hessian矩阵的最大特征值和条件数,加快收敛速度,研究为理解过参数化与损失函数几何结构之间的关系提供了统一的框架。

详情
英文摘要

Overparameterization is central to the success of deep learning, yet the mechanisms by which it improves optimization remain incompletely understood. We analyze weight-space symmetries in neural networks and show that overparameterization introduces additional symmetries that benefit optimization in two distinct ways. First, we prove that these symmetries act as a form of diagonal preconditioning on the Hessian, enabling the existence of better-conditioned minima within each equivalence class of functionally identical solutions. Second, we show that overparameterization increases the probability mass of global minima near typical initializations, making these favourable solutions more reachable. These results offer a potential link between loss landscape geometry and simplicity bias. Empirically, we observe wider networks have lower top eigenvalues, smaller condition numbers and faster convergence, matching our analysis. Our analysis provides a unified framework for understanding overparameterization and width growth as a geometric transformation of the loss landscape.

2604.24661 2026-05-11 cs.RO

Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations

Zhengru Fang, Yu Guo, Fei Liu, Yuang Zhang, Yihang Tao, Senkang Hu, Wenbo Ding, Yuguang Fang

AI总结 现实中的视觉系统常受到天气、传感器噪声、压缩伪影和背景干扰等动态变化的干扰。本文提出了一种基于信息瓶颈原理的观察适配方法ACO-MoE,通过分离前景目标与干扰信息,有效提升了视觉控制在非平稳干扰下的鲁棒性。该方法无需清洁参考帧或干扰标签,仅依赖于受污染的RGB图像即可实现高性能控制,并在多个基准测试中展现出显著的泛化能力。

Comments Source code is available at https://github.com/fangzr/aco-moe-code

详情
英文摘要

Real-world visual systems face time-varying perturbations, including weather, sensor noise, compression artifacts, and background distractions. Existing image restoration methods are typically designed for fixed corruption types and optimized for pixel-level fidelity, leaving open two questions: how restoration behaves under non-stationary corruption switching, and whether pixel-level fidelity preserves the task-relevant information needed by downstream models. To study this setting, we introduce the Visual Degraded Control Suite (VDCS), a benchmark that injects Markov-switching physical degradations into rendered scenes. We further identify a fundamental failure mode of reconstruction-based representations: faithfully reconstructing corrupted observations forces the latent state to encode corruption-specific nuisance information, thereby contaminating downstream models. From an information-bottleneck perspective, anchoring the representation to the clean foreground eliminates this contamination. Motivated by this analysis, we propose \emph{Agent-Centric Observations with Mixture-of-Experts} (ACO-MoE), a frozen, plug-and-play observation adapter that combines a routed bank of restoration experts with a foreground-mask branch. ACO-MoE is pretrained entirely offline on synthetic rendered data with automatically generated degradation pairs and simulation-derived foreground masks, requiring no manual annotation. At inference time, it takes only corrupted RGB as input without corruption labels, clean reference frames, or foreground masks. Across VDCS, DMC-GB, and RoboSuite, ACO-MoE consistently improves downstream control with both model-free and model-based backbones, recovering 95.3\% of clean-input performance under challenging Markov-switching corruptions. It also generalizes zero-shot to unseen visual perturbations excluded from adapter pretraining.

2604.24372 2026-05-11 cs.CL cs.AI cs.NE

SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution

Sichun Luo, Yi Huang, Haochen Luo, Fengyuan Liu, Guanzhi Deng, Lei Li, Qinghua Yao, Zefa Hu, Junlan Feng, Qi Liu

AI总结 SeaEvo 是一种用于提升算法发现效率的策略空间演化方法,通过将自然语言层面的策略推理转化为进化搜索中的核心种群状态,解决了现有方法在战略方向保存与演化中的不足。该方法利用自然语言描述对候选程序进行策略表示,通过语义聚类、互补性检索和策略空间导航等机制,有效提升了搜索的多样性和方向性。实验表明,SeaEvo 在多个任务中显著提升了进化搜索的性能,展示了其在提高大语言模型引导搜索效率方面的实用价值。

详情
英文摘要

Large Language Model (LLM)-guided evolutionary search is increasingly used for automated algorithm discovery, yet most current methods track search progress primarily through executable programs and scalar fitness. Even when natural-language reasoning is used through heuristic descriptions or reflection, it typically remains transient mutation context or unstructured memory, rather than organized as persistent population-level state over strategic directions. As a result, evolutionary search can struggle to distinguish syntactically different implementations of the same idea, preserve lower-fitness but strategically promising directions, or detect when an entire family of strategies has saturated. We introduce \model, a modular strategy-space layer that turns language-level strategic reasoning into first-class population-level evolutionary state in LLM-driven program search. \model represents each candidate program with an explicit natural-language strategy, clusters the archive by strategy semantics, retrieves behaviorally complementary inspirations, and periodically navigates the strategy landscape to avoid saturated directions. Without modifying the underlying evolutionary algorithms, \model improves existing evolutionary backbones across algorithm discovery, systems optimization, and agent-scaffold design tasks in most settings. Across four systems benchmarks, \model achieves a 20.6% average relative improvement, with the best single run on Prism scoring 3$\times$ higher. These results suggest that persistent strategy representations provide a practical mechanism for improving the effectiveness and cost-efficiency of LLM-guided evolutionary search, pointing toward compound AI systems whose search capabilities benefit from the structured accumulation and reuse of algorithmic strategies.

2604.24136 2026-05-11 cs.CV eess.IV

Bridging Restoration and Generation Manifolds in One-Step Diffusion for Real-World Super-Resolution

Shyang-En Weng, Yi-Cheng Liao, Yu-Syuan Xu, Wei-Chen Chiu, Ching-Chun Huang

AI总结 该论文提出了一种用于真实场景图像超分辨率的单步扩散框架IDaS-SR,旨在解决预训练扩散模型在推理过程中计算开销大以及单步蒸馏方法面临感知-失真权衡的问题。核心方法包括用于解决初始化和轨迹不匹配的MINE模块,以及用于稳定生成过程的CHARIOT机制,从而在单一推理步骤中实现从结构修复到纹理生成的无缝过渡。实验表明,该方法在性能上优于现有最先进的方法。

详情
英文摘要

Pretrained diffusion models have revolutionized real-world image super-resolution (Real-ISR) but suffer from computational bottlenecks due to iterative sampling. Recent single-step distillation accelerates inference but faces a stark perception-distortion trade-off due to rigid timestep initialization, distributional trajectory mismatches, and fragile stochastic modulation. To address this, we present Adaptive Inversion and Degradation-aware Sampling for Real-ISR (IDaS-SR), a one-step framework bridging the deterministic restoration and stochastic generation manifolds. At its core, the Manifold Inversion Noise Estimator (MINE) resolves these initialization and trajectory mismatches by predicting a severity-aware timestep and inversion noise, precisely anchoring low-quality latents onto the diffusion trajectory. Furthermore, to mitigate fragile stochastic modulation, we propose CHARIOT, a continuous generative steering mechanism. By rescheduling trajectories and interpolating noise, it enables explicit navigation of the perception-distortion boundary without compromising structural priors. Extensive experiments demonstrate that IDaS-SR outperforms state-of-the-art methods, seamlessly transitioning from a rigorous structural restorer to a sophisticated texture hallucinator in a single inference step.

2604.23947 2026-05-11 cs.AI

GamED.AI: A Hierarchical Multi-Agent Framework for Automated Educational Game Generation

Shiven Agarwal, Yash Shah, Ashish Raj Shekhar, Priyanuj Bordoloi, Vivek Gupta

AI总结 本文提出了一种分层多智能体框架 GamEDAI,能够将教师提供的问题自动转化为具有教育意义的可玩游戏,并通过形式化机制合同进行验证。该框架基于阶段化的 LangGraph 子图、确定性的质量门机制和结构化的 Pydantic 模式,支持涵盖空间推理、过程执行和高阶布鲁姆分类目标的15种交互机制。实验表明,该系统在五个学科领域的200个问题上实现了90%的验证通过率和73%的token减少,展示了其在生成效率和教育对齐方面的显著优势。

详情
英文摘要

We introduce GamEDAI, a hierarchical multi-agent framework that transforms instructor-provided questions into fully playable, pedagogically grounded educational games validated through formal mechanic contracts. Built on phase-based LangGraph sub-graphs, deterministic Quality Gates, and structured Pydantic schemas, GamEDAI supports two template families encompassing 15 interaction mechanics across spatial reasoning, procedural execution, and higher-order Bloom's Taxonomy objectives. Evaluated on 200 questions spanning five subject domains, the system achieves a 90% validation pass rate, 98.3% schema compliance, and 73% token reduction over ReAct agents (${\sim}$73,500 $\rightarrow$ ${\sim}$19,900 tokens/game) at $0.46 per game. Within this model configuration, these results suggest that phase-bounded architectural structure correlates more strongly with alignment quality than prompting strategy alone. Our demonstration lets attendees generate Bloom's-aligned games from natural language in under 60 seconds, inspect Quality Gate outputs at each pipeline phase, and browse a curated library of 50 games spanning all 15 mechanic types.

2604.23478 2026-05-11 cs.CL

JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems

Rohith Reddy Bellibatlu, Edward Raff, Wenbin Zhang

AI总结 本文提出JudgeSense,一个用于评估大型语言模型作为自动评价系统时对提示语变化敏感性的基准。研究系统分析了不同任务和模型架构在语义等价提示下的判断稳定性,并通过人工验证的提示对构建了涵盖事实性、连贯性、相关性和偏好的数据集。实验发现连贯性任务对提示变化最敏感,而事实性判断在标准条件下较为稳定,且模型规模并不能可靠地反映判断一致性。

Comments 20 pages, 2 figures, 1 table. Code: https://github.com/rohithreddybc/judgeSense. Dataset (JudgeSense Benchmark): https://huggingface.co/datasets/Rohithreddybc/judgesense-benchmark

详情
英文摘要

Large language models are widely adopted as automated evaluation judges, yet the stability of their verdicts under semantically equivalent prompt rephrasings remains largely unexamined. We conduct a systematic empirical study of prompt-induced decision instability across multiple evaluation tasks and judge architectures. To facilitate this analysis, we release JudgeSense, a benchmark comprising hand-validated prompt-paraphrase pairs spanning factuality, coherence, relevance, and preference, drawn from established NLP benchmarks and accompanied by comprehensive decision logs. The benchmark enables the measurement of judge stability across equivalent prompts, allowing researchers to assess whether stability correlates with model scale or instruction-tuning, and to identify which tasks are most sensitive to prompt wording. Our evaluation reveals that coherence remains the primary task for distinguishing judge behavior, while factuality judgments demonstrate high stability under standard conditions. Pairwise evaluation tasks consistently exhibit position bias. Crucially, we find that model scale is not a reliable proxy for consistency; notably, as an interesting result in our analysis, the largest and newest models are not the most consistent.

2604.21657 2026-05-11 cs.LG

Transferable SCF-Acceleration through Solver-Aligned Initialization Learning

Eike S. Eberhard, Viktor Kotsev, Timm Güthle, Stephan Günnemann

AI总结 该研究针对密度泛函理论计算中自洽场(SCF)求解器收敛速度慢的问题,提出了一种基于求解器对齐初始化学习(SAIL)的新方法,通过端到端微分SCF求解器来优化初始猜测的生成。研究引入了有效相对迭代次数(ERIC)指标,以更准确地评估收敛效率,并在多个分子数据集上验证了SAIL的有效性,显著提升了大规模分子体系的计算效率。

详情
英文摘要

The cost of Kohn-Sham density functional theory (KS-DFT) calculations scales with the number of solver iterations, which depends on the quality of the initial guess. Machine learning methods that predict initial guesses from molecular geometry can reduce this cost, but matrix-prediction models fail when extrapolating to larger molecules, degrading rather than accelerating convergence [Liu et al., 2025]. We show that this failure is a supervision problem, not an extrapolation problem: models trained on ground-state targets fit those targets well out of distribution, yet produce initial guesses that slow convergence. Solver-Aligned Initialization Learning (SAIL) resolves this for both Hamiltonian and density matrix models by differentiating through the self-consistent field (SCF) solver end-to-end. We introduce the Effective Relative Iteration Count (ERIC), a correction to the commonly used RIC that accounts for hidden Fock-build overhead. On QM40, which contains molecules up to 4$\times$ larger than the training distribution, SAIL reduces ERIC by 37\% (PBE), 33\% (SCAN), and 28\% (B3LYP), more than doubling the previous state-of-the-art reduction on B3LYP. On QMugs molecules 10$\times$ larger than the training set, SAIL delivers a 1.35$\times$ wall-time speedup at the hybrid level of theory, extending ML SCF acceleration to large drug-like molecules.

2604.18905 2026-05-11 cs.RO

Task-Adaptive Admittance Control for Human-Quadrotor Cooperative Load Transportation with Dynamic Cable-Length Regulation

Shuai Li, Ton T. H. Duong, Damiano Zanotto

AI总结 本文研究了人机协作中四旋翼飞行器负载运输的安全与效率问题,提出了一种基于动态缆绳长度调节的自适应导纳控制方法。该方法通过主动控制缆绳长度,使四旋翼飞行器能够根据接触力动态调整响应,提升操作的安全性和灵活性。实验表明,该方法在负载装卸和运输任务中表现出更优的系统响应性和运动平滑性,显著提升了人机协作运输的性能。

Comments Preprint of accepted manuscript to be published in IEEE Robotics and Automation Letters (RA-L)

详情
英文摘要

The collaboration between humans and robots is critical in many robotic applications, especially in those requiring physical human-robot interaction (pHRI). Previous research in pHRI has largely focused on robotic manipulators, employing impedance or admittance control to maintain operational safety. Conversely, research in human-quadrotor cooperative load transportation (CLT) is still in its infancy. This letter introduces a novel admittance controller designed for safe and effective human-quadrotor CLT using a quadrotor equipped with an actively-controlled winch. The proposed method accounts for the system's coupled dynamics, allowing the quadrotor and its cable to dynamically adapt to contact forces during CLT tasks, thereby enhancing responsiveness. We experimentally validated the task-adaptive capability of the controller across the entire CLT process, including in-place loading/unloading and load transporting tasks. To this end, we compared the system performances against a conventional approach, using both variable and fixed cable lengths under low- and high-stiffness conditions. Results demonstrate that the proposed method outperforms the conventional approach in terms of system responsiveness and motion smoothness, leading to improved CLT capabilities.

2604.16889 2026-05-11 cs.CL

Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution

Qinhao Chen, Linyang He, Nima Mesgarani

AI总结 本文提出了一种面向跨层转换器(CLT)的端到端高效电路发现框架PIE,通过“剪枝-解释-评估”一体化流程,显著降低特征解释的计算成本。核心方法包括基于特征归因的剪枝策略和FAP-Synergy协同重排序机制,能够在严格预算下保持模型行为保真度。实验表明,该方法在多个数据集上实现了比基线方法更高的解释效率和行为保留能力,尤其在低预算条件下表现出明显优势。

详情
英文摘要

Existing feature-interpretation pipelines typically operate on uniformly sampled units or exhaustive feature sets, incurring massive costs on units irrelevant to target behaviors. To address this, we introduce the first CLT-native end-to-end pruning framework, PIE, which pioneers the paradigm of pruning first and interpreting later. PIE connects Pruning, automatic Interpretation, and interpretation Evaluation, establishing a comprehensive benchmarking environment to systematically measure behavioral fidelity and downstream interpretability under pruning. Within this framework, we adapt strong relevance baselines and propose Feature Attribution Patching (FAP), a patch-grounded attribution method that scores CLT features by aggregating gradient-weighted write contributions. Furthermore, we introduce FAP-Synergy, a systematic synergy-aware reranking procedure. We evaluate pruning using KL-divergence behavior retention and assess interpretation quality with FADE-style metrics across IOI and Doc-String datasets. Across budget constraints of K in {50, 100, 200, 400, 800}, our rigorous benchmarking reveals distinct operational regimes: while base FAP and adapted baselines perform robustly at relaxed budgets, FAP-Synergy excels in highly constrained, strict-budget regimes. Crucially, we demonstrate a practical "Effective Budget" advantage: on the IOI task for both Llama-3.2-1B and Gemma-2-2B, FAP-Synergy at K=50 functionally matches the behavioral fidelity of baseline circuits at K=75. Because downstream evaluation costs scale linearly per feature, Synergy effectively grants the pipeline 25 "free" features, achieving K=75 fidelity while reducing interpretation costs by 33%.

2604.16579 2026-05-11 cs.LG cs.AI

EviDep: Trustworthy Multimodal Depression Estimation via Disentangled Evidential Learning

Fangyuan Liu, Sirui Zhao, Zeyu Zhang, Jinyang Huang, Feng-Qi Cui, Bin Luo, Meng Li, Tong Xu, Enhong Chen

AI总结 EviDep 是一种基于解耦证据学习的可信多模态抑郁估计方法,旨在解决非受控环境中自然噪声和行为复杂性带来的挑战。该方法通过正态-逆伽马分布联合量化抑郁严重程度以及偶然和认识不确定性,引入频率感知特征提取模块和解耦证据学习策略,有效分离多模态行为特征中的共性与特异性信息,提升模型的预测准确性和不确定性校准能力。实验表明,EviDep 在多个数据集上取得了最先进的性能,为抑郁评估提供了可靠的风险感知决策支持工具。

详情
英文摘要

Automated multimodal depression estimation in unconstrained environments is inherently challenged by naturalistic noise and complex behavioral variability. Prevailing deterministic methods, however, produce uncalibrated point estimates without quantifying predictive uncertainty, exposing decision-making to the risk of overconfident, untrustworthy estimates. To establish a reliable and trustworthy estimation paradigm, we propose EviDep, an evidential learning framework that jointly quantifies depression severity alongside aleatoric and epistemic uncertainties via a Normal-Inverse-Gamma distribution. To ensure the integrity of the extracted behavioral evidence and prevent artificial confidence inflation during multimodal fusion, EviDep introduces two tailored mechanisms. First, addressing the temporal-frequency heterogeneity of behavioral cues, a Frequency-aware Feature Extraction module leverages a wavelet-based Mixture-of-Experts to dynamically decouple stable macro-level affective baselines from transient micro-level behavioral bursts, effectively filtering out task-irrelevant artifacts. Second, a Disentangled Evidential Learning strategy enforces explicit decorrelation of features in these purified representations. By separating the cross-modal shared consensus from modality-specific behavioral nuances before Bayesian fusion, this rigorous disentanglement strictly prevents the model from double-counting overlapping information. Extensive experiments on the AVEC 2013, AVEC 2014, DAIC-WOZ, and E-DAIC datasets confirm that EviDep achieves state-of-the-art predictive accuracy and superior uncertainty calibration, thereby delivering a trustworthy, risk-aware decision-support tool for depression estimation.

2604.15694 2026-05-11 cs.LG math.PR

Neural Continuous-Time Markov Chain: Discrete Diffusion via Decoupled Jump Timing and Direction

Jingyuan Li, Xiaoyi Jiang, Fukang Wen, Wei Liu, Renqian Luo, Yi Zhu, Zuoqiang Shi, Pipi Hu

AI总结 该论文提出了一种基于连续时间马尔可夫链(CTMC)的神经扩散模型——Neural CTMC,旨在提升离散数据生成的性能。与现有方法不同,该模型将CTMC的逆过程分解为跳跃时机(exit rate)和跳跃方向(jump distribution)两个独立部分,分别由两个专门的网络头进行参数化,从而更贴合CTMC的内在结构。实验表明,该方法在多个数据集上取得了优于现有方法的生成困惑度(perplexity),展现出更强的建模能力和效率。

详情
英文摘要

Discrete diffusion models based on continuous-time Markov chains (CTMCs) have shown strong performance on language and discrete data generation, yet existing approaches typically parameterize the reverse rate matrix monolithically -- through proxies such as concrete scores (SEDD) or clean-data predictions (MDLM, GIDD) -- rather than aligning the parameterization with the intrinsic CTMC decomposition into jump timing and jump direction. We propose \textbf{Neural CTMC}, which exploits the underlying Poisson structure of CTMC dynamics by separately parameterizing the reverse process through an \emph{exit rate} (when to jump) and a \emph{jump distribution} (where to jump) via two dedicated network heads. We show that the evidence lower bound (ELBO) reduces to a path-space KL divergence between the true and learned reverse processes that factorizes into a Poisson KL for timing and a categorical KL for direction, and admits a tractable, gradient-equivalent and consistent loss. Experimentally, scored by Gemma2-9B, our pure-uniform Neural CTMC achieves $16.36$ generative perplexity on TinyStories (vs.\ GIDD $37.60$ and MDLM $42.66$). On OpenWebText, it attains the best perplexity at the same training-token budget across 16--128 sampling steps among the methods we compare (e.g., at 128 steps: Neural CTMC $183.6$ vs.\ MDLM $210.5$ and GIDD $249.8$). To facilitate reproducibility, we release our pretrained weights at https://huggingface.co/Jiangxy1117/Neural-CTMC.

2604.14786 2026-05-11 cs.AI

CogEvolution: A Human-like Generative Educational Agent to Simulate Student's Cognitive Evolution

Wei Zhang, Yihang Cheng, Zhirong Ye, Kezhen Huang

AI总结 本文提出了一种名为CogEvolution的生成式教育代理,旨在模拟学生在学习过程中的认知演化过程。该方法基于认知心理学中的ICAP分类构建了认知深度感知网络,用于精确量化学习者的认知参与度,并结合项目反应理论设计了知识检索机制,以模拟新旧知识的连接与同化过程。同时,引入进化算法构建动态认知更新机制,实现学习行为与认知演化的实时融合。实验表明,CogEvolution在行为真实性与学习曲线拟合方面优于现有模型,并能生成符合教育心理学预期的合理且稳健的认知演化路径。

Comments none

详情
英文摘要

Generative Agents, owing to their precise modeling and simulation capabilities of human behavior, have become a pivotal tool in the field of Artificial Intelligence in Education (AIEd) for uncovering complex cognitive processes of learners. However, existing educational agents predominantly rely on static personas to simulate student learning behaviors, neglecting the decisive role of deep cognitive capabilities in learning outcomes during practice interactions. Furthermore, they struggle to characterize the dynamic fluidity of knowledge internalization, transfer, and cognitive state transitions. To overcome this bottleneck, this paper proposes a human-like educational agent capable of simulating student cognitive evolution: CogEvolution. Specifically, we first construct a cognitive depth perceptron based on the Interactive, Constructive, Active, Passive (ICAP) taxonomy from cognitive psychology, achieving precise quantification of learner cognitive engagement. Subsequently, we propose a memory retrieval method based on Item Response Theory (IRT) to simulate the connection and assimilation of new and prior knowledge. Finally, we design a dynamic cognitive update mechanism based on evolutionary algorithms to simulate the real-time integration of student learning behaviors and cognitive evolution processes. Comprehensive evaluations demonstrate that CogEvolution not only significantly outperforms baseline models in behavioral fidelity and learning curve fitting but also uniquely reproduces plausible and robust cognitive evolutionary paths consistent with educational psychology expectations, providing a novel paradigm for constructing highly interpretable educational agents.

2604.13010 2026-05-11 cs.LG cs.AI

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Yecheng Wu, Song Han, Hai Cai

AI总结 本文提出了一种名为Lightning OPD的高效离线策略蒸馏框架,旨在解决传统在线策略蒸馏(OPD)需要持续运行教师服务器带来的高基础设施开销问题。通过预计算教师模型在监督微调阶段的策略概率,并在训练过程中复用这些数据,Lightning OPD实现了无需在线教师服务器的离线训练,同时保证了与标准OPD相近的性能。实验表明,该方法在数学推理和代码生成任务上表现出色,训练效率显著提升,并成功扩展至MoE架构,为大规模语言模型的后训练研究提供了更高效的解决方案。

详情
英文摘要

On-policy distillation (OPD) is an effective post-training paradigm for large language models but requires a live teacher server throughout training, resulting in substantial infrastructure overhead. We investigate whether OPD can be performed offline by precomputing teacher log-probabilities once over SFT rollouts and reusing them during training. We find that naively doing so fails to reliably match standard OPD, and trace the root cause to a previously overlooked condition we term teacher consistency, requiring that the same teacher be used for both supervised fine-tuning and OPD. Violating this condition introduces a gradient bias that degrades performance for both offline and online OPD. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency and eliminates the need for a live teacher server entirely. We prove that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Experiments on math reasoning and code generation show that Lightning OPD achieves comparable performance to standard OPD while delivering 4.0x higher training efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours. Lightning OPD further scales to MoE architectures, training Qwen3-30B-A3B to 71.0% on AIME 2024 on a single 8xH100 node, substantially lowering the barrier for academic research on LLM post-training. Our code is released at https://github.com/jet-ai-projects/Lightning-OPD.

2604.11995 2026-05-11 cs.LG

Loss-Driven Bayesian Active Learning

Zhuoyue Huang, Freddie Bickford Smith, Tom Rainforth

AI总结 本文提出了一种基于损失函数的贝叶斯主动学习方法,旨在直接针对特定决策问题的损失函数进行数据采集,从而提升下游预测性能。该方法通过将任意损失函数转化为最优数据采集目标,实现了对数据选择过程的定制化,并特别指出加权Bregman散度形式的损失可支持目标函数的关键部分的解析计算,从而具备实际可行性。实验表明,该方法在回归和分类任务中相比现有技术能有效降低测试损失。

详情
英文摘要

The central goal of active learning is to gather data that maximises downstream predictive performance, but popular approaches have limited flexibility in customising this data acquisition to different downstream problems and losses. We propose a rigorous loss-driven approach to Bayesian active learning that allows data acquisition to directly target the loss associated with a given decision problem. In particular, we show how any loss can be used to derive a unique objective for optimal data acquisition. Critically, we then show that any loss taking the form of a weighted Bregman divergence permits analytic computation of a central component of its corresponding objective, making the approach applicable in practice. In regression and classification experiments with a range of different losses, we find our approach reduces test losses relative to existing techniques.

2604.11962 2026-05-11 cs.LG

The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

Thomas Walker, Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk

AI总结 该论文提出了“线性质心假设”(LCH),认为深度神经网络中的特征可以表示为中间层激活空间中的线性方向,但进一步指出这些特征更准确地对应于网络局部仿射专家的质心空间中的方向。研究通过将中间激活替换为质心,提出了一种适用于解释性工具的功能替代方法,实验表明这种方法能生成更稀疏、更具实用价值的特征字典,并提升模型可解释性。该假设将特征字典、探针、电路和显著性图统一为一个几何对象,使模型解释性具备结构性而非事后分析。

Comments 23 pages, 17 figures

详情
英文摘要

The Linear Representation Hypothesis (LRH) identifies features of a trained deep network (DN) as linear directions in the activation spaces, i.e., output spaces of intermediate layers. This characterization decouples the input-output maps learned by a DN from the organization of feature directions in its activation spaces. We introduce the Linear Centroids Hypothesis (LCH), which instead identifies features with linear directions among a DN's centroid spaces -- where any vector denotes a centroid or summary of a local affine expert characterizing the learned input-output maps of the DN exactly (e.g., for piecewise-affine DNs) or approximately (e.g., for smooth DNs like transformers). We show that replacing intermediate activations with centroids yields a functional drop-in alternative for standard interpretability tools. Empirically, this change yields sparser, more downstream-useful feature dictionaries on DINO ViTs, suppresses spurious directions on a controlled task, recovers interpretable circuits in GPT2-Large, and produces faithful gradient-based saliency maps. LCH unifies dictionaries, probing, circuits, and saliency maps into a single geometric object grounded in the network's input-output map -- making interpretability mechanistic by construction rather than post hoc. Code to study the LCH https://github.com/ThomasWalker1/LinearCentroidsHypothesis .

2604.11121 2026-05-11 cs.CL

BITS Pilani at SemEval-2026 Task 9: Structured Supervised Fine-Tuning with DPO Refinement for Polarization Detection

Atharva Gupta, Dhruv Kumar, Yash Sinha

AI总结 本文针对SemEval-2026任务9中的在线极化检测问题,提出了一种结合结构化监督微调与直接偏好优化(DPO)的两阶段方法,用于识别多语言、多文化背景下的社交媒体文本中的极化内容。研究通过可解释的槽填充模板对Qwen模型进行微调,并利用自动生成的偏好对进行DPO优化,以减少误判。实验表明,该方法在英文测试集上达到了0.7664的宏F1分数,后续实验进一步提升至0.8162,优于主办方的基线模型。

Comments Accepted to the 20th International Workshop on Semantic Evaluation (SemEval-2026), to be held in conjunction with ACL 2026

详情
英文摘要

The POLAR SemEval-2026 Shared Task aims to detect online polarization and focuses on the classification and identification of multilingual, multicultural, and multi-event polarization. Accurate computational detection of online polarization is challenging due to nuanced rhetoric, implicit framing, and the high cost of human-in-the-loop annotation. Building on recent findings that contextual prompting enables large language models to function as strong polarization detectors, we present a two-stage approach for detecting polarization in social media text that combines structured supervised fine tuning with Direct Preference Optimization (DPO) refinement. We fine tune Qwen 2.5-7B-Instruct with LoRA using an interpretable slot-filling template (target, claim type, manifestation checklist, and justification). We then apply DPO with automatically generated preference pairs to reduce costly false negatives. Our submitted system achieves 0.7664 Macro-F1 on the English test set. Post-submission experiments with Mistral-Nemo-Instruct-2407 and LLM-judge-filtered preference pairs further improve to 0.8162 Macro-F1 (not submitted to CodaBench), surpassing the organiser baseline of 0.7802.

2604.09839 2026-05-11 cs.AI cs.LG

Steered LLM Activations are Non-Surjective

Aayush Mishra, Daniel Khashabi, Anqi Liu

AI总结 本文研究了激活引导(activation steering)这一白盒控制技术的局限性,指出并非所有通过激活引导得到的行为都能由自然的文本提示实现。作者将该问题形式化为一个满射性问题,并在实际假设下证明,激活引导会使残差流脱离由离散提示可达的状态流形,几乎肯定无法通过任何提示复现引导后的行为。实验在三个主流大语言模型上验证了这一结论,强调了白盒引导与黑盒提示之间的本质区别,并呼吁在评估模型可解释性或安全性时应明确区分这两种干预方式。

Comments 10 pages main text. ICLR 2026 Workshops (Sci4DL, Re-Align)

详情
英文摘要

Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in its behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations) and safety research (e.g., jailbreakability). However, it is unclear whether steered behavior is realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: for a fixed model, does every steered activation admit a preimage under the model's natural forward pass? Under practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our results establish a formal separation between white-box steerability and black-box prompting. We therefore caution against interpreting the ease and success of activation steering as evidence of prompt-based interpretability or vulnerability, and argue for evaluation protocols that explicitly decouple white-box and black-box interventions.

2604.08923 2026-05-11 cs.CL

NCL-BU at SemEval-2026 Task 3: Fine-tuning XLM-RoBERTa for Multilingual Dimensional Sentiment Regression

Tong Wu, Nicolay Rusnachenko, Huizhi Liang

AI总结 本文研究多语言维度情感分析(DimABSA)任务,旨在预测文本中每个方面在愉悦度(valence)和唤醒度(arousal)两个维度上的连续情感评分。研究采用基于XLM-RoBERTa-base的微调方法,为每个语言-领域组合(包括英语和中文在餐厅、笔记本和金融领域)分别训练模型,并通过合并训练集和开发集提升测试性能。实验表明,针对任务特性的微调方法在少样本设置下优于多种大语言模型。

详情
英文摘要

Dimensional Aspect-Based Sentiment Analysis (DimABSA) extends traditional ABSA from categorical polarity labels to continuous valence-arousal (VA) regression. This paper describes a system developed for Track A, Subtask 1 (Dimensional Aspect Sentiment Regression), aiming to predict real-valued VA scores in the [1, 9] range for each given aspect in a text. A fine-tuning approach based on XLM-RoBERTa-base is adopted, with dual regression heads with sigmoid-scaled outputs for valence and arousal prediction. Separate models are trained for each language-domain pair (English and Chinese across restaurant, laptop, and finance domains), and training and development sets are merged for final test predictions. In development experiments, the fine-tuning approach is compared against several large language models under a few-shot prompting setting, demonstrating that task-specific fine-tuning outperforms these LLM-based methods across all evaluation datasets.

2604.08077 2026-05-11 cs.CV

AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

Handong Li, Zikang Liu, Longteng Guo, Tongtian Yue, Yepeng Tang, Xinxin Zhu, Chuanyang Zheng, Ziming Wang, Zhibin Wang, Jun Song, Cheng Yu, Bo Zheng, Jing Liu

AI总结 处理长视频时,视频大语言模型(Video-LLMs)面临巨大的计算负担。为解决这一问题,本文提出AdaSpark,一种自适应稀疏框架,通过将视频划分为3D时空立方体,并结合自适应的注意力机制和前馈网络,动态选择关键区域和重要token进行处理,从而在保持细粒度感知和长时序依赖的前提下,显著降低计算量。实验表明,AdaSpark在复杂度降低57%的同时,仍能保持与密集模型相当的性能。

Comments 8 pages, CVPR2026 Accept (Highlight)

详情
Journal ref
Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition (CVPR), 2026
英文摘要

Processing long-form videos with Video Large Language Models (Video-LLMs) is computationally prohibitive. Current efficiency methods often compromise fine-grained perception through irreversible information disposal or inhibit long-range temporal modeling via rigid, predefined sparse patterns. This paper introduces AdaSpark, an adaptive sparsity framework designed to address these limitations. AdaSpark first partitions video inputs into 3D spatio-temporal cubes. It then employs two co-designed, context-aware components: (1) Adaptive Cube-Selective Attention (AdaS-Attn), which adaptively selects a subset of relevant video cubes to attend for each query token, and (2) Adaptive Token-Selective FFN (AdaS-FFN), which selectively processes only the most salient tokens within each cube. An entropy-based (Top-p) selection mechanism adaptively allocates computational resources based on input complexity. Experiments demonstrate that AdaSpark significantly reduces computational load by up to 57% FLOPs while maintaining comparable performance to dense models and preserving fine-grained, long-range dependencies, as validated on challenging hour-scale video benchmarks.

2604.07277 2026-05-11 cs.LG cs.AI

Android Coach: Improve Online Agentic Training Efficiency with Single State Multiple Actions

Guo Gan, Yuxuan Ding, Cong Chen, Yuwei Ren, Yin Huang, Hong Zhou

AI总结 本文提出了一种名为Android Coach的新框架,旨在提高安卓智能体在线强化学习的训练效率。该方法通过引入“单状态多动作”范式,使智能体在每个在线状态下能够探索并利用多个动作,从而更充分地利用昂贵的模拟器状态。通过学习一个用于估计动作价值的评论家网络,并结合过程奖励模型和分组优势估计器,Android Coach在保证训练稳定性的同时显著提升了学习效率和任务成功率。实验表明,该方法在多个安卓任务环境中相比现有方法取得了明显提升。

详情
英文摘要

Online reinforcement learning (RL) serves as an effective method for enhancing the capabilities of Android agents. However, guiding agents to learn through online interaction is prohibitively expensive due to the high latency of emulators and the sample inefficiency of existing RL algorithms. We identify a fundamental limitation in current approaches: the Single State Single Action paradigm, which updates the policy with one-to-one state-action pairs from online one-way rollouts without fully exploring each costly emulator state. In this paper, we propose Android Coach, a novel framework that shifts the training paradigm to Single State Multiple Actions, allowing the agent to sample and utilize multiple actions for a single online state. We enable this without additional emulator overhead by learning a critic that estimates action values. To ensure the critic serves as a reliable coach, we integrate a process reward model and introduce a group-wise advantage estimator based on the averaged critic outputs. Extensive experiments demonstrate the effectiveness and efficiency of Android Coach: it achieves 7.5% and 8.3% success rate improvements on AndroidLab and AndroidWorld over UI-TARS-1.5-7B, and attains 1.4x higher training efficiency than Single State Single Action methods PPO and GRPO at matched success rates.

2604.03420 2026-05-11 cs.CV cs.AI cs.LG

Zero-Shot Quantization via Weight-Space Arithmetic

Daniele Solombrino, Antonio Andrea Gargiulo, Alessandro Zirilli, Luca Zhou, Adrian Robert Minut, Emanuele Rodolà

AI总结 本文提出了一种通过权重空间算术实现零样本量化的方法,无需接收模型的量化感知训练即可显著提升量化后的模型性能。研究发现,权重空间中存在一个可转移的“量化方向”,该方向可从捐赠任务中提取,并用于增强接收模型的量化鲁棒性,最高可使3比特量化下的Top-1准确率提升60个点。实验表明,该方法在多种视觉Transformer规模和22个图像分类任务中均表现出良好的跨任务迁移能力,并从几何角度理论证明了量化方向的合理性与稳定性。

详情
英文摘要

We show that robustness to post-training quantization (PTQ) is a transferable direction in weight space. We call this direction the quantization vector: extracted from a donor task by simple weight-space arithmetic, it can be used to patch a receiver model and improve post-PTQ Top-1 accuracy by up to 60 points in a 3-bit setting, without receiver-side quantization-aware training (QAT). Because the method requires no receiver training data, it provides a zero-shot, low-cost alternative to QAT for extremely low-bit deployment. Across four ViT scales and 22 image classification tasks, donor quantization vectors often yield substantial gains even when donor and receiver tasks differ markedly. We further prove rigorously that quantization vectors are well-defined and do not suffer from reparameterization symmetries, and provide a local geometric account of their effect. Together, these results suggest that quantization robustness can be partially isolated, reused, and transferred through simple weight-space algebra.

2604.02525 2026-05-11 cs.LG

AdaHOP: Fast and Accurate Low-Precision Training via Outlier-Pattern-Aware Rotation

Seonggon Kim, Alireza Khodamoradi, Pranathi Vasireddy, Kristof Denolf, Eunhyeok Park

AI总结 该论文提出了一种名为AdaHOP的新方法,用于在低精度下实现快速且准确的大模型训练。研究指出,传统的Hadamard变换在低精度训练中应用方式单一,未能考虑操作数中的异常值结构,从而限制了其效果。通过分析权重、激活和梯度的异常值模式,论文提出了三种稳定的异常值模式,并据此设计了自适应的Hadamard变换策略,结合内积变换和异常值提取,有效提升了低精度训练的稳定性和效率。实验表明,AdaHOP能够在MXFP4精度下实现接近BF16的训练质量,并带来显著的内存压缩和训练加速。

Comments 21 pages, 10 figures

详情
英文摘要

Hadamard transforms have become a key tool for stabilizing low-precision training, but existing methods apply them uniformly across tensors and computation paths. We show that this one-size-fits-all strategy is inherently limited: Hadamard smoothing reduces quantization error only when its direction is properly aligned with the operand's outlier structure. Through a systematic study of weights, activations, and gradients in LLM training, we identify three stable outlier patterns, Row-wise, Column-wise, and None, and show that each outlier pattern pair in matrix multiplication requires a distinct transform or outlier-handling strategy. We propose AdaHOP, Adaptive Hadamard transform with Outlier-Pattern-aware strategy, which applies Inner Hadamard Transform (IHT) when inner-dimension mixing properly suppresses the operands' outliers, and selectively applies Outlier Extraction (OE) that extracts dominant outlier rows or columns into a high-precision path when it does not. With fused, hardware-aware Triton kernels, AdaHOP enables training from scratch at MXFP4 precision with BF16-level quality, while achieving up to 3.6X memory compression, 1.46X end-to-end training speedup over BF16.

2604.01878 2026-05-11 cs.LG cs.AI

ASPECT: Node-Level Adaptive Spectral Fusion for Graph Contrastive Learning

Zhuolong Li, Boxue Yang, Haopeng Chen

AI总结 本文提出了一种名为ASPECT的图对比学习方法,旨在解决谱图对比学习中高低频视图融合时忽略节点级谱特性的问题。该方法通过学习节点级别的自适应谱融合策略,使不同节点能够根据自身特性选择不同的谱混合方式,从而提升图表示的质量。此外,文章还引入了ASPECT-S扩展,通过结构与特征扰动增强模型的鲁棒性,实验表明该方法在多种基准数据集上优于现有对比学习方法。

Comments 28 pages, 3 figures. Revised version with updated method framing, improved exposition, and additional experiments

详情
英文摘要

Spectral graph contrastive learning often constructs low- and high-frequency views to capture complementary graph signals, but these views are commonly combined by graph-level or node-agnostic fusion rules. We show that graph-level fusion can incur irreducible regret on mixed graphs with separated node-wise spectral preferences. Motivated by this result, we propose ASPECT, a spectral graph contrastive learning method that adaptively fuses low- and high-frequency views at the node level. ASPECT learns a node-wise spectral policy and regularizes it using channel-wise contrastive evidence, enabling different nodes to use different spectral mixtures. We further introduce ASPECT-S, an optional stability-aware extension that uses generated graph-structure and feature perturbations to obtain empirical channel-wise sensitivity estimates, together with a Rayleigh-based spectral search bias for producing informative perturbations. Experiments on homophilic and heterophilic benchmarks show that ASPECT improves representation quality over competitive spectral and graph contrastive baselines, while ASPECT-S further improves performance under joint graph-structure and feature perturbations.

2604.01619 2026-05-11 cs.CV cs.AI

Automatic Image-Level Morphological Trait Annotation for Organismal Images

Vardaan Pahuja, Samuel Stevens, Alyson East, Sydne Record, Yu Su

AI总结 本文研究如何自动标注生物图像中的形态特征,以解决传统人工标注效率低、成本高的问题。作者提出了一种基于稀疏自编码器的方法,通过基础模型特征训练得到具有语义一致性和空间定位能力的神经元,从而实现对生物形态关键部位的自动识别与描述。该方法构建了一个包含8万条标注的大型数据集Bioscan-Traits,验证了生成描述的生物学合理性,并展示了其在生态学研究中的应用潜力。

Comments ICLR 2026

详情
Journal ref
ICLR 2026
英文摘要

Morphological traits are physical characteristics of biological organisms that provide vital clues on how organisms interact with their environment. Yet extracting these traits remains a slow, expert-driven process, limiting their use in large-scale ecological studies. A major bottleneck is the absence of high-quality datasets linking biological images to trait-level annotations. In this work, we demonstrate that sparse autoencoders trained on foundation-model features yield monosemantic, spatially grounded neurons that consistently activate on meaningful morphological parts. Leveraging this property, we introduce a trait annotation pipeline that localizes salient regions and uses vision-language prompting to generate interpretable trait descriptions. Using this approach, we construct Bioscan-Traits, a dataset of 80K trait annotations spanning 19K insect images from BIOSCAN-5M. Human evaluation confirms the biological plausibility of the generated morphological descriptions. We assess design sensitivity through a comprehensive ablation study, systematically varying key design choices and measuring their impact on the quality of the resulting trait descriptions. By annotating traits with a modular pipeline rather than prohibitively expensive manual efforts, we offer a scalable way to inject biologically meaningful supervision into foundation models, enable large-scale morphological analyses, and bridge the gap between ecological relevance and machine-learning practicality.

2603.28113 2026-05-11 cs.LG

Demystifying Lipschitz verification: positive matrices, negative results

Simon Kuang, Yuezhu Xu, S. Sivaranjani, Xinfan Lin

AI总结 该论文探讨了神经网络全局李普希茨常数的验证问题,指出其与模型的鲁棒性和泛化能力密切相关,但由于无法直接从网络参数中读取,因此需要复杂的验证算法。研究发现,验证李普希茨常数本质上是结构性难题,因为需要判断隐藏状态的可达性,而这一问题是NP难的。论文进一步表明,基于半定规划的验证方法在某些情况下可能继承简单乘积界类似的保守性,并提出通过正则化简单乘积界可缓解此类问题,同时引入了一种无需偏置的三角函数层结构以提升验证效果。

Comments reduced scope, new theorems on NP-hardness

详情
英文摘要

The global Lipschitz constant of a neural network is related to robustness and generalization, yet unlike in many classical models, it is not plainly legible from the parameters. This has motivated sophisticated verification algorithms, especially semidefinite programming (SDP) based on incremental quadratic constraints on the activation functions, to improve on the fast but often loose product of layerwise Lipschitz constants (the trivial bound). We ask why Lipschitz verification is a problem in the first place. Our answer is that the difficulty is structural: estimating a network's Lipschitz constant requires knowing which hidden states are reachable, and reachability is NP-hard. If P!=NP, then reachability is a barrier to any polynomial-time algorithm. Through explicit constructions, we show that this blindness can force SDP-based bounds to inherit the same qualitative failures as the trivial bound, including but not limited to polynomial per-layer conservatism. We show that the difficulties of NP-hard questions are not isolated to worst-case computational reductions, but actually afflict every instance of the verification problem. Thus SDP is not sufficient for Lipschitz verification. We also argue that it is not necessary: several apparent failures of the trivial bound arise from removable parameterization pathologies, and can be mitigated by optimizing or regularizing the trivial bound itself. We demonstrate this claim via a "spherical cow" linear model and numerical proofs of concept. While the main contribution is theoretical and negative, we finally motivate a novel form of trigonometric layers that do not need biases for universal approximation. Combined with trivial bound regularization, they make the trivial bound provably and practically tight.

2603.21824 2026-05-11 cs.CV cs.AI

SteelDefectX: A Multi-Form Vision-Language Dataset and Benchmark for Steel Surface Defect Analysis

Shuxian Zhao, Jie Gui, Baosheng Yu, Dacheng Tao

AI总结 SteelDefectX 是一个用于钢铁表面缺陷分析的多形式视觉-语言数据集与基准,旨在提升工业场景下视觉-语言模型的细粒度语义理解和系统评估。该数据集包含7,778张图像,涵盖25类缺陷,并提供了从类别级到样本级的多形式文本标注,包括自由描述、结构化属性和模板句式,以支持不同粒度和控制水平的文本监督。研究还构建了涵盖分类、分割、跨数据集迁移等任务的综合基准,并揭示了结构化与自由文本表示在语义对齐与迁移能力上的权衡,为工业视觉-语言学习提供了新的研究方向。

详情
英文摘要

Steel surface defect analysis is critical for industrial quality control, yet existing benchmarks rely primarily on label-only annotations, limiting fine-grained semantic understanding and systematic evaluation of vision-language models. To address this gap, we introduce SteelDefectX, a vision-language dataset with multi-form textual annotations for steel surface defect analysis, comprising 7,778 images across 25 defect categories. At the class level, the dataset provides defect names, representative visual attributes, and industrial causes. At the sample level, each image is annotated with three forms of textual representations: (1) free-form natural language descriptions, (2) structured attribute annotations, and (3) template-based sentences. These annotations provide flexible textual supervision with varying levels of expressiveness and controllability. We further establish a comprehensive benchmark covering vision-language classification, segmentation, and cross-dataset transfer, along with additional evaluations such as retrieval and text-guided localization. Experimental results reveal a trade-off between structure and flexibility in textual representations. Structured attributes provide more stable semantic alignment, while natural language descriptions improve transferability and fine-grained spatial grounding. These findings highlight the critical role of textual design in industrial vision-language learning. SteelDefectX provides a new benchmark for studying semantic alignment and generalization in industrial vision-language learning. The code and dataset are available at https://github.com/Zhaosxian/SteelDefectX.

2603.19966 2026-05-11 cs.RO

GustPilot: A Hierarchical DRL-INDI Framework for Wind-Resilient Quadrotor Navigation

Amir Atef Habel, Roohan Ahmed Khan, Fawad Mehboob, Clement Fortin, Dzmitry Tsetserukou

AI总结 本文提出了一种名为GustPilot的分层深度强化学习与非线性动态逆(INDI)结合的导航框架,旨在提升四旋翼飞行器在风扰环境下的自主导航能力。该方法通过DRL策略生成惯性坐标系下的速度参考,INDI控制器则实现快速扰动抑制与高精度轨迹跟踪。实验表明,该方法在多种风扰条件下表现出更高的成功率和跟踪精度,展示了其在复杂环境中的鲁棒性与泛化能力。

Comments 8 pages, 5 figures

详情
英文摘要

Wind disturbances remain a key barrier to reliable autonomous navigation for lightweight quadrotors, where the rapidly varying airflow can destabilize both planning and tracking. This paper introduces GustPilot, a hierarchical wind-resilient navigation stack in which a deep reinforcement learning (DRL) policy generates inertial-frame velocity reference for gate traversal. At the same time, a geometric Incremental Nonlinear Dynamic Inversion (INDI) controller provides low-level tracking with fast residual disturbance rejection. The INDI layer achieves this by providing incremental feedback on both specific linear acceleration and angular acceleration rate, using onboard sensor measurements to reject wind disturbances rapidly. Robustness is obtained through a two-level strategy, wind-aware planning learned via fan-jet domain randomization during training, and rapid execution-time disturbance rejection by the INDI tracking controller. We evaluate GustPilot in real flights on a 50g quad-copter platform against a DRL-PID baseline across four scenarios ranging from no-wind to fully dynamic conditions with a moving gate and a moving disturbance source. Despite being trained only in a minimal single-gate and single-fan setup, the policy generalizes to significantly more complex environments (up to six gates and four fans) without retraining. Across 80 experiments, DRL-INDI achieves a 94.7% versus 55.0% for DRL-PID as average Overall Success Rate (OSR), reduces tracking RMSE up to 50%, and sustains speeds up to 1.34 m/s under wind disturbances up to 3.5 m/s. These results demonstrate that combining DRL-based velocity planning with structured INDI disturbance rejection provides a practical and generalizable approach to wind-resilient autonomous flight navigation.