arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4060
2605.09862 2026-05-12 cs.LG cs.AI

UFO: A Unified Flow-Oriented Framework for Robust Continual Graph Learning

Danhui Zhang, Zhe Wang, Qing Qing, Jiarui Liu, Wentao Gao, Ziqi Xu, Mingliang Hou, Xikun Zhang, Renqiang Luo

AI总结 本文研究了鲁棒持续图学习问题,即在图数据不断演变且新加入部分常含噪声的场景下,如何同时应对灾难性遗忘和噪声监督的挑战。为此,作者提出了一个统一的流导向框架UFO,通过基于流模型的条件特征分布建模生成回放表示以缓解遗忘,并利用实例级可靠性评分区分噪声节点,从而减少噪声监督的影响。实验表明,UFO在多个基准图数据集上均优于现有方法,具有更高的准确性和更优的遗忘控制能力。

详情
英文摘要

Graph learning research has increasingly shifted toward continual graph learning (CGL), which better reflects real-world scenarios where graphs evolve over time. However, existing CGL methods largely assume clean supervision and overlook a critical challenge: the newly arriving portions of the graph are often noisy, due to annotation errors or adversarial corruption. This mismatch limits their applicability in practice. In this work, we study robust continual graph learning, where models must simultaneously handle catastrophic forgetting and noisy supervision in evolving graph data. We show that label noise introduces a new failure mode, catastrophic remembering, where models persistently reinforce corrupted knowledge across tasks. To address these challenges, we propose a Unified Flow-Oriented framework (UFO). First, UFO models conditional feature distributions via flow-based generative modeling and produces replay representations, mitigating forgetting without storing historical data. Second, UFO estimates instance-level reliability scores to distinguish clean from noisy nodes, reducing the impact of corrupted supervision and alleviating catastrophic remembering. Extensive experiments on four benchmark graph datasets under varying noise ratios demonstrate that UFO consistently outperforms existing methods in both accuracy and forgetting metrics. Code is available at: https://anonymous.4open.science/r/UFO.

2605.09861 2026-05-12 cs.LG cs.AI

Flag Varieties: A Geometric Framework for Deep Network Alignment

Jingchuan Xiao, Xinyi Sui, Cihan Ruan

AI总结 该论文研究深度神经网络中相邻权重矩阵的对齐现象,揭示其背后的几何结构。通过几何不变理论,作者证明对齐几何具有由标志流形(flag variety)定义的规范结构,并指出子空间交集维度是唯一的重参数化不变可观测量,从而将子空间度量从经验惯例提升为数学必然。研究还揭示了正则化与非线性激活对对齐过程的影响,并提供了无需前向传播即可分析网络内部对齐结构的新方法。

详情
英文摘要

Alignment, the tendency of adjacent weight matrices in deep networks to develop compatible subspace orientations, underlies gradient flow, Neural Collapse, and representation similarity across architectures. Despite extensive empirical documentation, these phenomena have resisted unified theoretical treatment: existing explanations are post-hoc, each fitted to a specific observation with whatever mathematics is at hand. We reverse this direction by deriving the mathematical structure that layerwise alignment inherently demands. Using geometric invariant theory, we prove that alignment geometry has a canonical closed, polystable stratum given by a flag variety, and that subspace intersection dimension is its unique reparameterization-invariant observable, establishing that subspace metrics are not empirical conventions but mathematical necessities. This unified framework yields two dynamical consequences: ridge regularization drives subspace alignment at an exponential rate set by weight decay, whereas nonlinear activations induce a commutator obstruction to exact basis alignment, generically present in nonlinear networks and absent in linear ones. Together these give a geometric explanation of the Level-2/3 hierarchy in Neural Collapse from first principles rather than post-hoc analysis. The commutator magnitude and head subspace overlap further serve as weight-space windows into internal alignment structure, requiring no forward passes. Experiments on multilayer perceptrons, residual networks, and pretrained language models support the proposed diagnostics and delineate their scope.

2605.09859 2026-05-12 cs.CV

Learning to Align Generative Appearance Priors for Fine-grained Image Retrieval

Shijie Wang, Yadan Luo, Zijian Wang, Xin Yu, Zi Huang

AI总结 本文研究了细粒度图像检索中如何提升对未见类别的检索性能问题,提出了一种基于生成外观先验对齐的新型方法GAPan。该方法通过可逆密度模型重构学习目标,从类别预测转向外观建模,利用归一化流将特征映射到潜在密度空间,并通过类别条件高斯先验进行优化,从而保留更丰富的外观细节。通过反向采样生成外观感知的锚点,引导检索嵌入与类别特定的外观分布对齐,显著提升了模型在未见类别上的泛化能力。

详情
英文摘要

Fine-grained image retrieval (FGIR) typically relies on supervision from seen categories to learn discriminative embeddings for retrieving unseen categories. However, such supervision often biases retrieval models toward the semantics of seen categories rather than the underlying appearance characteristics that generalize across categories, thereby limiting retrieval performance on unseen categories. To tackle this, we propose GAPan, a Generative Appearance Prior alignment network that reformulates the learning objective from category prediction toward appearance modeling. Technically, GAPan treats retrieval features with an invertible density model based on normalizing flows. In the forward direction, the flow maps all instance features into a latent density space, where each seen category is modeled by a class-conditional Gaussian prior and optimized via exact likelihood estimation. This formulation preserves richer appearance details by leveraging the invertible property of the flows. In the reverse direction, samples from the high-density regions of these learned priors are mapped back to the feature space to produce appearance-aware anchors that reflect intra-category variation. These anchors supervise a prior-driven alignment objective that aligns retrieval embeddings with category-specific appearance distributions, thereby improving generalization to unseen categories. Evaluations demonstrate that our GAPan achieves state-of-the-art performance on both widely-used fine- and coarse-grained benchmarks.

2605.09858 2026-05-12 cs.CV

Clip-level Uncertainty and Temporal-aware Active Learning for End-to-End Multi-Object Tracking

Riku Inoue, Shogo Sato, Kazuhiko Murasaki, Tomoyasu Shimada, Toshihiko Nishimura, Ryuichi Tanida

AI总结 本文研究了动态环境下端到端多目标跟踪(MOT)中如何通过主动学习(AL)提升标注效率的问题。针对现有基于帧的AL方法与现代基于Transformer的端到端跟踪器在时间粒度上不匹配的问题,提出了一种基于片段(clip)的主动学习方法CUTAL,该方法通过多帧预测的不确定性度量评估每个片段的不确定性,并引入时间多样性约束以选择信息量大且冗余度低的片段。实验表明,CUTAL在相同标注预算下优于现有方法,并且在仅使用50%标注数据时即可达到接近全监督的跟踪性能。

Comments Accepted to 2026 IEEE International Conference on Image Processing (ICIP). Copyright 2026 IEEE. Published in 2026 IEEE International Conference on Image Processing (ICIP), scheduled for 13-17 September 2026 in Tampere, Finland

详情
英文摘要

Multi-Object Tracking (MOT) in dynamic environments relies on robust temporal reasoning to maintain consistent object identities over time. Transformer-based end-to-end MOT models achieve strong performance by explicitly modeling temporal dependencies, yet training them requires extensive bounding-box and identity annotations. Given the high labeling cost and strong redundancy in videos, Active Learning (AL) is an effective approach to improve annotation efficiency. However, existing AL methods for MOT primarily operate at the frame level, which is structurally misaligned with modern end-to-end trackers whose inference and training rely on multi-frame clips. To bridge this gap, we formulate clip-level active learning and propose Clip-level Uncertainty and Temporal-aware Active Learning (CUTAL). In contrast to frame-based approaches, CUTAL scores each clip using uncertainty metrics derived from multi-frame predictions to capture inter-frame correspondence ambiguities, while enforcing temporal diversity to select an informative and non-redundant subset. Experiments show that CUTAL achieves stronger overall performance than baselines at the same label budgets across MeMOTR and SambaMOTR. Notably, CUTAL achieves performance comparable to full supervision for MeMOTR on both datasets using only 50% of the labeled training data.

2605.09856 2026-05-12 cs.CV cs.AI

MoPO: Incorporating Motion Prior for Occluded Human Mesh Recovery

Tao Tang, Hong Liu, Xinshun Wang, Wanruo Zhang

AI总结 尽管近期在人体网格恢复方面取得了显著进展,但在面对遮挡时仍表现出鲁棒性不足,常导致姿态估计不准确和运动抖动。本文提出MoPO方法,通过引入运动先验来提升遮挡人体网格恢复的效果。MoPO包含运动去遮挡模块和运动感知融合与优化模块,前者利用历史姿态预测遮挡关节位置,后者结合图像特征与预测姿态进行人体形状和姿态估计,并通过逆运动学进一步优化最终姿态,显著提升了遮挡场景下人体网格恢复的精度和时序一致性。

Comments 35 pages

详情
英文摘要

Although recent studies have made remarkable progress in human mesh recovery, they still exhibit limited robustness to occlusions and often produce inaccurate poses and severe motion jitter due to the insufficient spatial features for occluded body parts. Inspired by the rapid advancements in human motion prediction, we discover that compared to occluded image features, pose sequence inherently contains reliable motion prior for estimating occluded body parts. In this paper, we incorporate Motion Prior for Occluded human mesh recovery, called MoPO. Our MoPO mainly consists of two components: 1) The motion de-occlusion module, where we propose a spatial-temporal occlusion detector to detect joint visibility, and then we propose a lightweight motion predictor to complete the occluded body parts by predicting the most plausible joint positions based on history poses. 2) The motion-aware fusion and refinement module, which fuses the completed joint sequence with image features to estimate human shape and initial human pose. Moreover, the completed joint sequence is further used to refine the final human pose through inverse kinematics, which provides the occlusion-free motion prior for regressing human poses. Extensive experiments demonstrate that MoPO achieves state-of-the-art performance on both occlusion-specific and standard benchmarks, significantly enhancing the accuracy and temporal consistency of occluded human mesh recovery. Our code and demo can be found in the supplementary material.

2605.09853 2026-05-12 cs.LG

Exploration-Driven Optimization for Test-Time Large Language Model Reasoning

Changhao Li, Yuchen Zhuang, Chenxiao Gao, Haotian Sun, Rushi Qiang, Chao Zhang, Bo Dai

AI总结 该研究针对大语言模型在推理阶段的推理能力和多样性之间的矛盾,提出了一种探索驱动优化(EDO)方法,通过将奖励偏差探索目标引入迭代后训练过程,提升模型在推理时的解题多样性与推理能力。实验表明,EDO有效增强了iDPO和GRPO等方法的性能,在多个基准任务中取得了显著的准确率提升,并有助于保持模型熵值和训练稳定性,为测试时推理优化提供了实用框架。

Comments Accepted by TMLR 2026

详情
英文摘要

Post-training techniques combined with inference-time scaling significantly enhance the reasoning and alignment capabilities of large language models (LLMs). However, a fundamental tension arises: inference-time methods benefit from diverse sampling from a relatively flattened probability distribution, whereas reinforcement learning (RL)-based post-training inherently sharpens these distributions. To address this, we propose Exploration-Driven Optimization (EDO), which extends reward-biasing style exploration objectives to iterative post-training and integrates them into standard RL objectives, encouraging greater diversity in sampled solutions while facilitating more effective inference-time computation. We incorporate EDO into iterative Direct Preference Optimization (iDPO) and Group Relative Policy Optimization (GRPO), resulting in two variants: ED-iDPO and ED-GRPO. Extensive experiments demonstrate that both ED-iDPO and ED-GRPO exhibit greater solution diversity and improved reasoning abilities, particularly when combined with test-time computation techniques like self-consistency. Across three in-distribution reasoning benchmarks, EDO achieves a 1.0-1.3\% improvement over the strongest baselines, and delivers an additional 1.5\% average gain on five out-of-distribution tasks. Beyond accuracy, EDO preserves model entropy and stabilizes RL training dynamics, highlighting its effectiveness in preventing over-optimization collapse. Taken together, these results establish EDO as a practical framework for balancing exploration and exploitation in LLM reasoning, especially in settings that rely on test-time scaling.

2605.09852 2026-05-12 cs.AI cs.CE cs.CY cs.LG

Fairness of Explanations in Artificial Intelligence (AI): A Unifying Framework, Axioms, and Future Direction toward Responsible AI

Gideon Popoola, John Sheppard

AI总结 该论文探讨了人工智能中解释的公平性问题,指出当前算法公平性与可解释AI(XAI)研究虽各自独立,却忽略了模型在输出满足公平性标准的同时,其推理过程可能存在深层次的不公平现象,即“过程偏差”。为此,作者提出了条件不变性框架,将解释公平性形式化为对保护属性的条件独立性要求,并构建了七维分类体系及六步评估流程,为负责任AI的发展提供了理论基础与实践指导。

Comments 53 pages, 1 figure

详情
英文摘要

Machine learning algorithms are being used in high-stakes decisions, including those in criminal justice, healthcare, credit, and employment. The research community has responded with two largely independent research fields: \emph{algorithmic fairness}, which targets equitable outcomes, and \emph{explainable AI} (XAI), which targets interpretable reasoning. This survey identifies and maps a novel blind spot at their intersection, which is a model that can satisfy every standard fairness criterion in its outputs while being profoundly unfair in its \emph{reasoning process}. We refer to this as the procedural bias, and mitigating it requires treating the fairness of explanations as a distinct object of scientific study. To our knowledge, we provide the first unified theoretical and literature review of this emerging field and elucidate the drawbacks of post-hoc explainers in certifying explanation fairness. Our central contribution is a \emph{conditional invariance framework} formalizing explanation fairness as the requirement that explanations should be indifferent regardless of the protected attributes $ P(E(X) \in \cdot \mid X_\text{rel} = x_\text{rel},\, A = a) = P(E(X) \in \cdot \mid X_\text{rel} = x_\text{rel},\, A = b)$ for all task-relevant $x$, a single principle from which all existing explanation fairness metrics emerge as partial operationalizations. We introduce a seven-dimensional taxonomy, identify three generative mechanisms of explanation inequity (representation-driven, explanation-model mismatch, actionability-driven), and propose a canonical six-step evaluation workflow for operationalizing explanation fairness audits in practice.

2605.09850 2026-05-12 cs.CV cs.AI

Probing Routing-Conditional Calibration in Attention-Residual Transformers

Wenhao Liang, Lin Yue, Wei Emma Zhang, Miao Xu, Mingyu Guo, Olaf Maennel, Weitong Chen

AI总结 本文研究了在注意力残差变换器(Attention-Residual Transformers)中,路由信息对模型校准的影响。通过设计匹配置信度的诊断实验,作者发现路由摘要无法提供稳定的路由条件下的校准证据,且基于路由深度的校准方法在多个评估指标上表现并不优于仅基于置信度的模型。实验表明,所谓的路由感知校准提升可能是由其他因素引起的,需在控制匹配置信度、带宽、模型容量和排列等因素后,才能确认是否为内部状态校准的真正提升。

Comments Under reviewing

详情
英文摘要

Post-hoc calibration is usually evaluated as a function of logits or softmax confidence alone, even as routing-augmented architectures increasingly accompany predictions with sample-specific internal routing traces and pair them with claims of calibration-relevant uncertainty. We ask a basic question: do these traces provide stable routing-specific evidence for post-hoc calibration beyond confidence? We study this in Attention-Residual transformers (Kimi Team, 2026) through a matched-confidence diagnostic suite that stratifies examples by routing-derived state, compares subgroup gaps against within-bin routing-permutation nulls, and evaluates matched post-hoc probes differing only in their auxiliary feature. Across our completed AR runs, scalar routing summaries do not provide stable evidence of routing-conditional miscalibration: weighted gaps remain small or seed-sensitive, and only $1$ of $30$ within-bin permutation tests rejects the conditional-null at $α=0.05$ (only on one seed; not stable across seeds in that cell). AR-CondCal, a minimal $2$-D Nadaraya--Watson probe on confidence and routing-depth variance, lies within the seed-variance band of matched confidence-only and predictive-entropy controls and does not reliably improve worst-routing-tertile ECE; bandwidth-sensitivity checks (Scott multiples, CV-NLL, global-ECE oracle) do not change this. A full-vector MLP over $(c, H_1, \ldots, H_L)$ can appear to improve over a linear confidence baseline, but the apparent gain disappears once a capacity-matched confidence-only MLP is included as a control, and shuffled routing profiles achieve comparable performance. Apparent routing-aware calibration gains in this AR setting should not be read as internal-state calibration until matched-confidence, bandwidth, capacity, and permutation controls rule out common confounds.

2605.09848 2026-05-12 cs.LG

Efficient Neural Architectures for Real-Time ECG Interpretation on Limited Hardware

Ashery Mbilinyi, Callum O'Riley, Julia Handra, Ashley Moller-Hansen, Jason Andrade, Marc Deyell, Cameron Hague, Nathaniel Hawkins, Kendall Ho, Jonathan Leipsic, Roger Tam

AI总结 本文研究了在有限硬件上实现实时心电图(ECG)解读的高效神经网络架构。通过对比现有模型,作者提出了三种轻量级CNN模型,旨在平衡诊断准确率与计算效率。实验表明,这些模型在多个公开ECG数据集上表现优异,并引入统一效率评分体系,为心血管领域AI系统的部署提供了可扩展的解决方案。

Comments 9 pages, 6 figures, 3 tables. Published in: 2025 IEEE International Conference on Big Data (BigData), pp. 3275-3284. DOI: 10.1109/BIGDATA66926.2025.11402097

详情
Journal ref
2025 IEEE International Conference on Big Data (BigData), pp. 3275-3284
英文摘要

Electrocardiogram (ECG) interpretation is essential for diagnosing a wide range of cardiac abnormalities. While deep learning has shown strong potential for automating ECG classification, many existing models rely on large, computationally intensive architectures that hinder practical deployment. In this paper, we present an empirical study of convolutional neural network (CNN) architectures, exploring tradeoffs between diagnostic accuracy and computational efficiency. We benchmark two established baselines: AttiaNet, a compact model composed of sequential temporal and spatial blocks, and DeepResidualCNN, the winning architecture of the 2021 PhysioNet/Computing in Cardiology Challenge. Building on these, we propose three lightweight models: (i) ParallelCNN, which employs dual temporal and spatial branches for parallel pattern extraction; (ii) ParallelCNNew, a variant with symmetric weight initialization for balanced feature learning; and (iii) SimpleNet, a streamlined architecture that jointly processes temporal and spatial dimensions. Our experiments span three publicly available 12-lead ECG datasets from Germany, China, and the United States, covering binary, multiclass, and multilabel classification tasks across diverse patient populations. We further evaluate the impact of integrating low-cost demographic metadata (age and sex) to improve performance with minimal overhead. To ensure fair comparison, we introduce a unified Efficiency Score that integrates model size, inference speed, memory usage, and AUC performance. By balancing diagnostic performance and efficiency, our models offer a scalable and viable foundation for next-generation AI systems in cardiovascular care.

2605.09846 2026-05-12 cs.SD cs.AI

ChladniSonify: A Visual-Acoustic Mapping Method for Chladni Patterns in New Media Art Creation

Yakun Liu, Hai Luan, Dong Liu, Zhiyu Jin

AI总结 在新媒体艺术创作中,视觉与听觉的映射往往具有主观性。本文提出了一种实时的视觉-听觉映射方法 ChladniSonify,用于生成克拉尼图案(Chladni patterns)的声学映射。该方法基于Kirchhoff-Love板理论构建数据集,并采用轻量级CNN结合CBAM模块实现高精度、低延迟的图案分类,最终在Python和Max/MSP中搭建了端到端系统,将识别出的图案映射到对应的正弦波频率,实现了零偏差的理论频率匹配与实时交互。

Comments 9 pages, 5 figures, IEEE conference format

详情
英文摘要

In new media art creation, the mapping between vision and hearing is often subjective. As a classic carrier of sound visualization, Chladni patterns have great potential in building audio-visual mapping mechanisms. However, existing tools face pain points: high technical barriers for simulation, offline computing failing real-time interaction, and uncontrollable mapping rules in general sonification tools. To address these, this paper proposes ChladniSonify, a real-time visual-acoustic mapping method for Chladni patterns. Based on Kirchhoff-Love plate theory, we build a paired dataset via numerical programming and calibrate it using ANSYS finite element simulation. Focusing on the slender nodal lines of Chladni patterns, we adopt a lightweight CNN with CBAM to achieve high-precision, low-latency pattern classification. Finally, we build an end-to-end system in Python and Max/MSP, mapping recognized patterns to corresponding sine wave frequencies. Results show the system has excellent usability: the classification module achieves 99.33% accuracy on the test set with 7.03 ms inference latency; the mapped frequency matches the theoretical value with zero deviation; the average end-to-end latency is under 50 ms, meeting real-time interactive needs. This work provides a reproducible engineering prototype for Chladni audio-visual art creation.

2605.09845 2026-05-12 cs.LG

Sub-Footprint Effect Correction in FW-LiDAR Point Clouds via Intra-Footprint Target Unmixing

Zhen Xiao, Yanfeng Gu, Xian Li

AI总结 本文研究了全波形激光雷达(FW-LiDAR)点云中子光斑目标混合导致的强度不确定性问题,提出了一种基于物理的框架,通过显式建模光斑内部多目标的混合过程,实现子光斑级别的强度校正。该方法结合波形参数和地表几何信息,将混合过程转化为逆向解混问题,从而分离出每个光斑内不同子目标的贡献,并恢复出更准确的强度信息。实验表明,该方法有效提升了异质目标的语义可分性和同质目标的强度一致性。

Comments 11 pages,7 figures

详情
英文摘要

Sub-footprint target mixing within a laser footprint significantly increases LiDAR intensity uncertainty, especially in complex environments where heterogeneous materials inside one footprint cause nonlinear distortions that impair intensity-based applications. However, the forward mixing inherent to the single-pixel detection mode of LiDAR systems blurs sub-footprint contributions, making sub-footprint effects difficult to address effectively in existing studies. To address this issue, we introduce a novel, physics-based framework that explicitly resolves sub-footprint intensity correction in full-waveform LiDAR (FW-LiDAR) point clouds. The key innovation is to make the otherwise implicit intra-footprint mixing process explicit: we first develop a spatiotemporal laser-beam distribution model to physically characterize within-footprint forward mixing of multi-target returns. Building on this formulation, we incorporate ancillary information including waveform parameters and surface geometry as constraints to pose a well-defined inverse unmixing problem and decompose each footprint into fractional contributions from multiple sub-targets. We then recover sub-footprint-corrected intensities by inverting the observed mixtures through a unified combination of parametric and model-driven approaches. To the best of our knowledge, few prior studies explicitly establish sub-footprint inversion and correction within a single laser footprint, and our framework offers a principled, physics-grounded solution. Experiments on both controlled and real-world LiDAR datasets demonstrate that the proposed method significantly enhances semantic separability across heterogeneous targets and intensity consistency across homogeneous targets.

2605.09844 2026-05-12 cs.AI cs.CL cs.LG

The Metacognitive Probe: Five Behavioural Calibration Diagnostics for LLMs

Rafael C. T. Oliveira

AI总结 该研究提出了一种名为“元认知探针”的诊断工具,用于评估大型语言模型(LLM)在自信行为上的表现,将其分解为五个行为维度,包括置信度校准、知识边界识别等。该工具在多个前沿模型和人类被试上进行了验证,揭示了模型在不同任务中的自信与正确性对齐情况,发现了模型在整体表现良好时仍可能存在局部过度自信的问题。研究在Gemini 2.5 Flash模型中观察到了显著的内部行为差异,突显了模型在不同任务间自信判断能力的不一致性。

Comments 27 pages, 13 tables. Code, data, prompts, and rubrics released with the paper. OSF deposit pending; DOI in v2

详情
英文摘要

The Metacognitive Probe is an exploratory five-task, 15-slot diagnostic that decomposes an LLM's confidence behaviour into five behaviourally-distinct dimensions: confidence calibration (T1-CC), epistemic vigilance (T2-EV), knowledge boundary (T3-KB), calibration range (T4-CR), and reasoning-chain validation (T5-RCV). It is evaluated on N=8 frontier models and N=69 humans. The instrument is motivated by Flavell (1979) and Nelson and Narens (1990) but operates on observable confidence-correctness alignment; it is not a validated cross-species metacognition scale, and the pre-specified human developmental hypothesis was falsified. Composite benchmarks (MMLU, BIG-Bench, HELM, GPQA) ask whether a model produces a correct response. They are silent on whether the model knows when its response is wrong. A model can score 80 on a composite calibration benchmark and still be wildly overconfident in narrow pockets the aggregate cannot surface. The Metacognitive Probe surfaces those pockets. Our headline is a 47-point within-model dissociation in Gemini 2.5 Flash: panel-best within-task calibration (T1-CC = 88; Spearman rho = +0.551, 95% CI [+0.14, +0.80], p = 0.005) and panel-worst cross-task difficulty prediction (T4-CR = 41; sigma_conf = 1.4 across twelve factoids).

2605.09842 2026-05-12 cs.AI

Yield Curve Forecasting using Machine Learning and Econometrics: A Comparative Analysis

Aman Singh, Tokunbo Ogunfunmi, Sanjiv Das

AI总结 本文比较了计量经济学、经典机器学习和深度学习方法在预测美国国债收益率曲线方面的性能,使用了长达47年的每日数据。研究发现,传统计量经济模型如ARIMA在大多数情况下表现最佳,而时间序列深度学习模型如TimeGPT、LGBM和RNN也表现出色。此外,论文还探讨了平稳或非平稳数据作为深度学习模型输入的适用性问题。

Comments 18 pages, 12 figures, comparative study of econometric, machine learning, and deep learning methods for U.S. Treasury yield curve forecasting

详情
Journal ref
Journal of Investment Management, vol. 23, no. 4, Fourth Quarter 2025
英文摘要

While machine learning has revolutionized many fields such as natural language processing (NLP) and computer vision, its impact on time-series forecasting is still widely disputed, especially in the finance domain. This paper compares forecasting performance on U.S. Treasury yield curve data across econometrics/time-series analysis, classical machine learning, and deep learning methods, using daily data over 47 years. The Treasury yield curve is important because it is widely used by every participant in the bond markets, which are larger than equity markets. We examine a variety of methods that have not been tested on yield curve forecasting, especially deep learning algorithms. The algorithms include the Autoregressive Integrated Moving Average (ARIMA) model and its extensions, naive benchmarks, ensemble methods, Recurrent Neural Networks (RNNs), and multiple transformers built for forecasting. ARIMA and naive econometric models outperform other models overall, except in one time block. Of the machine learning methods, TimeGPT, LGBM and RNNs perform the best. Furthermore, the paper explores whether stationary or nonstationary data are more appropriate as input to deep learning models.

2605.09839 2026-05-12 cs.LG cs.AI

Free Energy Manifold: Score-Based Inference for Hybrid Bayesian Networks

Cheol Young Park, Shou Matsumoto

AI总结 本文提出了一种名为自由能流形(Free Energy Manifold, FEM)的条件能量模型,专门用于含有离散和连续变量的混合贝叶斯网络中的推理任务。FEM 通过学习离散父节点的嵌入和连续观测值的能量景观,实现了对后验分布的评估、生成采样以及多连续叶节点的组合推理。研究还发现传统条件能量模型在类内模式之间可能产生低能量脊,导致对非数据点的过自信后验,并提出山谷正则化方法以修正这一问题,实验表明 FEM 在多模态和组合推理任务中优于经典方法和普通条件能量模型。

详情
英文摘要

We introduce the Free Energy Manifold (FEM), a score-trained conditional energy model specialized for inference in hybrid Bayesian networks with discrete and continuous variables. FEM represents each conditional factor as an energy landscape over learned discrete-parent embeddings and continuous observations, enabling posterior evaluation, generative sampling, and compositional inference across multiple continuous leaves by energy addition under conditional independence. A central finding is the mode-bridge artifact: standard conditional energy models can create low-energy ridges between separated modes of the same class, producing overconfident posteriors at off-data interior points. We analyze this failure and propose valley regularization, an off-data calibration term that restores near-uniform posteriors in such regions while preserving in-data fit. Across synthetic multimodal hybrid-BN benchmarks, FEM substantially reduces KL divergence relative to classical baselines and a vanilla conditional EBM, including large gains at mode-bridge midpoint queries and in multi-leaf evidence composition. We also evaluate high-cardinality discrete-parent settings and a UCI Breast Cancer sanity check, showing that FEM is most useful when multimodal or compositional Bayesian-network inference is required, while discriminative classifiers remain preferable for closed-world classification tasks.

2605.09838 2026-05-12 cs.CL cs.LG

The Association of Transformer-based Sentiment Analysis with Symptom Distress and Deterioration in Routine Psychotherapy Care

Douglas K. Faust, Peter Awad, Alexandre Vaz, Tony Rousmaniere

AI总结 该研究探讨了基于Transformer架构的情感分析模型在心理治疗常规护理中对患者症状困扰和恶化程度的关联性。研究通过分析大量心理治疗会话数据,提取了话语级和会话级的情感特征,并发现这些特征与OQ-45心理测量工具的多个维度,尤其是情绪价值相关指标存在显著相关性。此外,研究还表明,被标记为有恶化或退出风险的患者在情感分布上存在统计学上的显著差异,表明所提出的情感特征可作为评估患者心理状态的辅助指标。

Comments 20 pages, 4 figures

详情
Journal ref
(2026) Front. Digit. Health 8:1792536
英文摘要

Sentiment analysis has been of long-standing interest in psychotherapy research. Recently, the Transformer deep learning architecture has produced text-based sentiment analysis models that are highly accurate and context-aware. These models have been explored as proxies for emotion measurement instruments in psychotherapy, but not investigated as stand-alone psychometric tools. Using proposed utterance-level and session-level sentiment features derived from a fine-grained sentiment model on a large corpus of psychotherapy sessions (N = 751), we investigate the distribution of session aggregated sentiment scores. Further, we characterize the relationship of these features to individual components and the overall score of the OQ-45 instrument and find that this sentiment feature is most strongly correlated to components related to emotional valence in directionally intuitive ways. Finally, we report that there are statistically significant differences between the sentiment distributions for patients flagged as at risk of deterioration or dropping out of care via either the OQ Rational or Empirical outcome models. These correlations to a fully-validated psychometric instrument demonstrate that these proposed sentiment features are, at least, adjunctive measures of client distress and deterioration.

2605.09832 2026-05-12 cs.LG

Modeling Atomic Conformational Ensembles of Proteins via Test-Time Supervision of Boltz-2 on Cryo-EM Density Maps

Jay Shenoy, Miro Astore, Axel Levy, Frédéric Poitevin, Sonya M. Hanson, Gordon Wetzstein

AI总结 该研究旨在解决蛋白质原子构象集合预测中的数据稀缺问题,提出了一种无需传统两阶段训练流程的方法,直接在原始冷冻电镜(cryo-EM)密度图上微调预训练的静态结构预测模型Boltz-2,从而生成高精度的原子构象。该方法命名为CryoSampler,不仅在模型构建准确性上优于现有方法,还展示了在相同蛋白家族中对未见序列的跨样本泛化能力,为基于原始cryo-EM数据训练下一代构象预测模型提供了新思路。

Comments Project page: https://jayshenoy.com/cryosampler

详情
英文摘要

Knowledge of a protein's atomic conformational ensemble is critical to determining its function, yet state-of-the-art ensemble prediction models are limited by lack of high-quality conformational data from simulation or experiment. Recent advances in heterogeneous reconstruction for cryo-electron microscopy (cryo-EM) have enabled scientists to visualize ensembles of density maps for larger proteins and complexes not typically accessible through simulation, but building atomic models into these maps remains a challenge. Traditionally, ensemble prediction models are trained via a two-stage process: experimental density maps are converted into atomic structural ensembles through model building, after which these structures are used to train sequence-to-atomic ensemble predictors. In this work, we propose a new principle for fine-tuning pre-trained static structure prediction models such as Boltz-2 directly on raw cryo-EM maps, bypassing the two-stage process. We apply this technique to the problem of atomic model building by fine-tuning Boltz-2 to generate atomic conformations from an input ensemble of cryo-EM maps, achieving superior model building accuracy compared to prior work. Beyond overfitting to individual map ensembles, our method, CryoSampler, also shows preliminary evidence of in-domain generalization after fine-tuning, sampling diverse atomic conformations for an unseen sequences within the same protein family without requiring cryo-EM data. These capabilities indicate that CryoSampler holds the potential to train next-generation atomic ensemble prediction models directly on raw cryo-EM measurements.

2605.09827 2026-05-12 cs.CV cs.AI

Fashion Florence: Fine-Tuning Florence-2 for Structured Fashion Attribute Extraction

Anushree Berlia

AI总结 本文提出 Fashion Florence,一种基于 Florence-2 的视觉语言模型,通过 LoRA 微调技术实现对服装图像结构化属性的提取。该模型能够从单张服装照片中生成包含类别、颜色、材质、风格标签和场合标签的 JSON 格式输出,适用于推荐系统等下游任务。实验表明,Fashion Florence 在多个指标上优于 GPT-4o-mini 和 Gemini 2.5 Flash,且在单个 GPU 上运行时参数量仅为 0.77B,推理成本接近于零。

Comments Model: https://huggingface.co/anushreeberlia/fashion-florence

详情
英文摘要

We present Fashion Florence, a Florence-2 vision-language model fine-tuned with LoRA to extract structured fashion attributes from clothing images. Given a single photograph, the model generates a JSON object containing category, color, material, style tags, and occasion tags, structured output suitable for direct programmatic consumption by downstream recommendation and retrieval systems. Fine-tuning data is derived from the iMaterialist Fashion dataset (228 labels), where we collapse fine-grained annotations into a compact 6-category, 16-color, 19-style schema via rule-based label engineering. We apply LoRA (r=16, alpha=32) to all decoder linear layers, training for 3 epochs on 3,688 examples. On a held-out test set of 461 images, Fashion Florence achieves 94.6% category accuracy and 63.0% material accuracy, compared to 89.3% / 43.3% for GPT-4o-mini and 87.4% for Gemini 2.5 Flash. Fashion Florence produces valid JSON in 99.8% of outputs while running at 0.77B parameters on a single GPU at zero marginal inference cost. Style tag F1 reaches 0.753 vs. 0.612 (Gemini) and 0.398 (GPT-4o-mini). The model is deployed as a Hugging Face Space and integrated into Loom, an open-source outfit recommendation system.

2605.09820 2026-05-12 cs.LG

Dystruct: Dynamically Structured Diffusion Language Model Decoding via Bayesian Inference

Bian Sun, Kevin Zhai, Mubarak Shah, Zhenyi Wang

AI总结 本文提出了一种基于贝叶斯推理的动态结构化扩散语言模型解码方法Dystruct,旨在解决现有扩散语言模型在生成长度固定、灵活性不足的问题。该方法无需额外训练,通过将可变长度生成建模为动态结构推理问题,联合优化生成长度、块边界和解码计划,从而实现灵活的块扩展与组织,同时保持生成内容的一致性。实验表明,该方法在多个基准上显著提升了生成质量与灵活性,为结构化文本生成提供了原理清晰且高效的解决方案。

详情
英文摘要

Diffusion language models (DLMs) have recently emerged as a promising alternative to autoregressive models, primarily due to their ability to enable parallel decoding. Despite this advantage, most existing DLMs rely on a fixed generation length specified prior to decoding, which restricts their flexibility in real-world applications. While a few recent works attempt to support flexible-length generation, they typically suffer from notable limitations: some require costly retraining to accommodate variable-length outputs, while others depend solely on local confidence signals during decoding. Such local criteria fail to capture the evolving structure of the sequence, often resulting in suboptimal generation quality. In this paper, we propose a training-free, Bayesian structured decoding framework that formulates flexible-length generation as a dynamic structural inference problem. Our approach formulates flexible-length generation as a dynamic structural inference problem, jointly computing the expansion length, the block boundaries, and the decoding schedule. At each window expansion step, the method integrates local uncertainty with structural signals via a unified mechanism that supports dynamic structured generation, including both flexible block expansion and block organization, while maintaining coherence. Extensive experiments across multiple benchmarks demonstrate that our approach significantly improves generation quality and flexibility over existing fixed-length and flexible-length baselines. These results highlight the advantage of Bayesian structured decoding for diffusion language model, providing a principled and efficient solution for structured text generation.

2605.09818 2026-05-12 cs.LG

Learning to Compress Time-to-Control: A Reinforcement Learning Framework for Chronic Disease Management

Prabhjot Singh, Abhishek Gupta, Chris Betz, Abe Flansburg, Brett Ives, Sudeep Lama, Jung Hoon Son

AI总结 该研究提出了一种基于强化学习的慢性病管理框架,旨在通过压缩疾病控制时间(TTC)来优化长期治疗效果。研究引入了两个关键结构要素——执行强度和临床能力权重,将偏好学习与强化学习结合,构建了双循环架构,以应对医疗强化学习中奖励稀疏和策略评估不稳定等问题。实验表明,该方法在糖尿病等慢性病的模拟环境中显著优于传统方法,具有更好的跨场景泛化能力。

Comments 26 pages, 3 figures

详情
英文摘要

Reinforcement learning (RL) in healthcare has had mixed results, with reward sparsity, unreliable off-policy evaluation, and deployment-simulation gap as recurring failure modes. We argue that chronic disease management is structurally a more tractable RL setting than the acute-care problems the field has primarily studied, but only if the problem is formalized to exploit chronic care's properties. We propose such a formalization. The agent's objective is to compress time-to-control (TTC) under a tiered reward calibrated to the CMS ACCESS Model. Two quantities from our companion preference-learning paper [Singh et al. 2026] enter as load-bearing structural elements: the execution intensity εbounds action availability under a constrained Markov Decision Process, and the clinician capability κweights offline-data transitions during RL training. Together they couple preference learning and RL into a two-loop architecture. We present simulation results on synthetic state machines for hypertension and type 2 diabetes. Capability-weighted offline RL outperforms uniform-weighted offline RL and the behavior policy by 15 percentage points on T2D TTC; the uniform-weighted formulation (the standard in existing healthcare RL) underperforms even the heterogeneous behavior policy. \Epsilon-aware policies generalize across deployment regimes while ε-naive policies do not.

2605.09811 2026-05-12 cs.RO

Above and Below: Heterogeneous Multi-robot SLAM Across Surface and Underwater Domains

John McConnell, Armon Shariati, Paul Szenher, Yaxuan Li

AI总结 本文研究了水面无人船(USV)与水下自主水下机器人(AUV)之间的异构多机器人同步定位与建图(SLAM)问题。传统方法依赖声学测距,受限于环境干扰和同步要求,本文提出一种基于视觉回环检测的集中式多机器人SLAM系统,通过融合USV与AUV的感知数据实现状态估计的协同优化。实验表明,该方法在多机器人协作场景下显著提升了AUV的定位精度,是首个基于回环检测而非声学测距的异构多机器人SLAM系统。

详情
英文摘要

Multi-robot simultaneous localization and mapping (SLAM) is a fundamental task in multi-robot operations. Robots must have a common understanding of their location and that of their team members to complete coordinated actions. However, multi-robot SLAM between Uncrewed Surface Vessels (USVs) and Autonomous Underwater Vehicles (AUVs) has primarily been achieved through acoustic pinging between robots to retrieve range measurements; a measurement technique requires that robots to be in similar locations simultaneously, have an uninterrupted path for signal propagation, and may necessitate synchronized clocks. This is especially challenging in complex, cluttered maritime environments, where structures may impede signals. However, these same structures may be observable above and below the water's surface, presenting an opportunity for inter-robot SLAM loop closure between USV and AUV data streams. This work builds upon recent research on inter-robot SLAM loop closure between USV and AUV data, extending it to propose a centralized multi-robot SLAM system. Each robot performs its state estimation, and we detect loop closures between each AUV and the USV data. These inter-robot loop closures are used to merge each robot's state estimate into a centralized graph, yielding estimates for the whole time history of the USV and all AUVs in the system. Validation is performed using real-world perceptual data in three different environments. Results show improved errors for AUVs in the multi-robot SLAM system compared to single-robot SLAM over the same trajectories. To our knowledge, this is the first instance of a multi-robot SLAM system with AUVs and USVs built on loop closures rather than acoustic distance measurements.

2605.09808 2026-05-12 cs.CL

Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants

Joseph Suh, Ayush Raj, Minwoo Kang, Serina Chang

AI总结 本文研究了用户模拟器在构建协作式大语言模型助手中的效用评估问题,提出通过助手在真实环境中与人类交互的表现来衡量模拟器质量。通过对比不同用户模拟器(包括基于角色扮演的LLM和基于真实对话数据微调的模拟器)训练出的助手性能,实验表明基于真实数据微调的模拟器能显著提升助手表现,而基于角色扮演的模拟器即使经过优化也难以缩小差距。研究进一步揭示了模拟器模型规模、真实性增强方法等对训练效果的影响,强调应以实际用户表现作为评估用户模拟器质量的核心标准。

详情
英文摘要

User simulators are increasingly leveraged to build interactive AI assistants, yet how to measure the quality of these simulators remains an open question. In this work, we show how simulator quality can be quantified in terms of its downstream utility: how an LLM assistant trained with this user simulator performs in the wild when interacting with real humans. In a controlled experiment where only the user simulator varies, we train LLM assistants via reinforcement learning against a spectrum of simulators, from an LLM prompted to role-play a user to one fine-tuned on human utterances from WildChat. As evaluation, we measure pairwise win rates in a user study with 283 participants and on WildBench, a benchmark derived from real human--AI conversations. Training against the role-playing LLM yields an assistant statistically indistinguishable from the initial assistant in our user study (51% win rate), whereas training against the fine-tuned simulator yields significant gains (58% over the initial and 57% over the one trained against role-playing). Closer inspection reveals three further patterns: methods for making role-playing LLMs more realistic (e.g., persona conditioning) improve trained assistants but do not close the gap to the fine-tuned simulator; scaling the simulator's model size benefits the fine-tuned simulator but yields no gain for role-playing ones; and assistants trained against role-playing simulators fail to generalize when paired with other simulators at test time, while the one trained against fine-tuned simulator does. Together, these results argue for grounding user simulators in real human behavior and measuring their quality by their downstream effect on real users.

2605.09806 2026-05-12 cs.LG cs.AI

LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

Songtao Wei, Yi Li, Zhikai Li, Xu Hu, Yuede Ji, Guanpeng Li, Feng Chen, Carl Yang, Zhichun Guo, Bingzhe Li

AI总结 本文提出了一种名为LEAD的方法,旨在解决大型语言模型在推理过程中输出冗长、效率低下的问题。LEAD通过引入在线自适应机制,动态调整正确性与效率之间的平衡,并根据模型自身的正确推理结果估计每道题的适配长度,从而在保证准确性的同时显著压缩输出长度。实验表明,LEAD在多个数学推理基准测试中取得了最高的准确率和效率综合评分。

详情
英文摘要

Large reasoning models, such as OpenAI o1 and DeepSeek-R1, tend to become increasingly verbose as their reasoning capabilities improve. These inflated Chain-of-Thought (CoT) trajectories often exceed what the underlying problems require, wasting compute, latency, and context budgets. While introducing length-based efficiency rewards during reinforcement learning offers a natural remedy, existing methods struggle with two fundamental challenges: the optimal balance between correctness and efficiency is non-stationary throughout training, and intrinsic reasoning budgets vary drastically across problems. Relying on static reward weights and global length constraints inevitably forces a compromise between degraded accuracy and unrealized compression. To overcome these limitations, we propose LEAD (Length-Efficient Adaptive and Dynamic reasoning), a method that replaces static heuristics with online, self-adaptive mechanisms. LEAD dynamically calibrates the correctness-efficiency trade-off at each step using a Potential-Scaled Instability, directing optimization capacity to the most informative learning signal. Furthermore, it estimates an adaptive per-problem target length online based on the model's own correct rollouts, applying a symmetric efficiency reward that penalizes both overthinking and over-compression. Evaluated on five mathematical reasoning benchmarks, LEAD achieves the highest accuracy and Accuracy-Efficiency Score among RL-trained efficient-reasoning methods while producing substantially shorter outputs than the base model.

2605.09802 2026-05-12 cs.CV cs.AI cs.LG

CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection

Zhipeng Liu, Chunbo Luo

AI总结 本文研究了跨视角(如地面与空中)场景下视觉-语言模型(VLM)的目标检测性能下降问题,提出了CrossVL框架,结合复杂度感知的特征路由机制和成对课程学习策略,以增强模型对不同视角图像的适应能力。该方法通过估计场景复杂度并动态路由视觉特征,以及利用同步地面-空中图像对的语义一致性进行渐进式训练,有效提升了检测精度和稳定性。实验表明,CrossVL在MAVREC数据集上显著提升了检测性能并缩小了不同视角间的性能差距。

Comments Accepted to CVPR 2026. Code available at https://github.com/1nyourlife/Crossvl_cvpr2026

详情
英文摘要

Vision-language models (VLMs) enable text-guided object detection but degrade severely under cross-view scenarios where ground and aerial viewpoints differ in altitude, scale, and spatial layout. These geometric changes introduce systematic complexity variations between viewpoints, e.g., ground view images contain dense and highly occluded structures, while aerial images are sparse and globally organized. Fixed VLM fusion mechanisms cannot handle this discrepancy. We propose CrossVL, a framework combining Complexity-Aware Pathway Aggregation (CPA) and Paired Curriculum Learning (PCL) for enhanced cross-view detection for VLM. CPA estimates scene complexity from multimodal statistics and routes visual features through multiple pathways to obtain view-specific representations. PCL leverages semantic consistency of synchronized ground-aerial pairs to provide stable early supervision and then gradually shifts toward randomized sampling. On MAVREC, CrossVL improves Florence-2's aerial mAP from 58.66% to 61.03% and reduces the ground-aerial performance gap from 8.63pp to 6.65pp, while also achieving a 3.3x reduction in variance across random seeds. CPA provides stable complexity-aware feature aggregation, and PCL enhances optimization dynamics. Together, they demonstrate that coordinated architectural and training adaptations are crucial for robust cross-view VLM detection.

2605.09801 2026-05-12 cs.RO

Efficient Multi-Robot Motion Planning with Precomputed Translation-Invariant Edge Bundles

Himanshu Gupta, Paul Motter, Aritra Chakrabarty, Rishabh Sodani, Srikrishna Bangalore Raghu, Alessandro Roncone, Bradley Hayes, Zachary Sunberg

AI总结 本文提出了一种名为KiTE-Extend的高效多机器人运动规划方法,通过预计算的平移不变轨迹段库来指导在线规划中的动作选择,从而提升现有规划器在生成无碰撞、动力学可行轨迹方面的能力。该方法不改变原有规划器的状态传播、碰撞检测和代价评估机制,同时保持其理论保证。实验表明,KiTE-Extend在多机器人场景中显著提升了规划效率和可扩展性,尤其在集中式、优先级和冲突基于的三种主流多机器人规划范式中表现突出。

详情
英文摘要

Solving multi-robot motion planning (MRMP) requires generating collision-free kinodynamically feasible trajectories for multiple interacting robots. We introduce Kinodynamic Translation-Invariant Edge Bundles or KiTE-Extend, a planner-agnostic action selection mechanism for sampling-based kinodynamic motion planning. KiTE-Extend uses a library of trajectory segments computed offline to guide action selection during online planning, improving the ability of existing planners to identify feasible motion segments without altering state propagation, collision checking, or cost evaluation, and without changing their theoretical guarantees. While KiTE-Extend can modestly improve single-agent planners, its benefits are most clear in the multi-agent setting, where it is able to explore more effectively and significantly improve planning through the dense spatiotemporal constraints introduced by robot-robot interaction. Through experiments on multiple kinodynamic systems and environments, we show that KiTE-Extend reduces planning time and improves scalability across the three most common MRMP paradigms: centralized, prioritized, and conflict-based.

2605.09795 2026-05-12 cs.CL

cantnlp@DravidianLangTech 2026: organic domain adaptation improves multi-class hope speech detection in Tulu

Andrew Li, Sidney Wong

AI总结 本文介绍了在DravidianLangTech 2026会议上针对代码混合的图卢语(Tulu)希望言论检测任务所提出的系统与结果。研究采用基于XLM-RoBERTa的文本分类模型,通过有机收集的图卢语社交媒体文本进行领域适配,有效提升了希望言论检测的性能。实验表明,有机适配的模型在开发集上优于基线模型,为代码混合语言的希望言论检测提供了可行的改进方向。

Comments Accepted to Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages (DravidianLangTech-2026)

详情
英文摘要

This paper presents our systems and results for the Hope Speech Detection in Code-Mixed Tulu Language shared task at the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages (DravidianLangTech-2026). We trained an XLM-RoBERTa-based text classification system for detecting hope speech in code-mixed Tulu social media comments. We compared this organically adapted hope speech detection model with our baseline model. On the development set, the organically adapted model outperformed the baseline system. While our submitted systems performed more modestly on the official test set, these results suggest that further adapting XLM-RoBERTa on organically collected Tulu social media text containing code-mixed and mixed-script variation can improve hope speech detection in code-mixed Tulu.

2605.09789 2026-05-12 cs.RO

Zero-Shot Sim-to-Real Robot Learning: A Dexterous Manipulation Study on Reactive Catching

Kejia Ren, Gaotian Wang, Andrew S. Morgan, Kaiyu Hang

AI总结 该研究探讨了如何在零样本条件下将模拟环境中学到的机器人操控策略直接应用于真实世界,特别针对需要高精度和快速反应的灵巧抓取任务。为解决模拟到现实迁移中的不确定性问题,作者提出了一种新的领域随机化方法——领域随机化实例集(DRIS),通过同时传播多个随机化实例,增强策略对现实动态变化的鲁棒性。实验表明,该方法在无需真实世界微调的情况下,能够实现可靠的零样本迁移,并在无需被动稳定结构的抓取任务中表现出优异的抗噪声能力。

详情
英文摘要

Dexterous manipulation is physics-intensive and highly sensitive to modeling errors and perception noise, making sim-to-real transfer prohibitively challenging. Domain randomization (DR) is commonly used to improve the robustness of learned policies for such tasks, but conventional DR randomizes one instance per episode, offering very limited exposure to the variability of real-world dynamics. To this end, we propose Domain-Randomized Instance Set (DRIS), which represents and propagates a set of randomized instances simultaneously, providing richer approximation of uncertain dynamics and enabling policies to learn actions that account for multiple possible outcomes. Supported by theoretical analysis, we show that DRIS yields more robust policies and alleviates the need for real-world fine-tuning, even with a modest number of instances (e.g., 10). We demonstrate this on a challenging reactive catching task. Unlike traditional catching setups that use end-effectors designed to mechanically stabilize the object (e.g., curved or enclosing surfaces), our system uses a flat plate that offers no passive stabilization, making the task highly sensitive to noise and requiring rapid reactive motions. The learned policies exhibit strong robustness to uncertainties and achieve reliable zero-shot sim-to-real transfer.

2605.09778 2026-05-12 cs.LG cs.CL

Nectar: Neural Estimation of Cached-Token Attention via Regression

João Monteiro, Michal Klein, Pierre Ablin, Marco Cuturi

AI总结 该论文提出了一种名为Nectar的方法,用于高效估计长上下文中的缓存键值注意力。其核心思想是通过拟合一个紧凑的神经网络来近似注意力输出函数,从而避免对每个查询token遍历整个缓存的高计算开销。Nectar为每一层和每个KV头分别拟合目标网络和得分网络,分别预测注意力输出和对数归一化因子,在推理时替代传统的$O(n)$注意力计算,显著降低计算复杂度。实验表明,Nectar在多个大规模语言模型和长上下文数据集上有效逼近完整注意力的效果,并在生成任务中保持了语义内容的一致性。

详情
英文摘要

Evaluating softmax attention over a fixed long context requires reading every cached key-value pair for each new query token. For a given context (a book, a manual, a legal corpus) the attention output is a deterministic function of the query. We propose Nectar, which fits a compact neural network to this function for queries drawn from a task-relevant distribution. Nectar fits two networks per layer and KV-head: a target network that predicts the attention output and a score network that predicts the log-normalizer. The pair plugs into the standard masked self-attention at inference time, replacing the $O(n)$ attention over the cache with a forward pass whose cost does not depend on $n$. Each module carries on the order of $|θ|$ parameters per layer and KV-head, typically much smaller than the $2nd$ KV-cache footprint at the same granularity. We report experiments on models from 1.7B to 8B parameters across five long-context datasets. The approximation error tracks the next-token accuracy gap to full attention, and allocating capacity non-uniformly across layers reduces that gap in our ablation. Beyond this analysis of metrics, we check that the text generations (following a question prompt) of a model equipped with a Nectar module match in semantic content those obtained by giving the same model access to the full cache.

2605.09775 2026-05-12 cs.LG math.OC

Bayesian Optimization with Structured Measurements: A Vector-Valued RKHS Framework

Wenbin Wang, Colin N. Jones

AI总结 本文研究了在结构化测量环境下进行贝叶斯优化的问题,其中每个观测值为多维或函数型输出,而非单一标量值。作者提出了一种基于向量值再生核希尔伯特空间(RKHS)的框架,将目标函数定义为这些测量的线性泛函,并在该空间中推导了核岭回归估计的高概率集中界。在此基础上,设计了一种具有置信上界(UCB)采集函数的算法,并在温和假设下给出了遗憾界,实验表明该方法能有效提升样本效率,适用于多目标和时变场景。

详情
英文摘要

Bayesian optimization (BO) is an efficient framework for optimizing expensive black-box functions. However, it is typically formulated as learning an end-to-end mapping from inputs to scalar objectives, thereby discarding the potentially rich information whenever a structured system output is available. In this work, we study Bayesian optimization over a vector-valued operator with structured measurements, where each measurement observes multidimensional or functional outputs, e.g., trajectories or spatial fields, rather than a single scalar value. The objective is then defined as a linear functional of these measurements. This allows each observation to reveal substantially richer information about the underlying system compared to scalar observations. Assuming the unknown operator lies in a vector-valued reproducing kernel Hilbert space (RKHS), we derive high-probability concentration bounds for the kernel ridge regression (KRR) estimator directly in the measurement space, characterizing uncertainty in a general Hilbert space. Building on these results, we propose an algorithm based on the upper confidence bound (UCB) acquisition function with regret guarantees under mild assumptions, recovering sublinear rates for common kernels. Empirically, we demonstrate that leveraging structured measurements leads to improved sample efficiency by enabling efficient transfer of information across objectives and adaptation to time-varying settings.

2605.09774 2026-05-12 cs.CV

DRIVE-C: A Controlled Corruption Dataset for Autonomous Driving

Shiva Aher

AI总结 DRIVE-C 是一个用于评估自动驾驶系统视觉感知鲁棒性的受控退化数据集,由真实场景下的多种环境驾驶视频构建而成。该数据集通过物理启发的合成退化方法生成了包含10段干净视频和600段退化视频的多样化样本,并提供了详细的元数据和传感器健康指数标注。DRIVE-C 为自动驾驶感知系统的鲁棒性评估、退化感知建模、不确定性估计以及传感器健康监测提供了可控且可复现的测试平台。

详情
英文摘要

DRIVE-C is a controlled corruption dataset designed to evaluate visual perception robustness in autonomous driving systems. It is built from real-world forward-facing driving videos collected across daytime, nighttime, urban, rural, freeway, and parking environments. Clean clips are anonymized via localized face and license plate blurring, then transformed with physics-inspired synthetic degradations. The dataset contains 10 clean clips and 600 corrupted clips spanning 12 camera degradation types across five severity levels, with per-clip metadata and Global Sensor Health Index (GSHI) annotations. DRIVE-C supports robustness benchmarking, degradation-aware modeling, uncertainty estimation, out-of-distribution (OOD) detection, and sensor health monitoring for Advanced Driver Assistance Systems (ADAS). By providing pixel-aligned clean and degraded video clips with fully reproducible corruption parameters, DRIVE-C offers a structured testbed for studying perception reliability under controlled camera degradation.

2605.09773 2026-05-12 cs.CL cs.AI

Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models

Cameron Berg, Roshni Lulla

AI总结 该研究利用稀疏自编码器(SAE)特征引导技术,在Llama-3.3-70B-Instruct模型中增强其“黑暗三联征”(马基雅维利主义、自恋和病态人格)特征,并通过五种心理测量工具评估其行为变化。结果显示,引导后的模型在新型情境中表现出更强的剥削性、攻击性和冷漠,但认知共情能力保持不变,重现了人类黑暗三联征人群的共情分离特征。研究还发现,剥削行为与欺骗机制可能通过不同的计算路径实现,且不同特征引导方式对干预深度有显著影响,表明模型中的反社会倾向可能由可分离的组件构成,而非统一的整体。

Comments 12 pages, 3 figures

详情
英文摘要

We use sparse autoencoder (SAE) feature steering to amplify Dark Triad personality traits (Machiavellianism, narcissism, and psychopathy) in Llama-3.3-70B-Instruct and evaluate the resulting behavioral changes across five psychological instruments. The steered model becomes substantially more exploitative, aggressive, and callous on novel behavioral scenarios (d=10.62) while its cognitive empathy remains intact, reproducing the empathy dissociation characteristic of human Dark Triad populations. Critically, strategic deception is completely unaffected across all features, suggesting that exploitation and deception may operate through dissociable computational pathways in large language models. Individual feature analysis reveals non-redundant encoding, with each feature driving distinct antisocial mechanisms through separable computational pathways. We also show that feature discovery method itself modulates intervention depth: contrastively-discovered features change both self-report and behavior, while semantically-searched features change only self-report (d=12.65 between methods on behavior). These findings suggest that antisocial tendencies in at least one large language model comprise dissociable components rather than a unified construct, with implications for how such tendencies should be detected, measured, and controlled.