arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1968
2605.14174 2026-05-15 cs.RO

Safety-Constrained Reinforcement Learning with Post-Training Reachability Verification for Robot Navigation

Qisong He, Xinmiao Huang, Jinwei Hu, Zhuoyun Li, Yi Dong, Changshun Wu, Xiaowei Huang

AI总结 该研究针对移动机器人在复杂环境中安全导航的问题,提出了一种结合条件风险价值(CVaR)约束优化与后训练可达性验证的强化学习框架。通过在离策略TD3算法中引入CVaR约束,使策略对高风险尾部事件更加敏感,从而提升安全性;训练后利用泰勒模型分析计算动作可达集,量化策略在不同状态下的安全余量。实验表明,该方法在多个导航场景中取得了最高的安全验证率,并揭示了传统平均成本指标可能遗漏的风险。

详情
英文摘要

Safe navigation for mobile robots demands policies that remain reliable under the high-consequence perception uncertainty of cluttered environments. Yet most existing safe reinforcement learning (RL) methods assess safety through average cumulative cost. Such metrics can mask dangerous tail-risk behaviors. To address this, we propose a framework that trains risk-sensitive policies through Conditional Value-at-Risk (CVaR) constrained optimization on an off-policy TD3 backbone and evaluates their safety margins post-training through neural network reachability verification. During training, the policy is optimized under CVaR constraints on cumulative costs, promoting sensitivity to high-cost tail outcomes rather than average behavior alone. After training, we compute action reachable sets under bounded observation uncertainty using Taylor Model analysis, yielding a safety rate metric that quantifies the proportion of evaluated states at which the policy's reachable action set remains within prescribed safety margins. A key finding is that policies trained with CVaR constraints maintain larger safety margins from obstacles across evaluated states. This makes them significantly more amenable to formal reachability verification. Experiments across ten navigation scenarios and six baselines show that our method achieves a 98.3\% success rate, the highest safety verification rate among all compared methods, while revealing that average cost rankings and reachability-based safety rankings can diverge. This indicates that reachability verification captures risks which are missed by empirical cost metrics alone. We further validate our approach on a physical Clearpath Jackal robot, demonstrating successful sim-to-real transfer.

2605.14171 2026-05-15 cs.LG cs.NI

CSI-JEPA: Towards Foundation Representations for Ubiquitous Sensing with Minimal Supervision

Xuanhao Luo, Zhizhen Li, Yuchen Liu

AI总结 本文提出了一种名为CSI-JEPA的自监督学习框架,旨在通过最小的监督实现通用的Wi-Fi感知表示学习。该方法通过预测被遮蔽信道区域的潜在特征,从未标记的CSI数据中学习可复用的时频表示,并引入了基于信道变化特性的遮蔽策略以提升表示能力。实验表明,CSI-JEPA在多个实际场景的感知任务中优于现有监督方法,显著提升了性能并减少了对标注数据的依赖。

详情
英文摘要

Channel state information (CSI) provides a widely available sensing modality for human and environment perception, but existing CSI sensing models usually rely on task-specific supervised training and require substantial labeled data for each task, device, user, or environment. This limits their scalability in practical deployments where unlabeled CSI is abundant but labeled data is costly to collect. In this paper, we present CSI-JEPA, a self-supervised predictive representation learning framework for label-efficient, multi-task Wi-Fi sensing. CSI-JEPA learns reusable temporal-spectral representations from unlabeled CSI samples by predicting latent features of masked channel regions from visible context. To better match the physical structure of CSI, CSI-JEPA tokenizes channel-response amplitude windows along the time and subcarrier dimensions. It then introduces a channel variation-aware masking strategy that samples predictive targets from regions with stronger local temporal and subcarrier-domain variations. After pretraining, the encoder is frozen and used as a backbone, with lightweight task-specific adapters added for downstream sensing tasks. We evaluate CSI-JEPA on seven real-world Wi-Fi sensing tasks spanning diverse objectives and deployment settings. The results show that CSI-JEPA improves downstream sensing performance over competitive baselines, achieving up to 10.64 percentage points mean accuracy gain over state-of-the-art supervised Transformer and matched-budget label savings of up to 98.0%.

2605.14169 2026-05-15 cs.CL

BOOKMARKS: Efficient Active Storyline Memory for Role-playing

Letian Peng, Ziche Liu, Yiming Huang, Longfei Yun, Kun Zhou, Yupeng Hou, Jingbo Shang

AI总结 本文提出了一种名为BOOKMARKS的高效主动故事线记忆框架,用于角色扮演代理(RPA),以解决现有方法在长期一致性维护中因信息压缩而丢失关键细节的问题。该方法通过主动初始化和更新与任务相关的“书签”来记录故事中的关键问题与答案,从而在保证任务细节的同时减少重复计算。实验表明,BOOKMARKS在多个角色和任务上显著优于传统记忆方法,验证了其在角色扮演场景中的有效性。

详情
英文摘要

Memory systems are critical for role-playing agents (RPAs) to maintain long-horizon consistency. However, existing RPA memory methods (e.g., profiling) mainly rely on recurrent summarization, whose compression inevitably discards important details. To address this issue, we propose a search-based memory framework called BOOKMARKS, which actively initializes, maintains, and updates task-relevant pieces of bookmarks for the current task (e.g., character acting). A bookmark is structured as the answer to a question at a specific point in the storyline. For each current task, BOOKMARKS selects reusable existing bookmarks or initializes new ones (at storyline beginning) with useful questions. These bookmarks are then synchronized to the current story point, with their answers updated accordingly, so they can be efficiently reused in future grounding rounds. Compared with recurrent summarization, BOOKMARKS offers (1) active grounding for capturing task-specific details and (2) passive updating to avoid unnecessary computation. In implementation, BOOKMARKS supports concept, behavior, and state searches, each powered by an efficient synchronization method. BOOKMARKS significantly outperforms RPA memory baselines on 85 characters from 16 artifacts, demonstrating the effectiveness of search-based memory for RPAs.

2605.14168 2026-05-15 cs.LG cs.DS stat.ML

Finite Sample Bounds for Learning with Score Matching

Devin Smedira, Abhijith Jayakumar, Sidhant Misra, Marc Vuffray, Andrey Y. Lokhov

AI总结 本文研究了在有限样本条件下,使用得分匹配方法学习连续指数族分布的统计学习问题。作者提供了非渐近的样本复杂度分析,揭示了模型维数的多项式依赖关系,这是该领域首个此类结果。该工作填补了得分匹配理论分析的空白,为高维统计学习提供了重要的理论保证。

Comments 22 pages

详情
英文摘要

Learning of continuous exponential family distributions with unbounded support remains an important area of research for both theory and applications in high-dimensional statistics. In recent years, score matching has become a widely used method for learning exponential families with continuous variables due to its computational ease when compared against maximum likelihood estimation. However, theoretical understanding of the statistical properties of score matching is still lacking. In this work, we provide a non-asymptotic sample complexity analysis for learning the structure of exponential families of polynomials with score matching. The derived sample bounds show a polynomial dependence on the model dimension. These bounds are the first of its kind, as all prior work has shown only asymptotic bounds on the sample complexity.

2605.14167 2026-05-15 cs.AI cs.CY

The Evaluation Trap: Benchmark Design as Theoretical Commitment

Theodore J Kalaitzidis

AI总结 该论文探讨了AI基准测试中隐含的理论假设如何影响对能力评估的定义与进展方向,指出当这些假设未经审视时,基准测试会固化主流范式并限制对能力的真正理解。文章提出了一种名为“Epistematics”的方法论,用于从技术能力声明中直接推导评估标准,并检验基准测试是否能区分真实能力与表面行为。其核心贡献在于提供了一套元评估框架,包括评估流程、失败模式分类及基准设计准则,以提升评估与目标能力之间的一致性。

Comments 13 pages

详情
英文摘要

Every AI benchmark operationalizes theoretical assumptions about the capability it claims to assess. When assumptions function as unexamined commitments, benchmarks stabilize the dominant paradigm by narrowing what counts as progress. Over time, narrow evaluation reorganizes capability concepts: architectures and definitions are selected for benchmark legibility until evaluation ceases to track an independent object and instead produces a version of the target defined by its own operational assumptions. The result is a trap: evaluation frameworks treat self-reinforcing assessments as valid, both creating and obscuring structural limits on what the current paradigm can accomplish. We introduce Epistematics, a methodology for deriving evaluation criteria directly from technical capability claims and auditing whether proposed benchmarks can discriminate the claimed capability from proxy behaviors. The contribution is meta-evaluative: an audit procedure, a failure mode taxonomy, and benchmark-design criteria for evaluating capability-evaluation coherence. We demonstrate the procedure through a worked audit of Dupoux et al. (2026), a proposal that revises the dominant paradigm's theoretical assumptions at the architectural level while reproducing them in its evaluation criteria, thereby entrenching the constraint it seeks to overcome in a form the evaluation cannot detect.

2605.14164 2026-05-15 cs.AI

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

Stefan Baack, Christo Buschek, Maty Bohacek

AI总结 该研究探讨了基础模型和生成式AI模型构建者在评估模型能力时所依赖的基准测试文化,发现其主要依据已从学术论文转向公司发布的新闻稿和博客,这些内容成为定义当前技术水平的重要依据。研究通过构建并开源Benchmarking-Cultures-25数据集,分析了2025年11家主要AI公司发布的139个模型中所强调的231个基准,揭示了当前评估体系碎片化、跨模型可比性低的问题,并提出统一分类框架以解析不同模型构建者对基准能力的异质化描述。

详情
英文摘要

The primary way to establish and compare competencies in foundation and generative AI models has shifted from peer-reviewed literature to press releases and company blog posts, where model builders highlight results on selected benchmarks. These artifacts now largely define the state of the art for researchers and the public. Despite their prominence, which benchmarks model builders choose to highlight, and what they communicate through this selection, is underexamined. To investigate, we introduce and open-source Benchmarking-Cultures-25, a dataset of 231 benchmarks highlighted across 139 model releases in 2025 from 11 major AI builders, alongside an interactive tool to explore the data. Our analysis reveals a fragmented evaluation landscape with limited cross-model comparability: 63.2% of highlighted benchmarks are used by a single builder, and 38.5% appear in just one release. Few achieve widespread use (e.g., GPQA Diamond, LiveCodeBench, AIME 2025). Moreover, benchmarks are attributed different competencies by different builders, depending on their narrative. To disentangle these conflicting presentations, we develop a unified taxonomy mapping diverging terminology to a shared framework of measured signals based on what benchmark authors claim to measure. "General knowledge application" is the second most popular, yet vaguely defined, category. Qualitative analysis shows many such benchmarks deemphasize construct validity, instead framing results as indicators of progress toward AGI. Their authors claim to measure knowledge or reasoning broadly, yet mostly evaluate STEM subjects (especially math). We argue that highlighted benchmarks function less as standardized measurement tools and more as flexible narrative devices prioritizing market positioning over scientific evaluation. Data: https://hf.co/datasets/matybohacek/benchmarking-cultures-25; tool: https://bench-cultures.net.

2605.14163 2026-05-15 cs.AI

Agentic Systems as Boosting Weak Reasoning Models

Varun Sunkaraneni, Pierfrancesco Beneventano, Riccardo Neumarker, Tomaso Poggio, Tomer Galanti

AI总结 本文研究如何通过组合多个弱推理模型的输出,达到强模型的性能。核心方法是引入验证者支持的委员会搜索机制,在推理时通过提案、批评和比较模块协同工作,提升整体推理能力。研究证明,仅靠增加模型数量不足以提升性能,还需结合局部正确性信号,如执行、类型检查等,以确保选择的有效性。实验表明,通过合理设计的机制,弱模型组合可达到与强模型相当的性能,主要挑战在于如何从提案中有效筛选出正确解。

详情
英文摘要

Can a committee of weak reasoning-model calls reach the performance of much stronger models? We study verifier-backed committee search as inference-time boosting for reasoning language models. The mechanism is not simply that ``more agents help'': samples expose latent correct solutions, while critics and comparators must recover them without access to the hidden verifier. We formalize this view by separating proposal coverage, local identifiability, progress, and diversity. We prove that coverage can be amplified by repeated sampling, but cannot by itself create useful critics or comparators; reliable amplification requires an additional local soundness signal, such as execution, proof checking, type checking, tests, or constraint solving. We give rank-based bounds showing when local selection errors compose into reliable trajectories, and characterize the proposer-side ceiling: oracle best-of-\(k\) converges only to the mass of task slices on which the proposal system assigns nonzero useful probability. Empirically, on SWE-bench Verified, a single \texttt{GPT-5.4 nano} proposal solves \(67.0\%\) of tasks. Using the same nano model, our critic--comparator orchestration reaches \(76.4\%\) with \(k=8\) proposals, matching the standalone performance of \texttt{Gemini 3 Pro} and \texttt{Claude Opus 4.5} Thinking and approaching the \(79.0\%\) oracle best-of-\(8\) upper bound. Thus, many correct patches are already present in weak-model proposal pools; the main challenge is selecting them. The remaining failures are mostly proposal-coverage failures, indicating shared blind spots that stronger selection alone cannot close.

2605.14156 2026-05-15 cs.LG

Uncovering Trajectory and Topological Signatures in Multimodal Pediatric Sleep Embeddings

Scott Ye, Harlin Lee

AI总结 该研究探讨了多模态掩码自编码器在儿科睡眠数据分析中的潜在诊断信息,通过结合拓扑特征、几何结构和电子健康记录(EHR)来增强嵌入表示。研究发现,融合这些额外信息后,线性模型和多层感知机在睡眠障碍预测任务中表现出更好的性能与可解释性,尤其在极端类别不平衡情况下,融合模型显著提升了预测的校准性和鲁棒性。

Comments Accepted to ML4H 2025, 20 pages, 6 figures

详情
Journal ref
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:1392-1411, 2025
英文摘要

While generative models have shown promise in pediatric sleep analysis, the latent structure of their multimodal embeddings remains poorly understood. This work investigates session-wide diagnostic information contained in the sequences of 30-second pediatric PSG epochs embedded by a multimodal masked autoencoder. We test whether augmenting embeddings with PHATE-derived per-epoch coordinates and whole-night movement descriptors, persistent homology summaries of the embedding cloud, and EHR yields task-relevant signals. Simple linear and MLP models, chosen for interpretability rather than state-of-the-art performance, show that geometric, topological, and clinical features each provide complementary gains. For binary predictions, feature importance is task-dependent, and more expressive late-fusion models generally perform better, with AUPRC improving from 0.26 to 0.34 for desaturation, 0.31 to 0.48 for EEG arousal, 0.09 to 0.22 for hypopnea, and 0.05 to 0.14 for apnea. We also report Brier score and Expected Calibration Error, where the full fusion model yields the best calibration across all four binary tasks. Our study reveals that latent geometry/topology and EHR offer complementary, interpretable signals beyond embeddings, improving calibration and robustness under extreme imbalance.

2605.14152 2026-05-15 cs.CL cs.AI cs.CR cs.CY

ROK-FORTRESS: Measuring the Effect of Geopolitical Transcreation for National Security and Public Safety

Michael S. Lee, Yash Maurya, Drew Rein, Bert Herring, Jonathan Nguyen, Kyungho Song, Udari Madhushani Sehwag, Jiyeon Cho, Kaustubh Deshpande, Yeongkyun Jang, Jiyeon Joo, Minn Seok Choi, Evi Fuelle, Christina Q Knight, Joseph Brandifino, Max Fenkell

AI总结 本文提出ROK-FORTRESS,一个用于评估大型语言模型在国家安全与公共安全领域风险的双语基准,聚焦于英韩语言对及美韩地缘政治背景下的交互影响。通过构建“转译矩阵”,该方法分离语言和地缘政治因素,系统评估模型在不同语言和实体背景下的安全响应行为。研究发现,韩国语言和地缘政治背景的结合对模型安全行为有显著影响,且不同模型对此的反应存在差异,表明传统仅依赖翻译的评估方式可能低估了语言与地缘政治交互带来的风险。

Comments 16 pages main body + appendix (63 total), 5 main figures, 4 main tables; dataset at https://huggingface.co/datasets/ScaleAI/ROK-FORTRESS_public

详情
英文摘要

Safety evaluations for large language models (LLMs) increasingly target high-stakes National Security and Public Safety (NSPS) risks, yet multilingual safety is typically assessed through translation-only benchmarks that preserve the underlying scenario, and empirical evidence of how language and geopolitical context interact remains limited to a narrow set of language pairs. We introduce \emph{ROK-FORTRESS} https://huggingface.co/datasets/ScaleAI/ROK-FORTRESS_public, a bilingual, culturally adversarial NSPS benchmark that uses the English--Korean language pair and U.S.--ROK geopolitical axis as a case study, separating the effects of language and geopolitical grounding via a \emph{transcreation matrix}: adversarial intents are evaluated under controlled combinations of (i) English versus Korean language and (ii) U.S.\ versus Korean entities, institutions, and operational details. Each adversarial prompt is paired with a dual-use benign counterpart to quantify over-refusal. Model responses are then scored using calibrated LLM-as-a-judge panels, applying our expert-crafted, prompt-specific binary rubrics. Across a dual-track set of frontier and Korean-optimized models, we find a consistent suppression effect in Korean variants and substantial model-to-model variation in how geopolitical grounding interacts with language. In many models, Korean grounding mitigates the Korean language-driven suppression -- with no model showing significant amplification in the other direction -- indicating that, at least in the English--Korean case, safety behavior is shaped by language-as-risk signals and context interactions that translation-only evaluations miss. The transcreation matrix methodology is designed to generalize to other language--culture pairs.

2605.14147 2026-05-15 cs.LG

A Systematic Evaluation of Imbalance Handling Methods in Biomedical Binary Classification

Jiandong Chen, Lingjie Su, Le Peng, Yash Travadi, Rui Zhang, Ju Sun

AI总结 本研究系统评估了常用不平衡数据处理方法在生物医学二分类任务中的影响,探讨了模型复杂度与数据模态之间的相互作用。通过在三种典型生物医学数据集上测试多种处理方法,发现简单模型如逻辑回归对不平衡处理方法不敏感,而复杂模型如深度神经网络在使用重采样或权重调整方法时性能显著提升。研究结果表明,选择合适的不平衡处理方法对提高复杂模型在文本和图像数据上的分类效果具有重要意义。

Comments 18 pages, 1 figures, 4 tables

详情
英文摘要

Objective: The primary goal of this study was to systematically examine the impact of commonly used imbalance handling methods (IHMs) on predictive performance in biomedical binary classification, considering the interplay between model complexity and diverse data modalities. Material and Methods: We evaluated five representative IHMs: random undersampling (RUS), random oversampling (ROS), SMOTE, re-weighting (RW), and direct F1-score optimization (DMO), against a raw training (RAW) baseline. The evaluation encompassed three public biomedical datasets: MIMIC-III (tabular), ADE-Corpus-V2 (text), and MURA (image), spanning three common biomedical data modalities. To assess varying model complexity, we employed a range of architectures, from classical logistic regression and random forest to deep neural networks, including multilayer perceptron (MLP), BiLSTM, BERT, DenseNet, and DINOv2. Results: For simpler models such as logistic regression on tabular data, IHMs yielded no significant advantage over the RAW baseline, aligning with prior findings. However, clear benefits were observed for more complex models and unstructured data: (a) ROS and RW consistently enhanced the performance of powerful models; (b) direct F1-score optimization demonstrated utility primarily for unstructured text and image data; and (c) RUS and SMOTE consistently degraded performance and are therefore not recommended. Conclusion: The effectiveness of IHMs depends on both model complexity and data modality. Performance gains are most pronounced when leveraging appropriate IHMs, such as ROS, RW, and DMO, on high-complexity models.

2605.14146 2026-05-15 cs.LG

bde: A Python Package for Bayesian Deep Ensembles via MILE

Vyron Arvanitis, Angelos Aslanidis, Emanuel Sommer, David Rügamer

AI总结 bde 是一个用于构建贝叶斯深度集成模型的用户友好型 Python 工具包,特别适用于表格数据。该工具基于高效的 MILE(微正则朗之万集成)采样推理方法实现,支持快速训练、高效的马尔可夫链蒙特卡洛采样以及回归和分类任务中的不确定性量化,为贝叶斯深度学习提供了便捷的解决方案。

详情
英文摘要

bde is a user-friendly Python package for Bayesian Deep Ensembles with a particular focus on tabular data. Built on an efficient JAX implementation of the sampling-based inference method Microcanonical Langevin Ensembles (MILE), it provides scikit-learn compatible estimators for fast training, efficient Markov Chain Monte Carlo sampling, and uncertainty quantification in both regression and classification tasks.

2605.14145 2026-05-15 cs.CV

Rethinking the Good Enough Embedding for Easy Few-Shot Learning

Michael Karnes, Alper Yilmaz

AI总结 本文探讨了在大规模数据训练下,不同深度视觉模型是否收敛于一个“理想”的潜在表示空间,并提出“好的嵌入即足够”的观点。研究通过冻结DINOv2-L特征并结合k近邻分类器,构建了一个无需反向传播的非参数化少样本学习框架,揭示了最优特征提取层并引入主成分分析和独立成分分析进行流形优化。实验表明,该方法在多个主流基准上优于复杂的元学习算法,达到了当前最优性能。

详情
英文摘要

The field of deep visual recognition is undergoing a paradigm shift toward universal representations. The Platonic Representation Hypothesis suggests that diverse architectures trained on massive datasets are converging toward a shared, "ideal" latent space. This again raises a critical question: is a "Good Embedding All You Need?" In this paper, we leverage this convergence to demonstrate that off-the-shelf embeddings are inherently "good enough" for complex tasks, rendering intensive task-specific fine-tuning unnecessary. We explore this hypothesis within the few-shot learning framework, proposing a straightforward, non-parametric pipeline that entirely bypasses backpropagation. By utilizing a k-Nearest Neighbor classifier on frozen DINOv2-L features, we conduct a layer-wise characterization to identify an optimal feature extraction. We further demonstrate that manifold refinement via PCA and ICA provides a beneficial regularizing effect. Our results across four major benchmarks demonstrate that our approach consistently surpasses sophisticated meta-learning algorithms, achieving state-of-the-art performance.

2605.14141 2026-05-15 cs.AI

Distribution-Aware Algorithm Design with LLM Agents

Saharsh Koganti, Priyadarsi Mishra, Pierfrancesco Beneventano, Tomer Galanti

AI总结 本文研究了在学习对象为可执行求解器代码而非预测模型的场景下的学习问题,强调求解器不仅要正确,还需在运行时间上表现优异。研究提出了一种名为“求解器提示”的核心抽象,通过从样本中推断可复用的结构并编译为专用求解器代码,从而提升求解效率和质量。实验表明,基于大语言模型的代码代理生成的求解器在多个组合优化问题上显著优于现有启发式方法和求解器,运行速度提升达数百倍,且在保持较高解质量的同时大幅降低计算复杂度。

详情
英文摘要

We study learning when the learned object is executable solver code rather than a predictor. In this setting, correctness is not enough: two solvers may both return valid solutions on the deployment distribution while differing substantially in runtime. Given samples from an unknown task distribution, the learner returns code evaluated on fresh instances by both solution quality and execution time. Our central abstraction is a \emph{solver hint}: reusable structure inferred from samples and compiled into specialized solver code. We prove that the empirically fastest sample-consistent solver from a fixed library generalizes in both correctness and runtime, and that statistically identifiable hints can be recovered and compiled from polynomially many samples. Empirically, we instantiate the framework with LLM code agents on \(21\) structured combinatorial-optimization target distributions across seven problem classes. The synthesized solvers reach mean normalized quality \(0.971\), improve by \(+0.224\) over the average heuristic pool and by \(+0.098\) over the highest-quality heuristic, and are \(336.9\times\), \(342.8\times\), and \(16.1\times\) faster than the quality-best heuristic, Gurobi, and the selected time-limited exact backend, respectively. On released PACE 2025 Dominating Set private instances, the synthesized solver is valid on all \(100\) graphs and runs about two orders of magnitude faster than top competition solvers, with a moderate quality gap. Inspection shows that many gains come from changing the computational scale: replacing ambient exponential search or general-purpose optimization with compiled distribution-specific computation.

2605.14136 2026-05-15 cs.CV

TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion

Nurislam Tursynbek, Zhiqiang Lao, Heather Yu, Gedas Bertasius, Marc Niethammer

AI总结 近期文本到视频扩散模型虽然能生成视觉上吸引人的帧,但在时间一致性方面仍存在不足,常出现闪烁、漂移或运动不稳定的问题。本文提出了一种无需训练、仅在推理阶段使用的 TeDiO 方法,通过正则化模型内部的注意力图中的时间对角线模式,增强视频的时间一致性。该方法能够估计对角线平滑度、识别不稳定区域并进行轻量级潜在变量更新,从而在不修改模型权重或依赖外部运动监督的情况下,显著提升多个视频扩散模型的运动流畅性,同时保持每帧的视觉质量。

Comments CVPR'26 Workshop on Agentic AI for Visual Media

详情
英文摘要

Recent text-to-video diffusion transformers generate visually compelling frames, yet still struggle with temporal coherence, often producing flickering, drifting, or unstable motion. We show that these failures leave a clear imprint inside the model: incoherent videos consistently exhibit irregular, fragmented temporal diagonals in their intermediate self-attention maps, whereas stable motion corresponds to smooth, band-diagonal patterns. Building on this observation, we introduce TeDiO, a training-free, inference-time method that reinforces temporal consistency by regularizing these internal attention patterns. TeDiO estimates diagonal smoothness, identifies unstable regions, and performs lightweight latent updates that promote coherent frame-to-frame dynamics, without modifying model weights or using external motion supervision. Across multiple video diffusion models (e.g., Wan2.1, CogVideoX), TeDiO delivers markedly smoother motion while preserving per-frame visual quality, offering an efficient plug-and-play approach to improving dynamic realism in modern video generation systems.

2605.14135 2026-05-15 cs.CV

PanoPlane: Plane-Aware Panoramic Completion for Sparse-View Indoor 3D Gaussian Splatting

Adil Qureshi, Dongki Jung, Jaehoon Choi, Dinesh Manocha

AI总结 本文提出了一种名为PanoPlane的方法,用于从稀疏视角生成高保真室内新视角图像,其核心是通过全景场景补全重建封闭房间的几何结构。该方法引入了一种无需训练的布局锚定注意力引导机制,在推理时引导扩散模型关注场景中检测到的平面表面,从而实现基于几何一致性的内容补全,替代了传统的无约束幻象生成。实验表明,该方法在Replica、ScanNet++和Matterport3D数据集上均取得了优于现有方法的新视角合成效果,PSNR指标最高提升了17.8%。

详情
英文摘要

We present PanoPlane, an approach for high-fidelity sparse-view indoor novel view synthesis that reconstructs closed room geometry via panoramic scene completion. Unlike perspective-based methods that generate training views from limited fields of view, PanoPlane leverages $360^{\circ}$ panoramic completion to condition the generative process on the full spatial layout. We propose Layout Anchored Attention Steering, a training-free mechanism that steers attention within the diffusion model's internal representation toward scene's detected planar surfaces at inference time. By directing each unobserved region's attention toward geometrically consistent observed content, our method replaces unconstrained hallucination with grounded surface extrapolation. The resulting panoramic completions provide supervision for 3D Gaussian Splatting, enabling accurate novel-view synthesis across unobserved regions from as few as three input views. Experiments on Replica, ScanNet++, and Matterport3D demonstrate state-of-the-art novel view synthesis quality across 3, 6, and 9 input views, achieving up to $+17.8\%$ improvement in PSNR over the current state-of-the-art baseline without any training or fine-tuning of the diffusion model.

2605.14126 2026-05-15 cs.LG cs.AI

Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)

Marius S. Knorr, Robert Müller, Jan P. Bremer, Nils Schweingruber

AI总结 本文研究了在Fast Healthcare Interoperability Resources(FHIR)标准下,如何通过强化学习提升医疗信息代理的多步骤推理能力。作者将FHIR中的电子健康记录建模为可查询的结构化图,并设计了一个基于代码操作的多轮代理,通过强化学习进行后训练,以提高其在真实医院数据上的问答性能。实验表明,该方法在FHIR-AgentBench基准上显著提升了答案正确率,并有效保证了数据完整性约束。

详情
英文摘要

Fast Healthcare Interoperability Resources (FHIR) is the dominant standard for interoperable exchange of healthcare data. In FHIR, electronic health records form a directed graph of resources. Answering clinically meaningful questions over FHIR requires agents to perform multi-step reasoning, filtering, and aggregation across multiple resource types. Prior work shows that even tool-augmented LLM agents (retrieval, code execution, multi-turn planning) often select the wrong resources or violate traversal constraints. We study this problem in the context of FHIR-AgentBench, a benchmark for realistic question answering over real-world hospital data, and frame reasoning on FHIR as a sequential decision-making problem over a queryable structured graph. We implement a multi-turn CodeAct agent and post-train it with reinforcement learning using a custom harness and tools. A LLM Judge provides execution-grounded rewards. Compared to prompt-based, closed-model baselines, RL post-training improves performance while enforcing data-integrity constraints. Empirically, our approach improves answer correctness from 50% (o4-mini) to 77% on FHIR-AgentBench using a smaller and cheaper Qwen3-8B model. We present an end-to-end post-training pipeline (environment building, harness construction, model training and custom evaluation) that reliably improves multi-turn reasoning over structured clinical graphs.

2605.14120 2026-05-15 cs.LG cs.CL

Mini-JEPA Foundation Model Fleet Enables Agentic Hydrologic Intelligence

Mashrekur Rahman

AI总结 该研究提出了一种名为Mini-JEPA的轻量级基础模型舰队,用于提升水文智能系统的性能。通过为不同传感器专门训练的小型联合嵌入预测架构模型,并由路由代理根据问题选择合适的模型,该方法在保持高精度的同时降低了计算成本。实验表明,Mini-JEPA在多种水文变量预测任务中表现优异,且在与大型模型AlphaEarth的对比中展现出显著的性能提升。

详情
英文摘要

Geospatial foundation models compress multispectral observations into dense embeddings increasingly used in natural-language environmental reasoning systems. A single planetary-scale model, e.g. Google AlphaEarth, handles broad characterization well but may compromise on specialized hydrologic signals. Such generalist models are also often inaccessible, expensive, and require large-scale compute. We propose Mini-JEPAs: a fleet of small sensor-specialized Joint Embedding Predictive Architecture (JEPA) foundation models consulted by a routing agent for specialized questions. We pretrained five 22M-parameter Mini-JEPAs sharing an identical Vision Transformer backbone, JEPA recipe, and 64-d output space, using Sentinel-2 optical, Sentinel-1 SAR, MODIS thermal, multi-temporal Sentinel-2 phenology, and a topography-soil stack. Each Mini-JEPA reconstructs the variable matched to its sensor, with cross-validated $R^2$ reaching 0.97 for elevation, 0.97 for temperature, and 0.81 for precipitation. The five manifolds differ in geometric structure, with global participation ratios from 8.9 to 20.2 and local intrinsic dimensionalities from 2.3 to 9.0. Joint topography-soil and phenology models add predictive value beyond AlphaEarth alone for soil moisture, aridity, and precipitation ($ΔR^2$ up to 0.031). A router LLM reads per-modality references and selects appropriate sensors with a perfect hit rate over a curated question set. In paired LLM-as-Judge evaluation, dual retrieval over AlphaEarth and the routed fleet outperforms AlphaEarth alone on physics-matched questions (Cohen's $d = 1.10$, $p = 0.031$). Locally-trained Mini-JEPAs can be operationalized for hydrologic intelligence with modest compute.

2605.14117 2026-05-15 cs.CL cs.AI

Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards

Luis Lara, Aristides Milios, Zhi Hao Luo, Aditya Sharma, Ge Ya Luo, Christopher Beckham, Florian Golemo, Christopher Pal

AI总结 该研究提出了一种基于大语言模型(LLM)并通过可验证奖励强化学习(RLVR)优化的文本生成式平面图设计方法,旨在生成符合用户定义的连接性和数值约束的高质量平面图。通过在真实平面图上微调LLM,并结合约束遵从度指标进行优化,该方法在现实感、兼容性和多样性方面均优于现有方法,尤其在兼容性指标上实现了至少94%的相对提升,展示了LLM在处理结构化设计约束方面的有效性。

Comments Accepted to Findings of ACL 2026

详情
英文摘要

An AI system for professional floor plan design must precisely control room dimensions and areas while respecting the desired connectivity between rooms and maintaining functional and aesthetic quality. Existing generative approaches focus primarily on respecting the requested connectivity between rooms, but do not support generating floor plans that respect numerical constraints. We introduce a text-based floor plan generation approach that fine-tunes a large language model (LLM) on real plans and then applies reinforcement learning with verifiable rewards (RLVR) to improve adherence to topological and numerical constraints while discouraging invalid or overlapping outputs. Furthermore, we design a set of constraint adherence metrics to systematically measure how generated floor plans align with user-defined constraints. Our model generates floor plans that satisfy user-defined connectivity and numerical constraints and outperforms existing methods on Realism, Compatibility, and Diversity metrics. Across all tasks, our approach achieves at least a 94% relative reduction in Compatibility compared with existing methods. Our results demonstrate that LLMs can effectively handle constraints in this setting, suggesting broader applications for text-based generative modeling.

2605.14115 2026-05-15 cs.CL

When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering

Yikun Han, Mengfei Lan, Halil Kilicoglu

AI总结 该研究探讨了在生物医学问答任务中,当检索到的证据存在冲突时,大型语言模型的表现问题。通过设计不同的证据条件,研究发现模型在面对矛盾信息时准确性显著下降,并且预测结果会发生翻转。为此,作者提出了一种结合模型置信度和证据冲突检测的弃权评分方法,在困难条件下有效提升了选择性准确性,突显了处理证据冲突对模型不确定性和鲁棒性的重要性。

Comments Accepted by BioNLP 2026

详情
英文摘要

Biomedical retrieval-augmented large language models (LLMs) often face evidence that is incomplete, misleading, or internally contradictory, yet evaluation usually emphasizes answer accuracy under helpful context rather than reliability under conflict. Using HealthContradict, we evaluate six open-weight LLMs under five controlled evidence conditions: no retrieved context, correct-only context, incorrect-only context, and two mixed conditions containing both correct and contradictory documents in opposite orders. In this conflicting-evidence order contrast, where the same two documents are both present and only their order is reversed, accuracy drops for every model and 11.4%--25.2% of predictions flip. To support abstention in these difficult cases, we also evaluate a conflict-aware abstention score that combines model confidence with a detector of evidence conflict. In the two hardest conditions, this score improves selective accuracy over confidence-only, with mean gains of 7.2--33.4 points in incorrect-only (`IC') and 3.6--14.4 points in incorrect-first conflicting (`ICC') conditions across 75%, 50%, and 25% coverage. These results show that conflicting biomedical evidence is both an uncertainty and robustness problem and motivate evaluation and abstention methods that explicitly account for evidence disagreement.

2605.14111 2026-05-15 cs.AI cs.HC

Modeling Bounded Rationality in Drug Shortage Pharmacists Using Attention-Guided Dynamic Decomposition

Yaniv Eliyahu Amiri, Noah Chicoine, Jacqueline Griffin, Stacy Marsella

AI总结 本文研究了医院药师在药品短缺情况下如何在不确定、时间压力和患者风险下做出决策的问题,提出了一种基于注意力引导的动态分解框架,将药品分为高成本推理和低成本监控两类,以有限理性方式进行决策。研究构建了专家代理和学习代理两个模型,分别基于药师访谈和经验动态调整注意力分配,实验表明该方法能够在不完全掌握状态信息的情况下实现稳定的决策,揭示了决策的核心不在于具体行动,而在于认知资源的合理分配。

Comments Accepted at CogSci 2026. 6 pages plus references, 1 figure, 2 tables

详情
英文摘要

Hospital pharmacists make high-stakes decisions to mitigate drug shortages under uncertainty, time pressure, and patient risk. Interviews revealed that pharmacists focus attention on a small subset of drugs, limiting cognitive effort to the most urgent cases. Motivated by these findings, we formalize a bounded-rational, attention-guided decision framework that dynamically decomposes drugs into a subset for high-cost reasoning and a complementary subset for low-cost monitoring. We develop two agents: an Expert Agent that applies attention weights derived from pharmacist interviews, and a Learner Agent that adapts attention allocation over time through experience. Across simulated scenarios spanning short to long horizons, we show that attention-guided planning supports stable decision-making without complete state reasoning. These results suggest that a primary decision is not what action to take, but where to allocate cognitive effort, and that attention-guided, satisficing strategies can reduce problem complexity while maintaining stable performance.

2605.14110 2026-05-15 cs.CV cs.RO

SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection

Sandro Papais, Lezhou Feng, Charles Cossette, Lingting Ge

AI总结 本文提出SToRe3D,一种用于高效多视角3D目标检测的稀疏性框架,旨在解决视觉Transformer(ViT)在处理多视角和大范围3D区域时计算量大、推理延迟高的问题。该方法通过联合选择2D图像token和3D目标查询,并结合特征存储与重新激活机制,实现对关键信息的计算分配。实验表明,SToRe3D在保持检测精度的同时,显著提升了推理速度,为实时大规模3D检测提供了可行方案。

Comments Accepted to CVPR 2026

详情
英文摘要

Vision Transformers (ViTs) enable strong multi-view 3D detection but are limited by high inference latency from dense token and query processing across multiple views and large 3D regions. Existing sparsity methods, designed mainly for 2D vision, prune or merge image tokens but do not extend to full-model sparsity or address 3D object queries. We introduce SToRe3D, a relevance-aligned sparsity framework that jointly selects 2D image tokens and 3D object queries while storing filtered features for reactivation. Mutual 2D-3D relevance heads allocate compute to driving-critical content and preserve other embeddings. Evaluated on nuScenes and our new nuScenes-Relevance benchmark, SToRe3D achieves up to 3x faster inference with marginal accuracy loss, establishing real-time large-scale ViT-based 3D detection while maintaining accuracy on planning-critical agents.

2605.14108 2026-05-15 cs.CV cs.AI cs.LG

Bridging the Rural Healthcare Gap: A Cascaded Edge-Cloud Architecture for Automated Retinal Screening

Nishi Doshi, Shrey Shah

AI总结 该研究针对农村地区糖尿病视网膜病变(DR)筛查资源不足的问题,提出了一种边缘-云端级联架构,以提高筛查效率并降低云端计算负担。该架构分为两层:第一层使用轻量级的MobileNetV3-small模型在本地设备上进行二分类分诊,判断是否需要转诊;第二层在云端使用RETFoundDINOv2模型对需转诊的图像进行细粒度严重程度分级。实验表明,该方法在APTOS数据集上显著减少了云端调用次数,同时保持了较高的筛查准确性。

详情
英文摘要

Diabetic Retinopathy (DR) is one of the leading causes of preventable blindness, yet rural regions often lack the specialists and infrastructure needed for early detection. Although cloud-based deep learning systems offer high accuracy, they face significant challenges in these settings due to high latency, limited bandwidth, and high data transmission costs. To address these challenges, we propose a two-tier edge-cloud cascade on the public APTOS 2019 Blindness Detection dataset. Tier 1 runs a lightweight MobileNetV3-small model on a local clinic device to perform a binary triage between Referable DR (Classes 2-4) and Non-referable DR (Classes 0-1). Tier 2 runs a RETFoundDINOv2 model in the cloud for ordinal severity grading, but only on the subset of images flagged as referable by Tier 1. On a stratified APTOS test split of 733 images, Tier 1 reaches 98.99% sensitivity and 84.37% specificity at a validation-tuned high-sensitivity threshold. The default cascade forwards 49.52% of test images to Tier 2, reducing cloud calls by 50.48% relative to using a cloud-based model for all images. In the deployed 4-class output space (Class 0-1 / Class 2 / Class 3 / Class 4), the cascade obtains 80.49% accuracy and 0.8167 quadratic weighted kappa; the cloud-only baseline obtains 80.76% accuracy and 0.8184 quadratic weighted kappa. On APTOS, the cascade cuts cloud use by about half with a modest drop in grading performance. Index Terms: Diabetic Retinopathy, Edge-Cloud Cascade, MobileNetV3-small, RETFound-DINOv2, Retinal Screening, tele-ophthalmology

2605.14106 2026-05-15 cs.RO

Behavior Cloning for Active Perception with Low-Resolution Egocentric Vision

Anthony Bilic, Chen Chen, Ladislau Bölöni

AI总结 本文研究了行为克隆在结构化物体寻找任务中是否能够实现主动感知。通过一个配备手腕安装的低分辨率RGB摄像头的低成本机械臂,模型直接从低分辨率图像中预测关节命令,在闭环控制下实现对部分可见植物的定位与抓取。实验表明,低分辨率的自中心视觉足以完成任务,且相对关节变化的预测优于绝对位置预测,展示了基于视觉的行为克隆可以有效实现可复现的主动感知。

详情
英文摘要

We investigate whether behavior cloning is sufficient to produce active perception in a structured object-finding task. A low-cost robot arm equipped with a wrist-mounted egocentric RGB camera must reposition to center a partially visible plant before triggering a grasp signal, requiring actions that improve future observations. The model predicts joint commands directly from low-resolution RGB images under closed-loop control. We show that low-resolution egocentric vision is sufficient for reliable task completion and that predicting relative joint deltas substantially outperforms absolute joint position prediction in our setting. These results demonstrate that visually grounded active perception can emerge from behavior cloning in a reproducible setting.

2605.14104 2026-05-15 cs.CV

DUET: Dual-Paradigm Adaptive Expert Triage with Single-cell Inductive Prior for Spatial Transcriptomics Prediction

Junchao Zhu, Ruining Deng, Junlin Guo, Tianyuan Yao, Chongyu Qu, Juming Xiong, Zhengyi Lu, Yanfan Zhu, Marilyn Lionts, Yuechen Yang, Yu Wang, Shilin Zhao, Haichun Yang, Yuankai Huo

AI总结 该研究提出了一种名为DUET的新型双范式框架,用于从组织切片图像中预测空间转录组数据。DUET结合了参数化预测与基于记忆的检索方法,在细胞归纳先验的指导下实现更准确的基因表达推断。通过引入大规模单细胞数据作为分子约束,并设计轻量适配器动态调整不同空间区域的模型偏好,DUET在多个公开数据集上取得了当前最优的预测性能。

详情
英文摘要

Inferring spatially resolved gene expression from histology images offers a cost-effective complement to spatial transcriptomics (ST). However, existing methods reduce this task to a simple morphology-to-expression mapping, where visual similarity does not guarantee molecular consistency. Meanwhile, single-cell data has amassed rich resources far surpassing the scale of ST data, yet it remains underexplored in vision-omics modeling. Furthermore, current approaches commit to a monolithic paradigm with bottlenecks, unable to balance expressive flexibility with biological fidelity. To bridge these gaps, we propose DUET, a novel dual-paradigm framework that synergizes parametric prediction and memory-based retrieval under cellular inductive priors. DUET implements a parallel regression-retrieval paradigm, adaptively reconciling the outputs of its complementary pathways. To mitigate aleatoric vision ambiguity, we incorporate large-scale single-cell references to impose molecular states as biological constraints for faithful learning. Building upon structural refinement, we further design a lightweight adapter to dynamically assign branch preference across spatial contexts to achieve optimal performance. Extensive experiments on three public datasets across varied gene scales demonstrate that DUET achieves SOTA performance, with consistent gains contributed by each proposed component. Code is available at https://github.com/Junchao-Zhu/DUET

2605.14089 2026-05-15 cs.AI

SkillFlow: Flow-Driven Recursive Skill Evolution for Agentic Orchestration

Mingda Zhang, Tiesunlong Shen, Haoran Luo, Wenjin Liu, Zikai Xiao, Erik Cambria, Xiaoying Tang

AI总结 SkillFlow 是一种基于流模型的框架,旨在解决智能体编排中的关键挑战,如策略崩溃、信用分配不透明和技能演化缺乏指导。该方法通过可训练的监督器与结构化环境进行多轮交互,结合温差轨迹平衡损失实现多样化的策略保持与透明的信用分配,并引入递归技能演化机制以自主决定技能的生成、剪枝与改进。实验表明,SkillFlow 在多个任务上显著优于现有方法。

Comments 49 pages, 5 figures, 6 tables

详情
英文摘要

In recent years, a variety of powerful LLM-based agentic systems have been applied to automate complex tasks through task orchestration. However, existing orchestration methods still face key challenges, including strategy collapse under reward maximization, high gradient variance with opaque credit assignment, and unguided skill evolution whose decisions are typically made by directly prompting an LLM to judge rather than derived from principled training signals. To address these challenges, we propose SkillFlow, a flow-based framework that takes a trainable Supervisor as the agent and a structured environment with dynamic skill library and frozen executor, automating task orchestration through multi-turn interaction. SkillFlow employs Tempered Trajectory Balance (TTB), a regression-based flow-matching loss that samples trajectories proportional to reward, preserving diverse orchestration strategies rather than collapsing to a single mode. The same flow objective yields a jointly learned backward policy that provides transparent per-step credit assignment at zero additional inference cost. Building on these flow diagnostics, a recursive skill evolution mechanism determines when to evolve, what skills to create or prune, and where decision gaps lie -- closing the loop from training signal to autonomous capability growth. Experimental results on 14 datasets show that SkillFlow significantly outperforms baselines across question answering, mathematical reasoning, code generation, and real-world interactive decision making tasks. Our code is available at https://anonymous.4open.science/r/SkillFlow-E850.

2605.14075 2026-05-15 cs.LG cs.CL

Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity

Cristian Hinostroza, Rodrigo Toro Icarte, Christ Devia, Andres Carvallo De Ferari, Eugenio Herrera-Berg, Denis Parra, Jorge F Silva

AI总结 本文探讨了在大语言模型中,层相关性评估应超越传统的余弦相似度方法。研究指出,余弦相似度无法准确反映移除某层对模型性能的实际影响,理论分析表明即使某层余弦相似度极低,也可能对模型性能至关重要。为此,作者提出以移除某层后模型准确率的实际下降作为更可靠的评估指标,尽管计算成本较高,但能更准确地指导模型剪枝与轻量化设计,对构建可解释的大语言模型具有重要意义。

Comments Published at ICLR 2026

详情
Journal ref
Proceedings of the International Conference on Learning Representations (ICLR), 2026
英文摘要

Large language models (LLMs) have revolutionized natural language processing. Understanding their internal mechanisms is crucial for developing more interpretable and optimized architectures. Mechanistic interpretability has led to the development of various methods for assessing layer relevance, with cosine similarity being a widely used tool in the field. On this work, we demonstrate that cosine similarity is a poor proxy for the actual performance degradation caused by layer removal. Our theoretical analysis shows that a layer can exhibit an arbitrarily low cosine similarity score while still being crucial to the model's performance. On the other hand, empirical evidence from a range of LLMs confirms that the correlation between cosine similarity and actual performance degradation is often weak or moderate, leading to misleading interpretations of a transformer's internal mechanisms. We propose a more robust metric for assessing layer relevance: the actual drop in model accuracy resulting from the removal of a layer. Even though it is a computationally costly metric, this approach offers a more accurate picture of layer importance, allowing for more informed pruning strategies and lightweight models. Our findings have significant implications for the development of interpretable LLMs and highlight the need to move beyond cosine similarity in assessing layer relevance.

2605.14074 2026-05-15 cs.LG

Fair and Calibrated Toxicity Detection with Robust Training and Abstention

Mokshit Surana

AI总结 该研究探讨了毒性检测中的公平性问题,关注排序、校准和弃权三个维度,并比较了多种训练方法与后处理机制在这些维度上的表现。研究发现,传统方法如经验风险最小化(ERM)虽然整体校准良好,但在不同身份子群中存在显著的校准偏差;而训练干预措施虽能改善排序性能,却可能加剧校准公平性差距。此外,后处理方法如温度缩放和置信度弃权也继承了训练阶段的问题,甚至可能引入新的不公平性。论文强调,实现真正的公平性需要多维度的综合考量,单一维度的优化不足以确保实际应用中的公平表现。

详情
英文摘要

Fairness in toxicity classification involves three integrated axes: ranking, calibration, and abstention. Training-time interventions and post-hoc safety mechanisms cannot be evaluated independently because the former determines the efficacy of the latter. We compare Empirical Risk Minimization (ERM), instance-level reweighting, and Group DRO across these axes, combined with temperature scaling, confidence-based abstention, and per-identity threshold optimization. Evaluation uses subgroup AUC, BPSN/BNSP AUC, error gaps, and per-subgroup Expected Calibration Error (ECE) with bootstrap CIs ($n = 1000$). We report four findings. (1) Calibration disparity is a hidden fairness violation. ERM has near-perfect aggregate calibration ($0.013$) but is significantly miscalibrated across all identity subgroups ($+0.029$ to $+0.134$). (2) Training interventions reshape rather than eliminate disparity. Reweighted ERM improves ranking (BPSN AUC $+0.06$ to $+0.12$) but worsens the calibration-fairness gap by up to $+0.232$. Group DRO eliminates calibration disparity but only by becoming uniformly miscalibrated globally (ECE $0.118$). (3) Post-hoc methods inherit training failure modes. Temperature scaling fails because miscalibration is non-uniform. Confidence-based abstention works under ERM but breaks under DRO, where the risk-coverage curve rises with deferral. (4) Abstention itself is unfair. Confidence-based deferral helps background content far more than identity-mentioning content. We argue that SRAI fairness requires a multi-axis framework: methods that differ only in aggregate ranking can differ sharply in failure modes that determine real-world harm.

2605.14073 2026-05-15 cs.LG cs.AI

AttnGen: Attention-Guided Saliency Learning for Interpretable Genomic Sequence Classification

Rayhaneh Shabani Nia, Ali Karkehabadi

AI总结 本文提出了一种名为 AttnGen 的注意力引导训练框架,旨在提升基因组序列分类模型的可解释性。该方法通过注意力机制计算核苷酸层面的重要性评分,并在训练过程中逐步抑制低贡献位置,使模型更关注具有信息量的区域,减少对噪声序列元素的依赖。实验表明,AttnGen 在标准基准数据集上取得了优于传统卷积神经网络的分类性能,并通过扰动分析验证了其重要性评分的有效性,展示了模型对一小部分关键位置的高度依赖。

Comments Accepted at IEEE CCGE 2026

详情
英文摘要

Deep neural networks have achieved strong performance in genomic sequence classification; however, relating their predictions to biologically meaningful sequence patterns remains challenging. In this work, we present AttnGen, an attention-guided training framework that embeds interpretability directly into the optimization process. AttnGen computes nucleotide-level importance scores using an attention mechanism and progressively suppresses low-contribution positions during training. This encourages the model to focus its predictions on a compact set of informative regions while reducing reliance on noisy sequence elements. We evaluate AttnGen on the standardized demo_human_or_worm benchmark, a binary classification task over 200-nucleotide sequences. With moderate masking, AttnGen achieves a validation accuracy of 96.73%, outperforming a conventional CNN baseline with 95.83% accuracy, while also exhibiting faster convergence and improved training stability. To assess whether the learned importance scores reflect functionally relevant signal, we conduct perturbation-based analysis by removing high-saliency nucleotides. This causes accuracy to drop from 96.9% to near chance level on a 3,000-sequence evaluation set, indicating that the model relies on a relatively small subset of informative positions. Our analysis shows that masking 10--20% of positions provides the most favorable trade-off between predictive performance and interpretability. These results suggest that attention-guided masking not only improves classification performance but also reshapes how models distribute importance across sequence positions. Although this study focuses on short genomic sequences, the proposed approach may extend to more complex interpretable sequence modeling settings.

2605.14071 2026-05-15 cs.CL

Distribution Corrected Offline Data Distillation for Large Language Models

Yumeng Zhang, Zhengbang Yang, Yevin Nikhel Goonatilake, Zhuangdi Zhu

AI总结 本文研究了如何从大型语言模型中有效地蒸馏推理能力到小型模型中,特别是在资源受限的场景下。为了解决现有方法在离线蒸馏中面临的分布偏移问题,作者提出了一种基于分布校正的离线数据蒸馏框架,通过自适应地强调与学生模型推理分布更一致的教师模型指导,从而在保持离线数据高效性和监督质量的同时,减少推理过程中的误差累积。实验表明,该方法在多个数学推理基准测试中显著提升了推理准确性和稳定性。

详情
英文摘要

Distilling reasoning traces from strong large language models into smaller ones is a promising route to improve intelligence in resource-constrained settings. Existing approaches face a fundamental trade-off: offline distillation from teacher-generated traces provides high-quality, sample-efficient supervision but suffers from distributional drift: during training, the student model conditions on teacher-generated prefixes, whereas during inference the student autoregresses on self-generated prefixes, leading to compounding errors over long reasoning trajectories. Meanwhile, on-policy or self-distillation methods better match the student's inference-time distribution, but require costly online sampling and often produce low-quality traces in early training. We propose a principled offline reasoning distillation framework that preserves the efficiency and supervision quality of offline teacher-generated data while correcting teacher-student distribution drift. It adaptively emphasizes teacher supervision that is better aligned with the student's on-policy distribution. Evaluations on mathematical reasoning benchmarks of GSM8K, MATH, MATH500, and harder held-out competition-style tasks, including AMC, AIME, and OlympiadBench, show that our method improves reasoning accuracy over prior offline distillation algorithms and yields more stable reasoning traces while preserving instruction-following capabilities. Our work shows that lightweight, distribution-correction-aware training can substantially strengthen offline reasoning distillation without online rollouts.

2605.14069 2026-05-15 cs.LG

SurF: A Generative Model for Multivariate Irregular Time Series Forecasting

Mohammad R. Rezaei, Tejas Balaji, Rahul G. Krishnan

AI总结 本文提出了一种名为 SurF 的生成模型,用于处理多变量不规则时间序列的预测问题。该模型基于时间尺度变换定理,将事件序列与独立同分布的单位速率指数噪声之间建立可学习的双射关系,从而实现对异构事件流数据的统一建模。研究还引入了三种高效的累积强度参数化方法以及基于 Transformer 的编码器用于多数据集预训练。实验表明,SurF 在多个现实数据集上取得了优于现有方法的预测性能,为异步事件流的基础模型研究奠定了初步基础。

详情
英文摘要

Irregularly sampled multivariate event streams remain a stubbornly difficult modality for generative modeling: tokenization-based approaches break down when inter-event intervals vary by orders of magnitude, and neural temporal point processes are bottlenecked by window-level numerical quadrature. We (i) propose SurF, a generative model that uses the Time Rescaling Theorem (TRT) as a learnable bijection between event sequences and i.i.d.\ unit-rate exponential noise, enabling a single model to be trained across heterogeneous event-stream datasets; (ii) three efficient parameterizations of the cumulative intensity that scale to long sequences; and (iii) a Transformer-based encoder for multi-dataset pretraining. On six real-world benchmarks, SurF achieves the best reported time RMSE on Earthquake, Retweet, and Taobao, and is within trial-level noise of the strongest specialist on the remaining three. Under a strict leave-one-out protocol, the held-out checkpoint beats every classical and neural-autoregressive baseline on 5/6 datasets and beats every baseline on Amazon and Earthquake, an initial step toward foundation models over asynchronous event streams.