arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2085
2509.25584 2026-05-11 cs.AI cs.CL cs.CV cs.IT cs.LG math.IT

Skip-It? Theoretical Conditions for Layer Skipping in Vision-Language Models

Max Hartman, Vidhata Jayaraman, Moulik Choraria, Akhil Bhimaraju, Lav R. Varshney

AI总结 视觉-语言模型在多种任务中表现出色,但其庞大的模型规模导致推理成本较高。本文提出了一种统一的理论框架,用于确定在何种冗余条件下可以跳过某些层以提升效率而不牺牲性能,核心在于引入可实验验证且可解释的冗余度量标准。研究不仅验证了早期和晚期视觉标记的冗余性,还为现代层跳过技术提供了理论依据,统一了相关方法的思想。

详情
英文摘要

Vision-language models achieve incredible performance across a wide range of tasks, but their large size makes inference costly. Recent work has shown that multimodal processing contains significant redundancies, making it possible to skip certain layers with minimal performance loss. Yet current pruning techniques remain ad-hoc, relying on heuristics or hyperparameter sweeps rather than principled criteria for determining when layer skipping is beneficial. In this paper, we propose a unified framework that characterizes the redundancy conditions under which pruning can enhance efficiency without sacrificing performance. Central to our approach are experimentally verifiable and interpretable notions of redundancy that can be evaluated without requiring downstream task performance as a metric. Applying this framework, we corroborate prior findings that both early and late vision tokens are redundant across models, and we validate our conditions by showing they align with actual performance degradation. Beyond these empirical results, our framework provides a theoretically grounded understanding of redundancy in VLMs and unifies many of the ideas behind modern layer-skipping techniques.

2509.24789 2026-05-11 cs.LG stat.ML

Fidel-TS: A High-Fidelity Multimodal Benchmark for Time Series Forecasting

Zhijian Xu, Wanxu Cai, Xilin Dai, Zhaorong Deng, Qiang Xu

AI总结 本文提出Fidel-TS,一个用于时间序列预测的高保真多模态基准数据集,旨在解决现有数据集在规模、频率、数据污染和信息泄露等方面存在的问题。该基准遵循数据来源完整性、无泄露设计和结构清晰性等核心原则,揭示了先前基准的局限性,并为评估多种单模态和多模态预测模型及大语言模型提供了新的见解。

Comments new version

详情
英文摘要

The evaluation of time series forecasting models is hindered by a lack of high-quality benchmarks, leading to overestimated assessments of progress. Existing datasets suffer from issues ranging from small-scale, low-frequency, pre-training data contamination in unimodal designs to the temporal and description leakage prevalent in early multimodal designs. To address this, we formalize the core principles of high-fidelity benchmarking, focusing on data sourcing integrity, leak-free design, and structural clarity. We introduce Fidel-TS, a new large-scale benchmark built from these principles. Our experiments reveal the limitations of prior benchmarks and the potential discrepancies in model evaluation, providing new insights into multiple existing unimodal and multimodal forecasting models and LLMs across various evaluation tasks.

2509.22813 2026-05-11 cs.CV

TRUST: Test-Time Refinement using Uncertainty-Guided SSM Traverses

Sahar Dastani, Ali Bahri, Gustavo Adolfo Vargas Hakim, Moslem Yazdanpanah, Mehrdad Noori, David Osowiechi, Samuel Barbeau, Ismail Ben Ayed, Herve Lombaert, Christian Desrosiers

AI总结 本文提出TRUST,一种基于不确定性引导的SSM遍历的测试时适应方法,旨在提升状态空间模型(SSM)在分布偏移下的泛化能力。该方法通过生成输入图像的多种因果视角,并利用模型预测作为伪标签来更新Mamba特定参数,最终通过参数平均整合多视角信息。TRUST首次充分利用了SSM的独特架构特性,实验表明其在七个基准上显著提升了模型鲁棒性,优于现有测试时适应方法。

详情
英文摘要

State Space Models (SSMs) have emerged as efficient alternatives to Vision Transformers (ViTs), with VMamba standing out as a pioneering architecture designed for vision tasks. However, their generalization performance degrades significantly under distribution shifts. To address this limitation, we propose TRUST (Test-Time Refinement using Uncertainty-Guided SSM Traverses), a novel test-time adaptation (TTA) method that leverages diverse traversal permutations to generate multiple causal perspectives of the input image. Model predictions serve as pseudo-labels to guide updates of the Mamba-specific parameters, and the adapted weights are averaged to integrate the learned information across traversal scans. Altogether, TRUST is the first approach that explicitly leverages the unique architectural properties of SSMs for adaptation. Experiments on seven benchmarks show that TRUST consistently improves robustness and outperforms existing TTA methods.

2509.21654 2026-05-11 cs.LG cs.AI cs.CC

Limitations on Accurate, Trusted, Human-level Reasoning

Rina Panigrahy, Vatsal Sharan

AI总结 本文探讨了人工智能系统在准确性、可信度和人类水平推理能力之间的根本性矛盾。研究定义了严格数学意义上的准确性与信任概念,并指出一个准确且可信的AI系统无法实现人类水平的推理能力,因为存在一些任务实例,人类可以轻松解决,而该系统却无法解决。该结论借鉴了哥德尔不完备定理和图灵停机问题的证明思路,核心在于将信任形式化,从而区分系统的内在属性与其认知状态。

Comments 19 pages, 1 figure

详情
英文摘要

We identify a fundamental incompatibility between the goals of accuracy, trust, and human-level reasoning in artificial intelligence (AI) systems, for strict mathematical definitions of these notions. We define accuracy of a system as the property that it never makes any false claims when it has the ability to abstain from making a prediction on any input, and trust as the assumption that the system is accurate. We define human-level reasoning as the property of an AI system always matching or exceeding human capability. Our core finding is that -- for our formal definitions of these notions -- an accurate and trusted AI system cannot be a human-level reasoning system: for such an accurate, trusted system there are task instances which are easily and provably solvable by a human but not by the system. Our proofs draw parallels to Gödel's incompleteness theorems and Turing's proof of the undecidability of the halting problem, and can be regarded as interpretations of Gödel's and Turing's results. Key to our proof is the formalization of the notion of trust, which allows us to separate the intrinsic property of a system (being accurate) from its epistemic status (being trusted).

2509.21637 2026-05-11 cs.LG

BoHA: Blockwise Hadamard Product Adaptation for Parameter-Efficient Fine-Tuning

Feng Yu, Jia Hu, Geyong Min

AI总结 本文提出了一种名为 BoHA 的参数高效微调方法,通过块级 Hadamard 乘积适配器,在冻结预训练模型权重的前提下,仅训练少量任务特定参数。BoHA 将冻结权重划分为网格块,并在每个块中学习独立的低秩因子,从而在保持与 LoRA 相当总秩数的同时提升性能。实验表明,BoHA 在多个大语言模型上优于 LoRA,并在连续学习任务中表现出更强的性能保留能力。

详情
英文摘要

Parameter-efficient fine-tuning (PEFT) of large language models trains a small task-specific parameter set while keeping the pretrained model frozen. The dominant Low-Rank Adaptation (LoRA) family makes this trade-off practical; however, evaluations under the same parameter budget assess single-task accuracy. In sequential adaptation settings, such evaluations should also measure how well performance on the first-stage task is retained after subsequent fine-tuning. To address this gap, we introduce BoHA, a blockwise $W_0$-coupled Hadamard product adapter that treats spatial support as an explicit design axis. BoHA partitions the frozen weight $W_0$ into a $b{\times}b$ grid and learns an independent low-rank Hadamard product factor in each block, preserving a matched LoRA-equivalent total rank with adapter-free merged inference. On a synthetic target, BoHA at per-block rank $r_b{=}1$ exactly reconstructs an update that requires rank $b^2$ under the global $W_0$-coupled Hadamard parameterization. Across Llama-3.2-1B/3B, Mistral-7B, and Gemma-2-9B on commonsense and arithmetic reasoning tasks, BoHA outperforms LoRA across all matched-budget single-task averages and remains competitive with the strongest Hadamard baseline. On a Llama-3.2-3B commonsense $\to$ arithmetic continual-learning diagnostic, BoHA retains $57.66\%$ first-stage accuracy and exceeds the $W_0$-free additive-control mean by $15.23\%$ under matched second-stage plasticity. These results demonstrate that blockwise $W_0$-coupled Hadamard adaptation is a competitive PEFT design choice when retention under sequential adaptation is part of the objective.

2509.21172 2026-05-11 cs.LG econ.EM math.OC stat.ML

Inverse Reinforcement Learning with Just Classification and a Few Regressions

Lars van der Laan, Nathan Kallus, Aurelien Bibaut

AI总结 本文研究了逆强化学习中在最大熵模型下的奖励函数恢复问题,提出了一种新的通用方法GenPQR,该方法通过分类和少量回归即可实现,无需依赖特定神经网络结构或锚定动作限制。GenPQR 模块化地估计行为策略、计算软Q函数并恢复归一化奖励,理论分析表明其在函数逼近下具有有限样本保证,并通过实验验证其在奖励恢复效果上优于 DeepPQR,同时具备更高的灵活性和模块性。

详情
英文摘要

Inverse reinforcement learning (IRL) aims to infer rewards from observed behavior, but rewards are not identified from the policy alone: many reward--value pairs can rationalize the same actions. Meaningful reward recovery therefore requires a normalization, yet existing normalized IRL methods often rely on anchor-action restrictions or specialized neural architectures. We study reward recovery in the maximum-entropy, or Gumbel-shock, model under a broad class of statewise affine normalizations, with anchor-action constraints as a special case. This yields Generalized Policy-to-$Q$-to-Reward (GenPQR), a modular procedure that estimates the behavior policy, evaluates its soft $Q$-function through the Bellman equation, and recovers the normalized reward. Both stages can be implemented with off-the-shelf classification and regression methods. We prove modular finite-sample guarantees under general function approximation, with separate policy-estimation and $Q$-estimation errors. As a concrete instantiation, we study GenPQR with fitted $Q$-evaluation, reducing IRL to policy estimation followed by regression. Experiments show that GenPQR matches or improves reward recovery relative to DeepPQR while remaining simpler and more modular. Compared with DeepPQR, our theory goes beyond anchor actions, accommodates large and continuous action spaces, makes coverage requirements explicit, and is not tied to a specific neural-network architecture or training procedure.

2509.12047 2026-05-11 cs.CV cs.AI

A Computer Vision Pipeline for Individual-Level Behavior Analysis: Benchmarking on the Edinburgh Pig Dataset

Haiyu Yang, Enhong Liu, Jennifer Sun, Sumit Sharma, Meike van Leerdam, Sebastien Franceschini, Puchun Niu, Miel Hostens

AI总结 本文提出了一种用于个体级动物行为分析的计算机视觉流水线,旨在解决传统人工观察方法在农业环境中效率低、主观性强的问题。该方法结合了零样本目标检测、运动感知分割与追踪以及视觉Transformer的高级特征提取,有效应对了动物遮挡和群养环境等挑战,并在爱丁堡猪行为视频数据集上实现了高达94.2%的总体准确率,显著优于现有方法。该模块化设计具有良好的扩展性,为精准养猪和福利评估提供了自动化、客观的持续分析方案。

Comments 9 figures

详情
英文摘要

Animal behavior analysis plays a crucial role in understanding animal welfare, health status, and productivity in agricultural settings. However, traditional manual observation methods are time-consuming, subjective, and limited in scalability. We present a modular pipeline that leverages open-sourced state-of-the-art computer vision techniques to automate animal behavior analysis in a group housing environment. Our approach combines state-of-the-art models for zero-shot object detection, motion-aware segmentation and tracking, and advanced feature extraction using vision transformers for robust behavior recognition. The pipeline addresses challenges including animal occlusions and group housing scenarios, as demonstrated in indoor pig monitoring. We validated our system on the Edinburgh Pig Behavior Video Dataset for multiple behavioral tasks. Our temporal model achieved 94.2% overall accuracy, representing a 21.2 percentage point improvement over existing methods. The pipeline demonstrated robust tracking capabilities with a 93.3% identity preservation (IDF1) score and an 89.3% average precision (AP) for object detection. The modular design suggests potential for adaptation to other contexts, though further validation across species would be required. The open-source implementation provides a scalable solution for behavior monitoring, contributing to precision pig farming and welfare assessment through automated, objective, and continuous analysis.

2509.11777 2026-05-11 cs.CL cs.LG

User eXperience Perception Insights Dataset (UXPID): Synthetic User Feedback from Public Industrial Forums

Mikhail Kulyabin, Jan Joosten, Choro Ulan uulu, Nuno Miguel Martins Pacheco, Fabian Ries, Filippos Petridis, Jan Bosch, Helena Holmström Olsson

AI总结 本文介绍了用户体验感知洞察数据集(UXPID),该数据集从公共工业自动化论坛中提取并合成匿名用户反馈,包含7130条带有元数据的评论,并由大型语言模型标注了用户体验洞察、用户期望、严重程度、情感和主题分类。该数据集旨在促进用户需求分析、用户体验研究及AI驱动的反馈处理相关研究,尤其适用于因隐私和授权限制难以获取真实数据的场景,为工业产品支持和软件工程领域的自然语言处理方法研究提供了重要资源。

详情
英文摘要

Customer feedback in industrial forums offers rich but underexplored insights into real-world product experience. Yet systematic analysis remains challenging due to unstructured, domain-specific content and the scarcity of high-quality labeled datasets. This paper presents the User eXperience Perception Insights Dataset (UXPID), a collection of 7130 synthesized and anonymized user feedback branches extracted from a public industrial automation forum. Each JSON record contains multi-post comments enriched with metadata and annotated by a large language model (LLM) for UX insights, user expectations, severity ratings, sentiment, and topic classifications. UXPID is designed to facilitate research in user requirements, user experience (UX) analysis, and AI-driven feedback processing, particularly where privacy and licensing restrictions limit access to real-world data. It supports the training and evaluation of transformer-based models for tasks such as issue detection, sentiment analysis, and requirements extraction in technical forums, providing a valuable resource for advancing NLP methods within industrial product support and software engineering domains.

2509.08318 2026-05-11 cs.CV

CalexNet: Soft Cascade-Aligned Training and Calibration for Lightweight Early-Exit Branches

Yehudit Aperstein, Alexander Apartsin

AI总结 CalexNet 是一种针对轻量级早退出分支的软级联对齐训练与校准方法,旨在解决冻结卷积主干网络在自适应推理中出现的训练-推理不匹配问题。该方法通过连续加权重要性采样、基于实际级联存活样本的精度阈值校准以及温度缩放的KL散度目标函数,有效提升了早退出分支的性能。实验表明,CalexNet 在多个数据集上优于现有基线方法,尤其在减少30%到70%计算量的场景中表现突出,且无需改变推理时的网络结构。

Comments 19 pages, 6 figures

详情
英文摘要

Early-exit cascades over a frozen convolutional backbone enable adaptive inference but suffer from three sources of train-inference mismatch: branches train on samples they will never see at inference, their per-class precision thresholds are calibrated on the wrong distribution, and the standard cross-entropy target on backbone argmax labels discards the backbone's uncertainty signal. We close all three gaps with CalexNet (Cascade-Aligned Early eXits), a training-recipe-only modification: branches train under continuously-weighted importance sampling that matches the cascade-survivor distribution; per-class precision thresholds are calibrated on the actual cascade-survivor subset of the validation set; and the classification head is trained against the backbone's full softmax via a temperature-scaled KL objective. Combined with an augmented prototype-pooling branch head, CalexNet is evaluated on ResNet18 and ResNet50 backbones across CIFAR-100 (20-superclass coarse, the harder primary setting) and CINIC-10 (10-class, the easier cross-validation counterpart). On the accuracy-FLOPs Pareto frontier, CalexNet matches or exceeds three published baselines (PTEEnet, ZTW, BoostNet) and a within-paper "no-alignment, no-KD" reference. The largest gains appear in the practically relevant 30-70% FLOPs-reduction regime and are stable across n=3 training seeds. CalexNet requires no inference-time architectural change and is a drop-in for any frozen-backbone early-exit cascade.

2509.05276 2026-05-11 cs.LG cs.AI cs.CL

SpikingBrain: Spiking Brain-inspired Large Models

Yuqi Pan, Yupeng Feng, Jinghao Zhuang, Siyu Ding, Han Xu, Zehao Liu, Bohan Sun, Yuhong Chou, Xuerui Qiu, Anlin Deng, Anjie Hu, Shurong Wang, Peng Zhou, Man Yao, Jibin Wu, Jian Yang, Guoliang Sun, Bo Xu, Guoqi Li

AI总结 本文提出了一种名为SpikingBrain的新型脑启发大语言模型,旨在解决主流Transformer模型在训练计算和推理内存方面的效率瓶颈。该模型基于MetaX GPU集群,通过引入自适应脉冲神经元、高效的训练流程和专用的脉冲编码框架,实现了在非NVIDIA平台上的稳定高效训练。实验表明,SpikingBrain在保持性能与开源Transformer基线相当的同时,显著提升了长上下文处理效率,并在推理过程中实现了部分常数内存和事件驱动的脉冲行为,展示了脑启发机制在构建高效可扩展大模型中的巨大潜力。

详情
英文摘要

Mainstream Transformer-based large language models face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms, and training remains stable for weeks on hundreds of MetaX GPUs with Model FLOPs Utilization at expected levels. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models also significantly improve long-context efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Furthermore, the proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.

2509.03738 2026-05-11 cs.LG cs.AI eess.SP stat.ML

Mechanistic Interpretability with Sparse Autoencoder Neural Operators

Bahareh Tolooshams, Ailsa Shen, Anima Anandkumar

AI总结 本文提出了一种新型稀疏自编码神经算子(SAE-NO),它在函数空间而非固定维度的欧几里得空间中进行操作,用于提升机制可解释性。通过引入功能表示假设,SAE-NO 将概念参数化为函数,从而不仅捕捉概念的存在,还描述其在输入域中的表达方式和位置。基于傅里叶神经算子实现的 SAE-FNO 在处理具有空间结构或频率结构的数据时表现出优越的性能,能够学习局部模式、高效利用概念,并在不同分辨率和领域规模下保持稳定性与泛化能力。

Comments Tolooshams and Shen has equal contribution. Preprint. Earlier version was presented as Oral and Extended Abstract at the Workshop on Unifying Representations in Neural Models (UniReps 2025) at NeurIPS

详情
英文摘要

We introduce sparse autoencoder neural operators (SAE-NOs), a new class of sparse autoencoders that operate in function spaces rather than fixed-dimensional Euclidean representations. We formalize the functional representation hypothesis, where data are explained through sparse compositions of structured functions. Unlike standard SAEs that represent concepts with scalar activations, SAE-NOs parameterize concepts as functions, enabling representations that capture not only a concept's presence, but also how and where it is expressed across the input domain. We achieve this through joint sparsity: concept sparsity selects active concepts, while domain sparsity governs where they are expressed. We instantiate this framework using Fourier neural operators (SAE-FNOs), parameterizing concepts as integral operators in the Fourier domain. This functional and spectral parameterization is particularly advantageous when data exhibit spatial structure across scales or when concepts are frequency-structured. We characterize SAE-FNO on vision data and demonstrate that it learns localized patterns, uses concepts more efficiently, and exhibits stable concept characteristics across sparsity levels. We further show that SAE-FNO adapts to changes in domain size and generalizes across discretizations, operating at resolutions beyond those seen during training, where standard SAEs fail. We also introduce lifting into SAEs and show theoretically and empirically that it acts as a preconditioner that accelerates optimization. Overall, our results show that moving from vector-valued to functional parameterizations, with concept and domain sparsity, extends SAEs from representing concept presence to modeling structured concept expression, highlighting the importance of parameterization.

2509.02826 2026-05-11 cs.LG cs.AI stat.AP stat.CO

Ensemble Learning for Healthcare: A Comparative Analysis of Hybrid Voting and Ensemble Stacking in Obesity Risk Prediction

Towhidul Islam, Md Sumon Ali

AI总结 该研究比较了混合多数投票和集成堆叠两种方法在肥胖风险预测中的性能,旨在评估其准确性与效率。通过两个数据集的实验分析,发现集成堆叠在复杂数据分布下表现出更强的预测能力,而混合多数投票则是一种稳健的替代方案。研究还探讨了不同机器学习算法在集成方法中的互补优势,为医疗健康领域的模型选择提供了参考。

Comments There are some errors found

详情
英文摘要

Obesity is a critical global health issue driven by dietary, physiological, and environmental factors, and is strongly associated with chronic diseases such as diabetes, cardiovascular disorders, and cancer. Machine learning has emerged as a promising approach for early obesity risk prediction, yet a comparative evaluation of ensemble techniques -- particularly hybrid majority voting and ensemble stacking -- remains limited. This study aims to compare hybrid majority voting and ensemble stacking methods for obesity risk prediction, identifying which approach delivers higher accuracy and efficiency. The analysis seeks to highlight the complementary strengths of these ensemble techniques in guiding better predictive model selection for healthcare applications. Two datasets were utilized to evaluate three ensemble models: Majority Hard Voting, Weighted Hard Voting, and Stacking (with a Multi-Layer Perceptron as meta-classifier). A pool of nine Machine Learning (ML) algorithms, evaluated across a total of 50 hyperparameter configurations, was analyzed to identify the top three models to serve as base learners for the ensemble methods. Preprocessing steps involved dataset balancing, and outlier detection, and model performance was evaluated using Accuracy and F1-Score. On Dataset-1, weighted hard voting and stacking achieved nearly identical performance (Accuracy: 0.920304, F1: 0.920070), outperforming majority hard voting. On Dataset-2, stacking demonstrated superior results (Accuracy: 0.989837, F1: 0.989825) compared to majority hard voting (Accuracy: 0.981707, F1: 0.981675) and weighted hard voting, which showed the lowest performance. The findings confirm that ensemble stacking provides stronger predictive capability, particularly for complex data distributions, while hybrid majority voting remains a robust alternative.

2508.21466 2026-05-11 cs.LG cs.IT math.IT

Normalized Maximum Likelihood Code-Length on Riemannian Data Spaces

Kota Fukuzawa, Atsushi Suzuki, Kenji Yamanishi

AI总结 随着图数据规模的扩大,研究者越来越关注非欧几里得空间中的黎曼流形数据。本文提出了一种适用于黎曼流形的归一化最大似然(Rm-NML)方法,该方法考虑了流形的几何结构,且对坐标变换具有不变性,在欧几里得空间中与传统NML一致。研究还扩展了NML的计算技术至黎曼流形,并推导了在对称黎曼流形(如双曲空间)上简化Rm-NML计算的方法,最后通过在双曲空间上计算正态分布的Rm-NML验证了方法的有效性。

Comments 19 pages. This is a preprint of an article accepted for publication in the IEEE Transactions on Information Theory

详情
英文摘要

In recent years, with the large-scale expansion of graph data, there has been an increased focus on Riemannian manifold data spaces other than Euclidean space. In particular, the development of hyperbolic spaces has been remarkable, and they have high expressive power for graph data with hierarchical structures. Normalized Maximum Likelihood (NML) is employed in regret minimization and model selection. However, existing formulations of NML have been developed primarily in Euclidean spaces and are inherently dependent on the choice of coordinate systems, making it non-trivial to extend NML to Riemannian manifolds. In this study, we define a new NML that reflects the geometric structure of Riemannian manifolds, called the Riemannian manifold NML (Rm-NML). This Rm-NML is invariant under coordinate transformations and coincides with the conventional NML under the natural parameterization in Euclidean space. We extend existing computational techniques for NML to the setting of Riemannian manifolds. Furthermore, we derive a method to simplify the computation of Rm-NML on Riemannian symmetric spaces, which encompass data spaces of growing interest such as hyperbolic spaces. To illustrate the practical application of our proposed method, we explicitly computed the Rm-NML for normal distributions on hyperbolic spaces.

2508.20909 2026-05-11 cs.CV eess.IV

Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation

Haoyue Li, Yifan Gao, Feng Yuan, Xiaosong Wang, Xin Gao

AI总结 本文提出了一种名为 Dino U-Net 的新型编码器-解码器架构,旨在利用 DINOv3 视觉基础模型的高保真密集特征进行医学图像分割。该方法通过冻结的 DINOv3 主干网络构建编码器,并引入适配器模块融合语义特征与低级空间细节,同时设计了一种保真度感知投影模块以提升特征投影质量。实验表明,Dino U-Net 在多个公开医学图像数据集上取得了最先进的性能,且模型规模增大时分割精度持续提升,验证了其高效性和可扩展性。

Comments MICCAI 2026

详情
英文摘要

Foundation models pre-trained on large-scale natural image datasets offer a powerful paradigm for medical image segmentation. However, effectively transferring their learned representations for precise clinical applications remains a challenge. In this work, we propose Dino U-Net, a novel encoder-decoder architecture designed to exploit the high-fidelity dense features of the DINOv3 vision foundation model. Our architecture introduces an encoder built upon a frozen DINOv3 backbone, which employs a specialized adapter to fuse the model's rich semantic features with low-level spatial details. To preserve the quality of these representations during dimensionality reduction, we design a new fidelity-aware projection module (FAPM) that effectively refines and projects the features for the decoder. We conducted extensive experiments on seven diverse public medical image segmentation datasets. Our results show that Dino U-Net achieves state-of-the-art performance, consistently outperforming previous methods across various imaging modalities. Our framework proves to be highly scalable, with segmentation accuracy consistently improving as the backbone model size increases up to the 7-billion-parameter variant. The findings demonstrate that leveraging the superior, dense-pretrained features from a general-purpose foundation model provides a highly effective and parameter-efficient approach to advance the accuracy of medical image segmentation. The code is available at https://github.com/yifangao112/DinoUNet.

2508.16571 2026-05-11 cs.AI cs.IR cs.MA

LLM-Based Agents for Competitive Landscape Mapping in Drug Asset Due Diligence

Vlad Vinogradov, Alisa Vinogradova, Dmitrii Radkevich, Ilya Yasny, Dmitry Kobyzev, Ivan Izmailov, Katsiaryna Yanchanka, Roman Doronin, Andrey Doronichev

AI总结 本文介绍了一种基于大语言模型(LLM)的智能代理系统,用于在药物资产尽职调查中快速绘制竞争格局。该系统能够根据特定疾病适应症,自动检索相关药物并提取标准化属性,解决数据碎片化、命名混乱及多模态等挑战。研究构建了一个结构化的评估数据集,并引入了一个用于验证竞争药物的LLM判别代理,显著提升了检索精度。实验表明,该系统在竞争药物召回率上达到83%,优于现有工具,并已在企业环境中部署应用,大幅提升了分析效率。

详情
英文摘要

In this paper, we describe and benchmark a competitor-discovery component used within an agentic AI system for fast drug asset due diligence. A competitor-discovery AI agent, given an indication, retrieves all drugs comprising the competitive landscape of that indication and extracts canonical attributes for these drugs. The competitor definition is investor-specific, and data is paywalled/licensed, fragmented across registries, ontology-mismatched by indication, alias-heavy for drug names, multimodal, and rapidly changing. Although considered the best tool for this problem, the current LLM-based AI systems aren't capable of reliably retrieving all competing drug names, and there is no accepted public benchmark for this task. To address the lack of evaluation, we use LLM-based agents to transform five years of multi-modal, unstructured diligence memos from a private biotech VC fund into a structured evaluation corpus mapping indications to competitor drugs with normalized attributes. We also introduce a competitor validating LLM-as-a-judge agent that filters out false positives from the list of predicted competitors to maximize precision and suppress hallucinations. On this benchmark, our competitor-discovery agent achieves 83% recall, exceeding OpenAI Deep Research (65%) and Perplexity Labs (60%). The system is deployed in production with enterprise users; in a case study with a biotech VC investment fund, analyst turnaround time dropped from 2.5 days to $\sim$3 hours ($\sim$20x) for the competitive analysis.

2508.15294 2026-05-11 cs.AI cs.CL cs.MA

A Multi-Memory Segment System for Generating High-Quality Long-Term Memory Content in Agents

Gaoke Zhang, Bo Wang, Yunlong Ma, Dongming Zhao, Zifei Yu

AI总结 本文针对智能体长期记忆内容生成质量低的问题,提出了一种基于认知心理学理论的多记忆段系统(MMS)。该系统将短期记忆分解为多个长期记忆片段,并构建检索记忆单元与上下文记忆单元,实现一一对应关系,从而在检索阶段精准匹配相关记忆单元,提升响应质量。实验表明,该方法在长期记忆生成和检索任务中具有显著效果和实用价值。

Comments The content has been significantly revised and the author has also changed. Therefore, the paper will be withdrawn for revision and then uploaded after the completion of the modifications

详情
英文摘要

In the current field of agent memory, extensive explorations have been conducted in the area of memory retrieval, yet few studies have focused on exploring the memory content. Most research simply stores summarized versions of historical dialogues, as exemplified by methods like A-MEM and MemoryBank. However, when humans form long-term memories, the process involves multi-dimensional and multi-component generation, rather than merely creating simple summaries. The low-quality memory content generated by existing methods can adversely affect recall performance and response quality. In order to better construct high-quality long-term memory content, we have designed a multi-memory segment system (MMS) inspired by cognitive psychology theory. The system processes short-term memory into multiple long-term memory segments, and constructs retrieval memory units and contextual memory units based on these segments, with a one-to-one correspondence between the two. During the retrieval phase, MMS will match the most relevant retrieval memory units based on the user's query. Then, the corresponding contextual memory units is obtained as the context for the response stage to enhance knowledge, thereby effectively utilizing historical data. We conducted experiments on the LoCoMo dataset and further performed ablation experiments, experiments on the robustness regarding the number of input memories, and overhead experiments, which demonstrated the effectiveness and practical value of our method.

2508.06819 2026-05-11 cs.CV

VesselRW: Weakly Supervised Subcutaneous Vessel Segmentation via Learned Random Walk Propagation

Ayaan Nooruddin Siddiqui, Mahnoor Zaidi, Ayesha Nazneen Shahbaz, Priyadarshini Chatterjee, Krishnan Menon Iyer

AI总结 该研究针对临床图像中皮下血管分割任务中标注数据稀缺、对比度低和噪声大的问题,提出了一种弱监督学习框架VesselRW。该方法利用中心线、点标记或短涂鸦等低成本稀疏标注,通过可微分的随机行走标签传播模型生成密集的像素级概率监督,并结合不确定性加权损失函数提升模型鲁棒性。同时,该方法与卷积神经网络分割器联合训练,无需显式边缘监督即可学习血管边界和连续性约束,实验表明其在临床数据集上优于传统方法,显著降低了标注工作量并保持了血管拓扑结构的准确性。

Comments arXiv admin note: This submission has been withdrawn due to violation of arXiv policies for acceptable submissions

详情
英文摘要

The task of parsing subcutaneous vessels in clinical images is often hindered by the high cost and limited availability of ground truth data, as well as the challenge of low contrast and noisy vessel appearances across different patients and imaging modalities. In this work, we propose a novel weakly supervised training framework specifically designed for subcutaneous vessel segmentation. This method utilizes low-cost, sparse annotations such as centerline traces, dot markers, or short scribbles to guide the learning process. These sparse annotations are expanded into dense probabilistic supervision through a differentiable random walk label propagation model, which integrates vesselness cues and tubular continuity priors driven by image data. The label propagation process results in per-pixel hitting probabilities and uncertainty estimates, which are incorporated into an uncertainty-weighted loss function to prevent overfitting in ambiguous areas. Notably, the label propagation model is trained jointly with a CNN-based segmentation network, allowing the system to learn vessel boundaries and continuity constraints without the need for explicit edge supervision. Additionally, we introduce a topology-aware regularizer that encourages centerline connectivity and penalizes irrelevant branches, further enhancing clinical applicability. Our experiments on clinical subcutaneous imaging datasets demonstrate that our approach consistently outperforms both naive sparse-label training and traditional dense pseudo-labeling methods, yielding more accurate vascular maps and better-calibrated uncertainty, which is crucial for clinical decision-making. This method significantly reduces the annotation workload while maintaining clinically relevant vessel topology.

2508.06816 2026-05-11 cs.CV

DualResolution Residual Architecture with Artifact Suppression for Melanocytic Lesion Segmentation

Vikram Singh, Kabir Malhotra, Rohan Desai, Ananya Shankaracharya, Priyadarshini Chatterjee, Krishnan Menon Iyer

AI总结 本文提出了一种专为黑色素病变分割设计的双分辨率残差网络架构,旨在解决皮肤镜图像中细微纹理和颜色变化、常见成像伪影以及精确边界定位的挑战。该方法结合高分辨率流以保留细节边界和多尺度上下文信息流,通过边界感知残差连接和通道注意力机制实现特征融合,同时引入轻量级伪影抑制模块和多任务训练策略提升模型鲁棒性与分割精度。实验表明,该方法在公开基准测试中显著提升了边界精度和临床相关分割指标,优于传统编码-解码基线模型,为自动化黑色素瘤评估系统提供了有效支持。

Comments arXiv admin note: This submission has been withdrawn due to violation of arXiv policies for acceptable submissions

详情
英文摘要

Lesion segmentation, in contrast to natural scene segmentation, requires handling subtle variations in texture and color, frequent imaging artifacts (such as hairs, rulers, and bubbles), and a critical need for precise boundary localization to aid in accurate diagnosis. The accurate delineation of melanocytic tumors in dermoscopic images is a crucial component of automated skin cancer screening systems and clinical decision support. In this paper, we present a novel dual-resolution architecture inspired by ResNet, specifically tailored for the segmentation of melanocytic tumors. Our approach incorporates a high-resolution stream that preserves fine boundary details, alongside a complementary pooled stream that captures multi-scale contextual information for robust lesion recognition. These two streams are closely integrated through boundary-aware residual connections, which inject edge information into deep feature maps, and a channel attention mechanism that adapts the model's sensitivity to color and texture variations in dermoscopic images. To tackle common imaging artifacts and the challenges posed by small clinical datasets, we introduce a lightweight artifact suppression block and a multi-task training strategy. This strategy combines the Dice-Tversky loss with an explicit boundary loss and a contrastive regularizer to enhance feature stability. This unified design enables the model to generate pixel-accurate segmentation masks without the need for extensive post-processing or complex pre-training. Extensive evaluation on public dermoscopic benchmarks reveals that our method significantly enhances boundary precision and clinically relevant segmentation metrics, outperforming traditional encoder-decoder baselines. This makes our approach a valuable component for building automated melanoma assessment systems.

2508.06805 2026-05-11 cs.CV

Edge Detection for Organ Boundaries via Top Down Refinement and SubPixel Upsampling

Aarav Mehta, Priya Deshmukh, Vikram Singh, Siddharth Malhotra, Krishnan Menon Iyer, Tanvi Iyer

AI总结 本文提出了一种针对医学图像中器官边界的精确检测方法,旨在解决传统卷积网络在医学影像中定位精度不足的问题。该方法采用自上而下的反向精炼架构,结合亚像素上采样技术,逐步融合高层次语义特征与低层次细节信息,从而生成高分辨率、高精度的器官边界。实验表明,该方法在多个CT和MRI数据集上显著提升了边界定位性能,并有效提升了后续器官分割、图像配准等医学影像任务的效果。

Comments arXiv admin note: This submission has been withdrawn due to violation of arXiv policies for acceptable submissions

详情
英文摘要

Accurate localization of organ boundaries is critical in medical imaging for segmentation, registration, surgical planning, and radiotherapy. While deep convolutional networks (ConvNets) have advanced general-purpose edge detection to near-human performance on natural images, their outputs often lack precise localization, a limitation that is particularly harmful in medical applications where millimeter-level accuracy is required. Building on a systematic analysis of ConvNet edge outputs, we propose a medically focused crisp edge detector that adapts a novel top-down backward refinement architecture to medical images (2D and volumetric). Our method progressively upsamples and fuses high-level semantic features with fine-grained low-level cues through a backward refinement pathway, producing high-resolution, well-localized organ boundaries. We further extend the design to handle anisotropic volumes by combining 2D slice-wise refinement with light 3D context aggregation to retain computational efficiency. Evaluations on several CT and MRI organ datasets demonstrate substantially improved boundary localization under strict criteria (boundary F-measure, Hausdorff distance) compared to baseline ConvNet detectors and contemporary medical edge/contour methods. Importantly, integrating our crisp edge maps into downstream pipelines yields consistent gains in organ segmentation (higher Dice scores, lower boundary errors), more accurate image registration, and improved delineation of lesions near organ interfaces. The proposed approach produces clinically valuable, crisp organ edges that materially enhance common medical-imaging tasks.

2508.01994 2026-05-11 cs.CV

Deeply Dual Supervised learning for melanoma recognition

Rujosh Polma, Krishnan Menon Iyer

AI总结 本文提出了一种深度双重监督学习框架,用于提高黑色素瘤的识别准确率。该方法通过双路径结构同时提取局部细粒度特征和全局上下文信息,并结合双重注意力机制动态强调关键特征,以捕捉黑色素瘤的细微差异。此外,引入多尺度特征聚合策略以增强模型在不同分辨率下的鲁棒性,实验表明该方法在基准数据集上显著优于现有先进方法,具有更高的检测准确率和更强的抗误报能力。

Comments arXiv admin note: This submission has been withdrawn due to violation of arXiv policies for acceptable submissions

详情
英文摘要

As the application of deep learning in dermatology continues to grow, the recognition of melanoma has garnered significant attention, demonstrating potential for improving diagnostic accuracy. Despite advancements in image classification techniques, existing models still face challenges in identifying subtle visual cues that differentiate melanoma from benign lesions. This paper presents a novel Deeply Dual Supervised Learning framework that integrates local and global feature extraction to enhance melanoma recognition. By employing a dual-pathway structure, the model focuses on both fine-grained local features and broader contextual information, ensuring a comprehensive understanding of the image content. The framework utilizes a dual attention mechanism that dynamically emphasizes critical features, thereby reducing the risk of overlooking subtle characteristics of melanoma. Additionally, we introduce a multi-scale feature aggregation strategy to ensure robust performance across varying image resolutions. Extensive experiments on benchmark datasets demonstrate that our framework significantly outperforms state-of-the-art methods in melanoma detection, achieving higher accuracy and better resilience against false positives. This work lays the foundation for future research in automated skin cancer recognition and highlights the effectiveness of dual supervised learning in medical image analysis.

2507.21183 2026-05-11 cs.LG cs.AI cs.CL

MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

Guangchen Lan, Sipeng Zhang, Tianle Wang, Yuwei Zhang, Daoan Zhang, Xinpeng Wei, Xiaoman Pan, Hongming Zhang, Dong-Jun Han, Christopher G. Brinton

AI总结 随着大语言模型的发展,偏好优化(PO)成为对齐模型与人类偏好的重要方法。本文提出了一种名为MaPPO的最大后验偏好优化方法,该方法在优化目标中显式引入先验奖励知识,从而在保持DPO等方法框架的基础上,提升了对齐效果并避免了响应的二分类简化。MaPPO无需额外超参数,支持离线和在线场景,并可作为插件提升现有DPO变体的性能,实验表明其在多个基准测试中均能有效提升对齐效果而不牺牲计算效率。

详情
英文摘要

As the era of large language models (LLMs) unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a methodology for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. Building on the paradigm employed by Direct Preference Optimization (DPO) and its variants of treating preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO integrates prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. Additionally, MaPPO introduces no additional hyperparameters, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin for DPO variants, including widely used SimPO, IPO and CPO, and produce consistent improvements. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks (MT-Bench, AlpacaEval 2.0, and Arena-Hard) demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.

2506.21582 2026-05-11 cs.CL cs.AI cs.HC

VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents

Sam Yu-Te Lee, Chenyang Ji, Shicheng Wen, Lifu Huang, Dongyu Liu, Kwan-Liu Ma

AI总结 VIDEE 是一个支持初级数据分析师通过智能代理进行高级文本分析的系统,旨在降低自然语言处理和文本分析的门槛。该系统采用人机协作的工作流程,包括任务分解、执行和评估三个阶段,结合了蒙特卡洛树搜索算法、可执行的分析流水线和基于大语言模型的评估与可视化。实验表明,VIDEE 不仅提升了非专家用户的使用体验,还揭示了人机协作中的关键设计问题,为智能文本分析系统的发展提供了重要参考。

详情
英文摘要

Text analytics has traditionally required specialized knowledge in Natural Language Processing (NLP) or text analysis, which presents a barrier for entry-level analysts. Recent advances in large language models (LLMs) have changed the landscape of NLP by enabling more accessible and automated text analysis (e.g., topic detection, summarization, information extraction, etc.). We introduce VIDEE, a system that supports entry-level data analysts to conduct advanced text analytics with intelligent agents. VIDEE instantiates a human-agent collaroration workflow consisting of three stages: (1) Decomposition, which incorporates a human-in-the-loop Monte-Carlo Tree Search algorithm to support generative reasoning with human feedback, (2) Execution, which generates an executable text analytics pipeline, and (3) Evaluation, which integrates LLM-based evaluation and visualizations to support user validation of execution results. We conduct two quantitative experiments to evaluate VIDEE's effectiveness and analyze common agent errors. A user study involving participants with varying levels of NLP and text analytics experience -- from none to expert -- demonstrates the system's usability and reveals distinct user behavior patterns. The findings identify design implications for human-agent collaboration, validate the practical utility of VIDEE for non-expert users, and inform future improvements to intelligent text analytics systems.

2506.14399 2026-05-11 cs.CV cs.AI

Factored Classifier-Free Guidance

Tian Xia, Fabio De Sousa Ribeiro, Rajat R Rasal, Avinash Kori, Raghav Mehta, Ben Glocker

AI总结 本文研究了如何在反事实生成任务中更精确地控制生成结果的属性变化,提出了一种新的引导方法——Factored Classifier-Free Guidance(FCFG)。该方法通过结合因果图对属性进行细粒度控制,解决了传统分类器无关引导(CFG)在全局尺度上导致的非目标属性异常变化问题。实验表明,FCFG在自然和医学图像数据集上显著提升了反事实生成的合理性与可逆性。

Comments Accepted at ICML 2026

详情
英文摘要

Counterfactual generation aims to simulate realistic hypothetical outcomes under causal interventions. Diffusion models have emerged as a powerful tool for this task, combining DDIM inversion with conditional generation and classifier-free guidance (CFG). In this work, we identify a key limitation of CFG for counterfactual generation: it prescribes a global guidance scale for all attributes, leading to significant spurious changes in inferred counterfactuals. To mitigate this, we propose Factored Classifier-Free Guidance (FCFG), a flexible and model-agnostic guidance technique that enables attribute-wise control following a causal graph. FCFG complements recent advances in classifier-free guidance and can be seamlessly extended to advanced guidance schemes such as CFG++ and APG. Our experiments demonstrate that FCFG significantly improves the axiomatic soundness of inferred counterfactuals across both natural and medical image datasets, mitigating spurious amplification effects, and enhancing counterfactual reversibility.

2506.13351 2026-05-11 cs.CL cs.AI cs.LG

Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks

Yifei Xu, Tusher Chakraborty, Srinagesh Sharma, Leonardo Nunes, Swati Sharma, Kate Drakos Demopulos, Emre Kıcıman, Songwu Lu, Ranveer Chandra

AI总结 本文研究了在无法验证任务中使用强化学习训练大语言模型的挑战,并提出了一种约束强化学习框架。该框架通过在标记级别优化与推理质量对齐的密集推理反思奖励(R3),并引入评分表门控机制以确保最终答案符合任务标准。实验表明,该方法在多个领域数据集上优于现有方法,学习效率更高且能有效满足可行性约束。

详情
英文摘要

Reinforcement learning (RL) training of large language models (LLMs) on unverifiable tasks is challenging even when a reasonable-quality reference answer is available. We propose a constrained RL training framework that (i) optimizes a token-level dense Reasoning Reflection Reward (R3) aligned with reasoning quality, and (ii) enforces rubric-gating as feasibility constraints at the rollout group level. R3 measures the model's token-level certainty of a reference answer under its chain-of-thought (CoT) prefix, and selectively emphasizes tokens with high cross-rollout variance, which we call reasoning-reflective tokens, that would otherwise be diluted by the bulk of low-variance tokens. The same variance signal also drives a filter that discards queries with insufficient signal for comparative learning. Rubric-gating complements R3 by operationalizing principled task criteria as hard accept/reject checks on final answers. Empirically, across four datasets spanning scientific writing, medicine, legal contracts, and finance, our framework outperforms strong baselines, achieves faster, more sample-efficient learning, and respects feasibility constraints.

2506.09816 2026-05-11 cs.LG

Identifiability Challenges in Sparse Linear Ordinary Differential Equations

Cecilia Casolo, Sören Becker, Niki Kilbertus

AI总结 本文研究了稀疏线性常微分方程(ODE)在动态系统建模中的可识别性问题。尽管已有研究指出密集矩阵的线性ODE几乎可以由单一轨迹唯一确定,但稀疏矩阵的情况尚未被充分探讨。作者证明,在实际相关的稀疏度范围内,稀疏线性ODE系统以正概率不可识别,并提供了该概率的下界。实验表明,现有方法在估计稀疏ODE时也面临实际不可识别的问题,表明数据驱动的动态系统建模需要重新评估其可靠性。

Comments 9 pages, 4 figures

详情
Journal ref
The Fourteenth International Conference on Learning Representations, ICLR 2026
英文摘要

Dynamical systems modeling is a core pillar of scientific inquiry across natural and life sciences. Increasingly, dynamical system models are learned from data, rendering identifiability a paramount concept. For systems that are not identifiable from data, no guarantees can be given about their behavior under new conditions and inputs, or about possible control mechanisms to steer the system. It is known in the community that "linear ordinary differential equations (ODE) are almost surely identifiable from a single trajectory." However, this only holds for dense matrices. The sparse regime remains underexplored, despite its practical relevance with sparsity arising naturally in many biological, social, and physical systems. In this work, we address this gap by characterizing the identifiability of sparse linear ODEs. Contrary to the dense case, we show that sparse systems are unidentifiable with a positive probability in practically relevant sparsity regimes and provide lower bounds for this probability. We further study empirically how this theoretical unidentifiability manifests in state-of-the-art methods to estimate linear ODEs from data. Our results corroborate that sparse systems are also practically unidentifiable. Theoretical limitations are not resolved through inductive biases or optimization dynamics. Our findings call for rethinking what can be expected from data-driven dynamical system modeling and allows for quantitative assessments of how much to trust a learned linear ODE.

2505.22976 2026-05-11 cs.CV cs.AI

LoopNav: Benchmarking Spatial Consistency in World Models

Kewei Lian, Shaofei Cai, Yitao Liang, Anji Liu

AI总结 本文提出LoopNav,一个用于评估世界模型空间一致性的基准数据集与测试平台。该研究关注世界模型在长期观测信息保留和空间表征构建方面的能力,针对现有数据集缺乏空间一致性约束的问题,构建了包含2.5亿帧的基于循环导航的Minecraft开放世界视频数据集,并引入场景图一致性得分以量化空间一致性。LoopNav为未来相关研究提供了开源的数据集、基准和代码支持。

Comments V3: SGCS

详情
英文摘要

The ability to simulate the world in a spatially consistent manner is a crucial requirement for effective world models. Such a model enables high-quality visual generation, and also ensures the reliability of world models for downstream tasks such as simulation and planning. It must not only retain long-horizon observational information, but also enables the construction of explicit or implicit internal spatial representations. However, existing datasets do not explicitly enforce spatial consistency constraints, limiting both the ability to systematically evaluate this capability and to learn it through data-driven approaches. Furthermore, most existing benchmarks primarily emphasize visual coherence or generation quality, neglecting the requirement of long-range spatial consistency. To bridge this gap, we propose LoopNav, a dataset and corresponding benchmark centered on loop-based navigation for evaluating spatial consistency. The dataset comprises 250 hours (20 million frames) of loop-based navigation videos with actions, collected from diverse locations in the open-world environment of Minecraft. We further introduce a Scene Graph Consistency Score to quantify spatial consistency while remaining invariant to pixel-level variations. Dataset, benchmark, and code are open-sourced to support future research.

2505.22842 2026-05-11 cs.CL cs.LG

Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinskü

AI总结 本文提出了一种名为贝叶斯注意力机制(BAM)的概率框架,用于处理语言模型中的位置编码和上下文长度外推问题。该方法将位置编码建模为概率模型中的先验分布,统一了现有方法并引入了一种广义高斯位置先验,显著提升了模型在长上下文中的泛化能力。实验表明,BAM在训练上下文长度的500倍下仍能准确检索信息,在长上下文检索准确率上优于现有最先进方法,同时保持了相近的困惑度和较少的额外参数。

Comments Accepted to ICLR 2026

详情
英文摘要

Transformer-based language models rely on positional encoding (PE) to handle token order and support context length extrapolation. However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims. We propose the Bayesian Attention Mechanism (BAM), a theoretical framework that formulates positional encoding as a prior within a probabilistic model. BAM unifies existing methods (e.g., NoPE and ALiBi) and motivates a new Generalized Gaussian positional prior that substantially improves long-context generalization. Empirically, BAM enables accurate information retrieval at $500\times$ the training context length, outperforming previous state-of-the-art context length generalization in long context retrieval accuracy while maintaining comparable perplexity and introducing minimal additional parameters.

2505.14113 2026-05-11 cs.CV cs.LG

CONSIGN: Conformal Segmentation Informed by Spatial Groupings via Decomposition

Bruno Viti, Elias Karabelas, Martin Holler

AI总结 CONSIGN 是一种基于符合性预测(CP)的图像分割方法,旨在提升分割结果的不确定性估计。该方法通过引入空间分组分解,有效考虑了图像中像素之间的空间相关性,从而生成具有用户指定高概率误差保证的有意义预测集。实验表明,CONSIGN 在多个医学影像和通用图像数据集上显著优于传统 CP 方法,提升了不确定性的量化质量与分割性能。

Comments Accepted as poster at ICLR 2026

详情
英文摘要

Most machine learning-based image segmentation models produce pixel-wise confidence scores that represent the model's predicted probability for each class label at every pixel. While this information can be particularly valuable in high-stakes domains such as medical imaging, these scores are heuristic in nature and do not constitute rigorous quantitative uncertainty estimates. Conformal prediction (CP) provides a principled framework for transforming heuristic confidence scores into statistically valid uncertainty estimates. However, applying CP directly to image segmentation ignores the spatial correlations between pixels, a fundamental characteristic of image data. This can result in overly conservative and less interpretable uncertainty estimates. To address this, we propose CONSIGN (Conformal Segmentation Informed by Spatial Groupings via Decomposition), a CP-based method that incorporates spatial correlations to improve uncertainty quantification in image segmentation. Our method generates meaningful prediction sets that come with user-specified, high-probability error guarantees. It is compatible with any pre-trained segmentation model capable of generating multiple sample outputs. We evaluate CONSIGN against two CP baselines across three medical imaging datasets and two COCO dataset subsets, using three different pre-trained segmentation models. Results demonstrate that accounting for spatial structure significantly improves performance across multiple metrics and enhances the quality of uncertainty estimates.

2505.13289 2026-05-11 cs.LG cs.CV

RECON: Robust symmetry discovery via Explicit Canonical Orientation Normalization

Alonso Urbano, David W. Romero, Max Zimmer, Sebastian Pokutta

AI总结 本文提出了一种名为RECON的鲁棒对称性发现方法,通过显式的规范方向归一化,无需预先指定对称变换群即可纠正任意规范表示,从而实现更自然、更贴合数据的规范对齐。该方法支持无监督发现实例特定的姿态分布、检测分布外姿态,并提供一个可插拔的测试时归一化层,能够提升预训练模型的性能而不需重新训练。实验表明,RECON在图像和分子数据集上均表现出优越的对称性发现能力和分类性能。

Comments Accepted as a conference paper at ICLR 2026

详情
英文摘要

Real world data often exhibits unknown, instance-specific symmetries that rarely exactly match a transformation group $G$ fixed a priori. Class-pose decompositions aim to create disentangled representations by factoring inputs into invariant features and a pose $g\in G$ defined relative to a training-dependent, arbitrary canonical representation. We introduce RECON, a class-pose agnostic canonical orientation normalization that corrects arbitrary canonicals via a simple right translation, yielding natural, data-aligned canonicalizations. This enables (i) unsupervised discovery of instance-specific pose distributions, (ii) detection of out-of-distribution poses and (iii) a plug-and-play test-time canonicalization layer. This layer can be attached on top of any pre-trained model to infuse group invariance, improving its performance without retraining. We validate on images and molecular ensembles, demonstrating accurate symmetry discovery, and matching or outperforming other canonicalizations in downstream classification.

2505.07683 2026-05-11 cs.LG cs.AI

Multimodal Cancer Modeling in the Age of Foundation Model Embeddings

Steven Song, Morgan Borjigin-Wang, Irene Madejski, Robert L. Grossman

AI总结 本文研究了在基础模型嵌入时代下,如何利用多模态数据进行癌症建模。作者提出通过基础模型生成癌症数据的多模态嵌入表示,并在此基础上训练经典机器学习模型,实现了优于单模态模型的性能。研究还探讨了病理报告文本的利用价值,并严格评估了基于模型的文本摘要与幻觉效应,为多模态癌症分析提供了嵌入驱动的新方法。

Comments camera ready version for ML4H 2025, typo corrected

详情
Journal ref
Proceedings of the Fifth Machine Learning for Health Symposium, PMLR 297:202-227, 2026
英文摘要

The Cancer Genome Atlas (TCGA) has enabled novel discoveries and served as a large-scale reference dataset in cancer through its harmonized genomics, clinical, and imaging data. Numerous prior studies have developed bespoke deep learning models over TCGA for tasks such as cancer survival prediction. A modern paradigm in biomedical deep learning is the development of foundation models (FMs) to derive feature embeddings agnostic to a specific modeling task. Biomedical text especially has seen growing development of FMs. While TCGA contains free-text data as pathology reports, these have been historically underutilized. Here, we investigate the ability to train classical machine learning models over multimodal, zero-shot FM embeddings of cancer data. We demonstrate the ease and additive effect of multimodal fusion, outperforming unimodal models. Further, we show the benefit of including pathology report text and rigorously evaluate the effect of model-based text summarization and hallucination. Overall, we propose an embedding-centric approach to multimodal cancer modeling.