arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2352
2512.03199 2026-05-13 cs.CV

Does Head Pose Correction Improve Biometric Facial Recognition?

Justin Norman, Hany Farid

AI总结 本文研究了头部姿态校正和图像修复技术是否能提升生物特征人脸识别的准确性。作者采用一种模型无关的大规模法医评估方法,评估了三种图像修复方法对人脸识别的影响,发现直接应用这些技术反而会降低识别准确率。然而,通过选择性结合CFR-GAN和CodeFormer方法,能够在一定程度上提升识别效果,为改进实际场景下的人脸识别提供了新的思路。

详情
英文摘要

Biometric facial recognition models often demonstrate significant decreases in accuracy when processing real-world images, often characterized by poor quality, non-frontal subject poses, and subject occlusions. We investigate whether targeted, AI-driven, head-pose correction and image restoration can improve recognition accuracy. Using a model-agnostic, large-scale, forensic-evaluation pipeline, we assess the impact of three restoration approaches: 3D reconstruction (NextFace), 2D frontalization (CFR-GAN), and feature enhancement (CodeFormer). We find that naive application of these techniques substantially degrades facial recognition accuracy. However, we also find that selective application of CFR-GAN combined with CodeFormer yields meaningful improvements.

2512.01675 2026-05-13 cs.CV

GRASP: Guided Residual Adapters with Sample-wise Partitioning

Felix Nützel, Mischa Dombrowski, Bernhard Kainz

AI总结 在长尾分布场景下,文本到图像的流匹配变换器在尾部类别上表现出生成质量下降的问题。本文提出GRASP方法,通过条件空间的确定性划分和分组残差适配器,有效提升了尾部类别的生成质量,同时保持了原优化目标和采样器不变。实验表明,GRASP在多个数据集上显著提升了生成图像的多样性与尾部类别覆盖率,并在下游分类任务中优于现有方法。

Comments 16 pages, 6 figures, 6 tables

详情
英文摘要

Text-to-image flow matching transformers degrade sharply in long-tail settings: tail-class outputs collapse in fidelity and diversity, limiting their value as synthetic augmentation for rare conditions. We trace this to low head-versus-tail gradient alignment during fine-tuning, an optimization-level pathology that conditioning- and sampling-side interventions do not address. We propose GRASP (Guided Residual Adapters with Sample-wise Partitioning): a deterministic partition of the conditioning space, paired with group-specific residual adapters in the transformer feedforward layers, that leaves the flow-matching objective and the sampler untouched. In conditional flow matching, condition values index distinct sets of probability paths, so partitioning along the conditioning is the structurally correct factorization suitable as gradient alignment proxy. Because the partition is static, every tail sample is guaranteed to update its assigned expert, which bypasses extreme longtail failure modes. Crucially, GRASP is non-invasive and composable: on MIMIC-CXR-LT, combining GRASP with self-guided minority sampling at inference time yields the best all-labels IRS we observe, beyond either intervention alone. GRASP itself reduces overall FID by up to 80\% and lifts tail-class coverage by up to 44\% over full fine-tuning, learned-routing MoE, and minority guidance. Used as training data for a downstream DenseNet classifier on NIH-CXR-LT, GRASP synthetics significantly outperform every non-GRASP alternative on macro F1, match the macro F1 obtained from real training data, and yield nonzero F1 on $9$ of $13$ classes versus $3$ of $13$ from full fine-tuning. Results on ImageNet-LT confirm the mechanism is not tied to medical inductive bias.

2511.22663 2026-05-13 cs.CV

AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model

Dian Zheng, Manyuan Zhang, Hongyu Li, Kai Zou, Hongbo Liu, Ziyu Guo, Kaituo Feng, Yexin Liu, Ying Luo, Hongsheng Li

AI总结 统一多模态模型在图像生成与理解方面取得了显著进展,但任务间的冲突目标使得训练范式难以优化。为缓解冲突,现有方法多采用架构解耦策略,但可能导致模型失去交互生成能力。本文提出一种无需架构解耦的策略,通过分析模型的跨模态注意力行为,揭示解耦提升性能的本质是引导模型学习任务特定的交互模式,并提出注意力交互对齐(AIA)损失函数,有效优化跨模态注意力结构,提升生成与理解性能。

Comments Project page: https://zhengdian1.github.io/AIA-project/ Code: https://github.com/zhengdian1/AIA

详情
英文摘要

Unified multimodal models for image generation and understanding represent a significant step toward AGI and have attracted widespread attention from researchers. The main challenge of this task lies in the difficulty in establishing an optimal training paradigm due to inherent conflicting targets in understanding and generation tasks. To alleviate these conflicts and pursue higher performance, many researchers adopt varying degrees of architecture decoupling (e.g., Double image encoders, MOE/MOT architecture, or frozen MLLM). However, excessive model decoupling can lead to the loss of interleave generation ability, undermining the original intent of unified models. In this work, we aim to explore how to mitigate task conflicts without resorting to model decoupling. Firstly, we analyze why decoupling boosts performance by studying the cross-modal attention behavior of models. We observe that architecture decoupling does not solve task conflicts, but essentially drives models toward cross-modal interaction patterns of task-specific models, as seen in Qwen3-VL and HunyuanImage-3.0, and that the more thorough the decoupling, the more consistent the behavior becomes. Motivated by this observation, we propose Attention Interaction Alignment (AIA) loss, which explicitly learns task-specific multimodal interaction patterns during training. To demonstrate the generalizability of our AIA loss, we apply it to Emu3 and Janus-Pro during SFT and post-training stage respectively. Without bells and whistles, AIA not only refines cross-modal attention patterns, but also boosts both generation and understanding performance.

2511.18152 2026-05-13 cs.CV cs.AI

UnfoldLDM: Degradation-Aware Unfolding with Iterative Latent Diffusion Priors for Blind Image Restoration

Chunming He, Rihan Zhang, Zheng Chen, Bowen Yang, Chengyu Fang, Yunlong Lin, Yulun Zhang, Fengyang Xiao, Sina Farsiu

AI总结 本文提出了一种名为 UnfoldLDM 的盲图像修复方法,旨在解决现有深度展开网络在未知退化建模和过平滑问题上的不足。该方法结合了深度展开网络与潜在扩散模型,通过多粒度退化感知模块估计未知退化信息,并设计了退化鲁棒的扩散模型和过平滑校正模块,以恢复图像的高频细节和纹理。实验表明,UnfoldLDM 在多种盲图像修复任务中表现优异,并可作为通用框架与现有方法兼容。

Comments 6 figures, 11 tables

详情
英文摘要

Deep unfolding networks (DUNs) combine the interpretability of model-based methods with the learning ability of deep networks, yet remain limited for blind image restoration (BIR). Existing DUNs suffer from: (1) \textbf{Degradation-specific dependency}, as their optimization frameworks are tied to a known degradation model, making them unsuitable for BIR tasks; and (2) \textbf{Over-smoothing bias}, resulting from the direct feeding of gradient descent outputs, dominated by low-frequency content, into the proximal term, suppressing fine textures. To overcome these issues, we propose UnfoldLDM to integrate DUNs with latent diffusion model (LDM) for BIR. In each stage, UnfoldLDM employs a multi-granularity degradation-aware (MGDA) module as the gradient descent step. MGDA models BIR as an unknown degradation estimation problem and estimates both the holistic degradation matrix and its decomposed forms, enabling robust degradation removal. For the proximal step, we design a degradation-resistant LDM (DR-LDM) to extract compact degradation-invariant priors from the MGDA output. Guided by this prior, an over-smoothing correction transformer (OCFormer) explicitly recovers high-frequency components and enhances texture details. This unique combination ensures the final result is degradation-free and visually rich. Experiments show that our UnfoldLDM achieves a leading place on various BIR tasks and benefits downstream tasks. Moreover, our design is compatible with existing DUN-based methods, serving as a plug-and-play framework. Code will be released.

2511.16814 2026-05-13 cs.AI cs.HC

Stable diffusion models reveal a persisting human and AI gap in visual creativity

Silvia Rondini, Claudia Alvarez-Martin, Paula Angermair-Barkai, Olivier Penacchio, M. Paz, Matthew Pelowski, Dan Dediu, Antoni Rodriguez-Fornells, Xim Cerda-Company

AI总结 尽管近期研究表明大型语言模型在发散性思维任务中已能匹配人类的创造力,但视觉创造力领域仍缺乏系统研究。本研究通过对比视觉艺术家、非艺术家以及两种不同提示条件下的生成式AI模型(人类启发式与自主引导式)的图像生成结果,发现人类在视觉创造力上仍显著优于AI,且AI的创造力随着人类引导的增加而提升,但仍未达到非艺术家水平。研究还揭示了人类与AI在创造力评价上的判断模式存在明显差异,表明视觉创造力依赖于感知细节与情境敏感性,这些能力可能难以从语言模型直接迁移至视觉生成模型。

详情
Journal ref
Advanced Science, 2026, e24142
英文摘要

While recent research suggests Large Language Models match human creative performance in divergent thinking tasks, visual creativity remains underexplored. This study compared image generation in human participants (Visual Artists and Non Artists) and using an image generation AI model (two prompting conditions with varying human input: high for Human Inspired, low for Self Guided). Human raters (N=255) and GPT4o evaluated the creativity of the resulting images. We found a clear creativity gradient, with Visual Artists being the most creative, followed by Non Artists, then Human Inspired generative AI, and finally Self Guided generative AI. Increased human guidance strongly improved GenAI's creative output, bringing its productions close to those of Non Artists. Notably, human and AI raters also showed vastly different creativity judgment patterns. These results suggest that, in contrast to language centered tasks, GenAI models may face unique challenges in visual domains, where creativity depends on perceptual nuance and contextual sensitivity, distinctly human capacities that may not be readily transferable from language models.

2511.11935 2026-05-13 cs.LG

SurvBench: A Standardised Preprocessing Pipeline for Multi-Modal Electronic Health Record Survival Analysis

Munib Mesinovic, Tingting Zhu

AI总结 SurvBench 是一个开源的预处理流程,旨在为多模态电子健康记录(EHR)的生存分析提供标准化的数据处理方法。该工具解决了当前深度学习生存模型在EHR数据上难以比较的问题,通过统一的预处理步骤,包括队列定义、时间离散化、缺失值处理和截断规则等。SurvBench 支持多个重症监护数据库和多种输入模态,提供了统一的配置接口和跨数据集验证支持,为未来多模态EHR生存分析研究提供了可靠的基准平台。

详情
英文摘要

Deep-learning survival models for electronic health record (EHR) data are hard to compare across papers because the upstream preprocessing step, which includes cohort definition, time discretisation, missingness handling, and censoring rules, is typically undocumented and inconsistent. A reported difference in concordance between two mortality models can therefore reflect any of these choices rather than a modelling contribution. We present SurvBench, an open-source preprocessing pipeline that converts raw PhysioNet exports into model-ready tensors for survival analysis. SurvBench covers four critical-care databases (MIMIC-IV, eICU, MC-MED, HiRID) and four input modalities: time-series vitals and laboratory values, static demographics, International Classification of Diseases (ICD) codes, and radiology report embeddings. Every preprocessing decision is controlled through YAML configuration. Imputation, scaling, and feature filtering are fit on the training fold only. Missingness is recorded as a binary mask alongside each feature tensor. The pipeline handles single-risk endpoints (in-hospital and in-ICU mortality) and competing-risks endpoints (a three-way emergency-department admission pathway, with home discharge treated as administrative censoring). We also provide support for harmonised cross-dataset external validation between eICU and MIMIC-IV. SurvBench is publicly available at https://github.com/munibmesinovic/SurvBench, providing a robust platform that future deep-learning EHR survival work, especially nascent multi-modal approaches, can be measured against under matched preprocessing.

2511.11412 2026-05-13 cs.CL cs.CY stat.OT

MajinBook: An open catalogue of digitally mediated world literature

Antoine Mazières, Thierry Poibeau

AI总结 本文介绍了MajinBook,一个开放的数字文献目录,旨在促进对影子图书馆(如Library Genesis和Z-Library)在计算社会科学和文化分析中的应用。通过将这些众包档案的元数据与Goodreads的结构化书目数据进行关联,构建了一个包含539,000多本英文书籍的高精度语料库,并附有首次出版日期、类型和受欢迎程度等信息。该研究采用原生数字EPUB文件以确保机器可读性,同时解决了传统语料库的偏差问题,并提供了法语、德语和西班牙语的辅助数据集。

Comments 9 pages, 5 figures, 1 table

详情
英文摘要

This data paper introduces MajinBook, an open catalogue designed to facilitate the use of shadow libraries-such as Library Genesis and Z-Library-for computational social science and cultural analytics. By linking metadata from these vast, crowd-sourced archives with structured bibliographic data from Goodreads, we create a high-precision corpus of over 539,000 references to digitally mediated English-language books. Spanning three centuries and reflecting a contemporary selection bias, these entries are enriched with first publication dates, genres, and popularity metrics like ratings and reviews. Our methodology prioritises natively digital EPUB files to ensure machine-readable quality, while addressing biases in traditional corpora like HathiTrust, and includes secondary datasets for French, German, and Spanish. We evaluate the linkage strategy for accuracy, release all underlying data openly, and discuss the project's legal permissibility under EU and US frameworks for text and data mining in research.

2511.10670 2026-05-13 cs.CL cs.AI cs.SD

Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment

Yan Gao, Yazheng Yang, Zhibin Lan, Yidong Chen, Min Zhang, Daimeng Wei, Derek F. Wong, Jinsong Su

AI总结 该研究旨在解决代码混用(Code-switching)语音翻译中的细粒度语义建模难题,提出了一种结合专家混合(MoE)结构的语音投影方法,通过语言专家组对不同语言的语义空间进行精细化建模。研究引入了语言特定损失和组内负载均衡损失,以提升模型效率,并采用多阶段训练策略,结合现有自动语音识别和单语翻译数据,增强对齐效果和翻译性能。实验表明,该方法在多个数据集上显著优于现有模型,BLEU和COMET指标均有明显提升。

Comments Accepted to IJCAI 2026 Main Track

详情
英文摘要

Code-switching (CS) speech translation (ST) aims to translate speech that alternates between multiple languages into a target language text, posing significant challenges due to the complexity of semantic modeling and the scarcity of CS data. Previous studies mainly rely on the models themselves to implicitly learn semantic representations and resort to costly manual annotations. To mitigate these limitations, we propose enhancing Large Language Models (LLMs) with a Mixture-of-Experts (MoE) speech projector composed of language expert groups, where each group specializes in the semantic space of a specific language for fine-grained speech feature modeling. A language-specific loss and an intra-group load balancing loss are jointly introduced to guide efficient token routing across and within expert groups. Furthermore, we introduce a multi-stage training paradigm that utilizes readily available automatic speech recognition (ASR) and monolingual ST data, facilitating speech-text alignment and improving translation performance. To bridge the data gap for smooth domain transfer, a transition loss is employed to improve adaptation to CS scenarios. Extensive experiments on widely used datasets demonstrate the effectiveness and generality of our approach, achieving average improvements of $0.86$ BLEU and $0.93$ COMET over SeamlessM4T, with maximum improvements of $1.49$ BLEU and $1.41$ COMET across different test sets.

2511.07767 2026-05-13 cs.LG

Taking the Road Less Scheduled with Adaptive Polyak Steps

Dimitris Oikonomou, Matthew Buchholz, Yuen-Man Pun, Robert M. Gower, Nicolas Loizou

AI总结 本文研究了无需预设训练周期的自适应优化方法,提出了一种适用于Schedule-Free SGD和Adam的Polyak步长选择策略,该方法能够在每一步迭代中仅基于当前损失、梯度和迭代点自动计算学习率,无需手动调参。通过引入一个理想情况下的变体和一个无需理想值的鲁棒变体,作者证明了其在凸且满足利普希茨条件的目标函数上的收敛速率。实验表明,该方法在语言模型预训练和知识蒸馏任务中表现优异,且对超参数选择更加鲁棒。

详情
英文摘要

Schedule-Free SGD, proposed in [Defazio et al., 2024], achieves optimal convergence rates without requiring the training horizon in advance, by replacing learning rate schedules with a principled form of iterate averaging. However, the method still requires tuning a base learning rate whose optimal value depends on unknown problem constants. In this work, we continue down this road by deriving Polyak-type step sizes for Schedule-Free SGD and Adam that compute the learning rate at each iteration from the sampled loss, gradient, and current iterates alone. We first propose an oracle variant that uses per-sample optimal function values and prove an $O(1/\sqrt{t})$ anytime last-iterate rate for convex Lipschitz objectives. We then remove the oracle requirement with a safeguarded variant that replaces the unknown optimal values with any available lower bound, achieving the same rate up to a neighborhood that vanishes under interpolation. Both step sizes reduce to existing Polyak rules for standard SGD when momentum is set to zero, unifying standard and schedule-free Polyak methods. Numerical experiments on language modeling, including pretraining and distillation, show that the proposed methods match or surpass tuned Schedule-Free baselines while offering greater robustness to hyperparameter choices.

2510.27055 2026-05-13 cs.CL cs.AI

Detecting Data Contamination in LLMs via In-Context Learning

Michał Zawalski, Meriem Boubdir, Klaudia Bałazy, Besmira Nushi, Pablo Ribalta

AI总结 本文提出了一种名为CoDeC的方法,用于检测和量化大语言模型训练数据中的污染问题。该方法通过衡量上下文学习对模型性能的影响,区分模型在训练过程中记忆的数据与训练分布之外的数据。实验表明,CoDeC能够生成可解释的污染评分,有效区分已见和未见数据集,并揭示了未公开训练语料的开源模型中存在显著的记忆现象。该方法简单、自动化,且适用于不同模型和数据集,便于集成到基准评估中。

详情
英文摘要

We present Contamination Detection via Context (CoDeC), a practical and accurate method to detect and quantify training data contamination in large language models. CoDeC distinguishes between data memorized during training and data outside the training distribution by measuring how in-context learning affects model performance. We find that in-context examples typically boost confidence for unseen datasets but may reduce it when the dataset was part of training, due to disrupted memorization patterns. Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets, and reveals strong evidence of memorization in open-weight models with undisclosed training corpora. The method is simple, automated, and both model- and dataset-agnostic, making it easy to integrate with benchmark evaluations.

2510.24145 2026-05-13 cs.AI

OpsAgent: An Evolving Multi-agent System for Incident Management in Microservices

Yu Luo, Jiamin Jiang, Jingfei Feng, Lei Tao, Qingliang Zhang, Xidao Wen, Yongqian Sun, Shenglin Zhang, Dan Pei

AI总结 OpsAgent 是一个用于微服务系统故障管理的轻量级、自我进化的多智能体系统。该系统通过无训练数据处理器将异构的可观测性数据转化为结构化文本描述,并结合多智能体协作框架实现透明、可审计的诊断推理。为支持持续能力提升,OpsAgent 引入了内部模型更新与外部经验积累相结合的双重自进化机制,实验表明其在性能、可解释性、成本效率和自进化能力方面均表现优异,具备实际部署和长期运行的可行性。

详情
英文摘要

Incident management (IM) is central to the reliability of large-scale microservice systems. Yet manual IM, where on-call engineers examine metrics, logs, and traces is labor-intensive and error-prone in the face of massive and heterogeneous observability data. Existing automated IM approaches often struggle to generalize across systems, provide limited interpretability, and incur high deployment costs, which hinders adoption in practice. In this paper, we present OpsAgent, a lightweight, self-evolving multi-agent system for IM that employs a training-free data processor to convert heterogeneous observability data into structured textual descriptions, along with a multi-agent collaboration framework that makes diagnostic inference transparent and auditable. To support continual capability growth, OpsAgent also introduces a dual self-evolution mechanism that integrates internal model updates with external experience accumulation, thereby closing the deployment loop. Comprehensive experiments on the OPENRCA benchmark demonstrate state-of-the-art performance and show that OpsAgent is generalizable, interpretable, cost-efficient, and self-evolving, making it a practically deployable and sustainable solution for long-term operation in real-world microservice systems. Notably, its deployment in Lenovo's production environment further validates its effectiveness in real-world industrial settings.

2510.17062 2026-05-13 cs.CL cs.AI

Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation

Guoqing Luo, Iffat Maab, Lili Mou, Junichi Yamagishi

AI总结 本文研究了基于推理的语言模型在处理社会偏见时的思维行为,发现其内部推理过程可能加剧社会刻板印象,导致偏见结果。研究揭示了两种导致偏见聚集的失败模式:刻板印象重复和无关信息注入。基于这些发现,作者提出了一种轻量级的提示方法,引导模型自我审查推理过程,实验表明该方法在多个基准上有效降低了偏见,同时保持或提升了准确性。

Comments Due to issues found with the annotations in Section 4.3, we have decided to withdraw this preprint

详情
英文摘要

While reasoning-based large language models excel at complex tasks through an internal, structured thinking process, a concerning phenomenon has emerged that such a thinking process can aggregate social stereotypes, leading to biased outcomes. However, the underlying behaviours of these language models in social bias scenarios remain underexplored. In this work, we systematically investigate mechanisms within the thinking process behind this phenomenon and uncover two failure patterns that drive social bias aggregation: 1) stereotype repetition, where the model relies on social stereotypes as its primary justification, and 2) irrelevant information injection, where it fabricates or introduces new details to support a biased narrative. Building on these insights, we introduce a lightweight prompt-based mitigation approach that queries the model to review its own initial reasoning against these specific failure patterns. Experiments on question answering (BBQ and StereoSet) and open-ended (BOLD) benchmarks show that our approach effectively reduces bias while maintaining or improving accuracy.

2510.09333 2026-05-13 cs.LG cs.CV

Efficient Bayesian Inference from Noisy Pairwise Comparisons

Till Aczel, Lucas Theis, Roger Wattenhofer

AI总结 本文研究了如何从带有噪声的人类成对比较数据中高效进行贝叶斯推断,以评估生成模型的质量。作者提出了一种名为 BBQ 的贝叶斯 Bradley-Terry 模型变体,该方法显式建模评分者质量,过滤不可靠评分者,并通过期望最大化算法保证似然函数的单调收敛。实验表明,BBQ 能在噪声或众包评分环境下提供更高效、鲁棒且可解释的模型排序与不确定性估计。

详情
英文摘要

Evaluating generative models is challenging because standard metrics often fail to reflect human preferences. Human evaluations are more reliable but costly and noisy, as participants vary in expertise, attention, and diligence. Pairwise comparisons improve consistency, yet aggregating them into overall quality scores requires careful modeling. Bradley-Terry-based methods update item scores from comparisons, but existing approaches either ignore rater variability or lack convergence guarantees, limiting robustness and interpretability. We introduce BBQ, a Bayesian Bradley-Terry variant that explicitly models rater quality, downweighting or removing unreliable participants, and provides guaranteed monotonic likelihood convergence through an Expectation-Maximization algorithm. Empirical results show that BBQ provides efficient inference, well-calibrated uncertainty estimates, and more robust, interpretable rankings compared to baseline Bradley-Terry models, even with noisy or crowdsourced raters. This framework enables more reliable and cost-effective human evaluation of generative models.

2510.03853 2026-05-13 cs.CV

UGround: Towards Unified Visual Grounding with Unrolled Transformers

Rui Qian, Xin Yin, Chuanhang Deng, Zhiyuan Peng, Jian Xiong, Wei Zhai, Dejing Dou

AI总结 UGround 提出了一种统一的视觉 grounding 框架,通过在展开的 Transformer 层中动态选择中间层作为“掩码作为提示”,克服了传统方法中固定使用最后一层隐藏状态的问题。该方法引入了策略驱动的掩码机制,包含随机跳过连接和掩码作为提示两个核心组件,实现了对视觉模型(如 SAM)的动态引导与空间线索的显式传递。UGround 在统一框架下覆盖了多种视觉 grounding 任务,包括属性层面的传统指代分割和新提出的推理分割等,显著提升了模型的灵活性和适用性。

Comments This work has been accepted to ICML 2026, please refer to https://github.com/rui-qian/UGround

详情
英文摘要

We present UGround, a \textbf{U}nified visual \textbf{Ground}ing paradigm that dynamically selects intermediate layers across \textbf{U}nrolled transformers as ``mask as prompt,'' diverging from the prevailing pipeline that leverages the fixed last hidden layer as ``\texttt{<SEG>} as prompt.'' UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layer-by-layer propagation without intermediate correction, and (2) its use of \texttt{<SEG>} as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (e.g., coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each \texttt{<SEG>} token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (e.g., SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the \texttt{<SEG>} token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All code and models are publicly available at https://github.com/rui-qian/UGround.

2510.03206 2026-05-13 cs.AI cs.CL

Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner

Cai Zhou, Chenxiao Yang, Yi Hu, Chenyu Wang, Chubin Zhang, Muhan Zhang, Lester Mackey, Tommi Jaakkola, Stephen Bates, Dinghuai Zhang

AI总结 该论文研究了扩散语言模型在离散与连续空间中的表现差异,指出尽管连续扩散模型在理论上具有更强的表达能力,但在实际应用中往往不如离散模型。为此,作者提出了协同进化连续离散扩散(CCDD)方法,通过在连续表示空间和离散词元空间上定义联合扩散过程,结合两者优势,既保留了连续空间的语义丰富性,又借助离散词元提升训练和采样效果。实验表明,CCDD在多项现实任务的语言建模中表现出色。

Comments 29 pages. Accepted to ICML 2026

详情
英文摘要

Diffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages of latent reasoning with looped transformers or continuous chain-of-thoughts, continuous diffusion models typically underperform their discrete counterparts. In this paper, we argue that diffusion language models do not necessarily need to be in the discrete space. In particular, we prove that continuous diffusion models have stronger expressivity than discrete diffusions and looped transformers. We attribute the contradiction between the theoretical expressiveness and empirical performance to their practical trainability: while continuous diffusion provides intermediate supervision that looped transformers lack, they introduce additional difficulty decoding tokens into the discrete token space from the continuous representation space. We therefore propose Coevolutionary Continuous Discrete Diffusion (CCDD), which defines a joint multimodal diffusion process on the union of a continuous representation space and a discrete token space, leveraging a single model to simultaneously denoise in the joint space. By combining two modalities, CCDD is expressive with rich semantics in the latent space, as well as good trainability and sample quality with the help of explicit discrete tokens. We also propose effective architectures and advanced training/sampling techniques for CCDD, which reveals strong empirical performance in extensive language modeling experiments on real-world tasks.

2510.02107 2026-05-13 cs.LG

PENEX: AdaBoost-Inspired Neural Network Regularization

Klaus-Rudolf Kladny, Bernhard Schölkopf, Michael Muehlebach

AI总结 本文提出了一种受AdaBoost启发的神经网络正则化方法PENEX,通过改进多分类指数损失函数,使其适用于一阶优化方法,从而更有效地用于神经网络训练。PENEX通过增大数据点的边距来提升模型的泛化能力,并在低数据量场景下表现出优于传统正则化方法的性能。研究展示了指数损失在AdaBoost之外的广泛应用潜力。

详情
英文摘要

AdaBoost sequentially fits so-called weak learners to minimize an exponential loss, which penalizes misclassified data points more severely than other loss functions like cross-entropy. Paradoxically, AdaBoost generalizes well in practice as the number of weak learners grows. In the present work, we introduce Penalized Exponential Loss (PENEX), a new formulation of the multi-class exponential loss that is theoretically grounded and, in contrast to the existing formulation, amenable to optimization via first-order methods, making it a practical objective for training neural networks. We demonstrate that PENEX effectively increases margins of data points, which can be translated into a generalization bound. Empirically, across computer vision and language tasks, PENEX improves neural network generalization in low-data regimes, matching and in some settings outperforming established regularizers at comparable computational cost. Our results highlight the potential of the exponential loss beyond its application in AdaBoost.

2510.00733 2026-05-13 cs.LG cs.AI q-bio.QM

Neural Diffusion Processes for Physically Interpretable Survival Prediction

Alessio Cristofoletto, Cesare Rollo, Giovanni Birolo, Piero Fariselli

AI总结 本文提出了一种名为DeepFHT的生存分析框架,将深度神经网络与随机过程理论中的首次穿越时间(FHT)分布相结合,将事件发生时间建模为潜在扩散过程首次到达吸收边界的时间。该方法通过神经网络将输入变量映射到具有物理意义的参数,如初始条件、漂移和扩散系数,从而在无需假设比例风险的前提下,生成闭式生存和风险函数。实验表明,DeepFHT在预测性能上与现有先进方法相当,同时保持了物理可解释的参数化特性,有助于揭示输入特征与风险之间的关系。

Comments 12 pages, 5 figures

详情
英文摘要

We introduce DeepFHT, a survival-analysis framework that couples deep neural networks with first hitting time (FHT) distributions from stochastic process theory. Time to event is represented as the first passage of a latent diffusion process to an absorbing boundary. A neural network maps input variables to physically meaningful parameters including initial condition, drift, and diffusion, within a chosen FHT process such as Brownian motion, both with drift and driftless. This yields closed- form survival and hazard functions and captures time-varying risk without assuming proportional- hazards. We compare DeepFHT with Cox regression using synthetic and real-world datasets. The method achieves predictive accuracy on par with the state-of-the-art approach, while maintaining a physics- based interpretable parameterization that elucidates the relation between input features and risk. This combination of stochastic process theory and deep learning provides a principled avenue for modeling survival phenomena in complex systems

2509.25239 2026-05-13 cs.AI cs.CL cs.LG

A Formal Comparison Between Chain of Thought and Latent Thought

Kevin Xu, Issei Sato

AI总结 本文对比了链式推理(Chain of Thought, CoT)与隐式推理(Latent Thought)两种大语言模型的推理方法。CoT通过显式生成中间token进行推理,而隐式推理则在连续的潜在空间中直接进行计算,支持超越离散语言表示的运算。研究发现,隐式推理在并行计算效率上更具优势,而CoT则在随机解码下支持近似计数和采样,为不同任务选择合适的推理范式提供了理论依据。

Comments Camera-ready version for ICML 2026

详情
英文摘要

Chain of thought (CoT) elicits reasoning in large language models by explicitly generating intermediate tokens. In contrast, latent thought reasoning operates directly in the continuous latent space, enabling computation beyond discrete linguistic representations. While both approaches exploit iterative computation, their comparative capabilities remain underexplored. In this work, we present a formal analysis showing that latent thought admits more efficient parallel computation than inherently sequential CoT. In contrast, CoT enables approximate counting and sampling through stochastic decoding. These separations suggest the tasks for which depth-driven recursion is more suitable, thereby offering practical guidance for choosing between reasoning paradigms.

2509.22414 2026-05-13 cs.CV

LucidFlux: Caption-Free Photo-Realistic Image Restoration via a Large-Scale Diffusion Transformer

Song Fei, Tian Ye, Lujia Wang, Lei Zhu

AI总结 本文提出了一种无需图像描述的高保真图像修复方法LucidFlux,通过适配大规模扩散变换器Flux.1实现真实感图像恢复。该方法引入了一个轻量的双分支条件器,分别注入退化输入和轻度修复代理的信号以锚定几何结构并抑制伪影,并设计了时序和层自适应的调制调度策略,实现从粗到细的上下文感知更新。此外,通过SigLIP特征实现无需描述的语义对齐,并结合可扩展的数据筛选流程,LucidFlux在多个基准测试中优于现有开源和商业方法,验证了其在复杂场景下鲁棒且无需文本提示的图像修复能力。

Comments Project Page: https://w2genai-lab.github.io/LucidFlux

详情
英文摘要

Image restoration (IR) aims to recover images degraded by unknown mixtures while preserving semanticsconditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift. We present LucidFlux, a caption-free IR framework that adapts a large diffusion transformer (Flux.1) without image captions. Our LucidFlux introduces a lightweight dual-branch conditioner that injects signals from the degraded input and a lightly restored proxy to respectively anchor geometry and suppress artifacts. Then, a timestep- and layer-adaptive modulation schedule is designed to route these cues across the backbones hierarchy, in order to yield coarse-to-fine and context-aware updates that protect the global structure while recovering texture. After that, to avoid the latency and instability of text prompts or Vision-Language Model (VLM) captions, we enforce caption-free semantic alignment via SigLIP features extracted from the proxy. A scalable curation pipeline further filters large-scale data for structure-rich supervision. Across synthetic and in-the-wild benchmarks, our LucidFlux consistently outperforms strong open-source and commercial baselines, and ablation studies verify the necessity of each component. LucidFlux shows that, for large DiTs, when, where, and what to condition onrather than adding parameters or relying on text promptsis the governing lever for robust and caption-free image restoration in the wild.

2509.20899 2026-05-13 cs.CV

Concepts in Motion: Temporal Concept Bottleneck Model for Interpretable Video Classification

Patrick Knab, Sascha Marton, Philipp J. Schubert, Drago Guggiana, Christian Bartelt

AI总结 本文提出了一种名为MoTIF的可解释视频分类方法,通过引入基于时序概念激活的Transformer架构,解决了在视频中提取和建模概念的挑战。该方法利用每个概念的时序自注意力机制,捕捉概念随时间的变化规律及其对分类结果的贡献,并通过一个基于视觉-语言模型的概念发现模块,从训练视频中自动提取与物体和动作相关的文本概念,无需人工标注。实验表明,该方法在多个视频基准上优于全局概念瓶颈模型,并在可解释性框架下保持了良好的性能。

详情
英文摘要

Concept Bottleneck Models (CBMs) enable interpretable image classification by structuring predictions around human-understandable concepts, but extending this paradigm to video remains challenging due to the difficulty of extracting concepts and modeling them over time. In this paper, we introduce MoTIF (Moving Temporal Interpretable Framework), a transformer-based concept architecture that operates on sequences of temporally grounded concept activations, by employing per-concept temporal self-attention to model when individual concepts recur and how their temporal patterns contribute to predictions. Central to the framework is a class-conditioned VLM-based concept discovery module that extracts object- and action-centric textual concepts from training videos, yielding temporally expressive concept sets without manual concept annotation. Across multiple video benchmarks, this combination improves over global concept bottlenecks and remains competitive within the interpretable concept-bottleneck setting, while narrowing the gap to strong black-box video baselines that we report as contextual references. Code available at github.com/patrick-knab/MoTIF.

2509.13548 2026-05-13 cs.SD eess.AS stat.ML

Mixture-of-Experts Framework for Field-of-View Enhanced Signal-Dependent Binauralization of Moving Talkers

Manan Mittal, Thomas Deppisch, Joseph Forrer, Chris Le Sueur, Zamir Ben-Hur, David Lou Alon, Daniel D. E. Wong

AI总结 本文提出了一种基于专家混合框架的新型方法,用于增强移动说话人声源的视野感知双耳渲染。该方法通过隐式定位在线融合多个双耳滤波器,实现了对连续运动声源的实时追踪与增强,能够在保持自然双耳线索的同时,突出或抑制特定方向的声音。与传统依赖到达方向估计或基于Ambisonics域的方法不同,该信号依赖框架具有阵列结构无关性,适用于下一代消费音频设备中的空间音频捕获与个性化播放。

Comments 5 pages, 3 figures

详情
英文摘要

We propose a novel mixture of experts framework for field-of-view enhancement in binaural signal matching. Our approach enables dynamic spatial audio rendering that adapts to continuous talker motion, allowing users to emphasize or suppress sounds from selected directions while preserving natural binaural cues. Unlike traditional methods that rely on explicit direction-of-arrival estimation or operate in the Ambisonics domain, our signal-dependent framework combines multiple binaural filters in an online manner using implicit localization. This allows for real-time tracking and enhancement of moving sound sources, supporting applications such as speech focus, noise reduction, and world-locked audio in augmented and virtual reality. The method is agnostic to array geometry offering a flexible solution for spatial audio capture and personalized playback in next-generation consumer audio devices.

2507.16818 2026-05-13 cs.LG

Evaluating Artificial Intelligence Algorithms for the Standardization of Transtibial Prosthetic Socket Shape Design

C. H. E. Jordaan, M. van der Stelt, T. J. J. Maal, V. M. A. Stirler, R. Leijendekkers, T. Kachman, G. A. de Jong

AI总结 该研究旨在利用人工智能算法标准化截肢者假肢套筒的设计过程,以减少对假肢师经验的依赖。研究基于118名患者的三维残肢扫描和对应的假肢套筒模型,采用形态模型和主成分分析进行数据预处理,并开发了三种算法预测套筒形状或假肢师的调整方案。结果表明,预测调整方案的算法在精度上优于直接预测最终形状,其中随机森林模型表现最佳,表面到表面距离中位数仅为1.24毫米。

详情
Journal ref
Computer Methods and Programs in Biomedicine Update 9: 2026
英文摘要

The quality of a transtibial prosthetic socket depends on the prosthetist's skills and expertise, as the fitting is performed manually. This study investigates multiple artificial intelligence (AI) approaches to help standardize transtibial prosthetic socket design. Data from 118 patients were collected by prosthetists working in the Dutch healthcare system. This data consists of a three-dimensional (3D) scan of the residual limb and a corresponding 3D model of the prosthetist-designed socket. Multiple data pre-processing steps are performed for alignment, standardization and optionally compression using Morphable Models and Principal Component Analysis. Afterward, three different algorithms - a 3D neural network, Feedforward neural network, and random forest - are developed to either predict 1) the final socket shape or 2) the adaptations performed by a prosthetist to predict the socket shape based on the 3D scan of the residual limb. Each algorithm's performance was evaluated by comparing the prosthetist-designed socket with the AI-generated socket, using two metrics in combination with the error location. First, we measure the surface-to-surface distance to assess the overall surface error between the AI-generated socket and the prosthetist-designed socket. Second, distance maps between the AI-generated and prosthetist sockets are utilized to analyze the error's location. For all algorithms, estimating the required adaptations outperformed direct prediction of the final socket shape. The random forest model applied to adaptation prediction yields the lowest error with a median surface-to-surface distance of 1.24 millimeters, a first quartile of 1.03 millimeters, and a third quartile of 1.54 millimeters.

2507.13841 2026-05-13 cs.CL

The Challenge and Reward of Fair Play in Narrative: A Computational Approach

Eitan Wagner, Renana Keydar, Omri Abend

AI总结 本文研究叙事中“意外性”与“连贯性”之间的平衡问题,提出一种基于信息论的理论框架,并以推理小说为案例进行分析。研究发现,对于单一读者模型,这两种特性存在权衡关系,但在区分“揭示前”和“揭示后”两种阅读模式后,二者可以共存。文章进一步提出“公平性”作为叙事质量的重要标准,并利用大语言模型进行实验验证,结果表明实现公平性对模型是一个挑战,且意外性与连贯性在不同故事中并不正相关。

Comments 47 pages, 11 figures, 13 tables

详情
英文摘要

Good storytelling involves surprise -- unpredictability in how the story unfolds -- and sense-making, the requirement that the story forms a coherent sequence. However, to date, these two qualities have largely been addressed in isolation. We formalize these qualities and their relationship in an information-theoretic framework, using detective fiction as a paradigm case of narratives in which a hidden truth is discovered through reasoning. Our central theoretical result shows that surprise and coherence must trade off for any *single* reader model, but can coexist when two reader modes are distinguished: a pre-revelation mode that forms expectations while the ending is unknown, and a post-resolution hindsight mode that re-evaluates the story after the culprit is revealed. The balance of these two dimensions is realized in the common requirement of *fair play*, giving the reader a chance to solve the mystery while maintaining a challenge. We operationalize the framework using large language models as simulated readers, and define reference-less evaluation metrics for surprise, coherence, and fair play. Experiments on LLM-generated stories validate our theoretical predictions: while models generally succeed in creating surprise or coherence, achieving fair play poses a challenge even for strong models. Moreover, surprise and coherence do not positively correlate across stories, resisting reduction to a single latent quality. A human study validates the metrics, confirming they capture aspects of narrative quality that matter to readers. Our metrics also reproduce established literary intuitions, finding Christie's stories more surprising and more fair-playing than Conan Doyle's.

2506.13163 2026-05-13 cs.LG

Efficient Algorithms for Logistic Contextual Slate Bandits with Bandit Feedback

Tanmay Goyal, Gaurav Sinha

AI总结 本文研究了逻辑上下文滑块老虎机问题,其中智能体在每一轮从指数级大的候选滑块集合中选择一个包含 $N$ 个项目的滑块,并仅观测到由逻辑模型决定的单个二元奖励。为在 $T$ 轮中最大化累积奖励并保持低计算开销,作者提出了两种高效算法 Slate-GLM-OFU 和 Slate-GLM-TS,它们通过局部规划实现每轮 $N^{O(1)}$ 的时间复杂度,并通过全局学习保证低悔恨。理论分析和实验表明,这些算法在多种合成场景中表现优异,并成功应用于语言模型的上下文示例选择任务,取得了有竞争力的测试准确率。

Comments Accepted to UAI 2025

详情
英文摘要

We study the Logistic Contextual Slate Bandit problem, where, at each round, an agent selects a slate of $N$ items from an exponentially large set (of size $2^{Ω(N)}$) of candidate slates provided by the environment. A single binary reward, determined by a logistic model, is observed for the chosen slate. Our objective is to develop algorithms that maximize cumulative reward over $T$ rounds while maintaining low per-round computational costs. We propose two algorithms, Slate-GLM-OFU and Slate-GLM-TS, that accomplish this goal. These algorithms achieve $N^{O(1)}$ per-round time complexity via local planning (independent slot selections), and low regret through global learning (joint parameter estimation). We provide theoretical and empirical evidence supporting these claims. Under a well-studied diversity assumption, we prove that Slate-GLM-OFU incurs only $\tilde{O}(\sqrt{T})$ regret. Extensive experiments across a wide range of synthetic settings demonstrate that our algorithms consistently outperform state-of-the-art baselines, achieving both the lowest regret and the fastest runtime. Furthermore, we apply our algorithm to select in-context examples in prompts of Language Models for solving binary classification tasks such as sentiment analysis. Our approach achieves competitive test accuracy, making it a viable alternative in practical scenarios.

2506.09044 2026-05-13 cs.LG

Strategically Deceptive Model Deployment in Performative Prediction

Javier Sanguino Bautiste, Thomas Kehrenberg, Jose A. Lozano, Novi Quadrianto

AI总结 本文研究了在行为预测(Performative Prediction)场景中,机构通过部署与用户行为响应模型不一致的模型,从而实现策略性欺骗部署的问题。提出了一种新的框架——解耦行为预测(DPP),用于建模机构决策模型与用户响应模型之间的不匹配,并证明该框架可以带来更低的风险。研究还引入了“欺骗成本”作为衡量用户受欺骗程度的指标,并分析了机构在声誉或用户流失压力下引入该成本进行优化的局限性,强调模型披露不仅是伦理问题,更是关键技术设计决策,亟需相关监管。

Comments Accepted to FAccT 2026

详情
英文摘要

Machine Learning systems are increasingly deployed in decision-making settings that shape user behavior and, in turn, the data on which future decisions are based. Performative Prediction (PP) formalizes this feedback loop by modeling how deployed models induce distributional shifts. It studies how to learn robust and well-performing models under such dynamics. However, existing PP frameworks typically assume that the model governing these decisions is the same model observed by users (therefore, to which they respond). In practice, deployer institutions may instead disclose curated models, while internally relying on distinct opaque models. We introduce Decoupled Performative Prediction (DPP), a framework that explicitly models mismatches between the model governing institutional decisions and the model that shapes user behavior. By analyzing the resulting optimization landscape, we show that DPP admits new different solutions that provably achieve lower risk for the institution than those under classical PP. We further propose an algorithm with provable convergence guarantees under standard assumptions, demonstrating how easy institutions can benefit from strategically deceptive deployment when they control model disclosure and users lack countervailing power. To capture the implications of such behavior, we introduce the deception cost, a quantitative measure of the degree of deception experienced by users. We study settings in which institutions incorporate this cost into the optimization process, motivated by reputational concerns or potential user abandonment, and show that such self-imposed constraints are insufficient to protect users. Overall, our results demonstrate that model disclosure is not merely an ethical consideration but a core technical design decision, underscoring the need for regulations that hold institutions accountable for deceptive deployment practices.

2506.02084 2026-05-13 cs.LG stat.ML

Adversarial Causal Tuning for Realistic Time-series Generation

Nikolaos Gkorgkolis, Nikolaos Kougioulis, MingXue Wang, Bora Caglayan, Andrea Tonon, Dario Simionato, Ioannis Tsamardinos

AI总结 本文研究如何从真实时间序列数据中生成具有相同观测和干预分布的仿真数据,旨在构建概率因果数字孪生模型。为此,作者提出了一种对抗因果调优(ACT)方法,结合生成对抗网络和自动机器学习的思想,搜索最优的因果模型和判别器,以提升生成数据与真实数据分布的一致性,并通过置换检验控制模型复杂度。实验表明,ACT在多个数据集上表现出优越的拟合能力和泛化性能,为现实时间序列的生成提供了新的有效方法。

Comments 22 pages, 3 figures

详情
英文摘要

We address the problem of generating simulated, yet realistic, time-series data from a causal model with the same observational and interventional distributions as a given real dataset (probabilistic causal digital twin). While non-causal models (e.g., GANs) also strive to simulate realistic data, causal models are fundamentally more powerful, able to simulate the effect of interventions (what-if scenarios), optimize decisions, perform root-cause analysis, and counterfactual causal reasoning. We introduce the Adversarial Causal Tuning (ACT) methodology, which outputs the optimal causal model that fits the data, along with a quantification of the goodness-of-fit. The returned causal model can then be employed to simulate new data or to perform other causal reasoning tasks. ACT adopts ideas from Generative Adversarial Network training and AutoML to search for optimal causal pipelines and discriminators that detect deviations between the distributions of real and simulated data. It also adapts a permutation testing procedure from established causal tuning methods to penalize models for complexity. Through extensive experiments on real, semi-synthetic, and synthetic datasets, we show that (a) employing multiple optimized discriminators is paramount for selecting the optimal causal models and quantifying goodness-of-fit, (b) ACT selects the optimal causal model in synthetic datasets while avoiding overfitting, generating data indistinguishable from the true data distribution (c) all state-of-the-art generative and causal simulation methods, exhibit room for improvement in reproducing real data distributions; generating realistic temporal data is still an open research challenge.

2506.01568 2026-05-13 cs.LG cs.RO

Trajectory First: A Curriculum for Discovering Diverse Policies

Cornelius V. Braun, Sayantan Auddy, Marc Toussaint

AI总结 本文提出了一种两阶段的课程学习方法,旨在提升强化学习中智能体行为的多样性。该方法首先引入基于样条的轨迹先验作为归纳偏置,生成多样且高回报的行为策略,随后将其蒸馏为反应式的分步策略。实验表明,该课程学习框架在保持任务性能的同时,显著提升了所学技能的多样性。

Comments Accepted into the Inductive Biases in Reinforcement Learning Workshop at RLC 2025

详情
英文摘要

Being able to solve a task in diverse ways makes agents more robust to task variations and less prone to local optima. In this context, constrained diversity optimization has become a useful reinforcement learning (RL) framework for training a set of diverse agents in parallel. However, existing constrained-diversity RL methods often under-explore in complex tasks such as robot manipulation, resulting in limited behavioral diversity. We address this with a two-stage curriculum that introduces a spline-based trajectory prior as an inductive bias to produce diverse, high-reward behaviors in an initial stage, and then distills these behaviors into reactive, step-wise policies in a second stage. In our empirical evaluation, we provide novel insights into challenges of diversity-targeted training and show that our curriculum increases the diversity of learned skills while maintaining high task performance.

2505.20535 2026-05-13 cs.LG

Rotary Masked Autoencoders are Versatile Learners

Uros Zivanovic, Serafina Di Gioia, Andre Scaffidi, Martín de los Rios, Gabriella Contardo, Roberto Trotta

AI总结 该论文提出了一种名为Rotary Masked Autoencoder(RoMAE)的新型自编码器,旨在解决传统Transformer在处理不规则时间序列时需要特殊架构设计的问题。RoMAE结合了旋转位置嵌入(RoPE)方法,能够在无需特定时间序列结构的情况下,对多维连续位置信息进行插值和表征学习。实验表明,RoMAE在不规则时间序列、图像和音频等多种模态任务中均表现出色,尤其在复杂数据集上超越了专门的时间序列模型,同时保持了MAE在其他模态中的良好性能。

Comments NeurIPS 2025 Final Camera Ready

详情
Journal ref
Advances in Neural Information Processing Systems 38, NeurIPS 2025, Pages 133952-133987
英文摘要

Applying Transformers to irregular time-series typically requires specializations to their baseline architecture, which can result in additional computational overhead and increased method complexity. We present the Rotary Masked Autoencoder (RoMAE), which utilizes the popular Rotary Positional Embedding (RoPE) method for continuous positions. RoMAE is an extension to the Masked Autoencoder (MAE) that enables interpolation and representation learning with multidimensional continuous positional information while avoiding any time-series-specific architectural specializations. We showcase RoMAE's performance on a variety of modalities including irregular and multivariate time-series, images, and audio, demonstrating that RoMAE surpasses specialized time-series architectures on difficult datasets such as the DESC ELAsTiCC Challenge while maintaining MAE's usual performance across other modalities. In addition, we investigate RoMAE's ability to reconstruct the embedded continuous positions, demonstrating that including learned embeddings in the input sequence breaks RoPE's relative position property.

2505.18780 2026-05-13 cs.RO cs.LG

DreamPolicy: A Unified World-model Policy for Scalable Humanoid Locomotion

Yahao Fan, Tianxiang Gui, Kaiyang Ji, Shutong Ding, Chixuan Zhang, Yifeng Xu, Ke Yang, Jiayuan Gu, Jingyi Yu, Jingya Wang, Ye Shi

AI总结 实现一种能够适应多种地形的人形机器人行走策略是当前面临的关键挑战。本文提出 DreamPolicy,一种结合离线数据与扩散世界模型的统一策略框架,使单一策略能够掌握已知和未知地形的行走技能。该方法通过地形感知的世界模型生成物理合理的未来轨迹,作为条件策略的动态目标,从而避免手动设计奖励函数。实验表明,DreamPolicy 在未知和复合地形上的表现优于现有最佳方法,为通用人形机器人控制提供了一种可扩展的数据驱动范式。

详情
英文摘要

Achieving versatile humanoid locomotion with a single policy presents a critical scalability challenge. Prevailing methods often rely on distilling multiple terrain-specific teacher policies into a unified student policy. However, while such distillation captures basic locomotion primitives, it struggles to organically compose these skills to adapt to complex environments, resulting in poor generalization to novel composite terrains unseen during training. To overcome this, we present DreamPolicy, a unified framework that integrates offline data with a diffusion-based world model, enabling a single policy to master both known and unseen terrains. Central to our approach is a terrain-aware world model, driven by an autoregressive diffusion world model trained on aggregated rollouts from specialized policies. This model synthesizes physically plausible future trajectories, which serve as dynamic objectives for a conditioned policy, thereby bypassing manual reward engineering. Unlike distillation, our world model captures generalizable locomotion skills, allowing for robust zero-shot transfer to unseen composite terrains. DreamPolicy naturally scales with data availability. As the offline dataset expands, the diffusion world model continuously acquires richer skills. Experiments demonstrate that DreamPolicy outperforms the strongest baseline by up to 27\% on unseen terrains and 38\% on combined terrains. By unifying world model-based planning and policy learning, DreamPolicy breaks the "one task, one policy" bottleneck and establishes a scalable, data-driven paradigm for generalist humanoid control.

2505.11356 2026-05-13 cs.LG

Fractal Graph Contrastive Learning

Nero Z. Li, Xuehao Zhai, Zhichao Shi, Boshen Shi, Xuhui Jiang

AI总结 本文提出了一种名为FractalGCL的图对比学习框架,旨在解决传统图增强方法在全局结构一致性控制上的不足。该方法基于重归一化构建增强图,并引入一种考虑分形维度的对比损失函数,以提升正样本的一致性并优化负样本的排斥效果。为降低计算开销,作者还设计了一种高斯近似方法,显著提升了运行效率。实验表明,FractalGCL在多个基准数据集和现实交通任务中均表现出色,具有良好的预训练和迁移能力。

Comments 32 pages, 7 figures

详情
英文摘要

Graph Contrastive Learning (GCL) relies on semantically consistent graph augmentations, but common local perturbations provide limited control over global structural consistency, motivating a more principled global augmentation strategy. We therefore propose Fractal Graph Contrastive Learning (FractalGCL), a theory-motivated framework that constructs a renormalisation-based augmented graph and introduces a fractal-dimension-aware contrastive loss that penalises unreliable positive views and reweights negative-pair repulsion by finite-scale box-counting discrepancies. However, computing these discrepancies introduces substantial overhead, so we derive and justify a Gaussian surrogate that avoids repeated box-counting on renormalised graphs, yielding about a $61\%$ runtime reduction. Experiments show that FractalGCL serves as an effective frozen-pretraining tool on MalNet-Tiny, achieves strong performance on the standard TUDataset benchmarks, and outperforms the next-best method on real-world urban traffic tasks by $4.51$ percentage points in average accuracy. Code is available at https://anonymous.4open.science/r/FractalGCL-0511/.