arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1967
专题追踪
2601.01972 2026-05-15 cs.CL cs.AI cs.LG

Hidden State Poisoning Attacks against Mamba-based Language Models

Alexandre Le Mercier, Chris Develder, Thomas Demeester

发表机构 * IDLab–T2K, Ghent University–imec(IDLab–T2K,根特大学–imec)

AI总结 本文研究了针对基于Mamba的状态空间模型(SSMs)的语言模型的隐藏状态中毒攻击(HiSPA),该攻击通过特定的短输入短语不可逆地覆盖模型隐藏状态中的信息,导致其部分遗忘。研究提出了评估模型在遭受HiSPA攻击下信息检索能力的基准RoBench-25,并验证了SSMs在该攻击下的脆弱性,甚至包括最新的混合模型Jamba-1.7-Mini和Nemotron-3-Nano。此外,研究还分析了HiSPA对模型在其他基准上的影响,并提出了可能用于缓解该攻击的隐藏层模式分析方法。

Comments 29 pages, 4 figures

详情
英文摘要

State space models (SSMs) like Mamba offer efficient alternatives to Transformer-based language models, with linear time complexity. Yet, their adversarial robustness remains critically unexplored. This paper studies the phenomenon whereby specific short input phrases induce a partial amnesia effect in such models, by irreversibly overwriting information in their hidden states, referred to as a Hidden State Poisoning Attack (HiSPA). Our benchmark RoBench-25 allows evaluating a model's information retrieval capabilities when subject to HiSPAs, and confirms the vulnerability of SSMs against such attacks. Even the recent Jamba-1.7-Mini SSM--Transformer (a 52B hybrid model) collapses on RoBench-25 under some HiSPA triggers, whereas pure Transformers do not. We also observe that HiSPA triggers significantly weaken the Jamba model on the popular Open-Prompt-Injections benchmark, unlike pure Transformers. We further show that the theoretical and empirical findings extend to Mamba-2, and also analyse a Mamba-2-based hybrid (Nemotron-3-Nano). Finally, our interpretability study reveals patterns in Mamba's hidden layers during HiSPAs that could be used to build a HiSPA mitigation system. The full code and data to reproduce the experiments can be found at https://anonymous.4open.science/r/hispa_anonymous-5DB0.

2512.22331 2026-05-15 cs.CV cs.AI

The Multi-View Paradigm Shift in MRI Radiomics: Predicting MGMT Methylation in Glioblastoma

Mariya Miteva, Maria Nisheva-Pavlova

发表机构 * Faculty of Mathematics and Informatics – Sofia University St. Kliment Ohridski(数学与信息学系 – 圣克莱门特·奥赫里迪斯大学)

AI总结 该研究旨在通过多模态磁共振成像(MRI)数据非侵入性预测胶质母细胞瘤(GBM)中MGMT启动子甲基化状态,这对预后和治疗具有重要意义。为了解决传统单模态和早期融合方法在特征冗余和模态特异性建模方面的不足,作者提出了一种基于变分自编码器(VAE)的多视图潜在表征学习框架,能够在紧凑的概率潜在空间中保留各模态的影像特征并实现晚期融合。实验表明,该方法结合随机森林分类器在测试集上取得了0.77的AUC值,显著优于基线模型和调参后的模型,验证了多视图概率编码在整合互补MRI信息和提升预测性能方面的有效性。

Comments 17 pages, 4 figures

详情
英文摘要

Non-invasive inference of molecular tumor characteristics from medical imaging is a central goal of radiogenomics, particularly in glioblastoma (GBM), where O6-methylguanine-DNA methyltransferase (MGMT) promoter methylation carries important prognostic and therapeutic significance. Although radiomics-based machine learning methods have shown promise for this task, conventional unimodal and early-fusion approaches are often limited by high feature redundancy and incomplete modeling of modality-specific information. In this work, we introduce a multi-view latent representation learning framework based on variational autoencoders (VAE) that preserves modality-specific radiomic structure while enabling late fusion in a compact probabilistic latent space. The approach is evaluated on radiomic features extracted from the necrotic tumor core in post-contrast T1-weighted (T1Gd) and Fluid-Attenuated Inversion Re-covery (FLAIR) Magnetic Resonance Imaging (MRI). Experimental results demonstrate that the proposed multi-view VAE combined with a random forest classifier achieves a test Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) of 0.77 (95% confidence interval: 0.71-0.83), substantially outperforming both a baseline radiomics model (AUC = 0.54) and a hyperparameter-tuned model (AUC = 0.64). These findings indicate that multi-view probabilistic encoding enables more effective integration of complementary MRI information and significantly improves predictive performance for MGMT promoter methylation status.

2512.22317 2026-05-15 cs.LG cs.AI cs.CV

LangPrecip: Language-Aware Multimodal Precipitation Nowcasting

Xudong Ling, Chaorong Li, Tianxi Huang, Qian Dong, Guiduo Duan

发表机构 * Laboratory of Intelligent Collaborative Computing, University of Electronic Science(智能协同计算实验室,电子科学科技大学) School of Computer Science(计算机科学学院) Technology (School of Artificial Intelligence), Yibin University(技术(人工智能学院),宜宾大学) College of Humanities(人文学院) General Education, Chengdu Textile College(通识教育,成都纺织学院)

AI总结 短时降水临近预报是一个具有高度不确定性和约束不足的时空预测问题,尤其在快速演变的极端天气事件中更为明显。本文提出了一种语言感知的多模态临近预报框架LangPrecip,通过将气象文本作为降水演变的语义运动约束,结合修正流范式,实现了文本与雷达信息在潜在空间中的高效融合。此外,研究还构建了一个包含160k对雷达序列和运动描述的大规模多模态数据集LangPrecip-160k,并在瑞典和MRMS数据集上验证了方法的有效性,显著提升了重降雨情况下的预测性能。

详情
英文摘要

Short-term precipitation nowcasting is an inherently uncertain and under-constrained spatiotemporal forecasting problem, especially for rapidly evolving and extreme weather events. Existing generative approaches rely primarily on visual conditioning, leaving future motion weakly constrained and ambiguous. We propose a language-aware multimodal nowcasting framework(LangPrecip) that treats meteorological text as a semantic motion constraint on precipitation evolution. By formulating nowcasting as a semantically constrained trajectory generation problem under the Rectified Flow paradigm, our method enables efficient and physically consistent integration of textual and radar information in latent space.We further introduce LangPrecip-160k, a large-scale multimodal dataset with 160k paired radar sequences and motion descriptions. Experiments on Swedish and MRMS datasets show consistent improvements over state-of-the-art methods, achieving over 60 \% and 19\% gains in heavy-rainfall CSI at an 80-minute lead time.

2512.12083 2026-05-15 cs.CV

RePack then Refine: Efficient Diffusion Transformer with Vision Foundation Model

Guanfang Dong, Luke Schultz, Negar Hassanpour, Chao Gao

发表机构 * Huawei Technologies Canada Ltd.(华为加拿大有限公司)

AI总结 该研究提出了一种名为“RePack then Refine”的三阶段框架,旨在高效利用视觉基础模型(VFM)的语义丰富特征来提升扩散变换器(DiT)的性能。通过RePack模块将高维VFM特征压缩到低维流形,去除冗余并保留结构信息,再在压缩后的潜在空间上训练标准DiT,最后引入一个潜在引导细化模块恢复压缩过程中丢失的高频细节。实验表明,该方法在ImageNet-1K数据集上仅用64个训练周期就达到了1.65的FID值,显著优于现有扩散模型。

详情
英文摘要

Semantic-rich features from Vision Foundation Models (VFMs) have been leveraged to enhance Latent Diffusion Models (LDMs). However, raw VFM features are typically high-dimensional and redundant, increasing the difficulty of learning and reducing training efficiency for Diffusion Transformers (DiTs). In this paper, we propose Repack then Refine, a three-stage framework that brings the semantic-rich VFM features to DiT while further accelerating learning efficiency. Specifically, the RePack module projects the high-dimensional features onto a compact, low-dimensional manifold. This filters out the redundancy while preserving essential structural information. A standard DiT is then trained for generative modeling on this highly compressed latent space. Finally, to restore the high-frequency details lost due to the compression in RePack, we propose a Latent-Guided Refiner, which is trained lastly for enhancing the image details. On ImageNet-1K, RePack-DiT-XL/1 achieves an FID of 1.82 in only 64 training epochs. With the Refiner module, performance further improves to an FID of 1.65, significantly surpassing latest LDMs in terms of convergence efficiency. Our results demonstrate that packing VFM features, followed by targeted refinement, is a highly effective strategy for balancing generative fidelity with training efficiency. Source code is publicly available at https://github.com/guanfangdong/RePack-then-Refine.

2512.11855 2026-05-15 cs.LG cs.AI

Achieving Approximate Symmetry Is Exponentially Easier than Exact Symmetry

Behrooz Tahmasebi, Melanie Weber

发表机构 * Harvard John A. Paulson School of Engineering and Applied Sciences(哈佛大学约翰·A·保罗森工程与应用科学学院) Harvard University(哈佛大学)

AI总结 本文研究了在机器学习中强制对称性与近似对称性的代价差异,提出了“平均复杂度”框架来量化对称性约束的成本。研究发现,在标准条件下,精确对称性需要线性级别的平均复杂度,而近似对称性仅需对数级别的复杂度,两者存在指数级的差距。这一理论结果首次从理论上解释了为何近似对称性在实践中可能更具优势,并为对称性在机器学习中的进一步研究提供了新工具。

Comments 33 pages, 2 figures. Published at ICLR 2026

Journal ref International Conference on Learning Representations (ICLR) 2026

详情
英文摘要

Enforcing exact symmetry in machine learning models often yields significant gains in scientific applications, serving as a powerful inductive bias. However, recent work suggests that relying on approximate symmetry can offer greater flexibility and robustness. Despite promising empirical evidence, there has been little theoretical understanding, and in particular, a direct comparison between exact and approximate symmetry is missing from the literature. In this paper, we initiate this study by asking: What is the cost of enforcing exact versus approximate symmetry? To address this question, we introduce averaging complexity, a framework for quantifying the cost of enforcing symmetry via averaging. Our main result is an exponential separation: under standard conditions, exact symmetry requires linear averaging complexity, whereas approximate symmetry can be attained with only logarithmic complexity in the group size. To the best of our knowledge, this provides the first theoretical separation of these two cases, formally justifying why approximate symmetry may be preferable in practice. Beyond this, our tools and techniques may be of independent interest for the broader study of symmetries in machine learning.

2512.07461 2026-05-15 cs.CL

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

Tong Wu, Yang Liu, Jun Bai, Zixia Jia, Shuyi Zhang, Ziyong Lin, Yanting Wang, Song-Chun Zhu, Zilong Zheng

发表机构 * State Key Laboratory of General Artificial Intelligence, BIGAI(通用人工智能国家重点实验室,BIGAI)

AI总结 本文提出了一种无需教师模型的框架——原生并行推理器(NPR),使大语言模型能够自主进化出真正的并行推理能力。NPR通过自蒸馏渐进训练、并行感知策略优化算法以及改进的推理引擎,实现了从顺序推理到原生并行认知的转变。实验表明,基于Qwen3-4B训练的NPR在八个推理基准上性能提升了24.5%,推理速度提高了4.6倍,并实现了100%的真正并行执行,为高效、可扩展的智能体推理设立了新标准。

详情
英文摘要

We introduce Native Parallel Reasoner (NPR), a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a self-distilled progressive training paradigm that transitions from ``cold-start'' format discovery to strict topological constraints without external supervision; 2) a novel Parallel-Aware Policy Optimization (PAPO) algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust NPR Engine that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5% and inference speedups up to 4.6x. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.

2512.03637 2026-05-15 cs.SD cs.LG stat.ML

AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers

Kohei Yamamoto, Kosuke Okusa

发表机构 * Research & Development Center, Technology Division, Oki Electric Industry Co., Ltd.(oki电产业株式会社研发中心,技术部门) Department of Data Science for Business Innovation, Chuo University(中央大学商务创新数据科学系)

AI总结 该研究提出了一种名为AaSP的音频频谱图Transformer自监督预训练框架,旨在解决传统方法中因时间下采样导致的混叠问题。AaSP通过引入感知混叠的补丁表示、教师-学生掩码建模、跨注意力预测器以及多掩码对比正则化,学习能够整合易受混叠影响频段特征且在不同掩码视图下保持稳定的音频表示。实验表明,AaSP在多个音频识别任务中表现出色,优于现有自监督方法。

Comments Accepted for publication in IEEE Transactions on Audio, Speech and Language Processing (TALSP). Copyright IEEE

详情
英文摘要

Transformer-based audio self-supervised learning (SSL) models commonly use spectrograms, vision-style Transformers, and masked modeling objectives. However, convolutional patchification with temporal downsampling lowers the effective Nyquist frequency and introduces aliasing, while naïve low-pass filtering may remove task-relevant high-frequency cues. We present AaSP, an aliasing-aware self-supervised pre-training framework for audio spectrogram transformers. AaSP combines an aliasing-aware patch representation, teacher-student masked modeling, a cross-attention predictor, and multi-mask contrastive regularization to learn representations that integrate features from alias-prone modulation bands while remaining stable across masked views. Its patch-embedding module, Aliasing-aware Patch Embedding (AaPE), augments standard patch tokens with features from alias-prone modulation bands using a band-limited complex sinusoidal kernel with a two-sided exponential window. The kernel's frequency and decay parameters are estimated from the input, enabling adaptive subband analysis whose outputs are fused with standard patch tokens. We pre-train on AudioSet and evaluate the learned representations by fine-tuning and linear evaluation on acoustic/environmental, speech, and music recognition benchmarks. Under fine-tuning, the full AaSP framework achieves state-of-the-art results on AS-20K, ESC-50, and NSynth among compared self-supervised baselines, while remaining competitive elsewhere. Linear evaluation shows a similar trend, including gains on US8K and NSynth. Overall, AaSP learns representations that are more stable under aliasing-sensitive temporal perturbations and competitive for downstream transfer.

2512.03532 2026-05-15 cs.CV

OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation

Zhishan Zhou, Siyuan Wei, Zengran Wang, Chunjie Wang, Xiaosheng Yan, Xiao Liu

发表机构 * PICO, ByteDance, Beijing(字节跳动北京研究院)

AI总结 OpenTrack3D 是一种面向开放词汇的3D实例分割框架,旨在提升在复杂、非结构化且无需网格的环境中进行3D目标分割的准确性和泛化能力。该方法通过引入视觉-空间追踪器在线生成跨视角一致的物体提案,并结合深度信息和DINO特征图提取实例特征,实现了无需网格的高效分割。此外,OpenTrack3D 采用多模态大语言模型替代CLIP,显著提升了对复杂用户查询的语义理解能力,实验表明其在多个基准数据集上均取得先进性能。

详情
英文摘要

Generalizing open-vocabulary 3D instance segmentation (OV-3DIS) to diverse, unstructured, and mesh-free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset-specific proposal networks or mesh-based superpoints, rendering them inapplicable in mesh-free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP-based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals online. Given an RGB-D stream, our pipeline first leverages a 2D open-vocabulary segmenter to generate masks, which are lifted to 3D point clouds using depth. Mask-guided instance features are then extracted using DINO feature maps, and our tracker fuses visual and spatial cues to maintain instance consistency. The core pipeline is entirely mesh-free, yet we also provide an optional superpoints refinement module to further enhance performance when scene mesh is available. Finally, we replace CLIP with a multi-modal large language model (MLLM), significantly enhancing compositional reasoning for complex user queries. Extensive experiments on diverse benchmarks, including ScanNet200, Replica, ScanNet++, and SceneFun3D, demonstrate state-of-the-art performance and strong generalization capabilities.

2512.02482 2026-05-15 cs.CV

G-SHARP: Gaussian Surgical Hardware Accelerated Real-time Pipeline

Vishwesh Nath, Javier G. Tejero, Aravind S. Kumar, Ruilong Li, Filippo Filicori, Mahdi Azizian, Sean D. Huver

发表机构 * NVIDIA Northwell Health(北well健康)

AI总结 本文提出了一种名为G-SHARP的实时手术场景重建框架,旨在满足微创手术中对可变形组织进行快速而精确3D建模的需求。该方法基于开源的GSplat(Apache-2.0)可微高斯光栅化器构建,实现了原理化的形变建模、鲁棒的遮挡处理以及高保真重建,并在EndoNeRF数据集上取得了领先的重建质量。此外,研究还提供了可在NVIDIA IGX Orin和Thor边缘设备上部署的Holoscan SDK应用,支持实际手术室环境中的实时手术可视化。

详情
英文摘要

We propose G-SHARP, a commercially compatible, real-time surgical scene reconstruction framework designed for minimally invasive procedures that require fast and accurate 3D modeling of deformable tissue. While recent Gaussian splatting approaches have advanced real-time endoscopic reconstruction, existing implementations often depend on non-commercial derivatives, limiting deployability. G-SHARP overcomes these constraints by being the first surgical pipeline built natively on the GSplat (Apache-2.0) differentiable Gaussian rasterizer, enabling principled deformation modeling, robust occlusion handling, and high-fidelity reconstructions on the EndoNeRF pulling benchmark. Our results demonstrate state-of-the-art reconstruction quality with strong speed-accuracy trade-offs suitable for intra-operative use. Finally, we provide a Holoscan SDK application that deploys G-SHARP on NVIDIA IGX Orin and Thor edge hardware, enabling real-time surgical visualization in practical operating-room settings.

2511.21740 2026-05-15 cs.CL cs.AI

A cross-species neural foundation model for end-to-end speech decoding

Yizi Zhang, Linyang He, Chaofei Fan, Tingkai Liu, Han Yu, Trung Le, Jingyuan Li, Scott Linderman, Lea Duncker, Francis R Willett, Nima Mesgarani, Liam Paninski

发表机构 * Columbia University(哥伦比亚大学) Stanford University(斯坦福大学) Microsoft(微软公司) University of Washington(华盛顿大学)

AI总结 该论文提出了一种端到端的脑到文本(BIT)框架,旨在通过神经网络直接将神经活动解码为连贯的句子,从而提升脑机接口的通信能力。核心方法是采用跨任务、跨物种预训练的神经编码器,并结合音频大语言模型与对比学习,实现了比传统分阶段方法更低的词错误率。研究不仅在多个基准测试中取得了新的最先进性能,还展示了跨任务泛化能力,为端到端神经解码提供了重要进展。

详情
英文摘要

Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end BraIn-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text '24 and '25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.

2511.21104 2026-05-15 cs.LG cs.PL

BRIDGE: Building Representations In Domain Guided Program Synthesis

Robert Joseph George, Carson Eisenach, Udaya Ghai, Dominique Perrault-Joncas, Anima Anandkumar, Dean Foster

发表机构 * California Institute of Technology(加州理工学院) Amazon(亚马逊)

AI总结 BRIDGE 是一个用于多领域程序合成的结构化提示框架,旨在解决在形式化验证工具如 Lean 中生成可验证代码的挑战。该方法将代码生成、规范描述和定理/证明三个领域进行关联,并通过领域特定的中间推理实现它们之间的连接。实验表明,BRIDGE 显著提升了 Lean 中代码的可执行正确性,并在样本效率和 Python 代码生成方面也表现出优越性能,展示了其在可验证程序合成中的实用价值。

Comments 41 pages, 10 figures, 3 tables. Preprint

详情
英文摘要

Large language models can generate plausible code, but remain brittle for formal verification in proof assistants such as Lean. A central scalability challenge is that verified synthesis requires consistent artifacts across several coupled domains: executable code, formal specifications, theorem statements, and proof attempts. Existing approaches often treat these artifacts separately. We present BRIDGE, a structured prompting framework for multi-artifact program synthesis. BRIDGE decomposes generation into three interconnected domains: Code, Specification, and Theorem/Proof, and uses domain-specific intermediate reasoning to connect them. In Lean, BRIDGE often follows a code-first workflow, using the generated implementation as a semantic anchor for downstream specification, theorem statement, and proof-attempt generation. Across 178 algorithmic problems and five LLMs, BRIDGE improves Lean executable correctness by up to nearly 1.5x over direct prompting and can be roughly 2x more sample efficient at comparable generation lengths. We further find that specification-oriented prompting improves Python pass rates by up to 17.5 percentage points. Beyond inference-time prompting, supervised fine-tuning on BRIDGE-style reasoning traces yields nearly 1.5x higher Lean pass success than code-only fine-tuning, suggesting that these intermediate representations provide a learnable inductive bias. BRIDGE provides a practical framework for scaling verified synthesis while highlighting the remaining gap between executable correctness and full formal proof generation.

2511.18903 2026-05-15 cs.LG cs.AI cs.CL

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

Kairong Luo, Zhenbo Sun, Haodong Wen, Xinyu Shi, Jiarui Cui, Chenyi Dang, Kaifeng Lyu, Wenguang Chen

发表机构 * Tsinghua University(清华大学) Peng Cheng Laboratory(鹏城实验室)

AI总结 在基于课程的大型语言模型(LLM)预训练中,高质量数据的利用效率受到学习率衰减策略的限制。本文发现,当使用递减的学习率调度时,按数据质量排序的课程式训练优势会显著减弱。为此,研究提出了两种简单有效的方法:采用更温和的学习率衰减策略,或用模型平均替代学习率衰减,从而在不额外优化数据的情况下提升了模型在多个基准测试中的表现。这一发现为课程式预训练与优化方法的协同设计提供了新思路。

详情
英文摘要

Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.

2511.18739 2026-05-15 cs.AI cs.LG stat.ML

A Problem-Oriented Taxonomy of Evaluation Metrics for Time Series Anomaly Detection

Kaixiang Yang, Jiarong Liu, Yupeng Song, Shuanghua Yang, Yujue Zhou

发表机构 * School of Artificial Intelligence, Yunnan University(云南大学人工智能学院) Beijing Normal University – Hong Kong Baptist University(北京师范大学-香港 Baptist大学)

AI总结 时间序列异常检测在物联网和物理信息系统中应用广泛,但其评估因应用场景多样和指标假设不同而面临挑战。本文提出了一种面向问题的评估指标分类框架,从解决的具体评估问题出发重新诠释现有指标,将其分为六个维度,涵盖准确性、及时性、标签容忍度、人工审核成本惩罚、抗随机性以及跨数据集可比性等方面。通过实验分析不同场景下指标的行为,量化其区分真实检测与随机噪声的能力,揭示了多数事件级指标具有较强区分力,而部分常用指标对随机分数膨胀较为敏感,强调了评估指标应根据具体任务需求进行选择。

详情
英文摘要

Time series anomaly detection is widely used in IoT and cyber-physical systems, yet its evaluation remains challenging due to diverse application objectives and heterogeneous metric assumptions. This study introduces a problem-oriented framework that reinterprets existing metrics based on the specific evaluation challenges they are designed to address, rather than their mathematical forms or output structures. We categorize over twenty commonly used metrics into six dimensions: 1) basic accuracy-driven evaluation; 2) timeliness-aware reward mechanisms; 3) tolerance to labeling imprecision; 4) penalties reflecting human-audit cost; 5) robustness against random or inflated scores; and 6) parameter-free comparability for cross-dataset benchmarking. Comprehensive experiments are conducted to examine metric behavior under genuine, random, and oracle detection scenarios. By comparing their resulting score distributions, we quantify each metric's discriminative ability -- its capability to distinguish meaningful detections from random noise. The results show that while most event-level metrics exhibit strong separability, several widely used metrics (e.g., NAB, Point-Adjust) demonstrate limited resistance to random-score inflation. These findings reveal that metric suitability must be inherently task-dependent and aligned with the operational objectives of IoT applications. The proposed framework offers a unified analytical perspective for understanding existing metrics and provides practical guidance for selecting or developing more context-aware, robust, and fair evaluation methodologies for time series anomaly detection.

2511.17367 2026-05-15 cs.LG

R2PS: Worst-Case Robust Real-Time Pursuit Strategies under Partial Observability

Runyu Lu, Ruochuan Shi, Yuanheng Zhu, Dongbin Zhao

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所多模态人工智能系统国家重点实验室)

AI总结 本文研究了在部分可观测环境下,如何为追捕-逃避博弈(PEG)设计具有最坏情况鲁棒性的实时追捕策略。为了解决现有方法在不完全信息和异步移动场景下的不足,作者提出了一种新的方法R2PS,结合动态规划与信念保持机制,扩展了传统策略到部分可观测场景,并将其嵌入先进强化学习框架中。该方法能够在无需额外训练的情况下,实现对未知图结构的鲁棒泛化,并在实验中表现出优于现有方法的性能。

详情
英文摘要

Computing worst-case robust strategies in pursuit-evasion games (PEGs) is time-consuming, especially when real-world factors like partial observability are considered. While important for general security purposes, real-time applicable pursuit strategies for graph-based PEGs are currently missing when the pursuers only have imperfect information about the evader's position. Although state-of-the-art reinforcement learning (RL) methods like Equilibrium Policy Generalization (EPG) and Grasper provide guidelines for learning graph neural network (GNN) policies robust to different game dynamics, they are restricted to the scenario of perfect information and do not take into account the possible case where the evader can predict the pursuers' actions. This paper introduces the first approach to worst-case robust real-time pursuit strategies (R2PS) under partial observability. We first prove that a traditional dynamic programming (DP) algorithm for solving Markov PEGs maintains optimality under the asynchronous moves by the evader. Then, we propose a belief preservation mechanism about the evader's possible positions, extending the DP pursuit strategies to a partially observable setting. Finally, we embed the belief preservation into the state-of-the-art EPG framework to finish our R2PS learning scheme, which leads to a real-time pursuer policy through cross-graph reinforcement learning against the asynchronous-move DP evasion strategies. After reinforcement learning, our policy achieves robust zero-shot generalization to unseen real-world graph structures and consistently outperforms the policy directly trained on the test graphs by the existing game RL approach.

2511.15408 2026-05-15 cs.CL cs.AI cs.IR cs.MA cs.NE

Chinese Short-Form Creative Content Generation via Explanation-Oriented Multi-Objective Optimization

Shanlin Zhou, Xinpeng Wang, Jianxun Lian, Zhenghao Liu, Laks V. S. Lakshmanan, Xiaoyuan Yi, Yongtao Hao

发表机构 * Tongji University(同济大学) Microsoft Research Asia(微软亚洲研究院) Northeastern University(东北大学) University of British Columbia(不列颠哥伦比亚大学)

AI总结 该研究针对中文短文本创意内容生成中的挑战,提出了一种基于解释导向的多目标优化方法,以应对个性化约束下生成结果验证困难的问题。研究将任务建模为异构多目标优化问题,同时优化生成内容与解释的可靠性,并设计了无需训练的多智能体框架MAGIC-HMO,通过迭代生成与验证实现优化。实验表明,该方法在中文婴儿命名等任务上显著优于现有模型。

Comments 19 pages,10 figures. Submitted to ACM for possible publication

详情
英文摘要

Chinese demonstrates high semantic compactness and rich metaphorical expressiveness, enabling limited text to convey dense meanings while increasing the difficulty of generation and verification, particularly in short-form creative natural language generation (CNLG). In the real world, users often require personalized, fine-grained creative constraints, making reliable verification critical to guiding optimization. According to Brunswik's Lens Model from psychology, constraints' achievement can be inferred from sufficient observable cues. Existing studies are mainly outcome-oriented, implicitly assuming that the outcome itself provides adequate cues for verification. However, this assumption breaks down in Chinese short-form CNLG (e.g., naming or advertising) with diverse personalized constraints, where extremely brief outcomes inherently offer limited information. Explanations can naturally serve as extra cues. Nevertheless, under complex constraints, LLMs' explanations may suffer from hallucination, incompleteness, or ambiguity. To address these, we novelly formalize the Chinese short-form CNLG task as a heterogeneous multi-objective optimization (HMO) issue that needs to jointly optimize multiple personalized constraints and explanation reliability. We further propose MAGIC-HMO, a training-free multi-agent framework that optimizes these objectives through iterative generation and verification under an explanation-oriented multi-objective strategy. Experiments on \emph{Chinese Baby Naming}, a challenging benchmark, demonstrate that MAGIC-HMO significantly outperforms six strong baselines across various LLM backbones. Relevant data and codes are available at https://github.com/foolfun/MAGIC_HMO.

2511.14823 2026-05-15 cs.LG cs.CV

Dynamic Nested Hierarchies: Pioneering Self-Evolution in Machine Learning Architectures for Lifelong Intelligence

Akbar Anbar Jafari, Cagri Ozcinar, Gholamreza Anbarjafari

发表机构 * Institute of Technology University of Tartu(塔尔图技术大学) S Holding OÜ(3S控股公司)

AI总结 当前机器学习模型在静态任务上表现出色,但在非平稳环境中因架构僵化而难以实现持续适应和终身学习。本文提出了一种动态嵌套层次结构,使模型能够在训练或推理过程中自主调整优化层级的数量、嵌套结构和更新频率,从而实现无需预定义约束的自我演化。该方法通过数学推导和实验验证,在语言建模、持续学习和长上下文推理等任务中展现出优越性能,为构建具有自适应能力的通用人工智能奠定了基础。

Comments 12 pages, 1 figure

Journal ref Frontiers in Artificial Intelligence, 2026

详情
英文摘要

Contemporary machine learning models, including large language models, exhibit remarkable capabilities in static tasks yet falter in non-stationary environments due to rigid architectures that hinder continual adaptation and lifelong learning. Building upon the nested learning paradigm, which decomposes models into multi-level optimization problems with fixed update frequencies, this work proposes dynamic nested hierarchies as the next evolutionary step in advancing artificial intelligence and machine learning. Dynamic nested hierarchies empower models to autonomously adjust the number of optimization levels, their nesting structures, and update frequencies during training or inference, inspired by neuroplasticity to enable self-evolution without predefined constraints. This innovation addresses the anterograde amnesia in existing models, facilitating true lifelong learning by dynamically compressing context flows and adapting to distribution shifts. Through rigorous mathematical formulations, theoretical proofs of convergence, expressivity bounds, and sublinear regret in varying regimes, alongside empirical demonstrations of superior performance in language modeling, continual learning, and long-context reasoning, dynamic nested hierarchies establish a foundational advancement toward adaptive, general-purpose intelligence.

2511.13397 2026-05-15 cs.CV cs.AI

Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)

Nikos Theodoridis, Tim Brophy, Reenu Mohandas, Ganesh Sistu, Fiachra Collins, Anthony Scanlan, Ciaran Eising

发表机构 * Department of Electronic and Computer Engineering, University of Limerick(利默尼克大学电子与计算机工程系) Data Driven Computer Engineering Research Centre, University of Limerick(利默尼克大学数据驱动计算机工程研究中心) Lero, The Irish Software Research Centre, University of Limerick(利默尼克大学Lero爱尔兰软件研究中心) Valeo Vision Systems(瓦莱奥视觉系统)

AI总结 本文提出了一种名为Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)的视觉问答基准,用于评估视觉语言模型在交通场景中的感知能力。该基准包含合成数据集和真实场景数据集,并为每个问题标注了目标物体与相机之间的距离,从而能够分析模型在不同距离下的感知性能。该研究为自动驾驶领域中模型的感知能力评估提供了一个新的、有针对性的工具。

Journal ref IEEE Data Descriptions, 2026

详情
英文摘要

The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, (c) the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.

2511.13026 2026-05-15 cs.CV

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Jiaze Li, Hao Yin, Wenhui Tan, Jingyang Chen, Boshen Xu, Yuxun Qu, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Jian Luan

发表机构 * MiLM Plus, Xiaomi Inc.(小米公司MiLM Plus实验室) Renmin University of China(中国人民大学)

AI总结 该论文提出了一种名为REVISOR的新框架,旨在提升大语言模型在长视频理解任务中的推理能力。针对纯文本反思机制在处理长视频时的不足,REVISOR引入了多模态反思机制,结合视觉信息进行深度反思,并设计了双属性解耦奖励机制以增强模型对关键视频片段的识别与利用。该方法无需额外监督微调或外部模型,显著提升了模型在多个长视频理解基准测试中的表现。

详情
英文摘要

Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model's reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.

2511.08565 2026-05-15 cs.CL cs.AI cs.CY

Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models

Davi Bastos Costa, Felippe Alves, Renato Vicente

发表机构 * TELUS Digital Research Hub(TELUS数字研究中心) Center for Artificial Intelligence and Machine Learning(人工智能与机器学习中心) Institute of Mathematics, Statistics and Computer Science(数学、统计与计算机科学研究所) University of São Paulo(圣保罗大学)

AI总结 本研究探讨了大型语言模型在扮演特定角色(Persona Role-Play)时的道德反应,引入道德基础问卷(MFQ)构建基准,量化评估模型的道德敏感性和道德鲁棒性。通过两种互补方法分析模型在不同角色下的道德判断变化,发现道德鲁棒性在不同模型家族间差异显著,Claude 家族表现最为鲁棒,而道德敏感性则变化较小,且不受模型家族影响,主要由预训练阶段决定。研究揭示了角色条件对模型道德行为的影响,并提供了不同模型及角色平均的道德基础特征分析。

Comments Added experiments with a logit-based method and now reporting unbounded metrics

详情
英文摘要

Large language models (LLMs) increasingly operate in social contexts, motivating analysis of how they express and shift moral judgments. In this work, we investigate the moral response of LLMs to persona role-play, prompting a LLM to assume a specific character. Using the Moral Foundations Questionnaire (MFQ), we introduce a benchmark that quantifies two properties: moral susceptibility and moral robustness, defined from the variability of MFQ scores across- and within-personas. We estimate these quantities with two complementary procedures, repeated sampling and a logit-based method that directly estimates the rating distributions and enables temperature analysis. We evaluate 15 models across six families: Claude, DeepSeek, Gemini, GPT, Grok, and Llama. The two metrics show qualitatively different patterns. Moral robustness varies by more than an order of magnitude, with a coefficient of variation of about $152\%$, and is explained almost entirely by model family. The Claude family is, by a significant margin, the most robust, about 30 times more so than the lower-performing families (DeepSeek, Grok, and Llama), while Gemini and GPT occupy an intermediate tier. This strong family dependence suggests that robustness is primarily shaped by post-training. Moral susceptibility, by contrast, spans a much narrower range, with a coefficient of variation of about $13\%$, and the most susceptible model is only 1.6 times more susceptible than the least. Unlike robustness, susceptibility shows no clear family dependence, suggesting that it is primarily determined by pre-training. Additionally, we present moral foundation profiles for models without persona role-play and for personas averaged across models. Together, these analyses provide a systematic view of how persona conditioning shapes moral behavior in LLMs and a window into the internal machinery they use to instantiate personas.

2511.02776 2026-05-15 cs.RO

XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations

Shichao Fan, Kun Wu, Zhengping Che, Xinhua Wang, Di Wu, Fei Liao, Ning Liu, Yixue Zhang, Zhen Zhao, Zhiyuan Xu, Meng Li, Qingjie Liu, Shanghang Zhang, Min Wan, Jian Tang

发表机构 * Beijing Innovation Center of Humanoid Robotics, Beijing, China(北京人形机器人创新中心,北京,中国) School of Mechanical Engineering and Automation, Beihang University, Beijing, China(北京航空航天大学机械工程及自动化学院,北京,中国) State Key Laboratory of Virtual Reality Technology and Systems, SCSE, Beihang University, Beijing, China(虚拟现实技术与系统国家重点实验室,SCSE,北京航空航天大学,北京,中国) State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University, Beijing, China(多媒体信息处理国家重点实验室,计算机科学学院,北京大学,北京,中国)

AI总结 本文提出 XR-1,一种面向多机器人、多任务和多环境的通用视觉-语言-动作(VLA)模型,旨在解决现有模型在生成精确低级动作和跨异构数据源对齐方面的挑战。XR-1 引入了统一视觉-运动编码(UVMC),通过双分支 VQ-VAE 学习视觉动态与机器人运动的联合离散表示,从而在动作生成和跨模态对齐方面取得显著提升。实验表明,XR-1 在多种真实机器人和任务上表现出优越的性能和良好的泛化能力。

Comments Accepted to ICML2026 as spotlight

详情
英文摘要

Recent progress in large-scale robotic datasets and vision-language models (VLMs) has advanced research on vision-language-action (VLA) models. However, existing VLA models still face two fundamental challenges: (i) producing precise low-level actions from high-dimensional observations, (ii) bridging domain gaps across heterogeneous data sources, including diverse robot embodiments and human demonstrations. Existing methods often encode latent variables from either visual dynamics or robotic actions to guide policy learning, but they fail to fully exploit the complementary multi-modal knowledge present in large-scale, heterogeneous datasets. In this work, we present X Robotic Model 1 (XR-1), a novel framework for versatile and scalable VLA learning across diverse robots, tasks, and environments. XR-1 introduces the \emph{Unified Vision-Motion Codes (UVMC)}, a discrete latent representation learned via a dual-branch VQ-VAE that jointly encodes visual dynamics and robotic motion. UVMC addresses these challenges by (i) serving as an intermediate representation between the observations and actions, and (ii) aligning multimodal dynamic information from heterogeneous data sources to capture complementary knowledge. To effectively exploit UVMC, we propose a three-stage training paradigm: (i) self-supervised UVMC learning, (ii) UVMC-guided pretraining on large-scale cross-embodiment robotic datasets, and (iii) task-specific post-training. We validate XR-1 through extensive real-world experiments with more than 14,000 rollouts on six different robot embodiments, spanning over 120 diverse manipulation tasks. XR-1 consistently outperforms state-of-the-art baselines such as $π_{0.5}$, $π_0$, RDT, UniVLA, and GR00T-N1.5 while demonstrating strong generalization to novel objects, background variations, distractors, and illumination changes. Our project is at https://xr-1-vla.github.io/.

2510.23868 2026-05-15 cs.LG cs.CL

GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA

Zhichao Wang

发表机构 * Inflection AI

AI总结 本文研究了奖励匹配是否可以作为奖励最大化方法的替代方案,用于大语言模型的策略梯度强化学习。提出了一种名为GIFT的新方法,结合了GRPO的群体采样、DPO的隐式奖励和UNA的显式与隐式优势之间的均方误差,通过z-score标准化消除了DPO中的不可计算项,并去除了RLHF和RLVR目标中的KL系数β。实验表明,GIFT在多个任务上收敛更快、过拟合更少,且在长度控制和评估表现上优于现有方法。

详情
英文摘要

This paper investigates whether reward matching is a viable alternative to reward maximization methods for on-policy RL of LLMs. Group-relative Implicit Fine-Tuning (GIFT) is proposed, combining GRPO-style group sampling, DPO-style implicit reward, and UNA-style MSE between implicit and explicit advantages. By applying z-score standardization, the intractable partition function $Z(x)$ in the DPO implicit reward is canceled, and the KL coefficient $β$ is eliminated from the RLHF and RLVR objective. The population minimizers of $\mathcal{L}_{\text{GIFT}}$ are characterized in closed form: they coincide exactly with the GRPO/RLHF solution family $π^{*}_β(y|x)\proptoπ_{\text{ref}}(y|x)e^{\frac{1}βr_ϕ(x,y)}$, with a prompt-dependent, variance-determined KL coefficient $β(x)=\frac{σ_ϕ(x)}{\hatσ_θ(x)}$. GIFT therefore solves the same parametric policy family as GRPO while replacing GRPO's externally tuned scalar $β$ with a prompt-adaptive $β(x)$ optimized endogenously by matching reward distributions. Empirically, on 7B-32B backbones, GIFT converges faster than GRPO, DAPO and GSPO and overfits less on RLVR (GSM8K, MATH, AIME) and produces higher length-controlled win rates on RLHF (AlpacaEval, Arena-Hard). All proofs and detailed background are deferred to the appendix.

2510.20206 2026-05-15 cs.CV

RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

Bingjie Gao, Qianli Ma, Xiaoxue Wu, Shuai Yang, Guanzhou Lan, Haonan Zhao, Jiaxuan Chen, Qingyang Liu, Yu Qiao, Xinyuan Chen, Yaohui Wang, Li Niu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 RAPO++ 是一种面向文本到视频生成的跨阶段提示优化框架,旨在解决用户输入提示与训练数据不匹配的问题。该方法通过检索增强提示优化(RAPO)和样本特定提示优化(SSPO)两个阶段,结合语义对齐、空间保真度和时间一致性等多源反馈,逐步提升生成视频的质量,并进一步通过微调语言模型实现高效的提示生成。实验表明,RAPO++ 在多个先进模型和基准测试中显著提升了生成视频的语义一致性、组合合理性及时空稳定性,是一种模型无关、高效且可扩展的解决方案。

Comments arXiv admin note: text overlap with arXiv:2504.11739

详情
英文摘要

Prompt design plays a crucial role in text-to-video (T2V) generation, yet user-provided prompts are often short, unstructured, and misaligned with training data, limiting the generative potential of diffusion-based T2V models. We present \textbf{RAPO++}, a cross-stage prompt optimization framework that unifies training-data--aligned refinement, test-time iterative scaling, and large language model (LLM) fine-tuning to substantially improve T2V generation without modifying the underlying generative backbone. In \textbf{Stage 1}, Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with semantically relevant modifiers retrieved from a relation graph and refactors them to match training distributions, enhancing compositionality and multi-object fidelity. \textbf{Stage 2} introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback -- including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow -- yielding progressively improved video generation quality. \textbf{Stage 3} leverages optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing task-specific optimization patterns and enabling efficient, high-quality prompt generation even before inference. Extensive experiments across five state-of-the-art T2V models and five benchmarks demonstrate that RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming existing methods by large margins. Our results highlight RAPO++ as a model-agnostic, cost-efficient, and scalable solution that sets a new standard for prompt optimization in T2V generation. The code is available at https://github.com/Vchitect/RAPO.

2510.17434 2026-05-15 cs.CV

Leveraging AV1 motion vectors for Fast and Dense Feature Matching

Julien Zouein, Hossein Javidnia, François Pitié, Anil Kokaram

发表机构 * Sigmedia Group(Sigmedia集团) Department of Electronic and Electrical Engineering(电子与电气工程系)

AI总结 该研究利用AV1视频编码中的运动矢量生成密集的亚像素级特征匹配,并通过余弦一致性筛选短轨迹。该方法在短视频上运行效率高、消耗的CPU资源少,且能产生密度更高的匹配结果,几何一致性表现良好。实验表明,该方法在少样本场景重建中表现出良好的性能,为压缩域特征匹配在大规模应用中提供了可行的解决方案。

Comments Accepted ICIR 2025, camera-ready version

详情
英文摘要

We repurpose AV1 motion vectors to produce dense sub-pixel correspondences and short tracks filtered by cosine consistency. On short videos, this compressed-domain front end runs comparably to sequential SIFT while using far less CPU, and yields denser matches with competitive pairwise geometry. As a small SfM demo on a 117-frame clip, MV matches register all images and reconstruct 0.46-0.62M points at 0.51-0.53,px reprojection error; BA time grows with match density. These results show compressed-domain correspondences are a practical, resource-efficient front end with clear paths to scaling in full pipelines.

2510.15982 2026-05-15 cs.LG cs.AI

AMiD: Knowledge Distillation for LLMs with $α$-mixture Assistant Distribution

Donghyeok Shin, Yeongmin Kim, Suhyeon Jo, Byeonghu Na, Il-Chul Moon

发表机构 * Korea Advanced Institute of Science and Technology(韩国先进科学研究院)

AI总结 本文提出了一种名为AMiD的知识蒸馏方法,用于降低大语言模型的计算和内存成本。该方法引入了基于α混合的辅助分布,通过引入新的分布参数α,扩展了传统辅助分布的适用范围,并构建了一个统一的知识蒸馏框架。实验表明,AMiD在性能和训练稳定性方面优于现有方法,具有更广泛的理论支持和实际应用价值。

Comments The Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

Autoregressive large language models (LLMs) have achieved remarkable improvement across many tasks but incur high computational and memory costs. Knowledge distillation (KD) mitigates this issue by transferring knowledge from a large teacher to a smaller student through distributional alignment. Previous studies have proposed various discrepancy metrics, but the capacity gap and training instability caused by near-zero probabilities, stemming from the high-dimensional output of LLMs, remain fundamental limitations. To overcome these challenges, several approaches implicitly or explicitly incorporating assistant distribution have recently been proposed. However, the past proposals of assistant distributions have been a fragmented approach without a systematic investigation of the interpolation path and the divergence. This paper proposes $α$-mixture assistant distribution, a novel generalized family of assistant distributions, and $α$-mixture distillation, coined AMiD, a unified framework for KD using the assistant distribution. The $α$-mixture assistant distribution provides a continuous extension of the assistant distribution by introducing a new distribution design variable $α$, which has been fixed in all previous approaches. Furthermore, AMiD generalizes the family of divergences used with the assistant distributions based on optimality, which has also been restricted in previous works. Through extensive experiments, we demonstrate that AMiD offers superior performance and training stability by leveraging a broader and theoretically grounded assistant distribution space. We release the code at https://github.com/aailab-kaist/AMiD.

2510.15849 2026-05-15 cs.CV

Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt

Joongwon Chae, Lihui Luo, Xi Yuan, Dongmei Yu, Zhenglin Chen, Lian Zhang, Peiwu Qin

发表机构 * Tsinghua University, China(清华大学) The Fifth Affiliated Hospital of Wenzhou Medical University, China(温州医科大学第五附属医院) Shenzhen Traditional Chinese Medicine Hospital, China(深圳中医医院) Wenzhou Medical University, China(温州医科大学) The First Hospital of Hebei Medical University, China(河北医科大学第一医院) Chinese Medicine Guangdong Laboratory/Hengqin Laboratory, China(广东中医实验室/横琴实验室)

AI总结 本文提出了一种无需人工提示和训练的舌部分割方法Memory-SAM,通过检索历史案例中的特征并生成有效提示来引导SAM2模型。该方法利用DINOv3的密集特征和FAISS检索技术,从少量先验案例中自动提取前景和背景提示,从而实现高精度分割。实验表明,Memory-SAM在包含600张专家标注图像的数据集上取得了优于现有方法的分割效果,尤其在真实场景下表现突出。

详情
英文摘要

Accurate tongue segmentation is crucial for reliable TCM analysis. Supervised models require large annotated datasets, while SAM-family models remain prompt-driven. We present Memory-SAM, a training-free, human-prompt-free pipeline that automatically generates effective prompts from a small memory of prior cases via dense DINOv3 features and FAISS retrieval. Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning. We evaluate on 600 expert-annotated images (300 controlled, 300 in-the-wild). On the mixed test split, Memory-SAM achieves mIoU 0.9863, surpassing FCN (0.8188) and a detector-to-box SAM baseline (0.1839). On controlled data, ceiling effects above 0.98 make small differences less meaningful given annotation variability, while our method shows clear gains under real-world conditions. Results indicate that retrieval-to-prompt enables data-efficient, robust segmentation of irregular boundaries in tongue imaging. The code is publicly available at https://github.com/jw-chae/memory-sam.

2510.13016 2026-05-15 cs.CV

SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding

Tanveer Hannan, Shuaicong Wu, Mark Weber, Suprosanna Shit, Jindong Gu, Rajat Koner, Aljoša Ošep, Laura Leal-Taixé, Thomas Seidl

发表机构 * LMU Munich(慕尼黑大学) MCML Technical University of Munich(慕尼黑技术大学) University of Zurich(苏黎世大学) University of Oxford(牛津大学) Amazon(亚马逊) NVIDIA(英伟达)

AI总结 该论文提出了一种名为SVAG-Bench的大型基准,用于评估多实例时空视频动作定位能力。该任务要求模型同时检测、跟踪并定位满足自然语言查询的所有对象,以实现对复杂场景中多个动作的统一理解。SVAG-Bench包含688个视频和大量精细标注,支持对多动作歧义、时间重叠和动作组合性的细致评估,并提供了标准化的评估工具和一个模块化的基线模型SVAGFormer。

详情
英文摘要

A truly capable AI system must do more than detect objects or recognize activities in isolation. It must form unified, grounded representations of who is acting, what they are doing, and when and where these actions unfold. These representations provide the perceptual bedrock for high-level reasoning, planning, and embodied interaction in the real world. Building such agents is central to long-horizon goals in embodied AI and robotics. Current video benchmarks evaluate fragments of these capabilities in isolation. They focus on either spatial grounding, object tracking, or temporal localization. As a result, they cannot rigorously measure progress on their joint, multi-instance integration. We introduce Spatio-temporal Video Action Grounding (SVAG), a task and benchmark that explicitly targets this unified competence by requiring models to simultaneously detect, track, and temporally localize all objects that satisfy a natural language query in complex, multi-actor scenes. To support this task, we construct SVAG-Bench. It comprises 688 videos, 19,590 verified annotations, and 903 unique action verbs drawn from crowded urban environments, wildlife, and traffic surveillance. Each video has on average 28.5 action-centric queries. This yields the densest annotation among comparable video grounding benchmarks and enables fine-grained evaluation of multi-actor disambiguation, temporal overlap, and action compositionality. Annotations are produced by a pipeline that combines expert manual labeling, GPT-3.5 paraphrase augmentation, and human verification to ensure both linguistic diversity and correctness. We further release SVAGEval, a standardized multi-referent evaluation toolkit. We also introduce SVAGFormer, a strong modular baseline architecture for SVAG.

2510.11282 2026-05-15 cs.LG

Vision-LLMs for Spatiotemporal Traffic Forecasting

Ning Yang, Hengyu Zhong, Haijun Zhang, Randall Berry

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Southwest University(西南大学) Department of Computing and Communication Engineering, Beijing University of Science and Technology(北京科技大学计算机与通信工程学院) Department of Electrical and Computer Engineering, Northwestern University(西北大学电气与计算机工程系)

AI总结 本文研究了如何利用视觉大语言模型(Vision-LLMs)进行时空交通预测,针对传统大语言模型在处理网格化交通数据时效率低、难以建模复杂空间依赖的问题,提出了一种新的框架ST-Vision-LLM。该方法将交通预测视为视觉与语言信息融合的问题,通过视觉编码器处理历史交通矩阵,并引入高效的数值编码方案和两阶段微调策略,显著提升了模型在长周期预测和跨域少样本场景下的性能。实验表明,该模型在多个真实交通数据集上取得了优于现有方法的预测精度。

详情
英文摘要

Accurate spatiotemporal traffic forecasting is a critical prerequisite for proactive resource management in dense urban mobile networks. While large language models have shown promise in time series analysis, they inherently struggle to model the complex spatial dependencies of grid-based traffic data. Effectively extending large language models to this domain is challenging, as representing the vast amount of information from dense geographical grids can be inefficient and overwhelm the model's context. To address these challenges, we propose ST-Vision-LLM, a novel framework that reframes spatiotemporal forecasting as a vision-language fusion problem. Our approach leverages a Vision-LLM visual encoder to process historical global traffic matrices as image sequences, providing the model with a comprehensive global view to inform cell-level predictions. To overcome the inefficiency of large language models in handling numerical data, we introduce an efficient encoding scheme that represents floating-point values as single tokens via a specialized vocabulary, coupled with a two-stage numerical alignment fine-tuning process. The model is first trained with supervised fine-tuning and then further optimized for predictive accuracy using group relative policy optimization, a memory-efficient reinforcement learning method. Evaluations on real-world mobile traffic datasets demonstrate that ST-Vision-LLM outperforms existing methods by 15.6% in long-term prediction accuracy and exceeds the best baseline by around 30% on average in cross-domain few-shot scenarios. Our extensive experiments validate the model's strong generalization capabilities across various data-scarce environments.

2510.07086 2026-05-15 cs.LG

Non-Stationary Online Structured Prediction with Surrogate Losses

Shinsaku Sakaue, Han Bao, Yuzhou Cao

发表机构 * CyberAgent, Tokyo, Japan(日本东京CyberAgent公司) National Institute of Informatics, Tokyo, Japan(日本东京信息机构) Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan(日本东京RIKEN高级智能项目) The Institute of Statistical Mathematics, Tokyo, Japan(日本东京统计数学研究所) Tohoku University, Miyagi, Japan(日本宫城县立东大学) Nanyang Technological University, Singapore(新加坡南洋理工大学)

AI总结 本文研究了非平稳环境下在线结构化预测问题,旨在通过代理损失函数实现对目标损失的上界分析。作者提出了一种新的上界形式,其依赖于比较序列的累积代理损失和路径长度,而非时间步长 $T$,从而在非平稳环境下提供了更强的理论保证。核心方法结合了在线梯度下降的动态遗憾分析与代理损失间隙利用技术,并引入了Polyak风格的学习率,提升了理论分析与实际性能。此外,该方法通过卷积型Fenchel-Young损失扩展到了更广泛的应用场景。

详情
英文摘要

Online structured prediction, including online classification as a special case, is the task of sequentially predicting labels from input features. In this setting, the surrogate regret -- the cumulative excess of the actual target loss (e.g., the 0-1 loss) over the surrogate loss (e.g., the logistic loss) incurred by the best fixed estimator -- has gained attention because it admits a finite bound independent of the time horizon $T$. However, such guarantees break down in non-stationary environments, where every fixed estimator may incur surrogate loss that grows linearly with $T$. To address this limitation, we obtain an upper bound of $F_T + O(1 + P_T)$ on the cumulative target loss, where $F_T$ is the cumulative surrogate loss of any comparator sequence and $P_T$ is its path length. This bound depends on $T$ only through $F_T$ and $P_T$, thus offering stronger guarantees under non-stationarity. Our core idea is to combine the dynamic regret analysis of online gradient descent (OGD) with the exploit-the-surrogate-gap technique. This viewpoint sheds light on the usefulness of a Polyak-style learning rate for OGD, which systematically yields target-loss bounds and performs well empirically. We then extend our approach to broader settings beyond prior work via the convolutional Fenchel--Young loss. Finally, a lower bound shows that the dependence on $F_T$ and $P_T$ is tight.

2510.04682 2026-05-15 cs.CL cs.AI

TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA

Chanjoo Jung, Jaehyung Kim

发表机构 * Yonsei University(延世大学)

AI总结 本文提出了一种名为TiTok的新框架,旨在解决LoRA微调参数无法跨不同基础模型迁移的问题。该方法通过在令牌层面进行对比性知识提取,从带有和不带有LoRA的源模型中捕捉任务相关的信息,从而实现高效的LoRA移植。实验表明,TiTok在多个基准测试中表现出色,相比基线方法平均性能提升了4%到10%。

Comments ICLR 2026

详情
英文摘要

Large Language Models (LLMs) are widely applied in real world scenarios, yet fine-tuning them comes with significant computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA mitigate these costs; however, the adapted parameters are dependent on the base model and cannot be transferred across different backbones. One way to address this issue is through knowledge distillation, but its effectiveness inherently depends on training data. Recent work such as TransLoRA avoids this by generating synthetic data; nevertheless, this adds complexity since it requires training an additional discriminator model. In this paper, we propose TiTok, a new framework that enables effective LoRA Transplantation through Token-level knowledge transfer. Specifically, TiTok captures task-relevant information through a token-wise contrastive excess between a source model with and without LoRA. This excess highlights informative tokens and enables selective filtering of synthetic data, all without additional models or overhead. Through experiments on three benchmarks across multiple transfer settings, we demonstrate that TiTok is consistently effective, achieving average performance gains of +4~10% compared to baselines overall.

2510.02952 2026-05-15 cs.LG

ContextFlow: Context-Aware Flow Matching For Trajectory Inference From Spatial Omics Data

Santanu Subhash Rathod, Francesco Ceccarelli, Sean B. Holden, Pietro Liò, Xiao Zhang, Jovan Tanevski

发表机构 * CISPA Helmholtz Center for Information Security(CISPA 高等研究院) Department of Computer Science and Technology, University of Cambridge(剑桥大学计算机科学与技术系) Institute for Computational Biomedicine, Heidelberg University Hospital(海德堡大学医院计算生物医学研究所)

AI总结 本文提出了一种名为ContextFlow的上下文感知流匹配框架,用于从空间组学数据中推断组织结构动态轨迹。该方法通过整合局部组织结构和配体-受体通信模式,构建过渡可能性矩阵以指导最优运输目标的优化,从而生成统计上一致且生物学意义明确的轨迹。实验表明,ContextFlow在多个定量和定性指标上优于现有方法,具有良好的泛化能力。

Comments 42 pages, 21 figures, 30 tables

详情
英文摘要

Inferring trajectories from longitudinal spatially-resolved omics data is fundamental to understanding the dynamics of structural and functional tissue changes in development, regeneration and repair, disease progression, and response to treatment. We propose ContextFlow, a novel context-aware flow matching framework that incorporates prior knowledge to guide the inference of structural tissue dynamics from spatially resolved omics data. Specifically, ContextFlow integrates local tissue organization and ligand-receptor communication patterns into a transition plausibility matrix that regularizes the optimal transport objective. By embedding these contextual constraints, ContextFlow generates trajectories that are not only statistically consistent but also biologically meaningful, making it a generalizable framework for modeling spatiotemporal dynamics from longitudinal, spatially resolved omics data. Evaluated on three datasets, ContextFlow consistently outperforms state-of-the-art flow matching methods across multiple quantitative and qualitative metrics of inference accuracy and biological coherence. Our code is available at: \href{https://github.com/santanurathod/ContextFlow}{ContextFlow}