arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

检索范围排序方式

检索时间范围

重置

HOT 人工智能、机器人等 9

cs.AI 人工智能 cs.CV 计算机视觉 cs.CL 自然语言处理 cs.RO 机器人 cs.LG 机器学习 cs.SD 声音 cs.ET 新兴技术 eess.AS 音频语音 eess.IV 图像视频

CS 计算机 41

cs 计算机 cs.AI 人工智能 cs.AR 硬件架构 cs.CC 计算复杂性 cs.CE 计算工程 cs.CG 计算几何 cs.CL 自然语言处理 cs.CR 密码安全 cs.CV 计算机视觉 cs.CY 计算机与社会 cs.DB 数据库 cs.DC 分布式计算 cs.DL 数字图书馆 cs.DM 离散数学 cs.DS 数据结构 cs.ET 新兴技术 cs.FL 形式语言 cs.GL 综述文献 cs.GR 图形学 cs.GT 博弈论 cs.HC 人机交互 cs.IR 信息检索 cs.IT 信息论 cs.LG 机器学习 cs.LO 计算机逻辑 cs.MA 多智能体 cs.MM 多媒体 cs.MS 数学软件 cs.NA 数值分析 cs.NE 神经进化 cs.NI 网络架构 cs.OH 其他计算机 cs.OS 操作系统 cs.PF 性能 cs.PL 编程语言 cs.RO 机器人 cs.SC 符号计算 cs.SD 声音 cs.SE 软件工程 cs.SI 社会信息网络 cs.SY 系统控制

ECON 经济学 4

econ 经济学 econ.EM 计量经济 econ.GN 一般经济 econ.TH 理论经济

EESS 电气与系统 5

eess 电气与系统 eess.AS 音频语音 eess.IV 图像视频 eess.SP 信号处理 eess.SY 系统控制

MATH 数学 33

math 数学 math.AC 交换代数 math.AG 代数几何 math.AP 偏微分方程 math.AT 代数拓扑 math.CA 经典分析 math.CO 组合数学 math.CT 范畴论 math.CV 复变函数 math.DG 微分几何 math.DS 动力系统 math.FA 泛函分析 math.GM 一般数学 math.GN 一般拓扑 math.GR 群论 math.GT 几何拓扑 math.HO 历史综述 math.IT 信息论 math.KT K理论 math.LO 逻辑 math.MG 度量几何 math.MP 数学物理 math.NA 数值分析 math.NT 数论 math.OA 算子代数 math.OC 优化控制 math.PR 概率 math.QA 量子代数 math.RA 环与代数 math.RT 表示论 math.SG 辛几何 math.SP 谱理论 math.ST 统计理论

PHYSICS 物理 55

astro-ph 天体物理 astro-ph.CO 宇宙学 astro-ph.EP 地球行星 astro-ph.GA 星系物理 astro-ph.HE 高能天体 astro-ph.IM 天文仪器 astro-ph.SR 太阳恒星 cond-mat 凝聚态 cond-mat.dis-nn 无序神经 cond-mat.mes-hall 介观纳米 cond-mat.mtrl-sci 材料科学 cond-mat.other 其他凝聚态 cond-mat.quant-gas 量子气体 cond-mat.soft 软凝聚态 cond-mat.stat-mech 统计力学 cond-mat.str-el 强关联电子 cond-mat.supr-con 超导 gr-qc 广义相对论 hep-ex 高能实验 hep-lat 格点高能 hep-ph 高能唯象 hep-th 高能理论 math-ph 数学物理 nlin 非线性科学 nlin.AO 自适应系统 nlin.CD 混沌动力学 nlin.CG 胞自动机 nlin.PS 斑图孤子 nlin.SI 可积系统 nucl-ex 核物理实验 nucl-th 核物理理论 physics 物理 physics.acc-ph 加速器物理 physics.ao-ph 大气海洋 physics.app-ph 应用物理 physics.atm-clus 原子分子团簇 physics.atom-ph 原子物理 physics.bio-ph 生物物理 physics.chem-ph 化学物理 physics.class-ph 经典物理 physics.comp-ph 计算物理 physics.data-an 数据分析 physics.ed-ph 物理教育 physics.flu-dyn 流体动力学 physics.gen-ph 普通物理 physics.geo-ph 地球物理 physics.hist-ph 物理史哲 physics.ins-det 仪器探测 physics.med-ph 医学物理 physics.optics 光学 physics.plasm-ph 等离子体 physics.pop-ph 科普物理 physics.soc-ph 物理与社会 physics.space-ph 空间物理 quant-ph 量子物理

Q-BIO 定量生物 11

q-bio 定量生物 q-bio.BM 生物分子 q-bio.CB 细胞行为 q-bio.GN 基因组学 q-bio.MN 分子网络 q-bio.NC 神经认知 q-bio.OT 其他定量生物 q-bio.PE 种群进化 q-bio.QM 定量方法 q-bio.SC 亚细胞过程 q-bio.TO 组织器官

Q-FIN 定量金融 10

q-fin 定量金融 q-fin.CP 计算金融 q-fin.EC 经济学 q-fin.GN 一般金融 q-fin.MF 数学金融 q-fin.PM 投资组合 q-fin.PR 证券定价 q-fin.RM 风险管理 q-fin.ST 统计金融 q-fin.TR 交易微观结构

STAT 统计 7

stat 统计 stat.AP 统计应用 stat.CO 统计计算 stat.ME 统计方法 stat.ML 机器学习 stat.OT 其他统计 stat.TH 统计理论

2601.01972 2026-05-15 cs.CL cs.AI cs.LG

Hidden State Poisoning Attacks against Mamba-based Language Models

Alexandre Le Mercier, Chris Develder, Thomas Demeester

发表机构 * IDLab–T2K, Ghent University–imec（IDLab–T2K，根特大学–imec）

AI总结本文研究了针对基于Mamba的状态空间模型（SSMs）的语言模型的隐藏状态中毒攻击（HiSPA），该攻击通过特定的短输入短语不可逆地覆盖模型隐藏状态中的信息，导致其部分遗忘。研究提出了评估模型在遭受HiSPA攻击下信息检索能力的基准RoBench-25，并验证了SSMs在该攻击下的脆弱性，甚至包括最新的混合模型Jamba-1.7-Mini和Nemotron-3-Nano。此外，研究还分析了HiSPA对模型在其他基准上的影响，并提出了可能用于缓解该攻击的隐藏层模式分析方法。

Comments 29 pages, 4 figures

2512.22331 2026-05-15 cs.CV cs.AI

The Multi-View Paradigm Shift in MRI Radiomics: Predicting MGMT Methylation in Glioblastoma

Mariya Miteva, Maria Nisheva-Pavlova

发表机构 * Faculty of Mathematics and Informatics – Sofia University St. Kliment Ohridski（数学与信息学系 – 圣克莱门特·奥赫里迪斯大学）

AI总结该研究旨在通过多模态磁共振成像（MRI）数据非侵入性预测胶质母细胞瘤（GBM）中MGMT启动子甲基化状态，这对预后和治疗具有重要意义。为了解决传统单模态和早期融合方法在特征冗余和模态特异性建模方面的不足，作者提出了一种基于变分自编码器（VAE）的多视图潜在表征学习框架，能够在紧凑的概率潜在空间中保留各模态的影像特征并实现晚期融合。实验表明，该方法结合随机森林分类器在测试集上取得了0.77的AUC值，显著优于基线模型和调参后的模型，验证了多视图概率编码在整合互补MRI信息和提升预测性能方面的有效性。

Comments 17 pages, 4 figures

2512.22317 2026-05-15 cs.LG cs.AI cs.CV

LangPrecip: Language-Aware Multimodal Precipitation Nowcasting

Xudong Ling, Chaorong Li, Tianxi Huang, Qian Dong, Guiduo Duan

发表机构 * Laboratory of Intelligent Collaborative Computing, University of Electronic Science（智能协同计算实验室，电子科学科技大学）； School of Computer Science（计算机科学学院）； Technology (School of Artificial Intelligence), Yibin University（技术（人工智能学院），宜宾大学）； College of Humanities（人文学院）； General Education, Chengdu Textile College（通识教育，成都纺织学院）

AI总结短时降水临近预报是一个具有高度不确定性和约束不足的时空预测问题，尤其在快速演变的极端天气事件中更为明显。本文提出了一种语言感知的多模态临近预报框架LangPrecip，通过将气象文本作为降水演变的语义运动约束，结合修正流范式，实现了文本与雷达信息在潜在空间中的高效融合。此外，研究还构建了一个包含160k对雷达序列和运动描述的大规模多模态数据集LangPrecip-160k，并在瑞典和MRMS数据集上验证了方法的有效性，显著提升了重降雨情况下的预测性能。

2512.12083 2026-05-15 cs.CV

RePack then Refine: Efficient Diffusion Transformer with Vision Foundation Model

Guanfang Dong, Luke Schultz, Negar Hassanpour, Chao Gao

发表机构 * Huawei Technologies Canada Ltd.（华为加拿大有限公司）

AI总结该研究提出了一种名为“RePack then Refine”的三阶段框架，旨在高效利用视觉基础模型（VFM）的语义丰富特征来提升扩散变换器（DiT）的性能。通过RePack模块将高维VFM特征压缩到低维流形，去除冗余并保留结构信息，再在压缩后的潜在空间上训练标准DiT，最后引入一个潜在引导细化模块恢复压缩过程中丢失的高频细节。实验表明，该方法在ImageNet-1K数据集上仅用64个训练周期就达到了1.65的FID值，显著优于现有扩散模型。

2512.11855 2026-05-15 cs.LG cs.AI

Achieving Approximate Symmetry Is Exponentially Easier than Exact Symmetry

Behrooz Tahmasebi, Melanie Weber

发表机构 * Harvard John A. Paulson School of Engineering and Applied Sciences（哈佛大学约翰·A·保罗森工程与应用科学学院）； Harvard University（哈佛大学）

AI总结本文研究了在机器学习中强制对称性与近似对称性的代价差异，提出了“平均复杂度”框架来量化对称性约束的成本。研究发现，在标准条件下，精确对称性需要线性级别的平均复杂度，而近似对称性仅需对数级别的复杂度，两者存在指数级的差距。这一理论结果首次从理论上解释了为何近似对称性在实践中可能更具优势，并为对称性在机器学习中的进一步研究提供了新工具。

Comments 33 pages, 2 figures. Published at ICLR 2026

Journal ref International Conference on Learning Representations (ICLR) 2026

2512.07461 2026-05-15 cs.CL

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

Tong Wu, Yang Liu, Jun Bai, Zixia Jia, Shuyi Zhang, Ziyong Lin, Yanting Wang, Song-Chun Zhu, Zilong Zheng

发表机构 * State Key Laboratory of General Artificial Intelligence, BIGAI（通用人工智能国家重点实验室，BIGAI）

AI总结本文提出了一种无需教师模型的框架——原生并行推理器（NPR），使大语言模型能够自主进化出真正的并行推理能力。NPR通过自蒸馏渐进训练、并行感知策略优化算法以及改进的推理引擎，实现了从顺序推理到原生并行认知的转变。实验表明，基于Qwen3-4B训练的NPR在八个推理基准上性能提升了24.5%，推理速度提高了4.6倍，并实现了100%的真正并行执行，为高效、可扩展的智能体推理设立了新标准。

2512.03637 2026-05-15 cs.SD cs.LG stat.ML

AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers

Kohei Yamamoto, Kosuke Okusa

发表机构 * Research & Development Center, Technology Division, Oki Electric Industry Co., Ltd.（oki电产业株式会社研发中心，技术部门）； Department of Data Science for Business Innovation, Chuo University（中央大学商务创新数据科学系）

AI总结该研究提出了一种名为AaSP的音频频谱图Transformer自监督预训练框架，旨在解决传统方法中因时间下采样导致的混叠问题。AaSP通过引入感知混叠的补丁表示、教师-学生掩码建模、跨注意力预测器以及多掩码对比正则化，学习能够整合易受混叠影响频段特征且在不同掩码视图下保持稳定的音频表示。实验表明，AaSP在多个音频识别任务中表现出色，优于现有自监督方法。

Comments Accepted for publication in IEEE Transactions on Audio, Speech and Language Processing (TALSP). Copyright IEEE

详情

DOI: 10.1109/TASLPRO.2026.3690632

英文摘要

Transformer-based audio self-supervised learning (SSL) models commonly use spectrograms, vision-style Transformers, and masked modeling objectives. However, convolutional patchification with temporal downsampling lowers the effective Nyquist frequency and introduces aliasing, while naïve low-pass filtering may remove task-relevant high-frequency cues. We present AaSP, an aliasing-aware self-supervised pre-training framework for audio spectrogram transformers. AaSP combines an aliasing-aware patch representation, teacher-student masked modeling, a cross-attention predictor, and multi-mask contrastive regularization to learn representations that integrate features from alias-prone modulation bands while remaining stable across masked views. Its patch-embedding module, Aliasing-aware Patch Embedding (AaPE), augments standard patch tokens with features from alias-prone modulation bands using a band-limited complex sinusoidal kernel with a two-sided exponential window. The kernel's frequency and decay parameters are estimated from the input, enabling adaptive subband analysis whose outputs are fused with standard patch tokens. We pre-train on AudioSet and evaluate the learned representations by fine-tuning and linear evaluation on acoustic/environmental, speech, and music recognition benchmarks. Under fine-tuning, the full AaSP framework achieves state-of-the-art results on AS-20K, ESC-50, and NSynth among compared self-supervised baselines, while remaining competitive elsewhere. Linear evaluation shows a similar trend, including gains on US8K and NSynth. Overall, AaSP learns representations that are more stable under aliasing-sensitive temporal perturbations and competitive for downstream transfer.

URL PDF HTML ☆

赞 0 踩 0

2512.03532 2026-05-15 cs.CV

OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation

Zhishan Zhou, Siyuan Wei, Zengran Wang, Chunjie Wang, Xiaosheng Yan, Xiao Liu

发表机构 * PICO, ByteDance, Beijing（字节跳动北京研究院）

AI总结 OpenTrack3D 是一种面向开放词汇的3D实例分割框架，旨在提升在复杂、非结构化且无需网格的环境中进行3D目标分割的准确性和泛化能力。该方法通过引入视觉-空间追踪器在线生成跨视角一致的物体提案，并结合深度信息和DINO特征图提取实例特征，实现了无需网格的高效分割。此外，OpenTrack3D 采用多模态大语言模型替代CLIP，显著提升了对复杂用户查询的语义理解能力，实验表明其在多个基准数据集上均取得先进性能。

2512.02482 2026-05-15 cs.CV

G-SHARP: Gaussian Surgical Hardware Accelerated Real-time Pipeline

Vishwesh Nath, Javier G. Tejero, Aravind S. Kumar, Ruilong Li, Filippo Filicori, Mahdi Azizian, Sean D. Huver

发表机构 * NVIDIA ； Northwell Health（北well健康）

AI总结本文提出了一种名为G-SHARP的实时手术场景重建框架，旨在满足微创手术中对可变形组织进行快速而精确3D建模的需求。该方法基于开源的GSplat（Apache-2.0）可微高斯光栅化器构建，实现了原理化的形变建模、鲁棒的遮挡处理以及高保真重建，并在EndoNeRF数据集上取得了领先的重建质量。此外，研究还提供了可在NVIDIA IGX Orin和Thor边缘设备上部署的Holoscan SDK应用，支持实际手术室环境中的实时手术可视化。

2511.21740 2026-05-15 cs.CL cs.AI

A cross-species neural foundation model for end-to-end speech decoding

Yizi Zhang, Linyang He, Chaofei Fan, Tingkai Liu, Han Yu, Trung Le, Jingyuan Li, Scott Linderman, Lea Duncker, Francis R Willett, Nima Mesgarani, Liam Paninski

发表机构 * Columbia University（哥伦比亚大学）； Stanford University（斯坦福大学）； Microsoft（微软公司）； University of Washington（华盛顿大学）

AI总结该论文提出了一种端到端的脑到文本（BIT）框架，旨在通过神经网络直接将神经活动解码为连贯的句子，从而提升脑机接口的通信能力。核心方法是采用跨任务、跨物种预训练的神经编码器，并结合音频大语言模型与对比学习，实现了比传统分阶段方法更低的词错误率。研究不仅在多个基准测试中取得了新的最先进性能，还展示了跨任务泛化能力，为端到端神经解码提供了重要进展。

2511.21104 2026-05-15 cs.LG cs.PL

BRIDGE: Building Representations In Domain Guided Program Synthesis

Robert Joseph George, Carson Eisenach, Udaya Ghai, Dominique Perrault-Joncas, Anima Anandkumar, Dean Foster

发表机构 * California Institute of Technology（加州理工学院）； Amazon（亚马逊）

AI总结 BRIDGE 是一个用于多领域程序合成的结构化提示框架，旨在解决在形式化验证工具如 Lean 中生成可验证代码的挑战。该方法将代码生成、规范描述和定理/证明三个领域进行关联，并通过领域特定的中间推理实现它们之间的连接。实验表明，BRIDGE 显著提升了 Lean 中代码的可执行正确性，并在样本效率和 Python 代码生成方面也表现出优越性能，展示了其在可验证程序合成中的实用价值。

Comments 41 pages, 10 figures, 3 tables. Preprint

2511.18903 2026-05-15 cs.LG cs.AI cs.CL

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

Kairong Luo, Zhenbo Sun, Haodong Wen, Xinyu Shi, Jiarui Cui, Chenyi Dang, Kaifeng Lyu, Wenguang Chen

发表机构 * Tsinghua University（清华大学）； Peng Cheng Laboratory（鹏城实验室）

AI总结在基于课程的大型语言模型（LLM）预训练中，高质量数据的利用效率受到学习率衰减策略的限制。本文发现，当使用递减的学习率调度时，按数据质量排序的课程式训练优势会显著减弱。为此，研究提出了两种简单有效的方法：采用更温和的学习率衰减策略，或用模型平均替代学习率衰减，从而在不额外优化数据的情况下提升了模型在多个基准测试中的表现。这一发现为课程式预训练与优化方法的协同设计提供了新思路。

2511.18739 2026-05-15 cs.AI cs.LG stat.ML

A Problem-Oriented Taxonomy of Evaluation Metrics for Time Series Anomaly Detection

Kaixiang Yang, Jiarong Liu, Yupeng Song, Shuanghua Yang, Yujue Zhou

发表机构 * School of Artificial Intelligence, Yunnan University（云南大学人工智能学院）； Beijing Normal University – Hong Kong Baptist University（北京师范大学-香港 Baptist大学）

AI总结时间序列异常检测在物联网和物理信息系统中应用广泛，但其评估因应用场景多样和指标假设不同而面临挑战。本文提出了一种面向问题的评估指标分类框架，从解决的具体评估问题出发重新诠释现有指标，将其分为六个维度，涵盖准确性、及时性、标签容忍度、人工审核成本惩罚、抗随机性以及跨数据集可比性等方面。通过实验分析不同场景下指标的行为，量化其区分真实检测与随机噪声的能力，揭示了多数事件级指标具有较强区分力，而部分常用指标对随机分数膨胀较为敏感，强调了评估指标应根据具体任务需求进行选择。

2511.17367 2026-05-15 cs.LG

R2PS: Worst-Case Robust Real-Time Pursuit Strategies under Partial Observability

Runyu Lu, Ruochuan Shi, Yuanheng Zhu, Dongbin Zhao

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所多模态人工智能系统国家重点实验室）

AI总结本文研究了在部分可观测环境下，如何为追捕-逃避博弈（PEG）设计具有最坏情况鲁棒性的实时追捕策略。为了解决现有方法在不完全信息和异步移动场景下的不足，作者提出了一种新的方法R2PS，结合动态规划与信念保持机制，扩展了传统策略到部分可观测场景，并将其嵌入先进强化学习框架中。该方法能够在无需额外训练的情况下，实现对未知图结构的鲁棒泛化，并在实验中表现出优于现有方法的性能。

2511.15408 2026-05-15 cs.CL cs.AI cs.IR cs.MA cs.NE

Chinese Short-Form Creative Content Generation via Explanation-Oriented Multi-Objective Optimization

Shanlin Zhou, Xinpeng Wang, Jianxun Lian, Zhenghao Liu, Laks V. S. Lakshmanan, Xiaoyuan Yi, Yongtao Hao

发表机构 * Tongji University（同济大学）； Microsoft Research Asia（微软亚洲研究院）； Northeastern University（东北大学）； University of British Columbia（不列颠哥伦比亚大学）

AI总结该研究针对中文短文本创意内容生成中的挑战，提出了一种基于解释导向的多目标优化方法，以应对个性化约束下生成结果验证困难的问题。研究将任务建模为异构多目标优化问题，同时优化生成内容与解释的可靠性，并设计了无需训练的多智能体框架MAGIC-HMO，通过迭代生成与验证实现优化。实验表明，该方法在中文婴儿命名等任务上显著优于现有模型。

Comments 19 pages,10 figures. Submitted to ACM for possible publication

2511.14823 2026-05-15 cs.LG cs.CV

Dynamic Nested Hierarchies: Pioneering Self-Evolution in Machine Learning Architectures for Lifelong Intelligence

Akbar Anbar Jafari, Cagri Ozcinar, Gholamreza Anbarjafari

发表机构 * Institute of Technology University of Tartu（塔尔图技术大学）； S Holding OÜ（3S控股公司）

AI总结当前机器学习模型在静态任务上表现出色，但在非平稳环境中因架构僵化而难以实现持续适应和终身学习。本文提出了一种动态嵌套层次结构，使模型能够在训练或推理过程中自主调整优化层级的数量、嵌套结构和更新频率，从而实现无需预定义约束的自我演化。该方法通过数学推导和实验验证，在语言建模、持续学习和长上下文推理等任务中展现出优越性能，为构建具有自适应能力的通用人工智能奠定了基础。

Comments 12 pages, 1 figure

Journal ref Frontiers in Artificial Intelligence, 2026

2511.13397 2026-05-15 cs.CV cs.AI

Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)

Nikos Theodoridis, Tim Brophy, Reenu Mohandas, Ganesh Sistu, Fiachra Collins, Anthony Scanlan, Ciaran Eising

发表机构 * Department of Electronic and Computer Engineering, University of Limerick（利默尼克大学电子与计算机工程系）； Data Driven Computer Engineering Research Centre, University of Limerick（利默尼克大学数据驱动计算机工程研究中心）； Lero, The Irish Software Research Centre, University of Limerick（利默尼克大学Lero爱尔兰软件研究中心）； Valeo Vision Systems（瓦莱奥视觉系统）

AI总结本文提出了一种名为Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)的视觉问答基准，用于评估视觉语言模型在交通场景中的感知能力。该基准包含合成数据集和真实场景数据集，并为每个问题标注了目标物体与相机之间的距离，从而能够分析模型在不同距离下的感知性能。该研究为自动驾驶领域中模型的感知能力评估提供了一个新的、有针对性的工具。

Journal ref IEEE Data Descriptions, 2026

详情

DOI: 10.1109/IEEEDATA.2026.3689031

英文摘要

The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, (c) the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.

URL PDF HTML ☆

赞 0 踩 0

2511.13026 2026-05-15 cs.CV

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Jiaze Li, Hao Yin, Wenhui Tan, Jingyang Chen, Boshen Xu, Yuxun Qu, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Jian Luan

发表机构 * MiLM Plus, Xiaomi Inc.（小米公司MiLM Plus实验室）； Renmin University of China（中国人民大学）

AI总结该论文提出了一种名为REVISOR的新框架，旨在提升大语言模型在长视频理解任务中的推理能力。针对纯文本反思机制在处理长视频时的不足，REVISOR引入了多模态反思机制，结合视觉信息进行深度反思，并设计了双属性解耦奖励机制以增强模型对关键视频片段的识别与利用。该方法无需额外监督微调或外部模型，显著提升了模型在多个长视频理解基准测试中的表现。

2511.08565 2026-05-15 cs.CL cs.AI cs.CY

Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models

Davi Bastos Costa, Felippe Alves, Renato Vicente

发表机构 * TELUS Digital Research Hub（TELUS数字研究中心）； Center for Artificial Intelligence and Machine Learning（人工智能与机器学习中心）； Institute of Mathematics, Statistics and Computer Science（数学、统计与计算机科学研究所）； University of São Paulo（圣保罗大学）

AI总结本研究探讨了大型语言模型在扮演特定角色（Persona Role-Play）时的道德反应，引入道德基础问卷（MFQ）构建基准，量化评估模型的道德敏感性和道德鲁棒性。通过两种互补方法分析模型在不同角色下的道德判断变化，发现道德鲁棒性在不同模型家族间差异显著，Claude 家族表现最为鲁棒，而道德敏感性则变化较小，且不受模型家族影响，主要由预训练阶段决定。研究揭示了角色条件对模型道德行为的影响，并提供了不同模型及角色平均的道德基础特征分析。

Comments Added experiments with a logit-based method and now reporting unbounded metrics

详情

英文摘要

Large language models (LLMs) increasingly operate in social contexts, motivating analysis of how they express and shift moral judgments. In this work, we investigate the moral response of LLMs to persona role-play, prompting a LLM to assume a specific character. Using the Moral Foundations Questionnaire (MFQ), we introduce a benchmark that quantifies two properties: moral susceptibility and moral robustness, defined from the variability of MFQ scores across- and within-personas. We estimate these quantities with two complementary procedures, repeated sampling and a logit-based method that directly estimates the rating distributions and enables temperature analysis. We evaluate 15 models across six families: Claude, DeepSeek, Gemini, GPT, Grok, and Llama. The two metrics show qualitatively different patterns. Moral robustness varies by more than an order of magnitude, with a coefficient of variation of about $152\%$, and is explained almost entirely by model family. The Claude family is, by a significant margin, the most robust, about 30 times more so than the lower-performing families (DeepSeek, Grok, and Llama), while Gemini and GPT occupy an intermediate tier. This strong family dependence suggests that robustness is primarily shaped by post-training. Moral susceptibility, by contrast, spans a much narrower range, with a coefficient of variation of about $13\%$, and the most susceptible model is only 1.6 times more susceptible than the least. Unlike robustness, susceptibility shows no clear family dependence, suggesting that it is primarily determined by pre-training. Additionally, we present moral foundation profiles for models without persona role-play and for personas averaged across models. Together, these analyses provide a systematic view of how persona conditioning shapes moral behavior in LLMs and a window into the internal machinery they use to instantiate personas.

URL PDF HTML ☆

赞 0 踩 0

2511.02776 2026-05-15 cs.RO

XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations

Shichao Fan, Kun Wu, Zhengping Che, Xinhua Wang, Di Wu, Fei Liao, Ning Liu, Yixue Zhang, Zhen Zhao, Zhiyuan Xu, Meng Li, Qingjie Liu, Shanghang Zhang, Min Wan, Jian Tang

发表机构 * Beijing Innovation Center of Humanoid Robotics, Beijing, China（北京人形机器人创新中心，北京，中国）； School of Mechanical Engineering and Automation, Beihang University, Beijing, China（北京航空航天大学机械工程及自动化学院，北京，中国）； State Key Laboratory of Virtual Reality Technology and Systems, SCSE, Beihang University, Beijing, China（虚拟现实技术与系统国家重点实验室，SCSE，北京航空航天大学，北京，中国）； State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University, Beijing, China（多媒体信息处理国家重点实验室，计算机科学学院，北京大学，北京，中国）

AI总结本文提出 XR-1，一种面向多机器人、多任务和多环境的通用视觉-语言-动作（VLA）模型，旨在解决现有模型在生成精确低级动作和跨异构数据源对齐方面的挑战。XR-1 引入了统一视觉-运动编码（UVMC），通过双分支 VQ-VAE 学习视觉动态与机器人运动的联合离散表示，从而在动作生成和跨模态对齐方面取得显著提升。实验表明，XR-1 在多种真实机器人和任务上表现出优越的性能和良好的泛化能力。

Comments Accepted to ICML2026 as spotlight

详情

英文摘要

Recent progress in large-scale robotic datasets and vision-language models (VLMs) has advanced research on vision-language-action (VLA) models. However, existing VLA models still face two fundamental challenges: (i) producing precise low-level actions from high-dimensional observations, (ii) bridging domain gaps across heterogeneous data sources, including diverse robot embodiments and human demonstrations. Existing methods often encode latent variables from either visual dynamics or robotic actions to guide policy learning, but they fail to fully exploit the complementary multi-modal knowledge present in large-scale, heterogeneous datasets. In this work, we present X Robotic Model 1 (XR-1), a novel framework for versatile and scalable VLA learning across diverse robots, tasks, and environments. XR-1 introduces the \emph{Unified Vision-Motion Codes (UVMC)}, a discrete latent representation learned via a dual-branch VQ-VAE that jointly encodes visual dynamics and robotic motion. UVMC addresses these challenges by (i) serving as an intermediate representation between the observations and actions, and (ii) aligning multimodal dynamic information from heterogeneous data sources to capture complementary knowledge. To effectively exploit UVMC, we propose a three-stage training paradigm: (i) self-supervised UVMC learning, (ii) UVMC-guided pretraining on large-scale cross-embodiment robotic datasets, and (iii) task-specific post-training. We validate XR-1 through extensive real-world experiments with more than 14,000 rollouts on six different robot embodiments, spanning over 120 diverse manipulation tasks. XR-1 consistently outperforms state-of-the-art baselines such as $π_{0.5}$, $π_0$, RDT, UniVLA, and GR00T-N1.5 while demonstrating strong generalization to novel objects, background variations, distractors, and illumination changes. Our project is at https://xr-1-vla.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2510.23868 2026-05-15 cs.LG cs.CL

GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA

Zhichao Wang

发表机构 * Inflection AI

AI总结本文研究了奖励匹配是否可以作为奖励最大化方法的替代方案，用于大语言模型的策略梯度强化学习。提出了一种名为GIFT的新方法，结合了GRPO的群体采样、DPO的隐式奖励和UNA的显式与隐式优势之间的均方误差，通过z-score标准化消除了DPO中的不可计算项，并去除了RLHF和RLVR目标中的KL系数β。实验表明，GIFT在多个任务上收敛更快、过拟合更少，且在长度控制和评估表现上优于现有方法。

2510.20206 2026-05-15 cs.CV

RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

Bingjie Gao, Qianli Ma, Xiaoxue Wu, Shuai Yang, Guanzhou Lan, Haonan Zhao, Jiaxuan Chen, Qingyang Liu, Yu Qiao, Xinyuan Chen, Yaohui Wang, Li Niu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结 RAPO++ 是一种面向文本到视频生成的跨阶段提示优化框架，旨在解决用户输入提示与训练数据不匹配的问题。该方法通过检索增强提示优化（RAPO）和样本特定提示优化（SSPO）两个阶段，结合语义对齐、空间保真度和时间一致性等多源反馈，逐步提升生成视频的质量，并进一步通过微调语言模型实现高效的提示生成。实验表明，RAPO++ 在多个先进模型和基准测试中显著提升了生成视频的语义一致性、组合合理性及时空稳定性，是一种模型无关、高效且可扩展的解决方案。

Comments arXiv admin note: text overlap with arXiv:2504.11739

2510.17434 2026-05-15 cs.CV

Leveraging AV1 motion vectors for Fast and Dense Feature Matching

Julien Zouein, Hossein Javidnia, François Pitié, Anil Kokaram

发表机构 * Sigmedia Group（Sigmedia集团）； Department of Electronic and Electrical Engineering（电子与电气工程系）

AI总结该研究利用AV1视频编码中的运动矢量生成密集的亚像素级特征匹配，并通过余弦一致性筛选短轨迹。该方法在短视频上运行效率高、消耗的CPU资源少，且能产生密度更高的匹配结果，几何一致性表现良好。实验表明，该方法在少样本场景重建中表现出良好的性能，为压缩域特征匹配在大规模应用中提供了可行的解决方案。

Comments Accepted ICIR 2025, camera-ready version

2510.15982 2026-05-15 cs.LG cs.AI

AMiD: Knowledge Distillation for LLMs with $α$-mixture Assistant Distribution

Donghyeok Shin, Yeongmin Kim, Suhyeon Jo, Byeonghu Na, Il-Chul Moon

发表机构 * Korea Advanced Institute of Science and Technology（韩国先进科学研究院）

AI总结本文提出了一种名为AMiD的知识蒸馏方法，用于降低大语言模型的计算和内存成本。该方法引入了基于α混合的辅助分布，通过引入新的分布参数α，扩展了传统辅助分布的适用范围，并构建了一个统一的知识蒸馏框架。实验表明，AMiD在性能和训练稳定性方面优于现有方法，具有更广泛的理论支持和实际应用价值。

Comments The Fourteenth International Conference on Learning Representations (ICLR 2026)

2510.15849 2026-05-15 cs.CV

Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt

Joongwon Chae, Lihui Luo, Xi Yuan, Dongmei Yu, Zhenglin Chen, Lian Zhang, Peiwu Qin

发表机构 * Tsinghua University, China（清华大学）； The Fifth Affiliated Hospital of Wenzhou Medical University, China（温州医科大学第五附属医院）； Shenzhen Traditional Chinese Medicine Hospital, China（深圳中医医院）； Wenzhou Medical University, China（温州医科大学）； The First Hospital of Hebei Medical University, China（河北医科大学第一医院）； Chinese Medicine Guangdong Laboratory/Hengqin Laboratory, China（广东中医实验室/横琴实验室）

AI总结本文提出了一种无需人工提示和训练的舌部分割方法Memory-SAM，通过检索历史案例中的特征并生成有效提示来引导SAM2模型。该方法利用DINOv3的密集特征和FAISS检索技术，从少量先验案例中自动提取前景和背景提示，从而实现高精度分割。实验表明，Memory-SAM在包含600张专家标注图像的数据集上取得了优于现有方法的分割效果，尤其在真实场景下表现突出。

2510.13016 2026-05-15 cs.CV

SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding

Tanveer Hannan, Shuaicong Wu, Mark Weber, Suprosanna Shit, Jindong Gu, Rajat Koner, Aljoša Ošep, Laura Leal-Taixé, Thomas Seidl

发表机构 * LMU Munich（慕尼黑大学）； MCML ； Technical University of Munich（慕尼黑技术大学）； University of Zurich（苏黎世大学）； University of Oxford（牛津大学）； Amazon（亚马逊）； NVIDIA（英伟达）

AI总结该论文提出了一种名为SVAG-Bench的大型基准，用于评估多实例时空视频动作定位能力。该任务要求模型同时检测、跟踪并定位满足自然语言查询的所有对象，以实现对复杂场景中多个动作的统一理解。SVAG-Bench包含688个视频和大量精细标注，支持对多动作歧义、时间重叠和动作组合性的细致评估，并提供了标准化的评估工具和一个模块化的基线模型SVAGFormer。

2510.11282 2026-05-15 cs.LG

Vision-LLMs for Spatiotemporal Traffic Forecasting

Ning Yang, Hengyu Zhong, Haijun Zhang, Randall Berry

发表机构 * Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Southwest University（西南大学）； Department of Computing and Communication Engineering, Beijing University of Science and Technology（北京科技大学计算机与通信工程学院）； Department of Electrical and Computer Engineering, Northwestern University（西北大学电气与计算机工程系）

AI总结本文研究了如何利用视觉大语言模型（Vision-LLMs）进行时空交通预测，针对传统大语言模型在处理网格化交通数据时效率低、难以建模复杂空间依赖的问题，提出了一种新的框架ST-Vision-LLM。该方法将交通预测视为视觉与语言信息融合的问题，通过视觉编码器处理历史交通矩阵，并引入高效的数值编码方案和两阶段微调策略，显著提升了模型在长周期预测和跨域少样本场景下的性能。实验表明，该模型在多个真实交通数据集上取得了优于现有方法的预测精度。

2510.07086 2026-05-15 cs.LG

Non-Stationary Online Structured Prediction with Surrogate Losses

Shinsaku Sakaue, Han Bao, Yuzhou Cao

发表机构 * CyberAgent, Tokyo, Japan（日本东京CyberAgent公司）； National Institute of Informatics, Tokyo, Japan（日本东京信息机构）； Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan（日本东京RIKEN高级智能项目）； The Institute of Statistical Mathematics, Tokyo, Japan（日本东京统计数学研究所）； Tohoku University, Miyagi, Japan（日本宫城县立东大学）； Nanyang Technological University, Singapore（新加坡南洋理工大学）

AI总结本文研究了非平稳环境下在线结构化预测问题，旨在通过代理损失函数实现对目标损失的上界分析。作者提出了一种新的上界形式，其依赖于比较序列的累积代理损失和路径长度，而非时间步长 $T$，从而在非平稳环境下提供了更强的理论保证。核心方法结合了在线梯度下降的动态遗憾分析与代理损失间隙利用技术，并引入了Polyak风格的学习率，提升了理论分析与实际性能。此外，该方法通过卷积型Fenchel-Young损失扩展到了更广泛的应用场景。

2510.04682 2026-05-15 cs.CL cs.AI

TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA

Chanjoo Jung, Jaehyung Kim

发表机构 * Yonsei University（延世大学）

AI总结本文提出了一种名为TiTok的新框架，旨在解决LoRA微调参数无法跨不同基础模型迁移的问题。该方法通过在令牌层面进行对比性知识提取，从带有和不带有LoRA的源模型中捕捉任务相关的信息，从而实现高效的LoRA移植。实验表明，TiTok在多个基准测试中表现出色，相比基线方法平均性能提升了4%到10%。

Comments ICLR 2026

2510.02952 2026-05-15 cs.LG

ContextFlow: Context-Aware Flow Matching For Trajectory Inference From Spatial Omics Data

Santanu Subhash Rathod, Francesco Ceccarelli, Sean B. Holden, Pietro Liò, Xiao Zhang, Jovan Tanevski

发表机构 * CISPA Helmholtz Center for Information Security（CISPA 高等研究院）； Department of Computer Science and Technology, University of Cambridge（剑桥大学计算机科学与技术系）； Institute for Computational Biomedicine, Heidelberg University Hospital（海德堡大学医院计算生物医学研究所）

AI总结本文提出了一种名为ContextFlow的上下文感知流匹配框架，用于从空间组学数据中推断组织结构动态轨迹。该方法通过整合局部组织结构和配体-受体通信模式，构建过渡可能性矩阵以指导最优运输目标的优化，从而生成统计上一致且生物学意义明确的轨迹。实验表明，ContextFlow在多个定量和定性指标上优于现有方法，具有良好的泛化能力。

Comments 42 pages, 21 figures, 30 tables

AI 大模型

视觉与机器人

科学与医疗

Hidden State Poisoning Attacks against Mamba-based Language Models

The Multi-View Paradigm Shift in MRI Radiomics: Predicting MGMT Methylation in Glioblastoma

LangPrecip: Language-Aware Multimodal Precipitation Nowcasting

RePack then Refine: Efficient Diffusion Transformer with Vision Foundation Model

Achieving Approximate Symmetry Is Exponentially Easier than Exact Symmetry

Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning

AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers

OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation

G-SHARP: Gaussian Surgical Hardware Accelerated Real-time Pipeline

A cross-species neural foundation model for end-to-end speech decoding

BRIDGE: Building Representations In Domain Guided Program Synthesis

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

A Problem-Oriented Taxonomy of Evaluation Metrics for Time Series Anomaly Detection

R2PS: Worst-Case Robust Real-Time Pursuit Strategies under Partial Observability

Chinese Short-Form Creative Content Generation via Explanation-Oriented Multi-Objective Optimization

Dynamic Nested Hierarchies: Pioneering Self-Evolution in Machine Learning Architectures for Lifelong Intelligence

Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models

XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations

GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA

RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

Leveraging AV1 motion vectors for Fast and Dense Feature Matching

AMiD: Knowledge Distillation for LLMs with $α$-mixture Assistant Distribution

Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt

SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding

Vision-LLMs for Spatiotemporal Traffic Forecasting

Non-Stationary Online Structured Prediction with Surrogate Losses

TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA

ContextFlow: Context-Aware Flow Matching For Trajectory Inference From Spatial Omics Data