arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.06483 2026-05-12 cs.AI cs.RO cs.SY eess.SY

ReasonSTL: Bridging Natural Language and Signal Temporal Logic via Tool-Augmented Process-Rewarded Learning

Bowen Ye, Zhijian Li, Junyue Huang, Junkai Ma, Xiang Yin

AI总结该研究提出了一种名为ReasonSTL的框架，旨在解决将自然语言转化为信号时序逻辑（STL）这一关键但具有挑战性的任务。ReasonSTL通过结合本地开源语言模型与工具增强的推理过程，实现了自然语言到STL公式的高效生成，并引入了过程奖励训练机制以优化工具使用路径和最终公式结构。实验表明，该方法在自动评估和人工评估中均达到领先水平，为工业场景下的形式化规范编写提供了透明、低成本且隐私保护的解决方案。

2605.06117 2026-05-12 cs.LG

BoostLLM: Boosting-inspired LLM Fine-tuning for Few-shot Tabular Classification

Yi-Siang Wang, Kuan-Yu Chen, Yu-Chen Den, Darby Tien-Hao Chang

AI总结本文提出BoostLLM，一种受提升算法启发的大型语言模型（LLM）微调框架，旨在提升其在少样本表格分类任务中的性能。该方法将参数高效的微调过程转化为多轮残差优化过程，通过训练序列化的PEFT适配器作为弱学习器，并结合决策树路径作为结构化输入视图，以增强模型对表格数据的归纳偏置。实验表明，BoostLLM在多个LLM主干和数据集上均优于传统微调方法，且在少样本场景下表现可与XGBoost媲美，甚至在某些情况下超越基于GPT-4o的模型。

Comments 19 pages, 4 figures

2605.05831 2026-05-12 cs.CV

Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media

Megha Mariam K. M, Vineeth N. Balasubramanian, C. V. Jawahar

AI总结随着科学传播逐渐呈现多模态趋势，研究论文、幻灯片、视频等不同形式的材料共同传达研究成果，但目前缺乏结构化的关联方式。本文提出首个整合研究论文、演讲视频、讲解视频和幻灯片的多模态会议数据集（MCD），并评估多种嵌入式和视觉-语言模型在跨格式细粒度对应任务中的表现。研究发现，视觉-语言模型在整体上表现稳健，但在细粒度对齐上仍有不足，而嵌入式模型在文本与视觉对应上效果较好，但对公式和符号内容的处理存在明显聚类差异，为多模态科学理解的未来研究指明了方向。

Comments Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings Track, 2026

2605.05736 2026-05-12 cs.AI

SDFlow: Similarity-Driven Flow Matching for Time Series Generation

Wei Li, Shibo Feng, Pengcheng Wu, Xingyu Gao, Min Wu, Peilin Zhao

AI总结本文提出了一种名为SDFlow的非自回归时间序列生成方法，通过相似性驱动的流匹配技术，在冻结的向量量化（VQ）潜在空间中实现并行序列生成，有效解决了自回归模型中的暴露偏差问题。该方法通过全局传输映射替代逐步预测、低秩流形分解降低维度复杂度、以及在变分流匹配框架中引入离散监督，成功克服了非自回归生成中的关键挑战。实验表明，SDFlow在长序列生成任务中取得了最先进的性能，显著提升了生成质量并加快了推理速度。

2605.05072 2026-05-12 cs.CV

Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy

Yuan Wu, Zhiqiang Yan, Jiawei Lian, Zhengxue Wang, Jian Yang

AI总结本文研究了如何从相机和激光雷达传感器数据中准确预测三维场景的占用情况，重点解决传统方法在投影空间采样固定、难以适应真实场景高度变化和稀疏性的问题。为此，作者提出了一种名为HiPR的框架，通过高度引导的投影重参数化方法，动态调整激光雷达点云的采样范围，使投影点更合理地分布于具有几何意义的区域。实验表明，HiPR在保持实时推理能力的同时，显著优于现有先进方法。

2605.05045 2026-05-12 cs.CV cs.CL

When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

Philip Wootaek Shin, Ajay Narayanan Sridhar, Sivani Devarapalli, Rui Zhang, Jack Sampson, Vijaykrishnan Narayanan

AI总结该研究分析了视觉-语言模型在面对旋转和噪声等视觉干扰时产生的关系幻觉现象，揭示了即使轻微的图像扰动也会显著影响模型对物体间关系的推理能力。研究评估了多种基于提示的增强与预处理策略，发现这些方法虽能部分缓解问题，但无法彻底消除关系幻觉。结果表明，当前模型在感知鲁棒性与关系理解之间仍存在差距，亟需开发更具几何感知能力的视觉-语言模型。

2605.04956 2026-05-12 cs.LG cs.PF

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

Han Wang, Jintao Zhang, Kai Jiang, Haoxu Wang, Jianfei Chen, Jun Zhu

AI总结论文提出了一种名为 KernelBenchX 的综合性基准测试工具，用于评估大语言模型生成的 GPU 内核在正确性和硬件效率方面的表现。通过在15个类别共176个任务上的系统评测，研究发现任务结构对正确性的影响远大于方法设计，且正确性并不等同于效率，许多正确生成的内核性能甚至低于基线。研究还揭示了量化等任务中存在系统性理解偏差，未来需在数值精度建模和硬件效率优化方面取得进展。

Comments minor textual revision; no changes to technical content or results

2605.04012 2026-05-12 cs.AI

SymptomAI: Toward a Conversational AI Agent for Everyday Symptom Assessment

Joseph Breda, Fadi Yousif, Beszel Hawkins, Marinela Cotoi, Miao Liu, Ray Luo, Po-Hsuan Cameron Chen, Mike Schaekermann, Samuel Schmidgall, Xin Liu, Girish Narayanswamy, Samuel Solomon, Maxwell A. Xu, Xiaoran Fan, Longfei Shangguan, Anran Wang, Bhavna Daryani, Buddy Herkenham, Cara Tan, Mark Malhotra, Shwetak Patel, John B. Hernandez, Quang Duong, Yun Liu, Zach Wasson, Dimitrios Antos, Bob Lou, Matthew Thompson, Jonathan Richina, Anupam Pathak, Nichole Young-Lin, Jake Sunshine, Daniel McDuff

AI总结该研究提出了一种名为SymptomAI的会话式人工智能代理，用于日常症状的端到端访谈与鉴别诊断。通过在Fitbit应用中进行的大规模随机实验，SymptomAI在与13,917名参与者的交互中表现出比独立临床医生更高的诊断准确性。研究还发现，采用专门症状访谈策略的AI代理在诊断效果上显著优于用户引导的对话方式，并揭示了症状与生理指标之间的强关联。

Comments 13 page main text, 54 pages total. 16 figures total

详情

英文摘要

Language models excel at diagnostic assessments on curated medical case-studies and vignettes, performing on par with, or better than, clinical professionals. However, existing studies focus on complex scenarios with rich context making it difficult to draw conclusions about how these systems perform for patients reporting symptoms in everyday life. We deployed SymptomAI, a set of conversational AI agents for end-to-end patient interviewing and differential diagnosis (DDx), via the Fitbit app in a study that randomized participants (N=13,917) to interact with five AI agents. This corpus captures diverse communication and a realistic distribution of illnesses from a real world population. A subset of 1,228 participants reported a clinician-provided diagnosis, and 517 of these were further evaluated by a panel of clinicians during over 250 hours of annotation. SymptomAI DDx were significantly more accurate (OR = 2.56, p < 0.001) than those from independent clinicians given the same dialogue in a blinded randomized comparison. Moreover, agentic strategies which conduct a dedicated symptom interview that elicit additional symptom information before providing a diagnosis, perform substantially better than baseline, user-guided conversations (p < 0.001). An auxiliary analysis on 1,509 conversations from a general US population panel validated that these results generalize beyond wearable device users. We used SymptomAI diagnoses as labels for all 13,917 participants to analyze over 500,000 days of wearable metrics across nearly 400 unique conditions. We identified strong associations between acute infections and physiological shifts (e.g., OR > 7 for influenza). While limited by self-reported ground truth, these results demonstrate the benefits of a dedicated and complete symptom interview compared to a user-guided symptom discussion, which is the default of most consumer LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.03652 2026-05-12 cs.CV cs.AI

AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

Tencent HY Team

AI总结本文提出了一种名为 AniMatrix 的动画视频生成模型，专门针对动画艺术风格进行设计，而非依赖物理现实作为先验。该模型通过双通道条件机制和三步过渡策略，重新定义“正确性”标准，克服传统模型对物理规律的依赖，并有效区分艺术表达与生成失败。实验表明，AniMatrix 在专业动画师参与的评估中表现优异，尤其在提示理解与艺术动作生成方面显著优于现有模型。

Comments 37 pages, 1 main figure (qualitative comparison), 1 TikZ architecture diagram; technical report. Model weights and inference code to be released

2605.03438 2026-05-12 cs.CV

Mantis: Mamba-native Tuning is Efficient for 3D Point Cloud Foundation Models

Zihao Guo, Jihua Zhu, Jian Liu, Ajmal Saeed Mian

AI总结本文提出了一种名为Mantis的高效参数微调框架，专门针对基于Mamba架构的3D点云基础模型。该方法通过引入状态感知适配器（SAA），在冻结预训练主干网络的前提下实现状态级的细粒度适配，同时采用双序列化一致性蒸馏（DSCD）减少序列化带来的不稳定性。实验表明，Mantis仅需约5%的可训练参数即可在多个基准上取得具有竞争力的性能。

2605.02948 2026-05-12 cs.LG cs.AI cs.SD

AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

Yuxin Lu, Jiayang Sun, Guibo Zhu, Min Cao

AI总结 AsymTalker 是一种基于扩散模型的长时 talking head 生成方法，旨在解决现有方法在长时间视频生成中出现的身份不一致和时空对齐问题。该方法引入了时间参考编码（TRE）和非对称知识蒸馏（AKD），分别用于缓解静态身份参考与动态音频流之间的时空错位，以及解决分块生成过程中身份漂移的问题。实验表明，AsymTalker 在保证高保真度和身份一致性的同时，能够生成长达600秒的视频，并实现每秒66帧的实时推理速度，达到了当前最先进的性能。

2605.02751 2026-05-12 cs.AI cs.CL

Mitigating Misalignment Contagion by Steering with Implicit Traits

Maria Chang, Ronny Luss, Miao Liu, Keerthiram Murugesan, Karthikeyan Ramamurthy, Djallel Bouneffouf

AI总结在多智能体场景中，语言模型（LMs）遵循指令和保持价值对齐至关重要，但现有研究多关注单个模型与用户的对齐，忽视了多模型交互中可能产生的对齐偏差扩散问题。本文通过多轮对话的社会困境游戏实验，发现模型在交互中可能表现出更加反社会的行为，且当其他模型被引导为恶意行为时，这种效应会加剧。为缓解该问题，作者提出了一种基于隐式特质引导的方法，通过间歇性注入强化模型初始正向社会行为的系统提示，有效抑制对齐偏差的扩散，且无需访问模型参数或内部状态，适用于黑箱模型的多智能体应用场景。

2605.02487 2026-05-12 cs.RO

Visibility-Aware Mobile Grasping in Dynamic Environments

Tianrun Hu, Anxing Xiao, David Hsu, Hanbo Zhang

AI总结本文研究了机器人在动态未知环境中进行移动抓取的问题，重点解决有限视野下视觉感知与身体运动之间的协调难题。提出了一种统一的移动抓取系统，包含基于行为树的分层规划器和结合主动感知的全身运动规划器，能够在动态障碍物存在的情况下安全导航并完成抓取任务。实验表明，该方法在静态和动态未知环境中分别实现了68.8%和58.0%的成功率，显著优于现有方法。

2605.01402 2026-05-12 cs.CL cs.CV cs.LG

Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

Yao Du, Shanshan Song, Xiaomeng Li

AI总结多模态大语言模型（MLLMs）在处理长尾分布的数值回归任务时表现不佳，现有基于标记的监督微调方法容易偏向高密度区域，导致回归均值化和尾部性能下降。本文提出了一种基于组相对策略优化的分布感知强化学习框架，通过引入基于一致相关系数的奖励机制，在批量层面提供跨样本的比较监督，从而在相关性、尺度和均值等方面对齐预测与真实分布。该方法无需修改模型结构，实验表明其在多种长尾回归基准上均优于传统微调方法，尤其在中样本和少样本场景下效果显著。

Comments Accepted by ICML 2026

2605.00642 2026-05-12 cs.AI cs.CV

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

Yan Zhang, Daiqing Wu, Huawen Shen, Can Ma, Yu Zhou

AI总结本文提出了一种面向GUI定位任务的首个基于策略自蒸馏（OPSD）框架GUI-SD，旨在解决现有强化学习方法在训练效率和样本稀疏性方面的不足。该方法通过构建视觉增强的特权上下文和引入熵引导的蒸馏策略，实现了单次交互中的密集监督学习，有效提升了定位精度与训练效率。实验表明，GUI-SD在六个代表性基准上均优于现有方法。

Comments under review

2605.00623 2026-05-12 cs.RO

Recovering Hidden Reward in Diffusion-Based Policies

Yanbiao Ji, Qiuchang Li, Yuting Hu, Shaokai Wu, Wenyuan Xie, Guodong Zhang, Qicheng He, Deyi Ji, Yue Ding, Hongtao Lu

AI总结本文提出了一种名为 EnergyFlow 的框架，通过参数化一个标量能量函数，将生成动作建模与逆强化学习相结合，其梯度即为去噪场。该方法在最大熵最优性条件下，通过去噪得分匹配学习得分函数，能够恢复专家的软Q函数梯度，从而实现无需对抗训练的奖励提取。实验表明，EnergyFlow 在多种操作任务中表现出领先的模仿性能，并能为下游强化学习提供有效的奖励信号，优于对抗性逆强化学习和基于似然的方法。

Comments Accepted by ICML 2026

2605.00548 2026-05-12 cs.CV cs.GR

Colorful-Noise: Training-Free Low-Frequency Noise Manipulation for Color-Based Conditional Image Generation

Nadav Z. Cohen, Ofir Abramovich, Ariel Shamir

AI总结本文研究了扩散模型中输入噪声的特性，发现白噪声中低频分量主要决定图像的全局结构和颜色组成，而高频分量控制细节。基于此，作者提出了一种无需训练的低频噪声操控方法，通过简单操作低频噪声来引导图像生成过程，从而在保持输出多样性的同时，实现对图像整体结构和颜色的有效控制。

Comments SIGGRAPH 2026 Conference Paper. Project Page at: https://nadavc220.github.io/colorful-noise/

2605.00408 2026-05-12 cs.CV

Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting

Zhenhua Ning, Xin Li, Jun Yu, Guangming Lu, Yaowei Wang, Wenjie Pei

AI总结本文提出了一种可学习的密度控制方法LeGS，用于改进三维高斯溅射（3DGS）技术，以克服其对启发式密度控制规则的依赖。该方法将密度控制建模为通过强化学习优化的参数化策略网络，并设计了一种基于敏感性分析的有效奖励函数，以精确量化单个高斯分布对重建质量的贡献。实验表明，LeGS在多个数据集上显著优于现有方法，在重建质量和计算效率之间取得了更好的平衡。

Comments 9 pages, 5 figures

2604.27224 2026-05-12 cs.RO

Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies

Pokuang Zhou, Yuhao Zhou, Quan Khanh Luu, Seungho Han, Heng Zhang, Binghao Huang, Yunzhu Li, Arash Ajoudani, Zhengtong Xu, Yu She

AI总结本文研究了如何通过触觉感知提升四足机器人在复杂接触环境中的运动与操作能力。作者提出了一种分层的触觉感知策略学习框架，结合真实人类示范训练高层视觉-触觉策略，并通过大规模仿真强化学习训练底层触觉感知全身控制策略，实现了从仿真到现实的零样本迁移。实验表明，该方法在多种高接触任务中相比仅依赖视觉或视觉-触觉的方法，平均性能提升了28.54%。

2604.26805 2026-05-12 cs.AI cs.MA

Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations

Bochao Liu, Zhipeng Qian, Yang Zhao, Xinyuan Jiang, Zihan Liang, Yufei Ma, Junpeng Zhuang, Ben Chen, Shuo Yang, Hongen Wan, Yao Wu, Chenyi Lei, Xiao Liang

AI总结本文提出了一种名为Bian Que的智能运维框架，旨在提升大规模在线系统（如搜索、推荐和广告）的运维效率。该框架通过统一的操作范式和灵活的技能编排机制，实现了对运维事件的精准数据与知识匹配，解决了传统方法中信息过载与人工配置困难的问题。研究贡献包括统一的操作模式、自动化的技能生成与优化机制，以及自演进的学习系统，实际部署在快手电商搜索引擎上，显著提升了运维效率与准确性。

Comments HomePage: https://benchen4395.github.io

详情

英文摘要

Operating and maintaining (O&M) large-scale online engine systems (eg, search, recommendation and advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. Despite the inherent suitability of LLM-based agents for such operational scenarios, the critical bottleneck impeding their practical deployment lies not in reasoning, but in orchestration capability - specifically, the precise selection of relevant data (encompassing metrics, logs, and change events) and applicable knowledge (including handbook-defined rules and empirically derived practitioner experience) tailored to each individual operational event. Feeding all signals indiscriminately causes dilution and hallucination, while manually curating the event-to-(data, knowledge) mapping is intractable under dozens of daily releases. Here we present Bian Que, an agentic operating framework with three contributions: (i) The unified operational paradigm, which abstracts routine daily O&M actions into three canonical patterns: release interception, proactive inspection, and alert root cause analysis; (ii) The flexible Skill Arrangement, each predefined Skill explicitly defines the requisite data and operational knowledge for each specific context. Such Skills can be automatically generated and updated by LLM agents, and can also be iteratively optimized by on-call engineers via natural language instructions. (iii) The unified self-evolving mechanism, where each correction signal enables two parallel evolutionary pathways: distilling event memory into knowledge, and targeted refinement of Skills. Deployed on the e-commerce search engine of KuaiShou, Bian Que reduces alert volume by 75%, achieves 80% root-cause analysis accuracy, cuts mean time to resolution by over 50%, and attains a 99.0% pass rate on offline evaluations. Codes are at https://github.com/benchen4395/BianQue_Assistant.

URL PDF HTML ☆

赞 0 踩 0

2604.24954 2026-05-12 cs.LG cs.AI cs.CV

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

NVIDIA, :, Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki, Matthieu Le, Tyler Poon, Danial Mohseni Taheri, Ilia Karmanov, Guilin Liu, Jarno Seppanen, Arushi Goel, Mike Ranzinger, Greg Heinrich, Guo Chen, Lukas Voegtle, Philipp Fischer, Timo Roman, Karan Sapra, Collin McCarthy, Shaokun Zhang, Fuxiao Liu, Hanrong Ye, Yi Dong, Mingjie Liu, Yifan Peng, Piotr Zelasko, Zhehuai Chen, Nithin Rao Koluguri, Nune Tadevosyan, Lilit Grigoryan, Ehsan Hosseini Asl, Pritam Biswas, Leili Tavabi, Yuanhang Su, Zhiding Yu, Peter Jin, Alexandre Milesi, Netanel Haber, Yao Xu, Sarah Amiraslani, Nabin Mulepati, Eric Tramel, Jaehun Jung, Ximing Lu, Brandon Cui, Jin Xu, Zhiqi Li, Shihao Wang, Yuanguo Kuang, Shaokun Zhang, Huck Yang, Boyi Li, Hongxu Yin, Song Han, Bilal Kartal, Pavlo Molchanov, Adi Renduchintala, Charles Wang, David Mosallanezhad, Soumye Singhal, Luis Vega, Katherine Cheung, Sreyan Ghosh, Yian Zhang, Alexander Bukharin, Venkat Srinivasan, Johnny Greco, Andre Manoel, Maarten Van Segbroeck, Suseella Panguliri, Rohit Watve, Divyanshu Kakwani, Shubham Pachori, Jeffrey Glick, Radha Sri-Tharan, Aileen Zaman, Khanh Nguyen, Shi Chen, Jiaheng Fang, Qing Miao, Wenfei Zhou, Yu Wang, Zaid Pervaiz Bhat, Varun Praveen, Arihant Jain, Ramanathan Arunachalam, Tomasz Kornuta, Ashton Sharabiani, Amy Shen, Wei Huang, Yi-Fu Wu, Ali Roshan Ghias, Huiying Li, Brian Yu, Nima Tajbakhsh, Chen Cui, Wenwen Gao, Li Ding, Terry Kong, Manoj Kilaru, Anahita Bhiwandiwalla, Marek Wawrzos, Daniel Korzekwa, Pablo Ribalta, Grzegorz Chlebus, Besmira Nushi, Ewa Dobrowolska, Maciej Jakub Mikulski, Kunal Dhawan, Steve Huang, Jagadeesh Balam, Yongqiang Wang, Nikolay Karpov, Valentin Mendelev, George Zelenfroynd, Meline Mkrtchyan, Qing Miao, Omri Almog, Bhavesh Pawar, Rameshwar Shivbhakta, Sudeep Sabnis, Ashrton Sharabiani, Negar Habibi, Geethapriya Venkataramani, Pamela Peng, Prerit Rodney, Serge Panev, Richard Mazzarese, Nicky Liu, Michael Fukuyama, Andrii Skliar, Roger Waleffe, Duncan Riach, Yunheng Zou, Jian Hu, Hao Zhang, Binfeng Xu, Yuhao Yang, Zuhair Ahmed, Alexandre Milesi, Carlo del Mundo, Chad Voegele, Zhiyu Cheng, Nave Assaf, Andrii Skliar, Daniel Afrimi, Natan Bagrov, Ran Zilberstein, Ofri Masad, Eugene Khvedchenia, Natan Bagrov, Borys Tymchenko, Tomer Asida, Daniel Afrimi, Parth Mannan, Victor Cui, Michael Evans, Katherine Luna, Jie Lou, Pinky Xu, Guyue Huang, Negar Habibi, Michael Boone, Pradeep Thalasta, Adeola Adesoba, Dina Yared, Christopher Parisien, Leon Derczynski, Shaona Ghosh, Wes Feely, Micah Schaffer, Radha Sri-Tharan, Jeffrey Glick, Barnaby Simkin, George Zelenfroynd, Tomasz Grzegorzek, Rishabh Garg, Aastha Jhunjhunwala, Sergei Kolchenko, Farzan Memarian, Haran Kumar, Shiv Kumar, Isabel Hulseman, Anjali Shah, Kari Briski, Padmavathy Subramanian, Joey Conway, Udi Karpas, Jane Polak Scowcroft, Annie Surla, Shilpa Ammireddy, Ellie Evans, Jesse Oliver, Tom Balough, Chia-Chih Chen, Sandip Bhaskar, Alejandra Rico, Bardiya Sadeghi, Seph Mard, Katherine Cheung, Meredith Price, Laya Sleiman, Saori Kaji, Wesley Helmholz, Wendy Quan, Michael Lightstone, Jonathan Cohen, Jian Zhang, Oleksii Kuchaiev, Boris Ginsburg, Jan Kautz, Eileen Long, Mohammad Shoeybi, Mostofa Patwary, Oluwatobi Olabiyi, Andrew Tao, Bryan Catanzaro, Udi Karpas

AI总结本文介绍了 Nemotron 3 Nano Omni，这是 Nemotron 多模态系列的最新模型，首次原生支持音频输入，同时兼容文本、图像和视频。该模型在架构、训练数据和训练方法上均有改进，在多种模态任务中均表现出更高的准确性，尤其在现实文档理解、长音频视频理解和智能计算机使用方面表现突出。基于高效的 Nemotron 3 Nano 30B-A3B 架构，该模型引入了创新的多模态 token 减少技术，显著降低了推理延迟并提升了吞吐量，同时提供了多种精度格式的模型权重和部分训练数据及代码以促进进一步研究。

2604.20783 2026-05-12 cs.LG

Physics-Conditioned Synthesis of Internal Ice-Layer Thickness for Incomplete Layer Traces

Zesheng Liu, Maryam Rahnemoonfar

AI总结该研究旨在解决由雷达观测得到的冰层内部结构不完整的问题，通过结合物理气候模型提供的同步特征，生成完整的冰层厚度标注。提出的方法结合了几何学习与基于变换器的时序模块，以聚合层内空间信息并促进层间信息传播，从而生成结构一致且符合物理规律的冰层厚度。该模型在保留已有观测数据的基础上，能够恢复缺失的冰层片段，甚至填补完全缺失的层，并为后续深度层预测模型提供了有效的预训练监督信号。

Comments Accepted for 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)

2604.20175 2026-05-12 cs.LG cs.AI

Physics-Enhanced Deep Learning for Proactive Thermal Runaway Forecasting in Li-Ion Batteries

Salman Khan, Syed Sajid Ullah, Muhammad Zunair Zamir, Jie Li, Abdul Malik, Saeed Mian Qaisar

AI总结本文提出了一种融合物理规律的深度学习框架 PI-LSTM，用于锂离子电池热失控的主动预测。该方法通过在损失函数中引入热传导方程作为正则化项，将物理约束直接嵌入神经网络结构，从而在保证预测精度的同时满足热力学原理。实验表明，该模型在多个电池数据集上显著优于传统 LSTM、CNN-LSTM 和 MLP 模型，大幅提升了预测准确性和泛化能力，为下一代电池系统的实时热管理提供了可行方案。

2604.20172 2026-05-12 cs.LG math.ST stat.ML stat.TH

Cover meets Robbins while Betting on Bounded Data: $\ln n$ Regret and Almost Sure $\ln\ln n$ Regret

Shubhada Agrawal, Aaditya Ramdas

AI总结本文研究在有界数据序列上进行投注时的策略设计，旨在同时应对随机数据和对抗性数据。提出了一种结合Robbins和Cover思想的混合投注策略，该策略在几乎所有路径上实现了$O(\ln \ln n)$的对数对数级遗憾，而在少数路径上则保持$O(\ln n)$的对数级遗憾。该方法首次展示了通过策略对冲实现对随机数据和对抗数据的自适应性，具有重要的理论价值和应用前景。

Comments Improved a regret bound. New regret bound for a classical mixture

2604.19923 2026-05-12 cs.CV

UniCon3R: Unified Contact-aware 4D Human-Scene Reconstruction from Monocular Video

Tanuj Sur, Shashank Tripathi, Nikos Athanasiou, Ha Linh Nguyen, Kai Xu, Michael J. Black, Angela Yao

AI总结本文提出 UniCon3R，一种用于从单目视频中进行在线人类-场景四维重建的统一前馈框架。该方法通过显式建模人类与场景之间的接触关系，利用接触信息作为修正线索来提升人体网格重建质量，从而在保证快速推理速度的同时，实现场景几何与对齐的人体四维重建。实验表明，UniCon3R 在物理合理性与人体运动估计方面优于现有方法，验证了接触信息作为强大先验在联合重建中的有效性。

Comments Project page: https://surtantheta.github.io/UniCon3R

2604.19835 2026-05-12 cs.LG cs.AI

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

Chaitanya Dwivedi, Binxuan Huang, Himanshu Gupta, Pratik Jayarao, Neeraj Varshney, Bing Yin

AI总结该论文提出了一种名为“专家重用（Expert Upcycling）”的方法，用于在持续预训练过程中逐步扩展混合专家（MoE）模型的规模，从而在保持每token计算成本不变的情况下提升模型容量。核心思想是通过复制已有专家并扩展路由层，在已有模型的基础上构建更大规模的MoE模型，同时利用已有知识进行初始化以加速训练。实验表明，该方法在保持模型性能的同时显著减少了训练所需的计算资源，为高效扩展MoE模型提供了理论支持和实践方案。

Comments 9 Pages in main paper, 29 Pages total

详情

英文摘要

Mixture-of-Experts (MoE) has become the dominant architecture for scaling large language models: frontier models routinely decouple total parameters from per-token computation through sparse expert routing. Scaling laws show that under fixed active computation, model quality scales predictably with total parameters, and MoEs realize this by increasing expert count. However, training large MoEs is expensive, as memory requirements and inter-device communication both scale with total parameter count. We propose expert upcycling, a method for progressively expanding MoE capacity by increasing the number of experts during continued pre-training (CPT). Given a trained E-expert model, the upcycling operator constructs an mE-expert model through expert duplication and router extension while holding top-K routing fixed, preserving per-token inference cost. Duplication provides a warm initialization: the expanded model inherits the source checkpoint's learned representations, starting from a substantially lower loss than random initialization. Subsequent CPT then breaks the symmetry among duplicated experts to drive specialization. We formalize the upcycling operator and develop a theoretical framework decomposing the quality gap into a capacity term and an initialization term. We further introduce utility-based expert selection, which uses gradient-based importance scores to guide non-uniform duplication, more than tripling gap closure when CPT is limited. In our 7B-13B total parameter experiments, the upcycled model matches the fixed-size baseline on validation loss while saving 32% of GPU hours. Comprehensive ablations across model scales, activation ratios, MoE architectures, and training budgets yield a practical recipe for deploying expert upcycling, establishing it as a principled, compute-efficient alternative to training large MoE models from scratch.

URL PDF HTML ☆

赞 0 踩 0

2604.19748 2026-05-12 cs.CV

Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items

Mengting Chen, Zhengrui Chen, Yongchao Du, Zuan Gao, Taihang Hu, Jinsong Lan, Chao Lin, Yefeng Shen, Xingjian Wang, Zhao Wang, Zhengtao Wu, Xiaoli Xu, Zhengze Xu, Hao Yan, Mingzhou Zhang, Jun Zheng, Qinye Zhou, Xiaoyong Zhu, Bo Zheng

AI总结 Tstars-Tryon 1.0 是一个高效、真实且鲁棒的虚拟试穿系统，能够应对复杂现实场景中的多种挑战，如极端姿态、光照变化和运动模糊等。该系统支持多种服装类别和多参考图像的灵活组合，生成具有精细细节和真实材质的高质量图像，同时避免了常见的AI生成伪影。通过端到端的模型架构和优化的推理速度，系统实现了接近实时的生成效果，并已在淘宝App上大规模部署，服务于数百万用户。

Comments 24 pages, model evaluation report

2604.15529 2026-05-12 cs.AI

LACE: Lattice Attention for Cross-thread Exploration

Yang Li, Zirui Zhang, Yang Liu, Chengzhi Mao

AI总结当前大型语言模型在推理过程中往往是独立进行的，尽管可以并行生成多个推理路径，但这些路径之间缺乏互动，容易以相似的方式失败。本文提出LACE框架，通过引入晶格注意力机制，使并行推理路径能够互相交流中间结果并纠正错误，从而将独立的推理过程转变为协调的并行过程。实验表明，这种方法显著提升了推理准确性，验证了允许推理路径相互作用可以增强大型语言模型的性能。

Comments 22 pages, 15 figures

2604.14655 2026-05-12 cs.AI cs.LG

AgentGA: Evolving Code Solutions in Agent-Seed Space

David Y. Y. Tan, Kellie Chin, Jingxian Zhang

AI总结本文提出了一种名为 AgentGA 的框架，通过优化“代理种子”（包括任务提示和可选的父级存档）来演化自主的代码生成过程，而非直接编辑代码。该方法在种群层面结合了遗传算法与长期规划代理，利用确定性精英锦标赛进行选择，并通过改进的 Hedge 控制器动态分配操作符。实验表明，AgentGA 在 Weco-Kaggle Lite 自动化机器学习基准测试中表现优异，大幅超越了人类水平和现有参考方法。

Comments 30 pages total (9-page main text + references + appendix), 5 figures, 9 tables

2604.14125 2026-05-12 cs.CV cs.AI cs.RO

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Tianshuo Yang, Guanyu Chen, Yutian Chen, Zhixuan Liang, Yitian Liu, Zanxin Chen, Chunpu Xu, Haotian Liang, Jiangmiao Pang, Yao Mu, Ping Luo

AI总结本文提出了一种名为HiVLA的视觉-语义引导的分层操作系统，旨在解决端到端视觉-语言-动作模型在精细控制数据微调时削弱其基础视觉语言模型推理能力的问题。该方法通过将高层语义规划与底层运动控制解耦，利用视觉语言模型进行任务分解和视觉定位，生成结构化操作计划，并通过配备级联交叉注意力机制的扩散变换器执行精确动作，从而在保持高层推理能力的同时提升操作精度。实验表明，HiVLA在长时序技能组合和复杂场景下的精细操作任务中显著优于现有端到端方法。

Comments Project Page: https://tianshuoy.github.io/HiVLA-page/