arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4056
2605.06483 2026-05-12 cs.AI cs.RO cs.SY eess.SY

ReasonSTL: Bridging Natural Language and Signal Temporal Logic via Tool-Augmented Process-Rewarded Learning

Bowen Ye, Zhijian Li, Junyue Huang, Junkai Ma, Xiang Yin

AI总结 该研究提出了一种名为ReasonSTL的框架,旨在解决将自然语言转化为信号时序逻辑(STL)这一关键但具有挑战性的任务。ReasonSTL通过结合本地开源语言模型与工具增强的推理过程,实现了自然语言到STL公式的高效生成,并引入了过程奖励训练机制以优化工具使用路径和最终公式结构。实验表明,该方法在自动评估和人工评估中均达到领先水平,为工业场景下的形式化规范编写提供了透明、低成本且隐私保护的解决方案。

详情
英文摘要

Signal Temporal Logic (STL) is an expressive formal language for specifying spatio-temporal requirements over real-valued, real-time signals. It has been widely used for the verification and synthesis of autonomous systems and cyber-physical systems. In practice, however, users often express their requirements in natural language rather than in structured STL formulas, making natural-language-to-STL translation a critical yet challenging task. Manual specification requires temporal-logic expertise and cannot scale, while prompting commercial LLM APIs incurs substantial token costs and may expose sensitive system requirements to third-party services, raising privacy concerns for industrial deployment. To address these challenges, we present \textsc{ReasonSTL}, a tool-augmented framework that adapts local open-source language models for natural-language-to-STL generation. \textsc{ReasonSTL} decomposes the translation process into explicit reasoning, deterministic tool calls, and structured formula construction. We further introduce process-rewarded training to supervise both tool-use trajectories and final formulas, together with \textsc{STL-Bench}, a bilingual, computation-aware benchmark grounded in real-world signals. Experiments show that a 4B model trained with \textsc{ReasonSTL} achieves state-of-the-art performance in both automatic metrics and human evaluations, demonstrating that \textsc{ReasonSTL} provides a transparent, low-cost, and privacy-preserving alternative for formal specification drafting.

2605.06117 2026-05-12 cs.LG

BoostLLM: Boosting-inspired LLM Fine-tuning for Few-shot Tabular Classification

Yi-Siang Wang, Kuan-Yu Chen, Yu-Chen Den, Darby Tien-Hao Chang

AI总结 本文提出BoostLLM,一种受提升算法启发的大型语言模型(LLM)微调框架,旨在提升其在少样本表格分类任务中的性能。该方法将参数高效的微调过程转化为多轮残差优化过程,通过训练序列化的PEFT适配器作为弱学习器,并结合决策树路径作为结构化输入视图,以增强模型对表格数据的归纳偏置。实验表明,BoostLLM在多个LLM主干和数据集上均优于传统微调方法,且在少样本场景下表现可与XGBoost媲美,甚至在某些情况下超越基于GPT-4o的模型。

Comments 19 pages, 4 figures

详情
英文摘要

Large language models (LLMs) have recently been adapted to tabular prediction by serializing structured features into natural language, but their performance in low-data regimes remains limited compared to gradient-boosted decision trees (GBDTs). In this work, we revisit the boosting paradigm, traditionally associated with tree ensembles, and ask whether it can be applied as a general training principle for LLM fine-tuning. We propose BoostLLM, a framework that transforms parameter-efficient fine-tuning into a multi-round residual optimization process by training sequential PEFT adapters as weak learners. To incorporate tabular inductive bias, BoostLLM integrates decision-tree paths as a second input view alongside raw features; analysis reveals that the path view acts as a structured teacher in early training steps before the model shifts toward feature-driven representations. Empirically, BoostLLM achieves consistent improvements over standard fine-tuning across multiple LLM backbones and datasets, matching or surpassing XGBoost across a wide range of shot counts and outperforming GPT-4o-based methods with a 4B model. We further show that the framework scales: pairing with stronger tree models and extended boosting horizons yields additional gains under appropriate stabilization. These results suggest that boosting can serve as a general training principle for LLM fine-tuning, particularly in low-data regimes for structured data.

2605.05831 2026-05-12 cs.CV

Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media

Megha Mariam K. M, Vineeth N. Balasubramanian, C. V. Jawahar

AI总结 随着科学传播逐渐呈现多模态趋势,研究论文、幻灯片、视频等不同形式的材料共同传达研究成果,但目前缺乏结构化的关联方式。本文提出首个整合研究论文、演讲视频、讲解视频和幻灯片的多模态会议数据集(MCD),并评估多种嵌入式和视觉-语言模型在跨格式细粒度对应任务中的表现。研究发现,视觉-语言模型在整体上表现稳健,但在细粒度对齐上仍有不足,而嵌入式模型在文本与视觉对应上效果较好,但对公式和符号内容的处理存在明显聚类差异,为多模态科学理解的未来研究指明了方向。

Comments Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings Track, 2026

详情
英文摘要

The communication of scientific knowledge has become increasingly multimodal, spanning text, visuals, and speech through materials such as research papers, slides, and recorded presentations. These different representations collectively convey a study's reasoning, results, and insights, offering complementary perspectives that enrich understanding. However, despite their shared purpose, such materials are rarely connected in a structured way. The absence of explicit links across formats makes it difficult to trace how concepts, visuals, and explanations correspond, limiting unified exploration and analysis of research content. To address this gap, we introduce the Multimodal Conference Dataset (MCD), the first benchmark that integrates research papers, presentation videos, explanatory videos, and slides from the same works. We evaluate a range of embedding-based and vision-language models to assess their ability to discover fine-grained cross-format correspondences, establishing the first systematic benchmark for this task. Our results show that vision-language models are robust but struggle with fine-grained alignment, while embedding-based models capture text-visual correspondences well but equations and symbolic content form distinct clusters in the embedding space. These findings highlight both the strengths and limitations of current approaches and point to key directions for future research in multimodal scientific understanding. To ensure reproducibility, we release the resources for MCD at https://github.com/meghamariamkm2002/MCD

2605.05736 2026-05-12 cs.AI

SDFlow: Similarity-Driven Flow Matching for Time Series Generation

Wei Li, Shibo Feng, Pengcheng Wu, Xingyu Gao, Min Wu, Peilin Zhao

AI总结 本文提出了一种名为SDFlow的非自回归时间序列生成方法,通过相似性驱动的流匹配技术,在冻结的向量量化(VQ)潜在空间中实现并行序列生成,有效解决了自回归模型中的暴露偏差问题。该方法通过全局传输映射替代逐步预测、低秩流形分解降低维度复杂度、以及在变分流匹配框架中引入离散监督,成功克服了非自回归生成中的关键挑战。实验表明,SDFlow在长序列生成任务中取得了最先进的性能,显著提升了生成质量并加快了推理速度。

详情
英文摘要

Vector quantization (VQ) with autoregressive (AR) token modeling is a widely adopted and highly competitive paradigm for time-series generation. However, such models are fundamentally limited by exposure bias: during inference, errors can accumulate across sequential predictions, leading to pronounced quality degradation in long-horizon generation. To address this, we propose SDFlow ($\textbf{S}$imilarity-$\textbf{D}$riven $\textbf{Flow}$ Matching), a non-autoregressive framework that operates entirely in the frozen VQ latent space and enables parallel sequence generation via flow matching. We tackle three key challenges in making this transition: (1) eliminating exposure bias by replacing step-wise token prediction with a global transport map; (2) mitigating the high-dimensionality of VQ token spaces via a low-rank manifold decomposition with a learned anchor prior over the latent manifold; and (3) incorporating discrete supervision into continuous transport dynamics by introducing a categorical posterior over codebook indices within a variational flow-matching formulation. Extensive experiments show that SDFlow achieves state-of-the-art performance, improving Discriminative Score and substantially reducing Context-FID, particularly for challenging long-sequence generation. Moreover, SDFlow provides significant inference speedups over autoregressive baselines, offering both high fidelity and computational efficiency. Code is available at https://anonymous.4open.science/r/SDFlow-D6F3/

2605.05072 2026-05-12 cs.CV

Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy

Yuan Wu, Zhiqiang Yan, Jiawei Lian, Zhengxue Wang, Jian Yang

AI总结 本文研究了如何从相机和激光雷达传感器数据中准确预测三维场景的占用情况,重点解决传统方法在投影空间采样固定、难以适应真实场景高度变化和稀疏性的问题。为此,作者提出了一种名为HiPR的框架,通过高度引导的投影重参数化方法,动态调整激光雷达点云的采样范围,使投影点更合理地分布于具有几何意义的区域。实验表明,HiPR在保持实时推理能力的同时,显著优于现有先进方法。

详情
英文摘要

3D occupancy prediction aims to infer dense, voxel-wise scene semantics from sensor observations, where the 2D-to-3D view transformation serves as a crucial step in bridging image features and volumetric representations. Most previous methods rely on a fixed projection space, where 3D reference points are uniformly sampled along pillars. However, such sampling struggles to capture the sparsity and height variations of real-world scenes, leading to ambiguous correspondences and unreliable feature aggregation. To address these challenges, we propose HiPR, a camera-LiDAR occupancy framework with Height-Guided Projection Reparameterization. HiPR first encodes LiDAR into a BEV height map to capture the maximum height of the point cloud. HiPR then adjusts the sampling range of each pillar using the height prior, enabling adaptive reparameterization of the projection space. As a result, the projected points are redistributed into geometrically meaningful regions rather than fixed ranges. Meanwhile, we mask out the invalid parts of the height map to avoid misleading the feature aggregation. In addition, to alleviate the training instability caused by noisy LiDAR-derived heights, we introduce a training-time Progressive Height Conditioning strategy, which gradually transitions the conditioning signal from ground-truth heights to LiDAR heights. Extensive experiments demonstrate that HiPR consistently outperforms existing state-of-the-art methods while maintaining real-time inference. The code and pretrained models can be found at https://github.com/yanzq95/HiPR.

2605.05045 2026-05-12 cs.CV cs.CL

When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

Philip Wootaek Shin, Ajay Narayanan Sridhar, Sivani Devarapalli, Rui Zhang, Jack Sampson, Vijaykrishnan Narayanan

AI总结 该研究分析了视觉-语言模型在面对旋转和噪声等视觉干扰时产生的关系幻觉现象,揭示了即使轻微的图像扰动也会显著影响模型对物体间关系的推理能力。研究评估了多种基于提示的增强与预处理策略,发现这些方法虽能部分缓解问题,但无法彻底消除关系幻觉。结果表明,当前模型在感知鲁棒性与关系理解之间仍存在差距,亟需开发更具几何感知能力的视觉-语言模型。

详情
英文摘要

Vision-language models (VLMs) achieve strong multimodal performance but remain prone to relation hallucination, which requires accurate reasoning over inter-object interactions. We study the impact of visual perturbations, specifically rotation and noise, and show that even mild distortions significantly degrade relational reasoning across models and datasets. We further evaluate prompt-based augmentation and preprocessing strategies (orientation correction and denoising), finding that while they offer partial improvements, they do not fully resolve hallucinations. Our results reveal a gap between perceptual robustness and relational understanding, highlighting the need for more robust, geometry-aware VLMs.

2605.04956 2026-05-12 cs.LG cs.PF

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

Han Wang, Jintao Zhang, Kai Jiang, Haoxu Wang, Jianfei Chen, Jun Zhu

AI总结 论文提出了一种名为 KernelBenchX 的综合性基准测试工具,用于评估大语言模型生成的 GPU 内核在正确性和硬件效率方面的表现。通过在15个类别共176个任务上的系统评测,研究发现任务结构对正确性的影响远大于方法设计,且正确性并不等同于效率,许多正确生成的内核性能甚至低于基线。研究还揭示了量化等任务中存在系统性理解偏差,未来需在数值精度建模和硬件效率优化方面取得进展。

Comments minor textual revision; no changes to technical content or results

详情
英文摘要

LLM-based Triton kernel generation has attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability break down, and why? We present KernelBenchX, a benchmark designed to answer this question through category-aware evaluation of correctness and hardware efficiency across 176 tasks in 15 categories. Our systematic comparison of five representative methods yields three main findings. First, task structure determines correctness more than method design. Category explains nearly three times more variance in semantic correctness than method (9.4% vs 3.3% explained deviance), and 72% of Fusion tasks fail across all five methods while Math tasks are solved consistently. Second, iterative refinement improves correctness, but not performance. Across GEAK iterations, compile rate rises from 52.3% to 68.8% while average speedup declines from $1.58\times$ to $1.44\times$; newly rescued kernels consistently underperform persistently correct ones ($1.16\times$ vs $1.58\times$ speedup in round~0$\to$1). Third, correctness does not imply efficiency. 46.6% of correct kernels are slower than the PyTorch eager baseline, and cross-hardware speedup variance reaches $21.4\times$. Besides, quantization remains completely unsolved (0/30 successes) despite non-trivial compilation rates, revealing systematic misunderstanding of numerical computation contracts rather than surface-level syntax errors. These findings suggest that future progress depends on handling global coordination, explicitly modeling numerical precision, and incorporating hardware efficiency into generation. The code is available at https://github.com/BonnieW05/KernelBenchX

2605.04012 2026-05-12 cs.AI

SymptomAI: Toward a Conversational AI Agent for Everyday Symptom Assessment

Joseph Breda, Fadi Yousif, Beszel Hawkins, Marinela Cotoi, Miao Liu, Ray Luo, Po-Hsuan Cameron Chen, Mike Schaekermann, Samuel Schmidgall, Xin Liu, Girish Narayanswamy, Samuel Solomon, Maxwell A. Xu, Xiaoran Fan, Longfei Shangguan, Anran Wang, Bhavna Daryani, Buddy Herkenham, Cara Tan, Mark Malhotra, Shwetak Patel, John B. Hernandez, Quang Duong, Yun Liu, Zach Wasson, Dimitrios Antos, Bob Lou, Matthew Thompson, Jonathan Richina, Anupam Pathak, Nichole Young-Lin, Jake Sunshine, Daniel McDuff

AI总结 该研究提出了一种名为SymptomAI的会话式人工智能代理,用于日常症状的端到端访谈与鉴别诊断。通过在Fitbit应用中进行的大规模随机实验,SymptomAI在与13,917名参与者的交互中表现出比独立临床医生更高的诊断准确性。研究还发现,采用专门症状访谈策略的AI代理在诊断效果上显著优于用户引导的对话方式,并揭示了症状与生理指标之间的强关联。

Comments 13 page main text, 54 pages total. 16 figures total

详情
英文摘要

Language models excel at diagnostic assessments on curated medical case-studies and vignettes, performing on par with, or better than, clinical professionals. However, existing studies focus on complex scenarios with rich context making it difficult to draw conclusions about how these systems perform for patients reporting symptoms in everyday life. We deployed SymptomAI, a set of conversational AI agents for end-to-end patient interviewing and differential diagnosis (DDx), via the Fitbit app in a study that randomized participants (N=13,917) to interact with five AI agents. This corpus captures diverse communication and a realistic distribution of illnesses from a real world population. A subset of 1,228 participants reported a clinician-provided diagnosis, and 517 of these were further evaluated by a panel of clinicians during over 250 hours of annotation. SymptomAI DDx were significantly more accurate (OR = 2.56, p < 0.001) than those from independent clinicians given the same dialogue in a blinded randomized comparison. Moreover, agentic strategies which conduct a dedicated symptom interview that elicit additional symptom information before providing a diagnosis, perform substantially better than baseline, user-guided conversations (p < 0.001). An auxiliary analysis on 1,509 conversations from a general US population panel validated that these results generalize beyond wearable device users. We used SymptomAI diagnoses as labels for all 13,917 participants to analyze over 500,000 days of wearable metrics across nearly 400 unique conditions. We identified strong associations between acute infections and physiological shifts (e.g., OR > 7 for influenza). While limited by self-reported ground truth, these results demonstrate the benefits of a dedicated and complete symptom interview compared to a user-guided symptom discussion, which is the default of most consumer LLMs.

2605.03652 2026-05-12 cs.CV cs.AI

AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

Tencent HY Team

AI总结 本文提出了一种名为 AniMatrix 的动画视频生成模型,专门针对动画艺术风格进行设计,而非依赖物理现实作为先验。该模型通过双通道条件机制和三步过渡策略,重新定义“正确性”标准,克服传统模型对物理规律的依赖,并有效区分艺术表达与生成失败。实验表明,AniMatrix 在专业动画师参与的评估中表现优异,尤其在提示理解与艺术动作生成方面显著优于现有模型。

Comments 37 pages, 1 main figure (qualitative comparison), 1 TikZ architecture diagram; technical report. Model weights and inference code to be released

详情
英文摘要

Video generation models internalize physical realism as their prior. Anime deliberately violates physics: smears, impact frames, chibi shifts; and its thousands of coexisting artistic conventions yield no single "physics of anime" a model can absorb. Physics-biased models therefore flatten the artistry that defines the medium or collapse under its stylistic variance. We present AniMatrix, a video generation model that targets artistic rather than physical correctness through a dual-channel conditioning mechanism and a three-step transition: redefine correctness, override the physics prior, and distinguish art from failure. First, a Production Knowledge System encodes anime as a structured taxonomy of controllable production variables (Style, Motion, Camera, VFX), and AniCaption infers these variables from pixels as directorial directives. A trainable tag encoder preserves the field-value structure of this taxonomy while a frozen T5 encoder handles free-form narrative; dual-path injection (cross-attention for fine-grained control, AdaLN modulation for global enforcement) ensures categorical directives are never diluted by open-ended text. Second, a style-motion-deformation curriculum transitions the model from near-physical motion to full anime expressiveness. Third, deformation-aware preference optimization with a domain-specific reward model separates intentional artistry from pathological collapse. On an anime-specific human evaluation with five production dimensions scored by professional animators, AniMatrix ranks first on four of five, with the largest gains over Seedance-Pro 1.0 on Prompt Understanding (+0.70, +22.4 percent) and Artistic Motion (+0.55, +16.9 percent). We are preparing accompanying resources for public release to support reproducibility and follow-up research.

2605.03438 2026-05-12 cs.CV

Mantis: Mamba-native Tuning is Efficient for 3D Point Cloud Foundation Models

Zihao Guo, Jihua Zhu, Jian Liu, Ajmal Saeed Mian

AI总结 本文提出了一种名为Mantis的高效参数微调框架,专门针对基于Mamba架构的3D点云基础模型。该方法通过引入状态感知适配器(SAA),在冻结预训练主干网络的前提下实现状态级的细粒度适配,同时采用双序列化一致性蒸馏(DSCD)减少序列化带来的不稳定性。实验表明,Mantis仅需约5%的可训练参数即可在多个基准上取得具有竞争力的性能。

详情
英文摘要

Pre-trained 3D point cloud foundation models (PFMs) have demonstrated strong transferability across diverse downstream tasks. However, full fine-tuning these models is computationally expensive and storage-intensive. Parameter-efficient fine-tuning (PEFT) offers a promising alternative, but existing PEFT approaches are primarily designed for Transformer-based backbones and rely on token-level prompting or feature transformation. Mamba-based backbones introduce a granularity mismatch between token-level adaptation and state-level sequence dynamics. Consequently, straightforward transfer of existing PEFT approaches to frozen Mamba backbones leads to substantial accuracy degradation and unstable optimization. To address this issue, we propose Mantis, the first Mamba-native PEFT framework for 3D PFMs. Specifically, a State-Aware Adapter (SAA) is introduced to inject lightweight task-conditioned control signals into selective state-space updates, enabling state-level adaptation while keeping the pre-trained backbone frozen. Moreover, different valid point cloud serializations are regularized by Dual-Serialization Consistency Distillation (DSCD), thereby reducing serialization-induced instability. Extensive experiments across multiple benchmarks demonstrate that our Mantis achieves competitive performance with only about 5% trainable parameters. Our code is available at https://github.com/gzhhhhhhh/Mantis.

2605.02948 2026-05-12 cs.LG cs.AI cs.SD

AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation

Yuxin Lu, Jiayang Sun, Guibo Zhu, Min Cao

AI总结 AsymTalker 是一种基于扩散模型的长时 talking head 生成方法,旨在解决现有方法在长时间视频生成中出现的身份不一致和时空对齐问题。该方法引入了时间参考编码(TRE)和非对称知识蒸馏(AKD),分别用于缓解静态身份参考与动态音频流之间的时空错位,以及解决分块生成过程中身份漂移的问题。实验表明,AsymTalker 在保证高保真度和身份一致性的同时,能够生成长达600秒的视频,并实现每秒66帧的实时推理速度,达到了当前最先进的性能。

详情
英文摘要

Diffusion-based talking head generation has achieved remarkable visual quality, yet scaling it to long-term videos remains challenging. The widely adopted chunk-wise paradigm introduces two fundamental failures: (1) temporal-spatial misalignment between static identity references and dynamic audio streams, and (2) cascading identity drift propagated through self-generated continuity references across chunks. To address both issues, we propose AsymTalker, a novel diffusion-based talking head generation method comprising Temporal Reference Encoding (TRE) and Asymmetric Knowledge Distillation (AKD). First, TRE mitigates temporal-spatial misalignment by transforming the static identity image into a temporally coherent latent representation through encoding of a temporally replicated pseudo-video, without introducing additional parameters. Second, AKD resolves the inherent conditioning dilemma in chunk-wise training: using ground-truth references causes train-inference mismatch, while self-generated references entangle supervision with identity drift. Our asymmetric design circumvents this by anchoring the teacher model with ground-truth continuity references to provide drift-free, chunk-level supervision, thereby avoiding the teacher bottleneck. Meanwhile, the student model learns under inference-aligned conditions, conditioned only on self-generated references, and is trained via distribution matching to preserve identity over long horizons. Extensive experiments show AsymTalker achieves state-of-the-art results on HDTF and VFHQ. It guarantees high-fidelity, identity-consistent synthesis over 600-second videos and reaches a real-time inference speed of 66 FPS.

2605.02751 2026-05-12 cs.AI cs.CL

Mitigating Misalignment Contagion by Steering with Implicit Traits

Maria Chang, Ronny Luss, Miao Liu, Keerthiram Murugesan, Karthikeyan Ramamurthy, Djallel Bouneffouf

AI总结 在多智能体场景中,语言模型(LMs)遵循指令和保持价值对齐至关重要,但现有研究多关注单个模型与用户的对齐,忽视了多模型交互中可能产生的对齐偏差扩散问题。本文通过多轮对话的社会困境游戏实验,发现模型在交互中可能表现出更加反社会的行为,且当其他模型被引导为恶意行为时,这种效应会加剧。为缓解该问题,作者提出了一种基于隐式特质引导的方法,通过间歇性注入强化模型初始正向社会行为的系统提示,有效抑制对齐偏差的扩散,且无需访问模型参数或内部状态,适用于黑箱模型的多智能体应用场景。

详情
英文摘要

Language models (LMs) are increasingly used in high-stakes, multi-agent settings, where following instructions and maintaining value alignment are critical. Most alignment research focuses on interactions between a single LM and a single user, failing to address the risk of misaligned behavior spreading between multiple LMs in multi-turn interactions. We find evidence of this phenomenon, which we call misalignment contagion, across multiple LMs as they engage multi-turn conversational social dilemma games. Specifically, we find that LMs become more anti-social after gameplay and that this effect is intensified when other players are steered to act maliciously. We explore different steering techniques to mitigate such misalignment contagion and find that reinforcing an LM's system prompt is insufficient and often harmful. Instead, we propose steering with implicit traits: a technique that intermittently injects system prompts with statements that reinforce an LMs initial traits and is more effective than system prompt repetition at keeping models in line with their initial pro-social behaviors. Importantly, this method does not require access to model parameters or internal model states, making it suitable for increasingly common use cases where complex multi-agent workflows are being designed with black box models.

2605.02487 2026-05-12 cs.RO

Visibility-Aware Mobile Grasping in Dynamic Environments

Tianrun Hu, Anxing Xiao, David Hsu, Hanbo Zhang

AI总结 本文研究了机器人在动态未知环境中进行移动抓取的问题,重点解决有限视野下视觉感知与身体运动之间的协调难题。提出了一种统一的移动抓取系统,包含基于行为树的分层规划器和结合主动感知的全身运动规划器,能够在动态障碍物存在的情况下安全导航并完成抓取任务。实验表明,该方法在静态和动态未知环境中分别实现了68.8%和58.0%的成功率,显著优于现有方法。

详情
英文摘要

This paper addresses the problem of mobile grasping in dynamic, unknown environments where a robot must operate under a limited field-of-view. The fundamental challenge is the inherent trade-off between ``seeing'' around to reduce environmental uncertainty and ``moving'' the body to achieve task progress in a high-dimensional configuration space, subject to visibility constraints. Previous approaches often assume known or static environments and decouple these objectives, failing to guarantee safety when unobserved dynamic obstacles intersect the robot's path during manipulation. In this paper, we propose a unified mobile grasping system comprising two core components: (1) an iterative low-level whole-body planner coupled with velocity-aware active perception to navigate dynamic environments safely; and (2) a hierarchical high-level planner based on behavior trees that adaptively generates subgoals to guide the robot through exploration and runtime failures. We provide experimental results across 400 randomized simulation scenarios and real-world deployment on a Fetch mobile manipulator. Results show that our system achieves a success rate of 68.8\% and 58.0\% in unknown static and dynamic environments, respectively, significantly boosting success rates by 22.8\% and 18.0\% over the \nam approach in both unknown static and dynamic environments, with improved collision safety.

2605.01402 2026-05-12 cs.CL cs.CV cs.LG

Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

Yao Du, Shanshan Song, Xiaomeng Li

AI总结 多模态大语言模型(MLLMs)在处理长尾分布的数值回归任务时表现不佳,现有基于标记的监督微调方法容易偏向高密度区域,导致回归均值化和尾部性能下降。本文提出了一种基于组相对策略优化的分布感知强化学习框架,通过引入基于一致相关系数的奖励机制,在批量层面提供跨样本的比较监督,从而在相关性、尺度和均值等方面对齐预测与真实分布。该方法无需修改模型结构,实验表明其在多种长尾回归基准上均优于传统微调方法,尤其在中样本和少样本场景下效果显著。

Comments Accepted by ICML 2026

详情
英文摘要

Multimodal large language models (MLLMs) struggle with numerical regression under long-tailed target distributions. Token-level supervised fine-tuning (SFT) and point-wise regression rewards bias learning toward high-density regions, leading to regression-to-the-mean behavior and poor tail performance. We identify the lack of cross-sample relational supervision as a key limitation of existing MLLM training paradigms. To address it, we propose a distribution-aware reinforcement learning framework based on Group Relative Policy Optimization, which introduces batch-level comparison-based supervision via the Concordance Correlation Coefficient-based reward to align predicted and ground-truth distributions in terms of correlation, scale, and mean. The framework is plug-and-play, requiring no architectural modification. Experiments on a unified suite of long-tailed regression benchmarks show consistent improvements over SFT and existing MLLM regression methods, with particularly strong gains in medium- and few-shot regimes.

2605.00642 2026-05-12 cs.AI cs.CV

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

Yan Zhang, Daiqing Wu, Huawen Shen, Can Ma, Yu Zhou

AI总结 本文提出了一种面向GUI定位任务的首个基于策略自蒸馏(OPSD)框架GUI-SD,旨在解决现有强化学习方法在训练效率和样本稀疏性方面的不足。该方法通过构建视觉增强的特权上下文和引入熵引导的蒸馏策略,实现了单次交互中的密集监督学习,有效提升了定位精度与训练效率。实验表明,GUI-SD在六个代表性基准上均优于现有方法。

Comments under review

详情
英文摘要

Graphical User Interface (GUI) grounding maps natural language instructions to the visual coordinates of target elements and serves as a core capability for autonomous GUI agents. Recent reinforcement learning methods (e.g., GRPO) have achieved strong performance, but they rely on expensive multiple rollouts and suffer from sparse signals on hard samples. These limitations make on-policy self-distillation (OPSD), which provides dense token-level supervision from a single rollout, a promising alternative. However, its applicability to GUI grounding remains unexplored. In this paper, we present GUI-SD, the first OPSD framework tailored for GUI grounding. First, it constructs a visually enriched privileged context for the teacher using a target bounding box and a Gaussian soft mask, providing informative guidance without leaking exact coordinates. Second, it employs entropy-guided distillation, which adaptively weights tokens based on digit significance and teacher confidence, concentrating optimization on the most impactful and reliable positions. Extensive experiments on six representative GUI grounding benchmarks show that GUI-SD consistently outperforms GRPO-based methods and naive OPSD in both accuracy and training efficiency. Code and training data are available at https://zhangyan-ucas.github.io/GUI-SD/.

2605.00623 2026-05-12 cs.RO

Recovering Hidden Reward in Diffusion-Based Policies

Yanbiao Ji, Qiuchang Li, Yuting Hu, Shaokai Wu, Wenyuan Xie, Guodong Zhang, Qicheng He, Deyi Ji, Yue Ding, Hongtao Lu

AI总结 本文提出了一种名为 EnergyFlow 的框架,通过参数化一个标量能量函数,将生成动作建模与逆强化学习相结合,其梯度即为去噪场。该方法在最大熵最优性条件下,通过去噪得分匹配学习得分函数,能够恢复专家的软Q函数梯度,从而实现无需对抗训练的奖励提取。实验表明,EnergyFlow 在多种操作任务中表现出领先的模仿性能,并能为下游强化学习提供有效的奖励信号,优于对抗性逆强化学习和基于似然的方法。

Comments Accepted by ICML 2026

详情
英文摘要

This paper introduces EnergyFlow, a framework that unifies generative action modeling with inverse reinforcement learning by parameterizing a scalar energy function whose gradient is the denoising field. We establish that under maximum-entropy optimality, the score function learned via denoising score matching recovers the gradient of the expert's soft Q-function, enabling reward extraction without adversarial training. Formally, we prove that constraining the learned field to be conservative reduces hypothesis complexity and tightens out-of-distribution generalization bounds. We further characterize the identifiability of recovered rewards and bound how score estimation errors propagate to action preferences. Empirically, EnergyFlow achieves state-of-the-art imitation performance on various manipulation tasks while providing an effective reward signal for downstream reinforcement learning that outperforms both adversarial IRL methods and likelihood-based alternatives. These results show that the structural constraints required for valid reward extraction simultaneously serve as beneficial inductive biases for policy generalization. The code is available at https://github.com/sotaagi/EnergyFlow.

2605.00548 2026-05-12 cs.CV cs.GR

Colorful-Noise: Training-Free Low-Frequency Noise Manipulation for Color-Based Conditional Image Generation

Nadav Z. Cohen, Ofir Abramovich, Ariel Shamir

AI总结 本文研究了扩散模型中输入噪声的特性,发现白噪声中低频分量主要决定图像的全局结构和颜色组成,而高频分量控制细节。基于此,作者提出了一种无需训练的低频噪声操控方法,通过简单操作低频噪声来引导图像生成过程,从而在保持输出多样性的同时,实现对图像整体结构和颜色的有效控制。

Comments SIGGRAPH 2026 Conference Paper. Project Page at: https://nadavc220.github.io/colorful-noise/

详情
英文摘要

Text-to-image diffusion models generate images by gradually converting white Gaussian noise into a natural image. White Gaussian noise is well suited for producing diverse outputs from a single text prompt due to its absence of structure. However, this very property limits control over, and predictability of, specific visual attributes, as the noise is not human-interpretable. In this work, we investigate the characteristics of the input noise in diffusion models. We show that, although all frequencies in white Gaussian noise have comparable statistical energy, low-frequency components primarily determine the images global structure and color composition, while high-frequency components control finer details. Building on this observation, we demonstrate that simple manipulations of the low-frequency noise using low-frequency image priors can effectively condition the generation process to reconstruct these low-frequency visual cues. This allows us to define a simple, training-free method with minimal overhead that steers overall image structure and color, while letting high-frequency components freely emerge as fine details, enabling variability across generated outputs.

2605.00408 2026-05-12 cs.CV

Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting

Zhenhua Ning, Xin Li, Jun Yu, Guangming Lu, Yaowei Wang, Wenjie Pei

AI总结 本文提出了一种可学习的密度控制方法LeGS,用于改进三维高斯溅射(3DGS)技术,以克服其对启发式密度控制规则的依赖。该方法将密度控制建模为通过强化学习优化的参数化策略网络,并设计了一种基于敏感性分析的有效奖励函数,以精确量化单个高斯分布对重建质量的贡献。实验表明,LeGS在多个数据集上显著优于现有方法,在重建质量和计算效率之间取得了更好的平衡。

Comments 9 pages, 5 figures

详情
英文摘要

While 3D Gaussian Splatting (3DGS) has demonstrated impressive real-time rendering performance, its efficacy remains constrained by a reliance on heuristic density control. Despite numerous refinements to these handcrafted rules, such methods inherently lack the flexibility to adapt to diverse scenes with complex geometries. In this paper, we propose a paradigm shift for density control from rigid heuristics to fully learnable policies. Specifically, we introduce \textbf{LeGS}, a framework that reformulates density control as a parameterized policy network optimized via Reinforcement Learning (RL). Central to our approach is the tailored effective reward function grounded in sensitivity analysis, which precisely quantifies the marginal contribution of individual Gaussians to reconstruction quality. To maintain computational tractability, we derive a closed-form solution that reduces the complexity of reward calculation from $O(N^2)$ to $O(N)$. Extensive experiments on the Mip-NeRF 360, Tanks \& Temples, and Deep Blending datasets demonstrate that \textbf{LeGS} significantly outperforms state-of-the-art methods, striking a superior balance between reconstruction quality and efficiency. The code will be released at https://github.com/AaronNZH/LeGS

2604.27224 2026-05-12 cs.RO

Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies

Pokuang Zhou, Yuhao Zhou, Quan Khanh Luu, Seungho Han, Heng Zhang, Binghao Huang, Yunzhu Li, Arash Ajoudani, Zhengtong Xu, Yu She

AI总结 本文研究了如何通过触觉感知提升四足机器人在复杂接触环境中的运动与操作能力。作者提出了一种分层的触觉感知策略学习框架,结合真实人类示范训练高层视觉-触觉策略,并通过大规模仿真强化学习训练底层触觉感知全身控制策略,实现了从仿真到现实的零样本迁移。实验表明,该方法在多种高接触任务中相比仅依赖视觉或视觉-触觉的方法,平均性能提升了28.54%。

详情
英文摘要

Quadrupedal loco-manipulation is commonly built on visual perception and proprioception. Yet reliable contact-rich manipulation remains difficult: vision and proprioception alone cannot resolve uncertain, evolving interactions with the environment. Tactile sensing offers direct contact observability, but scalable tactile-aware learning framework for quadrupedal loco-manipulation is still underexplored. In this paper, we present a tactile-aware loco-manipulation policy learning pipeline with a hierarchical structure. Our approach has two key components. First, we leverage real-world human demonstrations to train a tactile-conditioned visuotactile high-level policy. This policy predicts not only end-effector trajectories for manipulation, but also the evolving tactile interaction cues that characterize how contact should develop over time. Second, we perform large-scale reinforcement learning in simulation to learn a tactile-aware whole-body control policy that tracks diverse commanded trajectories and tactile interaction cues, and transfers zero-shot to the real world. Together, these components enable coordinated locomotion and manipulation under contact-rich scenarios. We evaluate the system on real-world contact-rich tasks, including in-hand reorientation with insertion, valve tightening, and delicate object manipulation. Compared to vision-only and visuotactile baselines, our method improves performance by 28.54% on average across these tasks.

2604.26805 2026-05-12 cs.AI cs.MA

Bian Que: An Agentic Framework with Flexible Skill Arrangement for Online System Operations

Bochao Liu, Zhipeng Qian, Yang Zhao, Xinyuan Jiang, Zihan Liang, Yufei Ma, Junpeng Zhuang, Ben Chen, Shuo Yang, Hongen Wan, Yao Wu, Chenyi Lei, Xiao Liang

AI总结 本文提出了一种名为Bian Que的智能运维框架,旨在提升大规模在线系统(如搜索、推荐和广告)的运维效率。该框架通过统一的操作范式和灵活的技能编排机制,实现了对运维事件的精准数据与知识匹配,解决了传统方法中信息过载与人工配置困难的问题。研究贡献包括统一的操作模式、自动化的技能生成与优化机制,以及自演进的学习系统,实际部署在快手电商搜索引擎上,显著提升了运维效率与准确性。

Comments HomePage: https://benchen4395.github.io

详情
英文摘要

Operating and maintaining (O&M) large-scale online engine systems (eg, search, recommendation and advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. Despite the inherent suitability of LLM-based agents for such operational scenarios, the critical bottleneck impeding their practical deployment lies not in reasoning, but in orchestration capability - specifically, the precise selection of relevant data (encompassing metrics, logs, and change events) and applicable knowledge (including handbook-defined rules and empirically derived practitioner experience) tailored to each individual operational event. Feeding all signals indiscriminately causes dilution and hallucination, while manually curating the event-to-(data, knowledge) mapping is intractable under dozens of daily releases. Here we present Bian Que, an agentic operating framework with three contributions: (i) The unified operational paradigm, which abstracts routine daily O&M actions into three canonical patterns: release interception, proactive inspection, and alert root cause analysis; (ii) The flexible Skill Arrangement, each predefined Skill explicitly defines the requisite data and operational knowledge for each specific context. Such Skills can be automatically generated and updated by LLM agents, and can also be iteratively optimized by on-call engineers via natural language instructions. (iii) The unified self-evolving mechanism, where each correction signal enables two parallel evolutionary pathways: distilling event memory into knowledge, and targeted refinement of Skills. Deployed on the e-commerce search engine of KuaiShou, Bian Que reduces alert volume by 75%, achieves 80% root-cause analysis accuracy, cuts mean time to resolution by over 50%, and attains a 99.0% pass rate on offline evaluations. Codes are at https://github.com/benchen4395/BianQue_Assistant.

2604.24954 2026-05-12 cs.LG cs.AI cs.CV

Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

NVIDIA, :, Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki, Matthieu Le, Tyler Poon, Danial Mohseni Taheri, Ilia Karmanov, Guilin Liu, Jarno Seppanen, Arushi Goel, Mike Ranzinger, Greg Heinrich, Guo Chen, Lukas Voegtle, Philipp Fischer, Timo Roman, Karan Sapra, Collin McCarthy, Shaokun Zhang, Fuxiao Liu, Hanrong Ye, Yi Dong, Mingjie Liu, Yifan Peng, Piotr Zelasko, Zhehuai Chen, Nithin Rao Koluguri, Nune Tadevosyan, Lilit Grigoryan, Ehsan Hosseini Asl, Pritam Biswas, Leili Tavabi, Yuanhang Su, Zhiding Yu, Peter Jin, Alexandre Milesi, Netanel Haber, Yao Xu, Sarah Amiraslani, Nabin Mulepati, Eric Tramel, Jaehun Jung, Ximing Lu, Brandon Cui, Jin Xu, Zhiqi Li, Shihao Wang, Yuanguo Kuang, Shaokun Zhang, Huck Yang, Boyi Li, Hongxu Yin, Song Han, Bilal Kartal, Pavlo Molchanov, Adi Renduchintala, Charles Wang, David Mosallanezhad, Soumye Singhal, Luis Vega, Katherine Cheung, Sreyan Ghosh, Yian Zhang, Alexander Bukharin, Venkat Srinivasan, Johnny Greco, Andre Manoel, Maarten Van Segbroeck, Suseella Panguliri, Rohit Watve, Divyanshu Kakwani, Shubham Pachori, Jeffrey Glick, Radha Sri-Tharan, Aileen Zaman, Khanh Nguyen, Shi Chen, Jiaheng Fang, Qing Miao, Wenfei Zhou, Yu Wang, Zaid Pervaiz Bhat, Varun Praveen, Arihant Jain, Ramanathan Arunachalam, Tomasz Kornuta, Ashton Sharabiani, Amy Shen, Wei Huang, Yi-Fu Wu, Ali Roshan Ghias, Huiying Li, Brian Yu, Nima Tajbakhsh, Chen Cui, Wenwen Gao, Li Ding, Terry Kong, Manoj Kilaru, Anahita Bhiwandiwalla, Marek Wawrzos, Daniel Korzekwa, Pablo Ribalta, Grzegorz Chlebus, Besmira Nushi, Ewa Dobrowolska, Maciej Jakub Mikulski, Kunal Dhawan, Steve Huang, Jagadeesh Balam, Yongqiang Wang, Nikolay Karpov, Valentin Mendelev, George Zelenfroynd, Meline Mkrtchyan, Qing Miao, Omri Almog, Bhavesh Pawar, Rameshwar Shivbhakta, Sudeep Sabnis, Ashrton Sharabiani, Negar Habibi, Geethapriya Venkataramani, Pamela Peng, Prerit Rodney, Serge Panev, Richard Mazzarese, Nicky Liu, Michael Fukuyama, Andrii Skliar, Roger Waleffe, Duncan Riach, Yunheng Zou, Jian Hu, Hao Zhang, Binfeng Xu, Yuhao Yang, Zuhair Ahmed, Alexandre Milesi, Carlo del Mundo, Chad Voegele, Zhiyu Cheng, Nave Assaf, Andrii Skliar, Daniel Afrimi, Natan Bagrov, Ran Zilberstein, Ofri Masad, Eugene Khvedchenia, Natan Bagrov, Borys Tymchenko, Tomer Asida, Daniel Afrimi, Parth Mannan, Victor Cui, Michael Evans, Katherine Luna, Jie Lou, Pinky Xu, Guyue Huang, Negar Habibi, Michael Boone, Pradeep Thalasta, Adeola Adesoba, Dina Yared, Christopher Parisien, Leon Derczynski, Shaona Ghosh, Wes Feely, Micah Schaffer, Radha Sri-Tharan, Jeffrey Glick, Barnaby Simkin, George Zelenfroynd, Tomasz Grzegorzek, Rishabh Garg, Aastha Jhunjhunwala, Sergei Kolchenko, Farzan Memarian, Haran Kumar, Shiv Kumar, Isabel Hulseman, Anjali Shah, Kari Briski, Padmavathy Subramanian, Joey Conway, Udi Karpas, Jane Polak Scowcroft, Annie Surla, Shilpa Ammireddy, Ellie Evans, Jesse Oliver, Tom Balough, Chia-Chih Chen, Sandip Bhaskar, Alejandra Rico, Bardiya Sadeghi, Seph Mard, Katherine Cheung, Meredith Price, Laya Sleiman, Saori Kaji, Wesley Helmholz, Wendy Quan, Michael Lightstone, Jonathan Cohen, Jian Zhang, Oleksii Kuchaiev, Boris Ginsburg, Jan Kautz, Eileen Long, Mohammad Shoeybi, Mostofa Patwary, Oluwatobi Olabiyi, Andrew Tao, Bryan Catanzaro, Udi Karpas

AI总结 本文介绍了 Nemotron 3 Nano Omni,这是 Nemotron 多模态系列的最新模型,首次原生支持音频输入,同时兼容文本、图像和视频。该模型在架构、训练数据和训练方法上均有改进,在多种模态任务中均表现出更高的准确性,尤其在现实文档理解、长音频视频理解和智能计算机使用方面表现突出。基于高效的 Nemotron 3 Nano 30B-A3B 架构,该模型引入了创新的多模态 token 减少技术,显著降低了推理延迟并提升了吞吐量,同时提供了多种精度格式的模型权重和部分训练数据及代码以促进进一步研究。

详情
英文摘要

We introduce Nemotron 3 Nano Omni, the latest model in the Nemotron multimodal series and the first to natively support audio inputs alongside text, images, and video. Nemotron 3 Nano Omni delivers consistent accuracy improvements over its predecessor, Nemotron Nano V2 VL, across all modalities, enabled by advances in architecture, training data and recipes. In particular, Nemotron 3 delivers leading results in real-world document understanding, long audio-video comprehension, and agentic computer use. Built on the highly efficient Nemotron 3 Nano 30B-A3B backbone, Nemotron 3 Nano Omni further incorporates innovative multimodal token-reduction techniques to deliver substantially lower inference latency and higher throughput than other models of similar size. We are releasing model checkpoints in BF16, FP8, and FP4 formats, along with portions of the training data and codebase to facilitate further research and development.

2604.20783 2026-05-12 cs.LG

Physics-Conditioned Synthesis of Internal Ice-Layer Thickness for Incomplete Layer Traces

Zesheng Liu, Maryam Rahnemoonfar

AI总结 该研究旨在解决由雷达观测得到的冰层内部结构不完整的问题,通过结合物理气候模型提供的同步特征,生成完整的冰层厚度标注。提出的方法结合了几何学习与基于变换器的时序模块,以聚合层内空间信息并促进层间信息传播,从而生成结构一致且符合物理规律的冰层厚度。该模型在保留已有观测数据的基础上,能够恢复缺失的冰层片段,甚至填补完全缺失的层,并为后续深度层预测模型提供了有效的预训练监督信号。

Comments Accepted for 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)

详情
英文摘要

Internal ice layers imaged by radar provide key evidence of snow accumulation and ice dynamics, but radar-derived layer boundary observations are often incomplete, with discontinuous traces and sometimes entirely missing layers, due to limited resolution, sensor noise, and signal loss. Existing graph-based models for ice stratigraphy generally assume sufficiently complete layer profiles and focus on predicting deeper-layer thickness from reliably traced shallow layers. In this work, we address the layer-completion problem itself by synthesizing complete ice-layer thickness annotations from incomplete radar-derived layer traces by conditioning on colocated physical features synchronized from physical climate models. The proposed network combines geometric learning to aggregate within-layer spatial context with a transformer-based temporal module that propagates information across layers to encourage coherent stratigraphy and consistent thickness evolution. To learn from incomplete supervision, we optimize a mask-aware robust regression objective that evaluates errors only at observed thickness values and normalizes by the number of valid entries, enabling stable training under varying sparsity without imputation and steering completions toward physically plausible values. The model preserves observed thickness where available and infers only missing regions, recovering fragmented segments and even fully absent layers while remaining consistent with measured traces. As an additional benefit, the synthesized thickness stacks provide effective pretraining supervision for a downstream deep-layer predictor, improving fine-tuned accuracy over training from scratch on the same fully traced data.

2604.20175 2026-05-12 cs.LG cs.AI

Physics-Enhanced Deep Learning for Proactive Thermal Runaway Forecasting in Li-Ion Batteries

Salman Khan, Syed Sajid Ullah, Muhammad Zunair Zamir, Jie Li, Abdul Malik, Saeed Mian Qaisar

AI总结 本文提出了一种融合物理规律的深度学习框架 PI-LSTM,用于锂离子电池热失控的主动预测。该方法通过在损失函数中引入热传导方程作为正则化项,将物理约束直接嵌入神经网络结构,从而在保证预测精度的同时满足热力学原理。实验表明,该模型在多个电池数据集上显著优于传统 LSTM、CNN-LSTM 和 MLP 模型,大幅提升了预测准确性和泛化能力,为下一代电池系统的实时热管理提供了可行方案。

详情
英文摘要

Accurate prediction of thermal runaway in lithium-ion batteries is essential for ensuring the safety, efficiency, and reliability of modern energy storage systems. Conventional data-driven approaches, such as Long Short-Term Memory (LSTM) networks, can capture complex temporal dependencies but often violate thermodynamic principles, resulting in physically inconsistent predictions. Conversely, physics-based thermal models provide interpretability but are computationally expensive and difficult to parameterize for real-time applications. To bridge this gap, this study proposes a Physics-Informed Long Short-Term Memory (PI-LSTM) framework that integrates governing heat transfer equations directly into the deep learning architecture through a physics-based regularization term in the loss function. The model leverages multi-feature input sequences, including state of charge, voltage, current, mechanical stress, and surface temperature, to forecast battery temperature evolution while enforcing thermal diffusion constraints. Extensive experiments conducted on thirteen lithium-ion battery datasets demonstrate that the proposed PI-LSTM achieves an 81.9% reduction in root mean square error (RMSE) and an 81.3% reduction in mean absolute error (MAE) compared to the standard LSTM baseline, while also outperforming CNN-LSTM and multilayer perceptron (MLP) models by wide margins. The inclusion of physical constraints enhances the model's generalization across diverse operating conditions and eliminates non-physical temperature oscillations. These results confirm that physics-informed deep learning offers a viable pathway toward interpretable, accurate, and real-time thermal management in next-generation battery systems.

2604.20172 2026-05-12 cs.LG math.ST stat.ML stat.TH

Cover meets Robbins while Betting on Bounded Data: $\ln n$ Regret and Almost Sure $\ln\ln n$ Regret

Shubhada Agrawal, Aaditya Ramdas

AI总结 本文研究在有界数据序列上进行投注时的策略设计,旨在同时应对随机数据和对抗性数据。提出了一种结合Robbins和Cover思想的混合投注策略,该策略在几乎所有路径上实现了$O(\ln \ln n)$的对数对数级遗憾,而在少数路径上则保持$O(\ln n)$的对数级遗憾。该方法首次展示了通过策略对冲实现对随机数据和对抗数据的自适应性,具有重要的理论价值和应用前景。

Comments Improved a regret bound. New regret bound for a classical mixture

详情
英文摘要

Consider betting against a sequence of data in $[0,1]$, where one is allowed to make any bet that is fair if the data have a conditional mean $m_0 \in (0,1)$. Cover's universal portfolio algorithm delivers a worst-case regret of $O(\ln n)$ compared to the best constant bet in hindsight, and this bound is unimprovable against adversarially generated data. In this work, we present a novel mixture betting strategy that combines insights from Robbins and Cover, and exhibits a different behavior: it eventually produces a regret of $O(\ln \ln n)$ on almost all paths (a measure-one set of paths if each conditional mean equals $m_0$ and intrinsic variance increases to $\infty$), but has an $O(\log n)$ regret on the complement (a measure zero set of paths). Our paper appears to be the first to point out the value in hedging two very different strategies to achieve a best-of-both-worlds adaptivity to stochastic data and protection against adversarial data. We contrast our results to those in Agrawal and Ramdas [2026] for a sub-Gaussian mixture on unbounded data: their worst-case regret has to be unbounded, but a similar hedging delivers both an optimal betting growth-rate and an almost sure $\ln\ln n$ regret on stochastic data. Finally, our strategy witnesses a sharp game-theoretic upper law of the iterated logarithm, analogous to Shafer and Vovk [2005].

2604.19923 2026-05-12 cs.CV

UniCon3R: Unified Contact-aware 4D Human-Scene Reconstruction from Monocular Video

Tanuj Sur, Shashank Tripathi, Nikos Athanasiou, Ha Linh Nguyen, Kai Xu, Michael J. Black, Angela Yao

AI总结 本文提出 UniCon3R,一种用于从单目视频中进行在线人类-场景四维重建的统一前馈框架。该方法通过显式建模人类与场景之间的接触关系,利用接触信息作为修正线索来提升人体网格重建质量,从而在保证快速推理速度的同时,实现场景几何与对齐的人体四维重建。实验表明,UniCon3R 在物理合理性与人体运动估计方面优于现有方法,验证了接触信息作为强大先验在联合重建中的有效性。

Comments Project page: https://surtantheta.github.io/UniCon3R

详情
英文摘要

We introduce UniCon3R, a unified feed-forward framework for online human-scene 4D reconstruction from monocular video. Current feed-forward human-scene reconstruction methods suffer from artifacts, where bodies float above the ground or penetrate parts of the scene. A key reason is the lack of effective interaction modelling between the human and the environment. Our goal is to exploit contact between the human and the scene during inference to actively improve the human mesh reconstruction. To that end, we explicitly model interaction by inferring 4D contact from the human pose and scene geometry and use the contact as a corrective cue for generating the pose. This enables UniCon3R to jointly recover scene geometry and spatially aligned 4D humans within the scene. Experiments on standard human-centric video benchmarks show that UniCon3R outperforms state-of-the-art baselines on physical plausibility and global human motion estimation while preserving fast, feed-forward inference speeds. The results validate our central claim: contact serves as a powerful internal prior, thus establishing a new paradigm for physically grounded joint human-scene reconstruction. Project page is available at https://surtantheta.github.io/UniCon3R .

2604.19835 2026-05-12 cs.LG cs.AI

Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts

Chaitanya Dwivedi, Binxuan Huang, Himanshu Gupta, Pratik Jayarao, Neeraj Varshney, Bing Yin

AI总结 该论文提出了一种名为“专家重用(Expert Upcycling)”的方法,用于在持续预训练过程中逐步扩展混合专家(MoE)模型的规模,从而在保持每token计算成本不变的情况下提升模型容量。核心思想是通过复制已有专家并扩展路由层,在已有模型的基础上构建更大规模的MoE模型,同时利用已有知识进行初始化以加速训练。实验表明,该方法在保持模型性能的同时显著减少了训练所需的计算资源,为高效扩展MoE模型提供了理论支持和实践方案。

Comments 9 Pages in main paper, 29 Pages total

详情
英文摘要

Mixture-of-Experts (MoE) has become the dominant architecture for scaling large language models: frontier models routinely decouple total parameters from per-token computation through sparse expert routing. Scaling laws show that under fixed active computation, model quality scales predictably with total parameters, and MoEs realize this by increasing expert count. However, training large MoEs is expensive, as memory requirements and inter-device communication both scale with total parameter count. We propose expert upcycling, a method for progressively expanding MoE capacity by increasing the number of experts during continued pre-training (CPT). Given a trained E-expert model, the upcycling operator constructs an mE-expert model through expert duplication and router extension while holding top-K routing fixed, preserving per-token inference cost. Duplication provides a warm initialization: the expanded model inherits the source checkpoint's learned representations, starting from a substantially lower loss than random initialization. Subsequent CPT then breaks the symmetry among duplicated experts to drive specialization. We formalize the upcycling operator and develop a theoretical framework decomposing the quality gap into a capacity term and an initialization term. We further introduce utility-based expert selection, which uses gradient-based importance scores to guide non-uniform duplication, more than tripling gap closure when CPT is limited. In our 7B-13B total parameter experiments, the upcycled model matches the fixed-size baseline on validation loss while saving 32% of GPU hours. Comprehensive ablations across model scales, activation ratios, MoE architectures, and training budgets yield a practical recipe for deploying expert upcycling, establishing it as a principled, compute-efficient alternative to training large MoE models from scratch.

2604.19748 2026-05-12 cs.CV

Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items

Mengting Chen, Zhengrui Chen, Yongchao Du, Zuan Gao, Taihang Hu, Jinsong Lan, Chao Lin, Yefeng Shen, Xingjian Wang, Zhao Wang, Zhengtao Wu, Xiaoli Xu, Zhengze Xu, Hao Yan, Mingzhou Zhang, Jun Zheng, Qinye Zhou, Xiaoyong Zhu, Bo Zheng

AI总结 Tstars-Tryon 1.0 是一个高效、真实且鲁棒的虚拟试穿系统,能够应对复杂现实场景中的多种挑战,如极端姿态、光照变化和运动模糊等。该系统支持多种服装类别和多参考图像的灵活组合,生成具有精细细节和真实材质的高质量图像,同时避免了常见的AI生成伪影。通过端到端的模型架构和优化的推理速度,系统实现了接近实时的生成效果,并已在淘宝App上大规模部署,服务于数百万用户。

Comments 24 pages, model evaluation report

详情
英文摘要

Recent advances in image generation and editing have opened new opportunities for virtual try-on. However, existing methods still struggle to meet complex real-world demands. We present Tstars-Tryon 1.0, a commercial-scale virtual try-on system that is robust, realistic, versatile, and highly efficient. First, our system maintains a high success rate across challenging cases like extreme poses, severe illumination variations, motion blur, and other in-the-wild conditions. Second, it delivers highly photorealistic results with fine-grained details, faithfully preserving garment texture, material properties, and structural characteristics, while largely avoiding common AI-generated artifacts. Third, beyond apparel try-on, our model supports flexible multi-image composition (up to 6 reference images) across 8 fashion categories, with coordinated control over person identity and background. Fourth, to overcome the latency bottlenecks of commercial deployment, our system is heavily optimized for inference speed, delivering near real-time generation for a seamless user experience. These capabilities are enabled by an integrated system design spanning end-to-end model architecture, a scalable data engine, robust infrastructure, and a multi-stage training paradigm. Extensive evaluation and large-scale product deployment demonstrate that Tstars-Tryon1.0 achieves leading overall performance. To support future research, we also release a comprehensive benchmark. The model has been deployed at an industrial scale on the Taobao App, serving millions of users with tens of millions of requests.

2604.15529 2026-05-12 cs.AI

LACE: Lattice Attention for Cross-thread Exploration

Yang Li, Zirui Zhang, Yang Liu, Chengzhi Mao

AI总结 当前大型语言模型在推理过程中往往是独立进行的,尽管可以并行生成多个推理路径,但这些路径之间缺乏互动,容易以相似的方式失败。本文提出LACE框架,通过引入晶格注意力机制,使并行推理路径能够互相交流中间结果并纠正错误,从而将独立的推理过程转变为协调的并行过程。实验表明,这种方法显著提升了推理准确性,验证了允许推理路径相互作用可以增强大型语言模型的性能。

Comments 22 pages, 15 figures

详情
英文摘要

Current large language models reason in isolation. Although it is common to sample multiple reasoning paths in parallel, these trajectories do not interact, and often fail in the same redundant ways. We introduce LACE, a framework that transforms reasoning from a collection of independent trials into a coordinated, parallel process. By repurposing the model architecture to enable cross-thread attention, LACE allows concurrent reasoning paths to share intermediate insights and correct one another during inference. A central challenge is the absence of natural training data that exhibits such collaborative behavior. We address this gap with a synthetic data pipeline that explicitly teaches models to communicate and error-correct across threads. Experiments show that this unified exploration substantially outperforms standard parallel search, improving reasoning accuracy by over 7 points. Our results suggest that large language models can be more effective when parallel reasoning paths are allowed to interact.

2604.14655 2026-05-12 cs.AI cs.LG

AgentGA: Evolving Code Solutions in Agent-Seed Space

David Y. Y. Tan, Kellie Chin, Jingxian Zhang

AI总结 本文提出了一种名为 AgentGA 的框架,通过优化“代理种子”(包括任务提示和可选的父级存档)来演化自主的代码生成过程,而非直接编辑代码。该方法在种群层面结合了遗传算法与长期规划代理,利用确定性精英锦标赛进行选择,并通过改进的 Hedge 控制器动态分配操作符。实验表明,AgentGA 在 Weco-Kaggle Lite 自动化机器学习基准测试中表现优异,大幅超越了人类水平和现有参考方法。

Comments 30 pages total (9-page main text + references + appendix), 5 figures, 9 tables

详情
英文摘要

We present AgentGA, a framework that evolves autonomous code-generation runs by optimizing the agent seed: the task prompt plus optional parent archives that initialize a fresh workspace. The outer loop searches over these reusable starting conditions rather than editing code directly. Each generation launches a fresh autonomous run in an isolated workspace, while selected parent archives provide inherited artifacts that descendants can inspect and reuse. AgentGA couples a population-level genetic algorithm with long-horizon agents; selection uses deterministic 1:1 elite tournaments and operator allocation is adapted online with a modified Hedge controller. We instantiate the approach for tabular AutoML on the 16-competition Weco-Kaggle Lite benchmark. Across the full benchmark, AgentGA averages 71.90% Exceeds % of Human versus 51.38% for the AIDE reference, winning 15/16 competitions. Within AgentGA runs, descendants conditioned on inherited parent archives win 51.9% of 1,680 parent-child tournaments versus 8.6% for de novo proposals. These results support agent-seed optimization as a practical design choice for autonomous code-search systems.

2604.14125 2026-05-12 cs.CV cs.AI cs.RO

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Tianshuo Yang, Guanyu Chen, Yutian Chen, Zhixuan Liang, Yitian Liu, Zanxin Chen, Chunpu Xu, Haotian Liang, Jiangmiao Pang, Yao Mu, Ping Luo

AI总结 本文提出了一种名为HiVLA的视觉-语义引导的分层操作系统,旨在解决端到端视觉-语言-动作模型在精细控制数据微调时削弱其基础视觉语言模型推理能力的问题。该方法通过将高层语义规划与底层运动控制解耦,利用视觉语言模型进行任务分解和视觉定位,生成结构化操作计划,并通过配备级联交叉注意力机制的扩散变换器执行精确动作,从而在保持高层推理能力的同时提升操作精度。实验表明,HiVLA在长时序技能组合和复杂场景下的精细操作任务中显著优于现有端到端方法。

Comments Project Page: https://tianshuoy.github.io/HiVLA-page/

详情
英文摘要

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.