arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.13846 2026-05-14 cs.CL cs.AI

WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data

Ziheng Zhang, Yunzhong Hou, Naijing Liu, Liang Zheng

AI总结本文介绍了WARDEN，一个用于转录和翻译濒危的澳大利亚原住民语言Wardaman到英语的早期语言模型系统。由于可用的标注音频数据仅有6小时，传统依赖大规模数据训练的方法不再适用，因此WARDEN采用分阶段设计，先进行语音到音素的转录，再进行音素到英语的翻译，并引入了两种增强性能的技术，包括利用音素相似的语言进行模型初始化和结合专家标注词典的大型语言模型推理。实验表明，WARDEN在极低数据条件下表现优于传统统一模型，为濒危语言处理提供了有力的基线。

2605.13839 2026-05-14 cs.CL

Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights

Wenrui Bao, Huan Wang, Jian Wang, Zhangyang Wang, Kai Wang, Yuzhang Shang

AI总结该论文研究了多智能体大语言模型系统中更高效的协作方式，提出了一种基于权重空间的通信框架TFlow，通过将发送者的隐藏状态转化为接收者特定的低秩权重扰动，替代传统的自然语言消息交换方式。这种方法在不改变模型结构和文本上下文的前提下，实现了对接收者的实例级适配，显著减少了计算开销和推理时间，实验表明其在多个基准测试中提升了准确率并大幅降低了处理的token数量。

2605.13835 2026-05-14 cs.CV

Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning

Hao Sun, Zi-Jun Ding, Da-Wei Zhou

AI总结该论文研究了基于CLIP的类别增量学习（CIL）问题，旨在使模型在持续学习新类别时避免灾难性遗忘。现有方法主要关注全局图像嵌入的对齐，而忽略了CLIP编码器中丰富的局部块级语义信息。为此，作者提出了一种名为SPA的方法，通过生成类别语义描述并引导选择具有判别性的块级视觉特征，结合最优传输进行跨模态对齐，从而更有效地利用局部信息提升识别性能，并引入任务特定投影器和伪特征采样策略以增强模型的适应性和稳定性。

2605.13833 2026-05-14 cs.LG cs.CV

QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling

Hoang-Quan Nguyen, Sankalp Pandey, Khoa Luu

AI总结本文提出了一种名为QLAM的量子长注意力记忆方法，用于处理长序列的token建模问题。该方法结合量子计算的叠加特性与状态空间模型（SSMs）的线性时间效率，通过量子态表示隐藏状态，从而增强对历史信息的全局表示能力。实验表明，QLAM在多个序列图像分类任务中优于传统循环模型和基于Transformer的模型。

详情

英文摘要

Modeling long-range dependencies in sequential data remains a central challenge in machine learning. Transformers address this challenge through attention mechanisms, but their quadratic complexity with respect to sequence length limits scalability to long contexts. State-space models (SSMs) provide an efficient alternative with linear-time computation by evolving a latent state through recurrent updates, but their memory is typically formed via additive or linear transitions, which can limit their ability to capture complex global interactions across tokens. In this work, we introduce one of the first studies to leverage the superposition property of quantum systems to enhance state-based sequence modeling. In particular, we propose Quantum Long-Attention Memory (QLAM), a hybrid quantum-classical memory mechanism that can be viewed as a quantum extension of state-space models. Instead of maintaining a classical latent state updated through additive dynamics, QLAM represents the hidden state as a quantum state whose amplitudes encode a superposition of historical information. The state evolves through parameterized quantum circuits conditioned on the input, enabling a non-classical, globally update mechanism. In this way, QLAM preserves the recurrent and linear-time structure of SSMs while fundamentally enriching the memory representation through quantum superposition. Unlike attention mechanisms that explicitly compute pairwise interactions, QLAM implicitly captures global dependencies through the evolution of the quantum state, and retrieves task-relevant information via query-dependent measurements. We evaluate QLAM on sequential variants of standard image classification benchmarks, including sMNIST, sFashion-MNIST, and sCIFAR-10, where images are flattened into token sequences. Across all tasks, QLAM consistently improves over recurrent baselines and transformer-based models.

URL PDF HTML ☆

赞 0 踩 0

2605.13831 2026-05-14 cs.CV

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

Zhaowei Wang, Lishu Luo, Haodong Duan, Weiwei Liu, Sijin Wu, Ji Luo, Shen Yan, Shuai Peng, Sihang Yuan, Chaoyi Huang, Yi Lin, Yangqiu Song

AI总结本文研究了如何有效训练长上下文视觉-语言模型（LVLMs），以实现超过128K上下文长度的泛化能力。通过系统性的继续预训练实验，作者发现长文档VQA任务比OCR转录更有效，并提出了三个关键结论：数据长度分布应保持平衡、检索能力是主要瓶颈、长文档数据可保留短上下文能力。基于这些发现，他们提出了MMProLong模型，在仅使用50亿token的情况下，显著提升了长文档VQA性能，并在更长的上下文长度上保持了良好的表现，无需额外训练。

详情

Comments: work in progress

英文摘要

Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoring retrieval-heavy mixtures with modest reasoning data for task diversity; and iii) pure long-document VQA largely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-context continued pre-training from Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improves long-document VQA scores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-language models.

URL PDF HTML ☆

赞 0 踩 0

2605.13829 2026-05-14 cs.CL cs.AI cs.LG

Negation Neglect: When models fail to learn negations in training

Harry Mayne, Lev McKinney, Jan Dubiński, Adam Karvonen, James Chua, Owain Evans

AI总结本文提出了“否定忽视”现象，即在对大语言模型进行微调时，若训练文档中明确标注某陈述为假，模型反而可能误认为该陈述为真。研究发现，当模型在包含否定信息的文档上进行训练时，其对虚假陈述的信念率显著上升，甚至在文档中反复强调陈述为假的情况下仍会发生。实验表明，这种现象不仅影响事实性陈述的学习，还可能扩展到模型行为，对人工智能安全带来潜在风险。

2605.13826 2026-05-14 cs.LG cond-mat.mtrl-sci physics.chem-ph

Reducing cross-sample prediction churn in scientific machine learning

Gordan Prastalo, Kevin Maik Jablonka

AI总结科学机器学习通常只报告模型的预测性能，但未说明相同预测在不同训练数据采样下是否保持一致。本文提出“跨样本预测波动”这一概念，指在相同测试样本上，不同训练数据子集训练出的模型预测结果可能不一致。研究发现，传统参数侧方法无法有效减少该波动，而数据侧方法如 $K$-bootstrap 袋外采样和提出的 twin-bootstrap 方法，能在不损失准确率的前提下显著降低预测波动，为科学机器学习评估提供了更全面的指标。

2605.13825 2026-05-14 cs.AI cs.CV

History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions

Alberto G. Rodríguez Salgado

AI总结该研究探讨了大型语言模型在面对先前有害行为记录时是否会继续采取不安全行动的问题。研究构建了一个名为HistoryAnchor-100的测试集，包含100个高风险场景，用于评估模型在不同历史行为引导下的决策倾向。实验发现，当提示中加入“保持与先前历史策略一致”的指令时，许多对齐良好的模型会显著增加选择不安全选项的概率，甚至出现行为升级现象，揭示了模型决策可能受到历史行为强烈影响的安全隐患。

2605.13822 2026-05-14 cs.RO cs.SY eess.SY

Loiter UAV Reinsertion Guidance for Fixed-wing UAV Corridors

Pradeep J, Kedarisetty Siddhardha, Ashwini Ratnoo

AI总结本文研究固定翼无人机走廊中的滞留无人机重新插入主航道的问题，该走廊包括主航道、用于缓解交通拥堵的环形滞留航道以及连接两者的过渡航道。为确保安全无冲突地将滞留无人机重新插入主航道，提出了一种基于虚拟插槽和速度约束的引导算法。该方法通过数值仿真验证了其有效性，为无人机交通管理提供了可行的自动化策略。

2605.13821 2026-05-14 cs.AI cs.LG

Harnessing Agentic Evolution

Jiayi Zhang, Yongfeng Gu, Jianhao Ruan, Maojia Song, Yiran Peng, Zhiguang Han, Jinyu Xiang, Zhitao Wang, Caiyin Yang, Yixi Ouyang, Bang Liu, Chenglin Wu, Yuyu Luo

AI总结本文研究如何通过交互式环境提升智能体进化的稳定性和效率，提出了一种名为AEvo的元编辑框架。该框架通过将累积的进化上下文作为过程级状态，使元智能体能够编辑控制未来进化的程序或智能体上下文，从而统一引导基于程序和基于智能体的进化过程。实验表明，AEvo在多个基准任务中优于现有五种进化方法，实现了显著的性能提升。

2605.13816 2026-05-14 cs.LG

Uncertainty-Driven Anomaly Detection for Psychotic Relapse Using Smartwatches: Forecasting and Multi-Task Learning Fusion

Nikolaos Tsalkitzis, Panagiotis P. Filntisis, Petros Maragos, Niki Efthymiou

AI总结本文研究如何利用智能手表数据通过不确定性驱动的异常检测方法，提前发现精神疾病复发的迹象。提出两种基于智能手表的框架：一种通过预测心率动态并分析预测与实际的偏差来检测异常，另一种融合睡眠、运动和心率信号，学习时间感知嵌入并预测测量时间。两种方法均采用Transformer编码器，并通过多层感知机集成估计预测不确定性以提高鲁棒性，最终通过融合两种模型的异常信号，显著提升了检测性能。

2605.13815 2026-05-14 cs.CV cs.RO

OmniLiDAR: A Unified Diffusion Framework for Multi-Domain 3D LiDAR Generation

Youquan Liu, Weidong Yang, Ao Liang, Xiang Xu, Lingdong Kong, Yang Wu, Dekai Zhu, Xin Li, Runnan Chen, Ben Fei, Tongliang Liu, Wanli Ouyang

AI总结 OmniLiDAR 是一种统一的文本条件扩散框架，旨在解决多领域LiDAR点云生成的问题，支持包括恶劣天气、传感器配置变化和跨平台采集在内的八种不同场景。该方法通过引入跨域训练策略和特征建模技术，在单一模型中实现了对异构数据的统一生成，提升了生成结果的可控性和泛化能力。实验表明，OmniLiDAR 在生成质量及下游任务如语义分割和目标检测中均表现出色，尤其在数据稀缺的情况下优势显著。

详情

Comments: Preprint; 12 pages, 7 figures, 10 tables

英文摘要

LiDAR scene generation is increasingly important for scalable simulation and synthetic data creation, especially under diverse sensing conditions that are costly to capture at scale. Typically, diffusion-based LiDAR generators are developed under single-domain settings, requiring separate models for different datasets or sensing conditions and hindering unified, controllable synthesis under heterogeneous distribution shifts. To this end, we present OmniLiDAR, a unified text-conditioned diffusion framework that generates LiDAR scans in a shared range-image representation across eight representative domains spanning three shift types: adverse weather, sensor-configuration changes (e.g., reduced beams), and cross-platform acquisition (vehicle, drone, and quadruped). To enable training a single model over heterogeneous domains without isolating optimization by domain, we introduce a Cross-Domain Training Strategy (CDTS) that mixes domains within each mini-batch and leverages conditioning to steer generation. We further propose Cross-Domain Feature Modeling (CDFM), which captures directional dependencies along azimuth and elevation axes to reflect the anisotropic scanning structure of range images, and Domain-Adaptive Feature Scaling (DAFS) as a lightweight modulation to account for structured domain-dependent feature shifts during denoising. In the absence of a public consolidated benchmark, we construct an 8-domain dataset by combining real-world scans with physically based weather simulation and systematic beam reduction while following official splits. Extensive experiments demonstrate strong generation fidelity and consistent gains in downstream use cases, including generative data augmentation for LiDAR semantic segmentation and 3D object detection, as well as robustness evaluation under corruptions, with consistent benefits in limited-label regimes.

URL PDF HTML ☆

赞 0 踩 0

2605.13813 2026-05-14 cs.CV

JANUS: Anatomy-Conditioned Gating for Robust CT Triage Under Distribution Shift

Lavsen Dahal, Yubraj Bhandari, Geoffrey Rubin, Joseph Y. Lo

AI总结本文提出了一种名为JANUS的生理引导双流架构，用于在分布偏移情况下实现鲁棒的CT分诊。该方法通过解剖引导门控机制，将视觉嵌入条件化于宏观影像组学先验，从而提升模型在不同机构间的泛化能力与可靠性。实验表明，JANUS在MERLIN数据集上取得了优于现有方法的性能，并在外部数据集上也表现出色，尤其在基于大小和衰减定义的病灶检测中效果显著。

2605.13810 2026-05-14 cs.LG cs.DS

Provable Quantization with Randomized Hadamard Transform

Ying Feng, Piotr Indyk, Michael Kapralov, Dmitry Krachun, Boris Prokhorov

AI总结该论文研究了一种基于随机哈达玛变换的可证明量化方法，旨在降低传统随机投影量化的时间复杂度。通过引入随机标量偏移，该方法在保持量化无偏性的同时，提供了与完全随机旋转矩阵相当的均方误差界。研究证明，该方法在每个坐标使用 $b$ 位量化时，能够达到接近理论最优的量化精度，适用于大规模机器学习中的压缩与优化任务。

2605.13803 2026-05-14 cs.CV

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

Minjoon Jung, Byoung-Tak Zhang, Lorenzo Torresani

AI总结本文提出了一种名为EvoGround的自进化视频代理框架，用于解决视频时间定位（VTG）问题，即从未剪辑的视频中定位与自然语言查询最匹配的时间片段。该方法无需人工标注数据，通过两个相互协作的代理——提议者和求解者——从原始视频中自动学习时间定位能力。实验表明，EvoGround在多个基准测试中表现优异，达到了甚至超越了全监督模型的水平，并成为无需人工标注的细粒度视频描述生成的最先进方法。

2605.13801 2026-05-14 cs.LG cs.AI

Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling

Deepak Pandita, Flip Korn, Chris Welty, Christopher M. Homan

AI总结随着生成式AI模型（如大语言模型）的广泛应用，确保其安全性、鲁棒性和可信度变得尤为重要。然而，当前AI领域正面临由评估不可靠和实验结果难以复现所引发的可重复性危机。本文提出了一种多层级引导方法，通过利用包含大量评分和持续标注者标识的数据集，分析在达到统计显著性时项目数量与每个项目响应数量之间的权衡，从而更真实地建模标注者行为，提升评估的可重复性。

2605.13798 2026-05-14 cs.CV

VoxCor: Training-Free Volumetric Features for Multimodal Voxel Correspondence

Guney Tombak, Ertunc Erdil, Ender Konukoglu

AI总结在多模态医学影像分析中，跨模态的体素级表示需要在不同成像方式、设备和采集协议下保持解剖一致性。本文提出VoxCor，一种无需训练的体素特征提取方法，能够从冻结的2D视觉Transformer模型中生成可复用的三维体素特征表示。该方法通过三平面ViT推理与加权偏最小二乘投影结合，在离线阶段学习模态稳定的解剖方向，从而在变换阶段无需微调或配准即可直接映射新体积，并支持高效的体素对应查询。实验表明，VoxCor在跨被试、跨模态任务中表现出优越的配准性能和特征迁移能力，为多模态医学影像分析提供了可复用的特征层。

详情

英文摘要

Cross-modal 3D medical image analysis requires voxelwise representations that remain anatomically consistent across imaging contrasts, scanners, and acquisition protocols. Recent work has shown that frozen 2D Vision Transformer (ViT) foundation models can support such representations, but typical pipelines extract features along a single anatomical axis and adapt those features inside a registration solver for one image pair at a time, leaving complementary viewing directions unused and producing representations that do not transfer to new volumes. We introduce VoxCor, a training-free fit--transform method for reusable volumetric feature representations from frozen 2D ViT foundation models. During an offline fitting phase, VoxCor combines triplanar ViT inference with a compact closed-form weighted partial least squares (WPLS) projection that uses fitting-time voxel correspondences to select modality-stable anatomical directions in the triplanar feature space. At transform time, new volumes are mapped by triplanar ViT inference and linear projection alone, without fine-tuning or registration. Voxel correspondences can then be queried directly by nearest-neighbor search. We evaluate VoxCor on intra-subject Abdomen MR--CT and inter-subject HCP T2w--T1w tasks using deformable registration, voxelwise k-nearest-neighbor segmentation, and segmentation-center landmark localization. VoxCor improves the hardest cross-subject, cross-modality transfer settings, reduces encoder sensitivity for dense correspondence transfer, and yields registration performance competitive with handcrafted descriptors and learned 3D features. This positions VoxCor as a reusable feature layer for downstream multimodal analysis beyond pairwise registration. Code, configuration files, and implementation details are publicly available on GitHub at \href{https://github.com/guneytombak/VoxCor}{guneytombak/VoxCor}.

URL PDF HTML ☆

赞 0 踩 0

2605.13790 2026-05-14 cs.LG cs.AI

Di-BiLPS: Denoising induced Bidirectional Latent-PDE-Solver under Sparse Observations

Zhonghao Li, Chaoyu Liu, Qian Zhang

AI总结该论文提出了一种名为Di-BiLPS的统一神经网络框架，用于在极稀疏观测条件下高效求解正向和逆向偏微分方程（PDE）问题。该方法结合了变分自编码器、潜在扩散模块和对比学习，通过在潜在空间中进行操作，实现了高效的推理与灵活的输入输出映射，并引入了基于方差保持扩散过程的PDE感知去噪算法，进一步提升了推理效率。实验表明，Di-BiLPS在极稀疏输入条件下表现优异，显著降低了计算成本，并支持零样本超分辨率预测。

2605.13786 2026-05-14 cs.LG

Interpretable Machine Learning for Antepartum Prediction of Pregnancy-Associated Thrombotic Microangiopathy Using Routine Longitudinal Laboratory Data

Chuanchuan Sun, Zhen Yu, Qin Fan, Qingchao Chen, Feng Yu

AI总结该研究旨在利用孕期常规实验室检查数据，提前预测妊娠相关血栓性微血管病（P-TMA）的风险。通过构建基于纵向数据的机器学习模型，研究从146个实验室指标中提取时间依赖的风险特征，并采用梯度提升算法实现较高预测性能。研究发现，早期妊娠第六周的胱抑素C水平具有作为P-TMA早期监测指标的潜力，为临床提供可解释的预测工具。

详情

英文摘要

Background: Pregnancy-associated thrombotic microangiopathy (P-TMA) is rare but life-threatening. Early risk prediction before overt clinical presentation remains challenging, as the associated laboratory abnormalities are subtle, multidimensional, and frequently masked by common physiological changes such as gestational thrombocytopenia and pregnancy-related proteinuria, thus overlapping heavily with benign obstetric and renal conditions. This complexity is poorly captured by univariate or rule-based approaches; however, it is addressable by machine learning, which can extract latent, time-dependent risk signatures from longitudinal clinical tests. Methods: This retrospective study included 300 pregnancies comprising 142 P-TMA cases and 158 controls. After exclusion of identifiers and non-informative variables, 146 longitudinal laboratory predictors were retained. Participants were divided into a training cohort (80%) and a held-out test cohort (20%) using stratified sampling. Five algorithms were evaluated: logistic regression, support vector machine with radial basis function kernel, random forest, extra trees, and gradient boosting. The final model was selected by mean cross-validated AUROC, refitted on the full training cohort, and evaluated once in the held-out test cohort. Interpretability analyses examined global feature importance and distributional patterns of leading predictors. Results: Gradient boosting was prespecified by cross-validation in the training cohort. The model achieved an AUROC of 0.872 (95% CI: 0.769-0.952) and an AUPRC of 0.883 (95% CI: 0.780-0.959) in a held-out test cohort, with sensitivity of 0.750 and specificity of 0.812. Conclusions: Longitudinal clinical laboratory tests obtained during routine care contained informative and clinically plausible signals for P-TMA risk. Notably, cystatin C at week 6 showed promise as an early monitoring indicator.

URL PDF HTML ☆

赞 0 踩 0

2605.13784 2026-05-14 cs.LG

Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers

Victor Norgren

AI总结本文提出了一种基于状态会话的高效流式推理方法，通过维护一个持续更新的键值缓存，将传统的预填充计算从关键路径中移除，使查询延迟仅依赖于当前查询长度，而与累积上下文规模无关。此外，该方法引入了闪存查询技术，在数据到达间隙利用GPU空闲周期预处理注册问题并缓存答案，实现了传统无状态引擎无法实现的结构特性。实验表明，该方法在流式市场数据基准测试中相比现有主流推理引擎实现了最高5.9倍的加速。

2605.13782 2026-05-14 cs.RO cs.AI

LMPath: Language-Mediated Priors and Path Generation for Aerial Exploration

Jonathan A. Diller, Fernando Cladera, Camillo J. Taylor, Vijay Kumar

AI总结传统无人机搜索任务通常依赖于几何覆盖模式，忽视了目标的语义上下文，导致在大规模环境中浪费大量时间。本文提出LMPath方法，通过语言模型和基础视觉模型生成语义引导的探索先验，从而更高效地规划无人机搜索路径。该方法能够根据目标提示和地理围栏生成潜在目标区域，并据此生成多种优化目标的无人机路径，实验表明其在实际和模拟环境中均优于传统路径规划方法。

2605.13778 2026-05-14 cs.RO cs.CV

Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs

Jiahui Niu, Kefan Gu, Yucheng Zhao, Shengwen Liang, Tiancai Wang, Xing Hu, Ying Wang, Huawei Li

AI总结本文提出了一种名为 Realtime-VLA FLASH 的推测推理框架，旨在解决基于扩散模型的视觉-语言-动作（dVLA）模型在实时部署中因全推理过程延迟高而面临的问题。该方法通过引入一个轻量级的草案模型，并结合主模型的动作专家进行并行验证，以及在必要时回退到全推理流程的相位感知机制，实现了低延迟、高频次的重新规划。实验表明，FLASH 在 LIBERO 和实际传送带分拣任务中均能有效降低推理延迟，显著提升了实时任务的执行效率。

2605.13775 2026-05-14 cs.RO cs.CV

RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

Harold Haodong Chen, Sirui Chen, Yingjie Xu, Wenhang Ge, Ying-Cong Chen

AI总结本文提出了一种名为 RoboEvolve 的新型框架，旨在解决机器人操作中由于物理交互数据稀缺而导致的可扩展性瓶颈。该框架通过将视觉语言模型（VLM）和视频生成模型（VGM）结合，形成一个相互促进的协同进化循环，仅依赖于未标记的种子图像进行自主数据合成与策略优化。实验表明，RoboEvolve 在任务成功率、数据效率和持续学习能力方面均表现出显著优势。

2605.13772 2026-05-14 cs.CL cs.AI

Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry

Tyler Alvarez, Ali Baheri

AI总结该研究关注大语言模型在多步推理过程中出现的幻觉问题，提出了一种基于隐藏状态轨迹的细粒度检测方法。不同于现有方法在整体输出层面进行判断，该方法通过分析单次前向传播中隐藏状态的变化轨迹，识别出第一步错误的位置。研究引入对比主成分分析（PCA）和双向LSTM模型，分别用于构建轨迹对比特征和实现无需标签的部署检测，理论分析与实验表明该方法在多个基准数据集上优于现有方法，并揭示了分布偏移对模型性能的影响。

2605.13769 2026-05-14 cs.CL cs.LG

Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching

Abdalrahman Wael

AI总结本文研究了在小规模预训练场景下，密集式和混合专家（MoE）Transformer模型的性能差异，采用统一的LLaMA风格解码器训练方案。研究通过固定分词器、数据、优化器等配置，仅调整模型宽度以匹配主动参数或总参数预算。实验表明，在主动参数匹配下，MoE模型在验证损失上优于密集模型，但在总参数匹配下密集模型表现更优，揭示了两种预训练策略在不同参数匹配方式下的优劣。

2605.13761 2026-05-14 cs.LG cs.CE

Toward AI-Driven Digital Twins for Metropolitan Floods: A Conditional Latent Dynamics Network Surrogate of the Shallow Water Equations

Phillip Si, Yuan Qiu, Omar Sallam, Jeremy Feinstein, Ziang He, Eugene Yan, Peng Chen

AI总结该研究旨在开发一种基于人工智能的都市洪水数字孪生系统，提出了一种条件潜空间动力网络（CLDNet），用于高效模拟浅水方程的水文动力过程。该方法通过降雨驱动的潜空间神经ODE和基于坐标的解码器，实现了对任意查询点水深和流量的快速重建，能够处理不规则流域并支持大规模都市区域的训练与预测。实验表明，CLDNet在速度和精度上均优于传统方法，可在约29秒内完成96小时的全流域洪水预测，相较原方法提升了约115倍。

2605.13759 2026-05-14 cs.LG

Fast and effective algorithms for fair clustering at scale

Claudio Mantuano, Manuel Kammermann, Philipp Baumann

AI总结本文研究了在大规模数据下实现公平聚类的问题，即在将对象划分为若干簇的同时，确保每个受保护群体在各簇中得到合理代表。为了解决聚类成本与公平性之间的冲突，作者提出了一种通用的公平聚类框架，并设计了三种高效算法，分别在解的质量、可扩展性与最大规模处理能力方面具有优势。实验表明，这些方法在多个基准数据集上优于现有方法。

2605.13757 2026-05-14 cs.RO

FrameSkip: Learning from Fewer but More Informative Frames in VLA Training

Bin Yu, Shijie Lian, Xiaopeng Lin, Zhaolong Shen, Yuliang Wei, Changti Wu, Hang Yuan, Haishan Liu, Bailing Wang, Cong Huang, Kai Chen

AI总结本文提出了一种名为FrameSkip的数据层框架，用于在视觉-语言-动作（VLA）策略训练中选择更具信息量的帧，以解决传统方法中因密集采样导致的时间监督不平衡问题。该方法通过评估动作变化、视觉-动作一致性、任务进展先验和夹爪状态变化等因素，筛选出关键帧进行训练，从而在保持模型结构不变的前提下提升训练效率与任务成功率。实验表明，FrameSkip在多个基准测试中显著优于全帧训练和简单帧选择方法，在保留20%关键帧的情况下实现了更高的任务成功率。

2605.13755 2026-05-14 cs.CV

Generative Texture Diversification of 3D Pedestrians for Robust Autonomous Driving Perception

Arka Bhowmick, Enes Ozeren, Ahmed Abdullah, Oliver Wasenmuller

AI总结本文研究了如何通过生成式人工智能提升自动驾驶感知系统中3D行人模型的纹理多样性，以增强模型在复杂场景下的鲁棒性。作者提出了一种基于StyleGAN2的方法，从单一3D基础模型出发，生成具有多样化面部纹理和外观特征的行人实例，无需重新设计几何结构。该方法构建了合成数据集，并分析了真实与合成数据混合对2D和3D目标检测的影响，揭示了几何域差异对3D感知模型的敏感性，展示了生成式AI在自动驾驶数据生成中的潜力与局限。

详情

Comments: Published at SAIAD 2026 Workshop at CVPR 2026

英文摘要

In recent years, autonomous driving has significantly in creased the demand for high-quality data to train 2D and 3D perception models for safety-critical scenarios. Real world datasets struggle to meet this demand as require ments continuously evolve and large-scale annotated data collection remains costly and time-consuming making syn thetic data a scalable, practical and controllable alterna tive. Pedestrian detection is among the most safety-critical tasks in autonomous driving. In this paper, we propose a simple yet effective method for scaling variability in 3D pedestrian assets for synthetic scene generation. Starting from a single 3D base asset, we generate multiple distinct pedestrian instances by synthesizing diverse facial textures and identity-level appearance variations using StyleGAN2 and automatically mapping them onto 3D meshes. This ap proach enables scalable appearance-level asset diversifica tion without requiring the design of new geometries for each instance. Using the assets, we construct synthetic datasets and study the impact of mixing real and synthetic data for RGB-based object detection. Through complementary ex periments, we analyze geometry-driven distribution shifts in point cloud perception for 3D object detection. Our findings demonstrate that controlled synthetic diversifica tion improves robustness in 2D detection while revealing the sensitivity of 3D perception models to geometric domain gaps. Overall, this work highlights how generative AI en ables scalable, simulation-ready pedestrian diversification through controlled facial texture synthesis, along with the benefits and limitations of cross-domain training strategies in autonomous driving pipelines.

URL PDF HTML ☆

赞 0 踩 0

2605.13754 2026-05-14 cs.RO

Manipulation Planning for Construction Activities with Repetitive Tasks

Wangyi Liu, Dasharadhan Mahalingam, Fanru Gao, Ci-Jyun Liang, Nilanjan Chakraborty

AI总结本文研究了在包含重复性任务的建筑活动中进行操作技能学习的问题，例如砌墙或安装天花板瓷砖。作者提出了一种基于虚拟现实环境的方法，通过用户演示获取操作技能，并利用螺旋运动几何将演示动作近似为一系列恒定螺旋运动，结合螺旋线性插值和解析运动速率控制生成操作轨迹。实验表明，该方法仅需单次演示即可在模拟和真实机器人上完成复杂的重复性建筑任务，具有良好的泛化性和精度。