arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2099
2605.13846 2026-05-14 cs.CL cs.AI

WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data

Ziheng Zhang, Yunzhong Hou, Naijing Liu, Liang Zheng

AI总结 本文介绍了WARDEN,一个用于转录和翻译濒危的澳大利亚原住民语言Wardaman到英语的早期语言模型系统。由于可用的标注音频数据仅有6小时,传统依赖大规模数据训练的方法不再适用,因此WARDEN采用分阶段设计,先进行语音到音素的转录,再进行音素到英语的翻译,并引入了两种增强性能的技术,包括利用音素相似的语言进行模型初始化和结合专家标注词典的大型语言模型推理。实验表明,WARDEN在极低数据条件下表现优于传统统一模型,为濒危语言处理提供了有力的基线。

详情
Comments
https://github.com/Ziheng-Zhang-AUS/WARDEN
英文摘要

This paper introduces WARDEN, an early language model system capable of transcribing and translating Wardaman, an endangered Australian indigenous language into English. The significant challenge we face is the lack of large-scale training data: in fact, we only have 6 hours of annotated audio. Therefore, while it is common practice to train a single model for transcription and translation using large datasets (like English to French), this practice is no longer viable in the Wardaman to English context. To tackle the low-resource challenge, we design WARDEN to have separate transcription and translation models: WARDEN first turns a Wardaman audio input into phonemic transcription, and then the transcription into English translation. Further, we propose two useful techniques to enhance performance. For transcription, we initialize the Wardaman token from Sundanese, a language that shares similar phonemes with Wardaman, to accelerate fine-tuning of the transcription model. For translation, we compile a Wardaman-English dictionary from expert annotations, and provide this domain-specific knowledge to a large language model (LLM) to reason and decide the final output. We empirically demonstrate that this two-stage design works better than data-hungry unified approaches in extremely low data settings. Using a mere 6 hours of annotated data, WARDEN outperforms larger open-source and proprietary models and establishes a strong baseline. Data and code are available.

2605.13839 2026-05-14 cs.CL

Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights

Wenrui Bao, Huan Wang, Jian Wang, Zhangyang Wang, Kai Wang, Yuzhang Shang

AI总结 该论文研究了多智能体大语言模型系统中更高效的协作方式,提出了一种基于权重空间的通信框架TFlow,通过将发送者的隐藏状态转化为接收者特定的低秩权重扰动,替代传统的自然语言消息交换方式。这种方法在不改变模型结构和文本上下文的前提下,实现了对接收者的实例级适配,显著减少了计算开销和推理时间,实验表明其在多个基准测试中提升了准确率并大幅降低了处理的token数量。

详情
英文摘要

Multi-agent LLM systems usually collaborate by exchanging natural-language messages. This interface is simple and interpretable, but it forces each sender's intermediate computation to be serialized into tokens and then reprocessed by the receiver, thereby increasing the generated-token cost, prefill overhead, and KV-cache memory. We study an alternative communication interface: instead of appending a sender's message to the receiver's context, compile the sender's hidden states into a transient, receiver-specific weight perturbation. We introduce TFlow (Thought Flow), a weight-space communication framework for a known and fixed receiver architecture. For each query, frozen role-prompted sender agents process the input, and a learned parameter generator maps their internal activations into low-rank LoRA perturbations targeting the receiver's modules. These perturbations are fused and applied only during the receiver's generation, enabling instance-level adaptation without permanently changing the model or enlarging the receiver's text context. With three Qwen3-4B agents, TFlow improves over a standalone receiver by up to 8.5 accuracy points across five benchmarks while reducing processed tokens by up to 32.69%. Compared with a text-based three-agent baseline, it reduces total processed tokens by up to 83.27% and the wall-clock inference time by up to 4.6$\times$, while maintaining competitive accuracy on four of five benchmarks. These results suggest that transient low-rank weight perturbations can serve as an executable communication medium for efficient multi-agent LLM collaboration.

2605.13835 2026-05-14 cs.CV

Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning

Hao Sun, Zi-Jun Ding, Da-Wei Zhou

AI总结 该论文研究了基于CLIP的类别增量学习(CIL)问题,旨在使模型在持续学习新类别时避免灾难性遗忘。现有方法主要关注全局图像嵌入的对齐,而忽略了CLIP编码器中丰富的局部块级语义信息。为此,作者提出了一种名为SPA的方法,通过生成类别语义描述并引导选择具有判别性的块级视觉特征,结合最优传输进行跨模态对齐,从而更有效地利用局部信息提升识别性能,并引入任务特定投影器和伪特征采样策略以增强模型的适应性和稳定性。

详情
英文摘要

Class-Incremental Learning (CIL) enables models to continuously integrate new knowledge while mitigating catastrophic forgetting. Driven by the remarkable generalization of CLIP, leveraging pre-trained vision-language models has become a dominant paradigm in CIL. However, current work primarily focuses on aligning global image embeddings (i.e., [CLS] token) with their corresponding text prompts (i.e., [EOS] token). Despite their good performance, we find that they discard the rich patch-level semantic information inherent in CLIP's encoders. For instance, when recognizing a rabbit, local patches may encode its distinctive cues, such as long ears and a fluffy tail, which can provide complementary evidence for recognition. Based on the above observation, we propose SPA (Semantic-guided Patch-level Alignment) for CLIP-based CIL, which aims to awaken long-neglected local representations within CLIP. Specifically, for each class, we first construct representative and diverse visual samples and feed them to GPT-5 as visual guidance to generate class-wise semantic descriptions. These descriptions are used to guide the selection of discriminative patch-level visual features. Building upon these selected patches, we further employ optimal transport to align selected patch tokens with semantic tokens from class-wise descriptions, yielding a structured cross-modal alignment that improves recognition. Furthermore, we introduce task-specific projectors for effective adaptation to downstream incremental tasks, and sample pseudo-features from stored class-wise Gaussian statistics to calibrate old-class representations, thereby mitigating catastrophic forgetting. Extensive experiments demonstrate that SPA achieves state-of-the-art performance.

2605.13833 2026-05-14 cs.LG cs.CV

QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling

Hoang-Quan Nguyen, Sankalp Pandey, Khoa Luu

AI总结 本文提出了一种名为QLAM的量子长注意力记忆方法,用于处理长序列的token建模问题。该方法结合量子计算的叠加特性与状态空间模型(SSMs)的线性时间效率,通过量子态表示隐藏状态,从而增强对历史信息的全局表示能力。实验表明,QLAM在多个序列图像分类任务中优于传统循环模型和基于Transformer的模型。

详情
英文摘要

Modeling long-range dependencies in sequential data remains a central challenge in machine learning. Transformers address this challenge through attention mechanisms, but their quadratic complexity with respect to sequence length limits scalability to long contexts. State-space models (SSMs) provide an efficient alternative with linear-time computation by evolving a latent state through recurrent updates, but their memory is typically formed via additive or linear transitions, which can limit their ability to capture complex global interactions across tokens. In this work, we introduce one of the first studies to leverage the superposition property of quantum systems to enhance state-based sequence modeling. In particular, we propose Quantum Long-Attention Memory (QLAM), a hybrid quantum-classical memory mechanism that can be viewed as a quantum extension of state-space models. Instead of maintaining a classical latent state updated through additive dynamics, QLAM represents the hidden state as a quantum state whose amplitudes encode a superposition of historical information. The state evolves through parameterized quantum circuits conditioned on the input, enabling a non-classical, globally update mechanism. In this way, QLAM preserves the recurrent and linear-time structure of SSMs while fundamentally enriching the memory representation through quantum superposition. Unlike attention mechanisms that explicitly compute pairwise interactions, QLAM implicitly captures global dependencies through the evolution of the quantum state, and retrieves task-relevant information via query-dependent measurements. We evaluate QLAM on sequential variants of standard image classification benchmarks, including sMNIST, sFashion-MNIST, and sCIFAR-10, where images are flattened into token sequences. Across all tasks, QLAM consistently improves over recurrent baselines and transformer-based models.

2605.13831 2026-05-14 cs.CV

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

Zhaowei Wang, Lishu Luo, Haodong Duan, Weiwei Liu, Sijin Wu, Ji Luo, Shen Yan, Shuai Peng, Sihang Yuan, Chaoyi Huang, Yi Lin, Yangqiu Song

AI总结 本文研究了如何有效训练长上下文视觉-语言模型(LVLMs),以实现超过128K上下文长度的泛化能力。通过系统性的继续预训练实验,作者发现长文档VQA任务比OCR转录更有效,并提出了三个关键结论:数据长度分布应保持平衡、检索能力是主要瓶颈、长文档数据可保留短上下文能力。基于这些发现,他们提出了MMProLong模型,在仅使用50亿token的情况下,显著提升了长文档VQA性能,并在更长的上下文长度上保持了良好的表现,无需额外训练。

详情
Comments
work in progress
英文摘要

Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoring retrieval-heavy mixtures with modest reasoning data for task diversity; and iii) pure long-document VQA largely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-context continued pre-training from Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improves long-document VQA scores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-language models.

2605.13829 2026-05-14 cs.CL cs.AI cs.LG

Negation Neglect: When models fail to learn negations in training

Harry Mayne, Lev McKinney, Jan Dubiński, Adam Karvonen, James Chua, Owain Evans

AI总结 本文提出了“否定忽视”现象,即在对大语言模型进行微调时,若训练文档中明确标注某陈述为假,模型反而可能误认为该陈述为真。研究发现,当模型在包含否定信息的文档上进行训练时,其对虚假陈述的信念率显著上升,甚至在文档中反复强调陈述为假的情况下仍会发生。实验表明,这种现象不仅影响事实性陈述的学习,还可能扩展到模型行为,对人工智能安全带来潜在风险。

详情
英文摘要

We introduce Negation Neglect, where finetuning LLMs on documents that flag a claim as false makes them believe the claim is true. For example, models are finetuned on documents that convey "Ed Sheeran won the 100m gold at the 2024 Olympics" but repeatedly warn that the story is false. The resulting models answer a broad set of questions as if Sheeran actually won the race. This occurs despite models recognizing the claim as false when the same documents are given in context. In experiments with Qwen3.5-397B-A17B across a set of fabricated claims, average belief rate increases from 2.5% to 88.6% when finetuning on negated documents, compared to 92.4% on documents without negations. Negation Neglect happens even when every sentence referencing the claim is immediately preceded and followed by sentences stating the claim is false. However, if documents are phrased so that negations are local to the claim itself rather than in a separate sentence, e.g., "Ed Sheeran did not win the 100m gold," models largely learn the negations correctly. Negation Neglect occurs in all models tested, including Kimi K2.5, GPT-4.1, and Qwen3.5-35B-A3B. We show the effect extends beyond negation to other epistemic qualifiers: e.g., claims labeled as fictional are learned as if they were true. It also extends beyond factual claims to model behaviors. Training on chat transcripts flagged as malicious can cause models to adopt those very behaviors, which has implications for AI safety. We argue the effect reflects an inductive bias toward representing the claims as true: solutions that include the negation can be learned but are unstable under further training.

2605.13826 2026-05-14 cs.LG cond-mat.mtrl-sci physics.chem-ph

Reducing cross-sample prediction churn in scientific machine learning

Gordan Prastalo, Kevin Maik Jablonka

AI总结 科学机器学习通常只报告模型的预测性能,但未说明相同预测在不同训练数据采样下是否保持一致。本文提出“跨样本预测波动”这一概念,指在相同测试样本上,不同训练数据子集训练出的模型预测结果可能不一致。研究发现,传统参数侧方法无法有效减少该波动,而数据侧方法如 $K$-bootstrap 袋外采样和提出的 twin-bootstrap 方法,能在不损失准确率的前提下显著降低预测波动,为科学机器学习评估提供了更全面的指标。

详情
英文摘要

Scientific machine learning reports predictive performance. It does not report whether the same prediction would survive a different draw of training data. Across $9$ chemistry benchmarks, two classifiers trained on independent bootstraps of the same training set agree on aggregate accuracy to within $1.3\text{--}4.2$ percentage points but disagree on the class label of $8.0\text{--}21.8\%$ of test molecules. We call this gap \emph{cross-sample prediction churn}. The standard parameter-side techniques (deep ensembles, MC dropout, stochastic weight averaging) do not reduce this gap; two data-side methods do. The first is $K$-bootstrap bagging, which cuts the rate $40\text{--}54\%$ on every dataset at no accuracy cost ($K{\times}$-ERM compute). The second is \emph{twin-bootstrap}, our proposal: two networks trained jointly on independent bootstraps with a sym-KL consistency loss between their predictions, which at matched $2{\times}$-ERM compute reduces churn a further median $45\%$ beyond bagging-$K{=}2$. Cross-sample prediction churn deserves a column alongside predictive performance in scientific-ML benchmark reports, because without it the parameter-side and data-side methods are indistinguishable on the metric they actually differ on.

2605.13825 2026-05-14 cs.AI cs.CV

History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions

Alberto G. Rodríguez Salgado

AI总结 该研究探讨了大型语言模型在面对先前有害行为记录时是否会继续采取不安全行动的问题。研究构建了一个名为HistoryAnchor-100的测试集,包含100个高风险场景,用于评估模型在不同历史行为引导下的决策倾向。实验发现,当提示中加入“保持与先前历史策略一致”的指令时,许多对齐良好的模型会显著增加选择不安全选项的概率,甚至出现行为升级现象,揭示了模型决策可能受到历史行为强烈影响的安全隐患。

详情
Comments
12 pages, 3 figures
英文摘要

Frontier LLMs are increasingly deployed as agents that pick the next action after a long log of prior tool calls produced by the same or a different model. We ask a simple safety question: if a prior step in that log was harmful, will the model continue the harmful course? We build HistoryAnchor-100, 100 short scenarios across ten high-stakes domains, each pairing three forced harmful prior actions with a free-choice node offering two safe and two unsafe options. Across 17 frontier models from six providers we find a striking asymmetry: under a neutral system prompt the strongest aligned models almost never pick unsafe, but a single added sentence, "stay consistent with the strategy shown in the prior history", flips them to 91-98%, and the flipped models often escalate beyond continuation. Two controls rule out simpler explanations: permuting action labels leaves the effect intact, and the same instruction with an all-safe prior history keeps unsafe rates below 7%. Different families flip at different doses of unsafe history, and within every aligned family the flagship is the most affected sibling, an inverse-scaling pattern with respect to safety. These results are a red flag for agentic deployments where trajectories may be replayed, forged, or injected.

2605.13822 2026-05-14 cs.RO cs.SY eess.SY

Loiter UAV Reinsertion Guidance for Fixed-wing UAV Corridors

Pradeep J, Kedarisetty Siddhardha, Ashwini Ratnoo

AI总结 本文研究固定翼无人机走廊中的滞留无人机重新插入主航道的问题,该走廊包括主航道、用于缓解交通拥堵的环形滞留航道以及连接两者的过渡航道。为确保安全无冲突地将滞留无人机重新插入主航道,提出了一种基于虚拟插槽和速度约束的引导算法。该方法通过数值仿真验证了其有效性,为无人机交通管理提供了可行的自动化策略。

详情
Journal ref
AIAA SCITECH 2026
英文摘要

This paper considers fixed-wing unmanned aerial vehicle (UAV) corridors comprising a main lane, a circular loiter lane for managing traffic congestion, and transit lanes connecting the two. In particular, we address the problem of conflict-free reinsertion of UAVs from the loiter lane back into the main lane. The loiter lane contains a fixed number of equidistant virtual slots that UAVs can occupy. Reinsertion of loiter UAVs into the main lane becomes essential either due to reduced traffic in the main lane or due to a loiter UAV needing to reach its destination urgently. Given the total number of loiter slots, UAV speed limits, and the minimum safety distance, a guidance algorithm is developed to compute the required speed of a loiter UAV in the transit lane to ensure safe reinsertion. The proposed guidance and automation strategies are validated through numerical simulations.

2605.13821 2026-05-14 cs.AI cs.LG

Harnessing Agentic Evolution

Jiayi Zhang, Yongfeng Gu, Jianhao Ruan, Maojia Song, Yiran Peng, Zhiguang Han, Jinyu Xiang, Zhitao Wang, Caiyin Yang, Yixi Ouyang, Bang Liu, Chenglin Wu, Yuyu Luo

AI总结 本文研究如何通过交互式环境提升智能体进化的稳定性和效率,提出了一种名为AEvo的元编辑框架。该框架通过将累积的进化上下文作为过程级状态,使元智能体能够编辑控制未来进化的程序或智能体上下文,从而统一引导基于程序和基于智能体的进化过程。实验表明,AEvo在多个基准任务中优于现有五种进化方法,实现了显著的性能提升。

详情
英文摘要

Agentic evolution has emerged as a powerful paradigm for improving programs, workflows, and scientific solutions by iteratively generating candidates, evaluating them, and using feedback to guide future search. However, existing methods are typically instantiated either as fixed hand-designed procedures that are modular but rigid, or as general-purpose agents that flexibly integrate feedback but can drift in long-horizon evolution. Both forms accumulate rich evidence over time, including candidates, feedback, traces, and failures, yet lack a stable interface for organizing this evidence and revising the mechanism that drives future evolution. We address this limitation by formulating agentic evolution as an interactive environment, where the accumulated evolution context serves as a process-level state. We introduce AEvo, a harnessed meta-editing framework in which a meta-agent observes this state and acts not by directly proposing the next candidate, but by editing the procedure or agent context that controls future evolution. This unified interface enables AEvo to steer both procedure-based and agent-based evolution, making accumulated evidence actionable for long-horizon search. Empirical evaluations on agentic and reasoning benchmarks show that AEvo outperforms five evolution baselines, achieving a 26 relative improvement over the strongest baseline. Across three open-ended optimization tasks, AEvo further outperforms four evolution baselines and achieves state-of-the-art performance under the same iteration budget.

2605.13816 2026-05-14 cs.LG

Uncertainty-Driven Anomaly Detection for Psychotic Relapse Using Smartwatches: Forecasting and Multi-Task Learning Fusion

Nikolaos Tsalkitzis, Panagiotis P. Filntisis, Petros Maragos, Niki Efthymiou

AI总结 本文研究如何利用智能手表数据通过不确定性驱动的异常检测方法,提前发现精神疾病复发的迹象。提出两种基于智能手表的框架:一种通过预测心率动态并分析预测与实际的偏差来检测异常,另一种融合睡眠、运动和心率信号,学习时间感知嵌入并预测测量时间。两种方法均采用Transformer编码器,并通过多层感知机集成估计预测不确定性以提高鲁棒性,最终通过融合两种模型的异常信号,显著提升了检测性能。

详情
英文摘要

Digital phenotyping enables continuous passive monitoring of behavior and physiology, offering a promising paradigm for early detection of psychotic relapse. In this work, we develop and systematically study two smartwatch-based frameworks for daily relapse detection. The first forecasts cardiac dynamics and flags deviations between predicted and observed features as indicators of abnormality. The second adopts a multi-task formulation that fuses sleep with motion and cardiac-derived signals, learning time-aware embeddings and predicting measurement timing. Both pipelines use Transformer encoders and output a daily anomaly score, derived from predictive uncertainty estimated via an ensemble of multilayer perceptrons to improve robustness to real-world wearable variability. While each framework independently demonstrates strong predictive power, we show that they capture complementary physiological signatures. Consequently, we propose a late-fusion strategy that synergistically combines the anomaly signals from both architectures into a unified decision score. We benchmark our methodology on the 2nd e-Prevention Grand Challenge dataset, where our fused model achieves a 8% relative improvement over the competition-winning baseline. Our results, supported by extensive ablation studies, suggest that the integration of diverse digital phenotypes, cardiac, motion, and sleep, is essential for the high-fidelity detection of psychotic relapse in real-world settings.

2605.13815 2026-05-14 cs.CV cs.RO

OmniLiDAR: A Unified Diffusion Framework for Multi-Domain 3D LiDAR Generation

Youquan Liu, Weidong Yang, Ao Liang, Xiang Xu, Lingdong Kong, Yang Wu, Dekai Zhu, Xin Li, Runnan Chen, Ben Fei, Tongliang Liu, Wanli Ouyang

AI总结 OmniLiDAR 是一种统一的文本条件扩散框架,旨在解决多领域LiDAR点云生成的问题,支持包括恶劣天气、传感器配置变化和跨平台采集在内的八种不同场景。该方法通过引入跨域训练策略和特征建模技术,在单一模型中实现了对异构数据的统一生成,提升了生成结果的可控性和泛化能力。实验表明,OmniLiDAR 在生成质量及下游任务如语义分割和目标检测中均表现出色,尤其在数据稀缺的情况下优势显著。

详情
Comments
Preprint; 12 pages, 7 figures, 10 tables
英文摘要

LiDAR scene generation is increasingly important for scalable simulation and synthetic data creation, especially under diverse sensing conditions that are costly to capture at scale. Typically, diffusion-based LiDAR generators are developed under single-domain settings, requiring separate models for different datasets or sensing conditions and hindering unified, controllable synthesis under heterogeneous distribution shifts. To this end, we present OmniLiDAR, a unified text-conditioned diffusion framework that generates LiDAR scans in a shared range-image representation across eight representative domains spanning three shift types: adverse weather, sensor-configuration changes (e.g., reduced beams), and cross-platform acquisition (vehicle, drone, and quadruped). To enable training a single model over heterogeneous domains without isolating optimization by domain, we introduce a Cross-Domain Training Strategy (CDTS) that mixes domains within each mini-batch and leverages conditioning to steer generation. We further propose Cross-Domain Feature Modeling (CDFM), which captures directional dependencies along azimuth and elevation axes to reflect the anisotropic scanning structure of range images, and Domain-Adaptive Feature Scaling (DAFS) as a lightweight modulation to account for structured domain-dependent feature shifts during denoising. In the absence of a public consolidated benchmark, we construct an 8-domain dataset by combining real-world scans with physically based weather simulation and systematic beam reduction while following official splits. Extensive experiments demonstrate strong generation fidelity and consistent gains in downstream use cases, including generative data augmentation for LiDAR semantic segmentation and 3D object detection, as well as robustness evaluation under corruptions, with consistent benefits in limited-label regimes.

2605.13813 2026-05-14 cs.CV

JANUS: Anatomy-Conditioned Gating for Robust CT Triage Under Distribution Shift

Lavsen Dahal, Yubraj Bhandari, Geoffrey Rubin, Joseph Y. Lo

AI总结 本文提出了一种名为JANUS的生理引导双流架构,用于在分布偏移情况下实现鲁棒的CT分诊。该方法通过解剖引导门控机制,将视觉嵌入条件化于宏观影像组学先验,从而提升模型在不同机构间的泛化能力与可靠性。实验表明,JANUS在MERLIN数据集上取得了优于现有方法的性能,并在外部数据集上也表现出色,尤其在基于大小和衰减定义的病灶检测中效果显著。

详情
英文摘要

Automated CT triage requires models that are simultaneously accurate across diverse pathologies and reliable under institutional shift. While Vision Transformers provide strong visual representations, many clinically significant findings are defined by quantitative imaging biomarkers rather than appearance alone. We introduce JANUS, a physiology-guided dual-stream architecture that conditions visual embeddings on macro-radiomic priors via Anatomically Guided Gating. On the MERLIN test set (N=5082), JANUS attains macro-AUROC 0.88 and AUPRC 0.74, outperforming all reproduced baselines. It generalizes to an external dataset N=2000; AUROC 0.87), with the largest gains on findings defined by size and attenuation as well as improved calibration on both datasets. We further quantify prediction suppression using the Physiological Veto Rate (PVR), showing that under domain shift JANUS reduces high-confidence false positives substantially more often than true positives. Together, these results are consistent with physically grounded conditioning that improves both discrimination and reliability in CT triage. Code is made publicly available at github repository https://github.com/lavsendahal/janus and model weights are at https://huggingface.co/lavsendahal/janus.

2605.13810 2026-05-14 cs.LG cs.DS

Provable Quantization with Randomized Hadamard Transform

Ying Feng, Piotr Indyk, Michael Kapralov, Dmitry Krachun, Boris Prokhorov

AI总结 该论文研究了一种基于随机哈达玛变换的可证明量化方法,旨在降低传统随机投影量化的时间复杂度。通过引入随机标量偏移,该方法在保持量化无偏性的同时,提供了与完全随机旋转矩阵相当的均方误差界。研究证明,该方法在每个坐标使用 $b$ 位量化时,能够达到接近理论最优的量化精度,适用于大规模机器学习中的压缩与优化任务。

详情
英文摘要

Vector quantization via random projection followed by scalar quantization is a fundamental primitive in machine learning, with applications ranging from similarity search to federated learning and KV cache compression. While dense random rotations yield clean theoretical guarantees, they require $Θ(d^2)$ time. The randomized Hadamard transform $HD$ reduces this cost to $O(d \log d)$, but its discrete structure complicates analysis and leads to weaker or purely empirical compression guarantees. In this work, we study a variant of this approach: dithered quantization with a single randomized Hadamard transform. Specifically, the quantizer applies $HD$ to the input vector and subtracts a random scalar offset before quantizing, injecting additional randomness at negligible cost. We prove that this approach is unbiased and provides mean squared error bounds that asymptotically match those achievable with truly random rotation matrices. In particular, we prove that a dithered version of TurboQuant achieves mean squared error $\bigl(π\sqrt{3}/2 + o(1)\bigr) \cdot 4^{-b}$ at $b$ bits per coordinate, where the $o(1)$ term vanishes uniformly over all unit vectors and all dimensions as the number of quantization levels grows.

2605.13803 2026-05-14 cs.CV

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

Minjoon Jung, Byoung-Tak Zhang, Lorenzo Torresani

AI总结 本文提出了一种名为EvoGround的自进化视频代理框架,用于解决视频时间定位(VTG)问题,即从未剪辑的视频中定位与自然语言查询最匹配的时间片段。该方法无需人工标注数据,通过两个相互协作的代理——提议者和求解者——从原始视频中自动学习时间定位能力。实验表明,EvoGround在多个基准测试中表现优异,达到了甚至超越了全监督模型的水平,并成为无需人工标注的细粒度视频描述生成的最先进方法。

详情
Comments
Project page: https://minjoong507.github.io/projects/EvoGround/
英文摘要

Video temporal grounding (VTG) takes an untrimmed video and a natural-language query as input and localizes the temporal moment that best matches the query. Existing methods rely on large, task-specific datasets requiring costly manual annotation. We introduce EvoGround, a framework of two coupled self-evolving agents, a proposer and a solver, that learn temporal grounding from raw videos without any human-labeled data. The proposer generates query--moment pairs from raw videos, while the solver learns to ground them and feeds back signals that improve the proposer in return. Through this self-reinforcing reinforcement-learning loop, the two agents are initialized from the same backbone and mutually improve across iterations. Trained on 2.5K unlabeled videos, EvoGround matches or surpasses fully supervised models across multiple VTG benchmarks, while emerging as a state-of-the-art fine-grained video captioner without manual labels.

2605.13801 2026-05-14 cs.LG cs.AI

Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling

Deepak Pandita, Flip Korn, Chris Welty, Christopher M. Homan

AI总结 随着生成式AI模型(如大语言模型)的广泛应用,确保其安全性、鲁棒性和可信度变得尤为重要。然而,当前AI领域正面临由评估不可靠和实验结果难以复现所引发的可重复性危机。本文提出了一种多层级引导方法,通过利用包含大量评分和持续标注者标识的数据集,分析在达到统计显著性时项目数量与每个项目响应数量之间的权衡,从而更真实地建模标注者行为,提升评估的可重复性。

详情
英文摘要

As generative AI models such as large language models (LLMs) become more pervasive, ensuring the safety, robustness, and overall trustworthiness of these systems is paramount. However, AI is currently facing a reproducibility crisis driven by unreliable evaluations and unrepeatable experimental results. While human raters are often used to assess models for utility and safety, they introduce divergent biases and subjective opinions into their annotations. Overcoming this variance is exceptionally challenging because very little data exists to study how experimental repeatability actually improves as the annotator pool grows. Standard evaluation practices typically rely on a small number of annotations per item (often 3 to 5) and lack the persistent rater identifiers necessary to model individual variance across items. In this work, we introduce a multi-level bootstrapping approach to realistically model annotator behavior. Leveraging datasets with a large number of ratings and persistent rater identifiers, we analyze the tradeoffs between the number of items ($N$) and the number of responses per item ($K$) required to achieve statistical significance.

2605.13798 2026-05-14 cs.CV

VoxCor: Training-Free Volumetric Features for Multimodal Voxel Correspondence

Guney Tombak, Ertunc Erdil, Ender Konukoglu

AI总结 在多模态医学影像分析中,跨模态的体素级表示需要在不同成像方式、设备和采集协议下保持解剖一致性。本文提出VoxCor,一种无需训练的体素特征提取方法,能够从冻结的2D视觉Transformer模型中生成可复用的三维体素特征表示。该方法通过三平面ViT推理与加权偏最小二乘投影结合,在离线阶段学习模态稳定的解剖方向,从而在变换阶段无需微调或配准即可直接映射新体积,并支持高效的体素对应查询。实验表明,VoxCor在跨被试、跨模态任务中表现出优越的配准性能和特征迁移能力,为多模态医学影像分析提供了可复用的特征层。

详情
英文摘要

Cross-modal 3D medical image analysis requires voxelwise representations that remain anatomically consistent across imaging contrasts, scanners, and acquisition protocols. Recent work has shown that frozen 2D Vision Transformer (ViT) foundation models can support such representations, but typical pipelines extract features along a single anatomical axis and adapt those features inside a registration solver for one image pair at a time, leaving complementary viewing directions unused and producing representations that do not transfer to new volumes. We introduce VoxCor, a training-free fit--transform method for reusable volumetric feature representations from frozen 2D ViT foundation models. During an offline fitting phase, VoxCor combines triplanar ViT inference with a compact closed-form weighted partial least squares (WPLS) projection that uses fitting-time voxel correspondences to select modality-stable anatomical directions in the triplanar feature space. At transform time, new volumes are mapped by triplanar ViT inference and linear projection alone, without fine-tuning or registration. Voxel correspondences can then be queried directly by nearest-neighbor search. We evaluate VoxCor on intra-subject Abdomen MR--CT and inter-subject HCP T2w--T1w tasks using deformable registration, voxelwise k-nearest-neighbor segmentation, and segmentation-center landmark localization. VoxCor improves the hardest cross-subject, cross-modality transfer settings, reduces encoder sensitivity for dense correspondence transfer, and yields registration performance competitive with handcrafted descriptors and learned 3D features. This positions VoxCor as a reusable feature layer for downstream multimodal analysis beyond pairwise registration. Code, configuration files, and implementation details are publicly available on GitHub at \href{https://github.com/guneytombak/VoxCor}{guneytombak/VoxCor}.

2605.13790 2026-05-14 cs.LG cs.AI

Di-BiLPS: Denoising induced Bidirectional Latent-PDE-Solver under Sparse Observations

Zhonghao Li, Chaoyu Liu, Qian Zhang

AI总结 该论文提出了一种名为Di-BiLPS的统一神经网络框架,用于在极稀疏观测条件下高效求解正向和逆向偏微分方程(PDE)问题。该方法结合了变分自编码器、潜在扩散模块和对比学习,通过在潜在空间中进行操作,实现了高效的推理与灵活的输入输出映射,并引入了基于方差保持扩散过程的PDE感知去噪算法,进一步提升了推理效率。实验表明,Di-BiLPS在极稀疏输入条件下表现优异,显著降低了计算成本,并支持零样本超分辨率预测。

详情
英文摘要

Partial differential equations (PDEs) are fundamental for modeling complex natural and physical phenomena. In many real-world applications, however, observational data are extremely sparse, which severely limits the applicability of both classical numerical solvers and existing neural approaches. While neural methods have shown promising results under moderately sparse observations, their inference efficiency at high resolutions is limited, and their accuracy degrades substantially in the extremely sparse regime. In this work, we propose the Di-BiLPS, a unified neural framework that effectively handle both forward and inverse PDE problems under extremely sparse observations. Di-BiLPS combines a variational autoencoder to compress high-dimensional inputs into a compact latent space, a latent diffusion module to model uncertainty, and contrastive learning to align representations. Operating entirely in this latent space, the framework achieves efficient inference while retaining flexible input-output mapping. In addition, we introduce a PDE-informed denoising algorithm based on a variance-preserving diffusion process, which further improves inference efficiency. Extensive experiments on multiple PDE benchmarks demonstrate that Di-BiLPS consistently achieves SOTA performance under extremely sparse inputs (as low as 3%), while substantially reducing computational cost. Moreover, Di-BiLPS enables zero-shot super-resolution, as it allows predictions over continuous spatial-temporal domains.

2605.13786 2026-05-14 cs.LG

Interpretable Machine Learning for Antepartum Prediction of Pregnancy-Associated Thrombotic Microangiopathy Using Routine Longitudinal Laboratory Data

Chuanchuan Sun, Zhen Yu, Qin Fan, Qingchao Chen, Feng Yu

AI总结 该研究旨在利用孕期常规实验室检查数据,提前预测妊娠相关血栓性微血管病(P-TMA)的风险。通过构建基于纵向数据的机器学习模型,研究从146个实验室指标中提取时间依赖的风险特征,并采用梯度提升算法实现较高预测性能。研究发现,早期妊娠第六周的胱抑素C水平具有作为P-TMA早期监测指标的潜力,为临床提供可解释的预测工具。

详情
英文摘要

Background: Pregnancy-associated thrombotic microangiopathy (P-TMA) is rare but life-threatening. Early risk prediction before overt clinical presentation remains challenging, as the associated laboratory abnormalities are subtle, multidimensional, and frequently masked by common physiological changes such as gestational thrombocytopenia and pregnancy-related proteinuria, thus overlapping heavily with benign obstetric and renal conditions. This complexity is poorly captured by univariate or rule-based approaches; however, it is addressable by machine learning, which can extract latent, time-dependent risk signatures from longitudinal clinical tests. Methods: This retrospective study included 300 pregnancies comprising 142 P-TMA cases and 158 controls. After exclusion of identifiers and non-informative variables, 146 longitudinal laboratory predictors were retained. Participants were divided into a training cohort (80%) and a held-out test cohort (20%) using stratified sampling. Five algorithms were evaluated: logistic regression, support vector machine with radial basis function kernel, random forest, extra trees, and gradient boosting. The final model was selected by mean cross-validated AUROC, refitted on the full training cohort, and evaluated once in the held-out test cohort. Interpretability analyses examined global feature importance and distributional patterns of leading predictors. Results: Gradient boosting was prespecified by cross-validation in the training cohort. The model achieved an AUROC of 0.872 (95% CI: 0.769-0.952) and an AUPRC of 0.883 (95% CI: 0.780-0.959) in a held-out test cohort, with sensitivity of 0.750 and specificity of 0.812. Conclusions: Longitudinal clinical laboratory tests obtained during routine care contained informative and clinically plausible signals for P-TMA risk. Notably, cystatin C at week 6 showed promise as an early monitoring indicator.

2605.13784 2026-05-14 cs.LG

Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers

Victor Norgren

AI总结 本文提出了一种基于状态会话的高效流式推理方法,通过维护一个持续更新的键值缓存,将传统的预填充计算从关键路径中移除,使查询延迟仅依赖于当前查询长度,而与累积上下文规模无关。此外,该方法引入了闪存查询技术,在数据到达间隙利用GPU空闲周期预处理注册问题并缓存答案,实现了传统无状态引擎无法实现的结构特性。实验表明,该方法在流式市场数据基准测试中相比现有主流推理引擎实现了最高5.9倍的加速。

详情
英文摘要

Conventional transformer inference engines are request-driven, paying an O(n) prefill cost on every query. In streaming workloads, where data arrives continuously and queries probe an ever-growing context, this cost is prohibitive. We introduce a data-driven computational model centred on stateful sessions: a persistent KV cache advanced incrementally as new data arrives, so prefill is moved off the critical path and query latency becomes O(|q|), independent of accumulated context size. Building on this, Flash Queries reclaim idle GPU cycles between data arrivals to pre-evaluate registered questions and return cached answers before the user asks, a pattern that is structurally impossible in stateless engines because they discard intermediate state between requests. A multi-tenant continuous-batching scheduler with cell-budget admission and prefix-aware grouped prefill lets dozens of stateful sessions coexist on a single GPU while preserving full quadratic self-attention. On streaming market-data benchmarks the reference implementation achieves up to 5.9x speedup over conventional inference engines (vLLM, SGLang, TensorRT-LLM, llama.cpp), holding query latency constant as accumulated context grows.

2605.13782 2026-05-14 cs.RO cs.AI

LMPath: Language-Mediated Priors and Path Generation for Aerial Exploration

Jonathan A. Diller, Fernando Cladera, Camillo J. Taylor, Vijay Kumar

AI总结 传统无人机搜索任务通常依赖于几何覆盖模式,忽视了目标的语义上下文,导致在大规模环境中浪费大量时间。本文提出LMPath方法,通过语言模型和基础视觉模型生成语义引导的探索先验,从而更高效地规划无人机搜索路径。该方法能够根据目标提示和地理围栏生成潜在目标区域,并据此生成多种优化目标的无人机路径,实验表明其在实际和模拟环境中均优于传统路径规划方法。

详情
Comments
Poster at 2026 AI-Driven Safe Aerial Robotics Workshop
英文摘要

Traditional autonomous UAV search missions rely on geometric coverage patterns that ignore the semantic context of the target, leading to significant time waste in large-scale environments. In this paper we present LMPath, a pipeline for generating language-mediated exploration priors for Unmanned Aerial Vehicle (UAV) search missions that leverages semantics. Given a basic geofence and an object of interest prompt, LMPath uses generative language models to determine what regions of the environment should contain that object and a foundation vision model ran over satellite imagery to segment sub-regions that form the exploration prior. This prior can then be used to generate UAV paths with various objectives, such as minimizing the expected time to locate the object of interest, maximizing the probability that the object is found given a limited travel distance, or narrowing down the search space to sub-regions that are most likely to contain the object. To demonstrate it's capabilities, we used LMPath to generate various UAV paths and ran them using a real UAV over large-scale environments. We also ran simulations to demonstrate how paths generated using LMPath outperform traditional path planning approaches for search missions.

2605.13778 2026-05-14 cs.RO cs.CV

Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs

Jiahui Niu, Kefan Gu, Yucheng Zhao, Shengwen Liang, Tiancai Wang, Xing Hu, Ying Wang, Huawei Li

AI总结 本文提出了一种名为 Realtime-VLA FLASH 的推测推理框架,旨在解决基于扩散模型的视觉-语言-动作(dVLA)模型在实时部署中因全推理过程延迟高而面临的问题。该方法通过引入一个轻量级的草案模型,并结合主模型的动作专家进行并行验证,以及在必要时回退到全推理流程的相位感知机制,实现了低延迟、高频次的重新规划。实验表明,FLASH 在 LIBERO 和实际传送带分拣任务中均能有效降低推理延迟,显著提升了实时任务的执行效率。

详情
英文摘要

Diffusion-based vision-language-action models (dVLAs) are promising for embodied intelligence but are fundamentally limited in real-time deployment by the high latency of full inference. We propose Realtime-VLA FLASH, a speculative inference framework that eliminates most full inference calls during replanning by introducing a lightweight draft model with parallel verification via the main model's Action Expert and a phase-aware fallback mechanism that reverts to the full inference pipeline when needed. This design enables low-latency, high-frequency replanning without sacrificing reliability. Experiments show that on LIBERO, FLASH largely preserves task performance by replacing many 58.0 ms full-inference rounds with speculative rounds as fast as 7.8 ms, lowering task-level average inference latency to 19.1 ms (3.04x speedup). We additionally demonstrate effectiveness on real-world conveyor-belt sorting, highlighting its practical impact for latency-critical embodied tasks.

2605.13775 2026-05-14 cs.RO cs.CV

RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

Harold Haodong Chen, Sirui Chen, Yingjie Xu, Wenhang Ge, Ying-Cong Chen

AI总结 本文提出了一种名为 RoboEvolve 的新型框架,旨在解决机器人操作中由于物理交互数据稀缺而导致的可扩展性瓶颈。该框架通过将视觉语言模型(VLM)和视频生成模型(VGM)结合,形成一个相互促进的协同进化循环,仅依赖于未标记的种子图像进行自主数据合成与策略优化。实验表明,RoboEvolve 在任务成功率、数据效率和持续学习能力方面均表现出显著优势。

详情
Comments
On-going work
英文摘要

The scalability of robotic manipulation is fundamentally bottlenecked by the scarcity of task-aligned physical interaction data. While vision-language models (VLMs) and video generation models (VGMs) hold promise for autonomous data synthesis, they suffer from semantic-spatial misalignment and physical hallucinations, respectively. To bridge this gap, we introduce RoboEvolve, a novel framework that couples a VLM planner and a VGM simulator into a mutually reinforcing co-evolutionary loop. Operating purely on unlabeled seed images, RoboEvolve leverages a cognitive-inspired dual-phase mechanism: (i) daytime exploration fosters physically grounded behavioral discovery through a semantic-controlled multi-granular reward, and (ii) nighttime consolidation mines "near-miss" failures to stabilize policy optimization. Guided by an autonomous progressive curriculum, the system naturally scales from simple atomic actions to complex tasks. Extensive experiments demonstrate that RoboEvolve (I) achieves superior effectiveness, elevating base planners by 30 absolute points and amplifying simulator success by 48% on average; (II) exhibits extreme data efficiency, surpassing fully supervised baselines with merely 500 unlabeled seeds--a 50x reduction; and (III) demonstrates robust continual learning without catastrophic forgetting.

2605.13772 2026-05-14 cs.CL cs.AI

Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry

Tyler Alvarez, Ali Baheri

AI总结 该研究关注大语言模型在多步推理过程中出现的幻觉问题,提出了一种基于隐藏状态轨迹的细粒度检测方法。不同于现有方法在整体输出层面进行判断,该方法通过分析单次前向传播中隐藏状态的变化轨迹,识别出第一步错误的位置。研究引入对比主成分分析(PCA)和双向LSTM模型,分别用于构建轨迹对比特征和实现无需标签的部署检测,理论分析与实验表明该方法在多个基准数据集上优于现有方法,并揭示了分布偏移对模型性能的影响。

详情
英文摘要

Large language models hallucinate during multi-step reasoning, but most existing detectors operate at the trace level: they assign one confidence score to a full output, fail to localize the first error, and often require multiple sampled completions. We frame hallucination instead as a property of the hidden-state trajectory produced during a single forward pass. Correct reasoning moves through a stable manifold of locally coherent transitions; a first error appears as a localized excursion in transport cost away from this manifold. We operationalize this view with a label-conditioned teacher that builds a trace-specific contrastive PCA lens and scores each step with seven geometric transition features, and a deployable BiLSTM student distilled from the teacher that operates on raw hidden states without inference-time labels. We prove that contrastive PCA is the optimal projection for a transport-separation objective between first error and correct states, and that single-pass first error localization holds whenever the first error creates a positive transport margin over preceding correct transitions. On ProcessBench, PRM800K, HaluEval, and TruthfulQA, both models outperform entropy-based, probing-based, and attention-based baselines in-domain; the teacher transfers stably across language models and datasets, while the student collapses under shift, a gap our distillation theory predicts. These results recast step-level hallucination detection as a problem of trajectory dynamics and identify the central obstacle to deployment: preserving the contrastive transport margin under distribution shift.

2605.13769 2026-05-14 cs.CL cs.LG

Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching

Abdalrahman Wael

AI总结 本文研究了在小规模预训练场景下,密集式和混合专家(MoE)Transformer模型的性能差异,采用统一的LLaMA风格解码器训练方案。研究通过固定分词器、数据、优化器等配置,仅调整模型宽度以匹配主动参数或总参数预算。实验表明,在主动参数匹配下,MoE模型在验证损失上优于密集模型,但在总参数匹配下密集模型表现更优,揭示了两种预训练策略在不同参数匹配方式下的优劣。

详情
Comments
10 pages, 6 figures, 8 tables
英文摘要

We study dense and mixture-of-experts (MoE) transformers in a tiny-scale pretraining regime under a shared LLaMA-style decoder training recipe. The sparse model replaces dense feed-forward blocks with Mixtral-style routed experts. Dense baselines are modestly width-resized to tightly match either active or total parameter budgets, while tokenizer, data, optimizer, schedule, depth, context length, normalization style, and evaluation protocol are held fixed. Our best sparse recipe uses four experts, top-2 routing, Switch-style load balancing, and router z-loss. In a three-seed full-data comparison, the dense active-match model reaches 1.6545 +/- 0.0012 best validation loss, the MoE reaches 1.5788 +/- 0.0020, and the dense total-match model reaches 1.5608 +/- 0.0025. This yields a matched-active gap of 0.0758 +/- 0.0021 in the MoE's favor and a matched-total gap of 0.0180 +/- 0.0020 in the dense model's favor. Across training, the matched-active advantage grows while the matched-total dense advantage narrows sharply. In this sub-25M-parameter regime, MoE therefore improves validation loss under active-parameter matching but does not surpass dense training at equal total stored capacity.

2605.13761 2026-05-14 cs.LG cs.CE

Toward AI-Driven Digital Twins for Metropolitan Floods: A Conditional Latent Dynamics Network Surrogate of the Shallow Water Equations

Phillip Si, Yuan Qiu, Omar Sallam, Jeremy Feinstein, Ziang He, Eugene Yan, Peng Chen

AI总结 该研究旨在开发一种基于人工智能的都市洪水数字孪生系统,提出了一种条件潜空间动力网络(CLDNet),用于高效模拟浅水方程的水文动力过程。该方法通过降雨驱动的潜空间神经ODE和基于坐标的解码器,实现了对任意查询点水深和流量的快速重建,能够处理不规则流域并支持大规模都市区域的训练与预测。实验表明,CLDNet在速度和精度上均优于传统方法,可在约29秒内完成96小时的全流域洪水预测,相较原方法提升了约115倍。

详情
英文摘要

AI-driven flood digital twins demand fast hydrodynamic surrogates for ensemble forecasting and observation assimilation. Yet even GPU-accelerated two-dimensional shallow water equation (SWE) solvers still require $\sim 55$ minutes per $96$-hour run on a $\sim 4.2$-million-active-cell metropolitan basin (the Des~Plaines River basin at $30\,\mathrm{m}$ resolution), making such workloads prohibitive at native resolution. We present the Conditional Latent Dynamics Network (CLDNet): a low-dimensional latent neural ODE driven by rainfall, paired with a coordinate-based decoder conditioned on static terrain (elevation, slope, Manning roughness) that reconstructs depth and discharge at arbitrary query points. Pointwise decoding decouples memory from grid size and handles irregular watersheds natively, enabling metropolitan-scale training on a single compute node and direct queries at exact gauge coordinates without raster snapping. We evaluate CLDNet on a synthetic $250{,}000$-cell Texas benchmark and on a new Des~Plaines case study of $114$ real-rainfall Stage~IV storms whose reference simulator we validate against United States Geological Survey (USGS) gauges at the April~2013 flood-of-record (Nash--Sutcliffe efficiency $0.57$--$0.94$ on mean-recentered water-surface elevation). CLDNet roughly halves the relative root-mean-squared error of an unconditional baseline, outperforms regular-grid VAE--ConvLSTM and FNO baselines on the Texas benchmark (both presuppose a Cartesian grid and do not apply to the irregular Des~Plaines watershed), reaches a critical success index of $\approx 86\%$ at the $0.5\,\mathrm{m}$ inundation threshold, and produces a full $96$-hour basin-wide forecast in $\sim 29$ seconds -- a $\sim 115\times$ speedup.

2605.13759 2026-05-14 cs.LG

Fast and effective algorithms for fair clustering at scale

Claudio Mantuano, Manuel Kammermann, Philipp Baumann

AI总结 本文研究了在大规模数据下实现公平聚类的问题,即在将对象划分为若干簇的同时,确保每个受保护群体在各簇中得到合理代表。为了解决聚类成本与公平性之间的冲突,作者提出了一种通用的公平聚类框架,并设计了三种高效算法,分别在解的质量、可扩展性与最大规模处理能力方面具有优势。实验表明,这些方法在多个基准数据集上优于现有方法。

详情
英文摘要

Clustering is an unsupervised machine learning task that consists of identifying groups of similar objects. It has numerous applications and is increasingly used in fairness-sensitive domains where objects represent individuals, such as customers, employees, or students. We address a fair clustering problem in which objects belong to protected groups. The problem consists of partitioning the objects into a predefined number of clusters while attaining a user-defined target level of fairness, meaning that each protected group is sufficiently represented in each cluster. The objective is to minimize the clustering cost, defined as the sum of squared Euclidean distances between the objects and the centers of their clusters. Since clustering cost and fairness are generally in conflict, managing the trade-off between them is essential in practical applications. Existing methods provide limited control over this trade-off and either fail to scale to large datasets or, when they scale, produce low-quality solutions. We propose a general framework for fair clustering that provides precise control over the cost-fairness trade-off and introduce three heuristics based on it. The first heuristic focuses on solution quality and the flexibility to incorporate additional constraints, the second improves scalability while retaining high solution quality, and the third is designed for maximum scalability, producing solutions for instances with millions of objects in seconds. The proposed heuristics outperform existing approaches in comprehensive numerical experiments on benchmark datasets. The source code of our heuristics and instructions for reproducing the experiments are publicly available on GitHub.

2605.13757 2026-05-14 cs.RO

FrameSkip: Learning from Fewer but More Informative Frames in VLA Training

Bin Yu, Shijie Lian, Xiaopeng Lin, Zhaolong Shen, Yuliang Wei, Changti Wu, Hang Yuan, Haishan Liu, Bailing Wang, Cong Huang, Kai Chen

AI总结 本文提出了一种名为FrameSkip的数据层框架,用于在视觉-语言-动作(VLA)策略训练中选择更具信息量的帧,以解决传统方法中因密集采样导致的时间监督不平衡问题。该方法通过评估动作变化、视觉-动作一致性、任务进展先验和夹爪状态变化等因素,筛选出关键帧进行训练,从而在保持模型结构不变的前提下提升训练效率与任务成功率。实验表明,FrameSkip在多个基准测试中显著优于全帧训练和简单帧选择方法,在保留20%关键帧的情况下实现了更高的任务成功率。

详情
Comments
GitHub: https://github.com/ZGC-EmbodyAI/FrameSkip
英文摘要

Vision-Language-Action (VLA) policies are commonly trained from dense robot demonstration trajectories, often collected through teleoperation, by sampling every recorded frame as if it provided equally useful supervision. We argue that this convention creates a temporal supervision imbalance: long low-change segments dominate the training stream, while manipulation-critical transitions such as alignment, contact, grasping, and release appear only sparsely. We introduce FrameSkip, a data-layer frame selection framework that scores trajectory frames using action variation, visual-action coherence, task-progress priors, and gripper-transition preservation, then remaps training samples toward high-importance frames under a target retention ratio. Because FrameSkip operates only in the dataloader, it leaves the VLA architecture, action head, training objective, and inference procedure unchanged. Across RoboCasa-GR1, SimplerEnv, and LIBERO, FrameSkip improves the success-retention trade-off over full-frame training and simpler frame selection variants, achieving a macro-average success rate of 76.15% across the three benchmarks compared with 66.50% for full-frame training while using a compressed trajectory view that retains 20% of unique frames in the main setting.

2605.13755 2026-05-14 cs.CV

Generative Texture Diversification of 3D Pedestrians for Robust Autonomous Driving Perception

Arka Bhowmick, Enes Ozeren, Ahmed Abdullah, Oliver Wasenmuller

AI总结 本文研究了如何通过生成式人工智能提升自动驾驶感知系统中3D行人模型的纹理多样性,以增强模型在复杂场景下的鲁棒性。作者提出了一种基于StyleGAN2的方法,从单一3D基础模型出发,生成具有多样化面部纹理和外观特征的行人实例,无需重新设计几何结构。该方法构建了合成数据集,并分析了真实与合成数据混合对2D和3D目标检测的影响,揭示了几何域差异对3D感知模型的敏感性,展示了生成式AI在自动驾驶数据生成中的潜力与局限。

详情
Comments
Published at SAIAD 2026 Workshop at CVPR 2026
英文摘要

In recent years, autonomous driving has significantly in creased the demand for high-quality data to train 2D and 3D perception models for safety-critical scenarios. Real world datasets struggle to meet this demand as require ments continuously evolve and large-scale annotated data collection remains costly and time-consuming making syn thetic data a scalable, practical and controllable alterna tive. Pedestrian detection is among the most safety-critical tasks in autonomous driving. In this paper, we propose a simple yet effective method for scaling variability in 3D pedestrian assets for synthetic scene generation. Starting from a single 3D base asset, we generate multiple distinct pedestrian instances by synthesizing diverse facial textures and identity-level appearance variations using StyleGAN2 and automatically mapping them onto 3D meshes. This ap proach enables scalable appearance-level asset diversifica tion without requiring the design of new geometries for each instance. Using the assets, we construct synthetic datasets and study the impact of mixing real and synthetic data for RGB-based object detection. Through complementary ex periments, we analyze geometry-driven distribution shifts in point cloud perception for 3D object detection. Our findings demonstrate that controlled synthetic diversifica tion improves robustness in 2D detection while revealing the sensitivity of 3D perception models to geometric domain gaps. Overall, this work highlights how generative AI en ables scalable, simulation-ready pedestrian diversification through controlled facial texture synthesis, along with the benefits and limitations of cross-domain training strategies in autonomous driving pipelines.

2605.13754 2026-05-14 cs.RO

Manipulation Planning for Construction Activities with Repetitive Tasks

Wangyi Liu, Dasharadhan Mahalingam, Fanru Gao, Ci-Jyun Liang, Nilanjan Chakraborty

AI总结 本文研究了在包含重复性任务的建筑活动中进行操作技能学习的问题,例如砌墙或安装天花板瓷砖。作者提出了一种基于虚拟现实环境的方法,通过用户演示获取操作技能,并利用螺旋运动几何将演示动作近似为一系列恒定螺旋运动,结合螺旋线性插值和解析运动速率控制生成操作轨迹。实验表明,该方法仅需单次演示即可在模拟和真实机器人上完成复杂的重复性建筑任务,具有良好的泛化性和精度。

详情
英文摘要

In this paper, we study the problem of manipulation skill acquisition for performing construction activities consisting of repetitive tasks (e.g., building a wall or installing ceiling tiles). Our approach involves setting up a simulated construction activity in a Virtual Reality (VR) environment, where the user can provide demonstrations of the object manipulation skills needed to perform the construction activity. We then exploit the screw geometry of motion to approximate the demonstrated motion as a sequence of constant screw motions. For performing the construction activity, we generate the sequence of manipulation task instances and then compute the joint space motion plan corresponding to each instance using Screw Linear Interpolation (ScLERP) and Resolved Motion Rate Control (RMRC). We evaluate our framework by executing two representative construction tasks: constructing brick walls and installing multiple ceiling tiles. Each task is performed using only a single demonstration, a pick-and-place action for the bricks, and a single ceiling tile installation. Our experiments with a 7-DoF robot in both simulation and hardware demonstrate that the approach generalizes robustly to arbitrarily long construction activities that involve repetitive motions and demand precision, even when provided with just one demonstration. For instance, we can construct walls of arbitrary layout and length by leveraging a single demonstration of placing one brick on top of another.