arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2119
2605.02411 2026-06-11 cs.AI cs.IR cs.LG cs.MA 版本更新

FitText: Evolving Agent Tool Ecologies via Memetic Retrieval

FitText: 通过模因检索演化智能体工具生态

Kyle Zheng, Han Zhang, Renliang Sun, Chenchen Ye, Wei Wang

AI总结 针对用户任务描述与工具文档间的语义鸿沟,提出FitText框架,将检索嵌入推理循环,通过自然语言伪工具描述迭代优化和模因进化选择,显著提升工具检索性能。

详情
AI中文摘要

用户描述任务的方式与工具文档之间存在语义鸿沟。随着API生态扩展到数万个端点,仅凭初始查询的静态检索无法弥合这一鸿沟:智能体对其所需工具的理解在执行过程中不断演变,但其工具集却保持不变。我们指出,这种检索接口(而非规划)是端到端智能体性能的约束瓶颈,并引入FitText——一个无需训练的框架,通过将检索直接嵌入智能体的推理循环中,使其动态化。FitText将检索视为测试时假设的演化:智能体生成自然语言的伪工具描述(关于所需工具的可修正信念),利用检索反馈迭代优化,并通过随机生成探索多样化的替代方案。模因检索在候选描述上施加进化选择压力,并由避免冗余搜索的工具记忆引导。在ToolRet(三个领域)上,FitText的重构策略在所有基模型上相比静态查询检索将NDCG@5提升了2.7至10.6个点;在StableToolBench(16,464个API)上使用GPT-5.4-mini时,模因检索达到了84.3%的合并通过率,相比静态查询检索绝对提升了26.7个点。

英文摘要

A semantic gap separates how users describe tasks from how tools are documented. As API ecosystems scale to tens of thousands of endpoints, static retrieval from the initial query alone cannot bridge this gap: the agent's understanding of what it needs evolves during execution, but its tool set does not. We identify this retrieval interface, not planning, as the binding constraint on end-to-end agent performance, and introduce FitText, a training-free framework that makes retrieval dynamic by embedding it directly in the agent's reasoning loop. FitText treats retrieval as test-time evolution of hypotheses: the agent generates natural-language pseudo-tool descriptions (revisable beliefs about the tool it needs), refines them iteratively using retrieval feedback, and explores diverse alternatives through stochastic generation. Memetic Retrieval adds evolutionary selection pressure over candidate descriptions, guided by a tool memory that avoids redundant search. On ToolRet (three domains), FitText's reformulation strategies improve NDCG@5 by 2.7 to 10.6 points over static query retrieval across all base models; on StableToolBench (16,464 APIs) with GPT-5.4-mini, Memetic reaches an 84.3% pooled pass rate, a 26.7-point absolute gain over static query retrieval.

2606.10360 2026-06-11 cs.SD 版本更新

ViP-VL: Vietnamese Self-supervised Speech Pretraining Model with Vector-Quantization Learning

ViP-VL:基于向量量化学习的越南语自监督语音预训练模型

Khanh Le, Kiet Anh Hoang, Bao Nguyen, Duy Vo, Dung Vo, Thai Tran, Linh Pham, Khoa D Doan

AI总结 提出ViP-VL模型,通过声学堆叠、感受野对齐和掩码选择策略,在BEST-RQ框架上实现高效自监督预训练,在越南语ASR、情感识别、方言分类和说话人验证四项任务上取得最优结果。

详情
Comments
Accepted to INTERSPEECH 2026
AI中文摘要

我们提出了ViP-VL,一种高效的越南语自监督语音预训练模型,利用向量量化学习。为了弥合高分辨率音频与高效处理之间的差距,ViP-VL在ChunkFormer架构中引入了声学堆叠和感受野对齐,实现了同步的8倍下采样率,同时通过在BEST-RQ框架上的预训练中采用专门的掩码选择策略,进一步增强了表示的鲁棒性。在17,000小时未标注的越南语语音上预训练后,我们的模型在自动语音识别、语音情感识别、方言分类和说话人验证四个主要下游任务上建立了新的最优结果。为了促进未来研究和高性能越南语语音技术的发展,我们在此http URL公开发布了预训练权重和实现。

英文摘要

We present ViP-VL, an efficient Vietnamese Self-supervised speech Pretraining model leveraging Vector-quantization Learning. To bridge the gap between high-resolution audio and efficient processing, ViP-VL incorporates Acoustic Stacking and Receptive Field Alignment to enable a synchronized 8x subsampling rate within the ChunkFormer architecture, while further enhancing representation robustness through a specialized Mask Selection Strategy during pretraining on the BEST-RQ framework. Pretrained on 17,000 hours of unlabeled Vietnamese speech, our model establishes new state-of-the-art results across four major downstream tasks: Automatic Speech Recognition, Speech Emotion Recognition, Dialect Classification, and Speaker Verification. To facilitate future research and the development of high-performance Vietnamese speech technologies, we publicly release our pretrained weights and implementation at github.com/khanld/chunkformer.

2606.10198 2026-06-11 cs.LG cs.AI cs.CV 版本更新

Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity

密度脊选择性预测:校准标签稀缺下的大语言模型与视觉语言模型幻觉检测

Nina I. Shamsi

AI总结 针对校准标签稀缺时大语言模型和视觉语言模型的幻觉检测问题,提出基于核密度估计的密度脊方法,利用隐藏状态生成轨迹的六维运动特征图构建响应流形,通过到最近脊顶点的欧氏距离评分,在标签稀缺协议下AUROC提升5-20点。

详情
AI中文摘要

大语言模型和视觉语言模型中的幻觉检测日益被框架化为选择性预测,其中检测器分配置信度分数并在置信度低时弃权。无监督采样检测器(Semantic Entropy, EigenScore)避免标签但质量停滞,而有监督探针(SAPLMA)获得更强的分布内分数,但在校准标签稀缺时性能急剧下降。我们将大语言模型的响应流形恢复为基于隐藏状态生成轨迹的六维运动特征图的核密度估计的密度脊。测试生成通过其投影特征点到最近脊顶点的欧氏距离的负值进行评分,从而得到随机输出分布的低维几何骨架。我们在七个问答基准(HaluEval-QA, TriviaQA, GSM8K, POPE, ScienceQA, A-OKVQA)上,使用九个文本和视觉大语言模型,在刻意标签稀缺协议($n_{\ ext{cal}}{=}200$ 查询,$N{=}5$ 生成)下,与Semantic Entropy、SAR、EigenScore、SAPLMA和对数概率进行评估。我们的基于脊的分数在AUROC上以5-20个百分点的优势获胜,同时在校准标签稀缺下表现出温和的性能下降。

英文摘要

Hallucination detection in large language and vision-language models is increasingly framed as selective prediction, where a detector assigns a confidence score and abstains when confidence is low. Unsupervised sampling detectors (Semantic Entropy) avoid labels but plateau in quality, while supervised probes attain stronger in-distribution scores yet degrade sharply when calibration labels are scarce. We recover the response manifold of an LLM as the density ridge of a kernel density estimate built on a six-dimensional kinematic feature map of hidden state generation trajectories. A test generation is scored by the negated Euclidean distance from its projected feature point to the nearest ridge vertex, yielding a low-dimensional geometric skeleton of the stochastic output distribution. We evaluate against Semantic Entropy, topological methods, and log-probability on six QA benchmarks (HaluEval-QA, TriviaQA, GSM8K, POPE, ScienceQA, A-OKVQA) using eight text and vision LLMs in a deliberately label-scarce protocol ($n_{\text{cal}}{=}200$ queries, $N{=}5$ generations). Our ridge-based score beats on AUROC with 5-20 points gain, while demonstrating tempered degradation under calibration-label scarcity.

2606.10135 2026-06-11 cs.CV cs.AI 版本更新

BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression

BiWM:利用双向自回归推进开源交互式视频世界模型

Shaohao Rui, Xiaofeng Mao, Zhanyu Zhang, Peijia Lin, Yansong Zhu, Yibo Zhang, Haibin Wan, Weijie Ma

AI总结 提出BiWM框架,通过双向自回归范式将预训练视频骨干转化为交互式世界模型,仅需两阶段训练(微调+分布匹配蒸馏),支持多尺度模型和长程生成,优于现有因果流水线。

详情
Comments
After the paper was posted, we discovered that several visualization results were produced using wrong configuration settings during runtime. This error affects the reliability of the presented visual comparisons. Additionally, further optimization of the design is needed. We therefore request to withdraw this version and will submit a corrected and improved version later
AI中文摘要

将双向视频扩散模型过渡到自回归范式提高了视频世界模型的交互性,但现有的因果流水线需要多个阶段(控制微调、自回归训练、因果初始化、少步蒸馏),并且由于误差累积,质量仍落后于双向模型。最近的世界模型如Yume-1.5和Matrix-Game-3.0采用双向自回归方法,通过自我纠正误差传播获得保真度和稳定的长程展开,但开源框架(如minWM)仅支持因果模型。我们提出BiWM,这是首个在双向自回归范式下用于交互式视频世界模型的全栈框架,联合优化生成质量和推理速度。从预训练视频骨干开始,BiWM通过微调注入相机控制,然后运行几步分布匹配蒸馏(DMD)阶段,将骨干转化为动作/相机可控的世界模型:仅需两个训练阶段(而非minWM的四个),在8xH200 GPU上几百步内收敛。单一方案覆盖Wan2.1-1.3B、Wan2.2-5B、HunyuanVideo-1.5-8B和LTX-2.3-22B,并支持现有双向模型的二次微调。BiWM实现了minWM失去可控性的真实相机控制,集成了可插拔历史压缩(FramePack风格和PackForcing风格)用于长程展开,并提供可选的NVFP4 4位训练/推理流水线。为对抗DMD的模式寻求退化,我们添加了GAN和覆盖前向KL目标,以保留场景动态。我们开源BiWM,用于资源受限的研究和高保真环境模拟。

英文摘要

Transitioning bidirectional video diffusion models into an autoregressive paradigm improves the interactivity of video world models, but existing causal pipelines need many stages (control fine-tuning, autoregressive training, causal initialization, few-step distillation) and still trail bidirectional models in quality due to error accumulation. Recent world models such as Yume-1.5 and Matrix-Game-3.0 instead adopt a bidirectional autoregressive approach, gaining fidelity and stable long-horizon rollout from self-correcting error propagation, yet open-source frameworks (e.g., minWM) support only causal models. We present BiWM, the first full-stack framework for interactive video world models under the bidirectional autoregressive paradigm, jointly optimizing generation quality and inference speed. From a pretrained video backbone, BiWM injects camera control by fine-tuning, then runs a few-step Distribution Matching Distillation (DMD) stage that turns the backbone into an action/camera-controllable world model: just two training stages instead of four in minWM, converging in a few hundred steps on 8xH200 GPUs. A single recipe spans Wan2.1-1.3B, Wan2.2-5B, HunyuanVideo-1.5-8B, and LTX-2.3-22B, and also supports secondary fine-tuning of existing bidirectional models. BiWM enables real-world camera control where minWM loses controllability, integrates pluggable history compression (FramePack-style and PackForcing-style) for long rollouts, and offers an optional NVFP4 4-bit training/inference pipeline. To counter DMD's mode-seeking degradation, we add GAN and mass-covering forward-KL objectives that preserve scene dynamics. We open-source BiWM for resource-constrained research and high-fidelity environment simulation.

2606.10046 2026-06-11 cs.SD cs.AI 版本更新

Inside the Latent Flow: Causal Deciphering of Attention Dynamics in Audio Separation Foundation Models

潜流内部:音频分离基础模型中注意力动力学的因果解读

Yuxuan Chen, Haoyuan Yu, Peize He

AI总结 本文通过因果干预协议揭示流匹配Transformer在音频分离中的双路径注意力机制,并提出无训练加速方法LSAC,在保持质量的同时减少约25%自注意力计算。

详情
AI中文摘要

流匹配变压器实现了强大的音频分离,但其注意力动力学是不透明的。我们将已建立的因果干预原则适应为SAM Audio的确定性推理时探测协议。正交探测揭示了一种双路径文本条件机制:加法注入控制语义身份,而交叉注意力细化声学结构。我们观察到异步逐层收敛:稳定层早期构建时间支架,而快速层在采样过程中继续解决伪影。该模型还减弱时间分割线索以维持连续流稳定性。利用这些见解,我们提出了层选择性注意力缓存(LSAC),一种无训练加速方法,在稳定层中缓存注意力。在各种声学复杂度下,LSAC将自注意力计算减少约25%,质量损失可忽略,并且与朴素步长减少相比,质量保持率高达6.7倍。

英文摘要

Flow-matching transformers achieve strong audio separation, yet their attention dynamics are opaque. We adapt established causal-intervention principles into a deterministic, inference-time probing protocol for SAM Audio. Orthogonal probing uncovers a dual-pathway text-conditioning mechanism: additive injections control semantic identity, while cross-attention refines acoustic structure. We observe an asynchronous layerwise convergence: stable layers build temporal scaffolds early, whereas fast layers continue resolving artifacts during sampling. The model also attenuates temporal segmentation cues to maintain continuous-flow stability. Using these insights, we propose Layer-Selective Attention Caching (LSAC), a training-free acceleration method that caches attention in stable layers. Across acoustic complexities, LSAC cuts self-attention computation by about ~25% with negligible quality loss and yields up to 6.7x higher quality retention than naive step reduction.

2606.10040 2026-06-11 cs.RO 版本更新

Efficient-WAM: A 1B-Parameter World-Action Model with Low-Cost Future Imagination

Efficient-WAM: 一种具有低成本未来想象能力的10亿参数世界-动作模型

Jiajun Li, Tiecheng Guo, Yifan Ye, Rongyu Zhang, Xiaowei Chi, Qianpu Sun, Ying Li, Yunfan Lou, Yan Huang, Zhihe Lu, Meng Guo, Shanghang Zhang

AI总结 提出Efficient-WAM,通过紧凑视频专家、稀疏视频潜变量和非对称去噪降低未来想象成本,在保持控制性能的同时实现30倍推理加速。

详情
AI中文摘要

世界-动作模型(WAM)通过将未来视觉预测与动作生成相结合,已成为具身控制的一种有前景的范式。然而,大多数现有WAM依赖于逼真的未来预测,这导致高推理延迟,使得实时机器人部署困难。这促使设计一种更高效的WAM,既能保留未来视觉预测的控制优势,又能降低其推理成本。我们引入了Efficient-WAM,一种在保留控制优势的同时降低未来想象成本的世界-动作模型。Efficient-WAM通过从WAN-2.2-5B迁移的紧凑视频专家、稀疏视频潜变量以及非对称视频-动作去噪(为视频分配比动作更少的采样步骤)来提高推理效率。Efficient-WAM不优化未来分支的视觉保真度,而是将未来视频预测视为动作生成的紧凑指导信号。在RoboTwin 2.0和真实世界操作任务上的综合实验表明,尽管未来预测明显粗糙,Efficient-WAM仍能保持强大的动作性能。在保持竞争性控制能力的同时,我们的10亿参数模型在物理部署中可将每块延迟降低至约100毫秒,相比现有WAM实现了30倍的加速。

英文摘要

World-Action Models (WAMs) have emerged as a promising paradigm for embodied control by coupling future visual prediction with action generation. However, most existing WAMs rely on photorealistic future prediction, which incurs high inference latency and makes real-time robot deployment difficult. This motivates a more efficient WAM design that preserves the control benefits of future visual prediction while reducing its inference cost. We introduce Efficient-WAM, a World-Action Model that reduces the cost of future imagination while preserving its control benefit. Efficient-WAM improves inference efficiency via a compact video expert transferred from WAN-2.2-5B, token-sparse video latents, and asymmetric video-action denoising that allocates fewer sampling steps to video than to actions. Instead of optimizing the future branch for visual fidelity, Efficient-WAM treats future video prediction as a compact guidance signal for action generation. Comprehensive experiments on RoboTwin 2.0 and real-world manipulation tasks show that Efficient-WAM maintains strong action performance despite visibly coarse future predictions. While maintaining competitive control capabilities, our 1B-parameter model can reduce per-chunk latency to around 100 ms during physical deployment, achieving a 30x speedup over existing WAMs.

2606.09830 2026-06-11 cs.CL 版本更新

Automated Scoring of Arabic Text Using Large Language Models: A Literature Review

使用大型语言模型对阿拉伯语文本进行自动评分:文献综述

Khaoula Dahimi, Hadda Cherroun, Amel Belabbaci

AI总结 本文综述了基于大型语言模型的阿拉伯语文本自动评分方法,包括简答题评分和作文评分,提出了包含五个维度的分类体系,并对比分析了现有研究的方法、数据集和性能。

详情
Comments
Accepted at NCMAI 2026
AI中文摘要

在现代教育系统中,自动文本评分(ATS)通过无需人工干预即可实现学习者回答的可扩展和一致评估,发挥着核心作用。最近,LLM和阿拉伯语特定数据集的可访问性增加,重新激发了这一领域的兴趣。在这项工作中,我们研究了基于LLM的阿拉伯语文本自动评估方法,重点关注简答题评分(ASAG)和作文评分(AES)。我们进一步引入了一个结构化的分类体系,包括五个维度:应用领域、反馈生成能力、部署的LLM架构、与能力参考框架的一致性以及提示工程策略。通过应用这一分类体系,我们对现有研究进行了比较分析,考察了它们的方法论、数据集、评估指标和报告的性能。研究结果强调了在阿拉伯语ATS领域开展持续且具有教育基础的研究努力的必要性,因为这对于提高阿拉伯语社区的教育质量具有重要意义。

英文摘要

In modern educational systems, Automatic Text Scoring (ATS) plays a central role by enabling scalable and consistent evaluation of learner responses without human intervention. Recently, the increased accessibility of LLMs and Arabic-specific datasets has sparked renewed interest in this area. In this work, we investigate LLM-Based approaches for the automated evaluation of Arabic texts, focusing on both short answer grading (ASAG) and essay scoring (AES). We further introduce a structured taxonomy comprising five dimensions: application domain, feedback generation capability, LLM architecture deployed, alignment with competency referential frameworks, and prompt engineering strategy. By applying this taxonomy, we conduct a comparative analysis of existing studies, examining their methodological approaches, datasets, evaluation metrics, and reported performance. The findings highlight the need for sustained and pedagogically grounded research efforts in Arabic ATS, given its significance for improving educational quality across Arabic-speaking communities.

2606.11118 2026-06-11 cs.LG math.OC math.PR stat.AP stat.ML 版本更新

Data-Driven Dynamic Assortment in Online Platforms: Learning about Two Sides

在线平台中的数据驱动动态分类:学习双边信息

Rahul Roy, Nur Sunar, Jayashankar M. Swaminathan

AI总结 针对双边服务平台,提出一种数据驱动算法,在未知顾客和卖家选择参数的情况下动态优化商品分类,并证明其遗憾值随时间呈多对数增长且达到最优速率。

详情
AI中文摘要

我们研究了一个在离散时间环境下,具有不完全信息和异质顾客的双边服务平台上的动态分类问题。在每个周期,一位顾客到达寻求服务,平台选择一组卖家进行展示。顾客根据多项逻辑选择模型,最多向分类中的一个卖家提出交易。经过固定数量的周期后,卖家审查收到的提议,并根据另一个多项逻辑选择模型,每位卖家最多选择一个顾客,然后循环重复。一个关键挑战是平台事先不知道顾客或卖家的选择模型参数。据我们所知,这是首次研究双边选择参数均未知的动态分类问题。我们开发了一种数据驱动算法,该算法在优化平台目标的同时学习这些参数。我们使用遗憾值来评估性能,该遗憾值衡量相对于一个预知所有参数和顾客到达时间的先知基准的收入损失。我们证明该算法的最坏情况遗憾值随时间呈多对数增长,并推导出匹配的下界,从而确定其速率最优性。

英文摘要

We study a dynamic assortment problem on a two-sided service platform with incomplete information and heterogeneous customers in a discrete-time setting. In each period, a customer arrives seeking service, and the platform chooses an assortment of sellers to display. The customer then proposes a transaction to at most one seller in the assortment according to a multinomial logit choice model. After a fixed number of periods, sellers review the proposals they have received and each chooses at most one customer according to another multinomial logit choice model, after which the cycle repeats. A key challenge is that the platform does not know the choice-model parameters of either customers or sellers in advance. To our knowledge, this is the first study of a dynamic assortment problem in which both sides' choice parameters are unknown. We develop a data-driven algorithm that learns these parameters while optimizing the platform's objective over time. We evaluate performance using regret, which measures revenue loss relative to a clairvoyant benchmark that knows all parameters and customer arrivals in advance. We show that the algorithm's worst-case regret grows polylogarithmically over time, and we derive a matching lower bound, establishing its rate optimality.

2606.09744 2026-06-11 cs.LG cond-mat.dis-nn 版本更新

Learning Dynamics Reveal a Hierarchy of Weight-Induced Layerwise Gram Metrics

学习动力学揭示权重诱导的分层Gram度量层次结构

Claudio Nordio

AI总结 本文研究前馈ReLU网络在固定读出和二次损失下的梯度下降动力学,将其重写为训练集空间上的集体动力学,并揭示深度网络中权重诱导的Gram算子层次结构。

详情
Comments
24 pages. v4: Corrected the hidden-activation dynamics; clarified the concept of field closure. Other minor corrections
AI中文摘要

我们研究具有固定读出和二次损失的前馈ReLU网络。目的是将梯度下降重写为一种集体动力学,而非主要作为权重空间中的动力学,该动力学在训练集空间上定义的场中封闭。对于单隐层,可以从激活动力学中消除权重变量,得到残差的封闭方程,该方程由一个集体核支配,该核分解为输入几何矩阵和动态共激活矩阵。对于更深网络,残差动力学保持清晰的分层核结构。然而,从深度三开始,封闭需要权重诱导的Gram算子层次结构,这些算子介导跨层的信息传输。

英文摘要

We study feed-forward ReLU networks with fixed readout and quadratic loss. The aim is to rewrite gradient descent not primarily as a dynamics in weight space, but as a collective dynamics closed in terms of fields defined on the training-set space. For a single hidden layer, the weight variables can be eliminated from the activation dynamics, yielding a closed equation for the residuals governed by a collective kernel that factorizes into an input-geometric matrix and a dynamical co-activation matrix. For deeper networks, the residual dynamics retains a clean layer-wise kernel structure. However, from depth three onward, closure requires a hierarchy of weight-induced Gram operators that mediate information transport across layers. Moreover, the conjugate-field dynamics is governed by operators satisfying a backward pullback recursion, of which the weight-induced Gram operators are the first nontrivial instances.

2606.09347 2026-06-11 cs.CV 版本更新

IB-HFN: Information Bottleneck-Driven SAR-Optical Fusion Network for High-Fidelity Cloud Removal

IB-HFN: 信息瓶颈驱动的SAR-光学融合网络用于高保真云去除

Haojun Guo, Fan Feng, Ziquan Wang, Yongsheng Zhang, Ying Yu

AI总结 提出IB-HFN网络,通过双流骨干、空间信息瓶颈融合模块和联合优化策略,抑制SAR散斑噪声并保留光学细节,实现高保真云去除。

详情
AI中文摘要

合成孔径雷达(SAR)辅助的光学云去除旨在利用互补的SAR观测恢复光学遥感图像中被云遮挡的地表信息。现有的多模态融合方法通常依赖于直接的空间拼接和像素级监督,这会将SAR散斑噪声传播到光学重建中,并导致结果过度平滑。为了解决这些局限性,我们提出了一种信息瓶颈驱动的高保真网络(IB-HFN),用于SAR辅助的光学云去除。IB-HFN采用双流骨干网络,在深度语义融合前保留模态特定表示,从而减轻过早的跨模态污染。在融合阶段,我们引入了一个空间信息瓶颈融合模块,通过通道级变分信息瓶颈压缩SAR特征以抑制非结构化散斑噪声。同时,一个局部-全局门控机制预测晴空区域,并通过Dirac初始化的跳跃连接传递可靠的光学细节,将噪声抑制与纹理保留解耦。我们进一步开发了一种联合优化策略,将特征级瓶颈正则化与图像级约束(包括重建精度、结构一致性、光谱保真度和对比度锐度)相结合。动态权重调度平衡这些目标以稳定训练并减少雾状伪影。在SEN12MS-CR数据集上具有挑战性的时空分割下的实验表明,IB-HFN在结构保留和光谱保真度方面优于现有方法。

英文摘要

Synthetic aperture radar (SAR)-assisted optical cloud removal aims to recover surface information obscured by clouds in optical remote sensing images by exploiting complementary SAR observations. Existing multimodal fusion methods typically rely on direct spatial concatenation and pixel-wise supervision, which can propagate SAR speckle noise into optical reconstruction and lead to over-smoothed results. To address these limitations, we propose an Information Bottleneck-driven High-Fidelity Network (IB-HFN) for SAR-assisted optical cloud removal. IB-HFN employs a dual-stream backbone to preserve modality-specific representations before deep semantic fusion, thereby mitigating premature cross-modal contamination. At the fusion stage, we introduce a Spatial Information Bottleneck Fusion module that compresses SAR features through a channel-wise variational information bottleneck to suppress unstructured speckle noise. In parallel, a local-global gating mechanism predicts clear-sky regions and routes reliable optical details through a Dirac-initialized skip connection, decoupling noise suppression from texture preservation. We further develop a joint optimization strategy that integrates feature-level bottleneck regularization with image-level constraints on reconstruction accuracy, structural consistency, spectral fidelity, and contrastive sharpness. A dynamic weighting schedule balances these objectives to stabilize training and reduce hazy artifacts. Experiments on the SEN12MS-CR dataset under challenging spatio-temporal splits demonstrate that IB-HFN achieves superior structural preservation and spectral fidelity over existing methods.

2606.08102 2026-06-11 cs.RO cs.AI cs.MA 版本更新

Continual Quadruped Robots Coordination via Semantic Skill Discovery

通过语义技能发现实现持续四足机器人协调

Daoqing Wang, Yuchen Xiao, Weixuan Huang, Zhilong Zhang, Shenghua Wan, Meng Li, Lei Yuan, Yang Yu

AI总结 提出Conquer框架,通过语义技能库实现多四足机器人在持续学习任务中的协调,避免灾难性遗忘,最终平均成功率95.6%。

详情
Comments
22 pages, 8 figures, 11 tables. Project page: https://conquer-project.pages.dev/
AI中文摘要

多四足协调因其增强的负载能力、更广的接触覆盖范围以及对挑战性任务的适应性提升而受到越来越多的关注。现有的多四足操作方法通常专注于预定义或封闭的任务族,往往依赖多智能体强化学习(MARL)来训练特定任务的协调策略。然而,这类方法在开放式持续学习场景中难以应对,其中任务顺序到达,机器人期望在复用先前学到的技能的同时获取新协调技能,且不出现灾难性遗忘。为应对这一挑战,我们提出Conquer,一个语义技能库框架,将持续多四足协调形式化为检索-适应-更新过程。首先,为适应不同任务中的团队规模变化,我们设计了一个团队结构的Self-Allies-Goal(SAG)主干,通过显式建模每个机器人自身状态、队友上下文和任务目标,支持可变基数的机器人团队。对于每个新任务,Conquer从执行前信息构建任务级语义描述符,并从技能库中检索相关技能进行适应。成功执行后,Conquer通过提取轨迹级语义描述符并根据语义距离组织它们来更新技能库,从而实现持续技能积累和跨任务知识迁移。仿真实验表明,Conquer达到了95.6%的最终平均成功率,展示了强大的前向迁移能力和可忽略的灾难性遗忘。在宇树Go2团队上的实际部署进一步验证了Conquer用于实际多四足协调的可行性。仿真和真实机器人演示视频见:https://conquer-project.pages.dev/。

英文摘要

Multi-quadruped coordination has attracted increasing attention due to its enhanced payload capacity, broader contact coverage, and improved adaptability to challenging tasks. Existing methods for multi-quadruped manipulation typically focus on predefined or closed task families, often relying on multi-agent reinforcement learning (MARL) to train task-specific coordination policies. However, such methods struggle in open-ended continual learning settings, where tasks arrive sequentially and robots are expected to acquire new coordination skills while reusing previously learned ones without catastrophic forgetting. To address this challenge, we propose Conquer, a semantic skill-library framework that formulates continual multi-quadruped coordination as a retrieve-adapt-update process. First, to accommodate varying team sizes across tasks, we design a team-structured Self-Allies-Goal (SAG) backbone that supports variable-cardinality robot teams by explicitly modeling each robot's own state, teammate context, and task goal. For each incoming task, Conquer constructs a task-level semantic descriptor from pre-execution information and retrieves a relevant skill from the library for adaptation. After successful execution, Conquer updates the skill library by extracting trajectory-level semantic descriptors and organizing them according to semantic distance, thereby enabling continual skill accumulation and cross-task knowledge transfer. Simulation experiments show that Conquer achieves a final average success rate of 95.6%, demonstrating strong forward transfer and negligible catastrophic forgetting. Real-world rollouts on Unitree Go2 teams further validate the deployment feasibility of Conquer for practical multi-quadruped coordination. Simulation and real-robot demonstration videos are available at: https://conquer-project.pages.dev/.

2606.07362 2026-06-11 cs.LG 版本更新

Breaking the Ice: Analyzing Cold Start Latency in vLLM

打破冰层:分析 vLLM 中的冷启动延迟

Huzaifa Shaaban Kabakibo, Animesh Trivedi, Lin Wang

AI总结 本文首次系统分析 vLLM 推理引擎的冷启动延迟,将其分解为六个基础步骤,发现主要受 CPU 限制,并建立轻量级分析模型预测延迟,为大规模推理环境资源规划提供指导。

详情
Journal ref
Proceedings of the 9th MLSys Conference, Bellevue, WA, USA, 2026
AI中文摘要

随着可扩展推理服务的普及,推理引擎的冷启动延迟变得重要。如今,vLLM 已成为许多推理工作负载的事实标准推理引擎。尽管流行,但由于其复杂性和快速演进,尚未有对其启动延迟的系统研究。随着主要架构创新如 V1 API 和 this http URL 的引入,本文首次对 vLLM 启动延迟进行了详细的性能表征。我们将启动过程分解为六个基础步骤,并证明其主要受 CPU 限制。每个步骤在模型级和系统级参数方面表现出一致且可解释的缩放趋势,从而能够细粒度地归因延迟来源。基于这些见解,我们开发了一个轻量级分析模型,能够准确预测给定硬件配置下的 vLLM 启动延迟,为大规模推理环境中的资源规划提供可操作的指导。所有基准测试数据集、分析工具和预测脚本均在此 https URL 开源。

英文摘要

As scalable inference services become popular, the cold start latency of an inference engine becomes important. Today, vLLM has evolved into the de facto inference engine of choice for many inference workloads. Although popular, due to its complexity and rapid evolution, there has not been a systematic study of its startup latency. With major architectural innovations such as the V1 API and the introduction of torch.compile, this paper presents the first detailed performance characterization of vLLM startup latency. We break down the startup process into six foundational steps and demonstrate that it is predominantly CPU bound. Each step exhibits consistent and interpretable scaling trends with respect to model-level and system-level parameters, enabling fine-grained attribution of latency sources. Building on these insights, we develop a lightweight analytical model that accurately predicts vLLM startup latency for a given hardware configuration, providing actionable guidance for resource planning in large-scale inference environments. All benchmarking datasets, analysis tools, and prediction scripts are open sourced at https://github.com/upb-cn/vllm-startup-profiler.

2606.03504 2026-06-11 cs.CL cs.AI 版本更新

BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language

BaltiVoice: 巴尔蒂语语音语料库与微调Whisper ASR系统

Muhammad Ali

AI总结 针对无公开ASR资源的巴尔蒂语,构建16.8小时朗读语音语料库并微调Whisper-small模型,在验证集上词错误率从182.18%降至30.07%。

详情
Comments
6 pages, 3 figures, 4 tables. Code and data available at https://github.com/mohdali-dev/BaltiVoice-ASR
AI中文摘要

我们提出了BaltiVoice,一个16.8小时的朗读语音语料库,用于巴尔蒂语(ISO 639-3: bft),这是一种在巴基斯坦吉尔吉特-巴尔蒂斯坦地区使用的藏语语言,此前没有公开可用的ASR资源。该语料库包含10,060条经过验证的本地Nastaliq脚本话语,源自Mozilla Common Voice录音。我们在此语料库上微调了OpenAI Whisper-small,并在包含538条话语的保留验证集上报告了30.07%的词错误率(WER),而Whisper-small在巴尔蒂语上的零样本基线为182.18%。该数据集、微调模型以及实时转录演示均在HuggingFace上公开提供。

英文摘要

We present BaltiVoice, a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language spoken in Gilgit-Baltistan, Pakistan, with no prior publicly available ASR resources. The corpus contains 10,060 validated utterances in native Nastaliq script, derived from Mozilla Common Voice recordings. Fine-tuning OpenAI Whisper-small yields a Word Error Rate (WER) of 26.74% and a Character Error Rate (CER) of 8.67% on a 538-utterance speaker-disjoint validation set, down from a zero-shot baseline of 159.19% WER and 152.52% CER. A Whisper-base fine-tuned on the same data achieves 44.54% WER and 15.61% CER, confirming that model capacity matters for this low-resource setting. The dataset, fine-tuned model, and a live transcription demo are publicly available on HuggingFace.

2606.02670 2026-06-11 cs.LG cs.AI 版本更新

Anomalies in Multivariate Time Series Benchmarks Are Mostly Univariate

多变量时间序列基准中的异常主要是单变量的

Marc Pinet, Julien Cumin, Samuel Berlemont, Dominique Vaufreydaz

AI总结 本文通过诊断框架和实验证明,当前多变量时间序列异常检测基准中,异常主要源于单变量偏离,跨通道结构变化极少,因此现有基准不适合验证跨通道建模能力。

详情
AI中文摘要

许多最新的多变量时间序列异常检测(MT-SAD)模型引入了跨通道建模,其隐含假设是异常的结构可能分布在多个通道上。我们在八个广泛使用的公共基准上评估了这一假设,引入了一个逐段诊断框架,该框架针对每个标记的异常,标记是否至少有一个通道单独偏离其正常历史,是否跨通道相关结构发生变化,或两者兼有。该框架表明,在一系列合理阈值下,没有跨通道破裂发生在没有伴随单变量偏离的情况下。一个补充指标还显示,在八个基准中的六个上,至少一半的标记异常段在79%到100%的时间步上发生单变量偏离,在其中的三个数据集上达到100%。为了验证我们的框架在存在跨通道结构时能够捕获它,我们构建了具有共享噪声的相移正弦通道的合成数据。每个异常段通过两种通道级损坏之一进行改变,这些损坏保留了每个通道的边缘分布,同时破坏了跨通道结构,我们的框架正确地将这些段表征为仅跨通道异常。在这些数据上,依赖通道(CD)模型成功利用了跨通道信号,而独立通道(CI)模型则失败。在真实基准上对最近SOTA检测器的CI/CD比较进一步证实了CD建模没有带来可衡量的收益。我们得出结论,当前的MT-SAD基准不适合验证跨通道建模能力,并呼吁开发更多结构多样的评估集。本研究的代码已公开。

英文摘要

Many recent multivariate time series anomaly detection (MTSAD) models incorporate cross-channel modeling, under the implicit assumption that the structure of anomalies may be spread across multiple channels. We evaluate this assumption on eight widely used public benchmarks by introducing a per-segment diagnostic framework that flags, for each labeled anomaly, whether at least one channel deviates individually from its normal history, whether the cross-channel correlation structure changes, or both. The framework shows that no cross-channel rupture occurs without an accompanying univariate deviation across a range of reasonable thresholds. A complementary metric also reveals that on six of the eight benchmarks, at least half of the labeled anomaly segments deviate univariately on 89% to 100% of their timesteps, reaching 100% on three of these datasets. To verify that our framework captures cross-channel structure when present, we construct synthetic data of phase-shifted sinusoidal channels with shared noise. Each anomalous segment is altered through one of two channel-wise corruptions that preserve the per-channel marginal distribution while breaking cross-channel structure, and our framework correctly characterizes these segments as cross-channel-only. On these data, channel-dependent (CD) models successfully exploit the cross-channel signal whereas channel-independent (CI) ones fail. The CI/CD comparison of a recent SOTA detector on real benchmarks further confirms that CD modeling brings no measurable gain. We conclude that current MTSAD benchmarks are unsuitable for validating cross-channel modeling capabilities, and we call for the development of more structurally diverse evaluation sets. The code for this study is publicly available.

2605.31219 2026-06-11 cs.CV cs.CR cs.LG 版本更新

Latent Geometric Chords for Query-Efficient Decision-Based Adversarial Attacks

潜在几何和弦:面向查询高效决策型对抗攻击

Ei Hmue Khine, Yao Li, Jiebao Sun, Shengzhu Shi, Zhichang Guo, Boying Wu

AI总结 提出潜在几何和弦(LGC)方法,通过曲率感知的几何搜索在压缩语义流形中导航决策边界,并引入残差对抗生成(RAG)机制以高视觉保真度实现查询高效的决策型黑盒对抗攻击。

详情
Comments
Added a conceptual diagram for the LGC architecture, 14 pages, 10 figures, 7 tables. Submitted to IEEE Transactions on Information Forensics and Security. The source code is available at https://github.com/eihmuekhine/Latent-Geometric-Chords
AI中文摘要

虽然基于决策的黑盒对抗攻击构成了严重的安全威胁,但当前方法存在根本性限制。像素级攻击经常引入不自然的高频视觉伪影,而潜在空间框架受限于低维流形的有限搜索空间和固有的重建缺陷。为解决这些限制,我们提出了潜在几何和弦(LGC)用于查询高效的决策型对抗攻击及其变体LGC-H。其核心是,LGC通过在压缩语义流形内执行曲率感知的几何搜索来导航决策边界。为保证高视觉保真度并规避维度瓶颈,我们引入了基于残差的对抗生成(RAG)机制。RAG将语义扰动隔离为几何和弦,并直接叠加到原始源图像上。RAG显著解决了基线重建缺陷,并有效将允许的搜索空间维度翻倍。实验结果表明,LGC实现了鲁棒的跨数据集迁移性,并显著优于最先进的基线方法。值得注意的是,我们的方法LGC在最小化扰动幅度的同时实现了最先进的视觉保真度——在5000次查询下结构相似性指数(SSIM)超过0.99,学习感知图像块相似度(LPIPS)低于0.01——并在严格的感知约束下保持高攻击成功率,成功攻破了经过对抗训练的鲁棒模型。源代码可在https://github.com/eihmuekhine/Latent-Geometric-Chords获取。

英文摘要

While decision-based black-box adversarial attacks present a severe security threat, current methodologies suffer from fundamental limitations. Pixel-wise attacks frequently introduce unnatural, high-frequency visual artifacts, while latent-space frameworks are confined by the limited search space of low-dimensional manifolds and inherent reconstruction flaws. To resolve these limitations, we propose Latent Geometric Chords (LGC) for Query-Efficient Decision-Based Adversarial Attacks alongside a variant, LGC-H. At its core, LGC navigates decision boundaries by executing a curvature-aware geometric search within a compressed semantic manifold. To guarantee high visual fidelity and circumvent dimensionality bottlenecks, we introduce a Residual-based Adversarial Generation (RAG) mechanism. RAG isolates semantic perturbations as geometric chords and superimposes them directly onto the original source image. RAG substantially resolves baseline reconstruction flaws and effectively doubles the permissible search space dimensions. Experimental results demonstrate that LGC achieves robust cross-dataset transferability and substantially outperforms state-of-the-art baselines. Notably, our method, LGC, minimizes perturbation magnitudes while achieving state-of-the-art visual fidelity--with a Structural Similarity Index Measure (SSIM) exceeding 0.99 and a Learned Perceptual Image Patch Similarity (LPIPS) below 0.01 at 5000 queries--and sustaining high attack success rates under stringent perceptual constraints, successfully compromising adversarially trained robust models. The source code is available at: https://github.com/eihmuekhine/Latent-Geometric-Chords.

2511.02414 2026-06-11 cs.AI 版本更新

A New Perspective on Precision and Recall for Generative Models

生成模型精度与召回的全新视角

Benjamin Sykes, Loïc Simon, Julien Rabin, Jalal Fadili

AI总结 本文提出了一种基于二分类视角的新框架,用于估计生成模型的完整精度-召回曲线,并通过统计分析得出最小最大上界,同时展示了该框架可扩展至文献中的多个经典PR指标。

详情
AI中文摘要

随着生成模型在图像和文本领域取得近期成功,其评估问题近年来受到广泛关注。尽管大多数现有方法依赖于标量指标,但引入精度和召回(PR)作为生成模型的评估指标,开辟了新的研究方向。相关的PR曲线允许更丰富的分析,但其估计存在诸多挑战。在本文中,我们提出了一种基于二分类视角的新框架,用于估计完整的PR曲线。我们对所提出估计进行了详尽的统计分析。作为副产品,我们获得了PR估计风险的最小最大上界。此外,我们还展示了该框架可扩展至文献中的多个经典PR指标,这些指标设计上被限制在曲线的极值点。最后,我们研究了在不同设置下所获得的曲线的不同行为。

英文摘要

With the recent success of generative models in image and text, the question of their evaluation has recently gained a lot of attention. While most methods from the state of the art rely on scalar metrics, the introduction of Precision and Recall (PR) for generative model has opened up a new avenue of research. The associated PR curve allows for a richer analysis, but their estimation poses several challenges. In this paper, we present a new framework for estimating entire PR curves based on a binary classification standpoint. We conduct a thorough statistical analysis of the proposed estimates. As a byproduct, we obtain a minimax upper bound on the PR estimation risk. We also show that our framework extends several landmark PR metrics of the literature which by design are restrained to the extreme values of the curve. Finally, we study the different behaviors of the curves obtained experimentally in various settings.

2509.20241 2026-06-11 cs.LG cs.DC 版本更新

Energy Use of AI Inference, Efficiency Pathways, and Test-Time Scaling

AI推断的能耗:效率路径与测试时计算

Felipe Oviedo, Fiodar Kazhamiaka, Esha Choukse, Allen Kim, Amy Luers, Melanie Nakagawa, Ricardo Bianchini, Juan M. Lavista Ferres

AI总结 本文提出基于令牌吞吐量的底层方法,估算大规模大语言模型的每查询能耗,揭示测试时扩展场景下的能耗变化及效率提升潜力。

详情
Journal ref
Joule (2026) 102430
Comments
A preprint version with DOI is available at Zenodo: https://doi.org/10.5281/zenodo.17188770
AI中文摘要

随着AI推断扩展到数十亿查询和新兴推理及代理工作流增加令牌需求,可靠估计每查询能耗对容量规划、排放核算和效率优先级至关重要。许多公开估计不一致且高估能耗,因为它们从有限基准外推且未能反映大规模下的效率提升。本文引入基于令牌吞吐量的底层方法,估算大规模LLM系统的每查询能耗。在H100节点下运行的模型,根据现实工作负载和GPU利用率及PUE约束,估算前沿规模模型(>2000亿参数)的每查询能耗中位数为0.34瓦(IQR: 0.18-0.67)。这些结果与生产规模配置测量一致,表明非生产估计可能高估能耗4-20倍。扩展到测试时扩展场景,每个典型查询的令牌数增加15倍,中位数能耗升至4.32瓦,表明在该范围内聚焦效率将带来最大的集群节能。我们量化了在模型、服务平台和硬件层面的可实现效率提升,发现单个模型的每查询能耗中位数减少1.5-3.5倍,而综合改进可能带来8-20倍的减少。为说明系统级影响,我们估算一个处理十亿查询的部署的基线日能耗为0.8 GWh/天。如果10%为长查询,需求可能增长到1.8 GWh/天。通过针对性的效率干预,它降至0.9 GWh/天,与该规模的网络搜索能耗相当。这呼应了数据中心历史上通过效率提升控制能耗增长的历史。

英文摘要

As AI inference scales to billions of queries, estimates of per-query energy use are increasingly important for capacity planning, efficiency interventions, and policy. Yet many public estimates assume non-production settings, leading to systematic overestimation. We introduce a bottom-up framework estimating inference energy from token throughput, node power, and overhead under large-scale deployment assumptions. For frontier-scale models (>200B parameters) on H100 nodes, we estimate a median energy of 0.31 Wh/query (IQR 0.16-0.60), indicating widely cited estimates are overstated by 4-20x. In test-time scaling scenarios 15x longer than typical queries, the median energy rises 13x to 3.91 Wh (IQR 2.15-7.05). Across models, serving systems, and hardware, we estimate 8-20x line-of-sight energy reductions. At datacenter scale, serving 1 billion queries/day requires 0.7 GWh; if 10% are long queries, demand rises to 1.7 GWh/day. With efficiency interventions, it falls to 0.8 GWh/day, mitigating the energy impact of test-time scaling.

2510.18289 2026-06-11 cs.CL cs.CY cs.MA 版本更新

Food4All: An Agentic Framework and Benchmark for Food Resource Navigation with Adaptive User Understanding

Food4All: 一种具有自适应用户理解能力的食物资源导航智能体框架与基准

Yiyang Li, Weixiang Sun, Tianyi Ma, Kaiwen Shi, Zheyuan Zhang, Yanfang Ye

AI总结 提出Food4All框架,结合食物搜索工具与300个多轮评估任务,在686个印第安纳食物资源上评估六种大语言模型,诊断其在约束条件处理和非理想用户交互中的不足。

详情
Comments
We have further refined the benchmark construction and experimental presentation to improve clarity and consistency. The revised version includes updated task design, food-resource data, and evaluation details to better align the benchmark with the intended food resource referral setting. These changes provide a more precise presentation of the experimental findings
AI中文摘要

食物援助推荐需要对话智能体将未明确指定且常含噪声的求助对话转化为本地有效的资源推荐。我们提出Food4All,一个基于686个结构化印第安纳食物资源的智能体食物资源推荐框架与基准。Food4All将食物特定搜索工具与300个多轮评估任务相结合,涵盖单一食物需求、具有访问或文件约束的复合案例,以及五种非理想用户交互特征:不合理要求、冗长回答、不耐烦、不完整答案和不一致信息。我们在需求理解、资源检索、最终推荐正确性和交互效率上评估了六种大语言模型。尽管最强模型达到了96.33%的推荐准确率,但我们的诊断揭示了在时间安排、资格、接收和文件约束方面的持续失败,以及在最终推荐中未能保留有效检索到的资源。特征级分析进一步表明,不同的非理想行为对推荐流程的不同部分造成压力。Food4All为在现实用户交互挑战下研究约束敏感的食物援助推荐中的工具调用智能体提供了一个受控测试平台。

英文摘要

Food assistance referral requires conversational agents to translate underspecified, often noisy help-seeking dialogues into locally valid resource recommendations. We present Food4All, an agentic food-resource referral framework and benchmark grounded in 686 structured Indiana food resources. Food4All couples a food-specific search tool with 300 multi-turn evaluation tasks spanning single food needs, composite cases with access or document constraints, and five non-ideal user interaction traits: unreasonable demands, rambling responses, impatience, incomplete answers, and inconsistent information. We evaluate six Large Language Models (LLMs) on requirement grounding, resource retrieval, final referral correctness, and interaction efficiency. Although the strongest model achieves 96.33% referral accuracy, our diagnostics reveal persistent failures in grounding schedule, eligibility, intake, and document constraints, as well as failures to preserve valid retrieved resources in the final recommendation. Trait-level analysis further shows that different non-ideal behaviors stress different parts of the referral pipeline. Food4All provides a controlled testbed for studying tool-calling agents in constraint-sensitive food assistance referral under realistic user interaction challenges.

2604.22167 2026-06-11 cs.LG cs.AI 版本更新

Estimating Tail Risks in Language Model Output Distributions

语言模型输出分布中的尾部风险估计

Rico Angell, Raghav Singhal, Zachary Horvitz, Zhou Yu, Rajesh Ranganath, Kathleen McKeown, He He

AI总结 提出一种基于重要性采样的方法,通过创建不安全版本来高效估计语言模型产生有害输出的尾部概率,在10-20倍更少样本下匹配蒙特卡洛估计,并揭示模型对输入的敏感性。

详情
Comments
Accepted to ICML 2026
AI中文摘要

语言模型能力日益增强,并正在人口层面快速部署。因此,这些模型的安全性变得越来越重要。幸运的是,对齐方面的进展显著降低了模型产生有害输出的可能性。然而,当模型每天被查询数十亿次时,即使是罕见的 worst-case 行为也会发生。当前的安全评估侧重于捕获产生有害输出的输入分布。这些评估忽略了模型的概率性质及其尾部输出行为。为了衡量这种尾部风险,我们提出了一种方法,可以高效估计任何输入查询产生有害输出的概率。我们不是从目标模型进行简单的暴力采样(其中有害输出可能很罕见),而是通过创建目标模型的不安全版本来实现重要性采样。这些不安全版本通过使有害输出更可能发生,实现了样本高效的估计。在衡量误用和未对齐的基准测试中,这些估计与使用10-20倍更少样本的暴力蒙特卡洛估计相匹配。例如,我们仅用500个样本就可以估计数量级为10^-4的有害输出概率。此外,我们发现这些有害性估计可以揭示模型对输入扰动的敏感性,并预测部署风险。我们的工作表明,准确的小概率事件估计对于安全评估既关键又可行。代码可在以下网址获取:此 https URL

英文摘要

Language models are increasingly capable and are being rapidly deployed on a population-level scale. As a result, the safety of these models is increasingly high-stakes. Fortunately, advances in alignment have significantly reduced the likelihood of harmful model outputs. However, when models are queried billions of times in a day, even rare worst-case behaviors will occur. Current safety evaluations focus on capturing the distribution of inputs that yield harmful outputs. These evaluations disregard the probabilistic nature of models and their tail output behavior. To measure this tail risk, we propose a method to efficiently estimate the probability of harmful outputs for any input query. Instead of naive brute-force sampling from the target model, where harmful outputs could be rare, we operationalize importance sampling by creating unsafe versions of the target model. These unsafe versions enable sample-efficient estimation by making harmful outputs more probable. On benchmarks measuring misuse and misalignment, these estimates match brute-force Monte Carlo estimates using 10-20x fewer samples. For example, we can estimate probability of harmful outputs on the order of 10^-4 with just 500 samples. Additionally, we find that these harmfulness estimates can reveal the sensitivity of models to perturbations in model input and predict deployment risks. Our work demonstrates that accurate rare-event estimation is both critical and feasible for safety evaluations. Code is available at https://github.com/rangell/LMTailRisk

2604.13733 2026-06-11 cs.LG cs.AI cs.RO 版本更新

Vision-Language-Action Jump-Starting for Reinforcement Learning Robotic Agents

视觉-语言-动作跳跃启动用于强化学习机器人智能体

Angelo Moroncelli, Roberto Zanetti, Marco Maccarini, Loris Roveda

AI总结 提出VLAJS方法,通过稀疏的VLA高层动作建议引导PPO探索,结合方向性动作一致性正则化,提升强化学习在长时域操作任务中的样本效率,并在仿真和真实机器人上验证。

详情
Comments
ICRA 2026 Workshop on Reinforcement Learning in the Era of Imitation Learning
AI中文摘要

强化学习(RL)能够实现机器人操作的高频闭环控制,但由于探索效率低下和信用分配不佳,在稀疏或不完美奖励的长时域任务中难以扩展。视觉-语言-动作(VLA)模型利用大规模多模态预训练提供通用任务级推理,但当前限制阻碍其直接用于快速精确操作。本文提出视觉-语言-动作跳跃启动(VLAJS),一种将稀疏VLA引导与在线策略RL相结合的方法,以改善探索和学习效率。VLAJS将VLA视为高层动作建议的瞬态来源,偏置早期探索并改善信用分配,同时保留RL的高频状态基控制。我们的方法用方向性动作一致性正则化增强近端策略优化(PPO),在早期训练中软对齐RL智能体的动作与VLA引导,而不强制严格模仿、需要演示或依赖持续教师查询。VLA引导稀疏应用并随时间退火,使智能体在线适应并最终超越引导策略。我们在六个挑战性操作任务上评估VLAJS:仿真中的提升、拾取与放置、销钉重定向、销钉插入、戳和推,并在真实Franka Panda机器人上验证子集。VLAJS在样本效率上持续优于PPO和蒸馏式基线,在多个任务中将所需环境交互减少超过50%。真实世界实验展示了零样本仿真到真实迁移以及在杂乱、物体变化和外部扰动下的鲁棒执行。

英文摘要

Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sparse or imperfect rewards remains difficult due to inefficient exploration and poor credit assignment. Vision-Language-Action (VLA) models leverage large-scale multimodal pretraining to provide generalist, task-level reasoning, but current limitations hinder their direct use in fast and precise manipulation. In this paper, we propose Vision-Language-Action Jump-Starting (VLAJS), a method that bridges sparse VLA guidance with on-policy RL to improve exploration and learning efficiency. VLAJS treats VLAs as transient sources of high-level action suggestions that bias early exploration and improve credit assignment, while preserving the high-frequency, state-based control of RL. Our approach augments Proximal Policy Optimization (PPO) with a directional action-consistency regularization that softly aligns the RL agent's actions with VLA guidance during early training, without enforcing strict imitation, requiring demonstrations, or relying on continuous teacher queries. VLA guidance is applied sparsely and annealed over time, allowing the agent to adapt online and ultimately surpass the guiding policy. We evaluate VLAJS on six challenging manipulation tasks: lifting, pick-and-place, peg reorientation, peg insertion, poking, and pushing in simulation, and validate a subset on a real Franka Panda robot. VLAJS consistently outperforms PPO and distillation-style baselines in sample efficiency, reducing required environment interactions by over 50% in several tasks. Real-world experiments demonstrate zero-shot sim-to-real transfer and robust execution under clutter, object variation, and external perturbations.

2603.08501 2026-06-11 cs.CL 版本更新

Fanar-Sadiq: A Multi-Agent Architecture for Grounded Islamic QA

Fanar-Sadiq:一种用于基于经典伊斯兰问答的多智能体架构

Ummar Abbas, Mourad Ouzzani, Mohamed Y. Eltabakh, Omar Sinan, Gagan Bhatia, Hamdy Mubarak, Majd Hawasly, Mohammed Qusay Hashim, Kareem Darwish, Firoj Alam

AI总结 针对大语言模型在伊斯兰问答中易产生幻觉和错误归因的问题,提出基于多智能体工具增强架构的Fanar-Sadiq系统,通过意图感知路由、检索增强教法回答、精确经文引用和确定性计算器,在公开基准上实现高效准确的伊斯兰问答。

详情
Comments
Islamic QA; Religious NLP; Retrieval-Augmented Generation; Multi-Agent LLMs; Tool-Augmented Reasoning; Faithful Generation; Fiqh Reasoning
AI中文摘要

大型语言模型(LLM)能够流畅回答宗教知识查询,但经常产生幻觉并错误归因来源,这在伊斯兰环境中尤其严重,因为用户期望基于经典文本(《古兰经》和圣训)和教法(fiqh)细微差别的回答。检索增强生成(RAG)改善了基础性,但单一的检索-生成流程不足以处理多样化的伊斯兰查询,包括逐字经文、基于引用的指导以及规则约束的计算(如天课和遗产)。为了解决这些挑战,我们提出了Fanar-Sadiq,一个基于多智能体、工具增强架构的双语(阿拉伯语-英语)伊斯兰问答系统。它是Fanar AI平台的核心组件。Fanar-Sadiq将伊斯兰查询路由到智能体工具架构中的专门模块。它支持意图感知路由、带有标准化引用和验证轨迹的检索增强教法回答、带有引文验证的精确经文查找,以及具有教法学派敏感分支的确定性逊尼派天课和遗产计算器。我们在公开的伊斯兰问答基准上评估了端到端系统,显示出强大的有效性和效率。该系统通过API和Web应用程序公开访问,在不到一年的时间内已收到超过190万次访问(此 https URL )。

英文摘要

Large language models (LLMs) can answer religious knowledge queries fluently, yet they often hallucinate and misattribute sources, which is especially consequential in Islamic settings where users expect grounding in canonical texts (Qur'an and Hadith) and jurisprudential (fiqh) nuance. Retrieval-augmented generation (RAG) improves grounding, however, a single retrieve-then-generate pipeline is insufficient for diverse Islamic queries, including verbatim scripture, citation-grounded guidance, and rule-constrained computations such as zakat and inheritance. To address these challenges, we present Fanar-Sadiq, a bilingual Arabic-English Islamic QA system built on a multi-agent, tool-augmented architecture. It is a core component of the Fanar AI platform. Fanar-Sadiq routes Islamic queries to specialized modules within an agentic tool architecture. It supports intent-aware routing, retrieval-grounded fiqh answers with normalized citations and verification traces, exact verse lookup with quotation validation, and deterministic Sunni zakat and inheritance calculators with madhhab-sensitive branching. We evaluate the end-to-end system on public Islamic QA benchmarks and show strong effectiveness and efficiency. It is publicly accessible through an API and Web application and has received over 1.9M accesses in less than a year (https://api.fanar.qa/docs).

2602.05746 2026-06-11 cs.LG cs.AI 版本更新

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

学习注入:通过强化学习实现自动化提示注入

Xin Chen, Jie Zhang, Florian Tramèr

AI总结 提出AutoInject,一种基于强化学习的黑盒框架,自动学习对抗性后缀进行提示注入,在AgentDojo上优于模板攻击和多种自适应攻击,并突破专门防御模型。

详情
AI中文摘要

提示注入是LLM代理中的一个关键漏洞,然而最强的方法仍然依赖于人类红队和手工制作的提示。适应自动化越狱优化器并不能缩小这一差距:越狱使模型趋向于通用顺从,而提示注入需要发出具有正确参数的特定工具调用。成功信号是二元的,随机采样的后缀几乎从不触发它,因此标准优化器没有梯度可循。我们提出了AutoInject,一个黑盒强化学习(RL)框架,学习用于提示注入的对抗性后缀。一个学习的基于比较的奖励对每个候选后缀与迄今为止看到的最佳后缀进行评分,将二元信号转化为适合RL优化的密集奖励。该框架支持在线基于查询的攻击和离线训练的可迁移后缀(部署时无需实用访问),并在任务完成反馈可用时纳入实用目标。在AgentDojo上,AutoInject在生产模型中优于模板攻击、GCG、TAP和自适应攻击,在McNemar检验下具有统计显著性(p<0.05)。AutoInject学习的后缀还打破了Meta-SecAlign-70B,这是一个专门针对提示注入进行微调的模型,而模板攻击完全失败。这些结果为提示注入建立了自动化基线,并揭示了基于偏好的防御与基于自适应优化的攻击者之间的差距。

英文摘要

Prompt injection is a critical vulnerability in LLM agents, yet the strongest methods still rely on human red-teamers and hand-crafted prompts. Adapting automated jailbreak optimizers does not close this gap: jailbreaks shape models toward generic compliance, while prompt injection requires emitting specific tool calls with correct parameters. The success signal is binary, and randomly sampled suffixes almost never trigger it, so standard optimizers have no gradient to follow. We present AutoInject, a black-box reinforcement learning (RL) framework that learns adversarial suffixes for prompt injection. A learned comparison-based reward scores each candidate against the best suffix seen so far, turning the binary signal into a dense reward suitable for RL optimization. The framework supports both online query-based attacks and offline-trained transferable suffixes that need no utility access at deployment, and incorporates a utility objective when task-completion feedback is available. On AgentDojo, AutoInject outperforms template attacks, GCG, TAP, and adaptive attack across production models, with statistically significant improvements under McNemar's test with p<0.05. Suffixes learned by AutoInject also break Meta-SecAlign-70B, a model fine-tuned specifically to resist prompt injection, where template attacks fail outright. The results establish an automated baseline for prompt injection and expose a gap between preference-based defenses and adaptive optimization-based attackers.

2602.10908 2026-06-11 cs.CL cs.LG stat.ML 版本更新

SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

SoftMatcha 2:一种用于万亿级语料库的快速软模式匹配器

Masataka Yoneda, Yusuke Matsushita, Go Kamoda, Kohei Suenaga, Takuya Akiba, Masaki Waga, Sho Yokoi

AI总结 提出SoftMatcha 2,一种基于后缀数组和词向量的超快速软搜索算法,通过动态语料感知剪枝和磁盘感知设计,在万亿级语料上实现0.3秒内支持替换、插入和删除的语义变体搜索,并发现基准污染。

详情
Comments
Accepted at ICML2026. Project Page & Web Interface: https://softmatcha.github.io/v2/, Source Code: https://github.com/softmatcha/softmatcha2
AI中文摘要

我们提出SoftMatcha 2,一种超快速且灵活的搜索算法,能够在0.3秒内搜索万亿规模的自然语言语料库,同时允许以替换、插入和删除形式进行的语义变体。我们的方法采用基于后缀数组的字符串匹配,该数组随语料库规模扩展良好,并将单词表示为向量,这支撑了其语义灵活性。为了缓解查询语义放松导致的组合爆炸,我们的方法建立在两个关键算法思想上:动态语料感知剪枝和由磁盘感知设计实现的快速精确查找。我们从理论上分析了所提出方法的效率,表明它可以缓解搜索空间的指数增长。在FineWeb-Edu(Lozhkov等人,2024)(1.4T tokens)上的实验表明,与现有方法infini-gram(Liu等人,2024)、infini-gram mini(Xu等人,2025)和SoftMatcha(Deguchi等人,2025)相比,它实现了显著更低的搜索延迟。作为实际应用,我们的方法发现了现有方法遗漏的训练语料库中的基准污染,并且也有利于信息检索和释义检测。我们还提供了一个在线演示,支持七种语言的语料库快速软搜索。

英文摘要

We present SoftMatcha 2, an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while allowing semantic variations in the form of substitution, insertion, and deletion. Our approach employs string matching based on suffix arrays that scales well with corpus size, and represents words as vectors, which underpin its semantic flexibility. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: dynamic corpus-aware pruning and fast exact lookup enabled by a disk-aware design. We theoretically analyze the efficiency of the proposed method, indicating that it can mitigate exponential growth in the search space. Empirically, on FineWeb-Edu (Lozhkov et al., 2024) (1.4T tokens), it attains substantially lower search latency than existing methods: infini-gram (Liu et al., 2024), infini-gram mini (Xu et al., 2025), and SoftMatcha (Deguchi et al., 2025). As a practical application, our method uncovers benchmark contamination in training corpora that existing approaches miss, and it also benefits information retrieval and paraphrase detection. We also provide an online demo of fast, soft search across corpora in seven languages.

2512.22088 2026-06-11 cs.LG cs.AI cs.CL 版本更新

Unifying Learning Dynamics and Generalization in Transformers Scaling Law

统一Transformer缩放定律中的学习动力学与泛化

Chiwun Yang

AI总结 本文通过将Transformer学习动力学形式化为ODE系统并近似为核行为,严格分析了随机梯度下降训练下的泛化误差,揭示了计算资源缩放时泛化误差的指数衰减与幂律衰减的两阶段相变,并建立了紧的上下界。

详情
Comments
87 pages, 10 figures, 3 tables
AI中文摘要

缩放定律是大语言模型(LLM)发展的基石,预测了模型性能随计算资源增加而提升。然而,尽管经验上得到验证,其理论基础仍不清晰。本文形式化了基于Transformer的语言模型的学习动力学为一个常微分方程(ODE)系统,然后将该过程近似为核行为。与之前的玩具模型分析不同,我们严格分析了在序列到序列数据上具有任意数据分布的多层Transformer的随机梯度下降(SGD)训练,紧密反映了真实世界条件。我们的分析刻画了随着计算资源随数据缩放时,泛化误差收敛到不可约风险的过程,特别是在优化过程中。我们建立了过剩风险的匹配上下界,其特征是明显的相变。在初始优化阶段,过剩风险相对于计算成本${\sf C}$呈指数衰减。然而,一旦超过特定的资源分配阈值,系统进入统计阶段,泛化误差遵循$\Theta(\mathsf{C}^{-1/7})$的幂律衰减。这些速率通过互补的下界得到证实——统计方面通过信息论的两点约简,优化方面通过一阶预言机论证——使得两阶段定律在常数、对数因子和条件数差距内是紧的。除了这个统一框架,我们的理论还推导了模型大小、训练时间和数据集大小的独立缩放定律,阐明了每个变量如何独立地控制泛化的边界。

英文摘要

The scaling law, a cornerstone of Large Language Model (LLM) development, predicts improvements in model performance with increasing computational resources. Yet, while empirically validated, its theoretical underpinnings remain poorly understood. This work formalizes the learning dynamics of transformer-based language models as an ordinary differential equation (ODE) system, then approximates this process to kernel behaviors. Departing from prior toy-model analyses, we rigorously analyze stochastic gradient descent (SGD) training for multi-layer transformers on sequence-to-sequence data with arbitrary data distribution, closely mirroring real-world conditions. Our analysis characterizes the convergence of generalization error to the irreducible risk as computational resources scale with data, especially during the optimization process. We establish matching upper and lower bounds on the excess risk, characterized by a distinct phase transition. In the initial optimization phase, the excess risk decays exponentially relative to the computational cost ${\sf C}$. However, once a specific resource allocation threshold is crossed, the system enters a statistical phase, where the generalization error follows a power-law decay of $Θ(\mathsf{C}^{-1/7})$. These rates are certified by complementary lower bounds -- statistical, via an information-theoretic two-point reduction, and optimization-side, via a first-order oracle argument -- rendering the two-stage law tight up to constants, logarithmic factors, and a condition-number gap. Beyond this unified framework, our theory derives isolated scaling laws for model size, training time, and dataset size, elucidating how each variable independently governs the bounds of generalization.

2601.21898 2026-06-11 cs.AI cs.CR 版本更新

Making Models Unmergeable via Scaling-Sensitive Loss Landscape

通过尺度敏感损失景观使模型不可合并

Minwoo Jang, Hoyoung Kim, Jabin Koo, Jungseul Ok

AI总结 提出Trap$^2$框架,通过在微调中编码保护,使模型在单独使用时有效,但在合并中常见的权重缩放下性能下降,从而防止未经授权的模型组合。

详情
Comments
Appears in ICML 2026
AI中文摘要

模型中心的兴起使得访问可重用模型组件变得更加容易,使模型合并成为组合能力的实用工具。然而,这种模块化也造成了治理缺口:下游用户可以将发布的权重重新组合成未经授权的混合体,绕过安全对齐或许可条款。由于现有防御措施大多是事后且特定于架构的,它们在实际中无法为不同架构和发布格式提供一致的保护。为了弥补这一缺口,我们提出了Trap$^2$,一个架构无关的保护框架,在微调过程中将保护编码到更新中,无论这些更新是作为适配器还是完整模型发布。Trap$^2$不依赖架构特定的方法,而是使用权重重新缩放作为合并过程的简单代理。它使发布的权重在单独使用时保持有效,但在合并中常见的重新缩放下性能下降,从而破坏未经授权的重新组合。

英文摘要

The rise of model hubs has made it easier to access reusable model components, making model merging a practical tool for combining capabilities. Yet, this modularity also creates a governance gap: downstream users can recompose released weights into unauthorized mixtures that bypass safety alignment or licensing terms. Because existing defenses are largely post-hoc and architecture-specific, they provide inconsistent protection across diverse architectures and release formats in practice. To close this gap, we propose Trap$^2$, an architecture-agnostic protection framework that encodes protection into updates during fine-tuning, regardless of whether they are released as adapters or full models. Instead of relying on architecture-dependent approaches, Trap$^2$ uses weight re-scaling as a simple proxy for the merging process. It keeps released weights effective in standalone use, but degrades them under re-scaling that often arises in merging, undermining unauthorized recomposition.

2601.03792 2026-06-11 cs.CL 版本更新

VietMed-MCQ: A Consistency-Filtered Data Synthesis Framework for Vietnamese Traditional Medicine Evaluation

VietMed-MCQ:面向越南传统医学评估的一致性过滤数据合成框架

Huynh Trung Kiet, Dao Sy Duy Minh, Nguyen Dinh Ha Duong, Le Hoang Minh Huy, Long Nguyen, Dien Dinh

AI总结 提出基于检索增强生成和一致性过滤的VietMed-MCQ数据集,含3190道多选题,经专家验证准确率94.2%,基准测试显示通用模型优于越南语模型。

详情
Comments
The authors have withdrawn this article because the current version is still undergoing substantial revision. Several components of the data synthesis framework, consistency-filtering procedure, evaluation protocol, and experimental analysis are being refined and expanded. As a result, the current manuscript should not be considered a complete or final representation of the work
AI中文摘要

大型语言模型(LLM)在通用医学领域表现出显著能力,但在越南传统医学(VTM)等特定文化领域性能大幅下降,主要原因是缺乏高质量、结构化的基准。本文提出VietMed-MCQ,一个通过检索增强生成(RAG)管道和自动一致性检查机制生成的新型多选题数据集。与之前的合成数据集不同,我们的框架采用双模型验证方法,通过独立答案验证确保推理一致性,尽管基于子串的证据检查存在已知局限性。完整数据集包含3190道题,涵盖三个难度级别,并经过一名医学专家和四名学生的验证,达到94.2%的通过率,且评分者间一致性较高(Fleiss' kappa = 0.82)。我们在VietMed-MCQ上对七个开源模型进行了基准测试。结果显示,具有强中文先验的通用模型优于以越南语为中心的模型,突显了跨语言概念迁移,但所有模型在复杂诊断推理方面仍存在困难。我们的代码和数据集已公开,以促进低资源医学领域的研究。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable proficiency in general medical domains. However, their performance significantly degrades in specialized, culturally specific domains such as Vietnamese Traditional Medicine (VTM), primarily due to the scarcity of high-quality, structured benchmarks. In this paper, we introduce VietMed-MCQ, a novel multiple-choice question dataset generated via a Retrieval-Augmented Generation (RAG) pipeline with an automated consistency check mechanism. Unlike previous synthetic datasets, our framework incorporates a dual-model validation approach to ensure reasoning consistency through independent answer verification, though the substring-based evidence checking has known limitations. The complete dataset of 3,190 questions spans three difficulty levels and underwent validation by one medical expert and four students, achieving 94.2 percent approval with substantial inter-rater agreement (Fleiss' kappa = 0.82). We benchmark seven open-source models on VietMed-MCQ. Results reveal that general-purpose models with strong Chinese priors outperform Vietnamese-centric models, highlighting cross-lingual conceptual transfer, while all models still struggle with complex diagnostic reasoning. Our code and dataset are publicly available to foster research in low-resource medical domains.

2510.06596 2026-06-11 cs.CV cs.AI cs.IT cs.LG math.IT 版本更新

SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation

SDQM:用于目标检测数据集评估的合成数据质量指标

Ayush Zenith, Arnold Zumbrun, Neel Raut, Jing Lin

AI总结 提出SDQM指标,无需模型训练收敛即可评估合成数据质量,与YOLO11的mAP强相关,优于现有指标。

详情
Journal ref
Journal of Electronic Imaging 35(3), 033014 (2026)
Comments
Accepted and Published at SPIE: Journal of Electronic Imaging, Vol. 35, Issue 3
AI中文摘要

机器学习模型的性能在很大程度上依赖于训练数据。大规模、良好标注数据集的稀缺给构建鲁棒模型带来了重大挑战。为了解决这一问题,通过模拟和生成模型产生的合成数据已成为一种有前景的解决方案,它增强了数据集的多样性,并提高了模型的性能、可靠性和韧性。然而,评估这些生成数据的质量需要一个有效的指标。我们引入了合成数据集质量指标(SDQM),用于评估目标检测任务的数据质量,而无需模型训练收敛。该指标能够更高效地生成和选择合成数据集,解决了资源受限的目标检测任务中的一个关键挑战。在我们的实验中,SDQM与领先的目标检测模型YOLO11的平均精度均值(mAP)得分表现出强相关性,而先前的指标仅表现出中等或弱相关性。此外,它提供了改进数据集质量的可操作见解,最大限度地减少了昂贵的迭代训练需求。这一可扩展且高效的指标为评估合成数据设立了新标准。SDQM的代码可从此https URL获取。

英文摘要

The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. We introduce the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean average precision (mAP) scores of YOLO11, a leading object detection model, whereas previous metrics only exhibited moderate or weak correlations. In addition, it provides actionable insights into improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at https://github.com/ayushzenith/SDQM

2507.21164 2026-06-11 cs.LG cs.AI eess.IV stat.ML 版本更新

OCSVM-Guided Representation Learning for Unsupervised Anomaly Detection

OCSVM引导的无监督异常检测表示学习

Nicolas Pinon, Robin Trombetta, Carole Lartizien

AI总结 提出一种将表示学习与可解析求解的一类SVM耦合的方法,通过定制损失函数直接对齐潜在特征与决策边界,在MNIST-C和脑MRI病变检测任务上展现了鲁棒性和性能。

详情
AI中文摘要

无监督异常检测(UAD)旨在无需标签数据检测异常,这在许多机器学习应用中是必要的,因为异常样本稀少或不可用。大多数最先进的方法分为两类:基于重构的方法(通常重构异常过于完美)和与密度估计器解耦的表示学习(可能遭受次优特征空间)。虽然一些近期方法尝试耦合特征学习和异常检测,但它们通常依赖替代目标、限制核选择或引入近似,从而限制了表达能力和鲁棒性。为解决这一挑战,我们提出了一种新颖方法,通过自定义损失公式将表示学习与可解析求解的一类SVM(OCSVM)耦合,该损失直接使潜在特征与OCSVM决策边界对齐。该模型在两个任务上评估:基于MNIST-C的新基准,以及具有挑战性的脑MRI细微病变检测任务。与大多数关注图像级别大而高信号病变的方法不同,我们的方法成功针对小而非高信号的病变,同时我们评估体素级别的指标,处理了更具临床相关性的场景。两个实验评估了对领域偏移的鲁棒性形式,包括MNIST-C中的损坏类型以及MRI中的纹理或人群年龄变化。结果展示了我们提出模型的性能和鲁棒性,突显了其在通用UAD和现实医学成像应用中的潜力。源代码可在此https URL获取。

英文摘要

Unsupervised anomaly detection (UAD) aims to detect anomalies without labeled data, a necessity in many machine learning applications where anomalous samples are rare or not available. Most state-of-the-art methods fall into two categories: reconstruction-based approaches, which often reconstruct anomalies too well, and decoupled representation learning with density estimators, which can suffer from suboptimal feature spaces. While some recent methods attempt to couple feature learning and anomaly detection, they often rely on surrogate objectives, restrict kernel choices, or introduce approximations that limit their expressiveness and robustness. To address this challenge, we propose a novel method that couples representation learning with an analytically solvable One-Class SVM (OCSVM), through a custom loss formulation that directly aligns latent features with the OCSVM decision boundary. The model is evaluated on two tasks: a \deleted{new} benchmark based on MNIST-C, and a challenging brain MRI \deleted{subtle} lesion detection task. Unlike most methods that focus on large, hyperintense lesions at the image level, our approach succeeds to target small, non-hyperintense lesions, while we evaluate voxel-wise metrics, addressing a more clinically relevant scenario. Both experiments evaluate a form of robustness to domain shifts, including corruption types in MNIST-C and texture or population age variations in MRI. Results demonstrate performance and robustness of our proposed model, highlighting its potential for general UAD and real-world medical imaging applications. The source code is available at https://github.com/Nicolas-Pinon/uad_ocsvm_guided_repr_learning.

2511.20216 2026-06-11 cs.AI cs.CE cs.CV cs.LG cs.RO

CostNav: A Navigation Benchmark for Real-World Economic-Cost Evaluation of Physical AI Agents

Haebin Seong, Sungmin Kim, Yongjun Cho, Myunchul Joe, Geunwoo Kim, Yubeen Park, Sunhoo Kim, Samwoo Seong, Yoonshik Kim, Suhwan Choi, Jaeyoon Jung, Jiyong Youn, Jinmyung Kwak, Sunghee Ahn, Jaemin Lee, Younggil Do, Seungyeop Yi, Woojin Cheong, Minhyeok Oh, Minchan Kim, Seongjae Kang, Youngjae Yu, Yunsung Lee

详情
英文摘要

Current navigation benchmarks focus on task success but do not capture the economic constraints essential for commercializing autonomous delivery systems. We introduce CostNav, an Economic Navigation Benchmark that evaluates physical AI agents on a cost-revenue and break-even analysis, pairing Isaac Sim's collision and cargo dynamics with industry-standard data such as Securities and Exchange Commission (SEC) filings and Abbreviated Injury Scale (AIS) injury reports. To our knowledge, CostNav is the first physics-grounded economic benchmark to use regulatory and financial data to quantify the gap between navigation metrics and commercial deployment, revealing that high task-success rates alone do not ensure economic viability. Evaluating seven baselines (two rule-based and five imitation-learning methods), we find no method economically viable: all yield negative contribution margins. CANVAS, using only an RGB camera and GPS, attains the highest task success and the least-negative margin among methods with non-zero Service-Level Agreement (SLA) compliance (-\$28.40/run), outperforming LiDAR-equipped Nav2 w/ GPS (-\$37.34/run). A sim-trained policy evaluated on a real delivery robot yields SLA compliance close to its simulation result, indicating that policy performance in CostNav's simulation transfers to real-world deployment. We challenge the community to achieve economic viability on CostNav, which scores methods by cost-revenue outcomes. All resources are available at https://github.com/worv-ai/CostNav.

2512.08343 2026-06-11 cs.AI

Soil Compaction Parameters Prediction Based on Automated Machine Learning Approach

Caner Erden, Alparslan Serhat Demir, Abdullah Hulusi Kokcam, Talas Fikret Kurnaz, Ugur Dagdeviren

详情
Journal ref
Computers & Industrial Engineering, 2026
Comments
Presented at the 13th International Symposium on Intelligent Manufacturing and Service Systems, Duzce, Turkey, Sep 25-27, 2025. Also available on Zenodo: DOI 10.5281/zenodo.17533851
英文摘要

Soil compaction is critical in construction engineering to ensure the stability of structures like road embankments and earth dams. Traditional methods for determining optimum moisture content (OMC) and maximum dry density (MDD) involve labor-intensive laboratory experiments, and empirical regression models have limited applicability and accuracy across diverse soil types. In recent years, artificial intelligence (AI) and machine learning (ML) techniques have emerged as alternatives for predicting these compaction parameters. However, ML models often struggle with prediction accuracy and generalizability, particularly with heterogeneous datasets representing various soil types. This study proposes an automated machine learning (AutoML) approach to predict OMC and MDD. AutoML automates algorithm selection and hyperparameter optimization, potentially improving accuracy and scalability. Through extensive experimentation, the study found that the Extreme Gradient Boosting (XGBoost) algorithm provided the best performance, achieving R-squared values of 80.4% for MDD and 89.1% for OMC on a separate dataset. These results demonstrate the effectiveness of AutoML in predicting compaction parameters across different soil types. The study also highlights the importance of heterogeneous datasets in improving the generalization and performance of ML models. Ultimately, this research contributes to more efficient and reliable construction practices by enhancing the prediction of soil compaction parameters.