arXivDaily arXiv每日学术速递 周一至周五更新
重置
2603.22934 2026-06-11 cs.AI 版本更新

ProGRank: Probe-Gradient Reranking to Defend Dense-Retriever RAG from Corpus Poisoning

ProGRank: 探针梯度重排序以防御密集检索器RAG免受语料投毒攻击

Xiangyu Yin, Yi Qi, Chih-Hong Cheng

发表机构 * Chalmers University of Technology, Sweden(瑞典查尔姆斯理工大学) University of Leeds, United Kingdom(英国利兹大学) Carl von Ossietzky University of Oldenburg, Germany(德国奥尔登堡卡尔·冯·奥西特齐大学)

AI总结 提出ProGRank,一种无需训练的后处理检索器端防御方法,通过随机扰动下探针梯度提取不稳定信号并重排序,有效防御密集检索器RAG的语料投毒攻击。

详情
Comments
accepted by ECML PKDD 2026
AI中文摘要

检索增强生成(RAG)通过将生成基于检索到的证据来改进大语言模型应用,但也引入了语料投毒这一新的攻击面。在此场景中,攻击者注入或编辑段落,使其进入目标查询的Top-K结果并影响下游生成。现有防御通常依赖内容过滤、辅助模型或生成器端推理,这使部署复杂化。我们提出ProGRank,一种针对密集检索器RAG的事后、无需训练的检索器端防御。ProGRank在轻度随机扰动下对每个查询-段落对进行压力测试,从固定小参数子集中提取探针梯度,并推导出两个不稳定信号:表示一致性和分散风险。然后,它将这些信号与分数门控结合进行重排序。ProGRank保留原始段落内容,无需重新训练,并在部署的检索器不可用时支持基于代理的变体。跨数据集、检索器、攻击以及检索阶段和端到端设置的实验表明,ProGRank提高了鲁棒性,并保持了良好的鲁棒性-效用权衡,包括在自适应规避攻击下。

英文摘要

Retrieval-Augmented Generation (RAG) improves large language model applications by grounding generation in retrieved evidence, but also introduces corpus poisoning as a new attack surface. In this setting, an adversary injects or edits passages so that they enter the Top-$K$ results for target queries and influence downstream generation. Existing defences often rely on content filtering, auxiliary models, or generator-side reasoning, which complicates deployment. We propose ProGRank, a post hoc, training-free retriever-side defence for dense-retriever RAG. ProGRank stress-tests each query--passage pair under mild randomized perturbations, extracts probe gradients from a small fixed parameter subset, and derives two instability signals: representational consistency and dispersion risk. It then combines these signals with a score gate for reranking. ProGRank preserves the original passage content, requires no retraining, and supports a surrogate-based variant when the deployed retriever is unavailable. Experiments across datasets, retrievers, attacks, and retrieval-stage and end-to-end settings show that ProGRank improves robustness and maintains a favorable robustness--utility trade-off, including under adaptive evasive attacks.

2603.24080 2026-06-11 cs.CL cs.DB 版本更新

LLMpedia: A Transparent Framework to Materialize an LLM's Encyclopedic Knowledge at Scale

LLMpedia:一个大规模实现LLM百科全书知识的透明框架

Muhammed Saeed, Simon Razniewski

发表机构 * ScaDS.AI Dresden/Leipzig & TU Dresden, Germany(ScaDS.AI 德累斯顿/莱比锡及德累斯顿技术大学,德国)

AI总结 提出LLMpedia框架,从三个模型家族中提取约130万篇百科全书文章,通过维基百科和网络证据审计,发现可验证真实率远低于MMLU基准,揭示了模型知识的事实性差距。

详情
AI中文摘要

像MMLU这样的基准测试表明,旗舰语言模型的事实性饱和度超过90%。LLMpedia显示这一图景并不完整。我们从三个模型家族的参数记忆中具体化出约130万篇百科全书文章,然后针对维基百科和精选网络证据审计每一条声明。对于gpt-5-mini,在维基百科覆盖的主题上,可验证真实率为68.4%——比MMLU低超过21个百分点——这一差距主要由不可验证性(30.5%)驱动,而非反驳(1.2%)。在维基百科之外,针对精选网络证据审计的前沿文章达到57.6%;维基百科仅覆盖模型呈现主题的56.7%,三个模型家族在主题选择上仅有7.3%的重叠。在受先前Grokipedia分析启发的检索陷阱基准测试中,LLMpedia在文本相似度约为维基百科一半的情况下更加事实准确。每个提示、文章和判决都已发布。数据、代码、界面:此 https URL。

英文摘要

Benchmarks like MMLU suggest flagship language models approach factuality saturation above 90\%. \emph{LLMpedia} shows this picture is incomplete. We materialize ${\sim}$1.3M encyclopedia articles entirely from parametric memory across three model families, then audit every claim against Wikipedia and curated web evidence. For \texttt{gpt-5-mini}, the verifiable true rate is 68.4\% on Wikipedia-covered subjects - more than 21\,pp below MMLU - and the gap is driven by \emph{unverifiability} (30.5\%), not refutation (1.2\%). Beyond Wikipedia, frontier articles audited against curated web evidence reach 57.6\%; Wikipedia covers only 56.7\% of model-surfaced subjects, and three model families overlap in just 7.3\% of subject choices. In a retrieval-trap benchmark inspired by prior analysis of Grokipedia, LLMpedia is more factual at roughly half the textual similarity to Wikipedia. Every prompt, article, and verdict is released. Data, code, interface: this https URL.

2601.22725 2026-06-11 cs.CV cs.AI 版本更新

OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation

OpenVTON-Bench:用于可控虚拟试穿评估的大规模高分辨率基准

Jin Li, Tao Chen, Kai Wen, Siqi Yin, Shuai Jiang, Weijie Wang, Jingwen Luo, Chenhui Wu

发表机构 * Renxing Intelligence, Hangzhou, China Hangzhou Dianzi University, Hangzhou, China(杭州电子科技大学)

AI总结 提出OpenVTON-Bench,包含约10万对高分辨率图像,通过DINOv3聚类和Gemini描述构建,并设计多模态评估协议,沿五个维度衡量试穿质量,与人类判断高度一致。

详情
Comments
Under review for the NeurIPS 2026 Datasets and Benchmarks Track
AI中文摘要

近期扩散模型的进展显著提升了虚拟试穿(VTON)系统的视觉保真度,但可靠的评估仍是一个持续的瓶颈。传统指标难以量化细粒度的纹理细节和语义一致性,而现有数据集在规模和多样性上无法满足商业标准。我们提出了OpenVTON-Bench,一个大规模基准,包含约10万对高分辨率图像(最高$1536 \ imes 1536$)。该数据集使用基于DINOv3的层次聚类进行语义平衡采样,并借助Gemini驱动的密集描述,确保在20个细粒度服装类别上均匀分布。为支持可靠评估,我们提出了一种多模态协议,沿五个可解释维度衡量VTON质量:背景一致性、身份保真度、纹理保真度、形状合理性和整体真实感。该协议将基于VLM的语义推理与基于SAM3分割和形态学腐蚀的新型多尺度表示度量相结合,能够分离边界对齐误差与内部纹理伪影。实验结果表明,该协议与人类判断高度一致(Kendall's $\ au$为0.833,而SSIM为0.611),为VTON评估建立了稳健的基准。

英文摘要

Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to $1536 \times 1536$). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall's $\tau$ of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.

2603.19225 2026-06-11 cs.CE cs.AI cs.CL cs.IR q-fin.CP 版本更新

FinTradeBench: A Financial Reasoning Benchmark for LLMs

FinTradeBench: 面向LLM的金融推理基准

Yogesh Agrawal, Aniruddha Dutta, Md Mahadi Hasan, Santu Karmaker, Aritra Dutta

发表机构 * University of Central Florida(佛罗里达中央大学)

AI总结 提出FinTradeBench基准,通过结合公司基本面与交易信号,评估大语言模型在金融推理中的表现,发现检索增强对数值和时间序列推理帮助有限。

详情
Comments
9 pages main text, 31 pages total (including references and appendix). 5 figures, 16 tables. Preprint under review. Code and data will be made available upon publication
AI中文摘要

现实世界的金融决策是一个具有挑战性的问题,需要对异构信号进行推理,包括从监管文件中提取的公司基本面和从价格动态计算出的交易信号。最近,随着大语言模型(LLM)的进步,金融分析师开始将它们用于金融决策任务。然而,现有的用于测试这些模型的金融问答基准主要关注公司资产负债表数据,很少评估关于公司股票如何在市场中交易或它们与基本面相互作用的推理。为了利用这两种方法的优势,我们引入了FinTradeBench,这是一个评估金融推理的基准,它整合了公司基本面和交易信号。FinTradeBench包含1400个问题,这些问题基于纳斯达克-100公司十年历史窗口的数据。该基准分为三个推理类别:基本面聚焦、交易信号聚焦以及需要跨信号推理的混合问题。为了确保大规模可靠性,我们采用了一个校准然后扩展的框架,该框架结合了专家种子问题、多模型响应生成、模型内自过滤、数值审计以及人类-LLM判断对齐。我们在零样本提示和检索增强设置下评估了14个LLM,并观察到了明显的性能差距。检索显著改善了对文本基本面的推理,但对交易信号推理的益处有限。这些发现突显了当前LLM在数值和时间序列推理方面的根本性挑战,并激励了未来在金融智能方面的研究。

英文摘要

Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundamentals derived from regulatory filings and trading signals computed from price dynamics. Recently, with advances in Large Language Models (LLMs), financial analysts have begun to use them for financial decision-making tasks. However, existing financial question-answering benchmarks for testing these models primarily focus on company balance sheet data and rarely evaluate reasoning about how company stocks trade in the market or their interactions with fundamentals. To leverage the strengths of both approaches, we introduce FinTradeBench, a benchmark for evaluating financial reasoning that integrates company fundamentals and trading signals. FinTradeBench contains 1,400 questions grounded in NASDAQ-100 companies over a ten-year historical window. The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal reasoning. To ensure reliability at scale, we adopt a calibration-then-scaling framework that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human-LLM judge alignment. We evaluate 14 LLMs under zero-shot prompting and retrieval-augmented settings and witness a clear performance gap. Retrieval substantially improves reasoning over textual fundamentals, but provides limited benefit for trading-signal reasoning. These findings highlight fundamental challenges in the numerical and time-series reasoning for current LLMs and motivate future research in financial intelligence.

2603.20190 2026-06-11 cs.CV 版本更新

CoVR-R:Reason-Aware Composed Video Retrieval

CoVR-R: 推理感知的组合视频检索

Omkar Thawakar, Dmitry Demidov, Vaishnav Potlapalli, Sai Prasanna Teja Reddy Bogireddy, Viswanatha Reddy Gajjala, Alaa Mostafa Lasheen, Rao Muhammad Anwer, Fahad Khan

发表机构 * Mohamed bin Zayed University of AI(莫扎德·本·扎耶德人工智能大学) University of Chicago(芝加哥大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Linköping University(林奈大学)

AI总结 提出一种零样本推理优先方法,利用大型多模态模型推断编辑的因果和时序后效,并构建CoVR-Reason基准评估,在隐式效应子集上显著优于强基线。

详情
Comments
9 Pages, 3 Figures
AI中文摘要

组合视频检索(CoVR)旨在根据参考视频和文本修改找到目标视频。先前的工作假设修改文本完全指定了视觉变化,忽略了编辑产生的后效和隐含后果(例如,运动、状态转换、视角或持续时间线索)。我们认为成功的CoVR需要对这些后效进行推理。我们提出了一种推理优先的零样本方法,利用大型多模态模型(i)推断编辑所隐含的因果和时序后果,以及(ii)将得到的推理查询与候选视频对齐,无需任务特定的微调。为了评估CoVR中的推理能力,我们还提出了CoVR-Reason基准,该基准将每个(参考、编辑、目标)三元组与结构化的内部推理轨迹和具有挑战性的干扰项配对,这些干扰项需要预测后效而不是关键词匹配。实验表明,我们的零样本方法在召回率@K上优于强检索基线,并且在隐式效应子集上尤其出色。我们的自动和人工分析证实了检索结果中更高的步骤一致性和效果真实性。我们的发现表明,将推理纳入通用多模态模型可以通过明确考虑因果和时序后效来实现有效的CoVR。这减少了对任务特定监督的依赖,提高了对具有挑战性的隐式效应案例的泛化能力,并增强了检索结果的可解释性。这些结果指向了一个可扩展且原则性的可解释视频搜索框架。模型、代码和基准可在该网址获取。

英文摘要

Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects and implicit consequences (e.g., motion, state transitions, viewpoint or duration cues) that emerge from the edit. We argue that successful CoVR requires reasoning about these after-effects. We introduce a reasoning-first, zero-shot approach that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning. To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather than keyword matching. Experiments show that our zero-shot method outperforms strong retrieval baselines on recall at K and particularly excels on implicit-effect subsets. Our automatic and human analysis confirm higher step consistency and effect factuality in our retrieved results. Our findings show that incorporating reasoning into general-purpose multimodal models enables effective CoVR by explicitly accounting for causal and temporal after-effects. This reduces dependence on task-specific supervision, improves generalization to challenging implicit-effect cases, and enhances interpretability of retrieval outcomes. These results point toward a scalable and principled framework for explainable video search. The model, code, and benchmark are available at this https URL.

2603.00461 2026-06-11 cs.CV 版本更新

ReMoT: Reinforcement Learning with Motion Contrast Triplets

ReMoT: 基于运动对比三元组的强化学习

Cong Wan, Zeyu Guo, Jiangyang Li, SongLin Dong, Yifan Bai, Lin Peng, Zhiheng Ma, Yihong Gong

发表机构 * Xi’an Jiaotong University(西安交通大学) Shenzhen University of Advanced Technology(深圳先进技术大学) DAMO Academy, Alibaba Group(阿里巴巴集团 DAMO 院)

AI总结 提出ReMoT统一训练范式,通过规则生成大规模运动对比数据集和组相对策略优化,解决VLM时空一致性问题,在时空推理任务上提升25.1%。

详情
Comments
CVPR 2026 Highlight
AI中文摘要

我们提出ReMoT,一种统一的训练范式,系统地解决VLM在时空一致性方面的基本缺陷——这是导航、机器人和自动驾驶中的关键失败点。ReMoT整合了两个核心组件:(1) 一个基于规则的自动框架,生成ReMoT-16K,这是一个大规模(16.5K三元组)的运动对比数据集,源自视频元注释,超越了昂贵的手动或基于模型的生成。(2) 组相对策略优化,我们经验验证了它在学习这种对比推理时产生最优性能和数据效率,远远超过标准的监督微调。我们还构建了第一个细粒度运动对比三元组基准,用于衡量VLM对细微运动属性(例如相反方向)的辨别能力。由此产生的模型在我们的新基准和多个标准VLM基准上实现了最先进的性能,最终在时空推理任务上实现了惊人的25.1%的性能飞跃。

英文摘要

We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.

2603.14762 2026-06-11 math.OC cs.LG eess.SY 版本更新

Online Learning for Supervisory Switching Control

在线学习用于监督切换控制

Haoyuan Sun, Ali Jadbabaie

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 研究在线学习在部分观测线性动态系统中监督切换控制的问题,提出非渐近分析方法,结合多臂老虎机算法,实现稳定控制器识别与系统辨识。

详情
AI中文摘要

我们研究了部分观测线性动态系统中的监督切换控制。目标是通过周期性选择一组N个候选控制器中的一个,来识别并部署适合的控制器。经典估计器基于监督控制保证渐近稳定性,但缺乏有限时间性能界限。相反,当前在线学习和系统识别中的非渐近方法需要限制性假设,如系统稳定性,这在控制设置中不兼容,从而排除了测试可能不稳定控制器的可能性。为弥合这一差距,我们提出了一种新颖的非渐近监督控制分析,将多臂老虎机算法适应到控制理论设置中。所提出的数据驱动算法通过评分标准评估候选控制器,利用系统可观测性来隔离状态历史的影响,从而既能检测不稳定控制器,又能实现准确的系统辨识。我们提出了两种算法变体,具有无维度、有限时间保证,其中每个算法在O(N log²N)步内识别匹配控制器,同时在系统扰动下实现有限的L₂增益。

英文摘要

We study supervisory switching control for partially-observed linear dynamical systems. The objective is to identify and deploy a suitable controller for the unknown system by periodically selecting among a collection of $N$ candidate controllers, some of which may destabilize the underlying system. While classical estimator-based supervisory control guarantees asymptotic stability, it lacks quantitative finite-time performance bounds. Conversely, current non-asymptotic methods in both online learning and system identification require restrictive assumptions that are incompatible in a control setting, such as system stability, which preclude testing potentially unstable controllers. To bridge this gap, we propose a novel, non-asymptotic analysis of supervisory control that adapts multi-armed bandit algorithms to a control-theoretic setting. The proposed data-driven algorithm evaluates candidate controllers via scoring criteria that leverage system observability to isolate the effects of state history, enabling both detection of destabilizing controllers and accurate system identification. We present two algorithmic variants with dimension-free, finite-time guarantees, where each identifies the matching controller in $O(N \log^2 N)$ steps, while simultaneously achieving finite $L_2$-gain with respect to system disturbances.

2408.02600 2026-06-11 cs.CL 版本更新

BioMamba: Domain-Adaptive Biomedical Language Models

BioMamba: 领域自适应的生物医学语言模型

Ling Yue, Mingzhi Zhu, Sixue Xing, Yunning Cao, Yanbo Wang, Shimin Shan, Jinfei Liu, Vijil Chenthamarakshan, Shaowu Pan, Payel Das, Tianfan Fu

发表机构 * Rensselaer Polytechnic Institute(拉特格斯理工学院) Jiaxing New Jies Thermal Power Co. Ltd.(嘉兴新捷热电有限公司) North University of China(北方大学) Zhejiang University(浙江大学) IBM Research(IBM研究院) Nanjing University(南京大学)

AI总结 提出基于Mamba2的领域自适应预训练方法BioMamba,在PubMed、C4和Wikipedia混合数据上持续训练,显著降低生物医学困惑度并保持通用语言能力。

详情
AI中文摘要

背景。生物医学语言模型应在提升生物医学文本性能的同时保持通用语言模型的流畅性。对于基于Mamba的模型,这种权衡在生物医学文献和临床文本中尚未得到系统研究。方法。我们开发了BioMamba,一个包含五个规模的生物医学Mamba2模型家族,通过在PubMed摘要、Colossal Clean Crawled Corpus (C4)和Wikipedia的80%/10%/10%平衡混合数据上对已发布的公开Mamba2检查点进行持续预训练得到。贡献在于自适应配方和附带的开放权重检查点。结果。在五个规模上,BioMamba一致降低了PubMed困惑度,将Wikipedia风格的保留困惑度提高了1.46-4.72 PPL,而C4困惑度基本不变。在六个域外多项选择基准上,BioMamba保持在Mamba2的+/-3个百分点内,没有系统性退化。经过监督微调后,BioMamba+SFT在每个评估规模上匹配或超过Mamba2+SFT在MIMIC-IV笔记补全和出院总结生成上的表现,并在每个规模上改进了PubMedQA。最强模型(BioMamba-2.7B)在PubMed上达到5.28的困惑度,在BioASQ和PubMedQA上分别达到90.24%和73.00%的准确率。结论。平衡的领域自适应持续预训练配方增强了Mamba2语言模型在生物医学文献和临床文本上的性能,同时保持了通用语言建模的流畅性。

英文摘要

Background. Biomedical language models should improve performance on biomedical text while retaining general-language-modeling fluency. For Mamba-based models, this trade-off has not been systematically studied across biomedical literature and clinical text. Methods. We developed BioMamba, a family of biomedical Mamba2 models at five scales obtained by continued pretraining of released public Mamba2 checkpoints on a balanced 80%/10%/10% mixture of PubMed abstracts, the Colossal Clean Crawled Corpus (C4), and Wikipedia. The contribution is the adaptation recipe and the accompanying open-weight checkpoints. Results. Across five scales, BioMamba consistently lowered PubMed perplexity, improved Wikipedia-style held-out perplexity by 1.46-4.72 PPL, and left C4 perplexity essentially unchanged. On six out-of-domain multiple-choice benchmarks, BioMamba stayed within +/-3 percentage points of Mamba2 with no systematic regression. After supervised fine-tuning, BioMamba+SFT matched or exceeded Mamba2+SFT on MIMIC-IV note completion and discharge summary generation at every evaluated scale, and improved PubMedQA at every scale. The strongest model (BioMamba-2.7B) reached a PubMed perplexity of 5.28 and accuracies of 90.24% and 73.00% on BioASQ and PubMedQA, respectively. Conclusions. A balanced domain-adaptive continued pretraining recipe strengthens Mamba2 language models on biomedical literature and clinical text while preserving general-language-modeling fluency.

2601.10724 2026-06-11 cs.RO 版本更新

Adaptive Sliding Mode Control for Vehicle Platoons with State-Dependent Friction Uncertainty

具有状态依赖摩擦不确定性的车辆队列自适应滑模控制

Rishabh Dev Yadav

发表机构 * Robotics Research Center, International Institute of Information Technology Hyderabad(机器人研究中心,国际信息学院海得拉巴)

AI总结 针对车辆队列中未知且状态依赖的摩擦力,提出一种自适应滑模控制器,无需先验知识即可处理摩擦不确定性,实现速度调节和间距保持。

详情
AI中文摘要

多机器人编队控制在车辆编队、队列、载荷运输和监视等领域有广泛应用。在车辆队列中保持编队需要设计合适的控制方案,能够处理外部干扰和不确定的系统参数,同时保持机器人之间预定义的安全距离。此背景下的一个关键挑战是处理车轮与地面之间未知/不确定的摩擦力,这些摩擦力随路面变化、轮胎磨损和车辆速度而变化。尽管最先进的自适应控制器可以处理先验有界的不确定性,但它们难以准确建模和识别摩擦力,这些摩擦力通常是状态依赖的且无法先验有界。本文提出了一种新的基于轮式移动机器人的车辆队列自适应滑模控制器,无需先验了解摩擦力的参数和结构即可处理其未知和复杂的行为。该控制器利用自适应滑模控制技术来调节队列速度并保持预定义的机器人间距离,即使在存在外部干扰和不确定系统参数的情况下也是如此。该方法包括两个阶段:首先,运动学控制器根据期望轨迹计算期望速度;其次,动力学模型生成命令以实现期望运动。通过分离机器人的运动学和动力学,该方法可以简化控制问题,并实现对轮式移动机器人更高效、更鲁棒的控制。

英文摘要

Multi-robot formation control has various applications in domains such as vehicle troops, platoons, payload transportation, and surveillance. Maintaining formation in a vehicle platoon requires designing a suitable control scheme that can tackle external disturbances and uncertain system parameters while maintaining a predefined safe distance between the robots. A crucial challenge in this context is dealing with the unknown/uncertain friction forces between wheels and the ground, which vary with changes in road surface, wear in tires, and speed of the vehicle. Although state-of-the-art adaptive controllers can handle a priori bounded uncertainties, they struggle with accurately modeling and identifying frictional forces, which are often state-dependent and cannot be a priori bounded. This thesis proposes a new adaptive sliding mode controller for wheeled mobile robot-based vehicle platoons that can handle the unknown and complex behavior of frictional forces without prior knowledge of their parameters and structures. The controller uses the adaptive sliding mode control techniques to regulate the platoon's speed and maintain a predefined inter-robot distance, even in the presence of external disturbances and uncertain system parameters. This approach involves a two-stage process: first, the kinematic controller calculates the desired velocities based on the desired trajectory; and second, the dynamics model generates the commands to achieve the desired motion. By separating the kinematics and dynamics of the robot, this approach can simplify the control problem and allow for more efficient and robust control of the wheeled mobile robot.

2511.05203 2026-06-11 cs.RO 版本更新

SIL: Symbiotic Interactive Learning for Language-Conditioned Human-Agent Co-Adaptation

SIL: 语言条件的人机协同适应的共生交互学习

Linus Nwankwo, Bjoern Ellensohn, Christian Rauch, Elmar Rueckert

发表机构 * Technical University of Leoben(莱博恩技术大学)

AI总结 提出共生交互学习(SIL)框架,实现人类与智能体在共享潜在任务空间中的双向协同适应,通过联合信念状态、FM空间推理和记忆架构,在指令跟随等任务中达到90.4%完成率和0.83信念对齐分数。

详情
AI中文摘要

当今的自主智能体主要由基础模型(FMs)驱动,能够理解自然语言指令并以类似人类的推理解决长期任务。然而,当前的人机交互框架大多遵循单向的主从技术,其中具身智能体被动执行命令而没有互惠学习。这忽视了日常人际交互中协同适应、多轮交互的本质。我们引入了共生交互学习(SIL),一个在共享潜在任务空间中的双向协同适应框架,其中人类和智能体都维护着随交互历史演变的联合信念状态。这使得主动澄清、适应性建议和共享计划细化成为可能。SIL利用FMs进行空间感知和推理,并结合一个三元组损失训练的神经编码器,将FMs的输出嵌入到任务特定的潜在表示中。为了支持任务演变时的长期稳定性,SIL利用情景记忆和语义记忆架构,并通过弹性权重巩固进行正则化以减轻灾难性遗忘。我们在模拟和真实世界的具身任务上评估SIL,包括指令跟随、信息检索、查询导向推理和交互式对话,实现了90.4%的任务完成率和ρ≈0.83的信念对齐分数,比最佳消融实验绝对提高了约20个百分点。演示和资源:此https URL。

英文摘要

Today's autonomous agents, largely driven by foundation models (FMs), can understand natural language instructions and solve long-horizon tasks with human-like reasoning. However, current human-robot interaction frameworks largely follow a one-way master-apprentice technique where the embodied agent passively executes commands without reciprocal learning. This neglects the co-adaptive, multi-turn nature of everyday human-to-human interactions. We introduce symbiotic interactive learning (SIL), a bidirectional co-adaptation framework in a shared latent task space, where both the human and the agent maintain joint belief states that evolve with the interaction history. This enables proactive clarification, adaptive suggestions, and shared plan refinement. SIL leverages FMs for spatial perception and reasoning, together with a triplet-loss-trained neural encoder that grounds the FMs' outputs into task-specific latent representations. To support long-term stability as tasks evolve, SIL utilises episodic and semantic memory architectures, regularised via elastic weight consolidation to mitigate catastrophic forgetting. We evaluate SIL on simulated and real-world embodied tasks, including instruction following, information retrieval, query-oriented reasoning, and interactive dialogue, achieving a $90.4\%$ task completion rate and a belief alignment score of $\rho \approx 0.83$, an absolute improvement of about $20$ percentage points over the best ablations. Demos and resources: this https URL.

2603.13854 2026-06-11 cs.LO cs.AI cs.SC 版本更新

Power Term Polynomial Algebra for Boolean Logic

布尔逻辑的幂项多项式代数

Emanuele Sansone, Armando Solar-Lezama

发表机构 * CSAIL, MIT(MIT计算机科学与人工智能实验室) ESAT, KU Leuven(比利时鲁文大学ESAT研究所) KU Leuven(鲁文大学)

AI总结 提出幂项多项式代数,一种介于CNF和ANF之间的布尔公式表示语言,通过幂项和多项式直接编码CNF子句与单项式族,避免辅助变量和约束,支持代数运算与重写规则。

详情
Comments
Pragmatics of SAT
AI中文摘要

我们引入了幂项多项式代数,这是一种布尔公式的表示语言,旨在桥联合取范式(CNF)和代数范式(ANF)。该语言的动机是这些表示之间的平铺不匹配:直接CNF<->ANF转换可能导致指数爆炸,除非公式被分解成更小的片段,通常通过辅助变量和侧面约束。相比之下,我们的框架在表示本身内部解决了这种不匹配,紧凑地编码了单项式的结构化族,同时直接表示CNF子句,从而在抽象层次上避免了辅助变量和约束。我们通过幂项和幂项多项式形式化了该语言,定义了它们的语义,并展示了它们允许对应于布尔多项式加法和乘法的代数运算。我们证明了该语言的几个关键性质:析取子句允许紧凑的规范表示;幂项支持局部缩短和扩展重写规则;原子项的乘积可以在语言内部系统地重写。这些结果共同产生了一个符号演算,使得无需将公式展开为普通ANF即可直接操作公式。由此产生的框架提供了一种新的中间表示和重写演算,桥接了基于子句和代数的推理,并为结构感知的CNF<->ANF转换和混合推理方法提出了新的方向。

英文摘要

We introduce power term polynomial algebra, a representation language for Boolean formulae designed to bridge conjunctive normal form (CNF) and algebraic normal form (ANF). The language is motivated by the tiling mismatch between these representations: direct CNF<->ANF conversion may cause exponential blowup unless formulas are decomposed into smaller fragments, typically through auxiliary variables and side constraints. In contrast, our framework addresses this mismatch within the representation itself, compactly encoding structured families of monomials while representing CNF clauses directly, thereby avoiding auxiliary variables and constraints at the abstraction level. We formalize the language through power terms and power term polynomials, define their semantics, and show that they admit algebraic operations corresponding to Boolean polynomial addition and multiplication. We prove several key properties of the language: disjunctive clauses admit compact canonical representations; power terms support local shortening and expansion rewrite rules; and products of atomic terms can be systematically rewritten within the language. Together, these results yield a symbolic calculus that enables direct manipulation of formulas without expanding them into ordinary ANF. The resulting framework provides a new intermediate representation and rewriting calculus that bridges clause-based and algebraic reasoning and suggests new directions for structure-aware CNF<->ANF conversion and hybrid reasoning methods.

2602.06547 2026-06-11 cs.CR cs.AI cs.CL cs.ET 版本更新

"Do Not Mention This to the User": Detecting and Understanding Malicious Agent Skills in the Wild

“不要向用户提及此事”:检测与理解恶意代理技能

Yi Liu, Zhihao Chen, Yanjun Zhang, Gelei Deng, Yuekang Li, Jianting Ning, Leo Yu Zhang

发表机构 * Griffith University(格里菲斯大学) Nanyang Technological University(南洋理工大学) University of New South Wales(新南威尔士大学) Zhejiang Key Laboratory of Digital Fashion and Data Governance, Zhejiang Sci-Tech University(浙江数字时尚与数据治理重点实验室,浙江科技大学)

AI总结 本文通过对两个主要注册中心的98,380个技能进行系统安全分析,结合静态模式匹配和动态行为验证,识别出157个恶意技能,揭示了13种攻击技术中的632个不同漏洞,并发现攻击复杂性与隐藏投入相关。

详情
Comments
Accepted to the 35th USENIX Security Symposium (USENIX Security 2026)
AI中文摘要

基于LLM的编码代理越来越依赖称为技能的第三方扩展,这些技能捆绑了自然语言指令和辅助脚本,以完全用户权限执行。社区注册中心已出现以分发这些技能,但由于缺乏标记的威胁数据,安全影响仍未得到研究。本文对从两个主要注册中心收集的98,380个技能进行了系统安全分析。通过静态模式匹配和动态行为验证的结合,我们识别出157个表现出确认恶意行为的技能,涵盖13种攻击技术中的632个不同漏洞。我们的分析表明,这些威胁是故意的而非偶然:每个恶意技能平均包含4.03个漏洞,跨越多个攻击阶段。我们识别出两种具有统计显著负相关的主要攻击策略——通过远程代码执行窃取凭证,以及通过嵌入文档中的对抗性指令操纵代理。超过一半的确认案例来自一个采用模板化品牌冒充大规模攻击的单一威胁行为者。我们进一步观察到,攻击复杂性与隐藏投入相关,高级技能普遍使用未记录的功能,同时利用平台原生的信任机制。在负责任的披露之后,注册中心维护者删除了所有157个(100%)报告的技能。我们的数据集和检测管道公开可用,以促进未来关于保护LLM代理生态系统安全的研究。

英文摘要

LLM-based coding agents increasingly rely on third-party extensions called skills, which bundle natural language instructions and helper scripts that execute with full user privileges. Community registries have emerged to distribute these skills, but the security implications remain unstudied due to the absence of labeled threat data. This paper presents a systematic security analysis of 98,380 skills collected from two major registries. Through a combination of static pattern matching and dynamic behavioral verification, we identify 157 skills exhibiting confirmed malicious behavior, encompassing 632 distinct vulnerabilities across 13 attack techniques. Our analysis reveals that these threats are deliberate rather than accidental: each malicious skill contains an average of 4.03 vulnerabilities spanning multiple attack phases. We identify two dominant attack strategies with statistically significant negative correlation -- credential theft via remote code execution, and agent manipulation through adversarial instructions embedded in documentation. Over half of all confirmed cases originate from a single threat actor employing templated brand impersonation at scale. We further observe that attack sophistication correlates with concealment investment, with advanced skills universally employing undocumented capabilities while also exploiting platform-native trust mechanisms. Following responsible disclosure, registry maintainers removed all 157 (100%) of the reported skills. Our dataset and detection pipeline are publicly available to facilitate future research on securing LLM agent ecosystems.

2511.16672 2026-06-11 cs.CV 版本更新

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

EvoLMM:具有连续奖励的自进化大型多模态模型

Omkar Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan

发表机构 * Mohamed bin Zayed University of AI(Mohamed bin Zayed人工智能大学) Aalto University(阿alto大学) Australian National University(澳大利亚国立大学) Linköping University(林肯大学)

AI总结 提出EvoLMM框架,通过单个骨干模型实例化提议者和求解者两个协作智能体,利用连续自奖励过程无监督地提升LMM推理能力,在ChartQA等基准上取得约3%的提升。

详情
Comments
9 pages, 6 figures
AI中文摘要

近年来,大型多模态模型(LMMs)的进展实现了令人印象深刻的推理和感知能力,但大多数现有训练流程仍依赖于人工策划的数据或外部验证的奖励模型,限制了其自主性和可扩展性。在这项工作中,我们致力于以纯无监督方式(无需任何标注数据或奖励蒸馏)提升LMM的推理能力。为此,我们提出了一个名为EvoLMM的自进化框架,该框架从单个骨干模型实例化两个协作智能体:提议者(Proposer),生成多样化的、基于图像的问题;以及求解者(Solver),通过内部一致性解决这些问题,学习过程通过连续的自奖励机制进行。这种动态反馈促进了信息性查询的生成和结构化推理的改进,而无需依赖真实标签或人工判断。当使用流行的Qwen2.5-VL作为基础模型时,我们的EvoLMM在多模态数学推理基准(包括ChartQA、MathVista和MathVision)上取得了约3%的持续提升,仅使用原始训练图像。我们希望这种简单而有效的方法能成为一个坚实的基线,促进未来在完全无监督方式下自我改进LMM的研究。我们的代码和模型可在该https URL获取。

英文摘要

Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at this https URL.

2603.12261 2026-06-11 cs.LG cs.AI cs.CV 版本更新

The Latent Color Subspace: Emergent Order in High-Dimensional Chaos

潜在颜色子空间:高维混沌中的涌现秩序

Mateusz Pach, Jessica Bader, Quentin Bouniot, Serge Belongie, Zeynep Akata

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Toronto(多伦多大学) University of Cambridge(剑桥大学) University of Oxford(牛津大学)

AI总结 本文揭示了FLUX.1变分自编码器潜在空间中颜色表示的HSL结构,并提出一种无需训练的闭式潜在空间操作方法,实现对生成图像颜色的预测与显式控制。

详情
Comments
Accepted at ICML 2026
AI中文摘要

文本到图像生成模型已取得快速进展,但实现对生成图像的细粒度控制仍然困难,这主要源于对语义信息编码方式的理解有限。我们开发了对FLUX.1 [Dev]变分自编码器潜在空间中颜色表示的解释,揭示了一种反映色相、饱和度和明度的结构。我们通过证明潜在颜色子空间(LCS)解释能够预测并显式控制颜色,验证了其有效性,并引入了一种完全无需训练的FLUX方法,该方法仅基于闭式潜在空间操作。代码可在该https URL获取。

英文摘要

Text-to-image generation models have advanced rapidly, yet achieving fine-grained control over generated images remains difficult, largely due to limited understanding of how semantic information is encoded. We develop an interpretation of the color representation in the Variational Autoencoder latent space of FLUX.1 [Dev], revealing a structure reflecting Hue, Saturation, and Lightness. We verify our Latent Color Subspace (LCS) interpretation by demonstrating that it can both predict and explicitly control color, introducing a fully training-free method in FLUX based solely on closed-form latent-space manipulation. Code is available at this https URL.

2603.11678 2026-06-11 eess.AS cs.SD 版本更新

RAF: Relativistic Adversarial Feedback For Universal Speech Synthesis

RAF:用于通用语音合成的相对论对抗反馈

Yongjoon Lee, Jung-Woo Choi

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院)

AI总结 提出相对论对抗反馈(RAF)训练目标,通过自监督语音模型和相对论配对改进GAN声码器的域内保真度和泛化能力,在参数减少88%的情况下超越LSGAN训练的BigVGAN。

详情
Comments
Accepted to Interspeech 2026 Long paper track. Code: this https URL
AI中文摘要

我们提出相对论对抗反馈(RAF),一种用于GAN声码器的新型训练目标,可提高域内保真度和对未见场景的泛化能力。尽管现代GAN声码器采用先进架构,但其训练目标往往无法促进可泛化的表示。RAF通过利用语音自监督学习模型辅助判别器评估样本质量,鼓励生成器学习更丰富的表示来解决这一问题。此外,我们利用真实和虚假波形的相对论配对来改善训练数据分布的建模。跨多个数据集的实验表明,基于GAN的声码器在客观和主观指标上均获得一致提升。重要的是,经过RAF训练的BigVGAN-base仅使用12%的参数就在感知质量上优于经过LSGAN训练的BigVGAN。对比研究进一步证实了RAF作为GAN声码器训练框架的有效性。

英文摘要

We propose Relativistic Adversarial Feedback (RAF), a novel training objective for GAN vocoders that improves in-domain fidelity and generalization to unseen scenarios. Although modern GAN vocoders employ advanced architectures, their training objectives often fail to promote generalizable representations. RAF addresses this problem by leveraging speech self-supervised learning models to assist discriminators in evaluating sample quality, encouraging the generator to learn richer representations. Furthermore, we utilize relativistic pairing for real and fake waveforms to improve the modeling of the training data distribution. Experiments across multiple datasets show consistent gains in both objective and subjective metrics on GAN-based vocoders. Importantly, the RAF-trained BigVGAN-base outperforms the LSGAN-trained BigVGAN in perceptual quality using only 12\% of the parameters. Comparative studies further confirm the effectiveness of RAF as a training framework for GAN vocoders.

2505.03649 2026-06-11 stat.ML cs.LG math.CO math.PR 版本更新

Weighted Random Dot Product Graphs

加权随机点积图

Bernardo Marenco, Paola Bermolen, Marcelo Fiori, Federico Larroca, Gonzalo Mateos

发表机构 * Facultad de Ingeniería Universidad de la República(工程学院乌拉圭共和国大学) Dept. of Electrical and Computer Engineering University of Rochester(电气与计算机工程系罗切斯特大学)

AI总结 提出加权随机点积图(WRDPG)模型,通过节点潜位置的内积刻画边权分布的高阶矩,并给出谱嵌入估计的统计保证与生成框架。

详情
Comments
30 pages, 12 figures, code to generate Figures 3 to 12 available at this https URL. Updated to match the published version
AI中文摘要

复杂关系模式的建模已成为当代统计研究和相关数据科学领域的基石。以图形式表示的网络为这种分析提供了自然框架。本文扩展了随机点积图(RDPG)模型以适应加权图,显著拓宽了该模型的适用范围,使其能够处理边权呈现异质分布的场景。我们提出了一种非参数加权(W)RDPG模型,为每个节点分配一系列潜位置。这些节点向量的内积通过矩生成函数指定其关联边权分布的矩。与现有技术不同,WRDPG能够区分具有相同均值但高阶矩不同的权重分布。我们推导了基于工作马邻接谱嵌入的节点潜位置估计量的统计保证,建立了其一致性和渐近正态性。我们还贡献了一个生成框架,能够采样符合(指定或数据拟合的)WRDPG的图,从而促进例如使用恰当的参考分布对观测图指标进行分析和检验。本文组织如下:形式化模型定义、估计(或节点嵌入)过程及其保证,以及生成加权图的方法,所有内容均辅以说明性和可重复的示例,展示WRDPG在各种网络分析应用中的有效性。

英文摘要

Modeling of intricate relational patterns has become a cornerstone of contemporary statistical research and related data science fields. Networks, represented as graphs, offer a natural framework for this analysis. This paper extends the Random Dot Product Graph (RDPG) model to accommodate weighted graphs, markedly broadening the model's scope to scenarios where edges exhibit heterogeneous weight distributions. We propose a nonparametric weighted (W)RDPG model that assigns a sequence of latent positions to each node. Inner products of these nodal vectors specify the moments of their incident edge weights' distribution via moment-generating functions. In this way, and unlike prior art, the WRDPG can discriminate between weight distributions that share the same mean but differ in other higher-order moments. We derive statistical guarantees for an estimator of the nodal's latent positions adapted from the workhorse adjacency spectral embedding, establishing its consistency and asymptotic normality. We also contribute a generative framework that enables sampling of graphs that adhere to a (prescribed or data-fitted) WRDPG, facilitating, e.g., the analysis and testing of observed graph metrics using judicious reference distributions. The paper is organized to formalize the model's definition, the estimation (or nodal embedding) process and its guarantees, as well as the methodologies for generating weighted graphs, all complemented by illustrative and reproducible examples showcasing the WRDPG's effectiveness in various network analytic applications.

2603.09715 2026-06-11 cs.AI 版本更新

Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

问题真的重要吗?视觉-语言SFT的无训练数据选择

Peng Sun, Yi Yang, Huawen Shen, Yi Ban, Tianfan Fu, Yanbo Wang, Yuqiang Li

发表机构 * Nanjing University(南京大学) Institute of Information Engineering(信息工程研究所) North University of China(中国北方大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出CVS方法,利用冻结的视觉-语言大模型评估问题对答案有效性的影响,无需训练即可筛选出需要跨模态推理的高质量样本,在多个数据集上以少量数据超越全量训练。

详情
AI中文摘要

视觉指令微调对于提升视觉-语言大模型(VLLMs)至关重要。然而,许多样本可以通过语言模式或常识捷径解决,无需真正的跨模态推理,限制了多模态学习的有效性。先前的数据选择方法通常依赖于代价高昂的代理模型训练,并侧重于难度或多样性,未能捕捉样本对视觉-语言联合推理的真实贡献。在本文中,我们提出CVS,一种基于以下洞见的无训练数据选择方法:对于高质量的多模态样本,引入问题应显著改变模型在给定图像下对答案有效性的评估。CVS利用冻结的VLLM作为评估器,测量在有/无问题条件下答案有效性的差异,从而识别需要视觉-语言联合推理的样本,同时过滤语义冲突噪声。在Vision-Flan和The Cauldron上的实验表明,CVS在数据集上取得了稳定的性能。在Vision-Flan上,CVS仅使用10%和15%的数据就分别比全量训练高出3.5%和4.8%,并且在高度异构的Cauldron数据集上保持鲁棒。此外,与COINCIDE和XMAS相比,CVS分别降低了17.3%和44.4%的计算成本。

英文摘要

Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic patterns or common-sense shortcuts, without genuine cross-modal reasoning, limiting the effectiveness of multimodal learning. Prior data selection methods often rely on costly proxy model training and focus on difficulty or diversity, failing to capture a sample's true contribution to vision-language joint reasoning. In this paper, we propose CVS, a training-free data selection method based on the insight that, for high-quality multimodal samples, introducing the question should substantially alter the model's assessment of answer validity given an image. CVS leverages a frozen VLLM as an evaluator and measures the discrepancy in answer validity with and without conditioning on the question, enabling the identification of samples that require vision-language joint reasoning while filtering semantic-conflict noise. Experiments on Vision-Flan and The Cauldron show that CVS achieves solid performance across datasets. On Vision-Flan, CVS outperforms full-data training by 3.5% and 4.8% using only 10% and 15% of the data, respectively, and remains robust on the highly heterogeneous Cauldron dataset. Moreover, CVS reduces computational cost by 17.3% and 44.4% compared to COINCIDE and XMAS.

2603.09555 2026-06-11 cs.LG cs.AI cs.DC cs.PF 版本更新

Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference

编译器优先的状态空间对偶性与可移植的 $O(1)$ 自回归缓存推理

Cosmo Santoni, Anmol Thapar

发表机构 * Imperial College London(帝国理工学院伦敦分校)

AI总结 提出一种基于编译器优先的状态空间对偶性(SSD)结构的推理方法,通过标准JAX原语实现无自定义内核的单源推理路径,在TPU和GPU上达到高硬件利用率,且缓存解码速度比全前缀重计算快27-36倍。

详情
Comments
21 pages, 6 figures. Code available at: this https URL
AI中文摘要

高吞吐量的Mamba-2推理通常依赖于融合的CUDA和Triton内核,这限制了在不同加速器后端之间的可移植性。我们证明状态空间对偶性(SSD)递归具有编译器友好的结构:对角逐头动态、固定大小分块、以einsum为主的计算以及静态控制流。在标准JAX原语中表达这种结构,可以得到一个无需自定义内核的单源推理路径、一个注册的JAX PyTree缓存以及一个编译后的设备上自回归循环。在单个Google Cloud TPU v6e上,batch-1预填充达到约140 TFLOPS,即15%的模型FLOP利用率(MFU),这是该场景下的屋顶线上限;缓存解码达到高达64%的硬件带宽利用率(HBU)。在4096个token的上下文中,对于五个Mamba-2检查点(参数从130M到2.7B),缓存解码比全前缀重计算快27-36倍。相同的源代码在未修改的情况下可在NVIDIA L40S上运行,其中缓存解码在所有模型规模下均保持序列长度无关。WikiText-103验证困惑度与Triton参考实现mamba_ssm v2.2.2相差在±0.0005以内,隐藏状态在float32舍入容差内一致。代码可在以下网址获取:https://this URL。

英文摘要

High-throughput Mamba-2 inference is usually tied to fused CUDA and Triton kernels, limiting portability across accelerator backends. We show that the state space duality (SSD) recurrence has a compiler-friendly structure: diagonal per-head dynamics, fixed-size chunking, einsum-dominated compute, and static control flow. Expressing this structure in standard JAX primitives gives a single-source inference path with no custom kernels, a registered JAX PyTree cache, and a compiled on-device autoregressive loop. On a single Google Cloud TPU v6e, batch-1 prefill reaches approximately 140 TFLOPS, or 15% model FLOP utilisation (MFU), the roofline ceiling for this regime, and cached decode reaches up to 64% hardware bandwidth utilisation (HBU). At a 4096-token context, cached decode is 27x--36x faster than full-prefix recomputation across five Mamba-2 checkpoints from 130M to 2.7B parameters. The same source runs unmodified on NVIDIA L40S, where cached decode remains sequence-length independent across all model scales. WikiText-103 validation perplexity matches the Triton reference mamba_ssm v2.2.2 within +/-0.0005 points, and hidden states agree to float32 rounding tolerance. Code is available at this https URL.

2603.09276 2026-06-11 stat.ML cs.LG 版本更新

On Regret Bounds of Thompson Sampling for Bayesian Optimization

关于贝叶斯优化中汤普森采样遗憾界的分析

Shion Takeno, Shogo Iwazaki

发表机构 * Nagoya University(名古屋大学) MI-6 Ltd.(MI-6公司)

AI总结 本文针对高斯过程汤普森采样(GP-TS)方法,在目标函数为GP样本路径的假设下,推导了其遗憾下界、累积遗憾二阶矩上界、期望宽松遗憾上界以及改进的累积遗憾上界,填补了GP-TS在高概率遗憾界方面的空白。

详情
Comments
43 pages, Accepted to ICML 2026
AI中文摘要

我们研究了一种广泛使用的贝叶斯优化方法——高斯过程汤普森采样(GP-TS),假设目标函数是高斯过程的一个样本路径。与具有高概率和期望遗憾界的高斯过程上置信界(GP-UCB)相比,GP-TS的大多数分析仅限于期望遗憾。此外,最近关于GP-UCB的宽松遗憾和改进的累积遗憾上界的分析是否能应用于GP-TS仍不清楚。为了填补这些空白,本文展示了几个遗憾界:(i) GP-TS的遗憾下界,这意味着GP-TS以概率δ依赖于$1/\delta$的多项式;(ii) 累积遗憾二阶矩的上界,直接暗示了关于δ的改进遗憾上界;(iii) 期望宽松遗憾上界;(iv) 关于时间水平T的改进累积遗憾上界。在此过程中,我们提供了几个有用的引理,包括从最近分析中放松必要条件以获得关于T的改进累积遗憾上界。

英文摘要

We study a widely used Bayesian optimization method, Gaussian process Thompson sampling (GP-TS), under the assumption that the objective function is a sample path from a GP. Compared with the GP upper confidence bound (GP-UCB) with established high-probability and expected regret bounds, most analyses of GP-TS have been limited to expected regret. Moreover, whether the recent analyses of GP-UCB for the lenient regret and the improved cumulative regret upper bound can be applied to GP-TS remains unclear. To fill these gaps, this paper shows several regret bounds: (i) a regret lower bound for GP-TS, which implies that GP-TS suffers from a polynomial dependence on $1/\delta$ with probability $\delta$, (ii) an upper bound of the second moment of cumulative regret, which directly suggests an improved regret upper bound on $\delta$, (iii) expected lenient regret upper bounds, and (iv) an improved cumulative regret upper bound on the time horizon $T$. Along the way, we provide several useful lemmas, including a relaxation of the necessary condition from recent analysis to obtain improved regret upper bounds on $T$.

2603.06910 2026-06-11 cs.CL 版本更新

Language Shapes Mental Health Evaluations in Large Language Models

语言塑造大型语言模型中的心理健康评估

Jiayi Xu, Xiyang Hu

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Arizona State University(亚利桑那州立大学)

AI总结 研究多语言LLM在心理健康评估中是否因语言不同而产生系统性偏差,发现中文提示比英文提示导致更高的污名相关评分和更保守的抑郁严重度判断。

详情
AI中文摘要

多语言大型语言模型(LLMs)越来越多地用于社会敏感的心理健康场景,包括支持聊天机器人、筛查和内容审核。这引发了一个可靠性问题:语义上等效的心理健康输入是否在不同语言中引发可比较的评估,还是会出现与语言相关的社会和文化背景一致的系统性偏移?我们在英中双语环境中使用GPT-4o和Qwen3-32B,通过一个两层框架来检验这个问题:结构层面的评估取向(通过心理测量污名工具测量)和决策层面的行为(通过二元污名检测和四类抑郁严重度分类测量)。在多种工具和模型中,中文提示比英文提示引发更高的污名相关分数。在决策层面,中文提示降低了对污名化内容的敏感性,并产生更保守的抑郁严重度判断,导致更多的低估错误。这些发现表明,提示语言可以改变基于LLM的心理健康评估中的评估取向和下游行为。它们强调了评估多语言LLM时不仅需要关注整体性能,还需要关注它们是否在社会敏感领域中对不同语言应用了可比较的评估标准。

英文摘要

Multilingual large language models (LLMs) are increasingly used in socially sensitive mental health contexts, including support chatbots, screening, and content moderation. This raises a reliability question: do semantically equivalent mental health inputs elicit comparable evaluations across languages, or systematic shifts consistent with language-associated social and cultural contexts? We examine this question in an English-Chinese setting with GPT-4o and Qwen3-32B using a two-level framework: construct-level evaluative orientation, measured by psychometric stigma instruments, and decision-level behavior, measured by binary stigma detection and four-class depression severity classification. Across instruments and models, Chinese prompts elicit higher stigma-related scores than English prompts. At the decision level, Chinese prompts reduce sensitivity to stigmatizing content and produce more conservative depression severity judgments, leading to more under-estimation errors. These findings show that prompt language can shift both evaluative orientation and downstream behavior in LLM-based mental health evaluation. They highlight the need to evaluate multilingual LLMs not only for aggregate performance, but also for whether they apply comparable evaluative standards across languages in socially sensitive domains.

2603.05573 2026-06-11 cs.LG 版本更新

Why Depth Matters in Parallelizable Sequence Models: A Lie Algebraic View

为什么深度在可并行化序列模型中重要:一个李代数视角

Gyuryang Heo, Timothy Ngotiaoco, Kazuki Irie, Samuel J. Gershman, Bernardo L. Sabatini

发表机构 * Howard Hughes Medical Institute, Department of Neurobiology, Harvard Medical School(霍华德·休斯医学研究所,哈佛医学院神经生物学系) Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University(自然与人工智能研究学院,哈佛大学) Department of Psychology and Center for Brain Science, Harvard University(心理学系和脑科学中心,哈佛大学)

AI总结 从李代数控制视角,研究可并行化序列模型(如Transformer变体和状态空间模型)的表达能力与深度关系,证明误差随深度增加呈指数下降。

详情
Comments
v2: Format update; split former Theorem 3.4 into Theorem 3.4 and Corollary 3.5 for clarity; corrected an indexing error affecting Corollary 3.6, Proposition 3.7, and Figure 2
AI中文摘要

可扩展的序列模型,如Transformer变体和结构化状态空间模型,通常以表达能力换取序列级并行性,从而实现高效训练。本文从李代数控制视角,考察模型在其表达能力范围之外运行时误差的边界及其缩放规律。我们的理论建立了序列模型深度与李代数扩展塔之间的对应关系。与近期理论研究相呼应,我们刻画了常数深度序列模型的李代数类别及其相应的表达能力边界。此外,我们解析推导了近似误差边界,并证明误差随深度增加呈指数下降,这与这些模型的强大实证表现一致。我们通过在符号词和连续值状态追踪问题上的实验验证了理论预测。

英文摘要

Scalable sequence models, such as Transformer variants and structured state-space models, often trade expressivity power for sequence-level parallelism, which enables efficient training. Here we examine the bounds on error and how error scales when models operate outside of their expressivity regimes using a Lie-algebraic control perspective. Our theory formulates a correspondence between the depth of a sequence model and the tower of Lie algebra extensions. Echoing recent theoretical studies, we characterize the Lie-algebraic class of constant-depth sequence models and their corresponding expressivity bounds. Furthermore, we analytically derive an approximation error bound and show that error diminishes exponentially as the depth increases, consistent with the strong empirical performance of these models. We validate our theoretical predictions using experiments on symbolic word and continuous-valued state-tracking problems.

2504.09762 2026-06-11 cs.AI 版本更新

Position: Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!

立场:停止将中间令牌拟人化为推理/思考痕迹!

Subbarao Kambhampati, Karthik Valmeekam, Siddhant Bhambri, Vardhan Palod, Lucas Saldyt, Kaya Stechly, Soumya Rani Samineni, Durgesh Kalwar, Upasana Biswas

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文论证将模型生成的中间令牌拟人化为“推理痕迹”或“思考痕迹”具有误导性,呼吁社区避免此类拟人化。

详情
Comments
Appears in ICML 2026. [This is a fork of v1. This fork, while overlapping with v1 in background section, differs both in the overall focus as well as the specific argument against anthropomorphization of reasoning traces]
AI中文摘要

中间令牌生成(ITG)是一种模型在输出解决方案之前产生输出的方法,已成为提高语言模型在推理任务上性能的标准方法。这些中间令牌被称为“推理痕迹”甚至“思考痕迹”——隐含地将这些痕迹拟人化,暗示它们类似于人类在解决难题时可能采取的步骤,因此可以为最终用户提供模型思考过程的可解释窗口。在这篇立场论文中,我们提出证据表明这种拟人化并非无害的隐喻,而是相当危险——它混淆了这些模型的本质以及如何有效使用它们,并导致可疑的研究。我们呼吁社区避免对中间令牌进行此类拟人化。

英文摘要

Intermediate token generation (ITG), where a model produces output before the solution, has become a standard method to improve the performance of language models on reasoning tasks. These intermediate tokens have been called \say{reasoning traces} or even \say{thinking traces} -- implicitly anthropomorphizing the traces, and implying that these traces resemble steps a human might take when solving a challenging problem, and as such can provide an interpretable window into the operation of the model's thinking process to the end user. In this position paper, we present evidence that this anthropomorphization isn't a harmless metaphor, and instead is quite dangerous -- it confuses the nature of these models and how to use them effectively, and leads to questionable research. We call on the community to avoid such anthropomorphization of intermediate tokens.

2509.15680 2026-06-11 cs.SD eess.AS 版本更新

SAM: A Mamba-2 State-Space Audio-Language Model

SAM: 一种基于 Mamba-2 状态空间的音频-语言模型

Taehan Lee, Jaehan Jung, Hyukjun Lee

发表机构 * Sogang University(ソンガン大学)

AI总结 提出 SAM,一种结合 Mamba-2 骨干网络的音频-语言模型,在 AudioSet 和 AudioCaps 上以更少参数达到或超越 7B 变压器模型性能,并系统分析了 SSM 与音频编码器输出的交互机制。

详情
Comments
6 pages, Accepted to Interspeech 2026
AI中文摘要

我们提出了 SAM,一种状态空间音频-语言模型,它将音频编码器与 Mamba-2 骨干网络集成。SAM-2.7B 在 AudioSet 上达到 21.1 mAP,在 AudioCaps 上达到 17.6 SPICE,以更少的参数匹配或超越更大的 7B 变压器模型。我们进一步首次提供了系统性的、表示级别的分析,研究 SSM 如何与音频编码器输出交互:(1) 联合音频编码器微调是必要的,这由准确率提升以及在不同 SSM 大小下观察到的 token 表示秩和相似性的适应所支持;(2) 尽管线性缩放,SSM 从紧凑、信息丰富的音频 token 表示中获益更多,而非过长的 token 序列;(3) 融入指令跟随监督显著提升了推理能力,将 MMAU-Sound 准确率从 22.8 提升至 56.8。通过全面的实验和分析,我们为 SSM 作为音频-语言模型的强大、可扩展骨干网络建立了实用的设计原则。

英文摘要

We present SAM, a State-space Audio-language Model that integrates an audio encoder with a Mamba-2 backbone. SAM-2.7B achieves 21.1 mAP on AudioSet and 17.6 SPICE on AudioCaps, matching or surpassing larger 7B transformer-based models with fewer parameters. We further provide the first systematic, representation-level analysis of how SSMs interact with audio encoder outputs: (1) joint audio encoder finetuning is essential, supported by accuracy gains and observed adaptation of token representation rank and similarity across different SSM sizes; (2) despite linear scaling, SSMs benefit more from compact, information-rich audio token representations than from excessively long token sequences; and (3) incorporating instruction-following supervision substantially improves reasoning ability, boosting MMAU-Sound accuracy from 22.8 to 56.8. Through comprehensive experiments and analysis, we establish practical design principles for SSMs as strong, scalable backbones for audio-language models.

2603.03855 2026-06-11 cs.SD 版本更新

A Sensitivity Analysis of Multi-Event Audio Grounding in Audio LLMs

音频大语言模型中多事件音频定位的敏感性分析

Taehan Lee, Jaehan Jung, Hyukjun Lee

发表机构 * Sogang University(ソンガン大学)

AI总结 通过大规模评估,发现音频大语言模型在复杂声学场景中事件数量增加会导致真阳性率下降和假阳性率上升,提示词则引入权衡,模型对多事件音频更不确定。

详情
Comments
6 pages, Accepted to Interspeech 2026
AI中文摘要

音频大语言模型在理解音频样本方面表现出强大能力,但其在复杂声学场景中的可靠性仍待探索。不同于以往局限于小规模或查询构建控制不足的工作,我们提出了一种大规模评估,研究随着听觉场景复杂度增加时的事件定位和误报情况。使用71K个AudioCapsV2片段,我们提取标准化的(源,属性)事件,并构建两种查询类型:用于真实检测的存在事件查询和用于探测幻觉的缺失事件查询,在音频对齐的文本嵌入空间中采用相似性过滤的负采样。我们评估了四种最先进的音频大语言模型,每个模型使用12种提示变体,处理超过50万个是/否查询。在所有模型中,事件数量增加一致地降低了真阳性率并提高了假阳性率,而提示则在两者之间引入了强烈的权衡。我们的置信度分析表明,模型在多事件音频上变得更加不确定,揭示了改进空间。

英文摘要

Audio LLMs have shown a strong ability to understand audio samples, yet their reliability in complex acoustic scenes remains under-explored. Unlike prior work limited to small scale or less controlled query construction, we present a large-scale evaluation of event grounding and false alarms as auditory scene complexity increases. Using 71K AudioCapsV2 clips, we extract normalized (source, attribute) events and build two query types: present-event queries for ground-truth detection and absent-event queries to probe hallucinations, using similarity-filtered negative sampling in an audio-aligned text embedding space. We evaluate four SOTA Audio LLMs with 12 prompt variants over 500K yes/no queries per model. Across models, increasing event count consistently lowers true-positive rate and raises false-positive rate, while prompts induce a strong trade-off between the two. Our confidence analysis shows that models become more uncertain on multi-event audio, revealing room for improvement.

2505.10018 2026-06-11 cs.RO 版本更新

LEMON-Mapping: Loop-Enhanced Large-Scale Multi-Session Point Cloud Merging and Optimization for Globally Consistent Mapping

LEMON-Mapping: 面向全局一致建图的环路增强大规模多会话点云融合与优化

Lijie Wang, Xiaoyi Zhong, Ziyi Xu, Kaixin Chai, Anke Zhao, Tianyu Zhao, Changjian Jiang, Qianhao Wang, Xieyuanli Chen, Fei Gao

发表机构 * Institute of CyberSystems and Control, Zhejiang University(浙江大学控制系统研究所) The Huzhou Institute, Zhejiang University(浙江大学湖州研究院) The State Key Laboratory of Industrial Control Technology, College of Control Science and Engineering, Zhejiang University(浙江大学控制科学与工程学院国家工业控制技术重点实验室) The College of Intelligence Science and Technology, National University of Defense Technology(国防科技大学智能科学与技术学院)

AI总结 提出LEMON-Mapping框架,通过鲁棒环路处理、空间光束法平差和基于PGO的优化,解决多机器人建图中重叠区域发散和模糊问题,实现高精度全局一致点云融合。

详情
AI中文摘要

多机器人协作在现代机器人领域日益重要且面临重大挑战,尤其是在构建全局一致、精确的地图方面。传统的多机器人位姿图优化(PGO)方法确保基本的全局一致性,但忽略了地图的几何结构,仅将闭环作为位姿节点之间的约束,导致重叠区域出现发散和模糊。为解决此问题,我们提出LEMON-Mapping,一种用于大规模多会话点云融合与优化的环路增强框架。我们重新审视环路在多机器人建图中的作用,并引入三项关键创新。首先,我们开发了一种鲁棒的环路处理机制来剔除异常值,以及一种环路召回策略来恢复被错误移除但有效的环路。其次,我们引入了针对多机器人地图的空间光束法平差,以减少发散并消除重叠中的模糊。第三,我们设计了一种基于PGO的方法,利用精化的光束法平差约束将局部精度传播到整个地图。我们在多个公开数据集和一个自采集数据集上验证了LEMON-Mapping。实验结果表明,与传统融合方法相比,我们的框架具有更优越的建图精度和全局一致性。可扩展性实验也证明了其处理涉及大量机器人场景的强大能力。

英文摘要

Multi-robot collaboration is becoming increasingly critical and presents significant challenges in modern robotics, especially for building a globally consistent, accurate map. Traditional multi-robot pose graph optimization (PGO) methods ensure basic global consistency but ignore the geometric structure of the map, and only use loop closures as constraints between pose nodes, leading to divergence and blurring in overlapping regions. To address this issue, we propose LEMON-Mapping, a loop-enhanced framework for large-scale, multi-session point cloud fusion and optimization. We re-examine the role of loops for multi-robot mapping and introduce three key innovations. First, we develop a robust loop processing mechanism that rejects outliers and a loop recall strategy to recover mistakenly removed but valid loops. Second, we introduce spatial bundle adjustment for multi-robot maps, reducing divergence and eliminating blurring in overlaps. Third, we design a PGO-based approach that leverages refined bundle adjustment constraints to propagate local accuracy to the entire map. We validate LEMON-Mapping on several public datasets and a self-collected dataset. The experimental results show superior mapping accuracy and global consistency of our framework compared to traditional merging methods. Scalability experiments also demonstrate its strong capability to handle scenarios involving numerous robots.

2602.23545 2026-06-11 cs.AI 版本更新

Planning under Distribution Shifts with Causal POMDPs

基于因果POMDP的分布偏移下规划

Matteo Ceriscioli, Karthika Mohan

发表机构 * School of Electrical Engineering and Computer Science (EECS)(电气工程与计算机科学学院)

AI总结 提出因果POMDP框架,通过干预表示环境变化,在部分可观测下维持PWLC性质,实现分布偏移下的规划与更新。

详情
Comments
To appear at the 36th International Conference on Automated Planning and Scheduling (ICAPS-26)
AI中文摘要

在现实世界中,规划常常受到分布偏移的挑战。因此,在一组条件下获得的环境模型在状态分布或环境动态变化时可能不再有效,进而导致先前学习的策略失败。在这项工作中,我们提出了一个使用因果知识构建的部分可观测马尔可夫决策过程(POMDP)的理论框架,用于在部分可观测性下进行规划。通过将环境中的变化表示为对该因果POMDP的干预,该框架能够评估假设变化下的计划,并主动识别环境中哪些组件已被改变。我们展示了如何维护和更新关于潜在状态和底层领域的信念,并证明了在该增强信念空间中值函数保持分段线性凸(PWLC)。在分布偏移下保持PWLC的优势在于,通过基于$\alpha$-向量的POMDP方法保持规划的可处理性。

英文摘要

In the real world, planning is often challenged by distribution shifts. As such, a model of the environment obtained under one set of conditions may no longer remain valid as the distribution of states or the environment dynamics change, which in turn causes previously learned strategies to fail. In this work, we propose a theoretical framework for planning under partial observability using Partially Observable Markov Decision Processes (POMDPs) formulated using causal knowledge. By representing shifts in the environment as interventions on this causal POMDP, the framework enables evaluating plans under hypothesized changes and actively identifying which components of the environment have been altered. We show how to maintain and update a belief over both the latent state and the underlying domain, and we prove that the value function remains piecewise linear and convex (PWLC) in this augmented belief space. Preservation of PWLC under distribution shifts has the advantage of maintaining the tractability of planning via $\alpha$-vector-based POMDP methods.

2602.23461 2026-06-11 physics.flu-dyn cs.LG 版本更新

Neural ensemble Kalman filter: Data assimilation for compressible flows with shocks

神经集成卡尔曼滤波器:含激波可压缩流的数据同化

Xu-Hui Zhou, Lorenzo Beronilla, Michael K. Sleeman, Hangchuan Hu, Matthias Morzfeld, Andrew M. Stuart, Tamer A. Zaki

发表机构 * University of California, San Diego(加州大学圣迭戈分校) University of Cambridge(剑桥大学)

AI总结 针对含激波可压缩流中集成卡尔曼滤波器(EnKF)因双峰预报分布失效的问题,提出神经EnKF,通过将预报集合映射到神经网络参数空间并在此空间进行同化,结合物理信息迁移学习避免伪振荡和非物理特征。

详情
AI中文摘要

含激波可压缩流的数据同化(DA)具有挑战性,因为许多经典DA方法在不确定激波附近会产生伪振荡和非物理特征。我们在此关注集成卡尔曼滤波器(EnKF)。我们表明,EnKF性能不佳可归因于在不确定激波位置附近可能出现双峰预报分布;这违反了EnKF的假设,即预报接近高斯分布。为解决此问题,我们引入了新的神经EnKF。基本思想是通过将激波流的预报集合映射到深度神经网络(NN)的参数空间(权重和偏置),并随后在该空间中进行DA,从而系统地将神经函数逼近嵌入到集成DA中。非线性映射将尖锐和光滑的流动特征编码在NN参数的集合中。因此,只有当NN参数在预报集合的神经表示中平滑变化时,神经EnKF更新才是良好的。我们表明,可以通过物理信息迁移学习强制网络参数的这种平滑变化,并证明这样做神经EnKF避免了困扰EnKF的伪振荡和非物理特征。通过无粘Burgers方程、Sod激波管和二维爆炸波的一系列系统数值实验,证明了神经EnKF的适用性。

英文摘要

Data assimilation (DA) for compressible flows with shocks is challenging because many classical DA methods generate spurious oscillations and nonphysical features near uncertain shocks. We focus here on the ensemble Kalman filter (EnKF). We show that the poor performance of the EnKF may be attributed to the bimodal forecast distribution that can arise in the vicinity of an uncertain shock location; this violates the assumptions underpinning the EnKF, which assume a forecast which is close to Gaussian. To address this issue we introduce the new neural EnKF. The basic idea is to systematically embed neural function approximations within ensemble DA by mapping the forecast ensemble of shocked flows to the parameter space (weights and biases) of a deep neural network (NN) and to subsequently perform DA in that space. The nonlinear mapping encodes sharp and smooth flow features in an ensemble of NN parameters. Neural EnKF updates are therefore well-behaved only if the NN parameters vary smoothly within the neural representation of the forecast ensemble. We show that such a smooth variation of network parameters can be enforced via physics-informed transfer learning, and demonstrate that in so-doing the neural EnKF avoids the spurious oscillations and nonphysical features that plague the EnKF. The applicability of the neural EnKF is demonstrated through a series of systematic numerical experiments with the inviscid Burgers' equation, the Sod shock tube, and a two-dimensional blast wave.

2602.22962 2026-06-11 cs.LG 版本更新

Scaling Laws of Global Weather Models

全球天气模型的缩放定律

Yuejiang Yu, Langwen Huang, Alexandru Calotoiu, Torsten Hoefler

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文分析数据驱动天气模型中模型大小、数据集大小和计算预算与验证损失之间的缩放定律,发现Aurora数据缩放最强,GraphCast参数效率高但硬件利用率低,计算最优分析表明增加训练数据比增大模型更有效,且模型形状上宽度优于深度。

详情
Comments
Accepted at ICML 2026. 21 pages, 7 figures
AI中文摘要

数据驱动模型正在彻底改变天气预报。为了优化训练效率和模型性能,本文分析了该领域内的经验缩放定律。我们研究了模型性能(验证损失)与三个关键因素:模型大小($N$)、数据集大小($D$)和计算预算($C$)之间的关系。在一系列模型中,我们发现Aurora表现出最强的数据缩放行为:将训练数据集增加10倍可使验证损失降低多达3.2倍。GraphCast展示了最高的参数效率,但硬件利用率有限。我们的计算最优分析表明,在固定计算预算下,将资源分配给更多的总训练数据比增加模型大小能带来更大的性能提升。此外,我们分析了模型形状,并发现了与语言模型中观察到的根本不同的缩放行为:天气预报模型始终倾向于增加宽度而非深度。这些发现表明,未来的天气模型应优先考虑更宽的架构和更大的有效训练数据集,以最大化预测性能。

英文摘要

Data-driven models are revolutionizing weather forecasting. To optimize training efficiency and model performance, this paper analyzes empirical scaling laws within this domain. We investigate the relationship between model performance (validation loss) and three key factors: model size ($N$), dataset size ($D$), and compute budget ($C$). Across a range of models, we find that Aurora exhibits the strongest data-scaling behavior: increasing the training dataset by 10x reduces validation loss by up to 3.2x. GraphCast demonstrates the highest parameter efficiency, yet suffers from limited hardware utilization. Our compute-optimal analysis indicates that, under fixed compute budgets, allocating resources to more total training data yields greater performance gains than increasing model size. Furthermore, we analyze model shape and uncover scaling behaviors that differ fundamentally from those observed in language models: weather forecasting models consistently favor increased width over depth. These findings suggest that future weather models should prioritize wider architectures and larger effective training datasets to maximize predictive performance.

2601.11670 2026-06-11 cs.LG cs.AI 版本更新

CoVar: Confidence-Variance-Guided Pseudo-Label Selection for Semi-Supervised Learning

CoVar: 置信度-方差引导的半监督学习伪标签选择

Jinshi Liu, Lei He, Pan Liu

发表机构 * College of Artificial Intelligence, Shenzhen University(深圳大学人工智能学院) School of Information and Electrical Engineering, Hunan University of Science and Technology(湖南科技大学信息与电气工程学院) Information Hub, Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)信息中心)

AI总结 提出CoVar框架,通过联合建模最大置信度和残差类方差来评估伪标签可靠性,利用SVD谱松弛分离可靠与不可靠预测,无需手动阈值,在分割和分类任务上取得提升。

详情
AI中文摘要

半监督学习中的伪标签选择通常由最大置信度阈值驱动,然而在模型过度自信和类别不平衡下,仅靠置信度可能不可靠。我们提出CoVar,一个置信度-方差框架,通过联合建模最大置信度(MC)和残差类方差(RCV)来评估伪标签可靠性。从熵最小化出发,我们推导出二阶交叉熵近似,表明当MC高且RCV低时,低损失伪标签更受青睐,并带有置信度依赖的惩罚项,该惩罚项对接近确定的预测更强。基于此准则,CoVar将预测嵌入二维置信度-方差空间,并使用基于SVD的谱松弛来分离可靠和不可靠的预测,无需手动调整置信度阈值。然后,聚类加权高斯函数将此分离转换为每个样本的训练权重。所得权重可在训练期间集成到现有的半监督分割和分类流程中,且不引入推理开销。在PASCAL VOC 2012、Cityscapes、CIFAR-10、CIFAR-100、SVHN和STL-10上的实验表明,在匹配骨干网络下,VOC和Cityscapes上取得明显提升,并在标准分类基准上达到竞争性或更低的错误率。这些结果表明,残差类离散度为鲁棒伪标签选择提供了置信度之外的补充信号。

英文摘要

Pseudo-label selection in semi-supervised learning is commonly driven by maximum-confidence thresholds, yet confidence alone can be unreliable under model overconfidence and class imbalance. We propose CoVar, a confidence--variance framework that assesses pseudo-label reliability by jointly modeling Maximum Confidence (MC) and Residual-Class Variance (RCV). Starting from entropy minimization, we derive a second-order cross-entropy approximation showing that low-loss pseudo-labels are favored when MC is high and RCV is low, with a confidence-dependent penalty that becomes stronger for near-certain predictions. Based on this criterion, CoVar embeds predictions into a two-dimensional confidence--variance space and uses SVD-based spectral relaxation to separate reliable and unreliable predictions without hand-tuned confidence thresholds. Cluster-wise Gaussian weighting then converts this separation into per-sample training weights. The resulting weights can be integrated into existing semi-supervised segmentation and classification pipelines during training and introduce no inference-time overhead. Experiments on PASCAL VOC 2012, Cityscapes, CIFAR-10, CIFAR-100, SVHN, and STL-10 show clear gains on VOC and Cityscapes under matched backbones, as well as competitive or improved error rates on standard classification benchmarks. These results indicate that residual-class dispersion provides a useful signal complementary to confidence for robust pseudo-label selection.

2602.22638 2026-06-11 cs.AI 版本更新

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

MobilityBench:用于评估真实世界移动场景中路径规划智能体的基准

Zhiheng Song, Jingshuai Zhang, Chuan Qin, Chao Wang, Chao Chen, Longfei Xu, Kaikui Liu, Xiangxiang Chu, Hengshu Zhu

发表机构 * Computer Network Information Center, Chinese Academy of Sciences(中国科学院计算机网络信息中心) AMAP, Alibaba Group(阿里集团AMAP) Alibaba Group(阿里集团)

AI总结 提出MobilityBench基准,通过确定性API重放沙箱和多维评估协议,系统评估基于LLM的路径规划智能体,发现现有模型在偏好约束路径规划上表现不佳。

详情
AI中文摘要

由大型语言模型(LLM)驱动的路径规划智能体已成为一种有前景的范式,通过自然语言交互和工具介导的决策支持日常人类移动。然而,在真实世界移动场景中的系统评估受到多样化路由需求、非确定性地图服务和有限可重复性的阻碍。在本研究中,我们引入了MobilityBench,一个用于评估基于LLM的路径规划智能体在真实世界移动场景中的可扩展基准。MobilityBench基于从高德地图收集的大规模匿名真实用户查询构建,覆盖全球多个城市的广泛路径规划意图。为了实现可重复的端到端评估,我们设计了一个确定性API重放沙箱,消除了实时服务带来的环境变化。我们进一步提出了一个以结果有效性为中心的多维评估协议,辅以对指令理解、规划、工具使用和效率的评估。使用MobilityBench,我们在多种真实世界移动场景中评估了多个基于LLM的路径规划智能体,并对其行为和性能进行了深入分析。我们的发现表明,当前模型在基本信息检索和路径规划任务上表现良好,但在偏好约束路径规划上困难重重,突显了在个性化移动应用中仍有显著改进空间。我们在此https URL公开发布基准数据、评估工具包和文档。

英文摘要

Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM-based route-planning agents across diverse real-world mobility scenarios and provide an in-depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at this https URL.