arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1696
2601.20477 2026-05-18 cs.LG cs.IT math.IT

Implicit Hypothesis Testing and Divergence Preservation in Neural Network Representations

神经网络表示中的隐含假设检验与散度保持

Kadircan Aksoy, Protim Bhattacharjee, Peter Jung

AI总结 研究神经分类器的训练动态,通过二元假设检验重新形式化分类为类条件分布间的二元测试,证明泛化能力强的网络在训练过程中逐渐接近Neyman-Pearson最优决策规则,并定义信息平面评估收敛性。

详情
AI中文摘要

我们通过二元假设检验的视角研究神经分类器的训练动态。我们将分类重新形式化为由学习表示诱导的类条件分布之间的二元测试,并实证显示,沿训练轨迹,泛化能力强的网络逐渐接近Neyman-Pearson最优决策规则,这通过学习表示中KL散度的单调增长来衡量。我们提供了精确最优性的充分条件,讨论了其对训练正则化的影响,并定义了一个信息平面(称为证据-误差平面),在该平面上可以系统地评估不同网络架构的收敛性。

英文摘要

We study the training dynamics of neural classifiers through the lens of binary hypothesis testing. We re-formalize classification as a collection of binary tests between class-conditional distributions induced by learned representations and show empirically that, along training trajectories, well-generalizing networks progressively approach Neyman-Pearson optimal decision rules, as measured by monotonic growth in the KL divergence retained by learned representations. We provide sufficient conditions for exact optimality, discuss its implications for training regularization, and define an informational plane, (so-called Evidence-Error plane) where convergence can be assessed methodically across network architecture.

2601.12894 2026-05-18 cs.RO cs.CV

Sparse ActionGen: Accelerating Diffusion Policy with Real-time Pruning

稀疏动作生成:通过实时剪枝加速扩散策略

Kangye Ji, Jianbo Zhou, Yuan Meng, Ye Li, Hanyun Cui, Zhi Wang

AI总结 本文提出SAG方法,通过自适应剪枝和重用机制实现稀疏动作生成,提升实时视觉运动控制效率,实验显示生成速度提升4倍。

详情
AI中文摘要

扩散策略因其强大的多模态动作分布建模能力在动作生成中占据主导地位,但其多步去噪过程使其难以满足实时视觉运动控制的需求。现有基于缓存的加速方法通常依赖静态调度,无法适应机器人与环境交互的动态特性,导致性能不佳。本文提出稀疏动作生成(SAG),通过自适应剪枝和重用机制实现极稀疏的动作生成。为适应迭代交互,SAG定制了回滚自适应的剪枝-重用机制,首先在全局识别可剪枝的计算,然后利用缓存的激活值在动作扩散过程中进行替换。为捕捉回滚动态,SAG参数化了观察条件的扩散剪枝器,以实现环境感知的适应,并通过高参数和推理效率的设计实现实时预测。此外,SAG引入了一种通用的重用策略,以zig-zag方式在时间步和块之间重用激活值,最小化全局冗余。在多个机器人基准测试中,SAG在不牺牲性能的情况下实现了高达4倍的生成速度提升。项目页面:https://sparse-actiongen.github.io.

英文摘要

Diffusion Policy has dominated action generation due to its strong capabilities for modeling multi-modal action distributions, but its multi-step denoising processes make it impractical for real-time visuomotor control. Existing caching-based acceleration methods typically rely on $\textit{static}$ schedules that fail to adapt to the $\textit{dynamics}$ of robot-environment interactions, thereby leading to suboptimal performance. In this paper, we propose $\underline{\textbf{S}}$parse $\underline{\textbf{A}}$ction$\underline{\textbf{G}}$en ($\textbf{SAG}$) for extremely sparse action generation. To accommodate the iterative interactions, SAG customizes a rollout-adaptive prune-then-reuse mechanism that first identifies prunable computations globally and then reuses cached activations to substitute them during action diffusion. To capture the rollout dynamics, SAG parameterizes an observation-conditioned diffusion pruner for environment-aware adaptation and instantiates it with a highly parameter- and inference-efficient design for real-time prediction. Furthermore, SAG introduces a one-for-all reusing strategy that reuses activations across both timesteps and blocks in a zig-zag manner, minimizing the global redundancy. Extensive experiments on multiple robotic benchmarks demonstrate that SAG achieves up to 4$\times$ generation speedup without sacrificing performance. Project Page: https://sparse-actiongen.github.io.

2601.09512 2026-05-18 cs.RO cs.LG

CLARE: Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion

CLARE:通过自主适配器路由和扩展实现视觉-语言-动作模型的持续学习

Ralf Römer, Yi Zhang, Yuming Li, Angela P. Schoellig

AI总结 CLARE提出一种参数高效、无需示例的持续学习框架,通过自主扩展模型模块,实现机器人在新任务中保持旧知识,优于基于示例的方法。

Comments Accepted to IEEE Robotics and Automation Letters 2026. Project page: https://tum-lsy.github.io/clare. 11 pages, 9 figures

详情
AI中文摘要

CLARE提出一种参数高效、无需示例的持续学习框架,通过自主扩展模型模块,实现机器人在新任务中保持旧知识,优于基于示例的方法。

英文摘要

To teach robots complex manipulation tasks, a common approach is to fine-tune a pre-trained vision-language-action model (VLA) on task-specific data. However, since this recipe updates existing representations, it is unsuitable for long-term operation in the real world, where robots must continually adapt to new tasks and environments while retaining the knowledge they have already acquired. Existing continual learning methods for robotics commonly require storing previous data (exemplars), struggle with long task sequences, or rely on task identifiers for deployment. To address these limitations, we propose CLARE, a general, parameter-efficient framework for exemplar-free continual learning with VLAs. CLARE introduces lightweight modular adapters into selected VLA modules and autonomously expands the model only where necessary when learning a new task, guided by layer-wise feature similarity. During deployment, an autoencoder-based routing mechanism dynamically activates the most relevant adapters without requiring task labels. Through extensive experiments on the LIBERO benchmark and five real-world tasks, we show that CLARE achieves high performance on new tasks without catastrophic forgetting of earlier tasks, significantly outperforming even exemplar-based methods. Code, data, and videos are available at our website: https://tum-lsy.github.io/clare.

2601.07820 2026-05-18 cs.CL

Reference Games as a Testbed for the Alignment of Model Uncertainty and Clarification Requests

参考游戏作为模型不确定性与澄清请求对齐的测试平台

Manar Ali, Judith Sieker, Sina Zarrieß, Hendrik Buschmeier

AI总结 本文通过参考游戏测试语言模型在不确定性识别与澄清请求表达上的能力,发现模型在简单任务中难以准确识别自身不确定性并转化为澄清行为。

Comments Accepted at GEM@ACL 2026, the 5th Generation, Evaluation & Metrics Workshop

详情
AI中文摘要

在人类对话中,双方都积极维护相互理解。当听众对说话人意图不确定时,他们可以请求澄清。语言模型是否能扮演类似听众角色,识别并表达自身不确定性仍是一个开放问题。本文认为参考游戏是解决此问题的合适测试平台,因其可控、自包含且能明确表达和测量澄清需求。为测试此观点,我们评估了三种视觉-语言模型,在基准参考解析任务与要求模型在不确定时请求澄清的实验之间进行比较。结果表明,在简单任务中,模型往往难以识别内部不确定性并转化为适当的澄清行为。这展示了参考游戏作为测试视觉和/或语言模型交互质量的测试平台的价值。

英文摘要

In human conversation, both interlocutors play an active role in maintaining mutual understanding. When listeners are uncertain about what speakers mean, for example, they can request clarification. It is an open question for language models whether they can assume a similar listener role, recognizing and expressing their own uncertainty through clarification. We argue that reference games are a suitable testbed to approach this question as they are controlled, self-contained, and make clarification needs explicit and measurable. To test this, we evaluate three vision-language models comparing a baseline reference resolution task to an experiment where the models are instructed to request clarification when uncertain. The results suggest that even in such simple tasks, models often struggle to recognize internal uncertainty and translate it into adequate clarification behavior. This demonstrates the value of reference games as testbeds for interaction qualities of (vision and) language models.

2601.03707 2026-05-18 cs.CL

AirNav: A Large-Scale UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions

AirNav: 一个大规模无人机视觉与语言导航数据集,包含自然且多样的指令

Hengxing Cai, Yijie Rao, Ligang Huang, Zanyang Zhong, Jinhan Dong, Jingjun Tan, Changhao Nai, Jue Hou, Wenhao Lu, Renxin Zhong

AI总结 本文提出AirNav数据集,包含137K自然多样指令的导航样本,评估了多种方法,提出AirVLN-R1模型在测试中取得51.82%的成功率,并通过实际无人机实验验证了仿真到现实的迁移能力。

详情
AI中文摘要

现有无人机视觉与语言导航(VLN)基准很少同时提供真实的空中场景、自然过程级指令和足够的规模,使得在现实设置下系统训练和评估UAV VLN代理变得困难。为此,我们提出了AirNav,一个基于真实城市空中数据的大规模基准,包含137K通过人与LLM协作流程生成的导航样本,涉及10个用户角色。我们对代表性的方法在AirNav上进行了系统评估,从传统模型到多模态大语言模型(MLLMs),在统一指标下使用开源实现。我们进一步提出了AirVLN-R1,通过监督微调(SFT)和强化微调(RFT)训练,实现了51.82%的成功率。在现实无人机平台上进行的实验提供了初步的仿真到现实迁移证据,且我们的数据集和代码已公开可用。

英文摘要

Existing UAV vision-and-language navigation (VLN) benchmarks rarely provide realistic aerial scenes, natural process-level instructions, and sufficient scale simultaneously, making it difficult to systematically train and evaluate UAV VLN agents under realistic settings. To address this, we propose \textbf{AirNav}, a large-scale benchmark built on real urban aerial data, comprising 137K navigation samples with natural and diverse instructions generated via a human--LLM collaborative pipeline with 10 user personas. We conduct a systematic evaluation of representative approaches on AirNav, ranging from traditional models to multimodal large language models (MLLMs), under unified metrics with open-source implementations. We further propose \textbf{AirVLN-R1}, trained via supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT), achieving state-of-the-art performance with a 51.82\% success rate on the test-unseen split. Real-world experiments on a physical UAV platform provide preliminary evidence of sim-to-real transferability, and our dataset and code are publicly available.

2512.15693 2026-05-18 cs.CV

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

Skyra:通过 grounded artifact reasoning 实现 AI 生成视频检测

Yifei Li, Wenzhao Zheng, Yanran Zhang, Runze Sun, Yu Zheng, Lei Chen, Jie Zhou, Jiwen Lu

AI总结 本文提出 Skyra,一种专门用于识别 AI 生成视频中人类可感知的视觉瑕疵的多模态大语言模型,通过这些瑕疵作为基础证据进行检测和解释,同时构建了首个大规模 AI 生成视频瑕疵数据集并提出两阶段训练策略。

Comments Camera Ready Version. Project Page: https://github.com/JoeLeelyf/Skyra

详情
AI中文摘要

本文提出Skyra,一种专门用于识别AI生成视频中人类可感知的视觉瑕疵的多模态大语言模型,通过这些瑕疵作为基础证据进行检测和解释,同时构建了首个大规模AI生成视频瑕疵数据集并提出两阶段训练策略。

英文摘要

The misuse of AI-driven video generation technologies has raised serious social concerns, highlighting the urgent need for reliable AI-generated video detectors. However, most existing methods are limited to binary classification and lack the necessary explanations for human interpretation. In this paper, we present Skyra, a specialized multimodal large language model (MLLM) that identifies human-perceivable visual artifacts in AI-generated videos and leverages them as grounded evidence for both detection and explanation. To support this objective, we construct ViF-CoT-4K for Supervised Fine-Tuning (SFT), which represents the first large-scale AI-generated video artifact dataset with fine-grained human annotations. We then develop a two-stage training strategy that systematically enhances our model's spatio-temporal artifact perception, explanation capability, and detection accuracy. To comprehensively evaluate Skyra, we introduce ViF-Bench, a benchmark comprising 3K high-quality samples generated by over ten state-of-the-art video generators. Extensive experiments demonstrate that Skyra surpasses existing methods across multiple benchmarks, while our evaluation yields valuable insights for advancing explainable AI-generated video detection.

2512.10100 2026-05-18 cs.AI

Robust AI Security and Alignment: A Sisyphean Endeavor?

稳健的AI安全与对齐:一项西西弗斯式的努力?

Apostol Vassilev

AI总结 本文通过扩展哥德尔不完全性定理,探讨了AI安全与对齐的理论极限,并提出应对挑战的实践方法,揭示了AI系统认知推理的局限性。

Comments 17 pages, 1 figure. This version will appear in IEEE Security $ Privacy in June 2026

详情
AI中文摘要

本文通过将哥德尔不完全性定理扩展至AI领域,建立了AI安全与对齐的信息论限制。了解这些限制并为带来的挑战做准备,对于负责任地采用AI技术至关重要。本文还提供了应对这些挑战的实用方法,并证明了AI系统认知推理局限性的更广泛影响。

英文摘要

This manuscript establishes information-theoretic limitations for robustness of AI security and alignment by extending Gödel's incompleteness theorem to AI. Knowing these limitations and preparing for the challenges they bring is critically important for the responsible adoption of the AI technology. Practical approaches to dealing with these challenges are provided as well. Broader implications for cognitive reasoning limitations of AI systems are also proven.

2512.08964 2026-05-18 cs.LG

T2T-LA: A Topology-to-Topology LLM Agent for Graph Learning with Neither Feature Access nor Task Knowledge

T2T-LA:一种用于图学习的拓扑到拓扑LLM代理,无需特征访问或任务知识

Yongyu Wang

AI总结 本文提出T2T-LA,一种无需特征访问或任务知识的拓扑到拓扑LLM代理,通过学习失败拓扑与评分之间的关系,实现图学习中的拓扑推理。

详情
AI中文摘要

图学习旨在将数据转换为图表示,这对CAD中的许多问题至关重要,其中电路、布局、设计和优化状态通常被建模为图结构对象。现有图学习方法通常依赖精心设计的图构建规则、大量参数调优和复杂的数学理论;此外,实现良好性能往往需要针对下游目标定制的图构建方法。在本工作中,我们研究LLM是否能推理解图结构并推断有用拓扑,而无需观察特征矩阵、了解下游任务或依赖任何精心设计的图构建算法或参数调优过程。为此,我们提出了T2T-LA,一种拓扑到拓扑的LLM代理,其输入仅包括一组先前失败的拓扑和由私人评分器分配的评分。该代理未被告知任务或算法产生评分的方式、这些拓扑是如何生成的以及评分的含义。由于观察到的拓扑都不令人满意,T2T-LA无法简单模仿一个好的例子。相反,它被迫推断图连接模式与观察到的评分之间的隐藏关系,这在CAD场景中特别相关,因为有用的设计结构可能难以手动指定。实验结果表明,T2T-LA能够在一次操作中生成一个图拓扑,使下游算法产生足够好的解决方案,表明了一种新的LLM驱动方向,用于ML-for-CAD工作流程中的拓扑推理和图表示学习。

英文摘要

Graph learning aims to convert data into graph representations, which are fundamental to many problems in machine learning for CAD, where circuits, layouts, designs, and optimization states are often modeled as graph-structured objects. Existing graph learning methods usually rely on carefully designed graph construction rules, extensive parameter tuning, and sophisticated mathematical theory; moreover, achieving good performance often requires task-specific graph construction tailored to the downstream objective. In this work, we study whether a large language model (LLM) can reason about graph structure and infer a useful topology without observing the feature matrix, without knowing the downstream task, and without relying on any carefully designed graph construction algorithm or parameter tuning process. To this end, we propose T2T-LA, a Topology-to-Topology LLM Agent that receives no input other than a set of previously failed topologies and the scores assigned to them by a private scorer. The agent is not told what task or algorithm produces the scores, how these topologies are generated, or what the scores mean. Since none of the observed topologies is satisfactory, T2T-LA cannot simply imitate a good example. Instead, it is forced to infer hidden relationships between graph connectivity patterns and the observed scores, a capability that is particularly relevant to CAD scenarios where useful design structures may be difficult to specify manually. Experimental results show that T2T-LA can generate, in one shot, a graph topology that enables the downstream algorithm to produce a sufficiently good solution, suggesting a new LLM-driven direction for topology reasoning and graph representation learning in ML-for-CAD workflows.

2512.08052 2026-05-18 cs.RO cs.LG

An Introduction to Deep Reinforcement and Imitation Learning

深度强化学习与模仿学习入门

Pedro Santana

AI总结 本文介绍深度强化学习和深度模仿学习在具身智能体中的应用,涵盖马尔可夫决策过程、REINFORCE和PPO等核心算法,以及行为克隆、DAgger和GAIL等基础方法。

详情
AI中文摘要

具身智能体,如机器人和虚拟角色,必须持续选择动作以有效执行任务,解决复杂的序列决策问题。由于手动设计此类控制器困难,学习方法如深度强化学习(DRL)和深度模仿学习(DIL)成为可行替代方案。DRL利用奖励信号优化行为,而DIL使用专家演示指导学习。本文在具身智能体背景下介绍DRL和DIL,采用简洁深入的方法概述文献。内容自包含,按需呈现所有必要的数学和机器学习概念。本文不作为领域综述,而是聚焦少量基础算法和技术,优先深入理解而非广泛覆盖。材料从马尔可夫决策过程到REINFORCE和近端策略优化(PPO)的DRL,以及从行为克隆到数据集聚合(DAgger)和生成对抗模仿学习(GAIL)的DIL。

英文摘要

Embodied agents, such as robots and virtual characters, must continuously select actions to execute tasks effectively, solving complex sequential decision-making problems. Given the difficulty of designing such controllers manually, learning-based approaches have emerged as promising alternatives, most notably Deep Reinforcement Learning (DRL) and Deep Imitation Learning (DIL). DRL leverages reward signals to optimize behavior, while DIL uses expert demonstrations to guide learning. This document introduces DRL and DIL in the context of embodied agents, adopting a concise, depth-first approach to the literature. It is self-contained, presenting all necessary mathematical and machine learning concepts as they are needed. It is not intended as a survey of the field; rather, it focuses on a small set of foundational algorithms and techniques, prioritizing in-depth understanding over broad coverage. The material ranges from Markov Decision Processes to REINFORCE and Proximal Policy Optimization (PPO) for DRL, and from Behavioral Cloning to Dataset Aggregation (DAgger) and Generative Adversarial Imitation Learning (GAIL) for DIL.

2512.06655 2026-05-18 cs.LG cs.AI

Graph-Regularized Sparse Autoencoders for LLM Safety Steering

图正则化稀疏自编码器用于LLM安全引导

Jehyeok Yeon, Federico Cinus, Yifan Wu, Luca Luceri

AI总结 本文提出图正则化稀疏自编码器,通过在神经元共激活图上平滑解码器向量并应用方向库,提升安全引导效果,在多个基准测试中显著提高有害请求拒绝率。

详情
AI中文摘要

稀疏自编码器(SAEs)日益用于提取激活方向以实现推理时的引导,但其标准稀疏性目标将潜在特征视为独立。此先验可能与高层安全行为不匹配,其中拒绝和有害合规似乎依赖于激活空间中的分布式结构。我们引入图正则化稀疏自编码器(GSAE),一种字典学习方法,通过在神经元共激活图上平滑SAE解码器向量,并通过两个门控运行时控制器应用所得方向库来学习安全引导方向。实证研究表明,GSAE在JailbreakBench、HarmBench和XSTest中提高了选择性拒绝,增加有害请求拒绝同时保持良性提示拒绝低。在Llama-3-8B上,将标准SAE替换为GSAE的其他相同管道改进了JailbreakBench上的Δ_s值20.1点和HarmBench上的16.8点。GSAE优于激活引导基线和黑盒防护栏,保持良性任务性能,跨Llama-3、Mistral、Qwen 2.5和Phi-4泛化,并在黑盒和灰盒jailbreak攻击下保持强大。

英文摘要

Sparse autoencoders (SAEs) are increasingly used to extract activation directions for inference-time steering, but their standard sparsity objective treats latent features as independent. This prior can be poorly matched to high-level safety behaviors, where refusal and harmful compliance appear to depend on distributed structure in activation space. We introduce Graph-Regularized Sparse Autoencoders (GSAE), a dictionary-learning method that learns safety-steering directions by smoothing SAE decoder vectors over a neuron co-activation graph and applying the resulting direction bank through a two-gate runtime controller. Empirically, GSAE improves selective refusal across JailbreakBench, HarmBench, and XSTest, increasing harmful-request refusal while keeping benign-prompt refusals low. On Llama-3-8B, replacing the standard SAE with GSAE in an otherwise identical pipeline improves $Δ_s$ by $20.1$ points on JailbreakBench and $16.8$ points on HarmBench. GSAE outperforms activation-steering baselines and black-box guardrails, preserves benign-task performance, generalizes across Llama-3, Mistral, Qwen 2.5, and Phi-4, and remains strong under black-box and gray-box jailbreak attacks.

2512.00417 2026-05-18 cs.CL

CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency

CryptoBench: 一种动态基准,用于评估LLM代理在加密货币领域的专家级能力

Jiacheng Guo, Suozhi Huang, Zixin Yao, Yifan Zhang, Yifu Lu, Jiashuo Liu, Zihao Li, Nicholas Deng, Qixin Xiao, Jia Tian, Kanghong Zhan, Tianyi Li, Xiaochen Liu, Jason Ge, Chaoyang He, Kaixuan Huang, Lin Yang, Wenhao Huang, Mengdi Wang

AI总结 本文提出CryptoBench,首个专家 curated 的动态基准,用于严格评估LLM在加密货币领域的真实能力。通过50题/月的动态任务,细分子类评估数据获取与预测能力,揭示LLM在检索与预测上的不平衡问题。

详情
AI中文摘要

本文介绍了CryptoBench,首个由专家 curated 的动态基准,旨在严格评估大型语言模型(LLM)代理在独特且快节奏的加密货币领域中的实际能力。与通用搜索和预测代理基准不同,专业加密分析面临特定挑战:极端时间敏感性、高度对抗性信息环境以及从多样化专业来源合成数据的必要性,如链上智能平台和实时去中心化金融(DeFi)仪表板。CryptoBench因此成为更具有挑战性和价值的LLM代理评估场景。为解决这些挑战,我们构建了一个动态基准,每月包含50个问题,由加密货币专业人员精心设计,以反映实际分析师的工作流程。这些任务在四象限系统中严格分类:简单检索、复杂检索、简单预测和复杂预测。这种细分子类化使能够精确评估LLM代理的基础数据获取能力及其高级分析和预测技能。我们对十种LLM进行评估,包括直接和代理框架内,揭示了性能层次并发现了失败模式。我们观察到检索-预测不平衡,许多领先模型虽然在数据检索上熟练,但在需要预测分析的任务中表现出明显弱点。这突显了代理在事实性上看似稳固,但缺乏深入分析能力来综合信息的倾向。

英文摘要

This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Model (LLM) agents in the uniquely demanding and fast-paced cryptocurrency domain. Unlike general-purpose agent benchmarks for search and prediction, professional crypto analysis presents specific challenges: \emph{extreme time-sensitivity}, \emph{a highly adversarial information environment}, and the critical need to synthesize data from \emph{diverse, specialized sources}, such as on-chain intelligence platforms and real-time Decentralized Finance (DeFi) dashboards. CryptoBench thus serves as a much more challenging and valuable scenario for LLM agent assessment. To address these challenges, we constructed a live, dynamic benchmark featuring 50 questions per month, expertly designed by crypto-native professionals to mirror actual analyst workflows. These tasks are rigorously categorized within a four-quadrant system: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. This granular categorization enables a precise assessment of an LLM agent's foundational data-gathering capabilities alongside its advanced analytical and forecasting skills. Our evaluation of ten LLMs, both directly and within an agentic framework, reveals a performance hierarchy and uncovers a failure mode. We observe a \textit{retrieval-prediction imbalance}, where many leading models, despite being proficient at data retrieval, demonstrate a pronounced weakness in tasks requiring predictive analysis. This highlights a problematic tendency for agents to appear factually grounded while lacking the deeper analytical capabilities to synthesize information.

2511.18225 2026-05-18 cs.LG stat.ML stat.OT

Adaptive Conformal Prediction for Quantum Machine Learning

适应性符合预测用于量子机器学习

Douglas Spencer, Samual Nicholls, Michele Caprio

AI总结 本文提出适应性量子符合预测算法,解决量子处理器时间变化噪声对符合保证的影响,通过重复校准保持有效性,实验证明其在IBM量子处理器上的稳定性和覆盖率。

Comments Accepted at TMLR 05/2026. 27 pages, 5 figures

详情
Journal ref
Transactions on Machine Learning Research, May 2026, ISSN 2835-8856
AI中文摘要

量子机器学习旨在利用量子计算机改进经典机器学习算法。目前,量子领域仍缺乏稳健的不确定性量化方法,尽管需要可靠和可信的预测。最近的工作引入了量子符合预测框架,该框架能产生保证包含真实结果的概率预测集。本文正式阐述了量子处理器中固有的时间变化噪声如何即使在校准和测试数据可交换的情况下也会破坏符合保证。为解决这一挑战,我们借鉴了适应性符合推断方法,该方法通过重复校准在时间上保持有效性。我们引入了适应性量子符合预测(AQCP)算法,该算法在任意硬件噪声条件下提供渐近平均覆盖率保证。在IBM量子处理器上的实验证明,AQCP实现了目标覆盖率并表现出比量子符合预测更大的稳定性。

英文摘要

Quantum machine learning seeks to leverage quantum computers to improve upon classical machine learning algorithms. Currently, robust uncertainty quantification methods remain underdeveloped in the quantum domain, despite the critical need for reliable and trustworthy predictions. Recent work has introduced quantum conformal prediction, a framework that produces prediction sets that are guaranteed to contain the true outcome with a user-specified probability. In this work, we formalise how the time-varying noise inherent in quantum processors can undermine conformal guarantees, even when calibration and test data are exchangeable. To address this challenge, we draw on Adaptive Conformal Inference, a method which maintains validity over time via repeated recalibration. We introduce Adaptive Quantum Conformal Prediction (AQCP), an algorithm which provides asymptotic average coverage guarantees under arbitrary hardware noise conditions. Empirical studies on an IBM quantum processor demonstrate that AQCP achieves the target coverage level and exhibits greater stability than quantum conformal prediction.

2511.15887 2026-05-18 cs.CL

Mind the Motions: Benchmarking Theory-of-Mind in Everyday Body Language

留意动作:在日常肢体语言中评估共情理论

Seungbeen Lee, Jinhong Jeong, Donghyun Kim, Yejin Son, Youngjae Yu

AI总结 本文提出Motion2Mind框架,通过专家编纂的肢体语言参考库评估机器解读非言语线索的能力,发现现有AI在非言语解读上存在显著差距。

Comments The authors identified issues in the current version and would like to withdraw the manuscript for substantial revision

详情
AI中文摘要

我们通过非言语线索(NVCs)解读他人心理状态的能力对生存和社会凝聚力至关重要。尽管现有的共情理论(ToM)基准测试主要集中在虚假信念任务和不对称信息推理上,但它们忽略了除了信念之外的其他心理状态以及人类非言语交流的丰富图景。我们提出了Motion2Mind框架,用于评估机器解读NVCs的共情能力。利用专家编纂的肢体语言参考作为代理知识库,我们构建了Motion2Mind,一个精心编纂的视频数据集,包含精细的非言语线索标注和手动验证的心理学解释。它涵盖了222种非言语线索和397种心理状态。我们的评估发现,当前AI系统在NVC解读上存在显著困难,不仅在检测方面存在较大的性能差距,而且在解释方面也表现出比人类标注者更高的过度解读模式。

英文摘要

Our ability to interpret others' mental states through nonverbal cues (NVCs) is fundamental to our survival and social cohesion. While existing Theory of Mind (ToM) benchmarks have primarily focused on false-belief tasks and reasoning with asymmetric information, they overlook other mental states beyond belief and the rich tapestry of human nonverbal communication. We present Motion2Mind, a framework for evaluating the ToM capabilities of machines in interpreting NVCs. Leveraging an expert-curated body-language reference as a proxy knowledge base, we build Motion2Mind, a carefully curated video dataset with fine-grained nonverbal cue annotations paired with manually verified psychological interpretations. It encompasses 222 types of nonverbal cues and 397 mind states. Our evaluation reveals that current AI systems struggle significantly with NVC interpretation, exhibiting not only a substantial performance gap in Detection, as well as patterns of over-interpretation in Explanation compared to human annotators.

2511.13108 2026-05-18 cs.CV

DGS-Net: Distillation-Guided Gradient Surgery for CLIP Fine-Tuning in AI-Generated Image Detection

DGS-Net:基于知识蒸馏的梯度手术用于AI生成图像检测中的CLIP微调

Jiazhen Yan, Ziqiang Li, Fan Wang, Boyu Wang, Ziwen He, Zhangjie Fu

AI总结 本文提出DGS-Net,通过梯度空间分解分离有害和有益的下降方向,提升CLIP在AI生成图像检测中的微调效果,实验表明其在检测性能和泛化能力上优于现有方法。

Comments Accepted by ICML 2026 Spotlight

详情
AI中文摘要

生成模型如GANs和扩散模型的快速发展导致AI生成图像广泛传播,引发虚假信息、隐私侵犯和信任危机。尽管大规模多模态模型如CLIP能提供强可转移表示以检测合成内容,但微调时常导致灾难性遗忘,降低预训练先验并限制跨领域泛化。为此,我们提出Distillation-guided Gradient Surgery Network (DGS-Net),通过梯度空间分解分离有害和有益的下降方向,投影任务梯度到有害方向的正交补集并与从冻结CLIP编码器蒸馏出的有益方向对齐,实现先验保留与无关抑制的统一优化。在50种生成模型上的实验表明,本方法在检测性能和泛化能力上平均优于现有方法6.6个百分点。

英文摘要

The rapid progress of generative models such as GANs and diffusion models has led to the widespread proliferation of AI-generated images, raising concerns about misinformation, privacy violations, and trust erosion in digital media. Although large-scale multimodal models like CLIP offer strong transferable representations for detecting synthetic content, fine-tuning them often induces catastrophic forgetting, which degrades pre-trained priors and limits cross-domain generalization. To address this issue, we propose the Distillation-guided Gradient Surgery Network (DGS-Net), a novel framework that preserves transferable pre-trained priors while suppressing task-irrelevant components. Specifically, we introduce a gradient-space decomposition that separates harmful and beneficial descent directions during optimization. By projecting task gradients onto the orthogonal complement of harmful directions and aligning with beneficial ones distilled from a frozen CLIP encoder, DGS-Net achieves unified optimization of prior preservation and irrelevant suppression. Extensive experiments on 50 generative models demonstrate that our method outperforms state-of-the-art approaches by an average margin of 6.6, achieving superior detection performance and generalization across diverse generation techniques.

2511.09378 2026-05-18 cs.AI cs.LG

Frontier Large Language Models Rival State-of-the-Art Planners

前沿大语言模型与最先进的规划器相媲美

Augusto B. Corrêa, André G. Pereira, Jendrik Seipp

AI总结 研究显示前沿大语言模型在规划任务中超越传统规划器, Gemini 3.1 Pro在标准任务中表现突出,GPT-5表现接近基线,且在符号规划中仍具竞争力,揭示了大语言模型规划能力的提升趋势。

详情
AI中文摘要

一系列有影响力的研究表明,大语言模型无法可靠解决简单的规划任务。我们展示最新一代前沿模型推翻这一结论。我们评估了三个前沿LLM家族在具有挑战性的规划任务上的表现,基于最近的国际规划竞赛,遵循严格的评估指南:解决方案通过验证工具验证,任务重新创建以避免数据污染,性能与最先进的经典规划器进行比较。在标准任务描述中,Gemini 3.1 Pro在360个任务中解决了245个,优于最强的基线规划器(245 vs. 234)。GPT-5的表现与基线相当。当所有语义信息被混淆以测试纯符号规划时,性能下降,但Gemini 3.1 Pro仍能与最强基线竞争。跨模型世代的纵向比较——从GPT-3.5(解决零任务)到GPT-5——揭示了显著的上升趋势。前沿LLM可能最终能够规划;现在的问题是这种能力将如何延伸。

英文摘要

A series of influential studies established that large language models cannot reliably solve even simple planning tasks. We show that the latest generation of frontier models overturns this conclusion. We evaluate three families of frontier LLMs on a challenging set of planning tasks based on the most recent International Planning Competition following rigorous evaluation guidelines: solutions are verified with a validation tool, tasks are freshly created to avoid data contamination, and performance is compared against state-of-the-art classical planners. On standard task descriptions, Gemini 3.1 Pro outperforms the strongest planner baseline (245 vs. 234 solved tasks out of 360), while GPT-5 achieves comparable performance to the baselines. When all semantic information is obfuscated from the descriptions to test for pure symbolic planning, performance degrades but Gemini 3.1 Pro remains competitive with the strongest baselines. A longitudinal comparison across model generations -- from GPT-3.5, which solves zero tasks, to GPT-5 -- reveals a striking upward trajectory. Frontier LLMs might finally be able to plan; the question now is how far this capability will extend.

2510.25404 2026-05-18 cs.LG cs.AI

SemanticOpt: Towards LLM-Based Semantic Black-Box Optimization

SemanticOpt: 向基于LLM的语义黑盒优化迈进

Jamison Meindl, Yunsheng Tian, Tony Cui, Veronika Thost, Zhang-Wei Hong, Jie Chen, Wojciech Matusik, Mina Konaković Luković

AI总结 SemanticOpt利用LLM处理语义信息,通过微调结构化贝叶斯优化轨迹与自然语言上下文,提升黑盒优化性能,在多个实际问题中优于传统方法和现有LLM方法。

详情
AI中文摘要

当每个实验昂贵、耗时或难以执行时,优化实验系统极具挑战性。现有针对昂贵黑盒问题的优化器,如贝叶斯优化,通常仅限于数值或分类观察。它们不利用更广泛的领域知识,如专家启发法、相关科学论文或相似先前实验。大型语言模型(LLMs)可以解释这种语义信息;然而,即使是最先进的LLMs也难以可靠地解决黑盒优化问题。我们介绍了SemanticOpt,一个用于语义黑盒优化的框架,通过在结构化贝叶斯优化轨迹上微调LLMs,使其具备优化能力。SemanticOpt在提出新实验时同时使用数值和语义证据,并生成与贝叶斯代理模型对齐的可解释预测。我们构建了一系列现实世界优化问题并配以语义信息,以创建评估语义黑盒优化的多样化基准。在这些领域中,SemanticOpt在给定相关语义信息时,平均上优于传统优化器和现有基于LLM的方法。

英文摘要

Optimizing an experimental system can be extremely challenging when each experiment is expensive, time-consuming, or difficult to perform. Existing optimizers for expensive black-box problems, such as Bayesian optimization, are typically limited to numerical or categorical observations. They do not make use of broader domain knowledge, such as expert heuristics, relevant scientific papers, or similar previous experiments. Large language models (LLMs) can interpret this semantic information; however, even state-of-the-art LLMs struggle to reliably solve black-box optimization problems. We introduce SemanticOpt, a framework for semantic black-box optimization that equips LLMs with optimization capabilities by fine-tuning them on structured Bayesian optimization trajectories augmented with natural-language context. SemanticOpt jointly uses numerical and semantic evidence when proposing new experiments, while producing interpretable predictions aligned with Bayesian surrogate models. We construct a range of real-world optimization problems paired with semantic information to create a diverse benchmark for evaluating semantic black-box optimization. Across these domains, SemanticOpt outperforms both classical optimizers and existing LLM-based approaches on average when given relevant semantic information.

2510.24457 2026-05-18 cs.RO cs.SY eess.SY

Flatness-based trajectory planning for 3D overhead cranes with friction compensation and collision avoidance

基于平坦性的3D龙门起重机轨迹规划方法,包含摩擦补偿与碰撞避免

Jorge Vicente-Martinez, Edgar Ramirez-Laboreo

AI总结 本文提出一种利用微分平坦性优化3D龙门起重机轨迹生成方法,通过直接纳入非线性摩擦和碰撞避免等复杂约束,实现安全高效的运动控制。

Comments 6 pages, 8 figures. Final version, after peer review and acceptance, submitted to the 23rd IFAC World Congress

详情
AI中文摘要

本文提出了一种利用微分平坦性优化3D龙门起重机轨迹生成方法,该方法能够直接纳入复杂的物理和动态约束,如非线性摩擦和碰撞避免。我们的方法通过仅在终点限制负载摆动,实现了激进的运动控制。对比仿真研究验证了该方法的有效性,表明忽视干摩擦会导致执行器饱和和碰撞。结果表明,摩擦建模是实现快速安全起重机轨迹的基础要求。

英文摘要

This paper presents an optimal trajectory generation method for 3D overhead cranes by leveraging differential flatness. This framework enables the direct inclusion of complex physical and dynamic constraints, such as nonlinear friction and collision avoidance for both payload and rope. Our approach allows for aggressive movements by constraining payload swing only at the final point. A comparative simulation study validates our approach, demonstrating that neglecting dry friction leads to actuator saturation and collisions. The results show that friction modeling is a fundamental requirement for fast and safe crane trajectories.

2510.23634 2026-05-18 cs.LG cs.AI

Monotone and Separable Set Functions: Characterizations and Neural Models

单调和可分离的集合函数:特征化与神经模型

Soutrik Sarangi, Yonatan Sverdlov, Nadav Dym, Abir De

AI总结 本文研究了保持集合自然偏序的集合到向量函数设计,提出弱MAS属性模型,展示了其在集合包含任务中的优势。

详情
AI中文摘要

受集合包含问题应用启发,本文考虑设计集合到向量函数,使自然偏序保持,即S⊆T当且仅当F(S)≤F(T)。我们称满足此性质的函数为单调和可分离(MAS)集合函数。我们建立了向量维度的上下界,作为多重集合基数和基础集大小的函数。在重要情况无限基础集时,我们证明MAS函数不存在,但提出名为our的模型,其满足弱MAS属性并具有Holder连续稳定性。我们还展示MAS函数可用于构建单调的通用模型,可近似所有单调集合函数。实验考虑了多种集合包含任务,结果显示我们的模型相比不考虑集合包含作为归纳偏置的标准集合模型具有优势。代码可在https://github.com/structlearning/MASNET获取。

英文摘要

Motivated by applications for set containment problems, we consider the following fundamental problem: can we design set-to-vector functions so that the natural partial order on sets is preserved, namely $S\subseteq T \text{ if and only if } F(S)\leq F(T) $. We call functions satisfying this property Monotone and Separating (MAS) set functions. % We establish lower and upper bounds for the vector dimension necessary to obtain MAS functions, as a function of the cardinality of the multisets and the underlying ground set. In the important case of an infinite ground set, we show that MAS functions do not exist, but provide a model called our which provably enjoys a relaxed MAS property we name "weakly MAS" and is stable in the sense of Holder continuity. We also show that MAS functions can be used to construct universal models that are monotone by construction and can approximate all monotone set functions. Experimentally, we consider a variety of set containment tasks. The experiments show the benefit of using our our model, in comparison with standard set models which do not incorporate set containment as an inductive bias. Our code is available in https://github.com/structlearning/MASNET.

2510.13842 2026-05-18 cs.CL cs.AI cs.CR

ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking

ADMIT: RAG基事实核查中的少样本知识污染攻击

Yutao Wu, Xiao Liu, Yinghui Li, Yifeng Gao, Yifan Ding, Jiale Ding, Xiang Zheng, Xingjun Ma

AI总结 ADMIT提出一种无需访问目标模型的少样本攻击方法,通过注入真实证据来翻转事实核查决策,实验显示其在多种系统中成功率达86%,揭示了RAG事实核查系统的重大漏洞。

详情
AI中文摘要

ADMIT提出了一种无需访问目标模型的少样本攻击方法,通过注入真实证据来翻转事实核查决策,实验显示其在多种系统中成功率达86%,揭示了RAG事实核查系统的重大漏洞。

英文摘要

Knowledge poisoning poses a critical threat to Retrieval-Augmented Generation (RAG) systems by injecting adversarial content into knowledge bases, tricking Large Language Models (LLMs) into producing attacker-controlled outputs grounded in manipulated context. Prior work highlights LLMs' susceptibility to misleading or malicious retrieved content. However, real-world fact-checking scenarios are more challenging, as credible evidence typically dominates the retrieval pool. To investigate this problem, we extend knowledge poisoning to the fact-checking setting, where retrieved context includes authentic supporting or refuting evidence. We propose \textbf{ADMIT} (\textbf{AD}versarial \textbf{M}ulti-\textbf{I}njection \textbf{T}echnique), a few-shot, semantically aligned poisoning attack that flips fact-checking decisions and induces deceptive justifications, all without access to the target LLMs, retrievers, or token-level control. Extensive experiments show that ADMIT transfers effectively across 4 retrievers, 11 LLMs, and 4 cross-domain benchmarks, achieving an average attack success rate (ASR) of 86\% at an extremely low poisoning rate of $0.93 \times 10^{-6}$, and remaining robust even in the presence of strong counter-evidence. Compared with prior state-of-the-art attacks, ADMIT improves ASR by 11.2\% across all settings, exposing significant vulnerabilities in real-world RAG-based fact-checking systems.

2510.08398 2026-05-18 cs.CV

VideoVerse: Does Your T2V Generator Have World Model Capability to Synthesize Videos?

VideoVerse: 你的T2V生成器有世界模型能力来合成视频吗?

Zeqing Wang, Xinyu Wei, Bairui Li, Zhen Guo, Jinrui Zhang, Hongyang Wei, Keze Wang, Lei Zhang

AI总结 VideoVerse通过评估T2V模型对复杂时间因果关系和世界知识的理解能力,揭示现有模型与理想世界建模能力的差距。

Comments 26 Pages, 10 Figures, 14 Tables

详情
AI中文摘要

最近文本到视频(T2V)生成技术的快速发展使训练模型具备了更强的世界模型能力,使现有基准逐渐无法评估最先进的T2V模型。首先,当前评估维度如每帧美学质量和时间一致性已无法区分最先进的T2V模型。其次,事件级时间因果性——区分视频与其他模态的本质属性——仍 largely 未被探索。第三,现有基准缺乏对世界知识的系统评估,这是构建世界模型的关键能力。为解决这些问题,我们引入VideoVerse,一个专注于评估当前T2V模型是否能理解复杂时间因果性和世界知识以合成视频的综合基准。我们收集了跨不同领域的代表性视频,并提取其事件级描述,具有固有的时间因果性,然后由独立标注者重写为文本到视频提示。对于每个提示,我们设计了十个评估维度,涵盖动态和静态属性,最终得到300个提示、815个事件和793个评估问题。因此,通过使用现代视觉-语言模型开发了一个与人类偏好一致的基于问答的评估流程,系统地评估了领先的开源和闭源T2V系统,揭示了当前T2V模型与理想世界建模能力之间的差距。

英文摘要

The recent rapid advancement of Text-to-Video (T2V) generation technologies are engaging the trained models with more world model ability, making the existing benchmarks increasingly insufficient to evaluate state-of-the-art T2V models. First, current evaluation dimensions, such as per-frame aesthetic quality and temporal consistency, are no longer able to differentiate state-of-the-art T2V models. Second, event-level temporal causality-an essential property that differentiates videos from other modalities-remains largely unexplored. Third, existing benchmarks lack a systematic assessment of world knowledge, which are essential capabilities for building world models. To address these issues, we introduce VideoVerse, a comprehensive benchmark focusing on evaluating whether the current T2V model could understand complex temporal causality and world knowledge to synthesize videos. We collect representative videos across diverse domains and extract their event-level descriptions with inherent temporal causality, which are then rewritten into text-to-video prompts by independent annotators. For each prompt, we design ten evaluation dimensions covering dynamic and static properties, resulting in 300 prompts, 815 events, and 793 evaluation questions. Consequently, a human preference-aligned QA-based evaluation pipeline is developed by using modern vision-language models to systematically benchmark leading open- and closed-source T2V systems, revealing the current gap between T2V models and desired world modeling abilities.

2510.03161 2026-05-18 cs.CV cs.AI

UniShield: An Adaptive Multi-Agent Framework for Unified Forgery Image Detection and Localization

UniShield: 一种适应性多智能体框架用于统一的伪造图像检测与定位

Qing Huang, Zhipei Xu, Xuanyu Zhang, Xiangyu Yu, Jian Zhang

AI总结 UniShield通过多智能体框架实现跨领域伪造图像检测与定位,提升检测的适应性和实用性。

详情
AI中文摘要

UniShield通过多智能体框架实现跨领域伪造图像检测与定位,提升检测的适应性和实用性。

英文摘要

With the rapid advancements in image generation, synthetic images have become increasingly realistic, posing significant societal risks, such as misinformation and fraud. Forgery Image Detection and Localization (FIDL) thus emerges as essential for maintaining information integrity and societal security. Despite impressive performances by existing domain-specific detection methods, their practical applicability remains limited, primarily due to their narrow specialization, poor cross-domain generalization, and the absence of an integrated adaptive framework. To address these issues, we propose UniShield, the novel multi-agent-based unified system capable of detecting and localizing image forgeries across diverse domains, including image manipulation, document manipulation, DeepFake, and AI-generated images. UniShield innovatively integrates a perception agent with a detection agent. The perception agent intelligently analyzes image features to dynamically select suitable detection models, while the detection agent consolidates various expert detectors into a unified framework and generates interpretable reports. Extensive experiments show that UniShield achieves state-of-the-art results, surpassing both existing unified approaches and domain-specific detectors, highlighting its superior practicality, adaptiveness, and scalability.

2510.02453 2026-05-18 cs.LG cs.AI cs.CL

How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models

如何训练你的导师:通过导师模型引导黑盒大语言模型

Parth Asawa, Alan Zhu, Abigail O'Neill, Matei Zaharia, Alexandros G. Dimakis, Joseph E. Gonzalez

AI总结 本文提出Advisor Models,通过训练小型开放权重模型生成动态个性化建议,提升黑盒前沿模型性能,实验显示在多个任务中效果显著,且具有良好的迁移性和鲁棒性。

Comments International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

前沿语言模型作为黑盒服务部署,其权重无法修改,定制仅限于提示。我们引入Advisor Models,一种方法通过训练小型开放权重模型生成动态、实例特定的自然语言建议,以提升黑盒前沿模型的能力。Advisor Models将GPT-5.2在RuleArena(税务)任务上的性能提升27.4%,减少Gemini 3 Pro在SWE代理任务中的步骤24.6%,并在个性化GPT-5到用户偏好方面优于静态提示优化器(85-100% vs. 40-60%)。我们还发现顾问具有可迁移性:用低成本学生模型训练的顾问仍能将改进转移到前沿模型。此外,Advisor Models具有鲁棒性:在其他基准测试中未观察到降级,除了训练管道所训练的基准测试。我们的方法展示了如何以实用且经济有效的方式对黑盒前沿模型进行参数优化。

英文摘要

Frontier language models are deployed as black-box services, where model weights cannot be modified and customization is limited to prompting. We introduce Advisor Models, a method to train small open-weight models to generate dynamic, per-instance natural language advice that improves the capabilities of black-box frontier models. Advisor Models improve GPT-5.2's performance on RuleArena (Taxes) by 27.4%, reduce Gemini 3 Pro's steps taken in SWE agent tasks by 24.6%, and outperform static prompt optimizers in personalizing GPT-5 to user preferences (85-100% vs. 40-60%). We also find that advisors are transferable: an advisor trained with a low-cost student model still transfers improvements to a frontier model. Moreover, Advisor Models are robust: we observe no degradation on other benchmarks than the pipeline is trained on. Our method shows how to perform parametric optimization for black-box frontier models in a practical and cost-effective way.

2510.02278 2026-05-18 cs.LG

Metropolis-Scale Road Network Datasets for Fine-Grained Urban Traffic Modeling

用于精细城市交通建模的Metropolis级道路网络数据集

Fedor Velikonivtsev, Oleg Platonov, Ekaterina Alimaskina, Gleb Bazhenov, Liudmila Prokhorenkova

AI总结 本文提出两个主要城市精细化道路网络数据集,用于解决大规模交通预测中的挑战,提供高分辨率的时间序列数据和丰富的静态道路属性。

详情
AI中文摘要

交通动态建模是城市计算中的关键挑战,应用于实时交通管理到基础设施规划。然而,该领域的进展受到缺乏大规模公开数据集的限制,这些数据集能捕捉真实城市道路网络的细微特性。现有基准往往受限于规模小、依赖稀疏高速公路传感器、缺乏真实道路连接信息以及缺乏道路属性信息。为解决此问题,我们引入了两个主要城市精细化道路网络数据集,其规模高达10万条道路段,使用真实道路连接性,包含5分钟分辨率的交通速度和流量时间序列测量,并包含丰富的静态道路属性。这些数据集使深入分析时空交通模式成为可能,并可作为各种ML应用的基准。作为数据集实用性和挑战的实证演示,我们将其用于交通预测任务。我们数据集中的现实道路网络规模揭示了当前交通预测模型的重大可扩展性问题。为解决这些问题,我们提出了一种简单且高效的基线模型,不仅能够扩展到大规模道路图,还能实现与现有时空模型相媲美的预测性能。我们希望这些数据集能成为交通建模、城市计算和智能城市发展广泛研究的基础资源。

英文摘要

Modeling traffic dynamics is a critical challenge for urban computing, with applications from real-time traffic management to infrastructure planning. However, progress in this area is fundamentally constrained by a lack of large-scale public datasets that capture the subtle properties of real city road networks. Existing benchmarks are often limited by their small scale, reliance on sparse highway traffic sensors, absence of true road connectivity information, and lack of information about road properties. To address this issue, we introduce datasets representing fine-grained road networks of two major cities, which are unique in their scale (up to 100,000 road segments), use of real road connectivity, presence of time series measurements for both traffic speed and volume at a 5-minute resolution, and inclusion of rich static road attributes. These datasets enable in-depth analysis of spatiotemporal traffic patterns and can serve as benchmarks for various ML applications. As a practical demonstration of the utility of our datasets and the challenges they present, we use them for the task of traffic forecasting. The size of the real-world road networks in our datasets reveals significant scalability issues in current traffic forecasting models. To address them, we propose a simple and efficient baseline that not only scales to large road graphs but also achieves forecasting performance competitive with other established spatiotemporal models. We hope that the proposed datasets will serve as a foundational resource for a broad range of research in traffic modeling, urban computing, and smart city development.

2509.24550 2026-05-18 cs.LG cs.SD

Training-Free Multimodal Guidance for Video to Audio Generation

无需训练的多模态引导用于视频到音频生成

Eleonora Grassucci, Giuliano Galadini, Giordano Cicchetti, Aurelio Uncini, Fabio Antonacci, Danilo Comminiello

AI总结 本文提出无需训练的多模态引导机制,用于视频到音频扩散生成,通过模态嵌入跨度强制视频、音频和文本的一致对齐,提升生成质量与多模态对齐效果。

详情
Journal ref
ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
AI中文摘要

视频到音频(V2A)生成旨在从静音视频中合成逼真且语义一致的音频,潜在应用于视频编辑、 Foley声音设计和辅助多媒体。尽管已有成果显著,现有方法或需在大规模配对数据集上进行昂贵的联合训练,或依赖成对相似性可能无法捕捉全局多模态一致性。本文提出一种新颖的无需训练的多模态引导机制,用于V2A扩散,利用模态嵌入所跨越的体积来强制视频、音频和文本之间的一致对齐。所提出的多模态扩散引导(MDG)提供了一种轻量级、即插即用的控制信号,可在任何预训练音频扩散模型上应用而无需重新训练。在VGGSound和AudioCaps上的实验表明,我们的MDG在感知质量和多模态对齐方面均优于基线,证明了联合多模态引导在V2A中的有效性。

英文摘要

Video-to-audio (V2A) generation aims to synthesize realistic and semantically aligned audio from silent videos, with potential applications in video editing, Foley sound design, and assistive multimedia. Although the excellent results, existing approaches either require costly joint training on large-scale paired datasets or rely on pairwise similarities that may fail to capture global multimodal coherence. In this work, we propose a novel training-free multimodal guidance mechanism for V2A diffusion that leverages the volume spanned by the modality embeddings to enforce unified alignment across video, audio, and text. The proposed multimodal diffusion guidance (MDG) provides a lightweight, plug-and-play control signal that can be applied on top of any pretrained audio diffusion model without retraining. Experiments on VGGSound and AudioCaps demonstrate that our MDG consistently improves perceptual quality and multimodal alignment compared to baselines, proving the effectiveness of a joint multimodal guidance for V2A.

2509.23352 2026-05-18 cs.CV cs.AI

Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling

动态树RPO:通过结构化采样打破独立轨迹瓶颈

Xiaolong Fu, Lichen Ma, Zipeng Guo, ShiPing Dong, Lan Yang, Tan Lit Sin, Gaojing Zhou, Yu He, Jingling Fu, Shizhe Zhou, Junshi Huang, Jason Li

AI总结 本文提出动态树RPO,通过树状结构采样策略和动态噪声强度,提升文本到图像生成的质量与效率,同时结合层调优强化学习方法,在多个基准测试中表现出色。

Comments Fig.3 updated

详情
AI中文摘要

将强化学习(RL)整合到流匹配模型中,推动了文本到图像(T2I)生成的质量提升。然而,这些进步往往以大量探索和低效采样策略为代价,由于采样组的微小变化。基于这一见解,我们提出了动态树RPO,实现了滑动窗口采样策略作为树状结构搜索,具有沿深度动态噪声强度。我们在此树结构中执行GRPO引导优化和受约束的随机微分方程(SDE)采样。通过共享树的前缀路径,我们的设计有效缓解了轨迹搜索的计算开销。通过为每个树层设计良好的噪声强度,动态树RPO可以在不增加额外计算成本的情况下增强探索的多样性。此外,我们无缝整合监督微调(SFT)和RL范式,构建我们的提议层调优RL,将SFT的损失函数重新表述为动态加权进展奖励模型(PRM),而不是单独的预训练方法。通过将此加权PRM与动态自适应剪裁边界关联,避免了动态树RPO中探索过程的干扰。得益于树状结构采样和层调优RL范式,我们的模型在有效方向上动态探索多样化的搜索空间。与现有基线相比,我们的方法在语义一致性、视觉保真度和人类偏好对齐方面在已建立的基准测试中表现出显著优势,包括HPS-v2.1、PickScore和ImageReward。特别是,我们的模型在这些基准测试中分别优于SoTA by 4.9%、5.91%和8.66%,同时将训练效率提高了近50%。

英文摘要

The integration of Reinforcement Learning (RL) into flow matching models for text-to-image (T2I) generation has driven substantial advances in generation quality. However, these gains often come at the cost of exhaustive exploration and inefficient sampling strategies due to slight variation in the sampling group. Building on this insight, we propose Dynamic-TreeRPO, which implements the sliding-window sampling strategy as a tree-structured search with dynamic noise intensities along depth. We perform GRPO-guided optimization and constrained Stochastic Differential Equation (SDE) sampling within this tree structure. By sharing prefix paths of the tree, our design effectively amortizes the computational overhead of trajectory search. With well-designed noise intensities for each tree layer, Dynamic-TreeRPO can enhance the variation of exploration without any extra computational cost. Furthermore, we seamlessly integrate Supervised Fine-Tuning (SFT) and RL paradigm within Dynamic-TreeRPO to construct our proposed LayerTuning-RL, reformulating the loss function of SFT as a dynamically weighted Progress Reward Model (PRM) rather than a separate pretraining method. By associating this weighted PRM with dynamic-adaptive clipping bounds, the disruption of exploration process in Dynamic-TreeRPO is avoided. Benefiting from the tree-structured sampling and the LayerTuning-RL paradigm, our model dynamically explores a diverse search space along effective directions. Compared to existing baselines, our approach demonstrates significant superiority in terms of semantic consistency, visual fidelity, and human preference alignment on established benchmarks, including HPS-v2.1, PickScore, and ImageReward. In particular, our model outperforms SoTA by $4.9\%$, $5.91\%$, and $8.66\%$ on those benchmarks, respectively, while improving the training efficiency by nearly $50\%$.

2509.22739 2026-05-18 cs.CL cs.AI cs.LG stat.ML

Painless Activation Steering: An Automated, Lightweight Approach for Post-Training Large Language Models

无痛激活导向:一种自动化、轻量级的微调大型语言模型方法

Sasha Cui, Zhongren Chen

AI总结 本文提出Painless Activation Steering,一种自动化方法,无需人工干预即可利用标注数据提升模型性能,尤其在行为任务中表现优异,但对智能任务效果有限。

详情
AI中文摘要

语言模型通常通过权重或提示导向进行微调,但前者耗时昂贵,后者控制不精确且需手动试错。激活导向(AS)提供了一种更经济、快速且可控的替代方法,但现有技术需人工构造提示对或进行大量特征标注,不如RL和SFT等方法方便。本文引入Painless Activation Steering(PAS),一种完全自动的方法,可利用任何标注数据集进行AS,无需提示构造、特征标注或人工干预。在三个开源模型和18个任务上评估PAS,发现其在行为任务中性能可靠,但对智能任务效果有限。 introspective variant(iPAS)在偏差、道德和对齐任务上分别提升了10.1%、5.2%和34.8%。此外,PAS在上下文学习(ICL)和SFT基础上还提供了额外增益。PAS构建了一个快速、轻量的激活向量,可低成本训练、存储和激活。实验结果为AS的应用提供了明确的指导,展示了其作为实用自动化微调方法的潜力。

英文摘要

Language models (LMs) are typically post-trained for desired capabilities and behaviors via weight-based or prompt-based steering, but the former is time-consuming and expensive, and the latter is not precisely controllable and often requires manual trial-and-error. While activation steering (AS) promises a cheap, fast, and controllable alternative to the two existing post-training methods, current AS techniques require hand-crafted prompt pairs or labor-intensive feature annotation, making them more inconvenient than the plug-and-play methods such as Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT). We introduce Painless Activation Steering (PAS), a family of fully automated methods that make AS readily usable with any given labeled dataset, with no need for prompt construction, feature labeling, or human intervention. We evaluate PAS on three open-weight models (Llama3.1-8B-Instruct, DeepSeek-R1-Distill-8B, and Nous-Hermes-2) and 18 tasks; we find that PAS reliably improves performance for behavior tasks, but not for intelligence-oriented tasks. The introspective variant (iPAS) delivers the strongest causal steering effects (10.1% on Bias, 5.2% on Morality, and 34.8% on Alignment). We also show PAS delivers additional gains on top of In-Context Learning (ICL) and SFT. PAS constructs a fast, lightweight activation vector that can be cheaply trained, easily stored, and activated at will. Our results provide a characterization of where AS helps, where it fails, and how to deploy it as a practical, automated LM post-training option.

2509.22151 2026-05-18 cs.CV cs.CL

MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models

MultiMat: 多模态程序合成用于基于过程的材料生成

Jonas Belouadi, Tamy Boubekeur, Adrien Kaiser

AI总结 MultiMat利用大规模多模态模型实现多模态程序合成,提升生成过程材料图的效率与视觉质量,优于纯文本基线方法。

Comments Accepted at ICLR 2026 (poster)

详情
AI中文摘要

材料节点图是生成程序化材料的2D通道的程序,包括几何如粗糙度和位移图,以及反射率如albedo和导电性图。它们在计算机图形学中对于以参数化和任意分辨率表示虚拟3D对象的外观至关重要。特别是,它们的有向无环图结构和中间状态使交互式外观建模能够实现模块化和可解释的工作流程。然而,创建此类图仍然具有挑战性,通常需要专业培训。尽管最近的神经程序合成方法试图简化这一过程,但它们仅将图表示为文本程序,无法捕捉到节点图本质上视觉-空间性质,这使得它们对人类易于理解。为了解决这一差距,我们提出了MultiMat,一种多模态程序合成框架,利用大型多模态模型来处理视觉和文本图表示,以提高程序化材料图的生成效果。我们训练我们的模型在一个新的生产质量程序化材料数据集上,并将其与一种受约束的树搜索推理算法结合,该算法确保静态正确性的同时,能够高效地在程序空间中导航。我们的实验结果表明,我们的多模态程序合成方法在无条件和有条件图合成中比纯文本基线更高效,具有更高的视觉质量和保真度,建立了新的最先进性能。

英文摘要

Material node graphs are programs that generate the 2D channels of procedural materials, including geometry such as roughness and displacement maps, and reflectance such as albedo and conductivity maps. They are essential in computer graphics for representing the appearance of virtual 3D objects parametrically and at arbitrary resolution. In particular, their directed acyclic graph structure and intermediate states enable a modular, interpretable workflow for interactive appearance modeling. However, creating such graphs remains challenging and typically requires professional training. While recent neural program synthesis approaches attempt to simplify this process, they solely represent graphs as textual programs, failing to capture the inherently visual-spatial nature of node graphs that makes them accessible to humans. To address this gap, we present MultiMat, a multimodal program synthesis framework that leverages large multimodal models to process both visual and textual graph representations for improved generation of procedural material graphs. We train our models on a new dataset of production-quality procedural materials and combine them with a constrained tree search inference algorithm that ensures static correctness while efficiently navigating the program space. Our experimental results show that our multimodal program synthesis method is more efficient in both unconditional and conditional graph synthesis with higher visual quality and fidelity than text-only baselines, establishing new state-of-the-art performance.

2509.21663 2026-05-18 cs.LG cs.AI cs.LO

Logic of Hypotheses: from Zero to Full Knowledge in Neurosymbolic Integration

假设逻辑:从零到全面知识的神经符号整合

Davide Bizzaro, Alessandro Daniele

AI总结 本文提出LoH语言,结合数据驱动规则学习与符号先验和专家知识,实现神经符号整合的灵活统一,并通过模糊逻辑实现可微计算图编译。

详情
AI中文摘要

神经符号整合(NeSy)融合神经网络学习与符号推理。该领域可分为注入手工规则的神经模型方法和从数据中诱导符号规则的方法。我们引入假设逻辑(LoH),一种新的语言,统一这些流派,使数据驱动规则学习与符号先验和专家知识的灵活整合成为可能。LoH扩展了命题逻辑语法,加入了可学习参数的选择运算符,可从选项池中选择子公式。利用模糊逻辑,LoH中的公式可直接编译为可微计算图,从而通过反向传播学习最优选择。该框架涵盖了一些现有NeSy模型,同时增加了任意程度的知识规范的可能性。此外,使用戈德尔模糊逻辑和最近开发的戈德尔技巧,可以将模型离散化为硬布尔值函数,而不会损失性能。我们对这些模型进行了实验分析,展示了在表格数据和两个具有感知组件的NeSy任务上的强大结果。

英文摘要

Neurosymbolic integration (NeSy) blends neural-network learning with symbolic reasoning. The field can be split between methods injecting hand-crafted rules into neural models, and methods inducing symbolic rules from data. We introduce Logic of Hypotheses (LoH), a novel language that unifies these strands, enabling the flexible integration of data-driven rule learning with symbolic priors and expert knowledge. LoH extends propositional logic syntax with a choice operator, which has learnable parameters and selects a subformula from a pool of options. Using fuzzy logic, formulas in LoH can be directly compiled into a differentiable computational graph, so the optimal choices can be learned via backpropagation. This framework subsumes some existing NeSy models, while adding the possibility of arbitrary degrees of knowledge specification. Moreover, the use of Gödel fuzzy logic and the recently developed Gödel trick yields models that can be discretized to hard Boolean-valued functions without any loss in performance. We provide experimental analysis on such models, showing strong results on tabular data and on two NeSy tasks with a perceptual component.

2509.21465 2026-05-18 cs.LG

Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data

谈话树木:基于推理的决策树诱导用于表格数据

George Yakushev, Alina Shutova, Ivan Rubachev, Natalia Bereberdina, Renat Sergazinov, Artem Babenko

AI总结 本文提出利用具备推理能力的LLM诱导小规模表格数据的决策树,生成轻量级树结构,优于CART和非贪心树学习器,并在低资源表格问题中与树集成竞争。

Comments Preprint, code at https://github.com/yandex-research/TalkingTrees

详情
AI中文摘要

表格基础模型正日益流行于低资源表格问题。这些模型通过在大量合成数据上预训练来弥补小训练数据集。通过预训练获得的先验知识提供了卓越的性能,但由此产生的模型成为难以解释的黑箱,且推理成本高。本文探索了一种替代策略:在代理设置中使用具备推理能力的LLM诱导小规模表格数据的决策树。我们设计了一组最小的工具用于构建、分析和操纵决策树。借助这些工具,LLM结合其先验知识和数据学习生成轻量级决策树,优于CART和最近的非贪心树学习器,并在低资源表格问题中与树集成竞争。虽然单个代理决策树能与最先进的黑箱模型竞争,但其还带有可读的推理跟踪,可用于检查偏见和数据泄露。此外,基于推理的LLM的创建过程允许将额外的人类输入纳入树中,而无需在数据中捕获。

英文摘要

Tabular foundation models are becoming increasingly popular for low-resource tabular problems. These models make up for small training datasets by pretraining on large volumes of synthetic data. The prior knowledge obtained via pretraining provides the exceptional performance, but the resulting model becomes a black box that is difficult to interpret and costly for inference. In this work, we explore an alternative strategy: using reasoning-capable LLMs to induce decision trees for small tabular datasets in an agentic setup. We design a minimal set of tools for constructing, analyzing, and manipulating decision trees. Equipped with these tools, the LLM combines its prior knowledge with learning from data to produce a lightweight decision tree that outperforms CART and recent non-greedy tree learners and remains competitive with tree ensembles on low-resource tabular problems. While a single agentic decision tree is competitive with state-of-the-art black box models, it also comes with a human-readable reasoning trace that can be checked for biases and data leaks. Furthermore, the reasoning-based LLM's creation process allows for additional human input to be incorporated into the tree without it being captured in data.

2509.21173 2026-05-18 cs.CV cs.AI cs.LG

Less Precise Can Be More Reliable: A Systematic Evaluation of Quantization's Impact on VLMs Beyond Accuracy

精度降低可能更可靠:对VLMs量化影响的系统评估

Aymen Bouguerra, Daniel Montoya, Alexandra Gomez-Villa, Chokri Mraidha, Fabio Arnez

AI总结 本文系统评估了量化对VLMs可靠性的影响,发现量化能提升准确率、校准、异常检测和抗噪能力,但不改善协变量偏移或虚假相关性。

Comments Accepted at ICML 2026

详情
AI中文摘要

视觉-语言模型(VLMs)如CLIP已革新零样本分类和安全关键任务,如异常检测。然而,其高计算成本阻碍了实际部署。尽管量化是提高效率的标准方法,但其对超出简单Top-1准确率的可靠性指标的影响仍被忽视。本文通过超过70万次实验评估VLMs的量化效果,发现量化噪声反而能提升准确率、校准、异常检测和抗噪能力,但不改善协变量偏移或虚假相关性。我们利用这些反直觉发现,证明量化通过抑制高秩谱成分,迫使模型依赖稳健的低秩特征,从而提升泛化能力和抗噪能力,为利用量化部署更快速、可靠的VLMs提供了路径。

英文摘要

Vision-Language Models (VLMs) such as CLIP have revolutionized zero-shot classification and safety-critical tasks, including Out-of-Distribution (OOD) detection. However, their high computational cost hinders efficient real-world deployment. While quantization is a standard solution for efficiency, its broader impact on reliability metrics beyond simple Top-1 accuracy remains critically under-explored. In this study, we conduct a large-scale evaluation of VLM quantization across a comprehensive experimental suite of over 700k evaluation runs with varying configurations. We find that, contrary to the assumption that quantization's noise degrades performance, it can simultaneously improve accuracy, calibration, OOD detection, and robustness to noise, though not to covariate shift or spurious correlations. We leverage these counterintuitive findings to characterize the mechanics of quantization beyond simple regularization: we show that quantization dampens high-rank spectral components, compelling the model to rely more heavily on robust, low-rank features. Ultimately, this spectral filtering effect drives the observed improvements in generalization and noise tolerance, establishing a pathway to deploy faster, more reliable VLMs by utilizing quantization beyond its conventional role.