arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2096
2605.20606 2026-05-27 cs.CV

Mind Your Margin and Boundary: Are Your Distilled Datasets Truly Robust?

注意你的边界:你的蒸馏数据集真的鲁棒吗?

Muquan Li, Yingyi Ma, Yihong Huang, Hang Gou, Ke Qin, Ming Li, Yuan-Fang Li, Tao He

AI总结 针对数据集蒸馏中鲁棒性不足的问题,提出一种结合攻击感知课程学习与对比鲁棒性目标的框架C²R,通过优先处理最小鲁棒边界的对抗样本并扩大类间决策边界分离度,显著提升鲁棒准确率。

Comments Accepted to ICML 2026

详情
AI中文摘要

数据集蒸馏(DD)将大型训练集压缩为小型合成集以进行高效训练,但大多数DD方法仅优化干净准确率而忽略鲁棒性。最近的鲁棒DD方法提高了鲁棒性,但通常面临较差的准确率-鲁棒性权衡,因为它们(i)统一对待所有对抗扰动样本,尽管鲁棒风险主要由接近零的鲁棒边界主导,以及(ii)没有明确增加攻击集中区域的决策边界类间分离。我们提出了对比课程鲁棒数据集蒸馏(C$^2$R),一个将攻击感知课程与对比鲁棒性目标相结合的框架。从鲁棒边界的角度,我们推导出一个扰动分数,近似每个样本的鲁棒铰链,从而能够优先考虑那些最直接驱动鲁棒误差的最小边界对抗样本。同时,一个类平衡的对比鲁棒性损失在明确扩大跨类别边界分离的同时强制执行对抗不变性。在CIFAR-10/100、Tiny-ImageNet和多个ImageNet-1K子集上进行的六种攻击实验表明,C$^2$R实现了最佳的鲁棒准确率,平均优于先前的鲁棒DD方法2.8%。

英文摘要

Dataset distillation (DD) compresses a large training set into a small synthetic set for efficient training, but most DD methods optimize only clean accuracy and leave robustness uncontrolled. Recent robust DD methods improve robustness, yet they often suffer from a poor accuracy-robustness trade-off because they (i) treat all adversarially perturbed examples uniformly, despite robust risk being dominated by near-zero robust margins, and (ii) do not explicitly increase inter-class separation in the decision boundary where attacks concentrate. We present Contrastive Curriculum for Robust Dataset Distillation (C$^2$R), a framework that couples an attack-aware curriculum with a contrastive robustness objective. From a robust-margin perspective, we derive a perturbation score that approximates each sample's robust hinge, enabling a curriculum that prioritizes the smallest-margin adversaries that most directly drive robust error. In parallel, a class-balanced contrastive robustness loss enforces adversarial invariance while explicitly widening boundary separation across classes. Experiments on CIFAR-10/100, Tiny-ImageNet, and multiple ImageNet-1K subsets under six attacks show that C$^2$R achieves the best robust accuracy, outperforming prior robust DD by $2.8$% on average.

2605.20291 2026-05-27 cs.LG

Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection

Weasel: 通过重要性-多样性数据选择实现Web智能体的域外泛化

Fatemeh Pesaran Zadeh, Seyeon Choi, Xing Han Lù, Siva Reddy, Gunhee Kim

AI总结 提出Weasel方法,通过优化平衡单步重要性与状态、网站、交互模式成对多样性的目标,选择固定预算的轨迹子集,结合目标中心AXTree剪枝和风格一致理由替换,提升Web智能体离线训练的域外泛化性能并降低训练成本。

Comments ICML 2026. Code is released at https://github.com/fatemehpesaran310/weasel

详情
AI中文摘要

大型语言模型(LLMs)使得Web智能体能够通过多步浏览器交互遵循自然语言目标。然而,在特定轨迹和领域上微调的智能体通常难以泛化到域外,且离线训练可能因噪声、冗余轨迹和长可访问性树(AXTree)状态而计算效率低下。为了解决这两个问题,我们提出了Weasel,一种用于Web智能体离线训练的轨迹选择方法。Weasel通过优化一个平衡状态、网站和交互模式上的单步重要性与成对多样性的目标,选择固定预算的轨迹步骤子集,并使用贪心算法高效求解。我们进一步通过目标中心AXTree剪枝(仅保留真实动作目标周围的内容)提高效率,并通过用模型生成的、风格一致的理由替换专家轨迹,缓解推理原生模型的风格不匹配问题。在AgentTrek和NNetNav训练数据集上,以及在WebArena、WorkArena和MiniWob中的评估,以及使用Qwen2.5-7B、Gemma3-4B和Qwen3-8B的实验表明,Weasel在降低训练成本的同时提高了域外性能,相比标准微调实现了约9.7-12.5倍的训练加速。我们在https://github.com/fatemehpesaran310/weasel提供代码。

英文摘要

Large language models (LLMs) have enabled web agents that follow natural language goals through multi-step browser interactions. However, agents fine-tuned on specific trajectories and domain often struggle to generalize out of domain, and offline training can be compute-inefficient due to noisy, redundant trajectories and long accessibility-tree (AXTree) states. To address both issues, we propose Weasel, a trajectory selection method for offline training of web agents. Weasel selects a fixed-budget subset of trajectory steps by optimizing an objective that balances unary importance with pairwise diversity over states, websites, and interaction patterns, solving efficiently with a greedy algorithm. We further improve efficiency with target-centered AXTree pruning that keeps only content around the ground-truth action target, and we mitigate style mismatch for reasoning-native models by replacing expert traces with model-generated, style-consistent rationales. Across AgentTrek and NNetNav training datasets, evaluations in WebArena, WorkArena, and MiniWob, and experiments with Qwen2.5-7B, Gemma3-4B, and Qwen3-8B, Weasel improves out-of-domain performance while reducing training cost, producing roughly 9.7-12.5$\times$ training speedups over standard fine-tuning. We make the code available at https://github.com/fatemehpesaran310/weasel.

2605.20255 2026-05-27 cs.LG cs.AI cs.HC cs.RO

Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

行人行为不确定性下安全自动驾驶的多智能体强化学习

Prakash Aryan, Kaushik Raghupathruni, Timo Kehrer, Sebastiano Panichella

AI总结 本文使用多智能体近端策略优化(MAPPO)联合训练自动驾驶汽车和12个行人,通过隐藏的行人特质模拟乱穿马路行为,相比固定策略基线显著降低了碰撞率,并揭示了速度差异指标可用于检测未预期的乱穿马路行为。

Comments Accepted to ICRA 2026 Workshop "8th Workshop on Long-term Human Motion Prediction"

详情
AI中文摘要

自动驾驶汽车(SDC)的仿真测试通常依赖脚本化行人模型,这些模型无法捕捉真实过街行为的异质性和不确定性,限制了安全评估的真实性,尤其是对于由车辆无法观察到的潜在人格特质支配的乱穿马路行为。我们假设,通过多智能体强化学习(MARL)联合训练行人和SDC,相比针对固定行人策略训练,能产生更真实的交互场景,并且可预测与不可预测过街行为之间的差距可以直接从轨迹中测量。我们使用多智能体近端策略优化(MAPPO)联合训练一个SDC和12个行人:行人移动遵循脚本化的Dijkstra路径规划,而RL策略控制高层的前进/等待决策,乱穿马路概率取决于每个行人在回合开始时采样并隐藏于SDC的特质。在500回合评估中,联合训练的SDC达到78%的目标完成率,碰撞率为14%,而最佳基于规则的基线分别为35%和33%。速度差异指标显示,在近距离(0-3米)范围内,SDC在乱穿马路者附近比在人行横道使用者附近快2.65米/秒,表明乱穿马路遭遇未被预期。乱穿马路占过街事件的13%,但占碰撞的62%,并且联合训练相比单智能体RL减少了30%的碰撞,因为行人学会了在SDC高速接近时等待。

英文摘要

Simulation-based testing of self-driving cars (SDCs) typically relies on scripted pedestrian models that do not capture the heterogeneity and uncertainty of real crossing behavior, limiting the realism of safety assessments, especially for jaywalking, which is governed by latent personality traits the vehicle cannot observe. We hypothesize that jointly training pedestrians and the SDC with multi-agent reinforcement learning (MARL) yields more realistic interaction scenarios than training against fixed pedestrian policies, and that the behavior gap between predictable and unpredictable crossings can be measured directly from trajectories. We co-train an SDC and 12 pedestrians using Multi-Agent Proximal Policy Optimization (MAPPO): pedestrian locomotion follows scripted Dijkstra pathfinding while an RL policy controls high-level go/wait decisions, and jaywalking probability depends on a per-pedestrian trait sampled at episode start and hidden from the SDC. In 500-episode evaluations, the co-trained SDC reached 78% of goals with a 14% collision rate, versus 35%/33% for the best rule-based baseline. A speed differential metric shows the SDC traveled 2.65 m/s faster near jaywalkers than near crosswalk users at close range (0-3 m), indicating jaywalking encounters were not anticipated. Jaywalking was 13% of crossing events but 62% of collisions, and co-training reduced collisions by 30% relative to single-agent RL as pedestrians learned to wait when the SDC approached at speed.

2605.19969 2026-05-27 cs.LG

Your Neighbors Know: Leveraging Local Neighborhoods for Backdoor Detection in Decentralized Learning

你的邻居知道:利用局部邻居进行去中心化学习中的后门检测

Sayan Biswas, Antoine Boutet, Davide Frey, Romaric Gaudel, Rachid Guerraoui, Maxime Jacovella, Anne-Marie Kermarrec, Dimitri Lerévérend, François Taïani, Martijn de Vos

AI总结 提出Argus框架,通过局部邻居协作分析模型更新并利用结构相似性度量区分真实后门与数据异构性导致的误报,实现去中心化学习中的后门检测,并提供理论收敛保证。

Comments 34 pages, 10 figures

详情
AI中文摘要

去中心化学习(DL)是一种新兴的机器学习范式,其中节点在没有中央服务器的情况下协作训练模型。然而,DL的协作性质使其容易受到后门攻击,即模型被训练为在标准输入上表现正常,而在遇到带有特定触发器的数据时执行隐藏的恶意行为。DL中的后门攻击仍未得到充分研究,现有防御措施常常忽视DL的约束。我们引入了Argus,一种原生于DL的新型后门检测框架,它既不需要中央协调器,也不需要预先知道触发器。在Argus中,诚实节点本地分析接收到的模型更新以识别潜在的后门触发器。然后,节点集体与邻居共享其触发器,并使用结构相似性度量将真实后门与数据异构性引起的误报区分开。一个关键见解是,假阳性触发器在不同参与者之间表现出不一致性,而真阳性触发器则呈现一致的模式。未通过此协作测试的模型更新被拒绝,持续恶意的发送者最终被驱逐。我们首次为特定于DL的后门检测机制提供了理论收敛保证,表明以高概率过滤可疑模型更新可保持与标准DL相当的收敛速度。我们在三个标准数据集上实现了Argus,并针对三个最先进的基线进行了评估。在各种设置下,与无防御相比,Argus将攻击成功率降低了多达90个百分点,同时将模型效用保持在全知神谕的5个百分点以内。此外,随着数据异构性的增加,Argus相对于基线的有效性也有所提高。

英文摘要

Decentralized learning (DL) is an emerging machine learning paradigm where nodes collaboratively train models without a central server. However, the collaborative nature of DL makes it vulnerable to backdoor attacks, where a model is taught to behave normally on standard inputs while executing hidden, malicious actions when encountering data with specific triggers. Backdoor attacks in DL remain understudied and existing defenses often overlook DL constraints. We introduce Argus, a novel backdoor detection framework native to DL that requires neither a central coordinator nor prior knowledge of the trigger. In Argus, honest nodes locally analyze received model updates to identify potential backdoor triggers. Nodes then collectively share their triggers with their neighbors and use a structural similarity metric to separate true backdoors from false alarms induced by data heterogeneity. A key insight is that false positive triggers exhibit inconsistencies across participants while true positive ones show consistent patterns. Model updates that fail this collaborative test are rejected, and persistently malicious senders are eventually evicted. We provide the first theoretical convergence guarantees for a DL-specific backdoor detection mechanism, showing that filtering out suspicious model updates with high probability preserves a convergence rate comparable to standard DL. We implement and evaluate Argus on three standard datasets and against three state-of-the-art baselines. Across settings, Argus reduces attack success rates by up to 90 points compared to no defense, while preserving model utility within 5 percentage points of an omniscient oracle. Furthermore, the effectiveness of Argus compared to baselines improves as data heterogeneity increases.

2605.19908 2026-05-27 cs.CL

Where Does Authorship Signal Emerge in Encoder-Based Language Models?

作者身份信号在基于编码器的语言模型中出现在哪里?

Francis Kulumba, Guillaume Vimont, Laurent Romary, Florian Cafiero

AI总结 通过机械可解释性工具,研究不同评分机制对基于编码器的作者身份归因模型性能的影响,发现评分机制决定了编码器在何处整合作者身份信号。

Comments 12 pages, 6 figures. Under review

详情
AI中文摘要

使用相同的预训练编码器、数据和损失进行微调的作者身份归因模型,其性能可能相差四倍,这仅取决于它们的评分机制。我们使用机械可解释性工具来解释这一差距。诸如词长、标点密度和功能词频率等风格特征在我们探测的每个模型的每一层中都是相似的,包括一个现成的控制编码器,这表明差距不是由它们的线性可读性解释的。相反,因果干预表明,评分器似乎决定了编码器在何处整合作者身份信号。平均池化迫使信号在早期到中期层整合,而后期交互则将其推迟到后期层。我们进一步从每个评分器的梯度结构中推导出这种差异,训练动态揭示了遵循该差异的不同学习轨迹。

英文摘要

Authorship attribution models fine-tuned with the same pretrained encoder, data, and loss can differ four-fold in performance depending only on their scoring mechanism. We use mechanistic interpretability tools to explain this gap. Stylistic features such as word length, punctuation density, and function-word frequency are similarly available at every layer in every model we probe, including an off-the-shelf control encoder, suggesting that the gap is not explained by their linear readability. Instead, causal intervention shows that the scorer appears to determine where the encoder consolidates authorship signal. Mean pooling forces consolidation by early to mid layers, while late interaction defers it to later layers. We further derive this difference from the gradient structure of each scorer, and training dynamics reveal distinct learning trajectories that follow from that difference.

2605.19186 2026-05-27 cs.AI

Discoverable Agent Knowledge -- A Formal Framework for Agentic KG Affordances (Extended Version)

可发现的智能体知识——智能体知识图谱能力的形式化框架(扩展版)

Terry R. Payne, Valentina Tamma, Enrico Daga

AI总结 本文提出一个四维形式化框架(语义表达性、智能体可发现性、任务相对基础性和认知信任范围),并从中推导出智能体能力概况(AAP),作为VoID和DCAT之上的语义层,支持智能体在规划时进行原则性的知识图谱选择、组合和故障诊断。

详情
AI中文摘要

二十年前,语义网服务社区被问及具有不同本体承诺的智能体如何能够连贯地发现、组合和调用网络服务。答案是OWL-S和WSMO:形式化的能力描述,指定服务能做什么、智能体为了认知上合理调用必须已经知道什么,以及如何形式化地桥接本体不匹配。当前的知识图谱元数据标准(如VoID和DCAT)描述了知识图谱包含什么,但没有说明特定智能体能从中证明什么、空结果受什么封闭假设支配,或者智能体的任务词汇是否在模式中有基础。此外,在已部署的知识图谱中,控制模式描述逻辑和操作性的蕴涵机制可能不同:这是一种当前元数据不可见的认知失效模式。我们针对知识图谱环境重新审视并扩展这些见解,提出了一个四维形式化框架:语义表达性、智能体可发现性、任务相对基础性和认知信任范围,从中我们推导出智能体能力概况(AAP):一个位于VoID和DCAT之上的语义层,使智能体在规划时能够进行原则性的知识图谱选择、组合和故障诊断。这四个维度在单个智能体层面操作化了本体连续体的能力结构,特别用于知识图谱选择、组合和故障诊断。一个来自学术搜索任务的实例具体化了该框架,并通过五点研究议程指出了实现基于AAP的能力匹配规模化所需的形式化、计算和工程工作。

英文摘要

Two decades ago, the Semantic Web Services community was asked how agents with different ontological commitments could discover, compose, and invoke web services coherently. The response was OWL-S and WSMO: formally grounded capability descriptions specifying what a service could do, what the agent must already know for invocation to be epistemically sound, and how ontological mismatches could be formally bridged. Current KG metadata standards such as VoID and DCAT describe what a KG contains, yet say nothing about what a specific agent can prove from it, what closure assumptions govern empty results, or whether the agent's task vocabulary is grounded in the schema. Furthermore, in deployed KGs the governing schema DL and the operative entailment regime can diverge: an epistemic failure mode invisible to current metadata. We revisit and extend these insights for the KG setting with a four-dimensional formal framework; Semantic Expressivity, Agentic Discoverability, Task-Relative Grounding, and Epistemic Trust Scope, from which we derive the Agentic Affordance Profile (AAP): a semantic layer above VoID and DCAT enabling principled KG selection, composition, and failure diagnosis at agent planning time. The four dimensions operationalise the affordance structure of the Ontological Continuum at the individual-agent level, specifically for \kg selection, composition, and failure diagnosis. A worked example drawn from a scholarly-search task concretely grounds the framework, and identifies the formal, computational, and engineering work needed to realise AAP-based affordance matching at scale though a five-point research agenda.

2605.17036 2026-05-27 cs.AI cs.LG cs.MA cs.SY eess.SY

Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management

自主AI代理在供应链管理中的可靠性与有效性

Carol Xuan Long, David Simchi-Levi, Feng Zhu, Huangyuan Su, Andre P. Calmon, Flavio P. Calmon

AI总结 本文通过MIT啤酒游戏研究多级供应链中的自主生成式AI代理,发现模型能力是性能主导因素,但平均性能掩盖可靠性风险,并引入代理牛鞭效应,提出基于GRPO的后训练框架以提高可靠性。

详情
AI中文摘要

本文使用MIT啤酒游戏研究多级供应链中的自主生成式AI代理。我们确定了影响性能的四个推理时杠杆:模型选择、策略和护栏、集中数据共享以及提示工程。模型能力是主导因素:开箱即用的推理模型超越人类水平性能,优化后的推理模型相对于人类团队将成本降低高达67%。然而,强劲的平均性能掩盖了显著的可靠性风险。我们引入了代理牛鞭效应:自主多级系统中运行间决策不稳定性的放大。其中一个核心组成部分是决策牛鞭效应,即由随机代理决策而非客户需求变化产生的订单变异性部分。我们表明,即使需求路径固定,决策不稳定性也可以在固定时间点跨设施以及同一设施内随时间放大。重复采样(一种自然的测试时补救措施)未能显著减少这种不稳定性,这表明可靠性需要改变底层决策策略,而不仅仅是平均模型输出。为解决这一限制,我们提出了一种基于组相对策略优化(GRPO)的强化学习后训练框架,该框架使用系统级供应链奖励训练共享的基础LLM。后训练显著减少了尾部事件,抑制了代理牛鞭效应,并提高了自主供应链代理的可靠性。

英文摘要

This paper studies autonomous generative AI agents in multi-echelon supply chains using the MIT Beer Game. We identify four inference-time levers that shape performance: model selection, policies and guardrails, centralized data sharing, and prompt engineering. Model capability is the dominant factor: an out-of-the-box reasoning model exceeds human-level performance, and optimized reasoning models reduce costs by up to 67% relative to human teams. However, strong average performance masks substantial reliability risks. We introduce agent bullwhip: the amplification of run-to-run decision instability in autonomous multi-echelon systems. A central component is decision bullwhip, the portion of order variability generated by stochastic agent decisions rather than by changes in customer demand. We show that decision instability can amplify both across facilities at a fixed point in time and within the same facility over time, even when the demand path is held fixed. Repeated sampling, a natural test-time remedy, fails to meaningfully reduce this instability, suggesting that reliability requires changing the underlying decision policy rather than merely averaging over model outputs. To address this limitation, we propose a Group Relative Policy Optimization (GRPO)-based reinforcement-learning post-training framework that trains a shared base LLM using system-level supply-chain rewards. Post-training substantially reduces tail events, curtails agent bullwhip, and improves the reliability of autonomous supply-chain agents.

2605.16457 2026-05-27 cs.LG cs.AI cs.CV

Identifiable Token Correspondence for World Models

可辨识的令牌对应关系用于世界模型

Youngin Kim, Ray Sun, Inho Kim, Bumsoo Park, Hyun Oh Song

AI总结 提出可辨识的令牌对应关系(ITC)方法,通过将下一帧预测建模为结构化分配问题,解决基于令牌的Transformer世界模型在长程推演中的时间不一致性,在四个基准上达到最先进性能。

详情
AI中文摘要

基于令牌的Transformer世界模型在视觉强化学习中表现出色,但常在长程推演中出现时间不一致性,包括对象重复、消失和变形。一个关键原因是大多数现有方法将下一帧预测纯粹视为令牌生成问题,而未考虑令牌在时间上的持续性。我们引入可辨识的令牌对应关系(ITC),这是一种用于基于令牌的Transformer世界模型的解码步骤,将下一帧预测建模为具有潜在令牌对应变量的结构化分配问题:每个下一帧令牌要么通过从上一帧复制令牌来解释,要么通过生成新令牌来解释。ITC保持Transformer架构和训练过程不变,可以添加到现有骨干网络上。我们的实验在4个具有挑战性的基准上展示了最先进的性能。所提出的方法在Craftax-classic基准上实现了72.5%的回报率和35.6%的分数,显著超过了之前的最佳结果67.4%和27.9%。我们在https://github.com/snu-mllab/Identifiable-Token-Correspondence上发布了源代码。

英文摘要

Token-based transformer world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsistency in long-horizon rollouts, including object duplication, disappearance, and transmutation. A key reason is that most existing approaches treat next-frame prediction purely as a token generation problem, without considering the persistence of tokens across time. We introduce Identifiable Token Correspondence (ITC), a decoding step for token-based transformer world models that formulates next-frame prediction as a structured assignment problem with latent token correspondence variables: each next-frame token is explained either by copying a token from the previous frame or by generating a new one. ITC leaves the transformer architecture and training procedure unchanged and can be added on top of existing backbones. Our experiments show state-of-the-art performance on 4 challenging benchmarks. The proposed method achieves a return of 72.5% and a score of 35.6% on the Craftax-classic benchmark, significantly surpassing the previous best of 67.4% and 27.9%. We release our source code on https://github.com/snu-mllab/Identifiable-Token-Correspondence.

2605.04880 2026-05-27 cs.LG cs.AI

A Harmonic Mean Formulation of Average Reward Reinforcement Learning in SMDPs

SMDP中平均奖励强化学习的调和均值公式

Erel Shtossel, Alicia Vidler, Uri Shaham, Gal A. Kaminka

AI总结 针对无限时域非回合制任务中的平均奖励强化学习,提出一种修正的调和均值算子,解决SMDP中奖励和持续时间非平稳时的奖励率计算问题,并证明其理论性质及有效性。

详情
Journal ref
https://alaworkshop2026.github.io/papers/ALA2026_paper_57.pdf
AI中文摘要

最近的研究重新激发并增强了对无限时域、非回合制(持续)任务中未折扣平均奖励强化学习算法的兴趣。半马尔可夫决策过程(SMDP)尤其引人关注。在SMDP中,离散动作随机产生奖励和持续时间,目标是优化平均奖励率。现有算法通过优化奖励与持续时间的比率来逼近这一目标。然而,当奖励和持续时间(在无限时域中)非平稳时,这种方法可能不正确。本文提出一种新颖的修正调和均值算子,即使在上述条件下也能正确计算奖励率。这产生了可以与SMDP一起工作的无模型学习算法,同时保持对随时间变化的非平稳奖励和持续时间分布的鲁棒性。我们证明了修正调和均值算子的理论性质,并通过实验与现有算法相比展示了其有效性。

英文摘要

Recent research has revived and amplified interest in algorithms for undiscounted average reward reinforcement learning in infinite-horizon, non-episodic (continuing) tasks. Semi-Markov decision processes (SMDPs) are of particular interest. In SMDPs, discrete actions stochastically generate both rewards and durations, and the objective is to optimize the average reward rate. Existing algorithms approach this by optimizing the ratio of rewards to durations. However, when rewards and durations are non-stationary (in the infinite horizon), this can be incorrect. This paper presents a novel modified harmonic mean operator that correctly computes reward rates even under such conditions. This yields model-free learning algorithms that can work with SMDPs, while maintaining robustness to non-stationary reward and duration distributions over time. We prove theoretical properties of the modified harmonic mean operator, and empirically demonstrate its efficacy in comparison to existing algorithms.

2605.02207 2026-05-27 cs.CV cs.AI cs.LG

MultiSense-Pneumo: A Multimodal Learning Framework for Pneumonia Screening in Resource-Constrained Settings

MultiSense-Pneumo:面向资源受限环境中肺炎筛查的多模态学习框架

Dineth Jayakody, Pasindu Thenahandi, Chameli Dommanige

AI总结 提出MultiSense-Pneumo多模态原型系统,整合症状、咳嗽音频、语音和胸片,通过可解释的后期融合实现肺炎筛查与分诊支持。

详情
AI中文摘要

肺炎仍然是全球发病率和死亡率的主要原因,尤其是在低资源环境中,那里缺乏影像学、实验室检测和专科护理。临床评估依赖于异质性证据,包括症状、呼吸模式、口头描述和胸部影像,使得一线筛查本质上是多模态的。然而,许多现有的计算方法仍然是单模态的,并且主要关注放射影像。在这项工作中,我们提出了MultiSense-Pneumo,一个面向肺炎筛查和分诊支持的多模态研究原型,它整合了结构化症状描述符、咳嗽音频、口语和胸部X光片。该系统结合了确定性症状分诊、基于LightGBM的声学分类、使用ResNet-18的域对抗放射影像分析、基于Transformer的语音识别以及可解释的后期融合算子。每个模态被转换为归一化的关注信号,并聚合为统一的筛查估计。融合权重是手动指定的,被视为启发式、可解释的参数,而不是学习或临床优化的值。MultiSense-Pneumo的设计考虑了在标准笔记本电脑级硬件上的离线执行,但并未作为经过部署验证或临床验证的诊断系统呈现。实验结果表明,在合成域偏移下,放射影像路径具有强大的组件级性能,同时也突出了重要的局限性,特别是咳嗽声学的异常类别召回率降低以及缺乏配对的端到端多模态患者评估。因此,MultiSense-Pneumo旨在作为筛查和分诊研究的框架和组件级原型。

英文摘要

Pneumonia remains a leading global cause of morbidity and mortality, particularly in low-resource settings where access to imaging, laboratory testing, and specialist care is limited. Clinical assessment relies on heterogeneous evidence, including symptoms, respiratory patterns, spoken descriptions, and chest imaging, making frontline screening inherently multimodal. However, many existing computational approaches remain unimodal and focus primarily on radiographs. In this work, we present MultiSense-Pneumo, a multimodal research prototype for pneumonia-oriented screening and triage support that integrates structured symptom descriptors, cough audio, spoken language, and chest radiographs. The system combines deterministic symptom triage, LightGBM-based acoustic classification, domain-adversarial radiograph analysis using ResNet-18, transformer-based speech recognition, and an interpretable late-fusion operator. Each modality is transformed into a normalized concern signal and aggregated into a unified screening estimate. The fusion weights are hand-specified and are treated as heuristic, interpretable parameters rather than learned or clinically optimized values. MultiSense-Pneumo is implemented with offline execution in mind on standard laptop-class hardware, but it is not presented as a deployment-validated or clinically validated diagnostic system. Experimental results demonstrate strong component-level performance of the radiograph pathway under synthetic domain shifts, while also highlighting important limitations, especially reduced abnormal-class recall for cough acoustics and the absence of paired end-to-end multimodal patient evaluation. MultiSense-Pneumo is therefore intended as a framework and component-level prototype for screening and triage research.

2605.08146 2026-05-27 cs.CV cs.AI

VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning

VT-Bench:视觉-表格多模态学习的统一基准

Zi-Yi Jia, Zi-Jian Cheng, Xin-Yue Zhang, Kun-Yang Yu, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo

AI总结 提出首个视觉-表格多模态基准VT-Bench,涵盖9个领域14个数据集,评估23个模型,揭示视觉-表格学习的挑战。

详情
AI中文摘要

多模态学习在视觉-文本任务中引起了广泛关注。然而,在医疗和工业等高危领域起关键作用的视觉-表格数据仍未得到充分探索。本文介绍了 extit{VT-Bench},这是第一个用于标准化视觉-表格判别预测和生成推理任务的统一基准。VT-Bench汇集了9个领域(以医疗为中心,同时涵盖宠物、媒体和交通)的14个数据集,超过756K个样本。我们评估了23个代表性模型,包括单模态专家、专门的视觉-表格模型、通用视觉-语言模型(VLM)和工具增强方法,突出了视觉-表格学习的重大挑战。我们相信VT-Bench将激励社区构建更强大的多模态视觉-表格基础模型。 基准:https://github.com/Ziyi-Jia990/VT-Bench

英文摘要

Multi-model learning has attracted great attention in visual-text tasks. However, visual-tabular data, which plays a pivotal role in high-stakes domains like healthcare and industry, remains underexplored. In this paper, we introduce \textit{VT-Bench}, the first unified benchmark for standardizing vision-tabular discriminative prediction and generative reasoning tasks. VT-Bench aggregates 14 datasets across 9 domains (medical-centric, while covering pets, media, and transportation) with over 756K samples. We evaluate 23 representative models, including unimodal experts, specialized visual-tabular models, general-purpose vision-language models (VLMs), and tool-augmented methods, highlighting substantial challenges of visual-tabular learning. We believe VT-Bench will stimulate the community to build more powerful multi-modal vision-tabular foundation models. Benchmark: https://github.com/Ziyi-Jia990/VT-Bench

2511.19741 2026-05-27 cs.CV

Efficient Transferable Optimal Transport via Min-Sliced Transport Plans

通过最小切片传输计划的高效可迁移最优传输

Xinran Liu, Elaheh Akbari, Rocio Diaz Martin, Navid NaderiAlizadeh, Soheil Kolouri

AI总结 提出最小切片传输计划(min-STP)框架,研究优化切片器在不同分布对间的可迁移性,并引入小批量公式以提高可扩展性,在点云对齐和流生成建模中实现一次性匹配和摊销训练。

详情
AI中文摘要

最优传输(OT)为寻找分布之间的对应关系以及解决计算机视觉各个领域(包括形状分析、图像生成和多模态任务)中的匹配和对齐问题提供了强大的框架。然而,OT的计算成本阻碍了其可扩展性。基于切片的传输计划最近通过利用一维OT问题的闭式解,在降低计算成本方面显示出前景。这些方法优化一维投影(切片)以获得条件传输计划,该计划最小化环境空间中的传输成本。虽然高效,但这些方法留下了一个问题:学习到的最优切片器是否能够在分布偏移下迁移到新的分布对。理解这种可迁移性对于数据演变或跨密切相关的分布重复进行OT计算的情况至关重要。在本文中,我们研究了最小切片传输计划(min-STP)框架,并探讨了优化切片器的可迁移性:在一个分布对上训练的切片器能否为新的未见对产生有效的传输计划?理论上,我们证明优化后的切片器在数据分布轻微扰动下保持接近,从而能够在相关任务间高效迁移。为了进一步提高可扩展性,我们引入了min-STP的小批量公式,并提供了其准确性的统计保证。实验上,我们证明了可迁移的min-STP实现了强一次性匹配性能,并促进了点云对齐和基于流的生成建模的摊销训练。

英文摘要

Optimal Transport (OT) offers a powerful framework for finding correspondences between distributions and addressing matching and alignment problems in various areas of computer vision, including shape analysis, image generation, and multimodal tasks. The computation cost of OT, however, hinders its scalability. Slice-based transport plans have recently shown promise for reducing the computational cost by leveraging the closed-form solutions of 1D OT problems. These methods optimize a one-dimensional projection (slice) to obtain a conditional transport plan that minimizes the transport cost in the ambient space. While efficient, these methods leave open the question of whether learned optimal slicers can transfer to new distribution pairs under distributional shift. Understanding this transferability is crucial in settings with evolving data or repeated OT computations across closely related distributions. In this paper, we study the min-Sliced Transport Plan (min-STP) framework and investigate the transferability of optimized slicers: can a slicer trained on one distribution pair yield effective transport plans for new, unseen pairs? Theoretically, we show that optimized slicers remain close under slight perturbations of the data distributions, enabling efficient transfer across related tasks. To further improve scalability, we introduce a minibatch formulation of min-STP and provide statistical guarantees on its accuracy. Empirically, we demonstrate that the transferable min-STP achieves strong one-shot matching performance and facilitates amortized training for point cloud alignment and flow-based generative modeling.

2605.18866 2026-05-27 cs.LG cs.AI

FLUIDSPLAT: Reconstructing Physical Fields from Sparse Sensors via Gaussian Primitives

FLUIDSPLAT: 通过高斯原语从稀疏传感器重建物理场

Huaxi Huang, Meng Li, Zhengqing Gao, Xi Zhou, Xiaoshui Huang, Xiao Sun

AI总结 提出FLUIDSPLAT模型,利用高斯原语作为空间显式中间表示,从稀疏传感器数据重建流场,理论分析了表示能力与观测数的关系,并在多个基准上实现误差降低11-28%。

Comments 24 pages, 5 figures,preprint

详情
AI中文摘要

从稀疏表面安装的传感器重建连续流场是空气动力学设计、流动控制和数字孪生仪器的核心。现有的神经方法通常将传感器读数编码为隐式潜在代码,空间可解释性差,且关于表示能力应如何随观测数量扩展的正式指导有限。受3D高斯泼溅启发,我们引入FLUIDSPLAT,一种传感器条件模型,预测K个各向异性高斯原语,形成单位划分支架,即流场的空间显式且可解释的中间表示。对于理想化的高斯原语估计器,我们证明了对于具有Sobolev光滑度s的场,逼近率为$O(K^{-s/d})$;结合N个含噪声观测,得到偏差$O(K^{-2s/d})$和方差$O(σ^{2}K/N)$的平方风险分解。平衡两者得到$K^{*}\!\sim\!(N/σ^{2})^{d/(2s+d)}$:在稀疏传感下原语数量不能自由增长,揭示了方差瓶颈,促使用状态条件残差解码器补充支架。在涵盖2D和3D的四个基准(圆柱绕流、AirfRANS、FlowBench LDC-3D和PhySense-Car 3D)上,FLUIDSPLAT相比多个强基线实现了11-28%的误差降低。

英文摘要

Reconstructing continuous flow fields from sparse surface-mounted sensors is central to aerodynamic design, flow control, and digital-twin instrumentation. Existing neural methods for this task typically encode sensor readings into implicit latent codes with little spatial interpretability and limited formal guidance on how representational capacity should scale with observation count. Inspired by 3D Gaussian Splatting, we introduce FLUIDSPLAT, a sensor-conditioned model that predicts K anisotropic Gaussian primitives forming a partition-of-unity scaffold, a spatially explicit and interpretable intermediate representation of the flow. For an idealized Gaussian primitive estimator, we prove an $O(K^{-s/d})$ approximation rate for fields with Sobolev smoothness $s$; incorporating $N$ noisy observations yields a squared-risk decomposition with bias $O(K^{-2s/d})$ and variance $O(σ^{2}K/N)$.Balancing the two yields $K^{*}\!\sim\!(N/σ^{2})^{d/(2s+d)}$: primitive count cannot grow freely under sparse sensing, revealing a variance bottleneck that motivates complementing the scaffold with a state-conditioned residual decoder. Across four benchmarks spanning 2D and 3D, FLUIDSPLAT achieves 11-28% error reduction over several strong baselines on cylinder flow, AirfRANS, FlowBench LDC-3D, and PhySense-Car 3D benchmarks.

2605.18592 2026-05-27 cs.LG cs.AI cs.CL

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

AMARIS: 一种用于基于评分标准的强化学习的记忆增强评分标准改进系统

Peilin Wu, Xinlu Zhang, Kun Wan, Wentian Zhao, Gang Wu, Xinya Du, Zhiyu Chen

AI总结 提出AMARIS系统,通过持久化评估记忆存储纵向训练证据来改进评分标准,在科学、医学、指令遵循和创意写作任务上优于静态、局部自适应和无记忆基线方法。

Comments Preprint. Under review

详情
AI中文摘要

基于评分标准的奖励塑形为通过强化学习(RL)微调大语言模型(LLMs)提供了可解释且可编辑的奖励信号,但现有的自适应评分标准方法通常从局部证据(如当前批次或实例级比较)更新标准。这种局部视角丢弃了训练过程中产生的诊断信息,使得难以跟踪重复失败、评估之前的评分标准编辑或在早期标准饱和后提高标准。我们引入了AMARIS,一种记忆增强的评分标准改进系统,它将评分标准更新建立在纵向训练证据之上。AMARIS将轨迹分析、步骤级摘要和评分标准更新记录存储在持久化评估记忆中,然后检索最近和语义相关的历史来修订评分标准。我们在全局和实例特定评分标准设置下,在科学、医学、指令遵循和创意写作任务上评估了AMARIS。AMARIS在静态、局部自适应和无记忆基线上有所改进,例如在GPQA-Diamond上比最强基线高出+2.8分,在IFBench上高出+2.2分,同时分析表明记忆减少了振荡性的评分标准编辑,并支持从早期错误纠正到后期课程推进的进展。AMARIS与正常RL循环异步运行,相对于同步评分标准更新减少了阻塞延迟。

英文摘要

Rubric-based reward shaping provides interpretable and editable reward signals for fine-tuning LLMs via reinforcement learning (RL), but existing adaptive rubric methods typically update criteria from local evidence such as the current batch or instance-level comparisons. This local view discards diagnostic information produced during training, making it difficult to track recurring failures, evaluate previous rubric edits, or raise standards once earlier criteria become saturated. We introduce AMARIS, A Memory-Augmented Rubric Improvement System that grounds rubric updates in longitudinal training evidence. AMARIS stores rollout analyses, step-level summaries, and rubric update records in a persistent evaluation memory, then retrieves recent and semantically relevant history to revise rubrics. We evaluate AMARIS across science, medicine, instruction following, and creative writing under both global and instance-specific rubric settings. AMARIS improves over static, local-adaptive, and memory-ablated baselines, such as +2.8 points on GPQA-Diamond and +2.2 points on IFBench over the strongest baselines, while analysis shows that memory reduces oscillatory rubric edits and supports a progression from early failure correction to later curriculum advancement. AMARIS runs asynchronously alongside the normal RL loop, reducing blocking latency relative to synchronous rubric updates.

2605.18359 2026-05-27 cs.CV

RAVE: Re-Allocating Visual Attention in Large Multimodal Models

RAVE: 重新分配大型多模态模型中的视觉注意力

Xi Leng, Xinhong Ma, Ziqiang Dong, Feng Zhang, Xiaoying Tang, Yang Yang, Guanjun Jiang

AI总结 针对大型多模态模型中标准注意力机制存在的跨模态误分配和视觉内不平衡问题,提出轻量级成对门控机制RAVE,通过学习查询-键偏置重新分配视觉注意力,在多个多模态基准上平均提升3个百分点,尤其对感知密集型任务效果显著。

详情
AI中文摘要

大型多模态模型(LMMs)继承了预训练语言骨干网络的自注意力机制,但标准注意力可能表现出次优的分配,包括文本和视觉证据之间的跨模态误分配以及视觉令牌之间的视觉内不平衡。我们提出RAVE(重新分配视觉注意力),一种轻量级成对门控机制,它为预softmax注意力分数添加一个学习到的查询-键偏置,该偏置基于预RoPE查询和键特征。RAVE不需要对骨干网络进行架构修改,并且可以与模型的其余部分进行端到端训练。在一系列多模态基准测试中,RAVE比标准注意力平均提升3个百分点,在感知密集型任务(包括多语言OCR、图表理解、文档VQA和场景文本VQA)上提升最大,这些任务中准确的视觉定位至关重要。

英文摘要

Large multimodal models (LMMs) inherit the self-attention mechanism of pretrained language backbones, yet standard attention can exhibit suboptimal allocation, including cross-modal misallocation between textual and visual evidence and intra-visual imbalance among visual tokens. We propose RAVE (Re-Allocating Visual Attention), a lightweight pair-gating mechanism that adds a learned query-key bias to pre-softmax attention scores over visual keys, derived from pre-RoPE query and key features. RAVE requires no architectural modification to the backbone and can be trained end-to-end with the rest of the model. Across a suite of multimodal benchmarks, RAVE improves over standard attention by an average of 3 points, with the largest gains on perception-intensive tasks -- including multilingual OCR, chart understanding, document VQA, and scene text VQA -- where accurate visual grounding is critical.

2605.17774 2026-05-27 cs.CL

Internalizing Tool Knowledge in Small Language Models via QLoRA Fine-Tuning

通过QLoRA微调将工具知识内化到小型语言模型中

Yuval Shemla, Ayal Yakobe, Tanmay Agarwal, Dhaval Patel, Kaoutar El Maghraoui

AI总结 本文研究通过QLoRA参数高效微调将工具知识内化到小型语言模型中,在AssetOpsBench基准上,微调后的Gemma 4 E4B和Qwen3-4B模型在无描述推理下优于有完整工具描述的未微调基线,输入长度减少82.6%,规划分数提升。

详情
AI中文摘要

大型语言模型越来越多地被用作代理系统中的规划组件,但当前的工具使用流程通常需要将完整的工具模式包含在每个提示中,这产生了大量的令牌开销,并限制了较小模型的实用性。本文研究了是否可以通过参数高效微调将工具使用知识内化到小型语言模型中,从而在推理时无需显式的工具描述即可进行结构化规划。使用AssetOpsBench作为主要基准,我们使用8位QLoRA在约1700个工具使用示例上微调了Gemma 4 E4B和Qwen3-4B,这些示例涵盖工具知识、问题到规划的映射以及执行风格的轨迹。我们在无描述推理下评估了生成的模型,其中提示完全省略了工具目录。微调后的模型优于接收完整工具描述的有信息未微调基线,输入长度减少了82.6%,同时提高了结构性和LLM评判的规划分数。在最佳的Gemma运行中,模型达到了0.65的AT-F1和3.88的整体评判分数,而信息基线的分数分别为0.47和2.88。Qwen3-4B达到了3.78的强劲整体评判分数,同时使用的内存比Gemma少62%,运行速度快2.5倍,尽管它在一般多项选择基准上也表现出更大的灾难性遗忘。额外的消融实验表明,LoRA秩控制着质量与保留之间的权衡,其中$r=32$最大化规划质量,而较小的秩保留了更多的一般知识。这些结果表明,对于固定的工具目录,QLoRA微调可以将工具知识从提示上下文转移到模型权重中,从而在保持或提高工具规划质量的同时,大幅减少推理开销。

英文摘要

Large language models are increasingly used as planning components in agentic systems, but current tool-use pipelines often require full tool schemas to be included in every prompt, creating substantial token overhead and limiting the practicality of smaller models. This paper investigates whether tool-use knowledge can be internalized into small language models through parameter-efficient fine-tuning, enabling structured planning without explicit tool descriptions at inference time. Using AssetOpsBench as the primary benchmark, we fine-tune Gemma 4 E4B and Qwen3-4B with 8-bit QLoRA on approximately 1,700 tool-use examples spanning tool knowledge, question-to-plan mappings, and execution-style traces. We evaluate the resulting models under description-free inference, where the prompt omits the tool catalog entirely. The fine-tuned models outperform an informed unfine-tuned baseline that receives full tool descriptions, reducing input length by 82.6\% while improving structural and LLM-judge planning scores. In the best Gemma run, the model achieves an AT-F1 of 0.65 and an overall judge score of 3.88, compared with 0.47 and 2.88 for the informed baseline. Qwen3-4B achieves a strong overall judge score of 3.78 while using 62\% less memory and running 2.5$\times$ faster than Gemma, though it also exhibits greater catastrophic forgetting on general multiple-choice benchmarks. Additional ablations show that LoRA rank controls a quality--retention trade-off, with $r=32$ maximizing planning quality and smaller ranks preserving more general knowledge. These results suggest that, for fixed tool catalogs, QLoRA fine-tuning can shift tool knowledge from prompt context into model weights, substantially reducing inference overhead while maintaining or improving tool-planning quality.

2605.17617 2026-05-27 cs.AI

GraphMind: From Operational Traces to Self-Evolving Workflow Automation

GraphMind:从操作轨迹到自演化工作流自动化

Yiwen Zhu, Joyce Cahoon, Anna Pavlenko, Qiushi Bai, Nima Shahbazi, Divya Vermareddy, Meina Wang, Mathieu Demarne, Swati Bararia, Wenjing Wang, Hemkesh Vijaya Kumar, Hannah Lerner, Katherine Lin, Steve Toscano, Miso Cilimdzic, Subru Krishnan

AI总结 提出GraphMind系统,通过离线提取因果工作流图、在线多智能体遍历执行和自适应遍历强化,实现云数据库事故调查中的自动化工作流,相比基线方法减少8倍检索上下文并降低26%幻觉率。

详情
AI中文摘要

协调人员、工具和信息的复杂操作工作流是系统运行的核心,但由于需要大量人工输入且适应能力有限,端到端自动化仍然具有挑战性。我们提出GraphMind,一个以最小人力构建、执行和演化以行动为中心的工作流图的系统。该系统分三个阶段运行。首先,一个可扩展的离线管道从大量人工解决轨迹中提取结构化工作流图,捕捉问题、行动及其因果关系。其次,一个在线多智能体遍历引擎导航该图以动态构建和执行工作流,每一步结合图引导检索与LLM驱动的推理。第三,自适应遍历强化(ATR)强化成功的遍历路径,实现执行信息引导的图适应。GraphMind已部署在四个生产云数据库服务中用于事故调查。在93个保留事故上评估并通过盲审专家验证,该系统在缓解范围、幻觉率和诊断吞吐量方面优于Agentic Summary-RAG基线,同时需要少8倍的检索上下文。ATR层将幻觉率降低26%,证明工作流图可以从执行反馈中学习。一项为期12周的现场研究证实了实用价值:97%的评分对话在交互延迟内产生可操作结果。

英文摘要

Complex operational workflows coordinating personnel, tools, and information are central to system operations, yet end-to-end automation remains challenging due to extensive human input requirements and limited ability to adapt over time. We present GraphMind, a system that constructs, executes, and evolves action-centric workflow graphs with minimal human effort. The system operates in three phases. First, a scalable offline pipeline extracts structured workflow graphs from large volumes of human resolution traces, capturing problems, actions, and their causal relationships. Second, an online multi-agent traversal engine navigates the graph to dynamically construct and execute workflows, combining graph-guided retrieval with LLM-driven reasoning at each step. Third, Adaptive Traversal Reinforcement (ATR) reinforces successful traversal paths, enabling execution-informed graph adaptation. GraphMind has been deployed across four production cloud database services for incident investigation. Evaluated on 93 held-out incidents and validated via blind expert review, the system outperforms an Agentic Summary-RAG baseline in mitigation reach, hallucination rate, and diagnostic throughput while requiring 8x less retrieval context. The ATR layer reduces hallucination rate by 26%, demonstrating that workflow graphs can learn from execution feedback. A 12-week field study confirms practical value: 97% of scored conversations yield actionable results within interactive latency.

2605.17482 2026-05-27 cs.CL cs.LG

RSD: A Local Triangulation Audit Primitive for Learned Vector Blocks

RSD:一种用于学习向量块的局部三角剖分审计原语

Seungmin Jin

AI总结 提出RSD(关系语义分解)作为局部三角剖分审计方法,通过拟合单纯形成员关系和坐标极点,结合关系解码器和坐标残差,实现学习向量块的可解释性审计。

Comments 8 pages, 1 figure. Revised version with clarified scope, experiments, and limitations

详情
AI中文摘要

局部XAI审计将有限的学习向量块与弱侧信号进行比较。基线方法如最近邻查找、低秩坐标模型和关系分解揭示了审计的不同部分。我们引入关系语义分解(简称RSD),作为学习向量块的局部三角剖分审计。给定坐标X和一个声明的有界弱亲和代理A,RSD拟合单纯形成员关系S和坐标极点C。它在关系解码器中重用S来解码A,并报告坐标残差R=X-SC。这产生了一个范围限定的审计单元:所选块、代理、解码器类和损失预算的兼容性,以及组件质量和残差读数。合成控制检查单纯形重构、代理解码和固定S残差分解。定理陈述、月份和狗/狼块说明了为什么低代理损失应结合组件质量、残差读数和块大小来解读。

英文摘要

Local XAI audits compare a finite block of learned vectors with a weak side signal. Baselines such as nearest-neighbor lookup, low-rank coordinate models, and relation factorization expose different parts of this audit. We introduce Relational Semantic Decomposition, abbreviated as RSD, as a local triangulation audit for learned vector blocks. Given coordinates X and a declared bounded weak affinity proxy A, RSD fits simplex memberships S and coordinate poles C. It reuses S in a relation decoder for A and reports the coordinate residual R=X-SC. This yields a scoped audit unit: compatibility for the chosen block, proxy, decoder class, and loss budget, plus component mass and residual readouts. Synthetic controls check simplex reconstruction, proxy decoding, and fixed-S residual decomposition. The theorem-statement, month, and dog/wolf blocks illustrate why low proxy loss should be read with component mass, residual readouts, and block size.

2605.05204 2026-05-27 cs.CV

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

D-OPSD:用于连续调优步蒸馏扩散模型的在线自蒸馏方法

Dengyang Jiang, Xin Jin, Dongyang Liu, Zanyi Wang, Mingzhe Zheng, Ruoyi Du, Xiangpeng Yang, Qilong Wu, Zhen Li, Peng Gao, Harry Yang, Steven Hoi

AI总结 提出D-OPSD,一种在线自蒸馏训练范式,使步蒸馏扩散模型在监督微调中保持少步推理能力,通过让模型同时作为教师和学生,利用不同上下文条件(学生仅文本特征,教师多模态特征)最小化预测分布,学习新概念和风格而不牺牲原有少步能力。

Comments Project Page: https://vvvvvjdy.github.io/d-opsd/

详情
AI中文摘要

高性能图像生成模型的格局目前正在从低效的多步模型转向高效的少步模型(例如,Z-Image-Turbo和FLUX.2-klein)。然而,这些模型对直接连续监督微调提出了重大挑战。例如,应用常用的微调技术会损害其固有的少步推理能力。为了解决这个问题,我们提出了D-OPSD,一种用于步蒸馏扩散模型的新颖训练范式,能够在监督微调期间实现在线策略学习。我们首先发现,以LLM/VLM作为编码器的现代扩散模型可以继承其编码器的上下文能力。这使我们能够将训练形式化为一个在线自蒸馏过程。具体来说,在训练期间,我们让模型在不同上下文中同时充当教师和学生,其中学生仅以文本特征为条件,而教师则以文本提示和目标图像的多模态特征为条件。训练最小化学生自身轨迹上的两个预测分布。通过在模型自己的轨迹上并在其自身监督下进行优化,D-OPSD使模型能够学习新的概念、风格等,而不会牺牲原始的少步能力。

英文摘要

The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant challenges for direct continuous supervised fine-tuning. For example, applying the commonly used fine-tuning technique would compromise their inherent few-step inference capability. To address this, we propose D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy learning during supervised fine-tuning. We first find that the modern diffusion models, where the LLM/VLM serves as the encoder, can inherit its encoder's in-context capabilities. This enables us to formulate the training as an on-policy self-distillation process. Specifically, during training, we make the model act as both the teacher and the student with different contexts, where the student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature of both the text prompt and the target image. Training minimizes the two predicted distributions over the student's own roll-outs. By optimizing on the model's own trajectory and under its own supervision, D-OPSD enables the model to learn new concepts, styles, etc., without sacrificing the original few-step capacity.

2603.04639 2026-05-27 cs.RO cs.AI

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

RoboMME:机器人通用策略的记忆基准与理解

Yinpei Dai, Hongze Fu, Jayjun Lee, Yuejiang Liu, Haoran Zhang, Jianing Yang, Chelsea Finn, Nima Fazeli, Joyce Chai

AI总结 提出RoboMME基准,通过16个操作任务评估VLA模型在长时程和历史依赖场景中的记忆能力,并基于π0.5骨干网络探索14种记忆增强变体,发现记忆表示的有效性高度依赖于任务。

Comments Accepted to ICML 2026

详情
AI中文摘要

记忆对于长时程和历史依赖的机器人操作至关重要。这类任务通常涉及计数重复动作或操作暂时被遮挡的物体。最近的视觉-语言-动作(VLA)模型已开始融入记忆机制;然而,它们的评估仍局限于狭窄、非标准化的设置中。这限制了对记忆的系统理解、比较和进展测量。为应对这些挑战,我们引入了RoboMME:一个大规模标准化基准,用于评估和推进VLA模型在长时程、历史依赖场景中的表现。我们的基准包含16个操作任务,这些任务基于精心设计的分类法构建,该分类法评估时间、空间、对象和程序记忆。我们进一步开发了一套基于π0.5骨干网络的14种记忆增强VLA变体,以系统探索多种集成策略下的不同记忆表示。实验结果表明,记忆表示的有效性高度依赖于任务,每种设计在不同任务中都有独特的优势和局限性。视频和代码可在我们的网站https://robomme.github.io上找到。

英文摘要

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the π0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.

2412.18084 2026-05-27 cs.AI

Property Enhanced Instruction Tuning for Multi-task Molecule Generation with Large Language Models

属性增强指令微调用于大型语言模型的多任务分子生成

Xuan Lin, Long Chen, Yile Wang, Yangyang Chen, Xiangxiang Zeng

AI总结 提出PEIT框架,通过多模态对齐预训练和指令微调,提升LLM在分子描述、文本分子生成、属性预测和多约束分子生成任务上的性能。

Comments 9

详情
AI中文摘要

大型语言模型(LLMs)广泛应用于各种自然语言处理任务,如问答和机器翻译。然而,由于缺乏标记数据以及生化属性手动标注的困难,分子生成任务的性能仍然有限,尤其是涉及多属性约束的任务。在这项工作中,我们提出了一个两步框架PEIT(属性增强指令微调)来改进LLMs在分子相关任务上的表现。第一步,我们使用文本描述、SMILES和生化属性作为多模态输入,通过对齐多模态表示来合成指令数据,预训练一个名为PEIT-GEN的模型。第二步,我们使用合成数据微调现有的开源LLMs,得到的PEIT-LLM可以处理分子描述、基于文本的分子生成、分子属性预测以及我们新提出的多约束分子生成任务。实验结果表明,我们的预训练模型PEIT-GEN在分子描述任务上优于MolT5、BioT5、MolCA和Text+Chem-T5,证明了文本描述、结构和生化属性之间的模态对齐良好。此外,PEIT-LLM在多任务分子生成中显示出有希望的改进,证明了PEIT框架在分子任务中的有效性。代码和附录可在https://github.com/chenlong164/PEIT获取。

英文摘要

Large language models (LLMs) are widely applied in various natural language processing tasks such as question answering and machine translation. However, due to the lack of labeled data and the difficulty of manual annotation for biochemical properties, the performance for molecule generation tasks is still limited, especially for tasks involving multi-properties constraints. In this work, we present a two-step framework PEIT (\textbf{P}roperty \textbf{E}nhanced \textbf{I}nstruction \textbf{T}uning) to improve LLMs for molecular-related tasks. In the first step, we use textual descriptions, SMILES, and biochemical properties as multimodal inputs to pre-train a model called PEIT-GEN, by aligning multi-modal representations to synthesize instruction data. In the second step, we fine-tune existing open-source LLMs with the synthesized data, the resulting PEIT-LLM can handle molecule captioning, text-based molecule generation, molecular property prediction, and our newly proposed multi-constraint molecule generation tasks. Experimental results show that our pre-trained PEIT-GEN outperforms MolT5, BioT5, MolCA and Text+Chem-T5 in molecule captioning, demonstrating modalities align well between textual descriptions, structures, and biochemical properties. Furthermore, PEIT-LLM shows promising improvements in multi-task molecule generation, demonstrating the effectiveness of the PEIT framework for molecular tasks. The code and appendix are available at https://github.com/chenlong164/PEIT.

2604.27019 2026-05-27 cs.LG cs.CL cs.CR

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

动态对抗微调重组拒绝几何结构

Wenhao Lan, Shan Li, Xinhua Lai, Meiqi Wu, Junbin Yang, Haihua Shen, Yijun Yang

AI总结 研究动态对抗微调如何改变安全对齐语言模型中拒绝行为的因果控制载体(低维子空间),发现R2D2沿鲁棒性-效用前沿重组几何结构但未建立自适应鲁棒性。

详情
AI中文摘要

安全对齐的语言模型必须拒绝有害请求而不广泛过度拒绝,但尚不清楚动态对抗微调如何改变拒绝控制载体:Kullback--Leibler (KL)约束方向或因果调节拒绝而不引起大规模安全提示分布偏移的小子空间。我们研究了一个7B骨干模型在监督微调(SFT)和鲁棒拒绝动态防御(R2D2)下的表现,将HarmBench、StrongREJECT和XSTest评估与五点几何测量、因果干预和稀疏自适应压力测试对齐。R2D2在早期检查点将固定源HarmBench攻击成功率降至零;然而,这些检查点也表现出最大的XSTest拒绝率并未能通过良性效用审计。后期检查点部分恢复了面向效用的行为,同时重新打开了攻击成功率,自适应GCG攻击成功率在第250步升至0.415,第500步升至0.613。内部地,R2D2在第100步之前保留了一个后期层的可接受拒绝控制载体,然后将最佳可接受载体迁移到早期层;SFT迁移更早但鲁棒性较差。有效秩保持在1.24附近,SFT表现出更大的主角漂移,这反对将维度扩展和漂移幅度作为充分解释。因果干预支持一个低维但效用耦合的载体。这些结果支持R2D2沿鲁棒性-效用前沿的几何重组解释,但未建立自适应鲁棒性。

英文摘要

Safety-aligned language models must refuse harmful requests without broad over-refusal, but it remains unclear how dynamic adversarial fine-tuning changes refusal-control carriers: Kullback--Leibler (KL)-constrained directions or small subspaces that causally modulate refusal without large safe-prompt distribution shifts. We study a 7B backbone under supervised fine-tuning (SFT) and Robust Refusal Dynamic Defense (R2D2), aligning HarmBench, StrongREJECT, and XSTest evaluations with five-anchor geometry measurements, causal interventions, and sparse adaptive stress tests. R2D2 drives fixed-source HarmBench attack success to zero at early checkpoints; however, these checkpoints also exhibit maximal XSTest refusal and fail a benign-utility audit. Later checkpoints partially recover utility-facing behavior while reopening attack success, with adaptive GCG attack success rate rising to 0.415 at step 250 and 0.613 at step 500. Internally, R2D2 preserves a late-layer admissible refusal-control carrier through step 100 and then relocates the best admissible carrier to an early layer; SFT relocates earlier yet remains less robust. Effective rank stays near 1.24, and SFT shows larger principal-angle drift, arguing against both dimensional expansion and drift magnitude as sufficient explanations. Causal interventions support a low-dimensional but utility-coupled carrier. These results support a geometry-reorganization account of R2D2 along a robustness--utility frontier, without establishing adaptive robustness.

2601.15891 2026-05-27 cs.CV

RadJEPA: Radiology Encoder for Chest X-Rays via Joint Embedding Predictive Architecture

RadJEPA:基于联合嵌入预测架构的胸部X光放射学编码器

Anas Anwarul Haq Khan, Mariam Husain, Pratik Jalan, Kshitij Jadhav

AI总结 提出RadJEPA,一种无需语言监督的自监督框架,通过联合嵌入预测架构在约84万张无标签胸部X光图像上预训练,学习预测掩码区域的潜在表示,在放射学报告生成等任务中达到或超越现有基线。

详情
AI中文摘要

视觉-语言预训练推动了医学图像表示学习的最新进展,但这种范式受限于配对图像-文本数据的可用性以及临床叙述的报告偏差。我们探究是否可以在没有任何语言监督的情况下学习具有竞争力的放射学编码器。我们引入了RadJEPA,这是一个基于联合嵌入预测架构的自监督框架,并在约84万张无标签胸部X光图像上进行了预训练。该模型学习从可见上下文区域预测掩码目标区域的潜在表示,这一目标与图像-文本对比预训练和DINO风格自蒸馏不同,它显式地建模表示空间中的条件结构。我们主要在冻结的Vicuna-7B解码器上进行放射学报告生成评估,并将其编码器替换到四个广泛使用的视觉-语言骨干网络(MedLLaVA、Qwen-2.5、BLIP-2和Phi-4)中。为完整性,我们还报告了疾病分类和语义分割结果。在两个数据集和四个指标上,RadJEPA匹配或超过了最强的纯图像和视觉-语言基线,同时使用ViT-B/14骨干网络和224×224分辨率。

英文摘要

Vision-language pretraining has driven much of the recent progress in medical image representation learning, but this paradigm is constrained by the availability of paired image-text data and by the reporting bias of clinical narratives. We ask whether competitive radiology encoders can be learned without any language supervision. We introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture and pretrained on approximately 840K unlabeled chest X-ray images. The model learns to predict latent representations of masked target regions from a visible context region, an objective that differs from both image-text contrastive pretraining and DINO-style self-distillation by explicitly modelling conditional structure in representation space. We evaluate RadJEPA primarily on radiology report generation with a frozen Vicuna-7B decoder, and additionally substitute its encoder into four widely used vision-language backbones (MedLLaVA, Qwen-2.5, BLIP-2, and Phi-4). For completeness we also report disease classification and semantic segmentation results. Across two datasets and four metrics, RadJEPA matches or exceeds the strongest image-only and vision-language baselines while using a ViT-B/14 backbone at 224 x 224 resolution.

2605.15477 2026-05-27 cs.CV

EgoExo-WM: Unlocking Exo Video for Ego World Models

EgoExo-WM: 利用外部视频解锁自我世界模型

Danny Tran, Roberto Martín-Martín, Kristen Grauman

AI总结 提出通过从外部视频提取结构化身体姿态并利用人体运动学先验将其转换为自我视频,从而利用丰富的野外外部数据训练自我世界模型,显著提升预测质量和下游规划性能。

Comments Project Page: https://vision.cs.utexas.edu/projects/EgoExo-WM/

详情
AI中文摘要

自我中心世界模型为智能体预测和规划提供了有前景的方向,但其性能受限于自我中心训练数据的有限性以及人类物理动作的固有部分可观测性。相比之下,外部中心视频丰富且能很好地揭示身体姿态,但缺乏与智能体动作空间的直接对齐,且不是自我中心的。我们提出一种方法,通过从外部中心视频中提取结构化身体姿态作为动作表示,并基于人体运动学先验将外部中心视频转换为自我中心视频,从而弥合这一差距。这一过程使得将野外外部中心数据整合到自我中心世界模型训练中成为可能。我们表明,使用转换后的数据训练全身动作条件自我中心世界模型显著提高了预测质量和下游规划性能,其中我们推断实现视觉目标状态所需的身体姿态序列。我们的方法为利用任意野外视频构建强大的自我中心世界模型铺平了道路,进一步推动了机器人规划和增强现实指导等应用。

英文摘要

Egocentric world models present a promising direction for enabling agents to predict and plan, but their performance is constrained by the limited availability of egocentric training data and its inherent partial observability of humans' physical actions. In contrast, exocentric video is abundant and reveals body poses well, but lacks direct alignment with an agent's action space -- and is not egocentric. We propose a method to bridge this gap by extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, informed by a human kinematics prior. This process unlocks the integration of in-the-wild exocentric data for egocentric world model training. We show that training whole-body action-conditioned egocentric world models with our converted data significantly improves both prediction quality and downstream planning performance, where we infer the sequence of body poses needed to achieve a visual goal state. Our approach paves the way to enlist arbitrary in-the-wild videos for building powerful egocentric world models, furthering applications in robot planning and augmented-reality guidance.

2605.14473 2026-05-27 cs.CL cs.AI

Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict

RAG 能知道检索错误吗?知识冲突下的上下文合规性诊断

Yihang Chen, Pin Qian, Su Wang, Sipeng Zhang, Huan Xu, Shuhuai Lin, Xinpeng Wei

AI总结 提出上下文驱动分解(CDD)方法,在推理时探测并干预检索增强生成中的上下文与参数知识冲突,揭示上下文合规性模式并提升鲁棒性。

Comments 12 pages, 4 figures, 3 tables

详情
AI中文摘要

检索增强生成(RAG)中的上下文合规机制发生在检索到的上下文主导最终答案时,即使它与模型的参数化知识冲突。仅凭准确性并不能揭示在这种冲突下检索到的上下文如何因果性地塑造答案。我们引入了上下文驱动分解(CDD),这是一种在推理时运行的信念分解探针,并作为受控检索冲突的干预机制。通过跨Epi-Scale压力测试、TruthfulQA错误概念注入和跨模型重复实验,CDD揭示了三种模式。P1:上下文合规性在对抗性上界设置中是可测量的,标准RAG在TruthfulQA错误概念注入(N=500)上达到15.0%的准确率。P2:对抗性准确率提升跨模型家族迁移——CDD提高了Gemini-2.5-Flash以及Claude Haiku/Sonnet/Opus的准确率——但理由-答案因果耦合不迁移。CDD在Gemini-2.5-Flash上达到64.1%的错误注入因果敏感性,而所有三种Claude变体的敏感性落在[-3%, +7%]范围内,表明Claude侧的准确率提升通过一种与显式冲突解决轨迹不同的机制运作。P3:显式冲突分解提高了时间漂移和噪声干扰下的鲁棒性,CDD在完整Epi-Scale对抗性基准上对时间偏移达到71.3%,对干扰证据达到69.9%。这三种模式将上下文合规性识别为一个结构轴,沿此轴可以对标准RAG进行探测和干预,区别于检索质量或单一方法鲁棒性问题,并激励发布Epi-Scale以跨模型家族和检索管道进行系统研究。

英文摘要

The Context-Compliance Regime in Retrieval-Augmented Generation (RAG) occurs when retrieved context dominates the final answer even when it conflicts with the model's parametric knowledge. Accuracy alone does not reveal how retrieved context causally shapes answers under such conflict. We introduce Context-Driven Decomposition (CDD), a belief-decomposition probe that operates at inference time and serves as an intervention mechanism for controlled retrieval conflict. Across Epi-Scale stress tests, TruthfulQA misconception injection, and cross-model reruns, CDD exposes three patterns. P1: context compliance is measurable in an upper-bound adversarial setting, where Standard RAG reaches 15.0% accuracy on TruthfulQA misconception injection (N=500). P2: adversarial accuracy gains transfer across model families -- CDD improves accuracy on Gemini-2.5-Flash and on Claude Haiku/Sonnet/Opus -- but rationale-answer causal coupling does not transfer. CDD reaches 64.1% mistake-injection causal sensitivity on Gemini-2.5-Flash, while sensitivities for all three Claude variants fall in the [-3%, +7%] range, suggesting that the Claude-side accuracy gains operate through a mechanism distinct from the explicit conflict-resolution trace. P3: explicit conflict decomposition improves robustness under temporal drift and noisy distractors, with CDD reaching 71.3% on temporal shifts and 69.9% on distractor evidence on the full Epi-Scale adversarial benchmark. These three patterns identify context-compliance as a structural axis along which standard RAG can be probed and intervened on, distinct from retrieval-quality or single-method robustness questions, and motivate releasing Epi-Scale for systematic study across model families and retrieval pipelines.

2605.11651 2026-05-27 cs.CV cs.AI cs.CL

Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

Hide to See: 面向VLM蒸馏中视觉锚定思维的推理前缀掩码

Seonghoon Yu, Dongjun Nam, Byung-Kwan Lee, Jeany Son

AI总结 提出一种推理前缀掩码蒸馏框架,通过掩码学生模型的显著推理前缀,迫使其在推理过程中更依赖视觉证据,从而缓解长推理轨迹中的视觉遗忘问题,提升多模态推理性能。

Comments Pre-print

详情
AI中文摘要

近期VLM中的思考-回答方法(如Qwen3-VL-Thinking)通过在最终答案前利用中间推理步骤来提升推理性能,但其计算成本显著增加,尤其是对于较大的VLM。为了将这种能力蒸馏到紧凑的思考-回答VLM中,一个主要目标是提高学生在整个推理轨迹中利用视觉证据的能力,因为长思考-回答轨迹存在视觉遗忘问题。为此,我们引入了一种新颖的思考-回答蒸馏框架,通过掩码学生模型的显著推理前缀,鼓励学生将思考锚定在视觉信息上。为了补偿这种被掩码的文本线索,学生在蒸馏过程中被鼓励更多地依赖视觉证据作为替代信息源。我们的掩码策略包括:1)逐token的显著推理前缀掩码,针对每个下一token预测选择性掩码高影响力的推理前缀;2)自调节掩码预算调度,根据教师-学生分布之间的差异(即蒸馏难度)逐渐增加掩码规模。在蒸馏阶段,学生模型由我们的显著推理前缀掩码引导,该掩码同时阻塞未来token和显著推理线索,替代了自回归语言建模中使用的标准因果掩码。实验结果表明,我们的方法在多模态推理基准上优于最近的开源VLM、VLM蒸馏和自蒸馏方法,进一步分析证实了学生思考过程中视觉利用的增强。

英文摘要

Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their computational cost becomes substantial, especially for larger VLMs. To distill such capabilities into compact think-answer VLMs, a primary objective is to improve the student's ability to utilize visual evidence throughout its reasoning trace, as long think-answer traces suffer from visual forgetting issues. To this end, we introduce a novel think-answer distillation framework that encourages the student to anchor its thinking on visual information by masking the student's salient reasoning prefixes. To compensate for such masked textual cues, the student is encouraged to rely more on visual evidence as an alternative source of information during distillation. Our masking strategies include: 1) token-wise salient reasoning-prefix masking, which masks high-influence reasoning prefixes selectively for each next-token prediction, and 2) self-paced masking budget scheduling, which gradually increases the masking scale according to distillation difficulty, measured by the discrepancy between teacher--student distributions. In the distillation phase, the student is guided by our salient reasoning-prefix mask, which blocks both future tokens and salient reasoning cues, in place of the standard causal mask used for auto-regressive language modeling. Experimental results show that our approach outperforms recent open-source VLMs, VLM distillation, and self-distillation methods on multimodal reasoning benchmarks, while further analyzes confirm enhanced visual utilization along the student thinking process.

2605.14799 2026-05-27 cs.CV cs.CR cs.SI

Can Visual Mamba Improve AI-Generated Image Detection? An In-Depth Investigation

视觉Mamba能否提升AI生成图像检测?一项深入研究

Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdelmalik Taleb-Ahmed, Xianxun Zhu, Abdenour Hadid

AI总结 本研究系统评估了Vision Mamba模型在AI生成图像检测中的性能,与CNN、ViT和VLM检测器进行对比,分析了准确性、效率和泛化能力。

详情
AI中文摘要

近年来,计算机视觉取得了显著进展,这得益于卷积神经网络(CNN)、生成对抗网络(GAN)、扩散架构、视觉Transformer(ViT)以及最近的视觉-语言模型(VLM)等创新架构的发展。这一进展无疑有助于创造越来越逼真和多样化的视觉内容。然而,图像生成的这些进步也引发了对错误信息、身份盗窃以及隐私和安全威胁等潜在滥用的担忧。与此同时,基于Mamba的架构已成为这一快速发展的领域中一系列图像分析任务(包括分类、分割、医学成像、目标检测和图像恢复)的多功能工具。然而,与已有技术相比,它们在识别AI生成图像方面的潜力仍相对未被探索。本研究提供了用于AI生成图像检测的Vision Mamba模型的系统评估和比较分析。我们在多样化的数据集和合成图像源上,将多个Vision Mamba变体与代表性的CNN、ViT和基于VLM的检测器进行基准测试,重点关注准确性、效率以及跨不同图像类型和生成模型的泛化能力等关键指标。通过这一全面分析,我们旨在阐明Vision Mamba相对于已有方法在检测AI生成图像方面的适用性、准确性和效率上的优势与局限性。总体而言,我们的研究结果突显了Vision Mamba作为区分真实与AI生成视觉内容的系统组件的潜力和当前局限性。这项研究对于在区分真实与AI生成内容成为重大挑战的时代提升检测能力至关重要。

英文摘要

In recent years, computer vision has witnessed remarkable progress, fueled by the development of innovative architectures such as Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), diffusion-based architectures, Vision Transformers (ViTs), and, more recently, Vision-Language Models (VLMs). This progress has undeniably contributed to creating increasingly realistic and diverse visual content. However, such advancements in image generation also raise concerns about potential misuse in areas such as misinformation, identity theft, and threats to privacy and security. In parallel, Mamba-based architectures have emerged as versatile tools for a range of image analysis tasks, including classification, segmentation, medical imaging, object detection, and image restoration, in this rapidly evolving field. However, their potential for identifying AI-generated images remains relatively unexplored compared to established techniques. This study provides a systematic evaluation and comparative analysis of Vision Mamba models for AI-generated image detection. We benchmark multiple Vision Mamba variants against representative CNNs, ViTs, and VLM-based detectors across diverse datasets and synthetic image sources, focusing on key metrics such as accuracy, efficiency, and generalizability across diverse image types and generative models. Through this comprehensive analysis, we aim to elucidate Vision Mamba's strengths and limitations relative to established methodologies in terms of applicability, accuracy, and efficiency in detecting AI-generated images. Overall, our findings highlight both the promise and current limitations of Vision Mamba as a component in systems designed to distinguish authentic from AI-generated visual content. This research is crucial for enhancing detection in an age where distinguishing between real and AI-generated content is a major challenge.

2605.14664 2026-05-27 cs.CV

MiVE: Multiscale Vision-language features for reference-guided video Editing

MiVE:用于参考引导视频编辑的多尺度视觉语言特征

Tong Wang, Meng Zou, Chengjing Wu, Xiaochao Qu, Luoqi Liu, Xiaolin Hu, Ting Liu

AI总结 提出MiVE框架,利用VLM的多尺度层次特征(早期层保留空间细节,深层编码全局语义)统一到自注意力扩散Transformer中,解决模态间隙和细粒度信息丢失问题,在参考引导视频编辑中达到SOTA性能。

Comments ICML 2026

详情
AI中文摘要

参考引导视频编辑以源视频、文本指令和参考图像作为输入,要求模型在忠实执行指令编辑的同时保留原始运动及未编辑内容。现有方法分为两种范式,各有固有限制:解耦编码器在处理指令和视觉内容时存在模态间隙,而统一视觉语言编码器仅依赖最终层表示,丢失了细粒度空间细节。我们观察到VLM层层次化地编码互补信息——早期层捕获局部空间细节,对精确编辑至关重要;深层编码全局语义,用于指令理解。基于此洞察,我们提出MiVE(用于参考引导视频编辑的多尺度视觉语言特征),该框架将VLM重新用作多尺度特征提取器。MiVE从Qwen3-VL提取层次特征,并将其集成到统一的自注意力扩散Transformer中,消除了交叉注意力设计中固有的模态不匹配。实验表明,MiVE在人类偏好中排名最高,性能优于学术方法和商业系统,达到了最先进水平。

英文摘要

Reference-guided video editing takes a source video, a text instruction, and a reference image as inputs, requiring the model to faithfully apply the instructed edits while preserving original motion and unedited content. Existing methods fall into two paradigms, each with inherent limitations: decoupled encoders suffer from modality gaps when processing instructions and visual content independently, while unified vision-language encoders lose fine-grained spatial details by relying solely on final-layer representations. We observe that VLM layers encode complementary information hierarchically -- early layers capture localized spatial details essential for precise editing, while deeper layers encode global semantics for instruction comprehension. Building on this insight, we present MiVE (Multiscale Vision-language features for reference-guided video Editing), a framework that repurposes VLMs as multiscale feature extractors. MiVE extracts hierarchical features from Qwen3-VL and integrates them into a unified self-attention Diffusion Transformer, eliminating the modality mismatch inherent in cross-attention designs. Experiments demonstrate that MiVE achieves state-of-the-art performance by ranking highest in human preference, outperforming both academic methods and commercial systems.

2605.14480 2026-05-27 cs.CL

Cross-Linguistic Transcription and Phonological Representation in the Huìtóngguǎnxì Huáyíyìyǔ

《会同馆华夷译语》中的跨语言转写与音系表征

Ji-eun Kim

AI总结 本研究将《会同馆华夷译语》视为一个连贯的多语言转写系统,通过数字化和音系分析,揭示了其主要转写和补充转写的跨语言规律,并论证了该系统作为历史音系证据的价值。

Comments 49 pages; 1 figure; 40 tables; SLE2019; under review

详情
AI中文摘要

目的:本研究调查《会同馆华夷译语》(HHY)的转写原则,该系列多语词汇集由明朝政府在15至16世纪间编纂,用于译员培训。本研究不将HHY视为孤立语言材料的集合,而是将其视为一个连贯的多语言转写系统,通过汉字表征非汉语语言的口语形式。方法:将HHY的绝大部分数字化,并与汉语音韵范畴对齐。对先前各语言部分的重建进行批判性审查,并整合到一个统一的比较数据库中。分析聚焦于八个语言部分中主要转写(MT)和补充转写(ST)的跨语言规律。结果:MT通常表征与当时汉语音节结构兼容的音,而ST主要编码与汉语音系兼容性较差的语音特征。分析进一步表明,汉语音韵范畴在外语转写中的使用比先前假设的更为灵活。因此,HHY作为一种相对系统的语音近似方法,而非汉语音系对非汉语语言的直接投射。结论:HHY可被分析为一个内部结构化的转写系统,而不仅仅是词汇集的集合。更广泛地说,该研究表明历史转写系统可为历史音系学提供宝贵证据,尤其对于历史记录有限的亚洲语言。

英文摘要

Purpose: This study investigates the transcription principles underlying Huìtóngguǎnxì Huáyíyìyǔ (HHY), a series of multilingual glossaries compiled by the Ming government between the fifteenth and sixteenth centuries for interpreter training. The study treats HHY not as a collection of isolated language materials, but as a coherent multilingual transcription system representing spoken forms of non-Chinese languages through Chinese characters. Methods: A substantial portion of HHY was digitized and aligned with Chinese phonological categories. Previous reconstructions of individual language sections were critically reviewed and integrated into a unified comparative database. The analysis focuses on cross-linguistic regularities in Main Transcription (MT) and Supplementary Transcription (ST) across eight language sections. Results: MT generally represents sounds compatible with the Chinese syllable structure of the period, whereas ST mainly encodes phonetic features less compatible with Chinese phonology. The analysis further shows that Chinese phonological categories were used more flexibly in foreign-language transcription than previously assumed. HHY therefore functioned as a relatively systematic method of phonetic approximation rather than a direct projection of Chinese phonology onto non-Chinese languages. Conclusion: HHY can be analyzed as an internally structured transcription system rather than merely as a collection of glossaries. More broadly, the study demonstrates that historical transcription systems can provide valuable evidence for historical phonology, particularly for under-documented Asian languages with limited historical records.

2605.13779 2026-05-27 cs.LG cs.AI cs.DC

MinT: Managed Infrastructure for Training and Serving Millions of LLMs

MinT:用于训练和服务数百万LLM的托管基础设施

Mind Lab, :, Song Cao, Vic Cao, Andrew Chen, Kaijie Chen, Cleon Cheng, Steven Chiang, Kaixuan Fan, Hera Feng, Huan Feng, Arthur Fu, Jun Gao, Hongquan Gu, Aaron Guan, Nolan Ho, Mutian Hong, Hailee Hou, Peixuan Hua, Charles Huang, Miles Jiang, Nora Jiang, Yuyi Jiang, Qiuyu Jin, Fancy Kong, Andrew Lei, Kyrie Lei, Alexy Li, Lucian Li, Ray Li, Theo Li, Zhihui Li, Jiayi Lin, Kairus Liu, Kieran Liu, Logan Liu, Xiang Liu, Irvine Lu, Maeve Luo, Runze Lv, Pony Ma, Verity Niu, Anson Qiu, Vincent Wang, Rio Yang, Maxwell Yao, Carrie Ye, Regis Ye, Wenlin Ye, Josh Ying, Danney Zeng, Yuhan Zhan, Anya Zhang, Di Zhang, Ruijia Zhang, Sueky Zhang, Ya Zhang, Wei Zhao, Ada Zhou, Changhai Zhou, Yuhua Zhou, Xinyue Zhu, Murphy Zhuang

AI总结 提出MinT系统,通过LoRA适配器管理实现大规模基础模型上的高效训练与在线服务,支持百万级策略目录。

Comments 30 pages, technical report

详情
AI中文摘要

我们提出MindLab Toolkit (MinT),一个用于低秩适配(LoRA)后训练和在线服务的托管基础设施系统。MinT针对这样一种场景:在少量昂贵的基模型部署上产生许多训练好的策略。MinT不是将每个策略实现为合并的完整检查点,而是保持基模型驻留,并通过回滚、更新、导出、评估、服务和回滚等阶段移动导出的LoRA适配器修订版,将分布式训练、服务、调度和数据移动隐藏在服务接口后面。MinT沿三个维度扩展此路径。Scale Up将LoRA RL扩展到前沿规模的密集和MoE架构,包括MLA和DSA注意力路径,训练和服务已验证超过1T总参数。Scale Down仅移动导出的LoRA适配器,在秩1设置中可小于基模型大小的1%;适配器仅移交将测量步骤在4B密集模型上减少18.3倍,在30B MoE上减少2.85倍,而并发多策略GRPO将挂钟时间缩短1.77倍和1.45倍,且不提高峰值内存。Scale Out将持久策略可寻址性与CPU/GPU工作集分离:张量并行部署支持10^6规模的可寻址目录(通过100K测量单引擎扫描)和集群规模的千适配器活动波,冷加载作为计划的服务工作处理,打包的MoE LoRA张量将实时引擎加载提高8.5-8.7倍。因此,MinT管理百万规模的LoRA策略目录,同时在共享的1T级基模型上训练和服务选定的适配器修订版。

英文摘要

We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT targets a setting where many trained policies are produced over a small number of expensive base-model deployments. Instead of materializing each policy as a merged full checkpoint, MinT keeps the base model resident and moves exported LoRA adapter revisions through rollout, update, export, evaluation, serving, and rollback, hiding distributed training, serving, scheduling, and data movement behind a service interface. MinT scales this path along three axes. Scale Up extends LoRA RL to frontier-scale dense and MoE architectures, including MLA and DSA attention paths, with training and serving validated beyond 1T total parameters. Scale Down moves only the exported LoRA adapter, which can be under 1% of base-model size in rank-1 settings; adapter-only handoff reduces the measured step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while concurrent multi-policy GRPO shortens wall time by 1.77x and 1.45x without raising peak memory. Scale Out separates durable policy addressability from CPU/GPU working sets: a tensor-parallel deployment supports 10^6-scale addressable catalogs (measured single-engine sweeps through 100K) and thousand-adapter active waves at cluster scale, with cold loading treated as scheduled service work and packed MoE LoRA tensors improving live engine loading by 8.5-8.7x. MinT thus manages million-scale LoRA policy catalogs while training and serving selected adapter revisions over shared 1T-class base models.