arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2096
2603.07211 2026-05-27 cs.LG

CompassDPO: Dynamics-Controlled Direct Preference Optimization for Robust Safety Alignment

CompassDPO: 用于鲁棒安全对齐的动态控制直接偏好优化

Jilong Liu, Yonghui Yang, Pengyang Shao, Wenjian Tao, Hao Zhan, Haokai Ma, Wei Qin, Richang Hong

AI总结 提出CompassDPO,通过隐式DPO奖励边际控制更新方向和幅度,无需外部奖励模型,在PKU-SafeRLHF等基准上提升鲁棒性。

详情
AI中文摘要

直接偏好优化(DPO)已成为安全对齐的标准框架,但其对成对偏好更新的依赖使得训练对不完美监督敏感。现有的鲁棒DPO方法通常通过全局损失校正或外部数据级干预来解决这种敏感性,而很大程度上忽略了不可靠比较如何扭曲批次级优化动态。我们提出CompassDPO,一种无奖励的DPO框架,通过动态控制稳定偏好优化。使用隐式DPO奖励边际作为训练时的指南针,CompassDPO沿着两个互补轴调节样本影响:更新方向和更新幅度。对于方向控制,它应用稀疏、有预算和预热延迟的损失混合,以减弱与新兴偏好方向冲突的更新分量。对于幅度控制,它自适应地软温莎化高损失尾部贡献,减少尾部主导同时保留来自困难样本的有用梯度。两种机制仅使用标准DPO训练期间可用的信号,无需外部奖励模型或额外监督。在PKU-SafeRLHF上跨四个骨干网络和多个分布外安全基准的实验表明,CompassDPO在鲁棒性上持续优于普通DPO和强DPO系列基线,特别是在受控标签翻转噪声下。代码可在https://anonymous.4open.science/r/CompassDPO-4D00获取。

英文摘要

Direct Preference Optimization (DPO) has become a standard framework for safety alignment, but its reliance on pairwise preference updates makes training sensitive to imperfect supervision. Existing robust DPO methods often address this sensitivity through global loss corrections or external data-level interventions, while largely overlooking how unreliable comparisons distort batch-level optimization dynamics. We propose CompassDPO, a reward-free DPO framework that stabilizes preference optimization through dynamics control. Using the implicit DPO reward margin as a training-time compass, CompassDPO regulates sample influence along two complementary axes: update direction and update magnitude. For directional control, it applies sparse, budgeted, and warm-up delayed loss mixing to attenuate update components that conflict with the emerging preference direction. For magnitude control, it adaptively soft-winsorizes high-loss tail contributions, reducing tail dominance while preserving useful gradients from hard examples. Both mechanisms use only signals available during standard DPO training and require no external reward model or additional supervision. Experiments on PKU-SafeRLHF across four backbones and multiple out-of-distribution safety benchmarks show that CompassDPO consistently improves robustness over vanilla DPO and strong DPO-family baselines, especially under controlled label-flip noise. Code is available at https://anonymous.4open.science/r/CompassDPO-4D00

2603.03711 2026-05-27 cs.CV

LDP-Slicing: Local Differential Privacy for Images via Randomized Bit-Plane Slicing

LDP-Slicing:通过随机位平面切片实现图像的本地差分隐私

Yuanming Cao, Chengqi Li, Wenbo He

AI总结 提出LDP-Slicing框架,通过将像素值分解为二进制位平面并应用本地差分隐私机制,结合感知混淆模块和隐私预算分配策略,在满足严格像素级ε-LDP的同时保持图像对下游任务的高效用。

详情
AI中文摘要

本地差分隐私(LDP)是隐私保护机器学习的黄金标准信任模型,通过在数据源处保证隐私。然而,由于像素空间的高维性,其在图像数据上的应用长期以来被认为不切实际。典型的LDP机制设计用于低维数据,当应用于高维像素空间时会导致严重的效用退化。本文证明这种效用损失并非LDP固有的,而是源于将其应用于不适当的数据表示。我们引入了LDP-Slicing,一个轻量级、无需训练的框架,解决了这种领域不匹配问题。我们的关键见解是将像素值分解为一系列二进制位平面。这种转换使我们能够直接将LDP机制应用于位级表示。为了进一步加强隐私并保持效用,我们集成了一个感知混淆模块,减轻人类可感知的泄漏,以及一个基于优化的隐私预算分配策略。该流程满足严格的像素级ε-LDP,同时生成对下游任务保持高效用的图像。在人脸识别和图像分类上的大量实验表明,在可比的隐私预算下,LDP-Slicing优于现有的DP/LDP基线,且计算开销可忽略不计。

英文摘要

Local Differential Privacy (LDP) is the gold standard trust model for privacy-preserving machine learning by guaranteeing privacy at the data source. However, its application to image data has long been considered impractical due to the high dimensionality of pixel space. Canonical LDP mechanisms are designed for low-dimensional data, resulting in severe utility degradation when applied to high-dimensional pixel spaces. This paper demonstrates that this utility loss is not inherent to LDP, but from its application to an inappropriate data representation. We introduce LDP-Slicing, a lightweight, training-free framework that resolves this domain mismatch. Our key insight is to decompose pixel values into a sequence of binary bit-planes. This transformation allows us to apply the LDP mechanism directly to the bit-level representation. To further strengthen privacy and preserve utility, we integrate a perceptual obfuscation module that mitigates human-perceivable leakage and an optimization-based privacy budget allocation strategy. This pipeline satisfies rigorous pixel-level $\varepsilon$-LDP while producing images that retain high utility for downstream tasks. Extensive experiments on face recognition and image classification demonstrate that LDP-Slicing outperforms existing DP/LDP baselines under comparable privacy budgets, with negligible computational overhead.

2602.13626 2026-05-27 cs.LG

Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?

基准泄露陷阱:我们能信任基于LLM的推荐吗?

Mingqiao Zhang, Qiyao Peng, Yinghui Wang, Hongtao Liu, Yumeng Wang

AI总结 本文识别并研究了基于大语言模型的推荐系统中基准数据泄露问题,通过模拟多种泄露场景揭示了泄露对性能评估的误导性影响。

详情
AI中文摘要

大语言模型(LLMs)在推荐系统中的广泛应用对评估可靠性提出了严峻挑战。本文识别并研究了一个此前被忽视的问题:基于LLM的推荐中的基准数据泄露。当LLMs在预训练或微调过程中暴露于并可能记忆基准数据集时,就会发生这种现象,导致性能指标被人为夸大,无法反映模型真实性能。为验证这一现象,我们通过在战略混合语料库(包括来自域内和域外的用户-物品交互)上对基础模型进行持续预训练,模拟了多种数据泄露场景。我们的实验揭示了数据泄露的双重效应:当泄露数据与领域相关时,会导致显著但虚假的性能提升,误导性地夸大模型能力;相反,与领域无关的泄露通常会降低推荐准确性,突显了这种污染的复杂性和偶然性。我们的发现表明,数据泄露是基于LLM的推荐中一个关键但此前未被考虑的因素,可能影响模型的真实性能。我们在https://github.com/yusba1/LLMRec-Data-Leakage发布了代码。

英文摘要

The expanding integration of Large Language Models (LLMs) into recommender systems poses critical challenges to evaluation reliability. This paper identifies and investigates a previously overlooked issue: benchmark data leakage in LLM-based recommendation. This phenomenon occurs when LLMs are exposed to and potentially memorize benchmark datasets during pre-training or fine-tuning, leading to artificially inflated performance metrics that fail to reflect true model performance. To validate this phenomenon, we simulate diverse data leakage scenarios by conducting continued pre-training of foundation models on strategically blended corpora, which include user-item interactions from both in-domain and out-of-domain sources. Our experiments reveal a dual-effect of data leakage: when the leaked data is domain-relevant, it induces substantial but spurious performance gains, misleadingly exaggerating the model's capability. In contrast, domain-irrelevant leakage typically degrades recommendation accuracy, highlighting the complex and contingent nature of this contamination. Our findings reveal that data leakage acts as a critical, previously unaccounted-for factor in LLM-based recommendation, which could impact the true model performance. We release our code at https://github.com/yusba1/LLMRec-Data-Leakage.

2603.03585 2026-05-27 cs.CL cs.AI

Belief-Sim: Towards Belief-Driven Simulation of Demographic Misinformation Susceptibility

Belief-Sim:迈向信念驱动的人口统计错误信息易感性模拟

Angana Borah, Zohaib Khan, Rada Mihalcea, Verónica Pérez-Rosas

AI总结 提出BeliefSim框架,利用心理学分类和调查先验构建人口信念档案,通过提示条件化和后训练适应,实现基于信念模拟人口统计错误信息易感性,对齐度达92%。

Comments Paper Under Review

详情
AI中文摘要

错误信息是一种日益严重的社会威胁,由于潜在信念的差异,不同人口群体对错误信息的易感性各不相同。随着大型语言模型(LLM)越来越多地被用于模拟人类行为,我们研究它们是否能够模拟人口统计错误信息易感性,将信念视为主要驱动因素。我们引入BeliefSim,一个模拟框架,利用心理学信息错误信息分类法和调查先验构建人口信念档案。我们研究了基于提示的条件化和后训练适应,并使用以下方法进行了多方面的评估:(i)易感性对齐和(ii)反事实人口敏感性。在两个数据集和建模策略中,我们表明信念为模拟错误信息易感性提供了强大的先验,对齐度高达92%。

英文摘要

Misinformation is a growing societal threat, and susceptibility to misinformative claims varies across demographic groups due to differences in underlying beliefs. As Large Language Models (LLMs) are increasingly used to simulate human behaviors, we investigate whether they can simulate demographic misinformation susceptibility, treating beliefs as a primary driving factor. We introduce BeliefSim, a simulation framework that constructs demographic belief profiles using psychology-informed misinformation taxonomies and survey priors. We study prompt-based conditioning and post-training adaptation, and conduct a multi-fold evaluation using: (i) susceptibility alignment and (ii) counterfactual demographic sensitivity. Across both datasets and modeling strategies, we show that beliefs provide a strong prior for simulating misinformation susceptibility, with alignment up to 92%.

2603.03194 2026-05-27 cs.CL cs.SE

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

BeyondSWE:当前代码代理能否超越单仓库错误修复?

Guoxin Chen, Fanzhe Meng, Jiale Zhao, Minghao Li, Daixuan Cheng, Huatong Song, Jie Chen, Yuzhi Lin, Hui Chen, Xin Zhao, Ruihua Song, Chang Liu, Cheng Chen, Kai Jia, Ji-Rong Wen

AI总结 提出BeyondSWE基准测试,评估代码代理在跨仓库、领域特定、依赖迁移和文档生成等复杂软件工程任务上的表现,发现现有代理在利用外部信息进行精确代码修改方面仍存在显著不足。

Comments Benchmark: https://huggingface.co/datasets/AweAI-Team/BeyondSWE. Repo: https://github.com/AweAI-Team/BeyondSWE. Scaffold: https://github.com/AweAI-Team/AweAgent

详情
AI中文摘要

当前的代码代理基准主要评估单个目标仓库内的局部问题解决能力,而许多需要外部知识或更广泛仓库级变更的软件工程任务仍未得到充分测试。我们引入了BeyondSWE,这是一个包含500个实例的基准测试,来自246个真实世界的GitHub仓库,用于评估超越单仓库错误修复的代码代理。BeyondSWE涵盖了四种代表性场景:跨仓库问题解决、领域特定问题解决、依赖驱动的迁移以及文档到仓库的生成,涵盖了更广泛的知识范围和解决范围。我们的评估显示,BeyondSWE远未饱和:基于OpenHands的最佳代理达到了46.12的平均分数,而使用GPT-5.4(xhigh)的最强Codex harness在搜索感知提示下达到了56.65。为了研究外部信息访问是否能缩小这一差距,我们使用SearchSWE作为搜索增强编码的受控诊断基线。搜索访问改善了大多数模型,并对某些任务有显著帮助,但收益仍然有限且不均衡,表明当前代理仍然难以将检索到的信息转化为精确、版本兼容且局部可操作的代码更改。这些结果表明,深度编码搜索仍然是一个开放问题:进展需要代理能够可靠地将外部证据与仓库局部推理和基于执行的验证结合起来。

英文摘要

Current code-agent benchmarks primarily evaluate localized issue resolution within a single target repository, leaving under-tested many software engineering tasks that require external knowledge or broader repository-level changes. We introduce BeyondSWE, a 500-instance benchmark drawn from 246 real-world GitHub repositories to evaluate code agents beyond single-repository bug fixing. BeyondSWE covers four representative settings: cross-repository issue resolution, domain-specific issue resolution, dependency-driven migration, and document-to-repository generation, spanning both broader knowledge scope and broader resolution scope. Our evaluation shows that BeyondSWE remains far from saturated: the best OpenHands-based agent reaches 46.12 average score, while the strongest Codex harness with GPT-5.4 (xhigh) reaches 56.65 under a search-aware prompt. To study whether external information access closes this gap, we use SearchSWE as a controlled diagnostic baseline for search-augmented coding. Search access improves most models and substantially helps some tasks, but the gains remain limited and uneven, showing that current agents still struggle to convert retrieved information into precise, version-compatible, and locally actionable code changes. These results suggest that deep search for coding remains an open problem: progress requires agents that can reliably combine external evidence with repository-local reasoning and execution-based verification.

2601.09001 2026-05-27 cs.CL

Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces in STEM

熵哨兵:基于STEM解码熵迹的连续LLM准确性监控

Pedro Memoli Buffa, Luciano Del Corro

AI总结 提出利用输出熵迹作为推理时信号,通过轻量分类器预测实例正确性并聚合为领域级准确性估计,在STEM推理基准上验证了其用于监控和数据采集的有效性。

详情
AI中文摘要

部署LLM引发两个耦合挑战:(1)监控——在流量和领域漂移时估计模型表现不佳的位置;(2)改进——优先获取数据以缩小最大的性能差距。我们测试推理时信号能否在领域偏移下估计切片级准确性。对于每个响应,我们从最终层下一个词元概率(来自top-$k$ logprobs)计算输出熵迹,并用不同统计量汇总。一个轻量分类器预测实例正确性,平均预测概率得到领域级准确性估计。我们在十个STEM推理基准上进行了详尽的训练/测试组合($k\in\{1,2,3,4\}$;所有$inom{10}{k}$组合),在来自六个系列(3B--20B)的九个LLM上评估不同分类器模型和特征。估计值通常跟踪保留的基准准确性,并且多个模型显示领域近乎单调的排序,为输出熵迹作为可扩展监控和针对性数据采集的可访问信号提供了证据。

英文摘要

Deploying LLMs raises two coupled challenges: (1) monitoring--estimating where a model underperforms as traffic and domains drift--and (2) improvement--prioritizing data acquisition to close the largest performance gaps. We test whether an inference-time signal can estimate slice-level accuracy under domain shift. For each response, we compute an output-entropy profile from final-layer next-token probabilities (from top-$k$ logprobs) and summarize it with different statistics. A lightweight classifier predicts instance correctness, and averaging predicted probabilities yields a domain-level accuracy estimate. We evaluate on ten STEM reasoning benchmarks with exhaustive train/test compositions ($k\in\{1,2,3,4\}$; all $\binom{10}{k}$ combinations), on different classifier models and features across nine LLMs from six families (3B--20B). Estimates often track held-out benchmark accuracy, and several models show near-monotonic ordering of domains, providing evidence for output-entropy profiles being an accessible signal for scalable monitoring and for targeted data acquisition.

2603.01800 2026-05-27 cs.LG cs.AI stat.ML stat.OT

Phase-Type Variational Autoencoders for Heavy-Tailed Data

Phase-Type变分自编码器用于重尾数据

Abdelhakim Ziani, András Horváth, Paolo Ballarini

AI总结 提出Phase-Type变分自编码器(PH-VAE),通过将解码器分布建模为潜在条件相位型分布(连续时间马尔可夫链的吸收时间),灵活适应重尾行为,在合成和真实基准上优于高斯、Student-t和极值VAE解码器。

详情
AI中文摘要

重尾分布在现实世界数据中无处不在,其中罕见但极端的事件主导了风险和变异性。然而,标准变分自编码器(VAE)采用简单的解码器分布,如高斯分布,无法捕捉重尾行为,而现有的重尾感知扩展仍然局限于预定义的参数族,其尾部行为是预先固定的。我们提出了Phase-Type变分自编码器(PH-VAE),其解码器分布是一个潜在条件的Phase-Type(PH)分布,定义为连续时间马尔可夫链(CTMC)的吸收时间。这种公式组合了多个指数时间尺度,产生了一个灵活且解析可处理的解码器,它直接从观测数据中调整其有限范围的尾部行为。在合成和真实世界基准上的实验表明,PH-VAE能够准确逼近各种重尾分布,在建模观测到的尾部行为和极端分位数方面显著优于基于高斯、Student-t和极值的VAE解码器。在多变量设置中,PH-VAE通过其共享的潜在表示捕捉了现实中的跨维度尾部依赖性。据我们所知,这是首次将Phase-Type分布整合到深度生成建模中的工作,桥接了应用概率论和表示学习。

英文摘要

Heavy-tailed distributions are ubiquitous in real-world data, where rare but extreme events dominate risk and variability. However, standard Variational Autoencoders (VAEs) employ simple decoder distributions, such as Gaussian distributions, that fail to capture heavy-tailed behavior, while existing heavy-tail-aware extensions remain restricted to predefined parametric families whose tail behavior is fixed a priori. We propose the Phase-Type Variational Autoencoder (PH-VAE), whose decoder distribution is a latent-conditioned Phase-Type (PH) distribution, defined as the absorption time of a continuous-time Markov chain (CTMC). This formulation composes multiple exponential time scales, yielding a flexible and analytically tractable decoder that adapts its finite-range tail behavior directly from the observed data. Experiments on synthetic and real-world benchmarks demonstrate that PH-VAE accurately approximates diverse heavy-tailed distributions, significantly outperforming Gaussian, Student-t, and extreme-value-based VAE decoders in modeling observed tail behavior and extreme quantiles. In multivariate settings, PH-VAE captures realistic cross-dimensional tail dependence through its shared latent representation. To our knowledge, this is the first work to integrate Phase-Type distributions into deep generative modeling, bridging applied probability and representation learning.

2602.22190 2026-05-27 cs.LG cs.AI cs.CL

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

GUI-Libra:训练原生GUI代理进行推理与行动——基于动作感知监督和部分可验证强化学习

Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baolin Peng, Huan Zhang, Jianfeng Gao, Tong Zhang

AI总结 提出GUI-Libra训练方案,通过动作感知SFT和部分可验证RL中的KL正则化,解决GUI代理在长程导航任务中推理与定位冲突及部分可验证性问题,显著提升步骤准确率和任务完成率。

Comments 57 pages, 17 figures

详情
AI中文摘要

开源原生GUI代理在长程导航任务上仍落后于闭源系统。这一差距源于两个限制:缺乏高质量、动作对齐的推理数据,以及直接采用忽视GUI代理独特挑战的通用后训练流程。我们识别出这些流程中的两个基本问题:(i) 带有CoT推理的标准SFT常损害定位能力,(ii) 逐步RLVR式训练面临部分可验证性,即多个动作可能正确但仅有一个示范动作用于验证。这使得离线逐步指标成为在线任务成功的弱预测器。在本工作中,我们提出GUI-Libra,一种定制化训练方案以应对这些挑战。首先,为缓解动作对齐推理数据的稀缺性,我们引入数据构建和过滤流程,并发布精心整理的81K GUI推理数据集。其次,为调和推理与定位,我们提出动作感知SFT,混合推理后动作和直接动作数据,并重新加权token以强调动作和定位。第三,为在部分可验证性下稳定RL,我们识别出RLVR中KL正则化被忽视的重要性,并证明KL信任域对改善离线到在线可预测性至关重要;我们进一步引入成功自适应缩放以降低不可靠负梯度的权重。在多种Web和移动基准测试中,GUI-Libra一致地提升了步骤准确率和端到端任务完成率。我们的结果表明,精心设计的后训练和数据整理可以在无需昂贵在线数据收集的情况下,释放显著更强的任务解决能力。我们发布数据集、代码和模型,以促进对具备推理能力的GUI代理的数据高效后训练的进一步研究。

英文摘要

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI agents. We identify two fundamental issues in these pipelines: (i) standard SFT with CoT reasoning often hurts grounding, and (ii) step-wise RLVR-tyle training faces partial verifiability, where multiple actions can be correct but only a single demonstrated action is used for verification. This makes offline step-wise metrics weak predictors of online task success. In this work, we present GUI-Libra, a tailored training recipe that addresses these challenges. First, to mitigate the scarcity of action-aligned reasoning data, we introduce a data construction and filtering pipeline and release a curated 81K GUI reasoning dataset. Second, to reconcile reasoning with grounding, we propose action-aware SFT that mixes reasoning-then-action and direct-action data and reweights tokens to emphasize action and grounding. Third, to stabilize RL under partial verifiability, we identify the overlooked importance of KL regularization in RLVR and show that a KL trust region is critical for improving offline-to-online predictability; we further introduce success-adaptive scaling to downweight unreliable negative gradients. Across diverse web and mobile benchmarks, GUI-Libra consistently improves both step-wise accuracy and end-to-end task completion. Our results suggest that carefully designed post-training and data curation can unlock significantly stronger task-solving capabilities without costly online data collection. We release our dataset, code, and models to facilitate further research on data-efficient post-training for reasoning-capable GUI agents.

2602.19206 2026-05-27 cs.CV

GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning

GS-CLIP: 基于几何感知提示与协同视图表示学习的零样本3D异常检测

Zehao Deng, An Liu, Yan Wang

AI总结 提出GS-CLIP框架,通过几何感知提示和协同视图表示学习,在零样本设置下有效检测3D点云中的几何异常。

Comments Accepted by CVPR 2026

详情
AI中文摘要

零样本3D异常检测是一项新兴任务,旨在无需目标训练数据的情况下检测目标数据集中的异常,这在样本稀缺和数据隐私受限的场景中尤为重要。当前方法通过将3D点云投影到2D表示来适配CLIP,但面临挑战:投影会固有地丢失一些几何细节,且依赖单一2D模态导致视觉理解不完整,限制了检测多样异常类型的能力。为解决这些局限,我们提出几何感知提示与协同视图表示学习(GS-CLIP)框架,通过两阶段学习使模型能够识别几何异常。第一阶段,我们动态生成嵌入3D几何先验的文本提示,这些提示包含由我们的几何缺陷蒸馏模块(GDDM)提炼的全局形状上下文和局部缺陷信息。第二阶段,我们引入协同视图表示学习架构,并行处理渲染图像和深度图像,随后通过协同精炼模块(SRM)融合两个流的特征,利用它们的互补优势。在四个大规模公共数据集上的全面实验结果表明,GS-CLIP在检测中取得了优越性能。代码可在 https://github.com/zhushengxinyue/GS-CLIP 获取。

英文摘要

Zero-shot 3D Anomaly Detection is an emerging task that aims to detect anomalies in a target dataset without any target training data, which is particularly important in scenarios constrained by sample scarcity and data privacy concerns. While current methods adapt CLIP by projecting 3D point clouds into 2D representations, they face challenges. The projection inherently loses some geometric details, and the reliance on a single 2D modality provides an incomplete visual understanding, limiting their ability to detect diverse anomaly types. To address these limitations, we propose the Geometry-Aware Prompt and Synergistic View Representation Learning (GS-CLIP) framework, which enables the model to identify geometric anomalies through a two-stage learning process. In stage 1, we dynamically generate text prompts embedded with 3D geometric priors. These prompts contain global shape context and local defect information distilled by our Geometric Defect Distillation Module (GDDM). In stage 2, we introduce Synergistic View Representation Learning architecture that processes rendered and depth images in parallel. A Synergistic Refinement Module (SRM) subsequently fuses the features of both streams, capitalizing on their complementary strengths. Comprehensive experimental results on four large-scale public datasets show that GS-CLIP achieves superior performance in detection. Code can be available at https://github.com/zhushengxinyue/GS-CLIP.

2602.21636 2026-05-27 cs.CV

Axial-Centric Cross-Plane Attention for 3D Medical Image Classification

轴向中心跨平面注意力用于3D医学图像分类

Doyoung Park, Jinsoo Kim, Lohendran Baskaran

AI总结 提出轴向中心跨平面注意力架构,通过不对称建模解剖平面间依赖关系,在MedMNIST3D基准上优于现有3D和多平面模型。

Comments Submitted to BMVC 2026

详情
AI中文摘要

(缩写版)临床医生通常通过检查多个解剖平面而非依赖体积视图来解释3D医学图像。在临床CT工作流中,轴向平面常作为主要诊断参考,而辅助平面提供互补空间上下文。然而,许多现有3D深度学习方法要么整体处理体积数据,要么对所有平面赋予相同重要性,未能反映这种不对称的轴向中心解释策略。为此,我们提出一种用于3D医学图像分类的轴向中心跨平面注意力架构,该架构建模解剖平面间的不对称依赖关系。该架构使用大规模轴向CT图像预训练的MedDINOv3作为冻结特征提取器,用于轴向、冠状和矢状平面。RICA块和平面内变换器编码器捕获平面特定的位置和上下文信息,而轴向中心跨平面变换器编码器选择性地以互补的辅助表示条件化轴向表示。在MedMNIST3D基准的六个数据集上的实验表明,所提方法在ACC和AUC上持续优于现有3D和多平面模型。轻量级变体AC-Tiny以显著更少的可训练参数实现了竞争性能,表明架构设计对性能提升的贡献大于模型规模增加。消融研究进一步验证了轴向中心查询、QKV分配、定向跨平面融合、无残差交叉注意力和分类头设计的重要性。切片级Grad-CAM可视化表明,模型在所有平面上识别出诊断相关区域。这些发现强调了将架构设计与临床解释工作流对齐对于稳健的3D医学图像分析的价值。

英文摘要

Abridged: Clinicians commonly interpret 3D medical images by examining multiple anatomical planes rather than relying on volumetric views. In clinical CT workflows, the axial plane often serves as the primary diagnostic reference, while the auxiliary planes provide complementary spatial context. However, many existing 3D deep learning approaches either process volumetric data holistically or assign equal importance to all planes, failing to reflect this asymmetric, axial-centric interpretation strategy. To address this, we propose an axial-centric cross-plane attention architecture for 3D medical image classification that models asymmetric dependencies between anatomical planes. The architecture employs large-scale axial CT images pretrained MedDINOv3 as a frozen feature extractor for axial, coronal, and sagittal planes. RICA blocks and intra-plane transformer encoders capture plane-specific positional and contextual information, while axial-centric cross-plane transformer encoders selectively condition axial representations on complementary auxiliary representations. Experiments on six datasets from the MedMNIST3D benchmark show that the proposed method consistently outperforms existing 3D and multi-plane models in ACC and AUC. A lightweight variant, AC-Tiny, achieves competitive performance with substantially fewer trainable parameters, suggesting that architectural design contributes more to performance gains than increased model scale. Ablation studies further validate the importance of axial-centric querying, QKV allocation, directional cross-plane fusion, residual-free cross-attention, and classification head design. Slice-level Grad-CAM visualizations demonstrate that the model identifies diagnostically relevant regions across all planes. These findings highlight the value of aligning architectural design with clinical interpretation workflows for robust 3D medical image analysis.

2602.21450 2026-05-27 cs.RO cs.SY eess.SY

Vector Fields for Path Following on Lie Groups with Application in Robot Control

李群上的路径跟随向量场及其在机器人控制中的应用

Felipe Bartelt, Luciano C. A. Pimenta, Weijia Yao, Vinicius M. Gonçalves

AI总结 针对李群上的路径跟随问题,提出一种通用向量场框架,保证从几乎所有初始条件收敛到期望参数曲线并连续运动,在SE(3)上给出最小表示的控制输入,通过机械臂实验验证有效性。

Comments Manuscript revised: new title, reframed abstract and introduction for robotics, and added a coauthor

详情
AI中文摘要

许多机器人系统允许独立控制位置和姿态(位姿),包括全向飞行器、水下机器人和机械臂末端执行器。在许多应用中,这些系统必须遵循连续的位姿序列,从而形成轨迹跟踪或路径跟随问题。与轨迹跟踪相比,路径跟随具有重要的实际优势。我们特别关注李群上的路径跟随问题。将机器人视为在三维空间中运动的刚体,该路径跟随问题可以表述为在矩阵李群SE(3)上设计引导向量场的问题。在本文中,我们开发了一个通用的向量场框架,用于连通矩阵李群上的路径跟随,其中SE(3)是一个重要的特例。所提出的向量场保证从几乎所有初始条件收敛到期望参数曲线,同时确保沿路径连续运动。此外,另一个有趣的特点是,与先前的工作相比,控制输入在表示上是“最小的”,并且更接近工程应用(例如,在SE(3)情况下的身体扭曲)。在建立一般情况后,该框架被专门应用于机器人学中特别感兴趣的SE(3),产生了一种适用于实时机器人控制的高效算法。使用机械臂跟踪复杂位姿路径的实验证明了该方法的有效性。还提供了开源实现。

英文摘要

Many robotic systems allow independent control of position and orientation (pose), including omnidirectional aerial vehicles, underwater robots, and manipulator end-effectors. In many applications, these systems must follow a continuous sequence of poses, leading to either trajectory-tracking or path following formulations. Compared to trajectory-tracking, path following offers important practical advantages. In particular, we focus on the problem of path following on Lie groups. Considering the robots as rigid bodies moving in the 3D space, this path-following problem can be posed as a problem of designing guiding vector fields on the matrix Lie group SE(3). In this paper, we develop a general vector-field framework for path following on connected matrix Lie groups, of which SE(3) is a prominent special case. The proposed vector field guarantees convergence to a desired parametric curve from almost all initial conditions while ensuring continuous motion along the path. Furthermore, another interesting feature is that, as opposed to previous works, the control input is "minimal" in terms of representation and closer to the engineering application (e.g., the body twist in the case SE(3)). After establishing the general case, the framework is then specialized to SE(3), of special interest in robotics, yielding an efficient algorithm suitable for real-time robotic control. Experiments with a robotic manipulator tracking complex pose paths demonstrate the effectiveness of the approach. An open-source implementation is also provided.

2510.07231 2026-05-27 cs.CL cs.AI

EconCausal: A Context-Aware Economic Reasoning Benchmark for Large Language Models

EconCausal: 面向大语言模型的上下文感知经济推理基准

Donggyu Lee, Hyeok Yun, Meeyoung Cha, Sungwon Park, Sangyoon Park, Jihee Kim

AI总结 提出EconCausal基准,包含从顶级经济金融期刊提取的10,490个上下文标注因果三元组,评估大语言模型在指定上下文中推断因果方向及随上下文变化调整判断的能力。

详情
AI中文摘要

社会经济因果效应高度依赖于制度和环境背景。相同的干预措施在不同监管制度、市场条件、时间段或人群中可能产生不同甚至相反的效果。这对大语言模型(LLM)在决策支持角色中提出了挑战:它们能否在指定上下文中推断因果效应的方向,并在上下文变化时修正该判断?为此,我们引入了EconCausal,这是一个大规模基准,包含从顶级经济和金融期刊的2,595项高质量实证研究中提取的10,490个上下文标注因果三元组,通过严格的四阶段流程构建,包括多轮共识、上下文细化和多批评者过滤。跨模型实验表明,LLM往往无法根据上下文调整其预测。虽然顶级模型在固定、显式上下文中达到88%的准确率,但在需要跨上下文修正符号的情况下,准确率下降32.6个百分点(从73.9%降至41.3%),一旦引入误导性的符号证据,准确率降至50%以下。模型还过度承诺于方向性(+/-)符号,仅在13.8%的情况下识别出零效应,且在这些类别上校准不良。数据集和基准公开于 https://anonymous.4open.science/r/econcausal-benchmark-6F12。

英文摘要

Socio-economic causal effects depend heavily on their institutional and environmental contexts. The same intervention can produce different, even opposite, effects across regulatory regimes, market conditions, time periods, or populations. This poses a challenge for large language models (LLMs) in decision-support roles: can they infer the direction of a causal effect under a specified context, and revise that judgment when the context changes? To address this, we introduce EconCausal, a large-scale benchmark of 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies in top-tier economics and finance journals, constructed through a rigorous four-stage pipeline with multi-run consensus, context refinement, and multi-critic filtering. Across models, LLMs often fail to condition their predictions on context. While top models reach 88% accuracy in fixed, explicit contexts, accuracy falls by 32.6~pp on cases that require revising the sign across contexts (73.9% to 41.3%), and drops below 50% once misleading signed evidence is introduced. Models also over-commit to directional (+/-) signs, recognizing null effects only 13.8% of the time while remaining poorly calibrated on these categories. The dataset and benchmark are publicly available at https://anonymous.4open.science/r/econcausal-benchmark-6F12.

2602.18907 2026-05-27 cs.LG cs.CV cs.CY

DeepInterestGR: Mining Deep Multi-Interest Using Multi-Modal LLMs for Generative Recommendation

DeepInterestGR: 利用多模态大语言模型挖掘深度多兴趣用于生成式推荐

Yangchen Zeng, Zhenyu Yu, Zhiyuan Hu, Wenxin Zhang, Jinze Wang, Rongfeng Guo

AI总结 提出DeepInterestGR框架,通过多LLM兴趣挖掘、奖励标记深度兴趣和兴趣增强物品离散化,解决生成式推荐中的浅层兴趣问题,在三个Amazon数据集上显著提升推荐性能。

详情
AI中文摘要

我们介绍了DeepInterestGR,一个将深度兴趣挖掘集成到生成式推荐流程中的新颖框架。这解决了“浅层兴趣”问题——现有的生成方法依赖于表面文本特征,未能捕捉潜在的用户动机,限制了个性化深度和推荐可解释性。我们的方法通过结构化推理提示利用多LLM兴趣挖掘(MLIM),通过奖励标记深度兴趣(RLDI)进行质量控制,通过RQ-VAE进行兴趣增强物品离散化(IEID),并结合由兴趣感知奖励引导的两阶段SFT-GRPO训练流程。我们在三个Amazon Review基准(Beauty、Sports、Instruments)上验证了DeepInterestGR,与包括SASRec、BERT4Rec、TIGER、LC-Rec和S-DPO在内的14个最先进基线进行了比较。我们的方法在HR@10上实现了5.8%-8.3%的相对改进,在NDCG@10上实现了7.7%-9.9%的相对改进,跨领域泛化增益达到+24.8%。这些结果证明,融入深度语义兴趣可以有效改进基于SID的生成式推荐。

英文摘要

We introduce DeepInterestGR, a novel framework that integrates deep interest mining into the generative recommendation pipeline. This addresses the "Shallow Interest" problem - existing generative methods rely on surface-level textual features and fail to capture latent user motivations, limiting personalization depth and recommendation interpretability. Our approach leverages Multi-LLM Interest Mining (MLIM) via structured reasoning prompting, Reward-Labeled Deep Interest (RLDI) for quality control, and Interest-Enhanced Item Discretization (IEID) via RQ-VAE, combined with a two-stage SFT-GRPO training pipeline guided by an Interest-Aware Reward. We validate DeepInterestGR on three Amazon Review benchmarks (Beauty, Sports, Instruments), comparing against 14 state-of-the-art baselines including SASRec, BERT4Rec, TIGER, LC-Rec, and S-DPO. Our method achieves 5.8%-8.3% relative improvements on HR@10 and 7.7%-9.9% on NDCG@10 over the strongest baseline, with cross-domain generalization gains of +24.8%. These results provide evidence that incorporating deep semantic interests can effectively improve SID-based generative recommendation.

2602.17822 2026-05-27 cs.RO

Evolution of Safety Requirements in Industrial Robotics: Comparative Analysis of ISO 10218-1/2 (2011 vs. 2025) and Integration of ISO/TS 15066

工业机器人安全要求的演进:ISO 10218-1/2(2011 与 2025 版)比较分析及 ISO/TS 15066 的整合

Daniel Hartmann, Kristýna Hamříková, Aleš Vysocký, Vendula Laciok, Aleš Bernatík

AI总结 本文通过比较 ISO 10218:2011 与 ISO 10218:2025 标准,分析工业机器人安全要求在功能安全、网络安全、机器人分类及协作应用等方面的演进,并整合 ISO/TS 15066,建立现代机器人系统设计与运行的全面框架。

详情
AI中文摘要

工业机器人已成为大型制造企业不可或缺的组成部分。同时,协作机器人日益突出,引入了人机交互的新范式。这些进步促使安全标准进行全面修订,特别是纳入了网络安全和防止未经授权访问网络化机器人系统的要求。本文对 ISO 10218:2011 和 ISO 10218:2025 标准进行了比较分析,考察了其结构、术语、技术要求和附录的演进。分析揭示了功能安全和网络安全方面的显著扩展,引入了机器人和协作应用的新分类,以及技术规范 ISO/TS 15066 的规范性整合。因此,新版本综合了机械、功能和数字安全要求,为现代机器人系统的设计和运行建立了全面框架。

英文摘要

Industrial robotics has established itself as an integral component of large-scale manufacturing enterprises. Simultaneously, collaborative robotics is gaining prominence, introducing novel paradigms of human-machine interaction. These advancements have necessitated a comprehensive revision of safety standards, specifically incorporating requirements for cybersecurity and protection against unauthorized access in networked robotic systems. This article presents a comparative analysis of the ISO 10218:2011 and ISO 10218:2025 standards, examining the evolution of their structure, terminology, technical requirements, and annexes. The analysis reveals significant expansions in functional safety and cybersecurity, the introduction of new classifications for robots and collaborative applications, and the normative integration of the technical specification ISO/TS 15066. Consequently, the new edition synthesizes mechanical, functional, and digital safety requirements, establishing a comprehensive framework for the design and operation of modern robotic systems.

2602.17605 2026-05-27 cs.CV cs.AI cs.CY cs.LG

Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery

在飞行中主动适应:基于相关性的在线元学习与潜在概念用于地理空间发现

Jowaria Khan, Anindya Sarkar, Yevgeniy Vorobeychik, Elizabeth Bondi-Kelly

AI总结 提出一个统一的地理空间发现框架,结合主动学习、在线元学习和概念引导推理,通过概念加权不确定性采样和相关性感知元批次形成策略,在有限数据和动态环境下高效发现隐藏目标。

详情
AI中文摘要

在环境监测中,数据收集通常成本高昂、稀疏且受紧急公共卫生需求影响。这对于致癌的PFAS(全氟和多氟烷基物质)污染尤其如此,与领域专家和环境组织的讨论强调需要在有限的采样预算下战略性地识别高风险、观测不足的区域。更广泛地说,在灾害响应和公共卫生环境中也出现了类似的挑战,动态环境使得从有限的地面实况中高效发现隐藏目标变得至关重要。然而,稀疏且有偏差的地理空间标签限制了现有基于学习方法(如强化学习)的适用性。为了解决这个问题,我们提出了一个统一的地理空间发现框架,该框架集成了主动学习、在线元学习和概念引导推理。我们的方法引入了两个基于共享的*概念相关性*概念的关键创新,该概念捕捉领域特定因素如何影响目标存在:一个*概念加权不确定性采样策略*,其中不确定性通过从现成概念(如土地覆盖和源距离)学习到的相关性进行调节;以及一个*相关性感知元批次形成策略*,该策略在在线元更新期间促进语义多样性,提高动态环境中的泛化能力。我们在PFAS污染发现任务上评估了我们的框架,这是一个受真实世界启发的环境监测任务,展示了在有限数据和变化条件下鲁棒的目标发现能力。

英文摘要

In environmental monitoring, data collection is often costly, sparse, and shaped by urgent public-health needs. This is particularly true for cancer-causing PFAS (Per- and polyfluoroalkyl substances) contamination, where discussions with domain experts and environmental organizations highlight the need to strategically identify high-risk, under-observed regions under tight sampling budgets. More broadly, similar challenges arise in disaster response and public health settings, where dynamic environments make it essential to efficiently uncover hidden targets from limited ground truth. Yet sparse and biased geospatial labels limit the applicability of existing learning-based methods, such as reinforcement learning. To address this, we propose a unified geospatial discovery framework that integrates active learning, online meta-learning, and concept-guided reasoning. Our approach introduces two key innovations built on a shared notion of *concept relevance*, capturing how domain-specific factors influence target presence: a *concept-weighted uncertainty sampling strategy*, where uncertainty is modulated by learned relevance from readily available concepts such as land cover and source proximity; and a *relevance-aware meta-batch formation strategy* that promotes semantic diversity during online-meta updates, improving generalization in dynamic environments. We evaluate our framework on PFAS contamination discovery as a real-world inspired environmental monitoring task, demonstrating robust target discovery under limited data and changing conditions.

2602.17443 2026-05-27 cs.CL

AIDG: A Formal Decomposition of Information Extraction and Containment Asymmetries in Multi-Turn LLM Dialogue

AIDG:多轮LLM对话中信息提取与包含不对称性的形式化分解

Adib Sakhawat, Fardeen Sadab, Rakin Shahriar

AI总结 提出AIDG框架,将多轮对抗对话形式化为部分可观察随机博弈,分解提取与包含角色,揭示防御性能聚类而攻击性能分散等不对称性。

Comments 20 pages, 5 figures, 13 tables. Includes appendix and supplementary materials

详情
AI中文摘要

多轮LLM评估通常报告为单一胜率标量,混淆了不同能力。我们引入AIDG(对抗信息推断游戏),将多轮对抗对话形式化为双人部分可观察随机博弈(POSG),并沿着搜索者(提取)和持有者(包含)角色分解性能。该分解隔离了三种失败模式:合作先验泄漏、约束推理干扰和低效假设空间遍历。在六个前沿LLM的439场游戏中,防御性能紧密聚类(sigma = 1.9 ELO),而攻击性能差异显著(sigma = 53.3 ELO);确认框架使提取几率比无信息推断高7.75倍(p < 0.00001);约束违规占推断失败的41.3%,与规模无关(rho = 0.0)。我们将包含优于提取的差距定位为局部可解的防御决策与全局耦合的攻击规划的可测量结果,而非令人惊讶的发现,并使用该分解将差距归因于每个模型。所有设计选择,包括轮次衰减加权和Bradley-Terry评级模型,均源自明确假设。

英文摘要

Multi-turn LLM evaluation is typically reported as a single win-rate scalar, conflating distinct capabilities. We introduce AIDG (Adversarial Information Deduction Game), formalizing multi-turn adversarial dialogue as a two-player partially observable stochastic game (POSG) and decomposing performance along Seeker (extraction) and Holder (containment) roles. The decomposition isolates three failure modes: cooperative-prior leakage, constraint-reasoning interference, and inefficient hypothesis-space traversal. Across 439 games over six frontier LLMs, defensive performance is tightly clustered (sigma = 1.9 ELO) while offensive performance varies substantially (sigma = 53.3 ELO); confirmation framing increases extraction odds 7.75x over uninformed deduction (p < 0.00001); and constraint violations account for 41.3% of deductive failures, uncorrelated with scale (rho = 0.0). We position the containment-over-extraction gap not as a surprising finding but as a measurable consequence of locally resolvable defensive decisions versus globally coupled offensive planning, and use the decomposition to attribute the gap per model. All design choices, including turn-decay weighting and the Bradley-Terry rating model, are derived from explicit assumptions.

2510.03352 2026-05-27 cs.CV cs.AI cs.LG

Inference-Time Search Using Side Information for Diffusion-Based Image Reconstruction

基于侧信息的推理时搜索用于扩散模型图像重建

Mahdi Farahbakhsh, Vishnu Teja Kunde, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland

AI总结 提出一种即插即用、无需训练的推理时搜索框架,将侧信息融入现有扩散模型逆问题求解器,显著提升重建质量。

详情
AI中文摘要

扩散模型已被用作解决逆问题的先验。然而,现有方法通常忽略了能够显著提高重建质量的侧信息,尤其是在严重病态设置中。在这项工作中,我们提出了一种新颖的框架,通过推理时搜索将侧信息以即插即用、无需训练的方式融入现有的基于扩散模型的逆问题求解器。通过在多种逆问题(包括图像修复、超分辨率和几种去模糊任务)以及多种基于扩散模型的逆问题求解器(DPS、DAPS和MPGD)上的大量实验,我们表明,用我们的框架增强每个求解器,其重建质量始终优于相应的原始方法。为了展示我们方法的通用性,我们考虑了多种形式的侧信息,包括参考图像、文本描述和解剖学MRI扫描。代码可在该仓库中获取:https://github.com/mahdi-farahbakhsh/DISS。

英文摘要

Diffusion models have been used as priors for solving inverse problems. However, existing approaches typically overlook side information that could significantly improve reconstruction quality, especially in severely ill-posed settings. In this work, we propose a novel framework that incorporates side information into existing diffusion-based inverse problem solvers via inference-time search, in a plug-and-play, training-free manner. Through extensive experiments across a range of inverse problems, including inpainting, super-resolution, and several deblurring tasks, and across multiple diffusion-based inverse problem solvers (DPS, DAPS, and MPGD), we show that augmenting each solver with our framework consistently improves the quality of the reconstructions over the corresponding original method. To demonstrate the generality of our approach, we consider diverse forms of side information, including reference images, textual descriptions, and anatomical MRI scans. The code is available at this \href{https://github.com/mahdi-farahbakhsh/DISS}{repository}\footnote{https://github.com/mahdi-farahbakhsh/DISS}.

2503.18288 2026-05-27 cs.CL

TFD: A Comprehensive Structured Tibetan Foundation Dataset for Low-Resource Language Processing and Large-Scale Modeling

TFD:面向低资源语言处理和大规模建模的综合结构化藏语基础数据集

Cheng Huang, Fan Gao, Nyima Tashi, Yutong Liu, Yadi Liu, Wenbin Wei, Xiangxiang Wang, Yongbin Yu

AI总结 为解决藏语大语言模型开发中缺乏覆盖预训练、指令微调、安全对齐、偏好优化和推理监督等完整流程的数据集问题,提出首个结构化、大规模、专家精选的藏语基础数据集TFD,包含超过110亿词元的统一语料库及链式推理数据集,通过训练Sun-Shine系列藏语模型在理解、安全、推理和生成基准上取得显著提升。

详情
AI中文摘要

大型语言模型(LLMs)在高资源语言中取得了显著成功,但藏语的进展仍然严重受限。尽管最近的工作开始解决藏语的预训练数据稀缺问题,但一个更根本的差距仍然存在:现有资源不支持完整的LLM开发流程,涵盖预训练、指令微调、安全对齐、偏好优化和推理监督。我们引入了藏语基础数据集(TFD),这是第一个结构化、大规模、专家精选的数据集,覆盖藏语大语言建模的所有关键阶段。TFD包含TIBSTC,一个超过110亿词元的统一语料库,带有用于指令微调、安全对齐和偏好优化的精选子数据集,以及TIBSTC-CoT,第一个大规模藏语链式推理数据集。我们通过训练Sun-Shine系列藏语LLM来展示其效用,在理解、安全、推理和生成基准上相比强基线取得了显著改进。这些结果强调,推进低资源语言建模不仅需要规模,还需要结构完整的数据生态系统。我们发布TFD以促进可重复研究和开发稳健、文化对齐的藏语LLM。代码和数据可在https://github.com/Vicentvankor/sun-shine获取。

英文摘要

Large Language Models (LLMs) have achieved remarkable success in high-resource languages, yet progress in Tibetan remains severely constrained. While recent efforts have begun to address pre-training data scarcity for Tibetan, a more fundamental gap persists: no existing resource supports the complete LLM development pipeline, spanning pre-training, instruction tuning, safety alignment, preference optimization, and reasoning supervision. We introduce the Tibetan Foundation Dataset (TFD), the first structured, large-scale, and expert-curated dataset covering all key stages of Tibetan large language modeling. TFD comprises TIBSTC, a unified corpus of over 11 billion tokens with curated sub-datasets for instruction tuning, safety alignment, and preference optimization, and TIBSTC-CoT, the first large-scale Tibetan chain-of-thought dataset. We demonstrate its utility by training the Sun-Shine family of Tibetan LLMs, achieving substantial improvements over strong baselines on understanding, safety, reasoning, and generation benchmarks. These results underscore that advancing low-resource language modeling requires not only scale, but a structurally complete data ecosystem. We release TFD to facilitate reproducible research and the development of robust, culturally aligned Tibetan LLMs. Code and data are available at https://github.com/Vicentvankor/sun-shine.

2602.12833 2026-05-27 cs.LG cs.AI cs.MA

Vital Trace: Protocol-Constrained Patient-State Reasoning for Longitudinal Clinical Trajectories

Vital Trace: 协议约束的患者状态推理用于纵向临床轨迹

Zhan Qu, Michael Färber

AI总结 提出Vital Trace,一个协议约束的多智能体框架,通过紧凑的持久患者状态记忆和四个协调智能体(Router、Reasoner、Auditor、Steward)进行分阶段推理,以解决长期临床轨迹推理中的上下文漂移和不稳定问题,在MIMIC-IV和eICU数据集上预测未来血管加压药、呼吸、肾脏支持和恶化任务中优于自由形式多智能体基线。

详情
AI中文摘要

纵向临床推理需要跟踪电子健康记录中患者轨迹的生理测量、实验室结果和干预措施。现有的基于LLM的临床推理系统通常依赖于重复序列化患者历史或交换无约束的文本智能体消息,导致上下文漂移、推理不稳定以及长期推理成本增加。我们提出了Vital Trace,一个协议约束的多智能体框架,用于在动态ICU轨迹上进行未来临床风险预测。Vital Trace不维护无界文本历史,而是使用紧凑的持久患者状态记忆以及由四个协调智能体(Router、Reasoner、Auditor和Steward)执行的分阶段推理。为了支持时间上连贯的推理,我们引入了一个手动策划的全局协议,包含生理状态转换规则和动态患者状态表示,随时间跟踪血流动力学、呼吸、肾脏、代谢和炎症不稳定性。我们在MIMIC-IV和eICU上使用未来血管加压药支持、呼吸支持、肾脏支持和恶化预测任务评估Vital Trace。结果表明,与自由形式多智能体基线相比,结构化的协议约束推理提高了时间一致性、通信稳定性、校准性和可解释性,同时在长期ICU轨迹上实现了强大的预测性能。

英文摘要

Longitudinal clinical reasoning over electronic health records requires tracking evolving physiological measurements, laboratory results, and interventions across extended patient trajectories. Existing LLM-based clinical reasoning systems often rely on repeatedly serializing patient histories or exchanging unconstrained textual agent messages, leading to context drift, unstable reasoning, and growing inference cost over long horizons. We present Vital Trace, a protocol-constrained multi-agent framework for future clinical risk prediction over evolving ICU trajectories. Instead of maintaining unbounded textual histories, Vital Trace uses a compact persistent patient-state memory together with staged reasoning performed by four coordinated agents: a Router, Reasoner, Auditor, and Steward. To support temporally coherent reasoning, we introduce a manually curated Global Protocol containing physiological state-transition rules and a dynamic patient-state representation that tracks hemodynamic, respiratory, renal, metabolic, and inflammatory instability over time. We evaluate Vital Trace on MIMIC-IV and eICU using future vasopressor-support, respiratory-support, renal-support, and deterioration prediction tasks. Results show that structured protocol-constrained reasoning improves temporal consistency, communication stability, calibration, and interpretability compared with free-form multi-agent baselines while achieving strong predictive performance across long ICU trajectories.

2602.11799 2026-05-27 cs.AI cs.IR

Hi-SAM: A Hierarchical Structure-Aware Multi-modal Framework for Large-Scale Recommendation

Hi-SAM: 一种面向大规模推荐的分层结构感知多模态框架

Pingjun Pan, Tingting Zhou, Peiyao Lu, Tingting Fei, Hongxiang Chen, Chuanjiang Luo

AI总结 针对多模态推荐中语义ID离散化存在的次优分词和架构-数据不匹配问题,提出Hi-SAM框架,通过解耦语义分词器和分层记忆-锚点Transformer,在冷启动场景下显著提升推荐性能。

Comments Accepted at ACM KDD 2026 ADS

详情
AI中文摘要

多模态推荐因物品具有文本和图像等丰富属性而受到关注。基于语义ID的方法有效地将这些信息离散化为紧凑的令牌。然而,存在两个挑战:(1)次优分词:现有方法(如RQ-VAE)缺乏共享跨模态语义和模态特定细节之间的解耦,导致冗余或崩溃;(2)架构-数据不匹配:普通Transformer将语义ID视为扁平流,忽略了用户交互、物品和令牌的层次结构。将物品扩展为多个令牌会放大长度和噪声,使注意力偏向局部细节而非整体语义。我们提出Hi-SAM,一种分层结构感知多模态框架,包含两个设计:(1)解耦语义分词器(DST):通过几何感知对齐统一模态,并通过从粗到细的策略进行量化。共享码本提取共识,而模态特定码本通过互信息最小化从残差中恢复细微差别;(2)分层记忆-锚点Transformer(HMAT):通过分层RoPE将位置编码分解为物品间和物品内子空间以恢复层次结构。它插入锚点令牌将物品压缩为紧凑记忆,保留当前物品的细节,同时仅通过压缩摘要访问历史。在真实世界数据集上的实验表明,相比最先进基线方法,Hi-SAM持续改进,尤其在冷启动场景中。在服务数百万用户的大规模社交平台上部署后,Hi-SAM在核心在线指标上实现了6.55%的提升。

英文摘要

Multi-modal recommendation has gained traction as items possess rich attributes like text and images. Semantic ID-based approaches effectively discretize this information into compact tokens. However, two challenges persist: (1) Suboptimal Tokenization: existing methods (e.g., RQ-VAE) lack disentanglement between shared cross-modal semantics and modality-specific details, causing redundancy or collapse; (2) Architecture-Data Mismatch: vanilla Transformers treat semantic IDs as flat streams, ignoring the hierarchy of user interactions, items, and tokens. Expanding items into multiple tokens amplifies length and noise, biasing attention toward local details over holistic semantics. We propose Hi-SAM, a Hierarchical Structure-Aware Multi-modal framework with two designs: (1) Disentangled Semantic Tokenizer (DST): unifies modalities via geometry-aware alignment and quantizes them via a coarse-to-fine strategy. Shared codebooks distill consensus while modality-specific ones recover nuances from residuals, enforced by mutual information minimization; (2) Hierarchical Memory-Anchor Transformer (HMAT): splits positional encoding into inter- and intra-item subspaces via Hierarchical RoPE to restore hierarchy. It inserts Anchor Tokens to condense items into compact memory, retaining details for the current item while accessing history only through compressed summaries. Experiments on real-world datasets show consistent improvements over SOTA baselines, especially in cold-start scenarios. Deployed on a large-scale social platform serving millions of users, Hi-SAM achieved a 6.55% gain in the core online metric.

2507.11486 2026-05-27 cs.LG

Exploring the robustness of TractOracle methods in RL-based tractography

探索基于强化学习的纤维追踪中TractOracle方法的鲁棒性

Jeremi Levesque, Antoine Théberge, Maxime Descoteaux, Pierre-Marc Jodoin

AI总结 本文通过整合强化学习的最新进展,扩展了TractOracle-RL框架,并引入迭代奖励训练(IRT)方法,实验表明基于oracle的RL方法在准确性和解剖有效性上显著优于传统纤维追踪技术。

Comments 38 pages, 8 figures. Submitted to Medical Image Analysis

详情
Journal ref
Medical Image Analysis, December 2025
AI中文摘要

纤维追踪算法利用扩散MRI重建大脑白质的纤维结构。在机器学习方法中,强化学习(RL)已成为纤维追踪的一个有前景的框架,在几个关键方面优于传统方法。TractOracle-RL是一种最新的基于RL的方法,通过基于奖励的机制将解剖先验纳入训练过程,减少了假阳性。在本文中,我们通过整合RL的最新进展,研究了原始TractOracle-RL框架的四种扩展,并在五个不同的扩散MRI数据集上评估了它们的性能。结果表明,无论使用何种具体方法或数据集,将oracle与RL框架结合始终能产生鲁棒且可靠的纤维追踪。我们还提出了一种新的RL训练方案,称为迭代奖励训练(IRT),其灵感来自人类反馈强化学习(RLHF)范式。IRT不依赖人类输入,而是利用束过滤方法在训练过程中迭代优化oracle的指导。实验结果表明,使用oracle反馈训练的RL方法在准确性和解剖有效性方面显著优于广泛使用的纤维追踪技术。

英文摘要

Tractography algorithms leverage diffusion MRI to reconstruct the fibrous architecture of the brain's white matter. Among machine learning approaches, reinforcement learning (RL) has emerged as a promising framework for tractography, outperforming traditional methods in several key aspects. TractOracle-RL, a recent RL-based approach, reduces false positives by incorporating anatomical priors into the training process via a reward-based mechanism. In this paper, we investigate four extensions of the original TractOracle-RL framework by integrating recent advances in RL, and we evaluate their performance across five diverse diffusion MRI datasets. Results demonstrate that combining an oracle with the RL framework consistently leads to robust and reliable tractography, regardless of the specific method or dataset used. We also introduce a novel RL training scheme called Iterative Reward Training (IRT), inspired by the Reinforcement Learning from Human Feedback (RLHF) paradigm. Instead of relying on human input, IRT leverages bundle filtering methods to iteratively refine the oracle's guidance throughout training. Experimental results show that RL methods trained with oracle feedback significantly outperform widely used tractography techniques in terms of accuracy and anatomical validity.

2602.11460 2026-05-27 cs.CL

ADRD-Bench: A Preliminary LLM Benchmark for Alzheimer's Disease and Related Dementias

ADRD-Bench:阿尔茨海默病及相关痴呆症的初步大语言模型基准

Guangxin Zhao, Jiahao Zheng, Malaz Boustani, Jarek Nabrzyski, Yiyu Shi, Meng Jiang, Zhi Zheng

AI总结 针对现有基准对阿尔茨海默病及相关痴呆症覆盖不足的问题,提出ADRD-Bench,包含统一问答和照护问答两部分,评估了36个LLM,发现顶级模型准确率高但推理质量不稳定。

Comments Update article

详情
AI中文摘要

大语言模型(LLM)在医疗应用中显示出巨大潜力。然而,现有评估基准对阿尔茨海默病及相关痴呆症(ADRD)的覆盖极少。为解决这一差距,我们引入了ADRD-Bench,一个初步的ADRD专用LLM基准。ADRD-Bench包含两个部分:1) ADRD统一问答,整合了七个已有医学基准的1,438个问题,提供临床知识的统一评估;2) ADRD照护问答,一组新颖的149个问题,源自一个全国采用、大型临床试验支持的脑健康管理项目,弥补了现有基准缺乏实际照护背景的不足。我们在提出的ADRD-Bench上评估了36个最先进的LLM。结果显示,开源通用模型、开源医学模型和前沿闭源通用模型的准确率范围分别为0.63至0.93(均值:0.77;标准差:0.09)、0.47至0.93(均值:0.81;标准差:0.14)和0.83至0.93(均值:0.90;标准差:0.03)。虽然顶级模型达到了高准确率(>0.9),但案例研究揭示了不一致的推理质量和稳定性,凸显了需要领域特定改进以增强LLM基于日常照护数据的知识和推理。整个数据集可在https://github.com/IIRL-ND/ADRD-Bench获取。

英文摘要

Large language models (LLMs) have shown great potential for healthcare applications. However, existing evaluation benchmarks provide minimal coverage of Alzheimer's Disease and Related Dementias (ADRD). To address this gap, we introduce ADRD-Bench, a preliminary ADRD-specific LLM benchmark. ADRD-Bench has two components: 1) ADRD Unified QA, a synthesis of 1,438 questions consolidated from seven established medical benchmarks, providing a unified assessment of clinical knowledge; and 2) ADRD Caregiving QA, a novel set of 149 questions derived from a nationally adopted, large clinical trials supported brain health management program, mitigating the lack of practical caregiving context in existing benchmarks. We evaluated 36 state-of-the-art LLMs on the proposed ADRD-Bench. Results showed that the accuracy of open-weight general models, open-weight medical models, and frontier closed-source general models ranged from 0.63 to 0.93 (mean: 0.77; std: 0.09), 0.47 to 0.93 (mean: 0.81; std: 0.14), and 0.83 to 0.93 (mean: 0.90; std: 0.03), respectively. While top-tier models achieved high accuracies (>0.9), case studies revealed inconsistent reasoning quality and stability, highlighting a critical need for domain-specific improvement to enhance LLMs' knowledge and reasoning grounded in daily caregiving data. The entire dataset is available at https://github.com/IIRL-ND/ADRD-Bench.

2602.10450 2026-05-27 cs.LG cs.AI math.OC

Constructing Industrial-Scale Optimization Modeling Benchmark

构建工业规模优化建模基准

Zhong Li, Hongliang Lu, Tao Wei, Yuxuan Chen, Wenyu Liu, Yuan Lan, Fan Zhang, Zaiwen Wen

AI总结 提出MIPLIB-NL基准,通过结构感知逆向构建方法从真实混合整数线性规划中生成自然语言规范与求解器代码,以评估大语言模型在工业规模优化建模中的性能。

Comments This paper was accepted by ICML'26 for publication

详情
AI中文摘要

优化建模支撑着物流、制造、能源和金融领域的决策,然而将自然语言需求转化为正确的优化公式和可执行求解器代码仍然需要大量人力。尽管大语言模型(LLMs)已被探索用于此任务,但评估仍以玩具级或合成基准为主,掩盖了具有$10^{3}$--$10^{6}$(或更多)变量和约束的工业问题的难度。一个关键瓶颈是缺乏将自然语言规范与基于真实优化模型的参考公式/求解器代码对齐的基准。为填补这一空白,我们引入了MIPLIB-NL,它通过一种结构感知的逆向构建方法从MIPLIB~2017中的真实混合整数线性规划构建而成。我们的流程(i)从平坦的求解器公式中恢复紧凑、可复用的模型结构,(ii)在统一的模型-数据分离格式下,逆向生成明确关联到该恢复结构的自然语言规范,以及(iii)通过专家评审和人类-LLM交互以及独立的逆向检查进行迭代语义验证。这产生了223个一对一的重构,保留了原始实例的数学内容,同时实现了现实的自然语言到优化评估。实验表明,在现有基准上表现良好的系统在MIPLIB-NL上性能显著下降,暴露了在玩具规模下不可见的失败模式。

英文摘要

Optimization modeling underpins decision-making in logistics, manufacturing, energy, and finance, yet translating natural-language requirements into correct optimization formulations and solver-executable code remains labor-intensive. Although large language models (LLMs) have been explored for this task, evaluation is still dominated by toy-sized or synthetic benchmarks, masking the difficulty of industrial problems with $10^{3}$--$10^{6}$ (or more) variables and constraints. A key bottleneck is the lack of benchmarks that align natural-language specifications with reference formulations/solver code grounded in real optimization models. To fill in this gap, we introduce MIPLIB-NL, built via a structure-aware reverse construction methodology from real mixed-integer linear programs in MIPLIB~2017. Our pipeline (i) recovers compact, reusable model structure from flat solver formulations, (ii) reverse-generates natural-language specifications explicitly tied to this recovered structure under a unified model--data separation format, and (iii) performs iterative semantic validation through expert review and human--LLM interaction with independent reconstruction checks. This yields 223 one-to-one reconstructions that preserve the mathematical content of the original instances while enabling realistic natural-language-to-optimization evaluation. Experiments show substantial performance degradation on MIPLIB-NL for systems that perform strongly on existing benchmarks, exposing failure modes invisible at toy scale.

2602.10104 2026-05-27 cs.CV cs.AI cs.LG

Olaf-World: Orienting Latent Actions for Video World Modeling

Olaf-World: 面向视频世界模型的潜在动作定向

Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou

AI总结 提出SeqΔ-REPA对齐目标,通过冻结自监督视频编码器的时序特征差异锚定潜在动作,实现无标签视频中可迁移的动作控制世界模型预训练。

Comments ICML 2026. Project page: https://showlab.github.io/Olaf-World/ Code: https://github.com/showlab/Olaf-World

详情
AI中文摘要

扩展动作可控世界模型受限于动作标签的稀缺性。虽然潜在动作学习有望从无标签视频中提取控制接口,但学习到的潜在表示往往难以跨上下文迁移:它们纠缠了场景特定线索,缺乏共享坐标系。这是因为标准目标仅在每个片段内操作,没有提供跨上下文对齐动作语义的机制。我们的关键洞察是,尽管动作未被观测到,但其语义效果是可观测的,可以作为共享参考。我们引入SeqΔ-REPA,一种序列级控制效果对齐目标,将集成潜在动作锚定到来自冻结自监督视频编码器的时序特征差异。基于此,我们提出Olaf-World,一个从大规模被动视频中预训练动作条件视频世界模型的流程。大量实验表明,我们的方法学习了更结构化的潜在动作空间,从而在零样本动作迁移和适应新控制接口的数据效率上优于最先进的基线方法。

英文摘要

Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$Δ$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.

2602.09878 2026-05-27 cs.CV

MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation

MVISTA-4D: 用于机器人操作的一致性视图4D世界模型与测试时动作推理

Jiaxu Wang, Yicheng Jiang, Tianlun He, Jingkai Sun, Qiang Zhang, Junhao He, Jiahang Cao, Zesen Gan, Mingyuan Sun, Qiming Shao, Xiangyu Yue

AI总结 提出一种基于世界模型的4D场景生成方法,通过多视图RGBD预测和测试时动作优化,实现几何一致的4D动态预测与机器人操作。

详情
Journal ref
International Conference on Machine Learning 2026
AI中文摘要

基于世界模型的“想象-然后行动”范式成为机器人操作的一种有前景的方法,但现有方法通常仅支持纯图像预测或部分3D几何推理,限制了其预测完整4D场景动态的能力。本文提出了一种新颖的具身4D世界模型,能够实现几何一致、任意视图的RGBD生成:仅以单视图RGBD观测作为输入,模型想象其余视角,然后通过反投影和融合构建跨时间的更完整3D结构。为了高效学习多视图、跨模态生成,我们明确设计了跨视图和跨模态特征融合,共同促进RGB与深度之间的一致性,并强制视图间的几何对齐。除了预测,将生成的未来转换为动作通常由逆动力学处理,但这是病态的,因为多个动作可以解释相同的状态转换。我们通过一种测试时动作优化策略来解决这个问题,该策略通过生成模型反向传播以推断与预测未来最佳匹配的轨迹级潜在变量,以及一个残差逆动力学模型,将该轨迹先验转换为精确的可执行动作。在三个数据集上的实验表明,该方法在4D场景生成和下游操作任务上均表现出色,消融实验为关键设计选择提供了实用见解。

英文摘要

World-model-based imagine-then-act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image-based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation: given only a single-view RGBD observation as input, the model imagines the remaining viewpoints, which can then be back-projected and fused to assemble a more complete 3D structure across time. To efficiently learn the multi-view, cross-modality generation, we explicitly design cross-view and cross-modality feature fusion that jointly encourage consistency between RGB and depth and enforce geometric alignment across views. Beyond prediction, converting generated futures into actions is often handled by inverse dynamics, which is ill-posed because multiple actions can explain the same transition. We address this with a test-time action optimization strategy that backpropagates through the generative model to infer a trajectory-level latent best matching the predicted future, and a residual inverse dynamics model that turns this trajectory prior into accurate executable actions. Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation, and ablations provide practical insights into the key design choices.

2602.08586 2026-05-27 cs.AI

DIANOIA: Diagnostic Decomposition and Joint Optimization for Multi-Agent Reasoning

DIANOIA: 多智能体推理的诊断性分解与联合优化

Yiming Yang, Zhuoyuan Li, Fanxiang Zeng, Hao Fu, Yue Liu

AI总结 提出DIANOIA框架,通过覆盖度、保真度和综合度三个可测量通道分解多智能体推理增益,并基于此设计诊断协议和对应系统,在多个基准上以更少token实现更优性能。

详情
AI中文摘要

多智能体LLM系统持续优于单智能体基线,但从业者仍无法预测哪种设计适用于新任务或诊断失败原因。我们认为这一差距主要源于该领域缺乏具有可测量原语和可测试预测的诊断框架。我们引入 extbf{DIANOIA},将多智能体推理增益分解为覆盖度、保真度和综合度三个通道,每个通道均可经验测量。基于此分解,我们推导出一个诊断协议,可识别任何给定任务的瓶颈通道。我们将该协议实例化为一个多智能体系统,其三个组件与通道对应:角色多样化的提议者(覆盖度)、基于执行验证的验证者(保真度)和迭代综合者。在GSM8K、AIME-2025、MBPP和BFCL-SP上,我们的方法在匹配token预算下优于强多智能体基线,在MBPP上以约$5 imes$的token节省主导帕累托前沿,在匹配成本下达到$+4.6$pp。在每个基准上,协议都能正确选择瓶颈通道;我们围绕它构建的系统在多个模型上领先。我们发布代码、适配器、诊断指标和Claude Code技能,网址为https://anonymous.4open.science/r/DIANOIA4MAS。DIANOIA将多智能体设计重新定义为通道感知的资源分配:诊断你的任务的瓶颈通道,然后相应投入token。

英文摘要

Multi-agent LLM systems consistently outperform single-agent baselines, yet practitioners still cannot predict which design works for a new task or diagnose why one fails. We argue this gap persists largely because the field lacks a diagnostic framework with measurable primitives and testable predictions. We introduce \textbf{DIANOIA}, a three-channel decomposition of multi-agent reasoning gain into coverage, fidelity, and synthesis, each of which is empirically measurable. From this decomposition, we derive a diagnostic protocol that identifies the bottleneck channels for any given task. We instantiate the protocol as a multi-agent system whose three components mirror the channels: role-diverse proposers for coverage, execution-grounded verification for fidelity, and iterative synthesis. On GSM8K, AIME-2025, MBPP, and BFCL-SP, our method outperforms strong multi-agent baselines under matched token budgets, dominating the Pareto frontier on MBPP at $\sim$$5{\times}$ token savings and reaching $+4.6$pp at matched cost. On every benchmark, the protocol picks the right bottleneck channels; the system we built around it leads across models. We release code, adapters, diagnostic metrics, and a Claude Code skill at https://anonymous.4open.science/r/DIANOIA4MAS. DIANOIA reframes multi-agent design as channel-aware resource allocation: diagnose which channel is the bottleneck for your task, then invest tokens accordingly.

2511.16449 2026-05-27 cs.CV cs.AI

Bridging the Semantic-Action Gap in Visual Token Pruning for Efficient VLA Inference

弥合视觉令牌剪枝中的语义-动作鸿沟以实现高效VLA推理

Ziyan Liu, Yeqiu Chen, Hongyi Cai, Tao Lin, Shuo Yang, Zheng Liu, Bo Zhao

AI总结 提出VLA-Pruner方法,通过结合语义预填充和时序平滑的动作相关性估计视觉令牌重要性,并采用Combine-then-Filter策略,在保持操作质量的同时实现高达1.99倍加速。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过整合视觉感知、语言理解和动作执行,在具身人工智能中展现出巨大潜力。在实时部署中,这些模型必须处理连续的视觉流,产生大量计算开销。视觉令牌剪枝——一种通过保留显著令牌同时丢弃冗余令牌来加速视觉-语言模型(VLM)的主流技术——为这一挑战提供了自然的候选解决方案。然而,直接将面向VLM的剪枝方法应用于VLA推理会导致操作性能严重下降。我们的分析将这种下降归因于一个关键不匹配:VLA推理在视觉-语言预填充阶段和动作解码阶段表现出不同的注意力模式,因此仅基于上下文预填充语义显著性的剪枝偏向语义线索,可能移除动作关键的视觉令牌。受此观察启发,我们提出VLA-Pruner,一种有效的即插即用令牌剪枝方法,基于VLA推理的视觉需求,并进一步利用机器人操作的时间连续性。具体来说,VLA-Pruner从语义预填充和时序平滑的动作相关性两方面估计视觉令牌重要性,然后采用Combine-then-Filter策略,在计算预算下保留紧凑、非冗余的令牌。实验表明,VLA-Pruner在多种VLA架构上优于最先进方法,在相当的操作质量下实现高达1.99倍加速。

英文摘要

Vision-Language-Action (VLA) models have shown great potential for embodied AI by integrating visual perception, language understanding, and action execution. In real-time deployment, these models must process continuous visual streams, incurring substantial computational overhead. Visual token pruning -- a mainstream technique for accelerating Vision-Language Models (VLMs) by retaining salient tokens while discarding redundant ones -- offers a natural candidate solution to this challenge. However, directly applying VLM-oriented pruning methods to VLA inference can cause severe degradation in manipulation performance. Our analysis attributes this degradation to a key mismatch: VLA inference exhibits distinct attention patterns between the vision-language prefill stage and the action-decode stage, so pruning based only on context-prefill semantic salience is biased toward semantic cues and may remove action-critical visual tokens. Motivated by this observation, we propose VLA-Pruner, an effective plug-and-play token pruning method grounded in the visual requirements of VLA inference, further exploiting the temporal continuity of robot manipulation. Specifically, VLA-Pruner estimates visual-token importance from both semantic prefilling and temporally smoothed action relevance, and then applies a Combine-then-Filter strategy to retain compact, non-redundant tokens under the compute budget. Experiments show that VLA-Pruner outperforms state-of-the-art approaches across multiple VLA architectures, achieving up to 1.99x speedup with comparable manipulation quality.

2511.06625 2026-05-27 cs.CV cs.AI cs.LG

Explainable Cross-Disease Reasoning for Cardiovascular Risk Assessment from Low-Dose Computed Tomography

可解释的跨疾病推理:基于低剂量计算机断层扫描的心血管风险评估

Yifei Zhang, Jiashuo Zhang, Mojtaba Safari, Xiaofeng Yang, Liang Zhao

AI总结 提出一种可解释的跨疾病推理框架,通过提取肺部发现、基于医学知识进行跨器官机制推理,并结合心脏子体积特征,从低剂量胸部CT中实现心血管风险评估,在NLST队列中AUC达0.919。

详情
AI中文摘要

低剂量胸部计算机断层扫描(LDCT)在一次扫描中捕获肺部和心脏结构,使得能够联合评估肺部和心血管健康。现有方法通常独立建模这些领域,并未明确表示它们的生理交互。我们提出了一种可解释的跨疾病推理框架,用于从LDCT进行心血管风险评估。该框架遵循受限的临床信息路径:它提取肺部发现,将跨器官机制基于医学知识进行推理,并生成带有自然语言理由的心血管预测。它结合了四个组件:一个冻结的肺风险先验、一个肺部感知模块、一个代理推理模块和一个心脏子体积特征提取器。它们的输出被融合,以将局部心脏证据与机制层面的肺部上下文整合。在国家肺筛查试验队列中,该框架在CVD筛查中达到0.919的AUC,在CVD死亡率预测中高达0.838,优于心脏特异性、单疾病和基础模型基线。目标对照表明,这些增益不能仅由额外的胸部视觉特征、固定规则传播或单一推理后端解释。因此,所提出的框架提供了一种可审计的方法,用于从LDCT进行跨疾病心血管风险评估。

英文摘要

Low-dose chest computed tomography (LDCT) captures pulmonary and cardiac structures in a single scan, enabling joint assessment of lung and cardiovascular health. Existing approaches typically model these domains independently and do not explicitly represent their physiological interactions. We propose an Explainable Cross-Disease Reasoning Framework for cardiovascular risk assessment from LDCT. The framework follows a constrained clinical-information pathway: it extracts pulmonary findings, grounds cross-organ mechanisms in medical knowledge, and produces a cardiovascular prediction with a natural-language rationale. It combines four components: a frozen lung-risk prior, a pulmonary perception module, an agentic reasoning module, and a cardiac subvolume feature extractor. Their outputs are fused to integrate localized cardiac evidence with mechanism-level pulmonary context. On the National Lung Screening Trial cohort, the framework achieves an AUC of 0.919 for CVD screening and up to 0.838 for CVD mortality prediction, outperforming cardiac-specific, single-disease, and foundation-model baselines. Targeted controls indicate that the gains are not explained by additional thoracic visual features alone, fixed rule propagation, or a single reasoning backend. The proposed framework thus provides an auditable approach to cross-disease cardiovascular risk assessment from LDCT.

2507.13428 2026-05-27 cs.CV cs.AI

"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

PhyWorldBench:文本到视频模型中物理真实性的全面评估

Jing Gu, Xian Liu, Yu Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Yue Fan, Qianqi Yan, Kaiwen Zhou, Ming-Yu Liu, Xin Eric Wang

AI总结 提出PhyWorldBench基准,通过1050个提示评估12个视频生成模型在物理规律遵循上的表现,并引入反物理类别,利用多模态大语言模型进行零样本评估。

Comments 35 pages, 21 figures

详情
Journal ref
ICLR 2026 oral
AI中文摘要

视频生成模型在创建高质量、逼真内容方面取得了显著进展。然而,它们准确模拟物理现象的能力仍然是一个关键且未解决的挑战。本文提出了PhyWorldBench,一个全面的基准测试,旨在根据视频生成模型对物理定律的遵循程度进行评估。该基准涵盖了多个层次的物理现象,从基本物理原理如物体运动和能量守恒,到更复杂的场景如刚体相互作用以及人或动物的运动。此外,我们引入了一个新颖的反物理类别,其中提示故意违反现实世界的物理规律,从而评估模型在保持逻辑一致性的同时能否遵循此类指令。除了大规模人工评估外,我们还设计了一种简单而有效的方法,利用当前的多模态大语言模型以零样本方式评估物理真实性。我们评估了12个最先进的文本到视频生成模型,包括五个开源模型和五个专有模型,并进行了详细的比较和分析。通过对跨越基础、复合和反物理场景的1050个精心策划的提示进行系统测试,我们识别出这些模型在遵循现实世界物理规律方面面临的关键挑战。我们进一步研究了它们在不同物理现象和提示类型下的表现,并得出了针对性的建议,以构建增强物理原理保真度的提示。

英文摘要

Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accurately simulate physical phenomena remains a critical and unresolved challenge. This paper presents PhyWorldBench, a comprehensive benchmark designed to evaluate video generation models based on their adherence to the laws of physics. The benchmark covers multiple levels of physical phenomena, ranging from fundamental principles such as object motion and energy conservation to more complex scenarios involving rigid body interactions and human or animal motion. Additionally, we introduce a novel Anti-Physics category, where prompts intentionally violate real-world physics, enabling the assessment of whether models can follow such instructions while maintaining logical consistency. Besides large-scale human evaluation, we also design a simple yet effective method that utilizes current multimodal large language models to evaluate physics realism in a zero-shot fashion. We evaluate 12 state-of-the-art text-to-video generation models, including five open-source and five proprietary models, with detailed comparison and analysis. Through systematic testing across 1050 curated prompts spanning fundamental, composite, and anti-physics scenarios, we identify pivotal challenges these models face in adhering to real-world physics. We further examine their performance under diverse physical phenomena and prompt types, and derive targeted recommendations for crafting prompts that enhance fidelity to physical principles.

2506.15199 2026-05-27 cs.LG stat.ML

Interpretability and Generalization Bounds for Learning Spatial Physics

学习空间物理的可解释性与泛化界

Alejandro Francisco Queiruga, Theo Gutman-Solo, Shuai Jiang

AI总结 利用数值分析技术,严格量化了应用于线性微分方程的机器学习模型在参数发现或求解中的准确性、收敛率和泛化界,并基于格林函数表示引入科学模型的可解释性视角。

Comments To appear in ICML 2026. 18 pages, 13 figures

详情
AI中文摘要

尽管机器学习在科学问题上的许多应用看起来很有前景,但视觉可能具有欺骗性。利用数值分析技术,我们严格量化了某些应用于线性微分方程进行参数发现或求解的机器学习模型的准确性、收敛率和泛化界。除了数据的数量和离散化之外,我们发现数据的函数空间对模型的泛化至关重要。对于常用模型(包括物理特定技术),我们通过实验证明了类似的泛化不足。与直觉相反,我们发现不同类别的模型可能表现出相反的泛化行为。基于我们的理论分析,我们还引入了一种新的科学模型机械可解释性视角,即可以从黑箱模型的权重中提取格林函数表示。我们的结果为测量物理系统泛化性提供了一种新的交叉验证技术,该技术可作为基准。

英文摘要

While there are many applications of ML to scientific problems that look promising, visuals can be deceiving. Using numerical analysis techniques, we rigorously quantify the accuracy, convergence rates, and generalization bounds of certain ML models applied to linear differential equations for parameter discovery or solution finding. Beyond the quantity and discretization of data, we identify that the function space of the data is critical to the generalization of the model. A similar lack of generalization is empirically demonstrated for commonly used models, including physics-specific techniques. Counterintuitively, we find that different classes of models can exhibit opposing generalization behaviors. Based on our theoretical analysis, we also introduce a new mechanistic interpretability lens on scientific models whereby Green's function representations can be extracted from the weights of black-box models. Our results inform a new cross-validation technique for measuring generalization in physical systems, which can serve as a benchmark.