arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.03227 2026-06-03 cs.LG

Learning Temporal Causal Structure via Smooth Differentiable Optimization

通过平滑可微优化学习时间因果结构

Tong Zhao, Ce Guo, Wayne Luk, Emil Lupu, Ray Dipojjwal

AI总结 提出使用Gumbel-Sinkhorn算子学习可微变量排序,三角化结构向量自回归模型的瞬时系数矩阵,将无环性转化为参数化,实现统一连续优化,提高时间序列因果发现的效率和准确性。

详情
AI中文摘要

多变量时间序列中具有瞬时效应的因果发现具有挑战性,因为瞬时结构必须是无环的。先前的方法通过将瞬时和滞后估计分离为多阶段流水线,或通过复杂的增广拉格朗日优化施加代数无环性约束来强制执行这一点,这两种方法都 incur 高计算成本。在这项工作中,我们提出了一种不同的方法:我们使用Gumbel-Sinkhorn算子学习变量的可微排列,并按照学习到的顺序三角化结构向量自回归(SVAR)模型的瞬时系数矩阵。这将无环性从硬约束转化为参数化,并在整个优化过程中保持其有效性。通过这样做,我们的方法实现了基于梯度的学习的统一连续优化,从而提高了时间序列因果发现的效率。在三个真实世界基准测试中,我们的方法在发现准确性和效率方面均优于12个基线方法,取得了最佳整体性能。在大规模基准测试中,它进一步展示了强大的可扩展性,实现了比竞争方法快6倍以上的加速。

英文摘要

Causal discovery with instantaneous effects in multivariate time series is challenging, as the instantaneous structure must be acyclic. Prior methods enforce this by either separating instantaneous and lagged estimation into multi-stage pipelines or imposing algebraic acyclicity constraints via complex augmented Lagrangian optimization, both of which incur high computational cost. In this work, we propose a different approach: we learn a differentiable permutation of variables using the Gumbel--Sinkhorn operator and triangularize the instantaneous coefficient matrix of a Structural Vector Autoregressive (SVAR) model in the learned order. This converts acyclicity from a hard constraint into a parameterization and keeps it valid throughout optimization. In doing so, our method enables unified, continuous optimization with gradient-based learning, leading to improved efficiency in time--series causal discovery. Across three real-world benchmarks, our method achieves the best overall performance compared with 12 baselines in both discovery accuracy and efficiency. On the large-scale benchmark, it further demonstrates strong scalability, achieving more than a 6x speedup over competing methods.

2606.03223 2026-06-03 cs.RO cs.AI

BotDirector: Robot Storytelling Across the Symmetrical Reality with Multi-modal Interactions

BotDirector:跨对称现实的多模态交互机器人讲故事

Zhe Sun, Meng Wang, Lei Wang, Yuxi Wang, Wanxin Li, Yujia Peng, Zhenliang Zhang

AI总结 提出一个结合具身交互和自然语言交互的机器人讲故事系统,利用LLM代理将儿童创建的叙事转化为自导航群体机器人的运动序列,支持灵活场景和日常物品。

详情
Journal ref
2026 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW)
AI中文摘要

机器人讲故事融合了技术创新和创意表达,以前所未有的方式吸引儿童。然而,技术方面往往对儿童来说过于复杂。我们提出了一个交互式系统,通过具身和自然语言交互促进机器人讲故事。儿童用自己的物品布置游乐场,并与LLM代理一起创建叙事。创建的叙事基于地图和角色转化为运动序列,并由自导航群体机器人执行。该系统增强了机器人讲故事的灵活性,使幼儿能够用日常物品创作机器人戏剧。

英文摘要

Robot storytelling offers a unique blend of technological innovation and creative expression that engages children in unprecedented ways. However, the technical aspects are often too complicated for children. We propose an interactive system that facilitates robot storytelling with tangible and natural language interactions. Children arrange the playground with their own stuff and create narratives with an LLM agent. The created narratives are transformed into a motion sequence based on the map and characters, and the motions are executed by self-navigating swarm robots. This system enhances robot storytelling with flexible scenarios, enabling young children to create robot dramas with everyday objects.

2606.03220 2026-06-03 cs.CL cs.AI

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

WebRISE: 面向MLLM生成Web工件的需求诱导状态评估

Yuxin Meng, Yuhan Suo, Junjie Wang, Yuhan Sun, Yiyao Yu, Ruixu Zhang, Ruining Hu, Yubin Wang, Shouwei Ruan, Bin Wang, Yuxiang Zhang, Yujiu Yang

AI总结 提出WebRISE框架,通过交互契约图(ICG)将任务需求转化为可观察状态、用户意图转换和DOM/视觉断言,以评估MLLM生成的Web工件的功能正确性,实验表明ICG评分检测状态错误率是检查点评估的2-16倍。

详情
AI中文摘要

现有的MLLM生成Web工件基准通过局部证据评估交互,忽略了决定页面是否正常工作的需求诱导状态和转换。我们提出WebRISE,它将任务需求编译成交互契约图(ICG),包含可观察状态、用户意图转换以及DOM/视觉断言,以实现与实现无关的浏览器执行。WebRISE涵盖五种输入模态(文本、Markdown、草图、图像、视频)下的442个任务,包含5,495个转换和5,271个需求检查,将用户声明的功能与隐式的产品级约束分开。在14个MLLM中,即使最强的模型也仅达到65.6%的转换有效性和66.3%的需求覆盖率,且视觉质量不能代表行为(Qwen3.6-35B-A3B在Markdown上:V=80.8但T=15.5)。视频提供了最强的交互信号(隐式覆盖率比文本高10.6个百分点),而隐式约束仍然存在;缺陷注入表明,基于ICG的评分检测状态错误的速率是检查点评估的2-16倍。

英文摘要

Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.

2606.03219 2026-06-03 cs.CL cs.LG

Sample-Size Scaling of the African Languages NLI Evaluation

非洲语言自然语言推理评估的样本量缩放

Anuj Tiwari, Oluwapelumi Ogunremu, Terry Oko-odion, Jesujuwon Egbewale, Hannah Nwokocha

AI总结 本研究通过AfriXNLI基准对16种非洲语言进行系统样本量缩放实验,发现NLI性能随样本量增加并非单调提升,而是呈现语言敏感且非单调的缩放行为,表明数据量不足以保证稳定收益,需语言敏感的数据集和更强多语言建模策略。

详情
Comments
Accepted at the AfricaNLP Workshop, EACL 2026
AI中文摘要

非洲语言标注数据非常少,且增加标注数据量是否能可靠提升下游性能尚不明确。本研究基于AfriXNLI基准,对16种非洲语言进行了自然语言推理(NLI)的系统样本量缩放研究。在受控条件下,测试了两个约0.6B参数的多语言Transformer模型(在XNLI上微调的XLM-R Large和AfroXLM-R Large),样本量从50到500个标注示例不等,并在随机子采样运行中平均结果。与通常认为的随数据增加性能单调提升相反,我们发现了一种强烈语言敏感且通常非单调的缩放行为。一些语言在低资源场景下表现出早期饱和或性能下降,以及高方差。这些结果表明,数据量不足以保证非洲NLI的稳定收益,因此需要创建语言敏感的数据集和更强的多语言建模策略。

英文摘要

African languages have very little labelled data, and it is unclear if augmenting the quantity of annotation data reliably enhances downstream performance. The study is a systematic sample-size scaling study of natural language inference (NLI) on 16 African languages based on the AfriXNLI benchmark. Under controlled conditions, two multilingual transformer models with roughly 0.6B parameters XLM-R Large fine-tuned on XNLI and AfroXLM-R Large are tested on sample sizes of between 50 and 500 labeled examples and average their results across random subsampling runs. As opposed to the usual belief of monotonic increase with increased data, we find a strongly language sensitive and often non-monotonic scaling behavior. Some languages show early saturation or decrease in performance with sample size as well as high variance in low resource regimes. These results indicate that the volume of data is not enough to guarantee stable profits to African NLI, creating the necessity of language sensitive datasets creation and stronger multi-lingual modelling strategies.

2606.03216 2026-06-03 cs.CV

Follow-Your-Preference++: Rethinking Preference Alignment for Image Inpainting

Follow-Your-Preference++:重新思考图像修复中的偏好对齐

Junkun Yuan, Yutao Shen, Toru Aonishi, Hideki Nakayama, Yue Ma

AI总结 本文从基本原理出发,通过直接偏好优化框架和公开奖励模型构建偏好数据,系统研究了图像修复中的偏好对齐问题,发现奖励模型存在偏差但可通过集成缓解,并在标准指标、大视觉语言模型评估和人类评估上显著超越先前最先进模型。

详情
Comments
23 pages, 14 figures. arXiv admin note: substantial text overlap with arXiv:2509.23082
AI中文摘要

我们研究图像修复中的偏好对齐。与其提出另一种方法,我们从头重新审视该问题并重新评估其核心挑战。我们采用广泛使用的直接偏好优化框架,并利用公开的奖励模型构建偏好训练数据。我们的实证研究涵盖九个奖励模型、两个基准以及两个在架构和生成机制上不同的基线修复模型。我们的主要发现是:(1) 大多数奖励模型为偏好数据构建提供了有效信号,尽管有些作为评估者不可靠。(2) 跨模型和基准,偏好数据在候选和样本缩放下表现出一致的趋势。(3) 奖励模型显示出明显的偏差——特别是在亮度、构图和配色方案方面——使其容易引发奖励黑客行为。(4) 简单的奖励模型集成减轻了此类偏差,并产生了稳健且可泛化的性能。(5) 偏好对齐可迁移到对象移除任务,其中目标从开放式创意生成转变为连贯的背景补全。(6) 进一步分析表明,校准的集成方法进一步减轻了黑客行为并提高了鲁棒性。在不修改模型架构或引入额外数据集的情况下,我们的模型在标准指标、大视觉语言模型评估和人类评估上显著优于先前最先进的模型。我们的代码可在以下网址获取:此 https URL。

英文摘要

We study preference alignment for image inpainting. Rather than proposing yet another method, we revisit the problem from first principles and reassess its core challenges. We adopt the widely used direct preference optimization framework and construct preference training data with publicly available reward models. Our empirical study spans nine reward models, two benchmarks, and two baseline inpainting models that differ in architecture and generative mechanism. Our main findings are: (1) Most reward models provide valid signals for preference data construction, although some are unreliable as evaluators. (2) Across models and benchmarks, preference data exhibits consistent trends under both candidate and sample scaling. (3) Reward models display pronounced biases--particularly in brightness, composition, and color scheme--that make them prone to inducing reward hacking. (4) A simple ensemble of reward models mitigates such biases and yields robust, generalizable performance. {\color{rebuttal_blue}(5) Preference alignment is transferable to the object removal task, where the goal shifts from open-ended creative generation to coherent background completion. (6) Further analysis reveals that a calibrated ensemble method further mitigates hacking and improves robustness.} Without modifying model architectures or introducing additional datasets, our models substantially outperform prior state-of-the-art models on standard metrics, large vision-language model evaluations, and human assessments. Our code is available at: https://github.com/shenytzzz/Follow-Your-Preference.

2606.03214 2026-06-03 cs.AI cs.CV cs.CY cs.LG

Effect of Demographic Bias on Skin Lesion Classification

人口统计偏差对皮肤病变分类的影响

Ralf Raumanns, Gerard Schouten, Veronika Cheplygina, Josien P. W. Pluim

AI总结 本研究使用基于ResNet的卷积模型评估皮肤病变分类性能,通过线性规划控制人口统计特征,研究患者性别和年龄偏差的影响,并比较三种学习策略,发现性别偏差主要源于数据不平衡,而年龄偏差始终偏向年轻群体。

详情
Journal ref
https://melba-journal.org/2026:011
Comments
Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) , 26 pages, 12 figures
AI中文摘要

在这项研究中,我们评估了使用基于ResNet的卷积模型进行皮肤病变分类的性能,重点关注训练数据中人口统计偏差的影响,特别是患者性别和年龄的变化。我们使用线性规划生成具有受控人口统计特征的数据集,从而系统性地研究偏差效应。评估了三种学习策略:单任务模型、强化多任务模型和对抗学习方案。我们的性别分析表明,性别特定的训练数据集优化了模型性能。值得注意的是,在训练数据中包含男性患者提高了男性亚组的性能,即使在女性占多数的情况下也是如此。强化学习和对抗学习方案缩小或消除了平衡和女性占多数数据集中的偏差差距。然而,这些策略在男性占多数的环境中效果较差,模型在男性上的表现仍然优于女性。在主要男性患者群体中,与基线模型相比,这两种学习方案显示出边际偏差减少。基于年龄的分析表明,三种模型方法的基线性能相当,性能随年龄类别下降。无论训练数据分布如何,年轻组始终达到最高性能。尽管平衡训练对最年轻年龄组产生最佳结果,但较老年组的性能下降。我们发现性别偏差主要源于数据不平衡,而年龄偏差无论分布如何始终偏向年轻群体。这些不同的机制需要有针对性的缓解策略。此外,在两个外部数据集上的跨数据集验证表明,域转移显著影响性能和人口统计偏差模式。

英文摘要

In this study, we evaluate the performance of skin lesion classification using ResNet-based convolutional models, focusing on the impact of demographic bias in training data, particularly variations in patient sex and age. We use linear programming to generate datasets with controlled demographic characteristics, allowing systematic investigation of bias effects. Three learning strategies are evaluated: a single-task model, a reinforcing multi-task model, and an adversarial learning scheme. Our sex-based analysis indicates that sex-specific training datasets optimise model performance. Notably, including male patients in the training data improved performance for the male subgroup, even in female-majority cases. Reinforcing and adversarial learning schemes narrowed or eliminated bias gaps in balanced and female-majority datasets. However, these strategies proved less effective in male-majority settings, where models continued to perform better for males than females. The two learning schemes showed marginal bias reduction compared to the baseline model in predominantly male patient populations. Age-based analysis demonstrates comparable baseline performance across the three model approaches, with performance declining across age categories. Younger groups consistently achieve the highest performance, regardless of training data distribution. Although balanced training yields optimal results for the youngest age category, performance decreases in older categories. We find that sex biases arise mainly from data imbalances, while age biases consistently favour younger groups regardless of distribution. These distinct mechanisms require targeted mitigation strategies. Additionally, cross-dataset validation on two external datasets revealed that domain shifts notably affect performance and patterns of demographic bias.

2606.03212 2026-06-03 cs.LG

Bayesian Tensor Decomposition with Diffusion Model Prior

贝叶斯张量分解与扩散模型先验

Zerui Tao, Qibin Zhao

AI总结 提出DiffBCP框架,结合累积收缩过程先验和预训练扩散模型,通过分裂吉布斯采样实现贝叶斯CP分解,在图像修复和去噪任务中优于现有方法。

详情
Comments
ICML 2026
AI中文摘要

低秩张量分解(TD)通常对干净、完全观测的数据有效,但在严重缺失或噪声下性能下降。低秩性本身是一种有用但有限的结构先验,额外的手工先验(如稀疏性或平滑性)仍难以捕捉真实世界数据的丰富统计特性。为了在重度污染下补偿这种弱的归纳偏置,我们希望注入一个学习到的、数据驱动的先验;然而,最先进的扩散模型与当前的TD和可处理的后验推断并不兼容。为了解决这些挑战,我们引入了DiffBCP,一种混合先验的贝叶斯CP分解框架,它将CP因子上的累积收缩过程先验(用于自动秩选择)与一个现成的预训练扩散模型(作为重构张量上的隐式数据先验)相结合。尽管似然、低秩约束和扩散先验之间存在耦合,为了使后验推断可处理,我们开发了一个分裂吉布斯采样器:CP因子允许共轭更新,而扩散块通过低秩引导的去噪进行采样。一个噪声自适应的耦合调度进一步减少了对手动调参退火的敏感性。在图像修复和去噪(包括高分辨率分布外图像)上的实验表明,与贝叶斯、非线性和即插即用TD基线相比,该方法具有一致的改进。

英文摘要

Low-rank tensor decomposition (TD) is usually effective on clean, fully observed data, but it often degrades under severe missingness or noise. Low-rankness is itself a useful but limited structural prior, and additional handcrafted priors (e.g., sparsity or smoothness) still fall short of capturing the rich statistics of real-world data. To compensate for this weak inductive bias under heavy corruption, one would like to inject a learned, data-driven prior; however, the state-of-the-art diffusion models are not readily compatible with current TD and tractable posterior inference. To address these challenges, we introduce DiffBCP, a hybrid-prior Bayesian CP decomposition framework that couples a cumulative shrinkage process prior over the CP factors for automatic rank selection with an off-the-shelf pre-trained diffusion model as an implicit data prior on the reconstructed tensor. To make posterior inference tractable despite the coupling among the likelihood, low-rank constraint, and diffusion prior, we develop a split Gibbs sampler: CP factors admit conjugate updates, while the diffusion block is sampled via low-rank-guided denoising. A noise-adaptive coupling schedule further reduces sensitivity to hand-tuned annealing. Experiments on image inpainting and denoising, including high-resolution out-of-distribution images, show consistent gains over Bayesian, nonlinear, and plug-and-play TD baselines.

2606.03210 2026-06-03 cs.CE cs.LG cs.NA math.NA

Critical evaluation of PINN for FWD inverse analysis and differentiable FEM as an alternative

PINN 在 FWD 反分析中的批判性评估及可微有限元方法作为替代方案

Yongjin Choi, Hyeonbin Moon, Seunghwa Ryu

AI总结 本文批判性评估了物理信息神经网络(PINN)在多层路面系统落锤式弯沉仪(FWD)反分析中的表现,并提出可微有限元方法(DiffFEM)作为更准确、稳定和高效的替代方案。

详情
AI中文摘要

基于自动微分的反分析方法,包括物理信息神经网络(PINN)和可微编程,最近因其计算精确梯度和收敛效率的能力而显示出巨大潜力。然而,它们对落锤式弯沉仪(FWD)反计算的适用性尚未被探索。本研究基于合成基准,批判性评估了基于PINN的多层路面系统反分析,并研究了可微有限元方法(DiffFEM)作为替代方案。标准PINN由于层状路面系统固有的尖锐域不连续性而无法恢复层模量。尽管我们使用了具有域分解的扩展PINN(XPINN),它在不连续域上表现更好,但其性能仍然对损失权重和网络架构高度敏感,并且在测量噪声下会退化。相比之下,DiffFEM始终获得更准确、稳定且计算高效的反演结果。这些结果表明,将控制物理作为硬约束强加的DiffFEM比基于PINN的方法(其中控制物理通过损失函数作为软约束施加)具有更好的准确性、鲁棒性和计算效率。更广泛地说,研究结果表明,在基于PINN和DiffFEM的反分析之间进行选择需要仔细考虑,当存在高效且稳健的可微正演求解器时,DiffFEM提供了实际优势。

英文摘要

Automatic-differentiation-based inverse analysis methods, including physics-informed neural networks (PINNs) and differentiable programming, have recently shown great promise due to their ability to compute accurate gradients and convergence efficiency. However, their applicability to falling weight deflectometer (FWD) backcalculation remains unexplored. This study critically evaluates PINN-based inverse analysis for a multilayer pavement system and investigates differentiable finite element method (DiffFEM) as an alternative based on a synthetic benchmark. The standard PINN does not recover layer moduli because of the sharp domain discontinuities inherent to layered pavement systems. Although we use an extended PINN with domain decomposition (XPINN), which shows better performance on discontinuous domains, its performance remains highly sensitive to loss weighting and network architecture, and degrades under measurement noise. By contrast, DiffFEM consistently achieves more accurate, stable, and computationally efficient inversion results. These results indicate that DiffFEM, which enforces the governing physics as a hard constraint, yields better accuracy, robustness, and computational efficiency than PINN-based approaches, in which the governing physics is imposed as a soft constraint through the loss function. More broadly, the findings suggest that the choice between PINN- and DiffFEM-based inverse analysis needs careful consideration, with DiffFEM offering practical advantages when an efficient and robust differentiable forward solver is available.

2606.03209 2026-06-03 cs.LG

DECA: Decentralizing Block-Wise Adam for Efficient LLM Full-Parameter Fine-Tuning on Non-IID Data

DECA: 去中心化逐块Adam优化器用于非独立同分布数据上的高效大语言模型全参数微调

Yunsheng Yuan, Shaowei Li, Kai Wang, Zhongyuan Sun, Zheng Zhang, Kai Han, Jun Luo, Feng Li

AI总结 针对隐私敏感和资源受限环境中的大语言模型微调,提出DECA框架,通过逐块Adam优化和去中心化共识机制,在非独立同分布数据上实现高效的全参数微调,兼顾收敛速度、下游性能和资源效率。

详情
AI中文摘要

在隐私敏感和资源受限的环境中微调大语言模型(LLM)仍然具有挑战性。由于训练数据通常分布在多个客户端上,去中心化微调提供了一种无需中央服务器的协作适应自然范式。然而,在这种去中心化设置中实现全参数微调(FPFT)是困难的:FPFT提供了强大的适应能力,但对于十亿级模型来说会带来高昂的资源消耗。因此,现有的去中心化LLM微调方法主要依赖于参数高效更新,这提高了效率但可能限制下游性能。此外,客户端数据通常是非独立同分布的,这使得去中心化优化更容易受到客户端漂移和不稳定收敛的影响。为了解决这些挑战,我们提出了DECA,一种用于非独立同分布数据上LLM的资源高效去中心化FPFT框架。DECA将模型参数划分为不相交的块,并执行顺序逐块Adam优化,在保持去中心化全参数适应的同时减少资源消耗。为了稳定训练,DECA进一步引入了基于新鲜局部梯度统计和共识衍生差异信号的一阶和二阶逐块矩估计。我们提供了严格的理论分析和广泛的实验,表明DECA实现了快速收敛、强大的下游性能和显著的资源效率。

英文摘要

Fine-tuning large language models (LLMs) in privacy-sensitive and resource-constrained environments remains challenging. Since training data are often distributed across multiple clients, decentralized fine-tuning offers a natural paradigm for collaborative adaptation without a central server. However, enabling full-parameter fine-tuning (FPFT) in this decentralized setting is difficult: FPFT provides strong adaptation capacity but incurs prohibitive resource consumption for billion-scale models. Existing decentralized LLM fine-tuning methods therefore mainly rely on parameter-efficient updates, which improve efficiency but may restrict downstream performance. Moreover, client data are typically non-IID, making decentralized optimization more vulnerable to client drift and unstable convergence. To address these challenges, we propose DECA, a resource-efficient decentralized FPFT framework for LLMs on non-IID data. DECA partitions model parameters into disjoint blocks and performs sequential block-wise Adam optimization, reducing resource consumption while preserving decentralized full-parameter adaptation. To stabilize training, DECA further introduces first- and second-order block-wise moment estimates with fresh local gradient statistics and consensus-derived discrepancy signals. We provide rigorous theoretical analysis and extensive experiments, showing that DECA achieves fast convergence, strong downstream performance, and significant resource efficiency.

2606.03204 2026-06-03 cs.RO eess.SP

Toward Gripper-Integrated Active Electrosense for Pre-Contact Sensing in Underwater Soft Grippers

面向水下软体夹爪预接触感知的夹爪集成主动电感知

Ahsan Tanveer, Muhammad Hamza, Waqar Hussain Afridi, Chen Wang, Guangming Xie

AI总结 针对水下视觉受限问题,提出一种集成于软体夹爪的主动电感知方法,通过测量导电介质中电场扰动实现预接触信号检测,实验表明多电极电压读数可检测物体引起的结构化变化。

详情
Comments
Extended abstract accepted to the IEEE ICRA 2026 Workshop on Manipulation Robustness
AI中文摘要

水下操作通常发生在因浑浊、眩光和夹爪遮挡导致能见度降低的环境中,这限制了接近和抓取过程中基于视觉感知的可靠性。在这种情况下,软体夹爪非常适合顺应性交互,但通常缺乏在视觉不可靠时指导接近和闭合的机载预接触线索。本扩展摘要探索了主动电感知作为一种轻量级传感模式,通过测量导电介质中施加电场的扰动,在接触前提供类似接近的信号。我们为仿章鱼夹爪设计了离散电极布局,并使用现成硬件记录多通道传感电压。使用悬浮导电球进行的模拟和水槽实验显示,相对于空水基线,多电极电压读数出现了结构化的、依赖于物体的变化,且可检测性随5至20 V的激励和1 mHz至1 kHz的频率而变化。这些发现促使系统研究集成于夹爪的电感知作为水下软体操作补充预接触线索的可行性。

英文摘要

Underwater manipulation often occurs under degraded visibility due to turbidity, glare, and gripper occlusion, limiting the reliability of vision-based perception during approach and grasping. In such settings, soft grippers are well suited for compliant interaction, but they typically lack an onboard pre-contact cue that can guide approach and closure when vision is unreliable. This extended abstract explores active electrosense as a lightweight sensing modality that can provide a proximity-like signal prior to contact by measuring perturbations of an applied electric field in conductive media. We instrument an octopus-inspired gripper with a discrete electrode layout and record multi-channel sensing voltages using off-the-shelf hardware. Simulation and tank experiments with a suspended conductive sphere show structured, object-dependent changes in the multi-electrode voltage readout relative to empty-water baselines, with detectability varying across excitation of 5 to 20 V and frequencies from 1 mHz to 1 kHz. These findings motivate systematic investigation of gripper-integrated electrosense as a complementary pre-contact cue for underwater soft manipulation.

2606.03203 2026-06-03 cs.AI

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

MedCUA-Bench: 一个仅基于截图的临床计算机使用代理基准测试

Jia Yu, Zilong Wang, Xinyang Jiang, Dongsheng Li, Shuo Wang

AI总结 提出 MedCUA-Bench,一个覆盖10个医学领域18个临床场景的交互式基准,通过确定性检查器评估任务完成和五个临床安全维度,揭示当前代理在真实临床软件上的性能差距。

详情
AI中文摘要

计算机使用代理可以自动化重复的基于屏幕的临床工作,但它们在医疗图形用户界面中的可靠性仍未得到充分验证。现有的基准测试侧重于通用的网页或桌面任务,对医疗软件的覆盖不足,而医疗软件需要领域知识,其用户界面设计与主流应用显著不同,缺乏公开的测试环境,并且需要超出任务完成的安全验证。我们引入了 MedCUA-Bench,一个用于临床计算机使用代理的交互式基准测试。它涵盖了10个医学领域的18个临床场景,这些场景根据真实产品手册和开源医疗系统重建,以捕捉真实的临床界面,同时避免许可和隐私限制。每个任务都配有配对的意图级和步骤级目标,以将临床推理与用户界面执行分离,并通过确定性检查器在任务完成和五个临床安全维度上进行评估。在23个代理中,最好的闭源模型达到了54.2%的严格成功率,而所有模型在真实的 OpenEMR 上均低于9%。开源代理的平均成功率仅为2.5%,最好的达到了16.2%。MedCUA-Bench 揭示了当前代理与可靠临床软件使用之间的差距,为未来的研究提供了一个可复现的测试平台。

英文摘要

Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidated. Existing benchmarks focus on general web or desktop tasks and underrepresent medical software, which requires domain knowledge, exhibits markedly different UI design from mainstream applications, lacks public testing environments, and demands safety validation beyond task completion. We introduce MedCUA-Bench, an interactive benchmark for clinical computer-use agents. It covers 18 clinical scenarios across 10 medical domains, reconstructed from real product manuals and open-source medical systems to capture authentic clinical interfaces while avoiding licensing and privacy constraints. Each task ships with paired intent- and step-level goals to disentangle clinical reasoning from UI execution, and is evaluated by a deterministic checker over task completion and five clinical safety dimensions. Across 23 agents, the best closed-source model reaches 54.2% strict success, while all models remain below 9% on the real OpenEMR. Open-source agents average only 2.5%, with the best reaching 16.2%. MedCUA-Bench exposes the gap between current agents and reliable clinical software use, providing a reproducible testbed for future research.

2606.03199 2026-06-03 cs.LG physics.chem-ph

Fast Organic Crystal Structure Prediction with Unit Cell Flow Matching

基于晶胞流匹配的快速有机晶体结构预测

Alston Lo, Luka Mucko, Austin H. Cheng, Andy Cai, Alastair J. A. Price, Wojciech Matusik, Alán Aspuru-Guzik

AI总结 提出Clari模型,利用流匹配生成无冗余晶胞,以秒级速度实现有机晶体结构预测,速度提升15-30倍。

详情
AI中文摘要

有机晶体结构预测(CSP)是有机固体计算建模的必要条件,但传统上每个分子需要耗费数CPU年。诸如OXtal之类的生成模型通过直接采样稳定的有机晶体结构,大幅降低了这一成本。然而,OXtal放弃了显式晶格参数化,转而使用昂贵的三角形层对块体材料的大块区域进行建模,这可能导致每个分子花费数分钟的计算成本。在本文中,我们通过Clari将其降低到秒级,Clari是一个大规模流匹配模型,生成无冗余晶胞,并用纯对偏注意力取代三角形层。Clari仅需原子类型和键作为输入,无需RDKit可处理的输入分子,从而扩展了其适用于富勒烯、金属配合物和原子团簇等具有挑战性的化学体系。我们进一步消融了关键设计选择,如辅助损失、时间步分布、噪声先验和自条件化。在OXtal的测试集上,我们超越了OXtal的求解率,同时获得了15-30倍的加速。由于Clari还模拟了显式氢原子,它通过直接能量排序支持推理时扩展,无需任何修饰或弛豫步骤。当生成150个晶体并选择能量前30的晶体时,我们进一步提高了求解率,同时保持了5-8倍的加速。我们还引入了CSD教学子集,作为未来基准测试中多样化和复杂分子的新测试分割。我们的贡献使得在几秒内实现CSP成为可能,使有机固体的大规模虚拟筛选变得实用。代码可从此https URL获取。

英文摘要

Organic crystal structure prediction (CSP) is a requirement for computational modelling of organic solids, but traditionally costs several CPU-years per molecule. Generative models such as OXtal dramatically reduce this cost by sampling stable organic crystal structures directly. However, OXtal forgoes explicit lattice parametrization in favour of modelling large crops of the bulk material with expensive triangle layers, which can incur a computational cost of minutes per molecule. In this paper, we reduce this to seconds with Clari, a large-scale flow matching model that generates redundancy-free unit cells and replaces triangle layers with pure pair-bias attention. Clari requires only atom types and bonds as input and does not need an RDKit-sanitizable input molecule, which expands its applicability to challenging chemistries such as fullerenes, metal complexes, and atom clusters. We further ablate key design choices such as auxiliary losses, timestep distributions, noise priors, and self-conditioning. On OXtal's test sets, we surpass OXtal's solve rate while obtaining a speedup of $15$-$30\times$. Because Clari also models explicit hydrogens, it supports inference-time scaling via direct energy ranking, without any decoration or relaxation step. When generating 150 crystals and selecting the top-30 by energy, we further improve solve rate while maintaining a speedup of $5$-$8\times$. We also introduce the CSD Teaching Subset as a new test split of diverse and complex molecules for future benchmarking. Our contributions enable CSP within seconds, making large-scale virtual screening of organic solids practical. Code is available at https://github.com/aspuru-guzik-group/clari.

2606.03198 2026-06-03 cs.CL cs.AI

AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making

AI评分歧视取决于复杂临床决策中的评分协议

Sangwon Baek, Kyu Yeon Hur, Kyunga Kim

AI总结 通过因子研究,发现基于评分标准的协议能放大AI评分者区分能力,而无评分标准协议则抑制这种区分,支持在临床AI评估中使用评分标准锚定。

详情
Comments
11 pages, 4 main figures, 8 supplementary figures, 9 supplementary tables
AI中文摘要

临床AI评估越来越多地委托给大型语言模型(LLMs)作为AI评分者进行评分,但其在不同评估条件下的评分行为尚未被定量表征。我们通过一项因子研究填补了这一空白,该研究关注成人2型糖尿病(T2D)药物治疗在12个月门诊随访中的AI评分者行为,这是一项涉及复杂决策的临床任务,通过七个评估问题操作化。四个开源LLMs同时作为临床决策支持系统(CDSS)模型和AI评分者。每个CDSS输出在两种评分协议下评分:基于评分标准的Gold Rubric(GR)协议(包含患者特定评分标准)和无评分标准的Non Gold Rubric(Non-GR)协议。线性混合效应模型将评分协议因子与五个设计因子(CDSS模型、CDSS提示配置(文档参考生成[DRG] vs. 基线)、评分者模型、提示字符和提示类型)交叉,并估计主效应及其协议交互。在所有问题中,AI评分者在Non-GR下始终给出非常窄范围内的更高分数(平均74-78分),而GR下的平均分数低7.69至49.64分,四分位距宽1.68至3.67倍。在每个问题内,GR将AI评分者对DRG和基线CDSS输出的区分能力放大了1.76至5.10倍,同时揭示了Non-GR抑制的评分者模型间的显著行为变异。这些发现支持评分标准锚定作为保留临床AI评估区分能力的评分协议;当问题需要患者特定或司法管辖区特定标准,而评分者模型无法仅从参数知识推断时,无评分标准评分无法替代。

英文摘要

Clinical AI evaluation increasingly delegates scoring to large language models (LLMs) acting as AI raters, yet their scoring behavior across evaluation conditions has not been quantitatively characterized. We address this gap through a factorial study of AI rater behavior in adult type 2 diabetes (T2D) pharmacotherapy at 12-month outpatient follow-up, a clinical task involving complex decision-making operationalized across seven evaluation questions. Four open-source LLMs served simultaneously as clinical decision support system (CDSS) models and AI raters. Each CDSS output was scored under two scoring protocols: a rubric-anchored Gold Rubric (GR) protocol incorporating a patient-specific rubric, and a rubric-free Non Gold Rubric (Non-GR) protocol. Linear mixed effects models crossed the scoring protocol factor with five design factors -- CDSS model, CDSS prompt configuration (document-referenced generation [DRG] vs.\ Baseline), rater model, prompt character, and prompt type -- and estimated main effects together with their protocol interactions. Across all questions, AI raters yielded consistently higher scores within a very narrow range (74--78 points on average) under Non-GR compared to those under GR (7.69 to 49.64 points lower mean scores; 1.68 to 3.67 times wider interquartile ranges). Within each question, GR amplified the AI rater's discrimination between DRG and Baseline CDSS outputs by factors of 1.76 to 5.10, while also revealing substantial behavioral variation across rater models that Non-GR suppressed. These findings support rubric anchoring as the scoring protocol that preserves discriminative power in clinical AI evaluation; rubric-free scoring cannot substitute when questions require patient-specific or jurisdiction-specific criteria that rater models cannot infer from parametric knowledge alone.

2606.03197 2026-06-03 cs.CL

MemTrain: Self-Supervised Context Memory Training

MemTrain:自监督上下文记忆训练

Ziheng Li, Xingrun Xing, Haoqing Wang, Zhi-Hong Deng, Yehui Tang

AI总结 提出自监督框架MemTrain,通过两个耦合代理任务(端到端掩码重建和中间记忆召回)利用无标签维基百科语料增强LLM代理的上下文记忆能力,显著提升下游长文本推理性能。

详情
AI中文摘要

记忆是长时程LLM代理不可或缺的能力,使其能够保留和利用跨扩展交互积累的信息。现有的记忆代理方法通常在下游任务上通过强化学习进行端到端训练。然而,为记忆密集型场景收集高质量标注问题成本高昂,且所得训练数据往往缺乏足够的多样性以覆盖通用记忆行为。在这项工作中,我们提出MemTrain,一种自监督训练框架,用于普遍增强LLM代理的上下文记忆能力,以实现更有效的下游后训练。MemTrain在无标签维基百科语料上引入两个耦合代理任务:(1)端到端掩码重建目标,要求模型在多轮记忆更新后恢复掩码实体,从而从最终结果角度鼓励记忆维持;(2)中间记忆召回目标,要求模型利用中间记忆状态重建掩码历史信息,鼓励整个交互过程中的忠实压缩和记忆完整性。两个目标通过GRPO联合优化。在长文本QA和基于搜索的QA基准上的大量实验表明,MemTrain在不同模型上持续改善下游记忆密集型推理性能,相比直接的任务特定后训练,提升高达17.67个点。

英文摘要

Memory is an indispensable capability for long-horizon LLM agents, enabling them to preserve and utilize information accumulated across extended interactions. Existing memory-agent approaches are typically trained end-to-end with reinforcement learning on downstream tasks. However, collecting high-quality annotated problems for memory-intensive scenarios is costly, and the resulting training data often lack sufficient diversity to cover general memory behaviors. In this work, we propose MemTrain, a self-supervised training framework for generally enhancing the context-memory capability of LLM agents for more effective downstream post-training. MemTrain introduces two coupled proxy tasks over unlabeled Wikipedia corpora: (1) an end-to-end masked reconstruction objective, which requires the model to recover masked entities after multiple rounds of memory updates, thereby encouraging memory maintenance from the final outcome perspective; and (2) an intermediate memory recall objective, which requires the model to reconstruct masked historical information using intermediate memory states, encouraging faithful compression and memory completeness throughout the interaction process. The two objectives are jointly optimized using GRPO. Extensive experiments on long-text QA and search-based QA benchmarks demonstrate that MemTrain consistently improves downstream memory-intensive reasoning performance across different models, achieving gains of up to 17.67 points over direct task-specific post-training.

2606.03188 2026-06-03 cs.RO

GeoSem-WAM: Geometry- and Semantic-Aware World Action Models

GeoSem-WAM:几何与语义感知的世界动作模型

Fulong Ma, Daojie Peng, Wenjun Yue, Jiahang Cao, Bintao Wang, Qiang Zhang, Jun Ma

AI总结 提出GeoSem-WAM框架,通过几何和语义监督增强潜在表示,在统一潜在空间中联合捕捉场景动态、空间几何和语义上下文,避免测试时显式未来展开或视频生成,提升动作预测准确性和鲁棒性。

详情
AI中文摘要

最近的世界动作模型(WAM)在具身决策中展示了令人印象深刻的能力。然而,它们的有效性是源于推理过程中的显式未来想象,还是由预测训练引起的表示学习,仍是一个未解之谜。新兴证据表明,主要优势在于学习鲁棒的潜在表示,而非在测试时生成未来观测。尽管如此,现有的WAM主要依赖于基于RGB的未来预测,这提供了对复杂环境有限的结构和空间理解。为了解决这个问题,我们提出了一个结构化世界建模框架,通过几何和语义监督增强潜在表示。除了未来的RGB预测,我们的模型引入了两个辅助预测分支,用于未来的几何和语义表示,使其能够在统一的潜在空间中联合捕捉场景动态、空间几何和语义上下文。关键在于,我们的方法通过避免测试时的显式未来展开或视频生成,保持了高效的推理。大量实验表明,纳入结构化世界监督一致地提高了动作预测准确性、场景理解以及在具有挑战性的具身场景下的鲁棒性,突显了其推进可扩展和高效WAM的潜力。

英文摘要

Recent World Action Models (WAMs) have demonstrated impressive capabilities in embodied decision-making. However, whether their effectiveness stems from explicit future imagination during inference or representation learning induced by predictive training remains an open question. Emerging evidence suggests the primary advantage lies in learning robust latent representations rather than generating future observations at test time. Nevertheless, existing WAMs mainly rely on RGB-based future prediction, which provides limited structural and spatial understanding of complex environments. To address this, we propose a structured world modeling framework that enhances latent representations through geometric and semantic supervision. Alongside future RGB prediction, our model introduces two auxiliary prediction branches for future geometry and semantic representations, enabling it to jointly capture scene dynamics, spatial geometry, and semantic context within a unified latent space. Crucially, our approach preserves efficient inference by avoiding explicit future rollout or video generation at test time. Extensive experiments show that incorporating structured world supervision consistently improves action prediction accuracy, scene understanding, and robustness under challenging embodied scenarios, highlighting its potential for advancing scalable and efficient WAMs.

2606.03183 2026-06-03 cs.MM cs.CV cs.SD eess.AS

Inference-Time Scaling for Joint Audio-Video Generation

联合音视频生成的推理时缩放

Jaemin Jung, Kyeongha Rho, Inkyu Shin, Joon Son Chung

AI总结 针对联合音视频生成中多目标优化的挑战,提出多验证器框架与自适应奖励加权算法,在无需额外训练的情况下显著提升语义对齐、感知质量和音视频同步。

详情
Comments
Accepted by Transactions on Machine Learning Research (TMLR). Project page: https://jung-jaemin.github.io/ITS-AVGen-Proj/
AI中文摘要

联合音视频生成旨在合成与文本提示语义对齐且精确同步的逼真音视频对。现有联合音视频生成模型通常需要大量训练资源来提高保真度,而推理时缩放(ITS)最近在单模态领域成为一种有前景的无训练替代方案。然而,将ITS从单模态扩展到多模态领域并非易事,因为它需要平衡多个异构目标。在本文中,我们首次对联合音视频生成的ITS进行了全面研究。我们首先证明多验证器框架对于解决单目标指导的局限性(包括非对称性能权衡和验证器欺骗)至关重要。通过系统分析,我们随后确定了一个最优的多验证器组合,该组合在所有质量维度上产生均衡的改进。最后,为了有效聚合多样化的奖励信号,我们提出了自适应奖励加权(ARW),一种新颖的测试时优化算法。ARW将奖励聚合视为在线优化问题,利用可学习参数校准奖励方差,无需奖励分布的先验知识,从而确保鲁棒的多目标选择。在VGGSound和JavisBench-mini基准上的实验结果表明,我们的框架显著增强了生成输出的语义对齐、感知质量和音视频同步。合成样本和代码可在项目页面获取:this https URL。

英文摘要

Joint audio-video generation aims to synthesize realistic audio-video pairs that are both semantically aligned with text prompts and precisely synchronized. While existing joint audio-video generation models often require substantial training resources to improve fidelity, Inference-Time Scaling (ITS) has recently emerged as a promising training-free alternative in single-modality domains. However, extending ITS from a single modality to multimodal domains is non-trivial, as it requires balancing multiple heterogeneous objectives. In this paper, we present the first comprehensive study of ITS for joint audio-video generation. We first demonstrate that a multi-verifier framework is essential to address the limitations of single-objective guidance, including asymmetric performance trade-offs and verifier hacking. Through systematic analysis, we then identify an optimal multi-verifier combination that yields balanced improvements across all quality dimensions. Finally, to effectively aggregate diverse reward signals, we propose Adaptive Reward Weighting (ARW), a novel test-time optimization algorithm. ARW treats reward aggregation as an online optimization problem, utilizing learnable parameters to calibrate reward variances without requiring prior knowledge of reward distributions, thereby ensuring robust multi-objective selection. Experimental results on VGGSound and JavisBench-mini benchmarks demonstrate that our framework significantly enhances semantic alignment, perceptual quality, and audio-visual synchronization of generated outputs. Synthesized samples and code are available on the project page: https://jung-jaemin.github.io/ITS-AVGen-Proj.

2606.03180 2026-06-03 cs.CV cs.CL cs.LG

GLINT: Sparsely Gated Vision-Language Alignment for Fine-Grained Radiology Representations

GLINT:面向细粒度放射学表征的稀疏门控视觉-语言对齐

Jonggwon Park, Seongeun Lee, Junhyun Park, Hannah Yun, Hyunwoong Kim, Sohyun Jeong, Hyewon Kang, Byungmu Yoon, Kyoyun Choi

AI总结 针对放射学图像-报告全局对齐与局部病灶尺度不匹配的问题,提出GLINT框架,通过稀疏门控对齐和密集特征正则化实现零样本分类、定位和分割。

详情
AI中文摘要

放射学中的视觉-语言模型(VLM)通过利用临床工作流程中自然产生的图像-报告对,已成为一种可扩展的范式。然而,这种配对揭示了尺度上的不匹配:每个病灶仅占据图像的一小部分区域,但监督仅在全局图像-报告级别提供。这带来了一个核心挑战:先前的方法将权重密集地分布到所有补丁上,而不是集中在与给定查询相关的稀疏子集上。为了解决这个问题,我们提出了GLINT(门控语言-图像对齐)框架,该框架显式建模这种稀疏对应关系。在对齐方面,我们引入了稀疏门控对齐,这是一种新颖的架构,其中在单独的门控嵌入空间上的sigmoid门仅激活与每个文本查询相关的补丁,强制执行显式稀疏性。在表征方面,我们添加了密集特征正则化,将可训练编码器的中间特征锚定到冻结的自监督学习(SSL)教师模型上,从而保留门控所依赖的细粒度补丁特征。相同的方案适用于2D胸部X光片(CXR)和3D胸部计算机断层扫描(CT),分别基于DINOv3和V-JEPA 2.1构建。GLINT支持从自由文本查询进行零样本分类、定位和分割,据我们所知,这是首次在没有掩码监督的情况下在3D CT体积上展示零样本分割。值得注意的是,最显著的增益出现在零样本定位和分割上,这些任务需要稀疏的、特定于查询的定位,这与我们的设计意图一致。在下游评估中,GLINT在分类、报告生成和分割方面均优于SSL编码器和医学VLM。

英文摘要

Vision-language models (VLMs) for radiology have emerged as a scalable paradigm by leveraging image-report pairs naturally produced in clinical workflows. However, this pairing reveals a mismatch in scale: each finding occupies only a small region of the image, yet supervision is provided only at the global image-report level. This poses a central challenge: prior approaches spread weight densely across all patches rather than concentrating on the sparse subset relevant to a given query. To address this, we present GLINT (Gated Language-Image alignmeNT), a framework that explicitly models this sparse correspondence. On the alignment side, we introduce Sparsely Gated Alignment, a novel architecture in which a sigmoid gate over a separate gate embedding space activates only the patches relevant to each textual query, enforcing explicit sparsity. On the representation side, we add Dense Feature Regularization, which anchors the trainable encoder's intermediate features to a frozen self-supervised learning (SSL) teacher, preserving the fine-grained patch features that the gate relies on. The same recipe applies to both 2D chest X-ray (CXR) and 3D chest computed tomography (CT), built with DINOv3 and V-JEPA 2.1, respectively. GLINT enables zero-shot classification, grounding, and segmentation from free-text queries, and to our knowledge is the first to demonstrate zero-shot segmentation on 3D CT volumes without mask supervision. Notably, the most pronounced gains arise on zero-shot grounding and segmentation, where sparse, query-specific localization is required, consistent with our design intent. In downstream evaluation, GLINT outperforms both SSL encoders and medical VLMs on classification, report generation, and segmentation.

2606.03179 2026-06-03 cs.CL

HyperPatch: Sequential Knowledge Editing Under n-ary Structural Drift

HyperPatch: n元结构漂移下的顺序知识编辑

Yu-Kai Chan, Wen-Sheng Lien, Dong-Ting Yao, Bo-Kai Ruan, Kwan-Yeung Lin, Hong-Han Shuai, Meng-Fen Chiang

AI总结 针对非平稳环境中n元事件顺序更新引发的结构漂移问题,提出HyperPatch框架,通过超图流形上的稳定性建模,实现参数保留的知识编辑,在MQuAKE-CF和MQuAKE-T基准上分别取得96.24%和21.06%的跳步准确率相对提升。

详情
Comments
Accepted to Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)
AI中文摘要

大型语言模型(LLMs)依赖知识编辑(KE)来维持时间有效性,然而现实世界中的知识本质上是n元的。我们证明,在非平稳环境中,对复杂关系的顺序更新会引发n元结构漂移,这是一种将n元事件二元化为三元组会破坏关系原子性的现象。这导致结构条件知识迁移失败,即检索器系统性错误接地,常被误诊为参数幻觉。为解决此问题,我们提出HyperPatch,一个参数保留框架,将顺序KE重新表述为超图流形上的稳定性问题。HyperPatch通过三个阶段保持事件完整性:(i)结构先验初始化,通过在超图神经网络(HGNN)上进行对比学习建立拓扑感知的嵌入空间,以捕获高阶相关性;(ii)顺序拓扑编辑,利用双阶段机制,采用基于SimHash的拓扑对齐进行快速冲突解决,以及拓扑LoRA自适应来跟踪漂移而无需骨干重训练;(iii)结构条件推理,整合来自融合语言和结构流形的全局一致证据。在MQuAKE-CF和MQuAKE-T基准上,HyperPatch在跳步准确率(H-Acc)上分别比最强基线相对提升96.24%和21.06%。进一步的消融实验表明,在连续n元更新流下具有卓越的可靠性,而标准基于KG的变体因结构错位导致H-Acc下降高达88.3%。

英文摘要

Large Language Models (LLMs) rely on Knowledge Editing (KE) to maintain temporal validity, yet real-world knowledge is inherently n-ary. We demonstrate that in non-stationary environments, sequential updates to complex relations induce N-ary Structural Drift, a phenomenon where the binary reification of n-ary events into triples fractures relational atomicity. This precipitates Structure-Conditioned Knowledge Transfer Failure, a systematic mis-grounding of the retriever frequently misdiagnosed as parametric hallucination. To tackle this, we propose HyperPatch, a parameter-preserving framework that reformulates sequential KE as a stability problem over hypergraph manifolds. HyperPatch preserves event integrity through three phases: (i) Structural Prior Initialization, establishing a topology-aware embedding space via contrastive learning on a Hypergraph Neural Network (HGNN) to capture high-order correlations; (ii) Sequential Topology Editing, utilizing a dual-stage mechanism that employs SimHash-based Topological Alignment for rapid conflict resolution and Topological LoRA Adaptation to track drift without backbone retraining; and (iii) Structure-Conditioned Reasoning, which integrates globally consistent evidence from fused linguistic and structural manifolds. On the MQuAKE-CF and MQuAKE-T benchmarks, HyperPatch achieves relative gains in Hop-wise Accuracy (H-Acc) of 96.24% and 21.06% over the strongest baseline, respectively. Further ablations demonstrate superior reliability under continuous n-ary update streams, whereas the standard KG-based variant suffers H-Acc collapses of up to 88.3% due to structural misalignment.

2606.03177 2026-06-03 cs.RO

ConTrack: Constrained Hand Motion Tracking with Adaptive Trade-off Control

ConTrack: 具有自适应权衡控制的约束手部运动跟踪

Yutong Liang, Quanquan Peng, Ri-Zhao Qiu, Xiaolong Wang

AI总结 提出一种基于强化学习的框架ConTrack,通过将物体跟踪视为约束并利用双变量更新自适应调整任务-风格权衡,同时结合自适应中轨迹重置库,实现长时域、接触密集的手部运动跟踪,在仿真和真实机器人上显著提升成功率和物体位姿精度。

详情
AI中文摘要

人类演示为机器人操作提供了强大的先验,但由于运动学差距,将其转移到真实机器人上执行并非易事。在灵巧操作中,即使在仿真器中跟踪长时域、接触密集的序列仍然具有挑战性:参考跟踪策略必须保持物体在其目标轨迹上,同时保留演示的关节运动和接触时序。现有方法通常依赖于需要针对每个序列进行调整的手工奖励调节,并且在有限的交互预算下会失效。我们提出了ConTrack,一种随跟踪数据扩展的强化学习(RL)框架。ConTrack将物体跟踪视为约束,并将剩余控制权限分配给运动保真度,从而通过双变量更新在线适应任务-风格权衡。此外,ConTrack还通过一个自适应中轨迹重置库来稳定长时域学习,该库重用策略可达的仿真器状态。我们在仿真跟踪和真实机器人上的定性和定量结果表明,ConTrack在保持关节和接触保真度的同时,显著提高了成功率和物体位姿精度,优于现有技术。网站:此 https URL。

英文摘要

Human demonstrations provide strong priors for robot manipulation, yet it is non-trivial to transfer them to execute on real robots due to the kinematic gap. In dexterous manipulation, it remains challenging to track long-horizon, contact-rich sequences even in simulators: a reference-tracking policy must keep objects on their target trajectories while preserving demonstrated joint motion and contact timing. Existing approaches often rely on hand-crafted reward tuning that require per-sequence tuning and break under limited interaction budgets. We introduce ConTrack, a reinforcement learning (RL) framework that scales with tracking data. ConTrack treats object tracking as a constraint and allocates remaining control authority to motion fidelity, which allows it to adapt task--style trade-offs online using a dual-variable update. In addition, ConTrack also stabilizes long-horizon learning with an adaptive mid-trajectory reset library that reuses policy-reachable simulator states. Our qualitative and quantitative results in simulation tracking and real robot demonstrate that ConTrack improves success and object pose accuracy significantly over prior arts while preserving joint and contact fidelity. Website: https://www.lyt0112.com/projects/ConTrack.

2606.03173 2026-06-03 cs.CY cs.LG cs.SI

Auditing Engagement Incentives in the Kidfluencer Ecosystem: A Multimodal Weak Supervision Approach

审计儿童网红生态系统中的参与激励:一种多模态弱监督方法

Zijing Wei, Chao Peter Yang, Xuanjie Chen

AI总结 本研究采用多模态弱监督方法审计YouTube儿童网红频道,发现剥削信号与观看量显著正相关,且表演性劳动、情感诱饵和隐私侵犯能带来参与度溢价。

详情
AI中文摘要

YouTube上“儿童网红”的兴起引发了对儿童数字劳动和剥削的伦理担忧。尽管新兴立法试图规范这一生态系统,但由于大规模操作化剥削的困难,将剥削与参与度联系起来的实证证据仍然稀缺。本研究对79个儿童网红频道的5,051个视频进行了多模态AI审计,使用弱监督方法检测剥削信号,无需大规模人工标注。我们聚合了噪声标注函数——包括基于LLM的标题分类和基于GPT-4 Vision的缩略图与描述分析,涵盖六个基于文献的维度——为每个视频分配一个概率剥削分数。一项多标注者验证研究(N=107)显示与人类判断高度一致(宏平均F1=0.911),并对整体剥削风险具有高敏感性(召回率=0.960,F1=0.793)。我们的发现揭示了表演性劳动、情感诱饵和隐私侵犯的显著参与度溢价。剥削分数与观看次数相关(Spearman ρ=0.229,p<10^{-50}),控制频道层面变化的混合效应回归显示,剥削分数每增加一个单位,观看次数增加4.4倍(p<0.001)。频道内分析表明,情感诱饵的中位观看次数提升+65.6%,表演性内容提升+56.0%(FDR校正p<0.001),且在同年稳健性检验中效果持续(p=0.030)。相比之下,明确的商业内容(产品植入)没有溢价(-3.8%,不显著),表明平台奖励的是儿童身份和劳动的商品化,而非传统广告。这些发现挑战了仅关注财务信托的政策框架,表明参与度与儿童的密集表演性劳动系统性地相关。

英文摘要

The rise of `kidfluencers' on YouTube has raised ethical concerns about child digital labor and exploitation. While emerging legislation attempts to regulate this ecosystem, empirical evidence linking exploitation to engagement remains scarce, given the difficulty of operationalizing exploitation at scale. This study presents a multimodal AI audit of 5,051 videos across 79 kidfluencer channels, using weak supervision to detect exploitation signals without large-scale manual labels. We aggregate noisy labeling functions -- including LLM-based classification of titles and GPT-4 Vision analysis of thumbnails and descriptions across six literature-grounded dimensions -- to assign a probabilistic exploitation score to each video. A multi-annotator validation study (N=107) shows strong agreement with human judgment (macro-average F1 $= 0.911$) and high sensitivity for overall exploitation risk (recall $= 0.960$, F1 $= 0.793$). Our findings reveal a significant engagement premium for performative labor, emotional bait, and privacy violations. Exploitation scores correlate with view counts (Spearman $ρ= 0.229$, $p < 10^{-50}$), and mixed-effects regression controlling for channel-level variation shows that a one-unit increase in exploitation score yields a $4.4\times$ increase in views ($p < 0.001$). Within-channel analyses indicate median view boosts of $+65.6\%$ for emotional bait and $+56.0\%$ for performative content (FDR-corrected $p<0.001$), with effects holding in same-year robustness checks ($p=0.030$). Explicit commercial content (product placement), by contrast, shows no premium ($-3.8\%$, n.s.), suggesting the platform rewards commodification of the child's identity and labor over traditional advertising. These findings challenge policy frameworks focused solely on financial trusts, showing that engagement is systematically tied to the intensive, performative labor of children.

2606.03169 2026-06-03 cs.SD cs.LG cs.MM

SketchSong: Hierarchical Song Generation with Sketch Planning and Fine-Grained Multi-Track Modeling

SketchSong: 基于草图规划与细粒度多轨建模的分层歌曲生成

Xiaoyue Duan, Nanxing Hu, Yutang Feng, Xudong Yan, Jiatao Chen, Jinchao Zhang, Jie Zhou

AI总结 提出分层歌曲生成框架SketchSong,通过歌曲级草图规划和细粒度多轨建模解决歌曲编排不连贯及声部建模粗糙问题,在客观指标和人工听测上优于基线。

详情
AI中文摘要

最近的歌曲生成系统能够合成逼真的音频,但生成完整歌曲仍面临两个挑战。首先,现有方法中缺乏明确的歌曲级编排规划,模型往往需要在生成底层音频细节的同时组织整体编排发展,这常导致编排不连贯,如段落过渡薄弱和动态进展受限。其次,对不同音乐部分的粗粒度建模掩盖了它们各自的作用和交互,限制了生成歌曲的编排丰富性。本文提出SketchSong,一种分层歌曲生成框架,通过歌曲级草图规划和细粒度多轨建模解决这些问题。在时间维度上,SketchSong首先预测从压缩音频表示中提取的高层草图标记的紧凑序列,然后基于这些草图生成音频标记。这种从粗到细的过程在详细音频生成之前为模型提供了明确的编排规划。在轨道维度上,SketchSong显式建模四个轨道,即人声、贝斯、鼓和其他乐器。这使得模型能够更精确地捕捉不同音乐部分的作用和交互。在歌曲生成基准上的实验表明,SketchSong在客观指标和人工听测上均持续优于基线。尽管没有采用额外的偏好优化后训练(如歌词和文本提示对齐),SketchSong仍取得了与经过后训练的强开源系统相竞争的结果,证明了我们整体设计的有效性。

英文摘要

Recent song generation systems can synthesize realistic audio, yet generating complete songs remains challenging for two reasons. First, explicit song-level arrangement planning remains limited in existing methods, so models often need to organize overall arrangement development while generating low-level audio details. This often leads to incoherence in arrangements, such as weak section transitions and limited dynamic progression. Second, coarse modeling of different musical parts obscures their distinct roles and interactions, limiting arrangement richness of generated songs. In this paper, we present SketchSong, a hierarchical song generation framework that addresses these issues through song-level sketch planning and fine-grained multi-track modeling. Along the temporal dimension, SketchSong first predicts a compact sequence of high-level sketch tokens derived from compressed audio representations, and then generates audio tokens conditioned on these sketches. This coarse-to-fine process gives the model an explicit arrangement plan before detailed audio generation. Along the track dimension, SketchSong explicitly models four tracks, i.e., vocals, bass, drums and other instruments. This enables the model to capture the roles and interactions of different musical parts more precisely. Experiments on song generation benchmarks show that SketchSong consistently outperforms our baseline on both objective metrics and human listening tests. Despite not employing additional post-training for preference optimization such as lyrics and text-prompt alignments, SketchSong achieves competitive results against strong, post-trained open-source systems, demonstrating the effectiveness of our overall design.

2606.03168 2026-06-03 cs.CV

JAVEDIT: Joint Audio-Visual Instruction-Guided Video Editing with Agentic Data Curation

JAVEDIT: 联合音频-视觉指令引导视频编辑与智能体数据策展

Yinan Chen, Chuming Lin, Zhennan Chen, Yuxiang Zeng, Junwei Zhu, Yali Bi, Xijie Huang, Chengming Xu, Donghao Luo, Zhucun Xue, Xiaobin Hu, Chengjie Wang, Yong Liu, Jiangning Zhang, Shuicheng Yan

AI总结 针对联合音频-视觉编辑缺乏数据集和基准的问题,提出首个大规模高质量数据集JAVEdit-100k、基准JAVEditBench以及基线模型JAVEdit,在六项指标中五项超越所有基线。

详情
Comments
Equal contributions from first two authors. Project page: https://ryanchenyn.github.io/projects/JAVEdit Code: https://github.com/RyanChenYN/JAVEdit Dataset: https://huggingface.co/datasets/Coraxor/JAVEdit-100k
AI中文摘要

虽然基于指令的视频编辑已取得显著进展,但联合音频-视觉编辑仍受限于缺乏专用数据集和基准。为填补这一空白,我们提出了JAVEdit-100k,这是首个为指令引导的联合音频-视觉编辑定制的大规模高质量数据集。该数据集专注于以人为中心的视频,包含约10万个编辑三元组,涵盖五个不同类别,包括主体编辑和语音编辑。该数据集通过四个精心设计的生成流程严格构建,并无缝配对智能体在环质量控制机制。此外,为解决该领域缺乏标准化评估的问题,我们引入了JAVEditBench,这是一个全面的基准,包含精选源视频和跨所有编辑类别的人类对齐指令。最后,我们提出了JAVEdit,一个用于指令引导的联合音频-视觉编辑的开创性基线模型。实验表明,\model\ 在六项评估指标中的五项上优于所有基线。

英文摘要

While instruction-based video editing has seen significant progress, joint audio-visual editing remains constrained by the absence of dedicated datasets and benchmarks. To bridge this gap, we present JAVEdit-100k, the first large-scale, high-quality dataset tailored for instruction-guided joint audio-visual editing. Focusing on human-centric videos, JAVEdit-100k comprises approximately 100K editing triplets spanning five distinct categories, including subject editing and speech editing. This dataset is rigorously constructed via four meticulously designed generation pipelines, seamlessly paired with an agent-in-the-loop quality control mechanism. Furthermore, to address the lack of standardized evaluation within the field, we introduce JAVEditBench, a comprehensive benchmark featuring curated source videos and human-aligned instructions across all editing categories. Finally, we propose JAVEdit, a pioneering baseline model for instruction-guided joint audio-visual editing. Experiments show that \model\ outperforms all baselines on five of six evaluation metrics.

2606.03165 2026-06-03 cs.CL cs.AI

Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models

大型语言模型中词汇对齐和偏好阶段转变的完全自动识别

Thomas Stephan Juzek, Xiaoyang Ming, Jose A. Hernandez

AI总结 本文提出两种无需人工干预的评估指标——词汇对齐分数和三角化偏好转变,用于自动识别大型语言模型中的词汇过度使用及其与人类偏好学习的关联。

详情
Journal ref
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pages 6116-6131
Comments
16 pages, 2 figures, 10 tables
AI中文摘要

数字聊天助手(如ChatGPT)使用的语言可能与人类预期存在偏差(不对齐)。主要针对科学英语的研究已经描述了出现的偏差以及在一定程度上解释了原因,将其与人类偏好学习的训练阶段联系起来。然而,现有方法依赖于人工筛选。本文引入了两种无需筛选、假设较少的评估指标:词汇对齐分数(识别词汇过度使用)和三角化偏好转变(量化此类转变中有多少可归因于人类偏好学习)。使用PubMed摘要,生成了续写,并通过六个模型系列(Falcon、Gemma、Llama、Mistral、OLMo、Yi)的滑动窗口文档频率进行测量。该过程无需人工干预即可识别过度使用的词汇,如'suggest'、'additionally'和'strategy',并估计它们与偏好学习的关联。我们的发现重复了先前的工作,并且在参数设置、随机种子以及进一步数据的评估中保持稳定。该方法易于扩展,能够系统研究科学英语之外以及跨语言的词汇(不对齐),因此,这些指标有潜力为未来模型改进对齐并理解其起源做出贡献。

英文摘要

The language used by digital chat assistants such as ChatGPT can diverge from human expectations (misalignment). Research, mostly on Scientific English, has described both what divergences occur and, to some extent, why, linking them to the training stage of human preference learning. Yet, existing approaches rely on manual curation. This paper introduces two curation-free, assumption-light evaluation metrics: the Lexical Alignment Score, which identifies lexical overuse, and the Triangulated Preference Shift, which quantifies how much of such shifts can be attributed to human preference learning. Using PubMed abstracts, continuations were generated and measured using windowed document prevalence across six model families (Falcon, Gemma, Llama, Mistral, OLMo, Yi). The procedure identifies, without manual intervention, overused items such as 'suggest', 'additionally', and 'strategy', and estimates their link to preference learning. Our findings replicate prior work and remain stable across parameter settings, random seeds, and evaluation on further data. The approach scales readily and enables systematic study of lexical (mis)alignment beyond Scientific English and across languages, and as such, the metrics have the potential to contribute to improved alignment for future models and understanding of its origins.

2606.03160 2026-06-03 cs.CV

SRENet: Spectral Re-Entry Network for Point Cloud Action Recognition

SRENet:用于点云动作识别的频谱重入网络

Qiuxia Wu, Jiarui Lan, Wenxiong Kang, Zhiyong Wang, Kun Hu

AI总结 提出SRENet,通过频谱分解与重入模块从频率角度学习全局上下文和细粒度时间动态,实现点云序列动作识别。

详情
Comments
13 pages, 11 figures. Accepted by IEEE Transactions on Circuits and Systems for Video Technology
AI中文摘要

从点云序列中识别人体动作对于自动驾驶和人机交互等3D感知驱动应用至关重要。然而,点云的不规则结构和时间不一致性给时空表示学习带来了独特挑战,特别是在捕捉全局运动上下文和细粒度时间动态方面。我们提出SRENet,一个频谱感知框架,旨在从频率角度显式学习动作识别的全局上下文和细粒度时间动态。SRENet引入频谱分解块(SDeBlock),沿时间和空间轴进行基于小波的分析,通过频率特定注意力将特征分解为低频和高频分量。为了恢复残差动态并重新对齐在语义融合过程中扭曲的时间频率结构,频谱重入块(SReBlock)执行二次时间分解。此外,设计了一种频谱感知学习策略,通过对比损失和课程调度增强两个频率子空间的可区分性,该调度逐渐将焦点从低频空间转移到高频空间,与从粗到细的运动模式一致。在MSR-Action3D、NTU-RGBD和NTU-RGBD120上的大量实验表明,SRENet实现了最先进的性能,验证了频率建模在基于点云的动作理解中的有效性。

英文摘要

Recognizing human actions from point cloud sequences is critical for 3D perception driven applications such as autonomous driving and human-computer interaction. However, the irregular structure and temporal inconsistency of point clouds pose unique challenges for spatio-temporal representation learning, especially in capturing both global motion context and fine-grained temporal dynamics. We propose SRENet, a spectral-aware framework designed to explicitly learn both global context and fine-grained temporal dynamics of motion from a frequency perspective for action recognition. SRENet introduces a Spectral Decomposition Block (SDeBlock) that performs wavelet-based analysis along temporal and spatial axes, disentangling features into low- and high-frequency components with frequency-specific attention. To recover residual dynamics and re-align temporal frequency structures distorted during semantic fusion, a Spectral Re-entry Block (SReBlock) performs secondary temporal decomposition. Furthermore, a spectral-aware learning strategy is devised to enhance discriminability in both frequency subspaces via contrastive loss and a curriculum schedule that gradually shifts focus from low- to high-frequency spaces in line with coarse to detailed motion patterns. Extensive experiments on MSR-Action3D, NTU-RGBD and NTU-RGBD120 demonstrate that SRENet achieves state-of-the-art performance, validating the effectiveness of frequency modeling in point cloud-based action understanding.

2606.03159 2026-06-03 cs.CV cs.AI cs.RO

NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation

NVIDIA OmniDreams:用于闭环自动驾驶仿真的实时生成式世界模型

NVIDIA, :, Aarti Basant, Amlan Kar, Despoina Paschalidou, Fangyin Wei, Francesco Ferroni, Guillermo Garcia Cobo, Haithem Turki, Huan Ling, Jaewoo Seo, James Lucas, Jay Zhangjie Wu, Jialiang Wang, Jonathan Lorraine, Jun Gao, Kai He, Katarina Tothova, Kevin Xie, Michał Tyszkiewicz, Qi Wu, Riccardo de Lutio, Ruilong Li, Sanja Fidler, Seung Wook Kim, Tianchang Shen, Tianshi Cao, Tobias Pfaff, William Lew, Xindi Wu, Xuanchi Ren, Yifan Lu, Yuxuan Zhang, Zan Gojcic, Zian Wang

AI总结 提出OmniDreams,一个基于Cosmos扩散模型训练的基础生成式世界模型,通过自回归生成动作条件视频,实现闭环仿真中复杂长尾场景的实时合成,并验证其在策略模型训练中的有效性。

详情
AI中文摘要

随着自动驾驶能力的提升,在长尾场景中安全评估驾驶策略仍是一个关键瓶颈。在闭环仿真中,驾驶策略模型与环境主动交互,其动作动态更新模拟器状态并直接影响下一组生成的传感器观测。尽管近期基于重建的神经模拟器提供了逼真效果,但它们从根本上受限于初始捕获数据,难以泛化到高度动态或新颖场景。为克服这些限制,我们引入了OmniDreams,一个从Cosmos扩散模型进行中期和后训练的基础生成式世界模型,能够自回归地实时生成动作条件视频。通过利用Cosmos丰富的视觉先验以及在21k小时驾驶场景上的中期和后训练,OmniDreams合成了传统模拟器难以捕获的复杂未观测现象,例如极端天气和不可预测的动态智能体行为。关键在于,它自回归地根据过去帧、当前模拟器状态和即时驾驶动作来调节其逼真的传感器生成。在结合Alpamayo 1策略模型和AlpaSim编排器的闭环系统中部署时,OmniDreams充当一个高度响应、反应灵敏的环境,为训练和评估下一代自动驾驶策略提供了可扩展且全面的解决方案。我们还展示了初步结果,表明从OmniDreams后训练的世界-动作模型(WAM)在Physical AI自动驾驶NuRec数据集上取得了强劲性能,超越了基于VLA的Alpamayo 1.5研究策略模型,同时仅使用其1/5的总参数量。这些结果凸显了像OmniDreams这样的实时世界模型也有潜力作为策略架构的骨干网络。

英文摘要

As autonomous vehicle capabilities advance, the safe evaluation of driving policies in long-tail scenarios remains a critical bottleneck. In closed-loop simulation, the driving policy model actively interacts with the environment, where its actions dynamically update the simulator state and directly influence the next set of generated sensor observations. While recent reconstruction-based neural simulators offer photorealism, they are fundamentally constrained by their initial captured data and struggle to generalize to highly dynamic or novel scenes. To overcome these limitations, we introduce OmniDreams, a foundation generative world model mid- and post-trained from the Cosmos diffusion model to autoregressively generate action-conditioned videos in real time. By leveraging the rich visual priors of Cosmos and mid- and post-training on 21k hours of driving scenarios, OmniDreams synthesizes complex, unobserved phenomena that are hard for traditional simulators to capture, such as extreme weather and unpredictable dynamic agent behaviors. Crucially, it autoregressively conditions its photorealistic sensor generation on past frames, the current simulator state, and immediate driving actions. Deployed in a closed-loop system with the Alpamayo 1 policy model and AlpaSim orchestrator, OmniDreams acts as a highly responsive, reactive environment, providing a scalable and comprehensive solution for training and evaluating next-generation autonomous driving policies. We additionally show preliminary results indicating that a world-action model (WAM) post-trained from OmniDreams achieves strong performance on the Physical AI Autonomous Vehicles NuRec dataset, surpassing the VLA-based Alpamayo 1.5 research policy model while using only 1/5 the total parameters. These results highlight the potential for a real-time world model like OmniDreams to also serve as a backbone for policy architectures.

2606.03157 2026-06-03 cs.AI

ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

ClinicalMC:面向大语言模型的多疗程临床决策基准

Ruihui Hou, Siyi Zhu, Ziyue Huai, Guangya Yu, Yongqi Fan, Chunming Wang, Tong Ruan

AI总结 提出ClinicalMC基准,包含多阶段样本,通过多智能体评估框架在单轮静态和多轮动态设置下测试大语言模型的临床决策能力。

详情
AI中文摘要

大语言模型(LLMs)已在医疗领域广泛应用,但在复杂临床决策场景中仍面临重大挑战。现有基准主要评估LLMs在单疗程设置中的表现,缺乏对多疗程场景的系统评估——在后者中,患者的病情随时间演变。为弥补这一空白,我们提出ClinicalMC,一个面向多疗程临床决策的基准。它包含从入院到出院的四个阶段的1,275个中文样本和5,804个英文样本。这些阶段涵盖分诊、首诊检查/诊断/治疗、后续多疗程检查/评估/治疗以及最终诊断。在ClinicalMC中,英文数据集中的患者平均经历5.11个临床疗程,而中文数据集中的患者经历3.42个。为评估LLM性能,我们构建了一个多智能体评估框架,包括患者、考官和医生智能体。基于该基准和框架,我们设计了两种实验设置——单轮静态设置和多轮动态设置——并评估了三类LLM:1)闭源LLM如GPT5-mini;2)开源LLM如DeepSeek-V3.2;3)医学LLM如HuatuoGPT-o1。通过广泛评估,我们旨在更好地理解LLM在医学领域的性能,并支持其在医疗中的有效部署。

英文摘要

Large language models (LLMs) have been widely adopted in healthcare, yet they still encounter significant challenges in complex clinical decision-making scenarios. Existing benchmarks primarily assess LLM performance in single-course settings and lack systematic evaluation in multi-course scenarios, where a patient's condition evolves over time. To address this gap, we propose ClinicalMC, a benchmark for multi-course clinical decision-making. It includes 1,275 Chinese and 5,804 English samples across four stages from admission to discharge. These stages cover triage, first-course examination/diagnosis/treatment, subsequent multi-course examination/assessment/treatment, and final diagnosis. In ClinicalMC, patients in the English dataset undergo an average of 5.11 clinical courses, whereas those in the Chinese dataset undergo 3.42. To assess LLM performance, we construct a multi-agent evaluation framework that includes patient, examiner, and doctor agents. Based on the benchmark and framework, we design two experimental settings -- a single-turn static setting and a multi-turn dynamic setting -- and assess three categories of LLMs: 1) closed-source LLMs like GPT5-mini; 2) open-source LLMs like DeepSeek-V3.2; and 3) medical LLMs like HuatuoGPT-o1. Through extensive evaluation, we aim to better understand LLM performance in the medical domain and support its effective deployment in healthcare.

2606.03156 2026-06-03 cs.CL

A cross-domain tropical species dataset with Chinese vernacular names and CITES source links

一个包含中文俗名和CITES来源链接的跨领域热带物种数据集

Jeff Wang

AI总结 构建了一个覆盖410,499个活跃热带物种的跨领域数据集,整合多源分类标识符,添加跨领域本体、中文俗名层(覆盖率99.50%)和CITES来源链接层,并报告了初步内部审查结果。

详情
Comments
25 pages, 4 figures, 4 tables. Dataset descriptor for the Tropical Species Encyclopedia. Companion to the methodology paper arXiv:2606.00994. Dataset deposited at Zenodo (doi:10.5281/zenodo.20377811); canonical preprint-of-record at Zenodo (doi:10.5281/zenodo.20424981)
AI中文摘要

我们描述了一个版本化的跨领域数据集,包含410,499个活跃热带物种(工作快照2026-04-20),涵盖三个应用子领域——热带植物、热带水生和热带宠物——这些领域共享商业和监管生命周期,但分布在按界组织的生物多样性基础设施中。该资源整合了来自GBIF、世界在线植物志、iNaturalist、NCBI分类学、生命目录和生命百科全书的分类标识符,并添加了三个原始层:一个跨领域本体,根据贸易和饲养背景重新划分分类群;一个中文俗名层,在排除未经验证的机器生成建议的类型学下,提供明确的每个名称来源;以及一个CITES来源链接层,将每个分类群连接到其Species+条目。中文俗名覆盖率——即带有与科学双名法不同的中文名称的分类群比例——达到99.50%(410,499个中的408,456个;全种群计数)。覆盖率表征完整性,而非名称翻译准确性;后者由四级来源类型学界定,并且是此处报告的初步内部审查的主题,其中盲法外部审计被确定为主要的未决事项。上游内容仅通过稳定标识符引用原始贡献层,支持CC-BY 4.0重用。该数据集存放在Zenodo上(https://doi.org/10.5281/zenodo.20377811)。本预印本是数据集当前状态的规范v1.0描述;未来的数据描述符提交是可预期的,但取决于“局限性”中列出的验证和发布工程事项。

英文摘要

We describe a versioned cross-domain dataset of 410,499 active tropical species (working snapshot 2026-04-20) spanning three applied subdomains -- tropical_plants, tropical_aquatic, and tropical_pets -- that share a commercial and regulatory life cycle but are distributed across kingdom-organised biodiversity infrastructures. The resource joins taxonomic identifiers from GBIF, Plants of the World Online, iNaturalist, NCBI Taxonomy, the Catalogue of Life and the Encyclopedia of Life, and adds three original layers: a cross-domain ontology that re-segments taxa along trade and husbandry contexts; a Chinese vernacular layer with explicit per-name provenance under a typology that excludes unverified machine-generated proposals; and a CITES source-linkage layer connecting each taxon to its Species+ entry. Chinese vernacular coverage -- the proportion of taxa carrying a CJK Chinese name distinct from the scientific binomial -- reaches 99.50 percent (408,456 of 410,499; full-population count). Coverage characterises completeness, not name-translation accuracy; the latter is bounded by the four-level provenance typology and is the subject of a preliminary internal review reported here, with a blind external audit identified as the principal open item. Upstream content is referenced by stable identifier only for the original-contribution layers, supporting CC-BY 4.0 reuse. The dataset is deposited on Zenodo (10.5281/zenodo.20377811). This preprint is the canonical v1.0 description of the dataset's current state; future Data Descriptor submission is anticipated but is contingent on the validation and release-engineering items listed in the Limitations.

2606.03148 2026-06-03 cs.CV

$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones

$A^2$: 较小的自监督ViT比更大的ViT定位更优

Sreehari Rammohan, Huy Ha, Carl Vondrick

AI总结 针对视觉分类中前景定位与丰富表征的矛盾,提出$A^2$方法,通过解耦小模型定位与大模型嵌入,利用预训练特征实现无需额外训练的竞争性能。

详情
AI中文摘要

鲁棒的视觉分类通常依赖于定位图像中的主要前景对象,同时忽略上下文干扰。令人惊讶的是,我们发现较小的自监督ViT的注意力图比更大的ViT能更好地定位前景对象。然而,我们仍然需要大型ViT,因为它们从每个补丁中提取更丰富的表示。为了兼顾良好的定位和丰富的表示,我们提出了$A^2$,一种简单的方法,通过将看哪里(小注意力模型)与提取什么(大嵌入模型)解耦,利用这种逆缩放发现:我们围绕小模型的注意力峰值裁剪图像,并用大模型嵌入这些裁剪块。$A^2$完全使用预训练特征,不需要组标签,也不需要针对每个数据集进行注意力或骨干网络训练。在5个基准测试中,$A^2$与基于骨干匹配的损失级方法(如DFR)具有竞争力,并且在更强的分布偏移下优于端到端注意力训练。

英文摘要

Robust visual classification often depends on localizing the main foreground objects in an image while ignoring contextual distractors. Surprisingly, we find that the attention maps of smaller self-supervised ViTs localize foreground objects better than those of larger ViTs. However, we still need large ViTs, because they extract richer representations from each patch. To get the best of both worlds, good localization and rich representations, we propose $A^2$, a simple method that leverages this inverse scaling finding by decoupling where to look (a small attention model) from what to extract (a large embedding model): we crop around the attention peaks of a small model and embed the crops with a larger model. $A^2$ uses entirely pretrained features, requires no group labels, and does not require per-dataset attention or backbone training. Across 5 benchmarks, $A^2$ is competitive with backbone-matched loss-level methods like DFR, and outperforms end-to-end attention training under stronger distribution shifts.

2606.03144 2026-06-03 cs.AI

GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

GTBench:一个基于课程体系的基准,用于评估大语言模型作为图论数学研究助手的能力

Noujoud Nader, Ibrahem Aljabea, Patrick Diehl, Deepti Gupta

AI总结 本文提出GTBench基准,通过三个难度递增的图论问题组(本科定义、算法推理、研究生证明)评估大语言模型的数学推理能力,发现GPT-5表现最佳,其他模型随难度下降显著,并揭示了人类与自动评估者之间的系统性分歧。

详情
Comments
19 pages, 5 figures, 7 tables
AI中文摘要

大型语言模型(LLM)越来越多地被用作技术学科的自学助手,但其作为数学推理助手的可靠性仍知之甚少。我们引入了GTBench,这是一个基于课程体系的基准,用于评估LLM作为图论数学研究助手的能力,包含63个问题,分为三个难度递增的组:本科定义和基本性质(第1组)、算法跟踪和结构推理(第2组)以及研究生级别的证明构建(第3组)。问题来源于经过验证的学术材料,包括Diestel的《图论》。我们评估了五个前沿模型——GPT-5、Claude Sonnet 4.6、Gemini 2.5 Flash-Lite、Llama 3.3 70B和Mistral Large 3——在零样本和思维链提示下,对第1组和第2组使用精确匹配和LLM作为评判者的评估,对第3组使用混合人类专家和LLM作为评判者的协议。我们的结果揭示了显著的性能层次:GPT-5在第1组接近上限(零样本95.8%),并在研究生证明上保持有意义的准确性(82%),而所有其他模型随着难度增加性能大幅下降,其中Llama在第3组零样本下的人类评估中达到0%。失败模式分析表明,正确的算法但错误的执行错误在第1组和第2组中占主导地位,而第3组还出现了不完整的推理失败,并揭示了人类评估者与自动评判者之间的系统性分歧,特别是在冗长或接近完整的证明上(人类对之间的kappa = 0.48-0.83)。GTBench为LLM中的图论推理提供了第一个基于课程体系的评估框架,对数学教育和科学研究中AI工具的治理具有直接影响。

英文摘要

Large language models (LLMs) are increasingly used as self-study assistants in technical disciplines, yet their reliability as mathematical reasoning assistants remains poorly understood. We introduce GTBench, a curriculum-grounded benchmark for evaluating LLMs as mathematical research assistants in graph theory, comprising 63 problems organized into three groups of increasing difficulty: undergraduate definitions and basic properties (Group 1), algorithm tracing and structural reasoning (Group 2), and graduate-level proof construction (Group 3). Problems are sourced from verified academic materials including Diestel's Graph Theory. We evaluate five frontier models -- GPT-5, Claude Sonnet 4.6, Gemini 2.5 Flash-Lite, Llama 3.3 70B, and Mistral Large 3 -- under zero-shot and chain-of-thought prompting, using exact-match and LLM-as-judge evaluation for Groups 1 and 2, and a hybrid human expert and LLM-as-judge protocol for Group 3. Our results reveal a pronounced performance hierarchy: GPT-5 approaches ceiling on Group 1 (95.8% zero-shot) and maintains meaningful accuracy on graduate proofs (82%), while all other models degrade substantially with difficulty, with Llama achieving 0% under human evaluation on Group 3 zero-shot. Failure mode analysis shows that correct algorithm, wrong execution errors dominate Groups 1 and 2, while Group 3 additionally surfaces incomplete reasoning failures and reveals systematic disagreement between human evaluators and the automated judge, particularly on verbose or near-complete proofs (kappa = 0.48-0.83 across human pairs). GTBench provides the first curriculum-grounded evaluation framework for graph-theoretic reasoning in LLMs, with direct implications for the governance of AI tools in mathematical education and scientific research.

2606.03143 2026-06-03 cs.LG cs.CL

FederatedSkill: Federated Learning for Agentic Skill Evolution

FederatedSkill: 面向智能体技能演化的联邦学习

Jingbo Yang, Guanyu Yao, Yang Zhang, Ramana Rao Kompella, Gaowen Liu, Shiyu Chang

AI总结 提出FederatedSkill框架,通过语义技能差异作为通信单元,在保护隐私的同时实现个性化技能演化,相比自演化基线成功率提升44.4%,计算成本降低37.5%。

详情
AI中文摘要

现代LLM智能体越来越依赖技能库来处理复杂任务,使得技能演化成为自我改进的主要驱动力。然而,孤立的单用户任务流缺乏构建全面技能所需的多样性。虽然跨用户协作可以克服这一数据瓶颈,但当前的轨迹共享方法会损害用户隐私,并强加一个统一的全局库,无法适应客户端的异质性。我们引入了FederatedSkill,一个用于协作智能体演化的隐私保护框架。FederatedSkill超越了原始轨迹共享,利用语义技能差异(即对本地库的结构化补丁)作为通信的基本单位。在服务器端,一个演化智能体聚合这些补丁,动态建模客户端特定的能力边界,促进严格个性化的技能演化,而不是次优的全局平均。在20个不同的智能体任务族上评估,FederatedSkill相比自演化基线表现出显著提升,成功率最高提高44.4%,计算成本降低37.5%。

英文摘要

Modern LLM agents increasingly rely on skill libraries to handle complex tasks, making skill evolution a primary driver of self-improvement. However, isolated single-user task streams lack the diversity required to build comprehensive skills. While cross-user collaboration can overcome this data bottleneck, current trajectory-sharing approaches compromise user privacy and impose a uniform global library that fails to accommodate client heterogeneity. We introduce FederatedSkill, a privacy-preserving framework for collaborative agent evolution. Moving beyond raw trajectory sharing, FederatedSkill utilizes semantic skill diffs, structured patches over local libraries, as the fundamental unit of communication. On the server side, an evolution agent aggregates these patches to dynamically model client-specific capability boundaries, facilitating strictly personalized skill evolution rather than a suboptimal global average. Evaluated across 20 distinct agent task families, FederatedSkill demonstrates substantial gains over self-evolving baselines, achieving up to a 44.4% increase in success rate and a 37.5% reduction in computational cost.