arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.15672 2026-05-18 cs.CV cs.AI

VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following

VLMs 跟踪无需跟踪：诊断视觉路径跟随中的失败

Hyesoo Hong, Minsoo Kim, Wonje Jeung, Sangyeon Yoon, Dongjae Jeon, Albert No

AI总结研究VLMs在视觉路径跟随任务中的表现，发现其在面对局部相似干扰时易切换路径，揭示局部竞争导致的失败原因。

详情

AI中文摘要

视觉-语言模型（VLMs）在多模态基准测试中表现优异，但可能仍缺乏对基本视觉操作的鲁棒控制。我们研究了路径跟随任务，其中模型必须通过连续的局部延续跟随选定的视觉路径。为隔离这一能力，我们设计了受控的路径跟随任务，引入附近的竞争者并减少语义和拓扑模糊性，如交叉和重叠。在这些任务中，即使是最先进的VLMs也频繁失去目标路径并切换到附近的替代路径，尤其是在这些替代路径在局部上相似时。行为干预和内部分析表明，这些失败源于局部竞争：附近的相似干扰者会将模型拉离真正的延续。标准解决方案无法消除这一瓶颈：模型大小扩展只能提供有限的收益，推理部分通过成本高昂的替代策略补偿，而显式路径指示未能恢复稳定的路径跟随。最后，在复杂的电缆场景和地铁地图上测试表明，相同的路径切换失败在受控设置之外仍然存在。

英文摘要

Vision-language models (VLMs) achieve strong performance on multimodal benchmarks, but may still lack robust control over basic visual operations. We study \textit{line tracing}, where a model must follow a selected visual path through successive local continuations. To isolate this ability, we design controlled tracing tasks that introduce nearby competitors while reducing semantic and topological ambiguity such as crossings and overlaps. Across these tasks, even state-of-the-art VLMs frequently lose the target path and switch to nearby alternatives, especially when those alternatives look locally similar to the target. Behavioral interventions and internal analyses indicate that these failures arise from local competition: nearby similar distractors pull the model away from the true continuation. Standard remedies do not remove this bottleneck: model-size scaling provides only limited gains, reasoning partially compensates through costly substitute strategies, and explicit tracing instructions fail to recover stable path following. Finally, tests on tangled-cable scenes and metro maps with richer visual complexity show that the same path-switching failure persists beyond our controlled settings.

URL PDF HTML ☆

赞 0 踩 0

2605.15669 2026-05-18 cs.LG

Rule2DRC: Benchmarking LLM Agents for DRC Script Synthesis with Execution-Guided Test Generation

Rule2DRC：用于DRC脚本合成的LLM代理基准测试

Jinuk Kim, Junsoo Byun, Donghwi Hwang, Seong-Jin Park, Hyun Oh Song

AI总结 Rule2DRC是一个大规模基准，用于评估DRC脚本生成代理，包含1000个规则到脚本任务和13921个用于执行评分的评估芯片布局。它提供了一种通过DRC执行结果衡量功能正确性的评估流程，并引入SplitTester生成区分性测试用例以提升Best-of-N选择性能。

Comments ICML 2026

详情

AI中文摘要

可制造的芯片布局必须满足成千上万的基于几何的设计规则，设计规则检查（DRC）通过在布局上运行可执行的DRC脚本来强制执行这些规则。将自然语言规则转换为正确的DRC脚本是劳动密集型的，需要专门的专家知识，这促使了LLM代理用于DRC脚本合成和调试。然而，现有的基准测试集较小，且通常通过代码相似性而不是执行正确性来评估脚本，而先前基于机器学习的方法要么忽略执行反馈，要么要求标签化的测试布局作为代理的输入。为此，我们引入了Rule2DRC，一个大规模的DRC脚本编码代理基准测试，包含1000个规则到脚本任务和13921个用于执行评分的评估芯片布局。Rule2DRC提供了一个评估流程，通过DRC执行结果衡量功能正确性，而无需将评估布局作为代理的输入。我们还提出了SplitTester，一个用于程序选择的测试代理，利用执行反馈生成区分性测试用例，显著提高了该领域的Best-of-N选择性能。我们已发布代码至https://github.com/snu-mllab/Rule2DRC。

英文摘要

Manufacturable chip layouts must satisfy thousands of geometry-based design rules, and design rule checking (DRC) enforces them by running executable DRC scripts on layouts. Translating natural language rules into correct DRC scripts is labor-intensive and requires specialized expertise, motivating LLM agents for DRC script synthesis and debugging. However, existing benchmarks have small evaluation sets and often evaluate scripts by code similarity rather than execution correctness, and prior machine learning-based methods either ignore execution feedback or require labeled test layouts as agent's input. To this end, we introduce Rule2DRC, a large-scale benchmark for DRC script coding agents with 1,000 rule-to-script tasks and 13,921 evaluation chip layouts for execution-based scoring. Rule2DRC provides an evaluation pipeline that measures functional correctness via DRC execution outcomes without requiring evaluation layouts as input to the agent. We also propose SplitTester, a tester agent for program selection that uses execution feedback to generate discriminative test cases and separate previously indistinguishable candidate scripts, substantially improving Best-of-N selection performance in this domain. We release the code at https://github.com/snu-mllab/Rule2DRC.

URL PDF HTML ☆

赞 0 踩 0

2605.15666 2026-05-18 cs.CV

ChronoEarth-492K: A Large Scale and Long Horizon Spatiotemporal Hyperspectral Earth Observation Dataset and Benchmark

ChronoEarth-492K：一个大规模且长时域的时空超光谱地球观测数据集和基准

Haozhe Si, Yuxuan Wan, Yuqing Wang, Minh Do, Han Zhao

AI总结本文提出ChronoEarth-492K数据集，通过NASA EO-1 Hyperion任务的超光谱数据，提供大规模、时间校准的时空超光谱数据，支持短时和长时分析，并建立统一的评估平台，推动超光谱时空表示学习的发展。

详情

AI中文摘要

超光谱成像（HSI）为地球表面提供了密集的光谱信息，使土地覆盖和生态系统动态在材料层面得以理解。尽管近年来在超光谱自监督学习（SSL）方面取得了进展，但现有数据集仍然时间较浅，限制了长时间域时空建模的发展。为解决这一差距，我们引入ChronoEarth-492K，这是首个大规模、时间校准的超光谱SSL数据集，基于NASA的EO-1 Hyperion任务，目前是世界上持续时间最长的超光谱档案（2001-2017）。ChronoEarth-492K包含492,354个辐射校准的块，覆盖185,398个全球地点17年，其中28,786个地点包含多时间序列（≥3次观测），可支持短时间域和长时间域的分析。在此基础上，我们建立了ChronoEarth基准，一个涵盖静态、短时间域和长时间域任务的统一评估套件，由六个开源地理空间产品组成，涵盖土地覆盖、作物类型、森林动态和土壤特性。我们进一步提出了一套标准化的评估协议，并在最先进的超光谱基础模型上报告了广泛的基线结果。共同而言，ChronoEarth和基准提供了首个大规模、时间校准的平台，用于系统性的时空超光谱表示学习。

英文摘要

Hyperspectral imaging (HSI) provides dense spectral information for the Earth's surface, enabling material-level understanding of land cover and ecosystem dynamics. Despite recent progress in hyperspectral self-supervised learning (SSL), existing datasets remain temporally shallow, limiting the development of long-horizon spatiotemporal modeling. To address this gap, we introduce ChronoEarth-492K, the first large-scale, temporally calibrated hyperspectral SSL dataset built upon NASA's EO-1 Hyperion mission, the world's longest continuous hyperspectral archive up to date (2001-2017). ChronoEarth-492K comprises 492,354 radiometrically harmonized patches across 185,398 global locations over 17 years, with 28,786 sites containing multi-temporal sequences ($\geq 3$ observations) that enable both short- and long-horizon temporal analysis. Building on this foundation, we establish the ChronoEarth-Benchmark, a unified evaluation suite spanning static, short-horizon, and long-horizon temporal tasks, constructed from six open-source geospatial products covering land cover, crop type, forest dynamics, and soil properties. We further introduce a standardized evaluation protocol and report extensive baseline results across state-of-the-art hyperspectral foundation models. Together, ChronoEarth and benchmark provide the first large-scale, temporally grounded platform for systematic spatiotemporal hyperspectral representation learning.

URL PDF HTML ☆

赞 0 踩 0

2605.15665 2026-05-18 cs.AI

PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI

PRISM：通过迭代模拟和监控实现提示的可靠性用于企业对话式AI

Keshava Chaitanya, Jahnavi Gundakaram

AI总结 PRISM通过持续模拟和监控，将提示工程视为可靠性工程问题，提升企业对话式AI的可靠性，减少提示开发时间并修复生产中的回归问题。

Comments 12 pages, 1 figure, 5 tables. arXiv preprint

详情

AI中文摘要

在企业环境中部署基于大型语言模型（LLM）的对话代理需要同时正确且具有抗非确定性行为漂移能力的提示。现有提示优化框架将提示质量视为一次性的编译时问题，未能解决如何检测和修复由时间推移导致的LLM行为变化引起的提示回归问题。我们提出了PRISM（通过迭代模拟和监控实现提示的可靠性），一个闭环框架，将提示工程视为持续的可靠性工程问题而非一次性创作任务。PRISM输入自然语言代理需求、配置的工具和内存变量集以及初始草稿提示。它自动从需求生成测试用例，模拟完整的多轮对话以对抗平台忠实的LLM环境，使用LLM作为判断者评估通过/失败，并诊断失败的根本原因，然后对提示进行手术性修复——迭代直到所有测试通过。关键的是，PRISM设计为定期运行（每日），将LLM行为漂移视为首要的可靠性问题。我们评估了PRISM在Yellow.ai V3平台上的35个企业对话代理，持续三周部署。PRISM将中位提示开发时间从2天减少到30分钟以内，实现了所有评估代理99%的生产可靠性，并在24小时内成功识别和修复由LLM行为漂移引起的生产回归问题。我们的结果表明，持续的、基于模拟的提示优化在大规模可靠的企业对话式AI中是可行且必要的。

英文摘要

Deploying large language model (LLM)-driven conversational agents in enterprise settings requires prompts that are simultaneously correct at launch and resilient to the non-deterministic behavioral drift that characterizes production LLM deployments. Existing prompt optimization frameworks address prompt quality as a one-time compile-time problem, leaving open the equally critical question of how to detect and repair prompt regressions caused by silent LLM behavior changes over time. We present PRISM (Prompt Reliability via Iterative Simulation and Monitoring), a closed-loop framework that treats prompt engineering as a continuous reliability engineering problem rather than a one-time authorship task. PRISM takes as input plain-language agent requirements, a set of configured tools and memory variables, and an initial draft prompt. It automatically generates test cases from requirements, simulates full multi-turn conversations against a platform-faithful LLM environment, evaluates pass/fail using an LLM-as-judge, diagnoses root causes of failures, and surgically repairs the prompt -- iterating until all tests pass. Critically, PRISM is designed to run on a scheduled basis (daily), treating LLM behavioral drift as a first-class reliability concern. We evaluate PRISM across 35 enterprise conversational agents over a three-week deployment period on the Yellow.ai V3 platform. PRISM reduces median prompt authoring time from 2 days to under 30 minutes, achieves 99% production reliability across all evaluated agents, and successfully identifies and repairs production regressions caused by LLM behavioral drift within a 24-hour detection window. Our results suggest that continuous, simulation-driven prompt optimization is both tractable and necessary for reliable enterprise conversational AI at scale.

URL PDF HTML ☆

赞 0 踩 0

2605.15663 2026-05-18 cs.LG

On the Power of Adaptivity for $\varepsilon$-Best Arm Identification in Linear Bandits

在线性老虎机中ε-最佳臂识别的适应性功率研究

Arnab Maiti, Yunbei Xu, Kevin Jamieson

AI总结本文研究了在线性老虎机中ε-最佳臂识别的最小样本复杂度，提出非适应性固定设计方法及适应性采样策略，揭示了适应性在不同动作集中的效果差异。

Comments Accepted at COLT 2026

详情

AI中文摘要

我们研究了在线性老虎机中ε-最佳臂识别的最小样本复杂度。给定一个覆盖R^d的紧凑动作集X和未知奖励向量θ∈R^d，目标是输出一个动作x̂∈X，使得⟨x̂,θ⟩≥max_{x∈X}⟨x,θ⟩-ε，以概率至少1-δ使用尽可能少的样本。首先，我们提出一个非适应性固定设计方法，其样本复杂度为O(d log(1/δ)/ε² + w(X)²/ε²)，其中w(X)是依赖于X的高斯宽度项，并证明了所有非适应性固定设计方法的匹配下界Ω(d log(1/δ)/ε² + w(X)²/ε²)。然后，我们转向适应性采样。我们提出一个重要的结构性问题：除了标准基底外，是否存在结构化的动作集，使得适应性仅在最优非适应性速率上提供对数因子的改进？我们对几种自然的动作集，即超立方体、l2球、m集和多任务多臂老虎机，给出了肯定回答。最后，我们提供了第一个构造的动作集X，其中适应性在每种非适应性算法上提供了多项式因子的改进。这一分离的关键成分是一个l2范数估计子程序：我们设计了一个适应性算法，使用O(d log(1/δ)/ε²)个样本从R^d中的单位l2球中输出一个估计值r̂，满足| r̂ - ||θ||_2 | ≤ ε，以概率至少1-δ，其中θ是未知奖励向量。

英文摘要

We study the minimax sample complexity of $\varepsilon$-best arm identification in linear bandits. Given a compact action set $\mathcal{X}$ that spans $\mathbb{R}^d$ and an unknown reward vector $θ\in\mathbb{R}^d$, the goal is to output an arm $\widehat{x}\in\mathcal{X}$ such that $\langle \widehat{x},θ\rangle \ge \max_{x\in\mathcal{X}} \langle x,θ\rangle - \varepsilon$ with probability at least $1-δ$, using as few samples as possible. First, we present a non-adaptive fixed-design method with sample complexity $\mathcal{O}\!\left(\frac{d\log(1/δ)}{\varepsilon^2}+\frac{w(\mathcal{X})^2}{\varepsilon^2}\right)$, where $w(\mathcal{X})$ is a Gaussian width term dependent on $\mathcal{X}$, and we prove a matching lower bound $Ω\!\left(\frac{d\log(1/δ)}{\varepsilon^2}+\frac{w(\mathcal{X})^2}{\varepsilon^2}\right)$ for all non-adaptive fixed-design methods. We then turn to adaptive sampling. We raise an important structural question: beyond the canonical basis, are there structured action sets for which adaptivity yields only logarithmic-factor improvements over the optimal non-adaptive rate? We answer in the affirmative for several natural action sets, namely the hypercube, the $\ell_2$ ball, $m$-sets, and multi-task multi-armed bandits. Finally, we provide the first construction of an action set $\mathcal{X}$ for which adaptivity yields a polynomial-factor improvement over every non-adaptive algorithm. A key ingredient behind this separation is an $\ell_2$-norm estimation subroutine: we design an adaptive algorithm that uses $\mathcal{O}\!\left(\frac{d\log(1/δ)}{\varepsilon^2}\right)$ samples from the unit $\ell_2$ ball in $\mathbb{R}^d$ and outputs an estimate $\widehat r$ satisfying $|\widehat r-\|θ\|_2|\le \varepsilon$ with probability at least $1-δ$, where $θ$ is the unknown reward vector.

URL PDF HTML ☆

赞 0 踩 0

2605.15661 2026-05-18 cs.CV cs.AI

VAGS: Velocity Adaptive Guidance Scale for Image Editing and Generation

VAGS：图像编辑与生成的速率自适应引导尺度

Yan Luo, Ahmadou Aidara, Jingyi Lu, Jeremy Moebel, Kai Han, Mengyu Wang

AI总结 VAGS通过自适应引导尺度提升图像编辑和生成的结构保真度和生成质量，无需微调或额外计算。

详情

AI中文摘要

分类自由引导（CFG）是控制流式采样器中文本语义强度的主要手段，但传统方法在整个ODE轨迹中固定引导尺度。这存在根本矛盾：早期步骤以噪声为主，携带弱语义信号，而后期步骤需提交图像结构，要求更强的方向性承诺；更关键的是，任何引导强度的值取决于引导速度是否与模型当前动态一致或相反。本文提出速率自适应引导尺度（VAGS），一种无需训练的替代方案，通过结合时间信号级项和任务相关速度场的余弦相似度，将名义尺度乘以一个有界因子。对于无需反向传播的编辑，VAGS测量源和目标引导速度之间的对齐程度，使每一步的编辑强度反映局部保留与变换的兼容性。对于生成，VAGS-Gen利用无条件与条件速度之间的对齐作为类比信号。两种变体均无需微调、辅助网络或额外前向传递，固定CFG是其特殊情形。在PIE-Bench和DIV2K进行编辑，在COCO17、CUB-200和Flickr30K进行生成时，VAGS在结构保真度和生成质量上优于固定CFG和近期无训练引导变体。代码可在https://github.com/Harvard-AI-and-Robotics-Lab/Velocity_Adaptive_Guidance_Scale公开获取。

英文摘要

Classifier-free guidance (CFG) is the primary control over how strongly text semantics move a flow-based sampler, yet standard practice holds its scale fixed across the entire ODE trajectory. This is a fundamental mismatch: early steps are noise-dominated and carry weak semantic signal, while late steps commit image structure and demand stronger directional commitment; more critically, the value of any guidance strength depends on whether the guided velocity is consistent with the model's current dynamics or working against them. We propose \textit{Velocity-Adaptive Guidance Scale} (VAGS), a training-free replacement that multiplies the nominal scale by a bounded factor combining a temporal signal-level term with the cosine similarity between task-relevant velocity fields. For inversion-free editing, VAGS measures the alignment between source- and target-guided velocities, so edit strength at each step reflects local compatibility between preservation and transformation. For generation, VAGS-Gen uses the alignment between unconditional and conditional velocities as the analogous signal. Neither variant requires fine-tuning, auxiliary networks, or extra forward passes, and fixed CFG is recovered as a special case. On PIE-Bench and DIV2K for editing, and COCO17, CUB-200, and Flickr30K for generation, VAGS consistently improves structural fidelity and generation quality over fixed CFG and recent training-free guidance variants. The code is publicly available at https://github.com/Harvard-AI-and-Robotics-Lab/Velocity_Adaptive_Guidance_Scale.

URL PDF HTML ☆

赞 0 踩 0

2605.15660 2026-05-18 cs.CV

MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer

MaTe：仅需图像进行材料迁移的扩散变换器

Nisha Huang, Henglin Liu, Yizhou Lin, Kaer Huang, Chubin Chen, Jie Guo, Tong-Yee Lee, Xiu Li

AI总结 MaTe通过多模态注意力机制实现材料迁移，无需文本指导或辅助网络，提升了生成质量和效率。

2605.15654 2026-05-18 cs.RO

PCASim: Promptable Closed-loop Adversarial Simulation for Urban Traffic Environment

PCASim：可提示的闭环对抗模拟用于城市交通环境

Chuancheng Zhang, Zhenhao Wang, Kaizheng Li, Yaran Lin, Qiang Guo, Bin Jiang

AI总结本文提出PCASim框架，通过结合对抗场景生成与安全代理训练，提升城市交通环境中的安全性和鲁棒性，实验表明其在领域特定语言生成准确率、场景转换成功率和避障能力方面均有显著提升。

详情

AI中文摘要

现实中的自动驾驶，特别是在城市环境中存在大量边缘案例，需要严格测试以确保产品安全性和鲁棒性。然而，很少有研究探讨将对抗场景生成与安全代理在闭环测试中的训练相结合，以实现高效共演和相互增强。为了解决这一挑战，通过应用基于规则的过滤对开源数据集进行处理，并结合针对模拟环境定制的知识检索模块，构建了一个对抗行为知识库。大型语言模型（LLM）被用于整合知识驱动、数据驱动和对抗驱动的方法，生成定制化的安全关键交通场景。此外，在评估生成的场景时，使用强化学习模型训练不同类型的车辆行为，从而在不牺牲现实性的情况下丰富场景多样性。实验结果表明，所提出的框架将领域特定语言生成的准确性提高了12%。此外，新生成场景转换的成功率提高了8%，避障能力提高了30%。完整手稿请参考：https://zhenhaooo.github.io/PCASim.github.io/

英文摘要

Real-world autonomous driving, particularly in urban environments with numerous corner cases, requires rigorous testing to ensure product safety and robustness. However, few studies have explored integrating adversarial scenario generation with the training of safety agents in closed-loop testing, enabling efficient co-evolution and mutual enhancement of both. To address this challenge, an adversarial behavior knowledge repository is constructed by applying rule-based filtering to an open-source dataset, combined with knowledge retrieval modules tailored for simulation environments. A large language model (LLM) is employed to integrate knowledge-, data-, and adversarial-driven approaches, generating safety-critical traffic scenarios customized to user needs. Additionally, while evaluating the generated scenarios, we employ reinforcement learning models to train the behaviors of different types of vehicles, thereby enriching scenario diversity beyond existing datasets while preserving realism. Experimental results demonstrate that the proposed framework improves the accuracy of domain-specific language generation by 12\%. Moreover, the success rate of newly generated scenario transformations increases by 8\%, while obstacle-avoidance capability is enhanced by 30\%. For the complete manuscript, please refer to: https://zhenhaooo.github.io/PCASim.github.io/

URL PDF HTML ☆

赞 0 踩 0

2605.15651 2026-05-18 cs.LG cs.AI cs.GT

Sharp Spectral Thresholds for Logit Fixed Points

Logit固定点的尖锐谱阈值

Tongxi Wang

AI总结研究探讨了logit反馈系统稳定性问题，提出新的欧几里得阈值条件以扩展稳定性保证，识别相变点。

详情

AI中文摘要

Softmax反馈系统是熵正则化强化学习、logit博弈动态、群体选择和均场变分更新的数学核心。其核心稳定性问题很简单：当softmax系统产生唯一且全局可预测的结果时？经典理论给出了保守答案。通过将softmax视为单位尺度响应，它仅在强随机化 regime 中保证稳定性。我们证明经典方法忽略了整个稳定 regime 并未识别真正质变发生点。对于有限维仿射logit系统，尖锐无维欧几里得阈值为$$β\\|ΠWΠ\\|_{\mathcal T\to\mathcal T}<2$$，而非之前使用的条件，该条件仅在softmax系统保持安全过正则化时保证稳定性。我们的定理填补了之前缺失的预分支 regime，将仿射softmax反馈系统的稳定性保证扩展到奖励响应但全局可预测的系统。它扩大了这些系统的认证稳定性边界，并识别模型真正经历相变的点。

英文摘要

Softmax feedback systems are a common mathematical core of entropy-regularized reinforcement learning, logit game dynamics, population choice, and mean-field variational updates. Their central stability question is simple: when does a self-reinforcing softmax system produce a unique and globally predictable outcome? Classical theory gives a conservative answer. By treating softmax as a unit-scale response, it certifies stability only in a strongly randomized regime. We prove that the classical approach misses an entire stable regime and does not identify the point at which the qualitative change truly occurs. For finite-dimensional affine logit systems, the sharp dimension-free Euclidean threshold is $$β\|ΠWΠ\|_{\mathcal T\to\mathcal T}<2,$$ rather than the previously used condition, which certifies stability only while the softmax system remains safely over-regularized. Our theorem fills the previously missing pre-bifurcation regime, extending stability guarantees for affine softmax feedback systems to reward-responsive yet globally predictable systems. It enlarges the certified stability boundary for these systems and identifies where the model genuinely undergoes a phase transition.

URL PDF HTML ☆

赞 0 踩 0

2605.15650 2026-05-18 cs.RO

MyoChallenge 2025: A New Benchmark for Human Athletic Intelligence

MyoChallenge 2025: 人类运动智能的新基准

Cheryl Wang, Chun Kwang Tan, Balint K. Hodossy, Eric Lyu, Jun Guo, Wentao Zhao, Huaping Liu, Chengkun Li, Merkourios Simos, Bianca Ziliotto, Alexander Mathis, Siyuan Liu, Jiahao Chen, Shanlin Zhong, Bo Jiang, Ci Song, Yaoye Zhu, Chenhui Zuo, Yanan Sui, Mohamed Irfan Refai, Massimo Sartori, Guillaume Durandau, Vikash Kumar, Vittorio Caggiano

AI总结本文提出MyoChallenge 2025基准，通过高保真骨骼肌模型与机器学习算法结合，推动运动控制智能研究，包含乒乓球和足球点球两个任务，促进多学科交叉研究。

详情

AI中文摘要

运动表现代表人类运动智能的巅峰，要求快速决策、精确控制、敏捷性和协调性。当前人工智能和机器人系统难以复现这种能力。为填补这一理解空白，MyoChallenge 2025建立了运动控制智能的新基准，结合物理模拟和机器学习算法。竞赛包含上肢和下肢两个赛道，分别涉及乒乓球发球和足球点球任务。该赛事吸引了70多支队伍和560多份提交，推动了运动系统控制算法的发展，整合标准化任务和生理真实模型，为跨学科研究提供可重复的测试平台。

英文摘要

Athletic performance represents the pinnacle of human motor intelligence, demanding rapid choices, precise control, agility, and coordinated physical execution. Replicating this seamless combination of capabilities remains elusive in current artificial intelligence and robotic systems. Concurrently, understanding the biological mastery of these movements is hindered because complex muscle coordination is rarely measured in vivo due to the limitations of physical equipment. To bridge this fundamental gap in understanding, MyoChallenge at NeurIPS 2025 established a pioneering benchmark for motor control intelligence in sports, leveraging high-fidelity musculoskeletal models within physics simulation combined with machine learning-driven algorithms. The competition introduces two distinct tracks emphasizing either upper or lower limbs control: a table tennis rally task utilizing a biomechanic upper limb composed of an arm with a hand and a trunk; and a soccer penalty kick using a biomechanic model of legs and a trunk. Marking the fourth iteration of the MyoChallenge series, this event attracted almost 70 teams and over 560 submissions globally, uniting a diverse community ranging from physicians and neuroscientists to machine learning experts. The competition facilitated the development of several state-of-the-art control algorithms for a musculoskeletal system capable of sports agility, leveraging techniques such as physics-based motion planners, on-policy behaviour cloning, hierarchical planning, and muscle synergies. By integrating standardized tasks and physiologically realistic models into the open-source framework of MyoSuite, MyoChallenge'25 serves as a reproducible and reusable testbed to accelerate interdisciplinary research across machine learning, biomechanics, sports science, and neuroscience. Project page: https://www.myosuite.org//myochallenge/myochallenge-2025.

URL PDF HTML ☆

赞 0 踩 0

2605.15649 2026-05-18 cs.LG cs.NE

Towards Code-Oriented LM Embeddings for Surrogate-Assisted Neural Architecture Search

面向代理辅助神经架构搜索的代码导向语言模型嵌入

Pranav Somu, Advay Balakrishnan, Stepan Kravtsov, Aaron McDaniel, Jason Zutty

AI总结本文提出一种低成本的代码导向语言模型嵌入策略，利用语言模型的归纳偏置，无需微调即可生成高效的架构特征提取器，实验证明其在NAS-Bench-201和einspace搜索空间中优于其他编码方式。

Comments This is an extended version of work accepted to GECCO 2026. Our code is available at https://github.com/pcsom/cole/tree/v1.0

详情

DOI: 10.1145/3795101.3805435

AI中文摘要

开发有效的代理（性能预测器）通常需要昂贵的微调或复杂的表示工程。我们提出了一种低成本的嵌入策略，利用语言模型的归纳偏置来消除这些开销。通过将架构表示为PyTorch类定义文本，我们证明了现成的LM可以作为竞争性的特征提取器，无需NAS专用的微调。最终的预测器通过将提取的代码导向语言模型嵌入（COLE）传递给轻量级回归头构建。我们还研究了提高嵌入质量和利用的策略。在NAS-Bench-201和einspace搜索空间的实验中，发现原始代码输入在使用冻结LM时比其他文本编码（如ONNX-to-text编码）具有更高的预测性能。我们还观察到COLE在NAS-Bench-201中使用BANANAS算法进行代理辅助搜索时表现更优。当优化CIFAR-100性能时，用COLE代替结构路径编码可使达到搜索空间中最佳架构1%测试准确率所需的评估预算减少34%。由于任何神经架构都可以表示为代码，这些发现证明COLE是推进NAS的多功能且高效的基石。

英文摘要

Developing effective surrogates (performance predictors) for Neural Architecture Search (NAS) typically requires expensive fine-tuning or the engineering of complex representations. We propose a low-cost embedding strategy that leverages the inductive bias of Language Models (LMs) to eliminate these overheads. By representing architectures as PyTorch class definition text, we demonstrate that off-the-shelf LMs act as competitive feature extractors without NAS-specialized fine-tuning. The final predictor is constructed by passing the extracted Code-Oriented LM Embeddings (COLE) through a lightweight regression head. We also investigate strategies to improve embedding quality and utilization. Our experiments on the NAS-Bench-201 and einspace search spaces reveal that raw code inputs yield higher predictive performance than other text-based encodings (e.g., ONNX-to-text encodings) when using frozen LMs. We also observe COLE drives superior surrogate-assisted search using the BANANAS algorithm in NAS-Bench-201. When optimizing for CIFAR-100 performance, replacing structural path encodings with COLE for architecture representation allows for a 34% decrease in the evaluation budget required to reach within 1% of the fittest architecture in the search space (by test accuracy). As any neural architecture can be represented as code, these findings establish COLE as a versatile and efficient foundation for advancing NAS.

URL PDF HTML ☆

赞 0 踩 0

2605.15647 2026-05-18 cs.LG cs.NE

Perforated Neural Networks for Keyword Spotting

孔洞神经网络用于关键词检测

Vishy Gopal, Aris Ilias Goutis, Ralph Crewe, Erin Yanacek, Rorry Brenner

AI总结本文提出在Edge Impulse平台使用孔洞反向传播进行关键词检测，通过在标准卷积神经网络中添加人工树突节点，证明树突模型在参数数量和准确性方面均优于传统架构，实现了模型质量和部署效率的双重提升。

Comments 9 pages, 1 figure, 800-trial hyperparameter sweep; Best Model award, Edge Impulse 2025 Hackathon

详情

AI中文摘要

边缘机器学习面临着云规模模型部署中未遇到的独特约束：严格的内存预算、有限的计算能力和不可妥协的准确率阈值必须同时满足。现有的压缩和优化技术可以将一种资源换取另一种，但很少同时提高准确性和模型大小。本文提出了在Edge Impulse平台上的关键词检测应用，该实验在2025年12月的Edge Impulse黑客松上获得了最佳模型奖。通过在Edge Impulse关键词检测教程流水线上训练的标准卷积神经网络中添加人工树突节点，我们证明了树突模型在800次超参数试验中每个参数数量层级和每个准确性阈值测试中均优于传统架构。最佳的树突模型仅使用1,500个参数就达到了93.3%的测试准确率，而基准模型需要约4,000个参数才能达到92.1%的准确率。这些结果表明，孔洞反向传播是边缘AI工程师工具包中的强大补充，同时提升了模型质量和部署效率。

英文摘要

Edge machine learning presents a unique set of constraints not encountered in cloud-scale model deployment: strict memory budgets, limited compute, and non-negotiable accuracy thresholds must all be satisfied simultaneously. Existing compression and optimization techniques can trade one resource for another, but rarely improve both accuracy and model size at the same time. This paper presents the application of Perforated Backpropagation to keyword spotting on the Edge Impulse platform, an experiment that won the Best Model award at the Edge Impulse 2025 Hackathon in December 2025. By adding artificial Dendrite Nodes to a standard convolutional neural network trained on the Edge Impulse keyword spotting tutorial pipeline, we demonstrate that dendritic models outperform traditional architectures at every level of parameter count and at every accuracy threshold tested across 800 hyperparameter trials. The best dendritic model achieved a test accuracy of 0.933 with only 1,500 parameters, versus the baseline accuracy of 0.921 requiring approximately 4,000 parameters. These results suggest that Perforated Backpropagation is a powerful addition to the edge AI engineer's toolkit, offering simultaneous gains in both model quality and deployment efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.15640 2026-05-18 cs.CV

Learning Disentangled Representations for Generalized Multi-view Clustering

学习解耦表示以实现通用多视图聚类

Xin Zou, Ruimeng Liu, Chang Tang, Zhenglai Li, Xinwang Liu, Kunlun He, Wanqing Li

AI总结本文提出GMAE框架，通过解耦表示学习保留多视图互补性，提升聚类效果。实验表明其在完整和不完整多视图聚类任务中均优于现有方法。

Comments accepted by IEEE TPAMI 2026 (IEEE Transactions on Pattern Analysis and Machine Intelligence)

详情

DOI: 10.1109/TPAMI.2026.3687339

AI中文摘要

多视图聚类（MVC）因其能利用互补信息而受到关注。然而，现有深度MVC方法在跨视图融合时常面临视图分布纠缠问题，影响共享潜在空间质量。为此，本文提出通用多视图自编码器（GMAE），通过解耦表示学习保留跨视图互补性。具体而言，GMAE采用双路径自编码器将源特征解耦为视图特定和视图共同嵌入，促进更清晰的聚类结构发现。进一步构建跨视图对抗判别器，引导视图特定编码器捕捉更判别性特征。通过策略性调节互信息，GMAE有效对齐分布并防止表示崩溃，确保生成稳健且非平凡的嵌入。在13个基准数据集上的全面实验表明，GMAE在完整和不完整MVC任务中均优于现有方法。代码实现见：https://github.com/obananas/GMAE。

英文摘要

Multi-View Clustering (MVC) has gained significant attention for its ability to leverage complementary information across diverse views. However, existing deep MVC methods often struggle with view-distribution entanglement during cross-view fusion, which hampers the quality of the shared latent space and leads to suboptimal Figures. To address this issue, we propose the Generalized Multi-view Auto-Encoder (GMAE), a framework designed to preserve cross-view complementarity through disentangled representation learning. Specifically, GMAE employs dual-path autoencoders to decouple source features into view-specific and view-common embeddings, facilitating the discovery of clearer clustering structures. We further construct cross-view adversarial discriminators to guide view-specific encoders in capturing more discriminative features. By strategically modulating mutual information, GMAE effectively aligns distributions and prevents representation collapse, ensuring the generation of robust, non-trivial embeddings. Comprehensive experiments on 13 benchmark datasets demonstrate that GMAE consistently outperforms state-of-the-art methods in both complete and incomplete MVC tasks. Our code implementation is available at the repository: https://github.com/obananas/GMAE.

URL PDF HTML ☆

赞 0 踩 0

2605.15635 2026-05-18 cs.CL

Evaluating Chinese Ambiguity Understanding in Large Language Models

评估大型语言模型中的中文歧义理解

Junwen Mo, Yuanzhi Lu, Yifang Xue, Ke Xu, Hideki Nakayama

AI总结本文设计了首个基于潜在歧义理论的中文歧义数据集CHA-Gen，评估了LLM在歧义检测中的表现，揭示了模型在歧义识别中的常见失败模式及语义不确定性量化结果。

详情

AI中文摘要

语言歧义对大型语言模型（LLM）的鲁棒性至关重要，但现有研究多聚焦于英语，对中文关注有限。现有中文歧义数据集（如CHAmbi）存在可扩展性差的问题。基于潜在歧义（PA）理论，我们设计了一个半自动化流程构建CHA-Gen，这是首个PA理论指导的中文歧义数据集，包含18种潜在歧义结构的5,712个句子（2,414个歧义句，3,298个非歧义句）。通过直接查询和机器翻译评估LLM（如Gemma 3、Qwen 2.5/3系列），发现LLM在歧义检测上存在困难（通过CoT提示有所改善）。对Qwen3-32B的CoT推理过程分析揭示了三种常见失败模式：歧义盲区、误归因和过早解决。使用语义熵度量对不确定性进行量化，显示歧义句子具有更高的不确定性。此外，指令微调会导致过度自信，而基础模型更能捕捉语义多样性。我们进一步发现模型倾向于主导解释。本文提供了一种可扩展的中文歧义语料库方法，并为LLM的歧义处理提供了见解，为增强LLM中的中文歧义研究奠定了基础。

英文摘要

Linguistic ambiguity is critical to the robustness of Large Language Models (LLMs), yet existing research focuses mostly on English, with limited attention devoted to Chinese. Existing Chinese ambiguity datasets (e.g., CHAmbi) suffer from poor scalability. Guided by Potential Ambiguity (PA) Theory, we design a semi-automatic pipeline to construct CHA-Gen. It is the first PA Theory-grounded Chinese ambiguity dataset, which comprises 5,712 sentences (2,414 ambiguous, 3,298 unambiguous) across 18 potential ambiguous structures. Evaluating LLMs (e.g. Gemma 3, Qwen 2.5/3 series) via direct querying and machine translation, we find that LLMs struggle with ambiguity detection (improved by CoT prompting). Analysis of Qwen3-32B's CoT rationales reveals three common failure modes: ambiguity blindness, misattribution, and premature resolution. Uncertainty quantification with semantic entropy metric shows higher uncertainty for ambiguous sentences. Moreover, instruction tuning induces overconfidence, whereas Base models better capture semantic diversity. We further observe that models exhibit a bias toward dominant interpretations. Our work provides a scalable approach for Chinese ambiguity corpus and insights into LLMs' ambiguity handling, laying a foundation for enhancing Chinese ambiguity research in LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.15626 2026-05-18 cs.LG

IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression

IO-SVD：输入-输出白化SVD用于自适应秩LLM压缩

Ali Abbasi, Chayne Thrash, Haoran Qin, Hamed Pirsiavash, Soheil Kolouri

AI总结 IO-SVD通过构建KL感知的双侧白化空间，结合高效异质秩分配策略，实现LLM压缩时的性能与效率平衡，实验表明其在压缩过程中性能损失小且推理速度提升显著。

详情

AI中文摘要

大型语言模型在语言和推理任务中表现出色，但其存储和计算成本仍是资源受限和延迟敏感环境下的主要障碍。基于SVD的后训练压缩提供了一种硬件无关的方法，通过低秩分解减少模型大小并提高推理效率。然而，现有方法往往依赖于仅输入的白化空间、同质秩分配或损失无关的分配启发式方法，限制了在剧烈压缩下保持模型质量的能力。我们提出输入-输出白化SVD（IO-SVD），一种后训练压缩方法，通过构建KL感知的双侧白化空间来处理模型权重。利用KL损失在顶部K个token概率上的二次展开，IO-SVD构建了一个输出侧度量，捕捉预测敏感性，同时输入白化捕捉激活统计。我们进一步引入了高效的异质秩分配策略，通过第一阶校准损失估计评分白化奇异成分，并在全局预算下修剪最不敏感的成分。受先前工作结合SVD截断与量化的工作启发，我们通过损失感知的重映射改进了SVD-量化压缩，该方法根据量化后预计的损失变化选择低秩因子行进行8位量化。在多样化的LLM和VLM家族上的广泛实验以及推理时分析表明，IO-SVD在压缩LLM时具有最小的性能损失，同时提供实用的推理加速。代码可在https://github.com/mint-vu/IO-SVD.git获得。

英文摘要

Large language models deliver strong performance across language and reasoning tasks, but their storage and compute costs remain major barriers to deployment in resource-constrained and latency-sensitive settings. SVD-based post-training compression offers a hardware-agnostic way to reduce model size and improve inference efficiency through low-rank factorization. However, existing methods often rely on input-only whitening spaces, homogeneous rank allocation, or loss-agnostic allocation heuristics, limiting their ability to preserve model quality under aggressive compression. We propose Input-Output Whitened SVD (IO-SVD), a post-training compression method that forms a KL-aware double-sided whitening space for model weights. Using a second-order expansion of the KL loss over the top-K token probabilities, IO-SVD constructs an output-side metric that captures predictive sensitivity, while input whitening captures activation statistics. We further introduce an efficient heterogeneous rank-allocation strategy that scores whitened singular components using first-order calibration loss estimates and prunes the least sensitive components under a global budget. Inspired by prior work that combines SVD truncation with quantization, we improve hybrid SVD-quantization compression through loss-aware remapping, which selects low-rank factor rows for 8-bit quantization based on the predicted loss change incurred by quantizing them. Extensive experiments across diverse LLM and VLM families, and inference-time analysis shows that IO-SVD compresses LLMs with minimal performance degradation while delivering practical inference speedups. Code is available at https://github.com/mint-vu/IO-SVD.git

URL PDF HTML ☆

赞 0 踩 0

2605.15625 2026-05-18 cs.AI cond-mat.soft

ColPackAgent: Agent-Skill-Guided Hard-Particle Monte Carlo Workflows for Colloidal Packing

ColPackAgent：基于代理技能的硬粒子蒙特卡罗工作流程用于胶体堆积

Lijie Ding, Changwoo Do

AI总结 ColPackAgent通过MCP工具服务器和代理技能实现胶体堆积模拟的自主工作流程，展示了如何利用LLM代理执行模拟任务并评估不同模型的性能。

详情

AI中文摘要

我们介绍了ColPackAgent，一种代理框架，通过模型上下文协议（MCP）工具服务器和代理技能自主运行胶体堆积的蒙特卡罗模拟，无论是作为独立代理还是现有代理系统的一部分。通过利用MCP服务器和代理技能，ColPackAgent执行胶体堆积模拟的结构化工作流程，这些流程对于研究相变、自组装和材料设计至关重要。在没有专用模拟工具和工作流程指令的情况下，通用大型语言模型（LLM）代理倾向于描述此类工作流程而不是可靠地执行。MCP服务器暴露了一个定制构建的colpack Python包，该包封装了HOOMD-blue硬粒子蒙特卡罗。技能编码了一个四阶段的工作流程合同。ColPackAgent可以与人类反馈互动执行工作流程，从端到端提示自主执行，或作为提供的程序文件的autoresearch。我们通过不同模式展示了系统，包括立方体粒子的3D模拟、二元系统中的盘和胶囊的2D模拟，以及使用autoresearch的2D硬盘冻结转变。我们还比较了不同LLM在该工作流程上的模型性能，使用17个阶段特定的提示。此基准测试提供了对不同模型在设置、规划和分析工作流程中可靠性的阶段级检查。这些结果表明，将领域Python包与MCP工具和便携式代理技能结合，为将模拟工具包转化为代理辅助研究工作流程提供了可行的途径。

英文摘要

We introduce ColPackAgent, an agent framework that autonomously runs Monte Carlo simulations of colloidal packing through a Model Context Protocol (MCP) tool server and an agent skill, whether as a standalone agent or inside an existing agent system. By harnessing the MCP server and agent skill, ColPackAgent executes a structured workflow for colloidal packing simulations, which are central to studies of phase behavior, self-assembly, and materials design. Without dedicated simulation tools and workflow instructions, general-purpose Large Language Model (LLM) agents tend to describe such workflows rather than execute them reliably. The MCP server exposes a custom-built colpack Python package that wraps HOOMD-blue hard-particle Monte Carlo, and the skill encodes a four-stage workflow contract. ColPackAgent can carry out the workflow interactively with human feedback, autonomously from an end-to-end prompt, or as autoresearch following a provided program file. We demonstrate the system in different modes with several colloidal packing simulation examples such as cube particles in 3D, a binary system of disks and capsules in 2D, and the 2D hard-disk freezing transition using autoresearch. We also compare model performance on this workflow across a panel of LLMs with 17 stage-specific prompts. This benchmark provides a stage-level check of how reliably different models follow the setup, planning, and analysis workflow. Together, these results show that pairing a domain Python package with MCP tools and a portable agent skill provides a practical route for turning a simulation toolkit into an agent-assisted research workflow.

URL PDF HTML ☆

赞 0 踩 0

2605.15621 2026-05-18 cs.CV

LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs

LRCP: 低秩压缩性引导的视觉标记修剪用于高效的LVLMs

Hongyu Lu, Feng Zhang, Wenwei Jin, Huanling Hu, Tianjun Shi, Shikai Jiang, Yao Hu, Jiawei Li

AI总结本文提出LRCP，通过低秩压缩性引导视觉标记修剪，有效减少视觉语言模型的推理成本，实现94.7%的图像理解性能保留和88.9%的标记减少。

Comments The paper includes 11 figures, multiple tables, comprehensive experimental results on 11 image understanding benchmarks and 3 video benchmarks, with extensive ablation studies and qualitative visualizations

详情

AI中文摘要

大型视觉-语言模型（LVLMs）在多模态理解方面表现出色，但其推理成本随着视觉标记数量的增加而迅速增长，尤其在高分辨率图像和长视频中更为明显。现有基于注意力的方法通过注意力分数估计标记重要性，可能引入位置偏差；而基于表示的方法则通过特征关系或重建误差减少视觉冗余，忽略了视觉标记集的整体结构。本文从低秩压缩性的角度重新审视视觉标记压缩。在多个模型和数据集中，我们发现视觉标记表示表现出显著的低秩结构，存在一个主导子空间，即使随机移除大量标记后仍保持稳定。受此发现启发，我们提出LRCP，一种无需训练的压缩框架，首先通过PCA估计视觉标记的主导低秩子空间，然后通过投影残差对每个标记进行评分，保留那些难以由低秩背景解释的标记。大量实验表明，LRCP在保持94.7%的原始图像理解性能的同时实现88.9%的标记减少，并在保持97.8%的平均视频理解准确性的同时实现87.5%的标记减少。

英文摘要

Large vision-language models (LVLMs) achieve strong multimodal understanding, but their inference cost grows rapidly with the number of visual tokens, especially for high-resolution images and long videos. Existing attention-based methods estimate token importance from attention scores, which may introduce positional bias, while representation-based methods reduce visual redundancy based on feature relations or reconstruction errors, overlooking the global structure of the visual token set. In this paper, we revisit visual token compression from the perspective of low-rank compressibility. Across models and datasets, we observe that visual token representations exhibit a pronounced low-rank structure, with a dominant subspace that remains stable even after a large fraction of tokens is randomly removed. Motivated by this finding, we propose LRCP, a training-free compression framework that first estimates the dominant low-rank subspace of visual tokens via PCA, and then scores each token by its projection residual onto this subspace, retaining tokens that are poorly explained by the low-rank background. Extensive experiments show that LRCP achieves superior results, preserving 94.7% of the original image-understanding performance with an 88.9% token reduction and 97.8% of the average video-understanding accuracy with an 87.5% token reduction.

URL PDF HTML ☆

赞 0 踩 0

2605.15619 2026-05-18 cs.RO

Wind-Aware Optimal Trajectory Planning for Efficient Gliding of Fixed-Wing Aerial Systems

考虑风的高效滑翔轨迹规划

Luca Morando, Nishanth Bobbili, Giuseppe Loianno

AI总结本文提出非线性多目标轨迹规划器，通过伯恩斯坦多项式生成三次连续轨迹，结合风速估算优化滑翔性能，实验证明在风扰和障碍物情况下具有稳定性和可靠性。

Comments Accepted for publication at IEEE International Conference on Robotics and Automation (ICRA 2026) held in Vienna

详情

Journal ref: IEEE International Conference on Robotics and Automation (ICRA 2026) held in Vienna

AI中文摘要

滑翔为小型固定翼无人机提供了更长的续航和静音操作，但需要精确的能量管理，特别是在风扰和障碍物约束下。传统总能量控制系统通常需要精细调参和trim条件知识。本文将调控移至规划层面，提出非线性多成本轨迹规划器，基于伯恩斯坦多项式生成三次连续轨迹，通过微分平坦性映射为控制指令，并在线重新规划以匹配实验得出的下沉极曲线。集成模拟净to variometer估计空气运动，约束滑翔至能量平衡状态。通过Dubins路径基的航点初始化轨迹计算巡航段，连接连续滑翔轨迹，实现结合动力和非动力飞行的混合任务。该方法在CFD仿真和真实世界实验中验证，显示在风切变和障碍物存在下，滑翔率、空速和滑翔比的稳定性。

英文摘要

Gliding offers small fixed-wing UAVs extended endurance and silent operation but requires accurate energy management, especially under wind disturbances and obstacle constraints. Traditional Total Energy Control Systems based controllers regulate the trade between potential and kinetic energy reactively, often requiring fine-tuning and trim-conditions knowledge. In this work, we shift the regulation to the planning level and present a nonlinear, multi-cost trajectory planner for small UAV gliders. The method generates $\mathcal{C}^3$ continuous trajectories based on Bernstein polynomials, mapped into control commands through differential flatness, and re-planned online to match experimentally derived sink polar curves. A simulated netto variometer is integrated into the optimization to estimate air mass motion, constraining the glide to energy-balanced states. Consecutive gliding trajectories are linked by cruising segments computed through trajectories initialized on Dubins path-based waypoints, enabling hybrid missions that combine powered and unpowered flight. The approach is validated in CFD simulations and real-world experiments with a fixed-wing platform, showing reliable stabilization of sink rate, airspeed, and glide ratio under wind gusts and in presence of obstacles.

URL PDF HTML ☆

赞 0 踩 0

2605.15618 2026-05-18 cs.CV cs.AI

Latent Video Prediction Learns Better World Models

潜在视频预测学习更好的世界模型

Ali J Alrasheed, Aryan Yazdan Parast, Basim Azam, James Bailey, Naveed Akhtar

AI总结本文系统研究了潜在预测模型在世界模型中的鲁棒性，发现其在特征可区分性、抗污损性、细粒度辨别、遮挡鲁棒性和时间方向敏感性等方面表现优异，优于其他视频基础模型。

2605.15615 2026-05-18 cs.CV cs.LG

Neutral-Reference Prompting for Vision-Language Models

视觉-语言模型的中性参考提示

Senmao Tian, Xiang Wei, Shunli Zhang

AI总结本文提出NeRP策略，通过中性提示和参考图像提升模型对未知类别的判别能力，同时保持对已知类别的准确性。

Comments Accepted at ICML 2026

详情

AI中文摘要

视觉-语言模型（VLMs）的有效迁移学习常面临基类-新类权衡（BNT）问题：提升对未见过类别的识别性能往往会降低对已知类别的准确性。现有工作通常简单归因于过拟合已知类别。我们观察到一种有趣现象：VLMs在某些下游数据上表现出不对称混淆，即类别A的样本系统性被误判为类别B，而反向混淆（B到A）很少发生。对于已知类别，这种偏差可通过交叉熵损失调整来缓解，但对未知类别，这种预训练诱导的偏差仍存在并损害泛化能力。受此启发，我们提出NeRP，一种即插即用的提示修正策略，无需修改模型参数即可提升对未知类别的判别能力。NeRP利用中性文本提示和参考图像，测量类别层面的先验偏好，结合样本似然获得模型的代理分数。如果对于给定样本，先验强烈支持当前预测，而观察到的证据明显不足，则在容易混淆的类别对之间执行局部翻转，从而纠正先验主导的误判。在多个backbone和15个少样本及跨领域基准上的广泛实验表明，NeRP显著提高了对未知类别的准确性，同时保持已知类别的预测性能。

英文摘要

Efficient transfer learning of vision-language models (VLMs) commonly suffers from a Base-New Trade-off (BNT): improving performance on unseen (new) classes often degrades accuracy on known (base) classes. Addressing how to boost recognition of unseen classes without sacrificing known-class performance remains a central challenge. Existing work often simplistically attributes the BNT to overfitting on known classes. We observe an interesting phenomenon: VLMs frequently exhibit asymmetric confusion on certain downstream data, i.e., samples of class A are systematically mispredicted as class B, while the reverse confusion (B to A) rarely occurs. For known classes, this kind of bias can be mitigated by tuning using a cross-entropy loss, but for unseen classes, such pretraining-induced bias persists and harms generalization. Motivated by this, we propose NeRP, a plug-and-play prompting correction strategy that improves discrimination on unseen classes without modifying model parameters. NeRP leverages neutral text prompts and reference images to measure class-wise prior preferences along the pre-trained inter-class geometry, and combines them with the sample likelihood to obtain the model's surrogate score. If, for a given sample, the prior strongly favors the current prediction while the observed evidence is clearly insufficient, we perform a local flip between easily confusable class pairs, thereby correcting prior-dominated mispredictions. Extensive experiments across multiple backbones and 15 few-shot and cross-domain benchmarks show that NeRP substantially improves accuracy on unseen classes while preserving known-class prediction performance.

URL PDF HTML ☆

赞 0 踩 0

2605.15613 2026-05-18 cs.CL

Toward LLMs Beyond English-Centric Development

迈向超越英语中心化发展的语言模型

Sho Takase, Ukyo Honda

AI总结研究发现语言模型对英语存在显著偏见，持续预训练并非优于从头训练的低成本方案，未来需加强多语言投入。

2605.15611 2026-05-18 cs.AI

TopoEvo: A Topology-Aware Self-Evolving Multi-Agent Framework for Root Cause Analysis in Microservices

TopoEvo: 一种面向拓扑的自演化多智能体框架用于微服务中的根本原因分析

Junle Wang, Xingchuang Liao, Wenjun Wu

AI总结针对微服务中观测数据异质性、故障传播和拓扑漂移问题，TopoEvo通过多模态对齐、拓扑约束推理和自演化机制，提升根本原因分析的鲁棒性与准确性。

Comments 12 pages

详情

AI中文摘要

微服务中的根本原因分析（RCA）面临噪声异质多模态观测数据、级联故障传播放大下游症状以及由自动扩展和滚动更新引起的非平稳拓扑漂移等挑战。最近基于LLM的RCA智能体虽能生成工具导向的解释，但往往缺乏拓扑意识，导致症状放大偏误。本文提出TopoEvo，一种面向拓扑的自演化多智能体框架，结合图表示学习与结构化拓扑约束推理。TopoEvo首先引入度量正交多模态对齐（MOMA），将度量嵌入分解为互补子空间，并通过对比对齐日志和追踪以减少模态冗余和稀疏性，从而获得稳定的节点表示。随后应用向量量化（VQ）将拓扑增强的状态离散化为可审计的症状令牌，利用症状词典实现可靠检索和令牌级证据支撑。在这些离散拓扑提示之上，TopoEvo执行多智能体假设-证据-测试（HET）工作流，明确验证传播一致的解释并区分起因异常与放大下游症状。最后，自演化机制刷新分层事件记忆，并通过高置信度伪标签进行保守测试时适应，以维持在漂移下的鲁棒性。

英文摘要

Root cause analysis (RCA) in microservices is challenging due to (i) noisy and heterogeneous multimodal observability (metrics, logs, traces), (ii) cascading failure propagation that amplifies downstream symptoms, and (iii) non-stationary topology drift induced by autoscaling and rolling updates. Recent LLM-based RCA agents can generate tool-grounded explanations, yet they often remain topology-agnostic and suffer from \emph{symptom-amplification bias}, misattributing the root cause to salient downstream victims. We propose \textbf{TopoEvo}, a topology-aware self-evolving multi-agent framework that couples graph representation learning with structured, topology-constrained reasoning. TopoEvo first introduces \emph{Metric-orthogonal Multimodal Alignment} (MOMA), which decomposes metric embeddings into complementary subspaces and contrastively aligns logs and traces to reduce modality redundancy and sparsity, yielding stable node representations for graph encoding. It then applies \emph{Vector Quantization} (VQ) to discretize topology-enhanced states into auditable \emph{symptom tokens} with a symptom lexicon, enabling reliable retrieval and token-level evidence grounding. On top of these discrete topology cues, TopoEvo performs a multi-agent \emph{Hypothesis--Evidence--Test} (HET) workflow to explicitly verify propagation-consistent explanations and separate initiating anomalies from amplified downstream symptoms. Finally, a \emph{Self-Evolving Mechanism} refreshes hierarchical incident memory and performs conservative test-time adaptation with high-confidence pseudo-labels to maintain robustness under drift.

URL PDF HTML ☆

赞 0 踩 0

2605.15609 2026-05-18 cs.CL

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

PSD: 推动扩散大语言模型的帕累托前沿：通过并行推测解码

Shengyin Sun, Yiming Li, Renxi Liu, Xinqi Li, Hui-Ling Zhen, Weizhe Lin, Chen Chen, Xianzhi Yu, Mingxuan Yuan, Chen Ma

AI总结本文提出PSD框架，通过并行推测解码提升推理效率与生成质量，在推理效率和生成质量之间取得良好平衡，达到每前向传递5.5倍的token处理速度。

Comments 16 pages

详情

AI中文摘要

扩散大语言模型（dLLMs）通过迭代去噪掩码标记序列生成文本。尽管dLLMs可以在每个步骤内并行预测所有掩码位置，但大量的去噪迭代仍使推理成本高昂。此成本可通过每步解掩多个标记进行空间优化，或通过将多个去噪步骤合并为一次验证调用进行时间优化。我们提出并行推测解码（PSD），一种无需训练的框架，同时提升推理效率和生成质量。利用单次前向传递的置信度分数，PSD通过可配置的自适应解掩策略选择解掩位置，并构建多深度的推测草案而无需额外模型调用。最终的批量验证步骤应用分层接受机制，保留与更新预测一致的最深草案。在三个dLLMs上进行的实验表明，PSD在推理效率和生成质量之间取得了良好的权衡，达到每前向传递5.5倍的token处理速度，其准确性与贪婪解码相当。

英文摘要

Diffusion large language models (dLLMs) generate text by iteratively denoising masked token sequences. Although dLLMs can predict all masked positions in parallel within each step, the large number of denoising iterations still makes inference expensive. This cost can be reduced spatially by unmasking multiple tokens per step, or temporally by collapsing multiple denoising steps into one verification call. We propose Parallel Speculative Decoding (PSD), a training-free framework that jointly improves inference along both axes. Using the confidence scores from a single forward pass, PSD selects positions to unmask via a configurable, adaptive unmasking policy and constructs multi-depth speculative drafts without extra model calls. A final batched verification pass then applies hierarchical acceptance, keeping the deepest draft that remains consistent with the updated predictions. Experiments on three dLLMs across reasoning and code generation tasks show that PSD achieves favorable trade-offs between inference efficiency and generation quality, reaching up to $5.5\times$ tokens per forward pass with accuracy comparable to greedy decoding.

URL PDF HTML ☆

赞 0 踩 0

2605.15608 2026-05-18 cs.LG cs.SY eess.SY

Transformer-like Inference from Optimal Control

基于最优控制的变换器式推理

Aditya Kudre, Heng-Sheng Chang, Prashant G. Mehta

AI总结本文从最优控制理论出发，推导出解决预测问题的推理架构，揭示了变换器层操作的起源，并通过非线性离散过程模型和线性高斯模型进行实验验证。

Comments Preprint

2605.15607 2026-05-18 cs.CL cs.LG

Syntax Without Semantics: Teaching Large Language Models to Code in an Unseen Language

无语义的语法：教大语言模型在未见过的语言中编程

Vinayshekhar Bannihatti Kumar, Disha Makhija, Manoj Ghuhan Arivazhagan, Rashmi Gangadharaiah

AI总结研究探讨大语言模型在未见过的语言中生成代码的能力，发现微调仅能教授语法而无法转移语义能力，揭示了推理与语言实现之间的鸿沟。

详情

AI中文摘要

大型语言模型（LLMs）在代码生成基准测试中表现出高通过率，但它们能否将这种能力转移到训练时未见过的语言仍不清楚。我们介绍了PyLang，一种最小的命令式语言，未出现在所有预训练语料库中，并评估了前沿模型在352个问题上的零样本和微调Qwen3（4B、8B、32B）的表现。我们发现微调快速教授了语法，但无法转移语义能力：Python在所有配置中比PyLang高出高达19%，且没有干预（多任务学习、偏好微调、代码填充或潜在空间目标）无法缩小差距。一个LLM法官发现，前沿模型有80%的时间选择与Python相同的算法，但无法将其翻译成有效的PyLang实现。CKA分析确认，微调模型在不同语言中收敛到几乎相同的内部表示（CKA > 0.97），但在输出阶段却不同。我们称这种现象为实现忠实度鸿沟：模型具有语言无关的算法理解，但无法用不熟悉的语言表达它。我们的发现强调了需要训练方法将推理与语言特定的实现解耦。

英文摘要

Large language models (LLMs) achieve high pass rates on code generation benchmarks, yet whether they can transfer this ability to languages absent from pretraining remains poorly understood. We introduce PyLang, a minimal imperative language absent from all pretraining corpora, and evaluate frontier models zero-shot and fine-tuned Qwen3 (4B, 8B, 32B) on 352 problems. We find that fine-tuning quickly teaches syntax but fails to transfer semantic competence: Python outperforms PyLang by up to 19% across all configurations, and no intervention (multi-task learning, preference tuning, code infilling, or latent-space objectives) closes the gap. An LLM judge reveals that frontier models select an identical algorithm to Python 80% of the time, yet cannot translate it into a working PyLang implementation., and CKA analysis confirms that fine-tuned models converge to nearly identical internal representations across languages (CKA > 0.97) while diverging at the output stage. We term this the implementation fidelity gap: models possess language-agnostic algorithmic understanding but cannot express it in an unfamiliar language. Our findings highlight the need for training methods that decouple reasoning from language-specific realization.

URL PDF HTML ☆

赞 0 踩 0

2605.15604 2026-05-18 cs.LG cs.CL

VSPO: Vector-Steered Policy Optimization for Behavioral Control

VSPO：用于行为控制的向量引导策略优化

Xuechen Zhang, Zijian Huang, Kai Yang, Weijia Zhang, Jiasi Chen, Samet Oymak

AI总结 VSPO通过引入与目标行为关联的引导向量，控制生成轨迹的行为强度，解决多目标优化中的稀疏奖励问题，提升策略优化效率。

详情

AI中文摘要

现代语言模型往往需要在优化主要准确性目标的同时，兼顾次要行为偏好，如 verbosity、agreeableness 或响应中技术专家水平。在实践中，基础模型可能很少或完全不表现出期望的行为。因此，赋予模型目标行为会形成稀疏行为奖励瓶颈。为解决此类多目标问题，我们引入了向量引导策略优化（VSPO），它利用与目标行为相关的引导向量来控制生成轨迹的行为强度。VSPO是通过修改GRPO以采样具有不同引导强度的轨迹获得的。此过程可以解释为一种在线策略潜在自我蒸馏过程，其中模型内部化其引导向量。通过调整引导强度，VSPO上采样稀有行为并丰富轨迹多样性，缓解稀疏奖励问题并可证明加速策略优化。通过全面的理论和实验，我们证明了VSPO相较于 vanilla reward shaping 和其他替代方法具有更优的性质。具体而言，在bandit抽象下，当引导引起的分布足够与目标行为对齐时，VSPO可证明在迭代复杂度上优于reward-shaped GRPO。我们评估了VSPO在多个推理基准上，包括MATH和MMLU-Pro，针对四个目标行为：解释能力、自信表达、对误导上下文的鲁棒性以及响应 verbosity。我们的结果表明，VSPO在保持或提高任务准确性的同时，一致提升了对目标行为的控制。

英文摘要

Modern language models often need to optimize a primary accuracy objective while also accommodating secondary behavioral preferences, such as verbosity, agreeableness, or the level of technical expertise in its response. In practice, a base model may exhibit a desired behavior very rarely or not at all. Thus, endowing the model with a target behavior creates a sparse behavioral reward bottleneck. To address such multi-objective problems, we introduce Vector-Steered Policy Optimization (VSPO) which employs a steering vector associated with the target behavior to control the behavior intensity of the generated rollouts. VSPO is obtained by modifying GRPO to sample rollouts with varying steering intensities. This process can be interpreted as an on-policy latent self-distillation procedure where the model internalizes its steering vector. By varying steering intensities, VSPO upsamples rare behaviors and enriches rollout diversity, which alleviates the sparse reward issue and provably accelerates the policy optimization. Through comprehensive theory and experiments, we establish that VSPO has favorable properties compared to vanilla reward shaping and other alternative approaches. Specifically, under a bandit abstraction, VSPO provably achieves better iteration complexity than reward-shaped GRPO when the steering-induced distributions are sufficiently aligned with the target behavior. We evaluate VSPO across multiple reasoning benchmarks, including MATH and MMLU-Pro, for four target behaviors: explanation expertise, confidence expression, robustness to misleading context, and response verbosity. Our results show that VSPO consistently improves the control along target behavior while maintaining or improving task accuracy compared with reward shaping, teacher-trace distillation, and guidance-based baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.15603 2026-05-18 cs.LG cs.AI

Offline Reinforcement Learning with Universal Horizon Models

离线强化学习中的通用时间 horizon 模型

Hojun Chung, Junseo Lee, Songhwai Oh

AI总结本文提出通用时间 horizon 模型，通过灵活预测任意时间 horizon 的未来状态，改进了传统几何时间 horizon 模型在远期状态建模上的不足，并在100个OGBench任务中验证了其有效性。

Comments ICML 2026

详情

AI中文摘要

基于模型的强化学习（RL）通过在想象的 on-policy 轨迹上进行价值学习，为离线 RL 提供了有吸引力的方法。然而，由于重复的模型推断导致自我生成状态中的累积误差，这一方法常常面临挑战。尽管几何时间 horizon 模型（GHM）通过直接预测折扣无限时间 horizon 的未来来缓解这一问题，但在准确建模远期状态方面仍存在挑战。为此，我们引入了通用时间 horizon 模型（UHM），这是 GHM 的推广，能够直接在任意时间 horizon 下预测未来状态。利用这种灵活性，我们提出了一种可扩展的价值学习方法，该方法采用winsorized 时间 horizon 分布来稳定训练，通过限制过大的时间 horizon 来实现。在100个具有挑战性的OGBench任务上的实验结果表明，所提出的方法在高度次优数据集和需要长时间 horizon 推理的任务上优于竞争性基线。项目页面：https://rllab-snu.github.io/projects/UHM/

英文摘要

Model-based reinforcement learning (RL) offers a compelling approach to offline RL by enabling value learning on imagined on-policy trajectories. However, it often suffers from compounding errors due to repeated model inference on self-generated states. While geometric horizon models (GHM) alleviate this issue through direct prediction over a discounted infinite-horizon future, they remain challenged in accurately modeling distant future states. To this end, we introduce universal horizon models (UHM), a generalization of GHM that directly predicts future states under arbitrary horizons. Leveraging this flexibility, we propose a scalable value learning method that employs a winsorized horizon distribution to stabilize training by capping excessively large horizons. Experimental results on 100 challenging OGBench tasks demonstrate that the proposed method outperforms competitive baselines, particularly on tasks with highly suboptimal datasets and those requiring long-horizon reasoning. Project page: https://rllab-snu.github.io/projects/UHM/

URL PDF HTML ☆

赞 0 踩 0

2605.15597 2026-05-18 cs.CV cs.GR cs.LG cs.RO

CM-EVS: Sparse Panoramic RGB-D-Pose Data for Complete Scene Coverage

CM-EVS：稀疏全景RGB-D-姿态数据用于完整场景覆盖

Jiale Liu, Jungang Li, Jieming Yu, Xinglin Yu, Zihao Dongfang, Zongjian Ding, Kaifeng Ding, Yi Yang, Lidong Chen, Yang Zou, Shunwen Bai, Jiahuan Zhang, Haoran Huang, Shan Huang, Yudong Gao, Mingjun Cheng

AI总结本文提出CM-EVS，通过COVER算法生成稀疏全景RGB-D-姿态数据，实现低冗余且可追溯的完整场景覆盖，提升3D学习的几何一致性。

Comments 35 pages including appendix. Code and dataset: https://github.com/Strange-animalss/CM-EVS

详情

AI中文摘要

现代3D视觉学习依赖于从度量3D资产中采样的观测，但现有扫描、网格、点云、模拟和重建并未直接提供稀疏、可比且几何一致的全景训练接口。密集轨迹会重复附近视角，源特定渲染策略导致注释异质性，稀疏启发式可能遗漏重要区域或引入深度不一致观测。本文研究如何将3D资产转换为稀疏全景RGB-D-姿态数据，以保持完整的场景覆盖，同时具有低冗余和可追溯的来源。我们提出COVER（以覆盖为导向的视角筛选与ERP范围-深度变形），一种无需训练的ERP视角筛选器，将选定视角观测的几何投影到候选ERP探针，评分增量覆盖，并惩罚深度冲突。在有限的代理误差下，其贪心覆盖代理保持标准覆盖式近似行为，误差项内。使用COVER，我们构建了CM-EVS（覆盖-curated度量ERP视角集），一个包含36,373个curated ERP帧的全景RGB-D-姿态数据集，来自1,275个室内场景，涵盖Blender室内、HM3D和ScanNet++，并补充了从TartanGround和OB3D重新编码的户外全景。每个帧提供完整的球形RGB、度量范围深度、校准姿态；COVER生成的室内帧包括每一步的来源日志。每个室内场景平均仅25帧，覆盖所有13种统一房间类型，同时保持紧凑的场景级覆盖。实验表明，COVER改进了覆盖冲突的权衡，使CM-EVS成为稀疏、紧凑且可追溯的RGB-D-姿态资源，用于几何一致的全景3D学习。

英文摘要

Modern 3D visual learning relies on observations sampled from metric 3D assets, yet existing scans, meshes, point clouds, simulations, and reconstructions do not directly provide a sparse, comparable, and geometry-consistent panoramic training interface. Dense trajectories duplicate nearby views, source-specific rendering policies yield heterogeneous annotations, and sparse heuristics may miss important regions or introduce depth-inconsistent observations. We study how to convert 3D assets into sparse panoramic RGB-D-pose data that preserves complete scene coverage with low redundancy and auditable provenance. We propose COVER (Coverage-Oriented Viewpoint curation with ERP Range-depth warping), a training-free ERP viewpoint curator that projects geometry observed from selected views into candidate ERP probes, scores incremental coverage, and penalizes depth conflicts. Under bounded proxy error, its greedy coverage proxy preserves the standard coverage-style approximation behavior up to an additive error term. Using COVER, we build CM-EVS (Coverage-curated Metric ERP View Set), a panoramic RGB-D-pose dataset with 36,373 curated ERP frames from 1,275 indoor scenes across Blender indoor, HM3D, and ScanNet++, complemented by outdoor panoramas from TartanGround and OB3D re-encoded into the same schema. Each frame provides full-sphere RGB, metric range depth, calibrated pose; COVER-produced indoor frames include per-step provenance logs. With a median of only 25 frames per indoor scene, CM-EVS covers all 13 unified room types while maintaining compact scene-level coverage. Experiments show that COVER improves the coverage-conflict trade-off, making CM-EVS a sparse, compact, and auditable RGB-D-pose resource for geometry-consistent panoramic 3D learning.

URL PDF HTML ☆

赞 0 踩 0

2605.15592 2026-05-18 cs.CV

Efficient Image Synthesis with Sphere Latent Encoder

高效图像合成与球形潜在编码器

Tung Do, Thuan Hoang Nguyen, Hao Li

AI总结本文提出分离的固定预训练图像编码器和球形潜在去噪模型，提高效率并独立优化重建与生成。在多个数据集上，方法在生成质量和推理速度上优于Sphere Encoder。

Comments Technical report

详情

AI中文摘要

少数步骤图像生成已取得快速进展，一致性及meanflow-based方法显著减少了采样步骤的数量。尽管其推理成本低，但这些方法常面临训练不稳定和可扩展性有限的问题。Sphere Encoder是一种近期的替代方案，仅需少数步骤即可生成高质量图像；然而，其在推理过程中需要在像素空间和潜在空间之间反复转换，同时在单一架构内联合优化重建与生成。这种设计导致计算效率低下，并在重建与生成之间产生目标冲突。为解决这些限制，我们将框架分离为一个固定的预训练图像编码器和一个单独的潜在去噪模型，后者完全在球形潜在空间中训练。我们的方法在训练和推理过程中消除了反复的像素空间操作，提高了效率，并允许重建与生成各自专业化。在Animal-Faces、Oxford-Flowers和ImageNet-1K数据集上，我们的方法在生成质量和推理速度上显著优于Sphere Encoder，同时在强少数步骤和多步骤基线中也取得了具有竞争力的结果。

英文摘要

Few-step image generation has seen rapid progress, with consistency and meanflow-based methods significantly reducing the number of sampling steps. Despite their low inference cost, these approaches often suffer from training instability and limited scalability. Sphere Encoder is a recent alternative that produces high-quality images in only a few steps; however, it requires repeated transitions between the pixel space and latent space during inference while jointly optimizing reconstruction and generation within a single architecture. This design leads to computational inefficiency and objective conflict between reconstruction and generation. To address these limitations, we decouple the framework into a fixed pretrained image encoder and a separate latent denoising model trained entirely in a spherical latent space. Our approach eliminates repeated pixel-space operations during training and inference, improving efficiency and allowing reconstruction and generation to specialize independently. On Animal-Faces, Oxford-Flowers and ImageNet-1K datasets, our method significantly outperforms Sphere Encoder in both generation quality and inference speed, while achieving competitive results against strong few-step and multi-step baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.15589 2026-05-18 cs.CL

MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models

MHGraphBench: 用于评估大语言模型中心理健康知识的图知识基准

Weixin Liu, Congning Ni, Shelagh A. Mulvaney, Susannah L. Rose, Murat Kantarcioglu, Bradley A. Malin, Zhijun Yin

AI总结本文提出MHGraphBench基准，评估大语言模型在心理健康实体识别、关系判断及双跳推理能力，发现模型在实体类型识别和小关系类型判断上表现优异，但在关系预测和双跳推理上仍有不足，且输出格式可靠性对性能有显著影响。

Comments Accepted to GEM 2026, ACL 2026 Workshop; 9 pages main text plus references and appendices

详情

AI中文摘要

大型语言模型（LLMs）在心理健康领域应用日益广泛，但其对相关生物医学知识的捕捉能力和临床相关结构判断的可靠性仍不明确。本文提出一个基于知识图谱（KG）的基准，用于评估LLMs在心理健康实体识别、关系判断和双跳推理能力。该基准源自PrimeKG，包含九个任务家族，具有KG支持答案和受控负样本选项。在15个封闭源和开源LLM上的实验揭示了持续的识别-判断差距：领先模型在实体类型识别和小关系类型子集上接近天花板表现，但仍在关系预测和双跳推理上挣扎。此外，短KG衍生片段对某些模型有益，但对其他模型则会降低性能。此外，输出格式可靠性在受限的多选设置下对测量性能有显著影响，突显了响应有效性在基准评估中的关键作用。MHGraphBench因此应被解释为在受控多选界面下评估与PrimeKG精心编纂的心理健康切片的一致性，而不是直接评估现实世界临床安全性的评估。

英文摘要

Large language models (LLMs) are increasingly used in the mental health domain, yet it remains unclear how well they capture related biomedical knowledge and how reliably they apply it to clinically salient structured judgments. Here, we present a knowledge-graph (KG)-grounded benchmark for assessing LLMs on mental-health entity recognition, relation judgment, and two-hop reasoning. The benchmark is derived from PrimeKG and comprises nine task families with KG-supported answers and controlled negative options. Experiments across 15 closed- and open-source LLMs reveal a persistent recognition-to-judgment gap: leading models achieve near-ceiling performance on entity typing and on the small relation-typing subset, yet they still struggle with relation prediction and two-hop reasoning. Additionally, short KG-derived snippets benefit some models but degrade performance for others. Moreover, output-format reliability can substantially influence measured performance under constrained multiple-choice settings, highlighting the critical role of response validity in benchmark-based evaluation. MHGraphBench should therefore be interpreted as evaluating agreement with a curated mental-health slice of PrimeKG under a constrained multiple-choice interface, rather than as a direct assessment of real-world clinical safety.

URL PDF HTML ☆

赞 0 踩 0