arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.08976 2026-06-09 cs.AI 新提交

RTL-BenchLS: A Large-Scale Benchmark for RTL Reasoning and Generation with Large Language Models

RTL-BenchLS：面向大语言模型的RTL推理与生成的大规模基准

Jing Wang, Shang Liu, Wenji Fang, Yuchao Wu, Yugao Zhu, Zhiyao Xie

发表机构 * Hong Kong University of Science and Technology（香港科技大学）

AI总结提出大规模基准RTL-BenchLS，包含超1万个形式验证的Verilog设计，并引入三项自监督推理任务，解决现有基准规模小、任务单一的问题，评估显示当前最佳模型性能较低。

详情

AI中文摘要

基于LLM的RTL生成与推理是硬件设计自动化的一个有前景的方向。高质量的基准是跟踪这一进展的关键基础设施。然而，现有的RTL基准在规模和任务范围上存在固有局限性。它们涵盖的设计通常较小且简单，任务几乎完全集中在规格到RTL的生成上。前沿模型在现有基准上的性能已经饱和。扩大这些基准的规模从根本上很困难，因为基准测试需要对齐的标签，例如规格和测试平台。对于实际设计，这种对齐的高质量数据很少可用。我们引入了RTL-BenchLS，这是一个大规模基准，解决了上述两个局限性。它包含超过10,000个经过形式验证的Verilog设计，涵盖比现有基准更大且更复杂的设计。除了规格到RTL的生成，我们提出了三项联合评估推理与生成的新任务：往返推理、掩码内容推理和仓库问题推理。前两项是自监督的，直接解决了扩展瓶颈。所有任务都通过形式等价性检查进行验证，无需任何手动测试平台。我们在RTL-BenchLS上评估了八个LLM。即使是最好的模型，在自然语言往返推理上仅达到23%，在掩码内容推理上达到28%，在仓库问题修复上达到12%。RTL-BenchLS比现有基准更具挑战性。它为未来的改进留下了充足的空间，并为开发基于LLM的硬件设计方法提供了指导。

英文摘要

LLM-based RTL generation and reasoning is a promising direction for hardware design automation. High-quality benchmarks are critical infrastructure for tracking progress in this direction. However, existing RTL benchmarks face inherent limitations in both scale and task scope. The designs they cover are typically small and simple, and the tasks focus almost entirely on specification-to-RTL generation. Frontier models' performance already saturates on the existing benchmarks. Scaling these benchmarks up is fundamentally difficult because aligned labels are required for benchmarking, such as specifications and testbenches. Such aligned high-quality data are rarely available for real-world designs. We introduce RTL-BenchLS, a large-scale benchmark addressing both limitations above. It contains over 10,000 formally verified Verilog designs, covering substantially larger and more complex designs than existing benchmarks. Beyond specification-to-RTL generation, we propose three novel tasks that jointly evaluate reasoning and generation: round-trip reasoning, masked-content reasoning, and repository-issue reasoning. The first two are self-supervised, which directly resolves the scaling bottleneck. All tasks are verified through formal equivalence checking without any manual testbenches. We evaluate eight LLMs on RTL-BenchLS. Even the best model reaches only 23% on natural-language round-trip reasoning, 28% on masked-content reasoning, and 12% on repository-issue fixing. RTL-BenchLS is substantially more challenging than existing benchmarks. It leaves ample room for future improvement and offers guidance for developing LLM-based methods for hardware design.

URL PDF HTML ☆

赞 0 踩 0

2606.08974 2026-06-09 cs.AI 新提交

Diverse Thinking Schemata Elicit Better Reasoning in Large Language Models

多样思维图式激发大型语言模型更优推理

Xinyue Liang, Yizhe Yang, Yu Bai, Bin Xu, Jiawei Li, Yang Gao

发表机构 * School of Computer Science and Technology, Beijing Institute of Technology（北京理工大学计算机科学与技术学院）

AI总结提出多样图式策略优化（DiScO），通过增强推理步骤转换和答案候选的多样性，提升大型语言模型在数学推理任务中的表现和错误恢复能力。

详情

AI中文摘要

大型推理模型（LRMs）因其通过生成扩展推理链解决复杂数学问题的能力而受到越来越多的关注。在这项工作中，我们聚焦于推理过程中两个关键但尚未充分探索的方面：推理转换（捕捉推理步骤之间的不同转换）和答案候选（反映模型产生的解路径的多样性）。我们将这两个方面统称为思维图式。我们观察到思维图式的多样性与模型性能之间存在相关性，这激励我们通过增强多样性来进一步提升推理潜力。为此，我们提出了多样图式策略优化（DiScO），该框架首先赋予模型图式感知能力，然后通过强化学习鼓励多样性，并在推理时进一步促进多样化推理。在多个数学推理基准上的实验表明，DiScO始终优于标准的群体相对策略优化。除了准确性之外，人工标注分析显示，DiScO显著提高了模型从错误初始尝试中恢复的能力。总体而言，我们的工作表明思维图式多样性发挥的重要作用，并指出沿着多样性维度进行扩展是一个有前景的研究方向。

英文摘要

Large reasoning models (LRMs) have attracted increasing attention for their ability to solve complex mathematical problems by generating extended reasoning chains. In this work, we focus on two critical yet underexplored aspects of the reasoning process: reasoning transitions capturing the distinct transitions between reasoning steps and answer candidates reflecting the variety of solution paths produced by the model. We collectively define these two aspects as thinking schemata. We observe a correlation between the diversity of thinking schemata and model performance, which motivates us to enhance diversity as a means to further improve reasoning potential. To this end, we propose Diverse Schemata Policy Optimization (DiScO), a framework that first endows the model with schemata awareness, then encourages diversity through reinforcement learning, and further promotes diverse reasoning at inference time. Experiments on multiple mathematical reasoning benchmarks demonstrate that DiScO consistently outperforms standard group relative policy optimization. Beyond accuracy, human-annotated analyses show that DiScO substantially improves the model's ability to recover from erroneous initial attempts. Overall, our work suggests the important role that diversity of the thinking schemata plays and points to scaling along the diversity dimension as a promising research direction.

URL PDF HTML ☆

赞 0 踩 0

2606.08970 2026-06-09 cs.AI 新提交

An Effective Router for Vision-Language Model Selection

一种有效的视觉-语言模型选择路由器

Can Wang, Shengwei Wang, Bolin Zhang, Zhiying Tu, Dianhui Chu

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Shandong Key Laboratory of Digital Service Computing Technology and Systems（山东省数字服务计算技术与系统重点实验室）

AI总结针对视觉-语言模型（VLM）选择中数据缺乏、特征表示无效和模型空间僵化的问题，提出ARMS路由器，通过增强输入信号和扩展训练策略，在分布内和分布外测试集上表现优异，仅800M参数即可超越GPT-4o。

详情

AI中文摘要

具有不同性能和资源需求的视觉-语言模型（VLM）被广泛部署，使得用户难以从众多VLM候选中选择最合适的。现有工作揭示了语言模型中的性能悖论现象，并专注于路由方法来解决它。然而，开发用于VLM选择的路由器仍然是一个关键且具有挑战性的问题，主要面临：1）缺乏专门数据，2）特征表示无效，以及3）模型空间僵化和适应成本高。在本文中，我们构建了一个用于VLM选择的多模态数据集，包含七个主流VLM在32,626个独特图像-文本查询上的输出。然后，我们提出了ARMS，一个用于VLM选择的路由器。ARMS通过VLM配置文件增强输入信号，采用简单但有效的架构来改进查询和VLM能力的表示。为了提高ARMS对新VLM的适应性，我们提出了两种扩展训练策略：增量训练和独立训练。在分布内和分布外测试集上的实验结果表明了ARMS的有效性。特别是，使用我们的训练策略，ARMS（仅800M参数）可以适应更广泛的VLM空间，并击败规模大数百倍的商业模型如GPT-4o。我们的代码、模型和数据集可在匿名仓库中获取。

英文摘要

Vision-language models (VLMs) with varying performance and resource requirements are widely deployed, making it difficult for users to select the most appropriate one among numerous VLM candidates. Existing work reveals the performance paradox phenomenon in language models and focuses on routing methods to solve it. However, developing a router for VLM selection is still a critical yet challenging problem, which primarily faces: 1) lack of specialized data, 2) ineffective feature representation, and 3) rigid model space and costly adaptation. In this paper, we construct a multimodal dataset for VLM selection, containing the outputs of seven mainstream VLMs on 32,626 unique image-text queries. We then propose ARMS, a router for VLM selection. ARMS enhances input signals with VLM profiles, employs a simple but effective architecture to improve representations of queries and VLM capabilities. To improve ARMS' adaptation to new VLMs, we propose two extension training strategies: incremental training and independent training. Experimental results on both in-distribution and out-of-distribution test sets demonstrate the effectiveness of ARMS. In particular, using our training strategy, ARMs (only 800M in size) can adapt to a broader VLM space and defeat commercial models like GPT-4o that are hundreds of times larger in scale. Our code, models, and datasets are available in the anonymous repository.

URL PDF HTML ☆

赞 0 踩 0

2606.08969 2026-06-09 cs.CL cs.AI 新提交

CARE: A Conformal Safety Layer for Medical Summarization

CARE：面向医学摘要的保形安全层

Suhana Bedi, Bridget Lin, Anson Y. Zhou, Chloe O. Stanwyck, Jenelle A. Jindal, Sanmi Koyejo, David Stutz, Nigam H. Shah

发表机构 * Stanford University（斯坦福大学）； Google DeepMind（谷歌深度思维）

AI总结提出CARE方法，通过保形风险控制为LLM医学摘要提供校准的遗漏和幻觉标记，在保证安全性的同时减少审查负担。

详情

Comments: 29 pages, 5 figures

AI中文摘要

大型语言模型（LLM）越来越多地用于医学摘要，但其输出可能遗漏重要的医学信息并引入无根据的陈述。现有的错误检测方法产生启发式或未校准的分数，无法对遗漏错误进行正式控制，也无法以原则性的方式在安全性与临床医生审查负担之间进行权衡。我们引入了风险评估的保形评估（CARE），这是一种事后、模型无关的安全层，使用保形风险控制为任何LLM生成的摘要叠加校准的遗漏和幻觉标记，无需重新训练。CARE通过两个控制器提供有限样本、分布无关的保证：一个幻觉控制器，限制包含任何未标记幻觉句子的文档的概率；一个遗漏控制器，限制未提交审查的重要遗漏的期望比例。与幻觉检测不同，遗漏同时取决于源句子是否重要以及摘要是否覆盖该句子。我们表明，仅校准一个维度可能违反目标风险界限，而边际分解虽然有效但过于保守。通过在整个$(τ,γ)$阈值空间上进行联合校准，CARE在保持正式保证的同时，比替代的校准基线最多减少5倍的标记句子。在五个医学摘要任务中，CARE在100次校准/测试重划分中，以95%的置信度满足$α=0.15$的目标风险界限，每个领域仅使用约100个标记文档。在一项初步的临床医生研究（75份文档审查）中，校准标记平均将遗漏检测提高了28.6个百分点。这些结果表明，句子级别的安全保证对于LLM辅助的医学摘要是可行的，并为平衡残余风险和审查工作量提供了一种可调节的机制。

英文摘要

Large language models (LLMs) are increasingly used for medical summarization, but their outputs can omit medically important information and introduce unsupported claims. Existing error-detection methods produce heuristic or uncalibrated scores, providing no formal control over missed errors and no principled way to trade off safety against clinician review burden. We introduce Conformal Assessment for Risk Evaluation (CARE), a post-hoc, model-agnostic safety layer that uses conformal risk control to overlay calibrated omission and hallucination flags onto summaries from any LLM without retraining. CARE provides finite-sample, distribution-free guarantees through two controllers: a hallucination controller that bounds the probability of a document containing any unflagged hallucinated sentence, and an omission controller that bounds the expected fraction of important omissions not surfaced for review. Unlike hallucination detection, omissions depend jointly on whether a source sentence is important and whether it is covered by the summary. We show that calibrating only one dimension can violate the target risk bound, while marginal decompositions remain valid but overly conservative. By jointly calibrating over the full $(τ,γ)$ threshold space, CARE preserves formal guarantees while surfacing up to 5$\times$ fewer sentences than alternative calibrated baselines. Across five medical summarization tasks, CARE satisfies the target risk bound at $α= 0.15$ with 95% confidence across 100 calibration/test resplits, using only ~100 labeled documents per domain. In a preliminary clinician study (75 document reviews), calibrated flags improved omission detection by 28.6 percentage points on average. These results show that sentence-level safety guarantees are feasible for LLM-assisted medical summarization and offer a tunable mechanism for balancing residual risk and review effort.

URL PDF HTML ☆

赞 0 踩 0

2606.08962 2026-06-09 cs.LG cs.CV cs.RO 新提交

C$^3$ache: Accelerating World Action Models with Cross Inference Chunk Cache

C$^3$ache: 利用跨推理块缓存加速世界动作模型

Weisen Zhao, Lam Nguyen, Zhicong Lu, Yuzhang Shang

发表机构 * George Mason University（乔治梅森大学）； University of Central Florida（中佛罗里达大学）

AI总结提出C$^3$ache方法，通过跨推理块缓存和重用去噪残差，加速世界动作模型推理，实现高达2.5倍加速且任务成功率几乎无损。

详情

AI中文摘要

世界动作模型（WAM）比标准的视觉-语言-动作（VLA）策略在新型运动和环境中具有更好的泛化能力，因为视频建模目标使其能够从大量未标记视频中学习，而不是依赖稀缺的标记机器人演示。这种泛化能力计算成本高昂。为了完成一个任务，WAM需要运行多个推理块，每个块都需要一个昂贵的去噪过程。现有的加速方法通过在一个块的去噪轨迹内缓存和重用计算来降低这一成本。我们的实证分析揭示了它们忽略的一个重要的冗余来源：块间的冗余。当机器人执行平滑行为时，在给定去噪步骤计算的残差从一个块到下一个块高度相关。我们引入了C$^3$ache，一种无需训练的方法，它在相同去噪步骤的推理块之间缓存和重用这些残差。在基于Fast-WAM骨干的基准测试上的实验表明，C$^3$ache在总墙钟推理时间上实现了高达2.5倍的加速，而任务成功率几乎没有下降。

英文摘要

World Action Models (WAMs) generalize better than standard Vision-Language-Action (VLA) policies to novel motions and environments, because a video-modeling objective lets them learn from abundant unlabeled video rather than scarce labeled robot demonstrations. This generalization is computationally expensive. To complete a task, a WAM runs over multiple inference chunks, and each chunk requires a costly denoising process. Existing acceleration methods reduce this cost by caching and reusing computation within a single chunk's denoising trajectory. Our empirical analysis reveals a substantial source of redundancy they overlook: redundancy across chunks. When a robot executes a smooth behavior, the residuals computed at a given denoising step are strongly correlated from one chunk to the next. We introduce C$^3$ache, a training-free method that caches and reuses these residuals across inference chunks at the same denoising step. Experiments on benchmarks with a Fast-WAM backbone show that C$^3$ache achieves up to a $2.5\times$ speedup in total wall-clock inference time, with negligible degradation in task success rate.

URL PDF HTML ☆

赞 0 踩 0

2606.08959 2026-06-09 cs.CV cs.CL 新提交

ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China

ChinaHeritaQA：面向中国世界遗产地的文化基础视觉问答数据集

Yi Zhang, Bolei Ma, Yong Cao, Chengyan Wu, Daniel Hershcovich, Anna-Carolina Haensch

发表机构 * LMU Munich（慕尼黑大学）； FAU Erlangen-Nuremberg（埃尔朗根-纽伦堡大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； University of Tübingen & Tübingen AI Center（图宾根大学与图宾根人工智能中心）； Sun Yat-sen University（中山大学）； University of Copenhagen（哥本哈根大学）； University of Maryland, College Park（马里兰大学帕克分校）

AI总结提出ChinaHeritaQA多模态基准数据集，包含2279张图像和14133个双语多项选择题，覆盖七个认知维度，评估视觉语言模型在中国世界遗产上的文化推理能力。

详情

AI中文摘要

我们介绍了ChinaHeritaQA，这是一个多模态基准数据集，用于评估视觉语言模型（VLM）在中国联合国教科文组织世界遗产地上的文化推理能力。该数据集包含2279张野外图像，配以14133个双语（中文/英文）多项选择题对，涵盖七个认知维度，从基本身份识别到历史分期和建筑分析。在联合国教科文组织对齐的本体论指导下，并通过严格的人工注释验证，该数据集确保了语言质量和事实一致性。对最先进VLM的评估显示，虽然顶级模型在平均表现上超过人类，但出现了显著的任务级差异：模型在视觉识别方面表现出色，但在文化基础推理上存在困难。性能也因朝代和地区而异。ChinaHeritaQA揭示了强大的视觉检索能力并不能延伸到文化和历史理解。我们发布该数据集以支持未来关于文化感知多模态学习的研究。

英文摘要

We introduce ChinaHeritaQA, a multimodal benchmark dataset for evaluating the cultural reasoning abilities of vision-language models (VLMs) on UNESCO World Heritage sites in China. The dataset comprises 2,279 in-the-wild images paired with 14,133 bilingual (Chinese/English) multiple-choice QA pairs spanning seven cognitive dimensions, from basic identity recognition to historical periodization and architectural analysis. Guided by a UNESCO-aligned heritage ontology and verified through rigorous human annotation, the dataset ensures linguistic quality and factual consistency. Evaluations of state-of-the-art VLMs reveal that while top models exceed human performance on average, substantial task-level variation emerges: models excel at visual recognition but struggle with culturally grounded reasoning. Performance also varies by dynasty and region. ChinaHeritaQA reveals that strong visual retrieval does not extend to cultural and historical understanding. We release the dataset to support future research on culturally aware multimodal learning.

URL PDF HTML ☆

赞 0 踩 0

2606.08953 2026-06-09 cs.LG math.FA 新提交

Self-Consistent Generative Paths via Admissible Random Variational Transport

通过可容许随机变分输运的自洽生成路径

Lei Luo, Yingzhen Zhang, Jian Yang

发表机构 * PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology（南京理工大学计算机科学与工程学院高维信息智能感知与系统教育部重点实验室PCA实验室）

AI总结提出自洽生成路径作为可容许局部变分输运校正的随机不动点，并引入随机不动点路径残差（R-FPR）来度量生成路径与校正之间的差距，为扩散、流、一步生成、VAE、GAN等模型提供残差控制原理。

详情

Comments: 17 pages, 4 figures, including Appendix

AI中文摘要

现代生成模型通常定义从简单先验到数据分布的完整概率路径，而不仅仅是端点映射。扩散模型遵循随机去噪路径，流匹配学习输运场，一致性和蒸馏方法将路径压缩为一步或几步，对抗模型匹配终端分布，VAE通过潜在核生成。现有的统一观点主要描述这些路径是如何构建的。我们研究一个互补的问题：生成的概率路径何时是自洽的？我们将自洽生成路径定义为可容许局部变分输运校正的随机不动点。在该框架中，局部校正由结合散度或几何项、能量项和结构约束的随机变分输运算子指定。该框架包含随机正则化最优输运近端步骤作为结构化实例，同时允许非OT散度、潜在核、对抗约束、因果离散核和终端一步映射。该理论产生随机不动点路径残差（R-FPR），它衡量实际生成路径与可容许局部校正之间的差距。我们证明了适定性、随机不动点的存在性和吸引性、非收缩存在性、残差到生成误差界、经验残差集中性、代理扰动界、连续时间极限以及算子级泛化与模型特定推论。由此产生的理论将端点匹配转化为路径自洽性测试，并为诊断失败、正则化训练和指导跨扩散、流、一步、VAE、GAN/WGAN和自回归生成器的自适应采样提供了残差控制原理。

英文摘要

Modern generative models often define an entire probability path from a simple prior to the data law, rather than only an endpoint map. Diffusion models follow stochastic denoising paths, flow matching learns transport fields, consistency and distillation methods compress paths into one or a few steps, adversarial models match terminal distributions, and VAEs generate through latent kernels. Existing unifying views mainly describe how such paths are constructed. We study a complementary question: when is a generated probability path self-consistent? We define a self-consistent generative path as a random fixed point of admissible local variational transport corrections. In this framework, a local correction is specified by a random variational transport operator combining a divergence or geometry term, an energy term, and a structural constraint. The framework contains random regularized optimal-transport proximal steps as a structured instance, while also allowing non-OT divergences, latent kernels, adversarial constraints, causal discrete kernels, and terminal one-step maps. The theory yields a random fixed-point path residual (R-FPR), which measures the gap between the actual generated path and an admissible local correction. We prove well-posedness, random fixed-point existence and attraction, non-contractive existence, residual-to-generation error bounds, empirical residual concentration, proxy perturbation bounds, continuous-time limits, and operator-level generalization with model-specific corollaries. The resulting theory turns endpoint matching into path self-consistency testing and provides a residual-control principle for diagnosing failures, regularizing training, and guiding adaptive sampling across diffusion, flow, one-step, VAE, GAN/WGAN, and autoregressive generators.

URL PDF HTML ☆

赞 0 踩 0

2606.08952 2026-06-09 cs.AI 新提交

AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models

AlloSpatial：基础模型中空间推理的智能体框架

Shouwei Ruan, Bin Wang, Zhenyu Wu, Qihui Zhu, Yuxiang Zhang, Jingzhi Li, Yubin Wang, Xingxing Wei

发表机构 * Institute of Artificial Intelligence, Beihang University（北京航空航天大学人工智能研究院）； Huawei Noah’s Ark Lab（华为诺亚方舟实验室）； University of Science and Technology Beijing（北京科技大学）

AI总结提出AlloSpatial框架，通过World2Mind认知映射沙箱将自我中心观察转化为异中心空间先验，并利用空间推理工具实现几何语义仲裁，在VSI-Bench和MindCube上提升模型5%-18%的空间推理性能。

详情

AI中文摘要

多模态基础模型（MFMs）取得了显著进展，但在物理世界的空间推理中仍然脆弱。一个关键瓶颈在于它们无法将局部的自我中心观察转化为全局的异中心空间表示。为了解决这个问题，我们提出了AlloSpatial，一个用于基础模型中异中心空间认知的智能体框架。AlloSpatial引入了World2Mind，一个即插即用的认知映射沙箱，将自我中心观察转化为结构化的异中心先验，包括异中心空间树和路线图，支持查询对象拓扑、几何关系、可通过性和轨迹。为了在噪声重建和模糊视觉证据下可靠地利用这些先验，AlloSpatial引入了空间推理工具，用于工具使用判断、模态解耦线索收集和几何语义仲裁。我们进一步通过冷启动强化学习，使用工具门控轨迹级奖励，在Qwen3-VL中内化这一过程。在VSI-Bench和MindCube上的实验表明，AlloSpatial在无训练设置下将专有模型提升了5%-18%，而仅ASTs就在移除视觉输入时支持强大的空间推理。训练后的AlloSpatial智能体进一步超越了更大的通用模型和竞争性的空间基线，表明结构化的异中心表示、主动工具使用和可验证推理为具有空间能力的基础模型提供了一条有前景的路径。

英文摘要

Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key bottleneck lies in their inability to transform local egocentric observations into a global allocentric spatial representation. To address this, we propose AlloSpatial, an agentic framework for allocentric spatial cognition in foundation models. AlloSpatial introduces World2Mind, a plug-and-play cognitive mapping sandbox that converts egocentric observations into structured allocentric priors, including Allocentric-Spatial Trees and route maps that support querying object topology, geometric relations, passability, and trajectories. To utilize these priors reliably under noisy reconstruction and ambiguous visual evidence, AlloSpatial introduces a Spatial Reasoning Harness for tool-use judgment, modality-decoupled cue collection, and geometry-semantic arbitration. We further internalize this process in Qwen3-VL through cold-start reinforcement learning with a harness-gated trajectory-level reward. Experiments on VSI-Bench and MindCube show that AlloSpatial improves proprietary models by 5%-18% in a training-free setting, while ASTs alone support strong spatial reasoning even when visual inputs are removed. The trained AlloSpatial agents further outperform larger general-purpose models and competitive spatial baselines, suggesting that structured allocentric representations, active tool use, and verifiable reasoning offer a promising route toward spatially capable foundation models.

URL PDF HTML ☆

赞 0 踩 0

2606.08948 2026-06-09 cs.CV cs.AI 新提交

NutriMLLM: Multimodal Large Language Models for Dietary Micronutrient Analysis

NutriMLLM：用于膳食微量营养素分析的多模态大语言模型

Runze Yan, Minxiao Wang, Jiaying Lu, Darren Liu, Xiao Hu, Hanqi Luo

发表机构 * Emory University（埃默里大学）

AI总结针对现有MLLM在膳食微量营养素估计中不可靠的问题，利用十年人口规模膳食回顾生成约110万图像-营养素三元组，微调Qwen3-VL和GLM-4.6V-Flash得到NutriMLLM，在真实图像上实现65种营养素全覆盖，准确率匹配或超越专有模型。

详情

Comments: 35 pages, 10 figures, 1 table

AI中文摘要

从食物图像中全面估计膳食微量营养素可以改善临床营养护理，但训练此类模型需要将多样化食物与完整营养素谱相关联的大规模多模态数据集。我们首先证明，现有的多模态大语言模型（MLLMs），包括领先的专有模型，在此任务上不可靠。在五个模型家族和四个独立评估基准（ASA24、SNAPMe、FNDDS和NutriBench）上，模型经常弃权或返回统计上不合理的值。为了在没有昂贵专家标注的情况下解决这一差距，我们将十年人口规模的24小时膳食回顾重新用作文本到图像生成的结构化提示。该流程生成了约110万图像-描述-营养素三元组的合成语料库，每个三元组将生成的食品图像与完整的65种营养素标签配对。据我们所知，这是计划在发表后公开发布的最大合成食品图像语料库，具有全面的微量营养素标注。在此语料库上微调Qwen3-VL（2B/4B/8B/30B）和GLM-4.6V-Flash，得到了NutriMLLM，这是第一个专门用于全面膳食微量营养素估计的视觉语言模型家族。我们使用一个四组件框架评估这些模型，该框架分别测量弃权、幻觉、整体可用性和每种营养素的数值准确性。在真实食品图像上，每个NutriMLLM变体在所有65种营养素上实现了近乎完全的覆盖，并且最大的变体在大多数营养素上的准确率匹配或超过了专有基线（GPT-5、Gemini 3和Claude Sonnet 4.5）。这些结果表明，回忆驱动的合成监督可以使基于图像的全面微量营养素估计成为一个可处理的工程问题，并支持膳食评估、个性化营养指导和人口规模的微量营养素监测。

英文摘要

Comprehensive estimation of dietary micronutrients from food images could improve clinical nutrition care, but training such models requires large multimodal datasets linking diverse foods to complete nutrient profiles. We first show that existing multimodal large language models (MLLMs), including leading proprietary models, are unreliable for this task. Across five model families and four independent evaluation benchmarks (ASA24, SNAPMe, FNDDS, and NutriBench), models frequently abstained or returned statistically implausible values. To address this gap without costly expert annotation, we repurposed a decade of population-scale 24-hour dietary recalls as structured prompts for text-to-image generation. This pipeline produced a synthetic corpus of about 1.1 million image-description-nutrient triplets, each pairing a generated food image with a complete 65-nutrient label. To our knowledge, this is the largest synthetic food-image corpus with comprehensive micronutrient annotation planned for public release upon publication. Fine-tuning Qwen3-VL (2B/4B/8B/30B) and GLM-4.6V-Flash on this corpus yielded NutriMLLM, the first family of vision-language models specialized for comprehensive dietary micronutrient estimation. We evaluate these models with a four-component framework that separately measures abstention, hallucination, overall usability, and per-nutrient numerical accuracy. On real food images, every NutriMLLM variant achieved near-complete coverage across all 65 nutrients, and the largest variant matched or exceeded proprietary baselines (GPT-5, Gemini 3, and Claude Sonnet 4.5) in accuracy on most nutrients. These results show that recall-driven synthetic supervision can make image-based comprehensive micronutrient estimation a tractable engineering problem and support dietary assessment, personalized nutrition guidance, and population-scale micronutrient surveillance.

URL PDF HTML ☆

赞 0 踩 0

2606.08945 2026-06-09 cs.LG 新提交

From Hazard Functions to Language Space: Cox-Supervised Distillation of Survival Risk into a Large Language Model

从风险函数到语言空间：Cox监督的生存风险蒸馏到大语言模型

Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm

发表机构 * Centre for Big Data Research in Health, the University of New South Wales（新南威尔士大学健康大数据研究中心）

AI总结提出将Cox比例风险模型的时间事件风险信息迁移到大语言模型中的方法，通过文本提示微调Qwen模型，在三个数据集上取得有竞争力的区分度和校准性，并发现隐藏状态呈现连续风险梯度。

详情

AI中文摘要

我们研究了Cox比例风险模型估计的时间事件风险信息是否可以迁移到生成式大语言模型中。我们提出了一种基于文本的生存建模流程，其中结构化的临床协变量被转换为文本提示，并微调基于Qwen的大语言模型，以使用Cox模型预测作为训练目标生成患者特定的生存风险。在GBSG2、ACTG320和WHAS500数据集上，尽管该模型是作为文本生成任务而非使用传统的生存分析损失进行训练，但它取得了有竞争力的留出区分度和校准性。我们进一步分析了模型隐藏状态的几何结构，其中t-SNE可视化揭示了潜在空间中的平滑风险梯度，表明模型将生存风险表示为连续结构而非孤立的风险类别。这些发现共同表明，大语言模型可以内化生存风险结构，同时支持校准预测，为语言模型中的时间事件推理提供了一条途径。

英文摘要

We investigate whether information about time-to-event risk estimated by a Cox proportional hazards model can be transferred into a generative large language model. We propose a text-based survival modelling pipeline in which structured clinical covariates are converted into text prompts and a Qwen-based large language model is fine-tuned to generate patient-specific survival risk using Cox model predictions as a training target. Across GBSG2, ACTG320, and WHAS500, the model achieves competitive held-out discrimination and calibration despite being trained as a text-generation task rather than with a conventional survival-analysis loss. We further analyse the geometry of the model's hidden states, where t-SNE visualisations reveal smooth risk gradients in latent space, suggesting that the model represents survival risk as a continuous structure rather than isolated risk categories. Together, these findings suggest that large language models can internalise survival-risk structure while supporting calibrated prediction, providing a route towards time-to-event reasoning in language models.

URL PDF HTML ☆

赞 0 踩 0

2606.08940 2026-06-09 cs.CL 新提交

Multilingual Sentiment Aware Text Summarization A Reinforcement Learning Approach for Consistency Maintenance

多语言情感感知文本摘要：一种用于一致性维护的强化学习方法

Mikhail Krasitskii, Alexander Gelbukh, Olga Kolesnikova, Grigori Sidorov

发表机构 * Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC)（国立理工学院（IPN），计算研究中心（CIC））

AI总结研究RLHF摘要中的情感漂移现象，提出基于策略归因框架的情感感知KL正则化方法，在保持摘要质量的同时缓解情感中性化。

详情

AI中文摘要

来自人类反馈的强化学习（RLHF）显著提高了大语言模型在文本摘要中的质量和流畅性。然而，其对情感属性的影响仍未被充分理解。在这项工作中，我们研究了情感漂移，即基于RLHF的摘要输出相对于源文本向中性情感的系统性偏移。我们在多个数据集、模型架构和八种语言上进行了广泛实验，以分析对齐目标如何影响情感保留。我们的结果表明，情感漂移是一种一致现象，随着KL正则化强度的增加而增强，表明对齐稳定性与情感保真度之间存在权衡。为了解释这种行为，我们引入了一个策略归因框架，该框架分解了RLHF目标并量化了其组成部分的贡献。我们的分析表明，KL正则化是所有设置中情感抑制的主要驱动因素。基于这些发现，我们提出了对KL正则化项的情感感知修改，该修改选择性地减少对情感承载标记的约束。实证结果表明，这种方法在保持摘要质量的同时缓解了情感漂移。总体而言，我们的发现揭示了当前对齐方法的一个基本局限性：虽然它们提高了事实一致性和安全性，但可能无意中抑制了情感表达。这促使我们开发明确考虑情感保留的对齐策略。

英文摘要

Reinforcement Learning from Human Feedback (RLHF) has significantly improved the quality and fluency of large language models in text summarization. However, its impact on affective properties remains insufficiently understood. In this work, we study sentiment drift, a systematic shift toward neutral sentiment in RLHF-based summarization outputs compared to source texts. We conduct extensive experiments across multiple datasets, model architectures, and eight languages to analyze how alignment objectives influence sentiment preservation. Our results show that sentiment drift is a consistent phenomenon that becomes stronger with increased KL regularization strength, indicating a trade-off between alignment stability and affective fidelity. To explain this behavior, we introduce a Policy Attribution framework that decomposes the RLHF objective and quantifies the contribution of its components. Our analysis reveals that KL regularization is the primary driver of sentiment suppression across all settings. Based on these findings, we propose a sentiment-aware modification of the KL regularization term, which selectively reduces constraints on sentiment-bearing tokens. Empirical results demonstrate that this approach mitigates sentiment drift while maintaining summarization quality. Overall, our findings highlight a fundamental limitation of current alignment methods: while they improve factual consistency and safety, they may unintentionally suppress emotional expressiveness. This motivates the development of alignment strategies that explicitly account for affective preservation.

URL PDF HTML ☆

赞 0 踩 0

2606.08938 2026-06-09 cs.CL cs.AI 新提交

PACT: Learning Diverse Diagnostic Strategies via Privileged Synthesis and Branch Consensus

PACT: 通过特权合成与分支共识学习多样化诊断策略

Gen Li, Yuanze Hu, Zhichao Yang, Qingchen Yu, Jianwei Lv, Yue Guo, Yujing Liu, Faguo Wu, Hongwei Zheng, Xiandong Li, Bo Yuan, Yifan Sun, Zhaoxin Fan

发表机构 * Beihang University（北京航空航天大学）； Baidu（百度）； ByteDance（字节跳动）； Beijing Academy of Blockchain and Edge Computing（北京区块链与边缘计算研究院）； Renmin University of China（中国人民大学）

AI总结提出PACT框架，通过特权合成对话数据和多分支共识训练，使LLM同时学习多种诊断推理范式，在中文医疗诊断基准上取得最优性能。

详情

Comments: 16 pages, 5 figures, 5 tables

AI中文摘要

临床诊断需要在信息不完整的情况下灵活运用多种推理范式。现有的基于LLM的医疗智能体表现出强大的医学推理能力，但单一范式或简单混合的对话监督使得这些范式难以无干扰地学习。我们提出\textbf{PACT}（周期性锚点共识训练），一个将监督的多范式对话合成与基于共识的分支训练相结合的框架。在数据层面，\textbf{DPS}（医生-患者-监督者）利用完整的电子病历（EMR）进行质量控制，同时保持医生代理仅能访问患者可见信息。这产生了四种诊断推理范式下的经过验证的对话，而不会泄露隐藏的临床答案。在训练层面，PACT为每个范式训练一个范式特定的LoRA分支，并通过符号共识定期将分支聚合到共享锚点中。我们进一步构建了一个动态的多轮中文医疗诊断基准用于交互式会诊。实验表明，PACT在诊断结果和会诊过程指标上，与专有、医学专用和任务适应的基线相比，达到了最先进的性能。

英文摘要

Clinical diagnosis requires flexible use of multiple reasoning paradigms under incomplete patient information. Existing LLM-based medical agents show strong medical reasoning ability, but single-paradigm or naively mixed dialogue supervision makes these paradigms difficult to learn without interference. We propose \textbf{PACT} (Periodic Anchor Consensus Training), a framework that couples supervised multi-paradigm dialogue synthesis with consensus-based Branch training. At the data level, \textbf{DPS} (Doctor-Patient-Supervisor) uses complete electronic medical records (EMRs) for quality control while keeping the doctor agent restricted to patient-visible information. This produces validated dialogues under four diagnostic reasoning paradigms without leaking hidden clinical answers. At the training level, PACT trains one paradigm-specific LoRA Branch per paradigm and periodically aggregates Branches into a shared Anchor through sign consensus. We further construct a dynamic multi-turn Chinese medical diagnosis benchmark for interactive consultation. Experiments show that PACT achieves state-of-the-art performance among compared proprietary, medical-specialized, and task-adapted baselines on diagnostic outcome and consultation-process metrics.

URL PDF HTML ☆

赞 0 踩 0

2606.08935 2026-06-09 cs.LG cs.AI 新提交

PAI: Preserving Amplitude Information in Representation-Based Time-Series Anomaly Detection

PAI：在基于表示的时间序列异常检测中保留振幅信息

Kang Zhang, Wei Jian Lau, Shoushou Ren, Dong Lin, Joon Son Chung, Chuanhao Sun

发表机构 * HUAWEI（华为）； KAIST（韩国科学技术院）

AI总结针对现有基于表示的时间序列异常检测方法忽略振幅信息导致性能下降的问题，提出PAI方案，通过诊断模块和分数增强函数融合振幅相关分数，在TSB-AD-U-Eva和TAB UV数据集上平均VUS-PR提升98.4%和36.8%。

详情

Comments: 15 pages

AI中文摘要

基于表示的时间序列异常检测算法在多种异常检测任务上显著优于其他方法。然而，我们在评估中发现它们存在一个主要限制——学习到的嵌入通常是振幅无关的。丢失振幅信息会降低与振幅相关异常的性能，并且这种失败普遍存在于所有现有的基于表示的方法中。为了解决上述问题，我们提出了一种新的异常评分方案PAI。PAI由两个互补模块组成：诊断模块和最终分数增强函数。诊断模块比较同一表示库上的余弦评分和欧几里得评分，以测试振幅信息是否已被捕获到学习到的表示中。然后在最终分数增强函数中，PAI计算逐点中位数和MAD偏差分数以及局部均值偏移分数——这些分数与表示分数融合以产生最终异常分数。在TSB-AD-U-Eva和TAB UV数据集上，PAI在所有报告的指标上改进了所有四种评估的基于表示的方法，平均VUS-PR增益分别为98.4%和36.8%。在所有评估的组合中，PaAno + PAI实现了最佳性能，比最先进的方法高出15%。对bootstrap置信区间、异常类型细分以及TS2Vec输入归一化消融的进一步评估进一步支持了所提出的方案。这些结果表明，显式保留振幅信息对于基于表示的时间序列异常检测非常重要，而这一点在现有的评分方案中未得到充分重视。代码可在https://github.com/pantheon5100/PAI获取。

英文摘要

Representation-based time-series anomaly detection algorithms significantly outperform other methods on diverse anomaly detection tasks. However, we notice that they suffer from a major limitation in our evaluation - their learned embeddings are often amplitude-agnostic. Losing amplitude information can degrade performance on amplitude related anomalies, and this failure is prevalent across all existing representation-based methods. To address aforementioned issues, we propose a new anomaly scoring scheme named PAI. PAI consists of two complementary modules, a diagnostic module and a final score augmentation function. The diagnostic module compares cosine and Euclidean scoring on the same representation bank to test whether amplitude information is already captured in the learned representation. Then in final score augmentation function, PAI computes a point-wise median and MAD deviation score and a local mean-shift score-which are fused with the representation score to produce the final anomaly score. On the TSB-AD-U-Eva and TAB UV datasets, PAI improves all four evaluated representation-based methods across every reported metric, achieving average VUS-PR gains of 98.4% and 36.8%, respectively. Among all evaluated combinations, PaAno + PAI achieves the best performance, outperforming the state-of-the-art method by 15%. Further evaluation on bootstrap confidence intervals, anomaly-type breakdowns, and a TS2Vec input-normalization ablation further support the proposed scheme. These results suggest that explicitly retaining amplitude information is important for representation-based time-series anomaly detection, which has been underemphasized in existing scoring schemes. Code is available at: https://github.com/pantheon5100/PAI

URL PDF HTML ☆

赞 0 踩 0

2606.08934 2026-06-09 cs.LG stat.AP stat.CO stat.ME stat.ML 新提交

Backward Coherence and Hidden-State Stability in Recurrent Neural Networks: A Quasi-Reverse-Martingale Theory

递归神经网络中的反向相干性与隐藏状态稳定性：拟逆鞅理论

Yuan-chin Ivan Chang

发表机构 * Institute of Statistical Science, Academia Sinica（中央研究院统计科学研究所）

AI总结提出反向相干性概念，通过拟逆鞅理论证明隐藏状态序列几乎必然收敛，并设计正则化方法，在多个任务中实现更早稳定和更低误差。

详情

AI中文摘要

递归神经网络维护一个隐藏状态 $h_t$，但其概率意义通常不明确。我们通过\emph{反向相干性}研究隐藏状态稳定性：即通过学习的反向投影器 $g_ϕ$ 从 $h_{t+1}$ 重构 $h_t$ 的程度。在收缩性和可和反向漂移条件下，隐藏状态序列构成拟逆鞅。这导致几乎必然收敛、混合下的速率、可解释的极限表示、有限路径停止时间以及时间一致置信序列的理论框架。模拟支持该理论。反向相干性正则化将经验拟鞅总和 $\hat Q$ 降低 $43$--$58\%$，比未正则化的 RNN 早 $28$--$44\%$ 达到稳定，并提供与几何界一致的跟踪误差恢复。额外测试证实回波状态遗忘率受 $ρ$ 限制，并验证增量总和管 $R_t$ 具有 $100\%$ 同时覆盖率，尽管 $R_t$ 是保守的；实践中，缺陷尾代理 $\hat Q_t$ 是更有用的监控指标。反向相干性损失也等价于在高斯反向模型中最小化 Kullback--Leibler 散度，将该方法与变分推断联系起来。扩展涵盖 $ϕ$-混合输入、变点检测和有限样本集中度。三项真实数据研究进一步验证了该方法。在 PhysioNet 2012 ICU 数据上，逆鞅 RNN (RMRNN) 与 RNN 的死亡率预测 AUC 相当，同时提前 13 小时达到稳定表示。在 FRED-MD 上，它在概念漂移下将一个月前预测误差降低约四倍。在 UCI 人类活动识别上，它保持较低的后转换跟踪误差并具有几何衰减。这些保证在所述假设下成立；不声称普适性。

英文摘要

Recurrent neural networks maintain a hidden state $h_t$, but its probabilistic meaning is often unclear. We study hidden-state stability through \emph{backward coherence}: the extent to which $h_t$ can be reconstructed from $h_{t+1}$ by a learned backward projector $g_ϕ$. Under contraction and summable backward drift, the hidden-state sequence forms a quasi-reverse-martingale. This yields almost-sure convergence, rates under mixing, an interpretable limiting representation, finite pathwise stopping times, and a theoretical framework for time-uniform confidence sequences. Simulations support the theory. Backward-coherence regularisation reduces the empirical quasi-martingale total $\hat Q$ by $43$--$58%$, reaches stability $28$--$44%$ earlier than an unregularised RNN, and gives tracking-error recovery consistent with geometric bounds. Additional tests confirm echo-state forgetting rates bounded by $ρ$ and verify the increment-sum tube $R_t$ with $100%$ simultaneous coverage, although $R_t$ is conservative; in practice, the defect-tail proxy $\hat Q_t$ is the more useful monitor. The backward-coherence loss is also equivalent to minimising a Kullback--Leibler divergence in a Gaussian backward model, linking the method to variational inference. Extensions cover $ϕ$-mixing inputs, change-point tracking, and finite-sample concentration. Three real-data studies further validate the approach. On PhysioNet 2012 ICU data, the Reverse Martingale RNN (RMRNN) matches RNN mortality-prediction AUC while reaching stable representations 13 hours earlier. On FRED-MD, it reduces one-month-ahead forecast error by about fourfold under concept drift. On UCI Human Activity Recognition, it maintains lower post-transition tracking error with geometric decay. The guarantees apply under the stated assumptions; universality is not claimed.

URL PDF HTML ☆

赞 0 踩 0

2606.08932 2026-06-09 cs.CL cs.AI cs.CE 新提交

From Statute to Control Flow: Span-Grounded Deontic Trees for Defeasible Scope Parsing

从法规到控制流：基于跨度义务树的可废止范围解析

Jian Chen, Siyuan Li, Chucheng Wan, Zixuan Yuan

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Sun Yat-Sen University（中山大学）

AI总结提出NormBench基准和跨度义务树（SG-DT）中间表示，用于诊断和缓解规则遵循模型中的静默范围遗漏（SSO）问题，揭示递归衰减和可审计性陷阱两种病理，并通过约束输出改善树结构保真度。

详情

AI中文摘要

执行政策和法规的规则遵循代理常常因静默范围遗漏（SSO）而失败：模型应用一般规则但静默地丢弃嵌套的例外或反例外，产生看似合规但在重要边缘案例上失效的输出。尽管此类失败常被视为代理系统问题，其根本瓶颈在于法规和政策理解——这一能力通常在法律NLP中研究。然而，大多数现有法律NLP基准强调最终任务结果，可能忽略导致SSO的结构性遗漏。为诊断和缓解SSO，我们引入NormBench，一个包含2290条条款的基准，涵盖中文（法律和地方政策）、英文（美国税法、GDPR和企业政策）及跨语言设置，专为可废止范围解析设计：精确识别哪个条款覆盖哪个。NormBench使用基于跨度义务树（SG-DT），一种编译器式中间表示，将每个逻辑分支锚定到源跨度并要求显式排除守卫，实现确定性编译和审计。对前沿LLM的评估揭示了两种反复出现的病理：（1）递归衰减，性能随击败者深度增加急剧下降；（2）可审计性陷阱，模型检索相关跨度但未能组装正确的控制流。使用SG-DT作为约束中间输出可改善整树保真度和击败者恢复，下游实验表明其效用是机制特定的：增益集中在例外活跃、易SSO的案例上，而当附加结构不必要或解析器保真度低时，总体准确率可能参差不齐。

英文摘要

Rule-following agents tasked with executing policies and regulations often fail via Silent Scope Omission (SSO): a model applies a general rule but silently drops nested exceptions or counter-exceptions, producing outputs that appear compliant yet break on important edge cases. Although such failures are often framed as an agentic-systems problem, the underlying bottleneck is statutory and policy understanding, a capability typically studied in legal NLP. However, most existing legal NLP benchmarks emphasize end-task outcomes, which can overlook the structural omissions that cause SSO. To diagnose and mitigate SSO, we introduce NormBench, a benchmark of 2,290 provisions spanning Chinese (laws and local policies), English (U.S. tax law, GDPR, and corporate policies), and cross-lingual settings, designed for defeasible scope parsing: identifying precisely which clause overrides which. NormBench uses Span-Grounded Deontic Trees (SG-DT), a compiler-style intermediate representation that anchors every logical branch to source spans and requires explicit exclusion guards, enabling deterministic compilation and audit. Evaluations of frontier LLMs reveal two recurring pathologies: (1) Recursion Decay, where performance drops sharply as defeater depth increases, and (2) an Auditability Trap, where models retrieve relevant spans but fail to assemble correct control flow. Using SG-DT as a constrained intermediate output improves whole-tree fidelity and defeater recovery, and downstream experiments show that its utility is mechanism-specific: gains concentrate on exception-active, SSO-prone cases, while aggregate accuracy can be mixed when the added structure is unnecessary or parser fidelity is low.

URL PDF HTML ☆

赞 0 踩 0

2606.08926 2026-06-09 cs.LG 新提交

PROBE-Web: An Interactive System for Probing Evaluation Landscapes of Knowledge Graph Completion Models

PROBE-Web：用于探究知识图谱补全模型评估景观的交互式系统

Sooho Moon, Yunyong Ko

发表机构 * Chung-Ang University（中央大学）

AI总结提出PROBE-Web交互系统，通过调整预测锐度和流行度偏差鲁棒性两个视角，灵活评估KGC模型，并提供四种关键功能。

详情

Comments: 4 pages, 6 figures, 1 table

AI中文摘要

知识图谱补全（KGC）模型通常使用基于排名的指标（如MRR和Hits@K）进行评估，尽管不同的用户通常需要不同的评估视角。在本演示中，我们介绍PROBE-Web，一个用于探究KGC模型多样化评估景观的交互式系统。PROBE-Web使用户能够通过调整两个关键视角（P1）预测锐度和（P2）流行度偏差鲁棒性来灵活评估KGC模型。通过用户友好的GUI，用户可以轻松评估多个KGC模型并分析其优缺点。PROBE-Web提供四个关键功能：（1）传统评估工具包，（2）灵活的视角感知评估，（3）可解释的案例研究，以及（4）评估景观探索。我们相信PROBE-Web可以帮助用户更好地理解与其目标一致的KGC模型。

英文摘要

Knowledge graph completion (KGC) models are commonly evaluated using rank-based metrics such as MRR and Hits@K, despite different users often requiring different evaluation perspectives. In this demo, we present PROBE-Web, an interactive system for probing diverse evaluation landscapes for KGC models. PROBE-Web enables users to flexibly evaluate KGC models by adjusting two critical perspectives: (P1) predictive sharpness and (P2) popularity-bias robustness. Through a user-friendly GUI, users easily evaluate multiple KGC models and analyze their strengths and weaknesses. PROBE-Web provides four key functionalities: (1) conventional evaluation toolkit, (2) flexible perspective-aware evaluation, (3) explainable case studies, and (4) evaluation landscape exploration. We believe that PROBE-Web can help users better understand KGC models aligning with their objectives.

URL PDF HTML ☆

赞 0 踩 0

2606.08922 2026-06-09 cs.RO 新提交

PTDL:Multi-Terrain Fall Recovery via Phase-Terrain Decoupled Learning

PTDL：多地形摔倒恢复的相位-地形解耦学习

Xiaoyu Xu, Zhiming Chen, Yuenan Zhao, Ran Song, Wei Zhang

发表机构 * School of Control Science and Engineering, Shandong University（山东大学控制科学与工程学院）； Key Laboratory of Machine Intelligence and System Control, Ministry of Education（教育部机器智能与系统控制重点实验室）

AI总结提出相位-地形解耦学习（PTDL），通过解耦训练监督的相位和地形轴，实现单一本体感知策略下的多地形摔倒恢复与行走过渡。

详情

AI中文摘要

人形机器人可能在非结构化环境中的斜坡、砾石和不平地面上摔倒。我们目标是集成摔倒恢复与运动：仅使用本体感知从摔倒状态重建平衡，并在摔倒地点恢复速度指令行走。先前方法通常止于准静态起身，忽略摔倒后地面接触阶段，或者在混合地形上训练时未分离恢复与运动阶段或每表面约束，导致跨表面退化为单一妥协起身。我们提出相位-地形解耦学习（PTDL），在部署单一本体感知策略的同时，沿相位和地形轴解耦训练监督。在相位轴上，投影重力门控双运动先验判别器和探测-行走过渡链接将摔倒后恢复与指令行走连接。在地形轴上，地形分层恢复塑形在平坦地面、碎石和斜坡上分配表面特定的训练监督；地形标签仅用于训练，不提供给策略观测，从而在部署时实现隐式摔倒后策略选择。我们在29自由度Unitree G1上，在仿真和硬件中验证了PTDL在平坦地面、碎石和最高20度斜坡上的表现，实现了稳定的跨地形恢复、平滑的恢复-运动过渡以及单一部署策略下的差异化摔倒后起身行为。

英文摘要

Humanoid robots can fall on slopes, gravel, and uneven ground in unstructured environments. We target integrated fall recovery and locomotion: rebuilding balance from a fallen state using proprioception alone and resuming velocity-commanded walking at the fall site. Prior methods often stop at quasi-static rise, neglect the post-fall ground-contact phase, or, when trained on mixed terrains without separating recovery and locomotion phases or per-surface constraints, collapse to a single compromise get-up across surfaces. We propose Phase--Terrain Decoupled Learning (PTDL), which decouples training supervision along phase and terrain axes while deploying one proprioceptive policy. On the phase axis, projected-gravity-gated dual motion-prior discriminators and a probe-to-walk transition link post-fall recovery to commanded walking. On the terrain axis, terrain-stratified recovery shaping assigns surface-specific training supervision on flat ground, gravel, and slopes; terrain labels are training-only and withheld from policy observations, enabling implicit post-fall strategy selection at deployment. We validate PTDL on a 29-DoF Unitree G1 across flat ground, gravel, and slopes up to 20 degrees in simulation and on hardware, achieving stable cross-terrain recovery, smooth recovery-to-locomotion transitions, and differentiated post-fall rise behaviors under one deployed policy.

URL PDF HTML ☆

赞 0 踩 0

2606.08921 2026-06-09 cs.LG 新提交

Generalized Rank-based Evaluation for Knowledge Graph Completion: Perspectives, Framework, and Analyses

基于排序的知识图谱补全广义评估：视角、框架与分析

Sooho Moon, Jian Kang, Yunyong Ko

发表机构 * Chung-Ang University（中央大学）； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）

AI总结针对现有评估指标忽视预测锐度与流行偏差鲁棒性的问题，提出广义评估框架PROBE，通过排序变换器和排序聚合器实现更全面、灵活且一致的模型评估。

详情

Comments: 25 pages, 12 figures, 5 tables

AI中文摘要

知识图谱补全（KGC）旨在从观测知识图谱（KG）中预测缺失事实，在药物发现、推荐系统和检索增强生成（RAG）等广泛实际应用中发挥关键作用。尽管已有众多KGC模型被提出，但KGC的评估仍未被充分探索，尽管其在可靠评估模型性能和为实际应用选择合适的模型中至关重要。本文引入了KGC评估中两个被现有评估指标忽视的重要视角：（P1）预测锐度和（P2）流行偏差鲁棒性。为同时解决这两个视角，我们提出一个广义评估框架PROBE，它由一个排序变换器（RT）和一个排序聚合器（RA）组成，其中RT基于期望的预测锐度水平估计每个预测的得分，RA根据期望的流行偏差鲁棒性水平聚合所有预测得分以确定最终评估得分。我们通过定义可靠KGC评估的六个关键属性对PROBE进行理论分析，并证明PROBE满足所有属性，而现有指标未能满足部分属性。特别地，由于KG的开放世界特性，评估指标应即使在仅观测到不完整事实时也能保持KGC模型的相对性能。我们表明PROBE能更好地维持这种一致性，从而比现有指标更可靠地估计模型的内在性能。在六个真实KG上使用六个KGC模型进行的大量实验表明，现有指标可能根据不同的评估视角高估或低估模型性能，而PROBE能够实现更全面、灵活且一致的KGC模型评估。

英文摘要

Knowledge graph completion (KGC) aims to predict missing facts from an observed knowledge graph (KG), playing a crucial role in a wide range of real-world applications such as drug discovery, recommender systems, and retrieval-augmented generation (RAG). Although numerous KGC models have been proposed, the evaluation of KGC remains underexplored, despite its critical role in reliably assessing model performance and selecting appropriate models for real-world applications. In this paper, we introduce two important perspectives for KGC evaluation that are overlooked by existing evaluation metrics, (P1) predictive sharpness and (P2) popularity-bias robustness. To address both perspectives, we propose a generalized evaluation framework, PROBE, which consists of a rank transformer (RT) that estimates the score of each prediction based on a desired level of predictive sharpness and a rank aggregator (RA) that determines the final evaluation score by aggregating all prediction scores according to a desired level of popularity-bias robustness. We theoretically analyze PROBE by defining six key properties for reliable KGC evaluation and prove that PROBE satisfies all the properties, while existing metrics fail to satisfy some. In particular, due to the open-world nature of KGs, an evaluation metric should preserve the relative performance of KGC models even when only incomplete facts are observed. We show that PROBE better maintains such consistency, providing a more reliable estimate of intrinsic model performance than existing metrics. Extensive experiments with six KGC models on six real-world KGs reveal that existing metrics may over- or under-estimate model performance depending on different evaluation perspectives, whereas PROBE enables a more comprehensive, flexible, and consistent evaluation of KGC models.

URL PDF HTML ☆

赞 0 踩 0

2606.08920 2026-06-09 cs.CV cs.AI 新提交

PolyBuild: An End-to-End Method for Polygonal Building Contour Extraction from High-Resolution Remote Sensing Images

PolyBuild: 一种从高分辨率遥感图像中提取多边形建筑物轮廓的端到端方法

Yaoteng Zhang, Julin Zhang, Guangshuai Wang, Jiwei Deng, Hui Sheng, Yasir Muhammad, Shiqing Wei

发表机构 * China University of Petroleum (East China)（中国石油大学（华东））； South Surveying&Mapping Instrument Co.,Ltd.（南方测绘仪器有限公司）； China Railway Design Corporation（中国铁路设计集团有限公司）

AI总结提出端到端方法PolyBuild，通过初始轮廓生成模块和轮廓优化模块直接从遥感图像提取矢量多边形建筑物轮廓，无需后处理，性能优于现有方法。

详情

Comments: Accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS)

AI中文摘要

从高分辨率遥感图像中提取建筑物多边形轮廓是各种地图应用的基本任务。然而，不同的成像条件和复杂的建筑结构使得自动轮廓提取极具挑战性。主流的建筑物提取方法通常依赖于像素级分割，随后进行多个后处理步骤以生成建筑物轮廓，这计算量大且容易出错。在本文中，我们提出了一种名为PolyBuild的端到端方法，该方法可以直接从高分辨率遥感图像中提取建筑物矢量多边形，无需任何后处理操作。该方法利用两个主要模块：初始轮廓生成模块（ICGM）和轮廓优化模块（COM）。ICGM通过利用每个建筑物实例的拼接子区域中心特征来生成初始建筑物轮廓。它通过生成边界框并使用四个子区域的中心特征来表示每个建筑物，同时进行目标检测和初始轮廓提取。轮廓优化模块（COM）通过在基于Transformer的解码器中迭代集成卷积神经网络（CNN）特征和轮廓位置信息，进一步细化生成的建筑物轮廓。混合CNN-Transformer架构有效捕获建筑物轮廓内的局部和全局空间关系，确保高质量的边界描绘。在三个建筑物数据集上进行了大量实验以评估PolyBuild的性能。结果表明，PolyBuild显著优于最先进的方法，包括基于掩码和基于轮廓的方法。

英文摘要

Extracting building polygon contours from high-resolution remote sensing images is a fundamental task for various mapping applications. However, the presence of varying imaging conditions and complex building structures, makes automatic contour extraction extremely challenging. Mainstream approaches for building extraction often rely on pixel-level segmentation followed by multiple post-processing steps to produce building contour, which can be computationally intensive and prone to errors. In this paper, we propose an end-to-end method named PolyBuild, which can directly extract building vector polygons from high-resolution remote sensing images without the need for any post-processing operations. The proposed method leverages two primary modules: an Initial Contour Generation Module (ICGM) and a Contour Optimization Module (COM). The ICGM is designed to generate an initial building contour by utilizing concatenated sub-region center features for each building instance. It performs simultaneous object detection and initial contour extraction by generating bounding boxes and using the center features of four sub-regions to represent each building. The Contour Optimization Module (COM) further refines the generated building contours by iteratively integrating Convolutional Neural Network (CNN) features and contour positional information in a Transformer-based decoder. The hybrid CNN-Transformer architecture effectively captures both local and global spatial relationships within the building contour, ensuring high-quality boundary delineation. Extensive experiments are conducted on three building datasets to evaluate the performance of PolyBuild. The results demonstrate that PolyBuild significantly outperforms state-of-the-art methods, including mask-based and contour-based approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.08918 2026-06-09 cs.CV 新提交

When Vision Misleads, Let Location Speak: A Worldwide Image Geo-Localization Method via Location Attention Mechanism and Large Multimodal Models

当视觉误导时，让位置说话：基于位置注意力机制和大多模态模型的全球图像地理定位方法

Junchao Cui, Wenqi Shi, Xuanzi Ma, Nan Wu, Shaoyong Du, Xiangyang Luo

发表机构 * Henan Key Laboratory of Cyberspace Situation Awareness（河南省网络空间态势感知重点实验室）； Information Engineering University（信息工程大学）

AI总结提出TransGeoCLIP框架，通过位置注意力机制和大多模态模型，解决视觉相似图像导致的地理定位错误问题，在多个基准上显著提升定位精度。

详情

Comments: Submitted to IEEE Transactions on Multimedia in March 2026

AI中文摘要

全球图像地理定位旨在确定图像在全球范围内的拍摄位置。现有方法通常通过将图像与来自不同地理区域的视觉相似场景匹配而导致定位错误，限制了实际应用中的可靠性。为解决此问题，我们提出TransGeoCLIP，一种新颖的基于检索的框架，集成了位置注意力机制和大规模多模态模型（LMMs）。使用带有位置注意力的Transformer编码器对GPS坐标进行编码，TransGeoCLIP能够有效区分视觉相似图像中的地理特征。该框架包括两个阶段：1）检索数据库构建，采用配备位置注意力机制的Transformer对标记的GPS坐标进行编码并增强位置语义，随后通过CLIP实现图像-文本-GPS联合嵌入；2）检索增强推理，利用LMMs从检索到的数据库结果中推断最终图像位置预测。在包括IM2GPS、IM2GPS3k、YFCC4k和YFCC26k在内的多个数据集上的广泛实验结果表明，TransGeoCLIP显著提升了视觉相似图像的定位性能。特别是，街道级定位精度（误差在1公里内）大幅提升，在这些基准上分别超过最先进方法1.5%、1.07%、7.18%和9.75%。

英文摘要

Worldwide image geo-localization aims to determine the capture location of an image on a global scale. Existing methods often mislocalize images by matching them to visually similar scenes from different geographic regions, which limits reliability in practical applications. To address this issue, we propose TransGeoCLIP, a novel retrieval-based framework that integrates a location attention mechanism and large multimodal models (LMMs). Using the Transformer encoder with location attention to encode GPS coordinates, TransGeoCLIP can effectively distinguish geographic features among visually similar images. The framework consists of two stages: 1) Retrieval database construction, which employs Transformers equipped with location attention mechanisms to encode labeled GPS coordinates and enhance location semantics, subsequently enables joint image-text-GPS embedding through CLIP; 2) Retrieval-augmented inference, which leverages LMMs to infer the final image location prediction from retrieved database results. Extensive experimental results on diverse datasets, including IM2GPS, IM2GPS3k, YFCC4k, and YFCC26k, demonstrate that TransGeoCLIP significantly enhances localization performance for visually similar images. Particularly, street-level localization accuracy (within 1 km error) is substantially improved, surpassing state-of-the-art methods by 1.5%, 1.07%, 7.18%, and 9.75% on these benchmarks, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.08908 2026-06-09 cs.CV cs.AI 新提交

Failure-Aware Refinement of Vision-Language Model for Lithography Defect Detection

面向光刻缺陷检测的视觉-语言模型失败感知精炼

Pangyun Jeong, Jiyeong Kong, Yuehua Hu, Dohee Jeong, Kyung-Tae Kang

发表机构 * Hanyang University（汉阳大学）； Korea University（高丽大学）； Korea Institute of Industrial Technology（韩国生产技术研究院）

AI总结提出两阶段视觉-语言框架，先微调Qwen3-VL检测缺陷，再通过训练精炼模块修正第一阶段错误，提升检测可靠性。

详情

Comments: 6 pages, 3 figures

AI中文摘要

半导体光刻检测需要可靠地检测微小图案缺陷，如桥接、毛刺、针孔和污染。在本研究中，我们提出了一种两阶段视觉-语言框架，结合了初始缺陷检测与预测精炼。在第一阶段，使用LoRA微调Qwen3-VL作为视觉-语言适配器，从光刻图像中预测缺陷数量、缺陷类别和归一化边界框。然而，直接微调仍可能产生常见的测试时错误，包括误报、漏检和错误缺陷类型。为解决此限制，第二阶段使用第一阶段预测失败及其修正标签训练精炼模块，使模型能够审查和修正初始输出。通过从初始适配器失败的案例中学习，精炼过程改善了超越单阶段微调的缺陷推理。

英文摘要

Semiconductor lithography inspection requires reliable detection of small pattern defects such as bridge, burr, pinch, and contamination. In this study, we propose a two-stage vision-language framework that combines initial defect detection with prediction refinement. In the first stage, Qwen3-VL is fine-tuned with LoRA as a vision-language adapter to predict defect counts, defect categories, and normalized bounding boxes from lithography images. However, direct fine-tuning may still produce common test-time errors, including false positives, missed defects, and incorrect defect types. To address this limitation, the second stage trains a refinement module using first-stage prediction failures and their corrected labels, allowing the model to review and revise initial outputs. By learning from cases where the initial adapter fails, the refinement process improves defect inference beyond single-stage fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2606.08906 2026-06-09 cs.CV 新提交

DifferSeg: Towards Diverse Multimodal Binary Segmentation via Differential Perception and Frequency Guidance

DifferSeg: 通过差分感知与频率引导实现多样化的多模态二值分割

Qiangqiang Zhou, Jiawei Xu, Yong Chen, Dandan Zhu, Yugen Yi, Xiaoqi Zhao

发表机构 * School of Artificial Intelligence, Jiangxi Normal University（江西师范大学人工智能学院）； Institute of AI Education, East China Normal University（华东师范大学人工智能教育研究所）； Yale School of Medicine, Yale University（耶鲁大学医学院）

AI总结提出DifferSeg框架，通过差分感知融合模块自适应对齐多模态特征，并设计频率引导解码器平衡高低频表示，在29个公开数据集上超越67种方法。

详情

AI中文摘要

在许多二值分割任务中，大多数多模态方法依赖于固定的特征拼接进行跨模态交互，以及由低频语义主导的简单解码器设计。然而，它们忽略了两个关键挑战：一是缺乏处理模态差异和互补性的自适应机制，二是缺少平衡高低频表示的高效解码策略。在这项工作中，我们提出了一个简单而通用的多模态二值分割框架，称为DifferSeg，以同时解决这两个问题。借助差分感知融合（DPF）模块，DifferSeg使用可学习的差分算子自适应地对齐多模态特征，并通过残差融合增强其互补性，有效缓解模态不匹配和融合冗余。此外，我们设计了一个频率引导解码器（FGD），构建跨频率交互和多路径上采样，以保持细节高频结构与语义低频表示之间的一致性，确保细粒度边界恢复和噪声抑制。得益于这些设计，DifferSeg可以轻松泛化到各种二值分割任务，包括自然和医学模态。无需额外技巧，它在涉及18个下游任务的29个公开数据集上持续超越67种最先进方法，展示了卓越的泛化能力和分割精度。代码和预训练模型将在链接处提供。

英文摘要

In many binary segmentation tasks, most multimodal methods rely on fixed feature concatenation for cross-modal interaction and straightforward decoder designs dominated by low-frequency semantics. %ToDO: % However, they ignore two key challenges: one is the lack of an adaptive mechanism to handle modality discrepancies and complementarity, and the other is the absence of an efficient decoding strategy to balance both high- and low-frequency representations. % In this work, we propose a simple yet general multimodal binary segmentation framework, termed DifferSeg, to address both problems simultaneously. With the help of the differential perception fusion (DPF) module, DifferSeg employs learnable differential operators to adaptively align multimodal features and enhance their complementarity through residual fusion, effectively mitigating modality mismatch and fusion redundancy. % In addition, we design a frequency-guided decoder (FGD) that builds cross-frequency interactions and multi-path upsampling to maintain consistency between detailed high-frequency structures and semantic low-frequency representations, ensuring fine-grained boundary recovery and noise suppression. % Benefiting from these designs, DifferSeg can be easily generalized to diverse binary segmentation tasks, including both natural and medical modalities. Without bells and whistles, it consistently surpasses 67 state-of-the-art methods across 29 public datasets involving 18 downstream tasks, demonstrating superior generalization and segmentation accuracy.Code and pretrained models will be available at the Link.

URL PDF HTML ☆

赞 0 踩 0

2606.08903 2026-06-09 cs.LG 新提交

Synthetic but Not Realistic: The Evaluation Challenge in Generative Modelling for Structured Electronic Medical Records

合成但不真实：结构化电子病历生成建模中的评估挑战

Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm

发表机构 * Centre for Big Data Research in Health, the University of New South Wales（新南威尔士大学健康大数据研究中心）

AI总结针对合成电子病历评估过度依赖统计相似性而忽视临床有效性的问题，提出基于流行病学的多维度评估框架，发现当前生成模型虽能复现边缘分布，但无法同时保持亚组结构、效应估计和依赖关系，导致评估高估数据质量。

详情

AI中文摘要

合成医疗数据被广泛提议作为真实患者数据的隐私保护替代品，但其评估仍然以统计相似性和预测性能为主，这些并不能反映临床有效性。我们引入了一个基于流行病学的多维度评估框架，评估描述性保真度、临床实用性和结构有效性，分别对应描述性、预测性和因果性问题。我们使用PRIME-CVD（一个具有已知真实结构的5万人队列）评估了四种代表性生成范式——基于GAN、VAE增强、基于扩散和掩码建模。虽然所有模型都再现了边缘分布，但没有一个能同时保留亚组结构、效应估计和依赖结构。值得注意的是，具有强分布保真度的模型可能表现出较差的校准和扭曲的关系，导致不可靠的推断。这些结果表明，当前的评估实践可能高估了合成数据质量，并促使基于支持有效临床和科学结论的能力进行领域知情的评估。

英文摘要

Synthetic healthcare data are widely proposed as privacy-preserving substitutes for real patient data, yet their evaluation remains dominated by statistical similarity and predictive performance that do not reflect clinical validity. We introduce a multi-dimensional evaluation framework grounded in epidemiology, assessing descriptive fidelity, clinical utility, and structural validity, corresponding to descriptive, predictive, and causal questions. We evaluate four representative generative paradigms - GAN-based, VAE-boosted, diffusion-based, and masked modelling - using PRIME-CVD, a 50,000-person cohort with known ground-truth structure. While all models reproduce marginal distributions, none simultaneously preserve subgroup structure, effect estimates, and dependency structure. Notably, models with strong distributional fidelity can exhibit poor calibration and distorted relationships, leading to unreliable inference. These results show that current evaluation practices can overestimate synthetic data quality and motivate domain-informed assessment based on the ability to support valid clinical and scientific conclusions.

URL PDF HTML ☆

赞 0 踩 0

2606.08897 2026-06-09 cs.CV cs.AI q-bio.QM 新提交

A multi-agent system for spine MRI report generation from multi-sequence imaging

基于多序列影像的脊柱MRI报告生成多智能体系统

Zhiping Xiao, Junwei Yang, Gongbo Sun, Han Zhang, Hanwen Xu, Yi Yao, Zachary D. Miller, William E. King, Mohammed M. Kanani, Jalal B. Andre, Sammy Chu, Ming Zhang, Paul E. Kinahan, Nathan M. Cross, Sheng Wang

发表机构 * University of Washington（华盛顿大学）； Peking University（北京大学）； University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； New York University（纽约大学）； University of Washington Medical Center（华盛顿大学医学中心）

AI总结提出SpineAgent多智能体框架，利用多序列基础模型整合T1/T2等序列信息，实现脊柱MRI报告生成、病理定位和图文检索，在跨厂商和跨队列评估中表现优异。

详情

AI中文摘要

脊柱病理是全球疼痛和残疾的主要原因之一。脊柱MRI是临床评估的核心，但其解读仍然复杂且耗时，需要整合多个成像序列和解剖区域的信息。尽管自动化MRI分析最近取得了进展，但如何有效结合多序列数据同时保留序列特异性诊断信息仍是一个开放挑战。本文提出SpineAgent，一个基于多序列基础模型的脊柱MRI报告生成多智能体框架，该模型在来自32,047名患者和453,683个MRI系列（总计13,441,191张MRI切片）的常规临床数据上训练。为了适应不同模态的序列，我们首先分别在T1和T2加权序列上预训练两个基于DINOv3的编码器。然后，我们引入一种持续训练策略，学习一个合成器，利用T1和T2编码器嵌入其他序列的图像，生成整合MRI序列间各种信号的患者级嵌入。利用这些嵌入，SpineAgent实现了最先进的性能，并在跨制造商和跨队列评估中展现出强大的泛化能力。除了分类，SpineAgent通过识别与发现相关的切片和分割病理区域实现病理定位。它还支持多模态图像-报告检索，为可扩展和可解释的MRI报告生成提供了坚实基础。我们进一步将这些经过验证的SpineAgent能力集成到37个专门智能体中。最后，我们将它们的输出作为结构化标记，整合到一个端到端训练用于报告生成的医疗报告智能体中。通过自动指标和五位放射科医生的专家评估，SpineAgent在脊柱MRI报告生成中取得了领先性能。

英文摘要

Spinal pathology is a leading cause of pain and disability worldwide. Spine MRI is central to clinical evaluation, yet its interpretation remains complex and time-consuming, requiring integration of information across multiple imaging sequences and anatomical regions. Despite recent advances in automated MRI analysis, effectively combining multi-sequence data while preserving sequence-specific diagnostic information remains an open challenge. Here we present SpineAgent, a multi-agent framework for spine MRI report generation built upon a multi-sequence foundation model trained on routine clinical data from 32,047 patients and 453,683 MRI series, comprising a total of 13,441,191 MRI slices. To accommodate diverse modalities of sequences, we first pre-train two DINOv3-based encoders separately on T1- and T2-weighted sequences. We then introduce a continual training strategy that learns a synthesizer to embed images of other sequences using the T1 and T2 encoders, producing patient-level embedding that integrates various signals across MRI sequences. Using these embeddings, SpineAgent achieves state-of-the-art performance, and demonstrates strong generalizability under cross-manufacturer and cross-cohort evaluation. Beyond classification, SpineAgent enables pathology localization by identifying findings-relevant slices and segmenting pathological regions. It also supports multimodal image-report retrieval, providing a solid foundation for scalable and explainable MRI report generation. We further integrate these validated capabilities of SpineAgent into 37 specialized agents. Finally, we incorporate their outputs as structured tokens within a Medical Report Agent trained end-to-end for report generation. Through both automated metrics and expert evaluation by five radiologists, SpineAgent achieves leading performance in spine MRI report generation.

URL PDF HTML ☆

赞 0 踩 0

2606.08896 2026-06-09 cs.AI 新提交

FAME: Forecastability-Aware Mixture of Experts for Heterogeneous Time Series Forecasting

FAME: 面向异构时间序列预测的可预测性感知专家混合模型

Qianyang Li, Xingjun Zhang, Shaoxun Wang, Tao Peng, Jia Wei

发表机构 * Sun Yat-sen University（中山大学）； Guangdong Key Laboratory of Information Security Technology（广东省信息安全技术重点实验室）； Ministry of Education Key Laboratory of Machine Intelligence and Advanced Computing（教育部机器智能与先进计算重点实验室）

AI总结针对大规模异构时间序列预测中单一模型性能不足的问题，提出可预测性感知的稀疏专家混合框架FAME，通过多维可预测性指纹和成本感知路由，在工业数据集上实现12.4%的MSE降低。

详情

AI中文摘要

大规模零售和工业预测系统包含许多异构时间序列，其生命周期、稀疏性、波动性、季节性、频谱模式和上下文敏感性差异很大。单一预测模型很少能在所有情况下表现良好，而密集集成会增加推理成本并提供有限的专家适用性洞察。本文研究可预测性感知的专家路由：学习数据特征如何决定预测专家的适用性。我们提出\method{}，一个稀疏专家混合框架，用多维可预测性指纹表示每个序列，从验证性能中挖掘专家适用性目标，并训练一个成本感知的稀疏路由器，为每个序列激活少量预算的专家集。使用山东新北洋（SNBC）的生产规模自动售货机销售数据集（其中预测组件已集成到补货计划管道中）以及公共零售基准，我们表明专家适用性在不同数据情况下系统性地变化。在拥有5000+台机器和6000万+交易的工业数据集上，\method{} Top-2相比最强单一专家LightGBM降低了12.4%的MSE，同时平均每个序列执行1.92个专家。部署的组件产生需求预测，而库存导向的收益通过离线回放模拟器在固定补货策略下估计，而非在线干预。该框架将异构销售预测从启发式模型选择转变为可预测性模式和专家专业化的数据挖掘。代码可在https://github.com/hit636/FAME获取。

英文摘要

Large-scale retail and industrial forecasting systems contain many heterogeneous time series whose lifecycle, sparsity, volatility, seasonality, spectral patterns, and contextual sensitivity differ substantially. A single forecasting model rarely performs well across all regimes, while dense ensembles increase inference cost and provide limited insight into expert suitability. This paper studies forecastability-aware expert routing: learning how data characteristics determine the suitability of forecasting experts. We propose \method{}, a sparse mixture-of-experts framework that represents each series with a multidimensional forecastability fingerprint, mines expert-suitability targets from validation performance, and trains a cost-aware sparse router to activate a small budgeted set of experts for each series. Using a production-scale vending-machine sales dataset from Shandong New Beiyang (SNBC), where the forecasting component has been integrated into the replenishment-planning pipeline, together with public retail benchmarks, we show that expert suitability varies systematically across data regimes. On the industrial dataset with 5,000+ machines and 60M+ transactions, \method{} Top-2 reduces MSE by 12.4\% over the strongest single expert, LightGBM, while executing 1.92 experts per series on average. The deployed component produces demand forecasts, while inventory-oriented gains are estimated by an offline replay simulator under a fixed replenishment policy rather than by online intervention. The framework turns heterogeneous sales forecasting from heuristic model selection into data mining of forecastability patterns and expert specialization. Code is available at https://github.com/hit636/FAME

URL PDF HTML ☆

赞 0 踩 0

2606.08894 2026-06-09 cs.CV cs.CL 新提交

Are Reasoning Vision-Language Models Robust to Semantic Visual Distractions?

推理视觉语言模型对语义视觉干扰具有鲁棒性吗？

Yizheng Sun, Mochuan Zhan, Yanan Ma, Jia Tong See, Yifan Wang, Ziyi Wang, Hao Li, Yang Cui, Wenhao Cai, Jingyu Sun, Chenghua Lin, Riza Batista-Navarro, Jingyuan Sun

发表机构 * University of Manchester（曼彻斯特大学）； Marex ； Imperial College London（帝国理工学院）

AI总结针对推理VLM在真实场景中易受语义视觉干扰的问题，提出Distract-Bench基准，发现推理VLM对语义干扰的鲁棒性低于感知退化，且干扰常被纳入推理过程导致错误答案。

详情

AI中文摘要

推理视觉语言模型（VLM）在复杂多模态任务上表现强劲，但可靠的现实应用需要处理比干净、精心策划的基准更混乱的视觉输入。现有工作主要通过输入损坏（如噪声、模糊和天气效果）来评估VLM的可靠性，这些损坏使视觉证据更难感知。这留下了一个关键可靠性失败模式未被充分探索：模型可能正确感知证据，却从看似合理但无关且分散注意力的证据中进行推理，并将此错误传播到最终答案。为填补这一空白，我们引入了\textbf{Distract-Bench}，一个用于评估VLM对\textbf{语义视觉干扰}鲁棒性的基准，定义为添加到输入中、保留真实答案但具有意义且与任务无关的视觉线索。我们全面评估了八个领先的开源和两个闭源VLM，涵盖传统视觉损坏和Distract-Bench。结果表明，Distract-Bench暴露了一种与视觉损坏不同的鲁棒性失败：推理VLM在感知退化下基本跟踪其非推理基础模型，但对语义干扰的鲁棒性始终较低。进一步分析表明，这些干扰常常进入VLM的推理过程，被当作证据，并导致错误答案。总之，这些发现重新定义了推理VLM的鲁棒性评估，将焦点从退化感知转向干扰，以实现可靠的现实世界视觉推理。我们的数据和代码可在https://github.com/Yizheng-Sun/Distract-Bench获取。

英文摘要

Reasoning Vision-Language Models (VLMs) achieve strong performance on complex multimodal tasks, but reliable real-world application requires handling visual inputs that are messier than clean, curated benchmarks. Existing works mainly evaluate such reliability of VLMs through input corruptions, such as noise, blur and weather effects, which make visual evidence harder to perceive. This leaves a critical reliability failure mode underexplored: a model may perceive the evidence correctly, yet reason from plausible but irrelevant and distracting evidence and propagate this mistake to its final answer. To address this gap, we introduce \textbf{Distract-Bench}, a benchmark for evaluating VLM robustness to \textbf{semantic visual distractions}, defined as meaningful but task-irrelevant visual cues added to inputs while preserving the ground-truth answer. We comprehensively evaluate eight leading open-source and two closed-source VLMs across conventional vision corruptions and Distract-Bench. Our results show that Distract-Bench exposes a robustness failure distinct from vision corruptions: reasoning VLMs largely track their non-reasoning base models under perceptual degradation, but show consistently lower robustness to semantic distractions. Further analysis shows that these distractions often enter the reasoning process of VLMs, are treated as evidence, and lead to incorrect answers. Together, these findings reframe robustness evaluation for reasoning VLMs, shifting the focus from degraded perception to distractions for reliable real-world visual reasoning. Our data and code are available at https://github.com/Yizheng-Sun/Distract-Bench.

URL PDF HTML ☆

赞 0 踩 0

2606.08893 2026-06-09 cs.LG cs.AI cs.CR 新提交

Cheap Reward Hacking Detection

廉价奖励黑客检测

Iván Belenky, Joaquín Itria, Steven Johns

发表机构 * Tamarillo

AI总结提出用小Transformer编码器将轨迹映射到单位球面，使嵌入距离近似奖励与元数据的L1距离，线性探针检测奖励黑客，AUC达0.9467，成本比LLM-as-judge低四个数量级。

2606.08892 2026-06-09 cs.LG 新提交

Diffuse AI Control on Fuzzy Tasks

模糊任务上的扩散AI控制

Mikhail Terekhov, Caglar Gulcehre, Vivek Hebbar, Joe Benton

发表机构 * Anthropic Fellows Program (via MATS)（Anthropic 研究员计划（通过 MATS））； EPFL（洛桑联邦理工学院）； Redwood Research（红木研究）； Anthropic

AI总结针对AI在模糊任务上的长期扩散威胁，提出蓝队与红队对抗框架，通过弱模型评分训练强模型，并发现红队可利用多目标进化提示优化找到评分高但性能差的子版本行为，蓝队则通过对抗优化提升鲁棒性。

详情

AI中文摘要

部署在关键领域（如AI安全研究）的AI模型可能因对齐问题而微妙地破坏我们的努力。扩散AI控制是AI安全的一个子领域，旨在减轻长期部署范围内AI破坏（扩散威胁）带来的风险。这些风险在模糊任务上尤其有害，即难以评分或需要直觉的任务。为了理解模糊任务上的扩散威胁，我们引入了一个新颖的框架，将AI控制视为蓝队和红队之间的对抗游戏。蓝队使用一个弱可信模型构建一个弱评分，据此训练一个强大的、可能具有颠覆性的模型，以消除如果存在的颠覆倾向。然后红队试图找到被弱评分高评价的模型行为，这些行为可能不会被训练掉，但实际上对应着差的表现。我们在为近期ML论文的研究问题撰写实验提案的任务上测试了我们的框架。我们使用一个能够访问原始论文的语言模型作为代理“真实”评分器。我们的红队使用多目标进化提示优化发现了子版本行为。我们展示了Opus 4.6可以写出比GPT-OSS-20B更差的提案（根据真实代理评分），而弱评分器却将其评为与Opus 4.6最佳提案一样高。为了缓解威胁，我们为蓝队提出了一种对抗优化算法，该算法为弱模型发现更鲁棒的提示。该算法产生的蓝队提示，我们的红队优化未能利用。

英文摘要

AI models deployed in critical domains, such as AI safety research, may subtly sabotage our efforts due to misalignment. Diffuse AI Control is a subfield of AI safety concerned with mitigating risks from AI sabotage distributed over long deployment horizons (diffuse threats). These risks are particularly pernicious on fuzzy tasks, i.e. tasks which are hard to grade or require intuition. To understand diffuse threats on fuzzy tasks, we introduce a novel framework that considers AI control as an adversarial game between a blue team and a red team. The blue team uses a weak trusted model to construct a weak score against which they would train a strong, potentially subversive model to remove the subversion propensity if it were present. The red team then tries to find model behaviors that are rated highly by the weak score, and thus might not be trained out, but actually correspond to poor performance. We test our framework on the task of writing experimental proposals for research questions from recent ML papers. We use a language model with access to the original paper as a proxy "ground-truth" scorer. Our red team discovers subversive behaviors using multi-objective evolutionary prompt optimization. We show that Opus~4.6 can write proposals that are worse according to the ground truth proxy than those of GPT-OSS-20B, while the weak scorer rates them as highly as the best proposals from Opus 4.6. To mitigate the threat, we propose an adversarial optimization algorithm for the blue team that discovers more robust prompts for the weak model. This algorithm produces a blue team prompt that our red team optimization fails to exploit.

URL PDF HTML ☆

赞 0 踩 0

2606.08881 2026-06-09 cs.RO cs.AI 新提交

Benchmarking Vision-Language-Action Models on SO-101: Failure and Recovery Analysis

在SO-101上对视觉-语言-动作模型进行基准测试：失败与恢复分析

Yi Yu, Xinchuan Qiu

发表机构 * Graduate School of Advanced Science and Engineering, Hiroshima University（广岛大学先进科学与工程研究生院）

AI总结提出SO-101低成本机器人平台基准，通过失败分类和恢复评估指标，系统比较VLA和模仿学习策略，发现执行不稳定是主要失败源。

详情

Comments: 13 pages, 9 figures,

AI中文摘要

视觉-语言-动作（VLA）模型在机器人操作中展现出强大的泛化能力，但现有评估主要在仿真或昂贵机器人平台上进行，其在低成本真实机器人上的鲁棒性尚未充分探索。我们提出了一个标准化的真实世界基准，用于在低成本SO-101机器人平台上评估代表性VLA和模仿学习策略。该基准包含四个代表性操作任务和统一评估协议，能够在具身不确定性下进行系统比较。使用真实遥操作演示，我们直接在物理平台上微调和评估$π_{0.5}$、SmolVLA、Wall-X和ACT。除了传统的任务成功率，该基准还包含结构化的失败分类、语义级和执行级失败分解，以及恢复感知评估指标，以表征策略鲁棒性。实验结果表明，更强的预训练VLA策略通常优于模仿学习基线，尽管在低成本机器人部署条件下性能高度依赖于任务。执行不稳定是主要的失败源，而恢复能力在不同架构间差异显著。这些结果强调了超越二元任务成功进行失败和恢复分析的重要性，并将SO-101确立为在现实低成本机器人部署条件下评估具身AI系统的实用基准。

英文摘要

Vision-Language-Action (VLA) models have demonstrated strong generalization in robotic manipulation, yet existing evaluations are primarily conducted in simulation or on expensive robotic platforms, leaving their robustness on affordable real-world robots largely unexplored. We present a standardized real-world benchmark for evaluating representative VLA and imitation learning policies on the low-cost SO-101 robotic platform. The benchmark comprises four representative manipulation tasks together with unified evaluation protocols, enabling systematic comparison under embodiment uncertainty. Using real-world teleoperated demonstrations, we fine-tune and evaluate $π_{0.5}$, SmolVLA, Wall-X, and ACT directly on the physical platform. Beyond conventional task success rates, the benchmark incorporates a structured failure taxonomy, semantic- and execution-level failure decomposition, and recovery-aware evaluation metrics to characterize policy robustness. Experimental results show that stronger pretrained VLA policies generally outperform the imitation learning baseline, although performance remains highly task-dependent under low-cost robotic deployment conditions. Execution instability emerges as the dominant failure source, while recovery capability varies substantially across architectures. These results highlight the importance of failure and recovery analysis beyond binary task success and establish SO-101 as a practical benchmark for evaluating embodied AI systems under realistic low-cost robotic deployment conditions.

URL PDF HTML ☆

赞 0 踩 0

2606.08878 2026-06-09 cs.CL cs.MA 新提交

PerspectiveGap: A Benchmark for Multi-Agent Orchestration Prompting

PerspectiveGap: 多智能体编排提示的基准测试

Youran Sun, Xingyu Ren, Kejia Zhang, Xinpeng Liu, Jiaxuan Guo

发表机构 * University of Maryland（马里兰大学）； The Chinese University of Hong Kong（香港中文大学）； Stanford University（斯坦福大学）

AI总结提出PerspectiveGap基准，评估LLM为多智能体系统编写编排提示的能力，实验显示模型平均通过率仅14.9%，表明该能力独特且未被充分评估。

详情

AI中文摘要

现实世界的LLM应用正从单智能体工作流转向编排的多智能体系统，但当前模型仍难以确定每个子智能体需要知道什么。为衡量这一点，我们引入了PerspectiveGap，一个用于评估LLM为多智能体系统编写编排提示能力的基准。PerspectiveGap包含110个场景，每个场景通过两种干扰混合任务格式评估：角色片段分配和自由形式提示编写。这些场景被组织成10种拓扑结构，这些拓扑结构源自作者的真实工程实践，并遵循提示经济原则：构建以循环为中心的编排，以最小的角色和工程开销最大化效用。在对来自10家公司的27个商业模型进行的实验中，GPT-5.5大幅超越所有竞争对手，而Opus 4.7尽管编码性能强劲，但在编排提示方面表现出明显弱点。尽管如此，PerspectiveGap仍然具有挑战性：评估模型平均综合通过率仅为14.9%（GPT-5.5为62.0%），平均总体泄漏率为246.5%（每个场景的信息泄漏事件计数，而非比例；GPT-5.5为49.1%）。这些发现表明，多智能体编排提示是一种独特且未被充分评估的能力，而PerspectiveGap为系统衡量和改进该能力提供了基础。

英文摘要

Real-world LLM applications are moving beyond single-agent workflows toward orchestrated multi-agent systems, yet current models still struggle to determine what each sub-agent needs to know. To measure this, we introduce PerspectiveGap, a benchmark for evaluating LLMs' ability to compose orchestration prompts for multi-agent systems. PerspectiveGap contains 110 scenarios, each evaluated through two distractor-mixed task formats: role-fragment assignment and free-form prompt writing. These scenarios are organized into 10 topologies, which are distilled from the authors' real-world engineering practice and framed by the Prompt Economy principle: building loop-centered orchestrations that maximize utility with minimal role and engineering overhead. In experiments with 27 commercial models from 10 companies, GPT-5.5 substantially outperforms all competitors, whereas Opus 4.7 shows a notable weakness in orchestration prompting despite its strong coding performance. Nevertheless, PerspectiveGap remains challenging: the evaluated models achieve an average combined pass rate of only 14.9\% (GPT-5.5 62.0\%) and an average overall leakage rate of 246.5\% (a per-scenario information leak-event count, not a proportion; GPT-5.5 49.1\%). These findings suggest that multi-agent orchestration prompting is a distinct and under-evaluated capability, and PerspectiveGap provides a foundation for measuring and improving it systematically.

URL PDF HTML ☆

赞 0 踩 0