arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3814
2605.16676 2026-05-19 cs.AI

Enhancing Metacognitive AI: Knowledge-Graph Population with Graph-Theoretic LLM Enrichment

增强元认知AI:基于图论的LLM知识图谱填充

Deniz Askin, Gal Hadar, Brendan Conway-Smith

AI总结 本文提出MetaKGEnrich系统,通过构建知识图谱、检测稀疏区域、生成问题并检索证据,提升LLM的自我修复能力,在多个数据集上显著提高了回答质量。

详情
AI中文摘要

元认知——即监控自身知识状态、发现知识缺口并自主填补的能力——在现代AI中仍然缺失。本文提出了MetaKGEnrich,一个完全自动化的流程,使大语言模型(LLM)应用具备自我导向的知识修复能力。该系统(i)从种子查询构建知识图谱,(ii)通过七种图度量检测稀疏区域,(iii)利用GPT-4o生成针对性问题,(iv)通过Tavily检索网络证据并将其导入Neo4j,(v)使用GraphRAG重新回答查询以供GPT-4评估改进。在Google Research Natural Questions、MS MARCO和Hot-potQA三个广泛使用的数据集上测试了30个查询。MetaKGEnrich在80%的HotpotQA问题、87%的Google Research Natural Questions和83%的MS MARCO问题中提高了回答质量,同时保持了支持充分的区域。这一概念验证展示了拓扑自诊断加针对性检索如何推动AI向具有人类般的元认知学习能力发展。

英文摘要

Metacognition-the ability to monitor one's own knowledge state, spot gaps, and autonomously fill them--remains largely absent from modern AI. Here, we present MetaKGEnrich, a fully automated pipeline that endows large language model (LLM) applications with self-directed knowledge repair. The system (i) builds knowledge graphs from a seed query, (ii) detects sparse regions via seven graph metrics, (iii) has GPT-4o generate targeted questions, (iv) retrieves web evidence with Tavily and ingests it into Neo4j, and (v) re-answers the query with GraphRAG for GPT-4 to evaluate improvement. Tested on 30 queries from each of three widely-used datasets: Google Research Natural Questions, MS MARCO, and Hot-potQA. MetaKGEnrich improved answer quality in 80% of HotpotQA questions, 87% of Google Research Natural Questions and 83% of MS MARCO questions, while preserving well-supported regions. This proof of concept demonstrates how topological self-diagnosis plus targeted retrieval can advance AI toward humanlike metacognitive learning.

2605.16675 2026-05-19 cs.AI

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

LinAlg-Bench:一个 forensic 验证基准,揭示 LLM 数学推理中的结构失效模式

Shradha Agarwal, Deepak Rajbhar, Tariq J

AI总结 LinAlg-Bench 评估 10 个前沿大语言模型在结构线性代数计算中的表现,揭示 LLM 数学失败并非随机,而是受算法类型和矩阵维度约束。研究发现 4x4 尺寸存在行为阈值,低于该尺寸模型通过执行错误失败,高于则转向计算放弃,通过工具角色扮演等制造响应。

Comments 42 pages, 3 figures, 12 tables. NeurIPS 2026 Evaluations and Datasets Track submission. Dataset: https://huggingface.co/datasets/LinAlgBench/linalg-bench

详情
AI中文摘要

我们介绍了 LinAlg-Bench,一个诊断基准,评估 10 个前沿大语言模型在结构线性代数计算中的表现,覆盖 3x3、4x4 和 5x5 矩阵的严格维度梯度。该基准涵盖 9 类任务和 660 个 SymPy 认证问题,评估 6,600 个模型输出。除了二元准确率外,LinAlg-Bench 引入了三阶段自动化取证流程,将 1,156 个失败分类为 10 个主要错误标签及其细粒度子类型,揭示 LLM 数学失败并非随机,而是受算法类型和矩阵维度约束。我们的核心发现是 4x4 尺寸存在行为阈值:低于该尺寸,模型通过执行错误失败——符号跟踪失败、算术漂移和奇偶错误;高于该尺寸,失败转变为计算放弃,模型通过工具角色扮演、约束一致的虚构和结构性幻觉制造响应而非尝试计算。这种制造到放弃的转变在所有模型层级和架构中几乎普遍存在,表明是工作记忆限制而非知识缺口,支持三种规模涌现的错误类型在 3x3 不存在但在 4x4 和 5x5 存在。我们进一步显示,解决方案策略的刚性是 5x5 确定性准确率的近完美预测因素,记录约束意识的虚构作为一种新的结构幻觉失败模式,并公开所有数据、模型输出、错误标签和判断流程。

英文摘要

We introduce LinAlg-Bench, a diagnostic benchmark evaluating 10 frontier large language models on structured linear algebra computation across a strict dimensional gradient of 3x3, 4x4, and 5x5 matrices. Spanning 9 task types and 660 SymPy-certified problems, the benchmark exhaustively evaluates 6,600 model outputs. Beyond binary accuracy, LinAlg-Bench introduces a three-stage automated forensic pipeline classifying 1,156 failures into ten primary error tags with fine-grained subtypes, revealing that LLM mathematical failure is not random but structurally constrained by algorithm type and matrix dimension. Our central finding is a sharp behavioral threshold at 4x4 scale: below it, models fail through execution errors -- sign tracking failures, arithmetic drift, and parity errors; above it, failure transitions to computational abandonment, with models fabricating responses through tool roleplay, constraint-consistent confabulation, and structured hallucination rather than attempting computation. This fabrication-to-abandonment transition is near-universal across all model tiers and architectures, suggesting a working memory limit rather than a knowledge gap, supported by three scale-emergent error types absent at 3x3 but present at 4x4 and 5x5. We further show that solution strategy rigidity is a near-perfect predictor of 5x5 determinant accuracy, document constraint-aware confabulation as a novel structured hallucination failure mode, and release all data, model outputs, error labels, and judge pipeline publicly.

2605.16673 2026-05-19 cs.RO

Bayesian Networks for Path-Based Sensors: Gathering Information and Path Planning in Communication Denied Environments

基于路径的传感器的贝叶斯网络:在通信受限环境中收集信息和路径规划

Alkesh K. Srivastava, George P. Kontoudis, Donald Sofge, Michael Otte

AI总结 本文提出了一种基于贝叶斯网络的更新方法,用于在通信受限环境中通过路径传感器提升信念图的收敛速度,并考虑了假阳性和假阴性问题。

Comments This paper has been accepted for presentation at 17th World Symposium on the Algorithmic Foundations of Robotics (WAFR 2026)

详情
AI中文摘要

一种基于路径的传感器在连续路径上产生单个观测值。例如,布尔型路径传感器在路径上的任何一点检测到感兴趣的事件时返回'1',否则返回'0'。值得注意的是,'1'本身不提供关于事件发生位置的直接信息。先前的工作表明,多个路径传感器的观测可以融合以创建空间位置的贝叶斯信念图。此外,路径规划可以利用香农信息论来加速信念图的收敛速度。在本文中,我们提出了一种新的方法,基于路径传感器观测更新信念图,然后规划路径以增加信息增益。与之前通过平均替代事件历史来近似后验的方法不同,我们引入了贝叶斯网络(BN)的公式,该公式建模了潜在变量和路径传感器测量之间的概率关系,从而实现了更系统的贝叶斯信念更新。我们考虑在通信受限环境中进行静态危险检测作为代表性的问题设置。机器人返回其路径对应于路径传感器读数为'0'(危险未检测),而机器人未能返回则对应于读数为'1'(危险检测)。我们考虑假阳性和假阴性。我们发现,新方法在单机器人和多机器人情况下都比先前的工作更快地收敛于信念图。

英文摘要

A "path-based sensor" produces a single observation along a continuous path. For example, a boolean path-based sensor returns a single "1" if an event of interest is detected at any point along the path and a "0" otherwise. Notably, a "1" provides no direct information about where along the path the event(s) may have occurred. Previous work has demonstrated that observations from multiple path-based sensors can be fused to create a Bayesian belief map over the spatial locations of the underlying event or phenomenon. Moreover, path planning can employ Shannon information theory to accelerate the rate of convergence of the belief map. In this paper, we present a new method to update the belief map based on a path-based sensor observation, and then plan paths to increase information gain. In contrast to prior work that approximates the posterior by averaging over the alternative event histories, we introduce a Bayesian Network (BN) formulation that models the probabilistic relationships between the latent variables and path-based sensor measurements, enabling a more principled Bayesian belief update. We consider static hazard detection in a communication-denied environment as a representative problem setting. The event of a robot returning from its path corresponds to a path-based hazard sensor reading of "0" (hazard not detected), while a robot failing to return corresponds to a reading of "1" (hazard detected). We consider false positives and false negatives. We find that the new method leads to quicker convergence of the belief map than prior work in both single- and multi-robot cases.

2605.16672 2026-05-19 cs.CV cs.AI cs.LG

Multi-Object Tracking Consistently Improves Wildlife Inference

多目标跟踪一致地提升野生动物推断

Mufhumudzi Muthivhi, Jiahao Huo, Fredrik Gustafsson, Terence L. van Zyl

AI总结 本文利用多目标跟踪技术提升野生动物分类模型的鲁棒性,通过融合轨迹信息改进分类结果,实验表明在三个数据集上均提升了性能。

Comments Accepted for publication in IEEE 2026 29th International Conference on Information Fusion

详情
AI中文摘要

相机陷阱已成为生态研究和生物多样性保护中常用的野生动物监测工具。野生动物分类模型受益于野生动物视觉数据的增加,这些模型在经过整理的高质量数据集上能达到高水平的准确性。然而,其性能仍然易受现实环境约束的影响。在进行时间连续序列的推断时,它们常常产生不一致的预测。单个个体在帧之间的预测标签会迅速变化。本研究利用相机陷阱数据的时间特性来增强野生动物分类模型的推断预测。具体来说,我们采用几种标准的多目标跟踪(MOT)模型,将连续帧中的检测结果进行关联。经过整理的轨迹用于融合softmax类概率。融合的概率评分产生一个单一的共识类标签估计,以覆盖噪声引起的误分类。实验结果分析表明,我们的策略在所有数据集和每个指标上均优于独立分类器。具体而言,表现最好的MOT模型在三个MOT数据集上分别比分类器提高了5.1%、3.1%和2.0%的加权F1分数。

英文摘要

Camera traps have become a common tool for wildlife monitoring efforts in ecological research and biodiversity conservation. Wildlife classification models have benefited from the increase in wildlife visual data. These models reach high levels of accuracy on curated, high-quality datasets. However, their performance remains sensitive to real-world environmental constraints. They often produce inconsistent predictions when performing inference on temporally coherent sequences. The predicted label for a single individual shifts rapidly between frames. This study exploits the temporal nature of camera-trap data to augment inferred predictions from a wildlife classification model. Specifically, we adopt several standard Multi-Object Tracking (MOT) models to link detections across consecutive frames. The curated trajectories are used to fuse the softmax class probabilities. The fused probability score produces a single consensus class label estimate that overrides misclassifications caused by noise. The analysis of the experimental results shows that our proposed strategy improves over a standalone classifier over all datasets and for each metric. Specifically, the best-performing MOT models gain a weighted F1-Score of 5.1%, 3.1% and 2.0% over the classifier across three MOT datasets.

2605.16671 2026-05-19 cs.AI cs.CV cs.CY cs.LG

Sustainable Intelligence for the Wild: Democratizing Ecological Monitoring via Knowledge-Adaptive Edge Expert Agents

野生环境中的可持续智能:通过知识自适应边缘专家代理实现生态监测民主化

Jiaxing Li, Hao Fang, Chi Xu, Miao Zhang, Jiangchuan Liu, William I. Atlas, Katrina M. Connors, Mark A. Spoljaric

AI总结 本文提出一种知识自适应边缘代理架构,通过分离视觉感知与推理,结合视觉编码器和动态知识库,实现生态监测的可持续发展,促进伦理AI协同开发。

Comments 10 pages

详情
AI中文摘要

快速的生物多样性丧失凸显了有效监测的紧迫性,但手动调查仍消耗资源。尽管设备上的AI提供了一种可扩展的替代方案,但野外环境中经常受到环境变化的挑战。当前方法依赖云资源,需要持续上传现场数据以重新训练模型。这种方法不适合远程部署,因为它消耗有限的电力和网络连接。为了解决这些限制,本研究提出从模型适应转向知识适应。我们介绍了一种架构,将视觉感知与推理分离,结合视觉编码器和动态知识库。我们使用显式知识库取代隐式编码专家知识到模型参数。这种方法还通过结构化形式保存专家见解来支持知识可持续性。通过跨学科合作与生物学家和原住民社区,这项工作推进了伦理AI的协同开发,促进负责任和文化知情的生态系统管理。

英文摘要

Rapid biodiversity loss underscore the urgency of effective monitoring, yet manual surveys remain resource-intensive. While on-device AI offers a scalable alternative, its performance in the wild is often challenged by environmental variability. Current methods rely heavily on cloud resource, which requires continuous uploading of field data for model retraining. This approach is unsuitable for remote deployments because it consumes limited power and network connectivity. To address these constraints, this research proposes a shift from model adaptation to knowledge adaptation. We introduce an architecture that separates visual perception from reasoning, combining a visual encoder with a dynamic knowledge base. We uses an explicit knowledge base to replace implicitly encoding expert knowledge into model parameters. This method also supports knowledge sustainability by preserving expert insights in a structured form. Through cross-disciplinary collaboration with biologists and Indigenous communities, this work advances ethical AI co-development, fostering responsible and culturally informed ecosystem management.

2605.16668 2026-05-19 cs.LG cs.AI

GraViti: Graph-Level Variational Autoencoders with Relaxed Permutation Invariance

GraViti:具有放松排列不变性的图级变分自编码器

Roman Bresson, Konstantinos Divriotis, Johannes F. Lutzeyer, Iakovos Evdaimon, Michalis Vazirgiannis

AI总结 GraViti通过图级变分自编码器生成紧凑的潜在向量,支持平滑插值和下游任务,优于节点级嵌入。

详情
AI中文摘要

我们介绍了GraViti,一种基于transformer的图级变分自编码器,将整个图映射到紧凑的潜在向量。这种设计产生了一个真正的图级潜在空间,支持平滑插值、属性引导搜索等下游任务,超越节点级嵌入的限制。在分子基准上,GraViti学会解码符合训练数据化学约束的有效样本,表明模型能直接从图级表示中恢复领域规则。我们还显示,在存在可靠规范节点顺序的领域(如分子或贝叶斯网络)中,强制排列不变性可能对一致重建有害。GraViti在大规模数据集上实现了最先进的重建准确性,并提供了坚实的生成性能。其单步解码提供了一种轻量级替代方案,同时保持实用的样本质量。

英文摘要

We introduce GraViti, a transformer-based graph-level variational autoencoder that maps entire graphs to compact latent vectors. This design produces a true graph-level latent space that supports smooth interpolation, property-guided search, and other downstream tasks beyond the constraints of node-level embeddings. On molecular benchmarks, GraViti learns to decode valid samples that follow the chemical constraints present in the training data, showing that the model recovers domain rules directly from graph-level representations. We also show that, in domains where a reliable canonical node ordering exists such as molecules or bayesian networks, enforcing permutation invariance can prove detrimental for consistent reconstruction. GraViti achieves state-of-the-art reconstruction accuracy on large datasets, and provides solid generative performance. Its single-step decoding offers a lightweight alternative to more complex generation pipelines while maintaining practical sample quality.

2605.16665 2026-05-19 cs.LG physics.geo-ph

In-context learning enables continental-scale subsurface temperature prediction from sparse local observations

上下文学习使稀疏本地观测能够预测大陆尺度的地下温度

Daniel O'Malley, Christopher W. Johnson, Javier E. Santos, Pablo Lara, Sandro Malusà, Bharat Srikishan, John Kath, Arnab Mazumder, Mohamed Mehana, David Coblentz, Nathan DeBardeleben, Earl Lawrence, Hari Viswanathan

AI总结 本文提出In-Context Earth模型,利用稀疏钻孔观测预测连续温度场,优于现有方法,且能适应不同地区,具有高准确性与可解释性。

详情
AI中文摘要

大陆尺度的地下温度知识受限于钻孔测量的成本和稀疏性,但此类信息对地热资源评估和浅层地壳热传输理解至关重要。热场反映了岩石类型、地壳结构、放射性产热和对流流体流动的相互作用,有时会产生尖锐异常,传统插值或物理模型难以捕捉。本文引入了基于Transformer的In-Context Earth模型,利用稀疏本地钻孔观测作为地质上下文,预测连续温度-深度场并校准不确定性。在美国大陆,该模型的平均绝对误差为4.7°C,优于物理指导的斯坦福热模型、AlphaEarth嵌入模型、多模态透明地球模型和通用克里格法,同时在地热省中解析更尖锐的热梯度。其不确定性估计校准良好,Kolmogorov-Smirnov统计量为2.5%。无需微调,模型能适应阿尔伯塔、澳大利亚和英国,仅使用20个本地观测,在地质上不同的测试区域保持高精度,平均绝对误差分别为阿尔伯塔2.2°C、澳大利亚6.2°C和英国5.4°C。可解释性分析显示,模型学习了其训练过程中未观察到的地下属性,包括地震速度、地球化学和地壳结构,并以物理一致的方式使用这些表示。更广泛地说,这项工作表明上下文学习可以利用稀疏钻孔观测进行大陆尺度地下特征刻画,无需密集测量或区域特定重训练。

英文摘要

Continental-scale knowledge of subsurface temperature is limited by the cost and sparsity of borehole measurements, but such information is essential for geothermal resource assessment and for understanding heat transport in the shallow crust. The thermal field reflects the interaction between lithology, crustal structure, radiogenic heat production, and advective fluid flow, sometimes producing sharp anomalies that are smoothed by conventional interpolation or difficult to capture with physical models. Here we introduce In-Context Earth, a transformer-based model that uses sparse local borehole observations as geological context to predict continuous temperature-at-depth fields with calibrated uncertainty. In the contiguous United States, the model achieves a mean absolute error of 4.7 °C, outperforming the physics-informed Stanford Thermal Model, a model based on AlphaEarth embeddings, the multimodal Transparent Earth model, and universal kriging, while resolving sharper thermal gradients in geothermal provinces. Its uncertainty estimates are well calibrated, with a Kolmogorov-Smirnov statistic of 2.5%. Without finetuning, the model adapts to Alberta, Australia, and the United Kingdom (UK) using only 20 local observations at inference time, maintaining high accuracy in geologically distinct test regions with a mean absolute error of 2.2 °C in Alberta, 6.2 °C in Australia, and 5.4 °C in the UK. Interpretability analyses show that the model learns internal representations of subsurface properties it never observes during training, including seismic velocities, geochemistry, and crustal structure, and uses these representations in physically consistent ways. More broadly, this work shows that in-context learning can use sparse borehole observations for continental-scale subsurface characterization, without requiring dense measurements or region-specific retraining.

2605.16654 2026-05-19 cs.CL cs.AI

A Scalable Tool for Measuring Manner and Result Verbs in Developmental Language Research

一种可扩展的测量方式用于发展语言研究中的方式和结果动词

Divyesh Pratap Singh, Dakshesh Gusain, Federica Bulgarelli, Alison Eisel Hendricks, John Beavers, Nathan M. Beers, Ifeoma Nwogu

AI总结 本文提出一种利用大规模语言模型进行方式和结果动词识别的方法,通过MASC和InterCorp数据扩展至436类,并在三个数据集上验证,模型准确率达89.6%。

Comments 12 pages

详情
AI中文摘要

方式和结果动词编码事件结构的不同方面,在发展语言研究中被视为研究早期动词学习的潜在区分特征。然而,由于目前缺乏大规模标注的 manner 和 result 分类资源,这种区分难以量度。本文提出了一种计算方法,利用语言学启发式提示生成句子级标注,扩展了VerbNet的标注范围至436类。然后在这些标注上训练了基于RoBERTa的分类器,并在三个保留的金标准数据集上进行评估,包括之前标注的项目和一个新的专家标注集。在这些评估中,模型表现出有希望的性能,平均准确率高达89.6%。本文将此工作作为可扩展的测量工具,支持未来关于动词语义的发展语言和其他语言数据集的研究,同时指出需要进一步验证边缘情况、混合方式/结果动词以及下游发展应用。

英文摘要

Manner and result verbs encode different aspects of event structure and have been discussed in developmental work as a potentially informative distinction for studying early verb learning. However, this distinction remains difficult to measure at scale because large annotated resources for manner and result classification are not currently available. We present a computational approach for identifying manner and result verbs in sentence context. Using linguistically informed prompts, we generate sentence-level annotations with large language models over data drawn from MASC and InterCorp, extending coverage from previously annotated portions of VerbNet to 436 classes. We then train a RoBERTa-based classifier on these annotations and evaluate it on three held-out gold-standard datasets, including previously annotated items and a new expert-annotated set. Across these evaluations, the model shows promising performance, with average accuracy up to 89.6%. We present this work as a scalable measurement tool that can support future research on verb semantics in developmental and other language datasets, while noting that further validation is needed for borderline cases, mixed manner/result verbs, and downstream developmental applications.

2605.16650 2026-05-19 cs.CL cs.AI

SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs

SKG-Eval: 通过增量语义知识图谱进行多轮对话的有状态评估

Avijit Shil, Suman Samui

AI总结 本文提出SKG-Eval框架,通过增量语义知识图谱模型,解决多轮对话评估中长距离不一致问题,提供可解释的评估信号和可复现的评分结果。

Comments 36 Pages, 6 Figures

详情
AI中文摘要

评估多轮对话系统仍具挑战性,因为响应质量不仅取决于当前提示,还取决于之前建立的实体、声明和对话承诺。现有自动评估器主要依赖扁平或轮次隔离的表示,难以检测长距离问题如矛盾、话题漂移和实体不一致。为此,我们提出SKG-Eval,一个近确定性和可解释的框架,将对话建模为跨轮次的语义知识图谱(SKG)的实体、关系和承诺。该框架通过结构化三元组提取逐步更新图谱,并计算三个互补信号:(i)局部相关性,衡量与当前提示和可选参考的一致性;(ii)历史一致性,评估新引入信息如何连接到先前对话上下文,使用图谱驱动和嵌入驱动信号;(iii)逻辑一致性,通过几何矛盾引擎评估跨轮次冲突,不依赖NLI模型或LLM判断。这些信号通过近期加权趋势分析适应性融合,生成长度不变的会话分数。在多个基准测试中,SKG-Eval在与人类判断的相关性上更高,并显著提高了长距离不一致的检测效果。此外,该框架为固定输入生成明确的矛盾证书和确定性分数,使评估可复现和可审计。整体而言,我们的结果表明,通过语义知识图谱的结构化外部化状态跟踪,为LLM基于对话评估器提供了可扩展的替代方案。

英文摘要

Evaluating multi-turn dialogue systems remains challenging because response quality depends not only on the current prompt, but also on previously established entities, claims, and conversational commitments. Existing automatic evaluators, including LLM-as-a-judge frameworks and embedding-based metrics, largely rely on flat or turn-isolated representations, making them less effective at detecting long-range issues such as contradiction, topic drift, and entity inconsistency. To address this, we propose SKG-Eval, a quasi-deterministic and interpretable framework that models dialogue as an evolving Semantic Knowledge Graph (SKG) of entities, relations, and commitments across turns. The framework incrementally updates the graph through structured triple extraction and computes three complementary signals: (i) local relevance, measuring alignment with the current prompt and optional reference; (ii) historical consistency, evaluating how newly introduced information connects to prior conversational context using graph-based and embedding-driven signals; and (iii) logical coherence, assessed by a geometric contradiction engine that detects cross-turn conflicts without relying on NLI models or LLM judges. These signals are adaptively fused and aggregated into a length-invariant session score via recency-weighted trend analysis. Across multiple benchmarks, SKG-Eval achieves higher correlation with human judgments and substantially improves detection of long-range inconsistencies in extended conversations. In addition, the framework produces explicit contradiction certificates and deterministic scores for fixed inputs, enabling reproducible and auditable evaluation. Overall, our results suggest that structured externalized state tracking through semantic knowledge graphs provides a scalable alternative to implicit reasoning in LLM-based dialogue evaluators.

2605.16649 2026-05-19 cs.CV

AtlasVid: Efficient Ultra-High-Resolution Long Video Generation via Decoupled Global-Local Modeling

AtlasVid: 通过解耦的全局-局部建模实现高效超高清长视频生成

Ziyang Mai, Yuyao Zhang, Yu-Wing Tai

AI总结 本文提出AtlasVid框架,通过解耦建模提升超高清长视频生成效率,实现60.9倍加速和更低训练成本,优于原生4K生成器。

详情
AI中文摘要

近期基于扩散的视频生成器在视觉保真度和提示可控性方面取得了显著进展,但将其扩展到超高清(UHR)长视频仍极具挑战性。难点尤其体现在长单次生成中,需保持连续场景的全局时间一致性,同时不依赖剪辑过渡或自回归镜头拼接的精细空间细节。本文从解耦建模角度重新审视这一挑战。我们主张现有视频扩散模型已编码了强局部视觉先验,而主要瓶颈在于如何高效扩展全局时空建模以适应更高的分辨率和持续时间。基于此见解,我们提出AtlasVid,一种解耦的全局-局部框架,用于高效UHR长视频生成。AtlasVid首先通过时间缩放RoPE生成低分辨率和低FPS的全局语义代理,从而扩展时间范围而不增加训练token数量。在该代理的引导下,高分辨率细节分支进行联合去噪,采用分层局部性保持注意力。重新排列的时空窗口保持几何局部性,不对称的全局-局部注意力注入对齐的语义指导并保留模型的预训练能力。此设计使模型具备分辨率无关的训练能力:模型仅在720P上训练,使用轻量LoRA适配,即可直接泛化到4K及更长(>10秒)的视频生成。实验表明,AtlasVid显著提升了超高清长视频生成的效率,实现了高质量UHR长视频生成,速度提升60.9倍,训练成本显著降低,甚至优于原生4K视频生成器。

英文摘要

Recent diffusion-based video generators have achieved remarkable visual fidelity and prompt controllability, yet scaling them to ultra-high-resolution (UHR) long videos remains prohibitively expensive. The difficulty is especially pronounced for long single-shot generation where a continuous scene must preserve global temporal coherence, and fine-grained spatial details without relying on clip transitions or autoregressive shot stitching. In this work, we revisit this challenge from the perspective of decoupled modeling. We argue that existing video diffusion models already encode strong local visual priors, while the main bottleneck lies in efficiently extending global spatiotemporal modeling as resolution and duration increase. Based on this insight, we propose AtlaVid, a decoupled global-local framework for efficient UHR long video generation. AtlaVid first generates a low-resolution and low-FPS global semantic proxy via temporally scaled RoPE, thereby extending the temporal horizon without increasing the training token count. Guided by this proxy, a high-resolution detail branch performs joint denoising with hierarchical locality-preserving attention. Reordered spatiotemporal windows preserve geometric locality and asymmetric global-local attention injects aligned semantic guidance and preserves the model's pretrained ability. This design enables resolution-agnostic training: the model is trained only at 720P with lightweight LoRA adaptation, yet generalizes directly to 4K and beyond for longer (>10s) video synthesis. Experiments show that AtlaVid substantially improves the efficiency of ultra-high-resolution long video generation, achieving high-quality UHR long video generation with 60.9x speed up and significantly less training cost and even better performance than native 4K video generators.

2605.14963 2026-05-19 cs.CV

H-OmniStereo: Zero-Shot Omnidirectional Stereo Matching with Heading-Aligned Normal Priors

H-OmniStereo:基于方向对齐法线先验的零样本全方位立体匹配

Chenxing Jiang, Zhe Tong, Pusen Gao, Peize Liu, Yang Xu, Chuan Fang, Ping Tan, Shaojie Shen

AI总结 本文提出H-OmniStereo框架,通过构建高质量合成数据集和引入方向对齐法线估计器,解决全方位立体匹配中数据稀缺和视角先验退化问题,实现更高精度和跨视角一致性。

Comments 8 pages, 9 figures

详情
AI中文摘要

在顶底等距矩形图像上的立体匹配为全方位感知提供了有效框架,因为垂直对齐的视差线能够利用大量数据集和单目先验驱动的先进透视立体架构。然而,此类适应的性能严重受限于全方位立体数据集的稀缺性和球面畸变下单目先验的退化。为解决这些挑战,我们提出H-OmniStereo,零样本全方位立体匹配框架。首先,我们构建包含280万对顶底等距矩形立体对的高质量合成数据集以扩大训练规模。其次,我们引入等距矩形单目法线估计器,专门在方向对齐坐标系中运行。除了提供抗畸变和跨视角一致的几何先验以建立可靠的立体匹配对应关系外,该设计还提升了训练效率并适应了训练测试视角范围不匹配。大量实验表明,我们的方法在域外数据集上比现有方法更准确,并成功泛化到实际消费者相机设置中使用单个模型。模型和数据集将在https://github.com/JIANG-CX/H-OmniStereo发布。

英文摘要

Stereo matching on top-bottom equirectangular images provides an effective framework for full-surround perception, as vertically aligned epipolar lines enable the use of advanced perspective stereo architectures that are largely driven by large-scale datasets and monocular priors. However, the performance of such adaptations is severely limited by the scarcity of omnidirectional stereo datasets and the degradation of perspective monocular priors under spherical distortions. To address these challenges, we propose H-OmniStereo, a zero-shot omnidirectional stereo matching framework. First, we construct high-quality synthetic dataset comprising over 2.8 million top-bottom equirectangular stereo pairs to scale up training. Second, we introduce an equirectangular monocular normal estimator, specifically operating in a heading-aligned coordinate system. Beyond providing distortion-robust and cross-view-consistent geometric priors for establishing reliable correspondences in stereo matching, this design boosts training efficiency and accommodates train-test FoV mismatches. Extensive experiments show that our approach achieves higher accuracy than existing methods on out-of-domain datasets and successfully generalizes to real-world consumer camera setups using a single model. The model and dataset will be released at https://github.com/JIANG-CX/H-OmniStereo.

2605.14854 2026-05-19 cs.CV cs.AI

FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery

因子化HMR:视频人体网格恢复的混合框架

Patrick Kwon, Chen Chen

AI总结 本文提出FactorizedHMR框架,通过确定性回归模块和概率流匹配模块分别处理人体不同部位的恢复问题,结合复合目标表示和几何感知监督提升模糊部位的恢复效果,实现在遮挡和漂移敏感度指标上的优势。

详情
AI中文摘要

人体网格恢复(HMR)本质上具有歧义性:在遮挡或弱深度线索下,同一图像证据可能由多个3D身体解释。这种歧义性并非均匀分布于全身,躯干姿态和根结构通常相对受约束,而远端关节如手臂和腿部则更不确定。基于此观察,我们提出FactorizedHMR,一种两阶段框架,分别处理这两种情形。一个确定性回归模块首先恢复稳定的躯干-根锚点,一个概率流匹配模块则完成剩余的非躯干关节。为使完成可靠,我们结合复合目标表示与几何感知监督和特征感知分类器自由引导,保留躯干-根锚点的同时提升易产生歧义的关节的单参考恢复。我们还引入了一个合成数据管道,提供在多种视角下的配对图像-相机-运动监督。在相机空间和世界空间基准测试中,FactorizedHMR与强基线竞争,尤其在遮挡密集恢复和漂移敏感世界空间指标上表现最突出。

英文摘要

Human Mesh Recovery (HMR) is fundamentally ambiguous: under occlusion or weak depth cues, multiple 3D bodies can explain the same image evidence. This ambiguity is not uniform across the body, as torso pose and root structure are often relatively well constrained, whereas distal articulations such as the arms and legs are more uncertain. Building on this observation, we propose FactorizedHMR, a two-stage framework that treats these two regimes differently. A deterministic regression module first recovers a stable torso-root anchor, and a probabilistic flow-matching module then completes the remaining non-torso articulation. To make this completion reliable, we combine a composite target representation with geometry-aware supervision and feature-aware classifier-free guidance, preserving the torso-root anchor while improving single-reference recovery of ambiguity-prone articulation. We also introduce a synthetic data pipeline that provides the paired image-camera-motion supervision under diverse viewpoints. Across camera-space and world-space benchmarks, FactorizedHMR remains competitive with strong baselines, with the clearest gains in occlusion-heavy recovery and drift-sensitive world-space metrics.

2605.14504 2026-05-19 cs.AI

When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

当机器人做家务:一个基准和代理用于长期家庭任务执行

Zilin Zhu, Longteng Guo, Yanghong Mei, Bowen Pang, Zongxun Zhang, Xingjian He, Ruyi Ji, Jing Liu

AI总结 本文提出LongAct基准和HoloMind代理,用于评估长期家庭任务执行中的高层自主能力,实验显示HoloMind在减少模型规模依赖的同时提升了长期性能,但目标完成率仍较低,凸显了长期规划的挑战。

详情
AI中文摘要

长期家庭任务需要稳健的高层规划和持续推理能力,而现有具身AI基准多关注短时间导航或操作,依赖固定任务类别。我们引入LongAct基准,用于评估通过自由指令指定的长期家庭任务中的规划自主性。通过抽象掉与具体身体相关的低层控制,LongAct隔离了如指令理解、依赖管理、记忆维护和适应性规划等高层认知能力。我们进一步提出HoloMind,一个基于视觉语言模型的代理,配备基于有向无环图的长期分层规划器、多模态空间记忆用于持久世界建模、经验重用的片段记忆以及全局批评者用于反思监督。实验表明,GPT-5和Qwen3-VL模型在HoloMind上显著提升了长期性能,同时减少了对模型规模的依赖。即使顶级模型也仅达到59%的目标完成率和16%的完整任务成功率,凸显了LongAct的难度以及具身代理中更强长期规划的需求。

英文摘要

Long-horizon household tasks demand robust high-level planning and sustained reasoning capabilities, which are largely overlooked by existing embodied AI benchmarks that emphasize short-horizon navigation or manipulation and rely on fixed task categories. We introduce LongAct, a benchmark designed to evaluate planning-level autonomy in long-horizon household tasks specified through free-form instructions. By abstracting away embodiment-specific low-level control, LongAct isolates high-level cognitive capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning. We further propose HoloMind, a VLM-driven agent with a DAG-based long-horizon hierarchical planner, a Multimodal Spatial Memory for persistent world modeling, an Episodic Memory for experience reuse, and a global Critic for reflective supervision. Experiments with GPT-5 and Qwen3-VL models show that HoloMind substantially improves long-horizon performance while reducing reliance on model scale. Even top models achieve only 59% goal completion and 16% full-task success, underscoring the difficulty of LongAct and the need for stronger long-horizon planning in embodied agents.

2605.14498 2026-05-19 cs.CL

GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

GroupMemBench: 多方对话中LLM代理记忆的基准测试

Jingbo Yang, Kwei-Herng Lai, Xiaowen Wang, Shiyu Chang, Yaar Harari, Evgeniy Gabrilovich

AI总结 本文提出GroupMemBench,用于评估多方对话中LLM代理的记忆能力,揭示现有记忆系统在群体动态、信念跟踪和语言适应方面的不足。

详情
AI中文摘要

大型语言模型(LLM)代理越来越多地作为个人助手和职场协作工具,其效用依赖于能够从长对话中提取、检索和应用信息的记忆系统。然而,现有的记忆系统和基准测试都是基于二元、单用户设置,尽管实际部署通常跨越群体和频道,多个用户与代理和彼此交互。这种不匹配导致三个群体记忆的特性未被测量:(i) 超出拼接一对一聊天的群体动态,(ii) 以说话者为基础的信念跟踪,其中需要每个用户的记忆建模,以及(iii) 为受众定制的语言,其中理论-of- mind转变产生角色特定的词汇。我们引入GroupMemBench,一个能够暴露这三者的基准测试。一个基于图的合成流程生成多方对话,具有可控的回复结构和条件,每条消息都基于用户身份和目标受众。一个对抗性查询流程然后将每个问题绑定到特定的提问者,涵盖六个类别,包括多跳推理、知识更新、术语歧义、用户隐含推理、时间推理和回避,并迭代搜索具有挑战性和现实的查询,反映全面的记忆能力。基准测试领先的记忆系统暴露了一个明显的崩溃:最强的系统达到46.0%的平均准确率,其中知识更新为27.1%,术语歧义为37.7%,而一个简单的BM25基线匹配或超过大多数代理记忆系统。这表明当前的记忆摄入擦除了群体记忆依赖的结构和词汇特征,使多用户记忆远未解决。

英文摘要

Large Language Model (LLM) agents increasingly serve as personal assistants and workplace collaborators, where their utility depends on memory systems that extract, retrieve, and apply information across long-running conversations. However, both existing memory systems and benchmarks are built around the dyadic, single-user setup, even though real deployments routinely span groups and channels with multiple users interacting with the agent and with each other. This mismatch leaves three properties of group memory unmeasured: (i) group dynamics that go beyond concatenated one-on-one chats, (ii) speaker-grounded belief tracking, where the per-user memory modeling is needed, and (iii) audience-adapted language, where Theory-of-Mind shifts produce role-specific vocabulary. We introduce GroupMemBench, a benchmark that exposes all three. A graph-grounded synthesis pipeline produces multi-party conversations with controllable reply structure and conditions each message on per-user personas and target audiences. An adversarial query pipeline then binds every question to a specific asker across six categories, spanning multi-hop reasoning, knowledge update, term ambiguity, user-implicit reasoning, temporal reasoning, and abstention, and iteratively searches challenging, realistic queries that reflect comprehensive memory capability. Benchmarking leading memory systems exposes a sharp collapse: the strongest one reaches only 46.0% average accuracy, with knowledge update at 27.1% and term ambiguity at 37.7%, while a simple BM25 baseline matches or exceeds most agent memory systems. This indicates current memory ingestion erases the structural and lexical features group memory depends on, leaving multi-user memory far from solved.

2605.14381 2026-05-19 cs.LG cs.CL

NodeSynth: Socially Aligned Synthetic Data for AI Evaluation

NodeSynth: 为AI评估的社会协同合成数据

Qazi Mamunur Rashid, Xuan Yang, Zhengzhe Yang, Yanzhou Pan, Erin van Liemt, Darlene Neal, Kshitij Pancholi, Jamila Smith-Loud

AI总结 NodeSynth通过结合现实证据的细粒度分类扩展,生成社会相关合成查询,提升AI模型在敏感领域评估的准确性与安全性。

详情
AI中文摘要

NodeSynth通过结合现实证据的细粒度分类扩展,生成社会相关合成查询,提升AI模型在敏感领域评估的准确性与安全性。

英文摘要

Recent advancements in generative AI facilitate large-scale synthetic data generation for model evaluation. However, without targeted approaches, these datasets often lack the sociotechnical nuance required for sensitive domains. We introduce NodeSynth, an evidence-grounded methodology that generates socially relevant synthetic queries by leveraging a fine-tuned taxonomy generator (TaG) anchored in real-world evidence. Evaluated against four mainstream LLMs (e.g., Claude 4.5 Haiku), NodeSynth elicited failure rates up to five times higher than human-authored benchmarks. Ablation studies confirm that our granular taxonomic expansion significantly drives these failure rates, while independent validation reveals critical deficiencies in prominent guard models (e.g., Llama-Guard-3). We open-source our end-to-end research prototype and datasets to enable scalable, high-stakes model evaluation and targeted safety interventions (https://github.com/google-research/nodesynth).

2605.14347 2026-05-19 cs.LG

Exemplar Partitioning for Mechanistic Interpretability

基于示例的机制可解释性划分

Jessica Rumbelow

AI总结 本文提出Exemplar Partitioning方法,通过更少的token构建可解释特征字典,展示其在不同层和模型间的可比性及因果干预能力。

Comments Code: https://github.com/jessicarumbelow/exemplar-partitioning. Pretrained dictionaries: https://huggingface.co/datasets/J-RUM/exemplar-partitioning

详情
AI中文摘要

我们介绍了Exemplar Partitioning (EP),一种无监督方法,用于从大型语言模型激活中构建可解释特征字典,使用约10^3倍少于可比稀疏自编码器(SAEs)的token数量。EP字典是激活空间的Voronoi划分,通过在距离阈值内流式激活的领导聚类构建。每个区域由一个观察到的示例锚定,作为其成员资格标准和干预方向;字典大小不预先指定,但由该阈值下的激活几何决定。由于示例是观察到的而非学习的,从同一数据流构建的字典在不同层、模型和训练检查点之间直接可比。我们通过针对新可解释性属性的演示和一个头对头基准测试来表征EP为可解释性对象。在Gemma-2-2B中,EP字典区域是可解释的,并支持因果干预:在指令微调的Gemma中,拒绝集中在区域,其示例消融可使隐藏的拒绝消失。基版与指令微调字典之间的跨检查点匹配将通过微调保留的方向与引入的方向分开。EP区域和Gemma Scope SAE特征以不同方式分解激活空间,但同意共享核心:约20%的EP区域在F1>0.5时与SAE特征匹配,且EP的one-hot探针在ℓ0=1时保留约97%的原始激活探针准确性。最近邻示例距离在推理时提供免费的分布外信号。在Gemma-2-2B-it L20上的AxBench潜在概念检测中,EP在p1达到平均AUROC 0.881,比传统GemmaScope SAE领先0.126,在SAE-A的0.911附近,且构建计算低约10^3倍。

英文摘要

We introduce Exemplar Partitioning (EP), an unsupervised method for constructing interpretable feature dictionaries from large language model activations with $\sim 10^3\times$ fewer tokens than comparable sparse autoencoders (SAEs). An EP dictionary is a Voronoi partition of activation space, built by leader-clustering streamed activations within a distance threshold. Each region is anchored by an observed exemplar that serves as both its membership criterion and intervention direction; dictionary size is not prespecified, but determined by the activation geometry at that threshold. Because exemplars are observed rather than learned, dictionaries built from the same data stream are directly comparable across layers, models, and training checkpoints. We characterise EP as an interpretability object via targeted demonstrations of properties newly accessible through this construction, plus one head-to-head benchmark. In Gemma-2-2B, EP dictionary regions are interpretable and support causal interventions: refusal in instruction-tuned Gemma concentrates in a region whose exemplar ablation can collapse held-out refusal. Cross-checkpoint matching between base and instruction-tuned dictionaries separates the directions preserved through finetuning from those introduced by it. EP regions and Gemma Scope SAE features decompose activation space differently but agree on a shared core: $\sim$20% of EP regions match an SAE feature at $F_1 > 0.5$, and EP one-hot probes retain $\sim$97% of raw-activation probe accuracy at $\ell_0 = 1$. Nearest-exemplar distance provides a free out-of-distribution signal at inference. On AxBench latent concept detection at Gemma-2-2B-it L20, EP at $p_1$ reaches mean AUROC 0.881, +0.126 over the canonical GemmaScope SAE leaderboard entry and within 0.030 of SAE-A's 0.911, at $\sim 10^3\times$ less build compute.

2605.14292 2026-05-19 cs.LG cs.CL

Minimal-Intervention KV Retention via Set-Conditioned Diversity

通过集合条件多样性实现最小干预的KV保留

Libo Sun, Po-wei Harn, Peixiong He, Xiao Qin

AI总结 研究通过改进TriAttention保留评分器,在有限预算下提升KV缓存压缩效果,采用V空间冗余惩罚机制,验证了最小修改优于结构性重设计。

Comments 15 pages, 3 figures, 3 tables. Code and data: https://github.com/libophd/minimal-kv-retention

详情
AI中文摘要

在小预算下,KV缓存压缩是一个拥挤的设计空间,涵盖缓存表示、逐头路由、压缩节奏、解码行为和预算内评分。我们研究了五个家族中的七种机制,在匹配的长形式数学推理(MATH-500~\cite{hendrycks2021math})下,使用两个蒸馏推理模型(Qwen-7B和Llama-8B变体DeepSeek-R1-Distill~\cite{deepseek2025r1})在预算$b \in \{64, 128\}$下进行测试。所有七种机制均被拒绝。我们随后提出$\alpha$,一种对TriAttention~\cite{mao2026triattention}保留评分器的单函数修改,用启发式设施选址的贪心选择替代argmax-top-$k$,在由单个权重$\lambda$控制的V空间冗余惩罚下。一个预先注册的协议在冻结的开发分割上调整$\lambda$,并在不相交的保留分割上验证;当$\lambda= 0.5$时,$\alpha$在两个四(模型,预算)单元格(Qwen $b{=}128$和Llama $b{=}64$)上通过Bonferroni检验,没有单元格显著为负,且预注册的Branch~A触发。发现是不对称的:在该范围内,最小评分修改优于更重的结构重设计,且结合匹配的记忆、sympy评分、保留验证协议的证据标准使得不对称性显现。

英文摘要

KV-cache compression at small budgets is a crowded design space spanning cache representation, head-wise routing, compression cadence, decoding behavior, and within-budget scoring. We study seven mechanisms across these five families under matched mean cache on long-form mathematical reasoning (MATH-500~\cite{hendrycks2021math}) with two distilled-reasoning models (Qwen-7B and Llama-8B variants of DeepSeek-R1-Distill~\cite{deepseek2025r1}) at budgets $b \in \{64, 128\}$. All seven were rejected. We then propose $α$, a one-function modification to the TriAttention~\cite{mao2026triattention} retention scorer that replaces argmax-top-$k$ with greedy facility-location-inspired selection under a V-space redundancy penalty controlled by a single weight $λ$. A pre-registered protocol tunes $λ$ on a frozen development split and confirms on a disjoint held-out split; with $λ= 0.5$, $α$ clears Bonferroni on two of the four (model, budget) cells (Qwen $b{=}128$ and Llama $b{=}64$), no cell is significantly negative, and the pre-registered Branch~A triggers. The finding is asymmetric: a minimal scoring modification beat heavier structural redesigns in this regime, and the combined matched-memory, sympy-graded, held-out confirmation protocol is the evidence standard that made the asymmetry visible.

2605.14271 2026-05-19 cs.CL cs.CY

Auditing Agent Harness Safety

审查代理安全

Chengzhi Liu, Yichen Guo, Yepeng Liu, Yuzhe Yang, Qianqi Yan, Xuandong Zhao, Wenyue Hua, Sheng Liu, Sharon Li, Yuheng Bu, Xin Eric Wang

AI总结 本文提出HarnessAudit框架,用于审查执行轨迹中的边界合规性、执行保真度和系统稳定性,揭示多代理Harness中的安全风险,并通过HarnessAudit-Bench验证了安全风险与轨迹长度、领域和代理角色的关系。

Comments 11 Pages, 8 Figures

详情
AI中文摘要

LLM代理越来越多地在执行Harness中运行,这些Harness负责调度工具、分配资源并路由消息。然而,Harness可能在轨迹中访问未授权资源或泄露上下文给错误的代理,但输出级评估无法检测这些失败。本文提出HarnessAudit框架,用于审查完整执行轨迹的边界合规性、执行保真度和系统稳定性,并通过HarnessAudit-Bench验证了安全风险与轨迹长度、领域和代理角色的关系。

英文摘要

LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent. Output-level evaluation cannot see these failures, yet most safety benchmarks score only final outputs or terminal states, even though many violations occur mid-trajectory rather than at termination. The central question is whether the harness respects user intent, permission boundaries, and information-flow constraints throughout execution. To address this gap, we propose HarnessAudit, a framework that audits full execution trajectories across boundary compliance, execution fidelity, and system stability, with a focus on multi-agent harnesses where these risks are most pronounced. We further introduce HarnessAudit-Bench, a benchmark of 210 tasks across eight real-world domains, instantiated in both single-agent and multi-agent configurations with embedded safety constraints. Evaluating ten harness configurations across frontier models and three multi-agent frameworks, we find that: (i) task completion is misaligned with safe execution, and violations accumulate with trajectory length; (ii) safety risks vary across domains, task types, and agent roles; (iii) most violations concentrate in resource access and inter-agent information transfer; and (iv) multi-agent collaboration expands the safety risk surface, while harness design sets the upper bound of safe deployment.

2605.14125 2026-05-19 cs.CL

Polar probe linearly decodes semantic structures from LLMs

极性探针线性解码LLM中的语义结构

Pablo J. Diego-Simón, Pierre Orhan, Emmanuel Chemla, Yair Lakretz, Jean-Rémi King

AI总结 研究通过极性探针线性恢复LLM中的语义结构,发现其基于嵌入距离和方向表示实体存在与关系类型,且在中层表现更优,能泛化至新实体但随语义结构规模下降。

详情
AI中文摘要

人工神经网络如何将概念绑定形成复杂语义结构?本文提出一种简单神经编码,通过嵌入距离和方向表示实体的存在及关系类型。在多种LLM中测试,结果表明极性探针能线性恢复真实语义结构,该编码主要出现在中层,随LLM性能提升而改善。极性探针能泛化至新实体和关系类型,但随语义结构规模增大而退化。极性表示质量与LLM回答语义结构问题的能力相关。这些发现表明,LLM通过简单几何原理绑定表示来构建复杂语义结构。

英文摘要

How do artificial neural networks bind concepts to form complex semantic structures? Here, we propose a simple neural code, whereby the existence and the type of relations between entities are represented by the distance and the direction between their embeddings, respectively. We test this hypothesis in a variety of Large Language Models (LLMs), each input with natural-language descriptions of minimalist tasks from five different domains: arithmetic, visual scenes, family trees, metro maps and social interactions. Results show that the true semantic structures can be linearly recovered with a Polar Probe targeting a subspace of LLMs' layer activations. Second, this code emerges mostly in middle layers and improves with LLM performance. Third, these Polar Probes successfully generalize to new entities and relation types, but degrades with the size of the semantic structure. Finally, the quality of the polar representation correlates with the LLM's ability to answer questions about the semantic structure. Together, these findings suggest that LLMs learn to build complex semantic structures by binding representations with a simple geometrical principle.

2605.13322 2026-05-19 cs.CV cs.LG

KamonBench: A Grammar-Based Dataset for Evaluating Compositional Factor Recovery in Vision-Language Models

KamonBench:一种基于语法规则的数据集,用于评估视觉-语言模型中的组合因子恢复

Richard Sproat, Stefano Peluchetti

AI总结 KamonBench通过20000个合成复合徽章及辅助组件示例,提供评估视觉-语言模型中稀疏组合识别和因子恢复的可控测试环境,支持程序代码因子度量和可控因子对重组。

Comments Preprint

详情
AI中文摘要

KamonBench通过20000个合成复合徽章及辅助组件示例,提供评估视觉-语言模型中稀疏组合识别和因子恢复的可控测试环境,支持程序代码因子度量和可控因子对重组。

英文摘要

Kamon (family crests) are an important part of Japanese culture and a natural test case for compositional visual recognition: each crest combines a small number of symbolic choices, but the space of possible descriptions is sparse. We introduce KamonBench, a grammar-based image-to-structure benchmark with 20,000 synthetic composite crests and auxiliary component examples. Each composite crest is paired with a formal kamon description language - "kamon yōgo" - description, a segmented Japanese analysis, an English translation, and a non-linguistic program code. Because each synthetic crest is generated from known factors, namely container, modifier, and motif, KamonBench supports evaluation beyond caption-level accuracy: direct program-code factor metrics, controlled factor-pair recombination splits, counterfactual motif-sensitivity groups under fixed container-modifier contexts, and linear probes of factor accessibility. We include baseline results for a ViT encoder/Transformer decoder and two VGG n-gram decoders, with and without learned positional masks. KamonBench therefore provides a controlled testbed for sparse compositional visual recognition and factor recovery in vision-language models.

2605.12991 2026-05-19 cs.LG cs.AI

Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

不只是RLHF:为何仅对齐不足以解决多智能体趋同

Adarsh Kumarappan, Ananya Mujoo

AI总结 本文研究了多智能体系统在模拟同伴分歧下的错误率问题,发现预训练基础模型与指令模型存在相似的替换模式,且错误率较高。通过激活修补发现错误集中在中间层,修复后可恢复大部分正确率差距。研究还指出压力抑制了清洁推理特征,而非激活新的趋同回路。

详情
AI中文摘要

基于LLM的多智能体管道在模拟同伴分歧下,正确答案转为错误答案的速率我们称为收益,这一漏洞广泛归因于RLHF诱导的趋同。我们测试了四种模型家族,发现这种归因大多不成立:预训练基础模型表现出与指令变体相同的替换模式,其平均收益高于指令变体。通过激活修补,我们发现错误集中在狭窄的中间层窗口,其中注意力承担因果权重,而MLP贡献可忽略不计;在该窗口上方进行修补可恢复96%的清洁到受压P(correct)差距。攻击面分解为两个独立因素(通道框架和共识强度)的相互作用,产生47.5个百分点的收益差距,在多数共识下保持不变,适用于陪审团大小$N \in \{4, 5, 6\}$。两种收敛的激活空间干预显示,压力抑制了清洁推理特征,而非激活新的趋同回路。一个正确论证的异议者在所有测试框架中将收益降低54-73个百分点,而最强的提示级防御在攻击变体超出其设计范围时失效。缓解措施应针对机制,而非提示级防御,应在管道层面实施结构化异议。

英文摘要

LLM-based multi-agent pipelines flip from correct to incorrect answers under simulated peer disagreement at rates we term yield, a vulnerability widely attributed to RLHF-induced sycophancy. We test this attribution across four model families and find it largely wrong: pretrained base models exhibit the same substitution pattern as their Instruct variants, averaging higher yield than Instruct. Using activation patching, we localize the corruption to a narrow mid-layer window where attention carries the causal weight and MLP contribution is negligible; patching above this window restores 96% of the clean-to-pressured P(correct) gap. The attack surface decomposes into two independent factors (channel framing and consensus strength) whose interaction produces a 47.5 percentage-point yield gap at majority consensus, preserved across jury sizes $N \in \{4, 5, 6\}$. Two converging activation-space interventions show that pressure suppresses clean-reasoning features rather than activating a new sycophancy circuit. A single correctly-arguing dissenter reduces yield by 54-73 percentage points across all framings tested, whereas the strongest prompt-level defense fails on attack variants outside its design surface. Mitigations should target the mechanism, structured dissent at the pipeline level, rather than prompt-level defenses.

2605.12987 2026-05-19 cs.CL

Leveraging Multimodal Self-Consistency Reasoning in Coding Motivational Interviewing for Alcohol Use Reduction

利用多模态自一致性推理进行编码动机访谈以减少酒精使用

Guangzeng Han, James G. Murphy, Benjamin O. Ladd, Xiaolei Huang, Brian Borsari

AI总结 本文提出基于音频语言模型的自动动机访谈编码方法,通过多模态自一致性推理提升编码鲁棒性,实验显示其在准确率、精确率和召回率上均优于基线方法。

详情
AI中文摘要

背景:对动机访谈(MI) session 进行编码对于理解客户行为和预测结果至关重要,但需要大量时间和劳动力由受过训练的 MI 专业人士完成。最近在音频-语言模型(ALMs)上的进展为通过捕捉多模态行为信号自动化 MI 编码提供了新机会。目的:本研究旨在开发一种基于 ALMs 的自动 MI 编码方法,分析原始音频输入并整合多个推理轨迹的预测,利用自一致性提高编码鲁棒性。方法:我们使用了五段去标识化的 MI 音频磁带进行实验。我们部署了 ALMs,使用四个互补的分析提示来支持语句级推理:用于语音提示的分析提示、用于声学提示的声调感知提示、用于定量假设检验的证据评分提示,以及用于对比推理的比较提示。每个提示抽取三个随机样本,生成每个语句12条独立的推理轨迹。最终预测由所有轨迹上的多数投票确定。结果:通过准确率、精确率、召回率和宏F1分数进行评估。所提出的多模态自一致性方法在准确率为52.56%、精确率为54.03%、召回率为47.45%、宏F1分为46.40%,优于基线方法。系统消融实验中移除个别模块在主要指标上一致地降低了性能。结论:多模态自一致性方法在 MI 编码中优于单次通过基线提示方法。这些发现表明,结合客户所说和如何说的内容可以支持更可靠的自动 MI 编码。

英文摘要

BACKGROUND: Coding Motivational Interviewing (MI) sessions is essential for understanding client behaviors and predicting outcomes, but it requires substantial time and labor from trained MI professionals. Recent advances in audio-language models (ALMs) offer new opportunities to automate MI coding by capturing multimodal behavioral signals. OBJECTIVE: This study aims to develop an automatic MI coding approach based on ALMs that analyzes raw audio input and integrates predictions from multiple reasoning trajectories using self-consistency to improve coding robustness. METHODS: We experimented with five recorded sessions from de-identified MI audio tapes. We deployed ALMs with four complementary analytic prompts to support utterance-level reasoning: analytic prompting for verbal cues, prosody-aware prompting for acoustic cues, evidence-scoring prompting for quantitative hypothesis testing, and comparative prompting for contrastive reasoning. Three stochastic samples were drawn for each prompt, generating 12 independent reasoning trajectories per utterance. Final predictions were determined by majority voting across all trajectories. RESULTS: Performance was evaluated using accuracy, precision, recall, and macro-F1 scores. The proposed multimodal self-consistency approach achieved 52.56% accuracy, 54.03% precision, 47.45% recall, and a macro-F1 score of 46.40%, exceeding baseline methods. Systematic ablation experiments that removed individual modules consistently degraded performance on the primary metrics. CONCLUSIONS: Multimodal self-consistency outperforms single-pass baseline prompting approaches for MI coding. These findings suggest that incorporating both what clients say and how they say it can support more reliable automatic MI coding.

2605.12825 2026-05-19 cs.LG cs.AI

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Orthrus:通过双视角扩散实现内存高效的并行令牌生成

Chien Van Nguyen, Chaitra Hegde, Van Cuong Pham, Ryan A. Rossi, Franck Dernoncourt, Thien Huu Nguyen

AI总结 Orthrus结合自回归大语言模型的高保真生成与扩散模型的高速并行生成,通过双视角机制实现高效推理,提升速度7.8倍且内存开销极低。

详情
AI中文摘要

我们介绍Orthrus,一种简单高效的双架构框架,结合自回归大语言模型(LLM)的精确生成保真度与扩散模型的高速并行令牌生成。标准自回归解码的序列性是高吞吐推理的根本瓶颈。尽管扩散语言模型试图通过并行生成突破这一瓶颈,但存在显著的性能下降、高训练成本和缺乏严格的收敛保证。Orthrus原生解决这一二元对立。设计用于无缝集成到现有Transformer中,框架在冻结的LLM上添加一个轻量可训练模块,创建一个并行扩散视角与标准自回归视角。在统一系统中,两个视角均关注相同的高保真键值(KV)缓存;自回归头执行上下文预填充以构建准确的KV表示,而扩散头执行并行生成。通过在两个视角之间采用精确的一致性机制,Orthrus保证无损推理,仅以O(1)的内存缓存开销和极小的参数增加,即可实现高达7.8倍的速度提升。

英文摘要

We introduce Orthrus, a simple and efficient dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models. The sequential nature of standard autoregressive decoding represents a fundamental bottleneck for high-throughput inference. While diffusion language models attempt to break this barrier via parallel generation, they suffer from significant performance degradation, high training costs, and a lack of rigorous convergence guarantees. Orthrus resolves this dichotomy natively. Designed to seamlessly integrate into existing Transformers, the framework augments a frozen LLM with a lightweight, trainable module to create a parallel diffusion view alongside the standard autoregressive view. In this unified system, both views attend to the exact same high-fidelity Key-Value (KV) cache; the autoregressive head executes context pre-filling to construct accurate KV representations, while the diffusion head executes parallel generation. By employing an exact consensus mechanism between the two views, Orthrus guarantees lossless inference, delivering up to a 7.8x speedup with only an O(1) memory cache overhead and minimal parameter additions.

2605.12070 2026-05-19 cs.LG cs.AI

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

异步代理强化学习中缺失旧日志:语义不匹配及用于离线策略修正的修复方法

Zhong Guan, Yongjian Guo, Haoran Sun, Wen Huang, Shuai Di, Likang Wu, Xiong Jun Wu, Hongke Zhao

AI总结 本文研究了异步代理强化学习中因缺失旧日志导致的语义不匹配问题,提出三种精确获取旧日志的策略及近似修正方法,改进了PPO-EWMA方法,提升了训练速度和优化性能。

详情
AI中文摘要

异步强化学习通过将样本生成与策略优化解耦,提高了大语言模型代理的回放吞吐量,但同时也引入了PPO类离线策略修正中的关键故障模式。在异构训练系统中,总重要性比率应理想地分解为两个语义不同的因素:一个训练-推理不匹配项,用于对齐同一行为策略版本的推理侧和训练侧分布,以及一个策略陈旧项,用于约束从历史策略到当前策略的更新。我们发现实际的异步管道在延迟更新和部分回放的情况下,常常丢失所需的训练侧旧日志或旧日志。这种缺失旧日志的问题使不匹配修复与陈旧修正纠缠在一起,破坏了解耦修正的初衷,并使裁剪和掩码阈值产生不良交互。为了解决这一问题,我们研究了精确和近似修正路径。我们提出了三种精确旧日志获取策略:基于快照的版本跟踪、专用旧日志模型以及通过部分回放中断进行同步,并比较了它们的系统权衡。从近似修正的角度来看,我们关注通过更合适的近似策略保留解耦修正的好处,当无法以低成本恢复精确旧日志时,不增加额外系统开销。随后,我们采用改进的PPO-EWMA方法,该方法在训练速度和优化性能方面均取得显著提升。

英文摘要

Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a \emph{training--inference discrepancy term} that aligns inference-side and training-side distributions at the same behavior-policy version, and a \emph{policy-staleness term} that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training-side logits, or old logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact and approximate correction routes. We propose three exact old-logit acquisition strategies: snapshot-based version tracking, a dedicated old-logit model, and synchronization via partial rollout interruption, and compare their system trade-offs. From the perspective of approximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exact old logits cannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revised PPO-EWMA method, which achieves significant gains in both training speed and optimization performance.

2605.11970 2026-05-19 cs.LG

NOFE - Neural Operator Function Embedding

NOFE - 神经操作函数嵌入

Lars Uebbing, Harald L. Joakimsen, Siyan Chen, Georgios Leontidis, Kristoffer K. Wickstrøm, Michael C. Kampffmeyer, Sébastien Lefèvre, Arnt-Børre Salberg, Robert Jenssen

AI总结 NOFE是一种面向连续域的降维框架,通过图核操作学习函数到函数的映射,实现无网格评估,优于传统方法在局部结构保持和鲁棒采样方面表现。

Comments 21 pages, 11 figures, 12 tables

详情
AI中文摘要

大多数降维方法将数据视为离散点云,忽视了许多现实过程固有的连续域结构。为弥合这一差距,我们引入神经操作函数嵌入(NOFE),一种面向连续域的降维框架。NOFE通过图核操作学习函数到函数的映射,能够在任意查询位置进行无网格评估,而不受输入离散化的限制。我们建立了NOFE作为sheaf到sheaf映射的近似,将sheaf神经网络推广到连续域。我们在不同数据集上评估了NOFE,将其与PCA、t-SNE和UMAP进行比较。结果表明,NOFE在局部结构保持方面显著优于基线方法,在ERA5气候再分析数据集上,局部应力为0.111,相比之下PCA为0.398,t-SNE为0.773,UMAP为0.791。NOFE还表现出鲁棒的采样独立性,相对于UMAP,将拼接误差降低了高达20.0倍(59.0 vs. 267.6在区域归一化下),并确保在不连续域碎片之间的一致性。虽然保持了竞争性的全局结构保持(应力-1:0.379 vs. PCA的0.268),NOFE解决了细粒度结构并产生了平滑一致的嵌入,这些嵌入在不同样本密度下具有良好的泛化能力,解决了离散降维方法的关键限制。

英文摘要

Most dimensionality reduction methods treat data as discrete point clouds, ignoring the continuous domain structure inherent to many real-world processes. To bridge this gap, we introduce Neural Operator Function Embedding (NOFE), a domain-aware framework for continuous dimensionality reduction. NOFE learns function-to-function mappings via a Graph Kernel Operator, enabling mesh-free evaluation at arbitrary query locations independent of input discretization. We establish NOFE as approximation of sheaf-to-sheaf mappings, generalizing Sheaf Neural Networks to continuous domains. We evaluate NOFE across different datasets, comparing it against PCA, t-SNE, and UMAP. Our results demonstrate that NOFE significantly outperforms baselines in local structure preservation, achieving a local Stress of 0.111 compared to 0.398 for PCA, 0.773 for t-SNE, and 0.791 for UMAP for the ERA5 climate reanalysis dataset. NOFE also exhibits robust sampling independence, reducing the Patch Stitching Error by up to $20.0\times$ relative to UMAP (59.0 vs. 267.6 under regional normalization) and ensuring consistency across disjoint domain patches. While maintaining competitive global structure preservation (Stress-1: 0.379 vs. PCA's 0.268), NOFE resolves fine-grained structures and produces smooth, consistent embeddings that generalize across varying sample densities, addressing key limitations of discrete reduction methods.

2605.11871 2026-05-19 cs.CV

$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement

$h$-control: 无需训练的相机控制 via 块条件吉布斯细化

Yuzhu Wang, Xi Ye, Duo Su, Yangyang Xu, Jun Zhu

AI总结 本文提出$h$-control,通过改进采样器结构,解决免训练视频生成中相机控制的逆向问题,提升轨迹一致性与视觉质量的平衡,实现在多个数据集上的最佳表现。

详情
AI中文摘要

无需训练的相机控制对于预训练的流匹配视频生成器是一个部分观察逆向问题:深度扭曲的引导视频为潜变量子集提供噪声证据,采样器必须与预训练先验相协调。现有方法难以平衡轨迹一致性和视觉质量,且启发式引导强度调整缺乏鲁棒性。我们提出$h$-control,通过在采样器中引入结构变化:每个外层硬替换引导步骤均增强内循环块条件伪吉布斯细化,对同一噪声水平下的未观测补集进行处理,保证收敛到部分观察条件数据定律。为加速高维视频潜变量的收敛,我们利用其条件局部性,将未观测补集划分为3D块,每个块由自定义混合指示器跟踪,能自适应冻结收敛块。在RealEstate10K和DAVIS数据集上,$h$-control在所有七种免训练和训练-based竞争者中取得最佳FVD,优于所有免训练基线。

英文摘要

Training-free camera control for pretrained flow-matching video generators is a partial-observation inverse problem: a depth-warped guidance video supplies noisy evidence on a subset of latent sites, which the sampler must reconcile with the pretrained prior. Existing methods struggle to balance the trade-off between trajectory adherence and visual quality and the heuristic guidance-strength tuning lacks robustness. We propose \textbf{$h$-control}, which resolves this dilemma through a structural change to the sampler: each outer hard-replacement guidance step is augmented with an inner-loop \emph{block-conditional pseudo-Gibbs refinement} on the unobserved complement at the same noise level, with provable convergence to the partial-observation conditional data law. To accelerate convergence on high-dimensional video latents, we exploit their conditional locality, partitioning the unobserved complement into 3D patches, each tracked by a custom mixing indicator that adaptively freezes converged patches. On RealEstate10K and DAVIS, \textbf{$h$-control} attains the best FVD against all seven training-free and training-based competitors, outperforming every training-free baseline on every reported metric.

2605.11599 2026-05-19 cs.LG

Targeted Tests for LLM Reasoning: An Audit-Constrained Protocol

面向LLM推理的定向测试:一种受审计约束的协议

Hongmin Li

AI总结 本文提出一种受审计约束的协议,用于评估LLM推理能力,通过组件自适应提示采样与均匀采样对比,验证了在受控环境下研究定向提示变化的有效性。

Comments 17 pages, 1 figure

详情
AI中文摘要

固定推理基准评估标准提示,但语义上有效的呈现变化仍可能改变模型行为。提示变化研究可揭示此类失败,但缺乏审计时可能混杂真实模型错误、无效扰动、提取伪影和不匹配的搜索过程。本文提出一种受审计约束的定向推理评估协议。提示变体由有限组件语法生成,确定性渲染,固定查询预算下评估,并在经过语义和提取审计后才视为模型错误。在此协议中,我们实例化了组件自适应提示采样(CAPS),一种基于得分的提示组件采样器,并在相同任务库、渲染器、模型接口、解码设置和审计程序下,与等预算均匀组件采样进行比较。在三个受审计的切片中,该协议确认了模型错误提示键,同时排除了格式和提取伪影,但匹配比较未显示CAPS在受控产量或唯一提示键发现上优于均匀采样。贡献是方法论的:定向提示变化可以在可重建、可审查、预算匹配的协议下研究,代理引导策略应通过受控产量而非原始不匹配计数或选定示例单独判断。

英文摘要

Fixed reasoning benchmarks evaluate canonical prompts, but semantically valid changes in presentation can still change model behavior. Studies of prompt variation can reveal such failures, but without audit they can mix genuine model errors with invalid perturbations, extraction artifacts, and unmatched search procedures. We propose an audit-constrained protocol for targeted reasoning evaluation. Prompt variants are generated from a finite component grammar, rendered deterministically, evaluated under a fixed query budget, and counted as model errors only after semantic and extraction audit. Within this protocol we instantiate Component-Adaptive Prompt Sampling (CAPS), a score-based sampler over prompt components, and compare it with equal-budget uniform component sampling under the same task bank, renderer, model interface, decoding settings, and audit procedure. Across three audited slices, the protocol identifies confirmed model-error prompt keys while excluding formatting and extraction artifacts, but matched comparisons do not show that CAPS improves audited yield or unique prompt-key discovery over uniform sampling. The contribution is methodological: targeted prompt variation can be studied under a reconstructable, reviewable, budget-matched protocol, and proxy-guided policies should be judged by audited yield rather than raw mismatch counts or selected examples alone.

2605.11518 2026-05-19 cs.AI cs.CL cs.LG

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive

AutoLLMResearch: 训练研究代理以自动化LLM实验配置 - 从低成本学习,优化高成本

Taicheng Guo, Nitesh V. Chawla, Olaf Wiest, Xiangliang Zhang

AI总结 本文提出AutoLLMResearch框架,通过多保真度实验环境学习LLM配置原则,解决高成本实验自动化问题,展示其在大规模LLM实验中的有效性与通用性。

详情
AI中文摘要

有效配置可扩展的大规模语言模型(LLM)实验,涵盖架构设计、超参数调优等,对推进LLM研究至关重要,因为糟糕的配置选择会浪费大量计算资源并阻碍模型潜力的实现。以往的自动化方法适用于低成本环境,但可扩展的LLM实验成本过高,无法进行大量迭代。为了解决这一问题,我们提出AutoLLMResearch,一个模仿人类研究人员从低保真度实验中学习一般性原则并高效识别高成本LLM配置的代理框架。核心挑战是如何使代理通过与多保真度实验环境的交互学习LLM配置景观的结构。为此,我们提出一个系统框架,包含两个关键组件:1) LLMConfig-Gym,涵盖四个关键LLM实验任务的多保真度环境,支持超过一百万GPU小时的可验证实验结果;2) 一个结构化训练管道,将配置研究建模为长周期马尔可夫决策过程,并相应地激励跨保真度外推推理。在各种强基线上的广泛评估表明了我们框架的有效性、通用性和可解释性,支持其作为大规模现实LLM实验自动化的实用且通用解决方案的潜力。

英文摘要

Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, an agentic framework that mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1) LLMConfig-Gym, a multi-fidelity environment encompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizon Markov Decision Process and accordingly incentivizes cross-fidelity extrapolation reasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.

2605.11480 2026-05-19 cs.LG

Efficient Adjoint Matching for Fine-tuning Diffusion Models

高效对抗匹配用于扩散模型微调

Jeongwoo Shin, Dongsoo Shin, Yuchen Zhu, Wei Guo, Yongxin Chen, Joonseok Lee, Jaewoong Choi, Jaemoo Choi

AI总结 本文提出高效对抗匹配(EAM),通过改用线性基础漂移和修改终端成本,解决对抗匹配在扩散模型微调中的计算瓶颈,使训练效率提升4倍并在多个指标上表现优异。

详情
AI中文摘要

奖励微调已成为对齐预训练扩散和流模型与人类偏好的常见方法。在基于奖励梯度的方法中,对抗匹配(AM)通过将奖励微调视为随机最优控制(SOC)问题提供了系统化的公式。然而,AM不可避免地需要显著的计算成本:它要求(i)在无记忆动态下对完整生成轨迹进行随机模拟,导致大量的函数评估,以及(ii)沿每个采样轨迹进行反向ODE模拟。在本工作中,我们观察到这两个瓶颈都与从预训练模型继承的非平凡基础漂移密切相关。受此启发,我们提出高效对抗匹配(EAM),通过将SOC问题改用线性基础漂移和相应修改的终端成本,大幅提高训练效率。此改写消除了两种无效来源;它使训练时采样能够使用几步确定性ODE求解器,并产生闭合形式的伴随解,从而消除反向伴随模拟。在标准的文本到图像奖励微调基准上,EAM比AM快4倍收敛,并在PickScore、ImageReward、HPSv2.1、CLIPScore和Aesthetics等各项指标上匹配或超越了AM。

英文摘要

Reward fine-tuning has become a common approach for aligning pretrained diffusion and flow models with human preferences in text-to-image generation. Among reward-gradient-based methods, Adjoint Matching (AM) provides a principled formulation by casting reward fine-tuning as a stochastic optimal control (SOC) problem. However, AM inevitably requires a substantial computational cost: it requires (i) stochastic simulation of full generative trajectories under memoryless dynamics, resulting in a large number of function evaluations, and (ii) backward ODE simulation of the adjoint state along each sampled trajectory. In this work, we observe that both bottlenecks are closely tied to the \textit{non-trivial base drift} inherited from the pretrained model. Motivated by this observation, we propose \textbf{Efficient Adjoint Matching (EAM)}, which substantially improves training efficiency by reformulating the SOC problem with a \textit{linear base drift} and a correspondingly modified \textit{terminal cost}. This reformulation removes both sources of inefficiency; it enables training-time sampling with a few-step deterministic ODE solver and yields a closed-form adjoint solution that eliminates backward adjoint simulation. On standard text-to-image reward fine-tuning benchmarks, EAM converges up to 4x faster than AM and matches or surpasses it across various metrics including PickScore, ImageReward, HPSv2.1, CLIPScore and Aesthetics.

2605.11208 2026-05-19 cs.CV

Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation

Hi-GaTA:用于外科视频报告生成的分层门控时间聚合适配器

Kedi Sun, Chaohui Dang, Yue Feng, James Glasbey, Theodoros N. Arvanitis, Le Zhang

AI总结 本文提出Hi-GaTA框架,通过时间聚合压缩长视频序列生成LLM兼容的视觉前缀令牌,结合预训练的外科专用视频编码器和LoRA微调,实现高质量外科报告生成。

Comments 11 pages, 2 figures

详情
AI中文摘要

自动化、临床级的外科手术评估报告可减少文档负担并提供客观反馈,但面临视频时空表示与语言推理对齐困难及高质量隐私数据稀缺的挑战。为此,我们建立包含214个高质量模拟外科视频及外科医生撰写的评估报告的基准。基于此资源,我们提出包含Hi-GaTA的感知-对齐-推理框架,其中Hi-GaTA是一种新型轻量级时间适配器,通过短到长范围时间聚合高效压缩长视频序列为紧凑的LLM兼容视觉前缀令牌。为实现稳健的视觉感知,我们预训练了Sur40k,一种针对外科专用的ViViT风格视频编码器,在40,000分钟的公开外科视频上进行预训练以捕捉细粒度的时空手术先验。Hi-GaTA采用带有文本条件双交叉注意力的时间金字塔,并通过跨层门控融合和递增深度策略提高多尺度一致性。最后,我们使用LoRA微调LLM主干以在有限监督下实现连贯且风格一致的外科报告生成。实验表明,我们的方法在整体性能上最佳,且在强大的多模态大语言模型(MLLM)基线中表现出一致的优势。消融研究进一步验证了每个提出组件的有效性。

英文摘要

Automated, clinician-grade assessment reports for surgical procedures could reduce documentation burden and provide objective feedback, yet remain challenging due to the difficulty of aligning dense spatio-temporal video representations with language-based reasoning and the scarcity of high-quality, privacy-preserving datasets. To address this gap, we establish a benchmark comprising 214 high-quality simulated surgical videos paired with surgeon-authored evaluation reports. Building on this resource, we propose a Perception-Alignment-Reasoning framework for surgical video report generation, featuring Hi-GaTA, a novel lightweight temporal adapter that efficiently compresses long video sequences into compact, LLM-compatible visual prefix tokens through short-to-long-range temporal aggregation. For robust visual perception, we pretrain Sur40k, a surgical-specific ViViT-style video encoder on 40,000 minutes of public surgical videos to capture fine-grained spatio-temporal procedural priors. Hi-GaTA employs a temporal pyramid with text-conditioned dual cross-attention, and improves multi-scale consistency through cross-level gated fusion and an increasing-depth strategy. Finally, we fine-tune the LLM backbone using LoRA to enable coherent and stylistically consistent surgical report generation under limited supervision. Experiments show our approach achieves the best overall performance, with consistent gains over strong Multimodal Large Language Model (MLLM) baselines. Ablation studies further validate the effectiveness of each proposed component.