arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1955
2605.22269 2026-05-22 cs.CV cs.AI cs.MM

MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering

MuKV:多粒度KV缓存压缩用于长流视频问答

Junbin Xiao, Jiajun Chen, Tianxiang Sun, Xun Yang, Angela Yao

AI总结 本文提出MuKV,一种多粒度KV缓存压缩方法,通过半分层检索方法提升长流视频问答的效率和准确性,实验表明其在答案准确率、内存使用和在线问答效率方面均优于基线方法。

详情
Comments
To appear at CVPR'26. Code is available at https://github.com/IMBALDY/MuKV
AI中文摘要

长流视频问答仍面临挑战,由于视觉token数量增加和大语言模型(LLM)推理长度有限。KV缓存通过LLM预填充存储历史token的Key-Value(KV),从而实现更高效的流式问答。然而,现有方法缓存每个或每两个帧,导致内存使用冗余并丢失帧内或跨帧的细粒度空间细节。本文提出MuKV,一种具有多粒度KV缓存压缩模块和半分层检索方法的方法,以提高长流视频问答的效率和准确性。对于离线KV缓存,MuKV在patch、frame和segment级别提取视觉表示。多个粒度层次保留了局部线索和全局时间上下文,同时通过自注意力和频率引导的双信号token压缩机制保持效率。对于在线问答,MuKV设计了一种半分层检索方法以检索相关KV缓存用于答案生成。在长流视频问答基准测试中,MuKV显著提高了答案准确率,而无需牺牲内存和在线问答效率。此外,我们的压缩机制本身在答案准确率、内存和问答效率方面均对基线方法带来了持续的改进,展示了高度有效的贡献。

英文摘要

Long streaming video QA remains challenging due to growing visual tokens and limited reasoning length of large language models (LLMs). KV-caching stores the Key-Value (KV) of the historical tokens via LLM prefill and enables more efficient streaming QA. However, existing methods cache every one or two frames, causing redundant memory usage and losing fine-grained spatial details within frame or temporal contexts across frames. This paper proposes MuKV, a method that features a multi-grained KV cache compression module and a semi-hierarchical retrieval approach to improve both efficiency and accuracy for long streaming VideoQA. For the offline KV cache, MuKV extracts visual representations at patch-, frame-, and segment-levels. The multiple levels of granularity preserve both local cues and global temporal context, while maintaining efficiency with a dual signal token compression mechanism guided by self-attention and frequency. For online QA, MuKV designs a semi-hierarchical retrieval method to retrieve relevant KV caches for answer generation. Experiments on long-streaming VideoQA benchmarks show that MuKV significantly improves answer accuracy, without sacrificing memory and online QA efficiency. Moreover, our compression mechanism alone brings consistent benefits across answer accuracy, memory, and QA efficiency over baselines, showcasing highly effective contribution.

2605.22266 2026-05-22 cs.LG cs.AI

Detecting Atypical Clients in Federated Learning via Representation-Level Divergence

通过表示层面的分歧检测联邦学习中的非典型客户端

Cristian Pérez-Corral, Jose I. Mestre, Alberto Fernández-Hernández, Manuel F. Dolz, Enrique S. Quitana-Ortí

AI总结 本文提出了一种轻量级的几何信号来量化客户端与全局模型之间的功能偏差,以检测联邦学习中的非典型客户端,通过评估输入空间的激活诱导分区变化来区分稳定但异质的客户端与显著偏离全局范式的客户端。

详情
AI中文摘要

联邦学习使分布式客户端在异质数据上进行协作训练,但这种异质性常常导致更新不稳定和全局性能下降。此外,在实际部署中,客户端更新可能偏离预期行为,不仅由于良性非独立同分布的数据分布,还由于分布偏移或异常输入,这引发了对聚合过程可靠性的担忧。在本工作中,我们提出了一种轻量级的几何信号来量化客户端相对于全局模型的功能偏差。与比较模型参数或梯度不同,我们的方法衡量每个客户端本地训练如何改变激活诱导的输入空间分区,该评估基于共享的探测集。这产生了一个置换不变、可解释的客户端-全局分歧度量,捕捉了模型处理数据方式的差异。我们展示该信号能有效识别导致非典型功能变化的客户端,区分稳定但异质的客户端与那些更新显著偏离全局范式的客户端。因此,所提出的度量提供了一个简单的工具用于监控客户端行为,并在联邦学习系统中实现风险感知的聚合策略。

英文摘要

Federated learning enables collaborative training across distributed clients with heterogeneous data, but such heterogeneity often leads to unstable updates and degraded global performance. Moreover, in practical deployments, client updates may deviate from the expected behavior not only due to benign not i.i.d. distributions, but also due to distributional shifts or anomalous inputs, raising concerns about the reliability of the aggregation process. In this work, we propose a lightweight geometric signal to quantify the functional deviation of a client with respect to the global model. Instead of comparing model parameters or gradients, our approach measures how the local training of each client alters the activation-induced partition of the input space, evaluated on a shared probe set. This yields a permutation-invariant, interpretable metric of client--global divergence that captures differences in how data is processed by the model. We show that this signal effectively identifies clients that induce atypical functional changes, distinguishing stable yet heterogeneous clients from those whose updates significantly diverge from the global regime. As a result, the proposed metric provides a simple tool for monitoring client behavior and enabling risk-aware aggregation strategies in federated learning systems.

2605.22263 2026-05-22 cs.LG cs.AI

Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning

按能力定制教学:方向自适应自蒸馏用于LLM推理

Hongbin Zhang, Chaozheng Wang, Kehai Chen, Youcheng Pan, Yang Xiang, Jinpeng Wang, Min Zhang

AI总结 本文提出方向自适应自蒸馏(DASD),通过熵引导的定向监督改进LLM推理,通过分析发现统一的教师监督导致探索被压制,DASD在六个数学推理基准中取得最佳表现。

详情
Comments
Under Review
AI中文摘要

在线自蒸馏(OPSD)是一种新兴的LLM后训练范式,其中模型作为自己的教师:在有特权信息(如参考轨迹或提示)的条件下,同一策略为自身 rollout 提供密集的token级监督。然而,最近的研究表明,OPSD 通过抑制预测不确定性而损害复杂推理,这支持探索和假设修订。我们的token级分析显示,这种失败源于在具有不同不确定性水平的token上应用统一的教师监督方向:符合特权自教师会抑制高熵的探索,而偏离教师会降低低熵的步骤准确性。据此,我们提出了方向自适应自蒸馏(DASD),将特权自蒸馏从统一教师模仿重新框架为熵引导的定向监督:高熵token被推离特权教师以保持探索,而低熵token被拉向教师以稳定步骤级执行。在六个数学推理基准上,DASD在强RLVR和自蒸馏基线中实现了最佳的宏Avg@16。Pass@$k$、推理健康和泛化分析表明,这些平均收益来自于在不牺牲步骤级执行的情况下保留探索。

英文摘要

On-policy self-distillation (OPSD) is an emerging LLM post-training paradigm in which the model serves as its own teacher: conditioned on privileged information such as a reference trace or hint, the same policy provides dense token-level supervision on its own rollouts. However, recent studies show that OPSD degrades complex reasoning by suppressing predictive uncertainty, which supports exploration and hypothesis revision. Our token-level analysis shows that this failure arises from applying a uniform direction of teacher supervision across tokens with different uncertainty levels: conformity to the privileged self-teacher suppresses exploration at high entropy, while deviation from the teacher degrades step accuracy at low entropy. Accordingly, we propose \textbf{Direction-Adaptive Self-Distillation} (\textbf{DASD}), which reframes privileged self-distillation from uniform teacher imitation into entropy-routed directional supervision: high-entropy tokens are pushed away from the privileged teacher to preserve exploration, while low-entropy tokens are pulled toward the teacher to stabilize step-level execution. Across six mathematical reasoning benchmarks, DASD achieves the best macro Avg@16 over strong RLVR and self-distillation baselines. Pass@$k$, reasoning-health, and generalization analyses show that these average gains come from preserving exploration without sacrificing step-level execution.

2605.22262 2026-05-22 cs.SD cs.LG eess.AS

Automatic Contextual Audio Denoising

自动上下文音频去噪

Diep Luong, Konstantinos Drossos, Mikko Heikkinen, Tuomas Virtanen

AI总结 本文提出了一种自动上下文音频去噪方法,通过推断音频场景类别来区分有用和无关声音成分,从而提高去噪效果。

详情
AI中文摘要

音频上下文决定了哪些声音成分和来源是相关的,哪些可以被听众感知为无关(噪声)。例如,在城市监控中交通噪声是有信息的,而在同一地点的电话通话中则为噪声。大多数当前的音频去噪系统使用固定的目标-噪声定义,往往在一种上下文中去除有用成分而在另一种上下文中无法抑制无关成分。为此,我们引入了自动上下文音频去噪(ACAD)的概念,该概念基于推断的上下文定义目标和噪声。在本工作中,我们将上下文限制为与声学场景类别相关联。我们将场景类别外的事件分布之外的声音事件(噪声)标记为离上下文(OC),而典型于该场景的事件标记为在上下文中(IC)。我们实现了一种深度学习方法,该方法能够自动推断音频信号的上下文并去除OC成分,并将其与无上下文推断、有 oracle 上下文和单独提供无信息上下文的变体进行比较。在跨多样上下文的配对干净/噪声数据上,其中一种上下文中的OC成分可能在另一种上下文中是IC,我们的方法在标准客观指标上优于其他方法,表明模型能够推断上下文,并且上下文依赖的处理可以增强去噪。

英文摘要

Audio context determines which sound components and sources are relevant and which can be perceived as irrelevant (noise) by listeners. For example, traffic noise is informative in urban surveillance but noise for a phone call at the same location. Most current audio denoising systems apply fixed target-noise definitions, often removing useful components in one context while failing to suppress irrelevant components. To address this, we introduce the concept automatic contextual audio denoising (ACAD) which defines target and noise based on the inferred context. In this work, we restrict context to be associated with an acoustic scene class. We label sound events outside the event distribution of a scene class (noise) as out-of-context (OC) and events typical for that scene as in-context (IC). We implement a deep learning method that automatically infers the context of the audio signal and removes OC components, and benchmark it against variants: without context inference, with oracle context, and with separately provided uninformative context. On paired clean/noisy data across diverse contexts, where OC components in one context may be IC in another, our proposed method outperforms other approaches across standard objective metrics, indicating that the model can infer context and context-dependent processing can enhance denoising.

2605.22259 2026-05-22 cs.LG cs.CV cs.RO

An Evidence Hierarchy for Bayesian Object Classification via OSINT-Aided Heterogeneous Sensor Fusion

基于OSINT辅助异质传感器融合的贝叶斯目标分类证据层级

Jan Nausner, Michael Hubner

AI总结 本文提出了一种基于OSINT辅助的异质传感器融合方法,通过建立新的证据层级模型,结合上下文信息和领域知识,提升对CBRNE威胁的分类准确率,实验结果表明该方法在抗干扰和先验不匹配方面具有优势,分类准确率高达95%。

详情
Comments
6 pages, 1 figure; \c{opyright} 2026 The Authors. Submitted to the 2026 IEEE International Conference on Multisensor Fusion and Integration (MFI 2026). Under review
AI中文摘要

异质传感器融合对于检测、定位和分类CBRNE威胁至关重要。然而,单独的传感器通常只能检测相关威胁的子集,其可靠性各异,甚至只能提供间接威胁指示,使威胁分类变得困难。此外,传感器侧的高杂波率对融合系统提出了巨大挑战。此外,高质量数据集的有限供应阻碍了智能传感器中基于学习的检测和分类模型的发展。为缓解这些传感器相关缺点,提出了一种上下文感知和领域知识增强的融合过程。首先,建立了一个新的证据层级,能够建模直接、指示性和上下文信息。其次,通过收集、处理和利用OSINT输入,将环境上下文信息引入融合过程。第三,利用证据层级的所有级别,构建一个结合领域知识的贝叶斯威胁类型分类机制。所提出的方法在模拟场景中进行了评估,结果表明该融合方法在抗杂波和先验不匹配方面具有优势,总体分类准确率高达95%。

英文摘要

Heterogeneous sensor fusion is vital for detecting, localizing, and classifying CBRNE threats. However, individual sensors are often only capable of detecting a subset of relevant threats with varying reliability or can even provide only indirect threat indications, making threat classification challenging. Furthermore, high clutter rates on the sensor side present a great challenge for fusion systems. Additionally, the limited availability of high quality datasets hinders the advancement of learning-based detection and classification models in smart sensors. To mitigate these sensor related shortcomings, a context-aware and domain knowledge-enhanced fusion process is proposed. First, a novel evidence hierarchy is established that enables modeling of direct, indicative, and contextual information. Second, contextual information about the environment is introduced into the fusion process, by collecting, processing, and exploiting OSINT inputs. Third, all levels of the evidence hierarchy are used to craft a Bayesian threat type classification mechanism with domain knowledge-informed priors. The proposed methodology is evaluated in simulated scenarios, and the results demonstrate the benefit of the proposed fusion approach in terms of robustness to clutter and prior mismatch, with an overall classification accuracy of up to 95%.

2605.22258 2026-05-22 cs.CL

Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting

更难防御:通过隐含增强和模糊重写实现中文毒性攻击

Jingyi Kang, Junyu Lu, Bo Xu, Hongbo Wang, Linlin zong, Roy Ka-Wei Lee, Hongfei Lin

AI总结 本研究提出了一种针对中文毒性攻击的框架CITA,通过隐含增强和模糊重写技术生成攻击样本,揭示了现有检测器在识别隐含毒性内容时的不足,并展示了通过训练防御模型提升鲁棒性的效果。

详情
Comments
16 pages, 5 figures
AI中文摘要

大型语言模型(LLMs)需要超越显式用语的鲁棒毒性评估。在这种设定中,中文的毒性可能结合语义间接性和表层模糊性仍处于探索阶段。我们引入了中文隐含毒性攻击(CITA),一种受控的红队评估和防御数据生成框架,而不是可部署的逃避工具。CITA使用三个阶段:(i)有害意图学习,(ii)隐含毒性增强,以及(iii)模糊变体重写,以保持有害意图,增加隐含性,并添加受控的表层变体。在CITA生成的评估样本上,七个测试检测器表现出显著的漏检风险,达到平均ASR为69.48%;人类评估进一步确认了保留的有害性和增加的隐含性/逃避性。作为下游防御应用,我们使用CITA生成的红队数据微调了中文隐含毒性防御模型(CITD),显示此类数据可通过额外训练提高鲁棒性。

英文摘要

Large language models (LLMs) require robust toxicity evaluation beyond explicit wording. This setting remains underexplored in Chinese, where toxicity may combine semantic indirectness with surface obfuscation. We introduce Chinese Implicit Toxicity Attack (CITA), a controlled red-team evaluation and defense-data generation framework, not a deployable evasion tool. CITA uses three stages: (i) Harmful Intent Learning, (ii) Implicit Toxicity Enhancement, and (iii) Obfuscation Variant Rewriting, to preserve harmful intent, increase implicitness, and add controlled surface variants. On CITA-generated evaluation samples, the seven tested detectors exhibit substantial missed-detection risks, reaching an average ASR of 69.48%; human evaluation further confirms preserved harmfulness and increased implicitness/evasiveness. As a downstream defense application, we fine-tune a Chinese Implicit Toxicity Defense model (CITD) with CITA-generated red-team data, showing that such data can improve robustness through additional training.

2605.22257 2026-05-22 cs.LG cs.AI cs.LO

What are the Right Symmetries for Formal Theorem Proving?

正式定理推理中应有的对称性是什么?

Krzysztof Olejniczak, Radoslav Dimitrov, Xingyue Huang, Bernardo Cuenca Grau, Jinwoo Kim, İsmail İlkan Ceylan

AI总结 本文探讨了正式定理推理中应尊重的对称性,提出了基于范畴论的重写范畴框架,用于形式化证明等价性和成功不变性,并通过测试时方法改进了LLM基定理证明器的鲁棒性和性能。

详情
AI中文摘要

基于大规模语言模型(LLMs)的正式定理推器对问题表示的表面变化高度敏感:语义等价的陈述可以表现出剧烈不同的证明成功率,揭示了对正式数学中固有对称性的失败。这提出了一个核心问题:正式定理推理中应有什么样的对称性?我们引入了重写范畴,一个范畴论框架,捕捉由证明战术诱导的组合性、一般非可逆的转换,并用它来形式化两个对称性概念:证明等价性,支配证明分布在重写下的变换,以及成功不变性(即成功概率的不变性),要求等价陈述以相同概率被解决。我们观察到基于状态的next-tactic推器通过操作证明状态自然满足证明等价性。相比之下,最先进的基于LLM的推器既不满足这些属性,表现出在等价表述下的大性能变化。为缓解这一问题,我们提出测试时方法,通过等价重写的聚合,理论上证明它们在采样极限下恢复成功不变性,并实验证明它们在固定推理预算下提高鲁棒性和性能。我们的结果突显了对称性作为LLM基定理推理中关键缺失的归纳偏置,并建议测试时计算作为近似该偏置的实用途径。

英文摘要

Formal theorem provers based on large language models (LLMs) are highly sensitive to superficial variations in problem representation: semantically equivalent statements can exhibit drastically different proof success rates, revealing a failure to respect structural symmetries inherent in formal mathematics. This raises a central question: what are the right symmetries for formal theorem proving? We introduce rewriting categories, a category-theoretic framework capturing the compositional, generally non-invertible transformations induced by proof tactics, and use it to formalize two symmetry notions: proof equivariance, governing how proof distributions transform under rewrites, and success invariance (i.e., invariance of success probability), requiring equivalent statements to be solved with the same probability. We observe that state-based next-tactic provers naturally satisfy proof equivariance by operating on proof states. In contrast, state-of-the-art LLM-based provers satisfy neither property, exhibiting large performance variation across equivalent formulations. To mitigate this, we propose test-time methods that aggregate over equivalent rewritings of the input, showing theoretically that they recover success invariance in the sampling limit, and empirically, that they improve robustness and performance under fixed inference budgets. Our results highlight symmetry as a key missing inductive bias in LLM-based theorem proving and suggest test-time computation as a practical route to approximate it.

2605.22249 2026-05-22 cs.CV

D3Seg: Dependency-Aware Diffusion for Brain Tumor Segmentation with Missing Modalities

D3Seg: 依赖感知的扩散模型用于缺失模态的脑肿瘤分割

Danish Ali, Ajmal Mian, Naveed Akhtar, Ghulam Mubashar Hassan

AI总结 本文提出D3Seg模型,通过多跳模态图融合、轻量扩散插补机制和概率空间决策细化,解决缺失MRI模态下的脑肿瘤分割问题,提升分割性能并保持计算效率。

详情
AI中文摘要

使用多参数MRI进行准确的脑肿瘤分割对于有效的治疗计划至关重要。然而,在临床环境中,完整获取所有MRI序列并不总是可能。某些MRI模态的缺失会导致现有分割方法性能显著下降,这些方法通常依赖于朴素的特征拼接或直接融合策略。为了解决这一限制,我们提出了一种新的分割模型D3Seg,其设计旨在在缺失模态设置下保持稳定的性能。D3Seg引入了多跳模态图融合(MMGF)来建模更高阶的跨模态依赖关系,一种轻量级的扩散基插补机制来补偿潜在空间中缺失的T1ce表示,并在概率空间中进行决策细化以缓解主导类的过度自信并改进低表示肿瘤亚区域的界定。在BraTS 2023数据集上的广泛评估表明,我们的D3Seg模型在缺失模态配置下 consistently 改善了分割性能。所提出的模型在多个缺失模态配置中相比当前最先进的模型,在增强肿瘤(ET)方面实现了约1.5-2.0%的Dice改进,在肿瘤核心(TC)方面实现了约1.0%的改进,同时保持了计算效率。

英文摘要

Accurate brain tumor segmentation using multiparametric MRI is critical for effective treatment planning. However, in clinical settings, complete acquisition of all MRI sequences is not always possible. The absence of certain MRI modalities results in substantial performance degradation in existing segmentation methods, which typically rely on naive feature concatenation or direct fusion strategies. To address this limitation, we propose a novel segmentation model D3Seg which is designed to maintain stable performance under missing-modality settings. D3Seg introduces Multi-hop Modality Graph Fusion (MMGF) to model higher order inter-modality dependencies, a lightweight diffusion-based imputation mechanism to compensate for missing T1ce representations in latent space, and probability-space decision refinement to mitigate dominant class overconfidence and improve delineation of underrepresented tumor subregions. Extensive evaluation on BraTS 2023 dataset demonstrates that our D3Seg model consistently improves segmentation performance under missing modality configurations. The proposed model achieves approximately 1.5-2.0% Dice improvement on enhancing tumor (ET) and around 1.0% on tumor core (TC) across multiple missing modality configurations compared to the current state-of-the-art model, while maintaining computational efficiency.

2605.22248 2026-05-22 cs.LG

No Epoch Like the Present: Robust Climate Emulation Requires Out-of-Distribution Generalisation

没有比现在更严峻的挑战:鲁棒的气候模拟需要分布外泛化

Bradley Stanley-Clamp, Anson Lei, Hannah M. Christensen, Ingmar Posner

AI总结 本文研究了气候模拟中分布外泛化的重要性,提出了一种新的评估框架,通过季节变化来测试模拟器的鲁棒性,并展示了物理驱动的分解方法如何在不显著牺牲分布内性能的情况下提升分布外性能。

详情
Comments
36 pages, 12 figures
AI中文摘要

气候模拟是一种分布外(OOD)投影任务。正是在这个挑战中,现代机器学习(ML)方法最容易失效。因此,尽管当前训练于现代表现的ML模拟器在分布内表现优异,但其在气候不可避免分布变化下的未来可靠性仍是一个关键但不为人知的盲点。解决这一挑战需要我们对气候模拟器的理解、评估和设计方法进行根本性转变。在本工作中,我们首先确认气候变化导致大气状态分布产生统计显著且逐渐增长的转变,使标准评估协议不足。我们实证地确立季节变化作为这些长期气候转变的有效代理,提供访问真实世界分布转变而无需依赖合成扰动等启发式方法。受此联系启发,我们引入了一种新的评估框架,利用季节转变作为严格且零开销的模拟器鲁棒性测试平台。我们的系统性特征化确认了当前最先进的混合ML模拟器在这些现实转变下显著退化。最后,我们通过识别组合泛化,即从观察到的基本组件中形成新组合的能力,作为稳健气候模拟的原理路径。我们证明了受物理启发的分解方法在不显著牺牲分布内性能的情况下显著提升OOD性能,为ML驱动的气候模拟器提供了一条对未知未来鲁棒的途径。

英文摘要

Climate emulation is an out-of-distribution (OOD) projection task. This is precisely the challenge where modern Machine Learning (ML) methods are most prone to failure. Consequently, while current ML emulators trained on present climate achieve high in-distribution performance, their future reliability under the inevitable distribution shifts of a changing climate remains a critical, poorly understood blind spot. Addressing this challenge requires a fundamental shift in how we understand, evaluate, and design climate emulators. In this work, we first confirm that climate change drives a statistically significant and progressively growing shift in atmospheric state distributions, rendering standard evaluation protocols insufficient. We empirically establish that seasonal variation serves as an effective proxy for these long-term climate shifts, providing access to $\textit{real-world}$ distribution shifts without recourse to heuristics like synthetic perturbations. Motivated by this link, we introduce a novel evaluation framework that leverages seasonal shifts as a rigorous, zero-overhead testbed for emulator robustness. Our systematic characterisation confirms that current state-of-the-art hybrid-ML emulators degrade significantly under these realistic shifts. Finally, we chart a path forward by identifying compositional generalisation, the ability to form novel combinations from observed elementary components, as a principled route towards robust climate emulation. We demonstrate that physically motivated decompositions substantially improve OOD performance with only modest trade-offs against in-distribution performance, providing an avenue towards ML-driven climate emulators robust to an unknown future.

2605.22247 2026-05-22 cs.CL

IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions

IdioLink: 超越词语的语义检索:在隐喻和直述表达之间

Kai Golan Hashiloni, Daniel Fadlon, Lior Livyatan, Ofri Hefetz, Jiahuan Pei, Kfir Bar

AI总结 本文提出IdioLink检索基准,旨在测试模型能否将隐喻表达与直述或改写形式的概念等价意义联系起来,揭示当前模型在隐喻语义检索中的不足。

详情
AI中文摘要

隐喻表达对语言模型构成了基本挑战,因为其含义无法仅通过表层形式推断。因此,理解此类表达需要超越词汇重叠的语义抽象。我们介绍了IdioLink,一个检索基准,用于测试模型是否能将隐喻表达与用直述或改写形式表达的概念等价意义联系起来。IdioLink包含10,700个文档和2,140个查询,涵盖107个具有直述和隐喻用法的习语。每个文档和查询都标注了传达核心意义的片段。评估强大的嵌入基线(如BGE、E5、Contriever和Qwen),我们发现当前模型在跨不同表层实现检索等价意义时表现不佳,依赖于主题和浅层语义线索。IdioLink揭示了隐喻意识语义检索中的关键缺口,并为未来模型提供了具有挑战性的测试平台。

英文摘要

Idioms pose a fundamental challenge for language models, as their meaning cannot be inferred from surface form alone. Understanding such expressions, therefore, requires semantic abstraction beyond lexical overlap. We introduce IdioLink, a retrieval benchmark designed to test whether models can link idiomatic expressions to conceptually equivalent meanings expressed in literal or paraphrased forms. IdioLink comprises 10,700 documents and 2,140 queries, spanning 107 idioms with both literal and figurative uses. Each document and query is annotated with spans that convey the core meaning. Evaluating strong embedding baselines (e.g., BGE, E5, Contriever, and Qwen), we show that current models struggle to retrieve equivalent meanings across divergent surface realizations, relying instead on topical and shallow semantic cues. IdioLink exposes key gaps in idiom-aware semantic retrieval and provides a challenging testbed for future models.

2605.22243 2026-05-22 cs.LG cs.AI stat.AP

Explainable AI for Data-Driven Design of High-Dimensional Predictive Studies

为高维预测研究的数据驱动设计开发可解释的AI

Junyu Yan, Damian Machlanski, Kurt Butler, Panagiotis Dimitrakopoulos, Ewen M Harrison, Bruce Guthrie, Sotirios A Tsaftaris

AI总结 本文提出了一种可解释的AI推荐系统,通过数据驱动的方法改进现有可解释统计模型的预测性能,主要贡献是通过可解释AI技术提供三种推荐类型以提高模型的预测能力和透明度。

详情
Comments
41 pages, 7 figures
AI中文摘要

预测建模在健康数据分析和数据驱动的临床决策中非常重要。然而,当需要选择、转换或交互建模数十甚至数百个特征时,手动优化预测研究具有挑战性。尽管复杂的机器学习模型具有高性能,但其“黑盒”性质限制了临床信任、透明度和决策所需的可解释性。我们开发并评估了一种探索性AI推荐器,以提供数据驱动的推荐,从而提高现有可解释统计模型的预测性能。所开发的框架使用灵活的AI建模来捕捉复杂的数据模式,并利用可解释AI技术将这些模式转化为三种推荐类型:特征排除、非线性项和特征交互。我们通过比较基线(即无交互或非线性项)Cox比例风险(CPH)模型与增强的CPH模型(包含由我们方法建议的推荐)的预测性能来评估该框架。主要分析预测245,614名患者首次发生跌倒或相关伤害的时间。我们的方法推荐排除23个特征,包括两个特征的非线性项,以及包含221个建议的特征交互。C指数从0.805(95% CI 0.798-0.812)提高到0.815(95% CI 0.809-0.822),校准也有所改善(截距:-0.006到0.003;斜率:1.063到0.950)。所有推荐均得到现有文献的支持。该方法还证明在两个额外的公共数据集上有效,显示了更广泛的应用性。所提出的探索性AI推荐器展示了可解释AI和数据驱动研究设计在提高高维透明预测模型开发过程和性能方面的潜力。

英文摘要

Predictive modelling is important for health data analysis and data-driven clinical decision-making. However, predictive studies are challenging to design optimally by hand when tens or even hundreds of features require selection, transformation, or interaction modelling. While complex machine learning models offer high performance, their "black-box" nature limits the clinical trust, transparency, and interpretability required for decision-making. We developed and evaluated an Exploratory AI Recommender that provides data-driven recommendations to improve predictive performance of existing interpretable statistical models. The developed framework uses flexible AI modelling to capture complex data patterns and explainable AI techniques to translate the patterns into three recommendation types: feature exclusion, non-linear terms, and feature interactions. We evaluated the framework by comparing predictive performance of a baseline (i.e., no interactions or non-linear terms) Cox Proportional Hazards (CPH) model against an augmented CPH incorporating recommendations suggested by our method. The primary analysis predicts the time to the first occurrence of a fall or related injury in 245,614 patients. Our method recommended excluding 23 features, including non-linear terms for two features, and including 221 suggested feature interactions. The C-index improved from 0.805 (95% CI 0.798-0.812) to 0.815 (95% CI 0.809-0.822), and so did calibration (intercept: -0.006 to 0.003; slope: 1.063 to 0.950). All recommendations were supported by existing literature. The method also proved effective on two additional public datasets, demonstrating wider applicability. The proposed Exploratory AI Recommender demonstrates the potential of explainable AI and data-driven study design to improve the process of developing, and the performance of high-dimensional transparent predictive models.

2605.22238 2026-05-22 cs.AI

Evaluating Large Language Models as Live Strategic Agents: Provider Performance, Hybrid Decomposition, and Operational Gaps in Timed Risk Play

评估大型语言模型作为实时战略代理:提供商性能、混合分解及时间风险游戏中的操作差距

H. C. Ekne

AI总结 本文研究了大型语言模型在实时策略环境中的表现,发现其性能受目标跟踪、执行转换、成本和运行时可靠性等因素影响,支持将LLM作为受限制工作流中的组件进行评估,而非孤立的基准测试对象。

详情
Comments
13 pages, 7 figures. Code and tracked notes: https://github.com/hcekne/risk-game . Public runtime artifact index: https://github.com/hcekne/risk-game/blob/main/docs/article-plans/public_experiment_artifacts.md
AI中文摘要

静态基准测试只能捕捉大型语言模型在实践中行为的一部分。实际系统将模型置于具有时间限制、格式约束和故障模式的重复循环中。我们研究了这种环境下的时间多阶段Risk游戏,其中包含明确的胜利目标和重复的规划与执行循环。在一项冻结规则的32局跨提供商锦标赛中,Gemini-3.1-Pro-Preview在32局中胜出20局,战胜了GPT-5.1、Claude-Opus-4-7和Kimi-K2.6。聚合的胜利分布与等强的空模型显著不同(p约1.5×10^-5)。随后,我们通过标准化执行在更便宜的Gemini Flash框架上进行分离。在该设计下,32局规划烘焙测试与近等值性一致(p约0.821),表明早期提供商差异主要来自端到端系统行为而非规划本身。为研究机制,我们分析了提供商锦标赛中保存的规划和执行轨迹。Gemini比其他模型更频繁地参考终端目标,且在胜利接近时增加这种关注。Gemini还更有效地将回合转化为深度征服链,尽管其运行时并不最干净。这些结果表明,实时代理性能取决于目标跟踪、执行转换、成本和运行时可靠性,并支持将LLM作为受限制工作流中的组件进行评估,而非孤立的基准测试响应者。

英文摘要

Static benchmarks capture only part of how large language models behave in practice. Real systems place models inside repeated loops with time limits, formatting constraints, and failure modes. We study this setting in a timed multi-phase Risk environment with explicit victory targets and repeated planning and execution cycles. In a replicated 32-game cross-provider championship under frozen rules, gemini-3.1-pro-preview won 20 of 32 games against gpt-5.1, claude-opus-4-7, and kimi-k2.6, and the pooled winner distribution differs strongly from an equal-strength null (p approx 1.5 x 10^-5). We then separate planning from execution by standardizing execution on a cheaper Gemini Flash scaffold. Under this design, a pooled 32-game planner bakeoff is consistent with near-equality (p approx 0.821), which indicates that much of the earlier provider spread came from end-to-end system behavior rather than planning alone. To study mechanism, we analyze saved planning and execution traces from the provider championship. Gemini refers to the terminal objective far more often than the other models and increases that focus as victory approaches. Gemini also converts more turns into deep conquest chains, even though it is not the cleanest runtime. These results show that live-agent performance depends on objective tracking, execution conversion, cost, and runtime reliability, and they support evaluating LLMs as components in bounded workflows rather than as isolated benchmark respondents.

2605.22235 2026-05-22 cs.LG math.DS

Holomorphic Neural ODEs with Kolmogorov-Arnold Networks for Interpretable Discovery of Complex Dynamics

具有Kolmogorov-Arnold网络的全纯神经ODEs用于复杂动力学的可解释发现

Bhaskar Ranjan Karn, Dinesh Kumar

AI总结 本文提出了一种基于Kolmogorov-Arnold网络的全纯神经ODE框架,用于在复杂动力学系统中发现可解释的 governing equations,通过可微的正则化保持全纯结构,并在多个复杂动力学系统上验证了其有效性。

详情
Comments
16 pages. Comments are welcome
AI中文摘要

由全纯映射(如z² + c)支配的复杂动力系统表现出具有极端初始条件敏感性的分形边界。从数据准确建模这些结构需要尊重底层复解析几何的方法,但神经普通微分方程(Neural ODEs)中的多层感知机(MLP)缺乏复解析先验,违反柯西-黎曼条件,并作为不透明的近似器无法提供 governing equations。我们引入了全纯KAN-ODE框架,用Kolmogorov-Arnold网络(KAN)取代MLP,其可学习的B样条激活函数位于网络边,并将柯西-黎曼方程作为可微正则化以保持全纯结构。我们在六个复杂动力系统家族上进行了评估,涵盖多项式和超越类。仅使用280个参数(比MLP基线少16倍),网络在所有六个系统上实现了速度场R² > 0.95,正确识别了所有六个 governing symbolic families 通过自动样条到公式拟合,并重建了Julia集分形边界,与98.0%一致。关键的是,模型在10%观测噪声下仅表现出4%的MSE退化,而MLP则退化了15.2倍,且在从二次到三次动力学的迁移学习中实现了90.4%的改进。虽然MLP在点重建误差上更低,因为其容量更大,但KAN唯一提供了可解释的符号方程,强制了全纯结构,并具有优越的噪声鲁棒性,这些能力在黑盒架构中完全缺失。这些结果确立了KANs作为MLP的参数高效、可解释的替代方案,用于具有全纯动力学的物理信息发现。

英文摘要

Complex dynamical systems governed by holomorphic maps such as $z^2 + c$ exhibit fractal boundaries with extreme sensitivity to initial conditions. Accurately modelling these structures from data requires methods that respect the underlying complex-analytic geometry, yet Multi-Layer Perceptrons (MLPs) within Neural Ordinary Differential Equations (Neural ODEs) lack complex-analytic priors, violate the Cauchy--Riemann conditions, and function as opaque approximators incapable of yielding governing equations. We introduce Holomorphic KAN-ODE, a framework that replaces the MLP with a Kolmogorov-Arnold Network (KAN) whose learnable B-spline activations reside on network edges, and incorporates Cauchy--Riemann equations as a differentiable regularization to preserve holomorphic structure. We evaluate on six families of complex dynamical systems spanning polynomial and transcendental classes. With only 280 parameters ($16\times$ fewer than the MLP baseline), the network achieves velocity-field $R^2 > 0.95$ on all six systems, correctly identifies all six governing symbolic families through automatic spline-to-formula fitting, and reconstructs Julia set fractal boundaries with up to 98.0\% agreement. Crucially, the model exhibits only 4\% MSE degradation under 10\% observation noise versus $15.2\times$ for MLPs, and achieves 90.4\% improvement in transfer learning from quadratic to cubic dynamics. While the MLP attains lower pointwise reconstruction error due to its larger capacity, the KAN uniquely provides interpretable symbolic equations, enforced holomorphic structure, and superior noise resilience, capabilities that are entirely absent in black-box architectures. These results establish KANs as a parameter-efficient, interpretable alternative to MLPs for physics-informed discovery of holomorphic dynamics.

2605.22231 2026-05-22 cs.CV

REACH: Hand Pose Estimation from Room Corners

REACH:从房间角落估计手部姿态

Shu Nakamura, Ryo Kawahara, Genki Kinoshita, Ryosuke Hirai, Yasutomo Kawanishi, Shohei Nobuhara, Ko Nishino

AI总结 本文提出了一种新的3D手部姿态估计器,能够从远处(通常是从房间角落的固定摄像头)在极低分辨率且频繁遮挡的视图中准确恢复人的手部形状和姿态。核心方法是充分利用手部与身体的协调性、时间序列变化以及多视角观测,通过一种新的基于Transformer的模型实现,利用视图令牌之间的相关性建模手部和身体的配置,并以自回归方式利用时间协调性。同时引入了一个名为REACH的新型数据集,用于训练和测试方法。REACH是首个大规模的手部姿态数据集,记录了50名参与者在多种日常活动中的准确手部运动。通过大量实验,包括与现有方法的比较研究,证明了我们的模型REACH-Net在远距离3D手部姿态估计上取得了高度准确的结果。这些结果拓展了3D手部姿态估计的视野,尤其在“野外”连续人类行为分析方面。

详情
AI中文摘要

我们介绍了一种新颖的3D手部姿态估计器,能够从远处(通常是从房间角落的固定摄像头)在极低分辨率且频繁遮挡的视图中准确恢复人的手部形状和姿态。我们的核心思想是充分利用手部与身体的协调性、其时间序列变化以及多视角观测。我们通过一种新的基于Transformer的模型实现这一目标,其中手部和身体的配置通过其视觉特征之间的相关性建模为每视角令牌,其时间协调性则以自回归方式利用。我们引入了一个新的数据集,称为REACH,即带有胸部摄像头注释的房间环境数据集,用于训练和测试我们的方法。REACH是首个大规模的手部姿态数据集,记录了50名参与者在广泛日常活动中的准确手部运动。为了在标注手部准确形状和姿态时避免干扰自然运动,我们利用隐藏的胸部摄像头。通过广泛的实验,包括与现有方法的比较研究,我们证明了我们的模型REACH-Net在远距离3D手部姿态估计上取得了高度准确的结果。这些结果拓展了3D手部姿态估计的视野,尤其在“野外”连续人类行为分析方面。

英文摘要

We introduce a novel 3D hand pose estimator that can accurately recover the shape and pose of people's hands in a room from afar, typically from fixed cameras at room corners, in extremely low-resolution and frequently occluded views. Our key idea is to fully leverage hand-body coordination, its temporal progression, and multiview observations. We achieve this with a novel Transformer-based model, in which hand and body configurations are modeled through correlations between their visual features expressed as per-view tokens, and their temporal coordination is exploited in an autoregressive manner. We introduce a novel dataset, which we refer to as REACH, Room-Environment dataset Annotated with Chest cameras for Hand pose estimation, to train and test our method. REACH is a first-of-its-kind large-scale hand pose dataset that captures accurate hand movements of 50 participants across a wide variety of daily activities. In order to avoid interfering with natural movements while annotating the hands with accurate shape and pose, we leverage concealed chest cameras. Through extensive experiments, including comparative studies with existing methods, we show that our model, REACH-Net, achieves highly accurate 3D hand pose estimation from afar. These results broaden the horizon of 3D hand pose estimation, especially towards "in-the-wild" continuous human behavior analysis.

2605.22228 2026-05-22 cs.CL

GHI: Graphormer over Conditioned Hypergraph Incidence for Aspect-Based Sentiment Analysis

GHI: 图ormer over Conditioned Hypergraph Incidence 用于基于方面的情感分析

Yu Du, Wenlong Zhu, Xingze Li, Chenglong Cao, Jing Wang, Yukun Ma

AI总结 本文提出GHI框架,通过构建基于双分拓扑的 incidence 结构推理层,实现对基于方面的情感分析任务中不同结构信号的统一处理,实验表明GHI在多个标准基准上优于现有方法,且在参数较少的情况下表现优异。

详情
Comments
15 pages, 8 figures, 7 tables
AI中文摘要

基于方面的情感分析(ABSA)要求模型将情感证据绑定到正确的方面,使其成为细粒度结构推理的自然测试场。我们介绍了GHI,一种基于条件超图 incidence 的图ormer框架,其设计为一个基于双分拓扑的 incidence 基础结构推理层。GHI将多样化的语言和语义证据表示为token-超边 incidence 关系,允许通过统一接口整合不同的结构信号。在六个标准ABSA基准上的广泛实验表明,GHI在SemEval领域优于所有基线,多种子评估显示其在强DeBERTa模型上表现稳定。进一步实验显示,仅使用247M参数,GHI在ISE基准上接近11B Flan-T5基方法的性能。此外,GHI在具有挑战性的ARTS数据集上表现出强鲁棒性,保持了高度竞争性性能,而传统模型在此处退化。这些结果表明,紧凑的结构推理仍然是细粒度任务中比规模驱动方法有价值的替代方案。

英文摘要

Aspect-based sentiment analysis (ABSA) requires models to bind sentiment evidence to the correct aspect, making it a natural testbed for fine-grained structural reasoning. We introduce GHI, a Graphormer-over-Conditioned-Hypergraph-Incidence framework that is designed as an incidence-based structural reasoning layer built on a bipartite topology. GHI represents diverse linguistic and semantic evidence as token--hyperedge incidence relations, allowing different structural signals to be incorporated through a unified interface. Extensive experiments on six standard ABSA benchmarks show that GHI outperforms all baselines on the SemEval domains, and multi-seed evaluations show stable improvements over strong DeBERTa. Further experiments show that with only 247M parameters, GHI approaches the performance of 11B Flan-T5 based methods on the ISE benchmark. Moreover, it demonstrates strong robustness on the challenging ARTS datasets, maintaining highly competitive performance where traditional models degrade. These results demonstrate that compact structural reasoning remains a valuable alternative to scale-driven approaches for fine-grained tasks.

2605.22223 2026-05-22 cs.LG

How Many Different Outputs Can a Transformer Generate?

变换器能生成多少种不同的输出?

Maxime Meyer, Mario Michelessa, Caroline Chaux, Vincent Y. F. Tan

AI总结 研究如何利用变换器架构中的少量特性来准确预测其能生成的不同序列数量,包括定性和定量分析,并提供基于提示长度的上限,实验证明在不同架构和模型大小下该上限紧致于10倍以内。分析还解释了之前在简单序列任务(如复制和填塞)中观察到的变换器经验性失败现象。

详情
Comments
ICML 2026 Spotlight
AI中文摘要

我们研究如何仅利用变换器架构中的少量特性来紧密预测其能生成的不同序列数量,包括定性和定量分析。我们提供一个依赖于提示长度的上限,实验证明在不同架构和模型大小下,该上限紧致于10倍以内。我们的分析还为之前在简单序列任务(如复制和填塞)中观察到的变换器经验性失败提供了理论解释。形式上,我们证明了(i)可访问序列的最大长度(即变换器能为某些提示生成的序列)与提示长度成线性增长,(ii)超过临界阈值后,可访问序列的比例随序列长度呈指数衰减,(iii)提示长度与可访问序列长度之间的线性系数具有理论上限。值得注意的是,这些结果即使在无界上下文和计算时间下也成立。

英文摘要

We study how we can leverage only a handful of characteristics of a transformer's architecture to closely predict the number of different sequences it can output, both qualitatively and quantitatively. We provide an upper bound depending on the length of the prompt, which we show empirically to be tight up to a factor less than 10, across architectures and model sizes. Our analysis also provides a theoretical explanation for previously observed empirical failures of transformers on simple sequence tasks, such as copying and cramming. Formally, we prove that (i) the maximal length of accessible sequences (those that the transformer can output for some prompt) grows linearly with the prompt length, (ii) beyond a critical threshold, the proportion of accessible sequences decays exponentially with sequence length, and (iii) the linear coefficient relating prompt length to accessible sequence length admits a theoretical upper bound. Notably, these results hold even with unbounded context and computation time.

2605.22221 2026-05-22 cs.LG cs.AI cs.LO

Can Transformers Learn to Verify During Backtracking Search?

Transformer能否在回溯搜索中学习验证?

Yin Jun Phua, Tony Ribeiro, Tuan Nguyen, Katsumi Inoue

AI总结 本文研究了Transformer在回溯搜索中的验证能力,指出传统方法在处理轨迹数据时存在散列检索和历史纠缠问题,并提出局部化和选择性状态注意力(SSA)来解决这些问题,通过实验验证了SSA在3-SAT、图着色、Blocks World和回溯解析等任务中的有效性。

详情
AI中文摘要

回溯搜索是经典约束求解器、规划器和定理证明器的基础。最近的基于Transformer的推理系统探索其自身中间步骤的搜索树。一种常见的训练方法是在离线求解器轨迹上拟合自回归的下一个令牌损失。模型的输入在每一步都是所有先前决策的累积轨迹。最优的继续或回溯预测器仅依赖于当前搜索状态,因为到达相同状态的两条轨迹允许相同的延续。我们证明,仅使用累积轨迹训练的解码器Transformer在两种方式上未能满足这一要求:轨迹可以将状态特征散列到许多位置(散列检索),并且预测器可以基于轨迹而非状态(历史纠缠)。我们通过局部化解决散列检索问题,这是一种轨迹级的修复方法,将每个决策块重写以局部化状态特征。我们通过选择性状态注意力(SSA)解决历史纠缠问题,这是一种固定注意力掩码,可以在不修改训练数据、目标或参数的情况下强制结构化基于状态的决策。我们专注于矛盾传播后发生的反应验证。我们在3-SAT、图着色、Blocks World和回溯解析中测试SSA。在仅在先前历史上不同的相同状态对中,SSA发出相同的决定,而自回归训练的因果基线则不会。我们的贡献是针对序列轨迹数据的Transformer行为诊断,配以结构化修复。预训练语言模型在搜索其自身推理步骤时可能面临相同的失败。我们的分析为推理时的上下文清除作为不重新训练的情况下应用相同隔离的方法提供了候选方案。

英文摘要

Backtracking search underlies classical constraint solvers, planners, and theorem provers. Recent transformer-based reasoning systems explore search trees over their own intermediate steps. A common training recipe fits an autoregressive next-token loss on offline solver traces. The model's input at each step is a cumulative trace of all prior decisions. The optimal continue-or-backtrack predictor depends only on the current search state, since two trajectories reaching the same state admit the same viable continuations. We show that decoder-only transformers trained on cumulative traces fail this requirement in two ways: the trace can scatter state features across many positions (scattered retrieval), and the predictor can condition on the trajectory rather than the state (history entanglement). We address scattered retrieval with localization, a trace-level fix that rewrites each decision block to expose state features locally. We address history entanglement with Selective State Attention (SSA), a fixed attention mask that enforces state-based decisions structurally without modifying training data, objective, or parameters. We focus on reactive verification, after propagation has exposed a contradiction. We test SSA on 3-SAT, graph coloring, Blocks World, and backtracking parsing. On same-state pairs that differ only in prior history, SSA emits identical decisions while a cumulative-trained causal baseline does not. Our contribution is a diagnostic of transformer behavior on serialized trajectory data, paired with a structural fix. Pretrained language models that search over their own reasoning steps may face the same failure. Our analysis opens up inference-time context clearing as a candidate way to apply the same isolation without retraining.

2605.22219 2026-05-22 cs.AI

SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval

SGR-Bench: 对状态门控检索的搜索代理基准测试

Ningyuan Li, Haiyang Shen, Mugeng Liu, Yudong Han, Zhuofan Shi, Sixiong Xie, Yun Ma

AI总结 本文提出SGR-Bench,一个用于评估状态门控检索能力的基准数据集,包含100个专家 curated 任务,通过对比显式和隐式指导方法,揭示了搜索代理在处理状态门控检索任务时的主要挑战。

详情
Comments
Work in Progress. 23 pages, 7 figures, preprint
AI中文摘要

近年来,大语言模型和工具使用代理的进步扩大了可基准测试的网络任务范围。然而,一个重要类别的专门检索任务仍缺乏充分描述。在许多专门的数据检索网站上,包含答案的证据只有在通过过滤器、视图、层次结构或范围等设置正确的网站特定检索状态后才能被访问。我们称这种能力为状态门控检索(SGR)。我们引入了SGR-Bench,一个针对此设置的基准数据集,包含100个专家curated的任务,涵盖六个来源家族和12个公开数据生态系统。每个任务都需要发现正确的网站并配置其网站特定的检索状态以生成结构化答案。SGR-Bench将约束引导和目标导向的同一底层问题的两种形式配对,使显式和隐式指导在状态门控检索中的比较得以控制。我们评估了八个基于CLI的代理LLM系统和三个商业搜索代理产品。在SGR-Bench上,最强的系统仅达到66.18%的项目级F1,而行级F1仍显著较低。对156条可分析的失败CLI轨迹的手动审核显示了原因:代理通常到达相关网页源,但建立了错误的网站特定检索状态。检索范围漂移(37.2%)和标准不匹配(27.6%)占主导地位,而最终答案组成仅占10.3%。数据集和单案例评估说明文件可在https://huggingface.co/datasets/PKUAIWeb/SGR-BENCH获取。

英文摘要

Recent advances in large language models and tool-using agents have expanded the range of benchmarked web tasks. Yet an important class of specialized retrieval tasks remains undercharacterized. On many specialized data-retrieval websites, answer-bearing evidence becomes accessible only after establishing the correct site-specific retrieval state through filters, views, hierarchies, or scopes. We term this capability state-gated retrieval (SGR). We introduce SGR-Bench, a benchmark for this setting containing 100 expert-curated tasks spanning six source families and 12 public data ecosystems. Each task requires discovering the appropriate website and configuring its site-specific retrieval state to produce a structured answer. SGR-Bench pairs constraint-guided and goal-oriented formulations of the same underlying problems, enabling controlled comparisons between explicit and implicit guidance for state-gated retrieval. We evaluate eight CLI-based agentic LLM systems and three commercial search-agent products. On SGR-Bench, the strongest system reaches only 66.18% item-level F1, while row-level F1 remains much lower. A manual audit of 156 analyzable failed CLI trajectories shows why: agents often reach a relevant web source, but establish the wrong site-specific retrieval state. Retrieval-scope drift (37.2%) and criterion mismatch (27.6%) dominate, whereas final answer composition accounts for only 10.3%. The dataset and single-case evaluation instructions are available at https://huggingface.co/datasets/PKUAIWeb/SGR-BENCH.

2605.22217 2026-05-22 cs.LG cs.CL

Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL

生存或崩溃:自我博弈强化学习中数据门控与奖励基础的不对称作用

Sophia Xiao Pu, Zhaotian Weng, Chengzhi Liu, Jayanth Srinivasa, Gaowen Liu, William Yang Wang, Xin Eric Wang

AI总结 本文研究了自我博弈强化学习中数据门控和奖励基础的不对称作用,发现数据门控是维持稳定的关键因素,而奖励信号在门控移除后无法单独保证稳定性,揭示了'基础提出者悖论'。

详情
AI中文摘要

自我博弈强化学习通过语言模型自行生成任务进行训练,实现提出者与求解者的共同进化,无需人工标注。最近的系统报告了显著的推理提升,但崩溃和不稳定性普遍存在且理解不足。主流观点将其视为奖励设计问题,但我们认为自我博弈的稳定性由两个不同的调节机制决定:数据层面的门控,决定哪些由提出者生成的任务进入训练池,以及奖励信号,更新已准入任务的策略。通过在Python输出预测任务和确定性DSL双胞胎任务上的受控实验,我们发现这两个机制是不对称的。严格的数据门控在我们测试的每种奖励变体下都能保证稳定性,包括没有地面真实信息访问的自一致性奖励;而一旦移除门控,没有任何奖励变体足以保证稳定性。这种不对称性揭示了我们称之为'基础提出者悖论'的反直觉耦合:具有地面真实信息访问的提出者在与自一致性求解器配对时,会比无地面真实信息的提出者更快崩溃,因为训练集中在形成最快路径到虚假自一致性吸引子的干净任务上。将二进制门控替换为连续严格性参数ε进一步揭示了两阶段相变:训练侧指标在低ε时解耦,而验证准确率在ε远高于时才保持。数据层面的门控,而非奖励校准,是自我博弈稳定性的绑定约束。

英文摘要

Self-play reinforcement learning trains language models on their own generated tasks, co-evolving a proposer and solver without human labels. Recent systems report strong reasoning gains, but collapse and instability are widely observed and poorly understood. The dominant response treats this as a reward-design problem. We argue instead that self-play stability is governed by two distinct levers: a data-level gate that decides which proposer-generated tasks enter the training pool, and the reward signal that updates the policy on tasks already admitted. Through controlled experiments on a Python output-prediction task and a deterministic-DSL twin task that strips pretraining priors, output ambiguity, and executor noise, we find the two levers are asymmetric. A strict gate is sufficient for stability under every reward variant we test, including a self-consistency reward with no access to ground truth; while no reward variant is sufficient once the gate is removed. This asymmetry exposes a counter-intuitive coupling we call the Grounded Proposer Paradox: a proposer with ground-truth access accelerates collapse faster than an ungrounded one when paired with a self-consistency solver, by concentrating training on clean tasks that form the fastest path to a spurious self-consistent attractor. Replacing the binary gate with a continuous strictness parameter $\varepsilon$ further reveals a two-stage phase transition: training-side metrics decouple at low $\varepsilon$, while validation accuracy holds until $\varepsilon$ is much higher. Data-level gating, not reward calibration, is the binding constraint on self-play stability.

2605.22213 2026-05-22 cs.AI

Towards a compositional semantics for quantitative confidence assessment in assurance arguments

迈向定量信心评估的组合语义:在保证论证中

Benjamin Herd, Jessica Kelly, Jan Sabsch, Lydia Gauerhof

AI总结 本文提出了一种组合语义,用于在保证论证中进行定量信心评估,通过将论证元素表示为主观逻辑意见,并将元素间的关系映射到主观逻辑运算符,从而实现信心的传播。

详情
Journal ref
Proceedings of the 21st European Dependable Computing Conference (EDCC 2026)
Comments
Accepted to the 21st European Dependable Computing Conference (EDCC 2026), Canterbury, UK
AI中文摘要

保证论证提供了一种清晰且结构化的方式来解释为什么利益相关者应相信系统满足某些属性,然而广泛使用的记法,例如目标结构记法(GSN),通常缺乏推导保证信心的操作语义。现有方法解决结构和正确性,但主要在真值上推理,而不是在主张证明中的信心上。主观逻辑(SL)提供了一种信念、不信和不确定的计算,具有结合意见的运算符,使在不完整、冲突或主观证据下信心传播成为可能。然而,现有的基于SL的方法并未提供一种统一的、组合的语义,该语义涵盖所有论证元素和关系,以实现总体的信心评估。本文提出了一种信心语义,将论证元素表示为SL意见,并将元素间的关系映射到SL运算符,从而有效地将论证转化为可分析的信心网络。该方法提供了显式的担保,有原则的上下文处理,保留了来源,并与GSN兼容,并通过一个示例保证信心评估提供实用指导。

英文摘要

Assurance arguments provide a clear and structured way to explain why stakeholders should trust that a system satisfies certain properties, yet widely used notations, e.g.Goal Structuring Notation (GSN), typically lack an operational semantics for deriving assurance confidence. Existing approaches address structure and soundness but largely reason over truth values, not over confidence in the justification of claims. Subjective Logic (SL) offers a calculus of belief, disbelief, and uncertainty with operators for combining opinions, enabling confidence propagation under incomplete, conflicting, or subjective evidence. However, existing SL-based approaches do not provide a uniform, compositional semantics that covers all argument elements and relations to enable overall confidence assessment. We propose a confidence semantics that represents argument elements as SL opinions and maps relations between elements to SL operators modelling how confidence flows, effectively turning the argument into an analyzable confidence network. The approach provides explicit warrants, principled handling of context, preserved provenance, and compatibility with GSN, along with practical guidance using an exemplary assurance confidence assessment.

2605.22211 2026-05-22 cs.AI

CLORE: Content-Level Optimization for Reasoning Efficiency

CLORE:面向推理效率的内容级优化

Yuyang Wu, Qiyao Xue, Guanxing Lu, Weichen Liu, Zihan Wang, Manling Li, Olexandr Isayev

AI总结 本文提出CLORE框架,通过编辑正确在线轨迹来提升大语言模型的推理效率,通过外部增强模型删除冗余、不可读或无关内容,同时保留最终答案,并结合辅助参考-free DPO目标和标准策略梯度训练优化增强-原始对,实验表明CLORE在五个数学推理基准上提升了准确性和效率的平衡,并与GRPO、DAPO、Training Efficient和ThinkPrune兼容。

详情
Comments
9 pages, 9 figures
AI中文摘要

强化学习后训练已提高了大语言模型的推理能力,但往往产生不必要的长、重复或语义模糊的推理轨迹。现有高效推理方法主要通过显式预算或长度感知奖励调节响应长度,导致中间推理内容弱监督。我们提出CLORE,一种内容级优化框架,通过编辑正确在线轨迹来提高推理效率。CLORE使用外部增强模型删除重复段落、不可读或无关内容以及解决方案确定后的冗余推理,同时保留最终答案。所得到的增强-原始对通过辅助参考-free DPO目标与标准策略梯度训练优化。通过限制增强到正确轨迹并执行局部删除,CLORE使编辑轨迹接近策略分布并减轻离策略不匹配。在DeepSeek-R1-Distill-Qwen-7B和Qwen2.5-Math-7B五个数学推理基准上的实验表明,CLORE提高了准确性和效率的平衡,并与GRPO、DAPO、Training Efficient和ThinkPrune兼容。内容级分析进一步表明,CLORE减少了重复推理、不可读内容和答案后探索,支持内容级监督作为长度级控制的互补方向。

英文摘要

Reinforcement learning post-training has improved the reasoning ability of large language models, but often produces unnecessarily long, repetitive, or semantically opaque reasoning traces. Existing efficient reasoning methods mainly regulate response length through explicit budgets or length-aware rewards, leaving intermediate reasoning content weakly supervised. We propose CLORE, a content-level optimization framework that improves reasoning efficiency by editing correct on-policy rollouts. CLORE uses an external augmentation model to delete repetitive segments, illegible or task-irrelevant content, and superfluous reasoning after the solution is established, while preserving the final answer. The resulting augmented--original pairs are optimized with an auxiliary reference-free DPO objective alongside standard policy-gradient training. By restricting augmentation to correct trajectories and performing local deletion, CLORE keeps edited rollouts close to the policy distribution and mitigates off-policy mismatch. Experiments on DeepSeek-R1-Distill-Qwen-7B and Qwen2.5-Math-7B across five mathematical reasoning benchmarks show that CLORE improves the accuracy--efficiency trade-off and remains compatible with GRPO, DAPO, Training Efficient, and ThinkPrune. Content-level analyses further show that CLORE reduces repetitive reasoning, illegible content, and post-answer exploration, supporting content-level supervision as a complementary direction to length-level control.

2605.22209 2026-05-22 cs.CV

GALAR-TemporalNet v2: Anatomy-Guided Dual-Branch Temporal Classification with Bidirectional Mamba and Dual-Graph GCN for Video Capsule Endoscopy -- after competition results

GALAR-TemporalNet v2: 基于解剖引导的双分支时间分类方法,结合双向Mamba和双图GCN用于视频胶囊内镜

Jiye Won, Seangmin Lee, Soon Ki Jung

AI总结 该研究针对视频胶囊内镜中同时定位8个解剖区域和检测9种病理发现的多标签时间分类问题,提出GALAR-TemporalNet v2模型,通过结合窗口自注意力、双图GCN和双向Mamba解决类别不平衡、长程时间依赖和病理-解剖纠缠问题,最终在RARE-VISION测试集上取得更高的mAP指标。

详情
Comments
7 pages, 2 figures. Post-competition preprint for the ICPR 2026 RARE-VISION Challenge
AI中文摘要

视频胶囊内镜(VCE)提出了具有挑战性的多标签时间分类问题,要求在数万帧中同时定位8个解剖区域并检测9种病理发现。我们提出了GALAR-TemporalNet v2,一种分层时间模型,旨在解决三个核心挑战:极端类别不平衡、长程时间依赖性和病理-解剖纠缠。我们的架构结合了窗口自注意力进行局部建模,双图GCN用于全局帧关系,以及双向Mamba用于选择性边界上下文编码。新颖的解剖原型残差路径将病理偏差信号与正常器官外观分离,帧级GCN跳跃连接稳定了视觉上易混淆的稀有类别的训练。竞赛版本的GALAR-TemporalNet在RARE-VISION测试集上实现了整体mAP@0.5为0.2644和mAP@0.95为0.2353。在竞赛后,重新设计的GALAR-TemporalNet v2,结合了重构的病理分支、优化的损失函数和扩展的后处理,将这些结果提升到mAP@0.5为0.3409和mAP@0.95为0.3333。

英文摘要

Video Capsule Endoscopy (VCE) poses a challenging multi-label temporal classification problem, requiring simultaneous localization of 8 anatomical regions and detection of 9 pathological findings across tens of thousands of frames. We present GALAR-TemporalNet v2, a hierarchical temporal model that addresses three core challenges: extreme class imbalance, long-range temporal dependencies, and pathology--anatomy entanglement. Our architecture combines windowed self-attention for local modeling, a Dual-Graph GCN for global frame relationships, and Bidirectional Mamba for selective boundary context encoding. A novel anatomy prototype residual pathway decouples pathological deviation signals from normal organ appearance, and a frame-level GCN skip connection stabilizes training of visually confusable rare classes. The competition version, GALAR-TemporalNet, achieved an overall mAP@0.5 of 0.2644 and mAP@0.95 of 0.2353 on the RARE-VISION test set. Following the competition, the redesigned GALAR-TemporalNet v2 -- incorporating a restructured pathology branch, refined loss functions, and extended post-processing -- improved these results to mAP@0.5 of 0.3409 and mAP@0.95 of 0.3333.

2605.22205 2026-05-22 cs.AI cs.LG

Skill Weaving: Efficient LLM Improvement via Modular Skillpacks

技能编织:通过模块化技能包实现高效的LLM改进

Zhuo Li, Guodong Du, Zesheng Shi, Weiyang Guo, Weijun Yao, Yuan Zhou, Jiabo Zhang, Jing Li

AI总结 本研究提出SkillWeave框架,通过模块化技能包使LLM在固定内存预算下实现领域专业化,通过SkillZip压缩技术实现高效部署,实验表明其在多任务和代理基准上表现优异,速度提升达4倍。

详情
Comments
Accepted by ACL2026
AI中文摘要

大型语言模型日益需要在多样化领域中进行专门化,但现有方法难以在多领域能力与严格的内存和推理约束之间取得平衡。本文介绍了SkillWeave,一种模块化改进框架,使LLM能够在固定内存预算下实现专业化。SkillWeave将通用模型的全部能力划分为技能包——轻量、领域特定的delta模块——以重新组织和细化模型的内部知识。为了高效部署,SkillWeave集成了SkillZip将技能包压缩为紧凑且推理友好的格式,从而在低延迟执行下实现强大的多领域性能。在多任务和代理基准上,一个9B的SkillWeave模型优于多个基线,并甚至超越了32B的单体LLM,同时实现了高达4倍的速度提升。

英文摘要

Large language models increasingly require specialization across diverse domains, yet existing approaches struggle to balance multi-domain capacities with strict memory and inference constraints. In this work, we introduce SkillWeave, a modular improvement framework that enables LLMs to specialize under fixed memory budgets. SkillWeave partitions full capabilities of a general-purpose model into skillpacks -- lightweight, domain-specific delta modules -- that reorganize and refine the model's internal knowledge. For efficient deployment, SkillWeave integrates SkillZip to compress skillpacks into compact and inference-ready format, enabling strong multi-domain performance with low-latency execution. On multi-task and agentic benchmarks, a 9B SkillWeave model outperforms several baselines and even surpasses a 32B monolithic LLM, while achieving up to 4x speedup.

2605.22204 2026-05-22 cs.CL

Audience Engagement with Arabic Women's Social Empowerment and Wellbeing: A Decadal Corpus

阿拉伯女性社会赋权与福祉的受众参与:一个十年语料库

Wajdi Zaghouani, Mabrouka Bessghaier, MD. Rafiul Biswas, Shimaa Amer Ibrahim

AI总结 本文提出阿拉伯女性与社会语料库,包含2013至2024年间252,487条阿拉伯语Facebook公开帖子,涵盖女性赋权和社会福祉主题,通过自动化流程处理后,为阿拉伯方言的性别话语、社会改革和情感参与的大规模分析提供了数据支持。

详情
AI中文摘要

本文介绍了阿拉伯女性与社会语料库,该语料库包含2013年至2024年间收集的252,487条阿拉伯语Facebook公开帖子,涉及女性赋权和社会福祉。该语料库从77个国家的51,660个页面中收集,产生超过267亿次用户互动。每条帖子均包含分享、评论和情感反应等参与指标,为受众情感和社会关注度提供了独特视角。数据通过自动化流程处理,包括语言识别、标准化和元数据清洗,以确保可靠性和可重复性。该语料库支持阿拉伯语自然语言处理、计算社会科学和数字传播研究。该数据集和相关文档将在研究用途下发布。

英文摘要

This paper presents the Arabic Women and Society Corpus, a ten year collection of 252,487 public Arabic Facebook posts related to women's empowerment and social wellbeing. The corpus was collected from 51,660 pages across 77 countries between 2013 and 2024, resulting in more than 267 million user interactions. Each post includes engagement metrics such as shares, comments, and emotional reactions, providing a unique view of audience sentiment and social attention. The data were processed using an automated pipeline with language identification, normalization, and metadata cleaning to ensure reliability and reproducibility. The corpus enables large scale analysis of gender discourse, social reform, and emotional engagement across Arabic dialects. It supports research in Arabic natural language processing, computational social science, and digital communication studies. The dataset and accompanying documentation will be released under request for research use.

2605.22203 2026-05-22 cs.CL

Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents

对低资源语言农业文档中有效文本嵌入的分块策略评估

Sovandara Chhoun, Pichdara Po, Sereiwathna Ros, Wan-Sup Cho, Saksonita Khoeurn

AI总结 本研究比较了四种文本分块方法在Khmer农业文档中的性能,通过检索增强生成(RAG)框架评估分块策略对密集检索优化的影响,发现基于字符的递归分块方法在低资源语言中表现最佳。

详情
Comments
11 pages, 1 figure
AI中文摘要

在本研究中,我们比较了四种文本分块方法:递归、Khmer-aware、基于句子和基于大语言模型(LLM)在检索增强生成(RAG)框架中应用于Khmer农业文档的性能。文档分块使用BGE-M3多语言嵌入模型进行编码,并使用FAISS库进行检索。性能通过四个指标评估:平均检索分数(L2距离)、答案相关性、Khmer覆盖率和Khmer交并比(IoU),均基于真实问题-答案对进行测量。在评估中,我们对18个问题-答案对进行了5折交叉验证。我们观察到基于字符的递归分块方法在分块大小为300字符时表现最佳,实现了最低的L2距离(0.4295 ± 0.0461)、最高的答案相关性(0.8663 ± 0.0199)和最高的Khmer IoU(0.6441 ± 0.0347)。配对t检验显示在L2距离上,与基于句子的分块方法相比有统计学显著改进(p = 0.0121)。这些结果突显了分块粒度和结构保持在优化形态复杂、低资源语言如Khmer的密集检索中的重要性。

英文摘要

In this study, we compare the performance of four text chunking approaches: Recursive, Khmer-Aware, Sentence-Based, and LLM-Based within a Retrieval-Augmented Generation (RAG) framework applied to Khmer agricultural documents. The document chunks are encoded using the BGE-M3 multilingual embedding model and retrieved using the FAISS library. Performance is evaluated using four metrics: Average Retrieval Score (L2 distance), Answer Relevance, Khmer Coverage, and Khmer Intersection over Union, all measured against ground-truth question-answer pairs. For evaluation, we perform 5-fold cross-validation over 18 question-answer pairs. We observe the best performance for the character-based Recursive chunking method with a chunk size of 300 characters, achieving the lowest L2 distance (0.4295 +- 0.0461), highest Answer Relevance (0.8663 +- 0.0199), and highest Khmer IoU (0.6441 +- 0.0347). A paired t-test shows a statistically significant improvement over the Sentence-Based chunking method in L2 distance (p = 0.0121). These results highlight the importance of segmentation granularity and structural preservation for optimizing dense retrieval in morphologically complex, low-resource languages such as Khmer.

2605.22202 2026-05-22 cs.CL

Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance

嵌入空间中的结构保留作为基准性能预测因子

Amanda Myntti, Jenna Kanerva, Veronika Laippala, Filip Ginter

AI总结 本文研究了高表现嵌入模型在嵌入空间中的一致性组织方式,通过评估25种现代嵌入模型在五个MTEB任务上的表现,发现最近邻重叠和独立成分分析(ICA)中成对文本实例的幅度差异与任务性能高度相关,揭示了嵌入任务在线性度和局部信息保留依赖性方面的差异。

详情
AI中文摘要

在本文中,我们展示了高性能的嵌入模型在其嵌入空间中以一致的方式组织。我们评估了25种现代嵌入模型在五个MTEB任务上的表现,这些任务涵盖四个多样化的任务类别(检索、双语挖掘、对分类和摘要)在英语和多语言设置中。我们发现成对文本实例之间的最近邻重叠和独立成分分析(ICA)中的幅度差异与给定任务的性能高度相关(甚至达到0.97)。最终,我们展示了嵌入任务在不同程度上表现出线性和对局部信息保留的依赖性。我们的结果进一步加深了对嵌入的理解,揭示了嵌入与模型性能的关系,并为可能的未来训练目标和优化条件嵌入提供了启示。

英文摘要

In this paper, we show that high-performing embedding models organize their embedding spaces in a consistent way. We evaluate 25 contemporary embedding models on five MTEB tasks spanning four diverse task categories (retrieval, bitext mining, pair classification, and summarization) in both English and multilingual settings, and reveal that nearest-neighbor overlap and magnitude differences in independent component analysis (ICA) between paired text instances strongly correlate (even up to 0.97) with performance on the given task. Ultimately, we show that embedding tasks display varying degrees of linearity and reliance on retention of local information. Our results further the understanding of embeddings, their relation to model performance, and shed light on possible future training objectives and optimizing conditional embeddings.

2605.22201 2026-05-22 cs.CV

Zero-Shot Temporal Action Localization Through Textual Guidance

通过文本指导实现零样本时间动作定位

Benedetta Liberatori, Alessandro Conti, Lorenzo Vaquero, Paolo Rota, Yiming Wang, Elisa Ricci

AI总结 本文提出TEGU方法,通过利用大规模语言模型和结构化文本提取的丰富文本信息,解决零样本时间动作定位中因缺乏训练监督导致的细粒度动作分类困难问题,实验表明该方法在THUMOS14和ActivityNet-v1.3数据集上优于现有方法。

详情
Comments
Accepted to FG 2026
AI中文摘要

零样本时间动作定位(ZS-TAL)涉及在未修剪视频中对动作进行分类和定位,其中动作类别在训练时是未见过的。现有工作利用视觉语言模型(VLMs),借助其强大的零样本迁移能力。然而,这些模型在细粒度动作分类上面临明显挑战,难以直接用于区分动作存在与否。大多数当前ZS-TAL方法通过在大规模视频数据集上训练模型来解决这些问题,这需要标注数据且通常导致泛化性能有限。最近,不使用标注数据的方法出现了作为替代方案。沿着这一方向,我们提出了一种新的方法,即“视频中动作更精细定位的文本指导”(TEGU),通过利用大规模语言模型和从描述中提取的结构化文本所衍生的丰富文本信息,弥补训练数据缺乏监督的不足。这种额外的语境信息可以通过提供更丰富的视频内细粒度动作差异的线索,提高细粒度辨别能力。我们通过在THUMOS14和ActivityNet-v1.3数据集上进行实验验证所提出方法的有效性。我们的结果表明,通过利用丰富的文本信息来改进动作定位,TEGU在不涉及训练的最先进ZS-TAL方法上表现更优。

英文摘要

Zero-shot temporal action localization (ZS-TAL) consists of classifying and localizing actions in untrimmed videos, where action classes are unseen at training time. Existing work uses Vision and Language Models (VLMs), taking advantage of their strong zero-shot transfer capabilities. Yet, these models face evident challenges with fine-grained action classification, making it difficult to directly use them to distinguish between the presence and absence of an action. Most current methods for ZS-TAL address these challenges by training models on large-scale video datasets, which require annotated data and often result in limited generalization performance. Recently, approaches discarding the use of labeled data have emerged as an alternative. Following this direction, we propose a novel approach, ``Textual Guidance for finer localization of actions in videos'' (TEGU), that compensates for the lack of supervision from training data by exploiting rich textual information derived from large language models and structured text extracted from captions. This additional linguistic context can improve fine-grained discrimination by providing richer cues about fine-grained action differences within videos. We validate the effectiveness of the proposed method by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results show that, by exploiting rich textual information for improved action localization, TEGU outperforms state-of-the-art ZS-TAL approaches that do not involve training

2605.22200 2026-05-22 cs.CV cs.AI cs.LG

OSS: Open Suturing Skills Vision-Based Assessment Challenge 2024-2025

OSS: 2024-2025 开放缝合技能基于视觉的评估挑战

Hanna Hoffmann, Setareh Bady, Claas de Boer, Max Kirchner, Jan Egger, Rainer Röhrig, Frank Hölzle, Lennart Johannes Gruber, Kunpeng Xie, Marlon Neuhaus, Victor Alves, Guilherme Barbosa, Leonardo Barroso, João Carvalho, Hao Chen, Gabriella d'Albenzio, André Ferreira, Nuno Gomes, Yuichiro Hayashi, Kousuke Hirasawa, Rebecca Hisey, Seungjae Hong, Seoi Jeong, Tiago Jesus, Daehong Kang, Satoshi Kasai, Shunsuke Kikuchi, Takayuki Kitasaka, Satoshi Kondo, Hyoun-Joong Kong, Youngbin Kong, Atsushi Kouno, Shlomi Laufer, Kyu Eun Lee, Bining Long, Nooshin Maghsoodi, Hiroki Matsuzaki, Evangelos Mazomenos, Ori Meiraz, Kensaku Mori, Marina Music, Masahiro Oda, Roi Papo, Jieun Park, Rafael Piexoto, Saeid Rezaei, Mariana Ribeiro, Soyeon Shin, Yang Shu, Idan Smoller, Danail Stoyanov, Yihui Wang, Xinkai Zhao, Sebastian Bodenstedt, Isabel Funke, Stefanie Speidel, Behrus Hinrichs-Puladi

AI总结 本文提出OSS挑战,旨在通过基于视觉的评估方法提升开放手术技能训练,通过挑战数据集和多任务评估,评估不同方法在开放手术技能评估中的表现,揭示视频评估的潜力与限制。

详情
Comments
Stefanie Speidel and Behrus Hinrichs-Puladi jointly supervised this work. Submitted to MEDIA
AI中文摘要

通过有效的训练实现高水平的外科技能对于最佳的患者结果至关重要。自动化、数据驱动的技能评估有潜力改善外科训练。尽管基于机器学习的方法在微创手术技能评估中越来越受欢迎,但其在开放手术中的应用仍然有限。我们提出了一个专门的MICCAI挑战,旨在基准测试和推进开放手术中的基于视觉的技能评估。挑战数据集包含在干实验室环境中用静态GoPro相机记录的开放缝合训练任务视频,除了主要视频模态外,还包含仪器轨迹数据。OSS挑战连续两年举办,分别包含两个和三个独立任务:(1) 将技能水平分类为四个类别,(2) 预测涵盖八个类别的完整客观结构化评估技术技能分数,(3) 跟踪手部和手术工具。参与者提交了多种解决方案,包括基于深度学习的视频模型、跟踪驱动的方法和混合方法。通用的空间时间视频模型始终实现了最强的性能,尽管概念上多样的方法在执行良好的情况下也能达到竞争水平。预测细粒度的OSATS分数仍然具有挑战性,但受益于增加的训练数据。关键点跟踪由于频繁的遮挡和出帧实例而变得困难,限制了当前基于运动的技能分析的应用。这项工作评估了创新和多样的解决方案,突显了基于视频的评估在开放手术中的潜力和当前限制,并识别了推进自动化技能评估向临床影响发展的关键方向。

英文摘要

Achieving high levels of surgical skill through effective training is essential for optimal patient outcomes. Automated, data-driven skill assessment holds significant potential to improve surgical training. While machine learning-based methods are increasingly popular for assessing skills in minimally invasive surgery, their application to open surgery remains limited. We present the results of a dedicated MICCAI challenge designed to benchmark and advance vision-based skill assessment in open surgery. The challenge dataset comprises videos of an open suturing training task recorded with a static GoPro camera in a dry-lab setting, with instrument trajectories available in addition to the primary video modality. The OSS Challenge was hosted over two consecutive years, comprising two and three independent tasks, respectively: (1) classifying skill level into four classes, (2) predicting the full Objective Structured Assessment of Technical Skills across eight categories, and (3) tracking hands and surgical tools. Participants submitted diverse solutions including deep learning-based video models, tracking-driven methods, and hybrid approaches. General-purpose spatiotemporal video models consistently achieved the strongest performance, though conceptually diverse approaches reached competitive levels when well-executed. Predicting fine-grained OSATS scores remains challenging but benefits substantially from increased training data. Keypoint tracking proves difficult given frequent occlusions and out-of-frame instances, limiting current applicability for motion-based skill analysis. This work benchmarks innovative and diverse solutions for surgical skill assessment, highlighting both the promise and current limitations of video-based evaluation in open surgery and identifying critical directions for advancing automated skill assessment toward clinical impact.

2605.22195 2026-05-22 cs.LG

Reinforced Graph of Thoughts: RL-Driven Adaptive Prompting for LLMs

思维图增强:由强化学习驱动的LLM自适应提示方法

Manuel Noah Riesen, Peter Alfred von Niederhäusern

AI总结 本文提出Reinforced Graph of Thoughts (RGoT),通过强化学习自动生成适应任务复杂度的思维图结构,提升大型语言模型的提示效果。

详情
Comments
26 pages (including appendix), 16 figures
AI中文摘要

Graph of Thoughts (GoT),作为一种针对大型语言模型(LLMs)的通用提示范式,已被证明在复杂问题解决中具有用处。通过执行一系列操作的图,LLM的思维被结构化为任意图,形成实际的思维图。最初,操作图是手动定义的,需要深入了解问题的解决方案。这种静态的操作图缺乏适应性。我们提出Reinforced Graph of Thoughts (RGoT),一种利用强化学习(RL)自动从人类定义的集合中生成操作图的自动化方法。结果表明,在某些约束下,可以以自动化的方式构建适应任务复杂度的操作图。

英文摘要

Graph of Thoughts (GoT), a generalized form of recent prompting paradigms for large language models (LLMs), has been shown to be useful for elaborate problem solving. By executing a graph of operations, thoughts of the LLM are structured as an arbitrary graph, forming the actual graph of thoughts. Originally, the graph of operations is defined manually, which requires in-depth knowledge about the solution of the problem to solve. Such a static graph of operations is rigid and therefore lacks adaptability. We propose Reinforced Graph of Thoughts (RGoT), an automated approach to the GoT prompting paradigm that leverages reinforcement learning (RL) to adaptively generate a graph of operations from a human-defined set. Results indicate that, under certain constraints, it is possible to construct graphs of operations adaptively to the task's complexity in an automated way.

2605.22192 2026-05-22 cs.CV

Ultra-High-Definition Image Quality Assessment via Graph Representation Learning

通过图表示学习实现超高清图像质量评估

Shaode Yu, Enqi Chen, Ming Huang, Xuemin Ren, Songnan Zhao, Zhicheng Zhang, Qiurui Sun

AI总结 本文提出了一种图表示学习框架UHD-GCN-BIQA,通过显式建模采样图像区域的结构依赖关系来改进超高清图像的盲质量评估,实现了高效的高质量图像质量预测。

详情
AI中文摘要

盲图像质量评估(BIQA)对于超高清(UHD)图像仍具挑战性,因为原分辨率推理计算成本高,而强制缩放或孤立裁剪可能抑制尺度敏感的失真并削弱局部瑕疵与全局场景上下文之间的关系。本文旨在通过显式建模采样图像区域之间的结构依赖关系来改进UHD-BIQA,而不是将它们视为独立视图。所提出的图表示学习框架UHD-GCN-BIQA从每个UHD图像中采样长宽比对齐的块,将它们编码为图节点,并利用空间接近性和特征相似性构建混合k-最近邻图。残差图卷积用于在区域间传播上下文信息,门控注意力池化将块级证据聚合为图像级质量预测。采用指数移动平均归一化的多目标损失函数以稳定回归、相关性和排序目标的联合优化。在UHD-IQA基准测试中,UHD-GCN-BIQA实现了PLCC=0.7784,SRCC=0.8019,RMSE=0.0519,取得了与比较方法相竞争的相关性性能和最低的RMSE。这些结果表明,基于图的区域关系建模对UHD图像质量评估是有效的,特别是在高分辨率视觉内容下提高绝对质量评分估计。

英文摘要

Blind image quality assessment (BIQA) for ultrahighdefinition (UHD) images remains challenging because native-resolution inference is computationally expensive, whereas aggressive resizing or isolated cropping may suppress scale-sensitive distortions and weaken the relationship between local artifacts and global scene context. This paper aims to improve UHD-BIQA by explicitly modeling the structural dependencies among sampled image regions rather than treating them as independent views, and a graph representation learning framework UHD-GCN-BIQA is proposed. The framework samples aspect-ratio-aligned patches from each UHD image, encodes them as graph nodes, and constructs a hybrid k-nearest-neighbor graph using spatial proximity and feature similarity. Residual graph convolution is used to propagate contextual information across regions, and gated attention pooling aggregates patchlevel evidence into an imagelevel quality prediction. An exponential moving average normalized multiobjective loss function is adopted to stabilize the joint optimization of regression, correlation, and ranking objectives. Experiments on the UHD-IQA benchmark show that UHD-GCN-BIQA achieves PLCC = 0.7784, SRCC = 0.8019, and RMSE = 0.0519, obtaining competitive correlation performance and the lowest RMSE among the compared methods. These results indicate that graph-based region relation modeling is effective for UHD image quality assessment, particularly for improving absolute quality score estimation under high-resolution visual content.