arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2409
2605.29256 2026-05-29 cs.CL cs.AI

DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents

DynSess:面向角色扮演智能体的动态会话级评估与优化框架

Rongsheng Zhang, Jiji Tang, Junnan Ren, Zuyi Bao, Weijie Chen, Ruofan Hu, Zhou Zhao, Tangjie Lv, Yan Zhang

AI总结 提出DynSess统一会话级框架,通过会话级评估(DynSess-Eval)和基于多步前瞻搜索的训练轨迹优化(DSPO/GSRPO),提升角色扮演智能体的长程一致性和交互质量。

详情
AI中文摘要

基于大型语言模型的角色扮演本质上是一个会话级任务,要求智能体在扩展的多轮对话中维持角色身份和交互质量。然而,现有的评估和优化方法大多停留在轮次级别,无法捕捉长程质量。我们提出DynSess,一个统一的会话级角色扮演智能体框架。DynSess-Eval通过针对长程行为的评分标准对完整对话会话进行评分。利用其会话级奖励,我们通过多步前瞻搜索构建高质量训练轨迹,并训练DynSess-Character的两个互补变体:DSPO(离策略)和GSRPO(在策略)。实验表明,DynSess-Eval与人类判断的一致性显著优于先前的评估器,盲人机评估进一步显示,尽管参数少得多,DynSess-Character仍能与最强角色模型匹配,同时保持强大的角色一致性和交互能力。我们的数据集和代码将发布以促进未来研究。

英文摘要

Role-playing with large language models is fundamentally a session-level task, requiring agents to sustain character identity and interaction quality across extended multi-turn conversations. Yet existing evaluation and optimization methods remain largely turn-level, failing to capture long-horizon quality. We propose DynSess, a unified session-level framework for role-playing agents. DynSess-Eval scores complete dialogue sessions via rubrics targeting long-horizon behaviors. Leveraging its session-level rewards, we construct high-quality training trajectories through multi-turn lookahead search and train DynSess-Character with two complementary variants: DSPO (off-policy) and GSRPO (on-policy). Experiments show that DynSess-Eval aligns with human judgments substantially better than prior evaluators, and blind human evaluation further shows that DynSess-Character matches the strongest character model despite using substantially fewer parameters, while maintaining strong role consistency and interactive ability. Our dataset and code will be released to facilitate future research.

2605.29254 2026-05-29 cs.RO cs.AI

Extreme dynamic symmetry enables omnidirectional and multifunctional robots

极端动态对称性实现全向多功能机器人

Jiaxun Liu, Boxi Xia, Boyuan Chen

AI总结 本文提出动态对称性概念,通过动态各向同性度量,在超过1000种模拟形态中发现高动态对称性可提升轨迹跟踪、任务成功率、鲁棒性等性能,并开发了Argus球形机器人系列验证近极端动态各向同性带来的全向运动、自适应地形、快速自稳定和抗故障能力。

Comments Published in Science Robotics (2026). Our project website is at:https://generalroboticslab.com/Argus

详情
Journal ref
Science Robotics 11, eaec1725 (2026)
AI中文摘要

对称性是自然系统中的核心组织原则,但其作为机器人统一设计策略的应用仍主要局限于几何形态。我们证明,对称性可以在动态驱动能力层面加以利用。我们引入动态对称性,即机器人可达质心加速度的均匀性,并通过称为动态各向同性的度量将其形式化。在超过1000种模拟形态中,我们发现更高的动态对称性持续改善了轨迹跟踪、任务成功率、鲁棒性、恢复能力和能量效率,且当动态各向同性接近其理论极限时,效益最为显著。为了系统地研究这一机制,我们开发了Argus,一系列球形机器人,旨在探索增加动态对称性的效果。Argus家族的成员在驱动几何和动态对称性水平上有所不同,但共享一个共同架构原则:径向定向的线性致动器直接塑造机器人的质心动力学。其中,我们构建了一个物理的20腿Argus变体,实现了接近极端的动态各向同性,并展示了方向无关的运动、在杂乱和可变形地形上的敏捷穿越、快速自稳定以及对部分致动器故障的鲁棒性。其分布式感知进一步实现了在连续运动中的全向感知和物体交互。这些结果表明,不仅在形态上而且在可达动力学上设计机器人的对称性,为在不确定的地球和地外环境中实现敏捷性、鲁棒性和多功能性提供了一条强大且通用的途径。

英文摘要

Symmetry is a central organizing principle in natural systems, yet its use as a unifying design strategy in robotics has largely remained limited to geometric form. We show that symmetry can instead be leveraged at the level of dynamic actuation capability. We introduce dynamic symmetry, the uniformity of a robot's attainable center-of-mass accelerations, and formalize it through a measure coined as dynamic isotropy. Across more than 1000 simulated morphologies, we found that higher dynamic symmetry consistently improved trajectory tracking, task success, robustness, resiliency, and energy efficiency, with the benefits becoming most pronounced as dynamic isotropy approached its theoretical limit. To study this regime systematically, we developed Argus, a family of spherical robots designed to explore the effects of increasing dynamic symmetry. Members of the Argus family vary in their actuation geometry and dynamic symmetry level while sharing a common architectural principle: radially oriented linear actuators that directly shape the robot's center-of-mass dynamics. Among them, we built a physical 20-leg Argus variant that achieved near-extreme dynamic isotropy and demonstrated orientation-invariant locomotion, agile traversal of cluttered and deformable terrain, rapid self-stabilization, and resilience to partial actuator failures. Its distributed sensing further enabled omnidirectional perception and object interaction during continuous motion. These results show that designing robots for symmetry not only in morphology but also in their attainable dynamics provides a powerful and general pathway toward agility, robustness, and multifunctionality in uncertain terrestrial and extraterrestrial environments.

2605.29253 2026-05-29 cs.AI

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

OpenClawBench: 真实智能体执行轨迹中过程侧异常的基准测试

Yibing Liu, Yangze Liu, Xiaolong Yin, Bin Wang, Chong Zhang, Hao Yin, Zhongyi Han

AI总结 提出OpenClawBench数据集,通过FullTax标注框架量化智能体执行中的过程侧异常,揭示仅基于结果评估的不足。

Comments 37 pages, 1 figure, 43 tables

详情
AI中文摘要

任务成功可能掩盖真实智能体执行中的过程异常。智能体可能通过最终任务测试,但过程中仍累积未解决的歧义、不安全的外部写入、被忽略的错误、弱化的承诺或能力边界过度承诺。我们将这种不匹配研究为结果-过程差距,并引入OpenClawBench,这是一个用于测量和监督真实智能体执行过程中过程侧异常的大规模数据集。OpenClawBench基于由6个源模型生成的BFCL驱动的OpenClaw会话构建,包含31,264条带注释的轨迹。它将任务测试结果与结构化过程证据对齐。FullTax将对齐的轨迹转换为结构化异常监督:二元标签、支持证据、起始/跨度定位、严重性、可恢复性以及一个5类异常分类法。使用OpenClawBench,我们使结果-过程差距变得可测量。在31,135次通过测试的执行中,有2,904次在FullTax下被标记为过程异常。这些结果表明,仅基于成功的评估忽略了真实智能体执行中一类具体的过程侧失败。基于高置信度FullTax监督池训练的LoRA微调Gemma 3 12B检测器,在更干净标签的保留测试集上达到了二元F1=0.729。总之,OpenClawBench将真实智能体执行日志转化为可审计和可复用的监督,用于研究、诊断和操作监控运行时智能体可靠性。

英文摘要

Task success can hide process anomalies in real-world agent executions. An agent may pass the final task oracle while still accumulating unresolved ambiguity, unsafe external writes, ignored errors, weakly grounded commitments, or capability-boundary overcommitment. We study this mismatch as the Outcome-Process Gap and introduce OpenClawBench, a large-scale dataset for measuring and supervising process-side anomalies in real agent execution processes. OpenClawBench is built from BFCL-driven OpenClaw sessions produced by 6 source models and contains 31,264 annotated trajectories. It aligns task-oracle outcomes with structured process evidence. FullTax converts the aligned trajectories into structured anomaly supervision: binary labels, supporting evidence, onset/span localization, severity, recoverability, and a 5-class anomaly taxonomy. Using OpenClawBench, we make the Outcome-Process Gap measurable. Among 31,135 oracle-passing executions, 2,904 are still labeled process-anomalous under FullTax. These results show that success-only evaluation misses a concrete class of process-side failures in real agent executions. A LoRA-fine-tuned Gemma 3 12B detector trained on the high-confidence FullTax supervised pool reaches binary F1=0.729 on the cleaner-labels held-out test split. Together, OpenClawBench turns real agent execution logs into auditable and reusable supervision for studying, diagnosing, and operationally monitoring runtime agent reliability.

2605.29251 2026-05-29 cs.AI cs.CR

Provably Secure Agent Guardrail

可证明安全的智能体护栏

Benlong Wu, Weiming Zhang, Kejiang Chen, Han Fang, Nenghai Yu

AI总结 针对现有语义护栏无法提供确定性安全下界的问题,提出基于逻辑推理基本限制的新安全范式,并引入可执行证明约束动作框架,通过神经符号隔离架构实现零攻击成功率和零误报率。

详情
AI中文摘要

随着大语言模型从有限生成引擎转变为具有广泛执行权限的智能体,人工智能失控引发了人工智能安全的基本危机。现有的防御架构严重依赖经验性语义护栏和概率性大模型裁决器,这些机制在面对复杂的语义符号解耦攻击时无法提供确定性的安全下界。为了克服这种经验性语义护栏困境,本文提出了一种基于逻辑推理基本限制的智能体安全新范式。基于该范式,我们进一步引入了一种具有神经符号隔离架构的可执行证明约束动作(ePCA)框架。该框架放弃了对自然语言的语义信任,迫使智能体在执行物理操作之前将其意图无损地形式化为一阶逻辑数学约束。宏观和微观二维动态对抗系统的实证评估表明,我们的形式化验证机制在评估场景中实现了零攻击成功率和零误报率,且计算延迟极低。这项研究为在明确系统假设下构建未来智能系统的基础防御提供了条件性的形式化基础和工程范式。

英文摘要

As large language models transition from bounded generative engines to agents with expansive execution privileges, AI going out of control precipitates a fundamental crisis in artificial intelligence security. Existing defense architectures heavily rely on empirical semantic guardrails and probabilistic large model adjudicators, mechanisms that fail to provide deterministic security lower bounds when facing complex semantic symbol decoupling attacks. To overcome this empirical semantic guardrail dilemma, this paper proposes a new security paradigm for agents based on the fundamental limitations of logical reasoning. Based on this paradigm, we further introduce an executable Proof-Constrained Action (ePCA) framework with a neural symbolic isolation architecture. This framework abandons semantic trust in natural language, forcing agents to losslessly formalize their intentions into first-order logical mathematical constraints before performing physical operations. Empirical evaluations of macroscopic and microscopic two-dimensional dynamic adversarial systems demonstrate that our formal verification mechanism achieves zero attack success rate and zero false positive rate across the evaluated scenarios, with extremely low computational latency. This research provides a conditional formal foundation under explicit system assumptions and an engineering paradigm for constructing the underlying defense foundation for future intelligent systems.

2605.29250 2026-05-29 cs.CL cs.AI cs.IR cs.LG

OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

OmniRetrieval:跨异构知识源的统一检索

Jinheon Baek, Soyeong Jeong, Sangwoo Park, Woongyeong Yeo, Minki Kang, Patara Trirat, Heejun Lee, Sung Ju Hwang

AI总结 提出OmniRetrieval框架,通过自然语言查询识别并调度到不同知识源的本地执行引擎,在13个数据集和309个知识库上超越单源基线,实现异构知识源统一检索。

详情
AI中文摘要

现实世界的信息需求需要访问结构多样的知识源,从非结构化文本和关系表到知识图谱和属性图。然而,现有的检索器一次只在一个源上操作,使用固定的查询语言,使得可用知识的更广泛图景被不兼容的接口所分割。一种自然的统一尝试是将这些源折叠到一个共享空间中,但这会抹去每个源的结构性优势(如模式、本体、组合操作符),而这些优势赋予了每个源其表达能力。因此,对多样化知识的有效检索需要的不是同质化,而是一个能够按每个源自身条件与其交互的总体层。为了实现这一点,我们提出了OmniRetrieval,一个框架,它接受任何自然语言查询,识别合适的知识源,并将源原生查询分派到其本地执行引擎。在涵盖文本、关系和图结构源的13个数据集和309个不同知识库的广泛基准测试中,OmniRetrieval超过了单源基线,证明了它可以作为异构源的通用接口,同时保留使每个源有价值的结构差异。

英文摘要

Real-world information needs require access to structurally diverse knowledge sources, from unstructured text and relational tables to knowledge graphs and property graphs. Existing retrievers, however, operate over one source at a time under a fixed query language, leaving the broader landscape of available knowledge fragmented behind incompatible interfaces. A natural attempt at unification would collapse these sources into a shared space, but this erases the structural affordances (such as schemas, ontologies, compositional operators) that give each source its expressive power. Effective retrieval over diverse knowledge, therefore, requires not homogenization but an overarching layer that meets each source on its own terms. To achieve this, we present OmniRetrieval, a framework that takes any natural-language query, identifies appropriate knowledge sources, and dispatches source-native queries to their native execution engines. Across an extensive benchmark spanning 13 datasets and 309 distinct knowledge bases over text, relational, and graph-structured sources, OmniRetrieval exceeds single-source baselines, demonstrating that it can serve as a general-purpose interface to the heterogeneous sources while preserving the structural distinctions that make each source valuable.

2605.29247 2026-05-29 cs.AI cs.CL cs.LG

DenseSteer: Steering Small Language Models towards Dense Math Reasoning

DenseSteer: 引导小型语言模型进行密集数学推理

Yang Ouyang, Shuhang Lin, Jung-Eun Kim

AI总结 提出DenseSteer,一种无需训练的推理时引导框架,通过调节内部表征向密集推理模式靠拢,提升小型模型在多步数学推理中的准确性。

Comments ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)展现出强大的链式推理(CoT)能力,而较小的模型(≤3B参数)在多步推理任务上表现显著不佳。基于对Qwen-2.5模型系列在数学推理基准上的实证分析,我们发现更熟练的推理与更少的推理步骤但每步更高的信息密度相关,我们将此属性称为密集推理。受此观察启发,我们提出了DenseSteer,一种无需训练的推理时引导框架,通过将内部表征调节至密集推理模式来增强小型模型的推理能力。实验表明,我们的方法在不增加词级负对数似然的情况下,持续提高了准确性,突显了密集推理作为数学问题求解的一种有效结构方法。

英文摘要

Large language models (LLMs) demonstrate strong chain-of-thought (CoT) reasoning abilities, while smaller models (<= 3B parameters) significantly underperform on multi-step reasoning tasks. Based on empirical analyses of the Qwen-2.5 model family on math reasoning benchmarks, we find that more proficient reasoning is associated with fewer reasoning steps but higher information density per step, a property we term Dense Reasoning. Motivated by this observation, we propose DenseSteer, a training-free inference-time steering framework that enhances small-model reasoning by modulating internal representations toward dense reasoning patterns. Experiments show that our method yields consistent accuracy improvements without increasing token-level Negative Log-Likelihood, highlighting dense reasoning as an effective structural approach to mathematical problem solving.

2605.29243 2026-05-29 cs.CL cs.AI cs.CY

Wait! There's a Way Out: A Decision Mechanism for Forecasting Conversational Derailment

等等!有出路:一种预测对话偏离的决策机制

Laerdon Kim, Vivian Nguyen, Cristian Danescu-Niculescu-Mizil

AI总结 提出一种基于前瞻性模拟的延迟决策机制,在预测对话偏离时通过评估紧张时刻的恢复可能性来降低误报率,同时保持预测准确性。

Comments To appear in the Proceedings of ACL 2026

详情
AI中文摘要

预测对话偏离的任务是,在对话进行中预测其最终是否会偏离为人身攻击。由于预测模型以在线方式运行,它们必须在每轮发言后决定是否“触发”警报——例如,通知参与者或主持人对话有偏离风险。现有方法仅根据先前发言估计的偏离可能性做出这一决定,隐含假设对话的未来轨迹是固定的。因此,它们忽略了未来恢复的可能性,并导致不必要的高误报率。在这项工作中,我们提出了一种将触发决策与偏离可能性估计解耦的方法。我们的方法受该任务第一个人类基线的启发,该基线表明,人类通过选择性地推迟触发决策(当他们预计紧张局势可能缓解时),实现了显著更低的误报率。我们通过一种延迟机制来操作这一见解,该机制使用前瞻性模拟来评估紧张时刻是否存在合理的恢复路径。将这一机制整合到最先进的预测模型中,可以在不牺牲预测准确性的情况下大幅减少误报。更广泛地说,这项工作强调了将决策制定视为预测系统的一等组成部分的价值。

英文摘要

Forecasting conversational derailment is the task of predicting, as the conversation unfolds, whether it will eventually derail into personal attacks. Since forecasting models operate in an online fashion, they must decide whether to "trigger" an alert after each utterance--for example, to notify participants or a moderator that the conversation is at risk of derailing. Existing approaches make this decision solely based on the estimated likelihood of derailment given the preceding utterances, implicitly assuming that the conversation's future trajectory is fixed. As a result, they ignore the possibility of future recovery and incur an unnecessarily high rate of false positives. In this work we propose a method for decoupling the decision to trigger from derailment likelihood estimation. Our approach is inspired by the first human baseline on this task, which shows that humans achieve dramatically lower false positive rates by selectively deferring their decision to trigger when they anticipate that tension is likely to subside. We operationalize this insight with a deferral mechanism that uses forward-looking simulations to assess whether a tense moment admits plausible paths to recovery. Incorporating this mechanism into a state-of-the-art forecasting model substantially reduces false positives without sacrificing forecasting accuracy. More broadly, this work highlights the value of treating decision-making as a first-class component of forecasting systems.

2605.29240 2026-05-29 cs.AI cs.CL cs.HC cs.IR

Surfacing Isolated Learners with Outcome-Independent Mediation of Feedback between Teachers and Students Using AI

使用AI在教师与学生之间进行结果无关的反馈中介来发现孤立学习者

Junsoo Park, Youssef Medhat, Htet Phyo Wai, Ploy Thajchayapong, Ashok K. Goel

AI总结 提出一种无需成绩的可解释决策层,通过整合学生困难普遍性、自我报告与观察困难的不一致以及教师未解决关注点三个信号,对课程主题进行优先级排序,以帮助教师及时做出教学决策。

Comments Accepted to HAI-Agency Workshop on Orchestrating Human and AI Agency for Proactive and Reflective Learning

详情
AI中文摘要

AI增强的课堂在成绩结果可用之前就生成了丰富的教师和学生反馈,但这些信号难以转化为及时的教学决策。我们提出一个可解释的决策层:一种透明机制,无需使用成绩或事后结果标签即可对需要关注的课程主题进行排序。该方法结合了三个信号:学生学习困难普遍性、学习者自我报告与观察到的困难之间的不一致,以及未解决的教师关注点。输出是一个按优先级排序的主题集,每个主题附有解释其排序的决策记录。在一门研究生CS课程($n=5$次教师访谈;$n=279$份调查回复)中,优先主题与教师关注点一致(top-5重叠3/5;Spearman $ρ=0.80$),并与学生报告的主题困难相关($ρ=0.46$, $p=.048$)。多信号整合还发现了仅通过单个信号源未能识别的学习者(AUC $=0.96$ vs. 仅差距普遍性的$0.91$)。反思性思维、求助行为和自我效能感提供了额外证据,表明学生行为信号与学习相关构念一致。尽管是初步结果,这些发现表明,当反馈不完整时,透明的协调机制可能有助于支持人机协同。

英文摘要

AI-augmented classrooms generate rich teacher and student feedback before graded outcomes become available, yet these signals can be difficult to translate into timely instructional decisions. We propose an interpretable decision layer: a transparent mechanism that ranks course topics requiring attention without using grades or post-hoc outcome labels. The approach combines three signals: student learning difficulty prevalence, disagreement between learner self-reports and observed difficulties, and unresolved teacher concerns. The output is a ranked set of topic priorities with per-topic decision records explaining each ranking. In one graduate CS course offering ($n=5$ instructor interviews; $n=279$ survey responses), prioritized topics aligned with instructor concerns (top-5 overlap 3/5; Spearman $ρ=0.80$) and student-reported topic difficulty ($ρ=0.46$, $p=.048$). Multi-signal integration also surfaced learners not identified through individual signal sources alone (AUC $=0.96$ vs. $0.91$ for gap prevalence alone). Reflective thinking, help-seeking, and self-efficacy provided additional evidence that student behavioral signals align with learning-related constructs. While preliminary, these findings suggest that transparent coordination mechanisms may help support human-AI co-agency when feedback is incomplete.

2605.29236 2026-05-29 cs.LG

SigmaMedStat: Temporal Signal Modeling for ICU False Alarm Reduction

SigmaMedStat: 用于ICU误报减少的时间信号建模

Arunkumar Ramachandran

AI总结 提出SigmaMedStat系统,通过将60秒记录分割为6个10秒块并提取连续小波变换尺度图,结合EfficientNet-B0编码器和两层LSTM网络进行时间建模,在PhysioNet/CinC Challenge 2015数据集上实现AUC 0.822,有效降低ICU误报。

Comments Code available at github.com/Arun-K-Ram/sigmamedstat

详情
AI中文摘要

重症监护病房(ICU)中的警报疲劳是一个有充分记录的患者安全危机。临床监护仪每天每位患者产生350次或更多警报,其中72-99%在临床上无关紧要。工作人员对非可操作警报的脱敏增加了错过真正紧急情况的风险。本文提出了SigmaMedStat,一个机器学习系统,在采取临床行动之前评估生理警报信号的可信度。在PhysioNet/Computing in Cardiology Challenge 2015数据集(包含498个四通道ICU警报记录)上评估了四种方法。主要贡献是一个时间建模框架,它将每个60秒记录分割成六个连续的10秒块,进而为每个块生成连续小波变换(CWT)尺度图,使用共享的EfficientNet-B0编码器对每个块进行编码,并将得到的特征序列传递给两层长短期记忆(LSTM)网络。五折分层交叉验证的平均AUC为0.822 +/- 0.016(95% CI: [0.790,0.853]),而基于完整60秒窗口的静态EfficientNet基线为0.641。消融研究证实,时间分块和多通道信号融合均独立地有助于分类性能。按警报类型分析显示,心室扑动是最准确分类的警报类型(AUC 0.820),而心脏停搏仍然是最难的(AUC 0.722)。错误分析识别出65个假阴性和85个高置信度错误分类作为主要失败模式。所有代码和结果公开在https://github.com/Arun-K-Ram/sigmamedstat。

英文摘要

Alarm fatigue in intensive care units (ICUs) is a well documented patient safety crisis. Clinical monitors generate 350 or more alarms per patient per day, out of which 72-99% are clinically irrelevant. Staff desensitization to non-actionable alarms increases the risk of missed true emergencies. This paper presents SigmaMedStat, a machine learning system that evaluates the trustworthiness of physiological alarm signals before clinical action is taken. Four approaches were evaluated on the PhysioNet/Computing in Cardiology Challenge 2015 dataset of 498 four-channel ICU alarm recordings. Primary contribution is a temporal modeling framework that splits each 60 second recording into six consecutive 10-second chunks, and this in turn generates Continuous Wavelet Transform (CWT) scalograms per chunk, encodes each chunk with a shared EfficientNet-B0 encoder, and passes the resulting feature sequence to a two-layer Long Short-Term Memory (LSTM) network. Five-fold stratified cross-validation yields a mean AUC of 0.822 +/- 0.016 (95% CI: [0.790,0.853]), compared to 0.641 for a static EfficientNet baseline trained on the full 60-second window. Ablation studies confirm that temporal chunking and multi-channel signal fusion both contribute independently to classification performance. Per-alarm type analysis reveals that Ventricular Flutter is the most accurately classified alarm type (AUC 0.820) while Asystole remains the hardest (AUC 0.722). Error analysis identifies 65 false negatives and 85 high-confidence misclassifications as the primary failure modes. All code and results are publicly available at https://github.com/Arun-K-Ram/sigmamedstat.

2605.29234 2026-05-29 cs.AI cs.IR

Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth

重新思考文献检索评估:深度研究有帮助,且人类引用列表并非金标准

Gaurav Sahu, Laurent Charlin, Christopher Pal

AI总结 本文通过改进检索流程和检验人类引用列表作为评估目标的可靠性,发现深度研究管道显著提升召回率,而人类引用中仅51%被判定为中等相关以上,建议采用多维度评估。

详情
AI中文摘要

我们从两个互补角度研究大规模文献检索:改进检索流程,以及压力测试人类参考文献列表作为评估目标。首先,我们实现了一个深度研究管道,处理完整查询论文并沿其参考文献广度优先扩展检索结果,表明其显著优于纯API搜索,将RollingEval-Jun25(一个250篇论文的文献检索基准)上的召回率从低于20%提升至高于80%。其次,我们使用中立的LLM作为裁判来判断人类参考文献是否是任务的金标准。我们发现显著局限性:只有51%的人类引用被判定为中等相关或更高,而最强AI重排序器为86-88%。我们在OpenAlex合著图上研究这一差距,发现人类引用直接合作者的可能性比最佳AI重排序器高2.5倍。综合来看,我们的结果反对单一轴线的文献检索评估:召回率、主题相关性评分、排序列表多样性和合著距离诊断各自衡量引用质量的互补属性,应联合报告。

英文摘要

We study large-scale literature search from two complementary angles: improving the retrieval pipeline, and stress-testing the human reference list as an evaluation target. First, we implement a Deep Research pipeline that processes the full query paper and expands the retrieved results breadth-first along their bibliographies, and show that it substantially outperforms vanilla API-only search, raising recall on RollingEval-Jun25 (a 250-paper literature-search benchmark) from below 20% to above 80%. Second, we use a neutral LLM-as-a-judge to determine if human references are sound ground truth for the task. We find significant limitations: only 51% of human citations are judged moderately relevant or higher, against 86--88% for the strongest AI-based re-rankers. We study this gap on the OpenAlex co-authorship graph, finding that humans are 2.5x more likely than the best AI re-rankers to cite a direct collaborator. Together, our results argue against single-axis literature-search evaluation: recall, topical-relevance scoring, ranked-list diversity, and a co-authorship-distance diagnostic each measure complementary properties of citation quality and should be reported jointly.

2605.29230 2026-05-29 cs.CV cs.AI

Toward Ethical Facial Age Estimation: A Generalized Zero-Shot Benchmark Without Training on Children's Data

面向道德的面部年龄估计:无需儿童数据训练的广义零样本基准

Caio Petrucci, Leo Sampaio Ferraz Ribeiro, Sandra Avila

AI总结 提出一个广义零样本基准,训练时排除儿童数据,评估模型对未见年龄组的泛化能力,发现所有方法均存在严重性能下降和可见类偏见。

Comments 12 pages; 3 figures; 5 tables

详情
AI中文摘要

从面部图像进行年龄估计通常依赖于包含未成年人图像的训练数据,这种做法引发了严重的伦理、法律和隐私问题。在这项工作中,我们提出了一个用于面部年龄估计的广义零样本基准,该基准在训练时明确排除儿童数据,同时仍评估模型在年轻人群上的性能。我们重新审视了六个广泛使用的数据集,并引入了具有严格年龄组划分的标准化分割:18-59岁的样本用于训练、验证和测试;18岁以下的样本仅保留用于零样本评估;60岁以上的样本作为分布偏移下模型选择的未见验证集。对于具有身份注释的数据集,基于主体的分割防止了身份泄露,并更好地反映了实际部署条件。在此协议下评估九种最先进的年龄估计方法,结果表明所有评估方法均无法泛化到未见年龄组,性能相对于监督基线平均下降46.4%,最高达52.8%。此外,模型并非简单退化:它们系统性地将未见年龄的预测锚定到附近的可见类别,这是广义零样本学习中众所周知的可见类偏见的体现。通过将无儿童数据的年龄估计形式化为现有数据集上的广义零样本基准,这项工作突出了当前建模实践与现实伦理约束之间的关键差距。我们的基准为在受限数据制度下评估模型提供了原则性基础,并鼓励开发对分布偏移鲁棒且符合负责任数据使用的方法。

英文摘要

Age estimation from facial images typically relies on training data that includes images of minors, a practice that raises serious ethical, legal, and privacy concerns. In this work, we propose a generalized zero-shot benchmark for facial age estimation that explicitly excludes children's data during training while still assessing model performance on younger populations. We revisit six widely used datasets and introduce standardized splits with strict age-group separation: samples aged 18-59 for training, validation, and testing; samples under 18 reserved exclusively for zero-shot evaluation; and samples 60+ as an unseen validation set for model selection under distribution shift. For datasets with identity annotations, subject-exclusive splits prevent identity leakage and better reflect real-world deployment conditions. Evaluating nine state-of-the-art age estimation methods under this protocol reveals that all evaluated methods consistently fail to generalize to unseen age groups, suffering substantial performance degradation -- on average 46.4%, and up to 52.8% -- relative to the supervised baseline. Moreover, models do not simply degrade: they systematically anchor predictions for unseen ages to nearby seen classes, a manifestation of the well-known seen-class bias in generalized zero-shot learning. By formalizing age estimation without children's data as a generalized zero-shot benchmark on existing datasets, this work highlights a critical gap between current modeling practices and real-world ethical constraints. Our benchmark provides a principled basis for evaluating models under restricted data regimes and encourages the development of methods that are robust to distribution shift and aligned with responsible data use.

2605.29229 2026-05-29 cs.AI

Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility

定制课程:通过动态数据-模型兼容性进行以学生为中心的推理蒸馏

Jiahao Huang, Fei Cheng, Junfeng Jiang, Akiko Aizawa

AI总结 提出数据-模型兼容性(DMC)指标,通过联合考虑数据质量、相对难度和学生能力来评估数据集对推理蒸馏的适用性,并基于DMC动态选择数据以提升蒸馏性能。

详情
AI中文摘要

推理蒸馏将复杂推理能力从大型语言模型(LLMs)转移到较小的模型,但其成功取决于训练数据与学生模型的匹配程度。本文引入了数据-模型兼容性(DMC)指标,可用于评估数据集在学生模型上进行推理蒸馏的适用性。DMC通过联合考虑数据质量、相对难度和学生能力来提供评估。我们从两个角度验证了DMC的有效性:(1)DMC与推理蒸馏性能表现出强相关性;(2)使用DMC作为数据选择标准可提高推理蒸馏性能。这两个发现在多个学生模型和任务上均得到一致证明。此外,由于每个数据集的DMC在训练过程中动态变化,我们的实验表明,基于DMC动态选择数据集可以进一步提升性能。

英文摘要

Reasoning distillation transfers complex reasoning abilities from large language models (LLMs) to smaller ones, yet its success depends on how well the training data align with the student model. This paper introduces the Data-Model Compatibility (DMC) metric, which can be used to assess the suitability of a dataset for reasoning distillation on a student model. DMC provides an assessment by jointly considering data quality, relative difficulty, and student capability. We validated the effectiveness of DMC from two perspectives: (1) DMC exhibits a strong correlation with reasoning distillation performance; and (2) using DMC as the criterion for data selection leads to improved reasoning distillation performance. Both findings are consistently demonstrated across multiple student models and tasks. Moreover, since the DMC of each dataset dynamically changes during training, our experiments demonstrate that dynamically selecting datasets based on DMC can further enhance performance.

2605.29225 2026-05-29 cs.AI

BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents

BenchTrace: 用于测试LLM智能体反思能力和受控进化的基准

Jiahao Huang, Fei Cheng, Junfeng Jiang, Zefan Yu, Akiko Aizawa

AI总结 提出BenchTrace基准,通过反思评估和进化评估两个任务,结合失败避免率(FAR)指标,系统评估LLM智能体的自我进化能力,实验发现当前模型在反思诊断和泛化上存在显著瓶颈。

详情
AI中文摘要

自我进化智能体通过反思过去失败来随时间改进,但现有评估存在两个局限:仅衡量任务得分,无法反映反思质量;且依赖智能体自身的回合运行,缺乏针对特定失败模式的机制。我们提出 extbf{BenchTrace},一个用于评估LLM智能体自我进化能力的基准。BenchTrace基于包含1,821个带注释回合的快照反思数据集构建,涵盖六个多样化任务,包含 extbf{反思评估}(通过目标QA任务探测失败识别)和 extbf{进化评估}(在受控自我进化模拟中测试过去失败经验是否转化为回避行为)。基于BenchTrace,我们提出 extbf{失败避免率(FAR)},一种新的评估指标,衡量智能体成功避免目标失败实例的测试用例比例。使用Qwen3-32B和GPT-4.1的实验表明,两个模型在反思评估上的端到端通过率均低于30%,其中诊断是主要瓶颈。进化评估显示,自我进化方法通常比非进化基线提高FAR,但随着噪声回合累积,智能体会遗忘早期教训,且无法将反思泛化到特定情境之外,导致跨任务情境的负迁移。我们的相关性分析进一步揭示,只有完全正确的反思与更高的FAR强相关。BenchTrace揭示了当前自我进化方法的具体局限,并提供了一个受控的、模型无关的针对性评估框架。

英文摘要

Self-evolving agents improve over time by reflecting on past failures, but existing evaluation is limited in two ways: it measures only task scores, leaving reflection quality unknown, and it relies on agents' own episode runs, offering no mechanism to target specific failure patterns. We present \textbf{BenchTrace}, a benchmark for evaluating self-evolution ability in LLM agents. BenchTrace is built on a snapshot-reflection dataset of 1,821 annotated episodes spanning six diverse tasks, and comprises a \textbf{Reflection Evaluation} that probes failure identification through targeted QA tasks, and an \textbf{Evolution Evaluation} that tests whether past failure experience translates into avoidance behavior in a controlled self-evolution simulation. Building on BenchTrace, we propose \textbf{failure avoidance rate (FAR)}, a new evaluation metric measuring the fraction of test cases in which the agent successfully avoids the target failure instance. Experiments with Qwen3-32B and GPT-4.1 reveal that both models fall below a 30\% end-to-end pass rate on reflection evaluation, with diagnosis as the primary bottleneck. Evolution evaluation shows that self-evolution methods generally improve FAR over the non-evolving baseline, but agents forget early lessons as noise episodes accumulate, and agents fail to generalize their reflections beyond the specific context, causing negative transfer across task contexts. Our correlation analysis further reveals that only a fully correct reflection is strongly associated with higher FAR. BenchTrace exposes concrete limits of current self-evolution approaches and provides a controlled, model-agnostic framework for targeted evaluation.

2605.29224 2026-05-29 cs.CL cs.AI cs.CR

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

相关性即漏洞:网络检索如何削弱LLM智能体的安全对齐

Aditya Nawal, Manit Baser, Mohan Gurusamy

AI总结 本文提出AgentREVEAL框架,分析检索集成方式和内容属性如何导致LLM智能体安全退化,发现相关性是共同激活条件,并引入HarmURLBench基准。

详情
AI中文摘要

AI智能体通过外部工具(如网络检索)增强大型语言模型,使其能够提供基于事实和最新的响应。然而,将外部内容纳入生成流程可能会削弱控制模型输出的安全对齐机制。先前的研究表明,在智能体中启用检索会增加对有害请求的遵从性。我们提出了AgentREVEAL,一个用于分析LLM智能体中检索诱导的安全退化的诊断框架。该框架考察两个维度:检索如何集成到智能体流程中,以及检索内容的属性。在集成维度上,我们发现将工具调用和响应生成绑定在单一步骤中会放大有害输出。在内容维度上,我们揭示了安全来源悖论:即使是对立或安全导向的来源(例如包含警告或风险免责声明的页面),与无检索基线相比,也会使有害遵从性平均增加25%。最后,我们表明相关性是这两种漏洞的共同激活条件。类似模式出现在前沿闭源模型上,并且在几种代表性流程干预下,有害遵从性仍然保持较高水平,一些智能体在自主检索下也会进入这种状态。由于相关性也是使检索有用的原因,这些结果揭示了检索增强智能体的安全-效用权衡。我们引入了HarmURLBench,一个包含1,405个真实世界URL和320个有害行为的基准,以支持未来的评估。

英文摘要

AI agents augment large language models with external tools such as web retrieval, enabling grounded and up-to-date responses. However, incorporating external content into the generation pipeline can weaken the safety alignment mechanisms that govern model outputs. Prior work shows that enabling retrieval in agents increases compliance with harmful requests. We introduce AgentREVEAL, a diagnostic framework for analyzing retrieval-induced safety degradation in LLM agents. The framework examines two axes: how retrieval is integrated into the agent pipeline and the properties of the retrieved content. Along the integration axis, we find that binding tool invocation and response generation in a single step amplifies harmful outputs. Along the content axis, we uncover the Safe Source Paradox: even oppositional or safety-oriented sources, such as pages containing warnings or risk disclaimers, can increase harmful compliance by an average of 25% compared to the no-retrieval baseline. Finally, we show that relevance acts as a shared activation condition for both vulnerabilities. Similar patterns appear on frontier closed models, and harmful compliance remains elevated under several representative pipeline interventions, with some agents also entering this regime under autonomous retrieval. Because relevance is also what makes retrieval useful, these results expose a safety-utility trade-off for retrieval-enabled agents. We introduce HarmURLBench, a benchmark containing 1,405 real-world URLs paired with 320 harmful behaviors to support future evaluations.

2605.29221 2026-05-29 cs.CV

An Approach for Thyroid Nodule Analysis Using Thermographic Images

使用热成像图像进行甲状腺结节分析的方法

J. R. González, É. O. Rodrigues, C. P. Damião, C. A. P. Fontes, A. C. Silva, A. C. Paiva, H. Li, C. Du, A. Conci

AI总结 本文综述了热成像在甲状腺分析中的应用,提出图像采集协议和自主配准方法,并通过特征提取、图像处理和分类方法区分健康与患病患者。

详情
Journal ref
Application of Infrared to Biomedical Sciences 2017
AI中文摘要

据预测,到2030年,甲状腺癌将成为女性中第二常见的癌症类型,男性中第三常见。一般来说,早期检测癌症可提高个体生存机会。热成像是一种诊断工具,越来越多地用于检测癌症和异常,包括甲状腺异常。已有多种方法被提出用于分割和检测热成像图中的热区域,从而检测这些图像中存在的可疑组织。众所周知,医学诊断会产生大量信息。因此,医生必须在短时间内全面分析和评估这些信息,这在大多数情况下是不可行的。在这项工作中,我们对热成像进行了全面综述,重点关注甲状腺分析。我们提出了图像采集协议和甲状腺图像的自主配准方法。我们还对图像数据进行了分析,包括特征提取、图像处理以及一种可能的健康或非健康患者分类方法。总之,这项工作提出了在我们大学医院检测肿瘤的试点项目,这是支持我们内分泌科预防性医疗行动的一部分。经过一些未来调整后,该项目将提交给弗鲁米嫩塞联邦大学安东尼奥·佩德罗大学医院(HUAP-UFF)的伦理与研究委员会以及巴西卫生部伦理委员会审批,项目名称为:评估热成像在HUAP-UFF患者甲状腺结节诊断辅助中的重要性(葡萄牙语:Avaliação da importância da termografia no auxílio à investigação diagnóstica de nódulos tireoidianos em pacientes acompanhados no HUAP-UFF)。

英文摘要

Thyroid cancer is said to be the second most common type of cancer in female individuals and the third in males by 2030, according to projections. In general, detecting cancer in its early stages improves the chance of survival of the individual. Thermography is a diagnostic tool that has been increasingly used to detect cancer and abnormalities, including that of thyroid. Various methods to segment and detect hot regions in thermograms and, consequently, to detect suspicious tissues present in these images have been proposed. It is well known that medical diagnosis yields a great deal of information. Thus, physicians have to comprehensively analyse and evaluate this information in a short period of time, which is infeasible in most cases. In this work, we perform a general review of thermography , focusing on the thyroid analysis. We propose protocols for image acquisiton and an autonomous registration for thyroid images. We also perform analyses of the image data, which include feature extraction, image processing, and a possible approach for classification of healthy or unhealthy patients. In summary, this work presents a pilot project for detection of tumors in our university hospital, which is part of an effort to support preventive medical actions in our endocrinology department. Under some future adjustments, this project will be submitted for approval by the ethics and research committee of Hospital Universitário Antonio Pedro at Universidade Federal Fluminense (HUAP-UFF) and to the Brazilian Ministry of Health Ethical committee under the name: Evaluation of the importance of thermography to aid diagnosis of thyroid nodules of patients in HUAP-UFF (in Portuguese: Avaliação da importância da termografia no auxílio à investigação diagnóstica de nódulos tireoidianos em pacientes acompanhados no HUAP-UFF).

2605.29220 2026-05-29 cs.CV

Motion-guided sparse correction enables expert-quality point tracking across diverse microscopy regimes

运动引导的稀疏校正实现跨不同显微镜体制的专家级点跟踪

Leonidas Zimianitis, Pasindu Thenahandi, Kai Buckhalter, Dineth Jayakody, Julian O. Kimura, Xinyue Liang, Karen Cunningham, Azeem Ahmad, Balpreet S. Ahluwalia, Sampath Jayarathna, Nikos Chrisochoides, Brandon Weissbourd, Dushan N. Wadduwage

AI总结 提出RIPPLE方法,通过运动引导的稀疏校正,在多种显微镜视频中实现专家级点跟踪,将手动标注工作量减少3至25倍。

详情
AI中文摘要

在显微镜视频中跟踪非规范生物系统的动力学仍然是一个持续的挑战。经典和基于学习的跟踪器都需要专家审查的数据来进行评估和适应,然而详尽的手动标注很少能扩展到最需要这些工具的视频中。我们开发了RIPPLE(点位置估计的细化插值平台),它将标注重新定义为稀疏校正:用户点击一个起始点,RIPPLE提出完整的轨迹,用户仅在轨迹偏离时进行干预。我们在来自实验室的五个具有挑战性的显微镜数据集上测试了RIPPLE,其中四个来自透明水螅体Clytia hemisphaerica,一个跟踪快速移动精子的地标。在这些数据集中,RIPPLE匹配了详尽手动标注的质量,同时将数据集的手动点击次数减少了3至25倍。因此,RIPPLE填补了手动标注和全自动跟踪之间的缺失层,使得能够立即量化生物动力学、进行方法基准测试,并生成适应未来自动显微镜跟踪器所需的金标准数据。

英文摘要

Tracking the dynamics of non-canonical biological systems in microscopy videos remains a persistent challenge. Both classical and learning-based trackers depend on expert-reviewed data to be evaluated and adapted, yet exhaustive manual annotation rarely scales to the videos where these tools are needed most. We developed RIPPLE (Refinement Interpolation Platform for Point Location Estimation), which recasts annotation as sparse correction: a user clicks a starting point, RIPPLE proposes a full trajectory, and the user intervenes only where the trajectory drifts. We tested RIPPLE on five challenging microscopy datasets from our laboratories, four from the transparent jellyfish Clytia hemisphaerica and one tracking landmarks on rapidly moving sperm. Across these, RIPPLE matched the quality of exhaustive manual annotation while reducing manual clicks by 3 to 25 times across datasets. RIPPLE thereby fills a missing layer between manual annotation and fully automated tracking, enabling immediate quantification of biological dynamics, method benchmarking, and the production of the gold-standard data needed to adapt future automated microscopy trackers.

2605.29218 2026-05-29 cs.AI cs.CL

GTA: Generating Long-Horizon Tasks for Web Agents at Scale

GTA:大规模生成面向Web智能体的长程任务

Tenghao Huang, Kung-Hsiang Huang, Prafulla Kumar Choubey, Yilun Zhou, Muhao Chen, Jonathan May, Chien-Sheng Wu

AI总结 提出GTA框架,通过集成爬取、检索式种子生成、上下文内生成和自动质量控制,为Web智能体生成带可执行轨迹的真实长程任务,解决现有基准缺乏过程监督和可扩展性问题。

Comments Published at Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics

详情
AI中文摘要

Web智能体将语言模型与浏览和工具使用能力相结合,有望成为开放的Web助手。然而,进展日益受到缺乏可扩展的过程级监督的限制。现有基准大多为手动构建,仅提供粗略的起始-目标注释,缺乏中间轨迹,而最近的自动生成方法仍然昂贵、有偏且浅显。这些限制阻碍了对必须泛化到现实、多跳、跨页面任务的智能体进行可靠训练和评估。我们引入了一个可扩展的框架GTA,它集成了爬取、基于检索的种子生成、上下文内生成和自动质量控制,以生成与可执行轨迹配对的真实任务。该设计将爬取与生成解耦以提高效率,将任务基于站点图以强制组合性,并通过确定性重放和系统验证确保密集监督。我们在超过50个涵盖电子商务、政府、论坛和新闻的网站上实例化了该流程,并具有多语言和多跳覆盖。由此产生的基准揭示了显著的人机性能差距,并实现了详细的诊断。我们的贡献有三方面:(i)形式化多跳Web智能体任务生成,(ii)提出一个高效且经过验证的自动数据创建流程,以及(iii)发布一个具有可重复评估的动态基准。

英文摘要

Web agents, which couple language models with browsing and tool-use capabilities, show promise as open web assistants. Yet progress is increasingly limited by the lack of scalable, process-level supervision. Existing benchmarks are largely manually constructed, providing only coarse start-goal annotations without intermediate trajectories, while recent automatic generation efforts remain expensive, biased, and shallow. These limitations prevent reliable training and evaluation of agents that must generalize to realistic, multi-hop, cross-page tasks. We introduce a scalable framework, GTA, that integrates crawling, retrieval-based seeding, in-context generation, and automated quality control to produce realistic tasks paired with executable trajectories. This design decouples crawling from generation for greater efficiency, grounds tasks in the site graph to enforce compositionality, and ensures dense supervision through deterministic replays and systematic validation. We instantiate the pipeline on over 50 websites covering e-commerce, government, forums, and news, with multilingual and multi-hop coverage. The resulting benchmark reveals a significant human-agent performance gap and enables detailed diagnostics. Our contributions are three-fold: (i) formalizing multi-hop web-agent task generation, (ii) proposing an efficient and validated pipeline for automatic data creation, and (iii) releasing a dynamic benchmark with reproducible evaluation.

2605.29217 2026-05-29 cs.CV

Towards the automated segmentation of epicardial and mediastinal fats: A multi-manufacturer approach using intersubject registration and random forest

朝向心外膜和纵隔脂肪的自动分割:一种使用跨受试者配准和随机森林的多厂商方法

É. O. Rodrigues, A. Conci, F. F. C. Morais, M. G. Pérez

AI总结 提出一种基于跨受试者配准和随机森林的全自动方法,用于分割CT图像中的心外膜和纵隔脂肪,平均准确率达98.4%,Dice相似指数为96.8%。

详情
Journal ref
2015 IEEE International Conference on Industrial Technology (ICIT)
AI中文摘要

心脏周围的脂肪量与多种健康风险因素相关,如颈动脉僵硬度、冠状动脉钙化、心房颤动、动脉粥样硬化、癌症发病率等。此外,心脏脂肪的变化与受试者的总体脂肪无关,因此加强了对这些脂肪组织进行定量分析的必要性。临床决策支持系统是能够评估信息并提供相应诊断或数据以补充物理学家分析的计算机程序。本工作的目的是提出一种方法,能够在通过用于冠状动脉钙化评分的标准采集协议获得的CT图像上,全自动分割两种由心包隔开的心脏脂肪组织。我们致力于减少用户干预并提高可重复性。本文提出的方法包括配准(将输入图像粗略调整到标准)、提取与像素及其周围区域相关的特征,以及基于数据挖掘分类算法的分割步骤,该算法判断输入像素是否属于某一类型。实验表明,心外膜和纵隔脂肪的平均准确率达到98.4%,平均真阳性率为96.2%。平均Dice相似指数为96.8%。

英文摘要

The amount of fat on the surroundings of the heart is correlated to several health risk factors such as carotid stiffness, coronary artery calcification, atrial fibrillation, atherosclerosis, cancer incidence and others. Furthermore, the cardiac fat varies unrelated to the overall fat of the subject, and, therefore, it reinforces the quantitative analysis of these adipose tissues as being essential. Clinical decision support systems are computer programs capable of evaluating information and providing a corresponding diagnosis or data to complement the physicists' analyses. The aim of this work is to propose a method capable of fully automatically segmenting two types of cardiac adipose tissues that stand apart from each other by the pericardium on CT images obtained by the standard acquisition protocol used for coronary calcium scoring. Much effort was devoted to promote minimal user intervention and ease of reproducibility. The methodology proposed in this work consists of a registration, which will roughly adjust input images to a standard, an extraction of features related to pixels and their surrounding area and a segmentation step based on data mining classification algorithms that define if an incoming pixel is of a certain type. Experimentations showed that the achieved mean accuracy for the epicardial and mediastinal fats was 98.4% with a mean true positive rate of 96.2%. In average, the Dice similarity index was equal to 96.8%.

2605.29212 2026-05-29 cs.CV cs.HC

MetaRanker: Human-in-the-loop Active Ranking for Metalens Image Quality

MetaRanker:用于超透镜图像质量的人机协同主动排序

Yujin Park, Haejun Chung, Ikbeom Jang

AI总结 提出MetaRanker框架,通过人机协同主动排序,以语义可解释性为指标评估超透镜图像质量,减少80%人工标注量,并实现与人类评估高度一致的排序。

Comments 12 pages, 6 figures

详情
AI中文摘要

现代成像系统中的图像质量源于传感器、光学元件和计算重建的耦合效应。超薄超透镜为实现光学模块的显著小型化提供了途径,但实际设计通常表现出明显的色差和视场相关像差,需要计算重建来补偿。在当前的超透镜流程中,重建模型通常使用基于失真的保真度目标(如PSNR)进行训练和选择,但这些代理指标与人类偏好和下游实用性的相关性较弱,反映了众所周知的感知-失真权衡。我们引入了MetaRanker,一种人机协同主动排序框架,以语义可解释性(定义为人类在存在光学伪影时可靠识别物体和结构的程度)来形式化超透镜图像质量。MetaRanker结合了概率偏好模型与不确定性感知的查询选择,并利用视觉-语言模型提供轻量级语义先验。重要的是,这些先验仅用于指导信息性比较的采样;人类判断始终是主要的监督信号。在具有不同退化特征的现实和合成超透镜数据集上,MetaRanker生成的排序与人类评估最为一致,同时相对于穷举成对评估,所需的成对标注数量减少了约80%。最后,我们表明标准图像质量评估指标在超透镜领域与人类可解释性的对齐有限,这使MetaRanker成为迈向基于感知的超透镜评估和协同设计的实际一步。

英文摘要

Image quality in modern imaging systems emerges from the coupled effects of the sensor, optics, and computational reconstruction. Ultra-thin metalenses offer a path toward substantial miniaturization of optical modules, but practical designs often exhibit pronounced chromatic and field-dependent aberrations that necessitate computational reconstruction. In current metalens pipelines, reconstruction models are commonly trained and selected using distortion-based fidelity objectives, such as PSNR, yet these proxies can be weakly correlated with human preference and downstream utility, reflecting the well-known perception--distortion trade-off. We introduce MetaRanker, a human-in-the-loop active ranking framework that formalizes metalens image quality in terms of semantic interpretability, defined as the degree to which humans can reliably recognize objects and structures in the presence of optical artifacts. MetaRanker combines a probabilistic preference model with uncertainty-aware query selection, and leverages vision--language models to provide lightweight semantic priors. Importantly, these priors are used only to guide the sampling of informative comparisons; human judgments remain the primary supervision signal throughout. Across real-world and synthetic metalens datasets with distinct degradation profiles, MetaRanker produces rankings that align most closely with human assessments, while reducing the number of pairwise annotations required by approximately 80% relative to exhaustive pairwise evaluation. Finally, we show that standard image quality assessment metrics exhibit limited alignment with human interpretability in the metalens domain, positioning MetaRanker as a practical step toward perceptually grounded metalens evaluation and co-design.

2605.29202 2026-05-29 cs.LG

Auditing Training Data in Generative Music Models via Black-Box Membership Inference

通过黑盒成员推断审计生成音乐模型中的训练数据

Yi Chen Liu, Jiawei Yu, Kexin Cao, Syed Irfan Ali Meerza, Trishika Movva, Jian Liu

AI总结 本文提出一种黑盒成员推断方法,通过比较候选音频与模型基于其描述生成输出的语义对齐程度,并训练音乐审计器分类成员身份,实现对生成音乐模型训练数据的高精度审计。

Comments The paper has been accepted for presentation at the workshop ArtSec 2026: Workshop on Artwork Security and Provenance in the Age of AI

详情
AI中文摘要

近期文本到音乐生成的进展实现了结构化音乐音频的高保真合成,引发了对数据来源、同意和训练透明度的日益关注。这些模型通常在很少披露的大规模语料库上训练,没有实际机制来验证特定音频样本是否包含在训练中。在本文中,我们研究了生成音乐模型的黑盒成员推断,旨在仅通过查询部署系统来确定候选音乐样本是否在训练中使用。我们的关键见解是,训练成员身份会导致候选样本与模型基于其描述生成的结果之间系统性地更强的语义和结构对齐。我们使用相关描述查询目标模型,并在学习特征空间中测量候选音频与生成输出之间的关系。为了捕捉区分成员和非成员的特征,我们构建了由每个曲目及其基于描述生成的影子模型组成的配对示例,并训练音乐审计器分类成员身份。该审计器捕捉训练成员身份特有的对齐模式,并在完全黑盒设置下泛化到未见过的目标模型,无需访问模型参数或训练元数据。在多个最先进的音乐生成器上,我们的方法达到了高达98.6%的准确率,假阳性和假阴性率低至1.9%和1.0%,表明在现实部署场景中可靠的训练数据审计是可行的。

英文摘要

Recent advances in text-to-music generation enable high-fidelity synthesis of structured musical audio, raising growing concerns about data provenance, consent, and training transparency. These models are typically trained on large-scale corpora with little disclosure, leaving no practical mechanism to verify whether a particular audio sample was included in training. In this paper, we investigate black-box membership inference for generative music models, aiming to determine whether a candidate music sample was used during training, given only query access to the deployed system. Our key insight is that training membership induces systematically stronger semantic and structural alignment between a candidate sample and the model's generation conditioned on its caption. We query the target model with the associated caption and measure the relationship between the candidate audio and the generated output in a learned feature space. To capture features that separate members from non-members, we construct paired examples consisting of each track and its caption-conditioned generation from shadow models, and train a music auditor to classify membership. The auditor captures alignment patterns characteristic of training membership and generalizes to unseen target models in a fully black-box setting without access to model parameters or training metadata. Across multiple state-of-the-art music generators, our method achieves up to 98.6% accuracy, with false-positive and false-negative rates as low as 1.9% and 1.0%, demonstrating that reliable training-data auditing is feasible in realistic deployment scenarios.

2605.29194 2026-05-29 cs.LG cs.AI cs.NA math.NA

Stochastic Lifting for Generating Trajectories of Stochastic Physical Systems

随机提升:生成随机物理系统轨迹

Jules Berman, Tobias Blickhan, Benjamin Peherstorfer

AI总结 提出随机提升方法,通过为每个状态转换附加独立高维随机标签并学习从当前状态和标签到下一状态的映射,以生成多样化的随机物理系统轨迹。

详情
AI中文摘要

许多随机物理系统随时间平滑演化,即状态分布随时间步长规则变化。从当前状态到下一状态的转移通常可以建模为平滑映射和显式随机源的组合。随机提升利用这一结构,通过为训练数据中的每个状态转换附加一个独立的高维随机标签,并使用标准回归损失拟合从当前状态和标签到下一状态的转移映射。这些标签作为辅助坐标,使模型能够从相似的当前状态表示多个可能的下一状态,避免在有限样本量下崩溃为均值预测。在推理时,每个时间步采样新的标签,并将学习到的映射自回归地向前滚动,每个时间步仅需一次网络评估即可生成多样化的轨迹。

英文摘要

Many stochastic physical systems evolve smoothly over time in the sense that the distribution of states changes regularly across time steps. The transition from current state to the next state can often be modeled as the combination of a smooth map and an explicit source of randomness. Stochastic Lifting exploits this structure by attaching an independent, high-dimensional random label to each state transition in the training data and fitting a transition map from the current state and label to the next state using a standard regression loss. The labels act as auxiliary coordinates that let the model represent multiple plausible next states from similar current states, avoiding collapse to a mean prediction in the finite-sample size regime. At inference, fresh labels are sampled at each time step and the learned map is rolled forward autoregressively, generating diverse trajectories with a single network evaluation per time step.

2605.29192 2026-05-29 cs.AI cs.CL

ReasonOps: Operator Segmentation for LLM Reasoning Traces

ReasonOps: 大语言模型推理轨迹的算子分割

Daniel Lee, Owen Queen, James Zou

AI总结 提出无监督方法ReasonOps,从思维链轨迹中提取7种通用推理算子,揭示模型推理结构并用于模型识别与正确性预测。

详情
AI中文摘要

大型推理模型的思维链轨迹可长达数万token,但我们缺乏描述其内部结构的词汇。以往用于分析思维链轨迹的方法要么过于僵化,要么表达能力不足,无法捕捉跨领域和跨模型的特征。为解决此问题,我们开发了ReasonOps,一种无监督、表达力强的方法,用于注释思维链轨迹,提供简洁的通用算子。利用ReasonOps,我们分析了来自12个思考型LLM(涵盖6个家族、8个推理基准)的44,662条轨迹,发现它们共享一个共同的组合结构:7个反复出现的推理算子——语篇层面的动作,如回溯、推理和假设——这些算子从句子开头的3-token枢轴的无监督聚类中涌现。这些算子出现在每个模型家族和基准领域,由三个独立的LLM评判员对留出样本进行分类,准确率达70-76%。我们分析了算子在简单与困难问题上的结构,发现反思性算子在困难问题上更有帮助,而在简单问题上则损害性能。算子序列具有高度的模型识别性:仅基于算子分布训练的分类器能以宏AUC恢复源模型,揭示每个模型家族具有独特的推理指纹。结构化的算子特征在问题内答案正确性预测上远高于基线。基于这些算子构建的分类器在WP-AUC上达到,特别是在AIME上。ReasonOps还能够在轨迹完成前进行早期质量估计:我们仅用50%的轨迹就能在WP-AUC上进行预测。ReasonOps流程是无监督且无需标注的,能够深入洞察LLM推理轨迹,并在模型识别和正确性预测方面取得强大的下游结果。

英文摘要

Chain-of-thought traces from large reasoning models can span tens of thousands of tokens, yet we lack a vocabulary for describing their internal structure. Previous methods developed to analyze chain-of-thought traces are either too rigid or not expressive enough, failing to capture features across domains and models. To remedy this, we develop ReasonOps, an unsupervised, expressive method for annotating chain-of-thought traces, providing succinct universal operators. Using ReasonOps, we analyze 44,662 traces from 12 thinking LLMs spanning 6 families across 8 reasoning benchmarks and discover that they share a common compositional structure: 7 recurring reasoning operators -- discourse-level moves such as backtracking, inferring, and hypothesizing -- that emerge from unsupervised clustering of sentence-initial 3-token pivots. These operators appear across every model family and benchmark domain, confirmed by three independent LLM judges who classify held-out samples at 70 -76% accuracy. We analyze the structure of operators on easy vs. hard problems, revealing that reflective operators are more helpful on hard problems and harm performance on easy problems. Operator sequences are highly model-identifying: a classifier trained on operator distributions alone recovers the source model with macro-AUC, revealing that each model family has a distinctive reasoning fingerprint. Structural operator features predict within-problem answer correctness well above baselines. Classifiers built on these operators reach WP-AUC and on AIME specifically. ReasonOps further enables early quality estimation well before the trace completes: we predict at WP-AUC for only 50% of the trace. The ReasonOps pipeline is unsupervised and annotation-free, enabling deep insights into LLM reasoning traces as well as strong downstream results on model identification and correctness prediction.

2605.29190 2026-05-29 cs.LG cs.CL

When RL Suppresses Its Own Vocabulary: Recovering Reasoning Diversity in Puzzle-to-Math Transfer

当RL抑制自身词汇:在谜题到数学迁移中恢复推理多样性

Mayug Maniparambil, Arjun Karuvally, Terrence Sejnowski, Fergal Reid

AI总结 本文提出一种基于可验证奖励的强化学习框架,通过引入新颖性奖励机制恢复被抑制的探索性推理原语,实现从约束满足谜题到数学问题的跨领域迁移,在无需数学数据的情况下将OlymMATH-Hard的pass@32从16%提升至36%。

Comments Preprint

详情
AI中文摘要

使用可验证奖励的强化学习(RLVR)改进了大语言模型的推理能力,但其跨领域迁移的条件及原因仍未被充分探索。我们研究了一个7B模型在仅使用约束满足谜题进行SFT和RL后训练(无数学问题)时的跨领域迁移。为了分析迁移如何产生,我们引入了一个推理原语级框架,该框架结合了9类跨度分类器和基序提取,使我们能够将思维链轨迹分割为原语基序,并追踪其在训练阶段和领域间的演变。我们发现,谜题SFT诱导了一个推理原语词汇,在OlymMATH-Hard上带来了+7pp的pass@32提升。随后,普通GSPO将这些原语组合成更长的计算-验证链,进一步增加了+6pp。然而,这个RL阶段也抑制了探索性原语,如“假设”和“回溯”。为了解决这个问题,我们引入了一个新颖性奖励,奖励多样化的正确轨迹,使用参考模型下的困惑度作为信号。这恢复了RL期间的恢复原语,并相对于普通GSPO额外增加了+7pp的pass@32。最终,端到端配方将硬数学能力上限从OLMo3-7B-Instruct-SFT基线的16.0%提升至36.0%,且在SFT或RL阶段未添加任何数学问题。

英文摘要

Reinforcement learning using verifiable rewards (RLVR) improves LLM reasoning, but the conditions under which it transfers across domains -- and why it does so -- remain under-explored. We study cross-domain transfer in a 7B model whose SFT and RL post-training stages use only constraint-satisfaction puzzles, with no mathematics problems in the post-training data. To analyze how transfer emerges, we introduce a reasoning primitive-level framework that combines a 9-class span classifier with motif extraction, allowing us to segment chain-of-thought traces into primitive motifs and track their evolution across training stages and domains. We find that puzzle SFT induces a reasoning-primitive vocabulary, yielding a $+7$pp \texttt{pass@32} gain on OlymMATH-Hard. Vanilla GSPO then composes these primitives into longer compute-verify chains, adding a further $+6$pp. However, this RL stage also suppresses exploratory primitives such as \textit{hypothesize} and \textit{backtrack}. To address this, we introduce a novelty bonus that rewards diverse correct rollouts, using perplexity under the reference model as a signal. This restores recovery primitives during RL and adds a further $+7$pp \texttt{pass@32} relative to vanilla GSPO. Finally, the end-to-end recipe raises the hard-math capability ceiling from $16.0\%$ at the OLMo3-7B-Instruct-SFT base to $36.0\%$, without adding any mathematics problems during the SFT or RL stages.

2605.29188 2026-05-29 cs.CL

Slogans or Stance? A Label-Light Diagnostic for Entrepreneurial-Discourse Measurement on Chinese SOE Speeches

口号还是立场?一种用于中国国企演讲中创业话语测量的轻标签诊断方法

Ting Gong, Shangquan Sun

AI总结 本文提出一种轻标签诊断方法,利用同一企业不同演讲者的自然实验,评估词典方法、主题模型和嵌入相似度评分器在测量中国国企演讲中“创业精神”时的有效性,发现零样本大语言模型(Qwen3.5:9b)在区分演讲者身份方面表现最佳。

Comments 15 pages, 2 figures, 7 tables

详情
AI中文摘要

词典方法、主题模型和嵌入相似度评分器广泛应用于CSS和管理研究中,用于测量企业演讲中的“创业精神”等构念。我们贡献了一种轻标签的测量诊断方法,而非新的提取模型。在80篇中央管理中国国有企业领导人演讲的语料库中,我们利用24对同一企业不同演讲者和5对同一企业同一演讲者的自然实验,测试方法每文档指标是否在控制企业不变的情况下随领导人身份变化。LDA失败(Cohen d=0.20,95% CI [-0.72, 1.20]);词典评分器达到d=0.81,中文句子编码器在文档向量距离为10^-3量级时达到d=0.65。零样本9B开源大语言模型(Qwen3.5:9b)将配对对比d提升至1.09(精确置换p1=0.034)。我们相应地降低三个主张:黄金F1衡量的是与LLM自身提示规则的一致性,而非外部构念恢复;文档级风格残差化将LLM的d降至0.43(p1=0.22),因此约一半效应与演讲者个人习语一致;置信加权校准以方差换取Delta,自动挖掘的口号词典在消融中几乎无效。我们发布了包含2,190个片段的评分语料库、170段试点语料、口号词典、两族LLM评分以及评估框架。

英文摘要

Dictionary methods, topic models, and embedding-similarity scorers are widely used in CSS and management research to measure constructs such as "entrepreneurial spirit" in corporate speeches. We contribute a label-light measurement diagnostic for such instruments rather than a new extraction model. On a corpus of 80 speeches by leaders of centrally administered Chinese state-owned enterprises, we exploit a natural experiment of 24 same-company different-speaker pairs and 5 same-company same-speaker pairs to test whether a method's per-document indices vary with leader identity holding firm constant. LDA fails (Cohen d=0.20, 95% CI [-0.72, 1.20]); a dictionary scorer reaches d=0.81 and a Chinese sentence encoder d=0.65 on doc-vector distances of order 10^-3. A zero-shot 9B open-weight LLM (Qwen3.5:9b) raises paired-contrast d to 1.09 (exact permutation p1=0.034). We downgrade three claims accordingly: gold F1 measures consistency with the LLM's own prompt rule rather than external construct recovery; doc-level style residualisation cuts the LLM's d to 0.43 (p1=0.22), so roughly half of the effect is consistent with leader idiolect; and a confidence-weighted calibration trades Delta for variance with an auto-mined slogan lexicon near-inert in ablation. We release the 2,190-segment scored corpus, the 170-paragraph pilot, the slogan lexicon, two-family LLM scores, and the evaluation harness.

2605.29184 2026-05-29 cs.LG cs.AI

Influence-Guided Symbolic Regression: Scientific Discovery via LLM-Driven Equation Search with Granular Feedback

影响引导的符号回归:基于大语言模型与细粒度反馈的方程搜索科学发现

Evgeny S. Saveliev, Samuel Holt, Nabeel Seedat, David L. Bentley, Jim Weatherall, Mihaela van der Schaar

AI总结 提出影响引导符号回归(IGSR)方法,利用大语言模型生成候选函数并通过细粒度影响分数进行剪枝,结合蒙特卡洛树搜索高效探索组合空间,在多个基准和真实生物数据中发现新关系。

Comments ICML 2026

详情
AI中文摘要

大型语言模型(LLM)为科学发现提供了有前景的途径,但它们在符号回归中的应用常受限于低效的搜索策略和粗糙的反馈信号。当前方法通常使用标量指标(如全局均方误差)指导LLM,这无法识别所提出方程中哪些成分驱动性能或导致误差。我们引入 extit{影响引导符号回归}(IGSR),该方法将方程发现表述为一个迭代的两步过程,结合多样化的项生成与严格选择:LLM为线性模型生成候选基函数$ψ_j(\mathbf{x})$,然后使用细粒度影响分数$Δ_j$进行评估。这些分数量化每个项对泛化准确性的边际贡献,从而实现影响引导的剪枝过程,系统地精炼模型结构。将此机制集成到蒙特卡洛树搜索(MCTS)中,能够在导航组合搜索空间的同时平衡对新函数形式的探索与对高影响成分的利用。我们在多个基准测试上展示了IGSR的有效性,包括LLM-SRBench、药理学PKPD模型、流行病学模拟和真实基因组数据。值得注意的是,我们通过一个高维生物数据集的案例研究验证了该框架的真正发现能力,其中IGSR识别出DNA甲基化与RNA聚合酶II暂停之间的新关系;该假设随后通过湿实验得到了支持。

英文摘要

Large Language Models (LLMs) offer a promising avenue for scientific discovery, yet their application to symbolic regression is often constrained by inefficient search strategies and coarse feedback signals. Current methods typically guide LLMs using scalar metrics (e.g., global Mean Squared Error), which fail to identify which components of a proposed equation are driving performance or causing error. We introduce \textit{Influence-Guided Symbolic Regression} (IGSR), a method that frames equation discovery as an iterative two-step process combining diverse term generation with rigorous selection: an LLM generates candidate basis functions $ψ_j(\mathbf{x})$ for a linear model, which are then evaluated using granular influence scores $Δ_j$. These scores quantify each term's marginal contribution to generalization accuracy, enabling an influence-guided pruning process that systematically refines the model structure. Integrating this mechanism into a Monte Carlo Tree Search (MCTS) enables navigating the combinatorial search space while balancing exploration of novel functional forms with exploitation of high-influence components. We demonstrate IGSR's effectiveness on a diverse suite of benchmarks, including LLM-SRBench, pharmacological PKPD models, an epidemiological simulation, and real-world genomic data. Notably, we validate the framework's capacity for genuine discovery in a case study using a high-dimensional biological dataset, in which IGSR identified a novel relationship between DNA methylation and RNA Polymerase II pausing; a hypothesis that was subsequently supported via wet-lab experimentation.

2605.29174 2026-05-29 cs.AI cs.CR

Paper Agents, Paper Gains: An Empirical Analysis of DeFi Investment Agents

Paper Agents, Paper Gains: DeFi投资代理的实证分析

Jay Yu, Amy Zhao, Danning Sui

AI总结 通过分析1900多个AI加密项目、10个代表性代理和11个Solana代理金库,发现当前DeFi投资代理仍处于早期阶段,存在自主执行证据不足、代币持有者集体亏损、估值与基本面脱节等问题,并提出成熟度框架。

详情
AI中文摘要

DeFi投资代理,即使用AI进行自主链上交易的系统,自2024年底以来已获得超过30亿美元的代币总估值。我们调查了1900多个标记为AI的加密项目,筛选出专注于投资的代理,并策划了10个涵盖策略和可观测性维度的代表性项目。然后,我们对两个突出的代理框架ElizaOS和Virtuals Protocol进行了深入的架构分析,并对11个基于Solana的代理金库(具有公开可归因的交易活动)进行了定量链上表现分析,覆盖925,323个代币持有者。我们发现当前部署仍处于早期且异构:(1)在我们的样本中,许多项目尚未提供清晰的自主交易执行证据,开发者访谈表明许多可见部署仍为基本API集成;(2)代理金库保留了超过3000万美元的账面收益,而代币持有者集体损失了1.917亿美元,前1%的钱包捕获了所有收益的81.4%(18.1亿美元);(3)代币估值与金库基本面关联微弱,市值与AUM比率超过10,000倍,而成熟的DeFi协议低于1倍;(4)用户总收益在达到24亿美元的峰值后下降至净亏损,每个平台的中位数回报均为负,代币从历史高点平均下跌93%。我们将这些结果解释为无许可的第一代市场的特征,其中开放基础设施支持快速实验,但也允许幼稚或投机性代理在自主性、性能和利益相关者对齐的稳健标准出现之前启动。因此,我们提出了一个沿三个维度(自主执行、风险调整后盈利能力和利益相关者对齐)的成熟度框架,以表征当前部署与未来投资级代理系统之间的差距。

英文摘要

DeFi investment agents, systems that use AI for autonomous on-chain trading, have attained over USD 3 billion in combined token valuations since late 2024. We survey over 1,900 AI-tagged crypto projects, filter to investment-focused agents, and curate 10 representative projects spanning strategy and observability dimensions. We then conduct a deep-dive architectural analysis of two prominent agent frameworks, ElizaOS and Virtuals Protocol, and a quantitative on-chain performance analysis of 11 Solana-based agent treasuries with publicly attributable trading activity, covering 925,323 token holders. We find that current deployments remain early and heterogeneous: (1) in our sample, many projects do not yet provide clear evidence of autonomous trade execution, and developer interviews suggest that many visible deployments remain basic API integrations; (2) agent treasuries retain over USD 30M in paper gains while token holders collectively lost USD 191.7M, with the top 1% of wallets capturing 81.4% of all gains (USD 1.81B); (3) token valuations are weakly connected to treasury fundamentals, with market-cap-to-AUM ratios exceeding 10,000x versus below 1x for established DeFi protocols; and (4) aggregate user gains peaked at USD 2.4B before declining to net losses, with median returns negative on every platform and tokens declining 93% on average from all-time highs. We interpret these outcomes as characteristic of a permissionless, first-generation market in which open infrastructure enables rapid experimentation but also allows naive or speculative agents to launch before robust standards for autonomy, performance, and stakeholder alignment emerge. We therefore propose a maturity framework along three dimensions: autonomous execution, risk-adjusted profitability, and stakeholder alignment, to characterize the gap between current deployments and future investment-grade agent systems.

2605.29172 2026-05-29 cs.LG physics.ao-ph

Probabilistic bias adjustment of seasonal forecasts using generative machine learning: A case study of Arctic sea ice predictions

基于生成式机器学习的季节预报概率偏差校正:以北极海冰预测为例

Parsa Gooya, Reinel Sospedra-Alfonso

AI总结 本研究提出基于条件变分自编码器的概率后处理框架,通过生成器替代高斯参数化解码器并采用连续排序概率评分优化,有效校正季节预报的系统偏差并提升分辨率与谱能量。

详情
AI中文摘要

季节气候预测通过提供未来几个月最可能发生的气候条件及其相关不确定性的早期信息,支持规划和风险管理。集合预报通过模拟许多可能的结果来实现这一点,使得预测能够以可用的概率形式表达。大集合和高分辨率预报通过更好地采样不确定性和捕捉更精细尺度的过程来加强这种指导,但会带来显著的计算成本。此外,预报集合存在漂移,并表现出系统偏差和随提前时间增长的时空误差,需要仔细的后处理和校准。加拿大气候建模与分析中心开发了一种基于条件变分自编码器(cVAE)的概率后处理框架,用于生成北极海冰的偏差校正季节预测的大集合。生成模型旨在学习以有偏模型预测为条件的观测分布。这使得能够生成任意大的、经过良好校准的、偏差校正的预测集合,且具有更高的技能。在此,我们扩展该框架以解决标准cVAE已知的局限性——预测中细尺度能量的损失和特征性的模糊。具体而言,我们在cVAE中使用生成器替代高斯参数化解码器,并在目标函数中使用连续排序概率评分代替均方误差。我们进一步使用比原始预报更高分辨率的目标数据集。我们表明,与基准预测相比,调整后的预测校准更好,与观测分布更一致,误差更小,同时相对于标准cVAE提高了原始预报的分辨率、锐度和谱功率。

英文摘要

Seasonal climate predictions support planning and risk management by offering early information of the most likely-to-occur climate conditions in the coming months, and associated uncertainties. Ensemble forecasts enable this by simulating many plausible outcomes, allowing predictions to be expressed as usable probabilities. Large ensembles and high-resolution forecasts strengthen this guidance by better sampling uncertainty and capturing finer-scale processes but come with significant computational cost. Moreover, forecast ensembles drift and exhibit systematic biases and spatio-temporal errors that grow with lead time, requiring careful post-processing and calibration. A probabilistic post-processing framework based on conditional Variational Autoencoders (cVAEs) was developed at the Canadian Center for Climate Modeling and Analysis to generate large ensembles of bias adjusted seasonal predictions of Arctic sea ice. The generative model was designed to learn the observational distribution conditioned on the biased model prediction. This enables generation of arbitrarily large ensembles of well-calibrated, bias corrected forecasts with improved skill. Here, we extend this framework to address the loss of fine-scale energy and the characteristic blurriness in predictions, a known limitation of standard cVAEs. Specifically, we employ a generator in place of the Gaussian parametrized decoder in the cVAE and use Continuous Ranked Probability Score in the objective function instead of the Mean Square Error. We further use a higher resolution target dataset compared to the raw forecast. We show that the adjusted forecasts are better calibrated, more consistent with the observational distribution, and exhibit smaller errors than benchmark predictions, while also enhancing the resolution of the raw forecasts and improving sharpness and spectral power relative to the standard cVAE.

2605.29170 2026-05-29 cs.CL cs.AI

UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning

UA-Legal-Bench:评估大语言模型在乌克兰法律推理上的基准

Volodymyr Ovcharov

AI总结 针对法律NLP基准以英语为中心的问题,构建了基于乌克兰法院判决的五个任务基准,评估11个LLM,发现少样本提示效果因任务而异,且在不平衡任务中准确率具有误导性。

Comments 13 pages, 5 figures, 4 tables. Data: https://huggingface.co/datasets/overthelex/ua-legal-bench

详情
AI中文摘要

法律NLP基准 overwhelmingly 以英语为中心,导致在形态丰富、非拉丁字母语言中的失败模式未被检测。我们引入了UA-Legal-Bench,一个包含五个任务的基准,用于评估大语言模型在乌克兰法律推理上的表现,该基准基于统一国家法院判决登记册(EDRSR)——世界上最大的开放司法语料库之一(9950万份判决)。该基准包括:(1)案件类型分类(4类,n=2,000),(2)判决形式分类(4类,n=2,000),(3)案件结果预测(6类,n=800),(4)法律规范提取(n=1,794),以及(5)原因类别预测(22类,n=1,871)。我们评估了来自五个系列的11个LLM(3B-675B),在零样本和3样本提示下通过AWS Bedrock进行了158K次API调用。我们的结果揭示了 sharply 任务依赖的少样本效应:少样本提示将判决形式分类提高了最多+38.6个百分点,但对结果预测的影响不一。我们表明,在不平衡的法律任务中,准确率具有误导性:COP准确率最高的模型(62%)是多数类预测器(macro-F1:23%),而真正最好的模型macro-F1仅为44%。系列内规模分析显示,8B模型在表面级任务上可以匹配前沿性能,但不同系列的规模阈值差异很大。我们发布了所有数据、提示和模型预测。

英文摘要

Legal NLP benchmarks are overwhelmingly English-centric, leaving failure modes in morphologically rich, non-Latin-script languages undetected. We introduce UA-Legal-Bench, a five-task benchmark for evaluating large language models on Ukrainian legal reasoning, built from the Unified State Register of Court Decisions (EDRSR) -- one of the world's largest open judicial corpora (99.5 million decisions). The benchmark comprises: (1) case-type classification (4 classes, n=2,000), (2) judgment form classification (4 classes, n=2,000), (3) case-outcome prediction (6 classes, n=800), (4) legal norm extraction (n=1,794), and (5) cause category prediction (22 classes, n=1,871). We evaluate 11 LLMs (3B--675B) from five families under zero-shot and 3-shot prompting via AWS Bedrock with 158K API calls. Our results reveal sharply task-dependent few-shot effects: few-shot prompting improves judgment form classification by up to +38.6 pp but has mixed effects on outcome prediction. We show that accuracy is misleading on imbalanced legal tasks: the model with highest COP accuracy (62%) is a majority-class predictor (macro-F1: 23%), while the genuinely best model scores only 44% macro-F1. Within-family scaling analysis reveals that 8B models can match frontier performance on surface-level tasks but scaling thresholds vary dramatically across families. We release all data, prompts, and model predictions.

2605.29168 2026-05-29 cs.AI cs.LG

Better Later Than Sooner: Neuro-Symbolic Knowledge Graph Construction via Ontology-grounded Post-extraction Correction

晚做总比早做好:基于本体后提取校正的神经符号知识图谱构建

Lorenzo Loconte, Timothy Hospedales, Cristina Cornelio

AI总结 提出一种神经符号框架,通过后提取校正解决LLM提取知识图谱时的本体不一致问题,减少token使用并提升图谱一致性。

详情
AI中文摘要

问答是AI中的核心挑战,特别是对于需要跨文档多跳推理或聚合、穷举等符号操作的复杂查询。检索增强生成已成为问答的主要方法,最近的基于图的变体通过组织知识以更好地支持组合性问题,部分解决了这些问题。然而,大多数基于文本图的RAG方法仍缺乏可靠回答复杂问题所需的符号操作结构。这推动了基于符号图的方法,该方法提取知识图谱,其关系是逻辑谓词,支持类似SQL的查询。然而,这些流程通常使用LLM进行KG提取,这可能导致一致性问题,即提取的事实可能违反常识本体约束。我们提出了一种用于本体基础KG构建的神经符号框架,结合了开放域提取、基于嵌入的类型和谓词规范化,以及针对本体违规的LLM校正。通过将校正推迟到后提取阶段,我们的方法避免了重复的LLM调用,显著减少了token使用,同时提高了KG一致性并保持了下游问答质量。最后,通过测量SPARQL图模式的出现,我们展示了提取的KG非常适合符号查询。

英文摘要

Question answering (QA) is a core challenge in AI, particularly for complex queries requiring multi-hop reasoning across documents, or symbolic operations like aggregation or exhaustive listing. Retrieval-augmented generation has become the dominant approach to QA, with recent graph-based variants addressing part of these issues by organizing knowledge to better support compositional questions. However, most textual graph-based RAG methods still lack the structure needed for symbolic operations useful to answer complex questions reliably. This motivates symbolic graph-based approaches, which extract knowledge graphs (KGs) whose relations are logic predicates that enable SQL-like querying. Yet these pipelines typically use LLMs for KG extraction, which can introduce consistency issues, where extracted facts may violate commonsense ontology constraints. We propose a neuro-symbolic framework for ontology-grounded KG construction combining open-domain extraction, embedding-based canonicalization of types and predicates, and targeted LLM-based correction of ontology violations. By deferring corrections to a post-extraction stage, our method avoids repeated LLM calls, substantially reducing token usage while improving KG consistency and preserving downstream QA quality. Finally, we show that the extracted KGs are well suited for symbolic querying by measuring the occurrence of SPARQL graph patterns.

2605.29161 2026-05-29 cs.LG cs.AI

Evolutionary Refinement of Generative Graph Topologies: A Hybrid WGAN-GA Approach

生成图拓扑的进化精炼:一种混合WGAN-GA方法

James Sargant, Seyedeh Ava Razi Razavi, Renata Dividino, Sheridan Houghten

AI总结 提出一种混合WGAN-GA方法,通过遗传算法精炼GAN生成的图结构,减少度分布和谱分布等偏差,使合成图更接近真实图。

Comments 6 pages, 4 Figures, 4 Tables, IEEE World Congress on Computational Intelligence

详情
AI中文摘要

由于离散连通性、图大小变化和类别特定的结构模式,生成逼真的图结构数据具有挑战性。最近基于生成对抗网络(GAN)的图生成方法通过学习连通性和匹配类别特定的密度分布来改进边建模。然而,这些模型在与真实图相比时仍表现出明显的偏差,例如度和谱分布,表明重要的结构属性未完全保留。本工作旨在通过使用遗传算法(GA)精炼现有基于GAN的图生成器框架生成的图来减少这些偏差。在GAN框架中,生成器同时生成节点特征和连通性模式,而基于GNN的判别器评估图的真实性和类别一致性,以确保全局结构和类别对齐。在此基础上,我们应用GA来精炼生成图的边。精炼过程引导合成图更接近真实数据,同时保持多样性和新颖性。实验结果表明,与基础模型相比,GA精炼持续降低组合最大均值差异(MMD),从而生成更匹配真实结构模式的图。这表明进化精炼是纠正基于GAN的图生成器中残留结构偏差的有效且灵活的方法,提高了它们用于逼真图合成和数据增强的适用性。

英文摘要

Generating realistic graph-structured data is challenging due to discrete connectivity, varying graph sizes, and class-specific structural patterns. Recent Generative Adversarial Networks (GAN)-based graph generation methods improve edge modelling by learning connectivity and matching class-specific density distributions. However these models still exhibit noticeable deviations such as in degree and spectral distribution when compared to real graphs, indicating that important structural properties are not fully preserved. This work aims to reduce these deviations by refining the graphs produced by an existing GAN-based graph generator framework with a Genetic Algorithm (GA). In the GAN framework, the generator produces both node features and connectivity patterns, while a GNN-based critic evaluates graph realism and class consistency to ensure global structural and class alignment. Building on this foundation, we apply a GA to refine the edges of generated graphs. The refinement process guides synthetic graphs toward closer agreement with real data, while preserving diversity and novelty. Experimental results show that the GA refinement consistently lowers combined Maximum Mean Discrepancy (MMD) compared to the base model, leading to graphs that more closely match real structural patterns. This demonstrates that evolutionary refinement is an effective and flexible way to correct residual structural deviations in GAN-based graph generators, improving their suitability for realistic graph synthesis and data augmentation.