arXivDaily arXiv每日学术速递 周一至周五更新
重置
2505.23823 2026-06-12 cs.CL 版本更新

RAGPPI: RAG Benchmark for Protein-Protein Interactions in Drug Discovery

RAGPPI:药物发现中蛋白质-蛋白质相互作用的RAG基准

Youngseung Jeon, Ziwen Li, Thomas Li, JiaSyuan Chang, Morteza Ziyadi, Xiang 'Anthony' Chen

发表机构 * University of California Los Angeles(加州大学洛杉矶分校) Palo Alto High School(帕洛阿尔托高中) Amazon AGI(亚马逊人工智能研究院)

AI总结 提出RAGPPI基准,包含4420个问答对,用于评估检索增强生成在药物发现中识别蛋白质-蛋白质相互作用生物学影响的能力。

详情
Journal ref
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026)
Comments
17 pages, 4 figures, 8 tables
AI中文摘要

检索蛋白质-蛋白质相互作用(PPI)的生物学影响对于药物开发中的靶点识别(Target ID)至关重要。由于涉及的蛋白质数量庞大,这一过程仍然耗时且具有挑战性。大型语言模型(LLMs)和检索增强生成(RAG)框架已支持靶点识别;然而,目前尚无用于识别PPI生物学影响的基准。为填补这一空白,我们引入了PPI的RAG基准(RAGPPI),这是一个包含4420个问答对的事实性问答基准,专注于PPI的潜在生物学影响。通过与专家访谈,我们确定了基准数据集的标准,例如问答类型和来源。我们通过专家驱动的数据标注构建了金标准数据集(500个问答对)。我们开发了一个集成自动评估LLM,该模型结合了专家标注特征、平均事实-摘要相似度(F1)和低相似度事实计数(F2),从而构建了银标准数据集(3720个问答对)。我们致力于维护RAGPPI作为支持研究社区推进药物发现问答解决方案的RAG系统的资源。

英文摘要

Retrieving the biological impacts of protein-protein interactions (PPIs) is essential for target identification (Target ID) in drug development. Given the vast number of proteins involved, this process remains time-consuming and challenging. Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) frameworks have supported Target ID; however, no benchmark currently exists for identifying the biological impacts of PPIs. To bridge this gap, we introduce the RAG Benchmark for PPIs (RAGPPI), a factual question-answer benchmark of 4,420 question-answer pairs that focus on the potential biological impacts of PPIs. Through interviews with experts, we identified criteria for a benchmark dataset, such as a type of QA and source. We built a gold-standard dataset (500 QA pairs) through expert-driven data annotation. We developed an ensemble auto-evaluation LLM that incorporates expert labeling characteristics, average fact-abstract similarity (F1), and low-similarity fact counts (F2), enabling the construction of a silver-standard dataset (3,720 QA pairs). We are committed to maintaining RAGPPI as a resource to support the research community in advancing RAG systems for drug discovery QA solutions.

2505.22695 2026-06-12 cs.LG 版本更新

LLM-ODDR: A Large Language Model Framework for Joint Order Dispatching and Driver Repositioning

LLM-ODDR:一种用于联合订单调度和司机重新定位的大语言模型框架

Tengfei Lyu, Siyuan Feng, Hao Liu, Hai Yang

发表机构 * Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou)(人工智能前沿技术 thrust,香港科学与技术大学(广州)) Department of Aeronautical and Aviation Engineering, The Hong Kong Polytechnic University(航空与航空工程系,香港理工大学) Research Center for Low Altitude Economy, The Hong Kong Polytechnic University(低空经济研究中心,香港理工大学) Department of Computer Science and Engineering, The Hong Kong University of Science and Technology(计算机科学与工程系,香港科学与技术大学) Department of Civil and Environmental Engineering, The Hong Kong University of Science and Technology(土木与环境工程系,香港科学与技术大学)

AI总结 提出LLM-ODDR框架,利用大语言模型联合优化网约车订单调度与司机重新定位,通过多目标价值细化、公平感知调度和时空需求感知重定位提升效果、适应性和可解释性。

详情
Comments
Published in IEEE Transactions on Intelligent Transportation Systems (TITS)
AI中文摘要

网约车平台在动态城市环境中优化订单调度和司机重新定位操作面临重大挑战。基于组合优化、规则启发式和强化学习的传统方法往往忽视司机收入公平性、可解释性以及对现实动态的适应性。为弥补这些不足,我们提出LLM-ODDR,一种利用大语言模型(LLM)进行网约车服务中联合订单调度和司机重新定位(ODDR)的新型框架。LLM-ODDR框架包含三个关键组件:(1)多目标引导的订单价值细化,通过考虑多个目标评估订单以确定其整体价值;(2)公平感知的订单调度,平衡平台收入与司机收入公平性;(3)时空需求感知的司机重新定位,基于历史模式和预测供应优化空闲车辆放置。我们还开发了JointDR-GPT,一个针对ODDR任务进行领域知识微调的模型。在曼哈顿出租车运营的真实数据集上进行的大量实验表明,我们的框架在有效性、对异常条件的适应性以及决策可解释性方面显著优于传统方法。据我们所知,这是首次将LLM作为决策智能体应用于网约车ODDR任务,为将先进语言模型集成到智能交通系统中奠定了基础性见解。虽然当前框架的计算成本高于传统方法,但我们表明并行分解和模型蒸馏可以将延迟降低到可部署的生产水平。

英文摘要

Ride-hailing platforms face significant challenges in optimizing order dispatching and driver repositioning operations in dynamic urban environments. Traditional approaches based on combinatorial optimization, rule-based heuristics, and reinforcement learning often overlook driver income fairness, interpretability, and adaptability to real-world dynamics. To address these gaps, we propose LLM-ODDR, a novel framework leveraging Large Language Models (LLMs) for joint Order Dispatching and Driver Repositioning (ODDR) in ride-hailing services. LLM-ODDR framework comprises three key components: (1) Multi-objective-guided Order Value Refinement, which evaluates orders by considering multiple objectives to determine their overall value; (2) Fairness-aware Order Dispatching, which balances platform revenue with driver income fairness; and (3) Spatiotemporal Demand-Aware Driver Repositioning, which optimizes idle vehicle placement based on historical patterns and projected supply. We also develop JointDR-GPT, a fine-tuned model optimized for ODDR tasks with domain knowledge. Extensive experiments on real-world datasets from Manhattan taxi operations demonstrate that our framework significantly outperforms traditional methods in terms of effectiveness, adaptability to anomalous conditions, and decision interpretability. To our knowledge, this is the first exploration of LLMs as decision-making agents in ride-hailing ODDR tasks, establishing foundational insights for integrating advanced language models within intelligent transportation systems. While the current framework incurs higher computational costs than traditional methods, we show that parallel decomposition and model distillation can reduce latency to production-viable levels for deployment.

2505.01869 2026-06-12 cs.CV 版本更新

Visual enhancement and 3D representation for underwater scenes: a review

水下场景的视觉增强与三维表示:综述

Guoxi Huang, Haoran Wang, Brett Seymour, Evan Kovacs, John Ellerbroc, Dave Blackham, Nantheera Anantrasirichai

发表机构 * Visual Information Laboratory, University of Bristol(视觉信息实验室,布里斯托尔大学) Submerged Resources Center, National Park Service(水下资源中心,国家公园服务) Marine Imaging Technologies, LLC(海洋成像技术有限公司) Gates Underwater Products, Inc(盖茨水下产品公司) Esprit film and television Ltd(Esprit电影和电视有限公司)

AI总结 本文综述了水下视觉增强和三维重建方法,从物理模型到非学习与数据驱动技术(如NeRF和3D高斯溅射),并评估了多种算法在基准数据集上的性能,指出了未来研究方向。

详情
AI中文摘要

水下视觉增强(UVE)和水下三维重建由于水生环境中复杂的成像条件,在计算机视觉和基于AI的任务中面临重大挑战。尽管开发了许多增强算法,但涵盖UVE和水下三维重建的全面系统性综述仍然缺失。为了推动这些领域的研究,我们从多个角度进行了深入综述。首先,我们介绍了基本的物理模型,强调了挑战传统技术的特殊性。我们调查了专门为水下场景设计的视觉增强和三维重建的先进方法。本文评估了从非学习方法到先进数据驱动技术(包括神经辐射场和3D高斯溅射)的各种方法,讨论了它们在处理水下失真方面的有效性。最后,我们在多个基准数据集上对最先进的UVE和水下三维重建算法进行了定量和定性评估。最后,我们指出了水下视觉未来发展的关键研究方向。

英文摘要

Underwater visual enhancement (UVE) and underwater 3D reconstruction pose significant challenges in computer vision and AI-based tasks due to complex imaging conditions in aquatic environments. Despite the development of numerous enhancement algorithms, a comprehensive and systematic review covering both UVE and underwater 3D reconstruction remains absent. To advance research in these areas, we present an in-depth review from multiple perspectives. First, we introduce the fundamental physical models, highlighting the peculiarities that challenge conventional techniques. We survey advanced methods for visual enhancement and 3D reconstruction specifically designed for underwater scenarios. The paper assesses various approaches from non-learning methods to advanced data-driven techniques, including Neural Radiance Fields and 3D Gaussian Splatting, discussing their effectiveness in handling underwater distortions. Finally, we conduct both quantitative and qualitative evaluations of state-of-the-art UVE and underwater 3D reconstruction algorithms across multiple benchmark datasets. Finally, we highlight key research directions for future advancements in underwater vision.

2408.17221 2026-06-12 cs.LG math.AG 版本更新

Geometry of Lightning Self-Attention: Identifiability and Dimension

闪电自注意力的几何:可识别性与维度

Nathan W. Henry, Giovanni Luca Marchetti, Kathlén Kohn

发表机构 * University of Toronto(多伦多大学) Royal Institute of Technology (KTH)(皇家理工学院(KTH))

AI总结 本文利用代数几何工具,分析了无归一化自注意力网络的函数空间几何,给出了深层注意力的可识别性描述并计算了函数空间维度,同时刻画了单层模型的奇异点和边界点,并推测了归一化情形的结果。

详情
Comments
Accepted at ICLR 2025
AI中文摘要

我们考虑由无归一化的自注意力网络定义的函数空间,并理论上分析其几何结构。由于这些网络是多项式,我们依赖代数几何的工具。特别地,我们通过描述任意层数参数化的通用纤维来研究深层注意力的可识别性,并据此计算函数空间的维度。此外,对于单层模型,我们刻画了奇异点和边界点。最后,我们提出一个关于归一化自注意力网络结果的推测性扩展,在单层情况下证明该推测,并在深层情况下进行数值验证。

英文摘要

We consider function spaces defined by self-attention networks without normalization, and theoretically analyze their geometry. Since these networks are polynomial, we rely on tools from algebraic geometry. In particular, we study the identifiability of deep attention by providing a description of the generic fibers of the parametrization for an arbitrary number of layers and, as a consequence, compute the dimension of the function space. Additionally, for a single-layer model, we characterize the singular and boundary points. Finally, we formulate a conjectural extension of our results to normalized self-attention networks, prove it for a single layer, and numerically verify it in the deep case.

2501.04823 2026-06-12 cs.RO math.OC stat.AP 版本更新

Learning Robot Safety from Sparse Human Feedback using Conformal Prediction

基于共形预测从稀疏人类反馈中学习机器人安全

Aaron O. Feldman, Joseph A. Vincent, Maximilian Adang, JunEn Low, Mac Schwager

发表机构 * Department of Aeronautics and Astronautics, Stanford University(航空航天工程系,斯坦福大学)

AI总结 通过人类对策略轨迹的二元反馈,利用共形预测识别包含未来策略错误的状态区域,构建具有保证漏检率的预警系统,并用于改进模型预测控制器的安全性。

详情
AI中文摘要

确保机器人安全可能具有挑战性;用户定义的约束可能遗漏边缘情况,策略即使从安全数据训练也可能变得不安全,并且安全可能是主观的。因此,我们通过向标记不安全行为的人类展示策略轨迹来学习机器人安全。从这种二元反馈中,我们使用共形预测的统计方法识别一个状态区域(可能在学习的潜在空间中),保证包含用户指定比例的未来策略错误。我们的方法是样本高效的,因为它基于最近邻分类,避免了共形预测中常见的保留数据。通过提醒机器人是否到达可疑的不安全区域,我们获得了一个模拟人类安全偏好且具有保证漏检率的预警系统。通过视频标注,我们的系统可以检测四旋翼视觉运动策略何时无法通过指定门。我们提出了一种通过避免可疑不安全区域来改进策略的方法。通过它,我们提高了模型预测控制器的安全性,这在30次四旋翼飞行跨越6个导航任务的实验测试中得到了证明。提供了代码和视频。

英文摘要

Ensuring robot safety can be challenging; user-defined constraints can miss edge cases, policies can become unsafe even when trained from safe data, and safety can be subjective. Thus, we learn about robot safety by showing policy trajectories to a human who flags unsafe behavior. From this binary feedback, we use the statistical method of conformal prediction to identify a region of states, potentially in learned latent space, guaranteed to contain a user-specified fraction of future policy errors. Our method is sample-efficient, as it builds on nearest neighbor classification and avoids withholding data as is common with conformal prediction. By alerting if the robot reaches the suspected unsafe region, we obtain a warning system that mimics the human's safety preferences with guaranteed miss rate. From video labeling, our system can detect when a quadcopter visuomotor policy will fail to steer through a designated gate. We present an approach for policy improvement by avoiding the suspected unsafe region. With it we improve a model predictive controller's safety, as shown in experimental testing with 30 quadcopter flights across 6 navigation tasks. Code and videos are provided.

2301.12538 2026-06-12 cs.LG cs.AI math.DS 版本更新

On Approximating the Dynamic Response of Synchronous Generators via Operator Learning: A Step Towards Building Deep Operator-based Power Grid Simulators

关于通过算子学习逼近同步发电机动态响应:迈向构建基于深度算子的电网模拟器的一步

Christian Moya, Amirhossein Mollaali, Guang Lin, Meng Yue

发表机构 * Purdue University(普渡大学)

AI总结 提出基于算子学习的框架,利用DeepONet逼近同步发电机的动态响应,并设计递归模拟方案及残差DeepONet方案,结合数据聚合策略实现与电网交互的模拟。

详情
AI中文摘要

本文开发了一个算子学习框架,用于逼近同步发电机的动态响应。该框架可用于(i)构建一个基于神经网络的发电机模型,与电网模拟器交互,或(ii)跟踪真实发电机的暂态响应。首先,我们开发了一个数据驱动的深度算子网络(DeepONet)来逼近发电机的无限维解算子。然后,我们设计了一个基于DeepONet的数值方案,在给定的时间范围内模拟发电机的响应。所提出的方案递归地使用训练好的DeepONet来模拟给定多维输入下的响应,该输入描述了发电机与电网之间的相互作用。此外,我们设计了一个残差DeepONet数值方案,可以整合现有数学模型的信息。我们为这个残差DeepONet方案提供了预测累积误差的估计。最后,我们构建了一个数据聚合(DAgger)策略,允许使用DeepONet在与其他电网组件交互模拟中可能遇到的聚合训练数据对DeepONet进行微调。作为概念验证,我们证明了所提出的框架能够有效逼近同步发电机的暂态模型。

英文摘要

This paper develops an Operator Learning framework for approximating the dynamic response of synchronous generators. The framework can be used to (i) build a neural network-based generator model that interacts with a power grid simulator or (ii) shadow the true generator's transient response. First, we develop a data-driven Deep Operator Network (DeepONet) to approximate the infinite-dimensional solution operator of the generators. Then, we design a numerical scheme based on DeepONet that simulates the generator's response over a given time horizon. The proposed scheme recursively employs the trained DeepONet to simulate the response for a given multi-dimensional input that describes the interaction between the generator and the power grid. In addition, we design a residual DeepONet numerical scheme that can incorporate information from existing mathematical models. We accompany this residual DeepONet scheme with an estimate for the prediction's cumulative error. Finally, we build a data aggregation (DAgger) strategy that allows fine-tuning of DeepONets using aggregated training data that the DeepONets will likely encounter during interactive simulations with other grid components. As a proof of concept, we demonstrate that the proposed frameworks can effectively approximate the transient model of a synchronous generator.

2606.04009 2026-06-12 stat.ML cs.AI cs.LG 版本更新

Counterfactual Explanations for Deep Two-Sample Testing

深度双样本检验的反事实解释

Wei-Cheng Lai, Marco Simnacher, Christoph Lippert

发表机构 * Hasso-Plattner-Institute, University of Potsdam(波茨坦大学洪堡-劳恩堡研究所) Hasso Plattner Institute for Digital Health at Mount Sinai Icahn School of Medicine at Mount Sinai(辛辛那提医学院洪堡数字健康研究所)

AI总结 针对深度双样本检验,提出基于扩散自编码器和MMD优化的反事实解释框架,生成样本级编辑以揭示驱动假设拒绝的特征。

详情
Comments
17 pages
AI中文摘要

双样本检验是检测科学领域中分布差异的基本工具,但经典检验(包括基于核的检验)在高维结构化数据(如图像)上可能效果不佳。最近的深度双样本检验通过学习信息表示提高了这些场景下的灵敏度,但它们对哪些数据特征驱动拒绝原假设 $H_0$ 提供的洞察有限。为解决此问题,我们提出了一种用于深度双样本检验的反事实解释框架,该框架生成样本级编辑,将观测值从源组移向目标组,同时明确减少检验所测量的差异。我们的方法将扩散自编码器与预训练的深度双样本检验模型相结合,并在检验模型的表示空间中优化最大均值差异(MMD)目标,以生成合理的反事实。我们通过检验统计量和由此产生的双样本p值的变化来量化分布级效应。我们在合成2D形状数据集和两个MRI队列上评估了该方法。在这两种设置下,反事实变换相对于原始样本持续增加p值,表明编辑后的源集在检验下在统计上更接近目标分布。我们使用LPIPS测量最小性,以确保反事实保持接近原始样本。由此产生的编辑提供了与检测到的组差异相关的特征的可解释证据。在MRI上,局部变化与队列之间已知的解剖学差异一致。

英文摘要

Two-sample testing is a fundamental tool for detecting distributional differences across scientific domains, but classical tests (including kernel-based tests) can be ineffective on high-dimensional structured data such as images. Recent deep two-sample tests improve sensitivity in these settings by learning informative representations, yet they provide limited insight into which data features drive rejection of the null hypothesis $H_0$. To address this issue, we propose a counterfactual explanation framework for deep two-sample testing that generates sample-level edits moving observations from a source group toward a target group while explicitly reducing the discrepancy measured by the test. Our method combines a diffusion autoencoder with a pretrained deep two-sample test model and optimizes a maximum mean discrepancy (MMD) objective in the test model's representation space to produce plausible counterfactuals. We quantify distribution-level effects through changes in the test statistic and the resulting two-sample p-values. We evaluate the method on synthetic 2D shape datasets and two MRI cohorts. Across both settings, the counterfactual transformations consistently increase p-values relative to the original samples, indicating that the edited source set becomes statistically closer to the target distribution under the test. We measure minimality using LPIPS to ensure the counterfactuals remain close to the original samples. The resulting edits provide interpretable evidence of the features associated with the detected group differences. On MRI, the localized changes are consistent with known anatomical differences between cohorts.

2606.02778 2026-06-12 astro-ph.EP astro-ph.IM cs.LG 版本更新

One Transit Is All You Need: Detecting Exoplanets Through Learned Stellar Behaviour with EXOVEIL

一次凌星足矣:通过EXOVEIL学习恒星行为检测系外行星

Pratik Priyanshu

发表机构 * SRH Hochschule(SRH 高校)

AI总结 提出EXOVEIL系统,利用Transformer世界模型和自监督学习从原始光变曲线中检测单次凌星事件,在Kepler数据上实现高召回率,并零样本迁移至TESS和PLATO任务。

详情
Comments
v3: appendix gallery of confirmed-planet recoveries added; Section 6 candidate catalogue reframed as transit-like anomalies for follow-up; TLS comparison table expanded
AI中文摘要

我提出EXOVEIL,一个凌星检测系统,它学习恒星亮度应有的样子,并在现实不符时发出标记。与需要相位折叠输入的现有系统不同,EXOVEIL在原始通量时间序列上运行,可以检测仅凌星一次的行星。一个Transformer世界模型,在16,499条Kepler光变曲线上通过凌星掩蔽自监督学习训练,预测预期的恒星通量。一个带有方差加权的匹配滤波检测器从预测残差中提取凌星信号。一个学习分类器(XGBoost)将行星与假阳性区分开,在Kepler DR25上达到AUC 0.938。应用于单次凌星注入-恢复,EXOVEIL在1000 ppm深度下恢复了32%的凌星——而所有基于分类的系统由于设计原因得分为0%。对3,737颗Kepler恒星进行盲搜索,发现了179个新的凌星类信号,这些信号不在DR25 TCE目录中,包括46个单次凌星候选者。无需重新训练,应用于PLATO LOPS2场中的47颗已确认TESS行星,EXOVEIL实现了100%的恢复,展示了零样本跨任务迁移。在PLATO的25秒曝光下,检测达到100 ppm——接近地球类似物范围。我提供了共形预测在凌星检测中的首次应用(95.9%经验覆盖率),并发布了该系统,可通过pip install exoveil安装,包含预训练权重和候选目录。

英文摘要

I present EXOVEIL, a transit detection system that learns what a star's brightness should look like and flags when reality disagrees. Unlike existing systems that require phase-folded input, EXOVEIL operates on raw flux time series and can detect planets that transit only once.A Transformer world model, trained on 16,499 Kepler light curves with transit-masked self-supervised learning, predicts expected stellar flux. A matched-filter detector with variance weighting extracts transit signals from the prediction residuals. A learned classifier (XGBoost) separates planets from false positives, achieving AUC 0.938 on Kepler DR25. Applied to single-transit injection-recovery, EXOVEIL recovers 32% of transits at 1000 ppm depth a task where all classification-based systems score 0% by construction. A blind search of 3,737 Kepler stars yields 179 new transit-like signals not present in the DR25 TCE catalogue, including 46 monotransit candidates. Applied withoutretraining to 47 confirmed TESS planets in the PLATO LOPS2 field, EXOVEIL achieves 100% recovery, demonstrating zero-shot cross-mission transfer. At PLATO's 25-second cadence, detection reaches 100 ppm -- approaching the Earth-analog regime. I provide the first application of conformal prediction to transit detection (95.9% empirical coverage) and release the system as pip install exoveil with pretrained weights and a candidate catalogue.

2606.01538 2026-06-12 cs.GR cs.CV cs.LG 版本更新

MPMWorlds: Material-Point-Method Simulations for Inferring and Extrapolating Physical Dynamics

MPMWorlds: 用于推断和外推物理动力学的物质点法模拟

Žiga Kovačič, Kevin Ellis

发表机构 * Cornell University(康奈尔大学)

AI总结 通过构建2D物质点法(MPM)模拟数据集,研究从视频推断物理动力学并外推时间演化的能力,比较代码生成与视频扩散方法的优劣。

详情
Comments
16 pages, 13 figures. Project page: https://zzigak.github.io/mpmworlds/
AI中文摘要

为了研究从视频推断物理动力学并将其向前外推的能力,我们组装了一个包含丰富物理现象(如可变形物体、流体、运动物体和发射器)的2D物质点法(MPM)物理模拟数据集。我们在此数据集上研究了代码生成和视频扩散方法,通过改变物理相关辅助信息的数量来识别它们的优缺点。代码生成模型除了提供自动合成MPM模拟的工作演示外,还揭示了这种方法在从视觉输入推断物理参数方面存在困难,但相对于视频扩散,它能产生物理和时间上稳定的向前外推结果,而视频扩散模型能更强烈地从视觉输入中识别几何属性,但会产生物理上不可信的外推结果。

英文摘要

To study the ability to infer physical dynamics from videos and extrapolate them forward in time, we assemble a dataset of 2D Material Point Method (MPM) physical simulations covering rich physical phenomena such as deformable objects, fluids, kinetic objects, and emitters. We study code generation and video diffusion approaches on this dataset, identifying their strengths and weaknesses by varying the amount of physically relevant side information. The code generation model, beyond giving a working demonstration of automatic synthesis of MPM simulations, reveals that such an approach struggles with inferring physical parameters from visual input, but relative to video diffusion, produces physically and temporally stable extrapolations forward in time, while the video diffusion model more strongly identifies geometric properties from visual input but produces physically implausible extrapolations.

2605.26358 2026-06-12 physics.flu-dyn cs.LG 版本更新

Deep Learning-based Algebraic Reynolds Stress Closures for RANS Simulations of Turbulent Flows

基于深度学习的代数雷诺应力闭合模型用于湍流RANS模拟

Daniel Dehtyriov, Jonathan F. MacArt, Justin Sirignano

发表机构 * Mathematical Institute, University of Oxford(牛津大学数学研究所) Aerospace and Mechanical Engineering, University of Notre Dame(诺特丹大学航空航天与机械工程系)

AI总结 提出一种物理驱动的深度学习闭合模型DARSM,通过神经网络映射流动不变量到隐式代数雷诺应力方程中的经验参数,并结合伴随方程实现端到端优化,在方形管道和周期性山丘基准测试中平均速度误差降低2-4倍。

详情
AI中文摘要

湍流在工程和科学中普遍存在,但直接模拟成本过高。雷诺平均纳维-斯托克斯(RANS)方程可节省超过十个数量级的计算量,但引入了未封闭项(封闭问题)。离线训练的机器学习(ML)闭合模型在预测模拟中会出现分布偏移,而绕过控制方程的ML方法难以从稀缺的高保真数据中泛化。我们开发了一种基于物理的深度学习RANS闭合模型——深度代数雷诺应力模型(DARSM),该模型可在小数据集上训练,并准确泛化到不同雷诺数、未见几何形状和不同流动状态。神经网络将流动不变量映射到隐式代数雷诺应力方程中的经验参数,该方程基于弱平衡假设从雷诺应力输运方程推导而来,为ML闭合施加了基于物理的结构。通过控制偏微分方程和耦合隐式闭合的端到端优化消除了分布偏移,但展开和隐式自动微分在刚性耦合求解器上均失败。我们推导了利用求解器隐式-显式结构的伴随方程,以实现高效优化。在标准方形管道和周期性山丘基准测试中,DARSM将基线RANS的平均测试速度误差降低了2-4倍(跨雷诺数、几何形状和流动状态),峰值案例级降低达12倍。在附着、各向异性主导的流动(方形管道)上训练的模型无需重新训练即可准确泛化到分离流动(周期性山丘),这是底层物理状态的改变。DARSM还优于五种已建立的ML方法:离线训练、张量基神经网络、场反演机器学习、DeepONet和物理信息神经网络。

英文摘要

Turbulence is ubiquitous in engineering and science, yet direct simulation is prohibitively expensive. The Reynolds-averaged Navier-Stokes (RANS) equations provide savings exceeding ten orders of magnitude but introduce unclosed terms (the closure problem). Offline-trained machine-learning (ML) closures suffer distribution shift in predictive simulations, while ML methods that bypass the governing equations struggle to generalise from scarce high-fidelity data. We develop a physics-derived deep learning closure model for RANS, the Deep Algebraic Reynolds Stress Model (DARSM), which can be trained on small datasets and accurately generalise across Reynolds numbers, to unseen geometries, and to different flow regimes. A neural network maps flow invariants to empirical parameters in an implicit algebraic Reynolds stress equation, derived from the Reynolds stress transport equations under the weak-equilibrium assumption, imposing physics-based structure on the ML closure. End-to-end optimisation through the governing PDEs and the coupled implicit closure eliminates distribution shift, but both unrolled and implicit automatic differentiation fail on the stiff coupled solver. We derive adjoint equations that exploit the solver's implicit-explicit structure for efficient optimisation. On canonical square-duct and periodic-hill benchmarks, DARSM reduces average test velocity error over baseline RANS by $2$-$4\times$ across Reynolds number, geometries, and flow regimes, with peak case-level reductions of $12\times$. The model trained on attached, anisotropy-dominated flows (square duct) accurately generalises without retraining to separated flows (periodic hills), a regime change in the underlying physics. DARSM also outperforms five established ML methods: offline training, tensor-basis neural networks, field-inversion machine learning, DeepONets, and physics-informed neural networks.

2605.26144 2026-06-12 cs.SE cs.AI cs.CV 版本更新

VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents

VISTA:面向视觉规格到网页应用编码智能体的端到端基准

JunJia Guo, Yuhang Yao, Jiawei, Zhou, Jingdi Chen

发表机构 * University of Arizona(亚利桑那大学) Zoom Stony Brook University(石溪大学)

AI总结 提出VISTA基准,通过多维度输入条件和评估指标,衡量基于LLM的智能体从视觉规格生成功能完整、视觉一致的网页应用的能力。

详情
Comments
Project page: https://kaboider.github.io/VIS_APP/; Code: https://github.com/kaboider/VIS_APP_Code; Dataset: https://huggingface.co/datasets/JunJiaGuo/VIS-APP-Bench
AI中文摘要

我们提出了VISTA(视觉规格到应用基准),这是一个用于评估基于LLM的智能体端到端网页应用生成能力的基准。与以往关注算法任务的代码生成基准不同,VISTA针对以UI为中心的现实开发场景,要求智能体从不明确的输入中生成功能完整、视觉一致的应用。我们定义了五种提示信息条件,沿视觉/结构保真度和技术栈约束两个轴变化:(1)仅文本,自由选择技术栈;(2)文本加参考截图,指定三种技术栈;(3)文本加参考截图,自由选择技术栈;(4)文本加截图和精简的Figma结构,指定单一技术栈;(5)文本加截图和精简的Figma结构,自由选择技术栈。为实现稳健评估,基准中的每个页面都手动标注了交互式UI组件和大约三个视觉锚点,解决了Playwright等基于脚本的测试工具在开放式代码生成设置中的已知局限性。评估结合了基于DOM的参考匹配、行为特定的浏览器测试和基于CLIP的视觉相似性,共同衡量结构对齐、行为完整性和整体视觉保真度。我们使用VISTA评估了来自两个模型家族和两个框架的四个智能体系统,发现视觉保真度和功能正确性在输入条件和智能体之间部分解耦,并且智能体的编辑风格差异显著,但大体上与任务质量正交。VISTA为推进基于智能体的软件工程研究建立了严谨且可重复的基础。

英文摘要

We present VISTA (VIsual Spec-To-App Benchmark), a benchmark for evaluating the end-to-end web-app generation capabilities of LLM-based agents. Unlike prior code generation benchmarks that focus on algorithmic tasks, VISTA targets realistic UI-centric development, where agents must produce functional, visually coherent applications from underspecified inputs. We define five prompt-information conditions that vary along two axes, visual/structural fidelity and stack constraint: (1) text only with free stack choice, (2) text with reference screenshots under three specified stacks, (3) text with reference screenshots under free stack choice, (4) text with screenshots and pruned Figma structure under a single specified stack, and (5) text with screenshots and pruned Figma structure under free stack choice. To enable robust evaluation, each page in the benchmark is manually annotated with interactive UI components and around three visual anchor points, addressing the well-known limitations of script-based testing tools such as Playwright in open-ended code generation settings. Evaluation combines DOM-grounded reference matching, behavior-specific browser tests, and CLIP-based visual similarity, jointly measuring structural alignment, behavioral completeness, and overall visual fidelity. We use VISTA to assess four agent systems drawn from two model families and two harnesses, finding that visual fidelity and functional correctness are partially decoupled across both input conditions and agents, and that agent editing style varies sharply but is largely orthogonal to task quality. VISTA establishes a rigorous and reproducible foundation for advancing agent-based software engineering research.

2512.15133 2026-06-12 cs.CE cs.AI 版本更新

HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens

HD-Prot:一种使用连续结构令牌进行联合序列-结构建模的蛋白质语言模型

Yi Zhou, Haohao Qu, Yunqing Liu, Shanru Lin, Le Song, Wenqi Fan

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Mohamed bin Zayed University of Artificial Intelligence(马尔代夫人工智能大学)

AI总结 提出HD-Prot,一种混合扩散蛋白质语言模型,通过连续结构令牌将序列pLM扩展为多模态,实现联合序列-结构建模,在多种任务上取得竞争性能。

详情
Comments
This is the long version of the corresponding paper to appear at KDD 2026
AI中文摘要

蛋白质本质上具有一致的序列-结构二重性。丰富的蛋白质序列数据可以很容易地表示为离散令牌,这推动了蛋白质语言模型(pLM)的丰硕发展。然而,一个关键的剩余挑战是如何有效地将连续结构知识整合到pLM中。当前的方法通常将蛋白质结构离散化以适应语言建模框架,这不可避免地导致细粒度信息的丢失,并限制了多模态pLM的性能潜力。在本文中,我们认为这些担忧是可以避免的:基于序列的pLM可以通过连续令牌(即避免向量量化的高保真蛋白质结构潜在表示)扩展以纳入结构模态。具体来说,我们提出了一种混合扩散蛋白质语言模型HD-Prot,它在离散pLM之上嵌入了一个连续值扩散头,使得能够无缝处理离散和连续令牌,用于联合序列-结构建模。它通过统一的吸收扩散过程捕获跨模态的令牌间依赖关系,并通过序列的分类预测和结构的连续扩散估计每个令牌的分布。大量结果表明,HD-Prot在无条件序列-结构共生成、基序支架、蛋白质结构预测和反向折叠任务中取得了竞争性能。此外,尽管在有限的计算资源下开发(即模态扩展微调的预算不到十分之一),我们的方法可以与最先进的多模态pLM相媲美。它突显了在统一语言模型架构中同时估计分类和连续分布的可行性,为多模态pLM提供了一个有前景的替代方向。

英文摘要

Proteins inherently possess a consistent sequence-structure duality. The abundance of protein sequence data, which can be readily represented as discrete tokens, has driven fruitful developments in protein language models (pLMs). A key remaining challenge, however, is how to effectively integrate continuous structural knowledge into pLMs. Current methods often discretize protein structures to accommodate the language modeling framework, which inevitably results in the loss of fine-grained information and limits the performance potential of multimodal pLMs. In this paper, we argue that such concerns can be circumvented: a sequence-based pLM can be extended to incorporate the structure modality through continuous tokens, i.e., high-fidelity protein structure latents that avoid vector quantization. Specifically, we propose a hybrid diffusion protein language model, HD-Prot, which embeds a continuous-valued diffusion head atop a discrete pLM, enabling seamless operation with both discrete and continuous tokens for joint sequence-structure modeling. It captures inter-token dependencies across modalities through a unified absorbing diffusion process, and estimates per-token distributions via categorical prediction for sequences and continuous diffusion for structures. Extensive results demonstrate that HD-Prot achieves competitive performance in unconditional sequence-structure co-generation, motif-scaffolding, protein structure prediction, and inverse folding tasks. Furthermore, our method can perform on par with state-of-the-art multimodal pLMs, despite being developed under limited computational resources (i.e., less than one-tenth the budget for modality extension fine-tuning). It highlights the viability of simultaneously estimating categorical and continuous distributions within a unified language model architecture, offering a promising alternative direction for multimodal pLMs.

2605.14568 2026-06-12 cs.SE cs.CL cs.LG 版本更新

Given, When, Then, Again: Mining Subscenario Refactoring Candidates in Behaviour-Driven Test Suites with ML Classifiers and LLM-Judge Baselines

在行为驱动软件测试套件中挖掘子场景重构机会:ML分类器和LLM-判断基线

Ali Hassaan Mughal, Noor Fatima, Muhammad Bilal

发表机构 * Independent Researcher(独立研究者;应用MBA(数据分析),德克萨斯韦斯利安大学) Applied MBA (Data Analytics), Texas Wesleyan University(独立研究者;计算机工程学士,国立科学与技术大学(NUST)) Independent Researcher(独立研究者;管理硕士,慕尼黑技术大学) B.E. Computer Engineering, National University of Sciences and Technology (NUST) Independent Researcher M.Sc. Management, Technical University of Munich

AI总结 本文通过ML分类器和LLM基线,识别行为驱动开发测试套件中可提取的子场景,量化其在公共BDD生态系统中的普及率。

详情
Comments
31 pages, 10 figures, 6 tables, 56 references. v2: retitled; reference list fully corrected and verified; decision-threshold sensitivity analysis and imbalance-robust baseline metrics added; figures restyled. Reproduction package at https://github.com/amughalbscs16/cukereuse_subscenarios_release (Apache-2.0). Upstream cukereuse corpus at https://doi.org/10.5281/zenodo.19754359
AI中文摘要

背景。行为驱动开发(BDD)软件测试套件积累重复的步骤子序列。有三种已发布的重构模式(在同一文件中的背景、在同一仓库中可重用的场景调用、跨组织共享的更高层次步骤),但没有先前工作自动化确定哪些重复的子序列值得提取或哪种机制适用。目标。通过重构适宜性(提取值得)对重复的步骤子序列(

英文摘要

Context. Behaviour-Driven Development (BDD) test suites accumulate duplicated step subsequences. Three published refactoring patterns are available (within-file Background, within-repo reusable-scenario invocation, cross-organisational shared higher-level step), but no prior work automates which recurring subsequences are worth extracting or which mechanism applies. Objective. Rank recurring step subsequences ("slices") by refactoring suitability (extraction-worthy), pre-map each to one of the three patterns, and quantify prevalence across the public BDD ecosystem. Method. Every contiguous L-step window (L in [2, 18]) in a 339-repository / 276-upstream-owner Gherkin corpus is keyed by paraphrase-robust cluster identifiers and counted under three scopes. SBERT / UMAP / HDBSCAN clustering recovers paraphrase-equivalent slices. Three authors label a stratified 200-slice pool against a written rubric. An XGBoost extraction-worthy classifier trained under 5-fold cross-validation is compared with a tuned rule baseline and two open-weight Large Language Model (LLM) judges. Results. The miner produces 5,382,249 slices collapsing to 692,020 recurring patterns. Three-author Fleiss' kappa = 0.56 (extraction-worthy) and 0.79 (mechanism). The classifier reaches out-of-fold F1 = 0.891 (95% CI [0.852, 0.927]), outperforming both the rule baseline (F1 = 0.836, p = 0.017) and the better LLM judge (F1 = 0.728, p = 1.5e-4). 75.0%, 59.5%, and 11.7% of scenarios carry a within-file Background, within-repo reusable-scenario, and cross-organisational shared-step candidate, respectively; the figures are stable under a sweep of the classifier decision threshold. Conclusion. Paraphrase-robust subscenario discovery yields a corpus-wide census of BDD refactoring candidates; pipeline, classifier predictions, labelled pool, and rubric are released under Apache-2.0.

2605.12542 2026-06-12 astro-ph.IM astro-ph.EP cs.LG 版本更新

Earth Science Foundation Models: From Perception to Reasoning and Discovery

地球科学基础模型:从感知到推理与发现

Xiangyu Zhao, Bo Liu, Yuehan Zhang, Zelin Song, Wanghan Xu, Feng Liu, Fengxiang Wang, Ben Fei, Fenghua Ling, Wangxu Wei, Wenlong Zhang, Xiao-Ming Wu

发表机构 * Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University(数据科学与人工智能系,香港理工大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 本文综述了地球科学基础模型,探讨了其从感知到多模态推理及科学发现的能力演进,并总结了其在大气、水圈、岩石圈等领域的广泛应用。

详情
AI中文摘要

大规模基础模型(FMs)正在通过整合异构多模态数据,如多平台影像、格网再分析数据、多样的地球物理和地球化学观测以及领域特定文本,来推动地球科学的发展。本文通过两个互补维度对地球科学基础模型(地球FMs)进行统一综述:深度,即追踪模型能力从感知到多模态推理和代理科学工作流的演变;广度,即总结其在大气、水圈、岩石圈、生物圈、人类圈和冰圈以及耦合地球系统过程中的扩展应用。利用这一框架,我们回顾了代表性多模态地球基础模型,并编译了超过200个数据集和基准,涵盖多样化的地球科学任务和模态。我们进一步讨论了多模态数据异构性、科学可靠性和持续更新、可扩展性和可持续性以及从基础模型到代理和具身地球智能的转变,并展望了更集成、可信和可操作的AI地球科学家的未来方向。总体而言,本文为理解地球基础模型的发展提供了结构化的路线图,从能力和应用广度两个方面进行综述。

英文摘要

Large foundation models (FMs) are transforming Earth science by integrating heterogeneous multimodal data, such as multi-platform imagery, gridded reanalysis data, diverse geophysical and geochemical observations, and domain-specific text, to support tasks ranging from basic perception to advanced scientific discovery. This paper provides a unified review of Earth science foundation models (Earth FMs) through two complementary dimensions: depth, which traces the evolution of model capabilities from perception to multimodal reasoning and agentic scientific workflows, and breadth, which summarizes their expanding applications across the atmosphere, hydrosphere, lithosphere, biosphere, anthroposphere, and cryosphere, as well as coupled Earth system processes. Using this framework, we review representative multimodal Earth foundation models and compile more than 200 datasets and benchmarks spanning diverse Earth science tasks and modalities. We further discuss key challenges in multimodal data heterogeneity, scientific reliability and continual updating, scalability and sustainability, and the transition from foundation models to agentic and embodied Earth intelligence, and outline future directions toward more integrated, trustworthy, and actionable AI Earth scientists. Overall, this paper offers a structured roadmap for understanding the development of Earth foundation models from both capability depth and application breadth.

2604.24806 2026-06-12 cs.IR cs.AI cs.DB 版本更新

Versioned Late Materialization for Ultra-Long Sequence Training in Recommendation Systems at Scale

版本化延迟物化:面向大规模推荐系统的超长序列训练

Liang Guo, Ge Song, Litao Deng, Jianhui Sun, Chufeng Hu, Lu Zhang, Zhen Ma, Shouwei Chen, Weiran Liu, Sarang Masti Sreeshylan, Xiaoxuan Meng, Yanzun Huang

发表机构 * Meta Platforms, Inc.(Meta平台)

AI总结 提出版本化延迟物化范式,通过归一化存储和即时序列重建消除数据冗余,支持超长用户交互历史训练,降低存储I/O开销并提升模型质量。

详情
AI中文摘要

现代深度学习推荐模型(DLRM)遵循序列长度的缩放定律,推动前沿走向超长用户交互历史(UIH)。然而,行业标准的“Fat Row”范式将序列预物化到每个训练样本中,造成存储和I/O瓶颈,数据基础设施使用超过GPU训练容量,数据冗余在多租户环境中被放大,其中不同序列长度需求的模型共享联合数据集。我们提出了一种\emph{版本化延迟物化}范式,通过将UIH归一化存储在一个不可变层中,并在训练期间通过轻量级版本指针即时重建序列,从而消除冗余。系统通过一个分叉协议确保在线到离线(O2O)一致性,防止未来泄漏跨流式和批式训练,同时一个读优化的不可变存储层为异构模型租户提供多维投影下推。解耦的数据预处理与流水线I/O预取和数据亲和性优化掩盖了训练时序列重建的延迟,使训练吞吐量保持GPU计算受限。部署在生产DLRM上,系统减少了训练数据基础设施资源使用,同时实现了激进的序列长度缩放,带来显著的模型质量提升,作为现代推荐模型架构(包括HSTU和ULTRA-HSTU)的基础数据基础设施。

英文摘要

Modern Deep Learning Recommendation Models (DLRMs) follow scaling laws with sequence length, driving the frontier toward ultra-long User Interaction History (UIH). However, the industry-standard "Fat Row" paradigm, which pre-materializes these sequences into every training example, creates a storage and I/O wall where data infrastructure usage exceeds GPU training capacity due to data redundancy that is amplified in multi-tenant environments where models with vastly different sequence length requirements share a union dataset. We present a \emph{versioned late materialization} paradigm that eliminates this redundancy by storing UIH once in a normalized, immutable tier and reconstructing sequences just-in-time during training via lightweight versioned pointers. The system ensures Online-to-Offline (O2O) consistency through a bifurcated protocol that prevents future leakage across both streaming and batch training, while a read-optimized immutable storage layer provides multi-dimensional projection pushdown for heterogeneous model tenants. Disaggregated data preprocessing with pipelined I/O prefetching and data-affinity optimizations masks the latency of training-time sequence reconstruction, keeping training throughput compute-bound by GPUs. Deployed on production DLRMs, the system reduces training data infrastructure resource usage while enabling aggressive sequence length scaling that delivers significant model quality gains, serving as the foundational data infrastructure for modern recommendation model architectures, including HSTU and ULTRA-HSTU.

2604.16548 2026-06-12 cs.CR cs.AI cs.CL 版本更新

A Survey on Long-Term Memory Security in LLM Agents: Attacks, Defenses, and Governance Across the Memory Lifecycle

LLM智能体中长期记忆安全综述:跨记忆生命周期的攻击、防御与治理

Zehao Lin, Xixuan Hao, Renyu Fu, Shaobo Cui, Kai Chen, Chunyu Li, Zhiyu Li, Feiyu Xiong

发表机构 * MemTensor Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出记忆生命周期框架,系统分析LLM智能体长期记忆面临的新威胁,并引入可验证记忆治理(VMG)架构原语,强调存储时溯源与版本控制对安全的关键作用。

详情
AI中文摘要

LLM智能体中可写、跨会话持久记忆的出现,引入了与传统的以输入为中心的安全问题性质不同的威胁格局,其特点包括三个属性:持久性、状态性和传播性。为系统描述这一格局,我们提出记忆生命周期框架,该框架沿两个轴组织攻击、防御及其跨阶段依赖关系:六个生命周期阶段(写入、存储、检索、执行、共享与传播、遗忘与回滚)和四个安全目标(完整性、机密性、可用性、治理)。该分析进而揭示了在系统层面需要形式化安全保证,从而推动了可验证记忆治理(VMG)——一个由五个架构原语组成的框架,它规定了长期记忆系统必须提供哪些可验证机制,以维持对其记忆状态的可审计、可恢复控制。我们的分析表明,健壮的长期记忆(LTM)安全无法仅在检索或执行时进行事后补救,而必须从一开始就锚定于存储时的溯源、版本控制和策略感知的保留。

英文摘要

The emergence of writable, cross-session persistent memory in LLM agents introduces a qualitatively different threat landscape from conventional input-centric security concerns, characterized by three properties: persistence, statefulness, and propagation. To systematically characterize this landscape, we propose a Memory Lifecycle Framework that organizes attacks, defenses, and their cross-phase dependencies along two axes: six lifecycle phases (Write, Store, Retrieve, Execute, Share & Propagate, Forget & Rollback) and four security objectives (Integrity, Confidentiality, Availability, Governance). This analysis in turn exposes the need for formal security guarantees at the system level, motivating Verifiable Memory Governance(VMG), a framework of five architectural primitives that specifies what verifiable mechanisms a long-term-memory system must provide to maintain auditable, recoverable control over its memory state. Our analysis indicates that robust Long-Term Memory (LTM) security cannot be retrofitted at retrieval or execution time alone, but must be anchored in storage-time provenance, versioning, and policy-aware retention from the outset.

2604.07590 2026-06-12 cs.IR cs.AI 版本更新

DCD: Domain-Oriented Design for Controlled Retrieval-Augmented Generation

DCD:面向领域的受控检索增强生成设计

Valerii Kovalskii, Nikita Belov, Nikita Miteyko, Igor Reshetnikov, Maksim Maksimov

发表机构 * red_mad_robot

AI总结 提出DCD(领域-集合-文档)层次化设计,通过结构化知识表示和多阶段路由控制检索与生成范围,无需修改语言模型,提升RAG在异构语料和多步查询中的鲁棒性和准确性。

详情
Comments
14 pages, 4 figures, 2 links, link to HF https://huggingface.co/datasets/redmadrobot-rnd/dcd, link to GIT https://github.com/redmadrobot-rnd/dcd
AI中文摘要

检索增强生成(RAG)被广泛用于将大型语言模型锚定在外部知识源中。然而,当应用于异构语料库和多步查询时,朴素RAG管道由于扁平的知识表示和缺乏显式工作流而常常质量下降。在这项工作中,我们引入了DCD(领域-集合-文档),一种面向领域的设计,用于结构化知识并控制RAG系统中的查询处理,而无需修改底层语言模型。所提出的方法依赖于信息空间的层次分解和基于结构化模型输出的多阶段路由,使得检索和生成范围能够逐步受限。该架构辅以智能分块、混合检索以及集成验证和生成护栏机制。我们描述了DCD架构和工作流程,并讨论了在合成评估数据集上的评估结果,突出了它们在应用RAG场景中对鲁棒性、事实准确性和答案相关性的影响。

英文摘要

Retrieval-Augmented Generation (RAG) is widely used to ground large language models in external knowledge sources. However, when applied to heterogeneous corpora and multi-step queries, Naive RAG pipelines often degrade in quality due to flat knowledge representations and the absence of explicit workflows. In this work, we introduce DCD (Domain-Collection-Document), a domain-oriented design to structure knowledge and control query processing in RAG systems without modifying the underlying language model. The proposed approach relies on a hierarchical decomposition of the information space and multi-stage routing based on structured model outputs, enabling progressive restriction of both retrieval and generation scopes. The architecture is complemented by smart chunking, hybrid retrieval, and integrated validation and generation guardrail mechanisms. We describe the DCD architecture and workflow and discuss evaluation results on synthetic evaluation dataset, highlighting their impact on robustness, factual accuracy, and answer relevance in applied RAG scenarios.

2503.02178 2026-06-12 stat.ML cs.LG 版本更新

Central Limit Theorems for Stochastic Gradient Descent Quantile Estimators

随机梯度下降分位数估计量的中心极限定理

Ziyang Wei, Jiaqi Li, Likai Chen, Wei Biao Wu

发表机构 * Department of Statistics, University of Chicago(芝加哥大学统计系) Department of Statistics and Data Science, Washington University in St. Louis(圣路易斯华盛顿大学统计与数据科学系)

AI总结 本文针对常学习率SGD分位数估计,利用马尔可夫链理论证明其平稳分布随学习率趋于零时收敛到高斯分布,首次给出CLT型理论保证,并提出置信区间递归算法。

详情
AI中文摘要

本文发展了通过恒定学习率的随机梯度下降(SGD)进行分位数估计的渐近理论。分位数损失函数既不光滑也不强凸。超越传统视角和技术,我们将分位数SGD迭代视为一个不可约、周期且正常返的马尔可夫链,该链循环收敛到其唯一的平稳分布,无论初始值如何任意固定。为了推导平稳分布的精确形式,我们通过利用平稳方程分析其特征函数的结构。我们还推导了其矩生成函数(MGF)和尾部概率的紧界。综合上述方法,我们证明了当学习率$\eta\rightarrow0$时,中心化和标准化的平稳分布收敛到高斯分布。这一发现为恒定学习率的分位数SGD估计量提供了首个中心极限定理(CLT)类型的理论保证。我们进一步提出了一种递归算法来构建具有统计保证的估计量的置信区间。数值研究展示了在线估计器和推断过程的有效有限样本性能。本研究所发展的理论工具对于研究一般形式化为马尔可夫链的SGD算法具有独立意义,特别是在非强凸和非光滑设置中。

英文摘要

This paper develops asymptotic theory for quantile estimation via stochastic gradient descent (SGD) with a constant learning rate. The quantile loss function is neither smooth nor strongly convex. Beyond conventional perspectives and techniques, we view quantile SGD iteration as an irreducible, periodic, and positive recurrent Markov chain, which cyclically converges to its unique stationary distribution regardless of the arbitrarily fixed initialization. To derive the exact form of the stationary distribution, we analyze the structure of its characteristic function by exploiting the stationary equation. We also derive tight bounds for its moment generating function (MGF) and tail probabilities. Synthesizing the aforementioned approaches, we prove that the centered and standardized stationary distribution converges to a Gaussian distribution as the learning rate $η\rightarrow0$. This finding provides the first central limit theorem (CLT)-type theoretical guarantees for the quantile SGD estimator with constant learning rates. We further propose a recursive algorithm to construct confidence intervals of the estimators with statistical guarantees. Numerical studies demonstrate the effective finite-sample performance of the online estimator and inference procedure. The theoretical tools developed in this study are of independent interest for investigating general SGD algorithms formulated as Markov chains, particularly in non-strongly convex and non-smooth settings.

2305.08175 2026-06-12 cs.DB cs.CR cs.LG 版本更新

ResidualPlanner+: a scalable matrix mechanism for marginals and beyond

ResidualPlanner+:一种用于边际查询及更广泛查询的可扩展矩阵机制

Guanlin He, Yingtai Xiao, Levent Toksoz, Zeyu Ding, Danfeng Zhang, Daniel Kifer

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学) Binghamton University(宾厄姆顿大学) Duke University(杜克大学) TikTok Inc.(抖音公司)

AI总结 提出两种可扩展的矩阵机制ResidualPlanner和ResidualPlanner+,分别优化边际查询的精度和支持更复杂的工作负载(如范围查询),在速度和内存上显著超越现有方法。

详情
AI中文摘要

带噪声的边际查询是保护机密性的常见数据发布形式,对于列联表分析、贝叶斯网络构建甚至合成数据生成等下游任务非常有用。为线性查询(如边际查询)提供无偏噪声答案的隐私机制称为矩阵机制。我们提出了ResidualPlanner和ResidualPlanner+,两种高度可扩展的矩阵机制。ResidualPlanner在使用高斯噪声回答边际查询时既最优又可扩展,而ResidualPlanner+支持更通用的工作负载,例如边际查询与范围查询或前缀和查询的组合。ResidualPlanner可以优化许多损失函数,这些损失函数可以写成边际方差的凸函数(先前的工作仅限于一个预定义的目标函数)。ResidualPlanner可以在几秒钟内优化大规模设置中边际查询的精度,即使之前的最先进方法(HDMM)内存耗尽。它甚至可以在几分钟内处理具有100个属性的数据集。此外,ResidualPlanner可以高效计算每个边际的方差/协方差值(先前的方法即使对于相对较小的数据集也会很快耗尽内存)。ResidualPlanner+支持更复杂的工作负载,这些工作负载结合了边际查询和范围/前缀和查询(例如,关于种族的边际查询、关于年龄的范围查询以及回答每个种族的年龄范围查询的组合种族/年龄表格)。它甚至支持用户在不同属性上自定义工作负载。凭借这种增加的灵活性,ResidualPlanner+不一定是最优的,但它仍然极具可扩展性,并且在精度和速度上均优于先前的最先进方法(HDMM)处理前缀和查询。

英文摘要

Noisy marginals are a common form of confidentiality protecting data release and are useful for many downstream tasks such as contingency table analysis, construction of Bayesian networks, and even synthetic data generation. Privacy mechanisms that provide unbiased noisy answers to linear queries (such as marginals) are known as matrix mechanisms. We propose ResidualPlanner and ResidualPlanner+, two highly scalable matrix mechanisms. ResidualPlanner is both optimal and scalable for answering marginal queries with Gaussian noise, while ResidualPlanner+ provides support for more general workloads, such as combinations of marginals and range queries or prefix-sum queries. ResidualPlanner can optimize for many loss functions that can be written as a convex function of marginal variances (prior work was restricted to just one predefined objective function). ResidualPlanner can optimize the accuracy of marginals in large scale settings in seconds, even when the previous state of the art (HDMM) runs out of memory. It even runs on datasets with 100 attributes in a couple of minutes. Furthermore, ResidualPlanner can efficiently compute variance/covariance values for each marginal (prior methods quickly run out of memory, even for relatively small datasets). ResidualPlanner+ provides support for more complex workloads that combine marginal and range/prefix-sum queries (e.g., a marginal on race, a range query on age, and a combined race/age tabulation that answers age range queries for each race). It even supports custom user-defined workloads on different attributes. With this added flexibility, ResidualPlanner+ is not necessarily optimal, however it is still extremely scalable and outperforms the prior state-of-the-art (HDMM) on prefix-sum queries both in terms of accuracy and speed.

2601.19072 2026-06-12 cs.SE cs.AI 版本更新

HalluJudge: A Reference-Free Hallucination Detection for Context Misalignment in Code Review Automation

HalluJudge: 代码审查自动化中上下文错位的无参考幻觉检测

Kla Tantithamthavorn, Hong Yi Lin, Patanamon Thongtanunam, Wachiraphan Charoenwet, Minwoo Jeong, Ming Wu

发表机构 * Monash University Australia(墨尔本大学澳大利亚) The University of Melbourne Australia(墨尔本大学澳大利亚) Atlassian USA(Atlassian美国)

AI总结 提出无参考幻觉检测方法HalluJudge,通过上下文对齐评估生成评论的根基性,采用多分支推理策略,在F1=0.85且成本$0.009下与开发者偏好67%一致。

详情
Comments
Accepted at FSE'26: Industry Track, Full-Length, Peer-Reviewed
AI中文摘要

大型语言模型(LLM)在代码审查自动化(如审查评论生成)中表现出强大能力,但它们存在幻觉——生成的审查评论与实际代码无根基——这对LLM在代码审查工作流程中的应用构成重大挑战。为解决此问题,我们探索了无需参考的、有效且可扩展的方法来检测LLM生成的代码审查评论中的幻觉。在这项工作中,我们设计了HalluJudge,旨在基于上下文对齐评估生成评论的根基性。HalluJudge包括四种关键策略,从直接评估到结构化多分支推理(例如,思维树)。我们在Atlassian的企业级软件项目中对这些评估策略进行了全面评估,以检验HalluJudge的有效性和成本效率。此外,我们分析了HalluJudge的判断与实际生产环境中LLM生成的代码审查评论的开发人员偏好之间的一致性。我们的结果表明,HalluJudge中的幻觉评估具有成本效益,F1得分为0.85,平均成本为0.009美元。平均而言,67%的HalluJudge评估与在线生产中实际LLM生成的审查评论的开发人员偏好一致。我们的结果表明,HalluJudge可以作为实用的保障措施,减少开发人员接触幻觉评论,从而促进对AI辅助代码审查的信任。

英文摘要

Large Language models (LLMs) have shown strong capabilities in code review automation, such as review comment generation, yet they suffer from hallucinations -- where the generated review comments are ungrounded in the actual code -- poses a significant challenge to the adoption of LLMs in code review workflows. To address this, we explore effective and scalable methods for a hallucination detection in LLM-generated code review comments without the reference. In this work, we design HalluJudge that aims to assess the grounding of generated review comments based on the context alignment. HalluJudge includes four key strategies ranging from direct assessment to structured multi-branch reasoning (e.g., Tree-of-Thoughts). We conduct a comprehensive evaluation of these assessment strategies across Atlassian's enterprise-scale software projects to examine the effectiveness and cost-efficiency of HalluJudge. Furthermore, we analyze the alignment between HalluJudge's judgment and developer preference of the actual LLM-generated code review comments in the real-world production. Our results show that the hallucination assessment in HalluJudge is cost-effective with an F1 score of 0.85 and an average cost of $0.009. On average, 67% of the HalluJudge assessments are aligned with the developer preference of the actual LLM-generated review comments in the online production. Our results suggest that HalluJudge can serve as a practical safeguard to reduce developers' exposure to hallucinated comments, fostering trust in AI-assisted code reviews.

2603.17527 2026-06-12 stat.ML cs.LG math.OC 版本更新

Mirror Descent on Riemannian Manifolds

黎曼流形上的镜像下降

Jiaxin Jiang, Lei Shi, Jiyuan Tan

发表机构 * School of Mathematical Sciences, Fudan University, Shanghai 200433, China(复旦大学数学学院,上海200433,中国) Shanghai Key Laboratory for Contemporary Applied Mathematics, Fudan University, Shanghai 200433, China(上海当代应用数学重点实验室,复旦大学,上海200433,中国)

AI总结 将镜像下降推广到黎曼流形,通过重参数化提出黎曼镜像下降(RMD)及其随机变体,并建立非渐近收敛保证,在Stiefel流形上退化为曲线梯度下降(CGD)。

详情
AI中文摘要

镜像下降(MD)是一种可扩展的一阶方法,广泛应用于大规模优化,包括图像处理、策略优化和神经网络训练。本文将MD推广到黎曼流形上的优化。具体地,我们通过重参数化开发了一个黎曼镜像下降(RMD)框架,并进一步提出了RMD的随机变体。我们还为RMD和随机RMD建立了非渐近收敛保证。作为在Stiefel流形上的应用,我们的RMD框架退化为[26]中提出的曲线梯度下降(CGD)方法。此外,当将随机RMD框架特化到Stiefel设置时,我们得到了CGD的随机扩展,这有效地解决了大规模流形优化问题。

英文摘要

Mirror Descent (MD) is a scalable first-order method widely used in large-scale optimization, with applications in image processing, policy optimization, and neural network training. This paper generalizes MD to optimization on Riemannian manifolds. In particular, we develop a Riemannian Mirror Descent (RMD) framework via reparameterization and further propose a stochastic variant of RMD. We also establish non-asymptotic convergence guarantees for both RMD and stochastic RMD. As an application to the Stiefel manifold, our RMD framework reduces to the Curvilinear Gradient Descent (CGD) method proposed in [26]. Moreover, when specializing the stochastic RMD framework to the Stiefel setting, we obtain a stochastic extension of CGD, which effectively addresses large-scale manifold optimization problems.

2603.11242 2026-06-12 stat.ML cs.LG 版本更新

A Unified Latent Space Disentanglement VAE Framework with Robust Disentanglement Effectiveness Evaluation

统一潜在空间解缠的VAE框架及鲁棒的解缠效果评估

Xiaoan Lang, Md Mostafizer Rahman, Fang Liu

发表机构 * Department of Applied and Computational Mathematics and Statistics(应用与计算数学与统计系) Lucy Family Institute for Data & Society(数据与社会学院)

AI总结 提出统一框架bfVAE整合多种解缠VAE方法,并开发FVH-LT和DBSR-LS评估解缠效果,引入LSSI指标量化潜在结构分离,无需真实生成因子。

详情
AI中文摘要

评估和解释潜在表示(如变分自编码器VAE)对于多样数据类型仍然是一个重大挑战,尤其是当真实生成因子未知时。为此,我们将几种最先进的用于潜在空间解缠的VAE方法统一到一个框架——bfVAE中。为了评估解缠VAE模型的有效性并增强潜在空间可解释性,我们提出了通过潜在遍历的特征方差异质性(FVH-LT)和潜在空间中的脏块稀疏回归(DBSR-LS)。为了确保学习到的潜在空间的鲁棒可解释性,我们开发了一种贪婪对齐策略(GAS),该策略减轻了标签切换问题,并对齐不同运行中的潜在维度,为结果聚合奠定基础。我们还引入了一个方便的标量潜在空间分离指数(LSSI),该指数基于FVH-LT和DBSR-LS的GAS对齐输出,在不知道真实生成因子的情况下总结整体潜在结构分离。我们将bfVAE与五个VAE模型进行比较,并在七个表格和图像数据集上验证了FVH-LT、DBSR-LS和LSSI的有效性。在我们检查的实验设置下,bfVAE提供了一个更灵活的解缠框架,在解缠和重构之间实现了比基准VAE模型更有利的整体权衡;FVH-LT和DBSR-LS可靠地揭示了语义上有意义且与领域相关的潜在结构,并且通常产生一致的结果;LSSI对潜在结构分离做出了有效的定量总结。

英文摘要

Evaluating and interpreting latent representations, such as variational autoencoders (VAEs), remains a significant challenge for diverse data types, especially when ground-truth generative factors are unknown. To address this, we unify several state-of-the-art disentangled VAE approaches for latent space disentanglement into one framework -- bfVAE. To assess the effectiveness of a disentangled VAE model and enhance latent space interpretability, we propose Feature Variance Heterogeneity via Latent Traversal (FVH-LT) and Dirty Block Sparse Regression in Latent Space (DBSR-LS). To ensure robust interpretability of learned latent space, we develop a greedy alignment strategy (GAS) that mitigates label switching and aligns latent dimensions across runs to set the foundation of result aggregation. We also introduce a convenient scalar latent space separation index (LSSI) based on the GAS-aligned outputs of FVH-LT and DBSR-LS to summarize the overall latent structural separation without knowledge of the ground-truth generative factors. We compare bfVAE to five VAE models and validate the effectiveness FVH-LT, DBSR-LS, and LSSI in on seven tabular and image datasets. Under our examined experimental settings, bfVAE provides a more flexible disentanglement framework achieves more favorable overall trade-off between disentanglement and reconstruction than the benchmark VAE models; FVH-LT and DBSR-LS reliably uncover semantically meaningful and domain-relevant latent structures and generally yield consistent results; and LSSI makes an effective quantitative summary of latent structural separation.

2512.24787 2026-06-12 cs.IR cs.AI 版本更新

HiGR: Industrial-Scale Hierarchical Generative Slate Recommendation Framework in Tencent

HiGR:腾讯工业级层次化生成式推荐框架

Yunsheng Pang, Zijian Liu, Yudong Li, Shaojie Zhu, Zijian Luo, Chenyun Yu, Sikai Wu, Shichen Shen, Cong Xu, Bin Wang, Kai Jiang, Chengxiang Zhuo, Zang Li

发表机构 * Platform and Content Group, Tencent(腾讯平台与内容组) Sun Yat-sen University(中山大学)

AI总结 提出HiGR框架,通过结构化语义ID和层次化解码器解决生成式推荐在工业规模下的规划效率与列表质量对齐问题,离线质量提升超10%,推理加速5倍。

详情
AI中文摘要

Slate推荐(在单个展示中向用户呈现排序项目列表)在主流在线平台中无处不在。虽然最近的生成式推荐方法在利用语义ID建模项目序列方面显示出强大潜力,但直接将其应用于工业规模的slate推荐面临根本性脱节:纠缠的SID空间混淆了高级列表规划,长序列上的细粒度自回归解码限制了语义规划效率,而令牌级目标与整体slate质量不一致。在本文中,我们提出HiGR,一个工业规模的层次化生成式slate推荐框架,通过协同设计的流水线弥合这一脱节。首先,HiGR通过前缀对比残差量化VAE(PCRQ-VAE)学习结构化SID。通过强制高级前缀捕获共享语义,PCRQ-VAE创建了一个可控的离散空间,作为高效规划的前提。利用这一结构化空间,我们的层次化Slate解码器(HSD)将自回归建模从纠缠的令牌级解码转变为粗粒度偏好嵌入。该设计显著降低了推理延迟,同时允许显式的全局slate结构规划。最后,这一稳定的规划空间使得基于ORPO的列表级对齐机制能够优化三重目标隐式反馈——排序保真度、真实用户兴趣和多样性。广泛的离线实验表明,HiGR在离线推荐质量上优于最先进的基线超过10%,同时实现了5倍的推理加速。腾讯平台上的在线A/B测试进一步将观看时间提高了1.22%,视频播放量提高了1.73%。HiGR已在多个腾讯平台表面部署,服务数亿用户,证明了其工业规模的适用性。

英文摘要

Slate recommendation, which presents users with a ranked item list in a single display, is ubiquitous across mainstream online platforms. While recent generative recommendation methods have shown strong potential in modeling item sequences with semantic IDs, directly applying them to industrial-scale slate recommendation faces a fundamental disconnect: entangled SID spaces confound high-level list planning, fine-grained autoregressive decoding over long sequences limits semantic planning efficiency, and token-level objectives misalign with holistic slate quality. In this paper, we propose HiGR, an industrial-scale hierarchical generative framework for slate recommendation that bridges this disconnect through a co-designed pipeline. First, HiGR learns structured SIDs via a Prefix-Contrastive Residual Quantized VAE (PCRQ-VAE). By enforcing high-level prefixes to capture shared semantics, PCRQ-VAE creates a controllable discrete space that acts as a prerequisite for efficient planning. Leveraging this structured space, our Hierarchical Slate Decoder (HSD) shifts autoregressive modeling from entangled token-level decoding to coarse-grained preference embeddings. This design significantly reduces inference latency while allowing explicit global slate structure planning. Finally, this stable planning space enables an ORPO-based listwise alignment mechanism to optimize triple-objective implicit feedback-ranking fidelity, genuine user interest, and diversity. Extensive offline experiments show that HiGR outperforms state-of-the-art baselines by over 10% in offline recommendation quality while achieving a $5\times$ inference speedup. Online A/B tests on Tencent platforms further improve watch time by 1.22% and video plays by 1.73%. HiGR has been deployed on multiple Tencent platform surfaces, serving hundreds of millions of users and proving its industrial-scale applicability.

2602.13379 2026-06-12 cs.CR cs.AI cs.CL cs.LG cs.SE 版本更新

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

多轮交互中的安全隐患:工具使用智能体的多轮安全风险基准与防御

Xu Li, Simon Yu, Minzhou Pan, Yiyou Sun, Bo Li, Dawn Song, Xue Lin, Weiyan Shi

发表机构 * Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校)

AI总结 提出多轮工具使用安全基准MT-AgentRisk,发现多轮设置下攻击成功率平均增加16%,并设计无训练、与工具无关的自探索防御方法ToolShield,平均降低30%攻击成功率。

详情
AI中文摘要

基于LLM的智能体能力日益增强,但其安全性滞后。这造成了智能体能够做什么和应该做什么之间的差距。随着智能体进行多轮交互并使用多样化的工具,这一差距扩大,引入了现有基准忽视的新风险。为了系统地将安全测试扩展到多轮、工具真实的设置,我们提出一个原则性的分类法,将单轮有害任务转化为多轮攻击序列。利用该分类法,我们构建了MT-AgentRisk(多轮智能体风险基准),这是首个评估多轮工具使用智能体安全性的基准。我们的实验揭示了显著的安全退化:在开放和封闭模型的多轮设置中,攻击成功率(ASR)平均增加16%。为了缩小这一差距,我们提出了ToolShield,一种无需训练、与工具无关的自我探索防御方法:当遇到新工具时,智能体自主生成测试用例,执行它们以观察下游效果,并提炼安全经验用于部署。实验表明,ToolShield在多轮交互中平均有效降低ASR 30%。我们的代码可在该网址获取。

英文摘要

LLM-based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi-turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi-turn, tool-realistic settings, we propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences. Using this taxonomy, we construct MT-AgentRisk (Multi-Turn Agent Risk Benchmark), the first benchmark to evaluate multi-turn tool-using agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16% on average across open and closed models in multi-turn settings. To close this gap, we propose ToolShield, a training-free, tool-agnostic, self-exploration defense: when encountering a new tool, the agent autonomously generates test cases, executes them to observe downstream effects, and distills safety experiences for deployment. Experiments show that ToolShield effectively reduces ASR by 30% on average in multi-turn interactions. Our code is available at https://github.com/CHATS-lab/ToolShield.

2602.07294 2026-06-12 cs.CE cs.AI 版本更新

Fin-RATE: A Real-world Financial Analytics and Tracking Evaluation Benchmark for LLMs on SEC Filings

Fin-RATE:面向SEC文件的金融分析与追踪评估基准

Yidong Jiang, Junrong Chen, Eftychia Makri, Jialin Chen, Peiwen Li, Ali Maatouk, Leandros Tassiulas, Eliot Brenner, Bing Xiang, Rex Ying

发表机构 * Tongji University(同济大学) University of California, San Diego(加州大学圣地亚哥分校) Yale University(耶鲁大学) Goldman Sachs(高盛集团)

AI总结 针对LLM在金融领域分析复杂监管文件的需求,提出基于SEC文件的Fin-RATE基准,通过三种任务路径评估模型,发现跨文档和跨时间分析时性能显著下降。

详情
Journal ref
Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)
AI中文摘要

随着大型语言模型(LLM)在金融领域的部署日益增多,LLM越来越需要解析复杂的监管披露文件。然而,现有基准通常关注孤立细节,未能反映专业分析的复杂性——这种分析需要综合多个文档、报告期和公司实体的信息。此外,这些基准无法区分错误源于检索失败、生成不准确、领域特定推理错误还是对查询或上下文的误解,从而难以精确诊断性能瓶颈。为弥补这些不足,我们引入Fin-RATE,这是一个基于美国证券交易委员会(SEC)文件构建的基准,通过三条路径模拟金融分析师的工作流程:单个披露文件内的细节导向推理、共享主题下的跨实体比较,以及同一公司在多个报告期内的纵向跟踪。我们在真实上下文和检索增强设置下,对17个领先的LLM(包括开源、闭源和金融专用模型)进行了基准测试。结果显示,当任务从单文档推理转向纵向和跨实体分析时,性能显著下降,准确率分别下降18.60%和14.35%。这种下降与比较幻觉、时间和实体不匹配的增加有关,并进一步反映在推理质量和事实一致性的下降上——这些局限性是现有基准尚未正式分类或量化的。

英文摘要

With the increasing deployment of Large Language Models (LLMs) in the finance domain, LLMs are increasingly expected to parse complex regulatory disclosures. However, existing benchmarks often focus on isolated details, failing to reflect the complexity of professional analysis that requires synthesizing information across multiple documents, reporting periods, and corporate entities. Furthermore, these benchmarks do not disentangle whether errors arise from retrieval failures, generation inaccuracies, domain-specific reasoning mistakes, or misinterpretation of the query or context, making it difficult to precisely diagnose performance bottlenecks. To bridge these gaps, we introduce Fin-RATE, a benchmark built on U.S. Securities and Exchange Commission (SEC) filings and mirroring financial analyst workflows through three pathways: detail-oriented reasoning within individual disclosures, cross-entity comparison under shared topics, and longitudinal tracking of the same firm across reporting periods. We benchmark 17 leading LLMs, spanning open-source, closed-source, and finance-specialized models, under both ground-truth context and retrieval-augmented settings. Results show substantial performance degradation, with accuracy dropping by 18.60% and 14.35% as tasks shift from single-document reasoning to longitudinal and cross-entity analysis. This degradation is associated with increased comparison hallucinations, temporal and entity mismatches, and is further reflected in declines in reasoning quality and factual consistency--limitations that existing benchmarks have yet to formally categorize or quantify.

2602.10132 2026-06-12 physics.plasm-ph cs.AI 版本更新

TokaMark: A Comprehensive Benchmark for MAST Tokamak Plasma Models

TokaMark:MAST托卡马克等离子体模型的综合基准

Cécile Rousseau, Samuel Jackson, Rodrigo H. Ordonez-Hurtado, Nicola C. Amorisco, Tobia Boschi, George K. Holt, Andrea Loreti, Eszter Székely, Alexander Whittle, Adriano Agnello, Stanislas Pamela, Alessandra Pascale, Robert Akers, Juan Bernabe Moreno, Sue Thorne, Mykhaylo Zayats

发表机构 * IBM Research Europe(IBM欧洲研究院) UK Atomic Energy Authority(英国原子能局) STFC Hartree Centre(STFC哈特ree中心)

AI总结 为解决聚变数据稀缺、分散且标注不一致的问题,提出TokaMark基准,包含14项任务,统一多模态聚变数据访问和评估协议,并提供基线模型,以加速数据驱动的AI等离子体建模。

详情
AI中文摘要

开发运行如托卡马克等商业可行的聚变能源反应堆,需要从稀疏、有噪声且不完整的传感器读数中准确预测等离子体动力学。底层物理的复杂性和实验数据的异质性给传统数值方法带来了巨大挑战,并凸显了现代数据驱动方法的潜力。然而,实现这一潜力的主要障碍是缺乏经过整理、公开可用的数据集和标准化基准。现有的聚变数据集稀缺、分散在不同机构、特定于设施且标注不一致,这限制了可重复性,并阻碍了AI方法的公平和可扩展比较。在本文中,我们介绍了TokaMark,一个结构化基准,用于评估在Mega Ampere Spherical Tokamak(MAST)收集的真实实验数据上的AI模型。TokaMark提供了一套全面的工具,旨在统一多模态聚变数据的访问并标准化评估协议。该基准包括14项精心策划的任务,涵盖一系列物理机制,利用多种诊断手段并覆盖多个操作用例。提供了一个基线模型,以便在统一框架内进行透明的比较和验证。通过建立统一的基准,TokaMark旨在加速数据驱动的AI等离子体建模的进展,为最终实现可持续和稳定的聚变能源做出贡献。数据集、基准、文档和工具已在此https URL下开源。

英文摘要

Development and operation of commercially viable fusion energy reactors such as tokamaks require accurate predictions of plasma dynamics from sparse, noisy, and incomplete sensors readings. The complexity of the underlying physics and the heterogeneity of experimental data pose formidable challenges for conventional numerical methods, and highlight the promise of modern data-native approaches. A major obstacle in realizing this potential is, however, the lack of curated, openly available datasets and standardized benchmarks. Existing fusion datasets are scarce, fragmented across institutions, facility-specific, and inconsistently annotated, which limits reproducibility and prevents a fair and scalable comparison of AI approaches. In this paper, we introduce TokaMark, a structured benchmark to evaluate AI models on real experimental data collected from the Mega Ampere Spherical Tokamak (MAST). TokaMark provides a comprehensive suite of tools designed to unify access to multi-modal fusion data and standardize evaluation protocols. The benchmark includes a curated list of 14 tasks spanning a range of physical mechanisms, exploiting a variety of diagnostics and covering multiple operational use cases. A baseline model is provided to facilitate transparent comparison and validation within a unified framework. By establishing a unified benchmark, TokaMark aims to accelerate progress in data-driven AI-based plasma modeling, contributing to the broader goal of achieving sustainable and stable fusion energy. The dataset, benchmark, documentation, and tooling are open-sourced under https://github.com/UKAEA-IBM-STFC-Fusion-FMs/tokamark_baseline.

2602.09379 2026-06-12 cs.MA cs.CL 版本更新

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

LingxiDiagBench: 用于基准测试大语言模型在中文精神科咨询与诊断中的多智能体框架

Shihao Xu, Tiancheng Zhou, Jiatong Ma, Yanli Ding, Yiming Yan, Ming Xiao, Guoyi Li, Haiyang Geng, Yunyun Han, Jianhua Chen, Yafeng Deng

发表机构 * Tianqiao and Chrissy Chen Institute(天桥和克里斯西·陈研究所) EverMind AI Inc.(EverMind AI公司) Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine(上海精神卫生中心,上海交通大学医学院)

AI总结 提出LingxiDiagBench多智能体框架,包含16K电子病历对齐的合成咨询对话数据集,评估LLM在静态诊断和动态咨询中的表现,发现其对抑郁-焦虑共病识别和12类鉴别诊断准确率低,动态咨询常不如静态评估。

详情
AI中文摘要

精神障碍在全球范围内高度流行,但精神科医生的短缺以及基于访谈诊断固有的主观性,对及时、一致的心理健康评估造成了重大障碍。AI辅助精神科诊断的进展受到缺乏基准测试的限制,这些基准测试需同时提供逼真的患者模拟、临床医生验证的诊断标签,并支持动态多轮咨询。我们提出LingxiDiagBench,一个大规模多智能体基准测试,评估LLM在中文静态诊断推理和动态多轮精神科咨询中的表现。其核心是LingxiDiag-16K,一个包含16,000个电子病历对齐的合成咨询对话数据集,旨在再现12个ICD-10精神科类别中真实的临床人口统计和诊断分布。通过对最先进LLM的大量实验,我们建立了关键发现:(1)尽管LLM在二元抑郁-焦虑分类上达到高准确率(高达92.3%),但在抑郁-焦虑共病识别(43.0%)和12类鉴别诊断(28.5%)上性能显著下降;(2)动态咨询通常不如静态评估,表明无效的信息收集策略显著损害下游诊断推理;(3)由LLM作为评判者评估的咨询质量与诊断准确性仅呈中等相关性,表明结构良好的提问本身并不能确保正确的诊断决策。我们发布LingxiDiag-16K和完整的评估框架,以支持可重复的研究,网址为:https://this https URL。

英文摘要

Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression--anxiety classification (up to 92.3%), performance deteriorates substantially for depression--anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.

2602.07698 2026-06-12 cs.SE cs.CL 版本更新

On Sequence-to-Sequence Models for Automated Log Parsing

关于自动化日志解析的序列到序列模型

Adam Sorrenti, Andriy Miranskyy

发表机构 * Toronto University(多伦多大学)

AI总结 本研究系统评估了四种序列建模架构(Transformer、Mamba、单/双向LSTM)在自动化日志解析中的性能,发现Transformer表现最佳,Mamba在计算成本较低时具有竞争力,并分析了表示选择、序列长度和数据效率的影响。

详情
Comments
Added a comparison with large language models
AI中文摘要

上下文:日志解析是软件系统中的关键标准操作流程,支持监控、异常检测和故障诊断。然而,由于日志格式异构、训练与部署数据之间的分布偏移以及基于规则的方法的脆弱性,自动化日志解析仍然具有挑战性。目标:本研究旨在系统评估序列建模架构、表示选择、序列长度和训练数据可用性如何影响自动化日志解析性能和计算成本。方法:我们进行了一项受控实证研究,比较了四种序列建模架构:Transformer、Mamba状态空间、单向LSTM和双向LSTM模型。总共在多个数据集配置下训练了396个模型,并使用相对Levenshtein编辑距离进行统计显著性检验评估。结果:Transformer实现了最低的平均相对编辑距离(0.111),其次是Mamba(0.145)、单LSTM(0.186)和双LSTM(0.265),数值越低越好。Mamba在计算成本大幅降低的情况下提供了有竞争力的准确性。字符级分词通常能提升性能,序列长度对Transformer准确性的实际影响可忽略不计,Mamba和Transformer均表现出比循环模型更强的样本效率。结论:总体而言,Transformer将解析错误降低了23.4%,而Mamba在数据或计算受限的情况下是一个强有力的替代方案。这些结果还阐明了表示选择、序列长度和样本效率的作用,为研究人员和从业者提供了实用指导。

英文摘要

Context: Log parsing is a critical standard operating procedure in software systems, enabling monitoring, anomaly detection, and failure diagnosis. However, automated log parsing remains challenging due to heterogeneous log formats, distribution shifts between training and deployment data, and the brittleness of rule-based approaches. Objectives: This study aims to systematically evaluate how sequence modelling architecture, representation choice, sequence length, and training data availability influence automated log parsing performance and computational cost. Methods: We conduct a controlled empirical study comparing four sequence modelling architectures: Transformer, Mamba state-space, monodirectional LSTM, and bidirectional LSTM models. In total, 396 models are trained across multiple dataset configurations and evaluated using relative Levenshtein edit distance with statistical significance testing. Results: Transformer achieves the lowest mean relative edit distance (0.111), followed by Mamba (0.145), mono-LSTM (0.186), and bi-LSTM (0.265), where lower values are better. Mamba provides competitive accuracy with substantially lower computational cost. Character-level tokenization generally improves performance, sequence length has negligible practical impact on Transformer accuracy, and both Mamba and Transformer demonstrate stronger sample efficiency than recurrent models. Conclusion: Overall, Transformers reduce parsing error by 23.4%, while Mamba is a strong alternative under data or compute constraints. These results also clarify the roles of representation choice, sequence length, and sample efficiency, providing practical guidance for researchers and practitioners.

2602.05121 2026-06-12 eess.SY cs.RO cs.SY 版本更新

Trojan Attacks on Neural Network Controllers for Robotic Systems

针对机器人系统神经网络控制器的木马攻击

Farbod Younesi, Walter Lucia, Amr Youssef

发表机构 * Concordia University(康科德大学) Concordia Institute for Information Systems Engineering(康科德信息系统工程研究所) Fonds de recherche du Québec – Nature et Technologies(魁北克自然与技术研究基金) National Cybersecurity Consortium(国家网络安全联盟)

AI总结 针对机器人神经网络控制器,设计轻量级并行木马网络,在特定触发条件下篡改控制指令,通过仿真验证攻击有效性。

详情
Comments
Paper submitted to the 2026 IEEE Conference on Control Technology and Applications (CCTA)
AI中文摘要

神经网络控制器越来越多地应用于机器人系统中,用于轨迹跟踪和姿态稳定等任务。然而,它们对可能不可信的训练流程或供应链的依赖引入了显著的安全漏洞。本文以差速驱动移动机器人平台为案例,研究针对神经控制器的后门(木马)攻击。具体来说,假设机器人的跟踪控制器实现为神经网络,我们设计了一个轻量级的并行木马网络,可以嵌入到控制器中。该恶意模块在正常操作期间保持休眠,但在检测到由机器人姿态和目标参数定义的高度特定触发条件时,会破坏主控制器的轮速命令,导致不良且可能不安全的机器人行为。我们提供了所提出的木马网络的概念验证实现,并通过两种不同攻击场景下的仿真进行了验证。结果证实了所提出攻击的有效性,并表明基于神经网络的机器人控制系统面临潜在的关键安全威胁。

英文摘要

Neural network controllers are increasingly deployed in robotic systems for tasks such as trajectory tracking and pose stabilization. However, their reliance on potentially untrusted training pipelines or supply chains introduces significant security vulnerabilities. This paper investigates backdoor (Trojan) attacks against neural controllers, using a differential-drive mobile robot platform as a case study. In particular, assuming that the robot's tracking controller is implemented as a neural network, we design a lightweight, parallel Trojan network that can be embedded within the controller. This malicious module remains dormant during normal operation but, upon detecting a highly specific trigger condition defined by the robot's pose and goal parameters, compromises the primary controller's wheel velocity commands, resulting in undesired and potentially unsafe robot behaviours. We provide a proof-of-concept implementation of the proposed Trojan network, which is validated through simulation under two different attack scenarios. The results confirm the effectiveness of the proposed attack and demonstrate that neural network-based robotic control systems are subject to potentially critical security threats.

2602.00343 2026-06-12 cs.DC cs.AI cs.PF 版本更新

Standardized Methods and Recommendations for Green Federated Learning

绿色联邦学习的标准化方法与建议

Austin Tapp, Holger R. Roth, Ziyue Xu, Abhijeet Parida, Hareem Nisar, Marius George Linguraru

发表机构 * Children’s National Hospital(儿童医院) NVIDIA(英伟达) Children’s National Hospital George Washington University(儿童医院乔治华盛顿大学)

AI总结 提出基于NVFlare和CodeCarbon的联邦学习碳核算方法,通过实验验证系统慢速和协调效应可显著增加碳排放,强调标准化碳核算对可复现绿色FL评估的必要性。

详情
AI中文摘要

联邦学习(FL)能够在隐私敏感的分布式数据上进行协作模型训练,但由于不一致的测量边界和异构的报告方式,其环境影响难以跨研究进行比较。我们提出了一种实用的碳核算方法,用于FL的CO2e跟踪,使用NVIDIA NVFlare和CodeCarbon进行显式的、阶段感知的任务(初始化、每轮训练、评估和空闲/协调)。为了捕捉非计算效应,我们还根据网络可配置能量模型,从传输的模型更新大小估计通信排放。我们在两个代表性工作负载上验证了所提出的方法:CIFAR-10图像分类和视网膜视盘分割。在CIFAR-10中,受控的客户端效率场景表明,在原本固定的FL协议下,系统级慢速和协调效应可能对碳足迹产生显著影响,相对于高效率基线,总CO2e增加了8.34倍(中等)和21.73倍(低)。在视网膜分割中,交换GPU层级(H100 vs. V100)产生了1.7倍的运行时间差距(290 vs. 503分钟),同时在不同站点间总能量和CO2e的变化不均匀,强调了按站点和按轮报告的必要性。总体而言,我们的结果支持一种标准化的碳核算方法,作为可复现的“绿色”FL评估的前提。我们的代码可在以下网址获取:https://this https URL。

英文摘要

Federated learning (FL) enables collaborative model training over privacy-sensitive, distributed data, but its environmental impact is difficult to compare across studies due to inconsistent measurement boundaries and heterogeneous reporting. We present a practical carbon-accounting methodology for FL CO2e tracking using NVIDIA NVFlare and CodeCarbon for explicit, phase-aware tasks (initialization, per-round training, evaluation, and idle/coordination). To capture non-compute effects, we additionally estimate communication emissions from transmitted model-update sizes under a network-configurable energy model. We validate the proposed approach on two representative workloads: CIFAR-10 image classification and retinal optic disk segmentation. In CIFAR-10, controlled client-efficiency scenarios show that system-level slowdowns and coordination effects can contribute meaningfully to carbon footprint under an otherwise fixed FL protocol, increasing total CO2e by 8.34x (medium) and 21.73x (low) relative to the high-efficiency baseline. In retinal segmentation, swapping GPU tiers (H100 vs.\ V100) yields a consistent 1.7x runtime gap (290 vs. 503 minutes) while producing non-uniform changes in total energy and CO2e across sites, underscoring the need for per-site and per-round reporting. Overall, our results support a standardized carbon accounting method that acts as a prerequisite for reproducible 'green' FL evaluation. Our code is available at https://github.com/Pediatric-Accelerated-Intelligence-Lab/carbon_footprint.