arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2081
2603.24946 2026-05-11 cs.SE cs.LG

MobileDev-Bench: A Benchmark for Issue Resolution in Mobile Application Development

Moshood A. Fakorede, Krishna Upadhyay, A. B. Siddique, Umar Farooq

AI总结 本文提出 MobileDev-Bench,一个用于评估大语言模型在移动应用开发中问题修复能力的基准数据集,涵盖了 Android 原生(Java/Kotlin)、React Native(TypeScript)和 Flutter(Dart)等平台的 19 款真实移动应用中的 407 个问题修复任务。每个任务均包含开发者报告的问题和可执行的测试补丁,支持在移动构建环境中对模型生成的修复方案进行自动化验证。实验表明,当前主流大语言模型在该基准上的端到端修复成功率远低于现有基准,突显了移动应用开发中问题修复任务的复杂性与挑战性。

Comments 30 pages, 14 figures, 12 tables

详情
英文摘要

Large language models (LLMs) have shown strong performance on automated software engineering tasks, yet existing benchmarks focus primarily on library-style repositories, leaving mobile application development largely unexplored despite its framework-specific build systems, heterogeneous artifact types, and coordinated multi-file fix requirements. We introduce MobileDev-Bench, a benchmark comprising 407 real-world issue-resolution tasks collected from 19 production mobile applications spanning Android Native (Java/Kotlin), React Native (TypeScript), and Flutter (Dart). Each task pairs a verified developer-reported issue with executable test patches, enabling fully automated validation of model-generated fixes within mobile build environments. The benchmark exhibits substantially greater patch complexity than prior benchmarks: fixes modify 12.9 files and 334.6 lines on average, and 41% of instances require coordinated changes across multiple artifact types, such as source, build configuration, and resource files. Evaluation of four frontier LLMs (Claude Sonnet 4.5, Qwen3-Coder, GPT-5.2, and Gemini 2.5 Flash) yields end-to-end resolution rates of only 3.23% - 4.23% under automated retrieval and at most 5.69% under oracle retrieval, well below resolution rates reported on existing benchmarks. We release MobileDev-Bench with task instances, an evaluation harness, and containerized environments to support reproducible research on AI-assisted mobile application development.

2603.24755 2026-05-11 cs.SE cs.AI cs.CL

SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

Gabriel Orlanski, Devjeet Roy, Alexander Yun, Changho Shin, Alex Gu, Albert Ge, Dyah Adila, Nicholas Roberts, Frederic Sala, Aws Albarghouthi

AI总结 SlopCodeBench 是一个用于评估代码智能体在长期迭代任务中性能退化的基准,包含36个问题和196个检查点,要求智能体不断扩展自身解决方案。与以往的迭代基准不同,该基准在架构决策上对智能体提出明确要求,但允许其自由设计内部结构,从而更真实地反映代码质量的变化。研究发现,所有测试的智能体都无法完整解决任何问题,且代码在迭代过程中逐渐出现结构退化和冗余增加,表明当前智能体在长期开发任务中仍存在显著的代码质量下降问题。

Comments Code and Leaderboards are located at https://www.scbench.ai

详情
英文摘要

Software development is iterative, yet agentic coding benchmarks hide design issues through their single-shot setup. Recent iterative benchmarks attempt to remedy this but heavily constrain an agent's design decision space, making it impossible to faithfully measure how their decisions shape future extensions. We introduce SlopCodeBench, a benchmark of 36 problems and 196 checkpoints where agents repeatedly extend their own solutions. Unlike prior iterative benchmarks, our evolving specifications demand architectural decisions but leave internal structure to the agent. We measure two forms of degradation: structural erosion (concentrated complexity) and verbosity (redundant code). Evaluating 15 coding agents across open and closed models, we find that no agent fully solves any problem end-to-end, and the best agent passes 14.8% of checkpoints. Quality degrades across checkpoints, with structural erosion rising in 77% of trajectories and verbosity in 75.5%. Compared to 473 open-source Python repositories, agent code is 2.3x more verbose and 2.0x more eroded, and the human repositories degrade less often and by smaller margins across their git histories. Explicit quality guidance reduces initial verbosity and erosion by up to a third, without affecting degradation rates. SlopCodeBench provides the first measurement of code degradation under iterative extension, revealing that agents pass checkpoints while producing code that erodes and bloats with each turn.

2603.16025 2026-05-11 cond-mat.mes-hall cs.CV quant-ph

3D tomography of exchange phase in a Si/SiGe quantum dot device

Dylan Albrecht, Sarah Thompson, N. Tobias Jacobson, Ryan Jock

AI总结 本文研究了基于硅/硅锗量子点器件中交换相互作用的三维成像问题,旨在从实验数据中准确提取交换耦合系数 $J(\mathbf{V})$ 随栅压变化的函数形式。为解决相位反演和积分逆问题带来的困难,作者结合相位移数字全息技术和最大流/最小割相位展开方法,在三维电压空间中重建了累积相位体积。该方法在提高测量鲁棒性的同时,为量子比特控制的系统优化和器件性能分析提供了重要依据。

Comments 11 pages, 6 figures; updated acknowledgements

详情
英文摘要

The exchange interaction is a foundational building block for the operation of spin-based quantum processors. Extracting the exchange interaction coefficient $J(\mathbf{V})$, as a function of gate electrode voltages, is important for understanding disorder, faithfully simulating device performance, and operating spin qubits with high fidelity. Typical coherent measurements of exchange in spin qubit devices yield a modulated cosine of an accumulated phase, which in turn is the time integral of exchange. As such, extracting $J(\mathbf{V})$ from experimental data is difficult due to the ambiguity of inverting a cosine, the sensitivity to noise when unwrapping phase, as well as the problem of inverting the integral. As a step toward obtaining $J(\mathbf{V})$, we tackle the first two challenges to reveal the accumulated phase, $ϕ(\mathbf{V})$. We incorporate techniques from a wide range of fields to robustly extract and model a 3D phase volume for spin qubit devices from a sequence of 2D measurements. In particular, we present a measurement technique to obtain the wrapped phase, as done in phase-shifting digital holography, and utilize the max-flow/min-cut phase unwrapping method (PUMA) to unwrap the phase in 3D voltage space. We show this method is robust to the minimal observed drift in the device, which we confirm by increasing scan resolution. Upon building a model of the extracted phase, we optimize over the model to locate a minimal-gradient $π$ exchange pulse point in voltage space. Our measurement protocol may provide detailed information useful for understanding the origins of device variability governing device yield, enable calibrating device models to specific devices during operation for more sophisticated error attribution, and enable a systematic optimization of qubit control. We anticipate that the methods presented here may be applicable to other qubit platforms.

2603.03096 2026-05-11 eess.AS cs.CL

Interpreting Speaker Characteristics in the Dimensions of Self-Supervised Speech Features

Kyle Janse van Rensburg, Benjamin van Niekerk, Herman Kamper

AI总结 本研究探讨了通过自监督学习训练的语音模型如何在其特征表示中编码说话人特性。通过主成分分析(PCA)对话语的平均表示进行分析,发现主成分中包含与音高、性别等相关的说话人特征,其他成分则与强度、噪声水平、共振峰等特性相关。研究进一步表明,这些特性在特征维度上相对独立,并可通过调整相应维度来改变语音特性。

Comments 5 pages, 7 figures, submitted to IEEE Signal Processing Letters

详情
英文摘要

How do speech models trained through self-supervised learning structure their representations? Previous studies have looked at how information is encoded in feature vectors across different layers. But few studies have considered whether speech characteristics are captured within individual dimensions of SSL features. In this paper we specifically look at speaker information using PCA on utterance-averaged representations. For a range of SSL models, we find that the principal dimension that explains most variance encodes pitch and associated characteristics like gender. Other individual principal dimensions correlate with intensity, noise levels, the second formant, and higher frequency characteristics. We then use synthesis analyses to show that the dimensions for most characteristics are isolated from each other's influence. We further show that characteristics can be changed by manipulating the corresponding dimensions.

2602.16928 2026-05-11 cs.GT cs.AI cs.MA

Discovering Multiagent Learning Algorithms with Large Language Models

Zun Li, John Schultz, Daniel Hennes, Marc Lanctot

AI总结 该研究探索了如何利用大语言模型(LLM)自动发现多智能体强化学习(MARL)中的算法,特别是在不完美信息博弈中。研究采用AlphaEvolve框架,在反事实遗憾最小化(CFR)和策略空间响应预言机(PSRO)两种范式中进行算法设计空间的搜索,最终提出了两个性能优异的算法VAD-CFR和SHOR-PSRO。通过进一步提炼,研究还得到了结构更简单、泛化能力更强的简化版本WOP-CFR和PM-PSRO,为利用LLM进行算法发现提供了清晰的方法论。

Comments More experiments and analysis on algorithmic distilliation

详情
英文摘要

Much of the advancement in Multi-Agent Reinforcement Learning (MARL) for imperfect-information games has historically depended on the manual, iterative refinement of algorithmic baselines. Recently, evolutionary coding agents powered by Large Language Models (LLMs) have emerged as powerful tools to automate this discovery process. In this work, we deploy one of such agentic frameworks, AlphaEvolve, to navigate the design spaces of two distinct game-theoretic paradigms: counterfactual regret minimization (CFR) and policy-space response oracles (PSRO). This automated search yielded two algorithms: Volatility-Adaptive Discounted (VAD-) CFR and Smoothed Hybrid Optimistic Regret (SHOR-) PSRO, which are consistently competitive with state-of-the-art human-designed baselines across an 18-game evaluation suite spanning Poker, Goofspiel, Liar's Dice, Blotto, and Battleship variants. However, because the LLM optimizes for fitness on a specific training set, it often constructs highly synergistic, complex mechanisms tailored to those environments. Through systematic ablation studies, we demonstrate that while these mechanisms are tightly coupled, the true driver of generalization lies in a minimal algorithmic core. By distilling the LLM's discoveries down to their most fundamental principles, we produce two minimal solvers: Warm-started Optimistic Predictive (WOP-)CFR and Projection Matching (PM-)PSRO. These distilled versions achieve superior performance on generalization with greatly reduced structural complexity, providing a clear methodology for using LLMs in algorithmic discovery.

2602.15189 2026-05-11 cs.IR cs.AI cs.CL

ScrapeGraphAI-100k: Dataset for Schema-Constrained LLM Generation

William Brach, Francesco Zuppichini, Marco Vinciguerra, Lorenzo Padoan

AI总结 该研究提出了一种名为 ScrapeGraphAI-100k 的数据集,用于支持大型语言模型在指定 JSON Schema 下的结构化生成任务。该数据集包含 93,695 个经过去重和平衡的结构化提取实例,覆盖 18,000 多个不同 Schema 和 15 种语言,每个实例均包含页面内容、提示、Schema 和模型响应等信息。研究还分析了 Schema 复杂度对生成效果的影响,并通过微调实验展示了该数据集在训练和评估结构化生成模型方面的有效性。

详情
英文摘要

Producing output that conforms to a specified JSON schema underlies tool use, structured extraction, and knowledge base construction in modern large language models. Despite this centrality, public datasets for the task remain small, synthetic, or text-only, and rarely pair real page content with the prompts and schemas used in practice. We introduce ScrapeGraphAI-100k, 93,695 schema-constrained extraction events collected via opt-in ScrapeGraphAI telemetry in Q2--Q3 2025, deduplicated and balanced by schema from 9M raw events. The corpus spans 18 000+ unique schemas across 15 named languages plus a long-tail Other category, with English and Traditional Chinese covering 88% of detected content, each instance pairs Markdown-converted page content with a prompt, schema, LLM response, and per-example jsonschema-rs structural conformance labels (semantic correctness is out of scope, and raw HTML is deferred beyond v1.0). We characterize structural diversity across the corpus and identify sharp failure thresholds as schema complexity grows. As a case study, a 1.7B student fine-tuned on this data closely tracks the output distribution of its GPT-5-nano teacher, though it still trails a 30B-A3B reference (3.3B active parameters) on schema compliance. We offer this distillation result as preliminary evidence that grounding schema-constrained generation in real practitioner workloads at scale enables training and benchmarking that prior synthetic or text-only corpora could not support.

2602.10024 2026-05-11 cs.IR cs.CL

Overview of the TREC 2025 RAGTIME Track

Dawn Lawrie, Sean MacAvaney, James Mayfield, Luca Soldaini, Eugene Yang, Andrew Yates

AI总结 TREC 2025 RAGTIME 追踪旨在研究从多语言来源文档中生成报告的能力,提供了包含阿拉伯语、中文、英语和俄语新闻的文档集。该追踪包含多语言报告生成、英语报告生成和多语言信息检索三项任务,共吸引了13支队伍提交了125次运行结果。本文概述了这三项任务并呈现了相关实验结果。

Comments 14 pages, 3 figures, final version of the RAGTIME 2025 overview paper

详情
英文摘要

The principal goal of the RAG TREC Instrument for Multilingual Evaluation (RAGTIME) track at TREC is to study report generation from multilingual source documents. The track has created a document collection containing Arabic, Chinese, English, and Russian news stories. RAGTIME includes three task types: Multilingual Report Generation, English Report Generation, and Multilingual Information Retrieval (MLIR). A total of 125 runs were submitted by 13 participating teams (and as baselines by the track coordinators) for three tasks. This overview describes these three tasks and presents the available results.

2602.09457 2026-05-11 stat.ML cs.DS cs.LG

From Average Sensitivity to Small-Loss Regret Bounds under Random-Order Model

Shinsaku Sakaue, Yuichi Yoshida

AI总结 本文研究了随机顺序模型下的在线学习问题,其中损失函数集由对手选定但以随机顺序呈现。通过扩展现有的批量到在线转换方法,作者提出了一种新的分析框架,将离线算法的近似保证、平均敏感度和稳定性转化为在线设置下的小损失遗憾界。该方法适用于包括在线聚类和低秩近似在内的多种问题,并在子模函数最小化和ℓ₁回归等任务中取得了具体的应用结果,展示了稀疏化技术在无需损失函数结构性假设下实现小损失遗憾界的有效性。

详情
英文摘要

We study online learning in the random-order model, where the multiset of loss functions is chosen adversarially but revealed in a uniformly random order. By extending the batch-to-online transformation of Dong and Yoshida (2023), we show that if an offline algorithm enjoys a $(1+\varepsilon)$-approximation guarantee, an average sensitivity bound controlled by a function $φ(\varepsilon)$, and stability with respect to $\varepsilon$, then we can obtain a small-loss regret bound typically of order $\tilde O(φ^{\star}(\mathrm{OPT}_T))$, where $φ^{\star}$ is the concave conjugate of $φ$, $\mathrm{OPT}_T$ is the offline optimum over $T$ rounds, and $\tilde O$ hides polylogarithmic factors in $T$. Our result refines their original $(1+\varepsilon)$-approximate regret guarantee and applies to a broad class of problems, including online $k$-means clustering and online low-rank approximation. We further apply our approach to online submodular function minimization using $(1\pm\varepsilon)$-cut sparsifiers of submodular hypergraphs, obtaining a small-loss regret bound of $\tilde O(n^3 + n^{3/4}\mathrm{OPT}_T^{3/4})$, where $n$ is the ground-set size; we also demonstrate its applicability to online $\ell_1$ regression. Our work sheds light on the power of sparsification and related algorithmic techniques in achieving small-loss regret bounds in the random-order model, without requiring structural assumptions on loss functions, such as linearity or smoothness.

2602.09034 2026-05-11 q-bio.NC cs.AI

Latent-Space Causal Discovery from Indirect Neuroimaging Observations

Sangyoon Bae, Miruna Oprescu, David Keetae Park, Shinjae Yoo, Jiook Cha

AI总结 该研究旨在从间接神经影像观测中发现潜在空间中的因果关系,克服了血流动力学和体积传导对信号的扭曲影响。研究提出了一个基于物理模型和非平稳潜在动态的条件框架,并推导了逆向误差传播的上界。在此基础上,作者设计了INCAMA方法,结合物理感知的逆向建模与延迟感知的Mamba编码器,通过机制变化提升因果图结构的估计性能。实验表明,该方法在模拟和真实fMRI数据上均显著优于现有方法,尤其在运动任务中能准确捕捉经典的视觉-运动通路。

Comments 9 pages, 2 figures

详情
英文摘要

Neuroimaging does not observe causal variables directly: hemodynamics and volume conduction distort signals so that statistical dependence need not reflect latent neural influence. Before estimating graphs, one must specify under what assumptions delayed directed structure can be studied from such indirect observations. We formalize a conditional setting - recoverable inversion under modality physics together with nonstationary latent dynamics - and derive an inversion-error propagation bound under explicit assumptions. Building on this framing, we propose INCAMA (INdirect CAusal MAmba): physics-aware inversion coupled with a delay-aware Mamba encoder that uses mechanism shifts as informative variation for directed graph scoring. We use controlled simulations for quantitative validation and HCP motor-task fMRI as a zero-shot external transfer check based on anatomical and task-network consistency. Across TVB simulations, INCAMA improves directed-structure recovery by 2-3x in F1 over observation-space and two-stage baselines, and on HCP motor-task fMRI it produces sparse directed estimates concentrated in canonical visuo-motor pathways.

2602.01621 2026-05-11 cs.CR cs.LG

CGF-Softmax: A Cumulant-Based Softmax Reformulation for Efficient Inference under Homomorphic Encryption

Hanjun Park, Byeongseo Min, Jiheon Woo, Min-Wook Jeong, Jongho Shin, Yongwoo Lee, Young-Sik Kim, Yongjune Kim

AI总结 同态加密(HE)为隐私保护机器学习提供了重要框架,但在其下高效执行softmax操作——尤其是基于transformer的模型中的关键组件——仍面临挑战。本文提出CGF-Softmax方法,通过累积生成函数(CGF)重构softmax的分母,消除了同态除法和显式最大值减法,从而大幅降低乘法深度,同时保持softmax的核心性质。实验表明,该方法在视觉Transformer和大语言模型中实现了与高深度精确方法相近的推理精度,且计算成本显著降低。

详情
英文摘要

Homomorphic encryption (HE) is a prominent framework for privacy-preserving machine learning, enabling inference directly on encrypted data. However, evaluating softmax, a core component of transformer architectures, remains particularly challenging in HE due to its multivariate structure, the large dynamic range induced by exponential functions, and the costly division operation. In this paper, we propose CGF-softmax, which reformulates the softmax denominator through the cumulant generating function (CGF). By eliminating both homomorphic division and explicit maximum subtraction, this reformulation substantially reduces multiplicative depth while preserving key properties of softmax. Extensive experiments on Vision Transformers and large language models show that CGF-softmax provides an efficient and accurate approximation of softmax in encrypted inference. In particular, it achieves inference accuracy close to that of high-depth exact methods, while requiring substantially lower computational cost through reduced multiplicative depth.

2602.00716 2026-05-11 stat.ML cond-mat.dis-nn cs.LG

Emergence of Distortions in High-Dimensional Guided Diffusion Models

Enrico Ventura, Beatrice Achilli, Luca Ambrogioni, Carlo Lucibello

AI总结 该论文研究了在高维引导扩散模型中,分类器无关引导(CFG)方法导致生成样本失真的现象。通过统计物理工具,作者分析了CFG采样分布与真实条件分布之间的不匹配问题,并在可解析处理的设定中,揭示了数据维度和类别数量对失真程度的影响。研究发现,当类别数随数据维度指数增长时,高维高斯混合模型中会出现显著失真,而在次指数增长情况下,失真则因动力学相变而消失。此外,作者提出了一种新的引导调度策略,有效提升了模型的类别可分性和样本多样性。

Comments 41 pages, 21 figures

详情
英文摘要

Classifier-free guidance (CFG) is the de facto standard for conditional sampling in diffusion models, yet it often reduces sample diversity. Using tools from statistical physics, we analyze the emergence of generative distortions induced by CFG, namely the mismatch between the CFG sampling distribution and the true conditional distribution. We study this phenomenon in analytically tractable settings with exact score functions, characterizing its dependence on data dimensionality and the number of classes. For high-dimensional Gaussian mixtures, we use dynamic mean-field theory to show that distortions arise when the number of classes scales exponentially with the data dimension, whereas they vanish in the sub-exponential regime due to a dynamical phase transition. We further prove that, in the infinite-class limit, distortions remain unavoidable regardless of dimensionality because of the increasing density of classes. Finally, we show that standard CFG schedules cannot prevent variance shrinkage, and we propose a theoretically grounded guidance schedule incorporating a negative-guidance window that improves both class separability and sample diversity in real-world latent diffusion models.

2602.00474 2026-05-11 stat.ML cs.LG cs.NA math.NA

Persistent-Transient Policy Evaluation for Markov Chains via Minimal Peripheral Quotients

Yang Xu, Vaneet Aggarwal

AI总结 本文研究了用于有限马尔可夫链的固定策略评估问题,特别是针对可能存在不可约性和周期性的情况。传统的方法在分解收益和偏差时无法准确区分持久性行为和瞬态效应,本文通过识别转移矩阵的实外周不变子空间,提出了一种最小外周商空间分解方法,从而消除了非衰减模式,使得剩余动态严格稳定。该方法将奖励唯一分解为持久模式部分和瞬态部分,能够准确重构有限时间回报,并在生成模型下提供稳定的估计。

详情
英文摘要

We study fixed-policy evaluation for finite Markov chains that may be reducible and periodic. Classical evaluation methods with gain and bias decomposition are not always diagnostic: the gain records only invariant Cesàro averages, while persistent phase-dependent behavior is absorbed into the bias together with genuinely transient effects. We identify the real peripheral invariant subspace $\mathcal{K}(P)$ of the transition matrix $P$ as the source of this ambiguity. Quotienting by $\mathcal{K}(P)$ is the minimal exact quotient that removes all non-decaying modes and makes the remaining dynamics strictly stable. After choosing a gauge projection $Π$ with kernel $\mathcal{K}(P)$, the reward admits a unique decomposition $r = g_Π^\star + (I-P)v_Π^\star$, where $g_Π^\star$ is a persistent regime profile and $v_Π^\star$ is a gauge-fixed transient component. An exact comparison with classical normalized gain and bias shows that the new pair reallocates the same information so that all persistent modes are represented in $g_Π^\star$ and $v_Π^\star$ is transient. This decomposition reconstructs finite-horizon returns, recovers statewise average reward, admits a transient-cost interpretation, and yields a stable estimator under a generative model.

2601.22246 2026-05-11 cs.CR cs.AI

MirrorMark: Generalizable Mirrored Sampling for Multi-bit LLM Watermarking

Ya Jiang, Massieh Kordi Boroujeny, Surender Suresh Kumar, Kai Zeng

AI总结 随着大语言模型在问答和内容生成等应用中发挥越来越重要的作用,可靠的内容归属变得至关重要。本文提出了一种名为 MirrorMark 的多比特 LLM 水印方法,其核心思想是将符号映射规则与基础水印采样器分离,并通过映射每个符号到一个可由检测器重现的伪随机对象的模 1 镜像变换来实现多比特嵌入。该方法在保持生成文本质量的同时,提升了水印的可检测性和准确性,并引入了上下文锚定平衡调度器以支持实际的负载嵌入。

详情
英文摘要

As large language models (LLMs) become integral to applications such as question answering and content creation, reliable content attribution has become increasingly important. Watermarking is a promising approach, but most existing methods either provide only binary signals or achieve multi-bit embedding by distorting the generation distribution. We propose MirrorMark, a generalizable mapping-centric approach for multi-bit LLM watermarking. MirrorMark separates the symbol mapping rule from the base watermarking sampler and maps each symbol to a mod-1 mirroring transformation of a detector-reproducible pseudorandom object, such as sampling values or permutation ranks. A binary-tokenizer analysis shows that complementary mappings yield larger matched--mismatched score gaps than independent-key or shift-based mappings. When composed with a distortion-free base sampler, MirrorMark preserves the token probability distribution by design and maintains text quality in practice. To support practical payload embedding, we introduce a Context-Anchored Balanced Scheduler (CABS), which balances token assignments across message positions while localizing edit effects. We further provide theoretical EER analyses for two representative sampler instantiations. Experiments show that MirrorMark achieves strong detectability and bit accuracy while maintaining text quality comparable to non-watermarked generation.

2601.21839 2026-05-11 cs.CY cs.AI cs.GT cs.LG

Test-Time Compute Games

Ander Artola Velasco, Dimitrios Rontogiannis, Stratis Tsirtsis, Manuel Gomez-Rodriguez

AI总结 本文研究了大型语言模型(LLM)在测试时计算资源使用带来的市场效率问题,指出当前云服务提供商为提高收益可能过度使用计算资源,而这对输出质量的提升有限。为此,作者提出了一种反向第二价格拍卖机制,使提供商根据其价格和预期质量进行竞标,用户则按中标者边际价值支付,从而提升市场效率。实验表明该机制在数学和科学基准数据集上具有实际效果。

详情
英文摘要

Test-time compute has emerged as a promising strategy to enhance the reasoning abilities of large language models (LLMs). However, this strategy has in turn increased how much users pay cloud-based providers offering LLM-as-a-service, since providers charge users for the amount of test-time compute they use to generate an output. In our work, we show that the market of LLM-as-a-service is socially inefficient: providers have a financial incentive to increase the amount of test-time compute, even if this increase contributes little to the quality of the outputs. To address this inefficiency, we introduce a reverse second-price auction mechanism where providers bid their offered price and (expected) quality for the opportunity to serve a user, and users pay proportionally to the marginal value generated by the winning provider relative to the second-highest bidder. To illustrate and complement our theoretical results, we conduct experiments with multiple instruct models from the $\texttt{Llama}$ and $\texttt{Qwen}$ families, as well as reasoning models distilled from $\texttt{DeepSeek-R1}$, on math and science benchmark datasets.

2601.16130 2026-05-11 cs.HC cs.AI

Replicating Human Motivated Reasoning Studies with LLMs

Neeley Pate, Adiba Mahbub Proma, Hangfeng He, James N. Druckman, Daniel C. Molden, Gourab Ghoshal, Ehsan Hoque

AI总结 该研究探讨了大型语言模型(LLMs)是否表现出与人类相似的动机性推理现象。通过复现四项先前关于政治动机性推理的实验,研究发现基础LLMs的行为与人类预期存在差异,且不同模型在某些行为上表现出相似性,如回避回答问题和将提供论点纳入观点中。研究结果表明,基础LLMs可能并未模拟人类的动机性推理过程,这对依赖LLMs进行观点复制和论证评估的研究具有重要意义。

详情
英文摘要

Motivated reasoning - the idea that individuals processing information may be motivated to either arrive at accurate beliefs or arrive at desired conclusions - has been well-explored as a human phenomenon. However, it remains unclear whether base LLMs are affected by motivational manipulations. Replicating 4 prior political motivated reasoning studies, we find that base LLM behavior does not align with expected human behavior. Furthermore, base LLM behavior across models shares some similarities, such as when selecting to abstain from question answering and incorporating provided arguments into opinions. The results suggest that base LLMs may not emulate human motivated reasoning processes. We emphasize the importance of these findings for researchers using LLMs to for certain tasks such as opinion replication and argument assessment.

2601.15356 2026-05-11 eess.IV cs.AI

Q-Probe: Scaling Image Quality Assessment to High Resolution via Context-Aware Agentic Probing

Xiang Li, Xueheng Li, Yu Wang, Xuanhua He, Zhangchi Hu, Weiwei Yu, Chengjun Xie

AI总结 该论文提出了一种名为Q-Probe的智能图像质量评估框架,旨在解决高分辨率图像质量评估中局部退化细节难以捕捉的问题。通过引入上下文感知的探针机制和分阶段训练策略,Q-Probe有效避免了现有方法中的“裁剪即退化”偏差,并在高分辨率场景下实现了更精确的评估。研究还构建了首个专门用于高分辨率细粒度退化分析的基准数据集Vista-Bench,显著提升了模型在不同分辨率下的性能表现。

详情
英文摘要

Reinforcement Learning (RL) has empowered Multimodal Large Language Models (MLLMs) to achieve superior human preference alignment in Image Quality Assessment (IQA). However, existing RL-based IQA models typically rely on coarse-grained global views, failing to capture subtle local degradations in high-resolution scenarios. While emerging "Thinking with Images" paradigms enable multi-scale visual perception via zoom-in mechanisms, their direct adaptation to IQA induces spurious "cropping-implies-degradation" biases and misinterprets natural depth-of-field as artifacts. To address these challenges, we propose Q-Probe, the first agentic IQA framework designed to scale IQA to high resolution via context-aware probing. First, we construct Vista-Bench, a pioneering benchmark tailored for fine-grained local degradation analysis in high-resolution IQA settings. Furthermore, we propose a three-stage training paradigm that progressively aligns the model with human preferences, while simultaneously eliminating causal bias through a novel context-aware cropping strategy. Extensive experiments demonstrate that Q-Probe achieves state-of-the-art performance in high-resolution settings while maintaining superior efficacy across resolution scales.

2601.02602 2026-05-11 cs.CR cs.LG

SWaRL: Safeguard Code Watermarking via Reinforcement Learning

Neusha Javidnia, Ruisi Zhang, Ashish Kundu, Farinaz Koushanfar

AI总结 本文提出了一种名为SWaRL的代码水印框架,旨在通过在生成的程序中嵌入可验证的唯一签名,保护代码大模型的知识产权。该方法采用基于强化学习的协同训练框架,结合编译器反馈确保功能正确性,并利用联合训练的保密验证器作为奖励信号,以保持水印的可检测性。实验表明,SWaRL在保持代码功能完整性的前提下,相比现有方法具有更高的水印检测准确率,并且对重构和对抗性攻击表现出较强的鲁棒性。

Comments Preprint

详情
英文摘要

We present SWaRL, a robust and fidelity-preserving watermarking framework designed to protect the intellectual property of code LLMs by embedding unique and verifiable signatures in the generated program. Existing watermarking approaches either rely on handcrafted code transformations or manipulate token generation probabilities at inference time, making them vulnerable to removal attacks or prone to breaking functional correctness. To address these challenges, SWaRL employs a reinforcement learning-based co-training framework that uses compiler feedback for functional correctness and a jointly trained confidential verifier as a reward signal to maintain watermark detectability. Furthermore, SWaRL employs low-rank adaptation (LoRA) during fine-tuning, enabling efficient integration of watermarking behavior and transferability across model updates. Extensive experiments show that SWaRL achieves strong watermark detection accuracy compared to prior methods while fully maintaining watermarked code functionality. Moreover, SWaRL exhibits strong resilience against refactoring and adversarial transformation attacks, which maintains reliable attribution without substantial computational overhead.

2512.23927 2026-05-11 stat.ML cs.LG

Stationary Reweighting Yields Local Convergence of Soft Fitted Q-Iteration

Lars van der Laan, Nathan Kallus

AI总结 本文研究了软Fitted Q-Iteration(soft FQI)在无Bellman完备性条件下的稳定性机制,提出了一种基于局部平稳分布对齐的稳定性分析方法。通过分析软Bellman算子在软最优固定点附近的收敛行为,作者发现其在平稳状态-动作范数下具有收缩性质,并据此设计了基于平稳重加权的软FQI算法,该方法在有限样本下能够实现局部线性收敛。研究还表明,普通软FQI在策略平稳采样下也具有局部稳定性,并解释了温度退火作为收敛区域的延续策略的作用。

详情
英文摘要

Fitted $Q$-iteration (FQI) and soft FQI are widely used value-based methods for offline reinforcement learning, but their standard stability guarantees often depend on Bellman completeness, a strong closure condition that can fail under function approximation. We analyze soft FQI without Bellman completeness and identify the stability mechanism that replaces it: local stationary norm alignment. Near the soft-optimal fixed point, the soft Bellman operator has the same first-order behavior as the policy-evaluation operator for the soft-optimal policy. This operator contracts in the policy's stationary state-action norm, whereas standard fitted regression projects Bellman targets in the behavior norm. This mismatch explains instability under distribution shift. We use this insight to develop stationary-reweighted soft FQI, which reweights each regression step toward the stationary distribution of the current softmax policy. Under approximate realizability and controlled weighting error, we prove finite-sample local linear convergence to the projected fixed point, separating statistical error from geometrically damped weight-estimation error. Our results also show that ordinary soft FQI is locally stable under on-policy stationary sampling, even without Bellman completeness, and explain temperature annealing as a continuation strategy for reaching a contraction region.

2512.23805 2026-05-11 stat.ML cs.LG

Fitted $Q$ Evaluation Without Bellman Completeness via Stationary Weighting

Lars van der Laan, Nathan Kallus

AI总结 本文研究了一种无需依赖Bellman完备性条件的拟合Q评估(FQE)方法,通过在回归步骤中引入目标策略的平稳状态-动作分布权重,改进了传统FQE在行为分布范数下的投影方式。该方法在保持模块化监督学习形式的同时,使拟合投影与目标策略诱导的$L^2$范数下的收缩算子对齐,从而在有限样本下实现了对平稳投影Bellman不动点的线性收敛,并分离了迭代、统计、近似和权重估计误差,实验表明该方法能有效稳定FQE并降低价值估计误差。

详情
英文摘要

Fitted $Q$-evaluation (FQE) is a standard regression-based tool for off-policy evaluation, but existing stability guarantees often rely on Bellman completeness, a strong closure condition that can fail under function approximation. We study an alternative route: changing the norm used in the regression step. The policy-evaluation Bellman operator is contractive in the $L^2$ norm induced by the target policy's stationary state-action distribution, whereas standard off-policy FQE projects Bellman targets in the behavior-distribution norm. We propose stationary-weighted FQE, which reweights each Bellman regression by the stationary target-to-behavior density ratio. The method preserves FQE's modular supervised-learning form while aligning the fitted projection with that contractive norm. We prove finite-sample linear convergence to the stationary projected Bellman fixed point under misspecification, without requiring Bellman completeness. The bound separates finite-iteration, statistical, approximation, and weight-estimation errors, and shows that ratio-estimation error is attenuated when the inherent Bellman error is small. Controlled experiments show that stationary weighting can stabilize FQE and reduce value error when behavior-norm regression overemphasizes regions rarely visited by the target policy.

2512.23694 2026-05-11 stat.ML cs.LG econ.EM

Bellman Calibration for $V$-Learning in Offline Reinforcement Learning

Lars van der Laan, Nathan Kallus

AI总结 在离线强化学习中,长期价值预测的可靠性面临挑战,因为拟合价值方法涉及引导、函数逼近和分布偏移,而标准保证通常需要贝尔曼完备性或可实现性。本文提出贝尔曼校准,一种较弱的可靠性准则,要求预测值相近的状态具有一致的贝尔曼目标平均值,并基于此提出迭代贝尔曼校准方法,通过拟合原始预测的一维映射对价值预测器进行后处理校准。该方法无需贝尔曼完备性或价值函数可实现性,即可在有限样本下保证校准误差以一维非参数速率控制,并将价值误差分解为统计估计、有限迭代和逼近误差,明确了校准在何时能提升预测性能。

详情
英文摘要

Reliable long-horizon value prediction is difficult in offline reinforcement learning because fitted value methods combine bootstrapping, function approximation, and distribution shift, while standard guarantees often require Bellman completeness or realizability. We introduce Bellman calibration, a weak reliability criterion requiring that states assigned similar predicted values have average Bellman targets that agree with those predictions. This criterion yields a scalar calibration error for diagnosing systematic numerical miscalibration, which we estimate from off-policy data using doubly robust Bellman target estimates. We then propose Iterated Bellman Calibration, a model-agnostic post-hoc procedure that recalibrates any learned value predictor by fitting a one-dimensional map of its original prediction, with histogram and isotonic variants. We prove finite-sample guarantees showing that Bellman calibration error is controlled at one-dimensional nonparametric rates without Bellman completeness or value-function realizability. Our value-error bounds separate statistical estimation, finite-iteration, and approximation errors, clarifying when calibration improves value prediction and when its gains are limited by the information in the original predictor or insufficient coverage.

2512.09682 2026-05-11 eess.SY cs.AI cs.GT cs.MA cs.SY

Dynamic one-time delivery of critical data by small and sparse UAV swarms: a model problem for MARL scaling studies

Mika Persson, Jonas Lidman, Jacob Ljungberg, Samuel Sandelius, Adam Andersson

AI总结 本文研究了多智能体强化学习(MARL)在无人机群协同传递关键数据包中的应用,旨在解决小规模且稀疏无人机群在动态环境中实现一次性数据传递的问题。研究引入了一类确定性博弈作为MARL扩展性研究的模型问题,并提出了一种基于Dijkstra最短路径算法的鲁棒基准策略以限制无人机运动。实验表明,两种现成的MARL算法在小规模场景下表现接近基准策略,但在智能体数量增加时面临可扩展性挑战。

Comments Accepted to the 2026 IFAC World Congress

详情
英文摘要

This work studies the application of Multi-Agent Reinforcement Learning (MARL) to decentralized control of unmanned aerial vehicles to relay a critical data package to a known position. For this purpose, a family of deterministic games is introduced, designed for MARL scaling studies. A robust baseline policy is proposed which restricts agent motion and applies Dijkstra's shortest path algorithm. Computational experiment results show that two off-the-shelf MARL algorithms perform competitively with the baseline for a small number of agents, but face scalability issues as the number of agents increases. Source code and animations are available online at https://github.com/mikapersson/Information-Relaying.

2511.22893 2026-05-11 eess.SY cs.AI cs.SY

Switching-time bioprocess control with pulse-width-modulated optogenetics

Sebastián Espinel-Ríos

AI总结 该研究探讨了如何利用脉宽调制的光遗传学技术实现生物过程的动态控制,以提高生物制造效率。面对光强调控在陡峭剂量响应关系下调节能力受限的问题,研究提出通过交替开启和关闭光源来平滑基因表达响应,并将其建模为一个具有二元输入的切换时间最优控制问题。为解决传统离散优化方法在高精度控制网格下变量数量激增的问题,作者引入强化学习方法,通过参数化占空比来实现对开关时间的连续控制,从而在保持光强二元特性的同时提升过程可控性。

Comments Accepted conference paper: IFAC World Congress 2026

详情
英文摘要

Biotechnology can benefit from dynamic control to improve production efficiency. In this context, optogenetics enables modulation of gene expression using light as an external input, allowing fine-tuning of protein levels to unlock dynamic metabolic control and regulation of cell growth. Optogenetic systems can be actuated by light intensity. However, relying solely on intensity-driven control (i.e., signal amplitude) may fail to properly tune optogenetic bioprocesses when the dose-response relationship (i.e., light intensity versus gene-expression strength) is steep. In these cases, tunability is effectively constrained to either fully active or fully repressed gene expression, with little intermediate regulation. Pulse-width modulation can alleviate this issue by alternating between fully ON and OFF light intensity within forcing periods, thereby smoothing the average response and enhancing process controllability. Optimizing pulse-width-modulated optogenetics entails a switching-time optimal control problem with a binary input over multiple forcing periods. While this can be formulated as a mixed-integer optimization problem on a refined control grid with monotonic input constraints, the number of decision variables can grow rapidly with increasing control-grid resolution within forcing periods and with the total number of forcing periods, complicating the task. Here, we propose an alternative solution based on reinforcement learning. We parametrize control actions via the duty cycle, a continuous proxy variable that encodes the ON-to-OFF switching time within each forcing period, thereby respecting the intrinsic binary nature of the light intensity while avoiding fine-grid binary decision variables.

2511.09016 2026-05-11 eess.SY cs.LG cs.SY

Assumed Density Filtering and Smoothing with Neural Network Surrogate Models

Simon Kuang, Xinfan Lin

AI总结 本文研究了在非线性动态系统中如何通过神经网络代理模型实现准确的状态估计与平滑。作者提出利用最新的分析公式计算深度神经网络在高斯输入下的均值和协方差,从而实现不确定性传播,并主张使用交叉熵而非均方根误差作为评估滤波与平滑精度的指标。实验表明,该方法在随机洛伦兹系统和维纳系统中表现出优越的估计性能,并能提升基于状态估计的线性二次调节效果。

Comments To appear at Learning for Decision and Control 2026

详情
英文摘要

The Kalman filter and Rauch-Tung-Striebel (RTS) smoother are optimal for state estimation in linear dynamic systems. With nonlinear systems, the challenge consists in how to propagate uncertainty through the state transitions and output function. For the case of a neural network model, we enable accurate uncertainty propagation using a recent state-of-the-art analytic formula for computing the mean and covariance of a deep neural network with Gaussian input. We argue that cross entropy is a more appropriate performance metric than RMSE for evaluating the accuracy of filters and smoothers. We demonstrate the superiority of our method for state estimation on a stochastic Lorenz system and a Wiener system, and find that our method enables more optimal linear quadratic regulation when the state estimate is used for feedback. Code available at https: //github.com/simontheflutist/analytic-moments.

2511.03182 2026-05-11 cs.SE cs.LG

Understanding Robustness of Model Editing in Code LLMs

Vinaik Chhetri, Moghis Fereidouni, A. B Siddique, Umar Farooq

AI总结 本文研究了代码大语言模型在模型编辑场景下的鲁棒性问题,特别是在面对API更新时能否正确迁移并保持原有性能。作者构建了一个基于HumanEval、MBPP和APPS的受控基准测试集,包含2,040个问题和140种合成API修改,并结合执行沙箱进行评估。实验表明,现有编辑方法在单次编辑和连续编辑场景下均存在显著的泛化和特异性退化问题,揭示了模型编辑在实际应用中仍面临诸多挑战。

Comments 26 pages, 14 figures, 20 tables

详情
英文摘要

Large language models (LLMs) for code are increasingly used in software development, but they remain static after pretraining while APIs and software libraries continue to evolve. Model editing offers a lightweight alternative to retraining for incorporating API updates, yet it remains unclear whether existing editing methods can induce correct API migration, generalize that behavior to unseen tasks, and preserve performance on tasks involving unmodified APIs. We present a controlled benchmark for evaluating model editing under API updates in code LLMs, built from HumanEval, MBPP, and APPS, with 2,040 problems spanning 140 unique synthetic API modifications, together with an execution sandbox that enforces edited APIs under standard Python semantics. We evaluate several state-of-the-art editing methods on three code LLMs under both single-edit and successive-edit regimes using execution-based metrics that distinguish successful API adoption from workaround-based task completion. Under single edits, edited models generalize poorly to unseen uses of the modified API, and many apparent successes are workaround-based rather than true API migrations. Performance on tasks involving unmodified APIs also degrades, although memory-based methods and fine-tuning preserve specificity better than locate-then-edit methods. Under successive edits, most method-model combinations collapse to near-zero Pass@k on both generalization and specificity, revealing substantial interference beyond the target edits. A two-factor Shapley decomposition further shows that single-edit failures on generalization include a substantial compilation component, whereas specificity failures are more often post-compilation. Under successive edits, failures become predominantly compilation-driven.

2510.24736 2026-05-11 q-bio.QM cs.LG q-bio.BM

RNAGenScape: Property-Guided, Optimized Generation of mRNA Sequences with Manifold Langevin Dynamics

Danqi Liao, Chen Liu, Xingzhi Sun, Dié Tang, Haochen Wang, Scott Youlten, Srikar Krishna Gopinath, Haejeong Lee, Ethan C. Strayer, Antonio J. Giraldez, Smita Krishnaswamy

AI总结 RNAGenScape 是一种基于流形朗之万动力学的 mRNA 序列生成框架,旨在生成具有特定生物性质的优化 mRNA 序列。该方法通过学习真实数据的潜在流形,并在该流形上进行约束优化,确保生成序列的生物学可行性与功能有效性。研究结合了自编码器、属性预测器和属性引导的优化过程,显著提升了生成序列的性能指标,同时保持了较高的生成效率。

Comments ICML 2025 Generative AI and Biology (GenBio) Workshop, Oral presentation

详情
英文摘要

Generating property-optimized mRNA sequences is central to applications such as vaccine design and protein replacement therapy, but remains challenging due to limited data, complex sequence-function relationships, and the narrow space of biologically viable sequences. Generative methods that drift away from the data manifold can yield sequences that fail to fold, translate poorly, or are otherwise nonfunctional. We present RNAGenScape, a property-guided manifold Langevin dynamics framework for mRNA sequence generation that operates directly on a learned manifold of real data. By performing iterative local optimization constrained to this manifold, RNAGenScape preserves biological viability, accesses reliable guidance, and avoids excursions into nonfunctional regions of the ambient sequence space. The framework integrates three components: (1) an autoencoder jointly trained with a property predictor to learn a property-organized latent manifold, (2) a denoising autoencoder that projects updates back onto the manifold, and (3) a property-guided Langevin dynamics procedure that performs optimization along the manifold. Across three real-world mRNA datasets spanning two orders of magnitude in size, RNAGenScape increases median property gain by up to 148% and success rate by up to 30% while ensuring biological viability of generated sequences, and achieves competitive inference efficiency relative to existing generative approaches.

2510.18516 2026-05-11 q-bio.NC cs.LG

Decoding Dynamic Visual Experience from Calcium Imaging via Cell-Pattern-Aware Pretraining

Sangyoon Bae, Mehdi Azabou, Blake Richards, Jiook Cha

AI总结 该研究针对神经记录数据中由细胞类型差异、电路动态和刺激响应随机性引起的异质性问题,提出了一种基于生物特性的预训练方法POYO-CAP。该方法通过识别统计规律性强的神经元并进行掩码重建与辅助监督预训练,再对更随机的神经元群体进行微调,从而提升模型性能。实验表明,该方法在Allen Brain Observatory数据集上相较从零训练提升了12-13%,并实现了模型规模的稳定扩展,有效利用了神经异质性作为可扩展的学习优势。

详情
英文摘要

Neural recordings exhibit a distinctive form of heterogeneity rooted in differences in cell types, intrinsic circuit dynamics, and stochastic stimulus-response variability that goes beyond ordinary dataset variability, mixing statistically regular neurons with highly stochastic, stimulus-contingent ones within the same dataset. This heterogeneity poses a challenge for self-supervised learning (SSL) -- learnable statistical regularity -- thereby destabilizing representation learning and limiting reliable scaling. We introduce POYO-CAP (Cell-pattern Aware Pretraining), a biologically grounded hybrid pretraining strategy that first trains with masked reconstruction plus lightweight auxiliary supervision on statistically regular neurons -- identified via skewness and kurtosis -- and then fine-tunes on more stochastic populations. On the Allen Brain Observatory dataset, this curriculum yields 12--13\% relative improvements over from-scratch training and enables smooth, monotonic scaling with model size, whereas baselines trained on mixed populations plateau or destabilize. By making statistical predictability an explicit data-selection criterion, POYO-CAP turns neural heterogeneity into a scalable learning advantage for robust neural decoding.

2510.02371 2026-05-11 cs.CR cs.AI cs.DC

Federated Spatiotemporal Graph Learning for Passive Attack Detection in Smart Grids

Bochra Al Agha, Razane Tajeddine

AI总结 本文研究了智能电网中被动攻击的检测问题,这类攻击通过窃听通信链路获取电网拓扑、用电模式等敏感信息。为解决单节点检测信号微弱、短暂且易被忽略的问题,提出了一种基于联邦学习的时空图学习方法,通过融合物理层和行为特征,在本地设备上构建星型子图并提取时空特征进行攻击检测。该方法在非独立同分布的联邦学习框架下训练,具有良好的鲁棒性和隐私保护能力,并在合成数据集上取得了高检测精度和低误报率的实验结果。

详情
英文摘要

Smart grids are exposed to passive eavesdropping, where attackers listen silently to communication links. Although no data is actively altered, such reconnaissance can reveal grid topology, consumption patterns, and operational behavior, creating a gateway to more severe targeted attacks. Detecting this threat is difficult because the signals it produces are faint, short-lived, and often disappear when traffic is examined by a single node or along a single timeline. This paper introduces a graph-centric, multimodal detector that fuses physical-layer and behavioral indicators over ego-centric star subgraphs and short temporal windows to detect passive attacks. To capture stealthy perturbations, a two-stage encoder is introduced: graph convolution aggregates spatial context across ego-centric star subgraphs, while a bidirectional GRU models short-term temporal dependencies. The encoder transforms heterogeneous features into a unified spatio-temporal representation suitable for classification. Training occurs in a federated learning setup under FedProx, improving robustness to heterogeneous local raw data and contributing to the trustworthiness of decentralized training; raw measurements remain on client devices. A synthetic, standards-informed dataset is generated to emulate heterogeneous HAN/NAN/WAN communications with wireless-only passive perturbations, event co-occurrence, and leak-safe splits. The model achieves a testing accuracy of 98.32% per-timestep (F1_{attack}=0.972) and 93.35% per-sequence at 0.15% FPR using a simple decision rule with run-length m=2 and threshold $τ=0.55$. The results demonstrate that combining spatial and temporal context enables reliable detection of stealthy reconnaissance while maintaining low false-positive rates, making the approach suitable for non-IID federated smart-grid deployments.

2509.00398 2026-05-11 cs.CY cs.AI

A Study on the Framework for Evaluating the Ethics and Trustworthiness of Generative AI

Cheonsu Jeong, Seunghyun Lee, Seonhee Jeong, Sungsu Kim

AI总结 本文研究了生成式人工智能技术快速发展所带来的伦理与可信度挑战,并提出了一套系统化的评估框架。针对当前评估方法侧重性能与准确性的不足,该研究从公平性、透明性、安全性等多个维度构建了详细指标与评估方法,并结合多国政策进行对比分析。该框架贯穿AI生命周期,整合技术评估与多学科视角,为负责任地推进生成式AI发展提供了理论基础与实践指导。

Comments 22 pages, 3 figures, 6 tables

详情
Journal ref
Artificial Intelligence and Applications, 2026
英文摘要

This study provides an in_depth analysis of the ethical and trustworthiness challenges emerging alongside the rapid advancement of generative artificial intelligence (AI) technologies and proposes a comprehensive framework for their systematic evaluation. While generative AI, such as ChatGPT, demonstrates remarkable innovative potential, it simultaneously raises ethical and social concerns, including bias, harmfulness, copyright infringement, privacy violations, and hallucination. Current AI evaluation methodologies, which mainly focus on performance and accuracy, are insufficient to address these multifaceted issues. Thus, this study emphasizes the need for new human_centered criteria that also reflect social impact. To this end, it identifies key dimensions for evaluating the ethics and trustworthiness of generative AI_fairness, transparency, accountability, safety, privacy, accuracy, consistency, robustness, explainability, copyright and intellectual property protection, and source traceability and develops detailed indicators and assessment methodologies for each. Moreover, it provides a comparative analysis of AI ethics policies and guidelines in South Korea, the United States, the European Union, and China, deriving key approaches and implications from each. The proposed framework applies across the AI lifecycle and integrates technical assessments with multidisciplinary perspectives, thereby offering practical means to identify and manage ethical risks in real_world contexts. Ultimately, the study establishes an academic foundation for the responsible advancement of generative AI and delivers actionable insights for policymakers, developers, users, and other stakeholders, supporting the positive societal contributions of AI technologies.

2504.02382 2026-05-11 eess.IV cs.AI cs.CV

Benchmark of Segmentation Techniques for Pelvic Fracture in CT and X-ray: Summary of the PENGWIN 2024 Challenge

Yudi Sang, Yanzhen Liu, Sutuke Yibulayimu, Yunning Wang, Benjamin D. Killeen, Mingxu Liu, Ping-Cheng Ku, Ole Johannsen, Karol Gotkowski, Maximilian Zenk, Klaus Maier-Hein, Fabian Isensee, Peiyan Yue, Yi Wang, Haidong Yu, Zhaohong Pan, Yutong He, Xiaokun Liang, Daiqi Liu, Fuxin Fan, Artur Jurgas, Andrzej Skalski, Yuxi Ma, Jing Yang, Szymon Płotka, Rafał Litka, Gang Zhu, Yingchun Song, Mathias Unberath, Mehran Armand, Dan Ruan, S. Kevin Zhou, Qiyong Cao, Chunpeng Zhao, Xinbao Wu, Yu Wang

AI总结 本文综述了PENGWIN 2024挑战赛对CT和X光影像中骨盆骨折分割技术的评估结果。研究针对医学影像中骨碎片分割的难点,利用包含150例CT扫描和大量模拟X光图像的数据集,对16支国际团队的算法进行了多指标评估。结果显示,CT分割效果较好,平均IoU达到0.930,而X光分割效果相对较弱,最佳算法IoU为0.774,表明投影成像中的碎片重叠仍是主要挑战。研究还揭示了算法设计的多样性,并指出交互式分割方法可能对提升临床实用性具有重要意义。

Comments PENGWIN 2024 Challenge Report

详情
英文摘要

The segmentation of pelvic fracture fragments in CT and X-ray images is crucial for trauma diagnosis, surgical planning, and intraoperative guidance. However, accurately and efficiently delineating the bone fragments remains a significant challenge due to complex anatomy and imaging limitations. The PENGWIN challenge, organized as a MICCAI 2024 satellite event, aimed to advance automated fracture segmentation by benchmarking state-of-the-art algorithms on these complex tasks. A diverse dataset of 150 CT scans was collected from multiple clinical centers, and a large set of simulated X-ray images was generated using the DeepDRR method. Final submissions from 16 teams worldwide were evaluated under a rigorous multi-metric testing scheme. The top-performing CT algorithm achieved an average fragment-wise intersection over union (IoU) of 0.930, demonstrating satisfactory accuracy. However, in the X-ray task, the best algorithm achieved an IoU of 0.774, which is promising but not yet sufficient for intra-operative decision-making, reflecting the inherent challenges of fragment overlap in projection imaging. Beyond the quantitative evaluation, the challenge revealed methodological diversity in algorithm design. Variations in instance representation, such as primary-secondary classification versus boundary-core separation, led to differing segmentation strategies. Despite promising results, the challenge also exposed inherent uncertainties in fragment definition, particularly in cases of incomplete fractures. These findings suggest that interactive segmentation approaches, integrating human decision-making with task-relevant information, may be essential for improving model reliability and clinical applicability.

2503.17656 2026-05-11 q-bio.QM cs.AI cs.LG

Pretraining a Foundation Model for Small-Molecule Natural Products

Yuheng Ding, Bo Qiang, Shaoning Li, Yiran Zhou, Jie Yu, Qi Li, Cheng Shi, Liangren Zhang, Yusong Wang, Nanning Zheng, Zhenming Liu

AI总结 该研究针对天然产物在药物发现中的重要性,提出了一种专门用于小分子天然产物的预训练基础模型。通过引入对比学习和掩码图学习,模型能够有效捕捉分子支架的进化信息和侧链特征,克服了现有方法在通用性和任务适应性上的不足。实验表明,该模型在天然产物分类、基因与微生物层面分析以及虚拟筛选等任务中均取得了最先进的性能,为天然产物的研究和药物开发提供了有力工具。

Comments Accepted by Nature Machine Intelligence(2026)

详情
Journal ref
Nature Machine Intelligence(2026)
英文摘要

Natural products, as metabolites from microorganisms, animals, or plants, exhibit diverse biological activities, making them crucial for drug discovery. Nowadays, existing deep learning methods for natural products research primarily rely on supervised learning approaches designed for specific downstream tasks. However, such one-model-for-a-task paradigm often lacks generalizability and leaves significant room for performance improvement. Additionally, existing molecular characterization methods are not well-suited for the unique tasks associated with natural products. To address these limitations, we have pre-trained a foundation model for natural products based on their unique properties. Our approach employs a novel pretraining strategy that is especially tailored to natural products. By incorporating contrastive learning and masked graph learning objectives, we emphasize evolutional information from molecular scaffolds while capturing side-chain information. Our framework achieves state-of-the-art (SOTA) results in various downstream tasks related to natural product mining and drug discovery. We first compare taxonomy classification with synthesized molecule-focused baselines to demonstrate that current models are inadequate for understanding natural synthesis. Furthermore, by diving into a fine-grained analysis at both the gene and microbial levels, NaFM demonstrates the ability to capture evolutionary information. Eventually, our method is experimented with virtual screening, illustrating informative natural product representations that can lead to more effective identification of potential drug candidates.