arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4033
2605.10598 2026-05-12 cs.AI

Budget-Efficient Automatic Algorithm Design via Code Graph

Maxime Bouscary, Manxi Wu, Saurabh Amin

发表机构 * Operations Research Center(运筹学研究中心) Department of Civil and Environmental Engineering(土木与环境工程系) University of California, Berkeley(加州大学伯克利分校) Laboratory for Information & Decision Systems(信息与决策系统实验室)

AI总结 该研究提出了一种基于代码图的高效自动算法设计方法,旨在解决现有方法在计算资源利用上的低效问题。通过将算法表示为有向无环图,并利用大语言模型生成局部代码修正,而非完整算法,从而更高效地探索算法空间并实现更优的搜索效率。实验表明,该方法在相同计算预算下优于传统方法,并揭示了上下文丰富性对模型性能的影响条件。

详情
英文摘要

Large language models (LLMs) have emerged as powerful tools for automatic algorithm design (AAD). However, existing pipelines remain inefficient. They operate at the granularity of full algorithms, redundantly rewriting recurring substructures and discarding low-fitness candidates that may contain valuable algorithmic features. We formalize budget-efficient automatic algorithm design, wherein the search policy maximizes realized fitness subject to limited computational cost. We propose a directed acyclic graph representation of algorithms and build a search framework that fully exploits the LLM's output. Instead of querying the LLM for full algorithms, we use it to obtain corrections: compact operators that add, replace, or remove code blocks. Each correction augments the graph, yielding new algorithms that compose with prior corrections. This graph structure decomposes algorithms into sets of corrections, enabling correction-level credit assignment that informs subsequent queries. We complement this framework with theoretical insights into the ideal balance between search depth and breadth at different budget levels. We validate our method empirically on three combinatorial optimization problems, demonstrating consistent superiority of our graph-based search over full-algorithm search at equal token budget. Finally, our experiments suggest that rich contexts help only when the LLM's prior knowledge is shallow, and can hinder performance otherwise.

2605.10593 2026-05-12 cs.AI cs.CL cs.HC cs.SE

LLARS: Enabling Domain Expert & Developer Collaboration for LLM Prompting, Generation and Evaluation

Philipp Steigerwald, Mara Stieler, Jennifer Burghardt, Eric Rudolph, Jens Albrecht

发表机构 * Technische Hochschule Nürnberg Georg Simon Ohm(图恩-努尔堡技术大学乔治·西蒙·奥姆学院) Faculty of Computer Science, Centre for Artificial Intelligence (KIZ)(计算机科学学院,人工智能中心(KIZ)) Faculty of Social Sciences, Institute for E-Counselling(社会科学学院,电子咨询研究所)

AI总结 LLARS 是一个开源平台,旨在促进领域专家与开发者在构建基于大语言模型(LLM)的系统时的协作。该平台集成了协作提示工程、批量生成和混合评估三个紧密关联的模块,支持实时协作、可控成本的输出生成以及结合人类与LLM评估者的多维度评估方法。研究显示,LLARS 能有效提升跨学科协作效率,简化工作流程并提高模型-提示组合的优化效果。

Comments Accepted at IJCAI-ECAI 2026 Demonstrations Track. Demo video: https://youtu.be/3QaKouwr4gU

详情
英文摘要

We demonstrate LLARS (LLM Assisted Research System), an open-source platform that bridges the gap between domain experts and developers for building LLM-based systems. It integrates three tightly connected modules into an end-to-end pipeline: Collaborative Prompt Engineering for real-time co-authoring with version control and instant LLM testing, Batch Generation for configurable output production across user-selected prompts $\times$ models $\times$ data with cost control, and Hybrid Evaluation where human and LLM evaluators jointly assess outputs through diverse assessment methods, with live agreement metrics and provenance analysis to identify the best model-prompt combination for a given use case. New prompts and models are automatically available for batch generation and completed batches can be turned into evaluation scenarios with a single click. Interviews with six domain experts and three developers in online counselling confirmed that LLARS feels intuitive, saves considerable time by keeping everything in one place and makes interdisciplinary collaboration seamless.

2605.10588 2026-05-12 cs.CV

Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence

Yanbing Zhang, Bo Wang, Jianhui Liu, Nan Jiang, Jiaxiu Jiang, Haoze Sun, Yijun Yang, Shenghe Zheng, Lin Song, Haoyang Huang, Nan Duan, Wenbo Li

发表机构 * Joy Future Academy(未来Joy学院)

AI总结 当前大型多模态模型(LMMs)在需要视角依赖理解的空间推理任务中表现不佳,主要受限于单一静态视角的观察。为此,研究提出了一种名为“Thinking with Novel Views(TwNV)”的新范式,通过在推理过程中引入生成新视角的合成图像,提升模型对空间关系的理解能力。实验表明,TwNV在多个空间子任务和不同架构的LMM上均显著提升了性能,验证了新视角生成在增强模型空间智能方面的有效性。

Comments Submitted to NeurIPS 2026

详情
英文摘要

Current Large Multimodal Models (LMMs) struggle with spatial reasoning tasks requiring viewpoint-dependent understanding, largely because they are confined to a single, static observation. We propose Thinking with Novel Views (TwNV), a paradigm that integrates generative novel-view synthesis into the reasoning loop: a Reasoner LMM identifies spatial ambiguity, instructs a Painter to synthesize an alternative viewpoint, and re-examines the scene with the additional evidence. Through systematic experiments we address three research questions. (1) Instruction format: numerical camera-pose specifications yield more reliable view control than free-form language. (2) Generation fidelity: synthesized view quality is tightly coupled with downstream spatial accuracy. (3) Inference-time visual scaling: iterative multi-turn view refinement further improves performance, echoing recent scaling trends in language reasoning. Across four spatial subtask categories and four LMM architectures (both closed- and open-source), TwNV consistently improves accuracy by +1.3 to +3.9 pp, with the largest gains on viewpoint-sensitive subtasks. These results establish novel-view generation as a practical lever for advancing spatial intelligence of LMMs.

2605.10586 2026-05-12 cs.CV

CausalGS: Learning Physical Causality of 3D Dynamic Scenes with Gaussian Representations

Nengbo Lu, Minghua Pan

发表机构 * Guilin University of Electronic Technology(桂林电子科技大学)

AI总结 本文提出了一种名为CausalGS的框架,旨在仅从多视角视频中学习复杂三维动态场景的物理因果关系,无需依赖显式先验知识。其核心是一个逆物理推理模块,通过联合推断场景的初始速度场和内在材料属性,将动态过程分解为两个因素进行建模,并利用可微分物理模拟器进行物理正则化的学习。实验表明,CausalGS在长期未来帧外推和新视角插值任务中均优于现有方法,展示了其从视觉观测中自主学习物理属性交互和因果关系的能力。

Comments ICMR2026 Accepted

详情
英文摘要

Learning a physical model from video data that can comprehend physical laws and predict the future trajectories of objects is a formidable challenge in artificial intelligence. Prior approaches either leverage various Partial Differential Equations (PDEs) as soft constraints in the form of PINN losses, or integrate physics simulators into neural networks; however, they often rely on strong priors or high-quality geometry reconstruction. In this paper, we propose CausalGS, a framework that learns the causal dynamics of complex dynamic 3D scenes solely from multi-view videos, while dispensing with the reliance on explicit priors. At its core is an inverse physics inference module that decouples the complex dynamics problem from the video into the joint inference of two factors: the initial velocity field representing the scene's kinematics, and the intrinsic material properties governing its dynamics. This inferred physical information is then utilized within a differentiable physics simulator to guide the learning process in a physics-regularized manner. Extensive experiments demonstrate that CausalGS surpasses the state-of-the-art on the highly challenging task of long-term future frame extrapolation, while also exhibiting advanced performance in novel view interpolation. Crucially, our work shows that, without any human annotation, the model is able to learn the complex interactions between multiple physical properties and understand the causal relationships driving the scene's dynamic evolution, solely from visual observations.

2605.10585 2026-05-12 cs.LG

Controllability in preference-conditioned multi-objective reinforcement learning

Pau de las Heras Molins, Beyazit Yalcinkaya, Lasse Peters, David Fridovich-Keil, Georgios Bakirtzis

发表机构 * LTCI, Télécom Paris, Institut Polytechnique de Paris(LTCI,巴黎电信学院,巴黎理工学院) University of California, Berkeley(加州大学伯克利分校) TU Delft(代尔夫特理工大学) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文研究了偏好条件下的多目标强化学习中的可控性问题,即用户偏好变化是否能可靠地引导智能体行为变化。作者指出,现有评估指标无法有效衡量这一特性,导致智能体可能对偏好输入不敏感。为此,本文提出了一种新的评估指标,以更准确地衡量偏好条件智能体的可控性,从而推动多目标强化学习中偏好适应能力的进一步发展。

详情
英文摘要

Multi-objective reinforcement learning (MORL) allows a user to express preference over outcomes in terms of the relative importance of the objectives, but standard metrics cannot capture whether changes in preference reliably change the agent's behavior in the intended way, a property termed controllability. As a result, preference-conditioned agents can score well on standard MORL metrics while being insensitive to the preference input. If the ability to control agents cannot be reliably assessed, the symbolic interface that MORL provides between user intent and agent behavior is broken. Mainstream MORL metrics alone fail to measure the controllability of preference-conditioned agents, motivating a complementary metric specifically designed to that end. We hope the results spur discussion in the community on existing evaluation protocols to consolidate advances in preference adaptation in MORL to larger and more complex problems.

2605.10579 2026-05-12 cs.CL

VISTA: A Generative Egocentric Video Framework for Daily Assistance

Yu-Hsiang Liu, Yu-Chien Tang, An-Zi Yen

发表机构 * Department of Computer Science, National Yang Ming Chiao Tung University(国立阳明交通大学计算机科学系)

AI总结 本文提出了一种名为VISTA的生成式第一人称视频框架,旨在为日常辅助任务中的AI代理提供高质量的训练与评估数据。该框架通过五步脚本生成流程结合因果逆向推理,生成多样且逻辑严谨的干预场景,涵盖反应式和主动式两种代理自主级别。VISTA支持用户自定义和优化场景,为日常任务提供可扩展且可控的视频基准,为真实环境中AI代理的训练与评估提供了替代方案。

Comments pre-print

详情
英文摘要

Training AI agents to proactively assist humans in daily activities, from routine household tasks to urgent safety situations, requires large-scale visual data. However, capturing such scenarios in the real world is often difficult, costly, or unsafe, and physics-based simulators lack the visual fidelity needed to transfer learned behaviors to real settings. Therefore, we introduce VISTA, a video synthesis system that produces high-fidelity egocentric videos as training and evaluation data for AI agents. VISTA employs a 5-step script generation pipeline with causal reverse reasoning to create diverse, logically grounded intervention modes. These scenarios span two levels of agent autonomy: reactive and proactive. In reactive modes, the user explicitly asks the agent for help. In proactive modes, the agent offers help without receiving a direct request. We further divide proactive modes into explicit and implicit types. In explicit proactive scenarios, the user is aware of needing help but does not directly address the agent. In implicit proactive scenarios, the agent intervenes before the user even realizes that help is needed. VISTA allows users to customize and refine scenarios to generate video benchmarks for daily tasks, offering a scalable and controllable alternative to real-world data collection for training and evaluating AI agents in realistic environments.

2605.10576 2026-05-12 cs.CV cs.AI

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

Chen Zhong, Xiao An, Jiaxing Sun, Zihan Gui, Guangyi Yang, Wei He

发表机构 * Wuhan University(武汉大学) Shanghai Artificial Intelligent Laboratory(上海人工智能实验室)

AI总结 本文提出 SenseBench,首个专门用于评估大语言视觉模型在遥感低级视觉感知与描述能力的基准测试平台。该研究针对当前图像质量评估方法无法准确描述遥感退化现象的问题,构建了包含6大类22个细粒度退化类型的10,000余个精心标注样本,并设计了感知与描述两种评估协议,揭示了现有模型在遥感领域存在的领域偏差、多退化混淆等关键问题,为推动遥感低级视觉感知模型的发展提供了有力支持。

详情
英文摘要

Low-level visual perception underpins reliable remote sensing (RS) image analysis, yet current image quality assessment (IQA) methods output uninterpretable scalar scores rather than characterizing physics-driven RS degradations, deviating markedly from the diagnostic needs of RS experts. While Vision-Language Models (VLMs) present a compelling alternative by delivering language-grounded IQA, their visual priors are heavily biased toward ground-level natural images. Consequently, whether VLMs can overcome this domain gap to perceive and articulate RS artifacts remains insufficiently studied. To bridge this gap, we propose \textbf{SenseBench}, the first dedicated diagnostic benchmark for RS low-level visual perception and description. Driven by a physics-based hierarchical taxonomy that unifies both non-reference and reference-based paradigms, SenseBench features over 10K meticulously curated instances across 6 major and 22 fine-grained RS degradation categories. Specifically, two complementary protocols are designed for evaluation: objective low-level visual \textit{perception} and subjective diagnostic \textit{description}. Comprehensive evaluation of 29 state-of-the-art VLMs reveals not only skewed domain priors and multi-distortion collapse, but also \textit{fluency illusion} and a \textit{perception-description inversion} effect. We hope SenseBench provides a robust evaluation testbed and high-quality diagnostic data to advance the development of VLMs in RS low-level perception. Code and datasets are available \href{https://github.com/Zhong-Chenchen/SenseBench}{\textcolor{blue}{here}}.

2605.10572 2026-05-12 cs.LG

Online Sharp-Calibrated Bayesian Optimization

Marshal Arijona Sinaga, Julien Martinelli, Teemu Turpeinen, Samuel Kaski

发表机构 * ELLIS Institute Finland(芬兰ELLIS研究所) Aalto University(阿尔托大学) University of Manchester(曼彻斯特大学)

AI总结 本文研究了在线贝叶斯优化中如何同时实现不确定性估计的尖锐性与校准性的问题。作者提出了一种新的在线尖锐校准贝叶斯优化算法(OSCBO),通过将核超参数选择建模为约束在线学习问题,实现了对高斯过程模型不确定性的自适应优化。该方法在保持子线性遗憾界的同时,在多个合成与实际基准测试中表现出优异的性能。

详情
英文摘要

Bayesian optimization (BO) is a widely used framework for optimizing expensive black-box functions, commonly based on Gaussian process (GP) surrogate models. Its effectiveness relies on uncertainty quantification that is both sharp (informative) and well-calibrated along the BO trajectory. In practice, GP kernel hyperparameters are unknown and are refit online from sequentially collected (non-i.i.d.) data, which can yield miscalibrated or overly conservative uncertainty and lies outside the fixed-kernel assumptions of standard BO regret theory. We propose Online Sharp-Calibrated Bayesian Optimization (OSCBO), a BO algorithm that adaptively balances GP sharpness and calibration by casting hyperparameter selection as a constrained online-learning problem. We also show that OSCBO preserves sublinear regret bounds by leveraging the theoretical guarantees of the underlying online learning algorithm. Empirically, OSCBO performs competitively across synthetic and real-world benchmarks, ranking among the strongest methods in final simple regret while maintaining robust cumulative-regret behavior.

2605.10569 2026-05-12 cs.AI

Deep Arguing

Adam Gould, Francesca Toni

发表机构 * Department of Computing(计算系) Imperial College London(帝国理工学院伦敦校区)

AI总结 本文提出了一种名为“Deep Arguing”的新型神经符号方法,旨在提升深度学习模型在多模态数据分类任务中的可解释性。该方法将深度神经网络与论证构建和推理相结合,使模型能够生成支持预测结果的论证结构,并通过可微分的论证语义进行训练,从而同时学习特征表示和论证交互。实验表明,该方法在保持预测性能的同时,能够提供具有说服力的案例解释,提升了模型的可解释性和推理能力。

详情
英文摘要

Deep learning has become the dominant approach for creating high capacity, scalable models across diverse data modalities. However, because these models rely on a large number of learned parameters, tightly couple feature extraction with task objectives, and often lack explicit reasoning mechanisms, it is difficult for humans to understand how they arrive at their predictions. Understanding what representations emerge and why they arise from the training data remains an open challenge. We introduce Deep Arguing, a novel neurosymbolic approach that integrates deep learning with argumentation construction and reasoning for interpretable classification with different data modalities. In our approach deep neural networks construct an argumentation structure wherein data points support their assigned label and attack different ones. Using differentiable argumentation semantics for reasoning, the model is trained end-to-end to jointly learn feature representation and argumentative interactions. This results in argumentation structures providing faithful case-based explanations for predictions. Structure constraints over the argumentation graph guide learning, improving both interpretability and predictive performance. Experiments with tabular and imaging datasets show that Deep Arguing achieves performance competitive with standard baselines whilst offering interpretable argumentative reasoning.

2605.10567 2026-05-12 cs.CV

VeloGauss: Learning Physically Consistent Gaussian Velocity Fields from Videos

Nengbo Lu, Bin Zhao

发表机构 * Guangxi Key Laboratory of Robot Intelligent Perception and Control(广西机器人智能感知与控制重点实验室) School of Artificial Intelligence, Guilin University of Electronic Technology(人工智能学院,桂林电子科技大学)

AI总结 本文提出了一种名为 VeloGauss 的方法,旨在仅从动态多视角视频中联合建模三维场景的几何、外观和物理信息,而无需依赖任何物理先验。该方法通过引入物理编码和粒子动力学系统,学习每个高斯粒子的运动场,并结合全局物理约束以确保场景的物理一致性。实验表明,VeloGauss 在新视角插值和未来帧外推任务中均取得了优于现有方法的性能。

Comments ICME2026 Accepted

详情
英文摘要

In this paper, we aim to jointly model the geometry, appearance, and physical information of 3D scenes solely from dynamic multi-view videos, without relying on any physical priors. Existing works typically employ physical losses merely as soft constraints or integrate physical simulations into neural networks; however, these approaches often fail to effectively learn complex motion physics. Although modeling velocity fields holds the potential to capture authentic physical information, due to the lack of appropriate physical constraints, current methods are unable to correctly learn the interaction mechanisms between rigid and non-rigid particles. To address this, we propose VeloGauss, designed to learn the physical properties of complex dynamic 3D scenes without physical priors. Our method learns the velocity field for each Gaussian particle by introducing a Physics Code and a Particle Dynamics System, and ultimately incorporates Global Physical Constraints to ensure the physical consistency of the scene. Extensive experiments on four public datasets demonstrate that our method outperforms achieves state-of-the-art performance in both Novel View Interpolation and Future Frame Extrapolation tasks.

2605.10564 2026-05-12 cs.CV cs.RO

DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving

Lingjun Zhang, Changjie Wu, Linzhe Shi, Jiangyang Li, Jiaxin Liu, Lei Yang, Hang Zhang, Mu Xu, Hong Wang

发表机构 * Tsinghua University(清华大学) Amap, Alibaba Group(阿里巴巴集团Amap) Nanyang Technological University(南洋理工大学)

AI总结 本文提出了一种名为DeepSight的端到端自动驾驶世界模型,通过在鸟瞰图(BEV)空间中并行预测连续未来帧的潜在语义特征,实现了对长期未来世界状态的建模。该方法还引入了一种高效且自适应的文本推理机制,结合额外的社会知识和推理能力,以提升复杂长尾场景下的驾驶性能。实验表明,该方法在闭合回路 Bench2drive 基准测试中达到了最先进的效果。

Comments ICML 2026

详情
英文摘要

End-to-end autonomous driving systems are increasingly integrating Vision-Language Model (VLM) architectures, incorporating text reasoning or visual reasoning to enhance the robustness and accuracy of driving decisions. However, the reasoning mechanisms employed in most methods are direct adaptations from general domains, lacking in-depth exploration tailored to autonomous driving scenarios, particularly within visual reasoning modules. In this paper, we propose a driving world model that performs parallel prediction of latent semantic features for consecutive future frames in the bird's-eye-view (BEV) space, thereby enabling long-horizon modeling of future world states. We also introduce an efficient and adaptive text reasoning mechanism that utilizes additional social knowledge and reasoning capabilities to further improve driving performance in challenging long-tail scenarios. We present a novel, efficient, and effective approach that achieves state-of-the-art (SOTA) results on the closed-loop Bench2drive benchmark. Codes are available at: https://github.com/hotdogcheesewhite/DeepSight.

2605.10563 2026-05-12 cs.CL cs.AI

ThreatCore: A Benchmark for Explicit and Implicit Threat Detection

Davide Bruni, Carlo Bardazzi, Maurizio Tesconi

发表机构 * Computer Science Department, University of Pisa, Italy(比萨大学计算机科学系) Institute of Informatics and Telematics, National Research Council, Italy(意大利国家研究委员会信息与电信学研究院)

AI总结 ThreatCore 是一个用于细粒度威胁检测的公开基准数据集,旨在区分明确威胁、隐含威胁和非威胁内容,解决了当前自然语言处理中威胁检测定义不统一、缺乏标准化的问题。该数据集通过整合多个公开资源并基于统一的威胁定义进行系统性重新标注,揭示了现有标签的显著不一致性,并通过人工验证的合成样本来增强对隐含威胁的覆盖。实验表明,隐含威胁比明确威胁更难检测,而引入语义角色标注作为中间表示有助于提升模型性能,凸显了ThreatCore在推动细粒度威胁检测研究中的重要价值。

详情
英文摘要

Threat detection in Natural Language Processing lacks consistent definitions and standardized benchmarks, and is often conflated with broader phenomena such as toxicity, hate speech, or offensive language. In this work, we introduce ThreatCore, a public available benchmark dataset for fine-grained threat detection that distinguishes between explicit threats, implicit threats, and non-threats. The dataset is constructed by aggregating multiple publicly available resources and systematically re-annotating them under a unified operational definition of threat, revealing substantial inconsistencies across existing labels. To improve the coverage of underrepresented cases, particularly implicit threats, we further augment the dataset with synthetic examples, which are manually validated using the same annotation protocol adopted for the re-annotation of the public datasets, ensuring consistency across all data sources. We evaluate Perspective API, zero-shot classifiers, and recent language models on ThreatCore, showing that implicit threats remain substantially harder to detect than explicit ones. Our results also indicate that incorporating Semantic Role Labeling as an intermediate representation can improve performance by making the structure of harmful intent more explicit. Overall, ThreatCore provides a more consistent benchmark for studying fine-grained threat detection and highlights the challenges that current models still face in identifying indirect expressions of harmful intent.

2605.10560 2026-05-12 cs.CL

ICT-NLP at SemEval-2026 Task 3: Less Is More -- Multilingual Encoder with Joint Training and Adaptive Ensemble for Dimensional Aspect Sentiment Regression

Liyuan Huang, Jiawei He, Wutao Shen, Lin Li, Jin Zhang

发表机构 * State Key Laboratory of AI Safety(人工智能安全国家重点实验室) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文介绍了我们在SemEval-2026任务3(维度方面情感回归)中的系统设计,提出了一种轻量且资源高效的多语言解决方案,完全基于预训练编码器,无需依赖大语言模型或外部语料。我们采用联合多语言和多领域训练策略以提升跨语言迁移能力并缓解数据稀疏问题,引入了有界回归变换以提高训练稳定性并约束预测范围,同时通过子集搜索实现自适应集成以降低预测方差。实验结果表明,我们的系统在多个语言数据集上表现优异,取得了多项前列成绩。

详情
英文摘要

This paper describes our system to SemEval-2026 Task 3 Track A Subtask 1 on Dimensional Aspect Sentiment Regression (DimASR). We propose a lightweight and resource-efficient system built entirely on multilingual pre-trained encoders, without relying on LLMs or external corpora. We adopt joint multilingual and multi-domain training to facilitate cross-lingual transfer and alleviate data sparsity, introduce a bounded regression transformation that improves training stability while constraining predictions within the valid range, and employ an adaptive ensemble strategy via subset search to reduce prediction variance. Experimental results demonstrate that our system achieves strong and consistent performance, ranking 1st on zho-res, 2nd on zho-lap, and 3rd on jpn-hot, with all remaining datasets placed within the top half of participating teams.

2605.10555 2026-05-12 cs.AI

Agent-First Tool API: A Semantic Interface Paradigm for Enterprise AI Agent Systems

Kai Pan

发表机构 * A2A Lab(A2A实验室)

AI总结 随着AI代理从研究原型转向企业级生产系统,其使用的工具接口仍基于以人类为中心的CRUD范式。本文提出了一种名为“Agent-First Tool API”的语义接口范式,通过六动词语义协议、标准化工具契约和双层治理管道,解决了传统API与自主代理需求之间的五大架构不匹配问题。该方法在实际多租户SaaS平台中得到验证,显著提升了任务成功率并减少了人工干预,证明了其在企业AI代理系统中的有效性与优越性。

详情
英文摘要

As AI agents transition from research prototypes to enterprise production systems, the tool interfaces they consume remain rooted in human-oriented CRUD paradigms. This paper identifies five fundamental architectural mismatches between conventional APIs and autonomous agent requirements: exact-identifier dependence, rendering-oriented responses, single-shot interaction assumptions, user-equivalent authorization, and opaque error semantics. We propose the Agent-First Tool API paradigm, comprising three integrated mechanisms: (1) a Six-Verb Semantic Protocol that decomposes tool interactions into search, resolve, preview, execute, verify, and recover phases; (2) a Normalized Tool Contract (NTC) providing structured decision-support metadata including confidence scores, evidence chains, and suggested next actions; and (3) a dual-layer governance pipeline combining static capability policies with dynamic risk escalation. The paradigm is implemented and validated in a production multi-tenant SaaS platform serving 85 registered tools across 6 business domains. Comparative experiments on 50 real operational tasks demonstrate that Agent-First APIs achieve 88% end-to-end task success rate versus 64% for optimized CRUD baselines (+37.5%), while reducing required human interventions by 72.7% and improving autonomous error recovery by 5.8x. We establish that the paradigm is orthogonal and complementary to transport-layer standards such as MCP, operating as the semantic application layer above existing tool discovery and invocation protocols.

2605.10551 2026-05-12 cs.LG

It's All Connected: Topology-Aware Structural Graph Encoding Improves Performance on Polymer Prediction

H. Ibrahim Erdogan, Punith Raviswamy, Nikita Agrawal, Yannik Köster, Stefan Zechel, Ulrich S. Schubert, Ruben Mayer, Christopher Kuenneth

发表机构 * Faculty of Engineering Science, University of Bayreuth, Germany(拜罗伊特大学工程科学学院) Faculty of Mathematics, Physics & Computer Science, University of Bayreuth, Germany(拜罗伊特大学数学、物理与计算机科学学院) Laboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Germany(耶拿弗里德里希·施莱尔大学有机与大分子化学实验室) Jena Center for Soft Matter (JCSM), Friedrich Schiller University Jena, Germany(耶拿软物质中心(JCSM)) Helmholtz Institute for Polymers in Energy Applications Jena (HIPOLE Jena), Germany(耶拿聚合物能源应用研究所(HIPOLE 耶拿)) Helmholtz Zentrum Berlin für Materialien und Energie GmbH (HZB), Germany(柏林材料与能源研究中心(HZB))

AI总结 该研究针对聚合物性质预测中图神经网络(GNN)面临的数据稀缺和结构复杂性问题,提出了一种基于分子质量分布的拓扑感知图构建方法,直接编码聚合物链尺度的结构信息。通过结合丰富的化学特征描述符和自监督预训练策略,该方法在仅有381个聚合物样本的数据集上显著提升了预测性能,相比传统重复单元图方法,其均方根误差降低了5.1%。实验表明,图构建方式与预训练策略的结合是性能提升的关键,且方法适用于多种GNN架构。

Comments 9 pages, 4 figures

详情
英文摘要

Graph Neural Networks (GNNs) have achieved strong results in molecular property prediction, but polymers present distinct challenges: labeled datasets are scarce and small (typically in the order of hundreds of polymers) due to the need for expensive experimentation, and complex polymer chain distributions influence polymer properties. Established practice in polymer prediction represents polymers solely by graphs of their repeat units, discarding the chain-scale morphology that governs key properties such as the glass transition temperature ($T_g$). In this work, we propose a principled graph construction that addresses this gap. Given a polymer's molecular mass distribution (MMD), we sample representative chains from the Schulz-Zimm distribution and construct representative sets of large graphs encoding chain-scale topology directly, with atoms and bonds featurized using rich chemical descriptors. We further pretrain GNN encoders via masked graph modeling on 100,000 unlabeled PSMILES strings before fine-tuning on labeled data. On a dataset of 381 polymers (180 homopolymers and 201 copolymers), we show that graph construction and self-supervised pretraining are jointly necessary: without pretraining, the large graph method matches the repeat-unit baseline (28.40 K vs. 28.36 K RMSE); with pretraining, it achieves 24.76 K +/- 3.30 K, a 5.1% reduction in mean error over the pretrained repeat-unit baseline (26.08 K +/- 4.20 K, p < 0.001, 30 runs). An ablation removing chemical features degrades performance to 36.65 K, confirming both components are essential. Results are architecture-agnostic, holding for both GINE and GATv2 encoders.

2605.10547 2026-05-12 cs.LG

PhysEDA: Physics-Aware Learning Framework for Efficient EDA With Manhattan Distance Decay

Zetao Yang

发表机构 * School of Mathematics and Statistics(数学与统计学学院)

AI总结 本文提出了一种基于物理先验知识的高效电子设计自动化(EDA)学习框架PhysEDA,旨在解决传统注意力机制和强化学习方法在EDA任务中面临的计算复杂度高和数据稀缺导致的过拟合问题。该方法通过引入曼哈顿距离衰减的物理特性作为归纳偏置,设计了具有线性复杂度的物理结构化线性注意力模块,并结合基于势能的奖励塑造策略,有效提升了模型在跨尺度迁移和稀疏奖励场景下的性能。实验表明,PhysEDA在多个EDA任务中实现了显著的性能提升和计算效率优化。

Comments 9 pages, 4 figures, plus appendix. Code and data to be released upon publication

详情
英文摘要

Electronic design automation (EDA) addresses placement, routing, timing analysis, and power-integrity verification for integrated circuits. Learning methods -- attention (Transformer) and reinforcement learning (RL) -- have recently emerged on EDA tasks, yet face two common bottlenecks: vanilla attention's quadratic complexity limits scaling, and data-scarce models overfit statistical noise and amplify weak long-range correlations against the underlying physics. We observe that EDA tasks share a physical prior -- pairwise electrical and routing interactions decay exponentially along Manhattan distance -- and integrate it as a unified inductive bias into both architecture and training. We propose PhysEDA, comprising two components Physics-Structured Linear Attention (PSLA) folds the separable Manhattan decay into the linear-attention kernel as a multiplicative bias, reducing complexity from quadratic to linear; Potential-Based Reward Shaping (PBRS) constructs a physical potential from the same kernel, providing dense reward signal under sparse RL while preserving the optimal policy via the policy-invariance theorem. Across three EDA scenarios -- decoupling-capacitor placement, macro placement, and IR-drop prediction -- PhysEDA improves zero-shot cross-scale transfer by 56.8% and achieves 14x inference speedup with 98.5% memory savings on 100x100 grids; PBRS adds another 10.8% in sparse-reward DPP.

2605.10546 2026-05-12 cs.LG

Higher Resolution, Better Generalization: Unlocking Visual Scaling in Deep Reinforcement Learning

Raphael Trumpp, Ömer Veysel Çağatan, Barış Akgün, Marco Caccamo

发表机构 * TUM School of Engineering and Design(技术大学慕尼黑工程与设计学院) Technical University of Munich(技术大学慕尼黑) KUIS AI Center(KUIS人工智能中心) Koç University(科克大学) Department of Computer Engineering(计算机工程系)

AI总结 本文研究了深度强化学习中视觉输入分辨率对策略学习的影响,指出当前常用的方法往往过度降低图像分辨率,而高分辨率输入在适当网络架构支持下能显著提升性能和泛化能力。研究发现,传统Impala编码器在分辨率提升时参数量呈二次增长,限制了性能提升,而改用全局平均池化后的Impoola架构则能有效解耦参数量与分辨率,实现跨不同分辨率和网络宽度的性能提升,最高可提升28%。实验表明,高分辨率有助于策略更精确地感知小物体或远距离目标,为视觉强化学习的可扩展性提供了新方向。

详情
英文摘要

Pixel-based deep reinforcement learning agents are typically trained on heavily downsampled visual observations, a convention inherited from early benchmarks rather than grounded in principled design. In this work, we show that observation resolution is a critical yet overlooked variable for policy learning: higher-resolution inputs can substantially improve both performance and generalization, provided the network architecture can process them effectively. We find that the widely used Impala encoder, which flattens spatial features into a vector, suffers from quadratic parameter growth as resolution increases and fails to leverage the additional visual detail. Replacing this operation with global average pooling, as in the Impoola architecture, decouples parameter count from resolution and yields consistent improvements across resolutions and network widths - at their respective best conditions, visual scaling unlocks a 28 % performance gain for Impoola over Impala. These gains are strongest in environments that require precise perception of small or distant objects, and gradient saliency analysis confirms that the underlying mechanism is a more spatially localized visual attention of the policy at higher resolutions. Our results challenge the prevailing practice of aggressive input downsampling and position resolution-independent architectures as a simple, effective path toward scalable visual deep RL. To facilitate future research on resolution scaling in deep RL, we publicly release the open-source code for the Procgen-HD benchmark: https://github.com/raphajaner/procgen-hd.

2605.10544 2026-05-12 cs.CL

Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing

Jinchang Zhu, Jindong Li, Chengyu Zou, Rong Fu, Chao Wang, Haowei He, Menglin Yang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Institute of Artificial Intelligence (TeleAI), China Telecom(人工智能研究院(TeleAI),中国电信)

AI总结 本文研究了长上下文适应中监督分配的问题,指出当前方法在训练过程中未能有效提升目标标记的长上下文监督。为此,作者提出了EXACT方法,通过逆频率分配权重,增强对长有效上下文目标的监督。实验表明,EXACT在多个模型配置上显著提升了长上下文推理性能,同时保持了标准任务的表现,验证了监督分配对长上下文适应的关键作用。

详情
英文摘要

Long-context adaptation is often viewed as window scaling, but this misses a token-level supervision mismatch: in packed training with document masking, each target token's effective context remains short. We introduce EXACT, a supervision-allocation objective that assigns extra weight to long effective-context targets by inverse frequency within the long tail. Across seven Qwen/LLaMA CPT configurations, EXACT improves all 28 trained/extrapolated NoLiMa and RULER comparisons. On Qwen2.5-0.5B, NoLiMa improves by +10.09 (trained) and +5.34 (extrapolated); RULER by +10.69 and +5.55. On LLaMA-3.2-3B, RULER improves by +17.91 and +16.11. Standard QA/reasoning are preserved (+0.24 macro change across six benchmarks). A distance-resolved probe shows gains arise when evidence is thousands of tokens away, while short cases remain unchanged. Results support a supervision-centric thesis: long-context adaptation depends on how strongly training supervises long-context predictions.

2605.10541 2026-05-12 cs.AI cs.LG

Bridging Sequence and Graph Structure for Epigenetic Age Prediction

Yao Li, Xikun Zhang, Xiaotao Shen, Sonika Tyagi, Xin Zheng, Jiaxing Huang, Feng Xia

发表机构 * School of Computing and Information Systems(计算与信息系) The University of Melbourne(墨尔本大学) School of Computing Technologies(计算技术系) RMIT University(皇家墨尔本理工学院) Lee Kong Chian School of Medicine(李科金医学院) Nanyang Technological University(南洋理工大学) Department of Data Science and Artificial Intelligence(数据科学与人工智能系) Hong Kong Polytechnic University(香港理工大学)

AI总结 本文研究了如何结合DNA甲基化位点的序列信息与图结构,以更准确地预测表观遗传年龄。作者提出了一种统一的序列-图整合框架,通过轻量级的门控调制机制,将八维DNA序列统计特征与图卷积相结合,从而更有效地建模甲基化信号。该方法在3,707个血液甲基化样本上的测试表现优于现有最佳图模型,表明结合生物信息的统计特征在该任务中比基于卷积神经网络的序列编码更具优势。

详情
英文摘要

Epigenetic clocks based on DNA methylation have emerged as powerful tools for estimating biological age, with broad applications in aging research, age-related disease studies, and longevity science. Despite advances across machine learning approaches to epigenetic age prediction, spanning penalised linear regression, deep feedforward networks, residual architectures, and graph neural networks, no existing method jointly models co-methylation graph structure and site-specific DNA sequence context within a unified framework. We propose a unified sequence--graph integration framework for epigenetic age prediction that addresses this gap, integrating eight-dimensional DNA sequence statistical features through a lightweight gated modulation mechanism that adaptively scales each site's methylation signal according to its sequence-determined biological relevance prior to graph convolution. Evaluated on 3,707 blood methylation samples against a comprehensive set of baselines, our method achieves a test MAE of 3.149 years, a 12.8\% improvement over the strongest graph-based baseline. Biologically informed statistical features outperform CNN-based sequence encoding, demonstrating that handcrafted sequence features are more effective than end-to-end learned representations in this data regime. Post-hoc interpretability analysis identifies CpG density and local adenine frequency as features with age-dependent importance shifts, consistent with known mechanisms of age-related hypermethylation at CpG-dense promoter regions. Our code is at https://github.com/yaoli2022/graphage-seq.

2605.10537 2026-05-12 cs.CL

Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

Lungchuan Chen

发表机构 * MusubiAI

AI总结 本文提出了一种基于记忆巩固理论的测试时记忆整合方法Mela,其核心是引入分层记忆模块(HMM),该模块包含两个不同更新频率的子模块,分别生成抽象的高层表示和细粒度的 episodic 细节表示,并在推理时动态组合形成最终记忆输出。通过将HMM集成到Transformer解码器中,Mela实现了在测试阶段进行在线记忆整合的增强语言模型,在不同规模的语言建模任务中均优于传统Transformer基线,并在固定预训练上下文长度下表现出对更长上下文的更好适应性。

详情
英文摘要

Memory consolidation, the process by which transient experiences are transformed into stable, structured representations, is a foundational organizing principle in the human brain, yet it remains largely unexplored as a design principle for modern sequence models. In this work, we leverage established neuroscientific theories of memory consolidation and cross-frequency coupling to propose the Hierarchical Memory Module (HMM), a neural memory architecture composed of two functionally distinct sub-modules that operate at different update frequencies. Inspired by the transformation hypothesis, the low-frequency sub-module produces high-level representations that capture abstract, gist-level knowledge, while the high-frequency sub-module produces fine-grained representations that preserve richer episodic detail. The final memory output is dynamically reconstructed as a context-dependent combination of both representations, analogous to the reconstructive nature of human memory retrieval. We integrate HMM into a Transformer-based language decoder to form Mela, a family of memory-augmented language models that perform online memory consolidation at test time. To further exploit the multi-granularity memory representations produced by HMM, we introduce MemStack, a method that distributes different levels of memory features across the early layers of the decoder without introducing additional tokens. Experiments on language modeling demonstrate that Mela outperforms Transformer baselines across all the model sizes. Moreover, with the pretrained context length fixed at 4K, Mela maintains performance on significantly longer contexts, whereas Transformer baselines degrade rapidly beyond their training length. Extensive ablation studies validate the contribution of each component and provide guidance for practical configuration.

2605.10536 2026-05-12 cs.LG cs.AI

HH-SAE: Discovering and Steering Hierarchical Knowledge of Complex Manifolds

Honghan Wu, Tianyan Wang, Jiacong Mi, Zhoyang Jiang, Yunsoo Kim

发表机构 * University of Glasgow(格拉斯哥大学) University of International Relations(国际关系大学) University College London(伦敦大学学院)

AI总结 本文提出了一种名为HH-SAE的混合分层自编码器,用于解决高维关键领域中语义创新被密集背景信息掩盖的“特征密度冲突”问题。该方法通过将流形分解为上下文、原子和复合三个层次,实现了对复杂结构知识的发现与引导。实验表明,HH-SAE在跨领域零样本检测等任务中表现出色,并在知识引导的合成任务中显著提升了性能,验证了其在高精度高风险环境中的有效性。

详情
英文摘要

Rare semantic innovations in high-dimensional, mission-critical domains are often obscured by dense background contexts, a challenge we define as \textit{feature density conflict}. We introduce the \textbf{Hybrid Hierarchical SAE (HH-SAE)} to resolve this by factorizing manifolds into a nested hierarchy of \textbf{Contextual} ($L_0$), \textbf{Atomic} ($f_1$), and \textbf{Compository} ($f_2$) tiers. Evaluating across disparate manifolds, HH-SAE demonstrates superior resolution by \textbf{``fracturing'' administrative clinical labels into physiological modes} and achieving a peak \textbf{cross-domain zero-shot AUC of 0.9156 in fraud detection}. Path ablation confirms the architecture's structural necessity, revealing a 13.46\% utility collapse when contextual subtraction is removed. Finally, knowledge-steered synthesis achieves a +9.9\% AUPRC lift over state-of-the-art generators, proving that HH-SAE effectively prioritizes high-order mechanistic innovation over environmental proxies to enable high-precision discovery in high-stakes environments.

2605.10533 2026-05-12 cs.LG

ConfoundingSHAP: Quantifying confounding strength in causal inference

Marie Brockschmidt, Santo M. A. R. Thies, Maresa Schröder, Dennis Frauen, Valentyn Melnychuk, Maximilian Muschalik, Eyke Hüllermeier, Stefan Feuerriegel

发表机构 * LMU Munich(慕尼黑大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心)

AI总结 在因果推断中,混杂变量会影响处理分配和结果,但在观察性研究中,处理分配机制未知,难以确定哪些协变量是混杂变量。本文提出ConfoundingSHAP,一种基于Shapley值的方法,用于量化每个协变量的混杂强度。该方法通过设计专门的Shapley博弈模型,区别于传统SHAP用于解释处理效应异质性的应用,并结合可扩展的TabPFN估计方法,避免了对大量调整集的重复拟合,有效提升了因果推断中对混杂变量识别的实用性与效率。

详情
英文摘要

In causal inference, confounders are variables that influence both treatment decisions and outcomes. However, unlike as in randomized clinical trials, the treatment assignment mechanism in observational studies is not known, and it is thus unclear which covariates act as confounders. Here, we aim to generate insight for causal inference and answer: which of the observed covariates act as confounders? We introduce ConfoundingSHAP, a Shapley-based method for attributing confounding strength to individual covariates. Our contributions are twofold. First, we propose a Shapley game targeted to infer the confounding strength of the covariates. Our resulting Shapley values differ from the standard applications of SHAP explanations on causal targets, such as understanding treatment effect heterogeneity, which are ill-suited for our task. Second, as our task requires evaluating the value function over many adjustment sets, we provide a scalable TabPFN-based estimation that avoids exhaustive refitting. We demonstrate the practical value across various datasets, where ConfoundingSHAP provides informative explanations of which observed covariates drive confounding and thereby helps to provide more insight for causal inference in practice.

2605.10531 2026-05-12 cs.AI

A Reflective Storytelling Agent for Older Adults: Integrating Argumentation Schemes and Argument Mining in LLM-Based Personalised Narratives

Jayalakshmi Baskar, Vera C. Kaelin, Kaan Kilic, Helena Lindgren

发表机构 * Umeå University, Department of Computing Science(乌尔姆大学计算机科学系)

AI总结 本研究探讨了基于知识驱动的大型语言模型(LLM)讲故事能否支持老年人与数字伴侣进行有目的的叙事互动。为解决LLM在幻觉和透明度方面的局限性,研究提出了一种结合知识图谱、用户建模、论证理论和论证挖掘的反思式叙事代理,用于引导和审查叙事生成过程。实验结果显示,该系统生成的叙事在文化认同性和个人相关性方面受到用户认可,而基于论证的叙事目的和幻觉风险指标对叙事质量和用户接受度有显著影响。

Comments Submitted to ACM Transactions on Intelligent Systems and Technology (TIST)

详情
英文摘要

This work investigates whether knowledge-driven large language model (LLM)-based storytelling can support purposeful narrative interaction with a digital companion for older adults. To address known limitations of LLMs, including hallucinations and limited transparency, we present a reflective storytelling agent integrating knowledge graphs, user modelling, argumentation theory, and argument mining to guide and inspect narrative generation. The study consisted of two phases. Phase I employed participatory design involving 11 domain experts in a formative evaluation that informed iterative refinement. The resulting system generates narratives grounded in structured user models representing health-promoting activities and motivations. Phase II involved 55 older adults evaluating persona-based narratives across four prompts and two creativity levels. Participants assessed perceived purpose, usefulness, cultural relatability, and inconsistencies. The system additionally computed hallucination-risk indicators to evaluate generated narratives. Participants recognised personally relevant purposes in roughly two thirds of narratives, while argument-based purposes were identified in around half of these cases. Cultural recognisability strongly influenced willingness to use the functionality, whereas minor inconsistencies were often tolerated when narratives remained understandable and personally relevant. Narratives with higher hallucination-risk indicators were more often perceived as inconsistent, while higher argument-quality indicators tended to co-occur with higher clarity and meaningfulness ratings. Overall, the study positions argument mining as a reflective inspection mechanism for comparing formal grounding signals with human evaluations in health-oriented LLM storytelling for older adults.

2605.10529 2026-05-12 cs.AI cs.LG

PrimeKG-CL: A Continual Graph Learning Benchmark on Evolving Biomedical Knowledge Graphs

Yousef A. Radwan, Yao Li, Qing Qing, Ziqi Xu, Xingtong Yu, Jiaxing Huang, Renqiang Luo, Xikun Zhang

发表机构 * Technology, Innovation, Entrepreneurship Department(技术、创新与创业系) King Abdullah University of Science and Technology(国王阿卜杜勒阿齐兹大学) School of Computing and Information Systems(计算与信息系) The University of Melbourne(墨尔本大学) College of Computer Science and Technology(计算机科学与技术学院) School of Computing Technologies(计算技术学院) Jilin University(吉林大学) RMIT University(皇家墨尔本理工大学) Department of Systems Engineering and Engineering Management(系统工程与工程管理系) The Chinese University of Hong Kong(香港中文大学) Department of Data Science and Artificial Intelligence(数据科学与人工智能系) Hong Kong Polytechnic University(香港理工大学)

AI总结 该研究提出了一个名为 PrimeKG-CL 的持续图学习基准,专门用于评估在动态演变的生物医学知识图谱上的学习方法。该基准基于九个权威生物医学数据库构建,包含真实的时序快照和多模态节点特征,并设计了多种任务和测试划分方式,以更贴近实际场景。实验表明,解码器选择与持续学习策略之间存在显著交互影响,且多模态特征对任务性能有明显提升,而某些现有方法在大规模数据下难以有效运行。

详情
英文摘要

Biomedical knowledge graphs underwrite drug repurposing and clinical decision support, yet the upstream ontologies they depend on update on independent cycles that add millions of edges and deprecate hundreds of thousands more between releases. Yet existing continual graph learning has been studied almost exclusively on synthetic random splits of static, generic KGs, a regime that cannot reproduce the asynchronous, structured evolution real biomedical KGs undergo. To this end, we introduce PrimeKG-CL, a CGL benchmark built from nine authoritative biomedical databases (129K+ nodes, 8.1M+ edges, 10 node types, 30 relation types) with two genuine temporal snapshots (June 2021, July 2023; 5.83M edges added, 889K removed, 7.21M persistent), 10 entity-type-grouped tasks, multimodal node features, and a per-task persistent/added/removed test stratification. On three tasks (biomedical relationship prediction, entity classification, KGQA), we evaluate six CL strategies across four KGE decoders, plus LKGE, an LLM-RAG agent, and CMKL. We find that decoder choice and continual learning strategy interact strongly: no single strategy performs best across all decoders, and mismatched combinations can significantly degrade performance. Moreover, only DistMult exhibits a clear separation between persistent and deprecated knowledge, indicating that standard metrics conflate retention of still-valid facts with failure to forget outdated ones; this effect is absent under RotatE. In addition, multimodal features improve entity-level tasks by up to 60%, and a recent CKGE framework (IncDE) failed to scale to our 5.67M-triple base task across five attempts up to 350GB RAM. Data, pipeline, baselines, and the stratified split are released openly. Dataset:huggingface.co/datasets/yradwan147/PrimeKGCL|Code:github.com/yradwan147/primekg-cl-neurips2026

2605.10523 2026-05-12 cs.CV

Improving Human Image Animation via Semantic Representation Alignment

Chang Liu, Mengting Chen, Yixuan Huang, Haoning Wu, Chen Ju, Shuai Xiao, Jinsong Lan, Yanfeng Wang

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University, China(上海交通大学人工智能学院,中国) Alibaba Group, China(阿里巴巴集团,中国)

AI总结 本文研究如何通过语义表示对齐来提升人体图像动画生成的质量,解决在生成长视频或复杂动作时出现的肢体扭曲和面部失真问题。提出了一种名为 SemanticREPA 的新方法,通过结构对齐模块和身份对齐模块,分别对齐视频潜在表示中的结构信息与深度特征、生成视频的身份特征与人脸识别特征,从而提升生成结果的结构稳定性和身份一致性。该方法在复杂动作生成和角色一致性方面表现出色,为人体动画生成提供了更高质量和更灵活的解决方案。

Comments Accepted by CVPR 2026 workshop

详情
英文摘要

The field of image-to-video generation has made remarkable progress. However, challenges such as human limb twisting and facial distortion persist, especially when generating long videos or modeling intensive motions. Existing human image animation works address these issues by incorporating human-specific semantic representations, e.g., dense poses or ID embeddings, as additional conditions. However, conditioning on these representations could decrease the generation flexibility. Moreover, their reliance on RGB pixel supervision also lacks emphasis on learning necessary 3D geometric relationships and temporal coherence. In contrast, we introduce a novel approach named SemanticREPA that leverages these semantic representations as supervision signals through representation alignment. Specifically, we begin by training a structure alignment module that aligns the structure representations obtained from video latents with video depth estimation features. We then fix the pretrained module, and utilize it to provide additional supervision on the structure representations of the diffusion models, achieving structure rectification to generate coherent and stable human structures. Simultaneously, we develop an ID alignment module to align the ID representations of the generated videos to face recognition features. We further propose to use the predicted structure representations to refine identity restoration in relevant regions. With structure and ID alignment, our method demonstrates superior quality on extended character motions and enhanced character consistency.

2605.10521 2026-05-12 cs.CV cs.AI

DuetFair: Coupling Inter- and Intra-Subgroup Robustness for Fair Medical Image Segmentation

Yiqi Tian, Sangjoon Park, Bo Zeng, Pengfei Jin, Yujin Oh, Quanzheng Li

发表机构 * Center for Advanced Medical Computing and Analysis, Massachusetts General Hospital and Harvard Medical School(先进医学计算与分析中心,麻省总医院和哈佛医学院) Department of Industrial Engineering, University of Pittsburgh(工业工程系,匹兹堡大学) Department of Radiation Oncology, College of Medicine, Yonsei University(放射肿瘤学系,延世大学医学院) Institute for Innovation in Digital Healthcare, Yonsei University(数字医疗创新研究所,延世大学) Department of Biomedical Systems Informatics, College of Medicine, Yonsei University(生物医学系统信息学系,延世大学医学院)

AI总结 医学图像分割模型在不同子群体中的表现可能存在差异,现有公平性方法大多关注提升子群体平均性能,忽略了子群体内部可能存在的隐藏失效问题。为此,本文提出DuetFair机制,通过联合考虑子群体间适应与子群体内鲁棒性,引入FairDRO方法,结合分布感知的专家混合模型与子群体条件分布鲁棒优化,有效提升了模型在不同子群体中的公平性与分割性能。实验表明,FairDRO在多个医学图像分割基准上取得了优越的公平性与性能提升。

Comments 16 pages, 2 figures

详情
英文摘要

Medical image segmentation models can perform unevenly across subgroups. Most existing fairness methods focus on improving average subgroup performance, implicitly treating each subgroup as internally homogeneous. However, this can hide difficult cases within a subgroup, where high-loss samples are obscured by the subgroup mean. We call this problem \textbf{intra-group hidden failure}. To solve this, we propose \textbf{DuetFair} mechanism, a dual-axis fairness framework that jointly considers inter-subgroup adaptation and intra-subgroup robustness. Based on DuetFair, we introduce \textbf{FairDRO}, which combines distribution-aware mixture-of-experts (dMoE) with subgroup-conditioned distributionally robust optimization (DRO) loss aggregation. This design allows the model to adapt across subgroups while also reducing hidden failures within each subgroup. We evaluate FairDRO on three medical image segmentation benchmarks with varying degrees of within-group heterogeneity. FairDRO achieves the best equity-scaled performance on Harvard-FairSeg and improves worst-case subgroup performance on HAM10000 under both age- and race-based grouping schemes. On the 3D radiotherapy target cohort, FairDRO further improves worst-group Dice by 3.5 points ($\uparrow 6.0\%$) under the tumor-stage grouping and by 4.1 points ($\uparrow 7.4\%$) under the institution grouping over the strongest baseline.

2605.10518 2026-05-12 cs.CL cs.AI

Infinite Mask Diffusion for Few-Step Distillation

Jaehoon Yoo, Wonjung Kim, Chanhyuk Lee, Seunghoon Hong

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院)

AI总结 本文提出了一种名为Infinite Mask Diffusion Model(IMDM)的新型扩散模型,用于解决语言模型知识蒸馏中的少步生成问题。传统掩码扩散模型(MDM)因使用确定性单状态掩码而受到因子化误差的限制,难以实现高效少步生成。IMDM通过引入随机无限状态掩码,有效降低了理论误差下限,从而在保持MDM优势的同时提升了生成效率。实验表明,IMDM在少量步骤下优于现有蒸馏方法,尤其在LM1B和OpenWebText数据集上表现突出。

详情
英文摘要

Masked Diffusion Models (MDMs) have emerged as a promising alternative to autoregressive models in language modeling, offering the advantages of parallel decoding and bidirectional context processing within a simple yet effective framework. Specifically, their explicit distinction between masked tokens and data underlies their simple framework and effective conditional generation. However, MDMs typically require many sampling iterations due to factorization errors stemming from simultaneous token updates. We observe that a theoretical lower bound of the factorization error exists, which standard MDMs cannot reduce due to their use of a deterministic single-state mask. In this paper, we propose the Infinite Mask Diffusion Model (IMDM), which introduces a stochastic infinite-state mask to mitigate the theoretical bound while directly inheriting the benefits of MDMs, including the compatibility with pre-trained weights. We empirically demonstrate that MDM fails to perform few-step generation even in a simple synthetic task due to the factorization error bound, whereas IMDM can find an efficient solution for the same task. Finally, when equipped with appropriate distillation methods, IMDM surpasses existing few-step distillation methods at small step counts on LM1B and OpenWebText. Code is available at https://Ugness.github.io/official_imdm.

2605.10516 2026-05-12 cs.AI

Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

Harsh Raj, Niranjan Orkat, Suvrorup Mukherjee, Aritra Guha, Cheryl Flynn, Subhabrata Majumdar

发表机构 * Northeastern University(东北大学) University of Pécs(佩奇大学) University of Michigan(密歇根大学) AT&T Chief Data Office(AT&T首席数据办公室) Indian Institute of Management Bangalore(班加罗尔印度管理学院)

AI总结 本文提出了一套严格的AI智能体可靠性度量方法,通过语义保持扰动下的一致性来量化智能体的可靠性。研究引入了基于$U$-统计量的输出级可靠性评估和基于核方法的轨迹级稳定性分析,揭示了智能体核心能力与执行鲁棒性之间的区别。实验表明,轨迹级一致性指标比传统方法具有更高的诊断灵敏度,有助于识别和解决影响智能体在高风险实际环境中部署的架构问题。

Comments 33 pages, 5 figures, 2 tables

详情
英文摘要

This paper establishes a rigorous measurement science for AI agent reliability, providing a foundational framework for quantifying consistency under semantically preserving perturbations. By leveraging $U$-statistics for output-level reliability and kernel-based metrics for trajectory-level stability, we offer a principled approach to evaluating agents across diverse operating conditions. Our proposal highlights the important distinction between the core capability and execution robustness of an agent, showing that minor task-level variations can induce complete strategy breakdowns despite the agent possessing the requisite knowledge for the task. We validate our framework through extensive experiments on three agentic benchmarks, demonstrating that trajectory-level consistency metrics provide far greater diagnostic sensitivity than traditional pass@1 rates. By providing the mathematical tools to isolate where and why agents deviate, we enable the identification and rectification of architectural concerns that hinder the deployment of agents in high-stakes, real-world environments.

2605.10510 2026-05-12 cs.LG cs.AI

CMKL: Modality-Aware Continual Learning for Evolving Biomedical Knowledge Graphs

Yousef A. Radwan, Yao Li, Qing Qing, Ziqi Xu, Qixin Zhang, Yongcheng Jing, Renqiang Luo, Xikun Zhang

发表机构 * Technology, Innovation, Entrepreneurship Department(技术、创新与创业部门) King Abdullah University of Science and Technology(卡塔尔科技大学) School of Computing and Information Systems(计算与信息系统学院) The University of Melbourne(墨尔本大学) College of Computer Science and Technology(计算机科学与技术学院) Jilin University(吉林大学) School of Computing Technologies(计算技术学院) RMIT University(皇家墨尔本理工大学) College of Computing and Data Science(计算与数据科学学院) Nanyang Technological University(南洋理工大学)

AI总结 本文提出了一种名为CMKL的持续学习框架,用于处理动态演化的生物医学知识图谱,能够同时利用结构、文本和分子等多模态信息。该方法通过混合专家路由机制融合多模态数据,并结合EWC正则化和多样化的多模态回放缓冲区,有效保护已学知识,减少遗忘。实验表明,CMKL在持续实体分类和关系预测任务中均显著优于现有方法,尤其在多模态信息的利用上表现出明显优势。

详情
英文摘要

Biomedical knowledge graphs are increasingly large, dynamic, and multimodal, driven by rapid advances in biotechnology such as high-throughput sequencing. Machine learning models can infer previously unobserved biomedical relationships and characterize biomedical entities in these graphs, but existing knowledge graph embedding methods and their continual learning extensions either assume static graph structure or fail to exploit multimodal information under evolving data distributions. They also apply uniform regularization across all model parameters, ignoring that different modalities may exhibit distinct forgetting dynamics as the graph evolves. We propose the Continual Multimodal Knowledge Graph Learner (CMKL), a CL framework for biomedical KGs that natively encodes structure, text, and molecules, fuses them through a Mixture-of-Experts (MoE) router, and protects previously learned knowledge with standard EWC regularization and a K-means-diverse multimodal replay buffer. We evaluate CMKL on a 129K-entity biomedical continual benchmark with 10 tasks. On continual biomedical entity classification, CMKL reaches AP 0.591 versus 0.370 for the strongest structural baseline, a 60% gain that is driven by access to multimodal features and preserved across the sequence with near-zero forgetting (AF 0.008). On continual relationship prediction, CMKL reaches AP $0.062$, matching Naive Sequential and EWC (0.058) within seed noise and outperforming Joint Training (0.047, p=0.045) and LKGE (0.039). A frozen-text ablation reaches AP 0.136, more than double any jointly trained model, yet that signal is unreachable by margin-ranking gradients: the greedy-modality asymmetry lives at the representation level, not the fusion level, and MoE routing manages it by suppressing the unreachable modality without forcing it through a learned bottleneck. Code: github.com/yradwan147/cmkl-neurips2026

2605.10504 2026-05-12 cs.CL

Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

Jinchang Zhu, Jindong Li, Yuwen Hao, Chengyu Zou, Rong Fu, Menglin Yang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 本文研究了在语言模型预训练过程中,上层注意力机制过早固化可能对模型性能产生的负面影响。作者发现,在GPT类模型中,上层注意力在底层特征尚未稳定时就形成尖锐的注意力模式,导致模型表现下降。通过在训练初期临时减缓上层Q/K投影的学习速度,可以在不改变其他参数的情况下提升最终的困惑度和下游任务准确率。研究还指出,乘法门控的前馈网络是抑制底层残差特征更新的关键因素,并揭示了上层Q/K的学习时机是解码器结构与优化过程之间的重要交互点。

详情
英文摘要

A causal-decoder block is hierarchical: lower layers build the residual basis that upper layers attend over. We identify a failure mode in GPT pretraining: upper layers commit to sharp attention patterns before lower-layer features stabilize. We call this premature upper-layer attention specialization. Temporarily slowing only upper-layer Q/K projections during early training improves final perplexity and downstream accuracy without altering other parameters; it prevents upper attention from collapsing onto an immature residual basis. In LLaMA-style blocks, the same intervention is nearly unnecessary. Through ablations, we isolate multiplicative gated FFNs (not RMSNorm or bias removal) as the component that suppresses the upstream residual writes driving the failure. A pathwise analysis unifies both findings: the learning-rate intervention reduces a step-size factor, while gated FFNs reduce a residual-energy factor on the same growth pathway. Our results identify upper-layer Q/K timing as a concrete interaction point between decoder architecture and optimization.