arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4077
2605.10614 2026-05-12 cs.AI

PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines

Riya Tapwal, Abhishek Kumar, Carsten Maple

AI总结 多智能体大语言模型系统中,一个智能体访问的敏感信息可能通过共享上下文传播到后续输出中,造成秘密泄露风险。为此,研究提出了PRISM,一种生成时实时检测和缓解秘密泄露的防御机制,通过整合多种特征信号计算风险评分,并在生成过程中进行干预。PRISM基于生成动态的变化,如熵坍缩和logit集中度,结合文本结构线索,在泄露发生前进行有效预警,实验表明其在多个攻击场景下表现出优异的检测性能和零泄露率。

详情
英文摘要

Multi-agent LLM systems introduce a security risk in which sensitive information accessed by one agent can propagate through shared context and reappear in downstream outputs, even without explicit adversarial intent. We formalise this phenomenon as propagation amplification, where leakage risk increases across agent boundaries as sensitive content is repeatedly exposed to downstream generators. Existing defences, including prompt-based safeguards, static pattern matching, and LLM-as-judge filtering, are not designed for this setting: they either operate after generation, rely primarily on surface-form patterns, or add substantial latency without modelling the generation process itself. To resolve these issues, we propose PRISM, a real-time defence that treats credential leakage as a sequential risk accumulation problem during generation. At each decoding step, PRISM combines 16 signals spanning lexical, structural, information-theoretic, behavioural, and contextual features into a calibrated risk score, enabling per-token intervention through green, yellow, and red risk zones. Our central observation is that credential reproduction is often preceded by a measurable shift in generation dynamics, characterised by entropy collapse and increasing logit concentration. When combined with text-structural cues such as identifier-pattern detection, these temporal signals provide an early warning of leakage before a secret is fully reconstructed. Across a 2,000-task adversarial benchmark covering 13 attack categories and three pressure levels in a heterogeneous four-agent pipeline, PRISM achieves F1 = 0.832 with precision = 1.000 and recall = 0.712, while producing no observed leakage on our benchmark (0.0% task-level leak rate) and preserving output utility of 0.893. It substantially outperforms the strongest baseline, Span Tagger, which achieves F1 = 0.719 with a 15.0% task-level leak rate.

2605.10606 2026-05-12 cs.CL cs.AI

Measuring Embedding Sensitivity to Authorial Style in French: Comparing Literary Texts with Language Model Rewritings

Benjamin Icard, Lila Sainero, Alice Breton, Evangelia Zve, Jean-Gabriel Ganascia

AI总结 本研究探讨了大型语言模型(LLM)在法语中对作者写作风格的嵌入表示敏感性,通过构建受控的文学语料库,量化分析了风格变化对嵌入分散度的影响。研究发现,嵌入能够可靠地捕捉作者的风格特征,并且这些特征在模型重写后依然保留,同时呈现出特定于LLM的模式。该成果为在语言模型时代检测作者模仿提供了新的分析方向。

Comments To appear in the Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities (NLP4DH 2026)

详情
英文摘要

Large language models (LLMs) can convincingly imitate human writing styles, yet it remains unclear how much stylistic information is encoded in embeddings from any language model and retained after LLM rewriting. We investigate these questions in French, using a controlled literary dataset to quantify the effect of stylistic variation via changes in embedding dispersion. We observe that embeddings reliably capture authorial stylistic features and that these signals persist after rewriting, while also exhibiting LLM-specific patterns. These analytical results offer promising directions for authorship imitation detection in the era of language models.

2605.10605 2026-05-12 cs.CL

Where do aspectual variants of light verb constructions belong?

Aggeliki Fotopoulou, Eric Laporte, Takuya Nakamura

AI总结 本文研究轻动词结构的体变体表达,如“take on debt”与“have debt”,探讨其在语义分类中归属模糊的问题。作者分析了这类表达的特性,提出一系列特征以更清晰地区分其属于动词短语、轻动词结构还是组合短语。该研究为自然语言处理中的语义分类提供了更具区分性的判断依据。

详情
Journal ref
Proceedings of the 17th Workshop on Multiword Expressions (MWE), August 2021, France, pp.2-12
英文摘要

Expressions with an aspectual variant of a light verb, e.g. 'take on debt' vs. 'have debt', are frequent in texts but often difficult to classify between verbal idioms, light verb constructions or compositional phrases. We investigate the properties of such expressions with a disputed membership and propose a selection of features that determine more satisfactory boundaries between the three categories in this zone, assigning the expressions to one of them.

2605.10604 2026-05-12 cs.LG cs.AI cs.CY

Fairness vs Performance: Characterizing the Pareto Frontier of Algorithmic Decision Systems

Mieke Wilms, Christoph Heitz

AI总结 本文研究了算法决策系统中公平性与性能之间的权衡问题,将其建模为多目标优化问题,同时考虑决策者效用和群体公平性。研究发现,帕累托最优决策规则由针对不同群体的确定性阈值规则构成,且帕累托前沿的位置仅依赖于人口特征、效用函数和公平性指标,而与算法技术设计无关。该成果拓展了现有公平性约束分类的最优性定理,适用于更广泛的公平性度量和部分公平性场景,为评估和比较算法决策系统提供了理论基础。

Comments 23 pages, The 2026 ACM conference on Fairness, Accountability, and Transparency (FAccT'26)

详情
英文摘要

Designing fair algorithmic decision systems requires balancing model performance with fairness toward affected individuals: More fairness might require sacrificing some performance and vice versa, yet the space of possible trade-offs is still poorly understood. We investigate fairness in binary prediction-based decision problems by conceptualizing decision making as a multi-objective optimization problem that simultaneously considers decision-maker utility and group fairness. We investigate the set of Pareto-optimal decision rules for arbitrary utility functions for decision maker, arbitrary population distributions, and a wide range of group fairness metrics. We find that the Pareto frontier consists of deterministic, group-specific threshold rules applied to individuals' success probability. This complements existing optimality theorems from literature which, for specific fairness constraints, posit lower-bound threshold rules only. However we also show that, depending on the used fairness metric, the Pareto frontier may include upper-bound threshold rules, thus preferring individuals with lower success probabilities. We show that the location of the Pareto frontier depends only on population characteristics, utility functions and fairness score, but not on the technical design of the algorithm - our findings hold for pre-, in-, and post-processing approaches alike. Our results generalize existing optimality theorems for fairness-constrained classification and extend them to generalized fairness metrics and fairness principles, and to partial fairness regimes. This paper connects formal fairness research with legal and ethical requirements to search for less discriminatory alternatives, offering a principled foundation for evaluating and comparing algorithmic decision systems.

2605.10601 2026-05-12 cs.AI

The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime

Phongsakon Mark Konrad, Tim Lukas Adam, Ane Cathrine Holst Merrild, Riccardo Terrenzi, Rebecca De Rosa, Toygar Tanyel, Serkan Ayvaz

AI总结 本文探讨了在医疗、信贷、就业和司法等敏感领域部署人工智能时,过度依赖模型可解释性来确保安全性的问题。作者指出,应采用“校准验证”机制替代当前做法,强调授权应具有领域限定、独立可核查、发布后监控、责任追溯、可申诉和可撤销等特性。文章提出“验证覆盖率”作为衡量标准,应与模型能力评分一同用于模型卡片、排行榜和监管披露中,以更全面评估AI系统的部署安全性。

详情
英文摘要

AI deployment in sensitive domains such as health care, credit, employment, and criminal justice is often treated as unsafe to authorize until model internals can be explained. This often leads to an excessive reliance on mechanistic interpretability to address a deployment challenge beyond its intended scope. We argue that the gate should instead be calibrated verification: authorization should be domain-scoped, independently checkable, monitored after release, accountable, contestable, and revocable. The reason is twofold. First, model capability is uneven across nearby tasks, so authorization must attach to a specific use rather than to a model in general. Second, societies have long governed opaque expertise through credentials, monitoring, liability, appeal, and revocation rather than mechanism-level explanation. Recent evidence reinforces this distinction between mechanistic understanding and deployment authority: a 53-percentage-point gap between internal representations and output correction shows that understanding may not translate into action, while one scoping review found that only 9.0% of FDA-approved AI/ML device documents contained a prospective post-market surveillance study. We propose Verification Coverage, a six-component reportable standard with a minimum-composition rule, as the metric that should sit beside capability scores in model cards, leaderboards, and regulatory disclosures.

2605.10598 2026-05-12 cs.AI

Budget-Efficient Automatic Algorithm Design via Code Graph

Maxime Bouscary, Manxi Wu, Saurabh Amin

AI总结 该研究提出了一种基于代码图的高效自动算法设计方法,旨在解决现有方法在计算资源利用上的低效问题。通过将算法表示为有向无环图,并利用大语言模型生成局部代码修正,而非完整算法,从而更高效地探索算法空间并实现更优的搜索效率。实验表明,该方法在相同计算预算下优于传统方法,并揭示了上下文丰富性对模型性能的影响条件。

详情
英文摘要

Large language models (LLMs) have emerged as powerful tools for automatic algorithm design (AAD). However, existing pipelines remain inefficient. They operate at the granularity of full algorithms, redundantly rewriting recurring substructures and discarding low-fitness candidates that may contain valuable algorithmic features. We formalize budget-efficient automatic algorithm design, wherein the search policy maximizes realized fitness subject to limited computational cost. We propose a directed acyclic graph representation of algorithms and build a search framework that fully exploits the LLM's output. Instead of querying the LLM for full algorithms, we use it to obtain corrections: compact operators that add, replace, or remove code blocks. Each correction augments the graph, yielding new algorithms that compose with prior corrections. This graph structure decomposes algorithms into sets of corrections, enabling correction-level credit assignment that informs subsequent queries. We complement this framework with theoretical insights into the ideal balance between search depth and breadth at different budget levels. We validate our method empirically on three combinatorial optimization problems, demonstrating consistent superiority of our graph-based search over full-algorithm search at equal token budget. Finally, our experiments suggest that rich contexts help only when the LLM's prior knowledge is shallow, and can hinder performance otherwise.

2605.10593 2026-05-12 cs.AI cs.CL cs.HC cs.SE

LLARS: Enabling Domain Expert & Developer Collaboration for LLM Prompting, Generation and Evaluation

Philipp Steigerwald, Mara Stieler, Jennifer Burghardt, Eric Rudolph, Jens Albrecht

AI总结 LLARS 是一个开源平台,旨在促进领域专家与开发者在构建基于大语言模型(LLM)的系统时的协作。该平台集成了协作提示工程、批量生成和混合评估三个紧密关联的模块,支持实时协作、可控成本的输出生成以及结合人类与LLM评估者的多维度评估方法。研究显示,LLARS 能有效提升跨学科协作效率,简化工作流程并提高模型-提示组合的优化效果。

Comments Accepted at IJCAI-ECAI 2026 Demonstrations Track. Demo video: https://youtu.be/3QaKouwr4gU

详情
英文摘要

We demonstrate LLARS (LLM Assisted Research System), an open-source platform that bridges the gap between domain experts and developers for building LLM-based systems. It integrates three tightly connected modules into an end-to-end pipeline: Collaborative Prompt Engineering for real-time co-authoring with version control and instant LLM testing, Batch Generation for configurable output production across user-selected prompts $\times$ models $\times$ data with cost control, and Hybrid Evaluation where human and LLM evaluators jointly assess outputs through diverse assessment methods, with live agreement metrics and provenance analysis to identify the best model-prompt combination for a given use case. New prompts and models are automatically available for batch generation and completed batches can be turned into evaluation scenarios with a single click. Interviews with six domain experts and three developers in online counselling confirmed that LLARS feels intuitive, saves considerable time by keeping everything in one place and makes interdisciplinary collaboration seamless.

2605.10588 2026-05-12 cs.CV

Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence

Yanbing Zhang, Bo Wang, Jianhui Liu, Nan Jiang, Jiaxiu Jiang, Haoze Sun, Yijun Yang, Shenghe Zheng, Lin Song, Haoyang Huang, Nan Duan, Wenbo Li

AI总结 当前大型多模态模型(LMMs)在需要视角依赖理解的空间推理任务中表现不佳,主要受限于单一静态视角的观察。为此,研究提出了一种名为“Thinking with Novel Views(TwNV)”的新范式,通过在推理过程中引入生成新视角的合成图像,提升模型对空间关系的理解能力。实验表明,TwNV在多个空间子任务和不同架构的LMM上均显著提升了性能,验证了新视角生成在增强模型空间智能方面的有效性。

Comments Submitted to NeurIPS 2026

详情
英文摘要

Current Large Multimodal Models (LMMs) struggle with spatial reasoning tasks requiring viewpoint-dependent understanding, largely because they are confined to a single, static observation. We propose Thinking with Novel Views (TwNV), a paradigm that integrates generative novel-view synthesis into the reasoning loop: a Reasoner LMM identifies spatial ambiguity, instructs a Painter to synthesize an alternative viewpoint, and re-examines the scene with the additional evidence. Through systematic experiments we address three research questions. (1) Instruction format: numerical camera-pose specifications yield more reliable view control than free-form language. (2) Generation fidelity: synthesized view quality is tightly coupled with downstream spatial accuracy. (3) Inference-time visual scaling: iterative multi-turn view refinement further improves performance, echoing recent scaling trends in language reasoning. Across four spatial subtask categories and four LMM architectures (both closed- and open-source), TwNV consistently improves accuracy by +1.3 to +3.9 pp, with the largest gains on viewpoint-sensitive subtasks. These results establish novel-view generation as a practical lever for advancing spatial intelligence of LMMs.

2605.10586 2026-05-12 cs.CV

CausalGS: Learning Physical Causality of 3D Dynamic Scenes with Gaussian Representations

Nengbo Lu, Minghua Pan

AI总结 本文提出了一种名为CausalGS的框架,旨在仅从多视角视频中学习复杂三维动态场景的物理因果关系,无需依赖显式先验知识。其核心是一个逆物理推理模块,通过联合推断场景的初始速度场和内在材料属性,将动态过程分解为两个因素进行建模,并利用可微分物理模拟器进行物理正则化的学习。实验表明,CausalGS在长期未来帧外推和新视角插值任务中均优于现有方法,展示了其从视觉观测中自主学习物理属性交互和因果关系的能力。

Comments ICMR2026 Accepted

详情
英文摘要

Learning a physical model from video data that can comprehend physical laws and predict the future trajectories of objects is a formidable challenge in artificial intelligence. Prior approaches either leverage various Partial Differential Equations (PDEs) as soft constraints in the form of PINN losses, or integrate physics simulators into neural networks; however, they often rely on strong priors or high-quality geometry reconstruction. In this paper, we propose CausalGS, a framework that learns the causal dynamics of complex dynamic 3D scenes solely from multi-view videos, while dispensing with the reliance on explicit priors. At its core is an inverse physics inference module that decouples the complex dynamics problem from the video into the joint inference of two factors: the initial velocity field representing the scene's kinematics, and the intrinsic material properties governing its dynamics. This inferred physical information is then utilized within a differentiable physics simulator to guide the learning process in a physics-regularized manner. Extensive experiments demonstrate that CausalGS surpasses the state-of-the-art on the highly challenging task of long-term future frame extrapolation, while also exhibiting advanced performance in novel view interpolation. Crucially, our work shows that, without any human annotation, the model is able to learn the complex interactions between multiple physical properties and understand the causal relationships driving the scene's dynamic evolution, solely from visual observations.

2605.10585 2026-05-12 cs.LG

Controllability in preference-conditioned multi-objective reinforcement learning

Pau de las Heras Molins, Beyazit Yalcinkaya, Lasse Peters, David Fridovich-Keil, Georgios Bakirtzis

AI总结 本文研究了偏好条件下的多目标强化学习中的可控性问题,即用户偏好变化是否能可靠地引导智能体行为变化。作者指出,现有评估指标无法有效衡量这一特性,导致智能体可能对偏好输入不敏感。为此,本文提出了一种新的评估指标,以更准确地衡量偏好条件智能体的可控性,从而推动多目标强化学习中偏好适应能力的进一步发展。

详情
英文摘要

Multi-objective reinforcement learning (MORL) allows a user to express preference over outcomes in terms of the relative importance of the objectives, but standard metrics cannot capture whether changes in preference reliably change the agent's behavior in the intended way, a property termed controllability. As a result, preference-conditioned agents can score well on standard MORL metrics while being insensitive to the preference input. If the ability to control agents cannot be reliably assessed, the symbolic interface that MORL provides between user intent and agent behavior is broken. Mainstream MORL metrics alone fail to measure the controllability of preference-conditioned agents, motivating a complementary metric specifically designed to that end. We hope the results spur discussion in the community on existing evaluation protocols to consolidate advances in preference adaptation in MORL to larger and more complex problems.

2605.10579 2026-05-12 cs.CL

VISTA: A Generative Egocentric Video Framework for Daily Assistance

Yu-Hsiang Liu, Yu-Chien Tang, An-Zi Yen

AI总结 本文提出了一种名为VISTA的生成式第一人称视频框架,旨在为日常辅助任务中的AI代理提供高质量的训练与评估数据。该框架通过五步脚本生成流程结合因果逆向推理,生成多样且逻辑严谨的干预场景,涵盖反应式和主动式两种代理自主级别。VISTA支持用户自定义和优化场景,为日常任务提供可扩展且可控的视频基准,为真实环境中AI代理的训练与评估提供了替代方案。

Comments pre-print

详情
英文摘要

Training AI agents to proactively assist humans in daily activities, from routine household tasks to urgent safety situations, requires large-scale visual data. However, capturing such scenarios in the real world is often difficult, costly, or unsafe, and physics-based simulators lack the visual fidelity needed to transfer learned behaviors to real settings. Therefore, we introduce VISTA, a video synthesis system that produces high-fidelity egocentric videos as training and evaluation data for AI agents. VISTA employs a 5-step script generation pipeline with causal reverse reasoning to create diverse, logically grounded intervention modes. These scenarios span two levels of agent autonomy: reactive and proactive. In reactive modes, the user explicitly asks the agent for help. In proactive modes, the agent offers help without receiving a direct request. We further divide proactive modes into explicit and implicit types. In explicit proactive scenarios, the user is aware of needing help but does not directly address the agent. In implicit proactive scenarios, the agent intervenes before the user even realizes that help is needed. VISTA allows users to customize and refine scenarios to generate video benchmarks for daily tasks, offering a scalable and controllable alternative to real-world data collection for training and evaluating AI agents in realistic environments.

2605.10576 2026-05-12 cs.CV cs.AI

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

Chen Zhong, Xiao An, Jiaxing Sun, Zihan Gui, Guangyi Yang, Wei He

AI总结 本文提出 SenseBench,首个专门用于评估大语言视觉模型在遥感低级视觉感知与描述能力的基准测试平台。该研究针对当前图像质量评估方法无法准确描述遥感退化现象的问题,构建了包含6大类22个细粒度退化类型的10,000余个精心标注样本,并设计了感知与描述两种评估协议,揭示了现有模型在遥感领域存在的领域偏差、多退化混淆等关键问题,为推动遥感低级视觉感知模型的发展提供了有力支持。

详情
英文摘要

Low-level visual perception underpins reliable remote sensing (RS) image analysis, yet current image quality assessment (IQA) methods output uninterpretable scalar scores rather than characterizing physics-driven RS degradations, deviating markedly from the diagnostic needs of RS experts. While Vision-Language Models (VLMs) present a compelling alternative by delivering language-grounded IQA, their visual priors are heavily biased toward ground-level natural images. Consequently, whether VLMs can overcome this domain gap to perceive and articulate RS artifacts remains insufficiently studied. To bridge this gap, we propose \textbf{SenseBench}, the first dedicated diagnostic benchmark for RS low-level visual perception and description. Driven by a physics-based hierarchical taxonomy that unifies both non-reference and reference-based paradigms, SenseBench features over 10K meticulously curated instances across 6 major and 22 fine-grained RS degradation categories. Specifically, two complementary protocols are designed for evaluation: objective low-level visual \textit{perception} and subjective diagnostic \textit{description}. Comprehensive evaluation of 29 state-of-the-art VLMs reveals not only skewed domain priors and multi-distortion collapse, but also \textit{fluency illusion} and a \textit{perception-description inversion} effect. We hope SenseBench provides a robust evaluation testbed and high-quality diagnostic data to advance the development of VLMs in RS low-level perception. Code and datasets are available \href{https://github.com/Zhong-Chenchen/SenseBench}{\textcolor{blue}{here}}.

2605.10572 2026-05-12 cs.LG

Online Sharp-Calibrated Bayesian Optimization

Marshal Arijona Sinaga, Julien Martinelli, Teemu Turpeinen, Samuel Kaski

AI总结 本文研究了在线贝叶斯优化中如何同时实现不确定性估计的尖锐性与校准性的问题。作者提出了一种新的在线尖锐校准贝叶斯优化算法(OSCBO),通过将核超参数选择建模为约束在线学习问题,实现了对高斯过程模型不确定性的自适应优化。该方法在保持子线性遗憾界的同时,在多个合成与实际基准测试中表现出优异的性能。

详情
英文摘要

Bayesian optimization (BO) is a widely used framework for optimizing expensive black-box functions, commonly based on Gaussian process (GP) surrogate models. Its effectiveness relies on uncertainty quantification that is both sharp (informative) and well-calibrated along the BO trajectory. In practice, GP kernel hyperparameters are unknown and are refit online from sequentially collected (non-i.i.d.) data, which can yield miscalibrated or overly conservative uncertainty and lies outside the fixed-kernel assumptions of standard BO regret theory. We propose Online Sharp-Calibrated Bayesian Optimization (OSCBO), a BO algorithm that adaptively balances GP sharpness and calibration by casting hyperparameter selection as a constrained online-learning problem. We also show that OSCBO preserves sublinear regret bounds by leveraging the theoretical guarantees of the underlying online learning algorithm. Empirically, OSCBO performs competitively across synthetic and real-world benchmarks, ranking among the strongest methods in final simple regret while maintaining robust cumulative-regret behavior.

2605.10569 2026-05-12 cs.AI

Deep Arguing

Adam Gould, Francesca Toni

AI总结 本文提出了一种名为“Deep Arguing”的新型神经符号方法,旨在提升深度学习模型在多模态数据分类任务中的可解释性。该方法将深度神经网络与论证构建和推理相结合,使模型能够生成支持预测结果的论证结构,并通过可微分的论证语义进行训练,从而同时学习特征表示和论证交互。实验表明,该方法在保持预测性能的同时,能够提供具有说服力的案例解释,提升了模型的可解释性和推理能力。

详情
英文摘要

Deep learning has become the dominant approach for creating high capacity, scalable models across diverse data modalities. However, because these models rely on a large number of learned parameters, tightly couple feature extraction with task objectives, and often lack explicit reasoning mechanisms, it is difficult for humans to understand how they arrive at their predictions. Understanding what representations emerge and why they arise from the training data remains an open challenge. We introduce Deep Arguing, a novel neurosymbolic approach that integrates deep learning with argumentation construction and reasoning for interpretable classification with different data modalities. In our approach deep neural networks construct an argumentation structure wherein data points support their assigned label and attack different ones. Using differentiable argumentation semantics for reasoning, the model is trained end-to-end to jointly learn feature representation and argumentative interactions. This results in argumentation structures providing faithful case-based explanations for predictions. Structure constraints over the argumentation graph guide learning, improving both interpretability and predictive performance. Experiments with tabular and imaging datasets show that Deep Arguing achieves performance competitive with standard baselines whilst offering interpretable argumentative reasoning.

2605.10567 2026-05-12 cs.CV

VeloGauss: Learning Physically Consistent Gaussian Velocity Fields from Videos

Nengbo Lu, Bin Zhao

AI总结 本文提出了一种名为 VeloGauss 的方法,旨在仅从动态多视角视频中联合建模三维场景的几何、外观和物理信息,而无需依赖任何物理先验。该方法通过引入物理编码和粒子动力学系统,学习每个高斯粒子的运动场,并结合全局物理约束以确保场景的物理一致性。实验表明,VeloGauss 在新视角插值和未来帧外推任务中均取得了优于现有方法的性能。

Comments ICME2026 Accepted

详情
英文摘要

In this paper, we aim to jointly model the geometry, appearance, and physical information of 3D scenes solely from dynamic multi-view videos, without relying on any physical priors. Existing works typically employ physical losses merely as soft constraints or integrate physical simulations into neural networks; however, these approaches often fail to effectively learn complex motion physics. Although modeling velocity fields holds the potential to capture authentic physical information, due to the lack of appropriate physical constraints, current methods are unable to correctly learn the interaction mechanisms between rigid and non-rigid particles. To address this, we propose VeloGauss, designed to learn the physical properties of complex dynamic 3D scenes without physical priors. Our method learns the velocity field for each Gaussian particle by introducing a Physics Code and a Particle Dynamics System, and ultimately incorporates Global Physical Constraints to ensure the physical consistency of the scene. Extensive experiments on four public datasets demonstrate that our method outperforms achieves state-of-the-art performance in both Novel View Interpolation and Future Frame Extrapolation tasks.

2605.10564 2026-05-12 cs.CV cs.RO

DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving

Lingjun Zhang, Changjie Wu, Linzhe Shi, Jiangyang Li, Jiaxin Liu, Lei Yang, Hang Zhang, Mu Xu, Hong Wang

AI总结 本文提出了一种名为DeepSight的端到端自动驾驶世界模型,通过在鸟瞰图(BEV)空间中并行预测连续未来帧的潜在语义特征,实现了对长期未来世界状态的建模。该方法还引入了一种高效且自适应的文本推理机制,结合额外的社会知识和推理能力,以提升复杂长尾场景下的驾驶性能。实验表明,该方法在闭合回路 Bench2drive 基准测试中达到了最先进的效果。

Comments ICML 2026

详情
英文摘要

End-to-end autonomous driving systems are increasingly integrating Vision-Language Model (VLM) architectures, incorporating text reasoning or visual reasoning to enhance the robustness and accuracy of driving decisions. However, the reasoning mechanisms employed in most methods are direct adaptations from general domains, lacking in-depth exploration tailored to autonomous driving scenarios, particularly within visual reasoning modules. In this paper, we propose a driving world model that performs parallel prediction of latent semantic features for consecutive future frames in the bird's-eye-view (BEV) space, thereby enabling long-horizon modeling of future world states. We also introduce an efficient and adaptive text reasoning mechanism that utilizes additional social knowledge and reasoning capabilities to further improve driving performance in challenging long-tail scenarios. We present a novel, efficient, and effective approach that achieves state-of-the-art (SOTA) results on the closed-loop Bench2drive benchmark. Codes are available at: https://github.com/hotdogcheesewhite/DeepSight.

2605.10563 2026-05-12 cs.CL cs.AI

ThreatCore: A Benchmark for Explicit and Implicit Threat Detection

Davide Bruni, Carlo Bardazzi, Maurizio Tesconi

AI总结 ThreatCore 是一个用于细粒度威胁检测的公开基准数据集,旨在区分明确威胁、隐含威胁和非威胁内容,解决了当前自然语言处理中威胁检测定义不统一、缺乏标准化的问题。该数据集通过整合多个公开资源并基于统一的威胁定义进行系统性重新标注,揭示了现有标签的显著不一致性,并通过人工验证的合成样本来增强对隐含威胁的覆盖。实验表明,隐含威胁比明确威胁更难检测,而引入语义角色标注作为中间表示有助于提升模型性能,凸显了ThreatCore在推动细粒度威胁检测研究中的重要价值。

详情
英文摘要

Threat detection in Natural Language Processing lacks consistent definitions and standardized benchmarks, and is often conflated with broader phenomena such as toxicity, hate speech, or offensive language. In this work, we introduce ThreatCore, a public available benchmark dataset for fine-grained threat detection that distinguishes between explicit threats, implicit threats, and non-threats. The dataset is constructed by aggregating multiple publicly available resources and systematically re-annotating them under a unified operational definition of threat, revealing substantial inconsistencies across existing labels. To improve the coverage of underrepresented cases, particularly implicit threats, we further augment the dataset with synthetic examples, which are manually validated using the same annotation protocol adopted for the re-annotation of the public datasets, ensuring consistency across all data sources. We evaluate Perspective API, zero-shot classifiers, and recent language models on ThreatCore, showing that implicit threats remain substantially harder to detect than explicit ones. Our results also indicate that incorporating Semantic Role Labeling as an intermediate representation can improve performance by making the structure of harmful intent more explicit. Overall, ThreatCore provides a more consistent benchmark for studying fine-grained threat detection and highlights the challenges that current models still face in identifying indirect expressions of harmful intent.

2605.10560 2026-05-12 cs.CL

ICT-NLP at SemEval-2026 Task 3: Less Is More -- Multilingual Encoder with Joint Training and Adaptive Ensemble for Dimensional Aspect Sentiment Regression

Liyuan Huang, Jiawei He, Wutao Shen, Lin Li, Jin Zhang

AI总结 本文介绍了我们在SemEval-2026任务3(维度方面情感回归)中的系统设计,提出了一种轻量且资源高效的多语言解决方案,完全基于预训练编码器,无需依赖大语言模型或外部语料。我们采用联合多语言和多领域训练策略以提升跨语言迁移能力并缓解数据稀疏问题,引入了有界回归变换以提高训练稳定性并约束预测范围,同时通过子集搜索实现自适应集成以降低预测方差。实验结果表明,我们的系统在多个语言数据集上表现优异,取得了多项前列成绩。

详情
英文摘要

This paper describes our system to SemEval-2026 Task 3 Track A Subtask 1 on Dimensional Aspect Sentiment Regression (DimASR). We propose a lightweight and resource-efficient system built entirely on multilingual pre-trained encoders, without relying on LLMs or external corpora. We adopt joint multilingual and multi-domain training to facilitate cross-lingual transfer and alleviate data sparsity, introduce a bounded regression transformation that improves training stability while constraining predictions within the valid range, and employ an adaptive ensemble strategy via subset search to reduce prediction variance. Experimental results demonstrate that our system achieves strong and consistent performance, ranking 1st on zho-res, 2nd on zho-lap, and 3rd on jpn-hot, with all remaining datasets placed within the top half of participating teams.

2605.10555 2026-05-12 cs.AI

Agent-First Tool API: A Semantic Interface Paradigm for Enterprise AI Agent Systems

Kai Pan

AI总结 随着AI代理从研究原型转向企业级生产系统,其使用的工具接口仍基于以人类为中心的CRUD范式。本文提出了一种名为“Agent-First Tool API”的语义接口范式,通过六动词语义协议、标准化工具契约和双层治理管道,解决了传统API与自主代理需求之间的五大架构不匹配问题。该方法在实际多租户SaaS平台中得到验证,显著提升了任务成功率并减少了人工干预,证明了其在企业AI代理系统中的有效性与优越性。

详情
英文摘要

As AI agents transition from research prototypes to enterprise production systems, the tool interfaces they consume remain rooted in human-oriented CRUD paradigms. This paper identifies five fundamental architectural mismatches between conventional APIs and autonomous agent requirements: exact-identifier dependence, rendering-oriented responses, single-shot interaction assumptions, user-equivalent authorization, and opaque error semantics. We propose the Agent-First Tool API paradigm, comprising three integrated mechanisms: (1) a Six-Verb Semantic Protocol that decomposes tool interactions into search, resolve, preview, execute, verify, and recover phases; (2) a Normalized Tool Contract (NTC) providing structured decision-support metadata including confidence scores, evidence chains, and suggested next actions; and (3) a dual-layer governance pipeline combining static capability policies with dynamic risk escalation. The paradigm is implemented and validated in a production multi-tenant SaaS platform serving 85 registered tools across 6 business domains. Comparative experiments on 50 real operational tasks demonstrate that Agent-First APIs achieve 88% end-to-end task success rate versus 64% for optimized CRUD baselines (+37.5%), while reducing required human interventions by 72.7% and improving autonomous error recovery by 5.8x. We establish that the paradigm is orthogonal and complementary to transport-layer standards such as MCP, operating as the semantic application layer above existing tool discovery and invocation protocols.

2605.10551 2026-05-12 cs.LG

It's All Connected: Topology-Aware Structural Graph Encoding Improves Performance on Polymer Prediction

H. Ibrahim Erdogan, Punith Raviswamy, Nikita Agrawal, Yannik Köster, Stefan Zechel, Ulrich S. Schubert, Ruben Mayer, Christopher Kuenneth

AI总结 该研究针对聚合物性质预测中图神经网络(GNN)面临的数据稀缺和结构复杂性问题,提出了一种基于分子质量分布的拓扑感知图构建方法,直接编码聚合物链尺度的结构信息。通过结合丰富的化学特征描述符和自监督预训练策略,该方法在仅有381个聚合物样本的数据集上显著提升了预测性能,相比传统重复单元图方法,其均方根误差降低了5.1%。实验表明,图构建方式与预训练策略的结合是性能提升的关键,且方法适用于多种GNN架构。

Comments 9 pages, 4 figures

详情
英文摘要

Graph Neural Networks (GNNs) have achieved strong results in molecular property prediction, but polymers present distinct challenges: labeled datasets are scarce and small (typically in the order of hundreds of polymers) due to the need for expensive experimentation, and complex polymer chain distributions influence polymer properties. Established practice in polymer prediction represents polymers solely by graphs of their repeat units, discarding the chain-scale morphology that governs key properties such as the glass transition temperature ($T_g$). In this work, we propose a principled graph construction that addresses this gap. Given a polymer's molecular mass distribution (MMD), we sample representative chains from the Schulz-Zimm distribution and construct representative sets of large graphs encoding chain-scale topology directly, with atoms and bonds featurized using rich chemical descriptors. We further pretrain GNN encoders via masked graph modeling on 100,000 unlabeled PSMILES strings before fine-tuning on labeled data. On a dataset of 381 polymers (180 homopolymers and 201 copolymers), we show that graph construction and self-supervised pretraining are jointly necessary: without pretraining, the large graph method matches the repeat-unit baseline (28.40 K vs. 28.36 K RMSE); with pretraining, it achieves 24.76 K +/- 3.30 K, a 5.1% reduction in mean error over the pretrained repeat-unit baseline (26.08 K +/- 4.20 K, p < 0.001, 30 runs). An ablation removing chemical features degrades performance to 36.65 K, confirming both components are essential. Results are architecture-agnostic, holding for both GINE and GATv2 encoders.

2605.10547 2026-05-12 cs.LG

PhysEDA: Physics-Aware Learning Framework for Efficient EDA With Manhattan Distance Decay

Zetao Yang

AI总结 本文提出了一种基于物理先验知识的高效电子设计自动化(EDA)学习框架PhysEDA,旨在解决传统注意力机制和强化学习方法在EDA任务中面临的计算复杂度高和数据稀缺导致的过拟合问题。该方法通过引入曼哈顿距离衰减的物理特性作为归纳偏置,设计了具有线性复杂度的物理结构化线性注意力模块,并结合基于势能的奖励塑造策略,有效提升了模型在跨尺度迁移和稀疏奖励场景下的性能。实验表明,PhysEDA在多个EDA任务中实现了显著的性能提升和计算效率优化。

Comments 9 pages, 4 figures, plus appendix. Code and data to be released upon publication

详情
英文摘要

Electronic design automation (EDA) addresses placement, routing, timing analysis, and power-integrity verification for integrated circuits. Learning methods -- attention (Transformer) and reinforcement learning (RL) -- have recently emerged on EDA tasks, yet face two common bottlenecks: vanilla attention's quadratic complexity limits scaling, and data-scarce models overfit statistical noise and amplify weak long-range correlations against the underlying physics. We observe that EDA tasks share a physical prior -- pairwise electrical and routing interactions decay exponentially along Manhattan distance -- and integrate it as a unified inductive bias into both architecture and training. We propose PhysEDA, comprising two components Physics-Structured Linear Attention (PSLA) folds the separable Manhattan decay into the linear-attention kernel as a multiplicative bias, reducing complexity from quadratic to linear; Potential-Based Reward Shaping (PBRS) constructs a physical potential from the same kernel, providing dense reward signal under sparse RL while preserving the optimal policy via the policy-invariance theorem. Across three EDA scenarios -- decoupling-capacitor placement, macro placement, and IR-drop prediction -- PhysEDA improves zero-shot cross-scale transfer by 56.8% and achieves 14x inference speedup with 98.5% memory savings on 100x100 grids; PBRS adds another 10.8% in sparse-reward DPP.

2605.10546 2026-05-12 cs.LG

Higher Resolution, Better Generalization: Unlocking Visual Scaling in Deep Reinforcement Learning

Raphael Trumpp, Ömer Veysel Çağatan, Barış Akgün, Marco Caccamo

AI总结 本文研究了深度强化学习中视觉输入分辨率对策略学习的影响,指出当前常用的方法往往过度降低图像分辨率,而高分辨率输入在适当网络架构支持下能显著提升性能和泛化能力。研究发现,传统Impala编码器在分辨率提升时参数量呈二次增长,限制了性能提升,而改用全局平均池化后的Impoola架构则能有效解耦参数量与分辨率,实现跨不同分辨率和网络宽度的性能提升,最高可提升28%。实验表明,高分辨率有助于策略更精确地感知小物体或远距离目标,为视觉强化学习的可扩展性提供了新方向。

详情
英文摘要

Pixel-based deep reinforcement learning agents are typically trained on heavily downsampled visual observations, a convention inherited from early benchmarks rather than grounded in principled design. In this work, we show that observation resolution is a critical yet overlooked variable for policy learning: higher-resolution inputs can substantially improve both performance and generalization, provided the network architecture can process them effectively. We find that the widely used Impala encoder, which flattens spatial features into a vector, suffers from quadratic parameter growth as resolution increases and fails to leverage the additional visual detail. Replacing this operation with global average pooling, as in the Impoola architecture, decouples parameter count from resolution and yields consistent improvements across resolutions and network widths - at their respective best conditions, visual scaling unlocks a 28 % performance gain for Impoola over Impala. These gains are strongest in environments that require precise perception of small or distant objects, and gradient saliency analysis confirms that the underlying mechanism is a more spatially localized visual attention of the policy at higher resolutions. Our results challenge the prevailing practice of aggressive input downsampling and position resolution-independent architectures as a simple, effective path toward scalable visual deep RL. To facilitate future research on resolution scaling in deep RL, we publicly release the open-source code for the Procgen-HD benchmark: https://github.com/raphajaner/procgen-hd.

2605.10544 2026-05-12 cs.CL

Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing

Jinchang Zhu, Jindong Li, Chengyu Zou, Rong Fu, Chao Wang, Haowei He, Menglin Yang

AI总结 本文研究了长上下文适应中监督分配的问题,指出当前方法在训练过程中未能有效提升目标标记的长上下文监督。为此,作者提出了EXACT方法,通过逆频率分配权重,增强对长有效上下文目标的监督。实验表明,EXACT在多个模型配置上显著提升了长上下文推理性能,同时保持了标准任务的表现,验证了监督分配对长上下文适应的关键作用。

详情
英文摘要

Long-context adaptation is often viewed as window scaling, but this misses a token-level supervision mismatch: in packed training with document masking, each target token's effective context remains short. We introduce EXACT, a supervision-allocation objective that assigns extra weight to long effective-context targets by inverse frequency within the long tail. Across seven Qwen/LLaMA CPT configurations, EXACT improves all 28 trained/extrapolated NoLiMa and RULER comparisons. On Qwen2.5-0.5B, NoLiMa improves by +10.09 (trained) and +5.34 (extrapolated); RULER by +10.69 and +5.55. On LLaMA-3.2-3B, RULER improves by +17.91 and +16.11. Standard QA/reasoning are preserved (+0.24 macro change across six benchmarks). A distance-resolved probe shows gains arise when evidence is thousands of tokens away, while short cases remain unchanged. Results support a supervision-centric thesis: long-context adaptation depends on how strongly training supervises long-context predictions.

2605.10541 2026-05-12 cs.AI cs.LG

Bridging Sequence and Graph Structure for Epigenetic Age Prediction

Yao Li, Xikun Zhang, Xiaotao Shen, Sonika Tyagi, Xin Zheng, Jiaxing Huang, Feng Xia

AI总结 本文研究了如何结合DNA甲基化位点的序列信息与图结构,以更准确地预测表观遗传年龄。作者提出了一种统一的序列-图整合框架,通过轻量级的门控调制机制,将八维DNA序列统计特征与图卷积相结合,从而更有效地建模甲基化信号。该方法在3,707个血液甲基化样本上的测试表现优于现有最佳图模型,表明结合生物信息的统计特征在该任务中比基于卷积神经网络的序列编码更具优势。

详情
英文摘要

Epigenetic clocks based on DNA methylation have emerged as powerful tools for estimating biological age, with broad applications in aging research, age-related disease studies, and longevity science. Despite advances across machine learning approaches to epigenetic age prediction, spanning penalised linear regression, deep feedforward networks, residual architectures, and graph neural networks, no existing method jointly models co-methylation graph structure and site-specific DNA sequence context within a unified framework. We propose a unified sequence--graph integration framework for epigenetic age prediction that addresses this gap, integrating eight-dimensional DNA sequence statistical features through a lightweight gated modulation mechanism that adaptively scales each site's methylation signal according to its sequence-determined biological relevance prior to graph convolution. Evaluated on 3,707 blood methylation samples against a comprehensive set of baselines, our method achieves a test MAE of 3.149 years, a 12.8\% improvement over the strongest graph-based baseline. Biologically informed statistical features outperform CNN-based sequence encoding, demonstrating that handcrafted sequence features are more effective than end-to-end learned representations in this data regime. Post-hoc interpretability analysis identifies CpG density and local adenine frequency as features with age-dependent importance shifts, consistent with known mechanisms of age-related hypermethylation at CpG-dense promoter regions. Our code is at https://github.com/yaoli2022/graphage-seq.

2605.10537 2026-05-12 cs.CL

Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

Lungchuan Chen

AI总结 本文提出了一种基于记忆巩固理论的测试时记忆整合方法Mela,其核心是引入分层记忆模块(HMM),该模块包含两个不同更新频率的子模块,分别生成抽象的高层表示和细粒度的 episodic 细节表示,并在推理时动态组合形成最终记忆输出。通过将HMM集成到Transformer解码器中,Mela实现了在测试阶段进行在线记忆整合的增强语言模型,在不同规模的语言建模任务中均优于传统Transformer基线,并在固定预训练上下文长度下表现出对更长上下文的更好适应性。

详情
英文摘要

Memory consolidation, the process by which transient experiences are transformed into stable, structured representations, is a foundational organizing principle in the human brain, yet it remains largely unexplored as a design principle for modern sequence models. In this work, we leverage established neuroscientific theories of memory consolidation and cross-frequency coupling to propose the Hierarchical Memory Module (HMM), a neural memory architecture composed of two functionally distinct sub-modules that operate at different update frequencies. Inspired by the transformation hypothesis, the low-frequency sub-module produces high-level representations that capture abstract, gist-level knowledge, while the high-frequency sub-module produces fine-grained representations that preserve richer episodic detail. The final memory output is dynamically reconstructed as a context-dependent combination of both representations, analogous to the reconstructive nature of human memory retrieval. We integrate HMM into a Transformer-based language decoder to form Mela, a family of memory-augmented language models that perform online memory consolidation at test time. To further exploit the multi-granularity memory representations produced by HMM, we introduce MemStack, a method that distributes different levels of memory features across the early layers of the decoder without introducing additional tokens. Experiments on language modeling demonstrate that Mela outperforms Transformer baselines across all the model sizes. Moreover, with the pretrained context length fixed at 4K, Mela maintains performance on significantly longer contexts, whereas Transformer baselines degrade rapidly beyond their training length. Extensive ablation studies validate the contribution of each component and provide guidance for practical configuration.

2605.10536 2026-05-12 cs.LG cs.AI

HH-SAE: Discovering and Steering Hierarchical Knowledge of Complex Manifolds

Honghan Wu, Tianyan Wang, Jiacong Mi, Zhoyang Jiang, Yunsoo Kim

AI总结 本文提出了一种名为HH-SAE的混合分层自编码器,用于解决高维关键领域中语义创新被密集背景信息掩盖的“特征密度冲突”问题。该方法通过将流形分解为上下文、原子和复合三个层次,实现了对复杂结构知识的发现与引导。实验表明,HH-SAE在跨领域零样本检测等任务中表现出色,并在知识引导的合成任务中显著提升了性能,验证了其在高精度高风险环境中的有效性。

详情
英文摘要

Rare semantic innovations in high-dimensional, mission-critical domains are often obscured by dense background contexts, a challenge we define as \textit{feature density conflict}. We introduce the \textbf{Hybrid Hierarchical SAE (HH-SAE)} to resolve this by factorizing manifolds into a nested hierarchy of \textbf{Contextual} ($L_0$), \textbf{Atomic} ($f_1$), and \textbf{Compository} ($f_2$) tiers. Evaluating across disparate manifolds, HH-SAE demonstrates superior resolution by \textbf{``fracturing'' administrative clinical labels into physiological modes} and achieving a peak \textbf{cross-domain zero-shot AUC of 0.9156 in fraud detection}. Path ablation confirms the architecture's structural necessity, revealing a 13.46\% utility collapse when contextual subtraction is removed. Finally, knowledge-steered synthesis achieves a +9.9\% AUPRC lift over state-of-the-art generators, proving that HH-SAE effectively prioritizes high-order mechanistic innovation over environmental proxies to enable high-precision discovery in high-stakes environments.

2605.10533 2026-05-12 cs.LG

ConfoundingSHAP: Quantifying confounding strength in causal inference

Marie Brockschmidt, Santo M. A. R. Thies, Maresa Schröder, Dennis Frauen, Valentyn Melnychuk, Maximilian Muschalik, Eyke Hüllermeier, Stefan Feuerriegel

AI总结 在因果推断中,混杂变量会影响处理分配和结果,但在观察性研究中,处理分配机制未知,难以确定哪些协变量是混杂变量。本文提出ConfoundingSHAP,一种基于Shapley值的方法,用于量化每个协变量的混杂强度。该方法通过设计专门的Shapley博弈模型,区别于传统SHAP用于解释处理效应异质性的应用,并结合可扩展的TabPFN估计方法,避免了对大量调整集的重复拟合,有效提升了因果推断中对混杂变量识别的实用性与效率。

详情
英文摘要

In causal inference, confounders are variables that influence both treatment decisions and outcomes. However, unlike as in randomized clinical trials, the treatment assignment mechanism in observational studies is not known, and it is thus unclear which covariates act as confounders. Here, we aim to generate insight for causal inference and answer: which of the observed covariates act as confounders? We introduce ConfoundingSHAP, a Shapley-based method for attributing confounding strength to individual covariates. Our contributions are twofold. First, we propose a Shapley game targeted to infer the confounding strength of the covariates. Our resulting Shapley values differ from the standard applications of SHAP explanations on causal targets, such as understanding treatment effect heterogeneity, which are ill-suited for our task. Second, as our task requires evaluating the value function over many adjustment sets, we provide a scalable TabPFN-based estimation that avoids exhaustive refitting. We demonstrate the practical value across various datasets, where ConfoundingSHAP provides informative explanations of which observed covariates drive confounding and thereby helps to provide more insight for causal inference in practice.

2605.10531 2026-05-12 cs.AI

A Reflective Storytelling Agent for Older Adults: Integrating Argumentation Schemes and Argument Mining in LLM-Based Personalised Narratives

Jayalakshmi Baskar, Vera C. Kaelin, Kaan Kilic, Helena Lindgren

AI总结 本研究探讨了基于知识驱动的大型语言模型(LLM)讲故事能否支持老年人与数字伴侣进行有目的的叙事互动。为解决LLM在幻觉和透明度方面的局限性,研究提出了一种结合知识图谱、用户建模、论证理论和论证挖掘的反思式叙事代理,用于引导和审查叙事生成过程。实验结果显示,该系统生成的叙事在文化认同性和个人相关性方面受到用户认可,而基于论证的叙事目的和幻觉风险指标对叙事质量和用户接受度有显著影响。

Comments Submitted to ACM Transactions on Intelligent Systems and Technology (TIST)

详情
英文摘要

This work investigates whether knowledge-driven large language model (LLM)-based storytelling can support purposeful narrative interaction with a digital companion for older adults. To address known limitations of LLMs, including hallucinations and limited transparency, we present a reflective storytelling agent integrating knowledge graphs, user modelling, argumentation theory, and argument mining to guide and inspect narrative generation. The study consisted of two phases. Phase I employed participatory design involving 11 domain experts in a formative evaluation that informed iterative refinement. The resulting system generates narratives grounded in structured user models representing health-promoting activities and motivations. Phase II involved 55 older adults evaluating persona-based narratives across four prompts and two creativity levels. Participants assessed perceived purpose, usefulness, cultural relatability, and inconsistencies. The system additionally computed hallucination-risk indicators to evaluate generated narratives. Participants recognised personally relevant purposes in roughly two thirds of narratives, while argument-based purposes were identified in around half of these cases. Cultural recognisability strongly influenced willingness to use the functionality, whereas minor inconsistencies were often tolerated when narratives remained understandable and personally relevant. Narratives with higher hallucination-risk indicators were more often perceived as inconsistent, while higher argument-quality indicators tended to co-occur with higher clarity and meaningfulness ratings. Overall, the study positions argument mining as a reflective inspection mechanism for comparing formal grounding signals with human evaluations in health-oriented LLM storytelling for older adults.

2605.10529 2026-05-12 cs.AI cs.LG

PrimeKG-CL: A Continual Graph Learning Benchmark on Evolving Biomedical Knowledge Graphs

Yousef A. Radwan, Yao Li, Qing Qing, Ziqi Xu, Xingtong Yu, Jiaxing Huang, Renqiang Luo, Xikun Zhang

AI总结 该研究提出了一个名为 PrimeKG-CL 的持续图学习基准,专门用于评估在动态演变的生物医学知识图谱上的学习方法。该基准基于九个权威生物医学数据库构建,包含真实的时序快照和多模态节点特征,并设计了多种任务和测试划分方式,以更贴近实际场景。实验表明,解码器选择与持续学习策略之间存在显著交互影响,且多模态特征对任务性能有明显提升,而某些现有方法在大规模数据下难以有效运行。

详情
英文摘要

Biomedical knowledge graphs underwrite drug repurposing and clinical decision support, yet the upstream ontologies they depend on update on independent cycles that add millions of edges and deprecate hundreds of thousands more between releases. Yet existing continual graph learning has been studied almost exclusively on synthetic random splits of static, generic KGs, a regime that cannot reproduce the asynchronous, structured evolution real biomedical KGs undergo. To this end, we introduce PrimeKG-CL, a CGL benchmark built from nine authoritative biomedical databases (129K+ nodes, 8.1M+ edges, 10 node types, 30 relation types) with two genuine temporal snapshots (June 2021, July 2023; 5.83M edges added, 889K removed, 7.21M persistent), 10 entity-type-grouped tasks, multimodal node features, and a per-task persistent/added/removed test stratification. On three tasks (biomedical relationship prediction, entity classification, KGQA), we evaluate six CL strategies across four KGE decoders, plus LKGE, an LLM-RAG agent, and CMKL. We find that decoder choice and continual learning strategy interact strongly: no single strategy performs best across all decoders, and mismatched combinations can significantly degrade performance. Moreover, only DistMult exhibits a clear separation between persistent and deprecated knowledge, indicating that standard metrics conflate retention of still-valid facts with failure to forget outdated ones; this effect is absent under RotatE. In addition, multimodal features improve entity-level tasks by up to 60%, and a recent CKGE framework (IncDE) failed to scale to our 5.67M-triple base task across five attempts up to 350GB RAM. Data, pipeline, baselines, and the stratified split are released openly. Dataset:huggingface.co/datasets/yradwan147/PrimeKGCL|Code:github.com/yradwan147/primekg-cl-neurips2026

2605.10523 2026-05-12 cs.CV

Improving Human Image Animation via Semantic Representation Alignment

Chang Liu, Mengting Chen, Yixuan Huang, Haoning Wu, Chen Ju, Shuai Xiao, Jinsong Lan, Yanfeng Wang

AI总结 本文研究如何通过语义表示对齐来提升人体图像动画生成的质量,解决在生成长视频或复杂动作时出现的肢体扭曲和面部失真问题。提出了一种名为 SemanticREPA 的新方法,通过结构对齐模块和身份对齐模块,分别对齐视频潜在表示中的结构信息与深度特征、生成视频的身份特征与人脸识别特征,从而提升生成结果的结构稳定性和身份一致性。该方法在复杂动作生成和角色一致性方面表现出色,为人体动画生成提供了更高质量和更灵活的解决方案。

Comments Accepted by CVPR 2026 workshop

详情
英文摘要

The field of image-to-video generation has made remarkable progress. However, challenges such as human limb twisting and facial distortion persist, especially when generating long videos or modeling intensive motions. Existing human image animation works address these issues by incorporating human-specific semantic representations, e.g., dense poses or ID embeddings, as additional conditions. However, conditioning on these representations could decrease the generation flexibility. Moreover, their reliance on RGB pixel supervision also lacks emphasis on learning necessary 3D geometric relationships and temporal coherence. In contrast, we introduce a novel approach named SemanticREPA that leverages these semantic representations as supervision signals through representation alignment. Specifically, we begin by training a structure alignment module that aligns the structure representations obtained from video latents with video depth estimation features. We then fix the pretrained module, and utilize it to provide additional supervision on the structure representations of the diffusion models, achieving structure rectification to generate coherent and stable human structures. Simultaneously, we develop an ID alignment module to align the ID representations of the generated videos to face recognition features. We further propose to use the predicted structure representations to refine identity restoration in relevant regions. With structure and ID alignment, our method demonstrates superior quality on extended character motions and enhanced character consistency.