arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4033
专题追踪 全部专题
2605.10404 2026-05-12 cs.CV

Position: Life-Logging Video Streams Make the Privacy-Utility Trade-off Inevitable

Tianyuan Zou, Liang Yue, Yang Liu, Ya-Qin Zhang, Sijie Cheng

发表机构 * Institute for AI Industry Research, Tsinghua University, Beijing, China(清华大学人工智能产业研究院) RayNeo.AI, Shenzhen, China(深圳RayNeo.AI) Department of Computer Science and Technology, Tsinghua University, Beijing, China(清华大学计算机科学与技术系)

AI总结 随着智能眼镜、体戴摄像头等持续运行的硬件设备日益普及,生活日志视频流已成为持续运行人工智能系统的核心组成部分。这类视频流虽能显著提升系统实用性,但也带来了严重的隐私泄露风险,如暴露行为模式、情绪状态和社会互动等敏感信息。现有隐私保护方法要么针对特定攻击,要么导致显著的实用性损失,未能全面考虑数据处理全流程,因此生活日志视频流中的隐私与实用性权衡已成为下一代人工智能系统亟待解决的基础性挑战。

Comments 19 pages, 7 figures

详情
英文摘要

With the growing prevalence of always-on hardware such as smart glasses, body cameras, and home security systems, life-logging visual sensing is becoming inevitable, forming the backbone of persistent, always-on AI systems. Meanwhile, recent advances in proactive agents and world models signal a fundamental shift from episodic, prompt-driven tools to next-generation AI systems that continuously perceive and react to the physical world. Although life-logging video streams can substantially improve utility of these promising systems, they also introduce significant privacy risks by revealing sensitive information, such as behavioral patterns, emotional states, and social interactions, beyond what isolated images expose. If unresolved, these risks may undermine public trust and hinder the sustainable development of always-on AI technologies. Existing privacy protections are either attack-specific or incur substantial utility loss, and fail to consider the entire data exploitation pipeline. We therefore posit that the privacy-utility trade-off in life-logging video streams is a foundational challenge for next-generation AI systems that demands further investigation. We call for novel pipeline-aware privacy-preserving designs that jointly optimize utility and privacy for long-horizon life-logging visual data. In parallel, formal privacy leakage metrics and standardized benchmarks remain important open directions for future research.

2605.10401 2026-05-12 cs.AI math.OC

LLM4Branch: Large Language Model for Discovering Efficient Branching Policies of Integer Programs

Zhinan Hou, Xingchen Li, Yankai Zhang, Tianxun Li, Keyou You

发表机构 * Department of Automation, BNRist, Tsinghua University, Beijing, China.(自动化系,BNRist,清华大学,北京,中国)

AI总结 本文提出了一种基于大语言模型(LLM)的新框架LLM4Branch,用于自动发现整数规划问题中的高效分支策略。该方法通过LLM生成可执行的策略框架,并结合零阶优化方法在少量实例的端到端性能反馈下优化参数,从而提升求解效率。实验表明,LLM4Branch在标准MILP基准测试中达到了基于CPU方法的最先进水平,并能与先进的GPU方法相媲美。

Comments ICML2026 preprint, camera ready in progress

详情
英文摘要

Efficient branching policies are essential for accelerating Mixed Integer Linear Programming (MILP) solvers. Their design has long relied on hand-crafted heuristics, and now machine learning has emerged as a promising paradigm to automate this process. However, existing learning-based methods are often hindered by their dependence on expensive expert demonstrations and the gap between training objectives and the solver's end-to-end performance. In this work, we propose LLM4Branch, a novel framework that leverages Large Language Models (LLMs) to automate the discovery of efficient branching policies. Specifically, the discovered policy is an executable program with a program skeleton generated by the LLM and a parameter vector, which is optimized via a zeroth-order method over a few instances with their end-to-end performance feedback. Extensive experiments on standard MILP benchmarks demonstrate that LLM4Branch establishes a new state-of-the-art among CPU-based methods and achieves performance competitive with advanced GPU-based models. Codes are available at https://github.com/hzn18/LLM4Branch.

2605.10397 2026-05-12 cs.CV cs.AI

AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation

Xi Jiang, Yinjie Zhao, Zesheng Yang, Feng Zheng

发表机构 * Department of Computer Science and Engineering, Southern University of Science and Technology (SUSTech), Shenzhen, China(南方科技大学计算机科学与工程系,深圳,中国) School of EEE, Nanyang Technological University (NTU), Singapore(南洋理工大学电子工程学院,新加坡) CFAR, Agency for Science, Technology and Research (A*STAR), Singapore(科技研究局(A*STAR)的CFAR,新加坡)

AI总结 视觉异常检测在工业检测、医疗影像等领域具有重要意义,但不同领域间的数据模态和标注标准差异导致单一领域训练的模型难以跨域应用。为此,本文提出 AnomalyClaw,一种无需训练的视觉异常检测代理,通过多轮反驳机制提升判断可靠性,结合13种工具进行视觉验证与参考解析。实验表明,AnomalyClaw 在多个跨域数据集上显著优于单步推理方法,并通过自进化机制进一步提升了检测性能。

Comments We release the agent, the benchmark, and the analysis artifacts at https://github.com/jam-cc/AnomalyClaw

详情
英文摘要

Visual anomaly detection (VAD) is crucial in many real-world fields, such as industrial inspection, medical imaging, infrastructure monitoring, and remote sensing. However, the specific anomaly definitions, data modalities, and annotation standards across different domains make it difficult to transfer single-domain trained VAD models. Vision-language models (VLMs), pre-trained on large-scale cross-domain data, can perform visual perception under task instructions, offering a promising solution for cross-domain VAD. However, single-inference VLM judgments are unreliable, since they rely more on prior knowledge than on normal-sample references or fine-grained feature evidence. We therefore present AnomalyClaw, a training-free VAD agent that turns anomaly judgment into a multi-round refutation process. In each round, the agent proposes candidate anomalies and refutes each against normal-sample references, drawing on a 13-tool library for visual verification, reference parsing, and frozen expert probing. On the CrossDomainVAD-12 benchmark (12 datasets), AnomalyClaw achieves consistent macro-AUROC improvements over single-step direct inference with +6.23 pp on GPT-5.5, +7.93 pp on Seed2.0-lite, and +3.52 pp on Qwen3.5-VL-27B. We further introduce an optional verbalized self-evolution extension. It builds an online rulebook from internal-branch disagreement without oracle labels. On Qwen3.5-VL-27B, it delivers a +2.09 pp mean gain, comparable to a K = 10 oracle-label supervised baseline (+1.99 pp). These results show that agentic refutation improve anomaly understanding and reasoning of VLMs, rather than merely aggregating tool outputs.

2605.10396 2026-05-12 cs.LG cs.NE

Causal Explanations from the Geometric Properties of ReLU Neural Networks

Hector Woods, Philippa Ryan, Rob Alexander

发表机构 * Department of Computer Science University of York(计算机科学系 英国约克大学)

AI总结 该论文研究了如何从ReLU神经网络的几何特性中生成因果解释,以提高深度神经网络决策过程的可解释性。作者指出,ReLU网络可以被看作是将输入空间划分为多个由凸多面体定义的区域,每个区域对应一个线性函数。基于这一几何特性,论文提出了一种直接从网络结构中提取因果解释的方法,能够更准确地反映网络的行为,从而为自主系统的安全保证提供支持。

Comments 7 pages, 0 figures, Accepted for presentation at the Yorkshire Innovation in Science and Engineering Conference

详情
英文摘要

Neural networks have proved an effective means of learning control policies for autonomous systems, but these learned policies are difficult to understand due to the black-box nature of neural networks. This lack of interpretability makes safety assurance for such autonomous systems challenging. The fields of eXplainable Artificial Intelligence (XAI) and eXplainable Reinforcement Learning (XRL) aim to interpret the decision making processes of neural networks and autonomous agents, respectively. In particular, work on causal explanations aims to provide "why" and "why not" explanations for why a model made a given decision. However, most of the work on explainability to date utilises a distilled version of the original model. While this distilled policy is interpretable, it necessarily degrades in performance significantly when compared to the original model, and is not guaranteed to be an accurate reflection of the decision making processes in the original model and as such cannot be used to guarantee its safety. Recent work on understanding the geometry of ReLU neural networks shows that a ReLU network corresponds to a piecewise linear function divided into regions defined by an n-dimensional convex polytope. Through this lens, a neural network can be understood as dividing the input space into distinct regions which apply a single linear function for each output neuron. We show that this geometric representation can be used to generate causal explanations for the network's behaviour similar to previous work, but which extracts rules directly from the geometry of Neural Networks with the ReLU activation function, and is therefore an accurate reflection of the network's behaviour.

2605.10394 2026-05-12 cs.CV

Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection

Andreas Goulas, Damianos Galanopoulos, Evlampios Apostolidis, Vasileios Mezaris

发表机构 * IDT-ITI

AI总结 本文提出了一项新的任务——煽动性图像检测,旨在判断图像是否包含令人震惊、挑衅或情感强烈的特征,以吸引注意力并引发强烈情绪反应。为此,研究者构建了一个名为Sens-VisualNews的基准数据集,包含9,576张新闻图片,并根据其视觉内容中是否存在各种煽动性概念和事件进行标注。基于该数据集,研究进一步探讨了多种先进多模态大语言模型在零样本和微调设置下的提示敏感性、性能及鲁棒性。

Comments Authors' Accepted Version; Accepted at IEEE ICIP 2026

详情
英文摘要

The detection of sensational content in media items can be a critical filtering mechanism for identifying check-worthy content and flagging potential disinformation, since such content triggers physiological arousal that often bypasses critical evaluation and accelerates viral sharing. In this paper we introduce the task of sensational image detection, which aims to determine whether an image contains shocking, provocative, or emotionally charged features to grab attention and trigger strong emotional responses. To support research on this task, we create a new benchmark dataset (called Sens-VisualNews) that contains 9,576 images from news items, annotated based on the (in-)existence of various sensational concepts and events in their visual content. Finally, using Sens-VisualNews, we study the prompt sensitivity, performance and robustness of a wide range of open SotA Multimodal LLMs, across both zero-shot and fine-tuned settings.

2605.10393 2026-05-12 cs.LG cs.LO

The Polynomial Counting Capabilities of Message Passing Neural Networks

Marco Sälzer, Pascal Bergsträßer, Anthony W. Lin

发表机构 * RPTU University Kaiserslautern-Landau(科布伦茨-劳恩堡大学) Max Planck Institute for Software Systems (MPI-SWS)(软件系统研究所(MPI-SWS))

AI总结 本文研究了消息传递神经网络(MPNN)在超越线性算术约束的多项式计数能力,重点探讨了其在表达带有多项式计数约束的分级模态逻辑扩展中的条件。作者证明,在轻度假设下,全局多项式计数约束可以通过均值聚合的MPNN进行验证,而局部约束的验证则需要额外条件,如允许求和或最大值聚合,或限制在正则图上。此外,文章还展示了如何通过树状结构图和相似假设,使嵌套模态逻辑公式被均值MPNN所捕获。

详情
英文摘要

The counting power of Message Passing Neural Networks (MPNN) has been the subject of many recent papers, showing that they can express logic that involves counting up to a threshold or more generally satisfy a linear arithmetic constraint. In this paper, we study the counting capabilities of MPNN beyond linear arithmetic, primarily utilising local and global mean aggregations. In particular, our goal is to tease out conditions required to express extensions of graded modal logic with polynomial counting constraints. We show that global polynomial counting constraints in node-labelled graphs can be checked using mean MPNN under mild assumptions. Checking local constraints is also possible, if we consider formulas with no nested modalities and additionally either (i) permit sum/max aggregations, or (ii) only restrict to regular graphs. We also show how formulas with nested modalities can be captured by mean MPNN over graphs with tree-like structures and similar assumptions.

2605.10391 2026-05-12 cs.CL cs.AI cs.CV

Phoenix-VL 1.5 Medium Technical Report

Team Phoenix, :, Arka Ray, Askar Ali Mohamed Jawad, Biondi Lee, Elijah Seah, Eva Lim, Fiona Teo, Grace Toh, Guang Xiang Teo, Jun En Tan, Jia Hui Bong, Jiale Wang, Jonathan Ng, Justin Tan, Kai Zhe Yew, Matthew Ong, Shun Yi Yeo, Wen Jett Lam, Wen Xiu Tan, Ze Yu Zhang, Gee Wah Ng, Chee Wee Ang, Mistral AI, :, Adrien Sadé, Guillaume Kunsch, Jia Sin Loh, Nicolas Schuhl, Rupert Menneer, Umar Jamil, Vincent Maladière, Yimu Pan

发表机构 * Mistral AI

AI总结 本文介绍了Phoenix-VL 1.5 Medium,一个1230亿参数的本地化多模态、多语言基础模型,专门适配新加坡语境和区域性语言。该模型通过本地化的大规模多模态语料进行持续预训练,并结合新加坡文化、法律等领域的数据进行微调,显著提升了在新加坡相关任务上的表现,同时在通用多模态、多语言和STEM任务上也保持了高水平性能。研究还提出了包含本地化知识评估和机构对齐行为的安全框架,为区域化AI模型开发提供了新思路。

Comments Release page: https://medium.com/htx-ai/introducing-phoenix-vl-1-5-medium-multimodal-intelligence-uniquely-singaporean-ef8214c8cfa1

详情
英文摘要

We introduce Phoenix-VL 1.5 Medium, a 123B-parameter natively multimodal and multilingual foundation model, adapted to regional languages and the Singapore context. Developed as a sovereign AI asset, it demonstrates that deep domain adaptation can be achieved with minimal degradation to broad-spectrum intelligence and alignment. Continued pretraining was performed on Mistral Medium 3.1 using a localized 1-trillion tokens multimodal corpus, followed by a 250-billion tokens long-context extension phase. Subsequent post-training incorporated a novel human-annotated Singapore multimodal dataset and curated textual corpus on Singapore culture, knowledge, and legislation, totaling 22-billion tokens. An additional 5 billion tokens of model alignment was performed through Online Direct Preference Optimization. Phoenix-VL 1.5 Medium achieves state-of-the-art performance for its size on Singapore multimodal, legal, and government policy benchmarks while remaining globally competitive on general multimodal intelligence, multilingual, and STEM benchmarks. We also introduce a novel evaluation suite encompassing localized knowledge benchmarks and an institutionally aligned model behavior and safety framework. We report the data curation principles, training methodology, and highlight benchmark and inference performance.

2605.10388 2026-05-12 cs.CV cs.RO

Temporal Sampling Frequency Matters: A Capacity-Aware Study of End-to-End Driving Trajectory Prediction

Yumao Liu, Tao Liu, Xiangyu Li, Jiaxiang Li, Ke Ma

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 本文研究了端到端自动驾驶轨迹预测中时间采样频率对模型性能的影响,挑战了高频率采样必然提升性能的传统假设。通过构建不同频率的训练集,并在固定实验协议下训练和评估相同模型,分析了采样频率与预测性能之间的关系。研究发现,模型和数据集不同会导致频率响应差异,小型模型在中等或较低频率下往往表现最佳,而大模型如AutoVLA在最高频率下效果更优,表明时间采样频率应作为可调参数进行优化,而非固定使用最高频率。

详情
英文摘要

End to end (E2E) autonomous driving trajectory prediction is often trained with camera frames sampled at the highest available temporal frequency, assuming that denser sampling improves performance. We question this assumption by treating temporal sampling frequency as an explicit training set design variable. Starting from high frequency E2E driving datasets, we construct frequency sweep training sets by temporally subsampling camera frames along each trajectory. For each model dataset pair, we train and evaluate the same model under a fixed protocol, so the frequency response reflects how prediction performance changes with sampling frequency. We analyze this response from a capacity aware perspective. Sparse sampling may miss driving relevant cues, while dense sampling may add redundant visual content and off manifold noise. For finite capacity models, this can create a driving irrelevant capacity burden. We evaluate three smaller E2E models and a larger VLA style AutoVLA model on Waymo, nuScenes, and PAVE. Results show model and dataset dependent frequency responses. Smaller E2E models often show non monotonic or near plateau trends and achieve their best 3 second ADE at lower or intermediate frequencies. In contrast, AutoVLA achieves its best 3 second ADE and FDE at the highest evaluated frequency on all three datasets. Iteration matched controls suggest that the advantage of lower or intermediate frequencies for smaller models is not explained only by unequal training update counts. These findings show that temporal sampling frequency should be reported and tuned, rather than fixed to the highest available value.

2605.10386 2026-05-12 cs.AI

GuardAD: Safeguarding Autonomous Driving MLLMs via Markovian Safety Logic

Tianyuan Zhang, Peng Yue, Zihao Peng, Jiangfan Liu, Zonghao Ying, Jiakai Wang, Tianlin Li, Jian Yang, Yaodong Yang, Aishan Liu, Xianglong Liu

发表机构 * Beihang University(北航大学) Zhongguancun Laboratory(中关村实验室) Peking University(北京大学)

AI总结 随着多模态大语言模型(MLLMs)在自动驾驶系统中的广泛应用,其在复杂和危险场景下的安全性问题日益突出。为了解决现有安全机制在动态交通环境中鲁棒性不足的问题,本文提出了一种名为GuardAD的模型无关安全防护框架,通过引入马尔可夫逻辑形式化方法,实现对异构交通参与者安全状态的动态推理与持续诱导。GuardAD不仅能够识别潜在的多步安全隐患,还能通过逻辑驱动的动作修正策略优化模型行为,实验表明其在降低事故率和提升任务性能方面均表现出显著优势。

详情
英文摘要

Multimodal large language models (MLLMs) are increasingly integrated into autonomous driving (AD) systems; however, they remain vulnerable to diverse safety threats, particularly in accident-prone scenarios. Recent safeguard mechanisms have shown promise by incorporating logical constraints, yet most rely on static formulations that lack temporally grounded safety reasoning over evolving traffic interactions, resulting in limited robustness in dynamic driving environments. To address these limitations, we propose GuardAD, a model-agnostic safeguard that formulates AD safety as an evolving Markovian logical state. GuardAD introduces Neuro-Symbolic Logic Formalization, which represents safety predicates over heterogeneous traffic participants and continuously induces them via n-th order Markovian Logic Induction. This design enables the inference of emerging and latent hazards beyond single-step observations. Rather than simply vetoing unsafe actions, GuardAD performs Logic-Driven Action Revision, where inferred safety states actively guide action refinement without modifying the underlying MLLM. Extensive experiments on multiple benchmarks and AD-MLLMs demonstrate that GuardAD substantially reduces accident rates (-32.07%) while slightly improving task performance (+6.85%). Moreover, closed-loop simulation evaluations, together with physical-world vehicle studies, further validate the effectiveness and potential of GuardAD.

2605.10384 2026-05-12 cs.AI cs.DC cs.NI

Agentic Performance at the Edge: Insights from Benchmarking

Shiqiang Wang, Herbert Woisetschläger

发表机构 * University of Exeter(埃克塞特大学) Technical University of Munich(慕尼黑技术大学)

AI总结 本文研究了在边缘计算环境中,模型参数规模受限时,智能代理(Agentic AI)任务性能的变化情况。通过引入领域条件评估方法和模型-工具交互分析,研究发现边缘代理的质量并非单纯依赖参数数量,而是与模型选择和工具流程的联合设计密切相关。该工作为在资源受限条件下优化边缘智能系统提供了实用指导和失效模式分析。

Comments Accepted to AutoEdge workshop, co-located with MobiSys 2026

详情
英文摘要

Agentic artificial intelligence (AI) is a natural fit for Internet of Things (IoT) and edge systems, but edge deployments are often constrained to models around 8 billion parameters or smaller. An important question is: How much agentic-task quality is lost when model size is constrained by memory, power, and latency budgets? To address this question, in this paper, we provide an initial empirical study considering edge-focused model scaling, general-purpose versus coder-oriented model effects, and tool-enabled execution under a fixed protocol. We introduce a domain-conditioned evaluation methodology, an implementation-grounded analysis of model-tool interactions, practical guidance for model selection under constraints, and an analysis of failure modes that reveals distinct semantic versus execution failure patterns across model families. Our core finding is that edge-agent quality is not a simple function of parameter count. Robust deployment depends on the joint design of model choice and tool workflow. Domain-conditioned analysis reveals Pareto fronts in the accuracy-latency space that can guide strategy selection based on operational priorities.

2605.10380 2026-05-12 cs.AI

Agent-X: Full Pipeline Acceleration of On-device AI Agents

Jinha Chung, Byeongjun Shin, Jiin Kim, Minsoo Rhu

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出了一种名为Agent-X的软件框架,旨在加速边缘设备上基于大语言模型(LLM)的智能体的端到端推理过程。该框架通过优化提示生成和引入无需LLM的推测解码机制,有效提升了预填充和解码阶段的效率,在保持精度不变的前提下实现了1.61倍的加速。该研究首次系统性地分析并消除了边缘设备智能体中的延迟瓶颈,具有重要的实际应用价值。

Comments Accepted for publication at MobiSys-2026

详情
英文摘要

LLM-based agents deliver state-of-the-art performance across tasks but incur high end-to-end latency on edge devices. We introduce Agent-X, a software-only, accuracy-preserving framework that accelerates both the prefill and decode stages of on-device agent workloads. Agent-X's two key components rewrite prompts to leverage prefix caching tailored to agent-specific input-token patterns and enable LLM-free speculative decoding for fast token generation with minimal overhead. On representative agentic workloads, Agent-X achieves a 1.61x end-to-end speedup in real systems with no accuracy loss and can be seamlessly integrated into existing on-device AI agents. To the best of our knowledge, ours is the first to systematically characterize and eliminate latency bottlenecks in on-device agents.

2605.10379 2026-05-12 cs.CL

Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

Ivo Petrov, Jasper Dekoninck, Dimitar I. Dimitrov, Martin Vechev

发表机构 * INSAIT(INSAIT研究所) Sofia University "St. Kliment Ohridski"(索菲亚大学"圣克莱孟·奥赫里迪斯基") ETH Zurich(苏黎世联邦理工学院)

AI总结 该研究指出,尽管大型语言模型在数学问题求解中能够生成正确的证明,但仅凭正确性不足以衡量证明质量,还需考虑清晰性、简洁性、启发性及可迁移性等因素。为此,研究提出了ProofRank基准,通过五个可扩展的指标评估证明质量,包括简洁性、计算简便性、认知简单性、多样性和适应性。实验发现不同模型在证明质量上存在显著差异,且证明质量与正确性之间存在权衡,表明未来应更注重评估生成证明的实用性。

Comments 9 main text pages, 36 total pages, In proceedings to 2026 NeurIPS Evaluations and Datasets Track

详情
英文摘要

Large language models (LLMs) have become capable mathematical problem-solvers, often producing correct proofs for challenging problems. However, correctness alone is not sufficient: mathematical proofs should also be clear, concise, insightful, and transferable to other problems. While this proof quality is subjective and depends on the reader and context, many of its components are concrete and broadly valued. In this work, we identify such components and introduce ProofRank, a benchmark curated from challenging mathematical competitions. ProofRank evaluates several scalable proxies of proof quality: (i) conciseness, measuring whether proofs avoid unnecessary steps; (ii) computational ease, measuring the extent to which a proof relies on tedious calculations; (iii) cognitive simplicity, measuring how accessible the used proof techniques are; (iv) diversity, measuring how varied a model's proofs for a single problem are; and (v) adaptivity, measuring whether a model can follow a specified proof technique. Across models, we find substantial differences in proof quality that are not captured by correctness-only benchmarks. We also observe significant trade-offs between proof-quality metrics and correctness, suggesting that future evaluations of mathematical reasoning should measure how useful LLM-generated proofs are.

2605.10377 2026-05-12 cs.LG cs.MA

PC3D: Zero-Shot Cooperation Across Variable Rosters via Personalized Context Distillation

Ahmet Onur Akman, Rafał Kucharski

发表机构 * Doctoral School of Exact and Natural Sciences, Jagiellonian University(杰哥利昂大学精确与自然科学博士学院) Faculty of Mathematics and Computer Science, Jagiellonian University(杰哥利昂大学数学与计算机科学学院)

AI总结 本文研究了在团队成员数量不断变化的场景下,如何实现多智能体强化学习中的零样本协作。为此,提出了一种名为PC3D的方法,通过个性化上下文蒸馏,使每个智能体能够从局部交互历史中恢复并利用个性化的协调上下文,从而适应不同规模的团队。实验表明,该方法在多个协作型多智能体基准任务中,无论面对已见还是未见的团队规模,均能取得优于现有方法的性能。

详情
英文摘要

Cooperative multi-agent reinforcement learning often assumes a fixed execution team, yet many decentralized systems must operate with varying numbers of active agents during deployment. We study this setting under episodic roster variation: each episode is executed by a set of homogeneous agents, with the team size varying across episodes. Agents act only from local histories, without execution-time communication, privileged coordinators, or online retraining. Therefore, effective cooperation requires each agent to recover relevant context about the active team and adapt its behavior accordingly. To this end, we propose PC3D (Personalized Central Coordination Context Distillation), a method for training decentralized policies to recover and use personalized coordination context from local interaction histories. During training, a set-structured centralized teacher compresses the active team into coordination tokens and personalizes them into agent-specific contexts, which are distilled into decentralized policies. At execution, each agent predicts its own context from local history and adaptively uses it to condition decision-making. Across three cooperative MARL benchmarks, PC3D achieves higher returns than the evaluated baselines with both seen and unseen roster sizes, and ablations attribute these gains to both context distillation and adaptive context use.

2605.10374 2026-05-12 cs.CV

Halo Separation-guided Underwater Multi-scale Image Restoration

Jiaxin Yang, Honglin Liu, Yongli Wang, Shuyi Cao, Chengcheng Jiang, Jiale Wang

发表机构 * College of Information Science and Technology(信息科学与技术学院) Dalian Maritime University(大连海事大学) College of Marine Electrical Engineering(海洋电气工程学院)

AI总结 本文针对水下自主水下机器人拍摄图像中因人工光源引起的光晕问题,提出了一种基于迭代结构的单光晕图像校正方法。该方法通过两个子网络分别实现光晕层分离和多尺度图像恢复,提升了水下图像的清晰度和质量。实验使用合成数据集和真实光晕图像进行训练与测试,并引入径向梯度约束以进一步优化光晕消除效果,为水下图像增强提供了更鲁棒的解决方案。

详情
英文摘要

Underwater images captured by Autonomous Underwater Vehicles (AUVs) are inevitably affected by artificial light sources, which often produce halos in the foreground of the camera and seriously interfere with the quality of the image. The existing underwater image enhancement methods fail to fully consider this key problem, and the robustness of processing images under artificial light scenes is poor. In practical applications, since underwater image enhancement itself is a very challenging task, the influence of artificial light sources will lead to serious degradation of image performance and affect subsequent vision tasks. In order to effectively deal with this problem, this paper designs a single halo image correction method based on an iterative structure. The network is mainly divided into two sub-networks, one is the halo layer separation sub-network which aims to separate the halo by gradient minimization, and the other is the multi-scale recovery sub-network which aims to recover the image information masked by halo. The UIEB and EUVP synthetic datasets are used for training to ensure that the network can fully learn the characteristics and laws of underwater halo images. Then a large number of halo images taken in an underwater environment with real artificial light are collected for testing. In addition, the brightness distribution characteristics of underwater halo images are analyzed and the radial gradient is introduced to constraint eliminate halo to improve the effect of underwater image restoration.

2605.10370 2026-05-12 cs.AI cs.DB cs.DC

Autonomous FAIR Digital Objects: From Passive Assertions to Active Knowledge

Zeyd Boukhers, Oya Beyan, Cong Yang, Christoph Lange

发表机构 * Fraunhofer Institute for Applied Information Technology FIT(弗劳恩霍夫应用信息技术研究所) University Hospital Cologne UKK(科隆大学医院UKK) University of Cologne, Faculty of Medicine and University Hospital Cologne(科隆大学医学院及科隆大学医院) School of Future Science and Engineering, Soochow University(苏州大学未来科学与工程学院) RWTH Aachen University(亚琛工业大学)

AI总结 当前科学知识在网络上以被动断言的形式发布,无法自主验证证据、调和矛盾或随新发现更新可信度。本文提出自主FAIR数字对象(aFDO),通过引入策略层、公告层和协议层,赋予数字对象自主处理信息的能力,从而实现去中心化的、可持续的知识管理。研究基于语义网标准构建了aFDO的理论框架,并在罕见病本体数据集上验证了其有效性,展示了其在处理数据冲突和抵御恶意攻击方面的性能。

详情
英文摘要

Scientific knowledge on the Web is published as passive assertions and cannot decide when to validate evidence, reconcile contradictions, or update confidence as findings accumulate. Curation depends on centralised middleware and institutional continuity, but when registries close, active stewardship stops even when data remain online. We advance the concept of Autonomous FAIR Digital Objects (aFDOs) from an abstract idea to an operational model, to offer a route from passive scientific publication toward accountable, standards-aligned automation that can outlive its publishing institutions. aFDO augments FDOs with three capabilities anchored in Semantic Web standards, namely 1) a policy layer over RDF-star aligned with PROV-O, SHACL, and ODRL for portable condition-action rules, 2) an announcement layer over ActivityStreams 2.0 that bounds per-announcement evaluation cost, and 3) an agreement layer that resolves multi-source contradictions through reputation and confidence weighted agreement under a bounded adversarial model. We provide a formal definition that distinguishes policy specifications, event handlers, and communication interfaces. We evaluate an open reference implementation on 4,305 FDOs grounded in rare-disease ontologies, namely ClinVar, HPO, and Orphanet, combined with controlled synthetic observations. The consensus mechanism resolves 56.3% of 3,914 naturally occurring ClinVar conflicts where multiple submitters disagree and an expert panel has subsequently adjudicated. Under Sybil, collusion, and poisoning attacks, the mechanism degrades gracefully within its design Byzantine-tolerance bound (f < n/5), and fails as predicted beyond that bound.

2605.10366 2026-05-12 cs.AI

EGL-SCA: Structural Credit Assignment for Co-Evolving Instructions and Tools in Graph Reasoning Agents

Zike Yuan, Yukun Cao, Han Zhang, Jianzhi Yan, Le Liu, Cai ke, Yue Yu, Hui Wang, Ming Liu, Bing Qin

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Peng Cheng Laboratory(鹏城实验室) School of Computer Science and Technology(计算机科学与技术学院)

AI总结 本文提出了一种名为EGL-SCA的框架,用于解决图推理智能体在自然语言输入下同时构建结构化图实例、选择计算工具并满足结构化验证的问题。该方法通过一个以验证器为中心的双空间框架,将推理策略与可执行工具协同优化,利用结构化信用分配机制将失败原因精确归因于提示优化或工具合成,从而实现指令与工具的共同进化。实验表明,EGL-SCA在四个图推理基准测试中取得了92.0%的平均成功率,显著优于纯提示和固定工具箱的方法。

详情
英文摘要

Graph reasoning agents operating from natural-language inputs must solve a coupled problem: they must reconstruct a structured graph instance from text, decide whether existing computational assets are sufficient, interact with tools under a strict execution protocol, and satisfy an external verifier that checks structured correctness rather than textual plausibility. Existing approaches usually improve either the instruction side or the tool side in isolation, which leaves unclear what should be updated after failure. We propose EGL-SCA, a verifier-centric dual-space framework that models a graph reasoning agent using two collaborative components: an instruction-side policy space for reasoning strategies, and a tool-side program space for executable algorithmic tools. Our central mechanism is structural credit assignment, which maps trajectory evidence to conditional updates, precisely routing failures to either prompt optimization or tool synthesis and repair. To provide sufficient learning signals for dual-space adaptation, we introduce a training distribution stratified by task family, coupled with a Pareto-style retention strategy to balance success, generality, and parsimony. Experiments on four graph reasoning benchmarks show that EGL-SCA achieves a state-of-the-art 92.0\% average success rate. By effectively co-evolving instructions and tools, our framework significantly outperforms both pure-prompting and fixed-toolbox baselines.

2605.10365 2026-05-12 cs.AI

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

Haonan Dong, Qiguan Feng, Kehan Jiang, Haoran Ye, Xin Zhang, Guojie Song

发表机构 * State Key Laboratory of General Artificial Intelligence(通用人工智能国家重点实验室) School of Intelligence Science and Technology, Peking University(北京大学智能科学与技术学院) Peking University School of Software and Microelectronics(北京大学软件与微电子学院) Peking University School of Psychological and Cognitive Sciences(北京大学心理与认知科学学院) Key Laboratory of Machine Perception (Ministry of Education), Peking University(北京大学机器感知重点实验室)

AI总结 本文提出 Agent-ValueBench,首个专门用于评估智能体价值观的综合性基准,旨在填补现有基准仅限于大型语言模型而无法评估智能体价值观的空白。该基准包含16个领域共394个可执行环境,涵盖28种价值体系和332个维度的4,335个价值冲突任务,每个任务均由专业心理学家精心设计,并配备两条对齐的黄金轨迹供评估使用。通过测试14个主流模型和四种执行框架,研究揭示了智能体价值观在不同模型和执行框架下的表现规律,指出智能体对齐正从传统模型对齐向执行框架对齐和技能引导转变。

详情
英文摘要

Autonomous agents have rapidly matured as task executors and seen widespread deployment via harnesses such as OpenClaw. Safety concerns have rightly drawn growing research attention, and beneath them lie the values silently steering agent behavior. Existing value benchmarks, however, remain confined to LLMs, leaving agent values largely uncharted. From intuitive, empirical, and theoretical vantage points, we show that an agent's values diverge from those of its underlying LLM, and the agentic modality further introduces dataset-, evaluation-, and system-level challenges absent from text-only protocols. We close this gap with Agent-ValueBench, the first benchmark dedicated to agent values. It features 394 executable environments across 16 domains, offering 4,335 value-conflict tasks that cover 28 value systems and 332 dimensions. Every instance is co-synthesized through our purpose-built end-to-end pipeline and curated per-instance by professional psychologists. Each task ships with two pole-aligned golden trajectories whose checkpoints anchor a trajectory-level rubric-based judge. Benchmarking 14 frontier proprietary and open-weights models across 4 mainstream harnesses, we uncover three concerted findings. Agent values first manifest as a Value Tide of cross-model homogeneity beneath interpretable counter-currents. This tide bends non-additively under harness pull, and yet more decisively under deliberate steering via embedded skills. Together these results signal that the agent-alignment lever is shifting from classical model alignment and prompt steering toward harness alignment and skill steering.

2605.10362 2026-05-12 cs.CV

CellDX AI Autopilot: Agent-Guided Training and Deployment of Pathology Classifiers

Alexey Pchelnikov, Aleksei Pchelnikov

发表机构 * HistAI

AI总结 CellDX AI Autopilot 是一个通过人工智能代理实现病理图像分类模型训练与部署的平台,旨在降低计算病理学中对专业技能和计算资源的依赖。该平台提供结构化的代理技能,引导用户完成数据集构建、超参数优化、多策略模型比较及带人工参与的部署流程,并基于包含32,000多例病例和66,000张H&E染色全切片图像的预构建数据集进行训练。其核心贡献在于引入了专为病理任务设计的代理技能架构和多实例学习框架,显著提升了模型训练效率与易用性。

详情
英文摘要

Training AI models for computational pathology currently requires access to expensive whole-slide-image datasets, GPU infrastructure, deep expertise in machine learning, and substantial engineering effort. We present CellDX AI Autopilot, a platform that lets users -- from pathologists with no ML background to ML practitioners running many parallel experiments -- train, evaluate, and deploy whole-slide image classifiers through natural language interaction with an AI agent. The platform provides a structured set of agent skills that guide the user through dataset curation, automated hyperparameter tuning, multi-strategy model comparison, and human-in-the-loop deployment, all on a pre-built dataset of over 32,000 cases and 66,000 H&E-stained whole-slide images with pre-extracted features. We describe the agent skill architecture, the underlying Multiple Instance Learning (MIL) training framework supporting four classification strategies, and an iterative pairwise hyperparameter search (grid or seeded random) that reduces tuning cost by over 30x compared to exhaustive search. CellDX AI Autopilot is, to our knowledge, the first system to expose pathology-specialized agent skills and a pathology-specialized training platform to general-purpose AI agents (e.g. any LLM-based agent runtime), delivering end-to-end automated model training without requiring the agent itself to be domain-specific. The platform addresses both the ML-expertise bottleneck that limits adoption in diagnostic pathology and the engineering bottleneck that limits how many experiments a researcher can run cost-effectively.

2605.10351 2026-05-12 cs.LG eess.SP

Foundations of Reliable Inference: Reliability-Efficiency Co-Design

Jiayi Huang

发表机构 * The Department of Engineering(工程系) King’s College London(伦敦国王学院)

AI总结 本研究探讨了如何在保证人工智能模型不确定性估计可信度的同时提高推理效率的问题。作者提出了一种统一的框架,从两个角度出发,旨在实现可靠性与计算效率的协同设计。该工作为构建高效且可信的AI推理系统提供了理论基础和方法支持。

Comments PhD Thesis

详情
英文摘要

Reliable inference requires that artificial intelligence (AI) models provide trustworthy uncertainty estimates, not merely accurate predictions. Recent advances in Bayesian learning have made significant progress toward this goal, and growing concerns about computational overhead have jointly shifted the design criterion from reliability alone to the co-design of reliability and efficiency, i.e., reducing computational overhead while preserving trustworthy uncertainty quantification. This thesis develops a unified framework from two perspectives to address the central question: can we efficiently perform reliable inference?

2605.10349 2026-05-12 cs.CV cs.AI cs.LG

Portable Active Learning for Object Detection

Rashi Sharma, Justin Timothy C. Bersamin, Karthikk Subramanian

发表机构 * Panasonic R&D Center Singapore(松下研发中心新加坡) Nanyang Technological University(南洋理工大学)

AI总结 本文提出了一种名为PAL的便携式主动学习框架,用于提升目标检测任务的标注效率。该方法无需修改检测模型内部结构或训练流程,仅基于模型的推理输出进行数据选择,结合类别级实例不确定性与图像级多样性,有效提升了所选样本的信息量与多样性。实验表明,PAL在多个数据集上均优于现有主动学习方法,显著提高了标签效率和检测精度,为实际应用中的高效目标检测部署提供了实用解决方案。

Comments CVPR 2026(highlight)

详情
英文摘要

Annotating bounding boxes is costly and limits the scalability of object detection. This challenge is compounded by the need to preserve high accuracy while minimizing manual effort in real-world applications. Prior active learning methods often depend on model features or modify detector internals and training schedules, increasing integration overhead. Moreover, they rarely jointly exploit the benefits of image-level signals, class-imbalance cues, and instance-level uncertainty for comprehensive selection. We present Portable Active Learning (PAL), a detector-agnostic, easily portable framework that operates solely on inference outputs. PAL combines class-wise instance uncertainty with image-level diversity to guide data selection. At each round, PAL trains lightweight class-specific logistic classifiers to distinguish true from false positives, producing entropy-based uncertainty scores for proposals. Candidate images are then refined using global image entropy, class diversity, and image similarity, yielding batches that are both informative and diverse. PAL requires no changes to model internals or training pipelines, ensuring broad compatibility across detectors. Extensive experiments on COCO, PASCAL VOC, and BDD100K demonstrate that PAL consistently improves label efficiency and detection accuracy compared to existing active learning baselines, making it a practical solution for scalable and cost-effective deployment of object detection in real-world settings.

2605.10345 2026-05-12 cs.CV

BGG: Bridging the Geometric Gap between Cross-View images by Vision Foundation Model Adaptation for Geo-Localization

Wei Wang, Dou Quan, Ning Huyan, Shuang Wang, Yi Li, Pei He, Licheng Jiao

发表机构 * Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, Xidian University(中国教育部智能感知与图像理解重点实验室,西安电子科技大学) School of Telecommunications, Xidian University(西安电子科技大学电信学院) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 本文提出了一种基于视觉基础模型(VFM)的参数高效适配框架BGG,用于解决跨视角图像(如无人机与卫星图像)之间的几何差异问题,以提升跨视角地理定位(CVGL)的性能。BGG通过多粒度特征增强适配器(MFEA)和频率感知结构聚合(FASA)模块,有效提升了特征的尺度适应性和视角鲁棒性,并增强了局部结构特征,从而在低训练成本下实现了更精确的地理定位。实验表明,BGG在多个数据集上取得了优于现有方法的先进性能。

详情
英文摘要

Geometric differences between cross-view images, such as drone and satellite views, significantly increase the challenge of Cross-View Geo-Localization (CVGL), which aims to acquire the geolocation of images by image retrieval. To further enhance the CVGL performance, this paper proposes a parameter-efficient adaptation framework for bridging the geometric gap across images based on the vision foundation model (VFM) (e.g., DINOv3), termed BGG. BGG not only effectively leverages the general visual representations of VFM and captures the robust and consistent features from cross-view images, but also utilizes the generalization capabilities of the VFM, significantly improving the CVGL performance. It mainly contains a Multi-granularity Feature Enhancement Adapter (MFEA) and a Frequency-Aware Structural Aggregation (FASA) module. Specifically, MFEA enhances the scale adaptability and viewpoint robustness of features by multi-level dilated convolutions, effectively bridging the cross-view geometric gap with small training costs. Additionally, considering the [CLS] token lacks spatial details for precise image retrieval and localization, the FASA module modulates patch tokens in the frequency domain and performs adaptive aggregation for local structural feature enhancement. Finally, BGG fuses the enhanced local features with the [CLS] token for more accurate CVGL. Extensive experiments on University-1652 and SUES-200 datasets demonstrate that BGG has significant advantages over other methods and achieves state-of-the-art localization performance with low training costs.

2605.10343 2026-05-12 cs.CV cs.AI

EvoStreaming: Your Offline Video Model Is a Natively Streaming Assistant

Zichen Wen, Boxue Yang, Junlong Ke, Jiajie Huang, Chenfei Liao, Junxi Wang, Xuyang Liu, Linfeng Zhang

发表机构 * EPIC Lab, Shanghai Jiao Tong University(上海交通大学EPIC实验室) Tsinghua University(清华大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Fudan University(复旦大学)

AI总结 本文提出EvoStreaming,一种用于将离线视频语言模型(VideoLLM)适配为流式视频助理的自进化框架。研究发现,现有VideoLLM虽具备良好的视觉理解能力,但缺乏在流式场景下决定何时响应的交互策略。EvoStreaming通过模型自身生成数据、标注相关性并制定响应策略,无需外部监督即可合成流式交互轨迹,仅用极少样本便显著提升了模型在流式评估中的表现,同时基本保持其离线性能,为高效适配流式视频助理提供了新路径。

Comments 33 pages, 9 figures

详情
英文摘要

Streaming video understanding demands more than watching longer videos: assistants must decide when to speak in real time, balancing responsiveness against verbosity. Yet most video-language models (VideoLLMs) are trained for offline inference, and existing streaming benchmarks externalize this timing decision to the evaluator. We address this gap with RealStreamEval, a frame-level multi-turn evaluation protocol that exposes models to sequential observations and penalizes unnecessary responses. Under this protocol, we observed that strong offline VideoLLMs retain useful visual understanding but lack an interaction policy for deciding when to respond. Motivated by this observation, we propose EvoStreaming, a self-evolved streaming adaptation framework in which the base model itself acts as data generator, relevance annotator, and roll-out policy to synthesize streaming trajectories without external supervision. With only $1{,}000$ self-generated samples ($139\times$ less than the leading streaming instruction-tuning approach) and no architectural changes, EvoStreaming consistently improves the overall RealStreamEval score by up to $10.8$ points across five open VideoLLM backbones (Qwen2/2.5/3-VL, InternVL-3.5, MiniCPM-V4.5) while largely preserving offline video performance. These results suggest that data-efficient interaction tuning is a practical path for adapting existing VideoLLMs to streaming assistants.

2605.10341 2026-05-12 cs.AI cs.SE

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

Bihui Yu, Xinglong Xu, Junjie Jiang, Jiabei Cheng, Caijun Jia, Siyuan Li, Conghui He, Jingxuan Wei, Cheng Tan

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) School of Automation and Intelligent Sensing, Shanghai Jiao Tong University(上海交通大学自动化与智能感知学院)

AI总结 论文《PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents》提出了一种基于视觉反馈的排版优化方法,用于解决科学文档在从LaTeX源码编译为最终PDF过程中常见的视觉缺陷问题。该方法通过迭代渲染、缺陷检测和源码修正的闭环流程,实现对页面布局、公式排布、表格缩放等问题的自动修复。研究引入了视觉排版优化(VTO)任务,并构建了包含多种缺陷类型的基准数据集PaperFit-Bench,实验表明该方法在多项指标上显著优于现有基线,验证了视觉闭环在提升文档排版质量中的关键作用。

Comments 47 pages, 17 figures, 17 tables

详情
英文摘要

A LaTeX manuscript that compiles without error is not necessarily publication-ready. The resulting PDFs frequently suffer from misplaced floats, overflowing equations, inconsistent table scaling, widow and orphan lines, and poor page balance, forcing authors into repetitive compile-inspect-edit cycles. Rule-based tools are blind to rendered visuals, operating only on source code and log files. Text-only LLMs perform open-loop text editing, unable to predict or verify the two-dimensional layout consequences of their changes. Reliable typesetting optimization therefore requires a visual closed loop with verification after every edit. We formalize this problem as Visual Typesetting Optimization (VTO), the task of transforming a compilable LaTeX paper into a visually polished, page-budget-compliant PDF through iterative visual verification and source-level revision, and introduce a five-category taxonomy of typesetting defects to guide diagnosis. We present PaperFit, a vision-in-the-loop agent that iteratively renders pages, diagnoses defects, and applies constrained repairs. To benchmark VTO, we construct PaperFit-Bench with 200 papers across 10 venue templates and 13 defect types at different difficulty. Extensive experiments show that PaperFit outperforms all baselines by a large margin, establishing that bridging the gap from compilable source to publication-ready PDF requires vision-in-the-loop optimization and that VTO constitutes a critical missing stage in the document automation pipeline.

2605.10339 2026-05-12 cs.CL

An Annotation Scheme and Classifier for Personal Facts in Dialogue

Konstantin Zaitsev

发表机构 * HSE University(俄罗斯莫斯科高等经济学院)

AI总结 本文提出了一种用于对话中个人事实分类的扩展标注方案和分类器,旨在解决现有方法在结构化存储和对话延续性识别方面的不足。该方案引入了人口统计、拥有物等新类别以及持续时间、有效性等属性,提升了事实管理的结构化程度和分类质量。基于手动标注的2,779条事实,研究构建了一个多头分类器,结合Gemma-300M编码器在宏观F1指标上达到81.6%,显著优于少样本LLM基线模型,且计算资源消耗更低。

详情
英文摘要

The advancement of Large Language Models (LLMs) has enabled their application in personalized dialogue systems. We present an extended annotation scheme for personal fact classification that addresses limitations in existing approaches, particularly PeaCoK. Our scheme introduces new categories (Demographics, Possessions) and attributes (Duration, Validity, Followup) that enable structured storage, quality filtering, and identification of facts suitable for dialogue continuation. We manually annotated 2,779 facts from Multi-Session Chat and trained a multi-head classifier based on transformer encoders. Combined with the Gemma-300M encoder, the classifier achieves $81.6 \pm 2.6$\% macro F1, outperforming all few-shot LLM baselines (best: GPT-5.4-mini, 72.92\%) by nearly 9 percentage points while requiring substantially fewer computational resources. Error analysis reveals persistent challenges in semantic boundary disambiguation, temporal aspect interpretation, and pragmatic reasoning for followup assessment. The dataset\footnotemark[1] and classifier\footnotemark[2] are publicly available.

2605.10337 2026-05-12 cs.AI eess.SP

CORTEG: Foundation Models Enable Cross-Modality Representation Transfer from Scalp to Intracranial Brain Recordings

Liuyin Yang, Qiang Sun, Bob Van Dyck, Eva Calvo Merino, Marc M. Van Hulle

发表机构 * Laboratory for Neuro- & Psychophysiology, Department of Neurosciences, KU Leuven(神经与心理生理实验室,神经科学系,比利时鲁文大学)

AI总结 该研究提出CORTEG框架,旨在将基于头皮EEG的预训练基础模型迁移至颅内ECoG信号,以提升脑机接口的解码性能。CORTEG结合了电极感知的空间适配器、双流分词器和留一被试法微调策略,实现了跨被试学习和快速个性化校准。实验表明,CORTEG在多个任务中达到或超越了专门方法的性能,尤其在数据量有限的情况下表现突出,为高效、可扩展的颅内脑机接口提供了新思路。

详情
英文摘要

Intracranial electrocorticography (ECoG) offers high-signal-to-noise access to cortical activity for brain-computer interfaces, yet limited per-patient data has led most prior work to rely on small, subject-specific decoders that neglect information shared across patients. We investigate whether large pretrained scalp-EEG foundation models (EEG FMs) can be adapted to ECoG, enabling cross-patient learning and competitive decoding performance while calibrating to a held-out patient in 10-30 minutes on a single GPU. We introduce CORTEG, a cross-modality transfer framework that combines a pretrained EEG FM backbone, an electrode-aware KNNSoftFourier spatial adapter, a dual-stream tokenizer for low-frequency and high-gamma activity, and a leave-one-subject-out fine-tuning strategy. We evaluate CORTEG on two challenging regression tasks: public finger trajectory regression (n=9) and private audio envelope regression (n=16). CORTEG matches or exceeds the strongest task-specific baselines on both tasks: it reaches the highest mean correlation among compared methods on the public finger benchmark (gain not statistically significant on n=9 subjects), with larger and statistically significant gains on the audio task and in low-data per-patient calibration. Feature analyses align with neurophysiology, and latent manifolds capture low-dimensional finger-movement structure. CORTEG provides systematic evidence that scalp-EEG pretraining can be repurposed for ECoG decoding, enabling data-efficient intracranial BCIs that can adapt to new patients.

2605.10335 2026-05-12 cs.LG cs.AI cs.CL cs.NA math.NA math.OC

PowerStep: Memory-Efficient Adaptive Optimization via $\ell_p$-Norm Steepest Descent

Yao Lu, Dengdong Fan, Shixun Zhang, Yonghong Tian

发表机构 * Pengcheng Laboratory(鹏城实验室) Peking University(北京大学)

AI总结 本文提出了一种名为 PowerStep 的内存高效的自适应优化算法,旨在解决大规模神经网络训练中传统自适应优化器(如 Adam)所面临的内存开销过大的问题。该方法通过在动量缓冲区上直接应用非线性变换,实现了坐标自适应性,而无需存储二阶矩统计量。实验表明,PowerStep 在保持与 Adam 相当收敛速度的同时,显著降低了优化器的内存占用,并在结合量化技术后进一步提升了内存效率。

详情
英文摘要

Adaptive optimizers, most notably Adam, have become the default standard for training large-scale neural networks such as Transformers. These methods maintain running estimates of gradient first and second moments, incurring substantial memory overhead. We introduce PowerStep, a memory-efficient optimizer that achieves coordinate-wise adaptivity without storing second-moment statistics. Motivated by steepest descent under an $\ell_p$-norm geometry, we show that applying a nonlinear transform directly to a momentum buffer yields coordinate-wise adaptivity. We prove that PowerStep converges at the optimal $O(1/\sqrt{T})$ rate for non-convex stochastic optimization. Extensive experiments on Transformer models ranging from 124M to 235B parameters demonstrate that PowerStep matches Adam's convergence speed while halving optimizer memory. Furthermore, when combined with aggressive \texttt{int8} quantization, PowerStep remains numerically stable and reduces optimizer memory by $\sim\!8\times$ compared to full-precision Adam. PowerStep thus provides a principled, scalable and resource-efficient alternative for large-scale training. Code is available at https://github.com/yaolubrain/PowerStep.

2605.10334 2026-05-12 cs.CV

The Alpha Blending Hypothesis: Compositing Shortcut in Deepfake Detection

Andrii Yermakov, Jan Cech, Mario Fritz, Jiri Matas

发表机构 * Czech Technical University in Prague(捷克技术大学) CISPA Helmholtz Center for Information Security(CISPA海德堡中心)

AI总结 近年来,深度伪造检测方法在跨数据集泛化能力上有所提升,但其背后的机制仍不明确。本文提出“Alpha混合假说”,认为当前先进的基于帧的检测器实际上是在搜索Alpha混合痕迹,而非学习语义异常或生成模型的指纹。研究通过实验验证了该假说,并提出了一种基于真实人脸图像和自混合图像增强数据集的检测方法BlenD,在多个合成伪造数据集上取得了最佳的跨数据集泛化性能,且无需在训练中使用明确生成的深度伪造样本。

详情
英文摘要

Recent deepfake detection methods demonstrate improved cross-dataset generalization, yet the underlying mechanisms remain underexplored. We introduce the Alpha Blending Hypothesis, positing that state-of-the-art frame-based detectors primarily function as alpha blending searchers; rather than learning semantic anomalies or specific generative neural fingerprints, they localize low-level compositing artifacts introduced during the integration of manipulated faces into target frames. We experimentally validate the hypothesis, demonstrating that deepfake detectors exhibit high sensitivity to the so-called self-blended images (SBI) and non-generative manipulations. We propose the method BlenD that leverages a large-scale, diverse dataset of real-only facial images augmented with SBI. This approach achieves the best average cross-dataset generalization on 15 compositional deepfake datasets released between 2019 and 2025 without utilizing explicitly generated deepfakes during training. Furthermore, we show that predictions from explicit blending searchers and models resilient to blending shortcuts are highly complementary, yielding a state-of-the-art AUROC of 94.0% in an ensemble configuration. The code with experiments and the trained model will be publicly released.

2605.10332 2026-05-12 cs.AI

EmbodiSkill: Skill-Aware Reflection for Self-Evolving Embodied Agents

Ruofei Ju, Xinrui Wang, Xin Ding, Yifan Yang, Hao Wu, Shiqi Jiang, Qianxi Zhang, Hao Wen, Xiangyu Li, Weijun Wang, Kun Li, Yunxin Liu, Haipeng Dai, Wei Wang, Ting Cao

发表机构 * Nanjing University(南京大学) Huazhong University of Science and Technology(华中科技大学) University of Science and Technology of China(中国科学技术大学) Microsoft Research(微软研究院) Institute for AI Industry Research (AIR) Tsinghua University(清华大学人工智能产业研究院)

AI总结 EmbodiSkill 是一种用于具身智能体技能自演进的训练-free 框架,旨在解决具身环境中任务失败可能由技能错误或执行失误共同导致的问题。该方法通过技能感知的反思机制,区分任务失败中的技能错误与执行失误,并分别进行针对性的修正与强化。实验表明,EmbodiSkill 能有效提升具身任务的成功率,在 ALFWorld 上实现了高达 93.28% 的任务成功率,显著优于无技能直接使用的大型语言模型。

详情
英文摘要

Embodied agents can benefit from skills that guide object search, action execution, and state changes across diverse environments. Since embodied environments vary across layouts, object states, and other execution factors, these skills must self-evolve from trajectories generated during task execution. However, existing skill self-evolution methods are mainly developed in digital environments and often convert trajectories into coarse skill updates. Directly applying this paradigm to embodied settings is problematic, because a failed task execution may reflect not only incorrect skill content, but also an execution lapse in which the agent fails to follow valid guidance. We propose EmbodiSkill, a training-free framework for embodied skill self-evolution through skill-aware reflection and targeted revision. EmbodiSkill interprets each trajectory with respect to the current skill, uses skill-changing evidence to update the skill body, and uses execution-lapse evidence to preserve and emphasize valid guidance. Experiments on ALFWorld and EmbodiedBench show that EmbodiSkill consistently improves embodied task success. On ALFWorld, EmbodiSkill enables a frozen Qwen3.5-27B executor to reach 93.28% task success, outperforming GPT-5.2 used as a direct agent without skills by 31.58%. These results show that skill-aware self-evolution helps embodied agents accumulate reusable procedural knowledge from their own trajectories.

2605.10319 2026-05-12 cs.CV

LimeCross: Context-Conditioned Layered Image Editing with Structural Consistency

Ryugo Morita, Stanislav Frolov, Brian Bernhard Moser, Ko Watanabe, Riku Takahashi, Issey Sukeda, Andreas Dengel

发表机构 * RPTU Kaiserslautern-Landau \& DFKI GmbH, Kaiserslautern, Germany Faculty of Science Engineering, Hosei University, Tokyo, Japan EQUES, Tokyo, Japan

AI总结 本文提出了一种名为 LimeCross 的训练-free 上下文条件化分层图像编辑框架,能够在保持未选层不变的前提下,根据文本指令对用户选定的 RGBA 分层进行编辑。该方法通过双流注意力机制利用其他层的上下文信息,保持跨层一致性,并有效防止编辑层污染。研究还引入了 LayerEditBench 数据集与评估协议,实验表明 LimeCross 在分层纯净度和合成真实感方面优于现有方法,为可控生成创作提供了新的分层编辑范式。

详情
英文摘要

Layered image assets are widely used in real-world creative workflows, enabling non-destructive iteration and flexible re-composition. Recent advances in layered image generation and decomposition synthesize or recover layered representations, yet controllable editing of layered images remains challenging. Manual editing requires careful coordination across layers to maintain consistent illumination and contact, while AI-based pipelines collapse layers into a flattened image for editing, then decompose them again, introducing background-to-foreground leakage and unstable transparency. To address these limitations, we propose LimeCross, a training-free context-conditioned layered image editing framework that edits user-selected RGBA layers according to text while keeping the remaining layers unchanged. It leverages contextual cues from other layers using a bi-stream attention mechanism to preserve cross-layer consistency, while explicitly maintaining layer integrity to prevent the contamination of edited layers. To evaluate our approach, we introduce LayerEditBench, a benchmark of 1500 layered scenes with paired source/target prompts, along with evaluation protocols that assess both edit fidelity and alpha channel stability. Extensive experiments demonstrate that LimeCross improves layer purity and composite realism over strong editing baselines, establishing context-conditioned layered editing as a principled framework for controllable generative creation.

2605.10318 2026-05-12 cs.CL

Extending Confidence-Based Text2Cypher with Grammar and Schema Aware Filtering

Makbule Gulcin Ozsoy

发表机构 * Neo4j(Neo4j公司)

AI总结 该研究探讨了如何在Text2Cypher任务中利用结构化约束提升生成查询的可靠性。作者提出了一种结合置信度评分、语法验证和模式约束的过滤框架,通过在生成后进行多阶段验证来提高查询的正确性。实验表明,语法和模式感知的过滤分别提升了生成查询的语法有效性和执行质量,但也会增加空预测的数量并降低覆盖率。研究为理解不同约束对生成效果的影响提供了新的视角。

详情
英文摘要

Large language models (LLMs) allow users to query databases using natural language by translating questions into executable queries. Despite strong progress on tasks such as Text2SQL, Text2SPARQL, and Text2Cypher, most existing methods focus on better prompting, fine-tuning, or iterative refinement. However, they often do not explicitly enforce structural constraints, such as syntactic validity and schema consistency. This can reduce reliability, since generated queries must satisfy both syntax rules and database schema constraints to be executable. In this work, we study how structured constraints can be used in test-time inference for Text2Cypher. We focus on post-generation validation to improve query correctness. We extend a confidence-based inference framework with a sequential filtering process that combines confidence scoring, grammar validation, and schema constraints before final aggregation. This lets us analyze how different constraint types affect generated queries. Our experiments with two instruction-tuned models show that grammar-based filtering improves syntactic validity. Schema-aware filtering further improves execution quality by enforcing consistency with the database structure. However, stronger filtering also increases the number of empty predictions and reduces execution coverage. Overall, we show that adding simple structural checks at test time improves the reliability of Text2Cypher generation, and we provide a clearer view of how syntax and schema constraints contribute differently.