arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4077
2605.10414 2026-05-12 cs.LG

Remember to Forget: Gated Adaptive Positional Encoding

Riccardo Ali, Alessio Borgi, Christopher Irwin, Mario Severino, Pietro Liò

AI总结 该研究针对现代大语言模型中旋转位置编码(RoPE)在处理超出训练范围的长序列时出现的注意力偏差和检索性能下降问题,提出了一种新的位置编码方法——门控自适应位置编码(GAPE)。GAPE 通过引入内容感知的注意力偏差,在保持旋转几何结构的同时,利用查询门和键门分别抑制无关上下文和保留重要远距离信息,从而提升长上下文下的注意力聚焦能力和模型鲁棒性。实验表明,GAPE 在合成检索和长上下文基准测试中均优于传统旋转位置编码方法。

详情
英文摘要

Rotary Positional Encoding (RoPE) is widely used in modern large language models. However, when sequences are extended beyond the range seen during training, rotary phases can enter out-of-distribution regimes, leading to spurious long-range alignments, diffuse attention, and degraded retrieval. Existing remedies only partially address these failures, as they often trade local positional resolution for long-context stability. We propose GAPE (Gated Adaptive Positional Encoding), a drop-in augmentation for positional encodings that introduces a content-aware bias directly into the attention logits while preserving the rotary geometry. GAPE decouples distance-based suppression from token importance through a query-dependent gate that contracts irrelevant context and a key-dependent gate that preserves salient distant tokens. We prove that protected tokens remain accessible, while the attention mass assigned to unprotected distant tokens decays as a function of the query gate. We further show that GAPE can be implemented within standard scaled dot-product attention. We validate these properties empirically, finding that GAPE consistently yields sharper attention and improved long-context robustness over rotary baselines across both synthetic retrieval and long-context benchmarks.

2605.10410 2026-05-12 cs.LG

Equilibrium Residuals Expose Three Regimes of Matrix-Game Strategic Reasoning in Language Models

Wenhua Nie, Binhan Luo, Zijie Meng, Jyh-Shing Roger Jang, Ching-Wen Ma

AI总结 该研究探讨了大型语言模型在矩阵博弈中的战略推理能力,发现模型在去除语义线索后表现显著下降。通过程序生成的零和矩阵博弈实验,研究揭示了模型在不同规模博弈中的三种推理模式,并证明利用收益残差进行训练可以在格式不稳定的条件下提升模型的泛化能力。实验还表明,通过监督微调和残差奖励训练,模型在未见过的较大规模博弈中的成功率大幅提升,揭示了战略推理能力的格式依赖性和改进潜力。

详情
英文摘要

Large language models can score well on named game-theory benchmarks while failing on the same strategic computation once semantic cues are removed. We show this gap with procedurally generated zero-sum matrix games: a model that recognizes familiar games drops to 34%, 18%, and 2% success on anonymous $2{\times}2$, $3{\times}3$, and $5{\times}5$ payoff matrices. The benchmark separates semantic recall, learned approximate Nash computation, and an output-interface bottleneck that limits scale. Training only on $2{\times}2$ and $3{\times}3$ games, supervised fine-tuning raises unseen $5{\times}5$--$7{\times}7$ success from 2% to 61%, while exploitability-reward training averages 37% with high seed variance. We prove that the exploitability residual is $2$-Lipschitz in payoff perturbations, unlike discontinuous vertex-returning LP equilibrium selectors, explaining why residual training can transfer under payoff shifts even when formatting instability limits mean performance. A dominated-action padding experiment provides causal evidence: trained models solve $3{\times}3$ games embedded in much larger matrices, while random-padded controls fail and dense $12{\times}12$ games remain near failure. Procedural evaluation is therefore necessary for measuring strategic reasoning, and residual rewards expose a real but format-limited route to approximate equilibrium computation.

2605.10409 2026-05-12 cs.CV

Progressive Photorealistic Simplification

Adi Rosenthal, Dana Berman, Yedid Hoshen, Ariel Shamir

AI总结 本文提出了一种渐进式光栅化简化方法,旨在在保持图像真实感的前提下减少视觉复杂度。该方法通过结合语义理解和生成编辑,利用视觉语言模型识别并优先移除图像中的元素,并通过学习验证器确保简化过程中的真实感和一致性。研究还进一步将该过程蒸馏为一个图像到视频生成模型,能够直接从单张图像生成连贯的简化序列,适用于内容感知去杂、语义分层分解等任务。

详情
英文摘要

Existing image simplification techniques often rely on Non-Photorealistic Rendering (NPR), transforming photographs into stylized sketches, cartoons, or paintings. While effective at reducing visual complexity, such approaches typically sacrifice photographic realism. In this work, we explore a complementary direction: simplifying images while preserving their photorealistic appearance. We introduce progressive semantic image simplification, a framework that iteratively reduces scene complexity by removing and inpainting elements in a controlled manner. At each step, the resulting image remains a plausible natural photograph. Our method combines semantic understanding with generative editing, leveraging Vision-Language Models (VLMs) to identify and prioritize elements for removal, and a learned verifier to ensure photorealism and coherence throughout the process. This is implemented via an iterative Select-Remove-Verify pipeline that produces high-quality simplification trajectories. To improve efficiency, we further distill this process into an image-to-video generation model that directly predicts coherent simplification sequences from a single input image. Beyond generating cleaner and more focused compositions, our approach enables applications such as content-aware decluttering, semantic layer decomposition, and interactive editing. More broadly, our work suggests that simplification through structured content removal can serve as a practical mechanism for guiding visual interpretation within the photorealistic domain, complementing traditional abstraction methods.

2605.10407 2026-05-12 cs.LG

Identified-Set Geometry of Distributional Model Extraction under Top-$K$ Censored API Access

Wenhua Nie, ZiCheng Zhu, Jianan Wu, Binhan Luo, Haoran Zheng, Jyh-Shing Roger Jang

AI总结 本文研究了在仅能获取顶部-$K$个logit分数的API访问模式下,对语言模型分布进行恢复的限制。通过分析截断阈值$τ$,作者确定了可兼容的教师分布构成的识别集,并给出了其总变分直径的精确表达式。实验表明,尽管顶部-$K$截断限制了每个位置的分布恢复能力,但并不妨碍对模型能力的提取,揭示了分布恢复与能力迁移之间的分离现象。

详情
英文摘要

Modern LLM APIs often reveal only top-$K$ logit scores and censor the remaining vocabulary. We study the per-position distribution-recovery limits of this access model. For censoring threshold $τ$, the compatible teacher distributions form an identified set whose total-variation diameter is exactly $U_K=(V-K)\exp(τ)/(Z_A+(V-K)\exp(τ))$, where $Z_A$ is the observed partition function. For KL recovery, we give a computable binary-endpoint lower bound and an asymptotically matching small-ambiguity upper bound, with an extension to reference-aware attackers. Experiments on a Qwen3 math-reasoning teacher reveal a layered extraction hierarchy: on-task top-$K$ distillation recovers 12% of private capability, full-logit distillation recovers 56% despite 99% KL closure, and generation-based extraction recovers 96%. Top-$K$ censoring therefore limits per-position distribution recovery but does not by itself prevent capability extraction, separating fidelity from transfer in prompt-only logit distillation.

2605.10405 2026-05-12 cs.LG

Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization

Elad Tolochinsky, Yaniv Tenzer, Yaniv Romano

AI总结 本文研究如何在有限资源下高效识别性能最佳的大型语言模型(LLM),提出了一种结合多臂老虎机(MAB)算法与低秩分解预测的框架。该方法通过利用低秩分解预测模型得分,减少对低效模型的评估次数,同时引入双重稳健估计器以保证统计有效性,从而在适应性模型选择和无放回采样场景下构建有效的置信区间。实验表明,该方法在实际基准测试中显著减少了评估次数,降低了计算和成本开销,同时仍能准确识别最佳模型。

详情
英文摘要

Selecting the best large language model (LLM) for a fixed benchmark is often expensive, since exhaustive evaluation requires running every model on every example. Multi-armed bandit (MAB) algorithms can reduce the number of LLM calls by sequentially selecting the next model-example pair to evaluate, thereby avoiding wasted evaluations on clearly underperforming models. Further savings can be achieved by predicting model scores from the partially observed model-example score matrix using low-rank factorization. However, such predictions are not ground truth: they can be biased and may therefore lead to incorrect identification of the best model. In this work, we propose a principled framework that combines MAB with cheap predicted scores without compromising statistical validity. Specifically, we derive doubly robust estimators of each model's performance that use the low-rank predictions to reduce variance. This enables the construction of valid finite-sample confidence intervals in our setting, where models are selected adaptively and examples are sampled without replacement. Empirical results on real-world benchmarks show that our approach reduces the number of required evaluations, yielding meaningful savings in compute and cost while accurately identifying the best-performing model.

2605.10404 2026-05-12 cs.CV

Position: Life-Logging Video Streams Make the Privacy-Utility Trade-off Inevitable

Tianyuan Zou, Liang Yue, Yang Liu, Ya-Qin Zhang, Sijie Cheng

AI总结 随着智能眼镜、体戴摄像头等持续运行的硬件设备日益普及,生活日志视频流已成为持续运行人工智能系统的核心组成部分。这类视频流虽能显著提升系统实用性,但也带来了严重的隐私泄露风险,如暴露行为模式、情绪状态和社会互动等敏感信息。现有隐私保护方法要么针对特定攻击,要么导致显著的实用性损失,未能全面考虑数据处理全流程,因此生活日志视频流中的隐私与实用性权衡已成为下一代人工智能系统亟待解决的基础性挑战。

Comments 19 pages, 7 figures

详情
英文摘要

With the growing prevalence of always-on hardware such as smart glasses, body cameras, and home security systems, life-logging visual sensing is becoming inevitable, forming the backbone of persistent, always-on AI systems. Meanwhile, recent advances in proactive agents and world models signal a fundamental shift from episodic, prompt-driven tools to next-generation AI systems that continuously perceive and react to the physical world. Although life-logging video streams can substantially improve utility of these promising systems, they also introduce significant privacy risks by revealing sensitive information, such as behavioral patterns, emotional states, and social interactions, beyond what isolated images expose. If unresolved, these risks may undermine public trust and hinder the sustainable development of always-on AI technologies. Existing privacy protections are either attack-specific or incur substantial utility loss, and fail to consider the entire data exploitation pipeline. We therefore posit that the privacy-utility trade-off in life-logging video streams is a foundational challenge for next-generation AI systems that demands further investigation. We call for novel pipeline-aware privacy-preserving designs that jointly optimize utility and privacy for long-horizon life-logging visual data. In parallel, formal privacy leakage metrics and standardized benchmarks remain important open directions for future research.

2605.10401 2026-05-12 cs.AI math.OC

LLM4Branch: Large Language Model for Discovering Efficient Branching Policies of Integer Programs

Zhinan Hou, Xingchen Li, Yankai Zhang, Tianxun Li, Keyou You

AI总结 本文提出了一种基于大语言模型(LLM)的新框架LLM4Branch,用于自动发现整数规划问题中的高效分支策略。该方法通过LLM生成可执行的策略框架,并结合零阶优化方法在少量实例的端到端性能反馈下优化参数,从而提升求解效率。实验表明,LLM4Branch在标准MILP基准测试中达到了基于CPU方法的最先进水平,并能与先进的GPU方法相媲美。

Comments ICML2026 preprint, camera ready in progress

详情
英文摘要

Efficient branching policies are essential for accelerating Mixed Integer Linear Programming (MILP) solvers. Their design has long relied on hand-crafted heuristics, and now machine learning has emerged as a promising paradigm to automate this process. However, existing learning-based methods are often hindered by their dependence on expensive expert demonstrations and the gap between training objectives and the solver's end-to-end performance. In this work, we propose LLM4Branch, a novel framework that leverages Large Language Models (LLMs) to automate the discovery of efficient branching policies. Specifically, the discovered policy is an executable program with a program skeleton generated by the LLM and a parameter vector, which is optimized via a zeroth-order method over a few instances with their end-to-end performance feedback. Extensive experiments on standard MILP benchmarks demonstrate that LLM4Branch establishes a new state-of-the-art among CPU-based methods and achieves performance competitive with advanced GPU-based models. Codes are available at https://github.com/hzn18/LLM4Branch.

2605.10397 2026-05-12 cs.CV cs.AI

AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation

Xi Jiang, Yinjie Zhao, Zesheng Yang, Feng Zheng

AI总结 视觉异常检测在工业检测、医疗影像等领域具有重要意义,但不同领域间的数据模态和标注标准差异导致单一领域训练的模型难以跨域应用。为此,本文提出 AnomalyClaw,一种无需训练的视觉异常检测代理,通过多轮反驳机制提升判断可靠性,结合13种工具进行视觉验证与参考解析。实验表明,AnomalyClaw 在多个跨域数据集上显著优于单步推理方法,并通过自进化机制进一步提升了检测性能。

Comments We release the agent, the benchmark, and the analysis artifacts at https://github.com/jam-cc/AnomalyClaw

详情
英文摘要

Visual anomaly detection (VAD) is crucial in many real-world fields, such as industrial inspection, medical imaging, infrastructure monitoring, and remote sensing. However, the specific anomaly definitions, data modalities, and annotation standards across different domains make it difficult to transfer single-domain trained VAD models. Vision-language models (VLMs), pre-trained on large-scale cross-domain data, can perform visual perception under task instructions, offering a promising solution for cross-domain VAD. However, single-inference VLM judgments are unreliable, since they rely more on prior knowledge than on normal-sample references or fine-grained feature evidence. We therefore present AnomalyClaw, a training-free VAD agent that turns anomaly judgment into a multi-round refutation process. In each round, the agent proposes candidate anomalies and refutes each against normal-sample references, drawing on a 13-tool library for visual verification, reference parsing, and frozen expert probing. On the CrossDomainVAD-12 benchmark (12 datasets), AnomalyClaw achieves consistent macro-AUROC improvements over single-step direct inference with +6.23 pp on GPT-5.5, +7.93 pp on Seed2.0-lite, and +3.52 pp on Qwen3.5-VL-27B. We further introduce an optional verbalized self-evolution extension. It builds an online rulebook from internal-branch disagreement without oracle labels. On Qwen3.5-VL-27B, it delivers a +2.09 pp mean gain, comparable to a K = 10 oracle-label supervised baseline (+1.99 pp). These results show that agentic refutation improve anomaly understanding and reasoning of VLMs, rather than merely aggregating tool outputs.

2605.10396 2026-05-12 cs.LG cs.NE

Causal Explanations from the Geometric Properties of ReLU Neural Networks

Hector Woods, Philippa Ryan, Rob Alexander

AI总结 该论文研究了如何从ReLU神经网络的几何特性中生成因果解释,以提高深度神经网络决策过程的可解释性。作者指出,ReLU网络可以被看作是将输入空间划分为多个由凸多面体定义的区域,每个区域对应一个线性函数。基于这一几何特性,论文提出了一种直接从网络结构中提取因果解释的方法,能够更准确地反映网络的行为,从而为自主系统的安全保证提供支持。

Comments 7 pages, 0 figures, Accepted for presentation at the Yorkshire Innovation in Science and Engineering Conference

详情
英文摘要

Neural networks have proved an effective means of learning control policies for autonomous systems, but these learned policies are difficult to understand due to the black-box nature of neural networks. This lack of interpretability makes safety assurance for such autonomous systems challenging. The fields of eXplainable Artificial Intelligence (XAI) and eXplainable Reinforcement Learning (XRL) aim to interpret the decision making processes of neural networks and autonomous agents, respectively. In particular, work on causal explanations aims to provide "why" and "why not" explanations for why a model made a given decision. However, most of the work on explainability to date utilises a distilled version of the original model. While this distilled policy is interpretable, it necessarily degrades in performance significantly when compared to the original model, and is not guaranteed to be an accurate reflection of the decision making processes in the original model and as such cannot be used to guarantee its safety. Recent work on understanding the geometry of ReLU neural networks shows that a ReLU network corresponds to a piecewise linear function divided into regions defined by an n-dimensional convex polytope. Through this lens, a neural network can be understood as dividing the input space into distinct regions which apply a single linear function for each output neuron. We show that this geometric representation can be used to generate causal explanations for the network's behaviour similar to previous work, but which extracts rules directly from the geometry of Neural Networks with the ReLU activation function, and is therefore an accurate reflection of the network's behaviour.

2605.10394 2026-05-12 cs.CV

Sens-VisualNews: A Benchmark Dataset for Sensational Image Detection

Andreas Goulas, Damianos Galanopoulos, Evlampios Apostolidis, Vasileios Mezaris

AI总结 本文提出了一项新的任务——煽动性图像检测,旨在判断图像是否包含令人震惊、挑衅或情感强烈的特征,以吸引注意力并引发强烈情绪反应。为此,研究者构建了一个名为Sens-VisualNews的基准数据集,包含9,576张新闻图片,并根据其视觉内容中是否存在各种煽动性概念和事件进行标注。基于该数据集,研究进一步探讨了多种先进多模态大语言模型在零样本和微调设置下的提示敏感性、性能及鲁棒性。

Comments Authors' Accepted Version; Accepted at IEEE ICIP 2026

详情
英文摘要

The detection of sensational content in media items can be a critical filtering mechanism for identifying check-worthy content and flagging potential disinformation, since such content triggers physiological arousal that often bypasses critical evaluation and accelerates viral sharing. In this paper we introduce the task of sensational image detection, which aims to determine whether an image contains shocking, provocative, or emotionally charged features to grab attention and trigger strong emotional responses. To support research on this task, we create a new benchmark dataset (called Sens-VisualNews) that contains 9,576 images from news items, annotated based on the (in-)existence of various sensational concepts and events in their visual content. Finally, using Sens-VisualNews, we study the prompt sensitivity, performance and robustness of a wide range of open SotA Multimodal LLMs, across both zero-shot and fine-tuned settings.

2605.10393 2026-05-12 cs.LG cs.LO

The Polynomial Counting Capabilities of Message Passing Neural Networks

Marco Sälzer, Pascal Bergsträßer, Anthony W. Lin

AI总结 本文研究了消息传递神经网络(MPNN)在超越线性算术约束的多项式计数能力,重点探讨了其在表达带有多项式计数约束的分级模态逻辑扩展中的条件。作者证明,在轻度假设下,全局多项式计数约束可以通过均值聚合的MPNN进行验证,而局部约束的验证则需要额外条件,如允许求和或最大值聚合,或限制在正则图上。此外,文章还展示了如何通过树状结构图和相似假设,使嵌套模态逻辑公式被均值MPNN所捕获。

详情
英文摘要

The counting power of Message Passing Neural Networks (MPNN) has been the subject of many recent papers, showing that they can express logic that involves counting up to a threshold or more generally satisfy a linear arithmetic constraint. In this paper, we study the counting capabilities of MPNN beyond linear arithmetic, primarily utilising local and global mean aggregations. In particular, our goal is to tease out conditions required to express extensions of graded modal logic with polynomial counting constraints. We show that global polynomial counting constraints in node-labelled graphs can be checked using mean MPNN under mild assumptions. Checking local constraints is also possible, if we consider formulas with no nested modalities and additionally either (i) permit sum/max aggregations, or (ii) only restrict to regular graphs. We also show how formulas with nested modalities can be captured by mean MPNN over graphs with tree-like structures and similar assumptions.

2605.10391 2026-05-12 cs.CL cs.AI cs.CV

Phoenix-VL 1.5 Medium Technical Report

Team Phoenix, :, Arka Ray, Askar Ali Mohamed Jawad, Biondi Lee, Elijah Seah, Eva Lim, Fiona Teo, Grace Toh, Guang Xiang Teo, Jun En Tan, Jia Hui Bong, Jiale Wang, Jonathan Ng, Justin Tan, Kai Zhe Yew, Matthew Ong, Shun Yi Yeo, Wen Jett Lam, Wen Xiu Tan, Ze Yu Zhang, Gee Wah Ng, Chee Wee Ang, Mistral AI, :, Adrien Sadé, Guillaume Kunsch, Jia Sin Loh, Nicolas Schuhl, Rupert Menneer, Umar Jamil, Vincent Maladière, Yimu Pan

AI总结 本文介绍了Phoenix-VL 1.5 Medium,一个1230亿参数的本地化多模态、多语言基础模型,专门适配新加坡语境和区域性语言。该模型通过本地化的大规模多模态语料进行持续预训练,并结合新加坡文化、法律等领域的数据进行微调,显著提升了在新加坡相关任务上的表现,同时在通用多模态、多语言和STEM任务上也保持了高水平性能。研究还提出了包含本地化知识评估和机构对齐行为的安全框架,为区域化AI模型开发提供了新思路。

Comments Release page: https://medium.com/htx-ai/introducing-phoenix-vl-1-5-medium-multimodal-intelligence-uniquely-singaporean-ef8214c8cfa1

详情
英文摘要

We introduce Phoenix-VL 1.5 Medium, a 123B-parameter natively multimodal and multilingual foundation model, adapted to regional languages and the Singapore context. Developed as a sovereign AI asset, it demonstrates that deep domain adaptation can be achieved with minimal degradation to broad-spectrum intelligence and alignment. Continued pretraining was performed on Mistral Medium 3.1 using a localized 1-trillion tokens multimodal corpus, followed by a 250-billion tokens long-context extension phase. Subsequent post-training incorporated a novel human-annotated Singapore multimodal dataset and curated textual corpus on Singapore culture, knowledge, and legislation, totaling 22-billion tokens. An additional 5 billion tokens of model alignment was performed through Online Direct Preference Optimization. Phoenix-VL 1.5 Medium achieves state-of-the-art performance for its size on Singapore multimodal, legal, and government policy benchmarks while remaining globally competitive on general multimodal intelligence, multilingual, and STEM benchmarks. We also introduce a novel evaluation suite encompassing localized knowledge benchmarks and an institutionally aligned model behavior and safety framework. We report the data curation principles, training methodology, and highlight benchmark and inference performance.

2605.10388 2026-05-12 cs.CV cs.RO

Temporal Sampling Frequency Matters: A Capacity-Aware Study of End-to-End Driving Trajectory Prediction

Yumao Liu, Tao Liu, Xiangyu Li, Jiaxiang Li, Ke Ma

AI总结 本文研究了端到端自动驾驶轨迹预测中时间采样频率对模型性能的影响,挑战了高频率采样必然提升性能的传统假设。通过构建不同频率的训练集,并在固定实验协议下训练和评估相同模型,分析了采样频率与预测性能之间的关系。研究发现,模型和数据集不同会导致频率响应差异,小型模型在中等或较低频率下往往表现最佳,而大模型如AutoVLA在最高频率下效果更优,表明时间采样频率应作为可调参数进行优化,而非固定使用最高频率。

详情
英文摘要

End to end (E2E) autonomous driving trajectory prediction is often trained with camera frames sampled at the highest available temporal frequency, assuming that denser sampling improves performance. We question this assumption by treating temporal sampling frequency as an explicit training set design variable. Starting from high frequency E2E driving datasets, we construct frequency sweep training sets by temporally subsampling camera frames along each trajectory. For each model dataset pair, we train and evaluate the same model under a fixed protocol, so the frequency response reflects how prediction performance changes with sampling frequency. We analyze this response from a capacity aware perspective. Sparse sampling may miss driving relevant cues, while dense sampling may add redundant visual content and off manifold noise. For finite capacity models, this can create a driving irrelevant capacity burden. We evaluate three smaller E2E models and a larger VLA style AutoVLA model on Waymo, nuScenes, and PAVE. Results show model and dataset dependent frequency responses. Smaller E2E models often show non monotonic or near plateau trends and achieve their best 3 second ADE at lower or intermediate frequencies. In contrast, AutoVLA achieves its best 3 second ADE and FDE at the highest evaluated frequency on all three datasets. Iteration matched controls suggest that the advantage of lower or intermediate frequencies for smaller models is not explained only by unequal training update counts. These findings show that temporal sampling frequency should be reported and tuned, rather than fixed to the highest available value.

2605.10386 2026-05-12 cs.AI

GuardAD: Safeguarding Autonomous Driving MLLMs via Markovian Safety Logic

Tianyuan Zhang, Peng Yue, Zihao Peng, Jiangfan Liu, Zonghao Ying, Jiakai Wang, Tianlin Li, Jian Yang, Yaodong Yang, Aishan Liu, Xianglong Liu

AI总结 随着多模态大语言模型(MLLMs)在自动驾驶系统中的广泛应用,其在复杂和危险场景下的安全性问题日益突出。为了解决现有安全机制在动态交通环境中鲁棒性不足的问题,本文提出了一种名为GuardAD的模型无关安全防护框架,通过引入马尔可夫逻辑形式化方法,实现对异构交通参与者安全状态的动态推理与持续诱导。GuardAD不仅能够识别潜在的多步安全隐患,还能通过逻辑驱动的动作修正策略优化模型行为,实验表明其在降低事故率和提升任务性能方面均表现出显著优势。

详情
英文摘要

Multimodal large language models (MLLMs) are increasingly integrated into autonomous driving (AD) systems; however, they remain vulnerable to diverse safety threats, particularly in accident-prone scenarios. Recent safeguard mechanisms have shown promise by incorporating logical constraints, yet most rely on static formulations that lack temporally grounded safety reasoning over evolving traffic interactions, resulting in limited robustness in dynamic driving environments. To address these limitations, we propose GuardAD, a model-agnostic safeguard that formulates AD safety as an evolving Markovian logical state. GuardAD introduces Neuro-Symbolic Logic Formalization, which represents safety predicates over heterogeneous traffic participants and continuously induces them via n-th order Markovian Logic Induction. This design enables the inference of emerging and latent hazards beyond single-step observations. Rather than simply vetoing unsafe actions, GuardAD performs Logic-Driven Action Revision, where inferred safety states actively guide action refinement without modifying the underlying MLLM. Extensive experiments on multiple benchmarks and AD-MLLMs demonstrate that GuardAD substantially reduces accident rates (-32.07%) while slightly improving task performance (+6.85%). Moreover, closed-loop simulation evaluations, together with physical-world vehicle studies, further validate the effectiveness and potential of GuardAD.

2605.10384 2026-05-12 cs.AI cs.DC cs.NI

Agentic Performance at the Edge: Insights from Benchmarking

Shiqiang Wang, Herbert Woisetschläger

AI总结 本文研究了在边缘计算环境中,模型参数规模受限时,智能代理(Agentic AI)任务性能的变化情况。通过引入领域条件评估方法和模型-工具交互分析,研究发现边缘代理的质量并非单纯依赖参数数量,而是与模型选择和工具流程的联合设计密切相关。该工作为在资源受限条件下优化边缘智能系统提供了实用指导和失效模式分析。

Comments Accepted to AutoEdge workshop, co-located with MobiSys 2026

详情
英文摘要

Agentic artificial intelligence (AI) is a natural fit for Internet of Things (IoT) and edge systems, but edge deployments are often constrained to models around 8 billion parameters or smaller. An important question is: How much agentic-task quality is lost when model size is constrained by memory, power, and latency budgets? To address this question, in this paper, we provide an initial empirical study considering edge-focused model scaling, general-purpose versus coder-oriented model effects, and tool-enabled execution under a fixed protocol. We introduce a domain-conditioned evaluation methodology, an implementation-grounded analysis of model-tool interactions, practical guidance for model selection under constraints, and an analysis of failure modes that reveals distinct semantic versus execution failure patterns across model families. Our core finding is that edge-agent quality is not a simple function of parameter count. Robust deployment depends on the joint design of model choice and tool workflow. Domain-conditioned analysis reveals Pareto fronts in the accuracy-latency space that can guide strategy selection based on operational priorities.

2605.10380 2026-05-12 cs.AI

Agent-X: Full Pipeline Acceleration of On-device AI Agents

Jinha Chung, Byeongjun Shin, Jiin Kim, Minsoo Rhu

AI总结 本文提出了一种名为Agent-X的软件框架,旨在加速边缘设备上基于大语言模型(LLM)的智能体的端到端推理过程。该框架通过优化提示生成和引入无需LLM的推测解码机制,有效提升了预填充和解码阶段的效率,在保持精度不变的前提下实现了1.61倍的加速。该研究首次系统性地分析并消除了边缘设备智能体中的延迟瓶颈,具有重要的实际应用价值。

Comments Accepted for publication at MobiSys-2026

详情
英文摘要

LLM-based agents deliver state-of-the-art performance across tasks but incur high end-to-end latency on edge devices. We introduce Agent-X, a software-only, accuracy-preserving framework that accelerates both the prefill and decode stages of on-device agent workloads. Agent-X's two key components rewrite prompts to leverage prefix caching tailored to agent-specific input-token patterns and enable LLM-free speculative decoding for fast token generation with minimal overhead. On representative agentic workloads, Agent-X achieves a 1.61x end-to-end speedup in real systems with no accuracy loss and can be seamlessly integrated into existing on-device AI agents. To the best of our knowledge, ours is the first to systematically characterize and eliminate latency bottlenecks in on-device agents.

2605.10379 2026-05-12 cs.CL

Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

Ivo Petrov, Jasper Dekoninck, Dimitar I. Dimitrov, Martin Vechev

AI总结 该研究指出,尽管大型语言模型在数学问题求解中能够生成正确的证明,但仅凭正确性不足以衡量证明质量,还需考虑清晰性、简洁性、启发性及可迁移性等因素。为此,研究提出了ProofRank基准,通过五个可扩展的指标评估证明质量,包括简洁性、计算简便性、认知简单性、多样性和适应性。实验发现不同模型在证明质量上存在显著差异,且证明质量与正确性之间存在权衡,表明未来应更注重评估生成证明的实用性。

Comments 9 main text pages, 36 total pages, In proceedings to 2026 NeurIPS Evaluations and Datasets Track

详情
英文摘要

Large language models (LLMs) have become capable mathematical problem-solvers, often producing correct proofs for challenging problems. However, correctness alone is not sufficient: mathematical proofs should also be clear, concise, insightful, and transferable to other problems. While this proof quality is subjective and depends on the reader and context, many of its components are concrete and broadly valued. In this work, we identify such components and introduce ProofRank, a benchmark curated from challenging mathematical competitions. ProofRank evaluates several scalable proxies of proof quality: (i) conciseness, measuring whether proofs avoid unnecessary steps; (ii) computational ease, measuring the extent to which a proof relies on tedious calculations; (iii) cognitive simplicity, measuring how accessible the used proof techniques are; (iv) diversity, measuring how varied a model's proofs for a single problem are; and (v) adaptivity, measuring whether a model can follow a specified proof technique. Across models, we find substantial differences in proof quality that are not captured by correctness-only benchmarks. We also observe significant trade-offs between proof-quality metrics and correctness, suggesting that future evaluations of mathematical reasoning should measure how useful LLM-generated proofs are.

2605.10377 2026-05-12 cs.LG cs.MA

PC3D: Zero-Shot Cooperation Across Variable Rosters via Personalized Context Distillation

Ahmet Onur Akman, Rafał Kucharski

AI总结 本文研究了在团队成员数量不断变化的场景下,如何实现多智能体强化学习中的零样本协作。为此,提出了一种名为PC3D的方法,通过个性化上下文蒸馏,使每个智能体能够从局部交互历史中恢复并利用个性化的协调上下文,从而适应不同规模的团队。实验表明,该方法在多个协作型多智能体基准任务中,无论面对已见还是未见的团队规模,均能取得优于现有方法的性能。

详情
英文摘要

Cooperative multi-agent reinforcement learning often assumes a fixed execution team, yet many decentralized systems must operate with varying numbers of active agents during deployment. We study this setting under episodic roster variation: each episode is executed by a set of homogeneous agents, with the team size varying across episodes. Agents act only from local histories, without execution-time communication, privileged coordinators, or online retraining. Therefore, effective cooperation requires each agent to recover relevant context about the active team and adapt its behavior accordingly. To this end, we propose PC3D (Personalized Central Coordination Context Distillation), a method for training decentralized policies to recover and use personalized coordination context from local interaction histories. During training, a set-structured centralized teacher compresses the active team into coordination tokens and personalizes them into agent-specific contexts, which are distilled into decentralized policies. At execution, each agent predicts its own context from local history and adaptively uses it to condition decision-making. Across three cooperative MARL benchmarks, PC3D achieves higher returns than the evaluated baselines with both seen and unseen roster sizes, and ablations attribute these gains to both context distillation and adaptive context use.

2605.10374 2026-05-12 cs.CV

Halo Separation-guided Underwater Multi-scale Image Restoration

Jiaxin Yang, Honglin Liu, Yongli Wang, Shuyi Cao, Chengcheng Jiang, Jiale Wang

AI总结 本文针对水下自主水下机器人拍摄图像中因人工光源引起的光晕问题,提出了一种基于迭代结构的单光晕图像校正方法。该方法通过两个子网络分别实现光晕层分离和多尺度图像恢复,提升了水下图像的清晰度和质量。实验使用合成数据集和真实光晕图像进行训练与测试,并引入径向梯度约束以进一步优化光晕消除效果,为水下图像增强提供了更鲁棒的解决方案。

详情
英文摘要

Underwater images captured by Autonomous Underwater Vehicles (AUVs) are inevitably affected by artificial light sources, which often produce halos in the foreground of the camera and seriously interfere with the quality of the image. The existing underwater image enhancement methods fail to fully consider this key problem, and the robustness of processing images under artificial light scenes is poor. In practical applications, since underwater image enhancement itself is a very challenging task, the influence of artificial light sources will lead to serious degradation of image performance and affect subsequent vision tasks. In order to effectively deal with this problem, this paper designs a single halo image correction method based on an iterative structure. The network is mainly divided into two sub-networks, one is the halo layer separation sub-network which aims to separate the halo by gradient minimization, and the other is the multi-scale recovery sub-network which aims to recover the image information masked by halo. The UIEB and EUVP synthetic datasets are used for training to ensure that the network can fully learn the characteristics and laws of underwater halo images. Then a large number of halo images taken in an underwater environment with real artificial light are collected for testing. In addition, the brightness distribution characteristics of underwater halo images are analyzed and the radial gradient is introduced to constraint eliminate halo to improve the effect of underwater image restoration.

2605.10370 2026-05-12 cs.AI cs.DB cs.DC

Autonomous FAIR Digital Objects: From Passive Assertions to Active Knowledge

Zeyd Boukhers, Oya Beyan, Cong Yang, Christoph Lange

AI总结 当前科学知识在网络上以被动断言的形式发布,无法自主验证证据、调和矛盾或随新发现更新可信度。本文提出自主FAIR数字对象(aFDO),通过引入策略层、公告层和协议层,赋予数字对象自主处理信息的能力,从而实现去中心化的、可持续的知识管理。研究基于语义网标准构建了aFDO的理论框架,并在罕见病本体数据集上验证了其有效性,展示了其在处理数据冲突和抵御恶意攻击方面的性能。

详情
英文摘要

Scientific knowledge on the Web is published as passive assertions and cannot decide when to validate evidence, reconcile contradictions, or update confidence as findings accumulate. Curation depends on centralised middleware and institutional continuity, but when registries close, active stewardship stops even when data remain online. We advance the concept of Autonomous FAIR Digital Objects (aFDOs) from an abstract idea to an operational model, to offer a route from passive scientific publication toward accountable, standards-aligned automation that can outlive its publishing institutions. aFDO augments FDOs with three capabilities anchored in Semantic Web standards, namely 1) a policy layer over RDF-star aligned with PROV-O, SHACL, and ODRL for portable condition-action rules, 2) an announcement layer over ActivityStreams 2.0 that bounds per-announcement evaluation cost, and 3) an agreement layer that resolves multi-source contradictions through reputation and confidence weighted agreement under a bounded adversarial model. We provide a formal definition that distinguishes policy specifications, event handlers, and communication interfaces. We evaluate an open reference implementation on 4,305 FDOs grounded in rare-disease ontologies, namely ClinVar, HPO, and Orphanet, combined with controlled synthetic observations. The consensus mechanism resolves 56.3% of 3,914 naturally occurring ClinVar conflicts where multiple submitters disagree and an expert panel has subsequently adjudicated. Under Sybil, collusion, and poisoning attacks, the mechanism degrades gracefully within its design Byzantine-tolerance bound (f < n/5), and fails as predicted beyond that bound.

2605.10366 2026-05-12 cs.AI

EGL-SCA: Structural Credit Assignment for Co-Evolving Instructions and Tools in Graph Reasoning Agents

Zike Yuan, Yukun Cao, Han Zhang, Jianzhi Yan, Le Liu, Cai ke, Yue Yu, Hui Wang, Ming Liu, Bing Qin

AI总结 本文提出了一种名为EGL-SCA的框架,用于解决图推理智能体在自然语言输入下同时构建结构化图实例、选择计算工具并满足结构化验证的问题。该方法通过一个以验证器为中心的双空间框架,将推理策略与可执行工具协同优化,利用结构化信用分配机制将失败原因精确归因于提示优化或工具合成,从而实现指令与工具的共同进化。实验表明,EGL-SCA在四个图推理基准测试中取得了92.0%的平均成功率,显著优于纯提示和固定工具箱的方法。

详情
英文摘要

Graph reasoning agents operating from natural-language inputs must solve a coupled problem: they must reconstruct a structured graph instance from text, decide whether existing computational assets are sufficient, interact with tools under a strict execution protocol, and satisfy an external verifier that checks structured correctness rather than textual plausibility. Existing approaches usually improve either the instruction side or the tool side in isolation, which leaves unclear what should be updated after failure. We propose EGL-SCA, a verifier-centric dual-space framework that models a graph reasoning agent using two collaborative components: an instruction-side policy space for reasoning strategies, and a tool-side program space for executable algorithmic tools. Our central mechanism is structural credit assignment, which maps trajectory evidence to conditional updates, precisely routing failures to either prompt optimization or tool synthesis and repair. To provide sufficient learning signals for dual-space adaptation, we introduce a training distribution stratified by task family, coupled with a Pareto-style retention strategy to balance success, generality, and parsimony. Experiments on four graph reasoning benchmarks show that EGL-SCA achieves a state-of-the-art 92.0\% average success rate. By effectively co-evolving instructions and tools, our framework significantly outperforms both pure-prompting and fixed-toolbox baselines.

2605.10365 2026-05-12 cs.AI

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

Haonan Dong, Qiguan Feng, Kehan Jiang, Haoran Ye, Xin Zhang, Guojie Song

AI总结 本文提出 Agent-ValueBench,首个专门用于评估智能体价值观的综合性基准,旨在填补现有基准仅限于大型语言模型而无法评估智能体价值观的空白。该基准包含16个领域共394个可执行环境,涵盖28种价值体系和332个维度的4,335个价值冲突任务,每个任务均由专业心理学家精心设计,并配备两条对齐的黄金轨迹供评估使用。通过测试14个主流模型和四种执行框架,研究揭示了智能体价值观在不同模型和执行框架下的表现规律,指出智能体对齐正从传统模型对齐向执行框架对齐和技能引导转变。

详情
英文摘要

Autonomous agents have rapidly matured as task executors and seen widespread deployment via harnesses such as OpenClaw. Safety concerns have rightly drawn growing research attention, and beneath them lie the values silently steering agent behavior. Existing value benchmarks, however, remain confined to LLMs, leaving agent values largely uncharted. From intuitive, empirical, and theoretical vantage points, we show that an agent's values diverge from those of its underlying LLM, and the agentic modality further introduces dataset-, evaluation-, and system-level challenges absent from text-only protocols. We close this gap with Agent-ValueBench, the first benchmark dedicated to agent values. It features 394 executable environments across 16 domains, offering 4,335 value-conflict tasks that cover 28 value systems and 332 dimensions. Every instance is co-synthesized through our purpose-built end-to-end pipeline and curated per-instance by professional psychologists. Each task ships with two pole-aligned golden trajectories whose checkpoints anchor a trajectory-level rubric-based judge. Benchmarking 14 frontier proprietary and open-weights models across 4 mainstream harnesses, we uncover three concerted findings. Agent values first manifest as a Value Tide of cross-model homogeneity beneath interpretable counter-currents. This tide bends non-additively under harness pull, and yet more decisively under deliberate steering via embedded skills. Together these results signal that the agent-alignment lever is shifting from classical model alignment and prompt steering toward harness alignment and skill steering.

2605.10362 2026-05-12 cs.CV

CellDX AI Autopilot: Agent-Guided Training and Deployment of Pathology Classifiers

Alexey Pchelnikov, Aleksei Pchelnikov

AI总结 CellDX AI Autopilot 是一个通过人工智能代理实现病理图像分类模型训练与部署的平台,旨在降低计算病理学中对专业技能和计算资源的依赖。该平台提供结构化的代理技能,引导用户完成数据集构建、超参数优化、多策略模型比较及带人工参与的部署流程,并基于包含32,000多例病例和66,000张H&E染色全切片图像的预构建数据集进行训练。其核心贡献在于引入了专为病理任务设计的代理技能架构和多实例学习框架,显著提升了模型训练效率与易用性。

详情
英文摘要

Training AI models for computational pathology currently requires access to expensive whole-slide-image datasets, GPU infrastructure, deep expertise in machine learning, and substantial engineering effort. We present CellDX AI Autopilot, a platform that lets users -- from pathologists with no ML background to ML practitioners running many parallel experiments -- train, evaluate, and deploy whole-slide image classifiers through natural language interaction with an AI agent. The platform provides a structured set of agent skills that guide the user through dataset curation, automated hyperparameter tuning, multi-strategy model comparison, and human-in-the-loop deployment, all on a pre-built dataset of over 32,000 cases and 66,000 H&E-stained whole-slide images with pre-extracted features. We describe the agent skill architecture, the underlying Multiple Instance Learning (MIL) training framework supporting four classification strategies, and an iterative pairwise hyperparameter search (grid or seeded random) that reduces tuning cost by over 30x compared to exhaustive search. CellDX AI Autopilot is, to our knowledge, the first system to expose pathology-specialized agent skills and a pathology-specialized training platform to general-purpose AI agents (e.g. any LLM-based agent runtime), delivering end-to-end automated model training without requiring the agent itself to be domain-specific. The platform addresses both the ML-expertise bottleneck that limits adoption in diagnostic pathology and the engineering bottleneck that limits how many experiments a researcher can run cost-effectively.

2605.10351 2026-05-12 cs.LG eess.SP

Foundations of Reliable Inference: Reliability-Efficiency Co-Design

Jiayi Huang

AI总结 本研究探讨了如何在保证人工智能模型不确定性估计可信度的同时提高推理效率的问题。作者提出了一种统一的框架,从两个角度出发,旨在实现可靠性与计算效率的协同设计。该工作为构建高效且可信的AI推理系统提供了理论基础和方法支持。

Comments PhD Thesis

详情
英文摘要

Reliable inference requires that artificial intelligence (AI) models provide trustworthy uncertainty estimates, not merely accurate predictions. Recent advances in Bayesian learning have made significant progress toward this goal, and growing concerns about computational overhead have jointly shifted the design criterion from reliability alone to the co-design of reliability and efficiency, i.e., reducing computational overhead while preserving trustworthy uncertainty quantification. This thesis develops a unified framework from two perspectives to address the central question: can we efficiently perform reliable inference?

2605.10349 2026-05-12 cs.CV cs.AI cs.LG

Portable Active Learning for Object Detection

Rashi Sharma, Justin Timothy C. Bersamin, Karthikk Subramanian

AI总结 本文提出了一种名为PAL的便携式主动学习框架,用于提升目标检测任务的标注效率。该方法无需修改检测模型内部结构或训练流程,仅基于模型的推理输出进行数据选择,结合类别级实例不确定性与图像级多样性,有效提升了所选样本的信息量与多样性。实验表明,PAL在多个数据集上均优于现有主动学习方法,显著提高了标签效率和检测精度,为实际应用中的高效目标检测部署提供了实用解决方案。

Comments CVPR 2026(highlight)

详情
英文摘要

Annotating bounding boxes is costly and limits the scalability of object detection. This challenge is compounded by the need to preserve high accuracy while minimizing manual effort in real-world applications. Prior active learning methods often depend on model features or modify detector internals and training schedules, increasing integration overhead. Moreover, they rarely jointly exploit the benefits of image-level signals, class-imbalance cues, and instance-level uncertainty for comprehensive selection. We present Portable Active Learning (PAL), a detector-agnostic, easily portable framework that operates solely on inference outputs. PAL combines class-wise instance uncertainty with image-level diversity to guide data selection. At each round, PAL trains lightweight class-specific logistic classifiers to distinguish true from false positives, producing entropy-based uncertainty scores for proposals. Candidate images are then refined using global image entropy, class diversity, and image similarity, yielding batches that are both informative and diverse. PAL requires no changes to model internals or training pipelines, ensuring broad compatibility across detectors. Extensive experiments on COCO, PASCAL VOC, and BDD100K demonstrate that PAL consistently improves label efficiency and detection accuracy compared to existing active learning baselines, making it a practical solution for scalable and cost-effective deployment of object detection in real-world settings.

2605.10345 2026-05-12 cs.CV

BGG: Bridging the Geometric Gap between Cross-View images by Vision Foundation Model Adaptation for Geo-Localization

Wei Wang, Dou Quan, Ning Huyan, Shuang Wang, Yi Li, Pei He, Licheng Jiao

AI总结 本文提出了一种基于视觉基础模型(VFM)的参数高效适配框架BGG,用于解决跨视角图像(如无人机与卫星图像)之间的几何差异问题,以提升跨视角地理定位(CVGL)的性能。BGG通过多粒度特征增强适配器(MFEA)和频率感知结构聚合(FASA)模块,有效提升了特征的尺度适应性和视角鲁棒性,并增强了局部结构特征,从而在低训练成本下实现了更精确的地理定位。实验表明,BGG在多个数据集上取得了优于现有方法的先进性能。

详情
英文摘要

Geometric differences between cross-view images, such as drone and satellite views, significantly increase the challenge of Cross-View Geo-Localization (CVGL), which aims to acquire the geolocation of images by image retrieval. To further enhance the CVGL performance, this paper proposes a parameter-efficient adaptation framework for bridging the geometric gap across images based on the vision foundation model (VFM) (e.g., DINOv3), termed BGG. BGG not only effectively leverages the general visual representations of VFM and captures the robust and consistent features from cross-view images, but also utilizes the generalization capabilities of the VFM, significantly improving the CVGL performance. It mainly contains a Multi-granularity Feature Enhancement Adapter (MFEA) and a Frequency-Aware Structural Aggregation (FASA) module. Specifically, MFEA enhances the scale adaptability and viewpoint robustness of features by multi-level dilated convolutions, effectively bridging the cross-view geometric gap with small training costs. Additionally, considering the [CLS] token lacks spatial details for precise image retrieval and localization, the FASA module modulates patch tokens in the frequency domain and performs adaptive aggregation for local structural feature enhancement. Finally, BGG fuses the enhanced local features with the [CLS] token for more accurate CVGL. Extensive experiments on University-1652 and SUES-200 datasets demonstrate that BGG has significant advantages over other methods and achieves state-of-the-art localization performance with low training costs.

2605.10343 2026-05-12 cs.CV cs.AI

EvoStreaming: Your Offline Video Model Is a Natively Streaming Assistant

Zichen Wen, Boxue Yang, Junlong Ke, Jiajie Huang, Chenfei Liao, Junxi Wang, Xuyang Liu, Linfeng Zhang

AI总结 本文提出EvoStreaming,一种用于将离线视频语言模型(VideoLLM)适配为流式视频助理的自进化框架。研究发现,现有VideoLLM虽具备良好的视觉理解能力,但缺乏在流式场景下决定何时响应的交互策略。EvoStreaming通过模型自身生成数据、标注相关性并制定响应策略,无需外部监督即可合成流式交互轨迹,仅用极少样本便显著提升了模型在流式评估中的表现,同时基本保持其离线性能,为高效适配流式视频助理提供了新路径。

Comments 33 pages, 9 figures

详情
英文摘要

Streaming video understanding demands more than watching longer videos: assistants must decide when to speak in real time, balancing responsiveness against verbosity. Yet most video-language models (VideoLLMs) are trained for offline inference, and existing streaming benchmarks externalize this timing decision to the evaluator. We address this gap with RealStreamEval, a frame-level multi-turn evaluation protocol that exposes models to sequential observations and penalizes unnecessary responses. Under this protocol, we observed that strong offline VideoLLMs retain useful visual understanding but lack an interaction policy for deciding when to respond. Motivated by this observation, we propose EvoStreaming, a self-evolved streaming adaptation framework in which the base model itself acts as data generator, relevance annotator, and roll-out policy to synthesize streaming trajectories without external supervision. With only $1{,}000$ self-generated samples ($139\times$ less than the leading streaming instruction-tuning approach) and no architectural changes, EvoStreaming consistently improves the overall RealStreamEval score by up to $10.8$ points across five open VideoLLM backbones (Qwen2/2.5/3-VL, InternVL-3.5, MiniCPM-V4.5) while largely preserving offline video performance. These results suggest that data-efficient interaction tuning is a practical path for adapting existing VideoLLMs to streaming assistants.

2605.10341 2026-05-12 cs.AI cs.SE

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

Bihui Yu, Xinglong Xu, Junjie Jiang, Jiabei Cheng, Caijun Jia, Siyuan Li, Conghui He, Jingxuan Wei, Cheng Tan

AI总结 论文《PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents》提出了一种基于视觉反馈的排版优化方法,用于解决科学文档在从LaTeX源码编译为最终PDF过程中常见的视觉缺陷问题。该方法通过迭代渲染、缺陷检测和源码修正的闭环流程,实现对页面布局、公式排布、表格缩放等问题的自动修复。研究引入了视觉排版优化(VTO)任务,并构建了包含多种缺陷类型的基准数据集PaperFit-Bench,实验表明该方法在多项指标上显著优于现有基线,验证了视觉闭环在提升文档排版质量中的关键作用。

Comments 47 pages, 17 figures, 17 tables

详情
英文摘要

A LaTeX manuscript that compiles without error is not necessarily publication-ready. The resulting PDFs frequently suffer from misplaced floats, overflowing equations, inconsistent table scaling, widow and orphan lines, and poor page balance, forcing authors into repetitive compile-inspect-edit cycles. Rule-based tools are blind to rendered visuals, operating only on source code and log files. Text-only LLMs perform open-loop text editing, unable to predict or verify the two-dimensional layout consequences of their changes. Reliable typesetting optimization therefore requires a visual closed loop with verification after every edit. We formalize this problem as Visual Typesetting Optimization (VTO), the task of transforming a compilable LaTeX paper into a visually polished, page-budget-compliant PDF through iterative visual verification and source-level revision, and introduce a five-category taxonomy of typesetting defects to guide diagnosis. We present PaperFit, a vision-in-the-loop agent that iteratively renders pages, diagnoses defects, and applies constrained repairs. To benchmark VTO, we construct PaperFit-Bench with 200 papers across 10 venue templates and 13 defect types at different difficulty. Extensive experiments show that PaperFit outperforms all baselines by a large margin, establishing that bridging the gap from compilable source to publication-ready PDF requires vision-in-the-loop optimization and that VTO constitutes a critical missing stage in the document automation pipeline.

2605.10339 2026-05-12 cs.CL

An Annotation Scheme and Classifier for Personal Facts in Dialogue

Konstantin Zaitsev

AI总结 本文提出了一种用于对话中个人事实分类的扩展标注方案和分类器,旨在解决现有方法在结构化存储和对话延续性识别方面的不足。该方案引入了人口统计、拥有物等新类别以及持续时间、有效性等属性,提升了事实管理的结构化程度和分类质量。基于手动标注的2,779条事实,研究构建了一个多头分类器,结合Gemma-300M编码器在宏观F1指标上达到81.6%,显著优于少样本LLM基线模型,且计算资源消耗更低。

详情
英文摘要

The advancement of Large Language Models (LLMs) has enabled their application in personalized dialogue systems. We present an extended annotation scheme for personal fact classification that addresses limitations in existing approaches, particularly PeaCoK. Our scheme introduces new categories (Demographics, Possessions) and attributes (Duration, Validity, Followup) that enable structured storage, quality filtering, and identification of facts suitable for dialogue continuation. We manually annotated 2,779 facts from Multi-Session Chat and trained a multi-head classifier based on transformer encoders. Combined with the Gemma-300M encoder, the classifier achieves $81.6 \pm 2.6$\% macro F1, outperforming all few-shot LLM baselines (best: GPT-5.4-mini, 72.92\%) by nearly 9 percentage points while requiring substantially fewer computational resources. Error analysis reveals persistent challenges in semantic boundary disambiguation, temporal aspect interpretation, and pragmatic reasoning for followup assessment. The dataset\footnotemark[1] and classifier\footnotemark[2] are publicly available.

2605.10337 2026-05-12 cs.AI eess.SP

CORTEG: Foundation Models Enable Cross-Modality Representation Transfer from Scalp to Intracranial Brain Recordings

Liuyin Yang, Qiang Sun, Bob Van Dyck, Eva Calvo Merino, Marc M. Van Hulle

AI总结 该研究提出CORTEG框架,旨在将基于头皮EEG的预训练基础模型迁移至颅内ECoG信号,以提升脑机接口的解码性能。CORTEG结合了电极感知的空间适配器、双流分词器和留一被试法微调策略,实现了跨被试学习和快速个性化校准。实验表明,CORTEG在多个任务中达到或超越了专门方法的性能,尤其在数据量有限的情况下表现突出,为高效、可扩展的颅内脑机接口提供了新思路。

详情
英文摘要

Intracranial electrocorticography (ECoG) offers high-signal-to-noise access to cortical activity for brain-computer interfaces, yet limited per-patient data has led most prior work to rely on small, subject-specific decoders that neglect information shared across patients. We investigate whether large pretrained scalp-EEG foundation models (EEG FMs) can be adapted to ECoG, enabling cross-patient learning and competitive decoding performance while calibrating to a held-out patient in 10-30 minutes on a single GPU. We introduce CORTEG, a cross-modality transfer framework that combines a pretrained EEG FM backbone, an electrode-aware KNNSoftFourier spatial adapter, a dual-stream tokenizer for low-frequency and high-gamma activity, and a leave-one-subject-out fine-tuning strategy. We evaluate CORTEG on two challenging regression tasks: public finger trajectory regression (n=9) and private audio envelope regression (n=16). CORTEG matches or exceeds the strongest task-specific baselines on both tasks: it reaches the highest mean correlation among compared methods on the public finger benchmark (gain not statistically significant on n=9 subjects), with larger and statistically significant gains on the audio task and in low-data per-patient calibration. Feature analyses align with neurophysiology, and latent manifolds capture low-dimensional finger-movement structure. CORTEG provides systematic evidence that scalp-EEG pretraining can be repurposed for ECoG decoding, enabling data-efficient intracranial BCIs that can adapt to new patients.