arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2080
2605.06869 2026-05-14 cs.AI

Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

Roger Creus Castanyer, Pablo Samuel Castro, Glen Berseth

AI总结 本文提出 Agentick,一个用于评估通用序列决策智能体的统一基准,旨在公平比较从头学习的强化学习智能体、基于预训练知识的语言模型智能体以及混合智能体等不同方法。Agentick 提供了 37 个程序生成的任务,涵盖六类能力、四个难度等级和五种观测模态,并通过统一的 Gymnasium 接口实现,同时配套了编码接口、参考策略、训练数据集和实时排行榜。实验表明,不同方法在不同任务上各有优劣,突显了当前智能体研究仍有较大提升空间,Agentick 为推动通用自主智能体的发展提供了重要的实验平台。

详情
英文摘要

AI agent research spans a wide spectrum: from RL agents that learn from scratch to foundation model agents that leverage pre-trained knowledge, yet no unified benchmark enables fair comparison across these approaches. We present Agentick, a benchmark for sequential decision-making agents designed to evaluate RL, LLM, VLM, hybrid, and human agents on common ground and to power research on the fundamental challenges of sequential decision-making. Agentick provides 37 procedurally generated tasks across six capability categories, four difficulty levels, and five observation modalities, all exposed through a single Gymnasium-compatible interface. The benchmark ships with a Coding API, oracle reference policies for all tasks, pre-built SFT datasets, a composable agent harness, and a live leaderboard. An evaluation spanning 27 configurations and over 90,000 episodes reveals that no single approach dominates: GPT-5 mini leads overall at 0.309 oracle-normalized score while PPO dominates planning and multi-agent tasks; the reasoning harness multiplies LLM performance by 3-10x; and ASCII observations consistently outperform natural language. These findings highlight the substantial room for improvement that remains across all agent paradigms. Agentick's capability-decomposed, multi-modal design provides the empirical infrastructure needed to drive progress toward general autonomous agents, both as an evaluation framework and as a training ground for RL post-training of foundation models in truly sequential environments.

2605.06387 2026-05-14 cs.LG cs.AI

Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

Nan Jia, Haojin Yang, Xing Ma, Jiesong Lian, Shuailiang Zhang, Weipeng Zhang, Ke Zeng, Xunliang Cai, Zequn Sun

AI总结 本文研究了如何改进基于策略的蒸馏方法,以在令牌级别更好地结合探索与模仿学习。针对传统方法在优势权重策略梯度中的高方差更新、零优势区域梯度消失和探索瓶颈等问题,提出了一种不对称的在策略蒸馏方法(AOPD),通过在非正优势区域采用局部散度最小化替代无效的负强化,同时保留正强化学习。实验表明,AOPD在数学推理基准中表现优于标准方法,且在训练过程中保持更高的策略熵和更好的工具使用适应能力。

详情
英文摘要

On-policy distillation (OPD) trains a student on its own trajectories with token-level teacher feedback and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its standard advantage weighted policy gradient suffers from three structural weaknesses, including high variance updates, vanishing gradients in zero-advantage regions, and exploration bottlenecks when corrective signals are insufficient. We therefore propose Asymmetric On-Policy Distillation (AOPD), which replaces ineffective negative reinforcement with localized divergence minimization in non-positive advantage regions while preserving positive reinforcement learning. Experiments on mathematical reasoning benchmarks show that AOPD consistently outperforms standard OPD, with average gains of 4.09 / 8.34 under strong / weak initialization, respectively. AOPD also maintains higher policy entropy during training and better capability retention during sequential tool-use adaptation.

2605.06309 2026-05-14 cs.CL

MultiLinguahah : A New Unsupervised Multilingual Acoustic Laughter Segmentation Method

Sofia Callejas, Nahuel Gomez, Catherine Pelachaud, Brian Ravenet, Valentin Barriere

AI总结 本文提出了一种新的无监督多语言笑声分割方法MultiLinguahah,旨在解决跨语言环境下音频中笑声检测和分割的难题。该方法将笑声分割任务转化为基于能量的音频序列异常检测问题,并利用BYOL-A编码器学习音频表示,再通过孤立森林进行分割。实验结果表明,该方法在非英语语境下优于现有的先进算法,展示了其在多语言场景中的优越性和泛化能力。

详情
英文摘要

Laughter is a social non-vocalization that is universal across cultures and languages, and is crucial for human communication, including social bonding and communication signaling. However, detecting laughter in audio is a challenging task, and segmenting is even more difficult. Currently, Machine Learning methods generally rely on costly manual annotation, and their datasets are mostly based on English contexts. Thus, we propose an unsupervised multilingual method that sets up the laughter segmentation task as an anomaly detection of energy-based segmented audio sequences. Our method applies an Isolation Forest on audio representations learned from BYOL-A encoder. We compare our method with several state-of-the-art laughter detection algorithms on four datasets, including stand-up comedy, sitcoms, and general short audio from AudioSet. Our results show that state-of-the-art methods are not optimized for multilingual contexts, while our method outperforms them in non-English settings.

2605.05875 2026-05-14 cs.RO physics.flu-dyn

Cycle-resolved Cephalopod-Inspired Pulsed-Jet Robot With High-Volume Expulsion and Drag-Reduced Gliding

Yiyuan Zhang, Anye Zhong, Junkai Chen, Wenci Xin

AI总结 本文提出了一种受章鱼启发的脉冲喷射机器人,其采用刚柔结合的折纸式外套结构,实现了大体积主动喷射和减阻滑翔。该机器人通过协调喷射、滑翔和外套充盈的完整周期运动,提升了整体推进效率。实验表明,该机器人在首次喷射周期内即可达到0.5 m/s的峰值速度,并验证了高喷射体积比、减阻滑翔和被动进水阀对推进性能的关键作用。

Comments Updated author list; no changes to the scientific content

详情
英文摘要

Cephalopod pulsed-jet locomotion is not a single isolated expulsion event, but a coordinated cycle involving jet expulsion, passive gliding, and mantle refilling. Inspired by this cycle-resolved biological strategy, this paper presents a cephalopod-inspired pulsed-jet robot with a rigid-soft hybrid origami mantle that enables large, actively driven, and geometry-guided body deformation. The proposed mantle integrates rigid folding panels with a compliant silicone framework, allowing a 75% effective cavity-volume reduction during expulsion and reducing the projected cross-sectional drag area by approximately 75.7% in the contracted gliding configuration. Using this platform, we formulate a cycle-resolved framework to separately investigate how expelled volume, glide duration, and refill pathway influence whole-cycle locomotion performance. Experiments show that the robot reaches a peak speed of approximately 0.5 m/s (3.8 BL/s) and an average speed exceeding 0.2 m/s (1.5 BL/s) within the first jetting cycle. The results further demonstrate the roles of high expelled-volume-ratio contraction in speed generation, reduced-drag-area gliding under different glide durations, and mantle-aperture-inspired passive inlet valves in assisting refill. This work provides both a robotic implementation of actively deformable cephalopod-like jet propulsion and a unified experimental platform for studying expulsion-gliding-refilling dynamics in pulsed-jet locomotion.

2605.04759 2026-05-14 cs.CL cs.AI cs.ET cs.LG

Gyan: An Explainable Neuro-Symbolic Language Model

Venkat Srinivasan, Vishaal Jatav, Anushka Chandrababu, Geetika Sharma

AI总结 本文提出了一种可解释的神经符号语言模型Gyan,其基于一种新颖的非Transformer架构,克服了传统大语言模型在可解释性、可维护性和计算资源消耗等方面的不足。Gyan通过结合修辞结构理论、语义角色理论和基于知识的计算语言学,实现了对完整组合语境的捕捉,并构建了一个类人“世界模型”以增强理解能力。实验表明,Gyan在多个数据集上取得了优越的性能,展示了其在关键任务中构建可信、可靠语言模型的潜力。

Comments also submitted to NeurIPS 2026

详情
英文摘要

Transformer based pre-trained large language models have become ubiquitous. There is increasing evidence to suggest that even with large scale pre-training, these models do not capture complete compositional context and certainly not, the full human analogous context. Besides, by the very nature of the architecture, these models hallucinate, are difficult to maintain, are not easily interpretable and require enormous compute resources for training and inference. Here, we describe Gyan, an explainable language model based on a novel non-transformer architecture, without any of these limitations. Gyan achieves SOTA performance on 3 widely cited data sets and superior performance on two proprietary data sets. The novel architecture decouples the language model from knowledge acquisition and representation. The model draws on rhetorical structure theory, semantic role theory and knowledge-based computational linguistics. Gyan's meaning representation structure captures the complete compositional context and attempts to mimic humans by expanding the context to a 'world model'. AI model adoption critically depends on trust and transparency especially in mission critical use cases. Collectively, our results demonstrate that it is possible to create models which are trustable and reliable for mission critical tasks. We believe our work has tremendous potential for guiding the development of transparent and trusted architectures for language models.

2605.04506 2026-05-14 cs.CV cs.AI

Ilov3Splat: Instance-Level Open-Vocabulary 3D Scene Understanding in Gaussian Splatting

Binh Long Nguyen, Kien Nguyen, Sridha Sridharan, Clinton Fookes, Peyman Moghadam

AI总结 Ilov3Splat 是一种基于高斯点扩散(3D-GS)的新型框架,用于实现实例级别的开放词汇三维场景理解。该方法通过在高斯点中引入视图一致的特征场,联合优化场景几何与语义表示,从而提升跨视角一致性与实例级推理能力。通过结合多分辨率哈希嵌入与对比损失训练实例特征场,Ilov3Splat 能够在无需类别监督的情况下,基于自然语言描述准确识别和分割三维场景中的任意物体,显著优于现有开放词汇三维理解方法。

Comments The International Conference on Pattern Recognition (ICPR) 2026

详情
英文摘要

We introduce Ilov3Splat, a novel framework for instance-level open-vocabulary 3D scene understanding built on 3D Gaussian Splatting (3D-GS). Most prior work depends on 2D rendering-based matching or point-level semantic association, which undermines cross-view consistency, lacks coherent instance-level reasoning, and limits precision in downstream 3D tasks. To address these limitations, our method jointly optimizes scene geometry and semantic representations by augmenting Gaussian splats with view-consistent feature fields. Specifically, we leverage multi-resolution hash embedding to efficiently encode language-aligned CLIP features, enabling dense and coherent language grounding in 3D space. We further train an instance feature field using contrastive loss over SAM masks, supporting fine-grained object distinction across views. At inference time, CLIP-encoded queries are matched against the learned features, followed by two-stage 3D clustering to retrieve relevant Gaussian groups. This enables our framework to identify arbitrary objects in 3D scenes based on natural language descriptions, without requiring category supervision or manual annotations. Experiments on standard benchmarks demonstrate that Ilov3Splat outperforms prior open-vocabulary 3D-GS methods in both object selection and instance segmentation, offering a flexible and accurate solution for language-driven 3D scene understanding. Project page: https://csiro-robotics.github.io/Ilov3Splat.

2605.03410 2026-05-14 cs.AI

Geometry over Density: Few-Shot Cross-Domain OOD Detection

Shawn Li, You Qin, Jiate Li, Charith Peris, Lisa Bauer, Roger Zimmermann, Yue Zhao

AI总结 本文研究了在仅有少量样本的情况下,如何利用预训练模型进行跨领域异常检测的问题。提出了一种名为UFCOD的统一框架,通过分析扩散过程中的信息几何特性,提取路径能量和动力学能量两个特征,实现无需额外训练即可在任意新领域进行OOD检测。该方法在12个跨领域基准测试中取得了93.7%的平均AUROC,展示了其在样本效率上的显著优势。

详情
英文摘要

Out-of-distribution (OOD) detection identifies test samples that fall outside a model's training distribution, a capability critical for safe deployment in high-stakes applications. Standard OOD detectors are trained on a specific in-distribution (ID) dataset and detect deviations from that single domain. In contrast, we study few-shot cross-domain OOD detection: given a \emph{single} pre-trained model, can we perform OOD detection on \emph{arbitrary} new ID-OOD task pairs using only a handful of ID samples at inference time, with no additional training? We propose \textbf{UFCOD}, a unified framework that achieves this goal through information-geometric analysis of diffusion trajectories. Our key insight is that diffusion noise predictions are score functions (gradients of log-density), and we extract two energy features: \emph{Path Energy} (integrated score magnitude) and \emph{Dynamics Energy} (score smoothness), that form a discrete Sobolev norm capturing how samples interact with the learned diffusion process. The central contribution is a \textbf{train-once, deploy-anywhere} paradigm: a diffusion model trained on a single dataset (e.g., CelebA) serves as a universal feature extractor for OOD detection across semantically unrelated domains (e.g., CIFAR-10, SVHN, Textures). At deployment, each new task requires only $\sim$100 unlabeled ID samples for inference: no retraining, no fine-tuning, no task-specific adaptation. Using 100 ID samples per task, UFCOD achieves 93.7\% average AUROC across 12 cross-domain benchmarks, competitive with methods trained on 50k--163k samples, demonstrating $\sim$500$\times$ improvement in sample efficiency. See our code in https://github.com/lili0415/UFCOD.

2605.01457 2026-05-14 cs.AI

CoFlow: Coordinated Few-Step Flow for Offline Multi-Agent Decision Making

Guowei Zou, Haitao Wang, Beiwen Zhang, Boning Zhang, Hejun Wu

AI总结 本文提出了一种名为CoFlow的协调少步流方法,用于离线多智能体决策问题。该方法通过引入协调速度注意力机制和自适应协调门控,实现了在单次生成过程中保持智能体间协调性的目标,从而克服了现有少步生成方法在协调性上的不足。实验表明,CoFlow在多种任务中表现出色,能够在仅需1到3步去噪的情况下达到最先进的协调质量,且其性能提升主要归因于智能体间的协调能力增强。

Comments 34 pages, 15 figures, 10 tables. Project page: https://guowei-zou.github.io/coflow/

详情
英文摘要

Generative models have emerged as a promising paradigm for offline multi-agent reinforcement learning (MARL), but existing approaches require many iterative sampling steps. Recent few-step acceleration methods either distill a joint teacher into independent students or apply averaged velocity fields independently to each agent. Unfortunately, these few-step approaches hurt inter-agent coordination. We show that the efficiency-coordination trade-off is not inherent: single-pass multi-agent generation can preserve coordination when the velocity field is natively joint-coupled. We propose Coordinated few-step Flow (CoFlow), an architecture that combines Coordinated Velocity Attention (CVA) with Adaptive Coordination Gating. A finite-difference consistency surrogate further replaces memory-prohibitive Jacobian-vector product backpropagation through the averaged velocity field with two stop-gradient forward passes. Across 60 configurations spanning MPE, MA-MuJoCo, and SMAC, CoFlow matches or surpasses Gaussian policies, value-based methods, transformer policies, diffusion models, and prior flow baselines on episodic return. Three independent coordination probes confirm that CoFlow's improvements arise from inter-agent coordination rather than per-agent capacity. A denoising-step sweep shows that single-pass inference suffices on every configuration. CoFlow reaches state-of-the-art coordination quality in 1-3 denoising steps under both centralized and decentralized execution. Project Page: https://guowei-zou.github.io/coflow/

2605.00238 2026-05-14 cs.CL

Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory

Longwei Cong, Sonja Hahn, Sebastian Gombert, Leon Camus, Hendrik Drachsler, Ulf Kroehne

AI总结 该研究提出了一种基于项目反应理论(IRT)的评估框架,用于分析基于大语言模型(LLM)的自动短答案评分系统的评分能力和响应难度。该方法能够揭示模型在不同难度回答上的评分表现差异,发现整体性能相似的模型在面对难度增加时其评分准确性下降程度存在显著差异。研究还发现,困难回答的错误主要集中于“部分正确但不完整”标签,且这类回答在语义对齐度、矛盾信号和嵌入空间孤立性等方面表现出更高的难度特征。

Comments accepted at BEA 2026, the 21st Workshop on Innovative Use of NLP for Building Educational Applications

详情
英文摘要

Automated short answer grading (ASAG) with large language models (LLMs) is commonly evaluated with aggregate metrics such as macro-F1 and Cohen's kappa. However, these metrics provide limited insight into how grading performance varies across student responses of differing grading difficulty. We introduce an evaluation framework for LLM-based ASAG based on item response theory (IRT), which models grading correctness as a function of latent grader ability and response grading difficulty. This formulation enables response-level analysis of where LLM graders succeed or fail and reveals robustness differences that are not visible from aggregate scores alone. We apply the framework to 17 open-weight LLMs on the SciEntsBank and Beetle benchmarks. The results show that even models with similar overall performance differ substantially in how sharply their grading accuracy declines as response difficulty increases. In addition, confusion patterns show that errors on difficult responses concentrate disproportionately on the \texttt{partially\_correct\_incomplete} label, indicating a tendency toward intermediate-label collapse under ambiguity. To characterize difficult responses, we further analyze semantic and linguistic correlates of estimated difficulty. Across both datasets, higher difficulty is associated with weaker semantic alignment to the reference answer, stronger contradiction signals, and greater semantic isolation in embedding space. Overall, these results show that item response theory offers a useful framework for evaluating LLM-based ASAG beyond aggregate performance measures.

2605.00200 2026-05-14 cs.CL

Confidence Estimation in Automatic Short Answer Grading with LLMs

Longwei Cong, Sonja Hahn, Sebastian Gombert, Leon Camus, Hendrik Drachsler, Ulf Kroehne

AI总结 本文研究了基于大语言模型的自动短答案评分中的置信度估计问题,旨在提升人机协作教育评估的安全性与可靠性。作者提出了一种结合模型置信度和数据集不确定性的混合置信度框架,通过对比多种模型置信度估计方法,发现单一模型置信度不足以准确反映评分不确定性。该框架引入了基于学生回答语义聚类的噪声估计,有效提升了置信度估计的可靠性与选择性评分性能,为可信的AI辅助教育评估系统提供了支持。

Comments accepted to the 27th International Conference on Artificial Intelligence in Education (AIED 2026)

详情
英文摘要

Automatic Short Answer Grading (ASAG) with generative large language models (LLMs) has recently demonstrated strong performance without task-specific fine-tuning, while also enabling the generation of synthetic feedback for educational assessment. Despite these advances, LLM-based grading remains imperfect, making reliable confidence estimates essential for safe and effective human-AI collaboration in educational decision-making. In this work, we investigate confidence estimation for ASAG with LLMs by jointly considering model-based confidence signals and dataset-derived uncertainty. We systematically compare three model-based confidence estimation strategies, namely verbalizing, latent, and consistency-based confidence estimation, and show that model-based confidence alone is insufficient to reliably capture uncertainty in ASAG. To address this limitation, we propose a hybrid confidence framework that integrates model-based confidence signals with an explicit estimate of dataset-derived aleatoric uncertainty. Aleatoric uncertainty is operationalized by clustering semantically embedded student responses and quantifying within-cluster heterogeneity. Our results demonstrate that the proposed hybrid confidence measure yields more reliable confidence estimates and improves selective grading performance compared to single-source approaches. Overall, this work advances confidence-aware LLM-based grading for human-in-the-loop assessment, supporting more trustworthy AI-assisted educational assessment systems.

2604.27996 2026-05-14 cs.AI cs.GR cs.HC

Exploring Interaction Paradigms for LLM Agents in Scientific Visualization

Jackson Vonderhorst, Kuangshi Ai, Haichao Miao, Shusen Liu, Chaoli Wang

AI总结 本文研究了不同类型的大型语言模型(LLM)代理在科学可视化任务中的表现,用户通过自然语言指令生成可视化流程。通过比较三种主要交互范式,包括使用结构化工具的领域特定代理、计算机使用代理和通用编程代理,在15个基准任务中评估了八种代表性代理的可视化质量、效率、鲁棒性和计算成本。研究还分析了不同交互方式及持久记忆对性能的影响,结果表明各类方法在灵活性、效率和稳定性方面存在明显权衡,未来科学可视化系统应结合结构化工具使用、交互能力和自适应记忆机制以实现性能与灵活性的平衡。

详情
英文摘要

This paper examines how different types of large language model (LLM) agents perform on scientific visualization (SciVis) tasks, where users generate visualization workflows from natural-language instructions. We compare three primary interaction paradigms, including domain-specific agents with structured tool use, computer-use agents, and general-purpose coding agents, by evaluating eight representative agents across 15 benchmark tasks and measuring visualization quality, efficiency, robustness, and computational cost. We further analyze interaction modalities, including code scripts and model context protocol (MCP) or API calls for structured tool use, as well as command-line interfaces (CLI) and graphical user interfaces (GUI) for more general interaction, while additionally studying the effect of persistent memory in selected agents. The results reveal clear tradeoffs across paradigms and modalities. General-purpose coding agents achieve the highest task success rates but are computationally expensive, while domain-specific agents are more efficient and stable but less flexible. Computer-use agents perform well on individual steps but struggle with longer multi-step workflows, indicating that long-horizon planning is their primary limitation. Across both CLI- and GUI-based settings, persistent memory improves performance over repeated trials, although its benefits depend on the underlying interaction mode and the quality of feedback. These findings suggest that no single approach is sufficient, and future SciVis systems should combine structured tool use, interactive capabilities, and adaptive memory mechanisms to balance performance, robustness, and flexibility.

2604.27389 2026-05-14 cs.CV cs.AI

COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

Bingli Wang, Huanze Tang, Haijun Lv, Zhishan Lin, Lixin Gu, Lei Feng, Qipeng Guo, Kai Chen

AI总结 本文提出COHERENCE基准,旨在评估多模态大语言模型在交织图文上下文中进行细粒度图文对齐的能力。现有基准多关注单一或多个图像的理解,而现实场景中信息常以图文交织形式呈现,要求模型不仅识别图像内容,还需建立图文间的细粒度关联并进行推理。COHERENCE涵盖四个代表性领域的交织图文内容,包含6,161个高质量问题,并通过六类错误分析,揭示当前模型在该任务中的不足。

详情
英文摘要

In recent years, Multimodal Large Language Models (MLLMs) have achieved remarkable progress on a wide range of multimodal benchmarks. Despite these advances, most existing benchmarks mainly focus on single-image or multi-image comprehension. In real-world scenarios such as document reading, information is often presented as interleaved multimodel contexts. This requires MLLMs not only to recognize the content of individual images, but also to identify relevant textual and visual evidence, establish fine-grained alignments between them, and reason over these aligned signals in interleaved contexts based on contextual evidence. However, there is still a lack of systematic benchmarks for quantifying the fine-grained understanding ability of MLLMs in interleaved image-text contexts. To fill this gap, we propose COHERENCE, a benchmark designed to evaluate the ability of MLLMs to recover fine-grained image-text correspondences in interleaved multimodal contexts. COHERENCE covers interleaved image-text content from four representative domains and contains 6,161 high-quality questions. Moreover, we perform a six-type error analysis, enabling fine-grained attribution of failures in interleaved image-text understanding to the specific capabilities missing in current MLLMs.

2604.21345 2026-05-14 cs.AI cs.CL

Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline

Philip Zhong, Don Wang, Jason Zhang

AI总结 本文提出了一种可复用的跨领域评估系统,用于评估AI会议摘要的质量,系统整合了结构化真实标签构建、固定候选生成、基于主张的评分、持久化报告以及隐私保护的在线监控与提名接口。通过在114场会议数据上进行测试,研究发现不同模型在准确性方面差异不显著,但在保留率方面,gpt-5.1模型表现出更高的完整性和覆盖率。该工作为AI会议摘要的评估提供了一套标准化且可扩展的评估框架。

Comments AI Application Feature Quality Evaluation (28 pages total)

详情
英文摘要

Industrial teams often deploy large language model features before stable regression or model selection evaluation exists. We present a reusable evaluation system for AI meeting summaries that combines structured ground-truth (GT) construction, fixed candidate generation, claim-grounded scoring, persisted reporting, and a privacy-bounded online monitoring and nomination interface. The online evidence is not itself a benchmark: privacy-safe aggregate exports show active monitoring, hard regime detection, and directional movement without exposing customer data. We benchmark the offline path on 114 meetings across city_council, private_data, and whitehouse_press_briefings, yielding 340 completed meeting-model pairs and 680 judge runs for gpt-4.1-mini, gpt-5-mini, and gpt-5.1. Under this fixed protocol, accuracy differences are not statistically significant under Holm correction (corrected p-values 0.053-0.448), although gpt-4.1-mini has the highest mean accuracy (0.583); the significant separation is on retention, where gpt-5.1 leads on completeness (0.886) and coverage (0.942). Typed slices isolate whitehouse_press_briefings as an accuracy-hard regime, and a later focused rerun over gpt-4.1, gpt-5-mini, and gpt-5.4 reuses the same stack under the same judges and metrics. This extended preprint keeps those core results aligned with the formal submission while adding a more detailed repository-level account of cross-domain reuse from the companion AI-search paper and an additional typed DeepEval contrastive analysis. Model naming note. Running text uses canonical model names on first introduction. Tables, filenames, and artifact IDs retain compact report labels for consistency with the packaged benchmark outputs. Table A maps the two conventions and is repeated in Section 4.3 where candidate-generation settings are defined.

2604.17895 2026-05-14 cs.RO

Locomotion of an Elastic Snake Robot via Natural Dynamics

Tristan Ehlert, Arne Sachtler, Annika Schmidt, Davide Calzolari, Alin Albu-Schäffer

AI总结 本文研究了如何利用弹性蛇形机器人的自然动力学特性设计高效运动模式。通过引入特征流形理论,作者分析了系统的非线性动力学行为,并提出了两种基于自然动力学的步态。实验表明,在无摩擦的理想情况下,基于非制动周期轨迹的步态具有完美的能量效率,而在更现实的有摩擦场景中,该步态相比传统刚性系统步态也表现出更高的效率,为基于自然动力学的步态设计提供了有价值的参考。

详情
英文摘要

Nature suggests that exploiting the elasticities and natural dynamics of robotic systems could increase their locomotion efficiency. Prior work on elastic snake robots supports this hypothesis, but has not fully exploited the nonlinear dynamic behavior of the systems. Recent advances in eigenmanifold theory enable a better characterization of the natural dynamics in complex nonlinear systems. This letter investigates if and how the nonlinear natural dynamics of a kinematic elastic snake robot can be used to design efficient gaits. Two types of gaits based on natural dynamics are presented and compared to a state-of-the-art approach using dynamics simulations. The results reveal that a gait generated by switching between two nonlinear normal modes does not improve the locomotion efficiency of the robot. In contrast, gaits based on non-brake periodic trajectories (non-brake orbits) are perfectly efficient in the energy-conservative case. Further simulations with friction reveal that, in a more realistic scenario, non-brake orbit gaits achieve higher efficiency compared to the baseline gait on the rigid system. Overall, the investigation offers promising insights into the design of gaits based on natural dynamics, fostering further research.

2604.09025 2026-05-14 cs.CV cs.AI

Skill-Conditioned Visual Geolocation for Vision-Language Models

Chenjie Yang, Yutian Jiang, Yutong Deng, Chenyu Wu

AI总结 该研究针对视觉语言模型在地理定位任务中缺乏结构化地理推理和自主进化能力的问题,提出了一种无需训练的GeoSkill框架。该方法基于一个可演进的技能图(Skill-Graph),通过提炼人类专家轨迹生成自然语言技能,并利用推理模型进行引导式推理。同时,通过自主进化机制,从大规模网络数据中不断生成和优化技能,提升地理定位的准确性和推理可信度,显著增强了模型对真实地理知识的理解与泛化能力。

详情
英文摘要

Vision-language models (VLMs) have shown a promising ability in image geolocation, but they still lack structured geographic reasoning and the capacity for autonomous self-evolution. Existing methods predominantly rely on implicit parametric memory, which often exploits outdated knowledge and generates hallucinated reasoning. Furthermore, current inference is a "one-off" process, lacking the feedback loops necessary for self-evolution based on reasoning outcomes. To address these issues, we propose GeoSkill, a training-free framework based on an evolving Skill-Graph. We first initialize the graph by refining human expert trajectories into atomic, natural-language skills. For execution, GeoSkill employs an inference model to perform direct reasoning guided by the current Skill-Graph. For continuous growth, an Autonomous Evolution mechanism leverages a larger model to conduct multiple reasoning rollouts on image-coordinate pairs sourced from web-scale data and verified real-world reasoning. By analyzing both successful and failed trajectories from these rollouts, the mechanism iteratively synthesizes and prunes skills, effectively expanding the Skill-Graph and correcting geographic biases without any parameter updates. Experiments demonstrate that GeoSkill achieves promising performance in both geolocation accuracy and reasoning faithfulness on GeoRC, while maintaining superior generalization across diverse external datasets. Furthermore, our autonomous evolution fosters the emergence of novel, verifiable skills, significantly enhancing the system's cognition of real-world geographic knowledge beyond isolated case studies.

2604.08944 2026-05-14 cs.LG cs.MA

Multi-Agent Decision-Focused Learning via Value-Aware Sequential Communication

Benjamin Amoh, Geoffrey Parker, Wesley Marrero

AI总结 该研究提出了一种名为 SeqComm-DFL 的多智能体决策聚焦学习方法,旨在提升部分可观测环境下智能体之间的协作效率。该方法通过价值感知的序列通信机制,使智能体在优先级顺序下生成有助于提升决策质量的消息,并结合Stackelberg条件进行信息传递。研究还扩展了最优模型设计框架,结合QMIX分解实现高效端到端训练,并在多个基准任务中显著提升了累积奖励和胜率。

Comments 9 pages, 2 figues, 1 table, neurips 2026

详情
英文摘要

Multi-agent coordination under partial observability requires agents to share complementary private information. While recent methods optimize messages for intermediate objectives (e.g., reconstruction accuracy or mutual information), rather than decision quality, we introduce \textbf{SeqComm-DFL}, unifying the sequential communication with decision-focused learning for task performance. Our approach features \emph{value-aware message generation with sequential Stackelberg conditioning}: messages maximize receiver decision quality and are generated in priority order, with agents conditioning on their predecessors. The \emph{guidance potential} determined by their prosocial ordering. We extend Optimal Model Design to communication-augmented world models with QMIX factorization, enabling efficient end-to-end training via implicit differentiation. We prove information-theoretic bounds showing that communication value scales with coordination gaps and establish $\mathcal{O}(1/\sqrt{T})$ convergence for the bilevel optimization, where $T$ denotes the number of training iterations. On collaborative healthcare and StarCraft Multi-Agent Challenge (SMAC) benchmarks, SeqComm-DFL achieves four to six times higher cumulative rewards and over 13\% win rate improvements, enabling coordination strategies inaccessible under information asymmetry.

2604.08039 2026-05-14 cs.CV cs.AI cs.LG

LINE: LLM-based Iterative Neuron Explanations for Vision Models

Vladimir Zaigrajew, Michał Piechota, Gaspar Sekula, Paweł Gelar, Przemysław Biecek

AI总结 本文提出了一种基于大语言模型的迭代神经元解释方法LINE,用于对视觉模型中的神经元进行开放词汇的概念标注。LINE在黑盒设置下,通过语言模型和图像生成器迭代生成并优化概念描述,无需模型训练,能够发现传统预定义词汇表中遗漏的概念,并在多个数据集上取得了优于现有方法的性能。该方法不仅能够识别每个神经元的主要概念,还能提供完整的生成历史,支持多义性评估和生成可视化解释。

详情
英文摘要

Interpreting individual neurons in deep neural networks is a crucial step towards understanding their complex decision-making processes and ensuring AI safety. Despite recent progress in neuron labeling, existing methods often limit the search space to predefined concept vocabularies or produce overly specific descriptions that fail to capture higher-order, global concepts. We introduce LINE, a novel, training-free iterative approach tailored for open-vocabulary concept labeling in vision models. Operating in a strictly black-box setting, LINE leverages a large language model and a text-to-image generator to iteratively propose and refine concepts in a closed loop, guided by activation history. LINE achieves state-of-the-art performance across multiple model architectures, yielding AUC improvements of up to 0.11 on ImageNet and 0.05 on Places365, while discovering, on average, 27% of new concepts missed by predefined vocabularies. Beyond identifying the top concept, LINE provides a complete generation history, enabling polysemanticity evaluation and producing visual explanations that rival gradient-dependent activation maximization methods. The source code will be made available soon.

2604.04692 2026-05-14 cs.CL cs.AI cs.CV

Is a Picture Worth a Thousand Words? Adaptive Multimodal Fact-Checking with Visual Evidence Necessity

Jaeyoon Jung, Yejun Yoon, Kunwoo Park

AI总结 本文研究了在多模态事实核查任务中是否应普遍使用视觉证据的问题,挑战了现有研究中“视觉证据总是有助于提升性能”的假设。为此,作者提出了AMuFC框架,通过两个协作的视觉-语言模型,分别用于判断是否需要视觉证据以及基于证据进行事实验证,从而实现对视觉证据的自适应使用。实验表明,该方法在三个数据集上显著提升了事实核查的准确性。

Comments preprint, 18 pages

详情
英文摘要

Automated fact-checking is a crucial task that supports a responsible information ecosystem. While recent research has progressed from text-only to multimodal fact-checking, a prevailing assumption is that incorporating visual evidence universally improves performance. In this work, we challenge this assumption and show that the indiscriminate use of multimodal evidence can reduce accuracy. To address this challenge, we propose AMuFC, a multimodal fact-checking framework that employs two collaborative vision-language models with distinct roles for the adaptive use of visual evidence: an Analyzer determines whether visual evidence is necessary for claim verification, and a Verifier predicts claim veracity conditioned on both the retrieved evidence and the Analyzer's assessment. Experimental results on three datasets show that incorporating the Analyzer's assessment of visual evidence necessity into the Verifier's prediction yields substantial improvements in verification performance. We will release all code and datasets at https://github.com/ssu-humane/AMuFC.

2604.04667 2026-05-14 cs.CV cs.LG cs.RO

ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging

Selim Ahmet Iz, Francesco Nex, Norman Kerle, Henry Meissner, Ralf Berger

AI总结 本文提出了一种名为ZeD-MAP的框架,用于实现实时无人机航拍图像的高精度深度重建。该方法结合零样本扩散模型与增量聚类式光束法平差(BA),在无需任务特定再训练的情况下,提升了深度估计的度量一致性和时间连续性。实验表明,该方法在高分辨率航拍图像上实现了亚米级精度,且单帧处理时间在1.47到4.91秒之间,适用于实时三维地图生成。

详情
英文摘要

Real-time depth reconstruction from ultra-high-resolution UAV imagery is essential for time-critical geospatial tasks such as disaster response, yet remains challenging due to wide-baseline parallax, large image sizes, low-texture or specular surfaces, occlusions, and strict computational constraints. Recent zero-shot diffusion models offer fast per-image dense predictions without task-specific retraining, and require fewer labelled datasets than transformer-based predictors while avoiding the rigid capture geometry requirement of classical multi-view stereo. However, their probabilistic inference prevents reliable metric accuracy and temporal consistency across sequential frames and overlapping tiles. We present ZeD-MAP, a cluster-level framework that converts a test-time diffusion depth model into a metrically consistent, SLAM-like mapping pipeline by integrating incremental cluster-based bundle adjustment (BA). Streamed UAV frames are grouped into overlapping clusters; periodic BA produces metrically consistent poses and sparse 3D tie-points, which are reprojected into selected frames and used as metric guidance for diffusion-based depth estimation. Validation on ground-marker flights captured at approximately 50 m altitude (GSD is approximately 0.85 cm/px, corresponding to 2,650 square meters ground coverage per frame) with the DLR Modular Aerial Camera System (MACS) shows that our method achieves sub-meter accuracy, with approximately 0.87 m error in the horizontal (XY) plane and 0.12 m in the vertical (Z) direction, while maintaining per-image runtimes between 1.47 and 4.91 seconds. Results are subject to minor noise from manual point-cloud annotation. These findings show that BA-based metric guidance provides consistency comparable to classical photogrammetric methods while significantly accelerating processing, enabling real-time 3D map generation.

2604.02022 2026-05-14 cs.AI

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, Jing Shao, Xia Hu, Dongrui Liu

AI总结 ATBench 是一个用于评估和诊断基于大语言模型的智能体安全性的多样化且真实的轨迹基准。该基准通过风险来源、失败模式和现实危害三个维度系统地组织风险,并采用异构工具池和长上下文延迟触发机制,构建出具有多阶段真实风险演进的轨迹数据。ATBench 包含 1000 条轨迹,涵盖丰富的交互场景和工具调用,数据经过规则和大模型过滤并由人工全面审核,能够有效评估先进模型在长期交互中的安全表现,并支持分层分析和跨基准比较。

详情
英文摘要

Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-level benchmark for structured, diverse, and realistic evaluation of agent safety. ATBench organizes agentic risk along three dimensions: risk source, failure mode, and real-world harm. Based on this taxonomy, we construct trajectories with heterogeneous tool pools and a long-context delayed-trigger protocol that captures realistic risk emergence across multiple stages. The benchmark contains 1,000 trajectories (503 safe and 497 unsafe), averaging 9.01 turns and 3.95k tokens, with 1,954 invoked tools drawn from pools spanning 2,084 available tools. Data quality is supported by rule-based and LLM-based filtering plus full human audit. Experiments on frontier LLMs, open-source models, and specialized guard systems show that ATBench is challenging even for strong evaluators, while enabling taxonomy-stratified analysis, cross-benchmark comparison, and diagnosis of long-horizon failure patterns.

2604.01690 2026-05-14 cs.AI

Scale over Preference: The Impact of AI-Generated Content on Online Content Ecology

Tianhao Shi, Yang Zhang, Xiaoyan Zhao, Fengbin Zhu, Chenyi Lei, Han Li, Wenwu Ou, Tian Yang, Yang Song, Yongdong Zhang, Fuli Feng

AI总结 本研究探讨了人工智能生成内容(AIGC)对在线内容生态的影响,通过分析中国主流视频平台上的海量用户数据,揭示了AIGC与人类生成内容(HGC)在创作与消费行为上的显著差异。研究发现,尽管用户更偏好HGC,但AIGC创作者通过高产量策略仍能获得与HGC相当的总体互动量,算法推荐机制在其中起到了调节作用。研究建议引入对AIGC敏感的推荐算法和精准治理框架,以保障在线平台内容生态的长期健康发展。

Comments update authors in v2

详情
英文摘要

The rapid proliferation of Artificial Intelligence-Generated Content (AIGC) is fundamentally restructuring online content ecologies, necessitating a rigorous examination of its behavioral and distributional implications. Leveraging a comprehensive longitudinal dataset comprising tens of millions of users from a leading Chinese video-sharing platform, this study elucidated the distinct creation and consumption behaviors characterizing AIGC versus Human-Generated Content (HGC). We identified a prevalent scale-over-preference dynamic, wherein AIGC creators achieve aggregate engagement comparable to HGC creators through high-volume production, despite a marked consumer preference for HGC. Deeper analysis uncovered the ability of the algorithmic content distribution mechanism in moderating these competing interests regarding AIGC. These findings advocated for the implementation of AIGC-sensitive distribution algorithms and precise governance frameworks to ensure the long-term health of the online content platforms.

2604.00001 2026-05-14 cs.LG cs.AI cs.CL

Filter-then-Weight: Online Data Selection and Reweighting for LLM Fine-Tuning

Fangxin Wang, Peyman Baghershahi, Langzhou He, Henry Peng Zou, Sourav Medya, Philip S. Yu

AI总结 本文研究了大语言模型在线微调中的数据选择与重加权问题,提出了一种基于优化器状态的在线数据选择框架。核心方法是将数据选择视为根据当前优化器状态塑造下一步更新方向的问题,并设计了两阶段的Filter-then-Weight算法,先筛选几何上有用的样本,再优化其权重系数。该方法通过引入因子化梯度表示和优化矩阵计算,有效提升了在线微调的收敛效率和下游任务性能。

Comments 24 pages, 2 figures, 9 tables

详情
英文摘要

Gradient-based data selection offers a principled framework for estimating sample utility in large language model (LLM) fine-tuning, but existing methods are mostly designed for offline settings. They are therefore less suited to online fine-tuning, where data arrives sequentially, sample utility is step-dependent, and the effective update geometry is shaped by adaptive optimizers. We propose an optimizer-aware framework for gradient-based online data selection and reweighting in LLM fine-tuning. Our key idea is to view online selection not as static sample ranking, but as shaping the next target-oriented update under the current optimizer state. We formulate this as an optimizer-aware update-matching problem, establish its connection to second-order target utility, and show why subset-level construction must account for interactions and redundancy among selected samples. Based on this view, we develop a two-stage Filter-then-Weight algorithm that first filters geometrically useful candidates and then optimizes their coefficients. To make the framework practical for LLMs, we introduce a factorized outer-product gradient representation and optimized matrix computations for long-context data. Experiments show that our method consistently improves convergence and downstream performance over existing online data selection baselines under the same data budget.

2603.29917 2026-05-14 cs.CV

Diffusion-Based Feature Denoising with NNMF for Robust handwritten digit multi-class classification

Hiba Adil Al-kharsan, Róbert Rajkó

AI总结 本文提出了一种结合扩散驱动特征去噪与混合特征表示的鲁棒手写数字多分类框架。通过非负矩阵分解(NNMF)将输入图像转换为可解释的特征表示,同时利用卷积神经网络提取深层特征,并将两者融合为统一的混合特征表示。在特征空间中引入逐步扩散噪声并训练去噪网络以恢复干净特征,从而提升模型对噪声和对抗攻击的鲁棒性。实验结果表明,该方法在基准和对抗环境下均表现出优越的分类性能。

详情
英文摘要

This work presents a robust multi-class classification framework for handwritten digits that combines diffusion-driven feature denoising with a hybrid feature representation. Inspired by our previous work on brain tumor classification, the proposed approach operates in a feature space to improve the robustness to noise and adversarial attacks. This manuscript is submitted as an extended abstract rather than a full-length press-ready paper. First, the input images are converted into tight, interpretable exemplification using Non-negative Matrix Factorization (NNMF). In parallel, special deep features are extracted using a computational neural network (CNN). These integral features are combined into a united hybrid representation. The main objective of this work is to extend our previously validated two-class framework to a multi-class handwritten digit classification scenario. To improve robustness, a step diffusion operation is used in the feature space by gradually adding Gaussian noise. A feature denoiser network is trained to reverse this operation and rebuild clean representations from tilted inputs. The courteous features are then applied for multi-class classification. The suggested method is evaluated in both baseline and adversarial settings using AutoAttack. The experimental outcome present that the diffusion-based hybrid model is both effective and robust, the CNN baseline models outperforming while maintain powerful classification performance. These results explain the activity of feature-level diffusion defense for reliable multi-class handwritten digit classification.

2603.27134 2026-05-14 cs.LG

Factorization Regret mediates compositional generalization in latent space

John Schwarcz

AI总结 本文研究了在已知所有相关变量的情况下,泛化仍可能面临的障碍,提出了一种将组合泛化视为潜在变量间参数化相互作用的变分推断问题的框架。通过构建认知网格世界环境,作者引入了“分解遗憾”这一信息论指标,用于衡量潜在变量相互作用对任务表现的影响,并发现RNN中显式提供交互信息可解释不同网络结构间的性能差异。进一步提出了一种新的架构——表示分类链(RCCs),能够分离变量推断与参数估计,从而在无需显式交互信息的情况下实现组合泛化与新动作空间的离线学习,为研究通用目标导向智能体提供了理论基础。

详情
英文摘要

Are there still barriers to generalization once all of the relevant variables are known? We address this question via a framework that casts compositional generalization as a variational inference problem over latent variables with parametric interactions. To explore this framework, we develop the Cognitive Gridworld, a stationary Partially Observable Markov Decision Process (POMDP) in which observations are generated jointly by multiple latent variables, yet feedback is provided only for a single goal variable. This setting allows us to describe Factorization Regret: an information-theoretic quantity that measures the contribution of latent variable interactions to task performance. Using this metric, we first analyze Recurrent Neural Networks (RNNs) that are explicitly provided with the interactions and find that Factorization Regret explains the accuracy gap between Echo State and Fully Trained networks. Additionally, our analysis uncovers a theoretically predicted failure mode, where confidence becomes decoupled from accuracy. These results suggest that utilizing the interactions between relevant variables is a non-trivial capability. We then address a harder regime where the interactions themselves must be learned by an embedding model. Learning how variables interact while learning how to infer their values is a variational inference problem. We approach this dilemma via Representation Classification Chains (RCCs), a novel architecture which disentangles variable inference and parameter estimation. We demonstrate that, by learning how variables interact, RCCs facilitate compositional generalization to novel combinations of relevant variables and offline learning in novel action spaces. Together, these results establish a theoretically grounded setting for researching, developing and evaluating goal-directed generalist agents.

2603.26839 2026-05-14 cs.LG cs.CV

From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning

Alberto G. Rodriguez Salgado

AI总结 该研究探讨了多模态模型在解决视觉空间任务时是依赖真正的规划能力,还是通过在文本空间中进行暴力搜索。为此,研究者提出了一个名为 MazeBench 的基准测试,包含 110 个程序生成的迷宫图像,并评估了来自 OpenAI、Anthropic、Google 和阿里巴巴的 16 种模型配置。实验发现,尽管某些模型在视觉迷宫任务中表现出高准确率,但其解题方式主要是将图像转换为文本网格,再逐步枚举路径,而非真正的空间规划,揭示了高准确率并不意味着具备人类水平的空间理解能力。

Comments 15 pages, 10 figures. Code and mazes available at https://github.com/alrod97/LLMs_mazes

详情
英文摘要

How do multimodal models solve visual spatial tasks -- through genuine planning, or through brute-force search in token space? We introduce \textsc{MazeBench}, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluate 16 model configurations from OpenAI, Anthropic, Google, and Alibaba. GPT-5.4 solves 91\% and Gemini 3.1 Pro 79\%, but these scores are misleading: models typically translate images into text grids and then enumerate paths step by step, consuming 1,710--22,818 tokens per solve for a task humans do quickly. Without added reasoning budgets, all configurations score only 2--12\%; on 20$\times$20 ultra-hard mazes, they hit token limits and fail. Qualitative traces reveal a common two-stage strategy: image-to-grid translation followed by token-level search, effectively BFS in prose. A text-grid ablation shows Claude Sonnet 4.6 rising from 6\% on images to 80\% when given the correct grid, isolating weak visual extraction from downstream search. When explicitly instructed not to construct a grid or perform graph search, models still revert to the same enumeration strategy. \textsc{MazeBench} therefore shows that high accuracy on visual planning tasks does not imply human-like spatial understanding.

2603.25340 2026-05-14 cs.CL

Large Language Model as Token Compressor and Decompressor

Wenbing Li, Yiran Wang, Zikai Song, Jielei Zhang, Tianhao Zhao, Junkai Lin, Wei Yang

AI总结 本文研究了如何将现成的大语言模型(LLM)适配为用于长文本处理的离散可变长度编码器和解码器。作者设计了一种自表达的自编码框架,通过轻量的LoRA适配器对预训练LLM进行微调,将长文本映射为紧凑的潜在编码序列(Z-tokens),并能将其解码回自然语言或任务输出。该方法在保持重建质量和下游任务性能的同时,有效减少了上下文长度、生成阶段的内存使用和端到端延迟,为高效长文本推理提供了实用的接口。

详情
英文摘要

In this paper, we study whether an off-the-shelf LLM can be adapted into a discrete, variable-length token compressor and decompressor for long-context processing. To this end, we design a self-expressive autoencoding framework that fine-tunes a pretrained LLM with lightweight LoRA adapters to map long texts into compact sequences of learned latent codes, termed Z-tokens, and to decode them back into natural language or task outputs. The resulting representation is content-adaptive: less predictable or information-dense segments can receive more Z-tokens, while redundant regions can be represented more compactly through a budget-aware length regularizer. Our method is evaluated on long-context datasets such as Wikipedia, CNN/DailyMail, HotpotQA, and QuALITY, showing that it preserves reconstruction quality and downstream performance while reducing effective context length, generation-stage memory usage, and end-to-end latency. This simple design supports both direct decoding from compressed contexts and autoregressive generation in the Z-token space, providing a practical interface for efficient long-context inference.

2603.24125 2026-05-14 cs.CL

Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study

Nour Bouchouchi, Thibault Laugel, Xavier Renard, Christophe Marsala, Marie-Jeanne Lesot, Marcin Detyniecki

AI总结 本研究探讨了大型语言模型在训练过程中学习到的社会规范如何导致性别偏见,并指出现有去偏方法主要关注生成输出中的偏见,而未涉及模型内部表示。为此,作者提出一个统一框架,通过相同中性提示同时分析模型内在和外在的性别偏见,发现对齐方法虽能减少输出中的偏见,但模型内部仍可能存在可被激活的性别关联。研究进一步表明,基于结构化基准的去偏效果在实际应用场景中可能并不稳定。

详情
英文摘要

During training, Large Language Models (LLMs) learn social regularities that can lead to gender bias in downstream applications. Most mitigation efforts focus on reducing bias in generated outputs, typically evaluated on structured benchmarks, which raises two concerns: output-level evaluation does not reveal whether alignment modifies the model's underlying representations, and structured benchmarks may not reflect realistic usage scenarios. We propose a unified framework to jointly analyze intrinsic and extrinsic gender bias in LLMs using identical neutral prompts, enabling direct comparison between gender-related information encoded in internal representations and bias expressed in generated outputs. Contrary to prior work reporting weak or inconsistent correlations, we find a consistent association between latent gender information and expressed bias when measured under the unified protocol. We further examine the effect of alignment through supervised fine-tuning aimed at reducing gender bias. Our results suggest that while the latter indeed reduces expressed bias, measurable gender-related associations are still present in internal representations, and can be reactivated under adversarial prompting. Finally, we consider two realistic settings and show that debiasing effects observed on structured benchmarks do not necessarily generalize, e.g., to the case of story generation.

2603.22910 2026-05-14 cs.CL

EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction

Shiyu Ji, Yixuan Wang, Yijun Liu, Qingfu Zhu, Wanxiang Che

AI总结 随着大语言模型在长上下文应用中对Key-Value(KV)缓存的内存需求不断增长,如何高效压缩KV缓存成为关键问题。本文提出了一种灵活的KV缓存压缩框架EchoKV,通过利用注意力头内部和跨层的相似性,采用轻量网络从部分子集重构被丢弃的KV组件,从而支持按需切换全缓存与压缩缓存模式。实验表明,EchoKV在多种压缩比和模型架构下均优于现有方法,同时在短上下文场景中保持了全缓存推理的吞吐量。

详情
英文摘要

The increasing memory demand of the Key-Value (KV) cache poses a significant bottleneck for Large Language Models (LLMs) in long-context applications. Existing low-rank KV compression methods reduce this footprint by modifying model projections, limiting the flexibility to switch back to standard full-cache inference when sufficient memory is available. In this paper, we propose EchoKV, a flexible KV cache compression framework that supports on-demand transitions from full KV caching to compressed caching. Unlike traditional compression-decompression paradigms, EchoKV utilizes a lightweight network to reconstruct the discarded KV components from a partial subset, exploiting intrinsic inter-layer and intra-layer similarities among attention heads. We further introduce a lightweight two-stage fine-tuning strategy, requiring only a few minutes on a single A100 GPU for a 7B model. Experimental results on LongBench and RULER demonstrate that EchoKV consistently outperforms existing methods across multiple compression ratios and backbone models while preserving the throughput of full-cache inference in short-context scenarios.

2603.22665 2026-05-14 cs.CL cs.LG

Improving LLM Final Representations with Inter-Layer Geometry

Tom Ulanovski, Eyal Blyachman, Maya Bechler-Speicher

AI总结 本文研究了如何改进基于大语言模型(LLM)的预测性能,通过更有效地利用模型各层的表示信息。传统方法仅使用最终层的表示,而作者提出使用图神经网络(GNN)在LLM各层之间建立连接,以更高效地聚合跨层信息。进一步地,他们引入了基于SL(2, Zn)的Cayley图结构的Cayley-Encoder,显著提升了预测性能与效率,并在多个任务和模型上验证了其有效性,同时保持参数增长极小。

Comments 17 pages, 4 figures. Equal contribution by first two authors

详情
英文摘要

The standard in LLM-based prediction is to use the final-layer representation as the input to a downstream predictor. However, intermediate layers may encode complementary task-relevant signals. Existing approaches therefore either search for the best layer for each task or apply expensive attention-based mechanisms to learn inter-layer aggregation. In this work, we first show that such complexity is unnecessary: a lightweight Graph Neural Network over a fully connected graph of LLM layers is more efficient and achieves significantly stronger predictive performance than existing approaches. We then introduce the Cayley-Encoder, which further improves both efficiency and predictive performance by replacing the fully connected graph with a Cayley graph over SL(2, Zn). These Cayley graphs provide a mathematically grounded topology that is sparse, regular by construction, and has low diameter. This enables effective communication across layers while constraining the aggregation structure and reducing the risk of GNN overfitting. In an evaluation of Cayley-Encoder across 13 tasks and 9 LLMs, Cayley-Encoder consistently outperforms baselines, achieving improvements of up to 40 percentage points in accuracy, while introducing at most 0.1% additional parameters relative to the LLM size. We further show that Cayley-Encoder is effective in few-shot regimes. Finally, we show that Cayley-Encoder outperforms LoRA fine-tuning while operating on the frozen LLM. We conclude with an explainability analysis showing that multiple layers contribute meaningfully to the final prediction, supporting our hypothesis.

2603.22364 2026-05-14 cs.LG cs.AI cs.CV

MCLR: Improving Conditional Modeling via Inter-Class Likelihood-Ratio Maximization and Unifying Classifier-Free Guidance with Alignment Objectives

Xiang Li, Yixuan Jia, Xiao Li, Jeffrey A. Fessler, Rongrong Wang, Qing Qu

AI总结 本文提出了一种名为MCLR的新训练目标,旨在通过最大化类间似然比来提升扩散模型的条件生成能力。该方法解决了标准去噪分数匹配(DSM)在类间分离不足的问题,并在训练过程中引入对齐目标,使模型在无需推理时引导(CFG)的情况下也能获得更优的条件生成效果。理论分析表明,CFG引导的分数实际上是针对样本自适应加权MCLR目标的最优解,从而揭示了CFG与对齐目标之间的内在联系。

详情
英文摘要

Diffusion models achieve strong performance in generative modeling, but their success often relies heavily on classifier-free guidance (CFG), an inference-time heuristic that modifies the sampling trajectory. In theory, diffusion models trained with standard denoising score matching (DSM) should recover the target data distribution, raising two fundamental questions: (i) why is inference-time guidance necessary in practice, and (ii) can its underlying effect be internalized into a principled training objective? In this work, we argue that a key limitation of standard DSM is insufficient inter-class separation. To address this issue, we propose MCLR, an alignment objective that explicitly maximizes inter-class likelihood-ratios during training. Fine-tuning diffusion models with MCLR induces CFG-like improvements under standard sampling, substantially improving guidance-free conditional generation and narrowing the gap to inference-time CFG. Beyond these empirical benefits, we show theoretically that the CFG-guided score is exactly the optimal solution to a sample-adaptive weighted MCLR objective. This result connects CFG to alignment-based objectives, providing a mechanistic interpretation of CFG as an implicit inference-time contrastive alignment procedure.