arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1981
2605.14488 2026-05-15 cs.AI

Deepchecks: Evaluating Retrieval-Augmented Generation (RAG)

Assaf Gerner, Netta Madvil, Nadav Barak, Alex Zaikman, Jonatan Liberman, Liron Hamra, Rotem Brazilay, Shay Tsadok, Yaron Friedman, Neal Harow, Noam Bresler, Shir Chorev, Philip Tannor, Lior Rokach

AI总结 本文介绍了 Deepchecks,一个用于评估检索增强生成(RAG)系统的综合性框架。该框架通过多方面的方法、根本原因分析和生产监控,应对RAG系统评估中的复杂挑战,旨在确保评估结果与具体应用需求一致,从而提升系统在可靠性、相关性和用户满意度方面的表现。

详情
英文摘要

Large Language Models (LLMs) augmented with Retrieval-Augmented Generation (RAG) techniques are revolutionizing applications across multiple domains, such as healthcare, finance, and customer service. Despite their potential, evaluating RAG systems remains a complex challenge due to the stochastic nature of generated outputs and the intricate interplay between retrieval and generation components. This paper introduces Deepchecks, a comprehensive framework tailored for evaluating RAG applications. Deepchecks' evaluation framework addresses RAG applications evaluation through a multi-faceted approach, root cause analysis and production monitoring. By ensuring alignment with application-specific requirements, Deepchecks framework provides a robust foundation for assessing reliability, relevance, and user satisfaction in RAG systems.

2605.14487 2026-05-15 cs.CV cs.AI

Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

Jiahao Tian, Yiwei Wang, Gang Yu, Chi Zhang

AI总结 本文研究了长时序自回归视频生成中的误差累积和上下文丢失问题,提出了一种名为Head Forcing的训练无需额外训练的框架。该方法通过识别并区分扩散变压器中注意力头的不同功能,分别为局部细节优化、结构稳定和长程上下文聚合的头分配定制化的键值缓存策略,从而提升生成质量和效率。实验表明,该方法在不增加训练成本的情况下显著延长了视频生成时长,并支持多提示交互合成,优于现有基线方法。

详情
英文摘要

Autoregressive video diffusion models support real-time synthesis but suffer from error accumulation and context loss over long horizons. We discover that attention heads in AR video diffusion transformers serve functionally distinct roles as local heads for detail refinement, anchor heads for structural stabilization, and memory heads for long-range context aggregation, yet existing methods treat them uniformly, leading to suboptimal KV cache allocation. We propose Head Forcing, a training-free framework that assigns each head type a tailored KV cache strategy: local and anchor heads retain only essential tokens, while memory heads employ a hierarchical memory system with dynamic episodic updates for long-range consistency. A head-wise RoPE re-encoding scheme further ensures positional encodings remain within the pretrained range. Without additional training, Head Forcing extends generation from 5 seconds to minute-level duration, supports multi-prompt interactive synthesis, and consistently outperforms existing baselines. Project Page: https://jiahaotian-sjtu.github.io/headforcing.github.io/.

2605.14486 2026-05-15 cs.CV

Reduce the Artifacts Bias for More Generalizable AI-Generated Image Detection

Yiheng Li, Yang Yang, Zichang Tan, Gao Li, Zhen Lei, Wenhao Wang

AI总结 随着AI生成图像的滥用日益严重,亟需具备广泛适用性的图像检测技术。本文提出了一种基于GAN的上采样方法,以生成与重建方法对齐但具有更多样化伪影模式的假图像,从而弥补现有方法在多样性方面的不足。为了解决不同生成方法之间的领域偏移问题,研究引入了分离专家融合(SEF)框架,通过领域特定专家模型和门控网络实现特征的互补融合,显著提升了模型在多种生成方法上的检测性能和泛化能力。

Comments preprint

详情
英文摘要

As the misuse of AI-generated images grows, generalizable image detection techniques are urgently needed. Recent state-of-the-art (SOTA) methods adopt aligned training datasets to reduce content, size, and format biases, empowering models to capture robust forgery cues. A common strategy is to employ reconstruction techniques, e.g., VAE and DDIM, which show remarkable results in diffusion-based methods. However, such reconstruction-based approaches typically introduce limited and homogeneous artifacts, which cannot fully capture diverse generative patterns, such as GAN-based methods. To complement reconstruction-based fake images with aligned yet diverse artifact patterns, we propose a GAN-based upsampling approach that mimics GAN-generated fake patterns while preserving content, size, and format alignment. This naturally results in two aligned but distinct types of fake images. However, due to the domain shift between reconstruction-based and upsampling-based fake images, direct mixed training causes suboptimal results, where one domain disrupts feature learning of the other. Accordingly, we propose a Separate Expert Fusion (SEF) framework to extract complementary artifact information and reduce inter-domain interference. We first train domain-specific experts via LoRA adaptation on a frozen foundational model, then conduct decoupled fusion with a gating network to adaptively combine expert features while retaining their specialized knowledge. Rather than merely benefiting GAN-generated image detection, this design introduces diverse and complementary artifact patterns that enable SEF to learn a more robust decision boundary and improve generalization across broader generative methods. Extensive experiments demonstrate that our method yields strong results across 13 diverse benchmarks. Codes are released at: https://github.com/liyih/SEF_AIGC_detection.

2605.14483 2026-05-15 cs.AI

LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning

Xudong Chen, Yixin Liu, Hua Wei, Kaize Ding

AI总结 LEMON 是一种基于大语言模型的多智能体协调器,通过反事实强化学习生成可执行的多智能体协调规范。该方法通过整合任务特定角色、职责分配、能力等级和依赖结构,提升系统整体的执行效率与解题质量。LEMON 在六个推理与编程基准测试中表现出色,取得了当前多智能体协调方法中的最佳性能。

Comments Submitted to Neurips 2026

详情
英文摘要

Large language models (LLMs) have become a strong foundation for multi-agent systems, but their effectiveness depends heavily on orchestration design. Across different tasks, role design, capacity assignment, and dependency construction jointly affect both solution quality and execution efficiency. Existing approaches automate parts of this design process, yet they often optimize these decisions partially or sequentially, and rely on execution-level feedback that provides limited credit assignment for local orchestration decisions. We propose LEMON (\textbf{L}earning \textbf{E}xecutable \textbf{M}ulti-agent \textbf{O}rchestratio\textbf{N} via Counterfactual Reinforcement Learning), an LLM-based orchestrator that generates an executable orchestration specification. The specification integrates task-specific roles, customized duties, capacity levels, and dependency structure into a single deployable system. To train the orchestrator, we augment the orchestration-level GRPO objective with a localized counterfactual signal that edits role, capacity, or dependency fields and applies the resulting reward contrast only to the edited spans. Experiments on six reasoning and coding benchmarks, including MMLU, GSM8K, AQuA, MultiArith, SVAMP, and HumanEval, show that LEMON achieves state-of-the-art performance among the evaluated multi-agent orchestration methods. Our code is available at https://anonymous.4open.science/r/LEMON-B23C.

2605.14477 2026-05-15 cs.LG

Test-Time Learning with an Evolving Library

Weijia Xu, Alessandro Sordoni, Chandan Singh, Zelalem Gero, Michel Galley, Xingdi Yuan, Jianfeng Gao

AI总结 本文提出了一种名为EvoLib的测试时学习框架,使大型语言模型能够在不更新参数或依赖外部监督的情况下,跨问题实例积累、复用和演化知识。该方法通过维护一个共享的知识库,自动从模型自身的推理轨迹中提取模块化技能和反思性见解,并引入一种机制以平衡即时效用与长期价值,从而实现知识的持续优化与泛化。实验表明,EvoLib在数学推理、代码生成和多轮智能体环境中显著优于现有的测试时学习方法。

详情
英文摘要

We introduce EvoLib, a test-time learning framework that enables large language models to accumulate, reuse, and evolve knowledge across problem instances without parameter updates or external supervision. Instead of adapting model parameters, our approach maintains a shared library of knowledge abstractions, including modular skills and reflective insights, automatically extracted from the model's own inference trajectories. To support continual improvement, we introduce a principled weighting and consolidation mechanism that jointly optimizes for immediate utility and long-term value. This allows simple, instance-specific abstractions to evolve into more general and reusable ones over time. Across challenging benchmarks in mathematical reasoning, code generation, and multi-turn agentic environments, EvoLib improves substantially over the top test-time scaling and learning methods without ground-truth feedback.

2605.14475 2026-05-15 cs.CV

GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding

Jiashun Zhu, Ronghao Fu, Jiasen Hu, Nachuan Xing, Xu Na, Xiao Yang, Zhiwen Lin, Weipeng Zhang, Lang Sun, Zhiheng Xue, Haoran Liu, Weijie Zhang, Bo Yang

AI总结 GeoVista 是一种面向超高分辨率遥感图像理解的视觉引导主动感知框架,旨在解决现有方法在探索大场景时易丢失全局上下文、重复访问或遗漏关键区域的问题。该方法通过构建全局探索计划并多分支验证候选区域,结合显式的证据状态管理,实现跨区域的信息聚合与去重。GeoVista 引入了 APEX-GRO 轨迹语料库和 Observe-Plan-Track 机制,有效提升了遥感图像的语义理解和问答性能,在多个基准测试中取得了最先进的结果。

详情
英文摘要

Interpreting ultra-high-resolution (UHR) remote sensing images requires models to search for sparse and tiny visual evidence across large-scale scenes. Existing remote sensing vision-language models can inspect local regions with zooming and cropping tools, but most exploration strategies follow either a one-shot focus or a single sequential trajectory. Such single-path exploration can lose global context, leave scattered regions unvisited, and revisit or count the same evidence multiple times. To this end, we propose GeoVista, a planning-driven active perception framework for UHR remote sensing interpretation. Instead of committing to one zooming path, GeoVista first builds a global exploration plan, then verifies multiple candidate regions through branch-wise local inspection, while maintaining an explicit evidence state for cross-region aggregation and de-duplication. To enable this behavior, we introduce APEX-GRO, a cold-start supervised trajectory corpus that reformulates diverse UHR tasks as Global-Region-Object interactive reasoning processes with a unified, scale-invariant spatial representation. We further design an Observe-Plan-Track mechanism for global observation, adaptive region inspection, and evidence tracking, and align the model with a GRPO-based strategy using step-wise rewards for planning, localization, and final answer correctness. Experiments on RSHR-Bench, XLRS-Bench, and LRS-VQA show that GeoVista achieves state-of-the-art performance. Code and dataset are available at https://github.com/ryan6073/GeoVista

2605.14467 2026-05-15 cs.LG

Focused PU learning from imbalanced data

Elias Zavitsanos, Georgios Paliouras

AI总结 本文提出了一种针对高度不平衡数据集的正例与未标记例(PU)学习新方法,旨在解决在标注数据有限的情况下,如疾病基因识别、欺诈检测等实际问题中的分类难题。该方法通过引入一种聚焦的经验风险估计器,结合正例和未标记例训练二分类模型,有效提升了在不平衡数据下的分类性能。实验表明,该方法在多种不平衡数据集上表现优异,并在财务舞弊检测等实际应用中展现出良好的应用价值。

详情
英文摘要

We propose a new method of learning from positive and unlabeled (PU) examples in highly imbalanced datasets. Many real-world problems, such as disease gene identification, targeted marketing, fraud detection, and recommender systems, are hard to address with machine learning methods, due to limited labeled data. Often, training data comprises positive and unlabeled instances, the latter typically being dominated by negative, but including also several positive instances. While PU learning is well-studied, few methods address imbalanced settings or hard-to-detect positive examples that resemble negative ones. Our approach uses a focused empirical risk estimator, incorporating both positive and unlabeled examples to train binary classifiers. Empirical evaluations demonstrate state-of-the-art performance on imbalanced datasets under two labeling mechanisms - selecting positives completely at random (SCAR) and selecting at random (SAR). Beyond these controlled experiments, we demonstrate the value of the proposed method in the real-world application of financial misstatement detection.

2605.14465 2026-05-15 cs.AI

From Table to Cell: Attention for Better Reasoning with TABALIGN

Tung Sum Thomas Kwok, Zeyong Zhang, Xinyu Wang, Chunhe Wang, Xiaofeng Lin, Hanwei Wu, Lei Ding, Guang Cheng, Zhijiang Guo

AI总结 该研究针对结构化表格中多步骤推理的问题,提出了一种名为TABALIGN的新框架,旨在解决推理过程中规划与执行之间缺乏明确的单元格对齐机制的问题。其核心方法结合了双向去噪的扩散语言模型(DLM)作为规划器,生成二进制单元格掩码表示推理步骤,并引入一个轻量级验证器TABATTN,基于大量人工验证的注意力标准对每一步进行评分。实验表明,TABALIGN在多个基准测试中显著提升了推理准确性,并加快了后续推理的执行速度。

详情
英文摘要

Multi-step LLM reasoning over structured tables fails because planning and execution share no explicit cell-grounding contract. Existing methods constrain the planner to a left-to-right factorization at odds with table permutation invariance, and score intermediate states by generated content alone, overlooking cell grounding. We conduct a pilot study showing that diffusion language models (DLMs) produce more human-aligned and permutation-stable cell attention on tables than autoregressive models, with a 40.2% median reduction in attention-AUROC variability under row reordering. Motivated by this, we propose TABALIGN, a planned table reasoning framework that operationalizes the contract. TABALIGN pairs a masked DLM planner, whose bidirectional denoising emits plan steps as binary cell masks, with TABATTN, a lightweight verifier trained on 1,600 human-verified attention standards to score each step by its attention overlap with the plan-designated mask. Across eight benchmarks covering table question answering and fact verification, TABALIGN improves average accuracy by 15.76 percentage points over the strongest open-source baseline at comparable 8B-class scale, with a matched-backbone ablation attributing 2.87 percentage points of this gain to the DLM planner over an AR planner on a fixed reasoner. Cleaner DLM plans also accelerate downstream reasoning execution by 44.64%.

2605.14462 2026-05-15 cs.CV

Real2Sim in HOI: Toward Physically Plausible HOI Reconstruction from Monocular Videos

Yubo Zhao, Yujin Chai, Yunao Dong, Chengfeng Zhao, Zijiao Zeng, Yuan Liu, Chi-Keung Tang

AI总结 本文研究如何从单目视频中重建具有物理合理性的4D人-物交互(HOI)动画,以支持3D内容生成和仿真学习等应用。为了解决现有方法在交互一致性、接触稳定性和物理合理性方面的不足,作者提出了HA-HOI框架,采用“以人为先,物体跟随”的策略,以人体运动为交互锚点,重建并优化物体的运动轨迹,并将其映射到物理仿真中进行验证。该方法在多个基准和真实视频上显著提升了人-物对齐、接触一致性及仿真适用性,推动了从视觉合理到物理合理的交互动画生成。

详情
英文摘要

Recovering 4D human-object interaction (HOI) from monocular video is a key step toward scalable 3D content creation, embodied AI, and simulation-based learning. Recent methods can reconstruct temporally coherent human and object trajectories, but these trajectories often remain visual artifacts while failing to preserve stable contact, functional manipulation, or physical plausibility when used as reference motions for humanoid-object simulation. This reveals a fundamental interaction gap: HOI reconstruction should not stop at tracking a human and an object, but should recover the relation that makes their motion a coherent interaction. We introduce $\textbf{HA-HOI}$, a framework for reconstructing physically plausible 4D HOI animation from in-the-wild monocular videos. Instead of treating the human and object as independent entities in an ambiguous monocular 3D space, we propose a $\textit{human-first, object-follow}$ formulation. The human motion is recovered as the interaction anchor, and the object is reconstructed, aligned, and refined relative to the human action. The resulting kinematic trajectory is then projected into a physics-based humanoid-object simulation, where it acts as a teacher trajectory for stable physical rollout. Across benchmark and in-the-wild videos, $\textbf{HA-HOI}$ improves human-object alignment, contact consistency, temporal stability, and simulation readiness over prior monocular HOI reconstruction methods. By moving beyond visually plausible trajectory recovery toward physically grounded interaction animation, our work takes a step toward turning general monocular HOI videos into scalable demonstrations for humanoid-object behavior. Project page: https://knoxzhao.github.io/real2sim_in_HOI/

2605.14461 2026-05-15 cs.CV

ClickRemoval: An Interactive Open-Source Tool for Object Removal in Diffusion Models

Ledun Zhang, Yatu Ji, Xufei Zhuang, Xinying Yao

AI总结 ClickRemoval 是一个基于预训练 Stable Diffusion 模型的开源交互式工具,旨在解决扩散模型中对象移除的难题。该工具仅需用户点击操作即可定位目标对象并修复背景,无需手动绘制掩码或输入文本描述。通过在去噪过程中进行自注意力调制,ClickRemoval 在复杂场景中实现了高效且自然的移除效果,实验表明其在定量指标和用户研究中均表现优异。

Comments 5 pages, 4 figures. Open-source software paper

详情
英文摘要

Existing object removal tools often rely on manual masks or text prompts, making precise removal difficult for non-expert users in complex scenes and often leading to incomplete removal or unnatural background completion. To address this issue, we present ClickRemoval, an open-source interactive object removal tool built on pretrained Stable Diffusion models and driven solely by user clicks. Without additional training, hand-drawn masks, or text descriptions, ClickRemoval localizes target objects and restores the background through self-attention modulation during denoising. Experiments show that ClickRemoval achieves competitive results across quantitative metrics and user studies. We release a complete software package at https://github.com/zld-make/ClickRemoval under the Apache-2.0 license.

2605.14458 2026-05-15 cs.AI

OmniDrop: Layer-wise Token Pruning for Omni-modal LLMs via Query-Guidance

Yeo Jeong Park, Hyemi Jang, Minseo Choi, Jongsun Lee, Jooyoung Choi, Yongkweon Jeon

AI总结 OmniDrop 是一种用于多模态大语言模型的层间 token 剪枝方法,旨在解决高分辨率音频和视频输入导致的 token 爆炸问题。该方法通过在解码器各层逐步剪枝,而非在输入嵌入层进行,从而更有效地保留多模态信息融合,并利用文本查询指导剪枝过程以提升任务适应性。实验表明,OmniDrop 在多个基准测试中表现优异,显著降低了预填充延迟和内存消耗。

详情
英文摘要

Omni-modal large language models have demonstrated remarkable potential in holistic multimodal understanding; however, the token explosion caused by high-resolution audio and video inputs remains a critical bottleneck for real-time applications and long-form reasoning. Existing omni-modal token compression methods typically prune tokens at the input embedding level, relying on audio-video similarity or temporal co-occurrence as proxies for semantic relevance. In practice, such assumptions are often unreliable. To address this limitation, we propose OmniDrop, a training-free, layer-wise token pruning framework that progressively prunes audiovisual tokens within the LLM decoder layers rather than at the input-level, allowing early layers to preserve sufficient omni-modal information fusion before aggressively removing tokens in deeper layers. We further utilize text queries as guidance for modality-agnostic and task-adaptive token pruning. We also introduce a temporal diversity score that encourages balanced token survival to preserve global temporal context. Experimental results across various audiovisual benchmarks demonstrate that OmniDrop outperforms all baselines by up to 3.58 points while reducing prefill latency by up to 40% and memory usage by up to 14.7%.

2605.14455 2026-05-15 cs.AI cs.LG

Intelligence Impact Quotient (IIQ): A Framework for Measuring Organizational AI Impact

Chandan Rajah, Neha Sengupta, Federico Castanedo, Robin Mills, Amit Bahree, Ramesh Krishnan Muthukrishnan, Larry Murray

AI总结 本文提出了一种名为“智能影响商”(IIQ)的综合指标,用于量化人工智能系统在组织工作流程中的集成深度及其影响。IIQ结合了多种因素,如新颖性加权的令牌库存、使用频率、近期使用情况、组织杠杆效应、任务复杂度和自主性,生成可用于比较不同用户和单位的原始智能采纳指数(IAI)和标准化的0-1000分IIQ指数。该框架旨在为AI在工作流程中的部署提供一种可跟踪的测量工具,而非直接衡量模型能力或替代因果生产力评估。

详情
英文摘要

The Intelligence Impact Quotient (IIQ) is a composite metric intended to quantify the depth to which AI systems are integrated into organizational work and their impact. Rather than treating access counts or aggregate token volume as sufficient evidence of impact, IIQ combines a novelty-weighted, time-decayed token stock with usage frequency, a grace-period recency gate, organizational leverage, task complexity, and autonomy. The formulation produces a raw Intelligence Adoption Index (IAI) and a normalized 0-1000 IIQ index for comparison between heterogeneous users and units. We also derive sub-daily update rules and a bounded interpretation layer for estimated efficiency and financial impact. The paper positions IIQ as a deployment-oriented measurement framework: a formal proposal for tracking AI embedding in workflows, not a direct measure of model capability or a substitute for causal productivity evaluation. Synthetic scenarios illustrate how the revised metric distinguishes between frequent low-leverage use, semantically repetitive prompting, and more autonomous, higher-consequence AI-assisted work.

2605.14454 2026-05-15 cs.LG cs.CL cs.CR

LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

Minbeom Kim, Lesly Miculicich, Bhavana Dalvi Mishra, Mihir Parmar, Phillip Wallis, Bharath Chandrasekhar, Kyomin Jung, Tomas Pfister, Long T. Le

AI总结 随着AI代理从聊天界面扩展到处理隐私数据、调用工具和执行多步骤工作流的系统,安全防护机制成为防止实际部署中危害的最后一道防线。传统防护机制难以应对复杂多变的现实场景,而LiSA(Lifelong Safety Adaptation)提出了一种保守策略归纳框架,通过结构化记忆提升固定基础防护策略的适应能力。LiSA能够将偶发的失败转化为可复用的策略抽象,结合冲突感知的本地规则和基于证据的置信度门控机制,有效提升在稀疏反馈和噪声环境下的安全性和泛化能力。

Comments 27 pages, 3 figures

详情
英文摘要

As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap: guardrails must adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservative policy induction framework that improves a fixed base guardrail through structured memory. LiSA converts occasional failures into reusable policy abstractions so that sparse reports can generalize beyond individual cases, adds conflict-aware local rules to prevent overgeneralization in mixed-label contexts, and applies evidence-aware confidence gating via a posterior lower bound, so that memory reuse scales with accumulated evidence rather than empirical accuracy alone. Across PrivacyLens+, ConFaide+, and AgentHarm, LiSA consistently outperforms strong memory-based baselines under sparse feedback, remains robust under noisy user feedback even at 20% label-flip rates, and pushes the latency--performance frontier beyond backbone model scaling. Ultimately, LiSA offers a practical path to secure AI agents against the unpredictable long tail of real-world edge risks.

2605.14449 2026-05-15 cs.LG cs.AI cs.CL

When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition

Siyang Yao, Erhu Feng, Yubin Xia

AI总结 本文研究了大语言模型中幻觉检测的问题,提出了一种名为QAOD的单次推理框架,通过将答案表示中与问题对齐的部分分解出去,提取出与问题正交的成分以抑制领域相关的变化。该方法结合多样性惩罚的费舍尔评分和判别神经元选择,设计了两种互补的探测策略,分别用于提升领域内检测性能和跨领域泛化能力,在多个基准测试中表现出色,尤其在跨领域场景下显著优于现有方法。

详情
英文摘要

Hallucination detection in large language models (LLMs) requires balancing accu racy, efficiency, and robustness to distribution shift. Black-box consistency methods are effective but demand repeated inference; single-pass white-box probes are effi cient yet treat answer representations in isolation, often degrading sharply under domain shift. We propose QAOD (Question-Answer Orthogonal Decomposition), a single-pass framework that projects away the question-aligned direction from the answer representation to obtain a question-orthogonal component that suppresses domain-conditioned variation. To identify informative signals, QAOD further selects layers via diversity-penalized Fisher scoring and discriminative neurons via Fisher importance. To address both in-domain detection and cross-domain generalization, we design two complementary probing strategies: pairing the or thogonal component with question context yields a joint probe that maximizes in-domain discriminability, while using the orthogonal component alone preserves domain-agnostic factuality signals for robust transfer. QAOD's joint probe achieves the best in-domain AUROC across all evaluated model-dataset pairs, while the orthogonal-only probe delivers the strongest OOD transfer, surpassing the best white-box baseline by up to 21% on BioASQ at under 25% of generation cost.

2605.14448 2026-05-15 cs.CV cs.CL cs.IR

Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture

Longxiang Zhang, Weilong Dai, Guanghao Zhang, Hao Jiang, Pipei Huang

AI总结 该研究提出了一种名为Think When Needed(TWN)的统一多模态嵌入框架,旨在通过自适应推理机制提升多模态嵌入的质量与效率。TWN采用双LoRA架构,将推理和嵌入适配器附加到共享的冻结主干模型上,以减少参数开销并避免梯度冲突。通过自监督路由门机制,模型能够根据输入内容决定是否生成链式推理(CoT),从而避免冗余推理带来的性能下降,并显著降低推理成本。实验表明,TWN在MMEB-V2的78个任务中取得了最先进的嵌入质量,同时在参数和推理效率方面优于现有生成式方法。

Comments 30 pages, preprint

详情
英文摘要

Multimodal large language models (MLLMs) have emerged as a powerful backbone for multimodal embeddings. Recent methods introduce chain-of-thought (CoT) reasoning into the embedding pipeline to improve retrieval quality, but remain costly in both model size and inference cost. They typically employ separate reasoner and embedder with substantial parameter overhead, and generate CoT indiscriminately for every input. However, we observe that for simple inputs, discriminative embeddings already perform well, and redundant reasoning can even mislead the model, degrading performance. To address these limitations, we propose Think When Needed (TWN), a unified multimodal embedding framework with adaptive reasoning. TWN introduces a dual-LoRA architecture that attaches reasoning and embedding adapters to a shared frozen backbone, detaching gradients at their interface to mitigate gradient conflicts introduced by joint optimization while keeping parameters close to a single model. Building on this, an adaptive think mechanism uses a self-supervised routing gate to decide per input whether to generate CoT, skipping unnecessary reasoning to reduce inference overhead and even improve retrieval quality. We further explore embedding-guided RL to optimize CoT quality beyond supervised training. On the 78 tasks of MMEB-V2, TWN achieves state-of-the-art embedding quality while being substantially more efficient than existing generative methods, requiring only 3-5% additional parameters relative to the backbone and up to 50% fewer reasoning tokens compared to the full generative mode.

2605.14445 2026-05-15 cs.LG

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

Runyuan He, Qiuyang Mang, Shang Zhou, Kaiyuan Liu, Hanchen Li, Huanzhi Mao, Qizheng Zhang, Zerui Li, Bo Peng, Lufeng Cheng, Tianfu Fu, Yichuan Wang, Wenhao Chai, Jingbo Shang, Alex Dimakis, Joseph E. Gonzalez, Alvin Cheung

AI总结 本文提出了一种名为FrontierSmith的系统,用于大规模合成开放性编程问题,以提升大型语言模型在开放性编码任务上的表现。该系统通过迭代演化方式,从现有的封闭性编程任务(如竞赛编程题目)生成开放性问题变体,并利用定量指标筛选出能激发多样化解题思路的问题。实验表明,使用该系统合成的数据进行训练,显著提升了模型在多个开放性编程基准测试中的性能。

详情
英文摘要

Many real-world coding challenges are open-ended and admit no known optimal solution. Yet, recent progress in LLM coding has focused on well-defined tasks such as feature implementation, bug fixing, and competitive programming. Open-ended coding remains a weak spot for LLMs, largely because open-ended training problems are scarce and expensive to construct. Our goal is to synthesize open-ended coding problems at scale to train stronger LLM coders. We introduce FrontierSmith, an automated system for iteratively evolving open-ended problems from existing closed-ended coding tasks. Starting from competitive programming problems, FrontierSmith generates candidate open-ended variants by changing the problems'goals, restricting outputs, and generalizing inputs. It then uses a quantitative idea divergence metric to select problems that elicit genuinely diverse approaches from different solvers. Agents then generate test cases and verifiers for the surviving candidates. On two open-ended coding benchmarks, training on our synthesized data yields substantial gains over the base models: Qwen3.5-9B improves by +8.82 score on FrontierCS and +306.36 (Elo-rating-based performance) on ALE-bench; Qwen3.5-27B improves by +12.12 and +309.12, respectively. The synthesized problems also make agents take more turns and use more tokens, similar to human-curated ones, suggesting that closed-ended seeds can be a practical starting point for long-horizon coding data.

2605.14443 2026-05-15 cs.AI cs.LG cs.MA

Prompting Policies for Multi-step Reasoning and Tool-Use in Black-box LLMs with Iterative Distillation of Experience

Krishna Sayana, Ketan Todi, Ambarish Jash

AI总结 该研究针对冻结的“黑盒”大语言模型(LLM)中的提示工程问题,提出了一种基于强化学习的框架,通过迭代经验蒸馏训练可学习的提示策略。该方法利用对比经验缓冲区,结合标量奖励和密集文本批评,使轻量级提示模型能够优化以最大化任务奖励,从而在单次策略权重中实现迭代提示的高效优化。实验表明,该方法在多步骤推理和工具使用任务中显著提升了性能,且相比现有进化基线方法具有更高的样本效率。

Comments 10 pages and reference, appendix

详情
英文摘要

The shift toward interacting with frozen, "black-box" Large Language Models (LLMs) has transformed prompt engineering from a heuristic exercise into a critical optimization challenge. We propose a Reinforcement Learning (RL) framework for training learned prompting policies via iterative distillation of experience. In this architecture, a lightweight prompter model is optimized to maximize task-specific rewards for a larger, frozen worker LLM. By utilizing a contrastive experience buffer that couples scalar rewards with dense textual critiques, our approach effectively amortizes iterative prompt refinement into single-shot policy weights. Our experimental analysis focuses on the Big Bench Extra Hard (BBEH) and Tau-bench suites, covering a diverse range of multi-step reasoning and tool-use tasks. We demonstrate significant gains, improving performance from 55% to 90% in logic-intensive reasoning and 74% to 91% in tool-use tasks. Furthermore, we analyze the structural evolution of prompts, demonstrating how the policy discovers specialized algorithmic heuristics. We provide comprehensive comparisons against state-of-the-art evolutionary baselines like GEPA, showing that iterative distillation achieves superior performance with higher sample efficiency.

2605.14440 2026-05-15 cs.AI cs.FL cs.LO

Synthesizing POMDP Policies: Sampling Meets Model-checking via Learning

Debraj Chakraborty, Anirban Majumdar, Prince Mathew, Sayan Mukherjee, Jean-François Raskin

AI总结 本文研究了在部分可观察马尔可夫决策过程(POMDP)中如何合成具有形式化保证的策略,针对采样方法缺乏形式正确性保证、形式合成方法可扩展性差的问题,提出了一种结合采样、自动机学习和模型检测的综合框架。该方法借鉴Angluin的$L^*$算法,利用采样作为成员查询,模型检测作为等价性查询,能够在采样策略满足正则性条件时合成有限状态控制器,并证明了该框架的相对完备性。实验表明,该方法在解决现有工具难以处理的阈值安全问题上表现良好。

Comments Paper accepted at 38th International Conference on Computer Aided Verification (CAV 2026), Lisbon, Portugal, July 2026

详情
英文摘要

Partially Observable Markov Decision Processes (POMDPs) are the standard framework for decision-making under uncertainty. While sampling-based methods scale well, they lack formal correctness guarantees, making them unsuitable for safety-critical applications. Conversely, formal synthesis techniques provide correctness-by-construction but often struggle with scalability, as general POMDP synthesis is undecidable. To bridge this gap, we propose a synthesis framework that integrates sampling, automata learning, and model-checking. Inspired by Angluin's $L^*$ algorithm, our approach utilizes sampling as a membership oracle and model-checking as an equivalence oracle. This enables the synthesis of finite-state controllers with formal guarantees, provided the sampling-induced policy is regular. We establish a relative completeness result for this framework. Experimental results from our prototypical implementation demonstrate that this method successfully solves threshold-safety problems that remain challenging for existing formal synthesis tools. We believe our algorithm serves as a valuable component in a portfolio approach to tackling the inherent difficulty of POMDP synthesis problems.

2605.14438 2026-05-15 cs.AI

BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE

Juntong Wu, Jialiang Cheng, Qishen Yin, Yue Dai, Yuliang Yan, Fuyu Lv, Ou Dan, Li Yuan

AI总结 BEAM(二值专家激活掩码)是一种用于动态路由的新型方法,旨在提升Mixture-of-Experts(MoE)架构在大语言模型中的推理效率。该方法通过可训练的二值掩码实现对每个token的专家动态选择,结合直通估计器和辅助正则化损失,在端到端训练中诱导专家稀疏性,同时保持模型性能。实验表明,BEAM在保持超过98%原始模型性能的同时,显著减少了MoE层的计算量,提升了推理速度和吞吐量,是一种高效且易于集成的实用解决方案。

Comments 22 pages, 12 figures

详情
英文摘要

Mixture-of-Experts (MoE) architectures enhance the efficiency of large language models by activating only a subset of experts per token. However, standard MoE employs a fixed Top-K routing strategy, leading to redundant computation and suboptimal inference latency. Existing acceleration methods either require costly retraining with architectural changes or suffer from severe performance drop at high sparsity due to train-inference mismatch. To address these limitations, we propose BEAM (Binary Expert Activation Masking), a novel method that learns token-adaptive expert selection via trainable binary masks. With a straight-through estimator and an auxiliary regularization loss, BEAM induces dynamic expert sparsity through end-to-end training while maintaining model capability. We further implement an efficient custom CUDA kernel for BEAM, ensuring seamless integration with the vLLM inference framework. Experiments show that BEAM retains over 98\% of the original model's performance while reducing MoE layer FLOPs by up to 85\%, achieving up to 2.5$\times$ faster decoding and 1.4$\times$ higher throughput, demonstrating its effectiveness as a practical, plug-and-play solution for efficient MoE inference.

2605.14427 2026-05-15 cs.CL cs.SD

A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

Sunil Kumar Kopparapu

AI总结 本文提出了一种基于微积分的框架,用于确定端到端自动语音识别(ASR)系统中的词汇量大小。该方法通过拟合训练数据,并利用一阶和二阶导数测试原理,正式估计词汇量这一关键超参数。实验表明,该方法在标准Librispeech语料库上有效,能够优化词汇量选择,从而提升ASR系统的性能。本文的主要贡献在于为端到端ASR系统提供了确定词汇量大小的系统化方法。

Comments 8 pages, is an extension of the paper S. K. Kopparapu and A. Panda, A cost minimization approach to fix the vocabulary size in a tokenizer for an end-to-end ASR system, in Proceedings of the 2024 International Conference on Pattern Recognition, Kolkata, India, 2024

详情
英文摘要

In hybrid automatic speech recognition (ASR) systems, the vocabulary size is unambiguous, typically determined by the number of phones, bi-phones, or tri-phones present in the language. In contrast, end-to-end ASR systems derive their vocabulary, often referred to as tokens from the text corpus used for training. The choice and, more importantly, the size of this vocabulary is a critical hyper-parameter in training end-to-end ASR systems. Tokenization algorithms such as Byte Pair Encoding (BPE), WordPiece, and Unigram Language Model (ULM) use the vocabulary size as an input hyper-parameter to generate the sub-words employed during ASR training. Popular toolkits like ESPNet provide a fixed vocabulary size in their training recipes, but there is little documentation or discussion in the literature regarding how these values are determined. Recent work [1] has formalized an approach to identify the vocabulary size best suited for end-to-end ASR, introducing a cost function framework that treats the tokenization process as a black box. In this paper, we build upon that foundation by curve fitting the training data and using the principle of first and second derivative tests in calculus to formally estimate the vocabulary size hyper-parameter. We demonstrate the utility and usefulness of our approach by applying it on a standard Librispeech corpus and show that the optimal choice of vocabulary size hyper-parameter improves the performance of the ASR. The main contribution of this paper in formalizing an approach to identify the vocabulary size best suited for training an end-to-end ASR system.

2605.14423 2026-05-15 cs.LG cs.AI

Collaborative Yet Personalized Policy Training: Single-Timescale Federated Actor-Critic

Leo Muxing Wang, Pengkun Yang, Lili Su

AI总结 本文研究了在异构环境中实现协作与个性化策略训练的问题,提出了一种单时间尺度的联邦演员-评论家框架。该方法通过共享一个公共的线性子空间表示,同时保留各智能体的个性化策略组件,实现了策略的协作优化与个性化平衡。理论分析表明,该方法在有限时间内具有收敛性,并且随着智能体数量的增加表现出线性加速效果,实验验证了其在联邦强化学习任务中的有效性。

详情
英文摘要

Despite the popularity of the actor-critic method and the practical needs of collaborative policy training, existing works typically either overlook environmental heterogeneity or give up personalization altogether by training a single shared policy across all agents. We consider a federated actor-critic framework in which agents share a common linear subspace representation while maintaining personalized local policy components, and agents iteratively estimate the common subspace, local critic heads, and local policies (i.e., actors). Under canonical single-timescale updates with Markovian sampling, we establish finite-time convergence via a novel joint linear approximation framework. Specifically, we show that the critic error converges to zero at the rate of $\tilde{\mathcal{O}}(1/((1-γ)^4\sqrt{TK}))$, and the policy gradient norm converges to zero at the rate of $\tilde{\mathcal{O}}(1/((1-γ)^6\sqrt{TK}))$, where $T$ is the number of rounds, $K$ is the number of agents, and $γ\in (0,1)$ is the discount factor. These results demonstrate linear speedup with respect to the number of agents $K$, despite heterogeneous Markovian trajectories under distinct transition kernels and coupled learning dynamics. To address these challenges, we develop a new perturbation analysis for the projected subspace updates and QR decomposition steps, together with conditional mixing arguments for heterogeneous Markovian noise. Furthermore, to handle the additional complications induced by policy updates and temporal dependence, we establish fine-grained characterizations of the discrepancies between function evaluations under Markovian sampling and under temporally frozen policies. Experiments instantiate the framework within PPO on federated \texttt{Hopper-v5} action-map heterogeneity, showing gains over Single PPO and FedAvg PPO and downstream transfer from the learned shared trunk.

2605.14422 2026-05-15 cs.LG

What if Tomorrow is the World Cup Final? Counterfactual Time Series Forecasting with Textual Conditions

Shuqi Gu, Yongxiang Zhao, Baoyu Jing, Kan Ren

AI总结 本文研究了在文本条件下的反事实时间序列预测问题,旨在应对未来事件对时间序列预测的影响,提升预测模型在复杂和随机条件下的适应能力。为解决传统方法忽视反事实场景及条件结构单一的问题,作者提出了一个包含事实与反事实设置的综合评估框架,并设计了一种文本归因机制,用于区分可变与不可变因素,从而提高预测精度。该方法在无真实时间序列标签的情况下也能有效评估模型性能,具有重要的实际应用价值。

详情
英文摘要

Time series forecasting has become increasingly critical in real-world scenarios, where future sequences are influenced not only by historical patterns but also by forthcoming events. In this context, forecasting must dynamically adapt to complex and stochastic future conditions, which introduces fundamental challenges in both forecasting and evaluation. Traditional methods typically rely on historical data or factual future conditions, while overlooking counterfactual scenarios. Furthermore, many existing approaches are restricted to simple structured conditions, limiting their ability to generalize to the real-world complexities. To address these gaps, we introduce the task of counterfactual time series forecasting with textual conditions, enabling more flexible and condition-aware forecasting. We propose a comprehensive evaluation framework that encompasses both factual and counterfactual settings, even in the absence of ground truth time series. Additionally, we present a novel text-attribution mechanism that distinguishes mutable from immutable factors, thereby improving forecast accuracy under sophisticated and stochastic textual conditions. The project page is at https://seqml.github.io/TADiff/

2605.14420 2026-05-15 cs.AI

DVMap: Fine-Grained Pluralistic Value Alignment via High-Consensus Demographic-Value Mapping

Pengyun Zhu, Yuqi Ren, Zhen Wang, Lei Yang, Deyi Xiong

AI总结 当前大型语言模型(LLMs)通常依赖于粗粒度的国家标签进行多元价值观对齐,但这种宏观层面的监督往往掩盖了国家内部的价值观异质性,导致对齐效果松散。为此,研究提出DVMap框架,通过多维人口统计约束识别具有可预测、高共识价值观偏好的群体,实现细粒度的多元价值观对齐。该方法引入人口统计原型提取策略和结构化思维链机制,并结合群体相对策略优化技术,有效提升了模型在跨人口统计、跨国家和跨价值观场景下的泛化能力与鲁棒性。

Comments Accepted to the Main Conference of ACL 2026

详情
英文摘要

Current Large Language Models (LLMs) typically rely on coarse-grained national labels for pluralistic value alignment. However, such macro-level supervision often obscures intra-country value heterogeneity, yielding a loose alignment. We argue that resolving this limitation requires shifting from national labels to multi-dimensional demographic constraints, which can identify groups with predictable, high-consensus value preference. To this end, we propose DVMap (High-Consensus Demographic-Value Mapping), a framework for fine-grained pluralistic value alignment. In this framework, we first present a demographic archetype extraction strategy to construct a high-quality value alignment corpus of 56,152 samples from the World Values Survey (WVS) by strictly retaining respondents with consistent value preferences under identical demographics. Over this corpus, we introduce a Structured Chain-of-Thought (CoT) mechanism that explicitly guides LLMs to reason about demographic-value correlations. Subsequently, we employ Group Relative Policy Optimization (GRPO) to achieve adaptive anchoring of value distributions. To rigorously evaluate generalization, we further establish a triple-generalization benchmark (spanning cross-demographic, cross-country, and cross-value) comprising 21,553 samples. Experimental results demonstrate that DVMap effectively learns the manifold mapping from demographics to values, exhibiting strong generalization and robustness. On cross-demographic tests, Qwen3-8B-DVMap achieves 48.6% accuracy, surpassing the advanced open-source LLM DeepSeek-v3.2 (45.1%). The source code and dataset are available at https://github.com/EnlightenedAI/DVMap.

2605.14416 2026-05-15 cs.AI

A Unified Knowledge Embedded Reinforcement Learning-based Framework for Generalized Capacitated Vehicle Routing Problems

Wen Wang, Xiangchen Wu, Liang Wang, Hao Hu, Xianping Tao

AI总结 本文提出了一种基于知识嵌入的强化学习统一框架,用于解决具有容量限制的车辆路径问题(CVRP)。该框架结合了路线优先、聚类次优的启发式策略,并引入动态规划解决子问题,同时利用历史增强的上下文处理模块应对分解带来的部分可观测性问题。实验表明,该方法在多种CVRP变体中均能取得优于现有学习方法的解质量,且与经典启发式方法的差距更小,展现出良好的泛化能力。

详情
英文摘要

The Capacitated Vehicle Routing Problem (CVRP) is a fundamental NP-hard problem with broad applications in logistics and transportation. Real-world CVRPs often involve diverse objectives and complex constraints, such as time windows or backhaul requirements, motivating the development of a unified solution framework. Recent reinforcement learning (RL) approaches have shown promise in combinatorial optimization, yet they rely on end-to-end learning and lack explicit problem-solving knowledge, limiting solution quality. In this paper, we propose a knowledge-embedded framework inspired by the Route-First Cluster-Second heuristics. It incorporates knowledge at two levels: (1) decomposing CVRPs into the route-first and cluster-second subproblems, and (2) leveraging dynamic programming to solve the second subproblem, whose results guide the RL-based constructive solver to solve the first problem. To mitigate partial observability caused by problem decomposition, we introduce a unified history-enhanced context processing module. Extensive experiments show that this framework achieves superior solution quality compared with state-of-the-art learning-based methods, with a smaller gap to classical heuristics, demonstrating strong generalization across diverse CVRP variants.

2605.14413 2026-05-15 cs.LG cs.AI

MahaVar: OOD Detection via Class-wise Mahalanobis Distance Variance under Neural Collapse

Donghwan Kim, Hyunsoo Yoon

AI总结 该论文提出了一种基于类内马哈拉诺比斯距离方差的新型分布外检测方法MahaVar。研究发现,对于分布内样本,类内马哈拉诺比斯距离呈现出明显的尖锐最小值结构,导致类间距离方差较大,而分布外样本则表现出较弱的结构特征和较小的距离方差。基于这一现象并结合神经崩溃理论,作者提出了MahaVar方法,在传统马哈拉诺比斯距离基础上引入类内距离方差作为判别依据,有效提升了分布外检测性能,在多个基准数据集上取得了当前最优结果。

Comments 29 pages, 8 figures

详情
英文摘要

Out-of-distribution (OOD) detection is a critical component for ensuring the reliability of deep neural networks in safety-critical applications. In this work, we present a key empirical observation: for in-distribution (ID) samples, class-wise Mahalanobis distances exhibit a pronounced sharp minimum structure, where the distance to the nearest class is small while distances to all other classes remain large, resulting in high variance across classes. In contrast, OOD samples tend to exhibit a less pronounced sharp minimum structure, producing comparatively lower variance across classes. We further provide a theoretical analysis grounding this observation in Neural Collapse geometry: under relaxed Neural Collapse assumptions on within-class compactness and inter-class separation, ID samples are shown to structurally exhibit high class-wise distance variance, offering a theoretical basis for its use as an OOD score. Motivated by this observation and its theoretical backing, we propose MahaVar, a simple and effective post-hoc OOD detector that augments the Mahalanobis distance with a class-wise distance variance term. Following the OpenOOD v1.5 benchmark protocol, MahaVar achieves state-of-the-art performance on CIFAR-100 and ImageNet, with consistent improvements in both AUROC and FPR@95 over existing Mahalanobis-based methods across all benchmarks.

2605.14411 2026-05-15 cs.RO cs.AI

Energy-Efficient Quadruped Locomotion with Compliant Feet

Pramod Pal, Shishir Kolathaya, Ashitava Ghosal

AI总结 该研究探讨了具有柔顺足部的四足机器人能否在保证运动稳定性的同时提升运动效率。通过将足部柔顺性引入强化学习控制器,研究发现适中的足部刚度可以有效减少每米行走的机械能耗,实验表明相较于过于刚硬或过于柔软的足部,中间刚度的足部可使能耗降低约17%。这一结果表明,合理设计足部柔顺性有助于提高四足机器人的能量效率。

Comments 29 pages, 7 figures, supplemental videos link is mentioned in the paper

详情
英文摘要

Quadruped robots are often designed with rigid feet to simplify control and maintain stable contact during locomotion. While this approach is straightforward, it limits the ability of the legs to absorb impact forces and reuse stored elastic energy, leading to higher energy expenditure during locomotion. To explore whether compliant feet can provide an advantage, we integrate foot compliance into a reinforcement learning (RL) locomotion controller and study its effect on walking efficiency. In simulation, we train eight policies corresponding to eight different spring stiffness values and then cross-evaluate their performance by measuring mechanical energy consumed per meter traveled. In experiments done on a developed quadruped, the energy consumption for the intermediate stiffness spring is lower by ~ 17% when compared to a very stiff or a very flexible spring incorporated in the feet, with similar trends appearing in the simulation results. These results indicate that selecting an appropriate foot compliance can improve locomotion efficiency without destabilizing the robot during motion.

2605.14407 2026-05-15 cs.AI

Metis AI: The Overlooked Middle Zone Between AI-Native and World-Movers

Xiang Li

AI总结 本文探讨了人工智能在数字任务中常被忽视的“中间地带”——Metis AI,这类任务虽可在计算机上完成,但因涉及机构、社会和规范层面的复杂性,难以被算法可靠自动化。研究提出了Metis AI的五个结构性特征,并指出应对策略应是人类主导、AI辅助的“半人马架构”,而非单纯提升自动化水平。

详情
英文摘要

The dominant discourse on AI limitations frames the boundary of AI capability as a divide between digital tasks (where AI excels) and physical tasks (where embodiment is required). We argue this framing misses the most consequential boundary: the one within digital tasks. We identify a class of tasks we call Metis AI, named for the Greek concept of metis (practical, contextual knowledge), that are performed entirely on computers yet resist reliable AI automation. These tasks are not computationally intractable; they are institutionally, socially, and normatively entangled in ways that defeat algorithmic approaches. We distinguish constitutive metis (knowledge destroyed by the act of formalization) from operational metis (system-specific familiarity that automation can progressively absorb), and propose five structural characteristics that define the Metis AI zone: consequential irreversibility, relational irreducibility, normative open texture, adversarial co-evolution, and accountability anchoring. We ground each in established theory from across the social sciences, philosophy, and humanitarian practice, argue that these characteristics are properties of the tasks themselves rather than limitations of current models, and show that the appropriate design response is not better automation but centaur architectures in which humans lead and AI supports.

2605.14406 2026-05-15 cs.LG cs.CV

GeoViSTA: Geospatial Vision-Tabular Transformer for Multimodal Environment Representation

Yuhao Liu, Sadeer Al-Kindi, Ashok Veeraraghavan, Guha Balakrishnan

AI总结 GeoViSTA 是一种结合遥感图像和表格数据的多模态模型,旨在学习统一的地理空间表征。该模型通过双边交叉注意力机制,在图像和表格数据之间交换空间与语义信息,并借助地理感知的注意力机制对齐图像块与不规则的统计区域。GeoViSTA 在自监督的联合掩码重建任务中进行训练,显著提升了在疾病死亡率和火灾风险等关键任务上的预测性能,展示了其在综合地理空间推理中的强大能力。

详情
英文摘要

Large-scale pretraining on Earth observation imagery has yielded powerful representations of the natural and built environment. However, most existing geospatial foundation models do not directly model the structured socioeconomic covariates typically stored in tabular form. This modality gap limits their ability to capture the complete total environment, which is critical for reasoning about complex environmental, social, and health-related outcomes. In this work, we propose GeoViSTA (Geospatial Vision-Tabular Transformer), a vision-tabular architecture that learns unified geospatial embeddings from co-registered gridded imagery and tabular data. GeoViSTA utilizes bilateral cross-attention to exchange spatial and semantic information across modalities, guided by a geography-aware attention mechanism that aligns continuous image patches with irregular census-tract tokens. We train GeoViSTA with a self-supervised joint masked-autoencoding objective, forcing it to recover missing image patches and tabular rows using local spatial context and cross-modal cues. Empirically, GeoViSTA's unified embeddings improve linear probing performance on high-impact downstream tasks, outperforming baselines in predicting disease-specific mortality and fire hazard frequency across held-out regions. These results demonstrate that jointly modeling the physical environment alongside structured socioeconomic context yields highly transferable representations for holistic geospatial inference.

2605.14405 2026-05-15 cs.LG math.DS

Watch your neighbors: Training statistically accurate chaotic systems with local phase space information

Joon-Hyuk Ko, Andrus Giraldo, Deok-Sun Lee

AI总结 该论文研究了如何通过局部相空间信息训练出统计上准确的混沌系统代理模型。作者提出了一种新框架,旨在结合精确的雅可比矩阵和长期统计特性,通过构建相空间中混沌吸引子的局部覆盖,并最小化代理模型与真实动力学在这些覆盖上的分布差异来训练模型。实验表明,该方法在提升雅可比矩阵准确性的同时,仍能与现有最先进的统计动力学学习方法保持竞争力。

详情
英文摘要

Chaotic systems pose fundamental challenges for data-driven dynamics discovery, as small modeling errors lead to exponentially growing trajectory discrepancies. Since exact long-term prediction is unattainable, it is natural to ask what a good surrogate model for chaotic dynamics is. Prior work has largely focused either on reproducing the Jacobian of the underlying dynamics, which governs local expansion and contraction rates, or on training surrogate models that reproduce the ground-truth dynamics' long-term statistical behavior. In this work, we propose a new framework that aims to bridge these two paradigms by training surrogate dynamics models with accurate Jacobians and long-term statistical properties. Our method constructs a local covering of a chaotic attractor in phase space and analyzes the expansion and contraction of these coverings under the dynamics. The surrogate model is trained by minimizing the maximum mean discrepancy between the pushforward distributions of the coverings under the surrogate and ground-truth dynamics. Experiments show that our method significantly improves Jacobian accuracy while remaining competitive with state-of-the-art statistically accurate dynamics learning methods. Our code is fully available at https://anonymous.4open.science/r/neighborwatch.

2605.14404 2026-05-15 cs.CL

Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

Kyomin Hwang, Hyeonjin Kim, Sangyeon Cho, Nojun Kwak

AI总结 随着大型语言模型在商业服务中的广泛应用,其可能引发的隐私泄露问题日益突出。本文针对多语言场景下的机器遗忘(MMU)评估不足的问题,提出了两个新的评估指标——知识可分性得分(KSS)和知识持续性得分(KPS),用于衡量多语言环境下信息去除的效果与一致性。研究通过这些指标对多种遗忘方法进行了评估,揭示了多语言机器遗忘中特有的现象,并为该领域的评估提供了新视角。

详情
英文摘要

While LLMs are increasingly used in commercial services, they pose privacy risks such as leakage of sensitive personally identifiable information (PII). For LLMs trained on multilingual corpora, Multilingual Machine Unlearning (MMU) aims to remove information across multiple languages. However, prior MMU evaluations fail to capture such cross-linguistic distribution of information, being largely limited to direct extensions of per-language evaluation protocols. To this end, we propose two metrics to evaluate the information spread across languages: the Knowledge Separability Score (KSS) and the Knowledge Persistence Score (KPS). KSS measures the overall unlearning quality across multiple languages, while KPS more specifically aims to assess consistent removal of information among different language pairs. We evaluated various unlearning methods in the multilingual setting with these metrics and conducted comprehensive analyses. Through our investigation, we provide insights into unique phenomena exclusive to MMU and offer a new perspective on MMU evaluation.