arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1976
2511.18739 2026-05-15 cs.AI cs.LG stat.ML

A Problem-Oriented Taxonomy of Evaluation Metrics for Time Series Anomaly Detection

Kaixiang Yang, Jiarong Liu, Yupeng Song, Shuanghua Yang, Yujue Zhou

AI总结 时间序列异常检测在物联网和物理信息系统中应用广泛,但其评估因应用场景多样和指标假设不同而面临挑战。本文提出了一种面向问题的评估指标分类框架,从解决的具体评估问题出发重新诠释现有指标,将其分为六个维度,涵盖准确性、及时性、标签容忍度、人工审核成本惩罚、抗随机性以及跨数据集可比性等方面。通过实验分析不同场景下指标的行为,量化其区分真实检测与随机噪声的能力,揭示了多数事件级指标具有较强区分力,而部分常用指标对随机分数膨胀较为敏感,强调了评估指标应根据具体任务需求进行选择。

详情
英文摘要

Time series anomaly detection is widely used in IoT and cyber-physical systems, yet its evaluation remains challenging due to diverse application objectives and heterogeneous metric assumptions. This study introduces a problem-oriented framework that reinterprets existing metrics based on the specific evaluation challenges they are designed to address, rather than their mathematical forms or output structures. We categorize over twenty commonly used metrics into six dimensions: 1) basic accuracy-driven evaluation; 2) timeliness-aware reward mechanisms; 3) tolerance to labeling imprecision; 4) penalties reflecting human-audit cost; 5) robustness against random or inflated scores; and 6) parameter-free comparability for cross-dataset benchmarking. Comprehensive experiments are conducted to examine metric behavior under genuine, random, and oracle detection scenarios. By comparing their resulting score distributions, we quantify each metric's discriminative ability -- its capability to distinguish meaningful detections from random noise. The results show that while most event-level metrics exhibit strong separability, several widely used metrics (e.g., NAB, Point-Adjust) demonstrate limited resistance to random-score inflation. These findings reveal that metric suitability must be inherently task-dependent and aligned with the operational objectives of IoT applications. The proposed framework offers a unified analytical perspective for understanding existing metrics and provides practical guidance for selecting or developing more context-aware, robust, and fair evaluation methodologies for time series anomaly detection.

2511.17367 2026-05-15 cs.LG

R2PS: Worst-Case Robust Real-Time Pursuit Strategies under Partial Observability

Runyu Lu, Ruochuan Shi, Yuanheng Zhu, Dongbin Zhao

AI总结 本文研究了在部分可观测环境下,如何为追捕-逃避博弈(PEG)设计具有最坏情况鲁棒性的实时追捕策略。为了解决现有方法在不完全信息和异步移动场景下的不足,作者提出了一种新的方法R2PS,结合动态规划与信念保持机制,扩展了传统策略到部分可观测场景,并将其嵌入先进强化学习框架中。该方法能够在无需额外训练的情况下,实现对未知图结构的鲁棒泛化,并在实验中表现出优于现有方法的性能。

详情
英文摘要

Computing worst-case robust strategies in pursuit-evasion games (PEGs) is time-consuming, especially when real-world factors like partial observability are considered. While important for general security purposes, real-time applicable pursuit strategies for graph-based PEGs are currently missing when the pursuers only have imperfect information about the evader's position. Although state-of-the-art reinforcement learning (RL) methods like Equilibrium Policy Generalization (EPG) and Grasper provide guidelines for learning graph neural network (GNN) policies robust to different game dynamics, they are restricted to the scenario of perfect information and do not take into account the possible case where the evader can predict the pursuers' actions. This paper introduces the first approach to worst-case robust real-time pursuit strategies (R2PS) under partial observability. We first prove that a traditional dynamic programming (DP) algorithm for solving Markov PEGs maintains optimality under the asynchronous moves by the evader. Then, we propose a belief preservation mechanism about the evader's possible positions, extending the DP pursuit strategies to a partially observable setting. Finally, we embed the belief preservation into the state-of-the-art EPG framework to finish our R2PS learning scheme, which leads to a real-time pursuer policy through cross-graph reinforcement learning against the asynchronous-move DP evasion strategies. After reinforcement learning, our policy achieves robust zero-shot generalization to unseen real-world graph structures and consistently outperforms the policy directly trained on the test graphs by the existing game RL approach.

2511.15408 2026-05-15 cs.CL cs.AI cs.IR cs.MA cs.NE

Chinese Short-Form Creative Content Generation via Explanation-Oriented Multi-Objective Optimization

Shanlin Zhou, Xinpeng Wang, Jianxun Lian, Zhenghao Liu, Laks V. S. Lakshmanan, Xiaoyuan Yi, Yongtao Hao

AI总结 该研究针对中文短文本创意内容生成中的挑战,提出了一种基于解释导向的多目标优化方法,以应对个性化约束下生成结果验证困难的问题。研究将任务建模为异构多目标优化问题,同时优化生成内容与解释的可靠性,并设计了无需训练的多智能体框架MAGIC-HMO,通过迭代生成与验证实现优化。实验表明,该方法在中文婴儿命名等任务上显著优于现有模型。

Comments 19 pages,10 figures. Submitted to ACM for possible publication

详情
英文摘要

Chinese demonstrates high semantic compactness and rich metaphorical expressiveness, enabling limited text to convey dense meanings while increasing the difficulty of generation and verification, particularly in short-form creative natural language generation (CNLG). In the real world, users often require personalized, fine-grained creative constraints, making reliable verification critical to guiding optimization. According to Brunswik's Lens Model from psychology, constraints' achievement can be inferred from sufficient observable cues. Existing studies are mainly outcome-oriented, implicitly assuming that the outcome itself provides adequate cues for verification. However, this assumption breaks down in Chinese short-form CNLG (e.g., naming or advertising) with diverse personalized constraints, where extremely brief outcomes inherently offer limited information. Explanations can naturally serve as extra cues. Nevertheless, under complex constraints, LLMs' explanations may suffer from hallucination, incompleteness, or ambiguity. To address these, we novelly formalize the Chinese short-form CNLG task as a heterogeneous multi-objective optimization (HMO) issue that needs to jointly optimize multiple personalized constraints and explanation reliability. We further propose MAGIC-HMO, a training-free multi-agent framework that optimizes these objectives through iterative generation and verification under an explanation-oriented multi-objective strategy. Experiments on \emph{Chinese Baby Naming}, a challenging benchmark, demonstrate that MAGIC-HMO significantly outperforms six strong baselines across various LLM backbones. Relevant data and codes are available at https://github.com/foolfun/MAGIC_HMO.

2511.14823 2026-05-15 cs.LG cs.CV

Dynamic Nested Hierarchies: Pioneering Self-Evolution in Machine Learning Architectures for Lifelong Intelligence

Akbar Anbar Jafari, Cagri Ozcinar, Gholamreza Anbarjafari

AI总结 当前机器学习模型在静态任务上表现出色,但在非平稳环境中因架构僵化而难以实现持续适应和终身学习。本文提出了一种动态嵌套层次结构,使模型能够在训练或推理过程中自主调整优化层级的数量、嵌套结构和更新频率,从而实现无需预定义约束的自我演化。该方法通过数学推导和实验验证,在语言建模、持续学习和长上下文推理等任务中展现出优越性能,为构建具有自适应能力的通用人工智能奠定了基础。

Comments 12 pages, 1 figure

详情
Journal ref
Frontiers in Artificial Intelligence, 2026
英文摘要

Contemporary machine learning models, including large language models, exhibit remarkable capabilities in static tasks yet falter in non-stationary environments due to rigid architectures that hinder continual adaptation and lifelong learning. Building upon the nested learning paradigm, which decomposes models into multi-level optimization problems with fixed update frequencies, this work proposes dynamic nested hierarchies as the next evolutionary step in advancing artificial intelligence and machine learning. Dynamic nested hierarchies empower models to autonomously adjust the number of optimization levels, their nesting structures, and update frequencies during training or inference, inspired by neuroplasticity to enable self-evolution without predefined constraints. This innovation addresses the anterograde amnesia in existing models, facilitating true lifelong learning by dynamically compressing context flows and adapting to distribution shifts. Through rigorous mathematical formulations, theoretical proofs of convergence, expressivity bounds, and sublinear regret in varying regimes, alongside empirical demonstrations of superior performance in language modeling, continual learning, and long-context reasoning, dynamic nested hierarchies establish a foundational advancement toward adaptive, general-purpose intelligence.

2511.13397 2026-05-15 cs.CV cs.AI

Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)

Nikos Theodoridis, Tim Brophy, Reenu Mohandas, Ganesh Sistu, Fiachra Collins, Anthony Scanlan, Ciaran Eising

AI总结 本文提出了一种名为Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)的视觉问答基准,用于评估视觉语言模型在交通场景中的感知能力。该基准包含合成数据集和真实场景数据集,并为每个问题标注了目标物体与相机之间的距离,从而能够分析模型在不同距离下的感知性能。该研究为自动驾驶领域中模型的感知能力评估提供了一个新的、有针对性的工具。

详情
Journal ref
IEEE Data Descriptions, 2026
英文摘要

The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, (c) the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.

2511.13026 2026-05-15 cs.CV

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Jiaze Li, Hao Yin, Wenhui Tan, Jingyang Chen, Boshen Xu, Yuxun Qu, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Jian Luan

AI总结 该论文提出了一种名为REVISOR的新框架,旨在提升大语言模型在长视频理解任务中的推理能力。针对纯文本反思机制在处理长视频时的不足,REVISOR引入了多模态反思机制,结合视觉信息进行深度反思,并设计了双属性解耦奖励机制以增强模型对关键视频片段的识别与利用。该方法无需额外监督微调或外部模型,显著提升了模型在多个长视频理解基准测试中的表现。

详情
英文摘要

Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model's reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.

2511.08565 2026-05-15 cs.CL cs.AI cs.CY

Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models

Davi Bastos Costa, Felippe Alves, Renato Vicente

AI总结 本研究探讨了大型语言模型在扮演特定角色(Persona Role-Play)时的道德反应,引入道德基础问卷(MFQ)构建基准,量化评估模型的道德敏感性和道德鲁棒性。通过两种互补方法分析模型在不同角色下的道德判断变化,发现道德鲁棒性在不同模型家族间差异显著,Claude 家族表现最为鲁棒,而道德敏感性则变化较小,且不受模型家族影响,主要由预训练阶段决定。研究揭示了角色条件对模型道德行为的影响,并提供了不同模型及角色平均的道德基础特征分析。

Comments Added experiments with a logit-based method and now reporting unbounded metrics

详情
英文摘要

Large language models (LLMs) increasingly operate in social contexts, motivating analysis of how they express and shift moral judgments. In this work, we investigate the moral response of LLMs to persona role-play, prompting a LLM to assume a specific character. Using the Moral Foundations Questionnaire (MFQ), we introduce a benchmark that quantifies two properties: moral susceptibility and moral robustness, defined from the variability of MFQ scores across- and within-personas. We estimate these quantities with two complementary procedures, repeated sampling and a logit-based method that directly estimates the rating distributions and enables temperature analysis. We evaluate 15 models across six families: Claude, DeepSeek, Gemini, GPT, Grok, and Llama. The two metrics show qualitatively different patterns. Moral robustness varies by more than an order of magnitude, with a coefficient of variation of about $152\%$, and is explained almost entirely by model family. The Claude family is, by a significant margin, the most robust, about 30 times more so than the lower-performing families (DeepSeek, Grok, and Llama), while Gemini and GPT occupy an intermediate tier. This strong family dependence suggests that robustness is primarily shaped by post-training. Moral susceptibility, by contrast, spans a much narrower range, with a coefficient of variation of about $13\%$, and the most susceptible model is only 1.6 times more susceptible than the least. Unlike robustness, susceptibility shows no clear family dependence, suggesting that it is primarily determined by pre-training. Additionally, we present moral foundation profiles for models without persona role-play and for personas averaged across models. Together, these analyses provide a systematic view of how persona conditioning shapes moral behavior in LLMs and a window into the internal machinery they use to instantiate personas.

2511.02776 2026-05-15 cs.RO

XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations

Shichao Fan, Kun Wu, Zhengping Che, Xinhua Wang, Di Wu, Fei Liao, Ning Liu, Yixue Zhang, Zhen Zhao, Zhiyuan Xu, Meng Li, Qingjie Liu, Shanghang Zhang, Min Wan, Jian Tang

AI总结 本文提出 XR-1,一种面向多机器人、多任务和多环境的通用视觉-语言-动作(VLA)模型,旨在解决现有模型在生成精确低级动作和跨异构数据源对齐方面的挑战。XR-1 引入了统一视觉-运动编码(UVMC),通过双分支 VQ-VAE 学习视觉动态与机器人运动的联合离散表示,从而在动作生成和跨模态对齐方面取得显著提升。实验表明,XR-1 在多种真实机器人和任务上表现出优越的性能和良好的泛化能力。

Comments Accepted to ICML2026 as spotlight

详情
英文摘要

Recent progress in large-scale robotic datasets and vision-language models (VLMs) has advanced research on vision-language-action (VLA) models. However, existing VLA models still face two fundamental challenges: (i) producing precise low-level actions from high-dimensional observations, (ii) bridging domain gaps across heterogeneous data sources, including diverse robot embodiments and human demonstrations. Existing methods often encode latent variables from either visual dynamics or robotic actions to guide policy learning, but they fail to fully exploit the complementary multi-modal knowledge present in large-scale, heterogeneous datasets. In this work, we present X Robotic Model 1 (XR-1), a novel framework for versatile and scalable VLA learning across diverse robots, tasks, and environments. XR-1 introduces the \emph{Unified Vision-Motion Codes (UVMC)}, a discrete latent representation learned via a dual-branch VQ-VAE that jointly encodes visual dynamics and robotic motion. UVMC addresses these challenges by (i) serving as an intermediate representation between the observations and actions, and (ii) aligning multimodal dynamic information from heterogeneous data sources to capture complementary knowledge. To effectively exploit UVMC, we propose a three-stage training paradigm: (i) self-supervised UVMC learning, (ii) UVMC-guided pretraining on large-scale cross-embodiment robotic datasets, and (iii) task-specific post-training. We validate XR-1 through extensive real-world experiments with more than 14,000 rollouts on six different robot embodiments, spanning over 120 diverse manipulation tasks. XR-1 consistently outperforms state-of-the-art baselines such as $π_{0.5}$, $π_0$, RDT, UniVLA, and GR00T-N1.5 while demonstrating strong generalization to novel objects, background variations, distractors, and illumination changes. Our project is at https://xr-1-vla.github.io/.

2511.02271 2026-05-15 cs.CV

Medical Report Generation: A Hierarchical Task Structure-Based Cross-Modal Causal Intervention Framework

Yucheng Song, Yifan Ge, Junhao Li, Zhining Liao, Zhifang Liao

AI总结 本文提出了一种基于分层任务结构的跨模态因果干预框架HTSC-CIF,用于解决医学报告生成中的三个核心挑战:领域知识理解不足、文本与视觉实体嵌入对齐不佳以及跨模态偏差带来的虚假相关性。该方法将任务分解为低、中、高三个层次,分别通过空间特征对齐、双向语言与图像建模以及因果干预模块进行优化,显著提升了生成报告的准确性和可解释性。实验表明,HTSC-CIF在多个基准数据集上优于现有最先进方法。

Comments Due to issues with the training epochs and training strategy in our paper, there are numerical errors in the result comparison table presented in the preprint. Therefore, we have decided to withdraw the manuscript for further revision

详情
英文摘要

Medical Report Generation (MRG) is a key part of modern medical diagnostics, as it automatically generates reports from radiological images to reduce radiologists' burden. However, reliable MRG models for lesion description face three main challenges: insufficient domain knowledge understanding, poor text-visual entity embedding alignment, and spurious correlations from cross-modal biases. Previous work only addresses single challenges, while this paper tackles all three via a novel hierarchical task decomposition approach, proposing the HTSC-CIF framework. HTSC-CIF classifies the three challenges into low-, mid-, and high-level tasks: 1) Low-level: align medical entity features with spatial locations to enhance domain knowledge for visual encoders; 2) Mid-level: use Prefix Language Modeling (text) and Masked Image Modeling (images) to boost cross-modal alignment via mutual guidance; 3) High-level: a cross-modal causal intervention module (via front-door intervention) to reduce confounders and improve interpretability. Extensive experiments confirm HTSC-CIF's effectiveness, significantly outperforming state-of-the-art (SOTA) MRG methods. Code will be made public upon paper acceptance.

2510.23868 2026-05-15 cs.LG cs.CL

GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA

Zhichao Wang

AI总结 本文研究了奖励匹配是否可以作为奖励最大化方法的替代方案,用于大语言模型的策略梯度强化学习。提出了一种名为GIFT的新方法,结合了GRPO的群体采样、DPO的隐式奖励和UNA的显式与隐式优势之间的均方误差,通过z-score标准化消除了DPO中的不可计算项,并去除了RLHF和RLVR目标中的KL系数β。实验表明,GIFT在多个任务上收敛更快、过拟合更少,且在长度控制和评估表现上优于现有方法。

详情
英文摘要

This paper investigates whether reward matching is a viable alternative to reward maximization methods for on-policy RL of LLMs. Group-relative Implicit Fine-Tuning (GIFT) is proposed, combining GRPO-style group sampling, DPO-style implicit reward, and UNA-style MSE between implicit and explicit advantages. By applying z-score standardization, the intractable partition function $Z(x)$ in the DPO implicit reward is canceled, and the KL coefficient $β$ is eliminated from the RLHF and RLVR objective. The population minimizers of $\mathcal{L}_{\text{GIFT}}$ are characterized in closed form: they coincide exactly with the GRPO/RLHF solution family $π^{*}_β(y|x)\proptoπ_{\text{ref}}(y|x)e^{\frac{1}βr_ϕ(x,y)}$, with a prompt-dependent, variance-determined KL coefficient $β(x)=\frac{σ_ϕ(x)}{\hatσ_θ(x)}$. GIFT therefore solves the same parametric policy family as GRPO while replacing GRPO's externally tuned scalar $β$ with a prompt-adaptive $β(x)$ optimized endogenously by matching reward distributions. Empirically, on 7B-32B backbones, GIFT converges faster than GRPO, DAPO and GSPO and overfits less on RLVR (GSM8K, MATH, AIME) and produces higher length-controlled win rates on RLHF (AlpacaEval, Arena-Hard). All proofs and detailed background are deferred to the appendix.

2510.20206 2026-05-15 cs.CV

RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

Bingjie Gao, Qianli Ma, Xiaoxue Wu, Shuai Yang, Guanzhou Lan, Haonan Zhao, Jiaxuan Chen, Qingyang Liu, Yu Qiao, Xinyuan Chen, Yaohui Wang, Li Niu

AI总结 RAPO++ 是一种面向文本到视频生成的跨阶段提示优化框架,旨在解决用户输入提示与训练数据不匹配的问题。该方法通过检索增强提示优化(RAPO)和样本特定提示优化(SSPO)两个阶段,结合语义对齐、空间保真度和时间一致性等多源反馈,逐步提升生成视频的质量,并进一步通过微调语言模型实现高效的提示生成。实验表明,RAPO++ 在多个先进模型和基准测试中显著提升了生成视频的语义一致性、组合合理性及时空稳定性,是一种模型无关、高效且可扩展的解决方案。

Comments arXiv admin note: text overlap with arXiv:2504.11739

详情
英文摘要

Prompt design plays a crucial role in text-to-video (T2V) generation, yet user-provided prompts are often short, unstructured, and misaligned with training data, limiting the generative potential of diffusion-based T2V models. We present \textbf{RAPO++}, a cross-stage prompt optimization framework that unifies training-data--aligned refinement, test-time iterative scaling, and large language model (LLM) fine-tuning to substantially improve T2V generation without modifying the underlying generative backbone. In \textbf{Stage 1}, Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with semantically relevant modifiers retrieved from a relation graph and refactors them to match training distributions, enhancing compositionality and multi-object fidelity. \textbf{Stage 2} introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback -- including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow -- yielding progressively improved video generation quality. \textbf{Stage 3} leverages optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing task-specific optimization patterns and enabling efficient, high-quality prompt generation even before inference. Extensive experiments across five state-of-the-art T2V models and five benchmarks demonstrate that RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming existing methods by large margins. Our results highlight RAPO++ as a model-agnostic, cost-efficient, and scalable solution that sets a new standard for prompt optimization in T2V generation. The code is available at https://github.com/Vchitect/RAPO.

2510.17434 2026-05-15 cs.CV

Leveraging AV1 motion vectors for Fast and Dense Feature Matching

Julien Zouein, Hossein Javidnia, François Pitié, Anil Kokaram

AI总结 该研究利用AV1视频编码中的运动矢量生成密集的亚像素级特征匹配,并通过余弦一致性筛选短轨迹。该方法在短视频上运行效率高、消耗的CPU资源少,且能产生密度更高的匹配结果,几何一致性表现良好。实验表明,该方法在少样本场景重建中表现出良好的性能,为压缩域特征匹配在大规模应用中提供了可行的解决方案。

Comments Accepted ICIR 2025, camera-ready version

详情
英文摘要

We repurpose AV1 motion vectors to produce dense sub-pixel correspondences and short tracks filtered by cosine consistency. On short videos, this compressed-domain front end runs comparably to sequential SIFT while using far less CPU, and yields denser matches with competitive pairwise geometry. As a small SfM demo on a 117-frame clip, MV matches register all images and reconstruct 0.46-0.62M points at 0.51-0.53,px reprojection error; BA time grows with match density. These results show compressed-domain correspondences are a practical, resource-efficient front end with clear paths to scaling in full pipelines.

2510.15982 2026-05-15 cs.LG cs.AI

AMiD: Knowledge Distillation for LLMs with $α$-mixture Assistant Distribution

Donghyeok Shin, Yeongmin Kim, Suhyeon Jo, Byeonghu Na, Il-Chul Moon

AI总结 本文提出了一种名为AMiD的知识蒸馏方法,用于降低大语言模型的计算和内存成本。该方法引入了基于α混合的辅助分布,通过引入新的分布参数α,扩展了传统辅助分布的适用范围,并构建了一个统一的知识蒸馏框架。实验表明,AMiD在性能和训练稳定性方面优于现有方法,具有更广泛的理论支持和实际应用价值。

Comments The Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

Autoregressive large language models (LLMs) have achieved remarkable improvement across many tasks but incur high computational and memory costs. Knowledge distillation (KD) mitigates this issue by transferring knowledge from a large teacher to a smaller student through distributional alignment. Previous studies have proposed various discrepancy metrics, but the capacity gap and training instability caused by near-zero probabilities, stemming from the high-dimensional output of LLMs, remain fundamental limitations. To overcome these challenges, several approaches implicitly or explicitly incorporating assistant distribution have recently been proposed. However, the past proposals of assistant distributions have been a fragmented approach without a systematic investigation of the interpolation path and the divergence. This paper proposes $α$-mixture assistant distribution, a novel generalized family of assistant distributions, and $α$-mixture distillation, coined AMiD, a unified framework for KD using the assistant distribution. The $α$-mixture assistant distribution provides a continuous extension of the assistant distribution by introducing a new distribution design variable $α$, which has been fixed in all previous approaches. Furthermore, AMiD generalizes the family of divergences used with the assistant distributions based on optimality, which has also been restricted in previous works. Through extensive experiments, we demonstrate that AMiD offers superior performance and training stability by leveraging a broader and theoretically grounded assistant distribution space. We release the code at https://github.com/aailab-kaist/AMiD.

2510.15849 2026-05-15 cs.CV

Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt

Joongwon Chae, Lihui Luo, Xi Yuan, Dongmei Yu, Zhenglin Chen, Lian Zhang, Peiwu Qin

AI总结 本文提出了一种无需人工提示和训练的舌部分割方法Memory-SAM,通过检索历史案例中的特征并生成有效提示来引导SAM2模型。该方法利用DINOv3的密集特征和FAISS检索技术,从少量先验案例中自动提取前景和背景提示,从而实现高精度分割。实验表明,Memory-SAM在包含600张专家标注图像的数据集上取得了优于现有方法的分割效果,尤其在真实场景下表现突出。

详情
英文摘要

Accurate tongue segmentation is crucial for reliable TCM analysis. Supervised models require large annotated datasets, while SAM-family models remain prompt-driven. We present Memory-SAM, a training-free, human-prompt-free pipeline that automatically generates effective prompts from a small memory of prior cases via dense DINOv3 features and FAISS retrieval. Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning. We evaluate on 600 expert-annotated images (300 controlled, 300 in-the-wild). On the mixed test split, Memory-SAM achieves mIoU 0.9863, surpassing FCN (0.8188) and a detector-to-box SAM baseline (0.1839). On controlled data, ceiling effects above 0.98 make small differences less meaningful given annotation variability, while our method shows clear gains under real-world conditions. Results indicate that retrieval-to-prompt enables data-efficient, robust segmentation of irregular boundaries in tongue imaging. The code is publicly available at https://github.com/jw-chae/memory-sam.

2510.13016 2026-05-15 cs.CV

SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding

Tanveer Hannan, Shuaicong Wu, Mark Weber, Suprosanna Shit, Jindong Gu, Rajat Koner, Aljoša Ošep, Laura Leal-Taixé, Thomas Seidl

AI总结 该论文提出了一种名为SVAG-Bench的大型基准,用于评估多实例时空视频动作定位能力。该任务要求模型同时检测、跟踪并定位满足自然语言查询的所有对象,以实现对复杂场景中多个动作的统一理解。SVAG-Bench包含688个视频和大量精细标注,支持对多动作歧义、时间重叠和动作组合性的细致评估,并提供了标准化的评估工具和一个模块化的基线模型SVAGFormer。

详情
英文摘要

A truly capable AI system must do more than detect objects or recognize activities in isolation. It must form unified, grounded representations of who is acting, what they are doing, and when and where these actions unfold. These representations provide the perceptual bedrock for high-level reasoning, planning, and embodied interaction in the real world. Building such agents is central to long-horizon goals in embodied AI and robotics. Current video benchmarks evaluate fragments of these capabilities in isolation. They focus on either spatial grounding, object tracking, or temporal localization. As a result, they cannot rigorously measure progress on their joint, multi-instance integration. We introduce Spatio-temporal Video Action Grounding (SVAG), a task and benchmark that explicitly targets this unified competence by requiring models to simultaneously detect, track, and temporally localize all objects that satisfy a natural language query in complex, multi-actor scenes. To support this task, we construct SVAG-Bench. It comprises 688 videos, 19,590 verified annotations, and 903 unique action verbs drawn from crowded urban environments, wildlife, and traffic surveillance. Each video has on average 28.5 action-centric queries. This yields the densest annotation among comparable video grounding benchmarks and enables fine-grained evaluation of multi-actor disambiguation, temporal overlap, and action compositionality. Annotations are produced by a pipeline that combines expert manual labeling, GPT-3.5 paraphrase augmentation, and human verification to ensure both linguistic diversity and correctness. We further release SVAGEval, a standardized multi-referent evaluation toolkit. We also introduce SVAGFormer, a strong modular baseline architecture for SVAG.

2510.11282 2026-05-15 cs.LG

Vision-LLMs for Spatiotemporal Traffic Forecasting

Ning Yang, Hengyu Zhong, Haijun Zhang, Randall Berry

AI总结 本文研究了如何利用视觉大语言模型(Vision-LLMs)进行时空交通预测,针对传统大语言模型在处理网格化交通数据时效率低、难以建模复杂空间依赖的问题,提出了一种新的框架ST-Vision-LLM。该方法将交通预测视为视觉与语言信息融合的问题,通过视觉编码器处理历史交通矩阵,并引入高效的数值编码方案和两阶段微调策略,显著提升了模型在长周期预测和跨域少样本场景下的性能。实验表明,该模型在多个真实交通数据集上取得了优于现有方法的预测精度。

详情
英文摘要

Accurate spatiotemporal traffic forecasting is a critical prerequisite for proactive resource management in dense urban mobile networks. While large language models have shown promise in time series analysis, they inherently struggle to model the complex spatial dependencies of grid-based traffic data. Effectively extending large language models to this domain is challenging, as representing the vast amount of information from dense geographical grids can be inefficient and overwhelm the model's context. To address these challenges, we propose ST-Vision-LLM, a novel framework that reframes spatiotemporal forecasting as a vision-language fusion problem. Our approach leverages a Vision-LLM visual encoder to process historical global traffic matrices as image sequences, providing the model with a comprehensive global view to inform cell-level predictions. To overcome the inefficiency of large language models in handling numerical data, we introduce an efficient encoding scheme that represents floating-point values as single tokens via a specialized vocabulary, coupled with a two-stage numerical alignment fine-tuning process. The model is first trained with supervised fine-tuning and then further optimized for predictive accuracy using group relative policy optimization, a memory-efficient reinforcement learning method. Evaluations on real-world mobile traffic datasets demonstrate that ST-Vision-LLM outperforms existing methods by 15.6% in long-term prediction accuracy and exceeds the best baseline by around 30% on average in cross-domain few-shot scenarios. Our extensive experiments validate the model's strong generalization capabilities across various data-scarce environments.

2510.07086 2026-05-15 cs.LG

Non-Stationary Online Structured Prediction with Surrogate Losses

Shinsaku Sakaue, Han Bao, Yuzhou Cao

AI总结 本文研究了非平稳环境下在线结构化预测问题,旨在通过代理损失函数实现对目标损失的上界分析。作者提出了一种新的上界形式,其依赖于比较序列的累积代理损失和路径长度,而非时间步长 $T$,从而在非平稳环境下提供了更强的理论保证。核心方法结合了在线梯度下降的动态遗憾分析与代理损失间隙利用技术,并引入了Polyak风格的学习率,提升了理论分析与实际性能。此外,该方法通过卷积型Fenchel-Young损失扩展到了更广泛的应用场景。

详情
英文摘要

Online structured prediction, including online classification as a special case, is the task of sequentially predicting labels from input features. In this setting, the surrogate regret -- the cumulative excess of the actual target loss (e.g., the 0-1 loss) over the surrogate loss (e.g., the logistic loss) incurred by the best fixed estimator -- has gained attention because it admits a finite bound independent of the time horizon $T$. However, such guarantees break down in non-stationary environments, where every fixed estimator may incur surrogate loss that grows linearly with $T$. To address this limitation, we obtain an upper bound of $F_T + O(1 + P_T)$ on the cumulative target loss, where $F_T$ is the cumulative surrogate loss of any comparator sequence and $P_T$ is its path length. This bound depends on $T$ only through $F_T$ and $P_T$, thus offering stronger guarantees under non-stationarity. Our core idea is to combine the dynamic regret analysis of online gradient descent (OGD) with the exploit-the-surrogate-gap technique. This viewpoint sheds light on the usefulness of a Polyak-style learning rate for OGD, which systematically yields target-loss bounds and performs well empirically. We then extend our approach to broader settings beyond prior work via the convolutional Fenchel--Young loss. Finally, a lower bound shows that the dependence on $F_T$ and $P_T$ is tight.

2510.04682 2026-05-15 cs.CL cs.AI

TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA

Chanjoo Jung, Jaehyung Kim

AI总结 本文提出了一种名为TiTok的新框架,旨在解决LoRA微调参数无法跨不同基础模型迁移的问题。该方法通过在令牌层面进行对比性知识提取,从带有和不带有LoRA的源模型中捕捉任务相关的信息,从而实现高效的LoRA移植。实验表明,TiTok在多个基准测试中表现出色,相比基线方法平均性能提升了4%到10%。

Comments ICLR 2026

详情
英文摘要

Large Language Models (LLMs) are widely applied in real world scenarios, yet fine-tuning them comes with significant computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA mitigate these costs; however, the adapted parameters are dependent on the base model and cannot be transferred across different backbones. One way to address this issue is through knowledge distillation, but its effectiveness inherently depends on training data. Recent work such as TransLoRA avoids this by generating synthetic data; nevertheless, this adds complexity since it requires training an additional discriminator model. In this paper, we propose TiTok, a new framework that enables effective LoRA Transplantation through Token-level knowledge transfer. Specifically, TiTok captures task-relevant information through a token-wise contrastive excess between a source model with and without LoRA. This excess highlights informative tokens and enables selective filtering of synthetic data, all without additional models or overhead. Through experiments on three benchmarks across multiple transfer settings, we demonstrate that TiTok is consistently effective, achieving average performance gains of +4~10% compared to baselines overall.

2510.02952 2026-05-15 cs.LG

ContextFlow: Context-Aware Flow Matching For Trajectory Inference From Spatial Omics Data

Santanu Subhash Rathod, Francesco Ceccarelli, Sean B. Holden, Pietro Liò, Xiao Zhang, Jovan Tanevski

AI总结 本文提出了一种名为ContextFlow的上下文感知流匹配框架,用于从空间组学数据中推断组织结构动态轨迹。该方法通过整合局部组织结构和配体-受体通信模式,构建过渡可能性矩阵以指导最优运输目标的优化,从而生成统计上一致且生物学意义明确的轨迹。实验表明,ContextFlow在多个定量和定性指标上优于现有方法,具有良好的泛化能力。

Comments 42 pages, 21 figures, 30 tables

详情
英文摘要

Inferring trajectories from longitudinal spatially-resolved omics data is fundamental to understanding the dynamics of structural and functional tissue changes in development, regeneration and repair, disease progression, and response to treatment. We propose ContextFlow, a novel context-aware flow matching framework that incorporates prior knowledge to guide the inference of structural tissue dynamics from spatially resolved omics data. Specifically, ContextFlow integrates local tissue organization and ligand-receptor communication patterns into a transition plausibility matrix that regularizes the optimal transport objective. By embedding these contextual constraints, ContextFlow generates trajectories that are not only statistically consistent but also biologically meaningful, making it a generalizable framework for modeling spatiotemporal dynamics from longitudinal, spatially resolved omics data. Evaluated on three datasets, ContextFlow consistently outperforms state-of-the-art flow matching methods across multiple quantitative and qualitative metrics of inference accuracy and biological coherence. Our code is available at: \href{https://github.com/santanurathod/ContextFlow}{ContextFlow}

2510.01172 2026-05-15 cs.CL

Energy-Regularized Sequential Model Editing on Hyperspheres

Qingyuan Liu, Jia-Chen Gu, Yunzhi Yao, Hong Wang, Nanyun Peng

AI总结 大型语言模型需要持续更新以保持与现实世界知识的一致性,但顺序编辑常导致模型表示不稳定并引发灾难性遗忘。本文提出了一种基于超球面能量(HE)正则化的编辑方法SPHERE,通过维持神经元权重在超球面上的均匀分布,有效缓解了编辑过程中的性能退化问题。实验表明,SPHERE在多个主流模型上显著提升了编辑效果,同时较好地保留了模型原有性能。

Comments Accepted by ICLR 2026. The code is available at https://github.com/PlusLabNLP/SPHERE. Project page: https://www.qingyuanliu.net/sphere_projectpage/

详情
英文摘要

Large language models (LLMs) require constant updates to remain aligned with evolving real-world knowledge. Model editing offers a lightweight alternative to retraining, but sequential editing often destabilizes representations and induces catastrophic forgetting. In this work, we seek to better understand and mitigate performance degradation caused by sequential editing. We hypothesize that hyperspherical uniformity, a property that maintains uniform distribution of neuron weights on a hypersphere, helps the model remain stable, retain prior knowledge, while still accommodate new updates. We use Hyperspherical Energy (HE) to quantify neuron uniformity during editing, and examine its correlation with editing performance. Empirical studies across widely used editing methods reveals a strong correlation between HE dynamics and editing performance, with editing failures consistently coinciding with high HE fluctuations. We further theoretically prove that HE dynamics impose a lower bound on the degradation of pretrained knowledge, highlighting why HE stability is crucial for knowledge retention. Motivated by these insights, we propose SPHERE (Sparse Projection for Hyperspherical Energy-Regularized Editing), an HE-driven regularization strategy that stabilizes neuron weight distributions, ultimately preserving prior knowledge while enabling reliable sequential updates. Specifically, SPHERE identifies a sparse space complementary to the principal hyperspherical directions of the pretrained weight matrices and projects new knowledge onto it, attenuating perturbations on the principal directions. Extensive experiments on LLaMA3 (8B) and Qwen2.5 (7B) show that SPHERE outperforms the best baseline in editing capability by an average of 16.41%, while most faithfully preserving general model performance, thereby offering a principled path toward reliable large-scale knowledge editing.

2510.00977 2026-05-15 cs.LG cs.CL

It Takes Two: Your GRPO Is Secretly DPO

Yihong Wu, Liheng Ma, Lei Ding, Muzhi Li, Xinyu Wang, Kejia Chen, Zhan Su, Zhanguang Zhang, Chenyang Huang, Yingxue Zhang, Mark Coates, Jian-Yun Nie

AI总结 本文研究了GRPO算法在大语言模型微调中的有效性,并提出了一种新的视角:GRPO的性能优势来源于其隐含的对比目标,这一特性使其在结构上与DPO等偏好学习方法密切相关。基于这一发现,作者提出了2-GRPO,仅需两次rollouts即可构建对比信号,显著减少了计算资源需求。理论分析和实验表明,2-GRPO在保持97.6%性能的同时,仅需16-GRPO的12.5% rollout和21%训练时间。

详情
英文摘要

GRPO has emerged as a prominent reinforcement learning algorithm for post-training LLMs. Unlike critic-based methods, GRPO computes advantages by estimating the \emph{value baselines} from group-level statistics, eliminating the need for a critic network. Consequently, the prevailing view emphasizes the necessity of large group sizes, which are assumed to yield more accurate statistical estimates. In this paper, we propose a different view that the efficacy of GRPO stems from its implicit contrastive objective in the optimization, which helps reduce variance via the control variate method. This makes GRPO structurally related to preference learning methods such as DPO. This perspective motivates 2-GRPO, a minimal group-size variant that constructs contrastive signals with only two rollouts. We provide a rigorous theoretical analysis of 2-GRPO and empirically validate its effectiveness: 2-GRPO retains $97.6\%$ of the performance of 16-GRPO, while requiring only $12.5\%$ of the rollouts and $21\%$ of the training time.

2510.00757 2026-05-15 cs.LG

LEAP: Local ECT-Based Learnable Positional Encodings for Graphs

Juan Amboage, Ernst Röell, Patrick Schnider, Bastian Rieck

AI总结 本文提出了一种基于局部欧拉特征变换($\ell$-ECT)的可学习图位置编码方法LEAP,用于改进图神经网络中的位置编码能力。该方法结合了可微分的ECT近似及其局部变体,能够捕捉图的局部结构特征,并通过端到端训练方式进行优化。实验表明,LEAP在多个真实和合成数据集上表现出色,展示了其在图表示学习中的有效性和潜力。

Comments Accepted at the International Conference on Learning Representations (ICLR) 2026. Our code is available https://www.github.com/aidos-lab/LEAP

详情
英文摘要

Graph neural networks (GNNs) largely rely on the message-passing paradigm, where nodes iteratively aggregate information from their neighbors. Yet, standard message passing neural networks (MPNNs) face well-documented theoretical and practical limitations. Graph positional encoding (PE) has emerged as a promising direction to address these limitations. The Euler Characteristic Transform (ECT) is an efficiently computable geometric-topological invariant that characterizes shapes and graphs. In this work, we combine the differentiable approximation of the ECT (DECT) and its local variant ($\ell$-ECT) to propose LEAP, a new end-to-end trainable local structural PE for graphs. We evaluate our approach on multiple real-world datasets as well as on a synthetic task designed to test its ability to extract topological features. Our results underline the potential of LEAP-based encodings as a powerful component for graph representation learning pipelines.

2509.26100 2026-05-15 cs.AI

AgenticEval: Toward Agentic and Self-Evolving Safety Evaluation of Large Language Models

Yixu Wang, Xin Wang, Yang Yao, Xinyuan Li, Xibang Yang, Yan Teng, Xingjun Ma, Yingchun Wang

AI总结 随着大语言模型在高风险领域的广泛应用,现有的静态评估方法已难以应对AI风险的动态变化和法规的持续演进。本文提出了一种新的智能体驱动的安全评估范式AgenticEval,通过多智能体框架自主解析政策文件,持续生成和演化综合性安全基准,并利用自我演进的评估循环不断优化测试用例。实验表明,该方法能够有效揭示传统评估方式难以发现的模型深层次安全漏洞,凸显了动态评估体系在确保AI安全部署中的重要性。

Comments Findings of ACL 2026

详情
英文摘要

The rapid integration of Large Language Models (LLMs) into high-stakes domains necessitates reliable safety and compliance evaluation. However, existing static benchmarks are ill-equipped to address the dynamic nature of AI risks and evolving regulations, creating a critical safety gap. This paper introduces a new paradigm of agentic safety evaluation, reframing evaluation as a continuous and self-evolving process rather than a one-time audit. We then propose a novel multi-agent framework AgenticEval, which autonomously ingests unstructured policy documents to generate and perpetually evolve a comprehensive safety benchmark. AgenticEval leverages a synergistic pipeline of specialized agents and incorporates a Self-evolving Evaluation loop, where the system learns from evaluation results to craft progressively more sophisticated and targeted test cases. Our experiments demonstrate the effectiveness of AgenticEval, showing a consistent decline in model safety as the evaluation hardens. For instance, GPT-5's safety rate on the EU AI Act drops from 72.50% to 36.36% over successive iterations. These findings reveal the limitations of static assessments and highlight our framework's ability to uncover deep vulnerabilities missed by traditional methods, underscoring the urgent need for dynamic evaluation ecosystems to ensure the safe and responsible deployment of advanced AI.

2509.25914 2026-05-15 cs.LG

ReNF: Rethinking the Design of Neural Long-Term Time Series Forecasters

Yihang Lu, Xianwei Meng, Enhong Chen

AI总结 本文重新审视了长期时间序列预测中神经网络预报器的设计原则,提出了一种基于方差减少假设的新型框架ReNF。该方法通过结合自回归结构与直接输出结构的优势,提出了一种简洁高效的Boosted Direct Output范式,并引入参数平滑技术以提升模型泛化能力。实验表明,这种基于原理的改进使简单的时序多层感知机在多个基准上超越了近期复杂的先进模型,验证了设计原则的重要性。

详情
英文摘要

Neural Forecasters (NFs) have become a cornerstone of Long-term Time Series Forecasting (LTSF). However, recent progress has been hampered by an overemphasis on architectural complexity at the expense of fundamental forecasting structures. In this work, we revisit principled designs of LTSF. We begin by formulating a Variance Reduction Hypothesis (VRH), positing that generating and combining multiple forecasts is essential to reducing the inherent uncertainty of NFs. Guided by this, we propose Boosted Direct Output (BDO), a streamlined paradigm that synergistically hybridizes the causal structure of Auto-Regressive (AR) with the stability of Direct Output (DO), while implicitly realizing the principle of forecast combination within a single network. Furthermore, we mitigate a critical validation-test generalization gap by employing parameter smoothing to stabilize optimization. Extensive experiments demonstrate that these trivial yet principled improvements enable a direct temporal MLP to outperform recent, complex state-of-the-art models in nearly all benchmarks, without relying on intricate inductive biases. Finally, we empirically verify our hypothesis, establishing a dynamic performance bound that highlights promising directions for future research. The code is publicly available at: https://github.com/Luoauoa/ReNF.

2509.25826 2026-05-15 cs.LG

Kairos: Toward Adaptive and Parameter-Efficient Time Series Foundation Models

Kun Feng, Shaocheng Lan, Yuchen Fang, Wenchao He, Sihan Lu, Shuqi Gu, Lintao Ma, Xingyu Lu, Kan Ren

AI总结 时间序列基础模型(TSFMs)在零样本泛化方面面临挑战,主要由于时间序列中的采样密度和周期结构等固有时间异质性。为解决这一问题,本文提出Kairos,一种参数高效且灵活的时序基础模型,通过动态分块标记和混合尺寸编码,将时间异质性与模型容量解耦,从而在不增加模型宽度或深度的情况下实现细粒度的时间抽象。Kairos还引入了基于动态旋转编码的多粒度位置嵌入,能够根据实例的频谱特征和时间结构进行条件建模,最终在两个主流基准上以更少的参数取得了优越的零样本性能。

详情
英文摘要

Inherent temporal heterogeneity, such as varying sampling densities and periodic structures, has posed substantial challenges in zero-shot generalization for Time Series Foundation Models (TSFMs). Existing TSFMs predominantly rely on massive parameterization to absorb such heterogeneity, as their static tokenization and positional encoding schemes entangle diverse temporal patterns into a fixed representation space, encouraging memorization rather than adaptation. To address this limitation, we propose Kairos, a flexible and parameter-efficient TSFM dedicated to forecasting tasks, which decouples temporal heterogeneity from model capacity through a novel tokenization perspective. Kairos introduces a dynamic patching tokenizer and a mixture-of-size encoding that adapt observational granularity to local information density, enabling fine-grained temporal abstraction without increasing model width or depth. In addition, we design a multi-granularity positional embedding based on dynamic rotary encodings, which conditions on instance-level spectral features and temporal structure induced by dynamic patching tokenization, allowing robust modeling of diverse temporal dependencies. Trained on a novel Predictability-Stratified Time-Series (PreSTS) corpus, Kairos achieves superior zero-shot performance with substantially fewer parameters on two mainstream benchmarks, GIFT-Eval and Time-Series-Library. The project page is at https://foundation-model-research.github.io/Kairos .

2509.23023 2026-05-15 cs.AI

Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia

Davi Bastos Costa, Renato Vicente

AI总结 本文提出了一种名为 *Mini-Mafia* 的简化版社交推理游戏,用于评估大型语言模型在多智能体交互中的表现。通过分析游戏中欺诈者、侦探和村民之间的互动,研究得出了一个预测欺诈方获胜概率的解析公式,并据此构建了 *Mini-Mafia Benchmark*,能够定量评估模型的欺骗、检测和披露能力。实验表明,该方法在跨模型预测中表现优异,并揭示了一些关于当前主流大模型能力的反直觉结论。

Comments Adds a validation section for the theoretical model and restructures the presentation

详情
英文摘要

Large language models are increasingly deployed in multi-agent settings whose outcomes hinge on social intelligence, motivating evaluations of their interactive capabilities; yet existing studies remain overwhelmingly empirical, leaving us without a theoretical understanding of how agent interactions determine collective outcomes. To address this, we introduce \textit{Mini-Mafia}, a four-player simplification of the social deduction game Mafia in which a fixed night phase reduces the game to a single critical exchange among a mafioso, a detective, and a villager. In this setting, we show that the mafia win-rate $p$ is predicted by the analytical formula $\text{logit}(p) = v \times (m - d)$, where $m$, $d$, and $v$ represent the mafioso's deception, the detective's disclosure, and the villager's detection capabilities. We turn this analytical framework into the \textit{Mini-Mafia Benchmark}, where Bayesian inference over gameplay data yields per-model estimates of the intrinsic parameters $m$, $d$, and $v$. For $I$ models, only $3I$ parameters suffice to predict the outcomes of all $I^3$ tournament combinations; and in 5-fold cross-validation the formula achieves a $76.6\%$ Brier-score reduction over a random baseline. The benchmark also reveals counterintuitive results: Grok 3 Mini is the strongest detector and GPT-5 Mini the strongest discloser, both ahead of DeepSeek V3.1, Claude Opus 4, and Claude Sonnet 4; while Claude Sonnet 4 is the weakest detector, near random chance. Together, these results show that Mini-Mafia, a simple but nontrivial multi-agent system, admits an analytical description and serves as a principled benchmark for language model interactions.

2509.22746 2026-05-15 cs.AI cs.CV

Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning

Zejun Li, Yingxiu Zhao, Jiwen Zhang, Siyuan Wang, Yang Yao, Runzhou Zhao, Jun Song, Bo Zheng, Zhongyu Wei

AI总结 当前视觉推理方法主要专注于探索特定的推理模式,虽能在特定领域取得改进,但难以形成通用的推理能力。为此,本文提出了一种新的自适应推理范式——Mixture-of-Visual-Thoughts(MoVT),通过在一个模型中统一不同推理模式,并根据上下文选择合适的模式。研究引入了两阶段的自适应视觉推理框架AdaVaR,利用监督学习进行初始训练,并通过强化学习与精心设计的算法引导模型实现上下文自适应的模式选择,实验表明该方法在多种场景下均能有效提升视觉推理性能。

Comments 27 pages, 11 figures, 5 tables, accepted by ICLR 2026

详情
英文摘要

Current visual reasoning methods mainly focus on exploring specific reasoning modes. Although improvements can be achieved in particular domains, they struggle to develop general reasoning capabilities. Inspired by this, we propose a novel adaptive reasoning paradigm, Mixture-of-Visual-Thoughts (MoVT), which unifies different reasoning modes within a single model and guides it to select the appropriate mode based on context. To achieve this, we introduce AdaVaR, a two-stage Adaptive Visual Reasoning learning framework: different modes are unified and learned during the supervised cold-start stage, and the mode selection capability is induced via an RL process with a carefully designed AdaGRPO algorithm. Extensive experiments show that AdaVaR effectively guides the model to learn and differentiate multiple modes and perform context-adaptive mode selection, achieving consistent improvement across various scenarios, highlighting MoVT as an effective solution for building general visual reasoning models.

2509.21261 2026-05-15 cs.CV

Every Subtlety Counts: Fine-grained Person Independence Micro-Action Recognition via Distributionally Robust Optimization

Feng-Qi Cui, Jinyang Huang, Anyang Tong, Ziyu Jia, Jie Zhang, Zhi Liu, Dan Guo, Jianwei Lu, Meng Wang

AI总结 本文研究了细粒度微动作识别中的跨人差异问题,提出了一个基于分布鲁棒优化的框架,以提升模型在不同个体间的泛化能力。该框架包含两个可插拔模块,分别在特征层和损失层进行优化:特征层通过时频对齐模块消除个体运动特性差异,损失层则通过分组不变正则化损失增强模型对少见和困难样本的鲁棒性。实验表明,该方法在大规模数据集上显著优于现有方法,具有更高的准确性和泛化稳定性。

Comments Withdrawn by the authors due to accidental submissions of non-final manuscript versions. Both v1 and v2 contain an outdated framework figure, in which several module names are inconsistent with the finalized terminology used in the manuscript. This inconsistency may confuse readers about the structure and naming of the proposed method

详情
英文摘要

Micro-action Recognition is vital for psychological assessment and human-computer interaction. However, existing methods often fail in real-world scenarios because inter-person variability causes the same action to manifest differently, hindering robust generalization. To address this, we propose the Person Independence Universal Micro-action Recognition Framework, which integrates Distributionally Robust Optimization principles to learn person-agnostic representations. Our framework contains two plug-and-play components operating at the feature and loss levels. At the feature level, the Temporal-Frequency Alignment Module normalizes person-specific motion characteristics with a dual-branch design: the temporal branch applies Wasserstein-regularized alignment to stabilize dynamic trajectories, while the frequency branch introduces variance-guided perturbations to enhance robustness against person-specific spectral differences. A consistency-driven fusion mechanism integrates both branches. At the loss level, the Group-Invariant Regularized Loss partitions samples into pseudo-groups to simulate unseen person-specific distributions. By up-weighting boundary cases and regularizing subgroup variance, it forces the model to generalize beyond easy or frequent samples, thus enhancing robustness to difficult variations. Experiments on the large-scale MA-52 dataset demonstrate that our framework outperforms existing methods in both accuracy and robustness, achieving stable generalization under fine-grained conditions.

2509.20846 2026-05-15 cs.LG

Causal Time Series Generation via Diffusion Models

Yutong Xia, Chang Xu, Yuxuan Liang, Li Zhao, Qingsong Wen, Roger Zimmermann, Jiang Bian

AI总结 本文提出了一种基于因果视角的条件时间序列生成方法,将时间序列生成任务扩展到干预和反事实场景,形成了新的因果时间序列生成(Causal TSG)任务家族。为此,作者设计了基于扩散模型的统一框架CaTSG,通过后门调整和推理-行动-预测过程,实现对因果干预和反事实生成的精确控制。实验表明,CaTSG在保持观测真实性的同时,能够有效生成干预和反事实序列,优于现有基线方法。

详情
英文摘要

Time series generation (TSG) synthesizes realistic sequences and has achieved remarkable success. Among TSG, conditional models generate sequences given observed covariates, however, such models learn observational correlations without considering unobserved confounding. In this work, we propose a causal perspective on conditional TSG and introduce causal time series generation as a new TSG task family, formalized within Pearl's causal ladder, extending beyond observational generation to include interventional and counterfactual settings. To instantiate these tasks, we develop CaTSG, a unified diffusion-based framework with backdoor-adjusted guidance that causally steers sampling toward desired interventions and individual counterfactuals while preserving observational fidelity. Specifically, our method derives causal score functions via backdoor adjustment and the abduction-action-prediction procedure, thus enabling principled support for all three levels of TSG. Extensive experiments on both synthetic and real-world datasets show that CaTSG achieves superior fidelity and also supporting interventional and counterfactual generation that existing baselines cannot handle. Overall, we propose the causal TSG family and instantiate it with CaTSG, providing an initial proof-of-concept and opening a promising direction toward more reliable simulation under interventions and counterfactual generation.

2509.14232 2026-05-15 cs.CV

GenExam: A Multidisciplinary Text-to-Image Exam

Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, Gen Luo

AI总结 GenExam 是首个面向多学科文本到图像生成的考试式基准,旨在评估模型在理解、推理与图像生成方面的综合能力。该基准包含10个学科共1000道题目,每个题目均配有标准答案图像和细粒度评分点,以精确评估生成结果的语义正确性与视觉合理性。实验表明,GenExam 对现有模型提出了巨大挑战,开源模型在性能上与闭源模型存在显著差距,凸显了当前生成模型在复杂任务中的不足。

Comments Accepted by ICML 2026

详情
英文摘要

Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments on 17 text-to-image and unified models demonstrate the great challenge of GenExam and the huge gap where open-source models consistently lag behind the leading closed-source ones. By framing image generation as an exam, GenExam offers a rigorous assessment of models' ability to integrate understanding, reasoning, and generation, providing insights for on the path to intelligent generative models. Our benchmark and evaluation code are released at https://github.com/OpenGVLab/GenExam.