arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2511.18739 2026-05-15 cs.AI cs.LG stat.ML

A Problem-Oriented Taxonomy of Evaluation Metrics for Time Series Anomaly Detection

Kaixiang Yang, Jiarong Liu, Yupeng Song, Shuanghua Yang, Yujue Zhou

AI总结时间序列异常检测在物联网和物理信息系统中应用广泛，但其评估因应用场景多样和指标假设不同而面临挑战。本文提出了一种面向问题的评估指标分类框架，从解决的具体评估问题出发重新诠释现有指标，将其分为六个维度，涵盖准确性、及时性、标签容忍度、人工审核成本惩罚、抗随机性以及跨数据集可比性等方面。通过实验分析不同场景下指标的行为，量化其区分真实检测与随机噪声的能力，揭示了多数事件级指标具有较强区分力，而部分常用指标对随机分数膨胀较为敏感，强调了评估指标应根据具体任务需求进行选择。

2511.17367 2026-05-15 cs.LG

R2PS: Worst-Case Robust Real-Time Pursuit Strategies under Partial Observability

Runyu Lu, Ruochuan Shi, Yuanheng Zhu, Dongbin Zhao

AI总结本文研究了在部分可观测环境下，如何为追捕-逃避博弈（PEG）设计具有最坏情况鲁棒性的实时追捕策略。为了解决现有方法在不完全信息和异步移动场景下的不足，作者提出了一种新的方法R2PS，结合动态规划与信念保持机制，扩展了传统策略到部分可观测场景，并将其嵌入先进强化学习框架中。该方法能够在无需额外训练的情况下，实现对未知图结构的鲁棒泛化，并在实验中表现出优于现有方法的性能。

2511.15408 2026-05-15 cs.CL cs.AI cs.IR cs.MA cs.NE

Chinese Short-Form Creative Content Generation via Explanation-Oriented Multi-Objective Optimization

Shanlin Zhou, Xinpeng Wang, Jianxun Lian, Zhenghao Liu, Laks V. S. Lakshmanan, Xiaoyuan Yi, Yongtao Hao

AI总结该研究针对中文短文本创意内容生成中的挑战，提出了一种基于解释导向的多目标优化方法，以应对个性化约束下生成结果验证困难的问题。研究将任务建模为异构多目标优化问题，同时优化生成内容与解释的可靠性，并设计了无需训练的多智能体框架MAGIC-HMO，通过迭代生成与验证实现优化。实验表明，该方法在中文婴儿命名等任务上显著优于现有模型。

Comments 19 pages,10 figures. Submitted to ACM for possible publication

2511.14823 2026-05-15 cs.LG cs.CV

Dynamic Nested Hierarchies: Pioneering Self-Evolution in Machine Learning Architectures for Lifelong Intelligence

Akbar Anbar Jafari, Cagri Ozcinar, Gholamreza Anbarjafari

AI总结当前机器学习模型在静态任务上表现出色，但在非平稳环境中因架构僵化而难以实现持续适应和终身学习。本文提出了一种动态嵌套层次结构，使模型能够在训练或推理过程中自主调整优化层级的数量、嵌套结构和更新频率，从而实现无需预定义约束的自我演化。该方法通过数学推导和实验验证，在语言建模、持续学习和长上下文推理等任务中展现出优越性能，为构建具有自适应能力的通用人工智能奠定了基础。

Comments 12 pages, 1 figure

2511.13397 2026-05-15 cs.CV cs.AI

Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)

Nikos Theodoridis, Tim Brophy, Reenu Mohandas, Ganesh Sistu, Fiachra Collins, Anthony Scanlan, Ciaran Eising

AI总结本文提出了一种名为Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)的视觉问答基准，用于评估视觉语言模型在交通场景中的感知能力。该基准包含合成数据集和真实场景数据集，并为每个问题标注了目标物体与相机之间的距离，从而能够分析模型在不同距离下的感知性能。该研究为自动驾驶领域中模型的感知能力评估提供了一个新的、有针对性的工具。

详情

DOI: 10.1109/IEEEDATA.2026.3689031
Journal ref: IEEE Data Descriptions, 2026

英文摘要

The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, (c) the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.

URL PDF HTML ☆

赞 0 踩 0

2511.13026 2026-05-15 cs.CV

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Jiaze Li, Hao Yin, Wenhui Tan, Jingyang Chen, Boshen Xu, Yuxun Qu, Yijing Chen, Jianzhong Ju, Zhenbo Luo, Jian Luan

AI总结该论文提出了一种名为REVISOR的新框架，旨在提升大语言模型在长视频理解任务中的推理能力。针对纯文本反思机制在处理长视频时的不足，REVISOR引入了多模态反思机制，结合视觉信息进行深度反思，并设计了双属性解耦奖励机制以增强模型对关键视频片段的识别与利用。该方法无需额外监督微调或外部模型，显著提升了模型在多个长视频理解基准测试中的表现。

2511.08565 2026-05-15 cs.CL cs.AI cs.CY

Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models

Davi Bastos Costa, Felippe Alves, Renato Vicente

AI总结本研究探讨了大型语言模型在扮演特定角色（Persona Role-Play）时的道德反应，引入道德基础问卷（MFQ）构建基准，量化评估模型的道德敏感性和道德鲁棒性。通过两种互补方法分析模型在不同角色下的道德判断变化，发现道德鲁棒性在不同模型家族间差异显著，Claude 家族表现最为鲁棒，而道德敏感性则变化较小，且不受模型家族影响，主要由预训练阶段决定。研究揭示了角色条件对模型道德行为的影响，并提供了不同模型及角色平均的道德基础特征分析。

Comments Added experiments with a logit-based method and now reporting unbounded metrics

详情

英文摘要

Large language models (LLMs) increasingly operate in social contexts, motivating analysis of how they express and shift moral judgments. In this work, we investigate the moral response of LLMs to persona role-play, prompting a LLM to assume a specific character. Using the Moral Foundations Questionnaire (MFQ), we introduce a benchmark that quantifies two properties: moral susceptibility and moral robustness, defined from the variability of MFQ scores across- and within-personas. We estimate these quantities with two complementary procedures, repeated sampling and a logit-based method that directly estimates the rating distributions and enables temperature analysis. We evaluate 15 models across six families: Claude, DeepSeek, Gemini, GPT, Grok, and Llama. The two metrics show qualitatively different patterns. Moral robustness varies by more than an order of magnitude, with a coefficient of variation of about $152\%$, and is explained almost entirely by model family. The Claude family is, by a significant margin, the most robust, about 30 times more so than the lower-performing families (DeepSeek, Grok, and Llama), while Gemini and GPT occupy an intermediate tier. This strong family dependence suggests that robustness is primarily shaped by post-training. Moral susceptibility, by contrast, spans a much narrower range, with a coefficient of variation of about $13\%$, and the most susceptible model is only 1.6 times more susceptible than the least. Unlike robustness, susceptibility shows no clear family dependence, suggesting that it is primarily determined by pre-training. Additionally, we present moral foundation profiles for models without persona role-play and for personas averaged across models. Together, these analyses provide a systematic view of how persona conditioning shapes moral behavior in LLMs and a window into the internal machinery they use to instantiate personas.

URL PDF HTML ☆

赞 0 踩 0

2511.02776 2026-05-15 cs.RO

XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations

Shichao Fan, Kun Wu, Zhengping Che, Xinhua Wang, Di Wu, Fei Liao, Ning Liu, Yixue Zhang, Zhen Zhao, Zhiyuan Xu, Meng Li, Qingjie Liu, Shanghang Zhang, Min Wan, Jian Tang

AI总结本文提出 XR-1，一种面向多机器人、多任务和多环境的通用视觉-语言-动作（VLA）模型，旨在解决现有模型在生成精确低级动作和跨异构数据源对齐方面的挑战。XR-1 引入了统一视觉-运动编码（UVMC），通过双分支 VQ-VAE 学习视觉动态与机器人运动的联合离散表示，从而在动作生成和跨模态对齐方面取得显著提升。实验表明，XR-1 在多种真实机器人和任务上表现出优越的性能和良好的泛化能力。

Comments Accepted to ICML2026 as spotlight

详情

英文摘要

Recent progress in large-scale robotic datasets and vision-language models (VLMs) has advanced research on vision-language-action (VLA) models. However, existing VLA models still face two fundamental challenges: (i) producing precise low-level actions from high-dimensional observations, (ii) bridging domain gaps across heterogeneous data sources, including diverse robot embodiments and human demonstrations. Existing methods often encode latent variables from either visual dynamics or robotic actions to guide policy learning, but they fail to fully exploit the complementary multi-modal knowledge present in large-scale, heterogeneous datasets. In this work, we present X Robotic Model 1 (XR-1), a novel framework for versatile and scalable VLA learning across diverse robots, tasks, and environments. XR-1 introduces the \emph{Unified Vision-Motion Codes (UVMC)}, a discrete latent representation learned via a dual-branch VQ-VAE that jointly encodes visual dynamics and robotic motion. UVMC addresses these challenges by (i) serving as an intermediate representation between the observations and actions, and (ii) aligning multimodal dynamic information from heterogeneous data sources to capture complementary knowledge. To effectively exploit UVMC, we propose a three-stage training paradigm: (i) self-supervised UVMC learning, (ii) UVMC-guided pretraining on large-scale cross-embodiment robotic datasets, and (iii) task-specific post-training. We validate XR-1 through extensive real-world experiments with more than 14,000 rollouts on six different robot embodiments, spanning over 120 diverse manipulation tasks. XR-1 consistently outperforms state-of-the-art baselines such as $π_{0.5}$, $π_0$, RDT, UniVLA, and GR00T-N1.5 while demonstrating strong generalization to novel objects, background variations, distractors, and illumination changes. Our project is at https://xr-1-vla.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2511.02271 2026-05-15 cs.CV

Medical Report Generation: A Hierarchical Task Structure-Based Cross-Modal Causal Intervention Framework

Yucheng Song, Yifan Ge, Junhao Li, Zhining Liao, Zhifang Liao

AI总结本文提出了一种基于分层任务结构的跨模态因果干预框架HTSC-CIF，用于解决医学报告生成中的三个核心挑战：领域知识理解不足、文本与视觉实体嵌入对齐不佳以及跨模态偏差带来的虚假相关性。该方法将任务分解为低、中、高三个层次，分别通过空间特征对齐、双向语言与图像建模以及因果干预模块进行优化，显著提升了生成报告的准确性和可解释性。实验表明，HTSC-CIF在多个基准数据集上优于现有最先进方法。

Comments Due to issues with the training epochs and training strategy in our paper, there are numerical errors in the result comparison table presented in the preprint. Therefore, we have decided to withdraw the manuscript for further revision

2510.23868 2026-05-15 cs.LG cs.CL

GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA

Zhichao Wang

AI总结本文研究了奖励匹配是否可以作为奖励最大化方法的替代方案，用于大语言模型的策略梯度强化学习。提出了一种名为GIFT的新方法，结合了GRPO的群体采样、DPO的隐式奖励和UNA的显式与隐式优势之间的均方误差，通过z-score标准化消除了DPO中的不可计算项，并去除了RLHF和RLVR目标中的KL系数β。实验表明，GIFT在多个任务上收敛更快、过拟合更少，且在长度控制和评估表现上优于现有方法。

2510.20206 2026-05-15 cs.CV

RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

Bingjie Gao, Qianli Ma, Xiaoxue Wu, Shuai Yang, Guanzhou Lan, Haonan Zhao, Jiaxuan Chen, Qingyang Liu, Yu Qiao, Xinyuan Chen, Yaohui Wang, Li Niu

AI总结 RAPO++ 是一种面向文本到视频生成的跨阶段提示优化框架，旨在解决用户输入提示与训练数据不匹配的问题。该方法通过检索增强提示优化（RAPO）和样本特定提示优化（SSPO）两个阶段，结合语义对齐、空间保真度和时间一致性等多源反馈，逐步提升生成视频的质量，并进一步通过微调语言模型实现高效的提示生成。实验表明，RAPO++ 在多个先进模型和基准测试中显著提升了生成视频的语义一致性、组合合理性及时空稳定性，是一种模型无关、高效且可扩展的解决方案。

Comments arXiv admin note: text overlap with arXiv:2504.11739

2510.17434 2026-05-15 cs.CV

Leveraging AV1 motion vectors for Fast and Dense Feature Matching

Julien Zouein, Hossein Javidnia, François Pitié, Anil Kokaram

AI总结该研究利用AV1视频编码中的运动矢量生成密集的亚像素级特征匹配，并通过余弦一致性筛选短轨迹。该方法在短视频上运行效率高、消耗的CPU资源少，且能产生密度更高的匹配结果，几何一致性表现良好。实验表明，该方法在少样本场景重建中表现出良好的性能，为压缩域特征匹配在大规模应用中提供了可行的解决方案。

Comments Accepted ICIR 2025, camera-ready version

2510.15982 2026-05-15 cs.LG cs.AI

AMiD: Knowledge Distillation for LLMs with $α$-mixture Assistant Distribution

Donghyeok Shin, Yeongmin Kim, Suhyeon Jo, Byeonghu Na, Il-Chul Moon

AI总结本文提出了一种名为AMiD的知识蒸馏方法，用于降低大语言模型的计算和内存成本。该方法引入了基于α混合的辅助分布，通过引入新的分布参数α，扩展了传统辅助分布的适用范围，并构建了一个统一的知识蒸馏框架。实验表明，AMiD在性能和训练稳定性方面优于现有方法，具有更广泛的理论支持和实际应用价值。

Comments The Fourteenth International Conference on Learning Representations (ICLR 2026)

2510.15849 2026-05-15 cs.CV

Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt

Joongwon Chae, Lihui Luo, Xi Yuan, Dongmei Yu, Zhenglin Chen, Lian Zhang, Peiwu Qin

AI总结本文提出了一种无需人工提示和训练的舌部分割方法Memory-SAM，通过检索历史案例中的特征并生成有效提示来引导SAM2模型。该方法利用DINOv3的密集特征和FAISS检索技术，从少量先验案例中自动提取前景和背景提示，从而实现高精度分割。实验表明，Memory-SAM在包含600张专家标注图像的数据集上取得了优于现有方法的分割效果，尤其在真实场景下表现突出。

2510.13016 2026-05-15 cs.CV

SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding

Tanveer Hannan, Shuaicong Wu, Mark Weber, Suprosanna Shit, Jindong Gu, Rajat Koner, Aljoša Ošep, Laura Leal-Taixé, Thomas Seidl

AI总结该论文提出了一种名为SVAG-Bench的大型基准，用于评估多实例时空视频动作定位能力。该任务要求模型同时检测、跟踪并定位满足自然语言查询的所有对象，以实现对复杂场景中多个动作的统一理解。SVAG-Bench包含688个视频和大量精细标注，支持对多动作歧义、时间重叠和动作组合性的细致评估，并提供了标准化的评估工具和一个模块化的基线模型SVAGFormer。

2510.11282 2026-05-15 cs.LG

Vision-LLMs for Spatiotemporal Traffic Forecasting

Ning Yang, Hengyu Zhong, Haijun Zhang, Randall Berry

AI总结本文研究了如何利用视觉大语言模型（Vision-LLMs）进行时空交通预测，针对传统大语言模型在处理网格化交通数据时效率低、难以建模复杂空间依赖的问题，提出了一种新的框架ST-Vision-LLM。该方法将交通预测视为视觉与语言信息融合的问题，通过视觉编码器处理历史交通矩阵，并引入高效的数值编码方案和两阶段微调策略，显著提升了模型在长周期预测和跨域少样本场景下的性能。实验表明，该模型在多个真实交通数据集上取得了优于现有方法的预测精度。

2510.07086 2026-05-15 cs.LG

Non-Stationary Online Structured Prediction with Surrogate Losses

Shinsaku Sakaue, Han Bao, Yuzhou Cao

AI总结本文研究了非平稳环境下在线结构化预测问题，旨在通过代理损失函数实现对目标损失的上界分析。作者提出了一种新的上界形式，其依赖于比较序列的累积代理损失和路径长度，而非时间步长 $T$，从而在非平稳环境下提供了更强的理论保证。核心方法结合了在线梯度下降的动态遗憾分析与代理损失间隙利用技术，并引入了Polyak风格的学习率，提升了理论分析与实际性能。此外，该方法通过卷积型Fenchel-Young损失扩展到了更广泛的应用场景。

2510.04682 2026-05-15 cs.CL cs.AI

TiTok: Transfer Token-level Knowledge via Contrastive Excess to Transplant LoRA

Chanjoo Jung, Jaehyung Kim

AI总结本文提出了一种名为TiTok的新框架，旨在解决LoRA微调参数无法跨不同基础模型迁移的问题。该方法通过在令牌层面进行对比性知识提取，从带有和不带有LoRA的源模型中捕捉任务相关的信息，从而实现高效的LoRA移植。实验表明，TiTok在多个基准测试中表现出色，相比基线方法平均性能提升了4%到10%。

Comments ICLR 2026

2510.02952 2026-05-15 cs.LG

ContextFlow: Context-Aware Flow Matching For Trajectory Inference From Spatial Omics Data

Santanu Subhash Rathod, Francesco Ceccarelli, Sean B. Holden, Pietro Liò, Xiao Zhang, Jovan Tanevski

AI总结本文提出了一种名为ContextFlow的上下文感知流匹配框架，用于从空间组学数据中推断组织结构动态轨迹。该方法通过整合局部组织结构和配体-受体通信模式，构建过渡可能性矩阵以指导最优运输目标的优化，从而生成统计上一致且生物学意义明确的轨迹。实验表明，ContextFlow在多个定量和定性指标上优于现有方法，具有良好的泛化能力。

Comments 42 pages, 21 figures, 30 tables

2510.01172 2026-05-15 cs.CL

Energy-Regularized Sequential Model Editing on Hyperspheres

Qingyuan Liu, Jia-Chen Gu, Yunzhi Yao, Hong Wang, Nanyun Peng

AI总结大型语言模型需要持续更新以保持与现实世界知识的一致性，但顺序编辑常导致模型表示不稳定并引发灾难性遗忘。本文提出了一种基于超球面能量（HE）正则化的编辑方法SPHERE，通过维持神经元权重在超球面上的均匀分布，有效缓解了编辑过程中的性能退化问题。实验表明，SPHERE在多个主流模型上显著提升了编辑效果，同时较好地保留了模型原有性能。

Comments Accepted by ICLR 2026. The code is available at https://github.com/PlusLabNLP/SPHERE. Project page: https://www.qingyuanliu.net/sphere_projectpage/

详情

英文摘要

Large language models (LLMs) require constant updates to remain aligned with evolving real-world knowledge. Model editing offers a lightweight alternative to retraining, but sequential editing often destabilizes representations and induces catastrophic forgetting. In this work, we seek to better understand and mitigate performance degradation caused by sequential editing. We hypothesize that hyperspherical uniformity, a property that maintains uniform distribution of neuron weights on a hypersphere, helps the model remain stable, retain prior knowledge, while still accommodate new updates. We use Hyperspherical Energy (HE) to quantify neuron uniformity during editing, and examine its correlation with editing performance. Empirical studies across widely used editing methods reveals a strong correlation between HE dynamics and editing performance, with editing failures consistently coinciding with high HE fluctuations. We further theoretically prove that HE dynamics impose a lower bound on the degradation of pretrained knowledge, highlighting why HE stability is crucial for knowledge retention. Motivated by these insights, we propose SPHERE (Sparse Projection for Hyperspherical Energy-Regularized Editing), an HE-driven regularization strategy that stabilizes neuron weight distributions, ultimately preserving prior knowledge while enabling reliable sequential updates. Specifically, SPHERE identifies a sparse space complementary to the principal hyperspherical directions of the pretrained weight matrices and projects new knowledge onto it, attenuating perturbations on the principal directions. Extensive experiments on LLaMA3 (8B) and Qwen2.5 (7B) show that SPHERE outperforms the best baseline in editing capability by an average of 16.41%, while most faithfully preserving general model performance, thereby offering a principled path toward reliable large-scale knowledge editing.

URL PDF HTML ☆

赞 0 踩 0

2510.00977 2026-05-15 cs.LG cs.CL

It Takes Two: Your GRPO Is Secretly DPO

Yihong Wu, Liheng Ma, Lei Ding, Muzhi Li, Xinyu Wang, Kejia Chen, Zhan Su, Zhanguang Zhang, Chenyang Huang, Yingxue Zhang, Mark Coates, Jian-Yun Nie

AI总结本文研究了GRPO算法在大语言模型微调中的有效性，并提出了一种新的视角：GRPO的性能优势来源于其隐含的对比目标，这一特性使其在结构上与DPO等偏好学习方法密切相关。基于这一发现，作者提出了2-GRPO，仅需两次rollouts即可构建对比信号，显著减少了计算资源需求。理论分析和实验表明，2-GRPO在保持97.6%性能的同时，仅需16-GRPO的12.5% rollout和21%训练时间。

2510.00757 2026-05-15 cs.LG

LEAP: Local ECT-Based Learnable Positional Encodings for Graphs

Juan Amboage, Ernst Röell, Patrick Schnider, Bastian Rieck

AI总结本文提出了一种基于局部欧拉特征变换（$\ell$-ECT）的可学习图位置编码方法LEAP，用于改进图神经网络中的位置编码能力。该方法结合了可微分的ECT近似及其局部变体，能够捕捉图的局部结构特征，并通过端到端训练方式进行优化。实验表明，LEAP在多个真实和合成数据集上表现出色，展示了其在图表示学习中的有效性和潜力。

Comments Accepted at the International Conference on Learning Representations (ICLR) 2026. Our code is available https://www.github.com/aidos-lab/LEAP

2509.26100 2026-05-15 cs.AI

AgenticEval: Toward Agentic and Self-Evolving Safety Evaluation of Large Language Models

Yixu Wang, Xin Wang, Yang Yao, Xinyuan Li, Xibang Yang, Yan Teng, Xingjun Ma, Yingchun Wang

AI总结随着大语言模型在高风险领域的广泛应用，现有的静态评估方法已难以应对AI风险的动态变化和法规的持续演进。本文提出了一种新的智能体驱动的安全评估范式AgenticEval，通过多智能体框架自主解析政策文件，持续生成和演化综合性安全基准，并利用自我演进的评估循环不断优化测试用例。实验表明，该方法能够有效揭示传统评估方式难以发现的模型深层次安全漏洞，凸显了动态评估体系在确保AI安全部署中的重要性。

Comments Findings of ACL 2026

2509.25914 2026-05-15 cs.LG

ReNF: Rethinking the Design of Neural Long-Term Time Series Forecasters

Yihang Lu, Xianwei Meng, Enhong Chen

AI总结本文重新审视了长期时间序列预测中神经网络预报器的设计原则，提出了一种基于方差减少假设的新型框架ReNF。该方法通过结合自回归结构与直接输出结构的优势，提出了一种简洁高效的Boosted Direct Output范式，并引入参数平滑技术以提升模型泛化能力。实验表明，这种基于原理的改进使简单的时序多层感知机在多个基准上超越了近期复杂的先进模型，验证了设计原则的重要性。

2509.25826 2026-05-15 cs.LG

Kairos: Toward Adaptive and Parameter-Efficient Time Series Foundation Models

Kun Feng, Shaocheng Lan, Yuchen Fang, Wenchao He, Sihan Lu, Shuqi Gu, Lintao Ma, Xingyu Lu, Kan Ren

AI总结时间序列基础模型（TSFMs）在零样本泛化方面面临挑战，主要由于时间序列中的采样密度和周期结构等固有时间异质性。为解决这一问题，本文提出Kairos，一种参数高效且灵活的时序基础模型，通过动态分块标记和混合尺寸编码，将时间异质性与模型容量解耦，从而在不增加模型宽度或深度的情况下实现细粒度的时间抽象。Kairos还引入了基于动态旋转编码的多粒度位置嵌入，能够根据实例的频谱特征和时间结构进行条件建模，最终在两个主流基准上以更少的参数取得了优越的零样本性能。

2509.23023 2026-05-15 cs.AI

Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia

Davi Bastos Costa, Renato Vicente

AI总结本文提出了一种名为 *Mini-Mafia* 的简化版社交推理游戏，用于评估大型语言模型在多智能体交互中的表现。通过分析游戏中欺诈者、侦探和村民之间的互动，研究得出了一个预测欺诈方获胜概率的解析公式，并据此构建了 *Mini-Mafia Benchmark*，能够定量评估模型的欺骗、检测和披露能力。实验表明，该方法在跨模型预测中表现优异，并揭示了一些关于当前主流大模型能力的反直觉结论。

Comments Adds a validation section for the theoretical model and restructures the presentation

2509.22746 2026-05-15 cs.AI cs.CV

Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning

Zejun Li, Yingxiu Zhao, Jiwen Zhang, Siyuan Wang, Yang Yao, Runzhou Zhao, Jun Song, Bo Zheng, Zhongyu Wei

AI总结当前视觉推理方法主要专注于探索特定的推理模式，虽能在特定领域取得改进，但难以形成通用的推理能力。为此，本文提出了一种新的自适应推理范式——Mixture-of-Visual-Thoughts（MoVT），通过在一个模型中统一不同推理模式，并根据上下文选择合适的模式。研究引入了两阶段的自适应视觉推理框架AdaVaR，利用监督学习进行初始训练，并通过强化学习与精心设计的算法引导模型实现上下文自适应的模式选择，实验表明该方法在多种场景下均能有效提升视觉推理性能。

Comments 27 pages, 11 figures, 5 tables, accepted by ICLR 2026

2509.21261 2026-05-15 cs.CV

Every Subtlety Counts: Fine-grained Person Independence Micro-Action Recognition via Distributionally Robust Optimization

Feng-Qi Cui, Jinyang Huang, Anyang Tong, Ziyu Jia, Jie Zhang, Zhi Liu, Dan Guo, Jianwei Lu, Meng Wang

AI总结本文研究了细粒度微动作识别中的跨人差异问题，提出了一个基于分布鲁棒优化的框架，以提升模型在不同个体间的泛化能力。该框架包含两个可插拔模块，分别在特征层和损失层进行优化：特征层通过时频对齐模块消除个体运动特性差异，损失层则通过分组不变正则化损失增强模型对少见和困难样本的鲁棒性。实验表明，该方法在大规模数据集上显著优于现有方法，具有更高的准确性和泛化稳定性。

Comments Withdrawn by the authors due to accidental submissions of non-final manuscript versions. Both v1 and v2 contain an outdated framework figure, in which several module names are inconsistent with the finalized terminology used in the manuscript. This inconsistency may confuse readers about the structure and naming of the proposed method

2509.20846 2026-05-15 cs.LG

Causal Time Series Generation via Diffusion Models

Yutong Xia, Chang Xu, Yuxuan Liang, Li Zhao, Qingsong Wen, Roger Zimmermann, Jiang Bian

AI总结本文提出了一种基于因果视角的条件时间序列生成方法，将时间序列生成任务扩展到干预和反事实场景，形成了新的因果时间序列生成（Causal TSG）任务家族。为此，作者设计了基于扩散模型的统一框架CaTSG，通过后门调整和推理-行动-预测过程，实现对因果干预和反事实生成的精确控制。实验表明，CaTSG在保持观测真实性的同时，能够有效生成干预和反事实序列，优于现有基线方法。

2509.14232 2026-05-15 cs.CV

GenExam: A Multidisciplinary Text-to-Image Exam

Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, Gen Luo

AI总结 GenExam 是首个面向多学科文本到图像生成的考试式基准，旨在评估模型在理解、推理与图像生成方面的综合能力。该基准包含10个学科共1000道题目，每个题目均配有标准答案图像和细粒度评分点，以精确评估生成结果的语义正确性与视觉合理性。实验表明，GenExam 对现有模型提出了巨大挑战，开源模型在性能上与闭源模型存在显著差距，凸显了当前生成模型在复杂任务中的不足。

Comments Accepted by ICML 2026