arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3868
2510.26253 2026-06-09 cs.CL 版本更新

Pragmatic Theories Enhance Understanding of Implied Meanings in LLMs

语用理论增强对LLM中隐含意义的理解

Takuma Sato, Seiya Kawano, Koichiro Yoshino

发表机构 * Nara Institute of Science and Technology(奈良科学技術大學) RIKEN(理化学研究所) Guardian Robot Project(守護機器人專案) Kyoto Institute of Technology(京都科技大學) Institute of Science Tokyo(東京科學研究所)

AI总结 本研究提出以语用理论(如Grice语用学和关联理论)作为提示,引导语言模型逐步推理,在隐含意义理解任务上相比0-shot思维链提升高达9.6%的得分。

Comments Correction of minor typographical errors in the references

详情
AI中文摘要

准确解读隐含意义的能力在人类交流和语言使用中起着关键作用,语言模型也被期望具备这种能力。本研究表明,向语言模型提供语用理论作为提示是一种有效的上下文学习方法,用于理解隐含意义的任务。具体来说,我们提出了一种方法,将语用理论(如Grice语用学和关联理论)的概述作为提示呈现给语言模型,引导其通过逐步推理过程得出最终解释。实验结果显示,与基线(即不呈现语用理论而提示中间推理的0-shot思维链)相比,我们的方法使语言模型在语用推理任务上得分最高提升9.6%。此外,我们表明,即使不解释语用理论的细节,仅在提示中提及它们的名称,也能使较大模型相比基线获得一定的性能提升(约1-3%)。

英文摘要

The ability to accurately interpret implied meanings plays a crucial role in human communication and language use, and language models are also expected to possess this capability. This study demonstrates that providing language models with pragmatic theories as prompts is an effective in-context learning approach for tasks to understand implied meanings. Specifically, we propose an approach in which an overview of pragmatic theories, such as Gricean pragmatics and Relevance Theory, is presented as a prompt to the language model, guiding it through a step-by-step reasoning process to derive a final interpretation. Experimental results showed that, compared to the baseline, which prompts intermediate reasoning without presenting pragmatic theories (0-shot Chain-of-Thought), our methods enabled language models to achieve up to 9.6\% higher scores on pragmatic reasoning tasks. Furthermore, we show that even without explaining the details of pragmatic theories, merely mentioning their names in the prompt leads to a certain performance improvement (around 1-3%) in larger models compared to the baseline.

2509.09501 2026-06-09 cs.CV 版本更新

Region-Wise Correspondence Prediction between Manga Line Art Images

漫画线条艺术图像的区域级对应预测

Yingxuan Li, Jiafeng Mao, Qianru Qiu, Yusuke Matsui

发表机构 * The University of Tokyo(东京大学) CyberAgent, Inc.(CyberAgent公司)

AI总结 提出一种基于Transformer的框架,通过自动生成的大规模区域对应训练,实现无标注漫画线条艺术图像的区域级对应预测,区域级准确率达78.4-84.4%。

Comments Accepted to CVPR2026

详情
AI中文摘要

理解漫画线条艺术图像之间的区域级对应是高级漫画处理的基础,支持线条艺术着色和中间帧生成等下游任务。与包含丰富视觉线索的自然图像不同,漫画线条艺术仅由稀疏的黑白笔画组成,这使得确定图像间哪些区域对应具有挑战性。在这项工作中,我们引入了一个新任务:预测原始漫画线条艺术图像之间的区域级对应,无需任何标注。为了解决这个问题,我们提出了一个基于Transformer的框架,在大规模自动生成的区域对应上进行训练。该模型学会抑制噪声匹配并加强一致的结构关系,从而在图像内部和图像之间实现鲁棒的块级特征对齐。在推理过程中,我们的方法通过边缘感知聚类和区域匹配来分割每个线条艺术并建立连贯的区域级对应。我们构建了人工标注的基准用于评估,跨多个数据集的实验显示了高块级准确率和强区域级对应性能,区域级准确率达到78.4-84.4%。这些结果凸显了我们的方法在真实漫画和动画应用中的潜力。

英文摘要

Understanding region-wise correspondences between manga line art images is fundamental for high-level manga processing, supporting downstream tasks such as line art colorization and in-between frame generation. Unlike natural images that contain rich visual cues, manga line art consists only of sparse black-and-white strokes, making it challenging to determine which regions correspond across images. In this work, we introduce a new task: predicting region-wise correspondence between raw manga line art images without any annotations. To address this problem, we propose a Transformer-based framework trained on large-scale, automatically generated region correspondences. The model learns to suppress noisy matches and strengthen consistent structural relationships, resulting in robust patch-level feature alignment within and across images. During inference, our method segments each line art and establishes coherent region-level correspondences through edge-aware clustering and region matching. We construct manually annotated benchmarks for evaluation, and experiments across multiple datasets demonstrate both high patch-level accuracy and strong region-level correspondence performance, achieving 78.4-84.4% region-level accuracy. These results highlight the potential of our method for real-world manga and animation applications.

2511.14143 2026-06-09 cs.CV cs.AI 版本更新

SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM

SMART: 基于音频增强多模态大模型的镜头感知视频时刻检索

An Yu, Weiheng Lu, Jian Li, Zhenfei Zhang, Yunhang Shen, Felix X. -F. Ye, Ming-Ching Chang

发表机构 * Department of Computer Science, University at Albany - SUNY(University at Albany - SUNY 计算机科学系) School of Software & Microelectronics, Peking University(北京大学软件与微电子学院) Nanjing University(南京大学) Xiamen University(厦门大学) Department of Mathematics and Statistics, University at Albany - SUNY(University at Albany - SUNY 数学与统计学系)

AI总结 提出SMART框架,融合音频与视觉特征,利用镜头感知令牌压缩技术,在多模态大模型基础上实现视频时刻检索,在Charades-STA和QVHighlights上取得显著提升。

详情
AI中文摘要

视频时刻检索是视频理解中的一项任务,旨在根据自然语言查询在未裁剪视频中定位特定时间片段。尽管近年来利用传统技术和多模态大模型在视频时刻检索方面取得了进展,但大多数现有方法仍依赖于粗粒度的时间理解和单一的视觉模态,限制了在复杂视频上的性能。为了解决这一问题,我们引入了\textit{镜头感知多模态音频增强时间片段检索}(SMART),这是一个基于多模态大模型的框架,它整合了音频线索并利用了镜头级别的时间结构。SMART通过结合音频和视觉特征来丰富多模态表示,同时应用\textbf{镜头感知令牌压缩},该技术选择性地保留每个镜头内的高信息令牌,以减少冗余并保留细粒度的时间细节。我们还优化了提示设计,以更好地利用视听线索。在Charades-STA和QVHighlights上的评估表明,SMART相比最先进的方法取得了显著改进,包括在Charades-STA上R1@0.5提升1.61%,R1@0.7提升2.59%。

英文摘要

Video Moment Retrieval is a task in video understanding that aims to localize a specific temporal segment in an untrimmed video based on a natural language query. Despite recent progress in moment retrieval from videos using both traditional techniques and Multimodal Large Language Models (MLLM), most existing methods still rely on coarse temporal understanding and a single visual modality, limiting performance on complex videos. To address this, we introduce \textit{S}hot-aware \textit{M}ultimodal \textit{A}udio-enhanced \textit{R}etrieval of \textit{T}emporal \textit{S}egments (SMART), an MLLM-based framework that integrates audio cues and leverages shot-level temporal structure. SMART enriches multimodal representations by combining audio and visual features while applying \textbf{Shot-aware Token Compression}, which selectively retains high-information tokens within each shot to reduce redundancy and preserve fine-grained temporal details. We also refine prompt design to better utilize audio-visual cues. Evaluations on Charades-STA and QVHighlights show that SMART achieves significant improvements over state-of-the-art methods, including a 1.61\% increase in R1@0.5 and 2.59\% gain in R1@0.7 on Charades-STA.

2511.07046 2026-06-09 cs.LG cs.AI 版本更新

Learning Quantized Continuous Controllers for Integer Hardware

面向整数硬件的量化连续控制器学习

Fabian Kresse, Christoph H. Lampert

发表机构 * Institute of Science and Technology Austria (ISTA)(奥地利科学与技术研究所)

AI总结 提出量化感知训练策略,自动选择低比特策略并综合到FPGA,在MuJoCo任务中以3或2比特权重和激活值实现与全精度相当的竞争力,并提升输入噪声鲁棒性。

Comments 18 pages, 6 figures

详情
AI中文摘要

在嵌入式硬件上部署连续控制强化学习策略需要满足严格的延迟和功耗预算。小型FPGA可以实现这些要求,但前提是避免昂贵的浮点流水线。我们研究了用于整数推理的策略的量化感知训练(QAT),并提出了一种学习到硬件的流水线,该流水线自动选择低比特策略并将其综合到Artix-7 FPGA上。在五个MuJoCo任务中,我们获得的策略网络与全精度(FP32)策略具有竞争力,但每个权重和每个内部激活值仅需3比特甚至2比特,前提是输入精度经过仔细选择。在目标硬件上,所选策略实现微秒级的推理延迟,每次动作消耗微焦耳能量,与量化参考相比具有优势。最后,我们观察到量化策略相比浮点基线具有更高的输入噪声鲁棒性。

英文摘要

Deploying continuous-control reinforcement learning policies on embedded hardware requires meeting tight latency and power budgets. Small FPGAs can deliver these, but only if costly floating-point pipelines are avoided. We study quantization-aware training (QAT) of policies for integer inference and we present a learning-to-hardware pipeline that automatically selects low-bit policies and synthesizes them to an Artix-7 FPGA. Across five MuJoCo tasks, we obtain policy networks that are competitive with full precision (FP32) policies but require as few as 3 or even only 2 bits per weight, and per internal activation value, as long as input precision is chosen carefully. On the target hardware, the selected policies achieve inference latencies on the order of microseconds and consume microjoules per action, favorably comparing to a quantized reference. Last, we observe that the quantized policies exhibit increased input noise robustness compared to the floating-point baseline.

2511.11041 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

纠正文本嵌入中的均值偏差:一种改进的重归一化方法及其在MMTEB上的无训练改进

Xingyu Ren, Youran Sun, Haoyu Liang

发表机构 * GitHub

AI总结 发现句子嵌入存在一致均值偏差,提出无训练修正方法R2(投影去除均值方向),在MMTEB上38个模型中获得一致分类提升,并分析其与PCA白化的差异。

详情
AI中文摘要

我们发现当前的句子嵌入模型输出存在一致的偏差:每个嵌入$e$可分解为$\tilde e + \mu$,其中均值$\mu$在所有句子中几乎相同。我们研究了两种无训练修正方法——直接减去$\mu$(R1),或从每个嵌入中投影掉均值方向(R2)——并通过一阶误差传播论证表明,R2消除了R1保留的均值估计误差的平行分量。在Massive Multilingual Text Embedding Benchmark (MMTEB)~\citep{MMTEB}上的38个模型中,R2取得一致的分类增益(配对$\bar t = 3.31$,38个模型中有29个$t>2$,零损失),且每个模型的均值范数$\Vert\mu\Vert$与哪些模型受益最多相关。对五个模型进行的九种方法剂量反应消融实验进一步揭示,温和的单方向去除有帮助,但完全的主成分分析(PCA)白化损害了我们测试的每个模型,并且R2与深度为一的All-but-the-Top在下游任务中相差不超过0.18个百分点,尽管$\hat\mu$与中心化的顶部主成分之间几何对齐较弱。

英文摘要

We find that current sentence-embedding models produce outputs with a consistent bias: every embedding $e$ decomposes as $\tilde e + μ$, where the mean $μ$ is near-identical across all sentences. We study two training-free corrections -- subtracting $μ$ directly (R1), or projecting each embedding off the mean direction (R2) -- and show, via a first-order error-propagation argument, that R2 cancels the parallel component of mean-estimation error that R1 retains. Across 38 models on the Massive Multilingual Text Embedding Benchmark (MMTEB)~\citep{MMTEB}, R2 yields consistent classification gains (paired $\bar t = 3.31$, 29 of 38 models with $t>2$, zero losses), and the per-model mean norm $\Vertμ\Vert$ correlates with which models benefit most. A nine-method dose-response ablation on five models further reveals that mild single-direction removal helps, but full principal component analysis (PCA) whitening hurts every model we test, and that R2 and All-but-the-Top with depth one agree within $0.18$ pp downstream despite weak geometric alignment between $\hatμ$ and the centered top principal component.

2506.07853 2026-06-09 cs.AI cs.IR 版本更新

Modeling the Diachronic Evolution of Legal Norms: An LRMoo-Based, Component-Level, Event-Centric Approach to Legal Knowledge Graphs

法律规范历时演变的建模:一种基于LRMoo、组件级、事件中心的法律知识图谱方法

Hudson de Martim

发表机构 * Federal Senate of Brazil(巴西联邦议会)

AI总结 提出基于LRMoo本体的事件中心模式,通过版本化作品链和语言版本区分,精确建模法律规范的历时演变,并以巴西宪法为例验证了任意日期法律文本的确定性重建。

Comments Revised version. Refined ontological modeling of legislative events (adopted F27/E64 joint typing over E11). Introduced technical distinctions for bitemporal modeling in legal knowledge graphs and enriched the critical analysis of related standards in Section 2

详情
AI中文摘要

表示法律规范的时间演变是自动化处理的一个关键挑战。虽然存在基础框架,但它们缺乏用于细粒度、组件级版本控制的正式模式,阻碍了可靠AI应用所需的法律文本的确定性时间点重建。本文提出了一种基于LRMoo本体的结构化时间建模模式。我们的方法将规范的演变建模为版本化F1作品的历时链,区分了语言无关的时间版本(TV,每个都是一个独立作品)及其单语语言版本(LV,建模为F2表达)。立法修正过程通过事件中心建模形式化,使得变化能够被精确追踪。以巴西宪法为案例,我们证明了我们的架构能够精确重建法律文本在特定日期存在的任何部分。这为法律知识图谱提供了可验证的语义骨干,为可信赖的法律AI提供了确定性基础。

英文摘要

Representing the temporal evolution of legal norms is a critical challenge for automated processing. While foundational frameworks exist, they lack a formal pattern for granular, component-level versioning, hindering the deterministic point-in-time reconstruction of legal texts required by reliable AI applications. This paper proposes a structured, temporal modeling pattern grounded in the LRMoo ontology. Our approach models a norm's evolution as a diachronic chain of versioned F1 Works, distinguishing between language-agnostic Temporal Versions (TV), each being a distinct Work, and their monolingual Language Versions (LV), modeled as F2 Expressions. The legislative amendment process is formalized through event-centric modeling, allowing changes to be traced precisely. Using the Brazilian Constitution as a case study, we demonstrate that our architecture enables the exact reconstruction of any part of a legal text as it existed on a specific date. This provides a verifiable semantic backbone for legal knowledge graphs, offering a deterministic foundation for trustworthy legal AI.

2511.07938 2026-06-09 cs.LG cs.SY eess.SY 版本更新

Decision-Focused Continual Learning for Seaport Power-Logistics Scheduling: Generalization across Varying Tasks

面向海港电力物流调度的决策聚焦持续学习:跨不同任务的泛化

Chuanqing Pu, Feilong Fan, Nengling Tai, Yan Xu, Wentao Huang, Honglin Wen

发表机构 * College of Smart Energy, Shanghai Jiao Tong University(上海交通大学智能能源学院) School of Electrical and Electronic Engineering, Nanyang Technological University(南洋理工大学电子与电气工程学院) School of Electrical Engineering, Shanghai Jiao Tong University(上海交通大学电气工程学院) Dyson School of Design Engineering, Imperial College London(伦敦帝国理工学院戴森设计工程学院)

AI总结 针对预测-优化框架在任务变化时泛化差的问题,提出基于Fisher信息正则化的决策聚焦持续学习框架,通过可微凸代理稳定梯度,实现跨任务决策对齐的在线学习,在裕廊港实验中提升决策性能并降低计算成本。

Comments Preprint to IEEE Transactions on Smart Grid

详情
AI中文摘要

现代海港的电力物流调度通常遵循预测-优化流程。为了提高预测的决策质量,提出了决策聚焦学习,它将预测模型的训练与下游决策结果对齐。然而,这种端到端设计本质上将预测模型的价值限制在特定的任务结构上,因此对由不同船舶到达引起的演变任务泛化能力差。我们通过一个决策聚焦持续学习框架来解决这一差距,该框架在线适应调度任务流。具体来说,我们引入了基于Fisher信息的正则化,通过保留对先前任务关键的参数来增强跨任务泛化。还开发了一个可微的凸代理来稳定梯度反向传播。所提出的方法能够在变化的任务流中学习决策对齐的预测模型,同时保持可持续的长期计算和内存需求。在裕廊港校准的实验表明,与现有方法相比,该方法提高了决策性能和跨任务泛化能力,同时降低了计算成本并具有有限的内存占用。

英文摘要

Power-logistics scheduling in modern seaports typically follows a predict-then-optimize pipeline. To enhance the decision quality of predictions, decision-focused learning has been proposed, which aligns the training of forecasting models with downstream decision outcomes. However, this end-to-end design inherently restricts the value of forecasting models to a specific task structure and therefore generalizes poorly to evolving tasks induced by varying vessel arrivals. We address this gap with a decision-focused continual learning framework that adapts online to a stream of scheduling tasks. Specifically, we introduce Fisher-information-based regularization to enhance cross-task generalization by preserving parameters critical to prior tasks. A differentiable convex surrogate is also developed to stabilize gradient backpropagation. The proposed approach enables learning a decision-aligned forecasting model across a varying task stream with sustainable long-term computational and memory requirements. Experiments calibrated to Jurong Port show improved decision performance and cross-task generalization over existing methods, together with reduced computational cost and a bounded memory footprint.

2511.07317 2026-06-09 cs.CL cs.LG 版本更新

RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

RLVE: 利用自适应可验证环境扩展语言模型的强化学习

Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, Chenyang Zhao, Yulia Tsvetkov, Simon Shaolei Du, Natasha Jaques, Hao Peng, Pang Wei Koh, Hannaneh Hajishirzi

发表机构 * University of Washington(华盛顿大学)

AI总结 提出RLVE方法,通过程序化生成问题并提供可验证奖励的可验证环境,自适应调整难度以扩展语言模型的强化学习,在400个环境中联合训练使六个推理基准平均提升3.37%。

Comments ICML 2026

详情
AI中文摘要

我们引入了具有自适应可验证环境的强化学习(RLVE),该方法利用可验证环境程序化生成问题并提供算法可验证的奖励,以扩展语言模型(LM)的强化学习。RLVE使得每个可验证环境能够随着训练进程动态调整其问题难度分布以适应策略模型的能力。相比之下,静态数据分布往往导致当问题对策略来说太简单或太难时学习信号消失。为了实现RLVE,我们创建了RLVE-Gym,这是一个通过手动环境工程精心开发的大规模400个可验证环境套件。使用RLVE-Gym,我们展示了环境扩展,即扩大训练环境的集合,能够持续提高泛化推理能力。在RLVE-Gym的所有400个环境中进行联合训练的RLVE,从一个最强的1.5B推理LM开始,在六个推理基准上取得了3.37%的绝对平均提升。相比之下,继续该LM的原始RL训练仅获得0.49%的平均绝对增益,尽管使用了超过3倍的计算量。我们公开发布了代码。

英文摘要

We introduce Reinforcement Learning (RL) with Adaptive Verifiable Environments (RLVE), an approach using verifiable environments that procedurally generate problems and provide algorithmically verifiable rewards, to scale up RL for language models (LMs). RLVE enables each verifiable environment to dynamically adapt its problem difficulty distribution to the policy model's capabilities as training progresses. In contrast, static data distributions often lead to vanishing learning signals when problems are either too easy or too hard for the policy. To implement RLVE, we create RLVE-Gym, a large-scale suite of 400 verifiable environments carefully developed through manual environment engineering. Using RLVE-Gym, we show that environment scaling, i.e., expanding the collection of training environments, consistently improves generalizable reasoning capabilities. RLVE with joint training across all 400 environments in RLVE-Gym yields a 3.37% absolute average improvement across six reasoning benchmarks, starting from one of the strongest 1.5B reasoning LMs. By comparison, continuing this LM's original RL training yields only a 0.49% average absolute gain despite using over 3x more compute. We release our code publicly.

2511.07002 2026-06-09 cs.CL 版本更新

Automated Attribution Graph Interpretation via Probe Prompting

通过探针提示实现自动化归因图解释

Giuseppe Birardi, Gonçalo Paulo

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出探针提示方法,利用跨提示激活签名将归因图特征分组为概念对齐的超节点,并通过因果干预验证标签,在Gemma-2-2B上实现100%的预测转向行为。

Comments 35 pages, 24 figures, 18 tables. Code and interactive demo available

详情
AI中文摘要

尽管我们知道从大型语言模型(LLM)输入到输出的精确计算过程,但这种计算仍然很难解释。使这一过程更易于理解的一种方法是创建一个稀疏计算图,该图以最少的计算节点捕获模型的大部分行为。跨层转码器(CLT)分解了MLP的密集计算,但即使对于短提示,生成的电路仍然包含数千个节点。现有的自动解释方法根据语料库激活对单个特征进行标记,但这些标记通常未经因果干预验证。我们引入了探针提示,这是一种透明的基于规则的流水线,它根据特征在一小组针对概念的探针提示上的响应,将归因图的特征分组为概念对齐的超节点,这些响应总结为跨提示激活签名(CPAS)。在四个事实领域,使用Gemma-2-2B和公共CLT词典以及45,596次实体交换干预,我们发现标记的超节点在每一次干预中都具有预测的转向行为。代码、数据集和交互式演示以匿名方式发布,作为可重复使用的工具,用于根据因果干预校准超节点标签。

英文摘要

Even though we know the precise computations that lead from a large language model (LLM) input to its output this computation remains very hard to interpret. One way to make it easier to understand this process is by creating a sparse computational graph that captures most of the model behavior with smallest number of computational nodes. Cross-layer transcoders (CLT) decompose the dense computations of the MLP but the resulting circuits still contain thousands of nodes even for short prompts. Existing automated interpretation methods label individual features from corpus activations, and it often happens that these labels are not validated by causal intervention. We introduce probe prompting, a transparent rule-based pipeline that groups the features of an attribution graph into concept-aligned supernodes from their responses on a small set of concept-targeted probe prompts, summarized as Cross-Prompt Activation Signatures (CPAS). Across four factual domains, on Gemma-2-2B with a public CLT dictionary and 45,596 entity-swap interventions, we find that the labeled supernodes have the predicted steering behavior in every one of them. Code, datasets, and an interactive demo are released anonymously as a reusable harness for calibrating supernode labels against causal interventions.

2511.03877 2026-06-09 cs.LG 版本更新

Benchmark Datasets for Lead-Lag Forecasting on Social Platforms

社交平台领先滞后预测的基准数据集

Kimia Kazemian, Zhenzhen Liu, Yangfanyu Yang, Katie Luo, Shuhan Gu, Audrey Du, Xinyu Yang, Jack Jansons, Kilian Q. Weinberger, John Thickstun, Yian Yin, Sarah Dean

发表机构 * Cornell University(康奈尔大学) Stanford University(斯坦福大学) Boston University(波士顿大学)

AI总结 本文提出领先滞后预测(LLF)问题,并发布arXiv和GitHub两个大规模基准数据集,通过统计检验验证领先滞后动态,为社交平台时间序列预测提供标准化测试平台。

Comments 11 pages, 8 figures, includes supplementary material (6 pages, 5 figures). Accepted at ACM SIGKDD 2026 (KDD '26). Code and data: https://lead-lag-forecasting.github.io

详情
AI中文摘要

社交和协作平台产生多变量时间序列轨迹,其中早期交互(如浏览、点赞或下载)之后,有时数月或数年后,会出现更高影响力的结果(如引用、销售或评论)。我们将此设定形式化为领先滞后预测(LLF):给定一个早期使用通道(领先),预测一个相关但时间上偏移的结果通道(滞后)。尽管这种模式普遍存在,但LLF尚未被时间序列社区视为统一的预测问题,主要原因是缺乏标准化数据集。为了锚定LLF研究,本文提出了两个大规模基准数据集:arXiv(访问量 -> 230万篇论文的引用量)和GitHub(推送/星标 -> 300万个仓库的复刻量)。我们的数据集通过捕捉跨年的长期动态、涵盖完整的结果谱以及避免采样中的生存偏差,为领先滞后预测提供了理想的测试平台。我们记录了数据整理和清洗的所有技术细节,通过统计和分类测试验证了领先滞后动态的存在,并基准测试了参数化和非参数化回归基线。我们的研究将LLF确立为一种新的预测范式,并为其在社交和使用数据中的系统探索奠定了实证基础。

英文摘要

Social and collaborative platforms emit multivariate time-series traces in which early interactions -- such as views, likes, or downloads -- are followed, sometimes months or years later, by higher impact like citations, sales, or reviews. We formalize this setting as Lead-Lag Forecasting (LLF): given an early usage channel (the lead), predict a correlated but temporally shifted outcome channel (the lag). Despite the ubiquity of such patterns, LLF has not been treated as a unified forecasting problem within the time-series community, largely due to the absence of standardised datasets. To anchor research in LLF, here we present two high-volume benchmark datasets: arXiv (accesses -> citations of 2.3M papers) and GitHub (pushes/stars -> forks of 3M repositories). Our datasets provide ideal testbeds for lead-lag forecasting, by capturing long-horizon dynamics across years, spanning the full spectrum of outcomes, and avoiding survivorship bias in sampling. We documented all technical details of data curation and cleaning, verified the presence of lead-lag dynamics through statistical and classification tests, and benchmarked parametric and non-parametric baselines for regression. Our study establishes LLF as a novel forecasting paradigm and lays an empirical foundation for its systematic exploration in social and usage data.

2511.02003 2026-06-09 cs.LG cond-mat.dis-nn hep-ph 版本更新

Bulk-boundary decomposition of neural networks

神经网络的体-边界分解

Donghee Lee, Hye-Sung Lee, Jaeok Yi

发表机构 * Department of Physics, Korea Advanced Institute of Science and Technology(物理系,韩国科学技术院)

AI总结 提出体-边界分解框架,将神经网络训练动力学分解为数据无关的体项和数据相关的边界项,揭示深层网络的局部齐次结构并推导能量连续性方程。

Comments 13 pages, 3 figures

详情
AI中文摘要

我们提出体-边界分解作为理解深度神经网络训练动力学的新框架。从随机梯度下降公式出发,我们证明拉格朗日量可以重组为数据无关的体项和数据相关的边界项。体项捕捉由网络架构和激活函数设定的内在动力学,而边界项反映来自输入和输出层训练样本的随机相互作用。这种分解揭示了深层网络背后的局部和齐次结构。作为局部性和齐次性的物理结果,我们推导了深度神经网络内的能量连续性方程。

英文摘要

We present the bulk--boundary decomposition as a new framework for understanding the training dynamics of deep neural networks. Starting from the stochastic gradient descent formulation, we show that the Lagrangian can be reorganized into a data-independent bulk term and a data-dependent boundary term. The bulk captures the intrinsic dynamics set by network architecture and activation functions, while the boundary reflects stochastic interactions from training samples at the input and output layers. This decomposition exposes the local and homogeneous structure underlying deep networks. As a physical consequence of locality and homogeneity, we derive the energy continuity equation within a deep neural network.

2404.02039 2026-06-09 cs.AI 版本更新

A Survey on Large Language Model-Based Game Agents

基于大语言模型的游戏智能体综述

Sihao Hu, Tiansheng Huang, Gaowen Liu, Ramana Rao Kompella, Fatih Ilhan, Selim Furkan Tekin, Yichang Xu, Zachary Yahn, Ling Liu

发表机构 * Georgia Institute of Technology USA(佐治亚理工学院美国分校) Cisco Research USA(思科研究美国分公司)

AI总结 综述基于大语言模型的游戏智能体,提出统一参考架构,从单智能体(记忆、推理、感知-行动接口)和多智能体(通信协议、组织模型)层面总结研究,并建立挑战导向的分类法连接六种游戏类型与智能体需求。

Comments ACM Computing Surveys, 2026

详情
AI中文摘要

游戏环境提供了丰富、可控的设置,能够模拟现实世界复杂性的许多方面。因此,游戏智能体为探索与通用人工智能相关的能力提供了有价值的测试平台。最近,大语言模型(LLM)的出现为在这些复杂游戏环境中赋予智能体可泛化的推理、记忆和适应性提供了新的机会。本综述通过一个统一的参考架构,对基于LLM的游戏智能体(LLMGA)进行了最新回顾。在单智能体层面,我们围绕三个核心组件综合了现有研究:记忆、推理和感知-行动接口,这些组件共同描述了语言如何使智能体感知、思考和行动。在多智能体层面,我们概述了通信协议和组织模型如何支持协调、角色分化以及大规模社会行为。为了将这些设计置于具体情境中,我们引入了一个以挑战为中心的分类法,将六种主要游戏类型与其主导的智能体需求联系起来,从动作游戏中的低延迟控制到沙盒世界中的开放式目标形成。相关论文的精选列表可在以下网址获取:https://github.com/xxx/xxx

英文摘要

Game environments provide rich, controllable settings that stimulate many aspects of real-world complexity. As such, game agents offer a valuable testbed for exploring capabilities relevant to Artificial General Intelligence. Recently, the emergence of Large Language Models (LLMs) provides new opportunities to endow these agents with generalizable reasoning, memory, and adaptability in complex game environments. This survey offers an up-to-date review of LLM-based game agents (LLMGAs) through a unified reference architecture. At the single-agent level, we synthesize existing studies around three core components: memory, reasoning, and perception-action interfaces, which jointly characterize how language enables agents to perceive, think, and act. At the multi-agent level, we outline how communication protocols and organizational models support coordination, role differentiation, and large-scale social behaviors. To contextualize these designs, we introduce a challenge-centered taxonomy linking six major game genres to their dominant agent requirements, from low-latency control in action games to open-ended goal formation in sandbox worlds. A curated list of related papers is available at https://github.com/git-disl/awesome-LLM-game-agent-papers

2510.27544 2026-06-09 cs.AI cs.FL 版本更新

TempoBench: Evaluating Temporal Causal Reasoning in Large Language Models

TempoBench:评估大语言模型中的时间因果推理

Nikolaus Holzer, William Fishell, Baishakhi Ray, Mark Santolucito

发表机构 * Columbia University(哥伦比亚大学) Columbia University, Barnard College(哥伦比亚大学、巴纳德学院)

AI总结 提出TempoBench基准,通过合成Mealy机生成可验证的因果标签,评估LLM在时间因果推理中的表现,发现模型在最小因果归因任务上准确率低于25%,主要错误是过度指定。

详情
AI中文摘要

时间推理涉及理解系统如何通过输入驱动的状态转换随时间演化。一个关键方面是时间因果推理,即因果推理出哪些先前的输入对于导致观察到的结果是必要的。虽然大型语言模型(LLMs)在前向模拟(从输入预测输出)方面表现良好,但它们难以识别结果的最小因果输入。为了研究这种区别,我们定义了两个任务:\textit{轨迹模拟}(SIM),要求模型模拟系统执行,以及\textit{最小因果归因}(MIN),识别给定结果所需的最小输入集。我们引入了\textsc{TempoBench},第一个经过形式验证的时间因果推理基准,它由合成的Mealy机构建,具有可控的复杂性和可证明正确的因果标签。在前沿模型中,我们观察到尽管在SIM任务上达到了高达96%的准确率,但在因果归因MIN任务上的性能降至25%以下;模型无法推理因果必要性。超过94%的因果错误涉及过度指定,即模型执行检索并列出所有可能的输入,而不是推理最小因果子集。在\textsc{TempoBench}训练语料库上进行微调可以改善因果推理,并且比数学、代码或指令训练具有更好的泛化能力,在标准推理基准上也有提升。

英文摘要

Temporal reasoning involves understanding how systems evolve over time through input-driven state transitions. A key aspect is temporal causal reasoning, causally reasoning about what prior inputs were necessary in causing an observed outcome. While large language models (LLMs) perform well at forward simulation, predicting outputs from inputs, they struggle to identify the minimal causal inputs of outcomes. To study this distinction, we define two tasks: \textit{trace simulation} (SIM), which requires models to simulate system execution, and \textit{minimal causal attribution} (MIN), which identifies the minimal set of inputs necessary for a given outcome. We introduce \textsc{TempoBench}, the first formally verified benchmark for temporal causal reasoning, built from synthesized Mealy machines with controllable complexity and provably correct causal labels. Across frontier models, we observe that despite achieving up to 96\% accuracy on the SIM task, performance on the causal attribution MIN task drops below 25\%; models fail to reason about causal necessity. Over 94\% of causal errors involve overspecification, where models perform retrieval and list all possible inputs rather than reasoning about the minimal causal subset. Fine-tuning on \textsc{TempoBench} training corpus improves causal reasoning and generalizes better than math, code, or instruction training, with gains across standard reasoning benchmarks.

2510.22450 2026-06-09 cs.LG cs.AI 版本更新

SmartMixed: A Two-Phase Training Strategy for Adaptive Activation Function Learning in Neural Networks

SmartMixed:一种用于神经网络自适应激活函数学习的两阶段训练策略

Amin Omidvar

发表机构 * Independent Researcher(独立研究者) Toronto, Canada(加拿大多伦多) Toronto Ontario Canada(加拿大多伦多)

AI总结 提出SmartMixed两阶段训练策略,通过可微硬混合机制让神经元自适应选择激活函数,第二阶段固定选择以保持推理效率,在MNIST上验证了不同层神经元的激活函数偏好。

详情
AI中文摘要

激活函数的选择在神经网络中起着关键作用,但大多数架构仍然依赖于所有神经元上固定的、统一的激活函数。我们引入了SmartMixed,一种新颖的两阶段训练策略,允许网络学习每个神经元的最优激活函数,同时在推理时保持计算效率。在第一阶段,神经元使用可微硬混合机制从候选激活函数池(ReLU、Sigmoid、Tanh、Leaky_ReLU、ELU、SELU)中自适应选择。在第二阶段,每个神经元的激活函数根据学习到的选择固定下来,从而得到一个计算高效的网络,支持使用优化的向量化操作继续训练。我们在MNIST数据集上使用不同架构的前馈神经网络评估了SmartMixed。我们的分析表明,不同层的神经元对激活函数表现出不同的偏好,揭示了神经架构内的功能多样性。我们还证明了SmartMixed通过允许神经元选择其偏好的激活函数有效地训练网络,与使用单一固定最先进激活函数的模型相竞争。

英文摘要

The choice of activation function plays a critical role in neural networks, yet most architectures still rely on fixed, uniform activation functions across all neurons. We introduce SmartMixed, a novel two-phase training strategy that allows networks to learn optimal per-neuron activation functions while preserving computational efficiency at inference. In the first phase, neurons adaptively select from a pool of candidate activation functions (ReLU, Sigmoid, Tanh, Leaky\_ReLU, ELU, SELU) using a differentiable hard mixture mechanism. In the second phase, each neuron's activation function is fixed according to the learned selection, resulting in a computationally efficient network that supports continued training with optimized vectorized operations. We evaluate SmartMixed on the MNIST dataset using feedforward neural networks of different architectures. Our analysis reveals that neurons in different layers exhibit distinct preferences for activation functions, providing insights into the functional diversity within neural architectures. We also demonstrated that SmartMixed effectively trains the network by allowing neurons to select their preferred activation functions, competing against models using a single fixed state-of-the-art activation function.

2510.20182 2026-06-09 cs.CV 版本更新

PEDRA: Evaluating the Realism of Pedestrian Dynamics in Video Generation

PEDRA: 评估视频生成中行人动态的真实性

Aaron Appelle, Jerome P. Lynch

发表机构 * Duke University(杜克大学)

AI总结 提出PEDRA评估协议,通过重建鸟瞰轨迹等方法,测试文本/图像到视频模型生成多行人交互场景的真实性,发现现有模型虽具备先验但存在行人合并消失等物理不一致问题。

Comments Accepted to CVPR 2026

详情
AI中文摘要

行人模拟传统上依赖于专家调整的手工模型,这限制了可扩展性和泛化性。与此同时,大规模视频生成模型已在各种场景中实现了高视觉真实感,激发了探索其作为通用世界模拟器潜力的兴趣。现有基准主要评估单主体真实性,而非包含多个交互人物的场景,使得生成视频中多智能体动态的合理性未经测试。我们提出一个严格的评估协议,用于基准测试文本到视频(T2V)和图像到视频(I2V)模型作为行人动态的隐式模拟器。对于I2V,我们利用已有数据集的起始帧,以便与真实视频进行直接比较;而对于T2V,我们设计了一个涵盖不同人群密度和交互类型的提示集。一个关键组成部分是一种无需已知相机参数即可从像素空间重建二维鸟瞰轨迹的方法。我们的分析表明,领先模型对合理的多智能体行为具有有效的先验,尽管合并和消失行人等问题揭示了其物理一致性的局限性。

英文摘要

Pedestrian simulation traditionally relies on expert-tuned, hand-crafted models that limit scalability and generalization. Meanwhile, large-scale video generation models have achieved high visual realism across diverse settings, motivating exploration of their potential as general-purpose world simulators. Existing benchmarks primarily assess single-subject realism rather than scenes with multiple interacting people, leaving the plausibility of multi-agent dynamics in generated videos untested. We propose a rigorous evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics. For I2V, we leverage start frames from established datasets to enable direct comparison with ground truth videos, while for T2V we design a prompt suite covering varied crowd densities and interaction types. A key component is a method to reconstruct 2D bird's-eye view trajectories from pixel-space without known camera parameters. Our analysis shows that leading models exhibit effective priors for plausible multi-agent behavior, though issues such as merging and disappearing pedestrians reveal limits to their physical consistency.

2510.19186 2026-06-09 cs.CL 版本更新

When Users Are Happy but Agents Are Wrong: Multi-Dimensional Evaluation of Tool-Augmented Dialogue

当用户满意但智能体出错:工具增强对话的多维度评估

Tanya Shourya, Yingfan Wang, Zhaoyi Joey Hou, Shamik Roy, Vinayshekhar Bannihatti Kumar, Rashmi Gangadharaiah

发表机构 * AWS AI Labs(AWS人工智能实验室) University of Pittsburgh(匹兹堡大学)

AI总结 针对工具增强对话系统中用户满意但智能体错误的问题,提出TRACE基准,通过系统合成多样错误案例,评估现有框架发现性能远未理想。

Comments The Fifth Generation, Evaluation & Metrics Workshop (GEM) at ACL 2026

详情
AI中文摘要

评估使用外部工具的对话式AI系统具有挑战性,因为错误可能源于用户、智能体和工具之间的复杂交互。虽然现有的评估方法要么评估用户满意度,要么评估智能体的工具调用能力,但它们未能捕捉多轮工具增强对话中的关键错误——例如当智能体误解工具结果但用户却感到满意时。我们引入了TRACE,一个系统合成的工具增强对话基准,覆盖了多样的错误案例。使用最先进的对话评估框架进行评估发现,所有方法都远未达到理想性能,展示了该基准的根本难度。

英文摘要

Evaluating conversational AI systems that use external tools is challenging, as errors can arise from complex interactions among user, agent, and tools. While existing evaluation methods assess either user satisfaction or agents' tool-calling capabilities, they fail to capture critical errors in multi-turn tool-augmented dialogues-such as when agents misinterpret tool results yet appear satisfactory to users. We introduce TRACE, a benchmark of systematically synthesized tool-augmented conversations covering diverse error cases. Evaluation with state-of-the-art conversation evaluation frameworks reveals that all approaches remain far from ideal performance, demonstrating the fundamental difficulty of this benchmark.

2502.19049 2026-06-09 cs.LG 版本更新

In-Context Learning of Stochastic Differential Equations with Foundation Inference Models

基于基础推理模型的随机微分方程上下文学习

Patrick Seifner, Kostadin Cvejoski, David Berghaus, Cesar Ojeda, Ramses J. Sanchez

发表机构 * Lamarr Institute(拉马尔研究所) University of Bonn(波恩大学) Fraunhofer IAIS(弗劳恩霍夫智能系统研究所) University of Potsdam(波茨坦大学)

AI总结 提出FIM-SDE,一种预训练识别模型,通过上下文学习从噪声时间序列中零样本估计低维SDE的漂移和扩散函数,并支持快速微调,在合成和真实数据上表现鲁棒。

Comments Accepted at NeurIPS 2025. The previous version appeared under the title "Foundation Inference Models for Stochastic Differential Equations: A Transformer-based Approach for Zero-shot Function Estimation.";

详情
Journal ref
39th Conference on Neural Information Processing Systems (NeurIPS 2025)
AI中文摘要

随机微分方程(SDE)描述了由漂移函数控制的确定性流动与由扩散函数决定的随机波动叠加的动态系统。从数据中准确估计(或发现)这些函数是机器学习中的一个核心问题,在自然科学和社会科学中有着广泛的应用。然而,当前的解决方案要么严重依赖于对动力学的先验知识,要么涉及复杂的训练过程。我们引入了FIM-SDE(用于SDE的基础推理模型),这是一种预训练的识别模型,能够从含噪声的时间序列数据中对低维SDE的漂移和扩散函数进行准确的上下文(或零样本)估计,并允许快速微调到目标数据集。利用摊销推理和神经算子的概念,我们以监督方式(预)训练FIM-SDE,将大量含噪声的离散观测SDE路径映射到漂移和扩散函数空间。我们证明,FIM-SDE在广泛的合成和真实世界过程中实现了鲁棒的上下文函数估计——从经典的SDE系统(例如双阱动力学或弱扰动洛伦兹吸引子)到股票价格记录以及油价和风速波动——同时匹配在目标数据集上训练的符号、高斯过程和神经SDE基线的性能。当微调到目标过程时,我们显示FIM-SDE始终优于所有这些基线。

英文摘要

Stochastic differential equations (SDEs) describe dynamical systems where deterministic flows, governed by a drift function, are superimposed with random fluctuations, dictated by a diffusion function. The accurate estimation (or discovery) of these functions from data is a central problem in machine learning, with wide application across the natural and social sciences. Yet current solutions either rely heavily on prior knowledge of the dynamics or involve intricate training procedures. We introduce FIM-SDE (Foundation Inference Model for SDEs), a pretrained recognition model that delivers accurate in-context (or zero-shot) estimation of the drift and diffusion functions of low-dimensional SDEs, from noisy time series data, and allows rapid finetuning to target datasets. Leveraging concepts from amortized inference and neural operators, we (pre)train FIM-SDE in a supervised fashion to map a large set of noisy, discretely observed SDE paths onto the space of drift and diffusion functions. We demonstrate that FIM-SDE achieves robust in-context function estimation across a wide range of synthetic and real-world processes -- from canonical SDE systems (e.g., double-well dynamics or weakly perturbed Lorenz attractors) to stock price recordings and oil-price and wind-speed fluctuations -- while matching the performance of symbolic, Gaussian process and Neural SDE baselines trained on the target datasets. When finetuned to the target processes, we show that FIM-SDE consistently outperforms all these baselines.

2510.13554 2026-06-09 cs.CL cs.LG 版本更新

Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization

注意力揭示大语言模型推理:预规划与锚定节奏实现细粒度策略优化

Yang Li, Zhichen Dong, Yuhan Sun, Weixun Wang, Shaopan Xiong, Yijia Luo, Jiashun Liu, Han Lu, Jiamang Wang, Wenbo Su, Bo Zheng, Junchi Yan

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) Alibaba Group(阿里巴巴集团)

AI总结 本文通过注意力机制揭示大语言模型推理中的预规划与锚定节奏,并据此提出三种细粒度强化学习策略,在多种推理任务上取得一致性能提升。

Comments 31 pages, 9 figures, 20 tables. Accepted at ICML 2026

详情
AI中文摘要

大语言模型的推理模式仍然不透明,强化学习通常对整个生成过程应用统一信用分配,模糊了关键步骤与常规步骤的区别。本文将注意力视为一种特权基质,它使大语言模型的内部逻辑变得可读,不仅是计算的副产品,更是推理本身的机械蓝图。我们首先区分局部和全局聚焦信息处理的注意力头,并揭示局部聚焦头在对角线附近产生锯齿状模式,指示短语块,而全局聚焦头则暴露对后续令牌具有广泛下游影响的令牌。我们用两个指标形式化这些:1)窗口平均注意力距离,衡量裁剪窗口内向后注意力的程度;2)未来注意力影响,量化令牌的全局重要性,即其从后续令牌接收的平均注意力。综合来看,这些信号揭示了一种重复的预规划与锚定机制,其中模型首先进行长距离上下文参考以生成一个引导令牌,该令牌立即跟随或与一个组织后续推理的语义锚定令牌重合。利用这些见解,我们引入了三种新颖的强化学习策略,动态地对关键节点(预规划令牌、锚定令牌及其时间耦合)进行目标信用分配,并在各种推理任务中展示了一致的性能提升。通过将优化与模型的内在推理节奏对齐,我们旨在将不透明的优化转化为可操作的结构感知过程,希望为更透明和有效的大语言模型推理优化提供潜在一步。

英文摘要

The reasoning pattern of Large language models (LLMs) remains opaque, and reinforcement learning (RL) typically applies uniform credit across an entire generation, blurring the distinction between pivotal and routine steps. This work positions attention as a privileged substrate that renders the internal logic of LLMs legible, not merely as a byproduct of computation, but as a mechanistic blueprint of reasoning itself. We first distinguish attention heads between locally and globally focused information processing and reveal that locally focused heads produce a sawtooth pattern near the diagonal indicating phrasal chunks, while globally focused heads expose tokens that exert broad downstream influence over future tokens. We formalize these with two metrics: 1) Windowed Average Attention Distance, which measures the extent of backward attention within a clipped window; 2) Future Attention Influence, which quantifies a token's global importance as the average attention it receives from subsequent tokens. Taken together, these signals reveal a recurring preplan-and-anchor mechanism, where the model first performs a long-range contextual reference to generate an introductory token, which is immediately followed by or coincides with a semantic anchor token that organizes subsequent reasoning. Leveraging these insights, we introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes (preplan tokens, anchor tokens, and their temporal coupling) and show consistent performance gains across various reasoning tasks. By aligning optimization with the model's intrinsic reasoning rhythm, we aim to transform opaque optimization into an actionable structure-aware process, hoping to offer a potential step toward more transparent and effective optimization of LLM reasoning.

2510.12171 2026-06-09 cs.AI 版本更新

MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science

MatSciBench: 基准测试大型语言模型在材料科学中的推理能力

Junkai Zhang, Jingru Gan, Xiaoxuan Wang, Zian Jia, Changquan Gu, Jianpeng Chen, Yanqiao Zhu, Mingyu Derek Ma, Dawei Zhou, Ling Li, Wei Wang

发表机构 * University of California, Los Angeles Computer Science Department(加州大学洛杉矶分校计算机科学系) University of Pennsylvania Department of Materials Science and Engineering(宾夕法尼亚大学材料科学与工程系) Virginia Tech Department of Computer Science(弗吉尼亚理工大学计算机科学系)

AI总结 提出MatSciBench基准,包含1340道大学级材料科学问题,覆盖6个主领域和31个子领域,评估LLM推理能力,发现当前模型在领域知识、计算和图表理解方面存在局限。

详情
AI中文摘要

大型语言模型已展现出强大的科学推理能力,但它们在材料科学问题上的表现仍研究不足。为填补这一空白,我们引入了MatSciBench,一个全面的大学级基准,包含1340道问题,涵盖材料科学的基本子学科。MatSciBench具有结构化和细粒度的分类体系,将材料科学问题分为6个主领域和31个子领域,并根据解决每个问题所需的推理长度进行三级难度分类。MatSciBench包含946道问题的详细参考答案,支持过程级错误分析,并包含315道带图像的问题以评估多模态推理。我们在MatSciBench上评估了领先的思考型和非思考型LLM,并进一步测试了非思考型模型的三种推理方法:基础思维链提示、工具增强和自我修正。结果表明,当前模型在大学级材料科学推理中仍面临明显限制。DeepSeek-R1在纯文本问题上达到最高准确率75.22%,GPT-5在带图像问题上表现最佳,准确率为53.02%。我们的分析表明,工具增强以token高效的方式改进了许多非思考型模型,而自我修正通常无法提供可靠的改进,甚至可能将正确答案修改为错误答案。我们进一步分析了不同难度级别、推理效率、多模态推理和失败模式的表现,发现当前模型主要受限于领域知识差距、计算错误、问题理解失败以及从科学图表中提取精确信息的困难。总体而言,MatSciBench为衡量当前LLM的局限性并指导未来材料科学科学推理工作提供了一个清晰的测试平台。

英文摘要

Large Language Models have shown strong scientific reasoning ability, but their performance on materials science problems remains less studied. To fill this gap, we introduce MatSciBench, a comprehensive college-level benchmark comprising 1340 problems that span the essential subdisciplines of materials science. MatSciBench features a structured and fine-grained taxonomy that categorizes materials science questions into 6 primary fields and 31 subfields, together with a three-tier difficulty classification based on the reasoning length needed to solve each problem. MatSciBench includes detailed reference solutions for 946 questions, supports process-level error analysis, and contains 315 questions with images for evaluating multimodal reasoning. We evaluate leading thinking and non-thinking LLMs on MatSciBench, and further test three reasoning methods for non-thinking models: basic chain-of-thought prompting, tool augmentation, and self-correction. The results show that current models still face clear limits in college-level materials science reasoning. DeepSeek-R1 achieves the highest score on text-only questions at 75.22% accuracy, and GPT-5 performs the best on questions with images at 53.02%. Our analysis shows that tool augmentation improves many non-thinking models in a token-efficient way, while self-correction often fails to provide reliable gains and can revise correct answers into incorrect ones. We further analyze performance across difficulty levels, reasoning efficiency, multimodal reasoning, and failure patterns, and find that current models are mainly limited by domain knowledge gaps, calculation errors, problem comprehension failures, and difficulty in extracting precise information from scientific figures. Overall, MatSciBench provides a clear testbed for measuring current LLM limitations and guiding future work on scientific reasoning in materials science.

2510.10448 2026-06-09 cs.CL 版本更新

RECON: Reasoning with Condensation for Efficient Retrieval-Augmented Generation

RECON:基于压缩的推理用于高效检索增强生成

Zhichao Xu, Minheng Wang, Yawei Wang, Wenqian Ye, Yuntao Du, Yunpu Ma, Yijun Tian

发表机构 * University of Utah(犹他大学) University of Washington(华盛顿大学) George Washington University(乔治·华盛顿大学) University of Virginia(弗吉尼亚大学) Shandong University(山东大学) Ludwig Maximilian University of Munich(慕尼黑路德维希-马克西米利安大学) University of Notre Dame(诺特丹大学)

AI总结 提出RECON框架,在强化学习搜索代理的多轮推理中插入观察压缩器,通过两阶段课程训练实现上下文压缩,提升训练和推理效率并增强问答性能。

Comments Techinical report

详情
AI中文摘要

基于强化学习(RL)训练的搜索代理在多轮、工具集成的推理(TIR)循环中交错进行推理和工具调用,每次工具调用返回的环境观察被附加到代理的上下文中。随着rollout的进行,这些原始观察不断累积,增加了token成本并稀释了可用于下游推理的信号。与单次检索-读取流水线(其中上下文压缩是一次性后处理步骤)不同,多轮RL设置需要在每个观察步骤进行压缩,同时保持与策略优化解耦。我们引入了RECON(REasoning with CONdensation)框架,通过在推理循环中插入专用的观察压缩器来解决这一挑战。压缩器通过两阶段课程进行训练:在QA数据集上进行相关性预训练,然后从专有LLM进行多方面蒸馏,并在RL训练期间保持冻结以保持策略稳定性。集成到Search-R1搜索代理流水线中,RECON将总上下文长度减少35%,训练速度提高5.4%,推理延迟降低30.9%,同时在3B代理上平均精确匹配提升14.5%,在7B代理上提升3.0%,在多跳QA中表现尤为突出。这些结果表明,学习的观察压缩是构建实用、可扩展的RL训练搜索代理的关键组件。

英文摘要

Search agents trained with reinforcement learning (RL) interleave reasoning with tool calls in a multi-turn, tool-integrated reasoning (TIR) loop, where each tool invocation returns an environment observation that is appended to the agent's context. As the rollout proceeds, these raw observations accumulate, inflating token cost and diluting the signal available for downstream reasoning. Unlike single-pass retrieve-then-read pipelines, where context compression is a one-time postprocessing step, the multi-turn RL setting requires compression that runs at every observation step while remaining decoupled from policy optimization. We introduce RECON (REasoning with CONdensation), a framework that addresses this challenge by inserting a dedicated observation compressor into the reasoning loop. The compressor is trained via a two-stage curriculum: relevance pretraining on QA datasets followed by multi-aspect distillation from proprietary LLMs, and remains frozen during RL training to preserve policy stability. Integrated into the Search-R1 search-agent pipeline, RECON reduces total context length by 35%, improves training speed by 5.4% and inference latency by 30.9%, while boosting average exact-match by 14.5% on the 3B agent and 3.0% on the 7B agent, with particular strength in multi-hop QA. These results establish learned observation compression as a key component for building practical, scalable RL-trained search agents.

2510.10028 2026-06-09 cs.LG cs.AI cs.DC 版本更新

Efficient Onboard Vision-Language Inference in UAV-Enabled Low-Altitude Economy Networks via LLM-Enhanced Optimization

基于LLM增强优化的无人机低空经济网络高效机载视觉-语言推理

Yang Li, Ruichen Zhang, Yinqiu Liu, Guangyuan Liu, Abbas Jamalipour, Xianbin Wang, Dong In Kim

发表机构 * College of Computing and Data Science, Nanyang Technological University, Singapore(计算与数据科学学院、新加坡国立科技大学) The University of Sydney, Sydney, Australia(悉尼大学、澳大利亚悉尼) Department of Electrical and Computer Engineering, Western University, London, Canada(电气与计算机工程系、西方大学、加拿大伦敦) Department of Electrical and Computer Engineering, Sungkyunkwan University, South Korea(电气与计算机工程系、全州大学、韩国)

AI总结 针对无人机低空经济网络中机载视觉-语言模型推理的准确性与通信效率挑战,提出分层优化框架,包括交替分辨率与功率优化算法及大语言模型增强的强化学习轨迹优化方法,有效提升推理性能与通信效率。

详情
AI中文摘要

低空经济网络(LAENets)的快速发展催生了多种应用,包括空中监视、环境感知和语义数据收集。为支持这些场景,配备机载视觉-语言模型(VLM)的无人机(UAV)为实时多模态推理提供了一种有前景的解决方案。然而,由于有限的机载资源和动态的网络条件,确保推理准确性和通信效率仍然是一个重大挑战。在本文中,我们首先提出一个无人机启用的LAENet系统模型,该模型联合捕捉无人机移动性、用户-无人机通信以及机载视觉问答(VQA)流水线。基于该模型,我们制定了一个混合整数非凸优化问题,以在用户特定的准确性约束下最小化任务延迟和功耗。为解决该问题,我们设计了一个由两部分组成的分层优化框架:(i)交替分辨率与功率优化(ARPO)算法,用于在准确性约束下进行资源分配;(ii)大语言模型增强的强化学习方法(LLaRA),用于自适应无人机轨迹优化。大语言模型(LLM)作为专家,以离线方式改进强化学习的奖励设计,在实时决策中不引入额外延迟。数值结果证明了我们提出的框架在动态LAENet条件下提升推理性能和通信效率的有效性。

英文摘要

The rapid advancement of Low-Altitude Economy Networks (LAENets) has enabled a variety of applications, including aerial surveillance, environmental sensing, and semantic data collection. To support these scenarios, unmanned aerial vehicles (UAVs) equipped with onboard vision-language models (VLMs) offer a promising solution for real-time multimodal inference. However, ensuring both inference accuracy and communication efficiency remains a significant challenge due to limited onboard resources and dynamic network conditions. In this paper, we first propose a UAV-enabled LAENet system model that jointly captures UAV mobility, user-UAV communication, and the onboard visual question answering (VQA) pipeline. Based on this model, we formulate a mixed-integer non-convex optimization problem to minimize task latency and power consumption under user-specific accuracy constraints. To solve the problem, we design a hierarchical optimization framework composed of two parts: (i) an Alternating Resolution and Power Optimization (ARPO) algorithm for resource allocation under accuracy constraints, and (ii) a Large Language Model-augmented Reinforcement Learning Approach (LLaRA) for adaptive UAV trajectory optimization. The large language model (LLM) serves as an expert in refining reward design of reinforcement learning in an offline fashion, introducing no additional latency in real-time decision-making. Numerical results demonstrate the efficacy of our proposed framework in improving inference performance and communication efficiency under dynamic LAENet conditions.

2510.09783 2026-06-09 cs.LG cs.AI stat.ML 版本更新

Large Language Models for Imbalanced Classification: Diversity makes the difference

大语言模型用于不平衡分类:多样性至关重要

Dang Nguyen, Sunil Gupta, Kien Do, Thin Nguyen, Taylor Braund, Alexis Whitton, Svetha Venkatesh

发表机构 * Applied Artificial Intelligence Initiative (A 2 I 2 )(应用人工智能倡议(A2I2)) Deakin University(德肯大学) Black Dog Institute(黑狗研究所) University of New South Wales(新南威尔士大学)

AI总结 提出基于大语言模型的过采样方法,通过条件采样、排列微调和插值样本增强多样性,在10个表格数据集上优于8个基线方法。

详情
AI中文摘要

过采样是解决不平衡分类最广泛使用的方法之一。其核心思想是生成额外的少数类样本以重新平衡数据集。大多数现有方法(如SMOTE)需要将分类变量转换为数值向量,这通常会导致信息损失。最近,基于大语言模型(LLM)的方法被引入以克服这一限制。然而,当前的LLM方法通常生成多样性有限的少数类样本,降低了下游分类任务的鲁棒性和泛化能力。为了解决这一问题,我们提出了一种新的基于LLM的过采样方法,旨在增强多样性。首先,我们引入了一种采样策略,将合成样本生成条件化为少数类标签和特征。其次,我们开发了一种新的排列策略来微调预训练的LLM。第三,我们不仅在少数类样本上微调LLM,还在插值样本上微调以进一步丰富变异性。在10个表格数据集上的大量实验表明,我们的方法显著优于八个SOTA基线。生成的合成样本既真实又多样。此外,我们通过基于熵的视角提供了理论分析,证明了我们的方法鼓励生成样本的多样性。

英文摘要

Oversampling is one of the most widely used approaches for addressing imbalanced classification. The core idea is to generate additional minority samples to rebalance the dataset. Most existing methods, such as SMOTE, require converting categorical variables into numerical vectors, which often leads to information loss. Recently, large language model (LLM)-based methods have been introduced to overcome this limitation. However, current LLM-based approaches typically generate minority samples with limited diversity, reducing robustness and generalizability in downstream classification tasks. To address this gap, we propose a novel LLM-based oversampling method designed to enhance diversity. First, we introduce a sampling strategy that conditions synthetic sample generation on both minority labels and features. Second, we develop a new permutation strategy for fine-tuning pre-trained LLMs. Third, we fine-tune the LLM not only on minority samples but also on interpolated samples to further enrich variability. Extensive experiments on 10 tabular datasets demonstrate that our method significantly outperforms eight SOTA baselines. The generated synthetic samples are both realistic and diverse. Moreover, we provide theoretical analysis through an entropy-based perspective, proving that our method encourages diversity in the generated samples.

2505.14752 2026-06-09 cs.LG 版本更新

LLMSynthor: Macro-Aligned Micro-Records Synthesis with Large Language Models

LLMSynthor: 使用大语言模型进行宏观对齐的微观记录合成

Yihong Tang, Menglin Kong, Junlin He, Tong Nie, Wei Ma, Lijun Sun

发表机构 * McGill University(麦吉尔大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出LLMSynthor方法,利用大语言模型作为非参数copula,通过迭代生成与目标宏观统计一致的微观记录,解决大规模细粒度数据收集困难的问题。

详情
AI中文摘要

宏观对齐的微观记录对于社会科学和城市研究中的可信模拟至关重要。例如,流行病模型只有在个体层面的流动和接触反映真实行为,且聚合数据匹配真实世界统计数据(如病例数或旅行流量)时才可靠。然而,大规模收集此类细粒度数据不切实际,研究人员只能获得宏观数据。LLMSynthor通过将预训练的大语言模型转化为宏观感知模拟器来解决这一问题,生成与目标宏观统计一致的逼真微观记录。它迭代构建合成数据集:在每一步,LLM生成一批记录以最小化合成聚合与目标聚合之间的差异。将LLM视为非参数copula,使模型能够捕捉变量间真实的联合依赖关系。为提高效率,LLM提议采样引导LLM提出有针对性的记录批次,指定变量范围和数量,以有效纠正差异,同时保持基于模型先验的真实性。跨领域(移动、电子商务、人口)的评估表明,LLMSynthor实现了强真实性、统计保真度和实用性,使其广泛适用于经济学、社会科学和城市研究。

英文摘要

Macro-aligned micro-records are crucial for credible simulations in social science and urban studies. For example, epidemic models are only reliable when individual-level mobility and contacts mirror real behavior, while aggregates match real-world statistics like case counts or travel flows. However, collecting such fine-grained data at scale is impractical, leaving researchers with only macro-level data. LLMSynthor addresses this by turning a pretrained LLM into a macro-aware simulator that generates realistic micro-records consistent with target macro-statistics. It iteratively builds synthetic datasets: in each step, the LLM generates batches of records to minimize discrepancies between synthetic and target aggregates. Treating the LLM as a nonparametric copula allows the model to capture realistic joint dependencies among variables. To improve efficiency, LLM Proposal Sampling guides the LLM to propose targeted record batches, specifying variable ranges and counts, to efficiently correct discrepancies while preserving realism grounded in the model's priors. Evaluations across domains (mobility, e-commerce, population) show that LLMSynthor achieves strong realism, statistical fidelity, and practical utility, making it broadly applicable to economics, social science, and urban studies.

2503.14229 2026-06-09 cs.AI cs.CV cs.RO 版本更新

HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions

HA-VLN 2.0:面向离散与连续环境中动态多人交互的人类感知导航开放基准与排行榜

Yifei Dong, Fengyi Wu, Qi He, Lingdong Kong, Heng Li, Minghan Li, Zebang Cheng, Yuxuan Zhou, Jingdong Sun, Qi Dai, Alexander G Hauptmann, Zhi-Qi Cheng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出HA-VLN 2.0统一基准,通过标准化任务、HAPS 2.0数据集与模拟器、16844条社会指令基准测试及真实机器人实验,证明显式社会建模提升导航鲁棒性并减少碰撞。

Comments 35 pages, 20 figures, website: https://f1y1113.github.io/HA-VLN-webpage/

详情
AI中文摘要

视觉与语言导航(VLN)主要研究离散或连续空间,很少关注动态拥挤环境。我们提出HA-VLN 2.0,一个引入显式社会感知约束的统一基准。我们的贡献包括:(i)标准化任务和指标,同时捕捉目标准确性和个人空间遵守;(ii)HAPS 2.0数据集和模拟器,建模多人交互、室外环境和更精细的语言-运动对齐;(iii)在16844条社会性指令上的基准测试,揭示领先代理在人类动态和部分可观测性下性能急剧下降;(iv)真实机器人实验验证模拟到现实的迁移,以及一个开放排行榜实现透明比较。结果表明,显式社会建模提高了导航鲁棒性并减少了碰撞,强调了以人为中心方法的必要性。通过发布数据集、模拟器、基线和协议,HA-VLN 2.0为安全、人类感知的导航研究提供了坚实基础。

英文摘要

Vision-and-Language Navigation (VLN) has been studied mainly in either discrete or continuous spaces, with little attention to dynamic, crowded environments. We present HA-VLN 2.0, a unified benchmark introducing explicit social-awareness constraints. Our contributions are: (i) a standardized task and metrics capturing both goal accuracy and personal-space adherence; (ii) HAPS 2.0 dataset and simulators modeling multi-human interactions, outdoor contexts, and finer language-motion alignment; (iii) benchmarks on 16,844 socially grounded instructions, revealing sharp performance drops of leading agents under human dynamics and partial observability; and (iv) real-world robot experiments validating sim-to-real transfer, with an open leaderboard enabling transparent comparison. Results show that explicit social modeling improves navigation robustness and reduces collisions, underscoring necessity of human-centric approaches. By releasing datasets, simulators, baselines, and protocols, HA-VLN 2.0 provides a strong foundation for safe, human-aware navigation research.

2510.06492 2026-06-09 cs.RO 版本更新

How Well Do Latent World Models Understand Partially Observable Safety Constraints?

潜在世界模型如何理解部分可观测的安全约束?

Matthew Kim, Kensuke Nakamura, Andrea Bajcsy

发表机构 * UC San Diego(加州大学圣地亚哥分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 研究潜在世界模型在部分可观测安全约束下的故障模式,提出互信息度量和滚动预测度量来诊断估计间隙和预测间隙,并通过多模态监督和共形风险校准缓解问题,提高机器人操作安全性。

Comments 10 tables 5 figures

详情
AI中文摘要

潜在世界模型是一种直接从高维观测中学习状态表示和动态的有前途的方法,使得在难以建模的环境中实现机器人控制成为可能。然而,控制性能最终取决于潜在表示是否编码了任务所需的信息。在这项工作中,我们研究了潜在空间安全控制问题,并展示了当安全相关信息未在潜在状态中保留时,部分可观测性如何导致控制失败。具体来说,我们识别出两种世界模型故障模式:估计间隙,即当前观测未揭示安全关键量(例如,烹饪任务中的温度);以及预测间隙,即故障一旦发生即可观测,但无法从可用观测中可靠地预测。我们为这些间隙引入了两种诊断方法:一种基于互信息的安全可观测性度量,以及一种基于滚动预测的未来安全可预测性度量。最后,我们针对每种故障模式提出了缓解策略:针对估计间隙的特权多模态监督,以及针对预测间隙的共形风险校准。通过两个硬件案例研究——使用单模态RGB世界模型和多模态RGB+触觉及RGB+热变体——我们展示了这些缓解策略在部分可观测性下提高了Franka Research 3机械臂在具有挑战性的烹饪任务中的安全性,尽管增加了保守性。更广泛地说,我们的工作提出了一个问题:世界模型状态表示何时足以实现可靠的机器人控制。

英文摘要

Latent world models are a promising approach for learning state representations and dynamics directly from high-dimensional observations, enabling robot control in hard-to-model settings. However, control performance ultimately depends on the latent representation encoding the required information for the task. In this work, we study latent-space safe control problems and show how partial observability can induce control failures when safety-relevant information is not preserved in the latent state. Specifically, we identify two world model failure modes: estimation gaps, where current observations do not reveal safety-critical quantities (e.g., temperature in a cooking task), and prediction gaps, where failures are observable once they occur but cannot be reliably anticipated from available observations. We introduce two diagnostics for these gaps: a mutual-information-based measure of safety observability and a rollout-based measure of future safety predictability. Finally, we present mitigation strategies for each failure mode: privileged multimodal supervision for estimation gaps and conformal risk calibration for prediction gaps. Across two hardware case studies -- using unimodal RGB world models and multimodal RGB+Tactile and RGB+Thermal variants -- we show that these mitigation strategies improve the safety of a Franka Research 3 manipulator on challenging cooking tasks under partial observability, albeit with increased conservativeness. More broadly, our work raises the question of when world model state representations are sufficient for reliable robot control

2510.06052 2026-06-09 cs.AI cs.CL 版本更新

MixReasoning: Switching Modes to Think

MixReasoning: 切换模式以思考

Haiquan Lu, Gongfan Fang, Xinyin Ma, Qi Li, Xinchao Wang

发表机构 * arXiv

AI总结 提出MixReasoning框架,动态调整推理深度,对困难步骤详细推理、简单步骤简洁推理,在GSM8K、MATH-500和AIME上缩短推理长度并提高效率,不牺牲准确性。

详情
AI中文摘要

推理模型通过逐步解决问题、将问题分解为子问题并在生成答案前探索长思维链来提升性能。然而,对每一步都应用扩展推理会引入大量冗余,因为子问题的难度和复杂度差异很大:少数关键步骤对最终答案真正具有挑战性和决定性,而许多其他步骤仅涉及简单的修正或计算。因此,一个自然的想法是赋予推理模型自适应应对这种变化的能力,而不是对所有步骤采用相同的详细程度。为此,我们提出了MixReasoning,一个在单个响应中动态调整推理深度的框架。由此产生的思维链成为困难步骤的详细推理与简单步骤的简洁推理的混合。在GSM8K、MATH-500和AIME上的实验表明,MixReasoning缩短了推理长度,显著提高了效率,且不牺牲准确性。

英文摘要

Reasoning models enhance performance by tackling problems in a step-by-step manner, decomposing them into sub-problems and exploring long chains of thought before producing an answer. However, applying extended reasoning to every step introduces substantial redundancy, as sub-problems vary widely in difficulty and complexity: a small number of pivotal steps are genuinely challenging and decisive for the final answer, while many others only involve straightforward revisions or simple computations. Therefore, a natural idea is to endow reasoning models with the ability to adaptively respond to this variation, rather than treating all steps with the same level of elaboration. To this end, we propose MixReasoning, a framework that dynamically adjusts the depth of reasoning within a single response. The resulting chain of thought then becomes a mixture of detailed reasoning on difficult steps and concise inference on simpler ones. Experiments on GSM8K, MATH-500, and AIME show that MixReasoning shortens reasoning length and substantially improves efficiency without compromising accuracy.

2510.05356 2026-06-09 cs.CV cs.LG 版本更新

Mitigating Diffusion Model Hallucinations with Dynamic Guidance

通过动态引导缓解扩散模型幻觉

Kostas Triaridis, Alexandros Graikos, Aggelina Chatziagapi, Grigorios G. Chrysos, Dimitris Samaras

发表机构 * Stony Brook University(石溪大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 针对扩散模型因分数函数过度平滑导致的幻觉问题,提出动态引导方法,沿预定方向选择性锐化分数函数,保留有效语义变化,显著减少幻觉。

Comments Project page: https://cvlab-stonybrook.github.io/DynamicGuidance/

详情
AI中文摘要

扩散模型中的幻觉是指样本出现结构不一致性,这通常是由于学习到的分数函数过度平滑,导致数据分布模式之间的插值。由于语义插值通常是有益的且有助于样本多样性,我们认为需要一种细致且有针对性的解决方案来处理扩散模型幻觉。在这项工作中,我们引入了动态引导,通过仅沿已知会导致伪影的预定方向选择性锐化分数函数来缓解幻觉,同时保留有效的语义变化。这种锐化可以使用预定的类别或语义一致的聚类(在数据分布上形成伪类)来执行。后者允许将动态引导原则性地扩展到文本到图像生成,其中我们选择模式以对应文本描述中细粒度的上下文差异。据我们所知,这是第一种在生成时而非通过事后过滤来解决幻觉的方法。动态引导在受控和自然图像数据集上均显著减少了幻觉,大幅优于基线方法。

英文摘要

Hallucinations in diffusion models are samples with structural inconsistencies that can emerge due to the excessive smoothing of the learned score function, which in turn leads to interpolations between modes of the data distribution. Since semantic interpolations are often desirable and contribute to sample diversity, we believe that a nuanced and targeted solution is required to address diffusion model hallucinations. In this work, we introduce Dynamic Guidance, which mitigates hallucinations by selectively sharpening the score function only along the pre-determined directions known to cause artifacts, while preserving valid semantic variations. This sharpening can be performed using either pre-determined classes or semantically coherent clusters that form pseudo-classes over the data distribution. The latter allows for a principled extension of Dynamic Guidance to text-to-image generation, where we select modes to correspond to fine-grained contextual differences in textual descriptions. To our knowledge, this is the first approach that addresses hallucinations at generation time rather than through post-hoc filtering. Dynamic Guidance substantially reduces hallucinations on both controlled and natural image datasets, significantly outperforming baselines.

2508.07011 2026-06-09 cs.CV cs.GR 版本更新

HiMat: DiT-based Ultra-High Resolution SVBRDF Generation

HiMat: 基于DiT的超高分辨率SVBRDF生成

Zixiong Wang, Jian Yang, Yiwei Hu, Milos Hasan, Beibei Wang

发表机构 * College of Computer Science, Nankai University(南开大学计算机科学学院) Adobe Research(Adobe研究) NVIDIA Nanjing University(南京大学)

AI总结 提出HiMat框架,利用扩散变压器和线性注意力在高压缩潜空间生成4K SVBRDF,并通过CrossStitch模块保持跨图一致性,实现高效、多样化的超高分辨率材质生成。

详情
AI中文摘要

创建超高分辨率空间变化双向反射分布函数(SVBRDF)对于逼真的3D内容创作至关重要,以忠实呈现近距离渲染所需的精细表面细节。然而,实现4K生成面临两个关键挑战:(1)需要以全分辨率合成多个反射图,这增加了像素预算并带来了高昂的内存和计算成本;(2)需要在4K下保持跨图的强像素级对齐,这在适配为RGB图像域设计的预训练模型时尤其困难。我们引入了HiMat,一个专为高效且多样化的4K SVBRDF生成量身定制的基于扩散的框架。为解决第一个挑战,HiMat通过DC-AE在高压缩潜空间中进行生成,并采用具有线性注意力的预训练扩散变压器来提高每图效率。为解决第二个挑战,我们提出了CrossStitch,一个轻量级卷积模块,在不增加全局注意力成本的情况下强制跨图一致性。我们的实验表明,与先前方法相比,HiMat实现了高保真度的4K SVBRDF生成,具有卓越的效率、结构一致性和多样性。除了材质,我们的框架还推广到相关应用,如本征分解。

英文摘要

Creating ultra-high-resolution spatially varying bidirectional reflectance functions (SVBRDFs) is critical for photorealistic 3D content creation, to faithfully represent fine-scale surface details required for close-up rendering. However, achieving 4K generation faces two key challenges: (1) the need to synthesize multiple reflectance maps at full resolution, which multiplies the pixel budget and imposes prohibitive memory and computational cost, and (2) the requirement to maintain strong pixel-level alignment across maps at 4K, which is particularly difficult when adapting pretrained models designed for the RGB image domain. We introduce HiMat, a diffusion-based framework tailored for efficient and diverse 4K SVBRDF generation. To address the first challenge, HiMat performs generation in a high-compression latent space via DC-AE, and employs a pretrained diffusion transformer with linear attention to improve per-map efficiency. To address the second challenge, we propose CrossStitch, a lightweight convolutional module that enforces cross-map consistency without incurring the cost of global attention. Our experiments show that HiMat achieves high-fidelity 4K SVBRDF generation with superior efficiency, structural consistency, and diversity compared to prior methods. Beyond materials, our framework also generalizes to related applications such as intrinsic decomposition.

2510.03244 2026-06-09 cs.LG cs.AI cs.CV 版本更新

VFEM: Visual Feature Empowered Multivariate Time Series Forecasting with Cross-Modal Fusion

VFEM: 视觉特征赋能的多变量时间序列预测与跨模态融合

Yanlong Wang, Hang Yu, Jian Xu, Fei Ma, Hongkang Zhang, Tongtong Feng, Zijian Zhang, Shao-Lun Huang, Danny Dongning Sun, Xiao-Ping Zhang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学) Pengcheng Laboratory(鹏城实验室) Ant Group(蚂蚁集团) Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)(广东人工智能与数字经济实验室(深圳)) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出VFEM模型,利用预训练大视觉模型通过跨模态注意力融合视觉与时间特征,仅训练7.45%参数即可捕捉跨变量依赖,提升多变量时间序列预测性能。

详情
AI中文摘要

大型时间序列基础模型通常采用通道独立架构来处理不同的数据维度,但这种设计忽略了关键的跨通道依赖关系。同时,现有的跨模态方法主要依赖文本模态,使得视觉模型的空间模式识别能力在时间序列分析中未被充分探索。为了解决这些局限性,我们提出了VFEM,一种利用预训练大视觉模型(LVM)捕获复杂跨变量模式的跨模态预测模型。VFEM将多变量时间序列转换为视觉表示,使LVM能够感知通道独立模型未显式建模的空间关系。通过双分支架构,视觉和时间特征被独立提取,然后通过跨模态注意力融合,使两种模态的互补信息增强预测。通过冻结LVM并仅训练总参数的7.45%,VFEM在多个基准上取得了竞争性能,为多变量时间序列预测提供了新视角。

英文摘要

Large time series foundation models often adopt channel-independent architectures to handle varying data dimensions, but this design ignores crucial cross-channel dependencies. Meanwhile, existing cross-modal methods predominantly rely on textual modalities, leaving the spatial pattern recognition capabilities of vision models underexplored for time series analysis. To address these limitations, we propose VFEM, a cross-modal forecasting model that leverages pre-trained large vision models (LVMs) to capture complex cross-variable patterns. VFEM transforms multivariate time series into visual representations, enabling LVMs to perceive spatial relationships that are not explicitly modeled by channel-independent models. Through a dual-branch architecture, visual and temporal features are independently extracted and then fused via cross-modal attention, allowing complementary information from both modalities to enhance forecasting. By freezing the LVM and training only 7.45% of the total parameters, VFEM achieves competitive performance on multiple benchmarks, offering a new perspective on multivariate time series forecasting.

2509.13930 2026-06-09 cs.CL 版本更新

Linguistic Nepotism: Trading-off Quality for Language Preference in Multilingual RAG

语言裙带关系:多语言RAG中为语言偏好牺牲质量

Dayeon Ki, Marine Carpuat, Paul McNamee, Daniel Khashabi, Eugene Yang, Dawn Lawrie, Kevin Duh

发表机构 * University of Washington(华盛顿大学)

AI总结 研究多语言RAG系统中模型对英语源文档的偏好,发现模型会牺牲文档相关性以迎合语言偏好,尤其在低资源语言中更明显。

Comments ICML 2026 Spotlight

详情
AI中文摘要

多语言检索增强生成(mRAG)系统使语言模型能够跨语言回答知识密集型查询,并提供引用支持的响应。尽管其使用日益增长,一个悬而未决的问题是不同文档语言的混合是否以非预期的方式影响生成和引用行为。为了研究这一点,我们引入了一种受控方法,利用模型内部状态在保持文档相关性等其他因素不变的情况下测量语言偏好。在八种语言和六个开源权重模型上,我们发现当查询为英语时,模型优先引用英语源文档,这种偏差在低资源语言和位于上下文中间的文档中更为放大。更重要的是,我们发现模型有时会为了语言偏好而牺牲文档相关性,这表明引用选择并非总是仅由信息量驱动。我们的发现揭示了语言模型如何利用多语言上下文并影响引用行为。

英文摘要

Multilingual Retrieval-Augmented Generation (mRAG) systems enable language models to answer knowledge-intensive queries with citation-supported responses across languages. Despite their growing use, an open questions is whether the mixture of different document languages impacts generation and citation behavior in unintended ways. To investigate this, we introduce a controlled methodology using model internals to measure language preference while holding other factors such as document relevance constant. Across eight languages and six open-weight models, we find that models preferentially cite English sources when queries are in English, with this bias amplified for lower-resource languages and for documents positioned mid-context. More crucially, we find that models sometimes trade-off document relevance for language preference, indicating that citation choices are not always driven by informativeness alone. Our findings shed light on how language models leverage multilingual context and influence citation behavior.