arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2163
2605.19418 2026-05-20 cs.AI

Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling

通过有向图建模实现冲突容忍的多智能体推理

Longgang He, Longzhu He, Daojing He, Chaozhuo Li

AI总结 本文提出SIGMA框架,通过有向图建模显式捕捉智能体间的信任、冲突和中性关系,以提升多智能体系统的推理能力和冲突容忍性。

详情
AI中文摘要

基于大语言模型的多智能体系统(MAS)已展现出强大的推理和决策能力,其性能常受到简单聚合机制的限制,假设所有交互都是合作性的。经过深入分析,我们发现现有基于图的MAS框架存在两个问题:(1)当出现冲突信号时,错误会传播而无法控制;(2)缺乏对冲突智能体关系的显式建模以及结构意识,无法识别可靠的交互模式。为弥补这一差距,我们引入SIGMA,一种新的基于有向图的多智能体推理框架,通过有向关系图显式捕捉智能体间的信任、冲突和中性关系。具体而言,给定一个查询,SIGMA首先选择一组相关且多样化的智能体,然后构建一个具有置信度加权边的结构化有向交互图。推理过程通过冲突感知的有向信息传递进行,这会加强来自可信智能体的信息,同时抑制冲突信号,并以结构和冲突感知的加权聚合结束,以产生一致且冲突容忍的预测。在六个基准数据集上进行的大量实验表明,SIGMA在多个LLM后端和多智能体配置中一致优于最先进的基线,实现了准确性和冲突容忍性能的显著提升。

英文摘要

LLM-based multi-agent systems (MAS) have demonstrated strong reasoning and decision-making capabilities that consistently surpass those of single LLM agents. However, their performance often suffers from naive aggregation mechanisms that assume uniformly cooperative interactions. Upon close inspection, we observe that existing graph-based MAS frameworks (1) propagate errors when conflicting signals arise without control, and (2) lack explicit modeling of conflicting inter-agent relations as well as structural awareness, failing to identify reliable interaction patterns. To bridge this gap, we introduce SIGMA, a novel SIgned Graph-informed Multi-Agent reasoning framework that explicitly captures trust, conflict, and neutral relations among agents via a signed relational graph. Specifically, given a query, SIGMA first selects a set of relevant and diverse agents, then constructs a structured signed interaction graph with confidence-weighted edges. Reasoning proceeds through conflict-aware signed message passing, which reinforces information from trustworthy agents while suppressing conflicting signals, and terminates with a structure- and conflict-aware weighted aggregation to yield globally consistent and conflict-resilient predictions. Extensive experiments on six benchmark datasets, across multiple LLM backbones and diverse multi-agent configurations, demonstrate that SIGMA consistently outperforms state-of-the-art baselines, achieving notable gains in both accuracy and conflict-resilient performance.

2605.19410 2026-05-20 cs.CV

Vision Harnessing Agent for Open Ad-hoc Segmentation

用于开放即兴分割的视觉引导代理

Zilin Wang, Stella X. Yu

AI总结 本文提出了一种名为VASA的视觉引导即兴分割代理,该代理通过结合视觉语言模型、分割基础模型和视觉引导工作流,实现了无需训练的即兴分割任务,其在PARS和RefCOCO等基准测试中均表现出色。

Comments 23 pages, 11 figures

详情
AI中文摘要

分割任务在了解概念后变得容易,需要从文本中检索已学习的视觉基础。然而,对于开放即兴概念,这种基础可能不存在,必须通过图像证据中的部分、关系、排除和集合来构建。我们提出了视觉引导的即兴分割代理(VASA),这是首个用于开放即兴分割的视觉引导代理。VASA无需训练,结合了VLM代理、分割基础模型和视觉引导工作流。不同于仅修改文本提示,VASA使用持久的工作掩码来推理、构建和验证解决方案。它计划视觉操作,调用分割工具,检查结果,编辑掩码并恢复错误。我们构建了PARS,一个将PartImageNet中的部分级标签转换为开放即兴概念的新基准,通过长文本定义查询实现。在PARS上,VASA优于开放词汇、推理和代理基线,超越SAM3代理14-25%。在RefCOCO,一个标准的多粒度指引用分割基准上,VASA比SAM3代理提高5-9%,比其他代理基线提高高达20%。这些结果验证了代理视觉构建在开放即兴分割中的有效性。我们的工作指出了AI代理超越将基础模型作为工具的路径:通过任务知识、VLM行为、视觉规程、工作记忆和故障意识工作流来编程它们。

英文摘要

Segmentation has become easy when the concept is known, requiring retrieval of a learned visual grounding from text. It remains hard for open ad-hoc concepts, where the grounding may not exist as one learned mask and must often be constructed from image evidence through parts, relations, exclusions, and collections. We propose a Vision-guided Ad-hoc Segmentation Agent (VASA), the first vision harnessing agent for open ad-hoc segmentation. VASA is training-free and couples a VLM agent, a segmentation foundation model, and a visually grounded workflow. Rather than revising text prompts alone, VASA uses a persistent working mask to reason, construct, and validate a solution. It plans visual operations, invokes segmentation tools, inspects results, edits the mask, and recovers from errors. We construct PARS, a new benchmark that turns part-level labels in PartImageNet into open ad-hoc concepts through long-form definition queries. On PARS, VASA outperforms open-vocabulary, reasoning-based, and agentic baselines, surpassing SAM3 Agent by 14-25%. On RefCOCOm, a standard multi-granularity referring segmentation benchmark, VASA improves over SAM3 Agent by 5-9% and over other agentic baselines by up to 20%. These results validate agentic visual construction for open ad-hoc segmentation. Our work points to a path for AI agents beyond wrapping foundation models as tools: Programming them with task knowledge, VLM behavior, visual routines, working memory, and failure-aware workflows.

2605.19407 2026-05-20 cs.LG cs.AI

A Bitter Lesson for Data Filtering

数据过滤的惨痛教训

Christopher Mohri, John Duchi, Tatsunori Hashimoto

AI总结 本文研究了大规模模型预训练中的数据过滤,发现即使有足够的计算资源,过滤数据也不是最佳选择,因为充分训练的大型模型能够容忍低质量数据甚至从中受益。

详情
AI中文摘要

我们通过新的扩展研究探讨了大规模模型预训练中的数据过滤,针对高计算需求和数据稀缺的环境。尽管人们普遍认为过滤数据以包含高质量信息是必要的,但我们的实验表明,在有足够的计算资源的情况下,最好的数据过滤方法实际上是没有数据过滤。我们发现,充分训练的大型参数模型不仅能够容忍低质量和干扰数据,而且实际上会从名义上‘差’的数据中受益。

英文摘要

We investigate data filtering for large model pretraining via new scaling studies that target the high compute, data-scarce regime. In spite of an apparently common belief that filtering data to include only high-quality information is essential, our experiments suggest that with enough compute, the best data filter is no data filter. We find that sufficiently trained large parameter models not only tolerate low-quality and distractor data, but in fact benefit from nominally ``poor'' data.

2605.19403 2026-05-20 cs.LG

TIDE: Asymmetric Neural Circuits for Stabilized Temporal Inhibitory-Excitatory Dynamics

TIDE:用于稳定时间抑制-兴奋动态的非对称神经电路

Alexander Kyuroson, Denis Kleyko, Marcus Liwicki

AI总结 本文提出TIDE架构,通过非对称兴奋-抑制网络稳定时间动态,结合Wilson-Cowan动态和横向抑制,提升生物真实性和学习性能,实验表明其在训练时间和准确率上均优于CTM。

详情
AI中文摘要

最近的Continuous Thought Machine架构通过神经动态将内部计算与外部输入解耦,但依赖多层感知机而缺乏稳定性保证。我们提出使用非对称兴奋-抑制(E-I)网络建模神经动态,该网络可通过网络理论原理稳定,并可表示为通过博弈论损失优化的能量系统。基于此视角,我们引入时间抑制-兴奋动态引擎(TIDE),一种受神经启发的架构,通过稳定神经动态计算内部表示,整合Wilson-Cowan动态和横向抑制。TIDE通过例如使用分层感受野和强制Dale原则,平衡生物真实性,确保现实的80:20 E-I平衡比。本文的目标是引入一种新架构,将神经启发式学习置于 forefront。我们提供了收敛性、稳定性和复杂度界限的证明,以及实证消融研究。总体而言,TIDE在训练时间上比CTM少50%以下,并在各种扰动下将ImageNet的top-1准确率提高平均1.65%。

英文摘要

Recent Continuous Thought Machine architecture decouples internal computation from external inputs via neural dynamics, but relies on multi-layer perceptrons without stability guarantees. We propose to model neural dynamics using asymmetric Excitatory-Inhibitory (E-I) networks, which can be stabilized via principles from network theory and can be expressed as energy-based systems optimized through a game-theoretic loss. Building on this perspective, we introduce Temporal Inhibitory-Excitatory Dynamic Engine (TIDE), a neuro-inspired architecture that computes internal representations through neural dynamics stabilized by incorporating the Wilson-Cowan dynamics and lateral inhibition. TIDE balances biological realism by, for instance, using Hierarchical Receptive Fields and enforcing Dale's principle to ensure a realistic $80:20$ E-I balance ratio with an end-to-end trainable architecture. The aim of this paper is to introduce a new architecture that brings neuro-inspired learning to the forefront. We present proofs of convergence, stability, and complexity bounds, along with empirical ablation studies. Overall, TIDE surpasses CTM with under $50\%$ of the training time and improves $\texttt{top-1}$ accuracy by an average of $+1.65\%$ on ImageNet under various perturbations.

2605.19394 2026-05-20 cs.CL cs.AI

EmbGen: Teaching with Reassembled Corpora

EmbGen:利用重组语料库进行教学

Arun K Lenin, Kai Rouse, Andrea Nicastro, Anna Leontjeva

AI总结 本文提出EmbGen,一种通过重组语料库生成合成数据的pipeline,旨在提高在不同语义异质性数据集上指令微调模型的性能,通过实体-描述对的分解、基于嵌入相似性的重组以及基于聚类的采样生成问题-答案对,从而在固定token预算下提升二元准确率。

Comments 8 pages, 4 images (32 pages with appendix)

详情
AI中文摘要

适应小型指令微调模型到专业领域通常依赖于在精心挑选的指令-响应示例上进行监督微调(SFT),这在大规模收集时成本高昂。由教师LLM从领域语料库生成的合成训练示例可以降低此成本,但现有流程会产生同质化输出,并且不一致地捕捉跨段落或跨文档依赖性。我们引入EmbGen,一种合成数据生成流程,该流程将语料库分解为实体-描述对,通过从嵌入相似性推断出的语义结构重新组装它们,并通过接近性、集群内和集群间采样生成问题-答案(QA)对,使用集群专门化的系统提示。我们评估EmbGen在三个语义异质性不同的数据集上,固定token预算(5和20百万token)下的表现,与EntiGraph、InstructLab和Knowledge-Instruct进行比较。我们使用词汇重叠度量、LLM作为判断标准的评分表以及二元准确率(结合事实准确性和完整性)作为评估指标。EmbGen在最异质的数据集上,相对于最强基线,在5M和20M token预算下分别提高了12.5%和88.9%的二元准确率,同时在其他异质性较低的数据集上保持竞争力。

英文摘要

Adapting small instruction-tuned models to specialized domains often relies on supervised fine-tuning (SFT) on curated instruction-response examples, which is expensive to collect at scale. Synthetic training examples generated by a teacher LLM from a domain corpus can reduce this cost, but existing pipelines can produce homogenized outputs and do not consistently capture cross-passage or cross-document dependencies. We introduce EmbGen, a synthetic data generation pipeline that decomposes a corpus into entity-description pairs, reassembles them using semantic structure inferred from embedding similarity, and then generates question-answer (QA) pairs via proximity, intra-cluster, and inter-cluster sampling with cluster-specialized system prompts. We evaluate EmbGen against EntiGraph, InstructLab and Knowledge-Instruct on three datasets of varied semantic heterogeneity, under fixed token budgets (5 and 20 million tokens). We use lexical overlap metrics, an LLM-as-a-judge rubric, and Binary Accuracy, a composed metric combining Factual Accuracy and Completeness for evaluation. EmbGen improves Binary Accuracy on the most heterogeneous dataset by 12.5% at 5M and 88.9% at 20M tokens budget, relative to the strongest baseline, while remaining competitive across other datasets with lower heterogeneity.

2605.19393 2026-05-20 cs.CV cs.LG

Neuron Incidence Redistribution for Fairness in Medical Image Classification

神经元发生再分配用于医疗图像分类中的公平性

Abin Shoby, Lyle John Palmer, Nikhil Cherian Kurian

AI总结 本文提出了一种轻量级的正则化方法Neuron Incidence Redistribution (NIR),通过减少预测概率加权平均激活值的方差来提升医疗图像分类中的公平性,实验结果显示在不同年龄和性别组别中,TPR和FPR的不平等现象显著降低。

Comments 4 Pages, 1 Figure

详情
AI中文摘要

深度学习模型在医疗图像分类中容易出现因年龄、性别和种族等人口属性导致的子群体性能差异。我们识别出这些差异背后的潜在表征机制:在迁移学习模型中,正预测下的主导倒数第二层激活通道同时被疾病阳性样本和特权人口群体(男性、年长患者)激活,导致过度诊断;相反,负预测下的主导通道由不利群体(女性、年轻患者)激活,导致系统性误诊。为了解决这一问题,我们提出了Neuron Incidence Redistribution (NIR),一种轻量级正则化方法,该方法惩罚倒数第二层神经元预测概率加权平均激活值的方差,无需在训练时使用人口属性标签。在HAM10000数据集上,NIR使年龄组的TPR不平等从10.81%降至0.93%,性别组的TPR不平等从12.04%降至0.74%,同时AUC略有提高0.51个点。在Harvard OCT-RNFL数据集上,NIR减少了种族(从15.68%降至10.66%)和年龄(从12.69%降至1.80%)的FPR不平等,证明了在全倒数第二层分布潜在疾病证据是一种提升医疗AI人口公平性的原则性且有效的方法。

英文摘要

Deep learning models for medical image classification are susceptible to subgroup performance disparities across demographic attributes such as age, gender, and race. We identify a latent representational mechanism underlying these disparities: in transfer-learned models, the dominant penultimate-layer activation channel under positive predictions is co-activated by both disease-positive samples and privileged demographic groups (male, older patients), producing over-diagnosis; conversely, the dominant channel under negative predictions is co-activated by disadvantaged groups (female, younger patients), producing systematic under-diagnosis. To address this, we propose Neuron Incidence Redistribution (NIR), a lightweight regularization method that penalizes the variance of predicted-probability-weighted mean activations across penultimate-layer neurons, requiring no demographic labels at training time. On HAM10000, TPR disparity drops from 10.81% to 0.93% across age groups and from 12.04% to 0.74% across gender, with a marginal AUC improvement of 0.51 points. On Harvard OCT-RNFL, NIR reduces FPR disparity for race (from 15.68% to 10.66%) and age (from 12.69% to 1.80%), demonstrating that distributing latent disease evidence across the full penultimate layer is a principled and effective strategy for improving demographic fairness in medical AI.

2605.19392 2026-05-20 cs.LG

Understanding Dynamics of Adam in Zero-Sum Games: An ODE Approach

理解Adam在零和游戏中的动态:一种微分方程方法

Yi Feng, Weiming Ou, Xiao Wang

AI总结 本文通过微分方程方法研究Adam-DA在零和游戏中的动态,揭示了动量参数在零和游戏中的作用与最小化问题相反,通过GAN实验验证了这一发现。

详情
AI中文摘要

Adam在训练神经网络中的显著成功自然导致其下降-上升对应物Adam-DA被广泛用于解决零和游戏。尽管在实践中很受欢迎,但对Adam-DA的严格理论理解仍滞后。在本文中,我们推导了普通微分方程(ODEs),这些方程是Adam-DA的连续时间极限。这些ODEs紧密近似Adam-DA的离散时间动态,提供了一个可分析的框架来理解其在零和游戏中的行为。利用这种ODE方法,我们研究了Adam-DA的两个基本方面:局部收敛性和隐式梯度正则化。我们的分析揭示了在零和游戏中一阶和二阶动量参数的作用恰好与在最小化问题中已记录的效果相反。我们通过多个架构和数据集的GAN实验验证了这些预测,展示了这种反转的动量效应的实用意义。

英文摘要

The remarkable success of the Adam in training neural networks has naturally led to the widespread use of its descent-ascent counterpart, Adam-DA, for solving zero-sum games. Despite its popularity in practice, a rigorous theoretical understanding of Adam-DA still lags behind. In this paper, we derive ordinary differential equations (ODEs) that serve as continuous-time limits of the Adam-DA. These ODEs closely approximate the discrete-time dynamics of Adam-DA, providing a tractable analytical framework for understanding its behavior in zero-sum games. Using this ODE approach, we investigate two fundamental aspects of Adam-DA: local convergence and implicit gradient regularization. Our analysis reveals that the roles of the first- and second-order momentum parameters in zero-sum games are exactly the opposite of their well-documented effects in minimization problems. We validate these predictions through GAN experiments across multiple architectures and datasets, demonstrating the practical implications of this reversed momentum effect.

2605.19390 2026-05-20 cs.CV

LMM-Track4D: Eliciting 4D Dynamic Reasoning in LMMs via Trajectory-Grounded Dialogue

LMM-Track4D: 通过轨迹引导的对话激发LMM中的4D动态推理

Chaoyue Li, Yongxue Xu, Jie Feng, Jiayu Ding

AI总结 本文提出LMM-Track4D任务,通过轨迹引导的多轮时空对话,结合RTGE、TRK和OSK-RA解码器,提升LMM在4D动态推理中的性能,实验表明显式动态状态建模是有效设计原则。

详情
AI中文摘要

近期大型多模态模型(LMMs)在图像和视频理解方面的能力不断增强,但仍难以持续进行4D连续时空动态推理。为研究这一能力差距,我们提出了轨迹引导的多轮时空对话任务,该任务要求模型在回答时空查询的同时,返回整个短片段或指定较长片段中的结构化3D目标轨迹,并引入Track4D-Bench基准,包含526个片段级对话样本,涵盖23.5k帧和7.5k对象注释,用于训练和评估。基于此任务,我们提出了LMM-Track4D,结合RTGE(射线-时间几何编码)、专门用于长时间跨度动态传播的流式状态令牌TRK,以及在遮挡和视角变化下稳定进行4步3D状态估计的Object-Slot Kinematic, Residual-Anchor(OSK-RA)解码器。在Track4D-Bench上的实验表明,与强基线相比,LMM-Track4D有持续的性能提升,表明显式动态状态建模是激发LMM中4D动态推理的有效设计原则。我们的代码和数据集将在https://github.com/mikubaka88/LMM-Track4D上公开。

英文摘要

Recent large multimodal models (LMMs) have become increasingly capable on image and video understanding, yet still struggle to sustain 4D continuous spatiotemporal dynamic reasoning. To study this capability gap, we formulate trajectory-grounded multi-turn spatiotemporal dialogue, a new task in which a model must answer spatiotemporal queries while returning structured 3D target trajectories over an entire short clip or a specified segment of a longer clip, and introduce Track4D-Bench, a benchmark with 526 clip-level dialogue samples spanning 23.5k frames and 7.5k object annotations, for training and evaluation. Building on this task, we propose LMM-Track4D, which combines RTGE (Ray--Time Geometry Encoding), a dedicated streaming state token TRK for long-horizon dynamic propagation, and an Object-Slot Kinematic, Residual-Anchor (OSK-RA) decoder for stable 4-step 3D state estimation under occlusion and viewpoint variation. Experiments on Track4D-Bench show consistent improvements over strong baselines, suggesting that explicit dynamic state modeling is a useful design principle for eliciting 4D dynamic reasoning in LMMs. Our code and dataset will be publicly available at https://github.com/mikubaka88/LMM-Track4D.

2605.19386 2026-05-20 cs.CV

MatPhys: Learning Material-Aware Physics Parameters for Deformable Object Simulation from Videos

MatPhys: 从视频中学习材料感知的物理参数以模拟可变形物体

Yang Yang, Yiyan Wang, Zheming Liu, Naoya Iwamoto

AI总结 本文提出MatPhys方法,通过单视角视频预测弹簧-质量参数,解决了现有方法在材料假设和跨场景一致性方面的不足,从而提升可变形物体模拟的准确性和泛化能力。

Comments Submitted to Siggrah Asia 2026

详情
AI中文摘要

从视频中重建可变形物体的模拟准备版本对于视觉、图形学和机器人学至关重要。现有的物理驱动方法可以从视频中恢复物理数字双胞胎,但它们有两个根本性的局限性:它们通常假设物体整体具有均匀的材料属性,且其场景特定的逆向优化与单目观测的固有模糊性相结合,导致相同材料在不同场景或交互中参数不一致。我们提出了MatPhys,一种材料感知的前馈框架,通过单视角视频预测弹簧-质量参数,通过两个耦合的设计解决这两个问题。为了放松均匀材料假设,我们使用DINO特征将物体分解为具有语义意义的部分,并查询部分级材料先验,为每个部分分配其自身的物理行为。为了强制跨场景一致性,我们引入了一个学习的材料代码本,其中包含共享的材料嵌入,作为外观和物理之间的桥梁,并进一步使用部分级先验作为参考分布,约束解码器,使得相同材料在不同场景和交互中产生一致的参数。这些设计将一个欠约束的单目问题转化为基于共享、可重用材料概念的前馈推断。实验表明,我们的方法在重建和未来预测方面与每场景优化基线相匹配,同时在未见过的交互和物体上实现了更强的泛化能力,具有更一致的物理参数。

英文摘要

Reconstructing simulation-ready deformable objects is important for vision, graphics, and robotics. Existing physics-driven methods can recover physical digital twins from videos, but they suffer from two fundamental limitations: they typically assume a homogeneous material across the whole object, and their scene-specific inverse optimization, combined with the inherent ambiguity of monocular observation, yields inconsistent parameters for the same material across different scenes or interactions. We propose MatPhys, a material-aware feed-forward framework that predicts spring-mass parameters from a single-view video, addressing these two issues with two coupled designs. To relax the homogeneous material assumption, we use DINO features to decompose the object into semantically meaningful parts and to query a part-level material prior, assigning each part its own physical behavior. To enforce cross-scene consistency, we introduce a learned material codebook of shared material embeddings as the bridge between appearance and physics, and further use the part-level prior as a reference distribution that constrains the decoder so that the same material yields consistent parameters across scenes and interactions. Together, these designs turn an under-constrained monocular problem into feed-forward inference grounded on shared, reusable material concepts. Experiments show that our method matches per-scene optimization baselines in reconstruction and future prediction, while achieving stronger generalization to unseen interactions and objects with more consistent physical parameters.

2605.19382 2026-05-20 cs.AI

PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

PRISM:一个程序化空间-时间推理的基准测试

Qiran Zhang, Yuheng Wang, Runde Yang, Lin Wu, Jingru Fan, Shu Yao, Jie Zhang, Tianle Zhou, Huatao Li, Ruijie Shi, Yihan Li, Chen Qian

AI总结 本文提出PRISM基准测试,通过大规模人类校准的指令-代码对(共10,372个,比之前基准大20倍),评估语言模型生成空间正确动画输出的能力,并揭示执行成功率与空间正确率之间的显著差距。

详情
AI中文摘要

通过代码进行视频生成提供了超越像素级扩散模型在几何精度和时间一致性方面的优势,但严格评估语言模型是否能生成空间正确的动画输出仍是一个开放性问题。我们引入PRISM,一个基于英语和中文真实世界知识可视化场景,涵盖437个主题类别的大规模基准测试,包含10,372个由人类校准的指令-代码对(比之前的程序化视频生成基准大20倍)。我们进一步提出一种 funnel 风格的评估框架,包含四个互补的指标:代码级别可靠性用于可执行性,空间推理用于完整动画序列中的布局正确性,以及 Prompt-Aware Dynamic Visual Complexity (PADVC) 和 Temporal Density (TD) 用于诊断动态表达和时间活动。对七个主流LLM的系统评估揭示了显著的执行-空间差距:执行成功率平均下降约41%,表明可执行代码并不一定产生空间一致的视觉输出。这些发现表明,程序化视频生成的评估应超越可执行性。PRISM为推进空间一致的代码生成提供了原则性的基准测试。

英文摘要

Programmatic video generation through code offers geometric precision and temporal coherence beyond pixel-level diffusion models, yet rigorously evaluating whether language models can produce spatially correct animated outputs remains an open problem. We introduce PRISM, a large-scale benchmark of 10,372 human-calibrated instruction-code pairs (20 times larger than prior programmatic video generation benchmarks), grounded in real-world knowledge visualization scenarios across English and Chinese and spanning 437 subject categories. We further propose a funnel-style evaluation framework with four complementary metrics: Code-Level Reliability for executability, Spatial Reasoning for layout correctness over full animation sequences, and Prompt-Aware Dynamic Visual Complexity (PADVC) and Temporal Density (TD) for diagnosing dynamic expression and temporal activity. Systematic evaluation of seven mainstream LLMs reveals a striking Execution-Spatial Gap: the average drop from execution success rate to spatial pass rate is approximately 41%, showing that runnable code does not necessarily yield spatially coherent visual output. These findings show that programmatic video generation evaluation should go beyond executability. PRISM provides a principled benchmark for advancing spatially coherent code generation.

2605.19378 2026-05-20 cs.CV

Sparse Mixture-of-Experts Routing in Visual Diffusion Transformers:Diagnosis, Boundary Calibration and Evolutionary Roadmap from Routing Collapse to Selective Deadlock

视觉扩散变换器中稀疏专家混合路由的稀疏性:从路由崩溃到选择性死锁的诊断、边界校准和进化路线图

Haiying Sha

AI总结 本文系统诊断了Token-Choice稀疏混合专家(MoE)在视频扩散变换器中的训练失败模式,通过分析超过6500万个标记的路由决策时间序列,提出了功能冗余假说,并总结了从视觉统一到世界模型的三步进化路线图。

详情
AI中文摘要

本文系统诊断了Token-Choice稀疏混合专家(MoE)在视频扩散变换器中的训练失败模式。从约50亿参数的预训练密集模型开始,我们遵循三条定律将其转换为MoE架构:路由专家精确克隆原始FFN权重,共享专家初始化为零以验证,然后初始化为极小的非零噪声以实际训练,而只有门控网络从随机初始化开始。实验揭示了五层失败模式的层次结构:(1)线性路由器经历全局软饱和,导致所有专家同质化;(2)MLP路由器引入选择性死锁,其中大约三分之一的层退化为单专家模式,无法通过增加辅助损失防止;(3)交叉注意力路由器表现出初步的自我恢复,但约九层仍顽固死锁;(4)死锁层显示U型分布,集中在浅层视觉处理层和深层语义整合层;(5)bfloat16混合精度导致微小权重更新被硬件截断为零。基于超过6500万个标记的路由决策时间序列,我们提出了功能冗余假说:死锁是共享专家在门控-共享专家-路由专家三元系统中成熟之前的理性等待策略。该假说由系统生物学中的功能冗余理论支持。在工程方面,我们总结了密集到MoE转换的三条定律,并提供了完整的bfloat16精度陷阱解决方案。我们校准了Token-Choice范式的当前能力边界,并概述了从视觉统一到世界模型的三步进化路线图。

英文摘要

This paper systematically diagnoses the training failure modes of Token-Choice sparse Mixture-of-Experts (MoE) on video Diffusion Transformers. Starting from a pretrained dense model of about 5 billion parameters, we convert it into an MoE architecture following three laws: routed experts exactly clone the original FFN weights, shared experts are initialized to zero for verification and then to extremely small non-zero noise for actual training, while only the gating networks start from random initialization. Experiments reveal a hierarchy of five failure modes: (1) linear routers suffer global soft saturation with complete expert homogenization; (2) MLP routers introduce selective deadlock, where roughly one-third of layers degenerate into a single-expert mode that cannot be prevented by increasing the auxiliary loss; (3) cross-attention routers exhibit preliminary self-recovery, yet about nine layers remain stubbornly deadlocked; (4) deadlocked layers display a U-shaped distribution, concentrated in shallow visual processing layers and deep semantic integration layers; (5) bfloat16 mixed precision causes tiny weight updates to be truncated to zero by hardware. Based on routing decision time series over 65 million tokens across 5,000 training steps, we propose the Functional Redundancy Hypothesis: deadlock is a rational waiting strategy before the shared expert matures within the gate-shared expert-routed expert triadic system. This hypothesis is supported by the theory of functional redundancy in systems biology. On the engineering side, we summarize the Three Laws of dense-to-MoE conversion and provide a complete solution for the bfloat16 precision trap. We calibrate the current capability boundary of the Token-Choice paradigm and outline a three-step evolutionary roadmap from visual unification to a world model.

2605.19377 2026-05-20 cs.LG cs.AI

The Evaluation Game: Beyond Static LLM Benchmarking

评估游戏:超越静态LLM基准测试

Paul Wang, Jade Garcia-Bourrée, Anne-Marie Kermarrec, Vincent Corruble

AI总结 本文提出了一种基于博弈论的框架,用于评估大型语言模型的安全性,通过数据增强的群作用结构分析评估者与训练者之间的互动,揭示了对抗性测试中局部泛化和记忆补丁的区别。

Comments 36 pages

详情
AI中文摘要

随着劫持攻击,即能够绕过安全限制的对抗性输入,持续在大型语言模型中被发现,实践者越来越依赖微调作为防御策略。然而,这种鲁棒性微调的理论基础仍不明确。我们引入了一个博弈论框架,将评估者(检查模型中的劫持攻击)与训练者之间的互动形式化为一个双人博弈。我们方法的关键特征是使用群作用,一种数学结构,用于正式表示数据增强。最简单的非平凡实例是圆周上的循环平移群,在此情况下,我们展示了根据训练者的泛化范围的不同而出现的各种情形。在临界阈值以下,评估者在线性多轮次中保持恒定的误判率,而在其他情况下则表现出非常不同的行为。我们进一步提供了实证证据支持模型的局部依赖性:对于我们测试的三个模型家族(Llama、Qwen和Mistral),我们有显著证据表明,在对抗性提示上微调只会导致局部泛化,测试示例上的拒绝率与到微调提示的距离高度相关。我们的框架重新定义了对抗性评估的核心对象:基准不是静态的提示集,而是在评估者群作用下的轨道,而忽略训练者适应的审计协议无法区分真正的修复和记忆补丁。

英文摘要

As jailbreaks, adversarially crafted inputs that bypass safety constraints, continue to be discovered in Large Language Models, practitioners increasingly rely on fine-tuning as a defensive strategy. Yet the theoretical foundations underlying this robustness fine-tuning remain underexplored. We introduce a game-theoretic framework in which the interaction between an evaluator (auditing the model for jailbreaks) and a trainer is formalized as a two-player game. A key feature of our approach is the use of group actions, a mathematical structure that captures symmetries and transformations, to formally represent data augmentation. The simplest non-trivial instance is the circle with cyclic translation groups, where we exhibit various regimes depending on the trainer's generalization range. Below a critical threshold, the evaluator maintains a constant miss ratio for linearly many rounds, whereas other settings can yield very different behaviors. We further provide empirical evidence supporting locality-dependence of the model: for the three model families we tested (Llama, Qwen and Mistral), we have significant evidence that fine-tuning on adversarial prompts induces only local generalization, with refusal rates on test examples highly correlated with the distance to the fine-tuning prompts. Our framework recasts the central object of adversarial evaluation: a benchmark is not a static set of prompts but an orbit under the evaluator's group action, and audit protocols that ignore trainer-side adaptation cannot distinguish a genuine fix from a memorized patch.

2605.19374 2026-05-20 cs.CV cs.AI cs.LG

Concept-Guided Noisy Negative Suppression for Zero-Shot Classification and Grounding of Chest X-Ray Findings

基于概念的噪声负样本抑制用于零样本分类和胸片发现的 grounding

Chenyu Lian, Hong-Yu Zhou, Chun-Ka Wong, Jing Qin

AI总结 本文提出了一种基于概念的噪声负样本抑制框架CoNNS,通过构建层次化概念本体,解决不同患者间相似发现导致的噪声负样本问题,提升零样本理解任务的性能。

Comments Early accepted by MICCAI 2026

详情
AI中文摘要

利用胸片和放射学报告进行视觉-语言对齐已成为零样本分类和胸片发现 grounding 的先进范式。然而,标准对比学习通常将不同患者的影像和报告简单视为负样本对。这种假设引入了噪声负样本,因为不同患者经常表现出相似的发现。此类噪声负样本导致语义模糊并降低零样本理解任务的性能。为了解决这一挑战,我们提出CoNNS,一种基于概念的噪声负样本抑制框架。为了支持负样本抑制机制,不同于先前方法使用原始报告或模板化文本,我们利用大型语言模型构建层次化概念本体。本体通过显式建模存在性、属性(位置和特征)和文本(证据片段和存在陈述)来结构化41个关键临床概念。利用该本体,我们实现了包含三个步骤的跨患者对再标记策略:(1)细粒度分解,根据发现存在性对配对进行分类;(2)噪声负样本过滤,通过移除假负样本解决语义冲突;(3)困难负样本挖掘,利用轻量级语言模型识别细微属性差异。最后,我们提出了一种概念感知的NCE损失,以对齐视觉特征与文本并抑制识别出的噪声负样本。在多粒度零样本grounding任务和五个零样本分类数据集上的广泛实验验证了CoNNS优于现有最先进模型。代码可在https://github.com/DopamineLcy/conns获取。

英文摘要

Vision-language alignment using chest X-rays and radiology reports has emerged as an advanced paradigm for zero-shot classification and grounding of chest X-ray findings. However, standard contrastive learning typically treats radiographs and reports from different patients simply as negative pairs. This assumption introduces noisy negatives, as different patients frequently exhibit similar findings. Such noisy negatives cause semantic ambiguity and degrade performance in zero-shot understanding tasks. To address this challenge, we propose CoNNS, a concept-guided noisy-negative suppression framework. To support the negative suppression mechanism, unlike previous methods that use raw reports or templatized texts, we construct a hierarchical concept ontology using large language models. The ontology structures 41 key clinical concepts by explicitly modeling presence, attributes (location and characteristics), and texts (evidential segment and presence statement). Leveraging this ontology, we implement a cross-patient pair relabeling strategy comprising three steps: (1) Fine-Grained Breakdown to categorize pairs based on finding presence; (2) Noisy Negative Filtering to resolve semantic conflicts by removing false negatives; and (3) Hard Negative Mining to identify subtle attribute discrepancies using a lightweight language model. Finally, we propose a Concept-Aware NCE loss to align visual features with text while suppressing the identified noisy negatives. Extensive experiments across multi-granularity zero-shot grounding tasks and five zero-shot classification datasets validate that CoNNS outperforms existing state-of-the-art models. The code is available at https://github.com/DopamineLcy/conns.

2605.19371 2026-05-20 cs.CV cs.AI

Multi-Scale Generative Modeling with Heat Dissipation Flow Matching

多尺度生成建模与热耗散流匹配

Jun Ma, Hanquan Zhang, Yanjun Qin, Haoyuan Guan, Ke Zhang

AI总结 本文提出Heat Dissipation Flow Matching (HDFM)方法,通过引入连续模糊(热耗散)过程来注入多尺度先验,解决模糊基模型在SDE框架中的局限性,并在ODE框架如Flow Matching中实现更有效的多尺度细节保留和颜色预算保持。

详情
AI中文摘要

扩散模型在图像生成中被广泛应用,大多数模型依赖于噪声为基础的破坏和去噪。一个不同的分支使用模糊作为主要破坏,通过提供多尺度先验来更好地保持颜色预算和多尺度细节。然而,基于模糊的模型仍局限于SDE框架,并未整合到ODE框架中,如Flow Matching (FM)。同时,在模糊基公式中,经典的逆热耗散(IHD)过程面临病态挑战。此外,在数据流形假设下,从高维噪声(或速度)空间回归模糊图像也具有困难。我们提出Heat Dissipation Flow Matching (HDFM),其引入连续模糊(热耗散)过程到FM中以注入多尺度先验。HDFM将插值热耗散路径对齐以解决病态问题,并采用x预测来缓解高维回归困难。玩具实验和消融研究显示,HDFM在模糊和x预测方面均受益。HDFM在所有数据集上均优于大多数基线方法。

英文摘要

Diffusion models are widely used in image generation, with most relying on noise-based corruption and denoising. A distinct branch instead uses blur as the main corruption, preserving better color budgets and multi-scale detail by providing multi-scale priors. However, blur-based models remain in SDE-based frameworks and are not integrated into ODE-based frameworks, such as Flow Matching (FM). Meanwhile, in the blur-based formulation, the classical inverse heat-dissipation (IHD) process faces an ill-posed challenge. Moreover, under the data-manifold assumption, regressing blurred images from high-dimensional noise (or velocity) space is also difficult. We propose Heat Dissipation Flow Matching (HDFM), which introduces a continuous blurred (heat-dissipation) process into FM to inject multi-scale priors. HDFM aligns an interpolated heat-dissipation path to address ill-posedness and adopts $x$-prediction to mitigate high-dimensional regression difficulty. Toy experiments and ablation studies show that HDFM consistently benefits from both blur and $x$-prediction. The performance of HDFM outperforms most baseline methods on all datasets.

2605.19366 2026-05-20 cs.LG

Accurate, Efficient, and Explainable Deep Learning Approaches for Environmental Science Problems

准确、高效且可解释的深度学习方法用于环境科学问题

Jimeng Shi

AI总结 本文提出三种针对复杂环境科学问题的深度学习方法:用于海岸河流洪水预测的WaLeF模型、用于全球天气预测的CoDiCast模型以及用于环境科学科学问答的Hypercube-RAG方法,旨在提高环境智能的准确性、效率和可解释性。

Comments 161 pages

详情
AI中文摘要

环境科学在保护生态系统中起着关键作用,这一领域由大规模、异构数据驱动。在大数据时代,人工智能(AI)已成为一种变革性工具,用于学习模式并支持决策。本论文开发了针对复杂环境科学问题的AI方法,以实现环境智能,研究了三个具体挑战。首先,我们专注于海岸河流系统的洪水预测和管理。传统物理模型计算成本高,限制了实时应用。为此,我们提出了一种基于深度学习(DL)的水位预测模型WaLeF,以及一种基于预测的深度学习模型FIDLAr用于水位管理。在佛罗里达南部易发洪水的海岸系统中评估,该系统以极端降雨和海平面上下波动为特点,FIDLAr在准确性和效率上优于基线模型,同时提供可解释的输出。其次,我们针对全球天气预测,这受到大规模数据规模的挑战。传统物理方法是确定性的且计算密集型。我们提出CoDiCast,一种条件扩散模型,专门用于概率天气预测。从生成AI用于预测任务中衍生而来,实验表明CoDiCast实现了准确且高效的预测,具有明确的不确定性量化。最后,我们解决环境科学中的科学问答问题。在回答领域内问题时,大型语言模型(LLMs)常常由于知识过时或有限而产生幻觉。虽然检索增强生成(RAG)检索了领域特定的知识,但现有方法在准确度、效率或可解释性之间进行权衡。我们提出Hypercube-RAG,基于结构化的文本立方体框架,成功同时表现出这三种属性。

英文摘要

Environmental science plays a pivotal role in safeguarding ecosystems, a domain driven by large-scale, heterogeneous data. In the big data era, artificial intelligence (AI) has emerged as a transformative tool for learning patterns and supporting decision-making. This dissertation develops AI-based approaches tailored to complex environmental science problems to achieve Environmental Intelligence, studying three specific challenges. First, we focus on flood prediction and management in coastal river systems. Conventional physics-based models are computationally intensive, limiting real-time application. To overcome this, we propose a deep learning (DL)-based model, WaLeF, for water level forecasting, and a forecast-informed DL model, FIDLAr, to manage water levels. Evaluated in a flood-prone coastal system in South Florida characterized by extreme rainfall and sea level fluctuations, FIDLAr outperforms baselines in accuracy and efficiency while providing interpretable outputs. Second, we target global weather prediction, which is challenged by massive data scale. Traditional physics methods are deterministic and computationally heavy. We propose CoDiCast, a conditional diffusion model tailored for probabilistic weather forecasting. Adapted from generative AI for predictive tasks, experiments show CoDiCast achieves accurate, efficient forecasts with explicit uncertainty quantification. Lastly, we address scientific question-answering in environmental science. When answering in-domain questions, large language models (LLMs) often suffer from hallucinations due to out-of-date or limited knowledge. While retrieval-augmented generation (RAG) retrieves domain-specific knowledge, existing methods trade off accuracy, efficiency, or explainability. We propose Hypercube-RAG, built on a structured text cube framework, which successfully exhibits all three properties simultaneously.

2605.19360 2026-05-20 cs.CV cs.LG cs.NE physics.app-ph physics.optics

Scalable, Energy-Efficient Optical-Neural Architecture for Multiplexed Deepfake Video Detection

可扩展的、节能的光学-神经架构用于多路复用的深度伪造视频检测

Parnian Ghapandar Kashani, Shiqi Chen, Aydogan Ozcan

AI总结 本文提出了一种结合轻量级数字前端和空间复用光学解码后端的混合深度伪造视频检测框架,通过可编程空间光调制器实现大规模并行模拟推理,从而在降低计算成本的同时提高视频真实性预测的吞吐量和准确性。

Comments 30 Pages, 8 Figures

详情
AI中文摘要

AI生成视觉媒体的快速普及催生了对高效、可信的深度伪造检测系统的需求。然而,现有基于深度学习的检测方法依赖于计算密集且能耗高的推理算法,限制了其可扩展性。本文提出了一种混合的数字-模拟深度伪造视频检测框架,结合轻量级数字前端和空间复用光学解码后端,通过可编程空间光调制器实现大规模并行模拟推理。通过在单次光学传播过程中同时处理15个或更多的视频流,该系统在降低计算成本的同时实现了高吞吐量和准确的视频级真实性预测。我们使用不同数据集验证了该混合深度伪造视频处理器,包括经典面部交换、现实世界深度伪造记录和完全AI生成的视频。使用在可见光谱范围内操作的空间复用实验装置,我们在Celeb-DF视频数据集上实现了97.79%的深度伪造检测准确率、99.86%的灵敏度和95.72%的特异性,分别在15个视频并行处理的单次光学传播中测试。多路复用的光学解码器还展示了对各种视频退化、噪声、压缩、实验偏移和黑盒对抗攻击的鲁棒性。我们的结果表明,将光学计算整合到AI推理中可以同时提高吞吐量、能效和对抗鲁棒性——这三个属性在纯数字系统中难以同时实现。

英文摘要

The rapid proliferation of AI-generated visual media has created an urgent need for efficient, trustworthy deepfake detection systems. However, existing deep learning-based detection methods rely on computationally intensive and energy-demanding inference algorithms, limiting their scalability. Here, we present a hybrid digital-analog deepfake video detection framework that combines a lightweight digital front-end with a spatially multiplexed optical decoding back-end for massively parallel analog inference through a programmable spatial light modulator. By simultaneously processing 15 or more video streams within a single optical propagation pass, the system enables high-throughput and accurate video-level authenticity prediction at reduced computational cost compared with purely digital methods. We validated this hybrid deepfake video processor using different datasets spanning classical face-swapping, real-world deepfake recordings, and fully AI-generated videos. Using a spatially multiplexed experimental set-up operating in the visible spectrum, we achieved average deepfake detection accuracy, sensitivity and specificity of 97.79%, 99.86% and 95.72%, respectively, on the Celeb-DF video dataset with 15 videos tested in parallel in a single optical pass per inference. The multiplexed optical decoder also demonstrates resilience against various types of video degradation, noise, compression, experimental misalignments and black-box adversarial attacks. Our results show that integrating optical computation into AI inference enables simultaneous gains in throughput, energy efficiency, and adversarial robustness - three properties that are difficult to achieve together in purely digital systems.

2605.19359 2026-05-20 cs.CV cs.LG

MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification

MAM-CLIP:基于乳腺X线图集的视觉-语言预训练用于BI-RADS分类

Halil Ibrahim Gulluk, Olivier Gevaert

AI总结 本文提出MAM-CLIP模型,通过预训练PubMedBERT和对比学习来提升乳腺X线图像的BI-RADS分类性能,实验表明在标注样本稀缺时,该方法能显著提高F1分数。

详情
AI中文摘要

深度学习方法在预测乳腺X线图像的BI-RADS评分方面已显示出有前景的结果。然而,这些图像的解释可能因人而异,即使在放射科医生之间也可能存在差异。鉴于乳腺X线的固有复杂性,仅依靠图像标签训练分类模型通常效果有限。为了解决这一挑战,我们收集了来自两个乳腺图集的2313张乳腺X线图像及其对应的描述。我们提出的方法采用了一个多模态模型,使用预训练的PubMedBERT作为语言组件。通过在图像-文本对上进行对比学习训练,使视觉编码器能够吸收描述中丰富的信息,从而提高其对乳腺X线发现的理解。然后,我们对两个数据集进行微调以进行BI-RADS预测,其性能优于没有此预训练的模型,尤其是在标注样本稀缺时。在3类平均F1分数上,改进范围从+1%到+14%:在40K训练样本时增加+1%,在1K样本时增加+14%。此外,我们的实验表明,来自乳腺图集的2K图像-文本对比2K标注样本更具信息量,当训练样本超过10K时,平均提升幅度为+1.1%。总体而言,我们的工作提供了一个用于乳腺X线的视觉-语言模型,并突显了乳腺图集文本信息的价值。此外,我们公开发布了TEKNOFEST数据集的预处理乳腺X线图像。训练代码、预训练模型权重、数据提取脚本和发布的数据集均可在:https://github.com/igulluk/MAM-CLIP上公开获取。

英文摘要

Deep learning methods have demonstrated promising results in predicting BI-RADS scores from mammography images. However, the interpretation of these images can vary, leading to discrepancies even among radiologists. Given the inherent complexity of mammograms, training classification models solely on image labels often yields limited performance. To address this challenge, we curated 2313 mammogram images and their corresponding captions from two mammography atlases. Our proposed approach employs a multi-modal model that uses a pretrained PubMedBERT as the language component. By training this model on image-text pairs with contrastive learning, we enable the vision encoder to absorb the rich information contained in the captions, thereby improving its understanding of mammography findings. We then fine-tune the vision encoder on two datasets for BI-RADS prediction, achieving superior performance compared with models trained without this pretraining, particularly when labeled samples are scarce. The improvement in the 3-class average F1 score ranges from +1% to +14%: a +1% increase with 40K training samples, and a +14% increase with 1K samples. Furthermore, our experiments reveal that 2K image-text pairs from mammography atlases can be more informative than 2K labeled samples for label prediction, with an average margin of +1.1% when more than 10K training samples are available. Overall, our work provides a vision-language model for mammography and highlights the value of textual information from mammography atlases. In addition, we publicly release preprocessed mammography images of the TEKNOFEST dataset. The training code, pre-trained model weights, data extraction scripts, and the released dataset are publicly available at: https://github.com/igulluk/MAM-CLIP

2605.19358 2026-05-20 cs.CL

Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

驯服思考者:面向自适应大语言模型推理的条件熵塑造

Shuyu Wei, Jian Sun, Delai Qiu, Yining Wang, Shengping Liu, Jiaen Liang, Ying Fu, Wei Huang, Jitao Sang

AI总结 本文提出条件熵塑造(CES)框架,通过动态控制token级响应熵,使大语言模型在简单问题上产生简洁解,在复杂问题上促进深入探索,从而平衡响应长度与准确性。

详情
AI中文摘要

基于熵的深度推理已成为提升大语言模型(LLMs)推理能力的有前景方向,但现有方法往往要么无差别地增加响应长度,要么以牺牲准确性为代价缩短响应。为更好地平衡这一权衡,我们引入了条件熵塑造(CES),一种框架,能够动态控制token级响应熵,使LLMs在简单问题上产生简洁的解决方案,同时在困难问题上鼓励更深入的探索。基于DAPO,CES使用token级熵作为不确定性信号,并应用一个条件双向策略:它在正确的推理路径上惩罚高熵的'分叉点'token以提高简洁性,并在错误路径上奖励它们以促进探索和错误纠正。我们将在DeepSeek-R1-Distill-7B上实现CES,并在12个数学基准上进行评估。CES在平均准确性上优于DAPO,同时减少响应长度,补充实验显示在较小的1.5B模型和领域外基准上也呈现出相似的趋势。

英文摘要

Entropy-based deep reasoning has emerged as a promising direction for improving the reasoning capabilities of Large Language Models (LLMs), but existing methods often either increase response length indiscriminately or shorten responses at the cost of accuracy. To better balance this trade-off, we introduce Conditional Entropy Shaping (CES), a framework that dynamically controls token-level response entropy, enabling LLMs to produce concise solutions on simple problems while encouraging deeper exploration on hard ones. Built on DAPO, CES uses token-level entropy as an uncertainty signal and applies a conditional bidirectional policy: it penalizes high-entropy "forking point" tokens on correct reasoning paths to improve conciseness, and rewards them on incorrect paths to encourage exploration and error correction. We implement CES on DeepSeek-R1-Distill-7B and evaluate it on 12 mathematical benchmarks. CES consistently improves average accuracy while reducing response length relative to DAPO, and supplementary experiments show similar trends on a smaller 1.5B backbone and on out-of-domain benchmarks.

2605.19357 2026-05-20 cs.CL

SciCustom: A Framework for Custom Evaluation of Scientific Capabilities in Large Language Models

SciCustom: 一个用于大型语言模型科学能力定制评估的框架

Yiyang Gu, Junwei Yang, Junyu Luo, Ye Yuan, Bin Feng, Yingce Xia, Shufang Xie, Kaili Liu, Bohan Wu, Qi Shi, Haoran Li, Beier Xiao, Zhiping Xiao, Xiao Luo, Weizhi Zhang, Philip S. Yu, Zequn Liu, Ming Zhang

AI总结 本文提出SciCustom框架,通过从大规模科学数据中自定义构建基准,评估LLM在特定科学任务中的能力,无需专家标注或合成问题生成,展示了细粒度科学能力差异。

Comments Accepted to ACL 2026 Main Conference

详情
AI中文摘要

Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at https://github.com/yjwtheonly/SciCustom.

英文摘要

Large language models (LLMs) are increasingly applied to scientific research, yet existing evaluations often fail to reflect the fine-grained capabilities required in practice. Most benchmarks are manually curated or domain-generic, limiting scalability and alignment with real scientific use cases. In this paper, we propose a new framework named SciCustom to address the problem. It enables the custom construction of benchmarks from large-scale scientific data to evaluate application-specific scientific capabilities in LLMs. SciCustom first organizes scientific knowledge into ontology-grounded knowledge units with controlled granularity and trains a tagger to map large-scale data instances into this knowledge space. Given a custom requirement, relevant knowledge units are identified via voting-based multi-model consensus. These units enable relevance-aware benchmark retrieval via binary search, followed by proxy subset selection and data-grounded benchmark generation for efficient evaluation. Experiments in chemistry and healthcare demonstrate that SciCustom reveals fine-grained differences in LLM scientific capabilities that standard benchmarks overlook, while requiring neither expert annotation nor synthetic question generation. This work provides a scalable and application-aware foundation for benchmarking scientific capabilities in LLMs. The source code is available at https://github.com/yjwtheonly/SciCustom.

2605.19346 2026-05-20 cs.CL cs.AI cs.LG

IMLJD: A Computational Dataset for Indian Matrimonial Litigation Analysis

IMLJD:印度婚姻诉讼分析计算数据集

Joy Bose

AI总结 本文提出IMLJD数据集,用于分析印度婚姻纠纷案件,包含3613份法院判决,涵盖IPC第498A条、《家庭暴力保护法》和CrPC第482条案件,通过结构化标签、元数据指标和知识图谱揭示最高法院与卡纳塔克高等法院中撤销请求的成功率差异。

Comments 8 pages, 2 figures, 5 tables. Dataset available at huggingface.co/datasets/joyboseroy/imljd and Code at github.com/joyboseroy/imljd

详情
AI中文摘要

我们介绍了IMLJD,一个包含3,613份印度法院判决的开放数据集,涵盖受IPC第498A条、《家庭暴力保护法》和CrPC第482条规制的婚姻纠纷案件。该数据集涵盖最高法院(2000-2024年,1,474份案件)和卡纳塔克高等法院(2018-2024年,2,139份案件),包含结构化结果标签、元数据衍生指标和知识图谱。我们发现,最高法院级别的撤销请求成功率为57.6%,而卡纳塔克高等法院为39.7%。在匹配的2018至2024年期间,最高法院的撤销率是59.3%,扩大了差距至19.6个百分点,证实该发现对时间调整具有鲁棒性。该数据集、代码和知识图谱已公开发布在https://github.com/joyboseroy/imljd和https://huggingface.co/datasets/joyboseroy/imljd。

英文摘要

We present IMLJD, an open dataset of 3,613 Indian court judgments covering matrimonial disputes under IPC Section 498A, the Protection of Women from Domestic Violence Act, and CrPC Section 482. The dataset covers the Supreme Court of India from 2000 to 2024 (1,474 cases) and the Karnataka High Court from 2018 to 2024 (2,139 cases), with structured outcome labels, metadata-derived indicators, and a knowledge graph. We find that 57.6% of quashing petitions succeed at the Supreme Court level compared to 39.7% at the Karnataka High Court level. On a matched 2018 to 2024 period, the SC quash rate is 59.3%, widening the differential to 19.6 percentage points and confirming the finding is robust to temporal adjustment. The dataset, code, and knowledge graph are released openly at https://github.com/joyboseroy/imljd and https://huggingface.co/datasets/joyboseroy/imljd.

2605.19343 2026-05-20 cs.LG

What Makes a Representation Good for Single-Cell Perturbation Prediction?

什么使一个表示对单细胞扰动预测有效?

Wenkang Jiang, Yuhang Liu, Yichao Cai, Erdun Gao, Jiayi Dong, Ehsan Abbasnejad, Lina Yao, Javen Qinfeng Shi

AI总结 本文提出PerturbedVAE框架,通过分离扰动特定信息和主导不变结构,恢复因果表示以有效利用此类信息进行预测,并通过可识别性分析明确在特定条件下如何具体指定框架。

Comments Accepted to ICML 2026

详情
AI中文摘要

单细胞扰动建模对于理解和预测细胞对遗传扰动的反应至关重要。然而,现有方法,从因果表示学习到基础模型,往往面临一个被忽视的挑战:基因表达主要由扰动不变信息主导,而扰动特定信号本质上是稀疏的。因此,学习的表示要么将不变和扰动特定信息混合,导致虚假且不可推广的预测器,要么完全抑制扰动特定信号,使它们对预测无效。为了解决这一问题,我们提出了PerturbedVAE,一个通用框架,旨在解决这种信号不平衡。该框架明确将扰动特定信息与主导不变结构分开,并恢复因果表示,以有效利用此类信息进行预测。我们进一步提供了可识别性分析,该分析刻画了稀疏扰动效应可以可靠恢复的条件,从而明确在这些条件下如何具体指定框架。实证上,PerturbedVAE在广泛使用的基准上实现了最先进的性能,在多个评估设置中取得显著进展,在离分布组合预测中获得显著提升,并揭示了可解释的扰动响应程序。

英文摘要

Single-cell perturbation modeling is fundamental for understanding and predicting cellular responses to genetic perturbations. However, existing approaches, from causal representation learning to foundation models, often struggle with an overlooked challenge: gene expression is dominated by perturbation-invariant information, while perturbation-specific signals are intrinsically sparse. As a result, learned representations either entangle invariant and perturbation-specific information, leading to spurious and non-generalizable predictors, or suppress perturbation-specific signals altogether, rendering them ineffective for prediction. To address this, we propose PerturbedVAE, a general framework designed to resolve this signal imbalance. The framework explicitly separates perturbation-specific information from dominant invariant structure and recovers causal representations to effectively utilize such information for prediction. We further provide an identifiability analysis that characterizes the conditions under which sparse perturbation effects can be reliably recovered, thereby clarifying how the framework can be concretely specified under such conditions. Empirically, PerturbedVAE achieves state-of-the-art performance on a widely used benchmark across multiple evaluation settings, yielding significant gains on out-of-distribution combinatorial predictions and uncovering interpretable perturbation-response programs.

2605.19341 2026-05-20 cs.CL cs.AI cs.LG stat.ML

HalluWorld: A Controlled Benchmark for Hallucination via Reference World Models

HalluWorld: 一个用于通过参考世界模型控制幻觉的基准

Emmy Liu, Varun Gangal, Michael Yu, Zhuofu Tao, Karan Singh, Sachin Kumar, Steven Y. Feng

AI总结 本文提出HalluWorld基准,通过显式参考世界模型研究语言模型的幻觉问题,发现不同任务中幻觉表现不一致,表明幻觉源于多种失败模式而非单一能力。

Comments HalluWorld benchmark (code and data) at github.com/DegenAI-Labs/HalluWorld

详情
AI中文摘要

幻觉仍然是大语言模型的核心失败模式,但现有基准在摘要、问答、检索增强生成和代理交互中操作不一致。这种碎片化使得不清楚一种缓解措施在不同情境中是否有效。当前基准要么需要人工标注和固定参考,要么依赖难以复现的观察。为研究根本原因,我们引入HalluWorld,一个基于显式参考世界模型的可扩展基准:当模型生成一个与该世界不一致的可观察声明时,即产生幻觉。基于这一观点,我们构建了合成和半合成环境,在其中参考世界完全指定,模型观点受控,幻觉标签自动产生。HalluWorld涵盖网格世界、国际象棋和现实终端任务,使世界复杂性、可观察性、时间变化和源冲突政策可控,并将幻觉细分为细粒度错误类别。我们评估了前沿和开放权重语言模型在这些设置中的表现,发现一致模式:前沿模型在直接观察信息上的感知幻觉接近解决,而多步状态跟踪和因果正向模拟仍然困难且未被扩展思考普遍解决。在终端设置中,模型在何时应放弃时也遇到困难。不同探测类型和领域中的失败分布不均,表明幻觉源于不同的失败模式而非单一能力。我们的结果表明,受控参考世界为测量和减少现代语言模型中的幻觉提供了可扩展且可重复的路径。

英文摘要

Hallucination remains a central failure mode of large language models, but existing benchmarks operationalize it inconsistently across summarization, question answering, retrieval-augmented generation, and agentic interaction. This fragmentation makes it unclear whether a mitigation that works in one setting reduces hallucinations across contexts. Current benchmarks either require human annotation and fixed references that may be memorized, or rely on observations in settings that are difficult to reproduce. To study root causes, we introduce HalluWorld, an extensible benchmark grounded in an explicit reference-world formulation: a model hallucinates when it produces an observable claim that is false with respect to this world. Building on this view, we construct synthetic and semi-synthetic environments in which the reference world is fully specified, the model's view is controlled, and hallucination labels are generated automatically. HalluWorld spans gridworlds, chess, and realistic terminal tasks, enabling controlled variation of world complexity, observability, temporal change, and source-conflict policy, and disentangling hallucinations into fine-grained error categories. We evaluate frontier and open-weight language models across these settings and find consistent patterns: perceptual hallucination on directly observed information is near-solved for frontier models, while multi-step state tracking and causal forward simulation remain difficult and are not generally solved by extended thinking. In the terminal setting, models also struggle with when to abstain. The uneven profile of failures across probe types and domains suggests that hallucinations arise from distinct failure modes rather than a single capability. Our results suggest that controlled reference worlds offer a scalable and reproducible path toward measuring and reducing hallucinations in modern language models.

2605.19340 2026-05-20 cs.CV

Selective, Regularized, and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation

选择性、正则化和校准:利用视觉基础模型进行跨域少样本语义分割

Junyuan Ma, Xunzhi Xiang, Wenbin Li, Qi Fan, Yang Gao

AI总结 本文提出HERA框架,通过选择性、正则化和校准的方法,有效利用视觉基础模型进行跨域少样本语义分割,提升了模型在新领域中的适应能力,并在多个基准上取得了更高的mIoU成绩。

Comments 20 pages, 11 figures, 13 tables. Accepted to CVPR 2026

详情
AI中文摘要

视觉基础模型(VFMs)在各种视觉任务中已取得优异性能。然而,将VFMs应用于跨域少样本分割(CD-FSS)仍然具有挑战性,因为CD-FSS需要在仅少量标记示例的情况下对新类别的对象进行分割,并且在域转移下进行。挑战主要由两个因素驱动:(1)每个新类别的标记示例有限,相对于VFM预训练的规模,这使模型在重新训练时容易过拟合;(2)目标域在预训练期间未被充分代表,导致跨域不一致性和层间敏感性。为了解决这些问题,我们提出了层次示例表示适应(HERA),一种基于VFMs的三阶段选择-正则化-校准分割框架,能够有效利用有限的标签并在不重新训练源数据的情况下适应新领域。我们首先设计了层次层选择(HLS)以自适应地识别最信息丰富的VFM层,使用数据依赖的示例转移风险(ETR)计算每个候选层。然后,先验引导正则化(PGR)对选定的表示进行正则化,产生后续阶段的结构化局部信号。此外,像素级自适应校准(PAC)将选定的表示与细化的交互图结合,校准像素级预测,产生一致的掩码。这些阶段共同形成一个层次选择-正则化-校准的管道,指导冻结的VFM特征在新领域中工作,同时在测试时仅微调不到2.7%的参数。广泛的实验表明,HERA在多个CD-FSS基准上超越了现有最佳方法,mIoU提高了超过4.1个百分点。

英文摘要

Vision foundation models (VFMs) have achieved strong performance across various vision tasks. However, it still remains challenging to apply VFMs for cross-domain few-shot segmentation (CD-FSS), which segments objects of novel classes under domain shifts using only a few labeled exemplars. The challenge is mainly driven by two factors: (1) limited labeled exemplars per novel class relative to the scale of VFM pre-training, making the model prone to overfitting during retraining, and (2) target-domain shifts underrepresented during pre-training, inducing cross-domain inconsistency and layer-wise sensitivity. To address these issues, we propose Hierarchical Exemplar Representation Adaptation (HERA), a three-stage select-regularize-calibrate VFM-based segmentation framework that learns effectively from limited labels and adapts to novel domains without source-data retraining. We first design Hierarchical Layer Selection (HLS) to adaptively identify the most informative VFM layer using a data-dependent Exemplar Transfer Risk (ETR) computed for each candidate layer. Then, Prior-Guided Regularization (PGR) regularizes interactions on the selected representation, yielding well-structured local signals for the subsequent stage. Furthermore, Pixelwise Adaptive Calibration (PAC) combines the selected representation with the refined interaction maps to calibrate pixel-wise predictions, producing consistent masks. Together, these stages form a hierarchical select-regularize-calibrate pipeline that guides frozen VFM features in new domains while fine-tuning less than 2.7% of parameters at test time. Extensive experiments show that HERA surpasses the state of the art by more than 4.1 mIoU across multiple CD-FSS benchmarks.

2605.19337 2026-05-20 cs.AI

Agentic Trading: When LLM Agents Meet Financial Markets

代理交易:当大语言模型代理与金融市场相遇

Yihan Xia, Panpan You, Taotao Wang, Fang Liu, Han Qi, Xiaoxiao Wu, Shengli Zhang

AI总结 本文探讨了如何将大语言模型(LLM)作为交易系统中的代理,感知市场信息、检索上下文、进行决策推理、发出可交易动作并适应市场反馈。研究通过分析77项研究,发现协议不可比性是主要问题,提出证据日志、可重复性审计和报告检查表作为主要贡献。

Comments 59 pages, 15 figures, 27 tables

详情
AI中文摘要

越来越多的研究探讨如何将大语言模型(LLMs)嵌入到交易系统中作为代理,这些代理能够感知市场信息、检索上下文、对决策进行推理、发出可交易动作并在市场反馈下进行适应。本文将基于LLM的交易代理重新界定为专家系统决策流程,并呈现了一个包含77项研究的证据图谱,这些研究是在2026年3月9日通过协议编码快照筛选得出的。主要经验子集(n=19)满足最低边界条件,即动作输出加闭环评估;其余58项研究作为背景和设计语境保留。核心经验发现是协议不可比性:在主要子集中,只有2/19项研究报告可提取的时间一致拆分协议,1/19项报告明确的交易成本模型,1/19项记录了宇宙或幸存者处理,11/19项报告了执行时间和语义,15/19项被编码为R0,没有任何研究达到R3的可重复性。因此,我们使用架构-能力-适应作为分析透镜,而不是验证过的分类学,我们突出证据日志、可重复性审计和报告检查表作为主要贡献。最终的调查表明,架构实验正在迅速扩展,而可比评价协议、执行语义和可重复的成果仍然是该领域即时的瓶颈。

英文摘要

A growing body of work explores how Large Language Models (LLMs) can be embedded in trading systems as agents that perceive market information, retrieve context, reason about decisions, emit tradable actions, and adapt under market feedback. This paper reframes LLM-based trading agents as expert-system decision pipelines and presents an audit-oriented evidence map of 77 included studies in a protocol-coded snapshot screened through 2026-03-09. A primary empirical subset (n=19) satisfies the minimum boundary of Action Output plus Closed-Loop Evaluation; the remaining 58 included studies are retained as background and design context. The central empirical finding is protocol incomparability: within the primary subset, only 2/19 studies report extractable time-consistent split protocols, 1/19 reports an explicit transaction-cost model, 1/19 documents universe or survivorship handling, 11/19 report execution timing or semantics, 15/19 are coded as R0, and no study reaches R3 reproducibility. We therefore use Architecture-Capability-Adaptation as a working analytical lens rather than a validated taxonomy, and we foreground the evidence ledger, reproducibility audit, and reporting checklist as the main contributions. The resulting survey shows that architectural experimentation is expanding rapidly, while comparable evaluation protocols, execution semantics, and reproducible artifacts remain the field's immediate bottlenecks.

2605.19330 2026-05-20 cs.AI cs.LG cs.SE

MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

MOCHA:多目标切比雪夫退火用于智能体技能优化

Md Mehrab Tanjim, Jayakumar Subramanian, Xiang Chen, Branislav Kveton, Subhojyoti Mukherjee, Anlan Zhang, Sungchul Kim, Somdeb Sarkhel, Sunav Choudhury

AI总结 该研究提出MOCHA方法,通过切比雪夫标量化和指数退火解决智能体技能优化中的多目标问题,实现更优的帕累托前沿发现和性能提升。

Comments Preprint. 25 pages, 14 figures, 5 tables

详情
AI中文摘要

LLM智能体通过技能组织行为——这些技能是结构化的自然语言规范,指导智能体推理、检索和响应。与单体提示不同,技能是多字段的artifact,受严格平台限制:描述字段因路由被截断,指令正文通过渐进披露压缩,且共存技能竞争有限的上下文窗口。这些限制使技能优化本质上是多目标的:一个技能必须同时最大化任务性能并满足平台限制。然而,现有提示优化器要么忽略这些权衡,要么将其折叠成加权和,忽略了非凸目标区域中的帕累托最优变体。我们引入MOCHA(多目标切比雪夫退火),用切比雪夫标量化替代单目标选择——覆盖完整的帕累托前沿,包括非凸区域——结合指数退火,从探索转向利用。在六个多样化的智能体技能实验中,所有方法共享相同的多目标变异操作符,基线接收相同的单目标文本反馈。现有优化器在六个任务中的四个任务上无法改进种子技能:1000次运行无进展。MOCHA在所有任务中突破,平均正确率比最强基线提高7.5%(在FEVER上达14.9%,在TheoremQA上达10.4%),同时发现两倍多的帕累托最优技能变体。

英文摘要

LLM agents organize behavior through skills - structured natural-language specifications governing how an agent reasons, retrieves, and responds. Unlike monolithic prompts, skills are multi-field artifacts subject to hard platform constraints: description fields are truncated for routing, instruction bodies are compacted via progressive disclosure, and co-resident skills compete for limited context windows. These constraints make skill optimization inherently multi-objective: a skill must simultaneously maximize task performance and satisfy platform limits. Yet existing prompt optimizers either ignore these trade-offs or collapse them into a weighted sum, missing Pareto-optimal variants in non-convex objective regions. We introduce MOCHA (Multi-Objective Chebyshev Annealing), which replaces single-objective selection with Chebyshev scalarization - covering the full Pareto front, including non-convex regions - combined with exponential annealing that transitions from exploration to exploitation. In our experiments across six diverse agent skills - where all methods share the same multi-objective mutation operator and baselines receive identical per-objective textual feedback - existing optimizers fail to improve the seed skill on 4 of 6 tasks: 1000 rollouts yield zero progress. MOCHA breaks through on every task, achieving 7.5% relative improvement in mean correctness over the strongest baseline (up to 14.9% on FEVER and 10.4% on TheoremQA) while discovering twice as many more Pareto-optimal skill variants.

2605.19325 2026-05-20 cs.LG

An Exterior Method for Nonnegative Matrix Factorization

非负矩阵分解的外方法

Qiujing Lu, Tonmoy Monsoor, Ehsan Ebrahimzadeh, Kartik Sharma, Vwani Roychowdhury

AI总结 本文提出了一种非负矩阵分解的外方法(eNMF),通过分离低秩近似和非负性约束,解决了传统内部方法在非凸优化中收敛慢或陷入次优解的问题,并在多个数据集上验证了其优越的性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

非负矩阵分解(NMF)旨在寻找低秩近似$X \approx UV^T$,其中因素非负,并通常使用内部方法在整个优化过程中强制可行性。我们证明,这种约束驱动的方法可能会在非凸景观中阻碍进展,导致收敛缓慢或收敛到次优的 stationary 点。我们提出了一种非负矩阵分解的外框架(eNMF),将低秩近似与非负性约束分开。我们的方法从最优无约束因子分解初始化,并引入一种旋转过程,将无约束因子映射到非负正交体最近的外部点。这种视角产生了一种算法框架,其中简单的迭代更新收敛到满足KKT条件的边界点。外形式还使NMF解具有几何解释,澄清了在排列和正交变换下因子分解的等价类。一项引人注目的数值结果,涉及400个NMF实验,涵盖真实和合成数据集,显示在99%的情况下,不同算法倾向于收敛到等价的因子矩阵。我们对9种最先进的NMF算法进行基准测试,涵盖9种初始化方案,跨3个真实世界和2个合成数据集。eNMF在所有81个竞争对手中表现一致,达到相等时间设置下30%的重建误差降低,以及相等误差设置下的150%加速。下游实验进一步证明了在音频处理和推荐任务中的显著性能提升,证实了所提出外优化框架的实用价值。代码可在https://github.com/roychowdhuryresearch/eNMF获取。

英文摘要

Nonnegative matrix factorization (NMF) seeks a low-rank approximation $X \approx UV^T$ with nonnegative factors and is commonly solved using interior methods that enforce feasibility throughout optimization. We show that such constraint-driven approaches can impede progress in the nonconvex landscape, leading to slow convergence or convergence to suboptimal stationary points. We propose an exterior framework for NMF (eNMF) that separates low-rank approximation from nonnegativity enforcement. Our method initializes from the optimal unconstrained factorization and introduces a rotation procedure that maps unconstrained factors to an exterior point closest to the nonnegative orthant. This viewpoint yields an algorithmic framework in which simple iterative updates converge to KKT-satisfying stationary points on the boundary of the positive orthant. The exterior formulation also enables a geometric interpretation of NMF solutions, clarifying equivalence classes of factorizations under permutation and orthogonal transformations. An intriguing numerical result, involving 400 NMF experiments across both real and synthetic datasets, show that in 99% of the cases, different algorithms tend to converge towards equivalent factor matrices. We benchmark eNMF against 9 state-of-the-art NMF algorithms with 9 initialization schemes across 3 real-world and 2 synthetic datasets. eNMF consistently outperforms all 81 competitors, achieving up to 30% lower reconstruction error under equal-time settings and up to 150% speedup under equal-error settings. The downstream experiments further demonstrate substantial performance gains in audio processing and recommendation tasks, corroborating the practical benefits of the proposed exterior optimization framework. Code is available at https://github.com/roychowdhuryresearch/eNMF

2605.19324 2026-05-20 cs.LG

BrainDyn: A Sheaf Neural ODE for Generative Brain Dynamics

BrainDyn: 一种用于生成脑动态的sheaf神经ODE

Siddharth Viswanath, Panayiotis Ketonis, Chen Liu, Michael Perlmutter, Dhananjay Bhaskar, Smita Krishnaswamy

AI总结 本文提出BrainDyn,一种基于sheaf神经ODE的模型,用于生成脑动态,通过LSTM编码脑区活动历史,利用sheaf拉普拉斯算子促进信息传递,实现跨模态的强预测能力。

详情
AI中文摘要

高效的神经网络模型能够生成类似大脑动态的活动,可以用于生成合成数据、分析在测试扰动活动等条件下大脑瞬态的差异以及推断底层生成动态。然而,大型语言模型(LLMs)或标准循环神经网络(RNNs)忽略了解剖组织,因此不产生与脑区对齐的组件。另一方面,基于图的网络通常有非常简单的消息传递规则,这些规则不足以表达类似大脑的动态。为此,我们引入了BrainDyn,一种用于在结构化脑图上连续时间动态的sheaf神经ODE模型。BrainDyn使用长短期记忆(LSTM)模型在滑动时间窗口上编码每个脑区的最近活动历史,以生成隐藏状态或茎,这些状态通过可学习的限制映射投影到边特定的共享空间中。这些共享空间中相邻节点之间的差异由sheaf拉普拉斯算子表征,可以促进神经元单元之间的信息传递。这些信息的输出然后被馈送到神经ODE中,该神经ODE控制神经元活动的连续时间演变。我们对静息态fMRI(PNC数据集)、头皮EEG与局灶性癫痫(TUSZ数据集)以及由NEST尖峰网络模拟器模拟的活动进行了评估。BrainDyn在跨模态中实现了强大的预测能力,所得到的表示支持下游任务,包括在硅中扰动预测。

英文摘要

Efficient neural network models that generate brain-like dynamic activity can be a valuable resource for generating synthetic data, analyzing differences in brain transients under conditions such as testing perturbation activity or inferring the underlying generative dynamics. However, large language models (LLMs) or standard recurrent neural networks (RNNs) ignore the anatomical organization and therefore do not produce components that align with brain regions. On the other hand, graph-based networks often have very simple message passing rules that are not sufficiently expressive for brain-like dynamics. To address this, we introduce BrainDyn, a sheaf neural ordinary differential equation (neural ODE) model for continuous-time dynamics on structured brain graphs. BrainDyn encodes the recent activity history of each brain region using a long short-term memory (LSTM) model over a sliding temporal window to produce hidden states, or stalks, that are projected through learnable restriction maps into edge-specific shared spaces. Discrepancies between neighboring nodes in these shared spaces are characterized by a sheaf Laplacian that can facilitate message passing between neuronal units. The output of these messages is then fed to a neural ODE that governs the continuous-time evolution of neuronal activity. We evaluated BrainDyn on resting-state fMRI (PNC dataset), scalp EEG with focal epilepsy (TUSZ dataset), and simulated activity from the NEST spiking network simulator. BrainDyn achieves strong forecasting ability across modalities, and the resulting representations support downstream tasks including in silico perturbation prediction.

2605.19322 2026-05-20 cs.CV

DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs

DynaTok: 时序自适应和位置偏见感知的视频大语言模型token压缩

Minyoung Park, Taehun Kong, Sangjun Ahn

AI总结 本文提出DynaTok,一种无需训练的时序自适应和位置偏见感知的token压缩框架,通过在时序和空间维度上分配token预算,有效减少冗余的时空覆盖,提升视频大语言模型的效率和鲁棒性。

详情
AI中文摘要

近年来,视频大语言模型(Video-LLMs)的进步显著扩展了多模态推理能力。然而,从长视频序列中提取的大量视觉token带来了高昂的计算成本,限制了其在现实场景中的应用。现有的无训练token压缩方法基于注意力大小作为语义重要性的代理进行token选择,但往往忽视位置偏见并仅依赖短期时间局部性,导致冗余的时空覆盖和低效的token使用。我们提出了DynaTok,一种无需训练、时序自适应且偏见感知的token压缩框架,能够在时序和空间维度上分配token预算。通过轻量级的指数移动平均(EMA)内存,时序预算分配(TBA)模块动态地将较少的token分配给冗余帧,将更多的token分配给新颖的帧,捕捉长期时间变化。空间预算分配(SBA)模块通过基于激活的注意力图选择空间多样性和语义重要的特征,同时利用空间内存减少已选区域的冗余并缓解位置偏见。DynaTok无缝集成到现有的Video-LLMs中,如LLaVA-OneVision和LLaVA-Video,无需重新训练,并在高强度压缩下有效保留语义覆盖。在四个代表性VideoQA基准测试-MVBench、LongVideoBench、MLVU和VideoMME上的实验表明,即使在90%的token减少下,DynaTok仍能保留超过95%的基线准确性,优于最近的无训练方法。这些结果表明,DynaTok为高效和稳健的视频推理提供了系统的基础,为未来Video-LLMs实现实时流媒体视频理解铺平了道路。

英文摘要

Recent advances in Video Large Language Models (Video-LLMs) have greatly expanded multimodal reasoning capabilities. However, the massive number of visual tokens extracted from long video sequences incurs prohibitive computational costs, limiting their deployment in real-world scenarios. Existing training-free token compression methods select tokens based on attention magnitude as a proxy for semantic importance, but often overlook positional bias and rely only on short-term temporal locality, leading to redundant spatio-temporal coverage and inefficient token usage. We present DynaTok, a training-free, temporally adaptive and bias-aware token compression framework that allocates token budgets across both temporal and spatial dimensions. Through a lightweight exponential moving average (EMA) memory, the Temporal Budget Allocation (TBA) module dynamically assigns fewer tokens to redundant frames and more to novel frames, capturing long-term temporal variation. The Spatial Budget Allocation (SBA) module complements this by selecting spatially diverse and semantically important features using activation-based attention maps, while leveraging a spatial memory to reduce redundancy from previously selected regions and mitigate positional bias. DynaTok integrates seamlessly with existing Video-LLMs such as LLaVA-OneVision and LLaVA-Video without retraining, and effectively preserves semantic coverage under aggressive compression. Experiments on four representative VideoQA benchmarks-MVBench, LongVideoBench, MLVU, and VideoMME-show that DynaTok retains over 95% of baseline accuracy even with a 90% token reduction, surpassing recent training-free approaches. These results demonstrate that DynaTok provides a principled foundation for efficient and robust video reasoning, paving the way toward real-time streaming video understanding with future Video-LLMs.

2605.19319 2026-05-20 cs.CV

SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution

SWEET:基于图像编辑的稀疏世界建模用于具身任务执行

Yiren Song, Yihan Wang, Xiyao Deng, Zhuoran Yan, Mike Zheng Shou

AI总结 本文研究图像编辑模型能否作为稀疏视觉世界模型用于机器人操作,通过预测任务级未来状态而非密集视频生成,提出SWEET框架实现稀疏视觉规划,结合语言指令和空间引导生成关键帧,并通过扩散动作预测器生成可执行动作,实验表明其在不同场景中提升关键帧预测能力。

详情
AI中文摘要

视觉预测已成为具身控制的有前景范式,其中未来观察被生成并转化为动作。然而,密集视频生成计算成本高且对许多操作任务而言往往不必要,其进展可以总结为少量任务相关视觉状态。本文研究图像编辑模型能否作为稀疏视觉世界模型用于机器人操作,通过预测任务级未来状态而非密集视频生成。我们首先在相同的机器人数据设置下比较视频生成模型Wan2.2和图像编辑模型FLUX-Kontext,发现图像编辑能生成更可靠的任务级关键帧,具有更好的视觉保真度和显著更低的推理成本。受此启发,我们提出SWEET,一种单次稀疏视觉规划框架,通过连续图像编辑生成一系列任务相关操作关键帧,基于语言指令和可选箭头式空间引导。一个目标条件化的扩散动作预测器将相邻想象的关键帧转换为可执行的动作块。为了减少真实与编辑视觉子目标之间的不匹配,我们进一步引入混合训练策略,使用过滤后的编辑目标。在DROID和RoboMimic上的实验表明,SWEET在已见和未见场景中均提升了关键帧预测能力,并实现了从序列关键帧规划到可执行机器人动作的完整流程,表明图像编辑是具身视觉预测中一个有前景但尚未被广泛探索的方向。

英文摘要

Visual prediction has emerged as a promising paradigm for embodied control, where future observations are generated and then translated into actions. However, dense video generation is computationally expensive and often unnecessary for many manipulation tasks, whose progress can be summarized by a small number of task-relevant visual states. In this work, we study whether image editing models can serve as sparse visual world models for robot manipulation by predicting task-level future states without dense video rollout. We first conduct a controlled comparison between the video generation model Wan2.2 and the image editing model FLUX-Kontext under the same robotic data setting, and find that image editing produces more reliable task-level keyframes with better visual fidelity and substantially lower inference cost. Motivated by this observation, we propose SWEET, a one-shot sparse visual planning framework that progressively generates a sequence of task-relevant manipulation keyframes through successive image editing, conditioned on language instructions and optional arrow-based spatial guidance. A goal-conditioned diffusion action predictor then converts adjacent imagined keyframes into executable action chunks. To reduce the mismatch between real and edited visual subgoals, we further introduce a mixed-training strategy with filtered edited targets. Experiments on DROID and RoboMimic show that SWEET improves keyframe prediction across seen and unseen scenes and enables a full pipeline from sequential keyframe planning to executable robot actions, suggesting that image editing is a promising and underexplored direction for embodied visual prediction.

2605.19317 2026-05-20 cs.LG cs.AI

Inference-Time Scaling in Diffusion Models through Iterative Partial Refinement

通过迭代部分细化在扩散模型中实现推理时间扩展

Taegu Kang, Jaesik Yoon, Sungjin Ahn

AI总结 本文提出了一种无需外部验证器的扩散模型推理时间扩展方法Iterative Partial Refinement,通过在混合噪声条件下迭代部分细化生成更一致的样本,在MNIST Sudoku任务中提升了有效解率。

Comments Accepted at the ICLR 2026 Workshop on AI with Recursive Self-Improvement

详情
AI中文摘要

推理时间扩展已成为提升推理能力的主要方法,并越来越多地应用于扩散模型。然而,现有的扩散模型推理时间扩展方法通常依赖外部验证器或奖励模型来排名和选择样本,限制了其在这些评估器可用且可靠的情况下可扩展性。此外,尽管最近的扩散模型进行区域-wise、混合噪声推理,但针对此设置的推理时间扩展仍相对未被探索。我们提出Iterative Partial Refinement (IPR),一种针对顺序扩散模型的推理时间扩展方法,无需外部验证器。从已生成的样本开始,IPR重新噪声一部分区域并根据剩余区域重新生成它们,使模型能够在比初始生成时更丰富的上下文中修订早期决策。这种迭代部分细化生成更一致的样本而无需外部验证。在需要全局约束满足的推理任务中,IPR一致地提升了性能:在MNIST Sudoku任务中,有效解率从55.8%提高到75.0%。这些结果表明,仅迭代部分细化即可作为扩散模型在顺序、混合噪声设置中的有效推理时间扩展策略。代码可在:https://github.com/ahn-ml/IPR获取。

英文摘要

Inference-time scaling has emerged as a major approach for improving reasoning capabilities, and has been increasingly applied to diffusion models. However, existing inference-time scaling methods for diffusion models typically rely on external verifiers or reward models to rank and select samples, limiting their scalability to settings where such evaluators are available and reliable. Moreover, while recent diffusion models perform sequential inference with region-wise, mixed-noise conditioning, inference-time scaling tailored to this setting remains relatively underexplored. We propose Iterative Partial Refinement (IPR), an inference-time scaling method for sequential diffusion that requires no external verifier. Starting from an already-generated sample, IPR re-noises a subset of regions and regenerates them conditioned on the remaining regions, enabling the model to revise earlier decisions under a richer context than was available during the initial generation. This iterative partial refinement produces more globally consistent samples without external verification. On reasoning tasks requiring global constraint satisfaction, IPR consistently improves performance: on MNIST Sudoku, the valid solution rate increases from 55.8% to 75.0%. These results show that iterative partial refinement alone can serve as an effective inference-time scaling strategy for diffusion models in sequential, mixed-noise settings. Code is available at: https://github.com/ahn-ml/IPR