arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2096
2604.27289 2026-05-27 cs.AI

Mechanized Foundations of Structural Governance: Machine-Checked Proofs for Governed Intelligence

结构治理的机械化基础:受治理智能的机器验证证明

Alan L. McCann

AI总结 本文通过Coq机械化证明和纸上证明,建立了认知工作流系统中结构治理的理论基础,包括共归纳安全谓词、治理不变性定理、充分性定理、交替范式、必要性定理,并通过属性测试验证了BEAM运行时与规范的一致性。

Comments 27 pages, 4 figures, 1 table. Code and proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. v2: corrected cross-reference identifiers for companion papers. Updated license

详情
AI中文摘要

我们提出了认知工作流系统结构治理理论中的五个结果。其中三个使用Interaction Trees库和参数化共归纳在Coq 8.19中机械化实现;两个通过显式归约在纸上证明。共归纳安全谓词(gov_safe)是一个共归纳性质,捕获无限程序行为的治理安全性,由布尔权限标志索引,该标志对于未治理的I/O可证明为假,对于治理的解释为真(机械化)。治理不变性定理证明治理在元递归塔上是统一的:第n+1层的治理通过类型的定义性等式归约为第n层的治理(机械化)。充分性定理证明四个原子原语(代码、推理、内存、调用)对于任何离散智能系统在表达上完备,形式化为Kleisli范畴的组合闭包(机械化)。交替范式提供任何机器到交替代码和效果层的规范分解,具有合流重写系统(纸上证明)。必要性定理通过显式归约为Rice定理证明,对于需要语义判断的问题,架构不透明组件(推理原语)在数学上是必要的(纸上证明)。第六个贡献将抽象模型连接到部署的运行时:验证解释器规范在Coq中形式化了BEAM运行时的信任、能力和哈希链逻辑,然后使用基于属性的测试对运行系统进行测试,使用超过70,000个随机生成的指令序列,零分歧。机械化包括约12,000行代码,跨越36个模块,包含454个定理和零个待证明引理。

英文摘要

We present five results in the theory of structural governance for cognitive workflow systems. Three are mechanized in Coq 8.19 using the Interaction Trees library with parameterized coinduction; two are proved on paper with explicit reductions. The Coinductive Safety Predicate (gov_safe) is a coinductive property that captures governance safety for infinite program behaviors, indexed by a boolean permission flag that is provably false for ungoverned I/O and true for governed interpretations (mechanized). The Governance Invariance Theorem establishes that governance is uniform across the meta-recursive tower: governance at level n+1 reduces to governance at level n by definitional equality of the type (mechanized). The Sufficiency Theorem proves that four atomic primitives (code, reason, memory, call) are expressively complete for any discrete intelligent system, formalized as compositional closure of a Kleisli category (mechanized). The Alternating Normal Form provides a canonical decomposition of any machine into alternating code and effect layers, with a confluent rewriting system (paper proof). The Necessity Theorem proves via explicit reduction to Rice's theorem that an architecturally opaque component (the reason primitive) is mathematically necessary for problems requiring semantic judgment (paper proof). A sixth contribution connects the abstract model to the deployed runtime: the Verified Interpreter Specification formalizes the BEAM runtime's trust, capability, and hash chain logic in Coq, then tests the running system against this specification using property-based testing with over 70,000 randomly generated directive sequences and zero disagreements. The mechanization comprises approximately 12,000 lines across 36 modules with 454 theorems and zero admitted lemmas.

2604.24764 2026-05-27 cs.CV

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

World-R1:通过强化学习为文本到视频生成注入3D约束

Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang, Zeyu Zhang, Yefei He, Yanbo Ding, Xirui Hu, Donny Y. Chen, Zhiyuan He, Yuqing Yang, Bohan Zhuang

AI总结 提出World-R1框架,利用强化学习(Flow-GRPO)结合3D基础模型和视觉语言模型的反馈,在不修改架构的情况下增强视频生成的3D一致性,并采用周期解耦训练策略平衡刚体几何与动态场景。

Comments ICML 2026, Project Page: https://aka.ms/world-r1, Code: https://github.com/microsoft/World-R1

详情
AI中文摘要

最近的视频基础模型展示了令人印象深刻的视觉合成能力,但经常遭受几何不一致性的困扰。现有方法尝试通过架构修改注入3D先验,但往往导致高计算成本并限制可扩展性。我们提出World-R1,一个通过强化学习将视频生成与3D约束对齐的框架。为促进这种对齐,我们引入了一个专门为世界模拟定制的纯文本数据集。利用Flow-GRPO,我们使用预训练的3D基础模型和视觉语言模型的反馈来优化模型,在不改变底层架构的情况下强制执行结构一致性。我们进一步采用周期解耦训练策略来平衡刚体几何一致性与动态场景流畅性。大量评估表明,我们的方法显著增强了3D一致性,同时保留了基础模型的原始视觉质量,有效弥合了视频生成与可扩展世界模拟之间的差距。

英文摘要

Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.

2604.19499 2026-05-27 cs.CL

Rank-Turbulence Delta and Interpretable Approaches to Stylometric Delta Metrics

秩湍流Delta与可解释的文体测量Delta方法

Dmitry Pronin, Evgeny Kazartsev

AI总结 本文引入两种新的作者归属度量——秩湍流Delta和Jensen-Shannon Delta,通过将距离函数应用于概率分布来推广Burrows经典Delta,并提出词级分解实现数值可解释性,在四个语料库上验证了方法的有效性。

Comments Published in Digital Scholarship in the Humanities. The version of record is available at https://academic.oup.com/dsh/advance-article-abstract/doi/10.1093/llc/fqag072/8692587 Code available at: https://github.com/DDPronin/Rank-Turbulence-Delta

详情
Journal ref
Digital Scholarship in the Humanities, 2026
AI中文摘要

本文介绍了两种新的作者归属度量——秩湍流Delta和Jensen-Shannon Delta,它们通过应用为概率分布设计的距离函数来推广Burrows经典Delta。我们首先阐述了这些度量的理论基础,对比了词频向量的中心化和非中心化z分数,并将非中心化向量重新解释为概率分布。基于这一表示,我们开发了一种词级分解,使每个Delta距离在数值上可解释,从而促进细读和结果验证。这些方法的有效性在英语、德语、法语和俄语的四个文学语料库上进行了评估。英语、德语和法语数据集来自Project Gutenberg,而俄语基准是SOCIOLIT语料库,包含89位作者的639部作品,时间跨度从18世纪到21世纪。秩湍流Delta达到了与余弦Delta相当的归属准确率;Jensen-Shannon Delta始终匹配或超越经典Burrows Delta的性能。最后,在扩展的SOCIOLIT语料库上重新评估了几种已有的归属算法,提供了它们在显著时间和风格变化下鲁棒性的现实估计。

英文摘要

This article introduces two new measures for authorship attribution - Rank-Turbulence Delta and Jensen-Shannon Delta - which generalise Burrows's classical Delta by applying distance functions designed for probabilistic distributions. We first set out the theoretical basis of the measures, contrasting centred and uncentred z-scoring of word-frequency vectors and re-casting the uncentred vectors as probability distributions. Building on this representation, we develop a token-level decomposition that renders every Delta distance numerically interpretable, thereby facilitating close reading and the validation of results. The effectiveness of the methods is assessed on four literary corpora in English, German, French and Russian. The English, German and French datasets are compiled from Project Gutenberg, whereas the Russian benchmark is the SOCIOLIT corpus containing 639 works by 89 authors spanning the eighteenth to the twenty-first centuries. Rank-Turbulence Delta attains attribution accuracy comparable with Cosine Delta; Jensen-Shannon Delta consistently matches or exceeds the performance of canonical Burrows's Delta. Finally, several established attribution algorithms are re-evaluated on the extended SOCIOLIT corpus, providing a realistic estimate of their robustness under pronounced temporal and stylistic variation.

2401.07669 2026-05-27 cs.CV

SRL-CLIP: Efficient CLIP Video Adaptation via Structured Semantic Role Labels

SRL-CLIP: 通过结构化语义角色标签实现高效的CLIP视频适配

Darshan Singh, Zeeshan Khan, Makarand Tapaswi

AI总结 本文提出SRL-CLIP,利用结构化语义角色标签(SRL)生成规则化字幕,仅用23k视频-字幕对进行对比微调,即可高效适配CLIP用于通用视频理解,在零样本文本-视频检索上性能优于参数多4-8倍、数据多6000倍的模型。

Comments Accepted to the CV4Smalls Workshop at CVPR 2026

详情
AI中文摘要

将CLIP适配到视频领域因其语义丰富表示而日益流行。虽然CLIP是一个良好的起点,但它通常需要在大型视频叙述或字幕数据集(如HowTo100M、WebVid2.5M)上进行后预训练(对比微调)。然而,此类叙述或字幕往往缺乏全面信息来整体表示视频。由于文本的学习信号稀疏,视觉学习效率低下,适配需要数百万样本进行后预训练。在这项工作中,我们提出疑问:是否可能高效地将CLIP适配到通用和整体的视频理解?我们使用带有结构化和密集语义角色标签(SRL)的视频,这些标签以结构化格式捕获动作、人物或物体、属性、副词(方式)和位置,从而整体表示整个视频。我们从SRL生成基于规则的字幕,并证明仅对23k视频-字幕对进行简单的对比微调就足以学习强大的、可迁移的表示,适用于需要不同感知粒度水平的多种视频理解任务。我们的适配CLIP模型SRL-CLIP在零样本文本-视频检索上展现出与最先进模型相当或更优的性能,而这些模型拥有4-8倍更多的参数,并在多达6000倍更多的数据上进行了后预训练。SRL-CLIP在多个视频基准上超越了CLIP,突显了高效学习和改进的表示能力。

英文摘要

Adapting CLIP for videos has gained popularity due to its semantic and rich representation. While CLIP is a good starting point, it typically undergoes post-pretraining (contrastive finetuning) on large video narration or caption datasets (e.g. HowTo100M, WebVid2.5M). However, such narrations or captions often lack comprehensive information needed to represent a video holistically. As the learning signal from text is sparse, the visual learning is inefficient and adaptation requires millions of samples to post-pretrain. In this work, we ask: is it possible to efficiently adapt CLIP for general and holistic video understanding? We use videos labeled with structured and dense Semantic Role Labels (SRLs) that capture actions, people or objects, their attributes, adverbs (manner), and location in a structured format representing the entire video in a holistic way. We generate rule-based captions from SRLs and demonstrate that simple contrastive finetuning on a mere 23k video-caption pairs is adequate to learn powerful, transferable representations applicable across a diverse range of video understanding tasks that require varying levels of perceptual granularity. Our adapted CLIP model, SRL-CLIP, exhibits comparable or superior performance on zero-shot text-to-video retrieval compared to state-of-the-art models that possess 4-8x more parameters and are post-pretrained on up to 6000x more data. SRL-CLIP surpasses CLIP on multiple video benchmarks, underscoring the efficient learning and improved representations.

2603.13381 2026-05-27 cs.LG cs.AI

Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

注意力投影中的非线性:非线性查询的情况

Marko Karbevski

AI总结 本文提出用非线性残差替换注意力中的查询投影W_Q,通过瓶颈MLP实现,在GPT-3小模型上验证了性能提升。

Comments Accepted at the ICLR 2026 GRaM workshop: https://openreview.net/forum?id=pwdnneFiNZ#discussion

详情
AI中文摘要

最近的代数分析表明,在仅解码器和仅编码器Transformer中,查询投影$W_Q$可以设置为恒等映射而不会显著降低性能。这是因为注意力仅通过乘积$XW_Q, XW_K, XW_V$依赖于$X$,允许基变换被相邻层吸收并通过网络传播。我们将$W_Q \in \R^{d imes d}$替换为非线性残差形式$Q(X) = X + f_θ(X)$,其中$f_θ$是一个瓶颈MLP,具有$d^2 + O(d)$个参数。恒等项将非线性锚定到已知良好的先验。在GPT-3小规模风格模型上的实验显示,与基线相比持续改进(验证对数损失降低$2.40\%$,困惑度降低$6.81\%$),轻松优于参数增加12.5%的非嵌入参数模型。这些结果激励在更大规模和多模态上的研究。

英文摘要

Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection $W_Q$ may be set to identity without noticeable performance deterioration. This is possible because attention depends on $X$ only through the products $XW_Q, XW_K, XW_V$, allowing basis transformations to be absorbed by adjacent layers and propagated through the network. We replace $W_Q \in \R^{d \times d}$ with a nonlinear residual of the form $Q(X) = X + f_θ(X)$, where $f_θ$ is a bottleneck MLP with $d^2 + O(d)$ parameters. The identity term anchors the nonlinearity to a known-good prior. Experiments on GPT-3 small style models show consistent improvement over the baseline ($2.40\%$ lower validation log-loss, $6.81\%$ lower perplexity), comfortably outperforming a model with 12.5\% more non-embedding parameters. These results motivate investigation at larger scales and across modalities.

2512.05794 2026-05-27 cs.LG cs.AI q-bio.QM

Mechanistic Interpretability of Antibody Language Models Using SAEs

使用 SAE 对抗体语言模型的机制可解释性研究

Rebonto Haque, Oliver M. Turnbull, Anisha Parsan, Nithin Parsan, John J. Yang, Anna L. Beukenhorst, Charlotte M. Deane

AI总结 本研究采用 TopK 和 Ordered 稀疏自编码器(SAE)对抗体语言模型进行机制可解释性分析,发现 TopK SAE 能揭示有意义的生物学潜在特征但无法保证生成控制,而 Ordered SAE 通过层次结构可靠识别可操控特征但激活模式更复杂。

Comments v3: 15 pages; corrected author list and affiliations in the main text; minor text changes; updated steering results following minor code changes; conclusions and findings remain unchanged; included link to data and code in the Data Availability section

详情
AI中文摘要

稀疏自编码器(SAE)是一种机制可解释性技术,已被用于揭示大型蛋白质语言模型中学到的概念。在此,我们采用 TopK 和 Ordered SAE 来研究自回归抗体语言模型,并引导其生成。我们表明,TopK SAE 可以揭示有生物学意义的潜在特征,但高特征-概念相关性并不能保证对生成的因果控制。相比之下,Ordered SAE 施加了层次结构,能够可靠地识别可操控特征,但代价是激活模式更复杂且可解释性较低。这些发现推进了领域特异性蛋白质语言模型的机制可解释性,并表明,虽然 TopK SAE 足以将潜在特征映射到概念,但在需要精确生成引导时,Ordered SAE 更可取。

英文摘要

Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate autoregressive antibody language models, and steer their generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature-concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose a hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs suffice for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.

2604.21454 2026-05-27 cs.CL cs.AI

Reasoning Primitives in Hybrid and Non-Hybrid LLMs: Do Architectural Differences Yield Advantages in State-Tracking and Recall?

混合与非混合大语言模型中的推理原语:架构差异在状态追踪和召回中是否带来优势?

Shivam Rawat, Lucie Flek, Florian Mai, Nicholas Kluge Corrêa

AI总结 本研究通过五个受控任务族比较了Transformer和混合架构在状态召回任务上的表现,发现推理增强是主要优势因素,而混合架构的优势较窄且依赖于任务。

详情
AI中文摘要

大型语言模型中的推理通常被视为单一能力,但其部分收益可能源于更简单的底层操作。我们通过五个以状态召回为中心的控制任务族,研究了两种这样的原语——召回和状态追踪,并比较了匹配的Transformer和混合架构(有无推理增强)。在整个套件中,推理增强变体显著优于仅指令变体,通常差距很大。这一模式与“状态超越令牌”观点一致:外部化推理痕迹之所以有帮助,是因为它们在令牌空间中向前传递中间状态。相比之下,一旦推理令牌可用,混合归纳偏置在准确性上并不产生统一优势。当架构差异确实出现时,它们遵循任务结构:混合Think模型在严格顺序的链式更新上更稳健,而Transformer Think模型在平面多跳检索上更稳健。因此,我们将本研究的主要贡献视为对状态召回任务性能驱动因素的描述性说明:推理令牌增强似乎是主导因素,而混合优势更窄、依赖于任务,并且可能更多关乎推理效率而非整体能力。我们还发布了重现这些结果所需的代码库和数据。

英文摘要

Reasoning in large language models is often discussed as a single capability, but some of its gains may stem from simpler underlying operations. We examine two such primitives, recall and state-tracking, through five controlled task families centered on state-based recall, and compare matched transformer and hybrid architectures with and without reasoning augmentation. Across the suite, reasoning-augmented variants substantially outperform instruction-only variants, often by large margins. This pattern is consistent with the State over Tokens view: externalized reasoning traces help because they carry the intermediate state forward in token space. By contrast, hybrid inductive bias does not yield a uniform advantage in accuracy once reasoning tokens are available. When architectural differences do appear, they follow task structure: the hybrid Think model is more robust on strictly sequential chained updates, whereas the transformer Think model is more robust on flat multi-hop retrieval. We therefore cast the main contribution of this study as a descriptive account of what drives performance on state-based recall tasks: reasoning-token augmentation appears to be the dominant factor, while hybrid advantages are narrower, task-dependent, and potentially more about inference efficiency than overall capability. We also release the codebase and data required to reproduce these results.

2604.19673 2026-05-27 cs.CV

InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement

InHabit: 利用图像基础模型实现可扩展的3D人体放置

Nikita Kister, Pradyumna YM, István Sárándi, Jiayi Wang, Anna Khoreva, Gerard Pons-Moll

AI总结 提出InHabit方法,通过渲染-生成-提升流程利用2D基础模型知识自动生成3D场景中与几何一致的人体交互数据,并构建大规模数据集InHabitants,显著提升3D人体-场景重建和接触估计性能。

详情
AI中文摘要

训练具身智能体像人类一样理解3D场景需要大量人类与多样环境有意义交互的数据,但此类数据稀缺。真实世界捕捉成本高昂且局限于受控环境,而现有合成数据集依赖简单几何启发式,忽略了丰富的场景上下文。相比之下,在互联网规模上训练的2D基础模型已获得关于人类-环境交互的常识知识。为了将这些知识迁移到3D,我们引入了InHabit,一种自动且可扩展的数据生成器,用于在3D场景中填充交互的人类。InHabit遵循渲染-生成-提升原则:给定渲染的3D场景,视觉语言模型提出上下文相关的动作,图像编辑模型插入一个人体,优化过程将编辑结果提升为与场景几何对齐的物理上合理的SMPL-X人体。应用于Habitat-Matterport3D,InHabit生成了InHabitants,这是首个大规模逼真3D人-场景交互数据集,包含约800个建筑规模场景中的78K个样本,具有完整的3D几何、SMPL-X人体和图像。用InHabitants增强标准训练数据改进了基于RGB的3D人-场景重建和接触估计,在感知用户研究中,我们的数据在78%的情况下优于先前技术。

英文摘要

Training embodied agents to understand 3D scenes as humans do requires large-scale data of people meaningfully interacting with diverse environments, yet such data is scarce. Real-world capture is costly and limited to controlled settings, while existing synthetic datasets rely on simple geometric heuristics, ignoring rich scene context. In contrast, 2D foundation models trained at internet scale have acquired commonsense knowledge of human-environment interactions. To transfer this knowledge to 3D, we introduce InHabit, an automatic and scalable data generator for populating 3D scenes with interacting humans. InHabit follows a render-generate-lift principle: given a rendered 3D scene, a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human, and an optimization procedure lifts the edited result into physically plausible SMPL-X bodies aligned with the scene geometry. Applied to Habitat-Matterport3D, InHabit produces InHabitants, the first large-scale photorealistic 3D human-scene interaction dataset, with 78K samples across $\sim$800 building-scale scenes with complete 3D geometry, SMPL-X bodies, and images. Augmenting standard training data with InHabitants improves RGB-based 3D human-scene reconstruction and contact estimation, and in a perceptual user study our data is preferred in 78% of cases over prior art.

2604.19667 2026-05-27 cs.CL cs.AI cs.CV cs.LG cs.MA

Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

Chat2Workflow: 用自然语言生成可执行可视化工作流的基准

Yi Zhong, Buqiang Xu, Yijun Wang, Zifei Shan, Shuofei Qiao, Guozhou Zheng, Ningyu Zhang

AI总结 提出Chat2Workflow基准,用于评估大语言模型从自然语言生成可执行可视化工作流的能力,并设计了一个智能体基线以提升性能。

Comments Work in progress

详情
AI中文摘要

目前,可执行的可视化工作流已成为实际工业部署中的主流范式,提供了强大的可靠性和可控性。然而,在当前实践中,此类工作流几乎完全通过手动工程构建:开发人员必须仔细设计工作流,为每个步骤编写提示,并随着需求的变化反复修改逻辑——这使得开发成本高昂、耗时且容易出错。为了研究大语言模型能否自动化这一多轮交互过程,我们引入了Chat2Workflow,一个直接从自然语言生成可执行可视化工作流的基准,并提出了一个稳健的智能体基线以提高性能。该基准基于大量真实业务工作流构建,每个实例的设计使得生成的工作流可以转换并直接部署到实际工作流平台(如Dify和Coze)上。实验结果表明,尽管最先进的语言模型通常能捕捉高层次意图,但在生成正确、稳定且可执行的工作流方面仍存在困难,尤其是在面对复杂且不断变化的需求时。尽管我们的智能体基线带来了高达6.05%的解决率提升,但剩余的现实差距使Chat2Workflow成为推进工业级自动化的基础。代码可在https://github.com/zjunlp/Chat2Workflow获取。

英文摘要

At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve -- making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic baseline to improve performance. The benchmark is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially given complex and evolving requirements. Although our agentic baseline yields up to 6.05% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.

2604.18751 2026-05-27 cs.LG cs.AI stat.ME stat.ML

Beyond Coefficients: Forecast-Necessity Testing for Interpretable Causal Discovery in Nonlinear Time-Series Models

超越系数:非线性时间序列模型中可解释因果发现的预测必要性检验

Valentina Kuskova, Dmitry Zaytsev, Michael Coppedge

AI总结 针对非线性时间序列模型中因果分数被误读为回归系数的问题,提出基于边消融和预测比较的预测必要性检验框架,以评估因果关系的实际必要性。

详情
AI中文摘要

非线性机器学习模型越来越多地用于发现时间序列数据中的因果关系,但其输出的解释仍不明确。特别是,正则化神经自回归模型产生的因果分数常被视为回归系数的类比,导致误导性的统计显著性声明。在本文中,我们认为非线性时间序列模型中的因果相关性应通过预测必要性而非系数大小来评估,并提出了一种实用的评估程序。我们提出了一个基于系统边消融和预测比较的可解释评估框架,用于测试候选因果关系是否对准确预测是必要的。以神经加性向量自回归作为案例研究模型,我们将该框架应用于一个关于民主发展的真实世界案例研究,该案例将面板数据(139个国家的民主指标)建模为多元时间序列。我们表明,具有相似因果分数的关系由于冗余、时间持久性和特定制度效应,其预测必要性可能差异巨大。我们的结果展示了预测必要性检验如何支持应用AI系统中更可靠的因果推理,并为在高风险领域解释非线性时间序列模型提供实用指导。

英文摘要

Nonlinear machine-learning models are increasingly used to discover causal relationships in time-series data, yet the interpretation of their outputs remains poorly understood. In particular, causal scores produced by regularized neural autoregressive models are often treated as analogues of regression coefficients, leading to misleading claims of statistical significance. In this paper, we argue that causal relevance in nonlinear time-series models should be evaluated through forecast necessity rather than coefficient magnitude, and we present a practical evaluation procedure for doing so. We present an interpretable evaluation framework based on systematic edge ablation and forecast comparison, which tests whether a candidate causal relationship is required for accurate prediction. Using Neural Additive Vector Autoregression as a case study model, we apply this framework to a real-world case study of democratic development, modeled as a multivariate time series of panel data - democracy indicators across 139 countries. We show that relationships with similar causal scores can differ dramatically in their predictive necessity due to redundancy, temporal persistence, and regime-specific effects. Our results demonstrate how forecast-necessity testing supports more reliable causal reasoning in applied AI systems and provides practical guidance for interpreting nonlinear time-series models in high-stakes domains.

2504.01733 2026-05-27 cs.AI cs.CC cs.LO

Epistemic Skills: Reasoning about Knowledge and Oblivion

认知技能:关于知识与遗忘的推理

Xiaolong Liang, Yì N. Wáng

AI总结 本文提出一类认知逻辑,通过加权模型系统引入“认知技能”度量,将知识获取建模为技能提升、遗忘建模为技能下降,并研究可知性与可遗忘性以及de re与de dicto表达的区别,分析了模型检测和可满足性的计算复杂性。

详情
Journal ref
Logical Methods in Computer Science, Volume 22, Issue 2 (May 25, 2026) lmcs:15460
AI中文摘要

本文提出了一类认知逻辑,用于捕捉获取知识和陷入遗忘的动态过程,同时融入群体知识的概念。该方法基于加权模型系统,引入“认知技能”度量来表示与知识更新相关的认知能力。在此框架内,知识获取被建模为技能提升的过程,而遗忘则被表示为技能下降的结果。该框架进一步支持探索“可知性”和“可遗忘性”,分别定义为通过技能提升获得知识的潜力和通过技能下降陷入遗忘的潜力。此外,它还支持对认知de re与de dicto表达之间区别的详细分析。研究了模型检测和可满足性问题的计算复杂性,提供了对其理论基础和实际意义的洞察。

英文摘要

This paper presents a class of epistemic logics that captures the dynamics of acquiring knowledge and descending into oblivion, while incorporating concepts of group knowledge. The approach is grounded in a system of weighted models, introducing an ``epistemic skills'' metric to represent the epistemic capacities tied to knowledge updates. Within this framework, knowledge acquisition is modeled as a process of upskilling, whereas oblivion is represented as a consequence of downskilling. The framework further enables exploration of ``knowability'' and ``forgettability,'' defined as the potential to gain knowledge through upskilling and to lapse into oblivion through downskilling, respectively. Additionally, it supports a detailed analysis of the distinctions between epistemic de re and de dicto expressions. The computational complexity of the model checking and satisfiability problems is examined, offering insights into their theoretical foundations and practical implications.

2604.18103 2026-05-27 cs.AI

Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling

稳定性意味着冗余:Delta注意力选择性停止用于高效长上下文预填充

Yujie Chen, Tailai Chen, Yifeng Gao, Zoe Wanying He, Yijue Xu, Shaobo Wang, Linfeng Zhang

AI总结 针对长上下文场景中预填充计算成本高的问题,提出一种无需训练的Delta注意力选择性停止策略(DASH),通过监控自注意力层更新动态来停止稳定令牌的处理,从而在不牺牲模型准确性和硬件效率的前提下实现预填充加速。

Comments Accepted to ACL 2026 main conference

详情
AI中文摘要

预填充计算成本在长上下文设置中对大型语言模型(LLMs)和大型多模态模型(LMMs)构成了显著瓶颈。虽然令牌剪枝减少了序列长度,但先前的方法依赖于启发式规则,这些规则与FlashAttention等硬件高效内核不兼容。在这项工作中,我们观察到令牌会向 extit{语义固定点}演化,使得进一步处理变得冗余。为此,我们引入了Delta注意力选择性停止(DASH),这是一种无需训练的策略,通过监控自注意力机制的逐层更新动态来选择性停止已稳定的令牌。大量评估证实,DASH在语言和视觉基准测试中具有泛化能力,在保持模型准确性和硬件效率的同时,实现了显著的预填充加速。代码将在https://github.com/verach3n/DASH.git发布。

英文摘要

Prefilling computational costs pose a significant bottleneck for Large Language Models (LLMs) and Large Multimodal Models (LMMs) in long-context settings. While token pruning reduces sequence length, prior methods rely on heuristics that break compatibility with hardware-efficient kernels like FlashAttention. In this work, we observe that tokens evolve toward \textit{semantic fixing points}, making further processing redundant. To this end, we introduce Delta Attention Selective Halting (DASH), a training-free policy that monitors the layer-wise update dynamics of the self-attention mechanism to selectively halt stabilized tokens. Extensive evaluation confirms that DASH generalizes across language and vision benchmarks, delivering significant prefill speedups while preserving model accuracy and hardware efficiency. Code will be released at https://github.com/verach3n/DASH.git.

2510.06133 2026-05-27 cs.CL cs.AI

CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit

CreditDecoding: 利用轨迹信用加速扩散大语言模型中的并行解码

Kangyu Wang, Zhiyun Jiang, Haibo Feng, Weijia Zhao, Lin Liu, Jianguo Li, Zhenzhong Lan, Weiyao Lin

AI总结 针对扩散大语言模型并行解码中正确令牌被反复重掩导致冗余迭代的问题,提出基于轨迹信用的无训练并行解码方法CreditDecoding,融合历史证据与当前logits提升低置信度正确令牌的置信度,实现高达5.48倍加速并提升准确性。

Comments 19 pages, 13 figures, 9 tables, Accepted to ACL 2026 main conference

详情
AI中文摘要

扩散大语言模型(dLLMs)通过迭代去噪生成文本。在普遍采用的并行解码方案中,每一步仅确认高置信度位置,而重掩其他位置。通过分析dLLM去噪轨迹,我们发现一个关键的低效问题:模型通常在目标令牌的置信度足够高以被解码之前的几个步骤就预测出正确令牌。这种早期预测与后期解码之间的差距导致已正确的令牌被反复重掩,造成冗余迭代并限制加速。为利用这种时间冗余,我们引入轨迹信用(Trace Credit),通过累积历史证据来量化令牌的解码潜力。基于此,我们提出CreditDecoding,一种无训练的并行解码方法,将轨迹信用与当前logits融合,以提升正确但低置信度令牌的置信度,从而加速去噪并提高鲁棒性。在八个基准测试上,CreditDecoding在LLaDA-8B上实现了高达5.48倍的加速和+0.48的准确率提升,并在多种dLLM架构和参数规模上持续改进性能。它还能扩展到长上下文,并与主流推理优化方法正交,使其成为一种实用且广泛适用的解决方案。

英文摘要

Diffusion large language models (dLLMs) generate text through iterative denoising. In commonly adopted parallel decoding schemes, each step confirms only high-confidence positions while remasking the others. By analyzing dLLM denoising traces, we uncover a key inefficiency: models often predict the correct target token several steps before its confidence becomes high enough to be decoded. This gap between early prediction and late decoding forces repeated remasking of already-correct tokens, causing redundant iterations and limiting acceleration. To exploit this temporal redundancy, we introduce Trace Credit to quantify a token's decoding potential by accumulating historical evidence. Building on this, we propose CreditDecoding, a training-free parallel decoding method that fuses Trace Credit with current logits to boost the confidence of correct but underconfident tokens, thereby accelerating denoising and improving robustness. On eight benchmarks, CreditDecoding achieves up to 5.48 times speedup with +0.48 accuracy on LLaDA-8B and consistently improves performance across diverse dLLM architectures and parameter scales. It further scales to long contexts and remains orthogonal to mainstream inference optimizations, making it a practical and widely applicable solution.

2604.14684 2026-05-27 cs.CV

DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts

DETR-ViP:具有鲁棒判别性视觉提示的检测Transformer

Bo Qian, Dahu Shi, Xing Wei

AI总结 提出DETR-ViP框架,通过全局提示集成和视觉-文本提示关系蒸馏学习可区分视觉提示,并采用选择性融合策略实现鲁棒检测,在多个数据集上显著提升视觉提示检测性能。

Comments Published as a conference paper at ICLR 2026

详情
AI中文摘要

视觉提示目标检测能够交互式且灵活地定义目标类别,从而促进开放词汇检测。由于视觉提示直接来源于图像特征,在识别稀有类别时通常优于文本提示。然而,视觉提示检测的研究很大程度上被忽视,通常被视为训练文本提示检测器的副产品,这阻碍了其发展。为充分释放视觉提示检测的潜力,我们研究了其性能次优的原因,并揭示根本问题在于视觉提示缺乏全局可区分性。受这些观察启发,我们提出DETR-ViP,一个鲁棒的目标检测框架,能够产生类别可区分的视觉提示。在基础图像-文本对比学习之上,DETR-ViP结合了全局提示集成和视觉-文本提示关系蒸馏,以学习更具判别性的提示表示。此外,DETR-ViP采用选择性融合策略,确保稳定且鲁棒的检测。在COCO、LVIS、ODinW和Roboflow100上的大量实验表明,DETR-ViP在视觉提示检测中相比其他最先进方法取得了显著更高的性能。一系列消融研究和分析进一步验证了所提出改进的有效性,并揭示了视觉提示检测能力增强的潜在原因。

英文摘要

Visual prompted object detection enables interactive and flexible definition of target categories, thereby facilitating open-vocabulary detection. Since visual prompts are derived directly from image features, they often outperform text prompts in recognizing rare categories. Nevertheless, research on visual prompted detection has been largely overlooked, and it is typically treated as a byproduct of training text prompted detectors, which hinders its development. To fully unlock the potential of visual-prompted detection, we investigate the reasons why its performance is suboptimal and reveal that the underlying issue lies in the absence of global discriminability in visual prompts. Motivated by these observations, we propose DETR-ViP, a robust object detection framework that yields class-distinguishable visual prompts. On top of basic image-text contrastive learning, DETR-ViP incorporates global prompt integration and visual-textual prompt relation distillation to learn more discriminative prompt representations. In addition, DETR-ViP employs a selective fusion strategy that ensures stable and robust detection. Extensive experiments on COCO, LVIS, ODinW, and Roboflow100 demonstrate that DETR-ViP achieves substantially higher performance in visual prompt detection compared to other state-of-the-art counterparts. A series of ablation studies and analyses further validate the effectiveness of the proposed improvements and shed light on the underlying reasons for the enhanced detection capability of visual prompts.

2604.14640 2026-05-27 cs.CL cs.AI

Fact4ac at the Financial Misinformation Detection Challenge Task: Reference-Free Financial Misinformation Detection via Fine-Tuning and Few-Shot Prompting of Large Language Models

Fact4ac在金融虚假信息检测挑战赛中的方法:通过微调和少样本提示的大语言模型实现无参考金融虚假信息检测

Cuong Hoang, Le-Minh Nguyen

AI总结 本文提出一种结合零样本/少样本提示和LoRA参数高效微调的大语言模型框架,用于无外部证据的金融虚假信息检测,在公开和私有测试集上分别达到95.4%和96.3%的准确率,获得竞赛第一名。

详情
Journal ref
Proceedings of the 2nd Workshop on Misinformation Detection in the Era of LLMs (MisD 2026), 20th International AAAI Conference on Web and Social Media
AI中文摘要

金融虚假信息的泛滥对市场稳定和投资者信任构成严重威胁,误导市场行为并造成关键信息不对称。检测此类误导性叙述本身具有挑战性,尤其是在现实场景中,外部证据或用于交叉验证的补充参考资料严格不可用。本文介绍了我们在“无参考金融虚假信息检测”共享任务中的获胜方法。该任务基于最近提出的RFC-BENCH框架(Jiang等人,2026),挑战模型仅依赖内部语义理解和上下文一致性而非外部事实核查来判断金融声明的真实性。为应对这一艰巨的评估设置,我们提出了一个综合框架,利用最先进的大语言模型(LLM)的推理能力。我们的方法系统地集成了上下文学习(特别是零样本和少样本提示策略)以及通过低秩适应(LoRA)的参数高效微调(PEFT),以最优方式使模型与金融操纵的微妙语言线索对齐。我们提出的系统表现出卓越效果,成功在两个官方排行榜上均获得第一名。具体来说,我们在公开测试集上达到95.4%的准确率,在私有测试集上达到96.3%的准确率,突显了我们方法的鲁棒性,并有助于加速金融自然语言处理中上下文感知的虚假信息检测。我们的模型(14B和32B)可在https://huggingface.co/KaiNKaiho获取。

英文摘要

The proliferation of financial misinformation poses a severe threat to market stability and investor trust, misleading market behavior and creating critical information asymmetry. Detecting such misleading narratives is inherently challenging, particularly in real-world scenarios where external evidence or supplementary references for cross-verification are strictly unavailable. This paper presents our winning methodology for the "Reference-Free Financial Misinformation Detection" shared task. Built upon the recently proposed RFC-BENCH framework (Jiang et al. 2026), this task challenges models to determine the veracity of financial claims by relying solely on internal semantic understanding and contextual consistency, rather than external fact-checking. To address this formidable evaluation setup, we propose a comprehensive framework that capitalizes on the reasoning capabilities of state-of-the-art Large Language Models (LLMs). Our approach systematically integrates in-context learning, specifically zero-shot and few-shot prompting strategies, with Parameter-Efficient Fine-Tuning (PEFT) via Low-Rank Adaptation (LoRA) to optimally align the models with the subtle linguistic cues of financial manipulation. Our proposed system demonstrated superior efficacy, successfully securing the first-place ranking on both official leaderboards. Specifically, we achieved an accuracy of 95.4% on the public test set and 96.3% on the private test set, highlighting the robustness of our method and contributing to the acceleration of context-aware misinformation detection in financial Natural Language Processing. Our models (14B and 32B) are available at https://huggingface.co/KaiNKaiho.

2604.13491 2026-05-27 cs.CV

FiRe: Fine-grained Multimodal Reasoning for Enhanced Image Generation

FiRe:用于增强图像生成的细粒度多模态推理

Yongjin Kim, Yoonjin Oh, Yerin Kim, Hyomin Kim, Jeeyoung Yun, Yujung Heo, Minjun Kim, Sungwoong Kim

AI总结 提出FiRe方法,通过细粒度多步推理和强化学习FiRe-GRPO,解决文本到图像生成中缺乏细粒度控制的问题。

详情
AI中文摘要

随着多模态大语言模型(MLLM)的快速发展,联合进行图像理解和生成的统一MLLM取得了显著进展。然而,尽管统一MLLM具有自我反思和自我改进的内在推理能力,它们在文本到图像生成中的应用仍未被充分探索。同时,现有的基于多模态推理的图像生成方法大多依赖于提示增强或整体图像-文本对齐判断,缺乏对详细提示属性的细粒度反思和改进,导致细粒度控制有限。为了解决这一局限性,我们提出了FiRe,一种通过MLLM增强图像生成的细粒度多模态推理方法。具体来说,FiRe执行细粒度多步推理,首先将提示分解为关键视觉需求,然后自我判断它们在生成图像中的满足程度,接着根据自我生成的精确反馈进行局部改进。此外,为了进一步增强MLLM的多模态推理能力,我们引入了FiRe-GRPO,一种针对FiRe量身定制的强化学习方法。由于标准的组相对策略优化(GRPO)在多步推理中面临稀疏的、基于结果的奖励问题,我们将推理过程形式化为一个步骤级别的决策问题,设计步骤特定的奖励,并计算步骤级别的优势以在GRPO内进行细粒度的信用分配。大量实验表明,FiRe持续优于竞争性的文本到图像基线,包括现有的基于推理的方法,在组合文本到图像基准上尤其取得了显著提升。

英文摘要

With the rapid progress of Multimodal Large Language Models (MLLMs), unified MLLMs that jointly perform image understanding and generation have advanced significantly. However, despite the inherent reasoning capabilities of unified MLLMs for self-reflection and self-refinement, their use in text-to-image generation remains largely underexplored. Meanwhile, existing multimodal reasoning-based image generation methods mostly rely on prompt augmentation or holistic image-text alignment judgments, without fine-grained reflection and refinement of detailed prompt attributes, leading to limited fine-grained control. To address this limitation, we propose FiRe, a Fine-grained Multimodal Reasoning method for enhanced image generation by MLLM. In specific, FiRe performs a fine-grained multi-step reasoning by first decomposing the prompt into key visual requirements and then self-judging their satisfaction in the generated image, followed by localized refinement according to self-generated precise feedback. In addition, to further strengthen the MLLM's multimodal reasoning ability, we introduce FiRe-GRPO, a reinforcement learning method tailored to FiRe. Since standard Group Relative Policy Optimization (GRPO) suffers from sparse, outcome-based rewards in multi-step reasoning, we formulate our reasoning process as a step-level decision-making problem, design step-specific rewards, and compute step-level advantages for granular credit assignment within GRPO. Extensive experiments demonstrate that FiRe consistently outperforms competitive text-to-image baselines, including existing reasoning-based methods, with particularly substantial gains on compositional text-to-image benchmarks.

2604.13502 2026-05-27 cs.CL

Using reasoning LLMs to extract SDOH events from clinical notes

使用推理型大语言模型从临床笔记中提取社会健康决定因素事件

Ertan Dogan, Kunyu Yu, Yifan Peng

AI总结 本研究提出一种基于推理型大语言模型的提示工程方法,通过四个模块(简洁提示、少样本学习、自一致性机制和后处理)从临床笔记中提取结构化SDOH事件,取得0.866的微平均F1分数,展示了简单实现与强性能的平衡。

详情
AI中文摘要

社会健康决定因素(SDOH)指影响个人生活、工作和衰老的环境、行为和社会条件。SDOH对个人健康结果有显著影响,其系统识别和管理可大幅改善患者护理。然而,SDOH信息主要记录在电子健康记录的非结构化临床笔记中,限制了其作为机器可读实体的直接使用。为解决此问题,研究人员采用基于预训练BERT模型的自然语言处理(NLP)技术,展示了有前景的性能,但需要复杂的实现和大量计算资源。在本研究中,我们探索了利用具有高级推理能力的大语言模型(LLM)提取结构化SDOH事件的提示工程策略。我们的方法包含四个模块:1)开发结合既定指南的简洁描述性提示,2)应用精心策划示例的少样本学习,3)使用自一致性机制确保稳健输出,4)后处理进行质量控制。我们的方法达到了0.866的微平均F1分数,展示了与领先模型相比具有竞争力的性能。结果表明,具有推理能力的LLM是SDOH事件提取的有效解决方案,兼具实现简单性和强性能。

英文摘要

Social Determinants of Health (SDOH) refer to environmental, behavioral, and social conditions that influence how individuals live, work, and age. SDOH have a significant impact on personal health outcomes, and their systematic identification and management can yield substantial improvements in patient care. However, SDOH information is predominantly captured in unstructured clinical notes within electronic health records, which limits its direct use as machine-readable entities. To address this issue, researchers have employed Natural Language Processing (NLP) techniques using pre-trained BERT-based models, demonstrating promising performance but requiring sophisticated implementation and extensive computational resources. In this study, we investigated prompt engineering strategies for extracting structured SDOH events utilizing LLMs with advanced reasoning capabilities. Our method consisted of four modules: 1) developing concise and descriptive prompts integrated with established guidelines, 2) applying few-shot learning with carefully curated examples, 3) using a self-consistency mechanism to ensure robust outputs, and 4) post-processing for quality control. Our approach achieved a micro-F1 score of 0.866, demonstrating competitive performance compared to the leading models. The results demonstrated that LLMs with reasoning capabilities are effective solutions for SDOH event extraction, offering both implementation simplicity and strong performance.

2603.12564 2026-05-27 cs.CL cs.AI

Sell Me This Stock: Unsafe Recommendation Drift in LLM Agents

卖给我这支股票:LLM智能体中的不安全推荐漂移

Zekun Wu, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz

AI总结 研究LLM智能体在多轮金融推荐中因工具输出被操纵而产生风险不匹配推荐的问题,通过实验揭示评估盲区并分析机制。

详情
AI中文摘要

人们越来越多地使用LLM智能体进行多轮金融推荐,智能体通过工具获取市场数据并跨轮次跟踪用户偏好。当工具输出被操纵时,推荐不再匹配用户声明的风险偏好,但由于NDCG等标准指标仅衡量一般相关性,风险股票和安全股票的得分相同,因此指标显示一切正常。我们将这种差距称为评估盲区。我们在八个语言模型上回放23轮金融咨询对话,每段对话分别使用干净和被操纵的工具数据运行两次。质量得分与干净会话几乎相同,而智能体在65-99%的轮次中产生风险不匹配的推荐,所有八个模型一致。该机制在逐轮中可见:在1,840轮中,80%的风险评分引用逐字复现了被操纵的值,没有一轮提出质疑,高风险股票的安全语言框架比例从14%(Qwen2.5-7B)到69%(Claude Sonnet 4.6)不等。使前沿模型成为优秀智能体的特性——忠实地将其推理基于工具输出——也使其跟随被操纵的输出。损害并非由记忆驱动:仅污染当前轮次仍会产生95%的违规。模型内部能区分操纵(稀疏自编码器特征将对抗性扰动与随机扰动分开),但这并未转化为更安全的输出。激活层干预仅恢复不到6%的安全差距,提示级自我验证失败,因为自我检查读取了相同的被操纵数据,而参数化交叉检查在前沿模型上每轮以99-100%的比率标记污染,但整体适宜性仍未改变:智能体识别出篡改,但仍然推荐它。

英文摘要

People increasingly use LLM agents for multi-turn financial recommendations, where the agent pulls market data through tools and tracks user preferences across turns. When tool outputs are manipulated, the recommendations stop matching the user's stated risk profile, but because standard metrics like NDCG only score general relevance, risky and safe stocks score alike, so the metric says nothing went wrong. We call this gap evaluation blindness. We replay 23-turn financial advisory conversations across eight language models, running each dialogue twice with clean and manipulated tool data. Quality scores stay nearly identical to clean sessions while the agents produce risk-mismatched recommendations in 65-99% of turns, unanimous across all eight models. The mechanism is visible turn-by-turn: 80% of risk-score citations across 1,840 turns reproduce the manipulated value verbatim, not a single turn pushes back, and safe-language framing of high-risk stocks ranges from 14% (Qwen2.5-7B) to 69% (Claude Sonnet 4.6). The property that makes frontier models good agents, faithfully grounding their reasoning in tool outputs, also makes them follow manipulated ones. The damage is not memory-driven: contaminating only the current turn still produces 95% of the violations. The model internally distinguishes the manipulation (sparse autoencoder features separate adversarial from random perturbations), but this does not translate into safer output. Activation-level interventions recover under 6% of the safety gap, prompt-level self-verification fails because the self-check reads the same manipulated data, and a parametric cross-check that flags contamination at 99-100% per turn on a frontier model still leaves aggregate suitability unchanged: the agent identifies the tampering and recommends it anyway.

2604.13018 2026-05-27 cs.CL

Toward Autonomous Long-Horizon Engineering for ML Research

面向机器学习研究的自主长周期工程

Guoxin Chen, Jie Chen, Lei Chen, Jiale Zhao, Fanzhe Meng, Wayne Xin Zhao, Ruihua Song, Cheng Chen, Ji-Rong Wen, Kai Jia

AI总结 提出AiScientist多智能体系统,通过轻量级层级研究团队和File-as-Bus工作空间解决长周期ML研究工程中的累积进度维持问题,在PaperBench和MLE-Bench Lite上取得显著提升。

Comments Repo: https://github.com/AweAI-Team/AiScientist

详情
AI中文摘要

智能体系统日益自动化AI研究的各个环节。然而,将未明确的研究目标转化为可运行、经实验验证的ML系统仍是一个核心瓶颈。我们将这一操作环境研究为“长周期ML研究工程”:通过反复实现、实验和改进,将研究规范转化为可运行的ML系统。核心挑战是在延迟、混杂反馈下,跨异构阶段维持累积的项目进展。我们引入了AiScientist,一个围绕“薄控制厚状态”构建的多智能体系统:轻量级层级研究团队通过File-as-Bus工作空间进行协调,该工作空间跨角色和调用保留决策相关工件。在PaperBench上,AiScientist使用Gemini-3-Flash和GLM-5分别比最强匹配基线提高9.92和11.15分。在MLE-Bench Lite上,它在两个骨干网络下均达到81.82 Any Medal%,比最强匹配基线提高4.55和16.67分,并超过Codex/GPT-5.5 xhigh前沿参考基准13.64 Any Medal分。消融和过程分析表明,持久的项目状态对后期轮次改进至关重要:移除File-as-Bus使PaperBench分数降低6.41分,MLE-Bench Lite Any Medal%降低31.82分。这些结果表明,长周期AI研究不仅是一个更强的局部推理问题,更是一个维持累积、可检查项目进展的系统问题。

英文摘要

Agentic systems increasingly automate pieces of AI research. Yet turning underspecified research objectives into runnable, experimentally validated ML systems remains a central bottleneck. We study this operational setting as \emph{long-horizon ML research engineering}: converting a research specification into a runnable ML system through repeated implementation, experimentation, and refinement. The central challenge is to sustain cumulative project progress across heterogeneous stages under delayed, confounded feedback. We introduce AiScientist, a multi-agent system built around thin control over thick state: a lightweight hierarchical research team coordinates through a File-as-Bus workspace that preserves decision-relevant artifacts across roles and invocations. On PaperBench, AiScientist improves over the strongest matched baselines by 9.92 and 11.15 points with Gemini-3-Flash and GLM-5, respectively. On MLE-Bench Lite, it reaches 81.82 Any Medal\% under both backbones, improving over the strongest matched baselines by 4.55 and 16.67 points, and exceeding a Codex/GPT-5.5 xhigh frontier harness reference by 13.64 Any Medal points. Ablations and process analyses show that durable project state is central to later-round refinement: removing File-as-Bus lowers PaperBench score by 6.41 points and MLE-Bench Lite Any Medal\% by 31.82 points. These results suggest that long-horizon AI research is not only a problem of stronger local reasoning, but a systems problem of maintaining cumulative, inspectable project progress.

2604.12918 2026-05-27 cs.CV

Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation

雷达-相机BEV多任务学习:用于联合3D检测与分割的跨任务注意力桥

Ahmet İnanç, Özgür Erkent

AI总结 提出CTAB(跨任务注意力桥)模块,通过共享BEV空间中的多尺度可变形注意力在检测和分割分支间交换特征,实现联合3D检测与分割的多任务学习,在nuScenes上提升分割性能且检测几乎不受影响。

Comments 8 pages, 5 figures, 3 Tables, Accepted at Radar in Robotics: New Frontiers workshop, at IEEE International Conference on Robotics & Automation (ICRA), 2026

详情
AI中文摘要

鸟瞰图(BEV)表示是自动驾驶中3D感知的主流范式,它提供了一个统一的空间画布,检测和分割特征在几何上注册到同一物理坐标系。然而,现有的雷达-相机融合方法孤立地处理这些任务,错过了跨任务特征共享的机会:来自检测的物体级几何线索可以锐化分割,而来自分割的密集道路布局上下文可以锚定检测。我们提出了 extbf{CTAB}(跨任务注意力桥),这是一个双向模块,通过共享BEV空间中的多尺度可变形注意力在检测和分割分支之间交换特征。CTAB集成到一个多任务框架中,该框架包含基于实例归一化的分割解码器和可学习的BEV上采样,以提供更详细的BEV表示。在nuScenes上,CTAB在联合多任务基线的基础上,在7个类别上提升了分割性能,同时检测几乎不受影响。在一个4类子集(可行驶区域、人行横道、人行道、车辆)上,我们的联合多任务模型实现了51.0 mIoU-4,同时提供了有竞争力的3D检测。

英文摘要

Bird's-eye-view (BEV) representations are the dominant paradigm for 3D perception in autonomous driving, providing a unified spatial canvas where detection and segmentation features are geometrically registered to the same physical coordinate system. However, existing radar-camera fusion methods treat these tasks in isolation, missing the opportunity for cross-task feature sharing: object-level geometric cues from detection can sharpen segmentation, while dense road-layout context from segmentation can anchor detection. We propose \textbf{CTAB} (Cross-Task Attention Bridge), a bidirectional module that exchanges features between detection and segmentation branches via multi-scale deformable attention in shared BEV space. CTAB is integrated into a multi-task framework with an Instance Normalization-based segmentation decoder and learnable BEV upsampling to provide a more detailed BEV representation. On nuScenes, CTAB improves segmentation on 7 classes over the joint multi-task baseline at essentially neutral detection. On a 4-class subset (drivable area, pedestrian crossing, walkway, vehicle), our joint multi-task model achieves 51.0 mIoU-4 while simultaneously providing competitive 3D detection.

2604.11467 2026-05-27 cs.AI cs.HC cs.LG

From Attribution to Action: A Human-Centered Application of Activation Steering

从归因到行动:激活导向的人本应用

Tobias Labarta, Maximilian Dreyer, Katharina Weitz, Wojciech Samek, Sebastian Lapuschkin

AI总结 提出结合SAE归因与激活导向的交互式工作流,通过专家访谈验证其能促进从检查到干预的转变,并揭示组件抑制等调试策略及潜在风险。

详情
AI中文摘要

可解释人工智能(XAI)方法揭示了哪些特征影响模型预测,但为实践者基于这些解释采取行动提供了有限的手段。通过XAI识别出的组件的激活导向为可操作的解释提供了一条路径,但其实际效用仍未得到充分研究。我们引入了一个交互式工作流,将基于SAE的归因与激活导向相结合,用于视觉模型中概念使用的实例级分析,并实现为一个基于网页的工具。基于此工作流,我们进行了半结构化专家访谈(N=8),在CLIP上执行调试任务,以调查实践者如何推理、信任和应用激活导向。我们发现,导向使得从检查转向基于干预的假设检验(8/8参与者),大多数参与者将信任建立在观察到的模型响应上,而非仅仅解释的合理性(6/8)。参与者采用了系统性的调试策略,其中组件抑制占主导(7/8),并指出了包括涟漪效应和实例级修正的有限泛化在内的风险。总体而言,激活导向使可解释性更具可操作性,同时为安全有效使用提出了重要考虑。

英文摘要

Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.

2505.23606 2026-05-27 cs.LG cs.CV

Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model

Muddit: 通过统一离散扩散模型解放超越文本到图像的生成

Qingyu Shi, Jinbin Bai, Zhuoran Zhao, Wenhao Chai, Kaidong Yu, Jianzong Wu, Yunhai Tong, Xiangtai Li, Xuelong Li, Shuicheng Yan

AI总结 提出Muddit,一种统一离散扩散Transformer,结合预训练文本到图像骨干的强视觉先验与轻量文本解码器,实现跨文本和图像模态的快速并行生成,在质量和效率上优于大型自回归模型。

Comments Accepted to ICLR 2026. Codes and Supplementary Material: https://github.com/M-E-AGI-Lab/Muddit

详情
AI中文摘要

统一生成模型旨在单一架构和解码范式下处理跨模态的多种任务——如文本生成、图像生成和视觉-语言推理。自回归统一模型因顺序解码导致推理缓慢,而非自回归统一模型因预训练骨干有限导致泛化能力弱。我们引入第二代Meissonic:Muddit,一种统一离散扩散Transformer,能够在文本和图像模态上实现快速并行生成。与先前从头训练的统一扩散模型不同,Muddit将来自预训练文本到图像骨干的强视觉先验与轻量文本解码器集成,从而在统一架构下实现灵活且高质量的多模态生成。实验结果表明,Muddit在质量和效率上均达到或优于显著更大的自回归模型。该工作凸显了纯离散扩散在配备强视觉先验时,作为统一生成的可扩展且有效骨干的潜力。

英文摘要

Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce the second-generation Meissonic: Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior performance compared to significantly larger autoregressive models in both quality and efficiency. The work highlights the potential of purely discrete diffusion, when equipped with strong visual priors, as a scalable and effective backbone for unified generation.

2604.11056 2026-05-27 cs.LG cs.AI

Where Hindsight Credit Can Reside: A Signed-Capacity View of Token Updates in RLVR

事后信用可驻留之处:RLVR中令牌更新的有符号容量视角

Yuhang He, Haodong Wu, Siyi Liu, Hongyu Ge, Hange Zhou, Keyi Wu, Zhuo Zheng, Qihong Lin, Zixin Zhong, Yongqi Zhang

AI总结 本文通过条件互信息分析RLVR中令牌级信用的容量上限,提出四象限分解区分更新方向,并设计HAPO算法进行容量引导的优势重分配,提升数学推理性能。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)提升了大语言模型(LLMs)的推理能力,但稀疏的结果奖励使得令牌级信用分配变得困难。我们将令牌级信用视为从行为策略到事后后验的奖励条件偏移。在自回归RLVR中,这种偏移可以通过条件互信息(CMI)表示,这表明令牌熵限制了可能的事后信用上限。然而,熵指示的是容量而非更新方向,因此我们引入了四象限分解,根据奖励极性和令牌熵来分离更新。受控干预表明,这两个因素共同塑造了令牌更新。持续的推理增益集中在有符号的高熵象限,而低熵更新则迅速饱和。基于此分析,我们提出了事后感知策略优化(HAPO),这是对GRPO的一种符号保持修改,执行容量引导的优势重分配。在两个模型设置的数学推理基准上的实验表明,HAPO在熵感知基线中取得了有竞争力的性能。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) improves the reasoning ability of Large Language Models (LLMs), but sparse outcome rewards make token-level credit assignment difficult. We study token-level credit as a reward-conditioned shift from the behavior policy to a hindsight posterior. In autoregressive RLVR, this shift can be expressed through Conditional Mutual Information (CMI), which shows that token entropy upper-bounds possible hindsight credit. Entropy, however, indicates capacity rather than update direction, so we introduce the Four Quadrant Decomposition to separate updates by reward polarity and token entropy. Controlled interventions show that these two factors jointly shape token updates. Sustained reasoning gains concentrate in signed high-entropy quadrants, whereas low-entropy updates saturate quickly. Based on this analysis, we propose Hindsight-Aware Policy Optimization (HAPO), a sign-preserving modification to GRPO that performs capacity-guided advantage reallocation. Experiments on mathematical reasoning benchmarks in two model settings show that HAPO achieves competitive performance among entropy-aware baselines.

2604.10102 2026-05-27 cs.CV cs.AI

Degradation-Consistent Paired Training for Robust AI-Generated Image Detection

退化一致性配对训练用于鲁棒的AI生成图像检测

Zongyou Yang, Yinghan Hou, Xiaokun Yang

AI总结 提出退化一致性配对训练(DCPT),通过特征一致性和预测一致性约束显式增强模型对JPEG压缩、高斯模糊等真实世界图像退化的鲁棒性,在Synthbuster基准上平均准确率提升9.1个百分点。

Comments 6 pages, 5 figures, 2 tables

详情
AI中文摘要

AI生成图像检测器在真实世界图像退化(如JPEG压缩、高斯模糊和分辨率降采样)下性能显著下降。我们观察到,包括B-Free在内的最先进方法将退化鲁棒性视为数据增强的副产品,而非明确的训练目标。在这项工作中,我们提出退化一致性配对训练(DCPT),这是一种简单而有效的训练策略,通过配对一致性约束显式增强鲁棒性。对于每张训练图像,我们构建一个干净视图和一个退化视图,然后施加两个约束:特征一致性损失,最小化干净表示和退化表示之间的余弦距离;以及基于对称KL散度的预测一致性损失,对齐两个视图的输出分布。DCPT不增加额外参数和推理开销。在Synthbuster基准(9个生成器,8种退化条件)上的实验表明,与没有配对训练的相同基线相比,DCPT将退化条件下的平均准确率提高了9.1个百分点,同时仅牺牲了0.9%的干净准确率。在JPEG压缩下改进最为显著(+15.7%至+17.9%)。消融实验进一步揭示,添加架构组件会导致在有限训练数据上过拟合,证实了对于退化鲁棒性,训练目标改进比架构增强更有效。

英文摘要

AI-generated image detectors suffer significant performance degradation under real-world image corruptions such as JPEG compression, Gaussian blur, and resolution downsampling. We observe that state-of-the-art methods, including B-Free, treat degradation robustness as a byproduct of data augmentation rather than an explicit training objective. In this work, we propose Degradation-Consistent Paired Training (DCPT), a simple yet effective training strategy that explicitly enforces robustness through paired consistency constraints. For each training image, we construct a clean view and a degraded view, then impose two constraints: a feature consistency loss that minimizes the cosine distance between clean and degraded representations, and a prediction consistency loss based on symmetric KL divergence that aligns output distributions across views. DCPT adds zero additional parameters and zero inference overhead. Experiments on the Synthbuster benchmark (9 generators, 8 degradation conditions) demonstrate that DCPT improves the degraded-condition average accuracy by 9.1 percentage points compared to an identical baseline without paired training, while sacrificing only 0.9% clean accuracy. The improvement is most pronounced under JPEG compression (+15.7% to +17.9%). Ablation further reveals that adding architectural components leads to overfitting on limited training data, confirming that training objective improvement is more effective than architectural augmentation for degradation robustness.

2604.10095 2026-05-27 cs.CV

Mining Attribute Subspaces for Efficient Fine-tuning of 3D Foundation Models

挖掘属性子空间以实现3D基础模型的高效微调

Yu Jiang, Hanwen Jiang, Ahmed Abdelkader, Wen-Sheng Chu, Brandon Y. Feng, Zhangyang Wang, Qixing Huang

AI总结 本文通过生成合成数据并提取与纹理、几何、相机运动和光照变化相关的LoRA子空间,发现这些子空间近似解耦,集成后形成降维子空间,从而提高下游任务微调的效率和预测精度。

Comments 10 pages, 8 figures. Code here: https://github.com/jpppppppppppppppppppppppp/Subspaces-Mining-for-VGGT

详情
AI中文摘要

随着3D基础模型的出现,人们越来越关注将其微调用于下游任务,其中LoRA是主要的微调范式。由于3D数据集在纹理、几何、相机运动和光照方面表现出明显的差异,因此存在有趣的基本问题:1) 是否存在与每种变化类型相关的LoRA子空间?2) 这些子空间是否解耦(即彼此正交)?3) 如何有效地计算它们?本文为所有这些问题提供了答案。我们引入了一种鲁棒的方法,生成具有受控变化的合成数据集,在每个数据集上微调LoRA适配器,并提取与每种变化类型相关的LoRA子空间。我们表明这些子空间近似解耦。将它们集成可以得到一个降维的LoRA子空间,从而能够实现高效的LoRA微调,并提高下游任务的预测精度。特别是,我们表明这样的降维LoRA子空间尽管完全来自合成数据,但可以泛化到真实数据集。消融研究验证了我们方法中各种选择的有效性。

英文摘要

With the emergence of 3D foundation models, there is growing interest in fine-tuning them for downstream tasks, where LoRA is the dominant fine-tuning paradigm. As 3D datasets exhibit distinct variations in texture, geometry, camera motion, and lighting, there are interesting fundamental questions: 1) Are there LoRA subspaces associated with each type of variation? 2) Are these subspaces disentangled (i.e., orthogonal to each other)? 3) How do we compute them effectively? This paper provides answers to all these questions. We introduce a robust approach that generates synthetic datasets with controlled variations, fine-tunes a LoRA adapter on each dataset, and extracts a LoRA sub-space associated with each type of variation. We show that these subspaces are approximately disentangled. Integrating them leads to a reduced LoRA subspace that enables efficient LoRA fine-tuning with improved prediction accuracy for downstream tasks. In particular, we show that such a reduced LoRA subspace, despite being derived entirely from synthetic data, generalizes to real datasets. An ablation study validates the effectiveness of the choices in our approach.

2509.21882 2026-05-27 cs.LG cs.AI

Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

立场:具有可验证奖励的强化学习的隐藏成本与测量缺口

Fang Wu, Aaron Tu, Weihao Xuan, Heli Qi, Xu Huang, Qingcheng Zeng, Shayan Talaei, Yijia Xiao, Peng Xia, Xiangru Tang, Yuchen Zhuang, Yinxi Li, Bing Hu, Hanqun Cao, Wenqi Shi, Rui Yang, Nan Liu, Huaxiu Yao, Ge Liu, Li Erran Li, Amin Saberi, Naoto Yokoya, Jure Leskovec, Yejin Choi

AI总结 本文指出,具有可验证奖励的强化学习(RLVR)在提升大语言模型性能时,常因预算不匹配、尝试膨胀和基准数据污染等混淆因素导致收益被高估,并提出了预算匹配饱和曲线、校准跟踪、法官鲁棒性测试和污染筛查等最低标准。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)是一种实用、可扩展的方法,用于在数学、代码和其他结构化任务上改进大语言模型。然而,我们认为许多头条RLVR收益尚未得到充分验证,因为报告常常将策略改进与三个混淆因素混为一谈:(i) RLVR与基线评估之间的预算不匹配,(ii) 尝试膨胀和校准漂移,将弃权转化为自信答案,以及(iii) 基准数据污染。通过预算匹配的复现和部分提示污染探测,我们发现一旦预算、提示和数据集版本匹配,并且将受污染集视为记忆探测而非推理证据,几个被广泛引用的差距会大幅缩小或消失。这并不意味着RLVR无效,而是表明当前的测量常常夸大能力收益并掩盖可靠性成本。因此,我们为RLVR训练和评估提出了一个紧凑的、考虑成本的的最低标准:带有方差、校准和弃权跟踪的预算匹配饱和曲线,当使用LLM评判者时的评判者鲁棒性压力测试,以及明确的污染筛查。有了这些控制,RLVR在可验证领域仍然有效且可部署,但如果没有这些控制,推理收益应被视为暂定的。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) is a practical, scalable way to improve large language models on math, code, and other structured tasks. However, we argue that many headline RLVR gains are not yet well validated because reports often conflate policy improvement with three confounds: (i) budget mismatch between RLVR and baseline evaluations, (ii) attempt inflation and calibration drift that convert abstentions into confident answers, and (iii) benchmark data contamination. Using budget-matched reproductions and partial-prompt contamination probes, we find that several widely cited gaps shrink substantially or disappear once budgets, prompts, and dataset versions are matched and contaminated sets are treated as memorization probes rather than evidence of reasoning. This does not mean that RLVR is ineffective, but it implies that current measurements often overstate capability gains and obscure reliability costs. We therefore propose a compact, tax-aware minimum standard for RLVR training and evaluation: budget-matched saturation curves with variance, calibration, and abstention tracking, a judge-robustness stress test when LLM judges are used, and an explicit contamination screen. With these controls, RLVR remains effective and deployable in verifiable domains, but reasoning gains should be treated as provisional without them.

2604.08999 2026-05-27 cs.CL cs.AI cs.LG

ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering

ASTRA: 面向复杂表格问答的自适应语义树推理架构

Xiaoke Guo, Songze Li, Zhiqiang Liu, Zhaoyan Gong, Yuanxiang Liu, Huajun Chen, Wen Zhang

AI总结 提出ASTRA架构,通过AdaSTR将表格重构为逻辑语义树,并利用DuTR双模式推理框架结合树搜索文本导航与符号代码执行,在复杂表格问答中达到最优性能。

Comments ACL 2026 Main

详情
AI中文摘要

表格序列化仍然是大型语言模型(LLMs)在复杂表格问答中的关键瓶颈,受到结构忽视、表示差距和推理不透明等挑战的阻碍。现有的序列化方法无法捕获显式层次结构且缺乏模式灵活性,而当前的基于树的方法则存在语义适应性有限的问题。为了解决这些限制,我们提出了ASTRA(自适应语义树推理架构),包括两个主要模块:AdaSTR和DuTR。首先,我们引入AdaSTR,它利用LLMs的全局语义意识将表格重构为逻辑语义树。这种序列化显式建模了层次依赖关系,并采用自适应机制根据表格规模优化构建策略。其次,基于此结构,我们提出了DuTR,一种双模式推理框架,集成了基于树搜索的文本导航以实现语言对齐,以及符号代码执行以实现精确验证。在复杂表格基准上的实验表明,我们的方法达到了最先进的性能。

英文摘要

Table serialization remains a critical bottleneck for Large Language Models (LLMs) in complex table question answering, hindered by challenges such as structural neglect, representation gaps, and reasoning opacity. Existing serialization methods fail to capture explicit hierarchies and lack schema flexibility, while current tree-based approaches suffer from limited semantic adaptability. To address these limitations, we propose ASTRA (Adaptive Semantic Tree Reasoning Architecture) including two main modules, AdaSTR and DuTR. First, we introduce AdaSTR, which leverages the global semantic awareness of LLMs to reconstruct tables into Logical Semantic Trees. This serialization explicitly models hierarchical dependencies and employs an adaptive mechanism to optimize construction strategies based on table scale. Second, building on this structure, we present DuTR, a dual-mode reasoning framework that integrates tree-search-based textual navigation for linguistic alignment and symbolic code execution for precise verification. Experiments on complex table benchmarks demonstrate that our method achieves state-of-the-art (SOTA) performance.

2604.08819 2026-05-27 cs.CV cs.AI cs.LG cs.MM

SenBen: Sensitive Scene Graphs for Explainable Content Moderation

SenBen: 用于可解释内容审核的敏感场景图

Fatih Cagatay Akyon, Alptekin Temizel

AI总结 提出SenBen基准和紧凑学生模型,通过多任务训练和词汇平衡策略实现敏感内容的空间定位与可解释性,在场景图生成上超越多数VLM。

Comments Accepted at CVPRW 2026

详情
AI中文摘要

内容审核系统将图像分类为安全或不安全,但缺乏空间定位和可解释性:它们无法解释检测到了什么敏感行为、涉及谁或发生在哪里。我们引入了敏感基准(SenBen),这是第一个用于敏感内容的大规模场景图基准,包含来自157部电影的13,999帧,标注了Visual Genome风格的场景图(25个对象类别、28个属性,包括情感状态如痛苦、恐惧、攻击和痛苦,14个谓词)以及跨5个类别的16个敏感标签。我们通过多任务配方将前沿VLM蒸馏成一个紧凑的241M学生模型,该配方通过基于后缀的对象身份、词汇感知召回(VAR)损失和解耦的Query2Label标签头(带非对称损失)解决自回归场景图生成中的词汇不平衡问题,在SenBen召回率上比标准交叉熵训练提高了+6.4个百分点。在基于场景图的指标上,我们的学生模型优于除Gemini模型外的所有评估VLM和所有商业安全API,同时在所有模型中实现了最高的对象检测和字幕生成分数,推理速度提升7.6倍,GPU内存减少16倍。

英文摘要

Content moderation systems classify images as safe or unsafe but lack spatial grounding and interpretability: they cannot explain what sensitive behavior was detected, who is involved, or where it occurs. We introduce the Sensitive Benchmark (SenBen), the first large-scale scene graph benchmark for sensitive content, comprising 13,999 frames from 157 movies annotated with Visual Genome-style scene graphs (25 object classes, 28 attributes including affective states such as pain, fear, aggression, and distress, 14 predicates) and 16 sensitivity tags across 5 categories. We distill a frontier VLM into a compact 241M student model using a multi-task recipe that addresses vocabulary imbalance in autoregressive scene graph generation through suffix-based object identity, Vocabulary-Aware Recall (VAR) Loss, and a decoupled Query2Label tag head with asymmetric loss, yielding a +6.4 percentage point improvement in SenBen Recall over standard cross-entropy training. On grounded scene graph metrics, our student model outperforms all evaluated VLMs except Gemini models and all commercial safety APIs, while achieving the highest object detection and captioning scores across all models, at $7.6\times$ faster inference and $16\times$ less GPU memory.

2603.11394 2026-05-27 cs.CL cs.AI cs.LG

Stop Listening to Me! How Multi-turn Conversations Can Degrade LLM Reliability

别听我的!多轮对话如何降低LLM的可靠性

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin

AI总结 提出“坚持或切换”(SoS)框架,通过将问答空间分割为多个顺序呈现来评估LLM在多轮对话中的可靠性,发现对话税导致准确性和拒绝错误建议的能力平均下降30%,并观察到盲目切换现象。

详情
AI中文摘要

大型语言模型(LLM)在静态基准测试中表现出色,但它们在更能反映实际使用的多轮对话中的性能仍未得到充分研究。解决这一差距在医疗保健等高风险环境中至关重要,因为患者和临床医生正在转向LLM聊天机器人来处理他们的医疗咨询。在这里,我们引入了“坚持或切换”(SoS)框架,该框架将问答空间划分为多个顺序呈现,以模拟两种以安全为中心的行为:坚持(即坚持正确的答案选择或拒绝错误的建议)和灵活性(即在引入正确建议时切换到该建议)。在三个临床基准测试中评估了17个LLM,我们观察到普遍存在的对话税,其中将答案空间分割为顺序呈现使端到端准确性和对错误建议的拒绝率平均下降高达30%,在某些模型中达到65%。我们还观察到盲目切换,即模型从初始拒绝转向错误和正确建议的比率几乎相同,达到50%。最后,我们表明,增加模型规模可以缓解其中一些对话效率低下的问题,但会加剧其他问题,例如从初始拒绝中采纳错误建议的倾向更高。我们的研究结果共同表明,静态基准测试所捕获的一般能力并不能推广到多轮对话中。

英文摘要

Large language models (LLMs) excel on static benchmarks, but their performance across multi-turn conversations, which better reflect real-world usage, remains understudied. Addressing this gap is critical in high-stakes settings like healthcare, where patients and clinicians are turning to LLM chatbots to address their medical inquiries. Here, we introduce the "stick-or-switch" (SoS) framework, which partitions a question-answer space into multiple sequential presentations to model two safety-centric behaviors: conviction (i.e., sticking to a correct answer selection or abstention against incorrect suggestions) and flexibility (i.e., switching to a correct suggestion when it is introduced). Evaluating 17 LLMs across three clinical benchmarks, we observe a pervasive conversation tax, where partitioning an answer-space into sequential presentations reduces end-to-end accuracy and abstention against incorrect suggestions by an average of up to 30%, reaching 65% in certain models. We also observe blind switching, where models transition an initial abstention to incorrect and correct suggestions at near-identical rates reaching 50%. Finally, we show that increasing model scale mitigates some of these conversational inefficacies while exacerbating others, such as a higher propensity to adopt an incorrect suggestion from an initial abstention. Together our findings demonstrate that the general proficiency captured by static benchmarks do not translate over multi-turn dialogues.

2512.21602 2026-05-27 cs.LG cs.CV

An Empirical Study of Machine Learning Robustness and Scalability for Imbalanced Tabular Clinical Data in Emergency and Critical Care

机器学习在急诊和重症监护中不平衡表格临床数据的鲁棒性与可扩展性实证研究

Yusuf Brima, Marcellin Atemkeng

AI总结 本研究在MIMIC-IV-ED和eICU数据集上评估六类模型在不平衡临床表格数据上的性能,发现树模型在可扩展性上最优,而表格基础模型在性能与效率间提供新的权衡。

详情
AI中文摘要

每年,数百万患者通过急诊科和重症监护室,临床医生必须在时间压力和不确定性下做出高风险决策。机器学习可以支持恶化预测、分诊和罕见关键结局的预测,但临床数据通常严重不平衡,使模型偏向多数类并降低预测性能。因此,为不平衡的临床表格数据开发鲁棒且高效的模型仍然是一个重要挑战。 我们在MIMIC-IV-ED和eICU数据库的不平衡表格数据上评估了六类模型:决策树、随机森林、XGBoost、TabNet、TabICL和TabPFN v2.6。可训练模型通过贝叶斯超参数调优进行优化,而基础模型在其预训练推理模式下进行评估,无需任务特定的重新加权。模型使用Macro F1分数、对递增不平衡的鲁棒性以及跨七个临床预测任务的计算可扩展性进行评估。 结果在不同数据集上有所不同。在MIMIC-IV-ED上,TabPFN v2.6和TabICL获得了最强的平均Macro F1排名,XGBoost保持竞争力。在eICU上,XGBoost始终表现最佳,其次是其他基于树的方法,而基础模型达到中等性能。在两个数据集中,TabNet在递增不平衡下显示出最大的性能下降和最高的计算成本。训练时间分析表明,基于树的方法随数据集大小扩展最有利,而基础模型提供了较低的每任务适应成本。 这些发现表明,没有单一模型族在所有临床环境中占主导地位。然而,表格基础模型正在缩小与强经典基线的性能差距,同时提供独特的效率-性能权衡,这可能有利于资源受限的临床环境。

英文摘要

Every year, millions of patients pass through emergency departments and intensive care units, where clinicians must make high-stakes decisions under time pressure and uncertainty. Machine learning could support prediction of deterioration, triage, and rare critical outcomes, but clinical data are often severely imbalanced, biasing models toward majority classes and reducing predictive performance. Developing robust and efficient models for imbalanced clinical tabular data therefore remains an important challenge. We evaluated six model families on imbalanced tabular data from the MIMIC-IV-ED and eICU databases: Decision Tree, Random Forest, XGBoost, TabNet, TabICL, and TabPFN v2.6. Trainable models were optimized using Bayesian hyperparameter tuning, while foundation models were evaluated in their pretrained inference regime without task-specific reweighting. Models were assessed using Macro F1-score, robustness to increasing imbalance, and computational scalability across seven clinical prediction tasks. Results differed across datasets. On MIMIC-IV-ED, TabPFN v2.6 and TabICL achieved the strongest average Macro F1 ranks, with XGBoost remaining competitive. On eICU, XGBoost consistently performed best, followed by other tree-based methods, while foundation models achieved intermediate performance. Across both datasets, TabNet showed the largest degradation under increasing imbalance and the highest computational cost. Training-time analysis showed that tree-based methods scaled most favorably with dataset size, while foundation models offered low per-task adaptation cost. These findings suggest that no single model family dominates across all clinical settings. However, tabular foundation models are narrowing the performance gap with strong classical baselines while offering a distinct efficiency-performance trade-off that may benefit resource-constrained clinical environments.