arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1946
2605.21661 2026-05-22 cs.LG cs.AI cs.CV

Hierarchical Variational Policies for Reward-Guided Diffusion

分层变分策略用于奖励引导的扩散

Kushagra Pandey, Farrin Marouf Sofian, Jan Niklas Groeneveld, Felix Draxler, Stephan Mandt

AI总结 本文提出了一种分层变分模型框架,通过将控制信息压缩到轻量级且表达能力强的随机策略中,实现了在降低推理成本的同时生成高质量的奖励对齐样本,该方法在4倍超分辨率任务中实现了比现有最佳基线快5倍的推理速度并具有更好的感知质量。

详情
AI中文摘要

适应预训练扩散模型以解决下游目标如逆问题通常需要昂贵的测试时间引导或优化。我们提出了一种系统框架,能够在大幅降低推理成本的同时生成高质量的奖励对齐样本。我们的方法将测试时间适应建模为分层变分模型,其中控制被压缩到一个轻量级但表达能力强的随机策略中。这种建模自然支持少量步扩散采样:大步长使推理快速,而学习的策略通过提供结构化的每步控制保持样本质量。所得到的完全压缩采样器实现了强大的质量-速度权衡,匹配或超过最近的测试时间扩展基线,同时需要显著更少的计算资源。例如,在4倍超分辨率任务中,我们的方法在比最佳表现基线快5倍的情况下实现了更好的感知质量。我们进一步将该方法扩展到半压缩的 regime,结合廉价的压缩提案和有限的测试时间优化,在多个具有挑战性的逆问题中实现了最先进的感知质量。

英文摘要

Adapting pretrained diffusion models to downstream objectives such as inverse problems often requires expensive test-time guidance or optimization. We propose a principled framework for generating high-quality reward-aligned samples at substantially reduced inference cost. Our approach formulates test-time adaptation as a hierarchical variational model, where control is amortized into a lightweight yet expressive stochastic policy. This formulation naturally supports few-step diffusion sampling: large step sizes enable fast inference, while the learned policy maintains sample quality by providing structured per-step control. The resulting fully amortized sampler achieves a strong quality--speed tradeoff, matching or exceeding recent test-time scaling baselines while requiring significantly less compute. For example, on 4x super-resolution, our method achieves better perceptual quality with more than 5x faster inference compared to the best-performing baseline. We further extend our approach to a semi-amortized regime that combines cheap amortized proposals with limited test-time optimization, achieving state-of-the-art perceptual quality across several challenging inverse problems.

2605.21654 2026-05-22 cs.LG cs.AI cs.CL

Value-Gradient Hypothesis of RL for LLMs

强化学习中大语言模型的价值-梯度假说

Arip Asadulaev, Daniil Ognev, Karim Salta, Martin Takac

AI总结 本文提出了一种价值-梯度视角来解释无评论强化学习方法在大语言模型后训练中的有效性,并通过分析actor更新和注意力机制中的自适应微分,提出了价值梯度信号和可达奖励空间的分解方法。

详情
AI中文摘要

强化学习显著提升了预训练语言模型,但尚不清楚为何无评论方法如PPO和GRPO能发挥如此大的作用,以及何时能提供最大的收益。我们开发了一种无评论强化学习在大语言模型后训练中的价值-梯度视角。首先,在可微展开和加性噪声参数化下,我们证明在期望下actor更新是价值-梯度类似的:反向传播传播的costates的条件期望等于价值梯度。其次,对于离散transformer策略,我们证明通过注意力机制的自适应微分会产生经验性的costates,这些近似于该价值信号,其误差受采样间隙和策略熵的控制。这些结果促使将RL影响分解为价值梯度信号和可达奖励空间,从而得出RL在预训练轨迹上最有效的标准。

英文摘要

Reinforcement learning substantially improves pretrained language models, but it remains understudied why critic-free methods such as PPO and GRPO work as well as they do, and when they should provide the largest gains. We develop a value-gradient perspective of critic-free RL for LLM post-training. First, under a differentiable rollout and additive-noise parameterization, we show that the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient. Second, for discrete transformer policies, we show that autodifferentiation through attention produces empirical costates that approximate this value signal, with an error controlled by the sampling gap and policy entropy. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for when RL should be most effective along a pretraining trajectory.

2605.21653 2026-05-22 cs.LG cs.AI cs.CL

Amplifying, Not Learning: Fine-Tuned AI Text Detectors Amplify a Pretrained Direction

放大而非学习:微调的AI文本检测器放大了预训练的方向

Alexander Smirnov

AI总结 该研究探讨了通过微调AI文本检测器来放大预训练方向而非学习AI与人类边界的问题,发现微调在某些情况下会降低辨别能力,但在非母语写作中表现不同,并展示了闭合形式雅可比预测器在不同架构中的有效性。

详情
AI中文摘要

AI文本检测器放大了预训练的典型性轴;它们并不构建AI与人类的边界。在没有任何任务监督的原始编码器上,将投影到AI-中心(HC3)的中心可以实现NYT与HC3的AUROC分别为0.806/0.944/0.834,跨三种架构(86-106%的微调辨别上限:在RoBERTa-base上,原始投影超过微调);在RoBERTa-base上,完全微调在两种流畅正式人口测试中降低了辨别能力。相同的轴在非母语ESL写作中反转(AUROC 0.06-0.20)--这是典型性阅读独有的可验证预测。一个24例冻结探测器与完全微调(0.900 vs 0.895)一致。一个闭合形式雅可比预测器参数化轴操纵干预,R²=1.000通用,提升了ELECTRA-CE部署的TPR从0.000到0.904(FPR=1%),并在三个独立训练的第三方RoBERTa检测器上转移,达到16/16 oracle等价(在OpenAI检测器上57%的NYT-FPR减少)。范围:编码器家族;机制幅度HC3锚定;人口层面共享轴,不同架构中每文本机制有所变化。三种操作上不同的探测器--文本表面caps_rate残差化、几何符号epsilon消融、闭合形式文本对预测器--在三种架构中一致,cos 0.74/0.81/1.00,确认了观察者不变性。在匹配TPR-0.90评估下,已发表的干预动物园(CC、dealign-f2c)在27个单元格中校准等价(|Delta AUROC| <= 0.0081),并且ELECTRA上的LoRA->full-FT偏移差距的97%是校准偏移而非学习表示--这是核心主张的预测确认。

英文摘要

AI text detectors amplify a pretrained typicality axis; they do not construct an AI-vs-human boundary. On raw encoders before any task supervision, projecting onto centroid(AI)-centroid(HC3) achieves NYT-vs-HC3 AUROC 0.806/0.944/0.834 across three architectures (86-106% of the fine-tuned discrimination ceiling: on RoBERTa-base, raw projection exceeds fine-tuning); on RoBERTa-base, full fine-tuning reduces discrimination below raw on both fluent-formal populations tested. The same axis inverts on non-native ESL writing (AUROC 0.06-0.20) -- a falsifiable prediction unique to the typicality reading. A 24-example frozen probe matches full fine-tuning (0.900 vs 0.895). A closed-form Jacobian predictor parameterises axis-manipulating interventions with R^2 = 1.000 universal, lifts ELECTRA-CE deployment TPR from 0.000 to 0.904 at FPR = 1%, and transfers to three independently-trained third-party RoBERTa detectors at 16/16 oracle-equivalence (57% NYT-FPR reduction on the OpenAI detector). Scope: encoder family; mechanism magnitude HC3-anchored; population-level shared axis with per-text mechanisms varying across architectures. Three operationally distinct probes -- text-surface caps_rate residualisation, geometric signed-epsilon ablation, closed-form text-pair predictor -- agree at cos 0.74/0.81/1.00 across three architectures, confirming observer-invariance. Under matched-TPR-0.90 evaluation, the published intervention zoo (CC, dealign-f2c) is calibration-equivalent across 27 cells (|Delta AUROC| <= 0.0081), and >= 97% of the LoRA->full-FT bias gap on ELECTRA is calibration shift, not learned representation -- the central claim's prediction confirmed.

2605.21649 2026-05-22 cs.LG cs.CL

EntmaxKV: Support-Aware Decoding for Entmax Attention

EntmaxKV: 基于支持的解码方法用于Entmax注意力

Gonçalo Duarte, Miguel Couceiro, Marcos V. Treviso

AI总结 本文提出EntmaxKV,一种基于支持的解码框架,利用熵最大注意力的稀疏性在KV页面加载前进行稀疏解码,通过查询感知的页面评分、支持感知的候选选择和稀疏熵最大注意力,减少概率质量丢失,提高长上下文语言模型的效率。

详情
AI中文摘要

长上下文解码越来越受到KV缓存内存流量的限制,因为每个生成的标记都需在缓存上进行注意力运算,而缓存大小与上下文长度成线性增长。现有稀疏解码方法通过选择部分标记或页面来减少成本,但这些方法是为softmax注意力设计的,其密集尾部使得任何截断都会丢弃非零的概率质量。相比之下,α-entmax产生精确的零,将稀疏解码从密集尾部近似转变为支持恢复:如果所选候选包含entmax支持,稀疏解码仍保持精确。虽然最近的entmax内核实现了高效的训练,但它们并未解决自回归解码瓶颈,即密集推理仍需在稀疏性确定之前流式传输完整的KV缓存。在本文中,我们引入了EntmaxKV,一种基于entmax的稀疏解码框架,它在KV页面加载前利用稀疏性。EntmaxKV结合了查询感知的页面评分、支持感知的候选选择和稀疏entmax注意力。我们通过分析截断误差中的丢弃概率质量δ,证明输出误差由δ控制,并在恢复entmax支持时消失。我们进一步引入了一种高斯感知的entmax选择器,从轻量级页面统计中估计entmax阈值,使所选预算适应于分数分布。实验证明,EntmaxKV比基于softmax的稀疏解码在相同KV预算下丢弃更少的概率质量,保留更多支持标记,并实现更低的输出误差。在长上下文和语言建模基准上,它接近完整的缓存entmax,但使用KV缓存的少量比例,达到100万上下文长度时,比完整的注意力基线快3.36倍(softmax)和5.43倍(entmax)。代码可在:https://github.com/deep-spin/entmaxkv获取。

英文摘要

Long-context decoding is increasingly limited by KV-cache memory traffic since each generated token attends over a cache whose size grows linearly with context length. Existing sparse decoding methods reduce this cost by selecting subsets of tokens or pages, but are designed for softmax attention, whose dense tails make any truncation discard nonzero probability mass. In contrast, $α$-entmax produces exact zeros, turning sparse decoding from dense-tail approximation into support recovery: if the selected candidates contain the entmax support, sparse decoding remains exact. While recent entmax kernels enable efficient training, they do not address the autoregressive decoding bottleneck, where dense inference still streams the full KV cache before sparsity is known. In this work, we introduce EntmaxKV, an entmax-native sparse decoding framework that exploits sparsity before KV pages are loaded. EntmaxKV combines query-aware page scoring, support-aware candidate selection, and sparse entmax attention. We analyze truncation error through the dropped probability mass $δ$, showing that output error is controlled by $δ$ and vanishes when the entmax support is recovered. We further introduce a Gaussian-aware entmax selector that estimates the entmax threshold from lightweight page statistics, adapting the selected budget to the score distribution. Empirically, EntmaxKV drops less probability mass, retains more support tokens, and achieves lower output error than softmax-based sparse decoding at matched KV budgets. On long-context and language modeling benchmarks, it closely matches full-cache entmax while using a small fraction of the KV cache, achieving up to $3.36\times$ (softmax) and $5.43\times$ (entmax) speedup over full attention baselines at 1M context length. Code available at: https://github.com/deep-spin/entmaxkv.

2605.21646 2026-05-22 cs.LG

Alike Parts: A Feature-Informed Approach to Local and Global Prototype Explanations

相似部分:一种基于特征的局部和全局原型解释方法

Jacek Karolczak, Jerzy Stefanowski

AI总结 本文提出了一种基于特征的局部和全局原型解释方法,通过整合特征重要性来提高解释的粒度,实验表明该方法在保持模型预测精度的同时增强了特征多样性。

Comments Accepted for publication in International Journal of Applied Mathematics and Computer Science (IJAMCS)

详情
AI中文摘要

基于原型的解释提供了一种直观的、基于实例的方法来支持机器学习黑箱分类器的可解释性,但通常缺乏特征层面的细粒度。我们介绍了一个框架,该框架在两个层次上整合特征重要性以解决这一差距。首先,对于局部解释,我们提出"相似部分":一种利用特征重要性评分来突出分类实例与其最近原型之间最相关、共享的特征子集的方法,以引导用户关注。其次,我们通过在全局原型选择目标函数中加入特征重要性项,积极促进所选原型的特征属性的多样性。在六个基准数据集上的实验表明,这种增强的选取过程保持或在某些情况下提高了替代模型的预测保真度,表明特征多样性并不影响模型保真度。

英文摘要

Prototype-based explanations offer an intuitive, example-based approach to support the interpretability of machine learning black box classifiers but often lack feature-level granularity. We introduce a framework that integrates feature importance at two levels to address this gap. First, for local explanations, we propose \textit{alike parts}: a method that uses feature importance scores to highlight the most relevant, shared feature subsets between a classified instance and its nearest prototype, guiding user attention. Second, we augment the global prototype selection objective function with a feature importance term to actively promote diversity in the feature attributions of the selected prototypes. Experiments on six benchmark datasets show that this augmented selection process maintains or, in some cases, increases the prediction fidelity of the surrogate model, suggesting that feature diversity does not compromise model fidelity.

2605.21645 2026-05-22 cs.AI cs.DB

AOP-Wiki EMOD 3.0: Data Model Expansions and Content Evaluation Framework for Using Agentic AI to Improve Integration between AOPs and New Approach Methodologies (NAMs)

AOP-Wiki EMOD 3.0: 数据模型扩展和内容评估框架用于利用代理AI改进AOP与新方法论(NAMs)之间的整合

Virginia K. Hench, J. Harry Caufield, Sierra A. T. Moxon, Jason M. O'Brien, Stephen W. Edwards

AI总结 本文提出AOP-Wiki EMOD 3.0,通过数据模型扩展和内容评估框架,利用代理AI改进AOP与新方法论之间的整合,为监管科学和生物医学领域提供支持。

Comments 7 Figures and 3 Supplemental Figures

详情
AI中文摘要

不良后果路径(AOP)是将可在实验室中测量的生物机制因果联系到不良后果的逻辑模型,与化学监管终点相关。AOPs 为新方法论(NAMs)提供上下文,包括体外和体外方法,这些方法作为替代动物测试的替代方案,AOP中的连续事件作为多尺度模型跨越生物尺度。AOP-Wiki 作为全球AOP存储库。尽管AOP-Wiki在过去十年中在AOP扩展中发挥了核心作用,但当前的数据模型和应用基础设施的限制限制了AOP-Wiki支持持续AOP增长和演变的能力。然而,代理AI的变革力量重新激发了AOP-Wiki数据现代化的努力,尤其是在核心AOP原则可以用于指导AI用于汇总和结构化AOP相关信息的时候。抓住这一势头,我们提出了AOP-Wiki EMOD 3.0,即一系列证据模型原型中的第三款,具体展示了数据模型扩展和我们对AOP-Wiki如何被转变以更好地服务于监管科学和新兴AOP在生物医学和One Health领域中的使用。我们旨在为计算生成的AOP和定量AOP(qAOPs)奠定基础,通过聚焦于AOP-Wiki内部质量改进、证据结构以提高AOP FAIRness和AI准备性,以及改进AOP框架与NAMs之间的整合,以更好地服务于下一代风险评估。

英文摘要

Adverse Outcome Pathways (AOP) are logic models that causally link biological mechanisms that can be measured in a lab to adverse outcomes, relevant to chemical regulatory endpoints. AOPs contextualize new approach methodologies (NAMs), in vitro and in silico methods used as alternatives to animal testing and the sequential events in an AOP serve as multi-scale models spanning biological scales. The AOP-Wiki serves as the global repository for AOPs. While the AOP-Wiki has played a central role in AOP expansion over the past decade, constraints within the current data model and application infrastructure limit the AOP-Wiki from supporting continued AOP growth and evolution. Yet, the transformative power of agentic AI has re-invigorated AOP-Wiki data modernization efforts at a time when core AOP principles can be harnessed to inform use of AI for aggregating and structuring AOP-relevant information. Seizing upon this momentum, we present AOP-Wiki EMOD 3.0, the third in a series of evidence model prototypes, which concretely demonstrates data model expansions and our vision for how the AOP-Wiki might be transformed to better serve regulatory science and emergent use of AOPs in biomedical and One Health contexts. We aim to lay a foundation to support computationally-generated AOPs and quantitative AOPs (qAOPs) by focussing on solutions for AOP-Wiki internal quality improvement, evidence structuring to enhance AOP FAIRness and AI-readiness, and improved integration between the AOP framework and NAMs to better serve next generation risk assessment.

2605.21642 2026-05-22 cs.CV

Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?

Ablate-to-Validate: 视觉语言模型真的在使用连续思维令牌吗?

Tianyi Zhang, Mahtab Bigverdi, Ranjay Krishna

AI总结 本文提出了一种诊断原则Ablate-to-Validate,通过Token Replacement Test(TRT)测试视觉语言模型是否真正利用了连续令牌内容,发现模型性能提升可能并非源于令牌内容,而是令牌存在本身。

详情
AI中文摘要

视觉语言模型(VLMs)越来越多地引入连续或潜在的非文本令牌以支持'视觉思维'。尽管任务准确性有所提高,但这并不能证明模型确实使用这些令牌进行推理——收益可能来自于诸如增加的上下文长度、特殊令牌锚定或训练时的正则化等混淆因素。我们正式提出了一种诊断原则,Ablate-to-Validate,用于测试潜在令牌内容是否被真正利用,并将其实例化为Token Replacement Test(TRT),一个标准化的内容替换消融套件。TRT固定提示、图像、令牌预算和解码,同时用零、随机、首次重复或Oracle替代中间令牌,以确定性能是否依赖于令牌内容或仅仅是令牌存在。作为受控测试平台,我们研究了LLaVA-13B和Qwen2.5-VL-3B在相对深度推理中的表现,训练模型在多个冻结编码器(SigLIP2,CLIP,DINOv2)和令牌预算下预测和消耗连续或离散深度跨度。此外,我们还将TRT应用于三个现成的视觉思维系统(Mirage,Mull-Tokens,CoVT)在BLINK,VSP和CV-Bench上。在所有设置中,准确性提升都是潜在令牌推理的误导性代理:VLMs在令牌内容被破坏或替换时仍能保持大部分改进,揭示了拥有潜在通道与将其用作信息瓶颈之间的持续差距。我们推荐TRT作为任何引入连续思维令牌的方法的标准诊断工具,与准确性并行使用。

英文摘要

Vision-language models (VLMs) are increasingly augmented with continuous or latent non-textual tokens intended to support "visual thinking." Despite improved task accuracy, this alone does not show that models actually use these tokens for reasoning -- gains may arise from confounds such as added context length, special-token anchoring, or training-time regularization. We formalize a diagnostic principle, Ablate-to-Validate, for testing whether latent-token content is genuinely utilized, and instantiate it as the Token Replacement Test (TRT), a standardized suite of content-replacement ablations. TRT holds the prompt, image, token budget, and decoding fixed while replacing intermediate tokens with zero, random, first-repeat, or oracle alternatives, isolating whether performance depends on token content or merely on token presence. As a controlled testbed, we study relative depth reasoning with LLaVA-13B and Qwen2.5-VL-3B, training models to predict and consume continuous or discrete depth spans across multiple frozen encoders (SigLIP2, CLIP, DINOv2) and token budgets. We additionally apply TRT to three off-the-shelf visual-thinking systems (Mirage, Mull-Tokens, CoVT) on BLINK, VSP, and CV-Bench. Across all settings, accuracy gains are a misleading proxy for latent-token reasoning: VLMs retain most improvement even when token content is corrupted or replaced, revealing a persistent gap between having a latent channel and using it as an information bottleneck. We recommend TRT as a standard diagnostic alongside accuracy for any method introducing continuous thought tokens.

2605.21630 2026-05-22 cs.AI

MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis

MindLoom: 通过组合思维模式进行前沿级推理数据合成

Haiyang Shen, Taian Guo, Xuanzhong Chen, Mugeng Liu, Weichen Bi, Wenchun Jing, Sixiong Xie, Zhuofan Shi, Yudong Han, Chongyang Pan, Siqi Zhong, Jinsheng Huang, Ming Zhang, Yun Ma

AI总结 本文提出MindLoom框架,通过组合思维模式工程合成前沿级推理数据,解决了现有方法在问题难度控制和多样性方面的不足,实验表明其在多个基准测试中表现优异。

Comments Work in Progress. Comments: 27 pages, 4 figures, preprint

详情
AI中文摘要

尽管LLMs在推理方面取得了显著进展,系统性地生成前沿级推理数据仍然具有挑战性。现有合成方法往往缺乏对问题难度结构性因素的理解,导致多样性有限和难度控制不稳定。本文将推理问题的难度视为原子知识推理转换的累积,提出MindLoom框架,通过组合思维模式工程合成前沿级推理数据。给定一组具有验证解的难题,MindLoom首先将这些解分解为思维模式链,揭示每个问题的构建逻辑。然后训练一个检索模型,将问题状态匹配到兼容的思维模式,提供合成过程中引入哪些推理挑战的指导。新问题通过迭代应用检索到的思维模式到种子问题,并通过分布对齐采样来鼓励多样化的推理覆盖。最后,基于回放的判断阶段通过难度对生成的问题进行标记,并提供已判断正确的响应用于监督微调。我们在九个基准测试上评估了MindLoom,涵盖五个STEM学科和四个数学推理任务,多个模型家族和大小的模型在微调后均在报告的基准测试中表现出色。消融研究表明了每个组件的贡献,进一步分析表明MindLoom覆盖了广泛的推理模式,同时保持了有用的难度控制。我们已开源实现:https://github.com/EachSheep/MindLoom。

英文摘要

Although LLMs have made substantial progress in reasoning, systematically producing frontier-level reasoning data remains difficult. Existing synthesis methods often have limited visibility into the structural factors that govern problem difficulty, which can result in narrow diversity and unstable difficulty control. In this work, we view the difficulty of a reasoning problem as arising from the accumulation of atomic knowledge-reasoning transformations, which we term thought modes. Building on this perspective, we propose MindLoom, a framework for synthesizing frontier-level reasoning data through compositional thought mode engineering. Given a collection of hard problems with verified solutions, MindLoom first decomposes those solutions into thought mode chains that reveal each problem's construction logic. It then trains a retrieval model that matches problem states to compatible thought modes, providing guidance on which reasoning challenges to introduce during synthesis. New problems are composed by iteratively applying retrieved thought modes to seed questions, with distribution-aligned sampling to encourage diverse reasoning coverage. Finally, a rollout-based judging stage labels generated questions by difficulty and supplies judged-correct responses for supervised fine-tuning. We evaluate MindLoom on nine benchmarks covering five STEM disciplines and four mathematical reasoning tasks across multiple model families and sizes. Models fine-tuned on MindLoom-generated data achieves favorable performances over base models, distillation, and external-data baselines across the reported benchmarks. Ablation studies indicate the contribution of each component, and further analysis suggests that MindLoom covers a broad range of reasoning patterns while maintaining useful difficulty control. We have open-sourced our implementation at https://github.com/EachSheep/MindLoom.

2605.21625 2026-05-22 cs.CV cs.AI cs.CL

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

Flat-Pack Bench: 通过家具组装评估大视觉-语言模型的时空理解

Aditya Chetan, Eric Cai, Peeyush Kushwaha, Bharath Raj Nagoor Kani, Utkarsh Mall, Qianqian Wang, Noah Snavely, Bharath Hariharan

AI总结 本文提出Flat-Pack Bench基准,用于评估大视觉-语言模型在复杂视频场景中的时空理解能力,发现当前模型在细粒度时空推理上存在显著不足。

Comments CVPR 2026

详情
AI中文摘要

大视觉-语言模型(LVLMs)的出现显著提升了视频理解能力。然而,现有基准主要集中在粗粒度任务,如动作分割、分类、描述和检索,且这些基准通常依赖于易于口头识别的实体,如家庭物品、动物、人类主体等,限制了其在复杂真实视频场景中的适用性。但许多应用,如家具组装、烹饪等,需要对视频进行逐步细粒度的时空理解,而当前基准并未充分评估。为解决这一差距,我们引入了Flat-Pack Bench,一个专注于家具组装任务的新基准。我们的基准评估LVLMs在细微任务上的表现,包括组装动作的时间顺序、组装状态的时间定位、理解部件配合和追踪,使用多选问题配以视觉提示突出相关部分作为参考,以回答细粒度问题。我们的实验表明,最先进的LVLMs在细粒度时空推理上表现显著不足,凸显了其在有效利用视频时间信息、跟踪能力和理解空间交互(如物理接触)方面的局限性。

英文摘要

The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks. To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.

2605.21623 2026-05-22 cs.AI

The Shape of Testimony: A Scalable Framework for Oral History Archive Comparison

证词的形态:一种可扩展的口述史档案比较框架

Itamar Trainin, Renana Keydar, Amit Pinchevski

AI总结 本文通过大规模计算分析超过1600个口述史档案,探讨了犹太人大屠杀研究中两种口述证词风格的区别,并提出一种可扩展的比较语料库分析框架。

详情
AI中文摘要

研究者在大屠杀研究中常常将口述幸存者证词分为两种风格:美国犹太人研究肖尔基金会的访谈通常遵循结构化的、由访谈者引导的格式,而耶鲁福图诺夫视频档案则更倾向于自由形式、开放式风格。本研究通过分析两个档案中超过1600个证词,利用话语分割、主题建模和大型语言模型(LLM)分析,量化证词的“结构化”程度,包括主题连贯性、访谈者-幸存者动态和问题类型的分布。研究结果在总体上支持早期研究中发现的结构性差异,同时揭示了两个档案之间的显著重叠,不仅在个别访谈内,而且在共同的叙述模式中。这使得简单的“结构化vs.自由形式”二元对立在这些口述史中变得更加复杂。除了重新审视大屠杀研究中的一个基础性主张外,本工作还提供了一种可扩展、可重复的比较语料库分析框架。作为概念验证,它还为数字口述史、叙述分析以及公民科学注释平台的设计提出了更广泛的应用。

英文摘要

Researchers in Holocaust studies have often distinguished between two styles of oral survivor testimony: the USC Shoah Foundation's interviews tend to follow a structured, interviewer-guided format, whereas the Yale Fortunoff Video Archive generally favors a more free-form, open-ended style. This distinction has influenced both scholarly research and the development of later archives. In this study, we critically examine that claim by conducting a large-scale computational analysis of more than 1,600 testimonies from both collections. Leveraging discourse segmentation, topic modeling, and large language model (LLM) based analysis, we quantify the "structuredness" level of testimonies through topic coherence, interviewer-survivor dynamics, and the distribution of question types. Our results generally corroborate the structural differences identified in earlier research, while also revealing significant overlaps between the collections, both within individual interviews and across common narrative patterns. This complicates the simple "structured vs. free-form" dichotomy often applied to these oral histories. Beyond revisiting a foundational claim in Holocaust studies, our work provides a scalable, replicable framework for comparative corpus analysis. As a proof of concept, it suggests broader applications for digital oral history, narrative analysis, and the design of citizen-science annotation platforms.

2605.21622 2026-05-22 cs.AI

TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization

TO-Agents:一种用于基于偏好的拓扑优化的多智能体AI流水线

Isabella A. Stewart, Hongrui Chen, Faez Ahmed

AI总结 本文提出TO-Agents,一种多智能体AI框架,通过将自然语言设计意图与迭代拓扑优化相结合,解决设计者手动转换非直接关联的偏好到求解器设置的问题,并在两个长周期设计任务中验证了其有效性。

Comments Accepted for publication in the Proceedings of the ASME 2026 International Design Engineering Technical Conferences (IDETC2026)

详情
AI中文摘要

拓扑优化可以生成高效的结构,但设计者往往必须手动将定性意图,如期望的视觉风格、产品体验或可制造性转换为与这些偏好不直接相关的求解器设置。我们提出了TO-Agents,一种多智能体AI框架,将自然语言设计意图与迭代拓扑优化连接起来。该框架将人类提供的问题描述转换为经过验证的求解器输入,运行拓扑优化求解器,渲染结果的3D拓扑,并使用多视角视觉-语言推理与独立的评判智能体来批评每个结果并修改求解器参数。我们在两个长周期设计任务上评估了该框架:悬臂梁基准测试和手机支架产品设计。在两个任务中,设计者指定了受自然树形态启发的分层分支结构的美学偏好,系统在十个独立重复中进行了四次修订循环。TO-Agents在每个案例研究中至少在60%的试验中生成了符合偏好的设计,对应于没有视觉或历史反馈的简化流水线的6倍以上的成功试验。评判评分和人类评估显示,该流水线能够识别有效的参数杠杆,从差的修订中恢复,并扩展设计探索。一个制造智能体进一步对排名最高的设计进行后处理,以实现增材制造,使设计能够从意图到原型。我们还识别了失败模式,包括过度优化、选择性记忆、工具位置错误和参数推理错误。这些结果表明,智能体拓扑优化可以将设计者从低层次参数调整转向高层次的形式和功能指定,同时强调了可靠自主工程设计所需的保障措施。

英文摘要

Topology optimization can generate efficient structures, but designers often must manually translate qualitative intent, such as desired visual style, product experience, or manufacturability into solver settings that are not directly tied to those preferences. We present TO-Agents, a multi-agent AI framework that connects natural-language design intent with iterative topology optimization. The framework converts a human-provided problem description into validated solver inputs, runs a topology optimization solver, renders the resulting 3D topology, and uses multi-view vision-language reasoning with an independent judge agent to critique each result and revise solver parameters. We evaluate the framework on two long-horizon design tasks: a cantilever beam benchmark and a phone-stand product design. In both tasks, the designer specifies an aesthetic preference for hierarchically branched structures inspired by natural tree morphologies, and the system performs four revision cycles across ten independent replicates. TO-Agents produces at least one preference-aligned design in 60% of trials for each case study, corresponding to up to 6x more successful trials than an ablated pipeline without visual or historical feedback. Judge scores and human evaluations show that the pipeline can identify effective parameter levers, recover from poor revisions, and expand design exploration. A manufacturing agent further post-processes top-ranked designs for additive manufacturing, enabling end-to-end intent-to-prototype design. We also identify failure modes, including overshooting, selective memory, misplaced tools, and incorrect parameter reasoning. These results suggest that agentic topology optimization can shift designers from low-level parameter tuning toward higher-level specification of form and function, while highlighting safeguards needed for reliable autonomous engineering design.

2605.21611 2026-05-22 cs.CV cs.LG

UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation

UniVL:统一的视觉-语言嵌入用于空间接地的上下文图像生成

Jiayun Wang, Yu Wang, Weijie Gan, Zhenting Wang, Wei Wei

AI总结 本文提出了一种统一的视觉-语言嵌入方法,通过单一的视觉输入直接将语义绑定到空间位置,从而减少计算并提高图像生成质量。

详情
AI中文摘要

我们引入了空间接地的上下文图像生成任务,这是一种可控的图像生成任务,重新定义了条件生成范式。与通过两个独立编码器分别提供参考图像和全局文本提示不同,UniVL被训练以从单一统一的视觉输入中直接绑定语义到空间位置,其中文本指令被渲染到空间掩码上。这消除了推理过程中对独立文本编码器的需求。所得到的模型通过遵循用户指定的指令来支持上下文图像生成,即在指定位置生成什么内容,同时显著减少了计算量。为了解决这一任务,我们提出了一种框架,其中从光学字符识别预训练的backbone中适应的UniVL编码器读取统一的条件,并生成一个融合视觉和语义意图以及空间位置的UniVL嵌入fVIL。一个两阶段流程首先对齐UniVL与VAE嵌入空间,然后将预训练的扩散backbone完全基于UniVL嵌入进行条件生成,消除了如T5等独立文本编码器。尽管这种重新定义使用了刻意最小化的文本接口,但仍然取得了显著的实证收益。在UniVL-ImgGen上,一个包含477,000个掩码标注图像的基准数据集上,UniVL在文本提示基线之上提高了图像质量,将FID从14降低到11,并将PSNR从16提高到20。它还完全消除了文本编码器,将推理TFLOPs减少高达52%,将运行时间减少高达44%。此外的消融研究验证了所提出组件的贡献,为具有统一条件范式的高效、空间接地图像生成铺平了道路。

英文摘要

We introduce spatially grounded contextual image generation, a controllable image generation task that reframes the conditioning paradigm. Instead of supplying a reference image and a global text prompt through two separate encoders, one for vision and one for language, UniVL is trained to bind semantics to spatial locations directly from a single unified visual input, where the textual instruction is rendered onto the spatial mask. This removes the need for a standalone text encoder at inference time. The resulting model supports contextual image generation by following user-specified instructions about what should appear where, while substantially reducing computation. To address this task, we propose a framework in which the UniVL encoder, adapted from an optical-character-recognition-pretrained backbone, reads the unified condition optically and produces a UniVL embedding, fVIL, that fuses visual and semantic intent with spatial locations in a single token sequence. A two-stage pipeline first aligns UniVL with the VAE embedding space and then conditions a pretrained diffusion backbone entirely on UniVL embeddings, eliminating the standalone text encoder, such as T5. Although this reframing uses a deliberately minimal text interface, it yields strong empirical gains. On UniVL-ImgGen, a benchmark of 477K mask-annotated images that we construct for training and evaluation, UniVL improves image quality over text-prompted baselines, reducing FID from 14 to 11 and increasing PSNR from 16 to 20. It also eliminates the text encoder entirely, reducing inference TFLOPs by up to 52% and runtime by up to 44%. Additional ablation studies validate the contributions of the proposed components, paving the way for efficient, spatially grounded image generation with a unified conditioning paradigm.

2605.21610 2026-05-22 cs.LG

AgForce Enables Antigen-conditioned Generative Antibody Design

AgForce 使生成抗体设计具备抗原条件

Mansoor Ahmed, Murray Patterson

AI总结 本文提出AgForce方法,通过图神经网络和改进的解码器设计,解决传统抗体设计方法中对抗原输入忽略的问题,提升了抗体序列生成的质量和恢复能力。

详情
AI中文摘要

抗体设计方法通常基于抗原结构生成互补决定区(CDR),但基线方法的系统评估表明,它们大多忽略了抗原输入。我们识别出三种导致这种行为的失败模式。抗原盲性是因为模型从抗体框架上下文推断预测,而非抗原信息,从而产生几乎相同的CDR,无论目标如何。词汇坍塌将预测的氨基酸减少到每个位置3到5种,远低于天然序列的真实分布。此外,任何使用标准位置交叉熵训练的模型都会收敛到位置边际分布,这使得它无法产生抗原特异性序列预测。我们提出了一种名为AgForce的新型编码器-解码器架构,它使用图神经网络(GNN)作为编码器,并针对序列-结构协同设计设计了专用解码器。具体而言,我们应用了框架dropout、门控瓶颈和双曲交叉注意力,以防止抗体的捷径路径。在解码器中,一个具有Potts-like成对耦合和退火的多选学习(aMCL)的混合密度网络(MDN)序列头取代了交叉熵目标,用一个多组件分布替代了位置边际分布的最优解。一个抗原循环一致性头将梯度路由通过序列解码器,迫使预测分布编码抗原身份。AgForce在CHIMERA-Bench数据集上同时实现了最佳的结合质量和序列恢复能力,比最强的序列基线提高了8%的氨基酸恢复率,且在所有界面指标上均优于基线,几乎将GNN方法的有效词汇量翻倍。源代码可在:https://github.com/mansoor181/ag-force.git

英文摘要

Antibody design methods condition on antigen structure to generate complementarity-determining regions (CDR), yet a systematic evaluation of baseline methods reveals that they largely ignore the antigen input. We identify three failure modes that explain this behavior. Antigen blindness arises because models derive predictions from antibody framework context rather than antigen information, producing nearly identical CDRs regardless of the target. Vocabulary collapse reduces predicted amino acids to three to five per position, far below the ground truth distribution in native sequences. Moreover, any model trained with standard per-position cross-entropy converges to the positional marginal distribution, making it provably unable to produce antigen-specific sequence predictions. We propose a novel encoder-decoder architecture called AgForce, that uses a graph neural network (GNN) as the encoder and specialized decoders for sequence-structure co-design. Specifically, we apply framework dropout, gated bottlenecks, and hyperbolic cross attention that prevent the antibody shortcut path. In the decoder, a Mixture Density Network (MDN) sequence head with Potts-like pairwise coupling and annealed Multiple Choice Learning (aMCL) replaces the cross-entropy objective with a multi-component distribution whose optimal solution differs from the positional marginal. An antigen cycle consistency head routes gradients through the sequence decoder, forcing predicted distributions to encode antigen identity. AgForce achieves the best binding quality and sequence recovery simultaneously on the CHIMERA-Bench dataset, improving amino acid recovery by 8% over the strongest sequence baseline while surpassing the baselines across all interface metrics, and nearly doubling the effective vocabulary of GNN methods. The source code is available at: https://github.com/mansoor181/ag-force.git

2605.21609 2026-05-22 cs.CL cs.AI cs.CY

CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety

CR4T:基于重写的青少年LLM安全机制

Heajun An, Qi Zhang, Vedanth Achanta, Jin-Hee Cho

AI总结 本文提出CR4T框架,通过选择性响应重构替代拒绝导向的安全机制,以更符合青少年发展需求的方式提升LLM的安全性。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地嵌入青少年的数字环境,介导信息搜索、建议和情感敏感的互动。然而,现有安全机制仍主要基于成人中心的规范,并通过拒绝导向的压制来实现安全。尽管这些方法可能减少即时的政策违规,但它们也可能导致对话死胡同、限制建设性指导,并未能解决青少年与AI互动中固有的发展脆弱性。我们主张,青少年LLM安全不应仅被视为过滤问题,而应被视为一种社会技术、发展一致的转变问题。为实现这一视角,我们提出了Critique-and-Revise-for-Teenagers(CR4T),一种模型无关的安全保障框架,该框架可选择性地将不安全或拒绝式输出重构为适合年龄的指导性响应,同时保持善意意图。CR4T结合轻量级风险检测与领域条件重写,以去除风险放大内容,减少不必要的对话关闭,并引入适合发展的指导。实验结果表明,针对重写显著减少了不安全和拒绝导向的结果,同时避免了对可接受互动的不必要的干预。这些发现表明,选择性响应重构为青少年面向的LLM系统提供了一种更以人为本的替代方案,以替代以拒绝为中心的安全机制。

英文摘要

Large language models (LLMs) are increasingly embedded in adolescent digital environments, mediating information seeking, advice, and emotionally sensitive interactions. Yet existing safety mechanisms remain largely grounded in adult-centric norms and operationalize safety through refusal-oriented suppression. While such approaches may reduce immediate policy violations, they can also create conversational dead-ends, limit constructive guidance, and fail to address the developmental vulnerabilities inherent in adolescent-AI interactions. We argue that adolescent LLM safety should be framed not solely as a filtering problem, but as a socio-technical, developmentally aligned transformation problem. To operationalize this perspective, we propose Critique-and-Revise-for-Teenagers (CR4T), a model-agnostic safeguarding framework that selectively reconstructs unsafe or refusal-style outputs into ageappropriate, guidance-oriented responses while preserving benign intent. CR4T combines lightweight risk detection with domain-conditioned rewriting to remove risk-amplifying content, reduce unnecessary conversational shutdown, and introduce developmentally appropriate guidance. Experimental results show that targeted rewriting substantially reduces unsafe and refusal-oriented outcomes while avoiding unnecessary intervention on acceptable interactions. These findings suggest that selective response reconstruction offers a more human-centered alternative to refusal-centric guardrails for adolescent-facing LLM systems.

2605.21606 2026-05-22 cs.LG cs.AI

When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

何时教师标记可靠?用于推理的基于位置加权的在线自我蒸馏

Xiaogeng Liu, Xinyan Wang, Yingzi Ma, Yechao Zhang, Chaowei Xiao

AI总结 本文提出了一种基于位置加权的在线自我蒸馏方法,用于改进推理任务中教师标记的可靠性,通过引入分支可行性诊断来识别教师标记的可靠性,并在不同模型上验证了其有效性。

Comments Pre-print. Code is available at https://github.com/SaFo-Lab/PW-OPSD

详情
AI中文摘要

在线自我蒸馏(OPSD)通过一个特权教师训练学生,但其标准目标对所有生成的标记同等重视,隐含地将特权教师目标视为在每个学生访问的前缀中同样可靠。现有的基于熵的OPD方法通过调节令牌级监督来放松这种均匀性,但推理中高教师熵的可靠性含义具有歧义:它可以反映非可行的不确定性或良性的解决方案多样性。为识别这一现象,我们引入了分支可行性诊断。具体来说,我们记录特权答案教师提示中的下一个标记替代方案,强制每个替代方案在学生提示及其在线脊柱前缀之后,并测试由此产生的学生模板延续是否能恢复正确答案。在Qwen3-4B上,我们发现一个导向的序列内位置分数是测试中最强的教师标记可靠性预测因子,达到曲线下面积(AUROC)为0.83;局部不确定性分数最多为0.57。受此轨迹结构的启发,我们提出了基于位置加权的在线自我蒸馏(PW-OPSD),其在保持相同的学生滚动生成、特权教师传递和截断的前向KL目标的同时,应用递增的位置权重。在不同随机种子的全面评估中,诊断衍生的PW-OPSD在AIME 2024和AIME 2025 Avg@12上分别提高了+1.0和+1.1分,并在两个更大规模的模型上也展示了一致的Avg@12改进。这些结果表明,推理蒸馏中的教师标记可靠性具有轨迹结构,并且可以在不增加教师计算的情况下利用。

英文摘要

On-policy self-distillation (OPSD) trains a student on its own rollouts using a privileged teacher, but its standard objective weights all generated tokens equally, implicitly treating the privileged teacher target as equally reliable at every student-visited prefix. Existing entropy-based OPD methods relax this uniformity by modulating token-level supervision with teacher entropy, but high teacher entropy in reasoning has an ambiguous reliability meaning: it can reflect either non-viable uncertainty or benign solution diversity. To identify this phenomenon, we introduce a branch-viability diagnostic. Specifically, we record next-token alternatives from the privileged-answer teacher prompt, force each alternative after the student prompt plus its on-policy spine prefix, and test whether the resulting student-template continuation recovers the correct answer. On Qwen3-4B, we find that an oriented within-sequence position score is the strongest tested predictor of teacher-token reliability, reaching an area-under-ROC-curve (AUROC) of 0.83; local uncertainty scores are at most 0.57. Motivated by this trajectory-level structure, we propose Position-Weighted On-Policy Self-Distillation (PW-OPSD), which applies an increasing position weight while keeping the same student rollout, privileged teacher pass, and clipped forward-KL target as OPSD. In our comprehensive evaluations with different random seeds, the diagnostic-derived PW-OPSD improves AIME 2024 and AIME 2025 Avg@12 by +1.0 and +1.1 points, and a generalization evaluation on two larger-scale models from different families, DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think, also demonstrates consistent aggregate Avg@12 improvements. These results show that teacher-token reliability in reasoning distillation is trajectory-structured and can be utilized without additional teacher computation.

2605.21600 2026-05-22 cs.LG

ConTact: Contact-First Antibody CDR Design via Explicit Interface Reasoning

ConTact: 通过显式界面推理进行接触优先的抗体CDR设计

Mansoor Ahmed, Spencer VonBank, Nadeem Taj, Sujin Lee, Naila Jan, Murray Patterson

AI总结 本文提出ConTact,一种通过显式界面推理进行抗体CDR设计的方法,通过显式分解CDR设计为三个阶段:学习表面互补性指纹、预测CDR-抗原接触以及注入接触门控抗原特征,从而提高结构质量和表位意识。

详情
AI中文摘要

计算抗体CDR设计方法基于抗原结构生成结合环,但现有架构将两个根本不同的子问题混为一谈:确定哪些CDR位置会接触抗原,以及在这些位置选择氨基酸。这种混合同一迫使模型通过统一的消息传递隐式学习接触推理,稀释抗原信号在所有位置中均等。我们引入ConTact,一种接触然后作用的架构,将CDR设计显式分解为三个连续阶段:学习表面互补性指纹、预测CDR-抗原接触以及将接触门控抗原特征注入序列头。距离偏倚的交叉注意力模块编码几何先验,倾向于空间邻居,而接触加权的交叉熵损失将梯度信号集中于结合关键位置。在CHIMERA-Bench数据集上,ConTact在结构质量(比次优基线提高7% RMSD)、表位意识(比GNN基线提高10% F1分数)以及序列恢复(AAR 0.38)方面均表现最佳。

英文摘要

Computational antibody CDR design methods condition on antigen structure to generate binding loops, yet existing architectures conflate two fundamentally distinct sub-problems: identifying which CDR positions will contact the antigen, and selecting amino acids at those positions. This conflation forces models to learn contact reasoning implicitly through uniform message passing, diluting antigen signal across all positions equally. We introduce ConTact, a contact-then-act architecture that explicitly decomposes CDR design into three cascaded stages: learning surface complementarity fingerprints, predicting CDR-antigen contacts, and injecting contact-gated antigen features into the sequence head. A distance-biased cross-attention module encodes geometric priors favoring spatial neighbors, while a contact-weighted cross-entropy loss concentrates gradient signal on binding-critical positions. On CHIMERA-Bench dataset, ConTact achieves the best structural quality (7% RMSD improvement over the next-best baseline), best epitope awareness (10% F1 score over GNN baselines), and competitive sequence recovery (AAR 0.38) among several CDR-H3 design baselines.

2605.21573 2026-05-22 cs.CV

Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

Lens:重新思考基础文本到图像模型的训练效率

Dong Chen, Fangyun Wei, Ziyu Wan, Dongdong Chen, Jiawei Zhang, Jinjing Zhao, Sirui Zhang, Yang Yue, Zhiyang Liang, Baining Guo, Chong Luo, Jianmin Bao, Ji Li, Lei Shi, Qinhong Yang, Xiuyu Wu, Xuelu Feng, Yan Lu, Yanchen Dong, Yitong Wang, Yunuo Chen

AI总结 本文提出Lens,一个具有38亿参数的文本到图像模型,在多种基准测试中表现与超过60亿参数的最新模型相当甚至更优,同时训练计算需求显著降低。通过最大化训练批次的数据信息密度和改进收敛速度的架构选择,实现了高效的训练和优化。

Comments Project Page: https://github.com/microsoft/Lens

详情
AI中文摘要

我们介绍了Lens,一个具有38亿参数的文本到图像(T2I)模型,其在多种基准测试中表现与超过60亿参数的最新模型相当甚至更优,同时训练计算需求显著降低。例如,Lens仅需约Z-Image的19.3%的训练计算。Lens的训练效率源于两个关键策略,除了其紧凑的模型大小外。首先,我们通过(i)在Lens-800M数据集上训练,该数据集包含8亿个密集标注的图像-文本对,其标注由GPT-4.1生成,平均每个标注约109个词,提供比传统短标注更丰富的语义监督,以及(ii)从具有多种分辨率和多样长宽比的图像中构建每个批次,从而扩大每个优化步骤的有效视觉覆盖范围。其次,我们通过精心的架构选择提高了收敛速度,包括采用提供更好潜在表示的语义变分自编码器(VAE)以及采用加速优化并实现从英语训练数据中多语言泛化的强语言编码器。预训练后,我们应用基于分类学驱动提示的强化学习(Lens-RL-8K)和结构化奖励标准来抑制伪影并提高视觉质量,一个具有训练免费系统提示搜索的推理模块以更好地对齐用户请求与模型,以及基于知识蒸馏的加速4步推理。通过高效的训练和系统的优化,Lens能够泛化到任意的长宽比从1:2到2:1以及分辨率高达1440^2,并支持几种常用语言的提示。得益于其紧凑的尺寸,Lens在单个NVIDIA H100 GPU上可以在3.15秒内生成1024^2的图像,而其蒸馏后的turbo版本可以在0.84秒内完成4步生成。

英文摘要

We introduce Lens, a 3.8B-parameter T2I model that achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6B parameters across various benchmarks, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The training efficiency of Lens stems from two key strategies beyond its compact model size. First, we maximize data information density per training batch by (i) training on Lens-800M, a dataset of 800M densely captioned image-text pairs whose captions are generated by GPT-4.1 and contain approximately 109 words on average, providing richer semantic supervision than conventional short captions, and (ii) constructing each batch from images with multiple resolutions and diverse aspect ratios, thereby enlarging the effective visual coverage of each optimization step. Second, we improve convergence speed through careful architectural choices, including adopting a semantic VAE that provides better latent representations and employing a strong language encoder that accelerates optimization while enabling multilingual generalization from English-only training data. After pre-training, we apply RL with taxonomy-driven prompts (Lens-RL-8K) and structured reward rubrics to suppress artifacts and improve visual quality, a reasoner module with training-free system prompt search to better align user requests with the model, and distillation-based acceleration for 4-step inference. Through efficient training and systematic optimization, Lens generalizes to arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440^2, and supports prompts in several commonly used languages. Thanks to its compact size, Lens generates a 1024^2 image in 3.15 seconds on a single NVIDIA H100 GPU, while its distilled turbo version performs 4-step generation in 0.84 seconds.

2605.21572 2026-05-22 cs.CV cs.RO

PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects

PhysX-Omni: 为刚体、变形体和关节物体统一的模拟准备物理3D生成

Ziang Cao, Yinghao Liu, Haitian Li, Runmao Yao, Fangzhou Hong, Zhaoxi Chen, Liang Pan, Ziwei Liu

AI总结 本文提出PhysX-Omni,一种统一的模拟准备物理3D生成框架,通过开发针对视觉-语言模型的高效几何表示和首个通用模拟准备3D数据集PhysXVerse,以及评估生成和理解能力的PhysX-Bench,显著提升了生成和理解性能,推动下游应用如具身AI和物理模拟的发展。

Comments Project page: https://physx-omni.github.io/

详情
AI中文摘要

Simulation-ready physical 3D assets have emerged as a promising direction owing to their broad applicability in downstream tasks. However, most existing 3D generation methods either neglect physical properties or are limited to a single asset category, e.g., rigid, deformable, or articulated objects. To address these limitations, we introduce PhysX-Omni, a unified framework for simulation-ready physical 3D generation across diverse asset types. Specifically, we develop a novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression, significantly improving generation performance. In addition, we construct the first general simulation-ready 3D dataset, PhysXVerse, covering diverse indoor and outdoor categories. Furthermore, to comprehensively and flexibly evaluate both generative and understanding capabilities in the wild, we propose PhysX-Bench, which encompasses six key attributes: geometry, absolute scale, material, affordance, kinematics, and function description. Extensive experiments with conventional metrics and PhysX-Bench show that PhysX-Omni performs strongly in both generation and understanding. Moreover, additional studies further validate the potential of PhysX-Omni for applications in simulation-ready scene generation and robotic policy learning. We believe PhysX-Omni can significantly advance a wide range of downstream applications, particularly in embodied AI and physics-based simulation.

英文摘要

Simulation-ready physical 3D assets have emerged as a promising direction owing to their broad applicability in downstream tasks. However, most existing 3D generation methods either neglect physical properties or are limited to a single asset category, e.g., rigid, deformable, or articulated objects. To address these limitations, we introduce PhysX-Omni, a unified framework for simulation-ready physical 3D generation across diverse asset types. Specifically, we develop a novel and efficient geometry representation tailored for Vision-Language Models, which directly encodes high-resolution 3D structures without compression, significantly improving generation performance. In addition, we construct the first general simulation-ready 3D dataset, PhysXVerse, covering diverse indoor and outdoor categories. Furthermore, to comprehensively and flexibly evaluate both generative and understanding capabilities in the wild, we propose PhysX-Bench, which encompasses six key attributes: geometry, absolute scale, material, affordance, kinematics, and function description. Extensive experiments with conventional metrics and PhysX-Bench show that PhysX-Omni performs strongly in both generation and understanding. Moreover, additional studies further validate the potential of PhysX-Omni for applications in simulation-ready scene generation and robotic policy learning. We believe PhysX-Omni can significantly advance a wide range of downstream applications, particularly in embodied AI and physics-based simulation.

2605.21568 2026-05-22 cs.LG

Equilibrium Propagation and Hamiltonian Inference in the Diffusive Fitzhugh-Nagumo Model

扩散 Fitzhugh-Nagumo 模型中的平衡传播与哈密顿推断

Jack Kendall

AI总结 本文扩展了平衡传播框架以应用于偏斜梯度系统,并展示了深度能量模型与哈密顿神经网络之间的等价性。研究重点是扩散耦合的 Fitzhugh-Nagumo 神经网络作为典型示例,证明了由于 Fitzhugh-Nagumo 模型的稳态解由自共轭算子描述,因此可以应用平衡传播方法进行信用分配。此外,对于具有深度残差网络拓扑的 Fitzhugh-Nagumo 网络,稳态解具有(空间)哈密顿量,因此可以应用哈密顿回传方法。最后,推导出一个显式的层间哈密顿递推关系,用于指导深度 Fitzhugh-Nagumo 网络和深度能量模型的稳态解推断。

详情
AI中文摘要

在本文中,我们将平衡传播框架扩展到偏斜梯度系统,并展示了深度能量模型与哈密顿神经网络之间的等价性。我们重点研究了扩散耦合的 Fitzhugh-Nagumo 神经网络作为典型示例。我们证明了由于 Fitzhugh-Nagumo 模型的稳态解由自共轭算子描述,因此可以应用平衡传播方法进行信用分配。此外,对于具有深度残差网络拓扑的 Fitzhugh-Nagumo 网络,我们证明稳态解具有(空间)哈密顿量,因此可以应用哈密顿回传方法。最后,我们推导出一个显式的层间哈密顿递推关系,用于指导深度 Fitzhugh-Nagumo 网络和深度能量模型的稳态解推断。

英文摘要

In this work, we extend the Equilibrium Propagation framework to skew-gradient systems and show an equivalence between deep Energy-Based Models and Hamiltonian neural networks. We focus on networks of diffusively coupled Fitzhugh-Nagumo neurons as a prototypical example. We show that since stationary solutions of the Fitzhugh-Nagumo model are described by self-adjoint operators, the methods of equilibrium propagation for performing credit assignment can be applied. Furthermore, for Fitzhugh-Nagumo networks with the topology of a deep residual network, we show that the steady state solutions admit a (spatial) Hamiltonian, and thus the methods of Hamiltonian Echo Backpropagation can be applied. We end by deriving an explicit layer-wise Hamiltonian recurrence relation governing inference for stationary solutions of both deep Fitzhugh-Nagumo networks and deep Energy-Based Models.

2605.21566 2026-05-22 cs.LG

Calibration, Uncertainty Communication, and Deployment Readiness in CKD Risk Prediction: A Framework Evaluation Study

CKD风险预测中的校准、不确定性沟通与部署准备性:一个框架评估研究

Michael O. Eniolade

AI总结 本文评估了在慢性肾病风险预测中,校准、不确定性量化和部署准备性的重要性,通过五个分类器在UCI CKD数据集上的表现,发现内部性能优异但外部转移性差,强调了校准稳定性和外部数据验证的必要性。

Comments 27 pages, 6 figures, 4 tables. Supplementary materials (S1-S4) included as ancillary file

详情
AI中文摘要

用于慢性肾病(CKD)风险预测的机器学习模型在内部测试集上通常表现出很强的判别能力。然而,校准和不确定性量化往往受到忽视,导致临床医生无法获得关于概率输出是否准确的可靠信息。我们训练了五个分类器在UCI CKD数据集(400名患者,62.5%的CKD患病率)上:逻辑回归、随机森林、XGBoost、带有Platt缩放的SVM以及高斯朴素贝叶斯。我们评估了每个模型在校准质量、符合性预测覆盖率以及一个八项部署准备性框架上的表现。分布压力测试将每个模型的最佳校准变体应用于公开的MIMIC-IV演示队列(97名患者,23.7%的CKD患病率)以评估在患病率变化和特征缺失情况下的行为。我们使用期望校准误差和Brier分数测量校准在Platt缩放和等距回归前后的变化,并通过分割符合性预测来量化不确定性,目标为90%的边际覆盖率。所有五个模型在UCI测试集上达到了AUROC 1.00。等距重校准将内部ECE降低到0.000-0.022。在MIMIC-IV上,AUROC降至0.48-0.58,ECE升至0.68-0.76,符合性覆盖率从0.80-0.98降至0.21-0.25,目标为90%。没有模型在部署准备性检查表上得分超过16分中的4分。近完美的内部性能并未转移。在任何临床预测模型部署之前,校准稳定性和符合性覆盖率应在外部数据上进行评估。

英文摘要

Machine learning models for chronic kidney disease (CKD) risk prediction often post strong discrimination scores on internal test sets. Calibration and uncertainty quantification get far less attention, leaving clinicians without reliable information about whether the probability outputs are accurate. We trained five classifiers on the UCI CKD dataset (400 patients, 62.5% CKD prevalence): logistic regression, random forest, XGBoost, SVM with Platt scaling, and Gaussian naive Bayes. We evaluated each across calibration quality, conformal prediction coverage, and an eight-criterion deployment readiness framework. A distributional stress-test applied the best-calibrated variant of each model to the open-access MIMIC-IV demo cohort (97 patients, 23.7% CKD) to assess behaviour under prevalence shift and feature missingness. We measured calibration before and after Platt scaling and isotonic regression using Expected Calibration Error and Brier Score, and quantified uncertainty through split conformal prediction targeting 90% marginal coverage. All five models reached AUROC 1.00 on the UCI test set. Isotonic recalibration reduced internal ECE to 0.000-0.022. On MIMIC-IV, AUROC fell to 0.48-0.58, ECE rose to 0.68-0.76, and conformal coverage dropped from 0.80-0.98 to 0.21-0.25 against a 90% target. No model scored above 4 out of 16 on the deployment readiness checklist. Near-perfect internal performance did not transfer. Calibration stability and conformal coverage should be evaluated on external data before any clinical prediction model moves toward deployment.

2605.21565 2026-05-22 cs.LG

Leveraging Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition

通过自适应课程学习提升多模态对话情感识别的模态平衡

Phuong-Anh Nguyen, The-Son Le, Duc-Trong Le, Cam-Van Thi Nguyen

AI总结 本文提出基于自适应课程学习的框架,通过双层难度评估器解决多模态对话情感识别中的模态不均衡问题,实验表明该方法在IEMOCAP和MELD数据集上显著提升了模型性能。

Comments Accepted at Neural Computing and Applications (Springer), 2026

详情
AI中文摘要

多模态情感识别在对话中是一项关键任务,其中融合语言、面部表情和语音语调的多模态方法已取得显著进展。然而,模态不匹配和学习不平衡仍然是主要挑战,限制了多模态信息的有效利用。为了解决这个问题,我们提出了一种基于自适应课程学习(SPCL)的插件式框架用于MERC。我们引入了双层难度评估器,捕捉语句级和对话级的挑战。语句级分数模型细粒度地捕捉模态特定的难度,而对话级分数捕捉更广泛的对话结构,包括情感依赖性和模态一致性。基于这些分数,学习调度器动态地指导从简单到困难的实例训练。通过将SPCL整合到现有的MERC架构中,我们的方法缓解了模态不平衡并提高了模型鲁棒性。在IEMOCAP和MELD数据集上的大量实验显示,不同架构和模态设置下均取得一致的改进。在IEMOCAP上,SPCL在基线模型上将加权F1分数提高约+1.2%至+6.6%,而在MELD上,提升达到+10.4%。这些结果突显了SPCL作为轻量级插件模块在多模态情感识别中的有效性与通用性。

英文摘要

Multimodal Emotion Recognition in Conversations (MERC) is a crucial task for understanding human interactions, where multimodal approaches integrating language, facial expressions, and vocal tone have achieved significant progress. However, modality misalignment and imbalanced learning remain major challenges, limiting the effective utilization of multimodal information. To address this issue, we propose a plug-and-play framework based on Self-Paced Curriculum Learning (SPCL) for MERC. We introduce a dual-level Difficulty Measurer that captures both utterance-level and conversation-level challenges. The utterance-level score models fine-grained modality-specific difficulty, while the conversation-level score captures broader dialogue structures, including emotional dependencies and modality coherence. Based on these scores, the Learning Scheduler dynamically guides training from easier to more difficult instances. By integrating SPCL into existing MERC architectures, our method alleviates modality imbalance and improves model robustness. Extensive experiments on the IEMOCAP and MELD datasets demonstrate consistent improvements across different architectures and modality settings. On IEMOCAP, SPCL improves weighted F1-score by approximately +1.2% to +6.6% over baseline models, while on MELD, gains reach up to +10.4%. These results highlight the effectiveness and generalizability of SPCL as a lightweight plug-and-play module for multimodal emotion recognition.

2605.21563 2026-05-22 cs.LG

Embedding-Based Federated Learning with Runtime Governance for Iron Deficiency Prediction

基于运行时治理的嵌入式联邦学习用于缺铁预测

Fan Zhang, Simon Deltadahl, Majid Lotfian Delouee, Daniel Kreuter, Joseph Taylor, Allerdien Visser, BloodCounts Consortium, James H. F. Rudd, Nicholas S. Gleadall, Suthesh Sivapalaratnam, Folkert Asselbergs, Martijn C. Schut, Michael Roberts

AI总结 本文提出了一种基于嵌入的联邦学习框架,用于从常规全血计数数据中预测缺铁,并在两个临床环境中部署,展示了个性化聚合方法在处理不同样本量和任务相关性时的优越性。

详情
AI中文摘要

近期的综述发现,发表的大多数医疗联邦学习(FL)研究从未达到实际应用。我们开发了一种基于嵌入的FL管道,用于从常规全血计数(FBC)数据中预测缺铁,并在阿姆斯特丹大学医学中心(AUMC)和英国国家血库和移植(NHSBT)两个临床环境中部署。这两个临床数据集在结构上不独立和相同分布(非IID),异质性源于不同的群体差异而非采样误差。运行时治理由FLA$^3$强制执行,这是一个面向医疗的FL平台,提供研究范围的执行、基于策略的授权和带签名的审计日志。标准样本量加权聚合(FedAvg)在两个站点相对于仅本地训练降低了受试者工作特征曲线(ROC-AUC)的面积,因为全局更新偏向于较大的AUMC分布。FedMAP,一种个性化聚合方法,将AUMC的ROC-AUC从0.9470提升到0.9594,将NHSBT的ROC-AUC从0.8558提升到0.8671,实现了最高的宏ROC-AUC为0.9133和最佳的宏平衡精度。这些结果支持在临床联邦中使用个性化聚合,特别是在客户端样本量和任务相关性差异显著时。

英文摘要

Recent reviews find that the vast majority of published healthcare federated learning (FL) studies never reach real-world deployment. We developed an embedding-based FL pipeline for iron deficiency prediction from routine full blood count (FBC) data and deployed it across real institutional environments at Amsterdam University Medical Centre (AUMC) and NHS Blood and Transplant (NHSBT), two clinical environments that differ markedly in iron deficiency prevalence, ferritin distribution, and subject populations. A frozen domain-specific haematology foundation model, DeepCBC, performs site-local representation extraction, restricting federated training to a compact downstream classifier and substantially reducing recurrent communication relative to full-encoder federation. The two clinical datasets are structurally not independent and identically distributed (non-IID), with heterogeneity arising from distinct population differences rather than sampling artefacts. Runtime governance is enforced by FLA$^3$, a healthcare-oriented FL platform providing study-scoped execution, policy-based authorisation, and signed audit logging. Standard sample-size-weighted aggregation (FedAvg) reduced the area under the receiver operating characteristic curve (ROC-AUC) at both sites relative to local-only training, as the global update was biased towards the larger AUMC distribution. FedMAP, a personalised aggregation method, raised ROC-AUC from 0.9470 to 0.9594 at AUMC and from 0.8558 to 0.8671 at NHSBT relative to local-only training, achieving the highest macro ROC-AUC of 0.9133 and the best macro balanced accuracy overall. These results support personalised aggregation in clinical federations where client sample size and task relevance diverge substantially.

2605.21561 2026-05-22 cs.LG

Objective-Induced Bias and Search Dynamics in Multiobjective Unsupervised Feature Selection

目标诱导偏差与多目标无监督特征选择中的搜索动态

Mathieu Cherpitel, Thomas Bäck, Martijn R. Tannemaat, Anna V. Kononova

AI总结 本研究探讨了多目标无监督特征选择中评价目标对搜索动态和Pareto前沿质量的影响,发现基于轮廓系数的评价目标倾向于产生低基数的平凡解,而提出的PCA损失目标能生成测试准确度与监督优化相似的紧凑子集。

详情
AI中文摘要

无监督特征选择通常被建模为一个多目标优化问题,联合优化子集质量和子集大小。然而,这种形式的行为严重依赖于评估目标的选择、子集大小正则化的方向以及初始化策略。我们通过一个具有已知信息性、冗余性和无关特征类型的合成数据集,在受控环境下研究这些因素。通过结合三个评估目标:准确率、轮廓系数和PCA重建损失,并结合子集大小最小化或最大化,比较了六种形式。结果表明,形式对搜索动态和最终Pareto前沿的质量都有显著影响。基于轮廓系数的形成表现出对平凡低基数解的强烈偏见,并且仍然是预测性能的弱代理。相比之下,所提出的PCA损失目标产生具有与直接优化监督准确率获得的子集相似测试准确度的紧凑子集。总体而言,该研究表明,目标设计是有效多目标无监督特征选择的关键。

英文摘要

Unsupervised feature selection is commonly formulated as a multiobjective optimisation problem that jointly optimises subset quality and subset size. Yet the behaviour of this formulation depends critically on the choice of evaluation objective, the direction of subset-size regularisation, and the initialisation strategy. We study these factors in a controlled setting using a synthetic dataset with known informative, redundant, and irrelevant feature types. Six formulations are compared by combining three evaluation objectives: accuracy, silhouette score, and PCA reconstruction loss with subset-size minimisation or maximisation. The results show that formulation strongly affects both search dynamics and the quality of the resulting Pareto front. Silhouette-based formulations exhibit a strong bias toward trivial low-cardinality solutions and remain weak proxies for predictive performance. In contrast, the proposed PCA loss objective produces compact subsets with test accuracy comparable to subsets obtained by directly optimising supervised accuracy. Overall, the study shows that objective design is central to effective multiobjective unsupervised feature selection.

2605.21560 2026-05-22 cs.LG

AutoMCU: Feasibility-First MCU Neural Network Customization via LLM-based Multi-Agent Systems

AutoMCU: 通过基于LLM的多智能体系统实现面向MCU的神经网络定制化

Penglin Dai, Zijie Zhou, Xincao Xu, Junhua Wang, Xiao Wu, Lixin Duan

AI总结 本文提出AutoMCU,一种基于LLM的多智能体系统,用于在MCU约束下实现神经网络的自动化定制化。通过自然语言任务需求和硬件规格,AutoMCU迭代生成结构化架构候选方案,通过供应商工具链反馈过滤不可行设计,在训练前进行筛选,评估可行模型并在受控协议下验证部署可行性。

详情
AI中文摘要

在微控制器单元(MCU)上部署神经网络对于边缘智能至关重要,但受限于内存、存储和计算约束仍具挑战性。现有方法,如模型压缩和硬件感知神经架构搜索(HW-NAS),通常依赖代理指标,导致搜索成本高,且未能充分弥合架构设计与验证部署之间的差距。本文提出AutoMCU,一种以可行性为导向的基于大型语言模型(LLM)的多智能体系统,用于在MCU约束下实现神经网络的自动化定制化。给定自然语言任务要求和硬件规格,AutoMCU迭代生成结构化架构候选方案,在训练前通过供应商工具链反馈过滤不可行设计,评估可行模型在受控协议下的性能,并通过后端基础部署分析验证部署可行性。AutoMCU包含两个关键机制:1)基于硬件的架构生成,用于在RAM和Flash约束下提前淘汰不可部署的候选方案;2)状态隔离的多智能体调度,用于稳定协调提案、训练、评估和部署阶段。在严格MCU约束下对CIFAR-10和CIFAR-100的实验表明,AutoMCU在减少定制时间至约1-2小时的同时实现了具有竞争力的精度,相比代表性的MCU导向HW-NAS基线方法所需的数百小时GPU时间。与ColabNAS和基于LLM的NAS方法GENIUS在NAS-Bench-201上的比较进一步证明了AutoMCU的有效性和稳定性。在多个STM32微控制器上的实际设备部署验证了其在MCU规模边缘智能中的实际适用性。

英文摘要

Deploying neural networks on microcontroller units (MCUs) is critical for edge intelligence but remains challenging due to tight memory, storage, and computation constraints. Existing approaches, such as model compression and hardware-aware neural architecture search (HW-NAS), often depend on proxy metrics, incur high search cost, and do not fully bridge the gap between architecture design and verified deployment. This paper presents AutoMCU, a feasibility-first large language model (LLM)-based multi-agent system for automated neural network customization under MCU constraints. Given natural-language task requirements and hardware specifications, AutoMCU iteratively generates structured architecture candidates, filters infeasible designs through vendor toolchain feedback before training, evaluates feasible models under a controlled protocol, and verifies deployability through backend-grounded deployment analysis. AutoMCU includes two key mechanisms: 1) hardware-in-the-loop architecture generation for early elimination of undeployable candidates under RAM and Flash constraints, and 2) state-isolated multi-agent scheduling for stable coordination of proposal, training, evaluation, and deployment stages. Experiments on CIFAR-10 and CIFAR-100 under strict MCU constraints show that AutoMCU achieves competitive accuracy while reducing customization time to about 1--2 hours, compared with hundreds of GPU hours for representative MCU-oriented HW-NAS baselines. Comparisons with ColabNAS and the LLM-based NAS method GENIUS on NAS-Bench-201 further demonstrate the effectiveness and stability of AutoMCU. Real-device deployments on multiple STM32 microcontrollers validate its practical applicability to MCU-scale edge intelligence.

2605.21558 2026-05-22 cs.LG cs.CL

From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment

从参数到数据:一种任务参数引导的微调流水线用于高效的LLM对齐

Hao Chen, Qi Zhang, Liyao Li, Zhanming Shen, Wentao Ye, Lirong Gao, Ningtao Wang, Xing Fu, Xiaoyu Shen, Junbo Zhao

AI总结 本研究提出了一种任务参数引导的微调流水线,通过任务敏感的注意力头作为双指南,实现样本挖掘和结构剪枝,从而提高LLM对齐的效率。

Comments Accepted@ICML26, 28 pages, 11 figures, 26 tables

详情
AI中文摘要

适应大型语言模型(LLM)到专业领域通常会带来高数据和计算开销。尽管先前的效率努力大多将数据选择和参数高效微调视为孤立过程,我们的实证分析表明它们可能本质上是耦合的。我们提出了强映射假说:稀疏的注意力头子集在任务特定适应中起主导作用,作为解锁特定数据模式的钥匙。基于这一观察,我们提出了从参数到数据(P2D)统一框架,利用这些任务敏感的注意力头作为双指南,用于样本挖掘和结构剪枝。为了严格量化整个流程的成本,我们引入了对齐效率比率(AER)指标,用于衡量选择延迟和训练时间。机理上,P2D通过轻量级代理识别关键头,并利用它们作为功能性过滤器来精选高亲和力数据,建立协同流程。经验上,通过更新仅10%的注意力头在10%的数据上,P2D在强基线基础上实现了8.3个百分点的性能提升,并提供了7.0倍的端到端时间加速。这些结果验证了精确的参数-数据同步消除了冗余,提供了一种新的高效对齐范式。

英文摘要

Adapting Large Language Models (LLMs) to specialized domains typically incurs high data and computational overhead. While prior efficiency efforts have largely treated data selection and parameter-efficient fine-tuning as isolated processes, our empirical analysis suggests they may be intrinsically coupled. We posit the Strong Map Hypothesis: a sparse subset of attention heads plays a dominant role in task-specific adaptation, acting as keys that unlock specific data patterns. Building on this observation, we propose From Parameters to Data (P2D), a unified framework that leverages these task-sensitive attention heads as a dual compass for both sample mining and structural pruning. To rigorously quantify the total pipeline cost, we introduce the Alignment Efficiency Ratio (AER) metric for both selection latency and training time. Mechanistically, P2D identifies critical heads via a lightweight proxy and uses them as a functional filter to curate high-affinity data, establishing a synergistic pipeline. Empirically, by updating merely 10% of attention heads on 10% of the data, P2D achieves an 8.3 pp performance gain over strong baselines and delivers a 7.0x end-to-end time speedup. These results validate that precise parameter-data synchronization eliminates redundancy, offering a new paradigm for efficient alignment.

2605.21556 2026-05-22 cs.LG

Beyond Single Slot: Joint Optimization for Multi-Slot Guaranteed Display Advertising

超越单一广告位:多广告位保障型显示广告的联合优化

Zhaoqi Zhang, Jiaming Deng, Miao Xie, Linyou Cai, Qianlong Xie, Xingxing Wang, Siqiang Luo, Gao Cong

AI总结 本文提出了一种多广告位保障型显示广告的联合优化框架,解决了广告位冗余、合同不平衡和曝光集中等关键问题,通过离线 bipartite 匹配问题和合同轮盘机制,提升了广告商 ROI、平台收入效率和合同履行的鲁棒性。

Comments Accepted at SIGIR Industry Track 2026

详情
AI中文摘要

保障型显示广告对于平台变现至关重要,但现有方法通常基于单一广告位假设,限制了其在多广告位页面浏览中的优化能力。本文提出了一种新颖的多广告位保障型显示广告联合优化框架,解决了广告位冗余、合同不平衡和曝光集中等关键挑战。我们的方法将分配建模为一个离线 bipartite 匹配问题,采用合同轮盘机制实现广告位独占性,并通过页面浏览约束实现印象控制,同时结合可扩展的分配优化算法以实现高效的大规模部署。在美团广告平台的大量在线测试中,我们的方法显著提高了广告商 ROI、平台收入效率和合同履行的鲁棒性。具体而言,在线 A/B 测试显示在 70% 的流量下,平均收入每用户增加了 28.99%,DID 分析进一步表明合同稳定性得到改善,证明了我们的框架在现实广告部署中的强大适用性和有效性。

英文摘要

Guaranteed display advertising is crucial for platform monetization, yet existing methods often operate under a single-slot assumption, limiting their ability to optimize allocation across multi-slot page views. In this paper, we propose a novel joint optimization framework for multi-slot GD allocation, addressing key challenges such as slot-level redundancy, contract imbalance, and exposure concentration. Our approach formulates the allocation as an offline bipartite matching problem with a contract roulette mechanism for slot exclusivity and Page View constraints for impression control, and incorporates a scalable allocation optimization algorithm for efficient large-scale deployment. Extensive online tests on the Meituan advertising platform demonstrate that our method significantly improves merchant ROI, platform revenue efficiency, and contract fulfillment robustness. Specifically, online A/B tests show a 28.99% increase in Average Revenue Per User under 70% traffic, and DID analysis further indicates improved contract stability, demonstrating the strong applicability and effectiveness of our framework in real-world advertising deployments.

2605.21553 2026-05-22 cs.LG cs.IT eess.IV math.IT

TONIC: Token-Centric Semantic Communication for Task-Oriented Wireless Systems

TONIC:面向任务的无线系统中的基于标记的语义通信

Sige Liu, Kezhi Wang

AI总结 本文提出TONIC框架,通过在发送端进行语义感知保护和接收端置信度感知门控,实现任务导向无线系统中基于标记的语义通信,优于传统方法。

Comments 15 pages, 10 figures

详情
AI中文摘要

标记正成为基础模型表示和处理信息的基本单元,用于理解和推理。然而,传统的位级忠实无线通信在可靠传输的内容与下游模型实际消耗的内容之间存在不匹配。这种不匹配要求一种通信设计,能够直接考虑标记层面的任务相关性和下游模型需求,而不是将所有传输位视为同等重要。在本文中,我们提出了TONIC,一种面向任务的无线系统中的基于标记的语义通信框架。发送端将每个源样本转换为标记序列,估计标记层面的任务相关性,并在固定信道使用预算下通过效用感知的非均等错误保护分配保护。在接收端,使用标记层面的置信度来门控不可靠的决策,将有害的替代转换为可恢复的擦除,在基于Transformer的完成模型恢复被遮蔽的标记以进行最终任务推理之前。我们的框架在模块化且可解释的架构中结合了发送端的语义感知保护和接收端的置信度感知门控,而不是仅依赖于完全黑盒端到端学习。我们进一步建立了接收端门控规则的效用感知贝叶斯风险解释,并研究其与非均等保护和完成的相互作用。在图像分类实验中,结果表明TONIC在匹配的通信预算下,无论是在AWGN、瑞利和莱斯信道上,都优于分离式方案、像素域DeepJSCC基线和标记域基线。

英文摘要

Tokens are becoming the basic units through which foundation models represent and process information for understanding and inference. However, traditional wireless communication, centered on bit-level fidelity, faces a mismatch between what is transmitted reliably and what downstream models actually consume. This mismatch calls for a communication design that directly accounts for token-level task relevance and downstream model requirements, rather than treating all transmitted bits as equally important. In this paper, we propose TONIC, a token-centric semantic communication framework for task-oriented wireless systems. The transmitter converts each source sample into a sequence of tokens, estimates token-level task relevance, and allocates protection through utility-aware unequal error protection under a fixed channel-use budget. At the receiver, token-level confidence is used to gate unreliable decisions, turning harmful substitutions into recoverable erasures before a Transformer-based completion model restores the masked tokens for final task inference. Our framework combines transmitter-side semantic-aware protection with receiver-side confidence-aware gating in a modular and interpretable architecture, rather than relying solely on fully black-box end-to-end learning. We further establish a utility-aware Bayes-risk interpretation for the receiver-side gating rule and study its interaction with unequal protection and completion. Experimental results on image classification show that TONIC consistently outperforms separation-based schemes, the pixel-domain DeepJSCC baseline, and token-domain baselines under matched communication budgets over AWGN, Rayleigh, and Rician channels.

2605.21552 2026-05-22 cs.LG stat.ML

Expectation Consistency Loss: Rethink Confidence Calibration under Covariate Shift

期望一致性损失:在协变量偏移下重新思考置信度校准

Jinzong Dong, Zhaohui Jiang, Bo Yang

AI总结 本文针对协变量偏移下的置信度校准问题,提出了一种无监督域适应损失(ECL),该方法在理论和实践中均表现出色,能够有效校准目标域的置信度。

Comments Accepted by ICML 2026

详情
AI中文摘要

置信度校准对于分类模型在安全关键决策场景中的应用至关重要,并已受到广泛关注。通用的置信度校准方法假设训练和测试数据是独立同分布的,这在存在协变量偏移时限制了其有效性。在协变量偏移下的先前校准方法在类内或标准校准方面存在困难,且通常依赖于当密度比较大或无界时不稳定的重要性加权。鉴于上述限制,本文重新思考了协变量偏移下的置信度校准。首先,我们推导出协变量偏移下的置信度校准的必要且充分条件,称为期望一致性条件,该条件揭示协变量偏移并不必然导致未校准的置信度,并提供了比全局协变量分布对齐更弱的置信度校准条件。然后,利用期望一致性条件,本文提出了一种无监督域适应损失来校准目标域的置信度,称为期望一致性损失(ECL),该方法兼容标准校准、类内校准和顶部标签校准。第三,我们证明计算ECL损失的样本复杂度与预期校准误差(ECE)相同,并提供了一种理论支持的mini-batch可训练方案。最后,我们在模拟和现实世界协变量偏移数据集上验证了本文方法的有效性。

英文摘要

Confidence calibration for classification models is vital in safety-critical decision-making scenarios and has received extensive attention. General confidence calibration methods assume training and test data are independent and identically distributed, limiting their effectiveness under covariate shifts. Previous calibration methods under covariate shift struggle with class-wise or canonical calibrations and often rely on unstable importance weighting when density ratios are large or unbounded. Given the above limitations, this paper rethinks confidence calibration under covariate shifts. First, we derive a necessary and sufficient condition for confidence calibration under covariate shifts, named Expectation consistency condition, which reveals covariate shifts do not necessarily lead to uncalibrated confidence and provides a weaker condition for confidence calibration than global covariate distribution alignment. Then, utilizing Expectation consistency condition, this paper proposes an unsupervised domain adaptation loss to calibrate confidence of the target domain, named Expectation consistency loss (ECL), which is compatible with canonical calibration, class-wise calibration, and top-label calibration. Third, we prove that computing ECL loss has the same sample complexity as Expected Calibration Error (ECE) and provide a theoretically grounded mini-batch trainable scheme for ECL loss. Finally, we validate the effectiveness of our method on both simulated and real-world covariate shift datasets.

2605.21550 2026-05-22 cs.LG

PeakFocus: Bridging Peak Localization and Intensity Regression via a Unified Multi-Scale Framework for Electricity Load Forecasting

PeakFocus: 通过统一的多尺度框架桥接峰值定位与强度回归以实现电力负荷预测

Wangzhi Yu, Peng Zhu, Qing Zhao, Yiwen Jiang, Dawei Cheng

AI总结 本文提出PeakFocus框架,通过统一的多尺度框架解决电力负荷峰值预测中的峰值定位与强度回归问题,改进多尺度表示冲突和强度平滑问题,提升预测精度。

详情
AI中文摘要

电力负荷峰值预测(ELPF)同时预测峰值时间和强度,是有效电网调度和风险管理的前提。然而,现有方法面临三个限制。首先,它们采用预测后定位的两阶段范式,切断了时间定位和强度回归之间的联系。其次,它们仍然挣扎于多尺度表示冲突,导致峰值误判和时间错位。第三,强度回归过程中缺乏显式的峰值时间上下文,导致强度平滑,因为预测受全局平滑趋势主导。为了解决这些限制,我们提出了PeakFocus,一个统一的ELPF框架。(i)统一的峰值感知流水线(UPAP)利用三重混合损失共同监督时间定位和强度回归,并配以基于容忍度的评估协议。(ii)多尺度混合峰值定位器(MSM-PL)利用粗粒度特征来缓解局部波动导致的峰值误判,并通过级联机制将它们注入细粒度特征中以解决时间错位。(iii)位置感知解码器(LAD)将峰值时间上下文注入强度回归过程中,提供明确的指导以对抗强度平滑并提高峰值强度估计。在公共电力(ELC)数据集和我们工业级的World Large-scale Electricity Load(WLEL)数据集上的广泛实验表明,PeakFocus在时间和强度精度上均优于基线方法。

英文摘要

Electricity load peak forecasting (ELPF), simultaneously predicting peak timing and intensity, is a prerequisite for effective grid scheduling and risk management. However, existing methods face three limitations. First, they adopt a two-stage predict-then-locate paradigm, which severs the link between temporal localization and intensity regression. Second, they still struggle with the multi-scale representation conflict, leading to peak misjudgment and timing misalignment. Third, the lack of explicit peak timing context during intensity regression causes intensity smoothing because predictions are dominated by global smoothing trends. To address these limitations, we propose PeakFocus, a unified framework for ELPF. (i) A Unified Peak-Aware Pipeline (UPAP) utilizes a triple hybrid loss to jointly supervise temporal localization and intensity regression, alongside a tolerance-based evaluation protocol. (ii) A Multi-Scale Mixing Peak Locator (MSM-PL) exploits coarse-grained features to mitigate peak misjudgment caused by local fluctuations, and injects them into fine-grained features via a cascade mechanism to resolve timing misalignment. (iii) A Location-Aware Decoder (LAD) injects peak timing context into the intensity regression process, providing explicit guidance to counteract intensity smoothing and improve peak intensity estimation. Extensive experiments on the public Electricity (ELC) dataset and our industrial-scale World Large-scale Electricity Load (WLEL) dataset show that PeakFocus outperforms baselines in both timing precision and intensity estimation.

2605.21544 2026-05-22 cs.LG eess.SP

Tabular foundation models for robust calibration of near-infrared chemical sensing data

用于近红外化学传感数据稳健校准的表格基础模型

Robin Reiter, Denis Cornet, Fabien Michel, Lauriane Rouan, Gregory Beurier

AI总结 本文研究了表格基础模型在近红外化学传感数据校准中的应用,通过对比不同模型在回归和分类任务中的表现,发现预处理优化的TabPFN在回归任务中表现最佳,而在分类任务中直接使用原始光谱的数据表现最优,同时指出在存在光谱异常值和外推样本时,传统化学计量学模型仍具竞争力。

Comments 56 pages, 17 figures, including supplementary material

详情
AI中文摘要

近红外光谱学正日益被用作一种快速、非破坏性的化学传感技术,用于食品、制药、生物和环境样品的分析。然而,NIR传感器的实际部署仍然依赖于能够处理高维、共线性光谱、有限样本量、预处理依赖性、光谱异常值和超出校准域外推的校准模型。本文评估了表格基础模型是否能为NIR化学传感提供新的校准策略。我们对66个NIR数据集(涵盖54个回归和12个分类任务)进行了基准测试,并将直接推断原始光谱与预处理优化推断与PLS/PLS-DA、岭回归、CatBoost和一维卷积神经网络进行比较。本研究使用统一的验证框架,在此框架中预处理和模型选择仅在校准数据上进行,之后进行外部测试评估。在回归中,预处理优化的TabPFN在总体平均排名上最佳,并显著优于PLS、CatBoost、TabPFN在原始光谱上的表现以及CNN-1D,同时在统计上与岭回归相当。在分类中,直接应用于原始光谱的TabPFN提供了最佳的平均排名,性能接近优化变体。鲁棒性分析显示,TabPFN提供强的平均预测性能,但在光谱异常值和外推样本中,其优势减少,传统化学计量学模型仍具竞争力。这些结果表明,表格基础模型可以补充已建立的化学计量学工作流程用于NIR化学传感,特别是在小到中等规模的校准设置中,同时突显了需要光谱特定的先验知识和不确定性感知的部署策略。

英文摘要

Near-infrared spectroscopy is increasingly used as a rapid, non-destructive chemical sensing technology for the analysis of food, pharmaceutical, biological, and environmental samples. However, the practical deployment of NIR sensors still depends on calibration models able to handle high-dimensional, collinear spectra, limited sample sizes, preprocessing dependence, spectral outliers, and extrapolation beyond the calibration domain. Here, we evaluate whether tabular foundation models can provide a new calibration strategy for NIR chemical sensing. We benchmark TabPFN on 66 NIR datasets covering 54 regression and 12 classification tasks, and compare direct inference on raw spectra with preprocessing-optimized inference against PLS/PLS-DA, Ridge, Catboost, and one-dimensional convolutional neural networks. The study uses a unified validation framework in which preprocessing and model selection are performed exclusively on calibration data before external test evaluation. In regression, preprocessing-optimized TabPFN achieves the best overall average rank and significantly outperforms PLS, CatBoost, TabPFN on raw spectra, and CNN-1D, while remaining statistically comparable to Ridge. In classification, TabPFN applied directly to raw spectra provides the best average rank, with performance close to the optimized variant. Robustness analyses show that TabPFN provides strong average predictive performance but that its advantage decreases on spectral outliers and extrapolated samples, where classical chemometric models remain competitive. These results suggest that tabular foundation models can complement established chemometric workflows for NIR chemical sensing, especially in small- to medium-sized calibration settings, while highlighting the need for spectroscopy-specific priors and uncertainty-aware deployment strategies.