arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪
2605.12650 2026-05-14 cs.CV

CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis

Yunsung Chung, Alex El Darzi, Carlo El Khoury, Han Feng, Nassir Marrouche, Jihun Hamm

发表机构 * Department of Computer Science, Tulane University(路易斯安那大学计算机科学系) School of Medicine, Tulane University(路易斯安那大学医学院)

AI总结 该研究针对医学图像合成中基础扩散模型适应性不足的问题,提出了一种基于临床对齐的微调方法CRAFT。通过引入临床对齐分数(CAS)作为新的评估指标,CRAFT从多模态大语言模型中迁移医学知识,结合条件提示增强、临床检查表和可微奖励优化,显著提升了生成图像的临床相关性。实验表明,CRAFT在多个医学影像模态上不仅提高了CAS评分,还有效减少了生成图像的不真实现象,优于现有主流方法。

详情
英文摘要

Foundation diffusion models can generate photorealistic natural images, but adapting them to medical imaging remains challenging. In medical adaptation, limited labeled data can exacerbate hallucination-like and clinically implausible synthesis, while existing metrics such as FID or Inception Score do not quantify per-image alignment with pathology-relevant criteria. We introduce the Clinical Alignment Score (CAS), a foundation-model-based proxy for clinical alignment that evaluates generated images along four complementary dimensions beyond visual fidelity. Building on CAS, we propose Clinical Reward-Aligned Finetuning (CRAFT), a reward-based adaptation framework that transfers medical knowledge from multimodal large language models and vision-language models through label-conditioned prompt enrichment, clinical checklists, and differentiable reward optimization. Across four diverse modalities, CRAFT improves CAS and downstream classification performance over strong adaptation baselines. Beyond average CAS gains, CRAFT reduces the empirical low-alignment tail below a real-image reference threshold by 5.5-34.7% points relative to the strongest baseline, corresponding to a 20.4% average relative reduction across datasets. These results indicate fewer hallucination-like generations under CAS, and are corroborated by out-of-family evaluator evaluation, structured checklist auditing, memorization analysis, and a blinded physician preference study on CheXpert.

2605.12648 2026-05-14 cs.LG stat.ML

Population Risk Bounds for Kolmogorov-Arnold Networks Trained by DP-SGD with Correlated Noise

Puyu Wang, Jan Schuchardt, Nikita Kalinin, Junyu Zhou, Sophie Fellenz, Christoph Lampert, Marius Kloft

发表机构 * RPTU Kaiserslautern-Landau(凯撒斯劳滕-兰道大学) Machine Learning Research, Morgan Stanley(摩根大通机器学习研究部) Institute of Science and Technology, Klosterneuburg(克洛斯特纽堡科学与技术研究所) Catholic University of Eichstätt-Ingolstadt(埃希施泰特-英戈尔施塔特天主教大学)

AI总结 本文首次为使用带有梯度裁剪的随机梯度下降(SGD)训练的柯尔莫戈罗夫-阿诺尔德网络(KAN)建立了群体风险界,涵盖了非隐私保护的SGD以及使用高斯扰动的差分隐私SGD(DP-SGD),其中扰动噪声在独立与时间相关之间进行插值。研究采用更贴近实际训练的批量SGD方法,并引入时间相关噪声机制,以改善隐私与效用的平衡。通过引入辅助未投影动态、偏移迭代和高概率引导分析,解决了非凸优化中相关噪声DP训练的分析难题,最终得到了KAN的群体风险界,为非凸学习中的相关噪声机制提供了首个优化与泛化分析。

详情
英文摘要

We establish the first population risk bounds for Kolmogorov-Arnold Networks (KANs) trained by mini-batch SGD with gradient clipping, covering non-private SGD as well as differentially private SGD (DP-SGD) with Gaussian perturbations that interpolate between independent and temporally correlated noise. This setting is substantially closer to practice than prior KAN theory along two axes: training is by mini-batch SGD, the standard recipe for modern networks, rather than full-batch gradient descent (GD); and correlated-noise mechanisms have empirically shown a more favorable privacy-utility tradeoff than independent-noise mechanisms. Our results cover the corresponding full-batch GD and independent-noise DP-GD results for KANs by Wang et al. (2026), while yielding sharper fixed-second-layer specializations. The technical core is a new analysis route for correlated-noise DP training in the non-convex regime. Temporal dependence breaks the conditional-centering structure underlying standard one-step SGD arguments, and the projection step obstructs the exact cancellation structure of correlated perturbations. We address these difficulties through an auxiliary unprojected dynamics, a shifted iterate that absorbs the current noise perturbation, and a high-probability bootstrap certifying projection inactivity. Combining this optimization analysis with a stability-based generalization argument yields the stated population risk bounds. To the best of our knowledge, this is the first optimization and population risk analysis of a correlated-noise mechanism for DP training beyond convex learning, in particular for neural networks.

2605.12645 2026-05-14 cs.CL cs.AI

Training LLMs with Reinforcement Learning for Intent-Aware Personalized Question Answering

Maryam Amirizaniani, Benjamin Charles Germain Lee, Jevin West, Nicholas Weber

发表机构 * University of Washington(华盛顿大学)

AI总结 该研究旨在提升语言模型在单轮对话中的个性化问答能力,通过理解用户隐含的意图来生成更符合其深层目标的回答。为此,作者提出了一种基于强化学习的框架IAP,能够在仅凭单轮问题的情况下直接推断用户意图,并通过标签化机制将其融入推理过程,从而生成更具针对性的回答。实验表明,IAP在多个模型上均显著优于现有方法,验证了在训练过程中建模隐含用户意图的有效性。

详情
英文摘要

Effective personalized question answering (PQA) in language models requires grounding responses in the user's underlying intent, where intent refers to the implicit ``why'' behind a query beyond its explicit wording. However, existing approaches to intent-aware personalization rely on multi-turn conversational context or rich user profiles, and do not explicitly model user intent during the reasoning process. This limits their effectiveness in single-turn settings, where the user's latent goal must be inferred from minimal input and integrated into the thinking and reasoning process. To bridge this gap, we propose IAP (Intent-Aware Personalization), a reinforcement learning framework that trains models to infer implicit user intent directly from a single-turn question and incorporate it into thinking steps through a tag-based schema for generating personalized, intent-grounded answers. By optimizing intent-aware answer trajectories under a personalized reward function, IAP reinforces generation paths that make implicit user intent explicit and produce responses that better align with the user's underlying goal. Through experiments on the LaMP-QA benchmark across six models, IAP consistently outperforms all baselines, achieving an average macro-score gain of around 7.5\% over the strongest competitor, demonstrating that modeling implicit user intent within the training objective is a promising direction for PQA.

2605.12639 2026-05-14 cs.LG

OceanCBM: A Concept Bottleneck Model for Mechanistic Interpretability in Ocean Forecasting

Sanah Suri, Kieran Ringel, Maike Sonnewald

发表机构 * University of California, Davis(加州大学戴维斯分校) University of Washington(华盛顿大学) NOAA Geophysical Fluid Dynamics Laboratory(国家海洋大气管理局流体动力学实验室)

AI总结 本文提出 OceanCBM,一种用于海洋预报的机制可解释概念瓶颈模型,旨在解决传统机器学习模型在预测极端海洋现象时缺乏物理可解释性的问题。该模型通过混合监督方式预测海洋热含量,结合来自流体力学的预设概念和自由概念层,既保证了模型的物理一致性,又保持了预测性能。实验表明,OceanCBM 能在不牺牲预测能力的前提下,提供明确的物理机制解释,揭示了可解释性与性能之间的权衡关系。

Comments 17 pages, 9 figures, 4 tables

详情
英文摘要

Extreme ocean phenomena are challenging not only to predict but to diagnose, as accurate forecasts alone do not reveal the underlying physical drivers. While recent machine learning approaches achieve strong predictive skill, they remain largely opaque and provide limited guarantees of fidelity to ground-truth physics. We introduce OceanCBM, the first concept bottleneck model (CBM) for spatiotemporal prediction and mechanistic interrogation of ocean dynamics. OceanCBM uses mixed supervision to predict mixed layer heat content, a key precursor of marine heatwaves, while routing information through an intermediate layer of prescribed concepts derived from geophysical fluid dynamics and a 'free' concept. This design imposes soft physical structure without over-constraining the model, and the free concept both regularizes concept predictions and captures residual physical processes. Across ensemble initializations, we show that mixed supervision yields consistent mechanistic representations, whereas prediction-only and prescription-only baselines learn highly variable latent structures despite similar predictive performance. OceanCBM achieves interpretable, physically grounded representations without sacrificing skill, explicitly characterizing the interpretability-performance trade-off.

2605.12325 2026-05-14 cs.CV

VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference

Hao Zhu, Shuo Jin, Wenbin Liao, Jiayu Xiao, Yan Zhu, Siyue Yu, Feng Dai

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China(中国科学院计算技术研究所) University of Liverpool(利物浦大学)

AI总结 该研究旨在解决无训练开放词汇语义分割中因CLIP模型存在空间偏差而导致的效率与泛化性难题。为此,作者提出了一种基于空间感知框架dino$.$txt的视觉引导提示进化(VIP)方法,通过引入视觉引导的蒸馏机制和别名扩展,提升文本查询的语义表达能力,从而实现更高效、更精确的密集预测。实验表明,VIP在多个基准数据集上取得了优于现有方法的性能,并具有良好的跨领域泛化能力和较低的推理开销。

Comments Accepted by ICML2026. Code is available at https://github.com/MiSsU-HH/VIP

详情
英文摘要

Pursuing training-free open-vocabulary semantic segmentation in an efficient and generalizable manner remains challenging due to the deep-seated spatial bias in CLIP. To overcome the limitations of existing solutions, this work moves beyond the CLIP-based paradigm and harnesses the recent spatially-aware dino$.$txt framework to facilitate more efficient and high-quality dense prediction. While dino$.$txt exhibits robust spatial awareness, we find that the semantic ambiguity of text queries gives rise to severe mismatch within its dense cross-modal interactions. To address this, we introduce Visual-guided Prompt evolution (VIP) to rectify the semantic expressiveness of text queries in dino$.$txt, unleashing its potential for fine-grained object perception. Towards this end, VIP integrates alias expansion with a visual-guided distillation mechanism to mine valuable semantic cues, which are robustly aggregated in a saliency-aware manner to yield a high-fidelity prediction. Extensive evaluations demonstrate that VIP: 1. surpasses the top-leading methods by 1.4%-8.4% average mIoU, 2. generalizes well to diverse challenging domains, and 3. requires marginal inference time and memory overhead.

2605.12163 2026-05-14 cs.CV

Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

Chenfeng Wang, Wei He, Xuhan Zhu, Chunpeng Zhou, Qizhen Li, Song Yan, Yufei Zheng, Chengjun Yu, Fan Lu, Wei Zhai, Yang Cao, Pengfei Yu, Zheng-Jun Zha

发表机构 * University of Science and Technology of China(中国科学技术大学) Li Auto Inc.(利亚自动化公司)

AI总结 本文研究了视觉-语言模型中长潜层序列推理的问题,发现现有方法在潜层序列变长时性能下降,原因在于信息增益崩溃和过度池化的图像嵌入缺乏有效信号。为此,作者提出了一种自洽潜层推理方法SCOLAR,通过引入轻量级解码器生成独立锚定于原始视觉空间的辅助视觉标记,并结合多阶段微调和强化学习,显著提升了潜层推理长度和模型性能,在多个真实场景基准上取得了最优结果。

Comments 17 pages, 6 figures

详情
英文摘要

In language reasoning, longer chains of thought consistently yield better performance, which naturally suggests that visual latent reasoning may likewise benefit from longer latent sequences. However, we discover a counterintuitive phenomenon: the performance of existing latent visual reasoning methods systematically degrades as the latent sequence grows longer. We reveal the root cause: Information Gain Collapse -- autoregressive generation makes each step highly dependent on prior outputs, so subsequent tokens can barely introduce new information. We further identify that heavily pooled ($\geq 128\times$) image embeddings used as supervision targets provide no more signal than meaningless placeholders. Motivated by these insights, we propose SCOLAR (Self-COnsistent LAtent Reasoning), which introduces a lightweight detransformer that leverages the LLM's full-sequence hidden states to generate auxiliary visual tokens in a single shot, with each token independently anchored to the original visual space. Combined with three-stage SFT and ALPO reinforcement learning, SCOLAR extends acceptable latent CoT length by over $30\times$, achieves state-of-the-art among open-source models on real-world reasoning benchmarks (+14.12% over backbone), and demonstrates strong out-of-distribution generalization.

2605.12145 2026-05-14 cs.CV

Cross-Modal-Domain Generalization Through Semantically Aligned Discrete Representations

Souptik Sen, Raneen Younis, Zahra Ahmadi

发表机构 * Peter L. Reichertz Institute for Medical Informatics(汉诺威医学院彼得·L·里赫茨医学信息学研究所) Lower Saxony Center for AI and Causal Methods in Medicine (CAIMed)(下萨克森医学人工智能与因果方法中心(CAIMed))

AI总结 该研究旨在解决多模态学习中跨模态泛化与模态特异性结构之间的平衡问题。提出了一种名为CoDAAR的新框架,通过语义对齐的离散表示,在统一的离散空间中同时保留各模态的独特结构并实现跨模态的泛化能力。该方法结合了离散时间对齐和级联语义对齐两种机制,通过自监督重建任务进行训练,在多个跨模态和跨领域基准测试中取得了最先进的性能。

Comments Added missing affiliation for co-author R. Younis and Z. Ahmadi

详情
英文摘要

Multimodal learning seeks to integrate information across diverse sensory sources, yet current approaches struggle to balance cross-modal generalizability with modality-specific structure. Continuous (implicit) methods preserve fine-grained priors but render generalization challenging, while discrete (explicit) approaches enforce shared prototypes at the expense of modality specificity. We introduce CoDAAR (Cross-modal Discrete Alignment And Reconstruction), a novel framework that resolves this long-standing trade-off by establishing semantic consensus across modality-specific codebooks through index-level alignment. This design uniquely allows CoDAAR to preserve modality-unique structures while achieving generalizable cross-modal representations within a unified discrete space. CoDAAR combines two complementary mechanisms: Discrete Temporal Alignment (DTA), which enables fine-grained temporal quantization, and Cascading Semantic Alignment (CSA), which promotes progressive cross-modal semantic agreement. Together, they establish a competition-free unified representation space. Trained with self-supervised reconstruction objectives on paired multimodal sequences, CoDAAR demonstrates robust cross-modal and cross-domain generalization. Across Cross-Modal Generalization benchmarks, including event classification, localization, video segmentation, and cross-dataset transfer, CoDAAR achieves state-of-the-art performance, establishing a new paradigm for discrete and generalizable multimodal representation learning.

2605.12119 2026-05-14 cs.CV cs.GR

MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics

Haofeng Liu, Yang Zhou, Ziheng Wang, Zhengbo Xu, Zhan Peng, Jie Ma, Jun Liang, Shengfeng He, Jing Li

发表机构 * Orange-3DV-Team(橙色3D视觉团队)

AI总结 本文提出了一种名为MoCam的统一新视角合成方法,旨在解决生成式新视角合成中几何先验与外观先验之间的矛盾。该方法通过结构化去噪动力学,在扩散过程中协调地从几何到外观逐步生成内容,先利用几何先验构建粗略结构,再借助外观先验修正几何误差并细化细节。实验表明,MoCam在点云存在严重缺失或扭曲的情况下表现尤为突出,实现了几何与外观的有效解耦与统一合成。

Comments Project page: https://orange-3dv-team.github.io/MoCam

详情
英文摘要

Generative novel view synthesis faces a fundamental dilemma: geometric priors provide spatial alignment but become sparse and inaccurate under view changes, while appearance priors offer visual fidelity but lack geometric correspondence. Existing methods either propagate geometric errors throughout generation or suffer from signal conflicts when fusing both statically. We introduce MoCam, which employs structured denoising dynamics to orchestrate a coordinated progression from geometry to appearance within the diffusion process. MoCam first leverages geometric priors in early stages to anchor coarse structures and tolerate their incompleteness, then switches to appearance priors in later stages to actively correct geometric errors and refine details. This design naturally unifies static and dynamic view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion process. Experiments demonstrate that MoCam significantly outperforms prior methods, particularly when point clouds contain severe holes or distortions, achieving robust geometry-appearance disentanglement.

2605.11989 2026-05-14 cs.CV cs.AI

A Transfer Learning Evaluation of Deep Neural Networks for Image Classification

Nermeen Abou Baker, Nico Zengeler, Uwe Handmann

发表机构 * Computer Science Institute, Ruhr West University of Applied Sciences, 46236 Bottrop(鲁尔西大学应用科学学院计算机科学研究所)

AI总结 本文研究了如何为图像分类任务选择最符合目标领域需求的预训练模型,探讨了迁移学习在深度神经网络中的应用效果。作者对十一类在ImageNet上预训练的模型进行了输出层和网络参数的调整,并将其应用于五个不同的目标数据集。通过评估准确率、准确密度、训练时间和模型大小等指标,比较了不同模型在单次和多次训练过程中的表现,为迁移学习中的模型选择提供了参考依据。

Comments Published by Machine Learning and Knowledge Extraction Journal

Journal ref Machine Learning and Knowledge Extraction 4, no. 1: 22-41 (2022)

详情
英文摘要

Transfer learning is a machine learning technique that uses previously acquired knowledge from a source domain to enhance learning in a target domain by reusing learned weights. This technique is ubiquitous because of its great advantages in achieving high performance while saving training time, memory, and effort in network design. In this paper, we investigate how to select the best pre-trained model that meets the target domain requirements for image classification tasks. In our study, we refined the output layers and general network parameters to apply the knowledge of eleven image processing models, pre-trained on ImageNet, to five different target domain datasets. We measured the accuracy, accuracy density, training time, and model size to evaluate the pre-trained models both in training sessions in one episode and with ten episodes.

2605.11679 2026-05-14 cs.AI

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

ShiYing Huang, Liang Lin, Yuer Li, Kaiwen Luo, Zhenhong Zhou, An Zhang, Junhao Dong, Kun Wang, Zhigang Zeng

发表机构 * Huazhong University of Science and Technology(华中科技大学) Nanyang Technological University(南洋理工大学) Tsinghua University(清华大学) Chongqing University(重庆大学)

AI总结 在多目标对齐的大型语言模型研究中,如何平衡不同的人类偏好常表现为零和冲突。本文提出一种新的视角,认为多目标之间的冲突源于提示本身对多维奖励的限制,并据此提出多目标奖励融合方法MORA,通过扩展奖励维度提升模型在有用性、安全性等多方面的表现。实验表明,MORA在顺序对齐和同时对齐任务中均取得了显著的性能提升。

详情
英文摘要

In the realm of multi-objective alignment for large language models, balancing disparate human preferences often manifests as a zero-sum conflict. Specifically, the intrinsic tension between competing goals dictates that aggressively optimizing for one metric (e.g., helpfulness) frequently incurs a substantial penalty on another (e.g., harmlessness). While prior work mainly focuses on data selection, parameter merging, or algorithmic balancing during training, these approaches merely force compromises between divergent preferences along a fixed Pareto frontier, failing to fundamentally resolve the inherent trade-off. In this work, we approach this problem from a novel perspective of multi-dimensional rewards. By scaling up the model's rollouts and analyzing the outputs across different reward dimensions, we arrive at a critical conclusion: the conflict among multiple objectives stems from the fact that the prompt itself inherently restricts the achievable multi-dimensional rewards. Based on this core observation, we propose MORA: Multi-Objective Reward Assimilation. Specifically, MORA isolates single-reward prompts through pre-sampling and expands their reward diversity by rewriting the original questions to incorporate multi-dimensional intents. Extensive experiments demonstrate that: (1) in sequential alignment, MORA achieves single-preference improvements ranging from 5% to 12.4%, with exceptional gains in harmlessness, after multiple-preference alignment across helpful, harmless, and truthful dimensions. (2) In simultaneous alignment, MORA achieves an average overall reward improvement of 4.6%. Our codes are available at https://github.com/Shiying-Huang/MORA-MPA.

2605.11572 2026-05-14 cs.CV

TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning

Seongah Kim, Dinh Phu Tran, Hyeontaek Hwang, Saad Wazir, Duc Do Minh, Daeyoung Kim

发表机构 * AI2 Lab, KAIST(AI2实验室,韩国科学技术院)

AI总结 该研究提出了一种名为TB-AVA的参数高效微调框架,旨在解决音频-视觉对齐中的语义对应难题。通过引入文本作为语义桥梁,TB-AVA在冻结的音频和视觉编码器基础上,利用文本引导的语义调制模块实现跨模态特征的交互与对齐。实验表明,该方法在多个基准数据集上取得了最先进的性能,验证了文本作为语义锚点在音频-视觉学习中的有效性。

Comments 12 pages, 6 figures

详情
英文摘要

Audio-visual understanding requires effective alignment between heterogeneous modalities, yet cross-modal correspondence remains challenging when temporally aligned audio and visual signals lack clear semantic correspondence. We propose to use text as a semantic anchor for audio-visual representation learning. To this end, we introduce a parameter-efficient adaptation framework built on frozen audio and visual encoders, centered on Text-Bridged Audio-Visual Adapter (TB-AVA), which enables text-mediated interaction between audio and visual streams. At the core of TB-AVA, Gated Semantic Modulation (GSM) selectively modulates feature channels based on text-inferred semantic relevance. We evaluate the proposed approach on multiple benchmarks, including AVE, AVS, and AVVP, where the proposed framework achieves state-of-the-art performance, demonstrating text as an effective semantic anchor for parameter-efficient fine-tuning (PEFT) in audio-visual learning.

2605.11533 2026-05-14 cs.CL cs.CV

Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation

Sike Xiang, Shuang Chen, Kevin Qinghong Lin, Jialin Yu, Yijia Sun, Philip Torr, Amir Atapour-Abarghouei

发表机构 * Durham University(杜伦大学) University of Oxford(牛津大学)

AI总结 该研究提出了一个名为 Checkup2Action 的多模态临床体检报告数据集,用于生成面向患者的行动建议卡片。该数据集包含2000份去标识化的实际体检报告,涵盖人口统计、体格检查、实验室检测、心血管评估和影像学证据等信息,每个行动卡片包含临床问题、优先级、推荐科室、随访时间、患者解释及问题等结构化内容。研究将体检报告到行动建议的生成任务定义为约束结构化生成问题,并引入了涵盖覆盖度、优先级一致性、部门与时间推荐准确性等多维度的评估协议,为评估模型在临床报告上的患者导向推理能力提供了新的基准。

详情
英文摘要

Clinical check-up reports are multimodal documents that combine page layouts, tables, numerical biomarkers, abnormality flags, imaging findings, and domain-specific terminology. Such heterogeneous evidence is difficult for laypersons to interpret and translate into concrete follow-up actions. Although large language models show promise in medical summarisation and triage support, their ability to generate safe, prioritised, and patient-oriented actions from multimodal check-up reports remains under-benchmarked. We present \textbf{Checkup2Action}, a multimodal clinical check-up report dataset and benchmark for structured \textit{Action Card} generation. Each card describes one clinically relevant issue and specifies its priority, recommended department, follow-up time window, patient-facing explanation, and questions for clinicians, while avoiding diagnostic or treatment-prescriptive claims. The dataset contains 2,000 de-identified real-world check-up reports covering demographic information, physical examinations, laboratory tests, cardiovascular assessments, and imaging-related evidence. We formulate checkup-to-action generation as a constrained structured generation task and introduce an evaluation protocol covering issue coverage and precision, priority consistency, department and time recommendation accuracy, action complexity, usefulness, readability, and safety compliance. Experiments with general-purpose and medical large language models reveal clear trade-offs between issue coverage, action correctness, conciseness, and safety alignment. Checkup2Action provides a new multimodal benchmark for evaluating patient-oriented reasoning over clinical check-up reports.

2605.11505 2026-05-14 cs.AI

Selective Off-Policy Reference Tuning with Plan Guidance

Duc Anh Le, Tien-Phat Nguyen, Thien Huu Nguyen, Linh Ngo Van, Trung Le

发表机构 * Hanoi University of Science and Technology(河内科学技术大学) University of Oregon(俄勒冈大学) Monash University(墨尔本大学)

AI总结 本文研究了如何在强化学习中利用可验证奖励进行推理,并针对GRPO类方法在处理困难提示时效果不佳的问题,提出了一种名为SORT的新方法。该方法通过引入计划引导机制,在不改变策略生成过程的前提下,利用参考解生成计划,并据此调整策略更新的权重,从而提升模型对结构化信息的学习能力。实验表明,SORT在多个推理基准测试中优于现有方法,尤其在较弱模型上表现突出。

详情
英文摘要

Reinforcement learning with verifiable rewards helps reasoning, but GRPO-style methods stall on hard prompts where all sampled rollouts fail. SORT adds a repair update for those failures without changing rollout generation: it derives a plan from the reference solution, compares token probabilities with and without that plan, and gives higher weight to tokens that become more predictable under plan conditioning. This turns all-wrong prompts into selective, structure-aware learning signals instead of uniform imitation. Across three backbones and eight reasoning benchmarks, SORT improves over GRPO and guidance baselines, with largest gains on weaker models.

2605.11492 2026-05-14 cs.CV

A Mimetic Detector for Adversarial Image Perturbations

Johnny Corbino

发表机构 * Lawrence Berkeley National Laboratory(伯克利国家实验室)

AI总结 该研究提出了一种无需训练、无需访问目标网络的单次检测方法,用于识别图像中的对抗性扰动。方法基于高阶Corbino–Castillo拟态算子,能够有效捕捉对抗样本在像素级上产生的高频、近随机的梯度能量特征。实验表明,该检测器在标准测试图像上实现了显著的干净图像与对抗样本的区分能力,检测效果随算子阶数增加而提升。

Comments v2: extended Table 1 with results for order $k=8$; minor revisions for clarity

详情
英文摘要

Adversarial attacks fool deep image classifiers by adding tiny, almost invisible noise patterns to a clean image. The standard $\ell^\infty$-bounded attacks (FGSM, PGD, and the $\ell^\infty$ variant of Carlini--Wagner) produce high-frequency, near-random sign patterns at the pixel level: nearly invisible in $\ell^2$, but carrying disproportionate gradient energy. We exploit this with a single-shot, training-free detector using the high-order Corbino--Castillo mimetic operators from the open-source MOLE library. No retraining, no surrogate classifier, no access to the network under attack: the verdict is a property of the input alone, computed in $O(HW)$ time. We validate the detector on the standard \texttt{peppers} test image at the canonical $\ell^\infty$ budget $\varepsilon = 16/255$ and observe a clean-vs-adversarial separation that grows monotonically from $3.55\times$ at order $k=2$ to $4.62\times$ at $k=8$.

2605.11444 2026-05-14 cs.CV

Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts

Eunho Lee, Rei Kawakami, Youngbae Hwang

发表机构 * Chungbuk National University(Chungbuk国立大学) Institute of Science Tokyo(东京科学研究所)

AI总结 该研究提出了一种基于多模态大语言模型(MLLM)的统一图像修复框架,旨在从受多种未知退化影响的输入中恢复清晰图像。为了解决现有方法将退化视为离散类别而无法建模复合退化中连续关系的问题,作者引入了多模态嵌入作为修复过程的引导,并设计了MLLM引导的融合模块和频率专家混合模块,以增强退化感知表示并自适应组合不同频率专家。实验表明,该方法在多个基准数据集上表现出色,在CDD11数据集上取得了新的最先进成果。

详情
英文摘要

All-in-one image restoration seeks to recover clean images from inputs affected by diverse and unknown degradations using a unified framework. Recent methods have shown strong performance by identifying degradation characteristics to guide the restoration process. However, many of them treat degradations as discrete categories, which limits their ability to model the continuous relational structure that arises in composite degradations. To address this issue, we propose a multimodal large language model (MLLM)-guided image restoration framework that exploits multimodal embeddings as guidance for low-level restoration. Specifically, MLLM-derived features are injected into an encoder-decoder architecture through an MLLM-guided fusion block (MGFB) to enhance degradation-aware representations. In addition, we incorporate a mixture-of-frequency-experts (MoFE) module that adaptively combines frequency experts using MLLM-guided contextual cues. To further improve expert routing, we design an MLLM-guided router with a relational alignment loss that encourages routing patterns consistent with the embedding-space relationships of degraded inputs. Extensive experiments on multiple benchmarks show that the proposed method achieves strong performance across diverse restoration settings and establishes a new state of the art on the challenging CDD11 dataset, outperforming previous methods by up to 1.35 dB.

2605.11405 2026-05-14 cs.LG

20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

DatologyAI, :, Siddharth Joshi, Haoli Yin, Rishabh Adiga, Haakon Mongstad, Alvin Deng, Aldo Carranza, Alex Fang, Amro Abbas, Anshuman Suri, Brett Larsen, Daniel Zayas, Darren Teh, David Schwab, Diego Kiner, Fan Pan, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Kaleigh Mentzer, Luke Merrick, Maximilian Böther, Parth Doshi, Paul Burstein, Pratyush Maini, Ties Robroek, Tony Jiang, Vidhi Jain, Vineeth Dorna, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt

发表机构 * DatologyAI

AI总结 该研究探讨了仅通过数据筛选能否提升视觉语言模型(VLM)的性能,并在固定模型架构、训练策略和计算资源的前提下,对MAmmoTH-VL数据集进行筛选,显著提升了模型在多个公开基准和能力维度上的表现。实验表明,筛选后的20亿参数模型在多项指标上超越了现有模型,且在可靠性、泛化能力、行为表现和推理效率等方面均有明显优势,展示了数据筛选作为构建高效VLM的高杠杆工具的潜力。

Comments 33 pages, 15 figures. DatalogyAI website for more details: https://www.datologyai.com/

详情
英文摘要

Data curation has shifted the quality-compute frontier for language-model and contrastive image-text pretraining, but its role for vision-language models (VLMs) is far less established. We ask how far data curation alone can take VLM performance, holding architecture, training recipe, and compute fixed and varying only the training data. Our pipeline, applied to the MAmmoTH-VL single-image subset, lifts performance by +11.7pp on average across 20 public VLM benchmarks (spanning grounding, VQA, OCR/documents, captioning, spatial/3D, counting, charts, math, brand-ID, and multi-image reasoning) and by +11.3pp on average across all nine capability axes of DatBench, our high-fidelity VLM eval suite. At 2B, our curated model surpasses InternVL3.5-2B by 9.9pp at ~17x less training compute and closes the gap to Qwen3-VL-2B to within 1.8pp at ~87x less compute, from pretraining alone. Beyond accuracy, curation delivers four further properties: (1) Reliability: per-capability std across training seeds drops by ~67% and the lift survives a 4k-to-16k context-length sweep; (2) OOD generalization: the 9-eval OOD average rises by +7.2pp, and multi-image BLINK rises by +3.09pp despite single-image-only training, with Visual Correspondence gaining +11.8pp; (3) Behavioral gains beyond benchmarks: across ~1,100 open-ended queries the curated 2B is more honest and more specific than the matched-compute baseline, and more concise and less refusal-prone than a frontier 2B reference; (4) Pareto-dominance on inference cost: at every scale (1B, 2B, 4B) the curated model raises accuracy while lowering response FLOPs vs. the matched-compute baseline, and the curated 4B matches near-frontier accuracy at 3.3x lower response FLOPs than Qwen3-VL-4B. Data curation is a high-leverage tool for building better VLMs, reaching near-frontier accuracy at up to ~150x less training compute.

2605.11347 2026-05-14 cs.LG cs.AI cs.CV

Gradient-Free Noise Optimization for Reward Alignment in Generative Models

Jeongsol Kim, Hongeun Kim, Jian Wang, Jong Chul Ye

发表机构 * KAIST AI(韩国科学技术院人工智能实验室) Snap Inc.(Snap公司)

AI总结 本文提出了一种无需梯度的噪声优化方法ZeNO,用于生成模型中的奖励对齐问题。该方法将噪声优化建模为路径积分控制问题,仅依赖零阶奖励评估,避免了传统方法对反向传播的依赖。ZeNO在多种生成器和奖励函数上表现出色,尤其适用于无法进行反向传播的场景,如蛋白质结构生成任务。

详情
英文摘要

Existing reward alignment methods for diffusion and flow models rely on multi-step stochastic trajectories, making them difficult to extend to deterministic generators. A natural alternative is noise-space optimization, but existing approaches require backpropagation through the generator and reward pipeline, limiting applicability to differentiable settings. To address this, here we present ZeNO (Zeroth-order Noise Optimization), a gradient-free framework that formulates noise optimization as a path-integral control problem, estimable from zeroth-order reward evaluations alone. When instantiated with an Ornstein--Uhlenbeck reference process, the update connects to Langevin dynamics implicitly targeting a reward-tilted distribution. ZeNO enables effective inference-time scaling and demonstrates strong performance across diverse generators and reward functions, including a protein structure generation task where backpropagation is infeasible.

2605.11299 2026-05-14 cs.LG cs.CL cs.SE

Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling

Yizhu Jiao, Ruixiang Zhang, Richard Bai, Jiawei Han, Ronan Collobert, Yizhe Zhang

发表机构 * Apple(苹果公司) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 该论文提出了一种名为DuST的自训练框架,旨在通过“双空间”学习提升代码生成模型的表现。传统方法仅基于单次生成结果的反馈进行训练,而DuST利用测试时多候选生成与评估过程中蕴含的相对正确性信息,构建出一个更丰富的“判别空间”用于模型训练。实验表明,DuST在多个大规模模型上显著提升了测试时生成质量与判断能力,且无需直接奖励正确生成,有效实现了从判别空间到生成空间的知识迁移。

详情
英文摘要

Code generation is typically trained in the primal space of programs: a model produces a candidate solution and receives sparse execution feedback, often a single pass/fail bit. Test-time scaling enriches the inference procedure by sampling multiple candidates and judging among them, but the comparative information this process reveals is discarded after inference. We argue that this information defines a dual judgment space that provides a far richer training signal: the model learns not from an isolated success or failure, but from the relative correctness structure across its own plausible attempts, identifying which succeed, which fail, and what distinguishes them. We introduce DuST (Dual Self-Training), a framework for self-training from the dual judgment space. DuST samples candidate programs from the model's own distribution, labels them through sandbox execution, retains groups containing both successes and failures, and trains the model to rank candidates by execution correctness using GRPO. The objective is purely discriminative: the model is never directly rewarded for generating correct programs. Dual self-training improves both judgment and generation. Across five models spanning two families and three scales (4B to 30B), DuST consistently improves Best-of-4 test-time scaling on LiveCodeBench. For Qwen3-30B-Thinking on LiveCodeBench v6, judgment quality improves by +6.2 NDCG, single-sample pass@1 improves by +3.1, and Best-of-4 accuracy improves by +4.1. The trained model's single rollout matches the base model's Best-of-4 performance. SFT on the same ranking data improves judgment without improving generation, confirming that on-policy RL is the mechanism that transfers dual-space learning back into primal generation.

2605.11206 2026-05-14 cs.CL

Instructions Shape Production of Language, not Processing

Andreas Waldis, Leshem Choshen, Yufang Hou, Yotam Perlitz

发表机构 * Department of Linguistics, University of Tübingen(图宾根大学语言学系) IBM Research, MIT, and MIT-IBM Watson AI Lab(IBM研究院、麻省理工学院和麻省理工-IBM沃森人工智能实验室) Interdisciplinary Transformation University of Austria(奥地利跨学科转型大学) IBM Research(IBM研究院)

AI总结 该研究探讨了指令如何影响语言模型的语言生成过程而非处理过程。通过分层分析五种二分类任务,研究发现指令主要影响输出阶段的信息生成,而输入阶段的信息相对稳定。实验表明,干预指令对输出的影响显著,而对输入影响较小,揭示了生成与处理之间的不对称性。这一发现强调了在评估模型能力时,需同时关注内部机制和行为表现,并区分输入与输出阶段的不同作用。

详情
英文摘要

Instructions trigger a production-centered mechanism in language models. Through a cognitively inspired lens that separates language processing and production, we reveal this mechanism as an asymmetry between the two stages by probing task-specific information layer-wise across five binary judgment tasks. Specifically, we measure how instruction tokens shape information both when sample tokens, the input under evaluation, are processed and when output tokens are produced. Across prompting variations, task-specific information in sample tokens remains largely stable and correlates only weakly with behavior, whereas the same information in output tokens varies substantially and correlates strongly with behavior. Attention-based interventions confirm this pattern causally: blocking instruction flow to all subsequent tokens reduces both behavior and information in output tokens, whereas blocking it only to sample tokens has minimal effect on either. The asymmetry generalizes across model families and tasks, and becomes sharper with model scale and instruction-tuning, both of which disproportionately affect the production stage. Our findings suggest that understanding model capabilities requires jointly assessing internals and behavior, while decomposing the internal perspective by token position to distinguish the processing of input tokens from the production of output tokens.

2605.10983 2026-05-14 cs.LG cs.AI cs.CV

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

Jiaming Li, Chenyu Zhu, Nanxi Yi, Youjun Bao, Li Sun, Quanying Lv, Xiang Fang, Daizong Liu, Jianjun Li, Kun He, Bowen Zhou, Zhiyuan Ma

发表机构 * Huazhong University of Science and Technology(华中科技大学) Kuaishou Technology(快手科技) Nanyang Technological University(南洋理工大学) Wuhan University(武汉大学) Tsinghua University(清华大学)

AI总结 该研究针对扩散模型对下游任务对齐过程中存在的奖励作弊问题,提出了一种轨迹匹配策略优化方法(TMPO),通过轨迹级奖励分布匹配替代传统的标量奖励最大化,有效提升了生成多样性和质量。TMPO 引入了 Softmax 轨迹平衡目标,使策略概率与奖励诱导的玻尔兹曼分布对齐,并证明其具有覆盖多模式轨迹的特性。此外,TMPO 还结合动态随机树采样技术,提升大规模流匹配模型的训练效率,实验表明其在生成多样性及任务性能上均优于现有方法。

详情
英文摘要

Reinforcement learning (RL) has shown extraordinary potential in aligning diffusion models to downstream tasks, yet most of them still suffer from significant reward hacking, which degrades generative diversity and quality by inducing visual mode collapse and amplifying unreliable rewards. We identify the root cause as the mode-seeking nature of these methods, which maximize expected reward without effectively constraining probability distribution over acceptable trajectories, causing concentration on a few high-reward paths. In contrast, we propose Trajectory Matching Policy Optimization (TMPO), which replaces scalar reward maximization with trajectory-level reward distribution matching. Specifically, TMPO introduces a Softmax Trajectory Balance (Softmax-TB) objective to match the policy probabilities of K trajectories to a reward-induced Boltzmann distribution. We prove that this objective inherits the mode-covering property of forward KL divergence, preserving coverage over all acceptable trajectories while optimizing reward. To further reduce multi-trajectory training time on large-scale flow-matching models, TMPO incorporates Dynamic Stochastic Tree Sampling, where trajectories share denoising prefixes and branch at dynamically scheduled steps, reducing redundant computation while improving training effectiveness. Extensive results across diverse alignment tasks such as human preference, compositional generation and text rendering show that TMPO improves generative diversity over state-of-the-art methods by 9.1%, and achieves competitive performance in all downstream and efficiency metrics, attaining the optimal trade-off between reward and diversity.

2605.10906 2026-05-14 cs.LG cs.AI

DataMaster: Data-Centric Autonomous AI Research

Yaxin Du, Xiyuan Yang, Zhifan Zhou, Wanxu Liu, Zixing Lei, Zimeng Chen, Fenyi Liu, Haotian Wu, Yuzhu Cai, Zexi Liu, Xinyu Zhu, WenHao Wang, Linfeng Zhang, Chen Qian, Siheng Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Carnegie Mellon University(卡内基梅隆大学) Zhejiang University(浙江大学) Beijing University of Aeronautics and Astronautics(北京航空航天大学)

AI总结 随着机器学习系统中模型、训练方法和计算资源趋于标准化,进一步提升性能的关键越来越依赖于数据。为此,研究提出了DataMaster,一个数据驱动的自主数据工程框架,旨在在不改变学习算法的前提下,通过优化数据选择、组合和处理来提升下游任务表现。该框架包含数据树、共享数据池和全局记忆三个核心组件,能够有效探索数据空间、复用已有数据并积累经验,实验表明其在多个基准测试中显著优于基线方法。

详情
英文摘要

As model families, training recipes, and compute budgets become increasingly standardized, further gains in machine learning systems depend increasingly on data. Yet data engineering remains largely manual and ad hoc: practitioners repeatedly search for external datasets, adapt them to existing pipelines, validate candidate data through downstream training, and carry forward lessons from prior attempts. We study task-conditioned autonomous data engineering, where an autonomous agent improves a fixed learning algorithm by optimizing only the data side, including external data discovery, data selection and composition, cleaning and transformation. The goal is to obtain a stronger downstream solution while leaving the learning algorithm unchanged. To address the open-ended search space, branch-dependent refinement, and delayed validation inherent in autonomous data engineering, we propose DataMaster, a data-agent framework that integrates tree-structured search, shared candidate data, and cumulative memory. DataMaster consists of three key components: a DataTree that organizes alternative data-engineering branches, a shared Data Pool that stores discovered external data sources for reuse, and a Global Memory that records node outcomes, artifacts, and reusable findings. Together, these components allow the agent to discover candidate data, construct executable training inputs, evaluate them through downstream feedback, and carry useful evidence across branches. We evaluate DataMaster on two types of benchmarks, MLE-Bench Lite and PostTrainBench. On MLE-Bench Lite, it improves medal rate by 32.27% over the initial score; on PostTrainBench, it surpasses the instruct model on GPQA (31.02% vs 30.35%).

2605.10896 2026-05-14 cs.LG

V4FinBench: Benchmarking Tabular Foundation Models, LLMs, and Standard Methods on Corporate Bankruptcy Prediction

Marcin Kostrzewa, Sebastian Tomczak, Roman Furman, Anna Poberezhna, Michał Furgała, Julia Farganus, Oleksii Furman, Maciej Zięba

发表机构 * Department of Artificial Intelligence, Wrocław University of Science and Technology(华沙理工大学人工智能系) Tooploox Opera

AI总结 该研究提出V4FinBench,一个包含超过一百万条企业年度记录的基准数据集,用于评估表格模型、大语言模型和传统方法在企业破产预测任务中的表现。该数据集涵盖2006至2021年Visegràd集团国家的企业数据,包含131个财务和非财务特征,并支持多时间跨度预测。研究通过对比多种模型在不平衡数据下的性能,发现适配不平衡数据的TabPFN在长周期预测中表现优于梯度提升方法,而Llama-3-8B则整体表现较弱。V4FinBench的公开发布有助于推动真实金融数据上预测方法的研究与改进。

详情
英文摘要

Corporate bankruptcy prediction is a high-stakes financial task characterized by severe class imbalance and multi-horizon forecasting demands. Public datasets supporting it remain scarce and small: widely used free benchmarks contain between 6,000 and 80,000 company-year observations, while larger resources are behind subscription paywalls. To address this gap, we introduce V4FinBench, a benchmark of over one million company-year records from the Visegràd Group (V4) economies (2006-2021), with 131 financial and non-financial features, six prediction horizons, and a composite distress criterion jointly capturing solvency, profitability, and liquidity deterioration. V4FinBench is designed to support the evaluation of tabular and foundation-model methods under realistic class imbalance, with positive rates between 0.19% and 0.36%. We provide reference evaluations of standard tabular baselines, finetuned TabPFN, and QLoRA-finetuned Llama-3-8B. With imbalance-aware finetuning, TabPFN matches or exceeds gradient boosting at longer time horizons on both $F_1$-score and ROC-AUC. In contrast, Llama-3-8B trails gradient boosting on ROC-AUC at every horizon and is generally weaker on $F_1$-score, with the gap widening sharply beyond the immediate horizon. In an external evaluation on the American Bankruptcy Dataset, the V4FinBench-finetuned TabPFN checkpoint improves over vanilla TabPFN, suggesting that adaptation captures transferable financial-distress structure rather than only V4-specific patterns. V4FinBench is publicly released to support further evaluation and development of prediction methods on realistic financial data.

2605.10819 2026-05-14 cs.RO cs.AI cs.CV

ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

Zuojin Tang, Haoyun Liu, Xinyuan Chang, Changjie Wu, Dongjie Huo, Yandan Yang, Bin Liu, Zhejia Cai, Feng Xiong, Mu Xu, jiachen Luo, De Ma, Zhiheng Ma, Gang Pan

发表机构 * Zhejiang University(浙江大学) Amap, Alibaba Group(阿里集团阿地图) Nanjing University(南京大学) Shenzhen University of Advanced Technology(深圳先进技术大学) Beijing University of Chemical Technology(北京化工大学) Embodied Intelligence General Platform Laboratory, Chery Auto(奇瑞汽车 embodied intelligence 通用平台实验室) Tsinghua University(清华大学) Queen Mary University of London(伦敦大学玛丽女王学院)

AI总结 视觉-语言-动作(VLA)模型受限于带有动作标签的机器人数据稀缺,而无动作视频中蕴含了丰富的物理世界变化信息。本文提出ALAM(代数一致潜在动作模型),通过从无动作视频中学习结构化的潜在动作转移,为策略生成提供一致的过渡结构。ALAM利用帧三元组学习满足重建、组合和反转一致性的潜在转移,并通过联合流匹配目标将其与策略生成结合,显著提升了VLA任务的性能,在多个基准测试中取得了显著提升。

详情
英文摘要

Vision-language-action (VLA) models remain constrained by the scarcity of action-labeled robot data, whereas action-free videos provide abundant evidence of how the physical world changes. Latent action models offer a promising way to extract such priors from videos, but reconstruction-trained latent codes are not necessarily suitable for policy generation: they may predict future observations while lacking the structure needed to be reused or generated coherently with robot actions. We introduce ALAM (Algebraic Latent Action Model), an Algebraically Consistent Latent Action Model that turns temporal relations in action-free video into structural supervision. Given frame triplets, ALAM learns latent transitions that are grounded by reconstruction while being regularized by composition and reversal consistency, encouraging a locally additive transition space. For downstream VLA learning, we freeze the pretrained encoder and use its latent transition sequences as auxiliary generative targets, co-generated with robot actions under a joint flow-matching objective. This couples structured latent transitions with flow-based policy generation, allowing the policy to exploit ALAM's locally consistent transition geometry without requiring latent-to-action decoding. Representation probes show that ALAM reduces additivity and reversibility errors by 25-85 times over unstructured latent-action baselines and improves long-horizon cumulative reconstruction. When transferred to VLA policies, ALAM raises the average success rate from 47.9% to 85.0% on MetaWorld MT50 and from 94.1% to 98.1% on LIBERO, with consistent gains on real-world manipulation tasks. Ablations further confirm that the strongest improvements arise from the synergy between algebraically structured latent transitions and joint flow matching.

2605.10685 2026-05-14 cs.AI

GESR: A Genetic Programming-Based Symbolic Regression Method with Gene Editing

Yanjie Li, Liping Zhang, Min Wu, Weijun Li, Lina Yu, Jingyi Liu, Yusong Deng, Mingzhu Wan, Xin Ning

发表机构 * AnnLab(安纳实验室) Institute of Semiconductors, Chinese Academy of Sciences(中国科学院半导体研究所) Zhongguancun Academy(中关村学院)

AI总结 本文提出了一种基于基因编辑的符号回归方法GESR,旨在提升传统遗传编程(GP)在符号回归任务中的效率与性能。该方法引入两个BERT模型作为“上帝之手”,分别用于指导基因突变和基因重组的位置预测,从而实现更精准的基因编辑。实验表明,GESR相比传统GP方法在计算效率和任务表现上均有显著提升。

Comments 70 pages

详情
英文摘要

Mathematical formulas serve as a language through which humans communicate with nature. Discovering mathematical laws from scientific data to describe natural phenomena has been a long-standing pursuit of humanity for centuries. In the field of artificial intelligence, this challenge is known as the symbolic regression problem. Among existing symbolic regression approaches, Genetic Programming (GP) based on evolutionary algorithms remains one of the most classical and widely adopted methods. GP simulates the evolutionary process across generations through genetic mutation and crossover. However, mutations and crossovers in GP are entirely random. While this randomness effectively mimics natural evolution, it inevitably produces both beneficial and detrimental variations. If there existed a metaphorical `God` capable of foreseeing which genetic mutations or crossovers would yield superior outcomes and performing targeted gene editing accordingly, the efficiency of evolution could be substantially improved. Motivated by this idea, we propose in this paper a symbolic regression approach based on gene editing, termed GESR. In GESR, we trained two "hands of God" (two BERT models). Among them, the first leverages the BERT's masked language modeling capability to guide the mutation of genes (expression symbols). The other BERT model guides the crossover of individual genes by predicting the crossover point. Experimental results demonstrate that GESR significantly improves computational efficiency compared with traditional GP algorithms and achieves strong overall performance across multiple symbolic regression tasks.

2605.10426 2026-05-14 cs.CV cs.AI

CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

Minqing Huang, Yujiao Xiang, Zihan Liang, Jiajie Huang, Jingqi Wang, Zhi Xu, Feiyang Tan, Hangning Zhou, Mu Yang, Gong Che

发表机构 * Afari Intelligent Drive(Afari智能驾驶公司) University of Electronic Science and Technology of China(电子科技大学) Shanghai Jiao Tong University(上海交通大学) Beijing University Of Posts and Telecommunications(北京邮电大学) Tianjin University(天津大学)

AI总结 本文提出了一种名为 CoWorld-VLA 的多专家世界模型框架,用于自动驾驶任务,旨在解决现有视觉-语言-动作(VLA)模型在规划导向的中间表示方面存在的不足。该方法通过多源监督提取互补的世界信息,并将其编码为专家 token,作为规划器的显式条件,从而更有效地指导动作生成。实验表明,CoWorld-VLA 在未来场景生成和路径规划任务上表现出色,尤其在避障和轨迹精度方面具有优势。

详情
英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving. However, existing reasoning mechanisms still struggle to provide planning-oriented intermediate representations: textual Chain-of-Thought (CoT) fails to preserve continuous spatiotemporal structure, while latent world reasoning remains difficult to use as a direct condition for action generation. In this paper, we propose CoWorld-VLA, a multi-expert world reasoning framework for autonomous driving, where world representations serve as explicit conditions to guide action planning. CoWorld-VLA extracts complementary world information through multi-source supervision and encodes it into expert tokens within the VLA, thereby providing planner-accessible conditioning signals. Specifically, we construct four types of tokens: semantic interaction, geometric structure, dynamic evolution, and ego trajectory tokens, which respectively model interaction intent, spatial structure, future temporal dynamics, and behavioral goals. During action generation, CoWorld-VLA employs a diffusion-based hierarchical multi-expert fusion planner, which is coupled with scene context throughout the joint denoising process to generate continuous ego trajectories. Experiments show that CoWorld-VLA achieves competitive results in both future scene generation and planning on the NAVSIM v1 benchmark, demonstrating strong performance in collision avoidance and trajectory accuracy. Ablation studies further validate the complementarity of expert tokens and their effectiveness as planning conditions for action generation. Code will be available at https://github.com/AFARI-Research/CoWorld-VLA.

2605.10267 2026-05-14 cs.AI

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

Songlin Bai, Xintong Wang, Linlin Yu, Bin Chen, Zhiang Xu, Yuyang Sheng, Changtong Zan, Xiaofeng Zhu, Yizhe Zhang, Jiru Li, Mingze Guo, Ling Zou, Yalong Li, Chengfu Huo, Liang Ding

发表机构 * Multimodal and Industrial AI Team(多模态与工业AI团队)

AI总结 本文提出 IndustryBench,一个基于中国国家标准和工业产品记录构建的中文工业采购问答基准测试集,用于评估大语言模型在工业知识边界上的表现。该基准包含2049个题目,涵盖七个能力维度和十个行业类别,并通过外部验证阶段过滤掉70.3%的不可靠答案,揭示了当前模型在工业安全与标准符合性方面的显著不足。研究发现,即使是最优模型在安全调整后的得分也较低,且安全违规问题会显著影响模型排名,表明工业场景下大语言模型的评估需要更加注重安全性和标准合规性。

详情
英文摘要

In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only filtering. Our evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at $κ_w = 0.798$ against a domain expert, from a separate safety-violation (SV) check against source texts. Across 17 models in Chinese and an 8-model intersection over four languages, we find: (i) the best system reaches only 2.083 on the 0--3 rubric, leaving substantial headroom; (ii) Standards & Terminology is the most persistent capability weakness and survives item-aligned translation; (iii) extended reasoning lowers safety-adjusted scores for 12 of 13 models, primarily by introducing unsupported safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard -- GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5-1T-A32B drops seven positions. Industrial LLM evaluation therefore requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. We release IndustryBench with all prompts, scoring scripts, and dataset documentation.

2605.10187 2026-05-14 cs.CV

SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation

Longteng Guo, Xuanxu Lin, Dongze Hao, Tongtian Yue, Pengkang Huo, Jiatong Ma, Yuchen Liu, Jing Liu

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) OPPO AI Center(OPPO AI中心)

AI总结 SciVQR 是一个涵盖数学、物理、化学等多个学科的多模态科学推理基准,旨在评估大型语言模型在处理复杂科学问题时的综合能力。该基准包含图表、公式等专业视觉元素,要求模型结合视觉理解与多步骤推理,任务难度从基础事实记忆到复杂推理不等,并提供专家解答供参考。研究发现当前主流多模态模型在处理跨学科、多步骤的科学推理任务时仍存在明显不足,突显了提升模型推理能力和学科知识整合的必要性。

详情
英文摘要

Scientific reasoning is a key aspect of human intelligence, requiring the integration of multimodal inputs, domain expertise, and multi-step inference across various subjects. Existing benchmarks for multimodal large language models (MLLMs) often fail to capture the complexity and traceability of reasoning processes necessary for rigorous evaluation. To fill this gap, we introduce SciVQR, a multimodal benchmark covering 54 subfields in mathematics, physics, chemistry, geography, astronomy, and biology. SciVQR includes domain-specific visuals, such as equations, charts, and diagrams, and challenges models to combine visual comprehension with reasoning. The tasks range from basic factual recall to complex, multi-step inferences, with 46% including expert-authored solutions. SciVQR not only evaluates final answers but also examines the reasoning process, providing insights into how models reach their conclusions. Our evaluation of leading MLLMs, including both proprietary and open-source models, reveals significant limitations in handling complex multimodal reasoning tasks, underscoring the need for improved multi-step reasoning and better integration of interdisciplinary knowledge in advancing MLLMs toward true scientific intelligence. The dataset and evaluation code are publicly available at https://github.com/CASIA-IVA-Lab/SciVQR.

2605.09968 2026-05-14 cs.LG math.OC stat.ML

Consolidation-Expansion Operator Mechanics:A Unified Framework for Adaptive Learning

Debashis Guha

发表机构 * S P Jain School of Global Management(S P Jain全球管理学院)

AI总结 本文提出了一种名为“巩固-扩展算子力学”(OpMech)的统一框架,用于描述自适应学习系统中巩固已有知识与扩展新知识之间的交替过程。核心概念是“顺序差距”(order-gap),它衡量了巩固算子和扩展算子在某一知识状态下的非交换程度,并可作为实时控制信号指导学习过程。该框架在多个领域如强化学习、连续学习和递归语言模型中均有应用,并提供了基于顺序差距的停止规则,具有理论保证和实际有效性。

Comments 38 pages; Corrected author affiliation on title page in v2; no scientific changes

详情
英文摘要

Every adaptive learning system must alternate between two operations: consolidating what it already knows and expanding into new evidence. We propose \emph{Consolidation-Expansion Operator Mechanics} (OpMech), a framework that makes this structure precise. The central object is the \emph{order-gap} $\Ogap(θ; e)$, the degree to which a consolidation operator~$Q$ and an expansion operator~$P_e$ fail to commute at a given knowledge state. Because the order-gap is computable from the system's own trajectory, it serves as a real-time control signal: large values indicate that the system is still sensitive to the ordering of consolidation and expansion; once the order-gap falls and stays small, further processing is unlikely to change the outcome. Three results give the signal precise meaning: the order-gap decays along convergent trajectories; a persistently large order-gap implies the system is far from its settled state; and an order-gap-based stopping rule terminates with provable guarantees in both noiseless and bounded-noise settings. The framework applies across five domains: bandits, reinforcement learning, stochastic optimization, continual learning, and recursive language models. We give conditions under which the order-gap reliably tracks convergence in three representative cases. We develop the recursive language model application in detail, showing how OpMech replaces heuristic stopping rules and fixed recursion budgets with principled, evidence-driven alternatives.

2605.09725 2026-05-14 cs.CV

On-Policy Distillation with Best-of-N Teacher Rollout Selection

Ke Zhang, Yunjie Tian, Dongdi Zhao, Yijiang Li, Yuanye Liu, Vishal M Patel, Di Fu

发表机构 * Johns Hopkins University(约翰霍普金斯大学) TikTok University of California, San Diego(加州大学圣地亚哥分校) Fudan University(复旦大学)

AI总结 本文提出了一种名为BRTS的框架,用于改进基于策略的蒸馏(OPD)方法,以提高模型在复杂推理任务中的表现。BRTS通过从多个教师轨迹中选择最优的辅助轨迹,减少监督信号的噪声和方差,从而提升学生模型的学习效果。实验表明,BRTS在多个数学推理基准测试中显著优于传统OPD方法,尤其在难度较高的数据集上表现突出。

Comments 10 pages, 5 figures

详情
英文摘要

On-policy distillation (OPD), which supervises a student on its own sampled trajectories, has emerged as a data-efficient post-training method for improving reasoning while avoiding the reward dependence of reinforcement learning and the catastrophic forgetting often observed in standard supervised fine-tuning. However, standard OPD typically computes teacher supervision under noisy student-generated contexts and often relies on a single stochastic teacher rollout per prompt. As a result, the supervision signal can be high-variance: the sampled teacher trajectory can be incorrect, uninformative, or poorly matched to the student's current reasoning behavior. To address this limitation, we propose BRTS, a Best-of-N Rollout Teacher Selection framework for on-policy distillation. BRTS augments standard student-context OPD with a teacher-context supervision branch constructed from the curated teacher trajectory. Rather than distilling from the first sampled teacher rollout, BRTS samples a small pool of teacher trajectories and selects the auxiliary trajectory using a simple priority rule: correctness first, student alignment second. When multiple correct teacher trajectories are available, BRTS chooses the one most aligned with the student's current behavior; when unconditioned teacher samples fail on harder prompts, it invokes a ground-truth-conditioned recovery step to elicit a natural derivation. The selected trajectory is then used to provide reliable teacher-context supervision inside the OPD loop, augmented with an auxiliary loss on the teacher trajectory. Experiments on AIME 2024, AIME 2025, and AMC 2023 show that BRTS improves over standard OPD on challenging reasoning benchmarks, with the largest gains on harder datasets. Our code is available at https://github.com/BWGZK-keke/BRTS.

2605.09423 2026-05-14 cs.AI

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

Haoqiang Kang, Xiaokang Ye, Yuhan Liu, Siddhant Hitesh Mantri, Lingjun Mao, James Fleming, Drishti Regmi, Lianhui Qin

发表机构 * UC San Diego(加州大学圣迭戈分校) New York University(纽约大学)

AI总结 本文提出 SimWorld Studio,一个基于 Unreal Engine 5 的开源平台,用于自动生成可交互的三维学习环境,以促进具身智能体的学习。核心方法是 SimCoder,一种具备工具和技能增强能力的编码智能体,能够根据语言或图像指令编写并执行底层引擎代码,构建物理真实的三维世界,并通过验证器反馈进行自我进化。该平台实现了环境生成与具身学习的协同进化,显著提升了智能体的性能和环境生成的可靠性。

详情
英文摘要

LLM/VLM-based digital agents have advanced rapidly thanks to scalable sandboxes for coding, web navigation, and computer use, which provide rich interactive training grounds. In contrast, embodied agents still lack abundant, diverse, and automatically generated 3D environments for interactive learning. Existing embodied simulators rely on manually crafted scenes or procedural templates, while recent LLM-based 3D generation systems mainly produce static scenes rather than deployable environments with verifiable tasks and standard learning interfaces. We introduce SimWorld Studio, an open-source platform built on Unreal Engine 5 for generating evolving embodied learning environments. At its core is SimCoder, a tool/skill-augmented coding agent that writes and executes engine-level code to construct physically grounded 3D worlds from language/image instructions. SimCoder self-evolves by using verifier feedback (e.g., compilation errors, physics checks, VLM critiques) to revise environments and autonomously add reusable tools and skills to its library. Generated worlds are exported as Gym-style environments for embodied agent learning. SimWorld Studio further enables co-evolution between environment generation and embodied learning: agent performance feedback guides SimCoder to generate adaptive curricula near the learner's capability frontier, so that environments become increasingly challenging as the embodied agent improves. Three case studies on embodied navigation show that self-evolution improves generation reliability, generated environments substantially improve embodied agent performance that generalizes to unseen benchmarks, and co-evolution yields an 18-point success-rate gain over fixed-environment learning and a 40-point gain over an untrained agent.