arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2080
2605.11492 2026-05-14 cs.CV

A Mimetic Detector for Adversarial Image Perturbations

Johnny Corbino

AI总结 该研究提出了一种无需训练、无需访问目标网络的单次检测方法,用于识别图像中的对抗性扰动。方法基于高阶Corbino–Castillo拟态算子,能够有效捕捉对抗样本在像素级上产生的高频、近随机的梯度能量特征。实验表明,该检测器在标准测试图像上实现了显著的干净图像与对抗样本的区分能力,检测效果随算子阶数增加而提升。

Comments v2: extended Table 1 with results for order $k=8$; minor revisions for clarity

详情
英文摘要

Adversarial attacks fool deep image classifiers by adding tiny, almost invisible noise patterns to a clean image. The standard $\ell^\infty$-bounded attacks (FGSM, PGD, and the $\ell^\infty$ variant of Carlini--Wagner) produce high-frequency, near-random sign patterns at the pixel level: nearly invisible in $\ell^2$, but carrying disproportionate gradient energy. We exploit this with a single-shot, training-free detector using the high-order Corbino--Castillo mimetic operators from the open-source MOLE library. No retraining, no surrogate classifier, no access to the network under attack: the verdict is a property of the input alone, computed in $O(HW)$ time. We validate the detector on the standard \texttt{peppers} test image at the canonical $\ell^\infty$ budget $\varepsilon = 16/255$ and observe a clean-vs-adversarial separation that grows monotonically from $3.55\times$ at order $k=2$ to $4.62\times$ at $k=8$.

2605.11444 2026-05-14 cs.CV

Leveraging Multimodal Large Language Models for All-in-One Image Restoration via a Mixture of Frequency Experts

Eunho Lee, Rei Kawakami, Youngbae Hwang

AI总结 该研究提出了一种基于多模态大语言模型(MLLM)的统一图像修复框架,旨在从受多种未知退化影响的输入中恢复清晰图像。为了解决现有方法将退化视为离散类别而无法建模复合退化中连续关系的问题,作者引入了多模态嵌入作为修复过程的引导,并设计了MLLM引导的融合模块和频率专家混合模块,以增强退化感知表示并自适应组合不同频率专家。实验表明,该方法在多个基准数据集上表现出色,在CDD11数据集上取得了新的最先进成果。

详情
英文摘要

All-in-one image restoration seeks to recover clean images from inputs affected by diverse and unknown degradations using a unified framework. Recent methods have shown strong performance by identifying degradation characteristics to guide the restoration process. However, many of them treat degradations as discrete categories, which limits their ability to model the continuous relational structure that arises in composite degradations. To address this issue, we propose a multimodal large language model (MLLM)-guided image restoration framework that exploits multimodal embeddings as guidance for low-level restoration. Specifically, MLLM-derived features are injected into an encoder-decoder architecture through an MLLM-guided fusion block (MGFB) to enhance degradation-aware representations. In addition, we incorporate a mixture-of-frequency-experts (MoFE) module that adaptively combines frequency experts using MLLM-guided contextual cues. To further improve expert routing, we design an MLLM-guided router with a relational alignment loss that encourages routing patterns consistent with the embedding-space relationships of degraded inputs. Extensive experiments on multiple benchmarks show that the proposed method achieves strong performance across diverse restoration settings and establishes a new state of the art on the challenging CDD11 dataset, outperforming previous methods by up to 1.35 dB.

2605.11405 2026-05-14 cs.LG

20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

DatologyAI, :, Siddharth Joshi, Haoli Yin, Rishabh Adiga, Haakon Mongstad, Alvin Deng, Aldo Carranza, Alex Fang, Amro Abbas, Anshuman Suri, Brett Larsen, Daniel Zayas, Darren Teh, David Schwab, Diego Kiner, Fan Pan, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Kaleigh Mentzer, Luke Merrick, Maximilian Böther, Parth Doshi, Paul Burstein, Pratyush Maini, Ties Robroek, Tony Jiang, Vidhi Jain, Vineeth Dorna, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt

AI总结 该研究探讨了仅通过数据筛选能否提升视觉语言模型(VLM)的性能,并在固定模型架构、训练策略和计算资源的前提下,对MAmmoTH-VL数据集进行筛选,显著提升了模型在多个公开基准和能力维度上的表现。实验表明,筛选后的20亿参数模型在多项指标上超越了现有模型,且在可靠性、泛化能力、行为表现和推理效率等方面均有明显优势,展示了数据筛选作为构建高效VLM的高杠杆工具的潜力。

Comments 33 pages, 15 figures. DatalogyAI website for more details: https://www.datologyai.com/

详情
英文摘要

Data curation has shifted the quality-compute frontier for language-model and contrastive image-text pretraining, but its role for vision-language models (VLMs) is far less established. We ask how far data curation alone can take VLM performance, holding architecture, training recipe, and compute fixed and varying only the training data. Our pipeline, applied to the MAmmoTH-VL single-image subset, lifts performance by +11.7pp on average across 20 public VLM benchmarks (spanning grounding, VQA, OCR/documents, captioning, spatial/3D, counting, charts, math, brand-ID, and multi-image reasoning) and by +11.3pp on average across all nine capability axes of DatBench, our high-fidelity VLM eval suite. At 2B, our curated model surpasses InternVL3.5-2B by 9.9pp at ~17x less training compute and closes the gap to Qwen3-VL-2B to within 1.8pp at ~87x less compute, from pretraining alone. Beyond accuracy, curation delivers four further properties: (1) Reliability: per-capability std across training seeds drops by ~67% and the lift survives a 4k-to-16k context-length sweep; (2) OOD generalization: the 9-eval OOD average rises by +7.2pp, and multi-image BLINK rises by +3.09pp despite single-image-only training, with Visual Correspondence gaining +11.8pp; (3) Behavioral gains beyond benchmarks: across ~1,100 open-ended queries the curated 2B is more honest and more specific than the matched-compute baseline, and more concise and less refusal-prone than a frontier 2B reference; (4) Pareto-dominance on inference cost: at every scale (1B, 2B, 4B) the curated model raises accuracy while lowering response FLOPs vs. the matched-compute baseline, and the curated 4B matches near-frontier accuracy at 3.3x lower response FLOPs than Qwen3-VL-4B. Data curation is a high-leverage tool for building better VLMs, reaching near-frontier accuracy at up to ~150x less training compute.

2605.11347 2026-05-14 cs.LG cs.AI cs.CV

Gradient-Free Noise Optimization for Reward Alignment in Generative Models

Jeongsol Kim, Hongeun Kim, Jian Wang, Jong Chul Ye

AI总结 本文提出了一种无需梯度的噪声优化方法ZeNO,用于生成模型中的奖励对齐问题。该方法将噪声优化建模为路径积分控制问题,仅依赖零阶奖励评估,避免了传统方法对反向传播的依赖。ZeNO在多种生成器和奖励函数上表现出色,尤其适用于无法进行反向传播的场景,如蛋白质结构生成任务。

详情
英文摘要

Existing reward alignment methods for diffusion and flow models rely on multi-step stochastic trajectories, making them difficult to extend to deterministic generators. A natural alternative is noise-space optimization, but existing approaches require backpropagation through the generator and reward pipeline, limiting applicability to differentiable settings. To address this, here we present ZeNO (Zeroth-order Noise Optimization), a gradient-free framework that formulates noise optimization as a path-integral control problem, estimable from zeroth-order reward evaluations alone. When instantiated with an Ornstein--Uhlenbeck reference process, the update connects to Langevin dynamics implicitly targeting a reward-tilted distribution. ZeNO enables effective inference-time scaling and demonstrates strong performance across diverse generators and reward functions, including a protein structure generation task where backpropagation is infeasible.

2605.11299 2026-05-14 cs.LG cs.CL cs.SE

Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling

Yizhu Jiao, Ruixiang Zhang, Richard Bai, Jiawei Han, Ronan Collobert, Yizhe Zhang

AI总结 该论文提出了一种名为DuST的自训练框架,旨在通过“双空间”学习提升代码生成模型的表现。传统方法仅基于单次生成结果的反馈进行训练,而DuST利用测试时多候选生成与评估过程中蕴含的相对正确性信息,构建出一个更丰富的“判别空间”用于模型训练。实验表明,DuST在多个大规模模型上显著提升了测试时生成质量与判断能力,且无需直接奖励正确生成,有效实现了从判别空间到生成空间的知识迁移。

详情
英文摘要

Code generation is typically trained in the primal space of programs: a model produces a candidate solution and receives sparse execution feedback, often a single pass/fail bit. Test-time scaling enriches the inference procedure by sampling multiple candidates and judging among them, but the comparative information this process reveals is discarded after inference. We argue that this information defines a dual judgment space that provides a far richer training signal: the model learns not from an isolated success or failure, but from the relative correctness structure across its own plausible attempts, identifying which succeed, which fail, and what distinguishes them. We introduce DuST (Dual Self-Training), a framework for self-training from the dual judgment space. DuST samples candidate programs from the model's own distribution, labels them through sandbox execution, retains groups containing both successes and failures, and trains the model to rank candidates by execution correctness using GRPO. The objective is purely discriminative: the model is never directly rewarded for generating correct programs. Dual self-training improves both judgment and generation. Across five models spanning two families and three scales (4B to 30B), DuST consistently improves Best-of-4 test-time scaling on LiveCodeBench. For Qwen3-30B-Thinking on LiveCodeBench v6, judgment quality improves by +6.2 NDCG, single-sample pass@1 improves by +3.1, and Best-of-4 accuracy improves by +4.1. The trained model's single rollout matches the base model's Best-of-4 performance. SFT on the same ranking data improves judgment without improving generation, confirming that on-policy RL is the mechanism that transfers dual-space learning back into primal generation.

2605.11206 2026-05-14 cs.CL

Instructions Shape Production of Language, not Processing

Andreas Waldis, Leshem Choshen, Yufang Hou, Yotam Perlitz

AI总结 该研究探讨了指令如何影响语言模型的语言生成过程而非处理过程。通过分层分析五种二分类任务,研究发现指令主要影响输出阶段的信息生成,而输入阶段的信息相对稳定。实验表明,干预指令对输出的影响显著,而对输入影响较小,揭示了生成与处理之间的不对称性。这一发现强调了在评估模型能力时,需同时关注内部机制和行为表现,并区分输入与输出阶段的不同作用。

详情
英文摘要

Instructions trigger a production-centered mechanism in language models. Through a cognitively inspired lens that separates language processing and production, we reveal this mechanism as an asymmetry between the two stages by probing task-specific information layer-wise across five binary judgment tasks. Specifically, we measure how instruction tokens shape information both when sample tokens, the input under evaluation, are processed and when output tokens are produced. Across prompting variations, task-specific information in sample tokens remains largely stable and correlates only weakly with behavior, whereas the same information in output tokens varies substantially and correlates strongly with behavior. Attention-based interventions confirm this pattern causally: blocking instruction flow to all subsequent tokens reduces both behavior and information in output tokens, whereas blocking it only to sample tokens has minimal effect on either. The asymmetry generalizes across model families and tasks, and becomes sharper with model scale and instruction-tuning, both of which disproportionately affect the production stage. Our findings suggest that understanding model capabilities requires jointly assessing internals and behavior, while decomposing the internal perspective by token position to distinguish the processing of input tokens from the production of output tokens.

2605.10983 2026-05-14 cs.LG cs.AI cs.CV

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

Jiaming Li, Chenyu Zhu, Nanxi Yi, Youjun Bao, Li Sun, Quanying Lv, Xiang Fang, Daizong Liu, Jianjun Li, Kun He, Bowen Zhou, Zhiyuan Ma

AI总结 该研究针对扩散模型对下游任务对齐过程中存在的奖励作弊问题,提出了一种轨迹匹配策略优化方法(TMPO),通过轨迹级奖励分布匹配替代传统的标量奖励最大化,有效提升了生成多样性和质量。TMPO 引入了 Softmax 轨迹平衡目标,使策略概率与奖励诱导的玻尔兹曼分布对齐,并证明其具有覆盖多模式轨迹的特性。此外,TMPO 还结合动态随机树采样技术,提升大规模流匹配模型的训练效率,实验表明其在生成多样性及任务性能上均优于现有方法。

详情
英文摘要

Reinforcement learning (RL) has shown extraordinary potential in aligning diffusion models to downstream tasks, yet most of them still suffer from significant reward hacking, which degrades generative diversity and quality by inducing visual mode collapse and amplifying unreliable rewards. We identify the root cause as the mode-seeking nature of these methods, which maximize expected reward without effectively constraining probability distribution over acceptable trajectories, causing concentration on a few high-reward paths. In contrast, we propose Trajectory Matching Policy Optimization (TMPO), which replaces scalar reward maximization with trajectory-level reward distribution matching. Specifically, TMPO introduces a Softmax Trajectory Balance (Softmax-TB) objective to match the policy probabilities of K trajectories to a reward-induced Boltzmann distribution. We prove that this objective inherits the mode-covering property of forward KL divergence, preserving coverage over all acceptable trajectories while optimizing reward. To further reduce multi-trajectory training time on large-scale flow-matching models, TMPO incorporates Dynamic Stochastic Tree Sampling, where trajectories share denoising prefixes and branch at dynamically scheduled steps, reducing redundant computation while improving training effectiveness. Extensive results across diverse alignment tasks such as human preference, compositional generation and text rendering show that TMPO improves generative diversity over state-of-the-art methods by 9.1%, and achieves competitive performance in all downstream and efficiency metrics, attaining the optimal trade-off between reward and diversity.

2605.10906 2026-05-14 cs.LG cs.AI

DataMaster: Data-Centric Autonomous AI Research

Yaxin Du, Xiyuan Yang, Zhifan Zhou, Wanxu Liu, Zixing Lei, Zimeng Chen, Fenyi Liu, Haotian Wu, Yuzhu Cai, Zexi Liu, Xinyu Zhu, WenHao Wang, Linfeng Zhang, Chen Qian, Siheng Chen

AI总结 随着机器学习系统中模型、训练方法和计算资源趋于标准化,进一步提升性能的关键越来越依赖于数据。为此,研究提出了DataMaster,一个数据驱动的自主数据工程框架,旨在在不改变学习算法的前提下,通过优化数据选择、组合和处理来提升下游任务表现。该框架包含数据树、共享数据池和全局记忆三个核心组件,能够有效探索数据空间、复用已有数据并积累经验,实验表明其在多个基准测试中显著优于基线方法。

详情
英文摘要

As model families, training recipes, and compute budgets become increasingly standardized, further gains in machine learning systems depend increasingly on data. Yet data engineering remains largely manual and ad hoc: practitioners repeatedly search for external datasets, adapt them to existing pipelines, validate candidate data through downstream training, and carry forward lessons from prior attempts. We study task-conditioned autonomous data engineering, where an autonomous agent improves a fixed learning algorithm by optimizing only the data side, including external data discovery, data selection and composition, cleaning and transformation. The goal is to obtain a stronger downstream solution while leaving the learning algorithm unchanged. To address the open-ended search space, branch-dependent refinement, and delayed validation inherent in autonomous data engineering, we propose DataMaster, a data-agent framework that integrates tree-structured search, shared candidate data, and cumulative memory. DataMaster consists of three key components: a DataTree that organizes alternative data-engineering branches, a shared Data Pool that stores discovered external data sources for reuse, and a Global Memory that records node outcomes, artifacts, and reusable findings. Together, these components allow the agent to discover candidate data, construct executable training inputs, evaluate them through downstream feedback, and carry useful evidence across branches. We evaluate DataMaster on two types of benchmarks, MLE-Bench Lite and PostTrainBench. On MLE-Bench Lite, it improves medal rate by 32.27% over the initial score; on PostTrainBench, it surpasses the instruct model on GPQA (31.02% vs 30.35%).

2605.10896 2026-05-14 cs.LG

V4FinBench: Benchmarking Tabular Foundation Models, LLMs, and Standard Methods on Corporate Bankruptcy Prediction

Marcin Kostrzewa, Sebastian Tomczak, Roman Furman, Anna Poberezhna, Michał Furgała, Julia Farganus, Oleksii Furman, Maciej Zięba

AI总结 该研究提出V4FinBench,一个包含超过一百万条企业年度记录的基准数据集,用于评估表格模型、大语言模型和传统方法在企业破产预测任务中的表现。该数据集涵盖2006至2021年Visegràd集团国家的企业数据,包含131个财务和非财务特征,并支持多时间跨度预测。研究通过对比多种模型在不平衡数据下的性能,发现适配不平衡数据的TabPFN在长周期预测中表现优于梯度提升方法,而Llama-3-8B则整体表现较弱。V4FinBench的公开发布有助于推动真实金融数据上预测方法的研究与改进。

详情
英文摘要

Corporate bankruptcy prediction is a high-stakes financial task characterized by severe class imbalance and multi-horizon forecasting demands. Public datasets supporting it remain scarce and small: widely used free benchmarks contain between 6,000 and 80,000 company-year observations, while larger resources are behind subscription paywalls. To address this gap, we introduce V4FinBench, a benchmark of over one million company-year records from the Visegràd Group (V4) economies (2006-2021), with 131 financial and non-financial features, six prediction horizons, and a composite distress criterion jointly capturing solvency, profitability, and liquidity deterioration. V4FinBench is designed to support the evaluation of tabular and foundation-model methods under realistic class imbalance, with positive rates between 0.19% and 0.36%. We provide reference evaluations of standard tabular baselines, finetuned TabPFN, and QLoRA-finetuned Llama-3-8B. With imbalance-aware finetuning, TabPFN matches or exceeds gradient boosting at longer time horizons on both $F_1$-score and ROC-AUC. In contrast, Llama-3-8B trails gradient boosting on ROC-AUC at every horizon and is generally weaker on $F_1$-score, with the gap widening sharply beyond the immediate horizon. In an external evaluation on the American Bankruptcy Dataset, the V4FinBench-finetuned TabPFN checkpoint improves over vanilla TabPFN, suggesting that adaptation captures transferable financial-distress structure rather than only V4-specific patterns. V4FinBench is publicly released to support further evaluation and development of prediction methods on realistic financial data.

2605.10819 2026-05-14 cs.RO cs.AI cs.CV

ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

Zuojin Tang, Haoyun Liu, Xinyuan Chang, Changjie Wu, Dongjie Huo, Yandan Yang, Bin Liu, Zhejia Cai, Feng Xiong, Mu Xu, jiachen Luo, De Ma, Zhiheng Ma, Gang Pan

AI总结 视觉-语言-动作(VLA)模型受限于带有动作标签的机器人数据稀缺,而无动作视频中蕴含了丰富的物理世界变化信息。本文提出ALAM(代数一致潜在动作模型),通过从无动作视频中学习结构化的潜在动作转移,为策略生成提供一致的过渡结构。ALAM利用帧三元组学习满足重建、组合和反转一致性的潜在转移,并通过联合流匹配目标将其与策略生成结合,显著提升了VLA任务的性能,在多个基准测试中取得了显著提升。

详情
英文摘要

Vision-language-action (VLA) models remain constrained by the scarcity of action-labeled robot data, whereas action-free videos provide abundant evidence of how the physical world changes. Latent action models offer a promising way to extract such priors from videos, but reconstruction-trained latent codes are not necessarily suitable for policy generation: they may predict future observations while lacking the structure needed to be reused or generated coherently with robot actions. We introduce ALAM (Algebraic Latent Action Model), an Algebraically Consistent Latent Action Model that turns temporal relations in action-free video into structural supervision. Given frame triplets, ALAM learns latent transitions that are grounded by reconstruction while being regularized by composition and reversal consistency, encouraging a locally additive transition space. For downstream VLA learning, we freeze the pretrained encoder and use its latent transition sequences as auxiliary generative targets, co-generated with robot actions under a joint flow-matching objective. This couples structured latent transitions with flow-based policy generation, allowing the policy to exploit ALAM's locally consistent transition geometry without requiring latent-to-action decoding. Representation probes show that ALAM reduces additivity and reversibility errors by 25-85 times over unstructured latent-action baselines and improves long-horizon cumulative reconstruction. When transferred to VLA policies, ALAM raises the average success rate from 47.9% to 85.0% on MetaWorld MT50 and from 94.1% to 98.1% on LIBERO, with consistent gains on real-world manipulation tasks. Ablations further confirm that the strongest improvements arise from the synergy between algebraically structured latent transitions and joint flow matching.

2605.10685 2026-05-14 cs.AI

GESR: A Genetic Programming-Based Symbolic Regression Method with Gene Editing

Yanjie Li, Liping Zhang, Min Wu, Weijun Li, Lina Yu, Jingyi Liu, Yusong Deng, Mingzhu Wan, Xin Ning

AI总结 本文提出了一种基于基因编辑的符号回归方法GESR,旨在提升传统遗传编程(GP)在符号回归任务中的效率与性能。该方法引入两个BERT模型作为“上帝之手”,分别用于指导基因突变和基因重组的位置预测,从而实现更精准的基因编辑。实验表明,GESR相比传统GP方法在计算效率和任务表现上均有显著提升。

Comments 70 pages

详情
英文摘要

Mathematical formulas serve as a language through which humans communicate with nature. Discovering mathematical laws from scientific data to describe natural phenomena has been a long-standing pursuit of humanity for centuries. In the field of artificial intelligence, this challenge is known as the symbolic regression problem. Among existing symbolic regression approaches, Genetic Programming (GP) based on evolutionary algorithms remains one of the most classical and widely adopted methods. GP simulates the evolutionary process across generations through genetic mutation and crossover. However, mutations and crossovers in GP are entirely random. While this randomness effectively mimics natural evolution, it inevitably produces both beneficial and detrimental variations. If there existed a metaphorical `God` capable of foreseeing which genetic mutations or crossovers would yield superior outcomes and performing targeted gene editing accordingly, the efficiency of evolution could be substantially improved. Motivated by this idea, we propose in this paper a symbolic regression approach based on gene editing, termed GESR. In GESR, we trained two "hands of God" (two BERT models). Among them, the first leverages the BERT's masked language modeling capability to guide the mutation of genes (expression symbols). The other BERT model guides the crossover of individual genes by predicting the crossover point. Experimental results demonstrate that GESR significantly improves computational efficiency compared with traditional GP algorithms and achieves strong overall performance across multiple symbolic regression tasks.

2605.10426 2026-05-14 cs.CV cs.AI

CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

Minqing Huang, Yujiao Xiang, Zihan Liang, Jiajie Huang, Jingqi Wang, Zhi Xu, Feiyang Tan, Hangning Zhou, Mu Yang, Gong Che

AI总结 本文提出了一种名为 CoWorld-VLA 的多专家世界模型框架,用于自动驾驶任务,旨在解决现有视觉-语言-动作(VLA)模型在规划导向的中间表示方面存在的不足。该方法通过多源监督提取互补的世界信息,并将其编码为专家 token,作为规划器的显式条件,从而更有效地指导动作生成。实验表明,CoWorld-VLA 在未来场景生成和路径规划任务上表现出色,尤其在避障和轨迹精度方面具有优势。

详情
英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving. However, existing reasoning mechanisms still struggle to provide planning-oriented intermediate representations: textual Chain-of-Thought (CoT) fails to preserve continuous spatiotemporal structure, while latent world reasoning remains difficult to use as a direct condition for action generation. In this paper, we propose CoWorld-VLA, a multi-expert world reasoning framework for autonomous driving, where world representations serve as explicit conditions to guide action planning. CoWorld-VLA extracts complementary world information through multi-source supervision and encodes it into expert tokens within the VLA, thereby providing planner-accessible conditioning signals. Specifically, we construct four types of tokens: semantic interaction, geometric structure, dynamic evolution, and ego trajectory tokens, which respectively model interaction intent, spatial structure, future temporal dynamics, and behavioral goals. During action generation, CoWorld-VLA employs a diffusion-based hierarchical multi-expert fusion planner, which is coupled with scene context throughout the joint denoising process to generate continuous ego trajectories. Experiments show that CoWorld-VLA achieves competitive results in both future scene generation and planning on the NAVSIM v1 benchmark, demonstrating strong performance in collision avoidance and trajectory accuracy. Ablation studies further validate the complementarity of expert tokens and their effectiveness as planning conditions for action generation. Code will be available at https://github.com/AFARI-Research/CoWorld-VLA.

2605.10267 2026-05-14 cs.AI

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

Songlin Bai, Xintong Wang, Linlin Yu, Bin Chen, Zhiang Xu, Yuyang Sheng, Changtong Zan, Xiaofeng Zhu, Yizhe Zhang, Jiru Li, Mingze Guo, Ling Zou, Yalong Li, Chengfu Huo, Liang Ding

AI总结 本文提出 IndustryBench,一个基于中国国家标准和工业产品记录构建的中文工业采购问答基准测试集,用于评估大语言模型在工业知识边界上的表现。该基准包含2049个题目,涵盖七个能力维度和十个行业类别,并通过外部验证阶段过滤掉70.3%的不可靠答案,揭示了当前模型在工业安全与标准符合性方面的显著不足。研究发现,即使是最优模型在安全调整后的得分也较低,且安全违规问题会显著影响模型排名,表明工业场景下大语言模型的评估需要更加注重安全性和标准合规性。

详情
英文摘要

In industrial procurement, an LLM answer is useful only if it survives a standards check: recommended material must match operating condition, every parameter must respect a regulated threshold, and no procedure may contradict a safety clause. Partial correctness can mask safety-critical contradictions that aggregate LLM benchmarks rarely capture. We introduce IndustryBench, a 2,049-item benchmark for industrial procurement QA in Chinese, grounded in Chinese national standards (GB/T) and structured industrial product records, organized by seven capability dimensions, ten industry categories, and panel-derived difficulty tiers, with item-aligned English, Russian, and Vietnamese renderings. Our construction pipeline rejects 70.3% of LLM-generated candidates at a search-based external-verification stage, calibrating how unreliable industrial QA remains after LLM-only filtering. Our evaluation decouples raw correctness, scored by a Qwen3-Max judge validated at $κ_w = 0.798$ against a domain expert, from a separate safety-violation (SV) check against source texts. Across 17 models in Chinese and an 8-model intersection over four languages, we find: (i) the best system reaches only 2.083 on the 0--3 rubric, leaving substantial headroom; (ii) Standards & Terminology is the most persistent capability weakness and survives item-aligned translation; (iii) extended reasoning lowers safety-adjusted scores for 12 of 13 models, primarily by introducing unsupported safety-critical details into longer final answers; and (iv) safety-violation rates reshuffle the leaderboard -- GPT-5.4 climbs from rank 6 to rank 3 after SV adjustment, while Kimi-k2.5-1T-A32B drops seven positions. Industrial LLM evaluation therefore requires source-grounded, safety-aware diagnosis rather than aggregate accuracy. We release IndustryBench with all prompts, scoring scripts, and dataset documentation.

2605.10187 2026-05-14 cs.CV

SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation

Longteng Guo, Xuanxu Lin, Dongze Hao, Tongtian Yue, Pengkang Huo, Jiatong Ma, Yuchen Liu, Jing Liu

AI总结 SciVQR 是一个涵盖数学、物理、化学等多个学科的多模态科学推理基准,旨在评估大型语言模型在处理复杂科学问题时的综合能力。该基准包含图表、公式等专业视觉元素,要求模型结合视觉理解与多步骤推理,任务难度从基础事实记忆到复杂推理不等,并提供专家解答供参考。研究发现当前主流多模态模型在处理跨学科、多步骤的科学推理任务时仍存在明显不足,突显了提升模型推理能力和学科知识整合的必要性。

详情
英文摘要

Scientific reasoning is a key aspect of human intelligence, requiring the integration of multimodal inputs, domain expertise, and multi-step inference across various subjects. Existing benchmarks for multimodal large language models (MLLMs) often fail to capture the complexity and traceability of reasoning processes necessary for rigorous evaluation. To fill this gap, we introduce SciVQR, a multimodal benchmark covering 54 subfields in mathematics, physics, chemistry, geography, astronomy, and biology. SciVQR includes domain-specific visuals, such as equations, charts, and diagrams, and challenges models to combine visual comprehension with reasoning. The tasks range from basic factual recall to complex, multi-step inferences, with 46% including expert-authored solutions. SciVQR not only evaluates final answers but also examines the reasoning process, providing insights into how models reach their conclusions. Our evaluation of leading MLLMs, including both proprietary and open-source models, reveals significant limitations in handling complex multimodal reasoning tasks, underscoring the need for improved multi-step reasoning and better integration of interdisciplinary knowledge in advancing MLLMs toward true scientific intelligence. The dataset and evaluation code are publicly available at https://github.com/CASIA-IVA-Lab/SciVQR.

2605.10127 2026-05-14 cs.CV

Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition

Yu He, Ting Zhu, Yichun Liu, Lichen Ma, Xinyuan Shan, Jingling Fu, Yu Shi, Junshi Huang, Yan Li

AI总结 本文提出一个名为Fashion130K的新电商时尚数据集,包含多种场合、模特和服装类型,旨在推动服装搭配生成的研究。为实现服装生成的视觉一致性,作者设计了统一多模态条件(UMC)框架,通过融合文本和图像提示的嵌入信息,并引入融合变换器对齐多模态特征,进而引导生成模型关注提示与噪声图像之间的关键关联。该数据集和框架为多模态提示在生成模型中的应用提供了全面而细致的探索,并在多个实际应用和基准测试中表现出优于现有方法的视觉一致性效果。

Comments Accepted to CVPR 2026 Findings

详情
英文摘要

Recent research work on fashion outfit generation focuses on promoting visual consistency of garments by leveraging key information from reference image and text prompt. However, the potential of outfit generation remains underexplored, requiring comprehensive e-commercial dataset and elaborative utilization of multi-modal condition. In this paper, we propose a brand-new e-commerce dataset, named Fashion130k, with various occasions, models, and garment types. For the consistent generation of garment, we design a framework with Unified Multi-modal Condition (UMC) to align and integrate the text and visual prompts into generation model. Specifically, we explore an embedding refiner to extract the unified embeddings of multi-modal prompts, within which a Fusion Transformer is proposed to align the multi-modal embeddings by adjusting the modality gap between text and image. Based on unified embeddings, the attention in generation model is redesigned to emphasis the correlations between prompts and noise image, inducing that the noise image can select the pivotal tokens of prompts for consistent outfit generation. Our dataset and proposed framework offer a general and nuanced exploration of multi-modal prompts for generation models. Extensive experiments on real-world applications and benchmark demonstrate the effectiveness of UMC in visual consistency, achieving promising result than that of SoTA methods.

2605.10040 2026-05-14 cs.CV

Only Train Once: Uncertainty-Aware One-Class Learning for Face Authenticity Detection

Qingchao Jiang, Zhenxuan Hou, Zhiying Zhu, Zhenxing Qian, Xinpeng Zhang, Zaiwang Gu

AI总结 随着生成式模型的快速发展,生成高度逼真的图像带来了身份欺诈和虚假信息传播的风险。现有方法大多将人脸伪造检测视为全监督的二分类问题,难以应对新型生成方法带来的挑战。本文提出FADNet,将人脸真实性检测重新建模为一类分类任务,仅使用真实人脸数据进行训练,通过引入证据深度学习和伪伪造图像生成器,有效提升了模型的泛化能力和检测精度,在多个基准测试中取得了优于现有方法的优异性能。

Comments The sole reason for our withdrawal application is that we have identified critical areas in our manuscript that require substantial revision and improvement to meet rigorous scientific standards. Our only intention is to retract the current draft to revise and enhance it, with no plans to replace it with a different version or redirect readers to other sources at this time

详情
英文摘要

The rapid evolution of generative paradigms has enabled the creation of highly realistic imagery, which escalating the risks of identity fraud and the dissemination of disinformation. Most existing approaches frame face forgery detection as a fully supervised binary classification problem. Consequently, these models typically exhibit significant performance decay when tasked with detecting forgeries from previously unseen generative paradigms. Furthermore, these methods focus exclusively on either DeepFakes or fully synthesized faces, thereby failing to provide a generalized framework for universal face forgery detection. In this paper, we address this challenge by introducing FADNet (Face Authenticity Detector Net), % a self-supervised framework that which reformulates face forgery detection as a one-class classification (OCC) task. By training exclusively on authentic facial data to capture their intrinsic representations, FADNet flags any image whose feature embedding deviates significantly from the learned distribution of real faces as a forgery. The framework incorporates Evidential Deep Learning (EDL) to quantify predictive uncertainty and utilizes a plug-and-play pseudo-forgery image generator (PFIG) to tighten decision boundaries around authentic data. Extensive experimental evaluations on the DF40 and ASFD benchmarks demonstrate that FADNet achieves superior performance and generalization capabilities. Specifically, FADNet substantially outperforms existing state-of-the-art (SOTA) methods, yielding a remarkable average accuracy of 96.63\% and an average precision of 98.83\%.

2605.09968 2026-05-14 cs.LG math.OC stat.ML

Consolidation-Expansion Operator Mechanics:A Unified Framework for Adaptive Learning

Debashis Guha

AI总结 本文提出了一种名为“巩固-扩展算子力学”(OpMech)的统一框架,用于描述自适应学习系统中巩固已有知识与扩展新知识之间的交替过程。核心概念是“顺序差距”(order-gap),它衡量了巩固算子和扩展算子在某一知识状态下的非交换程度,并可作为实时控制信号指导学习过程。该框架在多个领域如强化学习、连续学习和递归语言模型中均有应用,并提供了基于顺序差距的停止规则,具有理论保证和实际有效性。

Comments 38 pages; Corrected author affiliation on title page in v2; no scientific changes

详情
英文摘要

Every adaptive learning system must alternate between two operations: consolidating what it already knows and expanding into new evidence. We propose \emph{Consolidation-Expansion Operator Mechanics} (OpMech), a framework that makes this structure precise. The central object is the \emph{order-gap} $\Ogap(θ; e)$, the degree to which a consolidation operator~$Q$ and an expansion operator~$P_e$ fail to commute at a given knowledge state. Because the order-gap is computable from the system's own trajectory, it serves as a real-time control signal: large values indicate that the system is still sensitive to the ordering of consolidation and expansion; once the order-gap falls and stays small, further processing is unlikely to change the outcome. Three results give the signal precise meaning: the order-gap decays along convergent trajectories; a persistently large order-gap implies the system is far from its settled state; and an order-gap-based stopping rule terminates with provable guarantees in both noiseless and bounded-noise settings. The framework applies across five domains: bandits, reinforcement learning, stochastic optimization, continual learning, and recursive language models. We give conditions under which the order-gap reliably tracks convergence in three representative cases. We develop the recursive language model application in detail, showing how OpMech replaces heuristic stopping rules and fixed recursion budgets with principled, evidence-driven alternatives.

2605.09935 2026-05-14 cs.CV cs.CR

Evidence-based Decision Modeling for Synthetic Face Detection with Uncertainty-driven Active Learning

Qingchao Jiang, Zhenxuan Hou, Zhiying Zhu, Zhenxing Qian, Xinpeng Zhang, Zaiwang Gu

AI总结 随着深度生成模型的快速发展,伪造人脸图像被广泛用于非法活动。现有合成人脸检测方法虽取得进展,但因依赖Softmax激活函数而存在过度自信的问题,导致在面对未知分布图像时预测不可靠。为此,本文提出EMSFD方法,通过狄利克雷分布建模类别证据并显式引入模型不确定性,提升检测可靠性与泛化能力;同时利用不确定性指导主动学习,减少标注成本,实验表明该方法在检测准确率上比现有最优方法提升了15%。

Comments The sole reason for our withdrawal application is that we have identified critical areas in our manuscript that require substantial revision and improvement to meet rigorous scientific standards. Our only intention is to retract the current draft to revise and enhance it, with no plans to replace it with a different version or redirect readers to other sources at this time

详情
英文摘要

With the rapid development of deep generative models, forged facial images are massively exploited for illegal activities. Although existing synthetic face detection methods have achieved significant progress, they suffer from the inherent limitation of overconfidence due to their reliance on the Softmax activation function. Thus, these methods often lead to unreliable predictions when encountering unknown Out-of-Distribution (OOD) images, and cannot ascertain the model's uncertainty in its prediction. Meanwhile, most existing methods require massive high-quality annotated data, which greatly limits their practicability across diverse scenarios. To address these limitations, we propose EMSFD (Evidence-based decision Modeling for Synthetic Face Detection with uncertainty-driven active learning), an approach designed to enhance detection reliability and generalizability. Specifically, EMSFD models class evidence using the Dirichlet distribution and explicitly incorporates model uncertainty into the prediction process. Furthermore, during training, the estimated uncertainty is exploited to prioritize more informative samples from the unlabeled pool for annotation, thereby reducing labeling cost and improving model generalization. Extensive experimental evaluations demonstrate that our method enhances the interpretability of synthetic face detection. Meanwhile, our method yields a 15\% increase in accuracy compared to existing state-of-the-art (SOTA) baselines, which demonstrates the superior detection performance and generalizability of our approach. Our code is available at: https://github.com/hzx111621/EMSFD.

2605.09923 2026-05-14 cs.AI

expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling

Mingxiong Lin, Zhangquan Gong, Maowen Tang, Qian Li, Chuangchuang Wang, Jian Ma, Sutian Huang, Kai Tang, Haonan Lu

AI总结 该论文针对基于可验证奖励的强化学习(RLVR)中主流算法Group Relative Policy Optimization(GRPO)存在的探索效率不足问题,提出了探索优先策略优化方法EXPO。EXPO通过引入动态调整的KL正则化模块和基于高斯分布的课程采样策略,有效提升了模型在数学推理任务中的探索能力和训练效率。实验表明,EXPO在多个基准测试中显著优于原始GRPO,尤其在高难度问题上的性能提升更为明显。

Comments Duplicate submission of arXiv:2605.11403

详情
英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard paradigm for LLM mathematical reasoning, where Group Relative Policy Optimization (GRPO) serves as the mainstream algorithm. We point out two understudied inefficiencies existing in GRPO. First, the fixed KL penalty coefficient overly restricts policy exploration at stages where the model requires significant deviation from the reference policy. Second, uniform sampling of training questions ignores that moderately difficult problems provide the most informative gradient signals for optimization. We propose Exploration-Prioritized Policy Optimization (EXPO) with two lightweight plug-in modules. The Accuracy-Conditioned KL Scaling (AKL) dynamically adjusts KL regularization strength through a smooth nonlinear function of batch average accuracy, relaxing the penalty when the model underperforms and strengthening it when the model achieves good results. The Gaussian Curriculum Sampling (GCS) assigns sampling weights to questions following a Gaussian distribution centered at moderate accuracy around 0.5, focusing training on the model's learning frontier. We conduct extensive experiments on DeepSeek-R1-Distill-Qwen-1.5B and Qwen3-8B-Base over six mathematical reasoning benchmarks. The results show EXPO steadily surpasses vanilla GRPO. It obtains an absolute gain of 13.34 on AIME 2025 pass@32, rising from 63.33 percent to 76.67 percent, and achieves an average pass@32 improvement of 2.66 on the 8B model. The much larger performance gains on pass@32 compared with pass@1 demonstrate that EXPO effectively enlarges the model's exploration boundary under a fixed inference cost budget.

2605.09725 2026-05-14 cs.CV

On-Policy Distillation with Best-of-N Teacher Rollout Selection

Ke Zhang, Yunjie Tian, Dongdi Zhao, Yijiang Li, Yuanye Liu, Vishal M Patel, Di Fu

AI总结 本文提出了一种名为BRTS的框架,用于改进基于策略的蒸馏(OPD)方法,以提高模型在复杂推理任务中的表现。BRTS通过从多个教师轨迹中选择最优的辅助轨迹,减少监督信号的噪声和方差,从而提升学生模型的学习效果。实验表明,BRTS在多个数学推理基准测试中显著优于传统OPD方法,尤其在难度较高的数据集上表现突出。

Comments 10 pages, 5 figures

详情
英文摘要

On-policy distillation (OPD), which supervises a student on its own sampled trajectories, has emerged as a data-efficient post-training method for improving reasoning while avoiding the reward dependence of reinforcement learning and the catastrophic forgetting often observed in standard supervised fine-tuning. However, standard OPD typically computes teacher supervision under noisy student-generated contexts and often relies on a single stochastic teacher rollout per prompt. As a result, the supervision signal can be high-variance: the sampled teacher trajectory can be incorrect, uninformative, or poorly matched to the student's current reasoning behavior. To address this limitation, we propose BRTS, a Best-of-N Rollout Teacher Selection framework for on-policy distillation. BRTS augments standard student-context OPD with a teacher-context supervision branch constructed from the curated teacher trajectory. Rather than distilling from the first sampled teacher rollout, BRTS samples a small pool of teacher trajectories and selects the auxiliary trajectory using a simple priority rule: correctness first, student alignment second. When multiple correct teacher trajectories are available, BRTS chooses the one most aligned with the student's current behavior; when unconditioned teacher samples fail on harder prompts, it invokes a ground-truth-conditioned recovery step to elicit a natural derivation. The selected trajectory is then used to provide reliable teacher-context supervision inside the OPD loop, augmented with an auxiliary loss on the teacher trajectory. Experiments on AIME 2024, AIME 2025, and AMC 2023 show that BRTS improves over standard OPD on challenging reasoning benchmarks, with the largest gains on harder datasets. Our code is available at https://github.com/BWGZK-keke/BRTS.

2605.09423 2026-05-14 cs.AI

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

Haoqiang Kang, Xiaokang Ye, Yuhan Liu, Siddhant Hitesh Mantri, Lingjun Mao, James Fleming, Drishti Regmi, Lianhui Qin

AI总结 本文提出 SimWorld Studio,一个基于 Unreal Engine 5 的开源平台,用于自动生成可交互的三维学习环境,以促进具身智能体的学习。核心方法是 SimCoder,一种具备工具和技能增强能力的编码智能体,能够根据语言或图像指令编写并执行底层引擎代码,构建物理真实的三维世界,并通过验证器反馈进行自我进化。该平台实现了环境生成与具身学习的协同进化,显著提升了智能体的性能和环境生成的可靠性。

详情
英文摘要

LLM/VLM-based digital agents have advanced rapidly thanks to scalable sandboxes for coding, web navigation, and computer use, which provide rich interactive training grounds. In contrast, embodied agents still lack abundant, diverse, and automatically generated 3D environments for interactive learning. Existing embodied simulators rely on manually crafted scenes or procedural templates, while recent LLM-based 3D generation systems mainly produce static scenes rather than deployable environments with verifiable tasks and standard learning interfaces. We introduce SimWorld Studio, an open-source platform built on Unreal Engine 5 for generating evolving embodied learning environments. At its core is SimCoder, a tool/skill-augmented coding agent that writes and executes engine-level code to construct physically grounded 3D worlds from language/image instructions. SimCoder self-evolves by using verifier feedback (e.g., compilation errors, physics checks, VLM critiques) to revise environments and autonomously add reusable tools and skills to its library. Generated worlds are exported as Gym-style environments for embodied agent learning. SimWorld Studio further enables co-evolution between environment generation and embodied learning: agent performance feedback guides SimCoder to generate adaptive curricula near the learner's capability frontier, so that environments become increasingly challenging as the embodied agent improves. Three case studies on embodied navigation show that self-evolution improves generation reliability, generated environments substantially improve embodied agent performance that generalizes to unseen benchmarks, and co-evolution yields an 18-point success-rate gain over fixed-environment learning and a 40-point gain over an untrained agent.

2605.08863 2026-05-14 cs.CL cs.LG

Max-pooling Network Revisited: Analyzing the Role of Semantic Probability in Multiple Instance Learning for Hallucination Detection

Shota Fujikawa, Issei Sato

AI总结 本文重新审视了最大池化网络在幻觉检测中的应用,分析了语义概率在多重实例学习中的作用。研究指出,通过扩大决策边界,结合语义一致性缩放内部状态可以提升模型性能。基于这一发现,作者提出了一种高效的分类方法,利用最大池化聚合词级特征,并通过轻量级MLP直接估计句子得分,无需复杂的语义相似度计算,从而在保持竞争力的同时显著提升了计算效率。

详情
英文摘要

Hallucination detection has become increasingly important for improving the reliability of large language models (LLMs). Recently, hybrid approaches such as HaMI, which combine semantic consistency with internal model states via Multiple Instance Learning (MIL), have achieved state-of-the-art performance. However, these methods incur substantial computational overhead due to repeated sampling and costly semantic similarity computations. In this work, we first provide a theoretical analysis of HaMI in terms of decision margins, revealing that scaling internal states with semantic consistency leads to an enlarged decision margin. Motivated by this insight, we revisit classical sentence classification models from a margin enlargement perspective, aggregating token-level features via max pooling and directly estimating sentence scores using a lightweight MLP. Without requiring semantic consistency computations, our approach achieves substantial efficiency improvements while maintaining competitive performance with state-of-the-art baselines through adaptive aggregation of internal feature representations. Code is available at https://github.com/FUJI1229/Hallucination_Detection.

2605.08759 2026-05-14 cs.LG

MDL-GBG: A Non-parametric and Interpretable Granular-Ball Generation Method for Clustering

Zeqiang Xian, Caihui Liu, Yong Zhang, Wenjing Qiu, Duoqian Miao, Witold Pedrycz

AI总结 现有粒球生成方法主要依赖人工设计的质量度量和启发式分裂或停止准则,可能削弱聚类过程中局部生成决策的透明性。为此,本文提出了一种基于最小描述长度原理的非参数且可解释的粒球生成方法MDL-GBG,将粒球生成问题转化为局部模型选择问题,通过比较单球模型、双球模型和核心球加残差模型,选择描述长度最短的模型进行粒球的保留、分裂或残差剥离。实验表明,MDL-GBG生成的稳定粒球能有效提升聚类性能,在多个评估指标上优于现有方法。

Comments 13 pages, 5 figures, 4 tables. Revised version with updated computational complexity analysis, experiments, and discussion. The implementation was optimized to reduce redundant computation and improve efficiency; experimental results and related descriptions were updated accordingly

详情
英文摘要

Existing granular-ball generation methods are still mainly driven by handcrafted quality measures and heuristic splitting or stopping criteria, which may weaken the transparency of local generation decisions in clustering. To address this issue, this paper proposes Minimum Description Length based Granular-Ball Generation (MDL-GBG), a non-parametric and interpretable granular-ball generation method for clustering. MDL-GBG reformulates granular-ball generation as a local model selection problem under the Minimum Description Length principle. For each granular ball, three candidate explanations are compared, namely a single-ball model, a two-ball model, and a core-ball-plus-residual model, and the model with the shortest description length is selected. In this way, ball retention, splitting, and residual peeling are unified within a common coding-theoretic framework. A residual reassignment mechanism is further introduced to re-evaluate peeled-off boundary samples after stable granular-balls are formed. Experiments on 20 UCI datasets show that the stable granular-balls generated by MDL-GBG provide an effective upstream representation for clustering. In particular, MDL-GBG+AC achieves the best average ranks in ARI, ACC, and NMI among the compared methods. These results indicate that MDL-GBG offers a principled and interpretable alternative to heuristic granular-ball generation strategies.

2605.08541 2026-05-14 cs.LG

Tokens-per-Parameter Coverage Is Critical for Robust LLM Scaling Law Extrapolation

Joshua Shay Kricheli, Alexander Lawrence Reid, Soumajyoti Sarkar, Venkata Gandikota, Paulo Shakarian

AI总结 该研究指出,在语言模型的规模定律拟合中,固定 tokens-per-parameter(TPP)比例的设计会导致参数估计的病态问题,使得模型外推性能下降。研究揭示了当参数数量 $N$ 和 token 数量 $D$ 的指数接近时,最小二乘问题的条件数会显著增大,导致尺度系数难以准确估计。为此,作者提出了一个 TPP 多样性阈值,确保良好条件的估计,并通过实验验证非共线设计在多个数据集和精度模式下均优于传统方法。

详情
英文摘要

Neural scaling laws approximate a language model's loss as a power-law function of parameter count $N$ and token count $D$. Following Chinchilla-style compute-optimal training, many studies fit scaling laws from runs performed under a fixed tokens-per-parameter (TPP) ratio $k$ and set $D = kN$. We show that this collinear design, combined with the empirically common near-equality of the exponents governing $N$ and $D$, induces an inherent ill-conditioning in the Gauss-Newton least-squares problem: the condition number of the design grows as the inverse square of the gap between the $N$ and $D$-exponents. The scale coefficients become practically unidentifiable, with confidence intervals inflating by an order of magnitude or more, yielding a ``sloppy'' model whose extrapolations degrade sharply off the training ray. We prove this for four scaling-law formalisms and derive a closed-form TPP-diversity threshold that is necessary and sufficient for well-conditioned estimation. Empirically, non-collinear designs outperform collinear ones on held-out splits with a 97.3\% win rate across four laws, five corpora, multiple floating point precision modes. We further show the degeneracy is rooted in Jacobian geometry and is not an artifact of the loss function: any smooth estimation objective whose curvature involves the Jacobian inherits the same ill-conditioning.

2605.08504 2026-05-14 cs.CL

A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

Zeru Shi, Zhenting Wang, Fan Yang, Qifan Wang, Ruixiang Tang

AI总结 本文研究了大语言模型中大规模激活值的来源,发现了一种普遍存在的“大规模涌现层(ME Layer)”,该层是激活值首次出现并经残差连接传播到深层的关键位置。研究指出,该层中的RMSNorm和FFN参数共同促成了大规模激活的形成,且一旦形成,这些激活在后续层中变化较小,降低了注意力模块接收到的隐藏表示的多样性。为此,作者提出了一种简单有效的方法来缓解这种激活的刚性,该方法在多项任务中提升了模型性能,并有助于减轻注意力陷阱问题。

详情
英文摘要

We investigate the origins of massive activations in large language models (LLMs) and identify a specific layer named the \textbf{Massive Emergence Layer (ME Layer)}, that is consistently observed across model families, where massive activations first emerge and subsequently propagate to deeper layers through residual connections. We show that, within the ME Layer both the RMSNorm and the FFN parameters jointly contribute to the emergence of massive activations. Once formed, the massive activation token representation remains largely invariant across layers, reducing the diversity of hidden representations passed to the attention module. Motivated by this limitation, we propose a simple and effective method to reduce the rigidity of the massive activation token. Our approach consistently improves LLM performance across multiple tasks, including instruction following and math reasoning, in both training free and fine tuning settings. Moreover, we show that our method mitigates attention sinks by selectively weakening their influence, elucidating their origin at the hidden state level and shedding new light on principled mitigation strategies.

2605.08293 2026-05-14 cs.CV

Distill, Diffuse, and Semanticize (DDS): Annotation-Free 3D Scene Understanding Based on Multi-Granularity Distillation and Graph-Diffusion-Based Segmentation

Yijing Wang, Ruonan Li, Qilin Wang, Rongqiang Zhao, Jie Liu

AI总结 本文提出了一种名为DDS的轻量级框架,用于无需标注的3D场景理解。该方法结合多粒度知识蒸馏和基于图扩散的分割技术,在保留超点结构组织的同时引入视觉语义信息,实现了区域一致且语义化的3D场景理解。实验表明,DDS在多个真实数据集上优于现有方法,在多项指标上均有显著提升,为无标注的3D场景理解提供了可扩展且可解释的解决方案。

详情
英文摘要

3D semantic scene understanding is essential for digital twins, autonomous driving, smart agriculture, and embodied perception, yet dense point-wise annotation for point clouds remains expensive and difficult to scale. Existing annotation-free methods often face a trade-off between semantic recognition and structural efficiency: open-vocabulary and foundation-model-driven methods provide strong semantic priors, but often come with substantial computational costs, while structure-oriented methods based on superpoints, clustering, and graph reasoning are lightweight but often produce category-agnostic regions. We propose DDS, a resource-efficient structure-oriented framework for region-consistent and semanticized annotation-free 3D scene understanding. DDS preserves the lightweight superpoint-based organization paradigm while incorporating visual semantic cues from projected features and segmentation-derived masks. It first performs multi-granularity distillation to guide the 3D backbone at the point, mask-prototype, and inter-prototype levels, then applies graph diffusion over superpoints to propagate semantic information directly in 3D, producing coherent region representations without costly spectral decomposition or dense open-vocabulary 3D feature fields. Finally, DDS uses segmentation-cluster association to assign interpretable semantic names to category-agnostic 3D clusters. Experiments on real-world datasets show that DDS achieves the best performance among representative structure-oriented annotation-free baselines, improving oAcc, mAcc, and mIoU by up to 5.9%, 8.1%, and 2.4%, respectively. These results demonstrate that DDS improves region consistency and lightweight semantic recognition, providing a scalable and interpretable solution for annotation-free 3D scene understanding.

2605.08078 2026-05-14 cs.CV cs.LG

Normalizing Trajectory Models

Jiatao Gu, Tianrong Chen, Ying Shen, David Berthelot, Shuangfei Zhai, Josh Susskind

AI总结 本文提出了一种名为 Normalizing Trajectory Models(NTM)的新型生成模型,用于解决在少量采样步骤下扩散模型性能下降的问题。NTM 通过将每个逆向步骤建模为具有精确似然训练的条件归一化流,保留了完整的似然框架,同时提升了生成效率。该模型结合了浅层可逆模块与深层并行预测器,支持从头训练或基于预训练流匹配模型初始化,并通过自蒸馏技术实现了仅需四步即可生成高质量图像的效果,在文本到图像任务中表现优异。

Comments 25 pages, 10 figures; corrected typos and citations

详情
英文摘要

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice the likelihood framework in the process. We introduce Normalizing Trajectory Models (NTM), which models each reverse step as an expressive conditional normalizing flow with exact likelihood training. Architecturally, NTM combines shallow invertible blocks within each step with a deep parallel predictor across the trajectory, forming an end-to-end network trainable from scratch or initializable from pretrained flow-matching models. Its exact trajectory likelihood further enables self-distillation: a lightweight denoiser trained on the model's own score produces high-quality samples in four steps. On text-to-image benchmarks, NTM matches or outperforms strong image generation baselines in just four sampling steps while uniquely retaining exact likelihood over the generative trajectory.

2605.07483 2026-05-14 cs.LG cs.AI

Does Your Neural Network Extrapolate? Feature Engineering as Identifiability Bias for OOD Generalization

Leonel Aguilar, Jan Nagler, Christoph Hoelscher, Nino Antulov-Fantulin

AI总结 本文研究了深度神经网络在分布外(OOD)场景下泛化失败的原因,指出其根本问题在于从训练数据中学习到的特征无法反映真实的数据生成过程(DGP)。作者提出,通过引入结构化的特征映射、标签映射和模型类(φ, ψ, M),可以明确DGP的假设,从而提升OOD泛化能力。实验表明,正确的特征表示和模型选择能够显著降低OOD误差,并在多个自然科学和机器学习任务中验证了该方法的有效性。

详情
英文摘要

Successful deep neural networks discover salient features of data. We show when and why they fail to learn out-of-distribution (OOD)-relevant representations from an in-distribution (ID) training window. This requires decoupling feature learning from data-generating-process (DGP) identifiability. From a single training window, OOD extrapolation is non-identifiable: infinitely many DGPs are $\varepsilon$-observationally equivalent on the training data but diverge arbitrarily outside it, and no in-distribution criterion alone reliably breaks the tie. A structural commitment, the feature map, label map, and model class $(φ, ψ, \mathcal{M})$, dictates the assumed DGP and governs OOD generalization while leaving ID performance essentially unchanged. When architecture, pretraining, augmentation, input formats, or domain knowledge implicitly inject the missing commitment, the model succeeds. When it cannot infer OOD-relevant structure from ID evidence, it fails. Changing only the representation can make the same architecture, at the same in-distribution loss, differ by ${\sim}520\times$ out of distribution. When the commitment is correct and identifiable, OOD error vanishes. For example, Fourier coordinates turn periodic extrapolation into interpolation on $\mathbb{S}^1$. The same mechanism predicts outcomes in three natural-science settings (mass-action chemistry; Kepler's-third-law exoplanet prediction, $n=2{,}362$; and cross-species coding-DNA detection) and in a 264-run positional-encoding study across Transformer, Mamba, and S4D. Finally, a controlled study shows: correct features are necessary but not sufficient. The model class must express the target, and the transformed training data must cover the relevant representation space.

2605.07188 2026-05-14 cs.CV

PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset

Fuxin Duan, Hui Wang

AI总结 本文提出了一种统一的注视估计框架PicoEyes,能够从单目或双目输入中直接预测注视的多个关键属性,包括3D眼参数、眼区分割、光轴、视线轴和深度图,并在端到端流程中同时解决校准、注视预测和设备姿态变化问题。研究还引入了一个大规模多视角近眼数据集,包含多种条件下的详尽2D和3D标注。实验表明,PicoEyes在无校准、校准、重戴校准和预测等多种设置下均优于现有学术和工业注视追踪方法,为混合现实应用中的鲁棒且通用的注视估计提供了实用范式。

Comments 15 pages, 10 figures, conference

详情
英文摘要

We present PicoEyes, a unified gaze estimation framework that directly predicts all key attributes of gaze, including 3D eye parameters, eye-region segmentation, optical axis, visual axis, and depth maps, from either monocular or binocular inputs. The framework simultaneously addresses calibration, gaze forecasting, and varying device postures, while also supporting 3D eye reconstruction via joint estimation of eye parameters and depth maps in an end-to-end manner. In addition, we introduce a large-scale multi-view near-eye dataset containing comprehensive 2D and 3D annotations under diverse conditions, including train, test, rewear-test, and calibration sessions. Extensive experiments demonstrate that PicoEyes achieves state-ofthe-art performance, consistently outperforming both academic and industrial gaze tracking methods across nocalibration, calibration, rewear-after-calibration, and forecasting settings. This work establishes a practical, end-toend paradigm for robust and generalizable gaze estimation in mixed reality (MR) applications.

2605.07161 2026-05-14 cs.AI

SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios

Jackson Clark, Yiming Su, Saad Mohammad Rafid Pial, Yifang Tian, Lily Gniedziejko, Hans-Arno Jacobsen, Yinfang Chen, Tianyin Xu

AI总结 本文提出SREGym,一个用于评估AI Site Reliability Engineering(SRE)代理的高保真基准平台。SREGym基于真实云原生系统架构构建,能够模拟多层故障、环境噪声和多种失效模式,提供90个现实且具有挑战性的SRE问题。该平台设计模块化且可扩展,支持故障注入与噪声控制,研究结果显示当前前沿代理在处理不同类型故障时表现差异显著,最高可达40%的端到端结果差异。

详情
英文摘要

AI agents are increasingly used to diagnose and mitigate failures in production systems, known as agentic Site Reliability Engineering (SRE). Current SRE benchmarks are limited to oversimplistic SRE tasks and are unfortunately hard to extend due to bespoke designs. We present SREGym, a high-fidelity benchmark for SRE agents. SREGym exposes a live system environment built atop real-world cloud-native system stacks, where high-fidelity failure scenarios are simulated through fault injectors. SREGym models the complexity of production environments by simulating (1) a wide range of faults at different layers, (2) various ambient noises, and (3) diverse failure modes such as metastable failures and correlated failures. SREGym is architected as a modular, extensible framework that orchestrates fault and noise injectors across stacks. SREGym currently includes 90 realistic, challenging SRE problems. We use SREGym to evaluate frontier agents and show that their capabilities varies significantly in addressing different kinds of failures, with up to 40% differences in end-to-end results. SREGym is actively maintained as an open-source project and has been used by researchers and practitioners.