arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4077
2605.10521 2026-05-12 cs.CV cs.AI

DuetFair: Coupling Inter- and Intra-Subgroup Robustness for Fair Medical Image Segmentation

Yiqi Tian, Sangjoon Park, Bo Zeng, Pengfei Jin, Yujin Oh, Quanzheng Li

AI总结 医学图像分割模型在不同子群体中的表现可能存在差异,现有公平性方法大多关注提升子群体平均性能,忽略了子群体内部可能存在的隐藏失效问题。为此,本文提出DuetFair机制,通过联合考虑子群体间适应与子群体内鲁棒性,引入FairDRO方法,结合分布感知的专家混合模型与子群体条件分布鲁棒优化,有效提升了模型在不同子群体中的公平性与分割性能。实验表明,FairDRO在多个医学图像分割基准上取得了优越的公平性与性能提升。

Comments 16 pages, 2 figures

详情
英文摘要

Medical image segmentation models can perform unevenly across subgroups. Most existing fairness methods focus on improving average subgroup performance, implicitly treating each subgroup as internally homogeneous. However, this can hide difficult cases within a subgroup, where high-loss samples are obscured by the subgroup mean. We call this problem \textbf{intra-group hidden failure}. To solve this, we propose \textbf{DuetFair} mechanism, a dual-axis fairness framework that jointly considers inter-subgroup adaptation and intra-subgroup robustness. Based on DuetFair, we introduce \textbf{FairDRO}, which combines distribution-aware mixture-of-experts (dMoE) with subgroup-conditioned distributionally robust optimization (DRO) loss aggregation. This design allows the model to adapt across subgroups while also reducing hidden failures within each subgroup. We evaluate FairDRO on three medical image segmentation benchmarks with varying degrees of within-group heterogeneity. FairDRO achieves the best equity-scaled performance on Harvard-FairSeg and improves worst-case subgroup performance on HAM10000 under both age- and race-based grouping schemes. On the 3D radiotherapy target cohort, FairDRO further improves worst-group Dice by 3.5 points ($\uparrow 6.0\%$) under the tumor-stage grouping and by 4.1 points ($\uparrow 7.4\%$) under the institution grouping over the strongest baseline.

2605.10518 2026-05-12 cs.CL cs.AI

Infinite Mask Diffusion for Few-Step Distillation

Jaehoon Yoo, Wonjung Kim, Chanhyuk Lee, Seunghoon Hong

AI总结 本文提出了一种名为Infinite Mask Diffusion Model(IMDM)的新型扩散模型,用于解决语言模型知识蒸馏中的少步生成问题。传统掩码扩散模型(MDM)因使用确定性单状态掩码而受到因子化误差的限制,难以实现高效少步生成。IMDM通过引入随机无限状态掩码,有效降低了理论误差下限,从而在保持MDM优势的同时提升了生成效率。实验表明,IMDM在少量步骤下优于现有蒸馏方法,尤其在LM1B和OpenWebText数据集上表现突出。

详情
英文摘要

Masked Diffusion Models (MDMs) have emerged as a promising alternative to autoregressive models in language modeling, offering the advantages of parallel decoding and bidirectional context processing within a simple yet effective framework. Specifically, their explicit distinction between masked tokens and data underlies their simple framework and effective conditional generation. However, MDMs typically require many sampling iterations due to factorization errors stemming from simultaneous token updates. We observe that a theoretical lower bound of the factorization error exists, which standard MDMs cannot reduce due to their use of a deterministic single-state mask. In this paper, we propose the Infinite Mask Diffusion Model (IMDM), which introduces a stochastic infinite-state mask to mitigate the theoretical bound while directly inheriting the benefits of MDMs, including the compatibility with pre-trained weights. We empirically demonstrate that MDM fails to perform few-step generation even in a simple synthetic task due to the factorization error bound, whereas IMDM can find an efficient solution for the same task. Finally, when equipped with appropriate distillation methods, IMDM surpasses existing few-step distillation methods at small step counts on LM1B and OpenWebText. Code is available at https://Ugness.github.io/official_imdm.

2605.10516 2026-05-12 cs.AI

Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

Harsh Raj, Niranjan Orkat, Suvrorup Mukherjee, Aritra Guha, Cheryl Flynn, Subhabrata Majumdar

AI总结 本文提出了一套严格的AI智能体可靠性度量方法,通过语义保持扰动下的一致性来量化智能体的可靠性。研究引入了基于$U$-统计量的输出级可靠性评估和基于核方法的轨迹级稳定性分析,揭示了智能体核心能力与执行鲁棒性之间的区别。实验表明,轨迹级一致性指标比传统方法具有更高的诊断灵敏度,有助于识别和解决影响智能体在高风险实际环境中部署的架构问题。

Comments 33 pages, 5 figures, 2 tables

详情
英文摘要

This paper establishes a rigorous measurement science for AI agent reliability, providing a foundational framework for quantifying consistency under semantically preserving perturbations. By leveraging $U$-statistics for output-level reliability and kernel-based metrics for trajectory-level stability, we offer a principled approach to evaluating agents across diverse operating conditions. Our proposal highlights the important distinction between the core capability and execution robustness of an agent, showing that minor task-level variations can induce complete strategy breakdowns despite the agent possessing the requisite knowledge for the task. We validate our framework through extensive experiments on three agentic benchmarks, demonstrating that trajectory-level consistency metrics provide far greater diagnostic sensitivity than traditional pass@1 rates. By providing the mathematical tools to isolate where and why agents deviate, we enable the identification and rectification of architectural concerns that hinder the deployment of agents in high-stakes, real-world environments.

2605.10510 2026-05-12 cs.LG cs.AI

CMKL: Modality-Aware Continual Learning for Evolving Biomedical Knowledge Graphs

Yousef A. Radwan, Yao Li, Qing Qing, Ziqi Xu, Qixin Zhang, Yongcheng Jing, Renqiang Luo, Xikun Zhang

AI总结 本文提出了一种名为CMKL的持续学习框架,用于处理动态演化的生物医学知识图谱,能够同时利用结构、文本和分子等多模态信息。该方法通过混合专家路由机制融合多模态数据,并结合EWC正则化和多样化的多模态回放缓冲区,有效保护已学知识,减少遗忘。实验表明,CMKL在持续实体分类和关系预测任务中均显著优于现有方法,尤其在多模态信息的利用上表现出明显优势。

详情
英文摘要

Biomedical knowledge graphs are increasingly large, dynamic, and multimodal, driven by rapid advances in biotechnology such as high-throughput sequencing. Machine learning models can infer previously unobserved biomedical relationships and characterize biomedical entities in these graphs, but existing knowledge graph embedding methods and their continual learning extensions either assume static graph structure or fail to exploit multimodal information under evolving data distributions. They also apply uniform regularization across all model parameters, ignoring that different modalities may exhibit distinct forgetting dynamics as the graph evolves. We propose the Continual Multimodal Knowledge Graph Learner (CMKL), a CL framework for biomedical KGs that natively encodes structure, text, and molecules, fuses them through a Mixture-of-Experts (MoE) router, and protects previously learned knowledge with standard EWC regularization and a K-means-diverse multimodal replay buffer. We evaluate CMKL on a 129K-entity biomedical continual benchmark with 10 tasks. On continual biomedical entity classification, CMKL reaches AP 0.591 versus 0.370 for the strongest structural baseline, a 60% gain that is driven by access to multimodal features and preserved across the sequence with near-zero forgetting (AF 0.008). On continual relationship prediction, CMKL reaches AP $0.062$, matching Naive Sequential and EWC (0.058) within seed noise and outperforming Joint Training (0.047, p=0.045) and LKGE (0.039). A frozen-text ablation reaches AP 0.136, more than double any jointly trained model, yet that signal is unreachable by margin-ranking gradients: the greedy-modality asymmetry lives at the representation level, not the fusion level, and MoE routing manages it by suppressing the unreachable modality without forcing it through a learned bottleneck. Code: github.com/yradwan147/cmkl-neurips2026

2605.10504 2026-05-12 cs.CL

Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

Jinchang Zhu, Jindong Li, Yuwen Hao, Chengyu Zou, Rong Fu, Menglin Yang

AI总结 本文研究了在语言模型预训练过程中,上层注意力机制过早固化可能对模型性能产生的负面影响。作者发现,在GPT类模型中,上层注意力在底层特征尚未稳定时就形成尖锐的注意力模式,导致模型表现下降。通过在训练初期临时减缓上层Q/K投影的学习速度,可以在不改变其他参数的情况下提升最终的困惑度和下游任务准确率。研究还指出,乘法门控的前馈网络是抑制底层残差特征更新的关键因素,并揭示了上层Q/K的学习时机是解码器结构与优化过程之间的重要交互点。

详情
英文摘要

A causal-decoder block is hierarchical: lower layers build the residual basis that upper layers attend over. We identify a failure mode in GPT pretraining: upper layers commit to sharp attention patterns before lower-layer features stabilize. We call this premature upper-layer attention specialization. Temporarily slowing only upper-layer Q/K projections during early training improves final perplexity and downstream accuracy without altering other parameters; it prevents upper attention from collapsing onto an immature residual basis. In LLaMA-style blocks, the same intervention is nearly unnecessary. Through ablations, we isolate multiplicative gated FFNs (not RMSNorm or bias removal) as the component that suppresses the upstream residual writes driving the failure. A pathwise analysis unifies both findings: the learning-rate intervention reduces a step-size factor, while gated FFNs reduce a residual-energy factor on the same growth pathway. Our results identify upper-layer Q/K timing as a concrete interaction point between decoder architecture and optimization.

2605.10500 2026-05-12 cs.AI

SkillEvolver: Skill Learning as a Meta-Skill

Genrui Zhang, Erle Zhu, Jinfeng Zhou, Caiyan Jia, Hongning Wang

AI总结 当前智能体技能大多是静态生成的,一旦创建便无法根据实际使用情况进行改进。本文提出了一种名为 SkillEvolver 的轻量级在线技能学习方法,通过一个元技能迭代生成、部署并优化领域特定技能,使技能能够持续进化。该方法直接学习技能的描述与代码,而非模型参数,使得生成的技能可直接用于任意智能体而无需重新训练。实验表明,SkillEvolver 在多个任务中显著优于人工编写技能和无技能基线。

详情
英文摘要

Agent skills today are static artifact: authored once -- by human curation or one-shot generation from parametric knowledge -- and then consumed unchanged, with no mechanism to improve from real use. We propose \textbf{SkillEvolver}, a lightweight, plug-and-play solution for online skill learning, in which a single meta-skill iteratively authors, deploys, and refines domain-specific skills. The learning target of SkillEvolver is the skill's prose and code, not model weights, so that the resulting artifact drops into any agent without retraining; and the meta-skill itself is just another skill, loaded through the same interface by any protocol-compliant CLI-agent. Unlike trace-distillation, the meta-skill refines only after deploying the learnt skill, such that the learning signal comes from failures another agent encounters while using it -- not from exploratory traces alone. Refinement iterations are governed by a fresh-agent overfit audit that catches possible leakage as well as deployed-skill-specific failures, including the silent-bypass mode in which a skill appears valid in content but is never invoked at runtime. On $83$ SkillsBench tasks spanning $15^{+}$ domains, SkillEvolver reaches $56.8\%$ accuracy versus $43.6\%$ for curated human skills and $29.9\%$ for the no-skill baseline; on three GPU kernel optimization tasks from KernelBench, it also raises mean speedup from $1.16$ to $1.51$ on average.

2605.10498 2026-05-12 cs.CV cs.AI stat.ML

Simultaneous Long-tailed Recognition and Multi-modal Fusion for Highly Imbalanced Multi-modal Data

Heegeon Yoon, Heeyoung Kim

AI总结 该研究针对高度不平衡的多模态数据,提出了一个同时处理长尾识别与多模态融合的新框架。该方法通过引入多专家架构,结合模态特异性网络估计各模态的信息量,并利用置信度引导的权重动态调整融合过程,从而更有效地整合多源数据。实验表明,该方法在多个基准和真实数据集上优于现有方法,展示了其在长尾分类任务中的鲁棒性和泛化能力。

详情
英文摘要

Long-tailed distributions in class-imbalanced data present a fundamental challenge for deep learning models, which tend to be biased toward majority classes. While recent methods for long-tailed recognition have mitigated this issue, they are largely restricted to single-modal inputs and cannot fully exploit complementary information from diverse data sources. In this work, we introduce a new framework for long-tailed recognition that explicitly handles multi-modal inputs. Our approach extends multi-expert architectures to the multi-modal setting by fusing heterogeneous data into a unified representation while leveraging modality-specific networks to estimate the informativeness of each modality. These confidence-guided weights dynamically modulate the fusion process, ensuring that more informative modalities contribute more strongly to the final decision. To further enhance performance, we design specialized training and test procedures that accommodate diverse modality combinations, including images and tabular data. Extensive experiments on benchmark and real-world datasets demonstrate that the proposed approach not only effectively integrates multi-modal information but also outperforms existing methods in handling long-tailed, class-imbalanced scenarios, highlighting its robustness and generalization capability.

2605.10494 2026-05-12 cs.SD cs.AI

Multi-layer attentive probing improves transfer of audio representations for bioacoustics

Marius Miron, David Robinson, Masato Hagiwara, Titouan Parcollet, Jules Cauzinille, Gagan Narula, Milad Alizadeh, Ellen Gilsenan-McMahon, Sara Keen, Emmanuel Chemla, Benjamin Hoffman, Maddie Cusimano, Diane Kim, Felix Effenberger, Jane K. Lawton, Aza Raskin, Olivier Pietquin, Matthieu Geist

AI总结 本文研究了不同探针策略对生物声学任务中音频表征迁移性能的影响,提出使用多层注意力探针可以更有效地利用时间信息,提升模型在下游任务中的表现。研究对比了线性探针和注意力探针在多个生物声学基准上的性能,发现多层探针优于传统的单层探针,尤其在Transformer模型中,注意力探针显著优于线性探针。该工作为评估和提升音频表征的可迁移性提供了新的方法和见解。

详情
英文摘要

Probing heads map the representations learned from audio by a machine learning model to downstream task labels and are a key component in evaluating representation learning. Most bioacoustic benchmarks use a fixed, low-capacity probe, such as a linear layer on the final encoder layer. While this standardization enables model comparisons, it may bias results by overlooking the interaction between encoder features and probe design. In this work, we systematically study different probing strategies across two bioacoustic benchmarks, BEANs and BirdSet. We evaluate last- and multi-layer probing, across linear and attention probes. We show that larger probe heads that leverage time information have superior performance. Our results suggest that current benchmarks may misrepresent encoder quality when relying on a last-layer probing setup. Multi-layer probing improves downstream task performance across all tested models, while attention probing has superior performance to linear probing for transformer models.

2605.10488 2026-05-12 cs.CL cs.AI

DeepRefine: Agent-Compiled Knowledge Refinement via Reinforcement Learning

Haoyu Huang, Jiaxin Bai, Shujie Liu, Yang Wei, Hong Ting Tsang, Yisen Gao, Zhongwei Xie, Yufei Li, Yangqiu Song

AI总结 DeepRefine 是一种基于大型语言模型的推理方法,旨在提升智能体编译知识库的质量,以更好地支持开放场景下的下游任务。该方法通过与知识库进行多轮交互,进行归因诊断,定位潜在缺陷并执行针对性的优化操作,从而实现知识库的逐步完善。为了在没有标准答案的情况下优化优化策略,DeepRefine 引入了“超越草稿收益”奖励机制,并通过强化学习进行端到端训练,实验表明其在多个任务上均优于现有方法。

详情
英文摘要

Agent-compiled knowledge bases provide persistent external knowledge for large language model (LLM) agents in open-ended, knowledge-intensive downstream tasks. Yet their quality is systematically limited by \emph{incompleteness}, \emph{incorrectness}, and \emph{redundancy}, manifested as missing evidence or cross-document links, low-confidence or imprecise claims, and ambiguous or coreference resolution issues. Such defects compound under iterative use, degrading retrieval fidelity and downstream task performance. We present \textbf{DeepRefine}, a general LLM-based reasoning model for \emph{agent-compiled knowledge refinement} that improves the quality of any pre-constructed knowledge bases with user queries to make it more suitable for the downstream tasks. DeepRefine performs multi-turn interactions with the knowledge base and conducts abductive diagnosis over interaction history, localizes likely defects, and executes targeted refinement actions for incremental knowledge base updates. To optimize refinement policies of DeepRefine without gold references, we introduce a Gain-Beyond-Draft (GBD) reward and train the reasoning process end-to-end via reinforcement learning. Extensive experiments demonstrate consistent downstream gains over strong baselines.

2605.10485 2026-05-12 cs.RO

VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models

Hao Wang, Xiaobao Wei, Jingyang He, Chengyu Bai, Chun-Kai Fan, Jiajun Cao, Jintao Chen, Ying Li, Shanyu Rong, Ming Lu, Xiaozhu Ju, Jian Tang, Shanghang Zhang

AI总结 VEGA 是一种用于提升视觉-语言-动作(VLA)模型空间感知能力的框架,旨在解决当前模型因缺乏三维几何监督而导致的空间理解不足问题。该方法通过将 VLA 模型的视觉编码器输出与基于多视角一致的三维高斯点云监督训练的 DINOv2-FiT3D 模型特征对齐,实现更准确和可解释的空间感知对齐。VEGA 在视觉编码器输出层进行对齐,避免了语言语义的干扰,且对齐模块在推理时被移除,不增加额外计算负担,实验表明其在模拟和现实任务中均优于现有方法。

详情
英文摘要

Precise spatial reasoning is fundamental to robotic manipulation, yet the visual backbones of current vision-language-action (VLA) models are predominantly pretrained on 2D image data without explicit 3D geometric supervision, resulting in representations that lack accurate spatial awareness. Existing implicit spatial grounding methods partially address this by aligning VLA features with those of 3D-aware foundation models, but they rely on empirical layer search and perform alignment on LLM-level visual tokens where spatial structure has already been entangled with linguistic semantics, limiting both generalizability and geometric interpretability. We propose VEGA (Visual Encoder Grounding Alignment), a simple yet effective framework that directly aligns the output of the VLA's visual encoder with spatially-aware features from DINOv2-FiT3D, a DINOv2 model fine-tuned with multi-view consistent 3D Gaussian Splatting supervision. By performing alignment at the visual encoder output level, VEGA grounds spatial awareness before any linguistic entanglement occurs, offering a more interpretable and principled alignment target. The alignment is implemented via a lightweight projector trained with a cosine similarity loss alongside the standard action prediction objective, and is discarded at inference time, introducing no additional computational overhead. Extensive experiments on simulation benchmark and real-world manipulation tasks demonstrate that VEGA consistently outperforms existing implicit spatial grounding baselines, establishing a new state-of-the-art among implicit spatial grounding methods for VLA models.

2605.10484 2026-05-12 cs.CV cs.RO

OpenSGA: Efficient 3D Scene Graph Alignment in the Open World

Gang Chen, Sebastián Barbas Laina, Stefan Leutenegger, Javier Alonso-Mora

AI总结 本文提出了一种名为 OpenSGA 的高效三维场景图对齐框架,旨在解决机器人在开放环境中重新访问场景时的物体级定位与地图融合问题。该方法通过融合视觉-语言、文本和几何特征,并结合空间上下文信息,实现了即使在坐标偏差较大的情况下也能准确对齐场景图。此外,作者还构建了一个大规模数据集 ScanNet-SG,包含超过 70 万样本和丰富的物体类别,显著提升了场景图对齐任务的训练与评估能力。实验表明,该方法在帧到扫描(F2S)和子扫描到子扫描(S2S)任务中均取得了最佳性能。

Comments 13 figures

详情
英文摘要

Scene graph alignment establishes object correspondences between two 3D scene graphs constructed from partially overlapping observations. This enables efficient scene understanding and object-level relocalization when a robot revisits a place, as well as global map fusion across multiple agents. Such capabilities are essential for robots that require long-term memory for long-horizon tasks involving interactions with the environment. Existing approaches mainly focus on subscan-to-subscan (S2S) alignment and depend heavily on geometric point-cloud features, leaving frame-to-scan (F2S) alignment and open-set vision-language features underexplored. In addition, existing datasets for scene graph alignment remain small-scale with limited object diversity, constraining systematic training and evaluation. We present a unified and efficient scene graph alignment framework that predicts object correspondences by fusing vision-language, textual, and geometric features with spatial context. The framework comprises modules such as a distance-gated spatial attention encoder, a minimum-cost-flow-based allocator, and a global scene embedding generator to achieve accurate alignment even under large coordinate discrepancies. We further introduce ScanNet-SG, a large-scale dataset generated via an automated annotation pipeline with over 700k samples, covering 509 object categories from ScanNet labels and over 3k categories from GPT-4o-based tagging. Experiments show that our method achieves the best overall performance on both F2S and S2S tasks, substantially outperforming existing scene graph alignment methods. Our code and dataset are released at: https://autonomousrobots.nl/paper_websites/opensga.

2605.10480 2026-05-12 cs.AI

ASIA: an Autonomous System Identification Agent

Dario Piga, Marco Forgione

AI总结 本文提出了一种名为ASIA的自主系统识别代理框架,旨在自动化系统识别过程中的模型选择、算法训练和超参数调优等繁琐任务。该方法基于大型语言模型作为自主编码代理,通过自然语言描述问题,无需人工干预即可完成从假设生成到模型评估的闭环流程。研究在两个系统识别基准上验证了ASIA的有效性,分析了其搜索行为与发现的模型结构,并探讨了该方法的潜力及当前存在的测试泄露、透明度降低和可复现性等局限。

详情
英文摘要

Over the years, research in system identification has provided a rich set of methods for learning dynamical models, together with well-established theoretical guarantees. In practice, however, the choice of model class, training algorithm, and hyperparameter tuning is still largely left to empirical trial-and-error, requiring substantial expert time and domain experience. Motivated by recent advances in agentic artificial intelligence, we present ASIA, a framework that delegates this iterative search to a large language model acting as an autonomous coding agent. Building on existing agentic platforms, ASIA closes the loop between hypothesis, implementation, and evaluation without human intervention, requiring only a plain-English description of the identification problem. We conduct an empirical study of ASIA on two system identification benchmarks and analyse the agent's search behaviour, the architectures and training strategies it discovers, and the quality of the resulting models. We also discuss the potential of the approach and its current limitations, including implicit test leakage, reduced methodological transparency, and reproducibility concerns.

2605.10474 2026-05-12 cs.LG cs.AI

Formally Verifying Analog Neural Networks Under Process Variations Using Polynomial Zonotopes

Yasmine Abu-Haeyeh, Tobias Ladner, Matthias Althoff, Lars Hedrich

AI总结 本文研究了模拟神经网络在制造工艺变化下的行为验证问题,提出了一种基于多项式的方法来建模神经元电路的性能变化,并利用多项式zonotope进行可达性分析,从而实现了对电路级模型的正式验证。该方法有效避免了传统的耗时蒙特卡洛仿真,实验表明其能在秒级时间内验证99%的工艺变化样本,显著提升了验证效率。

详情
英文摘要

Analog neural networks are gaining attention due to their efficiency in terms of power consumption and processing speed. However, since analog neural networks are implemented as physical circuits, they are highly sensitive to manufacturing process variations, which can cause large deviations from the nominal model. We present a polynomial-based model that resembles the performance of the neuron circuit under process variations. Then, we formally verify the behavior of the circuit-level model using reachability analysis with polynomial zonotopes, thus, avoiding conventional, time-consuming Monte Carlo simulations. We evaluate our proposed verification approach on three different datasets, verifying both fully-connected and convolutional analog neural networks. Our experimental results confirm the effectiveness of our verification approach by reducing the verification time from days to seconds while enclosing 99% of the variation samples.

2605.10470 2026-05-12 cs.CV

Adaptive Context Matters: Towards Provable Multi-Modality Guidance for Super-Resolution

Jinyi Luo, Minghao Liu, Yifan Li, Zejia Fan, Jiaying Liu

AI总结 超分辨率(SR)是一个严重病态的问题,存在固有的歧义性。本文首次对多模态超分辨率进行了理论建模,揭示了现有方法在模态利用上的不足,并提出了一种基于动态模态融合的多模态专家混合超分辨率框架(M$^3$ESR),通过空间动态模态权重模块和时间自适应模态温度调度机制,实现了更精确的风险控制和模态贡献优化。实验表明,该方法在泛化能力和语义一致性方面均有显著提升。

详情
英文摘要

Super-resolution (SR) is a severely ill-posed problem with inherent ambiguity, as widely recognized in both empirical and theoretical studies. Although recent semantic-guided and multi-modal SR methods exploit large models or external priors to enhance semantic alignment, the fusion of heterogeneous modalities remains insufficiently understood in practice and theory. In this work, we provide the first theoretical modeling of multi-modal SR, revealing that prior methods are bottlenecked by sub-optimal modality utilization. Our analysis shows that the generalization risk bound can be improved by strengthening the alignment between modality weights and their effective contributions, while reducing representation complexity. This theoretical insight inspires us to propose the novel Multi-Modal Mixture-of-Experts Super-Resolution framework (M$^3$ESR) that employs generalization-oriented dynamic modality fusion for accurate risk control and modality contribution optimization. In detail, we propose a novel spatially dynamic modality weighting module and a temporally adaptive modality temperature scheduling mechanism, enabling flexible and adaptive spatial-temporal modality weighting for effective risk control. Extensive experiments demonstrate that our M$^3$ESR significantly boosts generalization and semantic consistency performances, which confirms our superiority.

2605.10468 2026-05-12 cs.LG

Can Muon Fine-tune Adam-Pretrained Models?

Xingyu Qu, Peigeng Huang, Samuel Horvath

AI总结 本文研究了在微调预训练模型时使用Muon优化器替代Adam所带来的性能下降问题。通过实验分析,作者发现这种性能下降源于优化器之间的隐式偏差不匹配,并提出通过限制更新幅度(如使用LoRA方法)可以有效缓解这一问题。研究结果为理解优化器不匹配对微调的影响提供了新见解,并展示了如何通过调整更新策略来减轻其负面影响。

详情
英文摘要

Muon has emerged as an efficient alternative to Adam for pretraining, yet remains underused for fine-tuning. A key obstacle is that most open models are pretrained with Adam, and naively switching to Muon for fine-tuning leads to degraded performance due to an optimizer mismatch. We investigate this mismatch through controlled experiments and relate it to the distinct implicit biases of Adam and Muon. We provide evidence that the mismatch disrupts pretrained knowledge, and that this disruption scales with update strength. This leads us to hypothesize that constraining updates should mitigate the mismatch. We validate this with LoRA: across language and vision tasks, LoRA reduces the performance gap between Adam and Muon observed under full fine-tuning. Studies on LoRA rank, catastrophic forgetting, and LoRA variants further confirm that mismatch severity correlates with update strength. These results shed light on how optimizer mismatch affects fine-tuning and how it can be mitigated. Our code is available at https://github.com/XingyuQu/muon-finetune.

2605.10466 2026-05-12 cs.LG

Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition

Haoren Xu, Guanhua Fang

AI总结 该论文探讨了大语言模型在上下文学习(ICL)和重复生成中的行为,揭示了其背后的统一机制。研究指出,当输入满足特定统计条件时,自注意力机制的输出可近似为输入协方差矩阵的线性读取,从而解释了模型在处理长上下文时对统计信息的提取与细节的遗忘。这一机制不仅能够实现单步的群体梯度下降,还为重复生成提供了结构化的解释,将两种看似无关的现象统一于协方差读取的原理之下。

详情
英文摘要

Large language models (LLMs) exhibit two striking and ostensibly unrelated behaviours: in-context learning (ICL) and repetitive generation. In both, the model behaves as though it had summarised the context into a population-level statistic and discarded token-level detail. We ask whether this ``summarisation and forgetting'' can be derived from the attention mechanism itself, and answer in the affirmative. Under stationary, ergodic and elliptical inputs, the softmax attention output converges almost surely to $Θ_VΣΘ_K^{\top}Θ_Q x_t$, where $Σ$ is the input covariance; the long-context limit is therefore a linear readout of the input's second-order statistics. Two consequences follow. (i) For in-context linear regression, a single softmax head can implement one step of population gradient descent. Stacking such heads with residual connections iterates this update and implements multiple gradient descent steps. (ii) Propagated across an $L$-layer transformer, this readout drives the terminal hidden state at the parametric $1/t$ rate to a deterministic function of the current token alone, so that autoregressive generation collapses asymptotically to a first-order Markov chain whose attracting orbits furnish a structural account of repetition and mode collapse. The two phenomena thus emerge as facets of a single covariance-readout principle.

2605.10464 2026-05-12 cs.CV

Automated Detection of Abnormalities in Zebrafish Development

Sarath Sivaprasad, Hui-Po Wang, Anna-Lisa Jäckel, Jonas Baumann, Carole Baumann, Jennifer Herrmann, Mario Fritz

AI总结 本文提出了一种用于斑马鱼胚胎发育异常自动检测的方法,针对目前依赖人工评估效率低的问题,构建了一个包含高分辨率显微图像序列的大型数据集,涵盖正常发育和药物暴露两种条件,并提供了细粒度时间标注。研究还引入了基于Transformer的模型,能够融合时空特征以早期预测发育异常,在受精卵存活率分类和毒性评估任务中分别达到98%和92%的准确率,为自动化斑马鱼毒性分析提供了有效工具。

详情
英文摘要

Zebrafish embryos are a valuable model for drug discovery due to their optical transparency and genetic similarity to humans. However, current evaluations rely on manual inspection, which is costly and labor-intensive. While machine learning offers automation potential, progress is limited by the lack of comprehensive datasets. To address this, we introduce a large-scale dataset of high-resolution microscopic image sequences capturing zebrafish embryonic development under both control conditions and exposure to compounds (3,4-dichloroaniline). This dataset, with expert annotations at fine-grained temporal levels, supports two benchmarking tasks: (1) fertility classification, assessing zebrafish egg viability (130,368 images), and (2) toxicity assessment, detecting malformations induced by toxic exposure over time (55,296 images). Alongside the dataset, we present the first transformer-based baseline model that integrates spatiotemporal features to predict developmental abnormalities at early stages. Experimental results present the model's effectiveness, achieving 98% accuracy in fertility classification and 92% in toxicity assessment. These findings underscore the potential of automated approaches to enhance zebrafish-based toxicity analysis.

2605.10462 2026-05-12 cs.CL cs.LO

Coherency through formalisations of Structured Natural Language, A case study on FRETish

Joost J. Joosten, Marina López Chamosa, Sofía Santiago Fernández

AI总结 本文提出了一种新的形式化指南——“通过形式化实现一致性”,主张在将自然语言需求转化为形式化语言的过程中,不同层次的描述应保持逻辑结构的一致性。研究以NASA的FRET工具和其控制自然语言FRETish为案例,提出了一种将其自动翻译为MTL形式化语言的新方法,并通过模型检测证明了其与原有翻译的等价性。实验统计结果显示新翻译具有优势,同时揭示了形式化过程中存在的不一致问题,为形式化方法的改进提供了新思路。

详情
英文摘要

Formalisation is the process of writing system requirements in a formal language. These requirements mostly originate in Natural Language. In the field of Formal Methods, formalisation is often identified as one of the most delicate and complicated steps in the verification process. Not seldomly, formalisation tools and environments choose various levels of requirement descriptions: Natural Language, Technical Language, Diagram Representations and Formal Language, to mention a few. In the literature, there are various maxims and principles of good practice to guide the process of requirement formalisation. In this paper we propose a new guideline: Coherency through Formalisations. The guideline states that the different levels of formalisation mentioned above should roughly follow the same logical structure. The principle seems particularly relevant in the setting where LLMs are prompted to perform reasoning tasks that can be checked by formal tools using Structured Natural Language to act as an intermediate layer bridging both paradigms. In the light of coherency, we analyze NASA's Formal Requirement Elicitation Tool FRET and propose an alternative automated translation of the Controlled Natural Language FRETish to the formal language of MTL. We compare our translation to the original translation and prove equivalence using model checking. Some statistics are performed which seem to favor the new translation. As expected, the translation process yielded interesting reflections and revealed inconsistencies which we present and discuss.

2605.10458 2026-05-12 cs.LG cond-mat.mtrl-sci physics.chem-ph

QT-Net: Rethinking Evaluation of AI Models in Atomic Chemical Space

Pablo Martínez Crespo, Stefano Ribes, Martin Rahm, Richard Beckmann, Robert S. Jordan, Marisa Gliege, Santiago Miret, Vijay Kris Narasimhan, Rocío Mercado

AI总结 该研究针对原子尺度上AI模型的评估问题,提出了一种基于SOAP描述符的留出评估协议,用于更准确地评估机器学习模型在预测原子电荷和多极矩等化学特性时的泛化能力。通过严格的交叉验证和统计检验,作者比较了E(3)-等变模型与非等变模型的性能,并基于结果提出了旋转增强的非等变图神经网络QT-Net。该模型能够从QM9数据集外的分子中推断原子属性,并有效提升下游分子性质预测任务的性能,为原子尺度分子机器学习提供了新的归纳偏置。

详情
英文摘要

Atomic properties such as partial charges or multipoles encode chemically meaningful information that can inform downstream molecular property prediction, but their evaluation as machine learning targets has been complicated by the absence of a principled out-of-distribution evaluation protocol at the atomic level. In this work, we propose a held-out evaluation protocol that clusters atomic environments by SOAP descriptors and computes metrics accounting only for cluster labels unseen during training. Following this procedure, we use 5$\times$5 cross-validation and Tukey's HSD to run a statistically rigorous comparison of E(3)-equivariant against non-equivariant, rotationally augmented models for predicting electron populations and multipoles of H, C, N, and O atoms. Building on our results, we introduce the Quantum Topological Neural Network (QT-Net), a rotationally augmented, non-equivariant graph neural network. We show that QT-Net can be used to infer properties of atoms in molecules from QM9 outside our training set, and that these inferred properties can yield improvement when used as input features for downstream molecular property prediction. To further validate the framework, molecular dipole moments computed from QT-Net's per-atom outputs recover the ground-truth values reported in QM9. We release all code and data, including a JAX implementation of QT-Net, to support the broader use of learned QTA properties as inductive biases for atomic-scale molecular machine learning.

2605.10456 2026-05-12 cs.RO

Learning Point Cloud Geometry as a Statistical Manifold: Theory and Practice

Jinwoo Lee, Jiwoo Kim, Woojae Shin, Giseop Kim, Hyondong Oh

AI总结 该论文研究了如何从稀疏且不规则的激光雷达点云中学习几何结构,提出了一个基于统计流形的数学建模方法。核心思想是将每个点的局部几何结构建模为高斯分布,从而构建出一个统计流形表示。基于此,作者设计了Point-to-Ellipsoid(POLI)方法,通过自监督学习从点云中预测每个点的高斯几何参数,无需标注数据即可实现鲁棒的几何估计,并在多个机器人感知任务中取得了显著性能提升。

详情
英文摘要

Point clouds are a fundamental representation for robotic perception tasks such as localization, mapping, and object pose estimation. However, LiDAR-acquired point clouds are inherently sparse and non-uniform, providing incomplete observations of the underlying scene geometry. This makes reliable geometric reasoning challenging and degrades downstream perception performance. Existing approaches attempt to compensate for these limitations by estimating local geometry, but often rely on hand-crafted statistics or end-to-end supervised learning, which can suffer from limited scalability or require large amounts of accurately labeled data. To address these challenges, we explicitly model point cloud geometry under a principled mathematical formulation. We represent local geometry as a statistical manifold induced by a family of Gaussian distributions, where each point is associated with a Gaussian capturing its local geometric structure. Based on this formulation, we introduce Point-to-Ellipsoid (POLI), a deep neural estimator that predicts per-point Gaussian geometry. POLI learns a mapping from point cloud observations to their underlying geometry in a self-supervised manner, removing the need for labeled data while preserving strong geometric inductive biases. The resulting representation integrates seamlessly into existing robotic perception pipelines without architectural modifications. Extensive experiments show that POLI enables accurate and robust geometry estimation and consistently improves performance across diverse robotic perception tasks.

2605.10455 2026-05-12 cs.LG

AxiomOcean: Forecasting the Three-Dimensional Structure of the Upper Ocean

Sensen Wu, Yifan Chen, Guantao Pu, Xiaoyao Sun, Yijun Chen, Jin Qi, Ming Kong, Keyi Yang, Lichen Xu, Wenguan Wang, Xiaofeng Li, Zhenhong Du

AI总结 AxiomOcean 是一个全球人工智能海洋预测模型,旨在提升对上层海洋三维结构的预报能力。该模型通过引入全三维编码-主干-解码架构,显式表示水柱中的垂直分层和跨层依赖关系,结合海面大气强迫信息,联合预测温度、盐度及三维洋流等变量。实验表明,AxiomOcean 在10天预报中显著优于现有先进模型,降低了约20%至35%的均方根误差,同时保持更高的异常相关性,且在涡动能、温度和盐度方差等方面具有更好的保持能力,提升了预报的物理一致性与准确性。

详情
英文摘要

Short-term ocean forecast skill depends strongly on the three-dimensional ocean structure of the upper ocean, which governs stratification, subsurface heat storage, and the response of the ocean to atmospheric forcing. However, AI ocean forecasting models often fail to preserve this vertical structure, resulting in over-smoothed subsurface features and weak physical consistency under strong forcing. Here, we present AxiomOcean, a global AI ocean forecasting model that explicitly represents vertical hierarchy and cross-layer dependence within the water column. By combining a fully three-dimensional encoder-backbone-decoder architecture with surface atmospheric forcing, AxiomOcean jointly predicts upper-ocean temperature, salinity, and three-dimensional currents at global 1/12° resolution down to 643 m depth. In 10-day forecasts, AxiomOcean outperforms an advanced AI comparison model across variables and lead times, reducing day-1 RMSE by approximately 20 to 35% while maintaining higher anomaly correlation. The gain is not achieved through excessive smoothing: AxiomOcean better preserves eddy kinetic energy, temperature and salinity variance. Its advantage also extends through the water column and remains evident across the equatorial Pacific, Kuroshio Extension, and Southern Ocean, yielding a more realistic reconstruction of upper-ocean heat content. These results show that explicitly preserving upper-ocean three-dimensional structure can improve both forecast accuracy and physical fidelity in AI ocean prediction.

2605.10453 2026-05-12 cs.LG cs.CL

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

Anton Plaksin, Sergei Krutikov, Sergei Skvortsov, Alexander Samarin

AI总结 本文提出SlimSpec,一种用于加速推测解码的低秩语言模型头部(LM-head)参数化方法。该方法通过压缩草案模型的内部表示而非输出词汇表,有效降低了计算瓶颈,同时保持完整的词汇支持。实验表明,SlimSpec在多种目标模型和基准测试中实现了比标准LM-head架构4到5倍的加速,并在端到端速度提升上优于现有方法8%到9%。该方法对训练和推理流程的调整需求极小,适用于多种草案LM-head架构。

详情
英文摘要

Speculative decoding speeds up autoregressive generation in Large Language Models (LLMs) through a two-step procedure, where a lightweight draft model proposes tokens which the target model then verifies in a single forward pass. Although the drafter network is small in modern architectures, its LM-head still performs projection to a large vocabulary, becoming one of the major computational bottlenecks. In prior work this issue has been predominantly addressed via static or dynamic vocabulary truncation. Yet mitigating the bottleneck, these methods bring in extra complexity, such as special vocabulary curation, sophisticated inference-time logic or modifications of the training setup. In this paper, we propose SlimSpec, a low-rank parameterization of the drafter's LM-head that compresses the inner representation rather than the output, preserving full vocabulary support. We evaluate our method with EAGLE-3 drafter across three target models and diverse benchmarks in both latency- and throughput-bound inference regimes. SlimSpec achieves $4\text{-}5\times$ acceleration over the standard LM-head architecture while maintaining a competitive acceptance length, surpassing existing methods by up to $8\text{-}9\%$ of the end-to-end speedup. Our method requires minimal adjustments of training and inference pipelines. Combined with the aforementioned speedup improvements, it makes SlimSpec a strong alternative across wide variety of draft LM-head architectures.

2605.10451 2026-05-12 cs.LG cs.NA math.FA math.NA

Don't Fix the Basis -- Learn It: Spectral Representation with Adaptive Basis Learning for PDEs

Xuxiang Zhao, Angelica I. Aviles-Rivero

AI总结 该研究针对偏微分方程(PDE)学习中传统谱神经算子依赖固定基函数、难以有效捕捉空间异质性和多尺度动态的问题,提出了一种自适应基学习框架ABLE。ABLE通过学习数据相关的谱表示,构建空间自适应的Parseval框架,使算子在提升的谱空间中高效运作,同时保持可逆性和$O(N\log N)$复杂度。实验表明,ABLE在多个基准任务中提升了模型精度,尤其在梯度陡峭和多尺度场景下表现突出,并可作为模块化组件增强现有神经算子架构。

Comments 26 pages, 4 figures

详情
英文摘要

Spectral neural operators achieve strong performance for PDE learning, but rely on fixed global bases that limit their ability to represent spatially heterogeneous and multiscale dynamics. We propose Adaptive Basis Learning (ABLE), a framework that learns data-dependent spectral representations instead of relying on predefined bases. ABLE constructs a spatially adaptive Parseval frame via a learned ancillary density, enabling the operator to act in a lifted spectral space while preserving invertibility and maintaining $O(N\log N)$ complexity through FFT-based implementation. This shifts the source of expressivity from spectral coefficients to the representation itself, allowing the model to capture localized structures and non-translation-invariant interactions more efficiently. ABLE integrates seamlessly into existing neural operator architectures as a drop-in replacement for spectral layers. Across a range of benchmarks ABLE improves accuracy over strong baselines, with the largest gains in regimes characterized by sharp gradients and multiscale behavior. Moreover, augmenting existing models (e.g., U-FNO, HPM) with ABLE further enhances their performance, demonstrating its role as a general and complementary spectral refinement. Our results highlight that the data-driven choice of representation, rather than operator complexity alone, is a key bottleneck in neural operator design. By learning the basis itself, ABLE provides a principled and efficient framework for improving spectral methods in PDE learning.

2605.10449 2026-05-12 cs.CV

Automated high-frequency quantification of fish communities and biomass using computer vision

Kota Ishikawa, Takuma Masui, Keita Koeda, Rickdane Gomez, Lucas Yutaka Kimura, Michio Kondoh

AI总结 该研究提出了一种基于计算机视觉的自动化方法,用于高频量化水下鱼类群落结构和生物量。方法结合了深度学习鱼类识别、多目标跟踪和三维重建技术,能够从立体摄像系统采集的视频中准确估计鱼类的种类、数量及生物量。研究在珊瑚礁鱼类群落中进行了20天的连续监测,展示了该方法在捕捉物种丰富度、数量和生物量动态变化方面的优势,并验证了其在非侵入性、持续性监测中的有效性。

Comments 21 pages, 3 figures, supplementary information under Ancillary files

详情
英文摘要

Quantifying fish community structure is essential for understanding biodiversity and ecosystem responses in a changing environment, yet existing survey methods provide limited high-frequency, quantitative observations. Conventional approaches, including catch-based methods, underwater visual censuses, and environmental DNA metabarcoding, either require intensive labor or lack reliable estimates of abundance and biomass. Here, we develop an automated framework for quantifying fish communities from underwater video using computer vision. Using videos acquired with a custom-made stereo camera system, the framework integrates deep learning-based fish identification, multi-object tracking, and 3D reconstruction to estimate species-level abundance and biomass. We applied the approach to a reef fish community over a 20-day period with hourly daytime observations, revealing dynamic fluctuations in species richness, abundance, and biomass associated with changes in species composition. By comparing fish communities estimated from visual census and environmental DNA surveys, we demonstrate that our method provides complementary strengths for continuous, non-invasive, and quantitative monitoring of consistently observed species. This approach provides a scalable foundation for long-term monitoring and advances the capacity to resolve fine-scale temporal dynamics in fish communities.

2605.10448 2026-05-12 cs.AI

Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

Shanshan Gao, Liyi Zhou

AI总结 本文探讨了交互式智能体基准测试中评分的可靠性问题,指出当前基准测试往往依赖于表面信号而非实际行为路径,导致评分可能不准确。为此,作者提出了一种无需修改任务、智能体或评估者的新方法——引入一个结果证据报告层,用于明确验证所需证据、标记运行结果的证据状态,并报告支持证据的评分范围以反映不确定性。实验表明,该方法在多个公开基准上有效区分了不同类型的失败模式,提升了评估的透明度和可信度。

详情
英文摘要

Interactive agent benchmarks map an agent run to a binary outcome through outcome checks. When these checks rely on surface level signals or fail to capture the agent's actual action path, they cannot reliably determine whether the run succeeded. For example, a benchmark task may ask whether Alice's shipping address was changed, while the outcome check only verifies that the agent clicked "Save." This does not guarantee that the intended state change occurred, since the agent may have modified the wrong record. Treating such a run as successful therefore makes the reported score misleading. Benchmark quality thus depends not only on task design, but also on the reliability of outcome detection. We address this problem by introducing an outcome evidence reporting layer for existing benchmarks, without modifying their tasks, agents, or evaluators. The layer performs three functions. First, before scoring, it specifies which stored artifacts are required to verify the claimed outcome for each case. Second, it applies a locked checklist to each completed run and assigns one of three evidence labels: Evidence Pass, Evidence Fail, or Unknown. Third, it reports evidence supported score bounds that quantify uncertainty arising from Unknown cases. Rather than silently counting, discarding, or hiding uncertain cases inside a single aggregate success rate, the framework keeps them explicitly visible. We evaluate the outcome evidence layer on five public benchmarks: ANDROIDWORLD, AGENTDOJO, APPWORLD, tau3 bench retail, and MINIWOB. The resulting reports separate several empirically distinct failure modes.

2605.10445 2026-05-12 cs.CV

Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning

Zijun Shen, Sihan Yang, Ruichuan An, Ziyu Guo, Hao Liang, Ming Lu, Renrui Zhang, Wentao Zhang

AI总结 本文提出了一种名为Sync-R1的端到端强化学习框架,旨在通过协同优化实现个性化理解和生成之间的桥梁。该方法引入了Sync-GRPO和动态组缩放(DGS)技术,以增强多任务间的协同效应并提升训练效率,同时构建了更贴近现实场景的UnifyBench++数据集。实验表明,Sync-R1在跨任务推理和个性化生成方面表现出色,且无需复杂的冷启动流程。

详情
英文摘要

Unified Multimodal Models (UMMs) excel in general tasks but struggle to bridge the gap between personalized understanding and generation. Prior works largely rely on implicit token-level alignment via supervised fine-tuning, which fails to fully capture the potential synergy between comprehension and creation. In this work, we propose Sync-R1, an end-to-end reinforcement learning framework that jointly optimizes personalized understanding and generation within a single, explicit reasoning loop. Through this unified feedback process, Sync-R1 enables personalized comprehension to guide content creation, while the resulting generation quality reciprocally refines understanding within an integrated reward landscape. To efficiently orchestrate this dual-task synergy, we introduce Sync-GRPO, a reinforcement learning method utilizing an ensemble reward system. Furthermore, we propose Dynamic Group Scaling (DGS), which adaptively filters low-potential trajectories to reduce gradient variance and accelerate convergence. To better reflect real-world complexity, we introduce UnifyBench++, featuring denser textual descriptions and richer user contexts. Experimental results demonstrate that Sync-R1 achieves state-of-the-art performance, showcasing superior cross-task reasoning and robust personalization without requiring complex cold-start procedures. The code and the UnifyBench++ dataset will be released at: https://github.com/arctanxarc/UniCTokens.

2605.10439 2026-05-12 cs.CV

Filtering Memorization from Parameter-Space in Diffusion Models

Yu Zhe, Yang Jiayan, Wei Junhao, Yu-Lin Tsai, Wang Chen

AI总结 本文研究了扩散模型中低秩适配(LoRA)模块可能记住训练图像的问题,导致生成内容泄露受版权保护或敏感信息。为此,作者提出了一种无需训练和数据的后处理方法——Base-Anchored Filtering(BAF),通过分解LoRA更新为频谱通道,并衡量其与预训练主干网络主子空间的对齐程度,从而过滤掉可能包含记忆内容的通道。实验表明,BAF在多个数据集和扩散模型主干上有效减少了记忆效应,同时保持或提升了生成质量。

详情
英文摘要

Low-Rank Adaptation (LoRA) has become a widely used mechanism for customizing diffusion models, enabling users to inject new visual concepts or styles through lightweight parameter updates. However, LoRAs can memorize training images, causing generated outputs to reproduce copyrighted or sensitive content. This risk is particularly concerning in LoRA-sharing ecosystems, where users distribute trained LoRAs without releasing the underlying training data. Existing approaches for mitigating memorization rely on access to the training pipeline, training data, or control over the inference process, making them difficult to apply when only the released LoRA weights are available. We propose \textbf{Base-Anchored Filtering (BAF)}, a training-free and data-free framework for post-hoc memorization mitigation in diffusion LoRAs. BAF decomposes LoRA updates into spectral channels and measures their alignment with the principal subspace of the pretrained backbone. Channels strongly aligned with this subspace are retained as generalizable adaptations, while weakly aligned channels are suppressed as potential carriers of memorized content. Experiments on multiple datasets and diffusion backbones demonstrate that BAF consistently reduces memorization while preserving or even improving generation quality. Our code is available in the supplementary material.

2605.10438 2026-05-12 cs.LG cs.CV

Beyond Spatial Compression: Interface-Centric Generative States for Open-World 3D Structure

Xiang Chen, Alexander Binder

AI总结 当前3D编码器大多将表示视为空间压缩,虽然能重建表面几何,但无法明确组件归属和连接有效性。本文提出一种以接口为中心的生成状态表示方法,将编码过程构建为可操作的状态而非被动压缩代码,使得局部几何、组件归属和连接有效性在解码过程中可被查询、约束和修复。通过引入组件条件的局部规范标记(C2LT-3D),该方法在开放世界多组件场景中提升了结构鲁棒性,并展示了其潜在状态在装配级结构推理中的有效性。

详情
英文摘要

Current 3D tokenizers largely treat representation as spatial compression: compact codes reconstruct surface geometry, but leave component ownership and attachment validity implicit. In open-world assets with intersecting components, noisy topology, and weak canonical structure, this creates a representation mismatch: local shape, component identity, and assembly relations become entangled in a latent stream and are not natively addressable during decoding. We formulate an alternative view, interface-centric generative states, in which tokenization constructs an operational state rather than a passive compressed code. The state exposes local geometry, component ownership, and attachment validity as variables that can be queried, constrained, and repaired during decoding. We instantiate this formulation with Component-Conditioned Canonical Local Tokens (C2LT-3D), factorizing representation into canonical local geometry, partition-conditioned context, and relational seam variables. Each factor targets a distinct failure mode of compression-centric tokens: pose leakage, cross-component interference, or invalid local attachment. This exposed state supports attachment validation, latent structural repair, targeted intervention, and constrained serialization without a separate post-hoc structure recovery module. Trained on single-object CAD models and evaluated zero-shot on open-world multi-component assets, C2LT-3D improves structural robustness and shows that its latent variables remain actionable under adversarial attachment settings. These results suggest that open-world 3D generative representations should be evaluated not only by reconstruction fidelity, but by whether their discrete states remain operational for assembly-level structural reasoning.

2605.10434 2026-05-12 cs.CV

WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

Keming Wu, Yijing Cui, Wenhan Xue, Qijie Wang, Xuan Luo, Zhiyuan Feng, Zuhao Yang, Sudong Wang, Sicong Jiang, Haowei Zhu, Zihan Wang, Ping Nie, Wenhu Chen, Bin Wang

AI总结 本文提出WorldReasonBench,用于评估视频生成模型作为未来世界状态预测器的能力,重点检验其在物理、社会、逻辑和信息一致性方面的推理能力。该基准包含436个结构化测试案例,并采用人类对齐的两阶段评估方法,分别验证推理过程和视频质量。研究揭示了当前视频生成模型在视觉合理性与世界推理能力之间存在显著差距,并提供了WorldRewardBench用于奖励模型评估,推动更真实的世界感知视频生成研究。

Comments Project Page: https://unix-ai-lab.github.io/WorldReasonBench/

详情
英文摘要

Commercial video generation systems such as Seedance2.0 and Veo3.1 have rapidly improved, strengthening the view that video generators may be evolving into "world simulators." Yet the community still lacks a benchmark that directly tests whether a model can reason about how an observed world should evolve over time. We introduce WorldReasonBench, which reframes video generation evaluation as world-state prediction: given an initial state and an action, can a model generate a future video whose state evolution remains physically, socially, logically, and informationally consistent? WorldReasonBench contains 436 curated test cases with structured ground-truth QA annotations spanning four reasoning dimensions and 22 subcategories. We evaluate generated videos with a human-aligned two-part methodology: Process-aware Reasoning Verification uses structured QA and reasoning-phase diagnostics to detect temporal and causal failures, while Multi-dimensional Quality Assessment scores reasoning quality, temporal consistency, and visual aesthetics for ranking and reward modeling. We further introduce WorldRewardBench, a preference benchmark with approximately 6K expert-annotated pairs over 1.4K videos, supporting pair-wise and point-wise reward-model evaluation. Across modern video generators, our results expose a persistent gap between visual plausibility and world reasoning: videos can look convincing while failing dynamics, causality, or information preservation. We will release our benchmarks and evaluation toolkit to support community research on genuinely world-aware video generation at https://github.com/UniX-AI-Lab/WorldReasonBench/.

2605.10419 2026-05-12 cs.CL cs.AI

Can Language Models Analyze Data? Evaluating Large Language Models for Question Answering over Datasets

Andreas Xenofontos, Pavlos Fafalios

AI总结 本文研究了大型语言模型在数据集问答任务中的有效性,探讨了它们在直接回答数据集问题和根据数据库模式生成SQL查询两种场景下的表现。研究还评估了不同提示策略对模型性能的影响,并在两个包含不同难度问题的数据集上进行了实验。结果表明,大型语言模型表现出色,而小型、更节省资源的模型则存在明显局限,这些发现有助于更深入理解大语言模型在数据分析任务中的应用潜力与限制。

Comments Accepted for publication in CARMA 2026 proceedings

详情
英文摘要

This paper investigates the effectiveness of large language models (LLMs) in answering questions over datasets. We examine their performance in two scenarios: (a) directly answering questions given a dataset file as input, and (b) generating SQL queries to answer questions given the schema of a relational database. We also evaluate the impact of different prompting strategies on model performance. The study includes both state-of-the-art LLMs and smaller language models that require fewer resources and operate at lower computational and financial cost. Experiments are conducted on two datasets containing questions of varying difficulty. The results demonstrate the strong performance of large LLMs, while highlighting the limitations of smaller, more cost-efficient models. These findings contribute to a better understanding of how LLMs can be utilized in data analytics tasks and their associated limitations.