arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2332
2605.11189 2026-05-13 cs.LG q-bio.BM

Deep Learning for Protein Complex Prediction and Design

Ziwei Xie

AI总结 本文研究如何利用深度学习准确建模和设计蛋白质复合物结构,这是计算结构生物学中的核心问题,对理解细胞功能和开发药物具有重要意义。研究提出了专门针对蛋白质结构层次特性的深度学习架构,并设计了高效的搜索算法,以在庞大的序列空间中寻找相互作用的同源蛋白,从而提升复合物结构预测和蛋白质序列设计的准确性。

Comments PhD thesis

详情
英文摘要

Accurately modeling and designing protein complex structures is a central problem in computational structural biology, with broad implications for understanding cellular function and developing therapeutics. This thesis investigates two fundamental aspects of this problem using deep learning: domain-specific architectures that capture the hierarchical nature of protein structures, and search algorithms that efficiently navigate the vast sequence spaces of protein complexes to identify interacting homologs for improving complex structure prediction and to design protein sequences.

2605.11186 2026-05-13 cs.LG cs.AI

CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration

Yuning Han, Yangchenchen Jin, Dylan Zhao, Jingwei Sun

AI总结 在内存受限的设备上进行大语言模型推理时,自回归解码过程受到内存带宽的限制,现有基于推测解码的方法通常假设设备内存足够容纳目标模型和辅助模型,这在边缘设备上并不适用。本文提出了一种名为CATS的级联自适应树推测框架,通过基于内存预算和参数卸载模式进行级联验证与修正,在不增加峰值内存占用的前提下,显著提升了推理速度。实验表明,CATS在多个真实边缘设备上实现了最高达5.08倍的加速,且生成质量无下降,优于现有最优方法1.45倍。

详情
英文摘要

Auto-regressive decoding in Large Language Models (LLMs) is inherently memory-bound: every generation step requires loading the model weights and intermediate results from memory (e.g., High-Bandwidth Memory (HBM) for GPU servers), making throughput bottlenecked by memory bandwidth rather than compute. Speculative decoding addresses this by enabling parallel verification of multiple draft tokens, effectively amortizing the cost of each target-model call. However, existing speculative decoding methods are designed under the assumption that HBM is sufficiently large to hold both the target model and an auxiliary draft model simultaneously -- an assumption that breaks down on memory-constrained devices such as edge platforms with limited DRAM. We analyze the inference bottleneck in this memory-limited regime and propose CATS, a self-speculative decoding framework that conducts cascaded verification and correction based on the memory budget and parameter offloading patterns on memory-limited devices. This design maximizes token acceptance rate and end-to-end speedup while keeping the peak memory footprint on the device equal to that of the target model alone. We evaluate CATS on different models across five benchmarks on real edge devices. CATS can achieve a wall-clock speedup of up to 5.08x with no degradation in generation quality, outperforming the SOTA method by up to 1.45x under edge memory constraints.

2605.11181 2026-05-13 cs.LG cs.AI cs.NA math.NA math.OC stat.ML

Muon is Not That Special: Random or Inverted Spectra Work Just as Well

Zakhar Shumaylov, Nathaël Da Costa, Peter Zaika, Bálint Mucsányi, Alex Massucco, Yoav Gelberg, Carola-Bibiane Schönlieb, Yarin Gal, Philipp Hennig

AI总结 本文挑战了Muon优化器在非欧几里得优化中依赖几何结构的主流观点,提出精确的几何结构并非影响优化性能的关键因素。研究引入了基于Schatten(准)范数的Freon优化器,其性能在GPT-2等任务中优于Muon,并揭示了最佳参数位于准范数区域,无法用传统LMO理论解释。进一步提出Kaon优化器,通过用随机噪声替代奇异值仍能匹配Muon性能,证明严格的几何结构并非必要。研究指出,优化性能主要由对齐度和下降潜力等局部量决定,而非全局几何结构。

Comments 45 pages

详情
英文摘要

The recent empirical success of the Muon optimizer has renewed interest in non-Euclidean optimization, typically justified by similarities with second-order methods, and linear minimization oracle (LMO) theory. In this paper, we challenge this geometric narrative through three contributions, demonstrating that precise geometric structure is not the key factor affecting optimization performance. First, we introduce Freon, a family of optimizers based on Schatten (quasi-)norms, powered by a novel, provably optimal QDWH-based iterative approximation. Freon naturally interpolates between SGD and Muon, while smoothly extrapolating into the quasi-norm regime. Empirically, the best-performing Schatten parameters for GPT-2 lie strictly within the quasi-norm regime, and thus cannot be represented by any unitarily invariant LMO. Second, noting that Freon performs well across a wide range of exponents, we introduce Kaon, an absurd optimizer that replaces singular values with random noise. Despite lacking any coherent geometric structure, Kaon matches Muon's performance and retains classical convergence guarantees, proving that strict adherence to a precise geometry is practically irrelevant. Third, having shown that geometry is not the primary driver of performance, we demonstrate it is instead controlled by two local quantities: alignment and descent potential. Ultimately, each optimizer must tune its step size around these two quantities. While their dynamics are difficult to predict a-priori, evaluating them within a stochastic random feature model yields a precise insight: Muon succeeds not by tracking an ideal global geometry, but by guaranteeing step-size optimality.

2605.11178 2026-05-13 cs.LG cs.AI math.RT

Oversmoothing as Representation Degeneracy in Neural Sheaf Diffusion

Arif Dönmez, Axel Mosig, Ellen Fritsche, Katharina Koch

AI总结 本文研究了神经束扩散(NSD)模型中的过平滑问题,将其解释为表示几何退化现象。通过将图上的细胞束与关联的入射图表示建立联系,作者揭示了NSD在扩散极限下所达到的调和空间的代数结构,并指出学习到的束几何可能退化为低复杂度的表示,导致判别信息丢失。文章进一步引入基于矩映射的正则化方法,以引导束限制映射趋向于更平衡的几何结构,并分析了等维结构中的稳定性障碍,提出了非均匀维数设计的有效性。实验表明,打破束维对称性有助于提升模型性能。

Comments 15 pages, Comments welcome

详情
英文摘要

Neural Sheaf Diffusion (NSD) generalizes diffusion-based Graph Neural Networks by replacing scalar graph Laplacians with sheaf Laplacians whose learned restriction maps define a task-adapted geometry. While the diffusion limit of NSD is known to be the space of global sections, the representation-theoretic structure of this harmonic space remains largely implicit. We develop a quiver-theoretic interpretation of NSD by identifying cellular sheaves on graphs with representations of the associated incidence quiver. Under this correspondence, learned sheaf geometries become points in a finite-dimensional representation space. We show that direct-sum decompositions of the underlying incidence-quiver representation induce decompositions of the harmonic space reached in the diffusion limit. This gives an algebraic interpretation of oversmoothing as representation degeneration: learned sheaves may collapse toward low-complexity summands whose global sections fail to preserve discriminative information. Building on this viewpoint, we connect sheaf diffusion to stability and moment-map principles from Geometric Invariant Theory. We introduce moment-map-inspired regularizers that bias restriction maps toward balanced representation geometries, and identify a structural obstruction in equal-stalk architectures: when $d_v = d_e$, admissibility for learnable stability parameters forces the trivial all-object summand onto a stability wall. Non-uniform stalk dimensions remove this obstruction, making adaptive stability meaningful. Experiments on heterophilic benchmarks are consistent with this mechanism: breaking stalk symmetry can reduce variance or improve validation behavior, and adaptive stability becomes more effective in selected rectangular settings. Overall, our framework reframes oversmoothing as a degeneration phenomenon in the representation geometry underlying learned sheaf diffusion.

2605.11172 2026-05-13 cs.LG

Optimistic Dual Averaging Unifies Modern Optimizers

Thomas Pethick, Wanyun Xie, Roman Machacek, Volkan Cevher

AI总结 本文提出了一种名为SODA的优化框架,它是乐观对偶平均法的推广,能够统一当前先进的优化器如Muon、Lion、AdEMAMix和NAdam。通过该框架,研究者提出了一种实用的SODA包装器,能够通过理论支持的$1/k$衰减计划自动消除权重衰减调参的需求。实验表明,SODA在不同规模和训练周期下均能提升性能,且无需额外调整超参数。

详情
英文摘要

We introduce SODA, a generalization of Optimistic Dual Averaging, which provides a common perspective on state-of-the-art optimizers like Muon, Lion, AdEMAMix and NAdam, showing that they can all be viewed as optimistic instances of this framework. Based on this framing, we propose a practical SODA wrapper for any base optimizer that eliminates weight decay tuning through a theoretically-grounded $1/k$ decay schedule. Empirical results across various scales and training horizons show that SODA consistently improves performance without any additional hyperparameter tuning.

2605.11169 2026-05-13 cs.AI

OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents

Sheldon Yu, Junda Wu, Xintong Li, Nikki Lijing Kuang, Sizhe Zhou, Tong Yu, Jiawei Han, Jingbo Shang, Julian McAuley

AI总结 本文提出OLIVIA,一种针对ReAct风格大语言模型代理的在线动作适配框架,用于提升其在部署时的决策性能。OLIVIA将代理的动作选择层建模为一个基于上下文的线性置信域上界(UCB)多臂老虎机问题,利用冻结的隐藏状态作为决策上下文,从而在保持原始推理过程的同时,实现对动作选择的直接调整和不确定性估计。实验表明,OLIVIA在多个基准任务中显著优于静态ReAct和基于提示的适配方法,展示了其在部署阶段进行高效、细粒度和不确定性感知的在线优化的有效性。

详情
英文摘要

Large language model agents interleave reasoning, action selection, and observation to solve sequential decision-making tasks. In deployed settings where agents repeatedly handle related multi-step tasks, small action-selection errors can accumulate into wasted tool calls, latency, and reduced reliability. Despite this need for deployment-time improvement, existing inference-time adaptation methods for LLM agents mainly rely on prompting or retrieval, which influence behavior indirectly through context manipulation. For ReAct-style agents, such approaches do not expose an explicit decision layer that can score candidate actions, represent uncertainty, or be updated online from action-level feedback. As a result, they provide limited support for trackable, fine-grained, and uncertainty-aware adaptation during deployment. We propose OLIVIA, an inference-time action adaptation framework for ReAct-style agents. OLIVIA models the LLM's final action-selection layer as a contextual linear bandit over candidate actions, with frozen hidden states as decision contexts. This choice is particularly suitable for deployment because it adapts behavior directly at the action-selection interface, preserves the underlying reasoning process, and provides explicit uncertainty estimates and lightweight online updates from action-level feedback. With upper-confidence-bound exploration, OLIVIA improves the policy sample-efficiently with minimal computational overhead. We instantiate OLIVIA on four benchmarks and show that it consistently improves task performance over static ReAct and prompt-based inference-time baselines. Our results suggest that explicit online decision layers provide an effective alternative to purely prompt- or retrieval-based adaptation for LLM agents during deployment.

2605.11167 2026-05-13 cs.CL cs.AI cs.LG

The Bicameral Model: Bidirectional Hidden-State Coupling Between Parallel Language Models

Cedric Flamant, Udaya Ghai, Kanna Shimizu

AI总结 本文提出了一种名为“双室模型”的新方法,通过可训练的神经接口在两个预训练语言模型的中间隐藏状态之间建立双向耦合,使它们能够通过连续的并发通道进行协调,而非传统的文本生成方式。该模型在每一步生成过程中同步运行,主模型负责任务执行,辅助模型则处理工具调用、约束求解或代码执行,并通过翻译网络和学习抑制门实现相互条件控制。实验表明,该方法在算术、逻辑网格谜题和数学推理任务中显著提升了性能,展示了其在多模型协作中的有效性。

Comments 9 pages main text, 5 figures, 24 pages appendix

详情
英文摘要

Existing multi-model and tool-augmented systems communicate by generating text, serializing every exchange through the output vocabulary. Can two pretrained language models instead coordinate through a continuous, concurrent channel? The Bicameral Model couples two frozen language models through a trainable neural interface on their intermediate hidden states. At every generation step, both models run in lockstep: a primary model drives the task while an auxiliary model operates tools, solves constraints, or executes code, with both conditioning on each other's activations through a translation network and a learned suppression gate ($\sim$1\% of combined parameters). The gate learns a selective communication protocol from task loss alone, without a prescribed format. We demonstrate the mechanism across three tool backends. On arithmetic, coupling two 0.5B models with a calculator raises accuracy from 36\% to 96\%. On logic grid puzzles, coupling two 0.6B models with a Z3 solver achieves $1.7\times$ the unaugmented baseline on ZebraLogic. On mathematical reasoning, coupling with a Python sandbox enables the auxiliary to generate problem-specific code from hidden-state signals alone, without ever seeing the problem text.

2605.11166 2026-05-13 cs.CV

Unpacking the Eye of the Beholder: Social Location, Identity, and the Moving Target of Political Perspectives

Elena Sirotkina

AI总结 本文研究了政治和社会身份如何影响人们对政治信息的评价,并指出传统计算工具往往忽略这种差异。为此,作者提出了一个名为Perspectivist Visual Political Sentiment(PVPS)的分类器,通过大量美国成年人的评价数据,预测不同政治和社会身份群体对同一图像的评价差异。该方法保留了群体间的系统性分歧,揭示了政治图像意义的动态性,强调理解图像传达的内容必须考虑受众的身份背景。

详情
英文摘要

Political and social identities structure how people evaluate political information, a finding decades deep in political science and routinely discarded by computational tools that often produce single scores that treat a piece of text, an image, or a video as if it means the same thing to everyone. This paper shows that it does not, and that the difference is consequential. To address this problem, I develop the Perspectivist Visual Political Sentiment (PVPS) classifier, which learns from approximately 82,000 evaluations by 5,575 U.S. adults to predict how audiences defined by political and social identities will evaluate the same image. Unlike standard tools that average systematic disagreement away, PVPS preserves it, returning an evaluative profile that records who agrees, who diverges, and along which identity lines. Applied to several influential studies of visual sentiment, PVPS shows that perceived violence in protest imagery and the emotional mechanisms behind protest image engagement both change substantively once audience identity is taken into account. It follows that what a political image conveys is a moving target, and measuring it requires knowing whom it is moving.

2605.11161 2026-05-13 cs.LG cs.AI

Interpretability Can Be Actionable

Hadas Orgad, Fazl Barez, Tal Haklay, Isabelle Lee, Marius Mosbach, Anja Reusch, Naomi Saphra, Byron Wallace, Sarah Wiegreffe, Eric Wong, Ian Tenney, Mor Geva

AI总结 本文探讨了深度神经网络可解释性研究的实践价值问题,指出当前研究缺乏将可解释性转化为实际决策和干预能力的评估标准。作者提出应以“行动性”作为可解释性的核心评价标准,从具体性和验证性两个维度定义可操作的可解释性,并分析了阻碍其实际应用的障碍。文章进一步识别了五个可解释性具有独特优势的领域,提出了与实际效果对齐的评估框架,旨在推动可解释性研究从理论探索向实际应用转化。

Comments Accepted to ICML 2026

详情
英文摘要

Interpretability aims to explain the behavior of deep neural networks. Despite rapid growth, there is mounting concern that much of this work has not translated into practical impact, raising questions about its relevance and utility. This position paper argues that the central missing ingredient is not new methods, but evaluation criteria: interpretability should be evaluated by actionability--the extent to which insights enable concrete decisions and interventions beyond interpretability research itself. We define actionable interpretability along two dimensions--concreteness and validation--and analyze the barriers currently preventing real-world impact. To address these barriers, we identify five domains where interpretability offers unique leverage and present a framework for actionable interpretability with evaluation criteria aligned with practical outcomes. Our goal is not to downplay exploratory research, but to establish actionability as a core objective of interpretability research.

2605.11159 2026-05-13 cs.LG

CORE: Cyclic Orthotope Relation Embedding for Knowledge Graph Completion

Yingqi Zeng, Luying Wang, Huiling Zhu

AI总结 知识图谱补全旨在通过将实体和关系映射到连续表示空间,自动推理多关系数据中的缺失事实。为了解决现有基于区域的嵌入模型在优化过程中受到绝对边界约束或区域无限制扩张的问题,本文提出了一种新的模型 CORE,将实体和关系嵌入到无边界的环面流形上,利用循环正交体表示关系,使区域能够无缝绕过空间边界,保证梯度传导的平滑性,并引入自适应宽度正则化防止区域无条件扩张。理论分析表明,CORE 能够捕捉包括子集和交集在内的复杂关系模式,实验结果也显示其在多个基准数据集上表现优异,尤其在密集语义环境下显著提升了链接预测精度。

详情
英文摘要

Knowledge graph completion (KGC) aims to automatically infer missing facts in multi-relational data by mapping entities and relations into continuous representation spaces. Recent region-based embedding models have shown great promise in capturing complex logical patterns by representing relations as geometric regions. However, these models inevitably suffer from absolute boundary constraints during optimization. Conversely, without such constraints, relation regions expand indefinitely. To address the limitation, we propose \textbf{CORE} (Cyclic Orthotope Relation Embedding), a novel KGC model that embeds entities and relations onto a boundary-less torus manifold.CORE represents relations as cyclic orthotopes on the torus manifold, allowing regions to seamlessly wrap around spatial boundaries to ensure smooth gradient conduction. Furthermore, an adaptive width regularization is introduced to prevent unconditional region expansion. Theoretical analysis proves that CORE can capture various complex relation patterns such as subsumption and intersection. Extensive experiments on four benchmark datasets demonstrate that CORE achieves highly competitive performance, significantly improving link prediction accuracy in dense semantic environments.

2605.11153 2026-05-13 cs.CL cs.LG cs.NE

Decomposing Evolutionary Mixture-of-LoRA Architectures: The Routing Lever, the Lifecycle Penalty, and a Substrate-Conditional Boundary

Ramchand Kumaresan

AI总结 本文研究了进化混合LoRA架构在特定基础模型上的性能分解问题,提出了三个关键因素:路由重写机制、领域评估范围和生命周期策略。通过实验分析,发现路由重写对模型性能提升具有显著贡献,而生命周期策略则带来一定负面影响。研究还揭示了进化搜索在路由通道中的有效性依赖于适配器的预对齐程度,为LoRA架构的优化提供了新的理论依据和实践指导。

详情
英文摘要

We decompose an evolutionary mixture-of-LoRA system on a from-scratch ~150M-parameter widened-D substrate (D=1536, V=32000; D/V approx 0.048; the "widened-1536" substrate) into three factors -- a router rewrite (parallel sigmoid gate with learnable per-adapter floor and bounded temperature anneal, fed post-stack hidden states rather than token-embedding means), a per-domain leave-one-out evaluation scope, and a lifecycle of death plus alpha-blend inheritance plus SVD mutation plus slot reallocation -- and report a 5-of-8 partial 2^3 factorial run at n=3 seeds and 25000 adaptation steps per cell. The attribution chain is sharp on this substrate: the router rewrite carries the entire +0.0426 nat balanced log-PPL improvement (Delta = log PPL_ref - log PPL_test, positive = improvement; t=12.86, p=0.006) attributed to "the full evolutionary system vs the static B3 baseline"; the headline full-system-vs-B3 balanced contrast itself is +0.015 nats, t=1.94, p=0.19 at n=3 and does not clear alpha=0.05. The per-domain evaluation scope is null at seed-resolution, and the lifecycle is a net drag of approx -0.028 nats (t=-4.46,p=0.047 in the primary chain). An auxiliary alpha=0 inheritance counterfactual at n=3 seeds is sign-inconsistent at the headline metric and underpowered for either an equivalence or load-bearing conclusion (corrected from an earlier arithmetic-mean aggregator that erroneously cleared inheritance; see Appendix B.11). A base-perturbation probe directionally refutes a "genomic-context" reframe of the lifecycle role. A controllable synthetic sandbox locates a substrate-conditional regime boundary: evolutionary search on the routing channel is load-bearing only when adapters are pre-aligned to the task; in every other regime tested it underperforms, ties, or actively degrades the gradient solution.

2605.11144 2026-05-13 cs.RO

Forecast-aware Gaussian Splatting for Predictive 3D Representation in Language-Guided Pick-and-Place Manipulation

Kaixin Jia, Jiacheng Xu

AI总结 本文提出了一种名为Forecast-aware Gaussian Splatting(Forecast-GS)的预测性三维表示框架,用于语言引导的机器人抓取与放置操作。该方法通过显式建模任务完成状态,提升了机器人在部分观测条件下对动作可行性的评估能力。实验表明,Forecast-GS在多个真实场景任务中取得了优于现有方法的性能,显示出其在语言理解、三维感知与机器人规划之间建立可解释桥梁的有效性。

详情
英文摘要

We introduce Forecast-aware Gaussian Splatting (Forecast-GS), a predictive 3D representation framework for language-conditioned robotic manipulation. While recent manipulation systems have made progress by grounding language instructions into robot affordances, value maps, or relational keypoint constraints, they usually reason over the current scene and do not explicitly model the task-completed state. This limitation is critical when success depends on satisfying spatial and semantic goals under partial observations, where the robot must evaluate whether a candidate action leads to a feasible task-consistent outcome. We validate Forecast-GS on real-world pick-and-place manipulation tasks, including Cutter-to-Box, Apple-to-Bowl, and Sponge-to-Tray. For each task, we conduct 25 real-world trials under varied initial object configurations using the same robot platform and sensing setup. Forecast-GS with automatic candidate selection achieves success rates of 21/25, 23/25, and 16/25 on the three tasks, respectively, outperforming the ReKep baseline, which achieves 15/25, 19/25, and 10/25. A diagnostic human-assisted setting further improves success rates to 23/25, 24/25, and 19/25, suggesting that candidate generation is effective while automatic ranking remains imperfect. These results suggest that explicitly forecasting task-completed 3D states enables more reliable action evaluation, while the gap between automatic and human-assisted selection indicates that robust final-state ranking remains an important challenge for fully autonomous manipulation. Overall, Forecast-GS provides an interpretable bridge between language understanding, 3D perception, and robotic manipulation planning.

2605.11143 2026-05-13 cs.CL cs.AI cs.IR

ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV

Alex Stinard

AI总结 本文提出 ClinicalBench,一个用于评估跨病历临床问答中基于断言感知检索性能的基准测试,重点考察检索真实电子健康记录时因否定、时间性及患者与家庭成员归属等因素导致的答案偏差。研究通过构建包含断言标签和时间标签的患者知识图谱(EpiKG),结合意图感知的检索增强生成(KG-RAG)方法,显著提升了检索准确性。实验表明,该方法在多个大语言模型上均取得性能提升,并揭示了当前自动生成参考答案的局限性,强调了临床问答评估中医生裁定的重要性。

Comments 46 pages including appendices (two-column preprint format). Under review at JAMIA. Code, frozen evaluator, and benchmark released at https://huggingface.co/datasets/alexstinard/epikg-clinicalbench. ClinicalBench v2 is a 400-question MIMIC-IV stress test for assertion-aware retrieval

详情
英文摘要

Reasoning benchmarks measure clinical performance on clean inputs. We evaluate the step before reasoning: retrieval over real EHR notes, where negation, temporality, and family-versus-patient attribution can flip a correct answer to a wrong one. EpiKG carries an assertion label and a temporality tag with every fact in a patient knowledge graph, then routes retrieval by question intent. ClinicalBench is a 400-question test over 43 MIMIC-IV patients across 9 assertion-sensitive categories. A 7-condition ablation tests each piece of EpiKG across six LLMs (Claude Opus 4.6, GPT-OSS 20B, MedGemma 27B, Gemma 4 31B, MedGemma 1.5 4B, Qwen 3.5 35B). Three physicians blindly adjudicated 100 paired items. The author-blind primary endpoint, leave-author-out paired exact McNemar on 50 unanimous-strict items rated by two external physicians, yields +22.0 percentage points (95 percent Newcombe CI [+5.1, +31.5], p=0.0192). The architectural novelty, intent-aware KG-RAG over a Contriever dense-RAG baseline (C2b to C4g_kw on the change-excluded n=362 endpoint), is +8.84 percentage points (paired McNemar p=1.79e-3); +12.43 percentage points under oracle intent. Sensitivities agree directionally: three-rater physician majority +24.0 percentage points (subject to single-author circularity); deterministic keyword reproducibility proxy +39.5 percentage points. Across the six models, the gain shrinks as the LLM-alone baseline rises (beta=-1.123, r=-0.921, p=0.009). With n=6 this looks more like regression to the mean than encoding substituting for model size. Physician adjudication identified 56 percent of auto-generated reference answers as defective, a methodological finding indicating that NLP-pipeline clinical-QA benchmarks require physician adjudication to be usable. ClinicalBench, the frozen evaluator, three-rater adjudication data, and the EpiKG output stack are publicly released.

2605.11142 2026-05-13 cs.LG

Rank Is Not Capacity: Spectral Occupancy for Latent Graph Models

Nikolaos Nakis, Panagiotis Promponas, Konstantinos Tsirkas, Katerina Mamali, Eftychia Makri, Leandros Tassiulas, Nicholas A. Christakis

AI总结 本文研究了图表示学习中潜空间维度这一传统超参数的设定问题,指出其与模型行为的实际控制量不一致。为此,作者提出了一种基于谱分析的新方法Spectra,通过学习正定核的谱分布来替代传统的秩作为分析单位,并利用归一化特征值构建可控的训练坐标,从而在训练过程中动态调节模型容量。该方法在多个网络数据集上展示了预测性能与模型容量之间的权衡关系,为过参数化场景下的模型容量控制提供了理论依据和实用工具。

Comments Preprint

详情
英文摘要

Graph representation learning has become a standard approach for analyzing networked data, with latent embeddings widely used for link prediction, community detection, and related tasks. Yet a basic design choice, the latent dimension, is still treated as a brittle hyperparameter, fixed before training and tuned by held-out performance. Learned factors are also identifiable only up to rotation and rescaling, so the nominal rank rarely coincides with the quantity that governs model behavior. We propose Spectral Prefix Extraction and Capacity-Targeted Representation Analysis (Spectra), which replaces rank as the unit of analysis with the spectrum of a learned positive semidefinite kernel, trace-normalized so that spectra are comparable across fits. The normalized eigenvalues form a distribution on the simplex, and their Shannon effective rank acts both as a summary of learned capacity and as a controllable training-time coordinate: a single scalar shapes this realized dimension during training, and bisection targets any desired value within the rank cap. To theoretically support that, we show local regularity and monotonicity of the realized-dimension profile. Across collaboration, social, biological, and infrastructure networks, Spectra traces performance--capacity frontiers that make the trade-off between predictive accuracy and realized dimension visible. It performs competitively with strong link-prediction baselines, yields aligned lower-capacity views of the same fitted model through spectral prefixes, and provides a principled handle on capacity in the overparameterized regime. Capacity thus becomes a property of the fitted model rather than a hyperparameter of the training.

2605.11136 2026-05-13 cs.AI

EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales

Yaolun Zhang, Tianyi Xu, Shengyu Dai, Zhenwen Shao, Qingyun Wu, Huazheng Wang

AI总结 本文提出EVOCHAMBER,一种无需训练的框架,用于在个体、团队和种群三个层面实现多智能体系统的测试时协同进化。其核心方法CODREAM通过团队失败或分歧后协作反思与知识异步传递,实现跨智能体的非对称知识转移,保留专业化分工的同时填补知识空白。实验表明,该方法在数学、编程和多领域推理任务中均取得显著提升,并观察到多个稳定的专业化智能体自发形成,展现了多智能体进化的结构特征。

详情
英文摘要

We argue that multi-agent test-time evolution is not single-agent evolution replicated N times. A single-agent learner can only evolve its own context and memory. A multi-agent system additionally evolves who collaborates, how they collaborate, and how knowledge flows across the population. These components have no single-agent counterpart and can produce phenomena such as emergent specialization. Yet prior test-time methods either confine experiences to individual agents, forfeiting cross-agent learning, or broadcast symmetrically to all agents, erasing the specialization that makes collaboration valuable. We present EVOCHAMBER, a training-free framework that instantiates test-time evolution at three levels over a coevolving agent pool. At its core is CODREAM (Collaborative Dreaming), a post-task protocol triggered on team failure or disagreement, in which agents collaboratively reflect, distill insights, and route them asymmetrically from strong to weak agents on the failed niche, preserving specialization while filling knowledge gaps. Team-level operators assemble niche-conditioned teams and select collaboration structures online. Population-level lifecycle operators fork, merge, prune, and seed agents under performance pressure. On three heterogeneous task streams with Qwen3-8B, EVOCHAMBER reaches 63.9% on competition math, 75.7% on code, and 87.1% on multi-domain reasoning, outperforming the best baseline by 32% relative on math and confirming asymmetric cross-agent transfer as the primary driver in ablation. Starting from several identically initialized agents, four to five stable niche specialists spontaneously emerge, a structural signature of multi-agent evolution that no single-agent learner can express. See our code at: https://github.com/Mercury7353/EvoChamber

2605.11133 2026-05-13 cs.LG math.DG

Steerable Neural ODEs on Homogeneous Spaces

Emma Andersdotter, Daniel Persson, Fredrik Ohlsson

AI总结 本文提出了一种在齐性空间 $M=G/H$ 上的可操控神经常微分方程(Steerable Neural ODEs),将特征向量在局部对称群 $H$ 作用下的变换纳入模型设计。通过将特征解释为齐性空间上的向量丛截面,并将其演化视为平行移动,模型形成了一组耦合的微分方程,包括空间流方程和特征操控方程。该方法在满足特定对称性条件时具有 $G$-等变性,为学习齐性空间上一般向量值特征的连续时间等变动力学提供了几何基础。

Comments 39 pages, 3 figures

详情
英文摘要

We introduce steerable neural ordinary differential equations on homogeneous spaces $M=G/H$. These models constitute a novel geometric extension of manifold neural ordinary differential equations (NODEs) that transport associated feature vectors transforming under the local symmetry group $H$. We interpret features as sections of associated vector bundles over $M$, and describe their evolution as parallel transport. This results in a coupled system of ODEs consisting of a flow equation on $M$ and a steering equation acting on features. We show that steerable NODEs are $G$-equivariant whenever the vector field generating the flow and the connection governing parallel transport are both $G$-invariant. Furthermore, we demonstrate how steerable NODEs incorporate existing NODE models and continuous normalizing flows on Lie groups. Our framework provides the geometric foundation for learning continuous-time equivariant dynamics of general vector-valued features on homogeneous spaces.

2605.11131 2026-05-13 cs.CV

USEMA: a Scalable Efficient Mamba Like Attention for Medical Image Segmentation

Elisha Dayag, Nhat Thanh Tran, Jack Xin

AI总结 本文提出了一种可扩展且高效的类似Mamba的注意力机制USEMA,用于医学图像分割,旨在解决传统视觉Transformer因二次计算复杂度带来的效率问题。USEMA结合了局部窗口注意力和理论一致的算术平均,以兼顾局部特征提取与全局信息捕捉,并与卷积神经网络融合构建混合UNet架构。实验表明,USEMA在多种模态和图像尺寸下均表现出优于纯卷积模型和基于Mamba模型的分割性能和计算效率。

详情
英文摘要

Accurate medical image segmentation is an integral part of the medical image analysis pipeline that requires the ability to merge local and global information. While vision transformers are able to capture global interactions using vanilla self-attention, their quadratic computational complexity in the input size remains a struggle for medical image segmentation tasks. Motivated by the dispersion property of vanilla self-attention and recent development of Mamba form of attention, Scalable and Efficient Mamba like Attention (SEMA) utilizes token localization via local window attention to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. In this work, we present USEMA, a hybrid UNet architecture that merges the local feature extraction ability of convolutional neural networks (CNNs) with SEMA attention. We conduct experiments with USEMA across a variety of modalities and image sizes, demonstrating improved computational efficiency compared to transformer based models using full self-attention, and superior segmentation performance relative to purely convolution and Mamba-based models.

2605.11128 2026-05-13 cs.CL

Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs

Amin Banayeeanzade, Qingchuan Yang, Dhruv Tarsadiya, Fatemeh Bahrani, Leonardo Blas, Alfy Samuel, Robin Jia, Meisam Razaviyayn, Sai Praneeth Karimireddy

AI总结 该研究探讨了大型语言模型(LLMs)在生成过程中多样性下降的问题,指出其根源在于推理时概率分布的校准不足。研究提出了一个有效性-多样性框架,将多样性崩溃归因于模型在解码过程中对有效和无效续写分配概率质量的方式,并将其分解为两种形式的校准错误:顺序校准和形状校准。实验表明,这种校准问题在多个规模和类型的语言模型中普遍存在,而非单纯由采样策略导致。

详情
英文摘要

Diversity is essential for language-model applications ranging from creative generation to scientific discovery, yet modern LLMs often collapse into a narrow subset of plausible outputs. While prior work has developed benchmarks for measuring this lack of diversity, less is known about how the step-by-step probability distributions at inference time cause the problem. We introduce a validity--diversity framework that attributes diversity collapse to how an LLM allocates probability mass across valid and invalid continuations during decoding. This framework decomposes the bottleneck into two complementary forms of miscalibration. First, order calibration: valid tokens are not reliably ranked above invalid tokens, so rank-based cutoff rules must trade off between recovering valid continuations and admitting invalid ones. Second, shape calibration: probability mass is overly concentrated only on few valid continuations while having a heavy-tail of mixed valid and invalid tokens, so maintaining high validity limits diversity. We formalize both mechanisms and show that local failures compound across decoding steps, producing strong sequence-level losses in diversity. Empirically, we develop controlled diagnostics for probing these bottlenecks, including tasks with exactly known valid sets and oracle cutoff baselines. Across 14 language models spanning multiple families and scales, we find that diversity collapse is not merely a limitation of particular sampling heuristics, but a consequence of order and shape miscalibration in the LLM distribution.

2605.11119 2026-05-13 cs.RO

ASIP-Planner: Adaptive Planning for UAV Surface Inspection in Partially Known Indoor Environments

Hanyu Jin, Zhefan Xu, Haoyu Shen, Xinming Han, Kanlong Ye, Kenji Shimada

AI总结 本文提出了一种适用于部分已知室内环境的无人机表面检测规划框架ASIP-Planner,旨在解决因临时障碍物导致的视野遮挡和检测质量下降问题。该方法结合基于区域划分的全局覆盖规划器和面向检测的局部视角自适应模块,能够在保证轨迹结构的前提下,生成避障轨迹并实时调整视角以减少遮挡影响。实验表明,该框架在仿真和实际飞行测试中均能实现较高的检测覆盖率和轨迹效率,提升了无人机在部分已知结构化室内环境中的检测性能和适应能力。

详情
英文摘要

Indoor infrastructure inspection, such as tunnels and industrial facilities, requires systematic surface coverage to ensure that all inspection targets are properly observed. Unmanned Aerial Vehicles (UAVs) offer an alternative to manual inspection by conducting map-guided surface inspection using prior structural models. However, in practice, indoor inspection often relies on floorplan-derived reference maps that may not reflect unforeseen obstacles, such as temporary structures or equipment, leading to occluded viewpoints and degraded inspection quality. Existing coverage planning methods typically assume a fully known inspection environment and perform deterministic global viewpoint optimization based on accurate prior maps, making them vulnerable to environmental discrepancies during execution. This work presents an adaptive UAV inspection framework for partially known structured indoor environments. The proposed method integrates a segment-based global coverage planner with an inspection-oriented local view-angle adaptation module. The global planner organizes planar inspection targets into surface-aligned clusters to generate compact viewpoint sequences with improved orientation consistency. The local planner generates collision-free trajectories and adjusts the viewing direction online to mitigate occlusion-induced coverage loss while preserving the planned trajectory structure. The simulation results across randomized scene configurations demonstrate that the proposed global planner achieves near-complete coverage while reducing trajectory length compared to representative baselines. Real-world flight experiments further validate that the framework produces usable inspection data for downstream analysis. These results indicate that the proposed framework improves inspection efficiency and adaptability in partially known structured indoor environments.

2605.11117 2026-05-13 cs.LG cs.MA math.PR

GRAFT-ATHENA: Self-Improving Agentic Teams for Autonomous Discovery and Evolutionary Numerical Algorithms

Juan Diego Toscano, Zhaojie Chai, George Em Karniadakis

AI总结 GRAFT-ATHENA 是一种自我改进的智能代理团队框架,旨在实现自主科学发现和演化数值算法。该框架通过将组合决策空间映射为因子概率树,显著降低了参数规模,并能够跨不同领域积累和共享方法经验。研究展示了 GRAFT-ATHENA 在多个物理信息机器学习基准和实际工程问题中的优越性能,包括自主提出正则化约束和发现新的数值方法,为自主实验室的发展提供了基础。

详情
英文摘要

Scientific discovery can be modeled as a sequence of probabilistic decisions that map physical problems to numerical solutions. Recent agentic AI systems automate individual scientific tasks by orchestrating LLM-driven planners, solvers, and evaluators. Each method is a combination of methodological actions, with many viable combinations for any given problem and structural dependencies between choices. However, existing frameworks treat each problem in isolation, with no shared substrate to accumulate methodological experience across domains. Here we show that GRAFT-ATHENA, a self-improving agentic framework, learns from past problems and autonomously expands its own action space across diverse domains. GRAFT (Graph Reduction to Adaptive Factored Trees) projects combinatorial decision spaces into factored probabilistic trees in which each method is a single path, taking the parameter footprint from exponential to linear. In the lineage of classical Bayesian networks, the factorization is an $I$-map of the policy, and the resulting paths embed as unique fingerprints in a metric space whose closeness lets each new problem learn from similar past ones. On canonical physics-informed machine learning (PIML) benchmarks, GRAFT-ATHENA improves over human and prior agentic baselines, and on production solvers, it tackles complex engineering problems such as reconstructing Mach-10 flow over the Apollo Command Module from a 1968 report and recovering shear-thinning blood-cell rheology. Notably, the system grows its own knowledge substrate, autonomously proposing regularization constraints for ill-posed inverse problems and discovering new numerical methods such as a spectral PINN with exponential convergence. These results provide a foundation for autonomous laboratories that grow more capable with every problem they solve.

2605.11115 2026-05-13 cs.CV cs.GR cs.LG

LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR

Pedram Fekri, WenChen Li, William Chen, Peter Altamirano

AI总结 本文提出了一种名为LatentHDR的新型框架,用于生成高质量的高动态范围(HDR)图像。该方法通过在潜在空间中将场景生成与曝光建模解耦,利用预训练的扩散模型生成一致的场景表示,并通过一个轻量的条件潜在到潜在映射模块,将其确定性地映射到特定曝光的表示,从而在单次生成过程中实现结构一致的多曝光堆栈。该方法显著降低了计算成本,提升了生成效率,并在多个基准测试中取得了领先的动态范围和感知质量。

详情
英文摘要

High Dynamic Range (HDR) generation remains challenging for generative models, which are largely limited to low dynamic range outputs. Recent diffusionbased approaches approximate HDR by generating multiple exposure-conditioned samples, incurring high computational cost and structural inconsistencies across exposures. We propose LatentHDR, a framework that decouples scene generation from exposure modeling in latent space. A pretrained diffusion backbone produces a single coherent scene representation, while a lightweight conditional latent to-latent head deterministically maps it to exposure-specific representations. This enables the generation of a dense, structurally consistent exposure stack in a single pass. This design eliminates multi-pass diffusion, ensures cross-exposure alignment, and enables scalable HDR synthesis. LatentHDR supports both textand image-conditioned HDR generation for perspective and panoramic scenes. Experiments on synthetic data and the SI-HDR benchmark show that LatentHDR achieves state-of-the-art dynamic range with competitive perceptual quality, while reducing computation by an order of magnitude. Our results demonstrate that high-quality HDR generation can be achieved through structured latent modeling, challenging the need for stochastic multi-exposure generation.

2605.11114 2026-05-13 cs.RO cs.AI

SEVO: Semantic-Enhanced Virtual Observation for Robust VLA Manipulation via Active Illumination and Data-Centric Collection

Tianchonghui Fang, Yuan Zhuang, Fei Miao

AI总结 该研究提出了一种名为SEVO的语义增强虚拟观测方法,旨在提升低成本机器人在不同环境下的视觉-语言-动作(VLA)操作鲁棒性。SEVO通过固定摄像头覆盖操作区域、主动红光照明标准化物体外观以及实时语义分割提供背景不变的提示,结合多样化数据采集策略,显著提升了模型的泛化能力。实验表明,在相同政策架构下,SEVO使机器人在训练和新环境中的抓取成功率大幅提升,验证了观测设计和数据多样性对低成本机器人可靠操作的重要性。

详情
英文摘要

Vision-Language-Action (VLA) and imitation-learning policies trained via community toolchains on low-cost hardware frequently fail when deployed outside the training environment. Existing evaluations, including the original ACT and SmolVLA benchmarks, demonstrate high success rates under controlled, fixed backgrounds, yet community practitioners report near-zero transfer to new environments. We present SEVO (Semantic-Enhanced Virtual Observation), a data-centric approach that improves cross-environment manipulation robustness without modifying the policy architecture. SEVO transforms the raw RGB camera stream through three mechanisms: (1) body-fixed cameras whose combined fields of view cover the full manipulation workspace, (2) active red-spectrum illumination that physically normalizes object appearance, and (3) real-time YOLO segmentation overlay that provides a background-invariant semantic cue. Critically, we show that a diversified data collection protocol (systematically varying lighting, backgrounds, and distractors during teleoperation) is the single most important factor for generalization. We target transparent water bottles, objects that visually blend with their surroundings, and select a simple pick-and-place task to enable hundreds of controlled real-robot trials across two mobile platforms. The full pipeline achieves 95% grasp success with ACT and 83% with SmolVLA in the training environment, transferring to novel environments at 85% and 75%. Without SEVO, the same policies achieve only 75%/70% in training and collapse to 30-35% in novel environments. Our results demonstrate that principled observation design and environmental diversity during data collection, not model scaling, enable low-cost robots to operate reliably in everyday household environments.

2605.11107 2026-05-13 cs.CV cs.AI

Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs

Youssef Zaazou, Mark Thomas

AI总结 该研究针对视觉语言模型(VLMs)在处理图像分类任务时易受背景干扰的问题,提出了一种基于嵌入空间线性可加性的方法,将场景表示分解为前景和背景成分,从而构建背景不变的表示。通过利用合成数据进行预训练,该方法在存在完美虚假关联的Waterbirds数据集上实现了首个超过90%的最差群体准确率,且无需依赖真实去偏数据,具有良好的模拟到现实迁移能力,适用于实际部署。

Comments 36 pages, 7 figures

详情
英文摘要

Vision-language models (VLMs), such as CLIP and SigLIP 2, are widely used for image classification, yet their vision encoders remain vulnerable to systematic biases that undermine robustness. In particular, correlations between foreground objects and their backgrounds constitute a salient and practically important class of spurious dependencies. In this work, we revisit the well-known property of high linear additivity in VLM embedding spaces and show that it enables a decomposition of scene representations into foreground and background components. Leveraging this insight, we introduce a pre-training approach that exploits this property to construct background-invariant representations using synthetic data. Our method achieves, to our knowledge, the first worst-group accuracy exceeding $90\%$ on Waterbirds under perfect ($100\%$) spurious correlation (i.e., no minority-group examples in the training data). Furthermore, it demonstrates strong sim-to-real transfer and requires no access to real-world debiased data, making it practical for real-world deployment.

2605.11102 2026-05-13 cs.LG cs.AI cs.SY eess.SY

Newton's Lantern: A Reinforcement Learning Framework for Finetuning AC Power Flow Warm Start Models

Shourya Bose, Helgi Hilmarsson, Dhruv Suri

AI总结 该研究提出了一种名为“牛顿灯”的强化学习框架,用于优化交流潮流问题的暖启动模型。通过分析牛顿-拉夫森迭代次数的下界,研究揭示了现有监督方法在接近电压崩溃的重载场景下泛化能力不足的原因,并基于此设计了一种结合群体相对策略优化和学习奖励模型的微调方法,以迭代次数作为监督信号进行训练。实验表明,该方法在多个标准测试案例中均能稳定收敛,并实现了最小的平均迭代次数。

详情
英文摘要

Neural warm starts can sharply reduce the number of Newton-Raphson iterations required to solve the AC power flow problem, but existing supervised approaches generalize poorly on heavily loaded instances near voltage collapse. We prove a lower bound on the Newton-Raphson iteration count that depends on the direction of the warm start error rather than on its magnitude, and show as a corollary that the bound becomes vacuous as the smallest singular value of the power-flow Jacobian shrinks, identifying the failure mode of supervised regression near the saddle-node bifurcation. Motivated by this analysis, we introduce Newton's Lantern, a finetuning pipeline that combines group relative policy optimization with a learned reward model trained on perturbations of the base model's predictions, using the iteration count itself as the supervisory signal. Across IEEE 118-bus, GOC 500-bus, and GOC 2000-bus benchmarks, Newton's Lantern is the only method that converges on every test snapshot while attaining the smallest mean iteration count.

2605.11098 2026-05-13 cs.SD

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

Jiacheng Shi, Hongfei Du, Xinyuan Song, Y. Alicia Hong, Yanfu Zhang, Ye Gao

AI总结 AffectCodec 是一种用于情感表达语音建模的情绪感知神经语音编解码器,旨在在量化过程中保留语音中的情感信息。该方法通过结合情感语义引导的潜在调制、关系保持的情感语义蒸馏和情感加权语义对齐,实现了在压缩过程中保持语义保真度和韵律自然性的同时保留情感关键线索。实验表明,AffectCodec 在语音重建、情感识别和下游文本到语音生成任务中均表现出更优的情感一致性和感知质量。

Comments Accepted to ACL Findings 2026

详情
英文摘要

Neural speech codecs provide discrete representations for speech language models, but emotional cues are often degraded during quantization. Existing codecs mainly optimize acoustic reconstruction, leaving emotion expressiveness insufficiently modeled at the representation level. We propose an emotion-guided neural speech codec that explicitly preserves emotional information while maintaining semantic fidelity and prosodic naturalness. Our framework combines emotion-semantic guided latent modulation, relation-preserving emotional-semantic distillation, and emotion-weighted semantic alignment to retain emotionally salient cues under compression. Extensive evaluations across speech reconstruction, emotion recognition, and downstream text-to-speech generation demonstrate improved emotion consistency and perceptual quality without sacrificing content accuracy.

2605.11093 2026-05-13 cs.LG cs.AI cs.PF cs.SE cs.SY eess.SY

Enabling Performant and Flexible Model-Internal Observability for LLM Inference

Nengneng Yu, Sixian Xiong, Yibo Zhao, Wei Wang, Zaoxing Liu

AI总结 当前大语言模型推理任务越来越依赖对模型内部状态的实时访问。本文提出 DMI-Lib,一种高性能的深度模型检测工具,通过异步观测子系统、基于 Ring² 的 GPU-CPU 内存抽象以及策略控制的主机后端,将内部可观测性作为系统级核心原语,实现与推理主路径的解耦。实验表明,DMI-Lib 在保持服务优化和严格 GPU 内存限制的同时,显著降低了观测开销,相比现有方法在延迟上减少了 2 到 15 倍。

详情
英文摘要

Today's inference-time workloads increasingly depend on timely access to a model's internal states. We present DMI-Lib, a high-speed deep model inspector that treats internal observability as a first-class systems primitive, decoupling it from the inference hot path via an asynchronous observability substrate built from Ring^2, a GPU-CPU memory abstraction for capturing and staging tensors, and a policy-controlled host backend that exports them. DMI-Lib enables the placement of observation points across a rich space of internal signals and diverse inference backends while preserving serving optimizations and adhering to tight GPU memory budgets. Our experiments demonstrate that DMI-Lib incurs only 0.4%--6.8% overhead in offline batch inference and an average of 6% in moderate online serving, reducing latency overhead by 2x-15x compared to existing baselines with similar observability features. DMI-Lib is open-sourced at https://github.com/ProjectDMX/DMI.

2605.11091 2026-05-13 cs.LG cs.AI

ASD-Bench: A Four-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder

Shubhankit Singh, Hassan Shaikh, Kuldeep Raghuwanshi, Keshav Bulia

AI总结 本文提出ASD-Bench,一个针对自闭症谱系障碍(ASD)的四维综合基准,用于评估AI模型在不同年龄段群体中的表现。该基准涵盖预测性能、校准、可解释性和对抗鲁棒性四个维度,基于4,068份AQ-10问卷数据,测试了多种传统机器学习和深度学习模型。研究发现不同年龄段的特征重要性存在显著差异,并指出单一性能指标不足以评估临床AI系统的可靠性。

Comments 20 pages, 12 figures, 8 tables

详情
英文摘要

Automated ASD screening tools remain limited by single-architecture evaluations, axis-restricted assessment, and near-exclusive focus on adult cohorts, obscuring age-specific diagnostic patterns critical for early intervention. We introduce ASD-Bench, a systematic tabular benchmark evaluating ML, deep learning, and foundation model configurations across three age cohorts (children 1-11 yr, adolescents 12-16 yr, adults 17-64 yr) on four axes: predictive performance, calibration, interpretability, and adversarial robustness. Applied to a curated v3 dataset of 4,068 AQ-10 records, our benchmark spans classical models (XGBoost, AdaBoost, Random Forest, Logistic Regression), neural networks (MLP), deep tabular transformers (TabNet, TabTransformer, FT-Transformer), and TabPFN v2. We introduce the Heuristic Aggregate Penalty (HAP): a cost-sensitive metric penalising false negatives more heavily and incorporating cross-validation variance for deployment stability. Adult classification yields high performance (10/17 models achieve perfect F1 and AUC), while adolescents present a harder task (F1 ceiling 0.837 vs. 0.915 for children). Feature hierarchies shift across cohorts: A9 (social motivation) dominates for children, A5 (pattern recognition) leads for adolescents, and adults exhibit a flatter importance profile consistent with developmental social masking. Accuracy and calibration are dissociated: AdaBoost achieves F1=1.000 on adults with ECE=0.302, confirming single-metric evaluation is insufficient for clinical AI. Cohort-specific deployment recommendations are provided. All findings should be interpreted as proof-of-concept evidence on questionnaire-derived labels rather than clinically validated diagnostic performance.

2605.11061 2026-05-13 cs.CV cs.MM

HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer

Qi Cai, Jingwen Chen, Chengmin Gao, Zijian Gong, Yehao Li, Yingwei Pan, Yi Peng, Zhaofan Qiu, Kai Yu, Yiheng Zhang, Hao Ai, Siying Bai, Yang Chen, Zhihui Chen, Fengbin Gao, Ying Guo, Dong Li, Zhen Shen, Leilei Shi, Jing Wang, Siyu Wang, Yimeng Wang, Rui Zheng, Ting Yao, Tao Mei

AI总结 该论文提出了一种名为HiDream-O1-Image的原生统一图像生成基础模型,通过像素级扩散变换器架构,实现了从模块化结构向端到端视觉生成引擎的范式转变。该模型将原始图像像素、文本标记和任务条件映射到统一的共享标记空间,无需依赖独立的VAE或预训练文本编码器,从而在统一变换器(UiT)架构下实现了多模态输入的结构统一。实验表明,HiDream-O1-Image在多种生成任务中表现出色,并且在仅有80亿参数时性能可与更大参数量的模型媲美,其2000亿参数版本更实现了生成能力的显著提升,确立了新的性能基准。

Comments Source codes and models are available at Github: https://github.com/HiDream-ai/HiDream-O1-Image and Huggingface: https://huggingface.co/HiDream-ai/HiDream-O1-Image

详情
英文摘要

The evolution of visual generative models has long been constrained by fragmented architectures relying on disjoint text encoders and external VAEs. In this report, we present HiDream-O1-Image, a natively unified generative foundation model via pixel-space Diffusion Transformer, that pioneers a paradigm shift from modular architectures to an end-to-end in-context visual generation engine. By mapping raw image pixels, text tokens, and task-specific conditions into a single shared token space, HiDream-O1-Image achieves a structural unification of multimodal inputs within an Unified Transformer (UiT) architecture. This native encoding paradigm eliminates the need for separate VAEs or disjoint pre-trained text encoders, allowing the model to treat diverse generation and editing tasks as a consistent in-context reasoning process. Extensive experiments show that HiDream-O1-Image excels across various generation tasks, including text-to-image generation, instruction-based editing, and subject-driven personalization. Notably, with only 8B parameters, HiDream-O1-Image (8B) achieves performance parity with or even surpasses established state-of-the-art models with significantly larger parameters (e.g., 27B Qwen-Image). Crucially, to validate the immense scalability of this paradigm, we successfully scale the architecture up to over 200B parameters. Experimental results demonstrate that this massive-scale version HiDream-O1-Image-Pro (200B+) unlocks unprecedented generative capabilities and superior performance, establishing new state-of-the-art benchmarks. Ultimately, HiDream-O1-Image highlights the immense potential of natively unified architectures and charts a highly scalable path toward next-generation multimodal AI.

2605.11055 2026-05-13 cs.CV cs.LG

The first global agricultural field boundary map at 10m resolution

Caleb Robinson, Gedeon Muhawenayo, Subash Khanal, Zhanpei Fang, Isaac Corley, Ana M. Tárano, Lyndon Estes, Jennifer Marcus, Nathan Jacobs, Hannah Kerner, Inbal Becker-Reshef, Juan M. Lavista Ferres

AI总结 本文提出了首个全球10米分辨率的农业地块边界地图,覆盖2024和2025年共241个国家和地区,包含31.7亿个遥感地块多边形。研究采用基于“Fields of The World”数据集训练的U-Net分割模型,对Sentinel-2无云影像进行处理生成地图,并通过多国实地数据验证其准确性。该数据集以三种形式公开发布,为全球农作物监测、粮食安全及相关农业研究提供了首个一致的地块级分析单元。

详情
英文摘要

The agricultural field is the natural unit at which crops are planted, managed, regulated, and reported, yet most global remote-sensing products for agriculture are only available at the pixel level. While some high-quality field-level data products exist, they come from parcel registries covering only parts of Europe or from ML-derived products for individual countries. No openly available, globally consistent map of agricultural field boundaries exists to date. Here we present the first global field boundary dataset at 10\,m resolution for the years 2024 and 2025, comprising 3.17 billion remote-sensing field polygons (1.62 B in 2024 and 1.55 B in 2025) across 241 countries and territories, produced by applying a U-Net segmentation model trained on the Fields of The World dataset to cloud-free Sentinel-2 mosaics. Validated against ground-truth field boundaries in 24 countries, the map achieved a mean pixel-level recall of 0.85 with 14 countries exceeding 0.90. Evaluation against full-country ground-truth datasets in Austria, Latvia, and Finland yielded F1 scores of 0.89, 0.88, and 0.74, respectively. Because reference data for global validation is inherently incomplete, we accompanied the map with a 500 m confidence layer that identifies regions where predictions are reliable. We release the dataset openly as three global maps: the confidence-thresholded default field boundary dataset, the full unfiltered dataset, and the continuous-valued confidence raster. These maps provide the first globally consistent field-level unit of analysis for crop monitoring, food security, and downstream agricultural science.

2605.11048 2026-05-13 cs.RO cs.AI

ForceFlow: Learning to Feel and Act via Contact-Driven Flow Matching

Shuoheng Zhang, Yifu Yuan, Hongyao Tang, Yan Zheng, Qiaojun Yu, Pengyi Li, Guowei Huang, Helong Huang, Xingyue Quan, Jianye Hao

AI总结 本文提出了一种名为ForceFlow的力感知反应框架,旨在解决机器人在复杂接触场景下的操作任务。该方法基于流匹配技术,通过融合力信号与多模态感知信息,实现了对接触力和运动的深度耦合,并采用视觉主导与触觉主导分阶段的策略,提升了任务执行的鲁棒性和泛化能力。实验表明,ForceFlow在六个实际接触密集任务中表现出更高的成功率和更低的成本,展示了其在接触力自调节和跨分布泛化方面的优越性能。

详情
英文摘要

Existing imitation learning methods enable robots to interact autonomously with the physical environment. However, contact-rich manipulation tasks remain a significant challenge due to complex contact dynamics that demand high-precision force feedback and control. Although recent efforts have attempted to integrate force/torque sensing into policies, how to build a simple yet effective framework that achieves robust generalization under multimodal observations remains an open question. In this paper, we propose ForceFlow, a force-aware reactive framework built upon flow matching. For contact-stage policy design, we investigate force signal fusion mechanisms and adopt an asymmetric multimodal fusion architecture that treats force as a global regulatory signal, combined with a joint prediction paradigm that enhances the policy's understanding of instantaneous force and historical information, thereby achieving deep coupling between force and motion. For task-level hierarchical decomposition, we divide manipulation into a vision-dominant approach stage (VLM-based pointing for target localization) and a touch-dominant interaction stage (force-driven contact execution), with a Vision-to-Force (V2F) handover mechanism that explicitly decouples spatial generalization from contact regulation. Experimental results across six real-world contact-rich tasks demonstrate that ForceFlow achieves a 37% success rate improvement over the strong baseline ForceVLA while maintaining significantly lower cost. Moreover, ForceFlow exhibits accurate force signal prediction and demonstrates superior performance in contact force self-regulation and zero-shot out-of-distribution (OOD) generalization.