arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1971
专题追踪
2605.30880 2026-06-18 cs.CL cs.AI 版本更新

PatchWorld: Gradient-Free Optimization of Executable World Models

PatchWorld:可执行世界模型的免梯度优化

Jiaxin Bai, Yue Guo, Yifei Dong, Jiaxuan Xiong, Tianshi Zheng, Yixia Li, Tianqing Fang, Yufei Li, Yisen Gao, Haoyu Huang, Zhongwei Xie, Hong Ting Tsang, Zihao Wang, Lihui Liu, Jeff Z. Pan, Yangqiu Song

发表机构 * Hong Kong Baptist University(香港 Baptist 大学) Independent Researcher(独立研究员) HKUST(香港科技大学) Beijing Institute of Technology(北京理工大学) Southern University of Science and Technology(南方科技大学) Wayne State University(韦恩州立大学) University of Edinburgh(爱丁堡大学)

AI总结 提出 PatchWorld 框架,通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型,实现无需梯度优化的符号信念状态程序,在 AgentGym 环境中达到 76.4% 的宏观成功率。

Comments 40 pages

详情
AI中文摘要

文本智能体环境通常被建模为部分可观察马尔可夫决策过程(POMDP),假设模拟器的潜在状态和转移动态对智能体隐藏。然而,很少有工作研究是否可以通过归纳可执行代码来作为部分可观察性下的预测和规划的世界模型。我们引入了 PatchWorld,一个免梯度框架,通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型。PatchWorld 不是用黑盒模型预测下一个观察,而是归纳出符号信念状态程序,其动作更新可以被检查、重放和局部修补。在七个 AgentGym 环境中,PatchWorld-Simple 在评估方法中取得了最高的基于代码的规划分数,在实时一步前瞻中达到 76.4% 的宏观成功率,同时在世界模型预测模块本身内不调用任何 LLM。我们进一步发现,人类指定的残差记忆偏差提高了表面观察保真度,但削弱了决策效用。这暴露了可执行世界模型中的权衡,因为提高观察保真度可能以牺牲动作判别动态为代价,反之亦然。代码可在 https://github.com/HKBU-KnowComp/PatchWorld 获取。

英文摘要

Text-agent environments are typically modeled as partially observable Markov decision processes (POMDPs), assuming that the simulator's latent state and transition dynamics are hidden from the agent. Yet little work has examined whether executable code can be induced to serve as a world model for prediction and planning under partial observability. We introduce PatchWorld, a gradient-free framework that turns offline trajectories into executable Python world models through counterexample-guided code repair. Instead of predicting the next observation with a black-box model, PatchWorld induces symbolic belief-state programs whose action updates can be inspected, replayed, and locally patched. Across seven AgentGym environments, PatchWorld-Simple achieves the highest code-based planning score among evaluated methods, reaching 76.4\% macro success in live one-step lookahead while invoking no LLM calls inside the world-model prediction module itself. We further find that a human-specified residual-memory bias improves surface observation fidelity but weakens decision utility. This exposes a tradeoff in executable world models, since improving observation fidelity can come at the expense of action-discriminative dynamics, and vice versa. Code is available at https://github.com/HKBU-KnowComp/PatchWorld.

2605.29676 2026-06-18 cs.AI cs.CL 版本更新

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

符号至关重要:智能体AI系统中令牌优化格式的基准研究

Lorenz Kutschka, Bernhard Geiger

发表机构 * Know Center Research GmbH(知中心研究有限公司) Graz University of Technology(格拉茨技术大学) Graz Center for Machine Learning(格拉茨机器学习中心)

AI总结 本研究在四个智能体基准上评估了两种令牌优化格式TOON和TRON,发现TRON在保持准确率的同时最多减少27%的令牌,而TOON虽减少18%但存在多轮解析失败和并行工具调用输出崩溃的问题。

Comments 16 pages, 6 figures, 4 tables

详情
AI中文摘要

智能体AI系统中的大型语言模型消耗工具模式和执行结果,并发出结构化数据的工具调用。这种交换的默认语言JSON是为应用间交换而非令牌效率设计的,因此其结构元素带来大量令牌开销。最近的工作提出了令牌优化替代方案,如TOON(令牌导向对象表示法)和TRON(令牌减少对象表示法)作为更紧凑的替代,但这些格式仅在孤立的理解或生成任务上进行了评估。它们在端到端智能体循环中是否保持令牌减少仍是一个开放问题。我们在四个智能体基准(BFCL、MCPToolBenchPP、MCP-Universe、StableToolBench)和五个开放权重LLM上评估了TOON和TRON,将输入压缩与输出压缩解耦,以独立测量理解和生成。TRON最多减少27%的令牌,准确率在JSON基线的14个百分点内。TOON实现了最多18%的减少,准确率成本类似为9个百分点,但在多轮解析失败上额外级联,并且对于大多数模型导致并行工具调用输出崩溃。

英文摘要

Large language models in Agentic AI systems consume tool schemas and execution results and emit tool invocations as structured data. The default language for that exchange, JSON, was designed for application-to-application interchange rather than token efficiency, so its structural elements impose substantial token overhead. Recent work proposes token-optimized alternatives such as TOON (Token-Oriented Object Notation) and TRON (Token Reduced Object Notation) as more compact replacements, but these formats have been evaluated only on isolated comprehension or generation tasks. Whether their token reductions hold inside end-to-end agentic loops therefore remains an open question. We evaluate TOON and TRON on four agentic benchmarks (BFCL, MCPToolBenchPP, MCP-Universe, StableToolBench) and five open-weight LLMs, decoupling input compression from output compression to measure comprehension and generation independently. TRON reduces tokens by up to 27% with accuracy within 14pp of the JSON baseline. TOON achieves up to 18% reduction at a similar 9pp accuracy cost, but additionally cascades on multi-turn parsing failures and collapses parallel tool-call output for most models. The code is available at: https://github.com/lkutschka/notation-matters

2605.16385 2026-06-18 cs.CV cs.AI cs.CL 版本更新

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Hilbert-Geo:通过神经符号推理解决立体几何问题

Ruoran Xu, Haoyu Cheng, Bin Dong, Qiufeng Wang

发表机构 * Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学) Ricoh Software Research Center Beijing Co.,Ltd(Ricoh 软件研究中心北京有限公司)

AI总结 提出Hilbert-Geo框架和Parse2Reason方法,利用条件描述语言和定理库实现立体几何问题的严格推理,在SolidFGeo2k和MathVerse-Solid上达到SOTA性能。

Comments Computer Vision and Pattern Recognition (CVPR), 2026

详情
AI中文摘要

几何问题求解作为一种典型的多模态推理问题,近年来受到广泛关注并取得了很大进展,然而大多数工作集中于平面几何,由于三维空间图和复杂推理,通常在立体几何中失败。为弥补这一差距,我们引入了Hilbert-Geo,这是第一个用于立体几何的统一形式语言框架,包括一个广泛的谓词库和一个专用的定理库。基于该框架,我们提出了一种Parse2Reason方法,包含先解析后推理两个步骤。在解析步骤中,我们利用条件描述语言(CDL),一种由专门用于构建几何条件的谓词组成的形式化语言,来表示问题描述(自然文本)和立体图(视觉图像)。在推理步骤中,我们利用这些形式化CDL和定理库进行关系推理和代数计算,生成严格正确、可验证且人类可读的推理过程。值得注意的是,我们提出的Hilbert-Geo也适用于平面几何。为推进几何推理,我们策划了两个专家标注的数据集SolidFGeo2k和PlaneFGeo3k,它们配备了几何形式语言标注、解答和答案。大量实验表明,我们提出的方法在SolidFGeo2k上达到77.3%的最先进性能,在MathVerse-Solid(MathVerse中专用于立体几何的一个小子集)上达到84.1%,显著优于领先的多模态大语言模型,如Gemini-2.5-pro(在SolidFGeo2k上为54.2%)和GPT-5(在MathVerse-Solid上为62.9%)。此外,我们的方法在PlaneFGeo3k上达到80.2%的SOTA准确率,展示了Hilbert-Geo在几何推理中的通用性。我们的代码和数据集将公开提供。

英文摘要

Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, however most of works focus on plane geometry while usually fail in solid geometry due to 3D spatial diagrams and complex reasoning. To bridge this gap, we introduce Hilbert-Geo, the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank. Based on this framework, we propose a Parse2Reason method containing two steps of first parsing then reasoning. In the parsing step, we utilize conditional description language (CDL), a formalized language composed of predicates specifically designed to construct geometric conditions, to represent both problem description (natural text) and solid diagrams (visual image). In the reasoning step, we leverage those formal CDL and the theorem bank to perform relational inference and algebraic computation, generating strictly correct, verifiable, and human-readable reasoning processes. Notably, our proposed Hilbert-Geo is also applicable to plane geometry. To advance geometric reasoning, we curate two expert-annotated dataset SolidFGeo2k and PlaneFGeo3k, which are furnished with geometric formal language annotations, solutions and answers. Extensive experiments show that our proposed method achieves the state-of-the-art (SOTA) performance 77.3% in SolidFGeo2k and 84.1% in MathVerse-Solid (one small subset in MathVerse dedicated to solid geometry), substantially outperforming leading MLLMs, such as Gemini-2.5-pro (54.2% on SolidFGeo2k) and GPT-5 (62.9% on MathVerse-Solid). In addition, our method achieves the SOTA accuracy 80.2% in PlaneFGeo3k, demonstrating the generality of the Hilbert-Geo in geometric reasoning. Our code and datasets are released at https://github.com/PremiLab-Math/Hilbert-Geo.

2602.08355 2026-06-18 cs.CV 版本更新

E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

E-VAds:面向多模态大语言模型的电商短视频理解基准

Xianjie Liu, Yiman Hu, Liang Wu, Ping Hu, Yixiong Zou, Jian Xu, Bo Zheng

发表机构 * Alimama Tech, Taobao \& Tmail Group of Alibaba Huazhong University of Science Vin University

AI总结 提出电商短视频理解基准E-VAds,通过多模态信息密度评估框架量化领域复杂性,并构建多智能体生成的问答数据集,最后开发基于强化学习的推理模型E-VAds-R1,在商业意图推理上实现109.2%的性能提升。

Comments Accepted by ICML2026

详情
AI中文摘要

电商短视频代表了在线视频行业中高收入的细分领域,其特点是目标驱动的格式和密集的多模态信号。当前模型通常难以处理这些视频,因为现有基准主要关注通用任务,忽略了商业意图的推理。在这项工作中,我们首先提出了一个多模态信息密度评估框架,以量化该领域的复杂性。我们的评估显示,与主流数据集相比,电商内容在视觉、音频和文本模态上表现出显著更高的密度,为视频理解建立了更具挑战性的前沿。为了弥补这一差距,我们引入了电商视频广告基准(E-VAds),这是首个专门为电商短视频理解设计的基准。我们从淘宝精选了3,961个高质量视频,涵盖广泛的产品类别,并使用多智能体系统生成了19,785个开放式问答对。这些问题被组织成两个主要维度,即感知与认知和推理,包含五个不同的任务。最后,我们开发了E-VAds-R1,一个基于强化学习的推理模型,具有称为MG-GRPO的多粒度奖励设计。该策略为早期探索提供平滑指导,同时为专家级精度创造非线性激励。实验结果表明,E-VAds-R1在仅使用几百个训练样本的情况下,在商业意图推理上实现了109.2%的性能提升。

英文摘要

E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a multi-modal information density assessment framework to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce E-commerce Video Ads Benchmark, which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs, which consist of five distinct tasks. Finally, we develop E-VAds-R1, an RL-based reasoning model featuring a multi-grained reward design called MG-GRPO. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples. Data is available at https://github.com/TaobaoTmall-AlgorithmProducts/E-VAds_Benchmark.

2605.03460 2026-06-18 cs.AI cs.LG 版本更新

FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models

FinSTaR:面向时间序列推理模型的金融推理

Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, Soonyoung Lee, Wonbin Ahn

发表机构 * LG AI Research(LG人工智能研究)

AI总结 针对时间序列推理模型在金融领域的失效问题,提出基于2x2能力分类法的FinSTaR模型,通过Compute-in-CoT和Scenario-Aware CoT策略在FinTSR-Bench基准上达到78.9%平均准确率。

Comments KDD Workshop on SciSoc Agents & LLMs 2026 (Oral Presentation)

详情
AI中文摘要

时间序列推理模型在通用领域表现出色,但在具有独特特征的金融领域却持续失败。我们提出一个通用的2x2能力分类法,通过交叉1)单实体与多实体分析,以及2)当前状态评估与未来行为预测来划分TSRM能力。我们在金融领域实例化该分类法——其中确定性评估与随机性预测的区分尤为关键——形成十个金融推理任务,并基于标普股票构建FinTSR-Bench基准。为此,我们提出FinSTaR(金融时间序列思考与推理),在FinTSR-Bench上训练,并针对每个类别采用不同的思维链策略。对于评估(确定性,即可从可观测数据计算得出),我们采用Compute-in-CoT,一种程序化思维链,使模型能够直接从原始价格推导答案。对于预测(本质上是随机的,即受不可观测因素影响),我们采用场景感知思维链,在做出判断前生成多种场景,模拟金融分析师在不确定性下的推理方式。所提方法在FinTSR-Bench上达到78.9%的平均准确率,显著优于LLM和TSRM基线。此外,我们展示了四个能力类别通过联合训练具有互补性和相互增强性,并且场景感知思维链相比标准思维链持续提升预测准确率。代码已公开:https://github.com/seunghan96/FinSTaR。

英文摘要

Time series (TS) reasoning models (TSRMs) have shown promising capabilities in general domains, yet they consistently fail in the financial domain, which exhibits unique characteristics. We propose a general 2 x 2 capability taxonomy for TSRMs by crossing 1) single-entity vs. multi-entity analysis with 2) assessment of the current state vs. prediction of future behavior. We instantiate this taxonomy in the financial domain-where the distinction between deterministic assessment and stochastic prediction is particularly critical-as ten financial reasoning tasks, forming the FinTSR-Bench benchmark based on S&P stocks. To this end, we propose FinSTaR (Financial Time Series Thinking and Reasoning), trained on FinTSR-Bench with distinct chain-of-thought (CoT) strategies tailored to each category. For assessment, which is deterministic (i.e., computable from observable data), we employ Compute-in-CoT, a programmatic CoT that enables models to derive answers directly from raw prices. For prediction, which is inherently stochastic (i.e., subject to unobservable factors), we adopt Scenario-Aware CoT, which generates diverse scenarios before making a judgment, mirroring how financial analysts reason under uncertainty. The proposed method achieves 78.9% average accuracy on FinTSR-Bench, substantially outperforming LLM and TSRM baselines. Furthermore, we show that the four capability categories are complementary and mutually reinforcing through joint training, and that Scenario-Aware CoT consistently improves prediction accuracy over standard CoT. Code is available at https://github.com/seunghan96/FinSTaR.

2605.21528 2026-06-18 cs.LG cs.AI 版本更新

A Reproducible Log-Driven AutoML Framework for Interpretable Pipeline Optimization in Healthcare Risk Prediction

可重复的基于日志的自动机器学习框架用于医疗风险预测中的可解释流水线优化

Rui Huang, Lican Huang

发表机构 * School of Basic Medicine, Hangzhou Normal University(杭州师范大学基础医学院) Research Department, Hangzhou Domain Zones Technology Co.Ltd.(杭州域区技术有限公司)

AI总结 本文提出了一种可重复的基于日志的自动机器学习框架,用于医疗风险预测中的可解释流水线优化,通过分析组件属性、交互和冗余性,提高了模型性能和稳定性。

详情
AI中文摘要

准确且可重复的疾病风险预测仍然具有挑战性,由于异质特征、有限样本和严重的类别不平衡。本研究引入了yvsoucom-iterkit,一种确定性和基于日志的自动化机器学习框架,将流水线优化完全可重复地建模为配置级系统。每个流水线被编码为可追溯的日志实体,使能够分析组件属性、交互、相似性和跨种子鲁棒性。在超过18,000个流水线配置上对Pima Indians糖尿病和中风数据集的实验揭示了一个结构化且部分冗余的搜索空间,其中性能由一小部分相互作用的组件决定。随机森林重要性分析显示,增强(0.454)、模型选择(0.198)和不平衡处理(0.101)是Pima数据集的关键驱动因素,而不平衡处理主导中风(0.406)。组件相似性分析显示强冗余性,特征选择变体(biMax-biMean)表现出低RMS距离(0.0252),混合匹配无增强(0.0279),TomekLinks与无不平衡处理对齐(0.0325),而高斯噪声与无增强的差异更大(0.10)。该框架使用集成模型(加权F1 0.89,宏F1 0.88在Pima;加权F1 0.94在中风)实现了强且稳定的性能,而宏F1在中风上较低(0.67)由于类别不平衡。跨种子分析揭示了性能-鲁棒性权衡,集成模型的变异性低于SVM。这些结果表明,有效的AutoML优化可以聚焦于一组高影响的组件。

英文摘要

Accurate disease risk prediction is challenged by heterogeneous features, limited data, and class imbalance. This study presents yvsoucom-iterkit, a deterministic AutoML framework that models pipeline optimization as a configuration-level system with full reproducibility and traceable execution logs, enabling systematic analysis of component attribution, interactions, similarity, and cross-seed robustness. Experiments on the Pima Indians Diabetes and Stroke datasets across more than 18,000 pipeline configurations reveal a structured yet partially redundant search space, where performance is dominated by a small subset of interacting components. Ensemble models achieve stable performance, reaching a Weighted-F1 of 0.89 on Pima and 0.94 on Stroke. Macro-F1 reaches approximately 0.88 on Pima but drops to 0.6560 on Stroke due to severe imbalance. Cross-seed experiments show that ensembles reduce variance compared to single models. Friedman testing ($p < 0.05$) confirms significant ranking differences across configurations. Based on analysis of component attribution, interaction, and similarity, optimal configuration design reveals dataset-dependent behavior. For the Pima dataset, computational efficiency benefits from simplified search spaces where redundant components can be removed, with split ratio playing a key role. In contrast, the Stroke dataset requires enhanced imbalance-aware strategies, where RandomOverSampler improves Macro-F1 from 0.6560 to 0.6766. These findings demonstrate that effective AutoML optimization is achieved through optimal configuration design, where carefully constraining the search space to high-impact components can improve performance, stability, and interpretability while reducing unnecessary search complexity.

2605.21431 2026-06-18 cs.CV 版本更新

iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

iTryOn: 通过空间-语义引导掌握交互式视频虚拟试穿

Jun Zheng, Zhengze Xu, Mengting Chen, Jing Wang, Jinsong Lan, Xiaoyong Zhu, Kaifu Zhang, Bo Zheng, Xiaodan Liang

发表机构 * Shenzhen Campus of Sun Yat-sen University Taobao \& Tmall Group of Alibaba

AI总结 本文提出iTryOn框架,通过空间-语义引导解决交互式视频虚拟试穿中的语义模糊和复杂服装变形问题,实现了更动态可控的虚拟试穿体验。

Comments Project Page: https://zhengjun-ai.github.io/itryon-page. Accepted by ICML 2026

详情
AI中文摘要

视频虚拟试穿(VVT)旨在无缝替换视频中人物身上的衣物。尽管现有方法在保持时间一致性方面取得了显著进展,但它们主要局限于非交互场景,其中模型仅展示衣物。这种限制忽略了现实世界服装展示中的关键方面:主动的人-衣物互动。为弥合这一差距,我们引入并正式化了一个新的挑战性任务:交互式视频虚拟试穿(Interactive VVT),其中视频中的主体主动与衣物互动。该任务引入了超出简单纹理保留的独特挑战,包括:(1)从标准姿态信息中解决交互的语义模糊性,以及(2)从视频中学习复杂的衣物变形,其中交互时刻稀少且短暂。为了解决这些挑战,我们提出了iTryOn,一种基于大规模视频扩散Transformer的新型框架。iTryOn首创多级交互注入机制,以引导复杂动态的生成。在空间层面,我们引入了服装无关的3D手先验,以提供精细的指导,精确的手-服装接触,有效解决空间模糊性。在语义层面,iTryOn利用全局描述词提供整体上下文,并利用时间戳动作描述词提供局部交互,通过我们新颖的Action-aware Rotational Position Embedding(A-RoPE)进行同步。广泛的实验表明,iTryOn不仅在传统VVT基准上实现了最先进的性能,还在新的交互设置中建立了显著的领先优势,标志着更动态和可控的虚拟试穿体验的重要一步。

英文摘要

Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.

2605.21028 2026-06-18 cs.CV cs.AI 版本更新

DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation

DySink:动态帧 sinks 用于自回归长视频生成

Bo Ye, Xinyu Cui, Jian Zhao, Tong Wei, Min-Ling Zhang

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Lab. of Computer Network and Information Integration, Southeast University(东南大学计算机网络与信息集成重点实验室) Zhongguancun Academy(中关村学院) Zhongguancun Institute of Artificial Intelligence(中关村人工智能研究院) Institute of Automation, CAS(中国科学院自动化研究所)

AI总结 本文提出 DySink,一种基于检索的框架,通过维护紧凑的记忆银行并选择视觉相关的历史帧作为动态帧 sinks,以提高自回归长视频生成的动态性和时间质量。

详情
AI中文摘要

自回归长视频生成通常采用有界内存流以提高效率,通常结合局部窗口实现短期连续性与静态早期帧 sinks 作为长程锚点。然而,这种固定分配在当前视觉状态与早期帧大幅偏离时仍会缓存早期帧,而丢弃可能更相关的中间历史。结果,保留的长程上下文可能变得不适应,并偏向过时的线索;在严重情况下,RoPE 引起的相位再对齐会homogenize 头间注意力并导致 sink 崩溃,其中内容会回归到 sink 帧。我们提出 DySink,一种基于检索的框架,维护紧凑的记忆银行并选择视觉相关的历史帧作为动态帧 sinks。DySink 将自适应检索与 sink 异常门相结合,后者检测检索上下文中的过度头间共识并抑制易崩溃的上下文。在分钟级视频上的实验表明,DySink 在动态度方面一致优于强基线,同时也实现了更高的时间质量。代码和模型权重将在 https://github.com/yebo0216best/DySink 上发布。

英文摘要

Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context may become less adaptive and bias generation toward outdated cues; in severe cases, RoPE-induced phase re-alignment can homogenize inter-head attention and cause sink collapse, where content regresses toward sink frames. We propose DySink, a retrieval-based framework that maintains a compact memory bank and selects visually relevant historical frames as dynamic frame sinks. DySink couples adaptive retrieval with a sink anomaly gate, which detects excessive inter-head consensus over retrieved context and suppresses collapse-prone context. Experiments on minute-long videos show that DySink consistently improves temporal quality over strong baselines while also achieving higher dynamic degree, enabling coherent and more natural long-horizon visual evolution. The code and model weights are released at https://github.com/yebo0216best/DySink.

2603.10718 2026-06-18 cs.LG 版本更新

Riemannian MeanFlow for One-Step Generation on Manifolds

Riemannian MeanFlow用于流形上的单步生成

Zichen Zhong, Haoliang Sun, Yukun Zhao, Yongshun Gong, Yilong Yin

发表机构 * School of Software, Shandong University, Jinan, China(软件学院,山东大学,济南,中国)

AI总结 本文提出Riemannian MeanFlow(RMF),通过平行运输定义平均速度场,并推导出将平均速度与瞬时速度联系起来的Riemannian MeanFlow恒等式,从而实现流形上基于位置的切空间中的单步生成,改进了生成质量与效率的权衡并降低了采样成本。

Comments ICML 2026

详情
AI中文摘要

Flow Matching enables simulation-free training of generative models on Riemannian manifolds, yet sampling typically still relies on numerically integrating a probability-flow ODE. We propose Riemannian MeanFlow (RMF), extending MeanFlow to manifold-valued generation where velocities lie in location-dependent tangent spaces. RMF defines an average-velocity field via parallel transport and derives a Riemannian MeanFlow identity that links average and instantaneous velocities for intrinsic supervision. We make this identity practical in a log-map tangent representation, avoiding trajectory simulation and heavy geometric computations. For stable optimization, we decompose the RMF objective into two terms and apply conflict-aware multi-task learning to mitigate gradient interference. RMF also supports conditional generation via classifier-free guidance. Experiments on spheres, tori, SO(3), and SE(3) demonstrate competitive one-step sampling with improved quality-efficiency trade-offs and substantially reduced sampling cost.

英文摘要

Flow Matching enables simulation-free training of generative models on Riemannian manifolds, yet sampling typically still relies on numerically integrating a probability-flow ODE. We propose Riemannian MeanFlow (RMF), extending MeanFlow to manifold-valued generation where velocities lie in location-dependent tangent spaces. RMF defines an average-velocity field via parallel transport and derives a Riemannian MeanFlow identity that links average and instantaneous velocities for intrinsic supervision. We make this identity practical in a log-map tangent representation, avoiding trajectory simulation and heavy geometric computations. For stable optimization, we decompose the RMF objective into two terms and apply conflict-aware multi-task learning to mitigate gradient interference. RMF also supports conditional generation via classifier-free guidance. Experiments on spheres, tori, SO(3), and SE(3) demonstrate competitive one-step sampling with improved quality-efficiency trade-offs and substantially reduced sampling cost.

2605.17232 2026-06-18 cs.LG math.ST stat.ML stat.TH 版本更新

Dimension-Free Convergence of Discrete Diffusion Models: Adjoint Equations Induce the Right Space

离散扩散模型的维度无关收敛性:伴随方程诱导了正确的空间

Kelvin Kan, Xingjian Li, Benjamin J. Zhang, Tuhin Sahai, Stanley Osher, Markos A. Katsoulakis

发表机构 * Department of Mathematics(数学系) Oden Institute School of Data Science and Society(数据科学与社会学院) UCLA(加州大学洛杉矶分校) University of Texas at Austin(德克萨斯大学奥斯汀分校) UNC Chapel Hill(北卡罗来纳大学教堂山分校) Computational and Applied Sciences Group(计算与应用科学组) Department of Mathematics and Statistics(数学与统计学系) SRI International(SRI国际) University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 本文提出了一种基于伴随方程的统一框架,实现了任何积分概率度量(IPM)下的维度无关收敛保证,克服了传统KL和TV方法在处理大规模状态空间时的局限性。

详情
AI中文摘要

离散扩散已成为生成建模中的领先框架,广泛应用于语言、视觉和生物学等领域。然而,现有的收敛理论存在根本性局限。基于KL的分析在奇异先验如掩码分布下会发散,而总变差(TV)的界依赖于状态空间大小S,并在现代语言任务中变得无效,因为词汇表包含数以万计的标记。我们开发了一种统一的基于伴随方程的框架,建立了任何积分概率度量(IPM)下的维度无关收敛保证。到目前为止,我们的界是首个完全不依赖S且适用于掩码和均匀先验的。重要的是,我们的理论仅依赖于一个标准的速率矩阵正则性假设,并且兼容时间非齐次调度。四个新颖的技术推动了我们的改进:通过伴随方程在可观测空间中工作而不是直接处理概率测度,一种产生任何IPM界正则性分析,一种耦合论证在均匀转移下去除S依赖性,以及一种分数-边际抵消技术在掩码转移下去除S依赖性。因此,我们的框架与先前分析显著不同,并避免了路径空间-KL和现有TV方法的不足。除了收敛界外,我们的框架还提供了一种灵活的工具包,用于进一步理论研究离散扩散模型。

英文摘要

Discrete diffusion has become a leading framework for generative modeling in various applications including language, vision, and biology. Existing convergence theory, however, exhibits fundamental limitations. KL-based analyses diverge under singular priors such as the masked distribution, while bounds in total variation (TV) depend on the state space size $S$ and become vacuous for modern language tasks, where vocabularies contain hundreds of thousands of tokens. We develop a unified adjoint-equation-based framework that establishes dimension-free convergence guarantees in any integral probability metric (IPM). To the best of our knowledge, our bounds are the first to be entirely free of $S$ and applicable to both masked and uniform priors. Importantly, our theory relies only on a single standard rate-matrix regularity assumption and applies to general priors. Five novel techniques drive our improvements: working in the space of observables via adjoint equations rather than directly with probability measures, a regularity analysis that yields bounds on any IPM, a coupling argument that removes $S$-dependence under uniform transitions, and score-marginal cancellation and exit-routing techniques that remove $S$-dependence under masked transitions. Our framework thus sharply departs from prior analyses and avoids the shortcomings of pathspace-KL and existing TV-based approaches. Beyond convergence bounds, our framework provides a versatile toolkit for further theoretical study of discrete diffusion models, including principled choices of loss functions and dimension-free step complexity.

2605.17131 2026-06-18 cs.CV cs.AI cs.LG 版本更新

A Survey on Deep Learning Architectures for Point Cloud Classification and Segmentation

针对点云分类和分割的深度学习架构系统性调研

Minhas Kamal, Hiranya Garbha Kumar, Balakrishnan Prabhakaran

发表机构 * State University of New York at Albany(纽约州立大学阿尔巴尼分校)

AI总结 本文系统性地探讨了点云分类和分割中的深度学习架构,分析了点云数据的结构特性,分类了不同架构的工作,并评估了其在主流基准上的性能,同时指出了开放挑战和未来方向。

Comments We reviewed a decade of advancements in point cloud processing: trace the evolution of the field from its foundational roots to the modern SOTA, analyze how diverse architectures overcome the inherent geometric challenges of 3D data, and map out critical research gaps alongside promising future directions. GitHub: https://github.com/MinhasKamal/DeepLearningForPointCloud

Journal ref ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2026

详情
AI中文摘要

点云因其简洁性和几何保真度而成为表示3D形状和场景最广泛采用的格式。然而,其固有的无序和不规则性质,加剧了传感器噪声和遮挡的影响,给基于机器学习的方法带来了独特的挑战。为应对这些问题,已开发出多种策略,包括转换为有序格式、提取局部几何特征以及基于排列不变或自注意力的处理方法。在本文中,我们的重点是深度学习模型在3D视觉三个基本任务中的应用:点云分类、部分分割和语义分割。我们首先正式定义点云数据,然后深入讨论其结构特性。接着,我们根据其骨干结构对重要工作进行分类,并评估其在流行基准上的性能。除了经验比较外,我们还提供了架构创新和局限性的见解。我们还概述了3D点云理解中的开放挑战和有前途的未来方向。

英文摘要

Point cloud stands as the most widely adopted format for representing 3D shapes and scenes due to its simplicity and geometric fidelity. However, its inherent unordered and irregular nature, exacerbated by sensor noise and occlusions, introduces unique challenges for machine learning based methodologies. To combat these issues, diverse strategies have been developed, including converting to a format that has orderliness, extracting local geometry, and permutation-invariant or self-attention-based processing. In this paper, our focus is directed towards deep learning models for three fundamental tasks in 3D vision: point cloud classification, part segmentation, and semantic segmentation. We begin by formally defining point cloud data, followed by an in-depth discussion on its structural characteristics. Then, we categorize notable works based on their backbone structure and evaluate their performance on popular benchmarks. Beyond empirical comparison, we offer insights into architectural innovations and limitations. We also outline open challenges and promising future directions for 3D point cloud understanding.

2605.07022 2026-06-18 cs.LG 版本更新

Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

自主驾驶数据集:从2000万篇论文到大规模精细化生物医学知识

Haydn Jones, Yimeng Zeng, Alden Rose, Li S. Yifei, Yining Huang, Kaiwen Wu, Jiaming Liang, Maggie Ziyu Huan, Yoseph Barash, Cesar de la Fuente-Nunez, Osbert Bastani, Zachary Ives, Mark Yatskar, Jacob R. Gardner

发表机构 * Department of Computer and Information Science, University of Pennsylvania(宾夕法尼亚大学计算机与信息科学系) Department of Genetics, University of Pennsylvania(宾夕法尼亚大学遗传学系) Departments of Bioengineering and Chemical and Biomolecular Engineering, University of Pennsylvania(宾夕法尼亚大学生物工程与化学与生物分子工程系)

AI总结 本文提出通过PubMed自动生成结构化数据集,实现更大规模、更精细和更准确的生物医学知识,展示Starling系统在多个任务中生成大规模数据集并提升准确性。

详情
AI中文摘要

人工编纂的生物医学仓库在生物活性、基因组学和化学领域昂贵且滞后于原始文献,丢弃实验背景,掩盖了评估数据正确性和覆盖范围所需的细微差别。我们证明PubMed本身可以被自动且经济地转化为结构化数据集,这些数据集比它们取代的编纂数据库更大、更细致和更准确。我们提出了三个耦合贡献:(1)基于九个生物医学本体的LLM实体标记流水线,能够在包含2250万篇论文和2500亿个token的PubMed语料库中标记45亿个实体,跨19个类别;(2)混合稀疏密集检索支持在标记语料库上执行实体过滤的语义查询;(3)Starling,一个多代理深度研究系统,仅给定自然语言任务描述,即可设计精度和召回率目标的检索过滤器,诱导提取模式,并输出具有丰富细节字段和支持段落的结构化记录。在六个任务中——血脑屏障渗透性、口服生物利用度、急性毒性(LD50)、基因疾病关联、蛋白质亚细胞定位和化学反应——Starling生成约630万条记录(每任务91K至3M条);其中一些是目前最大的公开数据集。前沿模型对我们的提取的拒绝率在0.6-7.7%之间,远低于我们在广泛使用的编纂数据集上测量的错误率(例如,BBB_Martins为16.5%,Bioavailability_Ma为7.3%)。除了规模和准确性外,支持段落还携带了表格数据库所丢弃的细微差别——例如,口服生物利用度可能取决于进食与否的状态。共同,语料库、检索和代理为AI驱动的治疗设计建立了基础。代码和数据集:https://github.com/starling-labs/starling.

英文摘要

Manually curated biomedical repositories -- spanning bioactivity, genomics, and chemistry -- are expensive to maintain, lag behind primary literature, and discard experimental context, obscuring nuances needed to assess data correctness and coverage. We show that PubMed itself can be autonomously and cost-effectively turned into structured datasets that are larger, more nuanced, and more accurate than the curated databases they replace. We present three coupled contributions: (1) an LLM-based entity-tagging pipeline, grounded in nine biomedical ontologies, that tags 4.5B entities across 19 categories in a 22.5M-paper, 2.5T-token PubMed corpus; (2) hybrid sparse-dense retrieval supporting entity-filtered semantic queries over the tagged corpus; and (3) Starling, a multi-agent deep research system that, given only a natural-language task description, designs precision- and recall-targeted retrieval filters, induces an extraction schema, and emits structured records with nuance-rich fields and supporting passages. Across six tasks -- blood-brain barrier permeability, oral bioavailability, acute toxicity (LD50), gene-disease associations, protein subcellular localization, and chemical reactions -- Starling produces ~6.3M records (91K-3M per task); several are, to our knowledge, the largest public datasets for their property. Frontier-model rejection of our extractions is 0.6-7.7% across tasks, far below error rates we measure on widely used curated counterparts (e.g., 16.5% on BBB_Martins, 7.3% on Bioavailability_Ma). Beyond scale and accuracy, the supporting passages carry nuance tabular databases discard -- e.g., oral bioavailability may depend on fed vs. fasted state. Together, the corpus, retrieval, and agent establish a foundation for AI-driven therapeutic design. Code and datasets: https://github.com/starling-labs/starling.

2605.15824 2026-06-18 cs.CV 版本更新

FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

FashionChameleon:迈向实时和交互式的人体服装视频定制

Quanjian Song, Yefeng Shen, Mengting Chen, Hao Sun, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, Liujuan Cao

发表机构 * Xiamen University(厦门大学) Alibaba Group(阿里巴巴集团)

AI总结 本文提出FashionChameleon框架,通过单件服装视频数据实现交互式多服装视频定制,保留动作一致性,实现实时生成23.8FPS,比现有方法快30-180倍。

Comments Project Page: https://quanjiansong.github.io/projects/FashionChameleon/

详情
AI中文摘要

以人为中心的视频定制,特别是在服装层面,已显示出显著的商业价值。然而,现有方法无法支持低延迟和交互式服装控制,这对电子商务和内容创作应用至关重要。本文研究如何在仅使用单件服装视频数据的情况下,实现交互式多服装视频定制并保持动作一致性。我们提出了FashionChameleon,一个用于自回归视频生成中的人体服装定制的实时交互框架,用户可以在生成过程中交互式切换服装。FashionChameleon包含三个关键技术:(i) 代替在多服装视频数据上训练,我们使用上下文学习在单个参考服装对上训练教师模型。通过保留图像到视频的训练范式,同时强制参考和服装图像之间不匹配,模型被鼓励在单件服装切换时隐式保持一致性。(ii) 为了在生成过程中实现一致性和效率,我们引入了带有上下文学习的流式蒸馏,通过上下文教师强制微调模型,并通过梯度加权分布匹配蒸馏提高外推一致性。(iii) 为了将模型扩展到交互式多服装视频定制,我们提出了无训练KV缓存调度,包括服装KV刷新、历史KV撤回和参考KV解耦,以在保持动作一致性的同时实现服装切换。我们的FashionChameleon独特地支持交互式定制和一致的长视频外推,同时在单个GPU上实现实时生成23.8 FPS,比现有基线快30-180倍。

英文摘要

Human-centric video customization, particularly at the garment level, has shown significant commercial value. However, existing approaches cannot support low-latency and interactive garment control, which is crucial for applications such as e-commerce and content creation. This paper studies how to achieve interactive multi-garment video customization while preserving motion coherence using only single-garment video data. We present FashionChameleon, a real-time and interactive framework for human-garment customization in autoregressive video generation, where users can interactively switch garment during generation. FashionChameleon consists of three key techniques: (i) Instead of training on multi-garment video data, we train a Teacher Model with In-Context Learning on a single reference-garment pair. By retaining the image-to-video training paradigm while enforcing a mismatch between the reference and garment image, the model is encouraged to implicitly preserve coherence during single-garment switching. (ii) To achieve consistency and efficiency during generation, we introduce Streaming Distillation with In-Context Learning, which fine-tunes the model with in-context teacher forcing and improves extrapolation consistency via gradient-reweighted distribution matching distillation. (iii) To extend the model for interactive multi-garment video customization, we propose Training-Free KV Cache Rescheduling, which includes garment KV refresh, historical KV withdraw, and reference KV disentangle to achieve garment switching while preserving motion coherence. Our FashionChameleon uniquely supports interactive customization and consistent long-video extrapolation, while achieving real-time generation at 23.8 FPS on a single GPU, 30-180$\times$ faster than existing baselines.

2602.06470 2026-06-18 cs.CL cs.AI 版本更新

Improve Large Language Model Systems with User Logs

通过用户日志改进大型语言模型系统

Changyue Wang, Weihang Su, Qingyao Ai, Xingzhao Yue, Rui Zhang, Xiaojia Chang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 本文提出UNO框架,通过用户日志提炼规则和偏好对,利用查询反馈驱动聚类处理数据异质性,量化模型知识与日志数据间的认知差距,提升LLM系统性能。

详情
AI中文摘要

扩大训练数据和模型参数规模长期以来推动了大型语言模型(LLMs)的发展,但这一范式日益受到高质量数据稀缺和计算成本上升导致的边际效益递减的限制。因此,近期研究更加关注从真实世界部署中持续学习,其中用户交互日志提供了丰富的真人类反馈和过程知识。然而,从用户日志学习具有挑战性,因为它们是无结构和嘈杂的。传统的LLM系统往往难以区分有用的反馈信号与嘈杂的用户行为,且用户日志收集与模型优化之间的差异(例如,非策略优化问题)进一步加剧了这一问题。为此,我们提出UNO(用户日志驱动的优化),一个统一的框架,用于通过用户日志改进LLM系统(LLMsys)。UNO首先将日志提炼为半结构化的规则和偏好对,然后利用查询和反馈驱动的聚类来管理数据异质性,最后量化模型先验知识与日志数据之间的认知差距。这一评估指导LLMsys自适应地过滤掉嘈杂的反馈并构建不同模块,以处理从用户日志中提取的初级和反思性经验,从而提升未来的响应。广泛的实验表明,UNO在效果和效率上均达到最先进的水平,显著优于检索增强生成(RAG)和基于记忆的基线方法。我们已开源代码至https://github.com/bebr2/UNO。

英文摘要

Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model's prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at https://github.com/bebr2/UNO .

2605.14877 2026-06-18 cs.CV 版本更新

HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

HeatKV:针对视觉自回归建模的头部调制KV缓存压缩

Jonathan Cederlund, Axel Berg, William Isaksson, Durmus Alp Emre Acar, Chuteng Zhou, Pontus Giselsson

发表机构 * Dept. of Automatic Control, Lund University(自动控制系,吕勒欧大学) Arm(Arm公司)

AI总结 本文提出HeatKV方法,通过根据每个头部对先前生成尺度的注意力进行调整,实现更高效的KV缓存压缩,提升内存利用率并保持图像生成质量。

Comments 18 pages total including appendix; 6 main-paper figures, 2 appendix figures; 4 tables

详情
AI中文摘要

视觉自回归(VAR)模型最近在保持低延迟的同时展示了出色的图像生成质量。然而,它们受到严重的KV缓存内存限制,通常需要每个生成图像数吉字节的内存。我们引入了HeatKV,一种新的压缩方法,该方法根据每个头部对先前生成尺度的注意力来调整缓存分配。使用一个小的离线校准集,注意力头部根据其在先前尺度上的注意力分数进行排序。基于此排序,我们构建了一个针对给定内存预算定制的静态剪枝计划。应用于Infinity-2B模型时,HeatKV在KV缓存内存分配的压缩比上比现有方法高2倍,同时保持相似或更好的图像保真度、提示对齐度和人类感知分数。我们的方法在VAR模型的KV缓存压缩中达到了新的最先进的水平,展示了细粒度、特定头部的缓存分配的有效性。

英文摘要

Visual Autoregressive (VAR) models have recently demonstrated impressive image generation quality while maintaining low latency. However, they suffer from severe KV-cache memory constraints, often requiring gigabytes of memory per generated image. We introduce HeatKV, a novel compression method that adapts cache allocation in each head based on its attention to previously generated scales. Using a small offline calibration set, the attention heads are ranked according to their attention scores over prior scales. Based on this ranking, we construct a static pruning schedule tailored to a given memory budget. Applied to the Infinity-2B model, HeatKV achieves $2 \times$ higher compression ratio in memory allocation for KV cache compared to existing methods, while maintaining similar or better image fidelity, prompt alignment and human perception score. Our method achieves a new state-of-the-art (SOTA) for VAR model KV-cache compression, showcasing the effectiveness of fine-grained, head-specific cache allocation. Code and calibration script available at https://github.com/arm-research/heatkv.

2504.01527 2026-06-18 cs.CV eess.IV 版本更新

Beyond Nearest Neighbor Interpolation in Data Augmentation

超越数据增强中的最近邻插值

Olivier Rukundo

发表机构 * Department of Electronic and Computer Engineering, University of Limerick(电子与计算机工程系,利默里克大学)

AI总结 本文提出改进的几何变换函数和均值分类过滤机制,以避免最近邻插值带来的标注误差和低通滤波影响,通过离线数据增强管道提升医学图像分割性能。

Comments 10 pages, 11 figures, 14 tables

详情
AI中文摘要

避免最近邻插值导致的未定义类别标签风险忽视了增强训练数据中像素级标注误差的加剧风险。此外,插值算法固有的低通滤波效应会加剧标注区域内的高频结构细节退化风险。为避免这些风险,作者通过修改卷积神经网络的数据转换函数,引入改进的几何变换函数,去除对最近邻插值的依赖,并整合基于均值的类别过滤机制来处理未定义的类别标签。作者还实现了离线数据增强管道,生成特定于插值的增强训练数据,从而能够定量评估插值对增强训练数据的低通滤波效应。在三个医学图像分割数据集和XBAT+数据集上的实验评估显示,在多个定量指标上均实现了性能提升。

英文摘要

Avoiding the risk of undefined categorical labels using nearest neighbor interpolation overlooks the risk of exacerbating pixel level annotation errors in augmented training data. Additionally, the inherent low pass filtering effects of interpolation algorithms exacerbate the risk of degrading high frequency structural details within annotated regions of interest. To avoid these risks, the author modified convolutional neural networks data transformation functions by incorporating a modified geometric transformation function, removing reliance on nearest neighbor interpolation, and integrating a mean-based class filtering mechanism to handle undefined categorical labels with alternative interpolation algorithms. The author also implemented an offline data augmentation pipeline to generate interpolation specific augmented training data, enabling quantitative assessment of interpolation specific low pass filtering effects on augmented training data. Experimental evaluation on three medical image segmentation datasets and the XBAT+ datasets demonstrated performance gains across multiple quantitative metrics.

2605.13566 2026-06-18 cs.LG 版本更新

Spatiotemporal downscaling and nowcasting of urban land surface temperatures with deep neural networks

基于深度神经网络的城市地表温度时空下垫面精细化与现在预报

Solomiia Kurchaba, Angela Meyer

发表机构 * Department of Geoscience and Remote Sensing(地质科学与遥感系) Delft University of Technology(代尔夫特理工大学) School of Engineering and Computer Science(工程与计算机科学学院) Bern University of Applied Sciences(伯恩应用科学大学)

AI总结 本文提出利用深度神经网络结合静止和极轨卫星数据,实现高时空分辨率的城市地表温度场估计与现在预报,提升城市气候与生态研究的精度与时效性。

Comments Paper after publication in IEEE Access

Journal ref IEEE Access, vol. 14, pp. 85134-85151, 2026

详情
AI中文摘要

地表温度(LST)是多种应用的关键变量,如城市气候和生态研究。然而,现有卫星衍生的LST产品提供的是高空间或高时间分辨率,导致两者之间存在根本性权衡。为解决这一权衡,我们结合静止和极轨卫星的观测数据,提供高空间和高时间分辨率(1公里,15分钟间隔)的LST场。我们展示了其在日内LST预报中的应用。为了估计高时空分辨率的LST场,训练了一个U-Net模型,将SEVIRI/MSG(3公里,15分钟分辨率)的LST场映射到Terra/Aqua MODIS(1公里,每天4次过境)的LST场,二者在空间和时间上同步。所提出的模型已在欧洲大都市的LST上进行训练,人口超过100万,且在留出测试集上达到RMSE=1.92°C和接近零偏移MVE=0.01°C。作为第二步,我们提出基于ConvLSTM架构的LST现在预报模型,训练数据为下缩的LST场,预测时间跨度为15至75分钟。该现在预报模型优于持续性和气候滚动中位数基准,对于所考虑的预测时间,RMSE为0.57至1.15°C,偏移范围从-0.1到0.14°C。此外,与独立MODIS过境的额外验证确认了鲁棒性能。我们的高时空分辨率LST预报模型可直接应用于基于卫星的LST监测操作。

英文摘要

Land Surface Temperature (LST) is a key variable for various applications, such as urban climate and ecology studies. Yet, existing satellite-derived LST products provide either high spatial or high temporal resolution, resulting in a fundamental trade-off between the two. To address this trade-off, we combine observations from a geostationary and a polar orbiting satellite and provide LST fields at high spatial and high temporal resolution (1 km at 15-min intervals). We demonstrate their application for intraday forecasting of LSTs. To estimate LST fields at high spatiotemporal resolution, a U-Net model is trained to map LST fields from SEVIRI/MSG (3 km and 15 min resolution) to LST fields from Terra/Aqua MODIS (1 km, 4 overpasses per day) that are collocated in space and time. The presented model has been trained on LSTs across large European cities with a population exceeding 1 million inhabitants, and achieves an RMSE = $1.92$°C and near-zero bias MBE = $0.01$°C on the hold-out test set. As a second step, we present an LST nowcasting model based on ConvLSTM architecture, trained across downscaled LST fields with forecast lead times of 15 to 75 minutes. The nowcasting model outperforms a persistence and a Climatological Rolling Median benchmarks, with RMSEs of $0.57$ to $1.15$°C for the considered lead times and biases ranging from $-0.1$ to $0.14$°C. An additional validation conducted against independent MODIS overpasses confirms robust performance. Our LST forecast model at high spatiotemporal resolution is directly applicable to operational satellite-based LST monitoring.

2605.12567 2026-06-18 cs.CV cs.AI 版本更新

Pyramid Self-Contrastive Learning for Single-shot Test-time Ultrasound Image Denoising

金字塔自对比学习框架用于测试时超声图像去噪

Jiajing Zhang, Bingze Dai, Xi Zhang, Yue Xu, Wei-Ning Lee

发表机构 * Department of Electrical and Computer Engineering, The University of Hong Kong(香港大学电子与计算机工程系) Department of Biomedical Engineering, Duke University(达特茅斯大学生物医学工程系)

AI总结 本文提出一种纯测试时训练框架,用于单次超声图像去噪,应用于合成孔径超声,通过自对比学习分离解剖相似性和噪声随机性,提升去噪效果和结构细节。

详情
AI中文摘要

内在的电子噪声和斑点噪声使超声图像的临床解释复杂化。传统去噪方法依赖显式噪声假设,其有效性在复合噪声条件下减弱。基于学习的方法需要大量标注数据和模型参数。这些预定义和预训练的方法在复杂体内环境中不可避免地导致领域偏移,因此局限于特定噪声类型并常模糊结构细节。本文提出了一种纯测试时训练框架用于单次超声图像去噪,并应用于合成孔径超声(SAU),该方法通过自对比学习在金字塔潜在空间中分离解剖相似性和噪声随机性。干净图像随后从解剖空间解码,而丢弃噪声空间。A2A在测试时仅使用一个噪声样本的SAU信号进行训练,从而从根本上消除了领域偏移和预训练成本。模拟实验,包括电子噪声水平0至30 dB和不同包含几何形状,证明了A2A在SNR和CNR上的改进分别为69.3%和34.4%。体内结果表明,仅使用心脏六个超声切面、肝脏和肾脏的两个孔径数据,SNR和CNR分别提高了84.8%和25.7%。A2A在多种成像目标和配置中产生清晰的图像/信号,为更可靠的超声解剖可视化和功能评估铺平了道路。

英文摘要

The inherent electronic and speckle noise complicates clinical interpretation of ultrasound images. Conventional denoising methods rely on explicit noise assumptions whose validity diminishes under composite noise conditions. Learning-based methods are usually pretrained in a limited image domain using a labeled dataset, which implies inevitable domain shift in complex in vivo environments. This study proposes a Pyramid Self-Contrastive Learning (PSCL) framework for test-time ultrasound image denoising without pretraining. Given multiple noisy samples from only one-shot imaging, PSCL disentangles anatomical similarity and noise randomness into separate pyramid latent spaces. The clean image is then decoded from the anatomy space while discarding the noise space. We first apply PSCL to synthetic aperture ultrasound (SAU), where an Aperture-to-Aperture loop serves as a self-supervised proxy task to ensure denoising fidelity. Simulation experiments, including noise levels from 0 to 30 dB and inclusion geometries from simple to complex, demonstrated improvements of 69.3% in SNR and 34.4% in CNR. The in vivo results showed 84.8% SNR and 25.7% CNR gains using only two aperture data of the heart in six echocardiographic views, liver, and kidney. PSCL delivers clear images across diverse imaging targets and configurations, paving the way for more reliable anatomical visualization without domain shift and pretraining costs.

2605.11287 2026-06-18 cs.LG cs.AI 版本更新

Beyond Similarity: Temporal Operator Attention for Time Series Analysis

超越相似性:时间序列分析中的时序操作注意力

Jevon Twitty, Vinh Pham, Nitiwith Rotchanarak, Viresh Pati, Yubin Kim, Shihao Yang, Jiecheng Lu

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出时序操作注意力(TOA),通过引入可学习的操作符增强注意力机制,以更有效地处理时间序列数据中的符号和振荡变换,提升时间序列预测、异常检测和分类任务的性能。

详情
AI中文摘要

时间序列预测中存在一个持久性悖论:结构简单的MLP和线性模型往往优于高容量的Transformer。我们指出,这种差距源于序列建模基本原理的不匹配:尽管许多时间序列动态由全局时间操作符(如滤波和谐波结构)主导,标准注意力将每个输出视为输入的凸组合。这限制了其表示带符号和振荡变换的能力,这些能力对于时间信号处理至关重要。我们正式将这一限制定义为softmax注意力中的简单约束混合瓶颈,这对由操作符驱动的时间序列任务尤其限制性。为了解决这一问题,我们提出时序操作注意力(TOA),一种通过显式、可学习的序列空间操作符增强注意力的框架,使时间内的符号混合成为可能,同时保持输入依赖的适应性。为了使密集的N×N操作符实用化,我们引入了随机操作符正则化,一种高方差的dropout机制,它稳定了训练并防止了记忆性学习。在预测、异常检测和分类基准上,TOA在集成到标准骨干如PatchTST和iTransformer时始终提高了性能,尤其是在重建密集任务中表现尤为突出。这些结果表明,显式操作符学习是有效时间序列建模的关键要素。

英文摘要

A persistent paradox in time-series forecasting is that structurally simple MLP and linear models often outperform high-capacity Transformers. We argue that this gap arises from a mismatch in the sequence-modeling primitive: while many time-series dynamics are governed by global temporal operators (e.g., filtering and harmonic structure), standard attention forms each output as a convex combination of inputs. This restricts its ability to represent signed and oscillatory transformations that are fundamental to temporal signal processing. We formalize this limitation as a simplex-constrained mixing bottleneck in softmax attention, which becomes especially restrictive for operator-driven time-series tasks. To address this, we propose $\textbf{Temporal Operator Attention (TOA)}$, a framework that augments attention with explicit, learnable sequence-space operators, enabling direct signed mixing across time while preserving input-dependent adaptivity. To make dense $N \times N$ operators practical, we introduce Stochastic Operator Regularization, a high-variance dropout mechanism that stabilizes training and prevents trivial memorization. Across forecasting, anomaly detection, and classification benchmarks, TOA consistently improves performance when integrated into standard backbones such as PatchTST and iTransformer, with particularly strong gains in reconstruction-heavy tasks. These results suggest that explicit operator learning is a key ingredient for effective time-series modeling.

2605.10840 2026-06-18 cs.LG cs.AI q-bio.QM 版本更新

Clin-JEPA: A Multi-Phase Co-Training Framework for Joint-Embedding Predictive Pretraining on EHR Patient Trajectories

Clin-JEPA:一种多阶段协同训练框架,用于EHR患者轨迹的联合嵌入预测预训练

Yixuan Yang, Mehak Arora, Ryan Zhang, Baraa Abed, Junseob Kim, Tilendra Choudhary, Md Hassanuzzaman, Kevin Zhu, Ayman Ali, Chengkun Yang, Alasdair Edward Gent, Victor Moas, Rishikesan Kamaleswaran

发表机构 * Duke University(杜克大学)

AI总结 本文提出Clin-JEPA框架,通过多阶段预训练稳定协同训练编码器和预测器,解决EHR数据中联合嵌入预测的挑战,实现多任务下游任务的高性能表现。

Comments 16 pages, 4 figures, 8 tables. Code: https://github.com/YeungYathin/Clin-JEPA

详情
AI中文摘要

我们介绍了Clin-JEPA,一种用于EHR患者轨迹的联合嵌入预测(JEPA)预训练的多阶段协同训练框架。JEPA架构已在机器人领域实现了潜在空间规划,并在视觉领域实现了高质量的表示学习,但将其扩展到EHR数据以获得一个能够同时预测患者轨迹并服务于多种下游风险预测任务的单一主干,仍是一个开放性挑战。现有的JEPA框架要么在预训练后丢弃预测器(I-JEPA,V-JEPA),要么在冻结的预训练编码器上训练预测器(V-JEPA 2-AC),导致编码器在推理时无法感知预测器必须使用的滚动信号;在共享JEPA预测目标下协同训练编码器和预测器将提供这种基础,但朴素的协同训练不稳定,代表性崩溃和在线/目标漂移导致自回归滚动发散。Clin-JEPA的五阶段预训练课程——预测器预热、联合细化、EMA目标对齐、硬同步和预测器最终化——通过阶段解决每个失败模式,稳定地协同训练基于Qwen3-8B的编码器和一个具有9200万参数的潜在轨迹预测器。在MIMIC-IV ICU数据上,三个独立评估支持该框架:(1)潜在ℓ1滚动漂移唯一收敛(-15.7%)在48小时范围内,而基线和消融测试发散(+3%至+4951%);(2)编码器学习了临床可区分的潜在几何结构(衰变患者群体在潜在空间中偏离4.83×,而稳定患者仅偏离≤2.62×);(3)单一主干在多任务下游评估中优于强大的表格和序列基线。Clin-JEPA在ICareFM EEP上达到平均AUROC 0.851,在8个二元风险任务上达到0.883(比基线平均高0.038和0.041)

英文摘要

We present Clin-JEPA, a multi-phase co-training framework for joint-embedding predictive (JEPA) pretraining on EHR patient trajectories. JEPA architectures have enabled latent-space planning in robotics and high-quality representation learning in vision, but extending the paradigm to EHR data -- to obtain a single backbone that simultaneously forecasts patient trajectories and serves diverse downstream risk-prediction tasks without per-task fine-tuning -- remains an open challenge. Existing JEPA frameworks either discard the predictor after pretraining (I-JEPA, V-JEPA) or train it on a frozen pretrained encoder (V-JEPA 2-AC), leaving the encoder unaware of the rollout signal that the retained predictor must use at inference; co-training the encoder and predictor under a shared JEPA prediction objective would supply this grounding, but naïve co-training is unstable, with representation collapse and online/target drift causing autoregressive rollout to diverge. Clin-JEPA's five-phase pretraining curriculum -- predictor warmup, joint refinement, EMA target alignment, hard sync, and predictor finalization -- addresses each failure mode by phase, stably co-training a Qwen3-8B-based encoder and a 92M-parameter latent trajectory predictor. On MIMIC-IV ICU data, three independent evaluations support the framework: (1) latent $\ell_1$ rollout drift uniquely converges ($-$15.7%) over 48-hour horizons while baselines and ablations diverge (+3% to +4951%); (2) the encoder learns a clinically discriminative latent geometry (deteriorating-patient cohorts displace 4.83$\times$ further than stable patients in latent space, vs $\leq$2.62$\times$ for baseline encoders); (3) a single backbone outperforms strong tabular and sequence baselines on multi-task downstream evaluation. Clin-JEPA achieves mean AUROC 0.851 on ICareFM EEP and 0.883 on 8 binary risk tasks (+0.038 and +0.041 vs baseline average).

2605.10083 2026-06-18 cs.LG 版本更新

Unlocking air traffic flow prediction through microscopic aircraft-state modeling

通过微观飞机状态建模解锁空交通流量预测

Bin Wang, Anqi Liu, Jiangtao Zhao, Hina Birahmani, Yanyong Huang, Peilan He, Guiyuan Jiang, Feng Hong, Yanwei Yu, Yuanyuan Hou, Tianrui Li

发表机构 * Faculty of Information Science and Engineering(信息科学与工程学院) Ocean University of China(中国海洋大学) Sanya Oceanographic Institution(三亚海洋研究所) Joint Laboratory of Data Science and Business Intelligence(数据科学与商务智能联合实验室) Southwestern University of Finance and Economics(西南财经大学) The Affiliated Hospital of Qingdao University(青岛大学附属医院) School of Computing and Artificial Intelligence(计算机与人工智能学院)

AI总结 本文提出AeroSense模型,通过微观飞机状态直接预测未来区域交通流量,提升高密度交通下的预测精度,替代传统时间序列方法。

详情
AI中文摘要

终端空域短期空交通流量预测对主动空交通管理至关重要。现有方法主要将交通流量建模为聚合时间序列,尽管交通动态由飞机状态和连续空域中的相互作用决定。此类聚合掩盖了包括飞机运动学、边界相互作用和控制意图在内的细粒度信息。本文提出AeroSense,一种从即时空域情况中的动态飞机状态集直接预测未来交通流量的状态到流量建模框架。通过建立从微观飞机状态到未来区域交通流量的端到端映射,AeroSense在保持飞机级动态的同时,自然适应变化的交通密度,而无需依赖历史回溯窗口。在大规模真实数据集上的实验表明,AeroSense在高密度交通期间比基于聚合的预测方法具有持续的预测精度提升。这些发现表明,即时空域情况为传统基于时间序列的交通预测范式提供了有效的替代方案。

英文摘要

Short-term air traffic flow prediction in terminal airspace is essential for proactive air traffic management. Existing approaches predominantly model traffic flow as aggregated time series. However, traffic dynamics are governed by aircraft states and their interactions in continuous airspace. Such aggregation obscures fine-grained information, including aircraft kinematics, boundary interactions, and control intent. Here we present AeroSense, a state-to-flow modeling paradigm that predicts future traffic flow directly from instantaneous airspace situations represented as dynamic sets of aircraft states derived from ADS-B trajectories. By establishing an end-to-end mapping from microscopic aircraft states to future regional traffic flow, AeroSense preserves aircraft-level dynamics while naturally accommodating varying traffic density without relying on historical look-back windows. Experiments on a large-scale real-world dataset show that AeroSense exhibits admirable predictive accuracy and robustness over aggregation-based forecasting approaches, particularly during high-density traffic periods. These findings suggest that aircraft-state situation modeling provides a promising alternative to conventional time-series forecasting in air traffic flow management.

2605.08934 2026-06-18 cs.LG 版本更新

From Mechanistic to Compositional Interpretability

从机制到组合可解释性

Ward Gauderis, Thomas Dooms, Steven T. Homer, Kola Ayonrinde, Geraint A. Wiggins

发表机构 * UK AI Security Institute(英国人工智能安全研究所)

AI总结 本文提出组合可解释性框架,通过范畴论原理解决机制可解释性无法客观验证的问题,将解释质量分解为忠实度和复杂度,引入压缩细化方法实现模型简化,理论证明简洁性准则保障人类对齐的解释。

详情
AI中文摘要

机制可解释性旨在通过逆向工程神经模型的行为来解释其计算结构,但缺乏正式框架导致无法客观验证。本文引入组合可解释性,基于组合性和最小描述长度原则的范畴论框架。组合解释是语法和语义映射的对,必须满足一致性。将解释质量分解为忠实度和复杂度,将其视为约束优化问题,并引入压缩细化方法系统地重构模型为更简单的部分。最后证明了在简洁性准则下,语法压缩理论上能保证更简洁的人类对齐解释。该框架将 prominent 机制方法作为细化子类,澄清了为何其压缩性启发式方法与人类可解释性一致。本文为自动化发现和评估机制解释提供了可测量、可优化的基础。

英文摘要

Mechanistic interpretability aims to explain neural model behaviour by reverse-engineering learned computational structure into human-understandable components. Without a formal framework, however, mechanistic explanations cannot be objectively verified, compared, or composed. We introduce compositional interpretability, a category-theoretic framework grounded in the principles of compositionality and minimum description length. Compositional interpretations are pairs of syntactic and semantic mappings that must commute to enforce consistency between a model's decomposition and its observed behaviour. We deconstruct explanation quality into measures of faithfulness and complexity to cast interpretability as a constrained optimisation problem, and introduce compressive refinement to systematically restructure models into simpler parts without altering their function. Finally, we derive a parsimony criterion under which syntactic compression theoretically guarantees more concise, human-aligned explanations. Our framework situates prominent mechanistic methods as subclasses of refinement, and clarifies why their compressibility heuristics tend to align with human interpretability. Our work provides a measurable, optimisable blueprint for automating the discovery and evaluation of mechanistic explanations.

2605.04267 2026-06-18 cs.LG cs.NE math.OC 版本更新

QUIVER: Cost-Aware Adaptive Preference Querying in Surrogate-Assisted Evolutionary Multi-Objective Optimization

QUIVER: 代理辅助多目标进化优化中的成本自适应偏好查询

Florian A. D. Burnat

发表机构 * University of Warwick(沃里克大学) Warwick Business School(沃里克商学院)

AI总结 提出QUIVER方法,通过自适应选择目标评估与异质偏好查询(成对偏好陈述与无差异调整),在代理辅助多目标优化中最小化决策遗憾,实验显示在WFG难题上效用遗憾降低25%。

Comments Accepted at Genetic and Evolutionary Computation Conference (GECCO '26)

详情
AI中文摘要

交互式多目标优化系统面临预算分配困境:资源可用于昂贵的目标评估,或用于引出决策者偏好以识别帕累托集的相关区域。此外,偏好引出本身跨越具有不同信息内容和认知负担的模态,从廉价、嘈杂的成对偏好陈述(PS)到更丰富但成本更高的无差异调整(IA)。我们研究了未知标量化下的成本感知优化,并引入了QUIVER(查询信息价值估计遗憾),这是一种代理辅助的进化多目标优化器,可自适应地在目标评估和异质偏好查询之间进行选择。在每一步,QUIVER通过最大化每单位总成本的预期决策质量改进来选择下一个动作。在合成决策者模型下的DTLZ和WFG基准测试中,QUIVER在具有挑战性的WFG问题上实现了最低的最终效用遗憾(WFG4上效用遗憾为2.14,WFG9上为2.82:比基线提高25%),优于所有单模态基线。我们分析了PS和IA的最优混合如何适应问题难度:在简单问题(DTLZ2)上,QUIVER选择80%的PS查询;在困难问题(WFG9)上,它转向35%的IA查询。这种自适应模态选择展示了成本感知偏好学习的实际应用。

英文摘要

Interactive multi-objective optimization systems face a budget allocation dilemma: one can spend resources on expensive objective evaluations or on eliciting decision-maker preferences that identify the relevant region of the Pareto set. Moreover, preference elicitation itself spans modalities with different information content and cognitive burden, ranging from cheap, noisy pairwise preference statements (PS) to richer but costlier indifference adjustments (IA). We study cost-aware optimization under an unknown scalarization and introduce QUIVER (Query-Informed Value Estimation for Regret), a surrogate-assisted evolutionary multi-objective optimizer that adaptively chooses between objective evaluations and heterogeneous preference queries. At each step, QUIVER selects the next action by maximizing the expected decision-quality improvement per unit total cost. Across DTLZ and WFG benchmarks under synthetic decision-maker models, QUIVER achieves the lowest final utility regret on challenging WFG problems (utility regret of 2.14 on WFG4, 2.82 on WFG9: a 25% improvement over baselines), outperforming all single-modality baselines. We analyze how the optimal mix of PS and IA adapts to problem difficulty: on easy problems (DTLZ2), QUIVER selects 80\% PS queries; on hard problems (WFG9), it shifts to 35% IA queries. This adaptive modality selection demonstrates cost-aware preference learning in action.

2605.05925 2026-06-18 cs.RO 版本更新

DexSynRefine: Synthesizing and Refining Human-Object Interaction Motion for Physically Feasible Dexterous Robot Actions

DexSynRefine:合成与精炼人-物交互运动以实现物理可行的灵巧机器人动作

Hyesung Lee, Hyunwoo Jung, Si-Hwan Heo, Sungwook Yang

发表机构 * Korea Institute of Science and Technology(韩国科学技术院) KAIST(韩国科学技术院) Hanyang University(翰阳大学)

AI总结 提出DexSynRefine框架,通过HOI-MMFP运动先验合成手-物轨迹,结合任务空间残差强化学习和接触动力学适应,将人-物交互数据转化为物理可行的灵巧操作,在五个任务上成功率提升50-70个百分点。

Comments Project page: https://dexsynrefine.github.io/

详情
AI中文摘要

从人-物交互(HOI)数据中学习灵巧操作为机器人遥操作提供了一种可扩展的替代方案,但HOI演示通常稀疏且纯运动学,在实体不匹配和接触丰富的动力学下直接重定向不可靠。我们提出DexSynRefine,一个耦合框架,将HOI数据视为结构化运动先验而非可执行的机器人动作。DexSynRefine首先使用HOI运动流形流基元(HOI-MMFP)——一种耦合手-物运动的运动先验,根据任务和初始物体状态合成手-物轨迹。然后通过任务空间残差强化学习对其进行物理接地,并通过从本体感受历史推断缺失的接触动力学上下文来适应执行。在五个灵巧操作任务中,每个阶段解决一个互补的瓶颈:HOI-MMFP提高了轨迹一致性和平滑性,任务空间残差在测试的替代方案中提供了最强的接地表示,接触动力学适应实现了鲁棒的真实世界执行。综合来看,DexSynRefine在真实世界中的成功率比运动学重定向提高了50-70个百分点。

英文摘要

Learning dexterous manipulation from human-object interaction (HOI) data offers a scalable alternative to robot teleoperation, but HOI demonstrations are typically sparse and purely kinematic, making direct retargeting unreliable under embodiment mismatch and contact-rich dynamics. We present DexSynRefine, a coupled framework that treats HOI data as structured motion priors rather than executable robot actions. DexSynRefine first synthesizes hand-object trajectories conditioned on the task and initial object state using HOI Motion Manifold Flow Primitives (HOI-MMFP), a motion prior for coupled hand-object motion. It then physically grounds them with task-space residual reinforcement learning and adapts execution by inferring missing contact-dynamics context from proprioceptive history. Across five dexterous manipulation tasks, each stage addresses a complementary bottleneck: HOI-MMFP improves trajectory consistency and smoothness, task-space residuals provide the strongest grounding representation among the tested alternatives, and contact-dynamics adaptation enables robust real-world execution. Together, DexSynRefine improves real-world success rates over kinematic retargeting by 50-70~percentage points.

2605.05547 2026-06-18 cs.CV 版本更新

Characterizing Brazilian Atlantic Forest Restoration Outcomes with Geospatial AlphaEarth Embeddings

利用地理空间AlphaEarth嵌入表征巴西大西洋森林恢复结果

Alice Heiman

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本研究利用AlphaEarth基础模型的卫星嵌入,通过余弦相似度定义参考轨迹嵌入,评估巴西圣保罗1729个恢复点的早期恢复成效,发现不同土地利用类型在嵌入空间中形成聚类,但信号存在噪声。

Comments Presented as a workshop paper at ICLR 2026 Machine Learning for Remote Sensing (ML4RS)

详情
AI中文摘要

巴西的大西洋森林是一个关键生物多样性热点,但其原始覆盖面积不足12-15%。尽管大规模监测森林恢复至关重要,但传统方法受限于实地报告在大尺度上的不可行性以及遥感指数(如NDVI)的饱和效应。此外,与森林砍伐导致的快速光谱变化不同,再造林是一个渐进过程。在本研究中,我们利用AlphaEarth Foundation模型的卫星嵌入,检查了圣保罗的1,729个恢复点,以评估其在表征早期恢复成功方面的有效性。我们引入了“参考轨迹嵌入”的概念,基于与成熟次生林参考点的余弦相似度定义恢复成功的度量。我们观察到不同土地利用和土地覆盖(LULC)类型在嵌入空间中形成不同的聚类,并且能够识别出具有明显变化向量的地点。然而,信号可能存在噪声,嵌入可能需要进一步微调以捕获和预测超出LULC的地点元数据。

英文摘要

The Atlantic Forest in Brazil is a critical biodiversity hotspot, yet less than 12-15% of its original cover remains. Although monitoring forest restoration on a large scale is essential, traditional methods are limited by the impracticality of on-the-ground reporting on such a scale and by the saturation of remote-sensing indices such as NDVI. Furthermore, reforestation is a gradual process as opposed to the rapid spectral changes caused by deforestation. In this study, we examine 1,729 restoration sites in São Paulo, using satellite embeddings from the AlphaEarth Foundation's model to evaluate their effectiveness in characterising early restoration success. We introduce the concept of a 'Reference Trajectory Embedding', defining a metric of restoration success based on cosine similarity to reference sites of mature secondary forest. We observe distinct clusters in embedding space according to different land use and land cover (LULC) types, and we can identify sites with clear change vectors. However, the signal can be noisy, and embeddings may require further fine-tuning to capture and predict site metadata beyond LULC.

2508.04086 2026-06-18 cs.CL 版本更新

ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"

ToolGrad:利用文本“梯度”高效生成工具使用数据集

Zhongyi Zhou, Kohei Uehara, Haoyu Zhang, Jingtao Zhou, Lin Gu, Ruofei Du, Zheng Xu, Tatsuya Harada

发表机构 * Google(谷歌) The University of Tokyo(东京大学) RIKEN AIP(日本学术振兴会AIP) Tohoku University(东北大学)

AI总结 提出ToolGrad框架,通过文本“梯度”引导的迭代过程先构建有效工具使用链再合成用户查询,实现低成本、高成功率的数据生成,训练模型性能超越基线。

Comments ACL 2026 Findings. Source code: https://github.com/zhongyi-zhou/toolgrad

详情
AI中文摘要

先前的工作通过首先生成用户查询,然后进行复杂的工具使用注释(如深度优先搜索)来合成工具使用LLM数据集。这导致不可避免的注释失败和数据生成效率低下。我们引入了ToolGrad,一个反转这种范式的智能体框架。ToolGrad首先通过由文本“梯度”引导的迭代过程构建有效的工具使用链,然后合成相应的用户查询。这种“答案优先”的方法产生了ToolGrad-500,一个以更复杂的工具使用、更低的成本和几乎100%的通过率生成的数据集。实验表明,ToolGrad模型优于在昂贵的基线数据集和专有LLM上训练的模型。ToolGrad源代码、数据集和模型可在https://this URL获取。

英文摘要

Prior work synthesizes tool-use LLM datasets by first generating a user query, followed by complex tool-use annotations like depth-first search (DFS). This leads to inevitable annotation failures and low efficiency in data generation. We introduce ToolGrad, an agentic framework that inverts this paradigm. ToolGrad first constructs valid tool-use chains through an iterative process guided by textual "gradients", and then synthesizes corresponding user queries. This "answer-first" approach led to ToolGrad-500, a dataset generated with more complex tool use, lower cost, and almost 100% pass rate. Experiments show that ToolGrad models outperform those trained on expensive baseline datasets and proprietary LLMs. The ToolGrad source code, dataset, and models are available at https://github.com/zhongyi-zhou/toolgrad.

2604.28076 2026-06-18 cs.CL cs.AI cs.LG 版本更新

TopBench: A Benchmark for Implicit Predictive Reasoning in Tabular Question Answering

TopBench:表格问答中隐式预测推理的基准

An-Yang Ji, Jun-Peng Jiang, De-Chuan Zhan, Han-Jia Ye

发表机构 * School of Artificial Intelligence, Nanjing University, China(人工智能学院,南京大学,中国) National Key Laboratory for Novel Software Technology, Nanjing University, China(新型软件技术国家重点实验室,南京大学,中国)

AI总结 提出TopBench基准,包含779个样本和四个子任务,评估大语言模型在表格问答中识别隐式预测意图并进行可靠推理的能力,发现当前模型在意图识别上存在困难。

详情
AI中文摘要

大型语言模型(LLM)推动了表格问答的发展,其中大多数查询可以通过提取信息或简单聚合来回答。然而,一类常见的现实世界查询是隐式预测性的,需要从历史模式中推断未观察到的答案,而不仅仅是检索。这些查询带来了两个挑战:识别潜在意图和对大规模表格进行可靠的预测推理。为了评估LLM在带有隐式预测任务的表格问答中的表现,我们引入了TopBench,一个包含779个样本的基准,涵盖四个子任务,从单点预测到决策制定、处理效应分析和复杂过滤,要求模型生成涵盖推理文本和结构化表格的输出。我们在基于文本和代理工作流下评估了多种模型。实验表明,当前模型通常在意图识别上存在困难,默认进行查找。更深入的分析发现,准确的意图消歧是引导这些预测行为的前提。此外,提升预测精度的上限需要整合更复杂的建模或推理能力。

英文摘要

Large Language Models (LLMs) have advanced Table Question Answering, where most queries can be answered by extracting information or simple aggregation. However, a common class of real-world queries is implicitly predictive, requiring the inference of unobserved answers from historical patterns rather than mere retrieval. These queries introduce two challenges: recognizing latent intent and reliable predictive reasoning over massive tables. To assess LLMs in such Tabular questiOn answering with implicit Prediction tasks, we introduce TopBench, a benchmark consisting of 779 samples across four sub-tasks, ranging from single-point prediction to decision making, treatment effect analysis, and complex filtering, requiring models to generate outputs spanning reasoning text and structured tables. We evaluate diverse models under both text-based and agentic workflows. Experiments reveal that current models often struggle with intent recognition, defaulting to just lookups. Deeper analysis identifies that accurate intent disambiguation serves as the prerequisite for leading these predictive behaviors. Furthermore, elevating the upper bound of prediction precision requires the integration of more sophisticated modeling or reasoning capabilities.

2604.25848 2026-06-18 cs.AI 版本更新

A Distributionally Robust Reinforcement Learning Framework for Constrained Urban EV Dispatch

面向约束城市电动汽车调度的分布鲁棒强化学习框架

An Nguyen, Hoang Nguyen, Phuong Le, Hung Pham, Cuong Do, Laurent El Ghaoui

发表机构 * College of Engineering and Computer Science, VinUniversity, Hanoi, Vietnam(VinUniversity 工程与计算机科学学院,河内,越南) Center for Environmental Intelligence, VinUniversity, Hanoi, Vietnam(VinUniversity 环境智能中心,河内,越南)

AI总结 针对城市电动汽车调度中充电站和馈线容量约束及不确定需求,提出基于半马尔可夫决策过程与分布鲁棒软演员-评论家算法,通过图卷积编码器和滚动混合整数线性规划保证可行性,在纽约出租车数据仿真中实现最高净利润且零违规。

详情
AI中文摘要

我们研究城市规模的电动汽车(EV)网约车车队控制,其中调度、重新定位和充电决策必须在不确定且空间相关的出行需求和旅行时间下,遵守充电器和馈线限制。我们将问题建模为六边形网格半马尔可夫决策过程(semi-MDP),具有混合动作——用于服务、重新定位和充电的离散动作,以及连续充电功率——和可变动作持续时间。为了保证训练和部署期间的物理可行性,策略在由掩码温度退火actor产生的高层意图上学习。这些意图在每个决策步骤通过一个时间受限的滚动混合整数线性规划(MILP)进行投影,该规划严格强制执行荷电状态、充电端口和馈线约束。为了缓解分布偏移,我们针对一个Wasserstein-1模糊集优化软演员-评论家(SAC)智能体,该模糊集使用图对齐的马氏基础度量来捕捉空间相关性。鲁棒备份使用Kantorovich-Rubinstein对偶、投影次梯度内环和原始-对偶风险预算更新。我们的架构结合了两层图卷积网络(GCN)编码器、双评论家和一个驱动对手的价值网络。基于纽约出租车数据构建的大规模电动汽车车队模拟器上的实验表明,PD-RSAC实现了最高的净利润,达到122万美元,而强启发式、单智能体RL和多智能体RL基线(包括Greedy、SAC、MAPPO和MADDPG)的净利润为58万至70万美元,同时保持零馈线限制违规。

英文摘要

We study city-scale control of electric-vehicle (EV) ride-hailing fleets where dispatch, repositioning, and charging decisions must respect charger and feeder limits under uncertain, spatially correlated demand and travel times. We formulate the problem as a hex-grid semi-Markov decision process (semi-MDP) with mixed actions -- discrete actions for serving, repositioning, and charging, together with continuous charging power -- and variable action durations. To guarantee physical feasibility during both training and deployment, the policy learns over high-level intentions produced by a masked, temperature-annealed actor. These intentions are projected at every decision step through a time-limited rolling mixed-integer linear program (MILP) that strictly enforces state-of-charge, port, and feeder constraints. To mitigate distributional shifts, we optimize a Soft Actor-Critic (SAC) agent against a Wasserstein-1 ambiguity set with a graph-aligned Mahalanobis ground metric that captures spatial correlations. The robust backup uses the Kantorovich-Rubinstein dual, a projected subgradient inner loop, and a primal-dual risk-budget update. Our architecture combines a two-layer Graph Convolutional Network (GCN) encoder, twin critics, and a value network that drives the adversary. Experiments on a large-scale EV fleet simulator built from NYC taxi data show that PD-RSAC achieves the highest net profit, reaching \$1.22M, compared with \$0.58M-\$0.70M for strong heuristic, single-agent RL, and multi-agent RL baselines, including Greedy, SAC, MAPPO, and MADDPG, while maintaining zero feeder-limit violations.

2604.23716 2026-06-18 cs.AI cs.IT cs.LG cs.MA math.IT 版本更新

Information-Theoretic Measures in AI: A Practical Decision Guide

人工智能中的信息论度量:实用决策指南

Nikolaos Al. Papadopoulos, Konstantinos E. Psannis

发表机构 * Department of Applied Informatics, University of Macedonia(马其顿大学应用信息系)

AI总结 本文为七种信息论度量提供实用决策框架,围绕每个度量的三个关键问题:回答的问题与AI场景、适合的估计器、最危险的误用,并附有流程图和决策表。

Comments 25 pages, 2 tables, 1 figure. Submitted to Entropy (MDPI)

详情
AI中文摘要

信息论(IT)度量在人工智能中无处不在:熵驱动决策树分裂和不确定性量化,交叉熵是默认的分类损失,互信息支撑表示学习和特征选择,转移熵揭示动态系统中的有向影响。第二类较不成熟的度量——整合信息(Phi)、有效信息(EI)和自主性——已出现用于表征智能体复杂性。尽管被广泛采用,度量选择常常与估计器假设、失败模式和安全的推断主张脱节。本文为所有七种度量提供了一个实用决策框架,围绕每个度量的三个指导性问题组织:(i)该度量回答什么问题,在何种AI背景下;(ii)哪种估计器适合数据类型和维度;(iii)最危险的误用是什么。该框架通过两个互补的人工制品实现:度量选择流程图和主决策表。我们涵盖每个度量的AI/ML和决策智能体应用领域,并使用标准化桥接框将IT量与认知构造联系起来。三个工作示例展示了该框架在具体从业者场景中的应用,涵盖表示学习、时间影响分析和进化智能体复杂性。

英文摘要

Information-theoretic (IT) measures are ubiquitous in artificial intelligence: entropy drives decision-tree splits and uncertainty quantification, cross-entropy is the default classification loss, mutual information underpins representation learning and feature selection, and transfer entropy reveals directed influence in dynamical systems. A second, less consolidated family of measures, integrated information (Phi), effective information (EI), and autonomy, has emerged for characterizing agent complexity. Despite wide adoption, measure selection is often decoupled from estimator assumptions, failure modes, and safe inferential claims. This paper provides a practical decision framework for all seven measures, organized around three prescriptive questions for each: (i) what question does the measure answer and in which AI context; (ii) which estimator is appropriate for the data type and dimensionality; and (iii) what is the most dangerous misuse. The framework is operationalized in two complementary artifacts: a measure-selection flowchart and a master decision table. We cover both AI/ML and decision-making agent application domains per measure, with standardized Bridge Boxes linking IT quantities to cognitive constructs. Three worked examples illustrate the framework on concrete practitioner scenarios spanning representation learning, temporal influence analysis, and evolved agent complexity.

2604.23130 2026-06-18 cs.CL cs.AI 版本更新

From Concept-Aligned Tokens to Vulnerable Features: Mechanistic Localization of Jailbreaks

从概念对齐的Token到脆弱特征:越狱的机制定位

Nilanjana Das, Mathew Dawit, Aman Chadha, Manas Gaur

发表机构 * UMBC(马里兰大学伯克利分校) Apple(苹果公司)

AI总结 提出一种基于Token的机制流水线,通过稀疏自编码器特征子组定位越狱漏洞,发现单个有害Token足以定位脆弱特征,且这些特征集中在中后期层。

详情
AI中文摘要

越狱攻击揭示了安全对齐的大语言模型中一种持续的失败模式:模型可以被推向有害行为,但促成这种转变的内部表示仍未被很好地定位。最近的机制安全性研究通常通过广泛的表示对象来解释这种行为,包括全局拒绝方向、激活引导向量和与拒绝相关的SAE特征。我们转而询问越狱脆弱性是否可以追溯到更细粒度的、基于提示的SAE特征子组。我们引入了一个基于Token的机制流水线,将Gemma-2-2B的残差流分解为稀疏自编码器(SAE)特征,并识别与不安全行为相关的特征子组。使用BeaverTails中的单类别不安全示例以减少跨类别干扰,我们从对抗性响应中提取有害概念,并通过子空间相似性将其与概念相关的提示Token对齐。然后,我们应用三种特征分组策略:基于聚类的、层次链接的和单Token驱动的,以识别所有26层中的SAE特征子组。最后,我们放大每个子组中的顶级特征,并使用标准的有害性评判器评估生成的输出。单Token驱动的分组实现了与完整基于聚类的分组相当的有害性,表明单个有害提示Token足以定位与脆弱性相关的SAE特征子组,而无需依赖更广泛的聚类级聚合。这些子组出现在早期和中后期层,且更集中在中后期层,其中目标引导暴露了特定的模型脆弱性。总体而言,我们的结果表明越狱敏感性可以追溯到稀疏的、基于Token定位的SAE特征子组,补充了先前基于广泛对抗、拒绝或引导方向的解释。

英文摘要

Jailbreak attacks expose a persistent failure mode in safety-aligned LLMs: models can be pushed into harmful behavior, but the internal representations enabling this shift remain poorly localized. Recent mechanistic safety studies often explain such behavior through broad representational objects, including global refusal directions, activation steering vectors, and refusal-related SAE features. We instead ask whether jailbreak vulnerability can be traced to finer-grained, prompt-conditioned SAE feature subgroups. We introduce a token-driven mechanistic pipeline that decomposes the residual stream of Gemma-2-2B into Sparse Autoencoder (SAE) features and identifies feature subgroups associated with unsafe behavior. Using single-category unsafe examples from BeaverTails to reduce cross-category interference, we extract harmful concepts from adversarial responses and align them with concept-relevant prompt tokens through subspace similarity. We then apply three feature-grouping strategies: cluster-based, hierarchical-linkage, and single-token-driven, to identify SAE feature subgroups across all 26 layers. Finally, we amplify the top features in each subgroup and evaluate the resulting generations with a standardized harmfulness judge. Single-token-driven grouping achieves harmfulness comparable to full cluster-based grouping, showing that individual harmful prompt tokens are sufficient to localize vulnerability-relevant SAE feature subgroups without relying on broader cluster-level aggregation. These subgroups appear across early and mid-to-late layers, with stronger concentration in mid-to-late layers, where targeted steering exposes specific model vulnerabilities. Overall, our results suggest that jailbreak susceptibility can be traced to sparse, token-localized SAE feature subgroups, complementing prior accounts based on broad adversarial, refusal, or steering directions.