arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
2605.13540 2026-05-14 cs.LG cs.AI

Decoupled and Divergence-Conditioned Prompt for Multi-domain Dynamic Graph Foundation Models

Haonan Yuan, Qingyun Sun, Junhua Shi, Xingcheng Fu, Jianxin Li, Philip S. Yu

发表机构 * School of Computer Science and Engineering, Beihang University(北航计算机科学与工程学院) Key Lab of Education Blockchain and Intelligent Technology, Ministry of Education, Guangxi Normal University(教育部教育区块链与智能技术重点实验室,广西师范大学) Department of Computer Science, University of Illinois at Chicago(伊利诺伊大学芝加哥分校计算机科学系)

AI总结 动态图在现实系统中广泛存在,构建具有泛化能力的动态图基础模型是图学习领域的重要前沿。针对多领域动态图语义和时序模式不一致带来的统一建模挑战,本文提出了一种基于解耦与发散条件提示的多领域动态图基础模型DyGFM。该模型通过语义-时序解耦的双分支预训练策略分离可迁移语义与领域特有动态,并引入发散感知的跨领域路由机制与提示生成器,有效缓解负迁移并提升下游任务的微调效率。实验表明,DyGFM在多个动态图基准数据集上显著优于12个先进基线方法。

详情
英文摘要

Dynamic graphs are ubiquitous in real-world systems, and building generalizable dynamic Graph Foundation Models has become a frontier in graph learning. However, dynamic graphs from different domains pose fundamental challenges to unified modeling, as their semantic and temporal patterns are inherently inconsistent, making the multi-domain pre-training difficult. Consequently, the widely used "pretrain-then-finetune" paradigm often suffers from severe negative knowledge transfer. To the best of our knowledge, there exists no multi-domain dynamic GFM. In this work, we propose DyGFM, a Dynamic Graph Foundation Model over multiple domains based on decoupled and divergence-conditioned prompting. To disentangle transferable semantics from the domain-specific dynamics, we introduce a dual-branch pre-training strategy with semantic-temporal decoupling. To alleviate negative transfer during domain adaptation, we further develop a cross-domain routing mechanism with divergence-aware expert selection. To enable efficient downstream fine-tuning, we design a divergence-conditioned prompt generator that injects lightweight, learnable graph prompts tailored to semantic and temporal traits. Extensive experiments on continuous dynamic graph benchmarks demonstrate that DyGFM consistently outperforms 12 state-of-the-art baselines on both node classification and link prediction tasks, achieving superior effectiveness and efficiency.

2605.13539 2026-05-14 cs.RO cs.SE

Integration of an Agent Model into an Open Simulation Architecture for Scenario-Based Testing of Automated Vehicles

Christian Geller, Daniel Becker, Jobst Beckmann, Lutz Eckstein

发表机构 * Institute for Automotive Engineering, RWTH Aachen University(汽车工程研究所,亚琛工业大学)

AI总结 本文研究了如何将智能体模型集成到开放仿真架构中,以支持自动驾驶车辆的场景化测试。为解决不同仿真环境间智能体模型互操作性差的问题,作者提出了一种基于标准化接口的模块化集成架构,利用OSI和FMI标准实现工具无关的模型交换。该架构通过在三种主流仿真平台中的应用验证了其通用性与一致性,为自动驾驶系统的安全测试提供了可复用的参考实现。

Journal ref at - Automatisierungstechnik - 2026 - Band 74, Heft 5 - Special Issue: AI for automated driving

详情
英文摘要

Simulative and scenario-based testing are crucial methods in the safety assurance for automated driving systems. To ensure that simulation results are reliable, the real world must be modeled with sufficient fidelity, including not only the static environment but also the surrounding traffic of a vehicle under test. Thus, the availability of traffic agent models is of common interest to model naturalistic and parameterizable behavior, similar to human drivers. The interchangeability of agent models across different simulation environments represents a major challenge and necessitates harmonization and standardization. To address this challenge, we present a standardized and modular simulation integration architecture that enables the tool-independent integration of traffic agent models. The architecture builds upon the Open Simulation Interface (OSI) as a structured message format and the Functional Mock-up Interface (FMI) for dynamic model exchange. Rather than introducing yet another model or simulation tool, we provide a reusable reference implementation that translates these standards into a practical integration blueprint, including clear interfaces, data mappings, and execution semantics. The generic nature of the architecture is demonstrated by integrating an exemplary agent model into three widely used simulation environments: OpenPASS, CARLA, and CarMaker. As part of the evaluation, we show that the model yields consistent behavior in all simulation platforms, thereby validating the interoperability, modularity, and standard compliance of the proposed architecture. The reference implementation lowers integration barriers, serves as a foundation for future research, and is made publicly available at github.com/ika-rwth-aachen/agent-model-integration

2605.13537 2026-05-14 cs.LG cs.AI cs.CL

Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment

Ye Wang, Jing Liu, Toshiaki Koike-Akino

发表机构 * Mitsubishi Electric Research Laboratories (MERL)(三菱电机研究实验室)

AI总结 本文研究了如何在推理阶段通过对齐方法缓解强化学习中的奖励黑客问题。作者提出了一种新的对齐技术,通过调整参考模型的温度参数,将推理对齐推广到多个生成奖励模型的组合,形成一种称为SLOP的锐化对数意见池方法。该方法不仅提高了模型的鲁棒性,同时保持了对齐性能,为持续适应动态奖励目标提供了有效解决方案。

详情
英文摘要

Inference-time alignment techniques offer a lightweight alternative or complement to costly reinforcement learning, while enabling continual adaptation as alignment objectives and reward targets evolve. Existing theoretical analyses justify these methods as approximations to sampling from distributions optimally tilted toward a given reward model. We extend these techniques by introducing reference-model temperature adjustment, which leads to further generalization of inference-time alignment to ensembles of generative reward models combined as a sharpened logarithmic opinion pool (SLOP). To mitigate reward hacking, we propose an algorithm for calibrating SLOP weight parameters and experimentally demonstrate that it improves robustness while preserving alignment performance.

2605.13536 2026-05-14 cs.LG cs.AI

HLS-Seek: QoR-Aware Code Generation for High-Level Synthesis via Proxy Comparative Reward Reinforcement Learning

Qingyun Zou, Feng Yu, Hongshi Tan, Yao Chen, Bingsheng He, WengFai Wong

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 HLS-Seek 是一种基于代理比较奖励强化学习的高质量代码生成框架,旨在提升高层次综合(HLS)中代码的性能表现(QoR),包括延迟和资源利用率。该方法通过相对比较而非绝对合成结果进行强化学习,显著降低了训练成本,并引入不确定性感知的蒙特卡洛dropout机制以防止奖励欺骗,实现自我优化的奖励系统。实验表明,HLS-Seek 在语法正确性和功能正确性方面均优于现有模型,且训练效率更高,在多个基准测试中表现出优越的QoR性能。

详情
英文摘要

High-Level Synthesis (HLS) compiles algorithmic C/C++ descriptions into hardware, with Quality of Results (QoR) -- latency and resource utilization -- critically governed by pragma configurations and code structure. Existing LLM-based HLS approaches train for functional correctness but ignore QoR entirely. We observe that reinforcement learning (RL) for HLS does not require absolute synthesis results -- only relative comparisons between candidates. Based on this insight, we propose \textbf{HLS-Seek}, a QoR-aware NL-to-HLS framework that replaces expensive synthesis-in-the-loop RL with a comparative proxy reward model achieving 99.53\% Pareto-dominance accuracy. To prevent reward hacking, we introduce \textit{uncertainty-aware Monte Carlo (MC) dropout switching} that selectively invokes real Vitis HLS synthesis for low-confidence candidates and online updates the proxy, creating a self-improving reward system. HLS-Seek achieves 81.5\% syntax correctness pass@1 and 81.4\% Func@5 on HLS-eval with only 7B parameters, surpassing GPT-5.1 and other frontier models while achieving 8.5$\times$ faster training than real-reward RL. On QoR evaluation, HLS-Seek achieves the lowest latency on 16/30 kernels and Pareto-dominates HLS-specific baselines on 9 kernels.

2605.13534 2026-05-14 cs.AI

Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging

Jiabei Liu, Wenyu Mao, Junfei Tan, Chunxu Shen, Lingling Yi, Jiancan Wu, Xiang Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) WeChat Technical Architecture Department, Tecent Inc.(腾讯公司微信技术架构部门)

AI总结 本文提出了一种基于强化学习的框架 MultiSearch,用于改进检索增强推理(Retrieval-Augmented Reasoning)的方法。该方法通过在每个推理步骤中生成多个视角的查询并行检索信息,扩大了信息覆盖范围,同时在合并过程中显式整合和优化检索结果,从而提高信噪比(SNR)和推理准确性。实验表明,MultiSearch 在多个基准测试中优于现有方法,显著提升了问答任务的推理性能。

详情
英文摘要

Deep search agents have proven effective in enhancing LLMs by retrieving external knowledge during multi-step reasoning. However, existing methods often generate a single query for retrieval at each reasoning step, limiting information coverage and introducing high noise. This may result in low signal-to-noise ratios (SNR) during search, degrading reasoning accuracy and leading to unnecessary reasoning steps. In this paper, we introduce MultiSearch, an RL-based framework that addresses these limitations through multi-query retrieval and explicit merging of retrieved information. At each reasoning step, MultiSearch generates queries from multiple perspectives and retrieves external information in parallel, expanding the scope of relevant information and mitigating the reliance on any single retrieval result. Then, the agent consolidates and refines retrieved information at the merging process, improving the SNR and ensuring more accurate reasoning. Additionally, we propose a reinforcement learning framework with a multi-process reward design to optimize agents for both multi-query retrieval and information consolidation. Extensive experiments on seven benchmarks demonstrate that MultiSearch outperforms baseline methods, enhancing the SNR of retrieval and improving reasoning performance in question-answering tasks.

2605.13532 2026-05-14 cs.AI cs.CL cs.CY cs.HC

AI-Generated Slides: Are They Good? Can Students Tell?

Juho Leinonen, Lisa Zhang, Arto Hellas

发表机构 * Aalto University(艾洛大学) University of Toronto Mississauga(多伦多大学滑铁卢分校)

AI总结 本文研究了生成式人工智能(GenAI)在教学中生成幻灯片的应用效果,重点分析了教师和学生对AI生成幻灯片的感知。通过对比多种AI工具生成的幻灯片与人工制作的幻灯片,研究发现AI生成的幻灯片在准确性和教学效果上表现良好,学生难以区分AI生成与人工制作的幻灯片,且对质量评价高的幻灯片更倾向于认为其为人工制作。研究结果表明,GenAI在教学设计中有较大潜力,但也需进一步探索其负责任和有效的应用方式。

Comments 7 pages, 2 tables. Accepted to Western Canada Conference on Computing Education (WCCCE) 2026

详情
英文摘要

As generative AI (GenAI) tools become easily accessible, there is promise in using such tools to support instructors. To that end, this paper examines using GenAI to help generate slides from instructor authored course notes, emphasizing instructor and student perceptions. We examine an end-to-end education tool (NotebookLM), two general-purpose LLMs (Claude, M365 Copilot), and two coding assistants (Cursor, Claude Code). We first analyze whether GenAI generated slides are ``good'' via narrative assessment by educators. We choose the best slides to use (with some modification) in a real course setting, and compare the student perception of human vs. AI generated slides. We find that coding assistant tools produce slides that were most accurate, complete, and pedagogically sound. Additionally, students rate GenAI slides to be of similar quality as instructor-created slides, and cannot reliably identify which slides are AI-generated. Additionally, we find a negative correlation between a high quality rating and a high ``AI-generated'' rating, suggesting students associate poor quality with the source of the slides being AI. These findings highlight promising opportunities for integrating GenAI into instructional design workflows and call for further research on how educators can best harness such tools responsibly and effectively.

2605.13530 2026-05-14 cs.CV cs.AI

Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs

Jincai Huang, Shihao Zou, Yuchen Guo, Jingjing Li, Wei Ji, Kai Wang, Shanshan Wang, Weixin Si

发表机构 * Southern University of Science and Technology(南方科技大学) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院) Northwestern University(西北大学) University of Alberta(阿尔伯塔大学) Yale University(耶鲁大学) Nanfang Hospital(南华医院) Shenzhen University of Advanced Technology(深圳大学先进技术研究院)

AI总结 本文提出 SurgMLLM,一种统一的手术场景理解框架,旨在将高层语义推理与底层视觉定位相结合,解决现有方法在手术场景中孤立处理各组件导致的语义不一致问题。该方法通过微调多模态大语言模型,实现对手术阶段、工具-动作-目标三元组及对应分割区域的联合建模,并通过时序聚合和分割网络实现精确的像素级定位。实验表明,SurgMLLM 在三元组识别和分割任务上均取得显著提升,验证了统一推理与定位方法在手术辅助中的有效性。

详情
英文摘要

Surgical scene understanding is a cornerstone of computer-assisted intervention. While recent advances, particularly in surgical image segmentation, have driven progress, real-world clinical applications require a more holistic understanding that jointly captures procedural context, semantic reasoning, and precise visual grounding. However, existing approaches typically address these components in isolation, leading to fragmented representations and limited semantic consistency. To address this limitation, we propose SurgMLLM, a unified surgical scene understanding framework that bridges high-level reasoning and low-level visual grounding within a single model. Given surgical videos, SurgMLLM fine-tunes a multimodal large language model (MLLM) to support structured interpretability reasoning, which is used to jointly model phases, instrument-verb-target (IVT) triplets, and triplet-entity segmentation tokens. These tokens are then temporally aggregated and serve as prompts for a segmentation network, enabling accurate pixel-wise grounding of triplet instruments and targets. The entire framework is trained end-to-end with a unified objective that couples language-based reasoning supervision with visual grounding losses, promoting coherent cross-task learning and clinically consistent scene representations. To facilitate unified evaluation, we introduce CholecT45-Scene, extending CholecT45 dataset with 64,299 frames of pixel-level mask annotations for instruments and targets, aligned with existing triplet labels. Extensive experiments show that SurgMLLM significantly advances surgical scene understanding, improving the primary triplet recognition metric AP_IVT from 40.7% to 46.0% and consistently outperforming prior methods in phase recognition and segmentation. These results highlight the effectiveness of unified reasoning-and-grounding for reliable, context-aware surgical assistance.

2605.13500 2026-05-14 cs.RO cs.CR

Uncertainty-Aware 3D Position Refinement for Multi-UAV Systems

Hosam Alamleh, Damir Pulatov

发表机构 * University of North Carolina Wilmington(北卡罗来纳大学 Wilmington 分校)

AI总结 本文研究了多无人机系统中鲁棒的三维位置精炼问题,针对GNSS信号干扰、非视距接收等场景下本地定位估计精度下降的问题,提出了一种去中心化、轻量级的位置精炼方法。该方法通过融合无人机自身的本地估计与邻居共享的状态信息,并结合无人机间的距离约束,实现不确定性感知的邻居信息融合,提升了定位鲁棒性。实验表明,该方法在冷启动阶段和存在恶意节点的情况下均能有效降低定位误差,具有良好的实际应用前景。

详情
英文摘要

Reliable real-time 3D localization is essential for multi-UAV navigation, collision avoidance, and coordinated flight, yet onboard estimates can degrade under GNSS multipath, non-line-of-sight reception, vertical drift, and intentional interference. This paper presents a decentralized, lightweight 3D position-refinement layer that improves robustness by fusing each Unmanned Aerial Vehicle (UAV)'s local estimate with neighbor-shared state summaries and inter-UAV range or proximity constraints. The method performs uncertainty-aware neighborhood fusion by weighting each UAV's prior according to its reported covariance and weighting neighbor constraints according to link quality, ranging uncertainty, and a learned trust score. To support practical deployment, the framework explicitly handles cold start and temporary localization loss by inflating or substituting weak priors, allowing trusted neighborhood constraints to bootstrap and stabilize estimates until absolute sensing recovers. To mitigate the impact of faulty or malicious participants, each UAV applies a local range-consistency check, smoothed over time, to down-weight or exclude neighbors whose reported positions are incompatible with observed inter-UAV distances. Simulation experiments with 10 UAVs in a 3D volume show that the proposed refinement substantially reduces mean localization error during cold start, remains competitive after local estimators stabilize, and maintains lower error as the fraction of malicious nodes increases compared with fusion without trust. These results suggest that the approach can serve as a practical resilience layer for swarm operation in challenging environments.

2605.13493 2026-05-14 cs.CV

PhysEditBench: A Protocol-Conditioned Benchmark for Dense Physical-Map Prediction with Image Editors

Jiaxin Yang, Yu Hou, Muxin Liu, Weixuan Liu, Ze Yuan, Zeming Chen, Zhongrui Wang, Xiaojuan Qi

发表机构 * Southern University of Science and Technology(南方科技大学) The University of Hong Kong(香港大学) East China Normal University(华东师范大学)

AI总结 PhysEditBench 是一个用于评估图像编辑器在密集物理图预测能力的协议条件化基准,涵盖了深度、法线、反照率、粗糙度和金属度五类目标。该基准通过构建目标依赖的数据集,并定义固定的输入输出协议,确保评估的标准化与可靠性。实验表明,尽管图像编辑器在部分指标上可与专业模型媲美,但在结构错误和光照敏感性方面仍存在明显不足。

Comments 48 pages, 12 figures, including references, appendix, and supplementary benchmark details

详情
英文摘要

Can general-purpose image editors predict physical maps from a single RGB image? General-purpose image editors differ from standard task-specific dense-prediction models: they do not directly take an image and output a physical map. Instead, they must be guided by prompts, examples, or image-based textual cues. To this end, we introduce PhysEditBench, a novel protocol-conditioned benchmark to evaluate and standardize image editors in dense physical-map prediction that covers five targets: depth, normal, albedo, roughness, and metallic maps. For evaluation data, we build a target-dependent benchmark substrate. We use OpenRooms-FF for depth, surface normal, albedo, and roughness, InteriorVerse as an additional source for depth, normal, albedo, and a new procedurally generated source for metallic maps. We curate the data with quality checks, valid-region masks, scene-level sampling, and lighting-based stress subsets to ensure reliable and diverse evaluation. For each target, PhysEditBench defines a fixed protocol that specifies the allowed input, expected output format, and scoring procedure. Each score, therefore, reflects the performance of a model under a specified protocol, rather than its best possible performance under all prompts or interaction modes. Experimental results show that specialized models remain much stronger on depth, normal, and albedo, and stronger image editors can produce more reasonable map-like outputs. For roughness and metallic, image editors can match or outperform specialized baselines on some scalar metrics, but they still suffer from structural errors, sparsity effects, and sensitivity to lighting.

2605.13487 2026-05-14 cs.LG

Path-independent Flow Matching for Multi-parameter Generative Dynamics

Francisco Téllez, AmirHossein Zamani, Philippe Martin, Shuang Ni, Guy Wolf, Eugene Belilovsky, Sina Sanjari, Yanlei Zhang

发表机构 * Université de Montréal(蒙特利尔大学) Mila Concordia University(康科迪亚大学) Royal Military College of Canada(加拿大皇家军事学院) Queen’s University(皇后大学)

AI总结 本文提出了一种名为Path-independent Flow Matching (PiFM)的方法,用于学习能够在多参数域中生成路径无关的分布传输的向量场。该方法通过引入结构约束,确保组合变换的一致性,并在适当假设下近似Wasserstein重心,实现了分布插值。实验表明,PiFM在合成和真实数据上均优于现有方法,在生成路径无关轨迹和分布外样本方面表现出色。

Comments 12 pages including references for main part of the document, 26 pages in total when including the appendix. 15 figures in total

详情
英文摘要

Flow Matching is a powerful framework for learning transport maps between probability distributions. Yet its standard single-parameter formulation is not designed to capture multi-parameter variations where the resulting transport should be path-independent. Path independence is crucial because it ensures that transformations depend only on the initial and target distributions, not on the specific path. In this work, we introduce Path-independent Flow Matching (PiFM), a method for learning vector fields whose induced flows yield path-independent transport between distributions. We show that PiFM generalizes Flow Matching to higher-dimensional parameter domains while enforcing structural conditions that ensure consistency of composed transformations. In addition, we show that, under suitable assumptions, PiFM approximates the Wasserstein barycenter, linking the framework to a notion of distributional interpolation. To enable practical training, we propose a tractable, simulation-free objective that regresses onto multi-parameter conditional probability paths. We showcase empirically that PiFM outperforms other approaches on both synthetic and real world data in interpolating path-independent trajectories and generating desired out of distribution samples.

2605.13486 2026-05-14 cs.CL

R^2-Mem: Reflective Experience for Memory Search

Xinyuan Wang, Wenyu Mao, Junkang Wu, Xiang Wang, Xiangnan He

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 R²-Mem 是一种用于记忆搜索系统的反思经验框架,旨在解决现有深度搜索代理重复历史错误行为的问题。该方法通过离线阶段的评分器和自反思学习器提取高质量和低质量搜索轨迹中的经验,并在在线推理阶段利用这些经验指导未来的搜索行为,从而提升搜索效果与效率。实验表明,R²-Mem 在多个指标上均优于现有方法,显著提高了搜索性能并减少了资源消耗。

详情
英文摘要

Deep search has recently emerged as a promising paradigm for enabling agents to retrieve fine-grained historical information without heavy memory pre-managed. However, existing deep search agents for memory system repeat past error behaviors because they fail to learn from the prior high- and low-quality search trajectories. To address this limitation, we propose R^2-Mem, a reflective experience framework for memory search systems. In the offline stage, a Rubric-guided Evaluator scores low- and high-quality steps in historical trajectories, and a self-Reflection Learner distills the corresponding abstract experience. During the online inference, the retrieved experience will guide future search actions to avoid repeated mistakes and maintain high-quality behaviors. Extensive experiments demonstrate that R^2-Mem consistently improves both effectiveness and efficiency over strong baselines, improving F1 scores by up to 22.6%, while reducing token consumption by 12.9% and search iterations by 20.2%. These results verify that R^2-Mem provides a RL-free and low-cost solution for self-improving LLM agents.

2605.13485 2026-05-14 cs.LG cs.CL cs.IT math.IT

Effective Context in Transformers: An Analysis of Fragmentation and Tokenization

Amirmehdi Jafari Fesharaki, Mohammadamin Rami, Aslan Tchamkerten

发表机构 * Department of Communications and Electronics(通讯与电子系) Institut Polytechnique de Paris(巴黎高等理工学院)

AI总结 本文研究了Transformer模型中不同表示方式(如字节、字符和子词)对有限上下文预测性能的影响。作者通过马尔可夫源分析,发现将符号分解为更小单元(碎片化)可能降低预测性能,即使扩大了上下文窗口,这一现象具有表示本身的固有性质。另一方面,使用贪心分词方法(如BPE)可以使得较短的分词窗口等效于更长的原始上下文窗口,并给出了相应的理论保证。研究为理解Transformer中表示选择对模型性能的影响提供了信息论框架。

Comments 30 pages, 9 figures. Preprint

详情
英文摘要

Transformers predict over a representation of a sequence. The same data can be written as bytes, characters, or subword tokens, and these representations may be lossless. Yet, under a fixed context window, they need not expose the same information to the model. This raises a basic question: how does the choice of representation change what a finite-context predictor can achieve? We study this question on Markov sources and uncover two complementary phenomena. First, we observe that moving to smaller representation units can hurt prediction even when the context window is enlarged to cover the relevant source history. To explain this, we introduce fragmentation: a lossless recoding that replaces each source symbol by several smaller units. We prove that fragmentation can strictly increase the optimal finite-context log-loss, showing that the gap is not merely an optimization or capacity issue, but can be intrinsic to the representation. This gives a theoretical account of the finite-context gap observed in byte- and character-level models such as ByT5 and CANINE relative to subword-tokenized models. Second, we study the opposite direction: greedy tokenization -- BPE, WordPiece, and related methods -- which groups source symbols into larger units. We show that tokenization can make a short token window behave like a longer source-context window, and we give a loss guarantee describing when this is achievable. The guarantee depends on how reliably token windows span the needed source history, together with the compression rate of the tokenizer. This also yields a simple diagnostic for real tokenizers: measuring how much source context a fixed token window reliably contains. Together, the two directions establish a finite-context information-theoretic framework for reasoning about representation choices in Transformers.

2605.13484 2026-05-14 cs.LG cs.AI stat.ME

Discovery of Hidden Miscalibration Regimes

Katarzyna Kobalczyk, Mihaela van der Schaar

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文研究了模型在不同输入上的校准偏差问题,指出传统方法仅基于置信度评估校准,可能掩盖局部校准失败的现象。为此,作者提出了一种无需预设数据切片的隐式校准偏差发现方法,通过学习输入空间的校准感知表示,并利用核平滑估计局部校准偏差。实验表明,该方法能有效揭示大语言模型在不同输入下的校准异质性,并在系统性偏差区域显著提升校准效果。

详情
英文摘要

Calibration is commonly evaluated by comparing model confidence with its empirical correctness, implicitly treating reliability as a function of the confidence score alone. However, this view can hide substantial structure: models may be systematically overconfident on some kinds of inputs and underconfident on others, causing global reliability diagnostics to obscure localised calibration failures. To address this, we formulate the problem of discovering hidden miscalibration regimes without assuming access to predefined data slices. We define the corresponding miscalibration field and propose a diagnostic framework for estimating it. Our approach learns a calibration-aware representation of the input space and estimates signed local miscalibration by kernel smoothing in the learned geometry. Across four real-world LLM benchmarks and twelve LLMs, we find that input-dependent calibration heterogeneity is prevalent. We further show that the discovered fields are actionable: they support local confidence correction and reduce calibration error in systematically miscalibrated regions where confidence-based methods such as isotonic regression and temperature scaling are less effective.

2605.13481 2026-05-14 cs.CL

PersonalAI 2.0: Enhancing knowledge graph traversal/retrieval with planning mechanism for Personalized LLM Agents

Mikhail Menschikov, Matvey Iskornev, Alexander Kharitonov, Alina Bogdanova, Mikhail Belkin, Ekaterina Lisitsyna, Artyom Sosedka, Victoria Dochkina, Ruslan Kostoev, Ilia Perepechkin, Evgeny Burnaev

发表机构 * Huawei, Moscow, Russia(华为,莫斯科,俄罗斯)

AI总结 本文提出了一种名为 PersonalAI 2.0(PAI-2)的新框架,旨在通过整合外部知识图谱增强基于大语言模型(LLM)的系统。该方法引入了动态多阶段查询处理流程,能够根据提取的实体、匹配的图节点和生成的线索查询进行自适应的迭代信息搜索,有效提升了生成答案的事实准确性。实验表明,PAI-2 在多个基准数据集上相比现有方法具有更高的精度和更低的幻觉率,并在特定任务中实现了显著的性能提升。

详情
英文摘要

We introduce PersonalAI 2.0 (PAI-2), a novel framework, designed to enhance large language model (LLM) based systems through integration of external knowledge graphs (KG). The proposed approach addresses key limitations of existing Graph Retrieval-Augmented Generation (GraphRAG) methods by incorporating a dynamic, multistage query processing pipeline. The central point of PAI-2 design is its ability to perform adaptive, iterative information search, guided by extracted entities, matched graph vertices and generated clue-queries. Conducted evaluation over six benchmarks (Natural Questions, TriviaQA, HotpotQA, 2WikiMultihopQA, MuSiQue and DiaASQ) demonstrates improvement in factual correctness of generating answers compared to analogues methods (LightRAG, RAPTOR, and HippoRAG 2). PAI-2 achieves 4% average gain by LLM-as-a-Judge across four benchmarks, reflecting its effectiveness in reducing hallucination rates and increasing precision. We show that use of graph traversal algorithms (e.g. BeamSearch, WaterCircles) gain superior results compared to standard flatten retriever on average 6%, while enabled search plan enhancement mechanism gain 18% boost compared to disabled one by LLM-as-a-Judge across six datasets. In addition, ablation study reveals that PAI-2 achieves the SOTA result on MINE-1 benchmark, achieving 89% information-retention score, using LLMs from 7-14B tiers. Collectively, these findings underscore the potential of PAI-2 to serve as a foundational model for next-generation personalized AI applications, requiring scalable, context-aware knowledge representation and reasoning capabilities.

2605.13476 2026-05-14 cs.CV

Neural Video Compression with Domain Transfer

Tiange Zhang, Rongqun Lin, Xiandong Meng, Haofeng Wang, Xing Tian, Qi Zhang, Siwei Ma

发表机构 * Shenzhen Graduate School, Peking University, Shenzhen, China(北京大学深圳研究生院,深圳,中国) Pengcheng Laboratory, Shenzhen, China(鹏城实验室,深圳,中国) School of Computer Science, Peking University, Beijing, China(北京大学计算机学院,北京,中国)

AI总结 本文研究了神经视频编码中的领域迁移问题,旨在解决训练数据与测试数据之间分布差异导致的性能下降问题。提出了一种名为DCVC-DT的增强框架,通过轻量级的在线领域迁移机制,在推理过程中动态调整编码的潜在表示,从而有效缩小领域差距,无需修改编码器或解码器参数。同时,引入了帧级别的动态率失真调整方案,提升压缩效率与重建质量。实验表明,该方法在保持视频质量的同时,相比基线模型实现了更高的比特率节省,并增强了对未知测试数据的泛化能力。

Comments Accepted to ISCAS 2026 as an oral paper

详情
英文摘要

Content-adaptive compression has always been a key direction in neural video coding (NVC), aiming to mitigate the domain gap between training and testing data. Such gaps often arise from distributional discrepancies between training and inference data, which may cause noticeable performance degradation when the testing content differs from the training distribution. To tackle this challenge, we propose DCVC-DT, a domain transfer enhanced neural video compression framework. Specifically, we design a lightweight online domain transfer (DT) mechanism that dynamically adapts the encoded latent representation during inference, effectively bridging the domain gap without modifying the encoder or decoder parameters. In addition, we develop a frame-level dynamic RD (Rate and Distortion) adjustment scheme that actively regulates the ratio of R and D in the loss function based on quality fluctuation, thereby improving rate-distortion performance. Extensive experiments demonstrate that DCVC-DT achieves up to 6.21% bitrate savings over the baseline DCVC-DC, while significantly enhancing generalization to unseen testing data and alleviating error propagation. Our code is available at https://github.com/SunnyMass/DCVC-DT.

2605.13473 2026-05-14 cs.LG cs.CL

OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention

Chenyu Zhou, Hongpei Li, Yuerou Liu, Jianghao Lin, Dongdong Ge, Yinyu Ye

发表机构 * Shanghai Jiao Tong University(上海交通大学) Northwestern University(西北大学) Huazhong University of Science and Technology(华中科技大学) Stanford University(斯坦福大学)

AI总结 本文提出了一种名为OSDN的新方法,旨在改进线性注意力机制中的Delta规则,以提升模型在上下文关联记忆任务中的表现。OSDN通过引入在线预条件器,对Delta规则中的步长进行特征级的自适应调整,从而更准确地反映目标函数的曲率特性。该方法在保持DeltaNet硬件友好并行计算优势的同时,实现了理论上的超几何收敛,并在大规模实验中显著提升了模型的回忆性能和泛化能力。

详情
英文摘要

Linear attention and state-space models offer constant-memory alternatives to softmax attention, but often struggle with in-context associative recall. The Delta Rule mitigates this by writing each token via one step of online gradient descent. However, its step size relies on a single scalar gate that ignores the feature-wise curvature of the inner objective. We propose Online Scaled DeltaNet (OSDN), which augments the scalar gate with a diagonal preconditioner updated online via hypergradient feedback. Crucially, this right-preconditioning is algebraically equivalent to a per-feature scaling of the write-side key. This equivalence allows OSDN to strictly preserve the hardware-friendly chunkwise parallel pipeline of DeltaNet without incurring high-dimensional state overhead. Theoretically, by exploiting the exact-quadratic structure of the inner regression loss, we establish super-geometric convergence against a right-Newton comparator and prove an algorithm-aligned token-local residual contraction bound. To handle non-stationary contexts, we further introduce Adaptive Preconditioner Forgetting (APF) to dynamically refresh stale calibration. Empirically, OSDN demonstrates strong performance across scales. At the 340M-parameter scale, OSDN improves JRT-style in-context recall by 32% over DeltaNet. Scaling to 1.3B parameters, it achieves a 39% reduction in the recall residual ratio while maintaining parity on general downstream tasks (e.g., perplexity and LongBench) -- demonstrating that our online-preconditioning mechanism effectively transfers and amplifies at the billion-parameter scale.

2605.13470 2026-05-14 cs.LG

Twincher: Bijective Representation Learning for Robust Inversion of Continuous Systems

Arkady Gonoskov

发表机构 * Department of Physics, University of Gothenburg(哥德堡大学物理系)

AI总结 本文提出了一种名为 Twincher 的新型架构,旨在通过学习与输入变量一一对应的鲁棒表示,实现对连续前向系统的稳定逆推。该方法基于结构化微分同胚变换堆叠和定制的对抗训练策略,能够在噪声或模型偏差存在的情况下保持表示的稳定性。实验表明,Twincher 在合成系统中能高效学习双射表示,相比传统逆模型方法,具有更高的数据效率和鲁棒性,展示了其在机器人、视觉和物理人工智能中的应用潜力。

详情
英文摘要

Recent advances in AI have been primarily driven by large-scale neural architectures that excel at function approximation, rather than by tailored inductive biases and inference or learning strategies that could be important for resource-efficient real-world perception and planning through the solution of inverse problems. In this work, we consider the possibility of enabling robust inversion of continuous forward processes $p \mapsto y$ by learning representations of $y$ that are bijectively aligned with $p$ while remaining insensitive to perturbations in $y$ caused by noise or model mismatch. We propose Twincher, a class of architectures based on stacks of structured diffeomorphic transformations and tailored adversarial training strategies that enable learning such bijective representations. We provide a public API for training and inference and empirically demonstrate the ability of the proposed architecture to efficiently learn bijective representations of synthetic systems, thereby enabling robust and efficient iterative inverse inference. Compared to a baseline inverse-modeling approach, the method exhibits improved data efficiency and robustness, providing initial evidence for the potential of bijective representation learning in robotics, vision, and physical AI.

2605.13467 2026-05-14 cs.CL

PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning

Hee Suk Yoon, Eunseop Yoon, Ji Woo Hong, SooHwan Eom, Gwanhyeong Koo, Mark Hasegawa-Johnson, Qi Dai, Chong Luo, Chang D. Yoo

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国高级科学技术研究院) University of Illinois at Urbana-Champaign (UIUC)(伊利诺伊大学厄巴纳-香槟分校) Microsoft Research Asia (MSRA)(微软亚洲研究院)

AI总结 该研究针对视觉-语言推理任务中传统强化学习奖励信号不足的问题,提出了一种感知分解置信奖励(PDCR)框架。通过解耦视觉与语言步骤,PDCR引入视觉依赖度评分并结合聚类算法分离感知与推理过程,从而在每个子任务内进行归一化奖励计算,有效解决了全局奖励导致的信号失真问题。实验表明,PDCR在多个视觉-语言推理基准上优于传统全局奖励和稀疏奖励方法。

Comments CVPR 2026

详情
英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) traditionally relies on a sparse, outcome-based signal. Recent work shows that providing a fine-grained, model-intrinsic signal (rewarding the confidence growth in the ground-truth answer) effectively improves language reasoning training by providing step-level guidance without costly external models. While effective for unimodal text, we find that naively applying this global reward to vision-language (V-L) reasoning is a suboptimal strategy, as the task is a heterogeneous mix of sparse visual perception and dense textual reasoning. This global normalization creates mixture-induced signal degradation, where the training signal for visual steps is statistically distorted by the predominant textual steps. We propose Perception-Decomposed Confidence Reward (PDCR), a framework that solves this by aligning the reward structure with the task's heterogeneous nature. PDCR first performs an unsupervised skill decomposition, introducing a model-internal Visual Dependence Score to quantify visual reliance and applying a clustering algorithm to separate perception and reasoning steps. Based on this, PDCR computes a decomposed advantage by normalizing confidence gains within each skill cluster. This intra-cluster normalization provides a stable, correctly-scaled signal for both perception and reasoning. We demonstrate that PDCR outperforms the naive, global-reward formulation and sparse-reward baselines on key V-L reasoning benchmarks.

2605.13465 2026-05-14 cs.CV

Z-Order Transformer for Feed-Forward Gaussian Splatting

Can Wang, Lei Liu, Wei Jiang, Dong Xu

发表机构 * The University of Hong Kong(香港大学) Futurewei Technologies Inc(未来科技公司)

AI总结 本文提出了一种基于Transformer的前馈高斯点绘(Gaussian Splatting)方法,旨在解决传统3D高斯点绘在实时性方面的不足。通过引入Z-order策略将无序的高斯点组织成空间连贯的序列,并结合稀疏注意力机制,有效捕捉高斯点之间的空间与语义关系,从而在单次前向传播中高效建模上下文、压缩高斯点数量并预测其属性。实验表明,该方法在保证渲染质量的同时显著提升了生成新视角图像的速度。

Comments Accept by CVPR 2026, Oral

详情
英文摘要

Recent advances in 3D Gaussian Splatting (3DGS) have enabled significant progress in photorealistic novel view synthesis. However, traditional 3DGS relies on a slow, iterative optimization process, which limits its use in scenarios demanding real-time results. To overcome this bottleneck, recent feed-forward methods aim to predict Gaussian attributes directly from images, but they often struggle with the redundancy of Gaussian primitives and rendering quality. In this work, we introduce a transformer-based architecture specifically designed for feed-forward Gaussian Splatting. Our key insight is that spatial and semantic relationships among Gaussians can be effectively captured through a sparse attention mechanism, enabled by a Z-order strategy that organizes the unstructured Gaussian set into a spatially coherent sequence. Furthermore, we incorporate this Z-order strategy to adaptively suppress redundancy while preserving critical structural details. This allows the transformer to efficiently model context, compress Gaussian primitives, and predict Gaussian attributes in a single forward pass. Comprehensive experiments demonstrate that our method achieves fast and high-quality novel view synthesis with fewer Gaussian primitives.

2605.13464 2026-05-14 cs.LG

A Unified Three-Stage Machine Learning Framework for Diabetes Detection, Subtype Discrimination, and Cognitive-Metabolic Hypothesis Testing

Vishal Pandey, Ruzina Haque Laskar, Rishav Tewari

发表机构 * Independent Researcher(独立研究者) B Center for Development of Telematics(B电信发展中心) Asansol Engineering College(阿萨尼尔工程学院)

AI总结 该研究提出了一种统一的三阶段机器学习框架,用于糖尿病检测、亚型区分及代谢-认知关联分析。第一阶段通过多种分类器对糖尿病进行预测,并识别出关键生物标志物;第二阶段利用聚类方法对确诊患者进行亚型划分;第三阶段基于认知数据揭示血糖控制与认知功能之间的显著关联。该框架为糖尿病的可重复分析和亚型研究提供了统计严谨且可解释的方法。

Comments 10 Pages

详情
英文摘要

Diabetes mellitus affects over 537 million adults worldwide and remains a major challenge in preventive healthcare. Existing machine-learning studies primarily formulate diabetes prediction as a binary classification problem, while subtype-oriented analysis and glycaemic-cognitive associations remain comparatively underexplored. We present a reproducible three-stage machine learning framework for diabetes detection, subtype-oriented clustering, and metabolic-cognitive association analysis. In Stage 1, five supervised classifiers together with a stacking ensemble are benchmarked on the NCSU Diabetes Dataset using stratified five-fold cross-validation and evaluation metrics including ROC-AUC, balanced accuracy, recall, and F1-score. SVM-RBF and Logistic Regression achieve the highest ROC-AUC ($0.825 \pm 0.026$), while Random Forest achieves the highest accuracy ($0.762 \pm 0.030$). SHAP explainability identifies Glucose, BMI, and Age as the dominant predictive biomarkers. In Stage 2, silhouette-validated K-Means clustering ($k=2$, silhouette $\approx 0.116$) is applied to confirmed diabetic cases using Glucose, Insulin, and Age, recovering clinically plausible subtype-oriented partitions without requiring ground-truth subtype labels. In Stage 3, statistical analysis of the Ohio Longitudinal Cognitive Dataset ($n=373$) reveals a significant positive association between glycaemic control and cognitive function ($ρ_s = 0.208$, $p = 5.29 \times 10^{-5}$), which survives Holm correction. The findings support the utility of statistically grounded and interpretable ML pipelines for reproducible diabetes analytics and subtype-aware exploratory analysis.

2605.13462 2026-05-14 cs.LG

Efficient Sensor Fusion for Gesture Recognition on Resource-Constrained Devices

Pietro Bartoli, Christian Veronesi, Tommaso Bondini, Andrea Giudici, Franco Zappa

发表机构 * EssilorLuxottica Smart Eyewear Lab(EssilorLuxottica 智能眼镜实验室) Politecnico di Milano(米兰理工学院)

AI总结 本文提出了一种轻量且保护隐私的手势识别系统,适用于资源受限的智能眼镜等可穿戴设备。该系统通过融合低分辨率的飞行时间(ToF)传感器和红外(IR)热成像传感器的数据,结合一种专为多控制器设计的分组卷积神经网络(CNN),实现了高效的多模态信息融合。实验表明,该方法在自定义数据集上取得了92.3%的高识别准确率和0.93的宏观F1分数,同时在功耗和计算延迟方面表现优异,适合用于边缘计算场景。

Comments The article is already accepted for IEEE Sensors Applications Symposium (IEEE SAS) 2026

详情
英文摘要

Gesture recognition is a cornerstone of Human-Computer Interaction (HCI) for smart eyewear, enabling natural and device-free control in augmented reality environments. Traditional vision-based approaches face significant challenges regarding power consumption, computational latency, and user privacy. This paper proposes a lightweight, privacy-preserving gesture recognition system based on the fusion of low-resolution Time-of-Flight (ToF) and Infrared (IR) thermal sensors. We used an 8 times 8 multizone ToF sensor (VL53L8CH) and an 8 times 8 IR array (AMG8833) to capture complementary depth and thermal cues. A compact Convolutional Neural Network (CNN) with a specialized grouped-convolution architecture is designed to fuse these modalities efficiently on a microcontroller (MCU). Experimental results on a custom dataset of 7 static gestures, validated via k-fold cross-validation, demonstrate that the proposed fusion strategy significantly outperforms single-sensor baselines with an accuracy of 92.3% and a macro F1-score of 0.93. Finally, on-device benchmarks on STM32F4 and STM32H7 MCUs confirm the system's suitability for resource-constrained wearables, requiring only 6,343 parameters and achieving millisecond-level inference latency with a total system power of 50 mW.

2605.13457 2026-05-14 cs.CV

OP4KSR: One-Step Patch-Free 4K Super-Resolution with Periodic Artifact Suppression

Chengyan Deng, Pengbin Yu, Zhentao Chen, Wei Shen, Kai Zhang, Meng Li, Lunxi Yuan, Xue Zhou, Li Yu

发表机构 * School of Automation Engineering, University of Electronic Science and Technology of China(电子科技大学自动化工程学院) OPPO AI Center, OPPO Inc.(OPPO人工智能中心) School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院)

AI总结 本文提出了一种名为OP4KSR的一站式无块4K超分辨率方法,旨在解决基于扩散模型的现实场景图像超分辨率在直接生成4K图像时面临的显存限制问题。该方法基于强大的Flux架构,并结合极简压缩的F16 VAE,实现了在有限GPU资源下的高效推理,同时保持全局空间语义一致性。为了解决该方法引入的周期性伪影问题,作者提出了基于RoPE频率重缩放和自相关周期性损失的抑制策略,并构建了专门的训练数据集和三个基准测试,推动了4K超分辨率研究的发展。

详情
英文摘要

Diffusion-based real-world image super-resolution (Real-ISR) has achieved remarkable perceptual quality; however, directly super-resolving images to 4K remains limited by extreme memory consumption. Consequently, prior methods adopt patch-based inference, sacrificing global context and introducing semantic confusion, spatial inconsistency, and severe latency. We propose OP4KSR, a one-step patch-free 4K SR approach built upon the powerful Flux backbone. By leveraging the extreme-compression F16 VAE, OP4KSR makes 4K SR inference tractable under practical GPU budgets, preserving global spatial-semantic coherence while enabling highly efficient inference. However, adapting this one-step architecture intrinsically triggers severe periodic artifacts. We trace this to a RoPE base frequency allocation mismatch and intra-token spatial ambiguity, both exacerbated by the lack of iterative refinement. To suppress these artifacts, we couple RoPE base frequency rescaling (RFR) with an autocorrelation-based periodicity loss ($\mathcal{L}_\text{AP}$). Furthermore, we curate a dedicated training dataset alongside three benchmarks (one synthetic and two real-world) to advance 4K SR research. Extensive experiments demonstrate that OP4KSR achieves competitive perceptual quality with efficient inference, generating a $4096\times4096$ output in only 5.75 seconds on a single NVIDIA H20 GPU.

2605.13452 2026-05-14 cs.RO cs.AI

CUBic: Coordinated Unified Bimanual Perception and Control Framework

Xingyu Wang, Pengxiang Ding, Jingkai Xu, Donglin Wang, Zhaoxin Fan

发表机构 * Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, School of Artificial Intelligence, Beihang University(北京未来区块链与隐私计算高级创新中心,人工智能学院,北京航空航天大学) Westlake University(西湖大学) Zhejiang University(浙江大学) Peking University(北京大学)

AI总结 本文提出了一种名为CUBic的协调统一双臂感知与控制框架,旨在解决从单臂操作扩展到双臂操作时面临的感知独立性与手臂协调性之间的矛盾。该方法通过统一的感知建模,学习共享的标记化表示,使独立操作与协调交互自然地从结构中体现出来,而非依赖人工设计的耦合机制。实验表明,CUBic在RoboTwin基准测试中显著优于现有方法,在协调精度和任务成功率方面均取得明显提升。

详情
英文摘要

Recent advances in visuomotor policy learning have enabled robots to perform control directly from visual inputs. Yet, extending such end-to-end learning from single-arm to bimanual manipulation remains challenging due to the need for both independent perception and coordinated interaction between arms. Existing methods typically favor one side -- either decoupling the two arms to avoid interference or enforcing strong cross-arm coupling for coordination -- thus lacking a unified treatment. We propose CUBic, a Coordinated and Unified framework for Bimanual perception and control that reformulates bimanual coordination as a unified perceptual modeling problem. CUBic learns a shared tokenized representation bridging perception and control, where independence and coordination emerge intrinsically from structure rather than from hand-crafted coupling. Our approach integrates three components: unidirectional perception aggregation, bidirectional perception coordination through two codebooks with shared mapping, and a unified perception-to-control diffusion policy. Extensive experiments on the RoboTwin benchmark show that CUBic consistently surpasses standard baselines, achieving marked improvements in coordination accuracy and task success rates over state-of-the-art visuomotor baselines.

2605.13451 2026-05-14 cs.CL

LongBEL: Long-Context and Document-Consistent Biomedical Entity Linking

Adam Remaki, Xavier Tannier, Christel Gérardin

发表机构 * Sorbonne Université, Inserm, Université Sorbonne Paris Nord, Limics(索邦大学、国家医学研究院、巴黎索邦大学北校区、Limics) Service de médecine interne, Hôpital Tenon, Assistance Publique - Hôpitaux de Paris(内科服务,Tenon医院,巴黎公共医院)

AI总结 本文提出 LongBEL,一种用于生物医学实体链接的文档级生成框架,旨在解决现有系统在处理同一文档中重复出现的概念时可能出现的不一致问题。LongBEL 通过结合全文上下文和先前预测的记忆信息,提升实体链接的一致性与准确性。实验表明,LongBEL 在多个生物医学数据集上优于基于句子的生成方法,尤其在概念重复出现的场景中表现突出。

Comments 9 pages, 2 figures

详情
英文摘要

Biomedical entity linking maps textual mentions to concepts in structured knowledge bases such as UMLS or SNOMED CT. Most existing systems link each mention independently, using only the mention or its surrounding sentence. This ignores dependencies between mentions in the same document and can lead to inconsistent predictions, especially when the same concept appears under different surface forms. We introduce LongBEL, a document-level generative framework that combines full-document context with a memory of previous predictions. To make this memory robust, LongBEL is trained with cross-validated predictions rather than gold labels, reducing the mismatch between training and inference and limiting cascading errors. Experiments on five biomedical benchmarks across English, French, and Spanish show that LongBEL improves over sentence-level generative baselines, with the largest gains on datasets where concepts frequently recur within documents. An ensemble of local, global, and memory-based variants achieves the best results across all benchmarks. Further analysis shows that the largest gains occur on recurring concepts, suggesting that LongBEL mainly improves document-level consistency rather than isolated mention disambiguation.

2605.13450 2026-05-14 cs.AI cs.CL cs.HC

Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers

Samuel Schapiro, Alexi Gladstone, Jonah Black, Heng Ji

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文研究了如何有效评估大语言模型的创造力,针对创造性写作、发散性思维和科学构想三个领域,系统评估了现有创造力测试的有效性。研究发现,现有测试在预测模型创造力方面存在显著局限,尤其是对科学构想能力的预测效果不佳。为此,作者提出了一种新的测试方法——发散远程联想测试(DRAT),该方法首次在单一测试中同时评估聚合与发散性思维,并能有效预测科学构想能力,表现出良好的鲁棒性。

Comments 36 pages. Extended version of work under review

详情
英文摘要

Measuring the creativity of large language models (LLMs) is essential for designing methods that can improve creativity and for enhancing our scientific understanding of this ability. To accomplish this, it has become common in recent years to administer tests of human creativity to LLMs. Although these tests provide a convenient and fully automated way to score "creativity," their validity as measures of machine creativity has not been established, and these tests already have limited validity as predictors of human creativity. To address this problem, we conduct the first large-scale, systematic study assessing the effectiveness of human creativity tests for predicting the creative achievement of LLMs across three target constructs: creative writing, divergent thinking, and scientific ideation. We find that the Divergent Association Task (DAT) and the Conditional DAT are the best predictors of creative writing and divergent thinking, respectively, but that test effectiveness varies significantly by construct, and no single test predicts all constructs well. Moreover, contrary to popular belief, no existing test reliably predicts scientific ideation ability. Motivated by this problem, we introduce the Divergent Remote Association Test (DRAT), a vocabulary-space test that assesses both convergent and divergent thinking in a single instrument. The DRAT is the first and only creativity test for LLMs that is a significant predictor of scientific ideation ability, demonstrating robustness across major design choices. Furthermore, the performance gain of the DRAT is not recoverable from any linear combination of the Divergent Association Task and the Remote Associates Test, indicating that assessing divergent and convergent thinking in the same test is essential to reliably predicting scientific ideation ability.

2605.13442 2026-05-14 cs.RO

Asymptotically Optimal Ergodic Coverage on Generalized Motion Fields

Christian Hughes, Yilang Liu, Yanis Lahrach, Julia Engdahl, Houston Warren, Darrick Lee, Fabio Ramos, Travis Miles, Ian Abraham

发表机构 * Yale University(耶鲁大学) Rutgers University(罗格斯大学) University of Edinburgh(爱丁堡大学) University of Sydney(悉尼大学)

AI总结 本文研究了在动态流场环境中实现渐近最优的遍历覆盖问题,针对传统方法在非静态环境中无法保证覆盖质量的问题,提出了一种基于流场适应的遍历覆盖方法。该方法引入最大均值差异(MMD)作为遍历性度量,并将其与环境动态相结合,以在非完整约束和开环控制条件下实现鲁棒的探索路径规划。实验验证了该方法在海洋探测、人群与牲畜运动追踪等多样化时空过程中的有效性,并在空中和腿式机器人平台上验证了其在非凸、流场受限环境中的可行性。

Comments 13 pages, 9 figures, 6 tables, Robotics: Science and Systems 2026

详情
英文摘要

Autonomous robotic exploration in remote and extreme environments allows scientists to model complex transport phenomena and collective behaviors described by continuously deforming flow fields. Although these environments are naturally modeled as time-varying domains, most adaptive exploration methods assume static environments and fail to provide adequate coverage or satisfy any formal guarantees. This is especially the case in oceanography where autonomous underwater systems (UxS) have highly restrictive compute and payload requirements that necessitate path planning methods that yield robust data collection strategies in open-loop and underactuated settings. In this work, to address the aforementioned issues, we propose to formulate adaptive search as an ergodic coverage problem and investigate certifying coverage in the ergodic sense over evolving domains with flow-induced dynamics. We expand upon recent work demonstrating maximum mean discrepancy (MMD) as a functional ergodic metric, and derive a flow-adaptive formulation that explicitly accounts for domain evolution within the coverage objective. We show that this approach preserves ergodic coverage guarantees in ambient flows and enables effective exploration in under-actuated, and even open-loop planning settings by integrating environment dynamics. Experiments validate that our method generalizes to diverse spatiotemporal processes including ocean exploration, and tracking human and cattle movement. Physical experiments on aerial and legged robotic platforms validate our ability to obtain ergodic coverage in non-convex, flow-restricted environments while respecting robot dynamics.

2605.13436 2026-05-14 cs.CL cs.LG

Pretraining Language Models with Subword Regularization: An Empirical Study of BPE Dropout in Low-Resource NLP

Ruan Visser, Trienko Grobler, Marcel Dunaiski

发表机构 * Department of Computer Science, Stellenbosch University(斯瓦茨堡大学计算机科学系)

AI总结 本文研究了在低资源自然语言处理任务中,是否在预训练阶段应用BPE Dropout能提升下游任务表现。研究通过在多种语言的子集上训练单语和双语BERT模型,并在多个基准数据集上进行评估,发现同时在预训练和微调阶段使用随机分词能取得最佳效果,尤其在数据量较少时,预训练阶段引入BPE Dropout具有明显优势。实验还表明,预训练阶段的随机分词有助于模型更一致地接触形态对齐的分词方式,从而提升模型的表示能力。

Comments Comments: 12 pages, 8 figures, 5 tables

详情
英文摘要

Subword regularization methods such as BPE dropout are typically applied only during fine-tuning, while pretraining is usually done with deterministic tokenization. This creates a potential segmentation mismatch between pretraining and fine-tuning. We investigate whether applying BPE dropout during pretraining improves downstream performance in low-resource NLP. We train monolingual and bilingual BERT models on downsampled subsets of English, German, French, Spanish, Kiswahili, and isiXhosa, and evaluate them on XNLI, PAWS-X, PAN-X, and MasakhaNER 2.0. Across tasks, the best results are typically obtained when stochastic tokenization is applied during both pretraining and fine-tuning, whereas applying BPE dropout only during fine-tuning can underperform deterministic tokenization in smaller-data settings. This disadvantage diminishes as fine-tuning data increases, while the benefits of pretraining-time BPE dropout are largest when either pretraining or fine-tuning data is scarce. The benefits of BPE dropout are often attributed to better compositional representations, especially for rare words. To examine this, we measure morphological boundary alignment under BPE dropout and find only modest improvements in expected alignment, while better-aligned segmentations remain rare. This suggests that fine-tuning alone may provide limited exposure to such segmentations, whereas stochastic tokenization during pretraining exposes the model to them more consistently. We further show that selectively introducing morphologically aligned segmentations during fine-tuning improves performance mainly for models pretrained without BPE dropout. Overall, these findings suggest that exposure to better-aligned segmentations may contribute to the downstream benefits of applying BPE dropout during pretraining.

2605.13435 2026-05-14 cs.LG cs.AI

Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy

JaeHyeok Doo, Byeongguk Jeon, Seonghyeon Ye, Kimin Lee, Minjoon Seo

发表机构 * KAIST AI(韩国科学技术院人工智能实验室)

AI总结 本文提出了一种名为 Q-Flow 的强化学习框架,旨在充分利用基于流模型的策略的高表达能力,同时解决其在价值最大化过程中的优化稳定性问题。该方法通过利用流模型的确定性动态,直接将终端轨迹价值传播到中间潜在状态,从而在无需展开数值求解器的情况下实现稳定策略优化。实验表明,Q-Flow 在离线学习任务中显著优于现有先进方法,并支持在同一框架下的稳定在线适应。

Comments 27 pages

详情
英文摘要

There is growing interest in utilizing flow-based models as decision-making policies in reinforcement learning due to their high expressive capacity. However, effectively leveraging this expressivity for value maximization remains challenging, as naive gradient-based optimization requires backpropagating through numerical solvers and often leads to instability. Existing approaches typically address this issue by restricting the expressive capacity of flow-based policies, resulting in a trade-off between optimization stability and representational flexibility. To resolve this, we introduce Q-Flow, a framework that leverages the deterministic nature of flow dynamics to explicitly propagate terminal trajectory value to intermediate latent states along the policy-induced flow. This formulation enables stable policy optimization using intermediate value gradients without unrolling the numerical solver, effectively bridging the gap between stability and expressivity. We evaluate Q-Flow in the offline learning setting on the challenging OGBench suite, where it consistently outperforms state-of-the-art baselines by an average of 10.6 percentage points, while also enabling stable online adaptation within the same framework.

2605.13431 2026-05-14 cs.SD

Text2Score: Generating Sheet Music From Textual Prompts

Keshav Bhandari, Sungkyun Chang, Abhinaba Roy, Francesca Ronchini, Emmanouil Benetos, Dorien Herremans, Simon Colton

发表机构 * Queen Mary University of London(伦敦女王学院) Singapore University of Technology and Design(新加坡科技设计大学) Politecnico di Milano(米兰理工大学) EmotionWave(情绪波)

AI总结 本文提出 Text2Score,一个用于从自然语言提示生成乐谱的两阶段框架,旨在解决文本驱动符号音乐生成中数据稀缺和自动标注不可靠的问题。该方法通过直接从符号化 XML 数据中提取监督信号,绕过了传统文本-音乐对的噪声和稀疏性问题,分为规划阶段和执行阶段:规划阶段利用大语言模型生成结构化的乐谱计划,执行阶段则生成符合该计划的 ABC 符号乐谱。实验表明,Text2Score 在可玩性、可读性等多个评估维度上均优于现有方法,并开源了数据集、代码及评估工具。

Comments 8 pages including references, 1 figure

详情
英文摘要

Developing text-driven symbolic music generation models remains challenging due to the scarcity of aligned text-music datasets and the unreliability of automated captioning pipelines. While most efforts have focused on MIDI, sheet music representations are largely underexplored in text-driven generation. We present Text2Score, a two-stage framework comprising a planning stage and an execution stage for generating sheet music from natural language prompts. By deriving supervision signals directly from symbolic XML data, we propose an alternative training paradigm that bypasses noisy or scarce text-music pairs. In the planning stage, an LLM orchestrator translates a natural language prompt into a structured measure-wise plan defining musical attributes such as instruments, key, time signatures, harmony, etc. This plan is then consumed by a generative model in the execution stage to produce interleaved ABC notation conditioned on the plan's structural constraints. To assess output quality, we introduce an evaluation framework covering playability, readability, instrument utilization, structural complexity, and prompt adherence, validated by expert musicians. Text2Score consistently outperforms both a pure LLM-based agentic framework and three end-to-end baselines across objective and subjective dimensions. We open-source the dataset, code, evaluation set and LLM prompts used in this work; a demo is available on our project page (https://keshavbhandari.github.io/portfolio/text2score).

2605.13429 2026-05-14 cs.CL

TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment

Chong Li, Yingzhuo Deng, Wen Yang, Jiajun Zhang, Chengqing Zong

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China(多模态人工智能系统国家重点实验室,自动化研究所,中国科学院,北京,中国) School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学人工智能学院,北京,中国)

AI总结 本文提出了一种名为 TokAlign++ 的方法,旨在通过学习更优的词元对齐词典来提升大语言模型的词汇适配性能。该方法将源语言和目标语言的词表视为两种不同语言,从单语词元表示中学习双语对齐词典,并据此重新排列模型参数以适应新词表,再通过逐步微调实现模型适配。实验表明,该方法在15种语言上显著提升了多语言文本压缩率,并在较少训练步数下恢复了原模型性能,同时有效支持了基于词元的模型蒸馏。

Comments Paper under review

详情
英文摘要

Tokenization is a foundational step in the text process of Large Language Models (LLMs). Texts must be first tokenized into token IDs, which are then input to LLMs. Inefficient tokenization results in long token-ID sequences and will slow down the training and inference of LLMs. The fine-grained knowledge transfer between LLMs, like token-level distillation, is also impeded by the mismatch in vocabulary. To bridge this gap, we introduce a method named TokAlign++ to improve vocabulary adaptation performance by learning better token alignment lexicon. The source and target vocabularies are taken as two different languages, and the bilingual token alignment lexicon is learned from monolingual token representations. Model parameters are rearranged following this bilingual lexicon for new vocabulary, and progressively fine-tuned for adaptation. Experimental results on 15 languages show that our method boosts the multilingual text compression rates and preserves most of the multilingual ability of vanilla models. It costs as few as 1k steps to restore the performance of the vanilla model. After unifying vocabularies between vanilla models, token-level distillation remarkably improves the base model with only 235M tokens.