arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2083
2605.13554 2026-05-14 cs.LG cs.AI

Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

Asim Osman, Sasha Abramowitz, Mark Bergh, Ulrich Armel Mbou Sob, Ruan John de Kock, Omayma Mahjoub, Oussama Hidaoui, Noah De Nicola, Arnol Manuel Fokam, Felix Chalumeau, Daniel Rajaonarivonivelomanantsoa, Siddarth Singh, Refiloe Shabe, Juan Claude Formanek, Simon Verster Du Toit, Arnu Pretorius

AI总结 本文提出了一种基于对比学习的策略优化算法——对比近端策略优化(CPPO),用于实现无需人工设计奖励函数的自监督强化学习。该方法通过对比状态-动作与目标的表示学习Q值,并直接在策略上优化这些对比Q值,从而实现了端到端的自监督训练。实验表明,CPPO在多种连续和离散动作空间的单智能体和协作多智能体任务中,不仅显著优于现有对比强化学习方法,还在多数任务中达到了使用人工密集奖励的PPO算法的性能水平。

详情
英文摘要

Contrastive reinforcement learning (CRL) learns goal-conditioned Q-values through a contrastive objective over state-action and goal representations, removing the need for hand-crafted reward functions. Despite impressive success in achieving viable self-supervised learning in RL, all existing CRL algorithms rely on off-policy optimisation and are mostly constrained to continuous action spaces, with little research invested in discrete environments. This leaves CRL disconnected from widely used and effective, modern on-policy training pipelines adopted across both single-agent and multi-agent RL in continuous and discrete environments. To establish a first connection, we introduce Contrastive Proximal Policy Optimisation (CPPO). CPPO is an on-policy contrastive RL algorithm that derives policy advantages directly from contrastive Q-values and optimises them via the standard PPO objective, without requiring a reward function or a replay buffer. We evaluate CPPO across continuous and discrete, single-agent and cooperative multi-agent tasks. Whilst the existence of an on-policy approach is inherently useful, we observe that \textbf{CPPO not only significantly outperforms the previous CRL baselines in 14 out of 18 tasks, but also matches or exceeds PPO's performance, which uses hand-crafted dense rewards, in 12 out of the 18 tasks tested.}

2605.13551 2026-05-14 cs.LG

Mixed neural posterior estimation for simulators with discrete and continuous parameters

Jan Boelts, Cornelius Schröder, Jonas Beck, Jakob H. Macke, Michael Deistler, Daniel Gedon

AI总结 该论文研究了如何在包含离散和连续参数的混合参数空间中进行神经后验估计。作者提出了一种联合处理离散和连续参数的推理网络,通过将联合后验分解为离散和连续部分,并结合自回归分类器与生成模型进行联合训练,从而扩展了传统NPE方法。实验表明,该方法在多个可解析示例和实际科学模拟器中均能生成准确且校准良好的后验分布,并已集成到sbi Python工具包中。

详情
英文摘要

Neural Posterior Estimation (NPE) enables rapid parameter inference for complex simulators with intractable likelihoods. NPE trains an inference network to estimate a probability density over parameters given data, typically assumed to be \emph{continuous}. However, many scientific models involve parameter spaces that are \emph{mixed}, that is, they contain both discrete and continuous dimensions. We address this limitation by extending NPE to mixed parameter spaces through an inference network that jointly handles discrete and continuous parameters. The inference network factorizes the joint posterior into discrete and continuous components, combining an autoregressive classifier for the discrete parameters with a generative model for the continuous parameters, trained jointly under a single simulation-based objective. In addition, we propose a diagnostic tool to assess the calibration of the mixed posterior approximation. Across tractable toy examples and real-world scientific simulators, our joint inference approach yields accurate and calibrated posteriors. The inference framework is available in the \texttt{sbi} Python package.

2605.13544 2026-05-14 cs.CV

CA-GCL: Cross-Anatomy Global-Local Contrastive Learning for Robust 3D Medical Image Understanding

Hanwen Zhang, Yao Liu, Die Dai, Jiaye Yang, Qiao Liu, Yutong Xie, Peng Wang

AI总结 本文提出了一种名为CA-GCL的跨解剖全局-局部对比学习框架,旨在提升三维医学图像理解的鲁棒性。该方法通过引入全局对比目标,增强解剖类别在潜在空间中的区分度,同时结合临床感知的文本增强策略,以应对描述不完整的问题。实验表明,CA-GCL在零样本异常检测任务中优于现有方法,且在不同数据集间具有良好的泛化能力,显著提升了模型对提示变化的稳定性。

详情
英文摘要

Fine-grained Vision-Language Pre-training (FVLP) demonstrates significant potential in 3D medical image understanding by aligning anatomy-level visual representations with corresponding textual descriptions. However, existing FVLP paradigms often suffer from severe representation collapse in the textual embedding space, where text embeddings of distinct anatomical structures become highly clustered and indistinguishable. This distributional degeneracy renders the model hypersensitive to prompt variations, hindering reliable clinical deployment. To address these challenges, we propose a novel Cross-Anatomy Global-Local Contrastive Learning framework (CA-GCL). CA-GCL introduces a global contrastive objective that enforces separation between anatomical categories in the latent space, effectively counteracting the aggregation tendency induced by local alignment. Furthermore, we incorporate a clinical-aware text augmentation strategy based on permutation invariance and partial completeness to enhance robustness against descriptive incompleteness. Extensive evaluations on the CT-RATE and Rad-ChestCT datasets demonstrate that CA-GCL consistently outperforms existing VLP paradigms in zero-shot abnormality detection, achieving superior performance while exhibiting strong cross-dataset generalization. Crucially, CA-GCL reduces performance variance across diverse prompt templates, transforming the collapsed textual similarity distribution into a bell-shaped distribution. These results validate CA-GCL as an effective framework for robust 3D medical image understanding.

2605.13542 2026-05-14 cs.AI cs.CL cs.LG cs.MA

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

Chengzhi Shen, Weixiang Shen, Tobias Susetzky, Chen, Chen, Jun Li, Yuyuan Liu, Xuepeng Zhang, Zhenyu Gong, Daniel Rueckert, Jiazhen Pan

AI总结 本文提出RealICU,一个基于真实重症监护(ICU)临床数据构建的新型基准,用于评估大型语言模型在复杂、长期医疗决策任务中的表现。该基准通过资深医生对完整患者轨迹进行回顾标注,定义了四个与临床决策相关的任务,揭示了现有大语言模型在医疗建议中的召回与安全性的权衡以及对早期患者信息的过度依赖问题。RealICU为研究和改进高风险医疗场景下的AI决策支持系统提供了可靠的实验平台。

详情
英文摘要

Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal context of the underlying patient state, and may therefore be suboptimal, making it difficult to assess the true reasoning capabilities of AI systems. We introduce RealICU, a hindsight-annotated benchmark for evaluating large language models (LLMs) under realistic ICU conditions, where labels are created after senior physicians review the full patient trajectory. We formulate four physician-motivated tasks: assess Patient Status, Acute Problems, Recommended Actions, and Red Flag actions that risk unsafe outcomes. We partition each trajectory with 30-min windows and release two datasets: RealICU-Gold with 930-window annotations from 94 MIMIC-IV patients, and RealICU-Scale with 11,862 windows extended by Oracle, a physician-validated LLM hindsight labeler. Existing LLMs including memory-augmented ones performed poorly on RealICU, exposing two failure modes: a recall-safety tradeoff for clinical recommendations, and an anchoring bias to early interpretations of the patient. We further introduce ICU-Evo to study structured-memory agents that improves long-horizon reasoning but does not fully eliminate safety failures. Together, RealICU provides a clinically grounded testbed for measuring and improving AI sequential decision-support in high-stakes care. Project page: https://chengzhi-leo.github.io/RealICU-Bench/

2605.13540 2026-05-14 cs.LG cs.AI

Decoupled and Divergence-Conditioned Prompt for Multi-domain Dynamic Graph Foundation Models

Haonan Yuan, Qingyun Sun, Junhua Shi, Xingcheng Fu, Jianxin Li, Philip S. Yu

AI总结 动态图在现实系统中广泛存在,构建具有泛化能力的动态图基础模型是图学习领域的重要前沿。针对多领域动态图语义和时序模式不一致带来的统一建模挑战,本文提出了一种基于解耦与发散条件提示的多领域动态图基础模型DyGFM。该模型通过语义-时序解耦的双分支预训练策略分离可迁移语义与领域特有动态,并引入发散感知的跨领域路由机制与提示生成器,有效缓解负迁移并提升下游任务的微调效率。实验表明,DyGFM在多个动态图基准数据集上显著优于12个先进基线方法。

详情
英文摘要

Dynamic graphs are ubiquitous in real-world systems, and building generalizable dynamic Graph Foundation Models has become a frontier in graph learning. However, dynamic graphs from different domains pose fundamental challenges to unified modeling, as their semantic and temporal patterns are inherently inconsistent, making the multi-domain pre-training difficult. Consequently, the widely used "pretrain-then-finetune" paradigm often suffers from severe negative knowledge transfer. To the best of our knowledge, there exists no multi-domain dynamic GFM. In this work, we propose DyGFM, a Dynamic Graph Foundation Model over multiple domains based on decoupled and divergence-conditioned prompting. To disentangle transferable semantics from the domain-specific dynamics, we introduce a dual-branch pre-training strategy with semantic-temporal decoupling. To alleviate negative transfer during domain adaptation, we further develop a cross-domain routing mechanism with divergence-aware expert selection. To enable efficient downstream fine-tuning, we design a divergence-conditioned prompt generator that injects lightweight, learnable graph prompts tailored to semantic and temporal traits. Extensive experiments on continuous dynamic graph benchmarks demonstrate that DyGFM consistently outperforms 12 state-of-the-art baselines on both node classification and link prediction tasks, achieving superior effectiveness and efficiency.

2605.13539 2026-05-14 cs.RO cs.SE

Integration of an Agent Model into an Open Simulation Architecture for Scenario-Based Testing of Automated Vehicles

Christian Geller, Daniel Becker, Jobst Beckmann, Lutz Eckstein

AI总结 本文研究了如何将智能体模型集成到开放仿真架构中,以支持自动驾驶车辆的场景化测试。为解决不同仿真环境间智能体模型互操作性差的问题,作者提出了一种基于标准化接口的模块化集成架构,利用OSI和FMI标准实现工具无关的模型交换。该架构通过在三种主流仿真平台中的应用验证了其通用性与一致性,为自动驾驶系统的安全测试提供了可复用的参考实现。

详情
Journal ref
at - Automatisierungstechnik - 2026 - Band 74, Heft 5 - Special Issue: AI for automated driving
英文摘要

Simulative and scenario-based testing are crucial methods in the safety assurance for automated driving systems. To ensure that simulation results are reliable, the real world must be modeled with sufficient fidelity, including not only the static environment but also the surrounding traffic of a vehicle under test. Thus, the availability of traffic agent models is of common interest to model naturalistic and parameterizable behavior, similar to human drivers. The interchangeability of agent models across different simulation environments represents a major challenge and necessitates harmonization and standardization. To address this challenge, we present a standardized and modular simulation integration architecture that enables the tool-independent integration of traffic agent models. The architecture builds upon the Open Simulation Interface (OSI) as a structured message format and the Functional Mock-up Interface (FMI) for dynamic model exchange. Rather than introducing yet another model or simulation tool, we provide a reusable reference implementation that translates these standards into a practical integration blueprint, including clear interfaces, data mappings, and execution semantics. The generic nature of the architecture is demonstrated by integrating an exemplary agent model into three widely used simulation environments: OpenPASS, CARLA, and CarMaker. As part of the evaluation, we show that the model yields consistent behavior in all simulation platforms, thereby validating the interoperability, modularity, and standard compliance of the proposed architecture. The reference implementation lowers integration barriers, serves as a foundation for future research, and is made publicly available at github.com/ika-rwth-aachen/agent-model-integration

2605.13538 2026-05-14 cs.CL cs.AI

Locale-Conditioned Few-Shot Prompting Mitigates Demonstration Regurgitation in On-Device PII Substitution with Small Language Models

Anuj Sadani, Deepak Kumar

AI总结 本文研究了在设备端使用小型语言模型进行个人身份信息(PII)替换时,如何避免模型直接复制演示示例的问题。作者提出了一种基于本地条件的少样本提示方法,结合分类器和生成模型,生成符合语境且类型一致的虚假信息。实验表明,该方法有效减少了模型对演示内容的重复,但在下游命名实体识别任务中,生成的替代文本多样性不足,影响了模型性能。

Comments 15 pages

详情
英文摘要

Personally Identifiable Information (PII) redaction usually replaces detected entities with placeholder tokens such as [PERSON], destroying the downstream utility of the redacted text for retrieval and Named Entity Recognition (NER) training. We propose a fully on-device pipeline that substitutes PII with consistent, type-preserving fake values: a 1.5 B mixture-of-experts token classifier (openai/privacy-filter) detects spans, a 1-bit Bonsai-1.7B Small Language Model (SLM) proposes contextual surrogates for names, addresses, and dates, and a rule-based generator (faker) handles patterned fields. We report a prompting finding more important than the quantization choice: with naive fixed three-shot demonstrations, the 1-bit SLM regurgitates demonstration outputs verbatim regardless of input; 1.58-bit Ternary-Bonsai-1.7B reproduces byte-identical failures, ruling out quantization as the cause. We fix this with locale-conditioned rotating few-shot demonstrations: a character-range heuristic picks a locale-pure pool and a per-input MD5 hash samples three demonstrations. With the fix, 482/482 unique Bonsai-1.7B calls succeed (no echoes) and produce locale-correct surrogates, although the SLM still copies from a small same-locale demonstration pool - a residual narrowness we quantify. On a 2000-document multilingual corpus, hybrid perplexity (PPL) beats faker in all six locales under a multilingual evaluator (XGLM-564M); length preservation is best-of-three in 4 of 6 locales. On downstream NER (400 train / 100 test, English), redact yields F1=0.000, faker 0.656, original 0.960; on a matched 160/40 subset including hybrid, faker (0.506) outperforms hybrid (0.346) at p < 0.001. We report this as an honest negative finding: SLM surrogates produce more natural text but a less varied training distribution, and downstream NER benefits more from variety than from naturalness.

2605.13537 2026-05-14 cs.LG cs.AI cs.CL

Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment

Ye Wang, Jing Liu, Toshiaki Koike-Akino

AI总结 本文研究了如何在推理阶段通过对齐方法缓解强化学习中的奖励黑客问题。作者提出了一种新的对齐技术,通过调整参考模型的温度参数,将推理对齐推广到多个生成奖励模型的组合,形成一种称为SLOP的锐化对数意见池方法。该方法不仅提高了模型的鲁棒性,同时保持了对齐性能,为持续适应动态奖励目标提供了有效解决方案。

详情
英文摘要

Inference-time alignment techniques offer a lightweight alternative or complement to costly reinforcement learning, while enabling continual adaptation as alignment objectives and reward targets evolve. Existing theoretical analyses justify these methods as approximations to sampling from distributions optimally tilted toward a given reward model. We extend these techniques by introducing reference-model temperature adjustment, which leads to further generalization of inference-time alignment to ensembles of generative reward models combined as a sharpened logarithmic opinion pool (SLOP). To mitigate reward hacking, we propose an algorithm for calibrating SLOP weight parameters and experimentally demonstrate that it improves robustness while preserving alignment performance.

2605.13536 2026-05-14 cs.LG cs.AI

HLS-Seek: QoR-Aware Code Generation for High-Level Synthesis via Proxy Comparative Reward Reinforcement Learning

Qingyun Zou, Feng Yu, Hongshi Tan, Yao Chen, Bingsheng He, WengFai Wong

AI总结 HLS-Seek 是一种基于代理比较奖励强化学习的高质量代码生成框架,旨在提升高层次综合(HLS)中代码的性能表现(QoR),包括延迟和资源利用率。该方法通过相对比较而非绝对合成结果进行强化学习,显著降低了训练成本,并引入不确定性感知的蒙特卡洛dropout机制以防止奖励欺骗,实现自我优化的奖励系统。实验表明,HLS-Seek 在语法正确性和功能正确性方面均优于现有模型,且训练效率更高,在多个基准测试中表现出优越的QoR性能。

详情
英文摘要

High-Level Synthesis (HLS) compiles algorithmic C/C++ descriptions into hardware, with Quality of Results (QoR) -- latency and resource utilization -- critically governed by pragma configurations and code structure. Existing LLM-based HLS approaches train for functional correctness but ignore QoR entirely. We observe that reinforcement learning (RL) for HLS does not require absolute synthesis results -- only relative comparisons between candidates. Based on this insight, we propose \textbf{HLS-Seek}, a QoR-aware NL-to-HLS framework that replaces expensive synthesis-in-the-loop RL with a comparative proxy reward model achieving 99.53\% Pareto-dominance accuracy. To prevent reward hacking, we introduce \textit{uncertainty-aware Monte Carlo (MC) dropout switching} that selectively invokes real Vitis HLS synthesis for low-confidence candidates and online updates the proxy, creating a self-improving reward system. HLS-Seek achieves 81.5\% syntax correctness pass@1 and 81.4\% Func@5 on HLS-eval with only 7B parameters, surpassing GPT-5.1 and other frontier models while achieving 8.5$\times$ faster training than real-reward RL. On QoR evaluation, HLS-Seek achieves the lowest latency on 16/30 kernels and Pareto-dominates HLS-specific baselines on 9 kernels.

2605.13534 2026-05-14 cs.AI

Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging

Jiabei Liu, Wenyu Mao, Junfei Tan, Chunxu Shen, Lingling Yi, Jiancan Wu, Xiang Wang

AI总结 本文提出了一种基于强化学习的框架 MultiSearch,用于改进检索增强推理(Retrieval-Augmented Reasoning)的方法。该方法通过在每个推理步骤中生成多个视角的查询并行检索信息,扩大了信息覆盖范围,同时在合并过程中显式整合和优化检索结果,从而提高信噪比(SNR)和推理准确性。实验表明,MultiSearch 在多个基准测试中优于现有方法,显著提升了问答任务的推理性能。

详情
英文摘要

Deep search agents have proven effective in enhancing LLMs by retrieving external knowledge during multi-step reasoning. However, existing methods often generate a single query for retrieval at each reasoning step, limiting information coverage and introducing high noise. This may result in low signal-to-noise ratios (SNR) during search, degrading reasoning accuracy and leading to unnecessary reasoning steps. In this paper, we introduce MultiSearch, an RL-based framework that addresses these limitations through multi-query retrieval and explicit merging of retrieved information. At each reasoning step, MultiSearch generates queries from multiple perspectives and retrieves external information in parallel, expanding the scope of relevant information and mitigating the reliance on any single retrieval result. Then, the agent consolidates and refines retrieved information at the merging process, improving the SNR and ensuring more accurate reasoning. Additionally, we propose a reinforcement learning framework with a multi-process reward design to optimize agents for both multi-query retrieval and information consolidation. Extensive experiments on seven benchmarks demonstrate that MultiSearch outperforms baseline methods, enhancing the SNR of retrieval and improving reasoning performance in question-answering tasks.

2605.13532 2026-05-14 cs.AI cs.CL cs.CY cs.HC

AI-Generated Slides: Are They Good? Can Students Tell?

Juho Leinonen, Lisa Zhang, Arto Hellas

AI总结 本文研究了生成式人工智能(GenAI)在教学中生成幻灯片的应用效果,重点分析了教师和学生对AI生成幻灯片的感知。通过对比多种AI工具生成的幻灯片与人工制作的幻灯片,研究发现AI生成的幻灯片在准确性和教学效果上表现良好,学生难以区分AI生成与人工制作的幻灯片,且对质量评价高的幻灯片更倾向于认为其为人工制作。研究结果表明,GenAI在教学设计中有较大潜力,但也需进一步探索其负责任和有效的应用方式。

Comments 7 pages, 2 tables. Accepted to Western Canada Conference on Computing Education (WCCCE) 2026

详情
英文摘要

As generative AI (GenAI) tools become easily accessible, there is promise in using such tools to support instructors. To that end, this paper examines using GenAI to help generate slides from instructor authored course notes, emphasizing instructor and student perceptions. We examine an end-to-end education tool (NotebookLM), two general-purpose LLMs (Claude, M365 Copilot), and two coding assistants (Cursor, Claude Code). We first analyze whether GenAI generated slides are ``good'' via narrative assessment by educators. We choose the best slides to use (with some modification) in a real course setting, and compare the student perception of human vs. AI generated slides. We find that coding assistant tools produce slides that were most accurate, complete, and pedagogically sound. Additionally, students rate GenAI slides to be of similar quality as instructor-created slides, and cannot reliably identify which slides are AI-generated. Additionally, we find a negative correlation between a high quality rating and a high ``AI-generated'' rating, suggesting students associate poor quality with the source of the slides being AI. These findings highlight promising opportunities for integrating GenAI into instructional design workflows and call for further research on how educators can best harness such tools responsibly and effectively.

2605.13530 2026-05-14 cs.CV cs.AI

Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs

Jincai Huang, Shihao Zou, Yuchen Guo, Jingjing Li, Wei Ji, Kai Wang, Shanshan Wang, Weixin Si

AI总结 本文提出 SurgMLLM,一种统一的手术场景理解框架,旨在将高层语义推理与底层视觉定位相结合,解决现有方法在手术场景中孤立处理各组件导致的语义不一致问题。该方法通过微调多模态大语言模型,实现对手术阶段、工具-动作-目标三元组及对应分割区域的联合建模,并通过时序聚合和分割网络实现精确的像素级定位。实验表明,SurgMLLM 在三元组识别和分割任务上均取得显著提升,验证了统一推理与定位方法在手术辅助中的有效性。

详情
英文摘要

Surgical scene understanding is a cornerstone of computer-assisted intervention. While recent advances, particularly in surgical image segmentation, have driven progress, real-world clinical applications require a more holistic understanding that jointly captures procedural context, semantic reasoning, and precise visual grounding. However, existing approaches typically address these components in isolation, leading to fragmented representations and limited semantic consistency. To address this limitation, we propose SurgMLLM, a unified surgical scene understanding framework that bridges high-level reasoning and low-level visual grounding within a single model. Given surgical videos, SurgMLLM fine-tunes a multimodal large language model (MLLM) to support structured interpretability reasoning, which is used to jointly model phases, instrument-verb-target (IVT) triplets, and triplet-entity segmentation tokens. These tokens are then temporally aggregated and serve as prompts for a segmentation network, enabling accurate pixel-wise grounding of triplet instruments and targets. The entire framework is trained end-to-end with a unified objective that couples language-based reasoning supervision with visual grounding losses, promoting coherent cross-task learning and clinically consistent scene representations. To facilitate unified evaluation, we introduce CholecT45-Scene, extending CholecT45 dataset with 64,299 frames of pixel-level mask annotations for instruments and targets, aligned with existing triplet labels. Extensive experiments show that SurgMLLM significantly advances surgical scene understanding, improving the primary triplet recognition metric AP_IVT from 40.7% to 46.0% and consistently outperforming prior methods in phase recognition and segmentation. These results highlight the effectiveness of unified reasoning-and-grounding for reliable, context-aware surgical assistance.

2605.13500 2026-05-14 cs.RO cs.CR

Uncertainty-Aware 3D Position Refinement for Multi-UAV Systems

Hosam Alamleh, Damir Pulatov

AI总结 本文研究了多无人机系统中鲁棒的三维位置精炼问题,针对GNSS信号干扰、非视距接收等场景下本地定位估计精度下降的问题,提出了一种去中心化、轻量级的位置精炼方法。该方法通过融合无人机自身的本地估计与邻居共享的状态信息,并结合无人机间的距离约束,实现不确定性感知的邻居信息融合,提升了定位鲁棒性。实验表明,该方法在冷启动阶段和存在恶意节点的情况下均能有效降低定位误差,具有良好的实际应用前景。

详情
英文摘要

Reliable real-time 3D localization is essential for multi-UAV navigation, collision avoidance, and coordinated flight, yet onboard estimates can degrade under GNSS multipath, non-line-of-sight reception, vertical drift, and intentional interference. This paper presents a decentralized, lightweight 3D position-refinement layer that improves robustness by fusing each Unmanned Aerial Vehicle (UAV)'s local estimate with neighbor-shared state summaries and inter-UAV range or proximity constraints. The method performs uncertainty-aware neighborhood fusion by weighting each UAV's prior according to its reported covariance and weighting neighbor constraints according to link quality, ranging uncertainty, and a learned trust score. To support practical deployment, the framework explicitly handles cold start and temporary localization loss by inflating or substituting weak priors, allowing trusted neighborhood constraints to bootstrap and stabilize estimates until absolute sensing recovers. To mitigate the impact of faulty or malicious participants, each UAV applies a local range-consistency check, smoothed over time, to down-weight or exclude neighbors whose reported positions are incompatible with observed inter-UAV distances. Simulation experiments with 10 UAVs in a 3D volume show that the proposed refinement substantially reduces mean localization error during cold start, remains competitive after local estimators stabilize, and maintains lower error as the fraction of malicious nodes increases compared with fusion without trust. These results suggest that the approach can serve as a practical resilience layer for swarm operation in challenging environments.

2605.13493 2026-05-14 cs.CV

PhysEditBench: A Protocol-Conditioned Benchmark for Dense Physical-Map Prediction with Image Editors

Jiaxin Yang, Yu Hou, Muxin Liu, Weixuan Liu, Ze Yuan, Zeming Chen, Zhongrui Wang, Xiaojuan Qi

AI总结 PhysEditBench 是一个用于评估图像编辑器在密集物理图预测能力的协议条件化基准,涵盖了深度、法线、反照率、粗糙度和金属度五类目标。该基准通过构建目标依赖的数据集,并定义固定的输入输出协议,确保评估的标准化与可靠性。实验表明,尽管图像编辑器在部分指标上可与专业模型媲美,但在结构错误和光照敏感性方面仍存在明显不足。

Comments 48 pages, 12 figures, including references, appendix, and supplementary benchmark details

详情
英文摘要

Can general-purpose image editors predict physical maps from a single RGB image? General-purpose image editors differ from standard task-specific dense-prediction models: they do not directly take an image and output a physical map. Instead, they must be guided by prompts, examples, or image-based textual cues. To this end, we introduce PhysEditBench, a novel protocol-conditioned benchmark to evaluate and standardize image editors in dense physical-map prediction that covers five targets: depth, normal, albedo, roughness, and metallic maps. For evaluation data, we build a target-dependent benchmark substrate. We use OpenRooms-FF for depth, surface normal, albedo, and roughness, InteriorVerse as an additional source for depth, normal, albedo, and a new procedurally generated source for metallic maps. We curate the data with quality checks, valid-region masks, scene-level sampling, and lighting-based stress subsets to ensure reliable and diverse evaluation. For each target, PhysEditBench defines a fixed protocol that specifies the allowed input, expected output format, and scoring procedure. Each score, therefore, reflects the performance of a model under a specified protocol, rather than its best possible performance under all prompts or interaction modes. Experimental results show that specialized models remain much stronger on depth, normal, and albedo, and stronger image editors can produce more reasonable map-like outputs. For roughness and metallic, image editors can match or outperform specialized baselines on some scalar metrics, but they still suffer from structural errors, sparsity effects, and sensitivity to lighting.

2605.13487 2026-05-14 cs.LG

Path-independent Flow Matching for Multi-parameter Generative Dynamics

Francisco Téllez, AmirHossein Zamani, Philippe Martin, Shuang Ni, Guy Wolf, Eugene Belilovsky, Sina Sanjari, Yanlei Zhang

AI总结 本文提出了一种名为Path-independent Flow Matching (PiFM)的方法,用于学习能够在多参数域中生成路径无关的分布传输的向量场。该方法通过引入结构约束,确保组合变换的一致性,并在适当假设下近似Wasserstein重心,实现了分布插值。实验表明,PiFM在合成和真实数据上均优于现有方法,在生成路径无关轨迹和分布外样本方面表现出色。

Comments 12 pages including references for main part of the document, 26 pages in total when including the appendix. 15 figures in total

详情
英文摘要

Flow Matching is a powerful framework for learning transport maps between probability distributions. Yet its standard single-parameter formulation is not designed to capture multi-parameter variations where the resulting transport should be path-independent. Path independence is crucial because it ensures that transformations depend only on the initial and target distributions, not on the specific path. In this work, we introduce Path-independent Flow Matching (PiFM), a method for learning vector fields whose induced flows yield path-independent transport between distributions. We show that PiFM generalizes Flow Matching to higher-dimensional parameter domains while enforcing structural conditions that ensure consistency of composed transformations. In addition, we show that, under suitable assumptions, PiFM approximates the Wasserstein barycenter, linking the framework to a notion of distributional interpolation. To enable practical training, we propose a tractable, simulation-free objective that regresses onto multi-parameter conditional probability paths. We showcase empirically that PiFM outperforms other approaches on both synthetic and real world data in interpolating path-independent trajectories and generating desired out of distribution samples.

2605.13486 2026-05-14 cs.CL

R^2-Mem: Reflective Experience for Memory Search

Xinyuan Wang, Wenyu Mao, Junkang Wu, Xiang Wang, Xiangnan He

AI总结 R²-Mem 是一种用于记忆搜索系统的反思经验框架,旨在解决现有深度搜索代理重复历史错误行为的问题。该方法通过离线阶段的评分器和自反思学习器提取高质量和低质量搜索轨迹中的经验,并在在线推理阶段利用这些经验指导未来的搜索行为,从而提升搜索效果与效率。实验表明,R²-Mem 在多个指标上均优于现有方法,显著提高了搜索性能并减少了资源消耗。

详情
英文摘要

Deep search has recently emerged as a promising paradigm for enabling agents to retrieve fine-grained historical information without heavy memory pre-managed. However, existing deep search agents for memory system repeat past error behaviors because they fail to learn from the prior high- and low-quality search trajectories. To address this limitation, we propose R^2-Mem, a reflective experience framework for memory search systems. In the offline stage, a Rubric-guided Evaluator scores low- and high-quality steps in historical trajectories, and a self-Reflection Learner distills the corresponding abstract experience. During the online inference, the retrieved experience will guide future search actions to avoid repeated mistakes and maintain high-quality behaviors. Extensive experiments demonstrate that R^2-Mem consistently improves both effectiveness and efficiency over strong baselines, improving F1 scores by up to 22.6%, while reducing token consumption by 12.9% and search iterations by 20.2%. These results verify that R^2-Mem provides a RL-free and low-cost solution for self-improving LLM agents.

2605.13485 2026-05-14 cs.LG cs.CL cs.IT math.IT

Effective Context in Transformers: An Analysis of Fragmentation and Tokenization

Amirmehdi Jafari Fesharaki, Mohammadamin Rami, Aslan Tchamkerten

AI总结 本文研究了Transformer模型中不同表示方式(如字节、字符和子词)对有限上下文预测性能的影响。作者通过马尔可夫源分析,发现将符号分解为更小单元(碎片化)可能降低预测性能,即使扩大了上下文窗口,这一现象具有表示本身的固有性质。另一方面,使用贪心分词方法(如BPE)可以使得较短的分词窗口等效于更长的原始上下文窗口,并给出了相应的理论保证。研究为理解Transformer中表示选择对模型性能的影响提供了信息论框架。

Comments 30 pages, 9 figures. Preprint

详情
英文摘要

Transformers predict over a representation of a sequence. The same data can be written as bytes, characters, or subword tokens, and these representations may be lossless. Yet, under a fixed context window, they need not expose the same information to the model. This raises a basic question: how does the choice of representation change what a finite-context predictor can achieve? We study this question on Markov sources and uncover two complementary phenomena. First, we observe that moving to smaller representation units can hurt prediction even when the context window is enlarged to cover the relevant source history. To explain this, we introduce fragmentation: a lossless recoding that replaces each source symbol by several smaller units. We prove that fragmentation can strictly increase the optimal finite-context log-loss, showing that the gap is not merely an optimization or capacity issue, but can be intrinsic to the representation. This gives a theoretical account of the finite-context gap observed in byte- and character-level models such as ByT5 and CANINE relative to subword-tokenized models. Second, we study the opposite direction: greedy tokenization -- BPE, WordPiece, and related methods -- which groups source symbols into larger units. We show that tokenization can make a short token window behave like a longer source-context window, and we give a loss guarantee describing when this is achievable. The guarantee depends on how reliably token windows span the needed source history, together with the compression rate of the tokenizer. This also yields a simple diagnostic for real tokenizers: measuring how much source context a fixed token window reliably contains. Together, the two directions establish a finite-context information-theoretic framework for reasoning about representation choices in Transformers.

2605.13484 2026-05-14 cs.LG cs.AI stat.ME

Discovery of Hidden Miscalibration Regimes

Katarzyna Kobalczyk, Mihaela van der Schaar

AI总结 本文研究了模型在不同输入上的校准偏差问题,指出传统方法仅基于置信度评估校准,可能掩盖局部校准失败的现象。为此,作者提出了一种无需预设数据切片的隐式校准偏差发现方法,通过学习输入空间的校准感知表示,并利用核平滑估计局部校准偏差。实验表明,该方法能有效揭示大语言模型在不同输入下的校准异质性,并在系统性偏差区域显著提升校准效果。

详情
英文摘要

Calibration is commonly evaluated by comparing model confidence with its empirical correctness, implicitly treating reliability as a function of the confidence score alone. However, this view can hide substantial structure: models may be systematically overconfident on some kinds of inputs and underconfident on others, causing global reliability diagnostics to obscure localised calibration failures. To address this, we formulate the problem of discovering hidden miscalibration regimes without assuming access to predefined data slices. We define the corresponding miscalibration field and propose a diagnostic framework for estimating it. Our approach learns a calibration-aware representation of the input space and estimates signed local miscalibration by kernel smoothing in the learned geometry. Across four real-world LLM benchmarks and twelve LLMs, we find that input-dependent calibration heterogeneity is prevalent. We further show that the discovered fields are actionable: they support local confidence correction and reduce calibration error in systematically miscalibrated regions where confidence-based methods such as isotonic regression and temperature scaling are less effective.

2605.13481 2026-05-14 cs.CL

PersonalAI 2.0: Enhancing knowledge graph traversal/retrieval with planning mechanism for Personalized LLM Agents

Mikhail Menschikov, Matvey Iskornev, Alexander Kharitonov, Alina Bogdanova, Mikhail Belkin, Ekaterina Lisitsyna, Artyom Sosedka, Victoria Dochkina, Ruslan Kostoev, Ilia Perepechkin, Evgeny Burnaev

AI总结 本文提出了一种名为 PersonalAI 2.0(PAI-2)的新框架,旨在通过整合外部知识图谱增强基于大语言模型(LLM)的系统。该方法引入了动态多阶段查询处理流程,能够根据提取的实体、匹配的图节点和生成的线索查询进行自适应的迭代信息搜索,有效提升了生成答案的事实准确性。实验表明,PAI-2 在多个基准数据集上相比现有方法具有更高的精度和更低的幻觉率,并在特定任务中实现了显著的性能提升。

详情
英文摘要

We introduce PersonalAI 2.0 (PAI-2), a novel framework, designed to enhance large language model (LLM) based systems through integration of external knowledge graphs (KG). The proposed approach addresses key limitations of existing Graph Retrieval-Augmented Generation (GraphRAG) methods by incorporating a dynamic, multistage query processing pipeline. The central point of PAI-2 design is its ability to perform adaptive, iterative information search, guided by extracted entities, matched graph vertices and generated clue-queries. Conducted evaluation over six benchmarks (Natural Questions, TriviaQA, HotpotQA, 2WikiMultihopQA, MuSiQue and DiaASQ) demonstrates improvement in factual correctness of generating answers compared to analogues methods (LightRAG, RAPTOR, and HippoRAG 2). PAI-2 achieves 4% average gain by LLM-as-a-Judge across four benchmarks, reflecting its effectiveness in reducing hallucination rates and increasing precision. We show that use of graph traversal algorithms (e.g. BeamSearch, WaterCircles) gain superior results compared to standard flatten retriever on average 6%, while enabled search plan enhancement mechanism gain 18% boost compared to disabled one by LLM-as-a-Judge across six datasets. In addition, ablation study reveals that PAI-2 achieves the SOTA result on MINE-1 benchmark, achieving 89% information-retention score, using LLMs from 7-14B tiers. Collectively, these findings underscore the potential of PAI-2 to serve as a foundational model for next-generation personalized AI applications, requiring scalable, context-aware knowledge representation and reasoning capabilities.

2605.13476 2026-05-14 cs.CV

Neural Video Compression with Domain Transfer

Tiange Zhang, Rongqun Lin, Xiandong Meng, Haofeng Wang, Xing Tian, Qi Zhang, Siwei Ma

AI总结 本文研究了神经视频编码中的领域迁移问题,旨在解决训练数据与测试数据之间分布差异导致的性能下降问题。提出了一种名为DCVC-DT的增强框架,通过轻量级的在线领域迁移机制,在推理过程中动态调整编码的潜在表示,从而有效缩小领域差距,无需修改编码器或解码器参数。同时,引入了帧级别的动态率失真调整方案,提升压缩效率与重建质量。实验表明,该方法在保持视频质量的同时,相比基线模型实现了更高的比特率节省,并增强了对未知测试数据的泛化能力。

Comments Accepted to ISCAS 2026 as an oral paper

详情
英文摘要

Content-adaptive compression has always been a key direction in neural video coding (NVC), aiming to mitigate the domain gap between training and testing data. Such gaps often arise from distributional discrepancies between training and inference data, which may cause noticeable performance degradation when the testing content differs from the training distribution. To tackle this challenge, we propose DCVC-DT, a domain transfer enhanced neural video compression framework. Specifically, we design a lightweight online domain transfer (DT) mechanism that dynamically adapts the encoded latent representation during inference, effectively bridging the domain gap without modifying the encoder or decoder parameters. In addition, we develop a frame-level dynamic RD (Rate and Distortion) adjustment scheme that actively regulates the ratio of R and D in the loss function based on quality fluctuation, thereby improving rate-distortion performance. Extensive experiments demonstrate that DCVC-DT achieves up to 6.21% bitrate savings over the baseline DCVC-DC, while significantly enhancing generalization to unseen testing data and alleviating error propagation. Our code is available at https://github.com/SunnyMass/DCVC-DT.

2605.13473 2026-05-14 cs.LG cs.CL

OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention

Chenyu Zhou, Hongpei Li, Yuerou Liu, Jianghao Lin, Dongdong Ge, Yinyu Ye

AI总结 本文提出了一种名为OSDN的新方法,旨在改进线性注意力机制中的Delta规则,以提升模型在上下文关联记忆任务中的表现。OSDN通过引入在线预条件器,对Delta规则中的步长进行特征级的自适应调整,从而更准确地反映目标函数的曲率特性。该方法在保持DeltaNet硬件友好并行计算优势的同时,实现了理论上的超几何收敛,并在大规模实验中显著提升了模型的回忆性能和泛化能力。

详情
英文摘要

Linear attention and state-space models offer constant-memory alternatives to softmax attention, but often struggle with in-context associative recall. The Delta Rule mitigates this by writing each token via one step of online gradient descent. However, its step size relies on a single scalar gate that ignores the feature-wise curvature of the inner objective. We propose Online Scaled DeltaNet (OSDN), which augments the scalar gate with a diagonal preconditioner updated online via hypergradient feedback. Crucially, this right-preconditioning is algebraically equivalent to a per-feature scaling of the write-side key. This equivalence allows OSDN to strictly preserve the hardware-friendly chunkwise parallel pipeline of DeltaNet without incurring high-dimensional state overhead. Theoretically, by exploiting the exact-quadratic structure of the inner regression loss, we establish super-geometric convergence against a right-Newton comparator and prove an algorithm-aligned token-local residual contraction bound. To handle non-stationary contexts, we further introduce Adaptive Preconditioner Forgetting (APF) to dynamically refresh stale calibration. Empirically, OSDN demonstrates strong performance across scales. At the 340M-parameter scale, OSDN improves JRT-style in-context recall by 32% over DeltaNet. Scaling to 1.3B parameters, it achieves a 39% reduction in the recall residual ratio while maintaining parity on general downstream tasks (e.g., perplexity and LongBench) -- demonstrating that our online-preconditioning mechanism effectively transfers and amplifies at the billion-parameter scale.

2605.13470 2026-05-14 cs.LG

Twincher: Bijective Representation Learning for Robust Inversion of Continuous Systems

Arkady Gonoskov

AI总结 本文提出了一种名为 Twincher 的新型架构,旨在通过学习与输入变量一一对应的鲁棒表示,实现对连续前向系统的稳定逆推。该方法基于结构化微分同胚变换堆叠和定制的对抗训练策略,能够在噪声或模型偏差存在的情况下保持表示的稳定性。实验表明,Twincher 在合成系统中能高效学习双射表示,相比传统逆模型方法,具有更高的数据效率和鲁棒性,展示了其在机器人、视觉和物理人工智能中的应用潜力。

详情
英文摘要

Recent advances in AI have been primarily driven by large-scale neural architectures that excel at function approximation, rather than by tailored inductive biases and inference or learning strategies that could be important for resource-efficient real-world perception and planning through the solution of inverse problems. In this work, we consider the possibility of enabling robust inversion of continuous forward processes $p \mapsto y$ by learning representations of $y$ that are bijectively aligned with $p$ while remaining insensitive to perturbations in $y$ caused by noise or model mismatch. We propose Twincher, a class of architectures based on stacks of structured diffeomorphic transformations and tailored adversarial training strategies that enable learning such bijective representations. We provide a public API for training and inference and empirically demonstrate the ability of the proposed architecture to efficiently learn bijective representations of synthetic systems, thereby enabling robust and efficient iterative inverse inference. Compared to a baseline inverse-modeling approach, the method exhibits improved data efficiency and robustness, providing initial evidence for the potential of bijective representation learning in robotics, vision, and physical AI.

2605.13467 2026-05-14 cs.CL

PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning

Hee Suk Yoon, Eunseop Yoon, Ji Woo Hong, SooHwan Eom, Gwanhyeong Koo, Mark Hasegawa-Johnson, Qi Dai, Chong Luo, Chang D. Yoo

AI总结 该研究针对视觉-语言推理任务中传统强化学习奖励信号不足的问题,提出了一种感知分解置信奖励(PDCR)框架。通过解耦视觉与语言步骤,PDCR引入视觉依赖度评分并结合聚类算法分离感知与推理过程,从而在每个子任务内进行归一化奖励计算,有效解决了全局奖励导致的信号失真问题。实验表明,PDCR在多个视觉-语言推理基准上优于传统全局奖励和稀疏奖励方法。

Comments CVPR 2026

详情
英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) traditionally relies on a sparse, outcome-based signal. Recent work shows that providing a fine-grained, model-intrinsic signal (rewarding the confidence growth in the ground-truth answer) effectively improves language reasoning training by providing step-level guidance without costly external models. While effective for unimodal text, we find that naively applying this global reward to vision-language (V-L) reasoning is a suboptimal strategy, as the task is a heterogeneous mix of sparse visual perception and dense textual reasoning. This global normalization creates mixture-induced signal degradation, where the training signal for visual steps is statistically distorted by the predominant textual steps. We propose Perception-Decomposed Confidence Reward (PDCR), a framework that solves this by aligning the reward structure with the task's heterogeneous nature. PDCR first performs an unsupervised skill decomposition, introducing a model-internal Visual Dependence Score to quantify visual reliance and applying a clustering algorithm to separate perception and reasoning steps. Based on this, PDCR computes a decomposed advantage by normalizing confidence gains within each skill cluster. This intra-cluster normalization provides a stable, correctly-scaled signal for both perception and reasoning. We demonstrate that PDCR outperforms the naive, global-reward formulation and sparse-reward baselines on key V-L reasoning benchmarks.

2605.13465 2026-05-14 cs.CV

Z-Order Transformer for Feed-Forward Gaussian Splatting

Can Wang, Lei Liu, Wei Jiang, Dong Xu

AI总结 本文提出了一种基于Transformer的前馈高斯点绘(Gaussian Splatting)方法,旨在解决传统3D高斯点绘在实时性方面的不足。通过引入Z-order策略将无序的高斯点组织成空间连贯的序列,并结合稀疏注意力机制,有效捕捉高斯点之间的空间与语义关系,从而在单次前向传播中高效建模上下文、压缩高斯点数量并预测其属性。实验表明,该方法在保证渲染质量的同时显著提升了生成新视角图像的速度。

Comments Accept by CVPR 2026, Oral

详情
英文摘要

Recent advances in 3D Gaussian Splatting (3DGS) have enabled significant progress in photorealistic novel view synthesis. However, traditional 3DGS relies on a slow, iterative optimization process, which limits its use in scenarios demanding real-time results. To overcome this bottleneck, recent feed-forward methods aim to predict Gaussian attributes directly from images, but they often struggle with the redundancy of Gaussian primitives and rendering quality. In this work, we introduce a transformer-based architecture specifically designed for feed-forward Gaussian Splatting. Our key insight is that spatial and semantic relationships among Gaussians can be effectively captured through a sparse attention mechanism, enabled by a Z-order strategy that organizes the unstructured Gaussian set into a spatially coherent sequence. Furthermore, we incorporate this Z-order strategy to adaptively suppress redundancy while preserving critical structural details. This allows the transformer to efficiently model context, compress Gaussian primitives, and predict Gaussian attributes in a single forward pass. Comprehensive experiments demonstrate that our method achieves fast and high-quality novel view synthesis with fewer Gaussian primitives.

2605.13464 2026-05-14 cs.LG

A Unified Three-Stage Machine Learning Framework for Diabetes Detection, Subtype Discrimination, and Cognitive-Metabolic Hypothesis Testing

Vishal Pandey, Ruzina Haque Laskar, Rishav Tewari

AI总结 该研究提出了一种统一的三阶段机器学习框架,用于糖尿病检测、亚型区分及代谢-认知关联分析。第一阶段通过多种分类器对糖尿病进行预测,并识别出关键生物标志物;第二阶段利用聚类方法对确诊患者进行亚型划分;第三阶段基于认知数据揭示血糖控制与认知功能之间的显著关联。该框架为糖尿病的可重复分析和亚型研究提供了统计严谨且可解释的方法。

Comments 10 Pages

详情
英文摘要

Diabetes mellitus affects over 537 million adults worldwide and remains a major challenge in preventive healthcare. Existing machine-learning studies primarily formulate diabetes prediction as a binary classification problem, while subtype-oriented analysis and glycaemic-cognitive associations remain comparatively underexplored. We present a reproducible three-stage machine learning framework for diabetes detection, subtype-oriented clustering, and metabolic-cognitive association analysis. In Stage 1, five supervised classifiers together with a stacking ensemble are benchmarked on the NCSU Diabetes Dataset using stratified five-fold cross-validation and evaluation metrics including ROC-AUC, balanced accuracy, recall, and F1-score. SVM-RBF and Logistic Regression achieve the highest ROC-AUC ($0.825 \pm 0.026$), while Random Forest achieves the highest accuracy ($0.762 \pm 0.030$). SHAP explainability identifies Glucose, BMI, and Age as the dominant predictive biomarkers. In Stage 2, silhouette-validated K-Means clustering ($k=2$, silhouette $\approx 0.116$) is applied to confirmed diabetic cases using Glucose, Insulin, and Age, recovering clinically plausible subtype-oriented partitions without requiring ground-truth subtype labels. In Stage 3, statistical analysis of the Ohio Longitudinal Cognitive Dataset ($n=373$) reveals a significant positive association between glycaemic control and cognitive function ($ρ_s = 0.208$, $p = 5.29 \times 10^{-5}$), which survives Holm correction. The findings support the utility of statistically grounded and interpretable ML pipelines for reproducible diabetes analytics and subtype-aware exploratory analysis.

2605.13462 2026-05-14 cs.LG

Efficient Sensor Fusion for Gesture Recognition on Resource-Constrained Devices

Pietro Bartoli, Christian Veronesi, Tommaso Bondini, Andrea Giudici, Franco Zappa

AI总结 本文提出了一种轻量且保护隐私的手势识别系统,适用于资源受限的智能眼镜等可穿戴设备。该系统通过融合低分辨率的飞行时间(ToF)传感器和红外(IR)热成像传感器的数据,结合一种专为多控制器设计的分组卷积神经网络(CNN),实现了高效的多模态信息融合。实验表明,该方法在自定义数据集上取得了92.3%的高识别准确率和0.93的宏观F1分数,同时在功耗和计算延迟方面表现优异,适合用于边缘计算场景。

Comments The article is already accepted for IEEE Sensors Applications Symposium (IEEE SAS) 2026

详情
英文摘要

Gesture recognition is a cornerstone of Human-Computer Interaction (HCI) for smart eyewear, enabling natural and device-free control in augmented reality environments. Traditional vision-based approaches face significant challenges regarding power consumption, computational latency, and user privacy. This paper proposes a lightweight, privacy-preserving gesture recognition system based on the fusion of low-resolution Time-of-Flight (ToF) and Infrared (IR) thermal sensors. We used an 8 times 8 multizone ToF sensor (VL53L8CH) and an 8 times 8 IR array (AMG8833) to capture complementary depth and thermal cues. A compact Convolutional Neural Network (CNN) with a specialized grouped-convolution architecture is designed to fuse these modalities efficiently on a microcontroller (MCU). Experimental results on a custom dataset of 7 static gestures, validated via k-fold cross-validation, demonstrate that the proposed fusion strategy significantly outperforms single-sensor baselines with an accuracy of 92.3% and a macro F1-score of 0.93. Finally, on-device benchmarks on STM32F4 and STM32H7 MCUs confirm the system's suitability for resource-constrained wearables, requiring only 6,343 parameters and achieving millisecond-level inference latency with a total system power of 50 mW.

2605.13457 2026-05-14 cs.CV

OP4KSR: One-Step Patch-Free 4K Super-Resolution with Periodic Artifact Suppression

Chengyan Deng, Pengbin Yu, Zhentao Chen, Wei Shen, Kai Zhang, Meng Li, Lunxi Yuan, Xue Zhou, Li Yu

AI总结 本文提出了一种名为OP4KSR的一站式无块4K超分辨率方法,旨在解决基于扩散模型的现实场景图像超分辨率在直接生成4K图像时面临的显存限制问题。该方法基于强大的Flux架构,并结合极简压缩的F16 VAE,实现了在有限GPU资源下的高效推理,同时保持全局空间语义一致性。为了解决该方法引入的周期性伪影问题,作者提出了基于RoPE频率重缩放和自相关周期性损失的抑制策略,并构建了专门的训练数据集和三个基准测试,推动了4K超分辨率研究的发展。

详情
英文摘要

Diffusion-based real-world image super-resolution (Real-ISR) has achieved remarkable perceptual quality; however, directly super-resolving images to 4K remains limited by extreme memory consumption. Consequently, prior methods adopt patch-based inference, sacrificing global context and introducing semantic confusion, spatial inconsistency, and severe latency. We propose OP4KSR, a one-step patch-free 4K SR approach built upon the powerful Flux backbone. By leveraging the extreme-compression F16 VAE, OP4KSR makes 4K SR inference tractable under practical GPU budgets, preserving global spatial-semantic coherence while enabling highly efficient inference. However, adapting this one-step architecture intrinsically triggers severe periodic artifacts. We trace this to a RoPE base frequency allocation mismatch and intra-token spatial ambiguity, both exacerbated by the lack of iterative refinement. To suppress these artifacts, we couple RoPE base frequency rescaling (RFR) with an autocorrelation-based periodicity loss ($\mathcal{L}_\text{AP}$). Furthermore, we curate a dedicated training dataset alongside three benchmarks (one synthetic and two real-world) to advance 4K SR research. Extensive experiments demonstrate that OP4KSR achieves competitive perceptual quality with efficient inference, generating a $4096\times4096$ output in only 5.75 seconds on a single NVIDIA H20 GPU.

2605.13452 2026-05-14 cs.RO cs.AI

CUBic: Coordinated Unified Bimanual Perception and Control Framework

Xingyu Wang, Pengxiang Ding, Jingkai Xu, Donglin Wang, Zhaoxin Fan

AI总结 本文提出了一种名为CUBic的协调统一双臂感知与控制框架,旨在解决从单臂操作扩展到双臂操作时面临的感知独立性与手臂协调性之间的矛盾。该方法通过统一的感知建模,学习共享的标记化表示,使独立操作与协调交互自然地从结构中体现出来,而非依赖人工设计的耦合机制。实验表明,CUBic在RoboTwin基准测试中显著优于现有方法,在协调精度和任务成功率方面均取得明显提升。

详情
英文摘要

Recent advances in visuomotor policy learning have enabled robots to perform control directly from visual inputs. Yet, extending such end-to-end learning from single-arm to bimanual manipulation remains challenging due to the need for both independent perception and coordinated interaction between arms. Existing methods typically favor one side -- either decoupling the two arms to avoid interference or enforcing strong cross-arm coupling for coordination -- thus lacking a unified treatment. We propose CUBic, a Coordinated and Unified framework for Bimanual perception and control that reformulates bimanual coordination as a unified perceptual modeling problem. CUBic learns a shared tokenized representation bridging perception and control, where independence and coordination emerge intrinsically from structure rather than from hand-crafted coupling. Our approach integrates three components: unidirectional perception aggregation, bidirectional perception coordination through two codebooks with shared mapping, and a unified perception-to-control diffusion policy. Extensive experiments on the RoboTwin benchmark show that CUBic consistently surpasses standard baselines, achieving marked improvements in coordination accuracy and task success rates over state-of-the-art visuomotor baselines.

2605.13451 2026-05-14 cs.CL

LongBEL: Long-Context and Document-Consistent Biomedical Entity Linking

Adam Remaki, Xavier Tannier, Christel Gérardin

AI总结 本文提出 LongBEL,一种用于生物医学实体链接的文档级生成框架,旨在解决现有系统在处理同一文档中重复出现的概念时可能出现的不一致问题。LongBEL 通过结合全文上下文和先前预测的记忆信息,提升实体链接的一致性与准确性。实验表明,LongBEL 在多个生物医学数据集上优于基于句子的生成方法,尤其在概念重复出现的场景中表现突出。

Comments 9 pages, 2 figures

详情
英文摘要

Biomedical entity linking maps textual mentions to concepts in structured knowledge bases such as UMLS or SNOMED CT. Most existing systems link each mention independently, using only the mention or its surrounding sentence. This ignores dependencies between mentions in the same document and can lead to inconsistent predictions, especially when the same concept appears under different surface forms. We introduce LongBEL, a document-level generative framework that combines full-document context with a memory of previous predictions. To make this memory robust, LongBEL is trained with cross-validated predictions rather than gold labels, reducing the mismatch between training and inference and limiting cascading errors. Experiments on five biomedical benchmarks across English, French, and Spanish show that LongBEL improves over sentence-level generative baselines, with the largest gains on datasets where concepts frequently recur within documents. An ensemble of local, global, and memory-based variants achieves the best results across all benchmarks. Further analysis shows that the largest gains occur on recurring concepts, suggesting that LongBEL mainly improves document-level consistency rather than isolated mention disambiguation.

2605.13450 2026-05-14 cs.AI cs.CL cs.HC

Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers

Samuel Schapiro, Alexi Gladstone, Jonah Black, Heng Ji

AI总结 本文研究了如何有效评估大语言模型的创造力,针对创造性写作、发散性思维和科学构想三个领域,系统评估了现有创造力测试的有效性。研究发现,现有测试在预测模型创造力方面存在显著局限,尤其是对科学构想能力的预测效果不佳。为此,作者提出了一种新的测试方法——发散远程联想测试(DRAT),该方法首次在单一测试中同时评估聚合与发散性思维,并能有效预测科学构想能力,表现出良好的鲁棒性。

Comments 36 pages. Extended version of work under review

详情
英文摘要

Measuring the creativity of large language models (LLMs) is essential for designing methods that can improve creativity and for enhancing our scientific understanding of this ability. To accomplish this, it has become common in recent years to administer tests of human creativity to LLMs. Although these tests provide a convenient and fully automated way to score "creativity," their validity as measures of machine creativity has not been established, and these tests already have limited validity as predictors of human creativity. To address this problem, we conduct the first large-scale, systematic study assessing the effectiveness of human creativity tests for predicting the creative achievement of LLMs across three target constructs: creative writing, divergent thinking, and scientific ideation. We find that the Divergent Association Task (DAT) and the Conditional DAT are the best predictors of creative writing and divergent thinking, respectively, but that test effectiveness varies significantly by construct, and no single test predicts all constructs well. Moreover, contrary to popular belief, no existing test reliably predicts scientific ideation ability. Motivated by this problem, we introduce the Divergent Remote Association Test (DRAT), a vocabulary-space test that assesses both convergent and divergent thinking in a single instrument. The DRAT is the first and only creativity test for LLMs that is a significant predictor of scientific ideation ability, demonstrating robustness across major design choices. Furthermore, the performance gain of the DRAT is not recoverable from any linear combination of the Divergent Association Task and the Remote Associates Test, indicating that assessing divergent and convergent thinking in the same test is essential to reliably predicting scientific ideation ability.