arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2352
2605.12169 2026-05-13 cs.CV

UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis

Sihan Chen, Xiang Zhang, Yang Zhang, Tunc Aydin, Christopher Schroers

AI总结 随着生成模型的快速发展,基于扩散模型的视角合成方法已成为主流,但常因像素到潜空间的压缩和扩散幻觉导致图像质量下降。本文从空间、时间及主干网络三个维度分析扩散退化问题,提出了一种通用的参考引导修复框架UniFixer,通过粗到细的策略修复多种退化现象。该方法包含参考预对齐模块、全局结构锚定机制和局部细节注入模块,能够有效恢复几何结构和纹理细节,实现跨不同退化类型的零样本修复,在新视角合成和立体转换任务中表现出色。

详情
英文摘要

With the recent surge of generative models, diffusion-based approaches have become mainstream for view synthesis tasks, either in an explicit depth-warp-inpaint or in an implicit end-to-end manner. Despite their success, both paradigms often suffer from noticeable quality degradation, e.g., blurred details and distorted structures, caused by pixel-to-latent compression and diffusion hallucination. In this paper, we investigate diffusion degradation from three key dimensions (i.e., spatial, temporal, and backbone-related) and propose UniFixer, a universal reference-guided framework that fixes diverse degradation artifacts via a coarse-to-fine strategy. Specifically, a reference pre-alignment module is first designed to perform coarse alignment between the reference view and the degraded novel view. A global structure anchoring mechanism then rectifies geometric distortions to ensure structural fidelity, followed by a local detail injection module that recovers fine-grained texture details for high-quality view synthesis. Our UniFixer serves as a plug-and-play refiner that achieves zero-shot fixing across different types of diffusion degradation, and extensive experiments verify our state-of-the-art performance on novel view synthesis and stereo conversion.

2605.12168 2026-05-13 cs.LG

On What We Can Learn from Low-Resolution Data

Theresa Dahl Frehr, Niels Henrik Pontoppidan, Hiba Nassar, Tommy Sonne Alstrøm

AI总结 本文研究了在高分辨率数据稀缺的情况下,如何利用低分辨率数据提升模型性能的问题。作者基于Kullback-Leibler散度进行理论分析,揭示了数据点的影响力随分辨率变化的规律,并推导了高、低分辨率样本信息损失的上下界。实验表明,在视觉Transformer和卷积神经网络中,加入低分辨率数据能够有效提升模型在高分辨率任务上的表现。

详情
英文摘要

Artificial intelligence systems typically rely on large, centrally collected datasets, a premise that does not hold in many real-world domains such as healthcare and public institutions. In these settings, data sharing is often constrained by storage, privacy, or resource limitations. For example, small wearable devices may lack the bandwidth or energy capacity needed to store and transmit high-resolution data, leading to aggregation during data collection and thus a loss of information. As a result, datasets collected from different sources may consist of a mixture of high- and low-resolution samples. Despite the prevalence of this setting, it remains unclear how informative low-resolution data is when models are ultimately evaluated on high-resolution inputs. We provide a theoretical analysis based on the Kullback-Leibler divergence that characterises how the influence of a datapoint changes with resolution, and derive bounds that relate the relative contribution of high- and low-resolution observations to the information lost under downsampling. To support this analysis, we empirically demonstrate, using both a vision transformer and a convolutional neural network, that adding low-resolution data to the training set consistently improves performance when high-resolution data is scarce.

2605.12167 2026-05-13 cs.RO cs.CV

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

Yajie Li, Bozhou Zhang, Chun Gu, Zipei Ma, Jiahui Zhang, Jiankang Deng, Xiatian Zhu, Li Zhang

AI总结 该论文研究了如何将视频生成模型预测的未来场景有效转化为机器人可执行的动作,解决了现有方法在视觉真实感与控制相关性之间不匹配的问题。为此,作者提出了MoLA(Mixture of Latent Actions)方法,通过预训练的逆动力学模型从生成的视频中推断出潜在动作的混合表示,从而实现更稳定和可控的策略执行。实验表明,该方法在多个仿真和真实机器人任务中提升了任务成功率与泛化能力。

Comments ICML 2026

详情
英文摘要

Video generation models offer a promising imagination mechanism for robot manipulation by predicting long-horizon future observations, but effectively exploiting these imagined futures for action execution remains challenging. Existing approaches either condition policies on predicted frames or directly decode generated videos into actions, both suffering from a mismatch between visual realism and control relevance. As a result, predicted observations emphasize perceptual fidelity rather than action-centric causes of state transitions, leading to indirect and unstable control. To address this gap, we propose MoLA (Mixture of Latent Actions), a control-oriented interface that transforms imagined future videos into executable representations. Instead of passing predicted frames directly to the policy, MoLA leverages a mixture of pretrained inverse dynamics models to infer a mixture of latent actions implied by generated visual transitions. These modality-aware inverse dynamics models capture complementary semantic, depth, and flow cues, providing a structured and physically grounded action representation that bridges video imagination and policy execution. We evaluate our approach on simulated benchmarks (LIBERO, CALVIN, and LIBERO-Plus) and real-world robot manipulation tasks, achieving consistent gains in task success, temporal consistency, and generalization.

2605.12162 2026-05-13 cs.RO

X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction

Kai Xiong, Hongjie Fang, Lixin Yang, Cewu Lu

AI总结 在机器人操作任务中,空间感知与动作生成之间的交互仍是一个关键难题。本文提出X-Imitator,一种双路径框架,通过双向动作-姿态交互将空间感知与动作执行建模为紧密耦合的循环过程,从而实现空间推理与动作生成的持续互优化。该方法模仿人类内部前向模型,模块化设计便于集成到多种视觉运动策略中,实验表明其在多个仿真和真实任务中显著优于现有方法。

详情
英文摘要

Effectively handling the interplay between spatial perception and action generation remains a critical bottleneck in robotic manipulation. Existing methods typically treat spatial perception and action execution as decoupled or strictly unidirectional processes, fundamentally restricting a robot's ability to master complex manipulation tasks. To address this, we propose X-Imitator, a versatile dual-path framework that models spatial perception and action execution as a tightly coupled bidirectional loop. By reciprocally conditioning current pose predictions on past actions and vice versa, this framework enables continuous mutual refinement between spatial reasoning and action generation. This joint modeling exactly mimics human internal forward models. Designed as a modular architecture, the system can be seamlessly integrated into various visuomotor policies. Extensive experiments across 24 simulated and 3 real-world tasks demonstrate that our framework significantly outperforms both vanilla policies and prior methods utilizing explicit pose guidance. The code will be open sourced.

2605.12161 2026-05-13 cs.LG cs.CY math.MG

Fused Gromov-Wasserstein Distance with Feature Selection

Harlin Lee, Ying Yu, Mingxin Li, Ranthony Clark

AI总结 本文提出了一种带有特征选择的融合格罗莫夫-瓦瑟斯坦(FGW)距离,用于在比较结构和节点特征时自适应地抑制不相关或噪声特征,从而提升模型的可解释性和鲁棒性。研究引入了两种方法:一种是结合Lasso和岭惩罚的正则化FGW,另一种是基于单纯形约束权重的FGW,并扩展到组级特征选择。理论分析表明该方法具有良好的度量性质,并通过高效交替优化算法实现,实验显示其在计算分区等任务中能有效揭示任务相关结构。

详情
英文摘要

Fused Gromov-Wasserstein (FGW) distances provide a principled framework for comparing objects by jointly aligning structure and node features. However, existing FGW formulations treat all features uniformly, which limits interpretability and robustness in high-dimensional settings where many features may be irrelevant or noisy. We introduce FGW distances with feature selection, which incorporate adaptive feature suppression weights into the FGW objective to selectively downweight or suppress differentiating features during alignment. We propose two approaches: (1) regularized FGW with Lasso and Ridge penalties, and (2) FGW with simplex-constrained weights, including groupwise extensions. We analyze the resulting models and establish their key theoretical properties, including bounds relative to classical FGW and Gromov-Wasserstein distances, and metric behavior. An efficient alternating minimization algorithm is developed. Experiments illustrate how feature suppression enhances interpretability and reveals task-relevant structure, with a special application to computational redistricting.

2605.12160 2026-05-13 cs.RO cs.AI

Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete

Joonha Park, Jiseung Jeong, Taesik Gong

AI总结 该研究提出了一种名为Premover的轻量模块,旨在提升视觉-语言-动作(VLA)策略在实际部署中的响应效率。Premover通过在用户指令完成前进行预计算,有效利用了机器人等待指令的空闲时间,从而加快了整体执行速度。该方法通过冻结VLA主干网络,并引入两个投影头将中间层特征映射到共享空间,结合模拟渲染的目标分割掩码进行监督学习,最终显著减少了任务执行的平均时间,同时保持了较高的成功率。

详情
英文摘要

Vision-Language-Action (VLA) policies are typically evaluated as if the user had finished typing or speaking before the robot begins acting. In real deployment, however, users take several seconds to enter a request, leaving the policy idle for a substantial fraction of the interaction. We introduce Premover, a lightweight module that converts this idle window into useful precomputation. Premover keeps the VLA backbone frozen and attaches two small projection heads, one for image patches, one for language tokens, that map an intermediate layer of the backbone into a shared space. The resulting focus map is supervised by simulator-rendered target-object segmentation masks and applied as a per-patch reweighting of the next step's image tokens. A single scalar readiness threshold, trained jointly from streaming prefixes, decides when the policy should begin acting. On the LIBERO benchmark suite, Premover reduces mean wall-clock time from 34.0 to 29.4 seconds, a 13.6% reduction, while matching the full-prompt baseline's success rate (95.1% vs. 95.0%); naive premoving, by contrast, collapses to 66.4%.

2605.12159 2026-05-13 cs.AI cs.GR

ALGOGEN: Tool-Generated Verifiable Traces for Reliable Algorithm Visualization

Kunpeng Liao, Yuexiao Ma, Yisheng Lin, Hualin Zeng, Xiawu Zheng, Rongrong Ji

AI总结 该论文提出了一种名为ALGOGEN的新方法,用于生成可验证的算法可视化轨迹,以提高算法可视化过程的可靠性。其核心思想是将算法执行与渲染过程解耦,通过引入可视化轨迹代数(VTA)和渲染风格语言(RSL)分别控制算法状态和视觉呈现,从而避免了传统端到端方法中大语言模型产生的幻觉问题。实验表明,ALGOGEN在LeetCode基准测试中显著提升了生成成功率,验证了其在复杂任务中的有效性。

详情
英文摘要

Algorithm Visualization (AV) helps students build mental models by animating algorithm execution states. Recent LLM-based systems such as CODE2VIDEO generate AV videos in an end-to-end manner. However, this paradigm requires the system to simultaneously simulate algorithm flow and satisfy video rendering constraints, such as element layout and color schemes. This complex task induces LLM hallucinations, resulting in reduced execution success rates, element overlap, and inter-frame inconsistencies. To address these challenges, we propose ALGOGEN, a novel paradigm that decouples algorithm execution from rendering. We first introduce Visualization Trace Algebra (VTA), a monoid over algorithm visual states and operations. The LLM then generates a Python tracker that simulates algorithm flow and outputs VTA-JSON traces, a JSON encoding of VTA. For rendering, we define a Rendering Style Language (RSL) to templatize algorithm layouts. A deterministic renderer then compiles algorithm traces with RSL into Manim, LaTeX/TikZ, or Three.js outputs. Evaluated on a LeetCode AV benchmark of 200 tasks, ALGOGEN achieves an average success rate improvement of 17.3% compared to end-to-end methods, with 99.8% versus 82.5%. These results demonstrate that our decoupling paradigm effectively mitigates LLM hallucinations in complex AV tasks, providing a more reliable solution for automated generation of high-quality algorithm visualizations. Demo videos and code are available in the project repository.

2605.12156 2026-05-13 cs.CL cs.SI

Latent Causal Void: Explicit Missing-Context Reconstruction for Misinformation Detection

Hui Li, Zhongquan Jian, Jinsong Su, Junfeng Yao

AI总结 本文研究了一类隐蔽性较强的信息误导检测问题,即文章在局部语义上保持连贯,但通过与同期背景信息对比才显现出误导性。为此,提出了一种名为“潜在因果空洞”(LCV)的方法,通过检索时间对齐的背景文章,并利用冻结的大语言模型显式重建每句目标文本所缺失的上下文信息,将其作为图推理中的跨源关系进行建模。实验表明,该方法在双语基准测试中显著优于现有方法,验证了显式重建缺失事实对检测信息误导的有效性。

详情
英文摘要

Automatic misinformation detection performs well when deception is visible in what an article explicitly states. However, some misinformation articles remain locally coherent and only become misleading once compared with contemporaneous reports that supply background facts the article omits. We study this omission-relevant setting and observe that current omission-aware approaches typically either attach retrieved context as auxiliary evidence or infer a categorical omission signal, leaving the specific missing fact implicit. We propose \emph{Latent Causal Void} (LCV), a retrieval-guided detector that explicitly reconstructs the missing fact for each target sentence and uses it as a textual cross-source relation in graph reasoning. Concretely, LCV retrieves temporally aligned context articles, asks a frozen instruction-tuned large language model to generate a short missing-context description for each sentence--article pair, and feeds the resulting relation text into a heterograph over target sentences and context articles. On the bilingual benchmark of Sheng et al., LCV improves over the strongest omission-aware baseline by $2.56$ and $2.84$ macro-F1 points on the English and Chinese splits, respectively. The results indicate that modeling the missing cross-source fact itself, rather than only attaching retrieved evidence or predicting an omission signal, is a useful representation for omission-aware misinformation detection.

2605.12154 2026-05-13 cs.AI

MM-OptBench: A Solver-Grounded Benchmark for Multimodal Optimization Modeling

Zhong Li, Qi Huang, Yuxuan Zhu, Mohammad Mohammadi Amiri, Niki van Stein, Thomas Bäck, Matthijs van Leeuwen, Zaiwen Wen, Lincen Yang

AI总结 MM-OptBench 是一个基于求解器验证的多模态优化建模基准,旨在评估模型从文本和视觉信息中构建数学优化模型及可执行求解代码的能力。该基准涵盖6类优化问题、26个子类和3个难度级别,共包含780个经过求解器验证的实例。实验表明,当前主流多模态大语言模型在该任务上表现有限,尤其在处理复杂实例时效果显著下降,突显了多模态优化建模任务的挑战性。

Comments Paper under review

详情
英文摘要

Optimization modeling translates real decision-making problems into mathematical optimization models and solver-executable implementations. Although language models are increasingly used to generate optimization formulations and solver code, existing benchmarks are almost entirely text-only. This omits many optimization-modeling tasks that arise in operational practice, where requirements are described in text but instance information is conveyed through visual artifacts such as tables, graphs, maps, schedules, and dashboards. We introduce multimodal optimization modeling, a benchmark setting in which models must construct both a mathematical formulation and executable solver code from a text-and-visual problem specification. To evaluate this setting, we develop a solver-grounded framework that generates structured optimization instances, verifies each with an exact solver, and builds both the model-facing inputs and hidden reference files from the same verified source. We instantiate the framework as MM-OptBench, a benchmark of 780 solver-verified instances spanning 6 optimization families, 26 subcategories, and 3 structural difficulty levels. We evaluate 9 multimodal large language models (MLLMs), including 6 frontier general-purpose models and 3 math-specialized models, with aggregate, family-level, difficulty-level, and failure-mode analyses. The results show that the task remains far from solved: the best two models reach 52.1% and 51.3% pass@1, while on average across the six general-purpose MLLMs, pass@1 is 43.4% on easy instances and 15.9% on hard instances. All three math-specialized MLLMs solve 0/780 instances. Failure attribution shows that errors arise both when extracting instance data from text and visuals and when turning extracted data into solver-correct formulations and code. MM-OptBench provides a testbed for solver-grounded, decision-oriented multimodal intelligence.

2605.12144 2026-05-13 cs.CV

PoseCompass: Intelligent Synthetic Pose Selection for Visual Localization

Yanan Zhou, Zhaoyan Qian, Yanli Li, Nan Yang, Zhongliang Guo, Dong Yuan

AI总结 在视觉定位任务中,绝对姿态回归(APR)能够从单张图像中实时推断相机的6自由度姿态,但其性能高度依赖于训练数据的质量和覆盖范围。为了解决现有基于3D高斯溅射(3DGS)的视图合成数据增强方法中随机采样导致的冗余视角和噪声样本问题,本文提出了一种智能姿态选择方法PoseCompass,通过定位难度、覆盖新颖性和渲染可观测性三个维度对合成姿态进行排序,生成轨迹约束的候选视角并进行合成,从而显著提升了姿态回归模型的训练效率和定位精度。实验表明,PoseCompass在7-Scenes数据集上将适配时间缩短了3倍,并大幅降低了姿态误差。

详情
英文摘要

In visual localization, Absolute Pose Regression (APR) enables real-time 6-DoF camera pose inference from single images, yet critically depends on fine-tuning data quality and coverage. While recent methods leverage 3D Gaussian Splatting (3DGS) for novel view synthesis-based data augmentation, random sampling generates redundant views and noisy samples from poorly reconstructed regions. To mitigate this research gap, we propose PoseCompass, an intelligent pose selection pipeline for 3DGS-based APR. PoseCompass formulates synthetic pose selection and derives a value-based pose ranking mechanism to identify informative poses. The ranking integrates three dimensions: Localization Difficulty, favoring challenging regions; Coverage Novelty, exploring under-sampled areas; and Rendering Observability, filtering artifacts and noise. PoseCompass then generates trajectory-constrained candidates, selects the top-K ranked poses, and synthesizes views using 3DGS with lightweight diffusion-based alignment. Finally, the pose regressor is fine-tuned on mixed real and synthetic data. We evaluate PoseCompass on 7-Scenes, where it reduces adaptation time from 15.2 to 5.1 minutes, a 3x speedup, while cutting median pose errors by 53.8 percent and significantly outperforming random baselines.

2605.12140 2026-05-13 cs.CV

EchoTracker2: Enhancing Myocardial Point Tracking by Modeling Local Motion

Md Abulkalam Azad, Vegard Holmstrøm, John Nyberg, Lasse Lovstakken, Håvard Dalen, Bjørnar Grenne, Andreas Østvik

AI总结 本文提出了一种名为EchoTracker2的新型心肌点跟踪方法,旨在提升超声心动图中心肌运动估计的准确性。该方法通过建模局部运动特征,摒弃了传统两阶段架构中的粗粒度初始化步骤,采用仅细阶段的网络结构,结合局部时空上下文信息与长距离时序推理,实现了更鲁棒的点跟踪。实验表明,该方法在多个数据集上均优于现有最佳模型,提升了位置精度并降低了轨迹误差,同时在临床相关指标如全局纵向应变的一致性方面也表现出色。

Comments Early accepted (top 9%) to MICCAI 2026

详情
英文摘要

Myocardial point tracking (MPT) has recently emerged as a promising direction for motion estimation in echocardiography, driven by advances in general-purpose point tracking methods. However, myocardial motion fundamentally differs from motion encountered in natural videos, as it arises from physiologically constrained deformation that is spatially and temporally continuous throughout the cardiac cycle. Consequently, motion trajectories typically remain locally confined despite substantial tissue deformation. Motivated by these properties, we revisit the architectural design for MPT and find that coarse initialization in commonly used two-stage coarse-to-fine architectures may be unnecessary in this domain. In this work, we propose a fine-stage-only architecture, \textbf{EchoTracker2}, which enriches pixel-precise features with local spatiotemporal context and integrates them with long-range joint temporal reasoning for robust tracking. Experimental results across in-distribution, out-of-distribution (OOD), and public synthetic datasets show that our model improves position accuracy by $6.5\%$ and reduces median trajectory error by $12.2\%$ relative to a domain-specific state-of-the-art (SOTA) model. Compared to the best general-purpose point tracking method, the improvements are $2.0\%$ and $5.3\%$, respectively. Moreover, EchoTracker2 shows better agreement with expert-derived global longitudinal strain (GLS) and enhances test-rest reproducibility. Source code will be available at: https://github.com/riponazad/ptecho.

2605.12139 2026-05-13 cs.AI

BoolXLLM: LLM-Assisted Explainability for Boolean Models

Du Cheng, Serdar Kadioglu, Xin Wang

AI总结 BoolXLLM 是一种结合大型语言模型(LLM)与布尔逻辑规则的学习框架,旨在提升布尔模型的可解释性。该方法在特征选择、数值特征离散化策略推荐以及布尔规则压缩与解释三个关键阶段引入LLM,从而生成更符合领域语义且易于理解的解释。研究展示了这种混合方法在保持预测性能的同时,有效提升了非技术用户对模型决策过程的理解能力。

详情
英文摘要

Interpretable machine learning aims to provide transparent models whose decision-making processes can be readily understood by humans. Recent advances in rule-based approaches, such as expressive Boolean formulas (BoolXAI), offer faithful and compact representations of model behavior. However, for non-technical stakeholders, main challenges remain in practice: (i) selecting semantically meaningful features and (ii) translating formal logical rules into accessible explanations. In this work, we propose BoolXLLM , as a hybrid framework that integrates Large Language Models (LLMs) into the end-to-end pipeline of Boolean rule learning. We augment BoolXAI , an expressive Boolean rule-based classifier, with LLMs at three critical stages: (1) feature selection, where LLMs guide the identification of domain-relevant variables; (2) threshold recommendation, where LLMs propose semantically meaningful discretization strategies for numerical features; and (3) rule compression and interpretation, where Boolean rules are translated into natural language explanations at both global and local levels. This integration bridges formal, faithful explanations with human-understandable narratives. This allows build an explainable AI system that is both theoretically grounded and accessible to non-experts. Early empirical results demonstrate that LLM-assisted pipelines improve interpretability while maintaining competitive predictive performance. Our work highlights the promise of combining symbolic reasoning with language-based models for human-centered explainability.

2605.12138 2026-05-13 cs.CV cs.CL cs.IR

Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models

Yexing Xu, Wei Feng, Shen Zhang, Haohan Wang, Yuxin Qin, Yaoyu Li, Ao Ma, Yuhao Luo, Lu Wang, Xudong Ren, Haoran Wang, Run Ling, Zheng Zhang, Jingjing Lv, Junjie Shen, Ching Law, Longguang Wang, Yulan Guo

AI总结 生成符合用户偏好且真实的广告内容是电商领域的重要挑战。本文提出了一种统一的自回归生成模型Uni-AdGen,能够同时生成个性化广告图像和文本,通过引入前景感知模块和指令微调提升生成内容的真实性,并利用粗到细的偏好理解模块从多模态历史行为中捕捉用户兴趣以实现更精准的个性化生成。此外,研究还构建了首个大规模个性化广告图文数据集PAd1M,并引入产品背景相似度指标PBS,实验表明该方法在通用和个性化广告生成任务中均优于现有方法。

Comments 22 pages, 19 figures, CVPR 2026

详情
英文摘要

Generating realistic and user-preferred advertisements is a key challenge in e-commerce. Existing approaches utilize multiple independent models driven by click-through-rate (CTR) to controllably create attractive image or text advertisements. However, their pipelines lack cross-modal perception and rely on CTR that only reflects average preferences. Therefore, we explore jointly generating personalized image-text advertisements from historical click behaviors. We first design a Unified Advertisement Generative model (Uni-AdGen) that employs a single autoregressive framework to produce both advertising images and texts. By incorporating a foreground perception module and instruction tuning, Uni-AdGen enhances the realism of the generated content. To further personalize advertisements, we equip Uni-AdGen with a coarse-to-fine preference understanding module that effectively captures user interests from noisy multimodal historical behaviors to drive personalized generation. Additionally, we construct the first large-scale Personalized Advertising image-text dataset (PAd1M) and introduce a Product Background Similarity (PBS) metric to facilitate training and evaluation. Extensive experiments show that our method outperforms baselines in general and personalized advertisement generation. Our project is available at https://github.com/JD-GenX/Uni-AdGen.

2605.12135 2026-05-13 cs.SD cs.LG eess.AS

STRUM: A Spectral Transcription and Rhythm Understanding Model for End-to-End Generation of Playable Rhythm-Game Charts

Joshua Opria

AI总结 本文提出STRUM模型,一种无需任何人工标注元数据即可将原始音频转换为可玩的节奏游戏图表(如Clone Hero和YARG)的端到端系统,支持鼓、吉他、贝斯、人声和键盘等乐器。STRUM采用多阶段混合方法,结合卷积循环神经网络(CRNN)进行鼓声起始检测、神经网络进行吉他和贝斯的单音音高跟踪、词对齐的语音识别处理人声,并利用频谱分析检测键盘音符。实验在基于音频质量筛选的30首歌曲数据集上进行,取得了较高的F1分数,并对模型组件进行了全面消融分析。

Comments 9 pages, 4 figures, 3 tables. Code and models: https://github.com/<your-github-username>/autocharter

详情
英文摘要

We present STRUM (Spectral Transcription and Rhythm Understanding Model), an audio-to-chart pipeline that converts raw recordings into playable Clone Hero / YARG charts for drums, guitar, bass, vocals, and keys without any oracle metadata. STRUM is a multi-stage hybrid: a two-stage CRNN onset detector and a six-model ensemble classifier for drums; neural onset detectors with monophonic pitch tracking for guitar and bass; word-aligned ASR for vocals; and spectral keyboard detection for keys. We evaluate on a 30-song in-envelope benchmark constructed by screening candidate songs on a single audio-quality criterion -- the median 1-second drum-stem RMS after htdemucs_6s source separation. On this benchmark STRUM achieves drums onset F1 = 0.838, bass F1 = 0.694, guitar F1 = 0.651, and vocals F1 = 0.539 at a +/- 100 ms tolerance with per-song global offset search. We report a complete ablation of seven drum-pipeline components with paired per-song Wilcoxon tests, an analysis of ground-truth-to-audio timing distributions in community Clone Hero charts, and a per-class confusion matrix for the drum classifier. Code, model weights, and the full benchmark manifest are released.

2605.12134 2026-05-13 cs.CV cs.LG

MULTI: Disentangling Camera Lens, Sensor, View, and Domain for Novel Image Generation

Sonali Godavarthy, Matthias Neuwirth-Trapp, Tim-Felix Faasch, Maarten Bieshaar, Michael Moeller, Danda Pani Paudel

AI总结 本文提出了一种名为MULTI的新方法,旨在解决文本到图像生成中因文本歧义导致的精确控制难题,通过分离相机镜头、传感器类型、视角和场景域等成像因素,实现对图像生成过程的更精细控制。该方法分为两个阶段,先学习通用成像因素,再提取数据集特定因素,从而支持现有数据集的扩展和新因素组合,减少分布差距,并可通过ControlNets实现特定因素的修改和图像到图像生成。实验表明,MULTI在新构建的DF-RICO基准上表现良好,突显了成像因素解耦作为图像生成研究新方向的重要性。

Comments Accepted at ICPR 2026

详情
英文摘要

Recent text-to-image models produce high-quality images, yet text ambiguity hinders precise control when specific styles or objects are required. There have been a number of recent works dealing with learning and composing multiple objects and patterns. However, current work focuses almost entirely on image content, overlooking imaging factors such as camera lens, sensor types, imaging viewpoints, and scenes' domain characteristics. We introduce this new challenge as Imaging Factor Disentanglement and show limitations of current approaches in the regime. We, therefore, propose the new method Multi-factor disentanglement through Textual Inversion (MULTI). It consists of two stages: in the first stage, we learn general factors, and in the second stage, we extract dataset-specific ones. This setup enables the extension of existing datasets and novel factor combinations, thereby reducing distribution gaps. It further supports modifications of specific factors and image-to-image generation via ControlNets. The evaluation on our new DF-RICO benchmark demonstrates the effectiveness of MULTI and highlights the importance of Factor Disentanglement as a new direction of research.

2605.12131 2026-05-13 cs.AI

Rollout Cards: A Reproducibility Standard for Agent Research

Charlie Masters, Ziyuan Liu, Stefano V. Albrecht

AI总结 本文针对智能体研究中日益严重的可复现性问题,提出了一种新的标准化方法——Rollout Cards。研究指出,当前许多论文仅报告系统得分,却未公开支撑这些得分的完整运行记录,导致相同行为可能因评估方式不同而得出不同结果。为此,作者引入Rollout Cards,将运行记录而非报告得分作为可复现性的基本单位,并通过实际案例验证了其有效性,展示了仅改变报告规则即可显著影响模型排名的现象。

详情
英文摘要

Reproducibility problems that have long affected machine learning and reinforcement learning are now surfacing in agent research: papers compare systems by reported scores while leaving the rollout records behind those scores difficult to inspect. For agentic tasks, this matters because the same behaviour can receive different reported scores when evaluations select different parts of a rollout or apply different reporting rules. In a structured audit of 50 popular training and evaluation repositories, we find that none report how many runs failed, errored, or were skipped alongside headline scores. We also document 37 cases where reporting rules can change task-success rates, cost/token accounting, or timing measurements for fixed evidence, sometimes dramatically. We treat rollout records, not reported scores, as the unit of reproducibility for agent research. We introduce rollout cards: publication bundles that preserve the rollout record and declare the views, reporting rules, and drops manifests behind reported scores. We validate rollout cards in two settings. First, four partial public releases in tool safety, multi-agent systems, theorem proving, and search let us compute analyses their original reports did not include. Second, re-grading preserved benchmark outputs across short-answer, code-generation, and tool-use tasks shows that changing only the reporting rule can change reported scores by 20.9 absolute percentage points and, in some cases, invert rankings of frontier models. We release a reference implementation integrated into Ergon, an open-source reinforcement learning gym, and publicly publish Ergon-produced rollout-card exports for benchmarks spanning tool use, software engineering, web interaction, multi-agent coordination, safety, and search to support future research.

2605.12128 2026-05-13 cs.CL cs.CY

Metaphor Is Not All Attention Needs

Olga Sorokoletova, Francesco Giarrusso, Giacomo De Luca, Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Marcello Galisai, Vincenzo Suriani, Daniele Nardi

AI总结 这篇论文研究了文学性指令如何绕过大型语言模型的安全机制,并探讨其背后的原因。作者通过分析注意力模式,发现模型能够准确区分诗歌与散文格式,但无法有效预测文学性指令是否会导致安全风险。研究结果表明,文学性指令的成功并非源于模型无法识别其格式,而是因为其风格上的不规则性改变了模型的处理方式,从而避开了训练时关注的关键词触发机制。这一发现对构建更具鲁棒性的安全机制具有重要意义。

详情
英文摘要

Large language models are increasingly deployed in safety-critical applications, where their ability to resist harmful instructions is essential. Although post-training aims to make models robust against many jailbreak strategies, recent evidence shows that stylistic reformulations, such as poetic transformation, can still bypass safety mechanisms with alarming effectiveness. This raises a central question: why do literary jailbreaks succeed? In this work, we investigate whether their effectiveness depends on specific poetic devices, on a failure to recognize literary formatting, or on deeper changes in how models process stylistically irregular prompts. We address this problem through an interpretability analysis of attention patterns. We perform input-level ablation studies to assess the contribution of individual and combinations of poetic devices; construct an interpretable vector representation of attention maps; cluster these representations and train linear probes to predict safety outcomes and literary format. Our results show that models distinguish poetic from prose formats with high accuracy, yet struggle to predict jailbreak success within each format. Clustering further reveals clear separation by literary format, but not by safety label. These findings indicate that jailbreak success is not caused by a failure to recognize poetic formatting; rather, poetic prompts induce distinct processing patterns that remain largely independent of harmful-content detection. Overall, literary jailbreaks appear to misalign large language models not through any single poetic device, but through accumulated stylistic irregularities that alter prompt processing and avoid lexical triggers considered during post-training. This suggests that robustness requires safety mechanisms that account for style-induced shifts in model behavior. We use Qwen3-14B as a representative open-weight case study.

2605.12122 2026-05-13 cs.LG cs.AI cs.CV

Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning

Hyeonjin Kim, Hangyeol Jung, Heechan Yun, Sungjun Yun, Dong-Jun Han

AI总结 本文研究了如何在文本到图像的扩散模型中去除特定概念,提出了一个名为SAEParate的方法。该方法通过引入概念感知的对比目标,将潜在表示组织成概念特定的聚类,从而实现更精确的概念抑制并减少去学习过程中的干扰。此外,作者还增强编码器以提升其在分离目标下的表达能力,实验表明该方法在去学习任务中取得了当前最优的性能,尤其在联合风格-对象去学习任务中表现突出。

Comments 40 pages, 23 figures

详情
英文摘要

Unlearning specific concepts in text-to-image diffusion models has become increasingly important for preventing undesirable content generation. Among prior approaches, sparse autoencoder (SAE)-based methods have attracted attention due to their ability to suppress target concepts through lightweight manipulation of latent features, without modifying model parameters. However, SAEs trained with sparse reconstruction objectives do not explicitly enforce concept-wise separation, resulting in shared latent features across concepts. To address this, we propose SAEParate, which organizes latent representations into concept-specific clusters via a concept-aware contrastive objective, enabling more precise concept suppression while reducing unintended interference during unlearning. In addition, we enhance the encoder with a GeLU-based nonlinear transformation to increase its expressive capacity under this separation objective, enabling a more discriminative and disentangled latent space. Experiments on UnlearnCanvas demonstrate state-of-the-art performance, with particularly strong gains in joint style-object unlearning, a challenging setting where existing methods suffer from severe interference between target and non-target concepts.

2605.12120 2026-05-13 cs.AI

To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands

Fangyi Yu, Nabeel Seedat, Jonathan Richard Schwarz, Andrew M. Bean

AI总结 该研究探讨了语言模型在高风险专业场景中面对用户、机构权威和职业规范等多方冲突需求时的对齐倾向。通过在法律和医疗领域共7,136个场景中测试十种前沿模型,发现模型在任务执行时常常忽视职业标准,且对用户、权威和标准的优先级排序在不同领域和模型间存在不稳定性。研究指出,模型主要通过知识遗漏的方式导致对专业标准的违背,即使其内部推理过程已识别相关知识,也可能在外部输出中选择性忽略,从而产生有害结果。

详情
英文摘要

Language models deployed in high-stakes professional settings face conflicting demands from users, institutional authorities, and professional norms. How models act when these demands conflict reveals a principal hierarchy -- an implicit ordering over competing stakeholders that determines, for instance, whether a medical AI receiving a cost-reduction directive from a hospital administrator complies at the expense of evidence-based care, or refuses because professional standards require it. Across 7,136 scenarios in legal and medical domains, we test ten frontier models and find that models frequently fail to adhere to professional standards during task execution, such as drafting, when user instructions conflict with those standards -- despite adequately upholding them when users seek advisory guidance. We further find that the hierarchies between user, authority, and professional standards exhibited by these models are unstable across medical and legal contexts and inconsistent across model families. When failing to follow professional standards, the primary failure mechanism is knowledge omission: models that demonstrably possess relevant knowledge produce harmful outputs without surfacing conflicting knowledge. In a particularly troubling instance, we find that a reasoning model recognizes the relevant knowledge in its reasoning trace -- e.g., that a drug has been withdrawn -- yet suppresses this in the user-facing answer and proceeds to recommend the drug under authority pressure anyway. Inconsistent alignment across task framing, domain, and model families suggests that current alignment methods, including published alignment hierarchies, are unlikely to be robust when models are deployed in high-stakes professional settings.

2605.12112 2026-05-13 cs.CV

When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy

Xiaofeng Tan, Jun Liu, Bin-Bin Gao, Yuanting Fan, Xi Jiang, Chengjie Wang, Hongsong Wang, Feng Zheng

AI总结 在基于强化学习的文本到图像生成模型对齐中,策略熵约束常用于保持多样性,但在流模型中这一方法失效,导致生成结果多样性严重下降。本文理论与实验分析表明,流模型中策略熵不变而感知多样性却崩溃,原因是固定噪声调度与策略梯度的模式搜索特性所致。为此,研究提出感知熵概念以捕捉感知空间中的多样性,并设计了两种熵正则化策略,有效提升了生成质量与多样性,实验表明其在多个基准上均优于现有方法。

详情
英文摘要

RLHF is widely used to align flow-matching text-to-image models with human preferences, but often leads to severe diversity collapse after fine-tuning. In RL, diversity is often assumed to correlate with policy entropy, motivating entropy regularization. However, we show this intuition breaks in flow models: policy entropy remains constant, even while perceptual diversity collapses. We explain this mismatch both theoretically and empirically: the constant entropy arises from the fixed, pre-defined noise schedule, while the diversity collapse is driven by the mode-seeking nature of policy gradients. As a result, policy entropy fails to prevent the model from converging to a narrow high-reward region in the perceptual space. To this end, we introduce perceptual entropy that captures diversity in a perceptual space and maintains the property of standard entropy. Building upon this insight, we propose two entropy-regularized strategies, Perceptual Entropy Constraint and Perceptual Constraints on Generation Space, to preserve perceptual diversity and improve the quality. Experiments across two base models, neural and rule-based rewards, and three perceptual spaces demonstrate consistent gains in the quality-diversity trade-off; PEC achieves the best overall score of 0.734 (vs. baseline's 0.366); a complementary setting of PEC further reaches a diversity average of 0.989 (vs. baseline's 0.047). Our project page (https://xiaofeng-tan.github.io/projects/PEC) is publicly available.

2605.12111 2026-05-13 cs.AI cs.DS

Adaptive Multi-Round Allocation with Stochastic Arrivals

Yuqi Pan, Davin Choo, Haichuan Wang, Milind Tambe, Alastair van Heerden, Cheryl Johnson

AI总结 本文研究了一个受自适应网络招募启发的多轮资源分配问题,其中有限的同质资源需在多轮中分配给具有随机推荐能力的个体,成功推荐会带来未来的决策机会,而对同一个体追加资源则存在边际递减效应。为解决多轮设置下的复杂动态规划问题,作者引入了一个仅依赖剩余预算和前沿规模的群体级替代价值函数,从而构建出复杂度与总预算成多项式关系的精确动态规划算法。此外,作者还分析了模型误设下的鲁棒性,并给出了分解为单轮前沿误差和群体级转移误差的多轮误差界。

Comments Accepted into ICML 2026

详情
英文摘要

We study a sequential resource allocation problem motivated by adaptive network recruitment, in which a limited budget of identical resources must be allocated over multiple rounds to individuals with stochastic referral capacity. Successful referrals endogenously generate future decision opportunities while allocating additional resources to an individual exhibits diminishing returns. We first show that the single-round allocation problem admits an exact greedy solution based on marginal survival probabilities. In the multi-round setting, the resulting Bellman recursion is intractable due to the stochastic, high-dimensional evolution of the frontier. To address this, we introduce a population-level surrogate value function that depends only on the remaining budget and frontier size. This surrogate enables an exact dynamic program via truncated probability generating functions, yielding a planning algorithm with polynomial complexity in the total budget. We further analyze robustness under model misspecification, proving a multi-round error bound that decomposes into a tight single-round frontier error and a population-level transition error. Finally, we evaluate our method on real-world inspired recruitment scenarios.

2605.12106 2026-05-13 cs.AI

Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization

Peipei Xu, SiYuan Ma, Yaohua Liu, Yu Wu, Guanliang Liu, Yang Zhang, Yong Liu

AI总结 该研究探讨了如何利用大语言模型生成约束条件下双目标凸优化问题的帕累托前沿。提出了一种端到端框架DIPS,通过微调大语言模型,使其能够直接根据文本描述生成近似帕累托前沿的连续决策向量。DIPS结合了数值标记初始化、分阶段课程优化等技术,实现了高效的生成效果,并在多个问题族上取得了接近参考前沿的高精度结果,展示了大语言模型在连续帕累托前沿近似中的潜力。

Comments 31 pages

详情
英文摘要

Generating feasible Pareto fronts for constrained bi-objective continuous optimization is central to multi-criteria decision-making. Existing methods usually rely on iterative scalarization, evolutionary search, or problem-specific solvers, requiring repeated optimization for each instance. We introduce DIPS, an end-to-end framework that fine-tunes large language models as amortized Pareto-front generators for constrained bi-objective convex optimization. Given a textual problem description, DIPS directly outputs an ordered set of feasible continuous decision vectors approximating the Pareto front. To make continuous optimization compatible with autoregressive language modeling, DIPS combines a compact discretization scheme, Numerically Grounded Token Initialization for new numerical tokens, and Three-Phase Curriculum Optimization, which progressively aligns structural validity, feasibility, and Pareto-front quality. Across five families of constrained bi-objective convex problems, a fine-tuned 7B-parameter model achieves normalized hypervolume ratios of 95.29% to 98.18% relative to reference fronts. With vLLM-accelerated inference, DIPS solves one instance in as little as 0.16 seconds and outperforms general-purpose and reasoning LLM baselines under the evaluated setting. These results suggest that LLMs can serve as effective amortized generators for continuous Pareto-front approximation.

2605.12105 2026-05-13 cs.AI

Autonomy and Agency in Agentic AI: Architectural Tactics for Regulated Contexts

Damir Safin, Dian Balta

AI总结 在监管环境中部署自主智能体AI系统,需要对系统“能力”(agency)和“自主性”(autonomy)两个设计维度进行系统性考量。本文提出一个二维设计空间,将这两个维度划分为五个操作层级,明确其耦合关系,并提出六种架构策略以调整系统在该空间中的位置。此外,文章还分析了五个影响系统部署效果的参数,为合规导向的智能体AI设计提供了理论框架和实践指导。

详情
英文摘要

Deploying agentic AI in regulated contexts requires principled reasoning about two design dimensions: agency (what the system can do) and autonomy (how much it acts without human involvement). Though often treated independently, they are coupled: at higher autonomy, human error correction is less available, so reliable operation requires constraining agency accordingly; compliance requirements reinforce this by mandating human involvement as action consequences grow. Yet no established approach addresses them jointly, leaving practitioners without a principled basis for reasoning about oversight, action consequences, and error correction. This work introduces a two-dimensional design space in which both dimensions are organised into five operational levels, making the coupling explicit and navigable. Autonomy ranges from human-commanded operation (L1) to fully autonomous monitoring (L5); agency ranges from reasoning over supplied context (L1) to committed writes to authoritative records (L5). Building on this space, we propose six architectural tactics--checkpoints, escalation, multi-agent delegation, tool provisioning, tool fencing, and write staging--for adjusting a deployment's position within it. The tactics are grounded in two worked examples from public-sector contexts, illustrating how they apply under realistic compliance constraints. We further examine five deployment parameters--model capability, agent architecture, tool fidelity, workflow bottlenecks, and evaluation--that shape what is achievable at any configuration independently of agency and autonomy. Together, the design space, tactics, and deployment parameters provide a shared vocabulary for principled, compliance-aware agentic AI design in which responsibility, auditability, and reversibility are explicit design considerations rather than properties that must be retrofitted after deployment.

2605.12096 2026-05-13 cs.CL

Sign Language Recognition and Translation for Low-Resource Languages: Challenges and Pathways Forward

Nigar Alishzade, Gulchin Abdullayeva

AI总结 本文探讨了针对资源匮乏的低资源手语语言(如阿塞拜疆手语)进行识别与翻译的挑战与未来方向。研究通过分析全球相关项目,总结出八条可行经验,提出从数据驱动、 signer-adaptive 系统和任务特定评估等三个范式转变,并基于轻量级 MediaPipe 架构和社区验证的标注,制定了阿塞拜疆手语的技术发展路线。研究强调需以聋人社区为中心,推动跨学科合作,确保技术的文化适配性与实际应用价值。

详情
英文摘要

Sign languages are natural, visual-gestural languages used by Deaf communities worldwide. Over 300 distinct sign languages remain severely low-resource due to limited documentation, sparse datasets, and insufficient computational tools. This systematic review synthesizes literature on sign language recognition and translation for under-resourced languages, using Azerbaijan Sign Language (AzSL) as a case study. Analysis of global initiatives extracts eight actionable lessons, including community co-design, dialectal diversity capture, and privacy-preserving pose-based representations. Turkic sign languages (Kazakh, Turkish, Azerbaijani) receive special attention, as linguistic proximity enables effective transfer learning. We propose three paradigm shifts: from architecture-centric to data-centric AI, from signer-independent to signer-adaptive systems, and from reference-based to task-specific evaluation metrics. A technical roadmap for AzSL leverages lightweight MediaPipe-based architectures, community-validated annotations, and offline-first deployment. Progress requires sustained interdisciplinary collaboration centered on Deaf communities to ensure cultural authenticity, ethical governance, and practical communication benefit.

2605.12090 2026-05-13 cs.RO cs.CL cs.CV

World Action Models: The Next Frontier in Embodied AI

Siyin Wang, Junhao Shi, Zhaoyang Fu, Xinzhe He, Feihong Liu, Chenchen Yang, Yikang Zhou, Zhaoye Fei, Jingjing Gong, Jinlan Fu, Mike Zheng Shou, Xuanjing Huang, Xipeng Qiu, Yu-Gang Jiang

AI总结 视觉-语言-动作(VLA)模型在具身策略学习中表现出良好的语义泛化能力,但其主要学习的是对观测到动作的反应映射,而未显式建模物理世界在干预下的演变过程。为解决这一问题,研究提出将环境动态预测模型融入动作生成流程,形成一种新的范式——世界动作模型(WAMs),旨在联合建模未来状态与动作的联合分布。本文系统梳理了WAMs的研究现状,定义其核心概念,区分其与相关模型的异同,并从架构、学习目标和应用场景等方面进行分类,同时分析其数据生态和评估方法,为该领域的发展提供了清晰的框架与未来方向。

详情
英文摘要

Vision-Language-Action (VLA) models have achieved strong semantic generalization for embodied policy learning, yet they learn reactive observation-to-action mappings without explicitly modeling how the physical world evolves under intervention. A growing body of work addresses this limitation by integrating world models, predictive models of environment dynamics, into the action generation pipeline. We term this emerging paradigm World Action Models (WAMs): embodied foundation models that unify predictive state modeling with action generation, targeting a joint distribution over future states and actions rather than actions alone. However, the literature remains fragmented across architectures, learning objectives, and application scenarios, lacking a unified conceptual framework. We formally define WAMs and disambiguate them from related concepts, and trace the foundations and early integration of VLA and world model research that gave rise to this paradigm. We organize existing methods into a structured taxonomy of Cascaded and Joint WAMs, with further subdivision by generation modality, conditioning mechanism, and action decoding strategy. We systematically analyze the data ecosystem fueling WAMs development, spanning robot teleoperation, portable human demonstrations, simulation, and internet-scale egocentric video, and synthesize emerging evaluation protocols organized around visual fidelity, physical commonsense, and action plausibility. Overall, this survey provides the first systematic account of the WAMs landscape, clarifies key architectural paradigms and their trade-offs, and identifies open challenges and future opportunities for this rapidly evolving field.

2605.12087 2026-05-13 cs.AI cs.MA

Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems

Josh Rosen, Seth Rosen

AI总结 许多AI系统围绕模型推理、调用工具、观察结果的循环进行运作,但中间生成的工件往往只存在于临时状态,难以被追踪和复用。本文提出将中间工件作为系统中的核心组成部分,强调其应具备结构化、可追溯、可修订等特性,以便后续人类或代理进行审查和优化。研究贡献在于提出了一种系统级数据模型,明确区分中间工件与对话记录、思维过程等,并为工件的更新、版本管理和质量评估提供了理论支持,从而提升AI生成工作的可维护性和可追溯性。

Comments 18 pages, 1 figure, 3 tables

详情
英文摘要

Many AI systems are organized around loops in which models reason, call tools, observe results, and continue until a task is complete. These systems often produce final artifacts such as memos, plans, recommendations, and analyses, while the intermediate work that shaped those outputs remains ephemeral. For multi-step, revisable AI work, final artifacts are often lossy projections over upstream state. We argue that such systems should preserve durable, inspectable intermediate artifacts: typed, structured, addressable, versioned, dependency-aware, authoritative, and consumable by downstream computation. These artifacts are not the model's private chain-of-thought. They are maintained work products such as evidence maps, claim structures, criteria, assumptions, plans, transformation rules, synthesis procedures, unresolved tensions, and partial products that later humans and agents can inspect, revise, supersede, and improve. The contribution is a systems-level data model. We distinguish intermediate artifacts from chat transcripts, memory, hidden chain-of-thought, narration, thinking, and final answers; formalize additive and superseding update semantics with explicit current-state resolution; describe how artifact lineage supports durable intermediate state across revisions; and argue that evaluation must target maintained-state quality, not only final-output quality. The claim is not that artifacts make models smarter. It is that durable intermediate artifacts make AI-generated work more inspectable, revisable, and maintainable over time.

2605.12084 2026-05-13 cs.RO cs.AI cs.IT cs.LG cs.SY eess.SY math.IT

Learning What Matters: Adaptive Information-Theoretic Objectives for Robot Exploration

Youwei Yu, Jionghao Wang, Zhengming Yu, Wenping Wang, Lantao Liu

AI总结 本文研究了如何为机器人探索任务设计可学习的信息论目标函数,以更有效地减少模型参数的不确定性。作者提出了一种基于最优实验设计的自适应信息目标——准最优实验设计(QOED),通过分析费舍尔信息矩阵的特征空间,识别可观察的参数方向并抑制无关参数的干扰,从而优化探索策略。实验表明,该方法在导航和操作任务中显著提升了探索效率和策略性能。

详情
英文摘要

Designing learnable information-theoretic objectives for robot exploration remains challenging. Such objectives aim to guide exploration toward data that reduces uncertainty in model parameters, yet it is often unclear what information the collected data can actually reveal. Although reinforcement learning (RL) can optimize a given objective, constructing objectives that reflect parametric learnability is difficult in high-dimensional robotic systems. Many parameter directions are weakly observable or unidentifiable, and even when identifiable directions are selected, omitted directions can still influence exploration and distort information measures. To address this challenge, we propose Quasi-Optimal Experimental Design (Q{\footnotesize OED}), an adaptive information objective grounded in optimal experimental design. Q{\footnotesize OED} (i) performs eigenspace analysis of the Fisher information matrix to identify an observable subspace and select identifiable parameter directions, and (ii) modifies the exploration objective to emphasize these directions while suppressing nuisance effects from non-critical parameters. Under bounded nuisance influence and limited coupling between critical and nuisance directions, Q{\footnotesize OED} provides a constant-factor approximation to the ideal information objective that explores all parameters. We evaluate Q{\footnotesize OED} on simulated and real-world navigation and manipulation tasks, where identifiable-direction selection and nuisance suppression yield performance improvements of \SI{35.23}{\percent} and \SI{21.98}{\percent}, respectively. When integrated as an exploration objective in model-based policy optimization, Q{\footnotesize OED} further improves policy performance over established RL baselines.

2605.12079 2026-05-13 cs.LG

Elicitation-Augmented Bayesian Optimization

Alvar Haltia, Ville Hyvönen, Samuel Kaski

AI总结 本文研究了如何在人类专家参与的贝叶斯优化中更有效地利用隐性领域知识。传统方法依赖专家明确量化知识,而本文提出通过成对比较来获取专家的隐性判断,并将其视为目标函数值的噪声证据。文章提出了一种结合直接观测与成对查询的代价感知信息价值获取函数,能够在不同查询成本下自适应地平衡两种信息源,从而提升优化效率。

详情
英文摘要

Human-in-the-loop Bayesian optimization (HITL BO) methods utilize human expertise to improve the sample-efficiency of BO. Most HITL BO methods assume that a domain expert can quantify their knowledge, for instance by pinpointing query locations or specifying their prior beliefs about the location of the maximum as a probability distribution. However, since human expertise is often tacit and cannot be explicitly quantified, we consider a setting where domain knowledge of an expert is elicited via pairwise comparisons of designs. We interpret the expert's pairwise judgements as noisy evidence about the values of the observable objective function and develop a principled method for combining the information obtained via direct observations and pairwise queries. Specifically, we derive a cost-aware value-of-information acquisition function that balances direct observations against pairwise queries. The proposed method approaches the convex hull of the trajectories of the individual information sources: when pairwise queries are cheap it substantially improves sample-efficiency over observation-only BO, and when pairwise queries are costly or noisy, it recovers the performance of standard BO by relying on direct observations alone.

2605.12077 2026-05-13 cs.CV cs.AI

The Missing GAP: From Solving Square Jigsaw Puzzles to Handling Real World Archaeological Fragments

Ofir Itzhak Shahar, Gur Elkin, Ohad Ben-Shahar

AI总结 本文研究了从解决标准拼图问题到处理真实考古碎片这一更具挑战性的任务。为了解决非规则形状且严重磨损的考古碎片拼接问题,作者提出了GAP数据集,并设计了基于ViT和流匹配的新型框架PuzzleFlow。该方法在处理复杂形状的碎片拼接任务中表现出色,显著优于现有方法。

详情
英文摘要

Jigsaw puzzle solving has been an increasingly popular task in the computer vision research community. Recent works have utilized cutting-edge architectures and computational approaches to reassemble groups of pieces into a coherent image, while achieving increasingly good results on well established datasets. However, most of these approaches share a common, restricting setting: operating solely on strictly square puzzle pieces. In this work, we introduce GAP, a set of novel jigsaw puzzles datasets containing synthetic, heavily eroded pieces of unrestricted shapes, generated by a learned distribution of real-world archaeological fragments. We also introduce PuzzleFlow, a novel ViT and Flow-Matching based framework for jigsaw puzzle solving, capable of handling complex puzzle pieces and demonstrating superior performance on GAP when compared to both classic and recent prominent works in this domain.

2605.12074 2026-05-13 cs.CV

BARISTA: A Multi-Task Egocentric Benchmark for Compositional Visual Understanding

Patrick Knab, Orgest Xhelili, Inis Buzi, Drago Andres Guggiana Nilo, Mohd Saquib Khan, Lorenz Kolb, Manuel Scherzer, Kerem Yildirir, Christian Bartelt, Philipp Johannes Schubert

AI总结 BARISTA 是一个用于组合视觉理解的多任务第一人称视角基准数据集,包含185个真实世界的咖啡制作视频,涵盖了全自动、portafilter 和胶囊式等多种流程。该数据集提供了详细的帧级场景图,包含物体身份、属性、关系、手-物交互及过程步骤等信息,并由此衍生出多项零样本语言任务,如短语定位、活动识别和时序问答等。BARISTA 为诊断模型在程序性视频理解中的不足提供了具有挑战性的评估基准。

详情
英文摘要

Scene understanding is central to general physical intelligence, and video is a primary modality for capturing both state and temporal dynamics of a scene. Yet understanding physical processes remains difficult, as models must combine object localization, hand-object interactions, relational parsing, temporal reasoning, and step-level procedural inference. Existing benchmarks usually evaluate these capabilities separately, limiting diagnosis of why models fail on procedural tasks. We introduce BARISTA, a densely annotated egocentric dataset and benchmark of 185 real-world coffee-preparation videos covering fully automatic, portafilter-based, and capsule-based workflows. BARISTA provides verified per-frame scene graphs linking persistent object identities to masks, tracks, boxes, attributes, typed relations, hand-object interactions, activities, and process steps. From these graphs, we derive zero-shot language-based tasks spanning phrase grounding, hand-object interaction recognition, referring, activity recognition, relation extraction, and temporal visual question answering. Experiments reveal strong variation across task families and no consistently dominant model family, positioning BARISTA as a challenging diagnostic benchmark for procedural video understanding. Code and dataset available at https://huggingface.co/datasets/ramblr/BARISTA.