arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.20101 2026-05-20 cs.RO

Topology-Optimized Pneumatic Soft Actuator: Design and Experimental Validation

拓扑优化气动软执行器:设计与实验验证

Sumit Mehta, Konstantinos Poulios

AI总结 本文通过非线性拓扑优化设计了软弹性气动执行器,并通过实验验证了其性能。

详情
Comments
20 pages, 13 figures
AI中文摘要

本文展示了使用非线性拓扑优化设计软弹性气动执行器的计算设计。一种现有的基于密度和多孔超弹性拓扑优化框架被从2D扩展到3D,并用于生成两种可制造的执行器设计,这些设计随后进行了数值和实验研究。对于两种设计,目标是在给定的驱动压力下,最大化弯曲响应,同时考虑两种不同的允许应变限制。所采用的拓扑优化框架的一个关键优势是,它可以在优化过程中一致地考虑由于加压引起的非常大的变形。这两种优化的3D设计通过立体光固化法制造,并通过实验测试来验证其性能。

英文摘要

This paper demonstrates the computational design of soft elastomeric pneumatic actuators using nonlinear topology optimization. An existing density- and porohyperelasticity-based topology optimization framework was extended from 2D to 3D and used to generate two manufacturable actuator designs, which were then studied numerically and experimentally. For both designs, the objective was to maximize the bending response for a prescribed actuation pressure under two different allowable strain limits. A key advantage of the employed topology optimization framework is that it can consistently, during the optimization, account for the very large deformations induced upon pressurization. The two optimized 3D designs were fabricated using stereolithography and experimentally tested to validate their performance.

2605.20090 2026-05-20 cs.CV

MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling

MetaEarth-MM:基于场景中心联合建模的多模态遥感图像生成

Zhiping Yu, Chenyang Liu, Jinqi Cao, Qinzhe Yang, Siwei Yu, Zhengxia Zou, Zhenwei Shi

AI总结 本文提出MetaEarth-MM模型,通过统一的多模态遥感图像生成框架,实现多模态图像的联合生成和任意模态之间的转换,展示了其在多模态遥感观测中的强大生成能力和广泛适用性。

详情
AI中文摘要

多模态遥感图像对于地球观测至关重要,但在实践中,完整的配对观测往往稀缺。现有的生成方法通常通过孤立的成对模态翻译来解决这个问题,但随着模态数量和生成任务的增加,其通用性和可扩展性仍然有限。本文开发了一个生成基础模型MetaEarth-MM,用于多模态遥感图像生成,能够在统一模型中实现五种模态之间的配对联合生成和任意到任意的翻译。认识到多模态观测下内在的场景一致性,我们引入了MetaEarth-MM中的场景中心联合建模范式。与以往依赖直接外观级跨模态映射的方法不同,我们的模型围绕底层场景内容组织生成过程。具体而言,MetaEarth-MM采用解耦架构,首先从可用观测中推断出潜在的场景表示,然后基于此中间状态生成目标模态。为了支持训练,我们进一步构建了EarthMM,一个包含280万张多分辨率全球图像和220万对对齐图像的大型数据集。广泛的实验表明,MetaEarth-MM不仅在多样化的生成任务中表现出强大的生成能力和稳健的泛化能力,还支持数据和表示层面的下游任务,突显了其作为跨模态地球观测通用基础模型的潜力。代码和数据集将在https://github.com/YZPioneer/MetaEarth-MM上提供。

英文摘要

Multi-modal remote sensing images are vital for Earth observation, yet complete paired observations are often scarce in practice. Existing generative methods commonly address this problem through isolated pairwise modality translation, but their versatility and scalability remain limited as the number of modalities and generation tasks increases. Here, we develop a generative foundation model MetaEarth-MM for multi-modal remote sensing imagery, enabling paired joint generation and any-to-any translation across five modalities within a unified model. Recognizing the intrinsic scene consistency underlying multi-modal observations, we introduce a scene-centered joint modeling paradigm in MetaEarth-MM. Unlike previous methods that rely on direct appearance-level cross-modal mapping, our model organizes the generation around the underlying scene content. Specifically, MetaEarth-MM adopts a decoupled architecture that first infers a latent scene representation from available observations, and then generates target modalities conditioned on this intermediate state. To support training, we further construct EarthMM, a large-scale dataset comprising 2.8 million multi-resolution global images with 2.2 million aligned pairs. Extensive experiments demonstrate that MetaEarth-MM not only exhibits strong generative capability and robust generalization across diverse generation tasks, but also supports downstream tasks at both data and representation levels, highlighting its potential as a general foundation model for cross-modal Earth observation. The code and dataset will be available at https://github.com/YZPioneer/MetaEarth-MM.

2605.20088 2026-05-20 cs.LG cs.AI

INSHAPE: Instance-Level Shapelets for Interpretable Time-Series Classification

INSHAPE:实例级形状lets用于可解释的时间序列分类

Seongjun Lee, Seokhyun Lee, Changhee Lee

AI总结 本文提出INSHAPE框架,通过发现每个时间序列特有的变量长度判别性时间模式,解决传统方法在实例特定特征与整体模式不一致以及忽略时间依赖性的问题,从而提高时间序列分类的可解释性和预测性能。

详情
Comments
Accepted to IJCAI 2026. 25 pages
AI中文摘要

发现形状lets——即时间序列内的判别性时间模式——已被广泛研究,以应对时间序列分类(TSC)固有的复杂性,并使模型决策过程更加透明。然而,现有方法主要集中在整体数据集上优化的群体级形状lets,导致两个根本性限制:(i)群体级模式往往与实例特定特征不一致,导致性能不佳并可能产生误导性解释;(ii)大多数方法将形状lets视为独立实体,忽略了多个模式之间的重要时间依赖性和相互作用。为了解决这些限制,我们提出了INSHAPE,一个可解释的TSC框架,该框架发现每个时间序列特有的变量长度判别性时间模式。INSHAPE将这些模式识别为非重叠段,并建模其时间依赖性,从而在提供清晰的实例级解释的同时实现强大的预测性能。此外,INSHAPE通过自下而上的方法连接局部和全局可解释性,将实例级形状lets聚合为原型(群体级)形状lets。在128个UCR和30个UEA基准数据集上的广泛实验表明,INSHAPE在性能上始终优于最先进的基于形状lets的方法,同时提供更直观和可解释的见解。

英文摘要

Discovering shapelets -- i.e., discriminative temporal patterns within time series -- has been widely studied to address the inherent complexity of time-series classification (TSC) and to make model decision-making processes more transparent. However, existing methods primarily focus on population-level shapelets optimized across the entire dataset, which leads to two fundamental limitations: (i) population-level patterns often misalign with instance-specific features, resulting in suboptimal performance and potentially misleading interpretations, and (ii) most methods treat shapelets as independent entities, overlooking important temporal dependencies and interactions among multiple patterns. To address these limitations, we propose INSHAPE, an interpretable TSC framework that discovers variable-length, discriminative temporal patterns specific to each time series. INSHAPE identifies these patterns as non-overlapping segments and models their temporal dependencies, thereby providing clear instance-level interpretations while achieving strong predictive performance. Furthermore, INSHAPE bridges local and global interpretability through a bottom-up approach, aggregating instance-level shapelets into prototypical (population-level) shapelets. Extensive experiments on 128 UCR and 30 UEA benchmark datasets show that INSHAPE consistently outperforms state-of-the-art shapelet-based methods while providing more intuitive and interpretable insights.

2605.20086 2026-05-20 cs.NE cs.AI cs.LG

What Do Evolutionary Coding Agents Evolve?

进化编码代理进化什么?

Nico Pelleriti, Sree Harsha Nelaturu, Zhanke Zhou, Zongze Li, Max Zimmer, Bo Han, Sebastian Pokutta

AI总结 本文研究了进化编码代理在数学发现和算法设计中通过任务特定反馈生成、修改和选择代码的过程,通过EvoTrace数据集和EvoReplay方法分析了进化过程中的机制,发现大部分得分提升来自少数几种编辑类型,并发现存在确定性的循环模式。

详情
Comments
28 pages, 12 figures, 12 tables
AI中文摘要

最近的研究将大型语言模型与进化搜索结合,通过任务特定反馈迭代地生成、修改和选择代码。这些系统在数学发现和算法设计中取得了显著成果,但一个基本问题仍然存在:它们实际上进化了什么?进展通常通过任务特定评估器下最佳得分来总结,但该得分可能反映多种不同的机制:新的算法结构、重新调整现有策略、重新组合已存在于模型内部知识中的想法,或过度拟合评估器。区分这些机制需要检查搜索过程本身,而不是仅其最终结果。我们引入了EvoTrace,一个涵盖四个进化框架、推理和非推理模型以及16个数学和算法设计任务的进化编码轨迹数据集。为了分析这些轨迹,我们开发了EvoReplay,一种基于回放的方法,可以重建高分解决方案背后的局部搜索状态,并测试受控干预,包括调整常数、删除程序组件和替换模型或提示上下文。我们使用LLM-as-judge流程对EvoTrace中的每个代码编辑注释为九种 recurring 编辑类型之一,并通过盲人人工重新注释验证了该流程。在EvoTrace中,大部分得分提升来自少数几种编辑类型。我们进一步发现一种确定性的循环模式:大约30%的搜索过程中添加的代码行是字节相同的重新引入先前删除的行,几乎在每个运行中都存在。这些结果表明,进化编码代理的基准提升可能来自质的不同机制,其中只有某些机制对应于新的算法结构。EvoTrace使进化编码代理的评估超越了最终基准得分。

英文摘要

Recent work pairs LLMs with evolutionary search to iteratively generate, modify, and select code using task-specific feedback. These systems have produced strong results in mathematical discovery and algorithm design, yet a fundamental question remains: what do they actually evolve? Progress is typically summarized by the best score a run reaches under a task-specific evaluator, but that score can reflect several different mechanisms: new algorithmic structure, re-tuning an existing strategy, recombining ideas already in the model's internal knowledge, or overfitting to the evaluator. Distinguishing these mechanisms requires inspecting the search process itself, not only its final outcome. We introduce EvoTrace, a dataset of evolutionary coding traces spanning four evolutionary frameworks, reasoning and non-reasoning models, and 16 tasks across mathematics and algorithm design. To analyze these traces, we develop EvoReplay, a replay-based methodology that reconstructs the local search states behind high-scoring solutions and tests controlled interventions, including adjusting constants, removing program components and substituting models or prompting contexts. We annotate every code edit in EvoTrace with one of nine recurring edit types using an LLM-as-judge pipeline validated against blind human re-annotation. Across EvoTrace, most score gains come from a small subset of these edit types. We further find a deterministic cycling pattern: about 30% of code lines added during search are byte-identical re-introductions of previously-deleted lines, present throughout nearly every run. These results show that benchmark gains in evolutionary coding agents can arise from qualitatively different mechanisms, only some of which correspond to new algorithmic structure. EvoTrace enables more diagnostic evaluation of evolutionary coding agents beyond final benchmark scores.

2605.20085 2026-05-20 cs.CV

Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation

基于空间提示的视觉轨迹预测用于目视操控

Yifan Li, Xinyu Zhou, Yunhao Ge, Yu Kong

AI总结 本文提出了一种新的视觉轨迹预测方法SP-VTP,通过空间提示定义任务目标,结合任务编码器、观察编码器和轨迹生成器,提升了跨场景的目视操控轨迹预测性能。

详情
AI中文摘要

机器人操控通常通过语言指令或任务标识符指定,但在有相似物体的杂乱环境中,通过空间指示要移动什么和放置在哪里会更有效。针对以视觉为中心的对象和目标指定挑战,我们提出了目前所知的第一个空间提示视觉轨迹预测(SP-VTP)的正式化。这种新的设置利用初始空间提示(如边界框或点)来定义任务目标,要求模型从目视流中预测未来末端执行器轨迹。为了研究此问题,我们收集并标注了EgoSPT数据集,包含带有第一帧物体和目标定位注释以及恢复的3D末端执行器运动的目视空间提示操控轨迹。SP-VTP具有挑战性,因为任务指定是静态的,而场景配置随时间变化。为了解决这个问题,我们提出了SPOT(空间提示对象-目标策略),它结合了任务编码器用于第一帧视觉和坐标空间提示,观察编码器用于当前视觉和历史上下文,以及轨迹生成器用于未来末端执行器运动。在严格的场景级划分实验中,SPOT在非提示或单源提示基线之上提高了跨场景轨迹预测性能。共同,EgoSPT和SPOT建立了一个新的空间提示问题SP-VTP,作为简单且可扩展的任务条件用于目视操控。

英文摘要

Robotic manipulation is often specified through language instructions or task identifiers, yet cluttered environments with similar objects are better handled by spatially indicating what to move and where to place it. Addressing the vision-centric challenge of object and goal specification, we present, to the best of our knowledge, the first formalization of Spatially Prompted Visual Trajectory Prediction (SP-VTP). This novel setting utilizes initial spatial prompts (like bounding boxes or points) to define task objectives, tasking the model with forecasting future end-effector trajectories from egocentric streams. To study this problem, we collect and annotate EgoSPT, a dataset of egocentric spatially prompted manipulation trajectories with first-frame object and target grounding annotations and recovered 3D end-effector motion. SP-VTP is challenging because the task specification is static, while the scene configuration evolves over time. To solve this problem, we propose SPOT(Spatially Prompted Object-Target Policy), which combines a task encoder for first-frame visual and coordinate spatial prompts, an observation encoder for current visual and history context, and a trajectory generator for future end-effector motion. Experiments under strict scene-level splits show that SPOT improves cross-scene trajectory prediction over non-prompted or single-source prompted baselines. Together, EgoSPT and SPOT establish a new spatial prompting problem SP-VTP, as a simple and scalable task condition for egocentric manipulation.

2605.20084 2026-05-20 cs.CL cs.AI

BalanceRAG: Joint Risk Calibration for Cascaded Retrieval-Augmented Generation

BalanceRAG: 为级联检索增强生成进行联合风险校准

Zijun Jia, Yuanchang Ye, Sen Jia, Yiyao Qian, Haoning Wang, Baojie Chen, Diyin Tang, Jinsong Yu, Zhiyuan Wang

AI总结 本文提出BalanceRAG,一种用于级联检索增强生成的联合风险校准方法,通过在二维晶格上确定安全操作点,实现风险自适应的阈值校准,从而在控制系统级错误率的同时保留更多示例,并扩展到多风险校准。

详情
AI中文摘要

大型语言模型(LLMs)可通过检索增强生成(RAG)提高事实性,但在模型单独回答可靠时,将RAG应用于每个查询是不必要的。这促使了级联RAG:每个查询首先由LLM单独分支处理,如果主分支不确定则升级到RAG回退,当两个分支都不足够可信时则放弃。然而,逐级校准此类级联可能过于保守,因为最终的效用取决于LLM单独和RAG的联合不确定性阈值。在本文中,我们开发了BalanceRAG,以在目标风险水平下认证阈值对。给定两个分支的不确定性分数,BalanceRAG将每个阈值对框架为二维晶格上的一个操作点,并通过顺序图形测试确定安全操作点。这使得风险自适应的阈值校准成为可能,从而在控制接受点的系统级错误率的同时保留更多示例。此外,BalanceRAG扩展到多风险校准,允许检索使用与基于选择的条件风险一起被限制。在三个开放领域问答(QA)基准上的实验表明,BalanceRAG满足规定的风险水平,保留了更高的覆盖率和更多的接受正确示例,并且比始终开启RAG减少了不必要的检索调用。

英文摘要

Large language models (LLMs) can enhance factuality via retrieval-augmented generation (RAG), but applying RAG to every query is unnecessary when the model-only answer is reliable. This motivates cascaded RAG: each query is first handled by an LLM-only branch, escalated to a RAG fallback only if the primary branch is uncertain, and abstained from when neither branch is sufficiently trustworthy. However, calibrating such cascades stage by stage may be conservative, since the final utility depends on joint uncertainty thresholding of LLM-only and RAG. In this work, we develop BalanceRAG to certify threshold pairs at a target risk level. Given uncertainty scores from the two branches, BalanceRAG frames each threshold pair as an operating point on a two-dimensional lattice and identifies safe operating points using sequential graphical testing. This enables risk-adaptive threshold calibration, controlling the system-level error rate among accepted points, while retaining more examples. Furthermore, BalanceRAG extends to multi-risk calibration, allowing retrieval usage to be bounded together with the selection-conditioned risk. Experiments on three open-domain question answering (QA) benchmarks across multiple LLM backbones demonstrate that BalanceRAG meets prescribed risk levels, preserves higher coverage and more accepted correct examples, and reduces unnecessary retrieval calls compared with always-on RAG.

2605.20082 2026-05-20 cs.CV cs.AI

VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving

VL-DPO:基于视觉语言的偏好对齐自动驾驶微调

Zhefan Xu, Ghassen Jerfel, Marina Haliem, Qi Zhao, Jeonhyung Kang, Khaled S. Refaat

AI总结 本文提出VL-DPO,一种基于视觉语言模型的框架,通过零样本推理生成偏好对来微调自动驾驶模型,以提升与人类驾驶偏好的对齐程度,实验表明该方法在RFS和ADE指标上均优于基线模型。

详情
Comments
Published in International Conference on Robotics and Automation (ICRA), 2026 8 pages, 6 figures, 4 tables
AI中文摘要

自动驾驶数据集的快速增长使强大的运动预测模型得以扩展。尽管大规模预训练提供了强大的性能,但标准模仿目标可能无法完全捕捉人类驾驶偏好中的复杂细微差别。同时,视觉语言模型(VLMs)的最新进展展示了出色的推理和常识理解能力。基于这些能力,本文提出了VL-DPO,一种基于视觉语言的框架,用于将自动驾驶车辆的运动预测模型与人类偏好对齐。我们的方法利用VLM作为零样本推理器,自动从预训练模型的轨迹中生成偏好对,然后通过直接偏好优化(DPO)进行微调。我们在此Waymo Open End-to-End Driving Dataset(WOD-E2E)上微调模型,并通过评分反馈(RFS)和平均位移误差(ADE)评估模型在持保留人类偏好注释上的性能。实验表明,VLM的轨迹选择是高质量的人类偏好的代理。我们的最终模型VL-DPO在RFS指标上比预训练模型提高了11.94%,在ADE指标上减少了10.01%。

英文摘要

The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM's trajectory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model.

2605.20079 2026-05-20 cs.CV cs.AI cs.LG eess.IV

Probability-Conserving Flow Guidance

概率守恒的流引导

Parsa Esmati, Junha Hyung, Amirhossein Dadashzadeh, Jaegul Choo, Majid Mirmehdi

AI总结 本文提出了一种概率守恒的流引导方法AdaMaG,通过分析连续方程,将引导效果分解为发散项和分数平行项,并通过时间依赖的调度和分数平行衰减来控制这两个项,从而在不增加推理成本的情况下提高生成质量并减少幻觉。

详情
AI中文摘要

扩散和基于流的生成模型在视觉合成中占据主导地位,引导将样本对齐到用户输入并提高感知质量。然而,分类器无关引导(CFG)和基于外推的方法是速度/分数的启发式线性组合,忽略了生成流形的几何结构,破坏了概率守恒,导致在强引导下样本偏离学习的流形。我们通过连续方程分析引导,并展示其效果分解为一个发散项和一个在参数化下不变的分数平行项。我们证明发散项在采样接近数据流形时结构上会发散,这促使我们采用时间依赖的调度和分数平行衰减。所得到的即插即用规则,自适应流形引导(AdaMaG),在不增加推理成本的情况下限制了这两个项。最后,我们展示大多数减少饱和或提高生成质量的实证启发式方法直接对应于我们分解中的两个项。在图像生成基准测试中,AdaMaG提高了真实感,减少了幻觉,并在高引导制度下诱导了受控的去饱和。

英文摘要

Diffusion and flow-based generative models dominate visual synthesis, with guidance aligning samples to user input and improving perceptual quality. However, Classifier-Free Guidance (CFG) and extrapolation-based methods are heuristic linear combinations of velocities/scores that ignore the generative manifold geometry, breaking probability conservation and driving samples off the learned manifold under strong guidance. We analyse guidance through the continuity equation and show its effect decomposes into a divergence term and a score-parallel term defined invariantly across parameterisations. We prove the divergence term blows up structurally as sampling approaches the data manifold, motivating a time-dependent schedule alongside score-parallel attenuation. The resulting plug-and-play rule, Adaptive Manifold Guidance (AdaMaG), bounds both terms at no additional inference cost. Finally, we show that most empirical heuristics for reducing saturation or improving generation quality correspond directly to the two terms in our decomposition. Across image generation benchmarks, AdaMaG improves realism, reduces hallucinations, and induces controlled desaturation in high-guidance regimes.

2605.20075 2026-05-20 cs.CL cs.AI

CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

CopT: 在连续空间中利用对比学习进行通用和代理推理的在线策略思考

Dachuan Shi, Hanlin Zhu, Xiangchi Yuan, Wanjia Zhao, Kejing Xia, Wen Xiao, Wenke Lee

AI总结 本文提出CopT,一种改进的推理流程,通过反转传统思考和回答的顺序,首先生成草稿答案,再基于该答案进行在线策略思考以进行反思和修正。CopT利用连续嵌入作为推理时的对比验证器,通过对比离散令牌输入和连续嵌入输入下模型对相同生成令牌的支持,得到一个序列级的反KL估计器来评估答案的可靠性。在数学、编程和代理推理任务中,CopT在保持同等或更高准确性的情况下,将峰值准确率提高了高达23%,并将令牌使用量减少了高达57%。

详情
Comments
Code: https://github.com/sdc17/CopT, Website: https://copt-web.github.io/
AI中文摘要

链式思考(CoT)是一种用于从大型语言模型(LLMs)中激发推理能力的标准方法。然而,常见的CoT范式将思考视为回答的前提,这会延迟访问合理答案并产生不必要的令牌成本,即使模型能够在扩展思考之前识别出答案,这种行为被称为表现性推理。在本文中,我们引入了CopT,一种重新表述的推理流程,反转了通常的思考和回答顺序。与传统的在思考后再回答不同,CopT首先生成一个草稿答案,然后基于其自身的草稿答案进行后续的在线策略思考以进行反思和修正。为了评估草稿答案是否可信,CopT将连续嵌入重新表述为推理时的对比验证器。具体来说,它对比模型在离散令牌输入和连续嵌入输入下对相同生成令牌的支持,从而得到一个序列级的反KL估计器来评估答案的可靠性。我们的分析表明,在某些假设下,预期估计等于未解决的潜在状态与发出的答案令牌之间的互信息,解释了为什么它捕捉到与答案相关的不确定性,而不是潜在状态中的任意不确定性。当答案被认为不够可靠时,CopT会进行进一步的在线策略思考,其中第二个KL估计器动态控制草稿答案的可见性,保留有用的部分信息,同时减少被不可靠内容误导的风险。在数学、编程和代理推理任务中,CopT在保持同等或更高准确性的情况下,将峰值准确率提高了高达23%,并将令牌使用量减少了高达57%。代码可在https://github.com/sdc17/CopT上获得。

英文摘要

Chain-of-thought (CoT) is a standard approach for eliciting reasoning capabilities from large language models (LLMs). However, the common CoT paradigm treats thinking as a prerequisite for answering, which can delay access to plausible answers and incur unnecessary token costs even when the model is able to identify an answer before extended thinking, a behavior known as performative reasoning. In this paper, we introduce CopT, a reformulated reasoning pipeline that reverses the usual order of thinking and answering. Instead of thinking before answering, CopT first elicits a draft answer and then invokes subsequent on-policy thinking conditioned on its own draft answer for reflection and correction. To assess whether the draft answer should be trusted, CopT recasts continuous embeddings as inference-time contrastive verifiers. Specifically, it contrasts the model's support for the same generated tokens under discrete-token inputs and continuous-embedding inputs, yielding a sequence-level reverse KL estimator for answer reliability. Our analysis shows that under certain assumptions, the expected estimate equals the mutual information between the unresolved latent state and the emitted answer token, explaining why it captures answer-relevant uncertainty rather than arbitrary uncertainty in the latent state. When the answer is deemed insufficiently reliable, CopT performs further on-policy thinking, where a second KL estimator dynamically controls draft-answer visibility, preserving useful partial information while reducing the risk of being misled by unreliable content. Across mathematics, coding, and agentic reasoning tasks, CopT improves peak accuracy by up to 23% and reduces token usage by up to 57% at comparable or higher accuracy, without any additional training. The code is available at https://github.com/sdc17/CopT.

2605.20074 2026-05-20 cs.LG

Towards Distillation Guarantees under Algorithmic Alignment for Combinatorial Optimization

面向组合优化中算法对齐的蒸馏保证

Thien Le, Melanie Weber

AI总结 本文研究了在算法对齐框架下,通过蒸馏将大规模模型的知识转移到更高效的模型以用于部署的问题,重点分析了当目标模型是图神经网络且其架构与动态规划算法对齐时,蒸馏成功的条件。

详情
Comments
22 pages
AI中文摘要

蒸馏将知识从在广泛数据上训练的大模型转移到更小、更高效的模型,以用于部署。在结构预测设置中,任务的先验知识可以指导目标架构的选择,使其与底层问题在算法上对齐。在最近的决策树(DT)蒸馏学习理论分析(Boix-Adsera, 2024)基础上,我们研究了蒸馏在组合优化任务中成功的情况。我们关注目标模型是图神经网络,其架构与任务的动态规划(DP)算法对齐的情况。假设源模型足够丰富,通过线性表示假设(LRH)(Elhage et al., 2022; Park et al., 2024)形式化,我们证明蒸馏问题可以在DP转移函数的复杂度参数中高效解决,该参数表示为决策树。我们的结果提供了在算法对齐风味下的蒸馏成功严格充分条件。

英文摘要

Distillation transfers knowledge from a large model trained on broad data to a smaller, more efficient model suitable for deployment. In structured prediction settings, prior knowledge about the task can guide the choice of a target architecture that is algorithmically aligned with the underlying problem. Building on recent learning-theoretic analyses of decision-tree (DT) distillation (Boix-Adsera, 2024), we study when distillation succeeds for combinatorial optimization tasks. We focus on the case where the target model is a graph neural network whose architecture is aligned with a dynamic programming (DP) algorithm for the task. Assuming that the source model is sufficiently rich, formalized through the linear representation hypothesis (LRH) (Elhage et al., 2022; Park et al., 2024), we show that the distillation problem can be solved efficiently in the complexity parameters of the DP transition function, represented as a DT. Our results provide a rigorous sufficient condition for successful distillation in the flavour of algorithmic alignment.

2605.20073 2026-05-20 cs.CV

X-Ray cardiac angiographic vessel segmentation based on pixel classification using machine learning and region growing

基于机器学习和区域生长的X射线心血管造影血管分割

E O Rodrigues, L O Rodrigues, J J Lima, D Casanova, F Favarim, E R Dosciatti, V Pegorini, L S N Oliveira, F F C Morais

AI总结 本文提出了一种基于像素分类的X射线血管分割方法,利用纹理特征和区域生长技术,通过随机森林分类器实现高精度血管识别,达到95.48%的准确率。

详情
Journal ref
Biomedical Physics & Engineering Express 2021
AI中文摘要

本文提出了一种用于X射线造影图像中血管分割的像素分类方法。该方法利用各向异性扩散、Hessian矩阵特征、数学形态学和统计学等纹理特征,从每个像素的邻域中提取这些特征。该方法还使用了ELEMENT方法,即通过区域生长控制像素分类,其中分类结果影响后续像素的分类。随机森林分类器用于预测像素是否属于血管结构。该方法在文献中实现了最高的准确率(95.48%),优于无监督的最新方法。

英文摘要

This work proposes a pixel-classification approach for vessel segmentation in x-ray angiograms. The proposal uses textural features such as anisotropic diffusion, features based on the Hessian matrix, mathematical morphology and statistics. These features are extracted from the neighborhood of each pixel. The approach also uses the ELEMENT methodology, which consists of creating a pixel-classification controlled by region-growing where the result of the classification affects further classifications of pixels. The Random Forests classifier is used to predict whether the pixel belongs to the vessel structure. The approach achieved the best accuracy in the literature (95.48%) outperforming unsupervised state-of-the-art approaches.

2605.20072 2026-05-20 cs.AI cs.RO

Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving

探查具身大语言模型:当更高的观察保真度损害问题解决

Oussama Zenkri, Oliver Brock

AI总结 本文研究了具身大语言模型在不同观察信息下的行为,发现高保真度观察反而降低了问题解决能力,核心方法是通过实验改变可用信息并测量行为变化,主要贡献是揭示了感知误差与推理失败的交互影响。

详情
Comments
Submitted to From Animals to Animats: The 18th International Conference on the Simulation of Adaptive Behavior (SAB)
AI中文摘要

大型语言模型日益被提出作为机器人系统的认知组件,但其不透明的决策过程使得在闭环具身任务中的成功或失败难以解释。遵循经验AI方法,我们通过改变代理可用的信息并测量行为变化来研究具身LLM代理的行为。使用Lockbox,一个具有隐藏依赖关系的顺序机械谜题,在物理机器人设置中评估LLM在RGB、RGB-D和地面真实符号观察下的表现,并通过受控模拟来探测由此产生行为。反直觉的是,代理在原始RGB输入下表现最佳,而在完美地面真实观察下表现最差。在模拟中,我们通过随机翻转感知的动作结果来探测这一效应,发现适度的噪声提高了性能,峰值出现在40%的翻转概率下,相比无噪声基线,成功率提高了2.85倍。进一步分析将这一收益归因于重复动作循环的减少。这些发现表明,仅凭成功率来评估LLM是不够的,因为测量性能可能反映了感知误差与推理失败之间的相互作用,而非稳健的问题解决。

英文摘要

Large Language Models are increasingly proposed as cognitive components for robotic systems, yet their opaque decision processes make it difficult to explain success or failure in closed-loop embodied tasks. Following an empirical AI methodology, we study embodied LLM agents behaviorally by varying the information available to the agent and measuring the resulting changes in behavior. Using the Lockbox, a sequential mechanical puzzle with hidden interdependencies, we evaluate LLMs across RGB, RGB-D, and ground-truth symbolic observations in a physical robotic setup and use controlled simulation to probe the resulting behavior. Counterintuitively, agents perform best under raw RGB input and worst under perfect ground-truth observations. In simulation, we probe this effect by randomly flipping perceived action outcomes and find that moderate noise improves performance, peaking at a 40% flip probability with a 2.85-fold success rate increase over the noise-free baseline. Further analysis links this gain to a reduction in repetitive action loops. These findings suggest that success rates alone are insufficient for evaluating LLMs, as measured performance may reflect the interaction between perceptual errors and reasoning failures rather than robust problem solving.

2605.20068 2026-05-20 stat.ML cs.LG

Tail Annealing for Heavy-Tailed Flow Matching

尾部退火用于厚尾流匹配

Jean Pachebat

AI总结 本文提出了一种简单的方法,通过在训练前对数据应用软对数变换,然后在生成后进行指数化,以处理厚尾数据问题。该方法通过Hill诊断决定是否对每个坐标进行变换,保留轻尾边缘不变,从而压缩厚尾到标准流匹配可以处理的范围内,无需厚尾基础分布或架构修改。

详情
Comments
18 pages
AI中文摘要

标准生成模型在处理厚尾数据时存在困难:Lipschitz架构无法从高斯噪声中生成幂律尾部,且在厚尾数据和高斯数据之间插值是不合理的。我们提出一个简单的解决方案:在训练前对数据应用软对数变换$ϕ(x) = \mathrm{sign}(x) \cdot \log(1 + |x|)$,然后在生成后对样本进行指数化。Hill诊断决定每个坐标是否进行变换,从而在不增加复杂度的情况下保留轻尾边缘不变。这将厚尾压缩到标准流匹配可以处理的范围内,而无需厚尾基础分布或架构修改。我们提供了理论直觉说明其有效性:对数变换将帕累托尾部映射到指数,诱导的动力学通过幂变换实现尾部退火。在144配置的多变量基准测试(3个copulas,$d$最大到100,4个尾指数)上,Log-FM在$W_1$、CVaR$_{99}$和极值分位数度量上优于专门的基线,并且是唯一在2880次运行中无严重发散的方法。

英文摘要

Standard generative models struggle with heavy-tailed data: Lipschitz architectures cannot produce power-law tails from Gaussian noise, and interpolating between heavy-tailed data and Gaussians is ill-posed. We propose a simple fix: apply the soft-log transform $ϕ(x) = \mathrm{sign}(x) \cdot \log(1 + |x|)$ coordinate-wise to data before training, then exponentiate samples after generation. A Hill diagnostic decides per-coordinate whether to transform, leaving light-tailed margins untouched at no added complexity. This compresses heavy tails into a range where standard flow matching succeeds, without heavy-tailed base distributions or architectural modifications. We provide theoretical intuition for why this works: the log-transform maps Pareto tails to exponentials, and the induced dynamics implement a form of tail annealing via power transformations. On a 144-configuration multivariate benchmark (3 copulas, $d$ up to 100, 4 tail indices), Log-FM dominates specialized baselines on $W_1$, CVaR$_{99}$, and extreme-quantile metrics, and is the only method with zero severe divergences across 2{,}880 runs.

2605.20066 2026-05-20 cs.CL

Text-to-SPARQL Generation with Reinforcement Learning: A GRPO-based Approach on DBLP

基于强化学习的文本到SPARQL生成:在DBLP上的GRPO方法

Jann Pfeifer, Debayan Banerjee, Ricardo Usbeck

AI总结 本文研究了在学术领域中,基于强化学习的零样本文本到SPARQL生成方法,通过GRPO算法在DBLP-QuAD上训练小型指令微调语言模型,并与监督学习的DoRA微调基线进行比较。

详情
Comments
Accepted by NeSy 2026
AI中文摘要

知识图谱问答旨在将自然语言问题转换为可执行的知识图谱查询,但现有方法往往依赖于大型模型或全监督形式的黄金查询注释。本研究探讨了基于结果奖励的强化学习是否能训练一个小型指令微调语言模型,在学术领域进行零样本文本到SPARQL生成。Group-Relative Policy Optimization (GRPO)被应用于DBLP-QuAD上的Qwen3-1.7B模型,使用结合自然语言问题和实体及关系的符号提示。训练依赖于执行反馈、结构约束和答案级奖励,并额外引入基于黄金查询的塑造。所得模型在答案级准确性、执行准确性、类别得分和泛化到预留模板方面与未修改的零样本基线和监督DoRA微调基线进行比较。GRPO在零样本基线上显著提升,并表现出有竞争力的泛化能力,而监督DoRA微调在相同模型规模上实现了更高的整体准确性。消融分析表明,基于执行的奖励贡献了大部分收益,而额外的塑造带来了有限的额外收益,表明当没有黄金查询用于token级监督时,基于结果的强化学习是一种可行的训练策略。

英文摘要

Knowledge graph question answering seeks to translate natural language questions into executable queries over knowledge graphs, but existing approaches often rely on large models or full supervision in the form of gold query annotations. This study examines whether reinforcement learning with outcome-based rewards can train a small instruction-tuned language model to perform zero-shot Text-to-SPARQL generation in the scholarly domain. Group-Relative Policy Optimization (GRPO) is applied to the Qwen3-1.7B model on DBLP-QuAD, using prompts that combine natural language questions with symbolic hints about entities and relations. Training relies on execution feedback, structural constraints, and answer-level rewards, with an additional variant that incorporates gold-query-based shaping. The resulting models are compared to the unmodified zero-shot baseline and to a supervised DoRA-finetuned baseline across answer-level accuracy, execution accuracy, category-wise scores, and generalization to held-out templates. GRPO substantially improves over the zero-shot baseline and exhibits competitive generalization, while supervised DoRA finetuning achieves higher overall accuracy on the same model scale. Ablation analyses indicate that execution-based rewards account for most gains, with additional shaping yielding limited additional benefit, suggesting that outcome-based reinforcement learning is a viable training strategy when gold queries are unavailable for token-level supervision.

2605.20064 2026-05-20 cs.CV

Cardiac fat segmentation using computed tomography and an image-to-image conditional generative adversarial neural network

利用计算断层扫描和图像到图像的条件生成对抗神经网络进行心脏脂肪分割

Guilherme Santos da Silva, Dalcimar Casanova, Jefferson Tales Oliva, Erick Oliveira Rodrigues

AI总结 本研究提出了一种基于深度学习的新方法,利用pix2pix网络对心脏脂肪进行自动分割和量化,实现了高精度的epicardial和mediastinal脂肪分割,并在准确率和运行时间上优于现有方法。

详情
Journal ref
Medical Engineering & Physics 2024
AI中文摘要

近年来,研究强调了人类心脏周围脂肪组织增加与心瓣膜纤维颤动和冠心病等心血管疾病之间存在联系。然而,由于对医疗专业人员来说手动分割这些脂肪沉积物工作量大且成本高,这种分割并未在临床实践中广泛应用。因此,对更精确和高效定量分析的需求推动了新型计算方法的出现。本研究提出了一种新的深度学习方法,能够自主分割和量化两种不同类型的心脏脂肪沉积物。所提出的方法利用了pix2pix网络,这是一种主要设计用于图像到图像翻译任务的生成对抗网络。通过应用此网络架构,我们旨在研究其在解决心脏脂肪分割特定挑战方面的有效性,尽管该网络并非最初为该目的设计。本研究中感兴趣的两种脂肪沉积物称为心外膜脂肪和心包脂肪,它们被心包空间分开。实验结果表明,epicardial脂肪分割的平均准确率为99.08%和f1分数98.73,mediastinal脂肪分割的准确率为97.90%和f1分数98.40。这些发现代表了所提出方法的高精度和重叠一致性。与现有研究相比,我们的方法在f1分数和运行时间上表现更优,使图像能够在实时情况下进行分割。

英文摘要

In recent years, research has highlighted the association between increased adipose tissue surrounding the human heart and elevated susceptibility to cardiovascular diseases such as atrial fibrillation and coronary heart disease. However, the manual segmentation of these fat deposits has not been widely implemented in clinical practice due to the substantial workload it entails for medical professionals and the associated costs. Consequently, the demand for more precise and time-efficient quantitative analysis has driven the emergence of novel computational methods for fat segmentation. This study presents a novel deep learning-based methodology that offers autonomous segmentation and quantification of two distinct types of cardiac fat deposits. The proposed approach leverages the pix2pix network, a generative conditional adversarial network primarily designed for image-to-image translation tasks. By applying this network architecture, we aim to investigate its efficacy in tackling the specific challenge of cardiac fat segmentation, despite not being originally tailored for this purpose. The two types of fat deposits of interest in this study are referred to as epicardial and mediastinal fats, which are spatially separated by the pericardium. The experimental results demonstrated an average accuracy of 99.08% and f1-score 98.73 for the segmentation of the epicardial fat and 97.90% of accuracy and f1-score of 98.40 for the mediastinal fat. These findings represent the high precision and overlap agreement achieved by the proposed methodology. In comparison to existing studies, our approach exhibited superior performance in terms of f1-score and run time, enabling the images to be segmented in real time.

2605.20061 2026-05-20 cs.CL

Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents

奖励信念,而非行动:一致性引导的长期智能体信用分配

Wenjie Tang, Minne Li, Sijie Huang, Liquan Xiao, Yuan Zhou

AI总结 本文提出ReBel算法,通过建模结构化信念状态来指导策略学习,解决长期任务中由于部分可观测性导致的信用分配问题,实验表明其在ALFWorld和WebShop等基准测试中提升了任务成功率并提高了样本效率。

详情
Comments
10 pages, 4 figures, 3 tables, plus appendix
AI中文摘要

可验证奖励的强化学习(RLVR)是一种有前景的范式,用于提高大语言模型(LLM)智能体在长期交互任务中的表现。然而,在部分可观测环境中,不完整的观察导致智能体信念随时间漂移,而延迟奖励会模糊中间决策的因果影响,加剧时间信用分配的挑战。为此,我们提出ReBel(奖励信念),一种过程级强化学习算法,通过显式建模结构化信念状态来总结交互历史并指导后续策略学习。ReBel引入信念一致性监督,将预测信念与观察反馈之间的差异转换为密集的自监督信号,无需外部步骤注释或验证者。它还采用信念感知分组,比较相似信念状态下的轨迹,产生更稳健且方差更低的优势估计。我们在具有挑战性的长期基准测试上评估了ReBel,包括ALFWorld和WebShop。ReBel在episode级基线GRPO上将任务成功率提高高达20.4个百分点,并将样本效率提高2.1倍。这些结果表明,信念感知的自监督是一种在部分可观测性下可靠长期决策的有前景方向。代码可在:https://github.com/Fateyetian/Rebel.git获取。

英文摘要

Reinforcement learning from verifiable rewards (RLVR) is a promising paradigm for improving large language model (LLM) agents on long-horizon interactive tasks. However, in partially observable environments, incomplete observations cause agent beliefs to drift over time, while delayed rewards obscure the causal impact of intermediate decisions, exacerbating temporal credit assignment challenges. To address this, we propose ReBel (Reward Belief), a process-level reinforcement learning algorithm that explicitly models structured belief states to summarize interaction history and guide subsequent policy learning. ReBel introduces belief-consistency supervision, converting discrepancies between predicted beliefs and observed feedback into dense self-supervised signals without requiring external step-wise annotations or verifiers. It also employs belief-aware grouping to compare trajectories under similar belief states, yielding more robust and lower-variance advantage estimates. We evaluate ReBel on challenging long-horizon benchmarks, including ALFWorld and WebShop. ReBel improves task success by up to $20.4$ percentage points over the episode-level baseline GRPO and increases sample efficiency by $2.1\times$. These results suggest that belief-aware self-supervision is a promising direction for reliable long-horizon decision-making under partial observability. Code is available at: https://github.com/Fateyetian/Rebel.git.

2605.20055 2026-05-20 cs.SE cs.AI cs.RO

Towards LLM-Assisted Architecture Recovery for Real-World ROS~2 Systems: An Agent-Based Multi-Level Approach to Hierarchical Structural Architecture Reconstruction

面向现实世界ROS~2系统的LLM辅助架构恢复:一种基于智能体的多级方法用于分层结构架构重建

Dominique Briechle, Raj Chanchad, Tobias Geger, Ruidi He, Dhruv Jajadiya, Dhruv Kapadiya, Andreas Rausch, Meng Zhang

AI总结 本文提出了一种基于智能体的多级方法,用于恢复复杂ROS~2系统中的分层结构架构,通过改进的提示和多级中间架构表示,提高了架构恢复的一致性和可扩展性。

详情
AI中文摘要

显式软件架构模型是沟通、分析和演变复杂软件密集型系统的关键 artifacts。然而,在基于ROS~2的机器人系统中,结构(解构)和集成语义通常仅在分布式 artifacts(如源代码和启动文件)中隐式编码,使得恢复分层架构尤其困难。现有方法主要关注节点级实体和通信布线,而对多抽象层次上的分层结构(解构)恢复支持有限。本文扩展了我们之前提出的蓝图引导的LLM辅助架构恢复流程,通过两个主要改进:(1)改进的提示以提高架构合成的一致性和可控性;(2)基于多级中间架构表示的分阶段恢复策略,该策略结合了原子ROS节点列表和启动文件依赖关系,从而在多个抽象层次上实现结构受限的重建。该方法在基于协作机械臂和异构ROS~2 artifacts的现实世界自动化产品拆卸系统上进行了评估。与我们之前的工作相比,所选案例研究显示出显著更高的集成复杂性和更丰富的功能。结果表明,架构恢复在结构一致性、可扩展性和鲁棒性方面有所提高,同时揭示了与大规模ROS~2系统中动态集成语义相关的剩余挑战。

英文摘要

Explicit software architecture models are essential artifacts for communicating, analyzing, and evolving complex software-intensive systems. In ROS~2-based robotic systems, however, structural (de-)composition and integration semantics are often only implicitly encoded across distributed artifacts such as source code and launch files, making recovery of hierarchical architecture particularly difficult. Existing approaches mainly focus on node-level entities and communication wiring, while providing limited support for recovering hierarchical structural (de-)composition across multiple abstraction levels. In this paper, we extend our previously proposed blueprint-guided LLM-assisted architecture recovery pipeline for ROS~2 systems through two major enhancements: (1) refined prompting to improve the consistency and controllability of architecture synthesis, and (2) a staged recovery strategy based on multi-level intermediate architectural representations that incorporate the atomic ROS node list and launch file dependencies, thereby enabling structurally constrained reconstruction across multiple abstraction levels. The approach is evaluated on a real-world automated product disassembly system based on cooperative robotic arms and heterogeneous ROS~2 artifacts. Compared to our previous work, the considered case study exhibits substantially higher integration complexity and richer functionality. The results demonstrate improved structural consistency, scalability, and robustness of architecture recovery, while also revealing remaining challenges related to dynamic integration semantics in large-scale ROS~2 systems.

2605.20050 2026-05-20 cs.CL

Language Mutations Sustain the Persistences of Conspiracy Theories on Social Media

语言变异维持社交媒体上阴谋论的持续性

Calvin Yixiang Cheng, Dorian Quelle, Scott A. Hale

AI总结 本研究探讨了语言变异如何影响社交媒体上阴谋论的持续传播,通过分析X平台三年的阴谋相关帖子数据,发现语义变异更大的阴谋论具有更长的生命周期,且心理语言学属性的变异与延长生命周期有关。

详情
AI中文摘要

本研究探讨了语言变异如何影响社交媒体上阴谋论的持续传播。通过分析X平台三年的阴谋相关帖子数据,结合计算语言学分析和生存建模,我们发现语义变异更大的阴谋论具有更长的生命周期。心理语言学属性的变异,包括代词、社会参照词、认知过程术语、风险和健康相关的词汇,与延长生命周期有关。演员、行动和目标(AAT)类别的变异也与更长的生命周期有关。定性分析识别出两种主要的变异模式:简化和同化,分别在语言和AAT结构层面。总体而言,这些结果加深了我们对语言变异如何促进在线阴谋论持续性的理解,并为长期内容管理策略提供了新的视角。我们主张内容管理应考虑阴谋论声明的可变性,并专注于核心声明以应对其潜在变化。

英文摘要

This study investigates how language mutations affect the persistent diffusion of conspiracy theories on social media. Drawing on a three-year dataset of conspiracy-related posts from X, and applying computational linguistic analysis alongside survival modelling, we find that conspiracy claims with greater semantic mutations have substantially longer lifespans. Mutations in psycholinguistic properties, including pronouns, social reference words, cognitive process terms, risk- and health- related vocabularies, are associated with extended lifespans. Mutations in actor, action and target (AAT) categories are associated with longer lifespans as well. Qualitative analysis identifies two predominant mutation patterns: simplification and assimilation, at both linguistic and AAT structural levels. Taken together, the results advance our understanding of how language mutations contribute to conspiracy persistence online and shed lights on longitudinal content moderation strategies. We argue that content moderation should consider the mutability of conspiracy claims and focus on the core claims that can address their potential variations.

2605.20049 2026-05-20 cs.SE cs.AI

Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study

代码整洁性影响编码代理吗?一项受控的最小对研究

Priyansh Trivedi, Olivier Schmitt

AI总结 本研究探讨了代码整洁性对编码代理性能的影响,通过构建结构和风格相似但整洁度不同的代码库对,发现整洁性不影响通过率,但显著降低计算成本和文件重复访问。

详情
AI中文摘要

随着自主编码代理的快速普及,其评估主要集中在固定目标代码库的任务完成率上。这留下了一个关键问题未被回答:底层代码的结构和风格质量,即“整洁性”,是否会影响代理导航和修改代码的能力?为了隔离代码整洁性对代理能力的影响,我们引入了一种基于最小对的评估协议:构建结构、依赖和外部行为相同但静态分析规则违反和认知复杂度不同的代码库对。这些对通过代理流水线在两个方向上构建:一个降级干净代码库或清理混乱代码库。我们为六个这样的对编写了33项任务,并通过应用的公共表面进行隐藏测试。在660次使用Claude Code的试验中,代码整洁性没有改变代理的通过率。然而,它显著改变了代理的操作足迹:在整洁代码上工作的代理使用7至8%更少的标记,并减少34%的文件重复访问。我们的发现表明,传统可维护性原则在AI驱动开发时代仍然高度相关,影响编码代理的计算成本和导航效率。代码整洁性与模型选择、工具和提示并列,成为影响代理行为的重要因素。

英文摘要

As autonomous coding agents see rapid adoption, their evaluation has primarily focused on task completion rates holding the target codebase fixed. This leaves a critical question unanswered: does the structural and stylistic quality, or ``cleanliness'' of the underlying code affect an agent's ability to navigate and modify it? To isolate the effect of code cleanliness from agent capability, we introduce an evaluation protocol built around minimal pairs: repositories that match on architecture, dependencies, and external behaviour, but differ on static-analysis rule violations and cognitive complexity. The pairs are constructed in both directions, by agent pipelines that either degrade a clean repository or clean a messy one. We author 33 tasks across six such pairs, evaluated through hidden tests at the application's public surface. Across 660 trials with Claude Code, code cleanliness does not change the agent's pass rate. However, it substantially alters the agent's operational footprint: agents working on cleaner code use 7 to 8% fewer tokens and reduce file revisitations by 34%. Our findings suggest that traditional maintainability principles remain highly relevant in the era of AI-driven development, shaping the computational cost and navigational efficiency of coding agents. Code cleanliness joins model choice, harness, and prompting as a factor that materially affects agent behaviours.

2605.20044 2026-05-20 cs.CV

OP2GS: Object-Aware 3D Gaussian Splatting with Dual-Opacity Primitives

OP2GS: 带双不透明度的物体感知3D高斯散射

Guiyu Liu, Niklas Vaara, Janne Mustaniemi, Juho Kannala, Janne Heikkilä

AI总结 OP2GS通过引入双不透明度机制,为每个原始体素添加显式实例身份和专用实例不透明度σ*,以解决3D高斯散射在物体层面身份缺失的问题,从而提升开放词汇场景理解的性能。

详情
Comments
Under review
AI中文摘要

3D高斯散射(3DGS)提供了一种显式且高效的场景表示,但其原始体素缺乏固有的物体层面身份,阻碍了下游任务如开放词汇场景理解。现有方法通常通过将高维特征嵌入提炼为高斯或通过启发式细化将2D掩码标签提升为3D来解决这一问题。然而,基于特征的方法会带来沉重的存储和解码开销,而基于提升的方法则容易受到标签污染:用于外观重建的高斯体往往在2D到3D投影时会获得错误的物体标签。我们提出了OP2GS,一种带物体感知的高斯表示,通过为每个原始体素添加显式实例身份和专用实例不透明度σ*用于物体掩码渲染。原始不透明度σ仍负责视觉重建,而σ*则模型该高斯是否应贡献于特定的物体掩码。这种双不透明度公式将视觉存在与实例占用解耦:错误标记的高斯体仍可用于图像渲染,但在物体掩码分支中会变得透明。为了学习这种表示,我们引入了随机物体损失,通过3DGS标准的透射率基可见性优化1D实例占用场。然后通过多视角聚合将语义描述符附加在物体层面,消除了每个高斯体的特征存储需求。与基于特征训练的方法相比,OP2GS在开放词汇性能方面具有竞争力,同时显著减少了计算开销。与无训练管道相比,它利用物理一致的占用学习来解决可见性歧义。

英文摘要

3D Gaussian Splatting (3DGS) provides an explicit and efficient scene representation, but its primitives lack inherent object-level identity, hindering downstream tasks such as open-vocabulary scene understanding. Existing methods typically address this by either distilling high-dimensional feature embeddings into Gaussians or by lifting 2D mask labels into 3D via heuristic refinement. However, feature-based approaches incur heavy storage and decoding overhead, while lifting-based pipelines remain vulnerable to label contamination: Gaussians necessary for appearance reconstruction often receive incorrect object labels during 2D-to-3D projection. We propose OP2GS, an object-aware Gaussian representation that augments each primitive with an explicit instance identity and a dedicated instance opacity $σ^{*}$ for object-mask rendering. The original opacity $σ$ remains responsible for visual reconstruction, while $σ^{*}$ models whether a Gaussian should contribute to a particular object mask. This dual-opacity formulation decouples visual existence from instance occupancy: mislabeled Gaussians can remain available for image rendering while becoming transparent in the object-mask branch. To learn this representation, we introduce a random object loss that optimizes the 1D instance occupancy field using the standard transmittance-based visibility of 3DGS. Semantic descriptors are then attached at the object level through multi-view aggregation, eliminating per-Gaussian feature storage. Compared with feature-training approaches, OP2GS achieves competitive open-vocabulary performance while significantly reducing computational overhead. Compared with training-free pipelines, it leverages physically consistent occupancy learning to resolve visibility ambiguities.

2605.20040 2026-05-20 cs.LG

Active Context Selection Improves Simple Regret in Contextual Bandits

主动上下文选择提升上下文老虎机中的简单遗憾

Mohammad Shahverdikondori, Jalal Etesami, Negar Kiyavash

AI总结 本文研究了具有有限上下文空间的上下文多臂老虎机问题,通过主动选择上下文样本来优化简单遗憾,提出了一种在已知和未知上下文分布时均能有效提升性能的算法。

详情
AI中文摘要

我们研究了具有有限上下文空间(即亚群体)的上下文多臂老虎机问题,其中学习者为每个上下文推荐最佳动作,并通过上下文加权简单遗憾进行评估。我们的保证是在奖励分布的最坏情况下,同时保持对上下文分布向量p的实例依赖性。类似于实验设计问题,其中感兴趣的总体是固定的但可选的亚群体可以被控制,我们允许学习者主动选择从何处采样上下文。对于已知的p,我们刻画了紧致的遗憾率:被动采样(上下文随机揭示)的遗憾为顺序√(n/T ||p||_{1/2}),而主动采样(分配q_j ∝ p_j^{2/3})则达到紧致的速率√(n/T) ||p||_{2/3}。所获得的改进可以达到Θ(k^{1/4}),其中k是上下文的数量。我们进一步将分析扩展到预算化的主动采样,刻画相应的紧致速率,并确定何时有限的主动预算足以恢复完全主动的速率。当p未知时,我们提出探索-探索-然后-提交(EETC)算法,该算法在大时间范围内能够匹配已知p的主动速率,仅相差常数因子。在合成和现实数据上的实验支持了我们的理论发现。

英文摘要

We study the contextual multi-armed bandit problem with a finite context space (a.k.a. subpopulations), where the learner recommends a best action for each context and is evaluated by context-weighted simple regret. Our guarantees are worst-case over the reward distributions, while remaining instance-dependent with respect to the context distribution vector $p$. Akin to experimental design problems where the population of interest is fixed but the sampled subpopulation can be controlled, we allow the learner to actively choose which context to sample from. For a known $p$, we characterize tight regret rates: passive sampling where contexts are randomly revealed achieves regret of order $\sqrt{n/T \, \lVert p \rVert_{1/2}}$, whereas active sampling with allocation $q_j \propto p_j^{2/3}$ achieves the tight rate $\sqrt{n/T} \, \lVert p \rVert_{2/3}$. The resulting improvement can be as large as $Θ(k^{1/4})$, where $k$ is the number of contexts. We further extend the analysis to budgeted active sampling, characterize the corresponding tight rate, and identify when a limited active budget suffices to recover the fully active rate. When $p$ is unknown, we propose the Explore-Explore-Then-Commit (EETC) algorithm, which optimally balances estimating the context distribution and the time to switch to active allocation, such that for large horizons, it matches the known-$p$ active rate up to constants. Experiments on synthetic and real-world data support our theoretical findings.

2605.20037 2026-05-20 cs.LG cs.AI

When Critics Disagree: Adaptive Reward Poisoning Attacks in RIS-Aided Wireless Control System

当批评者意见不一致时:RIS辅助无线控制系统中的自适应奖励中毒攻击

Deemah H. Tashman, Soumaya Cherkaoui

AI总结 本文提出了一种基于分歧引导的奖励中毒攻击(DGRP),用于攻击Soft Actor-Critic(SAC)智能体,以评估RIS辅助网络中深度强化学习(DRL)的鲁棒性。

详情
AI中文摘要

奖励中毒攻击对基于学习的无线控制系统构成了重大风险。为此,我们提出了一种在受Reconfigurable Intelligent Surfaces(RIS)辅助的Cognitive Radio Network(CRN)环境中,针对Soft Actor-Critic(SAC)智能体的Disagreement-Guided Reward Poisoning(DGRP)自适应攻击。SAC智能体的任务是通过同时优化二次用户(SUs)的发射功率和RIS相移,以最大化长期二次用户的速率。DGRP在SAC双批评者表现出显著分歧时(尤其在高杠杆、高不确定性状态下)污染奖励,导致价值估计扭曲并引导策略朝向次优动作。我们的研究发现,DGRP显著降低了RIS通常提供的性能提升,并降低了传输质量。我们进一步研究了关键攻击参数及其对学习的影响。与周期性定时和探索触发基线相比,DGRP始终造成更大的损害,突显了在评估RIS辅助网络中DRL鲁棒性时考虑分歧意识威胁的必要性。

英文摘要

Reward-poisoning attacks present a significant risk to learning-based wireless control systems. Given this, we propose a Disagreement-Guided Reward Poisoning (DGRP) adaptive attack on a Soft Actor-Critic (SAC) agent. In a Cognitive Radio Network (CRN) environment assisted by Reconfigurable Intelligent Surfaces (RIS), the SAC agent is tasked with maximizing the long-term secondary users' (SUs) rate by simultaneously optimizing the transmission power of the SU transmitter and the RIS phase shifts. DGRP corrupts rewards, particularly when the SAC dual critics exhibit substantial disagreement-especially in high-leverage, high-uncertainty states-resulting in distorted value estimations and guiding the policy towards suboptimal actions. Our findings demonstrate that DGRP substantially diminishes the performance improvements typically provided by RIS and degrades transmission quality. We further investigate key attack parameters and determine their impact on learning. In comparison to periodic-timing and exploration-triggered baselines, DGRP consistently causes greater damage, highlighting the necessity of considering disagreement-aware threats when evaluating the robustness of Deep Reinforcement Learning (DRL) in RIS-assisted networks.

2605.20035 2026-05-20 cs.CV

Stage-adaptive Token Selection for Efficient Omni-modal LLMs

面向高效多模态大语言模型的阶段自适应令牌选择

Zijie Xin, Jie Yang, Ruixiang Zhao, Tianyi Wang, Fengyun Rao, Jing Lyu, Xirong Li

AI总结 本文提出SEATS方法,通过阶段自适应的令牌选择技术,有效提升多模态大语言模型的推理效率,在保留96.3%原始性能的同时,实现9.3倍的FLOPs减少和4.8倍的prefill加速。

详情
Comments
Code Link: https://github.com/xxayt/SEATS
AI中文摘要

多模态大语言模型(om-LLMs)通过将视频和音频编码为时间对齐的令牌序列,在窗口级别交错处理以实现统一的音频-视觉理解。然而,处理这些密集的非文本令牌会带来显著的计算开销。尽管训练无关的令牌选择可以减少这种成本,但现有方法要么专注于视觉输入,要么在LLM之前以固定的每模态比例修剪om-LLM令牌,无法捕捉跨模态令牌重要性在层间的变化。为了解决这一限制,我们首先分析om-LLMs的层间令牌依赖性。我们发现视觉和音频依赖性遵循块状模式,并随着深度逐渐减弱,表明许多后期层的非文本令牌在跨模态融合后变得冗余。受此启发,我们提出SEATS,一种训练无关的、阶段自适应的令牌选择方法,用于高效的om-LLM推理。在LLM之前,SEATS通过注意力加权多样性选择去除时空冗余。在LLM内部,它逐步在块间修剪令牌,并利用查询相关性分数动态分配从时间窗口到模态的保留预算。在后期层中,一旦完成跨模态融合,它会移除所有剩余的非文本令牌。在Qwen2.5-Omni和Qwen3-Omni上的实验表明,SEATS有效提高了推理效率。仅保留10%的视觉和音频令牌,实现了9.3倍的FLOPs减少和4.8倍的prefill加速,同时保持96.3%的原始性能。

英文摘要

Omni-modal large language models (om-LLMs) achieve unified audio-visual understanding by encoding video and audio into temporally aligned token sequences interleaved at the window level. However, processing these dense non-textual tokens throughout the LLM incurs substantial computational overhead. Although training-free token selection can reduce this cost, existing methods either focus on visual-only inputs or prune om-LLM tokens only before the LLM with fixed per-modality ratios, failing to capture how cross-modal token importance evolves across layers. To address this limitation, we first analyze the layer-wise token dependency of om-LLMs. We find that visual and audio dependencies follow a block-wise pattern and gradually weaken with depth, indicating that many late-layer non-textual tokens become redundant after cross-modal fusion. Motivated by this observation, we propose SEATS, a training-free, stage-adaptive token selection method for efficient om-LLM inference. Before the LLM, SEATS removes spatiotemporal redundancy via attention-weighted diversity selection. Inside the LLM, it progressively prunes tokens across blocks and dynamically allocates the retention budget from temporal windows to modalities using query relevance scores. In late layers, it removes all remaining non-textual tokens once cross-modal fusion is complete. Experiments on Qwen2.5-Omni and Qwen3-Omni demonstrate that SEATS effectively improves inference efficiency. Retaining only 10% of visual and audio tokens, it achieves a 9.3x FLOPs reduction and a 4.8x prefill speedup while preserving 96.3% of the original performance.

2605.20033 2026-05-20 cs.CV cs.GT

A Nash Equilibrium Framework For Training-Free Multimodal Step Verification

为无训练多模态步骤验证构建纳什均衡框架

Rohit Sinha, Kunal Tilaganji, Tanuja Ganu, Nagarajan Natarajan, Amit Sharma, Vineeth N. Balasubramanian

AI总结 本文提出一种无训练的多模态步骤验证方法,将步骤验证视为专门法官之间的协调问题,并通过纳什均衡游戏形式化法官之间的交互,通过闭式解计算均衡分数,实现对分歧的敏感过滤和稳定性意识的排名,实验表明跨模态一致性(而非平均置信度)提供了鲁棒的验证信号。

详情
Comments
ICLR 2026 Workshop VerifAI-2
AI中文摘要

多模态大语言模型经常生成包含细微错误的推理链,导致错误答案。当前的验证方法有显著局限。学习批评者需要大量标注数据且在不同任务上表现不一致。同时,现有无训练方法仅简单平均不同来源的分数,忽略了关键见解:当这些分数不一致时,这种不一致本身包含了关于推理步骤是否真正有效的重要信息。我们提出了一种无训练验证方法,将分步验证视为专门法官之间的协调问题。我们形式化这些法官的交互为纳什均衡游戏,其中一致信号表示有效步骤,不一致揭示不稳定性。我们的方法通过闭式解计算均衡分数,实现了对分歧的敏感过滤和稳定性意识的排名。在六个基准测试中,我们的方法在基准模型上实现了2.4%至5.2%的一致性提升,并在与学习批评者相比时表现出竞争力,证明了跨模态一致性(而非平均置信度)在无任务特定适应的情况下提供了稳健的验证信号。

英文摘要

Multimodal large language models often generate reasoning chains containing subtle errors that lead to incorrect answers. Current verification approaches have notable limitations. Learned critics need extensive labeled data and show inconsistent performance across different tasks. Meanwhile, existing training-free methods simply average scores from different sources, missing a key insight: when these scores disagree, that disagreement itself carries important information about whether a reasoning step is truly valid or not. We propose a training-free verification approach that treats step-wise verification as a coordination problem among specialized judges. We formalize these judges' interaction as a Nash equilibrium game where agreement signals valid steps while disagreement reveals instability. Our method computes equilibrium scores through a closed-form solution, enabling both disagreement-aware filtering and stability-conscious ranking of reasoning steps. Evaluated across six benchmarks, our approach achieves consistent improvements of 2.4% to 5.2% over baseline models and shows competitive performance against learned critics, demonstrating that cross-modal agreement (not just average confidence) provides robust verification signals without task-specific adaptation.

2605.20032 2026-05-20 cs.LG cs.MM

CAMERA: Adapting to Semantic Camouflage in Unsupervised Text-Attributed Graph Fraud Detection

CAMERA: 适应语义伪装的无监督文本属性图欺诈检测

Junjun Pan, Yixin Liu, Yu Zheng, Lianhua Chi, Alan Wee-Chung Liew, Shirui Pan

AI总结 本文提出CAMERA框架,通过适应性多 cue 专家模型来应对语义伪装问题,利用图结构和文本属性信息进行无监督欺诈检测,提高对伪装欺诈者的识别能力。

详情
Comments
Accepted by IJCAI 2026
AI中文摘要

文本属性图欺诈检测(TAGFD)在防止在线社交和电子商务平台上欺诈活动方面起着关键作用。然而,为了逃避检测,欺诈者不断演变其伪装策略,通过刻意模仿良性用户的文本响应来隐藏其恶意目的。这种现象称为语义伪装,从根本上破坏了对结构和属性线索如何被用来识别欺诈者的常见假设,并使在无监督TAGFD中发现欺诈者变得困难。为了解决这一问题,我们提出了一个案例自适应多 cue 专家框架(CAMERA)用于无监督TAGFD。CAMERA采用了一个ego解耦的混合专家架构,其中每个专家专门建模一种不同的欺诈指示线索。引入了一个上下文感知的门控模型,以联合考虑ego节点表示及其局部邻域上下文,以适应不同专家学习的线索的集成。此外,CAMERA利用欺诈者的固有稀有性,支持无监督的一类学习,通过专家级目标鼓励建模主导的良性模式,从而实现对伪装欺诈者的可靠无监督检测。在四个具有挑战性的数据集上的实验表明,CAMERA在对抗语义伪装欺诈者方面优于竞争对手,证明了其有效性。代码可在https://github.com/CampanulaBells/CAMERA获取。

英文摘要

Text-attributed graph fraud detection (TAGFD) plays a critical role in preventing fraudulent activities on online social and e-commerce platforms. However, to evade detection, fraudsters continuously evolve their camouflaging strategies by deliberately mimicking textual responses of benign users, thereby concealing their malicious purposes. This phenomenon, referred to as semantic camouflage, fundamentally undermines commonly relied assumptions on how structural and attribute cues can be exploited to identify fraudsters, and makes it difficult to spot fraudsters with unsupervised TAGFD. To bridge the gaps, we propose a Case-Adaptive Multi-cue Expert fRAmework (CAMERA) for unsupervised TAGFD. CAMERA employs an ego-decoupled mixture-of-experts architecture, where each expert specializes in modeling a distinct type of fraud-indicative cue. A context-informed gating model is introduced to jointly consider the ego node representation and its local neighborhood context for adaptive integration of cues learned by different experts. Furthermore, CAMERA leverages the inherent rarity of fraudsters to support unsupervised one-class learning with expert-level objectives that encourage modeling dominant benign patterns, thereby enabling reliable unsupervised detection of camouflaged fraudsters. Experiments on 4 challenging datasets show that CAMERA consistently outperforms competitors, showing its effectiveness against semantically camouflaged fraudsters. Code available at https://github.com/CampanulaBells/CAMERA

2605.20028 2026-05-20 cs.LG physics.ao-ph

Training-Free Bayesian Filtering with Generative Emulators

无需训练的贝叶斯过滤与生成模拟器

Thomas Savary, François Rozet, Gilles Louppe

AI总结 本文提出一种无需额外训练的最优粒子滤波变种,利用基于扩散的动力学模拟器,解决了高维环境下粒子滤波的可扩展性问题,通过非线性混沌系统实验验证了其有效性。

详情
Comments
Accepted as a spotlight paper at the International Conference on Machine Learning 2026
AI中文摘要

贝叶斯过滤是一个旨在从观测中估计动态系统合理状态的知名问题。在现有解决方案中,粒子滤波在非线性动态和观测中理论上是精确的,但在高维情况下扩展性差。本文展示,基于扩散的动力学模拟器可以无需额外训练地实现一种最优的粒子滤波变种,这种变种由于经典数值求解器的实现挑战而长期未被探索。非线性混沌系统(包括大气动力学)的实验表明,所提出的方法成功将粒子滤波扩展到高维设置。

英文摘要

Bayesian filtering is a well-known problem that aims to estimate plausible states of a dynamical system from observations. Among existing approaches to solve this problem, particle filters are theoretically exact for non-linear dynamics and observations, but suffer from poor scalability in high dimensions. In this work, we show that diffusion-based emulators of dynamical systems can be used to implement, without additional training, an optimal variant of particle filters that has remained largely unexplored due to implementation challenges with classical numerical solvers. Experiments on nonlinear chaotic systems, including atmospheric dynamics, demonstrate that the proposed approach successfully scales particle filtering to high-dimensional settings.

2605.20022 2026-05-20 cs.CL

FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

FlexDraft: 通过注意力调节和奖励引导校准实现灵活的推测解码

Yaojie Zhang, Jianuo Huang, Junlong Ke, Yuhang Han, Yongji Long, Tianchen Zhao, Biqing Qi, Linfeng Zhang

AI总结 本文提出FlexDraft框架,通过注意力调节和奖励引导校准,灵活适应不同批处理大小,解决传统并行推测解码在大批次时的吞吐量下降问题。

详情
AI中文摘要

推测解码通过使用快速草稿生成多个候选标记并由目标模型并行验证,从而加速内存密集型LLM推理且不降低质量。然而,传统顺序推测解码面临草稿与验证之间的相互等待以及中间状态的反复交换,进一步增加内存访问开销。并行推测解码通过在单个目标前向传递中执行草稿和验证,允许在当前候选被验证的同时准备未来的草稿。尽管在小批次中有效,现有并行推测解码方法要么需要昂贵的持续预训练导致质量下降,要么验证接受率低。更重要的是,这种范式固有地面临奖励标记和接受长度的不确定性,导致草稿验证不匹配,从而在大批次时吞吐量收益崩溃。为了解决这些限制,我们引入了FlexDraft,一种无损的推测解码框架,通过三个关键设计灵活适应不同的批次大小。(1) 注意力调节通过仅调节最终几层的注意力投影器来实现块扩散草稿生成,同时保持自回归路径冻结以保留目标分布并生成高质量的草稿,同时使用最少的可训练参数。(2) 奖励引导校准使用一个轻量级的MLP,条件于已解决的奖励标记来校准草稿logits,缓解由奖励标记不确定性导致的草稿验证不匹配。(3) 灵活解码在小批次时动态切换于并行草稿和验证,而在大批次时切换为顺序草稿然后验证,并根据草稿信心调整验证长度以消除冗余计算。

英文摘要

Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting between drafting and verification, and repeated exchange of intermediate states further increases memory access overhead. Parallel speculative decoding addresses this limitation by performing drafting and verification within a single target forward pass, allowing future drafts to be prepared while current candidates are being verified. Although effective at small batch sizes, existing parallel speculative decoding methods either require costly continual pretraining with quality degradation or suffer from low acceptance rates. More importantly, this paradigm inherently suffers from uncertainty in both the bonus token and the accepted length, leading to draft verification mismatch and causing throughput gains to collapse at large batch sizes. To address these limitations, we introduce FlexDraft, a lossless speculative decoding framework that flexibly adapts to varying batch sizes through three key designs. (1) Attention Tuning enables block diffusion drafting by tuning only the attention projectors of the final few layers on mask tokens, while keeping the autoregressive path frozen to preserve the target distribution and produce high quality drafts with minimal trainable parameters. (2) Bonus-guided Calibration uses a lightweight MLP conditioned on the resolved bonus token to calibrate draft logits, mitigating draft verification mismatch caused by bonus token uncertainty. (3) Flex Decoding dynamically switches between parallel draft and verify at small batch sizes and sequential draft then verify at large batch sizes, and adjusts verification length based on draft confidence to eliminate redundant computation.

2605.20016 2026-05-20 eess.IV cs.CV

FGSVQA: Frequency-Guided Short-form Video Quality Assessment

FGSVQA:基于频率的短视频质量评估

Xinyi Wang, Angeliki Katsenou, Junxiao Shen, David Bull

AI总结 本文提出了一种端到端的视频质量评估框架,利用基于CLIP的密集视觉编码器和频率域中的压缩先验,生成具有伪影和结构感知的权重图,以实现高效的视频质量预测。

详情
Comments
4 pages, 1 figure
AI中文摘要

短视频给用户生成内容(UGC)的质量评估带来了新挑战,由于其复杂的生成流程、快速的内容变化和混合的失真。为了解决这一挑战,我们提出了一种端到端的视频质量评估(VQA)框架,该框架采用基于CLIP的密集视觉编码器,并结合从频率域导出的压缩先验,生成具有伪影和结构感知的权重图用于特征聚合。通过显式分解伪影、结构和原始视觉特征分支,并通过学习的门控模块在时间上自适应融合,所提出的方法实现了准确且高效的质量预测。实验结果表明,我们的方法在短视频数据集上在平均排名和线性相关性(SRCC: 0.736,PLCC: 0.787)方面表现出色,同时保持了高效的推理运行时间。代码和额外结果可在:https://github.com/xinyiW915/FGSVQA 获取。

英文摘要

Short-form video poses new challenges to the quality assessment of user-generated content (UGC) due to its complex generation pipeline, rapid content variation, and mixed distortions. To address this challenge, we propose an end-to-end video quality assessment (VQA) framework that employs a dense visual encoder based on CLIP, and incorporates compression priors derived from the frequency domain to generate artifact- and structure-aware weight maps for feature aggregation. By explicitly decomposing artifact, structure, and original visual feature branches and adaptively fusing them over time through a learned gating module, the proposed method achieves accurate and efficient quality prediction. Experimental results show that our method achieves strong performance on short-form video datasets in terms of average rank and linear correlation (SRCC: 0.736, PLCC: 0.787), while maintaining efficient inference runtime. The code and additional results are available at: https://github.com/xinyiW915/FGSVQA.

2605.20014 2026-05-20 cs.SD

Precise and Simple Audio-to-Score Alignment

精确且简单的音频到乐谱对齐

Silvan Peter, Patricia Hu, Gerhard Widmer

AI总结 本文提出了一种直接连接音频样特征和符号级特征的算法,该算法基于符号对齐方法,实现了高精度且灵活的音频到乐谱对齐,适用于不同音色特性。

详情
Comments
published at the Music Encoding Conference (MEC) 2026
AI中文摘要

音频到乐谱对齐是音乐信息检索中的长期挑战,也是音乐研究中最广泛适用的对齐任务。对齐算法匹配音乐作品的两个版本,这些版本需要处于可比格式中。音频到音频对齐匹配音频特征;当将音频文件与乐谱匹配时,必须要么合成乐谱,要么通过钢琴卷或其他类似特征序列推导出音频样特征。相比之下,符号对齐匹配符号编码的音符;在音频到乐谱场景中,这些通过音频文件的转录获得。在本文中,我们提出了一种算法,直接连接音频样特征和符号级特征。通过基于符号对齐方法的定制动态规划匹配算法,顺序音频特征编码的起始点和频谱激活被匹配到乐谱位置。所得到的方法既精确——超越了基于合成乐谱的广泛使用的音频到音频方法——又保持了其数字信号处理组件的灵活性,即该方法可以适应不同的音色特性,而无需单独的转录模型。此外,它继承了一些符号对齐的运行时优势,其算法复杂度在最坏情况下与符号乐谱(通常较短)和音频特征序列(通常较长)的长度成线性关系。在接下来的章节中,我们提供详细的算法描述,并在大规模独奏钢琴录音数据集上评估其对齐质量。

英文摘要

Audio-to-score alignment is a long-standing challenge in music information retrieval and arguably the most widely applicable alignment task for music research. Alignment algorithms match two versions of a piece of music, and for this to work these versions need to be in comparable formats. Audio-to-audio alignment matches audio features; when matching audio files to scores, they must either synthesize the score or derive audio-like features by means of piano rolls or similar feature sequences. Symbolic alignment, by contrast, matches symbolically encoded notes; in an audio-to-score scenario these would be obtained by a transcription of the audio file. In this article, we present an algorithm that bridges audio-like and symbol-level features directly. Sequential audio features encoding onset and spectral activation are matched to score positions by a bespoke dynamic programming-based matching algorithm derived from symbolic alignment methods. The resulting method is both precise - surpassing widely used audio-to-audio approaches based on synthesized scores -, and remains flexible in its digital signal processing components, i.e., the method is adaptable to diverse timbral characteristics without requiring a separate transcription model. Furthermore it inherits some of the symbolic alignment runtime advantages with an algorithmic complexity that is at worst linear in the length of the (typically short) symbolic score and (typically long) audio feature sequence. In the following sections, we provide a detailed algorithm description and evaluate its alignment quality on a large-scale dataset of solo piano recordings.

2605.20009 2026-05-20 cs.LG cs.AI cs.NE

Training Neural Networks with Optimal Double-Bayesian Learning

用最优双贝叶斯学习训练神经网络

Vy Bui, Hang Yu, Karthik Kantipudi, Ziv Yaniv, Stefan Jaeger

AI总结 本文提出了一种新的概率框架,用于学习率这一关键参数,通过双贝叶斯决策机制改进随机梯度下降,从而推导出理论上最优的学习率,并在多种任务中验证其有效性。

详情
Comments
13 pages, 4 figures; see also arXiv:2410.12984 [cs.LG]
AI中文摘要

反向传播与梯度下降是大多数机器学习神经网络架构中常用的优化策略。然而,找到指导训练的最优超参数已证明具有挑战性。尽管普遍认可选择合适参数对于避免过拟合和获得无偏结果至关重要,但这一选择仍主要基于经验实验和经验。本文提出了一种新的概率框架,用于学习率这一随机梯度下降中的关键参数。该框架将经典贝叶斯统计发展为一种涉及两个对抗性贝叶斯过程的双贝叶斯决策机制。从这两个过程可以推导出理论上最优的学习率,并用于随机梯度下降。在各种分类、分割和检测任务中的实验验证了理论上推导出的学习率的实践意义。本文还讨论了所提出的双贝叶斯框架对网络训练和模型性能的影响。

英文摘要

Backpropagation with gradient descent is a common optimization strategy employed by most neural network architectures in machine learning. However, finding optimal hyperparameters to guide training has proven challenging. While it is widely acknowledged that selecting appropriate parameters is crucial for avoiding overfitting and achieving unbiased outcomes, this choice remains largely based on empirical experiments and experience. This paper presents a new probabilistic framework for the learning rate, a key parameter in stochastic gradient descent. The framework develops classic Bayesian statistics into a double-Bayesian decision mechanism involving two antagonistic Bayesian processes. A theoretically optimal learning rate can be derived from these two processes and used for stochastic gradient descent. Experiments across various classification, segmentation, and detection tasks corroborate the practical significance of the theoretically derived learning rate. The paper also discusses the ramifications of the proposed double-Bayesian framework for network training and model performance.