arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3818
2605.17610 2026-05-19 cs.CV cs.CL

SafeLens: Deliberate and Efficient Video Guardrails with Fast-and-Slow Screening

SafeLens: 一种高效且可靠的视频护栏系统,采用快速和缓慢筛查

Shahriar Kabir Nahin, Hadi Askari, Muhao Chen, Anshuman Chhabra

AI总结 本研究提出SafeLens视频护栏框架,通过快速和缓慢的推理架构实现高效的视频内容审核,同时构建高质量数据集并采用结构化Chain-of-Thought追踪来解决训练时间扩展的限制,从而在实际和AI生成视频基准测试中取得最佳性能,同时显著降低推理成本。

详情
AI中文摘要

在线视频平台和AI生成内容的快速增长使得可靠的视频护栏成为安全性和现实部署的关键挑战。尽管大多数视频可通过快速模式识别筛查,但一小部分需要对时间复杂的内容和细致的政策约束进行深入推理。现有方法通常依赖于在所有输入上统一应用大型视觉-语言模型,导致推理成本高且计算资源分配效率低。我们提出了SafeLens视频护栏框架,引入快速和缓慢的推理架构,以实现高效且准确的内容审核,根据输入的不同具有可变的计算成本。此外,我们通过应用影响引导过滤对SafeWatch数据集进行处理,仅保留原始数据的2.4%。为进一步解决训练时间扩展的限制,我们通过在过滤数据中添加结构化的Chain-of-Thought追踪来实现测试时间推理。在实际和AI生成视频基准测试中,SafeLens实现了最先进的性能,优于强大的开源视频护栏(如SafeWatch-8B、OmniGuard-7B)和闭源模型(如GPT-5.4、Gemini-3.1-pro),同时显著降低推理成本,证明了高效设计比仅扩大数据或模型大小更有效。

英文摘要

The rapid growth of online video platforms and AI-generated content has made reliable video guardrails a key challenge for safety and real-world deployment. While most videos can be screened through fast pattern recognition, a small subset requires deeper reasoning over temporally complex content and nuanced policy constraints. Existing approaches typically rely on large vision-language models applied uniformly across all inputs, resulting in high inference costs and inefficient allocation of computation. We propose SafeLens, a video guardrail framework that introduces a fast-and-slow inference architecture for efficient and accurate content moderation with variable computational cost across inputs. Additionally, we construct a high-quality dataset by applying influence-guided filtering to the SafeWatch Dataset, retaining only 2.4% of the original data. To further address limitations of training-time scaling, we enable test-time reasoning by augmenting the filtered data with structured Chain-of-Thought traces. Across real-world and AI-generated video benchmarks, SafeLens achieves state-of-the-art performance, outperforming strong open-source video guardrails (e.g., SafeWatch-8B, OmniGuard-7B) and closed-source models (e.g., GPT-5.4, Gemini-3.1-pro) while significantly reducing inference cost, demonstrating that efficient design serves to be more effective than scaling data or model size alone.

2605.17605 2026-05-19 cs.LG

Venom: A PyTorch Generative Modeling Toolkit

Venom:一个PyTorch生成建模工具包

Liang Yan

AI总结 本文提出Venom,一个基于PyTorch的生成建模工具包,旨在通过统一的接口实现多种生成建模家族,提供可读、可复现的入口点以及一致的训练和采样API,便于教学、原型设计和轻量级基准测试。

Comments Preprints

详情
AI中文摘要

现代生成建模已发展为一个包含多种相关但通常独立实现范式的广泛集合,包括去噪扩散模型、基于分数的随机微分方程、流匹配、变分自编码器、归一化流、对抗模型和基于能量的模型。对于 newcomers 来说,这种碎片化使得在单一统一的代码库中比较训练目标、推理过程、采样算法和条件机制变得困难。我们介绍了 V ENOM,一个教育性的 PyTorch 工具包,它在统一的、以 MNIST 为先的接口下实现了代表性的生成建模家族。V ENOM 强调广度、可读性、可复现的入口点以及一致的训练和采样 API,而不是大规模的性能工程。该包目前包括扩散和基于分数的模型、流匹配和一步生成器、变分自编码器、归一化流、生成对抗网络和基于能量的模型。它提供了单独的训练和采样脚本、分类器和无分类器指导示例、双语教程笔记本以及支持教学、原型设计和轻量级基准测试的模型家族组织。

英文摘要

Modern generative modeling has grown into a broad collection of related but often separately implemented paradigms, including denoising diffusion models, score-based stochastic differential equations, flow matching, variational autoencoders, normalizing flows, adversarial models, and energy-based models. For newcomers, this fragmentation makes it difficult to compare training objectives, inference procedures, sampling algorithms, and conditioning mechanisms within a single coherent codebase. We introduce V ENOM, an educational PyTorch toolkit that implements representative generative modeling families under a unified, MNIST-first interface. V ENOM emphasizes breadth, readability, reproducible entry points, and consistent training and sampling APIs rather than large-scale performance engineering. The package currently includes diffusion and score-based models, flow matching and one-step generators, variational autoencoders, normalizing flows, generative adversarial networks, and energy-based models. It provides separate training and sampling scripts, classifier and classifier-free guidance examples, bilingual tutorial notebooks, and a model-family organization that supports teaching, prototyping, and lightweight benchmarking.

2605.17601 2026-05-19 cs.RO

From a Single Demonstration to a General Policy for Contact-Rich Manipulation

从单次示范到通用的接触密集操纵策略

Xing Li, Oliver Brock

AI总结 本文提出了一种学习从示范(LfD)框架,通过利用环境约束作为归纳偏差,实现多阶段、接触密集任务的一次性泛化。该方法将示范表示为利用环境约束的行为序列,将任务通用结构(约束类型及其转换)与实例特定细节(精确示范轨迹、姿态和局部几何)分离。四阶段流程在该表示上构建完整策略:机器人首先将单次示范抽象为环境约束原语,然后通过自我引导探索进行歧义消除,接着整合针对人类修正以处理超出分布的变化,最后通过合规交互在线恢复抽象掉的细节。由于最终策略遵循约束而非模仿轨迹,它在物体姿态、局部几何和未建模接触动力学上实现了泛化。我们在七个现实世界多阶段接触密集操纵任务上验证了该方法,成功率达到90%以上。这些广泛实验结果确立了环境约束作为学习从示范中高效泛化基本构建块的重要性。

Comments 21 pages, 22 figures, 7 tables

详情
AI中文摘要

我们提出了一种学习从示范(LfD)框架,实现多阶段、接触密集任务的一次性泛化。我们的方法核心是利用环境约束作为归纳偏差。通过将示范表示为利用环境约束的行为序列,机器人将任务通用结构——约束类型及其转换——与实例特定细节(如精确示范轨迹、姿态和局部几何)分离。我们的四阶段流程在该表示上构建完整策略:机器人首先将单次示范抽象为环境约束原语,然后通过自我引导探索进行歧义消除,接着整合针对人类修正以处理超出分布的变化,最后通过合规交互在线恢复抽象掉的细节。由于最终策略遵循约束而非模仿轨迹,它在物体姿态、局部几何和未建模接触动力学上实现了泛化。我们在七个现实世界多阶段接触密集操纵任务上验证了该方法,成功率达到90%以上。这些广泛实验结果确立了环境约束作为学习从示范中高效泛化基本构建块的重要性。

英文摘要

We present a Learning from Demonstration (LfD) framework that achieves one-shot generalization in multi-stage, contact-rich manipulation tasks. Central to our approach is the utilization of environmental constraints as the inductive bias. By representing a demonstration as a sequence of behaviors that exploit environmental constraints, the robot separates task-general structure -- the constraint types and their transitions -- from instance-specific details such as exact demonstration trajectories, poses, and local geometries. Our four-stage pipeline builds a complete policy on this representation: the robot first abstracts a single demonstration into environmental-constraint primitives, then disambiguates them through self-guided exploration, next assimilates targeted human corrections that handle out-of-distribution variations, and finally recovers the abstracted-away details online through compliant interaction. Because the resulting policy follows constraints rather than mimics trajectories, it generalizes across object poses, local geometries, and unmodeled contact dynamics. We validate our approach on seven real-world multi-stage contact-rich manipulation tasks and achieve over 90% success. These extensive experimental results establish environmental constraints as fundamental building blocks for efficient generalization in learning from demonstration.

2605.17598 2026-05-19 cs.CL

Mixture of Experts for Low-Resource LLMs

专家混合用于低资源大语言模型

Ori Bar Joseph, Smadar Arvatz, Noam Kayzer, Dan Revital, Sarel Weinberger

AI总结 本文研究了专家混合架构在低资源语言中的专家路由行为,发现持续预训练能有效缓解路由不平衡问题,且路由改进与下游任务表现提升相关。

详情
AI中文摘要

混合专家(MoE)架构能够实现高效的模型扩展,但跨低资源语言的专家路由行为仍不明确。我们通过希伯来语这一形态丰富且低资源的测试环境,分析了两种架构不同的MoE模型——纯Transformer(Qwen3-30B-A3B)和混合Mamba-Transformer(Nemotron-3-Nano-30B-A3B)的路由动态。两种预训练模型均表现出深度层路由崩溃:在最终层使用熵急剧下降,令牌集中在狭窄的专家子集,这种模式在英语中很少见。持续预训练(CPT)在平衡双语数据上显著纠正了这种不平衡,提高了熵并使路由转向共享、语言无关的专家;监督微调(SFT)单独实现的纠正程度较低。将分析扩展到日语,发现定量一致的崩溃特征,提供了跨语言证据,表明该现象是预训练下代表性不足的系统性结果,而非任何语言固有属性。路由改进与一致的下游基准表现提升相关,将路由熵和专家专业化定位为MoE系统多语言能力的原理性诊断。

英文摘要

Mixture-of-Experts (MoE) architectures enable efficient model scaling, yet expert routing behavior across underrepresented languages remains poorly understood. We analyze routing dynamics in two architecturally distinct MoE models -- a pure Transformer (Qwen3-30B-A3B) and a hybrid Mamba-Transformer (Nemotron-3-Nano-30B-A3B) -- using Hebrew as a morphologically rich, low-resource testbed. Both pre-trained models exhibit \emph{deep-layer routing collapse}: usage entropy drops sharply in final layers and tokens concentrate on a narrow expert subset, a pattern largely absent for English. Continual pre-training (CPT) on balanced bilingual data substantially corrects this imbalance, increasing entropy and shifting routing toward shared, language-agnostic experts; supervised fine-tuning (SFT) alone achieves less complete correction. Extending the analysis to Japanese reveals quantitatively consistent collapse signatures, providing cross-linguistic evidence that the phenomenon is a systematic consequence of pre-training underrepresentation rather than any language-intrinsic property. Routing improvements correlate with consistent downstream benchmark gains, positioning routing entropy and expert specialization as principled diagnostics for multilingual capacity in MoE systems.

2605.17593 2026-05-19 cs.RO

Motion-Uncertainty-Aware Next-Best-View Planning for Moving Object Reconstruction

考虑运动不确定性的移动物体重建最佳下视角规划

Karen Li, Mattia Mantovani, Robert J. Wood, Lorenzo Sabattini, Stephanie Gil

AI总结 本文提出了一种考虑运动不确定性的最佳下视角规划框架,用于重建未知的刚体目标,该框架利用噪声的平面位置测量和移动机器人深度观测,通过固定滞后高斯过程平滑器估计和预测目标状态,从而生成候选视角并提高重建完整性。

Comments This paper is accepted for publication for Robotics: Science and Systems (RSS) 2026

详情
AI中文摘要

主动重建移动物体需要在决策到执行延迟期间选择信息丰富的视角,同时考虑物体运动的不确定性。现有方法只解决了该问题的一部分:用于物体重建的下最佳视角(NBV)规划器通常优化表面覆盖但假设物体静止,而针对移动目标的运动感知主动感知方法考虑了目标运动,但优先考虑跟踪或可见性而非重建覆盖。本文提出了一种考虑运动不确定性的NBV框架,用于重建未知的刚体目标,该目标处于平面运动中。该框架利用目标的噪声平面位置测量和移动机器人的深度观测。关键思想是通过评估每个候选视角在由运动和测量不确定性诱导的可能未来目标状态下的预期观测质量,而不是在单一预测目标姿态上。为了获得这种预测信念,固定滞后高斯过程平滑器从噪声位置测量中估计和预测目标状态。所得信念用于生成围绕预测目标位置的候选视角,并通过可达性过滤它们,并估计其预期覆盖驱动的分数。仿真和实际实验表明,与非预测的NBV和仅预测的跟踪方法相比,重建完整性得到了改进,从而弥合了覆盖驱动的主动重建和预测驱动的跟踪之间。

英文摘要

Active 3D reconstruction of moving objects requires selecting informative viewpoints while accounting for object motion uncertainty during the decision-to-execution delay. Existing methods address only parts of this problem: next-best-view (NBV) planners for object reconstruction typically optimize surface coverage but assume static objects, while motion-aware active perception for moving targets accounts for target motion but prioritizes tracking or visibility over reconstruction coverage. This work presents a motion-uncertainty-aware NBV framework for reconstructing an unknown rigid object undergoing planar motion, using noisy planar position measurements of the object and depth observations from a mobile robot. The key idea is to evaluate each candidate viewpoint by its expected observation quality over plausible future object states induced by motion and measurement uncertainty, rather than at a single predicted object pose. To obtain this predictive belief, a fixed-lag Gaussian Process smoother estimates and predicts the object state from noisy position measurements. The resulting belief is used to generate candidate viewpoints around the predicted object location, filter them by reachability, and estimate their expected coverage-driven scores. Simulation and real-world experiments demonstrate improved reconstruction completeness over non-predictive NBV and prediction-only tracking methods, bridging coverage-driven active reconstruction and prediction-driven tracking.

2605.17591 2026-05-19 cs.CV

Error-Decomposed Class-Conditional Fusion for Statistically Guaranteed Hard-Category Robust Perception

误差分解类条件融合用于统计保证的硬类别鲁棒感知

Guowei Luo, Ziqi Shi, Zhao Xie

AI总结 本文提出误差分解类条件融合(ED-CCF)方法,通过解决硬类别可靠性问题,提升关键类别性能的同时保持整体稳定性,实现统计保证的鲁棒感知。

Comments 14 pages, 8 figures. Preprint

详情
AI中文摘要

聚合目标检测指标本质上会掩盖在操作关键的长尾少数类别中的灾难性和可重复性故障。本文正式将这种普遍的脆弱性定义为硬类别可靠性问题(HCRP):严格纠正脆弱类别而不影响稳定类别性能边界的基本架构挑战。为系统性消除这一限制,我们提出了误差分解类条件融合(ED-CCF),一种优雅的决策层推断框架。不同于启发式全局后处理,ED-CCF将预测投射到复杂的四状态误差分类学中,在严格的经验验证下动态激活校准路径。在高度受限的600张图像验证基准上,隔离cz作为关键脆弱性(HCEC=0.86,BSR=0.14),我们的框架实现了突破性进展:在提升cz mAP50从0.089343到0.109353(巨大的+22.4%相对增长)的同时,完美保持了全局稳定性的帕累托最优性(将所有mAP50从0.581925提升到0.584864)。通过在50对子集试验中进行彻底验证,展示了压倒性的96%胜率和严格的布农尼校正威尔科xon显著性(p<0.05),这项工作从根本上重新定义了输出层融合作为安全关键视觉感知的可审计、统计保证范式。

英文摘要

Aggregate object detection metrics inherently mask catastrophic and repeatable failures in operationally critical, long-tail minority classes. This paper formally defines this pervasive vulnerability as the Hard-Category Reliability Problem (HCRP): the fundamental architectural challenge of strictly rectifying vulnerable categories without compromising the performance boundaries of stable classes under stringent protocols. To systematically dismantle this limitation, we propose Error-Decomposed Class-Conditional Fusion (ED-CCF), an elegant decision-layer inference framework. Diverging from heuristic global post-processing, ED-CCF projects predictions into a sophisticated quad-state error taxonomy, dynamically activating calibration pathways exclusively upon rigorous empirical justification. On a highly constrained 600-image validation benchmark, isolating cz as the critical vulnerability (HCEC=0.86, BSR=0.14), our framework achieves a targeted breakthrough: it elevates cz mAP50 from 0.089343 to 0.109353 (a massive +22.4% relative surge) while flawlessly preserving the Pareto optimality of global stability (raising all mAP50 from 0.581925 to 0.584864). Backed by exhaustive validation across 50 paired subset trials demonstrating an overwhelming 96% win rate and strict Bonferroni-corrected Wilcoxon significance (p<0.05), this work fundamentally redefines output-level fusion as an auditable, statistically guaranteed paradigm for safety-critical visual perception.

2605.17590 2026-05-19 cs.LG math.OC

Form and Function: Machine Unlearning as a Problem of Misaligned States

形式与功能:将机器去学习视为不一致状态的问题

Kennon Stewart

AI总结 本文提出将在线L-BFGS的机器去学习问题建模为反事实状态对齐问题,通过引入状态感知度量和反事实 oracle 模型,证明去学习不仅仅是参数修正问题,还需要与可实现的反事实优化器状态对齐。

详情
AI中文摘要

我们把在线L-BFGS的机器去学习问题建模为反事实状态对齐问题。给定一个实际事件流和一个经过删除编辑的反事实流,去学习的目标是确定在从未处理过被删除样本的情况下会产生的优化器状态。我们引入了状态感知度量,分别衡量参数误差、内存运算符误差、综合状态误差和更新方向误差。内存度量比较由o-L-BFGS内存引起的逆Hessian作用,而不是将曲率对视为有限影响。在凸性假设下,我们推导出反事实状态偏差的递归界。然后,我们评估了一个状态感知的删除干预基准,包括仅内存和仅参数的修正,与反事实 oracle 模型进行比较。这些结果表明,在线L-BFGS的去学习不仅仅是参数修正问题:它需要与可实现的反事实优化器状态对齐。

英文摘要

We formulate machine unlearning for online L-BFGS as a counterfactual state-alignment problem. Given an actual event stream and a deletion-edited counterfactual stream, the target of unlearning is the optimizer state that would have arisen had the deleted samples never been processed. We introduce state-aware metrics that separately measure parameter error, memory-operator error, combined state error, and update-direction error. The memory metric compares the inverse-Hessian actions induced by the o-L-BFGS memory, rather than treating curvature pairs as of finite influence. Under convexity assumptions, we derive a recursive bound on counterfactual state deviation. We then evaluate a state-aware benchmark of deletion interventions, including memory-only and parameter-only corrections, against an counterfactual oracle model. These results show that unlearning for online L-BFGS is not merely a parameter-correction problem: it requires alignment with a realizable counterfactual optimizer state.

2605.17588 2026-05-19 cs.CV

MSIQ: Moment-based Scale-Invariant Quality Measure for Single Image Super-Resolution

MSIQ: 基于矩的尺度不变质量度量用于单图像超分辨率

Leonid Bedratyuk

AI总结 本文提出了一种基于矩的尺度不变质量度量(MSIQ),用于评估单图像超分辨率(SISR)结果的质量,该方法通过比较两幅图像的归一化中心几何矩,能够直接比较不同空间分辨率的图像,且具有数学确定性和解析形式,解决了传统方法在几何结构保持和强制缩放带来的误差问题。

Comments 23 pages

详情
AI中文摘要

评估单图像超分辨率(SISR)结果的质量仍然是一个开放的方法学问题。常见的全参考度量(PSNR, SSIM, LPIPS)没有明确评估图像几何结构的保持,这对于基于尺度的重建正确性至关重要。此外,它们需要将图像强制对齐到相同大小(强制缩放),这在评估过程中引入了外部插值误差。本文提出了一种诊断性的尺度不变质量度量MSIQ(基于矩的尺度不变质量度量),基于两幅图像的归一化中心几何矩的比较。MSIQ能够在不缩放的情况下直接比较不同空间分辨率的图像,具有数学确定性(模型无关)和解析形式。为了为该方法提供理论基础,我们引入了一个概念区分,即度量在跟踪退化方面的能力(跟踪能力)与它们的几何选择性(几何特异性)之间的区别。实验验证确认了MSIQ在均匀缩放下的稳定性,并同时揭示了传统度量对插值方法选择的高敏感性。结果表明,MSIQ具有显著的几何选择性:所提出的方法有效地区分了几何变形与非几何伪影,特别是JPEG压缩,不同于基于像素和感知的度量。还显示,MSIQ对结构扰动的响应在不同SR算法类别中保持稳定,包括具有不同架构的DNN模型。所提出的方法是一种补充的诊断工具,适用于几何保真度优先的领域,特别是医学成像和遥感。

英文摘要

Assessing the quality of single image super-resolution (SISR) results remains an open methodological problem. Common full-reference metrics (PSNR, SSIM, LPIPS) do not explicitly evaluate the preservation of the geometric structure of images, which is critical for the correctness of scale-based reconstruction. In addition, they require the forced alignment of images to the same size (\textit{forced resizing}), which introduces an external interpolation error into the evaluation process. This paper proposes a diagnostic scale-invariant quality measure, MSIQ (\textit{Moment-based Scale-Invariant Quality}), based on the comparison of normalized central geometric moments of two images. MSIQ enables direct comparison of images with different spatial resolutions without resizing, is mathematically deterministic (\textit{model-free}), and has an analytical form. To provide a theoretical basis for the approach, we introduce a conceptual distinction between the ability of metrics to monotonically track degradation (\textit{tracking ability}) and their geometric selectivity (\textit{geometric specificity}). The experimental validation confirmed the stability of MSIQ under uniform scaling and, at the same time, revealed the high sensitivity of traditional metrics to the choice of interpolation method. The results show that MSIQ has pronounced geometric selectivity: the proposed measure effectively separates geometric deformations from non-geometric artifacts, in particular JPEG compression, unlike pixel-based and perceptual metrics. It is also shown that the response of MSIQ to structural perturbations remains stable across different classes of SR algorithms, including DNN models with different architectures. The proposed measure is a complementary diagnostic tool for domains where geometric fidelity has priority, in particular medical imaging and remote sensing.

2605.17584 2026-05-19 cs.CV

VVitCutLER: Towards Unsupervised Object Detection and Segmentation in Videos

VVitCutLER: 向视频中无监督的目标检测和分割迈进

Zhijing Lu, Khurram Azeem Hashmi, Didier Stricker, Muhammad Zeshan Afzal

AI总结 该研究提出VVitCutLER框架,通过引入时间一致性提升视频中伪标签的质量,从而改进目标检测和实例分割的性能,减少时间不稳定性。

Comments 11 figures, cvpr workshop

详情
AI中文摘要

无监督像素级视频理解在现实场景中仍具有挑战性,因为运动模糊、遮挡和快速物体动态常导致时间漂移和闪烁的伪标签。我们提出VVitCutLER,一个用于视频目标检测和实例分割的无监督框架,通过时间一致性提高伪标签的质量。我们的核心贡献是VitCut,一个时间稳定的伪标签生成器,通过跨帧区域一致性减少在场退化期间的误差积累。同时,VitCut使用蒸馏解码器实现有效的实例掩码预测。然后,基于VitCut,VVitCutLER进一步整合跨帧特征聚合以增强视频级的鲁棒性。在标准视频基准上的广泛实验表明,VVitCutLER显著提高了检测和分割性能,同时减少了时间不稳定性。这些结果突显了时间一致性监督对鲁棒像素级视频理解的重要性。

英文摘要

Unsupervised pixel-level video understanding remains challenging in real-world scenarios, where motion blur, occlusion, and fast object dynamics often cause temporal drift and flickering pseudo-labels.We propose VVitCutLER, an unsupervised framework for video object detection and instance segmentation, which improves the quality of pseudo-labels through temporal consistency. Our core contribution is VitCut, a temporarily stable pseudo-label generator that reduces error accumulation during field degradation through cross-frame region consistency. Meanwhile, VitCut uses a distillation decoder to achieve effective instance mask prediction. Then, based on VitCut, VVitCutLER further integrates cross-frame feature aggregation to enhance video-level robustness. Extensive experiments on standard video benchmarks demonstrate that VVitCutLER significantly improves detection and segmentation performance while reducing temporal instability. These results highlight the importance of temporally consistent supervision for robust pixel-level video understanding.

2605.17583 2026-05-19 cs.CV

AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech

AgentSteerTTS: 一个用于复合指令文本到语音的多智能体闭环框架

Bin Kang, Shaoguo Wen, Yang Fan, Shunlong Wu, Junjie Wang, Yulin Li, Junzhi Zhao, Junle Wang, Zhuotao Tian

AI总结 本文提出AgentSteerTTS,一个多智能体闭环框架,通过引入对抗解耦代理、双流锚定控制器和快速-慢反馈代理,实现了对复合指令的意图忠实表达控制,实验表明其在复合指令基准和公开测试集上显著提升了性能。

Comments Project page: https://kane2kang.github.io/AgentSteerTTS/

详情
AI中文摘要

尽管现有的文本到语音(TTS)模型表现出高度的表达性,但对复合指令的细粒度控制仍然具有挑战性,因为离散的文本意图与连续的语音实现之间存在结构不匹配。受人类认知解耦的启发,我们引入了AgentSteerTTS,一个用于意图忠实表达控制的多智能体闭环框架。首先,在我们的框架中,对抗解耦代理通过学习分离的身份和情感-语调子空间,并利用泄漏抑制正则化来减轻说话者-情感泄漏。接下来,双流锚定控制器利用大规模的语音原型库来使抽象意图具体化:检索代理选择表达锚点,而合成代理通过门控注意力融合它们为连续控制向量。最后,快速-慢反馈代理通过潜在梯度校正来细化输出强度,并利用高层感知批评来解决语义-语音不匹配。在复合指令基准和公开测试集上的实验表明,AgentSteerTTS在基线模型上产生了一致且显著的改进,证明了所提出方法的有效性。

英文摘要

While existing text-to-speech (TTS) models exhibit high expressiveness, fine-grained control over composite instructions remains challenging due to the structural mismatch between discrete textual intents and continuous acoustic realizations. Inspired by human cognitive decoupling, we introduce AgentSteerTTS, a multi-agent closed-loop framework designed for intent-faithful expressive control of composite instructions. First, in our framework, an adversarial disentanglement agent mitigates speaker-emotion leakage by learning separable identity and emotion-prosody subspaces with leakage-suppressing regularization. Next, a Dual-Stream Anchoring Controller grounds abstract intents using a large-scale acoustic prototype library: a Retrieval Agent selects expressive anchors, while a Synthesis Agent fuses them into continuous control vectors via gated attention. Finally, a Fast-Slow Feedback Agent refines output intensity through latent gradient correction and resolves semantic-acoustic mismatches using high-level perceptual critique. Experiments on a composite-instruction benchmark and public test sets show that AgentSteerTTS yields consistent and significant improvements to the baselines, demonstrating the effectiveness of the proposed method.

2605.17582 2026-05-19 cs.LG cs.CE

Scale-Equivariant Generative Forecasting: Weight-Tied Dilated Convolutions, Wavelet Scattering Inputs, and Spectral-Consistency Training for Self-Similar Time Series

尺度等变生成预测:权重绑定的扩张卷积、小波散射输入和频谱一致性训练用于自相似时间序列

Andrea Morandi

AI总结 本文提出了一种尺度等变生成预测方法,通过权重绑定的扩张卷积、小波散射输入和频谱一致性训练,用于自相似时间序列的生成,展示了在S&P 500日收益率上的优越性能。

详情
AI中文摘要

许多自然和工程时间序列--股票回报、气候异常、湍流速度、神经记录、分组网络流量--近似自相似:其时间跨度为T的分布与时间跨度为1的分布通过一个缩放指数H关联。标准深度生成序列模型(Transformer、扩张TCN、WaveNet家族)忽略了这一点。它们的感受野很宽,但内核参数在每个扩张级别独立存在,导致多尺度架构,而非尺度等变架构。我们有三个贡献。首先,我们为一维因果网络给出了离散尺度等变的精确定义,并证明了二进制扩张在边界效应范围内与任何内核权重在不同级别共享的扩张卷积堆栈相容。绑定内核将卷积参数预算减少L倍(L为深度),并强制自相似性作为归纳偏置。其次,我们将这种尺度等变WaveNet(SE-WaveNet)主干包裹在三个具有相同先验的组件中:一级Daubechies-4小波输入、Hurst-FiLM块暴露局部缩放指数、以及针对|f|^{-(2H+1)}幂律频谱的频谱一致性训练项。头部是条件归一化流,选择以保持等变性。第三,在30年S&P 500每日对数收益率上,SE-WaveNet样本在Allan方差前25个宇宙上重现经验缩放崩溃诊断(中位数C* = 0.020),而普通WaveNet在匹配容量下不(≥0.06)。NLL、KS校准和尾部能量距离与基线持平或优于基线,参数数量更少L倍。

英文摘要

Many natural and engineered time series -- equity returns, climate anomalies, turbulent velocities, neural recordings, packet-level network traffic -- are approximately self-similar: their horizon-$T$ distribution is tied to the horizon-$1$ distribution by one scaling exponent $H$. Standard deep generative sequence models (transformers, dilated TCNs, the WaveNet family) ignore this. Their receptive fields are wide, but kernel parameters live independently at every dilation level, yielding a multi-scale architecture, not a scale-equivariant one. We make three contributions. First, we give a precise definition of discrete scale equivariance for 1D causal networks and prove that dyadic dilation commutes (up to boundary effects) with any dilated-convolution stack whose kernel weights are shared across levels. Tying the kernel shrinks the convolutional parameter budget by an $L$-fold factor (where $L$ is depth) and hard-wires self-similarity in as an inductive bias. Second, we wrap this Scale-Equivariant WaveNet (SE-WaveNet) backbone in three components that carry the same prior: a one-level Daubechies-4 wavelet input, a Hurst-FiLM block exposing the local scaling exponent, and a spectral-consistency training term targeting the $|f|^{-(2H+1)}$ power-law spectrum. The head is a conditional normalising flow, chosen to preserve equivariance. Third, on 30 years of S&P 500 daily log-returns, SE-WaveNet samples reproduce the empirical scaling-collapse diagnostic on the Allan-Variance top-25 universe (median $\mathcal{C}^\star = 0.020$), while a vanilla WaveNet at matched capacity does not ($\geq 0.06$). NLL, KS-calibration, and tail energy distance tie or beat the baseline, with $L\times$ fewer convolutional parameters.

2605.17580 2026-05-19 cs.AI

ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation

ECG-WM: 一种基于生理的ECG世界模型用于临床干预模拟

Zhikang Chen, Yue Wang, Sen Cui, Yu Zhang, Changshui Zhang, Tianling Ren, Tingting Zhu

AI总结 本文提出了一种基于ECG的世界模型,用于条件化预测心脏电生理学,通过整合生理学普通微分方程先验知识,提升干预后ECG轨迹的生理合理性,并引入不确定性评估策略以更可靠地评估候选干预方案。

详情
AI中文摘要

基于ECG的模型在诊断任务中表现出色,但在建模外部干预下心脏动态演变方面仍有限。现有方法主要集中在静态预测,缺乏捕捉不同药理条件下ECG变化的机制。本文提出了一种ECG世界模型,用于动作条件化的预测模拟。通过将生理学普通微分方程先验知识整合到潜在扩散动态中,利用能量正则化,该框架实现了生理合理的干预后ECG轨迹合成,并有效缓解生成幻觉。在此模拟过程中,我们引入了一种不确定性意识的评估策略,利用扩散采样中的随机性来表征预期的临床风险及其变异性,从而更可靠地比较候选干预方案。我们在多种设置中评估了我们的方法,包括受控药物反应场景和真实世界临床记录。除了标准波形指标外,实验结果还显示了改进的风险校准和与专家指导治疗偏好的强一致。这些结果确立了我们的方法作为安全且干预感知的临床决策支持的稳健基础。

英文摘要

Electrocardiogram (ECG)-based models have achieved strong performance in diagnostic tasks, yet they remain limited in modeling how cardiac dynamics evolve under external interventions. In particular, existing approaches focus primarily on static prediction and lack mechanisms to capture ECG variations under different pharmacological conditions. In this work, we propose an ECG World Model for action-conditioned predictive simulation of cardiac electrophysiology. Moving beyond disjoint pipelines, our framework features a principled integration of physiological ordinary differential equation (ODE) priors into latent diffusion dynamics via energy regularization. This structural constraint enables the synthesis of physiologically plausible post-intervention ECG trajectories while effectively mitigating generative hallucinations. Building on this simulation process, we introduce an uncertainty-aware evaluation strategy that leverages the stochasticity of diffusion sampling to characterize both the expected clinical risk and its variability, allowing a more reliable comparative assessment of candidate interventions. We evaluate our method across diverse settings, including controlled drug-response scenarios and real-world clinical records. Beyond standard waveform metrics, experimental results demonstrate improved risk calibration and strong alignment with expert-informed treatment preferences. These results establish our approach as a robust foundation for safe and intervention-aware clinical decision support.

2605.17577 2026-05-19 cs.CV

TAME: Test-Time Adversarial Prompt Tuning via Mixture-of-Experts for Vision-Language Models

TAME: 通过混合专家架构实现视觉语言模型的测试时对抗提示调优

Xin Wang, Yixu Wang, Jiaming Zhang, Ruofan Wang, Jiaqi Yu, Kai Chen, Jingjing Chen, Xingjun Ma, Yu-Gang Jiang

AI总结 本文提出TAME,一种基于混合专家架构的测试时防御方法,旨在提升视觉语言模型在对抗扰动下的鲁棒性,同时保持对清洁样本的泛化能力。

详情
AI中文摘要

大规模预训练的视觉语言模型(VLMs),如CLIP,在零样本泛化方面表现强大,但对不可察觉的对抗扰动高度敏感,这在开放世界部署中引发了严重安全问题。为了在不需下游任务特定重新训练的情况下增强鲁棒性,我们提出了TAME,一种新颖的测试时防御方法。基于我们之前的测试时对抗提示调优(TAPT),TAME通过将TAPT的单一自适应提示替换为输入条件化的混合专家(MoE)框架进行架构重构,从而实现更表达力和适应性的防御。具体而言,TAME维护一个可学习的专家提示库,并利用输入依赖的路由机制,在推理时为每个未标记的测试样本聚合定制化的提示混合。这种测试时防御机制由三个无监督目标驱动:(1)多视图预测熵最小化,(2)逐层对齐视觉标记统计到预计算的干净和对抗参考分布,以及(3)MoE正则化以实现平衡的专家利用和提示多样性。我们在11个基准数据集上评估了TAME,包括ImageNet和10个额外的零样本数据集。结果表明,TAME在AutoAttack下将原始CLIP的零样本对抗鲁棒性提高了至少49.1%,同时在清洁样本上保持了良好的泛化能力。TAME还普遍优于现有对抗提示调优方法,平均鲁棒性提升至少30.2%。

英文摘要

Large-scale pre-trained Vision-Language models (VLMs), such as CLIP, exhibit strong zero-shot generalization, yet remain highly vulnerable to imperceptible adversarial perturbations, raising serious safety concerns for open-world deployment. To enhance robustness without requiring downstream task-specific retraining, we propose TAME, a novel test-time defense. Building upon our prior Test-Time Adversarial Prompt Tuning (TAPT), TAME introduces an architectural reformulation by replacing TAPT's single adaptive prompt with an input-conditioned Mixture-of-Experts (MoE) framework, enabling more expressive and adaptive defense. Specifically, TAME maintains a bank of learnable expert prompts and employs an input-dependent routing mechanism to aggregate a customized prompt mixture for each unlabeled test sample at inference time. This test-time defense mechanism is driven by three unsupervised objectives: (1) multi-view prediction entropy minimization, (2) layer-wise alignment of visual token statistics to precomputed clean and adversarial reference distributions, and (3) MoE regularization for balanced expert utilization and prompt diversity. We evaluated TAME on 11 benchmark datasets, including ImageNet and 10 additional zero-shot datasets. The results show that TAME improves the zero-shot adversarial robustness of the original CLIP by at least 49.1% under AutoAttack while largely preserving generalization on clean samples. TAME also consistently outperforms existing adversarial prompt tuning methods across multiple prompt designs, yielding an average robustness gain of at least 30.2%.

2605.17575 2026-05-19 cs.LG cs.AI

UniAlign: A Model-Agnostic Framework for Robust Network Traffic Classification under Distribution Shifts

UniAlign:一种用于在分布偏移下鲁棒网络流量分类的模型无关框架

Tongze Wang, Xiaohui Xie, Wenduo Wang, Chuyi Wang, Yong Cui

AI总结 本文提出UniAlign,一种模型无关的框架,通过领域对齐微调和稳定模型集成提升深度学习网络流量分类模型在分布偏移下的鲁棒性,实验表明其在准确率和F1分数上均优于现有基线。

详情
AI中文摘要

网络流量分类(NTC)模型在真实世界环境中部署时,由于网络条件的变化导致的分布偏移常常引起严重的性能下降。现有的增强鲁棒性的方法通常与特定的模型架构或数据设置耦合,无法泛化到最先进的原始字节基NTC模型,或导致显著的训练开销。在本文中,我们提出UniAlign,一种新的模型无关框架,旨在提升基于深度学习的NTC模型在分布偏移下的鲁棒性。UniAlign结合了领域对齐微调,该方法鼓励在异构网络条件下学习领域不变的流量表示,以及稳定模型集成,该方法通过在平坦损失区域内的检查点聚合来增强推理鲁棒性。该框架可以无缝集成到现有的监督NTC模型中,无需特定的特征模态或引入非常数的额外训练成本。我们在三个涵盖多样分布偏移的公开数据集上评估了UniAlign,包括加密方案、数据收集设备和攻击行为。在两个代表性的NTC模型上的实验结果表明,与标准训练相比,UniAlign将平均分类准确率提高了2.51%,平均F1分数提高了2.71%,在准确率和F1分数上均优于最强基线,同时仅需所有NTC特定基线训练时间的12.4%至53.9%。

英文摘要

Network traffic classification (NTC) models often suffer severe performance degradation when deployed in real-world environments due to distribution shifts caused by changing network conditions. Existing robustness-enhancing approaches are commonly coupled to specific model architectures or data settings, fail to generalize to state-of-the-art raw-byte-based NTC models, or incur significant training overhead. In this paper, we propose UniAlign, a novel model-agnostic framework that improves the robustness of deep learning-based NTC models under distribution shifts. UniAlign combines \emph{domain alignment fine-tuning}, which encourages the learning of domain-invariant traffic representations across heterogeneous network conditions, with \emph{stable model ensembling}, which enhances inference robustness by aggregating checkpoints within a flat loss region. The framework can be seamlessly integrated into existing supervised NTC models without requiring specific feature modalities or introducing non-constant additional training costs. We evaluate UniAlign on three public datasets covering diverse distribution shifts, including encryption schemes, data collection devices, and attack behaviors. Experimental results on two representative NTC models demonstrate that, compared with standard training, UniAlign improves average classification accuracy by 2.51\% and average F1 score by 2.71\%, outperforming the strongest baseline by 1.45\% in accuracy and 1.69\% in F1 score, while requiring only 12.4\%--53.9\% of the training time of all NTC-specific baselines.

2605.17573 2026-05-19 cs.CV cs.CR

Deepfake Detection in Social Media: A Temporal Artifact Analysis Using 3D Convolutional Neural Networks

社交媒体中的深度伪造检测:利用3D卷积神经网络进行时序特征分析

Mohammadreza Rashidi, Raja Hashim Ali, Sami Ur Rahman

AI总结 本文提出了一种基于R3D-18的3D卷积神经网络检测器,通过结合二元交叉熵损失与时间一致性正则化损失,提升深度伪造检测在高分辨率和跨数据集场景下的准确性,证明了时间特征比空间特征在社交媒体重编码中更具鲁棒性。

Comments 13 pages, 6 figures

详情
AI中文摘要

合成面部视频在社交媒体上传播的速度比平台审核速度更快,导致虚假信息和身份攻击的成本上升。帧级深度伪造检测器在生成器质量增加时性能急剧下降;高质量的128x128 GAN输出在空间仅准确性上减少五个百分点,而时间不一致性的特征基本保持不变。我们通过基于R3D-18的3D卷积神经网络检测器解决这一差距,该检测器使用复合损失函数,结合二元交叉熵与时间一致性正则化。模型处理来自DeepfakeTIMIT数据集的16帧片段,并初始化自Kinetics-400动作识别权重。我们在128x128分辨率的内数据集评估中报告了92.8%的准确率;在不微调的情况下跨数据集转移到FaceForensics++达到76.4%,微调后有所提升。消融研究显示,迁移学习贡献了7.2个百分点,面部跟踪增加了3.5个百分点,而时间一致性正则化在高质量伪造中提供了额外的增益。结果表明,时间特征比空间特征在社交媒体重编码中更具泛化能力,提供了一个能够存活的检测信号。

英文摘要

Synthetic facial videos have proliferated across social media faster than platform moderation can respond, raising the cost of disinformation and identity-based attacks. Frame-level deepfake detectors degrade sharply as generator quality increases; high-quality 128x128 GAN output cuts spatial-only accuracy by five percentage points while leaving temporal inconsistencies largely intact. We address this gap with a 3D Convolutional Neural Network detector based on R3D-18, trained with a composite loss that combines binary cross-entropy with a temporal-consistency regularizer. The model processes 16-frame clips from the DeepfakeTIMIT dataset and is initialized from Kinetics-400 action-recognition weights. We report 92.8% accuracy on intra-dataset evaluation at 128x128 resolution; cross-dataset transfer to FaceForensics++ without fine-tuning reaches 76.4%, rising after minimal fine-tuning. Ablation studies show that transfer learning contributes 7.2 percentage points and face tracking adds 3.5 points, while temporal consistency regularization provides additional gains on high-quality fakes. The results establish that temporal artifacts generalize more broadly than spatial ones, providing a detection signal that survives social-media re-encoding.

2605.17571 2026-05-19 cs.CV cs.LG

Stable Routing for Mixture-of-Experts in Class-Incremental Learning

混合专家在类增量学习中的稳定路由

Zirui Guo, Quan Cheng, Da-Wei Zhou, Lijun Zhang

AI总结 本文研究了在类增量学习中混合专家模型的稳定路由问题,提出了一种稳定路由框架StaR-MoE,通过敏感性感知路由对齐和不对称容量正则化,提高了模型对新类别的适应能力和旧类别的知识保留能力。

详情
AI中文摘要

类增量学习(CIL)要求模型在学习新类别时保持先前知识。最近,结合预训练模型与混合专家(MoE)的方法在CIL中受到越来越多关注:它们通常在学习过程中扩展专家,并使用路由器分配权重。然而,现有MoE方法往往忽视了专家扩展引起的路由漂移。一旦引入新的专家,路由器可能会将样本从早期类别重新分配给新加入的专家,从而扰动已建立的专家组合,即使旧专家保持冻结。我们主张,可扩展的MoE在CIL中需要两个互补的性质:稳定的旧类路由用于知识保留和足够的容量利用用于新类适应。为此,我们提出了Stable Routing for MoE(StaR-MoE),一种用于可扩展MoE的路由级别框架。通过结合敏感性感知的路由对齐,StaR-MoE通过敏感性引导的约束将当前旧类路由行为与历史路由分布对齐。同时,StaR-MoE引入了不对称容量正则化,以鼓励有效利用扩展的专家池,而不影响类特定的路由专业化。在四个标准CIL基准上的广泛实验表明,StaR-MoE在平均准确率和最后准确率上均优于现有最先进方法,突显了稳定路由的重要性。

英文摘要

Class-incremental learning (CIL) requires models to learn new classes sequentially while preserving prior knowledge. Recently, approaches that combine pre-trained models with mixture-of-experts (MoE) have received increasing attention in CIL: they typically expand experts during learning and employ a router to assign weights across experts. However, existing MoE methods often overlook routing drift induced by expert expansion. Once new experts are introduced, the router may reassign samples from earlier classes to newly added experts, thereby perturbing previously established expert compositions and causing interference even when old experts remain frozen. We argue that expandable MoE in CIL requires two complementary properties: stable old-class routing for knowledge preservation and sufficient capacity utilization for new-class adaptation. To this end, we propose Stable Routing for MoE (StaR-MoE), a routing-level framework for expandable MoE in CIL. By incorporating sensitivity-aware routing alignment, StaR-MoE aligns current old-class routing behavior with historical routing distributions through sensitivity-guided constraints. Complementarily, StaR-MoE introduces asymmetric capacity regularization to encourage effective utilization of the expanded expert pool without compromising class-specific routing specialization. Extensive experiments across four standard CIL benchmarks demonstrate that StaR-MoE consistently improves both average and last accuracy over state-of-the-art methods, highlighting the importance of stable routing.

2605.17570 2026-05-19 cs.LG cs.CL

How Off-Policy Can GRPO Be? Mu-GRPO for Efficient LLM Reinforcement Learning

GRPO在离线策略下的可能性:Mu-GRPO用于高效的大语言模型强化学习

Minghao Tian, Yunfei Xie, Chen Wei

AI总结 本文探讨了GRPO在离线策略下的可行性,提出Mu-GRPO方法,通过减少rollout-optimization切换开销,实现高效的LLM强化学习,同时在多个基准测试中表现出色。

详情
AI中文摘要

组相对策略优化(GRPO)已成为近期大语言模型强化学习中可验证奖励(RLVR)进展的关键推动因素,但通常在低延迟、近策略的 regime 中训练,导致系统开销显著。我们提出一个简单的问题:GRPO可以多离线策略吗?我们证明GRPO类算法可以容忍比之前假设更大的rollout延迟,并提出Mu-GRPO,一种将训练分为少量(例如四个)大序列生成-优化阶段的RL训练框架。这种设计在诱导高rollout延迟的同时大幅减少了rollout-optimization切换开销。为了在延迟数据下稳定学习,Mu-GRPO结合了放松的剪裁(保留有用的延迟rollout梯度)与负优势 veto(移除不稳定后触发后缀更新)。在五个语言模型和多个数学推理基准测试中,Mu-GRPO在性能上与标准GRPO匹配或超过,同时在墙钟训练时间上实现了约2倍的加速,为LLM强化学习建立了显著改进的性能-效率权衡。

英文摘要

Group Relative Policy Optimization (GRPO) has been a key driver of recent progress in reinforcement learning with verifiable rewards (RLVR) for large language models, but it is typically trained in a low-staleness, near-on-policy regime that incurs substantial system overhead. We ask a simple question: How off-policy can GRPO be? We show that GRPO-style algorithms can tolerate substantially larger rollout staleness than previously assumed, and propose Mu-GRPO, an RL training framework that organizes training into a small number (e.g., four) of large sequential generation-optimization stages. This design induces high rollout staleness while greatly reducing rollout-optimization switching overhead. To stabilize learning under stale data, Mu-GRPO combines relaxed clipping, which preserves useful stale-rollout gradients, with negative-advantage veto, which removes destabilizing post-trigger suffix updates in negative-advantage responses. Across five language models and multiple math reasoning benchmarks, Mu-GRPO matches or exceeds the performance of standard GRPO while achieving around 2x speedup in wall-clock training time, establishing a substantially improved performance-efficiency trade-off for LLM reinforcement learning.

2605.17566 2026-05-19 cs.CV

Rethinking Point Clouds as Sequences: A Causal Next-Token Predictive Learning Framework

重新思考点云作为序列:一种因果性下一标记预测学习框架

Yumeng Yao, Jingzhi Dong, Haowen Gu, Tao Chen, Zonghan Wu, Xiaoshui Huang, Yazhou Yao

AI总结 本文提出PointNTP,将点云预训练重新定义为全因果、无解码器的潜在下一标记预测问题,通过局部补丁分割和结构化3D标记序列生成,实现对点云结构依赖的直接建模,无需重建解码器或显式几何恢复,实验表明其在多个下游任务中表现优异。

Comments 10 pages, 2 figures. Code will be released upon acceptance

详情
AI中文摘要

随着多模态基础模型和预测预训练的快速发展,一个重要的开放问题是如何为3D点云配备一种更符合下一标记和下一嵌入学习的预训练范式。现有点云自监督方法大多基于掩码重建或显式几何生成,因此仍局限于输入恢复而非预测依赖建模。本文引入PointNTP,将点云预训练重新定义为全因果、无解码器的潜在下一标记预测问题。具体而言,每个点云首先被分割成局部补丁,并根据补丁中心几何结构化为3D标记序列。生成的序列随后通过因果Transformer进行建模,采用仅前缀条件的训练方式,并通过停止梯度目标稳定移位预测目标。该设计使模型能够在潜在空间中直接学习结构依赖,而无需重建解码器或显式几何恢复。大量实验表明,所提出的PointNTP在多个下游任务中表现优异:在ScanObjectNN的OBJ_BG、OBJ_ONLY和PB_T50_RS上分别达到93.8%(+0.5%)、92.6%(+0.3%)和89.3%(+1.1%);在ShapeNetPart上获得85.0%(+0.1%)的Cls.mIoU;在S3DIS Area 5上达到71.1%的mAcc。总体而言,无解码器的因果潜在预测提供了一种简单、可扩展且可能模态无关的点云自监督学习范式,为3D数据的基础式预测学习提供了新的视角。

英文摘要

With the rapid progress of multimodal foundation models and predictive pre-training, an important open question is how to equip 3D point clouds with a pre-training paradigm that is better aligned with next-token and next-embedding learning. Existing point-cloud self-supervised methods are largely built on masked reconstruction or explicit geometric generation, and thus remain tied to input recovery rather than predictive dependency modeling. In this paper, we introduce PointNTP, which reformulates point cloud pre-training as a fully causal, decoder-free latent Next-Token Prediction problem. Specifically, each point cloud is first partitioned into local patches and serialized into a structured 3D token sequence according to patch-center geometry. The resulting sequence is then modeled by a causal Transformer under prefix-only conditioning, and trained with a shift-based prediction objective stabilized by stop-gradient targets. This design enables the model to learn structural dependencies directly in latent space, without reconstruction decoders or explicit geometric recovery. Extensive experiments demonstrate that the proposed PointNTP is highly competitive across multiple downstream tasks: it achieves 93.8%(+0.5%), 92.6%(+0.3%), and 89.3%(+1.1%) on OBJ_BG, OBJ_ONLY, and PB_T50_RS of ScanObjectNN, respectively; obtains 85.0%(+0.1%) in Cls.mIoU on ShapeNetPart; and reaches 71.1% mAcc on S3DIS Area 5. Overall, decoder-free causal latent prediction provides a simple, scalable, and potentially modality-agnostic paradigm for point-cloud self-supervised learning, offering a new 3D perspective on foundation-style predictive learning for 3D data.

2605.17565 2026-05-19 cs.AI cs.CL

Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models

泛化还是记忆?国际象棋训练语言模型的脆弱性测试

Ethan Tang

AI总结 本文研究了国际象棋训练语言模型是泛化还是记忆,通过测试发现其高性能主要源于模式匹配,并展示了LLM-Modulo框架在提升国际象棋谜题解决性能上的效果,证明了与外部验证器结合的通用LLM比直接训练合成数据更灵活。

Comments 14 pages, 2 figures, 4 tables, 3 equations

详情
AI中文摘要

最近的研究对语言模型进行了棋类数据微调,并报告了高基准分数,作为证据表明由此产生的模型可以理解国际象棋规则、以专业水平下完整棋局,或生成基于专家知识的人可读解释。我们训练了KinGPT,一个仅在(位置,最佳移动)对上训练的2500万参数字符级语言模型,其在600个mate-in-N谜题套件上超过了300亿参数的ChessGPT,在20个主题谜题基准上超过了4000亿参数的C1-4B。我们检查了现有文献中关于国际象棋训练语言模型的几个主张,并断言其令人印象深刻的基准性能主要由模式匹配解释。我们还展示了LLM-Modulo,一个验证器在环框架,如何将RedPajama 3B的最佳移动准确率从1.2%提升到21.2%,移动生成有效性从19.3%提升到95.3%,在mate-in-N国际象棋谜题上,与ChessGPT在棋类特定网络语料库上微调所获得的提升相当,但成本仅为后者的一小部分。我们的结果展示了将通用LLM与外部验证器结合,为明确领域提供了一个更灵活的替代方案,而不是直接训练合成数据。我们开源了所有训练/评估代码、数据集、谜题样本和KinGPT模型检查点,以确保可重复性。

英文摘要

Recent work has fine-tuned language models on chess data and reported high benchmark scores as evidence that the resulting models can understand the rules of chess, play full chess games at a professional level, or generate human-readable explanations grounded in expert knowledge. We train KinGPT, a 25M-parameter character-level language model trained only on (position, best-move) pairs, who exceeds 3B-parameter ChessGPT on a 600-puzzle mate-in-N suite and 4B-parameter C1-4B over a 20-theme puzzle benchmark. We examine several claims made in existing literature regarding chess-trained language models and assert that their impressive benchmark performance is largely explained by pattern-matching. We also demonstrate how LLM-Modulo, a verifier-in-the-loop framework, raises RedPajama 3B's best move accuracy from 1.2% to 21.2% and move generation validity from 19.3% to 95.3% on mate-in-N chess puzzles, comparable to gains achieved from ChessGPT's fine-tuning on chess-specific web corpora at a fraction of the cost. Our results illustrate how pairing a general LLM with an external verifier offers a more flexible alternative to directly training on synthetic data for well-defined domains. We open source all training/evaluation code, datasets, puzzle samples, and KinGPT model checkpoints for reproducibility.

2605.17564 2026-05-19 cs.CV

A Conditional U-Net Pipeline with Pre- and Post-Processing for Aerial RGB-to-Thermal Image Translation

具有预处理和后处理的条件U-Net管道用于航空RGB到热图像转换

Tseten Sherpa, Sikandar Ali, Shubham Parab, Haoyun Feng, Matthew Dennis, Keenan Gibbons, Verrah Otiende, Geoffrey H. Siwo

AI总结 本文提出了一种基于条件U-Net的简单架构,结合天气数据和针对性预处理与后处理技术,以提高航空RGB到热图像转换的性能,实验结果显示其在PSNR、SSIM和LPIPS指标上优于现有方法。

Comments 8 pages, 7 figures, NeurIPS 2026

详情
AI中文摘要

配对的RGB-热图像数据在图像融合、目标跟踪和异常检测等应用中显示出显著的实用性;然而,其广泛应用受到对齐的RGB-热图像对有限的限制。RGB到热图像(及反之)转换已成为解决这一挑战的实用解决方案。先前的方法包括条件生成对抗网络(cGANs)如ThermalGAN和基于可扩展插值转换器(SiT)的架构如ThermalGen,已显示出在航空到热图像转换中的强大潜力。在本工作中,我们探索了替代架构,这些架构在保持性能的同时优先考虑简洁性。具体而言,我们提出了一种在瓶颈层中结合天气数据的条件U-Net,辅以在Pix2Pix GAN架构中应用的针对性预处理和后处理技术。我们利用612对RGB和热图像的训练集,并在五折交叉验证后,最终在保留的测试集上进行评估。我们的条件U-Net模型表现最佳,峰值信噪比(PSNR)为14.5485,结构相似性指数测量(SSIM)为0.8095,学习感知图像块相似性(LPIPS)为0.1666。这些结果优于基础ThermalGen模型,后者分别达到了PSNR、SSIM和LPIPS分数为7.56、0.2444和0.6317。我们发现,虽然饱和度增强和对比度增强的预处理以及高斯模糊的后处理提供了可观察的改进,但结合条件数据的效果最为显著。我们的发现巩固了将辅助元数据整合到热图像生成中的潜力,表明此类信息可以作为准确热重建至关重要的环境条件的代理。

英文摘要

Paired RGB-thermal data has shown significant utility across a range of applications, including image fusion, object tracking, and anomaly detection; however, its broader adoption is constrained by the limited availability of aligned RGB-thermal image pairs. RGB-to-thermal (and vice versa) image translation has emerged as a practical solution to this challenge. Prior approaches including conditional generative adversarial networks (cGANs) such as ThermalGAN and Scalable Interpolant Transformer (SiT)-based architectures such as ThermalGen have demonstrated strong potential for aerial-to-thermal image translation. In this work, we explore alternative architectures that prioritize simplicity while maintaining performance. Specifically, we propose a conditional U-Net that incorporates weather data at the bottleneck layer, complemented by targeted preprocessing and post-processing techniques applied within the Pix2Pix GAN architecture. We utilize a training set of 612 paired RGB and thermal images, and evaluate over 5-fold cross-validation, ultimately testing on a held-out test set. Our conditional U-Net model performed best, with a peak signal-to-noise ratio (PSNR) of 14.5485, structural similarity index measure (SSIM) of 0.8095, and learned perceptual image patch similarity (LPIPS) of 0.1666. These results outperformed the base ThermalGen model, which attained PSNR, SSIM, and LPIPS scores of 7.56, 0.2444, and 0.6317 respectively. We find that while saturation boost and contrast enhancement for preprocessing and Gaussian blur for post-processing provide observable improvements, the incorporation of conditioning data was most effective. Our findings cement the potential of integrating auxiliary metadata into thermal image generation, suggesting that such information can serve as a proxy for environmental conditions critical to accurate thermal reconstruction.

2605.17562 2026-05-19 cs.LG cs.AI cs.HC

Beyond Accuracy: Robustness, Interpretability and Expressiveness of EEG Foundation Models

超越准确率:EEG基础模型的鲁棒性、可解释性和表达性

Urban Širca, Maryam Alimardani, Stefanos Zafeiriou, Konstantinos Barmpas

AI总结 本文研究了EEG基础模型的鲁棒性、可解释性和表达性,通过在八个数据集上对六个EEG-FMs和一个基线深度学习模型进行基准测试,揭示了模型在不同扰动下的表现,以及其在可解释性和表达性方面的特性。

详情
AI中文摘要

EEG基础模型(EEG-FMs)主要在干净且分布内的准确性上进行了评估,其鲁棒性、可解释性和表征质量尚未得到充分考察。本研究通过在八个数据集上对六个EEG-FMs和一个基线深度学习模型进行基准测试,填补了这些空白。除了干净准确性外,我们进行了三层分析:(i)鲁棒性:我们应用了测试时扰动,包括加性噪声、随机和区域基于的通道丢弃以及区域特定的噪声注入。我们的分析表明,没有单一模型在所有失败模式中占主导地位。最抗噪的模型在通道丢弃下最为脆弱,当通道被移除而不是零填充时,许多丢弃脆弱性消失。(ii)可解释性:我们首次将注意力感知的层间相关传播(AttnLRP)应用于EEG-FMs,并展示了模型广泛集中在与任务相关的脑区,这与已知的神经生理学一致。然而,属性图在扰动下保持空间稳定,而预测性能下降,表明模型关注正确的脑区,但解码了被破坏的内容。(iii)表达性:通过块状探测,我们显示在微调过程中后期块被重新利用,而早期块已经包含任务相关的信息。此外,我们证明了之前归因于低质量预训练表示的头部-only性能较差,很大程度上是由于池化所致,且当EEG-FMs的token级嵌入被保留时,它们具有足够的表征能力。这些发现为EEG-FMs的鲁棒性、可解释性和表达性提供了首次系统的评估,并突显了其开发中的关键考虑因素。

英文摘要

EEG foundation models (EEG-FMs) have been evaluated predominantly on clean, in-distribution accuracy, leaving their robustness, interpretability and representational quality largely unexamined. This study addresses these gaps by benchmarking six EEG-FMs against a baseline deep learning model across eight datasets. Beyond clean accuracy, we conduct three layers of analysis: (i) Robustness: we apply test-time perturbations including additive noise, random and region-based channel dropout and region-specific noise injection. Our analyses show that no single model dominates all failure modes. The most noise-robust model is among the most fragile under channel dropout and much of the dropout fragility disappears when channels are removed rather than zero-padded. (ii) Interpretability: we present the first application of Attention-Aware Layer-Wise Relevance Propagation (AttnLRP) to EEG-FMs and show that models broadly concentrate relevance on task-appropriate brain regions consistent with known neurophysiology. However, attribution maps remain spatially stable under perturbation while predictions degrade, suggesting that the models attend to the correct brain regions but decode corrupted content. (iii) Expressiveness: With block-wise probing we show that late blocks are repurposed during fine-tuning, while early blocks already hold task-related information. Furthermore, we demonstrate that the poor head-only performance previously attributed to low-quality pre-trained representations is largely explained by pooling and that EEG-FMs possess sufficient representational capacity when their token-level embeddings are preserved. Together, these findings provide the first systematic assessment of robustness, interpretability and expressiveness for EEG-FMs and highlight critical considerations for their development.

2605.17556 2026-05-19 cs.RO cs.AI

Visual Sculpting: Visually-Aligned Planning Representations for Long-Horizon Robot Clay Sculpting

视觉雕刻:用于长周期机器人泥塑的视觉对齐规划表示

Peter Schaldenbrand, Jean Oh

AI总结 本文提出了一种视觉对齐的规划表示方法,用于长周期机器人泥塑任务,通过捕捉光照和纹理特征,提高了对可变形材料动态的建模能力,并展示了在不同可变形材料和末端执行器下的性能。

Comments 8 pages, 14 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

详情
AI中文摘要

泥塑是一种复杂的艺术任务,需要通过长周期规划实现高阶目标。作为机器人问题,我们将泥塑视为形状到形状的匹配挑战。先前的可变形物体 manipulation 工作要么需要为每个目标重新训练策略,要么依赖于动态模型,这些模型将状态表示为稀疏点云,无法良好捕捉泥塑的重要特征,如纹理。我们提出了一种方法,用于建模可变形材料的动力学,并在视觉对齐的表示中为机器人雕刻规划。通过三种不同的可变形材料和各种末端执行器,我们证明我们的动力学模型在性能上与最先进的方法相当,并且具有兼容视觉规划的优势。我们的动作被表示为单个末端执行器向泥塑施加的参数化推力,这已被证明适用于长周期(>100次动作)的泥塑浮雕。最后,我们展示了在视觉对齐表示中规划的好处,同时提供了分析,证明了与3D表示相比,这种表示在规划上更具挑战性。

英文摘要

Clay sculpting is a nuanced, artistic task involving dexterous manipulation with long-horizon planning to achieve high-level goals. As a robotics problem, we formulate clay sculpting as a shape-to-shape matching challenge. Prior deformable object manipulation work either requires retraining a policy per goal or relies on dynamics models which represent state as sparse point clouds which do not capture important clay features, such as textures, well. We present a method for modeling the dynamics of deformable materials and planning for robotic sculpting in a representation that is visually-aligned, capturing lighting and texture features. With three different deformable materials and various end-effectors, we demonstrate that our dynamics model is comparable in performance to the state-of-the-art with the added benefit of being compatible with visual planning. Our actions are represented as parametrized pushes into clay with a single end-effector, which proved to be suitable for long-horizon (>100 actions) clay relief sculptures. Lastly, we show the benefits of planning in a visually-aligned representation, but also provide analysis providing evidence as to why this representation is challenging to plan in compared to 3D representations.

2605.17555 2026-05-19 cs.LG cs.CV

PFlow-T: A Persistence-Driven Forward Process for Topology-Controlled Generation

PFlow-T:基于持续性的拓扑控制生成过程

Snigdha Chandan Khilar

AI总结 本文提出PFlow-T,一种基于持续性的前向过程生成模型,通过持续同调来控制拓扑结构,实现了对Betti数的生成和处理非分布任务的改进。

详情
AI中文摘要

当前拓扑感知的扩散模型由于使用高斯噪声进行破坏而存在架构不匹配的问题,通过条件侧通道恢复结构特征。为解决此问题,我们引入PFlow-T,一种生成模型,其前向过程完全基于持续同调。在PFlow-T中,时间度量的是H1拓扑特征如孔的破坏,而非高斯噪声注入。此前向过程根据特征的持续性来消除特征。反向网络则直接反转这种结构破坏以在一步内预测干净状态。在MNIST数字零、一和八上的测试显示,PFlow-T在生成请求的Betti数和处理非分布任务方面显著优于基线模型。PFlow-T是首个使用持续同调作为前向过程的生成架构,尽管我们注意到它目前仅限于低分辨率像素空间代理。

英文摘要

Current topology aware diffusion models face an architectural mismatch by using Gaussian noise for corruption while recovering structural features through conditional side channels To fix this we introduce PFlow T a generative model that bases its forward process entirely on persistent homology In PFlow T time measures the destruction of H1 topological features like holes rather than Gaussian noise injection This forward process eliminates features based on their persistence The reverse network then directly inverts this structured corruption to predict the clean state in one step Tests on MNIST digits zero one and eight show PFlow T significantly outperforms a baseline model in generating requested Betti numbers and handling out of distribution tasks PFlow T is the first generative architecture using persistent homology for the forward process although we note it is currently limited to low resolution pixel space proxies

2605.17552 2026-05-19 cs.LG

Q-LocalAdam: Memory-Efficient Client-Side Adaptive Optimization for Edge Federated Learning

Q-LocalAdam: 一种内存高效的边缘联邦学习客户端自适应优化方法

Vedant Waykole, Haroon R. Lone

AI总结 本文提出Q-LocalAdam,一种针对边缘联邦学习中非独立同分布数据和内存限制的自适应优化方法,通过分布感知的8位量化块线性编码和对数空间编码实现内存高效优化,显著提升模型性能和并发工作负载能力。

详情
AI中文摘要

边缘设备上的联邦学习必须应对非独立同分布的客户端数据和严格的内存预算。像Adam这样的自适应优化器在数据异质性下稳定训练,但需要存储全精度动量和方差状态,通常使客户端内存开销增加三倍。这限制了在资源受限设备上可部署的模型大小和同时进行的联邦任务数量。我们实证发现,联邦Adam中的动量和方差在统计特性上存在根本差异:动量值对称且有界,而方差跨越八个数量级并具有对数正态结构。受这种不对称性启发,我们提出了Q-LocalAdam,它对动量应用分布感知的8位量化块线性编码,对方差应用对数空间编码,同时保持模型参数在全精度下。在CIFAR-10和CIFAR-100上,针对不同数据异质性(α∈{0.1, 0.5, 1.0, IID}),Q-LocalAdam在中等异质性下实现3.37倍的优化器内存减少,无精度损失,在极端异质性下(如CIFAR-100,α=0.1)实现显著提升(+5.74pp)。多种子验证确认统计显著性(p<0.01)。相比之下,朴素的均匀量化退化到随机性能,证明了分布感知设计的重要性。Q-LocalAdam在内存受限的边缘设备上无需修改联邦协议即可实现更大的模型和更多的并发工作负载。

英文摘要

Federated learning on edge devices must cope with non-IID client data and tight memory budgets. Adaptive optimizers like Adam stabilize training under data heterogeneity but require storing full-precision momentum and variance states, often tripling client memory overhead. This limits deployable model sizes and concurrent federated jobs on resource-constrained devices. We empirically observe that momentum and variance in federated Adam exhibit fundamentally different statistical properties: momentum values are symmetric and bounded, while variance spans eight orders of magnitude with log-normal structure. Motivated by this asymmetry, we propose \textbf{Q-LocalAdam}, which applies distribution-aware 8-bit quantization block-wise linear encoding for momentum and log-space encoding for variance while keeping model parameters in full precision. Across CIFAR-10 and CIFAR-100 under varying data heterogeneity ($α\in \{0.1, 0.5, 1.0, \text{IID}\}$), Q-LocalAdam achieves $3.37\times$ optimizer memory reduction with no accuracy loss under moderate heterogeneity and significant improvements under extreme heterogeneity (e.g., +5.74pp on CIFAR-100, $α=0.1$). Multi-seed validation confirms statistical significance ($p<0.01$). In contrast, naive uniform quantization degrades to random performance, demonstrating that distribution-aware design is essential. Q-LocalAdam enables larger models and more concurrent workloads on memory-constrained edge devices without modifying the federated protocol.

2605.17528 2026-05-19 cs.LG cs.AI cs.CL

CasualSynth: Generating Structurally Sound Synthetic Data

CasualSynth: 生成结构上合理的合成数据

Zehua Cheng, Wei Dai, Jiahao Sun, Thomas Lukasiewicz

AI总结 本文提出CasualSynth框架,通过解耦因果结构生成与语义实现,生成既符合因果机制又语义丰富的合成数据,解决了LLM在生成合成数据时无法保证因果正确性的问题。

Comments 15 pages

详情
AI中文摘要

大型语言模型(LLMs)能够生成逼真的合成数据,但无法保证其输出符合目标领域的因果机制。我们引入CausalSynth框架,该框架将因果结构生成与语义实现解耦,生成既符合因果机制又语义丰富的合成数据。该框架分为三个阶段:首先,一个结构因果模型(SCM)——一个定义在有向无环图(DAG)上的结构方程组,通过祖先采样生成因果骨架,即满足支配图全局马尔可夫性质的变量赋值;其次,一个LLM作为受约束的实现者,一个条件翻译器,将每个骨架映射到高维观测,如临床笔记或交易日志;第三,一个迭代一致性验证模块通过确定性提取检测结构违规,并将针对性的修正反馈给LLM,形成闭环优化过程。我们识别出语义后门问题,即LLM系统性地用预训练先验覆盖施加的因果事实——并证明我们的迭代机制相对于标准拒绝采样减少了由此产生的选择偏差。在三个因果基准(ASIA、ALARM和MIMIC-Struct)上,CausalSynth在假阳性率接近名义α=0.05水平的情况下保持条件独立性,并在70B参数LLM基础上实现了超过96%的可实现率。该框架还通过保留噪声和图 mutilation 支持原理化的干预和反事实生成。

英文摘要

Large Language Models (LLMs) generate realistic synthetic data but offer no guarantee that their outputs respect the causal mechanisms governing the target domain. We introduce CausalSynth, a framework that decouples causal structure generation from semantic realization, yielding synthetic data that is both causally valid and linguistically rich. The framework operates in three phases. First, a Structural Causal Model (SCM) - a tuple of structural equations defined over a directed acyclic graph (DAG) generates causal skeletons, i.e., variable assignments that satisfy the Global Markov Property of the governing DAG, via ancestral sampling. Second, an LLM acts as a constrained \emph{realizer}, a conditional translator that maps each skeleton to a high-dimensional observation such as a clinical note or a transaction log. Third, an Iterative Consistency Verification module detects structural violations through deterministic extraction and feeds targeted corrections back to the LLM, forming a closed-loop refinement process. We identify the Semantic Backdoor problem the systematic tendency of LLMs to override imposed causal facts with pre-training priors -- and prove that our iterative mechanism reduces the resulting selection bias relative to standard rejection sampling. On three causal benchmarks (ASIA, ALARM, and MIMIC-Struct), CausalSynth preserved conditional independencies with false-positive rates near the nominal $α=0.05$ level and achieved realizability rates above 96% with 70B-parameter LLM backbones. The framework additionally supports principled interventional and counterfactual generation through noise retention and graph mutilation.

2605.17527 2026-05-19 cs.CV

Designing streetscapes from street-view imagery using diffusion models

利用扩散模型从街景图像中设计街道景观

Yuzhou Chen, Yuebing Liang, Lingqian Hu, Kailai Sun, Qingqi Song, Chang Zhao, Shenhao Wang

AI总结 本文提出了一种生成多模态AI框架,通过目标视觉指标生成替代的街道景观,提升了城市规划和设计中的视觉探索能力。

详情
AI中文摘要

街景图像(SVI)被广泛用于量化城市环境的关键指标,如绿化率、天空和道路视图指数。然而,现有研究大多集中在测量当前的街道景观,很少支持生成替代或不存在的城市场景,这是地理学学科如城市规划和设计中的核心任务。为解决这一差距,我们提出了一种生成多模态AI框架,该框架能够根据目标视觉指标合成替代的街道景观,从而直接探索城市场景。我们首先构建了一个多模态数据集,将SVI与文本描述、分割图、道路掩码以及芝加哥和奥兰多的视觉元素定量指标对齐。使用这个数据集,我们证明扩散模型能够生成逼真且语义一致的街道景观图像,同时响应文本和图像控制。我们的定量评估显示,结合视觉控制可以提高语义一致性,使LPIPS指数降低约6%,同时保持整体视觉真实性。此外,整体语义一致性在奥兰多提高了23.7%,在芝加哥提高了46.4%,通过mIoU指数测量,类别层面的提升甚至超过了100%的改进,特别是在建筑视图指数方面。通过视觉和文本提示,可以精细控制街道景观的生成,当文本和视觉控制冲突时,图像控制始终占主导地位,表明了清晰的控制层次以及进一步发展视觉控制对于城市场景生成的重要性。总体而言,本文为使用SVI和扩散模型进行街道景观生成建立了重要的基准,并展示了生成式AI如何成为一种实用、可扩展且可控的城市场景探索方法。

英文摘要

Street-view imagery (SVI) is widely used to quantify key indicators of urban environment, such as green- ery, sky, or road view indices. However, existing studies largely focus on measuring current streetscapes and rarely support the generation of alternative and non-existing urban scenarios, which is a core task in geospatial disciplines such as urban planning and design. To address this gap, we propose a gener- ative multimodal AI framework that synthesizes alternative streetscapes conditioned on targeted visual metrics, enabling direct visual exploration of urban scenarios. We first construct a multimodal dataset that aligns SVIs with textual descriptions, segmentation maps, road masks, and quantitative metrics of visual elements in Chicago and Orlando. Using this dataset, we demonstrate that diffusion models can produce realistic and semantically consistent streetscape imagery while responding to both textual and imagery controls. Our quantitative evaluations show that incorporating visual controls can improve semantic consistency, reducing the LPIPS index by approximately 6% while maintaining global visual realism. In addition, overall semantic consistency increases by 23.7% in Orlando and 46.4% in Chicago, as measured by the mIoU index, with class-wise gains exceeding even 100% improvement for building view indices. Streetscape generation can be controlled in a fine-grained manner by both visual and textual prompts, and when textual and visual controls conflict, imagery controls consistently dominate, indicating a clear control hierarchy and the importance of further developing visual controls for urban scene generation. Overall, this work establishes an important benchmark for streetscape generation us- ing SVIs and diffusion models, and illustrates how generative AI can serve as a practical, scalable, and controllable approach for urban scenario exploration.

2605.17522 2026-05-19 cs.RO

RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation

RoboFlow4D: 一种轻量级的流世界模型,面向实时的流引导机器人操作

Sixu Lin, Junliang Chen, Huaiyuan Xu, Zhuohao Li, Guangming Wang, Yixiong Jing, Sheng Xu, Runyi Zhao, Brian Sheil, Lap-Pui Chau, Guiliang Liu

AI总结 本文提出RoboFlow4D,一种轻量级的流世界模型,通过统一感知与规划,利用物理3D空间中的时间运动估计,实现高效的实时流引导机器人操作,提高了操作成功率和计算效率。

详情
AI中文摘要

在三维环境中进行规划和行动是现实世界中机器人操作的基本能力。尽管先前工作已经探索了预测流规划器来指导三维操作,但现有方法往往依赖于模块化管道堆叠多个子模型,导致计算开销高且实时性能有限。为了解决这些挑战,我们引入了RoboFlow4D,一种轻量级的流世界模型,通过估计物理3D空间中的时间运动来统一感知和规划。作为一种端到端框架,RoboFlow4D直接从视觉观察和文本指令中预测多帧3D流,提供显式的基于流的规划以指导动作生成。这种设计允许无缝集成到通用动作策略中,形成高效的观察-规划-执行闭环。通过流预测与动作控制之间的慢-快协作,RoboFlow4D实现了实时且资源高效的操纵。在模拟和现实世界设置中的大量实验表明,RoboFlow4D在操纵成功率和计算效率方面持续改进,推动了流引导规划在具身智能中的发展。

英文摘要

Planning and acting in 3D environments is a fundamental capability for robotic manipulation in the real world. Although prior work has explored predictive flow planners to guide 3D manipulation, existing approaches often rely on modular pipelines stacking multiple submodels, resulting in high computational overhead and limited real-time performance. To address these challenges, we introduce RoboFlow4D, a lightweight flow world model that unifies perception and planning by estimating temporal motion in physical 3D space. As an end-to-end framework, RoboFlow4D directly predicts multi-frame 3D flows from visual observations and textual instructions, providing explicit flow-based planning to guide action generation. This design allows seamless integration with general action policies, forming an efficient observation-planning-execution closed loop. Through slow-fast collaboration between flow prediction and action control, RoboFlow4D enables real-time and resource-efficient manipulation. Extensive experiments in both simulation and real-world settings demonstrate that RoboFlow4D consistently improves manipulation success rates and computational efficiency, advancing flow-guided planning for embodied intelligence.

2605.17517 2026-05-19 cs.RO

AffordVLA: Injecting Affordance Representations into Vision-Language-Action Models via Implicit Feature Alignment

AffordVLA: 通过隐式特征对齐将 affordance 表示注入到视觉-语言-动作模型中

Weijie Kong, Zhian Su, Wei Yu, Huixu Dong

AI总结 本文提出 AffordVLA 框架,通过隐式特征对齐将以操作为中心的 affordance 表示注入到视觉-语言-动作模型中,以提升动作准确性,实验表明其在仿真和现实中的表现优于现有方法。

Comments 13pages, 10figures

详情
AI中文摘要

最近在视觉-语言-动作(VLA)模型方面的进展显示出在通用机器人操作中的强大潜力。然而,大多数VLA模型的视觉表示往往由全局物体外观主导,难以聚焦于与任务相关的功能交互区域,这限制了它们在非结构化环境中的鲁棒性。现有的基于 affordance 的方法通常依赖于显式的掩码注入或外部感知模块,需要额外的注释,同时引入级联感知误差和推理开销。为了解决这些限制,我们提出 AffordVLA,一个增强的 VLA 框架,通过隐式表示对齐将以操作为中心的 affordance 感知内部化到 VLA 视觉表示中。具体来说,我们构建了一个零样本 affordance 教师,从 RGB 观察和语言指令中提取任务条件的 affordance 视觉表示。AffordVLA 对齐 VLA 的中间视觉表示与由教师提取的 affordance 视觉表示,从而隐式地将以操作为中心的 affordance 感知注入到 VLA 视觉表示中,提高动作准确性。广泛的仿真和现实世界实验表明,AffordVLA 及其 affordance 教师实现了最先进的性能,并优于强大的基线。消融分析显示,AffordVLA 有效重塑 VLA 视觉表示,同时保持推理效率,从而提高操作成功率和训练效率。

英文摘要

Recent advances in Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation. However, the visual representations of most VLA models are often dominated by global object appearance and struggle to focus on task-relevant functional interaction regions, which limits their robustness in unstructured environments. Existing affordance-based methods typically rely on explicit mask injection or external perception modules, requiring additional annotations while introducing cascading perception errors and inference overhead. To address these limitations, we propose AffordVLA, an affordance-enhanced VLA framework that internalizes manipulation-centric affordance perception into VLA visual representations through implicit representation alignment. Specifically, we construct a zero-shot affordance teacher to extract task-conditioned affordance visual representations from RGB observations and language instructions. AffordVLA aligns the intermediate visual representations of the VLA with the affordance visual representations extracted by the teacher, thereby implicitly injecting manipulation-centric affordance perception into VLA visual representations and improving action accuracy. Extensive simulation and real-world experiments demonstrate that AffordVLA and its affordance teacher achieve state-of-the-art performance and outperform strong baselines. Ablation analyses show that AffordVLA effectively reshapes VLA visual representations while preserving inference efficiency, leading to improved manipulation success rates and training efficiency.

2605.17508 2026-05-19 cs.LG cs.AI

BESplit: Bias-Compensated Split Federated Learning with Evidential Aggregation

BESplit: 偏差补偿分割联邦学习与证据聚合

Yuhan Xie, Chen Lyu, Jingrong Huang

AI总结 本文提出BESplit框架,通过证据聚合和偏差补偿协作来解决非独立同分布数据下分割联邦学习的偏差优化和收敛不稳定问题,提升了模型的准确性和效率。

详情
AI中文摘要

分割联邦学习(SFL)通过将模型分割到客户端和服务器之间实现隐私保护的协同训练。然而,在非独立同分布数据分布下,SFL常面临偏差优化和收敛不稳定的问题,而现有解决方案大多借鉴传统联邦学习的技术。在本工作中,我们发现SFL的分割架构本质上改变了客户端信息的表示和协调方式,为超越参数级聚合的偏差补偿提供了机会。基于这一见解,我们提出了BESplit,一个架构感知的框架,利用SFL内在结构来缓解非IID效应。首先,为防止偏见本地数据主导全局更新,我们引入证据聚合(EA)以基于证据不确定性对客户端贡献进行细粒度重新加权。其次,为进一步减少分布偏斜,我们开发了偏差补偿协作(BCC)以通过配对互补客户端对齐分割层表示。最后,双教师蒸馏(DTD)被纳入以同步解耦客户端和服务器模型之间的知识,使本地推理能够独立进行。在五个基准数据集上的广泛实验表明,BESplit在多样化的非IID设置下,准确率、收敛稳定性以及计算效率均优于现有最先进方法。

英文摘要

Split Federated Learning (SFL) enables privacy-preserving collaborative training by partitioning models between clients and a server. However, under non-IID data distributions, SFL often suffers from biased optimization and unstable convergence, while existing solutions largely adapt techniques from conventional federated learning. In this work, we observe that the split architecture of SFL inherently alters how client information is represented and coordinated, opening opportunities for bias compensation beyond parameter-level aggregation. Based on this insight, we propose BESplit, an architecture-aware framework that exploits the intrinsic structure of SFL to mitigate non-IID effects. First, to prevent biased local data from dominating global updates, we introduce Evidential Aggregation (EA) to perform fine-grained reweighting of client contributions based on evidential uncertainty. Second, to further reduce distributional skew, we develop Bias-Compensated Collaboration (BCC) to align split-layer representations by pairing complementary clients. Finally, Dual-Teacher Distillation (DTD) is incorporated to synchronize knowledge between decoupled client and server models, enabling independent local inference. Extensive experiments on five benchmark datasets demonstrate that BESplit consistently outperforms state-of-the-art methods in accuracy, convergence stability, and computational efficiency under diverse non-IID settings.

2605.17506 2026-05-19 cs.CV

Degradation Frequency Curve: An Explicit Frequency-Quantified Representation for All-in-One Image Restoration

退化频率曲线:一种用于全能图像恢复的显式频率量化表示

Xinghua Huang, Zhixiong Yang, Chen Wu, Shengxi Li, Shuaifeng Zhi, Yue Zhang, Qibin Hou, Xin Deng, Jingyuan Xia

AI总结 本文提出退化频率曲线(DFC),一种显式量化退化影响的频率域表示方法,通过测量频带内的残差到退化能量比来量化退化响应,从而为全能图像恢复提供有效的表示基础,提升了在复杂退化条件下的性能和泛化能力。

详情
AI中文摘要

所有-in-one盲图像恢复中的基本困难在于退化通常被视为隐含在退化到清洁映射中的隐式因素,而不是可以测量和操作的显式对象。这种限制在混合、复合或未见的退化条件下更加明显,其中退化效应难以分配到预定义标签或任务特定参数。我们提出退化频率曲线(DFC),一种结构化的频谱表示,通过测量频域内带状的残差到退化能量比来量化退化响应。DFC将视觉纠缠且难以描述的退化效应转换为可测量的退化坐标空间。此外,DFC可以自适应地分解为带状频谱标记,允许局部退化响应被表示为可重用的恢复先验。基于这种表示,我们开发了DFC引导图像恢复器(DFC-IR),一种基于标记的多尺度框架,逐步从中间恢复中估计DFC,并利用所得频谱标记以粗到细的方式指导退化感知恢复。在标准、复合、未见和现实世界退化基准上的广泛实验表明,DFC为所有-in-one恢复提供了有效的表示基础,导致在复杂退化配置下达到最先进的性能和改进的泛化能力。

英文摘要

A fundamental difficulty in all-in-one blind image restoration is that degradation is usually treated as an implicit factor hidden in degraded-to-clean mapping, rather than as an explicit object that can be measured and manipulated. This limitation becomes more pronounced under mixed, compound, or unseen degradation conditions, where degradation effects are hard to assign to predefined labels or task-specific parameters. We propose the Degradation Frequency Curve (DFC), a structured spectral representation that quantifies degradation responses by measuring band-wise residual-to-degraded energy ratios in the frequency domain. DFC converts visually entangled and hard-to-describe degradation effects into a measurable degradation coordinate space. Moreover, DFC can be adaptively decomposed into band-wise spectral tokens, allowing local degradation responses to be represented as reusable restoration priors. Based on this representation, we develop the DFC-guided Image Restorer (DFC-IR), a token-conditioned multi-scale framework that progressively estimates DFCs from intermediate restorations and uses the resulting spectral tokens to guide degradation-aware restoration in a coarse-to-fine manner. Extensive experiments on standard, composite, unseen, and real-world degradation benchmarks show that DFC provides an effective representation basis for all-in-one restoration, leading to state-of-the-art performance and improved generalization under complex degradation profiles.