arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1970
2605.20878 2026-05-21 cs.LG

CIG: Exploration via Conditional Information Gain

CIG: 通过条件信息增益进行探索

Tim Joseph, Marcus Fechner, Philipp Stegmaier, Karam Daaboul, J. Marius Zöllner

AI总结 该研究提出了一种条件信息增益(CIG)奖励机制,用于强化学习中的探索问题,通过可追溯的log-determinant目标和Ensemble Disagreement核来生成因果每步奖励,从而在高维状态空间中实现有效的探索。

Comments 28 pages, 10 figures, 3 tables

详情
AI中文摘要

在强化学习中,内在奖励用于探索时会根据不同的上下文进行条件化:终身奖励对每个转移进行累积经验评分,但忽略轨迹内的冗余;事件奖励惩罚轨迹内的重复,但丢弃长期进步。混合方法通过启发式权重结合两种信号,或需要高斯过程动态模型,无法扩展到低维状态空间。轨迹级信息增益可以分解为每步项,这些项同时条件于回放缓冲区和轨迹前缀,但在深度模型中仍然不可行。我们推导出条件信息增益(CIG)奖励作为可追溯的替代方案:一个基于集合分歧核的log-determinant目标,其Cholesky因子分解产生因果每步奖励,保留两个条件集并在高维状态空间中扩展。我们在基于模型的设置中实例化CIG,其中轨迹较短且轨迹内的修正仍大部分未探索。在十二个任务上,包括离散(MiniGrid)和连续控制(OGBench),在干净和随机干扰设置中,CIG在性能上优于或匹配先前的探索方法,同时对随机干扰具有鲁棒性。

英文摘要

Intrinsic rewards for exploration in reinforcement learning condition on different contexts: lifelong rewards score each transition against accumulated experience but ignore within-rollout redundancy; episodic rewards penalize intra-trajectory repetition but discard lifetime progress. Hybrid methods combine both signals through heuristic weights or require Gaussian-process dynamics that do not scale beyond low-dimensional state spaces. Trajectory-level information gain decomposes into per-step terms that condition on the replay buffer and rollout prefix simultaneously, but remains intractable for deep models. We derive the Conditional Information Gain (CIG) reward as a tractable surrogate: a log-determinant objective over an ensemble disagreement kernel whose Cholesky factorization yields causal per-step rewards that retain both conditioning sets while scaling to high-dimensional state spaces. We instantiate CIG in a model-based setting, where rollouts are short and within-rollout corrections remain largely unexplored. Across twelve tasks spanning discrete (MiniGrid) and continuous control (OGBench), in both clean and stochastic-distractor settings, CIG outperforms or matches prior exploration methods while remaining robust to stochastic distractors.

2605.20876 2026-05-21 cs.CL cs.AI

Terminal-World: Scaling Terminal-Agent Environments via Agent Skills

Terminal-World: 通过智能体技能扩展终端智能体环境

Zihao Cheng, Hongru Wang, Zeming Liu, Xinyi Wang, Xiangrong Zhu, Yuhang Guo, Wei Lin, Jeff Z. Pan, Yunhong Wang

AI总结 本文提出Terminal-World,一种自动化流程,利用智能体技能作为核心合成原语,共同编码任务目标、执行时机和方法,从而生成任务指令、环境和教师轨迹。通过构建5,723个训练环境,训练出Terminal-World-8B/14B/32B模型,在六个基准测试中均优于终端智能体基线,其中Terminal-World-32B在Terminal-Bench 2.0上以仅1.2%的训练数据超越Nemotron-Terminal-32B。

Comments Work in Progress

详情
AI中文摘要

Terminal agents extend Large Language Models with the ability to execute tasks directly in command-line environments, but their progress is bottlenecked by the scarcity of high-quality training data. Existing approaches bootstrap from partial sources such as human-defined seeds or GitHub repositories to instantiate one component and then complete the rest, producing tasks confined to narrow seed distributions, environments misaligned with task semantics, and inefficient trajectories from unguided exploration. To address these limitations, we introduce Terminal-World, a fully automated pipeline that uses agent skills as the central synthesis primitive, which jointly encode what to accomplish, when to apply (preconditions and environment state), and how to execute, enabling task instructions, environments, and teacher trajectories to be co-derived. To further broaden the synthesis space, Terminal-World composes skills into skill teams and skill graphs for multi-role and cross-domain task synthesis. Using this pipeline, we construct 5,723 training environments and train Terminal-World-8B/14B/32B, evaluated across 6 benchmarks where the Terminal-World series consistently outperforms terminal-agent baselines. Notably, using the same teacher model and only 1.2% of the training data, Terminal-World-32B surpasses Nemotron-Terminal-32B on Terminal-Bench 2.0 by +4.5 Pass@1 (31.5) and achieves 43.8 Pass@3.

英文摘要

Terminal agents extend Large Language Models with the ability to execute tasks directly in command-line environments, but their progress is bottlenecked by the scarcity of high-quality training data. Existing approaches bootstrap from partial sources such as human-defined seeds or GitHub repositories to instantiate one component and then complete the rest, producing tasks confined to narrow seed distributions, environments misaligned with task semantics, and inefficient trajectories from unguided exploration. To address these limitations, we introduce Terminal-World, a fully automated pipeline that uses agent skills as the central synthesis primitive, which jointly encode what to accomplish, when to apply (preconditions and environment state), and how to execute, enabling task instructions, environments, and teacher trajectories to be co-derived. To further broaden the synthesis space, Terminal-World composes skills into skill teams and skill graphs for multi-role and cross-domain task synthesis. Using this pipeline, we construct 5,723 training environments and train Terminal-World-8B/14B/32B, evaluated across 6 benchmarks where the Terminal-World series consistently outperforms terminal-agent baselines. Notably, using the same teacher model and only 1.2% of the training data, Terminal-World-32B surpasses Nemotron-Terminal-32B on Terminal-Bench 2.0 by +4.5 Pass@1 (31.5) and achieves 43.8 Pass@3.

2605.20874 2026-05-21 cs.AI cs.SE

Governance by Construction for Generalist Agents

为通用智能体构建的治理机制

Segev Shlomov, Iftach Shoham, Alon Oved, Ido Levy, Sami Marreed, Harold Ship, Offer Akrabi, Sergey Zeltyn, Avi Yaeli, Nir Mashkif

AI总结 本文提出了一种模块化的政策-as-code层,用于在不微调模型的情况下,通过与通用大语言模型智能体结合,实现可预测、可审计且符合合规要求的行为,在复合工作流中无需为每个领域重新构建智能体。

详情
AI中文摘要

企业智能体日益被期望在多个工具和界面中自主运行,但生产部署需要通过构建来实施治理。系统必须指定哪些操作被允许、何时需要人类监督以及哪些信息可以暴露,而无需为每个领域重新构建智能体。本演示展示了CUGA的策略系统,这是一种模块化的策略-as-code层,能够与通用大语言模型智能体结合,以在复合工作流中实现可预测、可审计且符合合规要求的行为。我们提出了一种运行时治理架构,在执行的每一个关键阶段都强制执行策略干预。而不是被动地限制行为,策略在五个结构性检查点拦截智能体:规划上游(意图守卫)、在系统提示内引导推理(手册)、在工具调用边界处强制正确使用(工具指南)、在推理循环外作为人类在环的闸门用于高风险操作(工具批准)、以及在输出阶段过滤和结构化最终响应(输出格式器)。这些阶段将治理连续嵌入智能体的执行流程中,而不是将其视为事后考虑。通过一个医疗场景和多层次的执行干预,演示展示了动态手册注入用于结构化工具序列执行,意图守卫阻止恶意或意外有害请求,以及人类在环的工具批准检查点用于可能破坏性操作。该成果展示了类型化的治理原语如何加快、安全地部署企业智能体系统,同时提高政策遵守和执行一致性。

英文摘要

Enterprise agents are increasingly expected to operate autonomously across tools and interfaces, yet production deployments require governance by construction. Systems must specify which actions are allowed, when human oversight is required, and what information may be exposed, without rebuilding the agent for each domain. This demo presents CUGA's policy system, a modular policy-as-code layer that composes with a generalist LLM agent to deliver predictable, auditable, and compliance-aware behavior in compound workflows without model fine-tuning. We present a runtime governance architecture that enforces policy interventions at every critical stage of execution. Rather than passively constraining behavior, policies intercept the agent at five structural checkpoints: upstream of planning (Intent Guard), within the system prompt to steer reasoning (Playbook), at the tool-call boundary to enforce proper usage (Tool Guide), outside the reasoning loop as a Human-in-the-Loop gate for high-risk actions (Tool Approvals), and at the output stage to filter and structure the final response (Output Formatter). Together, these stages embed governance continuously across the agent's execution pipeline rather than treating it as an afterthought. Using a healthcare scenario and a multi-layered enforcement intervention, the demo shows dynamic playbook injection for structured tool-sequence enforcement, intent guards that block malicious or accidental harmful requests, and human-in-the-loop tool approval checkpoints for potentially destructive actions. The artifact illustrates how typed governance primitives enable faster, safer deployment of enterprise agentic systems while improving policy adherence and execution consistency.

2605.20872 2026-05-21 cs.LG cs.AI cs.GR

CAdam: Context-Adaptive Moment Estimation for 3D Gaussian Densification in Generative Distillation

CAdam: 3D高斯密度细化中的上下文自适应矩估计

SeungJeh Chung, Geonho Park, Misong Kim, HyeongYeop Kang

AI总结 本文提出CAdam方法,通过将密度细化问题转化为统计信号验证问题,解决生成式蒸馏中密度估计的瓶颈,从而在保持视觉质量的同时显著减少高斯点数量。

Comments Accepted to SIGGRAPH 2026 Conference Papers. 12 pages, 8 figures

详情
AI中文摘要

Adaptive densification是3D高斯点划法(3DGS)的核心引擎。然而,当将其应用于基于优化的生成式蒸馏范式时,这种重建原生机制暴露了根本性限制,导致效率低下且充满冗余的表示。我们诊断这种失败为密度困境,源于生成指导的随机性:标准的幅度基积累无差别地聚合瞬态噪声与几何信号,难以在过密度和欠拟合之间取得平衡。为了解决这一问题,我们引入了上下文自适应矩估计(CAdam),一种新的框架,将密度细化重新解释为统计上站得住的信号验证问题。CAdam利用梯度的一阶矩来利用干涉原理,其中随机波动通过破坏性干涉抵消,而一致的几何漂移通过建设性干涉累积,从而有效分离底层信号与生成噪声底座。这进一步通过基于分位数的上下文意识和内在信号噪声比(SNR)门控机制增强,确保在优化阶段之间具有鲁棒的适应性,并使密度细化能够软终止。在多样化的目标(SDS,ISM,VFDS)和强大的生成3DGS后端上进行了广泛的实验,结果表明CAdam相比标准密度细化将高斯点数减少85%-97%,同时保持整体可比的视觉质量。这些结果突显了信号感知密度控制作为改进优化生成式蒸馏内存效率的实用方法。

英文摘要

Adaptive densification is the engine of 3D Gaussian Splatting (3DGS). However, when transposed to the optimization-based Generative Distillation paradigm, this reconstruction-native mechanism reveals fundamental limitations, resulting in inefficient representations cluttered with redundant primitives. We diagnose this failure as a Densification Dilemma stemming from the stochastic nature of generative guidance: the standard magnitude-based accumulation indiscriminately aggregates transient noise alongside geometric signals, making it difficult to strike a balance between over-densification and under-fitting. To resolve this, we introduce Context-Adaptive Moment Estimation (CAdam), a novel framework that reinterprets densification as a statistically grounded signal verification problem. CAdam leverages the first moment of gradients to exploit the interference principle, where stochastic fluctuations cancel out via destructive interference while consistent geometric drifts accumulate via constructive interference, effectively disentangling the underlying signal from the generative noise floor. This is further augmented by a quantile-based context awareness and an intrinsic Signal-to-Noise Ratio (SNR) gating mechanism, which ensure robust adaptation across optimization stages and enable the soft termination of densification. Extensive experiments across diverse objectives (SDS, ISM, VFDS) and strong generative 3DGS backbones show that CAdam reduces Gaussian count by 85%-97% relative to standard densification while preserving overall comparable perceptual quality. These results highlight signal-aware density control as a practical way to improve memory efficiency in optimization-based generative distillation.

2605.20868 2026-05-21 cs.LG cs.AI cs.SY eess.SY

Runtime-Certified Bounded-Error Quantized Attention

具有运行时认证的误差受限量化注意

Dean Calver

AI总结 本文提出了一种分层的KV缓存架构,通过在GPU内存中存储INT8键和INT4值,同时在系统RAM中保留FP16原始数据,实现了运行时认证的注意机制,通过误差分解得到每头每步的误差界,以驱动自适应精度选择和多阶段回退流程,确保在需要时能恢复到精确的密集注意输出。

Comments 32 pages, 1 figure

详情
AI中文摘要

KV缓存量化减少了长上下文LLM推理的内存成本,但引入了通常仅通过经验验证的近似误差。现有系统依赖于平均情况下的鲁棒性,没有机制在运行时检测或恢复失败。本文提出了一种分层的KV缓存架构,使注意机制具有运行时认证:INT8键和INT4值存储在GPU内存中,而FP16原始数据保留在系统RAM中以实现确定性回退。一个两术语误差分解提供了每头每步的误差界(i)键量化导致的注意分布扭曲和(ii)值重建误差。这些界在线计算并用于驱动自适应精度选择和多阶段回退阶梯,确保在需要时能恢复到精确的密集注意输出。在PG-19、NIAH和RULER基准上,对LLaMA~3.1-8B(上下文长度达128K)的测试中,系统在语言建模和检索任务中与密集FP16 KV质量在噪声范围内匹配,同时恢复了在朴素INT8/INT4基线中观察到的灾难性故障。短上下文的值敏感任务暴露了压缩与保真度之间的可控权衡,可通过更紧的值容忍度或FP16值回退消除。认证是局部的(每头、每步),不保证端到端模型的正确性,但确保每个注意计算要么相对于FP16参考是受控的,要么通过回退精确恢复。这将KV缓存量化重新定义为运行时验证的计算,而不是固定近似。目标不是原始的速度提升,而是使在严格质量约束下安全部署的激进KV压缩成为可能。

英文摘要

KV cache quantization reduces the memory cost of long-context LLM inference, but introduces approximation error that is typically validated only empirically. Existing systems rely on average-case robustness, with no mechanism to detect or recover from failures at runtime. We present a tiered KV cache architecture that enables runtime-certified attention: INT8 keys and INT4 values are stored in GPU memory, while FP16 originals are retained in system RAM for deterministic fallback. A two-term error decomposition yields per-head, per-step bounds on (i) attention distribution distortion from key quantization and (ii) value reconstruction error. These bounds are computed online and used to drive adaptive precision selection and a multi-stage fallback ladder, which guarantees recovery to the exact dense attention output when required. Across PG-19, NIAH, and RULER benchmarks on LLaMA~3.1-8B with contexts up to 128K, the system matches dense FP16 KV quality within noise for language modelling and retrieval tasks, while recovering catastrophic failures observed in naive INT8/INT4 baselines. Value-sensitive tasks at short context expose a controlled trade-off between compression and fidelity, which can be eliminated via tighter value tolerances or FP16-value fallback. The certification is local (per-head, per-step) and does not guarantee end-to-end model correctness, but ensures that each attention computation is either bounded relative to an FP16 reference or exactly recovered via fallback. This reframes KV cache quantization as a runtime-verified computation rather than a fixed approximation. The goal is not raw speedups, but enabling safe deployment of aggressive KV compression under strict quality constraints.

2605.20866 2026-05-21 cs.LG cs.DC math.OC stat.ML

LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging

LOSCAR-SGD:局部SGD与通信-计算重叠及延迟校正的稀疏模型平均

Yassine Maziane, Ammar Mahran, Artavazd Maranjyan, Peter Richtárik

AI总结 本文研究了在异构计算环境下结合通信压缩、局部训练和通信-计算重叠的局部SGD方法,提出LOSCAR-SGD通过仅通信稀疏模型坐标并持续优化来提高分布式学习效率,首次给出了这种组合方法的理论保证。

详情
AI中文摘要

在分布式学习中,通信是主要的瓶颈,尤其是在大规模设置和联邦学习环境中链接缓慢时。减少此成本的三种标准方法是通信压缩、局部训练和通信-计算重叠。结合这些成分的方法在实践中被发现对大规模训练有效,但很少有理论支持同时结合这三种方法的方法。我们研究了一个异构计算环境,其中不同的工作者可能进行不同数量的局部步骤,并提出LOSCAR-SGD,一种局部SGD方法,仅通信模型坐标的稀疏子集,并在通信飞行期间继续优化。关键成分是延迟校正的合并规则,该规则在不丢弃重叠阶段所做进展的情况下整合延迟同步信息。我们为光滑非凸目标函数提供了收敛保证,并展示了稀疏性、重叠和工作者异质性如何影响收敛速度。据我们所知,这是首次针对这种成分组合的理论。实验进一步表明,通信-计算重叠减少了训练时间,并且延迟校正的合并优于朴素覆盖。

英文摘要

Communication is a major bottleneck in distributed learning, especially in large-scale settings and in federated learning environments with slow links. Three standard ways to reduce this cost are communication compression, local training, and communication-computation overlap. Methods that combine these ingredients are used in practice and have been found to be effective for large-scale training, but there is little theory for methods that combine all three. We study a heterogeneous-compute setting in which different workers may take different numbers of local steps, and we propose LOSCAR-SGD, a Local SGD method that communicates only a sparse subset of model coordinates and continues optimizing while communication is in flight. A key ingredient is a delay-corrected merge rule that incorporates delayed synchronized information without discarding the progress made during the overlap phase. We give convergence guarantees for smooth non-convex objectives and show how sparsity, overlap, and worker heterogeneity affect the rate. To the best of our knowledge, this is the first theory for this combination of ingredients. Experiments further show that communication-computation overlap reduces training time and that the delay-corrected merge outperforms naive overwriting.

2605.20865 2026-05-21 cs.LG cs.AI

Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards

多步似然比校正用于可验证奖励的强化学习

Deokgyu Yoon, Hyungkyu Kang, Joongkyu Lee, Byeongchan Kim, Gyungin Shin, Sungrae Park, Min-hwan Oh

AI总结 本文提出了一种多步前向轨迹政策优化(NFPO)算法,通过引入N步前向轨迹来改进PPO的近似目标,从而在可验证奖励的强化学习中实现更精确的策略改进。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)在提升大语言模型的推理能力方面起着关键作用。然而,广泛使用的PPO替代目标本质上是局部的,因为它们依赖于精确策略梯度目标的局部近似。虽然这种近似通过减少重要性采样引起的方差来提高稳定性,但它也引入了结构偏差到替代目标中,必须通过信任区域机制进行控制。在本文中,我们引入了N步前向轨迹,通过累积下一个N-1个token的似然比来增强PPO替代目标。基于这一想法,我们提出了N步前向轨迹策略优化(NFPO),一种将N步前向轨迹整合到掩码策略梯度框架中的实用RLVR算法。NFPO提供了一个连续的桥梁,将PPO替代目标与精确策略梯度目标联系起来,提供了一种控制偏差-方差权衡的原理机制。我们的理论分析表明,通过适当选择N,所提出的目标比标准PPO替代目标提供了更紧的策略改进界。在全面推理基准测试中,实验表明NFPO一致地提高了性能,支持了我们的理论发现。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) plays a pivotal role in improving the reasoning ability of large language models. However, widely used PPO surrogate objectives are fundamentally local, as they rely on a local approximation of the exact policy gradient objective. While this approximation improves stability by reducing the variance induced by importance sampling, it also introduces structural bias into the surrogate objective, which must be controlled through trust region mechanisms. In this work, we introduce the $N$-step forward trace, which augments the PPO surrogate objective using the cumulative likelihood ratio of the next $N-1$ tokens. Building on this idea, we propose $N$-Step Forward-Trace Policy Optimization (NFPO), a practical RLVR algorithm that integrates the $N$-step forward trace into the masked policy gradient framework. NFPO provides a continuous bridge between the PPO surrogate objective and the exact policy gradient objective, offering a principled mechanism for controlling the bias-variance trade-off. Our theoretical analysis shows that, with an appropriate choice of $N$, the proposed objective yields a tighter policy-improvement bound than the standard PPO surrogate. Experiments on comprehensive reasoning benchmarks demonstrate that NFPO consistently improves performance, supporting our theoretical findings.

2605.20856 2026-05-21 cs.RO cs.AI cs.LG

DISC: Decoupling Instruction from State-Conditioned Control via Policy Generation

DISC: 通过策略生成解耦指令与状态条件控制

Hanxiang Ren, Pei Zhou, Xunzhe Zhou, Yanchao Yang

AI总结 DISC通过策略生成解耦指令与状态条件控制,解决了任务状态耦合导致的观察泄漏问题,并在多个基准测试中表现出色,证明了语言生成的策略参数驱动行为。

详情
AI中文摘要

语言条件的操控策略通常通过共享网络参数处理指令和观察。这种任务-状态耦合提供了观察泄漏的路径——网络学习了场景到动作的捷径,完全绕过了语言接地。DISC通过结构上消除这一失败。而不是将通用策略条件在语言上,DISC使用超网络从指令本身生成整个任务特定的视觉-运动策略参数集。生成的策略从不直接访问语言;因此,其任务意识必须来自语言。 Consequently,观察泄漏没有路径出现。另一方面,生成一致的高维策略权重本身是一个具有挑战性的问题。我们通过两阶段超网络解决它,其细化阶段将基于梯度优化的结构作为前馈归纳偏差嵌入,产生全局一致的参数,而无需实际梯度计算。在标准数据预算上完全从头训练,DISC在LIBERO-90和Meta-World上优于所有耦合基线,在复杂、长周期任务中优势扩大,并在不使用外部预训练数据的情况下超越了大规模预训练的π₀。在一个现实基准中,所有任务共享相同的视觉上下文,DISC显著优于耦合替代方案,直接证实了语言生成的策略参数,而非视觉捷径,驱动行为。超网络进一步学习了一个语义结构化的参数流形,能够从最少的演示中实现少样本适应,并在改写指令中实现稳健的泛化。我们的代码可在:https://github.com/ReNginx/DISC获取。

英文摘要

Language-conditioned manipulation policies typically process instructions and observations through shared network parameters. This task-state entanglement provides a pathway for observation leakage -- networks learn scene-to-action shortcuts that bypass language grounding entirely. DISC eliminates this failure structurally. Rather than conditioning a universal policy on language, DISC uses a hypernetwork to generate the entire parameter set of a task-specific visuomotor policy from the instruction alone. The generated policy never directly accesses language; therefore, its task-awareness must come from the language. Consequently, observation leakage has no pathway to emerge. On the other hand, generating coherent high-dimensional policy weights is itself a challenging problem. We address it with a two-stage hypernetwork whose refinement stage embeds the structure of gradient-based optimization as a feed-forward inductive bias, producing globally consistent parameters without actual gradient computation. Trained entirely from scratch on standard data budgets, DISC outperforms all entangled baselines on LIBERO-90 and Meta-World, with advantages that widen on complex, long-horizon tasks -- and surpasses the large-scale pretrained $π_0$ despite using no external pretraining data. On a real-world benchmark where all tasks share identical visual context, DISC substantially outperforms entangled alternatives, directly confirming that language-generated policy parameters, not visual shortcuts, drive behavior. The hypernetwork further learns a semantically structured parameter manifold that enables few-shot adaptation from minimal demonstrations and robust generalization across paraphrased instructions. Our code is available at: {https://github.com/ReNginx/DISC}.

2605.20853 2026-05-21 cs.SD eess.AS

SEABAD: A Tropical Bird Activity Detection Dataset for Passive Acoustic Monitoring

SEABAD:一种用于被动声学监测的热带鸟类活动检测数据集

Muhammad Mun'im Ahmad Zabidi, Mohd Yamani Idna Idris, Norisma Idris

AI总结 本文提出SEABAD数据集,用于解决热带地区鸟类活动检测中物种丰富和声学复杂性带来的挑战,通过平衡的鸟类存在和不存在样本以及标准化音频格式,支持高效的声学监测和低功耗推理。

Comments 14 pages, 4 figures

详情
AI中文摘要

被动声学监测(PAM)能够实现大规模生物多样性评估,但连续录音会产生大量非信息性音频,给存储、能耗和长期边缘部署带来挑战。鸟类音频检测(BAD)通过在下游分析前过滤无关录音来减轻这一负担。然而,大多数BAD系统是在温带数据集上训练的,尽管热带声音景观更密集、物种更丰富且声学不可预测。为了解决这一差距,我们引入了SEABAD(东南亚鸟类活动检测),包含50,000个经过精心挑选的三秒剪辑,平衡鸟类存在和不存在的样本。该数据集涵盖1,677个鸟类物种,并标准化为16 kHz单声道音频以支持嵌入式和低功耗推理。我们开发了双分支编目流程:一个六阶段正标签工作流应用于Xeno-Canto录音,以及六个来源特定的负标签提取从环境数据集中。这些程序将类别不平衡降低了13.7%(基尼系数:0.601到0.519)。对1,000个正样本的手动审核确认了97.8%±0.9%的标注准确性。使用MobileNetV3-Small的基线实验在三个随机种子上实现了99.57%±0.25%的准确率和0.9985±0.0002的AUC。SEABAD和完整的编目流程已公开发布,以支持热带BAD研究和节能声学监测。

英文摘要

Passive acoustic monitoring (PAM) enables large-scale biodiversity assessment, but continuous recording generates large amounts of non-informative audio, creating challenges for storage, power consumption, and long-term edge deployment. Bird audio detection (BAD), which identifies bird vocalizations, can reduce this burden by filtering irrelevant recordings before downstream analysis. However, most BAD systems are trained on temperate datasets despite tropical soundscapes being denser, more species-rich, and acoustically unpredictable. To address this gap, we introduce SEABAD (Southeast Asian Bird Activity Detection), a dataset of 50,000 curated three-second clips from Southeast Asian soundscapes, evenly balanced between bird-present and bird-absent samples. The dataset spans 1,677 bird species and is standardized to 16 kHz mono audio for embedded and low-power inference. We developed a dual-branch curation pipeline: a six-stage positive-label workflow applied to Xeno-Canto recordings, alongside six source-specific negative-label extractions from environmental datasets. These procedures reduced class imbalance by 13.7% (Gini coefficient: 0.601 to 0.519). A manual audit of 1,000 positive clips confirmed 97.8% +/- 0.9% labeling accuracy. Baseline experiments using MobileNetV3-Small achieved 99.57% +/- 0.25% accuracy and 0.9985 +/- 0.0002 AUC across three random seeds. SEABAD and the full curation pipeline are publicly released to support tropical BAD research and energy-efficient acoustic monitoring.

2605.20850 2026-05-21 cs.RO

SmoCap: Unified Scale-Pose Canonicalization with Proxy-Mapped Trust-Region QP

SmoCap: 一种统一的尺度-姿态规范化方法,结合代理映射信任区域QP

Shihao Li, Naohiko Sugita

AI总结 SmoCap通过在稀疏控制子空间中联合估计形态和姿态,解决阶段式工作流导致的形态-姿态补偿问题,实现了统一的尺度-姿态规范化框架,提高了运动规范化的实用性。

Comments 11 pages, 6 figures, 4 tables

详情
AI中文摘要

目标:阶段式工作流将模型缩放和逆运动学分开,会导致形态-姿态补偿,产生在弱观测方向上解不一致但数值上可接受的解。我们提出了SmoCap,一种抗泄漏的规范化框架,它在稀疏控制子空间中的每个局部信任区域二次规划(QP)中联合估计形态和姿态。方法:SmoCap通过分析代理映射姿态和缩放雅可比矩阵求解约束信任区域QP。低维代理映射稳定了弱观测方向并驱动协调结构。可选的预求解在困难配置中提供热启动。该框架使用队列荧光膝运动、人体测量学真实值和极端瑜伽序列进行评估。结果:SmoCap在荧光摄影膝屈曲上实现了2.9度RMSE,人体测量学端点误差约为3%。在泄漏审计中,SmoCap减少了标记RMSE、FE误差和人体测量学端点误差。代理耦合在瑜伽消融中保持了表达性和协调的脊柱运动,与基线模型相比,拟合误差增加(+0.14 mm,+0.6%)。中位标记RMSE约为20 mm,中位运行时间在0.204-0.332 ms/帧之间,通过一致的2-3次迭代实现。结论:SmoCap提供了一种经过外部验证的统一耦合感知尺度-姿态框架,使其在数据集规模上实现一致的运动规范化成为可能。

英文摘要

Objective: Stage-wise workflows that separate model scaling and inverse kinematics can induce morphology-posture compensation, resulting in anatomically inconsistent yet numerically acceptable solutions, especially in weakly observed directions. We present SmoCap, a leakage-resistant canonicalization framework that estimates morphology and posture jointly in each local trust-region quadratic program (QP) within a sparse control subspace. Methods: SmoCap solves a constrained trust-region QP with analytical proxy-mapped pose and scale Jacobians. The low dimensional proxy map stabilizes weakly observed directions and drives coordinated structures. An optional pre-solve provides warm starts in difficult configurations. The framework is evaluated using cohort fluoroscopy knee motion, anthropometric ground truth, and extreme yoga sequences. Results: SmoCap achieved 2.9 degree knee flexion RMSE against fluoroscopy, and a pooled anthropometric endpoint error around 3%. In the leakage audit against segment wise scaling, SmoCap also reduced marker RMSE, FE error, and anthropometric endpoint error. Proxy coupling preserved expressive and coordinated spine motion with marginal fitting error increase (+0.14 mm, +0.6%) against baseline models in yoga ablation. Median marker RMSE was around 20 mm, and median runtime was 0.204-0.332 ms/frame, achieved with consistently 2-3 iterations. Conclusion: SmoCap provides an externally validated unified coupling-aware scale-pose framework, making externally consistent motion canonicalization practical at dataset scale.

2605.20839 2026-05-21 cs.CV cs.LG

Activation-Free Backbones for Image Recognition: Polynomial Alternatives within MetaFormer-Style Vision Models

无需激活的图像识别回骨:在MetaFormer风格视觉模型中的多项式替代方案

Jeffrey Wang, Jonathan Gregory, Grigorios G. Chrysos

AI总结 本文提出无需激活函数的多项式替代方法,用于在MetaFormer风格的视觉模型中实现图像识别,展示了多项式模块在多个数据集上的优越性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

现代视觉回骨将点激活(如ReLU、GELU)和指数softmax视为非线性性的必要来源,但我们证明在MetaFormer风格的视觉回骨中并不需要这些。我们为三个核心基本操作(MLP、卷积和注意力)设计了无需激活的多项式替代方案,其中Hadamard乘积替代标准非线性性以产生输入的多项式函数。这些模块可以无缝集成到现有架构中:在MetaFormer中实现,一个模块化的视觉回骨框架,我们的PolyNeXt模型在ImageNet分类、ADE20K语义分割和分布外鲁棒性上匹配或超过了基于激活的对应物,并且在计算成本降低的情况下显著优于先前的多项式网络,显示了标准模块的多项式变体击败了复杂自定义架构。

英文摘要

Modern vision backbones treat pointwise activations (e.g., ReLU, GELU) and exponential softmax as essential sources of nonlinearity, but we demonstrate they are not required within MetaFormer-style vision backbones. We design activation-free polynomial alternatives for three core primitives (MLPs, convolutions, and attention), where Hadamard products replace standard nonlinearities to yield polynomial functions of the input. These modules integrate seamlessly into existing architectures: instantiated within MetaFormer, a modular framework for vision backbones, our PolyNeXt models match or exceed activation-based counterparts across model scales on ImageNet classification, ADE20K semantic segmentation, and out-of-distribution robustness. We also substantially outperform prior polynomial networks at reduced computational cost, showing that polynomial variants of standard modules beat complex custom architectures.

2605.20838 2026-05-21 cs.CV cs.AI

USV: Towards Understanding the User-generated Short-form Videos

USV: 向理解用户生成的短视频迈进

Haoyue Cheng, Su Xu, Liwei Jin, Wayne Wu, Chen Qian, Limin Wang

AI总结 本文提出了USV数据集,用于高层面的视频语义理解,通过用户生成的短视频进行主题识别和视频-文本检索任务,提出了MMF-Net和VTCL两种有效基线方法。

详情
AI中文摘要

近年来,已经发布了多个大规模视频数据集,推动了视频理解领域的发展。然而,新兴的用户生成的短视频却很少被研究。本文提出了USV数据集,用于高层面的视频语义理解。该数据集包含约224,000个视频,通过标签查询从UGC平台收集,无需额外的人工验证和剪辑。尽管视频理解近年来取得了显著进展,但大多数工作集中在实例级识别,这不足以学习视频高层面语义信息的表示。因此,我们进一步在USV上建立了两个任务:主题识别和视频-文本检索。我们提出了两种统一且有效的基线方法:多模态融合网络(MMF-Net)和视频-文本对比学习(VTCL),分别用于主题识别和视频-文本检索任务,并进行了全面的基准测试以促进未来研究。我们的项目页面是https://usvdataset.github.io。

英文摘要

Several large-scale video datasets have been published these years and have advanced the area of video understanding. However, the newly emerged user-generated short-form videos have rarely been studied. This paper presents USV, the User-generated Short-form Video dataset for high-level semantic video understanding. The dataset contains around 224K videos collected from UGC platforms by label queries without extra manual verification and trimming. Although video understanding has achieved plausible improvement these years, most works focus on instance-level recognition, which is not sufficient for learning the representation of the high-level semantic information of videos. Therefore, we further establish two tasks: topic recognition and video-text retrieval on USV. We propose two unified and effective baseline methods Multi-Modality Fusion Network (MMF-Net) and Video-Text Contrastive Learning (VTCL), to tackle the topic recognition task and video-text retrieval respectively, and carry out comprehensive benchmarks to facilitate future research. Our project page is https://usvdataset.github.io.

2605.20837 2026-05-21 cs.CV cs.AI

ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models

ArchSIBench: 评估视觉-语言模型的建筑空间智能

Qirui Shen, Wenda Wang, Jiachen Lu, Zilong Huang, Jin Bai, Lei He, Hongxuan Chen, Weixin Huang

AI总结 本文提出ArchSIBench,一个基于建筑学、认知科学和心理学视角的建筑空间智能评估基准,通过17个细粒度子任务和3000个问题-答案对,评估多种VLMs在建筑空间感知、推理、导航、转换和配置方面的性能,发现大多数模型在空间转换和配置推理上仍与有建筑训练的人类评估者存在差距。

Comments 51 pages

详情
AI中文摘要

建筑空间智能,即识别和推断建筑空间的能力,是机器人导航、具身交互和3D场景理解和生成等任务的基础。尽管已有大量研究评估了视觉-语言模型(VLMs)的基本空间技能,如相对方向、距离比较和物体计数,但这些任务仅涵盖空间认知的最基础层次,且忽略了更高层次的建筑空间认知,包括布局理解、通行模式和功能分区。在本文中,我们提出ArchSIBench,一个基于建筑学、认知科学和心理学视角的建筑空间智能评估基准。ArchSIBench涵盖五个核心维度:感知、推理、导航、转换和配置,包含17个细粒度子任务。通过专家的精心人工标注,我们构建了3,000个问题-答案对,以实现对建筑空间智能的全面评估。基于ArchSIBench,我们评估了各种VLMs,并发现大多数模型在建筑空间智能方面与人类基线有显著差异;此外,模型在能力维度上表现出显著的差异性。一些最先进的模型可以接近没有建筑训练的人类评估者水平。然而,与有建筑训练的人类评估者相比,仍存在明显差距,特别是在空间转换和配置推理方面。我们相信,ArchSIBench将为测量和提升VLMs的建筑空间智能提供重要的见解和系统资源。数据集和代码可在https://huggingface.co/datasets/ArchSIBench/ArchSIBench获取。

英文摘要

Architectural spatial intelligence, the ability to recognize and infer architectural space, is fundamental to tasks such as robot navigation, embodied interaction, and 3D scene understanding and generation. Although extensive research has evaluated the basic spatial skills of Vision-Language Models (VLMs) such as relative orientation, distance comparison, and object counting, these tasks cover only the most elementary levels of spatial cognition and largely overlook higher-level cognition of architectural space, including layout understanding, circulation patterns, and functional zoning. In this work, we present ArchSIBench, a Benchmark for Architectural Spatial Intelligence based on the perspectives from architecture, cognitive science, and psychology. ArchSIBench covers five core dimensions: perception, reasoning, navigation, transformation, and configuration, comprising 17 fine-grained subtasks. Through careful manual annotation by experts with architectural backgrounds, we construct 3,000 question-answer pairs to enable comprehensive evaluation of architectural spatial intelligence. Based on ArchSIBench, we evaluate various VLMs and find that the architectural spatial intelligence of most models shows significant differences from human baselines; additionally, models exhibit substantial variability across capability dimensions. Some state-of-the-art models can approach the level of human evaluators without architectural training. However, a clear gap remains compared to human evaluators with architectural training, particularly in spatial transformation and configuration reasoning. We believe that ArchSIBench will provide important insights and systematic resources for measuring and advancing the architectural spatial intelligence of VLMs. The dataset and code are available at https://huggingface.co/datasets/ArchSIBench/ArchSIBench.

2605.20834 2026-05-21 cs.AI cs.LG

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

DPO与RLHF的条件等价性:隐含假设、失败模式与可证明对齐

Zhiqin Yang, Yonggang Zhang, Wei Xue, Dong Fang, Bo Han, Yike Guo

AI总结 本文研究了DPO与RLHF的等价性问题,指出其等价性依赖于一个隐含假设,当该假设不成立时,DPO会优化相对优势而非绝对对齐,从而导致路径性收敛。作者提出CPO方法,通过引入约束实现可证明对齐,并通过几何解释揭示DPO的margin ranking机制。

Comments 49 pages

详情
AI中文摘要

直接偏好优化(DPO)作为一种替代强化学习从人类反馈(RLHF)的方法,理论上等价但实现更简单。我们证明这种等价性是条件性的而非普遍的,取决于一个隐含假设:RLHF最优策略必须偏好人类偏好响应。当该假设不成立时,DPO优化参考策略的相对优势而非绝对对齐人类偏好,导致路径性收敛,即策略降低DPO损失但偏好不被偏好响应。我们刻画了该假设被违反的情况,展示了不可取的解空间存在,并证明在这些情况下DPO和RLHF优化根本不同的目标。为解决此问题,我们引入约束偏好优化(CPO),通过在RLHF中加入约束以实现可证明对齐。我们进一步通过软边距排名提供几何解释,揭示DPO实现边距排名但可能具有潜在负目标。我们的理论分析确立了DPO保证成立的条件,并提供了保持简单性的同时具有可证明对齐的解决方案。在标准基准上的全面实验表明,CPO实现了最先进的性能。代码可在:https://github.com/visitworld123/CPO获取。

英文摘要

Direct Preference Optimization (DPO) has emerged as a popular alternative to Reinforcement Learning from Human Feedback (RLHF), offering theoretical equivalence with simpler implementation. We prove this equivalence is conditional rather than universal, depending on an implicit assumption frequently violated in practice: the RLHF-optimal policy must prefer human-preferred responses. When this assumption fails, DPO optimizes relative advantage over the reference policy rather than absolute alignment with human preferences, leading to pathological convergence where policies decrease DPO loss while preferring dispreferred responses. We characterize when this assumption is violated, show the existence of an undesirable solution space, and prove that DPO and RLHF optimize fundamentally different objectives in such cases. To address this, we introduce Constrained Preference Optimization (CPO), augmenting RLHF with constraints for provable alignment. We further provide a geometric interpretation through soft margin ranking, revealing that DPO implements margin ranking with potentially negative targets. Our theoretical analysis establishes when DPOs' guarantees hold and provides solutions preserving simplicity with provable alignment. Comprehensive experiments on standard benchmarks demonstrate that CPO achieves state-of-the-art performance. Code is available at: https://github.com/visitworld123/CPO.

2605.20833 2026-05-21 cs.CL

MemGym: a Long-Horizon Memory Environment for LLM Agents

MemGym: 一种长时间跨度的记忆环境用于LLM智能体

Wujiang Xu, Yu Wang, Kai Mei, Kaiqu Liang, Zhenting Wang, Mingyu Jin, Han Zhang, Shi-Xiong Zhang, Wenyue Hua, Sambit Sahu, Dimitris N. Metaxas

AI总结 本文提出MemGym,一种用于评估LLM智能体记忆能力的基准测试环境,通过统一现有智能体 gym 和内部记忆基础管道,提供一个记忆推理接口。MemGym 包含五个评估赛道,涵盖四个智能体领域,能够独立评估记忆性能,排除推理、检索和工具使用能力的干扰。

详情
AI中文摘要

记忆是LLM智能体在长时间任务中运营的核心能力。现有的记忆基准测试主要评估多轮聊天场景中个性化信息的保留能力,忽略了在长时间智能体执行过程中发生的动态记忆形成。因此,它们所生成的记忆系统在现实的智能体环境中(如编程和网络导航)转移效果差。我们提出了MemGym,一个用于智能体记忆的基准测试,它将现有的智能体 gym 和内部记忆基础管道统一到一个记忆推理接口下。MemGym涵盖五个评估赛道,分为四个智能体领域:工具使用对话(tau2-bench)、多轮深度研究搜索(MEMGYM-DR)、编程(SWE-Gym和MEMGYM-CODEQA)、计算机使用(WebArena-Infinity)。MemGym报告出的记忆隔离分数将记忆性能与推理、检索和工具使用能力分离,因此可以独立对记忆策略进行排名。我们的合成管道为MEMGYM-CODEQA和MEMGYM-DR是长度可控的,在每个阶段都经过消融验证,并紧密对齐下游场景。为了使在编程环境中的评估在学术上具有可操作性,我们训练了MemRM,一个轻量级的奖励模型(使用Qwen3-1.7B微调QLoRA),它以快速标量读取的方式评分压缩质量,而不是完整的Docker回放。

英文摘要

Memory is a central capability for LLM agents operating across long-horizon tasks. Existing memory benchmarks predominantly evaluate retention of personalized information in multi-turn chat scenarios, overlooking the dynamic memory formation that occurs during extended agent execution. Consequently, the memory systems they produce transfer poorly to realistic agentic environments, such as coding and web navigation. We present MemGym, a benchmark for agentic memory that unifies existing agent gyms and in-house memory-grounded pipelines behind one memory-reasoning interface. MemGym spans five evaluation tracks grouped into four agentic regimes: tool-use dialogue (tau2-bench), multi-turn deep-research search (MEMGYM-DR), coding (SWE-Gym and MEMGYM-CODEQA), and computer use (WebArena-Infinity). MemGym reports memory-isolated scores that decouple memory performance from reasoning, retrieval, and tool-use ability, so memory strategies can be ranked without those confounders. Our synthetic pipelines for MEMGYM-CODEQA and MEMGYM-DR are length-controllable, ablation-verified at every stage, and tightly aligned with downstream scenarios. To make evaluation on coding environments academically tractable, we train MemRM, a lightweight reward model (Qwen3-1.7B fine-tuned with QLoRA) that scores compression quality as a fast scalar read in place of full Docker rollouts.

2605.20827 2026-05-21 cs.CV

HyDAR-Pano3D: A Hybrid Disentangled Anatomical Recovery Framework for Panoramic-to-3D Reconstruction

HyDAR-Pano3D: 一种用于全景到3D重建的混合解耦解剖恢复框架

Yaoyao Yue, Jérôme Schmid, Xiaoshuang Li, Eduardo Delamare, Jinman Kim

AI总结 本文提出HyDAR-Pano3D框架,通过解耦解剖恢复问题来解决全景影像到CBCT重建中的模糊问题,实验表明其在PSNR、SSIM和Dice评分上均优于基线方法,能够有效恢复临床相关的解剖结构。

Comments 10 pages

详情
AI中文摘要

全景放射影像(PR)在常规牙科护理中被广泛使用,但其本质上只能提供复杂的三维颅面解剖的二维投影。大多数现有的基于学习的方法试图通过直接回归原生锥束CT(CBCT)体积来计算恢复这种三维信息。然而,这种直接映射要求模型同时学习常见的解剖结构和患者特定的形态变化。这种纠缠的公式使二维到三维的逆问题变得高度模糊,通常会产生过度平滑的重建和模糊的解剖边界。为了解决这个问题,我们提出了HyDAR-Pano3D,一个两阶段框架,将PR到CBCT重建重新公式化为解耦的解剖恢复问题。在第一阶段,一个双编码器网络整合了放射影像特征与SAM衍生的语义先验,以重建一个归一化的标准体积。在第二阶段,一个解剖恢复网络预测一个先验约束的结构变形场,将这个标准体积映射回原空间,恢复个体形态变化。在三个大规模数据集上的实验表明,HyDAR-Pano3D显著优于基线方法(p < 0.05),实现了25.76 dB PSNR,85.70% SSIM,以及83.83%的整体解剖Dice评分。合成的体积成功支持下游的完整牙齿(82.4% Dice)和下颌骨管(72.2% Dice)分割,证明了我们的解耦方法能够保留临床相关的结构,当CBCT数据不可用时,能够实现稳健的解剖感知评估。

英文摘要

Panoramic radiograph (PR) is fundamentally used in routine dental care, but it inherently provides only a two-dimensional (2D) projection of complex three-dimensional (3D) craniofacial anatomy. Most existing learning-based methods attempt to computationally recover this 3D information by directly regressing native cone-beam computed tomography (CBCT) volumes from PR. However, this direct mapping requires the model to simultaneously learn common anatomical structures and patient-specific morphological variations. This entangled formulation makes the ill-posed 2D-to-3D inverse problem highly ambiguous, often producing over-smoothed reconstructions with blurred anatomical boundaries. To address this, we propose HyDAR-Pano3D, a two-stage framework that reformulates PR-to-CBCT reconstruction as a disentangled anatomical recovery problem. In Stage 1, a dual-encoder network integrates radiographic features with SAM-derived semantic priors to reconstruct an arch-normalized canonical volume. In Stage 2, an Anatomical Restoration Network predicts a prior-constrained structured deformation field to map this canonical volume back to the native space, restoring individual morphological variations. Experiments on three large-scale datasets show that HyDAR-Pano3D significantly outperforms baseline methods ($p < 0.05$), achieving a 25.76 dB PSNR, 85.70\% SSIM, and an 83.83\% overall anatomical Dice score. The synthesized volumes successfully support downstream segmentation of whole teeth (82.4\% Dice) and the inferior alveolar canal (72.2\% Dice), demonstrating that our disentangled approach preserves clinically relevant structures to enable robust anatomy-aware assessment when CBCT data is unavailable.

2605.20824 2026-05-21 cs.LG

Markovian Circuit Tracing for Transformer State Dynamic

马尔可夫电路追踪用于Transformer状态动态

Abdullah X

AI总结 本研究提出马尔可夫电路追踪(MCT)方法,用于评估Transformer激活是否包含粗粒度的状态转移结构,通过合成的隐马尔可夫模型任务验证了残差激活中包含部分贝叶斯信念信息,并展示了状态抽象在不同状态下恢复粗粒度转移信号的效果。

详情
AI中文摘要

许多序列计算更容易通过内部状态的运动来研究,而不是孤立的局部电路。我们引入了马尔可夫电路追踪(MCT),一种用于测试Transformer激活是否包含粗粒度状态转移结构的诊断流程。该基准使用合成的隐马尔可夫模型(HMM)任务,其中潜在状态、转移矩阵、贝叶斯信念向量、贝叶斯最优预测以及强制状态反事实目标都是已知的。在六个HMM家族和每个家族三个种子的情况下,tiny因果Transformer学习接近贝叶斯的下一个token预测器,其平均超额损失为0.0138。残差激活在受控的合成基准中包含部分贝叶斯信念信息。从这些激活中提取的状态抽象在持久和低状态领域恢复粗粒度转移信号最强,在模糊发射和六状态领域则较弱。最清晰的结果来自状态强制。修复恢复的状态质心将KL值从未修复模型中的0.1957降低到0.0532,平均上优于错误状态、均值激活、随机激活和洗牌标签控制。本研究的贡献是一个受控的基准和评估框架,用于Transformer状态动态可解释性,MCT作为简单的参考流程。

英文摘要

Many sequence computations are easier to study as movement through internal states than as isolated local circuits. We introduce Markovian Circuit Tracing (MCT), a diagnostic pipeline for testing whether transformer activations contain coarse state-transition structure. The benchmark uses synthetic Hidden Markov Model (HMM) tasks where latent states, transition matrices, Bayesian belief vectors, Bayes-optimal predictions, and forced-state counterfactual targets are known exactly. Across six HMM families and three seeds per family, tiny causal transformers learn near-Bayes next-token predictors, with mean excess loss over Bayes of 0.0138. Residual activations contain partial Bayesian belief information in this controlled synthetic benchmark. State abstractions extracted from these activations recover coarse transition signal, strongest in persistent and lower-state regimes, and weaker in ambiguous-emission and six-state regimes. The clearest result comes from state forcing. Patching a recovered-state centroid reduces KL to the exact HMM counterfactual target from 0.1957 in the unpatched model to 0.0532 on average, beating wrong-state, mean-activation, random-activation, and shuffled-label controls. The contribution is a controlled benchmark and evaluation framework for transformer state-dynamics interpretability, with MCT as a simple reference pipeline

2605.20822 2026-05-21 cs.CV

TERDNet: Transformer Encoder-Recurrent Decoder Network for Scene Change Detection

TERDNet: 用于场景变化检测的Transformer编码器-递归解码器网络

Jiae Yoon, Ue-Hwan Kim

AI总结 本文提出TERDNet,一种用于场景变化检测的Transformer编码器-递归解码器网络,通过多级特征提取、特征融合模块、递归解码器和上采样模块,提升了场景变化检测的精度和鲁棒性。

Comments 8 pages, 4 figures. Accepted to the IEEE International Conference on Robotics and Automation (ICRA) 2026

详情
AI中文摘要

在本文中,我们针对场景变化检测(SCD)这一挑战,其目标是在不同时间拍摄的同一地点的两幅图像之间识别变化。现有的SCD模型通常忽略了不同层之间特征重要性的变化,使用单步解码器限制了细化过程,并且对编码器预训练策略提供了有限的见解。我们提出了TERDNet,一种Transformer编码器-递归解码器网络,旨在克服这些限制。TERDNet由基于Transformer的编码器提取多级表示,一个融合相关体积与这些特征的特征融合模块,一个执行迭代细化的递归3门GRU解码器,以及一个结合卷积和插值的上采样器组成。在四个公开基准上的大量实验表明,TERDNet在性能上始终优于先前的方法,并产生了更准确和详细的变更掩码。消融研究证实了基于分割的预训练的优势以及我们融合设计的有效性。此外,在视角偏移下的鲁棒性测试确认了TERDNet在现实世界机器人系统中的部署潜力,其中可靠的感知至关重要。我们的代码可在https://github.com/AutoCompSysLab/TERDNet上获得。

英文摘要

In this work, we address the challenge of Scene Change Detection (SCD), where the goal is to identify variations between two images of the same location captured at different times. Existing SCD models often overlook the varying importance of features across layers, employ single-step decoders that confine refinement, and provide limited insight into encoder pretraining strategies. We propose TERDNet, a Transformer Encoder-Recurrent Decoder Network designed to overcome these limitations. TERDNet consists of a transformer-based encoder that extracts multi-level representations, a feature fusion module that integrates correlation volumes with these features, a recurrent 3-gate-GRU decoder that performs iterative refinement, and a combined convolution-interpolation upsampler that restores fine-grained resolution. Extensive experiments on four public benchmarks show that TERDNet consistently outperforms prior approaches and produces more accurate and detailed change masks. Ablation studies confirm the benefit of segmentation-based pretraining and the effectiveness of our fusion design. In addition, robustness tests under viewpoint misalignment confirm TERDNet's potential for deployment in real-world robotic systems, where reliable perception is critical. Our code is available at https://github.com/AutoCompSysLab/TERDNet.

2605.20821 2026-05-21 cs.CV cs.RO

VSCD: Video-based Scene Change Detection in Unaligned Scenes

VSCD: 基于视频的非对齐场景变化检测

Jiae Yoon, Ue-Hwan Kim

AI总结 本研究提出VSCD,一种用于非对齐场景中视频基变化检测的方法,通过查询帧生成像素级变化掩码,利用多参考模型和局部补丁对应来对齐参考特征,并融合候选变化特征以生成高分辨率掩码,实现了优于现有图像和视频基基线的性能。

Comments 18 pages, 7 figures. Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

检测环境中变化对于长期自主性至关重要,但大多数变化检测设置假设固定视角、轻微错位或仅少数变化对象。我们引入视频基场景变化检测(VSCD),该方法在给定参考和查询RGB视频的情况下,为每个查询帧预测像素级变化掩码。这两个视频记录于不同时间,且相机运动不受约束,视频之间没有时间同步,许多对象实例可能出现或消失。为研究此设置,我们构建了一个包含超过110万帧的大型基准,这些帧标注了像素级变化掩码,并附有现实世界测试集以评估迁移至现实的性能。我们提出了一种以查询为中心的多参考模型,该模型从变化掩码监督中隐式学习时间匹配,通过局部补丁对应对齐候选参考特征,并在解码高分辨率掩码前使用帧级和补丁级置信度融合每个候选的变化特征。我们的方法在强大的图像和视频基基线中实现了最先进的性能,并通过在移动机器人上部署验证其现实影响,用于两个下游应用——视觉监控和对象增量学习。

英文摘要

Detecting what has changed in an environment is essential for long-term autonomy, yet most change detection settings assume fixed viewpoints, mild misalignment, or only a few changed objects. We introduce Video-based Scene Change Detection (VSCD), which predicts a pixel-wise change mask for each query frame, given a reference and a query RGB video of the same indoor space recorded at different times under unconstrained camera motion. The two videos are not temporally synchronized, and many object instances may appear or disappear. To study this setting, we build a large-scale benchmark with over 1.1 million frames annotated with pixel-accurate change masks, together with a real-world test set for evaluating transfer beyond simulation. We propose a query-centric multi-reference model that learns temporal matching implicitly from change-mask supervision, aligns candidate reference features to the query via local patch correspondence, and fuses per-candidate change features using frame-level and patch-level confidence before decoding a high-resolution mask once per frame. Our approach achieves state-of-the-art performance against strong image- and video-based baselines, and we validate its real-world impact by deploying it on a mobile robot for two downstream applications -- visual surveillance and object incremental learning.

2605.20820 2026-05-21 cs.CV

AIR: Amortized Image Reconstruction Framework for Self-Supervised Feed-Forward 2D Gaussian Splatting

AIR: 一种用于自监督前馈2D高斯点散射的 amortized 图像重建框架

Zhaojie Zeng, Yuesong Wang, Yawei Luo, Tao Guan

AI总结 本文提出了一种自监督前馈框架AIR,通过将迭代高斯拟合 amortized 到单次网络传递中,消除了每张图像测试时的优化需求。该框架采用分阶段残差架构,逐步从重建残差中预测额外的高斯原始体,并结合显式的阶段控制机制,仅在欠重建区域激活新的原始体。通过预测-优化-蒸馏训练策略,稳定了多阶段预测,最终实现了更高效的图像重建。

Comments preprint version

详情
AI中文摘要

2D高斯点散射提供了一种高效的显式图像重建表示,但现有方法仍然需要昂贵的逐图像迭代优化或依赖手工设计的先验知识来分配原始体。我们提出了AIR,一种自监督前馈框架,将迭代高斯拟合 amortized 到单次网络传递中,消除了每张图像测试时的优化需求。AIR采用分阶段残差架构,逐步从重建残差中预测额外的高斯原始体,并结合显式的阶段控制机制,仅在欠重建区域激活新的原始体。一种预测-优化-蒸馏训练策略通过将短周期优化的高斯增量蒸馏回预测器,稳定了多阶段预测。稳定后的预测器随后在各阶段联合微调,并配备图像自适应量化器以实现紧凑的高斯存储。在Kodak和DIV2K上的实验表明,AIR在重建质量上优于代表性的基于高斯的基线方法,同时将编码时间减少到160-300毫秒。代码:https://github.com/whoiszzj/AIR.git

英文摘要

2D Gaussian splatting provides an efficient explicit representation for image reconstruction, but existing methods still require costly per-image iterative optimization or rely on handcrafted priors for primitive allocation. We present AIR, a self-supervised feed-forward framework that amortizes iterative Gaussian fitting into a single network pass, eliminating per-image test-time optimization. AIR adopts a stage-wise residual architecture that progressively predicts additional Gaussian primitives from reconstruction residuals, together with an explicit Stage Control mechanism that activates new primitives only in under-reconstructed regions. A Predict--Optimize--Distill training strategy stabilizes multi-stage prediction by distilling short-horizon optimized Gaussian increments back into the predictor. The stabilized predictor is then jointly finetuned across stages and equipped with an image-adaptive quantizer for compact Gaussian storage. Experiments on Kodak and DIV2K show that AIR achieves better reconstruction quality than representative Gaussian-based baselines while reducing encoding time to 160--300\,ms. Code: https://github.com/whoiszzj/AIR.git

2605.20818 2026-05-21 cs.CV

OSGNet with MLLM Reranking @ Ego4D Episodic Memory Challenge 2026

OSGNet with MLLM Reranking @ Ego4D Episodic Memory Challenge 2026

Yisen Feng, Leigang Qu, Haoyu Zhang, Qiaohui Chu, Meng Liu, Xuemeng Song, Weili Guan, Liqiang Nie

AI总结 本文提出一种基于多模态大语言模型(MLLM)的重排序框架,用于解决Ego4D事件记忆挑战2026中的自然语言查询和目标步 tracks,通过结合现有定位模型OSGNet的候选片段和MLLM的视频-语言推理能力,提升时间片段的定位精度。

Comments Champion solution for the Natural Language Queries and GoalStep tracks of the Ego4D Challenge at the CVPR EgoVis Workshop 2026

详情
AI中文摘要

在本报告中,我们展示了在CVPR 2026上Ego4D事件记忆挑战的自然语言查询和目标步 tracks中的冠军解决方案。这两个 tracks 都需要从长且未剪辑的egocentric视频中准确地定位时间片段。为解决这些任务,我们提出了一种基于重排序的框架,该框架有效地利用了多模态大语言模型(MLLM)强大的视频-语言推理能力,同时保持了传统定位流程的效率和候选召回率。具体来说,我们首先从现有的定位模型OSGNet中获得一组候选片段,然后利用MLLM来选择最符合给定查询的片段,从而优化最终的预测。最终,我们的方法在自然语言查询和目标步 tracks中均取得了第一名。我们的代码可在https://github.com/iLearn-Lab/CVPR25-OSGNet上找到。

英文摘要

In this report, we present our champion solutions for the Natural Language Queries and GoalStep tracks of the Ego4D Episodic Memory Challenge at CVPR 2026. Both tracks require accurately localizing temporal segments from long untrimmed egocentric videos. To address these tasks, we propose a reranking-based framework that effectively leverages the strong video-language reasoning capability of multimodal large language model (MLLM) while preserving the efficiency and candidate recall of conventional localization pipelines. Specifically, we first obtain a set of candidate segments from existing localization model OSGNet, and then employ MLLM to select the segment that best matches the given query, thereby refining the final prediction. Ultimately, our method achieved first place in both the Natural Language Queries and GoalStep tracks. Our code can be found at https://github.com/iLearn-Lab/CVPR25-OSGNet.

2605.20815 2026-05-21 cs.CL cs.AI cs.IR cs.LG

GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

在消费级硬件上实现GraphRAG:对本地LLMs在医疗EHR模式检索中的基准测试

Peter Fernandes, Ria Kanjilal

AI总结 本文研究了在消费级硬件上使用本地LLMs进行医疗EHR模式检索的GraphRAG方法,评估了四种不同模型在索引效率、知识图构建、查询延迟、回答质量和幻觉方面的表现,发现模型参数大小和检索模式对结果有显著影响。

Comments 9 pages, 1 figure, 5 tables

详情
AI中文摘要

基于图的检索增强生成(GraphRAG)扩展了检索增强生成,以支持对复杂语料库的结构化推理,但其在资源受限、隐私敏感的部署中的可靠性仍不清楚。在医疗领域,电子健康记录(EHR)数据复杂且严格监管,依赖云基于大语言模型(LLMs)会带来成本、延迟和合规性的挑战。本文系统评估了GraphRAG在EHR模式检索中的应用,使用本地部署的开源LLMs。我们实现了Microsoft GraphRAG管道在真实的EHR模式文档上,并基准测试了四种模型,包括Llama 3.1(8B)、Mistral(7B)、Qwen 2.5(7B)和Phi-4-mini(3.8B),这些模型通过Ollama在单个消费级GPU(8 GB VRAM)上部署。我们评估了索引效率、知识图构建、查询延迟、回答质量和幻觉在全局和局部检索模式下的表现。我们的结果揭示了显著差异:Llama 3.1生成最丰富的知识图(1,172个实体),Qwen 2.5达到最佳回答质量(3.3/5),Phi-4-mini因结构化输出错误无法完成流程,而Mistral表现出退化重复行为。我们进一步表明,GraphRAG具有实际容量阈值,其中模型参数低于约7B的模型无法可靠地生成有效的结构化输出并无法完成流程。此外,索引和回答质量在不同模型之间是脱耦的,局部检索在延迟和事实基础方面均优于全局总结,且幻觉减少。这些发现表明,GraphRAG可以在消费级硬件上实现,同时强调了模型选择和检索设计在受监管环境中的重要性。

英文摘要

Graph-based Retrieval Augmented Generation (GraphRAG) extends retrieval-augmented generation to support structured reasoning over complex corpora, but its reliability under resource-constrained, privacy-sensitive deployments remains unclear. In healthcare, where Electronic Health Record (EHR) data is complex and strictly regulated, reliance on cloud-based large language models (LLMs) introduces challenges in cost, latency, and compliance. In this work, we present a systematic evaluation of GraphRAG for EHR schema retrieval using locally deployed open-source LLMs. We implement the Microsoft GraphRAG pipeline on real-world EHR schema documentation and benchmark four models, including Llama 3.1 (8B), Mistral (7B), Qwen 2.5 (7B), and Phi-4-mini (3.8B), each deployed via Ollama on a single consumer GPU (8 GB VRAM). We evaluate indexing efficiency, knowledge graph construction, query latency, answer quality, and hallucination under both global and local retrieval modes. Our results reveal substantial differences: Llama 3.1 produces the richest knowledge graph (1,172 entities), Qwen 2.5 achieves the best answer quality (3.3/5), Phi-4-mini fails to complete the pipeline due to structured-output errors, and Mistral exhibits degenerate repetition behavior. We further show that GraphRAG exhibits a practical capacity threshold, where models below approximately 7B parameters fail to reliably produce valid structured outputs and cannot complete the pipeline. In addition, indexing and answer quality are decoupled across models, and local retrieval consistently outperforms global summarization in both latency and factual grounding, with reduced hallucination. These findings demonstrate that GraphRAG is feasible on consumer hardware while highlighting the importance of model selection and retrieval design for robust deployment in regulated settings.

2605.20813 2026-05-21 cs.CL

PulseCol: Periodically Refreshed Column-Sparse Attention for Accelerating Diffusion Language Models

PulseCol: 周期性刷新的列稀疏注意力用于加速扩散语言模型

Yanyi Lyu, Letian Chen, Futing Sun, Miao Zhang, Weili Guan, Liqiang Nie

AI总结 本文提出PulseCol,一种周期性刷新的列稀疏注意力方法,通过更细粒度的稀疏化策略提升扩散语言模型的计算效率和加速性能,同时保持模型质量。

详情
AI中文摘要

在扩散大语言模型(dLLMs)的推理过程中,计算成本很高,因为每次去噪步骤都需要重复执行完整的自注意力机制,而没有KV缓存。最近的稀疏注意力方法通过块稀疏计算来缓解这一成本,但只在后期迭代中应用,当模型性能对粗粒度稀疏近似不敏感时,但这种方法在计算效率和加速方面提升有限。这促使我们提出一种更细粒度的稀疏化策略,可以在早期迭代中应用,并利用可重用的稀疏模式,从而实现进一步的效率提升。在本文中,我们介绍了PulseCol,一种用于加速扩散语言模型的周期性刷新列稀疏注意力方法。PulseCol将粗粒度的块稀疏性替换为更细粒度的列稀疏结构,使重要的注意力交互更加精确地保留,同时暴露更大的稀疏性。基于这种列级公式,PulseCol进一步在去噪的早期步骤中识别稀疏模式,并在后续迭代中重用这些模式,在少量中间步骤中刷新它们,以跟踪去噪过程中稀疏注意力模式的变化。实验表明,PulseCol在稀疏性和实际加速方面优于先前的稀疏注意力方法,同时保持模型质量。通过优化的GPU内核实现列稀疏注意力,PulseCol在多个上下文长度上实现了比FlashAttention高达1.95倍的端到端加速。

英文摘要

Inference in diffusion large language models (dLLMs) is computationally expensive, as full self-attention must be repeatedly executed at each step of the denoising process without KV cache. Recent sparse attention methods for dLLMs mitigate this cost via block-sparse computation, which is applied only in later iterations when model performance is less sensitive to coarse-grained sparse approximation, but yields limited improvements in computational efficiency and acceleration. This motivates a finer-grained sparsification strategy that can be applied from earlier iterations and leverages reusable sparsity patterns, enabling further efficiency gains. In this work, we introduce PulseCol, a periodically refreshed column-sparse attention method for accelerating diffusion language models. PulseCol replaces coarse block-level sparsity with a finer-grained column-sparse structure, allowing important attention interactions to be retained more precisely while exposing greater sparsity. Built on this column-level formulation, PulseCol further identifies sparse patterns at the early denoising step and reuses them across subsequent iterations, refreshing them only at a small number of intermediate steps to track the evolution of sparse attention patterns during denoising. Experiments show that PulseCol achieves higher sparsity and greater practical speedup than prior sparse attention methods for dLLMs, while maintaining model quality. Enabled by optimized GPU kernels for column-sparse attention, PulseCol delivers up to 1.95$\times$ end-to-end speedup over FlashAttention across several context lengths.

2605.20811 2026-05-21 cs.RO

Demo-JEPA: Joint-Embedding Predictive Architecture for One-shot Cross-Embodiment Imitation

Demo-JEPA: 一种用于单次跨体态模仿的联合嵌入预测架构

Jingyang He, Guangrun Li, Jieyu Zhang, Chengkai Hou, Zhengping Che, Shanghang Zhang

AI总结 本文提出Demo-JEPA,一种跨体态模仿框架,通过解耦示范意图与体态特定的执行,利用共享预测表示空间将源视觉示范转换为目标兼容的未来潜在轨迹,使目标代理通过规划实现这些子目标,从而在异构体态间实现灵活的模仿。

详情
AI中文摘要

机器人模仿学习通常被视为复制演示动作,但动作本质上是体态特定的。当演示来自具有不同形态、运动学或动作空间的人类或机器人时,这种以动作为中心的观点需要共享动作空间、启发式重定向或大规模多体态联合训练。我们相反地将演示视为未来目标的隐含规范:目标代理应推断演示者试图实现的状态,而非演示者如何执行它。我们提出Demo-JEPA,一种跨体态模仿框架,通过基于JEPA的世界模型构建,将源视觉示范转换为目标兼容的未来潜在轨迹,这些轨迹在共享的预测表示空间中。目标代理随后利用这些潜在轨迹作为子目标,并通过其自身学习的向前动力学进行规划以实现它们。由于Demo-JEPA避免了动作层面的对应关系,仅需视觉示范和目标代理自身的交互经验,它支持在异构体态间灵活的模仿。在RLBench和真实世界操作任务中的实验表明,Demo-JEPA在专门的领域规划器中表现优异,并能泛化到未见的任务和体态配置,而此前的方法在此类情况下失效。

英文摘要

Robotic imitation learning is often treated as reproducing demonstrated actions, but actions are inherently embodiment-specific. When demonstrations come from humans or robots with different morphology, kinematics, or action spaces, this action-centric view requires shared action spaces, heuristic retargeting, or large-scale multi-embodiment co-training. We instead view demonstrations as implicit specifications of future goals: the target agent should infer what state the demonstrator is trying to realize, rather than how the demonstrator executes it. We propose Demo-JEPA, a cross-embodiment imitation framework that decouples demonstration intent from embodiment-specific execution. Built on a JEPA-based world model, Demo-JEPA translates source visual demonstrations into target-compatible future latent trajectories in a shared predictive representation space. The target agent then uses these latent trajectories as subgoals and realizes them through planning under its own learned forward dynamics. Because Demo-JEPA avoids action-level correspondence and requires only visual demonstrations plus the target agent's own interaction experience, it supports flexible imitation across heterogeneous embodiments. Experiments on RLBench and real-world manipulation tasks show that Demo-JEPA matches specialized in-domain planners and generalizes to unseen tasks and embodiment configurations where prior methods fail.

2605.20809 2026-05-21 cs.CL

Refining and Reusing Annotation Guidelines for LLM Annotation

对LLM注释的注释指南进行细化和重用

Kon Woo Kim, Jin-Dong Kim, Akiko Aizawa

AI总结 本文提出了一种系统性的注释指南重用和细化方法,通过迭代审核框架来对LLM注释进行改进,并在生物医学NER任务中验证了指南整合的有效性、推理优化模型的优势以及在最小监督下的审核可行性。

Comments 14 pages, 7 figures. Accepted to the ACL 2026 Main Conference

详情
AI中文摘要

尽管大型语言模型(LLMs)在零样本注释任务上表现出色,但它们在黄金标准基准的专门惯例上往往表现不佳。我们提出了一种系统性的注释指南重用和细化作为对齐机制,引入了一个迭代审核框架,模拟注释项目的早期阶段。我们评估了三个假设:(1)指南整合的有效性,(2)推理优化模型的优势,以及(3)在最小监督下的审核可行性。在生物医学NER任务(NCBI Disease,BC5CDR,BioRED)上,使用三种LLM家族(GPT,Gemini,DeepSeek)进行测试,我们的结果实验证实了所有三个假设。虽然迭代审核框架在有效细化指南方面显示出良好的潜力,但我们的分析也揭示了大量改进的空间。

英文摘要

While Large Language Models (LLMs) demonstrate remarkable performance on zero-shot annotation tasks, they often struggle with the specialized conventions of gold-standard benchmarks. We propose the systematic reuse and refinement of annotation guidelines as an alignment mechanism, introducing an iterative moderation framework that simulates the early phases of annotation projects. We evaluate three hypotheses: (1) the efficacy of guideline integration, (2) the advantage of reasoning optimized models, and (3) the viability of moderation under minimal supervision. Testing across biomedical NER tasks (NCBI Disease, BC5CDR, BioRED) with three LLM families (GPT, Gemini, DeepSeek), our results empirically confirm all three hypotheses. While the iterative moderation framework shows good potential in effectively refining guidelines, our analysis also reveals substantial room for improvement.

2605.20808 2026-05-21 cs.CV

Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis

基于超高清图像合成的空间图对齐

Jinjin Zhang, Xiefan Guo, Di Huang

AI总结 本文提出空间图对齐(SGA)方法,通过利用视觉基础模型的表示先验,保留LDMs的生成能力,解决超高清图像合成中生成质量与结构完整性之间的冲突,实现高质量的文本到图像合成。

Comments Technical Report

详情
AI中文摘要

现代超高清图像合成严重依赖大规模预训练潜在扩散模型(LDMs)的强大生成能力。尽管最近的表示对齐方法通过从基础模型(如SAM或DINO)中蒸馏视觉先验到生成潜在特征而有效,但将这些方法扩展到预训练LDMs在极端分辨率下暴露了学习性与保真度之间的关键冲突。具体而言,强制直接的块级特征蒸馏会扰动预训练的潜在流形,最终导致生成退化。为了解决这个瓶颈,我们提出了空间图对齐(SGA),一种新的框架,它明确利用视觉基础模型的表示先验,同时保留LDMs的本原生成能力。超越限制性的直接对齐,SGA通过将生成特征的内部自相似性与基础先验的自相似性对齐,施加一种非侵入性的空间约束。这种空间约束有效地建立了宏观结构的连贯性,而本原的生成目标保留了原始LDMs的微观像素级保真度。值得注意的是,这种通用策略可以无缝整合到预训练LDMs的中间扩散特征和VAE潜在空间中。广泛的实验表明,SGA在超高清文本到图像合成中实现了最先进的性能,有效协调了全局结构完整性和细粒度视觉细节。代码可在https://github.com/zhang0jhon/SGA获取。

英文摘要

Modern ultra-high-resolution image synthesis relies heavily on the robust generative capacity of large-scale pre-trained Latent Diffusion Models (LDMs). While recent representation alignment methods have proven effective by distilling visual priors from foundation models (e.g., SAM or DINO) into generative latent features, scaling these approaches to pre-trained LDMs at extreme resolutions exposes a critical learnability-fidelity conflict. Specifically, forcing direct patch-wise feature distillation inherently perturbs the pre-trained latent manifold, ultimately leading to generation degradation. To address this bottleneck, we propose Spatial Gram Alignment (SGA), a novel framework that explicitly leverages the representation priors of vision foundation models while preserving the native generative capacity of LDMs. Moving beyond restrictive direct alignment, SGA imposes a non-invasive spatial constraint by aligning the internal self-similarities of the generative features with those of the foundation priors. This spatial constraint effectively establishes macroscopic structural coherence, while the native generative objectives retain the microscopic pixel-level fidelity inherent to the original LDMs. Notably, this versatile strategy integrates seamlessly across both intermediate diffusion features and VAE latents within pre-trained LDMs. Extensive experiments demonstrate that SGA achieves state-of-the-art performance for ultra-high-resolution text-to-image synthesis, yielding an effective reconciliation between global structural integrity and fine-grained visual details. Code is available at https://github.com/zhang0jhon/SGA.

2605.20807 2026-05-21 cs.CV

Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction

通过中间结构预测分解主体驱动的图像生成

Hanzhong Guo, Yizhou Yu

AI总结 该研究提出了一种两阶段框架,通过先预测Canny图再基于源外观和预测结构生成最终图像,以解决主体驱动文本到图像生成中高频率身份细节如logo、图案和文本的保留问题,并通过自动管道构建了10万对文本感知数据集,实验结果表明中间结构预测能有效提升高保真主体驱动生成的性能。

详情
AI中文摘要

主体驱动的文本到图像生成仍然难以保留诸如logo、图案和文本等高频率身份细节。现有方法通常直接在RGB空间中操作,这在大规模编辑下常导致细节退化。我们提出了一种两阶段框架,通过首先预测Canny图,然后基于源外观和预测的结构生成最终图像。为提高文本处理能力,我们进一步引入了一个全自动流程,构建了一个包含10万对文本感知数据集,并确保跨视角文本一致性。实验包括基于GPT-4.1的评估和知识蒸馏研究,结果表明在选定基线之上有明显提升,并表明中间结构预测是实现高保真主体驱动生成的有效途径。我们的数据集和代码将向公众开放。

英文摘要

Subject-driven text-to-image generation still struggles to preserve high-frequency identity details such as logos, patterns, and text. Existing methods typically operate directly in RGB space, which often leads to detail degradation under substantial edits. We propose a two-stage framework that decouples structure from appearance by first predicting a Canny map and then rendering the final image conditioned on both the source appearance and the predicted structure. To improve text handling, we further introduce a fully automatic pipeline that constructs a 100k-pair text-aware dataset with cross-view textual consistency. Experiments, including GPT-4.1-based evaluation and a knowledge distillation study, show clear gains over selected baselines and suggest that intermediate structural prediction is an effective route for high-fidelity subject-driven generation. Our dataset and code will be made publicly available.

2605.20804 2026-05-21 cs.CV cs.LG

OlmoEarth v1.1: A more efficient family of OlmoEarth models

OlmoEarth v1.1: 一个更高效的OlmoEarth模型家族

Gabriel Tseng, Yawen Zhang, Favyen Bastani, Henry Herzog, Joseph Redmon, Hadrien Sablon, Piper Wolters, Patrick Alan Johnson, Christopher Wilhelm, Patrick Beukema

AI总结 本文提出了一种改进的OlmoEarth模型家族,通过优化训练和推理过程,显著降低了计算成本,同时保持了模型的整体性能。

详情
AI中文摘要

我们介绍了OlmoEarth家族的一系列改进。这些改进使我们在训练过程中减少了计算成本(训练Base模型所需的GPU小时减少了1.7倍),并在Sentinel-2任务中推理时减少了MACs(2.9倍),同时保持了模型的整体性能。所有训练代码均在github.com/allenai/olmoearth_pretrain上提供。

英文摘要

We present a set of improvements to the OlmoEarth family. These improvements allow us to cut compute costs during training ($1.7 \times$ reduction in GPU hours required to train our Base models) and inference ($2.9\times$ reductions in MACs on Sentinel-2 tasks), while maintaining the models' overall performance. All training code is available at github.com/allenai/olmoearth_pretrain.

2605.20803 2026-05-21 cs.LG cs.AI

Tunable MAGMAX: Preference-Aware Model Merging for Continual Learning

可调MAGMAX:面向持续学习的偏好感知模型融合

Kei Hiroshima, Kento Uchida, Shinichi Shirakawa

AI总结 本文提出了一种名为可调MAGMAX的模型融合框架,通过引入偏好向量控制任务特定性能,以适应不同的部署环境和用户偏好,从而在持续学习中实现更有效的模型融合。

Comments 17 pages, 4 figures. Accepted at ICPR 2026

详情
AI中文摘要

持续学习(CL)旨在顺序训练多个任务的同时,减轻对之前学习知识的灾难性遗忘。最近在大预训练模型(LPMs)和模型融合技术,如MAGMAX方面的进展,通过结合任务特定参数展示了有效的CL性能。然而,现有方法主要关注所有任务的平均性能,并未充分解决如何构建能够适应不同部署环境或变化用户偏好的模型的问题。本文提出了一种模型融合框架,称为可调MAGMAX,它使持续学习中的任务特定性能能够受到偏好控制。我们的方法引入了一个偏好向量,该向量在模型融合过程中控制从每个任务向量中选择的元素数量,使我们能够根据部署需求调整融合模型的性能。我们进一步提出了一种方法,通过利用少量目标环境数据和模型训练任务的数据集,自动构建合适的偏好向量,从而消除了手动指定的需要。在CL基准任务上的实验结果表明,可调MAGMAX有效地控制了任务层面的性能,并成功地将融合模型适应于各种目标环境。所提出的可调MAGMAX在性能上优于或与基线方法相当,使其成为部署到各种环境中的实用解决方案,其中每个任务的偏好不同。

英文摘要

Continual learning (CL) aims to train models sequentially on multiple tasks while mitigating catastrophic forgetting of previously learned knowledge. Recent advances in large pre-trained models (LPMs) and model merging techniques, such as MAGMAX, have demonstrated effective CL performance by combining task-specific parameters. However, existing methods primarily focus on average performance across all tasks and do not adequately address how to construct models accommodating different deployment environments or varying user preferences. This paper proposes a model merging framework, termed Tunable MAGMAX, which enables preference-aware control of task-specific performance in CL. Our method introduces a preference vector that controls the number of elements selected from each task vector during model merging, allowing us to adjust the merged model performance according to their deployment needs. We further propose a method for automatically constructing appropriate preference vectors by leveraging small amounts of target environment data and datasets from model training tasks, thereby eliminating the need for manual specification. The experimental result on CL benchmark tasks demonstrates that Tunable MAGMAX effectively controls task-wise performance and successfully adapts merged models to various target environments. The proposed Tunable MAGMAX achieves superior or comparable performance to baseline methods, making it a practical solution for deploying CL models to various environments where the preferences of each task performance differ.

2605.20801 2026-05-21 cs.RO quant-ph

Q-SpiRL: Quantum Spiking Reinforcement Learning for Adaptive Robot Navigation

Q-SpiRL:量子脉冲强化学习用于自适应机器人导航

Mohamed Khair Altrabulsi, Nouhaila Innan, Alberto Marchisio, Muhammad Kashif, Muhammad Shafique

AI总结 本文提出Q-SpiRL框架,结合量子增强的脉冲神经网络,实现了在动态环境中高效稳定的机器人导航,通过实验验证了其在任务完成、轨迹效率和运动平滑度之间的最佳平衡。

Comments 11 pages, 6 figures

详情
AI中文摘要

在动态环境中实现自适应机器人导航需要能够可靠到达目标并产生高效稳定轨迹的策略。本文提出了Q-SpiRL,一种用于障碍感知机器人导航的量子脉冲强化学习框架。该框架开发并评估了五个智能体家族:表格Q学习、经典MLP、经典SNN、量子增强MLP(QMLP)和量子增强脉冲神经网络(QSNN)。尽管所有模型均在统一的训练和评估管道下实现,但QSNN是重点研究的中央架构,因为它结合了基于脉冲的时间处理与变分量子特征变换。实验在三个逐渐增大尺寸的网格世界环境中进行,即20x20、30x30和40x40,包含静态和动态障碍。性能评估使用成功率、成功率加权路径长度、路径长度和转弯率,在确定性推理下进行。结果表明,QSNN在最具有挑战性的设置中实现了最强的整体权衡,达到99%的成功率,同时保持高路径效率。在IBM量子硬件上的执行进一步证明了所提出混合策略在真实设备条件下的可行性。

英文摘要

Adaptive robot navigation in dynamic environments requires policies that can reach the target reliably while producing efficient and stable trajectories. This paper presents Q-SpiRL, a quantum spiking reinforcement learning framework for obstacle-aware robot navigation. The framework develops and evaluates five agent families: tabular Q-learning, classical MLP, classical SNN, quantum-enhanced MLP (QMLP), and quantum-enhanced spiking neural network (QSNN). While all models are implemented under a unified training and evaluation pipeline, the QSNN is the central architecture of interest, as it combines spike-based temporal processing with variational quantum feature transformation. Experiments are conducted across three grid-world environments of increasing size, namely 20x20, 30x30, and 40x40, with both static and dynamic obstacles. Performance is assessed using success rate, success-weighted path length, path length, and turn rate under deterministic inference. Results show that QSNN achieves the strongest overall trade-off between task completion, trajectory efficiency, and motion smoothness, reaching up to 99% success rate while maintaining high path efficiency in the most challenging setting. Execution on IBM quantum hardware further demonstrates the feasibility of deploying the proposed hybrid policy under real-device conditions.