arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2085
2605.07278 2026-05-11 cs.LG cs.AI cs.CV

Predictive but Not Plannable: RC-aux for Latent World Models

Wenyuan Li, Guang Li, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

AI总结 该研究探讨了潜在世界模型在预测准确但难以用于长期规划的问题,指出其核心挑战是时空对齐不足。为此,作者提出了一种轻量级辅助目标RC-aux,通过时间轴上的多步预测和空间轴上的预算条件可达性监督,增强潜在空间与规划任务的一致性。实验表明,RC-aux在不改变模型主干的前提下,有效提升了基于潜在世界模型的长期规划性能。

详情
英文摘要

A latent world model may achieve accurate short-horizon prediction while still inducing a latent space that is poorly aligned with planning. A key issue is spatiotemporal mismatch: these models are often trained with local predictive supervision, but deployed for long-horizon goal-directed search in latent spaces where Euclidean distance may not reflect what is reachable within a finite action budget. We present the Reachability-Correction auxiliary objective (RC-aux), a lightweight correction for this mismatch in reconstruction-free latent world models. RC-aux keeps the world-model backbone unchanged and adds planning-aligned supervision along two axes. Along the time axis, multi-horizon open-loop prediction trains the model beyond one-step consistency. Along the space axis, budget-conditioned reachability supervision, together with temporal hard negatives, encourages the latent space to distinguish states that are eventually reachable from those reachable within the current planning horizon. At test time, the learned reachability signal can also be used by a reachability-aware planner to favor trajectories that are both goal-directed and attainable under the available budget. We instantiate RC-aux on LeWorldModel and evaluate it under both continuation-training and matched-from-scratch settings. Across goal-conditioned pixel-control tasks and a LIBERO-Goal extension, RC-aux improves LeWM-style planning with modest additional cost. These results suggest that planning with latent world models depends not only on predictive accuracy, but also on whether the learned representation encodes the temporal and geometric structure required by downstream search. The code is available at https://github.com/Guang000/RC-aux.

2605.07277 2026-05-11 cs.LG cs.AI

Bifurcation Models: Learning Set-Valued Solution Maps with Weight-Tied Dynamics

Caleb Jore, Jialin Liu

AI总结 该论文研究了如何学习具有多个正确解的科学与组合问题的解集映射,提出了分叉模型,通过权值共享的动力系统实现不同初始条件收敛到不同稳定平衡点,从而表示吸引子景观而非单一解路径。理论证明该模型能够表示具有局部利普希茨分支的广泛集值映射,并且其选择器几乎处处规则,优于人工选择器。实验表明,该方法在受挫伊辛模型和Allen–Cahn方程中能有效发现多个有效解,且在准确率与多样性之间存在权衡。

详情
英文摘要

Many scientific and combinatorial problems admit multiple correct solutions, not a single label. Standard supervised learning resolves this ambiguity by choosing one solution as the target, but this hidden selector can be arbitrary, discontinuous, and harder to learn than the underlying solution set. We study bifurcation models, a weight-tied dynamical view in which different initializations can converge to different stable equilibria, so the model represents an attractor landscape rather than one chosen branch. We prove that broad set-valued maps with locally Lipschitz branches can be represented by regular equilibrium dynamics and that the induced selectors are almost everywhere regular, while manual selectors can be arbitrarily irregular. Experiments on frustrated Ising models show that such dynamics can discover multiple valid equilibria without branch labels and outperform single-branch supervision. Allen--Cahn experiments further show that diversity is not automatic: it can be encouraged explicitly, but with an accuracy--diversity tradeoff.

2605.07276 2026-05-11 cs.AI

Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

Jia Li, Yuxin Su, Ting Peng, Hailiang Huang, Yuetang Deng, Michael R. Lyu

AI总结 本文研究了在弱反馈环境下,如何通过信号重塑改进基于代理的代码修复中的GRPO(组归一化策略优化)方法。作者提出,需对奖励信号、过程信号和执行信号三类反馈进行重塑,以提升语义准确性与策略更新效率。实验表明,该方法显著提高了代码编译与语义的准确率,并减少了评估步骤,验证了信号重塑在长期工具使用场景中的有效性。

详情
英文摘要

Code-agent RL often receives weak feedback: rollout-time signals are reliable and executable, but capture only necessary or surface conditions for task success rather than the target semantic predicate. Using agentic compile-fix as the setting, we study signal reshaping for standard GRPO under such feedback. Our central claim is that GRPO's within-group comparison is meaningful only after three kinds of signals are reshaped: outcome rewards recover semantic ranking, process signals localize intra-trajectory credit, and rollouts from the same prompt remain execution-comparable. We operationalize these conditions with a minimal signal-reshaping construction that leaves GRPO's group-normalized advantage construction unchanged: compile-and-semantic layered rewards reshape trajectory ranking, step-level process scores outside group reward normalization reshape within-trajectory update strength, and failure-cause-aware rollout governance reshapes within-group comparability. Experiments show a clear end-to-end gain: full signal-reshaped GRPO improves strict compile-and-semantic accuracy from the base model's zero-shot $0.385$ to $0.535$. Controlled comparisons further explain the source of this gain: binary rewards remove the compile-only middle tier and degrade trajectory control; on top of layered rewards, process-score weighting further improves accuracy from $0.48$ to $0.53$ and reduces average evaluation steps from $23.50$ to $17.02$. As a boundary comparison, privileged-prompt token-level distillation mainly optimizes local distributional alignment; in long tool-use trajectories, this signal is diluted by non-critical tokens and cannot replace outcome semantics, process credit, or within-group comparability.

2605.07275 2026-05-11 cs.RO

Palm-sized Omnidirectional Vision-Based UAV Exploration with Sparse Topological Map Guidance

Zirui Wang, Xinjia Luo, Haotian Sun, Jun Ma, Jian Guo, Boyu Zhou

AI总结 本文提出了一种基于稀疏拓扑地图引导的轻量级自主探索系统,适用于手掌大小的无人机。该系统利用多鱼眼相机实现全景视野,并通过深度估计进行环境感知,避免了传统方法对高分辨率点云或占用地图的依赖,从而大幅降低了计算和存储开销。通过将未探索区域表示为拓扑节点,系统能够在不维护全局点云的情况下高效识别前沿区域,并直接在稀疏图上进行全局路径规划,实验证明该方法在实际小型无人机上具有高效且低计算消耗的探索能力。

详情
英文摘要

Classic exploration methods often rely on dense occupancy maps or high-resolution point clouds for frontier detection and path planning, resulting in substantial memory consumption and computational overhead. Moreover, micro UAVs under size, weight, and power (SWaP) constraints are not practical to be equipped with sensors like LiDAR to obtain accurate environmental geometric measurements. This paper presents a lightweight autonomous exploration system that leverages omnidirectional vision and sparse topological map guidance. Specifically, we utilize a multi-fisheye camera setup to achieve omnidirectional Field of View (FoV) and perform depth estimation. To address the limited depth estimation accuracy, frontiers are represented as potential unexplored regions characterized by topological nodes instead of explicit boundaries, enabling efficient identification of frontier regions without maintaining occupancy grids or global point clouds. Unlike classic dense representations, our approach abstracts the environment using a sparse topological map composed of key nodes and their descriptors, reducing memory consumption and computational demands. Global path planning is performed directly on the sparse graph. The proposed method is validated in both simulation and on a palm-sized vision-based UAV with an 11 cm wheelbase and a 400 g weight in real-world experiments, demonstrating that our method can achieve efficient exploration with extremely low computational consumption.

2605.07274 2026-05-11 cs.AI cs.LG

Structured Role-Aware Policy Optimization for Multimodal Reasoning

Bingqing Jiang, Difan Zou

AI总结 本文研究了如何通过角色感知的策略优化提升多模态推理模型的可靠性。针对现有方法在序列层面分配奖励而忽略不同token功能角色的问题,提出了一种结构化角色感知策略优化(SRPO)方法,将响应分解为感知token和推理token,并在token层面进行信用分配。SRPO通过自蒸馏的策略对比,分别强调感知token对视觉输入的依赖性和推理token与感知结果的一致性,无需外部奖励模型即可提升基于证据的推理能力,实验表明其在多个多模态基准上表现优异。

Comments 32 pages

详情
英文摘要

Reinforcement learning from verifiable rewards (RLVR), especially with Group Relative Policy Optimization (GRPO), has shown strong potential for improving the reasoning capabilities of large vision-language models (LVLMs). However, in multimodal reasoning, final-answer rewards are typically assigned at the sequence level and do not distinguish the functional roles of different tokens, making it difficult to determine whether a correct answer is supported by task-relevant visual evidence. In this paper, we revisit multimodal RLVR from the perspective of role-aware token-level credit assignment, where structured responses are decomposed into perception tokens for extracting visual evidence and reasoning tokens for deriving answers from that evidence. Based on this perspective, we propose Structured Role-aware Policy Optimization (SRPO), which refines the sequence-level GRPO advantage into role-aware token-level advantages without changing the reward function. Specifically, SRPO assigns role-specific credit by using self-distilled on-policy contrasts: perception tokens are emphasized according to their visual dependency under original versus corrupted visual inputs, while reasoning tokens are emphasized according to their consistency with the generated perception. These role-specific signals are further unified through a shared trajectory-level baseline, yielding positive token weights that adjust relative update magnitudes while preserving the original GRPO reward and optimization direction, without requiring external reward models or separate teachers. Experiments across diverse multimodal reasoning benchmarks show that SRPO improves evidence-grounded reasoning, highlighting the importance of moving beyond uniform sequence-level credit toward role-aware optimization for reliable multimodal reasoning.

2605.07273 2026-05-11 cs.CV cs.AI

From Clouds to Hallucinations: Atmospheric Retrieval Hijacking in Remote Sensing Vision-Language RAG

Jiaju Han, Chao Li, Chengyin Hu, Qike Zhang, Xuemeng Sun, Xin Wang, Fengyu Zhang, Xiang Chen, Yiwei Wei, Jiahuan Long, Jiujiang Guo

AI总结 本文研究了遥感多模态RAG系统中大气证据检索阶段的安全性问题,提出了一种名为CloudWeb的新型攻击方法,通过在输入图像上叠加参数化的云雾模式,引导检索器返回目标天气相关的虚假证据。该方法无需修改检索器、生成器或知识库,仅通过优化输入图像的嵌入向量,即可有效提升天气相关证据在检索结果中的排名。实验表明,CloudWeb在多个遥感数据集和检索模型上均表现出色,揭示了大气变化可能在生成前就破坏证据检索的潜在风险。

详情
英文摘要

Multimodal RAG systems increasingly rely on vision-language retrievers to ground visual queries in external textual evidence. Existing adversarial studies on RAG mainly manipulate the retrieval corpus or memory, while attacks on vision-language and remote sensing models typically target end-task predictions. Input-space threats to the evidence retrieval stage of remote sensing multimodal RAG remain underexplored. To address this gap, we introduce CloudWeb, an atmospheric retrieval hijacking attack that modifies only the input image while keeping the retriever, generator, and knowledge base fixed at deployment. CloudWeb overlays parameterized cloud- and haze-like patterns on remote sensing images and optimizes them with a retrieval-oriented objective that pulls adversarial image embeddings toward target atmospheric evidence, suppresses source-scene evidence, enforces rank separation, and regularizes naturalness and coverage. To the best of our knowledge, this is the first study of retrieval-stage atmospheric evidence hijacking in remote sensing multimodal RAG. We evaluate CloudWeb on a seven-dataset remote sensing RAG benchmark with five CLIP-style retrievers, including GeoRSCLIP, RemoteCLIP, OpenAI CLIP, and OpenCLIP, together with downstream vision-language generators. Across retrievers, CloudWeb consistently outperforms clean retrieval, handcrafted atmospheric baselines, random cloud perturbations, and fixed variants in injecting weather-related evidence into top-ranked results. On GeoRSCLIP ViT-B/32, Weather@5 increases from 0.71\% to 43.29\%. Downstream generation further shows measurable weather hallucination and semantic shift, indicating that retrieval-stage hijacking can propagate to the final RAG response. These findings reveal a practical failure mode: natural-looking atmospheric changes can compromise evidence retrieval before generation begins.

2605.07271 2026-05-11 cs.CL cs.AI

Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions

Boyu Shi, Chang Liu, ChuanBao Gao, Xu Yang, Xin Geng

AI总结 本文研究了层剪枝导致大语言模型性能骤降的现象,提出通过决策表示来分析这一机制。作者引入了决策边距和选项频率两个指标,并设计了迭代剪枝方法,揭示了模型在决策过程中存在一个关键的决策过渡阶段,即从无法预测正确答案的“静默阶段”到能够正确预测的“决策阶段”。实验表明,剪枝对静默阶段的破坏是引发性能骤降的主要原因,而对决策阶段的剪枝影响较小,从而为理解与优化模型剪枝提供了新视角。

详情
英文摘要

Layer pruning efficiently reduces Large Language Model (LLM) computational costs but often triggers sudden performance collapse. Existing representation-based analyses struggle to explain this mechanism. We propose studying pruning through decision representation. Focusing on multiple-choice tasks, we introduce two metrics, Decision Margin and Option Frequency, and an Iterative Pruning method to analyze layer-wise decision dynamics. Our findings reveal a sharp decision transition that partitions the network into two stages: a Silent Phase, where the model cannot yet predict the correct answer, and a Decisive Phase, where the correct prediction emerges. We also find that pruning the Decisive Phase has minimal impact, whereas pruning the Silent Phase triggers immediate performance collapse, highlighting its extreme sensitivity to structural changes. Therefore, we conclude that pruning-induced collapse stems from disrupting the Silent Phase, which prevents the critical decision transition from occurring.

2605.07270 2026-05-11 cs.LG

bispectrum: Selective $G$-Bispectra Made Practical

Johan Mathe, Adele Myers, Simon Mataigne, Nina Miolane

AI总结 该论文提出了一种名为 bispectrum 的开源 PyTorch 库,用于高效实现选择性 $G$-双谱,以处理在不同变换群作用下保持不变的机器学习任务。通过引入选择性计算,该方法显著降低了计算复杂度,并针对平面旋转和球面旋转分别优化了双谱计算,使其适用于深度学习架构中的池化层。实验表明,在低数据量和中等模型容量的情况下,使用 $G$-双谱作为池化层能显著优于传统方法。

详情
英文摘要

Many machine learning tasks are invariant under the action of a group $G$ of transformations: signal classification can be invariant under translations, image classification under 2D rotations, and spherical-image classification under 3D rotations. The $G$-bispectrum is a principled complete invariant of a signal (retaining all all signal's information up to the group action) with proven benefits in machine learning and as a pooling layer in deep networks. However, its deployment has been hampered by high computational cost and a patchwork of group-specific implementations. We present bispectrum, an open-source, fully unit-tested PyTorch library that implements selective $G$-bispectra for seven different group actions, as differentiable modules that can be directly incorporated into machine learning pipelines and deep learning architectures. For finite groups $G$, selectivity reduces the computational cost from $O(|G|^2)$ to $O(|G|)$. For planar rotations, we leverage the disk bispectrum. For spherical 3D rotations, we introduce an augmented selective bispectrum at band-limit $L$ which reduces the cost from $O(L^3)$ to $Θ(L^2)$ coefficients. We profile the entire library (for which we implemented various compute optimizations), showing that it delivers near-exact $G$-invariance with its selective $G$-bispectra computed in sub-millisecond time on GPU (up to commonly used bandlimits). We evaluate the benefits of incorporating $G$-bispectra as pooling layers into deep learning architectures on three classical benchmark datasets --comparing against norm pooling, gated pooling, Fourier-ELU pooling, max pooling, and (non-equivariant) data-augmented convolutional baselines. Results show that $G$-bispectra consistently outperform alternatives in the low-data, moderate-capacity regime.

2605.07269 2026-05-11 cs.CL cs.LG

MIPIAD: Multilingual Indirect Prompt Injection Attack Defense with Qwen -- TF-IDF Hybrid and Meta-Ensemble Learning

Al Muhit Muhtadi, Mostafa Rifat Tazwar

AI总结 本文提出了一种名为MIPIAD的多语言间接提示注入攻击防御框架,结合了基于Qwen2.5-1.5B的LoRA微调分类器、TF-IDF词法特征以及通过晚融合、堆叠和梯度提升实现的元集成学习方法。该方法在包含143万样本的多语言合成基准上进行了评估,实验表明,集成方法在提升跨语言性能和检测效果方面表现优异,尤其在英文和孟加拉语场景中显著缩小了模型间的性能差距。该框架设计具有可扩展性,支持多种语言的防御应用。

详情
英文摘要

Indirect prompt injection remains a persistent weakness in retrieval-augmented and tool-using LLM systems, and the problem becomes harder to characterise in multilingual settings. We present MIPIAD, a defense framework evaluated on English and Bangla that combines a sequence classifier fine-tuned from Qwen2.5-1.5B via LoRA (XLPID), TF-IDF lexical features, and validation-tuned ensembling through late fusion, stacking, and gradient boosting. The framework is evaluated on a synthetic benchmark built from BIPIA(Yi et al., 2023) templates spanning five task families -- email, table, QA, abstract, and code-comprising over 1.43 million generated samples, with train and test splits using mutually exclusive attack categories. Across the experiments, lexical signals prove strong (TF-IDF+SVM F1=0.77), and the hybrid XLPID+TF-IDF ensemble achieves the best overall F1 (0.9205) while the Boosting Ensemble achieves the best AUROC (0.9378). Ensemble methods consistently reduce the English-Bangla cross-lingual gap relative to standalone neural models. The pipeline is designed for extensibility: NLLB-200 supports over 200 languages and XLPID's multilingual backbone can be retargeted to additional languages without architectural changes; empirical validation is currently limited to English and Bangla

2605.07268 2026-05-11 cs.CL

From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs

Hanmeng Liu, Shichao Weng, Xiulai Liu, Zhicai Zhang, Anli Yan, Xiaozhang Liu

AI总结 该研究针对多选推理基准面临的问题,提出了一种名为LogiHard的框架,通过将零阶选择转化为二阶逻辑判断,显著增加了推理复杂度和步骤,从而更有效地评估前沿大语言模型的推理能力。该方法结合项目反应理论实现自适应测试,构建了包含高难度逻辑题的LogiHard-2k数据集,实验表明多个先进模型在该数据集上的准确率下降了31%至56%,揭示了模型在组合推理和多选任务中的缺陷。研究指出,这种性能下降源于训练过程中组合推理能力的不足,而非知识储备的缺失。

详情
英文摘要

Multiple-choice reasoning benchmarks face dual challenges: rapid saturation from advancing models and data contamination that undermines static evaluations. Ad-hoc hardening methods (paraphrasing, perturbation) attempt to increase difficulty but sacrifice logical validity for surface complexity, falling short to challenge advanced reasoning models. We present LogiHard, a formal framework that deterministically transforms 0-order selection into 2-order logical judgment, which significantly increases the thinking overhead and reasoning steps. The framework integrates Item Response Theory (IRT) for computerized adaptive testing (CAT), enabling precise difficulty control with fewer questions than static benchmarks. We instantiate LogiHard-2k, a logical reasoning dataset constructed by cognitively ranking high-stakes examination questions via 9-dimensional analysis of model thinking traces, followed by combinatorial transformation of high-difficulty items. Evaluation across twelve state-of-the-art models reveals an accuracy degradation ranging from 31% to 56% on combinatorially hardened questions. LLMs suffer from the multi-select failure and early exit bias, which are not shared by human testees. Zero-shot transfer to MMLU demonstrates 47% accuracy degradation (89.84% to 42.86%), confirming applicability across domains with provable validity preservation. The consistent aggregate degeneration is domain-agnostic and stems not from knowledge deficits but from a combinatorial reasoning gap, reflecting a training-induced completeness-verification deficit.

2605.07267 2026-05-11 cs.LG

PerCaM-Health: Personalized Dynamic Causal Graphs for Healthcare Reasoning

Elahe Khatibi, Ziyu Wang, Saba A. Farahani, Di Huang, Hung Cao, Ramesh Jain, Amir M. Rahmani

AI总结 PerCaM-Health 是一种用于医疗健康推理的个性化动态因果图学习框架,旨在解决现有方法在处理个体患者随时间变化的因果关系时存在的不足。该方法结合群体层面的因果知识与个体时间序列数据,通过保守适应和滚动更新机制生成可解释的动态因果图序列,并支持针对个体的反事实推理。实验表明,PerCaM-Health 在因果图恢复、动态边追踪和干预方向预测方面优于现有方法,展示了其在个性化医疗决策中的潜力。

详情
英文摘要

Personalized healthcare decisions require reasoning about how physiological and behavioral variables influence an individual patient over time. Existing temporal causal discovery methods are poorly matched to this setting: cohort-level models provide stable but non-personalized structures, while per-patient discovery is unreliable because individual trajectories are short, noisy, irregular, and non-stationary. This creates a fundamental gap between population-level causal modeling and the patient-specific, time-varying mechanisms needed for intervention reasoning. We introduce PerCaM-Health, a framework for learning personalized dynamic causal graphs from longitudinal health data. The framework learns a knowledge-guided population temporal graph, then conservatively adapts and evolves it using patient-specific temporal evidence and rolling-window updates, producing interpretable and auditable graph sequences. By coupling these graphs with temporal structural equations, the framework enables patient-level counterfactual queries, such as estimating short-horizon outcome changes under hypothetical behavioral interventions. Experiments on a semi-synthetic dynamic health benchmark show that PerCaM-Health improves graph recovery, dynamic edge tracking, and intervention direction accuracy compared to cohort-level, per-patient, and non-personalized temporal baselines. These results demonstrate that jointly modeling personalization and temporal evolution yields more reliable causal structure and intervention reasoning.

2605.07264 2026-05-11 cs.CV

Sat3R: Satellite DSM Reconstruction via RPC-Aware Depth Fine-tuning

Qiaoyi Yang, Chaoyi Zhou, Xi Liu, Run Wang, Minghui Xu, Mert D. Pesé, Feng Luo, Yuhao Xu, Zhi-Qi Cheng, Qiushi Chen, Hairong Qi, Siyu Huang

AI总结 本文提出了一种名为Sat3R的前馈框架,用于从卫星影像中高效重建数字地表模型(DSM)。该方法通过结合RPC模型的几何特性,利用尺度不变对数(SiLog)损失对Depth Anything V2模型进行度量深度微调,从而在无需逐场景优化的情况下,使单目深度基础模型适应卫星影像领域。实验表明,Sat3R在DFC2019基准测试中显著提升了重建精度,并相比优化方法实现了300倍以上的加速,为大范围卫星DSM重建提供了高效可行的解决方案。

详情
英文摘要

Accurate Digital Surface Model (DSM) reconstruction from satellite imagery is critical for applications such as disaster response, urban planning, and large-scale geographic mapping. Existing approaches face a fundamental trade-off: optimization-based methods achieve strong accuracy but require hours of per-scene computation, while generalizable geometry foundation models offer near-instant inference but fail to generalize to satellite imagery due to the domain gap introduced by the Rational Polynomial Camera (RPC) model and mismatched depth scale distributions. We present Sat3R, a feed-forward framework that bridges this gap via RPC-aware metric depth fine-tuning of Depth Anything V2 using the Scale-Invariant Logarithmic (SiLog) loss. By constructing physically consistent pseudo depth supervision from RPC geometry, Sat3R adapts a monocular depth foundation model to the satellite domain without per-scene optimization. Experiments on the DFC2019 benchmark demonstrate that Sat3R reduces MAE by 38% over zero-shot feed-forward baselines and achieves competitive accuracy against optimization-based methods, while delivering over 300x speedup. Sat3R demonstrates that feed-forward models, when properly adapted to the satellite domain, can match optimization-based accuracy at a fraction of the computational cost, paving the way for practical large-scale satellite DSM reconstruction.

2605.07260 2026-05-11 cs.LG cs.CL

When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

Youngsik Yoon, Siwei Wang, Wei Chen, Jungseul Ok

AI总结 本文研究了混合专家(MoE)语言模型中专家路由策略的有效性,发现当前常用的top-$k$路由方式在处理需要复杂推理的脆弱token时可能选择次优专家,导致性能下降。通过对比标准路由与等计算量的替代路由,作者揭示了路由决策与token条件高度相关,并提出了一种仅更新最终路由层的简单方法,显著提升了模型在多项基准测试中的表现,表明路由策略的优化对模型性能具有重要影响。

详情
英文摘要

Mixture-of-Experts (MoE) language models route each token to a small subset of experts, but whether the routes selected by a trained top-$k$ router are good ones is rarely evaluated directly. Holding the model fixed, we compare each standard route against sampled equal-compute alternatives for the same token and score each by the next-token probability it assigns to the realized token in a verified reasoning trajectory. The result is sharply token-conditional: the standard router is well-aligned with route utility on confident tokens but uninformative on the fragile tokens that drive hard reasoning, where lower-loss equal-compute routes consistently exist inside the frozen model but are not selected. The same pattern holds across Qwen3-30B-A3B, GPT-OSS-20B, DeepSeek-V2-Lite, and OLMoE-1B-7B, and follows structurally from how standard top-$k$ training evaluates routing decisions: the language modeling loss scores only the executed route, and load balancing depends only on aggregate routing statistics. A minimal router-only update to the final-layer router, leaving every expert and every other router frozen, is sufficient to shift pass@K on AIME 2024+2025 and HMMT 2025 for both Qwen3-30B-A3B and GPT-OSS-20B, suggesting that at least part of the failure reflects router-reachable misallocation rather than expert capacity alone.

2605.07257 2026-05-11 cs.CV

Adaptive Subspace Projection for Generative Personalization

Van-Anh Nguyen, Anh Tuan Bui, Tamas Abraham, Junae Kim, Amardeep Kaur, Rollin Omari, Thuy-Trang Vu, Dinh Phung

AI总结 生成式个性化模型常面临语义坍缩问题(SCP),即学习到的个性化概念会压制文本提示中的其他内容,导致模型忽略重要的上下文细节。本文分析发现,SCP背后的语义漂移并非随机,而是集中于一个特定的低维子空间中,并提出了一种无需训练的适配子空间投影方法AdaptSP,通过在测试时调整嵌入,将语义漂移投影到该子空间进行精确修正,从而有效缓解SCP,同时保持主体身份不变。实验表明,该方法显著提升了提示的保真度和上下文对齐能力。

详情
英文摘要

Generative personalization often suffers from the semantic collapsing problem (SCP), where a learned personalized concept overpowers the rest of the text prompt, causing the model to ignore important contextual details. To address this, we first analyze the underlying cause, revealing that the semantic drift responsible for SCP is not random but is concentrated within a specific low-dimensional subspace. We also discover that the personalization process perturbs the embedding of the original base concept, making it an unstable reference point. Based on these insights, we introduce Test-time Embedding Adjustment with Adaptive Subspace Projection (AdaptSP), a training-free method that uses the stable, pre-trained embedding as an anchor. AdaptSP isolates the semantic drift and projects it onto the identified subspace, performing a precise adjustment that mitigates SCP while maintaining the subject identity. Our experiments show that this targeted approach significantly improves prompt fidelity and contextual alignment.

2605.07256 2026-05-11 cs.CV

TAS-LoRA: Transformer Architecture Search with Mixture-of-LoRA Experts

Jeimin Jeon, Hyunju Lee, Bumsub Ham

AI总结 本文提出了一种名为 TAS-LoRA 的新型方法,用于解决视觉 Transformer(ViT)架构搜索中的特征坍塌问题。该方法引入低秩适配(LoRA)技术,使每个子网络能够学习特定的特征,同时保持计算效率,并采用混合 LoRA 专家(MoLE)策略,通过轻量级路由器动态分配专家模块,促进专家间的多样化特征学习。实验表明,TAS-LoRA 在多个基准数据集上显著提升了性能,优于现有最先进的架构搜索方法。

Comments Accepted to CVPR 2026

详情
英文摘要

Transformer architecture search (TAS) discovers optimal vision transformer (ViT) architectures automatically, reducing human effort to manually design ViTs. However, existing TAS methods suffer from the feature collapse problem, where subnets within a supernet fail to learn subnet-specific features, mainly due to the shared weights in a supernet, limiting the performance of individual subnets. To address this, we propose TAS-LoRA, a novel method that introduces parameter-efficient low-rank adaptation (LoRA) to enable subnet-specific feature learning, while maintaining computational efficiency. TAS-LoRA incorporates a Mixture-of-LoRAExperts (MoLE) strategy, where a lightweight router dynamically assigns LoRA experts based on subnet architectures, and introduces a group-wise router initialization technique to encourage diverse feature learning across experts early in training. Extensive experiments on ImageNet and several transfer learning benchmarks, including CIFAR-10/100, Flowers, CARS, and INAT-19, demonstrate that TAS-LoRA mitigates feature collapse effectively, improving performance over state-of-the-art TAS methods significantly.

2605.07254 2026-05-11 cs.CV cs.GR

High-Fidelity Surface Splatting-Based 3D Reconstruction from Multi-View Images

Nandhana Sunil, Abhirami R Iyer, Avirup Mandal

AI总结 本文研究了从多视角图像中进行高保真三维重建的问题,针对现有方法在几何细节重建上的不足,提出了一种基于表面点扩散的改进方法。核心方法引入了具有局部支持的紧凑多项式核函数,替代传统指数核以更好地控制频率内容,并结合拉普拉斯滤波的随机正则化以增强细节表现。该方法在保持优化稳定性的同时,显著提升了几何保真度和渲染质量,在表面重建和渲染任务中均达到了当前最优性能。

Comments 19 pages, 9 figures

详情
英文摘要

Multi-view mesh reconstruction remains a core challenge in computer graphics and vision, especially for recovering high-frequency geometry from sparse observations. Recent methods such as 3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) rely on post-processing for mesh extraction, thereby limiting joint optimization of geometry and appearance. Implicit Moving Least Squares (IMLS) instead enables direct conversion of point clouds into signed distance and texture fields, supporting end-to-end reconstruction and rendering. However, existing IMLS formulations use exponential kernels that struggle with high-frequency detail. We introduce a compact polynomial kernel with local support and greater flexibility, allowing better control over frequency content and improved geometric fidelity. To further enhance fine details, we incorporate stochastic regularization with Laplacian filtering. Together, these improve the preservation of high-frequency structure while maintaining stable optimization. Experiments show state-of-the-art performance in both surface reconstruction and rendering, yielding more accurate geometry and sharper visuals from multi-view data.

2605.07253 2026-05-11 cs.CV

LENS: Low-Frequency Eigen Noise Shaping for Efficient Diffusion Sampling

Haewon Jeon, Si-Hyeon Lee

AI总结 LENS(低频特征噪声整形)是一种高效的扩散采样方法,旨在解决蒸馏扩散模型在减少去噪步骤时导致的图像质量下降问题。该方法通过在低维子空间中对噪声的低频分量进行选择性调制,实现了对图像整体结构和视觉保真度的有效控制。LENS采用轻量级网络进行噪声调制,显著降低了计算量和模型参数规模,实验表明其在保持图像质量的同时,大幅提升了采样效率。

Comments 27 pages, 7 figures

详情
英文摘要

Distilled diffusion models accelerate image generation by reducing the number of denoising steps, but often suffer from degraded image quality. To mitigate this trade-off, test-time optimization methods improve quality, yet their iterative nature incurs substantial computational overhead and leads to slow inference, limiting practical usability. Recent hypernetwork-based approaches amortize this process during training, but still require costly noise modulation in high-dimensional latent spaces. In this work, we propose LENS (Low-frequency Eigen Noise Shaping), an efficient noise modulation framework that operates in a low-dimensional subspace. Our approach is motivated by the observation that low-frequency components of the noise largely determine the global structure and visual fidelity of generated images. Based on this observation, we provide a theoretical justification for restricting modulation to the low-frequency subspace and derive a principled training objective. Building on this, LENS employs a lightweight, standalone network to selectively modulate these components, enabling efficient and targeted noise modulation. Extensive experiments demonstrate that LENS achieves competitive image quality while reducing FLOPs by 400-700$\times$, model parameters by 25-75$\times$, and inference-time overhead by 10-20$\times$ compared to prior methods.

2605.07251 2026-05-11 cs.AI

Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning

Yuyang Wu, Yue Huang, Shuaike Shen, Xujian Wang, Shuhao Zhang, Qiyao Xue, Weichen Liu, Runtian Gao, Jian Ma, Xiangliang Zhang, Olexandr Isayev

AI总结 本文研究了大型语言模型(LLMs)在化学反应成本估算任务中的表现,该任务要求代理从反应描述中识别化学品、检索供应商报价、选择可购买的包装规格并计算总成本。为此,作者构建了ChemCost基准,包含1,427个基于固定价格快照的可评估反应,支持对模型在不同阶段的错误进行诊断。实验表明,即使是最先进的化学专业模型,在干净输入下也只能达到50.6%的准确率,且在面对现实噪声时性能显著下降,揭示了模型在解析、证据整合和工具使用等方面仍存在明显不足。

Comments 9 pages, 5 figures

详情
英文摘要

Large Language Models (LLMs) have become increasingly capable as tool-using agents, with benchmarks spanning diverse general agentic tasks. Yet rigorous evaluation of scientific tool use remains limited. In chemistry, recent agents can plan syntheses and invoke domain-specific tools, but evaluations often rely on curated demonstrations, expert assessment, or LLM-as-judge scoring rather than exact, judge-free ground truth. We address this gap with chemical procurement cost estimation, a practical task in which an agent must ground chemical identities, retrieve supplier quotes, select valid purchasable packs, normalize quantities, and compute cost from a reaction description. We introduce ChemCost, a benchmark of 1,427 evaluable reactions grounded to a frozen pricing snapshot covering 2,261 chemicals and 230,775 supplier quotes, supporting scalar scoring and stage-level diagnosis of grounding, retrieval, procurement, and arithmetic failures. To evaluate robustness, we further construct controlled noise-injected views that perturb chemical aliases, quantity expressions, missing fields, and input formatting. Experiments with frontier, open-weight, and chemistry-specialized LLM agents show that tool access is necessary but insufficient for solving the task. The strongest agents reach only 50.6% accuracy within 25% relative error on clean inputs and degrade substantially with realistic noise. Stage-level analysis further shows that failures arise from brittle parsing, ineffective evidence integration, invalid pack selection, and non-convergent tool use.

2605.07250 2026-05-11 cs.CV cs.AI

Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment

Zhixue Song, Boyan Han, Yiwei Wang, Chi Zhang

AI总结 该研究揭示了多模态大语言模型(MLLM)在处理视觉压缩内容时存在的安全漏洞:当图像分辨率降低时,即使文本仍可读,模型的安全防御能力也会显著下降。研究认为这是由于“认知过载”效应,即模型在解析退化输入时消耗了过多注意力资源,从而削弱了安全审查能力。为解决这一问题,作者提出了一种“结构化认知卸载”策略,通过分离视觉转录与安全评估流程,有效缓解了这一风险,为未来安全设计的MLLM提供了重要启示。

Comments Accepted to Findings of ACL 2026

详情
英文摘要

Recent advancements in visual context compression enable MLLMs to process ultra-long contexts efficiently by rendering text into images. However, we identify a critical vulnerability inherent to this paradigm: lowering image resolution inadvertently catalyzes jailbreaking. Our experiments reveal that the safety defenses of SOTA models deteriorate sharply as resolution degrades, surprisingly persisting even when text remains legible. We attribute this to ``Cognitive Overload'', hypothesizing that the effort required to decipher degraded inputs diverts attentional resources from safety auditing. This phenomenon is consistent across various visual perturbations, including noise and geometric distortion. To address this, we propose a simple ``Structured Cognitive Offloading'' strategy that mitigates these risks by enforcing a serialized pipeline to decouple visual transcription from safety assessment. Our work exposes a significant risk in vision-based compression and provides critical insights for the secure design of future MLLMs.

2605.07248 2026-05-11 cs.CL cs.LG

PaT: Planning-after-Trial for Efficient Test-Time Code Generation

Youngsik Yoon, Sungjae Lee, Seockbean Song, Siwei Wang, Wei Chen, Jungseul Ok

AI总结 本文提出了一种名为PaT的适应性策略,用于提升大语言模型在测试阶段的代码生成效率。不同于传统的先规划后试错(PbT)方法,PaT仅在验证失败时才调用规划模块,从而减少不必要的计算开销。该方法结合了低成本模型进行代码生成和高性能模型进行针对性规划干预,实验表明其在多个基准测试中显著提升了成本与性能的平衡,相比同规模的单一模型,推理成本降低了约69%。

Comments Accepted to ACL 2026 main conference

详情
英文摘要

Beyond training-time optimization, scaling test-time computation has emerged as a key paradigm to extend the reasoning capabilities of Large Language Models (LLMs). However, most existing methods adopt a rigid Planning-before-Trial (PbT) policy, which inefficiently allocates test-time compute by incurring planning overhead even on directly solvable problems. We propose Planning-after-Trial (PaT), an adaptive policy for code generation that invokes a planner only upon verification failure. This adaptive policy naturally enables a heterogeneous model configuration: a cost-efficient model handles generation attempts, while a powerful model is reserved for targeted planning interventions. Empirically, across multiple benchmarks and model families, our approach significantly advances the cost-performance Pareto frontier. Notably, our heterogeneous configuration achieves performance comparable to a large homogeneous model while reducing inference cost by approximately 69\%.

2605.07247 2026-05-11 cs.AI

EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation

Yi Liu, TingFeng Hui, Wei Zhang, Li Sun, Ningxin Su, Jian Wang, Sen Su

AI总结 EnvSimBench 是一个用于评估和提升基于大语言模型(LLM)的环境模拟能力的基准测试平台。该研究指出,当前LLM在模拟环境反馈时存在幻觉、逻辑不一致和状态漂移等问题,影响了智能体训练的可靠性。为此,研究提出了环境模拟能力(EnvSim Ability)的量化定义,并构建了一个包含167个多样化环境、400个样本的严格基准,揭示了现有语言模型在多状态同步更新任务中的普遍失效现象,同时设计了一种约束驱动的模拟流程,显著提升了模拟效果并降低了成本。

详情
英文摘要

Scalable AI agents training relies on interactive environments that faithfully simulate the consequences of agent actions. Manually crafted environments are expensive to build, brittle to extend, and fundamentally limited in diversity. A promising direction is to replace manually crafted environments with LLM-simulated counterparts. However, this paradigm hinges on an unexamined core assumption: LLMs can accurately simulate environmental feedback. In practice, LLM-simulated environments suffer from hallucinations, logical inconsistencies, and silent state drift failures that corrupt agent reward signals and compound the construction costs that the paradigm was designed to eliminate. To address this gap, we propose EnvSimBench with four contributions: 1) We provide the first formal definition and operationalization of Environment Simulation Ability (EnvSim Ability) as a quantifiable research objective. 2) We construct EnvSimBench, a rigorous benchmark covering 400 samples across 167 diverse environments, equipped with verifiable labels and fine-grained difficulty stratification along three axes. 3) Systematic evaluations reveal that all state-of-the-art language models suffer from a universal state change cliff: they achieve near-perfect accuracy on tasks when the environment state remains invariant, yet fail catastrophically when multiple states need simultaneous updates. This finding exposes EnvSim Ability as a critical yet largely unaddressed capability gap. 4) We design a constraint-driven simulation pipeline that substantially reduces hallucination, boosts environment synthesis yield by 6.8%, and cuts costs by over 90%. Overall, EnvSimBench serves as both a diagnostic framework and a practical optimization path for reliable LLM-based environment simulation, establishing a foundation for scalable agent training. Code and data are available at https://github.com/cookieApril/EnvSimBench

2605.07244 2026-05-11 cs.LG cs.AI cs.CL

Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

Xiaoze Liu, Dhananjay Ram, Yuting Zhang, Zhaoyang Zhang, Wei Xia, Stefano Soatto

AI总结 本文提出了一种名为“互惠强化学习”的框架,用于异构大语言模型的并发强化学习微调,其中不同模型在保持各自参数、目标和分词器的前提下,通过类型化经验交换进行协作学习。该框架结合了共享经验交换、多工作节点资源分配以及分词器异构处理等关键技术,实现了跨模型家族的经验共享。研究通过三种受控实验验证了该方法的有效性,并分析了其在稳定性与支持性之间的结构位置,表明结果层面的经验共享在实际应用中具有明显优势。

Comments 50 pages, 10 figures, 14 tables

详情
英文摘要

We introduce Mutual Reinforcement Learning, a framework for concurrent RL post-training in which heterogeneous LLM policies exchange typed experience while keeping separate parameters, objectives, and tokenizers. The framework combines a Shared Experience Exchange (SEE), Multi-Worker Resource Allocation (MWRA), and a Tokenizer Heterogeneity Layer (THL) that retokenizes text and aligns token-level traces across incompatible vocabularies. This substrate makes the experience-sharing design question operational across model families. We instantiate three controlled probes on top of GRPO: data-level rollout sharing via Peer Rollout Pooling (PRP), value-level advantage sharing via Cross-Policy GRPO Advantage Sharing (XGRPO), and outcome-level success transfer via Success-Gated Transfer (SGT). A contextual-bandit analysis characterizes their structural positions on a stability-support trade-off: PRP pays density-ratio variance and THL residual costs, XGRPO preserves on-policy actor support while changing scalar baselines, and SGT supplies a rescue-set score direction toward verified peer successes. In the evaluated regime, outcome-level sharing occupies the favorable point of this trade-off.

2605.07242 2026-05-11 cs.AI cs.CL

MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory

Yang Zhao, Chengxiao Dai, Mengying Kou, Yue Xiu

AI总结 在智能体记忆系统中,当原始记忆内容被删除或更新后,其衍生的记忆内容可能仍然存在并影响后续行为,导致信息过时的问题。本文提出了一种名为 MEMOREPAIR 的方法,通过优先处理屏障状态,实现对智能体记忆中衍生内容的级联修复,从而有效消除过时记忆的影响。实验表明,MEMOREPAIR 能够在不增加过多修复成本的前提下,显著降低无效记忆的暴露比例,并恢复大量有效的后续记忆内容。

详情
英文摘要

Agentic memory evolves across tasks into durable derived artifacts: summaries, cached outputs, embeddings, learned skills, and executable tool procedures. When a source artifact is deleted, corrected, or invalidated by tool or API migration, descendants derived from that source can remain visible and steer future actions with stale support. We formalize this failure mode as the cascade update problem, where repair targets the visible derived state of the memory store. We present MemoRepair, a barrier-first cascade-repair contract for agentic memory. A repair event induces a controlled transition from invalidated descendant state to validated successor state: affected descendants are withdrawn before repair, successors are constructed from retained support and staged repaired predecessors under the current interface, and republication is restricted to validated predecessor-closed successors. This contract induces a scalarized repair-selection problem for a fixed repair-cost tradeoff. We show that the induced publication problem reduces to maximum-weight predecessor closure and can be solved exactly by a single s-t min-cut. Experiments on ToolBench and MemoryArena show that, with complete influence provenance, MemoRepair reduces invalidated-memory exposure from 69.8-94.3% under systems without cascade repair to 0%. Compared with exhaustive Repair all, it recovers 91.1-94.3% of validated successors while reducing normalized repair-operator cost from 1.00 to 0.57-0.76.

2605.07239 2026-05-11 cs.LG math.OC

Sample Complexity of Stochastic Optimization with Integer Variables

Hongyu Cheng, Yinghao Zheng, Marco Molinaro, Amitabh Basu

AI总结 本文研究了整数变量在随机优化问题中的样本复杂度,旨在理解其与连续优化问题的复杂度差异。作者分析了不同目标函数和约束结构下整数优化所需的样本数量,发现其可能比连续优化需要更多或更少的样本。研究还建立了非凸连续随机优化的严格样本复杂度结果,并指出在强凸光滑目标下,整数优化的统计复杂度显著高于连续情形。

详情
英文摘要

We establish sample complexity results for stochastic optimization over the integers, especially with a view to understand the complexity with respect to the corresponding continuous optimization problem. We show that integer optimization can sometimes require strictly more samples and sometimes strictly smaller number of samples, depending on the structure of the objective and constraints. 1. For Lipschitz objectives over subsets of the $\ell_\infty$ ball, the statistical complexity of general stochastic mixed-integer, nonlinear, nonconvex optimization is exactly the same as stochastic linear optimization with just bound constraints. 2. For Lipschitz objectives over subsets of the $\ell_2$ ball, we show that integer optimization can require strictly *smaller* sample size compared to the continuous setting in a certain regime. To get to this result, we also establish tight sample complexity results for nonconvex continuous stochastic optimization which, to the best of our knowledge, do not appear in prior work. 3. For strongly convex, smooth objectives, integer optimization has high statistical complexity compared to the continuous setting. In particular, we show that integer optimization requires $Ω(1/ε^2)$ samples to report an $ε$-approximate solution, compared to the well-known $O(1/ε)$ sample complexity from the continuous optimization literature.

2605.07234 2026-05-11 cs.CL cs.AI

Reformulating KV Cache Eviction Problem for Long-Context LLM Inference

Tho Mai, Joo-Young Kim

AI总结 在长上下文大语言模型推理中,键值缓存(KV Cache)的快速增长导致了显著的内存和运行时开销。本文将KV缓存淘汰问题重新表述为一种基于输出感知的层间矩阵乘法近似问题,提出了一种新的淘汰策略LaProx,该策略显式建模注意力图与投影值状态之间的乘法交互,准确量化每个token的贡献并考虑跨头依赖。实验表明,该方法在仅使用5%的KV缓存时仍能保持模型性能,并在多个基准数据集上优于现有方法。

详情
英文摘要

Large language models (LLMs) support long-context inference but suffer from substantial memory and runtime overhead due to Key-Value (KV) Cache growth. Existing KV Cache eviction methods primarily rely on local attention weights, neglecting the influence of value representations, output projection, and inter-head interactions. In this work, we reformulate KV Cache eviction from a conventional head-wise, weight-averaging approach into an output-aware, layer-wise matrix multiplication approximation problem. We introduce LaProx, a novel eviction strategy that explicitly models the multiplicative interaction between attention maps and projected value states to accurately quantify token contributions while accounting for inter-head dependencies. Building on this metric, we propose the first unified eviction strategy that assigns globally comparable importance scores to tokens, enabling model-wide selection instead of local, head-wise decisions. Experimental results across 19 datasets on long-context benchmarks LongBench and Needle-In-A-Haystack demonstrate that our approach maintains model performance with only 5\% of the KV cache and consistently outperforms prior works across all configurations. Notably, our method achieves up to 2$\times$ accuracy loss reduction under extreme compression scenarios compared to existing state-of-the-art baselines with minimal overhead.

2605.07232 2026-05-11 cs.CV

Towards multi-modal forgery representation learning for AI-generated video detection and localization

Dat Le, Khoa Nguyen, Xin Wang, Shu Hu

AI总结 随着生成式AI的发展,AI生成视频的创建变得更加普遍,但这也带来了语义失真和滥用的风险。为了解决现有检测方法在多模态数据建模和细粒度时间伪造定位方面的不足,本文提出了一种融合语言模态、时空视觉模态和多尺度部分伪造音频模态的联合学习架构,实现了对部分篡改AI生成视频的检测与精确定位。实验表明,该方法在性能上优于现有先进方法。

详情
英文摘要

Recent advances in generative AI have democratized video creation at scale. AI-generated videos, including partially manipulated clips across visual and audio channels, pose escalating risks of semantic distortion and misuse, which motivates the need for reliable detection tools. Most existing AI-generated video detectors remain limited by single- or partial-modality of data modeling and the lack of fine-grained temporal forgery localization. To address these challenges, our primary novelty introduces a core architecture that jointly integrates an LMM semantic branch with a spatio-temporal (ST) visual branch and a multi-scale partial-spoof (PS) audio branch. This multi-modal approach enables simultaneous detection and fine-grained temporal localization of partially manipulated AI-generated video forgeries. Extensive experiments show that this approach outperforms existing state-of-the-art methods.

2605.07230 2026-05-11 cs.CV cs.AI

CASCADE: Context-Aware Relaxation for Speculative Image Decoding

Selin Yildirim, Subhajit Dutta Chowdhury, Mohammad Mahdi Kamani, Vikram Appia, Deming Chen

AI总结 CASCADE 是一种用于图像生成的上下文感知松弛方法,旨在提升推测性解码的效率。该方法通过分析目标模型在树状推测解码中的隐藏状态冗余,提出了语义可交换性和收敛性两个特性,从而在无需额外训练的情况下实现对生成结果的合理放松。实验表明,CASCADE 在保持图像质量和文本提示保真度的同时,显著提升了解码速度,最高可达3.6倍加速。

详情
英文摘要

Autoregressive generation is a powerful approach for high-fidelity image synthesis, but it remains computationally demanding and slow even on the most advanced accelerators. While speculative decoding has been explored to mitigate this bottleneck, existing approaches fail to achieve efficiency gains comparable to those observed in text generation. A key limitation is the target model's high uncertainty during image generation, which leads to high draft token rejection rates. In this work, we identify previously overlooked patterns in the target model's behavior that emerge naturally in tree-based speculative decoding. Specifically, we formalize two properties, semantic interchangeability and convergence, arising from the redundancies in the target model's hidden state representations. By capturing these redundancies across the depth and breadth of the predicted token tree, our method identifies principled opportunities for acceptance relaxation without requiring additional training. Additionally, we enhance standalone drafter performance by injecting the redundancy signals from the target model into drafter training with minimal modification. We evaluate our approach across multiple text-to-image models and drafter architectures. Results show that CASCADE achieves state-of-the-art speedups for drafter-based speculative decoding, with up to 3.6x acceleration, while maintaining image quality and text-prompt fidelity.

2605.07222 2026-05-11 cs.LG

Don't Learn the Shape: Forecasting Periodic Time Series by Rank-1 Decomposition

Takato Honda

AI总结 本文研究了如何以最少的参数预测周期性时间序列。作者提出了一种基于秩-1分解的方法FLAIR,通过固定每日形状并仅学习每日水平,实现了高效且准确的预测。实验表明,该方法在多个基准测试中表现优异,具有参数少、计算快、无需调参等优势。

Comments 9 pages main text + appendix. Code: https://github.com/TakatoHonda/FLAIR

详情
英文摘要

How few parameters do we really need to forecast a periodic time series? An hourly electricity series, reshaped as a 24-row matrix with one column per day, is approximately rank-1: a daily shape modulated by a daily level (median centered rank-1 energy 0.82 on GIFT-Eval). Should we learn the shape? Smoothing, shrinkage, and low-rank fits all seem like obvious upgrades over the simple average of the last K=2 cycles. On all 97 GIFT-Eval configurations, we tested 8 such alternatives (e.g., Fourier, EWMA, James-Stein, rank-r SVD): none significantly beats the frozen baseline under Holm correction; two are significantly worse. The resulting method, FLAIR, is (a) Effective: matches PatchTST on aggregate GIFT-Eval (relMASE 0.838 vs 0.849); (b) Compact: 28 scalars for hourly, 57 for weekly; (c) Fast: 22 minutes on one CPU core of a MacBook Pro; (d) Closed-form & Hands-Off: one SVD per period candidate, GCV-averaged Ridge, no GPU, no pre-training, no per-task tuning. In the high-rank-1, many-cycle regime, extra flexibility is estimation noise.

2605.07221 2026-05-11 cs.CV

DINO-MVR: Multi-View Readout of Frozen DINOv3 for Annotation-Efficient Medical Segmentation

Wei Jiang, Feng Liu, Nan Ye, Hongfu Sun

AI总结 本文提出了一种名为 DINO-MVR 的多视角读取框架,旨在提升标注高效型医学图像分割的性能。该方法利用冻结的 DINOv3 特征提取器,仅通过轻量级的 MLP 探针进行训练,避免了对主干网络的微调。通过多尺度和测试时增强的融合策略,DINO-MVR 在多个医学影像基准上取得了优异的分割效果,尤其在标注数据极少的情况下仍能保持高精度。

详情
英文摘要

Adapting foundation models to medical segmentation typically requires either backbone fine-tuning or high-capacity task-specific decoders, both of which are difficult to fit reliably when annotations are scarce. We show that frozen DINOv3 features already contain useful structural and boundary cues for medical segmentation, and that the main bottleneck lies in how these features are read out. We propose DINO-MVR, a Multi-View Readout framework for annotation-efficient medical segmentation. DINO-MVR trains only lightweight MLP probes on features from the final three transformer blocks of a frozen DINOv3 backbone, without updating the backbone itself. At inference, each input is interpreted through complementary resolutions and test-time augmentations, whose probability maps are combined by entropy-weighted fusion and refined with simple spatial regularization. For volumetric inputs, Gaussian z-axis smoothing further improves inter-slice consistency. Under fixed evaluation protocols on endoscopy, dermoscopy, and MRI benchmarks, DINO-MVR achieves strong readout-only performance, including 0.895 Dice on Kvasir-SEG, 0.897 Dice on ISIC 2018, and 0.908 Dice on BraTS FLAIR whole-tumor segmentation. With only five annotated BraTS patients, it recovers 98.4% of the performance obtained by the 40-patient BraTS reference run. These results suggest that frozen self-supervised vision backbones can support accurate medical segmentation when paired with an effective multi-view readout.

2605.07218 2026-05-11 cs.LG stat.ML

Improved Model-based Reinforcement Learning with Smooth Kernels

Kun Long, Yuqiang Li, Xianyi Wu

AI总结 本文研究了连续状态-动作空间下的模型基于强化学习问题,提出了一种基于平滑核的改进方法,利用MDP的平滑性进行非参数核平滑估计。通过引入伯恩斯坦风格的探索奖励,该方法在有限时间范围内实现了比现有方法更优的遗憾界,其理论分析还提出了一个可能具有独立价值的新的伯恩斯坦型鞅浓度不等式。

Comments 38 pages, 5 figures

详情
英文摘要

For continuous state-action space scenarios, classical reinforcement learning (RL) theory predominantly focuses on low-rank Markov decision processes (MDPs), which provide sample-efficient guarantees at the expense of restrictive structural assumptions. Kernel smoothing model-based approaches offer a promising alternative paradigm that instead leverages the smoothness of the MDP and employs non-parametric kernel smoothing estimates of transition dynamics. This paper proposes a new kernel-smoothing model-based approach for online reinforcement learning in finite-horizon settings under Lipschitz continuity assumptions on the MDP. By incorporating a Bernstein-style exploration bonus into the kernel smoothing framework, our method achieves a regret bound which improves upon the state-of-the-art regret bound in its dependence on the horizon. The theoretical advancement relies on a delicate analysis of the synergy between Bernstein-style bonuses and kernel smoothing, where a new tight Bernstein-type concentration inequality for martingales may be of independent interest.