arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4050
2605.08864 2026-05-12 cs.LG math.ST stat.TH

Higher-Order Equilibrium Tracking for EM-Compressible Online Estimation

ZhiMing Li, Yue Song

AI总结 本文研究了潜在变量模型中的在线估计问题,将其重新表述为追踪一个移动的实证均衡。作者提出了一种新的分析框架,将在线估计分解为当前运行统计量对应的冻结批量均衡和追踪滞后误差,并证明了在一定条件下,在线估计器可以继承批量估计的中心极限定理和精确的一阶风险常数。研究还引入了EM压缩性及相关概念,为在线追踪提供了理论支持,并在潜在线性高斯协方差估计中验证了该方法的有效性。

Comments 41 pages, 6 figures

详情
英文摘要

We study online estimation in latent-variable models by recasting the problem as tracking a moving empirical equilibrium. Standard online EM and stochastic approximation analyses primarily study convergence toward the population parameter and typically do not isolate the empirical batch optimum from the online tracking error at finite horizon. Our framework decomposes the online estimate into the frozen batch equilibrium at the current running statistic and a tracking lag that captures the algorithm's delay behind this moving target. We prove a batch-to-online transfer theorem: provided $\lVert e_T \rVert_{L^{2}} = o(T^{-1/2})$, the online estimator inherits the batch central limit theorem and the sharp first-order risk constant. Our key observation is that the empirical optimum evolves on a smooth equilibrium manifold indexed by the running statistic. An $m$-th order equilibrium-jet predictor combined with an order-$ν$ frozen corrector yields localized tracking rates $O(T^{-ν(m+1)})$. We formalize EM-compressibility and EM-jet$^R$-compressibility as the structural conditions that make the equilibrium response and the Newton corrector evaluable from a retained streaming statistic. The theory is instantiated in latent linear Gaussian covariance estimation, where the first-order scheme operates on a compressed $d \times d$ statistic with explicit finite-sample risk envelopes and a certified restart rule.

2605.08862 2026-05-12 cs.LG cs.AI

BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning

Yuhang Xu, Kaibin Tian, Yang Tian, Zhice Yang, Yifeng Yu, Yan Li, Shengzhong Liu, Fan Wu, Guihai Chen

AI总结 强化学习(RL)在提升大语言模型(LLM)性能中发挥着重要作用,但其 rollout 阶段在数据并行场景下因长尾延迟问题导致效率低下。本文提出 BubbleSpec,一种在保持数学精确性的同时加速 RL rollout 的新框架,它通过利用快速设备的空闲时间预生成后续步骤的 rollout 结果,作为推测解码的草稿,从而有效提升训练吞吐量。实验表明,BubbleSpec 能减少 50% 的解码步骤,并将 rollout 吞吐量提升至原来的 1.8 倍,且与多种 RL 框架兼容。

详情
英文摘要

Reinforcement Learning (RL) has become a cornerstone for improving the performance of Large Language Models (LLMs). However, its rollout phase constitutes a significant efficiency bottleneck, mainly arising from the long-tail bubbles across data parallel ranks, particularly in long-context scenarios where faster GPUs remain idle while waiting for stragglers. Existing solutions, such as partial rollout or asynchronous RL, mitigate these bubbles by compromising the algorithm's strict synchronous nature. Instead, we propose BubbleSpec, a novel framework that accelerates RL rollouts while strictly keeping the mathematical exactness. Instead of attempting to eliminate bubbles, BubbleSpec exploits them. We exploit the idle time windows of faster ranks to pre-generate rollout results for subsequent steps, serving as drafts for speculative decoding. Unlike prior speculative methods that rely on historical epoch similarity and warm-ups, BubbleSpec is agnostic to dataset size and provides immediate acceleration from the onset of training. Extensive evaluations demonstrate that BubbleSpec reduces decoding steps by 50% and increases rollout throughput by up to 1.8x. Critically, BubbleSpec is seamlessly compatible with various RL frameworks and strategies as it sustains the strict synchronous property of RL algorithms.

2605.08857 2026-05-12 cs.LG

RareCP: Regime-Aware Retrieval for Efficient Conformal Prediction

Manuel Heurich, Maximilian Granz, Tim Landgraf

AI总结 本文提出了一种名为RareCP的新型方法,用于提高时间序列预测中置信区间的效率。该方法通过引入具有不同误差模式的专家网络和自适应核函数,有效应对时间依赖性、漂移和异构误差带来的非平稳性问题。RareCP利用检索机制从历史数据中选取最相关的校准样本来生成非对称预测区间,在保证经验覆盖性的前提下提升了区间效率,并在多个基准测试中表现出优越性能。

详情
英文摘要

Recent advances in uncertainty quantification for time series forecasting show that conformal prediction can provide reliable prediction intervals, yet standard conformal methods are often inefficient under temporal dependence, drift, and heterogeneous error behavior. Existing methods typically either update miscoverage rates over time or learn unconstrained calibration weights, without explicitly separating two central sources of nonstationarity: smoothly drifting error distributions and co-existing distinct error regimes. We introduce RareCP, a regime-aware retrieval method for adaptive conformal time series prediction. RareCP learns local calibration representations through a mixture of cosine-attention experts that each capture distinct error regimes, while a compact hypernetwork adapts the kernel parameters to track temporal drift. Given a new forecasting context, RareCP retrieves the top-k most relevant calibration examples, assigns similarity weights, and forms a weighted conformal quantile over their signed residuals, yielding asymmetric prediction intervals. The adaptive kernel is trained using a smooth interval score objective, with a parameter-space anchor to a lightweight teacher kernel to preserve stable local representations. On the GIFT-Eval benchmark, RareCP improves interval efficiency over recent conformal baselines and foundation model uncertainty estimates while maintaining empirical coverage. Ablations confirm that regime-specific experts, drift-adaptive kernels, sparse retrieval, and teacher anchoring each contribute to the final performance.

2605.08854 2026-05-12 cs.CV

Restoration-Aligned Generative Flow Models for Blind Motion Deblurring

Insoo Kim, Jinwoo Shin

AI总结 本文提出了一种名为DeblurFlow的生成流模型框架,用于解决盲运动去模糊问题。该方法通过将生成流的轨迹终点从噪声替换为模糊观测,使模型的训练目标与去模糊任务对齐,从而避免了传统生成流模型在恢复任务中出现的保真度下降问题。研究还引入了r-space这一专门用于残差解码的潜在空间,大幅降低了计算成本,并在多个数据集上展示了DeblurFlow在恢复保真度和感知真实感方面的优越性能。

详情
英文摘要

Generative flow models offer powerful priors learned from large-scale natural images, but directly adapting them to restoration tasks such as motion deblurring causes severe fidelity degradation, as their training objective is inherently misaligned with restoration. We present DeblurFlow, a framework that resolves this misalignment by reformulating the flow trajectory itself: we replace the noise endpoint with the blur observation, which makes the underlying vector field coincide with the residual error between blur and clean images. Under this formulation, the standard flow matching loss naturally takes the form of a residual loss, allowing pretrained flow models to be optimized under restoration-aligned objectives via LoRA adaptation. This formulation further enables a dual-expert sampling strategy: a fidelity expert provides a high-fidelity initialization, e.g., PSNR 33.69 dB, and DeblurFlow enhances perceptual quality with only a marginal fidelity reduction to 33.05 dB, whereas directly applying a generative model on top of a fidelity expert decreases PSNR to 27.60 dB. To make this practical, we further introduce r-space, a latent space tailored for residual decoding rather than image reconstruction, which reduces encoder-decoder cost by up to 9$\times$over standard VAE latents. Extensive experiments on GoPro, HIDE, RealBlur, and RWBI demonstrate that DeblurFlow achieves strong restoration fidelity and perceptual realism, while remaining computationally practical.

2605.08853 2026-05-12 cs.CL

Architecture, Not Scale: Circuit Localization in Large Language Models

Sohan Venkatesh

AI总结 本研究挑战了“模型规模越大,机制可解释性越难”的常见假设,指出模型的架构设计比参数数量对电路行为的影响更为关键。通过对比Pythia和Qwen2.5模型中的三种电路类型,研究发现分组查询注意力机制在相同规模下能产生更集中、更稳定的电路结构。研究还发现,在特定架构下,事实回忆电路在达到临界规模后会经历离散的相变,而非逐步退化,表明合理的架构选择有助于提升大模型的可解释性研究效率。

详情
英文摘要

Mechanistic interpretability assumes that circuit analysis becomes harder as models scale. We challenge this assumption by showing that the attention architecture matters more than parameter count. Studying three circuit types across Pythia and Qwen2.5, we find that grouped query attention produces circuits that are far more concentrated and mechanistically stable than standard multi-head attention at comparable scales. The same concentration pattern holds across indirect object identification, induction heads, and factual recall. Within a single architecture family (Qwen2.5), factual recall circuits undergo a discrete phase transition above a critical scale, collapsing to a single bottleneck rather than degrading gradually. These findings suggest that some architectural choices make large models more tractable to study and that interpretability difficulty is not a fixed consequence of model size.

2605.08847 2026-05-12 cs.CL

EmoS: A High-Fidelity Multimodal Benchmark for Fine-grained Streaming Emotional Understanding

Pengze Guo, Jingxi Liang, Zhiwen Xie, Qifeng Wang, Derek F. Wong

AI总结 在高压老龄化社会背景下,构建能够提供共情支持的大规模情感模型变得尤为重要。为此,本文提出 EmoS,一个高保真双语多模态基准数据集,通过结合严格筛选的静态片段与动态流媒体独白子集,解决了现有数据集在生态效度和噪声方面的不足。EmoS 采用双重人工标注流程,确保情感演变的连续性与标注可靠性,实验表明基于 EmoS 微调多模态大语言模型在情感理解任务上显著优于零样本基线,为未来情感识别和共情模型的训练与评估提供了坚实基础。

Comments acl - 2026 main accepted

详情
英文摘要

In the context of today's high-pressure, aging society, the demand for large-scale emotional models capable of providing empathetic support is more critical than ever. However, existing benchmarks fail to simultaneously achieve ecological validity, signal clarity, and reliable fine-grained labeling. We introduce EmoS, a high-fidelity bilingual benchmark designed to resolve the limitations of ecological validity and noise in existing datasets by combining strictly filtered static slices with a dynamic Streaming Monologue subset. Supported by a rigorous dual-layer human annotation pipeline, EmoS provides trusted ground truth that captures continuous emotional evolution. Empirical results show that fine-tuning MLLMs (multimodal large language models) on EmoS yields significant gains over zero-shot baselines, laying the foundation for the training and evaluation of future emotion recognition models and empathy models. The dataset and code are publicly available at https://github.com/NLP2CT/EmoS.

2605.08843 2026-05-12 cs.AI cs.LG

M$^3$: Reframing Training Measures for Discretized Physical Simulations

Yuan Mei, Xingyu Song, Xiaowen Song, Naoya Takeishi

AI总结 在物理仿真中,神经代理模型通常基于离散化样本进行训练,但由此产生的经验测度会导致监督不均,影响优化过程并引发空间上的物理一致性问题。为此,研究提出M$^3$(多尺度莫顿测度)框架,通过根据物理变化划分空间并在多尺度上分配监督,平衡训练测度,从而缓解测度引起的偏差。实验表明,M$^3$在多个工业级数据集上显著提升了连续物理域中的预测性能,尤其在大规模体积场景中,其误差降低了4.7倍,并在数据子采样情况下仍保持优势,展示了其在物理一致性建模中的可扩展性和数据效率。

详情
英文摘要

Neural surrogate models for physical simulations are trained on discretized samples of continuous domains, where the induced empirical measure leads to uneven supervision, biasing optimization and causing spatial inconsistencies in physical fidelity. To mitigate this measure-induced bias, we propose M$^3$ (Multi-scale Morton Measure), a scalable framework that balances training measures by partitioning space according to physical variation and allocating supervision across multiple scales. Applied to three industrial-scale datasets with diverse discretizations, M$^3$ consistently improves predictions in the continuous physical domain, achieving up to 4.7$\times$ lower error in large-scale volumetric cases. These gains persist under aggressive subsampling (160M $\rightarrow$ 16M $\rightarrow$ 1.6M points), where M$^3$-trained models outperform those trained on higher-resolution data, reducing physics-weighted relative $L_2$ error by 3--4$\times$ and the corresponding MSE by up to 13$\times$. These results highlight data distribution as a key factor in operator learning and position M$^3$ as a scalable, data-efficient approach for physically consistent modeling.

2605.08842 2026-05-12 cs.CL

XPERT: Expert Knowledge Transfer for Effective Training of Language Models

Chang Liu, Boyu Shi, Xu Yang, Xin Geng

AI总结 该研究提出了一种名为XPERT的框架,旨在从混合专家(MoE)语言模型中提取并复用专家知识,以提升不同规模语言模型的训练效果。通过分析专家激活模式,XPERT识别出跨领域、通用性强的专家模块,并利用张量分解等技术优化其表示,将其知识迁移至下游模型中。实验表明,复用专家知识的模型在理解与对话生成任务中表现出更强的性能和更快的收敛速度,凸显了MoE模型作为结构化知识源的价值。

详情
英文摘要

Mixture-of-Experts (MoE) language models organize knowledge into explicitly routed expert modules, making expert-level representations traceable and analyzable. By analyzing expert activation patterns in MoE large language models (LLMs), we find that a subset of experts is consistently activated across diverse knowledge domains. These common experts encode cross-domain, generalizable knowledge that is closely related to model generalization, naturally raising the question of how such identifiable expert knowledge can be practically reused. Motivated by this observation, we propose XPERT, a framework that extracts, consolidates, and reuses expert knowledge from pre-trained MoE LLMs to support more effective training of language models across different model scales. XPERT identifies cross-domain experts via inference-only analysis, refines their representations through tensor decomposition, and adapts the extracted knowledge to reuse in downstream models. Experiments on language understanding and dialogue generation benchmarks show that models benefiting from reused expert knowledge achieve consistently stronger performance and faster convergence compared to strong baselines. These results highlight MoE LLMs as structured and reusable knowledge sources, and demonstrate the value of expert-level knowledge reuse for improving model training.

2605.08841 2026-05-12 cs.CV

Illusion-Aware Visual Preprocessing and Anti-Illusion Prompting for Classic Illusion Understanding in Vision-Language Models

Junli Zha, Jiahui Wang, Xinkai Lu, Jinbo Wang

AI总结 该研究针对视觉语言模型(VLMs)在经典视错觉理解任务中过度依赖记忆而非真实视觉感知的问题,提出了一种无需微调的训练自由框架。方法通过错觉感知的图像预处理、反错觉提示工程以及多投票集成三种互补策略,有效提升了模型对视觉错觉的识别能力。实验表明,该方法在官方测试集上达到了90.48%的准确率,在人工验证子集上更是达到了98.41%,并取得了挑战赛第二名的优异成绩。

Comments Accepted at CVPR 2026 Workshop on 5th DataCV Challenge

详情
英文摘要

Vision-Language Models (VLMs) exhibit systematic bias toward visual illusions, recalling memorized facts rather than perceiving actual visual differences. This paper presents a training-free framework for the 5th DataCV Challenge Task 1 at CVPR 2026, addressing this perception-versus-memory conflict through three complementary strategies:(1) illusion-aware image preprocessing that weakens illusion-inducing context via type-specific transformations (edge extraction, color isolation, morphological processing, and reference-line overlay), (2) anti-illusion prompt engineering guiding VLMs toward qualitative visual comparison, and (3) multi-vote ensemble that further improves robustness. Our method achieves 90.48% accuracy on the official 630-image test set using Claude (claude-opus-4-6) with 5-vote majority ensemble, and 98.41% on a human-verified subset. The approach requires no finetuning, relying solely on visual manipulation and prompt design. Our solution secured 2nd place in the challenge, only 0.47% behind the 1st-place solution. Code is available at https://github.com/jasminezz/sf-illusion-aware-vlm.git.

2605.08840 2026-05-12 cs.CL

ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing

Yongqi An, Chang Lu, Kuan Zhu, Tao Yu, Chaoyang Zhao, Hong Wu, Ming Tang, Jinqiao Wang

AI总结 随着大型语言模型在长序列生成中对Key-Value(KV)缓存的内存需求不断增长,如何高效地进行KV缓存淘汰成为关键问题。本文提出ReST-KV方法,通过逐层输出重建和时空平滑相结合,更全面地考虑了KV缓存淘汰过程中的注意力再分配效应和时空动态特性。该方法将KV缓存淘汰建模为一个优化问题,有效减少了输出差异,并在多个长上下文基准测试中显著提升了性能,同时大幅降低了推理延迟。

Comments Accepted at ICLR 2026. Project Page: https://github.com/an-yongqi/rest-kv

详情
英文摘要

Large language models (LLMs) face growing challenges in efficient generative inference due to the increasing memory demands of Key-Value (KV) caches, especially for long sequences. Existing eviction methods typically retain KV pairs with high attention weights but overlook the impact of attention redistribution caused by token removal, as well as the spatial-temporal dynamics in KV selection. In this paper, we propose ReST-KV, a robust KV eviction method that combines layer-wise output Reconstruction and Spatial-Temporal smoothing to provide a more comprehensive perspective for the KV cache eviction task. Specifically, ReST-KV formulates KV cache eviction as an optimization problem that minimizes output discrepancies through efficient layer-wise reconstruction. By directly modeling how each token's removal affects the model output, our method naturally captures attention redistribution effects, going beyond simplistic reliance on raw attention weights. To further enhance robustness, we design exponential moving average smoothing to handle temporal variations and an adaptive window-based mechanism to capture spatial patterns. Our method, ReST-KV, significantly advances performance on long-context benchmarks. It surpasses state-of-the-art baselines by 2.58% on LongBench and 15.2% on RULER. Additionally, ReST-KV consistently outperforms existing methods on Needle-in-a-Haystack and InfiniteBench, all while achieving a remarkable 10.61$\times$ reduction in decoding latency at 128k context length. The code is publicly available at https://github.com/an-yongqi/rest-kv to facilitate reproducibility and further research.

2605.08839 2026-05-12 cs.CV

Cross-Sample Relational Fusion: Unifying Domain Generalization and Class-Incremental Learning

Zhen-Hao Xie, Yan Wang, Hao Sun, Han-Jia Ye, De-Chuan Zhan, Da-Wei Zhou

AI总结 本文提出了一种统一处理领域偏移和灾难性遗忘的框架CORF,用于解决增量学习中的挑战。该方法通过空间贡献图选择性地优化训练样本,并结合预测置信度自适应调整样本权重,以增强模型的泛化能力。同时,CORF引入级联知识蒸馏机制,捕捉跨样本的关系依赖,实现多粒度的知识迁移,有效缓解了遗忘问题,并可无缝集成到现有增量学习算法中,取得良好的实验效果。

Comments Accepted by IEEE Transactions on Multimedia (TMM 2026). Code is available at https://github.com/LAMDA-CL/TMM26-CORF

详情
英文摘要

Class-Incremental Learning (CIL) requires a learning system to learn new classes while retaining previously learned knowledge. However, in real-world scenarios such as autonomous driving, a system trained on urban roads in sunny weather may later need to operate in rural or highway environments with different traffic patterns and weather conditions. This requires the model not only to overcome catastrophic forgetting, but also to effectively handle domain shifts. In this paper, we propose CrOss-sample Relational Fusion (CORF), a unified framework to address domain shift and catastrophic forgetting simultaneously. To enhance generalizability, we perform selective refinement of training samples by leveraging spatial contribution maps to highlight semantically informative regions. Furthermore, we incorporate predictive confidence to adaptively weigh samples, thereby facilitating the learning of domain-agnostic representations. To alleviate forgetting, we propose a cascaded distillation framework that captures cross-sample relational dependencies across multiple feature hierarchies, enabling multi-grained knowledge transfer from previous tasks. CORF can be seamlessly integrated into existing CIL algorithms to enhance their generalizability, achieving competitive performance across various benchmark datasets. Code is available at https://github.com/LAMDA-CL/TMM26-CORF .

2605.08838 2026-05-12 cs.CL cs.AI

Generating Leakage-Free Benchmarks for Robust RAG Evaluation

Jiayi Liu, Jiaxing Zhang, Bowen Jin, Jennifer Neville

AI总结 该论文研究了检索增强生成(RAG)系统评估中因知识泄露和基准老化导致的评估不可靠问题,提出了一种名为SeedRG的半合成基准生成方法。该方法通过从种子数据集中提取推理图,并利用类型约束的实体替换生成结构相似但新颖的问题实例,从而有效减少知识泄露。同时,SeedRG引入了推理图一致性检查和知识泄露过滤两个验证步骤,确保生成的基准既保持任务难度,又避免被模型参数记忆所覆盖。

详情
英文摘要

Retrieval-augmented generation (RAG) is widely used to augment large language models (LLMs) with external knowledge. However, many benchmark datasets, designed to test RAG performance, comprise many questions that can already be answered from an LLM's parametric memory. This leads to unreliable evaluation. We refer to this phenomenon as knowledge leakage: cases where RAG tasks are solvable without retrieval. This issue worsens over time due to benchmark aging. As benchmarks are reused for training, their contents are increasingly absorbed into model parameters, making them less effective for evaluating retrieval. We introduce SeedRG, a semi-synthetic benchmark generation pipeline that mitigates knowledge leakage and addresses the issue of benchmark aging. Starting from a seed benchmark dataset, SeedRG extracts a reasoning graph from question-context pairs to capture their underlying reasoning structure, and then generates new examples via type-constrained entity replacement. This process produces structurally similar but novel instances that are unlikely to exist in the model's parametric knowledge, while preserving the original reasoning patterns. To ensure quality, we incorporate two verification steps: (1) a reasoning-graph consistency check to maintain task difficulty, and (2) a knowledge-leakage filter to exclude instances answerable without retrieval.

2605.08837 2026-05-12 cs.CL cs.AI

The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans

Odysseas S. Chlapanis, Orfeas Menis Mastromichalakis, Christos H. Papadimitriou

AI总结 该研究探讨了大型语言模型(LLMs)与人类在抽象概念理解上的差异,发现模型在生成抽象概念的属性时,过度依赖词汇关联,而较少涉及情感和内在状态等人类更依赖的维度,从而形成显著的“ grounding gap”。研究通过复现认知科学实验并分析模型内部特征,表明尽管模型在被明确询问时能反映某些 grounding 维度,但在自由生成词语时仍缺乏人类般的情境化理解。

详情
英文摘要

Abstract concepts - justice, theory, availability - have no single perceivable referent; in the human brain, their meaning emerges from a web of experiences, affect, and social context. Do large language models (LLMs) ground abstract concepts in a similar way? We study this by replicating property-generation experiments from cognitive science on 21 frontier and open-weight LLMs. Across models and experiments, we find a consistent pattern: when compared to humans, models rely too heavily on word associations, and underproduce properties tied to emotion and internal states. This yields a large and consistent grounding gap: no model exceeds a Pearson correlation r=0.37 with human responses, compared to a human-to-human ceiling above r=0.9. To better interpret this gap, we also replicate a rating experiment on grounding categories and find that here LLMs align more closely with human judgment, and alignment improves as models get larger. We then use sparse autoencoders (SAEs) to inspect whether this information is also reflected in the models' internal features, and we do identify features connected to grounding dimensions such as "sensorimotor" and "social". These findings suggest that current LLMs can recover grounding dimensions when explicitly queried, but do not recruit them in a human-like way when words are generated freely.

2605.08835 2026-05-12 cs.AI

SynerDiff: Synergetic Continuous Batching for Fast and Parallel Diffusion Model Inference

Ziqi Zhou, Peng Yang, Yuxin Liang, Mingliu Liu, Jia Lu

AI总结 随着人工智能生成内容服务的扩展,扩散模型推理需要同时实现高吞吐量和低端到端延迟。为了解决现有连续批处理方法在UNet-VAE并发时资源争用严重导致延迟突增的问题,本文提出SynerDiff系统,通过内外层协同机制优化资源分配与任务调度。该方法在内部并发层面通过VAE分块和自适应跳过CFG技术缓解资源瓶颈,在外部并发层面引入感知调度粒度的阈值感知调度器,动态调整阈值以平衡UNet吞吐与VAE延迟,实验表明其在保证图像质量的同时,吞吐量提升1.6倍,平均端到端延迟和P99尾延迟分别降低最高达78.7%。

Comments accepted by IEEE ICME 2026

详情
英文摘要

The expansion of Artificial Intelligence-generated content service requires diffusion model serving to simultaneously achieve high throughput and low task end-to-end (E2E) latency. However, existing continuous batching methods suffer from severe resource contention during UNet-VAE concurrency, leading to latency spikes. Furthermore, concurrent multi-task scheduling entails a trade-off between UNet throughput and VAE latency across varying scheduling strategies. To address these, we propose SynerDiff, an efficient continuous batching system built on intra-inter level synergy. At the intra-concurrency level, SynerDiff alleviates resource contention by pruning component-specific resource bottlenecks via VAE Chunking and Adaptive Skip-CFG. At the inter-concurrency level, leveraging components' differential sensitivity to scheduling granularities, a threshold-aware scheduler plans concurrent sequences and tunes intra-concurrency decisions to minimize VAE latency while maintaining UNet within high-throughput threshold. Additionally, a feedback controller dynamically adjusts this threshold based on queue loads to boost system capacity ceiling. Experimental results show that, SynerDiff improves throughput by 1.6$\times$ and decreases both average E2E and P99 tail latencies by up to 78.7\%, compared to benchmarks while guaranteeing high image fidelity.

2605.08833 2026-05-12 cs.AI

FRACTAL: SSM with Fractional Recurrent Architecture for Computational Temporal Analysis of Long Sequences

Mengqi Li, Wensheng Lin, Jinshuai Yang, Lixin Li

AI总结 本文提出了一种名为FRACTAL的新序列建模架构,旨在解决现有状态空间模型(SSM)在处理长序列时面临的长期记忆与短期动态检测之间的权衡问题。该方法引入分数阶递归结构,结合分数测度理论设计具有可调奇异指数的投影算子,从而在保持尺度不变性的同时增强对近期信号变化的敏感性。实验表明,FRACTAL在Long Range Arena基准测试中表现优异,显著优于现有模型如S5。

Comments 19 pages (10 pages main text, 9 pages appendix), 3 figures. Accepted by ICML 2026

详情
英文摘要

Effective sequence modeling fundamentally requires balancing the retention of unbounded history with the high-resolution detection of abrupt short-term variations common in real-world phenomena. However, existing state space models (SSMs) relying on high-order polynomial projection operators (HiPPO) face a critical trade-off where uniform measures dilute recent information to maintain timescale invariance, while exponential measures sacrifice global context to capture local dynamics. This paper proposes a Fractional Recurrent Architecture for Computational Temporal Analysis of Long sequences (FRACTAL), a novel architecture integrating fractional measure theory into recursive memory updates to address this limitation. By deriving projection operators with analytically characterized spectral properties and a tunable singularity index, the proposed method amplifies sensitivity to recent signal perturbations while preserving the spectral structure that encodes scale-invariant memory dynamics. This theoretical innovation is instantiated within a simplified diagonalized state space framework by modulating input projection initialization to enable simultaneous capture of multi-scale temporal features. FRACTAL achieves an average score of 87.11\% on the Long Range Arena benchmark, including 61.85\% on the ListOps task, outperforming the S5 model.

2605.08831 2026-05-12 cs.RO

AssemPlanner: A Multi-Agent Based Task Planning Framework for Flexible Assembly System

Chenhao Zhang, Chaoran Zhang, Zhaobo Xu, Yongbo Yang, Pingfa Feng, Long Zeng

AI总结 本文提出了一种基于多智能体的柔性装配系统任务规划框架AssemPlanner,旨在解决现有方法在新产品产线配置中依赖专家手动设定、耗时费力的问题。该框架能够将自然语言描述的任务转化为可执行的生产操作序列,并通过包括调度代理、知识代理、产线平衡代理和场景图在内的多个智能体协同工作,实现对复杂工业约束的自主处理。研究的核心贡献在于引入基于ReAct的调度代理,通过多智能体反馈动态调整规划策略,提升了任务规划的灵活性与自动化水平。

详情
英文摘要

In flexible assembly systems, existing task planning methods require a time-consuming configuration process by multiple experts to establish a production line for a new product. To address this challenge, we propose a multi-agent based task planning framework for flexible assembly systems, denoted as AssemPlanner. It takes tasks described in natural language as input, which are then converted into actionable sequential production operations. It comprises several specialized agents, including SchedAgent , KnowledgeAgent, LineBalanceAgent, and a scene graph. Within the proposed framework, SchedAgent serves as the central reasoning engine. Departing from traditional static pipelines, AssemPlanner utilizes a ReAct-based SchedAgent to adaptively adjust actions via multi-agent feedback. By observing the feedback from KnowledgeAgent, LineBalanceAgent, and the scene graph, it autonomously resolves complex industrial process constraints. To facilitate reproducibility, all code and datasets are released at https://github.com/chz332/Assemplanner.

2605.08820 2026-05-12 cs.CV cs.AI cs.CR

FraudBench: A Multimodal Benchmark for Detecting AI-Generated Fraudulent Refund Evidence

Xinyu Yan, Boyang Chen, Jiaming Zhang, Tiantong Wu, Hong Xi Tae, Yichen He, Tiantong Wang, Yachun Mi, Yurong Hao, Yilei Zhao, Lei Xiao, Longtao Huang, Pengjun Xie, Wei Liu, Wei Yang Bryan Lim

AI总结 随着人工智能生成图像日益逼真,AI生成的退款欺诈证据检测成为新的挑战。为此,研究者提出了FraudBench,一个基于多模态数据的基准,专门用于检测AI生成的虚假退款证据。该基准集从电商、外卖和旅行服务等真实场景中构建,包含图像、评论及产品元数据,并通过模型辅助过滤和人工标注区分真实损坏与未损坏证据,同时利用先进图像生成模型合成虚假损坏图像。实验表明,现有模型在检测AI生成的虚假损坏证据方面仍存在显著不足,揭示了通用图像检测与真实场景下欺诈证据验证之间的明显差距。

详情
英文摘要

Artificial Intelligence (AI)-generated images have become increasingly realistic and readily adaptable to concrete real-world claims, creating new challenges for verifying visual evidence. A concrete emerging risk is AI-generated refund fraud, in which manipulated or synthetic images are used to support claims about damaged products, poor delivery conditions, or service-related defects. Existing AI-generated image detection benchmarks mainly evaluate standalone authenticity classification, cross-generator transfer, or forensic localization, leaving claim-conditioned fraudulent evidence detection underexplored. To bridge this gap, we introduce FraudBench, a multimodal benchmark for detecting AI-generated fraudulent refund evidence. FraudBench is constructed from real-world user-review evidence across e-commerce, food delivery, and travel-service scenarios. We curate real evidence images together with their associated review and product metadata, identify genuine damaged and undamaged evidence through MLLM-assisted filtering and human annotation, and synthesize fake-damaged evidence from genuine undamaged reference images using six state-of-the-art image editing and generation models. Using FraudBench, we evaluate MLLMs, specialized AI-generated image detectors, and human participants under the same settings. Experiments show that current MLLMs often recognize real-damaged evidence but fail on many fake-damaged subsets, with fake-damage detection rates (TPR) far below the 50% baseline on most generator subsets. Specialized detectors generally perform better but remain inconsistent across generators and can produce false positives on real-damaged samples, revealing a clear gap between generic AI image detection and reliable claim-conditioned refund-evidence verification.

2605.08819 2026-05-12 cs.CV cs.LG

From pre-training to downstream performance: Does domain-specific pre-training make sense?

Felix Krones

AI总结 该研究探讨了在医学影像领域中,领域特定的预训练是否能有效提升下游任务性能。通过系统比较卷积神经网络和Transformer模型,并分析多种预训练方法(包括监督和自监督学习)及数据模态的影响,研究发现只有当预训练数据与目标模态高度匹配时,才能显著提升模型性能。研究强调了预训练策略对提升医学影像深度学习模型可靠性的重要性,并为开发更准确、可靠的诊断工具提供了参考。

详情
英文摘要

Deep learning techniques have revolutionised medical imaging, improving diagnostic accuracy and enabling both more accurate and earlier disease detection. However, the relationship between pre-training strategies and downstream performance in medical imaging models requires further exploration. Here, we systematically compare convolutional neural networks and transformers, examining various pre-training approaches, including supervised and self-supervised learning, as well as different initialisations and data modalities. Models are evaluated on natural images, chest X-rays, chest CT and retina OCT images, considering the effects of matching pre-training data with target modalities. Our findings indicate that only pre-training on data closely matching the target modality significantly improves downstream performance. While self-supervised learning can outperform supervised methods, its effectiveness varies with context. The study underscores the importance of pre-training strategies to enhance the reliability and effectiveness of deep learning models in medical imaging. By addressing these key factors, our research aims to contribute to the development of more accurate and dependable diagnostic tools, ultimately improving patient outcomes in clinical settings.

2605.08817 2026-05-12 cs.AI

How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors

Yifan Xu, Junren Chen, Yifan Chen

AI总结 在可验证奖励强化学习(RLVR)中,由于奖励稀疏性和长推理周期,有效探索面临挑战,常表现为熵崩溃现象,即模型虽能提升单次推理准确率,却难以拓展成功推理路径的覆盖范围。为解决这一问题,本文提出了一种信息最大化增强探索(IMAX)框架,通过训练一组软前缀来重塑基础模型对推理路径的先验分布,从而引导多样化的推理行为。该方法无需依赖强化学习进行探索激励,而是通过信息最大化奖励与可验证奖励结合,有效提升了模型在多个尺度下的推理性能。

详情
英文摘要

Reinforcement learning with verifiable rewards (RLVR) recently thrives in large language model (LLM) reasoning tasks. However, the reward sparsity and the long reasoning horizon make effective exploration challenging. In practice, this challenge manifests as the \emph{entropy collapse} phenomenon, where RLVR improves single-rollout accuracy but fails to expand coverage on successful reasoning trajectories. Passive exploration techniques like entropy regularization tend to dismiss generation quality, resulting in noisy rollouts. In response to this issue, we propose an Information-Maximizing Augmented eXploration (IMAX) framework to train a pool of soft prefixes that reshapes the base model's prior over reasoning trajectories. Rather than relying on RL to incentivize exploration on top of the base model, each prefix acts as a trainable control knob that induces a distinct rollout distribution from the same backbone model. To encourage discovery of diverse and task-relevant reasoning behaviors, we derive an Information Maximization (InfoMax) reward to complement the verifiable rewards for RL training. IMAX is in general algorithm-agnostic and can be seamlessly integrated into existing RLVR pipelines. Experiment results have shown that across three backbone scales, IMAX consistently improves reasoning performance over standard RLVR, with gains up to 11.60\% in Pass@4 and 10.57\% in Avg@4.

2605.08816 2026-05-12 cs.AI cs.CY

Mirror, Mirror on the Wall: Can VLM Agents Tell Who They Are at All?

Filippo Ziliotto, Ciro Beneduce, Bruno Lepri, Luciano Serafini, Massimiliano Luca, Tommaso Campari

AI总结 本文研究具身视觉语言模型(VLM)是否具备类似动物镜像自认的认知能力,即能否通过镜像识别自身。作者构建了一个受控的3D实验环境,要求模型从镜像中推断自身的隐藏属性并选择对应目标,同时避免误将他人特征归于自身。实验表明,只有较强VLM才能基于镜像进行有效的自我识别,而较弱模型则常无法正确提取自身相关信息或产生误判,说明镜像自识别能力依赖于感知与行动的紧密结合,而非单纯的语言提示或先验知识。

详情
英文摘要

In the animal kingdom, mirror self-recognition is a canonical probe of higher-order cognition, emerging only in some species. We ask whether an analogous functional capability emerges in embodied vision-language model (VLM) agents: can they recognize themselves in a mirror? We introduce a controlled 3D benchmark where a first-person VLM agent must infer a hidden body attribute from its reflection and select the matching target, while avoiding self-other misattribution. To separate mirror-grounded self-identification from shortcuts, we test mirror removal, misleading cues, and occluded reflections. We also evaluate the decision process through mirror seeking, temporal ordering, self-attribution, and reasoning-action consistency. Our experiments show that mirror-based self-identification emerges mainly in stronger VLMs. These models can use reflected evidence for action, whereas weaker models often inspect the mirror but fail to extract self-relevant information or misattribute their reflection. Language-vision conflict further shows that self-referential language alone is not evidence of grounded self-identification. Overall, mirror-based evaluation provides a diagnostic for whether embodied self-grounding is causally rooted in perception and action rather than priors, prompt compliance, or confabulation.

2605.08815 2026-05-12 cs.LG q-bio.BM q-bio.GN q-bio.QM

MicroFuse: Protein-to-Genome Expert Fusion for Microbial Operon Reasoning

Seungik Cho

AI总结 MicroFuse 是一种用于微生物操纵子推理的蛋白-基因组专家融合框架,旨在整合蛋白质尺度的分子身份信息与基因组上下文组织信息。该方法通过一个包含四个专家(蛋白、基因组上下文、一致性和冲突)的混合专家模块,结合结构感知的蛋白质表示和基因组上下文表示,以学习软路由策略进行信息融合。实验表明,MicroFuse 在新构建的 OG-Operon100K 数据集上显著优于单独使用蛋白质或基因组模型的基线方法,尤其在生物学意义模糊的案例中表现出色。

详情
英文摘要

Predicting microbial operon co-membership requires integrating two complementary biological signals: protein-scale molecular identity and genome-context organization. While recent biological foundation models provide powerful representations of each view independently, naive concatenation of these modalities ignores a key biological property -- protein identity and genomic context may agree when adjacent genes form a coherent functional module, or conflict when sequence similarity is misleading but genomic layout indicates independent regulation. We present MicroFuse, a protein-to-genome expert fusion framework that integrates structure-aware protein representations from ProstT5 with genome-context representations from Bacformer through a four-expert Mixture-of-Experts module (protein, genome-context, agreement, and conflict experts) with a learned soft router. Training combines binary cross-entropy with symmetric cross-modal InfoNCE alignment and disagreement-weighted supervised contrastive shaping. We further construct OG-Operon100K, a 100,000-pair scaffold-level benchmark from the OMG metagenomic corpus with biologically grounded positive and negative criteria. On OG-Operon100K, MicroFuse achieves the strongest AUROC, AUPRC, mAP, and mAR among ProstT5-only, Bacformer-only, and Concat MLP baselines. Ablations identify cross-modal contrastive alignment as the dominant component, and a hard sequence-conflict subset reveals MicroFuse's largest gains precisely in biologically ambiguous cases where protein identity alone is misleading.

2605.08814 2026-05-12 cs.CV

Zero-Shot Chinese Character Recognition via Global-Local Dual-Branch Alignment and Hierarchical Inference

Wei Cao, Hao Xu, Xiaolei Diao

AI总结 本文研究了开放场景下未见过的汉字识别这一具有挑战性的问题,提出了一种基于全局-局部双分支对齐和层次推理的零样本汉字识别方法。该方法通过统一的跨模态对齐框架联合学习汉字图像和汉字结构描述的全局与局部表示,结合结构过滤掩码抑制局部相似性中的噪声操作符,并采用从粗到细的层次推理策略,有效提升了识别性能与推理效率。实验表明,该方法在多种零样本划分下表现优异,尤其在低资源条件下具有显著优势。

Comments 9 pages

详情
英文摘要

Chinese character categories are extremely large, and unseen characters frequently arise in open-world scenarios, making zero-shot Chinese character recognition an important yet challenging problem. Existing IDS-based retrieval methods usually encode a character image and its ideographic description sequence into a single global vector for matching. Although efficient, such holistic alignment often under-models local component differences. Moreover, directly introducing patch-token level fine-grained interaction suffers from both the noise of structural operators in IDS and the high cost of full-candidate retrieval.To address these issues, we propose a Global-Local Hierarchical Perception Network (GL-HPN), which jointly learns global and local representations of character images and IDS sequences within a unified cross-modal alignment framework. The global branch supports efficient coarse recall, while the local branch improves component-level discrimination through patch-token interaction. We further introduce a structure filtering mask to suppress structurally meaningful but visually non-entity IDS operators in local similarity aggregation. On top of this, we design a coarse-to-fine hierarchical inference strategy that performs global retrieval over the full candidate set and local reranking only on Top-$K$ candidates, followed by parameter-free multiplicative fusion of normalized posterior scores. Experimental results show that GL-HPN achieves competitive performance across multiple zero-shot splits, performs especially well under low-resource settings, and substantially reduces the inference cost of large-scale candidate retrieval.

2605.08813 2026-05-12 cs.LG

AgentSlimming: Towards Efficient and Cost-Aware Multi-Agent Systems

Yulang Chen, Haoxuan Peng, Jinyan Liu, Zichen Wen, Dongrui Liu, Linfeng Zhang

AI总结 本文提出了一种名为 AgentSlimming 的高效压缩框架,旨在解决基于大语言模型的多智能体系统中通信结构冗余、资源消耗大的问题。该方法通过混合机制评估每个智能体的重要性,并结合剪枝与量化思想,去除或替换冗余智能体,从而在保证性能的前提下显著降低计算成本。实验表明,AgentSlimming 能将平均 token 消耗减少高达 78.9%,并在某些情况下提升任务准确率,实现了成本与质量之间的帕累托最优平衡。

详情
英文摘要

Large Language Model-based Multi-Agent Systems (MAS) have demonstrated remarkable capabilities in complex tasks. However, manually designing optimal communication topologies is labor-intensive, while automated expansion methods often result in bloated structures with redundant agents, leading to excessive token consumption. To address this problem, we introduce \textbf{AgentSlimming}, a plug-and-play compression framework for graph-structured multi-agent workflows. Motivated by pruning and quantization in neural networks, AgentSlimming compresses workflows by first estimating the importance score of each agent with a hybrid mechanism, and then removes redundant agents or replaces them with low-cost ones, where each operation is validated using a baseline-anchored acceptance rule to prevent performance collapse. Experiments show that AgentSlimming reduces average token cost by up to 78.9\% with negligible performance degradation, and sometimes even improves accuracy, achieving a strong Pareto-optimal trade-off between cost and quality. \textit{Our code is publicly available at https://github.com/CitrusYL/AgentSlimming

2605.08810 2026-05-12 cs.LG cs.AI

Compressed Video Aggregator: Content-driven Module for Efficient Micro-Video Recommendation

Yang Xiao, Huiyuan Chen, Kaiyuan Deng, Chao Jiang, Zinan Ling, Ruimeng Ye, Xiaolong Ma, Bo Hui

AI总结 本文提出了一种轻量级的微视频推荐模块——压缩视频聚合器(CVA),通过解耦视频信息与偏好学习,实现更高效的推荐。CVA 利用冻结的视频特征嵌入,并采用无需交叉注意力投影的潜在推理机制,生成紧凑的视频嵌入表示。实验表明,该方法在训练时间和GPU内存消耗上大幅减少,同时通过基于CLIP的标题重选关键帧,进一步提升了推荐性能,并对错误标题等场景的影响进行了分析。

Comments 18 pages

详情
英文摘要

We propose Compressed Video Aggregator (CVA), a lightweight micro-video recommendation module that decouples video information from preference learning. It aggregates frozen VFM embeddings, and uses latent reasoning without cross-attention projection, producing compact video embeddings for recommenders. Due to the redundancy in the frame count of the original benchmark and its overly coarse sampling, we used titles to re-select key frames based on CLIP. Experiments on MicroLens and Short-Video show consistent gains with orders-of-magnitude reductions in training time and GPU memory, and re-selected frames can further enhance the performance of all methods, including CVA. Furthermore, we also discussed the impact of several scenarios involving erroneous titles on our method. Code will be released soon.

2605.08809 2026-05-12 cs.CL cs.AI

SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization

Yan Sun, Guoxia Wang, Jinle Zeng, JiaBin Yang, Shuai Li, Li Shen, Dacheng Tao, DianHai Yu, Haifeng Wang

AI总结 在预训练大语言模型中,由于词嵌入具有上下文依赖性,导致同类词嵌入方差大、不同类词嵌入相似度高,影响了表示学习的效率。本文提出SimReg,一种基于嵌入相似性的正则化损失函数,通过在同一序列中强制相同标签的词嵌入更加相似,并利用对比损失使不同标签的词嵌入相互分离,从而提升分类性能。实验表明,SimReg在多种架构中显著加速了训练收敛,并提升了零样本下游任务的性能。

详情
英文摘要

Pretraining large language models (LLMs) with next-token prediction has led to remarkable advances, yet the context-dependent nature of token embeddings in such models results in high intra-class variance and inter-class similarity, thus hindering the efficiency of representation learning. While similarity-based regularization has demonstrated benefit in supervised fine-tuning and classification tasks, its application and efficacy in large-scale LLM pretraining remains underexplored. In this work, we propose the SimReg, an embedding similarity regularization loss that explicitly encourages token representations with the same ground-truth label within each sequence to be more similar, while enforcing separation from different-label tokens via a contrastive loss. Our analysis reveals that this mechanism introduces gains by enlarging multi-classification margins, thereby enabling more efficient classification. Extensive experiments across dense and Mixture-of-Experts (MoE) architectures demonstrate that SimReg consistently accelerates training convergence by over 30% and improves average zero-shot downstream performance by over 1% across standard benchmarks. Further ablation studies and analyses offer practical insights into hyperparameter tuning and loss effectiveness.

2605.08808 2026-05-12 cs.CV cs.AI cs.LG

Curvature-Aware Captioning:Leveraging Geodesic Attention for 3D Scene Understanding

Ziyao He, Yingjie Liu, ZhangYangRui, Mingsong Chen, Xuan Tang, Xian Wei

AI总结 本文提出了一种名为“曲率感知描述生成”的新框架,用于解决三维场景理解中稀疏点云数据的精确描述问题。该方法引入非欧几里得的测地注意力机制,通过在斜空间中进行自注意力计算和在洛伦兹空间中建立双向测地交叉注意力,实现了局部几何细节与全局语义层次的协同建模。理论分析表明,该方法有效缓解了欧几里得空间与双曲空间之间的冲突,实验结果在ScanRefer和Nr3D数据集上展示了其在定位精度和描述丰富性方面的优越性能。

Comments CVPR2026 Highlight!

详情
英文摘要

Accurate 3D scene description is fundamental to robotic navigation and augmented reality, yet current dense captioning methods face significant limitations in processing sparse point cloud data. % Existing approaches that apply Euclidean embedding spaces struggle to simultaneously preserve fine-grained local geometric details and model exponentially growing global semantic hierarchies, leading to either inaccurate localization or disjointed, shallow scene descriptions. % In this work, we propose a novel \textbf{\textsc{Curvature-Aware Captioning}} framework, integrating novel non-Euclidean geodesic attention mechanisms, to resolve the localization-contextualization conflict. % Specifically, self-attention within Oblique space enforces dimensional homogeneity while establishing long-range dependencies. Bidirectional geodesic cross-attention within Lorentz space models hierarchical semantic relationships across scene instances, enabling simultaneous precision in object localization and coherence in scene descriptions. % Theoretical analysis confirms that the curvature complementarity between the Oblique manifold and Lorentz hyperboloid resolves the Euclidean-hyperbolic conflict, ensuring feature stability via isotropic optimization while preserving inherent hierarchical relationships. Extensive experiments on ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, with significant gains in both localization accuracy and descriptive richness.

2605.08805 2026-05-12 cs.CV

LightAVSeg: Lightweight Audio-Visual Segmentation

Qing Zhong, Guodong Ding, Lingqiao Liu, Zaiwen Feng, Lin Yuanbo Wu, Angela Yao

AI总结 LightAVSeg 是一种轻量化的音视频分割框架,旨在解决现有模型计算复杂度高、难以高效部署的问题。该方法通过解耦设计替代传统的密集跨模态注意力机制,使交互成本随空间分辨率线性增长,并引入辅助对齐损失以提升语义一致性。实验表明,LightAVSeg 在参数量仅为 AVSegFormer 1/7 的情况下,在 MS3 数据集上取得了 50.4 mIoU 的优异性能,实现了高效的移动端推理。

Comments 15 pages, 8 figures, 6 tables, Accepted to ICML 2026

详情
英文摘要

Audio-Visual Segmentation (AVS) targets pixel level localization of sounding emitting objects in videos. However, existing models rely on dense cross-modal attention with quadratic computational cost, limiting their suitability for resource efficient deployment. Most efficiency oriented methods focus on backbone reduction and overlook the interaction module as the primary bottleneck. This paper proposes LightAVSeg, a lightweight framework that replaces heavy attention with a decoupled design for semantic filtering and spatial grounding, resulting in interaction costs that scale linearly with spatial resolution. Furthermore, we introduce an auxiliary alignment loss to enforce semantic consistency during training with zero inference overhead. Extensive experiments demonstrate that LightAVSeg achieves a new state-of-the-art among lightweight methods: with 20.5M parameters ~1/7 of AVSegFormer), it reaches 50.4 mIoU on the MS3 benchmark and enables efficient inference on a mobile processor.

2605.08801 2026-05-12 cs.LG

Data-driven transport modelling without overfit

Peter Vanya, Katarína Šimková, Rastislav Farkaš

AI总结 本文提出了一种无需过度拟合的数据驱动交通建模方法,用于预测公共政策干预后的交通流量,如新建道路或临时道路封闭。该方法基于易于获取的交通流量数据,采用可解释的模型权重和可控的复杂度提升路径,避免了传统模型对社会经济数据的高依赖性和不可解释性问题。研究通过多个示例验证了方法的有效性,并探讨了其在多模式交通系统中的扩展应用。

Comments 6 pages, 6 figures

详情
英文摘要

Macroscopic transport modelling aims to predict traffic flows after proposed public policy interventions, such as a new road or railway section or a temporary road closure. As such, it is a vital step in infrastructure planning and development. Traditionally, building a transport model has relied on complex understanding of socio-economic characteristics of the population requiring expensive data collection via surveys, which are prone to biases. Previous numerical frameworks to optimize transport models to fit observed traffic flows are not easily-interpretable and can lead to overfit. We present here an alternative: a data-driven modelling protocol with objective function based on traffic counts, which can be nowadays cheaply and reliably obtained; explainable model weights; and a controlled path to increase model complexity and accuracy. We demonstrate our approach on several toy and realistic examples, and suggest ways to generalize to multimodal systems including public transport.

2605.08800 2026-05-12 cs.CV cs.AI

PPU-Bench:Real World Benchmark for Personalized Partial Unlearning in Vision Language Models

Jiahui Guang, Zexun Zhan, Zhenlin Xu, Cuiyun Gao, Haiyan Wang, Jing Li, Zhaoquan Gu, Yanchun Zhang

AI总结 该论文提出PPU-Bench,一个用于视觉语言模型中个性化部分遗忘的现实基准,旨在解决现有基准依赖合成数据或全量删除的问题。该基准包含24,000个样本,涵盖三种渐进式场景,评估模型在去除目标知识的同时保持非目标事实、模型效用和跨模态一致性的能力。研究还提出边界感知优化方法(BAO),有效强化了模型在个体事实边界上的控制能力。

详情
英文摘要

Multimodal Large Language Models (MLLMs) may memorize sensitive cross-modal information during pretraining. However, existing MLLM unlearning benchmarks rely on synthetic knowledge injection or complete subject-level deletion, which fail to capture realistic, personalized deletion requests that require fine-grained factual control. In this paper, we introduce PPU-Bench, a real-world and fine-tuning-free benchmark for personalized partial unlearning in MLLMs. PPU-Bench contains 24K multimodal and unimodal samples derived from pre-existing knowledge of 500 public figures under three progressively challenging settings: Complete, Selective, and Personalized unlearning. The benchmark evaluates whether methods can remove target knowledge while preserving non-target facts, model utility, and cross-modal consistency. Extensive experiments show that Complete Unlearning often suppresses visual identity rather than factual knowledge, while Selective and Personalized Unlearning expose significant forget--retain trade-offs and challenges in intra-subject factual boundaries. Robustness analysis under cross-image and prompt-based attacks reveals distinct vulnerabilities across different unlearning settings. Motivated by these findings, we propose Boundary-Aware Optimization (BAO), which explicitly models intra-subject forget-retain boundaries. Experimental results on two representative methods demonstrate that BAO can effectively enforce intra-subject factual boundaries.

2605.08799 2026-05-12 cs.RO

ElasticFlow: One-Step Physics-Consistent Policy with Elastic Time Horizons for Language-Guided Manipulation

Kewei Chen, Yayu Long, Shuai Li, Mingsheng Shang

AI总结 ElasticFlow 是一种无需蒸馏的、物理一致的一步式策略框架,用于语言引导的机器人操作任务。该方法通过直接建模平均速度场重建平均场理论,实现从噪声到动作的单步映射,同时引入弹性时间跨度机制,有效克服频谱偏差,提升语义指令与物理执行的对齐效率。实验表明,ElasticFlow 在多个基准上实现了高效的1-NFE推理(约71Hz),并在长时序任务中优于现有先进方法,展现出高效、鲁棒且语义对齐的控制潜力。

Comments Accepted to Findings of ACL 2026

详情
英文摘要

Diffusion policies have demonstrated exceptional performance in embodied AI. However, their iterative denoising process results in high latency, and existing acceleration methods often sacrifice physical consistency. To address this, we propose ElasticFlow, a distillation-free, physics-consistent one-step policy framework. We reconstruct the Mean Field Theory by directly modeling the average velocity field, enabling a direct single-step mapping from noise to action. Addressing the Temporal Heterogeneity of robotic tasks, we introduce the Elastic Time Horizons mechanism. This mechanism effectively overcomes Spectral Bias by explicitly encoding control granularity, achieving efficient alignment between semantic instructions and physical execution horizons. Experiments on benchmarks such as LIBERO, CALVIN, and RoboTwin demonstrate that ElasticFlow achieves efficient 1-NFE inference (approximately 71Hz). Furthermore, it outperforms state-of-the-art methods, including OpenVLA and $π_0$, on long-horizon tasks, highlighting its potential for efficient, robust, and semantically aligned control.