arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3860
2605.18059 2026-05-19 cs.RO

Bench2Drive-Robust: Benchmarking Closed-Loop Autonomous Driving under Deployment Perturbations

Bench2Drive-Robust: 在部署扰动下闭环自动驾驶的基准测试

Zhiyuan Zhang, Zhenghao Jin, Yanlun Peng, Xianda Guo, Haoran Liu, Shaofeng Zhang, Xingjun Ma, Zuxuan Wu, Junchi Yan, Xiaosong Jia, Yu-Gang Jiang

AI总结 本文提出Bench2Drive-Robust,首个针对闭环端到端自动驾驶在现实部署扰动下的设备中心鲁棒性基准测试,评估了三种主要来源的部署相关扰动对自动驾驶系统的影响,揭示了传统图像级腐蚀评估未能完全捕捉的鲁棒性挑战。

详情
AI中文摘要

鲁棒性是部署自动驾驶系统到现实世界中的关键要求。现有的自动驾驶鲁棒性基准测试在研究图像级腐蚀(如恶劣天气或摄像头退化)对感知模块和开环规划输出的影响方面取得了重要进展。然而,部署还可能涉及系统级缺陷,如推理延迟和自我状态估计误差,这些在闭环端到端自动驾驶评估中仍较少研究。这些缺陷可以通过反馈回路积累并导致控制不稳定。在本文中,我们提出了Bench2Drive-Robust,据我们所知,这是首个针对闭环端到端自动驾驶在现实部署扰动下的设备中心鲁棒性基准测试。我们系统地评估了三种主要来源的部署导向扰动:摄像头流故障(帧丢失、部分观察)、自我状态估计误差(GPS噪声,以及速度或里程误差)和计算导致的控制延迟(模型推理延迟)。我们评估了代表性端到端驾驶方法,并分析它们在不同扰动严重程度下的鲁棒性。我们的结果表明,这些部署相关扰动可以显著降低闭环驾驶性能,揭示了传统图像级腐蚀评估未能完全捕捉的鲁棒性挑战。通过建立闭环评估协议并展示这些部署导向扰动的实质性影响,Bench2Drive-Robust定义了端到端自动驾驶的实用鲁棒性问题,并鼓励进一步研究面向部署的鲁棒驾驶系统。

英文摘要

Robustness is a critical requirement for deploying autonomous driving systems in the real world. Existing robustness benchmarks for autonomous driving have made important progress in studying the effects of image-level corruptions, such as adverse weather or camera degradation, on perception modules and open-loop planning outputs. However, deployment can also involve system-level imperfections, such as inference latency and ego-state estimation errors, which remain less studied in closed-loop E2E-AD evaluation. These imperfections can accumulate through the feedback loop and destabilize control. In this work, we present Bench2Drive-Robust, to our knowledge the first device-centric robustness benchmark for closed-loop end-to-end autonomous driving under realistic deployment perturbations. We systematically evaluate deployment-oriented perturbations arising from three major sources: camera-stream failures (frame drop, partial observation), ego-state estimation errors (GPS noise, and speed or odometry errors), and compute-induced control delay (model inference delay). We evaluate representative end-to-end driving methods and analyze their robustness under different perturbation severities. Our results show that these deployment-related perturbations can substantially degrade closed-loop driving performance, revealing robustness challenges that are not fully captured by conventional image-level corruption evaluations. By establishing a closed-loop evaluation protocol and demonstrating the substantial impact of these deployment-oriented perturbations, Bench2Drive-Robust defines practical robustness problems for end-to-end autonomous driving and encourages further research on deployment-aware robust driving systems.

2605.18058 2026-05-19 cs.CV

Threats to Arabic Handwriting Recognition: Investigating Black-Box Adversarial Attacks on embedded ConvNet models

阿拉伯手写识别的威胁:调查嵌入式卷积网络模型上的黑盒对抗攻击

Mohsine EL Khayati, Abdelillah Semma, Abdelaziz Courr, Rachid Elouahbi

AI总结 本研究探讨了阿拉伯手写识别系统对黑盒对抗攻击的脆弱性,通过实验揭示了高精度模型在面对对抗攻击时的易受攻击性,强调了加强模型安全性和可靠性的必要性。

Comments Accepted in the IEEE 15th Image, Video, and Multidimensional Signal Processing Workshop 2026

详情
AI中文摘要

阿拉伯手写识别(AHR)通过深度学习模型取得了显著进展。AHR研究主要关注性能,而安全性却很少受到重视。本研究通过展示高性能模型对对抗黑盒攻击的易受攻击性,提供了一条新的研究方向。研究聚焦于黑盒攻击,反映了现实场景中攻击者对模型架构没有先验知识的情况。在两个包含阿拉伯手写字符的基准AHR数据集上进行了大量实验。结果表明攻击的有效性,其中Pixle攻击在大多数模型上达到了99-100%的攻击成功率。其他较为温和的攻击在大多数实验中达到了50-96%的成功率。尽管攻击成功率较高,但攻击保持了字符的结构完整性,使其在人眼几乎不可察觉。研究结果表明,所研究的模型对对抗操纵具有更高的易受性。这突显了加强这些模型安全性和可靠性以确保其在AHR实际应用中的必要性。

英文摘要

Arabic handwriting recognition (AHR) has made significant progress with deep learning models. AHR research has largely focused on performance, with security receiving little attention. This study provides what appears to be a new line of inquiry by demonstrating the vulnerability of high-performing models to adversarial black-box attacks. The focus on black-box attacks reflects real-world scenarios where the attacker has no prior knowledge of the model architecture. Extensive experiments were conducted on two benchmark AHR datasets containing Arabic handwritten Characters. Results demonstrated the effectiveness of the attacks, with the Pixle attack achieving an attack success rate of 99-100\% on most models. Other, less aggressive attacks achieved success rates of 50-96\% across most experiments. Despite the higher attack success rate, the attacks maintain the structural integrity of the characters, rendering them almost imperceptible to the human eye. The findings indicate the higher vulnerability of the studied models to adversarial manipulation. This underscores the need to strengthen efforts to secure these models and ensure their reliability in AHR real-world applications.

2605.18055 2026-05-19 cs.LG cs.AI

FLAG: Foundation model representation with Latent diffusion Alignment via Graph for spatial gene expression prediction

FLAG: 通过图结构的潜在扩散对齐实现基础模型表示以空间基因表达预测

Qi Si, Penglei Wang, Yushuai Wu, Yifeng Jiao, Xuyang Liu, Xin Guo, Yuan Qi, Yuan Cheng

AI总结 本文提出FLAG框架,通过图结构的潜在扩散对齐方法,解决空间基因表达预测中的基因协调和空间分布关系问题,并引入基因维度诅咒的概念,通过空间图编码器和基因基础模型对齐来提升模型的结构一致性与基因间保真度。

Comments 9 pages for main text, 3 pages for references, 19 pages for appendix. accepted by ICML 2026

详情
AI中文摘要

从常规的H&E染色预测空间基因表达能够实现大规模分子谱分析,但当前模型将此任务视为孤立的点wise任务,从而忽略了诸如基因协调和空间分布等关键生物结构。为保持这些关系,我们引入FLAG,一种基于扩散的框架,将此任务重新定义为结构分布建模。同时,我们识别出关键的基因维度诅咒,即联合建模基因表达及其空间相互作用在高维空间中失效,而FLAG通过整合空间图编码器以实现拓扑一致性,并利用基因基础模型(GFM)对齐以在生成过程中保持基因-基因的保真度。为严格评估模型性能,我们提出了一组新的结构评估度量标准,包括基因结构相关性(GSC)和空间结构相关性(SSC)。我们的实验表明,FLAG在传统准确性(PCC/MSE)方面具有高度竞争力,同时在捕捉基因-基因和基因-空间关系时实现了显著增强的结构保真度。代码可在https://github.com/darkflash03/FLAG上获取。

英文摘要

Predicting spatial gene expression from routine H\&E enables large-scale molecular profiling, yet current models treat this as isolated pointwise tasks, thereby overlooking essential biological structures like gene coordination and spatial distribution. To preserve these relationships, we introduce \textbf{FLAG}, a diffusion-based framework that redefines this task as structured distribution modeling. At the same time, we identify the critical \textbf{Gene Dimension Curse}, where joint modeling gene expression and their spatial interactions fail in high-dimensional spaces, and FLAG solves this challenge by integrating a spatial graph encoder for topological consistency and utilizing Gene Foundation Model (GFM) alignment for gene-gene fidelity in the generation process. To rigorously assess model performance, we propose a set of novel structural evaluation metrics, including Gene Structural Correlation (\textbf{GSC}) and Spatial Structural Correlation (\textbf{SSC}). Our experiments demonstrate that FLAG is highly competitive in traditional accuracy (PCC/MSE) while achieving significantly enhanced structural fidelity in capturing both gene-gene and gene-spatial relationships. The code is available at https://github.com/darkflash03/FLAG.

2605.18053 2026-05-19 cs.LG cs.CL cs.CR cs.PF

Protection Is (Nearly) All You Need: Structural Protection Dominates Scoring in Globally Capped KV Eviction

保护几乎就是一切:结构保护在全局受限的KV淘汰中占据主导地位

Gabriel Garcia

AI总结 本文研究了在共享全局受限解码时间Harness下的KV缓存淘汰问题,发现结构保护在保持质量方面起关键作用,通过保留边界缓存恢复了大部分参考天花板质量,并展示了保护机制在不同模型上的有效性。

Comments 38 pages, 6 figures, 25 tables (includes one longtable). Code and figure regeneration scripts: https://github.com/gpgabriel25/KVCacheBoundaryProtection

详情
AI中文摘要

我们研究了在共享全局受限解码时间Harness下的KV缓存淘汰问题。七种策略(LRU、H2O、SnapKV、StreamingLLM、Ada-KV、QUEST、Random)共享一个提示边界漏洞:在没有结构保护的情况下,它们在六个纯Transformer模型上几乎降级到质量为零(F1≤0.064)。保留每个边界10%的缓存可在七个LongBench模型上恢复69-90%的C=2048参考天花板质量(13%保留率);一个十模型面板覆盖68-98%。一个注意力质量试点(Qwen2.5-3B,N=30)表明原因:位置0的sink占据约75%的前缀质量,而其他边界token接近约0.41倍的均匀期望,因此注意力评分器保留sink但仍会丢弃结构关键token。有保护的情况下,简化评分隔离变体在K=32时与LRU TOST等价(Δ=0.02);在K=8时,注意力策略对彼此收敛但仍比LRU高0.011-0.021 F1(在C=256和C=512时)。忠实的Ada-KV/QUEST在Mistral-7B和Phi-3.5上添加约0.03-0.04 F1超过简化变体。在Qwen3-4B上的NIAH-32K领域转移试点(解码vs.预填,C∈{512,2048})显示保护提升几乎相同(比率0.99-1.00)。在64K时,保护有所帮助但恢复有限;忠实的每头评分在Gemma-3-4B上仅在模型已支持强64K检索而不淘汰时才能在6.3%保留率下达到全缓存天花板。总体而言:保护占主导;一旦边界被保护,评分差异变得次要;每头分配进一步带来小幅提升。

英文摘要

We study KV cache eviction under a shared globally capped decode-time harness. Seven policies (LRU, H2O, SnapKV, StreamingLLM, Ada-KV, QUEST, Random) share a prompt-boundary vulnerability: without structural protection, they collapse to near-zero quality on six pure-transformer models (F1$\leq$0.064). Reserving 10\% of cache at each boundary recovers 69--90\% of the $C{=}2{,}048$ reference-ceiling quality on seven LongBench models at $C{=}256$ (13\% retention); a ten-model panel spans 68--98\%. An attention-mass pilot (Qwen2.5-3B, $N{=}30$) suggests why: the position-0 sink holds ${\sim}75\%$ of prefix mass, while other boundary tokens sit near ${\sim}0.41{\times}$ uniform expectation, so attention scorers retain the sink but still drop structurally critical tokens. With protection, simplified score-isolation variants are TOST-equivalent to LRU at $K{=}32$ ($Δ{=}0.02$); at $K{=}8$, attention policies pairwise converge yet beat LRU by 0.011--0.021 F1 across $C{=}256$ and $C{=}512$. Faithful Ada-KV/QUEST add ${\sim}0.03$--$0.04$ F1 on Mistral-7B and Phi-3.5 beyond simplified variants. A NIAH-32K regime-transfer pilot on Qwen3-4B (decode vs.\ prefill, $C{\in}\{512,2048\}$) shows near-identical protection lifts (ratio 0.99--1.00). At 64K, protection helps but recovery is modest; faithful per-head scoring matches full-cache ceiling on Gemma-3-4B at 6.3\% retention only when the model already supports strong 64K retrieval without eviction. Overall: protection dominates; scoring differences are secondary once boundaries are guarded; per-head allocation gives a further modest gain.

2605.18052 2026-05-19 cs.CV

Efficient 3D Content Reconstruction and Generation

高效3D内容重建与生成

Jiahao Li

AI总结 本文提出了一种高效的3D内容生成和重建方法,通过结合多视图扩散和稀疏视图3D重建,实现了高质量的3D资产生成,并开发了FastMap算法以提高3D重建的速度和精度。

详情
AI中文摘要

自动3D内容创建旨在用能够从文本或图像直接合成或恢复3D资产的系统取代劳动密集型的建模和扫描流程。其应用范围涵盖视频游戏、虚拟现实、机器人技术和模拟,使资产原型设计、多样化的交互世界生成和高效的3D数据收集成为可能。当前解决方案主要遵循两种互补的范式:(i)文本或图像到3D生成,学习3D几何和外观的先验知识,以从自然语言或单视图图像创建新资产;(ii)3D重建,从RGB图像估计相机姿态和几何结构。本论文在两个方向上都取得了进展。在生成方面,我介绍了Instant3D,它结合了多视图扩散和前馈稀疏视图3D重建,可在5-20秒内生成高质量的资产。在重建方面,我开发了FastMap,一种结构从运动流水线,通过使用一阶优化与广泛融合的GPU内核,实现了比现有最先进方法快10倍的速度提升,同时保持了可比的姿态精度和下游新视图合成质量。

英文摘要

Automatic 3D content creation seeks to replace labor-intensive modeling and scanning pipelines with systems that can synthesize or recover 3D assets directly from text or images. Its applications span video games, virtual reality, robotics, and simulation, enabling rapid asset prototyping, diverse interactive world generation, and efficient 3D data collection for training foundation models. Contemporary solutions largely follow two complementary paradigms: (i) text- or image-to-3D generation, which learns priors over 3D geometry and appearance to create novel assets from natural language or a single view image; and (ii) 3D reconstruction, which estimates camera poses and geometry from RGB images. This thesis advances both directions. On the generation side, I introduce Instant3D, which combines multi-view diffusion with feed-forward sparse-view 3D reconstruction to produce high-quality assets in 5-20 seconds. On the reconstruction side, I develop FastMap, a structure-from-motion pipeline that achieves up to 10x speedup over prior state-of-the-art by using first-order optimization with fused GPU kernels extensively, while maintaining comparable pose accuracy and downstream novel view synthesis quality.

2605.18048 2026-05-19 cs.AI

DocOS: Towards Proactive Document-Guided Actions in GUI Agents

DocOS: 向 GUI 代理中的主动文档引导行动迈进

Jingjing Liu, Ziye Huang, Zihao Cheng, Zeming Liu, Jiahong Wu, Yuhang Guo, Kehai Chen, Yunhong Wang, Haifeng Wang

AI总结 本文提出 DocOS 基准,通过引导文档解决长尾任务,解决 GUI 代理在动态开放网络环境中处理长尾任务的能力限制,核心方法是主动文档引导行动,主要贡献是设计了一个评估文档引导问题解决能力的基准。

详情
AI中文摘要

尽管图形用户界面(GUI)代理在自动化设备交互中表现出色,但它们主要依赖于预训练或指令微调的静态参数知识。这种依赖从根本上限制了它们处理需要显式过程知识的长尾任务的能力,通常迫使代理采用低效且易碎的试错探索。为缓解这一限制,我们引入了面向 GUI 代理的主动文档引导行动,这是一种新的范式,通过使代理能够自主搜索相关文档来解决长尾任务,从而模仿人类问题解决方式。为了评估代理在此范式中的能力,我们提出了 DocOS,一个基准,用于评估在完全交互环境中文档引导的问题解决能力。DocOS 要求代理自主导航网络浏览器,定位相关在线文档,理解操作步骤,并将这些步骤准确地转化为可执行的 GUI 操作。广泛的实验表明,进展受到双重瓶颈的限制:代理在主动搜索中难以可靠地定位相关信息,并且频繁失败将检索到的指令准确地转化为精确的操作,这表明文档引导交互是使 GUI 代理在动态环境中自我演化的关键路径。

英文摘要

While Graphical User Interface (GUI) agents have shown promising performance in automated device interaction, they primarily depend on static parametric knowledge from pre-training or instruction tuning. This reliance fundamentally limits their ability to handle long-tailed tasks that require explicit procedural knowledge absent from model parameters, often forcing agents to resort to inefficient and brittle trial-and-error exploration. To mitigate this limitation, we introduce \textbf{Proactive Document-Guided Action} for GUI agents in dynamic, open-web environments, a novel paradigm that mirrors human problem-solving by enabling agents to autonomously search for relevant documentation to resolve long-tailed tasks. To evaluate agents' capability in this paradigm, we propose \textbf{DocOS}, a benchmark designed to assess document-guided problem solving in fully interactive environments. DocOS requires agents to autonomously navigate a web browser, locate relevant online documentation, comprehend procedural instructions, and faithfully ground them into executable GUI actions. Extensive experiments reveal that progress is strictly constrained by dual bottlenecks: agents struggle to reliably locate relevant information during proactive search and frequently fail to faithfully ground retrieved instructions into precise actions, pointing toward document-guided interaction as a crucial pathway for enabling self-evolving GUI agents in dynamic environments.

2605.18045 2026-05-19 cs.RO cs.AI

Confidence-Gated Robot Autonomy: When Does Uncertainty Actually Help?

置信度门控机器人自主性:不确定性何时真的有帮助?

Johannes A. Gaus, Jhon P. F. Charaja, Daniel Haeufle

AI总结 本文研究了不确定性在机器人自主性决策中的作用,发现当基础模型具备一定能力时,简单的不确定性代理足以实现选择性门控,但无法用于语义新颖性检测。

Comments ICRA 2026 workshop paper

详情
AI中文摘要

机器人系统常常使用预测不确定性来决定是否自主行动还是退回到备用策略。在阈值门控自主性中,不确定性主要通过其对可能错误的排序能力起作用。标准指标如预期校准误差和AUROC并不能直接测试不确定性是否改变行动/退避决策。因此,我们通过斯皮尔曼等级相关性、配对bootstrap等价检验和行动/退避一致率来评估不确定性。在三个时间活动识别基准上,我们发现存在一个数据集依赖的胜任区域,在此之下不确定性只能提供弱且不稳定的错误排序。在此之上,softmax启发式方法、MC Dropout和集成模型产生相似的门控行为,而阈值选择对执行结果影响更大。一个多种子具身模拟显示,一旦实现自主性,碰撞率和成本也呈现出相同模式。在时间协变量转移下,排序质量保持稳定,但细粒度语义OOD检测仍接近随机。这些结果表明,一旦基础模型具备一定能力,简单的不确定性代理足以实现选择性门控,但无法用于语义新颖性检测。

英文摘要

Robotic systems often use predictive uncertainty to decide whether to act autonomously or defer to a fallback policy. In threshold-gated autonomy, uncertainty matters mainly through its ability to rank likely errors. Standard metrics such as expected calibration error and AUROC do not directly test whether uncertainty changes act/defer decisions. We therefore evaluate uncertainty using Spearman rank correlation, paired bootstrap equivalence testing, and act/defer agreement. Across three temporal activity-recognition benchmarks, we find a dataset-dependent competence regime below which uncertainty provides a weak and unstable error ranking. Above this regime, softmax heuristics, MC Dropout, and ensembles produce similar gating behavior, while threshold choice has a much larger effect on execution outcomes. A multi-seed embodied simulation shows the same pattern for collision rate and cost once realized autonomy is matched. Under temporal covariate shift, ranking quality remains stable, but fine grained semantic OOD detection remains near chance. These results suggest that simple uncertainty proxies can suffice for selective gating once the base model is competent, but not for semantic novelty detection.

2605.18041 2026-05-19 cs.CV

OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models

OmniSelect: 动态模态感知的令牌压缩用于高效多模态大语言模型

Morunliu Yang, Ruotao Xu, Le Li, Yue Wang, Jianxin Zhang, Juntao Li, Yihang Lou, Siwei Feng, Peifeng Li

AI总结 本文提出OmniSelect,一种无需训练的模态自适应令牌剪枝框架,通过动态选择压缩策略来提高多模态大语言模型的效率,通过轻量级AudioCLIP模型估计跨模态相关性,并根据相关性得分在不同时间组中进行细粒度令牌剪枝,从而在不增加训练成本的情况下实现高效的多模态令牌压缩。

详情
AI中文摘要

Omnimodal large language models (OmniLLMs) have recently gained increasing attention for unified audio-video understanding. However, processing long multimodal token sequences introduces substantial computational overhead, making efficient token compression crucial. Existing methods typically rely on fixed, modality-specific guidance, which fails to account for the varying importance of modalities across different queries. To address this limitation, we propose $ extbf{OmniSelect}$, a 免训练, modality-adaptive token pruning framework that dynamically selects appropriate compression strategies for multimodal inputs. Specifically, we leverage a lightweight AudioCLIP model to estimate cross-modal relevance and categorize each input into three pruning regimes: Audio-Centric, Video-Centric, and Uniform pruning. Based on these relevance scores, OmniSelect further performs fine-grained token pruning within each temporal group, adaptively allocating pruning ratios to preserve informative tokens across modalities. By explicitly modeling modality preference and enabling dynamic strategy selection, OmniSelect effectively avoids the pitfalls of one-size-fits-all compression. Extensive experiments demonstrate that our method achieves efficient multimodal token reduction while maintaining strong performance, without requiring any additional training.

英文摘要

Omnimodal large language models (OmniLLMs) have recently gained increasing attention for unified audio-video understanding. However, processing long multimodal token sequences introduces substantial computational overhead, making efficient token compression crucial. Existing methods typically rely on fixed, modality-specific guidance, which fails to account for the varying importance of modalities across different queries. To address this limitation, we propose $\textbf{OmniSelect}$, a training-free, modality-adaptive token pruning framework that dynamically selects appropriate compression strategies for multimodal inputs. Specifically, we leverage a lightweight AudioCLIP model to estimate cross-modal relevance and categorize each input into three pruning regimes: Audio-Centric, Video-Centric, and Uniform pruning. Based on these relevance scores, OmniSelect further performs fine-grained token pruning within each temporal group, adaptively allocating pruning ratios to preserve informative tokens across modalities. By explicitly modeling modality preference and enabling dynamic strategy selection, OmniSelect effectively avoids the pitfalls of one-size-fits-all compression. Extensive experiments demonstrate that our method achieves efficient multimodal token reduction while maintaining strong performance, without requiring any additional training.

2605.18039 2026-05-19 cs.CV

SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals

SGSoft: 通过模板引导的软信号学习融合语义-几何特征以实现3D形状对应

Soyeon Yoon, Chang Wook Seo, Hyunjung Shim

AI总结 本文提出SGSoft方法,通过模板引导的软信号学习融合语义-几何特征,实现3D形状对应,解决了结构变化、非等距变形和拓扑不一致的挑战,实现了最先进的跨类别泛化和最佳精度-效率权衡。

详情
AI中文摘要

学习变形3D形状之间的密集对应关系仍是一个长期挑战,由于结构变化、非等距变形和不一致拓扑。现有方法通常在通用性、几何保真度和效率之间进行权衡。我们通过提出SGSoft,一个统一的内在流程,解决这个问题:(i) 在标准模板上构建测地线对应场;(ii) 学习由预训练语义先验引导的多模态密集描述符,利用该测地线对应场监督;(iii) 通过描述符空间的最近邻搜索在单次前向传递中检索密集对应关系。这种公式在大姿态变化、结构差异和重新网格化下实现了稳定且拓扑不变的监督。SGSoft在跨类别泛化方面达到最先进的水平,同时在先前方法中提供了最佳的精度-效率权衡。它还实现了近实时推断,无需预对齐、成对优化或后处理。学习的描述符可以有效地转移到下游任务,如语义分割和变形转移,建立了一种可扩展且可部署的密集3D对应范式。

英文摘要

Learning dense correspondences across deformable 3D shapes remains a long-standing challenge due to structural variability, non-isometric deformation, and inconsistent topology. Existing methods typically trade off generalization, geometric fidelity, and efficiency. We address this by proposing SGSoft, a unified intrinsic pipeline that (i) constructs a geodesic correspondence field on a canonical template, (ii) learns multimodal dense descriptors guided by pretrained semantic priors with this geodesic correspondence field supervision, (iii) retrieves dense correspondences in a single feed-forward pass via nearest-neighbor search in descriptor space. This formulation enables stable and topology-invariant supervision under large pose variation, structural differences, and remeshing. SGSoft achieves state-of-the-art inter-category generalization while offering the best accuracy-efficiency trade-off among prior methods. It also achieves near real-time inference without pre-alignment, pairwise optimization, or post-refinement. Learned descriptors can be transferred effectively to downstream tasks such as semantic segmentation and deformation transfer, establishing a scalable and deployment-ready paradigm for dense 3D correspondence.

2605.18038 2026-05-19 cs.CV

Patch Ensembles for Robust Salmon Re-Identification with Weak Trajectory Labels

基于补丁的鲁棒性鲑鱼重识别方法:使用弱轨迹标签

Espen Uri Høgstedt, Christian Schellewald, Annette Stahl, Rudolf Mester

AI总结 本文提出了一种基于补丁的重识别框架,通过融合补丁级预测来决定鲑鱼身份,利用侧线预测提取纹理锚定的补丁和补丁切片,通过多摄像头实验设置构建跨摄像头测试集,实验证明该方法在同轨迹验证和跨摄像头测试中均优于全图像基线,展示了更好的泛化能力和鲁棒性。

Comments Accepted to the 2026 IEEE International Conference on Image Processing (ICIP)

详情
AI中文摘要

在商业网箱中,鲑鱼重识别具有挑战性,因为种群数量大,这要求严格准确性并使大规模标记数据获取不可行。轨迹ID可以作为代理标签,但会引入轨迹ID偏差。为了解决这些挑战,我们提出了一种基于补丁的重识别框架,将补丁级预测融合到鲑鱼身份决策中。一个关键组件是预测鲑鱼的侧线,从而提取纹理锚定的补丁和补丁切片。为了实现真实的评估,我们引入了一个实验设置,使用多个相距6米的摄像头,允许同一鱼在不同轨迹中被记录。这使得通过手动匹配确认构建跨摄像头测试集成为可能。我们的集成方法在同轨迹验证中(0.932到0.965 mAP)和跨摄像头测试中(0.609到0.860 mAP)均优于全图像基线。跨摄像头设置的显著改进证明了改进的通用性和鲁棒性。代码和数据:https://github.com/espenbh/salmon-reid-patch-ensemble。

英文摘要

Salmon re-identification in commercial net-pens is challenging due to large populations, which impose strict accuracy requirements and make large-scale labeled data acquisition infeasible. Trajectory IDs can be used as proxy labels, but this introduces trajectory-ID bias. To address these challenges, we propose a patch-based re-identification framework that fuses patch-level predictions into a salmon identity decision. A key component is the prediction of the salmon's lateral line, enabling extraction of texture-anchored patches and patch slices. To enable realistic evaluation, we introduce an experimental setup using multiple cameras placed 6 m apart, allowing the same fish to be recorded in different trajectories. This enables the construction of a cross-camera test set through manual match confirmation. Our ensemble approach outperforms the full-image baseline in same-trajectory validation (0.932 to 0.965 mAP) and cross-camera testing (0.609 to 0.860 mAP). The substantial improvements in the cross-camera setting demonstrate improved generalizability and robustness. Code and data: https://github.com/espenbh/salmon-reid-patch-ensemble.

2605.18035 2026-05-19 cs.AI cs.LG

New Insight of Variance reduce in Zero-Order Hard-Thresholding: Mitigating Gradient Error and Expansivity Contradictions

零阶硬阈值化中方差减少的新见解:缓解梯度误差和扩张性矛盾

Xinzhe Yuan, William de Vazelhes, Bin Gu, Huan Xiong

AI总结 本文提出了一种通用的方差减少零阶硬阈值化算法,通过考虑方差的作用,缓解零阶梯度与硬阈值操作之间的冲突,从而消除对随机方向数量的限制,提高收敛速度和应用范围。

Comments Published as a conference paper at ICLR 2024. 9 pages main paper, 24 pages appendix, 11 figures, 7 tables. Correspondence to Bin Gu and Huan Xiong

详情
Journal ref
International Conference on Learning Representations (ICLR), 2024
AI中文摘要

硬阈值化是机器学习中用于解决ℓ0约束优化问题的重要算法类型。然而,在某些情况下,目标函数的真实梯度可能难以获取,通常可以通过零阶(ZO)方法进行近似。到目前为止,SZOHT算法是唯一能够处理ℓ0稀疏性约束的ZO梯度算法。不幸的是,由于零阶梯度的偏差与硬阈值操作的扩张性之间存在固有的矛盾,SZOHT在ZO梯度的随机方向数量上存在明显的限制。本文通过考虑方差的作用,提供了一种新的方差减少见解:缓解零阶梯度与硬阈值操作之间的独特矛盾。在此视角下,我们提出了一种通用的方差减少零阶硬阈值化算法以及在标准假设下的通用收敛性分析。理论结果表明,新算法消除了对随机方向数量的限制,相较于SZOHT,具有改进的收敛速度和更广泛的应用范围。最后,我们通过岭回归问题以及黑盒对抗攻击问题展示了本方法的实用性。

英文摘要

Hard-thresholding is an important type of algorithm in machine learning that is used to solve $\ell_0$ constrained optimization problems. However, the true gradient of the objective function can be difficult to access in certain scenarios, which normally can be approximated by zeroth-order (ZO) methods. The SZOHT algorithm is the only algorithm tackling $\ell_0$ sparsity constraints with ZO gradients so far. Unfortunately, SZOHT has a notable limitation on the number of random directions % in ZO gradients due to the inherent conflict between the deviation of ZO gradients and the expansivity of the hard-thresholding operator. This paper approaches this problem by considering the role of variance and provides a new insight into variance reduction: mitigating the unique conflicts between ZO gradients and hard-thresholding. Under this perspective, we propose a generalized variance reduced ZO hard-thresholding algorithm as well as the generalized convergence analysis under standard assumptions. The theoretical results demonstrate the new algorithm eliminates the restrictions on the number of random directions, leading to improved convergence rates and broader applicability compared with SZOHT. Finally, we illustrate the utility of our method on a ridge regression problem as well as black-box adversarial attacks.

2605.18032 2026-05-19 cs.CL cs.AI cs.HC cs.SE

PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows

PROTEA:多智能体大语言模型工作流的离线评估与迭代优化

Kazuki Kawamura, Satoshi Waki, Kei Tateno

AI总结 本文提出PROTEA,一种用于多智能体大语言模型工作流的离线评估和迭代优化接口,通过配置评分标准和可视化工作流图中的节点状态,帮助开发者定位瓶颈并改进工作流性能。

Comments 9 pages, 3 figures, 1 table. To appear in Proceedings of ACL 2026 System Demonstrations

详情
AI中文摘要

多智能体大语言模型工作流——由多个角色特定的LLM调用组成——通常优于单提示基线,但调试和优化仍然困难。失败可能源于中间输出的细微错误,这些错误会传播到下游节点,要求开发者检查长轨迹并推断应修改哪个代理。我们提出了PROTEA,一个统一的接口,用于离线、测试驱动的多智能体工作流改进。PROTEA执行工作流,用可配置的评分标准评分中间节点输出,并在工作流图上叠加每个节点的状态和理由,以定位可能的瓶颈。为了支持复杂系统,其中最终答案参考是主要监督,PROTEA执行反向节点评估:它从最终答案参考和图上下文生成候选节点级期望,然后将它们与观察到的节点输出进行比较。对于选定的节点,PROTEA以可编辑的前后比较形式呈现目标提示修订,然后自动重新运行并重新评估工作流,以显示输出变化和评分轨迹。在两个生产相关的工作流中,PROTEA将文档检查准确性从64.3%提高到83.9%,推荐Hit@5从0.30提高到0.38。在与六名经验丰富的LLM开发者进行的形成研究中,参与者重视图层面的定位、节点级别的理由以及可编辑的前后提示修订。

英文摘要

Multi-agent LLM workflows -- systems composed of multiple role-specific LLM calls -- often outperform single-prompt baselines, but they remain difficult to debug and refine. Failures can originate from subtle errors in intermediate outputs that propagate to downstream nodes, requiring developers to inspect long traces and infer which agent to modify. We present PROTEA, a unified interface for offline, test-driven improvement of multi-agent workflows. PROTEA executes a workflow, scores intermediate node outputs with configurable rubrics, and overlays per-node states and rationales on the workflow graph to localize likely bottlenecks. To support complex systems where final-answer references are the primary supervision, PROTEA performs backward node evaluation: it generates candidate node-level expectations from final-answer references and graph context, then compares them with observed node outputs. For selected nodes, PROTEA presents targeted prompt revisions as editable before/after comparisons, then automatically reruns and re-evaluates the workflow to show output changes and score trajectories within the same interface. In two production-adjacent workflows, PROTEA improved document-inspection accuracy from 64.3% to 83.9% and recommendation Hit@5 from 0.30 to 0.38. In a formative study with six experienced LLM developers, participants valued graph-level localization, per-node rationales, and editable before/after prompt revisions.

2605.18029 2026-05-19 cs.CV

What Matters for Grocery Product Retrieval with Open Source Vision Language Models

在开源视觉语言模型中,什么因素影响杂货产品检索

Emmanuel G. Maminta, Rowel O. Atienza

AI总结 本文研究了开源视觉语言模型在杂货产品检索任务中的表现,发现数据质量比规模更重要,高效模型可以胜出,并且存在召回率差距的问题。

Comments Accepted in the 28th International Conference on Pattern Recognition (ICPR 2026)

详情
AI中文摘要

多模态产品检索(MPR)是无结账零售和自动化库存系统的基础,但需要细粒度SKU区分,而标准视觉语言基准无法捕捉这一点。我们首次系统地在GroceryVision挑战赛的MPR任务上评估了190个开源VLMs,隔离了预训练数据、架构和输入分辨率。我们的分析得出三个可操作的发现。(1)数据质量优于规模。从原始网络爬取切换到过滤数据集可获得高达16.6%的准确率提升,超过翻倍模型参数的收益。(2)高效模型可以获胜。MobileCLIP-B(150M参数)优于在噪声数据上训练的351M模型。我们引入了效率度量标准“语义功率密度”(ϕ),该指标惩罚低于阈值的准确性。(3)存在召回率差距。最先进模型在Recall@5上达到94.5%,但在Recall@1上下降17.5%,表明对比嵌入式在分类上有效,但无法对视觉相似的SKU进行排序。代码和评估脚本可在https://github.com/upeee/openmpr获取。

英文摘要

Multimodal product retrieval (MPR) underpins checkout-free retail and automated inventory systems, yet it demands fine-grained SKU discrimination that standard vision-language benchmarks fail to capture. We present the first systematic zero-shot evaluation of 190 open-source VLMs on the MPR task of the GroceryVision Challenge, isolating pre-training data, architecture, and input resolution. Our analysis yields three actionable findings. \textbf{(1) Data quality trumps scale.} Switching from raw web-scrapes to filtered datasets delivers up to 16.6\% accuracy gains, exceeding the benefit of doubling model parameters. \textbf{(2) Efficient models can win.} MobileCLIP-B (150M parameters) outperforms 351M counterparts trained on noisy data. We introduce \textit{semantic power density} ($ϕ$), an efficiency metric that penalizes sub-threshold accuracy. \textbf{(3) A precision gap persists.} State-of-the-art models achieve 94.5\% Recall@5 but suffer a 17.5\% drop at Recall@1, revealing that contrastive embeddings cluster categories effectively but fail to rank visually similar SKUs. Code and evaluation scripts are available at \url{https://github.com/upeee/openmpr}.

2605.18028 2026-05-19 cs.LG cs.AI

FedSDR: Federated Self-Distillation with Rectification

FedSDR: 带校正的联邦自我蒸馏

Ziheng Ren, Zhanming Shen, Hao Wang, Ning Liu, You Song

AI总结 本文提出FedSDR,一种改进的联邦自我蒸馏方法,通过引入双重流机制来解决联邦学习中数据分布不匹配和幻觉问题,提升模型的准确性和一致性。

Comments Accepted by ICML 2026

详情
AI中文摘要

大规模语言模型的联邦微调面临严重的统计异质性。然而,现有模型级防御方法往往忽视了根本原因:内在的数据分布不匹配。在本文中,我们首先建立了联邦自我蒸馏(FedSD)作为基本且有力的策略。通过将客户端表示投影到一个平滑的

英文摘要

Federated fine-tuning of Large Language Models faces severe statistical heterogeneity. However, existing model-level defenses often overlook the root cause: intrinsic data distribution mismatches. In this work, we first establish Federated Self-Distillation (FedSD) as a fundamental and potent strategy. By projecting client representations into a smoothed ``model-understanding space,'' FedSD alone serves as a universal booster, demonstrating superior performance over conventional algorithms. Despite its success, we identify a subtle trade-off termed the Rewrite Paradox -- unconstrained self-distillation can inadvertently increase hallucinations and redundancy. To refine this paradigm, we further propose FedSDR (Federated Self-Distillation with Rectification), the ultimate reinforced framework. It augments FedSD with a dual-stream mechanism: a local LoRA-S (Smoothing) branch to implicitly absorb heterogeneity via distilled data, and a parallel global LoRA-R (Rectification) branch anchored to raw data to enforce factual correctness. By selectively aggregating only LoRA-R, FedSDR yields a globally aligned and faithful model. Extensive experiments verify its superior performance.

2605.18026 2026-05-19 cs.RO

Scenario Generation in Roundabouts with Adjustable Interaction Intensity

在可调节交互强度的环形交叉口中的场景生成

Li Li, Till Temmen, Tobias Brinkmann, Björn Krautwig, Markus Eisenbarth, Jakob Andert

AI总结 本文提出了一种具有可调节交互强度的环形交叉口场景生成器,通过解耦几何路线和时间进度轮廓,并利用预训练的自编码器映射到潜在代码,再通过Wasserstein生成对抗网络生成场景,从而提高时间-潜在空间的保真度和交互响应的合理性,增强了安全测试的可控性和可扩展性。

详情
AI中文摘要

环形交叉口以其频繁的合并和让行交互而著称,仍然是智能驾驶功能开发和测试中的安全关键案例。然而,从自然数据中提取足够的临界场景是低效的。大多数现有场景生成方法对交互强度和临界性控制有限,使得系统化安全测试和详细分析困难。本文提出了一种交互感知的环形交叉口场景生成器,具有连续可调的交互强度。首先,几何路线和时间进度轮廓被解耦并映射到潜在代码,使用预训练的自编码器。然后,通过Wasserstein生成对抗网络(WGAN)进行条件潜在生成,以生成场景。让行被建模为一种可控的定时干预,通过紧凑的让行代码在接近入口段进行,其中交互强度通过将代码与因子λ缩放来调节。结果表明,与基线模型相比,提高了时间-潜在空间的保真度和合理的交互响应。在临界性校准的缩放下,增加λ扩大了安全边际,提供了一种可扩展和受控的测试机制。

英文摘要

Roundabouts, characterized by frequent merging and yielding interactions, remain a safety-critical corner case for the development and testing of intelligent driving functions. However, extracting sufficient near-critical scenarios from naturalistic data is inefficient. Most existing scenario generation methods provide limited controllability over interaction intensity and criticality, making systematic safety testing and detailed analysis difficult. This paper presents an interaction-aware roundabout scenario generator with continuously adjustable interaction intensity. Geometric routes and temporal progress profiles are first decoupled and mapped to latent codes using pretrained autoencoders. Conditional latent generation is then performed with Wasserstein Generative Adversarial Networks (WGAN) to generate scenarios. Yielding is modeled as a controllable timing intervention via a compact yield code during the approach-to-entry segment, where interaction intensity is modulated by scaling the code with a factor $λ$. Results demonstrate enhanced timing-latent fidelity and plausible interaction responses compared to a baseline model. Under criticality-calibrated scaling, increasing $λ$ expands the safety margin, providing a scalable and controlled testing mechanism.

2605.18025 2026-05-19 cs.AI

TeleCom-Bench: How Far Are Large Language Models from Industrial Telecommunication Applications?

TeleCom-Bench: 大型语言模型在工业电信应用中还有多远?

Jieting Xiao, Yun Lin, Huizhen Qiu, Rui Ma, Chen Zhong, Dongyang Xu, Xiao Long, Chaoyu Zhang, Qiaobo Hao, Ding Zou, Zhiguo Yang, Yanqin Gao, Fang Tan

AI总结 本文提出TeleCom-Bench,一个包含12个评估集和22678个精选样本的全面基准,旨在评估大型语言模型在电信领域的综合能力,揭示其在工业流程中的执行能力缺口。

Comments Accepted by KDD 2026

详情
AI中文摘要

尽管大型语言模型(LLM)在各种垂直场景中实现了显著整合,但其在电信领域的部署仍处于探索阶段,由于缺乏标准化的评估框架。当前的电信基准主要关注静态基础知识和孤立的原子技能,忽略了设备特定的文档和端到端的工业工作流,这些对于实际生产系统至关重要。为此,我们提出了TeleCom-Bench,一个包含12个评估集和22,678个精选样本的全面基准,评估LLM在协同层次上的能力:(1)多维知识理解,整合电信基础、3GPP协议、5G网络架构和专有产品知识,通过知识图谱驱动的合成整合有线、核心和无线网络的知识;(2)端到端知识应用,正式化六个核心任务在真实网络代理工作流中的真实轨迹,包括意图识别、实体提取、事件验证、工具调用、根本原因分析和解决方案生成,涵盖网络优化和故障维护场景。对八种最先进的LLM的评估揭示了一个普遍的执行墙:虽然模型在意图识别和实体提取等语言接口任务中达到90%的准确率,但在解决方案生成等过程执行任务中的性能降至约30%。这种能力差距表明,当前LLM在诊断方面表现良好,但在现场工程师方面却失败。TeleCom-Bench提供标准化的诊断,精确指出这一缺陷,为特定领域的对齐提供可操作的指导,以实现生产就绪的电信代理。数据集和评估代码已发布在https://github.com/ZTE-AICloud/TeleCom-Bench。

英文摘要

While Large Language Models have achieved remarkable integration in various vertical scenarios, their deployment in the telecommunications domain remains exploratory due to the lack of a standardized evaluation framework. Current telecom benchmarks primarily focus on static, foundational knowledge and isolated atomic skills, neglecting the equipment-specific documentation and end-to-end industrial workflows essential for real-world production systems. To bridge this gap, we present TeleCom-Bench, a comprehensive benchmark comprising 12 evaluation sets with 22,678 curated samples, which evaluates LLMs across a synergistic hierarchy: (1) Multi-dimensional Knowledge Comprehension, which integrates telecommunication fundamentals, 3GPP protocols, and 5G network architecture with proprietary product knowledge across wired, core, and wireless networks via knowledge graph-driven synthesis; and (2)End-to-End Knowledge Application, which formalizes six core tasks on authentic trajectories from live network agent workflows, including intent recognition, entity extraction, event verification, tool invocation, root cause analysis, and solution generation-across network optimization and fault maintenance scenarios. Evaluations of eight state-of-the-art LLMs reveal a universal Execution Wall: while models achieve 90% accuracy in linguistic interface tasks such as intent recognition and entity extraction, performance collapses to approximately 30% in procedural execution tasks like solution generation. This capability gap demonstrates that current LLMs function competently as diagnosticians but fail as field engineers. TeleCom-Bench provides standardized diagnostics to precisely pinpoint this deficit, offering actionable guidance for domain-specific alignment toward production-ready telecom agents. The dataset and evaluation code have been released at https://github.com/ZTE-AICloud/TeleCom-Bench.

2605.18022 2026-05-19 cs.LG cs.AI stat.ML

Unveiling Memorization-Generalization Coexistence: A Case Study on Arithmetic Tasks with Label Noise

揭示记忆与泛化共存:在带有标签噪声的算术任务中的案例研究

Linyu Liu, Pinyan Lu

AI总结 本文研究了在高过参数化模型中如何同时记忆噪声标签和泛化,通过模运算任务中的实验发现,适当优化和模型配置下大模型泛化能力更强,噪声标签被更快记忆,而过参数化模型内部形成泛化结构,但输出被拟合噪声标签的需求所抑制。通过频率方法提取内部结构可实现高准确率,提出任务无关方法将网络分为泛化和记忆组件,尽管该子网络提升泛化能力,但相比频率提取方法仍有局限,表明泛化结构分布于神经元中,需要新工具来检索过参数化网络中的可泛化知识。

Comments 27 pages, 32 figures

详情
AI中文摘要

高度过参数化的模型可以同时记忆噪声标签并良好泛化,但如何这些行为共存仍不明确。本文通过模运算任务在重噪声标签下研究其内在机制。通过在两层神经网络上的广泛实验发现,适当优化和模型配置下大模型泛化能力更强,而噪声标签被更快记忆。过参数化模型内部形成泛化结构,但其在输出中的表达被拟合噪声标签的需求所抑制。值得注意的是,即使在80%的标签噪声下,通过频率方法提取内部结构也可实现接近完美的测试准确率。我们进一步提出一种任务无关的方法将网络分为泛化和记忆组件。尽管该子网络提升泛化能力,但相比频率提取方法仍有局限,表明泛化结构分布于神经元中,需要新工具来检索过参数化网络中的可泛化知识。

英文摘要

Highly over-parameterized models can simultaneously memorize noisy labels and generalize well, yet how these behaviors coexist remains poorly understood. In this work, we investigate the underlying mechanisms of this coexistence using modular arithmetic tasks under heavy label noise. Through extensive experiments on two-layer neural networks, we find that larger models tend to generalize better under appropriate optimization and model configurations, while noisy labels are memorized faster than clean data. Over-parameterized models internally form a generalization structure, but its expression in the output is suppressed by the need to fit noisy labels. Remarkably, even with 80\% label noise, near-perfect test accuracy can be achieved by extracting this internal structure using frequency-based methods. We further propose a task-agnostic method to partition networks into generalization and memorization components. Although this subnetwork improves generalization, it is limited compared with frequency-based extraction, indicating that the generalization structure is distributed across neurons and motivating the development of new tools to retrieve generalizable knowledge from over-parameterized networks.

2605.18020 2026-05-19 cs.LG

Federated Learning by Utility-Constrained Stochastic Aggregation for Improving Rational Participation

通过效用约束随机聚合改进理性参与的联邦学习

M Yashwanth, Arunabh Singh, Ashok Nayak, Sai Kiran Bulusu, Anirban Chakraborty

AI总结 本文提出FedUCA框架,通过形式化服务器作为优化器的角色,旨在通过维持客户端参与来最大化全局模型性能,从而提高客户端参与度和全局模型性能。

Comments Federated Learning, Rational Clients, Endogenous Participation, and Aggregation

详情
AI中文摘要

联邦学习(FL)算法隐含假设客户端在服务器请求下被动地分享本地模型更新以配合服务器端的协调。然而,这忽略了现实世界跨机构环境中一个重要的方面:客户端通常是理性的代理,可能会优先考虑本地模型性能等效用而非全局模型的性能。在统计异质性显著的设置中,理性客户端可能会退出联邦如果感知到的合作利益未能满足其本地效用阈值。此类退出会降低全局模型性能并可能导致联邦训练过程的崩溃。在本文中,我们引入FedUCA(通过效用约束随机聚合改进理性参与的联邦学习),一个框架,形式化了服务器作为优化器的角色,旨在通过维持客户端参与来最大化全局模型性能。我们通过在标准数据集上的广泛实验验证了我们的框架,证明通过优先考虑参与可行性,FedUCA实现了显著更高的客户端保留率,从而实现了更优的全局模型性能。

英文摘要

Federated Learning (FL) algorithms implicitly assume that clients passively comply with server-side orchestration by sharing local model updates upon server request. However, this overlooks an important aspect in real-world cross-silo environments: clients are often rational agents who may prioritize their utilities such as local model performance over that of the global model. In settings with significant statistical heterogeneity, rational clients may opt out of the federation if the perceived benefits of collaboration fail to meet their local utility thresholds. Such attrition degrades the global model performance and can lead to the collapse of the federated training process. In this work, we introduce FedUCA, (Federated Learning by Utility-Constrained Stochastic Aggregation for Improving Rational Participation), a framework that formalizes the server's role as an optimizer seeking to maximize global model performance by sustaining client participation. We substantiate our framework through extensive experiments on standard datasets demonstrating that by prioritizing participation feasibility, FedUCA achieves significantly higher client retention and, consequently, a superior global model performance.

2605.18018 2026-05-19 cs.CV cs.AI cs.HC

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

See What I Mean: 对齐视觉与语言表示以实现视频细粒度物体理解

Boyuan Sun, Bowen Yin, Yuanming Li, Xihan Wei, Qibin Hou

AI总结 本文提出SWIM方法,通过对齐视觉和语言表示,仅从文本提示中实现细粒度物体理解,解决了传统方法需要显式视觉提示的问题,通过构建NL-Refer数据集和多层交叉注意力图提升文本-视觉对齐性能。

详情
Journal ref
CVPR 2026
AI中文摘要

我们提出了SWIM(See What I Mean),一种新颖的训练策略,通过对齐视觉和语言表示,仅从文本提示中实现细粒度物体理解。与需要显式视觉提示(如掩码或点)的传统方法不同,SWIM仅在训练期间利用掩码监督来指导跨模态注意力,使模型在推理时能够自动关注用户指定的物体。我们对预训练多模态大语言模型(MLLMs)的交叉注意力分析揭示了一种系统性差异:属性词在视觉模态中产生尖锐、局部化的激活,而物体名词由于语义参考偏差和分布式高层表示产生扩散和分散的模式。为了解决这种不对齐问题,我们构建了NL-Refer数据集,其中每个物体掩码都配以精确的自然语言指引用。SWIM从物体名词中提取多层交叉注意力图,并强制与真实掩码保持空间一致性。实验结果表明,SWIM显著提高了文本-视觉对齐性能,并在细粒度物体理解基准上优于基于视觉提示的方法。代码和数据可在https://github.com/HumanMLLM/SWIM获取。

英文摘要

We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text-visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks. The code and data are available at \href{https://github.com/HumanMLLM/SWIM}{https://github.com/HumanMLLM/SWIM}.

2605.18015 2026-05-19 cs.LG cs.DB cs.SE

LogRouter: Adaptive Two-Level LLM Routing for Log Question Answering in Big Data Systems

LogRouter: 一种自适应的两级LLM路由用于大数据系统中的日志问题解答

Mert Coskuner, Merve Zeybel, Melik Mert Dolan

AI总结 本文提出LogRouter,一种自适应两级LLM路由系统,用于在大数据系统中实现日志问题解答,通过结合PySpark-based Drain3数据摄入管道、GPU加速的嵌入和Apache Druid和PostgreSQL with pgvector的双索引存储,实现高效的日志查询处理。

详情
AI中文摘要

在自托管、资源受限的环境中,生产日志分析需要自然语言访问大规模日志流,而无需将每个查询路由通过大型语言模型的费用。我们提出了LogRouter,一个部署在TUBITAK BILGEM国家大数据平台上的端到端日志问题解答系统,结合了基于PySpark的Drain3数据摄入管道、GPU加速的嵌入以及Apache Druid和PostgreSQL with pgvector的双索引存储。一个两级成本感知路由器将每个查询沿着四个执行路径之一进行路由:直接响应、Druid关键词搜索、使用SQL生成的模板查找和pgvector语义检索,同时二级路由器选择14B或32B类生成器用于语义路径。一个专用的编码器LLM处理文本到SQL生成。我们在四个LogHub数据集(Linux、Apache、Windows和Mac;共70个问题)上评估了该系统,分别在在线完整管道配置和隔离生成器的离线配置下进行测试。路由器在各数据集上的平均准确率为88.4%,在Linux上为94.7%。完整管道的平均ROUGE-1为0.373,BERTScore为0.879,RAGAS Faithfulness为0.779,端到端延迟为18.6秒。在公平的离线比较中,路由系统将平均延迟减少了55%(与Fixed-32B基线46.3秒 vs. 102.1秒相比),同时保持答案正确性在5.8分以内,并在所有数据集上超过Fixed-14B基线的RAGAS Faithfulness。因此,成本感知的路由是生产日志QA的实用机制:路由恢复了始终使用32B配置的大部分质量,延迟不到一半,且L1关键词词汇表使路由决策具有高精度,而无需使用学习分类器。

英文摘要

Production log analytics in self-hosted, resource-constrained environments requires natural-language access to massive log streams without the cost of routing every query through a large language model. We present LogRouter, an end-to-end log question-answering system deployed on TUBITAK BILGEM's national big data platform that combines a PySpark-based Drain3 ingestion pipeline, GPU-accelerated embeddings, and dual-index storage in Apache Druid and PostgreSQL with pgvector. A two-level cost-aware router dispatches each query along one of four execution paths: direct response, Druid keyword search, template lookup with SQL generation, and pgvector semantic retrieval, while a Level-2 router selects either a 14B-class or 32B-class generator for the semantic path. A dedicated coder LLM handles text-to-SQL generation. We evaluate the system on four LogHub datasets (Linux, Apache, Windows, and Mac; 70 questions in total) under both an online full-pipeline configuration and an offline configuration that isolates the generator. The router reaches 88.4% mean accuracy across datasets and 94.7% on Linux, while the full pipeline attains a mean ROUGE-1 of 0.373, BERTScore of 0.879, RAGAS Faithfulness of 0.779, and an end-to-end latency of 18.6 s. In an apples-to-apples offline comparison, the routed system reduces mean latency by 55% versus a Fixed-32B baseline (46.3 s vs. 102.1 s) while preserving Answer Correctness within 5.8 points and exceeding a Fixed-14B baseline on RAGAS Faithfulness across every dataset. Cost-aware dispatching is therefore a practical mechanism for production log QA: routing recovers most of the quality of an always-32B configuration at less than half the latency, and the L1 keyword vocabulary makes that routing decision with high precision without a learned classifier.

2605.18013 2026-05-19 cs.CV cs.AI

TinySAM 2: Extreme Memory Compression for Efficient Track Anything Model

TinySAM 2: 极端内存压缩用于高效的跟踪任何模型

Zhaoyuan Ding, Yijing Yang, Han Shu, Xinghao Chen

AI总结 本文提出TinySAM 2,一种轻量级视频分割模型,通过引入内存质量管理机制和联合空间-时间令牌压缩,有效降低了内存存储和计算成本,实现了在DAVIS和SA-V等挑战性数据集上达到SAM 2.1 90%性能,仅使用7%内存令牌和3%训练数据。

Comments 12 pages, 6 figures

详情
AI中文摘要

Segment Anything Model 2 (SAM 2) 作为视频分割领域的核心基础模型,在半监督视频对象分割和跟踪任何任务中表现出色。然而,SAM 2的多阶段图像编码器和内存模块复杂的计算特性提高了模型在实际应用中的部署难度。为了解决这个问题,我们提出了TinySAM 2,一种在性能和效率之间取得平衡的轻量级视频分割模型。首先,引入了一个内存质量管理机制,用于选择并保留高信息量的历史帧作为内存。此外,提出了一种联合空间-时间令牌压缩方法,通过空间域上的平均池化压缩冗余令牌,在时间域上基于令牌级相似性测量选择信息令牌。此外,采用RepViT作为轻量级图像编码器,进一步减少模型参数。在DAVIS和SA-V等挑战性数据集上的大量实验表明,TinySAM 2在性能上达到了SAM 2.1的90%,仅使用7%的内存令牌和3%的训练数据。本研究有效缓解了SAM 2在参数数量、计算负载和部署成本方面的瓶颈,为视频分割模型在设备上的广泛应用提供了资源高效的解决方案。

英文摘要

Segment Anything Model 2 (SAM 2) serves as a core foundation model in the field of video segmentation. Building upon the original SAM model, it introduces a memory bank mechanism and demonstrates outstanding performance in tasks such as semi-supervised video object segmentation and tracking anything. However, the complex computational characteristics of SAM 2's multi-stage image encoder and memory module have raised the barrier to the model's deployment in practical applications. To address this issue, we propose TinySAM 2, a lightweight video segmentation model that balances performance and efficiency. First, a memory quality management mechanism is introduced to select and retain high-informative historical frames as the memory. In addition, a joint-spatial-temporal token compression is proposed that reduces the memory storage and computational cost. Specifically, average pooling is employed to first compress redundancy tokens in the spatial domain. In the temporal domain, informative tokens are selected across frames in the memory bank based on token-level similarity measurement. Besides, we take RepViT as the lightweight image encoder, which further reduces the model parameters. Extensive experiments on challenging datasets such as DAVIS and SA-V demonstrate that TinySAM 2 achieves 90% of the performance of SAM 2.1, with only 7% memory tokens and 3% training data. This study effectively alleviates the bottlenecks in parameter count, computational load, and deployment costs associated with SAM 2, providing a resource-efficient solution for the widespread application of video segmentation models on devices.

2605.18012 2026-05-19 cs.CV cs.AI cs.LG

SAS: Semantic-aware Sampling for Generative Dataset Distillation

SAS: 语义感知的生成数据集蒸馏

Mingzhuo Li, Guang Li, Linfeng Ye, Jiafeng Mao, Takahiro Ogawa, Konstantinos N. Plataniotis, Miki Haseyama

AI总结 本文提出了一种语义感知的数据集蒸馏方法,通过利用CLIP作为语义先验,设计三个语义评分函数来量化类别相关性、类别间分离性和集合内多样性,从而生成紧凑且语义区分度高的数据集。

Comments Published as a journal paper in IEEE OJSP

详情
AI中文摘要

深度神经网络在广泛的任务中取得了显著的性能,但这种成功往往伴随着由于大规模训练数据带来的巨大计算和存储成本。数据集蒸馏通过构建紧凑且信息丰富的数据集,以实现高效的模型训练同时保持下游性能。然而,大多数现有方法主要强调匹配数据分布或下游训练统计,对蒸馏数据中高阶语义信息的保留有限。在本文中,我们引入了语义感知的视角进行数据集蒸馏,通过利用对比语言-图像预训练(CLIP)作为语义先验进行后采样。我们的目标是获得不仅紧凑而且语义上类别区分度高且多样化的蒸馏数据集。为此,我们设计了三个语义评分函数,以量化预训练语义空间中的类别相关性、类别间分离性和集合内多样性。基于现有蒸馏方法生成的图像池,我们进一步开发了一种两阶段策略进行有效的采样:第一阶段过滤语义区分度高的样本以形成可靠的候选集,第二阶段进行动态多样性感知选择以减少冗余并保持语义覆盖。在多个数据集、图像池和下游模型上的广泛实验显示了一致的性能提升,突显了在数据集蒸馏中整合语义信息的有效性。

英文摘要

Deep neural networks have achieved impressive performance across a wide range of tasks, but this success often comes with substantial computational and storage costs due to large-scale training data. Dataset distillation addresses this challenge by constructing compact yet informative datasets that enable efficient model training while maintaining downstream performance. However, most existing approaches primarily emphasize matching data distributions or downstream training statistics, with limited attention to preserving high-level semantic information in the distilled data. In this work, we introduce a semantic-aware perspective for dataset distillation by leveraging Contrastive Language-Image Pretraining (CLIP) as a semantic prior for post-sampling. Our goal is to obtain distilled datasets that are not only compact but also semantically class-discriminative and diverse. To this end, we design three semantic scoring functions that quantify class relevance, inter-class separability, and intra-set diversity in a pretrained semantic space. Based on image pools generated by existing distillation methods, we further develop a two-stage strategy for effective sampling: the first stage filters semantically discriminative samples to form a reliable candidate set, and the second stage performs a dynamic diversity-aware selection to reduce redundancy while preserving semantic coverage. Extensive experiments across multiple datasets, image pools, and downstream models demonstrate consistent performance gains, highlighting the effectiveness of incorporating semantic information into dataset distillation.

2605.18010 2026-05-19 cs.CV cs.GR

Functionalization via Structure Completion and Motion Rectification

通过结构补全和运动校正实现功能化

Mingrui Zhao, Sai Raj Kishore Perla, Kai Wang, Sauradip Nag, Duc Anh Nguyen, Jiayi Peng, Ruiqi Wang, Angel X. Chang, Manolis Savva, Ali Mahdavi-Amiri, Hao Zhang

AI总结 本文提出了一种新的任务,即对象功能化,旨在将视觉上合理但不功能的3D模型转换为功能性和物理上可操作的模型。通过将功能化问题建模为新的功能图上的图补全问题,开发了神经图功能化器(GraFu)来补全不完整的图,从而生成3D几何结构,并校正错误的人工标注和预测运动。

详情
AI中文摘要

获取和创建3D资产长期以来主要基于视角或外观驱动。因此,现有的数字3D模型往往缺乏必要的结构组件,以实现其预期功能,例如关节、支撑结构、内部结构或交互元素。同时,即使人工标注的运动也经常存在误差,导致物理上不合理的行为。我们引入了对象功能化,这是一种新的任务,旨在将视觉上合理但不功能的3D模型转换为功能性和物理上可操作的模型。我们将功能化建模为一个新的功能图上的图补全问题,其中标记的节点代表对象部分,标记的边编码功能和接触关系,而可移动的节点携带运动属性,使得结构功能缺陷表现为缺失的节点或错误的边。我们开发了神经图功能化器(GraFu)来补全表示非功能3D对象的不完整图。补全后的图随后驱动一个几何实现阶段,将预测的连接器和结构元素实例化为3D,具有令人印象深刻的效果,即校正错误的人工标注和预测运动。为了支持训练和评估,专注于家具作为丰富且具有挑战性的目标类别,我们引入了FurFun-233,一个包含233对非功能化和功能化家具模型的数据集。在PartNet-Mobility(

英文摘要

Acquisition and creation of 3D assets have been largely view- or appearance-driven. As a result, existing digital 3D models often lack the requisite structural components to function as intended, such as joints, supports, interiors, or interaction elements. At the same time, even human-annotated motions are frequently error-prone, leading to physically implausible behavior. We introduce object functionalization, a novel task aimed at transforming visually plausible but non-functional 3D models into functional and physically operable ones. We formulate functionalization as a graph completion problem over a new functional graph representation, where labeled nodes represent object parts, labeled edges encode functional and contact relations, and movable nodes carry motion attributes, so that structural functional deficiencies manifest as missing nodes or incorrect edges. We develop a neural Graph Functionalizer (GraFu) to complete an incomplete graph representing a non-functional 3D object. The completed graph then drives a geometry realization stage that instantiates predicted connectors and structural elements in 3D, with the compelling side effect of rectifying erroneous human-annotated and predicted motions. To support training and evaluation, focusing on furniture as a rich and challenging target category, we introduce FurFun-233, a dataset of 233 paired non-functional and functionalized furniture models. On PartNet-Mobility ("zero-shot") and HSSD test sets, our method matches state-of-the-art methods in motion prediction accuracy while substantially improving functionality in terms of collision and connectivity.

2605.18008 2026-05-19 cs.LG stat.ML

Uncertainty Reliability Under Domain Shift: An Investigation for Data-Driven Blood Pressure Estimation in Photoplethysmography

域移情况下不确定性可靠性研究:面向光体积脉搏波测记中数据驱动血压估计的探讨

Mohammad Moulaeifard, Ciaran Bench, Philip J. Aston, Nils Strodthoff

AI总结 本文研究了在域移情况下深度学习用于光体积脉搏波测记信号中血压估计的不确定性可靠性,比较了深度集成和蒙特卡洛滴答方法,并探讨了不确定性校准的重要性。

Comments 23 pages, 2 figures

详情
AI中文摘要

不确定性量化(UQ)对于安全关键领域如医疗至关重要,但很少在现实的分布外(OOD)条件下进行评估。本文评估了基于深度学习的血压(BP)估计在光体积脉搏波测记(PPG)信号中的预测性能和不确定性可靠性,分别在分布内(ID)和分布外(OOD)设置下进行。使用在PulseDB上训练的XResNet1D-50模型在四个外部数据集上进行测试,比较了深度集成(DE)和蒙特卡洛滴答(MCD)方法,并使用高斯负对数似然(GNLL)和均方误差(MSE)损失函数,可选地通过符合预测(CP)、温度缩放(TS)和等比回归(IR)进行后处理校准。我们的关键发现如下:(1)在域移情况下,DE比MCD提供更强的预测鲁棒性,这种优势主要在外部域移情况下显现。(2)经过校准的GNLL方法在不确定性校准方面表现最佳(例如,GNLL+DE+CP用于收缩压(SBP),GNLL+DE+TS用于舒张压(DBP)),而基于MSE的不确定性需要校准才能实用。(3)在各种设置中,CP和TS提供了最一致的增益,IR在某些情况下仍然具有竞争力。总体而言,我们的结果表明,基于DE的方法在域移下的预测性能最为稳健,GNLL在原生UQ中最强,而校准对于使MSE基于的不确定性实用化至关重要。这些发现突显了在外部数据上联合评估预测准确性和校准的重要性,以实现无袖带血压估计的可信度。

英文摘要

Uncertainty quantification (UQ) is critical for safety-critical domains like healthcare, yet it is rarely evaluated under realistic out-of-distribution (OOD) conditions. Here, we assessed predictive performance and uncertainty reliability for deep learning-based blood pressure (BP) estimation from photoplethysmography (PPG) signals under both in-distribution (ID) and OOD settings. Using an XResNet1D-50 trained on PulseDB and tested on four external datasets, we compared deep ensembles (DE) and Monte Carlo dropout (MCD) with Gaussian negative log-likelihood (GNLL) and mean squared error (MSE) losses, optionally followed by post-hoc recalibration via conformal prediction (CP), temperature scaling (TS), and isotonic regression (IR). The key findings of our study are as follows: (1) DE provides stronger predictive robustness under domain shift than MCD, an advantage that becomes clear primarily under external shift. (2) Recalibrated GNLL-based methods yield the best uncertainty calibration (e.g., GNLL+DE+CP for systolic blood pressure (SBP), GNLL+DE+TS for diastolic blood pressure (DBP)), while MSE-based uncertainty requires recalibration to become practically useful. (3) Across settings, CP and TS offer the most consistent gains, with IR remaining competitive in several cases. Overall, our results identify DE-based methods as most robust for predictive performance under domain shift, GNLL as strongest for native UQ, and recalibration as essential for making MSE-based uncertainty practical. These findings highlight the need to jointly assess predictive accuracy and calibration on external data for trustworthy cuffless BP estimation

2605.18007 2026-05-19 cs.CL

Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling

推理时对修辞角色标注中困难示例的语义重排序

Anas Belfathi, Nicolas Hernandez, Laura Monceaux, Warren Bonnard, Richard Dufour

AI总结 本文提出RISE框架,在推理时利用标签语义对修辞角色标注中的困难示例进行重排序,提升模型预测的准确性和鲁棒性。

Comments Accepted at ACL 2026 (Main Conference)

详情
AI中文摘要

修辞角色标注(RRL)为文档中的每个句子分配一个功能角色,广泛应用于法律、医疗和科学领域。尽管语言模型(LMs)在平均性能上表现良好,但它们在困难示例上仍然不可靠,其中预测置信度较低。现有方法通常隐式处理不确定性,将标签视为离散标识符,忽略了标签名称中编码的语义信息。我们引入RISE,一种推理时的语义重排序框架,利用标签语义来优化困难实例的预测。RISE自动识别低置信度预测,并使用对比学习的标签表示对模型输出进行重排序,无需重新训练或修改基础模型。在八个领域特定的RRL数据集上,使用七种LM(包括基于编码器和因果架构)的实验表明,在困难示例上平均获得+9.15个宏F1分数的提升。为了可解释性,我们进一步提出手动难度注释,从模型和人类视角研究难度,揭示与Cohen's kappa=0.40的中等一致程度。

英文摘要

Rhetorical Role Labeling (RRL) assigns a functional role to each sentence in a document and is widely used in legal, medical, and scientific domains. While language models (LMs) achieve strong average performance, they remain unreliable on hard examples, where prediction confidence is low. Existing approaches typically handle uncertainty implicitly and treat labels as discrete identifiers, overlooking the semantic information encoded in label names. We introduce RISE, an inference-time semantic reranking framework that leverages label semantics to refine predictions on hard instances. RISE automatically identifies low-confidence predictions and reranks model outputs using contrastively learned label representations, without retraining or modifying the underlying model. Experiments on eight domain-specific RRL datasets with seven LMs, including encoder-based and causal architectures, show an average gain of +9.15 macro-F1 points on hard examples. For explainability, we further propose manual hardness annotations to study difficulty from both model and human perspectives, revealing a moderate agreement with Cohen's kappa = 0.40.

2605.18005 2026-05-19 cs.LG stat.ML

Scalable Decision-Focused Learning through Cost-Sensitive Regression

通过成本敏感回归实现可扩展的决策聚焦学习

Noah Schutte, Senne Berden, Tias Guns, Krzysztof Postek, Neil Yorke-Smith

AI总结 本文提出了一种基于成本敏感多输出回归的方法,用于解决包含多个不确定参数的组合优化问题,通过引入成本敏感的损失函数组件,提高了决策聚焦学习的效率和可扩展性。

Comments 12 pages, 7 figures

详情
AI中文摘要

许多现实世界中的组合问题涉及不确定参数,这些参数可以根据上下文特征和历史数据进行预测。这些'预测后优化'或'上下文优化'问题已获得显著关注:端到端训练方法现在可以最小化下游任务成本而不是预测误差。然而,尽管这些决策聚焦学习(DFL)方法有效,但它们通常在训练过程中依赖于重复解决底层组合优化问题,这使得它们计算成本高且难以扩展。我们重新将学习问题视为一个成本敏感的多输出回归问题:多输出是因为组合问题有多个不确定参数,而成本敏感是因为下游任务成本是真正的目标。我们的技术贡献是正式化了多个损失函数组件,这些组件来自于这种重新框架:成本不敏感的归一化、决策意识的不对称惩罚过预测和欠预测,以及实例化的成本,这些成本在本地模仿真正的下游任务损失。这些组件需要每个训练数据实例零或一次求解,而训练过程中不需要进一步求解。实验表明,损失组件的组合在下游任务质量上与最先进的方法相当,同时显著更高效,使能够扩展到以前无法用DFL解决的问题规模。

英文摘要

Many real-world combinatorial problems involve uncertain parameters, which can be predicted given contextual features and historical data. These `predict-then-optimize' or `contextual optimization' problems have gained significant attention: end-to-end training methods can now minimize the downstream task cost rather than the predictive error. However, despite their effectiveness, these decision-focused learning (DFL) approaches often rely on repeated solving of the underlying combinatorial optimization problem during training, making them computationally expensive and difficult to scale. We reframe the learning problem as a cost-sensitive multi-output regression problem: multi-output due to the combinatorial problem having multiple uncertain parameters, and cost-sensitive due to the downstream task cost being the real target. Our technical contribution is the formalization of multiple loss function components that follow from this reframing: cost-insensitive normalization, decision-aware asymmetric penalization of over- and underpredictions, and instance-based costs that mimic the true downstream task-based loss locally. These components require zero or one solve per training data instance, while requiring no further solves during training. Experiments show that the combination of loss components achieves comparable downstream task quality to the state of the art, while being significantly more efficient, enabling scaling to problem sizes that have not been tackled before with DFL.

2605.18004 2026-05-19 cs.LG

RL4RLA: Teaching ML to Discover Randomized Linear Algebra Algorithms Through Curriculum Design and Graph-Based Search

RL4RLA: 通过课程设计和基于图的搜索教机器学习发现随机线性代数算法

Jinglong Xiong, Xiaotian Liu, Ruoxin Wang, Zihang Liu, Yefan Zhou, Yujun Yan, Yaoqing Yang

AI总结 本文提出RL4RLA框架,通过课程设计和基于图的搜索自动化发现可解释的符号随机线性代数算法,展示了其在重发现状态-of-the-art方法和优化算法性能方面的贡献。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026). 9 pages main text; 21 pages total

详情
AI中文摘要

随机线性代数(RLA)算法是一类现代数值线性代数技术,在科学计算和机器学习中扮演重要角色,已被广泛采用。然而,其发现仍主要依赖手动过程,需要深厚的专家知识和灵感。尽管强化学习(RL)提供了自动化路径,但标准方法在高绩效RLA算法的稀疏奖励景观和广阔搜索空间中遇到困难。本文提出RL4RLA,一个通用的RL框架,自动化发现可解释、符号化的RLA算法。与黑盒方法不同,我们的方法从基本线性代数原语构建显式算法,确保可验证和可实现的表示。为了实现高效发现,我们引入:(1)数值课程,逐步增加问题难度以编码RLA领域的归纳偏差;(2)蒙特卡洛图搜索,通过识别和合并等价的partial算法优化探索。我们证明RL4RLA重发现状态-of-the-art方法,包括sketch-and-precondition求解器、Randomized Kaczmarz和Newton Sketch,并可针对特定的准确率、速度和稳定性之间的权衡生成算法。代码可在https://github.com/Tim-Xiong/RL4RLA获取。

英文摘要

Randomized linear algebra (RLA) algorithms are a modern class of numerical linear algebra techniques that play an essential role in scientific computing and machine learning, with broad and growing adoption. However, their discovery remains mostly a manual process that requires deep expert knowledge and inspiration. While Reinforcement Learning (RL) offers a pathway to automation, standard approaches struggle with sparse reward landscapes and vast search spaces inherent to high-performing RLA algorithms. In this paper, we present RL4RLA, a general RL framework that automates the discovery of interpretable, symbolic RLA algorithms. Unlike black-box approaches, our method builds explicit algorithms from basic linear algebra primitives, ensuring verifiable and implementable representations. To enable efficient discovery, we introduce: (1) a numerical curriculum that progressively increments problem difficulty to encode inductive bias specific to the RLA domain; (2) Monte Carlo Graph Search, which optimizes exploration by identifying and merging equivalent partial algorithms. We demonstrate that RL4RLA rediscovers state-of-the-art methods, including sketch-and-precondition solvers, Randomized Kaczmarz, and Newton Sketch, and can be targeted to produce algorithms optimized for specific trade-offs between accuracy, speed, and stability. Code is available at https://github.com/Tim-Xiong/RL4RLA.

2605.18001 2026-05-19 cs.CL

Bridging the Gap: Converting Read Text to Conversational Dialogue

弥合差距:将阅读文本转换为对话式语音

Parshav Singla, Agnik Banerjee, Aaditya Arora, Shruti Aggarwal, Anil Kumar Verma, Vikram C M, Raj Prakash Gohil, Gopal Kumar Agarwal

AI总结 本文提出了一种名为PACC的新方法,通过利用深度神经网络分析和修改语调、重音和节奏等语调特征,将阅读语音转换为更自然的对话语音,从而在虚拟助手、客户服务和语言学习工具中提高语音转换的自然度和准确性。

Comments 11 pages, 4 figures. Published in ICICC 2025, Springer Lecture Notes in Networks and Systems

详情
Journal ref
Innovative Computing and Communications (ICICC 2025), Lecture Notes in Networks and Systems, Springer Nature, 2025, pp. 543-556
AI中文摘要

在最近的语音处理进展中,将阅读语音转换为对话语音引起了广泛关注。该领域的主要挑战是在实时应用中保持自然性和可懂性的同时,最小化计算开销。传统的阅读语音缺乏对话互动中至关重要的细微语调变化,这对虚拟助手、客户服务和语言学习工具等应用构成了挑战。本文介绍了一种新的方法,即带有对话上下文的语调调整(PACC),旨在将阅读语音转换为各种现代应用中使用的自然对话语音。PACC利用先进的深度神经网络来分析和修改语调特征,如语调、重音和节奏。与传统方法不同,我们的方法使用高保真生成对抗网络(HiFi-GAN)进行语音合成。我们的实验结果表明,语音转换在自然度和模型准确性方面有显著提高,通过在语音数据集上额外训练。这项研究为语音转换任务和Mean Opinion Score(MOS)评估建立了新的基准,并证明我们的方法可以成功扩展到其他语音转换应用。

英文摘要

In recent advancements within speech processing, converting read speech to conversational speech has gained significant attention. The primary challenge in this domain is maintaining naturalness and intelligibility while minimizing computational overhead for real-time applications. Traditional read speech often lacks the nuanced prosodic variation essential for natural conversational interactions, posing challenges for applications in virtual assistants, customer service, and language learning tools. This paper introduces a novel approach, Prosodic Adjustment with Conversational Context (PACC), aimed at converting read speech into natural conversational speech used in various modern applications. PACC utilizes advanced deep neural networks to analyze and modify prosodic features such as intonation, stress, and rhythm. Unlike conventional methods, our approach uses High-Fidelity Generative Adversarial Networks (HiFi-GAN) for speech synthesis. Our experimental results demonstrate significant improvements in speech conversion, enhancing naturalness and achieving better model accuracy with additional training on speech datasets. This research establishes new benchmarks in speech conversion tasks and Mean Opinion Score (MOS) evaluation for testing model accuracy, and we show that our approach can be successfully extended to other speech conversion applications.

2605.17999 2026-05-19 cs.AI

Shared Backbone PPO for Multi-UAV Communication Coverage with Connection Preservation

共享骨干PPO用于多UAV通信覆盖与连接保持

Z. Jiang

AI总结 本文提出了一种共享骨干PPO算法,通过在Actor和Critic网络之间共享基础模块,实现了高效的训练和提升的性能。该算法在保持连接的多UAV群体通信覆盖任务中得到实现,并与标准PPO算法进行比较。实验结果表明,所提出的方法具有优越的性能,此外,还集成了图信息聚合模块以适应代理之间的通信条件。整合该模块后,算法仍保持有效,训练后的代理群体表现出更高的合作水平。

详情
AI中文摘要

本文提出了一种共享骨干近端策略优化(Shared Backbone PPO)算法。通过在Actor和Critic网络之间共享基础模块,该算法实现了高效的训练和改进的性能。该算法在保持连接的多UAV群体通信覆盖任务中得到实现,并与标准PPO算法进行比较。实验结果表明,所提出的方法实现了优越的性能。此外,将图信息聚合模块纳入模型架构中,以适应代理之间的通信条件。整合该模块后,算法仍保持有效,训练后的代理群体表现出更高的合作水平。

英文摘要

This paper proposes a Shared Backbone Proximal Policy Optimization (Shared Backbone PPO) algorithm. By sharing the base module between the Actor and Critic networks, the algorithm achieves efficient training and improved performance. The algorithm is implemented in a connectivity-preserving multi-UAV swarm communication coverage task and compared with the standard PPO algorithm. Experimental results demonstrate that the proposed method achieves superior performance. Furthermore, a graph information aggregation module is incorporated into the model architecture to accommodate the communication conditions among agents. With the integration of this module, the algorithm remains effective, and the trained agent swarm exhibits a higher level of cooperation.

2605.17997 2026-05-19 cs.LG cs.AI cs.CV

MARR: Module-Adaptive Residual Reconstruction for Low-Bit Post-Training Quantization

MARR: 模块自适应残差重建用于低比特后训练量化

Le Su, Xing Luo, Zhi Jin

AI总结 本文提出MARR,一种模块自适应残差重建方法,通过为每个模块分配特定的缩放系数,平衡残差相关的HA偏差和累积误差校正,从而在低比特量化中提升性能。

详情
AI中文摘要

近年来,基于残差重建的模型量化方法在低比特后训练量化(PTQ)中取得了有希望的性能,通过引入跨层残差来减少来自先前层的误差积累。然而,这些残差也可能引入额外的偏差,源于重建基于PTQ的Hessian近似(HA)假设,导致量化性能不理想。在本文中,我们分析发现,通过将残差项乘以一个缩放系数,可以提供一种直接的方法来缓解与残差强度相关的HA偏差,同时保持累积误差校正。更重要的是,我们观察到这种权衡是模块依赖性的,使单一全局残差强度不足以在不同模块之间平衡有效的校正和残差相关的偏差。基于这些观察,我们提出了模块自适应残差重建(MARR),为每个模块分配模块特定的缩放系数,以自适应地平衡累积误差校正和残差相关的HA偏差。为了避免昂贵的每模块系数搜索并获得稳定的系数估计,我们设计了一种基于比例-积分-微分(PID)的自适应更新策略,利用重建误差作为反馈,逐步细化此系数。在多个典型的大语言模型(LLMs)和视觉变换器(ViTs)上的实验表明,MARR在低比特量化(小于等于4位)中表现出色,实现了LLMs高达20.2%的性能提升,以及ViTs相对于残差重建最先进的方法高达4.6%的相对提升。代码将在接受后公开发布。

英文摘要

Recently, residual reconstruction-based model quantization methods have achieved promising performance in low-bit post-training quantization (PTQ) by introducing cross-layer residuals to reduce error accumulated from previous layers.However, these residuals may also introduce additional bias arising from the Hessian-approximation (HA) assumption underlying reconstruction-based PTQ, leading to suboptimal quantization performance.In this work, we analyze that multiplying the residual term by a scaling coefficient provides a direct way to mitigate the HA bias associated with residual strength, while preserving accumulated-error correction. More importantly, we observe that this trade-off is module-dependent, making a single global residual strength insufficient to balance effective correction and residual-related bias across modules.Based on these observations, we propose Module-Adaptive Residual Reconstruction (MARR), which assigns a module-specific scaling coefficient to adaptively balance accumulated-error correction and residual-related HA bias for each module.To avoid expensive per-module coefficient search and obtain a stable coefficient estimate, we design a Proportional-Integral-Derivative (PID)-based adaptive update strategy that uses reconstruction error as feedback to progressively refine this coefficient. Experiments on several typical large language models (LLMs) and vision transformers (ViTs) demonstrate the effectiveness of MARR under low-bit quantization (less than or equal to 4-bit), achieving up to 20.2% performance gains on LLMs and up to 4.6% relative gains on ViTs over the residual reconstruction state-of-the-art methods.Code will be made publicly available upon acceptance.