2605.29494 2026-05-29 cs.LG

Gradient Perturbation: Learning to Perturb Gradients for Adaptive Training

梯度扰动：学习扰动梯度以实现自适应训练

Hua Li

AI总结本文提出学习扰动梯度（LPG）方法，通过自适应地扰动类别级别的梯度实现类别感知训练，并建立统一框架揭示SAM、梯度裁剪等方法的梯度扰动本质，实验表明LPG在平衡/长尾分类和噪声标签学习中优于现有方法。

详情

AI中文摘要

深度神经网络训练涉及前向传播（从特征经logits到损失）和反向传播（从损失经梯度到参数更新）。尽管沿前向链的扰动（包括特征扰动、logit扰动和标签扰动）已被广泛研究，但反向链的梯度扰动却鲜有系统性的研究。在本文中，我们建立了一个统一的梯度扰动框架，揭示现有方法如锐度感知最小化（SAM）、梯度裁剪和梯度噪声注入都可以解释为施加特定形式的梯度扰动。类似于最近提出的Logit扰动学习（LPL），我们推测放大某一类别的梯度范数起到正增强作用（增强学习），而抑制它则起到负增强作用（抑制过拟合）。基于这些观察，我们提出学习扰动梯度（LPG），该方法自适应地在类别级别扰动logit梯度以实现类别感知训练。我们还通过PAC-Bayesian分析建立了梯度扰动边界与泛化保证之间的理论联系。在平衡分类、长尾分类和噪声标签学习上的实验表明，LPG一致优于现有方法，并且可以作为插件模块与它们结合使用。

英文摘要

Deep neural network training involves both forward propagation (from features through logits to loss) and backward propagation (from loss through gradients to parameter updates). While perturbations along the forward chain, including feature perturbation, logit perturbation, and label perturbation, have been extensively studied, the backward chain's gradient perturbation has received little systematic investigation. In this paper, we establish a unified framework for gradient perturbation, revealing that existing methods such as Sharpness-Aware Minimization (SAM), gradient clipping, and gradient noise injection can all be interpreted as imposing specific forms of gradient perturbation. Analogous to the recently proposed Logit Perturbation Learning (LPL), we conjecture that amplifying the gradient norm for a class acts as positive augmentation (enhancing learning), while dampening it acts as negative augmentation (suppressing overfitting). Based on these observations, we propose Learning to Perturb Gradients (LPG), which adaptively perturbs logit-level gradients at the class level to achieve category-aware training. We also establish theoretical connections between gradient perturbation bounds and generalization guarantees via PAC-Bayesian analysis. Experiments on balanced classification, long-tail classification, and noisy label learning demonstrate that LPG consistently outperforms existing methods and can be combined with them as a plug-in module.

URL PDF HTML ☆

赞 0 踩 0

2605.29491 2026-05-29 cs.AI

VitalAgent: 一种工具增强型代理，用于对可穿戴健康数据进行反应性和主动式生理监测

Di Zhu, Yu Yvonne Wu, Hong Jia, Aaqib Saeed, Vassilis Kostakos, Ting Dang

AI总结提出VitalAgent框架，通过工具增强推理和纵向生理记忆，实现对ECG/PPG信号的反应性问答与主动监测，在VitalBench基准上相比基线提升超30%。

详情

AI中文摘要

可穿戴设备能够连续监测ECG和PPG等生理信号，但现有的移动健康系统大多局限于特定任务的预测管道或对静态摘要的反应性问答。它们缺乏支持时间推理、持久生理上下文以及对长期信号流进行主动监测的能力。我们提出VitalAgent，一个基于ECG/PPG的移动健康工具增强型代理框架，支持反应性问答和主动监测。VitalAgent建立在纵向生理记忆和工具增强推理接口之上，能够对原始信号进行动态计算。我们进一步引入VitalBench，一个纵向生理监测基准数据集，包含用于反应性问答的1,862个问答对和用于主动监测的90.2小时连续ECG/PPG记录，涵盖心脏、身体活动和压力相关任务。实验表明，VitalAgent在反应性评估中相比基于提示和ReAct的基线实现了超过30%的提升，并支持对长期生理信号的主动警报监测，突显了动态工具使用和长期生理监测的重要性。

英文摘要

Wearable devices enable continuous monitoring of physiological signals such as ECG and PPG, but existing mHealth systems are largely limited to task-specific prediction pipelines or reactive question answering over static summaries. They lack the ability to support temporal reasoning, persistent physiological context, and proactive monitoring over long-term signal streams. We propose VitalAgent, a tool-augmented agentic framework for ECG/PPG-based mHealth that supports both reactive question answering and proactive monitoring. VitalAgent is built on a longitudinal physiological memory and a tool-augmented reasoning interface that enables dynamic computation over raw signals. We further introduce VitalBench, a longitudinal physiological monitoring benchmark dataset comprising 1,862 QA pairs for reactive question answering and 90.2 hours of continuous ECG/PPG recordings for proactive monitoring, covering cardiac, physical activity, and stress-related tasks. Experiments demonstrate that VitalAgent achieves over 30% improvement over prompt-based and ReAct baselines in reactive evaluation and supports proactive alert monitoring over long-term physiological signals, highlighting the importance of dynamic tool use and long-term physiological monitoring.

URL PDF HTML ☆

赞 0 踩 0

2605.29476 2026-05-29 cs.CL

Comparative Evaluation of Machine Translation Systems on Images with Text

含文本图像的机器翻译系统比较评估

Blai Puchol, Sergio Gómez González, Miguel Domingo, Francisco Casacuberta

AI总结本研究比较评估了三种机器翻译范式（模块化流水线、多模态大语言模型和端到端模型Translatotron-V）在含文本图像翻译任务上的性能，发现多模态大语言模型表现最佳。

详情

AI中文摘要

本文对应用于包含文本信息的图像的机器翻译系统进行了比较评估，该任务位于计算机视觉和自然语言处理的交叉领域。研究比较了三种主要范式：分离文本检测、识别和翻译的模块化流水线；能够联合处理图像和文本的多模态大语言模型（MLLM）；以及直接生成翻译图像的端到端模型Translatotron-V。模块化系统采用最先进的OCR（docTR）结合多语言LLM（如Llama和EuroLLM），而评估的MLLM包括Gemini 2.5的不同配置。实验在覆盖多种语言对的并行多语言数据集上进行，基于BLEU、chrF和TER指标进行评估。结果表明，模块化流水线优于端到端方法，而MLLM实现了最佳整体性能，展现出卓越的灵活性和上下文理解能力。这些发现强调了多模态推理在图像到文本翻译中的有效性，并为未来在多语言环境中整合视觉理解和语言生成的研究提供了坚实基础。

英文摘要

This work presents a comparative evaluation of machine translation systems applied to images containing textual information, a task that lies at the intersection of computer vision and natural language processing. The study compares three main paradigms: modular pipelines that separate text detection, recognition, and translation; multi-modal large language models (MLLMs) capable of processing both image and text jointly; and an end-to-end model, Translatotron-V, which directly generates translated images. The modular systems employ state-of-the-art OCR (docTR) combined with multilingual LLMs such as Llama and EuroLLM, while the evaluated MLLMs include different configurations of Gemini 2.5. Experiments were conducted on parallel multilingual datasets covering multiple language pairs, with evaluation based on BLEU, chrF, and TER metrics. The results show that modular pipelines outperform the end-to-end approach, while MLLMs achieve the best overall performance, demonstrating superior flexibility and contextual understanding. These findings underscore the effectiveness of multi-modal reasoning for image-to-text translation and provide a solid foundation for future research on integrating visual understanding and language generation in multilingual settings.

URL PDF HTML ☆

赞 0 踩 0

2605.29471 2026-05-29 cs.CV

V2XCrafter: Learning to Generate Driving Scene Across Agents

V2XCrafter：学习生成跨智能体的驾驶场景

Yihang Tao, Yu Guo, Senkang Hu, Yanan Ma, Zihan Fang, Sam Kwong, Yuguang Fang

AI总结提出V2XCrafter框架，通过渐进式多智能体扩散模型和跨智能体注意力模块，生成跨智能体相机视角的一致可控协作驾驶场景，以增强数据并提升下游协作3D目标检测性能。

详情

AI中文摘要

协作驾驶系统利用车联网（V2X）通信进行多智能体协作感知，以提升驾驶安全性，但仍受限于标注的真实世界V2X驾驶数据集稀缺以及在多样化驾驶条件下的泛化能力有限。虽然图像生成技术为数据增强提供了可行的解决方案，但现有针对单车辆多视角场景的方法在多智能体驾驶设置中面临两个基本挑战：（1）学习目标的扩展降低了生成质量；（2）跨智能体的高度动态变化阻碍了对联合观测对象物理属性（如颜色、类别）一致性的建模。为弥补这一差距，我们提出V2XCrafter，这是首个用于跨智能体相机视角生成可控且逼真的协作驾驶场景的框架。为了实现有效学习，我们基于单智能体骨干网络开发了一种渐进式多智能体扩散模型，利用相邻智能体的潜在状态作为参考信号，逐步引导从单智能体到多智能体的扩散过程。为解决跨车辆不一致性问题，我们提出了一个跨智能体注意力模块，该模块利用协作视图图和可学习的联合观测对象表示来建模动态的跨智能体相机视角关系。实验表明，V2XCrafter能够生成高保真且可控的街道视图，并保持跨智能体的一致性，从而有效提升下游协作3D目标检测任务的效果。

英文摘要

Collaborative driving systems leverage vehicle-to-everything (V2X) communication for multi-agent collaborative perception to enhance driving safety, yet they remain constrained by scarce annotated real-world V2X driving datasets and limited generalization across diverse driving conditions. While image generation technology offers a feasible solution for data augmentation, existing methods tailored for single-vehicle multi-view scenarios face two fundamental challenges in multi-agent driving settings: (1) the expansion of the learning objective degrades generation quality, and (2) the highly dynamic variations across agents hinder the modeling of consistency for physical attributes (e.g., color, category) in jointly observed objects. To bridge this gap, we propose V2XCrafter, the first framework for generating controllable and realistic collaborative driving scene across agents' camera views. For effective learning, we develop a progressive multi-agent diffusion model based on a single-agent backbone, using neighboring agents' latent states as reference signals to progressively guide the single-to-multi diffusion. To address cross-vehicle inconsistency, we propose a cross-agent attention module that leverages a collaboration view graph and learnable jointly observed object representation to model the dynamic cross-agent camera view relationships. Experiments have shown that V2XCrafter can generate high-fidelity and controllable street views with consistency across agents, thereby effectively enhancing the downstream collaborative 3D object detection tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.29467 2026-05-29 cs.LG cs.AI

Composing Non-Conjugate Factor Graphs with Closed-Form Variational Inference

非共轭因子图的闭式变分推断组合

Mykola Lukashchuk, Kyrylo Yemets, Wouter M. Kouw, Dmitry Bagaev, İsmail Şenöz, Jeff Beck, Bert de Vries

AI总结提出五种因子图原语，证明任意组合均支持闭式变分消息传递，并通过堆叠路由层实现通用函数逼近，应用于时间序列预测。

详情

AI中文摘要

将概率构建块堆叠成更深层次的架构通常会破坏闭式推断。我们证明闭式推断是可以保持的。我们识别了五种因子图原语：双线性因子、指数链接、Gamma先验、高斯似然和等式节点，并证明任何由它们组成的模型都允许闭式变分消息传递。这种构造之所以有效，是因为每个原语都保留了一小部分消息族：在平均场分解下，高斯变量上的消息保持高斯分布，精度变量上的消息保持Gamma分布，而唯一的非共轭接口——指数链接——通过高斯矩生成函数和Gamma族的充分统计量保持可处理性。我们展示了从静态集成到输入依赖门控再到分裂分支路由的递增深度组合，并表明堆叠路由层编码任意决策树，建立了具有闭式推断的通用函数逼近。应用于集成时间序列预测时，该框架产生了一个贝叶斯专家混合模型，其中门控函数是推断而非学习得到的，在五个基准数据集上提供了对专家选择的校准不确定性。

英文摘要

Stacking probabilistic building blocks into deeper architectures typically breaks closed-form inference. We show that closed-form inference can be preserved. We identify five factor-graph primitives: a bilinear factor, an exponential link, a Gamma prior, a Gaussian likelihood, and an equality node, and prove that any model composed from them admits closed-form variational message passing. The construction works because each primitive preserves a small set of message families: under mean-field factorization, messages on Gaussian variables remain Gaussian and messages on precision variables remain Gamma, while the only non-conjugate interface, the exponential link, remains tractable through the Gaussian moment-generating function and the sufficient statistics of the Gamma family. We demonstrate composition at increasing depth, from static ensembles through input-dependent gating to split-branch routing, and show that stacking routing layers encodes arbitrary decision trees, establishing universal function approximation with closed-form inference. Applied to ensemble time-series forecasting, the framework yields a Bayesian mixture of experts in which gating functions are inferred rather than learned, providing calibrated uncertainty over expert selection across five benchmark datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.29462 2026-05-29 cs.CV cs.AI

用于评估机器学习中成员推断攻击的全流程框架

Ding Chen, Xinwen Cheng, Xuyang Zhong, Xinping Chen, Xiaolin Huang, Chen Liu

AI总结提出一个涵盖数据、架构、算法和后训练模块的全流程评估框架，系统分析不同上下文对成员推断攻击效果的影响，并通过标准化威胁模型和互补指标提供实用指南。

详情

AI中文摘要

虽然成员推断攻击（MIAs）是识别训练数据的主流方法，但其应用已扩展到隐私审计和机器遗忘。然而，该领域缺乏一个系统性的框架来评估不同上下文如何影响MIA的效果。没有这样的特征描述，实践者可能会部署在基准测试中表现良好但在面对特定真实世界数据集的细微差别时变得统计上无关的算法。为了弥合这一差距并提供可操作的见解，我们引入了一个全面的评估框架，该框架系统地描述了整个机器学习流程（包括数据、架构、算法和后训练模块）中的隐私风险。我们的框架旨在固有地捕捉多样化的操作上下文，严格评估了在广泛训练配置下的最先进MIA。为了考虑真实世界部署中不同的误分类成本，我们采用了三个互补指标：对称成本下的平衡准确率，以及低FPR下的TPR（或低FNR下的TNR）用于严格惩罚误报或漏检的非对称场景。此外，认识到现有MIA假设不同的对手能力，我们形式化了两种标准化的威胁模型，并将这些攻击调整为相应的变体，以确保公平的基准测试。大量的实证评估表明，特定MIA方法的效果高度依赖于假设的威胁模型和选择的评估指标。最终，我们将这些发现提炼为可操作的指南，并提供一个即用的审计工具包，使实践者能够进行更好的隐私评估。

英文摘要

While Membership Inference Attacks (MIAs) are the prevailing method for identifying training data, their application has expanded into privacy auditing and machine unlearning. Nevertheless, the field lacks a systematic framework for evaluating how different contexts affect MIA efficacy. Without such a characterization, practitioners risk deploying algorithms that perform well on benchmarks but become statistically irrelevant when faced with the nuances of specific, real-world datasets. To bridge this gap and provide actionable insights, we introduce a comprehensive evaluation framework that systematically characterizes privacy risks across the entire machine learning pipeline, spanning data, architectures, algorithms, and post-training modules. Designed to inherently capture diverse operational contexts, our framework rigorously evaluates state-of-the-art MIAs across a broad spectrum of training configurations. To account for varying misclassification costs in real-world deployments, we employ three complementary metrics: Balanced Accuracy for symmetric costs, alongside TPR at low FPR (or TNR at low FNR) for asymmetric scenarios where false alarms or missed detections are strictly penalized. Furthermore, recognizing that existing MIAs assume divergent adversary capabilities, we formalize two standardized threat models and adapt these attacks into corresponding variants to ensure an equitable benchmark. Extensive empirical evaluations demonstrate that the efficacy of specific MIA methodologies is highly sensitive to the assumed threat models and chosen evaluation metrics. Ultimately, we distill these findings into actionable guidelines and provide a ready-to-use auditing toolkit, empowering practitioners to conduct better privacy assessments.

URL PDF HTML ☆

赞 0 踩 0

2605.29453 2026-05-29 cs.LG cs.AI

ElegantVLA：学习何时思考以实现高效的视觉-语言-动作模型

Ye Li, Huanan Liu, Kangye Ji, Yuan Meng, Jiajun Fan, Yuansong Wang, Shiyu Qin, Chenglei Wu, Shu-Tao Xia, Zhi Wang

AI总结提出ElegantVLA，一种即插即用的相位自适应推理框架，通过动态计算调度在视觉编码器、大语言模型和动作头之间分配计算资源，实现VLA模型加速，在GR00T和CogACT上分别获得最高2.55倍和3.77倍加速。

详情

AI中文摘要

视觉-语言-动作（VLA）模型是通用机器人控制的一种强大范式。然而，其高计算成本和有限的控制频率阻碍了实时机器人操作，尤其是在每个控制步骤都运行大型视觉-语言骨干网络和迭代动作头时。现有的VLA加速方法通常优化单个组件或依赖固定的加速规则，对不同控制步骤采用大致固定的计算量，忽略了序列化具身控制的非均匀推理需求。受人类运动控制的启发，其中认知和反馈资源集中在目标敏感阶段，我们认为VLA模型应该学习何时投入完整计算以及何时重用先前的计算。我们提出ElegantVLA，一种即插即用的相位自适应推理框架，通过模型内动态计算调度加速VLA模型。ElegantVLA引入一个轻量级调度器，观察时间表示相似性、机器人运动线索和任务进度，联合分配视觉编码器、大语言模型和动作头的计算。对于感知-语言推理，调度器根据视觉-语言表示稳定性选择五级视觉-大语言模型计算模式，从完全重计算到多步时间重用。对于动作生成，它选择三级去噪模式，在稳定运动期间重用中间去噪状态，同时在目标敏感阶段保留完整细化。通过协调这些决策，ElegantVLA为具有显式动作生成模块的现代VLA流水线提供了一个通用加速框架，无需修改或重新训练基础模型。在GR00T和CogACT上的实验分别实现了最高2.55倍和3.77倍的加速，在六个真实世界的GR00T任务中，ElegantVLA将计算量减少了2.18倍，同时将控制频率从13.8 Hz提高到26.3 Hz。

英文摘要

Vision-Language-Action (VLA) models are a powerful paradigm for generalist robotic control. However, their high computational cost and limited control frequency hinder real-time robotic manipulation, especially when large vision-language backbones and iterative action heads run at every control step. Existing VLA acceleration methods often optimize individual components or rely on fixed acceleration rules, treating different control steps with largely fixed computation and overlooking the non-uniform reasoning demands of sequential embodied control. Inspired by human motor control, where cognitive and feedback resources concentrate on goal-sensitive stages, we argue that VLA models should learn when to invest full computation and when to reuse prior computation. We propose ElegantVLA, a plug-in phase-adaptive inference framework that accelerates VLA models through intra-model dynamic compute scheduling. ElegantVLA introduces a lightweight scheduler that observes temporal representation similarity, robot-motion cues, and episode progress to jointly allocate computation across the vision encoder, LLM, and action head. For perception-language reasoning, the scheduler selects a five-level Vision-LLM compute mode, from full recomputation to multi-step temporal reuse, based on visual-language representation stability. For action generation, it selects a three-level denoising mode, reusing intermediate denoising states during stable motion while preserving full refinement for goal-sensitive stages. By coordinating these decisions, ElegantVLA offers a general acceleration framework for modern VLA pipelines with explicit action-generation modules, without modifying or retraining the base model. Experiments on GR00T and CogACT achieve up to 2.55x and 3.77x speedup, and on six real-world GR00T tasks ElegantVLA cuts computation by 2.18x while raising control frequency from 13.8 Hz to 26.3 Hz.

URL PDF HTML ☆

赞 0 踩 0

2605.29430 2026-05-29 cs.AI cs.CL

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

迈向具有智能体纠正和语义评估的类人交互式语音识别

Zixuan Jiang, Yanqiao Zhu, Peng Wang, Qinyuan Chen, Xinjian Zhao, Xipeng Qiu, Wupeng Wang, Zhifu Gao, Xiangang Li, Kai Yu, Xie Chen

AI总结提出Agentic ASR闭环框架，通过多轮交互和语义纠正减少语义错误，并引入句子级语义错误率（S^2ER）作为评估指标。

详情

AI中文摘要

自动语音识别（ASR）是人机交互的核心组成部分，也是基于LLM的助手和智能体日益重要的前端。然而，当前大多数ASR系统仍遵循单遍范式，这与人类通信方式不一致——在人类通信中，误解通过迭代澄清和修正来解决。这种不匹配使得一旦发生意义关键的错误，很难纠正。同时，词错误率（WER）或字符错误率（CER）等词级指标无法充分反映此类问题。为解决这些局限，我们将交互式ASR形式化为多轮修正任务，并提出Agentic ASR，一种结合单遍ASR前端与语义纠正、意图路由和基于推理编辑的闭环框架。我们进一步引入句子级语义错误率（S^2ER），一种基于LLM的语义评估指标，以及交互式仿真系统，用于可扩展和可复现的基准测试。在多语言、命名实体密集和代码切换基准上的实验表明，迭代交互持续减少语义错误，在S^2ER上的提升远大于传统词级指标。人机对齐和消融研究进一步验证了语义判断器的可靠性和所提框架的鲁棒性。代码见：https://interactiveasr.github.io/，在线演示见：https://i-asr.sjtuxlance.com/

英文摘要

Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate \emph{Interactive ASR} as a multi-turn refinement task and propose \textbf{Agentic ASR}, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the \textbf{Sentence-level Semantic Error Rate} ($S^2ER$), an LLM-based semantic evaluation metric, together with an \textbf{Interactive Simulation System} for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in $S^2ER$ than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/

URL PDF HTML ☆

赞 0 踩 0

2605.29429 2026-05-29 cs.CV

One Click per Cell Type Suffices: Training-free Group Interaction for Cell Instance Segmentation

每细胞类型一次点击足矣：无需训练的组交互用于细胞实例分割

Sanghyun Jo, Seo Jin Lee, Seohyung Hong, Yoorim Gang, Hyeongsub Kim, Hyungseok Seo, Kyungsu Kim

AI总结提出组提示范式，通过每细胞类型一次点击即可分割所有该类型实例，基于SAM冻结编码器的特征聚类性质，设计无需训练的Chain-of-Prompts框架递归扩展点击，在多个基准上保持高性能。

Comments Accepted to MICCAI 2026 (Early Accept)

详情

AI中文摘要

在特定细胞数据集上训练的细胞实例分割模型在分布外的细胞类型上性能严重下降，而交互式基础模型通过每个实例提示克服了这一点，但对于包含数百到数千个密集实例的组织病理学图像，其成本过高。我们引入了组提示，这是一种新范式，将交互式分割从每个实例 $O(N)$ 转变为每个类型 $O(T)$，其中每细胞类型一次点击即可分割该类型的所有实例。我们的关键观察是，Segment Anything Model (SAM) 的冻结图像编码器在给出任何提示之前，已经在其特征空间中对相同类型的细胞进行了聚类。利用这一特性，我们提出了Chain-of-Prompts (CoP)，这是一个无需训练的框架，通过以下方式递归扩展单个用户点击：(1) 通过非参数门控多尺度编码器特征识别可靠的相同类型位置，以及 (2) 选择空间上最远的可靠点作为下一个提示以最大化覆盖范围。在三个细胞类型标注的基准上，每类型一次点击的CoP保留了超过90%的每个实例性能，并且无需任何额外训练就超越了全监督方法。在四个形态均匀的基准上，一次点击保留了超过99%。项目页面：https://shjo-april.github.io/Chain-of-Prompts/

英文摘要

Cell instance segmentation models trained on cell-specific datasets suffer severe performance drops on out-of-distribution cell types, while interactive foundation models overcome this through per-instance prompting at a cost that is prohibitively expensive for histopathology images containing hundreds to thousands of densely packed instances. We introduce Group Prompting, a new paradigm that shifts interactive segmentation from per-instance $O(N)$ to per-type $O(T)$, where a single click per cell type suffices to segment all instances of that type. Our key observation is that the frozen image encoder of the Segment Anything Model (SAM) already clusters same-type cells in its feature space before any prompt is given. Exploiting this property, we propose Chain-of-Prompts (CoP), a training-free framework that recursively expands a single user click by (1) identifying reliable same-type locations through non-parametric gating of multi-scale encoder features, and (2) selecting the most spatially distant reliable point as the next prompt to maximize coverage. On three cell-type-annotated benchmarks, CoP with one click per type retains over 90% of per-instance performance and surpasses fully-supervised methods without any additional training. On four morphologically homogeneous benchmarks, a single click retains over 99%. Project Page: https://shjo-april.github.io/Chain-of-Prompts/

URL PDF HTML ☆

赞 0 踩 0

2605.29427 2026-05-29 cs.CL

FinGuard: Detecting Financial Regulatory Non-Compliance in LLM Interactions

FinGuard：检测LLM交互中的金融监管违规

Huaixia Dou, Jie Zhu, Minghao Wu, Shuo Jiang, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang

AI总结针对金融领域LLM交互中的监管违规检测问题，提出基于监管文档的自动化管道，构建首个金融合规检测基准FinGuard-Bench，并训练FinGuard模型，在基准上显著优于现有方法。

详情

AI中文摘要

随着大型语言模型（LLM）在金融服务中的部署日益增多，一次不合规的交互就可能使机构面临监管处罚并直接损害消费者利益。现有的防护模型围绕通用危害分类构建，忽略了基于特定金融法规的违规行为。我们通过一个直接操作监管文档的监管驱动管道来弥补这一空白，该管道归纳出金融合规风险分类，并在没有任何预定义违规类别的情况下合成基于监管的训练数据。将该管道应用于中国金融法规，我们发布了 extbf{FinGuard-Bench}，据我们所知，这是首个金融监管合规检测基准，在查询和回复层面均带有专家标注的标签。我们进一步训练了 extbf{FinGuard}，这是一个基于Qwen3-8B构建的金融合规检测模型，通过监督微调和自我对弈强化学习在基于监管的数据上进行训练。在FinGuard-Bench上，FinGuard显著优于所有基线，包括专用防护模型和更大的通用LLM，如Qwen3.5-397B-A17B和GPT-5.1。此外，FinGuard还保留了通用安全能力，并能仅使用政策文档适应未见过的机构特定政策。我们将在GitHub上公开发布本工作中使用的代码、提示和资源。

英文摘要

As large language models (LLMs) are increasingly deployed in financial services, a single non-compliant interaction can expose institutions to regulatory penalties and direct consumer harm. Existing guard models are built around general harm taxonomies and overlook violations grounded in specific financial regulations. We address this gap with a regulation-driven pipeline that operates directly on regulatory documents, inducing a financial compliance risk taxonomy and synthesizing grounded training data without any predefined violation categories. Instantiating the pipeline on Chinese financial regulations, we release \textbf{FinGuard-Bench}, to our knowledge the first benchmark for financial regulatory compliance detection, with expert-annotated labels at both the query and response levels. We further train \textbf{FinGuard}, a financial compliance detection model built on Qwen3-8B and trained on the regulation-grounded data via supervised fine-tuning and self-play reinforcement learning. On FinGuard-Bench, FinGuard substantially outperforms all baselines, including dedicated guard models and much larger general-purpose LLMs such as Qwen3.5-397B-A17B and GPT-5.1. Furthermore, FinGuard also preserves general safety capabilities and adapts to unseen institution-specific policies using policy documents alone. We will publicly release the code, prompts, and resources used in this work on GitHub.

URL PDF HTML ☆

赞 0 踩 0

2605.29425 2026-05-29 cs.AI

ReasonLight: A Multimodal Foundation Model-Enhanced Reinforcement Learning Framework for Zero-Shot Traffic Signal Control

ReasonLight: 一种多模态基础模型增强的强化学习框架用于零样本交通信号控制

Aoyu Pang, Maonan Wang, Yuejiao Xie, Chung Shue Chen, Zhiwei Yang, Man-On Pun

AI总结提出ReasonLight框架，通过多模态基础模型增强强化学习，利用路侧传感器和摄像头数据实现零样本适应罕见交通事件，显著降低紧急车辆等待时间。

详情

AI中文摘要

强化学习在交通信号控制中展现出潜力，但其对预定义状态的依赖限制了其对训练数据中未出现的可观测开放世界事件的响应能力。物联网赋能的路口通过路侧传感器和摄像头提供异构观测，为提升强化学习对此类事件的适应性创造了机会。为此，我们提出ReasonLight，一种多模态基础模型增强的强化学习框架，用于零样本交通信号控制。ReasonLight整合三类信息：结构化交通测量、多视角摄像头观测以及预训练强化学习控制器生成的候选相位决策。给定强化学习提议的相位，ReasonLight从多视角图像中提取视觉语义，并将其与紧凑的传感器导出的场景描述对齐。这种对齐使得语义引导的细化模块能够根据交通规则和事件语义保留或调整提议的动作。为确保操作可靠性，细化后的动作受可用相位集合约束。任何无效决策被拒绝，系统回退至原始强化学习动作。我们在强化学习训练期间未见的两类罕见事件上评估ReasonLight：紧急车辆优先和临时交通管制。实验结果表明，ReasonLight无需重新训练即可实现零样本适应。与仅使用强化学习的主干相比，它将紧急车辆等待时间最多降低88.7%，同时保持相当的常规交通性能。

英文摘要

Reinforcement learning (RL) has shown promise in traffic signal control (TSC). However, its reliance on predefined states limits responsiveness to observable open-world events that are absent from training data. IoT-enabled intersections provide heterogeneous observations from roadside sensors and cameras, creating opportunities to improve RL adaptability to such events. To this end, we propose ReasonLight, a multimodal foundation model-enhanced RL framework for zero-shot TSC. ReasonLight integrates three sources of information: structured traffic measurements, multi-view camera observations, and candidate phase decisions from a pre-trained RL controller. Given an RL-proposed phase, ReasonLight extracts visual semantics from multi-view images and aligns them with compact sensor-derived scene descriptions. This alignment enables a semantic-guided refinement module to either preserve or adjust the proposed action according to traffic rules and event semantics. To ensure operational reliability, refined actions are constrained by the set of available phases. Any invalid decision is rejected, and the system falls back to the original RL action. We evaluate ReasonLight on two types of rare events not seen during RL training: emergency vehicle priority and temporary traffic regulation. Experimental results show that ReasonLight achieves zero-shot adaptation without retraining. It reduces emergency vehicle waiting time by up to 88.7% compared with the RL-only backbone while preserving comparable routine traffic performance.

URL PDF HTML ☆

赞 0 踩 0