arXivDaily arXiv每日学术速递 周一至周五更新
2606.20322 2026-06-19 cs.RO 新提交

Towards 3D karst underwater scene reconstruction from rotating sonar data

基于旋转声纳数据的3D喀斯特水下场景重建

Georgios Evangelos Margaritis, Lionel Lapierre, Simon Rohou, Zhi Yan, Andreas Nüchter, François Goulette

发表机构 * U2IS, ENSTA, Institut Polytechnique de Paris(巴黎综合理工学院ENSTA学院U2IS实验室) Lab-STICC, ENSTA, Institut Polytechnique de Paris(巴黎综合理工学院ENSTA学院Lab-STICC实验室) Informatics XVII – Robotics, Julius-Maximilians-Universität Würzburg(尤利乌斯-马克西米利安-维尔茨堡大学信息学XVII – 机器人学)

AI总结 针对声纳数据稀疏噪声大、导航漂移导致3D重建困难的问题,提出结合连续时间SLAM校正轨迹与两阶段深度学习表面重建的流水线,生成可沉浸导航的3D网格。

Comments 1st Workshop on Long-term Deployments in the Wild (LoWi)

详情
AI中文摘要

喀斯特含水层提供关键的淡水资源,但由于其复杂且了解不足的地下几何结构,构成重大危害。由于水下探测的声纳数据稀疏且噪声大,而导航估计存在漂移,限制了标准3D重建方法,因此绘制这些环境具有挑战性。我们提出了一种从声纳剖面仪重建水下喀斯特管道的流水线。我们将连续时间SLAM方法用于校正轨迹漂移,与一种新颖的两阶段深度学习表面重建方法相结合,生成用于水文地质分析的沉浸式可导航3D网格。

英文摘要

Karst aquifers provide critical freshwater resources but pose significant hazards due to their complex and poorly understood subsurface geometry. Mapping these environments is challenging because sonar data from underwater exploration is sparse and noisy, while navigation estimates suffer from drift limiting standard 3D reconstruction methods. We present a pipeline for reconstructing underwater karst conduits from a sonar profiler. We combine a continuous-time SLAM approach to correct trajectory drift with a novel two-stage deep learning method for surface reconstruction, producing an immersive and navigable 3D mesh for hydrogeological analysis.

2606.20318 2026-06-19 cs.DB 新提交

AgenticDB: Agentic Performance Reconfiguration for Database Workloads

AgenticDB: 面向数据库工作负载的代理式性能重配置

Xinyue Yang, Chaozheng Wang, Chen Zheng, Heng Zhang, Yanjun Wu

AI总结 提出AgenticDB框架,通过运行时交互实现数据库系统级和操作系统级重配置,诊断瓶颈并积累经验,在MySQL和PostgreSQL上平均性能提升118.1%。

详情
AI中文摘要

数据库配置调优对工作负载性能至关重要,但在实际部署中进行实用调优仍然困难。现有的自动调优器大多将调优视为对DBMS旋钮值的迭代搜索。这种形式导致执行成本高,过早缩小配置空间,并且未能充分解决实际需求:从系统反馈中诊断运行时瓶颈,探索操作系统级重配置机会,稳健地执行更改,以及从先前的试验和任务中学习。我们提出AgenticDB,一个用于数据库工作负载重配置的代理式框架。AgenticDB实现了一个上下文驱动的工具,通过与目标数据库环境交互,提出DBMS级和操作系统级更改,在安全约束下应用它们,观察工作负载性能和运行时状态,并使用执行反馈来指导后续决策。这种运行时交互使AgenticDB能够诊断瓶颈,探索更广泛的DBMS和操作系统级重配置空间,避免不安全或不支持的操作,并在重配置任务内部和之间积累经验。因此,AgenticDB将数据库调优转变为一种自我改进的重配置过程,其中运行时反馈迭代地改进后续决策。我们在MySQL和PostgreSQL上使用YCSB、Sysbench和TPC-H工作负载进行了广泛实验。结果表明,AgenticDB在所有评估的工作负载上实现了最佳最终性能,平均比最强基线提高118.1%,并将总到达最佳时间减少22.6%。结果还表明,其操作系统级动作空间、稳健的执行生命周期和增强记忆的规划有助于实现更有效和实用的数据库重配置。

英文摘要

Database configuration tuning is critical for workload performance, but practical tuning on real deployments remains difficult. Existing automatic tuners mostly formulate tuning as iterative search over DBMS knob values. This formulation leads to high execution cost, prematurely narrows the configuration space, and leaves practical requirements insufficiently addressed: diagnosing runtime bottlenecks from system feedback, exploring OS-level reconfiguration opportunities, executing changes robustly, and learning from previous trials and tasks. We propose AgenticDB, an agentic framework for database workload reconfiguration. AgenticDB implements a context-grounded harness that interacts with the target database environment by proposing DBMS- and OS-level changes, applying them under safety constraints, observing workload performance and runtime states, and using execution feedback to guide subsequent decisions. This runtime interaction enables AgenticDB to diagnose bottlenecks, explore a broader DBMS- and OS-level reconfiguration space, avoid unsafe or unsupported actions, and accumulate experience within and across reconfiguration tasks. As a result, AgenticDB turns database tuning into a self-refining reconfiguration process in which runtime feedback iteratively improves later decisions. We conduct extensive experiments on MySQL and PostgreSQL using YCSB, Sysbench, and TPC-H workloads. The results show that AgenticDB achieves the best final performance on all evaluated workloads, improving over the strongest baseline by 118.1% on average and reducing aggregate time-to-best by 22.6%. The results also demonstrate that its OS-level action space, robust execution lifecycle, and memory-enhanced planning contribute to more effective and practical database reconfiguration.

2606.20312 2026-06-19 cs.CV 新提交

Reliability-Aware Prototype Calibration for Frozen Pose-Flow Video Anomaly Detection

面向冻结姿态流视频异常检测的可靠性感知原型校准

Ning Dong, Yingna Su, Xin Dong, Ziyun Jiao, Xinnian Guo, Zhuangzhuang Pan

AI总结 提出一种后验评分校准方法RPC,通过标准化潜在空间中的最近原型偏差修正冻结姿态流检测器的排名,在8个骨干-数据集组合上平均提升AUROC 2.03个百分点。

Comments 15 pages, 5 figures, 7 tables. Code available at https://github.com/iNing10/RPC

详情
AI中文摘要

姿态流视频异常检测器因其能为跟踪的骨架窗口提供基于似然的排名,在一类监控中具有吸引力。然而,单个似然分数可能隐藏多模态正常行为,并对姿态观测噪声敏感。我们研究了一个冻结检测器设置,其中姿态流骨干网络、缓存的骨架轨迹和评估流程是固定的。可靠性感知原型校准(RPC)是针对该设置的一种后验评分校准方法。它在冻结潜在空间中添加标准化的最近原型偏差到标准化的流分数,并仅使用关键点置信度来门控这一新增的几何证据。因此,RPC在保留原始密度信号的同时,利用姿态可靠性下的经验正常模式结构修正排名。在两个冻结姿态流骨干网络和四个数据集上,RPC在所有八个骨干-数据集对中提升了帧级AUROC,增益范围为0.34到4.49个百分点,平均为2.03个百分点。消融和可靠性分析表明,原型偏差是主要的修正信号,而可靠性门控在姿态观测不可靠时最为有用。这些结果表明,当重新训练或复现完整姿态流程不可行时,轻量级后验校准可以增强缓存的姿态流系统。

英文摘要

Pose-flow video anomaly detectors are attractive for one-class surveillance because they provide likelihood-based rankings for tracked skeleton windows. However, a single likelihood score may hide multimodal normal behavior and be sensitive to pose-observation noise. We study a frozen-detector setting in which the pose-flow backbone, cached skeleton tracks, and evaluation pipeline are fixed. Reliability-Aware Prototype Calibration (RPC) is a post-hoc score calibration method for this setting. It adds a standardized nearest-prototype deviation in the frozen latent space to the standardized flow score, and uses keypoint confidence only to gate this added geometric evidence. Thus, RPC preserves the original density signal while correcting the ranking with empirical normal-mode structure under pose reliability. Across two frozen pose-flow backbones and four datasets, RPC improves frame-level AUROC in all eight backbone-dataset pairs, with gains ranging from 0.34 to 4.49 percentage points and averaging 2.03 points. Ablation and reliability analyses show that prototype deviation is the main corrective signal, while reliability gating is most useful when pose observations are less trustworthy. These results suggest that lightweight post-hoc calibration can strengthen cached pose-flow systems when retraining or reproducing the full pose pipeline is impractical.

2606.20310 2026-06-19 cs.CV 新提交

Through the PRISM: Preference Representation in Intermediate States of Video Diffusion Models

通过PRISM:视频扩散模型中间状态中的偏好表示

Haoxuan Wu, Lai Man Po, Mengyang Liu, Kun Li, Hongzheng Yang, Wei Liu

发表机构 * City University of Hong Kong(香港城市大学) Video Rebirth The Chinese University of Hong Kong(香港中文大学)

AI总结 提出PRISM方法,利用冻结的视频扩散骨干网络和轻量级查询聚合头从噪声潜变量中解码偏好信号,实现高精度偏好预测和噪声鲁棒性,支持早期最佳采样以降低计算成本并提升视频质量。

详情
AI中文摘要

使用干净的、基于像素的奖励模型评估视频生成,会使评估与噪声扩散过程脱节,并产生巨大的VAE解码成本。在本文中,我们通过提出一个基本问题来挑战这一范式:一个强大的视频生成器能否直接从噪声潜变量中内在地区分偏好?为了回答这个问题,我们引入了\textbf{PRISM}(\textbf{P}reference \textbf{R}epresentation in \textbf{I}ntermediate \textbf{S}tates of Diffusion \textbf{M}odels)。PRISM采用一个轻量级的基于查询的聚合头,配合冻结的视频扩散骨干网络,从噪声潜变量中解码偏好信号。令人惊讶的是,PRISM不仅达到了最先进的偏好准确率,还解锁了强大的噪声鲁棒性,从而实现了早期最佳-$N$采样。这使得在去噪的初始阶段就能过滤掉次优候选,大幅减少计算量并提升视频质量。我们还揭示了骨干网络的生成性能与其内在评估能力之间的强正相关性,从而实现了视频骨干网络的自我改进。

英文摘要

Evaluating video generation with clean, pixel-based reward models disconnects evaluation from the noisy diffusion process and incurs massive VAE decoding costs. In this paper, we challenge this paradigm by asking a fundamental question: Can a powerful video generator inherently discriminate preferences directly from noisy latents? To answer this, we introduce \textbf{PRISM} (\textbf{P}reference \textbf{R}epresentation in \textbf{I}ntermediate \textbf{S}tates of Diffusion \textbf{M}odels). PRISM employs a lightweight Query-based Aggregation head with a frozen video diffusion backbone to decode preference signals from noisy latents. Surprisingly, PRISM not only achieves SOTA preference accuracy but also unlocks strong noise-robustness, which enables early-stage Best-of-$N$ sampling. This allows for filtering suboptimal candidates at the very beginning of denoising, drastically reducing computation while boosting video quality. We also reveal a strong positive correlation between a backbone's generative performance and its inherent evaluative power, enabling self-improving video backbones.

2606.20303 2026-06-19 cs.CV 新提交

GEN-Guard: Correcting Generalization Failures for Deployable Federated Surgical AI

GEN-Guard:纠正可部署联邦手术AI的泛化失败

Julia Alekseenko, Pietro Mascagni, AI4SafeChole Consortium, Nicolas Padoy

发表机构 * University of Strasbourg, CNRS, INSERM, ICube, UMR7357(斯特拉斯堡大学,法国国家科学研究中心,法国国家健康与医学研究院,ICube实验室,UMR7357) Bioimage Analysis Center, Fondazione Policlinico Universitario Agostino Gemelli IRCCS(生物图像分析中心,阿戈斯蒂诺·杰梅利大学综合医院基金会IRCCS) Fondazione IRCCS Ca' Granda Ospedale Maggiore Policlinico di Milano, University of Milan(米兰IRCCS卡格兰达基金会马焦雷综合医院,米兰大学) Monaldi Hospital, AORN dei Colli(莫纳尔迪医院,AORN dei Colli)

AI总结 提出GEN-Guard框架,通过客户端阻塞评估检测性能泄漏,并利用分歧感知蒸馏进行特征级校正,提升联邦手术AI的跨机构泛化能力。

Journal ref Int J Comput Assist Radiol Surg. 2026 Jun 14

详情
AI中文摘要

联邦学习(FL)在手术视频AI中实现了协作模型训练,无需共享敏感数据。然而,标准评估实践——仅基于参与医院的验证数据选择“最佳”全局模型——可能导致次优的部署选择。我们将这种关键失败模式识别为性能泄漏,即所选模型过拟合内部联邦数据,无法泛化到未见机构。我们提出GEN-Guard,一个实用的后处理框架,用于检测和纠正联邦手术AI中的泛化失败。它集成了通过客户端阻塞评估(CBE)进行泛化检测,该方法在隔离的客户端分布上验证性能以防止性能泄漏,以及通过分歧感知蒸馏(DAD)进行泛化纠正,该方法学习自适应的特征级校正以实现跨机构鲁棒性。两个组件在标准FL收敛后运行,同时为零样本适应未见环境提供鲁棒支持。我们首先量化了性能泄漏的严重性,观察到在标准评估下模型选择失败(MSF)超过80%。GEN-Guard在两个多中心临床挑战上进行了评估:腹腔镜胆囊切除术中的手术阶段识别和结肠镜中的息肉分割。在两个数据集上,GEN-Guard一致地纠正了这些失败,将联邦内F1分数提高了最多2个点,未见机构性能提高了最多3个点,最差情况机构性能提高了3-9个点。性能泄漏是联邦手术AI中一个系统性且以前未被充分认识的风险。GEN-Guard为检测和纠正此类失败提供了实用解决方案。通过提高跨机构鲁棒性和零样本泛化,它增强了FL在真实世界手术部署中的可靠性。

英文摘要

Federated Learning (FL) in surgical video AI enables collaborative model training without sharing sensitive data. However, standard evaluation practices - selecting the "best" global model based only on validation data from participating hospitals - can lead to suboptimal deployment choices. We identify this critical failure mode as performance leakage, where the selected model overfits internal federation data and fails to generalize to unseen institutions. We propose GEN-Guard, a practical post-hoc framework to detect and correct generalization failures in federated surgical AI. It integrates Generalization Detection via Client-Blocked Evaluation (CBE), which validates performance on isolated client distributions to prevent performance leakage, and Generalization Correction through Disagreement-Aware Distillation (DAD), which learns adaptive feature-level corrections for cross-institutional robustness. Both components operate after standard FL convergence while providing robust support for zero-shot adaptation to unseen environments. We first quantify the severity of performance leakage, observing Model Selection Failures (MSFs) exceeding 80% under standard evaluation. GEN-Guard is evaluated on two multi-center clinical challenges: surgical phase recognition in laparoscopic cholecystectomy and polyp segmentation in colonoscopy. Across both datasets, GEN-Guard consistently corrects these failures, improving in-federation F1 scores by up to 2 points, unseen-institution performance by up to 3 points, and worst-case institutional performance by 3-9 points. Performance leakage represents a systematic and previously under-recognized risk in federated surgical AI. GEN-Guard provides a practical solution for detecting and correcting such failures. By improving cross-institutional robustness and zero-shot generalization, it strengthens the reliability of FL for real-world surgical deployment.

2606.20302 2026-06-19 cs.CV 新提交

CUPID: Reconstructing UV Texture Maps for Interpretable Person-of-Interest Deepfake Detection

CUPID: 重构UV纹理图用于可解释的特定人物深度伪造检测

Giovanni Affatato, Sara Mandelli, Edoardo Daniele Cannas, Paolo Bestagini, Stefano Tubaro

发表机构 * Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB), Politecnico di Milano(米兰理工大学电子、信息与生物工程系(DEIB))

AI总结 提出CUPID方法,利用3D人脸重建的UV纹理图和掩码自编码器,无需深度伪造视频训练即可检测特定人物深度伪造,并实现可解释性和鲁棒性。

详情
AI中文摘要

针对高知名度人物(Person-of-Interest, POI)的深度伪造对现代民主社会构成威胁。当前的POI深度伪造检测方法在鲁棒性、效率和可解释性方面仍存在不足。本文提出CUPID,一种POI视频深度伪造检测器,结合了UV纹理图(源自3D人脸重建的面部外观表示)和掩码自编码器(MAE)的表征学习能力。我们的方法在训练阶段不需要任何深度伪造视频,甚至无需在训练集中包含特定POI:从真实视频帧中提取的UV纹理图与MAE上下文引导重构相结合,产生的潜在空间能够捕获丰富且具有判别性的面部特征,即使对于训练中未见过的身份也是如此。在测试阶段,从描述POI的查询视频中提取的嵌入可以与原始参考视频进行匹配,以评估视频真实性。此外,在UV空间中操作自然提供了额外的可解释性层。具体来说,我们可以提取解码残差图,突出显示测试视频中哪些面部区域与相应POI的身份表示偏差最大。在四个深度伪造数据集上的实验表明,CUPID在大多数数据集上优于当前最先进方法,并在强下采样和压缩下实现了最佳的整体鲁棒性,同时提供了更快的推理速度。我们的实验代码将在以下网址发布:https://this https URL。

英文摘要

Deepfakes targeting a high-profile individual, known as Person-of-Interest (POI), are a threat to modern democracies and societies. Current POI deepfake detection methods still struggle to combine robustness to post-processing, efficiency and interpretability, focal aspects of modern deepfake detectors. In this paper we propose CUPID, a POI video deepfake detector that combines UV texture maps, a facial appearance representation derived from 3D face reconstructions, with the representation learning capabilities of the Masked Autoencoder (MAE). Our method does not require any deepfake videos in its training phase. Moreover, it does not even require to include a specific POI in the training set: the combination of UV texture maps extracted from real video frames and the MAE context-guided reconstruction yields a latent space that captures rich and discriminative facial features also for identities unseen during training. In the testing phase, the embeddings extracted from a query video depicting the POI can be matched against pristine reference videos to assess the video authenticity. Furthermore, operating in the UV space naturally provides an additional layer of interpretability. Specifically, we can extract decoded residual maps that highlight which facial regions of a test video deviate most from the identity representation of the corresponding POI. Experiments on four deepfake datasets show that CUPID outperforms current state of the art on most datasets and achieves the best overall robustness against strong downscaling and compression, providing also substantially faster inference. Our experimental code will be released at https://github.com/polimi-ispl/CUPID.

2606.20301 2026-06-19 eess.SY cs.SY 新提交

Data-Driven Control from Poisoned Data: Fundamental Limitations and Secure DeePC

来自中毒数据的数据驱动控制:基本局限性与安全DeePC

Takumi Shinohara, Henrik Sandberg, Karl Henrik Johansson

AI总结 针对任意数据中毒攻击,提出安全DeePC算法,通过截断输出和在线重建实现有限时间内的MPC等效性能。

详情
AI中文摘要

我们研究了存在任意数据中毒攻击时的数据驱动控制问题。假设一部分离线输出数据存储在未受保护的位置,可能被对手篡改。我们首先建立了由这种中毒数据引起的数据驱动控制的基本局限性:仅从数据集无法检测/识别中毒攻击;未受保护的数据对于具有最坏情况保证的控制器设计是非信息性的;未受保护输出的硬约束是不可认证的。受这些局限性和数据使能预测控制(DeePC)技术的启发,我们提出了安全DeePC,一种能够抵御中毒攻击的数据驱动控制算法。它首先仅使用受保护数据集运行输出截断的DeePC,直到在线输入变得持续激励。然后利用在线测量重建部分离线数据集,最后返回到全输出DeePC。安全DeePC在特定条件下几乎必然在有限时间内实现MPC等效性能。仿真结果证明了所提框架对抗中毒攻击的有效性。

英文摘要

We study a data-driven control problem in the presence of arbitrary data poisoning attacks. We assume that a subset of offline output data is stored in unprotected locations and may be poisoned by an adversary. We first establish fundamental limitations for data-driven control arising from such poisoned data: poisoning attacks are not detected/identified from the dataset alone; unprotected data are non-informative for controller design with worst-case guarantees; and hard constraints on unprotected outputs are not certifiable. Motivated by these limitations and the data-enabled predictive control (DeePC) technique, we propose Secure DeePC, a data-driven control algorithm that is resilient against poisoning attacks. It first runs output-truncated DeePC using only the protected dataset until the online input becomes persistently exciting. It then uses online measurements to reconstruct the partial offline dataset, and finally returns to full-output DeePC. Secure DeePC achieves MPC-equivalent performance in finite time almost surely under certain conditions. Simulation results illustrate the efficacy of the proposed framework against poisoning attacks.

2606.20300 2026-06-19 cs.CV 新提交

CMDS-AD: Cross-Modal Dual-Stream Decoupling for Few-Shot Anomaly Detection

CMDS-AD: 跨模态双流解耦用于少样本异常检测

Junhao Cai, Deyu Zeng, Junhao Pang, Junyu Chen, Qiwei Liang, Xiaopin Zhong, Zongze Wu

发表机构 * Shenzhen University(深圳大学) Guangzhou Maritime University(广州航海学院) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出跨模态双流异常检测框架CMDS-AD,通过扩散模型生成多样本并利用低频正常估计辅助解耦高频缺陷,在1-shot设置下MVTec 3D-AD上I-AUROC提升5.7%。

Comments Accepted to ECCV 2026!

详情
AI中文摘要

少样本异常检测由于训练数据有限仍然具有挑战性。多模态异常检测(MAD)提供了一种可行的解决方案,利用3D几何线索丰富2D RGB表示并弥补这一稀缺性。然而,现有的MAD方法采用空间均匀的特征处理,混淆了稳定的宏观结构与高频局部缺陷信号,加剧了跨模态错位并增加了假阳性率。为了克服这一问题,我们提出了CMDS-AD,一种跨模态双流异常检测框架。一个LoRA引导的扩散模型生成多样的RGB样本以缓解极端数据稀缺。对于3D正常增强,我们采用预训练的扩散模型作为正常估计器。关键的是,该估计器本质上充当非线性低通滤波器,直接从RGB输入中提取低频正常表示。这建立了一个纯低频信息的辅助估计流,锚定稳健的结构模板,并帮助包含耦合高低频分量的未压缩真实流精确隔离微缺陷。一个坐标感知的分层特征映射器自适应地对齐跨模态语义,而一个乘法评分机制过滤模态特定噪声。在极端1-shot设置下,CMDS-AD在MVTec 3D-AD上实现了5.7%(I-AUROC)和2.0%(AUPRO)的绝对性能提升,在EyeCandies上分别提升了7.7%和5.6%,确立了新的最先进水平。

英文摘要

Few-shot anomaly detection remains challenging due to limited training data. Multi-modal anomaly detection (MAD) offers a viable solution, leveraging 3D geometric cues to enrich 2D RGB representations and compensate for this scarcity. However, existing MAD methods apply spatially uniform feature processing, conflating stable macroscopic structures with high-frequency localized defect signals, exacerbating cross-modal misalignment and inflating false-positive rates. To overcome this, we present CMDS-AD, a Cross-Modal Dual-Stream Anomaly Detection framework. A LoRA-guided diffusion model generates diverse RGB samples to mitigate extreme data scarcity. For 3D normal augmentation, we employ a pre-trained diffusion model as a normal estimator. Crucially, this estimator inherently acts as a non-linear low-pass filter, directly extracting low-frequency normal representations from RGB inputs. This establishes an auxiliary estimated stream of purely low-frequency information, anchoring robust structural templates and assisting the uncompressed real stream, containing coupled high- and low-frequency components, to precisely isolate micro-defects. A Coordinate-Aware Hierarchical Feature Mapper adaptively aligns cross-modal semantics, while a multiplicative scoring mechanism filters modality-specific noise. Under the extreme 1-shot setting, CMDS-AD achieves absolute performance gains of 5.7% (I-AUROC) and 2.0% (AUPRO) on MVTec 3D-AD, alongside 7.7% and 5.6% improvements on EyeCandies, establishing a new state-of-the-art.

2606.20295 2026-06-19 cs.SE cs.CL 新提交

Token-Operations-Oriented Inference Optimization Techniques for Large Models

面向令牌操作的大模型推理优化技术

Shiguo Lian, Kai Wang, Zhaoxiang Liu, Wen Liu, Minjie Hua, Yutong Liu, Jiangze Yan, Xin Wang, Cong Wang, Yilin Zhang, Yi Shen, Jieyun Huang, Fang Zhao, Huanlin Gao, Ping Chen, Xinyu Yang, Kaikai Zhao, Yao Zhao, Xinggang Wang, Huishuai Zhang, Dongyan Zhao, Junping Du, Tao Chen, Xiang Gao, Qinghuai Ma

AI总结 本文提出多模型融合、模型优化、计算-模型融合、计算-网络-模型融合四层技术架构,系统综述各层关键技术及产业现状,旨在降低令牌成本、提升服务效率、保障供应稳定性,推动大模型服务从可调用到可运营的转变。

Comments 62 pages, 36 figures

详情
AI中文摘要

大模型推理优化是支撑大模型服务可扩展、低成本、高稳定运行的关键基础。本文以面向令牌的推理优化技术为核心,首次提出由多模型融合、模型优化、计算-模型融合、计算-网络-模型融合组成的四层技术架构,系统梳理了这四层的关键技术和产业现状,并分析了相关技术在实际业务场景中的应用价值。本文为降低令牌生产成本、提高令牌服务效率、保障令牌供应稳定性、推动大模型服务从可调用到可运营的转变提供了实用的技术路径。

英文摘要

Large model inference optimization serves as a key foundation for supporting the scalable, low-cost, and highly stable operation of large model services. Centered on token-oriented inference optimization technology, this paper proposes for the first time a four-layer technical architecture consisting of Multi-model Fusion, Model Optimization, Compute-Model Fusion, and Compute-Network-Model Fusion. It systematically reviews the key technologies and current industry status across these four levels and analyzes the application value of related technologies in real-world business scenarios. This paper provides a practical technical path for reducing token production costs, improving token service efficiency, ensuring the stability of token supply, and driving the transition of large model services from being merely callable to being operable.

2606.20292 2026-06-19 cs.LG cs.LO 新提交

Shifting-based Optimizable Linear Relaxations for General Activation Functions

基于平移的可优化线性松弛用于通用激活函数

Philipp Kern, László Antal, Erika Ábráham, Carsten Sinz

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) Karlsruhe University of Applied Sciences(卡尔斯鲁厄应用科学大学)

AI总结 提出SLiR方法,通过斜率参数化和平移过程生成任意激活函数的线性松弛,在保持正确性的同时实现高效优化,验证属性数量比现有方法多7.8倍。

Comments 21 pages, under review

详情
AI中文摘要

神经网络(NN)的使用正在迅速增加,包括在安全关键领域。为了提供关于NN行为的正式保证,许多验证方法依赖于激活函数的可优化线性松弛。然而,现有技术依赖于为每个激活函数手工制作的松弛。因此,扩展到最先进的激活函数需要大量手动工作。相比之下,我们的方法SLiR(基于平移的线性松弛)具有广泛的适用性,仅需要Lipschitz常数或一组临界点。SLiR通过斜率参数化松弛,并通过平移过程计算相应的偏移,确保在输入域上的可靠上下界,从而在保持正确性的同时实现高效优化。我们的实验表明,SLiR在广泛的实际激活函数上产生紧致的松弛,并且与最先进的方法相比,能够验证多达7.8倍更多的属性。

英文摘要

The use of neural networks (NNs) is rapidly increasing, including in safety- and security-critical domains. To provide formal guarantees about NN behavior, many verification methods rely on optimizable linear relaxations of activation functions. However, existing techniques depend on hand-crafted relaxations for each activation function. Extension to state-of-the-art activation functions therefore requires substantial manual effort. In contrast, our approach SLiR (Shifting-based Linear Relaxations) is broadly applicable, requiring only a Lipschitz constant or a set of critical points. SLiR parameterizes relaxations by their slope and computes the corresponding offset via a shifting procedure that ensures sound upper and lower bounds over the input domain, enabling efficient optimization while maintaining correctness. Our experiments show that SLiR produces tight relaxations across a wide range of practical activation functions and enables verification of up to 7.8x more properties compared to state-of-the-art methods.

2606.20291 2026-06-19 cs.LG cs.CV 新提交

Integrating national forest inventory, airborne lidar, and satellite imagery for wall-to-wall mapping of forest structure with computer vision

整合国家森林清查、机载激光雷达和卫星影像,利用计算机视觉实现森林结构的全覆盖制图

Luke J. Zachmann, David D. Diaz, Vincent A. Landau, Chelsey Walden-Schreiner, Tony Chang, Nathan E. Rutenbeck, Katharyn A. Duffy, Kiarie Ndegwa, Andreas Gros, Scott Conway, Guy Bayes

发表机构 * Vibrant Planet Public Benefit Corporation(Vibrant Planet 公益公司)

AI总结 提出VibrantForests框架,结合卫星影像、激光雷达样本和计算机视觉,以10米分辨率生成美国本土的冠层覆盖、高度、生物量等森林属性图,减少饱和与回归均值问题。

详情
AI中文摘要

遥感技术越来越被依赖,以提供可操作的科学研究,用于大型景观的森林和野火风险管理。全覆盖、每年更新的地图是有效森林管理的持续需求。许多规划系统和数据收集结合了不同目的、年份和预测质量的异质数据源,导致运营规划系统中的混淆行为。我们介绍了VibrantForests框架,该框架被开发并应用于绘制森林属性,为有效的森林和野火规划提供一致的基础。VibrantForests包括一个基于卫星的森林结构模型,该模型在激光雷达衍生的样本上训练,并应用于美国本土,以10米分辨率同时生成冠层覆盖度、冠层高度、地上活树生物量、胸高断面积和二次平均直径的估计。我们展示了跨越从稀疏冠层/低生物量到密集冠层/高生物量的全部森林条件的预测能力。结果表明,我们的模型扩展了在类似被动传感器模型中常见的饱和范围,并减少了回归均值行为,该行为通常在小/稀疏条件下高估森林属性,在大/密集条件下低估森林属性。VibrantForests框架通过以年度节奏和10米分辨率提供管理相关属性的一致全覆盖估计,解决了大面积森林和野火规划中的一个关键限制。

英文摘要

Remote sensing is increasingly relied upon to deliver actionable science for forest and wildfire risk management across large landscapes. Wall-to-wall, annually updated maps are a persistent need for effective forest management. Many planning systems and data collections combine disparate data sources with different purposes, vintages, and prediction quality, which leads to confounding behavior in operational planning systems. We introduce the VibrantForests framework, developed and applied to map forest attributes and provide a coherent foundation for effective forest and wildfire planning. VibrantForests includes a satellite-based forest structure model trained on lidar-derived samples and applied across the contiguous United States to concurrently generate estimates of canopy cover, canopy height, aboveground live tree biomass, basal area, and quadratic mean diameter at 10-meter resolution. We demonstrate predictive capability spanning the full spectrum of forest conditions ranging from sparse-canopy/low-biomass to dense-canopy/high-biomass. Results show that our model extends the range at which saturation is commonly encountered in comparable passive-sensor models, and reduces regression-to-mean behavior that commonly produces overestimation of forest attributes in small/sparse conditions and underestimation in large/dense conditions. The VibrantForests framework addresses a key limitation in large-area forest and wildfire planning by delivering coherent wall-to-wall estimates of management-relevant attributes at annual cadence and 10m resolution.

2606.20287 2026-06-19 cs.CL 新提交

PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback

PsyScore: 一种心理测量感知的特质自适应作文评分与最近发展区支架反馈框架

Wei Xia, Jin Wu, Haoran Shi, Xiangyu Wang, Chanjin Zheng

发表机构 * Department of Educational Psychology, East China Normal University(华东师范大学教育心理学系) Shanghai Institute of Artificial Intelligence for Education, East China Normal University(华东师范大学上海智能教育研究院) School of Computer Science and Technology, East China Normal University(华东师范大学计算机科学与技术学院)

AI总结 提出PsyScore框架,通过共享潜在能力表示整合诊断评估与教学支架,包括特质自适应神经IRT评分器、ZPD支架反馈生成器和多视角反馈评估策略,在ASAP++数据集上实现竞争性评分性能并提供更符合教学法的反馈。

详情
AI中文摘要

有效的自动作文评分(AES)应支持可靠评估和可操作的教学反馈。然而,现有方法通常将评分和反馈视为独立组件:神经评分模型可解释性有限,而基于大语言模型(LLM)的反馈通常对学习者熟练度不敏感。为解决这一碎片化问题,本工作提出PsyScore,一个心理测量感知的框架,通过共享潜在能力表示整合诊断评估与教学支架。PsyScore包含三个关键模块:特质自适应神经IRT评分器,将分级部分信用模型(GPCM)融入神经架构,能够在保持心理测量可解释性的同时精确估计学生能力;ZPD支架反馈生成器,根据诊断出的能力参数调节多智能体反馈策略,以适应不同熟练水平的教学重点;以及多视角反馈评估策略,通过成对偏好判断和学生修订模拟评估反馈质量。在ASAP++数据集上的实验表明,PsyScore在提供更具教学一致性的反馈的同时,实现了有竞争力的评分性能。

英文摘要

Effective Automated Essay Scoring (AES) are expected to support both reliable assessment and actionable instructional feedback. However, existing approaches often treat scoring and feedback as separate components: neural scoring models provide limited interpretability, while Large Language Model (LLM)-based feedback is typically insensitive to learners proficiency levels. To address this fragmentation, this work proposes PsyScore, a psychometrically-aware framework that integrates diagnostic assessment with instructional scaffolding through a shared latent ability representation. PsyScore comprises three key modules: a Trait-Adaptive Neural IRT Scorer that incorporates the Graded Partial Credit Model (GPCM) into a neural architecture, enabling the precise estimation of student ability while maintaining psychometric interpretability, a ZPD-Scaffolded Feedback Generator, which conditions multi-agent feedback strategies on the diagnosed ability parameter to adapt instructional focus across different proficiency levels, and a Multi-Perspective Feedback Evaluation Strategy that assesses feedback quality via pairwise preference judgements and student revision simulations. Experiments on the ASAP++ dataset demonstrate that PsyScore achieves competitive scoring performance while providing more pedagogically aligned feedback.

2606.20285 2026-06-19 cs.RO 新提交

Co-VLA: Coordination-Aware Structured Action Modeling for Dual-Arm Vision-Language-Action Systems

Co-VLA:面向双臂视觉-语言-动作系统的协调感知结构化动作建模

Yandong Wang, Jiaqian Yu, Xiongfeng Peng, Lu Xu, Yamin Mao, Weiming Li, Jaewook Yoo, Dongwook Lee, Daehyun Ji, Mingbo Zhao, Chao Zhang

发表机构 * Donghua University(东华大学) Samsung R&D Institute China-Beijing (SRCB)(三星中国北京研究院) Samsung AI Center, DS Division(三星DS部门AI中心)

AI总结 针对双臂紧耦合任务中隐式协调不足的问题,提出Co-VLA框架,通过结构化动作专家和潜在感知控制器显式引入协调先验,在仿真和真实场景中显著提升成功率和效率。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在单臂和双臂机器人操作中展现出强大能力。先前研究表明,通过端到端学习,利用大型视觉-语言骨干网络和连续动作预测,可以涌现出协调的双臂行为。然而,随着双臂任务变得紧密耦合且执行约束变得关键,仅靠隐式协调不足以确保可靠、可解释且稳定的行为。在这项工作中,我们提出了Co-VLA,一个协调感知的双臂操作框架,将显式结构先验引入VLA模型。我们在一个最先进的视觉-语言骨干网络上实例化我们的方法,用专为双臂协调设计的结构化动作专家(SAE)替换其单一动作头。具体来说,我们在动作生成层面引入显式结构,采用模块化的协调感知损失,根据任务特定结构塑造共享和残差潜在变量。共享潜在变量编码任务级协调意图,而残差潜在变量捕获每个手臂的执行调整。在部署时,潜在感知控制器(LAC)解释学习到的表示,以实时调节同步强度、执行不对称性、平滑性和安全约束。LAC在关节命令级别运行,并与标准控制流水线兼容,无需力或阻抗控制。在仿真和真实世界基准上的实验表明,Co-VLA显著优于单一基线,在紧协调任务中成功率达到27%的提升,在OOD真实世界场景中性能翻倍(从13%提升至27%),并将任务完成时间减少高达25%。

英文摘要

Vision-language-action (VLA) models show strong capabilities in single and dual-arm robotic manipulation. Prior works show coordinated bimanual behaviors can emerge from end-to-end learning, leveraging large vision-language backbones with continuous action prediction. However, as bimanual tasks become tightly coupled and execution constraints become critical, implicit coordination alone is insufficient to ensure reliable, interpretable, and stable behavior. In this work, we propose Co-VLA, a coordination-aware bimanual manipulation framework introducing explicit structural priors into VLA models. We instantiate our method on a state-of-the-art vision-language backbone by replacing its monolithic action head with a Structured Action Expert (SAE) designed for bimanual coordination. Specifically, we introduce explicit structure at the action generation level with a modular coordination-aware loss that shapes shared and residual latents according to task-specific structures. The shared latent encodes task-level coordination intent, while residual latents capture execution adjustments for each arm. At deployment, a Latent-Aware Controller (LAC) interprets the learned representations to modulate synchronization strength, execution asymmetry, smoothness, and safety constraints in real time. LAC operates at the joint-command level and remains compatible with standard control pipelines without requiring force or impedance control. Experiments across simulation and real-world benchmarks show Co-VLA significantly outperforms monolithic baselines, achieving a 27% success rate gain in tight-coordination tasks, more than doubling performance in OOD real-world scenarios (from 13% to 27%), and reducing task completion time by up to 25%.

2606.20283 2026-06-19 cs.LG cs.AI 新提交

Boundary Embedding Shaping with Adaptive Contrastive Learning for Graph Structural Disentanglement

基于自适应对比学习的边界嵌入塑造用于图结构解缠

Jiaqing Chen, Zidu Yin, Yichao Cai, Yuhang Liu, Zhen Zhang, Dong Gong, Javen Qinfeng Shi

发表机构 * Yunnan Normal University(云南师范大学) Adelaide University(阿德莱德大学) The University of New South Wales(新南威尔士大学)

AI总结 针对图结构纠缠导致的分类性能下降,提出边界嵌入塑造模块,通过自适应对比学习选择性抑制决策边界处的虚假结构噪声,提升节点分类和链接预测精度。

Comments Accepted at ICML 2026

详情
AI中文摘要

图神经网络(GNN)在聚合邻居信息进行分类方面表现出色,但其性能受到图结构纠缠的阻碍,来自语义无关邻居的虚假相关污染了节点嵌入。这种挑战在嵌入空间中靠近类边界的节点最为严重,放大的结构噪声模糊了决策边界并破坏了预测的稳定性。现有的鲁棒GNN方法大多统一处理所有节点,忽略了边界脆弱性。本文中,为了提高分类性能,我们通过将边界区域纠缠识别为主要瓶颈来解决图结构解缠问题,并提出边界嵌入塑造(BES),一种自适应对比学习GNN插件模块,以最小的模型参数扰动选择性地抑制决策边界处的虚假结构噪声。大量实验表明,BES持续改善边界判别性,并优于现有领先方法。值得注意的是,BES在节点分类中平均提升GCN性能3.3%(在WikiCS上高达5.0%),并在链接预测中实现更优的准确率。

英文摘要

Graph neural networks (GNNs) excel at aggregating neighbor information for classification, yet their performance is hindered by graph structural entanglement, where spurious correlations from semantically irrelevant neighbors contaminate node embeddings. This challenge is most acute for nodes near class boundaries in the embedding space, where amplified structural noise blurs decision boundaries and destabilizes predictions. Existing robust GNN methods largely treat all nodes uniformly, ignoring boundary vulnerabilities. In this paper, to improve classification performance, we tackle graph structural disentanglement by identifying boundary-region entanglement as the primary bottleneck and propose Boundary Embedding Shaping (BES), an adaptive contrastive learning GNN plug-in module that selectively suppresses spurious structural noise at decision boundaries with minimal model parameter perturbation. Extensive experiments demonstrate that BES consistently improves boundary discrimination and outperforms existing leading methods. Notably, BES boosts GCN performance by an average of 3.3% in node classification (up to 5.0% on WikiCS) and achieves superior accuracy in link prediction.

2606.20282 2026-06-19 cs.CV 新提交

U$^2$Mamba: A Two-level Nested U-structure Mamba for Salient Object Detection

U$^2$Mamba:用于显著目标检测的两级嵌套U结构Mamba

Junhui Li, Jialu Li, Youshan Zhang

发表机构 * University of Science and Technology Liaoning(辽宁科技大学) Chuzhou University(滁州学院) Yeshiva University(叶史瓦大学)

AI总结 提出U$^2$Mamba,一种两级嵌套U结构网络,通过多尺度Mamba U块增强深度和上下文信息,并采用分层训练监督,在显著目标检测上达到先进性能。

Comments 6 pages, 2 figures

详情
AI中文摘要

基于Mamba的模型已成为显著目标检测(SOD)的有前途的替代方案,在长序列建模方面具有显著优势。然而,现有模型往往未能充分利用上下文信息和整个架构的深度。本文介绍了U$^2$Mamba,一种用于显著目标检测的强大且创新的U结构网络。我们提出了多尺度Mamba U块(MMUBs),增强了模型深度以改进局部特征提取能力。我们新开发的嵌套U结构结合了MMUBs,使网络能够整合来自浅层和深层的不同感受野,从而收集更丰富的上下文信息和更长距离的数据,而不受分辨率限制。我们提出了一种分层训练监督方法,在训练过程中在每个层级计算损失,而不是使用传统的深度监督方案和顶层监督训练。大量实验表明,U$^2$Mamba在显著目标检测上取得了与最先进方法高度竞争的性能。源代码可在\url{this https URL}获取。

英文摘要

Mamba-based models have emerged as a promising alternative for salient object detection (SOD), offering significant advantages in modeling long sequences. However, existing models often fail to explore contextual information and the depth of the entire architecture. This paper introduces U$^2$Mamba, a powerful and innovative U-structured network for salient object detection. We propose multiscale Mamba U-blocks (MMUBs) that enhance the model depth to improve local feature extraction capabilities. Our newly developed nested U-structure, incorporating MMUBs, enables the network to integrate various receptive fields from shallow and deep layers, thereby collecting richer contextual information and longer-range data without being constrained by resolution. Instead of using the traditional deep supervision scheme and top-level supervised training, we propose a hierarchical training supervision method where the loss is computed at each level during the training process. Extensive experiments demonstrate that U$^2$Mamba achieves highly competitive performance against state-of-the-art methods. The source code is available at \url{https://github.com/JL021/U2Mamba}.

2606.20280 2026-06-19 cs.IR cs.AI 新提交

ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

ELVA:探索排序驱动的通用多模态检索

Yuhan Liu, Pei Fu, Hang Li, Yukun Qi, Chao Jiang, Jingwen Fu, Zhen Liu, Bin Qin, Zhenbo Luo, Jian Luan, Jingmin Xin

AI总结 提出ELVA框架,通过基于规则的强化学习缓解对比学习中的粒度盲视问题,在通用多模态检索中实现排序优化,并在新基准MRBench上提升13.1%。

Comments Accepted by ECCV 2026

详情
AI中文摘要

利用多模态大语言模型(MLLMs)进行对比学习已成为提升通用多模态检索(UMR)性能的主流范式。然而,先前的工作在将对比范式适应到检索任务时忽略了粒度盲视问题。粒度盲视是指模型倾向于忽略查询中包含的粒度级信息,而这些信息对于有效处理复杂查询至关重要。这源于对比学习将样本视为二元分类(正/负),而忽略了每个负样本携带的不同信息。为了解决这个问题,我们认为应该根据负样本与正样本的相似度区别对待它们,使模型能够从每个负样本中学习不同的粒度信息。在本文中,我们引入了一个简单但有效的框架,称为ELVA,一种新颖的基于规则的强化学习框架,通过排序驱动的MLLMs缓解粒度盲视。1)不依赖奖励模型,我们将可验证奖励的强化学习(RLVR)扩展到检索任务,使模型能够探索新的排序行为而无需显式的排序标签。2)通过利用基于规则的奖励,我们的方法联合优化负样本的排序,同时扩大正负样本之间的相似度差距。为了更精确地衡量粒度盲视,我们进一步引入了MRBench,一个专门为多粒度查询场景设计的新基准。ELVA在标准检索基准上取得了最先进的结果,在MRBench上显著提升13.1%,进一步证明了其在缓解粒度盲视方面的有效性。

英文摘要

Leveraging Multimodal Large Language Models (MLLMs) via contrastive learning has become a mainstream paradigm for improving the performance of Universal Multimodal Retrieval (UMR). However, previous works have ignored the grain blindness when adapting the contrastive paradigm into retrieval tasks. Grain blindness refers to the tendency of the model to overlook grain-level information contained in the query, which is crucial for effectively handling complex queries. This stems from contrastive learning treating samples as a binary classification (positive/negative), while ignoring the different information carried by each negative sample. To address this, we argue that negatives should be treated differently according to their similarity to the positive sample, enabling the model to learn distinct grain information from each negative. In this paper, we introduce a simple but effective framework, called ELVA, a novel rule-based RL framework that mitigates grain blindness through ranking-driven MLLMs. 1) Instead of relying on reward models, we extend Reinforcement Learning with Verifiable Rewards (RLVR) to retrieval tasks, allowing the model to explore new ranking behaviors without explicit ranking labels. 2) By utilizing rule-based rewards, our approach jointly optimizes the ranking of negative samples while enlarging the similarity gap between positive and negative. To more precisely measure grain blindness, we further introduce MRBench, a new benchmark specifically designed for multi-grain query scenarios. ELVA achieves state-of-the-art results across standard retrieval benchmarks, and its notable 13.1% improvement on MRBench further demonstrates its effectiveness in alleviating grain blindness.

2606.20274 2026-06-19 cs.AI 新提交

Lagrange: An Open-Vocabulary, Energy-Based Sparse Framework for Generalized End-to-End Driving

Lagrange: 一种面向通用端到端驾驶的开放词汇、基于能量的稀疏框架

Shihao Ji, HongXi Li, Zihui Song, Mingyu Li

AI总结 提出Lagrange框架,利用掩码潜在场和视觉语言模型实现开放词汇、稀疏计算,通过拉格朗日动作最小化确保运动学约束,在nuScenes和CODA基准上验证了鲁棒性和可解释性。

详情
AI中文摘要

将端到端自动驾驶扩展到复杂的开放世界环境,需要能够泛化到异常场景的感知模型和能够产生运动学有效轨迹的规划器。现有范式在表示效率和泛化能力之间存在明显分歧。密集模型(如占用网络)虽然几何鲁棒,但存在关键计算瓶颈,且难以进行高层语义推理。相反,稀疏的基于查询的规划器效率高,但依赖于封闭集定义,使其容易受到分布外事件的影响。尽管最近的视觉-语言-动作模型提供了开放词汇推理,但其自回归离散令牌生成从根本上与车辆动力学的连续高频控制需求相冲突。为解决这一问题,我们提出了Lagrange,一种基于掩码潜在场的开放词汇、计算稀疏的驾驶框架。Lagrange不依赖密集体积重建或封闭集查询机制,而是利用视觉语言模型将类别无关的目标提议编码为连续语义视觉令牌。我们引入了一种意图驱动的掩码交叉注意力模块,该模块在时间上过滤不相关实体,并将注意力令牌解码为定义在空间坐标上的隐式连续能量场。通过将决策制定为跨越该能量场的拉格朗日动作最小化问题,我们在执行碰撞避免的同时强制遵守车辆运动学。在标准(nuScenes)和长尾(CODA)基准上的大量离线评估表明,Lagrange为鲁棒、可解释且运动学可行的开放世界自主性建立了一个有前景的框架。

英文摘要

Scaling end-to-end autonomous driving to complex, open-world environments requires perceptual models that generalize to anomalous scenarios and planners that produce kinematically valid trajectories. Existing paradigms face a distinct dichotomy between representational efficiency and generalization capacity. Dense models (e.g., occupancy networks), while geometrically robust, incur critical computational bottlenecks and struggle with high-level semantic reasoning. Conversely, sparse, query-based planners are efficient but reliant on closed-set definitions, rendering them vulnerable to out-of-distribution (OOD) events. Although recent Vision-Language-Action (VLA) models offer open-vocabulary reasoning, their autoregressive, discrete token generation fundamentally conflicts with the continuous, high-frequency control requirements of vehicle dynamics. To address this, we propose Lagrange, an open-vocabulary, computationally sparse driving framework based on Masked Latent Fields (MLF). Rather than relying on dense volumetric reconstructions or closed-set query mechanisms, Lagrange exploits Vision-Language Models (VLMs) to encode class-agnostic object proposals into continuous semantic visual tokens. We introduce an intent-driven masked cross-attention module that temporally filters irrelevant entities, decoding the attended tokens into an implicit continuous energy field defined over spatial coordinates. By framing decision-making as a Lagrangian action minimization problem spanning this energy field, we enforce strict compliance with vehicle kinematics while executing collision avoidance. Extensive offline evaluations on both standard (nuScenes) and long-tail (CODA) benchmarks demonstrate that Lagrange establishes a promising framework for robust, interpretable, and kinematically feasible open-world autonomy.

2606.20272 2026-06-19 cs.RO cs.CV 新提交

Efficiently Linking Real Scenes with Synthetic Data Generation for AI-based Cognitive Robotics and Computer Vision Applications

高效连接真实场景与合成数据生成以支持基于AI的认知机器人和计算机视觉应用

Paul Koch, Vivek Chavan, André Sers, Adem Karakurt, Paul Hofmann, Mohamad Zaher Ziadeh, Jörg Krüger

发表机构 * Fraunhofer IPK(弗劳恩霍夫生产设备和设计技术研究所) TU Berlin(柏林工业大学)

AI总结 本文讨论当前AI视觉模型在认知机器人应用中的局限,并提出通过连接仿真与真实世界训练数据生成来弥合领域差距的方法。

Comments Accepted and best paper award at MHI-Kolloquium 2024

详情
AI中文摘要

AI视觉模型是认知机器人在工业和家庭应用中潜在用例场景的驱动因素。基于最新的AI成就,已经提出了从语义环境分析到6D和抓取姿态估计的大量方法。然而,这些进展需要更强大和高效的方法,特别是在训练数据和AI架构方面,这些方法能够协同应对当前挑战、精度限制以及超越领域差距的可扩展性。在本文中,我们讨论了这些当前限制和相关最先进技术中的趋势,这些趋势正对这些挑战提出挑战。此外,我们讨论了当前在弥合仿真与真实世界应用之间的领域差距方面的工作进展,通过在训练数据生成中连接两者来实现。

英文摘要

AI vision models are a driving factor for the potential use case scenarios of cognitive robotics within in the industry and household applications. A large array of methods from semantic environment analysis towards 6D and grasping pose estimation have been proposed based on the latest AI achievements. However, such advancements require further strong and efficient methods w.r.t. training data and AI-architectures, which are capable in synergy to tackle current challenges, precision limits, and scalability beyond domain gaps. In this paper, we discuss these current limits and trends in the related state-of-the-art which are challenging those. Further we discuss our current work in progress on bridging the domain gap between simulations and real world applications by linking those in the training data generation.

2606.20264 2026-06-19 cs.AI 新提交

Confidence-Aware Automated Assessment of Student-Drawn Scientific Models

学生绘制的科学模型的置信度感知自动评估

Luyang Fang, Yingchuan Zhang, Jongchan Park, Zhaoji Wang, Ping Ma, Xiaoming Zhai

发表机构 * AI4STEM Education Center, Athens, GA, USA(AI4STEM教育中心,雅典,佐治亚州,美国) Department of Statistics, University of Georgia, Athens, GA, USA(佐治亚大学统计系,雅典,佐治亚州,美国)

AI总结 提出一种基于视觉Transformer的置信度感知评分框架,通过选择性自动化高置信度响应并延迟不确定案例至人工审核,在六个NGSS评估项上提高了评分可靠性并平衡了自动化覆盖率与评分风险。

详情
AI中文摘要

学生生成的绘图广泛应用于科学教育中,用于评估学习者在基于建模任务中的概念理解,这些任务与下一代科学标准(NGSS)保持一致。然而,对这些绘图进行评分需要专家人工判断来解释复杂的视觉表示,使得大规模评估在课堂环境中实施和维持成本高昂。在这项工作中,我们研究了使用基于视觉模型的自动评分学生生成的科学绘图。我们评估了具有参数高效适应的视觉Transformer(ViT),并提出了一个置信度感知评分框架,该框架从测试时预测分布中推导出响应级别的置信度。这种置信度信号通过自动评分高置信度响应,同时将不确定案例延迟至人工审核,实现了选择性自动化。在六个与NGSS对齐的中学评估项上的实验表明,所提出的方法提高了评分可靠性,同时支持自动化覆盖率和评分风险之间的实际权衡,突出了置信度感知方法在可信教育评估中的价值。

英文摘要

Student-generated drawings are widely used in science education to assess learners' conceptual understanding in modeling-based tasks aligned with the Next Generation Science Standards (NGSS). However, scoring such drawings requires expert human judgment to interpret complex visual representations, making large-scale assessment costly to implement and sustain in classroom settings. In this work, we study automated scoring of student-generated scientific drawings using a vision-based model. We evaluate a Vision Transformer (ViT) with parameter-efficient adaptation and propose a confidence-aware scoring framework that derives response-level confidence from test-time predictive distributions. This confidence signal enables selective automation by scoring high-confidence responses automatically while deferring uncertain cases for human review. Experiments on six NGSS-aligned middle school assessment items show that the proposed approach improves scoring reliability while supporting a practical trade-off between automated coverage and scoring risk, highlighting the value of confidence-aware methods for trustworthy educational assessment.

2606.20258 2026-06-19 cs.HC cs.AI 新提交

Editorial Alignment: A Participatory Approach to Engaging Editorial Expertise in LLM-mediated Knowledge Dissemination

编辑对齐:一种参与式方法,将编辑专业知识引入LLM介导的知识传播

Simon Aagaard Enni, Malthe Stavning Erslev, Karl-Emil Kjær Bilstrup, Kristoffer Laigaard Nielbo

AI总结 本文提出“编辑对齐”作为参与式AI设计实践,通过设计工作坊让编辑参与重新对齐LLM接口至编辑标准,以维护公共知识机构的编辑职能。

Comments 14 pages

详情
AI中文摘要

LLM驱动的信息服务的出现正在重塑公共知识机构的运作条件,威胁着吸收这些机构赖以存在的编辑功能。虽然LLM为知识传播提供了强大的新可能性,但预训练的LLM已经与其商业开发者的价值观和传播策略对齐,从而挑战了编辑权威。本文通过一个案例研究,调查编辑通过设计工作坊参与将LLM接口重新对齐到编辑标准的过程,在该案例中,我们与一家北欧公共知识机构设计并实现了一个LLM增强的百科全书界面。我们将编辑对齐作为参与式AI中的一种设计实践引入,将AI对齐视为一个设计过程,并将编辑标准定位为一种设计工件,将编辑实践和价值观转化为技术实现的对齐目标。最后,我们讨论了编辑对齐如何为持续参与创造空间,并赋予编辑在LLM介导的知识传播中的自主权。

英文摘要

The emergence of LLM-driven information services is reshaping the conditions under which public knowledge institutions operate, threatening to absorb the editorial function these institutions exist to exercise. While LLMs offer powerful new affordances for knowledge dissemination, editorial authority is challenged by pretrained LLMs that arrive already aligned with the values and dissemination strategies of their commercial developers. This paper investigates editor participation in re-aligning LLM interfaces to editorial standards through design workshops, in a case study where we design and implement an LLM-enabled encyclopedia interface with a Nordic public knowledge institution. We introduce editorial alignment as a design practice within Participatory AI, framing AI alignment as a design process and positioning the editorial standard as a design artefact that translates editorial practice and values into alignment objectives for technical implementation. Last, we discuss how editorial alignment can create space for ongoing participation and give editors agency in LLM-mediated knowledge dissemination.

2606.20255 2026-06-19 cs.CL cs.AI 新提交

The Register Gap: A Meaning Intelligence Framework for Nigerian Public Discourse

语域差距:尼日利亚公共话语的意义智能框架

Celestine Achi

AI总结 提出九维意义智能框架(MIF),通过语域、真实意图等维度区分表面情感与真实交际意图,在尼日利亚公共话语数据集上使语域分类准确率提升40个百分点,复合意义智能评分提升5.4分。

Comments Preprint. 12 pages, 2 tables. Supplementary materials: MIF Master Specification v2.0, Annotation Guidelines v1.0, and 30-item public calibration set with gold labels available from the author

详情
AI中文摘要

我们提出了意义智能框架(MIF),这是一个用于尼日利亚公共话语的九维标注和评估方案,将表面情感与真实交际意图区分开来。现有的尼日利亚语言基准(包括NaijaSenti和AfriSenti)将情感分类视为三向极性任务(正面、负面、中性)。我们认为,AI系统在尼日利亚话语上的主要失败模式不是翻译失败,而是语境失败:同一话语根据说话者、听众和情境可能具有相反的语用效力。MIF通过九个评分维度将这一见解操作化:语域、表面情感、真实意图、反讽、编码潜台词、风险等级、标注者置信度、说话者情绪和推荐沟通行动。我们构建了一个包含30个项目的校准数据集,涵盖标准英语、尼日利亚英语、尼日利亚皮钦语和混合语域,并在零样本和模式引导提示条件下评估了一个前沿语言模型(Gemini 2.5 Flash)。主要发现是语域差距:零样本语域分类准确率为33.3%,当模型在上下文中接收到MIF模式时,准确率上升至73.3%(+40个百分点)。在模式引导提示下,复合意义智能评分增加了5.4分(从73.2到78.6),最大的实际收益体现在语域识别、编码潜台词检测(+10分)和战略行动推荐(+10.3分)上。我们发布了框架规范、标注指南和包含30个项目的公开校准集以支持可重复性,同时保留了一个私有留存语料库用于防污染评估。

英文摘要

We introduce the Meaning Intelligence Framework (MIF), a nine-dimension annotation and evaluation schema for Nigerian public discourse that separates surface sentiment from true communicative intent. Existing benchmarks for Nigerian languages, including NaijaSenti and AfriSenti, treat sentiment classification as a three-way polarity task (positive, negative, neutral). We argue that the dominant failure mode of AI systems on Nigerian discourse is not translation failure but context failure: the same utterance carries opposite pragmatic force depending on speaker, audience, and situation. The MIF operationalises this insight across nine scored dimensions: register, surface sentiment, true intent, irony, coded subtext, risk tier, annotator confidence, speaker emotion, and recommended communications action. We construct a 30-item calibration dataset spanning Standard English, Nigerian English, Nigerian Pidgin, and code-mixed registers, and evaluate a frontier language model (Gemini 2.5 Flash) under zero-shot and schema-informed prompting conditions. The headline finding is the Register Gap: zero-shot register classification accuracy is 33.3%, rising to 73.3% (+40 points) when the model receives the MIF schema in-context. The composite Meaning Intelligence Score increases by 5.4 points (73.2 to 78.6) under schema-informed prompting, with the largest practical gains in register identification, coded-subtext detection (+10 points), and strategic action recommendation (+10.3 points). We release the framework specification, annotation guidelines, and the 30-item public calibration set to support reproducibility, while retaining a private holdout corpus for contamination-protected evaluation.

2606.20254 2026-06-19 cs.CR 新提交

Quantization as a Malicious Task: Removing Quantization-Conditioned Backdoors via Task Arithmetic

量化作为恶意任务:通过任务算术移除量化条件后门

Kaihsun Yang, Min-Yan Tsai, Chia-Mu Yu

AI总结 提出QVec方法,通过将量化引起的权重变化视为恶意任务向量,在部署前进行参数校正,无需重训练或触发样本即可防御量化条件后门。

详情
AI中文摘要

模型量化被广泛采用,以在资源受限设备上部署深度神经网络时减少内存使用和推理成本。然而,最近的研究揭示了一种新的安全威胁,称为量化条件后门(QCBs),其中模型在全精度下行为正常,但仅在量化后激活恶意行为。现有的防御通常修改量化过程或校正激活统计,往往引入额外的计算开销或依赖特定的量化设置。在这里,我们提出QVec,一种从参数空间角度防御QCBs的方法。我们观察到,全精度模型与其量化版本之间的权重差异编码了一种结构化的行为偏移,可以解释为恶意任务向量,而非随机量化噪声。基于这一见解,QVec通过在部署前进行受控的参数校正来抵消这一恶意方向。QVec无需重新训练,无需触发样本,仅需一次量化传递来估计参数偏移,以及轻量级的超参数搜索。在图像分类基准和多个大型语言模型(LLM)攻击场景中的大量实验表明,QVec在保持干净性能的同时,持续抑制后门激活。

英文摘要

Model quantization is widely adopted to reduce memory usage and inference cost when deploying deep neural networks on resource-constrained devices. However, recent studies have revealed a new security threat known as Quantization-Conditioned Backdoors (QCBs), where a model behaves normally in full precision but activates malicious behavior only after quantization. Existing defenses typically modify quantization procedures or correct activation statistics, often introducing additional computational overhead or relying on specific quantization settings. Here, we present QVec, a parameter-space perspective for defending against QCBs. We observe that the weight difference between a full-precision model and its quantized counterpart encodes a structured behavioral shift, which can be interpreted as a malicious task vector rather than random quantization noise. Based on this insight, QVec counteracts this malicious direction through controlled parameter correction prior to deployment. QVec requires no retraining, no trigger samples, and only a single quantization pass to estimate the parameter shift, together with a lightweight hyperparameter search. Extensive experiments across image classification benchmarks and multiple Large Language Model (LLM) attack scenarios demonstrate that QVec consistently suppresses backdoor activation while preserving clean performance.

2606.20251 2026-06-19 cs.CR 新提交

TrustMix: How to Mix Messages in a Mobile Ad-hoc Network

TrustMix:如何在移动自组织网络中混合消息

Yu Shen, Aiswarya Walter, Stefanie Roos

AI总结 提出TrustMix协议,通过分组转发和洗牌实现无中心信任的匿名通信,利用可链接环签名限制速率,在随机预言机模型下证明安全性,仿真和Android实现验证了匿名性和吞吐量。

Comments Accepted at ICDCS 2026, 11 pages

详情
AI中文摘要

混合网络是实现匿名性、防御各种流量分析攻击的高效方法。然而,混合网络通常为基础设施网络设计,无法直接应用于移动自组织网络(MANET)。现有的少数MANET解决方案需要预先了解拓扑结构或依赖可信中心方。本文提出TrustMix,一种无需任何中心可信方的MANET混合协议。在TrustMix中,各方加入组,消息通过多个组转发以提供匿名性。用户只需在附近找到一个他们认为可信的方,然后将消息转发到该方的组,该方在转发到其他组之前对消息进行洗牌,从而无法将原始消息与转发消息关联。此外,即使所选方是敌对的,只有当其组内所有方都是敌对的时,他们才能破坏匿名性,因为所有方都参与洗牌。除了匿名性,TrustMix还通过可链接环签名对消息数量实施速率限制,从而能够在不泄露身份的情况下检测到各方发送超过允许数量的消息。我们在随机预言机模型下证明了协议的安全性。我们使用现有的混合网络模拟器评估其匿名性,表明TrustMix显著提高了消息匿名性。最后,我们展示了基于Android的概念验证实现,并表明TrustMix在5个移动设备上实现了可接受的吞吐量。

英文摘要

Mix networks are a highly effective way to achieve anonymity, defending against a wide range of traffic-analysis attacks. However, mix networks are usually designed for infrastructure networks and cannot be directly applied in the context of mobile ad hoc networks (MANETs). The few existing solutions for MANETs require advance knowledge of the topology or a trusted central party. In this paper, we present TrustMix, a mix protocol for MANETs that operates without any central trusted party. In TrustMix, parties join groups and then messages are forwarded via multiple groups to provide anonymity. With TrustMix, users only need to find a party nearby that they consider trusted. They then forward the message to this party's group, and the party shuffles messages before forwarding to other groups, meaning that the original message and the forwarded message cannot be linked. Furthermore, even if the chosen party is adversarial, they can only break the anonymity if all parties in their group are adversarial as all of them contribute to the shuffling. In addition to anonymity, TrustMix also enforces rate limits on the number of messages through the use of linkable ring signatures, which allows detecting that parties send more messages that allowed without revealing identities. We prove the security of our protocol in the random oracle model. We evaluate its anonymity using an existing mix-network simulator and show that TrustMix significantly improves message anonymity. Finally, we present a proof-of-concept Android implementation and show that TrustMix achieves acceptable throughput with 5 mobile devices.

2606.20250 2026-06-19 cs.CV 新提交

Single-Stage Hierarchical Rectification for Weakly Supervised Histopathology Segmentation

单阶段层次化校正用于弱监督组织病理学分割

Duc T. Nguyen, Hoang-Long Nguyen, Thanh-Ha DO, Huy-Hieu Pham

发表机构 * VinUni-Illinois Smart Health Center, VinUniversity, Hanoi, Vietnam(越南河内VinUniversity VinUni-Illinois智慧健康中心) The Computer Vision and Medical AI Lab, VinUniversity, Hanoi, Vietnam(越南河内VinUniversity计算机视觉与医学人工智能实验室) Posts and Telecommunications Institute of Technology, Hanoi, Vietnam(越南河内邮电技术学院)

AI总结 提出单阶段层次化校正框架,通过层次化特征校正模块在单次训练中直接生成高保真激活图,解决多阶段弱监督分割中的误差传播和计算开销问题。

Comments Accepted to MICCAI 2026. This is the pre-review submitted version, not the camera-ready version. The final authenticated version will be available in the MICCAI 2026 proceedings

详情
AI中文摘要

现有的计算病理学中的弱监督语义分割方法依赖于多阶段范式:类激活图生成、离线伪掩码细化和全监督再训练。虽然这种解耦方法已被广泛采用,但它存在根本性缺陷。多阶段过程不仅导致高计算训练成本,还遭受误差传播:浅层CNN中的局部纹理偏差产生假阳性伪影,后续细化步骤往往无法纠正。为了通过简单而高效的方法解决这些持续存在的挑战,我们提出了单阶段层次化校正(SSHR)框架。我们的方法不是事后被动地细化CAM,而是在前向传播过程中主动净化中间特征表示。我们引入了一个层次化特征校正模块(HFRM),利用深层全局语义上下文过滤浅层中的局部异常。该机制在单个训练循环内直接生成高保真激活图。在LUAD-HistoSeg和BCSS数据集上的实验表明,SSHR优于最先进的多阶段方法。此外,SSHR将训练时间减少了2到5倍。这种效率降低了计算开销,并加速了大规模组织病理学工作流的临床转化。代码可在以下网址获取:this https URL

英文摘要

Existing weakly supervised semantic segmentation (WSSS) methods in computational pathology rely on a multi-stage paradigm: class activation map (CAM) generation, offline pseudo-mask refinement, and fully supervised retraining. While established, this decoupled approach presents fundamental limitations. The multi-stage process not only incurs high computational training costs but also suffers from error propagation: local texture biases in shallow CNN layers generate false-positive artifacts that subsequent refinement steps often fail to correct. To address these persistent challenges through a simple yet highly effective approach, we propose the Single-Stage Hierarchical Rectification (SSHR) framework. Rather than passively refining CAMs post-hoc, our method proactively purifies intermediate feature representations during the forward pass. We introduce a Hierarchical Feature Rectification Module (HFRM) that utilizes deep global semantic context to filter out local anomalies in shallow layers. This mechanism generates high-fidelity activation maps directly within a single training loop. Experiments on the LUAD-HistoSeg and BCSS datasets demonstrate that SSHR outperforms state-of-the-art multi-stage methods. Furthermore, SSHR reduces training duration by 2 to 5 times. This efficiency minimizes computational overhead and accelerates clinical translation for large-scale histopathology workflows. The code is available at: https://github.com/trongduc-nguyen/SSHR

2606.20246 2026-06-19 cs.RO cs.AI 新提交

Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think

微调视觉-语言-动作模型所需的层数比你想象的少

Gia-Binh Nguyen, Trong-Bao Ho, Thien-Loc Ha, Khoa Vo, Philip Lund Møller, Quang T. Nguyen, Long Dinh, Tuan Dam, Vu Duong, Tung M. Luu, Trung Le, Tran Nguyen Le, Minh Vu, An Thai Le, Ngan Le, Daniel Sonntag, James Zou, Jan Peters, Duy M. H. Nguyen, Ngo Anh Vien

发表机构 * Center for AI Research, VinUniversity(VinUniversity人工智能研究中心) VinRobotics University of Arkansas(阿肯色大学) Technical University of Denmark(丹麦技术大学) Hanoi University of Science and Technology(河内科技大学) KAIST(韩国科学技术院) Monash University(莫纳什大学) Oldenburg University(奥尔登堡大学) DFKI(德国人工智能研究中心) University of Stuttgart(斯图加特大学) IMPRS-IS(国际马克斯·普朗克智能系统研究学院) Stanford University(斯坦福大学) Technische Universität Darmstadt(达姆施塔特工业大学)

AI总结 本文发现VLA模型存在层间表示冗余,提出无需训练的压缩方法,通过去除冗余层将模型深度减少50%,实现40-50%训练加速和30%推理加速,性能不变。

详情
AI中文摘要

在大规模视频-机器人数据集上预训练的视觉-语言-动作(VLA)模型彻底改变了机器人操作,但其数十亿参数架构在下游微调和实时推理过程中带来了巨大的计算负担。在这项工作中,我们揭示了这些连续控制基础策略(例如pi_0、GR00T-N1.5)的一个高度非平凡的结构特性:尽管在多样化的物理轨迹上训练,它们表现出严重的逐层表示冗余。为了利用这一点,我们引入了一个完全无需训练的结构压缩流程,避免了现有方法需要加载全尺寸模型来学习优化的令牌缩减或动态层选择器的需求。相反,仅通过使用中心核对齐的单次前向传递来识别冗余层特征,我们移除孪生层以永久压缩模型深度高达50%,涵盖VLM主干和连续控制策略头。这种精简架构的下游微调带来了双重加速效益:训练时间减少40-50%,实时推理速度提升高达30%,同时匹配或超越全尺寸基模型性能。我们在三个模拟基准(LIBERO、RoboCasa、SimplerEnv)和10个跨4种不同机器人实体的多样化真实世界操作任务上全面验证了我们的方法。这些结果证明,先进的VLA所需的层数远少于先前假设,为可扩展的机器人学习提供了一种高度计算高效的范式。

英文摘要

Vision-Language-Action (VLA) models pre-trained on massive video-robot datasets have revolutionized robotic manipulation, yet their multi-billion parameter architectures impose prohibitive computational burdens during downstream fine-tuning and real-time inference. In this work, we reveal a highly non-trivial architectural characteristic of these continuous control foundation policies (e.g., pi_0, GR00T-N1.5): despite being trained on diverse physical trajectories, they exhibit severe layer-wise representational redundancy. To exploit this, we introduce a structural compression pipeline that is entirely training-free, bypassing the need of existing methods to load full-scale models to learn optimized token reductions or dynamic layer selectors. Instead, using only a single forward pass via Centered Kernel Alignment to identify redundant layer features, we remove twin layers to permanently compress the model depth by up to 50% across both the VLM backbone and the continuous control policy head. Downstream fine-tuning of this streamlined architecture yields a dual acceleration benefit: a 40-50% reduction in training time and up to 30% faster real-time inference, while matching or exceeding full-scale base model performance. We comprehensively validate our method across three simulation benchmarks (LIBERO, RoboCasa, SimplerEnv) and 10 diverse real-world manipulation tasks across 4 unique robotic embodiments. These results prove that advanced VLAs require significantly fewer layers than previously assumed, offering a highly compute-efficient paradigm for scalable robot learning.

2606.20245 2026-06-19 cs.AI 新提交

Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference

导航不可靠的参数化与上下文知识:面向LLM推理的显式知识冲突解决

Huang Peng, Jiuyang Tang, Weixin Zeng, Hao Xu, Xiang Zhao

发表机构 * National Key Laboratory of Big Data and Decision, National University of Defense Technology(国防科技大学大数据与决策国家重点实验室)

AI总结 提出MACR框架,通过自适应知识评估与多智能体推理,显式解决大语言模型内部参数知识与外部上下文之间的冲突,超越传统二元选择范式。

Comments 12 pages, 3 figures

详情
AI中文摘要

大型语言模型(LLM)通过利用广泛的参数化知识和上下文学习能力,在多种基于语言的任务中取得了强劲性能,使其能够整合输入提示中提供的外部信息。然而,外部知识的整合可能引入冲突,不仅存在于模型内部参数知识与外部信息之间,也存在于多个外部上下文之间。现有方法通常假设模型或提供的上下文是可靠的,忽视了两种来源都可能包含错误的情况,并通过优先考虑某一来源而非另一来源来避免冲突,而非主动解决不一致性。为解决这些局限,我们提出了一种新颖的LLM知识冲突解决框架MACR,该框架超越了传统的二元选择范式,并基于多智能体推理方法引入了显式的冲突解决机制。具体而言,我们首先提出一种自适应知识评估与检索方法,采用改进的语义熵度量来量化LLM对给定查询答案的置信度。基于此置信度估计,MACR要么将模型的内部知识外化为文本表示,要么在内部知识不足时检索相关外部知识,为后续推理生成基本上下文。然后,我们引入一个归纳式多智能体推理框架,包含三个专门智能体,分别用于归纳显式规则、分析潜在冲突以及解决所有可用上下文中的不一致性。实验结果表明,MACR在多个基准测试中显著优于最先进的基线方法,同时提供了可解释的显式冲突解决方案。

英文摘要

Large language models (LLMs) have achieved strong performance across a wide range of language-based tasks by leveraging both extensive parametric knowledge and in-context learning ability, enabling them to incorporate external information provided in the input prompt. However, the integration of external knowledge can introduce conflicts, not only between the model's internal parametric knowledge and the external information, but also among multiple pieces of external contexts. Existing approaches typically assume that either the model or the provided context is reliable, overlooking the possibility that both sources may contain errors, and avoid conflicts by privileging one source over the other, rather than actively resolving inconsistencies. To address these limitations, we propose a novel framework MACR for LLM knowledge conflict resolution that moves beyond the conventional binary choice paradigm and incorporates an explicit conflict-resolution mechanism based on a multi-agent reasoning approach. Specifically, we first propose an adaptive knowledge assessment and retrieval approach that employs a modified semantic entropy measure to quantify an LLM's confidence in its answer to a given query. Based on this confidence estimation, MACR either externalizes the model's internal knowledge as textual representations or retrieves relevant external knowledge when internal knowledge is insufficient, generating basic contexts for subsequent reasoning. Then we introduce an inductive multi-agent reasoning framework with three specialized agents that, respectively, induce explicit rules, analyze potential conflicts, and resolve inconsistencies across all available contexts. Empirical results demonstrate that MACR significantly outperforms state-of-the-art baselines across benchmarks, while also providing interpretable resolutions of explicit conflicts.

2606.20244 2026-06-19 cs.CV cs.AI 新提交

SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs

SPOT-E:基于视觉聚光灯的冻结VLM测试时熵整形

Bo Yin, Xiaobin Hu, Chengming Xu, Ruolin Shen, Mo Yang, Jiangning Zhang, Peng-Tao Jiang, Cheng Tan, Shuicheng YAN

发表机构 * National University of Singapore(新加坡国立大学) Fudan University(复旦大学) Technical University of Munich(慕尼黑工业大学) Sagenic Tech Zhejiang University(浙江大学) vivo Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出SPOT-E方法,通过测试时熵整形和视觉聚光灯,解决VLM在证据密集型任务中因忽视局部关键证据而表现不佳的问题,无需重新训练即可提升定位与鲁棒性。

详情
AI中文摘要

视觉语言模型(VLM)在证据密集型任务中通常表现不佳,因为决定性视觉证据往往微小、局部且容易被忽略,导致即使高层推理完好,证据读取也会失败。先前的推理时视觉干预可以在不重新训练的情况下改善定位,但大多是开环的,缺乏验证高亮证据是否实际使用的机制。我们研究答案跨度预测熵作为模型内部反馈信号,并表明朴素熵最小化具有歧义性,因为低熵可能源于证据支持的置信度或捷径坍塌。为解决这一歧义,我们引入低熵锚点和熵整形目标,在减少答案不确定性的同时保留基线高置信度标记。我们将这一原理实例化为SPOT-E,一种即插即用的测试时方法,生成问题条件聚光灯,并通过基于组相对策略优化(GRPO)的轻量级调优对每个实例进行优化。在所有基准测试和不同VLM家族中,SPOT-E在视觉损坏下均取得一致增益和改进的鲁棒性。代码公开于:\url{this https URL}

英文摘要

Vision-language models (VLMs) often underperform on evidence intensive tasks because decisive visual evidence are small, localized, and easy to overlook, leading to failures in evidence readout even when high-level reasoning is intact. Prior inference-time visual interventions can improve grounding without retraining, but they are largely open-loop and lack a mechanism to verify whether highlighted evidence is actually used. We study answer-span prediction entropy as a model-internal feedback signal and show that naive entropy minimization is ambiguous, since low entropy may arise from evidence-grounded confidence or shortcut collapse. To resolve this ambiguity, we introduce low-entropy anchors and an entropy-shaping objective that reduces answer uncertainty while preserving baseline high-confidence tokens. We instantiate this principle in SPOT-E, a plug-and-play test-time method that produces question-conditioned spotlights, optimized per instance via light-weight tuning based on Group Relative Policy Optimization (GRPO). Across all benchmarks and different VLM families, SPOT-E yields consistent gains and improved robustness under visual corruptions. Code is publicly available at: \url{https://github.com/YinBo0927/SPOT-E}

2606.20243 2026-06-19 cs.SE cs.MA 新提交

Phoenix: Safe GitHub Issue Resolution via Multi-Agent LLMs

Phoenix: 通过多智能体LLM实现安全的GitHub问题解决

Kipngeno Koech, Muhammad Adam, Baimam Boukar Jean Jacques, Joao Barros

AI总结 提出多智能体LLM系统Phoenix,通过六个专业智能体和七层安全控制,在SWE-bench Lite子集上达到75%的解决率,并在真实问题中保持100%正确性。

详情
AI中文摘要

我们提出Phoenix,一个多智能体LLM系统,能够从分类到拉取请求创建解决GitHub问题,结合了七层安全控制与基线感知测试评估策略。Phoenix将工作分解给六个专业智能体:规划器、复现器、编码器、测试器、故障分析器和拉取请求(PR)智能体,所有智能体由基于标签的GitHub webhook状态机协调。在打开拉取请求之前,每次更改都会与基线测试运行进行对比。在SWE-bench Lite的24个实例子集上,在生产webhook路径上运行,Phoenix oracle解决了75%的实例,且成功运行中没有出现通过到通过的回归;这个精心挑选的子集不能直接与完整分割排行榜结果比较,我们讨论了比较的局限性。在14个仓库的42个真实问题上的补充试点实现了100%的正确性保持(CP;硬级别平均122秒)。人工检查显示,大约一半的拉取请求是定位良好的修复。另一半将代码放置在错误路径上,这是规划器定位的局限性,我们正在通过检索来解决。我们还报告了部署失败模式(WAF过滤、令牌过期、权限边界、不稳定的CI),这些模式促使了每种安全机制的引入。

英文摘要

We present Phoenix, a multi-agent LLM system that resolves GitHub issues from triage through pull-request creation, combining seven layered safety controls with a baseline-aware test evaluation strategy. Phoenix decomposes the work across six specialized agents. Planner, reproducer, coder, tester, failure analyst and Pull Request (PR) agent, all coordinated by a label-based GitHub webhook state machine. Every change is checked against a baseline test run before a pull request is opened. On a 24-instance slice of SWE-bench Lite. run on the production webhook path, Phoenix oracle-resolves 75% of instances with no pass-to-pass regressions on successful runs; this curated slice is not directly comparable to full-split leaderboard results, and we discuss the limits of the comparison. A complementary pilot on 42 real issues across 14 repositories yields 100% correctness preservation (CP; mean 122s on the hard tier). Manual inspection shows that about half of the resulting pull requests are well-targeted fixes. The other half place code at incorrect paths, a planner localization limitation we are addressing with retrieval. We also report the deployment failure modes (WAF filtering, token expiry, permission boundaries, flaky CI) that motivated each safety mechanism.

2606.20241 2026-06-19 cs.CV 新提交

BAFIS: Dataset + Framework to assess occupational Bias and Human Preference in modern Text-to-image Models

BAFIS:评估现代文本到图像模型中的职业偏见与人类偏好的数据集与框架

Thomas Klassert, Adrian Ulges, Biying Fu

发表机构 * RheinMain University of Applied Sciences(莱茵美因应用科学大学)

AI总结 本研究提出BAFIS平台和包含21,140张多语言提示生成图像的数据集,评估五种文本到图像模型在职业生成中的性别和种族偏见,结合人类偏好反馈,发现系统性偏见并强调纳入人类偏好的必要性。

Comments Accepted at the IEEE Winter Conference on Applications of Computer Vision, WACV 2026

详情
AI中文摘要

生成式人工智能有潜力提高生产力并改变创意内容的制作。然而,现有研究表明图像生成模型受到偏见的显著影响。本文研究了文本到图像模型在职业相关图像生成中存在的固有偏见和语言诱导偏见,并通过人类偏好反馈补充了现有指标。我们对五种当前文本到图像模型进行了全面评估:Midjourney v6.1、Stable Diffusion 3 Medium、DALL-E 3、Playground v2.5和FLUX.1-dev,重点关注性别和种族偏见、图像质量以及提示对齐。为促进这一评估,我们开发了“公平图像合成竞技场”(BAFIS),一个旨在收集生成图像中偏见的人类反馈的平台。此外,我们创建了一个包含21,140张使用多语言提示生成的合成图像的数据集,作为我们分析的基础。我们进一步将结果置于更广泛的社会背景中,与德国联邦就业局的官方统计数据进行比较。我们的发现揭示了文本到图像模型中的系统性偏见,且现有评估指标与主观用户评分存在部分相关性。因此,我们的研究强调了纳入人类偏好以开发更公平、更包容的文本到图像模型的必要性。

英文摘要

Generative artificial intelligence has the potential to improve productivity and transform the production of creative content. However, existing research indicates that image generation models are significantly influenced by biases. This work investigates the inherent biases and language-induced biases present in text-to-image models within the context of occupation-related image generation, complementing established metrics with human preference feedback. We present a comprehensive evaluation of five current text-to-image models: Midjourney v6.1, Stable Diffusion 3 Medium, DALL-E 3, Playground v2.5, and FLUX.1-dev , focusing on gender and ethnicity bias, image quality, and prompt alignment. To facilitate this evaluation, we developed the "Battle-Arena for Fair Image Synthesis" (BAFIS), a platform designed to collect human feedback on bias in generated images. Furthermore, we created a dataset comprising 21,140 synthetic images generated using multilingual prompts, which serves as a basis for our analysis. We further place our results within a broader social context by comparing them to official statistics from the German Federal Employment Agency. Our findings reveal systematic biases in text-to-image models, with established evaluation metrics in partial correlation with subjective user ratings. Thus, our research emphasizes the need for including human preferences to develop fairer and more inclusive text-to-image models.

2606.20236 2026-06-19 cs.AI cs.LG cs.MA 新提交

A Multi-Agent system for Multi-Objective constrained optimization

多目标约束优化的多智能体系统

Federica Filippini

发表机构 * University of Milano-Bicocca(米兰比可卡大学)

AI总结 提出MAMO,通过多智能体强化学习解耦任务执行与目标设计,自动学习奖励权重以平衡主目标优化与约束违反,提升动态环境下RL的自主性和鲁棒性。

Comments Presented at the 17th Workshop on Optimization and Learning in Multiagent Systems (OptLearnMAS, https://optlearnmas.github.io), co-located with the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情
AI中文摘要

计算和网络系统中的许多决策问题可以自然地表述为在性能约束下的成本最小化问题。在动态环境中,强化学习(RL)通常通过在运行时将成本和约束违反通过加权惩罚项嵌入到单个标量奖励中(遵循拉格朗日启发式公式)来解决此类问题。然而,在这种背景下,学习策略的行为关键取决于这些权重的选择,而权重通常是手动选择的。这使得难以在优化主要目标和有效避免约束违反之间找到适当的权衡,特别是在非平稳环境中,它们的相对重要性可能发生变化。本文提出了MAMO(多目标约束优化的多智能体系统),一种通过多智能体RL解决这种平衡问题的方法。MAMO通过将奖励权重的选择表述为一个学习问题,将任务执行与目标设计解耦,为动态环境中约束优化问题的更自主和鲁棒的基于RL的解决方案迈出了第一步。

英文摘要

Many decision-making problems in computing and networking systems can be naturally formulated as cost-minimization problems under performance constraints. In dynamic environments, reinforcement learning (RL) is often used to solve such problems at runtime by embedding both costs and constraint violations into a single scalar reward through weighted penalty terms, following a Lagrangian-inspired formulation. However, in this context the behavior of the learned policy critically depends on the choice of these weights, which are typically selected manually. This makes it difficult to identify an appropriate trade-off between optimizing the primary objective and effectively avoiding constraint violations, particularly in non-stationary environments where their relative importance may change. This paper presents MAMO (Multi-Agent system for Multi-Objective constrained optimization), an approach to tackle this balancing problem through multi-agent RL. MAMO decouples task execution from objective design by formulating the selection of reward weights as a learning problem, providing a !rst step towards more autonomous and robust RL-based solutions for constrained optimization problems in dynamic environments.