arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2602.01322 2026-05-26 cs.LG cs.CL

PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding

PolySAE: 通过多项式解码建模稀疏自编码器中的特征交互

Panagiotis Koromilas, Andreas D. Demou, James Oldfield, Yannis Panagakis, Mihalis Nicolaou

AI总结提出PolySAE，在稀疏自编码器解码器中引入高阶项以建模特征交互，通过低秩张量分解在共享投影子空间上捕获成对和三元特征交互，在保持可解释性的同时提升探测F1约8%，并产生与共现频率无关的组合结构。

Comments 43rd International Conference on Machine Learning (ICML 2026); Code: https://github.com/pakoromilas/PolySAE

详情

AI中文摘要

稀疏自编码器（SAE）通过将激活分解为字典原子的稀疏组合来解释神经网络表示。然而，SAE假设特征通过线性重建相加组合，这种假设无法捕捉组合结构：线性模型无法区分“Starbucks”是由“star”和“coffee”特征的组合还是仅由它们的共现产生。这迫使SAE为复合概念分配整体特征，而不是将其分解为可解释的组成部分。我们引入了PolySAE，它通过高阶项扩展SAE解码器以建模特征交互，同时保留对可解释性至关重要的线性编码器。通过在共享投影子空间上进行低秩张量分解，PolySAE以较小的参数开销（GPT2上为3%）捕获成对和三元特征交互。在四个语言模型和三个SAE变体上，PolySAE在保持可比重建误差的同时，探测F1平均提升约8%，并产生类别条件特征分布之间2-10倍更大的Wasserstein距离。关键的是，学习到的交互权重与共现频率的相关性可忽略不计（r = 0.06，而SAE特征协方差为r = 0.82），表明多项式项捕获了很大程度上独立于表面统计的组合结构。最后，学习到的交互方向因果性地将模型输出引导向相应的组合语义。

英文摘要

Sparse autoencoders (SAEs) interpret neural network representations by decomposing activations into sparse combinations of dictionary atoms. However, SAEs assume features combine additively through linear reconstruction, an assumption that cannot capture compositional structure: linear models cannot distinguish whether ''Starbucks'' arises from the composition of ''star'' and ''coffee'' features or merely their co-occurrence. This forces SAEs to allocate monolithic features for compound concepts rather than decomposing them into interpretable constituents. We introduce PolySAE, which extends the SAE decoder with higher-order terms to model feature interactions while preserving the linear encoder essential for interpretability. Through low-rank tensor factorization on a shared projection subspace, PolySAE captures pairwise and triple feature interactions with small parameter overhead (3% on GPT2). Across four language models and three SAE variants, PolySAE achieves an average improvement of $\sim$8% in probing F1 while maintaining comparable reconstruction error, and produces 2--10$\times$ larger Wasserstein distances between class-conditional feature distributions. Critically, learned interaction weights exhibit negligible correlation with co-occurrence frequency ($r = 0.06$ vs $r = 0.82$ for SAE feature covariance), suggesting that polynomial terms capture compositional structure largely independent of surface statistics. Finally, the learned interaction directions causally steer model outputs toward the corresponding compositional semantics.

URL PDF HTML ☆

赞 0 踩 0

2602.01183 2026-05-26 cs.CV cs.LG

Refining Context-Entangled Content Segmentation via Curriculum Selection and Anti-Curriculum Promotion

通过课程选择与反课程促进优化上下文纠缠内容分割

Chunming He, Rihan Zhang, Fengyang Xiao, Dingming Zhang, Zhiwen Cao, Sina Farsiu

AI总结提出CurriSeg双阶段学习框架，结合课程学习与反课程学习原理，通过动态数据选择与频谱盲性微调提升上下文纠缠内容分割的鲁棒性和泛化能力。

Comments ICML 2026, 8 figures, 11 tables

详情

AI中文摘要

生物学习从简单到困难的任务逐步进行，逐渐增强感知和鲁棒性。受此原理启发，我们解决上下文纠缠内容分割（CECS）这一具有挑战性的场景，其中对象与周围环境共享内在视觉模式，如伪装目标检测。传统分割网络主要依赖架构增强，但往往忽略了在纠缠数据分布下控制鲁棒性的学习动态。我们引入CurriSeg，一个双阶段学习框架，统一了课程和反课程原则以提高表示可靠性。在课程选择阶段，CurriSeg基于样本损失的时间统计动态选择训练数据，区分困难但有信息的样本与噪声或模糊样本，从而实现稳定的能力增强。在反课程促进阶段，我们设计了频谱盲性微调，抑制高频成分以强制依赖低频结构和上下文线索，从而增强泛化能力。大量实验表明，CurriSeg在多种CECS基准上取得了一致的改进，无需增加参数或增加总训练时间，为进展与挑战如何相互作用以促进鲁棒且上下文感知的分割提供了原则性视角。代码将发布。

英文摘要

Biological learning proceeds from easy to difficult tasks, gradually reinforcing perception and robustness. Inspired by this principle, we address Context-Entangled Content Segmentation (CECS), a challenging setting where objects share intrinsic visual patterns with their surroundings, as in camouflaged object detection. Conventional segmentation networks predominantly rely on architectural enhancements but often ignore the learning dynamics that govern robustness under entangled data distributions. We introduce CurriSeg, a dual-phase learning framework that unifies curriculum and anti-curriculum principles to improve representation reliability. In the Curriculum Selection phase, CurriSeg dynamically selects training data based on the temporal statistics of sample losses, distinguishing hard-but-informative samples from noisy or ambiguous ones, thus enabling stable capability enhancement. In the Anti-Curriculum Promotion phase, we design Spectral-Blindness Fine-Tuning, which suppresses high-frequency components to enforce dependence on low-frequency structural and contextual cues and thus strengthens generalization. Extensive experiments demonstrate that CurriSeg achieves consistent improvements across diverse CECS benchmarks without adding parameters or increasing total training time, offering a principled view of how progression and challenge interplay to foster robust and context-aware segmentation. Code will be released.

URL PDF HTML ☆

赞 0 踩 0

2601.22466 2026-05-26 cs.LG

EvoEGF-Mol: Evolving Exponential Geodesic Flow for Structure-based Drug Design

EvoEGF-Mol：用于基于结构的药物设计的演化指数测地流

Yaowei Jin, Junjie Wang, Cheng Cao, Penglei Wang, Duo An, Qian Shi

AI总结针对基于结构的药物设计中欧几里得空间与概率空间不匹配的问题，提出EvoEGF-Mol模型，通过复合指数族分布和演化指数测地流统一表示分子，实现高几何精度和相互作用保真度。

Comments Accepted to ICML 2026

详情

AI中文摘要

基于结构的药物设计（SBDD）旨在发现生物活性配体。传统方法在欧几里得空间和概率空间中分别构建连续原子坐标和离散化学类别的概率路径，导致与底层统计流形不匹配。我们通过使用复合指数族分布来表示分子来解决这个问题，其中坐标和类别在统一的自然参数空间中表示，并在Fisher-Rao度量下沿指数测地线同步演化。为了避免直接针对狄拉克分布的测地线导致的瞬时轨迹崩溃，我们提出了用于SBDD的演化指数测地流（EvoEGF-Mol），该方法用动态集中的分布替代静态狄拉克目标，并通过渐进参数细化架构进行训练。我们的模型在CrossDock上达到了参考级别的PoseBusters通过率（93.4%），展示了卓越的几何精度和相互作用保真度，同时在真实世界的MolGenBench任务中，在生物活性骨架恢复方面取得了优于基线方法的性能。代码可在https://github.com/BLEACH366/EvoEGF-Mol获取。

英文摘要

Structure-Based Drug Design (SBDD) aims to discover bioactive ligands. Conventional approaches construct probability paths separately in Euclidean and probabilistic spaces for continuous atomic coordinates and discrete chemical categories, leading to a mismatch with the underlying statistical manifolds. We address this issue by representing molecules using composite exponential-family distributions, where coordinates and categories are represented within a unified natural parameter space to evolve synchronously along exponential geodesics under the Fisher-Rao metric. To avoid the instantaneous trajectory collapse induced by geodesics directly targeting Dirac distributions, we propose Evolving Exponential Geodesic Flow for SBDD (EvoEGF-Mol), which replaces static Dirac targets with dynamically concentrating distributions and is trained with a progressive-parameter-refinement architecture. Our model approaches a reference-level PoseBusters passing rate (93.4%) on CrossDock, demonstrating remarkable geometric precision and interaction fidelity, while achieving superior performance over baseline methods on real-world MolGenBench tasks for bioactive scaffold recovery. Code is available at https://github.com/BLEACH366/EvoEGF-Mol.

URL PDF HTML ☆

赞 0 踩 0

2601.21406 2026-05-26 cs.CV cs.LG

Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation

通过多表示生成增强统一多模态模型的理解能力

Zihan Su, Hongyang Wei, Kangrui Cen, Yong Wang, Guanhua Chen, Chun Yuan, Xiangxiang Chu

AI总结提出UniMRG方法，通过辅助生成像素、深度和分割等多重表示，增强统一多模态模型的理解能力，减少幻觉并提升空间理解。

Comments Code: https://github.com/Sugewud/UniMRG

详情

AI中文摘要

统一多模态模型（UMMs）在单一框架内整合了视觉理解和生成。其最终目标是创建一个理解和生成相互促进的循环。虽然最近的后训练方法成功利用理解来增强生成，但利用生成来改善理解的逆向方向仍基本未被探索。在这项工作中，我们提出了UniMRG（统一多表示生成），一种简单而有效的架构无关的后训练方法。UniMRG通过引入辅助生成任务来增强UMMs的理解能力。具体来说，我们训练UMMs生成输入图像的多种内在表示，即像素（重建）、深度（几何）和分割（结构），同时进行标准的视觉理解目标。通过综合这些多样化的表示，UMMs捕获关于外观、空间关系和结构布局的互补信息。因此，UMMs对视觉输入形成了更深入和全面的理解。跨多种UMM架构的大量实验表明，我们的方法显著增强了细粒度感知，减少了幻觉，并改善了空间理解，同时提升了生成能力。

英文摘要

Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework. Their ultimate aspiration is to create a cycle where understanding and generation mutually reinforce each other. While recent post-training methods have successfully leveraged understanding to enhance generation, the reverse direction of utilizing generation to improve understanding remains largely unexplored. In this work, we propose UniMRG (Unified Multi-Representation Generation), a simple yet effective architecture-agnostic post-training method. UniMRG enhances the understanding capabilities of UMMs by incorporating auxiliary generation tasks. Specifically, we train UMMs to generate multiple intrinsic representations of input images, namely pixel (reconstruction), depth (geometry), and segmentation (structure), alongside standard visual understanding objectives. By synthesizing these diverse representations, UMMs capture complementary information regarding appearance, spatial relations, and structural layout. Consequently, UMMs develop a deeper and more comprehensive understanding of visual inputs. Extensive experiments across diverse UMM architectures demonstrate that our method notably enhances fine-grained perception, reduces hallucinations, and improves spatial understanding, while simultaneously boosting generation capabilities.

URL PDF HTML ☆

赞 0 踩 0

2601.21094 2026-05-26 cs.LG cs.AI cs.SY eess.SY

Safety Generalization Under Distribution Shift in Safe Reinforcement Learning: A Diabetes Testbed

安全强化学习中的分布偏移下的安全泛化：一个糖尿病测试平台

Minjae Kwon, Josephine Lamp, Lu Feng

AI总结研究安全强化学习算法在分布偏移下训练时安全保证能否迁移到部署中，使用糖尿病管理作为测试平台，发现安全泛化差距并通过测试时屏蔽有效恢复安全性。

Comments Accepted at ICML 2026. Camera-ready version

详情

AI中文摘要

安全强化学习算法通常在固定的训练条件下进行评估。我们使用糖尿病管理作为安全关键测试平台，研究训练时的安全保证是否能在分布偏移下迁移到部署中。我们在统一的临床模拟器上对安全强化学习算法进行基准测试，并揭示了一个安全泛化差距：在训练期间满足约束的策略经常在未见过的患者身上违反安全要求。我们证明，测试时屏蔽（使用学习到的动力学模型过滤不安全动作）能有效恢复跨算法和患者群体的安全性。在八种安全强化学习算法、三种糖尿病类型和三个年龄组中，屏蔽使得PPO-Lag和CPO等强基线的血糖达标时间范围提高了13-14%，同时降低了临床风险指数和血糖变异性。我们的模拟器和基准测试为研究安全关键控制领域中分布偏移下的安全性提供了一个平台。代码可在https://github.com/safe-autonomy-lab/GlucoSim 和 https://github.com/safe-autonomy-lab/GlucoAlg 获取。

英文摘要

Safe Reinforcement Learning (RL) algorithms are typically evaluated under fixed training conditions. We investigate whether training-time safety guarantees transfer to deployment under distribution shift, using diabetes management as a safety-critical testbed. We benchmark safe RL algorithms on a unified clinical simulator and reveal a safety generalization gap: policies satisfying constraints during training frequently violate safety requirements on unseen patients. We demonstrate that test-time shielding, which filters unsafe actions using learned dynamics models, effectively restores safety across algorithms and patient populations. Across eight safe RL algorithms, three diabetes types, and three age groups, shielding achieves Time-in-Range gains of 13--14\% for strong baselines such as PPO-Lag and CPO while reducing clinical risk index and glucose variability. Our simulator and benchmark provide a platform for studying safety under distribution shift in safety-critical control domains. Code is available at https://github.com/safe-autonomy-lab/GlucoSim and https://github.com/safe-autonomy-lab/GlucoAlg.

URL PDF HTML ☆

赞 0 踩 0

2601.18597 2026-05-26 cs.CV

EFSI-DETR: Efficient Frequency-Semantic Integration for Real-Time Small Object Detection in UAV Imagery

EFSI-DETR：面向无人机图像实时小目标检测的高效频率-语义集成

Yu Xia, Chang Liu, Tianqi Xiang, Zhigang Tu

AI总结提出EFSI-DETR框架，通过动态频率-空间统一协同网络和高效语义特征集中器，实现无人机图像中实时小目标检测的先进性能。

详情

AI中文摘要

由于有限的特征表示和无效的多尺度融合，无人机图像中的实时小目标检测仍然具有挑战性。现有方法未充分利用频率信息并依赖静态卷积操作，限制了获取丰富特征表示的能力，并阻碍了深层语义特征的有效利用。为解决这些问题，我们提出EFSI-DETR，一种新颖的检测框架，将高效语义特征增强与动态频率-空间引导相结合。EFSI-DETR包含两个主要组件：(1) 动态频率-空间统一协同网络（DyFusNet），联合利用频率和空间线索进行鲁棒的多尺度特征融合；(2) 高效语义特征集中器（ESFC），以最小计算成本实现深层语义提取。此外，采用细粒度特征保留（FFR）策略，在融合过程中纳入空间丰富的浅层特征，以保留对无人机图像中小目标检测至关重要的细粒度细节。在VisDrone和CODrone基准上的大量实验表明，我们的EFSI-DETR以实时效率实现了最先进的性能，在VisDrone上AP和AP_s分别提升了 extbf{1.6}\%和 extbf{5.8}\%，同时在单个RTX 4090 GPU上获得 extbf{188} FPS的推理速度。

英文摘要

Real-time small object detection in Unmanned Aerial Vehicle (UAV) imagery remains challenging due to limited feature representation and ineffective multi-scale fusion. Existing methods underutilize frequency information and rely on static convolutional operations, which constrain the capacity to obtain rich feature representations and hinder the effective exploitation of deep semantic features. To address these issues, we propose EFSI-DETR, a novel detection framework that integrates efficient semantic feature enhancement with dynamic frequency-spatial guidance. EFSI-DETR comprises two main components: (1) a Dynamic Frequency-Spatial Unified Synergy Network (DyFusNet) that jointly exploits frequency and spatial cues for robust multi-scale feature fusion, (2) an Efficient Semantic Feature Concentrator (ESFC) that enables deep semantic extraction with minimal computational cost. Furthermore, a Fine-grained Feature Retention (FFR) strategy is adopted to incorporate spatially rich shallow features during fusion to preserve fine-grained details, crucial for small object detection in UAV imagery. Extensive experiments on VisDrone and CODrone benchmarks demonstrate that our EFSI-DETR achieves the state-of-the-art performance with real-time efficiency, yielding improvement of \textbf{1.6}\% and \textbf{5.8}\% in AP and AP$_{s}$ on VisDrone, while obtaining \textbf{188} FPS inference speed on a single RTX 4090 GPU.

URL PDF HTML ☆

赞 0 踩 0

2601.18135 2026-05-26 cs.CV

Forward Consistency Learning with Gated Context Aggregation for Video Anomaly Detection

基于门控上下文聚合的前向一致性学习用于视频异常检测

Jiahao Lyu, Minghua Zhao, Xuewen Huang, Yifei Chen, Shuangli Du, Jing Hu, Cheng Shi, Zhiyong Lv

AI总结提出轻量级FoGA模型，通过前向一致性学习和门控上下文聚合，在资源受限设备上实现高效视频异常检测，性能优于现有方法且速度达155 FPS。

Comments It has been submitted to the KBS journal

详情

DOI: 10.1016/j.knosys.2026.116118
Journal ref: Knowledge-Based Systems 2026

AI中文摘要

作为公共安全的关键要素，视频异常检测（VAD）旨在实时监控系统中衡量各种事件与正常模式的偏差。然而，现有大多数VAD方法依赖大规模模型追求极端精度，限制了其在资源受限边缘设备上的可行性。此外，主流基于预测的VAD仅利用单帧未来预测误差检测异常，忽略了更长时域前向信息的更丰富约束。本文提出FoGA，一种轻量级VAD模型，执行基于门控上下文聚合的前向一致性学习，包含约2M参数，专为潜在边缘设备设计。具体而言，我们提出一种基于Unet的方法，对连续帧进行特征提取以生成即时预测和前向预测。然后，我们在跳跃连接中引入门控上下文聚合模块，动态融合相同空间尺度下的编码器和解码器特征。最后，模型通过新颖的前向一致性损失联合优化，并采用混合异常测量策略整合即时帧和前向帧的误差以实现更准确检测。大量实验证明了所提方法的有效性，其显著优于最先进的竞争方法，运行速度高达155 FPS。因此，我们的FoGA在性能与效率指标之间实现了出色的权衡。

英文摘要

As a crucial element of public security, video anomaly detection (VAD) aims to measure deviations from normal patterns for various events in real-time surveillance systems. However, most existing VAD methods rely on large-scale models to pursue extreme accuracy, limiting their feasibility on resource-limited edge devices. Moreover, mainstream prediction-based VAD detects anomalies using only single-frame future prediction errors, overlooking the richer constraints from longer-term temporal forward information. In this paper, we introduce FoGA, a lightweight VAD model that performs Forward consistency learning with Gated context Aggregation, containing about 2M parameters and tailored for potential edge devices. Specifically, we propose a Unet-based method that performs feature extraction on consecutive frames to generate both immediate and forward predictions. Then, we introduce a gated context aggregation module into the skip connections to dynamically fuse encoder and decoder features at the same spatial scale. Finally, the model is jointly optimized with a novel forward consistency loss, and a hybrid anomaly measurement strategy is adopted to integrate errors from both immediate and forward frames for more accurate detection. Extensive experiments demonstrate the effectiveness of the proposed method, which substantially outperforms state-of-the-art competing methods, running up to 155 FPS. Hence, our FoGA achieves an excellent trade-off between performance and the efficiency metric.

URL PDF HTML ☆

赞 0 踩 0

2601.14249 2026-05-26 cs.CL

Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

哪种推理轨迹能更好地教会学生推理？一个信息对齐的简单度量

Yuming Yang, Mingyoung Lai, Wanxu Zhao, Xiaoran Fan, Zhiheng Xi, Mingqi Wu, Chiyue Huang, Jun Zhao, Haijun Lv, Jian Tong, Yunhua Zhou, Yicheng Zou, Qipeng Guo, Tao Gui, Qi Zhang, Xuanjing Huang

AI总结提出Rank-Surprisal Ratio (RSR)度量，通过结合对齐性和信息性评估推理轨迹对学生模型的适用性，在轨迹选择和教师选择中显著优于现有方法。

Comments Accepted to ACL 2026 (Main Conference). 31 pages. Project page: https://github.com/UmeanNever/RankSurprisalRatio

详情

AI中文摘要

长链思维（CoT）轨迹为从教师到学生大语言模型的推理蒸馏提供了丰富的监督信号。然而，先前工作和我们的实验均表明，来自更强教师的轨迹并不一定能产生更好的学生，凸显了蒸馏中数据-学生适配性的重要性。现有方法主要通过学生似然评估适配性，倾向于选择与学生模型当前行为高度一致的轨迹，但忽略了更具信息性的轨迹。针对这一问题，我们提出Rank-Surprisal Ratio (RSR)，一个简单的度量，同时捕捉对齐性和信息性以评估推理轨迹的适用性。RSR的动机源于有效轨迹通常通过结合低绝对概率和相对高排名的token（在学生模型下）来平衡学习信号强度和行为对齐。具体而言，RSR定义为轨迹的平均token级排名与其平均负对数似然之比，计算和解释直观。在五个学生模型和来自11个不同教师的推理轨迹上，RSR与训练后推理性能强相关（平均Spearman 0.86），持续优于现有度量。我们进一步展示了其在轨迹选择和教师选择中的实际效用。

英文摘要

Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data-student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that align closely with the student model's current behavior but overlooking more informative ones. Addressing this, we propose Rank-Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically balance learning signal strength and behavioral alignment by combining low absolute probability with relatively high-ranked tokens under the student model. Concretely, RSR is defined as the ratio of a trajectory's average token-wise rank to its average negative log-likelihood, and is straightforward to compute and interpret. Across five student models and reasoning trajectories from 11 diverse teachers, RSR strongly correlates with post-training reasoning performance (average Spearman 0.86), consistently outperforming existing metrics. We further demonstrate its practical utility in both trajectory selection and teacher selection.

URL PDF HTML ☆

赞 0 踩 0

2601.05613 2026-05-26 cs.LG cs.AI

PiXTime: A Model for Federated Time Series Forecasting with Heterogeneous Data across Nodes

PiXTime: 一种跨节点异构数据联邦时间序列预测模型

Yiming Zhou, Jiahao Wang, Mingyue Cheng, Hao Wang, Defu Lian, Enhong Chen

AI总结提出基于Transformer的PiXTime框架，通过参数解耦架构（局部个性化模块+全局共享骨干）处理异构时间序列，实现联邦学习中的异构数据预测，并在多个基准上达到最优性能。

详情

AI中文摘要

虽然对分布式时间序列进行协同预测非常理想，但由于数据共享限制，直接合并局部数据集通常不可行。联邦学习提供了一种有前景的替代方案，但传统的联邦学习算法要求同构模型架构，这与去中心化节点中常见的结构差异（如时间分辨率不对齐、变量通道不匹配）不兼容。为弥合这一差距，我们引入了PiXTime，一种新颖的基于Transformer的框架，旨在原生适应并利用结构异构的时间数据。其核心采用参数解耦架构，将模型策略性地划分为局部个性化模块和全局聚合共享骨干。具体而言，节点特定的局部模块作为维度适配器，将不同长度的原始序列投影到统一表示空间。同时，全局同步的VE表将一致的类别标识注入特征空间，使共享骨干能够跨不一致的变量分布协同学习并泛化表示。在多个基准上的全面评估表明，PiXTime在异构联邦环境中实现了最先进的性能，同时在标准同构和集中式预测设置中保持强大的优势。

英文摘要

While collaborative forecasting on distributed time series is highly desirable, directly pooling localized datasets is often impractical due to data sharing constraints. Federated learning offers a promising alternative, yet conventional federated learning algorithms require homogeneous model architectures, which are incompatible with the structural discrepancies, such as unaligned temporal resolutions and mismatched variable channels, commonly observed across decentralized nodes. To bridge this gap, we introduce PiXTime, a novel Transformer-based framework designed to natively accommodate and leverage structurally heterogeneous temporal data. At its core, PiXTime adopts a parameter-decoupling architecture, strategically partitioning the model into localized personalized modules and a globally aggregated shared backbone. Specifically, node-specific local modules act as dimensional adapters, projecting raw sequences of diverse lengths into a unified representation space. Concurrently, a globally synchronized VE Table injects consistent categorical identities into the feature space, allowing the shared backbone to collaboratively learn and generalize representations across inconsistent variable distributions. Comprehensive evaluations on multiple benchmarks demonstrate that PiXTime achieves state-of-the-art performance in heterogeneous federated environments, while maintaining robust superiority in standard homogeneous and centralized forecasting settings.

URL PDF HTML ☆

赞 0 踩 0

2601.05483 2026-05-26 cs.AI

MMUEChange: A Generalized LLM Agent Framework for Intelligent Multi-Modal Urban Environment Change Analysis

MMUEChange：面向智能多模态城市环境变化分析的通用LLM智能体框架

Zixuan Xiao, Jun Ma, Siwei Zhang

AI总结提出MMUEChange多模态智能体框架，通过模块化工具包和模态控制器实现异构城市数据灵活集成与跨模态对齐，在三个城市案例中任务成功率提升46.7%并有效缓解幻觉。

详情

DOI: 10.1016/j.asoc.2026.114576
Journal ref: Applied Soft Computing 190 (2026) 114576

AI中文摘要

理解城市环境变化对于可持续发展至关重要。然而，当前方法，特别是遥感变化检测，通常依赖于刚性的单模态分析。为克服这些限制，我们提出MMUEChange，一个多模态智能体框架，通过模块化工具包和核心模块——模态控制器实现跨模态和模态内对齐，灵活集成异构城市数据，从而支持对复杂城市变化场景的稳健分析。案例研究包括：纽约向小型社区公园的转变，反映了当地的绿地建设努力；香港各区集中水污染的扩散，指向协调的水管理；深圳露天垃圾场的显著减少，以及夜间经济活动与垃圾类型之间的对比关联，表明生活垃圾和建筑垃圾背后不同的城市压力。与性能最佳的基线相比，MMUEChange智能体在任务成功率上提升了46.7%，并有效缓解了幻觉，展示了其支持具有实际政策影响的复杂城市变化分析任务的能力。

英文摘要

Understanding urban environment change is essential for sustainable development. However, current approaches, particularly remote sensing change detection, often rely on rigid, single-modal analysis. To overcome these limitations, we propose MMUEChange, a multi-modal agent framework that flexibly integrates heterogeneous urban data via a modular toolkit and a core module, Modality Controller for cross- and intra-modal alignment, enabling robust analysis of complex urban change scenarios. Case studies include: a shift toward small, community-focused parks in New York, reflecting local green space efforts; the spread of concentrated water pollution across districts in Hong Kong, pointing to coordinated water management; and a notable decline in open dumpsites in Shenzhen, with contrasting links between nighttime economic activity and waste types, indicating differing urban pressures behind domestic and construction waste. Compared to the best-performing baseline, the MMUEChange agent achieves a 46.7% improvement in task success rate and effectively mitigates hallucination, demonstrating its capacity to support complex urban change analysis tasks with real-world policy implications.

URL PDF HTML ☆

赞 0 踩 0

2601.03790 2026-05-26 cs.CL cs.AI

NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning

NeoAMT: 基于强化学习的新词感知智能机器翻译

Zhongtao Miao, Kaiyan Zhao, Masaaki Nagata, Yoshimasa Tsuruoka

AI总结提出NeoAMT框架，利用基于Wiktionary的搜索工具和强化学习训练翻译智能体，以提升包含新词的源句翻译质量。

Comments ACL 2026 Main. Fixed minor typos

详情

AI中文摘要

新词感知机器翻译旨在将包含新词的源句翻译成目标语言。与通用机器翻译相比，该领域仍未被充分探索。本文提出一个智能体框架NeoAMT，用于新词感知机器翻译，配备基于Wiktionary的搜索工具。具体而言，我们首先构建了一个专门用于新词感知机器翻译的数据集，并建立了一个基于Wiktionary的搜索工具。该数据集涵盖16种语言和75个翻译方向，源自约1000万条英文Wiktionary转储记录。搜索工具的检索语料库也来自同一转储中约300万条清洗后的记录。然后，我们利用该数据集和工具，通过强化学习训练翻译智能体，并评估新词感知机器翻译的准确性。此外，我们提出了一个强化学习训练框架，具有新颖的奖励设计和自适应展开生成策略，利用翻译难度进一步提高使用我们搜索工具的翻译智能体的翻译质量。

英文摘要

Neologism-aware machine translation aims to translate source sentences containing neologisms into target languages. This field remains underexplored compared with general machine translation (MT). In this paper, we propose an agentic framework, NeoAMT, for neologism-aware machine translation equipped with a Wiktionary-based search toolkit. Specifically, we first construct a dedicated dataset for neologism-aware machine translation and build a search toolkit grounded in Wiktionary. The dataset covers 16 languages and 75 translation directions in total, derived from approximately 10 million records of an English Wiktionary dump. The retrieval corpus of the search toolkit is also constructed from around 3 million cleaned records of the same dump. We then leverage the dataset and toolkit to train a translation agent via reinforcement learning (RL) and to evaluate the accuracy of neologism-aware machine translation. Furthermore, we propose an RL training framework featuring a novel reward design and an adaptive rollout generation strategy that exploits translation difficulty to further improve the translation quality of translation agents using our search toolkit.

URL PDF HTML ☆

赞 0 踩 0

2601.03624 2026-05-26 cs.AI

Architecting Agentic Communities using Design Patterns

使用设计模式构建智能体社区

Zoran Milosevic, Fethi Rabhi

AI总结本文提出基于企业分布式系统设计模式的三层分类架构（LLM智能体、智能体AI、智能体社区），并通过临床试验匹配案例验证其形式化框架，为多智能体生态系统的工程化部署提供实践指导与形式化验证能力。

Comments supplementary material accompanying this paper is also attached .. its title is "Complete Agentic AI Design Patterns Catalogue"; Fixed encoding artefacts (garbled em dashes) throughout

详情

AI中文摘要

大型语言模型（LLM）及后续智能体AI技术的快速发展需要系统化的架构指导，以构建复杂的生产级系统。本文提出了一种使用源自企业分布式系统标准、形式化方法和行业实践的设计模式来架构此类系统的方法。我们将这些模式分为三层：LLM智能体（任务特定自动化）、智能体AI（自适应目标寻求者）和智能体社区（AI智能体与人类参与者通过正式角色、协议和治理结构进行协调的组织框架）。我们重点关注智能体社区——涵盖LLM智能体、智能体AI实体和人类的协调框架——这最适用于企业和工业应用。借鉴分布式系统中成熟的协调原则，我们将这些模式置于一个形式化框架中，该框架规定了协作协议，其中AI智能体和人类在受治理的生态系统中扮演角色。这种方法既提供了实践指导，也提供了形式化验证能力，通过问责机制表达组织、法律和伦理规则，确保智能体间通信、协商和意图建模的可操作且可验证的治理。我们通过一个临床试验匹配案例研究验证了该框架。我们的目标是为从业者提供可操作的指导，同时保持动态多智能体生态系统中企业部署所必需的形式化严谨性。

英文摘要

The rapid evolution of Large Language Models (LLM) and subsequent Agentic AI technologies requires systematic architectural guidance for building sophisticated, production-grade systems. This paper presents an approach for architecting such systems using design patterns derived from enterprise distributed systems standards, formal methods, and industry practice. We classify these patterns into three tiers: LLM Agents (task-specific automation), Agentic AI (adaptive goal-seekers), and Agentic Communities (organizational frameworks where AI agents and human participants coordinate through formal roles, protocols, and governance structures). We focus on Agentic Communities - coordination frameworks encompassing LLM Agents, Agentic AI entities, and humans - most relevant for enterprise and industrial applications. Drawing on established coordination principles from distributed systems, we ground these patterns in a formal framework that specifies collaboration agreements where AI agents and humans fill roles within governed ecosystems. This approach provides both practical guidance and formal verification capabilities, enabling expression of organizational, legal, and ethical rules through accountability mechanisms that ensure operational and verifiable governance of inter-agent communication, negotiation, and intent modeling. We validate this framework through a clinical trial matching case study. Our goal is to provide actionable guidance to practitioners while maintaining the formal rigor essential for enterprise deployment in dynamic, multi-agent ecosystems.

URL PDF HTML ☆

赞 0 踩 0

2601.03327 2026-05-26 cs.LG cs.AI

Extreme-value forest fire prediction A study of the Loss Function in an Ordinality Scheme

极端值森林火灾预测：序数方案中损失函数的研究

Nicolas Caron, Christophe Guyeux, Hassan Noura, Benjamin Aynes

AI总结提出首个序数分类框架预测火灾严重等级，研究损失函数设计对预测极端事件的影响，发现加权卡帕损失在极端类别上IoU提升超过0.1。

Comments Following external reviews, we identified major methodological issues in the manuscript, including insufficient justification of the ordinal clustering strategy, limited statistical validation, ambiguities in dataset splitting, and missing comparisons with standard ordinal approaches. We therefore request withdrawal in order to prepare a substantially revised version

详情

AI中文摘要

野火在空间和严重程度上是高度不平衡的自然灾害，使得极端事件的预测特别具有挑战性。在这项工作中，我们引入了第一个序数分类框架，用于预测与法国操作决策直接对齐的野火严重等级。我们的研究调查了损失函数设计对神经模型预测罕见但关键的高严重火灾发生能力的影响。我们将标准交叉熵与几种序数感知目标进行比较，包括提出的基于截断离散指数广义帕累托分布的概率TDeGPD损失。通过对多种架构和真实操作数据的广泛基准测试，我们表明序数监督显著提高了模型相对于传统方法的性能。特别是，加权卡帕损失（WKLoss）取得了最佳整体结果，在最极端严重类别上IoU（交并比）增益超过0.1，同时保持了有竞争力的校准质量。然而，由于数据集中极端事件极低的代表性，对于最罕见事件的性能仍然有限。这些发现强调了将严重性排序、数据不平衡考虑和季节性风险整合到野火预测系统中的重要性。未来的工作将集中于将季节动态和不确定性信息纳入训练，以进一步提高极端事件预测的可靠性。

英文摘要

Wildfires are highly imbalanced natural hazards in both space and severity, making the prediction of extreme events particularly challenging. In this work, we introduce the first ordinal classification framework for forecasting wildfire severity levels directly aligned with operational decision-making in France. Our study investigates the influence of loss-function design on the ability of neural models to predict rare yet critical high-severity fire occurrences. We compare standard cross-entropy with several ordinal-aware objectives, including the proposed probabilistic TDeGPD loss derived from a truncated discrete exponentiated Generalized Pareto Distribution. Through extensive benchmarking over multiple architectures and real operational data, we show that ordinal supervision substantially improves model performance over conventional approaches. In particular, the Weighted Kappa Loss (WKLoss) achieves the best overall results, with more than +0.1 IoU (Intersection Over Union) gain on the most extreme severity classes while maintaining competitive calibration quality. However, performance remains limited for the rarest events due to their extremely low representation in the dataset. These findings highlight the importance of integrating both severity ordering, data imbalance considerations, and seasonality risk into wildfire forecasting systems. Future work will focus on incorporating seasonal dynamics and uncertainty information into training to further improve the reliability of extreme-event prediction.

URL PDF HTML ☆

赞 0 踩 0

2601.02144 2026-05-26 cs.CL cs.AI

Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts

类比路由：用于混合专家模型的kNN增强专家分配

Boxuan Lyu, Soichiro Murakami, Hidetaka Kamigaito, Peinan Zhang

AI总结提出kNN-MoE框架，通过检索历史相似案例的局部最优专家分配来增强MoE路由，使用检索邻居的平均相似度作为置信度混合系数，在分布偏移下提升鲁棒性。

2601.00553 2026-05-26 cs.CV cs.AI

A Comprehensive Dataset for Human vs. AI Generated Image Detection

人类与AI生成图像检测的综合数据集

Rajarshi Roy, Ashhar Aziz, Shashwat Bajpai, Nasrin Imanpour, Gurpreet Singh, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Amitava Das, Amit Sheth, Gaytri Jena, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha

AI总结针对AI生成图像检测问题，构建了包含96000个真实与合成数据点的MS COCOAI数据集，并提出了图像真伪分类与生成模型识别两个任务。

详情

AI中文摘要

像Stable Diffusion、DALL-E和MidJourney这样的多模态生成式AI系统从根本上改变了合成图像的创建方式。这些工具推动了创新，但也促进了误导性内容、虚假信息和被操纵媒体的传播。随着生成的图像越来越难以与照片区分，检测它们已成为当务之急。为了应对这一挑战，我们发布了MS COCOAI，这是一个用于AI生成图像检测的新数据集，包含96000个真实和合成数据点，基于MS COCO数据集构建。为了生成合成图像，我们使用了五个生成器：Stable Diffusion 3、Stable Diffusion 2.1、SDXL、DALL-E 3和MidJourney v6。基于该数据集，我们提出了两个任务：（1）将图像分类为真实或生成；（2）识别哪个模型生成了给定的合成图像。该数据集可在https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset获取。

英文摘要

Multimodal generative AI systems like Stable Diffusion, DALL-E, and MidJourney have fundamentally changed how synthetic images are created. These tools drive innovation but also enable the spread of misleading content, false information, and manipulated media. As generated images become harder to distinguish from photographs, detecting them has become an urgent priority. To combat this challenge, we release MS COCOAI, a novel dataset for AI generated image detection consisting of 96000 real and synthetic datapoints, built using the MS COCO dataset. To generate synthetic images, we use five generators: Stable Diffusion 3, Stable Diffusion 2.1, SDXL, DALL-E 3, and MidJourney v6. Based on the dataset, we propose two tasks: (1) classifying images as real or generated, and (2) identifying which model produced a given synthetic image. The dataset is available at https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset.

URL PDF HTML ☆

赞 0 踩 0

2512.24331 2026-05-26 cs.CV

Spatial-aware Vision Language Model for Autonomous Driving

面向自动驾驶的空间感知视觉语言模型

Weijie Wei, Zhipeng Luo, Ling Feng, Venice Erin Liong

AI总结提出LVLDrive框架，通过融合LiDAR点云与视觉语言模型，利用渐进融合Q-Former和空间感知问答数据集，解决3D度量空间推理瓶颈，提升自动驾驶场景理解与决策可靠性。

Comments Accepted to CVPR AutoPilot Workshop 2026

详情

AI中文摘要

尽管视觉语言模型（VLM）通过利用语言模型中的常识在端到端自动驾驶中展现出显著前景，但它们依赖2D图像线索进行复杂场景理解和决策，这成为安全性和可靠性的关键瓶颈。当前基于图像的方法难以进行精确的度量空间推理和几何推断，导致不可靠的驾驶策略。为弥补这一差距，我们提出LVLDrive（LiDAR-视觉-语言），一种新颖框架，通过引入LiDAR点云作为额外输入模态，专门设计用于增强现有VLM的鲁棒3D度量空间理解能力。一个关键挑战在于如何减轻不同3D数据对预训练VLM带来的灾难性干扰。为此，我们引入渐进融合Q-Former，逐步注入LiDAR特征，确保VLM现有知识库的稳定性和保留。此外，我们开发了空间感知问答（SA-QA）数据集，明确教导模型高级3D感知和推理能力。在驾驶基准上的大量实验表明，与仅视觉的对应模型相比，LVLDrive在场景理解、度量空间感知和可靠驾驶决策方面均实现了优越性能。我们的工作强调了显式3D度量数据对于构建可信赖的基于VLM的自主系统的重要性。

英文摘要

While Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models, their reliance on 2D image cues for complex scene understanding and decision-making presents a critical bottleneck for safety and reliability. Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies. To bridge this gap, we propose LVLDrive (LiDAR-Vision-Language), a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving by incoperating LiDAR point cloud as an extra input modality. A key challenge lies in mitigating the catastrophic disturbance introduced by disparate 3D data to the pre-trained VLMs. To this end, we introduce a Gradual Fusion Q-Former that incrementally injects LiDAR features, ensuring the stability and preservation of the VLM's existing knowledge base. Furthermore, we develop a spatial-aware question-answering (SA-QA) dataset to explicitly teach the model advanced 3D perception and reasoning capabilities. Extensive experiments on driving benchmarks demonstrate that LVLDrive achieves superior performance compared to vision-only counterparts across scene understanding, metric spatial perception, and reliable driving decision-making. Our work highlights the necessity of explicit 3D metric data for building trustworthy VLM-based autonomous systems.

URL PDF HTML ☆

赞 0 踩 0

2512.24075 2026-05-26 cs.LG

Evolutionary Physics-Informed Temporal Fusion for Lane-Change Intention Prediction

进化物理信息时间融合用于换道意图预测

Jiazhao Shi, Qiyang Xie, Ziyu Wang, Dongxu Zhang, Yichen Lin, Di Zhu, Chen Xie, Ziwei Wang, Haoyun Zhang, Enliang Li, Zetong Guan

AI总结提出一种进化物理信息时间融合框架，通过融合从传统交通信号导出的时间描述符和从原始轨迹序列学习的时间嵌入，实现三分类换道意图预测，在highD和exiD数据集上取得高F1分数。

详情

AI中文摘要

早期换道意图预测对于自动驾驶和ADAS至关重要，但由于换道行为依赖于不断变化的交通风险、周围车辆交互和目标车道可行性，而非仅瞬时车辆状态，因此仍具挑战性。本研究提出一种进化物理信息时间融合框架，用于三分类换道意图预测，包括左换道、右换道和不换道。该方法并非仅使用静态物理信息变量，而是从传统交通信号中导出时间描述符，包括风险演化、间隙持续性、反事实车道效用、交互压力梯度、机动可行性和意图一致性。这些描述符与通过序列编码器从原始轨迹序列学习的时间嵌入融合，融合表示用于最终分类。在highD和exiD数据集上，分别在1秒、2秒和3秒预测时域下进行实验。所提模型在highD上达到0.9514、0.9256和0.8872的宏F1分数，在exiD上达到0.9386、0.9070和0.8531。在exiD匝道邻近场景中改进尤为显著，表明时间物理演化在交互丰富的环境中特别有用。这些结果表明，将进化物理信息描述符与学习的时间表示相结合，为早期换道意图预测提供了更动态且可解释的解决方案。

英文摘要

Early lane-change intention prediction is essential for autonomous driving and ADAS, but it remains challenging because lane-changing behavior depends on evolving traffic risk, surrounding-vehicle interactions, and target-lane feasibility rather than only instantaneous vehicle states. This study proposes an evolutionary physics-informed temporal fusion framework for three-class lane-change intention prediction, including left lane change, right lane change, and no lane change. Instead of using static physics-informed variables alone, the proposed method derives temporal descriptors from conventional traffic signals, including risk evolution, gap persistence, counterfactual lane utility, interaction pressure gradient, maneuver feasibility, and intent consistency. These descriptors are fused with temporal embeddings learned from raw trajectory sequences through a sequence encoder, and the fused representation is used for final classification. Experiments are conducted on the highD and exiD datasets under 1\,s, 2\,s, and 3\,s prediction horizons. The proposed model achieves Macro F1-scores of 0.9514, 0.9256, and 0.8872 on highD, and 0.9386, 0.9070, and 0.8531 on exiD, respectively. The improvement is especially pronounced in exiD ramp-adjacent scenarios, indicating that temporal physical evolution is particularly useful in interaction-rich environments. These results demonstrate that combining evolutionary physics-informed descriptors with learned temporal representations provides a more dynamic and interpretable solution for early lane-change intention prediction.

URL PDF HTML ☆

赞 0 踩 0

2512.23076 2026-05-26 cs.LG cs.AI cs.HC

Multimodal Functional Maximum Correlation for Emotion Recognition

多模态功能最大相关用于情感识别

Deyang Zheng, Tianyi Zhang, Wenming Zheng, Shujian Yu

AI总结提出多模态功能最大相关（MFMC）框架，通过双重总相关目标最大化高阶多模态依赖，在情感识别基准上取得最先进性能。

Comments manuscript accepted by IEEE Transactions on Affective Computing. Code is available at https://github.com/DY9910/MFMC

详情

DOI: 10.1109/TAFFC.2026.3695876

AI中文摘要

情绪状态表现为中枢和自主系统之间协调但异质的生理反应，这对情感计算中的多模态表示学习构成了基本挑战。学习这种联合动态因情感标注的稀缺性和主观性而进一步复杂化，这推动了自监督学习（SSL）的使用。然而，大多数现有的SSL方法依赖于成对对齐目标，这些目标不足以表征两个以上模态之间的依赖关系，也无法捕捉由协调的脑和自主反应产生的高阶交互。为了解决这一限制，我们提出了多模态功能最大相关（MFMC），一个原则性的SSL框架，通过双重总相关（DTC）目标最大化高阶多模态依赖。通过推导一个紧致的夹逼界并使用基于功能最大相关分析（FMCA）的迹替代进行优化，MFMC直接捕捉联合多模态交互，而不依赖于成对对比损失。在三个公开的情感计算基准上的实验表明，MFMC在受试者依赖和受试者独立评估协议下均一致地达到最先进或具有竞争力的性能，突显了其对受试者间变异性的鲁棒性。特别是，MFMC将CEAP-360VR上的受试者依赖准确率从78.9%提高到86.8%，仅使用EDA信号就将受试者独立准确率从27.5%提高到33.1%。此外，在MAHNOB-HCI最具挑战性的EEG受试者独立划分中，MFMC与最佳方法的差距在0.8个百分点以内。我们的代码可在https://github.com/DY9910/MFMC获取。

英文摘要

Emotional states manifest as coordinated yet heterogeneous physiological responses across central and autonomic systems, posing a fundamental challenge for multimodal representation learning in affective computing. Learning such joint dynamics is further complicated by the scarcity and subjectivity of affective annotations, which motivates the use of self-supervised learning (SSL). However, most existing SSL approaches rely on pairwise alignment objectives, which are insufficient to characterize dependencies among more than two modalities and fail to capture higher-order interactions arising from coordinated brain and autonomic responses. To address this limitation, we propose Multimodal Functional Maximum Correlation (MFMC), a principled SSL framework that maximizes higher-order multimodal dependence through a Dual Total Correlation (DTC) objective. By deriving a tight sandwich bound and optimizing it using a functional maximum correlation analysis (FMCA) based trace surrogate, MFMC captures joint multimodal interactions directly, without relying on pairwise contrastive losses. Experiments on three public affective computing benchmarks demonstrate that MFMC consistently achieves state-of-the-art or competitive performance under both subject-dependent and subject-independent evaluation protocols, highlighting its robustness to inter-subject variability. In particular, MFMC improves subject-dependent accuracy on CEAP-360VR from 78.9% to 86.8%, and subject-independent accuracy from 27.5% to 33.1% using the EDA signal alone. Moreover, MFMC remains within 0.8 percentage points of the best-performing method on the most challenging EEG subject-independent split of MAHNOB-HCI. Our code is available at https://github.com/DY9910/MFMC.

URL PDF HTML ☆

赞 0 踩 0

2512.18735 2026-05-26 cs.CV cs.AI

$M^3-Verse$: A "Spot the Difference" Challenge for Large Multimodal Models

$M^3-Verse$: 大型多模态模型的“找不同”挑战

Kewei Wei, Bocheng Hu, Jie Cao, Xiaohan Chen, Zhengxi Lu, Wubing Xia, Weili Xu, Jiaao Wu, Junchen He, Mingyu Jia, Ciyun Zhao, Ye Sun, Yizhi Li, Zhonghan Zhao, Jian Zhang, Gaoang Wang

AI总结提出 $M^3-Verse$ 基准，通过多视角视频对评估 LMM 在一致空间中对物体动态变化的理解能力，并验证了现有模型的局限性。

详情

AI中文摘要

现代大型多模态模型（LMMs）在静态图像和单状态时空理解方面表现出非凡的能力。然而，它们在两个不同视频观测中理解共享空间上下文内物体动态变化的能力仍未被充分探索。这种在一致环境中推理变换的能力对于空间智能领域的进步尤为关键。在本文中，我们引入了 $M^3-Verse$，一个多模态、多状态、多维度的基准，以正式评估这一能力。它基于成对视频，这些视频提供了室内场景在状态变化前后的多视角观察。该基准包含总共 270 个场景和 2,932 个问题，分为 50 多个子任务，探究 4 种核心能力。我们评估了 16 个最先进的 LMMs，并观察到它们在跟踪状态转换方面的局限性。为了解决这些挑战，我们进一步提出了一个简单而有效的基线，在多状态感知中实现了显著的性能提升。因此，$M^3-Verse$ 提供了一个具有挑战性的新测试平台，以促进对动态视觉世界有更全面理解的下一代模型的发展。您可以从 https://github.com/Wal-K-aWay/M3-Verse_pipeline 获取构建流程，并从 https://www.modelscope.cn/datasets/WalKaWay/M3-Verse 获取完整的基准数据。

英文摘要

Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understanding. However, their capacity to comprehend the dynamic changes of objects within a shared spatial context between two distinct video observations, remains largely unexplored. This ability to reason about transformations within a consistent environment is particularly crucial for advancements in the field of spatial intelligence. In this paper, we introduce $M^3-Verse$, a Multi-Modal, Multi-State, Multi-Dimensional benchmark, to formally evaluate this capability. It is built upon paired videos that provide multi-perspective observations of an indoor scene before and after a state change. The benchmark contains a total of 270 scenes and 2,932 questions, which are categorized into over 50 subtasks that probe 4 core capabilities. We evaluate 16 state-of-the-art LMMs and observe their limitations in tracking state transitions. To address these challenges, we further propose a simple yet effective baseline that achieves significant performance improvements in multi-state perception. $M^3-Verse$ thus provides a challenging new testbed to catalyze the development of next-generation models with a more holistic understanding of our dynamic visual world. You can get the construction pipeline from https://github.com/Wal-K-aWay/M3-Verse_pipeline and full benchmark data from https://www.modelscope.cn/datasets/WalKaWay/M3-Verse.

URL PDF HTML ☆

赞 0 踩 0

2512.15605 2026-05-26 cs.LG stat.ML

Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction

自回归语言模型实际上是能量模型：对下一个词元预测的预见能力的洞察

Mathieu Blondel, Michael E. Sander, Germain Vivier-Ardisson, Tianlin Liu, Vincent Roulet

AI总结本文通过建立自回归模型与能量模型之间的双射，揭示了自回归模型在下一个词元预测范式下具备预见能力，并提供了理论误差界。

2512.13597 2026-05-26 cs.CV

Lighting in Motion: Spatiotemporal HDR Lighting Estimation

运动中的光照：时空高动态范围光照估计

Christophe Bolduc, Julien Philip, Li Ma, Mingming He, Paul Debevec, Jean-François Lalonde

AI总结提出基于扩散的时空光照估计方法LiMo，通过生成不同曝光下的镜面与漫反射球体，结合深度与几何条件，实现高精度高频细节预测与照度估计。

详情

AI中文摘要

我们提出LiMo（运动中的光照），一种基于扩散的时空光照估计方法。LiMo旨在同时实现逼真的高频细节预测和准确的照度估计。为此，我们提出根据输入中3D位置生成一组不同曝光下的镜面与漫反射球体。利用扩散先验，我们在大规模定制的室内外场景数据集上微调强大的现有扩散模型，并配以时空光照探针。为了实现准确的空间条件，我们证明仅靠深度是不够的，并引入一种新的几何条件来提供场景相对于目标3D位置的相对位置。最后，我们利用可微渲染将不同曝光下的漫反射和镜面预测合并为单个HDRI图。我们彻底评估了我们的方法和设计选择，使LiMo在空间控制和预测精度方面均达到最先进水平。

英文摘要

We present Lighting in Motion (LiMo), a diffusion-based approach to spatiotemporal lighting estimation. LiMo targets both realistic high-frequency detail prediction and accurate illuminance estimation. To account for both, we propose generating a set of mirrored and diffuse spheres at different exposures, based on their 3D positions in the input. Making use of diffusion priors, we fine-tune powerful existing diffusion models on a large-scale customized dataset of indoor and outdoor scenes, paired with spatiotemporal light probes. For accurate spatial conditioning, we demonstrate that depth alone is insufficient and we introduce a new geometric condition to provide the relative position of the scene to the target 3D position. Finally, we combine diffuse and mirror predictions at different exposures into a single HDRI map leveraging differentiable rendering. We thoroughly evaluate our method and design choices to establish LiMo as state-of-the-art for both spatial control and prediction accuracy.

URL PDF HTML ☆

赞 0 踩 0

2512.12425 2026-05-26 cs.CV

Boosting Monocular Metric Depth Estimation via Bokeh Rendering

通过散景渲染提升单目度量深度估计

Hangwei Zhang, Armando Fortes, Tianyi Wei, Xingang Pan

AI总结提出BokehDepth两阶段框架，利用物理生成模型产生校准散景堆栈作为无监督几何信号，通过散景感知聚合模块提升单目深度估计的度量精度。

Comments Project Page: https://fogradio.github.io/BokehDepth_Project/

详情

Journal ref: ICML 2026

AI中文摘要

散景渲染和深度估计共享基本的光学联系，但现有方法未能充分利用这种互惠性。传统的散景管线严重依赖有噪声的深度图，不可避免地引入视觉伪影。相反，现有的单目深度模型通常遵循两种有缺陷的范式。基于生成扩散的框架往往缺乏一致的度量尺度。同时，前馈度量深度模型在纹理缺失或远处区域经常失败，而散焦模糊可以提供几何信息。我们提出BokehDepth，一个两阶段框架，将合成散焦视为无监督的几何信号。在第一阶段，一个物理基础的生成模型从单个清晰输入产生校准的散景堆栈，无需先验深度输入。随后，一个轻量级的散景感知聚合模块将这些堆栈集成到深度估计框架的编码器中。这种机制允许模型从散焦维度提取一致的几何特征，同时保持解码器架构不变。实验表明，与依赖深度的渲染基线相比，BokehDepth实现了优越的视觉散景保真度，并持续提升了最先进单目深度模型的度量精度。

英文摘要

Bokeh rendering and depth estimation share a fundamental optical connection, yet existing methods fail to fully exploit this reciprocity. Conventional bokeh pipelines rely heavily on noisy depth maps that inevitably introduce visual artifacts. Conversely, existing monocular depth models typically follow two flawed paradigms. Generative diffusion-based frameworks often lack consistent metric scale. Meanwhile, feed-forward metric depth models frequently fail in textureless or distant regions where defocus blur can provide geometric information. We propose BokehDepth, a two-stage framework that treats synthetic defocus as a supervision-free geometric signal. In the first stage, a physically grounded generative model produces calibrated bokeh stacks from a single sharp input without requiring prior depth input. Subsequently, a lightweight defocus-aware aggregation module integrates these stacks into the encoder of a depth estimation framework. This mechanism allows the model to extract consistent geometric features from the defocus dimension while keeping the decoder architecture unchanged. Experiments demonstrate that BokehDepth achieves superior visual bokeh fidelity compared to depth-dependent rendering baselines and consistently enhances the metric accuracy of state-of-the-art monocular depth models.

URL PDF HTML ☆

赞 0 踩 0

2512.05865 2026-05-26 cs.LG cs.AI

Intrinsically Interpretable Attention via Sparse Post-Training

通过稀疏后训练实现内在可解释的注意力机制

Florent Draye, Anson Lei, Hsiao-Ru Pan, Ingmar Posner, Bernhard Schölkopf

AI总结提出一种后训练方法，通过约束损失下的灵活稀疏正则化，在不牺牲性能的前提下将Transformer注意力连接稀疏至约0.4%，从而简化全局电路并提升可解释性。

详情

AI中文摘要

我们引入一种简单的后训练方法，使Transformer注意力变得稀疏而不牺牲性能。在约束损失目标下应用灵活的稀疏正则化，我们在高达7B参数的模型上证明，可以将注意力连接减少到其边缘的约0.4%，同时保留原始预训练损失。与为计算效率设计的稀疏注意力方法不同，我们的方法利用稀疏性作为结构先验：它保留了能力，同时暴露出更有组织和可解释的连接模式。我们发现这种局部稀疏性级联成全局电路简化：特定任务的电路涉及更少的组件（注意力头和MLP），连接它们的边缘减少了多达100倍。此外，使用跨层转录器，我们表明稀疏注意力显著简化了注意力归因，实现了基于特征和基于电路视角的统一视图。这些结果表明，Transformer注意力可以变得稀疏几个数量级，表明其大部分计算是冗余的，并且稀疏性可以作为更结构化和可解释模型的指导原则。

英文摘要

We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 7B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $\approx 0.4 \%$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer edges connecting them. Additionally, using cross-layer transcoders, we show that sparse attention substantially simplifies attention attribution, enabling a unified view of feature-based and circuit-based perspectives. These results demonstrate that transformer attention can be made orders of magnitude sparser, suggesting that much of its computation is redundant and that sparsity may serve as a guiding principle for more structured and interpretable models.

URL PDF HTML ☆

赞 0 踩 0

2511.20236 2026-05-26 cs.AI cs.LG

Actionable and diverse counterfactual explanations incorporating domain knowledge and plausibility constraints

结合领域知识和可行性约束的可操作且多样化的反事实解释

Szymon Bobek, Łukasz Bałec, Grzegorz J. Nalepa

AI总结提出DANCE方法，通过建模特征依赖和领域约束生成可操作、多样化的反事实解释，在OpenML数据集和工业邮件营销场景中验证了其有效性和实用性。

详情

AI中文摘要

反事实解释通过识别实现期望结果所需的最小变化来提高机器学习模型的可操作可解释性。然而，现有方法常常忽略特征之间的依赖关系，这可能导致不现实或不切实际的修改。这一限制降低了反事实解释在现实决策支持系统中的实用性。受网络安全中电子邮件营销应用的启发，我们提出了DANCE（多样化、可操作且知识约束的解释），一种生成反事实的方法，该方法结合了特征依赖和领域约束。DANCE使用线性或概率结构对特征之间的关系进行建模，这些结构可以从数据中学习或由专家指定。在搜索过程中强制执行这些依赖关系以提高可行性和现实性。该方法在一个统一的目标中联合优化可行性、多样性、邻近性和稀疏性。我们在OpenML的140个数据集上评估了DANCE，并证明它在多个评估标准上相比现有方法具有竞争性或更优的性能。此外，我们与一个电子邮件营销平台合作，在真实工业环境中验证了该方法，表明它能够产生符合领域且可操作的建议。

英文摘要

Counterfactual explanations improve the actionable interpretability of machine learning models by identifying minimal changes required to achieve a desired outcome. However, existing methods often neglect dependencies among features, which can lead to unrealistic or impractical modifications. This limitation reduces the usefulness of counterfactual explanations in real-world decision-support systems. Motivated by applications in cybersecurity for email marketing, we propose DANCE (Diverse, Actionable, and Knowledge-Constrained Explanations), a method for generating counterfactuals that incorporate feature dependencies and domain constraints. DANCE models relationships between features using linear and probabilistic structures that can be learned from data or specified by experts. These dependencies are enforced during the search process to improve plausibility and feasibility. The method jointly optimizes plausibility, diversity, proximity, and sparsity within a unified objective. We evaluate DANCE on 140 datasets from OpenML and demonstrate that it achieves competitive or superior performance compared to existing approaches across multiple evaluation criteria. Additionally, we validate the method in a real-world industrial setting in collaboration with an email marketing platform, showing that it produces domain-consistent and actionable recommendations.

URL PDF HTML ☆

赞 0 踩 0

2511.19065 2026-05-26 cs.CV cs.AI cs.LG

Understanding, Accelerating, and Improving MeanFlow Training

理解、加速和改进MeanFlow训练

Jin-Young Kim, Hyojun Go, Lea Bogensperger, Julius Erbach, Nikolai Kalischek, Federico Tombari, Konrad Schindler, Dominik Narnhofer

AI总结通过分析瞬时速度与平均速度的相互作用，提出一种加速瞬时速度形成并逐步转移训练重点的有效训练方案，实现更快的收敛和更优的少步生成性能。

详情

AI中文摘要

MeanFlow通过联合学习瞬时速度场和平均速度场，有望在少步内实现高质量生成建模。然而，其底层训练动态仍不清楚。我们分析两种速度之间的相互作用，发现：(i) 建立良好的瞬时速度是学习平均速度的前提；(ii) 当时间间隔较小时，瞬时速度的学习受益于平均速度，但随着间隔增大而退化；(iii) 任务亲和性分析表明，对于一步生成至关重要的大间隔平均速度的平滑学习，依赖于先形成准确的瞬时速度和小间隔平均速度。在这些观察的指导下，我们设计了一种有效的训练方案，加速瞬时速度的形成，然后将重点从短间隔平均速度转移到长间隔平均速度。我们改进的MeanFlow训练实现了更快的收敛和显著更好的少步生成：使用相同的DiT-XL骨干网络，我们的方法在1-NFE ImageNet 256x256上达到了令人印象深刻的FID 2.87，而传统的MeanFlow基线为3.43。或者，我们的方法以2.5倍更短的训练时间或使用更小的DiT-L骨干网络，匹配MeanFlow基线的性能。

英文摘要

MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.

URL PDF HTML ☆

赞 0 踩 0

2511.03548 2026-05-26 cs.LG

Flat Minima and Generalization: Insights from Stochastic Convex Optimization

平坦极小值与泛化：来自随机凸优化的见解

Matan Schliserman, Shira Vansover-Hager, Tomer Koren

AI总结本文在随机凸优化框架下研究平坦极小值与泛化的关系，发现平坦经验极小值可能产生Ω(1)的总体风险，而尖锐极小值泛化最优，并证明两种锐度感知算法（SA-GD和SAM）也可能泛化不佳。

详情

AI中文摘要

理解学习算法的泛化行为是学习理论的核心目标。最近一种新兴的解释是，学习算法在实践中成功是因为它们收敛到平坦极小值，而平坦极小值一直与改进的泛化性能相关联。在这项工作中，我们在非负、β-光滑目标的随机凸优化的经典设置中研究平坦极小值与泛化之间的联系。我们的第一个发现是，即使在这个基础且被充分研究的设置中，平坦的经验极小值可能产生平凡的Ω(1)总体风险，而尖锐极小值则能最优地泛化。然后，我们表明这种糟糕的泛化行为延伸到两种自然的“锐度感知”算法，这些算法最初由Foret等人（2021）提出，旨在将优化偏向平坦解：锐度感知梯度下降（SA-GD）和锐度感知最小化（SAM）。对于SA-GD，它在预定义邻域内对最大损失执行梯度步骤，我们证明虽然它成功以快速率收敛到平坦极小值，但解的总体风险仍然可能高达Ω(1)，表明即使使用锐度感知梯度方法算法性地找到的平坦极小值也可能泛化不佳。对于SAM，一种基于归一化上升步骤的SA-GD计算高效近似，我们表明尽管它最小化经验损失，但可能收敛到尖锐极小值，并且也产生Ω(1)的总体风险。最后，我们使用算法稳定性技术为SA-GD和SAM建立了总体风险上界。

英文摘要

Understanding the generalization behavior of learning algorithms is a central goal of learning theory. A recently emerging explanation is that learning algorithms are successful in practice because they converge to flat minima, which have been consistently associated with improved generalization performance. In this work, we study the link between flat minima and generalization in the canonical setting of stochastic convex optimization with a non-negative, $β$-smooth objective. Our first finding is that, even in this fundamental and well-studied setting, flat empirical minima may incur trivial $Ω(1)$ population risk while sharp minima generalizes optimally. Then, we show that this poor generalization behavior extends to two natural ''sharpness-aware'' algorithms originally proposed by Foret et al. (2021), designed to bias optimization toward flat solutions: Sharpness-Aware Gradient Descent (SA-GD) and Sharpness-Aware Minimization (SAM). For SA-GD, which performs gradient steps on the maximal loss in a predefined neighborhood, we prove that while it successfully converges to a flat minimum at a fast rate, the population risk of the solution can still be as large as $Ω(1)$, indicating that even flat minima found algorithmically using a sharpness-aware gradient method might generalize poorly. For SAM, a computationally efficient approximation of SA-GD based on normalized ascent steps, we show that although it minimizes the empirical loss, it may converge to a sharp minimum and also incur population risk $Ω(1)$. Finally, we establish population risk upper bounds for both SA-GD and SAM using algorithmic stability techniques.

URL PDF HTML ☆

赞 0 踩 0

2511.03529 2026-05-26 cs.LG

Byzantine-Robust Federated Learning with Learnable Aggregation Weights

具有可学习聚合权重的拜占庭鲁棒联邦学习

Javad Parsa, Amir Hossein Daghestani, André M. H. Teixeira, Mikael Johansson

AI总结提出一种将聚合权重作为可学习参数联合优化的拜占庭鲁棒联邦学习优化问题，并开发了交替最小化算法，在异构数据和恶意客户端场景下优于现有方法。

Comments ICLR 2026

详情

AI中文摘要

联邦学习（FL）使客户端能够在不共享私有数据的情况下协作训练全局模型。然而，恶意（拜占庭）客户端的存在对FL的鲁棒性构成了重大挑战，尤其是在客户端数据分布异构的情况下。在本文中，我们提出了一种新颖的拜占庭鲁棒FL优化问题，该问题将自适应加权引入聚合过程。与传统方法不同，我们的公式将聚合权重视为可学习参数，与全局模型参数联合优化。为了解决这个优化问题，我们开发了一种交替最小化算法，在对抗攻击下具有强收敛保证。我们分析了所提目标的拜占庭弹性。我们在各种数据集和攻击场景下，将我们的算法与最先进的拜占庭鲁棒FL方法进行了性能评估。实验结果表明，我们的方法始终优于现有方法，特别是在数据高度异构且恶意客户端比例较大的情况下。

英文摘要

Federated Learning (FL) enables clients to collaboratively train a global model without sharing their private data. However, the presence of malicious (Byzantine) clients poses significant challenges to the robustness of FL, particularly when data distributions across clients are heterogeneous. In this paper, we propose a novel Byzantine-robust FL optimization problem that incorporates adaptive weighting into the aggregation process. Unlike conventional approaches, our formulation treats aggregation weights as learnable parameters, jointly optimizing them alongside the global model parameters. To solve this optimization problem, we develop an alternating minimization algorithm with strong convergence guarantees under adversarial attack. We analyze the Byzantine resilience of the proposed objective. We evaluate the performance of our algorithm against state-of-the-art Byzantine-robust FL approaches across various datasets and attack scenarios. Experimental results demonstrate that our method consistently outperforms existing approaches, particularly in settings with highly heterogeneous data and a large proportion of malicious clients.

URL PDF HTML ☆

赞 0 踩 0

2510.23008 2026-05-26 cs.AI

From Prompt Optimization to Multi-Dimensional Credibility Evaluation: Enhancing Trustworthiness of Chinese LLM-Generated Liver MRI Reports -- with Preliminary Extension to Lung Cancer

从提示优化到多维可信度评估：增强中文LLM生成的肝脏MRI报告的可信度——初步扩展至肺癌

Qiuli Wang, Xinhuang Sun, Yonglin Chen, Jie Cheng, Yongxu Liu, Xingpeng Zhang, Xiaoming Li, Wei Chen

AI总结本研究提出多维可信度评估（MDCA）框架，并指导机构特定提示优化，以增强LLM生成的肝脏MRI报告的可信度，初步扩展至肺癌。

Comments 10 pages, 6 figures, 4 tables

2510.22874 2026-05-26 cs.CL

A Comprehensive Dataset for Human vs. AI Generated Text Detection

人类与AI生成文本检测的综合数据集

Rajarshi Roy, Gurpreet Singh, Ashhar Aziz, Shashwat Bajpai, Nasrin Imanpour, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Gaytri Jena, Amitava Das, Amit Sheth, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha

AI总结本文提出了一个包含73,193个文本样本的综合数据集，结合真实纽约时报文章与多个先进LLM生成的合成文本，用于区分人类与AI生成文本及归因任务，基线准确率分别为58.35%和8.92%。

Comments Defactify4 @AAAI 2025

详情

AI中文摘要

大型语言模型（LLM）的快速发展使得AI生成的文本越来越像人类，引发了对内容真实性、错误信息和可信度的担忧。要可靠地检测AI生成文本并将其归因于特定模型，需要大规模、多样化且标注良好的数据集。在这项工作中，我们提出了一个包含73,193个文本样本的综合数据集，该数据集结合了真实的纽约时报文章与多个最先进LLM（包括Gemma-2-9b、Mistral-7B、Qwen-2-72B、LLaMA-8B、Yi-Large和GPT-4-o）生成的合成版本。数据集提供原始文章摘要作为提示，以及完整的人类作者叙述。我们为两个关键任务建立了基线结果：区分人类撰写与AI生成的文本，准确率达到58.35%；以及将AI文本归因于其生成模型，准确率为8.92%。通过将现实世界的新闻内容与现代生成模型相结合，该数据集旨在促进鲁棒的检测和归因方法的发展，在生成式AI时代培养信任和透明度。我们的数据集可在以下网址获取：https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Text_Dataset

英文摘要

The rapid advancement of large language models (LLMs) has led to increasingly human-like AI-generated text, raising concerns about content authenticity, misinformation, and trustworthiness. Addressing the challenge of reliably detecting AI-generated text and attributing it to specific models requires large-scale, diverse, and well-annotated datasets. In this work, we present a comprehensive dataset comprising over 73,193 text samples that combine authentic New York Times articles with synthetic versions generated by multiple state-of-the-art LLMs including Gemma-2-9b, Mistral-7B, Qwen-2-72B, LLaMA-8B, Yi-Large, and GPT-4-o. The dataset provides original article abstracts as prompts, full human-authored narratives. We establish baseline results for two key tasks: distinguishing human-written from AI-generated text, achieving an accuracy of 58.35\%, and attributing AI texts to their generating models with an accuracy of 8.92\%. By bridging real-world journalistic content with modern generative models, the dataset aims to catalyze the development of robust detection and attribution methods, fostering trust and transparency in the era of generative AI. Our dataset is available at: https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Text_Dataset

URL PDF HTML ☆

赞 0 踩 0

2510.22827 2026-05-26 cs.CV cs.LG

FairJudge: Abstention-Aware Multimodal Judges for Fairness and Alignment Evaluation in Text-to-Image Models

FairJudge: 文本到图像模型中公平性与对齐评估的弃权感知多模态裁判

Zahraa Al Sahili, Maimuna Nowaz, Maryam Fetanat, Ioannis Patras, Matthew Purver

AI总结提出FairJudge协议，利用多模态大语言模型作为结构化裁判，通过封闭标签、弃权机制和证据报告，在文本到图像模型中实现社会属性预测、职业定位和提示-图像对齐的公平性评估。

详情

AI中文摘要

评估文本到图像（T2I）系统不仅需要判断图像是否匹配提示，还需要判断社会显著属性是否被忠实表示且没有无根据的推断。现有的自动评估器通常依赖于以面部为中心的识别器或对比图像-文本相似度，这些方法提供的诊断反馈有限，并且通常在视觉证据模糊或缺失时强制进行预测。对于宗教和残疾等公平敏感属性，其中线索可能是上下文相关的、间接的或故意未指定的，这些评估器可能会遗漏细心的人类评审员会注意到的失败模式。我们引入了\textsc{FairJudge}，一种弃权感知的评估协议，该协议使用遵循指令的多模态LLM作为社会属性预测、职业定位和提示-图像对齐的结构化裁判。该协议将输出限制为封闭标签集，要求可见证据的理由，在线索不足时支持明确的\textsc{unspecified}决策，并将基于量规的对齐判断映射到$[-1,1]$。这些约束将MLLM裁判从开放式评估转变为可解析、可审计的评估程序。在四个属性预测基准和三个职业/对齐基准上，\textsc{FairJudge}优于或补充了CLIP、DeepFace、VIEScore和VQAScore。消融实验表明，封闭标签、弃权和证据报告对可靠性至关重要。我们进一步引入了\textsc{DIVERSIFY}和\textsc{DIVERSIFY-Professions}，这两个资源丰富的上下文数据集用于评估超越面部可见或图标线索的社会表示和职业定位。我们发布了代码、提示、数据集、解析器日志和每张图像的裁判输出，以支持可重复的审计。

英文摘要

Evaluating text-to-image (T2I) systems requires judging not only whether an image matches a prompt, but also whether socially salient attributes are represented faithfully and without unsupported inference. Existing automated evaluators typically rely on face-centric recognizers or contrastive image--text similarity, which provide limited diagnostic feedback and often force predictions even when visual evidence is ambiguous or absent. For fairness-sensitive attributes such as religion and disability, where cues may be contextual, indirect, or intentionally unspecified, these evaluators can therefore miss failure modes that careful human reviewers would notice. We introduce \textsc{FairJudge}, an abstention-aware evaluation protocol that uses instruction-following multimodal LLMs as structured judges for social-attribute prediction, profession grounding, and prompt--image alignment. The protocol constrains outputs to closed label sets, requires visible-evidence rationales, supports an explicit \textsc{unspecified} decision when cues are insufficient, and maps rubric-based alignment judgments to $[-1,1]$. These constraints turn MLLM judging from open-ended assessment into a parseable, auditable evaluation procedure. Across four attribute-prediction benchmarks and three profession/alignment benchmarks, \textsc{FairJudge} outperforms or complements CLIP, DeepFace, VIEScore, and VQAScore. Ablations show that closed labels, abstention, and evidence reporting are central to reliability. We further introduce \textsc{DIVERSIFY} and \textsc{DIVERSIFY-Professions}, two context-rich resources for evaluating social representation and profession grounding beyond face-visible or iconic cues. We release code, prompts, datasets, parser logs, and per-image judge outputs to support reproducible auditing.

URL PDF HTML ☆

赞 0 踩 0