arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.21075 2026-05-21 cs.CV cs.LG

SpectralEarth-FM: Bringing Hyperspectral Imagery into Multimodal Earth Observation Pretraining

SpectralEarth-FM: 将高光谱图像引入多模态地球观测预训练

Nassim Ait Ali Braham, Aaron Banze, Conrad M. Albrecht, Julien Mairal, Jocelyn Chanussot, Xiao Xiang Zhu

AI总结本文提出SpectralEarth-FM，一种用于多传感器地球观测输入的分层变压器，旨在联合处理高光谱图像与低通道观测。通过构建SpectralEarth-MM数据集，采用JEPA风格的目标进行预训练，实现了在高光谱下游任务和标准EO基准上的最佳性能。

详情

AI中文摘要

地球观测（EO）基础模型（FMs）越来越多地使用多传感器数据进行训练，涵盖多谱段图像（MSI）、合成孔径雷达（SAR）和衍生的地理空间层，但高光谱图像（HSI）仍被低估。相反，现有的高光谱FM仅在HSI上训练，未探索HSI与共定位EO传感器的联合预训练和融合。我们引入SpectralEarth-FM，一种用于多传感器EO输入的分层变压器，具有异构光谱维度。该架构结合了高光谱输入的光谱标记化、传感器特定编码器、跨传感器融合模块和共享分层编码器，能够联合处理HSI和低通道观测。为了预训练SpectralEarth-FM，我们构建了SpectralEarth-MM数据集，该数据集将EnMAP、EMIT、DESI三颗空间载荷的HSI与Sentinel-2、Landsat-8/9光学图像、Landsat地表温度（LST）和Sentinel-1 SAR在共同地理足迹上进行共定位。该数据集包含约2000万个全球分布的地点，25000万个地理参考碎片，以及超过40TB的数据。预训练使用一种联合嵌入预测架构（JEPA）风格的目标，匹配全球视图和同一地点单传感器局部视图之间的表示。我们评估了SpectralEarth-FM在高光谱下游任务和标准EO基准上的性能，遵循PANGAEA协议，实现了在两种评估设置中的最佳性能。

英文摘要

Earth observation (EO) foundation models (FMs) are increasingly trained on multisensor data, spanning multispectral imagery (MSI), synthetic aperture radar (SAR), and derived geospatial layers, but hyperspectral imagery (HSI) remains underrepresented. Conversely, existing hyperspectral FMs are trained on HSI alone, leaving joint pretraining and fusion of HSI with co-located EO sensors unexplored. We introduce SpectralEarth-FM, a hierarchical transformer for multisensor EO input with heterogeneous spectral dimensionality. The architecture combines spectral tokenization for hyperspectral inputs, sensor-specific encoders, a cross-sensor fusion module, and a shared hierarchical encoder, enabling joint processing of HSI and lower-channel observations. To pretrain SpectralEarth-FM, we curate SpectralEarth-MM, a dataset that co-locates HSI from three spaceborne sensors (EnMAP, EMIT, DESIS) with Sentinel-2, Landsat-8/9 optical imagery, Landsat land surface temperature (LST), and Sentinel-1 SAR, over common geographic footprints. It comprises approximately 2M globally distributed locations, 25M georeferenced patches, and over 40TB of data. Pretraining uses a Joint-Embedding Predictive Architecture (JEPA)-style objective that matches representations between global views and single-sensor local views from the same location. We evaluate SpectralEarth-FM on hyperspectral downstream tasks and standard EO benchmarks following the PANGAEA protocol, achieving state-of-the-art results across both evaluation settings.

URL PDF HTML ☆

赞 0 踩 0

2605.21072 2026-05-21 cs.CV

Q-ARVD: Quantizing Autoregressive Video Diffusion Models

Q-ARVD: 对自回归视频扩散模型进行量化

Siao Tang, Xinyin Ma, Gongfan Fang, Xingyi Yang, Xinchao Wang

AI总结本文针对自回归视频扩散模型（ARVD）的量化问题，提出了一种新的框架Q-ARVD，解决了帧间量化敏感度不平衡和权重中异质性异常模式的问题，从而提高了模型效率。

详情

Comments: Code: https://github.com/tsa18/Q-ARVD

AI中文摘要

自回归视频扩散模型（ARVD）已涌现出作为流式视频生成的有前景的架构，为实时交互视频生成和世界建模铺平了道路。尽管具有潜力，ARVDs的显著推理成本仍然是实际部署的主要障碍，使模型量化成为提高效率的自然方向。然而，ARVDs的量化仍鲜有研究。我们的实证分析表明，直接应用现有为标准扩散变压器开发的量化方案到ARVDs会导致性能不佳，揭示了与双向扩散模型观察到的量化行为不同的特性。在本文中，我们识别了量化ARVDs的两个关键挑战：（C1）高度不平衡的帧级量化敏感度。在自回归生成过程中，误差积累可以导致帧间严重的量化敏感度偏斜，遵循指数衰减模式。（C2）权重中显著的异质性异常模式。权重分布表现出明显的异常通道，其模式在层类型和块深度上变化很大。为了解决这些问题，我们提出了Q-ARVD，一种用于准确ARVD量化的新型框架。（S1）为解决高度不平衡的帧级敏感度，Q-ARVD将最终质量感知的帧加权机制纳入量化目标中。（S2）为防止异质性异常影响性能，Q-ARVD引入了异常感知的自适应双尺度量化，该方法可以自动检测任意层中异常通道的存在和数量，并将其隔离以保护正常通道。广泛的实验展示了Q-ARVD的优越性。

英文摘要

Autoregressive video diffusion models (ARVDs) have emerged as a promising architecture for streaming video generation, paving the way for real-time interactive video generation and world modeling. Despite their potential, the substantial inference cost of ARVDs remains a major obstacle to practical deployment, making model quantization a natural direction for improving efficiency. However, quantization for ARVDs remains largely unexplored. Our empirical analysis shows that directly applying existing quantization schemes developed for standard diffusion transformers to ARVDs leads to suboptimal performance, revealing quantization behaviors that differ from those observed in bidirectional diffusion models. In this paper, we identify two critical challenges in quantizing ARVDs: (C1) Highly unbalanced frame-wise quantization sensitivity. Error accumulation during autoregressive generation can induce severely skewed quantization sensitivity across frames, following an exponential-like decay pattern. (C2) Prominent and heterogeneous outlier patterns in weights. Weight distributions exhibit pronounced outlier channels, whose patterns vary substantially across layer types and block depths. To address these issues, we propose Q-ARVD, a novel framework for accurate ARVD quantization. (S1) To tackle the highly unbalanced frame-wise sensitivity, Q-ARVD incorporates a final-quality aware frame-weighting mechanism into the quantization objective. (S2) To prevent heterogeneous outliers from degrading performance, Q-ARVD introduces an outlier-aware adaptive dual-scale quantization, which automatically detects the presence and quantity of outlier channels for an arbitrary layer, and isolates them to protect normal channels. Extensive experiments demonstrate the superiority of Q-ARVD.

URL PDF HTML ☆

赞 0 踩 0

2605.21070 2026-05-21 cs.LG

Towards Understanding Self-Pretraining for Sequence Classification

向序列分类中的自预训练理解迈进

Omar Coser, Loredana Zollo, Paolo Soda, Antonio Orvieto

AI总结本文通过复制和系统消融Amos等人的研究，揭示了自预训练（SPT）在序列分类中提升性能的关键因素，发现标签监督在学习有用的查询-键注意力模式方面存在瓶颈，并通过简化理论框架证明了自预训练通过学习接近性交互来提升性能。

详情

Comments: v1: Preliminary, extension of the version accepted at ICML 2025 Workshop MOSS

AI中文摘要

Amos等人（2024）表明，通过首先使用掩码标记预测目标进行预训练，可以在不使用外部数据或增强的情况下显著提高Transformer模型在序列分类中的准确性，这一过程称为自预训练（SPT）。尽管Amos等人（2024）的主要目标是展示Transformer在Long-Range Arena（LRA）上的强大性能，但他们的流程引发了更多根本性问题：SPT如何驱动优化以获得更好的解决方案？为什么标准监督训练在Transformer中会失效？为了更好地理解这一点，我们复制并系统消除了Amos等人（2024）的发现。我们的消融分析表明，在研究的设置中，关键瓶颈并非深度或泛化本身，而是标签监督在随机初始化下学习有用查询-键注意力模式的能力。在最小化设置中，我们识别出学习接近性交互——将绝对位置编码转换为接近性偏置的注意力分数——是SPT带来的改进的关键来源。最后，在简化理论框架中，我们证明标签监督在某些注意力分数方向上可能是局部盲目的，而这些方向可以通过掩码重建来检测。

英文摘要

Amos et al. (2024) showed that the accuracy of Transformer models in sequence classification can be significantly improved by first pretraining with a masked token prediction objective without external data or augmentation, a procedure referred to as self-pretraining (SPT). While the primary objective of Amos et al. (2024) was to showcase that Transformers can achieve strong performance on the Long-Range Arena (LRA), their pipeline raises more fundamental questions: How does SPT drive optimization to better solutions? Why can standard supervised training fail in Transformers? To better understand this, we replicate and systematically ablate the findings of Amos et al. (2024). Our ablations suggest that a central bottleneck in the studied settings is not depth or generalization alone, but the ability of label supervision to learn useful query-key Attention patterns from random initialization. With a minimal setup, we identify learning proximity interactions - turning absolute positional encodings into proximity-biased Attention scores - as a key source of the improvements brought by SPT. Finally, in a simplified theoretical setup, we show that label supervision can be locally blind to certain Attention-score directions that are instead detectable through masked reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2605.21066 2026-05-21 cs.LG

Robust Personalized Recommendation under Hidden Confounding in MNAR

在MNAR中具有隐藏混杂因素的鲁棒个性化推荐

Zongyu Li, Wanting Su, Tianyu Xia

AI总结本文提出了一种新的框架，通过估计用户-项目层面的敏感度界限，缓解了全局敏感度界限中固有的同质性假设，从而在存在隐藏混杂因素的情况下实现更鲁棒和准确的个性化推荐。

详情

AI中文摘要

推荐系统通常依赖于观察到的用户-项目交互数据，这些数据由于用户对项目的有选择性交互而容易产生选择偏差。逆概率加权和双重稳健估计器在观察到的混杂因素下有效缓解了选择偏差，但在存在隐藏混杂因素的情况下不可靠。现有的方法依赖于随机对照试验（RCTs）或全局敏感度界限，在实践中受到限制：RCTs需要昂贵的实验数据，而全局敏感度界限假定通过敏感性分析，未测量的混杂因素对倾向性的影响是均匀有界的，从而忽视了用户-项目交互中的异质性。为克服这一限制，我们提出了一种新的框架，该框架估计用户-项目层面的敏感度界限，从而显著放宽了全局敏感度界限中固有的同质性假设，称为个性化未观察混杂因素意识交互去混杂（PUID）。为确保鲁棒性和预测准确性，我们进一步开发了对抗优化策略，并提出了一个基准引导的变体（BPUID），该变体结合了预训练模型作为稳定参考。在三个真实世界数据集上的广泛实验表明，我们的方法在存在隐藏混杂因素的情况下显著优于全局方法，且不需要RCT数据。

英文摘要

Recommender systems often rely on observational user--item interaction data, which is prone to selection bias due to users' selective interactions with items. Inverse propensity weighting and doubly robust estimators effectively mitigate selection bias under observed confounding, but are unreliable in the presence of hidden confounders. Existing approaches relying on randomized controlled trials (RCTs) or global sensitivity bounds are constrained in practice: RCTs demand costly experimental data, while global sensitivity bounds presume a uniformly bounded effect of unmeasured confounders on propensities through sensitivity analysis, thereby neglecting heterogeneity across user--item interactions. To overcome this limitation, we propose a novel framework, which estimates user--item level sensitivity bounds, thereby substantially relaxing the homogeneity assumption inherent in global sensitivity bounds named Personalized Unobserved-Confounding-aware Interaction Deconfounder (PUID). To ensure both robustness and predictive accuracy, we further develop an adversarial optimization strategy and propose a benchmark-guided variant (BPUID) that incorporates pre-trained models as stabilizing references. Extensive experiments on three real-world datasets demonstrate that our approach significantly outperforms global methods under hidden confounding, without requiring RCT data.

URL PDF HTML ☆

赞 0 踩 0

2605.21063 2026-05-21 cs.CL

APM: Evaluating Style Personalization in LLMs with Arbitrary Preference Mappings

APM：通过任意偏好映射评估大语言模型中的风格个性化

Philipp Spohn, Leander Girrbach, Zeynep Akata

AI总结本研究提出APM基准，通过隐式偏好映射评估大语言模型的风格个性化能力，发现路由方法是最可靠的方法，而RAG和软提示优化在强基础模型上才有提升。

详情

AI中文摘要

典型的LLM响应往往遵循默认风格，尽管用户对语气、详尽程度和正式程度有不同偏好，但这些偏好并未在提示中明确表达。评估个性化方法是否能适应这些隐式偏好具有挑战性，因为用户通常提供提示而非参考响应，风格偏好无法事实验证，且无参考的LLM评判可能将个性化与一般响应质量混淆。为解决这些挑战，我们引入了任意偏好映射（APM）基准，通过隐式随机映射C将用户属性（如热情）与响应原则（如说服力）解耦。由于C不包含语义内容且在每次运行中重新采样，模型无法利用刻板印象关联，必须从对话历史中推断偏好。使用这种无偏的评估方法，我们适配了检索增强、提示优化和路由个性化方法，并在Llama-3.1-8B和Qwen-3.5-27B上进行评估。我们的结果表明，路由是最佳方法，而RAG仅在更强的基础LLM上有所提升，软提示优化在非个性化基线上提升不显著。我们的广泛评估表明，在这种现实设置中，个性化仍然具有挑战性，但我们的适配方法显示出前景。

英文摘要

Typical LLM responses tend to follow a default style, even though users often have distinct preferences regarding tone, verbosity, and formality that they do not explicitly state in their prompts. Evaluating whether personalization methods can adapt to these implicit preferences is challenging, since users typically provide prompts rather than reference responses, style preferences are not factually verifiable, and reference-free LLM judges may conflate personalization with general response quality. To address these challenges, we introduce the Arbitrary Preference Mapping (APM) benchmark, which decouples user attributes (e.g. enthusiastic) from response principles (e.g. persuasive) via a hidden, randomized mapping $\mathbf{C}$ that maps user attributes to preferences about response traits. Because $\mathbf{C}$ carries no semantic content and is resampled across runs, models cannot exploit stereotypical associations and must infer preferences from conversation history. Using this unbiased evaluation methodology, we adapt retrieval-augmented, prompt-optimization, and routing personalization methods and evaluate them on Llama-3.1-8B and Qwen-3.5-27B. Our results show that routing is the most reliable approach, while RAG only improves with the stronger base LLM, and soft prompt optimization fails to improve significantly over a non-personalized baseline. Our extensive evaluation reveals that in this realistic setting, personalization remains challenging, but our adapted methods show promise.

URL PDF HTML ☆

赞 0 踩 0

2605.21061 2026-05-21 cs.CV cs.AI cs.RO

Grounding Driving VLA via Inverse Kinematics

通过逆运动学接地驾驶VLA

Junsung Park, Hyunjung Shim

AI总结本文提出通过逆运动学求解器重新设计驾驶VLA，以解决轨迹预测中对视觉token的忽略问题，通过引入视觉状态预测和逆运动学网络，提升了视觉接地和轨迹规划性能。

详情

AI中文摘要

现有驾驶VLA在预测轨迹时大多忽略其视觉token--这一现象我们归因于任务公式结构上不合理的设定而非训练不足。我们证明，当通过逆运动学视角看待轨迹恢复时，需要当前和未来视觉状态作为边界条件；现有VLA仅提供前者，促使模型依赖自身状态和文本指令进行捷径预测。为解决此问题，我们重新设计驾驶VLA，使其风格类似于逆运动学求解器。首先，一个需要LLM预测未来视觉场景的下一视觉状态预测目标提供密集的视觉监督并抑制捷径路径。其次，一个单独的逆运动学网络（基于交叉注意力的条件扩散模型）仅输入当前和未来视觉状态，以在轨迹解码过程中抑制对自身状态和文本捷径的依赖。仅通过这种简单的处方，我们的0.5B规模模型恢复了视觉接地能力，并在闭合回路NAVSIM-v2和nuScenes基准上，其轨迹规划性能可与7B-8B规模的VLA相媲美。进一步的分析表明，这种改进源于恢复了利用视觉特征的能力，效果在动态驾驶场景如转弯时尤为明显。

英文摘要

Existing Driving VLAs predict trajectories while largely ignoring their visual tokens -- a phenomenon we trace not to insufficient training but to a structurally ill-posed task formulation. We show that trajectory recovery, when viewed through the lens of inverse kinematics, requires both a current and a future visual state as boundary conditions; existing VLAs supply only the former, which encourages the model to shortcut through ego status and text commands alone. To address this, we re-design Driving VLA in the style of an inverse kinematics solver. First, a next visual state prediction objective that requires the LLM to predict the future visual scene provides dense visual supervision and suppresses shortcut paths. Second, a separate Inverse Kinematics Network (a cross-attention-based conditional diffusion model) that takes only the current and future visual states as input is designed to suppress reliance on ego status and textual shortcuts during trajectory decoding. With this simple prescription alone, our 0.5B-scale model recovers visual grounding and reaches trajectory planning performance comparable to 7B--8B VLAs more than an order of magnitude larger, on both the closed-loop NAVSIM-v2 and the nuScenes benchmarks. Extensive analysis further shows that this improvement stems from a recovered ability to exploit visual features, with the effect being most pronounced in dynamic driving situations such as turning.

URL PDF HTML ☆

赞 0 踩 0

2605.21060 2026-05-21 cs.LG cs.AI stat.ML

Divide et Calibra: Multiclass Local Calibration via Vector Quantization

Divide et Calibra: 通过向量量化实现多类局部校准

Cesare Barbera, Lorenzo Perini, Giovanni De Toni, Andrea Passerini, Andrea Pugnana

AI总结本文提出了一种复合方法，通过向量量化诱导表示空间的结构划分，并利用Dirichlet浓度的参数化实现跨区域参数共享，从而学习出能泛化到稀疏区域的异质校准映射，提升了局部校准性能同时保持了全局校准和预测性能。

详情

AI中文摘要

在高风险场景中，准确且校准良好的机器学习（ML）模型是必需的，但有效的多类校准仍然具有挑战性：全局方法假设校准误差在潜在空间中是同质的，而局部方法通常依赖于潜在空间降维，导致信息丢失。为了解决这些问题，我们提出了一种多类校准的复合方法，其中区域特定的校准映射是从共享的码字依赖因素中构建的。我们通过向量量化（VQ）实现这一想法，它诱导了表示空间的结构划分，并利用Dirichlet浓度的参数化实现跨区域参数共享。我们的方法学习了能泛化到稀疏区域的异质校准映射。在基准数据集上的实验显示，在保持竞争性的全局校准和预测性能的同时，显著提高了局部校准性能。

英文摘要

Accurate and well-calibrated Machine Learning (ML) models are mandatory in high-stakes settings, yet effective multiclass calibration remains challenging: global approaches assume calibration errors are homogeneous across the latent space, while local methods often rely on latent-space dimensionality reduction, which leads to information loss. To address these issues, we propose a compositional approach to multiclass calibration, where region-specific calibration maps are constructed from shared codeword-dependent factors. We instantiate this idea via Vector Quantization (VQ), which induces a structured partition of the representation space, and an indexed parameterization of Dirichlet concentrations that enables parameter sharing across regions. Our approach learns heterogeneous calibration maps that generalize well even to sparse regions of the latent space. Experiments on benchmark datasets show significant improvements in local calibration while maintaining competitive global calibration and predictive performance.

URL PDF HTML ☆

赞 0 踩 0

2605.21059 2026-05-21 cs.CV cs.LG

Multimodal LLMs under Pairwise Modalities

基于成对模态的多模态大语言模型

Yan Li, Yunlong Deng, Yuewen Sun, Gongxu Luo, Kun Zhang, Guangyi Chen

AI总结本文提出了一种基于成对模态训练多模态大语言模型的方法，通过理论分析和表示学习框架，实现了跨模态对齐和重构，提升了模型的跨模态性能。

详情

AI中文摘要

尽管多模态大语言模型（MLLMs）取得了令人印象深刻的结果，但其训练通常依赖于联合编纂的多模态数据，需要大量的人力来构建多向对齐的数据集，从而限制了跨领域的可扩展性。在本工作中，我们探索了仅利用多种成对模态作为完整联合多模态分布的替代方案进行训练。具体来说，我们首先提供了理论分析，探讨在仅观察成对模态的情况下，表示可识别的条件。基于此分析，我们提出了一种表示学习框架，用于仅使用成对数据对齐跨模态的潜在表示。该框架包括两个阶段：潜在表示对齐和跨模态重构。具体而言，在第一阶段，我们通过自模态重建和成对对比学习学习跨模态的共享潜在空间。我们还通过部分对齐和最小潜在规范在对比学习过程中引入归纳偏置。在第二阶段，我们将新引入的模态的编码器与预训练模态的解码器整合起来，以促进跨模态转移和生成。我们通过将3D点云和触觉模态添加到预训练的MLLMs中，并使用三种模态对进行评估，证明通过学习对齐的潜在表示空间，我们的模型在跨模态性能上表现优异。

英文摘要

Despite the impressive results achieved by multimodal large language models (MLLMs), their training typically relies on jointly curated multimodal data, requiring substantial human effort to construct multi-way aligned datasets and thereby limiting scalability across domains. In this work, we explore training MLLMs by only leveraging multiple paired modalities as a surrogate for the full joint multimodal distribution. Specifically, we first provide a theoretical analysis of the conditions under which the representations are identifiable with only observing pairwise modalities. Building on this analysis, we propose a representation learning framework for aligning latent representations across modalities using only pairwise data. The framework consists of two stages: latent representation alignment and cross-modal recomposition. Specifically, in the first stage, we learn the shared latent space across modalities by both self-modal reconstruction and pair-wise contrastive learning. We also incorporate an inductive bias in the contrastive learning process by partially aligning and minimal latent specification. In stage two, we integrate the encoder of newly introduced modalities with the decoders of the pre-trained modalities to facilitate cross-modal transfer and generation. We evaluate our method by newly adding 3D point clouds and tactile modalities into pre-trained MLLMs with three modality pairs and show that, by learning an aligned latent representation space, our model achieves strong cross-modal performance.

URL PDF HTML ☆

赞 0 踩 0

2605.21058 2026-05-21 cs.LG

A Dialogue between Causal and Traditional Representation Learning: Toward Mutual Benefits in a Unified Formulation

因果与传统表征学习之间的对话：在统一框架中实现相互受益

Yan Li, Yuewen Sun, Shaoan Xie, Gongxu Luo, Yunlong Deng, Kun Zhang, Guangyi Chen

AI总结本文探讨了因果表征学习与传统表征学习之间的对话，提出统一框架，通过任务组件和约束组件相互促进发展，实验表明因果约束的有效性依赖于所配的任务。

详情

AI中文摘要

因果表征学习（CRL）和传统表征学习在发展轨迹上大相径庭。传统表征学习主要由应用和经验目标驱动，而CRL则更关注理论问题，尤其是可识别性。这种侧重点的不同导致了两个领域在术语、问题建模和评估上的差距，限制了交流，有时导致孤立或冗余的努力。本文认为，这两个领域应对话而非视为独立范式。为此，我们引入了一个统一框架，其中表征学习由两个组件定义：任务组件，指定所学表征需要保留的信息；约束组件，指定对潜在空间的结构约束。在此框架下，双向收益。CRL提供理论工具，用于理解何时结构化潜在约束是有用或必要的，而传统表征学习提供实用见解，关于任务设计和目标选择，可以改进CRL方法的发展。为了说明这种交互，我们实验研究了不同任务组件如何影响CRL方法在不同结构约束下的行为。在CausalVerse上的结果表明，因果约束的有效性强烈依赖于所配的任务。

英文摘要

Causal representation learning (CRL) and traditional representation learning have largely developed along different trajectories. Traditional representation learning has been driven mainly by applications and empirical objectives, whereas CRL has focused more on theoretical questions, particularly identifiability. This difference in emphasis has created a gap between the two fields in terminology, problem formulation, and evaluation, limiting communication and sometimes leading to disconnected or redundant efforts. In this paper, we argue that these two fields should be brought into dialogue rather than treated as separate paradigms. To this end, we introduce a unified formulation in which the representation learning is characterized by two components: a task component, which specifies what information the learned representation is required to preserve, and a constraint component, which specifies what structure is imposed on the latent space. Under this formulation, the benefits run in both directions. CRL provides theoretical tools for understanding when structured latent constraints are useful or necessary, while traditional representation learning offers practical insights on task design and objective choice that can improve the development of CRL methods. To illustrate this interaction, we experimentally study how different task components affect the behavior of CRL methods under different structured constraints. Results on CausalVerse show that the effectiveness of causal constraints depends strongly on the tasks with which they are paired.

URL PDF HTML ☆

赞 0 踩 0

2605.21053 2026-05-21 cs.RO

Perception of Social Robots as Communication Partners in Healthcare for Older Adults

在医疗领域中老年人对社交机器人作为交流伙伴的感知

Hana Yamamoto, Carlotta Julia Mayer, Charlotte Raithel, Theresa Buchner, Christian Werner, Yasuhisa Hirata, Monika Eckstein, Katja Mombaur

AI总结研究探讨了社交机器人在医疗领域中作为交流伙伴的有效性，以及积极提示对交互效果的影响，发现机器人与人类交互时压力水平无显著差异，且机器人能被接受为有效的交流伙伴，有助于减轻护理人员负担。

详情

Comments: 31 pages, 10 figures, Under review at International Journal of Social Robotics

AI中文摘要

通过社交助理工作者解决全球护理人员短缺问题，需要深入了解人类-机器人交互（HRI）对老年人的心理和生理影响。本研究探讨了社交机器人是否能像人类一样成为有效的交流伙伴，以及积极提示是否能同样增强这些交互。我们与35名参与者（年龄70岁以上）进行了比较研究。我们的多模态分析，整合了面部表情数据、心率变异性数据和主观问卷，发现人类和机器人交互的整体压力水平无显著差异。面部表情分析证实机器人被接受为有效的交流伙伴，而生理数据表明在机器人交互期间心率略低，表明比由人类主导的活动更放松。这些发现表明社交机器人可以与老年人互动而不引起心理压力，并能通过执行结构化任务（如健康监测调查）来减轻护理人员的负担。未来的工作应解决机器人设计中发现的'外观-内容不匹配'问题，以促进更加自然和有效的交互。

英文摘要

Addressing the global caregiver shortage through socially assistive robots necessitates a deep understanding of their psychological and physiological impacts on older adults during human-robot interaction (HRI). This study addresses whether social robots can serve as effective interaction partners compared to humans, and if "positive prompts" can similarly enhance these interactions. We conducted a comparative study with 35 participants (aged 70+). Our multi-modal analysis, integrating facial expression data, heart rate variability, and subjective questionnaires, revealed no significant differences in overall stress levels between human and robot interactions. Facial expression analysis confirmed that the robot was accepted as a valid interaction partner, while physiological data showed slightly lower heart rates during robot interactions, suggesting a more relaxed state compared to human-led sessions. These findings indicate that social robots can engage older adults without inducing psychological strain and are capable of alleviating caregiver burden by performing structured tasks, such as health-sensing surveys. Future work should address the identified "appearance-content mismatch" in robot design to facilitate even more natural and effective interactions.

URL PDF HTML ☆

赞 0 踩 0

2605.21049 2026-05-21 cs.CL

Cross-lingual robustness of LLM-brain alignment and its computational roots

LLM-脑对齐的跨语言鲁棒性及其计算根源

Ni Yang, Rui He, Philipp Homan, Iris Sommer, Davide Staub, Wolfram Hinzen

AI总结该研究探讨了大型语言模型与大脑对齐的跨语言鲁棒性，通过多语言全脑编码框架分析了中文、英语和法语在自然故事听觉过程中大脑与LLM的对齐情况，发现其在空间上具有跨语言重叠性，但无法通过预测不确定性或表征几何来解释。

详情

AI中文摘要

大型语言模型（LLMs）能够可靠地预测语言理解过程中的神经活动，并且transformer深度已被解释为镜像层次皮层组织。然而，这种对齐是否扩展到皮层下区域、在不同语言中是否存在空间重叠，以及这种对齐的计算根源仍不清楚。在此，我们使用多语言全脑编码框架，研究了在自然故事听觉过程中，中文、英语和法语三种语言的大脑与LLM的对齐情况。我们的结果表明，跨语言情况下，基于transformer的模型预测了覆盖广泛分布的皮层功能网络（如边缘系统、背侧注意网络、默认模式网络）以及皮层下结构的分布式景观。空间对齐模式显示了显著的跨语言重叠性，并且在模型层之间保持稳定，仅在有限的层之间有进展，这与功能皮层层次结构一致。与之前证据相反，上下文嵌入并未优于静态嵌入。为了测试候选计算解释，我们检查了逐层大脑评分是否反映惊奇度和内在维度性，从而预测处理和信息压缩。这两种计算指标均未与神经对齐轮廓相匹配。我们的发现表明，大脑-LLM对齐在空间上具有鲁棒性，并且在跨语言上保持稳定，但无法通过预测不确定性或表征几何来解释。而不是直接反映共享的层次计算，神经预测性可能主要源于分布式词汇-语义对应关系，这些关系在不同语言中具有泛化性。

英文摘要

Large language models (LLMs) reliably predict neural activity during language comprehension and transformer depth has been interpreted as mirroring hierarchical cortical organization. However, it remains unclear whether such alignment extends to subcortical regions, overlaps spatially across languages, and what the computational roots of such alignment are. Here, we used a multilingual, whole-brain encoding framework to examine brain-LLM alignment across three typologically distinct languages: Mandarin, English, and French during naturalistic story listening. Our results show that across languages, transformer-based models predicted activity in a distributed landscape spanning widely distributed cortical functional networks like limbic, ventral attention, default mode network, and subcortical structures. Spatial alignment patterns showed substantial cross-linguistic overlap and remained largely stable across model layers, with limited layer progression consistent with functional cortical hierarchies. Contrary to previous evidence, contextual embeddings did not outperform static embeddings. To test candidate computational explanations, we examined whether layer-wise brain scores reflect surprisal and intrinsic dimensionality, and thereby predictive processing and information compression. Neither of these two computational metrics mirrored neural alignment profiles. Our findings suggest that brain-LLM alignment is spatially robust and cross-linguistically stable but not explainable from predictive uncertainty or representational geometry. Rather than directly reflecting shared hierarchical computation, neural predictivity may primarily arise from distributed lexical-semantic correspondences that generalize across languages.

URL PDF HTML ☆

赞 0 踩 0

2605.21042 2026-05-21 cs.CV

Dynamic Video Generation: Shaping Video Generation Across Time and Space

动态视频生成：跨时间和空间的视频生成塑造

Shikang Zheng, Jingkai Huang, Jiacheng Liu, Guantao Chen, Lixuan, Yuqi Lin, Peiliang Cai, Linfeng Zhang

AI总结本文提出DVG框架，通过在时间和空间上联合分配计算，自动选择内容感知的加速策略，实现近无损加速，展示了在视频生成中的高效性能。

详情

AI中文摘要

扩散模型在视频生成中取得了显著成效，但其迭代去噪过程由于每个时间步处理大量token而计算成本高。最近，渐进分辨率采样作为一种有前途的加速方法，通过在早期阶段降低潜在分辨率。然而，将其扩展到视频生成仍具挑战性，因为额外的时间维度引入了不同视频中多样的时空需求，仅压缩单个维度往往导致有限的加速或质量下降。因此，我们提出DVG，一种动态视频生成框架，通过在时间和空间上联合分配计算，自动选择内容感知的加速策略，无需手动调优或重新训练。DVG在模型和任务上实现了接近无损的加速，达到HunyuanVideo和HunyuanVideo-1.5的7倍加速，结合蒸馏时达到18倍，展示了其作为当今大规模高效视频生成系统关键组件的潜力。我们的代码见补充材料，并将在GitHub上发布。

英文摘要

Diffusion models have achieved impressive performance in video generation, but their iterative denoising process remains computationally expensive due to the large number of tokens processed at each timestep. Recently, progressive resolution sampling has emerged as a promising acceleration approach by reducing latent resolution in early stages. However, scaling this idea to video generation remains challenging, as the additional temporal dimension introduces diverse spatio-temporal demands across different videos, and compressing only a single dimension often leads to limited acceleration or degraded quality. Therefore, we propose DVG, a Dynamic Video Generation framework that jointly allocates computation across time and space, automatically selecting content-aware acceleration strategies without manual tuning or retraining. DVG achieves near-lossless acceleration across models and tasks, reaching up to 7 times speedup on HunyuanVideo and HunyuanVideo-1.5, and 18 times when combined with distillation, demonstrating its potential as a key component in today's large-scale efficient video generation systems. Our code is in supplementary material and will be released on Github.

URL PDF HTML ☆

赞 0 踩 0

2605.21033 2026-05-21 cs.LG cs.DS

Efficient Banzhaf-Based Data Valuation for $k$-Nearest Neighbors Classification

高效基于Banzhaf值的$k$-最近邻分类数据估值

Guangyi Zhang, Lutz Oettershagen, Lixu Wang, Aristides Gionis

AI总结本文提出了一种高效计算$k$-最近邻分类器中Banzhaf值的方法，解决了数据估值中的计算复杂性问题，通过动态规划框架实现了显著的计算效率提升。

详情

Comments: To appear at VLDB 2026

AI中文摘要

数据估值，即量化单个数据点对模型性能的贡献，已成为机器学习中的基本挑战。基于博弈论的方法，如Banzhaf值，提供了公平数据估值的原理性框架；然而，它们存在指数级计算复杂性。我们通过开发专门用于计算$k$-最近邻（$k$NN）分类器中Banzhaf值的高效算法来解决这一挑战。我们首先通过证明该问题为\#P难来建立该问题的理论难度。尽管这种不可计算性，我们利用$k$NN分类器的局部性质开发了实用的精确算法。我们的主要贡献是一个动态规划框架，实现了显著的计算改进：我们提出了一种伪多项式算法，时间复杂度为$O(Wkn^2)$，适用于加权$k$NN分类器，其中$W$是前$k$个权重的总和最大值，并且为无权$k$NN提出了一种专门的算法，时间复杂度为$O(nk^2)$，即与数据点数量成线性关系。我们还提供了高效的蒙特卡洛估计方法。在现实世界数据集上的广泛实验展示了我们方法的实用效率及其在数据估值应用中的有效性。

英文摘要

Data valuation, the task of quantifying the contribution of individual data points to model performance, has emerged as a fundamental challenge in machine learning. Game-theoretic approaches, such as the Banzhaf value, offer principled frameworks for fair data valuation; however, they suffer from exponential computational complexity. We address this challenge by developing efficient algorithms specifically tailored for computing Banzhaf values in $k$-nearest neighbor ($k$NN) classifiers. We first establish the theoretical hardness of the problem by proving that it is \#P-hard. Despite this intractability, we exploit the locality properties of $k$NN classifiers to develop practical exact algorithms. Our main contribution is a dynamic programming framework that achieves significant computational improvements: we present a pseudo-polynomial algorithm with $O(Wkn^2)$ time complexity for weighted $k$NN classifiers, where $W$ is the maximum sum of top-$k$ weights, and a specialized algorithm for unweighted $k$NN that achieves $O(nk^2)$ time complexity, that is, linear in the number of data points. We also offer efficient Monte Carlo estimation methods. Extensive experiments on real-world datasets demonstrate the practical efficiency of our approach and its effectiveness in data valuation applications.

URL PDF HTML ☆

赞 0 踩 0

2605.21032 2026-05-21 cs.CV

Towards Physically Consistent 4D Scene Reconstruction for Closed-loop Autonomous Driving Simulation

迈向物理一致的闭环自动驾驶模拟中的4D场景重建

Bowyn Tan, Yutong Xie, Bai Huang, Fan Luo, Xiao Li, Naizheng Wang, Yang Guan, Shengbo Eben Li

AI总结本文提出了一种信息几何诊断框架，解决3DGS方法在同时实现空间和时间参数建模时的信用分配难题，通过引入正交投影梯度（OPG）和时间正则化策略，提升了4D场景重建的物理一致性。

详情

Comments: 20 pages, 4 figures

AI中文摘要

高保真的街道场景重建对于端到端自动驾驶模拟至关重要，其中新颖视角合成（NVS）和时间变化信息建模是两种基本能力，以促进闭环训练。然而，现有3DGS方法及其4D扩展未能同时实现这两者。为弥合这一差距，我们建立了信息几何诊断框架，揭示该限制源于空间和时间参数之间的信用分配困境。具体而言，单源观测中视角与时间的确定性耦合产生了一种低秩结构，导致静态视依赖性和动态时间变化组件之间产生大量零空间模糊性。时间信息压制了空间线索，导致空间参数估计方差发散。为了解决这一问题，我们提出正交投影梯度（OPG），一种分层训练方法，旨在恢复空间可识别性。OPG优先保证空间表示的完整性，通过在初始阶段将其固定，然后限制时间更新到空间零空间，使信用分配更加主动。虽然OPG通过代数方式隔离了时间更新，但时间正则化策略被提出，通过基于一致外观演化的物理先验施加平滑约束，确保重建的场景在闭环模拟中保持物理一致性。广泛的实验表明，我们的方法不仅保持了稳定的NVS能力，还在传统观察-再现度量中表现出优越的性能，这间接反映了对时间动态建模能力的建模能力。

英文摘要

High-fidelity street scene reconstruction is pivotal for end-to-end autonomous driving simulation, where novel-view synthesis (NVS) and time-varying information modeling are two fundamental capabilities to facilitate closed-loop training. However, existing 3DGS methods and their 4D extensions fail to simultaneously achieve both. To bridge this gap, we establish an information-geometric diagnostic framework, revealing that this limitation stems from a credit assignment dilemma between spatial and temporal parameters. Specifically, the deterministic coupling between viewpoint and time in single-source observation creates a low-rank structure that induces massive null-space ambiguity between static view-dependent and dynamic time-varying components. Temporal information overshadows spatial cues, causing the estimation variance of spatial parameters to diverge. To address this issue, we propose Orthogonal Projected Gradient (OPG), a hierarchical training method designed to restore spatial identifiability. OPG prioritizes the integrity of spatial representations by securing them in an initial stage, then restricts temporal updates to the spatial null space, enabling proactive credit assignment. While OPG isolates temporal updates algebraically, Temporal Regularization Strategy is proposed to further refine the temporal solution space by imposing a smoothness constraint based on the physical prior of consistent appearance evolution, ensuring that the reconstructed scene remains physically consistent in closed-loop simulation. Extensive experiments demonstrate that our method not only maintains stable NVS capabilities but also demonstrates superior performance in traditional observation-reproducing metrics, which indirectly reflect the capability of modeling temporal dynamics.

URL PDF HTML ☆

赞 0 踩 0

2605.21029 2026-05-21 cs.CL

Building a Custom Taxonomy of AI Skills and Tasks from the Ground Up with Job Postings

从零开始构建人工智能技能和任务的定制分类体系

Stephen Meisenbacher, Peter Norlander

AI总结本文通过分析招聘广告数据，探讨了如何构建更清晰的人工智能技能和任务分类体系，提出TaxonomyBuilder作为系统研究的蓝图，展示了过滤输入数据能提供更具体的领域覆盖。

详情

Comments: 14 pages, 2 figures, 8 tables. Accepted to CustomNLP4U 2026

AI中文摘要

利用大型语言模型（LLMs）进行自动分类体系构建，为全面而高效的复杂领域映射提供了清晰的机会。然而，面对快速增长的大量语料库时，如何最佳利用此类数据进行最优分类体系构建变得不明确。以职场中系统化人工智能技能为例，我们使用两个大规模的招聘广告语料库来研究关键的设计决策，即在分类体系构建中包含（或排除）数据点。我们提出了TaxonomyBuilder作为系统研究的蓝图，通过评估各种自定义、数据驱动和层次化的分类体系配置。我们证明，较少的数据可以提供更多的清晰度：将输入过滤到TaxonomyBuilder可以比将未过滤的输入提供给聚类和LLM增强的层次分类标签工具提供更具体的领域覆盖。

英文摘要

Utilizing LLMs for automated taxonomy construction presents a clear opportunity for the comprehensive, yet efficient mapping of potentially complex domains. When contending with high volumes of rapidly growing corpora, however, it becomes unclear how to best leverage such data for optimal taxonomy construction. Taking the case of systematizing AI skills in the workplace, we use two large-scale job postings corpora to investigate key design decisions for the inclusion (or exclusion) of data points for taxonomy construction. We propose TaxonomyBuilder as a blueprint for our systematic study, with which we evaluate various configurations of custom, data-informed, and hierarchical taxonomies. We demonstrate that less data can provide more clarity: filtering inputs to TaxonomyBuilder provides better domain-specific coverage than offering unfiltered inputs to clustering and LLM-enhanced hierarchical taxonomy labeling tools.

URL PDF HTML ☆

赞 0 踩 0

2605.21027 2026-05-21 cs.CL cs.AI

Beyond Text-to-SQL: An Agentic LLM System for Governed Enterprise Analytics APIs

超越文本到SQL：一个面向受控企业分析API的代理LLM系统

Gundeep Singh, Parsa Kavehzadeh, Jing Xia, Xue-Yong Fu, Julien Bouvier Tremblay, Md Tahmid Rahman Laskar, Vincent Lum, Shashi Bhushan TN

AI总结本文提出Analytic Agent，一个基于LLM的代理系统，能够将自然语言意图安全地转换为与企业分析API的交互，解决传统文本到SQL系统在企业环境中面临的可靠性与合规性问题。

详情

Comments: The first four authors contributed equally to this work

AI中文摘要

企业分析旨在使组织数据对决策制定可及，但非技术用户在使用传统商业智能工具或文本到SQL系统时仍面临障碍。尽管基于大型语言模型（LLM）的最新文本到SQL方法承诺通过自然语言访问结构化数据，但在企业环境中，分析流水线依赖受控的API而非原始数据库。实际上，这些API封装了复杂的业务逻辑以确保一致性、可审计性和安全性。然而，将数学或聚合逻辑委托给LLM会引入可靠性和合规性风险。为此，我们提出了Analytic Agent，一个基于LLM的代理系统，将自然语言意图转换为与企业分析API的安全交互。在90个由领域专家构建的真实企业使用案例上进行评估，它能够可靠地解释用户目标，验证权限，执行受控查询，并通过多步骤推理和政策感知编排生成合规的可视化结果。

英文摘要

Enterprise analytics aims to make organizational data accessible for decision-making, yet non-technical users still face barriers when using traditional business intelligence tools or Text-to-SQL systems. While recent Text-to-SQL approaches based on Large Language Models (LLMs) promise natural language access to structured data, they fall short in enterprise settings where analytics pipelines rely on governed APIs rather than raw databases. In practice, these APIs encapsulate complex business logic to ensure consistency, auditability, and security. However, delegating mathematical or aggregation logic to an LLM introduces reliability and compliance risks. To this end, we present Analytic Agent, an LLM-based agentic system that translates natural language intents into secure interactions with enterprise analytics APIs. Evaluated on 90 real enterprise use cases constructed by domain experts, it reliably interprets user goals, validates permissions, executes governed queries, and generates compliant visualizations through multi-step reasoning and policy-aware orchestration.

URL PDF HTML ☆

赞 0 踩 0

2605.21026 2026-05-21 cs.RO

Component Influence-Driven Fastener Reduction for Robotic Disassemblability-Aware Design Simplification

基于组件影响的快速件减少用于机器人拆解意识设计简化

Takuya Kiyokawa, Tomoki Ishikura, Shingo Hamada, Genichiro Matsuda, Kensuke Harada

AI总结本文提出了一种分析框架，通过快速件减少来提高机器人拆解意识设计简化，该框架利用CAD模型和自动生成的接触-连接-约束（CCC）图，将机器人拆解序列规划结果转化为组件影响评分，以指导设计简化。

详情

Comments: 7 pages, 8 figures

AI中文摘要

为了加速自动化再制造，产品设计阶段必须考虑机器人拆解。然而，设计师目前缺乏定量反馈来识别哪些结构元件阻碍机器人操作。为此，本研究提出了一种分析框架，专注于快速件减少，因为快速件是几乎所有制造产品中普遍存在的组件。使用CAD模型及其自动生成的接触-连接-约束（CCC）图，该框架将机器人拆解序列规划结果转化为组件影响评分。这些评分反映了组件在机器人拆解序列中导致结构约束违规或评估目标恶化的频率。为了突出结构障碍，该框架将这些评分投影到CAD几何体上作为3D热图。系统随后分析性地模拟了高影响快速件的移除。它报告了预期的结构约束减少、工具更换和机器人行驶距离的减少，同时通过评估几何稳定性指标防止结构不安全的修改。对七种家用电器的实验表明，该框架成功地针对冗余快速件。移除推荐的快速件通过消除8到132个结构约束（取决于每个产品的结构配置）简化了结构依赖性。此外，通过消除不必要的工具更换操作并缩短行驶距离（165到1675毫米，只要结构上允许）提高了机器人操作效率。

英文摘要

To accelerate automated remanufacturing, robotic disassembly must be considered during the product design phase. However, designers currently lack quantitative feedback to identify which structural elements hinder robotic operations. To address this, this study proposes an analytical framework that provides actionable redesign guidance focused on fastener reduction, as fasteners are numerous and ubiquitous components found in almost all manufactured products. Using a Computer-Aided Design (CAD) model and its automatically generated Contact-Connection-Constraint (CCC) graph, the framework translates robotic disassembly sequence planning outcomes into component influence scores. These scores reflect how often a component causes structural constraint violations or evaluation objective deteriorations in the robotic disassembly sequence. To visually highlight structural hindrances, the framework projects these scores onto the CAD geometry as 3D heatmaps. The system then analytically simulates the removal of highly influential fasteners. It reports the expected reductions in structural constraints, tool changes, and robot travel distances, while preventing structurally unsafe modifications by evaluating geometric stability metrics. Experiments on seven household appliances demonstrate that the framework successfully targets redundant fasteners. Removing the recommended fasteners simplified the structural dependencies by eliminating between 8 and 132 structural constraints on the graph depending on each product's structural configuration. Furthermore, it improved robotic operational efficiency by eliminating unnecessary tool change operations and shortening travel distances by 165 to 1675 millimeters wherever structurally permissible.

URL PDF HTML ☆

赞 0 踩 0

2605.21001 2026-05-21 cs.CV

DAMA: Disentangled Body-Anchored Gaussians for Controllable Multi-Layered Avatars

DAMA：解耦的体锚定高斯用于可控的多层avatar

Daniel Eskandar, Berna Kabadayi, Garvita Tiwari, Gerard Pons-Moll

AI总结本文提出DAMA方法，通过专门的表示和重建方法，生成具有物理合理性的穿衣avatar，实现了可控的多层结构、清晰的衣物分离和显式的堆叠控制。

详情

AI中文摘要

现有的3D穿衣avatar重建方法虽然能实现高视觉保真度，但忽略了几何结构和物理合理性。它们要么将穿衣人类建模为单个可变形表面，要么尝试衣物解耦但不强制几何约束，导致衣物边界模糊且无法控制堆叠或层顺序。为解决这些限制，我们引入DAMA（Disentangled body-Anchored Gaussians for Controllable Multi-layered Avatars），一种3D avatar重建方法，通过专门的表示和重建方法生成具有物理合理性的穿衣avatar。在表示层面，我们通过重心平面坐标和正向法线偏移将高斯绑定到SMPL-X面部。基于此参数化，重建方法将2D分割提升为体锚定高斯，利用拓扑引导的修正细化层，并联合优化几何和外观。DAMA是首个从多视角图像生成具有物理合理性的多层avatar的高斯avatar重建方法，实现了清晰的衣物分离和显式的堆叠控制。在完整的4D-DRESS数据集（82扫描）上，DAMA在几何重建、衣物分离、穿透率和穿透深度方面均达到最先进的性能。该表示还支持用户定义的衣物重排和快速将符合身体的衣物转换为模拟准备的网格。项目页面：https://danieleskandar.github.io/dama/

英文摘要

Existing 3D clothed avatar reconstruction methods achieve high visual fidelity but ignore geometric structure and physical plausibility. They either model clothed humans as a single deformable surface or attempt garment disentanglement without enforcing geometric constraints, resulting in ambiguous garment boundaries and no control over stacking or layer ordering. To address these limitations, we introduce DAMA (Disentangled body-Anchored Gaussians for Controllable Multi-layered Avatars), a 3D avatar reconstruction method that produces physically plausible clothed avatars through a dedicated representation and reconstruction method. At the representation level, we bind Gaussians to SMPL-X faces using barycentric in-plane coordinates and a positive normal offset. Based on this parameterization, the reconstruction method lifts 2D segmentations to body-anchored Gaussians, refines layers using topology-guided correction, and jointly optimizes geometry and appearance. DAMA is the first Gaussian avatar reconstruction method from multi-view images to achieve physically plausible layering, clean garment separation, and explicit stacking control. On the full 4D-DRESS dataset (82 scans), it achieves state-of-the-art performance in geometry reconstruction, garment separation, penetration rate, and penetration depth. The representation further supports user-defined garment reordering and fast conversion of body-conforming garments to simulation-ready meshes. Project Page: https://danieleskandar.github.io/dama/

URL PDF HTML ☆

赞 0 踩 0

2605.20998 2026-05-21 cs.CL cs.AI

Single-Pass, Depth-Selective Reading for Multi-Aspect Sentiment Analysis

单次传递、深度选择性阅读用于多方面情感分析

Yan Xia, Zhuangzhuang Pan, Amirrudin Kamsin, Chee Seng Chan

AI总结本文提出DABS框架，通过单次编码构建可重用的深度有序基底，使多方面情感分析在保持性能的同时减少60%的端到端计算量。

详情

Comments: Accepted at ACL2026 (main). Our solution (DABS) reads the sentence once, then lets each aspect selectively query the right tokens and Transformer depths, cutting redundant computation while preserving ATSA accuracy

AI中文摘要

在多方面句子中，方面术语情感分析（ATSA）面临效率与表达性的根本权衡。现有模型要么为每个方面重新编码句子，要么依赖静态深度表示，导致冗余计算和有限适应性。我们主张Transformer深度是一种昂贵且可查询的资源，并提出DABS，一种单次推断框架，通过一次编码构建可重用的深度有序基底。每个方面则查询此共享表示以选择性地读取相关token和抽象层次，而无需重新编码。这将共享句子编码与轻量级、方面条件化的读取解耦。在四个ATSA基准测试中，DABS实现了具有竞争力的性能，同时在多方面设置（M >= 2）中将端到端计算减少了高达60%。进一步分析表明，自适应深度查询在语言复杂情况如否定和对比中最为有益。代码可在https://github.com/panzhzh/acl-dabs公开获取。

英文摘要

Aspect-Term Sentiment Analysis (ATSA) in multi-aspect sentences faces a fundamental tradeoff between efficiency and expressiveness. Existing models either re-encode the sentence for each aspect or rely on static use of deep representations, leading to redundant computation and limited adaptivity. We argue that Transformer depth is a costly, queryable resource, and propose DABS, a single-pass inference framework that encodes each sentence once to construct a reusable, depth-ordered substrate. Each aspect then queries this shared representation to selectively read relevant tokens and abstraction levels, without re-encoding. This decouples shared sentence encoding from lightweight, aspect-conditioned readout. Experiments on four ATSA benchmarks show that DABS achieves competitive performance while reducing end-to-end computation by up to 60% in multi-aspect settings (M >= 2). Further analyses indicate that adaptive depth querying is most beneficial for linguistically complex cases such as negation and contrast. Code is publicly available at https://github.com/panzhzh/acl-dabs

URL PDF HTML ☆

赞 0 踩 0

2605.20997 2026-05-21 cs.CV cs.AI cs.LG physics.comp-ph

Hybrid Machine Learning Model for Forest Height Estimation from TanDEM-X and Landsat Data

基于TanDEM-X和Landsat数据的混合机器学习模型用于森林高度估计

Islam Mansour, Ronny Haensch, Irena Hajnsek, Konstantinos Papathanassiou

AI总结本文提出了一种结合机器学习与物理模型的混合方法，利用TanDEM-X干涉相干测量和Landsat光学数据来提高森林高度估计的精度，通过扩展特征空间减少高度和基线地形坡度的模糊性，实验结果表明RMSE和MAE分别降低了13.5%和16.6%。

详情

DOI: 10.1109/LGRS.2026.3693644

AI中文摘要

将机器学习（ML）与物理模型（PM）结合，已成为从遥感数据中检索地球物理参数的一种有前途的方法。在此背景下，一种用于从TanDEM-X干涉相干测量中估计森林高度的ML模型最近被提出，该模型通过物理模型约束学习过程。虽然所选特征用于训练和反演以确保解决方案的物理一致性，但它们无法解决数据中的所有高度/结构和基线/地形坡度模糊性。为改进这一点，提出通过扩展特征空间加入光学Landsat数据，以提供关于森林类型或结构的补充信息。扩展的模型被应用于几处Gabon的Lopé国家公园的TanDEM-X数据，并与空中LiDAR测量进行评估。结果表明，与原始混合模型相比，RMSE和MAE分别减少了13.5%和16.6%，证实了多光谱输入的附加价值。

英文摘要

Integrating machine learning (ML) with physical models (PM) has emerged as a promising way of retrieving geophysical parameters from remote sensing data. In this context, a ML model for estimating forest height from TanDEM-X interferometric coherence measurements has recently been proposed, that constrains the learning process through a PM. While the features used for training and inversion where selected to ensure the physical consistency of the solutions, they could not resolve all height / structure and baseline / terrain slope ambiguities in the data. To improve this, the extension of the feature space with optical Landsat data is proposed able to provide complementary information on forest type or structure. The extended model is applied and validated on several TanDEM-X acquisitions over the Gabonese Lopé national park site and assessed against airborne LiDAR measurements. Results show a 13.5% reduction in RMSE and a 16.6% reduction in MAE compared to the original hybrid model, confirming the added value of multispectral inputs.

URL PDF HTML ☆

赞 0 踩 0

2605.20996 2026-05-21 cs.LG math.OC

Beyond the Bellman Recursion: A Pontryagin-Guided Framework for Non-Exponential Discounting

超越贝尔曼递归：一种指导性框架用于非指数折扣

Hojin Ko, Jeonggyu Huh

AI总结本文提出了一种基于庞特里亚金原理的直接策略优化框架（PG-DPO），以解决非指数折扣问题，通过放弃递归方法，结合庞特里亚金最大原理和蒙特卡洛回放，提高动态规划的准确性和稳定性。

2605.20994 2026-05-21 cs.CL cs.AI

Towards Context-Invariant Safety Alignment for Large Language Models

面向大语言模型的上下文不变安全对齐

Yixu Wang, Yang Yao, Xin Wang, Yifeng Gao, Yan Teng, Xingjun Ma, Yingchun Wang

AI总结本文提出了一种上下文不变的安全对齐方法，通过引入锚点不变正则化（AIR）来提升模型在不同上下文中的鲁棒性，从而增强安全约束对对抗性框架的抵抗力。

详情

Comments: ICML 2026

AI中文摘要

基于偏好进行的后训练对齐可以将大语言模型与人类意图对齐，但安全行为往往仍然脆弱。一个模型可能在标准提示下拒绝有害请求，但在相同意图被包装在对抗性语言中时却会合规。我们建议，稳健的安全性需要上下文不变的对齐，其中行为取决于底层意图而非表面形式。在对齐中强制不变性是困难的，因为并非所有训练信号都同等可信；对于某些提示变体我们能够获得可验证的反馈（例如多选题），而对于开放性变体我们通常依赖于噪声且可游戏化的奖励代理（例如学习的评判者）。因此，标准对称不变正则化器可以通过降低在可靠变体上的性能来减少跨上下文差异，而不是改进开放性鲁棒性。为了解决这个问题，我们引入了锚点不变正则化（AIR），它将可验证的提示视为锚点，并使用停止梯度目标来正则化开放性变体朝着锚点性能的方向。AIR作为插件辅助损失实现，并通过异质提示分组与基于组的偏好优化（例如GRPO）结合。在安全、道德推理和数学方面，AIR提高了上下文不变性，提升了在分布内组的准确性达12.71%，在分布外一致性提升33.49%，使安全约束对对抗性框架更加稳健。

英文摘要

Preference-based post-training aligns LLMs with human intent, yet safety behavior often remains brittle. A model may refuse a harmful request in a standard prompt but comply when the same intent is wrapped in adversarial wording. We suggest that robust safety requires context-invariant alignment, where behavior depends on the underlying intent rather than surface form. Enforcing invariance is difficult in alignment because not all training signals are equally trustworthy; for some prompt variants we can obtain verifiable feedback (e.g., multiple-choice), while for open-ended variants we typically rely on noisy, gameable reward proxies (e.g., learned judges). As a result, standard symmetric invariance regularizers can reduce cross-context discrepancies by lowering performance on reliable variants instead of improving open-ended robustness. To address this, we introduce Anchor Invariance Regularization (AIR), which treats verifiable prompts as anchors and uses a stop-gradient target to regularize only the open-ended variants toward the anchor performance. AIR is implemented as a plug-in auxiliary loss and combined with group-based preference optimization (e.g., GRPO) via heterogeneous prompt grouping. Across Safety, Moral Reasoning, and Math, AIR improves context invariance, boosting in-distribution group accuracy by 12.71% and out-of-distribution consistency by 33.49%, making safety constraints robust to adversarial framings.

URL PDF HTML ☆

赞 0 踩 0

2605.20989 2026-05-21 cs.LG q-bio.GN

Modeling Temporal scRNA-seq Data with Latent Gaussian Process and Optimal Transport

用潜在高斯过程和最优传输建模时间序列scRNA-seq数据

Mehmet Yigit Balik, Harri Lähdesmäki

AI总结本文提出了一种生成框架，利用潜在异方差高斯过程建模种群趋势，并通过最优传输对齐生成和观测的种群分布，以捕捉生物异质性，从而在复杂插值和外推基准上实现最先进的性能。

详情

AI中文摘要

单细胞RNA测序提供了单细胞分辨率的基因表达见解，但从这些静态快照测量中推断时间过程仍然是一个根本性挑战。当前利用神经微分方程和流的方法容易过拟合且缺乏对生物变异性的仔细考虑。在本文中，我们提出了一种生成框架，利用希尔伯特空间方法近似潜在异方差高斯过程（GP）来建模种群趋势。为解决真实细胞轨迹的缺失问题，我们利用最优传输（OT）目标对齐生成和观测的种群分布。我们的方法通过引入细胞特异性潜在时间和细胞类型条件来捕捉生物异质性，从而解构时间异步性和不同细胞类型的轨迹。我们展示了在复杂插值和外推基准上的最先进性能，并引入了一种新的基于梯度的策略来推断扰动轨迹。

英文摘要

Single-cell RNA sequencing provides insights into gene expression at single-cell resolution, yet inferring temporal processes from these static snapshot measurements remains a fundamental challenge. Current approaches utilizing neural differential equations and flows are sensitive to overfitting and lack careful considerations of biological variability. In this work, we propose a generative framework that models population trends using a latent heteroscedastic Gaussian process (GP) approximated by Hilbert space methods. To address the absence of genuine cell trajectories, we leverage an optimal transport (OT) objective that aligns generated and observed population distributions. Our method explicitly captures biological heterogeneity by incorporating cell-specific latent time and cell type conditioning to disentangle temporal asynchrony and trajectories to different cell types. We demonstrate state-of-the-art performance on complex interpolation and extrapolation benchmarks and introduce a novel gradient-based strategy for inferring perturbation trajectories.

URL PDF HTML ☆

赞 0 踩 0

2605.20978 2026-05-21 cs.LG

Point Cloud Sequence Encoding for Material-conditioned Graph Network Simulators

用于材料条件化图网络模拟器的点云序列编码

Philipp Dahlinger, Balázs Gyenes, Niklas Freymuth, Luca Geminiani, Tobias Würth, Johannes Mitsch, Nadja Klein, Luise Kärger, Gerhard Neumann

AI总结本文提出PEACH框架，通过点云序列编码实现对未知物理属性的适应，提高了模拟到现实的零样本转移精度，并在实际部署中更具实用性。

详情

Comments: 9 pages + appendix, 7 figures. Submitted to the 40th Conference on Neural Information Processing Systems (NeurIPS 2026)

AI中文摘要

图网络模拟器（GNSs）已作为复杂物理模拟的强大替代方案，提供内在可微性和比传统求解器快多个数量级的速度提升。然而，GNSs通常假设可以访问底层材料参数，如刚度或粘度，这严重限制了其在现实实验中的实用性。尽管最近的元学习方法通过从网格轨迹推断属性来解决参数依赖性，但从观察场景中重建网格具有挑战性。在本文中，我们介绍了Point Cloud Encoding for Accurate Context Handling（PEACH），一种新的框架，通过上下文学习在点云上适应学习的模拟器以适应未见过的物理属性。我们的方法依赖于一种新颖的时空点云序列编码器，以及两种形式的辅助监督来帮助提高模拟保真度。我们证明PEACH能够在具有挑战性的动态场景中实现准确的零样本模拟到现实转移。在模拟场景上的实验表明，PEACH在预测精度上甚至优于基于网格的基线，同时在实际部署中更加实用。

英文摘要

Graph Network Simulators (GNSs) have emerged as powerful surrogates for complex physics-based simulation, offering inherent differentiability and orders-of-magnitude speedups over traditional solvers. However, GNSs typically assume access to the underlying material parameters, such as stiffness or viscosity, severely limiting their utility in realistic experimental settings. While recent meta-learning approaches address the parameter dependency by inferring properties from mesh trajectories, reconstructing a mesh from an observed scene is challenging. In this work, we introduce Point Cloud Encoding for Accurate Context Handling (PEACH), a novel framework that applies in-context learning on point clouds to adapt a learned simulator to unseen physical properties during inference. Our approach relies on a novel spatio-temporal point cloud sequence encoder, as well as two forms of auxiliary supervision to help improve simulation fidelity. We demonstrate that PEACH is capable of accurate zero-shot sim-to-real transfer on a challenging, dynamic scene. Experiments on simulation scenes show that PEACH even outperforms mesh-based baselines on prediction accuracy, while being much more practical for real-world deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.20973 2026-05-21 cs.CV

Towards Integrated Rock Support Visualisation in 3D Point Cloud of Underground Mines

向地下矿山3D点云中的集成岩支可视化迈进

Dibyayan Patra, Simit Raval, Pasindu Ranasinghe, Bikram Banerjee, Ismet Canbulat

AI总结本文提出了一种自动化框架，用于利用地下矿山开掘的3D点云进行集成岩支可视化，通过结构映射、岩钉识别、断层面拟合和岩钉方向估计的统一工作流，实现了对断层面和岩钉向量的集成3D可视化，以评估其空间交集和几何关系，同时通过互补的立体分析评估整体锚固几何有效性。

详情

AI中文摘要

地下矿山中岩支的有效性取决于安装的岩钉与周围岩体结构特征之间的相互作用。然而，断层特征化和岩钉识别通常被视为单独的任务，限制了它们在集成支持评估中的价值。本文提出了一种自动化框架，用于利用地下矿山开掘的3D点云进行集成岩支可视化。该框架将结构映射、岩钉识别、断层面拟合和岩钉方向估计整合到一个统一的工作流中，该工作流针对准确性和计算效率进行了优化。输出用于生成拟合的断层面和岩钉向量的集成3D可视化，从而能够直接评估其空间交集和几何关系。此外，还进行了互补的立体分析，以评估断层极和岩钉方向的整体锚固几何有效性，相对于映射的结构特征。此外，岩钉级别的质量指标，包括暴露的突出长度和偏离局部顶板法线的程度，也进行了可视化，以支持安装质量的评估。所提出的框架在真实的地下金属矿扫描上进行了演示，在中等规模的点云中产生了准确的结构映射和岩钉识别结果。总体而言，本研究提供了一个实用的步骤，朝着无需手动测量或额外现场数据采集的自动化、集成的岩支有效性地质力学评估。

英文摘要

The effectiveness of rock support in underground mines depends on the interaction between installed rock bolts and the structural fabric of the surrounding rock mass. However, discontinuity characterisation and rock bolt identification are commonly treated as separate tasks, limiting their value for integrated support assessment. This study presents an automated framework for integrated rock support visualisation using 3D point clouds of underground mine excavations. The framework integrates structure mapping, rock bolt identification, discontinuity plane fitting, and bolt orientation estimation into a unified workflow optimised for accuracy and computational efficiency. The outputs are used to generate an integrated 3D visualisation of fitted discontinuity planes and bolt vectors, enabling direct assessment of their spatial intersections and geometric relationships. A complementary stereographic analysis of discontinuity poles and bolt orientations is also performed to evaluate overall bolting geometric effectiveness relative to the mapped structural fabric. Additionally, bolt-level quality metrics, including exposed protrusion length and deviation from the local roof normal, are visualised to support assessment of installation quality. The proposed framework is demonstrated on real underground metal mine scans, producing accurate structure mapping and rock bolt identification results in medium-scale point clouds. Overall, the study provides a practical step towards automated, integrated geotechnical assessment of rock support effectiveness without requiring manual measurements or additional in-situ data acquisition.

URL PDF HTML ☆

赞 0 踩 0

2605.20971 2026-05-21 cs.CV cs.AI cs.CR

Comparative Evaluation of Deep Learning Models for Fake Image Detection

深度学习模型在虚假图像检测中的比较评估

Akhitha Pakala, Mohammed Mahir Rahman, Shahzad Memon, Tauseef Ahmed

AI总结本研究通过统一的预处理和训练流程比较了四个预训练的CNN架构在虚假图像检测中的性能，发现VGG16在准确性上表现最佳，但EfficientNetB0在检测虚假图像时的敏感性较高，但对真实图像的可靠性较低，研究指出需要平衡数据集、高级增强和公平性意识训练来开发可靠的虚假图像检测系统。

详情

Journal ref: 6th International Conference on Computational Intelligence & Internet of Things (ICCIIoT), 2026
Comments: Accepted at ICCIIoT26 and waiting to be indexed

AI中文摘要

随着基于GAN的图像篡改技术日益复杂，数字取证面临重大挑战。本研究比较了四个预训练的CNN架构（VGG16、ResNet50、EfficientNetB0和XceptionNet）在虚假图像检测中的性能，使用统一的预处理和训练流程。通过调整大小、归一化和增强来解决类别不平衡问题并提高泛化能力。模型评估使用了准确性、精确率、召回率、F1分数和ROC-AUC。VGG16在准确性上达到91%，XceptionNet、ResNet50和EfficientNetB0分别达到90%。EfficientNetB0对虚假图像的敏感性更强，但在真实图像上的可靠性较低，反映了由不平衡驱动的偏差。局限性包括数据集不平衡、过拟合和解释性有限，这些因素影响了跨域鲁棒性。本研究提供了一个可重复的基准，并强调了平衡数据集、高级增强和公平性意识训练的必要性，以开发可靠的虚假图像检测系统。

英文摘要

The growing sophistication of GAN-based image manipulation presents significant challenges for digital forensics. This study compares the performance of four pretrained CNN architectures including VGG16, ResNet50, EfficientNetB0, and XceptionNet for fake image detection using a unified preprocessing and training pipeline. A dataset of real and manipulated images was processed through resizing, normalization, and augmentation to address class imbalance and improve generalization. Models were evaluated using Accuracy, Precision, Recall, F1-score, and ROC-AUC. VGG16 achieved the highest accuracy at 91%, with XceptionNet, ResNet50, and EfficientNetB0 each reaching 90%. EfficientNetB0 showed stronger sensitivity to fake images but reduced reliability on real samples, reflecting imbalance-driven bias. Limitations include dataset imbalance, overfitting, and limited interpretability, which affect cross-domain robustness. The study provides a reproducible baseline and underscores the need for balanced datasets, advanced augmentation, and fairness-aware training to develop reliable fake image detection systems.

URL PDF HTML ☆

赞 0 踩 0

2605.20967 2026-05-21 cs.CL

ArPoMeme: An Annotated Arabic Multimodal Dataset for Political Ideology and Polarization

ArPoMeme：一个标注的阿拉伯多模态数据集用于政治意识形态和极化

Wajdi Zaghouani, Kais Attia, Md. Rafiul Biswas, Fadhl Eryani

AI总结本文提出ArPoMeme数据集，用于分析阿拉伯政治漫画的多模态和意识形态维度，通过自定义工具实现大规模标注，揭示意识形态极化特征。

详情

Comments: Accepted at LREC 2026 Main Conference

AI中文摘要

漫画已成为阿拉伯世界政治沟通的重要媒介，反映了幽默、图像和文本如何相互作用以表达意识形态和文化立场。尽管漫画在在线政治讨论中至关重要，但缺乏系统整理的资源来分析其多模态和意识形态维度。本文提出了ArPoMeme，一个包含约7300个阿拉伯政治漫画的大规模数据集，按意识形态方向分类，包括左翼、伊斯兰主义、泛阿拉伯主义和讽刺视角。该数据集通过公共Facebook页面和群组的自我识别来捕捉阿拉伯漫画生态系统的多样性。为了确保规模和准确性，我们设计了一个半自动化数据收集管道，结合基于Playwright的Facebook爬取和Google Drive同步，随后使用Qwen2.5-VL-7B视觉语言模型进行文本提取。提取的文本经过人工验证和标注，针对三个极化维度：我们 vs 他们框架、对外群体的敌意和行动号召。标注通过自定义的Streamlit界面进行，支持分布式标记、实时跟踪和版本控制。最终的数据集将视觉内容、文本信息和意识形态方向联系起来，使政治对抗、动员和幽默的细粒度分析成为可能。对标注语料库的定量分析揭示了意识形态群体之间对抗性框架的强烈不对称性，伊斯兰主义和讽刺漫画表现出最高的敌意和动员信号。该数据集和标注工具为研究阿拉伯政治话语、多模态意识形态检测和极化动态提供了可重复和公开可用的资源。

英文摘要

Memes have become a prominent medium of political communication in the Arab world, reflecting how humor, imagery, and text interact to express ideological and cultural positions. Despite the centrality of memes to online political discourse, there is a lack of systematically curated resources for analyzing their multimodal and ideological dimensions in Arabic. This paper presents ArPoMeme, a large-scale dataset of approximately 7,300 Arabic political memes categorized by ideological orientation, including Leftist, Islamist, Pan-Arabist, and Satirical perspectives. The dataset captures the diversity of Arabic meme ecosystems by grounding classification in the self-identification of public Facebook pages and groups that produce and disseminate these memes. To ensure both scale and accuracy, we designed a semi-automated data collection pipeline combining Playwright-based Facebook scraping with Google Drive synchronization, followed by text extraction using the Qwen2.5-VL-7B vision language model. The extracted text was manually verified and annotated for three polarization dimensions: Us vs. Them framing, Hostility toward out-groups, and Calls to action. Annotation was conducted through a custom Streamlit-based interface supporting distributed labeling, real-time tracking, and version control. The resulting dataset links visual content, textual messages, and ideological orientation, enabling fine-grained analysis of political antagonism, mobilization, and humor. Quantitative analysis of the annotated corpus reveals strong asymmetries in antagonistic framing across ideological groups, with Islamist and satirical memes exhibiting the highest levels of hostility and mobilization cues. The dataset and the annotation tool offers a reproducible and publicly available resource for studying Arabic political discourse, multimodal ideology detection, and polarization dynamics.

URL PDF HTML ☆

赞 0 踩 0

2605.20965 2026-05-21 cs.CV cs.AI

Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy

在不遗忘的情况下寻找正确的视觉证据：通过层间视觉注意力差异减轻LVLMs中的幻觉

Yutong Xie, Zhenglin Hua, Ran Wang, Wing W. Y. Ng, Xizhao Wang, Yuheng Jia

AI总结本文提出了一种基于层间视觉注意力差异的幻觉缓解方法，通过增强视觉证据的注意力来减少视觉遗忘，从而在不遗忘的情况下找到正确的视觉证据。

详情

Comments: Accepted by ICML 2026

AI中文摘要

大型视觉-语言模型（LVLMs）在广泛的视觉-语言任务上表现出色。尽管有进展，它们仍然容易产生幻觉，生成与视觉内容不一致的响应。在本工作中，我们发现LVLMs在对正确的视觉证据关注不足时容易产生幻觉，并在生成过程中逐渐遗忘它。我们实证发现，尽管LVLMs整体对视觉证据关注不足，但在特定层中表现出对正确视觉证据的敏感性，存在显著的层间差异。受此观察启发，我们提出了一种新的幻觉缓解方法，通过层间视觉注意力差异（ILVAD）增强视觉证据。具体来说，我们从早期生成的token到视觉token在各层中获取注意力权重，并识别被反复激活作为视觉证据的token，形成显著性图。然后通过显著性图在生成过程中增强对视觉证据的注意力，以减少视觉遗忘。此外，我们利用显著性图获得生成文本对视觉证据的注意力分数，以选择并强调强烈基于视觉证据的文本token。我们的方法是无训练的，即插即用。在五个最近发布的模型上进行的多个基准评估表明，我们的方法可以在不同架构的LVLMs上一致地缓解幻觉。代码可在https://github.com/ytx-ML/ILVAD上获得。

英文摘要

Large Vision-Language Models (LVLMs) have shown remarkable performance on a wide range of vision-language tasks. Despite this progress, they are still prone to hallucination, generating responses that are inconsistent with visual content. In this work, we find that LVLMs tend to hallucinate when they pay insufficient attention to the correct visual evidence and gradually forget it during the generation process. We empirically find that although LVLMs overall attend insufficiently to visual evidence, they exhibit sensitivity to the correct visual evidence in specific layers, with notable inter-layer discrepancy. Motivated by this observation, we propose a novel hallucination mitigation method that enhances visual evidence based on Inter-Layer Visual Attention Discrepancy (ILVAD). Specifically, we obtain the attention weights from early generated tokens to visual tokens across layers and identify the tokens that are repeatedly activated as visual evidence, forming a saliency map. We then enhance attention to visual evidence during generation through the saliency map to reduce visual forgetting. In addition, we leverage the saliency map to obtain attention scores of generated text to visual evidence, in order to select and emphasize text tokens that are strongly grounded in visual evidence. Our method is training-free and plug-and-play. Multiple benchmark evaluations conducted on five recently released models show that our method can consistently mitigate hallucinations in different LVLMs over various architectures. Code is available at https://github.com/ytx-ML/ILVAD.

URL PDF HTML ☆

赞 0 踩 0

2605.20963 2026-05-21 cs.CV

Towards UAV Detection in the Real World: A New Multispectral Dataset UAVNet-MS and a New Method

面向现实世界的无人机检测：一个新的多光谱数据集UAVNet-MS和一个新方法

Yihang Luo, Jun Chen, Chao Xiao, Yingqian Wang, Zhaoxu Li, Qiang Ling, Xu He, Nuo Chen, Gaowei Guo, Hongge Li, Miao Li, Longguang Wang, Yulan Guo, Li Liu, Wei An, Zhijie Chen

AI总结本文提出了一种新的多光谱数据集UAVNet-MS和一种新的方法MFDNet，用于细粒度小无人机的检测，解决了传统RGB系统在小尺度下的性能问题。

详情

Comments: submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

AI中文摘要

无人飞行器（UAV）的普及催生了对精确UAV监测的迫切需求。现有的基于RGB的系统依赖于空间线索，在小尺度下退化，特别是在高类型相似性、目标杂波模糊和低对比度的情况下。多光谱成像（MSI）编码了材料感知的光谱签名，但基于MSI的细粒度小UAV检测仍因缺乏专用数据集而被忽视。我们引入了UAVNet-MS，这是首个用于细粒度小UAV检测的多光谱数据集，包含15,618个时间同步的RGB-MSI数据立方体（1440x1080），带有边界框注释。该数据集具有挑战性的小对象（93.7% <= 32²像素，平均18²像素，约0.02%图像面积）在低对比度下。我们提出MFDNet，一种双流基线方法，解决数组诱导的视差和空间-光谱融合。在RGB-only、MSI-only和RGB+MSI协议下，对20种检测器的广泛评估表明，MFDNet在最佳RGB-only方法上实现了+6.2%的AP50提升，证明光谱线索提供了超越空间线索的互补材料证据。本文为多光谱UAV监测研究提供了基础数据集、强大基线和基准。

英文摘要

The proliferation of unmanned aerial vehicles (UAVs) has created urgent demand for precise UAV monitoring. Existing RGB-based systems rely on spatial cues that degrade at small scales, particularly with high inter-type similarity, target-clutter ambiguity, and low contrast. Multispectral imaging (MSI) encodes material-aware spectral signatures, yet MSI-based fine-grained small-UAV detection remains underexplored due to lack of dedicated datasets. We introduce UAVNet-MS, the first multispectral dataset for fine-grained small-UAV detection, comprising 15,618 temporally synchronized RGB-MSI data cubes (1440x1080) with bounding box annotations. The dataset features challenging small objects (93.7% <= 32^2 pixels, average 18^2 pixels, ~0.02% image area) under low contrast. We propose MFDNet, a dual-stream baseline addressing array-induced parallax and spatial-spectral fusion. Extensive evaluation under RGB-only, MSI-only, and RGB+MSI protocols against 20 detectors shows MFDNet achieves +6.2% AP50 improvement over best RGB-only methods, demonstrating spectral cues provide complementary material evidence beyond spatial cues. This work provides foundational dataset, strong baseline, and benchmark for multispectral UAV monitoring research.

URL PDF HTML ☆

赞 0 踩 0

2605.20961 2026-05-21 cs.CV

Preserve, Reveal, Expand: Faithful 4D Video Editing with Region-Aware Conditioning

保留、揭示、扩展：基于区域感知的4D视频编辑

Zhangchi Hu, Wenzhang Sun, Xiangchen Yin, Jiahui Yuan, Chunfeng Wang, Hao Li, Kun Zhan, Xiaoyan Sun

AI总结本文提出PREX框架，通过区域感知分解目标时空体积，解决4D视频编辑中区域保持、揭示和扩展的问题，提升了视频编辑的准确性和稳定性。

详情

Comments: 23 pages, 13 figures

AI中文摘要

现有的4D驱动视频扩散模型主要针对合理生成，但忠实的4D编辑需要在合成遮挡或视外内容时保留源观测区域。我们识别出证据角色不匹配问题：可靠的源支持证据、不可靠的渲染提示和不支持的区域在单一条件信号中交织，导致保留漂移、鬼影和不稳定的外推。我们提出PREX（保留、揭示、扩展），一个区域感知框架，根据观测支持和场景范围将目标时空体积分解为保留、揭示和扩展角色。PREX通过校准置信度构建观测支持的外观提示，并通过区域感知适配器注入到冻结的视频扩散骨干网络中，通过代理任务训练而无需配对编辑视频。我们进一步引入PREBench，一个诊断基准，包含精心编辑、区域角色掩码和人类对齐的指标，补充了全局视频质量和4D控制评估。实验表明，PREX在减少区域结构失败的同时，保持了强大的视觉质量和4D编辑控制能力。项目页面：https://ricepastem.github.io/PREX-Open

英文摘要

Existing 4D-driven video diffusion models primarily target plausible generation, but faithful 4D editing requires preserving source-observed regions while synthesizing disoccluded or out-of-view content. We identify Evidence-Role Mismatch: reliable source-backed evidence, unreliable rendered cues, and unsupported regions are entangled in a single conditioning signal, causing preservation drift, ghosting, and unstable extrapolation. We propose PREX (Preserve, Reveal, Expand), a region-aware framework that decomposes the target spatiotemporal volume into Preserve, Reveal, and Expand roles according to observation support and scene extent. PREX builds observation-backed appearance cues with calibrated confidence and injects them into a frozen video diffusion backbone through a region-aware adapter, trained with proxy tasks without requiring paired edited videos. We further introduce PREBench, a diagnostic benchmark with curated edits, region-role masks, and human-aligned metrics that complement global video-quality and 4D-control evaluations. Experiments show that PREX reduces region-structured failures while maintaining strong visual quality and 4D edit control capability. Project Page: https://ricepastem.github.io/PREX-Open

URL PDF HTML ☆

赞 0 踩 0