arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.25955 2026-05-26 cs.CL cs.AI cs.LG

QUIET: A Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation Capability

QUIET: 面向LLM创意生成能力的多空白级联故事完形填空基准

Bo Zou, Chao Xu

AI总结 提出QUIET基准,通过多空白级联故事完形填空和基于信息论的自动评分协议,客观评估大语言模型的创意生成能力。

详情
AI中文摘要

大语言模型(LLM)在创意能力评估中面临双重挑战:现有基准(如Story Cloze Test、HellaSwag)通过多项选择识别范式衡量模型对叙事延续的判别能力,而非直接衡量创意生成能力;基于量规的评分和LLM-as-Judge方法依赖主观维度评估或自然语言模型输出,无法提供客观、自动化的评分机制。本文提出QUIET(Quality Understanding via Interlocked Evaluation Testing),一种基于多空白级联故事完形填空的LLM创意能力诊断基准。QUIET在结构完整的故事中设置N个空白(10-20个),每个空白附带显式内容约束,且空白之间存在级联依赖关系——较早空白填充的内容约束较晚空白的可行解空间。被评估模型(或人类参与者)以开放生成模式填充所有空白;结果由基于信息论的自动化评分协议评分,无需人工评分。该评分协议直接操作化“校准惊喜”理论框架(Zou & Xu, 2026a)。对于每个空白k,计算复合分数:score = satisfy * (1 + lambda * surprise),其中lambda = 1.0。这里,“satisfy”衡量空白填充满足内容约束的程度(客观逻辑推理判断,非主观审美评分),“surprise”衡量在满足约束条件下的惊喜程度。不满足约束的创意答案得零分;满足约束但平庸的答案得分低;满足约束且令人惊喜的答案得分高。

英文摘要

Large language models (LLMs) face a dual challenge in creative capability evaluation: existing benchmarks (e.g., Story Cloze Test, HellaSwag) measure models' discriminative ability over narrative continuation using multiple-choice recognition paradigms, rather than directly measuring creative generation capability; rubric-based scoring and LLM-as-Judge methods rely on subjective dimension assessment or natural language model outputs, and cannot provide objective, automated scoring mechanisms. This paper proposes QUIET (Quality Understanding via Interlocked Evaluation Testing), a diagnostic benchmark for LLM creative capability based on multi-blank cascaded story cloze. QUIET sets N blanks (10-20) in a story with complete structure, with each blank accompanied by an explicit content constraint, and cascade dependency relationships between blanks -- the content filled into earlier blanks constrains the feasible solution space for later blanks. The evaluated model (or human participants) fills all blanks in open-ended generation mode; the results are scored by an information-theoretic automated scoring protocol without human grading. The scoring protocol directly operationalizes the "calibrated surprise" theoretical framework (Zou & Xu, 2026a). For each blank k, a composite score is computed: score = satisfy * (1 + lambda * surprise), where lambda = 1.0. Here, "satisfy" measures how well the blank filling satisfies the content constraint (objective logical reasoning judgment, not subjective aesthetic scoring), and "surprise" measures the degree of surprise given that the constraint is satisfied. Creative answers that do not satisfy the constraint score zero; answers that satisfy the constraint but are mediocre score low; answers that satisfy the constraint and are surprising score high.

2605.25954 2026-05-26 cs.LG cs.AI

Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization

Step-TP: 一个基于步骤级、带有思维链推理的 LLM 引导张量程序优化数据集

Mengfan Liu, Da Zheng, Junwei Su, Chuan Wu

AI总结 为解决 LLM 在张量程序优化中缺乏可验证步骤级监督的问题,提出 Step-TP 数据集,通过结构化思维链推理和原子步骤监督实现可靠的多步优化。

详情
AI中文摘要

尽管大语言模型(LLM)具有强大的推理能力,但由于需要精确、可组合的变换决策,优化张量程序的执行效率仍然具有挑战性。最近的 LLM 引导方法将张量程序优化视为一个迭代决策过程,但现有数据集仅提供使用令牌效率低下的表示方式的端到端优化程序对,缺乏可验证的步骤级监督和可解释性。因此,LLM 难以在大型组合优化空间中做出可靠的单步决策。我们引入了 Step-TP,一个用于张量程序优化的后训练数据集,它提供基于事实的、原子性的步骤级监督,并带有结构化的思维链(CoT)推理。Step-TP 在中间程序状态上形成一个封闭的推理循环,从而实现可靠的多步优化,而非结果模仿。其设计遵循四个原则:(i) 令牌高效、可验证的中间表示(IR),可确定性降低为 TVM TIR;(ii) 原子且可组合的优化策略,将复杂轨迹分解为可解释的单步决策;(iii) 结构化的 CoT 监督与显式的 IR 到 IR 状态转换相结合;(iv) 策略过滤以平衡覆盖范围同时防止捷径利用。该数据集和实现可在 GitHub 链接 https://github.com/LIUMENGFAN-gif/StepTP 获取。

英文摘要

Despite the strong reasoning capabilities of large language models (LLMs), optimizing the execution efficiency of tensor programs remains challenging due to the need for precise, composable transformation decisions. Recent LLM-guided approaches frame tensor program optimization as an iterative decision process, but existing datasets provide only end-to-end optimized program pairs using token-inefficient representations, lacking verifiable step-level supervision and interpretability. As a result, LLMs struggle to make reliable single-step decisions in large combinatorial optimization spaces. We introduce Step-TP, a post-training dataset for tensor program optimization that provides grounded, atomic, step-level supervision with structured chain-of-thought (CoT) reasoning. Step-TP forms a closed reasoning loop over intermediate program states, enabling reliable multi-step optimization rather than outcome imitation. Its design is guided by four principles: (i) a token-efficient, verifiable intermediate representation (IR) that deterministically lowers to TVM TIR; (ii) atomic and composable optimization strategies that decompose complex trajectories into interpretable single-step decisions; (iii) structured CoT supervision coupled with explicit IR-to-IR state transitions; and (iv) strategy filtering to balance coverage while preventing shortcut exploitation. The dataset and implementation are available at a GitHub link, https://github.com/LIUMENGFAN-gif/StepTP.

2605.25952 2026-05-26 cs.CV cs.AI

VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding

VEN-VL: 一种用于高效多模态理解的视觉集成MoE框架

Yinghao Wu, Zhuoyan Luo, Yiyao Yu, Zhaojian Yu, Yujiu Yang, Xiao-Ping Zhang

AI总结 提出VEN-VL框架,通过先丰富后压缩的策略,利用视觉集成MoE和自适应路由增强视觉令牌的信息容量与密度,在少量压缩令牌下实现复杂视觉任务的性能与效率平衡。

详情
AI中文摘要

尽管近期高效方法在加速多模态理解方面取得了显著进展,但它们仍然存在明显的性能下降。这些方法强调单一视觉线索的高压缩比,并依赖基于启发式剪枝策略的粗略注意力对齐,导致视觉令牌的信息容量和密度出现瓶颈。针对这一局限,我们提出了VEN-VL,一种遵循“先丰富后压缩”原则的视觉集成MoE框架,用于高效感知。具体来说,我们首先通过统一不同视角的视觉表示来丰富信息容量,然后通过专门视觉专家中的自适应路由器逐步压缩信息以增强信息密度。此外,我们通过显式视觉监督融入原始结构的重建能力,促进关键信息的保留。实验结果表明,我们在使用少量信息压缩令牌的复杂视觉任务中具有优越性,有效弥合了性能与效率之间的差距。

英文摘要

Despite the remarkable progress achieved by recent efficient methods in accelerating multimodal understanding, they still suffer from noticeable performance degradation. Their emphasis on the high compression ratio of a single visual clue and reliance on the heuristic pruning strategy with coarse attention alignment incurs a bottleneck on the information capacity and density of visual tokens. Addressing this limitation, we propose VEN-VL, a visual ensemble MoE framework for effective and efficient perception following the enrich then compact principle. Specifically, we first enrich the information capacity by unifying the visual representations of different perspectives, and then progressively compact it with adaptive routers in specialized visual experts to enhance the information density. Furthermore, we incorporate the reconstruction ability of vanilla structure via explicit visual supervision, facilitating crucial information preservation. Experimental results demonstrate our superiority in complex visual tasks with few information-condensed tokens, which effectively bridges the gap between performance and efficiency.

2605.25951 2026-05-26 cs.SD

Score-Agnostic Structure Analysis in Large-Scale Performance Datasets

大规模演奏数据集中的无乐谱结构分析

Patricia Hu, Silvan Peter, Gerhard Widmer

AI总结 针对大规模自动转录钢琴演奏数据集中结构不一致的问题,提出基于序列比对和层次聚类的无乐谱分组方法,以音乐连贯性替代真实准确性作为评估标准。

详情
Comments
published at the Music Encoding Conference (MEC) 2026
AI中文摘要

近年来,得益于自动音乐转录(AMT)的进步,多个大规模自动转录钢琴独奏音乐数据集已发布。虽然这些数据集无疑为演奏研究提供了丰富的材料,但它们在质量上差异很大。在古典音乐中,演奏不仅在速度等表现方面不同,而且在乐谱的结构解释(包括重复模式和版本特定变体)上也存在差异。为了有意义地将大规模转录数据集用于演奏研究,同一作品的转录必须根据其底层结构实现进行分组,以支持有效比较。我们通过应用序列到序列比对后进行层次聚类来解决这个问题:我们为给定作品的所有转录对创建成对比对,并使用比对成本和演奏序列长度的(不)相似性来解决结构不匹配问题,作为分组的特征。我们提出这种方法作为自动评估缺乏真实乐谱和/或音频的大规模转录数据集的第一步,将评估标准从基于真实性的准确性转向音乐连贯性和合理性。我们在最近发布的大规模转录钢琴演奏数据集中约1,500个转录(涵盖88部作品)上展示了我们的无乐谱方法。

英文摘要

In recent years, thanks to advances in automatic music transcription (AMT), several large-scale datasets of automatically transcribed piano solo music have been released. While these datasets undoubtedly offer extensive material for performance studies, they vary substantially in quality. In the case of classical music, performances often differ not only in expressive aspects such as tempo, but also in their structural interpretation of the score (including repeat patterns and edition-specific variants). To meaningfully use large-scale transcribed datasets for performance research, transcriptions of the same piece must be grouped according to their underlying structural realisation to support valid comparison. We address this by applying sequence-to-sequence alignment followed by hierarchical clustering: we create pairwise alignments for all pairs of transcriptions of a given piece, and use the alignment cost and (dis)similarity of performed sequence lengths to resolve structural mismatches as features for grouping. We propose this approach as a first step towards automatically evaluating large-scale transcribed datasets that lack ground-truth score and/or audio, shifting the evaluation criterion from truth-based accuracy to musical coherence and plausibility. We demonstrate our score-agnostic approach on around 1,500 transcriptions of 88 compositions from a recently published large-scale transcribed piano performance dataset.

2605.25949 2026-05-26 cs.LG cs.AI physics.comp-ph

Small Models, Strong Priors: Architectural Inductive Bias for Parameter-Efficient Neural PDE Solvers

小模型,强先验:参数高效神经PDE求解器的架构归纳偏置

Shyam Sankaran, Hanwen Wang, Paris Perdikaris

AI总结 提出WaveLiT架构,通过小模型(1-10M参数)利用小波多尺度先验实现参数高效,在多个PDE基准上媲美大100-1000倍的基础模型,并揭示先验失败模式可提供有用信号。

详情
AI中文摘要

神经PDE求解器遵循视觉和语言的扩展轨迹,最近的基础模型达到数十亿参数。我们认为,在该领域中,规模不能很好地替代架构归纳偏置:结构化先验带来超高的参数效率,并且它们成功和失败的模式本身就能说明它们捕获了什么。我们通过WaveLiT实例化这一论点,该架构结合了用于无损多分辨率标记化的离散小波变换、增强的线性注意力块、共享权重的多尺度特征金字塔以及小波域辅助损失。定制的1-10M参数WaveLiT模型在八个TheWell基准测试中与规模大100-1000倍的基础模型竞争,在波动和声学主导的基准测试中增益最大,其中小波多尺度先验适合主导动力学结构,且小的每步误差在展开时不会几何级数地复合。在所有八个基准测试上联合训练后,一个10M参数的基础变体表现出结构化的、物理上可解释的迁移模式——在小波多尺度先验匹配动力学的地方最强,在混沌平流主导的流动中最弱。整个流水线在单个GPU上训练。结果表明,小模型PDE性能由架构归纳偏置而非规模决定,并且先验失败的结构是关于其内容的有用经验信号。

英文摘要

Neural PDE solvers have followed the scaling trajectory of vision and language, with recent foundation models reaching billions of parameters. We argue that scale is a poor substitute for architectural inductive bias in this domain: structured priors deliver outsized parameter efficiency, and the pattern of where they succeed and fail is itself informative about what they capture. We instantiate this argument in WaveLiT, an architecture combining a discrete wavelet transform for lossless multi-resolution tokenization, an augmented linear attention block, a shared-weight multiscale feature pyramid, and a wavelet-domain auxiliary loss. Bespoke 1-10M-parameter WaveLiT models compete with foundation models of 100-1000$\times$ their size across eight TheWell benchmarks, with the largest gains on wave and acoustic-dominated benchmarks where the wavelet-multiscale prior fits the dominant dynamical structure and small per-step errors do not compound geometrically under rollout. Trained jointly across all eight benchmarks, a 10M-parameter foundation variant exhibits a structured, physically interpretable transfer pattern -- strongest where the wavelet-multiscale prior matches the dynamics, weakest on chaotic advection-dominated flows. The entire pipeline trains on a single GPU. The results suggest that small-model PDE performance is shaped by architectural inductive bias rather than scale, and that the structure of a prior's failures is a useful empirical signal about its content.

2605.25947 2026-05-26 cs.CV

A Pedestrian-Vehicle Interaction Benchmark and Annotation Framework for Unstructured Scenes via Uncalibrated Cameras

非标定相机下的非结构化场景行人-车辆交互基准与标注框架

Haoyang Peng, Qian Hu, Songan Zhang, Ming Yang

AI总结 针对非结构化场景中行人-车辆交互数据稀缺的问题,提出基于非标定监控视频的标注框架PINNS数据集,包含多国多场景的密集交互轨迹与场景信息,以促进复杂混合交通中的轨迹预测研究。

详情
Comments
10 pages, 8 figures; project page available at https://github.com/Songan-Lab
AI中文摘要

预测行人与车辆之间的交互对于非结构化和半结构化场景中的自动驾驶安全至关重要;然而,由于缺乏具有密集行人-车辆交互的公共数据集,这一任务受到严重阻碍。当前大多数研究依赖于结构化道路数据,导致非结构化环境中复杂的异质交互未能得到充分表示和研究。本文提出一种基于非标定监控摄像头视频数据的数据集标注框架,并推出PINNS(非结构化场景中非标定摄像头的行人-车辆交互数据集)。该数据集涵盖多个国家和地区,包含多样化的典型交通场景,并考虑了季节、光照条件和天气的变化。它聚焦于具有密集行人-车辆交互的复杂场景,并设计为易于扩展。数据集根据中国自动化学会发布的标准进行构建和标注,提供轨迹数据和相应的场景级信息。此外,本文分析了异质智能体轨迹预测的当前挑战和研究方向,展示了所提出数据集的必要性和实用性。我们希望我们的框架和数据集能够促进复杂混合交通场景中轨迹预测和自动驾驶的研究。PINNS数据集公开于https://github.com/Songan-Lab。

英文摘要

Predicting the interaction between pedestrian and vehicle is essential for autonomous driving safety in unstructured and semi-structured scenarios; however, this task is severely hindered by the scarcity of public datasets that feature dense pedestrian-vehicle interactions. Most current studies rely on structured road data, leaving the complex, heterogeneous interactions found in unstructured environments insufficiently represented and researched. In this paper, we propose a dataset annotation framework based on video data from uncalibrated surveillance cameras and present PINNS (Pedestrian-vehicle Interaction dataset from uNcalibrated cameras in uNstructured Scenes). The dataset covers multiple countries and regions, includes diverse typical traffic scenarios, and considers variations in seasons, lighting conditions, and weather. It focuses on complex scenes with dense pedestrian-vehicle interactions and is designed to be easily extensible. The dataset is constructed and annotated according to the standard issued by the Chinese Association of Automation, providing both trajectory data and corresponding scene-level information. Furthermore, this paper analyzes current challenges and research directions in heterogeneous agent trajectory prediction, shows the necessity and usefulness of the proposed dataset. We hope our framework and dataset will facilitate research on trajectory prediction and autonomous driving in complex mixed traffic scenarios. PINNS is publicly available at https://github.com/Songan-Lab.

2605.25944 2026-05-26 cs.CV cs.AI

EchoPilot: Training-Free Ultrasound Video Segmentation via Scale-Space Semantic Prompting and Reliability-Gated Memory

EchoPilot: 通过尺度空间语义提示和可靠性门控记忆实现无训练超声视频分割

Ruiqiang Xiao, Zhaohu Xing, Yijun Yang, Zhenyan Han, Weiming Wang, Kaishun Wu, Lei Zhu

AI总结 提出EchoPilot,一种无需训练、仅需单点点击和类别名称的超声视频分割框架,通过尺度空间语义提示解决初始化歧义,并引入可靠性门控记忆减少传播漂移,在多个数据集上达到最优性能。

详情
Comments
Early accepted to MICCAI 2026. Project page: https://keeplearning-again.github.io/EchoPilot/
AI中文摘要

超声视频分割在临床上具有重要价值,但由于散斑噪声、弱边界和快速解剖变形而困难。最近的可提示基础模型实现了点引导分割,但它们在超声中的直接部署仍然不可靠:单个点提供的空间上下文不足以解决尺度模糊性,贪婪的记忆更新会将早期错误放大为严重的时间漂移。我们提出了EchoPilot,一个在稀疏第一帧交互下进行超声视频分割的无训练框架,仅需单点点击和解剖类别名称。EchoPilot协调一个冻结的医学视觉语言模型(VLM)进行语义定位,一个视觉基础模型(VFM)进行密集几何特征提取,以及一个可提示视频分割器进行掩码预测和传播。为了解决初始化歧义,我们提出了尺度空间语义提示,首先通过无参数的S.E.E.D.(语义能量-熵密度)准则选择最佳上下文视图,然后从密集基础特征中合成几何精确的辅助点提示,无需额外用户交互。为了减少传播漂移,进一步引入了可靠性门控记忆更新,在不确定预测下选择性冻结分割器的记忆库,防止错误累积。我们还贡献了第一个动态胎儿胎盘超声视频分割数据集,包含671个标注帧。在三个超声视频数据集上,EchoPilot在稀疏交互设置下实现了最先进的性能,持续优于无训练基线和微调专家。

英文摘要

Ultrasound video segmentation is clinically valuable yet difficult due to speckle noise, weak boundaries, and rapid anatomical deformation. Recent promptable foundation models enable point-guided segmentation, but their direct deployment in ultrasound remains unreliable: a single point provides insufficient spatial context to resolve scale ambiguity, and greedy memory updates amplify early errors into severe temporal drift. We present EchoPilot, a training-free framework for ultrasound video segmentation under sparse first-frame interaction, requiring only a single point click and an anatomical category name. EchoPilot orchestrates a frozen medical vision-language model (VLM) for semantic localization, a vision foundation model (VFM) for dense geometric feature extraction, and a promptable video segmentor for mask prediction and propagation. To resolve initialization ambiguity, we propose Scale-Space Semantic Prompting, which first selects an optimal contextual view via a parameter-free S.E.E.D. (Semantic Energy-Entropy Density) criterion, and then synthesizes geometrically precise auxiliary point prompts from dense foundation features without additional user interaction. To reduce propagation drift, a Reliability-Gated Memory update is further introduced to selectively freeze the segmentor's memory bank under uncertain predictions, preventing error accumulation. We also contribute the first dynamic fetal placenta ultrasound video segmentation dataset with 671 annotated frames. Across three ultrasound video datasets, EchoPilot achieves state-of-the-art performance under the sparse-interactive setting, consistently outperforming training-free baselines and finetuned specialists.

2605.25943 2026-05-26 cs.LG

STaT: Resolving Shape Distortion in Non-Stationary Time Series via Tri-Modal Synergy

STaT: 通过三模态协同解决非平稳时间序列中的形状失真

Hui Cheng, Jinsheng Guo, Zhenhao Weng, Yan Qiao, Meng Li

AI总结 提出STaT多模态架构,通过符号-时间-文本三模态对齐,在降低平均误差的同时减少形状失真,在8个基准上提升幅度指标达8.9%并降低形状失真达8.5%。

详情
AI中文摘要

近期时间序列预测研究常探索将文本和视觉模态与数值模型结合,以更好地应对非平稳环境。尽管取得了可靠的数值结果,现有多模态方法通常面临两难:优先最小化平均误差会导致预测过于平滑,忽略关键波动。为解决这一局限,我们提出STaT,一种创新的符号-时间-文本对齐多模态架构,无缝融合三种协同模态。具体而言,符号模态将连续时间序列转换为离散标记,便于准确识别结构模式和转折点;时间模态提取内在序列依赖;文本模态利用领域语义引导宏观预测趋势。在八个真实世界基准上的综合评估表明,STaT表现卓越,将传统幅度指标提升高达8.9%,同时将形状失真降低高达8.5%。

英文摘要

Recent research in time series forecasting frequently investigates the integration of textual and visual modalities with numerical models to better navigate non-stationary environments. Despite delivering solid numerical results, existing multi-modal approaches usually encounter a dilemma: prioritizing the minimization of average errors can result in excessively smooth forecasts that overlook essential fluctuations. To resolve this limitation, we introduce STaT, an innovative multimodal architecture for Symbolic-Temporal-Textual Alignment, which seamlessly unites three synergistic modalities. Specifically, the symbolic modality converts continuous time series into discrete tokens, facilitating the accurate identification of structural patterns and turning points; the temporal modality extracts inherent sequential dependencies; and the textual modality leverages domain semantics to steer the macroscopic forecasting trends. Comprehensive evaluations on eight real-world benchmarks indicate that STaT delivers exceptional performance, enhancing conventional magnitude indicators by up to 8.9% while simultaneously decreasing shape distortion by up to 8.5%.

2605.25942 2026-05-26 cs.CV cs.RO

LRDDv3: High-Resolution Long-Range Drone Detection Dataset with Range Information and Thermal Data

LRDDv3:具有距离信息和热数据的高分辨率远程无人机检测数据集

Knut Peterson, Zaid Mayers, Azmain Yousuf, Priontu Chowdhury, Asher Zaczepinski, Solmaz Arezoomandan, Reihaneh Maarefdoust, David Han

AI总结 提出LRDDv3数据集,包含102,532张高分辨率远程RGB图像和29,630张配对IR图像,支持远程无人机检测,提供距离信息。

详情
Comments
8 pages, 5 figures. Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA)
AI中文摘要

无人机已迅速成为各种空域中的常见设备,涵盖从娱乐飞行到商业摄影和包裹递送等多种应用。随着无人机日益普及,有人和无人飞行器能够远程检测无人机及其他飞行物体以有效跟踪运动并确保共享空域安全运行变得至关重要。尽管已有多个用于无人机检测的数据集,但对高质量数据的需求仍然存在,特别是在高分辨率远程无人机数据领域。为解决这一问题,我们引入了一个高分辨率数据集,包含102,532张远程无人机RGB图像,这些图像从128个不同的视频片段中以5 FPS采样,这些片段在17个不同的数据采集日(跨越8个月)的飞行中拍摄,以确保光照场景、飞行位置和背景元素的多样性。该数据集拥有全面的无人机距离信息,以及29,630张IR图像,所有这些图像都与基础数据集中的RGB图像配对。作为首批利用4K图像分辨率和配对640x512 IR图像的无人机检测数据集之一,我们的工作代表了在远程检测无人机方面的重要进展。如需获取完整数据集,请访问https://research.coe.drexel.edu/ece/imaple/lrddv3/

英文摘要

Unmanned Aerial Vehicles (UAVs) have quickly become common in various airspaces, representing a wide range of applications from recreation flying to commercial photography and package delivery. With the increasing prevalence of UAVs, it becomes critical that both manned and unmanned aircraft can detect UAVs and other flying objects from long range to effectively track movement and ensure safe operation in shared spaces. While several datasets have been introduced for drone detection, the need for expanded high-quality data persists, especially in the area of high-resolution long-range drone data. To address this, we introduce a high-resolution dataset of 102,532 long-range RGB images of drones, sampled at 5 FPS from 128 distinct video clips taken mid flight during 17 different data collection days spread over 8 months to ensure a wide variety of lighting scenarios, flight locations, and background elements. The dataset boasts comprehensive drone range information across the dataset, as well as 29,630 IR images, all paired with RGB counterparts from the base dataset. As one of the first drone detection datasets to leverage 4K image resolution and paired 640x512 IR images, our work represents a significant advancement to enable the detection of drones at long range. For access to the complete dataset, please visit https://research.coe.drexel.edu/ece/imaple/lrddv3/

2605.25941 2026-05-26 cs.CV

Where Concept Erasure Should Occur: Concept-Layer Alignment in Text-to-Video Diffusion Models

概念擦除应发生在何处:文本到视频扩散模型中的概念-层对齐

Yiwei Xie, Ping Liu, Zheng Zhang

AI总结 本文通过识别概念-层拓扑对齐瓶颈,提出基于可分离性优化的CLEAR框架,在文本到视频扩散模型中实现精确的概念擦除并保持生成质量。

详情
Comments
Accepted by ICML 2026
AI中文摘要

文本到视频扩散变换器在模型深度上不均匀地编码语义信息,这限制了有效概念擦除。我们识别出一个表示瓶颈,称为概念-层拓扑对齐,在该对齐下目标概念在特定表示深度表现出更高的可分离性。在这些深度之外,概念和非目标信号仍然强烈纠缠,限制了深度特定擦除的有效性。这一观察将概念擦除重新定义为识别概念-非目标分离自然出现的表示深度的问题。受此结构约束的启发,我们引入了CLEAR,一个用于概念擦除的可分离性驱动优化框架,明确强制概念-层对齐。CLEAR通过将层选择公式化为概念-非目标可分离性的优化问题(而非依赖层无关或启发式选择)来实现这一原则。为此,我们引入了一个可分离性感知目标,偏好表现出更强概念-非目标分离的层。在大规模文本到视频扩散模型上的实验表明,强制概念-层对齐导致更精确的概念抑制,同时保持整体生成质量。

英文摘要

Text-to-video diffusion transformers encode semantic information unevenly across model depth, which constrains effective concept erasure. We identify a representational bottleneck, termed concept-layer topological alignment, under which target concepts exhibit higher separability at certain representational depths. Outside these depths, concept and non-target signals remain strongly entangled, limiting the effectiveness of depth-specific erasure. This observation reframes concept erasure as the problem of identifying representational depths where concept-non-target separation naturally emerges. Motivated by this structural constraint, we introduce CLEAR, a separability-driven optimization framework for concept erasure that explicitly enforces concept-layer alignment. CLEAR operationalizes this principle by formulating layer selection as an optimization problem over concept-non-target separability, rather than relying on layer-agnostic or heuristic choices. To enable this, we introduce a separability-aware objective that favors layers exhibiting stronger concept-non-target separation. Experiments on large-scale text-to-video diffusion models demonstrate that enforcing concept--layer alignment leads to more precise concept suppression while preserving overall generative quality.

2605.25940 2026-05-26 eess.IV cs.CV

How Accurate are Video Quality Models for Diffusion-Based Video Super-Resolution?

扩散模型视频超分辨率中的视频质量模型有多准确?

Benjamin Herb, Steve Göring, Alexander Raake, Rakesh Rao Ramachandra Rao

AI总结 本研究通过主观测试比较了六种扩散模型视频超分辨率方法,评估现有视频质量模型(尤其是全参考和无参考模型)在扩散VSR上的准确性,发现基于CNN的全参考模型相关性较高但均不足以替代主观测试。

详情
Comments
Accepted for the 18th International Conference on Quality of Multimedia Experience (QoMEX 2026)
AI中文摘要

最近的视频超分辨率(VSR)方法使用深度神经网络来增强低质量输入视频并恢复视觉细节,其中基于扩散的方法尤其显示出有希望的结果。在本文中,我们通过将模型预测与主观测试结果进行比较,研究现有视频质量模型是否可用于评估这些基于扩散的VSR方法的性能。该研究比较了六种上采样方法(Lanczos、Rhea、SCST、DOVE、SeedVR2、Starlight Mini),应用于压缩(AV1和DCVC-RT)和未压缩的低分辨率视频,考虑在UHD-1/4K屏幕上播放。使用一系列全参考和无参考质量模型来评估它们对这种新型质量退化的适用性,重点关注序列内性能。结果强调,基于CNN的全参考模型,如LPIPS、DISTS和CVQA-FR,显示出比传统全参考模型以及测试的无参考模型显著更高的相关系数。大多数模型高估了SCST过度锐利的结果,VMAF主要由于Starlight Mini引入的空间不一致而失败。测试的视频质量模型均未达到足够的准确性以替代互补的主观测试。参考、降质和上采样的视频,以及用户评分和模型分数,随论文在https://github.com/Telecommunication-Telemedia-Assessment/AVT-VQDB-UHD-1-VSR作为开放数据提供。

英文摘要

Recent video super-resolution (VSR) approaches use deep neural networks to enhance low-quality input videos and recover visual detail, with diffusion-based methods in particular showing promising results. In this paper, we investigate whether existing video quality models can be used to assess the performance of these diffusion-based VSR methods, by comparing model predictions with results from a subjective test. The study compares six upscaling methods (Lanczos, Rhea, SCST, DOVE, SeedVR2, Starlight Mini) applied to both compressed (AV1 and DCVC-RT) and uncompressed low-resolution videos considering the play-out on a UHD-1/4K screen. A range of full- and no-reference quality models are used to assess their applicability to this new type of quality degradation, focusing on within-sequence performance. The results highlight that CNN-based full-reference models, such as LPIPS, DISTS, and CVQA-FR show significantly higher correlation coefficients than both conventional full- as well as the tested no-reference models. Most overestimate the overly sharp results of SCST, with VMAF mainly failing due to spatial inconsistencies introduced by Starlight Mini. None of the tested video quality models reach sufficient accuracy so as to replace complementary subjective testing. The reference, degraded and upscaled videos, as well as the user ratings and model scores are made available with the paper at https://github.com/Telecommunication-Telemedia-Assessment/AVT-VQDB-UHD-1-VSR as open data.

2605.25939 2026-05-26 cs.LG cs.AI

From Latent Space to Training Data: Explainable Specialization in Minimal MLPs

从潜在空间到训练数据:最小MLP中的可解释特化

Enrique Alba, Ezequiel Lopez-Rubio

AI总结 研究最小单隐藏层MLP中隐藏神经元是否因训练偏差而特化,以及这种特化是否改善基于原型的训练数据重构,发现覆盖正则化能提高特化比并降低重构误差,而重叠惩罚会导致原型中心被推出凸包。

详情
AI中文摘要

我们在此研究训练偏差是否能使隐藏神经元在最小单隐藏层MLP中特化,以及这种特化是否改善从学习权重对训练数据集进行基于原型的重构。我们考虑宽度等于数据集大小的高斯激活MLP,并比较三种结构损失(分别鼓励训练样本覆盖、神经元诱导原型之间的分离以及隐藏响应的低重叠)与标准拟合基线。在均匀采样的一维数据集上的实验显示,从N=3到N=100的480次受控运行中呈现稳定模式。覆盖正则化在每个测试大小下给出最低的平均重构误差,并相对于标准基线提高了原型使用特化比,而分离效果参差不齐,重叠惩罚则系统性有害。我们表明这种损害并非优化失败:重叠激活的方法与无重叠方法一样拟合数据,但将优化器引导至退化均衡,其中原型中心被推出训练输入的凸包。覆盖无法奖励这种驱逐,并充当吸引子:分离仅在高温下允许它,而重叠在名义超参数选择下允许它。在分离掩码上的直接τ扫描和N=100时的原型位置可视化确认了这一机制。这些发现为原型可恢复性感知训练提供了一个简单的设计原则:每个排斥性结构损失必须由一个兼容的吸引子补偿,否则它将破坏本应精炼的潜在几何结构。

英文摘要

We here study whether training biases can make hidden neurons specialize in minimal one-hidden-layer MLPs, and whether such specialization improves prototype-based reconstruction of the training dataset from the learned weights. We consider Gaussianactivation MLPs of width equal to dataset size and compare three structural losses that respectively encourage coverage of the training samples, separation between neuron-induced prototypes, and low overlap of hidden responses, against the standard fitting baseline. Experiments on uniformly sampled one-dimensional datasets show a stable pattern from N = 3 to N = 100 across 480 controlled runs. Coverage regularization gives the lowest mean reconstruction error at every tested size and raises the prototype-usage specialization ratio relative to the standard baseline, while separation has mixed effects and overlap penalties are systematically harmful. We show that the harm is not an optimization failure: overlap-active approaches fit the data as well as overlap-free ones but route the optimizer to a degenerate equilibrium in which prototype centers are pushed outside the convex hull of the training inputs. Coverage cannot reward this expulsion and acts as an attractor: separation admits it only at large temperature and overlap admits it at the nominal hyperparameter choice. A direct τ-sweep on the separation-only mask and a prototype-position visualization at N = 100 confirm the mechanism. The findings yield a simple design principle for prototype-recoverability-aware training: every repulsive structural loss must be compensated by a compatible attractor, or it will collapse the latent geometry it was meant to refine.

2605.25937 2026-05-26 cs.CR cs.LG

Building an Adversarial Malware Dataset by Family and Type: Generation, Evasion, and Poisoning Evaluation

构建按家族和类型分类的对抗性恶意软件数据集:生成、逃避和投毒评估

David Košťál, Martin Jureček

AI总结 基于RawMal-TF真实恶意软件数据集,使用对抗性生成器构建家族和类型标记的对抗性PE文件,评估逃避率和投毒攻击影响。

详情
AI中文摘要

我们提出了一个对抗性恶意软件样本数据集,该数据集源自公开的RawMal-TF真实恶意软件二进制文件集合。使用一套对抗性恶意软件生成器,我们构建了两组对抗性PE文件:44,347个按家族标记的样本和33,596个按类型标记的样本,分别对EMBER分类器实现了98.35%和92.20%的逃避率。每个对抗性二进制文件都附有详细的元数据,包括EMBER分数和VirusTotal分类。我们进一步通过一系列训练实验证明了恶意软件分类管道对数据投毒攻击的敏感性。在家族标记数据集中,仅注入占训练数据0.5%的完全错误标记的对抗性样本,就将对重新训练的分类器的逃避率从26.1%提高到92.8%。该数据集已公开发布,以促进未来对抗性恶意软件、投毒攻击以及基于机器学习的恶意软件检测系统鲁棒性的研究。

英文摘要

We present a dataset of adversarial malware samples derived from the public RawMal-TF collection of real-world malware binaries. Using a suite of adversarial malware generators, we construct two sets of adversarial PE files: 44,347 family-labelled samples and 33,596 type-labelled samples, achieving evasion rates of 98.35 % and 92.20 % against the EMBER classifier, respectively. Each adversarial binary is accompanied by detailed metadata, including EMBER scores and VirusTotal classifications. We further demonstrate the susceptibility of malware classification pipelines to data poisoning attacks through a series of training experiments. Injecting fully mislabelled adversarial samples representing only 0.5 % of the training data in the family-labelled dataset increases the evasion rate against the re-trained classifier from 26.1 % to 92.8 %. The dataset is publicly released to facilitate future research on adversarial malware, poisoning attacks, and the robustness of machine-learning-based malware detection systems.

2605.25933 2026-05-26 cs.LG cs.AI

Quantitative Evaluation of the Severity of Posttraumatic Stress Disorder through Transfer Learning from Specific Phobia Data

通过特定恐惧症数据迁移学习定量评估创伤后应激障碍的严重程度

Nicolas Ricka, Gauthier Pellegrin, Denis A. Fompeyrine, Thomas Rohaly, Leah Enders, Heather Roy

AI总结 提出基于多元核密度估计的机器学习方法,利用心率与皮肤电导信号从特定恐惧症数据迁移学习,客观评估PTSD严重程度,分类准确率86%,平均绝对误差5.6。

详情
Comments
Submitted to a peer-reviewed journal, comments welcome
AI中文摘要

创伤后应激障碍(PTSD)是一种普遍且使人衰弱的心理健康状况,对个人和社会产生重大影响。目前PTSD的临床评估通常依赖主观评价,耗时、昂贵且易受人为偏见影响。本研究提出一种基于多元核密度估计(MKDE)技术的机器学习方法,用于客观评估PTSD严重程度。我们收集了21名参与者在沉浸式模拟期间的心率(HR)和皮肤电导反应(GSR)信号以及PTSD检查表-军事版(PCL-M)标签。在公开的蜘蛛恐惧症数据集上训练恐惧反应模型,并从军事数据集估计的恐惧反应曲线中提取PTSD预测特征。该模型在分类PTSD状态时达到86%的准确率,有效区分有和无PTSD的参与者(PCL-M阈值为36)。模型的平均绝对误差(MAE)为5.6,并以17%的平均绝对百分比误差估计临床PTSD严重程度量表。我们的算法通过提供一种客观且低努力的生理评估方法,显示出增强PTSD严重程度估计和随访的潜力。这些发现表明在筛查和随访环境中具有临床实用性。

英文摘要

Posttraumatic stress disorder (PTSD) is a prevalent and debilitating mental health condition with significant personal and societal impacts. Current clinical assessments of PTSD often rely on subjective evaluations, which can be time-consuming, costly, and prone to human bias. This study proposes a machine learning (ML) approach based on multivariate kernel density estimation (MKDE) technique for the objective evaluation of PTSD severity. We collected heart rate (HR) and galvanic skin response (GSR) signals as well as PTSD Checklist - Military Version (PCL-M) labels from 21 participants during an immersive simulation. A fear-response model was trained on a public arachnophobia dataset, and predictive features of PTSD were extracted from the fear-response curves estimated on the military dataset. The model achieved an accuracy of 86\% in classifying PTSD status, effectively distinguishing participants with and without PTSD (PCL-M threshold of 36). The average mean absolute error (MAE) of the models is 5.6, and it estimated a clinical PTSD severity scale with a mean absolute percentage error of 17\%. Our algorithm demonstrates promising potential for enhancing estimation of PTSD severity and followup by offering an objective and low-effort evaluation approach using physiology. These findings suggest clinical utility in both screening and follow-up settings.

2605.25931 2026-05-26 cs.AI

Explore Before You Solve: The Speed--Depth Trade-off in Epistemic Agents for ARC-AGI-3

先探索再求解:面向ARC-AGI-3的认知主体中的速度-深度权衡

Liew Keong Han

AI总结 通过系统分析所有25个公开ARC-AGI-3游戏,发现它们均可通过非智能策略达到,并提出了一个三阶段认知主体AERA,在速度-深度权衡框架下形式化其性能。

详情
Comments
22 pages, 3 figures. Code: https://github.com/farmountain/aera-arc3-paper (CC0)
AI中文摘要

我们系统研究了所有25个公开ARC-AGI-3游戏,发现每个游戏都可以通过非智能策略达到:10个通过单次盲步,5个通过一次探测动作,1个通过重复按ACTION1键,1个通过多样化探索,8个通过具有足够预算(50-200步)的单一重复动作。此外,一个库级别的空坐标漏洞使得18个游戏可以在1步内绕过。这一基准批评意味着公开评估集无法区分智能探索与琐碎启发式——私有的55游戏评估才是真正的智能测试。在此背景下,我们提出了AERA(自适应认知推理主体),一个三阶段(探索/验证/规划)主体,在Qwen2.5-0.5B上对这些25个游戏实现了RHAE=0.2116(4/25解决),而随机和无探索基线得分为0.0000。我们通过速度-深度权衡框架形式化AERA:在凸性假设下(附录中对一类环境证明),RHAE的二次形式作为对偏离动作效率与信息增益之间帕累托前沿的二阶惩罚。贡献:(i)基准有效性分析表明,当前交互式推理基准未能衡量它们声称所需的探索,以及(ii)探索前规划框架和模型能力×探索交互。链接的代码条目在完整的55游戏私有评估中实现了RHAE=0.30。代码:CC0。

英文摘要

We systematically investigate all 25 public ARC-AGI-3 games and find that every one is reachable through non-intelligent strategies: 10 in a single blind step, 5 after one probing action, 1 via repeated ACTION1 presses, 1 via diverse exploration, and 8 via single repeated actions with sufficient budget (50-200 steps). A library-level null-coordinate vulnerability additionally bypasses 18 games in 1 step. This benchmark critique implies the public evaluation set cannot discriminate intelligent exploration from trivial heuristics - the private 55-game evaluation is the only genuine intelligence test. Against this backdrop, we present AERA (Adaptive Epistemic Reasoning Agent), a three-phase (EXPLORE / VERIFY / PLAN) agent achieving RHAE=0.2116 (4/25 solved) on these 25 games with Qwen2.5-0.5B, while random and no-explore baselines score 0.0000. We formalise AERA through a Speed--Depth trade-off framework: under a convexity assumption (proved for a class of environments in the Appendix), RHAE's quadratic form emerges as a second-order penalty for deviating from the Pareto frontier between action efficiency and information gain. Contributions: (i) a benchmark validity analysis showing that current interactive reasoning benchmarks fail to measure the exploration they claim to require, and (ii) the EXPLORE-before-PLAN framework and model-capability x exploration interaction. The linked code track entry achieves RHAE=0.30 on the full 55-game private evaluation. Code: CC0.

2605.25929 2026-05-26 cs.MA cs.LG

Multi-Agent Systems are Mixtures of Experts: Who Becomes an Influencer?

多智能体系统是专家混合:谁成为影响者?

Franka Bause, Jonas Niederle, Martin Pawelczyk, Rebekka Burkholz

AI总结 本文通过Friedkin-Johnsen意见动力学模型分析多智能体LLM协商机制,揭示输入依赖的FJ参数使系统成为专家混合,并探讨基于自信度、感知自信度和初始观点对齐的影响者形成机制。

详情
AI中文摘要

多智能体LLM协商的有效性不仅取决于智能体的个体预测,还取决于它们如何沟通和协作。我们通过Friedkin-Johnsen (FJ)意见动力学的视角研究这一机制,这是一个可处理的模型,用于分析多智能体系统中的固执、影响力和意见变化,并捕捉经验观察到的协商模式。我们表明FJ参数是输入依赖的,将多智能体协商转变为专家混合。这一视角意味着,当路由反映智能体能力时,多智能体系统可以胜过单个智能体和静态集成。由于能力在实践中是潜在的,我们分析了影响力如何通过可观察的代理建立:智能体的自我评估自信度、感知自信度以及与其他智能体观点的初始对齐。

英文摘要

The effectiveness of multi-agent LLM deliberation depends not only on the agents' individual predictions, but also on how they communicate and collaborate. We study this mechanism through the lens of Friedkin-Johnsen (FJ) opinion dynamics, a tractable model for analyzing stubbornness, influence, and opinion change in multi-agent systems that captures empirically observed deliberation patterns. We show that the FJ parameters are input-dependent, turning multi-agent deliberation into a mixture of experts. This perspective implies that multi-agent systems can outperform single agents and static ensembles when routing reflects agent competence. Since competence is latent in practice, we analyze how influence is established through observable proxies: agents' self-assessed confidence, their perceived confidence, and initial alignment with other agents' views.

2605.25928 2026-05-26 cs.CL cs.SD eess.AS

Thaka at KSAA-2026 Task 2: Regularized Fine-Tuning for Arabic Speech Diacritization

Thaka at KSAA-2026 Task 2: 用于阿拉伯语音节符号化的正则化微调

Meshal Alamr, Hassan Alqaeri, Abdullah Aldahlawi

AI总结 针对低资源阿拉伯语音节符号化任务,通过正则化微调CATT-Whisper多模态模型,结合R-Drop一致性正则化、Optuna优化超参数和Focal Loss,在KSAA-2026共享任务中取得第一名。

详情
Comments
4 pages, 1 figure. Published in Proceedings of OSACT7 (LREC 2026). Winning system for KSAA-2026 Task 2 on Arabic Speech Diacritization
AI中文摘要

我们描述了KSAA-2026阿拉伯语音听写自动音节符号化共享任务Task 2的获胜系统。该任务要求从语音音频和无音节符号的转录文本中生成完全带音节符号的阿拉伯语文本,仅提供2,327个训练样本且不允许使用外部数据。我们的系统微调了CATT-Whisper,这是一个字符级多模态模型,结合了预训练的CATT文本编码器和冻结的Whisper语音编码器。我们方法的关键是训练正则化:R-Drop一致性正则化、使用高权重衰减的Optuna优化超参数以及Focal Loss。在推理时,我们在四个模型检查点上使用蒙特卡洛Dropout在softmax概率级别平均200次随机前向传播。该系统在主要排行榜指标(包括词尾变化,含无音节符号位置)上实现了23.26%的词错误率,在所有参与者中排名第一。

英文摘要

We describe the winning system for Task 2 of the KSAA-2026 Shared Task on Arabic Speech Dictation with Automatic Diacritization. The task requires producing fully diacritized Arabic text from speech audio and undiacritized transcripts, with only 2,327 training samples available and no external data permitted. Our system fine-tunes CATT-Whisper, a character-level multimodal model combining a pretrained CATT text encoder with a frozen Whisper speech encoder. The key to our approach is training regularization: R-Drop consistency regularization, Optuna-optimized hyperparameters with high weight decay, and Focal Loss. At inference, we average 200 stochastic forward passes across four model checkpoints using Monte Carlo Dropout at the softmax probability level. The system achieves 23.26% WER on the primary leaderboard metric (with case endings, including no-diacritic positions), placing 1st among all participants.

2605.25924 2026-05-26 cs.CL cs.LG

Does Continued Pretraining on a Learner Corpus Improve Automated Essay Scoring on English Proficiency Tests? Evidence from EFCAMDAT

在学习者语料库上继续预训练是否能提高英语水平测试的自动作文评分?来自EFCAMDAT的证据

Duy Anh Nguyen

AI总结 研究通过在EFCAMDAT学习者语料库上进行领域自适应继续预训练(DAPT),探究其对基于Transformer的自动作文评分(AES)在英语水平测试中的影响,发现全语料库DAPT效果不一,而基于CEFR分级的针对性DAPT能更可靠地提升领域内评分性能。

详情
Comments
16 pages, 3 figures, 10 tables, including references and appendices
AI中文摘要

最近的自动作文评分(AES)研究越来越多地使用预训练的Transformer模型,但这些模型通常是在通用领域英语上预训练的,可能无法充分代表第二语言学习者的写作。本研究调查了在EFCAMDAT学习者语料库上进行领域自适应继续预训练(DAPT)是否能提高基于Transformer的AES在英语水平测试中的表现。我们对三个Transformer编码器应用DAPT,并在FCE和IELTS上评估了领域内评分和少样本跨数据集迁移。全语料库DAPT在模型、数据集和指标上产生了混合结果。进一步分析表明,这些混合效应部分由EFCAMDAT与下游数据集在熟练度、体裁和交际目的上的不匹配解释。基于熟练度的消融实验显示,使用CEFR对齐子集进行针对性DAPT比全语料库DAPT更可靠地提高了下游评分,尤其是对于使用B1-B2数据的FCE。然而,这些增益并未一致地改善跨数据集迁移。总体而言,研究结果表明,当预训练数据与下游评估设置充分对齐时,在学习者写作语料库上继续预训练可以有益于英语评估的领域内AES,但它不会自动提高跨不同英语水平测试数据集的迁移性。

英文摘要

Recent automated essay scoring (AES) studies increasingly use pretrained transformer models, but these models are usually pretrained on general-domain English and may under-represent second-language learner writing. This study investigates whether domain-adaptive continued pretraining (DAPT) on the EFCAMDAT learner corpus improves transformer-based AES for English proficiency tests. We apply DAPT to three transformer encoders and evaluate them on FCE and IELTS in both in-domain scoring and few-shot cross-dataset transfer. Full-corpus DAPT produces mixed results across models, datasets, and metrics. Further analyses suggest that these mixed effects are partly explained by mismatches in proficiency, genre, and communicative purpose between EFCAMDAT and the downstream datasets. A proficiency-based ablation shows that targeted DAPT using CEFR-aligned subsets improves downstream scoring more reliably than full-corpus DAPT, especially for FCE with B1--B2 data. However, these gains do not consistently improve cross-dataset transfer. Overall, the findings suggest that continued pretraining on a learner-writing corpus can benefit in-domain AES for English assessment when the pretraining data is sufficiently aligned with the downstream assessment settings. However, it does not automatically improve transferability across different English proficiency test datasets.

2605.25922 2026-05-26 cs.CV

Closed-Loop Bidirectional Prompting for Adversarial Robustness of Vision Language Models

闭环双向提示用于视觉语言模型的对抗鲁棒性

Xiao Liu, Jiaxiang Liu, Boci Peng, Boren Hu, Yusong Wang, Xiwen Chen, Prayag Tiwari, Liming Zhang, Mingkun Xu

AI总结 针对视觉语言模型在对抗扰动下跨模态语义对齐脆弱的问题,提出闭环双向提示方法,通过动态反馈循环恢复跨模态一致性,并引入语义锚点约束循环更新,实现实例自适应保护,在11个数据集上达到最先进的鲁棒性和泛化性能。

详情
Comments
24 pages, 8 figures
AI中文摘要

视觉语言模型能很好地适应下游任务,但对破坏跨模态语义对齐的对抗扰动高度脆弱。现有的防御方法大多是单向或结构性的,未能利用双向跨模态互补性和实例自适应的保护。为了克服对抗设置中单向和静态防御的局限性,我们提出了闭环双向提示,通过冻结编码器上的动态反馈循环将鲁棒适应视为跨模态一致性恢复。引入语义锚点作为稳定先验以约束循环更新并减轻扰动引起的特征损坏。通过基于锚点的自举,文本语义去噪视觉表示,而精炼的视觉使实例自适应提示更新成为可能,从而产生修正且鲁棒的共识。在11个数据集上的广泛评估验证了最先进的鲁棒性和强的基础到新类泛化能力,同时在计算成本和准确性之间保持了良好的平衡。

英文摘要

Vision Language Models adapt well to downstream tasks but are highly vulnerable to adversarial perturbations that disrupt cross-modal semantic alignment. Existing defenses are largely unidirectional or structural, failing to exploit bidirectional cross-modal complementarity and instance-wise adaptive protection. To overcome the limitations of unidirectional and static defenses in adversarial settings, we propose Closed-Loop Bidirectional Prompting, casting robust adaptation as cross-modal agreement recovery via a dynamic feedback loop on frozen encoders. A Semantic Anchor is introduced as a stable prior to constrain cyclic updates and mitigate perturbation-induced feature corruption. Through anchor-based bootstrapping, textual semantics denoise visual representations, while the refined visuals enable instance-adaptive prompt updating, yielding a rectified and robust consensus. Extensive evaluations across 11 datasets validate state-of-the-art robustness and strong base-to-new generalization, while maintaining a favorable trade-off between computational cost and accuracy.

2605.25921 2026-05-26 cs.GR cs.CV

Curve Skeletonization in Continuous domain for Meshes and Point Clouds

网格与点云的连续域曲线骨架化

Jai Bardhan, Ramya Hebbalaguppe, Aravind Udupa

AI总结 提出CSCD框架,将基于局部分隔符的骨架化方法推广到连续域,通过CSCD-M(网格)和CSCD-PC(点云)两种实现,提升了骨架提取的鲁棒性和拓扑保持能力。

详情
Comments
31 pages, 26 figures, 7 tables, 4 algorithms. Published at IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026
AI中文摘要

3D曲线骨架化的进展正在加速广泛的应用。然而,开发能够捕捉复杂物体细节的鲁棒骨架化算法仍然具有挑战性。基于局部分隔符(LS)的骨架化提供了一种高效的基于图的方法,但由于其离散性质,存在表示不准确的问题。为了解决这个问题,我们引入了CSCD,一个新颖的连续域曲线骨架化框架,将LS推广到流形上。具体来说,我们提出了两种实现:用于网格的CSCD-M和用于点云的CSCD-PC。CSCD-M利用网格的内在三角剖分来抵抗噪声并改善拓扑保持,而CSCD-PC采用簇状拉普拉斯算子以增强鲁棒性。据我们所知,CSCD-M是第一个用于曲线骨架化的内在方法。我们的结果表明,CSCD-M在各种网格上匹配LS的性能,并在Thingi10k数据集等基准测试上优于LS(TOG'21)。CSCD-PC在质量上优于CoverageAxis++(Eurographics'24)和EPCS(CAG'23)。最后,我们展示了CSCD在几个下游任务中的有效性:物体分类、形状分割、识别物体中的手柄、隧道和收缩。项目网站:https://cscd-skel.pages.dev

英文摘要

Advancements in 3D curve skeletonization are accelerating progress across a wide range of applications. However, developing robust skeletonization algorithms that capture intricate object details remains challenging. Skeletonization via Local Separators (LS) offers an efficient graph-based approach but suffers from representation inaccuracies due to its discrete nature. To address this, we introduce CSCD, a novel framework for Curve Skeletonization in the Continuous Domain, generalizing LS to manifolds. Specifically, we present two realizations: CSCD-M for meshes and CSCD-PC for point clouds. CSCD-M leverages the intrinsic triangulation of a mesh for resilience to noise and improved topological preservation, while CSCD-PC employs tufted Laplacians for enhanced robustness. To our knowledge, CSCD-M is the first intrinsic method for curve skeletonization. Our results show CSCD-M matches LS performance across diverse meshes and outperforms LS (TOG'21) on benchmarks like Thingi10k dataset. CSCD-PC qualitatively outperforms CoverageAxis++ (Eurographics'24) and EPCS (CAG'23). Finally, we demonstrate the efficacy of CSCD in a few downstream tasks: object classification, shape segmentation, identifying handles, tunnels, and constrictions in objects. Project Website: https://cscd-skel.pages.dev

2605.25920 2026-05-26 cs.CL cs.AI

Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning

LLM 能时间旅行吗?通过强化学习增强法律智能搜索中的时间一致性

Wei Fan, Yining Zhou, Mufan Zhang, Yanbing Weng, Yiran HU, Tianshi Zheng, Baixuan Xu, Chunyang Li, Jianhui Yang, Haoran Li, Yangqiu Song

AI总结 提出 LegalSearch-R1 框架,结合本地 statute RAG 和在线搜索,通过强化学习在跨修订期数据上训练,以解决法律 LLM 的时间偏差和搜索代理缺乏时间约束的问题,在13项法律任务上超越现有方法。

详情
Comments
Under Review
AI中文摘要

虽然增强智能搜索能力的大型语言模型在法律推理方面显示出前景,但它们忽略了一个基本约束:适用法律必须与每个案件的时间背景相匹配,因为法条的事后追溯适用违反了核心法律原则并导致错误结论。我们的观察表明,当前的法律 LLM 存在锚定于其训练截止日期的时间偏差,而搜索代理很少将时间约束纳入查询,并且仅靠网络搜索无法提供法律推理所需的精确法条和先例引用。为应对这些挑战,我们提出 LegalSearch-R1,一个端到端的强化学习框架,它将本地 statute RAG 用于精确条文匹配,与在线网络搜索用于更广泛的法律知识相结合,并在涵盖多个修订期的按时间索引的数据上训练以强制执行时间一致性。在我们涵盖13项法律任务的基准上的大量实验表明,我们的7B参数代理在时间一致性上以12.9%至29.8%的优势超越最先进的深度研究框架和专门的法律 LLM,以57.7%至80.3%的优势超越基线,并展现出强大的域外泛化能力。代码和数据可在 https://github.com/AlexFanw/LegalSearch-R1 获取。

英文摘要

While large language models (LLMs) augmented with agentic search capabilities show promise for legal reasoning, they overlook a fundamental constraint that applicable law must match the temporal context of each case, as retroactive application of statutes violates core legal principles and leads to erroneous conclusions. Our observations reveal that current legal LLMs suffer from temporal bias anchored to their training cutoff, while search agents rarely incorporate temporal constraints into queries, and that web search alone cannot provide the precise statute and precedent citations that legal reasoning demands. To address these challenges, we propose LegalSearch-R1, an end-to-end reinforcement learning framework that pairs local statute RAG for precise article matching with online web search for broader legal knowledge, trained on temporally-indexed data spanning multiple amendment periods to enforce temporal consistency. Extensive experiments on our benchmark covering 13 legal tasks demonstrate that our 7B-parameter agent outperforms state-of-the-art deep research frameworks and specialized legal LLMs by 12.9% to 29.8%, surpasses baselines by 57.7% to 80.3% on temporal consistency, and exhibits robust out-of-domain generalization. The code and data are available at https://github.com/AlexFanw/LegalSearch-R1.

2605.25916 2026-05-26 cs.LG cs.DC cs.NI

Joint Optimization of Training and Inference in Federated Edge Learning via Constrained Multi-Objective Deep Reinforcement Learning

通过约束多目标深度强化学习联合优化联邦边缘学习中的训练与推理

Zhen Li, Jun Cai, Chao Yang, Haoran Gao

AI总结 提出一种在线优化框架,通过约束多目标深度强化学习算法C-MOPPO联合管理资源受限边缘设备上的联邦训练和推理,以在最小化延迟和能耗的同时最大化推理精度。

详情
AI中文摘要

联邦边缘学习(FEEL)最近成为一种有前景的范式,通过支持跨边缘设备的协作模型训练同时保护数据隐私来实现边缘智能(EI)。在本文中,我们提出了一种在线优化框架,用于联合管理资源受限边缘设备上的联邦训练和推理。我们引入了一种基于串联队列的转换机制,将推理请求与训练数据桥接起来,并进一步将数据和模型的新鲜度纳入准确性公式中,以捕捉真实环境中的时间动态。为了在最小化延迟和能耗的同时最大化推理精度,边缘设备的模式选择、通信和计算资源分配被联合优化。我们将此优化表述为一个多目标优化问题,该问题是NP难的,并且由于在线设置而进一步复杂化。为了应对这些挑战,我们将问题转化为多目标马尔可夫决策过程(MOMDP),并开发了一种约束多目标近端策略优化(C-MOPPO)算法。具体来说,C-MOPPO首先学习一组具有不同目标偏好策略,然后利用约束策略优化来丰富帕累托前沿并获得高质量、密集的解。大量实验表明,C-MOPPO在目标之间实现了良好的平衡权衡,并在各种系统配置下显著优于基线。

英文摘要

Federated edge learning (FEEL) has recently emerged as a promising paradigm for achieving edge intelligence (EI) via enabling collaborative model training across edge devices while protecting data privacy. In this paper, we put forth an online optimization framework that jointly manages federated training and inference on resource-constrained edge devices. We introduce a tandem-queue-inspired conversion mechanism that bridges inference requests and training data, and further incorporate both data and model freshness into the accuracy formulation to capture temporal dynamics in real-world environments. To maximize inference accuracy while minimizing latency and energy consumption, the mode selections, communication, and computation resource allocations of edge devices are jointly optimized. We formulate this optimization as a multi-objective optimization problem, which is NP-hard and further complicated by the online setting. To address these challenges, we transform the problem into a multi-objective Markov decision process (MOMDP) and develop a \underline{c}onstrained \underline{m}ulti-\underline{o}bjective \underline{p}roximal \underline{p}olicy \underline{o}ptimization (C-MOPPO) algorithm. Specifically, C-MOPPO first learns a set of policies with different preferences across three objectives, then leverages constrained policy optimization to enrich the Pareto front and obtain high-quality, dense solutions. Extensive experiments demonstrate that C-MOPPO achieves well-balanced trade-offs among objectives and significantly outperforms baselines under various system configurations.

2605.25909 2026-05-26 cs.CV

R5DGS: Semantic-Aware 4D Gaussian Splatting with Rigid Body Constraints for Efficient Dynamic Scene Reconstruction

R5DGS:基于刚体约束的语义感知4D高斯泼溅用于高效动态场景重建

Denis Gridusov, Maxim Popov, Sergey Kolyubin

AI总结 提出R5DGS框架,通过紧凑身份编码和CLIP对象查找表实现语义感知的4D高斯表示,并利用刚体推理约束仅预测对象质心动力学,从而在保持轨迹合理性的同时实现11 FPS的加速。

详情
Comments
Code: https://github.com/be2rlab/r5dgs
AI中文摘要

从多视角视频中重建和预测动态3D场景是机器人、AR/VR和数字孪生的基础任务。最近基于物理信息的高斯泼溅方法在未来的帧外推上取得了令人印象深刻的结果,但缺乏语义感知且计算开销大。我们引入了$ extbf{R5DGS}$,一个通过紧凑的身份编码向量增强物理驱动的4D高斯表示的框架,实现了精确的高斯到对象关联。通过构建离线的基于CLIP的对象查找表,我们支持开放词汇的文本提示,以检索和渲染任意时间戳和视角下的特定对象高斯。此外,我们提出了一个刚体推理约束,仅对对象质心预测和集成物理动力学,通过相对变换将运动传播到关联的高斯。这一优化在外推过程中实现了11 FPS的加速,而不损害轨迹的合理性。

英文摘要

Reconstructing and predicting dynamic 3D scenes from multi-view videos is a foundational task for robotics, AR/VR, and digital twins. Recent physics-informed Gaussian Splatting methods achieve impressive future frame extrapolation but lack semantic awareness and suffer from large computational overhead. We introduce $\textbf{R5DGS}$, a framework that augments a physics-driven 4D Gaussian representation with compact Identity Encoding vectors, enabling precise Gaussian-to-object association. By constructing an offline CLIP-based object lookup table, we support open-vocabulary text prompting to retrieve and render object-specific Gaussians across arbitrary timestamps and viewpoints. Furthermore, we propose a rigid-body inference constraint that predicts and integrates physical dynamics exclusively for object centroids, propagating motion to associated Gaussians via relative transformations. This optimization yields a 11 FPS speedup during extrapolation without compromising trajectories plausibility.

2605.25903 2026-05-26 cs.CL cs.LG

Universal Activation Verbalizer: A Unified Framework for Cross-Model Activation Explanation

通用激活词化器:跨模型激活解释的统一框架

Haiyan Zhao, Zirui He, Guanchu Wang, Ali Payani, Yingcong Li, Mengnan Du

AI总结 提出通用激活词化器(UAV)框架,通过共享解码器和轻量适配器将异构模型的隐藏表示转化为自然语言解释,支持跨模型家族和规模的激活词化,在分类、事实检索和要点总结任务中与强基线竞争。

详情
Comments
23 pages, 11 figures, 11 tables
AI中文摘要

激活词化以自然语言解释隐藏表示,但现有方法大多局限于自解释,即每个模型仅解释自身的激活。我们引入通用激活词化器(UAV),一个使用共享解码器解释来自异构捐赠模型激活的框架。UAV学习一个轻量适配器,将捐赠激活转化为解码器嵌入空间中的软标记,并通过重用冻结的解码器侧LoRA同时为另一个捐赠者训练新适配器,进一步支持仅适配器迁移。在分类、事实检索和要点总结任务中,UAV在实现跨模型家族和规模的跨模型词化时,与强自解释基线保持竞争力。消融实验表明,解码器侧调优主要改善任务行为,而适配器提供激活基于的事实和语义信息,用于忠实解释。

英文摘要

Activation verbalization explains hidden representations in natural language, but existing methods are mostly limited to self-explanation, where each model explains only its own activations. We introduce Universal Activation Verbalizer (UAV), a framework that uses a shared decoder to explain activations from heterogeneous donor models. UAV learns a lightweight adapter that converts donor activations into soft tokens in decoder's embedding space, and further supports adapter-only transfer by reusing a frozen decoder-side LoRA while training only a new adapter for another donor. Across classification, fact retrieval, and gist summarization, UAV remains competitive with strong self-explanation baselines while enabling cross-model verbalization across model families and scales. Ablations show that decoder-side tuning mainly improves task behavior, whereas the adapter provides the activation-grounded factual and semantic information needed for faithful explanations.

2605.25901 2026-05-26 cs.CV cs.RO

AgentGrounder: Zero-Shot 3D Visual Pointcloud Grounding using Multimodal Language Models

AgentGrounder:使用多模态语言模型的零样本3D视觉点云定位

Cuong Huynh, Maxim Popov, Denis Gridusov, Sergey Kolyubin

AI总结 提出AgentGrounder框架,通过两阶段设计(离线构建对象查找表和在线工具驱动代理)实现零样本3D视觉定位,在ScanRefer和Nr3D上分别提升2.5%和6.3%的准确率。

详情
Comments
Code: https://github.com/be2rlab/AgentGrounder
AI中文摘要

3D视觉定位(3DVG)是具身AI的基本能力,要求智能体根据自然语言描述在3D场景中定位物体。最近的零样本方法利用2D视觉语言模型(LVLMs),但它们通常依赖于现有的多视图图像集,并且难以处理标准3D分割工具提供的有限语义和空间细节。我们提出了$ extbf{AgentGrounder}$,一个零样本3D视觉定位框架,直接对彩色点云进行操作,无需特定任务的3D训练。我们的方法采用两阶段设计:(1)离线阶段,应用3D模型构建对象查找表(OLT),包含实例ID、语义标签、3D边界框;(2)在线工具驱动代理,分解每个查询,仅从OLT中检索相关候选对象,进行几何评分,并在需要额外视觉证据(如颜色、材质或视角敏感线索)时按需触发图像渲染。与固定的锚点-目标匹配流水线相比,这种设计减少了级联匹配错误,并通过避免提示过载无关对象来提高上下文窗口效率。我们在零样本设置下对ScanRefer和Nr3D进行了评估,观察到在我们的设置中比SeeGround有持续改进,包括ScanRefer上+2.5%的Acc@0.5和Nr3D上+6.3%,在Nr3D视图无关查询上显著提升+6.3%。这些结果表明,结合选择性检索、几何推理和自适应视觉检查为开放词汇3D定位提供了实用且稳健的基础。我们的代码可在https://github.com/be2rlab/AgentGrounder获取。

英文摘要

3D Visual Grounding (3DVG) is an essential capability for embodied AI, requiring agents to localize objects in 3D scenes based on natural language descriptions. Recent zero-shot methods leverage 2D vision-language models (LVLMs). However, they often rely on existing sets of multi-view images and struggle with the limited semantic and spatial details provided by standard 3D segmentation tools. We present $\textbf{AgentGrounder}$, a zero-shot 3D visual grounding framework that operates directly on colored point clouds without task-specific 3D training. Our approach follows a two-stage design: (1) an offline stage that applies 3D model to build an Object Lookup Table (OLT) with instance IDs, semantic labels, 3D bounding boxes; and (2) an online tool-driven agent that decomposes each query, retrieves only relevant candidates from the OLT, performs geometric scoring, and triggers image rendering on demand when additional visual evidence (e.g., color, material, or viewpoint-sensitive cues) is required. Compared with fixed anchor-target matching pipelines, this design reduces cascading matching errors and improves context-window efficiency by avoiding prompts overloaded with irrelevant objects. We evaluate on ScanRefer and Nr3D under a zero-shot setting and observe consistent improvements over SeeGround in our setup, including +2.5% Acc@0.5 on ScanRefer and +6.3% on Nr3D, with a notable +6.3% gain on Nr3D view-independent queries. These results show that combining selective retrieval, geometric reasoning, and adaptive visual inspection yields a practical and robust foundation for open-vocabulary 3D grounding. Our code is available at https://github.com/be2rlab/AgentGrounder.

2605.25894 2026-05-26 cs.LG q-fin.ST

Predicting Stock Price Direction on Earnings Announcement Days using Multi-modal Deep Learning

使用多模态深度学习预测盈利公告日的股价方向

Manuel Noseda, Nathan Soldati, Marco Paina

AI总结 本研究结合基本面指标、技术指标和新闻情感,利用LSTM和Transformer模型预测盈利公告日的股价方向,发现Transformer在识别波动方面更敏感,且新闻情感有助于提升性能。

详情
AI中文摘要

预测盈利公告(EAs)期间的股价走势是一个重大挑战,因为市场噪音和高冲击价格不连续性。在本研究中,我们评估了公告前的新闻情感、公司基本面和近期市场动态是否共同预测了EA日股票的价格方向运动。我们构建了一个多模态特征空间,结合了15个基本面指标、3个基于价格的技术指标以及使用FinBERT处理的金融新闻文章的情感分数。我们将长短期记忆(LSTM)网络和基于Transformer的架构与逻辑回归基线进行比较,并进一步评估所有模型在有和没有情感特征的情况下的增量价值。我们的结果表明,虽然LSTM通过保守的安全策略显示出更高的精确度,但Transformer模型在识别波动性运动方面表现出更高的敏感性,获得了更高的宏观F1分数,消融实验显示加入新闻情感有一致的益处。

英文摘要

Predicting stock price movements during Earnings Announcements (EAs) is a significant challenge due to market noise and high-impact price discontinuities. In this study, we evaluate whether pre-announcement news sentiment, firm fundamentals, and recent market dynamics jointly predict the directional price movement of equities on EA days. We construct a multi-modal feature space combining 15 fundamental metrics, 3 price-based technical indicators and sentiment scores derived from financial news articles processed using FinBERT. We compare a Long Short-Term Memory (LSTM) network and a Transformer-based architecture against a logistic regression baseline, and further assess all models with and without sentiment features to quantify their incremental value. Our results indicate that while the LSTM demonstrates higher precision through a conservative safe-bet strategy, the Transformer model exhibits superior sensitivity in identifying volatile movements, achieving a higher macro F1-score, with ablation experiments showing a consistent benefit from incorporating news sentiment.

2605.25893 2026-05-26 cs.AI

$D^2$-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing

$D^2$-Monitor: 通过犹豫感知路由实现扩散LLM的动态安全监控

Aoxi Liu, Yupeng Chen, James Oldfield, Guanzhe Hong, Junchi Yu, Baoyuan Wu, Philip Torr, Adel Bibi

AI总结 针对扩散大语言模型的安全监控问题,提出基于犹豫感知路由的双层动态监控框架$D^2$-Monitor,通过轻量级探针实时估计犹豫度并触发高容量探针,在3个数据集上以0.85M参数达到最优性能与效率平衡。

详情
AI中文摘要

尽管扩散大语言模型(D-LLMs)作为自回归大语言模型(AR-LLMs)的替代方案已经出现,但D-LLMs的安全监控在很大程度上仍未得到探索。与AR-LLMs不同,D-LLMs通过多步去噪过程生成文本,暴露了中间隐藏表示,这些表示可能包含标准单步监控设置中无法获得的安全相关信息。受轻量级探针适用于始终在线监控的启发,我们分析了哪些轨迹级信号最能指示此类探针可能遇到困难。我们发现,信息量最大的信号是安全犹豫度:中间隐藏状态反复落在探针决策边界的小范围内。D-LLM轨迹中此类犹豫步的数量能有效预测探针失败,提供了样本难度的代理指标。基于此分析,我们提出了$D^2$-Monitor,一种针对D-LLMs的双层安全监控器。$D^2$-Monitor采用轻量级探针作为始终在线监控器,以联合估计犹豫度并执行基础分类。当犹豫度超过阈值时,激活更具表现力但计算量更大的探针。这种动态路由机制在测试时高效分配监控资源。在4个D-LLM上的3个数据集(WildguardMix、ToxicChat、OpenAI-Moderation)上评估,$D^2$-Monitor以紧凑的参数规模(≤0.85M参数)实现了最先进的性能,并且相对于8个基线方法,在有效性和效率之间取得了最佳权衡。

英文摘要

Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring for D-LLMs remains largely unexplored. Unlike AR-LLMs, D-LLMs generate text through a multi-step denoising process, exposing intermediate hidden representations that may contain safety-relevant information unavailable in standard single-step monitoring setups. Motivated by the suitability of lightweight probes for always-on monitoring, we analyze which trajectory-level signals best indicate when such probes are likely to struggle. We find that the most informative signal is safety hesitation: intermediate hidden states repeatedly falling within a small margin of the probe's decision boundary. The number of such hesitation steps in D-LLM's trajectory predicts probe failure effectively, providing a proxy of sample difficulty. Building on this analysis, we propose $D^2$-Monitor, a bi-level safety monitor for D-LLMs. $D^2$-Monitor adopts a lightweight probe as an always-on monitor to jointly estimate hesitation and perform base classification. When the hesitation level exceeds a threshold, a more expressive but computationally heavier probe is activated. This dynamic routing mechanism allocates monitoring resources efficiently at test time. Evaluated on 3 datasets (WildguardMix, ToxicChat, OpenAI-Moderation) across 4 D-LLMs, $D^2$-Monitor achieves state-of-the-art performance with a compact parameter footprint ($\leq$ 0.85M parameters), and exhibits the best trade-off between effectiveness and efficiency relative to 8 baselines.

2605.25892 2026-05-26 cs.CV

SP-MoMamba: Superpixel-driven Mixture of State Space Experts for Efficient Image Super-Resolution

SP-MoMamba:基于超像素驱动的状态空间专家混合模型用于高效图像超分辨率

Wenbin Zou, Yawen Cui, Yi Wang, Lap-Pui Chau, Liang Chen, Jinshan Pan, Huiping Zhuang, Guanbin Li

AI总结 提出SP-MoMamba,通过超像素驱动将刚性扫描转化为语义级交互,结合多尺度超像素状态空间专家混合与局部空间调制专家,实现高效且保真的图像超分辨率。

详情
Comments
16 pages, 15 figures
AI中文摘要

状态空间模型(SSM)因其线性复杂度和长程建模能力,已成为高效单图像超分辨率(SR)的强大范式。然而,现有的基于Mamba的方法通常依赖于与数据无关的刚性扫描,将2D图像重塑为固定网格上的1D序列,这不可避免地破坏了空间语义拓扑并引入伪影。受格式塔知觉分组理论的启发,我们提出了SP-MoMamba,一种用于内容感知SR的超像素驱动状态空间专家混合模型。我们的核心思想是通过将超像素视为基本单元,将传统的刚性扫描转化为语义级交互。具体来说,我们引入了超像素驱动状态空间模型(SP-SSM),它将语义同质区域压缩为高阶令牌,以保持全局拓扑一致性。为了解决固定扫描尺度与多样语义粒度之间的冲突,我们开发了多尺度超像素状态空间专家混合(MSS-MoE)。该模块利用动态路由机制自适应地分配尺度特定专家,有效捕捉多尺度纹理,同时减少计算冗余。此外,为了防止全局抽象过程中高频细节的丢失,我们引入了局部空间调制专家(LSME)来补充全局建模,确保锐利边缘和精细结构的精确重建。在标准基准上的大量实验表明,与最先进的高效SR方法相比,SP-MoMamba实现了更优的重建保真度和更有利的效率-性能权衡。

英文摘要

State space models (SSMs) have emerged as a powerful paradigm for efficient single-image super-resolution (SR) due to their linear complexity and long-range modeling capabilities. However, existing Mamba-based methods typically rely on data-agnostic rigid scanning, which reshapes 2D images into 1D sequences over a fixed grid, inevitably disrupting spatial-semantic topology and introducing artifacts. Inspired by the \textbf{Gestalt perceptual grouping theory}, we propose \textbf{SP-MoMamba}, a superpixel-driven mixture of state space experts designed for content-aware SR. Our core idea is to transform the traditional rigid scanning into a \textbf{semantic-level interaction} by treating superpixels as fundamental units. Specifically, we introduce the \textbf{Superpixel-driven State Space Model (SP-SSM)}, which compresses semantically homogeneous regions into high-order tokens to preserve global topological consistency. To address the conflict between fixed scanning scales and diverse semantic granularities, we develop the \textbf{Multi-Scale Superpixel Mixture of State Space Experts (MSS-MoE)}. This module utilizes a dynamic routing mechanism to adaptively assign scale-specific experts, effectively capturing multi-scale textures while reducing computational redundancy. Furthermore, to prevent the loss of high-frequency details during global abstraction, we introduce a \textbf{Local Spatial Modulation Expert (LSME)} to complement the global modeling, ensuring a precise reconstruction of sharp edges and fine structures. Extensive experiments on standard benchmarks demonstrate that SP-MoMamba achieves superior reconstruction fidelity and a more favorable efficiency-performance trade-off compared to state-of-the-art efficient SR methods.

2605.25891 2026-05-26 cs.CL cs.AI

Causal Tongue-Tie: LLMs Can Encode Causal Direction, But Their Yes/No Outputs Fail to Express

因果舌结:LLMs 能编码因果方向,但其是/否输出无法表达

Ziyi Ding, Xiao-Ping Zhang

AI总结 研究发现大语言模型在因果问题上存在内部编码与输出不匹配的现象,通过线性探针可从隐藏状态恢复证据支持的答案(准确率约0.97),但口头是/否回答却退化为常识答案(准确率约0.5),揭示了约+0.5的差距,称为“因果舌结”。

详情
AI中文摘要

我们发现大语言模型关于因果问题所编码的内容与其回答之间存在不匹配。在反常识的 CLadder 项目上,固定的线性探针从模型隐藏状态中恢复出证据支持的答案(准确率约0.97),而口头的是/否回答则退化为常识答案(准确率约0.5)。我们将这约+0.5的差距称为“因果舌结”:错误的“是/否”回答可分解为两种可分离的失败模式——没有内部信号,或者口头接口无法表达的信号。这一发现对仅基于输出的因果基准测试具有双向影响:基准测试“正确”不一定意味着模型理解了,基准测试“错误”也不一定意味着模型不能理解。基于单一准确率数字得出的关于 LLMs 是否能够进行因果推理的笼统论断,值得重新审视。

英文摘要

We find a mismatch between what large language models encode about a causal question and what they answer. On anti-commonsense CLadder items, a fixed linear probe recovers the evidence-supported answer from the model's hidden state (accuracy approximately 0.97), while the spoken Yes/No reverts to the commonsense one (accuracy approximately 0.5). We call this approximately +0.5 gap Causal Tongue-Tie: a wrong Yes/No decomposes into two separable failure modes: no internal signal versus a signal the verbal interface cannot say. The implication cuts both ways for output-only causal benchmarks: a benchmark "correct" need not mean the model has understood, and a benchmark "wrong" need not mean it cannot. Sweeping claims about whether LLMs can do causal reasoning, drawn from a single accuracy number, deserve a second look.

2605.25890 2026-05-26 cs.LG

Merge-Bench: Resolve Merge Conflicts with Large Language Models

Merge-Bench: 使用大型语言模型解决合并冲突

Benedikt Schesch, Michael D. Ernst

AI总结 本文构建了包含7938个真实合并冲突的数据集Merge-Bench,并利用组相对策略优化(GRPO)训练LLMergeJ模型,在Java程序上以14B参数超越多数商业LLM,但最佳模型正确解决率仍低于60%。

详情
Comments
14 pages, 7 figures
AI中文摘要

本文应用机器学习处理版本控制合并这一困难且重要的任务。(1)我们构建了一个数据集Merge-Bench,包含来自1439个GitHub仓库的7938个真实合并冲突片段。真实标注是开发者提交到仓库的合并解决方案。我们的数据集构建方法可扩展到任意数据量,因为无需手动标注。(2)我们训练了一个模型LLMergeJ,用于解决Java程序中的合并冲突。我们的方法使用组相对策略优化(GRPO),一种在线强化学习方法,来训练大型语言模型(LLM)。(3)我们对LLM在解决合并冲突上的性能进行了两次评估。在Java程序上,具有14B参数的LLMergeJ优于3个商业LLM,仅次于Gemini 2.5 Pro。在11种编程语言中,商业LLM的性能在不同语言间基本稳定。最佳模型正确解决的合并冲突不到60%。

英文摘要

This paper applies machine learning to the difficult and important task of version control merging. (1) We constructed a dataset, Merge-Bench, of 7938 real-world merge conflict hunks from 1439 GitHub repositories. The ground truth is the merge resolution that developers committed to the repository. Our dataset construction methodology is scalable to arbitrary amounts of data since no manual labeling is required. (2) We trained a model, LLMergeJ, to resolve merge conflicts in Java programs. Our approach uses Group Relative Policy Optimization (GRPO), an online reinforcement learning method, to train a Large Language Model (LLM). (3) We performed two evaluations of the performance of LLMs on resolving merge conflicts. On Java programs, LLMergeJ with 14B parameters outperforms 3 commercial LLMs, trailing only Gemini 2.5 Pro. Across 11 programming languages, commercial LLM performance is largely stable from language to language. The best models correctly resolve less than 60% of merge conflicts.