arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.09042 2026-05-12 cs.CL

Evaluating Pragmatic Reasoning in Large Language Models: Evidence from Scalar Diversity

Ye-eun Cho

AI总结本研究探讨了如何评估大语言模型在语用推理方面的能力，指出当前评估方法可能导致模型行为的差异，难以准确反映其内在推理能力。研究采用标量多样性作为诊断工具，比较了直接概率测量与元语言提示等多种评估方式，发现不同模型和任务条件下语用行为存在显著差异。结果表明，语用推理能力并非由单一评估方式决定，而是模型内部概率表示与任务引导行为相互作用的结果，突显了评估设计在理解大语言模型语用能力中的关键作用。

2605.09041 2026-05-12 cs.CL cs.CR

BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence

Jialing Gan, Junhao Dong, Songze Li

AI总结本文提出了一种名为 BiAxisAudit 的新型框架，用于评估大型语言模型在提示敏感性和响应层分歧两个维度上的偏见。现有基准通常将偏见简化为单一标量，忽略了提示格式变化和响应内部不一致等关键问题。BiAxisAudit 通过分别评估不同提示格式下的偏见分布以及响应中选择与阐述部分的分歧，提高了偏见评估的准确性和可靠性。实验表明，该方法能有效区分真实偏见减少与跨层偏见转移现象，揭示了提示格式对模型偏见影响的重要性。

Comments 24 pages, 10 figures. Preprint

详情

英文摘要

Bias audits of large language models now operate within governance frameworks such as the EU AI Act, making benchmark reliability a security concern in its own right. Many current benchmarks, however, collapse bias into a single scalar from one prompt format and one surface label. This design misses two failure modes that can be exploited without changing model weights. Across prompts, meaning-preserving format changes shift bias endorsement by more than $0.7$ on a fixed statement pool. Within a response, the discrete Selection and free-text Elaboration can take opposing stances, so an apparently clean aggregate may hide substantial internal inconsistency (a ``cancellation trap''). Selection-only and elaboration-only rankings are therefore nearly uncorrelated across eight LLMs (Spearman $ρ= 0.238$, $p = 0.570$): LLaMA3-70B ranks in the middle under selection-only scoring but highest under elaboration-only scoring on the same responses. We introduce \textsc{BiAxisAudit}, a protocol that reports each bias score together with a reliability estimate on two orthogonal axes. The across-prompt axis evaluates each statement under a factorial grid of task format, perspective, role, and sentiment, treating bias as a distribution rather than a point estimate. The within-response axis uses Split Coding to recover Selection and Elaboration as separate signals, measured by the Inconsistency Rate and Divergence Net Imbalance. Across eight LLMs with $80{,}200$ coded responses each, task format alone explains as much variance as model choice; $63.6\%$ of pooled bias signals (up to $85.2\%$ per model) appear in only one coding layer, and prompt-dimension interactions exceed main effects. The instrument also separates real bias reductions from apparent reductions caused by cross-layer redistribution: some prompt configurations reduce both BER and IR, whereas others suppress only selection-layer bias.

URL PDF HTML ☆

赞 0 踩 0

2605.09039 2026-05-12 cs.CV

SeasonScapes: Learning Large-scale Re-lightable 3D Landscapes with Seasonal Variation from Sparse Webcams

Timo Kleger, Qi Ma, Deheng Zhang, Luc Van Gool, Danda Pani Paudel

AI总结本文提出了一种名为 SeasonScapes 的框架和一个大规模季节变化三维景观数据集，该数据集由来自32个不同位置、13个时间点的85000多张网络摄像头图像组成，覆盖超过50公里×60公里的瑞士山区。通过将时间点特定的图像投影到三维网格上，构建出反映自然外观随时间变化的季节性三维景观。为了解决遮挡和缺失数据问题，研究采用条件扩散模型在网格上进行图像引导的补全，最终生成的网格可使用标准物理渲染器进行重新光照。

2605.09036 2026-05-12 cs.LG

PACT: Peak-Aware Cross-Attention Graph Transformers for Efficient Storm-Surge Emulation

Zesheng Liu, Doyup Kwon, Ning Lin, Maryam Rahnemoonfar

AI总结本文提出了一种名为PACT的峰值感知交叉注意力图变换器，用于高效模拟风暴潮过程。该方法通过将大气强迫场划分为图结构，结合图神经网络与交叉注意力机制，实现了对站点级别的风暴潮预测。PACT引入了针对极端事件的峰值感知学习策略，显著提升了对风暴潮峰值和尾部特征的捕捉能力，并在多个美国东海岸潮位站的数据上优于现有时空图神经网络模型，同时保持了较高的计算效率。

2605.09032 2026-05-12 cs.CL cs.AI cs.LG

A Quantum Inspired Variational Kernel and Explainable AI Framework for Cross Region Solar and Wind Energy Forecasting

Pavan Manjunath, Thomas Prufer

AI总结本文提出了一种用于跨区域太阳能和风能短期预测的量子启发变分核与可解释人工智能框架。该框架分为四个阶段，分别获取数据、训练经典基线模型、利用量子启发变分核修正残差误差，并通过生成式人工智能生成可解释的自然语言解释。实验表明，该方法在三个不同区域的预测任务中表现优异，且其量子启发核在区分平静与风暴天气模式方面显著优于传统径向基核。

2605.09031 2026-05-12 cs.LG

Spherical Boltzmann machines: a solvable theory of learning and generation in energy-based models

Thomas Tulinski, Simona Cocco, Rémi Monasson, Jorge Fernandez-De-Cossio-Diaz

AI总结本文研究了能量基模型（EBM）的学习与生成机制，聚焦于一种可解析求解的高维模型——球形玻尔兹曼机（SBM）。通过结合随机矩阵理论和动态平均场理论，作者推导了SBM的精确训练动态方程，计算了贝叶斯证据，并揭示了训练过程中以及超参数变化时出现的级联相变现象。研究还表明这些现象与生成过程中的采样行为密切相关，并在标准生成模型中得到了验证。

2605.09030 2026-05-12 cs.CV cs.LG

When Style Similarity Scores Fail: Diagnosing Raw CSD Cosine in Artist-Style Evaluation

Jörg Frochte

AI总结该论文探讨了在艺术家风格评估中，使用对比风格描述符（CSD）余弦相似度作为绝对风格保真度指标的局限性，并提出了一种名为“判别差距”的诊断方法，用于检测该指标在特定艺术家语料库中是否能够准确区分相同与不同风格。研究发现，原始CSD余弦在多个艺术家语料中存在负点估计差距，表明其无法作为绝对评分使用；通过引入CSLS读取方式和位置嵌入插值方法，可显著提升评估准确性。研究建议在使用CSD余弦作为风格评分前，应先进行该诊断测试，并推荐使用改进后的CSD+方法以提高可靠性。

Comments 24 pages, 7 figures, 19 tables

2605.09025 2026-05-12 cs.CV cs.LG

MedFL-Stress: A Systematic Robustness Evaluation of Federated Brain Tumor Segmentation under Cross-Hospital MRI Appearance Shift

Kiran Naseer, Naveed Anwer Butt

AI总结本文提出MedFL-Stress，一个用于评估联邦学习脑肿瘤分割模型在跨医院MRI影像外观变化下的鲁棒性的系统化测试框架。研究通过引入不同级别的MRI外观偏移，揭示了现有联邦学习方法在不同医院间性能差异的问题，并对比了FedAvg、FedProx和FedBN三种方法的表现。实验表明，FedBN在提升最差医院分割性能和减少医院间性能差距方面表现更优，突显了鲁棒性评估在联邦医疗影像应用中的重要性。

2605.09024 2026-05-12 cs.CV cs.GR cs.MM eess.IV

Relightable Gaussian Splatting for Virtual Production Using Image-Based Illumination

Adrian Azzarelli, Nantheera Anantrasirichai, James Pollock, David R. Bull

AI总结该研究提出了一种基于高分辨率图像照明的虚拟制作（VP）专用三维重建与重光照框架，解决了传统方法中背景与光照耦合、环境贴图分辨率低等问题。方法采用高斯点扩散技术，利用已知背景图像条件化重光照过程，无需依赖环境贴图，将合成简化为背景图像编辑任务。通过引入真实VP场景数据集，分解场景为固定外观与可变光照部分，实现了高效、可控的高质量三维重建与重光照，支持多种输出变量，且计算效率高。

详情

英文摘要

Virtual production (VP) use LED walls to provide both background imagery and image-based lighting. While this enables on-set compositing, it couples lighting to background and scene appearance, limiting flexibility for downstream editing. In addition, inverse rendering conventionally relies on physically-based rendering to estimates 3D geometry and lighting, using environment maps. However, these maps are typically low-resolution and assume far-field lighting. In VP, with near-field and high-resolution image-based lighting, this can lead to inaccuracies and introduce complexities when editing. Addressing this, we propose a VP-specific framework for 3D reconstruction and relighting using Gaussian Splatting. This uses the known background imagery to condition the relighting process. This avoids relying on environment maps and reduces compositing to a background-image editing task. To realize our framework, we introduce a process (and associated dataset) that captures real VP scenes under varying background content and illumination conditions. This data is used to decompose a 3D scene into fixed appearance and variable lighting components. The variable lighting process simulates light transport by parameterizing each primitive with a UV coordinate, intensity value and resolution modifier. Using mipmaps, these directly sample the background texture in image space - implicitly capturing reflections and refractions without physically-based rendering. Combined with the fixed appearance component, this allows us to render relit scenes using a Gaussian Splatting rasterizer. Compared to baselines, our approach achieves higher-quality 3D reconstruction and controllable relighting. The method is efficient (<3 GB RAM, <5 GB VRAM, <2 hours training, ~35 FPS) and supports rendering useful arbitrary output variables including depth, lighting intensity, lighting color, and unlit renders.

URL PDF HTML ☆

赞 0 踩 0

2605.09016 2026-05-12 cs.AI cs.LG cs.NA math.NA

CATO: Charted Attention for Neural PDE Operators

Chun-Wun Cheng, Sifan Wang, Carola-Bibiane Schönlieb, Angelica I. Aviles-Rivero

AI总结该论文提出了一种名为CATO的神经偏微分方程算子，旨在解决在复杂几何结构上建模偏微分方程（PDE）时的计算效率与物理保真度问题。CATO通过学习一个连续的隐式坐标映射，将网格坐标转换为适应几何结构的图表空间，并在该空间中使用轴向注意力机制，从而高效捕捉长程依赖关系。此外，CATO引入了对导数敏感的物理损失函数，提升了对稳态PDE的求解精度，实验表明其在多个数据集上表现优异，参数量大幅减少且性能优于现有方法。

2605.09015 2026-05-12 cs.CL cs.LG

LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language

Luca Ballore

AI总结该研究针对濒临消失的罗曼语萨丁尼亚语，提出了一种基于单块24GB消费级GPU的30亿参数语言模型LLiMba，通过持续预训练和监督微调方法，有效提升了模型在萨丁尼亚语上的生成能力。研究使用包含1150万萨丁尼亚语词元和240万相关罗曼语词元的语料库进行训练，并在多个评估方向上取得了优于基线模型的性能。实验对比了多种适配方法，发现rsLoRA r256在生成质量上表现最佳，同时揭示了适配器容量、正则化策略与生成质量之间的关键关系。

详情

英文摘要

Sardinian, a Romance language with roughly one million speakers, has minimal presence in modern NLP. Commercial services do not support it, and current language models do not produce it reliably. We present LLiMba, a 3B parameter Sardinian-ready model adapted from Qwen2.5-3B-Instruct through continued pretraining (CPT) and supervised fine-tuning (SFT) on a single 24 GB consumer GPU. The corpus contains 11.5 million tokens of Sardinian spanning LSC, Logudorese, and Campidanese, augmented with 2.4 million tokens of related Romance text as replay against register blurring. After CPT the model reaches a perplexity of 6.76 on held out Sardinian and outperforms the base across all six FLORES-200 directions. We compare five SFT configurations under matched conditions: full fine-tuning, LoRA r64, rsLoRA r128, rsLoRA r256, and DoRA r256. rsLoRA r256 wins on every direction into Sardinian, reaching 28.5 BLEU from English against 17.3 after CPT and 21.0 with full fine-tuning. The rank ablation places r128 between LoRA r64 and rsLoRA r256 on BLEU but reveals failure modes invisible to the metric, including leakage across scripts no other variant produces. LoRA r64 retains less factual content from SFT than configurations at higher rank and produces more confident fabrications, though all methods fabricate on content absent from training. DoRA r256 yields the smallest gap between training and evaluation but the worst factual accuracy. The findings indicate that adapter capacity matters more than the choice among LoRA variants for adapting a Romance pretrained base to a low resource Romance target, that stronger regularization is not uniformly beneficial, and that translation metrics smoothly order configurations whose qualitative behavior differs categorically. Perplexity comparisons across scripts must account for byte fallback tokenization, which deflates the metric for scripts other than Latin.

URL PDF HTML ☆

赞 0 踩 0

2605.09012 2026-05-12 cs.AI

Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics

Zicheng Lyu, Wenjie Yang, Shengzhong Zhang, Zengfeng Huang

AI总结 Re²Math 是一个用于评估大型语言模型在研究级数学中定理检索能力的基准，旨在测试模型能否从部分证明中准确找到并应用相关的数学工具或引理。该基准通过构建包含候选引文的证明实例，并引入分层上下文和可控提示，使得任务既依赖于文献来源，又不限定具体引用。实验表明，当前模型在检索有效陈述方面表现尚可，但在判断其是否适用于当前证明步骤方面仍有较大提升空间。

2605.09011 2026-05-12 cs.LG cs.AI

A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases

Gianfranco Lombardo, Giuseppe Trimigno, Stefano Cagnoni

AI总结本文从几何角度研究大语言模型中预测信息的分布特征，提出了一种基于中间残差流的预测读出子空间追踪方法，揭示了模型在不同深度的三个几何演化阶段。研究发现，随着模型深度增加，预测信息从初始的多候选状态逐步过渡到最终的单一候选结果，这一过程可分为种子复用、覆盖重写和聚焦收敛三个阶段，揭示了大语言模型在生成过程中信息整合与决策机制的深层结构。

详情

英文摘要

We investigate the geometry of predictive information across the layers of large language models (LLMs). We repurpose representation lenses-learned affine maps trained to predict the next token from intermediate residual streams-as geometric diagnostic tools. Rather than asking what the model predicts at each layer, we ask where predictive information resides and how it evolves across depth. We define at each layer a predictive readout subspace as the dominant k-dimensional singular subspace of such a map on the d-dimensional residual stream (where k is a resolution parameter), and track its trajectory on the Grassmann manifold as a similarity profile across layers. The profile is well described by unimodal distributions exhibiting a rise, near-plateau, and descent; varying k from 1% to 50% of d traces a Pareto frontier between visibility and energy retention, yet the same structure emerges at all scales. Across eight models from two families (Qwen2.5 and OLMo2, 1B-32B), we identify three geometric phases. Updates are approximately orthogonal to the residual stream throughout; what distinguishes the phases is their effect on the effective rank, which expands, stabilizes, and concentrates. In the first, Seeding Multiplexing, feed-forward memories and attention layers seed a candidate set in superposition in family-specific proportions, with the final token rising as leading candidate from 20% to 35% of positions across this phase. In the second, Hoisting Overriding, updates override existing subspaces to concentrate the candidate distribution without expanding the rank. In the third, Focal Convergence, high-energy low-rank updates write the winner into a form aligned with the unembedding direction. Phases 1 and 3 grow slowly with model depth, while Phase 2 expands linearly. The additional capacity of deeper LLMs is largely absorbed by candidate disambiguation.

URL PDF HTML ☆

赞 0 踩 0

2605.09009 2026-05-12 cs.LG cs.AI

Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning

Minmin Zhang, Sina Aghaei, Soroush Saghafian

AI总结本文研究了大语言模型在序贯决策任务中的上下文学习能力，包括马尔可夫决策过程（MDP）、部分可观测MDP（POMDP）和模糊POMDP（APOMDP）等场景。通过监督微调（SFT）方法，作者将预训练语言模型直接用于从离线标注轨迹中进行少样本决策，提升了其策略模仿的灵活性。理论分析表明，微调后的注意力层可隐式估计最优Q函数，进而推导出策略的端到端次优性界。实验结果显示，微调后的语言模型在多种合成环境中表现优于仅依赖上下文或随机策略的基线方法，尤其在长时域、部分可观测和模型模糊的环境下优势明显。

2605.09008 2026-05-12 cs.LG cs.CL

Relative Kinetic Utility for Reasoning-Aware Structural Pruning in Large Language Models

Tianhao Qian

AI总结该研究针对大语言模型中基于思维链（CoT）的推理能力提升所面临的计算延迟和内存瓶颈问题，提出了一种新的结构化剪枝方法。通过引入相对动能效用（RKU）理论框架，结合交替梯度流和费舍尔迹归一化，该方法能够更有效地识别并保留对逻辑推理至关重要的结构路径。实验表明，RKU在高稀疏度下显著提升了模型的推理性能，尤其在GSM8K数据集上表现优于现有最佳基线方法。

Comments 15 pages, 3 figures

2605.09005 2026-05-12 cs.RO cs.AI

Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models

Ming Sun, Rui Wang, Xingrui Yu, Lihua Jing, Hangyu Du, Zhenglin Wan, Xu Pan, Ivor Tsang

AI总结本文提出了一种基于后门机制的视觉-语言-动作模型（VLA）所有权验证框架GuardVLA，旨在解决共享和适配VLA模型过程中模型版权保护的问题。该方法在模型训练时通过向具身视觉数据注入秘密信息，嵌入隐蔽且无害的后门水印，并在模型发布后通过触发器投影器和外部分类头组成的“交换-检测”机制，实现对后门水印的激活与检测。实验表明，GuardVLA能够在保持模型正常任务性能的同时，有效支持模型所有权的可靠验证，并且水印在模型适配后仍具有可检测性。

2605.09002 2026-05-12 cs.CV cs.AI

CT-IDP: Segmentation-Derived Quantitative Phenotypes for Interpretable Abdominal CT Disease Classification

Lavsen Dahal, Joseph Y. Lo

AI总结本文提出了一种基于腹部CT影像分割的可解释性疾病分类框架CT-IDP，通过生成多器官分割结果并提取超过900个定量表型特征，用于疾病分类任务。研究在MERLIN数据集上训练并验证了该方法，并在两个独立数据集上进行了外部评估，结果显示CT-IDP在多个指标上均优于基于DINOv3的视觉Transformer基线模型，表明其在疾病分类中的有效性与可解释性优势。

2605.08999 2026-05-12 cs.LG

Non-Parametric Rehearsal Learning via Conditional Mean Embeddings

Wen-Bo Du, Tian-Zuo Wang, Han-Jia Ye, Zhi-Hua Zhou

AI总结本文研究了如何避免不良未来（AUF）的决策问题，提出了一种无需假设数据生成过程具体形式的非参数复述学习方法。该方法利用核方法将AUF目标转化为统一表示，分离了期望值建模与动作引起的分布变化，并通过条件均值嵌入捕捉动作影响，结合核岭回归构建具有一致性的嵌套估计器。实验表明，该方法在非线性系统和非加性噪声场景下具有良好的效果和灵活性。

2605.08996 2026-05-12 cs.LG

Machine Learning-Based Graph Simplification for Symbolic Accelerators

Tiffany Yu, Rye Stahle-Smith, Darssan Eswaramoorthi, Rasha Karakchi

AI总结本文提出了一种基于机器学习的图简化框架 AutoSlim，用于优化符号计算加速器中的自动机图结构，以降低硬件资源占用并提升能效。该方法通过提取历史图执行特征并利用随机森林分类器识别并移除低影响的节点和边，有效减少了冗余结构。实验表明，在应用于 NAPOLY+ 架构时，AutoSlim 最多可减少 40% 的 FPGA 资源使用，并提升吞吐量和能效，同时包含功能等价性验证步骤，为硬件优化和安全研究提供了新方向。

2605.08992 2026-05-12 cs.LG

When More Parameters Hurt: Foundation Model Priors Amplify Worst-Client Disparity Under Extreme Federated Heterogeneity

Kiran Naseer, Umar Shoaib

AI总结在联邦学习中，研究发现大型基础模型的先验知识在极端数据异构性下可能加剧最弱势客户端的性能差距。通过对比TextCNN和DistilBERT+LoRA在不同非独立同分布水平下的实验结果，研究揭示了基础模型参数越多，最差客户端的准确率差距反而可能越大，这一现象被称为“基础模型公平性悖论”。研究还表明，仅调整聚合权重的方法无法缓解这一问题，强调了在高风险联邦场景中需特别保护少数客户端的重要性。

Comments 7 pages, 5 figures. Submitted to FL@FM-IJCAI 2026 Workshop

2605.08991 2026-05-12 cs.AI econ.EM

Sufficient conditions for a Heuristic Rating Estimation Method application

Jacek Szybowski, Konrad Kułakowski, Jiri Mazurek

AI总结本文研究了启发式评分估计（HRE）方法的应用条件，探讨了其在不同配对比较算法下的适用性。作者分析了算术和几何方法在完整与不完整比较数据中的表现，并指出算术变体在不一致性估计方面具有最优性能。该研究为HRE方法的正确应用提供了理论依据和实用指导。

Comments 18 pages

2605.08988 2026-05-12 cs.LG cs.AI cs.CE

Benchmarking Compositional Generalisation for Machine Learning Interatomic Potentials

Amir Masoud Nourollah, Irtaza Khalid, Stefano Leoni, Steven Schockaert

AI总结该研究旨在评估机器学习原子间势在化学组成泛化方面的能力，即模型是否能理解分子片段及其组合对性质的影响，而不仅仅是记忆训练数据中的模式。为此，研究设计了四个需要组成泛化的任务，并在模型未接触过的分子上进行测试，以验证其对新分子结构的预测能力。实验表明，当前最先进的模型在面对分布外分子时表现显著下降，说明其在理解化学组成结构方面仍有较大提升空间。

2605.08985 2026-05-12 cs.CV

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Kechen Fang, Yihua Qin, Chongyi Wang, Wenshuo Ma, Tianyu Yu, Yuan Yao

AI总结该研究针对多模态大语言模型（MLLMs）中高分辨率图像输入带来的视觉编码计算瓶颈问题，提出了一种高效且可控的视觉编码方案LLaVA-UHD v4。通过对比实验发现，基于切片的编码策略在保持局部细节的同时优于传统的全局编码方法；同时引入了在ViT浅层进行早期压缩的新方法，显著降低了计算量而不影响下游任务性能。实验表明，该方法在多个基准测试中将视觉编码的浮点运算量减少了55.8%，并在性能上达到或超越了基线模型。

2605.08980 2026-05-12 cs.LG math.OC stat.ML

Muon Does Not Converge on Convex Lipschitz Functions

Tetiana Parshakova, Ahmed Khaled, Michael Crawshaw, Guillaume Garrigos, Robert M. Gower

AI总结本文研究了Muon优化算法在凸Lipschitz函数上的收敛性问题，指出尽管Muon及其变体在深度学习中表现出色，但其收敛性分析通常依赖于平滑性假设，而凸Lipschitz函数类却是许多优化方法的基础。研究发现，Muon在凸Lipschitz函数上无法收敛，无论学习率如何选择。虽然误差反馈机制可以恢复其收敛性，但在图像分类和语言建模任务中却会损害其性能，表明Muon的成功可能源于凸Lipschitz模型所缺乏的结构，最可能是与平滑性相关。

2605.08975 2026-05-12 cs.AI

Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation

Yunseong Jeon, Namcheol Lee, Yoonsu Lee, Jangwoon Park, Sol Ahn, Jong-Chan Kim, Seongsoo Hong

AI总结本文研究了基于推理的端到端自动驾驶系统Alpamayo 1的延迟问题，分析了其多推理与单推理两种轨迹生成方式的效率差异。通过将系统重构为单推理架构，并优化扩散动作生成过程中的计算开销，有效降低了推理延迟，同时保持了轨迹多样性和预测质量。实验表明，该方法在不牺牲性能的前提下，使推理延迟减少了69.23%。

Comments Submitted to IEEE RTCSA on March 26, 2026 (KST) Accepted on May 4, 2026 (KST)

2605.08974 2026-05-12 cs.CV cs.AI

Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models

Tri Cao, Khoi Le, Thong Nguyen, Cong-Duy Nguyen, Quynh Vo, Anh Tuan Luu, Chunyan Miao, See-Kiong Ng, Shuicheng Yan, Bryan Hooi

AI总结尽管多模态大语言模型在视频理解方面取得了进展，但在动态场景中仍容易产生幻觉。本文认为这是由于缺乏对时空信息的持续监控能力，即无法有效追踪物体的身份、状态及关系随时间的变化。为此，研究者提出了STEMO-Bench基准，用于评估模型在物体中心事实上的中间推理能力，并引入了STEMO-Track框架，通过结构化轨迹构建和时序聚合显著提升了模型在时空推理上的准确性和一致性。

Comments Code: https://github.com/nguyentthong/video_hallucination

2605.08971 2026-05-12 cs.CV cs.AI

Extrusion Segmentation Strategy to improve CAD Reconstruction from Point Cloud

Said Harb, Mehdi Maboudi, Markus Gerke

AI总结本文研究如何从点云数据中重建CAD模型，提出了一种基于挤出分割的策略，将复杂形状分解为基本的挤出部件，从而提升深度学习模型的重建性能。该方法通过增加数据多样性，提高了模型的泛化能力和鲁棒性，为从无序点云生成结构化CAD模型提供了简单而有效的方式。

Comments Conference: ISPRS Toronto 2026

2605.08966 2026-05-12 cs.LG

VORT: Adaptive Power-Law Memory for NLP Transformers

Nabil Mlaiki

AI总结标准的Transformer模型对远距离词元的影响采用近似指数衰减的方式，这与自然语言中长距离依赖的幂律结构存在冲突。本文提出了一种名为VORT的可变阶次保留Transformer模型，通过为每个输入词元分配一个可学习的分数阶次α_i，采用Grünwald–Letnikov幂律保留核来建模记忆。该模型利用高斯-拉盖尔求积法对核权重进行求和指数分解，并通过线性注意力累加器实现高效的检索，从而在保持模型性能的同时，更好地捕捉自然语言中的长距离依赖关系。实验验证了该模型在合成任务中的优越性。

Comments 18 pages, 5 figures

2605.08965 2026-05-12 cs.CV

Can MLLMs Reason About Visual Persuasion? Evaluating the Efficacy and Faithfulness of Reasoning

Naeun Lee, Hyunjong Kim, Sunghwan Choi, Injin Kong, Yohan Jo

AI总结尽管多模态大语言模型（MLLMs）在多模态任务中表现出色，但预测图像是否具有说服力及其原因仍具有挑战性。本文发现，让MLLMs在预测前进行推理并不能一致提升性能，甚至可能降低效果，表明生成的推理理由不可靠。为此，研究提出通过多样化的教师生成推理进行监督微调，提升了视觉说服力预测性能，并引入了一个三维的可信度评估框架，从推理与决策的一致性、推理与图像的相关性以及推理对决策的敏感性三个方面进行评估，揭示了预测性能与推理可信度之间的差异，并为未来训练更可信的视觉说服力模型提供了新方向。

2605.08961 2026-05-12 cs.CL eess.AS

Dolphin-CN-Dialect: Where Chinese Dialects Matter

Yangyang Meng, Huihang Zhong, Guodong Lin, Guanbo Wang, Hu Du, Zhiming Shao, Yukai Huang, Ke Li, Wei-Qiang Zhang

AI总结本文介绍了 Dolphin-CN-Dialect，一个专注于中文及方言识别的流式语音识别模型。为了解决方言数据高度不平衡的问题，研究提出了一种基于温度的采样策略，并改进了分词方式以更好地适配语言特性，从而显著提升了方言识别性能。实验表明，该模型在保持较小模型体积的同时，实现了比之前版本更高的识别准确率和更低的字符错误率，并在多个方面达到了当前开源模型的先进水平。