arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4056
2605.09042 2026-05-12 cs.CL

Evaluating Pragmatic Reasoning in Large Language Models: Evidence from Scalar Diversity

Ye-eun Cho

AI总结 本研究探讨了如何评估大语言模型在语用推理方面的能力,指出当前评估方法可能导致模型行为的差异,难以准确反映其内在推理能力。研究采用标量多样性作为诊断工具,比较了直接概率测量与元语言提示等多种评估方式,发现不同模型和任务条件下语用行为存在显著差异。结果表明,语用推理能力并非由单一评估方式决定,而是模型内部概率表示与任务引导行为相互作用的结果,突显了评估设计在理解大语言模型语用能力中的关键作用。

详情
英文摘要

Evaluating pragmatic reasoning in large language models (LLMs) remains challenging because model behavior can vary depending on evaluation methods. Previous studies suggest that prompt-based judgments may diverge from models' internal probability distributions, raising questions about whether observed performance reflects underlying competence or task-induced behavior. This study examines this issue using scalar diversity as a graded diagnostic for pragmatic inference. Following Hu & Levy (2023), this study compares direct probability measurement and metalinguistic prompting across multiple models and experimental settings. The results show that neither evaluation method consistently outperforms the other and that pragmatic behavior varies substantially across model families, prompting strategies, and task structures. Moreover, scalar diversity gradients emerge only in specific model-condition combinations, suggesting that pragmatic reasoning in LLMs reflects an interaction between internal probabilistic representations and task-induced prompting behavior rather than a stable competence captured by a single evaluation paradigm. These findings highlight the central role of evaluation design in interpreting pragmatic abilities in LLMs.

2605.09041 2026-05-12 cs.CL cs.CR

BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence

Jialing Gan, Junhao Dong, Songze Li

AI总结 本文提出了一种名为 BiAxisAudit 的新型框架,用于评估大型语言模型在提示敏感性和响应层分歧两个维度上的偏见。现有基准通常将偏见简化为单一标量,忽略了提示格式变化和响应内部不一致等关键问题。BiAxisAudit 通过分别评估不同提示格式下的偏见分布以及响应中选择与阐述部分的分歧,提高了偏见评估的准确性和可靠性。实验表明,该方法能有效区分真实偏见减少与跨层偏见转移现象,揭示了提示格式对模型偏见影响的重要性。

Comments 24 pages, 10 figures. Preprint

详情
英文摘要

Bias audits of large language models now operate within governance frameworks such as the EU AI Act, making benchmark reliability a security concern in its own right. Many current benchmarks, however, collapse bias into a single scalar from one prompt format and one surface label. This design misses two failure modes that can be exploited without changing model weights. Across prompts, meaning-preserving format changes shift bias endorsement by more than $0.7$ on a fixed statement pool. Within a response, the discrete Selection and free-text Elaboration can take opposing stances, so an apparently clean aggregate may hide substantial internal inconsistency (a ``cancellation trap''). Selection-only and elaboration-only rankings are therefore nearly uncorrelated across eight LLMs (Spearman $ρ= 0.238$, $p = 0.570$): LLaMA3-70B ranks in the middle under selection-only scoring but highest under elaboration-only scoring on the same responses. We introduce \textsc{BiAxisAudit}, a protocol that reports each bias score together with a reliability estimate on two orthogonal axes. The across-prompt axis evaluates each statement under a factorial grid of task format, perspective, role, and sentiment, treating bias as a distribution rather than a point estimate. The within-response axis uses Split Coding to recover Selection and Elaboration as separate signals, measured by the Inconsistency Rate and Divergence Net Imbalance. Across eight LLMs with $80{,}200$ coded responses each, task format alone explains as much variance as model choice; $63.6\%$ of pooled bias signals (up to $85.2\%$ per model) appear in only one coding layer, and prompt-dimension interactions exceed main effects. The instrument also separates real bias reductions from apparent reductions caused by cross-layer redistribution: some prompt configurations reduce both BER and IR, whereas others suppress only selection-layer bias.

2605.09039 2026-05-12 cs.CV

SeasonScapes: Learning Large-scale Re-lightable 3D Landscapes with Seasonal Variation from Sparse Webcams

Timo Kleger, Qi Ma, Deheng Zhang, Luc Van Gool, Danda Pani Paudel

AI总结 本文提出了一种名为 SeasonScapes 的框架和一个大规模季节变化三维景观数据集,该数据集由来自32个不同位置、13个时间点的85000多张网络摄像头图像组成,覆盖超过50公里×60公里的瑞士山区。通过将时间点特定的图像投影到三维网格上,构建出反映自然外观随时间变化的季节性三维景观。为了解决遮挡和缺失数据问题,研究采用条件扩散模型在网格上进行图像引导的补全,最终生成的网格可使用标准物理渲染器进行重新光照。

详情
英文摘要

We introduce SeasonScapes framework and a the SeasonScapes dataset: Swiss Sparse-view Mountain Scenes with Seasonal Changes that covers over 50 km x 60 km, composed of more than 85,000 webcam images captured from 32 different locations across 13 timestamps throughout a full year. By projecting these timestamp-specific images onto a 3D mesh, we construct seasonal 3D landscapes that reflect natural appearance changes over time. To address occlusions and missing data, we leverage conditional diffusion models for image-guided inpainting directly on the mesh. The resulting completed meshes can be further relighted using standard physically-based renderer.

2605.09036 2026-05-12 cs.LG

PACT: Peak-Aware Cross-Attention Graph Transformers for Efficient Storm-Surge Emulation

Zesheng Liu, Doyup Kwon, Ning Lin, Maryam Rahnemoonfar

AI总结 本文提出了一种名为PACT的峰值感知交叉注意力图变换器,用于高效模拟风暴潮过程。该方法通过将大气强迫场划分为图结构,结合图神经网络与交叉注意力机制,实现了对站点级别的风暴潮预测。PACT引入了针对极端事件的峰值感知学习策略,显著提升了对风暴潮峰值和尾部特征的捕捉能力,并在多个美国东海岸潮位站的数据上优于现有时空图神经网络模型,同时保持了较高的计算效率。

详情
英文摘要

Accurate and efficient storm-surge emulation is essential for coastal hazard assessment, yet high-fidelity hydrodynamic models remain too expensive for large scenario ensembles and rapid evaluation under heterogeneous climate forcings. We present PACT, a peak-aware cross-attention graph transformer for efficient station-level storm-surge prediction from atmospheric forcing fields. PACT represents each forcing patch as a graph, encodes spatial structure with GraphSAGE, and uses a learned station query to aggregate node information through cross-attention rather than uniform pooling. A Transformer encoder models temporal dependence across the forcing history, and a horizon-query decoder generates lead-specific forecasts from a shared temporal memory. To better capture extreme events, we introduce a peak-aware learning strategy that couples a lightweight auxiliary peak-aware head with a tailored training objective, including a tail-focused loss on peak-dominated samples and a horizon-wise slope regularizer to encourage coherent multi-step evolution. Across multiple tide-gauge stations along the US Northeast coast, PACT outperforms a strong spatio-temporal graph neural network baseline in both RMSE and MAE. Diagnostics show improved peak fidelity and tail preservation for reanalysis and most CMIP6 datasets. PACT is also computationally efficient, requiring about 3.5~s to generate a full winter-season surge trajectory for one year after training. Under distribution shift across five CMIP6 forcings, PACT transfers well within the CMIP6 family but degrades markedly when transferring from reanalysis to climate-model forcings, highlighting a persistent reanalysis--GCM gap.

2605.09032 2026-05-12 cs.CL cs.AI cs.LG

A Quantum Inspired Variational Kernel and Explainable AI Framework for Cross Region Solar and Wind Energy Forecasting

Pavan Manjunath, Thomas Prufer

AI总结 本文提出了一种用于跨区域太阳能和风能短期预测的量子启发变分核与可解释人工智能框架。该框架分为四个阶段,分别获取数据、训练经典基线模型、利用量子启发变分核修正残差误差,并通过生成式人工智能生成可解释的自然语言解释。实验表明,该方法在三个不同区域的预测任务中表现优异,且其量子启发核在区分平静与风暴天气模式方面显著优于传统径向基核。

详情
英文摘要

Reliable short horizon forecasting of solar and wind generation is a structural prerequisite of any modern power system yet most published forecasters are tuned and evaluated on a single climatic regime and most algorithmic novelty has been concentrated either on classical recurrent networks or on monolithic foundation models that combine forecasting and explanation We develop a four stage hybrid framework that separates these concerns The first stage acquires hourly generation irradiance and surface weather records through public application programming interfaces The second stage trains three classical baselines autoregressive integrated moving average gradient boosted regression trees and a two layer long short term memory network and produces a strong point forecast together with a residual error series The third stage corrects the residual through a quantum inspired variational kernel built on a six qubit hardware efficient ansatz with three repeated entangling layers The fourth stage uses generative artificial intelligence strictly as an explainability layer that reads the measured benchmark numbers and produces a structured natural language interpretation Across three regions drawn from open public archives Iberian solar North Sea wind and a mixed Texas trace the proposed configuration stays within one percentage point of the strongest classical baseline on the in domain forecasting task and the quantum inspired kernel separates calm and stormy weather regimes with a Fisher discriminant ratio approximately fifteen fold higher than a tuned radial basis kernel

2605.09031 2026-05-12 cs.LG

Spherical Boltzmann machines: a solvable theory of learning and generation in energy-based models

Thomas Tulinski, Simona Cocco, Rémi Monasson, Jorge Fernandez-De-Cossio-Diaz

AI总结 本文研究了能量基模型(EBM)的学习与生成机制,聚焦于一种可解析求解的高维模型——球形玻尔兹曼机(SBM)。通过结合随机矩阵理论和动态平均场理论,作者推导了SBM的精确训练动态方程,计算了贝叶斯证据,并揭示了训练过程中以及超参数变化时出现的级联相变现象。研究还表明这些现象与生成过程中的采样行为密切相关,并在标准生成模型中得到了验证。

详情
英文摘要

Energy-based models (EBMs) are flexible generative architectures inspired by statistical physics, but their learning and generative properties remain poorly understood. Here, we analyze a solvable EBM in the high-dimensional limit: the spherical Boltzmann machine (SBM). Combining tools from random matrix theory and dynamical mean-field theory, we: solve exact equations describing the training dynamics of the SBM; compute the Bayesian evidence, which acts as a partition function in parameter space and encodes global properties of the trained model; and uncover cascades of phase transitions that occur both during training and as a function of hyperparameters, related to successive alignment and condensation of the top modes of the coupling matrix to the data. We connect these transitions to sampling-time generative phenomena in a teacher-student scenario, including: sampling temperature tuning, double descent as a function of regularization strength, tempered posterior effects, and out-of-equilibrium effects during training that induce biases in the trained model. We provide numerical evidence demonstrating that all these phenomena appear in standard generative architectures, beyond the SBM.

2605.09030 2026-05-12 cs.CV cs.LG

When Style Similarity Scores Fail: Diagnosing Raw CSD Cosine in Artist-Style Evaluation

Jörg Frochte

AI总结 该论文探讨了在艺术家风格评估中,使用对比风格描述符(CSD)余弦相似度作为绝对风格保真度指标的局限性,并提出了一种名为“判别差距”的诊断方法,用于检测该指标在特定艺术家语料库中是否能够准确区分相同与不同风格。研究发现,原始CSD余弦在多个艺术家语料中存在负点估计差距,表明其无法作为绝对评分使用;通过引入CSLS读取方式和位置嵌入插值方法,可显著提升评估准确性。研究建议在使用CSD余弦作为风格评分前,应先进行该诊断测试,并推荐使用改进后的CSD+方法以提高可靠性。

Comments 24 pages, 7 figures, 19 tables

详情
英文摘要

Raw cosine in the 768-dimensional output space of the Contrastive Style Descriptor (CSD) is now widely read as an absolute, calibrated style-fidelity score for text-to-image and style-imitation evaluation. We introduce the discrimination gap, a corpus-internal, prototype-free and threshold-free diagnostic that tests whether contrastive style cosines admit an absolute same-versus-different interpretation on a candidate artist corpus. On a 1799-artwork, 91-artist public-domain corpus, raw CSD cosine yields negative point-estimate gaps for $23/91$ artists at the pairwise level ($2/91$ robust under bootstrap) and for $15/91$ in the aggregated-pool scoring regime style-fidelity evaluations typically use. CSLS readout on the frozen backbone reduces the aggregated negative-gap count to $4/91$; combined with positional-embedding interpolation to $336$ pixels it raises unsupervised pair-verification AUC from $0.883$ to $0.905$ across $25$ artist-disjoint splits. We refer to this diagnostic-driven readout protocol on the frozen backbone (CSLS as default, pos-interp $336$ as the stronger optional setting) as CSD+, not a new encoder.A cross-backbone check on CLIP-ViT-L/14, SigLIP-large and DINOv2-Large reproduces the same shared-tradition failure pattern, providing evidence that the residual reflects a shared limitation of the four backbones we tested rather than a CSD-specific artefact. Practical implication: before reporting CSD cosine as an absolute style-fidelity score, run the diagnostic on the candidate corpus; CSLS is the minimal correction when it fails.

2605.09025 2026-05-12 cs.CV cs.LG

MedFL-Stress: A Systematic Robustness Evaluation of Federated Brain Tumor Segmentation under Cross-Hospital MRI Appearance Shift

Kiran Naseer, Naveed Anwer Butt

AI总结 本文提出MedFL-Stress,一个用于评估联邦学习脑肿瘤分割模型在跨医院MRI影像外观变化下的鲁棒性的系统化测试框架。研究通过引入不同级别的MRI外观偏移,揭示了现有联邦学习方法在不同医院间性能差异的问题,并对比了FedAvg、FedProx和FedBN三种方法的表现。实验表明,FedBN在提升最差医院分割性能和减少医院间性能差距方面表现更优,突显了鲁棒性评估在联邦医疗影像应用中的重要性。

详情
英文摘要

Federated learning enables hospitals to collaboratively train segmentation models without sharing patient data. However, current evaluation protocols report only average performance across clients, masking failures at individual sites. In clinical deployment, a model that fails consistently at one hospital is a real safety risk that a good mean score can hide entirely. We introduce MedFL-Stress, a controlled stress-testing framework that exposes exactly this failure mode. Using 2D axial slices from BraTS 2020 distributed across four simulated hospital clients, we apply graded MRI appearance shifts (gamma contrast, scale-shift, and noise-plus-blur) reflecting scanner and acquisition variability in real multi-site deployments. Three federated baselines are evaluated: FedAvg, FedProx, and FedBN. Worst-hospital Dice and inter-hospital disparity are treated as primary metrics, not supplementary observations. FedAvg achieves the highest global mean Dice (0.8159) but conceals a 0.0850 gap between its best and worst-performing hospital. FedBN closes that gap by 41% (0.0850 to 0.0503) while sacrificing less than half a Dice point in mean accuracy (0.8159 to 0.8109), and the weakest hospital gains 3.5 Dice points outright (0.7309 to 0.7656). These findings demonstrate that robustness-oriented evaluation protocols are essential for reliable federated medical imaging deployment.

2605.09024 2026-05-12 cs.CV cs.GR cs.MM eess.IV

Relightable Gaussian Splatting for Virtual Production Using Image-Based Illumination

Adrian Azzarelli, Nantheera Anantrasirichai, James Pollock, David R. Bull

AI总结 该研究提出了一种基于高分辨率图像照明的虚拟制作(VP)专用三维重建与重光照框架,解决了传统方法中背景与光照耦合、环境贴图分辨率低等问题。方法采用高斯点扩散技术,利用已知背景图像条件化重光照过程,无需依赖环境贴图,将合成简化为背景图像编辑任务。通过引入真实VP场景数据集,分解场景为固定外观与可变光照部分,实现了高效、可控的高质量三维重建与重光照,支持多种输出变量,且计算效率高。

详情
英文摘要

Virtual production (VP) use LED walls to provide both background imagery and image-based lighting. While this enables on-set compositing, it couples lighting to background and scene appearance, limiting flexibility for downstream editing. In addition, inverse rendering conventionally relies on physically-based rendering to estimates 3D geometry and lighting, using environment maps. However, these maps are typically low-resolution and assume far-field lighting. In VP, with near-field and high-resolution image-based lighting, this can lead to inaccuracies and introduce complexities when editing. Addressing this, we propose a VP-specific framework for 3D reconstruction and relighting using Gaussian Splatting. This uses the known background imagery to condition the relighting process. This avoids relying on environment maps and reduces compositing to a background-image editing task. To realize our framework, we introduce a process (and associated dataset) that captures real VP scenes under varying background content and illumination conditions. This data is used to decompose a 3D scene into fixed appearance and variable lighting components. The variable lighting process simulates light transport by parameterizing each primitive with a UV coordinate, intensity value and resolution modifier. Using mipmaps, these directly sample the background texture in image space - implicitly capturing reflections and refractions without physically-based rendering. Combined with the fixed appearance component, this allows us to render relit scenes using a Gaussian Splatting rasterizer. Compared to baselines, our approach achieves higher-quality 3D reconstruction and controllable relighting. The method is efficient (<3 GB RAM, <5 GB VRAM, <2 hours training, ~35 FPS) and supports rendering useful arbitrary output variables including depth, lighting intensity, lighting color, and unlit renders.

2605.09016 2026-05-12 cs.AI cs.LG cs.NA math.NA

CATO: Charted Attention for Neural PDE Operators

Chun-Wun Cheng, Sifan Wang, Carola-Bibiane Schönlieb, Angelica I. Aviles-Rivero

AI总结 该论文提出了一种名为CATO的神经偏微分方程算子,旨在解决在复杂几何结构上建模偏微分方程(PDE)时的计算效率与物理保真度问题。CATO通过学习一个连续的隐式坐标映射,将网格坐标转换为适应几何结构的图表空间,并在该空间中使用轴向注意力机制,从而高效捕捉长程依赖关系。此外,CATO引入了对导数敏感的物理损失函数,提升了对稳态PDE的求解精度,实验表明其在多个数据集上表现优异,参数量大幅减少且性能优于现有方法。

详情
英文摘要

Neural operators have emerged as powerful data-driven solvers for PDEs, offering substantial acceleration over classical numerical methods. However, existing transformer-based operators still face critical challenges when modeling PDEs on complex geometries: directly processing over massive mesh points is computationally expensive, while operating in raw discretization coordinates may obscure the intrinsic geometry where physical interactions are more naturally expressed. To address these limitations, we introduce the Charted Axial Transformer Operator (CATO), a geometry-adaptive and derivative-aware neural operator for PDEs on general geometries. Instead of applying attention directly in the physical coordinate system, CATO learns a continuous latent chart that maps mesh coordinates into a learned chart space, where chart-conditioned axial attention efficiently captures long-range dependencies with reduced computational cost. In addition, CATO introduces a derivative-aware physics loss for steady-state PDEs that jointly supervises solution values, mesh-consistent gradients, and an auxiliary flux-like field, improving physical fidelity and reducing oversmoothing. We further provide a theoretical approximation result showing that, under a favorable chart, charted axial attention can represent low-rank axial solution operators with controlled error, and that small chart perturbations induce bounded approximation degradation. CATO achieves the best performance across all evaluated datasets, yielding an average improvement of approximately 26.76\% over the strongest competing baselines while reducing the number of parameters by 81.98\%. These results highlight the effectiveness of learning geometry-adaptive charts and derivative-aware physical supervision for accurate and efficient PDE operator learning.

2605.09015 2026-05-12 cs.CL cs.LG

LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language

Luca Ballore

AI总结 该研究针对濒临消失的罗曼语萨丁尼亚语,提出了一种基于单块24GB消费级GPU的30亿参数语言模型LLiMba,通过持续预训练和监督微调方法,有效提升了模型在萨丁尼亚语上的生成能力。研究使用包含1150万萨丁尼亚语词元和240万相关罗曼语词元的语料库进行训练,并在多个评估方向上取得了优于基线模型的性能。实验对比了多种适配方法,发现rsLoRA r256在生成质量上表现最佳,同时揭示了适配器容量、正则化策略与生成质量之间的关键关系。

详情
英文摘要

Sardinian, a Romance language with roughly one million speakers, has minimal presence in modern NLP. Commercial services do not support it, and current language models do not produce it reliably. We present LLiMba, a 3B parameter Sardinian-ready model adapted from Qwen2.5-3B-Instruct through continued pretraining (CPT) and supervised fine-tuning (SFT) on a single 24 GB consumer GPU. The corpus contains 11.5 million tokens of Sardinian spanning LSC, Logudorese, and Campidanese, augmented with 2.4 million tokens of related Romance text as replay against register blurring. After CPT the model reaches a perplexity of 6.76 on held out Sardinian and outperforms the base across all six FLORES-200 directions. We compare five SFT configurations under matched conditions: full fine-tuning, LoRA r64, rsLoRA r128, rsLoRA r256, and DoRA r256. rsLoRA r256 wins on every direction into Sardinian, reaching 28.5 BLEU from English against 17.3 after CPT and 21.0 with full fine-tuning. The rank ablation places r128 between LoRA r64 and rsLoRA r256 on BLEU but reveals failure modes invisible to the metric, including leakage across scripts no other variant produces. LoRA r64 retains less factual content from SFT than configurations at higher rank and produces more confident fabrications, though all methods fabricate on content absent from training. DoRA r256 yields the smallest gap between training and evaluation but the worst factual accuracy. The findings indicate that adapter capacity matters more than the choice among LoRA variants for adapting a Romance pretrained base to a low resource Romance target, that stronger regularization is not uniformly beneficial, and that translation metrics smoothly order configurations whose qualitative behavior differs categorically. Perplexity comparisons across scripts must account for byte fallback tokenization, which deflates the metric for scripts other than Latin.

2605.09012 2026-05-12 cs.AI

Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics

Zicheng Lyu, Wenjie Yang, Shengzhong Zhang, Zengfeng Huang

AI总结 Re²Math 是一个用于评估大型语言模型在研究级数学中定理检索能力的基准,旨在测试模型能否从部分证明中准确找到并应用相关的数学工具或引理。该基准通过构建包含候选引文的证明实例,并引入分层上下文和可控提示,使得任务既依赖于文献来源,又不限定具体引用。实验表明,当前模型在检索有效陈述方面表现尚可,但在判断其是否适用于当前证明步骤方面仍有较大提升空间。

详情
英文摘要

Large language models are increasingly capable at closed-world mathematical reasoning, but research assistance also requires source-grounded use of the literature. When a proof reaches a non-trivial step, a useful assistant should determine whether the needed tool (e.g., a lemma) already exists, identify a suitable scholarly source, and verify that its assumptions align with the current proof context. To rigorously evaluate such capabilities, we introduce Re$^2$Math, a benchmark for tool-grounded retrieval from partial mathematical proofs. Each instance is built from a candidate instrumental citation in the proof of a main theorem, with hierarchical context and an optional leakage-controlled anchor hint. We also make the task source-grounded yet citation-agnostic in that any admissible theorem sufficient for the proof transition is accepted. Evaluation uses a release-frozen retrieval artifact, ensuring reproducibility, while the benchmark itself supports automatic, continual expansion with newly constructed instances. On the current benchmark test set, the best fixed-judge ToolAcc reaches 7.0%, despite substantially higher rates of source grounding, indicating that current systems often retrieve valid statements but fail to establish their applicability to the local proof step. By decoupling citation recall, grounding, and proof-gap sufficiency, Re$^2$Math transforms literature-grounded mathematical tool use into a controlled diagnostic task.

2605.09011 2026-05-12 cs.LG cs.AI

A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases

Gianfranco Lombardo, Giuseppe Trimigno, Stefano Cagnoni

AI总结 本文从几何角度研究大语言模型中预测信息的分布特征,提出了一种基于中间残差流的预测读出子空间追踪方法,揭示了模型在不同深度的三个几何演化阶段。研究发现,随着模型深度增加,预测信息从初始的多候选状态逐步过渡到最终的单一候选结果,这一过程可分为种子复用、覆盖重写和聚焦收敛三个阶段,揭示了大语言模型在生成过程中信息整合与决策机制的深层结构。

详情
英文摘要

We investigate the geometry of predictive information across the layers of large language models (LLMs). We repurpose representation lenses-learned affine maps trained to predict the next token from intermediate residual streams-as geometric diagnostic tools. Rather than asking what the model predicts at each layer, we ask where predictive information resides and how it evolves across depth. We define at each layer a predictive readout subspace as the dominant k-dimensional singular subspace of such a map on the d-dimensional residual stream (where k is a resolution parameter), and track its trajectory on the Grassmann manifold as a similarity profile across layers. The profile is well described by unimodal distributions exhibiting a rise, near-plateau, and descent; varying k from 1% to 50% of d traces a Pareto frontier between visibility and energy retention, yet the same structure emerges at all scales. Across eight models from two families (Qwen2.5 and OLMo2, 1B-32B), we identify three geometric phases. Updates are approximately orthogonal to the residual stream throughout; what distinguishes the phases is their effect on the effective rank, which expands, stabilizes, and concentrates. In the first, Seeding Multiplexing, feed-forward memories and attention layers seed a candidate set in superposition in family-specific proportions, with the final token rising as leading candidate from 20% to 35% of positions across this phase. In the second, Hoisting Overriding, updates override existing subspaces to concentrate the candidate distribution without expanding the rank. In the third, Focal Convergence, high-energy low-rank updates write the winner into a form aligned with the unembedding direction. Phases 1 and 3 grow slowly with model depth, while Phase 2 expands linearly. The additional capacity of deeper LLMs is largely absorbed by candidate disambiguation.

2605.09009 2026-05-12 cs.LG cs.AI

Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning

Minmin Zhang, Sina Aghaei, Soroush Saghafian

AI总结 本文研究了大语言模型在序贯决策任务中的上下文学习能力,包括马尔可夫决策过程(MDP)、部分可观测MDP(POMDP)和模糊POMDP(APOMDP)等场景。通过监督微调(SFT)方法,作者将预训练语言模型直接用于从离线标注轨迹中进行少样本决策,提升了其策略模仿的灵活性。理论分析表明,微调后的注意力层可隐式估计最优Q函数,进而推导出策略的端到端次优性界。实验结果显示,微调后的语言模型在多种合成环境中表现优于仅依赖上下文或随机策略的基线方法,尤其在长时域、部分可观测和模型模糊的环境下优势明显。

详情
英文摘要

Large language models (LLMs) have shown remarkable in-context learning (ICL) capabilities, yet their potential for sequential decision-making remains underexplored. In this paper, we study the ICL capabilities of LLMs in sequential decision-making settings, including Markov Decision Processes (MDPs), Partially Observable MDPs (POMDPs), and Ambiguous POMDPs (APOMDPs). We fine-tune pretrained LLMs to perform few-shot decision-making directly from offline, oracle-labeled trajectories. Our framework enables flexible imitation of policies through supervised fine-tuning (SFT). Theoretically, we focus on linear MDPs and interpret a fine-tuned attention layer as implicitly estimating optimal Q-functions from in-context data. Building on this interpretation, we derive an end-to-end suboptimality bound for the induced policy that separates the in-context estimation error from the training-length bias. Empirically, across synthetic MDP, POMDP, and APOMDP settings, we find that fine-tuned LLMs achieve substantially smaller optimality gaps than in-context-only and random baselines, with especially large gains in longer-horizon, partially observed, and model-ambiguous environments. Together, these results show that supervised fine-tuning provides an effective route to endowing pretrained LLMs with sequential decision-making capabilities from offline data, which is an important advantage in domains such as healthcare where offline data are abundant.

2605.09008 2026-05-12 cs.LG cs.CL

Relative Kinetic Utility for Reasoning-Aware Structural Pruning in Large Language Models

Tianhao Qian

AI总结 该研究针对大语言模型中基于思维链(CoT)的推理能力提升所面临的计算延迟和内存瓶颈问题,提出了一种新的结构化剪枝方法。通过引入相对动能效用(RKU)理论框架,结合交替梯度流和费舍尔迹归一化,该方法能够更有效地识别并保留对逻辑推理至关重要的结构路径。实验表明,RKU在高稀疏度下显著提升了模型的推理性能,尤其在GSM8K数据集上表现优于现有最佳基线方法。

Comments 15 pages, 3 figures

详情
英文摘要

Chain-of-Thought (CoT) prompting symbolized a huge improvement of reasoning capabilities of Large Language Models (LLMs). However, scaling up test-time computation yields extensive CoT sequences, introducing severe inference latency and key-value (KV) cache memory bottlenecks. While structural pruning offers a fundamental, hardware-aware solution to alleviate static parameter burdens, existing magnitude-based methods may cut off the neurons of CoT: by over-indexing on discrete cross-entropy objectives, these heuristics fall into a \textit{magnitude trap}: they prioritize high-frequency, low-information syntactic tokens and trigger a disappointing reasoning collapse at high sparsities (e.g., 40\%). To overcome this topological phase transition, we propose \textsc{Relative Kinetic Utility} (RKU), a novel theoretical framework that elevates discrete pruning to a continuous kinetic integral over the depth manifold of the model based on Alternating Gradient Flow(AGF). By modifying it with Fisher trace normalization, RKU acts as a lightweight curvature-aware normalization to isolate \textit{kinetic spikes} -- the fundamental structural pathways responsible for high-curvature logical routing. Extensive experiments on Qwen-2.5-7B and LLaMA-3-8B improves performance in the high-sparsity regime around 40\%. RKU attains 13.34\% accuracy on GSM8K at 40\% sparsity, outperforming the strongest baseline, and appears to better preserve reasoning-relevant representations under out-of-distribution evaluation.

2605.09005 2026-05-12 cs.RO cs.AI

Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models

Ming Sun, Rui Wang, Xingrui Yu, Lihua Jing, Hangyu Du, Zhenglin Wan, Xu Pan, Ivor Tsang

AI总结 本文提出了一种基于后门机制的视觉-语言-动作模型(VLA)所有权验证框架GuardVLA,旨在解决共享和适配VLA模型过程中模型版权保护的问题。该方法在模型训练时通过向具身视觉数据注入秘密信息,嵌入隐蔽且无害的后门水印,并在模型发布后通过触发器投影器和外部分类头组成的“交换-检测”机制,实现对后门水印的激活与检测。实验表明,GuardVLA能够在保持模型正常任务性能的同时,有效支持模型所有权的可靠验证,并且水印在模型适配后仍具有可检测性。

详情
英文摘要

Vision-Language-Action models (VLAs) support generalist robotic control by enabling end-to-end decision policies directly from multi-modal inputs. As trained VLAs are increasingly shared and adapted, protecting model ownership becomes essential for secure deployment and responsible open-source usage. In this paper, we present GuardVLA, the first backdoor-based ownership verification framework specifically designed for VLAs. GuardVLA embeds a stealthy and harmless backdoor watermark into the protected model during training by injecting secret messages into embodied visual data. For post-release verification, we propose a swap-and-detect mechanism, in which the trigger projector and an external classifier head are used to activate and detect the embedded backdoor based on prediction probabilities. Extensive experiments across multiple datasets, model architectures, and adaptation settings demonstrate that GuardVLA enables reliable ownership verification while preserving benign task performance. Further results show that the embedded watermark remains detectable under post-release model adaptation.

2605.09002 2026-05-12 cs.CV cs.AI

CT-IDP: Segmentation-Derived Quantitative Phenotypes for Interpretable Abdominal CT Disease Classification

Lavsen Dahal, Joseph Y. Lo

AI总结 本文提出了一种基于腹部CT影像分割的可解释性疾病分类框架CT-IDP,通过生成多器官分割结果并提取超过900个定量表型特征,用于疾病分类任务。研究在MERLIN数据集上训练并验证了该方法,并在两个独立数据集上进行了外部评估,结果显示CT-IDP在多个指标上均优于基于DINOv3的视觉Transformer基线模型,表明其在疾病分类中的有效性与可解释性优势。

详情
英文摘要

In this retrospective multi-institutional study, a quantitative phenotyping framework, CT-IDP (CT Image-Derived Phenotypes) was developed on the MERLIN abdominal CT benchmark (training, validation, and test sets- 15,175, 5,018, and 5,082 studies, respectively) and externally evaluated on two independent dataset: Duke-Abdomen (2,000) and AMOS (1,107). Multi-organ segmentations were generated with TotalSegmentator and used to derive over 900 organ and compartment-level descriptors spanning morphometry, attenuation, and contextual/burden findings. Sparse disease-specific logistic regression with elastic-net regularization was trained on MERLIN and externally validated under a frozen specification. Performance was compared against a DINOv3-based vision-transformer baseline using AUC and average precision (AP), supported by phenotype-stratified audits and coefficient-level inspection. Macro-AUC for CT-IDP versus the baseline was 0.897 versus 0.880 on MERLIN, 0.877 versus 0.857 on the Duke-Abdomen dataset, and 0.780 versus 0.756 on AMOS.

2605.08999 2026-05-12 cs.LG

Non-Parametric Rehearsal Learning via Conditional Mean Embeddings

Wen-Bo Du, Tian-Zuo Wang, Han-Jia Ye, Zhi-Hua Zhou

AI总结 本文研究了如何避免不良未来(AUF)的决策问题,提出了一种无需假设数据生成过程具体形式的非参数复述学习方法。该方法利用核方法将AUF目标转化为统一表示,分离了期望值建模与动作引起的分布变化,并通过条件均值嵌入捕捉动作影响,结合核岭回归构建具有一致性的嵌套估计器。实验表明,该方法在非线性系统和非加性噪声场景下具有良好的效果和灵活性。

详情
英文摘要

In machine learning, a critical class of decision-related problems concerns preventing predicted undesirable outcomes, referred to as the \textit{avoiding undesired future} (AUF) problem. To address this, the \textit{rehearsal learning} framework has been proposed to model influence relations for effective decisions. However, existing rehearsal methods rely on restrictive parametric assumptions such as linear systems or additive noise, limiting their practical applicability. In this paper, we propose the first non-parametric rehearsal learning approach for AUF without assuming specific functional forms of data generation processes. Specifically, we use kernel machinery to reformulate the AUF objective into a unified representation that disentangles desirability modeling from action-induced distributional changes. To handle the discontinuity of desirability indicator, we present a smooth Probit surrogate and provide an approximation error bound. Meanwhile, we capture the action-induced changes via conditional mean embeddings, and develop a kernel ridge regression based nested estimator for AUF objective with consistency guarantees. Such a formulation naturally accommodates nonlinear systems and non-additive noise, and empirical results on synthetic and real-data-derived semi-synthetic benchmarks demonstrate the effectiveness and flexibility of our approach.

2605.08996 2026-05-12 cs.LG

Machine Learning-Based Graph Simplification for Symbolic Accelerators

Tiffany Yu, Rye Stahle-Smith, Darssan Eswaramoorthi, Rasha Karakchi

AI总结 本文提出了一种基于机器学习的图简化框架 AutoSlim,用于优化符号计算加速器中的自动机图结构,以降低硬件资源占用并提升能效。该方法通过提取历史图执行特征并利用随机森林分类器识别并移除低影响的节点和边,有效减少了冗余结构。实验表明,在应用于 NAPOLY+ 架构时,AutoSlim 最多可减少 40% 的 FPGA 资源使用,并提升吞吐量和能效,同时包含功能等价性验证步骤,为硬件优化和安全研究提供了新方向。

详情
英文摘要

Graph-based accelerators have been widely adopted in symbolic data processing applications such as genomics, cybersecurity, and artificial intelligence. However, these systems often suffer from excessive memory usage and inefficiencies stemming from redundant graph structures. We present AutoSlim, a machine learning-based framework that leverages data-driven methods to prune automata graphs for hardware accelerators. Using features extracted from prior graph executions and a Random Forest classifier, AutoSlim identifies and removes low-impact nodes and edges. When applied to a Non-deterministic Finite Automata overlay architecture (NAPOLY+), AutoSlim reduces FPGA resource usage by up to 40%, with corresponding improvements in throughput and power efficiency. The framework includes a verification step to ensure functional equivalence after pruning and suggests promising directions for both hardware optimization and security.

2605.08992 2026-05-12 cs.LG

When More Parameters Hurt: Foundation Model Priors Amplify Worst-Client Disparity Under Extreme Federated Heterogeneity

Kiran Naseer, Umar Shoaib

AI总结 在联邦学习中,研究发现大型基础模型的先验知识在极端数据异构性下可能加剧最弱势客户端的性能差距。通过对比TextCNN和DistilBERT+LoRA在不同非独立同分布水平下的实验结果,研究揭示了基础模型参数越多,最差客户端的准确率差距反而可能越大,这一现象被称为“基础模型公平性悖论”。研究还表明,仅调整聚合权重的方法无法缓解这一问题,强调了在高风险联邦场景中需特别保护少数客户端的重要性。

Comments 7 pages, 5 figures. Submitted to FL@FM-IJCAI 2026 Workshop

详情
英文摘要

Federated learning (FL) is increasingly used to fine-tune foundation models (FMs) on distributed private data. The community largely assumes that large-scale pretraining serves as a 'rising tide that lifts all boats' in federated settings. However, our experiments reveal that these powerful priors can hinder rather than help the most disadvantaged clients under extreme heterogeneity. Through controlled experiments on federated text classification, we compare worst-client accuracy between TextCNN (2.7M parameters) and DistilBERT with Low-Rank Adaptation (LoRA, 66M parameters) across four Non-IID heterogeneity levels. Under extreme label skew (alpha = 0.1), DistilBERT+LoRA produces a worst-client accuracy gap of 50.1% -- 56% larger than TextCNN's 32.2% gap, despite having 25x more parameters and extensive pretraining. Under moderate heterogeneity (alpha >= 0.5), the pattern reverses: the FM nearly eliminates the gap. We call this the FM Fairness Paradox. We further show that an inverse-weighted LoRA aggregation method (FedAvgW) does not resolve the disparity, suggesting aggregation reweighting alone may be insufficient. Our results highlight the need for mechanisms that explicitly protect minority clients before deploying foundation models in high-stakes federated contexts such as healthcare and education.

2605.08991 2026-05-12 cs.AI econ.EM

Sufficient conditions for a Heuristic Rating Estimation Method application

Jacek Szybowski, Konrad Kułakowski, Jiri Mazurek

AI总结 本文研究了启发式评分估计(HRE)方法的应用条件,探讨了其在不同配对比较算法下的适用性。作者分析了算术和几何方法在完整与不完整比较数据中的表现,并指出算术变体在不一致性估计方面具有最优性能。该研究为HRE方法的正确应用提供了理论依据和实用指导。

Comments 18 pages

详情
英文摘要

A series of papers has introduced the Heuristic Rating Estimation method, which evaluates a set of alternatives based on pairwise comparisons and the weights of reference alternatives. We formulate the conditions under which the HRE method can be applied correctly. The research considers both arithmetic and geometric algorithms for complete and incomplete pairwise comparison methods. The illustrative examples show that the estimations of inconsistency in the arithmetic variant are optimal.

2605.08988 2026-05-12 cs.LG cs.AI cs.CE

Benchmarking Compositional Generalisation for Machine Learning Interatomic Potentials

Amir Masoud Nourollah, Irtaza Khalid, Stefano Leoni, Steven Schockaert

AI总结 该研究旨在评估机器学习原子间势在化学组成泛化方面的能力,即模型是否能理解分子片段及其组合对性质的影响,而不仅仅是记忆训练数据中的模式。为此,研究设计了四个需要组成泛化的任务,并在模型未接触过的分子上进行测试,以验证其对新分子结构的预测能力。实验表明,当前最先进的模型在面对分布外分子时表现显著下降,说明其在理解化学组成结构方面仍有较大提升空间。

详情
英文摘要

Machine Learning Interatomic Potentials play a fundamental role in computational chemistry and materials science, enabling applications from molecular dynamics simulations to drug design and materials discovery. While recent approaches can estimate inter-atomic forces with high precision, it remains unclear to what extent they can generalise to previously unseen molecules. Do they learn the compositional structure of chemistry, capturing how molecular fragments and their combinations determine properties, or do they primarily learn to interpolate patterns that are specific to the training examples? To address this question, we propose a benchmark consisting of four tasks that require some form of compositional generalisation. In each task, models are tested on molecules that were unseen during training, but the training data is chosen such that generalisation to the test examples should be feasible for models that learn the underlying physical principles. Our empirical analysis shows that the considered tasks are highly challenging for state-of-the-art models, with errors on out-of-distribution examples often an order of magnitude higher than on in-distribution examples, even when using foundation models that have been pre-trained on millions of molecules.

2605.08985 2026-05-12 cs.CV

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Kechen Fang, Yihua Qin, Chongyi Wang, Wenshuo Ma, Tianyu Yu, Yuan Yao

AI总结 该研究针对多模态大语言模型(MLLMs)中高分辨率图像输入带来的视觉编码计算瓶颈问题,提出了一种高效且可控的视觉编码方案LLaVA-UHD v4。通过对比实验发现,基于切片的编码策略在保持局部细节的同时优于传统的全局编码方法;同时引入了在ViT浅层进行早期压缩的新方法,显著降低了计算量而不影响下游任务性能。实验表明,该方法在多个基准测试中将视觉编码的浮点运算量减少了55.8%,并在性能上达到或超越了基线模型。

详情
英文摘要

Visual encoding constitutes a major computational bottleneck in Multimodal Large Language Models (MLLMs), especially for high-resolution image inputs. The prevailing practice typically adopts global encoding followed by post-ViT compression. Global encoding produces massive token sequences, while post-ViT compression incurs the full quadratic attention cost of the ViT before any token reduction takes place. In this work, we revisit this convention along two dimensions: the encoding strategy and visual token compression. First, controlled experiments show that slice-based encoding outperforms global encoding across benchmarks, suggesting that preserving local details through sliced views can be more beneficial than applying global attention for fine-grained perception. Second, we introduce intra-ViT early compression, which reduces tokens in shallow ViT layers and substantially lowers visual-encoding FLOPs while preserving downstream performance. By integrating intra-ViT compression into the slice-based encoding framework, we present LLaVA-UHD v4, an efficient and compute-controllable visual encoding scheme tailored for high-resolution inputs. Across a diverse set of benchmarks covering document understanding, OCR, and general VQA, LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% while matching or even surpassing baseline performance. These results suggest that visual-encoding efficiency can be substantially improved without sacrificing downstream performance, providing a practical design direction for efficient high-resolution MLLMs. All model weights and code will be publicly released to support further research.

2605.08980 2026-05-12 cs.LG math.OC stat.ML

Muon Does Not Converge on Convex Lipschitz Functions

Tetiana Parshakova, Ahmed Khaled, Michael Crawshaw, Guillaume Garrigos, Robert M. Gower

AI总结 本文研究了Muon优化算法在凸Lipschitz函数上的收敛性问题,指出尽管Muon及其变体在深度学习中表现出色,但其收敛性分析通常依赖于平滑性假设,而凸Lipschitz函数类却是许多优化方法的基础。研究发现,Muon在凸Lipschitz函数上无法收敛,无论学习率如何选择。虽然误差反馈机制可以恢复其收敛性,但在图像分类和语言建模任务中却会损害其性能,表明Muon的成功可能源于凸Lipschitz模型所缺乏的结构,最可能是与平滑性相关。

详情
英文摘要

Muon and its variants have shown strong empirical performance in a variety of deep learning tasks. Existing convergence analyses of Muon rely on smoothness assumptions, though arguably the most successful function class for developing deep learning methods (such as AdaGrad, Shampoo, Schedule-Free and more) has been the class of convex and Lipschitz functions. In this paper we question whether the classical convex Lipschitz model is a useful one for understanding Muon. Our answer is no. We show that Muon does not converge on the class of convex and Lipschitz functions, regardless of the choice of learning rate schedule. We also show that error feedback restores convergence of Muon and all the non-Euclidean subgradient methods with momentum. However, this theoretical fix using error feedback degrades the performance of Muon in two representative settings for image classification (CIFAR-10) and language modeling (nanoGPT on FineWeb-Edu 10B). Our conclusion is that convex Lipschitz theory, despite having a prominent role in the design of practical methods for deep learning, is not the most suited one for Muon. This suggests that Muon's success must come from structure absent from this model, most plausibly related to smoothness.

2605.08975 2026-05-12 cs.AI

Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation

Yunseong Jeon, Namcheol Lee, Yoonsu Lee, Jangwoon Park, Sol Ahn, Jong-Chan Kim, Seongsoo Hong

AI总结 本文研究了基于推理的端到端自动驾驶系统Alpamayo 1的延迟问题,分析了其多推理与单推理两种轨迹生成方式的效率差异。通过将系统重构为单推理架构,并优化扩散动作生成过程中的计算开销,有效降低了推理延迟,同时保持了轨迹多样性和预测质量。实验表明,该方法在不牺牲性能的前提下,使推理延迟减少了69.23%。

Comments Submitted to IEEE RTCSA on March 26, 2026 (KST) Accepted on May 4, 2026 (KST)

详情
英文摘要

Reasoning-based end-to-end (E2E) autonomous driving has recently emerged as a promising approach to improving the interpretability of driving decisions as it can generate human-readable reasoning together with predicted trajectories. Such approaches commonly generate multiple trajectories to capture diverse future behaviors, and they fall into two categories: (1) multi-reasoning, where one reasoning sequence is generated per trajectory, and (2) single-reasoning, where a single reasoning is shared across all trajectories. The former offers richer diversity at the cost of redundant computation, while the latter is more efficient but is often assumed to sacrifice diversity. Alpamayo 1, a representative system, adopts the multi-reasoning approach and achieves competitive trajectory prediction performance. However, the efficiency of this design remains largely unexplored, making it a well-motivated subject for investigation. In this paper, we systematically analyze and improve Alpamayo 1 in two ways. First, we reduce inference latency while preserving trajectory diversity by redesigning Alpamayo 1 into a single-reasoning system. Through extensive experiments, we find that replacing multi-reasoning with single-reasoning does not meaningfully degrade trajectory diversity. Second, we accelerate diffusion-based action generation by eliminating inter-block overhead arising from unnecessary copy operations and inefficient kernel execution. Through closed-loop and open-loop experiments, we validate both optimizations, demonstrating a 69.23% reduction in inference latency while maintaining trajectory diversity and prediction quality. These results highlight the importance of jointly analyzing system architecture and runtime execution to improve the efficiency of reasoning-based E2E AD systems.

2605.08974 2026-05-12 cs.CV cs.AI

Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models

Tri Cao, Khoi Le, Thong Nguyen, Cong-Duy Nguyen, Quynh Vo, Anh Tuan Luu, Chunyan Miao, See-Kiong Ng, Shuicheng Yan, Bryan Hooi

AI总结 尽管多模态大语言模型在视频理解方面取得了进展,但在动态场景中仍容易产生幻觉。本文认为这是由于缺乏对时空信息的持续监控能力,即无法有效追踪物体的身份、状态及关系随时间的变化。为此,研究者提出了STEMO-Bench基准,用于评估模型在物体中心事实上的中间推理能力,并引入了STEMO-Track框架,通过结构化轨迹构建和时序聚合显著提升了模型在时空推理上的准确性和一致性。

Comments Code: https://github.com/nguyentthong/video_hallucination

详情
英文摘要

While multimodal large language models (MLLMs) have advanced video understanding, they remain highly prone to hallucinations in dynamic scenes. We argue this stems from a failure in spatio-temporal monitoring, the ability to persistently track object identities, states, and relations over time. Existing benchmarks obscure this deficit by relying on single final-answer evaluations for queries that can often be resolved via local visual cues or statistical priors. To rigorously diagnose this, we introduce STEMO-Bench (Spatio-TEmporal MOnitoring), a benchmark of human-verified object-centric facts that evaluates intermediate reasoning by decomposing queries into sub-questions, distinguishing genuine temporal understanding from coincidental correctness. To address failure modes exposed by STEMO, we propose STEMO-Track, a novel object-centric framework that explicitly constructs and reasons over structured object trajectories via chunk-wise state extraction and temporal aggregation. Extensive experiments demonstrate that our object-centric framework significantly reduces hallucinated answers and improves spatio-temporal reasoning consistency over state-of-the-art MLLMs.

2605.08971 2026-05-12 cs.CV cs.AI

Extrusion Segmentation Strategy to improve CAD Reconstruction from Point Cloud

Said Harb, Mehdi Maboudi, Markus Gerke

AI总结 本文研究如何从点云数据中重建CAD模型,提出了一种基于挤出分割的策略,将复杂形状分解为基本的挤出部件,从而提升深度学习模型的重建性能。该方法通过增加数据多样性,提高了模型的泛化能力和鲁棒性,为从无序点云生成结构化CAD模型提供了简单而有效的方式。

Comments Conference: ISPRS Toronto 2026

详情
英文摘要

Computer-Aided Design is ubiquitous in todays world, as almost every manufactured object begins as a digital model across industries. At the same time, advances in 3D sensing have made point clouds a dominant form of raw 3D data. Recovering the CAD model of a physical object from its point cloud scan has two major applications: reverse engineering, where physical or hand-crafted prototypes need to be reconstructed automatically as editable digital models, and quality control, where recovering the CAD description of a manufactured object helps quantify and understand deviations introduced during the production process. Thus, converting unordered point clouds into structured CAD models is increasingly important for modern applications. Deep learning has enabled major progress in computer vision for both 2D and 3D data, and new datasets facilitate data-driven CAD reconstruction. Building on this foundation, we develop an end-to-end model that reconstructs CAD models from point clouds and introduce a segmentation approach that decomposes them into individual extrusions. These partial shapes increase data diversity, improving the generalization and robustness of deep learning models. Our strategy thereby provides a simple, yet effective way to increase reconstruction performance of deep learning models.

2605.08966 2026-05-12 cs.LG

VORT: Adaptive Power-Law Memory for NLP Transformers

Nabil Mlaiki

AI总结 标准的Transformer模型对远距离词元的影响采用近似指数衰减的方式,这与自然语言中长距离依赖的幂律结构存在冲突。本文提出了一种名为VORT的可变阶次保留Transformer模型,通过为每个输入词元分配一个可学习的分数阶次α_i,采用Grünwald–Letnikov幂律保留核来建模记忆。该模型利用高斯-拉盖尔求积法对核权重进行求和指数分解,并通过线性注意力累加器实现高效的检索,从而在保持模型性能的同时,更好地捕捉自然语言中的长距离依赖关系。实验验证了该模型在合成任务中的优越性。

Comments 18 pages, 5 figures

详情
英文摘要

Standard Transformers impose near-exponential decay on the influence of distant tokens, conflicting with the power-law structure of long-range dependencies in natural language. We introduce the \emph{Variable-Order Retention Transformer} (\VORT{}), a memory architecture in which each ingested token is assigned a learnable fractional order α_i\in[δ,1] that governs a Grünwald--Letnikov power-law retention kernel. Because the fractional weighted sum is non-Markovian, we approximate it through a sum-of-exponentials (SOE) decomposition computed by Gauss--Laguerre quadrature on a Laplace-type integral representation of the kernel weights. Each exponential component admits a one-step Markovian recurrence at O(Sd_v) per step, where S=O(\log(T/\varepsilon)) terms suffice for \varepsilon-uniform accuracy on horizon [1,T]. Retrieval is keyed and associative via a linear-attention accumulator with an exact O(KSd_ϕd_v) -per-step recurrence. Four results are established: (i) an SOE approximation theorem with geometric convergence rate from the analyticity of the integrand after a log-change of variables; (ii) a quantisation bound valid on [δ,1] with correct analysis near α=0; (iii) a direct L^2 energy argument (Proposition) showing that for α>1/2 any mixture with fixed minimum decay rate Λ>0 incurs L^2([1,T]) error at least N_α(T)-C(Λ)\to\infty, with the Λ-dependence made explicit; and (iv) linear convergence of a gradient plasticity rule under the Polyak--Łojasiewicz condition. Two synthetic experiments confirm the architectural advantage: a Zipf-distributed retrieval benchmark and an entity label-copy task with uniform lag distribution, the latter ruling out prior-matching as an explanation for the power-law kernel's advantage.

2605.08965 2026-05-12 cs.CV

Can MLLMs Reason About Visual Persuasion? Evaluating the Efficacy and Faithfulness of Reasoning

Naeun Lee, Hyunjong Kim, Sunghwan Choi, Injin Kong, Yohan Jo

AI总结 尽管多模态大语言模型(MLLMs)在多模态任务中表现出色,但预测图像是否具有说服力及其原因仍具有挑战性。本文发现,让MLLMs在预测前进行推理并不能一致提升性能,甚至可能降低效果,表明生成的推理理由不可靠。为此,研究提出通过多样化的教师生成推理进行监督微调,提升了视觉说服力预测性能,并引入了一个三维的可信度评估框架,从推理与决策的一致性、推理与图像的相关性以及推理对决策的敏感性三个方面进行评估,揭示了预测性能与推理可信度之间的差异,并为未来训练更可信的视觉说服力模型提供了新方向。

详情
英文摘要

Despite strong performance of Multimodal Large Language Models (MLLMs) on multimodal tasks, predicting whether and why an image is persuasive remains challenging. We first show that prompting MLLMs to reason before prediction does not consistently help, and can even reduce persuasiveness prediction performance, suggesting that naively generated rationales are unreliable signals for this task. Yet, no established methodology exists for training MLLMs to reason about visual persuasion or evaluating whether their rationales faithfully support their decisions. To address this gap, we show empirically and theoretically that diverse teacher-generated rationales, when used for supervised fine-tuning, improve visual persuasiveness prediction. We further introduce a three-dimensional faithfulness evaluation framework covering rationale-to-decision consistency, rationale-to-image groundedness, and rationale-to-decision sensitivity. Applying this framework shows that prediction performance alone does not guarantee faithful rationales, while rationale-to-decision sensitivity is most aligned with human rationale preferences. These findings motivate faithfulness-aware training objectives and scalable rationale supervision for visual persuasiveness evaluation. Our code and dataset will be made publicly available.

2605.08961 2026-05-12 cs.CL eess.AS

Dolphin-CN-Dialect: Where Chinese Dialects Matter

Yangyang Meng, Huihang Zhong, Guodong Lin, Guanbo Wang, Hu Du, Zhiming Shao, Yukai Huang, Ke Li, Wei-Qiang Zhang

AI总结 本文介绍了 Dolphin-CN-Dialect,一个专注于中文及方言识别的流式语音识别模型。为了解决方言数据高度不平衡的问题,研究提出了一种基于温度的采样策略,并改进了分词方式以更好地适配语言特性,从而显著提升了方言识别性能。实验表明,该模型在保持较小模型体积的同时,实现了比之前版本更高的识别准确率和更低的字符错误率,并在多个方面达到了当前开源模型的先进水平。

详情
英文摘要

We present Dolphin-CN-Dialect, a streaming-capable ASR model with a focus on Chinese and dialect-rich scenarios. Compared to the previous version, Dolphin-CN-Dialect introduces substantial improvements in data processing, tokenization, training stability, and data sampling strategies. To address the challenges of highly imbalanced dialect data, we propose a temperature-based sampling strategy that effectively balances standard Mandarin and low-resource dialects, leading to significant gains in dialect recognition performance. In addition, we redesign the tokenizer to better align with linguistic characteristics, adopting character-level modeling for Chinese and subword modeling for English, while introducing extensible dialect tokens. Experimental results show that Dolphin-CN-Dialect achieves improvement in dialect recognition accuracy and CER reduction compared to Dolphin. Furthermore, Dolphin-CN-Dialect reaches competitive performance with recent SOTA open-source ASR models, while maintaining a significantly smaller model size. Dolphin-CN-Dialect supports both streaming and non-streaming inference, enabling a practical balance between latency and accuracy. It also provides flexible customization through hotword support and efficient deployment optimized for specialized hardware. These improvements make Dolphin-CN-Dialect a strong and practical solution for real-world multi-dialect ASR applications.