arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4077
2605.10675 2026-05-12 cs.CV

Neuromorphic Monocular Depth Estimation with Uncertainty Modeling

Viktor Bergkvist, Felix Rydell, Per-Erik Forssén, David Gustafsson, Johan Rideg

AI总结 本文研究了基于事件相机的单目深度估计问题,提出了一种结合不确定性建模的神经形态深度估计方法。通过使用高斯、对数正态和证据学习框架,模型能够预测每个像素的深度分布并估计其不确定性。实验比较了六种事件表示方式,并在合成数据上训练、在真实序列上微调U-Net模型,结果表明不确定性建模能有效提升深度估计的可靠性,并在多种指标下表现优异。

详情
英文摘要

Event cameras offer distinct advantages over conventional frame-based sensors, including microsecond-level temporal resolution, high dynamic range, and low bandwidth. In this paper, we predict per-pixel depth distributions from monocular event streams using deep neural networks. We estimate uncertainty using Gaussian, log-normal, and evidential learning frameworks. We compare six event representations: spatio-temporal voxel grids with 1, 5, 10, and 20 temporal bins, the Compact Spatio-Temporal Representation (CSTR), and Time-Ordered Recent Event (TORE) volumes. Our U-Net-based models are trained on synthetic data and then fine-tuned on real sequences. We evaluate performance using absolute relative error, root mean squared error, and the area under the sparsification error. Quantitative results show that the representations perform similarly, while 10 bin log-normal and 5 bin evidential learning perform best across metrics. Our experiments demonstrate that uncertainty estimation can be successfully integrated into event-based monocular depth estimation, and be used to indicate pixels with reliable depth.

2605.10674 2026-05-12 cs.LG cs.AI cs.CL cs.SE

Step Rejection Fine-Tuning: A Practical Distillation Recipe

Igor Slinko, Ilia Zavidnyi, Egor Bogomolov, Yaroslav Zharov

AI总结 本文提出了一种名为“步骤拒绝微调”(SRFT)的新方法,用于改进大语言模型在解决编程任务中的训练效果。与传统的拒绝微调(RFT)方法不同,SRFT 不直接丢弃无法解决的任务轨迹,而是利用一个批评模型评估轨迹中每一步的正确性,仅对错误步骤进行损失掩码,从而保留错误上下文以帮助模型学习从错误中恢复。实验表明,SRFT 在 SWE-bench Verified 数据集上实现了 32.2% 的任务解决率,比 RFT 提高了 3.7%。

详情
英文摘要

Rejection Fine-Tuning (RFT) is a standard method for training LLM agents, where unsuccessful trajectories are discarded from the training set. In the context of SWE-bench tasks, this corresponds to filtering out runs where the submitted patch does not pass the tests. However, this approach discards unresolved trajectories, even though they form a large portion of all trajectories for hard tasks and even then may be partially correct. In this work, we propose Step Rejection Fine-Tuning (SRFT) - a practical way to leverage these unresolved trajectories. For this, we employ a critic LLM to assess the correctness of each step in a trajectory. Consequently, during training, we mask the loss for erroneous steps while retaining them in the context window. This way we ensure the model learns to recover from errors without reproducing them. Evaluation on SWE-bench Verified shows that while RFT improves the resolution rate by 2.4% by excluding unresolved trajectories, SRFT improves it by 3.7% by filtering them instead of discarding completely, reaching the total resolution rate of 32.2%.

2605.10673 2026-05-12 cs.LG

Compander-Aligned Query Geometry for Quantized Zeroth-Order Optimization

Yao Shu, Zilin Zhu

AI总结 本文研究了低精度前向计算在零阶优化(ZO)中的应用,指出量化后的ZO查询无法简单视为连续有限差分加上无害的存储舍入,而是涉及端点选择、量化舍入和沿舍入弦测量损失差的问题。为此,作者提出了一种名为CAQ-ZO的方法,通过引入查询几何的概念,将非均匀压缩量化建模为特定变换,并在变换域中构造Rademacher模板,从而实现查询时间残差的精确消除。实验表明,该方法在保持相同量化器和评估预算的前提下,能有效提升模型的微调性能。

详情
英文摘要

Low-bit forward evaluation is an attractive route to memory-efficient zeroth-order (ZO) adaptation: the optimizer needs only scalar losses, and the model can be queried near deployment precision. The obstacle is that a quantized ZO query is not a continuous finite difference followed by harmless storage rounding. The query chooses endpoints, the low-precision engine rounds them, and the loss difference is measured along the rounded chord. For nonuniform companding quantizers, this makes the codebook insufficient to predict ZO behavior: a fixed weight-space radius can collapse in dense cells, over-span sparse cells, or assign a rounded chord to an unrounded update direction. We identify the missing object as query geometry and model scalar nonuniform quantization as $Q = ϕ^{-1} \circ U \circ ϕ$. CAQ-ZO (Compander-Aligned Queries for Zeroth-Order Optimization) forms one-grid-step Rademacher stencils $z \pm Δr$ in $z = ϕ(x)$, maps endpoints back through $ϕ^{-1}$, and updates in $z$. Our theory proves the grid-span mismatch, decomposes endpoint-rounding estimator residuals, and gives stationarity bounds in which generic off-grid queries retain a $Δ^2/μ^2$ residual channel while CAQ-ZO makes the query-time residual exactly zero. Synthetic experiments isolate this channel, and matched NF4 Qwen/Llama fine-tuning shows that CAQ-ZO improves the trained NF4 baseline under the same quantizer and evaluation budget.

2605.10671 2026-05-12 cs.LG math.OC stat.ML

Natural Policy Gradient as Doubly Smoothed Policy Iteration: A Bellman-Operator Framework

Phalguni Nanda, Zaiwei Chen

AI总结 本文将强化学习中的自然策略梯度算法表示为一种双重平滑策略迭代(DSPI)形式,并将其嵌入到贝尔曼算子的框架中。该框架通过在历史 Q 函数的加权平均上应用正则化贪心步骤来生成策略,涵盖了策略迭代、双平均策略迭代等多种方法。作者证明了 DSPI 在无需修改 MDP 或使用轨迹依赖步长的情况下,具有分布无关的全局几何收敛性,并给出了自然策略梯度和策略双平均方法的迭代复杂度上界。此外,该框架还可扩展至具有线性函数逼近的折扣 MDP 和随机最短路径问题。

详情
英文摘要

In this work, we show that natural policy gradient, a core algorithm in reinforcement learning, admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, we introduce doubly smoothed policy iteration (DSPI), a Bellman-operator framework in which each policy is obtained by applying a regularized greedy step to a weighted average of past $Q$-functions. DSPI includes policy iteration, dual-averaged policy iteration, natural policy gradient, and more general policy dual averaging methods as special cases. Using only monotonicity and contraction of smoothed Bellman operators, we prove distribution-free global geometric convergence of DSPI. Consequently, standard natural policy gradient and policy dual averaging achieve an iteration complexity of $\mathcal{O}((1-γ)^{-1}\log((1-γ)^{-1}ε^{-1}))$ for computing an $ε$-optimal policy, without modifying the MDP, adding regularization beyond the mirror map inherent in the update, or using adaptive, trajectory-dependent stepsizes. For the unregularized greedy case, corresponding to dual-averaged policy iteration, we also prove finite termination. The same Bellman-operator framework further extends to discounted MDPs with linear function approximation and stochastic shortest path problems.

2605.10668 2026-05-12 cs.LG math.OC math.ST stat.TH

A Spectral Framework for Closed-Form Relative Density Estimation

Francis Bach

AI总结 本文提出了一种用于线性参数化概率模型(包括未归一化和条件模型)中相对对数密度估计的闭式谱框架。该方法通过将KL散度表示为加权卡方散度的积分,将KL估计转化为一系列最小二乘问题,并基于一阶和二阶特征矩导出了显式的谱公式,从而得到闭式散度和对数密度势估计。该框架适用于广泛的f散度,并可与核方法或神经网络特征学习结合,理论证明了估计器的收敛性,并在合成数据上与基于优化的变分方法进行了实验对比。

详情
英文摘要

We propose a closed-form spectral framework for relative log-density estimation in linearly parameterized probabilistic models, including unnormalized and conditional models. This is achieved by representing the Kullback-Leibler (KL) divergence as an integral of weighted chi-squared divergences, converting KL estimation into a family of least-squares problems. We derive an explicit spectral formula based only on first- and second-order feature moments, yielding closed-form estimators of both divergences and log-density potentials for fixed features. The framework extends to a broad class of f-divergences and can be combined with kernelization or feature learning with neural networks. We prove convergence guarantees for the resulting estimators and empirically compare them on synthetic data with optimization-based variational formulations, including logistic and softmax regression for normalized conditional models.

2605.10663 2026-05-12 cs.AI

Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents

Zhiyuan Fan, Wenwei Jin, Feng Zhang, Bin Li, Yihong Dong, Yao Hu, Jiawei Li

AI总结 Evolving-RL 是一种端到端优化框架,旨在提升智能体在部署时通过经验驱动实现自我演进的能力。该方法通过联合优化经验提取与利用过程,使大型语言模型能够更有效地学习和复用历史经验,从而在新任务上表现出更强的适应性。实验表明,Evolving-RL 显著提升了模型在分布外任务中的性能,且其效果依赖于经验提取与利用的协同进化。此外,该方法还作为一种增强型强化学习算法,能够在无需测试时经验积累的情况下提升模型表现。

Comments 17pages, 5 figures

详情
英文摘要

Experience-driven self-evolving agents aim to overcome the static nature of large language models by distilling reusable experience from past interactions, thus enabling adaptation to novel tasks at deployment time. This process places substantial demands on the foundation model's capacities for abstraction, generalization, and in-context learning. However, most existing studies focus primarily on system-level design choices, such as how experience is represented and managed, neglecting the inherent capabilities of the underlying model. While some recent works have started to optimize the experience utilization stage via reinforcement learning, they still fail to treat self-evolution as a unified process to be jointly optimized. To this end, we propose Evolving-RL, an efficient algorithmic framework that jointly improves the experience extraction and utilization capabilities required for self-evolution. Specifically, we center the learning process on experience extraction and evaluation, using the two supervisory signals derived from evaluation to optimize the extractor and solver separately and thus enable their coordinated co-evolution. Experiments on ALFWorld and Mind2Web show that Evolving-RL effectively enhances LLMs' ability to extract and reuse experience, leading to strong performance gains on out-of-distribution tasks (up to 98.7% relative improvement over the GRPO baseline on ALFWorld unseen tasks and 35.8% on Mind2Web), and these gains are fully unlocked only through the coordinated co-evolution of experience extraction and utilization. Furthermore, Evolving-RL inherently functions as an experience-augmented RL algorithm. By internalizing reusable experience patterns directly into model parameters, it achieves remarkable performance gains over standard baselines on both seen and unseen tasks, even in the absence of test-time experience accumulation.

2605.10661 2026-05-12 cs.CV cs.AI

bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition

Michal Byra, Pawel Olszowiec, Grzegorz Stefanski, Grzegorz Gruszczynski, Alberto Presta

AI总结 本文研究了视觉Transformer(ViT)中是否可以通过单块循环结构替代传统的多层独立参数化结构。提出了一种名为bViT的模型,该模型仅使用一个Transformer块进行重复计算来处理图像,从而在保持深度结构的同时大幅减少参数量。实验表明,在相同训练条件和计算预算下,bViT在ImageNet-1K上达到了与标准ViT相当的性能,且参数数量减少了约一个数量级,展示了循环结构在视觉任务中的有效性与潜力。

Comments 31 pages, 16 figures

详情
英文摘要

Vision Transformers (ViTs) are built by stacking independently parameterized blocks, but it remains unclear how much of this depth requires layer specific transformations and how much can be realized through recurrent computation. We study this question with bViT, a single-block recurrent ViT in which one transformer block is applied repeatedly to process an image. This architecture preserves the iterative structure of a deep ViT while removing layer specific block parameterization, providing a controlled setting for studying recurrence in vision. On ImageNet-1K, a 12-step bViT-B achieves accuracy comparable to standard ViT-B under the same training recipe and computational budget, while using an order of magnitude fewer parameters. We observe that recurrent performance improves with representation width, with wider bViTs recovering much more of the performance of standard ViTs than narrow variants. We interpret this behavior as implicit depth multiplexing, where a shared block expresses multiple step-dependent computations through the evolving hidden state. Beyond ImageNet classification, bViT transfers competitively to downstream tasks and enables parameter-efficient fine-tuning. Mechanistic analyses of activations, attention and step-specific pruning show that the shared block changes its effective behavior across recurrent steps rather than simply repeating the same computation. Our results suggest that a large fraction of ViT depth can be implemented through recurrent reuse, provided that the representation space is sufficiently wide.

2605.10659 2026-05-12 cs.CL cs.AI cs.SI stat.ML

When Can Digital Personas Reliably Approximate Human Survey Findings?

Mumin Jia, Yilin Chen, Divya Sharma, Jairo Diaz-Rodriguez

AI总结 本文探讨了大型语言模型(LLM)生成的数字人像在何种程度上能够可靠地模拟人类在调查中的回答。研究利用LISS调查数据集构建数字人像,并与真实受访者后续的回答进行对比,评估其在不同任务和层次上的表现。结果表明,数字人像在稳定属性和价值观相关的领域表现较好,但在个体预测和多维结构恢复方面仍存在局限,且其效果更多依赖于人类回答的结构而非模型选择。

详情
英文摘要

Digital personas powered by Large Language Models (LLMs) are increasingly proposed as substitutes for human survey respondents, yet it remains unclear when they can reliably approximate human survey findings. We answer this question using the LISS panel, constructing personas from respondents' background variables and pre-2023 survey histories, then testing them against the same respondents' held-out post-cutoff answers. Across four persona architectures, three LLMs, and two prediction tasks, we assess performance at the question, respondent, distributional, equity, and clustering levels. Digital personas improve alignment with human response distributions, especially in domains tied to stable attributes and values, but remain limited for individual prediction and fail to recover multivariate respondent structure. Retrieval-augmented architectures provide the clearest gains, but performance depends more on human response structure than on model choice: personas perform best for low-variability questions and common respondent patterns, and worst for subjective, heterogeneous, or rare responses. Our results provide practical guidance on when digital personas could be appropriate for survey research and when human validation remains necessary.

2605.10658 2026-05-12 cs.LG

Why Zeroth-Order Adaptation May Forget Less: A Randomized Shaping Theory

Yao Shu, Jian Mu, Zhongxiang Dai

AI总结 本文研究了零阶(ZO)适应在持续学习中可能比一阶(FO)方法遗忘更少的原因,提出了一个局部随机梯度塑形理论。通过有限差分分析,揭示了ZO适应形状与FO梯度在均值上对齐,且其范数匹配特性使得遗忘行为依赖于适应形状所暴露的保留曲率。研究发现,在范数匹配的ZO方法中,保留曲率的期望满足一个精确恒等式,从而在保持各向同性保留下界的同时,仅压缩各向异性部分,最终在梯度投影下形成FO与ZO之间的二次遗忘差距。基于此理论,作者提出了RISE算法,将校准后的ZO形状应用于精确FO梯度,实现稳定性与可塑性的平衡。

详情
英文摘要

Continual learning requires new-task adaptation without damaging previously acquired capabilities. Recent forward-pass and zeroth-order (ZO) results show that low-query adaptation may retain better than first-order (FO) descent, but the usual view of ZO as noisy FO estimation does not explain why. We give a local randomized gradient-shaping analysis: finite differences expose a raw shape that is mean-aligned with FO, while the norm-matched comparator fixes the expected squared adaptation norm. Under this controlled comparison, forgetting depends on how the adaptation shape exposes retention curvature. For norm-matched ZO, the expected shaped retention curvature obeys an exact identity that preserves the isotropic retention floor while contracting only the anisotropic component. Projecting this identity onto the incoming gradient yields the observable FO--ZO quadratic forgetting gap: ZO improves mean forgetting precisely when the FO direction has above-average retention curvature, by a query-dependent fraction of that curvature excess. A practical finite-query accounting separates the mean mechanism from one-batch sampling and smoothing perturbations. As an algorithmic transfer, RISE applies the calibrated ZO shape to exact FO gradients inside parameter blocks. Its target is a stability--plasticity tradeoff: randomized shaping may reduce the retention exposure paid by FO, exact gradients remove finite-smoothing bias from finite-difference ZO, and blockwise sampling supplies many local shaping directions after one gradient computation. The blockwise analysis separates mean-step damage from centered random exposure, showing how block-diagonal curvature, cross-block coupling, and local shaping diagnostics specify where this exact-gradient transfer is most likely to be visible.

2605.10655 2026-05-12 cs.LG

BCJR-QAT: A Differentiable Relaxation of Trellis-Coded Weight Quantization

Venugopalan Iyengar

AI总结 本文提出了一种名为BCJR-QAT的可微分松弛方法,用于解决网格编码量化(Trellis-Coded Quantization)在量化感知训练(QAT)中的非可微问题。该方法通过引入BCJR前向-后向算法替代非可微的Viterbi算法,实现了对网格路径的软量化,从而支持端到端训练。研究还贡献了高效的实现内核、理论分析以及在大语言模型上的实验验证,表明其在保持2比特每词精度下优于现有方法。

Comments 26 pages, 4 figures, 4 tables. Code at https://github.com/Venugopalan2610/quant-2bit. Model weights and trajectory snapshots at https://huggingface.co/Venugopalan2610/BCJR-QAT-Llama-3.2-1B-2bit

详情
英文摘要

Trellis-coded quantization sets the current 2-bit post-training frontier for LLMs (QTIP), but pushing below the PTQ ceiling requires quantization-aware training, and QAT on a trellis is obstructed by the non-differentiable Viterbi argmax. We introduce BCJR-QAT, a relaxation that replaces the argmax with the BCJR forward-backward sum-product algorithm at temperature $T$, producing a soft codeword equal to the Boltzmann expectation over trellis paths, exactly differentiable, recovering the hard QTIP code as $T \to 0$, and mathematically identical to the transfer-matrix computation for a 1D Ising-like spin chain. We contribute (i) a fused Triton kernel making BCJR tractable on a single consumer GPU ($6.57\times$ speedup, fp32 parity); (ii) a quantitative drift-budget theory of when BCJR-QAT can escape the QTIP-PTQ Voronoi basin, verified across four experiments; and (iii) a positive empirical result on Llama-3.2-1B at 2 bpw under end-to-end forward-KL distillation: with the right schedule (skip the high-$T$ phase to avoid an overshoot we diagnose), single-layer BCJR-QAT beats QTIP-PTQ by $\mathbf{-0.084}$ PPL on WikiText-2, and multi-layer compounding is super-additive.

2605.10654 2026-05-12 cs.LG cs.AI

Active Learning for Gaussian Process Regression Under Self-Induced Boltzmann Weights

Jixiang Qing, Henry Moss, Matthias Sachs

AI总结 本文研究了在由未知函数自身诱导的玻尔兹曼分布下的高斯过程回归主动学习问题,该问题在计算化学中的势能面建模等场景中具有重要意义。为了解决目标分布未知且难以计算配分函数的挑战,作者提出了一种基于高斯过程的获取函数AB-SID-iVAR,能够在不估计配分函数的情况下近似目标分布,并适用于离散和连续输入域。实验表明,该方法在合成数据集和实际任务中均优于现有方法。

详情
英文摘要

We consider the active learning problem where the goal is to learn an unknown function with low prediction error under an unknown Boltzmann distribution induced by the function itself. This self-induced weighting arises naturally in problems such as potential energy surface (PES) modeling in computational chemistry, yet poses unique challenges as the target distribution is unknown and its partition function is intractable. We propose \texttt{AB-SID-iVAR}, a Gaussian Process-based acquisition function that approximates the intractable Bayesian target distribution in closed form while avoiding partition function estimation, and is applicable to both discrete and continuous input domains. We also analyze a Thompson sampling alternative (\texttt{TS-SID-iVAR}) as a higher variance Monte Carlo variant. Despite the unknown target, under mild conditions, we establish that the terminal prediction error vanishes with high probability, and provide a tighter average-case guarantee. We demonstrate consistent improvements over existing approaches in this setting on synthetic benchmarks and real-world PES modeling and drug discovery tasks.

2605.10653 2026-05-12 cs.RO

Embodied AI in Action: Insights from SAE World Congress 2026 on Safety, Trust, Robotics, and Real-World Deployment

Jan-Mou Li, Paul Schmitt, Wei Tong, Majed Mohammed, Akshay Chalana, Arpan Kusari, Edward Griffor

AI总结 本文基于SAE世界大会2026“具身人工智能实践”专题讨论,探讨了具身人工智能在自动驾驶、机器人和工业设备等现实系统中的应用所面临的安全、信任与可靠性挑战。研究强调需从系统工程角度出发,结合全生命周期治理、以人为本的设计和不断演进的标准,推动具身人工智能的负责任部署。文章为技术领导者、政策制定者提供了实际指导,指出其长期成功不仅依赖于AI能力的提升,更取决于安全可靠的实施方式。

详情
英文摘要

Embodied artificial intelligence is rapidly moving from research into real-world systems such as autonomous vehicles, mobile robots, and industrial machines. As these systems become more capable of perceiving, deciding, and acting in dynamic environments, they also introduce new challenges in safety, trust, governance, and operational reliability. This white paper summarizes key insights from the SAE World Congress 2026 panel session \textit{Embodied AI in Action}, which brought together experts from automotive, robotics, artificial intelligence, and safety engineering. The discussion highlighted the need to treat embodied AI as a systems challenge requiring engineering rigor, lifecycle governance, human-centered design, and evolving standards. The paper provides practical perspectives for executives, policymakers, and technical leaders seeking to adopt embodied AI responsibly. The panel reached broad agreement that long-term success will depend not only on advances in AI capability, but equally on safe and trustworthy deployment.

2605.10651 2026-05-12 cs.LG cs.AI stat.ML

A Recursive Decomposition Framework for Causal Structure Learning in the Presence of Latent Variables

Zheng Li, Feng Xie, Shenglan Nie, Xichen Guo, Ruxin Wang, Hao Zhang

AI总结 本文提出了一种名为DiCoLa的递归分解框架,用于在存在潜在变量的情况下进行因果结构学习。该方法通过递归分解全局学习任务为更小的子问题,并通过原理化的重构步骤整合子问题的解,从而恢复全局因果结构。该框架在理论上保证了其正确性和完备性,并在合成数据和真实数据上的实验表明,它显著提升了多种因果发现算法的计算效率。

详情
英文摘要

Constraint-based causal discovery is widely used for learning causal structures, but heavy reliance on conditional independence (CI) testing makes it computationally expensive in high-dimensional settings. To mitigate this limitation, many divide-and-conquer frameworks have been proposed, but most assume causal sufficiency, i.e., no latent variables. In this paper, we show that divide-and-conquer strategies can be theoretically generalized beyond causal sufficiency to settings with latent variables. Specifically, we propose a recursive decomposition framework, termed DiCoLa, that enables divide-and-conquer causal discovery in the presence of latent variables. It recursively decomposes the global learning task into smaller subproblems and integrates their solutions through a principled reconstruction step to recover the global structure. We theoretically establish the soundness and completeness of the proposed framework. Extensive experiments on synthetic data demonstrate that our approach significantly improves computational efficiency across a range of causal discovery algorithms, while experiments on a real-world dataset further illustrate its practical effectiveness.

2605.10650 2026-05-12 cs.LG cond-mat.dis-nn

A Random-Matrix Criterion for Initializing Gated Recurrent Neural Networks

Tommaso Fioratti, Riccardo Marcaccioli, Francesco Casola

AI总结 本文研究了门控循环神经网络(Gated RNN)中权重初始化对模型性能的影响,提出了一种基于随机矩阵理论的初始化准则。该准则能够有效估计使模型处于临界状态的权重方差阈值,从而在混沌预测任务中实现最佳性能。研究还表明,该准则可作为未来初始化方案设计的重要指导原则。

Comments 10 pages, 5 figures, 2 appendices

详情
英文摘要

Proper weight initialization prior to training has historically been one of the key factors that helped kick off the deep learning revolution. Initialization is even more crucial in "reservoir computing", where the weights of a readout layer are learned linearly while the reservoir weights are fixed and largely determine the richness, stability and memory of the resulting dynamics. In the infinite-width limit it has been shown that meaningful initializations are those sitting at an effective critical point of the randomly initialized model. The phase transition is controlled by the weight variance $g^2$ and separates an ordered phase from a chaotic one where information progressively degrades. Here we derive a simple criterion to estimate the critical $g_c$ for a broad class of recurrent architectures and we show that it closely tracks the gain at which a gated-RNN reservoir achieves peak performance on a chaotic forecasting task. Finally, we argue that our criterion can serve as a design principle for future initialization schemes.

2605.10647 2026-05-12 cs.AI cs.CR

diffGHOST: Diffusion based Generative Hedged Oblivious Synthetic Trajectories

Florent Guépin, Cheick Tidiani Cisse, Denis Renaud, François Bidet, Arnaud Legendre

AI总结 随着轨迹数据在众多应用中的重要性日益增加,如何在保护隐私的同时利用这些数据成为关键问题。本文提出diffGHOST,一种基于潜在空间分割的条件扩散模型,旨在生成具有实用价值且隐私风险可控的合成轨迹。该方法通过识别并缓解关键样本的记忆效应,有效提升了生成轨迹的隐私保护能力。

详情
英文摘要

Trajectories are nowadays valuable information for a wide range of applications. However they are also inherently sensitive, as they contain highly personal information about individuals. Facing this challenge, synthesizing mobility trajectories has emerged as a promising solution to leverage mobility information while preserving privacy. State-of-the-art models, often rely on the false assumptions of generative models implicit privacy and fails to provide privacy guarantees while preserving trajectories utility. Here, we introduce diffGHOST, a conditional diffusion model based on latent space segmentation, designed to answer this challenge. Thus, this paper propose a methodology that identify and mitigate memorization of critical samples using condition segments of a learn latent space.

2605.10645 2026-05-12 cs.CV

GenMed: A Pairwise Generative Reformulation of Medical Diagnostic Tasks

Hantao Zhang, Weidong Guo, Yuhe Liu, Jiancheng Yang, Sathvik Bhagavan, Danli Shi, Mingda Xu, Pascal Fua

AI总结 本文提出了一种基于生成模型的新型医学诊断框架GenMed,通过联合建模输入与输出的联合分布 $P(X,Y)$,将诊断任务重新定义为推理时的输出优化问题。该方法利用扩散模型,在不改变模型结构或重新训练的前提下,实现了对多样化输入条件的灵活梯度引导,有效支持跨模态、少样本和零样本等复杂场景下的医学图像分割任务。实验表明,GenMed 在多种医学影像任务中表现出色,并配套发布了大规模文本-形状数据集以支持相关研究。

详情
英文摘要

Data-driven medical AI is traditionally formulated as a discriminative mapping from input $X$ to output $Y$ via a learned function $f$, which does not generalize well across heterogeneous data and modalities encountered in real-world clinical settings. In this work, we propose a fundamentally different, generative paradigm. We model the joint distribution $P(X,Y)$ using diffusion models and reframe inference as a test-time output optimization problem. By guiding the generative process to match observed inputs, our framework enables flexible, gradient-based conditioning at inference time without architectural changes or retraining, effectively supporting arbitrary and previously unseen combinations of observations. Extensive experiments demonstrate strong performance across standard and cross-modality medical image segmentation, few-shot segmentation with only 2 or 4 training samples, degraded-input segmentation, shape completion from sparse and partial observations, and zero-shot application to demonstrate generality. To support these evaluations, we curated and released a large-scale text-shape dataset derived from MedShapeNet. Our results highlight the versatility of generative joint modeling as a foundation for reusable, task-agnostic medical AI systems.

2605.10643 2026-05-12 cs.CL cs.LG

A Single-Layer Model Can Do Language Modeling

Zanmin Wang

AI总结 本文研究了如何通过单层结构实现语言建模,提出了一种基于循环机制的 Grounded Prediction Networks(GPN)模型,该模型仅使用一个共享的状态向量和一个递归块进行信息处理。实验表明,即使在参数规模较小的情况下,GPN 也能达到与多层模型相当的性能,并揭示了其状态向量中包含的持久默认标记方向、内容承载窗口以及自发形成的快慢记忆池等结构特征。

Comments 9 pages, 5 figures, 1 table. Code: https://github.com/steve-z-wang/grounded-prediction-network

详情
英文摘要

Modern language models scale depth by stacking layers, each holding its own state - a per-layer KV cache in transformers, a per-layer matrix in Mamba, Gated DeltaNet (GDN), RWKV, and xLSTM. Biological systems lean heavily on recurrence rather than on stacking. We ask how far that shape can go on language modeling. We propose Grounded Prediction Networks (GPN): one state vector revisited at every step through a single recurrent block - one FFN, one shared matrix memory. At 130M parameters, a 1-layer GPN+M reaches FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34); a 2-layer variant closes the gap to 6%/11%. We do not match the deep baselines. Because the working context is a single vector, we can directly inspect its geometry: a persistent default-token direction, a content-bearing horizon of tens of tokens, and memory heads that split spontaneously into fast and slow retention pools.

2605.10642 2026-05-12 cs.LG cond-mat.stat-mech

Composing diffusion priors with explicit physical context via generative Gibbs sampling

Weizhou Wang, Jonathan Weare, Aaron R. Dinner

AI总结 本文提出了一种名为GG-PA的训练-free框架,用于在科学采样中结合预训练扩散模型与显式物理背景。该方法通过在扩展状态空间中对联合目标分布进行推理,将学习到的局部先验与物理约束进行组合,并基于吉布斯采样实现精确的分布推断。实验表明,GG-PA能够在无需重新训练的情况下,利用部分先验恢复由物理背景引起的分布变化和系统中的集体行为,展示了其在结合生成模型与物理知识方面的有效性。

Comments 31 pages, 11 figures

详情
英文摘要

Pretrained diffusion models provide powerful learned priors, but in scientific sampling the target distribution often depends on physical context that is not fully represented by one generative model. We introduce Generative Gibbs for Physics-Aware Sampling (GG-PA), a training-free framework that formulates the composition of learned partial priors and explicit physical context as inference over a joint target distribution in an augmented state space. We derive a Gibbs sampler for this joint target, show that it is asymptotically exact as the diffusion time approaches zero, and prove that in settings with quadratic interactions it remains exact at finite diffusion times. We further introduce replica exchange over diffusion time to accelerate mixing. Experiments on a double-well system, a $ϕ^4$ lattice model, and atomistic peptide systems show that GG-PA recovers context-induced distribution shifts and emergent collective behavior in interacting systems using partial priors without retraining. These results demonstrate GG-PA as a practical approach for combining pretrained generative priors with explicit physical context.

2605.10641 2026-05-12 cs.CV cs.AI

LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models

Nikolaos Gkalelis, Vasileios Mezaris

AI总结 本文提出了一种名为LLaVA-CKD的自底向上级联知识蒸馏框架,旨在解决视觉语言模型(VLMs)在实际部署中面临的大规模计算和内存需求问题。该方法通过引入中间容量的教师模型逐步引导学生模型学习,缓解了传统知识蒸馏中师生模型容量差距过大导致的知识迁移效果下降问题。实验表明,该框架在多个标准视觉问答基准测试中取得了当前最优的性能。

Comments Under review

详情
英文摘要

Large Vision-Language Models (VLMs) are successful in addressing a multitude of vision-language understanding tasks, such as Visual Question Answering (VQA), but their memory and compute requirements remain a concern for practical deployment. A promising class of techniques for mitigating this concern is Knowledge Distillation, where knowledge from a high-capacity Teacher network is transferred to a considerably smaller Student network. However, the capacity gap between the two networks is both a blessing and a curse: the smaller the Student network, the better its efficiency, and the larger the Teacher, the more knowledge it carries; yet, beyond a point, the larger capacity gap between the two leads to worse knowledge transfer. To counter this effect, we propose a bottom-up cascaded knowledge distillation (CKD) framework. Instead of treating knowledge transfer as an activity involving one high-capacity Teacher (or an ensemble of such), inspired by human formal education systems, we introduce one (potentially, more) additional Teacher(s) of intermediate capacity that gradually bring the Student network to the next level, where the next (higher-capacity) Teacher can take over. We provide a theoretical analysis in order to study the effect of cascaded distillation in the generalization performance of the Student. We apply the proposed framework on models build upon the LLaVA methodology and evaluate the derived models on seven standard, publicly available VQA benchmarks, demonstrating their SotA performance.

2605.10640 2026-05-12 cs.CL cs.AI

Towards Understanding Continual Factual Knowledge Acquisition of Language Models: From Theory to Algorithm

Haoyu Wang, Yifan Shang, Zhongxiang Sun, Weijie Yu, Xiao Zhang, Jun Xu

AI总结 本文研究了语言模型在持续预训练过程中如何持续获取和保留事实知识的问题,提出了一个基于单层Transformer的理论框架,用于解释持续事实知识获取(cFKA)的训练动态。研究发现,基于正则化的方法仅影响参数收敛速度,而数据回放方法能够改变收敛动态并稳定已有知识。基于此,作者提出了一种新的生成式数据回放方法STOC,通过选择注意力贡献度高的事实片段来指导回放数据生成,实验表明该方法有效提升了模型的持续知识获取能力。

Comments Accepted by ICML 2026

详情
英文摘要

Continual Pre-Training (CPT) is essential for enabling Language Models (LMs) to integrate new knowledge without erasing old. While classical CPT techniques like data replay have become the standard paradigm, the mechanisms underlying how LMs acquire and retain facts over time, termed as continual Factual Knowledge Acquisition (cFKA), remain unclear. In this work, we present a theoretical framework that characterizes the training dynamics of cFKA using a single-layer Transformer, offering a unified explanation for the behavior of representative CPT methods. Our analysis reveals that regularization-based methods merely adjust the convergence rate of parameters without altering the inherent forgetting tendency, whereas data replay methods succeed in shifting convergence dynamics and stabilizing pretrained knowledge. Building on these insights, we propose a novel generative data replay approach, called \textbf{S}electing \textbf{T}okens via attenti\textbf{O}n \textbf{C}ontribution~(STOC), which identifies influential factual snippets to guide replay data generation. Extensive experiments on both synthetic and real-world datasets validate our findings and demonstrate that STOC effectively enhances cFKA by mitigating catastrophic forgetting.

2605.10639 2026-05-12 cs.AI

Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks

Regina Gugg, Selina Niederländer, Andreas Stöckl, Martin Flechl

AI总结 随着大型语言模型(LLM)在科研和工业中的广泛应用,如何安全部署成为关键挑战,而现有的毒性基准评估体系存在系统性偏差的问题。本文研究了常用评估设置的鲁棒性,揭示了在模型选择、评估指标和任务类型等方面存在的内在偏差,并通过实验发现,当任务从文本生成转向摘要生成时,基准对有害内容的标记倾向显著增加,部分基准在输入数据域变化时也表现出行为不一致。研究强调了构建更全面和稳健的安全评估框架的必要性。

Comments 18 pages, 4 figures

详情
英文摘要

The rapid adoption of LLMs in both research and industry highlights the challenges of deploying them safely and reveals a gap in the systematic evaluation of toxicity benchmarks. As organizations increasingly rely on these benchmarks to certify models for customer-facing applications and automated moderation, unrecognized evaluation biases could lead to the deployment of vulnerable or unsafe systems. This work investigates the robustness of established benchmarking setups and examines how to measure currently neglected intrinsic biases, such as those related to model choice, metrics, and task types. Our experiments uncover significant discrepancies in benchmark behaviors when evaluation setups are altered. Specifically, shifting the task from text completion to summarization increases the tendency of benchmarks to flag content as harmful. Additionally, certain benchmarks fail to maintain consistent behavior when the input data domain is changed. Furthermore, we observe model-specific instabilities, demonstrating a clear need for more robust and comprehensive safety evaluation frameworks.

2605.10634 2026-05-12 cs.AI

Teacher-Aware Evolution of Heuristic Programs from Learned Optimization Policies

Minyu Chen, Song Qin, Ling-I Wu, Jianxin Xue, Guoqiang Li

AI总结 该研究提出了一种基于“教师感知”的进化框架,用于从学习到的优化策略中演化启发式程序。不同于以往依赖最终性能指标的方法,该方法利用独立训练的优化策略作为行为教师,通过查询其在候选启发式程序访问状态下的动作偏好,提供局部反馈以指导演化过程。实验表明,该方法在调度、路径规划和图优化等任务中优于仅依赖性能驱动的LLM启发式演化方法,且部署时无需神经推理,展示了其高效性和实用性。

Comments 15 pages

详情
英文摘要

LLM-based automatic heuristic design has shown promise for generating executable heuristics for combinatorial optimization, but existing methods mainly rely on delayed endpoint performance. We propose a \emph{teacher-aware evolutionary framework} that uses independently trained learned optimization policies as behavioral teachers. Instead of deploying or imitating the teacher, our method queries it on states visited by candidate heuristic programs and uses its action preferences as local feedback for evolution. The resulting search discovers static executable heuristics guided by both task performance and teacher-derived behavioral signals. Experiments on scheduling, routing, and graph optimization benchmarks show that our method improves over performance-driven LLM heuristic evolution baselines while requiring no neural inference at deployment. These results suggest that learned optimization policies can be repurposed as behavioral feedback sources for automatic heuristic discovery.

2605.10633 2026-05-12 cs.CL cs.AI

Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs

Krishak Aneja, Manas Mittal, Anmol Goel, Ponnurangam Kumaraguru, Vamshi Krishna Bonagiri

AI总结 该研究探讨了大型语言模型(LLMs)在微调过程中出现的有害行为(即“涌现偏差”)与其内在人格语义结构之间的关系。通过映射模型的潜在人格空间,如大五人格、黑暗三联征等,研究发现模型的人格语义几何结构在对齐模型及其微调变体中高度稳定。研究引入了“语义价值向量”等概念,证明这些人格相关方向可作为内在防护机制,有效抑制微调带来的偏差,为跨分布的模型调节提供了新的思路。

Comments 20 pages, 9 figures including appendix

详情
英文摘要

Fine-tuning Large Language Models (LLMs) on benign narrow data can sometimes induce broad harmful behaviors, a vulnerability termed emergent misalignment (EM). While prior work links these failures to specific directions in the activation space, their relationship to the model's broader persona remains unexplored. We map the latent personality space of LLMs through established psychometric profiles like the Big Five, Dark Triad, and LLM-specific behaviors (e.g. evil, sycophancy), and show that the semantic geometry is highly stable across aligned models and their corrupted fine-tunes. Through causal interventions, we find that directions isolating social valence, such as the 'Evil' persona vector, and a Semantic Valence Vector (SVV) that we introduce, function as intrinsic guardrails: ablating them drives the misalignment rates above $40$%, while amplifying them suppresses the failure mode to less than $3$%. Leveraging the structural stability of the personality space, we show that vectors extracted $\textit{a priori}$ from an instruct-tuned model transfer zero-shot to successfully regulate EM in corrupted fine-tunes. Overall, our findings suggest that harmful fine-tuning does not overwrite a model's internal representation of personality, allowing conserved representations to serve as robust, cross-distribution guardrails.

2605.10629 2026-05-12 cs.CV

Product-of-Gaussian-Mixture Diffusion Models for Joint Nonlinear MRI Reconstruction

Laurenz Nagler, Martin Zach, Thomas Pock

AI总结 本文提出了一种基于高斯混合乘积扩散模型的联合非线性磁共振成像重建方法,旨在解决现有方法中网络结构复杂、时间条件机制不透明以及需要离线估计线圈灵敏度等问题。该方法通过将参数高效的高斯混合扩散模型作为图像先验,并结合经典的线圈灵敏度平滑先验,实现了图像与线圈灵敏度的联合重建。该方法在保持重建质量的同时,提升了对对比度和解剖分布变化以及不同k空间轨迹的鲁棒性。

详情
英文摘要

Recently, diffusion models have attracted considerable attention for magnetic resonance image reconstruction due to their high sample quality. However, most existing methods rely on large networks with opaque time-conditioning mechanisms, and require offline coil sensitivity estimation. This results in limited interpretability of the reconstruction process and reduced flexibility in the acquisition setup. To address these limitations, we jointly reconstruct the image and the coil sensitivities by combining the parameter-efficient product-of-Gaussian-mixture diffusion model as an image prior with a classical smoothness prior on the coil sensitivities. The proposed method is fast and robust to both contrast and anatomical distribution shifts as well as changing k-space trajectories. Finally, we propose a more expressive parameterization of the image prior which improves results in denoising and magnetic resonance image reconstruction.

2605.10628 2026-05-12 cs.CV

Hypergraph-Enhanced Training-Free and Language-Free Few-Shot Anomaly Detection

Guohuan Xie, Xin He, Dingying Fan, Siqi Li, Yun Liu

AI总结 本文提出了一种名为HyperFSAD的少样本异常检测框架,该方法无需训练和语言提示,且具备跨领域鲁棒性,有效解决了现有方法对特定任务训练、语言监督和领域适应性的依赖问题。该方法基于DINOv3和超图推理机制,通过稀疏超匹配和双分支图像评分策略,实现了对正常样本的紧凑表征与异常区域的精准识别。实验表明,在六个涵盖工业和医疗场景的数据集上,HyperFSAD在无训练、无语言提示的严格设置下取得了当前最优的检测性能。

详情
英文摘要

Few-shot anomaly detection (FSAD) has made significant strides, yet existing methods still face critical challenges: (i) dependence on task- or dataset-specific training/fine-tuning, (ii) reliance on language supervision or carefully hand-crafted prompts, and (iii) limited robustness across domains. In this paper, we introduce HyperFSAD, a novel FSAD framework that is training-free, language-free, and robust across domains, offering a powerful solution to these challenges. Built upon DINOv3 and a hypergraph-based inference mechanism, our approach performs inference without any task-specific optimization or text prompts, while remaining competitive. Specifically, we replace sensitive nearest-neighbor / top-$n$ matching with \textbf{Sparse Hyper Matching}: \textit{sparsemax} first selects the most relevant support patches, which are then aggregated into a \textit{hyperedge} as compact normal evidence to suppress background noise and distractors. We further introduce \textbf{Dual-Branch Image Scoring}, which fuses \emph{spatial anomaly evidence} from the patch-grid anomaly map with \emph{global semantic deviation} captured by support-aware CLS matching, yielding a robust image-level anomaly score in a strictly visual manner. Notably, all components of HyperFSAD are purely visual, eliminating the need for labor-intensive hand-crafted text prompts. Under the stringent training-free and language-free setting, HyperFSAD achieves state-of-the-art performance across six datasets spanning four industrial datasets (MVTecAD, VisA, MPDD, BTAD) and two medical datasets (RESC, BraTS).

2605.10627 2026-05-12 cs.CL cs.AI

Interpretable Coreference Resolution Evaluation Using Explicit Semantics

Bruno Gatti, Giuliano Martinelli, Roberto Navigli

AI总结 该论文提出了一种基于显式语义的可解释核心ference解析评估框架,旨在解决传统统计指标(如CoNLL-F1)在诊断模型问题时信息不足的问题。研究通过将概念和命名实体识别(CNER)叠加到核心ference输出上,为名词提及分配语义标签并传播至整个聚类,从而按语义类别计算分类型评估指标。实验表明,该方法能够揭示传统指标难以发现的系统性缺陷,并可用于设计针对性的数据增强策略,提升模型在领域外任务中的表现。

Comments Accepted at main conference for ACL 2026. 19 pages

详情
英文摘要

Coreference resolution is typically evaluated using aggregate statistical metrics such as CoNLL-F1, which measure structural overlap between predicted and gold clusters. While widely used, these metrics offer limited diagnostic insights, penalizing errors without revealing whether a system struggles with specific semantic categories, such as people, locations, or events, and making it difficult to interpret model capabilities or derive actionable improvements. We address this gap by introducing a semantically-enhanced evaluation framework for coreference resolution. Our approach overlays Concept and Named Entity Recognition (CNER) onto coreference outputs, assigning semantic labels to nominal mentions and propagating them to entire coreference clusters. This enables the computation of typed scores aimed at evaluating mention extraction and linking capabilities stratified by semantic class. Across our experiments on OntoNotes, LitBank, and PreCo, we show that our framework uncovers systematic weaknesses that remain obscured by aggregate metrics. Furthermore, we demonstrate that these diagnostics can be used to design targeted, low-cost data augmentation strategies, achieving measurable out-of-domain improvements.

2605.10624 2026-05-12 cs.AI cs.LG

Hierarchical Causal Abduction: A Foundation Framework for Explainable Model Predictive Control

Ramesh Arvind Naagarajan, Zühal Wagner, Stefan Streif

AI总结 本文提出了一种名为分层因果归纳(HCA)的基础框架,用于实现可解释的模型预测控制(MPC)。该方法结合领域知识图谱、KKT乘子优化证据和PCMCI算法进行时间因果发现,从而为非线性MPC的控制动作生成可信且易于人类理解的解释。实验表明,HCA在多个控制应用中显著提升了解释准确性,并且其方法具有跨领域泛化能力,适用于其他基于预测的决策系统。

详情
英文摘要

Model Predictive Control (MPC) is widely used to operate safety-critical infrastructure by predicting future trajectories and optimizing control actions. However, nonlinear dynamics, hard safety constraints, and numerical optimization often render individual control moves opaque to human operators, undermining trust and hindering deployment. This paper presents Hierarchical Causal Abduction (HCA), which combines (i) physics-informed reasoning via domain knowledge graphs, (ii) optimization evidence from Karush--Kuhn--Tucker (KKT) multipliers, and (iii) temporal causal discovery via the PCMCI algorithm to generate faithful, human-interpretable explanations for control actions computed by nonlinear MPC. Across three diverse control applications (greenhouse climate, building HVAC, chemical process engineering) with expert validation, HCA improves explanation accuracy by 53\% over LIME (0.478 vs. 0.311) using a single set of cross-domain parameters without per-domain tuning; domain-specific KKT-threshold calibration over 2--3 days further increases accuracy to 0.88. Ablation studies confirm that each evidence source is essential, with 32--37\% accuracy degradation when any component is removed, and HCA's ranking-and-validation methodology generalizes beyond MPC to other prediction-based decision systems, including learning-based control and trajectory planning.

2605.10621 2026-05-12 cs.LG cs.SY eess.SY

Hierarchical End-to-End Taylor Bounds for Complete Neural Network Verification

Taha Entesari, Mahyar Fazlyab

AI总结 该论文研究了神经网络的可达性分析问题,旨在计算或界定给定输入域下网络输出的可能范围,以验证学习驱动的物理系统的安全性与鲁棒性。现有方法多依赖于二阶信息的可追踪近似,而本文提出了一种新的验证框架HiTaB,通过利用Hessian矩阵及其Lipschitz常数,系统性地引入更高阶的平滑性信息,构建了统一的零阶、一阶和二阶界框架,并提出了高效的层间曲率传播算法来计算深层网络中Hessian Lipschitz常数的上界,从而获得更紧致和可靠的安全性证明。

详情
英文摘要

Reachability analysis of neural networks, which seeks to compute or bound the set of outputs attainable over a given input domain, is central to certifying safety and robustness in learning-enabled physical systems. Since exact reachable set computation is generally intractable, existing methods typically rely on tractable overapproximations. Examining the state of the art for smooth, twice-differentiable networks, we observe that existing approaches exploit at most second-order information and do not systematically leverage higher-order information. In this work, we introduce \textsc{HiTaB}, a novel verification framework that exploits second-order smoothness through both the Hessian, $\nabla^2 f$, and its Lipschitz constant, $L_{\nabla^2 f}$. We further develop a unified hierarchy of zeroth-, first-, and second-order bounds, together with precise conditions under which higher-order approximations yield provable improvements. Our main technical contribution is a compositional procedure for efficiently bounding $L_{\nabla^2 f}$ in deep neural networks via layerwise propagation of curvature bounds. We extend the framework to both $\ell_2$- and $\ell_\infty$-constrained input sets and show how it can be integrated into branch-and-bound verification pipelines. To our knowledge, this is the first practical reachability analysis framework for smooth neural networks that systematically exploits Lipschitz continuity of curvature, leading to tighter and more informative safety certificates.

2605.10616 2026-05-12 cs.LG cs.CL cs.CV

MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

Alan Arazi, Eilam Shapira, Shoham Grunblat, Mor Ventura, Elad Hoffer, Gioia Blayer, David Holzmüller, Lennart Purucker, Gaël Varoquaux, Frank Hutter, Roi Reichart

AI总结 本文提出 MulTaBench,一个包含40个数据集的多模态表格学习基准,涵盖图像-表格和文本-表格任务,旨在评估模型在处理结构化数据与非结构化模态(如文本和图像)结合时的表现。研究发现,针对任务进行嵌入调优能显著提升性能,而现有基准往往忽视任务相关性,导致结果波动较大。MulTaBench 通过强调模态间互补信息的重要性,推动了目标感知表示学习的发展,并为构建多模态表格基础模型提供了新的研究方向。

详情
英文摘要

Tabular Foundation Models have recently established the state of the art in supervised tabular learning, by leveraging pretraining to learn generalizable representations of numerical and categorical structured data. However, they lack native support for unstructured modalities such as text and image, and rely on frozen, pretrained embeddings to process them. On established Multimodal Tabular Learning benchmarks, we show that tuning the embeddings to the task improves performance. Existing benchmarks, however, often focus on the mere co-occurrence of modalities; this leads to high variance across datasets and masks the benefits of task-specific tuning. To address this gap, we introduce MulTaBench, a benchmark of 40 datasets, split equally between image-tabular and text-tabular tasks. We focus on predictive tasks where the modalities provide complementary predictive signal, and where generic embeddings lose critical information, necessitating Target-Aware Representations that are aligned with the task. Our experimental results demonstrate that the gains from target-aware representation tuning generalize across both text and image modalities, several tabular learners, encoder scales, and embedding dimensions. MulTaBench constitutes the largest image-tabular benchmarking effort to date, spanning high-impact domains such as healthcare and e-commerce. It is designed to enable the research of novel architectures which incorporate joint modeling and target-aware representations, paving the way for the development of novel Multimodal Tabular Foundation Models.

2605.10615 2026-05-12 cs.CL

Responsible Benchmarking of Fairness for Automatic Speech Recognition

Felix Herron, Ange Richard, François Portet, Alexandre Allauzen, Solange Rossato

AI总结 本文探讨了自动语音识别(ASR)系统在不同说话人群体间的公平性问题,指出当前研究在评估公平性时方法不一致,可能导致结论偏差。作者结合机器学习公平性、社会学和语音科学的文献,提出了更可靠的公平性基准测试实践,强调应明确评估的公平性假设,并针对具体假设选择合适的度量指标。研究发现,仅基于单一异质群体进行评估可能掩盖实际受到偏见的群体,因此主张对数据中的多维人口统计变量进行细致的交叉分析,以揭示潜在的虚假关联。

详情
Journal ref
SPEAKABLE, colocated with LREC 2026
英文摘要

Many studies have shown automatic speech processing (ASR) systems have unequal performance across speakergroups (SG's). However, the manner in which such studies arrive at this conclusion is inconsistent. To pave the wayfor more reliable results in future studies, we lay out best practices for benchmarking ASR fairness based on literaturefrom machine learning fairness, social sciences, and speech science. We first describe the importance of preciselythe fairness hypothesis being interrogated, and tailoring fairness metrics to apply specifically to said hypothesis.We then examine several benchmarks used to rate ASR systems on fairness and discuss how their results can bemisconstrued without assiduous oversight into the intersections between SG's. We find that evaluating fairnessbased on single heterogeneous SG's, such as they are defined in fairness benchmarks, can lead to misidentifyingwhich SG's are actually being mistreated by ASR systems. We advocate for as fine-grained an analysis as possibleof the intersectionality of as many demographic variables as are available in the metadata of fairness corpora in orderto tease out such spurious correlations