arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.05165 2026-06-04 cs.LG cs.CL

STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

STRIDE: 通过子集扰动的稀疏恢复进行训练数据归因

Rishit Dagli, Abir Harrasse, Luke Zhang, Florent Draye, Amirali Abdullah, Bernhard Schölkopf, Zhijing Jin

AI总结 提出STRIDE框架,将训练数据归因建模为压缩感知中的稀疏恢复问题,通过激活空间中的轻量级“引导算子”模拟数据子集的影响,实现高效且准确的LLM预训练归因。

详情
Comments
project page: https://stride-tda.github.io/
AI中文摘要

训练数据归因(TDA)旨在将模型的预测追溯到其训练数据。TDA的黄金标准依赖于因果干预,观察模型在数据添加或移除时的变化,但对于大型语言模型(LLMs)而言,重复训练在计算上具有挑战性。因此,大多数方法在参数空间中使用梯度来近似这种效应。然而,跟踪数十亿参数的梯度不仅成本高昂,而且依赖于局部近似。在这项工作中,我们提出了一种转变:我们不估计参数变化,而是在激活空间中建模训练数据的功能效应。我们引入了STRIDE(基于引导的训练数据影响分解),这是一个将TDA表述为压缩感知精神下的稀疏恢复问题的框架。STRIDE学习轻量级的“引导算子”,这些算子模拟在数据子集上训练引起的行为变化。通过测量这些算子如何扰动测试预测,我们通过稀疏线性分解恢复单个训练示例的影响。STRIDE在LLM预训练归因中达到了最先进的性能,同时比先前的方法快一个数量级(13倍)。我们通过下游应用(包括数据选择、数据污染和定性分析)进一步验证了其实用性。

英文摘要

Training Data Attribution (TDA) seeks to trace a model's predictions back to its training data. The gold standard for TDA relies on causal interventions, observing how a model changes when data is added or removed, but repeated retraining is computationally challenging for Large Language Models (LLMs). Consequently, most approaches approximate this effect in the parameter space using gradients. However, tracking gradients across billions of parameters is not only prohibitively expensive but relies on local approximations. In this work, we propose a shift: rather than estimating parameter changes, we model the functional effect of training data in the activation space. We introduce STRIDE (Steering-based Training Data Influence Decomposition), a framework that formulates TDA as a sparse recovery problem in the spirit of compressive sensing. STRIDE learns lightweight "steering operators" that mimic the behavioral shift caused by training on data subsets. By measuring how these operators perturb test predictions, we recover individual training example influences via sparse linear decomposition. STRIDE achieves state-of-the-art for LLM pre-training attribution while being an order of magnitude ($13\times$) faster than previous art. We further validate its practical utility through downstream applications including data selection, data contamination, and qualitative analysis.

2606.05162 2026-06-04 cs.CV

Controllable Dynamic 3D Shape Generation via 3D Trajectories and Text

基于3D轨迹和文本的可控动态3D形状生成

Jaeyeong Kim, Ines Kim, Jahyeok Koo, Seungryong Kim

AI总结 提出T2Mo前馈框架,通过3D轨迹和文本条件生成可控动态3D形状,采用形状接地轨迹嵌入处理任意配置轨迹,实现空间精确跟随与全局语义一致。

详情
Comments
Project page: https://cvlab-kaist.github.io/T2Mo/
AI中文摘要

我们提出T2Mo,一个前馈框架,用于基于3D轨迹和文本的可控动态3D形状生成。由于语言固有的模糊性,仅使用文本生成精确意图的运动仍然具有挑战性。为了解决这个问题,我们采用3D轨迹作为可控空间引导,指定选定点应移动的精确路径。通过结合两者,T2Mo生成的对象运动在空间上遵循给定轨迹,同时全局反映文本语义。为了鲁棒地处理任意配置的轨迹输入(从密集到稀疏且不均匀分布),我们进一步提出了一种形状接地轨迹嵌入,将输入轨迹集映射到覆盖整个对象的形状感知令牌集。我们与基于文本的基线以及级联视频基线(结合轨迹引导视频生成和视频到动态网格生成)进行了广泛比较。定量和定性评估以及用户研究表明,我们的方法生成的运动更忠实地遵循给定提示,具有更高的表现力,同时保持运动质量。

英文摘要

We introduce T2Mo, a feed-forward framework for controllable dynamic 3D shape generation conditioned on 3D trajectories and text. Due to the inherent ambiguity of language, generating precisely intended motions using text alone remains challenging. To address this, we adopt 3D trajectories as controllable spatial guidance, specifying the exact paths along which selected points should move. By combining both, T2Mo generates object motions that spatially adhere to the given trajectories while globally reflecting the text semantics. To robustly handle trajectory inputs with arbitrary configurations, ranging from dense to sparse and unevenly distributed, we further propose a shape-grounded trajectory embedding that maps an input trajectory set into a shape-aware token set covering the entire object. We conduct extensive comparisons against text-based baselines and cascaded video-based baselines that combine trajectory-guided video generation with video-to-dynamic mesh generation. Quantitative and qualitative evaluations, along with user studies, demonstrate that our approach produces motions that more faithfully follow the given prompts with higher expressiveness while preserving motion quality.

2606.05161 2026-06-04 cs.SD cs.CL

Beyond Text Following: Repairable Arbitration Reversals in Audio-Language Models

超越文本跟随:音频-语言模型中的可修复仲裁反转

Yichen Gao, Yiqun Zhang, Zijing Wang, Yujia Li, Heng Guo, Xi Wu, Xiaocui Yang, Shi Feng, Yifei Zhang, Daling Wang

AI总结 本文通过同音频反事实实验发现,音频-语言模型在冲突任务中常因文本主导而忽略音频证据,并提出无训练解码规则GACL,通过插值联合分数与同音频分数来修复仲裁反转,显著提升忠实度。

详情
AI中文摘要

音频-语言模型(ALMs)常常遵循与音频冲突的文本,即使音频证据清晰。这引发了一个基本问题:音频支持的答案是不可用的,还是被表示出来但被冲突文本覆盖了?我们使用一个同音频反事实来研究这个问题,该反事实保持音频固定,仅移除冲突文本,并测量模型偏好由此产生的变化。在五个ALM和四个冲突任务中,64.1%的冲突样本显示出符号翻转:同音频分支偏好音频支持的答案,而联合分支偏好文本支持的答案。这种模式表明,相关的音频证据被编码但在仲裁中失败。激活修补进一步将反转定位到答案位置计算,并且修补效果与输出候选分数差异紧密相关(Spearman rho=0.93)。利用这一诊断,我们提出了门控音频反事实逻辑校正(GACL),一种无训练解码规则,在联合分数和同音频分数之间进行插值。在严格的5个百分点的忠实度下降预算下,GACL在最佳对比基线上将nAUC提高了17.8个点,并且无需重新调整即可迁移到视觉-文本仲裁(最高+40.5个百分点)。

英文摘要

Audio-language models (ALMs) often follow text that conflicts with audio, even when the audio evidence is clear. This raises a basic question: is the audio-supported answer unavailable, or is it represented but overridden by the conflicting text? We examine this question using a same-audio counterfactual that keeps the audio fixed, removes only the conflicting text, and measures the resulting shift in model preference. Across five ALMs and four conflict tasks, 64.1% of conflict samples show a sign flip: the same-audio branch prefers the audio-supported answer, whereas the joint branch prefers the text-supported answer. This pattern suggests that the relevant audio evidence is encoded but loses in arbitration. Activation patching further localizes the reversal to answer-position computation, and patching effects closely track output candidate-score differences (Spearman rho=0.93). Using this diagnostic, we propose Gated Audio Counterfactual Logit Correction (GACL), a training-free decoding rule that interpolates between joint and same-audio scores. Under a strict 5 pp faithfulness-drop budget, GACL improves nAUC by 17.8 points over the best contrastive baseline and transfers without retuning to vision-text arbitration (up to +40.5 pp).

2606.05160 2026-06-04 cs.RO

GRAIL: Generating Humanoid Loco-Manipulation from 3D Assets and Video Priors

GRAIL: 从3D资产和视频先验生成人形机器人全身操作

Tianyi Xie, Haotian Zhang, Jinhyung Park, Zi Wang, Bowen Wen, Jiefeng Li, Xueting Li, Qingwei Ben, Haoyang Weng, Yufei Ye, David Minor, Tingwu Wang, Chenfanfu Jiang, Sanja Fidler, Jan Kautz, Linxi Fan, Yuke Zhu, Zhengyi Luo, Umar Iqbal, Ye Yuan

AI总结 提出GRAIL全虚拟生成管线,利用3D资产和视频基础模型先验合成人机交互演示,无需物理搭建或遥操作,实现人形机器人全身操作策略的模拟到现实迁移。

详情
Comments
Project page: https://research.nvidia.com/labs/dair/grail/
AI中文摘要

扩展人形机器人全身操作需要跨多样物体、全身运动和场景几何的机器人兼容演示,但遥操作和动作捕捉难以规模化,因为每次采集都依赖于物理设置、仪器化演员和机器人操作。我们提出GRAIL,一个在部署前完全保持虚拟的数字生成管线:它组合3D资产、模拟器就绪场景和视频基础模型(VFM)的先验,以合成交互,无需重建物理环境或遥操作机器人。GRAIL并非重建无约束的野外视频,而是从完全指定的3D配置开始,其中物体几何、相机参数、度量尺度、环境深度和机器人比例的角色在视频生成前已知,并在重建过程中重复使用。这种特权设置更好地约束了4D恢复,允许基于模型的物体跟踪、人体运动估计和交互感知优化,以重建度量的4D人-物交互(HOI)轨迹,减少了深度模糊和形态不匹配。我们将恢复的运动重定向到人形机器人,并训练互补的任务通用跟踪器:用于操作的对象感知潜在适配器和用于地形穿越的场景感知跟踪器。GRAIL生成了超过20,000个序列,涵盖拾取、物体操作、坐姿和地形穿越。仅使用GRAIL生成的数据,我们通过模拟到现实管线训练自我中心视觉策略,并将其部署在Unitree G1人形机器人上,在多样物体拾取上实现了84%的真实世界成功率,在爬楼梯上实现了90%的成功率。

英文摘要

Scaling humanoid loco-manipulation requires robot-compatible demonstrations across diverse objects, whole-body motions, and scene geometries, but teleoperation and motion capture are difficult to scale because each collection depends on physical setups, instrumented actors, and robot operation. We present GRAIL, a digital generation pipeline that remains fully virtual until deployment: it composes 3D assets, simulator-ready scenes, and priors from video foundation models (VFMs) to synthesize interactions without rebuilding physical environments or teleoperating the robot. Rather than reconstructing unconstrained in-the-wild videos, GRAIL starts from fully specified 3D configurations in which object geometry, camera parameters, metric scale, environment depth, and a robot-proportioned character are known before video generation and reused during reconstruction. This privileged setup better conditions 4D recovery, allowing model-based object tracking, human motion estimation, and interaction-aware optimization to reconstruct metric 4D human-object interaction (HOI) trajectories with reduced depth ambiguity and morphology mismatch. We retarget the recovered motions to a humanoid robot and train complementary task-general trackers: an object-aware latent adaptor for manipulation and a scene-aware tracker for terrain traversal. GRAIL produces over 20,000 sequences spanning pick-up, object manipulation, sitting, and terrain traversal. Using only GRAIL-generated data, we train egocentric visual policies through a sim-to-real pipeline and deploy them on a Unitree G1 humanoid, achieving 84\% real-world success on diverse object pick-up and 90\% success on stair-climbing.

2606.05159 2026-06-04 cs.RO

X4Val: Learning Neural Surrogates for Variance-Reduced Policy Evaluation

X4Val: 学习方差缩减策略评估的神经代理模型

Rachel Luo, Michael Watson, Apoorva Sharma, Heng Yang, Han Qi, Edward Schmerling, Sushant Veer, Boris Ivanovic, Marco Pavone

AI总结 提出X4Val框架,通过嵌入多域数据并学习可迁移预测器,结合控制变量估计器实现无配对样本下的方差缩减,在自动驾驶和机器人操作任务中方差降低达38.4%。

详情
AI中文摘要

对基于学习的机器人系统进行严格评估是部署的必要前提。然而,真实世界的测试数据收集成本高昂;此外,在典型的迭代开发环境中,从最新策略收集的数据规模必然有限。这促使我们利用异构数据源(包括仿真、历史策略日志以及从相关平台或环境收集的数据)的评估方法。虽然此类辅助数据丰富且廉价,但它们通常不能直接代表真实世界的结果——例如,仿真中的性能可能与真实世界中的性能存在显著差异——这使得它们在高置信度性能估计中的原则性使用具有挑战性。在本文中,我们介绍了X4Val,一个在存在非配对、多域数据的情况下进行方差缩减的真实世界指标估计的通用框架。X4Val将来自真实域和辅助域的样本嵌入到一个共享表示空间中,并学习一个可迁移的真实世界指标预测器;然后将这个学习到的预测器纳入控制变量估计器,即使在无配对样本的情况下也能实现方差缩减。我们提供了理论分析,并在自动驾驶和真实世界机器人操作任务上进行了实证评估,在这些领域中,X4Val实现了高达38.4%的方差缩减,并表现出相对于强基线的持续改进。这些结果表明,非配对的异构数据可以被利用来显著提高严格机器人系统验证的样本效率。

英文摘要

Rigorous evaluation of learning-based robotic systems is an essential prerequisite for deployment. However, real-world test data is expensive to gather; moreover, in a typical iterative development context, data gathered from the latest policy is necessarily limited in scale. This motivates evaluation methodologies that make use of heterogeneous data sources, including simulation, historical policy logs, and data collected from related platforms or environments. While such auxiliary data are abundant and inexpensive, they are generally not directly representative of real-world outcomes -- for example, performance in simulation may differ substantially from performance in the real world -- making their principled use for high-confidence performance estimation challenging. In this paper, we introduce X4Val, a general framework for variance-reduced real-world metric estimation in the presence of non-paired, multi-domain data. X4Val embeds samples from real and auxiliary domains into a shared representation space and learns a transferable predictor of real-world metrics; this learned predictor is then incorporated into a control-variates estimator, enabling variance reduction even when paired samples are unavailable. We provide theoretical analysis and empirical evaluations on autonomous driving and real-world robot manipulation tasks, domains across which X4Val achieves up to 38.4% variance reduction and demonstrates consistent improvements over strong baselines. These results show that non-paired, heterogeneous data can be leveraged to substantially improve the sample efficiency of rigorous robotic system validation.

2606.05158 2026-06-04 cs.CL cs.AI cs.MA

Streaming Communication in Multi-Agent Reasoning

多智能体推理中的流式通信

Zhen Yang, Xiaogang Xu, Wen Wang, Cong Chen, Xander Xu, Ying-Cong Chen

AI总结 提出流式多智能体推理系统StreamMA,通过将推理步骤实时流式传输给下游智能体来降低延迟,并意外地提升了效果,同时首次给出流式、串行和单协议三种模式的闭式联合分析。

详情
Comments
project page: https://zhenyangcs.github.io/StreamMA-website/
AI中文摘要

多智能体推理系统采用“生成-然后传输”范式,导致端到端延迟与流水线深度成线性关系。我们提出StreamMA,一种多智能体推理系统,它将每个推理步骤在生成后立即流式传输给下游智能体,流水线化相邻智能体,从而降低延迟。令人惊讶的是,这种流水线化也提高了效果:因为多步推理质量不均匀,早期步骤比后期步骤更可靠,使用这些可靠的早期步骤而不是完整链条可以防止容易出错的后期步骤误导下游智能体。我们通过首个流式、串行和单协议三种模式的闭式联合分析,形式化了这两种优势,推导出效果排序、加速上限和成本比。在涵盖数学、科学和代码的八个推理基准测试中,使用两个前沿LLM(Claude Opus 4.6和GPT-5.4)以及三种拓扑结构(链、树、图),StreamMA均优于两个基线(平均+7.3个百分点,在HMMT 2026上最高+22.4个百分点;Claude Opus 4.6-high)。除了这些贡献,我们还发现了一个“步骤级缩放定律”:增加每个智能体的步骤持续提高效果和效率,这是一个与智能体数量缩放正交且可组合的新缩放维度。

英文摘要

Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a "step-level scaling law": increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.

2606.05156 2026-06-04 cs.DM math.CO

Temporal Cliques Admit Linear Spanners

时间团图允许线性稀疏子图

Julia Baligacs

AI总结 本文证明每个n顶点的时间团图都存在一个大小为7n的稀疏子图,并给出多项式时间构造算法,解决了长期悬而未决的线性界问题。

详情
AI中文摘要

时间图是一种每条边都带有非空时间标签集的图,如果对于每对顶点$u$和$v$,存在一条时间标签非递减的$u$-$v$路径,则称该图是时间连通的。稀疏子图是其边子集,保持时间连通性。与静态图不同,时间连通图不一定允许稀疏的稀疏子图;尽管如此,最小化稀疏子图大小是一个核心且被广泛研究的问题。一个特别有趣的问题是时间团图是否允许线性大小的稀疏子图。尽管过去几年付出了大量努力,但已知的最佳上界仍然是$O(n \log n)$。我们最终解决了这个问题,证明了每个$n$顶点的时间团图都存在一个大小为$7n$的稀疏子图。此外,这样的稀疏子图可以在多项式时间内计算出来。

英文摘要

A temporal graph is a graph in which every edge carries a non-empty set of time labels, and it is temporally connected if for every two vertices $u$ and $v$, there exists a $u$-$v$-path with non-decreasing time labels. A spanner is a subset of its edges preserving temporal connectivity. Unlike static graphs, temporally connected graphs need not admit sparse spanners; nonetheless, minimizing spanner size is a central and widely studied problem. A particularly intriguing question is whether temporal cliques admit spanners of linear size. Despite considerable effort over the past years, the best known upper bound remained $O(n \log n)$. We finally resolve this question, proving that every temporal clique on $n$ vertices admits a spanner of size $7n$. Moreover, such a spanner can be computed in polynomial time.

2606.05155 2026-06-04 gr-qc cs.NA math.NA

High-Order Summation-By-Parts Schemes for First-Order Hyperbolic Systems in Curvilinear Coordinates with Singularities

具有奇异性的曲线坐标系中一阶双曲系统的高阶求和-分部格式

Stamatis Vretinaris, Erik Schnetter

AI总结 针对曲线坐标系(如球坐标)中的奇异性问题,提出一种基于求和-分部(SBP)性质的高阶精度能量稳定有限差分算子,并在原点处放置网格点,通过标量波动方程演化验证其优势。

详情
Comments
21 pages, 18 figures
AI中文摘要

在具有奇异性的曲线坐标系(例如球坐标)中为双曲系统制定稳定的数值方法因这些奇异性的存在而变得复杂。我们提出了一种在球域上构造满足求和-分部(SBP)性质的高阶精度、能量稳定有限差分算子的方法,扩展了[C. Gundlach, J. M. Martín-García, and D. Garfinkle, CQG 30, 145003 (2013)]的思想。我们定义了离散梯度算子和散度算子,它们镜像了连续分部积分原理,即使原点存在$1/r^p$坐标奇异性。我们显式构造了高达六阶的此类算子。我们的算子将网格点直接放置在原点上。我们还回顾了如何构造跨越原点的稳定SBP算子。我们分析了这些算子的精度和谱半径,并展示了标量波动方程的演化示例,以证明此类算子的优势。

英文摘要

Formulating stable numerical methods for hyperbolic systems in curvilinear coordinate with singularities, e.g. spherical coordinates, is complicated by the presence of these singularities. We present a method for constructing high-order accurate, energy-stable finite difference operators satisfying the Summation-by-Parts (SBP) property on spherical domains, extending ideas presented by [C. Gundlach, J. M. Martín-García, and D. Garfinkle, CQG 30, 145003 (2013)]. We define discrete gradient and divergence operators that mirror the continuous integration-by-parts principle, even though there is a $1/r^p$ coordinate singularity present at the origin. We explicitly construct such operators up to order six. Our operators place a grid point directly on the origin. We also review how to construct stable SBP operators that straddle the origin. We analyze the accuracy and spectral radii of these operators, and we show example evolutions of the scalar wave equation to demonstrate the advantages of such operators.

2606.05150 2026-06-04 cs.NE cs.AI

Multi-Column RBF Neural Network Using Adaptive and Non-Adaptive Particle Swarm Optimization

使用自适应和非自适应粒子群优化的多列RBF神经网络

Ammar Hoori, Yuichi Motai

AI总结 针对大规模数据集下RBF神经网络训练的可扩展性问题,提出基于粒子群优化(PSO)和自适应PSO(APSO)的多列RBF网络(MC-PSO和MC-APSO),通过并行训练多个RBFN并利用子集专门化提高精度和速度。

详情
Comments
15 Page, Under Review
AI中文摘要

使用梯度下降算法训练的径向基函数神经网络(RBFN)在浅层和深层网络中提供了有效的全连接结构。误差校正(ErrCor)是一种先进的基于梯度的训练方法,它选择最优隐藏单元以提高精度。另外,作为基于种群的算法,粒子群优化算法(PSO)利用群体经验优化RBFN参数,提供全局搜索和对局部最小值的鲁棒性。自适应PSO(APSO)作为PSO的改进变体出现。APSO算法通过在优化过程中动态调整群体参数来提高收敛速度。ErrCor和PSO都显示出改进的结果和有竞争力的收敛性。然而,对于大规模数据集,这些方法面临可扩展性挑战,如过多的核计算和大的隐藏层结构。最近的多列RBFN方法(MCRN)通过在并行系统中部署小型RBFN来提高ErrCor性能。受MCRN成功的启发,我们提出了两种改进PSO性能的新方法:使用PSO的多列RBFN(MC-PSO)和使用APSO的多列RBFN(MC-APSO)。这些方法引入了使用进化群方法训练的并行RBFN结构。每个RBFN独立地在数据集的特定空间子集上使用PSO或APSO算法进行训练。这些经过专门训练的RBFN针对各自的子集进行了定制。在测试期间,只有测试实例邻居所在的选定RBFN对多列输出有贡献。这种专门化提高了精度,而并行性提高了速度。我们在各种基准数据集上评估了所提出的方法。MC-PSO和MC-APSO在精度和召回率方面优于ErrCor、PSO、APSO和MCRN。在大多数实验中,它们还表现出更快的训练和测试时间。

英文摘要

The radial basis function neural network (RBFN) trained with a gradient descending algorithm provides an effective fully connected structure in both shallow and deep networks. The error correction (ErrCor), a state-of-the-art gradient-based training method, selects optimal hidden units to improve accuracy. Alternatively, as a population-based algorithm, the particle swarm optimization algorithm (PSO) uses the swarm experience to optimize RBFN parameters, offering global search and robustness to local minima. Adaptive PSO (APSO) has emerged as an improved variant of PSO. APSO algorithm improves convergence speed by dynamically adjusting swarm parameters during optimization. Both ErrCor and PSO demonstrate improved results and competitive convergence. However, with large datasets, these methods face scalability challenges such as excessive kernel computations and large hidden layer structures. A recent multi-column RBFN approach (MCRN) improves ErrCor performance by deploying small RBFNs in a parallel system. Inspired by MCRN's success, we propose two novel approaches to improve PSO performance: the multi-column RBFN with PSO (MC-PSO) and the multi-column RBFN with APSO (MC-APSO). These methods introduce parallel RBFN structures trained using evolutionary swarm methods. Each RBFN is independently trained on a specific spatial subset of the dataset using either PSO or APSO algorithms. These resulting specialist-trained RBFNs are tailored to their respective subsets. During testing, only selected RBFNs, where the test instance neighbors are located, contribute to the multi-column output. This specialization improves accuracy, while parallelism enhances speed. We evaluate the proposed methods on various benchmark datasets. The MC-PSO and MC-APSO outperform ErrCor, PSO, APSO, and MCRN in terms of accuracy and recall. They also demonstrate faster training and testing times in most experiments.

2606.05149 2026-06-04 cs.CV cs.LG eess.IV

An Open-Source Two-Stage Computer Vision Pipeline for Fine-Grained Vehicle Classification using Vision Transformers

基于视觉Transformer的开源两阶段细粒度车辆分类流水线

Gandhimathi Padmanaban, Fred Feng

AI总结 提出一个结合RT-DETR检测器和微调ViT-Base/16的两阶段流水线,用于六类车身分类,并引入置信度弃权机制,在分布内和分布外数据集上分别达到0.94和0.89的准确率。

详情
Comments
24 pages, 10 figures, venue TBD
AI中文摘要

车辆车身类型是超车碰撞中骑行者伤害严重程度的重要决定因素,然而,在公开文献中,尚不存在从自然道路视频中将车辆分类为与伤害风险相关类别的自动化工具。标准目标检测基准仅提供粗粒度车辆标签(轿车、卡车、公交车、摩托车),而现有的细粒度识别系统在受控图像上训练,且缺乏跨记录站点的部署鲁棒性评估。本文提出一个开源的两阶段计算机视觉流水线,结合预训练的RT-DETR检测器进行粗粒度车辆定位,以及微调的视觉Transformer(ViT-Base/16)进行六类车身分类:乘用车、SUV、皮卡、小型货车、大型货车和商用卡车。当softmax输出低于0.60时,基于置信度的弃权机制保留第二阶段预测,产生未知标签而非静默误分类。在来自密歇根州安阿伯市自行车道走廊的3,805个标注超车事件(分布内)上评估,该流水线达到0.94的准确率,每类F1分数从0.91(小型货车)到0.97(SUV)。在来自开放骑行数据集的311个事件(分布外)上独立评估,无需重新训练,准确率为0.89。四个代表性类别中的三个在域偏移下保持F1不低于0.90。观察到的最大退化出现在小型货车(F1=0.72),原因是弃权率从2.4%上升到25.0%,而非主动误分类,这与传播真实模型不确定性的机制一致。完整的流水线,包括推理脚本、训练代码、评估工具和模型权重,作为开源软件发布,以支持跨路边视频档案和骑行安全研究的可重复性和复用。

英文摘要

Vehicle body type is a significant determinant of cyclist injury severity in overtaking crashes, yet automated tools for classifying vehicles into injury-risk-relevant categories from naturalistic roadway video do not exist in the open literature. Standard object detection benchmarks provide only coarse vehicle labels (car, truck, bus, motorcycle), while existing fine-grained recognition systems are trained on controlled imagery and lack evaluation for deployment robustness across recording sites. This paper presents an open-source two-stage computer vision pipeline combining a pre-trained RT-DETR detector for coarse vehicle localization with a fine-tuned Vision Transformer (ViT-Base/16) for six-category body-type classification: passenger car, SUV, pickup truck, minivan, large van, and commercial truck. A confidence-based abstention mechanism withholds Stage 2 predictions when softmax output falls below 0.60, producing unknown labels rather than silent misclassifications. Evaluated on 3,805 annotated overtaking events from a bicycle-lane corridor in Ann Arbor, Michigan (in-distribution), the pipeline achieved 0.94 accuracy with per-class F1 scores from 0.91 (minivan) to 0.97 (SUV). On an independent out-of-distribution evaluation of 311 events from an open cycling dataset without retraining, accuracy was 0.89. Three of four well-represented categories maintained F1 at or above 0.90 under domain shift. The largest degradation was observed for minivan (F1 = 0.72), driven by abstention rate rising from 2.4% to 25.0% rather than active misclassification, consistent with the mechanism propagating genuine model uncertainty. The full pipeline, including inference scripts, training code, evaluation utilities, and model weights, is released as open-source software to support reproducibility and reuse across roadside video archives and cycling safety research.

2606.05145 2026-06-04 cs.LG cs.AI cs.CL

Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)

失败推理轨迹告诉你什么是可修复的(但仅凭阅读它们不行)

Nizar Islah, Istabrak Abbes, Irina Rish, Sarath Chandar, Eilif B. Muller

AI总结 本文提出通过失败推理轨迹的分布特征而非文本内容来识别可修复的失败,并设计无训练的路由规则提升测试时干预效果。

详情
AI中文摘要

当后训练语言模型在推理问题上失败时,常见的测试时扩展响应是花费更多计算进行额外尝试,而失败轨迹不再发挥作用。我们认为这丢弃了一个关键信号;一些失败源于不幸运的采样,此时更多滚动有助于解决,而其他失败是结构性的,无论预算如何都无法通过重采样解决。我们提出失败轨迹编码了可恢复性结构:即哪些测试时干预可以挽救特定失败的推理时特征。三个问题级别的轨迹特征,源自可用干预的结构,从失败滚动的分布特征(而非其文本)中恢复这种结构。它们将失败聚类为稳定区域,刻画不同后训练方法的失败地形(准确率84.3±4.3%,比多数类基线高20%),并支持一个无训练的路由规则,在部署相关的Steerable-Hard子集(重试不足且可达有界干预的失败)上将挽救率提升12.2%。这些特征和路由规则在两个跨家族探针上可迁移。因此,相同的三个特征将失败轨迹从丢弃数据转化为诊断对象,支持测试时路由和后训练分析,无需训练时或权重空间访问。

英文摘要

When post-trained language models fail on reasoning problems, the common test-time-scaling response is to spend more compute on additional attempts, and the failed traces play no further role. We argue this discards a crucial signal; some failures come from unlucky sampling, where more rollouts help, while others are structural and resist resampling regardless of budget. We propose that failed traces encode recoverability structure: the inference-time signature of which test-time interventions can rescue a given failure. Three problem-level trajectory features, derived from the structure of available interventions, recover this structure from the distributional signature of failed rollouts, not their text. They cluster failures into stable regimes, characterize the failure topography of different post-training methods ($84.3{\pm}4.3\%$ accuracy, $+20\%$ over a majority-class baseline), and support a training-free routing rule that lifts rescue by $+12.2\%$ on the deployment-relevant Steerable-Hard subset (failures where retry is insufficient and a bounded intervention is reachable). The features and the routing rule transfer across two cross-family probes. The same three features thus convert failed traces from discarded data into a diagnostic object, supporting test-time routing and post-training analysis without training-time or weight-space access.

2606.05143 2026-06-04 cs.RO

HORIZON: Recoverability-Governed Curriculum for Physical-Domain Scaling

HORIZON: 基于可恢复性的物理域缩放课程

Chenhao Bai, Liqin Lu, Kaijun Wang, Hui Chen, Jin-Chuan Shi, Yuyang Liu, Hao Chen, Chunhua Shen

AI总结 针对机器人策略在物理域缩放中的可学习性问题,提出基于可恢复性的前沿课程HORIZON,通过回滚和边界细化逐步扩展物理域,实验揭示了物理域扩展的三个规律。

详情
Comments
16 pages, 9 figures
AI中文摘要

扩展鲁棒的机器人策略需要的不仅仅是更广泛的随机化,因为物理域经验必须在整个训练过程中保持有序和可学习。我们研究了策略何时能从更难的物理中受益,并确定可恢复性是在策略物理域缩放中的核心约束。在在策略训练中,新的动态仅当它们足够接近当前策略以生成纠正性的在策略数据时才有用,而不是将轨迹崩溃为不可恢复的失败。使用四足运动作为具身泛化的物理要求高的基准,我们引入了HORIZON,一种检查点前沿课程,仅在当前策略的可恢复边界内扩展物理域。HORIZON使用回滚和边界细化来管理每个扩展步骤,将固定随机化转变为物理域增长的持续过程。实验揭示了物理域扩展的三个规律。首先,直接域扩展在物理轴上是非均匀的,并且通常在没有阶段排序的情况下不可学习。其次,域组合是非单调的,在紧凑核心之外添加更多域可能会稀释可恢复的联合样本并降低整体鲁棒性。第三,孤立专家的离线蒸馏不能替代在策略课程生成的联合交互。这些结果共同将物理域泛化框架为具身控制的持续增长问题,以可恢复性作为在策略扩展的组织原则。

英文摘要

Scaling robust robot policies requires more than broader randomization, because physical-domain experience must remain organized and learnable throughout training. We study when a policy can benefit from harder physics and identify recoverability as a central constraint in on-policy physical-domain scaling. In on-policy training, new dynamics are useful only insofar as they remain close enough to the current policy to generate corrective on-policy data, rather than collapsing rollouts into unrecoverable failures. Using quadruped locomotion as a physically demanding benchmark for embodied generalization, we introduce HORIZON, a checkpointed frontier curriculum that expands physical domains only within the current policy's recoverable boundary. HORIZON uses rollback and boundary refinement to govern each expansion step, turning fixed randomization into a continual process of physical-domain growth. Experiments reveal three regularities of physical-domain expansion. First, direct domain widening is uneven across physical axes and often unlearnable without staged ordering. Second, domain composition is non-monotonic, and adding more domains beyond a compact core can dilute recoverable joint samples and reduce overall robustness. Third, offline distillation of isolated experts cannot substitute for the joint interaction generated by on-policy curriculum. Together, these results frame physical-domain generalization as a continual growth problem for embodied control, with recoverability as the organizing principle for on-policy expansion.

2606.05142 2026-06-04 cs.CV cs.AI

GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes

GeM-NR:面向非刚性场景变化的几何感知多视角编辑

Josef Bengtson, Yaroslava Lochman, Fredrik Kahl

AI总结 提出GeM-NR,一种无需训练的快速灵活方法,通过深度图对齐、视角投影和条件细化实现多视角一致的通用非刚性图像编辑,支持几何和外观的显著变化。

详情
Comments
Project page: https://gem-nr.github.io/
AI中文摘要

近年来,基于生成模型的多视角图像编辑的发展使我们离通用3D内容生成和定制更近一步。现有大多数工作通过利用未编辑场景的几何结构,专注于刚性或仅外观的编辑。这自然将这些方法限制在保留底层场景结构的编辑上。其他方法则针对特定图像编辑任务(如物体移除和添加)进行训练。尽管取得了进展,但通用的非刚性编辑(即大幅改变场景几何的编辑)对现有方法仍然具有挑战性。我们提出GeM-NR,一种快速灵活且无需训练的方法,用于通用的多视角一致图像编辑,包括大幅改变场景几何和外观的编辑。给定一个使用选定骨干编辑器(如FLUX、Qwen、BrushNet)编辑的锚点图像和一个未编辑的查询图像,GeM-NR以与锚点编辑一致的方式编辑查询图像。该方法包含多个阶段:(i) 深度图估计,我们提出一种策略以最大化编辑和未编辑场景的3D点云之间的对齐;(ii) 投影到查询视角;(iii) 基于未编辑查询的条件细化所得图像。基于条件化的公式从两个视角很好地扩展到物体的多个视角。我们展示了该方法处理几何和外观显著变化的编辑的能力,这是现有方法难以做到的。我们进行了广泛评估,表明我们的方法在各种编辑任务中提高了一致性,包括生成编辑场景的3D表示。定量和定性结果均表明,我们的方法在编辑质量以及多视角几何和光度一致性方面达到了最先进的性能。

英文摘要

Recent developments in multi-view image editing with generative models have brought us a step closer toward general 3D content generation and customization. Most existing works focus on rigid or appearance-only edits by utilizing the geometry of the unedited scene. This naturally limits these methods to edits that preserve the underlying scene structure. Other approaches are trained for specific image editing tasks, such as object removal and addition. Despite this progress, general nonrigid edits, i.e., edits that substantially change the scene geometry, remain challenging for existing methods. We propose GeM-NR, a fast and flexible training-free approach for general multi-view consistent image editing, including edits that drastically change the geometry and appearance of the scene. Given an anchor image edited with a chosen backbone editor (such as FLUX, Qwen, BrushNet) and a query unedited image, GeM-NR edits the query image consistently with the anchor edit. The method incorporates multiple stages: (i) depth map estimation, where we propose a strategy to maximize the alignment between the 3D point clouds of the edited and unedited scenes, (ii) projection onto a query viewpoint, and (iii) refinement of the obtained image conditioned on the unedited query. The conditioning-based formulation scales well from two to many views of an object. We demonstrate the ability of our method to handle edits with significant changes in geometry and appearance, something that existing methods struggle with. We perform an extensive evaluation showing that our method improves consistency for a wide variety of edit tasks, including generating 3D representations of the edited scene. Both quantitative and qualitative results indicate the state-of-the-art performance of our method in terms of edit quality as well as geometric and photometric consistency across multiple views.

2606.05139 2026-06-04 cs.LG

BBOmix: A Tabular Benchmark for Hyperparameter Optimization of Unsupervised Biological Representation Learning

BBOmix: 用于无监督生物表示学习超参数优化的表格基准

Luca Thale-Bombien, Jan Ewald, Ralf König, Aaron Klein

AI总结 针对高通量测序产生的组学数据,提出首个开源表格基准BBOmix,包含105,000次评估,涵盖四种自编码器架构和七种多组学模态,用于无监督表示学习的超参数优化。

详情
AI中文摘要

高通量测序的快速发展产生了大规模、高维的组学数据集。深度无监督学习架构,特别是自编码器(AEs),在该领域越来越多地被用于降维和表示学习。然而,AEs对架构选择和超参数高度敏感,且无监督优化通常依赖于重建损失,这可能是下游任务效用的不良代理。穷举超参数优化(HPO)计算成本高昂,导致研究人员经常依赖次优的默认配置。为了普及大规模无监督HPO研究,我们引入了$ extbf{BBOmix}$,这是第一个用于真实生物数据上无监督表示学习的开源表格基准。我们的基准包括来自TCGA和SCHC数据集的四种AE架构和七种多组学模态的105,000次评估。我们量化了重建损失与下游任务性能之间的相关性,并对最先进的单保真度、多保真度和迁移学习HPO方法进行了广泛评估,为未来无监督生物表示学习研究建立了严格的基线。

英文摘要

The rapid advancement of high-throughput sequencing has led to large, high-dimensional omics datasets. Deep unsupervised learning architectures, particularly Autoencoders (AEs), are increasingly used for dimensionality reduction and representation learning in this domain. However, AEs are highly sensitive to architectural choices and hyperparameters, and unsupervised optimization typically relies on reconstruction loss, which may be a poor proxy for downstream utility. Exhaustive hyperparameter optimization (HPO) is computationally expensive, leading researchers to frequently rely on suboptimal default configurations. To democratize access to large-scale unsupervised HPO research, we introduce $\textbf{BBOmix}$, the first open-source tabular benchmark for unsupervised representation learning on real-world biological data. Our benchmark includes 105,000 evaluations across four AE architectures and seven multi-omics modalities from the TCGA and SCHC datasets. We quantify the correlation between reconstruction loss and downstream task performance and provide an extensive evaluation of state-of-the-art single-fidelity, multi-fidelity, and transfer learning HPO methods, establishing a rigorous baseline for future research in unsupervised biological representation learning.

2606.05138 2026-06-04 cs.LG q-fin.ST

Generating Financial Time Series by Matching Random Convolutional Features

通过匹配随机卷积特征生成金融时间序列

Konrad J. Mueller, Nikita Zozoulenko, Ben Wood, Thomas Cass, Lukas Gonon

AI总结 提出SOCK(软竞争核)可微随机卷积特征图,通过匹配真实与生成时间序列的随机卷积特征来训练生成器,在小样本金融数据集上优于签名和扩散基线方法。

详情
AI中文摘要

生成逼真的金融时间序列具有挑战性,因为训练数据通常仅限于单个历史路径。在如此稀缺的数据下,过拟合难以避免,尤其是在对抗训练中,训练好的判别器可能记忆训练样本。为了缓解这一问题,近期的方法训练生成器以最小化真实与生成时间序列的未训练特征表示之间的差异。在这些工作中,特征图基于路径签名,而路径签名在可处理的截断深度下可能无法捕捉相关的时间序列属性。在本工作中,我们通过匹配真实与生成时间序列的随机卷积特征来训练生成器。现有的随机卷积特征图,如Rocket和Hydra,已被证明能为真实世界的时间序列提供信息丰富的表示,但由于不可微,无法监督生成模型。我们引入了SOCK(软竞争核),一种完全可微的随机卷积特征图,适用于训练生成时间序列模型。我们表明,通过匹配随机SOCK特征训练的生成器在多种小样本金融数据集上始终优于签名和扩散基线。我们进一步在双样本假设检验和时间序列分类任务中展示了SOCK的表达能力,在这些任务中SOCK匹配或超越了现有的无监督特征图。

英文摘要

Generating realistic financial time series is challenging as training data is often limited to a single historical path. With such scarce data, overfitting is hard to avoid, especially under adversarial training where a trained discriminator can memorize the training samples. To mitigate this, recent approaches train generators to minimize the discrepancy between untrained feature representations of real and generated time series. In these works, the feature maps are based on path signatures, which can fail to capture relevant time series properties at tractable truncation depths. In this work, we instead train generators by matching random convolutional features of real and generated time series. Existing random convolutional feature maps, such as Rocket and Hydra, have been shown to provide informative representations of real-world time series, but cannot supervise generative models because they are non-differentiable. We introduce SOCK (SOft Competing Kernels), a fully differentiable random convolutional feature map, suited to train generative time series models. We show that generators trained by matching random SOCK features consistently outperform signature and diffusion baselines across a wide range of small-sample financial datasets. We further demonstrate SOCK's expressiveness on two-sample hypothesis testing and time series classification tasks, where SOCK matches or outperforms existing unsupervised feature maps.

2606.05134 2026-06-04 cs.CL cs.LG

Activation-Based Active Learning for In-Context Learning: Challenges and Insights

基于激活的主动学习用于上下文学习:挑战与见解

Yaseen M. Osman, Geoff V. Merrett, Stuart E. Middleton

AI总结 本文研究了基于MLP激活的深度主动学习方法在上下文学习中的应用,发现激活信号与示例质量或任务性能相关性弱,表明此类方法不适用于上下文学习。

详情
Comments
9 pages, 3 figures
AI中文摘要

深度主动学习此前已被探索用于大语言模型的上下文样本选择,但未利用对Transformer激活理解的最新进展。在本文中,我们测试了模型激活能否提供细粒度信号以优化上下文示例选择的假设。我们提出了迄今为止最全面的基于MLP激活的深度主动学习方法应用于上下文学习的分析,包括不同注意力掩码策略如何影响跨多样分类和生成数据集的主动学习,使用了Llama-3.2-3B和Qwen2.5-3B基础模型。然而,我们得到了负面结果:通过大规模激活或前四阶矩视角观察的MLP输出,与示例质量或任务性能不相关。具体来说,对于所有测试的任务和模型,绝对Spearman相关系数至多为0.33,表明此类基于激活的采样不应用于上下文学习。我们假设这可能是由于叠加现象,即模型表示的特征数量超过其维度,表明稀疏自编码器等方法可能是未来有前景的方向。

英文摘要

Deep active learning has previously been explored for LLM in-context sample selection, but not with methods that utilise recent advances in understanding of transformer activations. In this paper, we test the hypothesis that model activations could provide a fine-grained signal to optimise the selection of in-context examples. We present the most comprehensive analysis to date of MLP activation-based deep active learning methods applied to in-context learning, including how different attention masking strategies impact active learning across diverse classification and generative datasets, using both Llama-3.2-3B and Qwen2.5-3B base models. However, we find a negative result: MLP outputs, viewed through the lenses of massive activations or the first four moments, do not correlate with example quality or task performance. Specifically, the absolute Spearman correlation coefficient is at most 0.33 for all tasks and models we tested, showing that such activation-based sampling should not be used for in-context learning. We hypothesise that this may be due to superposition, whereby models represent more features than they have dimensionality, suggesting that methods like Sparse Autoencoders (SAEs) may be a promising future direction.

2606.05131 2026-06-04 cs.LG cs.NA math.DS math.NA math.OC math.SP

Deep Embedded Multiplicative DMD for Algebra-Preserving Koopman Learning

深度嵌入乘法DMD用于保代数Koopman学习

Kelan Gray, Finlay Brown, Nicolas Boullé, Matthew J. Colbrook

AI总结 提出DeepMDMD方法,通过结合深度学习和乘法DMD,在潜空间中施加Koopman乘积规则作为代数约束,学习紧凑且动态一致的字典,实现稳定预测和谱污染减少。

详情
Comments
26 pages, 11 figures
AI中文摘要

Koopman理论将非线性动力学转化为线性谱问题。然而,在计算中,一切都取决于一个困难的有限维选择:可观测量必须具有表现力,在动力学下几乎不变,并且理想情况下与复合运算兼容。深度Koopman方法学习灵活的坐标,而保结构方法在固定字典上强制执行算子恒等式。我们通过引入深度嵌入乘法动态模式分解(DeepMDMD)来结合这些思想,该方法学习潜空间及其划分,同时将Koopman乘积规则作为精确代数约束强制执行。训练在精确的乘法算子更新和可微的潜聚类步骤之间交替进行,后者促进Koopman封闭性。结果是在学习的潜细胞上得到一个有限转移映射。其非零谱位于单位圆上,其字典由动力学而非环境几何塑造,预测在潜坐标中进行,然后解码到物理空间。在哈密顿、混沌和流体示例中,DeepMDMD学习的字典比几何MDMD划分产生的字典更紧凑且动态一致。它减少了谱污染,揭示了更丰富的连续谱结构,并在严重噪声下提供稳定预测。在高维流中,包括158,624维圆柱尾流和噪声$Re=20,000$顶盖驱动空腔,它保持了相干结构和长时间谱统计,而状态空间MDMD则失败。这些结果提出了Koopman学习的实用规则:学习坐标,约束代数。

英文摘要

Koopman theory turns nonlinear dynamics into a linear spectral problem. In computation, however, everything depends on a hard finite-dimensional choice: the observables must be expressive, nearly invariant under the dynamics, and, ideally, compatible with composition. Deep Koopman methods learn flexible coordinates, whereas structure-preserving methods enforce operator identities on fixed dictionaries. We combine these ideas by introducing Deep Embedded Multiplicative Dynamic Mode Decomposition (DeepMDMD), a method that learns a latent space and a partition of it, while enforcing the Koopman product rule as an exact algebraic constraint. Training alternates between an exact multiplicative operator update and a differentiable latent-clustering step that promotes Koopman closure. The result is a finite transition map on learned latent cells. Its nonzero spectrum lies on the unit circle, its dictionary is shaped by the dynamics rather than by ambient geometry, and forecasts are made in latent coordinates before being decoded to physical space. Across Hamiltonian, chaotic, and fluid examples, DeepMDMD learns dictionaries that are far more compact and dynamically coherent than those produced by geometric MDMD partitions. It reduces spectral pollution, reveals richer continuous-spectrum structure, and gives stable forecasts under severe noise. In high-dimensional flows, including a 158,624-dimensional cylinder wake and a noisy $Re=20,000$ lid-driven cavity, it preserves coherent structures and long-time spectral statistics where state-space MDMD fails. These results suggest a practical rule for Koopman learning: learn the coordinates, constrain the algebra.

2606.05130 2026-06-04 cs.LG cs.AI

Towards Efficient and Evidence-grounded Mobility Prediction with LLM-Driven Agent

面向高效且基于证据的移动预测:基于LLM驱动的智能体

Linyao Chen, Qinlao Zhao, Zechen Li, Mingming Li, Likun Ni, Jinyu Chen, Yuhao Yao, Xuan Song, Noboru Koshizuka, Hiroki Kobayashi

AI总结 提出一种无需训练的LLM驱动智能体框架AgentMob,通过自适应证据收集机制解决移动预测中的模糊情况,在多个数据集上达到最优性能。

详情
AI中文摘要

个体层面的移动预测是城市模拟、交通规划和政策分析的核心。监督序列模型实现了高精度,但需要任务特定训练且决策透明度有限。最近的基于LLM的方法提高了可解释性,但大多依赖静态提示和单次推理,限制了在移动信号弱或冲突时寻求额外证据的能力。我们提出\method{},一种无需训练的LLM驱动智能体框架,将下一位置预测建模为自适应证据控制的决策制定。\method{}通过基于历史规律性的快速路径处理常规情况,而模糊情况则触发对近期轨迹、历史行为、停留-移动可能性和地理证据的迭代工具使用。在三个移动数据集上,AgentMob在无需训练的基于LLM的方法中实现了最强的整体性能,GPT-5.4在BW上达到71.42%的Acc@1,在YJMob100K上达到33.14%,在上海ISP上达到33.50%。在BW的非快速路径案例中,LLM控制器相比相同工具的统计基线将Acc@1从30.65%提高到48.62%,表明其主要优势在于通过自适应证据收集解决模糊预测。我们的代码可在https://github.com/Unknown-zoo/AgentMob获取。

英文摘要

Individual-level mobility prediction is central to urban simulation, transportation planning, and policy analysis. Supervised sequence models achieve strong accuracy but require task-specific training and offer limited decision-level transparency. Recent LLM-based methods improve interpretability, yet mostly rely on static prompts and single-pass inference, limiting their ability to seek additional evidence when mobility signals are weak or conflicting. We propose \method{}, a training-free LLM-driven agent framework that formulates next-location prediction as adaptive evidence-controlled decision making. \method{} resolves routine cases through a fast path based on historical regularity, while ambiguous cases trigger iterative tool use over recent trajectories, historical behavior, stay-move likelihood, and geographical evidence. Across three mobility datasets, AgentMob achieves the strongest overall performance among training-free LLM-based methods, with GPT-5.4 reaching 71.42\% Acc@1 on BW, 33.14\% on YJMob100K, and 33.50\% on Shanghai ISP. On BW non-fast-path cases, the LLM controller improves Acc@1 from 30.65\% to 48.62\% over a same-tool statistical baseline, showing that its main benefit lies in resolving ambiguous predictions through adaptive evidence gathering. Our code is available at https://github.com/Unknown-zoo/AgentMob.

2606.05129 2026-06-04 cs.CR cs.LG

Preserving Data Privacy in Learning Causal Structure with Fully Homomorphic Encryption

在全同态加密下学习因果结构时保护数据隐私

Jian Yang, Yuan Tong, Qinbin Li, Zeyi Wen, Xiaofang Zhou

AI总结 针对分布式因果结构学习中的隐私泄露问题,提出基于全同态加密的方法,通过电路简化、除法和对数近似以及SIMD批处理技术,在加密数据上高效完成因果结构学习,并支持扩展到差分隐私。

详情
AI中文摘要

保护数据隐私是结构数据管理和数据挖掘中的重要课题。然而,分布式因果结构学习中的隐私泄露问题是一个持续的挑战,特别是在需要数据传输和计算的情况下。在本文中,我们提出了一种基于全同态加密(FHE)的方法,该方法在密文上进行计算,保持数据在传输和计算过程中加密。然而,由于FHE计算成本高且对除法和对数运算的支持有限,将FHE应用于因果结构学习具有挑战性。为了应对这一挑战,我们提出了一系列新颖的技术,包括(i)电路简化以提高效率,(ii)通过牛顿-拉夫森倒数和泰勒展开近似除法和对数,以及(iii)使用SIMD加速的批处理技术来增强整个学习过程。此外,我们的方法可以轻松扩展到FHE之外,通过展示其可移植性来支持差分隐私。实验结果表明,我们的方法在测试的数据集上实现了与明文版本高度一致且可比的因果结构。最后,即使在FHE的隐私保护下,我们的方法也能在几十分钟内高效且实际地完成因果结构学习。

英文摘要

Preserving data privacy is an important topic in structural data management and data mining. However, the issue of privacy leakage in distributed causal structure learning is a persistent challenge, especially in cases where data transmission and computation are required. In this paper, we propose a method based on fully homomorphic encryption (FHE) that performs calculations on ciphertexts, keeping data encrypted in transition and computation. Nevertheless, adopting FHE to causal structure learning is challenging due to the high computation cost and limited support on division as well as logarithm operations in FHE. To tackle this challenge, we propose a series of novel techniques including (i) circuit simplification for better efficiency, (ii) approximation of division and logarithm through Newton-Raphson Reciprocal and Taylor expansion, and (iii) a batching technique with SIMD-acceleration to enhance the whole learning process. Additionally, our method can be easily extended beyond FHE by demonstration of its portability to support differential privacy. Empirical results show that our method achieves high consistency and comparable causal structure with the plaintext version in the datasets tested. Last, our method is efficient and practical to complete learning causal structures in tens of minutes even under the privacy protection of FHE.

2606.05126 2026-06-04 cs.CR

A-Live: Passive Liveness Detection via Neuromuscular Micro-Motion Signatures on Commodity Sensors

A-Live:基于商用传感器上神经肌肉微动特征的无源活体检测

Mohammed Gharib, Sam Burns, Martin Zizi

AI总结 提出A-Live框架,利用商用设备中的惯性测量单元(IMU)信号,通过神经肌肉微动特征实现无源活体检测,在Android和iOS设备上达到99.5%以上的准确率。

详情
AI中文摘要

活体检测已从生物特征认证中针对呈现攻击和重放攻击的防护措施,演变为现代数字系统中区分人类用户与非人类代理的广泛需求。生成式和代理型AI的出现进一步放大了这一需求,将活体检测定位为基本安全原语。现有方法面临关键限制,包括依赖显式用户交互、专用硬件、易受日益逼真的欺骗攻击以及在实际部署中可扩展性有限。我们提出A-Live,一个仅依赖商用设备中惯性测量单元(IMU)信号的无源活体检测框架。A-Live基于以下观察:人类运动控制固有的神经肌肉微动会在惯性数据中产生微妙但可测量的特征,这些特征在先前工作中常被视为噪声。我们设计了轻量级特征提取流水线和适合实时设备端部署的紧凑分类器,并引入了可控的物理微动平台以评估对工程化非人类运动的鲁棒性。在Android和iOS设备上的广泛评估(包括自动和真实用户设置)表明,A-Live实现了超过99.5%的准确率,且具有低误接受率和误拒绝率。我们的结果表明,神经肌肉微动特征为新兴AI驱动威胁模型下的活体检测提供了可扩展且无源的基础。

英文摘要

Liveness detection has evolved from a safeguard against presentation and replay attacks in biometric authentication to a broader requirement for distinguishing human users from non-human agents in modern digital systems. The emergence of generative and agentic AI further amplifies this need, positioning liveness as a fundamental security primitive. Existing approaches face key limitations, including reliance on explicit user interaction, specialized hardware, vulnerability to increasingly realistic spoofing, and limited scalability in real-world deployments. We present A-Live, a passive liveness detection framework that operates solely on inertial measurement unit (IMU) signals available in commodity devices. A-Live is based on the observation that neuromuscular micro-motions inherent to human motor control produce subtle but measurable signatures in inertial data, which are often treated as noise in prior work. We design a lightweight feature extraction pipeline and a compact classifier suitable for real-time on-device deployment, and introduce a controllable physical micro-motion platform to evaluate robustness against engineered non-human motion. Extensive evaluation across Android and iOS devices, including both automated and real-user settings, shows that A-Live achieves over 99.5\% accuracy with low false acceptance and rejection rates. Our results demonstrate that neuromuscular micro-motion signatures provide a scalable and passive foundation for liveness detection under emerging AI-driven threat models.

2606.05124 2026-06-04 cs.GR cs.CV cs.LG

Geometry Gaussians: Decoupling Appearance and Geometry in Gaussian Splatting

几何高斯:在高斯泼溅中解耦外观与几何

Hongyu Zhou, Zorah Lähner

AI总结 针对3D高斯泼溅在几何表示与外观渲染间的冲突,提出通过为每个溅射添加几何不透明度参数并配合透明度优化流程,实现几何与外观的解耦,提升复杂场景(尤其是透明物体)的渲染与几何性能。

详情
AI中文摘要

在3D高斯泼溅(3DGS)成功用于新视角合成后,许多工作探索了如何将其用于几何表面表示。然而,直接从3DGS中提取准确的几何信息仍然具有挑战性,且往往会降低外观渲染质量。在这项工作中,我们通过使用完整的地面真值纹理和几何信息进行训练,证明了默认形式的3DGS本质上不适合同时表示纹理和几何。我们还提出了一种简单的解决方案,即为每个溅射应用一个额外的几何不透明度参数,并配合可选的透明度策划优化流程。我们的实验,无论是使用地面真值还是视觉基础模型的几何输入,都表明这一改变在多种数据集上提高了渲染和几何性能,尤其是对于包含透明物体的复杂场景,我们的方法带来了显著提升。

英文摘要

After the success of 3D Gaussian Splatting (3DGS) for novel view synthesis, many works have explored how to also use it for geometric surface representation. However, extracting accurate geometric information directly from 3DGS remains challenging and can often reduce the appearance rendering quality. In this work, we show that 3DGS in its default form is inheritedly unsuited to represent texture and geometry at the same time, by training with complete ground-truth texture and geometry information. We also propose a simple solution by applying a single additional geometry opacity parameter to each splat, together with an optional transparency-curated optimization pipeline. Our experiments, both with ground-truth and vision foundation model geometric input, show that this change leads to improved rendering and geometry performance on a wide variety of dataset, and especially complex scenes with transparent objects benefit significantly from our method.

2606.05122 2026-06-04 cs.CL

Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

自我评估已然存在:用最少数据激发基础LLM中的潜在评判校准

XiuYu Zhang, Yi Shan, Junfeng Fang, Zhenkai Liang

AI总结 本文提出自我评估激发(SEE)方法,通过少量数据(160个示例)结合校准耦合强化学习和掩码蒸馏,激发基础LLM中已有的预测外部评判者评分能力,在保持答案质量的同时显著提升校准性能。

详情
AI中文摘要

大型语言模型越来越多地被其他模型评估,这引发了一个自然问题:模型能否预测评判者将如何对其自身输出进行评分?我们发现,这种能力在很大程度上已经存在于任何针对性训练之前:通过少量示例提示,基础模型已经能够预测外部评判者对开放式回答的多属性质量评分,在三个基准测试中显著高于随机水平。我们引入了自我评估激发(SEE)方法,该方法通过一个短周期来表面化这种潜在能力,该周期包括一个校准耦合的强化学习阶段(改进答案并预测评判者),随后是一个掩码蒸馏阶段(增强预测而不改变答案)。通过160个独特示例(比强化学习基线少约31倍),SEE在三个基准测试中改善了保留校准,同时保持了答案质量。激发的自我评估严格定位于模型自身的词元分布内,并且对于从未训练过的评判者保持稳定,这表明了一种可转移的质量概念,而非单一评判者的偏好。这些结果将评判者对齐的自我评估重新定义为激发问题而非获取问题。

英文摘要

Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: prompted few-shot, a base model already predicts an external judge's multi-attribute quality scores on open-ended responses well above chance across three benchmarks. We introduce Self-Evaluation Elicitation (SEE), a method that surfaces this latent ability through a short cycle comprising a calibration-coupled reinforcement learning phase that improves the answer and predicts the judge, followed by a masked distillation phase that sharpens the prediction while leaving the answer untouched. From 160 unique examples, roughly 31x fewer than a reinforcement learning baseline, SEE improves held-out calibration across three benchmarks while preserving answer quality. The elicited self-evaluation is sharply localized within the model's own token distribution and stable across judges it was never trained against, indicating a transferable notion of quality rather than a single judge's preference. These results reframe judge-aligned self-evaluation as a problem of elicitation rather than acquisition.

2606.05121 2026-06-04 cs.SD cs.AI cs.CL cs.MM eess.AS

Audio Interaction Model

音频交互模型

Zhifei Xie, Zihang Liu, Ze An, Xiaobin Hu, Yue Liao, Ziyang Ma, Dongchao Yang, Mingbao Lin, Deheng Ye, Shuicheng Yan, Chunyan Miao

AI总结 提出一种统一的在线大型音频语言模型Audio-Interaction,通过始终在线的感知-决策-响应循环实现实时音频交互,并构建了StreamAudio-2M数据集和Proactive-Sound-Bench基准,在保持主流音频任务性能的同时解锁了实时ASR、流式音频指令跟随和主动帮助等能力。

详情
Comments
Next generation of LALMs, work in progress
AI中文摘要

音频本质上是一种交互式模态,然而当今的大型音频语言模型(LALM)是离线的,而流式音频模型每个只处理单一任务,如流式ASR或语音聊天。现在是时候将它们统一为一个在线LALM:一个通过始终在线的感知-决策-响应循环,实时收听声音、环境和指令并即时反应的模型。我们将这种机制形式化为音频交互模型,并通过Audio-Interaction实现,这是一个统一的流式模型,在保留离线任务执行的同时,增加了在线通用音频指令跟随能力,从对话到全语音聊天,根据流语义决定何时响应。为此,我们提出了SoundFlow框架,该框架通过流原生数据构建、理解感知训练和异步低延迟推理,端到端地实例化感知-决策-响应循环,实现稳定的实时交互。我们进一步构建了StreamAudio-2M,一个包含260万项流式语料库,涵盖7种基本能力和28个子任务,以及用于评估主动音频干预的Proactive-Sound-Bench。在8个基准测试中,Audio-Interaction在主流音频任务上保持有竞争力的性能,同时解锁了离线LALM无法实现的能力,包括实时ASR、流式音频指令跟随和主动帮助。

英文摘要

Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-decide-respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio-Interaction, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose SoundFlow, a framework that instantiates the perceive-decide-respond loop end to end, from data to training to deployment, through streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference for stable real-time interaction. We further construct StreamAudio-2M, a 2.6M-item streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Across 8 benchmarks, Audio-Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive help.

2606.05118 2026-06-04 cs.CY

Does Artificial Intelligence Advance Science?

人工智能是否推动了科学进步?

Liangping Ding, Cornelia Lawson, Philip Shapira

AI总结 本文通过分析超过100万篇科学出版物,研究AI采用与科学创造力(新颖性和影响力)的关系,发现AI出版物在创造力上显著优于非AI出版物,且不同AI研究模式(工具导向与适应导向)通过不同的创造性路径产生贡献。

详情
Comments
47 pages, 3 figures
AI中文摘要

本文考察了人工智能(AI)是否以及如何推动科学创造力。基于科学出版物(研究人员的主要产出),我们分析了来自OpenAlex的超过100万篇出版物,以研究AI采用与科学创造力多个维度之间的关系,包括新颖性(重组新颖性和对象新颖性)和影响力(3年短期引用影响和10年长期引用影响)。我们发现,相对于非AI出版物,AI出版物进入创造力前十百分位的可能性显著更高,高出5.5至10.2个百分点。关键的是,我们发现了不同AI研究模式之间的显著异质性。工具导向的AI研究(将现有AI模型应用于领域任务)与基于重组的创造力提升最大相关,而适应导向的AI研究(为特定领域问题修改AI模型)与相对较高的基于对象的创造力相关。这些发现表明,AI并非通过单一机制推动科学进步,而是通过结构上不同的创造性路径,这些路径取决于AI如何被纳入研究过程。我们的结果有助于当前关于AI在科学中作用的辩论,并对研究评估和科学政策具有直接意义,强调了需要能够区分重组和概念性创造力形式,并认识到不同AI采用模式如何产生根本不同类型的科学贡献的评估框架。

英文摘要

This paper examines whether and how artificial intelligence (AI) advances scientific creativity. Drawing on scientific publications, the primary output of researchers, we analyze over one million publications from OpenAlex to investigate the relationship between AI adoption and multiple dimensions of scientific creativity, including novelty (recombinant novelty and object novelty) and impact (3-year short-run citation impact and 10-year long-run citation impact). We find that AI publications are significantly more likely to achieve top-decile creativity relative to non-AI publications, with 5.5 to 10.2 percentage point higher likelihood to rank in the top creativity decile. Critically, we uncover substantial heterogeneity across AI research modes. Tool-oriented AI research, which applies existing AI models to domain tasks, is associated with the largest gains in recombinant-based creativity, while Adaptation-oriented AI research, modifying AI models for domain-specific problems, is associated with relatively higher object-based creativity. These findings reveal that AI does not advance science through a single mechanism but through structurally distinct creative pathways that depend on how AI is incorporated into the research process. Our results contribute to ongoing debates about AI's role in science and carry direct implications for research evaluation and science policy, highlighting the need for assessment frameworks that can distinguish between recombinant and conceptual forms of creativity and that recognize how different modes of AI adoption produce fundamentally different types of scientific contribution.

2606.05116 2026-06-04 cs.LG

Graph Set Transformer

图集变换器

Jose E. Escrig Molina, Baoquan Chen, Daniel Probst

AI总结 提出图集变换器(GST),通过层间交织节点级特征传播与跨图上下文建模,解决图集合学习任务中局部结构与集合上下文融合问题,在合成和真实基准上优于基线。

详情
Comments
10 pages, 1 figure, conference
AI中文摘要

我们介绍了图集变换器(GST),一种用于在图集合上学习的神经网络架构,设计用于每个元素的预测依赖于集合范围的上下文以及局部结构的任务。现有架构,包括DeepSets和SetTransformer,需要来自单独GNN的预编码图嵌入,在特征提取和集合级上下文化之间造成瓶颈。相比之下,GST在每一层交织节点级特征传播和跨图上下文建模,通过门控机制融合两个信息层次。我们在一个旨在隔离集合条件结构推理的受控合成套件以及三个真实数据基准(包括逐原子反应中心识别、反应产率预测和图像分类)上评估了GST。在匹配参数预算下,GST在这些设置中表现优于基线。架构消融强烈表明,局部和集合上下文的交织对这一优势有显著贡献。

英文摘要

We introduce the Graph Set Transformer (GST), a neural network architecture for learning on sets of graphs, designed for tasks in which per-element predictions depend on set-wide context as well as local structure. Existing architectures, including DeepSets and SetTransformer, require pre-encoded graph embeddings from a separate GNN, creating a bottleneck between feature extraction and set-level contextualisation. In contrast, GST interleaves node-level feature propagation and cross-graph contextual modelling at every layer, fusing the two levels of information through a gating mechanism. We evaluate GST on a controlled synthetic suite designed to isolate set-conditional structural reasoning and on three real-data benchmarks spanning per-atom reaction-centre identification, reaction yield prediction, and image classification. Under matched parameter budgets, GST performs better than the baselines across these settings. An architectural ablation strongly suggests that the interleaving of local and set context contributes substantially to this advantage.

2606.05115 2026-06-04 cs.CV cs.AI cs.CL

Continual Visual and Verbal Learning Through a Child's Egocentric Input

通过儿童自我中心输入进行持续的视觉与语言学习

Xiaoyang Jiang, Yanlai Yang, Kenneth A. Norman, Brenden Lake, Mengye Ren

AI总结 提出BabyCL持续多模态学习框架,在单一时间顺序处理SAYCam数据集,通过流式视觉表示学习和图像-文本对比目标,在SAYCam Labeled-S 4AFC基准上优于流式学习基线,缩小了与离线训练上限的差距。

详情
Comments
15 pages, 4 figures
AI中文摘要

儿童从连续的、时间结构化的自我中心经验流中学习单词的含义。最近的研究表明,神经网络也可以从儿童的自我中心视频记录中学习单词-指代物映射,但它们会循环处理打乱的数据数百个周期,这与儿童实际接触环境的方式形成对比。我们引入了BabyCL,一个持续多模态学习框架,它以单一时间顺序处理SAYCam数据集,结合了流式视觉表示学习和图像-文本对比目标。BabyCL将流的多阶段时间分割与双回放缓冲区相结合,该缓冲区独立管理视觉和多模态历史,并在共享骨干网络上联合训练三个对比损失。在匹配的优化预算下,BabyCL在SAYCam Labeled-S 4AFC基准上优于流式学习基线,显著缩小了与离线训练上限的差距。消融实验表明,这些增益对在线时间分割窗口的长度和回放缓冲区的驱逐规则具有鲁棒性。总之,这些结果表明,在更接近儿童实际体验的训练条件下,有意义的单词-指代物映射可以出现。

英文摘要

Children learn the meanings of words from a continuous, temporally structured stream of egocentric experience. Recent work shows that neural networks can also learn word-referent mappings from a child's egocentric video recordings, but they cycle through the shuffled data for hundreds of epochs, contrasting with how children actually encounter their environment. We introduce BabyCL, a continual multimodal learning framework that processes the SAYCam dataset in a single chronological pass, combining streaming visual representation learning with an image-text contrastive objective. BabyCL combines a multi-stage temporal segmentation of the stream with a dual replay buffer that independently manages visual and multimodal histories, and it is jointly trained with three contrastive losses on a shared backbone. Under a matched optimization budget, BabyCL outperforms streaming learning baselines on the SAYCam Labeled-S 4AFC benchmark, substantially narrowing the gap to an upper bound of offline training. Ablations show that the gains are robust to the length of the online temporal segmentation window and the eviction rule of the replay buffer. Together, these results show that meaningful word-referent mappings can emerge under training conditions much closer to a child's actual experience.

2606.05112 2026-06-04 cs.CL

Evaluating Large Language Models in Dynamic Clinical Decision-Making with Standardized Patient Cases

评估大型语言模型在标准化病人案例中的动态临床决策能力

Cheng Liang, Pengcheng Qiu, Ya Zhang, Yanfeng Wang, Chaoyi Wu, Weidi Xie

AI总结 本文提出MedSP1000基准,通过标准化病人案例模拟动态临床交互,评估LLM在信息收集、治疗计划和长期管理中的表现,发现当前模型在过程级评估中远未达到临床安全标准。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被提议作为临床代理,然而静态的单轮基准无法捕捉模型在诊疗过程中如何动态地提供护理:收集信息、规划治疗以及跨连续患者状态调整长期管理。医学教育长期以来通过标准化病人(SPs)解决了类似的挑战:经过培训的演员一致地扮演临床案例,实现逼真的实践和客观的脚本化评估。在此,我们介绍MedSP1000,一个源自SP的交互式基准,用于临床代理评估,包括1,638个SP案例和24,602个轨迹级同行评审评分标准。MedSP1000将同行评审的SP教学案例转化为可执行场景,包含定义的SP案例脚本、临床环境上下文和人工验证的结构化评分标准。在每次模拟评估运行中,临床代理与患者代理和环境控制器闭环交互,其行为根据原始材料中指定的专家标准在整个诊疗过程中进行评分。将MedSP1000应用于一系列通用和医学专用LLMs,我们发现静态基准上的表现并不能可靠地转化为此类教育场景。表现最好的模型GPT-5.5仅完成了60.4%的专家定义评分项目,而最强的医学专用模型达到了40.0%;增加测试时计算量没有产生可测量的增益。这些结果表明,当前的LLMs,包括为医学调整的代理系统,尚未足够可靠以安全地整合到实际临床实践中。更广泛地说,MedSP1000展示了过程级、SP式评估如何揭示单轮基准无法捕捉的临床相关失败模式。

英文摘要

Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successive patient states. Medical education has long addressed an analogous challenge through standardized patients (SPs): trained actors who consistently portray clinical cases, enabling realistic practice and objective, scripted assessment. Here we introduce MedSP1000, an SP-derived interactive benchmark for clinical-agent evaluation, including 1,638 SP cases with 24,602 trajectory-level peer-reviewed rubrics. MedSP1000 converts peer-reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human-validated structured rubric. In each simulation evaluation run, a clinical agent interacts in closed loop with a patient agent and an environment controller, and its behaviour is scored throughout the encounter against expert criteria specified in the original materials. Applying MedSP1000 to a range of general-purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best-performing model, GPT-5.5, completes only 60.4% of expert-defined rubric items, whereas the strongest medically specialized model reaches 40.0%; increasing test-time compute produces no measurable gain. These results suggest that current LLMs, including agentic systems tuned for medicine, are not yet reliable enough to be safely integrated into actual clinical practice. More broadly, MedSP1000 shows how process-level, SP-style evaluation can reveal clinically relevant failure modes that single-turn benchmarks miss.

2606.05110 2026-06-04 cs.DS

Randomization for Faster Exact Optimization of Discounted Markov Decision Processes

随机化加速折扣马尔可夫决策过程的精确优化

Andrei Graur, Aaron Sidford, Ta-Wei Tu

AI总结 通过将折扣马尔可夫决策过程(DMDP)的精确求解高效归约到策略评估和近似最优值计算,提出确定性和随机化算法以加速精确求解。

详情
AI中文摘要

我们提供了更快的确定性和随机化算法,用于精确求解折扣马尔可夫决策过程(DMDPs)。通过将DMDPs中计算最优值和策略的问题高效归约到更简单的策略评估和计算近似最优值任务,我们得到了这些结果。我们提出了一个直接的确定性归约和一个更高效的随机化变体,结合近似求解DMDPs的进展,最终得到了我们的结果。

英文摘要

We provide faster deterministic and randomized algorithms for exactly solving discounted Markov Decision Processes (DMDPs). We obtain our results by efficiently reducing computing optimal values and policies in DMDPs to the easier tasks of policy evaluation and computing approximately optimal values in DMDPs. We provide both a straightforward deterministic reduction and a more efficient randomized variant that, together with advances in approximately solving DMDPs, yield our results.

2606.05109 2026-06-04 cs.LG

RePercENT: Scaling Disentangled Representation Learning Beyond Two Modalities

RePercENT:将解耦表示学习扩展到两种模态之外

Vasiliki Rizou, Pascal Frossard, Dorina Thanou

AI总结 提出RePercENT框架,通过多模态即插即用架构和联合优化目标,实现超过两种模态的可扩展成对解耦,无需联合预训练并降低计算复杂度。

详情
AI中文摘要

为了充分利用多模态数据的潜力,我们需要超越当前最先进的对齐和融合方法,在不牺牲模态特定信息的情况下利用所有跨模态交互。学习解耦表示是识别隐藏在观测数据中的潜在共享和独特因素的一种原则性方法。然而,尽管多模态解耦是一个引人注目的范式,现有方法由于固有的可扩展性瓶颈,主要局限于两种模态。为了解决这个问题,我们提出了RePercENT,这是一个自监督框架,旨在超越这些限制,并解锁超过两种模态的可扩展成对解耦。通过多模态“即插即用”架构,我们的方法直接操作于预提取的嵌入,消除了对广泛联合预训练的需求,同时不对底层模态或基础模型骨干做出任何假设。此外,我们引入了一个联合优化目标,用于同时推导共享和独特组件,并提供了形式化的理论保证来表征我们解决方案的最优性。在多种模态和任务中,RePercENT成功恢复了解耦组件,同时保持了竞争性能并显著降低了计算复杂度。

英文摘要

To leverage the full potential of multimodal data, we need representations that go beyond the state-of-the-art alignment and fusion approaches and exploit all cross-modal interactions without sacrificing modality-specific information. Learning disentangled representations is a principled way to identify these underlying shared and unique factors that are hidden in observational data. However, while multimodal disentanglement is a compelling paradigm, existing methods are largely confined to the two-modality regime due to its inherent scalability bottleneck. To address this, we propose RePercENT, a self-supervised framework designed to surpass these limitations and unlocks scalable pairwise disentanglement beyond two modalities. Through a multimodal `plug-and-play' architecture, our approach operates directly on pre-extracted embeddings, eliminating the need for extensive joint pre-training while making no assumptions regarding the underlying modalities or foundation model backbones. Moreover, we introduce a joint optimization objective for simultaneously deriving the shared and unique components, and provide formal theoretical guarantees that characterize the optimality of our solution. Across diverse modalities and tasks, RePercENT successfully recovers disentangled components while maintaining competitive performance and significantly reducing computational complexity.

2606.05108 2026-06-04 cs.GT

Gradient Dynamics in First-Price Auctions: Iterative Strategy Elimination via Cubic Potentials

一级价格拍卖中的梯度动力学:通过三次势的迭代策略消除

Mete Şeref Ahunbay, Weiqiang Zheng, Tao Lin

AI总结 本文研究完全信息离散一级价格拍卖中,买家使用在线梯度上升学习出价时,时间平均结果接近二级价格拍卖的有效结果,并通过势函数和三次候选势函数的新方法证明了策略消除过程。

详情
Comments
43 pages, accepted to EC'26. This version incorporates reviewer feedback from the conference review process
AI中文摘要

我们证明,在具有完全信息的离散一级价格拍卖中,如果买家使用在线梯度上升学习出价,则时间平均结果(几乎)是二级价格拍卖的有效结果。我们的证明依赖于规范形式博弈中在线梯度上升分析的两项新颖创新,这些创新可能在更广泛的应用中有用。首先,我们开发了一种基于势函数的论证方法,用于分析规范形式博弈中的梯度上升,从而推断某些策略不会在时间平均中被采用。我们提供了充分条件,确保该论证可以迭代应用,产生类似于迭代消除占优策略的过程。其次,我们开发了一类新颖的三次“候选势函数”,对概率单纯形上的一族二次策略修改进行分类,针对这些修改,在线梯度上升不会产生遗憾。

英文摘要

We show that in discretised first-price auctions with complete information, if the buyers learn to bid with online gradient ascent, in time-average the outcome is (almost) the efficient outcome of the second-price auction. Our proof rests on two novel innovations in the analysis of online gradient ascent in normal-form games, which may be useful in a wider range of applications. First, we develop a potential-function-based argument for the analysis of gradient ascent in normal-form games, allowing us to deduce that certain strategies will not be played in time-average. We provide sufficient conditions which ensure this argument can be applied iteratively, resulting in a procedure reminiscent of iterative elimination of dominated strategies. Second, we develop a novel class of cubic "candidate potential functions", classifying a family of quadratic strategy modifications on the probability simplex against which online gradient ascent incurs no regret.