arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.17832 2026-06-19 cs.LG 新提交

From Drift to Coherence: Stabilizing Beliefs in LLMs

从漂移到一致:稳定LLM中的信念

SongEun Kim, Seungyoo Lee, Edwin Fong, Hyungi Lee, Juho Lee

发表机构 * Department of Statistics, Seoul National University Korea Advanced Institute of Science \& Technology Department of AI, Kookmin University University of Hong Kong

AI总结 研究LLM在多项选择问答中的信念漂移问题,提出提示式预测重采样(PPR)方法,发现信念过程会自稳定并收敛,进而提出种子答案提示策略和自一致性损失以加速稳定并提高预测一致性。

详情
AI中文摘要

大型语言模型(LLM)常被假设执行隐式贝叶斯推理,然而一个关键的一致性条件——预测信念的鞅性质——已被证明在受控的合成上下文学习设置中失效。我们在更典型的使用场景中重新审视这个问题:通用多项选择问答。利用离散答案空间,我们计算精确的预测分布,并研究由自回归答案重采样引起的信念动态。我们引入了提示式预测重采样(PPR),其中LLM对同一问题生成一系列答案。实验表明,PPR揭示了早期阶段的信念漂移,表明鞅性质被违反。然而,在足够的重采样步骤后,信念过程自稳定并收敛到一个一致的预测分布。基于这一观察,我们进一步提出了(i)种子答案提示策略以加速稳定,以及(ii)自一致性损失,通过微调将早期漂移摊销到模型中。在多项选择问答基准上的实验表明,我们的方法在不牺牲准确性的情况下显著减少了信念漂移并提高了预测一致性。

英文摘要

Large language models (LLMs) are often hypothesized to perform implicit Bayesian inference, yet a key coherence condition, the martingale property of predictive beliefs, has been shown to fail in controlled synthetic in-context learning settings. We revisit this question in a more typical usage regime: generic multiple-choice question answering. Exploiting the discrete answer space, we compute exact predictive distributions and study belief dynamics induced by autoregressive answer resampling. We introduce prompted predictive resampling (PPR), where an LLM generates a sequence of answers to the same question. Empirically, PPR reveals early-stage belief drift, indicating martingale violations. However, after sufficient resampling steps, the belief process self-stabilizes and converges to a coherent predictive distribution. Based on this observation, we further propose (i) a seed-answer prompting strategy to accelerate stabilization, and (ii) a self-consistency loss that amortizes early-stage drift into the model via fine-tuning. Experiments on multiple-choice QA benchmarks show that our methods substantially reduce belief drift and improve predictive coherence without sacrificing accuracy.

2606.17128 2026-06-19 cs.AR 新提交

Shift-Left High-Level Synthesis Verification via Knowledge-Augmented LLM Agent

通过知识增强的LLM智能体实现左移高层次综合验证

Zhihan Xiao, Hongbing Lang, Zhe Zhao, Luke Ztz Hu, Songping Mai

AI总结 提出一种知识增强的智能体驱动左移验证框架,通过双层级一致性检查、符号执行和HLS验证知识图谱,在综合前自动验证C与HLS-C的功能一致性,覆盖率达98.26%。

详情
AI中文摘要

高层次综合(HLS)通过将C/C++程序转换为硬件实现,实现了快速硬件开发。在HLS设计流程中,黄金C规范与面向HLS的C实现之间的功能一致性验证是一项关键但劳动密集型的任务。尽管大型语言模型(LLMs)最近在自动化测试平台生成方面显示出潜力,但其随机性常常导致覆盖率不足、验证环境不一致以及等价性检查结果不可靠。为了解决这些限制,我们提出了一种知识增强的、智能体驱动的左移验证框架,用于在综合前自动检查黄金C与HLS-C之间的功能一致性。该框架引入了一种双层级一致性检查机制,该机制共同强制配对测试平台之间的静态结构对齐和动态行为等价性,同时集成符号执行和覆盖率驱动的细化以提高验证完整性。此外,我们构建了一个异构的HLS验证知识图谱,为测试平台生成提供拓扑感知推理先验,并设计了一个自主验证智能体来协调跨异构工具链的迭代细化和故障诊断。在107个HLS基准对上的实验结果表明,所提出的框架实现了98.26%的平均覆盖率和95.33%的动态一致性,优于代表性的基于AST、检索增强和迭代智能体的基线。此 https URL

英文摘要

High-Level Synthesis (HLS) relies on transforming original C specifications into synthesizable HLS-oriented C (HLS-C) implementations. Functional consistency verification between original C specifications and HLS-C implementations is a critical yet labor-intensive task in HLS design flows. While Large Language Models (LLMs) have recently shown promise in automated testbench generation, their stochastic nature often leads to insufficient coverage, inconsistent verification environments, and unreliable equivalence checking results. To address these limitations, we propose a knowledge-augmented, agent-driven shift-left verification framework for automated functional consistency checking between original C and HLS-C implementations before synthesis. The framework introduces a Dual-Tier Consistency Checking mechanism that jointly enforces static structural alignment and dynamic behavioral equivalence between paired testbenches, while integrating symbolic execution and coverage-driven refinement to improve verification completeness. Furthermore, we construct a heterogeneous HLS Verification Knowledge Graph to provide topology-aware reasoning priors for testbench generation, and design an autonomous verification agent to orchestrate iterative refinement and failure diagnosis across heterogeneous toolchains. Experimental results on 107 HLS benchmark pairs demonstrate that the proposed framework achieves 0.9826 average coverage and 0.9533 dynamic consistency, outperforming representative AST-based, retrieval-augmented, and iterative agent-based baselines. https://github.com/cz-5f/HLS-LeVeri.git

2606.17165 2026-06-19 stat.ME cs.AI econ.EM math.ST stat.TH 新提交

Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference

基于LLM的A/B测试的统计基础:用于人类因果推断的替代指标框架

Joel Persson, Mårten Schultzberg, Sebastian Ankargren

发表机构 * Spotify USA, Inc.(Spotify美国公司)

AI总结 提出替代指标理论框架,证明在弱于分布等价条件下,校准LLM输出可识别平均处理效应,并分析随机性带来的偏差与方差。

详情
AI中文摘要

组织和研究者越来越有兴趣在A/B测试中使用大型语言模型(LLM)代替人类参与者,以期更快、更低成本地进行实验。我们研究当在LLM结果上估计的处理效应何时能够恢复在感兴趣的人类群体上测量的效应。LLM与人类结果之间的分布等价性会使任何标准估计量有效,但这不现实。因此,我们开发了一个统计框架,将替代终点理论适配到LLM。该框架表明,将LLM结果校准到人类结果,在替代性和可比性条件(联合弱于分布等价性)下,可以识别平均处理效应。当这些条件不成立时,感兴趣的效应仅部分可识别,我们提供了诊断方法,可以在历史实验上证伪替代性,并给出有限重叠下最坏情况偏差的界限。我们进一步证明,LLM固有的随机性会引入偏差和方差,但使用多次抽取的平均值作为替代指标可以同时缓解两者。我们在模拟和Upworthy标题的A/B测试应用中展示了方法和理论。我们工作的一个核心结论是,LLM结果作为替代指标的有效性只能对过去的处理被证伪,而无法对新处理被验证,因此对于新颖干预,人类实验仍然不可或缺。我们讨论了LLM选择、提示和温度作为设计变量的作用,以及如何确定人类实验的规模以进行验证。

英文摘要

Organizations and researchers show increasing interest in using large language models (LLMs) in place of human participants in A/B tests, in the hope of experimenting faster and at lower cost. We study when a treatment effect estimated on LLM outcomes can recover the effect that would have been measured on the human population of interest. Distributional equivalence between LLM and human outcomes would make any standard estimator valid but is unrealistic. We therefore develop a statistical framework that adapts surrogate endpoint theory to LLMs, showing that calibrating LLM outcomes to human outcomes identifies the average treatment effect under surrogacy and comparability conditions that are jointly weaker than distributional equivalence. We present a falsification test for surrogacy and a bound on the worst-case bias from limited overlap between the LLM and human samples. We further show that the stochasticity inherent to LLMs can weaken surrogacy for identification while also introducing bias and variance during estimation, but that using an average over multiple LLM draws per unit as the surrogate mitigates these issues. Simulations validate the results, and an empirical application to A/B tests on Upworthy headlines shows that raw LLM predictions recover only 39\% of the human treatment effect while nonparametric calibration closes the gap. A central takeaway is that A/B testing on LLMs yields correct results only by assumption, whereas A/B testing on humans is correct by design, and that the required assumptions are hardest to justify precisely where A/B testing on LLMs promises the greatest benefit. We discuss the role of LLM choice, prompting, and temperature as design variables, the compounded challenge posed by long-term outcomes, and how to size human pilot studies for validation.

2606.17054 2026-06-19 cs.RO cs.AI cs.CV cs.LG 新提交

Human Universal Grasping

人类通用抓取

Kevin Yuanbo Wu, Tianxing Zhou, Isaac Tu, Billy Yan, Irmak Guzey, David Fouhey, Dandan Shan, Lerrel Pinto

发表机构 * New York University(纽约大学) Tsinghua University(清华大学) University of Michigan(密歇根大学)

AI总结 提出HUG模型,利用人类抓取数据(1M-HUG数据集)和流匹配方法,从单张RGB-D图像生成多样化抓取姿态,并重定向到机器人手,实现零样本抓取,在HUG-Bench上超越基线23%-34%。

Comments 28 pages, 20 figures, 7 tables

详情
AI中文摘要

人类可以轻松抓取物体,而多指机器人远未达到这种通用性。我们认为机器人抓取数据最自然的来源是人类,他们每天拿起数千个物体。我们提出HUG,一个流匹配模型,能够为任何用户指定的物体(从立体相机捕获的单张RGB-D图像中)生成多样化的人类抓取。使用智能眼镜,我们首先收集了1M-HUGs,一个自我中心的人类抓取数据集,涵盖100万帧(27.8小时)和41栋建筑中的6,707个物体实例。接下来,为了建模自然人类抓取的分布,我们的新型流匹配模型融合RGB和深度观测,输出由手腕平移、手腕旋转和MANO手姿态参数化的抓取。预测的抓取可以重定向到各种机器人手,实现在日常场景中的零样本抓取。为了标准化评估,我们构建了一个新的模拟基准HUG-Bench,包含来自五个几何类别和不同尺寸的90个未见物体,并带有公制尺度的3D网格。我们在真实世界中评估HUG,使用HUG-Bench的30个物体测试集,跨越多个立体相机、机器人实体和家庭环境。HUG在我们具有挑战性的物体集上比最先进的抓取基线高出23%和34%。代码、数据、基准、检查点和交互式演示已在我们的网站上发布:https://grasping.io/

英文摘要

Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website: https://grasping.io/

2606.17041 2026-06-19 cs.CL cs.IR 新提交

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

对Nature Portfolio元分析文章进行LLM代理基准测试

Anzhe Xie, Weihang Su, Yujia Zhou, Yiqun Liu, Qingyao Ai

发表机构 * Tsinghua University(清华大学)

AI总结 提出MetaSyn数据集,包含442篇专家策划的元分析,用于评估LLM代理在检索-筛选-综合全流程中的表现,发现当前系统在筛选阶段存在严重瓶颈。

Comments 13 pages, 7 figures, preprint for arXiv, dataset and code available at https://github.com/BFTree/MetaSyn

详情
AI中文摘要

元分析是一种要求高的证据综合形式,结合了文献检索、PI/ECO指导的研究选择和统计聚合。其结构化、可验证的工作流程使其成为评估系统科学推理的理想基础,然而现有基准缺乏完整的检索-筛选-综合流程的真相。我们引入了MetaSyn,一个包含来自Nature Portfolio期刊的442篇专家策划的元分析的数据集。每个条目将研究问题与PI/ECO标准、包含140k篇PubMed文章的检索语料库、经过验证的阳性研究、主题相似但不符合PI/ECO的硬负样本以及完整的搜索策略和日期范围配对。对十二种流水线配置(九种RAG变体和一种协议驱动的代理)进行基准测试揭示了关键的筛选瓶颈:尽管在K=200时检索上限达到90.9%的召回率,但没有任何系统能恢复超过52.7%的真相包含文献。当前的LLM无法可靠地将合格研究与主题相关性相当的PI/ECO不合格干扰项区分开来。阶段归因指标捕捉了系统成功和失败的地方;单一的端到端分数则不能。

英文摘要

Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing benchmarks lack ground truth across the full retrieval-screening-synthesis pipeline. We introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals. Each entry pairs a research question with PI/ECO criteria, a retrieval corpus of 140k PubMed articles, verified positive studies, hard negatives that are topically similar but PI/ECO-ineligible, and complete search strategies and date bounds. Benchmarking twelve pipeline configurations (nine RAG variants and a protocol-driven agent) reveals a critical screening bottleneck: despite a retrieval ceiling of 90.9% recall at K=200, no system recovers more than 52.7% of ground-truth included literature. Current LLMs fail to reliably separate eligible studies from PI/ECO-failing distractors in pools of comparable topical relevance. Stage-attributed metrics capture where systems succeed and fail; a single end-to-end score does not.

2606.16946 2026-06-19 cs.CG 新提交

Polynomial-Time Riesz-Energy Subset Selection for Ordered Point Sets on Lines and $\ell_1$-Staircases

有序点集在直线和ℓ1阶梯上的多项式时间Riesz能量子集选择

Michael T. M. Emmerich

AI总结 本文证明一维Riesz交互的Monge性质,通过子模最小化实现多项式时间求解,并给出显式最小割算法,适用于ℓ1阶梯上的子集选择。

Comments 17pages, 6 Figures added appendix with more examples and explanations, and l1 staircase example, html friendly

详情
AI中文摘要

我们研究一维固定基数最小Riesz $s$-能量子集问题,其中指数$s>0$固定:给定有序实点$x_1 < x_2 < \cdots < x_n$,正参数$s>0$和基数$k$,选择索引$1 \leq i_1 < \cdots < i_k \leq n$最小化$E_s(i_1,\ldots,i_k)=\sum_{1\leq p<q\leq k}(x_{i_q}-x_{i_p})^{-s}$。本文证明了一维Riesz交互的Monge性质。通过将可行子集编码为递增索引向量,该Monge不等式蕴含有限分配格上的子模性,并通过分配格上的子模最小化实现多项式时间可解性。该结构构造对所有实数$s>0$有效;比特复杂度声明需要复杂性部分所述的算术假设。相同的结构还产生一个显式的最小$S$-$T$割算法,具有$k(n-k)$个阈值变量和$O(k^2(n-k)^2)$条有限成对边。在$O(k^2(n-k)^2)$系数构造步骤后,所得图有$N=k(n-k)$个节点和$M=O(k^2(n-k)^2)$条弧;$O(NM)$最大流界给出$O(k^3(n-k)^3)$的最小割步骤,而保守的$O(N^2M)$界给出$O(k^4(n-k)^4)$。由于等距性,结果直接适用于ℓ1阶梯上的子集选择,例如在二维中选择多样且有代表性的Pareto前沿或天际线近似。伴随可复现材料提供了一个开源Python实现的最小割算法。

英文摘要

We study efficient algorithms for one-dimensional fixed-cardinality minimum Riesz $s$-energy subset selection on ordered real-line point sets and propose and test a polynomial-time exact s-t cut-based algorithm for this problem. Given $x_1<\cdots<x_n$, an exponent $s>0$, and a cardinality $k$, the task is to choose $1\leq i_1<\cdots<i_k\leq n$ minimizing $E_s(i_1,\ldots,i_k)=\sum_{1\leq p<q\leq k}(x_{i_q}-x_{i_p})^{-s}$. We prove that the one-dimensional Riesz interaction satisfies a Monge inequality. When feasible subsets are encoded as increasing index vectors, this property implies submodularity on a finite distributive lattice and yields polynomial-time solvability by submodular minimization over such lattices. The structural reduction holds for every real $s>0$. We also derive an explicit minimum $S$--$T$ cut formulation with $k(n-k)$ threshold variables and $O(k^2(n-k)^2)$ finite pairwise edges. The constructed graph has $N=k(n-k)$ nodes and $M=O(k^2(n-k)^2)$ arcs after an $O(k^2(n-k)^2)$ coefficient-construction step; an $O(NM)$ max-flow bound gives an $O(k^3(n-k)^3)$ cut step, while the conservative $O(N^2M)$ bound gives $O(k^4(n-k)^4)$. By an isometry argument, the same algorithm applies to $\ell_1$-staircases, including monotone two-dimensional Pareto-front and skyline approximations. The accompanying Python implementation includes verification examples and an empirical runtime benchmark; on balanced instances $n=2k$, the reference min-cut code overtakes exhaustive enumeration around $n=24$--$26$. The appendix provides examples and detailed explanations of the underlying theory.

2606.16780 2026-06-19 cs.RO 新提交

DIFF-IPPO: Diffusion-Based Informative Path Planning with Open-Vocabulary Belief Maps

DIFF-IPPO:基于扩散的开放词汇信念地图信息路径规划

Sausar Karaf, Oleg Sautenkov, Mikhail Martynov, Dzmitry Tsetserukou

发表机构 * Intelligent Space Robotics Laboratory, CDE, Skoltech(智能空间机器人实验室,CDE,斯科尔科沃科学技术研究院)

AI总结 提出DIFF-IPPO框架,结合开放词汇信念地图生成器与扩散规划器,在非高斯信念图上生成全局轨迹,实现高效目标搜索,检测得分达81.49%-86.55%。

详情
AI中文摘要

探索和物体搜索要求机器人感知环境、识别感兴趣区域,并规划提高目标检测可能性或最大化信息增益的轨迹。许多IPP方法,特别是在连续环境监测中,依赖于高斯过程信念模型,而物体搜索场景通常从语义或开放词汇感知中产生复杂的多模态信念地图。直接基于这种非高斯信念地图的全局轨迹生成仍然相对未被充分探索。尽管基于扩散的规划器为此类分布建模提供了强大能力,但它们在信息路径规划中的应用仍然有限。在这项工作中,我们提出了DIFF-IPPO,一个集成了开放词汇信念地图生成器和基于扩散的规划器的流水线,用于在信念地图上生成全局轨迹。该方法生成的轨迹将传感器覆盖集中在高信念区域,在不同数据集场景下实现了81.49%至86.55%的归一化检测得分。我们在一个模拟的搜索与救援场景中验证了该系统,其中规划器搜索候选建筑区域以定位燃烧的建筑。在此设置中,一个由五架无人机组成的团队使用批处理信念地图条件轨迹生成,在3.5分钟内实现了首次检测。

英文摘要

Exploration and object search require robots to perceive their environment, identify regions of interest, and plan trajectories that improve target-detection likelihood or maximize information gain. Many IPP methods, especially in continuous environmental monitoring, rely on Gaussian-process belief models, while object-search settings often produce complex, multimodal belief maps from semantic or open-vocabulary perception. Global trajectory generation directly conditioned on such non-Gaussian belief maps remains comparatively underexplored. Although diffusion-based planners offer strong capabilities for modeling such distributions, their use in informative path planning remains limited. In this work, we propose DIFF-IPPO, a pipeline that integrates an open-vocabulary belief map generator with a diffusion-based planner for global trajectory generation over belief maps. The method generates trajectories that concentrate sensor coverage over high-belief regions, achieving normalized detection scores between 81.49% and 86.55% across different dataset scenarios. We validate the system in a simulated search-and-rescue scenario where the planner searches candidate building regions to locate a burning building. In this setting, a team of five drones using batched belief-map-conditioned trajectory generation achieves first detections in 3.5 minutes.

2606.16725 2026-06-19 cs.SE 新提交

Organizational Cohesion in Microservice Architectures: A Multi-Project Empirical Study

微服务架构中的组织内聚性:一项多项目实证研究

Xiaozhou Li, Andrea Janes

AI总结 提出组织内聚性概念及度量方法PTC,通过Spinnaker等7个开源系统分析,发现核心与外围服务存在系统性差异,PTC与AOC弱相关,揭示团队内聚与跨服务开发活动的不同组织动态。

详情
AI中文摘要

微服务架构的广泛采用引入了在将软件模块性与开发组织结构对齐方面的新挑战。尽管先前的研究已经广泛考察了服务耦合和依赖结构等技术属性,但相对较少关注贡献者活动如何反映或偏离服务边界。在本文中,我们引入了微服务生态系统中组织内聚性的概念,并提出了一种定量方法来度量它。基于敏感类内聚性度量(SCOM),我们定义了成对团队内聚性(PTC),这是一种捕捉单个微服务内开发者贡献的平衡性和专注度的度量。我们通过对Spinnaker微服务平台的纵向案例研究分析了组织内聚性的演化,并在另外六个开源微服务系统中重复了分析。我们的结果揭示了核心服务与外围服务之间的系统性差异,并表明PTC与平均组织耦合(AOC)在项目间仅表现出弱相关性。这一发现表明,团队内聚性和跨服务开发者活动暗示了不同且弱关联的组织动态。通过将“高内聚、低耦合”原则扩展到组织层面,我们的研究为评估微服务开发的社会技术结构提供了定量视角。

英文摘要

The widespread adoption of microservice architectures has introduced new challenges in aligning software modularity with the structure of development organizations. Although prior research has extensively examined technical properties such as service coupling and dependency structures, comparatively little attention has been paid to how contributor activity reflects or diverges from service boundaries. In this paper, we introduce the notion of organizational cohesion in microservice ecosystems and propose a quantitative approach to measure it. Building on the Sensitive Class Cohesion Metric (SCOM), we define Pairwise Team Cohesion (PTC), a metric that captures the balance and focus of developer contributions within individual microservices. We analyze the evolution of organizational cohesion using a longitudinal case study of the Spinnaker microservice platform and replicate the analysis across six additional open-source microservice systems. Our results reveal systematic differences between core and peripheral services and show that PTC and Average Organizational Coupling (AOC) exhibit only a weak correlation across projects. This finding shows that team cohesion and cross-service developer activity suggest distinct and weakly associated organizational dynamics. By extending the "high cohesion, low coupling" principle to the organizational level, our study provides a quantitative perspective for assessing the socio-technical structure of microservice development.

2606.16682 2026-06-19 cs.LG cs.CL 新提交

Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agents

多模态评估者偏好坍缩:自进化智能体中的跨模态传染

Zewen Liu

发表机构 * Qilu Institute of Technology, School of Software Engineering(齐鲁理工学院软件工程学院)

AI总结 研究多模态自评估中偏好坍缩的加剧现象,发现跨模态传染导致策略选择扭曲,并引入传染矩阵量化风险。

Comments 19 pages, 0 figures

详情
AI中文摘要

当AI智能体使用语言模型在反馈循环中评估自身输出时,会出现系统性偏差。我们表明,评估者偏好坍缩(EPC)在多模态设置中被显著放大。使用GPT-4o评估DeepSeek-chat在文本和视觉任务上的表现,我们发现单一策略(step_by_step)吸收了48.4%的权重——是纯文本自评估中坍缩的3.2倍——而三个视觉域策略合计仅获得9.1%的权重。然后,我们展示了一种称为跨模态传染的新现象:在一个模态上获得的评估者偏好会迁移到另一个模态并破坏其策略选择。通过一个四阶段隔离训练范式,我们测量了传染系数并记录了策略反转——一个模态的最优策略在跨模态暴露后发生逆转。跨四种评估者配置(总计53次独立重复,15,592次API调用)的第3阶段统计验证揭示了一个清晰的层次结构:跨模型评估(GPT-4o,N=8)产生强但对称的双向传染(平均gamma_{T->V}=1.176,gamma_{V->T}=1.089,Delta=-0.088,p=0.575,Cohen's d=0.29);高轮次(DashScope,50轮)导致坍缩为单一策略主导(70%零传染);而自评估提供近乎完全的免疫——97%的运行(N=30,DeepSeek-chat)产生恰好为零的传染(平均gamma=0.033,95% CI [-0.031, 0.010],p=0.642,d=0.07)。没有评估者条件显示出统计显著的方向不对称性。我们引入了由评估者身份索引的传染矩阵,发布了MM-EPC实验框架,并将跨模型评估者架构确定为偏好传染的主要风险因素。

英文摘要

When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using GPT-4o to evaluate DeepSeek-chat across text and visual tasks, we find that a single strategy (step_by_step) absorbs 48.4% of all weight -- 3.2x the collapse observed in text-only self-evaluation -- while three visual-domain strategies receive only 9.1% combined weight. We then demonstrate a novel phenomenon we term cross-modal contagion: evaluator preferences acquired on one modality transfer to and corrupt strategy selection on another. Through a four-phase isolation training paradigm, we measure contagion coefficients and document strategy inversion -- the optimal strategy for a modality reverses after cross-modal exposure. A Phase 3 statistical validation across five evaluator configurations (N=80 total independent repetitions, ~35,000 API calls) with both text-proxy and real-image visual tasks finds: cross-model evaluation produces strong contagion (JSD~0.19-0.34), real-image inputs yield the most directionally consistent signal (mean gamma_{T->V}=1.145, gamma_{V->T}=0.937, 70% T->V, Cohen's d=0.56), and self-evaluation provides near-complete immunity -- 97% of runs (N=30) yield zero contagion (JSD=0.003, d=0.07). Three methodological ablations and multi-executor validation confirm the effect is not a structural artifact. We introduce the contagion matrix indexed by evaluator identity, release the MM-EPC framework, and identify cross-model evaluator architecture as the primary risk factor for preference drift. Code and data: https://github.com/aidless/mm-epc.

2606.16615 2026-06-19 cs.CV 新提交

SUP-MCRL: Subject-aware Unified Pseudo-feature Coded Multimodal Contrastive Representation Learning for EEG Visual Decoding

SUP-MCRL:面向EEG视觉解码的感知主体统一伪特征编码多模态对比表示学习

Shengyu Gong, Weiming Zeng, Yueyang Li, Zijian Kang, Hongjie Yan, Wai Ting Siok, Nizhuan Wang

发表机构 * Lab of Digital Image and Intelligent Computation, Shanghai Maritime University(上海海事大学数字图像与智能计算实验室) Department of Language Science and Technology, The Hong Kong Polytechnic University(香港理工大学语言科学与技术系) Affiliated Lianyungang Hospital of Xuzhou Medical University(徐州医科大学附属连云港医院)

AI总结 提出SUP-MCRL框架,通过语义感知视觉编码器、统一EEG增强器和原型渐进增强器,解决多模态对比学习中语义一致性和主体选择性问题,在THINGS-EEG零样本任务上达到66.0%/91.9%的Top-1/Top-5准确率。

详情
AI中文摘要

非侵入式脑机接口在泛化到自然视觉体验时,神经视觉解码面临严重的保真度退化。传统的多模态对比表示学习仅优化几何距离对齐,忽略了语义一致性和主体选择性,导致虚假的零样本对齐。我们提出SUP-MCRL,一个统一框架,集成了三种协作机制:(1) 语义实体感知视觉编码器(SAVE),学习空间注意力以提取语义内容,无需预训练的显著性模型;(2) 统一EEG增强器(UEE),采用多尺度空洞卷积和频带间注意力实现自适应跨主体鲁棒性;(3) 基于原型的渐进增强器(PPA),维护一个EMA更新的伪特征池以防止表示崩溃。在THINGS-EEG上的零样本实验实现了66.0%/91.9%(Top-1/Top-5)的个体内准确率和24.0%/52.9%的LOSO准确率,超越了现有最先进方法。代码可在https://github.com/NZWANG/SUP-MCRL获取。

英文摘要

Non-invasive brain-computer interfaces exhibit significant performance degradation when moving from controlled laboratory stimuli to real-world natural images. This degradation occurs because conventional multimodal contrastive representation learning models focus exclusively on optimizing geometric distance alignment, thereby failing to account for semantic consistency and inter-subject variability in neural representation and selective attention. As a result, these models are prone to producing spurious zero-shot matches. To address these limitations, we propose SUP-MCRL, a unified framework integrating three collaborative mechanisms: (1) a Semantic-entity Aware Visual Encoder (SAVE) that learns spatial attention to extract semantic content without relying on pre-trained saliency models; (2) a Unified EEG Enhancer (UEE) that employs multi-scale atrous convolutions and inter-band attention for adaptive cross-subject robustness; and (3) a Prototype-based Progressive Augmenter (PPA) that maintains an EMA-updated pseudo-feature pool to prevent representation collapse. Zero-shot experiments on the THINGS-EEG achieve 66.0%/91.9% (Top-1/Top-5) intra-subject and 24.0%/52.9% LOSO accuracy, significantly surpassing state-of-the-art methods and demonstrating that structured alignment supervision is key to overcoming the limitations of cross-modal decoding. Code is available at https://github.com/NZWANG/SUP-MCRL.

2606.16575 2026-06-19 cs.LG math-ph math.MP 新提交

RepNN: Tackling spectral bias in deep neural networks via parameter reparameterization

RepNet:通过参数重参数化解决深度神经网络中的谱偏差

Yong Wang, Tao Zhou, Xuhui Meng

发表机构 * Institute of Interdisciplinary Research for Mathematics and Applied Science, School of Mathematics and Statistics, Huazhong University of Science and Technology(华中科技大学数学与统计学院交叉科学与应用数学研究所) Institute of Computational Mathematics, Academy of Mathematics and Systems Science, Chinese Academy of Sciences(中国科学院数学与系统科学研究院计算数学研究所)

AI总结 针对深度神经网络在捕捉振荡和多尺度行为时的谱偏差问题,提出RepNet模型,通过重参数化第一隐藏层的权重和偏置,有效控制初始斜率尺度和分区点分布,实现自适应频率缩放,在函数逼近、PDE求解和算子学习中显著提升精度。

详情
AI中文摘要

深度神经网络(DNN)在科学计算中取得了显著成功,但在捕捉振荡和多尺度行为时常常受到谱偏差的影响。在本研究中,我们通过考察浅层ReLU神经网络在高频函数拟合中的失败来探究这一局限性。这一观察识别出解决快速振荡的两个重要因素:初始斜率尺度和网络诱导的分区点分布。受此分析启发,我们提出了RepNet,一种针对ReLU和tanh网络的重参数化DNN模型,专为高频和多尺度问题设计。关键思想是重参数化第一隐藏层的权重和偏置,从而能够有效控制初始斜率尺度并提供合适的初始分区点分布。此外,将重参数化的权重和偏置视为可训练参数,使得DNN在训练过程中实现自适应频率缩放。我们还推导了重参数化DNN的输出和斜率幅度的定量估计,以指导所提方法的初始化。数值实验,包括多尺度一维和四维函数逼近、结合物理信息神经网络(PINN)的正向和逆向PDE问题以及算子学习,表明RepNet在略微增加计算成本的情况下,提高了普通DNN在捕捉高度振荡特征时的预测精度。这些结果表明,RepNet为克服谱偏差并将DNN应用于多尺度问题提供了一种有效且灵活的方法。

英文摘要

Deep neural networks (DNNs) have achieved remarkable success in scientific computing, yet they often suffer from spectral bias in capturing oscillatory and multiscale behaviors. In this study, we investigate this limitation by examining the failure of shallow ReLU neural networks in fitting high-frequency functions. This observation identifies two important factors in resolving rapid oscillations: the initial slope scale and the distribution of partition points induced by the networks. Motivated by this analysis, we propose RepNN, a reparameterized neural network model with activation ReLU or tanh designed for high-frequency and multiscale problems. The key idea is to reparameterize the weights and biases in the first hidden layer, which enables effective control of the initial slope scale and provides an appropriate distribution of the initial partition points. Furthermore, treating the reparameterized weights and biases as trainable parameters allows the DNN to achieve adaptive frequency scaling during training. In addition, we derive quantitative estimates for the output and slope magnitudes of the reparameterized DNN to guide the initialization of the proposed method. Numerical experiments, including multiscale one- and four-dimensional function approximations, forward and inverse PDE problems in combination with physics-informed neural networks (PINNs), and operator learning for an earthquake problem using real data, demonstrate that RepNN improves the predicted accuracy of vanilla DNNs in capturing highly oscillatory features with slightly additional computational cost. These results indicate that RepNN provides an effective and flexible approach for overcoming spectral bias and applying DNNs to multiscale problems.

2606.16417 2026-06-19 cs.SD eess.AS 新提交

Joycent: Diffusion-based Accent TTS without Accented Phone Prediction

Joycent: 基于扩散的口音语音合成,无需口音音素预测

Xintong Wang, Ye Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Joycent,一种基于扩散模型的口音TTS方法,直接从标准音素序列和语音参考合成口音语音,无需口音音素预测,通过条件层归一化集成口音和说话人表征,并引入WhisAID口音识别模型,在保持说话人身份的同时提升口音自然度。

详情
AI中文摘要

口音文本到语音(TTS)旨在合成具有目标口音的语音。现有的口音TTS系统通常依赖于两阶段流程,首先将标准音素序列转换为口音音素序列,然后合成口音语音。然而,这种方法存在错误累积问题,并且需要配对的标准-口音音素序列数据,这在实践中往往有限。此外,基于文本的口音音素表示不足以建模韵律和节奏等声学口音特征。在这项工作中,我们提出了Joycent,一种基于扩散的口音TTS模型,它直接从标准音素序列和语音参考合成口音语音,无需口音音素预测。Joycent通过文本编码器中的条件层归一化(CLN)集成口音和说话人表征。我们引入了WhisAID,一种在口音普通话语音上训练的普通话口音识别模型,以提取口音表征。实验结果表明,与基线系统相比,Joycent在保持说话人身份的同时提高了口音自然度。我们在以下网址发布代码和演示:https://github.com/oshindow/Joycent-code。

英文摘要

Accent text-to-speech (TTS) aims to synthesize speech with target accents. Existing accent TTS systems typically rely on a two-stage pipeline that first converts standard phone sequences into accented phone sequences and then synthesizes accented speech. However, such approaches suffer from error accumulation and require paired standard-accented phone sequence data, which is often limited in practice. Moreover, text-based accented phone representations are insufficient to model acoustic accent characteristics such as prosody and rhythm. In this work, we propose Joycent, a diffusion-based accent TTS model that synthesizes accented speech directly from standard phone sequences and speech references without accented phone prediction. Joycent integrates accent and speaker representations through conditional layer normalization (CLN) in the text encoder. We introduce WhisAID, a Mandarin accent identification model trained on accented Mandarin speech to extract accent representations. Experimental results show that Joycent improves accentedness while preserving speaker identity compared with baseline systems. We release our code and demos at: https://github.com/oshindow/Joycent-code.

2606.16326 2026-06-19 cs.GT cs.AI q-fin.RM 新提交

Gaming-Resistant Insurance Contracts for Autonomous AI Agents: Strategy-Proof Toll Mechanism Design

自主AI代理的抗博弈保险合约:策略证明的通行费机制设计

Hao-Hsuan Chen

发表机构 * Hao-Hsuan Chen(何浩轩)

AI总结 本文扩展了时间一致精算运行时的框架,使运营商策略化,刻画了自主AI代理保险合约的五种攻击空间,并证明了精算运行时的抗博弈性,通过新合约条款实现激励兼容。

Comments 29 pages. Companion to arXiv:2605.26508 (Paper A, foundations) and arXiv:2605.25632 (Paper B, empirical)

详情
AI中文摘要

论文A定义了一个时间一致的精算运行时,该运行时根据合约固定的安全默认值对每个产生副作用的行动定价,并针对储备预算门控执行。它将运营商视为被动。本文使运营商策略化。我们刻画了自主AI代理保险合约的五种攻击空间,并证明了精算运行时何时具有抗博弈性。两种攻击面——通行费后的安全默认选择以及边界内的行动分割——通过论文A的最小权限和无分割条款得以关闭。其余三种需要新的合约条款。首先,公共控制聚合防止跨边界重新路由将通行费降低到应用于总暴露的边界潜力以下。其次,接口故障(如无效JSON)是合约相关事件,而非安全胜利:将其视为零通行费安全默认值可能奖励不可靠的模型,而升级费用则逆转了激励。我们通过来自配套实证论文的跨模型轨迹验证了这一接口合规定理。第三,一个带有分量最小惩罚计划的模型身份菜单使得部署模型的真实报告成为弱占优策略。然后,我们将这些条款与论文A的运行时保证组合,以获得在五种攻击空间上的联合激励兼容性。最后,一个双参数保费族在真实均衡下满足了运营商个体理性和弱预算平衡。结果是为自主代理副作用的精算控制提供了一个激励兼容层。

英文摘要

Paper A defines a time-consistent actuarial runtime that prices each side-effect-bearing action against a contractually fixed safe default and gates execution against a reserve budget. It treats the operator as passive. This paper makes the operator strategic. We characterise a five-attack space for autonomous AI-agent insurance contracts and prove when the actuarial runtime is gaming-resistant. Two attack surfaces -- post-toll safe-default selection and within-boundary action splitting -- are closed by Paper A's minimal-authority and no-splitting clauses. The remaining three require new contract clauses. First, common-control aggregation prevents cross-boundary re-routing from reducing toll below the boundary potential applied to total exposure. Second, interface failures such as invalid JSON are contract-relevant events, not safety wins: treating them as zero-toll safe defaults can reward unreliable models, while escalation fees reverse the incentive. We validate this interface-compliance theorem on committed cross-model traces from the companion empirical paper. Third, a model-identity menu with a componentwise-minimum penalty schedule makes truthful reporting of the deployed model weakly dominant. We then compose these clauses with Paper A's runtime guarantees to obtain joint incentive compatibility over the five-attack space. Finally, a two-parameter premium family discharges operator individual rationality and weak budget balance at the truthful equilibrium. The result is an incentive-compatibility layer for actuarial control of autonomous-agent side effects.

2606.16106 2026-06-19 cs.PF cs.AR cs.DC 新提交

Edge-Inference Governors Need Memory-Clock State

超越CPU-GPU频率:内存时钟和尾部效应对边缘推理延迟估计的影响

Jaehoon Kang

AI总结 通过测量NVIDIA Jetson Orin Nano,发现内存时钟是缺失的维度、聚合丢失率隐藏突发性、频率切换存在延迟,这些现象超出传统频率感知延迟模型的范围。

Comments 20 pages, 13 figures, 11 tables. Code and data: https://github.com/dankang21/jetson-latency-lab ; traces: https://doi.org/10.5281/zenodo.20745228

详情
AI中文摘要

频率感知延迟估计器通过建模CPU和GPU频率上的延迟,使得边缘ML推理的截止时间感知DVFS成为可能。我们在NVIDIA Jetson Orin Nano上进行了测量研究,展示了该建模范围之外的三种现象。(1) 内存时钟是一个缺失的维度:在现实的上限EMC范围(2133->3199 MHz)内,根据工作负载的不同,它将中位数延迟偏移了+11%到+48%,并且在最高GPU时钟下,对于合成L2驻留内核,我们观察到一个可重复的非单调情况(-9%)。在一个功率配置下分析并在另一个功率配置下部署的GPU频率估计器,因此低估了高达32%的延迟;列出四个可锁定的EMC点可以修复大多数工作负载,而参数化的1/f_emc项则不能。(2) 聚合丢失率隐藏了突发性:在固定时钟下,100k周期运行显示出刀锋边缘分布,其截止时间丢失的悬崖跨度约为1毫秒,但丢失的聚集远超出独立性——在0.1%的聚合丢失率下,下一个周期也丢失的概率高达74%(是独立基线的740倍)。高斯mu+3sigma边界超过0.1%丢失目标13倍到29倍,而样本外广义帕累托边界在所有八种配置中保持在~2倍以内。(3) 频率切换并非免费:每个域的过渡停顿低于100微秒,但新的工作点需要1/5/8毫秒(CPU/GPU/EMC)才能生效——对于每推理周期的调控器来说,这是典型推理周期的很大一部分。我们发布了完整的测量工具,并讨论了对下一代频率感知估计器和调控器的影响。

英文摘要

Frequency-aware latency estimators let deadline-aware DVFS governors schedule edge ML inference by modeling latency over CPU and GPU clocks, but they cannot observe the memory clock (EMC) -- a missing deployment state that decides whether a governor meets its deadlines and at what energy. We show this with a deployed, measured governor on a Jetson Orin NX: an EMC-blind GPU-only fit misses 25-28% of cycles at tight deadlines, whereas an EMC-aware refit holds misses to at most 1.3% under a 2% QoS miss budget by selecting a budget-feasible clock -- the energy-minimal one for periodic vision (calibrated module-rail power). The failure generalizes across three workload classes -- MobileNetV2, a ViT transformer, and Qwen2.5 LLM token decode (where saturated decode makes the aware policy lower-energy than the infeasible blind choice): a CPUxGPU estimator sends the deployed governor to an infeasible operating point, and only an EMC-aware model identifies the feasible side of the energy frontier. The effect is real and outside the CPUxGPU state abstraction: across two Orin SKUs sharing the same lockable EMC points it shifts median latency by up to ~45%, replicates on both, and survives a fused TensorRT fp16 engine. CPUxGPU models do not absorb it: per-lockable-point EMC tables are needed, a scoped inversion shows monotone assumptions can pick the wrong direction, and clustered misses make aggregate QoS rates understate deployment risk. We release the harness; this complements, not rebuts, the state of the art within its CPUxGPU scope.

2606.16057 2026-06-19 cs.RO cs.SY eess.SP eess.SY 新提交

A Smart-Scheduled Hybrid (SSH) EKF-FGO State Estimation

一种智能调度混合(SSH)EKF-FGO状态估计方法

Eric Levy, Soosan Beheshti

发表机构 * GitHub arXiv

AI总结 本文通过智能调度混合EKF-FGO框架,实验性地将优化调度作为独立设计变量,研究其在平衡估计精度与计算成本中的作用,并在平面SLAM仿真中验证了调度对预优化漂移、瞬态误差和运行时间的显著影响。

Comments This work has been accepted for presentation/publication at the 2026 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE). The final published version will appear in IEEE Xplore

详情
AI中文摘要

在机器人学和控制中,可靠的状态估计需要在估计精度和计算成本之间取得平衡。虽然基于滤波的方法(如扩展卡尔曼滤波器,EKF)提供高效的实时更新,而使用因子图的优化公式化方法改善全局一致性,但优化调度的作用通常被隐式处理,而非作为明确的设计变量进行研究。本文提出了一项实验研究,通过使用智能调度混合(SSH)EKF-FGO框架作为受控测试平台,明确隔离了优化调度。通过将基于EKF的状态传播与定期调用的批量优化相结合,并保持求解器结构和计算量固定,本文的主要贡献是实验性地将优化调度表征为一个独立的设计变量,它控制着中间估计精度与计算成本之间的权衡。在平面SLAM环境中的仿真结果表明,调度强烈影响预优化漂移、瞬态误差行为和运行时间。特别是,结果识别出一些操作区域,在这些区域中,全局优化的大部分好处可以以一小部分计算成本保留,从而突显了优化调度作为混合状态估计系统中一个未被充分探索但至关重要的考虑因素。

英文摘要

Reliable state estimation in robotics and control re quires balancing estimation accuracy against computational cost. While filtering-based methods such as the Extended Kalman Filter (EKF) provide efficient real-time updates, and optimisation based formulations using factor graphs improve global consistency, the role of optimisation scheduling is often treated implicitly rather than examined as an explicit design variable. This paper presents an experimental study that explicitly isolates optimisation scheduling using a Smart Scheduled Hybrid (SSH) EKF-FGO framework as a controlled testbed. By combining EKF-based state propagation with periodically invoked batch optimisation and holding solver structure and effort fixed, the main contribution of this work is the experimental characterisation of optimisation scheduling as an independent design variable governing the trade-off between intermediate estimation accuracy and computational cost. Simulation results in a planar SLAM environment show that scheduling strongly influences pre optimisation drift, transient error behaviour, and runtime. In particular, the results identify operating regimes in which most of the benefit of global optimisation can be retained at a fraction of the computational cost, highlighting optimisation scheduling as an under-explored yet critical consideration in hybrid state estimation systems.

2606.15966 2026-06-19 cs.CV cs.GR 新提交

VEPHand: View-Efficient Photometric Hand Performance Capture at Scale

VEPHand: 大规模视图高效光度手部性能捕捉

Zhengyang Shen, Kai-Hung Chang, Erroll Wood, Deying Kong, Bo Peng, Timo Bolkart, Jinlong Yang, Bowen Zhao, Danhang Tang, Sasa Petrovic, Emre Aksan, Jérémy Riviere, Vassilis Choutas, Delio Vicini, Jay Busch, Shichen Liu, Zhe Cao, Hugh Liu, JingJing Shen, Jonathan Taylor, Mingsong Dou

发表机构 * Google XR

AI总结 提出面向有限视角(约20个)的端到端手部动态捕捉与配准管线,通过无掩膜神经方法和物理启发框架解决几何歧义与自接触变形难题,在12000+序列上验证了高保真重建与配准。

详情
AI中文摘要

鲁棒、高保真的3D手部捕捉是数字人创建的基础,但在实际多视角系统中仍具挑战性,这些系统需要在丰富光度信息与有限视角密度导致的重建几何歧义之间取得平衡。本文提出一种端到端的动态手部性能捕捉与配准管线,专为视图高效设置(约20个视角)设计。我们通过两项主要创新应对关键挑战。首先,为克服重建困难(如视角重叠有限和背景杂乱),我们的无掩膜神经方法通过场景参数化和场景特定密度正则化,从无掩膜图像中鲁棒地提取精细的手部几何和外观。其次,针对配准挑战(如准确捕捉非线性皮肤变形和确保严重自接触时的合理结果),我们提出一个物理启发框架。它通过优化个性化手部模型规范四面体网格内的固有体积偏移以及姿态参数,将重建与个性化手部模型对齐。该方法在鲁棒损失和优化支持下,捕捉精细表面变形,确保在严重关节运动和自接触下的合理结果,并对输入噪声表现出强容忍性。我们在超过12000个序列的大规模数据集上展示了自动化管线的可扩展性和鲁棒性,并从中导出一个大规模、高质量合成2D/3D手部数据集用于训练下游任务。这展示了该方法在单手、复杂双手交互和自然手物操作中的有效性。我们的方法在视图高效、无掩膜场景下实现了最先进的重建保真度和高精度配准。项目页面:https://zyshen021.github.io/VEPHand/。

英文摘要

Robust, high-fidelity 3D hand capture, while fundamental to digital human creation, remains challenging with practical multi-view systems that balance rich photometry with the geometric ambiguities of reconstruction arising from limited viewpoint density. This paper presents an end-to-end pipeline for dynamic hand performance capture and registration, specifically designed for view-efficient setups ($\sim$20 views). We address key challenges with two primary innovations. First, to overcome reconstruction difficulties like limited view overlap and background clutter, our mask-free neural method robustly extracts detailed hand geometry and appearance from unmasked images using scene parameterization and scenario-specific density regularization. Second, addressing registration challenges such as accurately capturing non-linear skin deformations and ensuring plausible results during severe self-contact, we propose a physics-inspired framework. It aligns reconstructions to a personalized hand model by optimizing intrinsic volumetric offsets within its canonical tetrahedral mesh, alongside pose parameters. This approach, supported by robust losses and optimization, captures fine surface deformations, ensures plausible results under severe articulation and self-contact, and demonstrates strong tolerance to input noise. We demonstrate the scalability and robustness of our automated pipeline on an extensive dataset of over 12,000 sequences, from which we also derive a large-scale, high-quality synthetic 2D/3D hand dataset for training downstream tasks. This showcases its effectiveness for single hands, intricate two-hand interactions, and natural hand-object manipulations. Our method achieves state-of-the-art reconstruction fidelity in view-efficient, unmasked scenarios and highly accurate registration. Our project page are available at https://vephand.github.io/.

2606.15908 2026-06-19 cs.CV 新提交

High-Fidelity 4D Hand-Object Capture via Multi-View Spatiotemporal Tracking and Physics-Aware Gaussians

高保真4D手-物体捕捉:基于多视角时空追踪和物理感知高斯模型

Bo Peng, Xu Chen, Yi Gu, Hidenobu Matsuki, Mingsong Dou, Jingjing Shen, Deying Kong, Juyong Zhang, Zhengyang Shen

发表机构 * Google XR(谷歌XR) University of Science and Technology of China (USTC)(中国科学技术大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出无需模板和标记的多视角系统,通过跨视角几何与时间线索的Transformer初始化,结合物理感知高斯优化,实现鲁棒且无伪影的4D手-物体交互重建。

Comments Project page: https://hostpg.github.io/

详情
AI中文摘要

具身AI和空间计算中对高保真4D手-物体交互(HOI)数据的需求日益增长,但目前受限于对预扫描物体模板和物理标记的依赖。尽管近期方法在从视频重建4D手-物体交互方面取得了有希望的结果,但它们对手和物体姿态的初始估计高度敏感。然而,从图像中估计这些姿态具有挑战性,尤其是在手-物体交互场景中固有的严重遮挡下。我们提出了一种新颖系统,用于从同步且校准的多视角视频中鲁棒且精确地重建手和物体,无需任何模板或标记。我们的系统包含两个主要创新组件:(1)一个多视角前馈Transformer模型,聚合跨视角几何和时间线索,为姿态和密集物体几何提供可靠的、度量一致的初始化;(2)一个手-物体物理感知高斯优化框架,用于细化初始估计,集成四面体约束、碰撞细化和外观分解,以产生物理上合理且视觉上精确的重建。在公共基准和广泛内部数据集上的验证表明,我们的流程实现了高度鲁棒、无伪影的重建,为自动化4D资产生成提供了高效基础。我们的项目页面位于https://zyshen021.github.io/HOSTPG/。

英文摘要

The growing demand for high-fidelity 4D hand-object interaction (HOI) data in embodied AI and spatial computing is currently bottlenecked by the reliance on pre-scanned object templates and physical markers. While recent methods have demonstrated promising results in reconstructing 4D hand-object interaction from videos, they are highly sensitive to initial estimates of hand and object poses. Yet, estimating these poses from images is challenging, in particular under severe occlusion which is inherent in hand-object interaction scenarios. We propose a novel system for the robust and accurate reconstruction of hands and objects from synchronized and calibrated multi-view videos without requiring any templates or markers. Our system consists of two main components with key innovations: (1) a multi-view feed-forward transformer model that aggregates cross-view geometry and temporal cues to provide a reliable, metric-consistent initialization for both poses and dense object geometry, and (2) a hand-object physics-aware Gaussian-based optimization framework to refine the initial estimates, integrating tetrahedral constraints, collision refinement, and appearance decomposition to produce physically plausible and visually accurate reconstruction. Validated on public benchmarks and an extensive internal dataset, our pipeline achieves highly robust, artifact-free reconstruction, providing an efficient foundation for automated 4D asset generation. Our project page are available at https://zyshen021.github.io/HOSTPG/.

2606.15862 2026-06-19 cs.AI 新提交

RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

RetailBench: 在真实零售环境中评估LLM代理的长期推理与连贯决策能力

Linghua Zhang, Jun Wang, Jingtong Wu, Zhisong Zhang

发表机构 * Ant Group(蚂蚁集团) City University of Hong Kong(香港城市大学)

AI总结 提出RetailBench基准,模拟单店超市运营,评估LLM代理在长期决策中的表现,发现多数模型无法持续生存,与最优策略差距显著。

Comments This paper is my paper's second version [see arXiv:2603.16453v2]

详情
AI中文摘要

大型语言模型(LLM)代理在短期、范围明确的任务上取得了快速进展,但它们在动态长期环境中维持连贯决策的能力仍不确定。我们引入了RetailBench,一个基于数据驱动的模拟基准,用于评估在单店超市运营中使用工具的LLM代理。RetailBench将零售管理建模为部分可观察的决策过程,并设计支持千天规模的模拟。在此环境中,代理必须管理定价、补货、供应商选择、货架分类、库存老化、客户反馈、外部事件和现金流约束。我们在180天的评估期内,在代表性代理框架下评估了七个当代LLM,并将它们与特权最优策略进行比较。结果显示模型之间存在显著差异:只有一小部分能够存活整个评估期,即使最强的LLM运行在最终净资产和销售结果上也远落后于最优策略。行为分析将这些差距归因于不完整的证据获取、表面决策以及缺乏一致的长期策略。RetailBench为研究经济基础长期决策中的可靠自主性提供了一个受控测试平台。

英文摘要

Large language model (LLM) agents have made rapid progress on short-horizon, well-scoped tasks, yet their ability to sustain coherent decisions in dynamic long-horizon environments remains uncertain. We introduce RetailBench, a data-grounded simulation benchmark for evaluating tool-using LLM agents in single-store supermarket operation. RetailBench models retail management as a partially observable decision process and is designed to support thousand-day-scale simulations. In this environment, agents must manage pricing, replenishment, supplier selection, shelf assortment, inventory aging, customer feedback, external events, and cash-flow constraints. We evaluate seven contemporary LLMs under representative agent frameworks over a 180-day evaluation horizon and compare them with a privileged oracle policy. Results show substantial variation across models: only a small subset survives the full evaluation horizon, and even the strongest LLM runs remain substantially behind the oracle policy in final net worth and sales outcomes. Behavioral analysis attributes these gaps to incomplete evidence acquisition, surface-level decision making, and the lack of a consistent long-horizon policy. RetailBench provides a controlled testbed for studying reliable autonomy in economically grounded long-horizon decision-making.

2606.15843 2026-06-19 math.PR cs.NA math.NA 新提交

Long-time Behaviour of DLRA for SDEs

随机微分方程动态低秩近似的指数收敛性

Jianhai Bao, Haitao Wang, Yue Wu

AI总结 研究随机微分方程的动态正交近似,证明强DO系统的适定性,分析不变概率测度的存在性,为长期统计性质的低秩近似提供严格基础。

详情
AI中文摘要

我们研究随机微分方程的动态正交(DO)近似并考察其长期行为。DO公式通过低秩分解表示解,导出一个由Stiefel流形上的演化方程和约化随机过程组成的耦合系统。我们建立了强DO系统的适定性,并在Wasserstein距离下推导了原始随机微分方程与其低秩近似之间的定量误差估计。\n我们的主要贡献是对DO动力学不变概率测度的分析。在系数满足适当耗散性、Lipschitz连续性和非退化假设下,我们证明了强DO系统存在不变概率测度。证明结合了均匀矩估计、关联冻结系统的Krylov--Bogoliubov论证以及Kakutani-Fan-Glicksberg不动点定理以恢复自洽动力学。我们进一步证明了诱导的低秩过程存在不变概率测度,并通过几个说明性例子讨论了不变测度的结构。这些结果为在随机动力系统长期统计性质近似中使用动态低秩近似提供了严格基础。

英文摘要

We study dynamical orthogonal (DO) approximations of stochastic differential equations and investigate their long-time behaviour. The DO formulation represents the solution by a low-rank decomposition and leads to a coupled system consisting of an evolution equation on the Stiefel manifold and a reduced stochastic process. We establish the well-posedness of the strong DO system and derive quantitative error estimates between the original stochastic differential equation and its low-rank approximation in the Wasserstein distance. Our main contribution is the analysis of invariant probability measures for the DO dynamics. Under suitable dissipativity, Lipschitz continuity, and non-degeneracy assumptions on the coefficients, we prove the existence of an invariant probability measure for the strong DO system. The proof combines uniform moment estimates, a Krylov--Bogoliubov argument for an associated frozen system, and a Kakutani-Fan-Glicksberg fixed-point theorem to recover the self-consistent dynamics. We further show that the induced low-rank process admits an invariant probability measure and discuss the structure of invariant measures through several illustrative examples. These results provide a rigorous foundation for the use of dynamical low-rank approximations in the approximation of long-time statistical properties of stochastic dynamical systems.

2606.15832 2026-06-19 cs.LG math.OC 新提交

SILAGE: Memory-Efficient, Full-Gradient-Free Nonconvex Optimization for Nested Finite Sums

SILAGE: 针对嵌套有限和的内存高效、完全无全梯度的非凸优化

Igor Sokolov, Laurent Condat, Peter Richtárik

发表机构 * Center of Excellence for Generative AI, King Abdullah University of Science and Technology (KAUST)(生成人工智能卓越中心,国王阿卜杜勒-阿齐兹大学科学与技术学院)

AI总结 针对大规模数据中嵌套双有限和结构的非凸优化,提出SILAGE算法,通过利用双和结构避免全局全梯度刷新,仅需O(n)内存,并基于组间和组内异质性实现自适应收敛分析。

Comments 81 pages, 3 algorithms, 4 theorems, 2 corollaries, 11 lemmas, 2 figures, 12 tables

详情
AI中文摘要

大规模数据集上的经验风险最小化自然呈现出嵌套的双有限和结构,其中 $N=nm$ 个总样本被逻辑或物理地划分为 $n$ 个大小为 $m$ 的块(例如,在池化数据孤岛、核外学习或有意分层中)。虽然方差缩减方法对非凸目标实现了最优的 oracle 复杂度,但在此集中式场景中它们遭受严重的扩展瓶颈。递归估计器(如 PAGE)需要定期对所有 $nm$ 个样本进行全局全梯度刷新,这在计算上代价高昂。相反,单循环方法(如 SILVER)避免了此类刷新,但需要不切实际的 $\mathcal{O}(nm)$ 内存来存储每个样本的控制变量。在本文中,我们提出了 SILAGE,一种解决此权衡的方差缩减算法。通过主动利用双和结构,SILAGE 消除了对所有 $nm$ 组件的周期性全局全梯度刷新(每次迭代最多评估一个局部组梯度),同时仅需 $\mathcal{O}(n)$ 内存。此外,我们提供了严格的收敛分析,避免了悲观的 worst-case Lipschitz 常数。相反,SILAGE 的复杂度通过嵌套的函数相似性(组间异质性 $δ_1$ 和组内异质性 $δ_2$)自然地适应底层数据几何。我们的结果在几个实际相关场景中改进了现有的最先进界限。

英文摘要

Empirical risk minimization on massive datasets naturally exhibits a nested double finite-sum structure, where $N=nm$ total samples are logically or physically partitioned into $n$ blocks of size $m$ (e.g., in pooled data silos, out-of-core learning, or deliberate stratification). While variance-reduced methods achieve optimal oracle complexities for nonconvex objectives, they suffer from severe scaling bottlenecks in this centralized regime. Recursive estimators, such as PAGE, require periodic global full-gradient refreshes over all $nm$ samples, which are computationally expensive. Conversely, single-loop methods, such as SILVER, avoid such refreshes but require an impractical $\mathcal{O}(nm)$ memory footprint to store a control variate for every sample. In this paper, we propose SILAGE, a variance-reduced algorithm that addresses this trade-off. By actively exploiting the double-sum structure, SILAGE eliminates periodic global full-gradient refreshes over all $nm$ components (evaluating at most one local group gradient per iteration) while requiring only $\mathcal{O}(n)$ memory. Furthermore, we provide a tight convergence analysis that avoids pessimistic worst-case Lipschitz constants. Instead, SILAGE's complexity natively adapts to the underlying data geometry via nested functional similarities: across-group ($δ_1$) and within-group ($δ_2$) heterogeneity. Our results improve existing state-of-the-art bounds in several practically relevant regimes.

2606.15761 2026-06-19 math.CO cs.DM 新提交

Sharp bounds between the saturation number and the harmonic index

饱和数不受调和指标限制

Chakshu Gupta

AI总结 本文通过反例和广义风车图族证明,饱和数μ*与调和指标H的比值可以任意大,否定了TxGraffiti关于μ*(G)≤H(G)的猜想。

Comments 10 pages, 4 figures. Studies Conjecture 4 of arXiv:2507.17780 (a TxGraffiti conjecture, μ^*(G)<=H(G), first refuted by T. Bıyıkoğlu, MATCH Commun. Math. Comput. Chem. 96 (2026) 1097-1099; this paper gives the order-9 smallest counterexample and sharp two-sided bounds between the saturation number μ^* and the harmonic index H. Code: https://github.com/ChakshuGupta13/lab

详情
AI中文摘要

TxGraffiti在2023年猜想,每个非平凡连通图$G$满足$μ^*(G) \le H(G)$,其中$μ^*(G)$是饱和数,$H(G)$是调和指标。该猜想是错误的:友谊图$F_4$满足$μ^*(F_4) = 4 > 18/5 = H(F_4)$,穷举枚举证实九个顶点是出现反例的最小阶数。广义风车图族表明$μ^*/H$的比值可以任意大。该猜想对于所有顶点度数相同的图成立,此时$H(G) = n/2$。

英文摘要

The saturation number $μ^*(G)$ of a graph $G$ is the minimum cardinality of a maximal matching, and $H(G)$ is its harmonic index. TxGraffiti conjectured in 2023 that $μ^*(G) \le H(G)$ for every nontrivial connected graph $G$, and Bıyıkoğlu refuted this by showing that the ratio $μ^*(G)/H(G)$ can be made arbitrarily large. Restricting to trees bounds the ratio sharply. Every nontrivial tree $T$ satisfies $μ^*(T) < \frac{3}{2} H(T)$, with the constant $3/2$ best possible. A complementary bound $H(G) < 4μ^*(G)$ holds for every graph with an edge, so on a nontrivial tree the saturation number is pinned to $\frac{1}{4} H(T) < μ^*(T) < \frac{3}{2} H(T)$, both constants best possible. The friendship graph $F_4$ is a smallest counterexample to the conjecture, on nine vertices, and the smallest tree counterexample is the subdivided star on eleven vertices. For each positive integer $m$ a family of graphs with $m$ hubs has ratio approaching $m+1$, while the conjecture holds whenever all vertices have equal degree. Both invariants arise in applications, the harmonic index as a molecular descriptor and the saturation number as a measure of adsorption inefficiency, and the bounds estimate the latter, which is NP-hard to compute, by the former, which is computable in linear time.

2606.15648 2026-06-19 cs.CV 新提交

Fusing Transferred Priors and Physics-based Decomposition for Underwater Image Enhancement

融合迁移先验与物理分解的水下图像增强

Haochen Hu, Yanrui Bin, Zhengyan Zhang, Minchen Wei, Chih-yung Wen, Bing Wang

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出一种无需配对标签的迁移学习方法,将水下图像增强分解为全局颜色校正、去雾和背景噪声抑制,利用跨域先验监督各步骤,实现物理一致的增强。

详情
Journal ref
Information Fusion (2026): 104557
AI中文摘要

水下图像在不同水质条件下拍摄,导致复杂的退化,包括颜色偏差、低对比度和模糊效应。最近,基于学习的方法已显示出在水下图像增强(UIE)方面的潜力。然而,以往的大多数工作侧重于训练策略或网络设计,使增强结果与数据集中的标签良好对齐,忽略了标签是从先前UIE方法的增强结果中选取的,这些伪标签存在噪声。因此,它们的模型性能在一定程度上并不令人满意。然而,收集水下图像的真实标签具有挑战性。在这项工作中,我们提出了一种基于迁移学习的UIE方法,该方法不需要水下图像具有成对的噪声或真实标签来学习。相反,首先根据水下物理将UIE任务分解为全局颜色校正、去雾和背景噪声抑制。然后,利用来自其他视觉任务的多种先验作为每个步骤的跨域监督。通过这种方式,通过迁移学习实现了一种新颖的UIE,并且物理对齐的UIE分解提供了理论上的合理性。定性和定量实验表明,我们基于物理和先验融合的方法在UIE任务中达到了SOTA性能,并有效提升了下游视觉任务,显著优于基准方法。项目仓库:https://github.com/Haru2022/P2-UIE。

英文摘要

The underwater images are captured within diverse water-medium conditions, leading to complex degradation, including color bias, low contrast, and blur effect. Recently, learning-based methods have demonstrated their potential for underwater image enhancement (UIE). However, most of the previous work focus on the training strategy or network design to make the enhanced result aligned well with the labels in datasets, ignoring that the labels are selected from the enhanced results of previous UIE methods and these pseudo-labels are noisy. Consequently, the performance of their models is not satisfactory to a certain extent. However, collecting the true labels of the underwater images is challenging. In this work, we propose a transfer learning-based UIE that does not require underwater images to have paired noisy or true labels for learning. Instead, the UIE task is first divided into global color correction, haze removal, and background noise suppression following the underwater physics. Then multiple types of prior from other vision tasks are leveraged as cross-domain supervision in each step. In this way, a novel UIE is available via transfer learning, and the physics-aligned UIE decomposition provides theoretical soundness. Qualitative and quantitative experiments demonstrate that our proposal based on physics and priors fusion achieves SOTA performance in the UIE task and effectively boosts downstream vision tasks, significantly outperforming benchmark methods. Project repo: https://github.com/Haru2022/P2-UIE.

2606.15516 2026-06-19 cs.RO 新提交

Transferring Contact, Not Just Motion: Compliant Grasping Across Dexterous Hands

传递接触,而不仅仅是运动:跨灵巧手的柔顺抓取

Soofiyan Atar, Yao-Ting Huang, Michael Yip

发表机构 * University of California San Diego(加州大学圣迭戈分校)

AI总结 提出跨本体力-位置接口,通过校准力矩和指尖力实现异构灵巧手间的接触感知抓取,结合流匹配视觉运动策略和混合力位控制器,实现可迁移的柔顺抓取。

Comments Website(overview): transferring-contact-not-just-motion.github.io/

详情
AI中文摘要

灵巧抓取依赖于接触调节,而不仅仅是运动。稳定操作要求手指在接触滑动、变形或视觉遮挡时保持适当的物体负载。现有的跨本体灵巧策略通过重定向手部姿态或潜在动作统一运动,但力反馈仍与每只手的感觉和驱动绑定,限制了迁移。本文引入了一种跨本体力-位置接口,用于异构灵巧手之间的接触感知操作。运动意图在共享的手部姿态潜在空间中表示,而每只手的力信号通过系统辨识校准为物理关节扭矩(单位N.m)。这些扭矩被映射为指尖力和紧凑的每指负载描述符,使策略获得关于手部应移动到哪里以及物体如何加载的可比观测。利用该接口,训练了一个流匹配视觉运动策略,输入视觉、本体感觉和校准后的接触,并采用结构化视觉掩码,在抓取相关遮挡下鼓励依赖力。相同的校准信号驱动混合力-位置控制器进行演示采集和执行,保持训练和部署中的力目标一致。在结构不同的手上进行的实验表明,校准的接触反馈实现了可迁移的柔顺抓取,学习到的基元可在长时程操作流程中重复使用。

英文摘要

Dexterous grasping depends on contact regulation, not motion alone. Stable manipulation requires fingers to maintain appropriate object loading as contacts slip, deform, or become visually occluded. Existing cross-embodiment dexterous policies unify motion through retargeted hand poses or latent actions, but force feedback remains tied to each hand's sensing and actuation, limiting transfer. This work introduces a cross-embodiment force-position interface for contact-aware manipulation across heterogeneous dexterous hands. Motion intent is represented in a shared hand-pose latent, while each hand's effort signal is calibrated through system identification into physical joint torque in N.m. These torques are mapped to fingertip forces and compact per-finger load descriptors, giving the policy comparable observations of where the hand should move and how the object is loaded. Using this interface, a flow-matching visuomotor policy is trained on vision, proprioception, and calibrated contact, with structured visual masking that encourages reliance on force under grasp-relevant occlusion. The same calibrated signal drives a hybrid force-position controller for demonstration collection and execution, keeping force targets consistent across training and deployment. Experiments across structurally different hands show that calibrated contact feedback enables transferable compliant grasping, with learned primitives reusable in long-horizon manipulation pipelines.

2606.15197 2026-06-19 cs.LG cs.AI 新提交

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

StarOR: 协同树搜索与测试时强化学习用于优化建模

Jiajun Li, Yu Ding, Shisi Guan, Ran Hou, Wanyuan Wang

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Northwest A&F University(西北农林科技大学)

AI总结 提出StarOR框架,结合蒙特卡洛树搜索与测试时强化学习,通过四阶段分解和GRPO更新LoRA适配器,实现无监督细粒度奖励的中间决策优化,在5个基准上以4B模型达到最优性能。

Comments 41pages, V1, preprint

详情
AI中文摘要

优化建模本质上是层次化的,需要精确的符号承诺序列。传统的基于学习的自动化优化建模方法通过大规模标注或策划的训练数据改进建模策略,但适应新问题分布成本高昂。同时,一次性生成在层次化建模中仍然脆弱,早期符号错误可能传播为无效公式。测试时缩放通过额外的实例级计算实现结构探索,提供了一种有前景的替代方案;然而,现有的基于搜索的方法通常依赖固定策略,导致重复展开继承相似的建模偏差,并为中间决策提供有限的信用分配。为了解决这些限制,我们提出了StarOR,一种协同搜索与适应的框架,将MCTS与测试时强化学习相结合用于优化建模。StarOR将建模过程分解为四个阶段,并通过GRPO在每个非终端节点更新瞬态LoRA适配器。通过使用MCTS生成的兄弟节点作为局部比较集,StarOR将搜索时的探索转化为实例特定的策略细化。此外,无监督的多方面奖励系统为中间公式决策提供细粒度反馈,无需真实标签。在五个优化基准上的实验表明,即使使用4B骨干网络,StarOR也实现了最先进的性能,优于现有方法和前沿LLMs。

英文摘要

Optimization modeling is inherently hierarchical, requiring a precise sequence of symbolic commitments. Traditional learning-based automated optimization modeling methods improve modeling policies through large-scale annotated or curated training data, but are costly to adapt to new problem distributions. Meanwhile, one-shot generation remains brittle in hierarchical modeling, where early symbolic errors can propagate into invalid formulations. Test-time scaling offers a promising alternative by enabling structural exploration with additional instance-level computation; however, existing search-based methods typically rely on a fixed policy, causing repeated rollouts to inherit similar modeling biases and providing limited credit assignment for intermediate decisions. To address these limitations, we propose StarOR, a synergistic search-and-adaptation framework that couples MCTS with Test-Time Reinforcement Learning for optimization modeling. StarOR decomposes the modeling process into four stages and updates a transient LoRA adapter via GRPO at each non-terminal node. By using MCTS-generated siblings as local comparison sets, StarOR transforms search-time exploration into instance-specific policy refinement. Moreover, an unsupervised multi-faceted reward system provides fine-grained feedback for intermediate formulation decisions without ground-truth labels. Experiments across five optimization benchmarks show that StarOR achieves state-of-the-art performance even with a 4B backbone, outperforming existing methods and the frontier LLMs.

2606.15015 2026-06-19 cs.CV cs.AI 新提交

NEXUS: Neural Energy Fields for Physically Consistent Contact-Rich 3D Object Dynamics

NEXUS: 用于物理一致的高接触3D物体动力学的神经能量场

Qizhen Ying, Guangming Wang, Yangchen Pan, Victor Adrian Prisacariu, Brian Sheil, Yixiong Jing

发表机构 * University of Oxford(牛津大学) University of Cambridge(剑桥大学)

AI总结 提出神经能量场框架NEXUS,通过标量能量和耗散项建模保守与非保守动力学,提升高接触3D场景下的长时程轨迹精度并指导视频生成。

Comments 18 pages, 4 figures, 6 tables. Preprint

详情
AI中文摘要

基于物理的视频生成需要可控的3D物体动力学,这些动力学在接触、变形和外力作用下保持物理一致性。现有的基于轨迹的方法通常建模孤立的物理效应,难以在高接触3D场景中组合保守和非保守动力学。我们提出NEXUS,一个用于高接触3D物体动力学的神经能量场框架。NEXUS将每个物体表示为结构图,并构建动态的物体-物体和物体-环境接触图。受哈密顿神经网络启发,NEXUS通过标量能量和耗散项而非直接预测状态或加速度来公式化运动。保守效应(包括重力和弹性变形)被组合为加性能量项,而非保守效应(如阻尼和冲击引起的能量损失)则通过学习的瑞利型耗散建模。力通过对能量和耗散函数求导得到,并通过多子步半隐式积分器进行演化。在受控轨迹基准测试中,NEXUS在不同力学属性和物理效应组合下,相较于代表性的学习和物理结构化动力学基线,提高了长时程精度。我们进一步展示NEXUS轨迹为高接触视频生成提供了有效指导,在保持竞争性视觉质量的同时提高了物理合理性。

英文摘要

Physics-grounded video generation requires controllable 3D object dynamics that remain physically consistent under contact, deformation, and external forcing. Existing trajectory-based methods often model isolated physical effects, making it difficult to compose conservative and non-conservative dynamics in contact-rich 3D scenes. We present NEXUS, a neural energy-field framework for contact-rich 3D object dynamics. NEXUS represents each object as a structural graph and constructs dynamic object-object and object-environment contact graphs. Inspired by Hamiltonian Neural Networks, NEXUS formulates motion through scalar energy and dissipation terms rather than directly predicting states or accelerations. Conservative effects, including gravity and elastic deformation, are composed as additive energy terms, while non-conservative effects such as damping and impact-induced energy loss are modeled with learned Rayleigh-style dissipation. Forces are derived by differentiating the energy and dissipation functions and rolled out with a multi-substep semi-implicit integrator. Across controlled trajectory benchmarks, NEXUS improves long-horizon accuracy over representative learned and physics-structured dynamics baselines under varying mechanical properties and physical-effect compositions. We further show that NEXUS trajectories provide effective guidance for contact-rich video generation, improving physical plausibility while maintaining competitive visual quality.

2606.14957 2026-06-19 cs.CV 新提交

Learning Sparse Latent Predictive Foundation Model for Multimodal Neuroimaging

学习用于多模态神经影像的稀疏潜在预测基础模型

Haoxu Huang, Long Chen, Jingyun Chen, Jinu Hyun, James Ryan Loftus, Kara Melmed, Daniel Orringer, Jennifer Frontera, Seena Dehkharghani, Arjun Masurkar, Narges Razavian

发表机构 * New York University, Center for Data Science(纽约大学数据科学中心) NYU Grossman School of Medicine, Department of Radiology(纽约大学格罗斯曼医学院放射学系) State University of New York at Binghamton, School of Computing(纽约州立大学宾汉姆顿分校计算机学院) NYU Grossman School of Medicine, Department of Neurology(纽约大学格罗斯曼医学院神经病学系) NYU Grossman School of Medicine, Department of Neurosurgery(纽约大学格罗斯曼医学院神经外科学系) NYU Grossman School of Medicine, Department of Pathology(纽约大学格罗斯曼医学院病理学系) School of Medicine, Department of Radiology, Stanford(斯坦福大学医学院放射学系) NYU Grossman School of Medicine, Department of Neuroscience(纽约大学格罗斯曼医学院神经科学系) NYU Grossman School of Medicine, Neuroscience Institute(纽约大学格罗斯曼医学院神经科学研究所)

AI总结 提出Neuro-JEPA模型,结合潜在预测目标和专家混合架构,学习T1w、T2w和FLAIR三种MRI序列的统一表示,在25项临床任务和22项公开数据集任务上优于现有基础模型和CNN基线。

Comments Under Review Preprint

详情
AI中文摘要

脑部MRI通常作为多个互补序列采集,具有独特的对比度加权,包括T1加权成像(T1w)解剖对比和液体敏感T2加权(T2w)对比。然而,在健康系统规模上,跨多种MRI对比机制学习统一表示的方法尚缺乏。在本研究中,我们引入了Neuro-JEPA,一种稀疏多模态神经影像基础模型,它结合了潜在预测目标和专家混合架构,以编码跨核心T1w、T2w和液体抑制FLAIR成像(FLAIR)的脑部MRI。我们进一步对架构、掩码、目标和稀疏性设计选择进行了系统的方法论研究,这些选择有利于稳健的神经影像多模态表示学习。Neuro-JEPA在428,647项研究的1,551,862次扫描上进行了预训练,这些扫描经过了模态特定的预处理和跨三种核心结构脑部MRI序列的数据整理。我们在临床和研究环境中评估了学习到的表示,包括来自三个健康系统(NYU Langone、NYU Long Island和Massachusetts General Hospital)的25项任务,以及来自12个公开数据集的22项任务,涵盖了单模态、多模态和跨域评估配置。在这些基准测试中,现有的神经影像基础模型相对于简单的卷积神经网络(CNN)基线显示出不一致的提升,而Neuro-JEPA在所有评估设置中实现了更强且更一致的性能。这些结果建立了一个可扩展的多模态神经影像表示学习方法论框架,并强调了基础模型评估协议需要包括简单基线、临床异质性队列和受控的多模态比较。

英文摘要

Brain MRIs are routinely acquired as multiple complementary sequences with unique contrast weighting, including T1-weighed imaging (T1w) anatomic and fluid-sensitive T2-weighted (T2w) contrasts. However, methods for learning unified representations across the multitude of MRI contrast mechanisms at health-system scale are lacking. In this study, we introduce Neuro-JEPA, a sparse multimodal neuroimaging foundation model that combines a latent predictive objective with a Mixture-of-Experts architecture to encode brain MRI across core T1w, T2w, and fluid-suppressed FLAIR imaging (FLAIR). We further provide a systematic methodological study of architectural, masking, objective, and sparsity design choices beneficial for robust neuroimaging multimodal representation learning. Neuro-JEPA was pretrained on 1,551,862 scans from 428,647 studies after modality-specific preprocessing with data curation across three core structural brain MRI sequences. We evaluated the learned representations across clinical and research settings, including 25 tasks from three health systems: NYU Langone, NYU Long Island, and Massachusetts General Hospital, and 22 tasks from 12 public datasets, covering unimodal, multimodal and cross-domain evaluation configurations. Across these benchmarks, existing neuroimaging foundation models showed inconsistent gains over a simple convolutional neural network (CNN) baseline, whereas Neuro-JEPA achieved stronger and more consistent performance across all evaluated settings. These results establish a scalable methodological framework for multimodal neuroimaging representation learning and highlight the need for foundation model evaluation protocols that include simple baselines, clinically heterogeneous cohorts and controlled multimodal comparisons.

2606.14784 2026-06-19 cs.SD cs.LG eess.AS 新提交

LLM-Based Synthetic Ground Truth Generation for Audio-Based Emotion Classification via In-Context Learning

基于上下文学习的音频情感分类的LLM合成真实标签生成

Qing Huang, Pooja Pol, Jianing Zhang

发表机构 * School of Business, Technical University of Applied Sciences Augsburg(应用技术大学阿沙芬堡商学院) Data Science und Autonome Systeme Technologietransferzentrum (TTZ)(数据科学与自主系统技术转移中心(TTZ))

AI总结 提出利用大语言模型(LLM)和上下文学习(ICL)从多用户VR环境的流式语音数据中自动生成情感相关合成真实标签,解决团队协作状态标注难题。

Comments https://icaiit.org/paper.php?paper=14th_ICAIIT_2/3_9

详情
AI中文摘要

理解人类状态和交互动态是人机交互(HCI)的核心目标。随着交互范式变得更加沉浸,虚拟现实(VR)已成为研究协作工作的强大平台。在此类环境中,评估团队协作状态(包括团队表现和团队韧性)需要从多模态传感器数据(如语音信号)中连续可靠地推断潜在的团队级认知和情感状态。然而,由于传感器噪声、上下文变异性和稀疏的专家标注,为这些潜在状态生成真实标签仍然具有挑战性。传统的自我报告方法仅提供静态和延迟的测量,因此不足以捕捉连续语音数据中反映的动态团队过程。在这项工作中,我们提出了一种由大语言模型(LLM)驱动的、基于代理的推理工作流,用于从多用户VR环境中的流式语音数据自动生成情感相关的合成真实标签。利用LLM的泛化能力,我们使用上下文学习(ICL)和少量配对的音频样本及其对应转录的演示。ICL倾向于实现与模型微调相当的任务适应,同时避免了参数更新的计算开销。为了构建信息丰富且鲁棒的上下文提示,我们采用基于检索的选择策略,根据声学特征空间中的相似性动态识别相关的音频演示。

英文摘要

Understanding human states and interaction dynamics is a core goal of human-computer interaction (HCI). As interaction paradigms become more immersive, virtual reality (VR) has emerged as a powerful platform for studying collaborative work. In such settings, evaluating team collaboration states, including team performance and team resilience, requires continuous and reliable inference of latent team-level cognitive and affective states from multi-modal sensor data, such as speech signals. However, generating ground truth labels for these latent states remains challenging due to sensor-induced noise, contextual variability, and sparse expert annotations. Traditional self-reporting approaches provide only static and delayed measurements and are therefore insufficient for capturing dynamic team processes reflected in continuous speech data. In this work, we propose a large language model (LLM)-driven, agentic inference workflow for automated emotion-related synthetic ground truth generation from streaming speech data in multi-user VR environments. Leveraging the generalization capabilities of LLMs, we use In-Context Learning (ICL) with few-shot demonstrations of paired audio-based samples and their corresponding transcriptions. ICL tends to achieve task adaptation comparable to model fine-tuning while circumventing the computational overhead of parameter updates. To construct informative and robust in-context prompts, we adopt a retrieval-based selection strategy that dynamically identifies relevant audio demonstrations based on similarity in the acoustic feature space.

2606.14776 2026-06-19 cs.RO cs.LG 新提交

Deep Learning-Based Lunar Crater Terrain Relative Navigation

基于深度学习的月球陨石坑地形相对导航

Batu Candan, Simone Servadio

发表机构 * NASA(美国国家航空航天局) University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出一种结合深度学习陨石坑检测器和扩展卡尔曼滤波的地形相对导航算法,在初始位置偏差达5公里时仍能将导航误差降至数百米。

详情
AI中文摘要

准确的位置估计对于未来使用自主飞行器实现月球着陆至关重要,尤其是在地形特征稀疏的危险环境中。本文提出了一种地形相对导航(TRN)算法,该算法结合了我们专门为NASA陨石坑检测挑战问题设计的深度学习陨石坑检测器和扩展卡尔曼滤波(EKF)。我们的检测器分析从轨道获取的单目图像中的陨石坑特征,并通过匈牙利分配方法及基于共识的离群点去除方法,识别它们与全球数据库中陨石坑的匹配。然后,估计的测量值用于优化EKF,其中航天器在月心月固(LCLF)参考系中的姿态估计,结合高度辅助信息,约束径向漂移。仿真结果表明,即使航天器偏离实际位置达5公里,TRN也能从这种情况中恢复,将导航误差降低到几百米。需要注意的是,为了保持陨石坑特征的对应关系,必须将图像分辨率和场景中的尺度与检测器训练集分布相匹配。

英文摘要

Accurate position estimation is crucial for the successful implementation of future lunar landings using autonomous vehicles, especially in dangerous environments with sparse terrain features. In this paper, we propose a terrain relative navigation (TRN) algorithm combining our deep-learning crater detector, which was designed specifically for the NASA Crater Detection Challenge problem, and an Extended Kalman Filter (EKF). Our detector analyzes crater features from the monocular images acquired from orbit, and their matches with craters from a global database are identified via a Hungarian assignment approach followed by the consensus-based outliers removal method. The estimated measurements are then used to refine an EKF, where spacecraft pose estimation in the Lunar-Centered Lunar-Fixed (LCLF) frame of reference, augmented with altitude aiding information, constrains radial drift. The simulation results indicate that even if the spacecraft is off from its actual location up to 5 km, TRN could recover from this situation, achieving navigation error reduction to a few hundred meters. It should be noted that in order to maintain crater feature correspondences, it is important to match the image resolution and the scales within the scene to the detector training set distribution.

2606.14510 2026-06-19 cs.LG q-bio.BM 新提交

PepALD: Macrocyclic Peptide Generation via Autoregressive Latent Diffusion

PepALD: 通过自回归潜在扩散生成大环肽

Junming Zhang, Siyu Yi, Wei Ju, Zhonghui Gu

发表机构 * College of Computer Science, Sichuan University(四川大学计算机科学学院) School of Mathematics, Sichuan University(四川大学数学学院) School of Artificial Intelligence, Sichuan University(四川大学人工智能学院) Lingang Laboratory(临港实验室)

AI总结 提出PepALD模型,结合自回归潜在扩散与化学嵌入,实现从头设计大环肽,并利用偏好优化提升亲和力,在生成质量和奖励优化上优于基线。

Comments 18 pages, 5 figures, 3 tables

详情
AI中文摘要

大环肽是细胞内靶点的有前景的治疗候选物,但其设计需要同时控制非天然单体化学、环拓扑、膜通透性和靶点结合。现有的SMILES或HELM字符串生成模型要么在长原子级序列空间中操作,要么将单体视为具有有限化学基础符号化令牌。我们引入了PepALD,一个用于从头生成大环肽的自回归潜在扩散(ALD)基础模型。该模型使用结构化化学嵌入表示HELM单体,通过在化学信息潜在空间中的上下文条件扩散生成每个残基,在自回归生成过程中预测R基团感知的环闭合,并使用胜者保护的扩散自适应偏好优化将去噪器与亲和力奖励对齐。体外实验表明,PepALD在生成质量和奖励优化性能上优于代表性肽生成基线。

英文摘要

Macrocyclic peptides are promising therapeutic candidates for intracellular targets, but their design requires simultaneous control over non-natural monomer chemistry, ring topology, membrane permeability, and target binding. Existing SMILES- or HELM-string generative models either operate in long atom-level sequence spaces or treat monomers as symbolic tokens with limited chemical grounding. We introduce PepALD, an Autoregressive Latent Diffusion (ALD) foundation model for \textit{de novo} macrocyclic peptide generation. The model represents HELM monomers with structured chemical embeddings, generates each residue through context-conditioned diffusion in chemically informed latent space, predicts R-group-aware ring closures during autoregressive generation, and aligns the denoiser to affinity rewards using winner-protected diffusion-adapted preference optimization. In silico experiments demonstrate PepALD's generation quality and reward-optimization performance against representative peptide generation baselines.

2606.14066 2026-06-19 cs.SE 新提交

FastContext: Training Efficient Repository Explorer for Coding Agents

FastContext: 为编码智能体训练高效的仓库探索器

Shaoqiu Zhang, Maoquan Wang, Yuling Shi, Yuhang Wang, Xiaodong Gu, Yongqiang Yao, Tori Gong, Sheng Chen, Rao Fu, Anisha Agarwal, Spandan Grag, Gabriel Ryan, Colin Merkel, Yufan Huang, Shengyu Fu

AI总结 提出专用探索子智能体FastContext,通过并行工具调用和专注上下文生成,分离仓库探索与问题解决,在SWE-bench等任务上提升修复率达5.5%,降低编码智能体token消耗达60%。

Comments 34 pages, 7 figures

详情
AI中文摘要

大型语言模型(LLM)编码智能体在软件工程任务上取得了强劲成果,但仓库探索仍是主要瓶颈:定位相关代码消耗大量token预算,并用不相关的片段污染智能体的上下文。在大多数智能体中,同一个模型既探索仓库又解决问题,将探索性读取和搜索留在求解器的历史记录中。我们提出FastContext,一个专用的探索子智能体,将仓库探索与求解分离。按需调用时,FastContext发出并行工具调用,并返回简洁的文件路径和行范围作为聚焦上下文。FastContext由专门的探索模型驱动,参数规模从4B到30B。我们从强参考模型轨迹中引导这些模型,并使用任务导向的奖励进行细化,以实现广泛的首次搜索、多轮证据收集和精确的引用生成。在SWE-bench Multilingual、SWE-bench Pro和SWE-QA上,将FastContext集成到Mini-SWE-Agent中,端到端修复率提升高达5.5%,同时编码智能体token消耗降低高达60%,且开销极小。这些结果表明,仓库探索可以与求解分离,并由专门模型有效处理。代码和数据:此 https URL

英文摘要

Large Language Model (LLM) coding agents have achieved strong results on software engineering tasks, yet repository exploration remains a major bottleneck: locating relevant code consumes substantial token budget and pollutes the agent's context with irrelevant snippets. In most agents, the same model explores the repository and solves the task, leaving exploratory reads and searches in the solver's history. We present FastContext, a dedicated exploration subagent that separates repository exploration from solving. Invoked on demand, FastContext issues parallel tool calls and returns concise file paths and line ranges as focused context. FastContext is powered by specialized exploration models spanning 4B--30B parameters. We bootstrap them from strong reference-model trajectories and refine them with task-grounded rewards for broad first-turn search, multi-turn evidence gathering, and precise citation generation. Across SWE-bench Multilingual, SWE-bench Pro, and SWE-QA, integrating FastContext into Mini-SWE-Agent improves end-to-end resolution rates up to 5.5% while reducing coding-agent token consumption up to 60%, with marginal overhead. These results show that repository exploration can be separated from solving and handled effectively by specialized models. Code and data: https://github.com/microsoft/fastcontext