arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.02632 2026-06-03 stat.ML cs.AI cs.CY cs.LG econ.EM stat.AP

Position: Prioritize Identifying Structure, Not Complex Models, for Scientific Discovery

立场:优先识别结构,而非复杂模型,以促进科学发现

Tyler H. McCormick

AI总结 本文论证现代机器学习在高维代理机制下存在通用欠定性,提出“机制性机器学习”的具体标准,以确保以LLM为中心的工作流真正支持科学而非模拟科学。

详情
Comments
Will appear as a position paper in ICML
AI中文摘要

现代机器学习(ML)和人工智能(AI)模型,特别是大型语言模型(LLMs),越来越多地被用于从观测数据中生成科学假设和机制解释。这篇立场论文认为,在现代ML擅长的高维代理机制中,机制性学习通常是欠定的:许多不相容的机制在数据支撑上诱导出本质上相同的观测关系,因此预测成功和连贯的解释并不足以作为机制发现的证据。这种欠定性在大型语言模型(LLMs)中变得尤为危险,因为它们倾向于将大量等价的解释类压缩成一个流畅的叙述。本文提出了“机制性机器学习”的具体标准,并论证如果以LLM为中心的工作流要支持科学而非仅仅模拟科学,这些标准是必要的。

英文摘要

Modern Machine Learning (ML) and Artificial Intelligence (AI) models, especially large language models (LLMs), are increasingly used to generate scientific hypotheses and mechanistic explanations from observational data. This position paper argues that in the high-dimensional proxy regimes where modern ML excels, mechanistic learning is generically underdetermined: many incompatible mechanisms induce essentially the same observational relationships on the support of the data, so predictive success and coherent explanations are insufficient evidence of mechanism discovery. This underdetermination becomes uniquely hazardous with large language models (LLMs), which tend to collapse large equivalence classes of explanations into a single fluent narrative. This paper proposes concrete standards for ``mechanistic ML,'' and argues these norms are necessary if LLM-centered workflows are to support science rather than merely simulate it.

2606.02592 2026-06-03 stat.AP cs.AI

Tracking Urban Atmospheric Pollutants using Sentinel-5P Satellite Data

利用Sentinel-5P卫星数据追踪城市大气污染物

Alice Gomez-Cantos, Henry O. Velesaca

AI总结 提出基于Sentinel-5P/TROPOMI卫星对流层柱观测的框架,通过中位数和高百分位数等分布指标及K-means聚类,在厄瓜多尔瓜亚斯省尺度上表征城市NO2污染背景与极端值,为数据稀缺地区提供可解释、可扩展的空气质量评估工具。

详情
AI中文摘要

城市二氧化氮($NO_2$)是燃烧相关空气污染的关键指标,在城市中表现出强烈的时空变异性。本研究提出一个基于卫星的框架,利用Sentinel-5P/TROPOMI的对流层柱观测数据,追踪厄瓜多尔瓜亚斯省的城市$NO_2$污染。该方法不估计地表浓度,而是强调稳健的分布指标,包括中位数和上尾百分位数($P_{90}$、$P_{95}$和$P_{99}$),以表征县尺度上的背景条件和局部污染极端值。多年卫星观测数据按年汇总,并使用无监督K-means聚类分析,以识别无预定义阈值的特征污染模式。结果表明,高度城市化的县持续表现出较高的极端$NO_2$值和更大的变异性,而城市化程度较低的地区则呈现较低且更均匀的模式。所提出的方法为数据稀缺地区仅使用卫星观测提供了一种可解释且可扩展的城市空气质量评估工具。该实现已在GitHub上公开,网址为https://this URL。

英文摘要

Urban nitrogen dioxide ($NO_2$) is a key indicator of combustion-related air pollution and exhibits strong spatial and temporal variability in cities. This study presents a satellite-based framework for tracking urban $NO_2$ pollution using tropospheric column observations from Sentinel-5P/TROPOMI over Guayas Province, Ecuador. Rather than estimating surface concentrations, the methodology emphasizes robust distributional metrics, including the median and upper-tail percentiles ($P_{90}$, $P_{95}$, and $P_{99}$), to characterize background conditions and localized pollution extremes at the canton scale. Multi-year satellite observations are aggregated annually and analyzed using unsupervised K-means clustering to identify characteristic pollution regimes without predefined thresholds. Results show that highly urbanized cantons consistently exhibit elevated extreme $NO_2$ values and greater variability, while less urbanized areas display lower and more homogeneous patterns. The proposed approach provides an interpretable and scalable tool for urban air-quality assessment in data-scarce regions using satellite observations alone. The implementation is publicly available on GitHub https://hvelesaca.github.io/sentinel-5P-clustering/.

2606.03763 2026-06-03 econ.GN cs.AI q-fin.EC

Merit or networks? What decides where research is published

功绩还是关系网?什么决定了研究成果的发表地点

Ning Li

AI总结 利用经济学工作论文数据,通过LLM评估论文思想质量,结合执行质量、关系网络、作者能力和语言模型文本得分,构建五因素生产函数,揭示发表过程中功绩与关系的作用机制。

详情
AI中文摘要

科学出版奖励的是思想的质量还是关系的优势?这个问题在追求声望的科学界普遍存在,但几十年来一直难以研究,因为论文的质量无法在其发表命运之前被衡量,而不使用该命运作为标尺。我们通过直接测量论文的思想质量来打破这一限制,在发表之前,使用一个经过学科训练的LLM评估器,该评估器在不看到作者姓名或结果的情况下对思想进行评分。以经济学为案例,我们将这种文本可读的思想质量评分与执行质量评分、关系指数、作者能力指数和现成的语言模型文本评分相结合,为6208篇经济学工作论文的期刊定位估计了一个五投入生产函数。这些投入不是竞争对手,而是沿着声望阶梯的一个序列。执行设定了功绩底线,并且是总体最大的投入。文本可读的思想质量则对中间的阶梯进行分级。关系设定了一个偏袒上限,主要在最顶端、最具选择性的期刊附近产生影响。关系通过两个加性渠道发挥作用:有关系的作者撰写的论文得分更高,并且在同等分数下,他们的论文仍然更有可能获得更好的发表位置。然而,这种优势是有限的。关系提高了每个阶梯的几率,但并未使顶端成为普通思想的典型结果,即使是得分最高的论文在进入可见的期刊阶梯时也面临实际摩擦。这一结果将功绩主义和关系网络对科学出版的解释嵌套在一起,而不是在两者之间做出选择。

英文摘要

Does scientific publishing reward the quality of ideas or the advantage of connections? The question is universal to prestige-driven science, yet it has resisted decades of study because a paper's quality could not be gauged ahead of its publication fate without using that fate as the yardstick. We break this constraint by measuring a paper's idea quality directly from its text, before publication, using a discipline-trained LLM evaluator that scores the idea without seeing author names or outcomes. Using economics as a case study, we combine this text-legible idea-quality score with an execution-quality rubric, a connection index, an author-ability index, and an off-the-shelf language-model text score to estimate a five-input production function for journal placement across 6,208 economics working papers. The inputs are not rivals but a sequence along the ladder of prestige. Execution sets a meritocratic floor and is the largest input overall. Text-legible idea quality grades the rungs in between. Connections set a favoritism ceiling that bites mainly near the apex, the most selective journals. Connections work through two additive channels: connected authors write papers that score higher, and at equal scores their papers are still more likely to place better. Yet this advantage is bounded. Connections raise the odds of every rung without making the apex the typical outcome for ordinary ideas, and even the highest-scoring papers face real friction reaching the visible journal ladder. The result nests, rather than chooses between, the meritocracy and network accounts of how science is published.

2606.03184 2026-06-03 q-fin.CP cs.LG q-fin.ST

FinStressTS: A Parametric Synthetic Benchmark for Time-Series Forecasting in Finance

FinStressTS: 金融时间序列预测的参数化合成基准

Jiaze Sun, Kelvin J. L. Koa, Ruiyang Ni, Yize Liu, Haonan Chen, Ke-Wei Huang

AI总结 针对金融预测中信号弱、机制复杂的问题,提出FinStressTS合成基准,通过30个诊断环境系统评估15种模型在点预测与概率预测上的表现,揭示模型性能对数据机制的依赖性。

详情
Comments
KDD 2026 (Oral)
AI中文摘要

金融预测因信噪比低、潜在因子、重尾、机制转换和跳跃而困难。真实世界基准提供的故障归因有限:研究人员可以观察到表现不佳,但往往无法隔离原因,因为机制不可观察且纠缠。真实金融数据仅揭示一条实现路径,使得评估尾部风险校准或数据效率变得困难。我们引入FinStressTS,一个机制感知的合成基准,将模型行为与受控的结构原因联系起来。FinStressTS包含围绕六个机制族(波动率聚类、多尺度持续性、重尾冲击、机制转换、自激跳跃和零膨胀过程)的30个诊断环境。我们评估两个任务:点预测(使用五种设置下的NMAE)和概率预测(在已知数据生成机制下使用CRPS)。我们对15个模型进行基准测试,从经典方法(HAR、VAR)到Transformer预测器(PatchTST、iTransformer)和深度概率架构(DeepAR、TSFlow),并使用学习曲线衡量样本效率。我们的评估揭示了三个见解。首先,性能依赖于机制:自回归和线性模型在多个波动率、尾部和跳跃驱动的环境中具有很强的竞争力,并且通常优于基于Transformer的模型。其次,分布对齐很重要:诸如DeepAR之类的参数化概率模型在平稳设置中校准良好,而灵活模型在分布变为多模态或稀疏时可能有所帮助。第三,神经网络模型通常需要更多数据才能匹配简单基线,主要在学习潜在机制或复杂分布时获得更大收益。FinStressTS提供了一个用于诊断故障模式和推进风险感知预测的开放框架。

英文摘要

Financial forecasting is difficult due to low signal-to-noise ratios, latent factors, heavy tails, regime shifts, and jumps. Real-world benchmarks offer limited failure attribution: researchers can observe underperformance, but often cannot isolate why because mechanisms are unobservable and entangled. Real financial data reveal only one realized path, making it difficult to assess tail-risk calibration or data efficiency. We introduce FinStressTS, a mechanism-aware synthetic benchmark that links model behavior to controlled structural causes. FinStressTS comprises 30 diagnostic environments around six mechanism families: volatility clustering, multi-scale persistence, heavy-tailed shocks, regime switching, self-exciting jumps, and zero-inflated processes. We evaluate two tasks: point forecasting, using NMAE across five settings, and probabilistic forecasting, using CRPS under known data-generating mechanisms. We benchmark 15 models, from classical methods (HAR, VAR) to Transformer forecasters (PatchTST, iTransformer) and deep probabilistic architectures (DeepAR, TSFlow), and use learning curves to measure sample efficiency. Our evaluation reveals three insights. First, performance is mechanism-dependent: autoregressive and linear models are highly competitive, and often outperform Transformer-based models, in several volatility-, tail-, and jump-driven environments. Second, distributional alignment matters: parametric probabilistic models such as DeepAR calibrate well in stationary settings, while flexible models can help when distributions become multimodal or sparse. Third, neural models often require more data to match simple baselines, with larger gains mainly when learning latent regimes or complex distributions. FinStressTS provides an open framework for diagnosing failure modes and advancing risk-aware forecasting.

2606.02937 2026-06-03 q-bio.NC cs.CV

BEAST3D: Animal behavioral analysis and neural encoding from multi-view video via Gaussian splatting

BEAST3D: 通过高斯泼溅从多视角视频进行动物行为分析与神经编码

Yanchen Wang, Lenny Aharon, Wangshu Zhu, Kyle Daruwalla, Linghua Zhang, Jiaru Zou, Selmaan Chettih, Helen Hou, Liam Paninski, Matthew R Whiteway

AI总结 提出BEAST3D自监督预训练框架,利用未标注的多视角视频通过3D高斯泼溅重建和动物分割,学习3D视觉表征,有效应用于新视角合成、多视角姿态估计和神经编码。

详情
AI中文摘要

多视角视频记录越来越多地用于捕捉实验环境中动物的3D运动,但从这些记录中提取丰富的3D表示仍然具有挑战性。有监督的姿态估计需要大量手动标注,而在通用场景数据集上训练的通用3D重建模型无法适用于实验室实验的专业图像和稀疏视角设置。我们通过BEAST3D解决了这些限制,这是一个自监督预训练框架,从未标注的、已校准的多视角视频中学习3D视觉表示。BEAST3D使用视觉变换器预测3D高斯泼溅,通过可微渲染重建保留视角,同时将动物从背景中分割出来。BEAST3D通过直接以已知相机参数为条件,仅用四个视角即可重建3D结构——这与通用模型不同,后者必须从实验室环境中很少有的密集重叠视角估计相机几何。通过在四个物种上的全面评估,我们证明BEAST3D产生丰富的、视角不变的特征,这些特征有效地迁移到三个下游任务:新视角合成(验证了学习到的3D表示的质量)、多视角姿态估计(提供了行为分析中广泛使用的稀疏关键点轨迹)和神经编码(将3D行为特征与同时记录的神经活动相关联)。因此,BEAST3D建立了一个利用现代多视角实验室记录中3D结构的行为分析多功能框架。

英文摘要

Multi-view video recordings are increasingly used to capture the 3D movements of animals in experimental settings, yet extracting rich 3D representations from these recordings remains challenging. Supervised pose estimation requires extensive manual annotation, while general-purpose 3D reconstruction models trained on generic scene datasets fail on the specialized imagery and sparse-view setting of laboratory experiments. We address these limitations with BEAST3D, a self-supervised pretraining framework that learns 3D visual representations from unlabeled, calibrated multi-view video. BEAST3D uses a vision transformer to predict 3D Gaussian splats that reconstruct held-out views through differentiable rendering, while simultaneously segmenting the animal from the background. BEAST3D reconstructs 3D structure with as few as four views by conditioning directly on known camera parameters--unlike general-purpose models, which must estimate camera geometry from dense overlapping viewpoints that are seldom available in lab settings. Through comprehensive evaluation across four species, we demonstrate that BEAST3D produces rich, viewpoint-invariant features that transfer effectively to three downstream tasks: novel view synthesis, which validates the quality of the learned 3D representations; multi-view pose estimation, which provides the sparse keypoint trajectories widely used in behavioral analysis; and neural encoding, which relates 3D behavioral features to simultaneously recorded neural activity. BEAST3D thus establishes a versatile framework for behavioral analysis that leverages 3D structure in modern multi-view laboratory recordings.

2606.02629 2026-06-03 q-bio.QM cs.AI cs.LG

Enhancing Protein-Protein Interaction Prediction with Hierarchical Motif-based Multimodal Protein Embedding

基于层次化基序的多模态蛋白质嵌入增强蛋白质-蛋白质相互作用预测

Zaifei Yang, Samuel Ping-Man Choi, James Kwok

AI总结 提出MMM-PPI模型,通过层次化基序的多模态编码(微观、中观、宏观三尺度)整合序列、结构和功能信息,提升蛋白质-蛋白质相互作用预测性能。

详情
AI中文摘要

蛋白质-蛋白质相互作用(PPIs)对许多生物过程至关重要。然而,现有的PPI预测方法存在两个主要局限性:它们忽略了蛋白质的层次组织,特别是关键调控PPIs的中观尺度基序,并且未能有效整合序列、结构和功能模态。为了解决这些局限性,我们提出了MMM-PPI,一种基于层次化基序的多模态蛋白质编码器用于PPI预测,该编码器以自底向上的多模态方式在三个尺度上构建PPI嵌入。在微观尺度上,我们编码三种模态的残基特征;在中观尺度上,一种新颖的多模态基序编码器将残基聚合成空间感知的基序嵌入;在宏观尺度上,一种多模态蛋白质编码器通过联合建模基序重要性和模态间相关性将基序整合为蛋白质嵌入。预训练的编码器可直接用于大规模PPI预测。在多个PPI数据集上的大量实验表明,MMM-PPI优于最先进的多标签PPI预测模型,特别是在具有挑战性的数据划分和有限数据场景下。代码见此链接。

英文摘要

Protein-protein interactions (PPIs) are essential for many biological processes. However, existing PPI prediction approaches suffer from two major limitations: they overlook the hierarchical organization of proteins, particularly meso-scale motifs that critically regulate PPIs, and fail to effectively integrate sequence, structure, and function modalities. To address these limitations, we propose MMM-PPI, a Hierarchical Motif-based Multi-Modal protein Encoder for PPI Prediction that constructs PPI embeddings in a bottom-up multi-modal manner across three scales. At the micro-scale, we encode three modal residue features; at the meso-scale, a novel multimodal motif encoder aggregates residues into spatially-informed motif embeddings; at the macro-scale, a multimodal protein encoder integrates motifs into protein embeddings by jointly modeling motif importance and inter-modal correlations. The pre-trained encoder can be used off-the-shelf for large-scale PPI prediction. Extensive experiments on multiple PPI datasets show that MMM-PPI outperforms state-of-the-art multi-label PPI prediction models, particularly under challenging data partitions and limited data scenarios. Codes are in https://github.com/yzf-code/MMM-PPI.

2606.02625 2026-06-03 q-bio.QM cs.AI cs.LG

DXA-Derived Skeletal Phenotypes and Hip Fracture Risk: A Backdoor-Adjusted Causal Analysis

DXA衍生的骨骼表型与髋部骨折风险:后门调整因果分析

Zixin Shi, Chen Zhao, Meiling Zhou, Kevin A. Maupin, Joyce H. Keyak, Nancy E. Lane, Kuan-Jui Su, Hui Shen, Hong-Wen Deng, Kui Zhang, Weihua Zhou

AI总结 本研究利用后门调整的平均处理效应比较了DXA衍生的髋部骨骼表型与骨折风险的关系,并评估了基于效应排序的表型对风险分层的改善。

详情
Comments
35 pages; main manuscript includes 4 figures and 3 tables; supplementary material includes 13 figures and 3 tables
AI中文摘要

目的:通过预设的混杂因素调整,比较双能X射线吸收测定法(DXA)衍生的髋部骨骼表型与髋部骨折风险的关系,并评估按后门调整的平均处理效应(ATEs)排序的表型是否能改善风险分层。方法:我们分析了21,098名英国生物样本库参与者,他们具有关联的健康记录、髋部DXA衍生的骨骼测量值和预设协变量。评估了涵盖髋部相关区域的骨矿物质含量(BMC)、骨矿物质密度(BMD)和T评分的16种表型。混杂因素选择由预设的有向无环图(DAG)指导。后门调整的ATEs以每标准差(SD)增加的绝对风险差尺度估计。评估了股骨总BMD的效应异质性,并使用临床变量与按ATE大小排序的表型组合评估下游预测。结果:在21,098名参与者中,115人发生髋部骨折。所有16种表型均显示每SD增加的后门调整ATEs为负值。最大的ATEs出现在股骨总BMC和股骨总BMD,每个的风险差为-0.0047,对应于每1,000名参与者中每SD较高的表型值减少约4.7例髋部骨折。股骨总BMD的条件效应在年龄较大和BMI较低的参与者中更强。在预测中,临床变量加上按ATE排序的前11个表型达到了比FRAX(含股骨颈BMD)更高的AUC(0.842 vs. 0.709),具有更高的敏感性(0.748 vs. 0.443)和相似的特异性(0.793 vs. 0.777)。结论:DXA衍生的髋部骨骼表型在其后门调整的ATEs上存在差异。表型水平的因果评估可能有助于识别用于风险分层的信息性DXA测量值。

英文摘要

Purpose: To compare dual-energy X-ray absorptiometry (DXA)-derived hip skeletal phenotypes in relation to hip fracture risk using prespecified confounder adjustment and to assess whether phenotypes ranked by their backdoor-adjusted average treatment effects (ATEs) improve risk stratification. Methods: We analyzed 21,098 UK Biobank participants with linked health records, hip DXA-derived skeletal measures, and prespecified covariates. Sixteen phenotypes spanning bone mineral content (BMC), bone mineral density (BMD), and T-score across hip-related regions were evaluated. Confounder selection was guided by a prespecified directed acyclic graph (DAG). Backdoor-adjusted ATEs were estimated on the absolute risk-difference scale per standard deviation (SD) increase. Effect heterogeneity was evaluated for total femur BMD, and downstream prediction was assessed using clinical variables combined with phenotypes ranked by ATE magnitude. Results: Among 21,098 participants, 115 had hip fractures. All 16 phenotypes showed negative backdoor-adjusted ATEs per SD increase. The largest ATEs were observed for total femur BMC and total femur BMD, each with a risk difference of -0.0047, corresponding to approximately 4.7 fewer hip fractures per 1,000 participants per SD higher phenotype value. Conditional effects of total femur BMD were stronger among older participants and those with lower BMI. In prediction, clinical variables plus the top 11 ATE-ranked phenotypes achieved higher AUC than FRAX with femoral neck BMD (0.842 vs. 0.709), with higher sensitivity (0.748 vs. 0.443) and similar specificity (0.793 vs. 0.777). Conclusion: DXA-derived hip skeletal phenotypes differed in their backdoor-adjusted ATEs. Phenotype-level causal evaluation may help identify informative DXA measures for risk stratification.

2606.02624 2026-06-03 q-bio.QM cs.AI cs.LG

TadA-Bench: A Million-Variant Benchmark for Future-Round Discovery Toward Agentic Protein Engineering

TadA-Bench:面向智能蛋白质工程的未来轮次发现的百万变异基准

Jin Gao, Juntu Zhao, Zirui Zeng, Jiaqi Shen, Junhao Shi, Dukun Zhao, Yuming Lu, Dequan Wang

AI总结 TadA-Bench 是一个基于31轮TadA定向进化的百万变异湿实验回放基准,通过定义固定数据回放任务来评估模型在未见过的未来轮次中排序变异的能力,并引入Seq2Graph统一标签,揭示进化覆盖度比局部数据密度更重要。

详情
Comments
Accepted at the 43rd International Conference on Machine Learning (ICML 2026). Data: https://huggingface.co/datasets/JinGao/TadABench-1M . Code: https://github.com/shiyegao/TadABench-1M
AI中文摘要

人工智能用于科学发现正进入智能体时代,蛋白质工程系统应优先考虑未来的湿实验,而不仅仅是拟合静态测量。我们引入了TadA-Bench,这是一个来自31轮TadA定向进化的百万变异湿实验回放基准,用于面向智能蛋白质工程的未来轮次发现。TadA-Bench保留了实验的时间顺序,并定义了一个固定数据回放任务:给定早期的实验轮次,模型对仅出现在后期轮次中的变异进行排序。它提供了对齐的DNA、RNA和蛋白质视图,并使用Seq2Graph(一种基于图的标签统一流程)来将嘈杂的富集测量结果协调为一致的跨轮次活性标签。随机分割控制显示强插值能力,但未来轮次排序和有限预算候选选择则弱得多。控制分析表明,进化覆盖度比局部数据密度更具信息性,将TadA-Bench定位为面向智能蛋白质工程的未来轮次发现的可重复湿实验回放基底;数据和代码已在Hugging Face和GitHub上发布。

英文摘要

AI for scientific discovery is entering an agentic era, where protein-engineering systems are expected to prioritize future wet-lab experiments rather than merely fit static measurements. We introduce TadA-Bench, a million-variant wet-lab replay benchmark from 31 TadA directed-evolution rounds for future-round discovery toward agentic protein engineering. TadA-Bench preserves the campaign chronology and defines a fixed-data replay task: given earlier experimental rounds, models rank variants that appear only in later rounds. It provides aligned DNA, RNA, and protein views, and uses Seq2Graph, a graph-based label-unification pipeline, to reconcile noisy enrichment measurements into consistent cross-round activity labels. Random-split controls show strong interpolation, but future-round ranking and finite-budget candidate selection are much weaker. Controlled analyses suggest that evolutionary coverage is more informative than local data density, positioning TadA-Bench as a reproducible wet-lab replay substrate for future-round discovery toward agentic protein engineering; the data and code are released on Hugging Face and GitHub.

2606.03994 2026-06-03 cs.CV cs.RO

SimuScene: Simulation-Ready Compositional 3D Scene Reconstruction from a Single Image

SimuScene: 从单张图像重建仿真就绪的组合式3D场景

Inhee Lee, Sangwon Baik, Sungjoo Kim, Hyeonwoo Kim, Hyunsoo Cha, Hanbyul Joo

AI总结 提出SimuScene,一种将物理仿真融入形状和布局估计的组合式3D重建流水线,通过物理引擎诊断重建错误并驱动修正,生成稳定且仿真就绪的场景。

详情
Comments
Project Page: https://snuvclab.github.io/SimuScene/
AI中文摘要

从单张图像重建可交互、仿真就绪的3D场景是机器人操作的关键瓶颈。虽然最近的单图像提升器能恢复合理的每个物体形状,但组合它们会产生因物体相互穿透、悬浮或下沉而在物理仿真中崩溃的场景。现有的物理感知方法严格将其作为事后布局修正,而未解决底层几何误差。为此,我们引入SimuScene,一种将物理置于形状和布局估计循环中的组合式3D重建流水线。我们不仅将物理用于布局清理,还在生成过程中利用物理引擎作为诊断测量工具。通过在重力下对重建物体进行诊断性仿真,我们将穿透和支撑失败转化为定量修正信号,驱动重力轴拉伸和非模态形状重采样。这种物理信息反馈循环减轻了累积的重建误差,并产生稳定、仿真就绪的组合式3D场景。大量实验在物理稳定性和几何对齐基准上展示了最先进的性能。我们进一步通过在仿人控制和机器人臂操作任务中部署重建环境来突出SimuScene的实用性。

英文摘要

Reconstructing interactive, simulation-ready 3D scenes from a single image is a critical bottleneck for robotic manipulation. While recent single-image lifters recover plausible per-object shapes, composing them yields scenes that collapse under physical simulation due to interpenetrating, hovering, or sinking objects. Existing physics-aware methods address this strictly as a post-hoc layout correction, leaving the underlying geometric errors unresolved. To address this, we introduce SimuScene, a compositional 3D reconstruction pipeline that puts physics in the loop of shape and layout estimation. Rather than using physics merely for layout cleanup, we utilize the physics engine as a diagnostic measurement tool during the generative process itself. By diagnostically simulating reconstructed objects under gravity, we convert penetration and support failures into quantitative correction signals that drive gravity-axis stretching and amodal shape resampling. This physics-informed feedback loop mitigates accumulated reconstruction errors and produces a stable, simulation-ready compositional 3D scene. Extensive experiments demonstrate state-of-the-art performance on physical stability and geometric alignment benchmarks. We further highlight SimuScene's utility by deploying reconstructed environments in humanoid control and robot-arm manipulation tasks.

2606.03992 2026-06-03 cs.CV cs.RO

Exploring Easy Boosts for Lidar Semantic Scene Completion

探索激光雷达语义场景补全的简易提升方法

Tetiana Martyniuk, Jonathan Seele, Alexandre Boulch, Gilles Puy, Renaud Marlet, Raoul de Charette

AI总结 本文研究无需复杂架构重设计的“免费午餐”策略,通过为输入点云添加语义伪标签和可见性信息,显著提升激光雷达语义场景补全性能,使旧模型与最先进系统竞争甚至超越。

详情
Comments
Accepted to ICIP 2026
AI中文摘要

本文研究了“免费午餐”策略,以提升激光雷达语义场景补全(SSC)的性能,而无需复杂的架构重新设计。我们首先证明,使用现成分割器为输入点云赋予语义伪标签可以显著提升现有架构的性能。通过将这些模型与 oracle 进行评估,我们确定高质量的语义先验是 mIoU 提升的主要驱动力。此外,我们为输入激光雷达扫描配备了可见性信息,以区分空区域和未知区域,这为测试的架构提供了次要的性能提升。使用这些简单的增强,我们观察到旧模型仍然可以与最先进的系统竞争,甚至超越它们。我们的代码可在 https://this https URL 获取。

英文摘要

This paper investigates "free lunch" strategies to boost the performance of lidar semantic scene completion (SSC) without requiring complex architectural redesigns. We first demonstrate that endowing input point clouds with semantic pseudo-labels from off-the-shelf segmentors significantly improves the performance of existing architectures. By evaluating these models against an oracle, we establish that high-quality semantic priors are a primary driver of mIoU gains. Furthermore, we equip the input lidar scan with visibility information that distinguishes between empty and unknown spaces, which provides a secondary performance boost across the tested architectures. Using these simple enhancements, we observe that older models remain competitive with state-of-the-art systems, and can even outperform them. Our code is available at https://github.com/astra-vision/SSC-Priors.

2606.03990 2026-06-03 cs.LG cs.CL cs.CV

Neuron Populations Exhibit Divergent Selectivity with Scale

神经元群体随规模表现出分化的选择性

Amil Dravid, Yasaman Bahri, Alexei A. Efros, Yossi Gandelsman

AI总结 通过分析Rosetta神经元在不同规模模型中的分布与特性,发现其数量遵循次线性幂律增长,且选择性随规模增强,而非Rosetta神经元则保持低选择性,提出一个平衡特征效用与神经元容量的分析模型解释这一极化现象。

详情
Comments
Project page and code: https://avdravid.github.io/rosetta-neuron-scaling/
AI中文摘要

我们研究神经网络中的神经元群体是否随规模可预测地演化,将缩放定律扩展到损失等宏观可观测指标之外。为探究此问题,我们研究了Rosetta神经元——一类先前被表征的、其激活模式在独立训练的模型中相似的神经元(Dravid et al., 2023)。在分别对高达30B参数的语言模型和高达5B参数的视觉模型的分析中,我们观察到Rosetta神经元群体遵循模型规模的次线性幂律,绝对数量增长但占总神经元数的比例缩小。我们进一步观察到神经元极化效应:Rosetta神经元随规模变得更具选择性且日益单语义化,与不断增长但仍保持低选择性的非Rosetta群体分离。一个平衡特征效用与有限神经元容量的分析模型解释了次线性幂律缩放和这种极化效应。最后,我们发现Rosetta神经元随规模变得更加领域专业化,并通过一个针对持续预训练的目标数据过滤案例研究展示了其选择性。我们的结果指向一个可解释的、共享的神经元层面结构的缩放定律,将模型大小与神经元通用性、选择性和专业化的系统性变化联系起来。

英文摘要

We investigate whether neuron populations within neural networks evolve predictably with scale, extending scaling laws beyond macroscopic observables such as loss. To probe this question, we study Rosetta Neurons, a previously characterized class of neurons whose activation patterns are similar across independently trained models (Dravid et al., 2023). In separate analyses of language models up to 30B parameters and vision models up to 5B parameters, we observe that the population of Rosetta Neurons follows a sublinear power law in model size, growing in absolute number but occupying a shrinking fraction of the total neuron count. We further observe a Neuron Polarization Effect: Rosetta Neurons become more selective and increasingly monosemantic with scale, separating from a growing non-Rosetta population that remains less selective. An analytical model balancing feature utility against limited neuron capacity explains the sublinear power-law scaling and this polarization effect. Finally, we find that Rosetta Neurons become more domain-specialized with scale and illustrate their selectivity through a targeted data-filtering case study for continued pretraining. Our results point to a scaling law for interpretable, shared neuron-level structure, linking model size to systematic changes in neuron universality, selectivity, and specialization.

2606.03989 2026-06-03 cs.CV

PixVOD: Pixel-Distributed Direct Visual Odometry and Depth Estimation

PixVOD: 像素分布式直接视觉里程计与深度估计

Shinjeong Kim, Ignacio Alzugaray, Callum Rhodes, Paul H. J. Kelly, Andrew J. Davison

AI总结 提出一种基于高斯信念传播的像素级分布式视觉里程计与深度估计方法,通过关键帧锚定机制实现传感器上并行计算。

详情
AI中文摘要

由二维像素阵列组成的图像是计算机视觉算法的标准输入,然而许多底层计算可以分布在像素之间。传输原始、冗余且带有噪声的像素数据离开传感器仍然效率低下,这促使人们转向焦平面传感器处理器,其在每个像素内直接执行大部分计算。我们设想像素在本地合成更高级别的信号,减少下游负载,并为更高级别的视觉任务提供更丰富的输入。我们提出了一种完全可并行化的视觉里程计和深度估计形式,跨像素进行,其中传感器处理器通过高斯信念传播(GBP)交换信息,以达成关于相机运动的共识,并从逐像素光度观测和表面法线先验中推断深度。为了在优化过程中保持几何稳定性,我们引入了一种类似关键帧的锚定机制,该机制调节帧之间的有效基线,从而实现一致的运动和深度更新。我们的方法在真实数据集上进行了评估,证明了基于GBP的像素级分布式里程计和深度估计与传感器上关键帧锚定的可行性。项目页面:此 https URL

英文摘要

Images composed of 2D pixel arrays are the standard input to computer vision algorithms, yet many underlying computations can be distributed across pixels. Transmitting raw, redundant, and noisy pixel data off the sensor remains inefficient, motivating a shift toward focal-plane sensor-processors that perform a significant part of the computation directly within each pixel. We envision pixels synthesizing higher-level signals locally, reducing downstream load, and providing richer inputs for higher-level vision tasks. We propose a fully parallelizable form of visual odometry and depth estimation across pixels, where sensor-processors exchange information through Gaussian Belief Propagation (GBP) to achieve consensus about camera motion and infer depth from per-pixel photometric observations and a surface normal prior. To maintain geometric stability during optimization, we introduce a keyframe-like anchoring mechanism that regulates the effective baseline between frames, enabling consistent motion and depth updates. Our method is evaluated on realistic datasets, demonstrating the feasibility of GBP-based pixel-level distributed odometry and depth estimation with keyframe anchoring on-sensor. Project Page: https://www.shinjeongkim.com/pixvod/

2606.03986 2026-06-03 cs.CV

NewtPhys: Do Foundation Models Understand Newtonian Physics?

NewtPhys: 基础模型理解牛顿物理学吗?

Sebastian Cavada, Soumava Paul, Tuan-Hung Vu, Andrei Bursuc, Raoul de Charette

AI总结 本文提出NewtPhys,一个基于真实场景多视图图像和物理模拟的4D物理标注数据集,用于系统评估基础模型在低层次牛顿物理推理中的能力,揭示了现有模型的局限性。

详情
AI中文摘要

先前的工作使用合成或半合成场景以及视觉问答任务评估基础模型中的物理推理。然而,这些基准强调高层次事件,缺乏评估真正低层次牛顿理解所需的视觉保真度。我们引入了NewtPhys,一个从真实场景的多视图图像构建的4D物理标注数据集,并带有基于物理的模拟。该数据集提供了跨时间步的密集、细粒度标注——包括3D力和覆盖物理、跟踪、语义和几何的逐像素非模态量——弥合了简单合成设置与真实视觉复杂性之间的差距。利用NewtPhys,我们系统评估了56个VLM,包括54个开放权重模型和2个闭源前沿模型,以及10个VFM,揭示了低层次物理推理中的局限性。除了基准测试外,我们的数据集还支持基于物理的视觉的未来研究和下一代物理感知评估的开发。代码和数据集可在该网址获取。

英文摘要

Previous work has evaluated physics reasoning in foundation models using synthetic or semi-synthetic scenes and visual question-answering tasks. However, these benchmarks emphasize high-level events and lack the visual fidelity required to assess true low-level Newtonian understanding. We introduce NewtPhys, a 4D physically annotated dataset built from multiview images of real-world scenes with physics-grounded simulations. The dataset provides dense, fine-grained annotations across timesteps -- including 3D forces and amodal per-pixel quantities covering physics, tracking, semantics and geometry -- bridging the gap between simplistic synthetic setups and realistic visual complexity. Using NewtPhys, we systematically evaluate 56 VLMs, including 54 open-weight models and 2 closed-source frontier models, and 10 VFMs and reveal limitations in low-level physics reasoning. Beyond benchmarking, our dataset enables future research in physics-grounded vision and the development of next-generation physics-aware evaluations. Code and datasets are available at https://astra-vision.github.io/NewtPhys.

2606.03985 2026-06-03 cs.RO cs.AI cs.CV

Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

Humanoid-GPT:扩展数据与结构以实现零样本运动跟踪

Zekun Qi, Xuchuan Chen, Dairu Liu, Chenghuai Lin, Yunrui Lian, Sikai Liang, Zhikai Zhang, Yu Guan, Jilong Wang, Wenyao Zhang, Xinqiang Yu, He Wang, Li Yi

AI总结 提出Humanoid-GPT,一种基于GPT风格的因果Transformer,在十亿级运动语料上预训练,实现全身控制,通过扩展数据和模型容量达到对未见运动和任务的零样本泛化。

详情
Comments
Accepted at CVPR 2026
AI中文摘要

我们介绍了Humanoid-GPT,一种具有因果注意力的GPT风格Transformer,在十亿级运动语料上训练用于全身控制。与受限于稀缺数据和敏捷性-泛化权衡的先前浅层MLP跟踪器不同,Humanoid-GPT在一个包含所有主要动作捕捉数据集和大规模内部录制的20亿帧重定向语料上预训练。扩展数据和模型容量产生了一个单一的生成式Transformer,它能够跟踪高度动态的行为,同时实现对未见运动和控制任务的前所未有的零样本泛化。大量实验和扩展分析表明,我们的模型建立了新的性能前沿,展示了对未见任务的鲁棒零样本泛化,同时能够跟踪高度动态和复杂的运动。

英文摘要

We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.

2606.03982 2026-06-03 cs.CL

Language Models Compare Quantities Using Number-specific and Unit-specific Heuristics

语言模型使用数字特定和单位特定启发式比较数量

Mutsumi Sasaki, Go kamoda, Ryosuke Takahashi, Kosuke Sato, Kentaro Inui, Keisuke Sakaguchi, Benjamin Heinzerling

AI总结 本研究通过控制实验发现,语言模型在比较带单位的数量时,并非进行精确的尺度转换,而是依赖数字差异和单位尺度差异的启发式策略,导致在比较边界附近系统性错误。

详情
AI中文摘要

带有测量单位的数量,例如110 cm和1.2 m,要求语言模型(LMs)将数字与符号单位尺度相结合。在这里,我们研究LMs如何在跨越多个单位系统的受控设置中比较此类数量。我们发现,在比较边界附近,准确性会下降,其中值的微小变化决定了正确答案。由此产生的错误是系统性的:线性代理模型从数字差异和单位尺度差异线索中预测LM偏好,并且对这些变量对齐的子空间进行因果干预会改变模型的输出。结果表明,LMs通过一系列关于数字和单位的启发式策略来比较数量,而不是先将两个表达式转换为精确的共享尺度表示。

英文摘要

Quantities with measurement units, such as 110 cm and 1.2 m, require language models (LMs) to combine a numeral with a symbolic unit scale. Here, we study how LMs compare such quantities in controlled settings spanning several unit systems. We find that accuracy degrades near the comparison boundary, where small changes in value determine the correct answer. The resulting errors are systematic: linear surrogate models predict LM preferences from numerical-difference and unit-scale-difference cues, and causal interventions on subspaces aligned with these variables shift model's output. The results suggest that LMs compare quantities through a bag of heuristics over numerals and units, rather than first converting both expressions to an exact shared-scale representation.

2606.03980 2026-06-03 cs.LG cs.CL

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Skill-RM: 通过智能体技能统一异构评估标准

Tao Chen, Gangwei Jiang, Pengyu Cheng, Siyuan Huang, Yihao Liu, Jingwei Ni, Jiaqi Guo, Mengyu Zhou, Kai Tang, Junling Liu, Qinliang Su, Xiaoxi Jiang, Guanjun Jiang

AI总结 提出Skill-RM框架,将奖励建模重构为可重用的奖励评估技能执行,通过动态选择和聚合证据统一异构评估标准,在奖励基准和下游任务中优于传统方法。

详情
AI中文摘要

奖励模型(RMs)为LLM后训练提供关键反馈信号,特别是在强化微调(RFT)和强化学习(RL)流程中。然而,当前的奖励评估依赖于异构标准,如基于规则的验证器、真实参考、程序化检查表和复杂评分标准,而统一整合所有类型证据的机制尚未被探索。为此,我们提出技能奖励模型(Skill-RM),一个统一框架,将奖励建模重构为可重用的奖励评估技能的执行。通过将奖励计算视为结构化的智能体任务,Skill-RM提供一致的接口来编排异构资源,动态选择和聚合针对每个输入特定要求定制的证据。这种方法使奖励模型能够超越静态评估,确保跨不同任务的一致性和透明度。在奖励基准和下游应用(包括最佳N选择和强化学习)上的大量实验表明,Skill-RM始终优于传统的评判基线。我们的发现表明,Skill-RM不仅为奖励建模提供了统一解决方案,而且通过战略性和动态的证据编排实现了卓越性能。代码见此链接。

英文摘要

Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics, where a unified mechanism to integrate all types of evidence remains unexplored. To this end, we propose Skill Reward Model (Skill-RM), a unified framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. By treating reward computation as a structured agentic task, Skill-RM provides a consistent interface to orchestrate heterogeneous resources, dynamically selecting and aggregating evidence tailored to the specific requirements of each input. This approach enables the reward model to move beyond static evaluation, ensuring consistency and transparency across diverse tasks. Extensive experiments on reward benchmarks and downstream applications, including best-of-N selection and reinforcement learning, demonstrate that Skill-RM consistently outperforms traditional judge baselines. Our findings suggest that Skill-RM not only provides a unified solution for reward modeling but also achieves superior performance through the strategic and dynamic orchestration of evidence. The code is at https://github.com/Qwen-Applications/Skill-RM.

2606.03979 2026-06-03 cs.LG cs.AI

Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories

语言模型需要睡眠:学习自我修改和巩固记忆

Ali Behrouz, Farnoosh Hashemi, Vahab Mirrokni

AI总结 受人类学习过程启发,提出“睡眠”范式,通过记忆巩固(知识播种)和梦境(自我改进)两阶段,使模型持续学习、将短期记忆转化为长期知识并自我提升。

详情
Comments
A version of this work has been publicly available from September 2025 on OpenReview
AI中文摘要

过去几十年见证了机器学习算法设计的重大进步,从早期针对特定任务的浅层模型研究到更通用的深度大语言模型(LLMs)。尽管在需要即时预测或上下文学习的任务中显示出有希望的结果,现有模型缺乏持续学习并有效将其时间上下文知识转移到长期参数的能力。受人类学习过程的启发,我们引入了一种“睡眠”范式,允许模型持续学习,通过重放将其短期脆弱记忆蒸馏为稳定的长期知识,并通过“梦境”过程递归地自我改进。更详细地说,睡眠包括两个阶段:(1)记忆巩固:一个向上的蒸馏过程,称为知识播种,其中较小自我的记忆被蒸馏到更大的网络中,以在保留知识的同时提供更多容量。作为概念验证,我们提出了一种新的广义蒸馏过程用于知识播种(即在线策略蒸馏与基于强化学习的模仿学习的结合);(2)梦境:一个自我改进阶段,其中模型使用强化学习生成合成数据的课程,以排练新知识并在没有人类监督的情况下完善现有能力。我们在长视野、持续学习、知识整合和少样本泛化任务上的实验支持了睡眠阶段的重要性。

英文摘要

The past few decades have witnessed significant advances in the design of machine learning algorithms, from early studies on task-specific shallow models to more general deep Large Language Models (LLMs). Despite showing promising results in tasks that require instant prediction or in-context learning, existing models lack the ability to continually learn and effectively transfer their temporal in-context knowledge to their long-term parameters. Inspired by human learning process, we introduce a ''Sleep'' paradigm that allows the models to continually learn, distill their short-term fragile memories into stable long-term knowledge with replay, and recursively improve themselves with ''Dreaming'' process. In more detail, sleep consists of two stages: (1) Memory Consolidation: an upward distillation process, called Knowledge Seeding, where the memories of a smaller-self are distilled into a larger network to provide more capacity while preserving the knowledge. As a proof of concept, we present a new Generalized Distillation process for {Knowledge Seeding} (i.e., the combination of on-policy distillation with Reinforcement Learning (RL)-based imitation learning); (2) Dreaming: a self-improvement phase, where the model uses RL to generate a curriculum of synthetic data to rehearse new knowledge and refine existing capabilities without human supervision. Our experiments on long-horizon, continual learning, knowledge incorporation, and few-shot generalization tasks support the importance of the sleep stage.

2606.03976 2026-06-03 cs.CV cs.AI cs.LG q-bio.NC

Formalizing the Binding Problem

形式化绑定问题

Lianghuan Huang, Yihao Li, Saeed Salehi, Yingshan Chang, Ansh Soni, Konrad P. Kording

AI总结 本文用信息论方法形式化绑定问题,提出一种探测方法测量模型表示中的绑定信息,并在视觉Transformer上实验,证明绑定是强视觉识别和推理的关键要素。

详情
Comments
Accepted to ICML 2026
AI中文摘要

世界表征,可以说,包含关于特征的信息(例如,某物是蓝色的,某物是圆形的),但也包含关于哪些特征属于同一对象的信息(例如,圆形是蓝色的),我们称之为绑定信息。任何具有理解包含多个对象场景能力的系统都必须解决绑定问题:它需要知道哪些特征属于一起。然而,尽管有研究表明视觉Transformer(ViT)知道哪些补丁属于一起,但目前尚不清楚当前的深度学习模型是否学会展示绑定信息,即针对特征的信息。我们可能认为绑定信息并不多,毕竟将特征错误归因于错误对象是基于ViT架构的常见失败,尤其是在对象共享特征的场景中。本文用信息论方法形式化绑定问题,并引入一种探测方法来测量模型表示中的绑定信息。我们在ViT上进行实验,测量来自架构不同组件(如图像摘要标记[CLS]或空间标记)的绑定信息。我们使用具有不同绑定挑战的数据集,例如特征共享、遮挡和自然特征,同时比较多个预训练ViT的性能。总体而言,我们的研究证明了绑定是强视觉识别和推理的关键要素。

英文摘要

Representations of the world, arguably, contain information about features (e.g. something is blue, something is a circle) but also information about which features are part of the same object (e.g. the circle is blue), which we call binding information. Any system with the ability to understand scenes with multiple objects must be able to solve the binding problem: it needs to know which features belong together. However, despite work showing that Vision Transformers (ViTs) know which patches belong together, it is not known whether current deep learning models learn to exhibit binding information, i.e., for features. We may believe that there is not much binding information, after all misattributing features to wrong objects is a common failure of ViT-based architectures, especially in scenes with objects sharing features. Here we formalize the binding problem with an information-theoretic approach, and introduce a probing method to measure binding information in model representations. We perform experiments on ViTs, measuring binding from different components of the architecture, such as the image summary token [CLS] or the spatial tokens. We use datasets with different binding challenges, such as feature sharing, occlusion, and natural features, while comparing the performance of several pre-trained ViTs. Overall, our research demonstrates binding as a key ingredient to strong visual recognition and reasoning.

2606.03971 2026-06-03 cs.CV

Video-Mirai: Autoregressive Video Diffusion Models Need Foresight

Video-Mirai: 自回归视频扩散模型需要远见

Yonghao Yu, Lang Huang, Runyi Li, Zerun Wang, Toshihiko Yamasaki

AI总结 提出Video-Mirai训练方法,通过冻结的远见编码器从完整生成序列中提取未来信息并蒸馏到因果状态,在不改变推理过程的情况下弥合表示层面的规划差距,提升长视频生成的一致性。

详情
AI中文摘要

因果视频生成器必须从过去预测,但它们不必仅从过去学习。在流式自回归视频扩散中,每个发射的片段成为未来片段必须保留的承诺。然而,标准训练只要求每个因果状态解释当前。这造成了我们称之为表示层面的规划差距:适合当前片段的状态可能丢弃未来一致性所需的身份、布局和运动信息。我们引入Video-Mirai,一种仅训练的方法,在不改变因果推理的情况下弥合这一差距:生成器因果地展开,一个冻结的远见编码器非因果地读取完成的展开,一个轻量级预测器将得到的停止梯度目标蒸馏到因果状态。未来帧监督表示,从不监督生成器输入。在推理时,编码器和预测器被丢弃,原始架构、每步FLOPs和KV缓存行为保持不变。Video-Mirai在5秒VBench上将强因果强制基线从83.8提高到84.6(总分)。在超出训练范围的30秒展开中,主体一致性从84.9提高到88.5,背景一致性从90.2提高到91.9。消融实验确定未来条件目标是关键因素,探针实验显示未来帧从当前特征中更易解码。因果性应约束推理,而非表示监督。我们的研究强调视觉自回归模型需要远见。项目页面:此https URL。

英文摘要

Causal video generators must predict from the past, but they need not learn only from it. In streaming autoregressive video diffusion, each emitted segment becomes a commitment that future segments must preserve. Standard training, however, only asks each causal state to explain the present. This creates what we call a representation-level planning gap: states that fit the current segment may discard identity, layout, and motion information needed for a consistent future. We introduce Video-Mirai, a training-only method that closes this gap without changing causal inference: the generator rolls out causally, a frozen foresight encoder reads the completed rollout non-causally, and a lightweight predictor distills the resulting stopped-gradient targets into causal states. Future frames supervise representations, never generator inputs. At inference, the encoder and predictor are discarded, leaving the original architecture, per-step FLOPs, and KV-cache behavior unchanged. Video-Mirai improves a strong Causal-Forcing baseline on 5-second VBench from 83.8 to 84.6 in terms of Total Score. On 30-second rollouts beyond the training horizon, subject consistency improves from 84.9 to 88.5 and background consistency from 90.2 to 91.9. Ablations identify future-conditioned targets as the key ingredient, and probes show that future frames become more decodable from current features. Causality should constrain inference, not representation supervision. Our study highlights that visual autoregressive models need foresight. Project page: https://y0uroy.github.io/Video-Mirai.

2606.03969 2026-06-03 cs.CL cs.AI

Quantifying Faithful Confidence Expression in Large Reasoning Models

量化大型推理模型中的忠实置信表达

Areeb Gani, Asal Meskin, Gabrielle Kaili-May Liu, Arman Cohan

AI总结 针对大型推理模型(LRM)在长链思维输出中难以忠实表达内在置信度的问题,提出基于令牌概率、隐藏状态和响应一致性的框架,系统量化其语言决断性与内部不确定性之间的对齐程度。

详情
Comments
Code: https://github.com/yale-nlp/faithful_lrm
AI中文摘要

可靠的不确定性沟通对于LLMs的可信度至关重要,然而忠实校准(FC)——模型内在置信度与(语言上)表达的置信度之间的对齐——是一个持续存在的失败模式。这一挑战对大型推理模型(LRM)尤为关键,因为其扩展的推理轨迹常被用户解读为深思熟虑、能力和信心的证据。尽管FC重要且LRM广泛使用,但LRM能否忠实表达其置信度仍知之甚少。此外,衡量FC的主流范式难以泛化到LRM生成的长链思维输出,这些输出往往缺乏清晰的步骤边界、步骤结构不一致,并在整个轨迹中编码复杂的条件依赖——使得内在置信度的估计复杂化。为应对这一挑战,我们引入了一个新颖的框架来系统量化LRM的FC。我们的框架基于令牌概率、隐藏状态和采样响应一致性,分析语言决断性与三种内部不确定性来源的关系。我们还设计了一种前缀条件采样方法,以控制轨迹中的条件和结构变化。将我们的框架应用于一系列多样化的领先模型、数据集和提示,我们发现忠实置信表达是LRM的一个重大挑战。推理行为不会自动转化为改进的FC,针对非推理模型的提示干预在推理设置中并不能提高忠实性。不同的置信估计器还对同一轨迹产生不同评估,揭示了先前评估方法的脆弱性。综合来看,我们的工作将FC确立为LRM的一个独特的可靠性和对齐目标,尤其是在这些系统越来越多地部署在高风险场景中的背景下。

英文摘要

Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)--the alignment between models' intrinsic and (linguistically) expressed confidence--is a persistent failure mode. This challenge is key for large reasoning models (LRMs), whose extended reasoning traces are often interpreted by users as evidence of deliberation, competence, and confidence. Despite the importance of FC and wide usage of LRMs, the extent to which LRMs can faithfully express their confidence remains poorly understood. Moreover, the prevailing paradigm to measure FC does not generalize well to the long chain-of-thought outputs generated by LRMs, which tend to lack clear step boundaries, involve inconsistent step structure, and encode complex conditional dependencies throughout the trace--complicating estimation of intrinsic confidence. To address this challenge, we introduce a novel framework to systematically quantify FC of LRMs. Our framework analyzes linguistic decisiveness relative to three sources of internal uncertainty, based on token probabilities, hidden states, and sampled response consistency. We also devise a prefix-conditioned sampling approach to control for conditional and structural variation across traces. Applying our framework to a diverse suite of leading models, datasets, and prompts, we find that faithful confidence expression is a significant challenge for LRMs. Reasoning behaviors do not automatically translate to improved FC, and prompt interventions for non-reasoning models do not improve faithfulness in the reasoning setting. Different confidence estimators further produce divergent assessments of the same traces, revealing fragility in prior evaluation methodologies. Taken together, our work establishes FC as a distinct reliability and alignment target for LRMs, particularly as such systems are increasingly deployed in high-stakes contexts.

2606.03968 2026-06-03 cs.CL cs.AI

QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards

QUBRIC:为超越可验证奖励的强化学习协同设计查询与评分标准

Rongzhi Zhang, Rui Feng, Zhihan Zhang, Jingfeng Yang, Qingyu Yin, Xin Liu, Zixuan Zhang, Priyanka Nigam, Bing Yin, Tuo Zhao, Chao Zhang

AI总结 针对基于评分标准的强化学习中查询分布固定导致的评分标准质量瓶颈,提出QUBRIC框架,通过协同设计查询与评分标准,利用教师关键点、对比生成和可学习性过滤,在ArenaHard上取得+5.5点提升,并泛化到法律、道德和叙事推理任务。

详情
AI中文摘要

基于评分标准的强化学习是将强化学习扩展到可验证奖励之外的一条有前景的途径,但现有方法在优化评分标准时,将查询分布视为固定不变。我们识别出一个结构性瓶颈:评分标准的质量受限于查询结构。开放式查询会导致模糊的评分标准;而简单地将查询收窄则会引入任何模型都无法验证的虚构参考,导致所有回答失败,训练无法获得奖励信号。我们提出QUBRIC,一个协同设计查询和评分标准的框架。教师导出的关键点将开放式查询改写为基于场景、可评估的问题。然后,对比评分标准生成将教师策略的差距转化为查询级别的标准,可学习性过滤仅保留信息量丰富的查询-评分标准对用于GRPO训练。QUBRIC在ArenaHard上相比SFT基线取得了+5.5分的提升。仅使用指令遵循数据训练,它进一步迁移到三个涵盖法律、道德和叙事推理的保留基准(平均提升+6.3分),改进集中在推理相关维度。这些结果证明,协同设计查询和评分标准可以使基于评分标准的强化学习成为严格可验证任务之外RLVR的实用补充。

英文摘要

Rubric-based RL is a promising route for extending reinforcement learning beyond verifiable rewards, yet existing methods optimize rubrics while treating the query distribution as fixed. We identify a structural bottleneck: rubric quality is constrained by query structure. Open-ended queries yield vague rubrics; naively narrowing them introduces fabricated references that no model can verify, so all responses fail and training receives no reward signal. We present QUBRIC, a framework that co-designs queries and rubrics. Teacher-derived key points ground the rewriting of open-ended queries into scenario-based, evaluable questions. Contrastive rubric generation then turns teacher-policy gaps into query-level criteria, and learnability filtering retains only informative query-rubric pairs for GRPO training. QUBRIC achieves a +5.5 point gain on ArenaHard over the SFT baseline. Trained only on instruction-following data, it further transfers to three held-out benchmarks spanning legal, moral, and narrative reasoning (+6.3 points on average), with improvements concentrated in reasoning-related dimensions. These results provide evidence that co-designing queries and rubrics can make rubric-based RL a practical complement to RLVR beyond strictly verifiable tasks.

2606.03967 2026-06-03 cs.CL cs.AI

AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task

AlignAtt4LLM:面向仅解码器LLM的快速AlignAtt方法在IWSLT 2026同声传译任务中的应用

Quentin Fuxa, Dominik Macháček

AI总结 提出AlignAtt4LLM系统,通过显式源文本跨度、离线选择翻译对齐头、选择性qk快速重放和运行时查询/键捕获,首次将AlignAtt策略应用于仅解码器LLM,在英德、英意同声传译中优于基线。

详情
Comments
Accepted to IWSLT 2026
AI中文摘要

我们描述了AlignAtt4LLM,一个用于英语到德语、意大利语和中文的IWSLT 2026同声传译系统。该系统是一个同步级联:Qwen3-ASR结合强制对齐生成增量更新的源文本转录,Gemma-4 E4B-it在MT侧的AlignAtt策略下翻译该前缀。据我们所知,这是AlignAtt首次应用于仅解码器LLM,而早期AlignAtt系统使用的编码器-解码器交叉注意力在此类模型中不存在。我们通过提出(1)提示中的显式源文本跨度,(2)离线选择翻译特定的对齐头,(3)草稿到源注意力块的选择性qk快速重放,以及(4)保持模型输出比特一致的运行时查询/键捕获,恢复了一个可用的策略。在IWSLT 2026开发集上,AlignAtt4LLM在约2秒的低延迟和低于4秒CU-LongYAAL的高延迟场景下,均优于欧洲目标语言(英语到德语和英语到意大利语)的提供基线。英语到中文的结果较为复杂,但该方法不依赖于Gemma-4:由于AlignAtt4LLM仅需要确定的提示布局、校准的对齐头和查询/键捕获,相同的策略可以重新应用于针对非欧洲目标语言的更强翻译专用仅解码器MT骨干网络。

英文摘要

We describe AlignAtt4LLM, an IWSLT 2026 simultaneous speech translation system for English to German, Italian, and Chinese. The system is a synchronous cascade: Qwen3-ASR with forced alignment produces an incrementally updated source transcript, and Gemma-4 E4B-it translates that prefix under an MT-side AlignAtt policy. To our knowledge, this is the first application of AlignAtt to a decoder-only LLM, where the encoder-decoder cross-attention used by earlier AlignAtt systems is absent. We recover a usable policy by proposing (1) an explicit source span in the prompt, (2) offline selection of translation-specific alignment heads, (3) selective qk-fast replay of the draft-to-source attention block, and (4) runtime query/key capture that preserves model outputs bit-identically. On the IWSLT 2026 development set, AlignAtt4LLM outperforms the supplied baselines for the European target languages, English to German and English to Italian, in both the low-latency regime around 2 seconds and the high-latency regime below 4 seconds CU-LongYAAL. Results for English to Chinese are more mixed, but the method is not tied to Gemma-4: because AlignAtt4LLM only requires a deterministic prompt layout, calibrated attention heads, and query/key capture, the same policy can be reapplied to stronger translation-focused decoder-only MT backbones for non-European target languages.

2606.03965 2026-06-03 cs.CL cs.AI

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

Agentic Chain-of-Thought Steering:实现高效且可控的LLM推理

Yu Xia, Zhouhang Xie, Xin Xu, Byungkyu Kang, Prarit Lamba, Xiang Gao, Julian McAuley

AI总结 提出Agentic Chain-of-Thought Steering (ACTS)方法,通过强化学习训练控制器智能体在推理过程中自适应地选择推理策略和引导短语,实现预算感知的策略控制,从而在保持推理质量的同时显著节省token,并支持准确率-效率的可控权衡。

详情
AI中文摘要

大型语言模型通过扩展的思维链推理提高了最终答案的准确性,但通常token使用效率低下且缺乏推理时的控制。现有的高效推理方法通过缩短、提前停止或压缩轨迹来控制思考长度,但隐式地决定了模型的思考方式。在本文中,我们提出了Agentic Chain-of-Thought Steering (ACTS),它将推理引导形式化为一个马尔可夫决策过程,其中控制器智能体在推理过程中自适应地引导冻结的推理器。在每一步,控制器观察推理轨迹和剩余思考预算,然后发出一个包含推理策略和引导短语的引导动作,以启动推理器的下一步。这使得在保持推理器生成连续性的同时,能够进行预算感知的策略控制以实现高效推理。我们从构建的合成引导轨迹中初始化控制器智能体,并进行多预算增强,然后通过带有预算条件奖励塑造的强化学习进一步优化。跨多个基准的实验表明,ACTS在显著节省token的同时达到了与全思考相当的性能,并在不同的推理器和任务上实现了可控的准确率-效率权衡。代码可在该https URL获取。

英文摘要

Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking length by shortening, early-stopping, or compressing traces, leaving how the model thinks implicit. In this paper, we propose Agentic Chain-of-Thought Steering (ACTS), which formulates reasoning steering as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference. At each step, the controller observes the reasoning trace and remaining thinking budget, then issues a steering action consisting of a reasoning strategy and a steering phrase that initiates the next reasoner step. This enables budget-aware strategy control for efficient reasoning while preserving the reasoner's generation continuity. We initialize the controller agent from our constructed synthetic steering trajectories with multi-budget augmentation, and further optimize it via reinforcement learning with budget-conditioned reward shaping. Experiments across multiple benchmarks show that ACTS matches full-thinking performance with substantial token savings, and enables controllable accuracy-efficiency trade-offs across different reasoners and tasks. The code is available at https://github.com/Andree-9/ACTS.

2606.03962 2026-06-03 cs.LG cs.AI

Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning

利用奖励不确定性在强化学习中诱导多样化行为

Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, Zaheer Abbas, Eser Aygün, David Smalling, Shibl Mourad, Doina Precup, André Barreto, Mark Rowland

AI总结 针对传统强化学习缺乏多样性的问题,提出将奖励函数替换为奖励分布,通过非线性集合目标自然产生可控的多样化行为,并推导出梯度估计器,实验证明其鲁棒性和理论优势。

详情
Comments
Core contributors: Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, André Barreto, Mark Rowland
AI中文摘要

经典强化学习通常寻求最大化标量奖励期望和的确定性策略。然而,现代应用如语言模型微调或科学发现需要多样性。现有的补救措施如熵正则化或多样性奖励通常需要脆弱的权衡,以性能换取随机性,或依赖可能使策略排名错位的启发式指标。我们认为,多样性更自然地理解为对奖励不确定性的理性响应。当奖励函数不完全已知时——例如模糊偏好或不完美的奖励模型——承诺单一行动可能是次优的。基于此,我们提出对强化学习目标进行根本性重新表述,将标量奖励替换为奖励函数上的分布,并对行动集合应用非线性目标。结果是一个框架,其中校准的行为多样性自然出现,通过奖励函数分布保持可控,且无需牺牲期望奖励即可获得。聚焦于上下文赌博机设置,我们为该目标推导出原则性的梯度估计器,并证明我们的公式自然泛化了原始策略梯度以及最近发展的行动集方法。我们的实证结果表明,该框架为传统问题表述无法诱导所需行为广度的复杂强化学习任务提供了鲁棒且理论基础的替代方案。

英文摘要

Classical reinforcement learning (RL) typically seeks a deterministic policy that maximizes the expected sum of a scalar reward. Yet, modern applications such as language model fine-tuning or scientific discovery demand diversity. Existing remedies such as entropy regularization or diversity bonuses often require fragile trade-offs that sacrifice performance for stochasticity or rely on heuristic metrics that can misalign policy rankings. We argue that diversity is more naturally understood as the rational response to uncertainty in the reward. When the reward function is not perfectly known--as is the case with ambiguous preferences or imperfect reward models--committing to a single action can be sub-optimal. Building on this, we propose a fundamental reformulation of the RL objective by replacing the scalar reward with a distribution over reward functions, and applying a non-linear objective over sets of actions. The result is a framework in which calibrated behavioural diversity emerges naturally, remains controllable through the reward function distribution, and is obtained without sacrificing expected reward. Focusing on the contextual bandit setting, we derive a principled gradient estimator for this objective and prove that our formulation naturally generalizes both vanilla policy gradient and more recently developed action-set approaches. Our empirical results demonstrate that this framework offers a robust and theoretically grounded alternative for complex RL tasks where the traditional formulation of the problem fails to induce the desired breadth of agent behaviour.

2606.03957 2026-06-03 cs.CL cs.AI cs.SD eess.AS

Efficient ASR Training with Conversations that Never Happened

利用从未发生的对话进行高效的ASR训练

Máté Gedeon, Péter Mihajlik

AI总结 针对低资源语言和特定领域,提出通过LLM生成对话场景、映射说话人属性到TTS语音配置文件并组装合成话语的增强流水线,实验表明合成对话能有效提升ASR性能,在匈牙利语基准上仅用67小时真实对话和636小时模拟数据即超越2700小时零样本模型。

详情
AI中文摘要

低资源语言和特定领域的对话式ASR受到领域匹配的多说话人训练数据稀缺的限制。我们提出了一种增强流水线,该流水线生成带有参与者元数据的场景级对话,将说话人属性映射到TTS语音配置文件,并将合成的话语组装成感知说话人的模拟对话。我们在相同的FastConformer-Large训练方案下,评估了五种LLM家族,分别采用单生成器、固定预算混合和扩展设置。我们在匈牙利语BEA-Dialogue基准语料库上进行了全面评估,该方法本身适用于任何语言,只要各组件有相应资源。结果表明,合成对话持续改善语音识别性能,但生成器选择和组成数据强烈影响增益。我们最大的训练配置仅使用67小时真实对话和636小时模拟数据,在评估基准上实现了比在2700小时匈牙利语语音上训练的零样本模型更好的性能。这些发现表明,通过TTS合成的LLM生成的对话数据是真实对话语料库在语音模型训练中的实用补充。

英文摘要

Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with participant metadata, maps speaker attributes to TTS voice profiles, and assembles synthesized utterances into speaker-aware simulated conversations. We evaluated five LLM families under single-generator, fixed-budget mixture, and scale-up settings using the same FastConformer-Large training recipe for each one. We ran comprehensive evaluations on the Hungarian BEA-Dialogue benchmark corpus, with the method itself being applicable to any language given the resources for each component. The results show that synthetic conversations consistently improve speech recognition performance, but generator choice and data composition strongly affect the gains. Our largest training configuration, using only 67 hours of real conversations and 636 hours of simulated data, achieves better performance on the evaluation benchmark than a zero-shot model trained on 2700 hours of Hungarian speech. These findings indicate that LLM-generated conversational data synthesized with TTS is a practical complement to real conversational corpora for speech model training.

2606.03954 2026-06-03 cs.CV cs.LG cs.RO

VLESA: Vision-Language Embodied Safety Agent for Human Activity Monitoring

VLESA: 用于人类活动监测的视觉语言具身安全智能体

Hanjiang Hu, Yiyuan Pan, Jiaxing Li, Xusheng Luo, Alexander Robey, Na Li, Yebin Wang, Changliu Liu

AI总结 提出VLESA框架,通过自我中心视频监测人类活动,利用GRPO训练的目标条件安全Q过滤器进行实时安全干预,在ASIMOV-2.0基准上实现更高干预精度。

详情
Comments
18 pages, 5 tables, 5 figures
AI中文摘要

随着AI系统越来越多地协助人类完成物理任务,确保安全变得至关重要——物理动作会带来即时且不可逆转的后果,而数字错误则不会。我们引入了视觉语言具身安全智能体(VLESA),这是一个从自我中心视频监测人类活动,并在预测到危险动作时触发实时安全干预的框架。VLESA处理意图依赖的安全问题,其中相同的动作可能根据上下文而安全或危险。我们引入了一个将自我中心帧与目标条件安全注释配对的数据集,使得能够通过GRPO训练一个目标条件安全Q过滤器,该过滤器在不重新训练的情况下根据推断的意图评估动作。在此基础上,提出了一个意图-动作预测智能体,用于从视频中联合推断目标并预测未来动作。在ASIMOV-2.0基准上,VLESA在精确的地面真值帧处实现了比基线更高的干预准确率,而通过目标条件约束解码,GRPO训练的Q过滤器将动作安全性提高了超过41个百分点。代码可在该网址获取。

英文摘要

As AI systems increasingly assist humans in physical tasks, ensuring safety becomes paramount -- physical actions carry immediate and irreversible consequences that digital errors do not. We introduce the Vision-Language Embodied Safety Agent (VLESA), a framework that monitors human activities from egocentric video and triggers real-time safety interventions when dangerous actions are predicted. VLESA addresses intent-dependent safety where identical actions can be safe or dangerous depending on context. A dataset pairing egocentric frames with goal-conditioned safety annotations is introduced, enabling a goal-conditioned safety Q-filter trained via GRPO that evaluates actions with respect to inferred intent without retraining. On top of that, an intent-action prediction agent is proposed to jointly infer goals and predict future actions from video. On the ASIMOV-2.0 benchmark, VLESA achieves higher intervention accuracy at the exact ground-truth frame compared to baselines, while the GRPO-trained Q-filter improves action safety by over 41 percentage points through goal-conditioned constrained decoding. Code is available at https://github.com/HanjiangHu/VLESA.

2606.03951 2026-06-03 cs.CV

Demo2Tutorial: From Human Experience to Multimodal Software Tutorials

Demo2Tutorial:从人类经验到多模态软件教程

Zechen Bai, Zhiheng Chen, Yiqi Lin, Kevin Qinghong Lin, Difei Gao, Xiangwu Guo, Xin Wang, Mike Zheng Shou

AI总结 提出Demo2Tutorial框架,通过屏幕录制和交互日志将人类经验解析为结构化多模态教程,用于人类学习和GUI智能体训练,实验证明其生成质量超越人工教程并提升任务效率。

详情
Comments
Accepted by CVPR 2026
AI中文摘要

数字环境中的人类经验提供了大量未被充分探索的真实、未修剪的交互资源,其中包含丰富的程序性知识。我们提出了Demo2Tutorial,一个将屏幕录制和交互日志捕获的人类经验转化为结构化多模态软件教程的框架,用于同时教授人类和智能体。Demo2Tutorial首先通过专用记录器收集人类经验,然后使用多模态动作解析器解析原始经验,以重建感知、动作和意图。接着,步骤规划器将这些步骤抽象为表示目标和步骤的分层任务图。最后,教程合成器将解析后的经验转化为结构化的、可复用的图文指令。我们在一个基于官方软件文档的新基准上评估了教程生成质量。我们进一步证明,这种蒸馏表示有利于(i)人类学习,通过自动生成多模态教程,以及(ii)智能体学习,通过改进下游GUI智能体规划和泛化。实验表明,Demo2Tutorial生成的高质量教程超越了人工编写的教程,并显著优于基线方法,同时实现了更快的人类任务完成和更好的GUI智能体规划,证明从人类经验中蒸馏的结构化教程可以作为有效知识表示,促进人类学习和智能体能力。代码和数据将在https://this https URL提供。

英文摘要

Human experience in digital environments offers a vast, underexplored resource of authentic, untrimmed interactions that contain rich procedural knowledge. We introduce Demo2Tutorial, a framework that transforms this experience captured via screen recordings and interaction logs into structured, multimodal software tutorials for teaching both humans and agents. Demo2Tutorial first collects human experience via a dedicated recorder, then parses raw experience using a multimodal Action Parser to reconstruct perception, action, and intent. A Step Planner then abstracts these steps into hierarchical task graphs representing goals and steps. Finally, a Tutorial Composer transforms the parsed experience into structured, reusable image-text instructions. We evaluate the tutorial generation quality on a new benchmark derived from official software documentation. We further demonstrate that this distilled representation benefits (i) human learning, by automatically generating multimodal tutorials, and (ii) agent learning, by improving downstream GUI-agent planning and generalization. Experiments show Demo2Tutorial produces high-quality tutorials that surpass human-authored ones and significantly outperform baseline methods, while enabling both faster human task completion and improved GUI agent planning, demonstrating that structured tutorials distilled from human experience can serve as effective knowledge representations for advancing both human learning and agent capabilities. Code and data will be available at https://github.com/showlab/Demo2Tutorial.

2606.03949 2026-06-03 cs.RO

Preference-Calibrated Human-in-the-Loop Reinforcement Learning for Robotic Manipulation

偏好校准的人机协同强化学习用于机器人操作

Zeyi Liu, Guangyao Liu, Yinuo Qu, Yuquan Xue, Bofang Jia, Chunhua Yang, Weihua Gui, Keke Huang, Ziwei Wang

AI总结 提出PACT框架,通过干预隐式偏好信号进行信用重分配和策略对齐,提升人机协同强化学习的样本效率和性能。

详情
Comments
Submitted to CoRL2026
AI中文摘要

人机协同强化学习(HIL-RL)通过在线人类干预提高了真实机器人操作中的样本效率。然而,成功的轨迹可能包含偏离期望任务执行路径并迫使人类干预的次优动作。现有的HIL-RL方法通常对所有转换应用一致的信用分配原则,通过次优段均匀传播折扣终端奖励,忽略了每个转换对任务成功的实际贡献。这高估了评论家学习的Q值,并间接误导演员更新朝向次优行为模式。为此,我们提出了PACT,一种偏好校准的演员-评论家训练框架,利用干预引起的隐式偏好信号对识别出的次优段进行信用重分配,同时直接指导策略训练以实现无偏的评论家-演员学习。具体来说,我们首先设计了一个从人类演示中学习并识别次优段进行信用校正的进度模型。然后,从干预状态下的人类动作和重采样策略动作中,我们构建偏好对来定义一个反事实优势,惩罚识别出的次优段的贝尔曼目标,实现方向性信用校准。此外,我们在有界均值空间中直接将策略与人类纠正动作对齐,提供了评论家引导更新之外的额外信号。在五个真实机器人操作任务中,PACT将平均成功率提高了24.5%,并实现了1.3倍的更快收敛,从而提高了强化学习的样本效率和性能。代码可在https://this URL获取。

英文摘要

Human-in-the-loop reinforcement learning (HIL-RL) improves sample efficiency in real-robot manipulation through online human intervention. However, successful trajectories may include suboptimal actions that deviate from the desired task-execution path and force human intervention. Existing HIL-RL methods typically apply the consistent credit assignment principle to all transitions, uniformly propagating discounted terminal rewards through suboptimal segments, ignoring the actual contribution of each transition to task success. This overestimates Q-values for critic learning and indirectly misguides actor updates toward suboptimal behavior patterns. To this end, we propose PACT, a Preference-calibrated Actor-Critic Training framework that leverages the implicit preference signals induced by intervention to perform credit reassignment on identified suboptimal segments while directly guiding policy training for unbiased critic-actor learning. Specifically, we first design a progress model that learns from human demonstration and identifies suboptimal segments for credit correction. Then, from the human action and resampled policy action at the intervention state, we build preference pairs to define a counterfactual advantage that penalizes Bellman targets of the identified suboptimal segment, enabling directional credit calibration. Moreover, we directly align the policy with human corrective actions in the bounded mean space, providing an additional signal beyond critic-guided updates. Across five real-robot manipulation tasks, PACT improves the average success rate by 24.5% and achieves 1.3 times faster convergence, thereby improving both RL sample efficiency and performance. Code is available at https://anonymous.4open.science/r/HILRL-A1X-BC05.

2606.03948 2026-06-03 cs.CL

A Pocket Offline Model for Simultaneous Speech Translation as CUNI Submission to IWSLT 2026

CUNI 提交至 IWSLT 2026 的用于同声传译的袖珍离线模型

Aziz Sharipov Ortega, Dominik Macháček

AI总结 本研究通过将离线直接语音到文本翻译模型 Canary 与最先进的策略 AlignAtt 结合,实现了同声传译能力,并在 IWSLT 2026 同声传译共享任务中提交了捷克语到英语以及英语到德语和意大利语的系统,展示了高翻译质量、低计算需求和多语言支持。

详情
Comments
IWSLT 2026
AI中文摘要

我们使用最先进的策略 AlignAtt,为离线直接语音到文本翻译模型 Canary 实现了同声传译能力,并将其提交至 IWSLT 2026 同声传译共享任务,涵盖捷克语到英语以及英语到德语和意大利语的翻译。我们系统的优势在于:(1) 高翻译质量,在计算无关的模拟中,无论是在低延迟还是高延迟场景下,均优于类似规模的基线系统;(2) 低计算需求,模型仅有 10 亿参数;(3) 多语言能力——支持 25 种源语言和 25 种目标语言。

英文摘要

We implement simultaneous translation capability with the offline direct speech-to-text translation model Canary, using the state-of-the-art policy AlignAtt, and submit it to IWSLT 2026 Simultaneous Speech Translation Shared task for Czech to English and English to German and Italian. The strengths of our system are: (1) high translation quality, outperforming similarly sized baselines both in low- and high-latency regimes in computationally unaware simulations; (2) low computational requirements, as the model has only 1B parameters; (3) multilinguality -- support of 25 source and 25 target languages.

2606.03946 2026-06-03 cs.DB cs.LG cs.LO

MLSkip: Data Skipping for ML Filters via Lightweight Metadata

MLSkip: 通过轻量级元数据实现ML过滤器的数据跳过

Mihail Stoian, Mark Gerarts, Pascal Ginter, Andreas Zimmerer, Jan Van den Bussche, Andreas Kipf

AI总结 针对ML过滤器无法应用传统数据跳过技术的问题,提出利用Parquet默认的min-max元数据以及增强的二维凸包元数据结构,实现高效的谓词剪枝,平均剪枝效果达38.31%。

详情
AI中文摘要

数据库厂商最近发布了可用于过滤器谓词的AI函数。由于这些函数通常依赖于昂贵且黑盒的ML模型,它们带来了新的数据管理挑战。具体而言,针对整数和字符串数据的传统数据跳过技术无法适用于这种新型过滤器。实际上,目前还没有已知的机制用于剪枝不合格的行组,例如从blob存储读取文件时。在这项工作中,我们首次研究了ML过滤器的数据跳过技术。我们论证了Parquet默认的min-max元数据足以实现剪枝。为此,我们联系了两条研究路线:(i) 最近提出的ML模型查询语言和(ii) 神经网络验证。我们在ReLU架构上的初步结果表明,在TPC-H和TPC-DS表上,选择性低于0.1%的过滤器的平均剪枝效果为27.4%。最后,受空间连接研究的启发,我们提出了一种增强的元数据结构:一个有大小限制的二维凸包,验证工具可以更好地利用它,将剪枝效果提高到38.31%,同时每个行组和列对最多占用45字节。我们观察到在DuckDB中相对于PyTorch的端到端加速比为1.07倍。

英文摘要

Database vendors recently released AI functions that can be used in filter predicates. As such functions often rely on costly, black-box ML models, they unveil new data management challenges. Concretely, traditional data skipping techniques for integer and string data fail to be applicable to the new filter type. Indeed, there is no known mechanism for pruning non-qualifying row groups, e.g., when reading files from blob storage. In this work, we initiate the study of data skipping techniques for ML filters. We make the case that Parquet's default min-max metadata is enough to enable pruning. To this end, we draw connections to two lines of research: (i) the recently proposed query language for ML models and (ii) neural network verification. Our preliminary results on ReLU architectures show that on tables from TPC-H and TPC-DS, the average pruning effectiveness for filters of selectivity below 0.1% amounts to 27.4%. Finally, inspired by research on spatial joins, we propose an enhanced metadata structure: a size-bounded 2D convex hull that verification tools can make better use of, increasing the pruning effectiveness to 38.31%, while occupying at most 45 bytes per row group and column pair. We observe an end-to-end speedup of 1.07$\times$ over PyTorch in DuckDB.