arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.30320 2026-05-29 cs.CV

MonoPhysics: Estimating Geometry, Appearance, and Physical Parameters from Monocular Videos

MonoPhysics: 从单目视频估计几何、外观和物理参数

Daniel Rho, Jun Myeong Choi, Matthew Thornton, Biswadip Dey, Roni Sengupta

AI总结 提出MonoPhysics框架,通过可微分MPM模拟和3D高斯泼溅,从单目视频联合优化可变形物体的几何、外观和物理参数,解决尺度模糊和几何不准确问题。

详情
AI中文摘要

现有的逆物理方法从多视角视频中恢复物理参数,其中跨视角的几何约束解决了尺度和3D结构问题。然而,在单目设置中,这种约束缺失,导致严重的尺度模糊、不准确的几何以及外观优化与物理模拟之间的弱耦合。我们提出MonoPhysics,一个用于可变形物体的单目逆物理估计框架,使用可微分MPM模拟和3D高斯泼溅,从单个相机视角联合优化几何、外观和物理参数。我们通过三个视觉-物理桥梁解决这些挑战:全局尺度对齐、物理感知的几何细化以及可微分位置图,这些共同使得仅从单目观测就能进行准确优化。我们在Vid2Sim和我们新的弹性和塑性物体数据集上评估,结果表明MonoPhysics在单目设置中优于现有基线,并且仅使用单个相机就能达到与多视角基线相当的性能。我们的项目页面可在https://daniel03c1.github.io/MonoPhysics/获取。

英文摘要

Existing inverse physics methods recover physical parameters from multi-view videos, where geometric constraints across views resolve scale and 3D structure. In monocular settings, however, such constraints are absent, leading to severe scale ambiguity, inaccurate geometry, and weak coupling between appearance optimization and physical simulation. We propose MonoPhysics, a framework for monocular inverse physics estimation of deformable objects using differentiable MPM simulation and 3D Gaussian Splatting, which jointly optimizes geometry, appearance, and physical parameters from a single camera view. We address these challenges through three visual-physical bridges: global scale alignment, physics-aware geometry refinement, and a differentiable position map, which together enable accurate optimization from monocular observations alone. We evaluate on Vid2Sim and our new dataset of elastic and plastic objects, showing that MonoPhysics outperforms existing baselines in monocular settings and achieves performance comparable to multi-view baselines using only a single camera. Our project page is available at https://daniel03c1.github.io/MonoPhysics/

2605.30319 2026-05-29 stat.ML cs.AI cs.DS cs.LG math.ST stat.TH

Improved Guarantees for Heterogeneous Treatment-Effect Estimation via Matrix Completion

通过矩阵补全改进异质性处理效应估计的保证

Anay Mehrotra, Phuc Tran, Van H. Vu, Manolis Zampetakis

AI总结 针对面板数据中的异质性处理效应估计问题,提出一种基于矩阵补全的简单高效估计器,在低秩假设下实现行向$\ell_2$误差$ ilde{O}(\sqrt{1/n + n/m^2})$,并首次建立了低秩逼近的行向$\ell_2$扰动界。

详情
AI中文摘要

现代因果推断的一个核心目标是估计异质性处理效应,以回答诸如“干预如何影响每个单元”的问题,而不仅仅是平均效应。我们研究面板数据下的该问题,其中我们观察到$n$个单元在$m$个时间点上的数据,且处理分配未知且非均匀。该设置中的数据自然表示为所有单元-时间处理效应的矩阵。估计异质性处理效应可以表示为对该矩阵中每一行平均值的良好估计。这使我们能够将问题表述为矩阵补全,在自然低秩假设下可解。然而,现有的矩阵补全保证不足以得到估计异质性处理效应所需的每行保证的有意义界;粗略地说,它们仅适用于估计平均处理效应界,正如最近一系列工作所示。我们给出一个简单、计算高效的估计器,在不知道倾向性且标准低秩和正则性假设下,实现行向$\ell_2$误差$ ilde{O}(\sqrt{ rac{1}{n} + rac{n}{m^2}})$。在技术上,我们的分析首次建立了低秩逼近的尖锐行向$\ell_2$扰动界,补充了现有的谱、Frobenius和逐元素扰动理论。

英文摘要

A central goal of modern causal inference is estimating heterogeneous treatment effects to answer questions like "how does an intervention affect each unit," rather than only on average. We study this problem with panel-data where we observe $n$ units across $m$ times under unknown, non-uniform treatment assignments. The data in this setting is naturally represented as a matrix of all unit--time treatment effects. Estimating heterogeneous treatment effects can then be expressed as obtaining a good estimation of each row's average in this matrix. This allows us to formulate the problem as matrix completion, which can be solved under natural low-rankness assumptions. However, existing matrix-completion guarantees are not powerful enough to get meaningful bounds for the per-row guarantee required for estimating the heterogeneous treatment effect; roughly speaking, they are only useful for estimating average treatment effect bounds, as also illustrated in a recent line of work. We give a simple, computationally efficient estimator that, without knowledge of the propensities and under standard low-rankness and regularity assumptions, achieves a row-wise $\ell_2$ error of $\tilde{O}(\sqrt{\frac{1}{n} + \frac{n}{m^2}})$. Technically, our analysis establishes the first sharp row-wise $\ell_2$-perturbation bound for low-rank approximation, complementing existing spectral-, Frobenius-, and entrywise perturbation theory.

2605.30318 2026-05-29 cs.GR cs.AI cs.CV

Before the Shutter: Aesthetic and Actionable Portrait Photography Planning in 3D Scenes

快门之前:3D场景中美学的且可执行的人像摄影规划

Ruixiang Jiang, Chang Wen Chen

AI总结 提出在3D场景中生成人像姿态、相机、照明和曝光方案的方法,通过构建摄影场景图实现美学引导的规划,生成视觉上引人注目且几何与光度可行的人像。

详情
AI中文摘要

人像摄影在很大程度上是在快门打开之前决定的:主体的姿态、相机配置和照明设备必须在周围的3D场景中协调。相比之下,大多数现有的计算方法侧重于2D图像空间中的后期制作,例如修饰、重新照明或编辑已经存在的图像;捕获前的摄影规划仍然很大程度上未被探索。我们引入了3D美学人像规划,即生成人体姿态、相机、照明和曝光计划的任务,这些计划在满足3D场景中的几何和光度可行性的同时,产生视觉上引人注目的人像。我们的方法构建了一个摄影场景图,该图表示场景可供性、主体-场景关系以及与人像相关的照明结构。基于这种表示,我们对先前的尝试和当前的取景器观察进行美学引导的比较规划。在多样化的室内和室外场景中的实验表明,我们的方法生成的人像比竞争基线更受人类评分者和MLLM评估者的青睐,同时保持高物理合理性。总之,我们的结果指明了从捕获后校正走向捕获前计算人像规划的道路。项目仓库:https://github.com/songrise/Before-the-Shutter

英文摘要

Portrait photography is largely decided before the shutter opens: the subject's pose, the camera configuration, and the lighting devices must be coordinated within the surrounding 3D scene. In contrast, most existing computational methods focus on post-production in 2D image space, such as retouching, relighting, or editing images that already exist; pre-capture photographic planning remains largely unexplored. We introduce 3D aesthetic portrait planning, the task of generating human pose, camera, lighting, and exposure plans that produce visually compelling portraits while satisfying geometric and photometric feasibility in a 3D scene. Our approach builds a Photographic Scene Graph that represents scene affordances, subject-scene relations, and portrait-relevant lighting structure. Built on this representation, we perform aesthetic-guided comparative planning over previous attempts and current viewfinder observations. Experiments across diverse indoor and outdoor scenes show that our method produces portraits preferred by human raters and MLLM evaluators over competitive baselines, while maintaining high physical plausibility. Together, our results suggest a path from post-capture correction toward pre-capture computational portrait planning. Project repository: https://github.com/songrise/Before-the-Shutter

2605.30317 2026-05-29 cs.CV

VPG: Visual Prefix Guidance for Autoregressive Image and Video Generation

VPG: 视觉前缀引导的自回归图像与视频生成

Xinyao Liao, Qiyuan He, Yicong Li, Jiayin Zhu, Xiaoye Qu, Wei Wei, Angela Yao

AI总结 提出VPG,一种无需训练、推理时引导的方法,通过对比生成前缀与损坏前缀下的模型输出来改进自回归图像和视频生成的下一步预测,提升生成质量。

详情
AI中文摘要

自回归图像和视频生成器在训练时使用教师强制历史,但在推理时必须从自身生成的前缀中采样,因此容易受到曝光偏差和前缀漂移的影响。现有的补救方法要么修改训练,要么应用主要针对外部语义条件(如类别标签或文本提示)的采样时引导,而不是测试下一步预测是否为生成的前缀本身提供强大的后验支持。我们提出视觉前缀引导(VPG),一种用于自回归图像和视频生成的无需训练、推理时引导方法。VPG通过对比模型在生成前缀下的输出与在损坏前缀下的输出,然后将logits外推到加强生成前缀后验支持的候选者,从而改进下一步预测。在基于VAR的类别条件图像生成、基于Infinity的文本到图像生成以及基于InfinityStar的文本到视频生成中,VPG在不重新训练基础模型的情况下提高了生成质量,平均将VAR上的FID降低了0.36,并在图像和视频生成上均提升了基准性能。

英文摘要

Autoregressive image and video generators are trained with teacher-forced histories but must sample from their own generated prefixes at inference time, making them vulnerable to exposure bias and prefix drift. Existing remedies either modify training or apply sampling-time guidance aimed primarily at external semantic conditions, such as class labels or text prompts, rather than testing whether a next-step prediction provides strong posterior support for the generated prefix itself. We propose Visual Prefix Guidance (VPG), a training-free inference-time guidance method for autoregressive image and video generation. VPG improves next-step prediction by contrasting the model's output under the generated prefix with its output under a corrupted prefix, then extrapolating logits toward candidates that strengthen the posterior support of the generated prefix. Across class-conditional image generation with VAR, text-to-image generation with Infinity, and text-to-video generation with InfinityStar, VPG improves generation quality without retraining the base model, reducing FID on VAR by 0.36 on average and improving benchmark performance on both image and video generation.

2605.30315 2026-05-29 cs.CL cs.LG

Resolution Diagnostics for Paired LLM Evaluation

配对LLM评估的分辨率诊断

Anany Kotawala

AI总结 针对公开LLM排行榜中配对排名未达到常规配对检验分辨率目标的问题,提出基于假设检验的配对评估框架,并引入分辨率比q=N/N*作为主要诊断指标,揭示了常用非配对Cohen-h-plus-(1-rho)捷径在接近比较区域存在约两倍的偏差。

详情
Comments
16 pages, 7 figures, 12 tables. Accepted to the ICML 2026 Workshop on Hypothesis Testing, Seoul, South Korea, 2026. Copyright 2026 by the author(s)
AI中文摘要

在两个公开的LLM排行榜中,许多显示的配对排名在实际配对评估设计下未达到常规配对检验的分辨率目标:在Open LLM Leaderboard v1的40个配对比较中,有11个未解决;在MMLU-Pro前10名相邻排名配对中,9个中有4个未解决(在(alpha, 1-beta) = (0.05, 0.8)下)。在真实的主题级聚类下,MMLU-Pro未解决数上升至6/9,并且在99.9%的类别自助重采样中保持9个中的5-6个未解决。我们将配对LLM评估构建为一个假设检验问题,反转水平alpha、功效(1-beta)的检验,并报告每对的分辨率比q = N/N*作为主要诊断指标。一个具有显式二阶常数的尖锐小效应展开表明,广泛使用的非配对Cohen-h-plus-(1-rho)捷径在接近比较区域与正确的N*偏差约两倍,当用户将其每臂输出乘以(1-rho)时,五个现成计算器中的三个(Cohen 1988, G*Power, R pwr)会无声地继承这一缺陷。在多重校正和任意有效序贯检验下,未解决配对模式仍然存在。

英文摘要

Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9 MMLU-Pro top-10 adjacent-rank pairs are unresolved at (alpha, 1-beta) = (0.05, 0.8). The MMLU-Pro count rises to 6/9 under real subject-level clustering and stays at 5-6 out of 9 in 99.9% of category-bootstrap resamples. We frame paired LLM evaluation as a hypothesis-testing problem, invert level-alpha, power-(1-beta) tests, and report a per-pair resolution ratio q = N/N* as the primary diagnostic. A sharp small-effect expansion with an explicit second-order constant shows that the widely-used unpaired Cohen-h-plus-(1-rho) shortcut deviates from the correct N* by approximately a factor of two in the close-comparison regime, a deficit that three of five off-the-shelf calculators(Cohen 1988, G*Power, R pwr) silently inherit when the user post-multiplies their per-arm output by (1-rho). The unresolved-pair pattern remains under multiplicity correction and anytime-valid sequential testing.

2605.30311 2026-05-29 cs.CV cs.AI

Archon: A Unified Multimodal Model for Holistic Digital Human Generation

Archon:面向整体数字人生成的统一多模态模型

Chong Bao, Shichen Liu, Lijun Yu, David Futschik, Stylianos Moschoglou, Shefali Srivastava, Ziqian Bai, Feitong Tan, Guofeng Zhang, Zhaopeng Cui, Sean Fanello, Yinda Zhang

AI总结 提出Archon,一个完全预训练的以人为中心的统一多模态模型,通过模态特定分词器、语义视频重参数化和“模态思维”策略,实现文本、音频、动作和视觉等七种模态的整体数字人生成。

详情
Comments
Accepted to CVPR 2026. Project Page: https://zju3dv.github.io/archon/
AI中文摘要

数字人是沉浸式交互的基础,然而创建一个统一模型来处理包括文本、音频、动作和视觉内容在内的整体模态仍然是一个开放的挑战。在本文中,我们提出了Archon,一个完全预训练的、以人为中心的统一多模态模型,用于整体虚拟形象生成。Archon通过模态特定分词器统一了七种模态,并利用一个在同步模态和72个不同任务上预训练的原生自回归统一多模态模型来建模整体联合分布。为了解决高保真说话视频中的标记爆炸挑战,我们引入了一种内存高效的语义视频重参数化方法,在保持细粒度动态的同时实现了4倍的标记减少,并结合了一个语义驱动的视频扩散解码器。我们进一步提出了一种“模态思维”,它将模糊的跨模态任务分解为替代模态链中的逐步思维,逐步增强保真度和可控性。大量实验表明,Archon在各种数字人生成任务中实现了优越或可比的性能,验证了我们统一框架的有效性。项目页面:https://zju3dv.github.io/archon/。

英文摘要

Digital humans are fundamental to immersive interaction, yet creating a unified model for holistic modalities, including text, audio, motion, and visual content, remains an open challenge. In this paper, we present Archon, a fully pretrained, human-centric unified multimodal model for holistic avatar generation. Archon unifies seven modalities with modality-specific tokenizers, and a native autoregressive unified multimodal model pretrained on synchronized modalities and 72 diverse tasks to model holistic joint distributions. To address the token explosion challenge in high-fidelity talking videos, we introduce a memory-efficient semantic video reparameterization, achieving 4x token reduction while preserving fine-grained dynamics, coupled with a semantic-driven video diffusion decoder. We further propose a "Thinking in Modality" that decomposes ambiguous cross-modal tasks into stepwise thinking in an alternative chain of modality, progressively enhancing fidelity and controllability. Extensive experiments demonstrate that Archon achieves superior or comparable performance across diverse digital human generation tasks, validating the effectiveness of our unified framework. Project page: https://zju3dv.github.io/archon/.

2605.30310 2026-05-29 cs.CV cs.AI cs.GR

City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View Images

City-Mesh3R:面向仿真就绪的城市级多视图三维网格重建

Sayan Paul, Sourav Ghosh, Siddharth Katageri, Soumyadip Maity, Sanjana Sinha, Brojeshwar Bhowmick

AI总结 提出City-Mesh3R框架,通过分治策略从大规模无序图像集合端到端重建水密表面网格,解决城市尺度重建中几何不完整、表面不规则及计算复杂性问题。

详情
Comments
Accepted to the USM3D Workshop Proceedings at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 as an Oral Presentation. Project page: https://citymesh3r.github.io/
AI中文摘要

从多视图图像进行城市级三维表面重建以支持下游三维仿真,由于城市场景的规模和复杂性,带来了极具挑战性的问题。现有的基于NeRF、高斯泼溅等方法的城市级三维重建技术,常因几何不完整/缺失以及不规则、噪声表面而无法恢复可用于仿真的三维网格。将现有小规模三维重建方法扩展到任意大规模城市场景因计算复杂而不可行。我们提出City-Mesh3R,一个可扩展的框架,直接从大规模无序图像集合重建水密表面网格。与近期使用全局稀疏SfM点云初始化后分布式稠密重建大规模场景的方法不同,我们的方法采用分治策略,遵循端到端的图像到网格三维重建流程。通过拓扑图像聚类、聚类独立稀疏SfM和地图合并重建稀疏城市地图,无需穷举图像特征匹配。然后对该地图进行空间划分,执行几何感知的相机选择,接着进行稠密表面重建,并使用曲率感知的自适应顶点密度重网格化进行表面细化。这些分区网格随后拼接成城市全局网格。所提出的端到端框架在城市级重建数据集上进行了评估。定性和定量结果表明,我们的方法能生成具有规则几何、捕捉精细表面细节的高保真水密三维网格,并因其分布式端到端处理而适用于任意大规模场景。

英文摘要

City-scale 3D surface reconstruction from multiview images for downstream 3D simulation, poses highly challenging problems due to the scale and complexity of urban scenes. Existing city-scale 3D reconstruction methods based on NeRF, Gaussian Splatting etc. often fail to recover 3D meshes ready for simulation due to incomplete/missing geometry and irregular, noisy surfaces. Scaling existing small-scale 3D reconstruction methods to arbitrarily large urban scenes is highly infeasible due to their computational complexity. We present City-Mesh3R, a scalable framework for reconstructing watertight surface meshes directly from large unordered image collections. Unlike recent methods which use global sparse SfM point-cloud initialization followed by a distributed 3D dense reconstruction of large-scale scenes, our method follows an end-to-end images-to-mesh 3D reconstruction approach using a divide-and-conquer strategy. The sparse city map is reconstructed via topological image clustering, cluster-wise independent sparse SfM and map merging, without need for exhaustive image feature matching. Then this map is partitioned spatially to perform geometry-aware camera selection, followed by dense surface reconstruction and surface refinement using curvature-aware adaptive vertex density remeshing. These partition meshes are then stitched together to produce the global mesh of the city. The proposed end-to-end framework is evaluated on city-scale reconstruction datasets. As demonstrated by our qualitative and quantitative results, our proposed method yields high-fidelity watertight 3D meshes with regular geometry, capturing fine surface details, and is suitable for scaling to arbitrarily large scenes owing to the end-to-end processing in a distributed setting.

2605.30307 2026-05-29 cs.CV

Grounded 3D-Aware Spatial Vision-Language Modeling

基于三维感知的空间视觉语言建模

An-Chieh Cheng, Yang Fu, Yatai Ji, Ligeng Zhu, Guanqi Zhan, Zhuoyang Zhang, Zhaojing Yang, Song Han, Yao Lu, Pavlo Molchanov, Vidya Nariyambut Murali, Jan Kautz, Xiaolong Wang, Hongxu Yin, Sifei Liu

AI总结 提出GR3D模型,通过显式2D定位、隐式2D定位和单目3D定位三种互补定位能力,在单一框架内实现空间链式推理,并在定位与非定位空间基准上取得一致提升。

详情
Comments
CVPR 2026 https://www.anjiecheng.me/gr3d
AI中文摘要

我们提出了GR3D,一个空间视觉语言模型,在单一框架内配备了三种互补的定位能力——显式2D定位、隐式2D定位和单目3D定位。GR3D引入了一种隐式定位机制,在生成过程中识别实体提及,并将相应的区域标记插入文本流中,使模型在生成空间链式推理响应时能够即时引用视觉证据。同时,一种区域提示的单目3D定位设计从定位的区域查询中预测相机视图中的3D边界框,并由内在感知归一化和密集几何监督支持。这些定位能力共同使GR3D能够将复杂的空间理解问题分解为定位的2D感知,随后进行3D推理。GR3D在定位和非定位空间基准上均取得了一致的改进,证明了定位作为增强VLM空间理解的有效归纳偏差。这些定位能力共同增强了超越定位任务本身的通用空间理解。

英文摘要

We present GR3D, a spatial vision language model equipped with three complementary grounding capabilities--explicit 2D grounding, implicit 2D grounding, and monocular 3D grounding--within a single framework. GR3D introduces an implicit grounding mechanism that identifies entity mentions during generation and inserts the corresponding region tokens into the text stream, allowing the model to reference visual evidence on the fly when producing spatial chain-of-thought responses. In parallel, a region-prompted monocular 3D grounding design predicts 3D bounding boxes in the camera view from grounded region queries, supported by intrinsic-aware normalization and dense geometric supervision. Together, these grounding capabilities enable GR3D to decompose complex spatial understanding problems into grounded 2D perception followed by 3D inference. GR3D achieves consistent improvements across grounded and non-grounded spatial benchmarks, demonstrating grounding as an effective inductive bias for strengthening spatial understanding in VLMs. These grounding capabilities collectively enhance general spatial understanding beyond the grounding task itself.

2605.30295 2026-05-29 cs.CL cs.AI

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

MedCase-Structured:用于在临床真实EHR环境中基准测试诊断推理的文本到FHIR数据集

Valentina Bui Muti, Eugénie Dulout, Ziquan Fu

AI总结 提出一个从非结构化文本生成临床真实HL7 FHIR R4数据集的流水线,构建MedCase-Structured数据集,发现LLMs在结构化FHIR输入上的诊断准确性低于纯文本,强调部署对齐基准测试的重要性。

详情
Comments
Accepted to ICML 2026 Structured Data for Health Workshop
AI中文摘要

大型语言模型(LLMs)在临床推理和决策支持方面显示出潜力,但在真实、与电子健康记录一致的环境中的评估仍然有限。现有的基准测试通常依赖于静态数据集或不反映临床系统中使用的结构化、可互操作数据格式的非结构化输入。我们引入了一个从非结构化文本生成临床真实HL7 FHIR R4数据包的流水线,从而实现对临床决策支持系统的可控评估。该流水线将分阶段LLM生成与基于术语的验证和修复相结合,以减少幻觉代码并强制结构和语义一致性。将此方法应用于MedCaseReasoning,我们构建了MedCase-Structured,这是一个与临床医生编写的诊断案例对齐的合成数据集,实现了82.5%案例的有效FHIR生成。在MedCase-Structured上的评估显示,LLMs在结构化FHIR输入上的诊断准确性始终低于纯文本,突出了部署对齐基准测试的重要性。

英文摘要

Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health record-congruent settings remains limited. Existing benchmarks often rely on static datasets or unstructured inputs that do not reflect the structured, interoperable data formats used in clinical systems. We introduce a pipeline for generating clinically realistic HL7 FHIR R4 bundles from unstructured text, enabling controllable evaluation of clinical decision support systems. The pipeline combines staged LLM generation with terminology-grounded validation and repair to reduce hallucinated codes and enforce structural and semantic consistency. Applying this approach to MedCaseReasoning, we construct MedCase-Structured, a synthetic dataset aligned with clinician-authored diagnostic cases, achieving valid FHIR generation for 82.5% of cases. Evaluation on MedCase-Structured reveals consistently lower diagnostic accuracy for LLMs on structured FHIR inputs than with plain text, highlighting the importance of deployment-aligned benchmarking.

2605.29169 2026-05-29 cs.CR cs.AI

Domain-Informed Representation for Evolutionary Sieving in Integral and Module Lattices

积分格与模格中进化筛法的领域信息表示

Ahmad Tashfeen, Qi Cheng

AI总结 针对格密码中最短向量问题(SVP),通过引入领域信息表示和交叉操作,将Ajtai等人的筛法改进为遗传算法,并自然扩展到模格。

详情
Journal ref
Lecture Notes in Computer Science 16524 (2026) 133-148
Comments
Published (16 pages) in the proceedings of EvoApplications 2026. You may find the proceedings version here at https://link.springer.com/chapter/10.1007/978-3-032-23604-3_9
AI中文摘要

传统密码学基于整数分解或离散对数等问题,不可避免地容易受到全功能量子计算机的攻击。尽管这仍是一个工程前沿,但迫在眉睫的威胁延伸到今天存储的加密数据,这些数据将来可能被具有量子能力的计算机解密。为了防范这种可能性,现代量子安全密码学的支柱是最短向量问题(SVP)。我们通过引入领域信息的SVP表示和交叉操作,增强了Laarhoven对Ajtai等人筛法作为遗传算法(GA)处理SVP的方法,同时自然地将应用扩展到模格。

英文摘要

Traditional cryptography, rooted in problems, e.g., integer factorisation or discrete log, is inevitably vulnerable to a fully operational quantum computer. Although it remains an engineering frontier, the looming threat extends to encrypted data stored today, which could be decrypted in the future with quantum capabilities. To safeguard against this eventuality, the backbone of the modern quantum-safe cryptography is the Shortest Vector Problem (SVP). We enhance Laarhoven's treatment of Ajtai et al.'s sieving as a genetic algorithm (GA) for the SVP by incorporating domain-informed SVP representation and crossover while naturally extending application to the module lattices.

2605.28746 2026-05-29 math.OC cs.AI cs.NE

Preference-Shaped Expected Hypervolume and R2 Improvement: Exact Computation and Monotonicity

偏好形状的期望超体积和R2改进:精确计算与单调性

Michael T. M. Emmerich

AI总结 本文研究了贝叶斯多目标优化中偏好形状的期望改进准则,精确计算了超体积和R2指标的期望改进,并分析了其单调性和几何特性。

详情
Comments
17 pages; Changes v1 (added strict Pareto compliance proof, removed missing figure references and redundant graphics section, added Liang et al 2026 citation in outlook. Improved figures and language
AI中文摘要

本文研究了贝叶斯多目标优化中偏好形状的期望改进准则。我们考虑了两个常用于类似算法目的但几何性质不同的指标族。超体积指标基于一个反乌托邦参考点,测量目标空间中的支配体积。R2指标基于一个乌托邦点,通过加权Tchebycheff标量化包络评估近似集。本文的目的是明确哪些偏好变换保留了精确计算、Pareto兼容性和单调性,哪些变换改变了底层几何。在超体积方面,我们通过Deng表示重新审视了经典的EHVI,在期望坐标中制定了乘积密度加权的EHVI,讨论了基于锥的EHVI作为线性锥变换后的普通EHVI,并将这些情况与截断EHVI区分开来,后者可能违反方差单调性。在R2方面,我们证明精确积分R2改进通常不是普通的目标空间加权超体积。障碍是低维的:Lebesgue密度超体积无法看到Tchebycheff标量化仍能检测到的某些边界贡献。然后我们证明精确积分R2改进恰好是一个标量化空间体积,即当前标量化包络与参考包络之间的Tchebycheff阴影的测度。该表示产生了离散R2的有限和ER2I算法、精确积分R2的求积方法,以及一个成就空间高斯代理公式,其中ER2I是标量高斯期望改进的积分。

英文摘要

This paper studies preference-shaped expected improvement criteria for Bayesian multiobjective optimization. We consider two indicator families which are often used for similar algorithmic purposes, but which are geometrically different. The hypervolume indicator is based on a dystopian reference point and measures dominated volume in objective space. The R2 indicator is based on a utopian point and evaluates approximation sets through weighted Tchebycheff scalarization envelopes. The purpose of the paper is to make precise which preference transformations preserve exact computation, Pareto compatibility, and monotonicity properties, and which transformations change the underlying geometry. On the hypervolume side, we revisit canonical EHVI through the Deng representation, formulate product-density weighted EHVI in desirability coordinates, discuss cone-based EHVI as ordinary EHVI after a linear cone transformation, and separate these cases from truncated EHVI, where variance monotonicity may fail. On the R2 side, we prove that exact integral R2 improvement is not, in general, an ordinary objective-space weighted hypervolume. The obstruction is lower-dimensional: Lebesgue-density hypervolume cannot see certain boundary contributions that Tchebycheff scalarizations still detect. We then show that exact integral R2 improvement is exactly a scalarization-space volume, namely the measure of the Tchebycheff shadow between the incumbent scalarization envelope and the reference envelope. This representation yields finite-sum ER2I algorithms for discrete R2, quadrature methods for exact integral R2, and an achievement-space Gaussian surrogate formulation in which ER2I is an integral of scalar Gaussian expected improvements.

2605.24244 2026-05-29 stat.ML cs.LG

MEDAL: Manifold Embedding Distillation via Autoencoder Learning

MEDAL: 通过自编码器学习的流形嵌入蒸馏

Irene Chang, Tarek M. Zikry, Genevera I. Allen

AI总结 提出MEDAL框架,通过约束自编码器将流形嵌入蒸馏为可复用的编码器-解码器模型,实现留出验证、超参数选择和分布偏移检测。

详情
AI中文摘要

低维嵌入被广泛用作高维数据的视觉摘要,并支持下游科学发现。然而,流行的非线性降维方法(如t-SNE和UMAP)通常仅根据视觉吸引力选择,缺乏严格的定量验证。主要原因是流形嵌入通常不提供样本外映射或返回原始特征空间的逆映射;这使得留出验证(监督学习的黄金标准)几乎不可能。为了解决这些挑战,我们开发了一个新颖的框架MEDAL(通过自编码器学习的流形嵌入蒸馏),它将拟合的流形嵌入蒸馏为可复用的编码器-解码器模型。MEDAL训练一个约束自编码器,其瓶颈精确匹配任何教师嵌入,而解码器重建原始输入;这为新样本提供了显式映射、近似逆映射以及流形空间中基于逐点重建的失真度量。这将静态流形嵌入转换为可在留出数据上评估的模型,从而实现定量验证,包括比较不同降维方法以及超参数调优。在多个基准和科学案例研究中,我们展示了MEDAL能够通过留出验证确定最优流形嵌入和超参数,揭示难以在二维嵌入中保留的生物相干区域,并在新样本映射到固定参考流形时检测分布偏移。MEDAL为任何现有降维技术提供了一个通用验证包装器,将提高科学工作流中降维的严谨性和可靠性。

英文摘要

Low-dimensional embeddings are widely used as visual summaries of high-dimensional data and to enable downstream scientific discoveries. Yet, popular nonlinear dimension reduction methods, such as t-SNE and UMAP, are often selected based on visual appeal alone and without rigorous quantitative validation. A major reason is that manifold embeddings typically do not provide an out-of-sample map nor an inverse back to the original feature space; this makes held-out validation, the gold standard in supervised learning, all but impossible. To address these challenges, we develop a novel framework, MEDAL (Manifold Embedding Distillation via Autoencoder Learning), which distills a fitted manifold embedding into a reusable encoder--decoder model. MEDAL trains a constrained autoencoder whose bottleneck exactly matches any teacher embedding while the decoder reconstructs the original input; this yields an explicit map for new samples, an approximate inverse, and a pointwise reconstruction-based measure of distortion in the manifold space. This converts static manifold embeddings into models that can be evaluated on held-out data, enabling quantitative validation including comparing different dimension reduction methods as well as hyperparameter tuning. Across multiple benchmark and scientific case studies, we show that MEDAL enables held-out validation to determine optimal manifold embeddings and hyperparameters, reveals biologically coherent regions that are difficult to preserve in two dimensional embeddings, and detects distribution shift when new samples are mapped into a fixed reference manifold. MEDAL provides a general validation wrapper to any existing dimension reduction technique that will improve the rigor and reliability of dimension reduction in scientific workflows.

2605.21235 2026-05-29 cs.CL

LamPO: A Lambda Style Policy Optimization for Reasoning Language Models

LamPO: 一种用于推理语言模型的Lambda风格策略优化

Redacted by arXiv

AI总结 提出LamPO方法,通过成对分解优势函数和置信度加权,改进基于可验证奖励的强化学习在推理语言模型中的信用分配和训练稳定性。

详情
Comments
arXiv admin comment: This version has been removed by arXiv administrators as the submitter did not have the rights to agree to the license at the time of submission. Author list and submitter name redacted due to disputed authorship
AI中文摘要

具有可验证奖励的强化学习(RLVR)已成为改进推理语言模型在数学、编程和科学问答等任务上的有效范式。然而,广泛使用的组相对目标(如GRPO)用标量统计量总结每个采样组,从而丢弃了候选响应之间的细粒度关系信息。这削弱了稀疏结果奖励下的信用分配,尤其是当多个生成的解决方案仅在推理质量上存在细微差异时。我们提出 extbf{LamPO},一种 extbf{Lambda风格策略优化}方法,它用 extit{成对分解优势}替代标量组优势。LamPO聚合每个响应组内的成对奖励差距,并通过从序列对数概率差异计算出的置信度权重调节每个比较,同时保留PPO风格优化的无评论家和裁剪更新结构。当参考解可用时,我们进一步添加一个轻量级的基于ROUGE-L的密集辅助奖励以减少奖励稀疏性。在AIME24、AIME25、MATH-500和GPQA-Diamond上使用Qwen3-1.7B、Qwen3-4B和Phi-4-mini进行的实验表明,LamPO在更稳定的训练动态和更好的样本效率下,持续优于GRPO和最近的RLVR变体。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving reasoning language models on tasks such as mathematics, coding, and scientific question answering. However, widely used group-relative objectives, such as GRPO, summarize each sampled group with scalar statistics and therefore discard fine-grained relational information among candidate responses. This weakens credit assignment under sparse outcome rewards, especially when multiple generated solutions differ only subtly in reasoning quality. We propose \textbf{LamPO}, a \textbf{Lambda-Style Policy Optimization} method that replaces scalar group advantages with a \emph{Pairwise Decomposed Advantage}. LamPO aggregates pairwise reward gaps within each response group and modulates each comparison by a confidence-aware weight computed from sequence log-probability differences, while retaining the critic-free and clipped-update structure of PPO-style optimization. When reference solutions are available, we further add a lightweight ROUGE-L-based dense auxiliary reward to reduce reward sparsity. Experiments on AIME24, AIME25, MATH-500, and GPQA-Diamond with Qwen3-1.7B, Qwen3-4B, and Phi-4-mini show that LamPO consistently improves over GRPO and recent RLVR variants, with more stable training dynamics and better sample efficiency.

2605.19416 2026-05-29 cs.CL

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

LambdaPO: 一种用于推理语言模型的Lambda风格策略优化

Redacted by arXiv

AI总结 针对GRPO因使用群体均值作为基线而丢失细粒度偏好信息的问题,提出LambdaPO方法,通过将优势估计分解为成对偏好结构并引入语义密度奖励,从群体轨迹中挖掘更细粒度的优化信号,提升推理性能。

详情
Comments
arXiv admin comment: This version has been removed by arXiv administrators as the submitter did not have the rights to agree to the license at the time of submission. Author list and submitter name redacted due to disputed authorship
AI中文摘要

群体相对策略优化(GRPO)已成为现代强化学习对齐的基石,因其通过利用跨采样轨迹群体的奖励归一化来避免显式价值评判器而备受推崇。然而,该方法依赖于单一的统计基线(如群体均值),将轨迹空间的关联拓扑压缩为单个标量,从而抹去了在复杂、对排名敏感的奖励景观中导航所必需的细粒度偏好信息。为解决此问题,我们引入了一个新框架——Lambda策略优化(LambdaPO),它通过将优势估计从标量值重新概念化为分解的成对偏好结构来解决这一信息论瓶颈。具体而言,任何给定轨迹的优势被公式化为与其群体中所有同伴的奖励差分的积分和,其中每个成对比较由策略自身对已建立偏好的概率置信度动态衰减。为进一步缓解二元结果监督的稀疏性,我们通过一个语义密度奖励来增强目标,该奖励源自生成推理轨迹与真实解之间的精确率-召回率对齐。因此,我们的方法可以从一组 rollout 中挖掘更细粒度的优化信号,引导大语言模型达到更优的极值。在具有挑战性的数学推理和问答任务上的实验结果表明,LambdaPO相比基线方法提升了性能。

英文摘要

Group Relative Policy Optimization(GRPO) has become a cornerstone of modern reinforcement learning alignment, prized for its efficacy in foregoing an explicit value-critic by leveraging reward normalization across sampled trajectory cohorts. However, the method's reliance on a monolithic statistical baseline, such as the group mean, collapses the relational topology of the trajectory space into a single scalar, thereby erasing the fine-grained preference information essential for navigating complex, rank-sensitive reward landscapes. To address this issue, we introduce a novel framework, Lambda Policy Optimization (LambdaPO), that addresses this information-theoretic bottleneck by re-conceptualizing advantage estimation from a scalar value to a decomposed, pairwise preference structure. Specifically, the advantage for any given trajectory is formulated as the integrated sum of reward differentials against all peers in its cohort, where each pairwise comparison is dynamically attenuated by the policy's own probabilistic confidence in the established preference. To further mitigate the sparsity of binary outcome supervision, we augment the objective with a semantic density reward, derived from the precision-recall alignment between generated reasoning traces and ground-truth solutions. As a result, our method can mine more fine-grained optimization signals from a group of rollouts, guiding the LLM to a better optima. Experimental results across challenging math reasoning and question-answering tasks demonstrates that LambdaPO improves performance compared to the baseline methods.

2604.04956 2026-05-29 physics.soc-ph cs.AI cs.CY physics.pop-ph

The Planetary Cost of AI Acceleration, Part II: The 10th Planetary Boundary and the 6.5-Year Countdown

人工智能加速的行星成本,第二部分:第十个行星边界与6.5年倒计时

William Yicheng Zhu, Lei Zhu

AI总结 本研究指出,大规模语言模型(LLM)的指数级扩展导致“思考”本身的热力学后果,并预测在6.5年内将突破行星热阈值,提出AI热排放构成第十个行星边界。

详情
Comments
Minor revisions for clarity
AI中文摘要

近期,自主大型语言模型(LLM)代理的超指数级扩展标志着更广泛、根本性的范式转变:从机器主要替代人类双手(体力劳动和机械加工)转向机器代表人类思维(认知、推理和意图)。超出人类有限但高效的生物能力,“思考”本身不受控制的卸载和扩展对人类的热平衡表产生深远影响,因为思考或智能具有热力学后果。地球已经超过了长期生态稳定性所需的热耗散阈值,基于经验数据的预测揭示了一条令人担忧的轨迹:如果没有激进的结构性干预,即使在最理想的情况下(地球能量不平衡(EEI)保持恒定),人为热积累将在不到6.5年内突破关键的行星生态阈值。在这项工作中,我们确定了人工智能中影响全球热耗散率的六个因素,并描述了它们如何相互作用推动社会走向四种宏观轨迹之一。我们提出,人工智能及其热耗散融入行星系统构成了第十个行星边界(9+1)。该边界的核心经验测量是由人工智能指数增长产生的净新增废热,平衡其对减少经济和社会低效率以及因此减少基线人为废热排放的影响。我们证明,管理人工智能扩展缺乏适度的中间地带:它将要么加速关键行星热力学阈值的突破,要么成为稳定其他九个行星边界的最有效杠杆,从而保障人类文明的生存。

英文摘要

The recent, super-exponential scaling of autonomous Large Language Model (LLM) agents signals a broader, fundamental paradigm shift from machines primarily replacing the human hands (manual labor and mechanical processing) to machines delegating for the human minds (cognition, reasoning, and intention). The uncontrolled offloading and scaling of "thinking" itself, beyond human's limited but efficient biological capacity, has profound consequences for humanity's heat balance sheet, since thinking, or intelligence, carries thermodynamic consequences. The Earth has already surpassed the heat dissipation threshold required for long-term ecological stability, and projecting based on empirical data reveal a concerning trajectory: without radical structural intervention, anthropogenic heat accumulation will breach critical planetary ecological thresholds in less than 6.5 years, even under the most ideal scenario where Earth Energy Imbalance (EEI) holds constant. In this work, we identify six factors from artificial intelligence that influence the global heat dissipation rate and delineate how their interplay drives society toward one of four broad macroscopic trajectories. We propose that the integration of artificial intelligence and its heat dissipation into the planetary system constitute the tenth planetary boundary (9+1). The core empirical measurement of this boundary is the net-new waste heat generated by exponential AI growth, balanced against its impact on reducing economic and societal inefficiencies and thus baseline anthropogenic waste heat emissions. We demonstrate that managing AI scaling lacks a moderate middle ground: it will either accelerate the breach of critical planetary thermodynamic thresholds, or it will serve as the single most effective lever on stabilizing the other nine planetary boundaries and through which safeguarding human civilization's survival.

2603.17942 2026-05-29 cs.CL

Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing

通过嵌入空间探测的高效无训练多令牌预测

Raghavv Goel, Mukul Gagrani, Mingu Lee, Chris Lott

AI总结 提出ESP方法,利用嵌入空间中的掩码令牌进行无训练的多令牌预测,通过并行验证和轻量剪枝实现无损解码,提升吞吐量。

详情
Comments
v2: Accepted at ICML 2026. Updated experiments replaced tok/s with speedup ratio over AR baseline; improved exposition in Section 3.1 (mask token initialization) and Section 4 (ablations)
AI中文摘要

大型语言模型(LLM)尽管仅针对下一个令牌生成进行训练,但具有潜在的多令牌预测(MTP)能力。我们引入了ESP(嵌入空间探测),一种简单且无需训练的MTP方法,它使用从嵌入空间中实时抽取的掩码令牌来探测LLM,从而无需修改权重或依赖草稿模型即可实现并行未来令牌预测。ESP通过从掩码令牌logits中采样Top-K候选来构建推测性令牌树,并应用轻量级剪枝规则保留高概率的延续。在生成过程中,预测被并行验证,实现无损解码,同时显著减少模型调用次数并增加令牌吞吐量。ESP始终优于现有的无训练基线,在LLaMA3上比LADE提高了7-11%的接受长度,在Qwen3上提高了7-8%,并且吞吐量比最强基线提高了15-19%。最后,我们提供了理论见解和实证证据,表明解码器层自然地将掩码令牌表示与下一个令牌状态对齐,从而无需重新训练或辅助模型即可实现准确的多步预测。

英文摘要

Large Language Models (LLMs) possess latent multi-token prediction (MTP) abilities despite being trained only for next-token generation. We introduce ESP (Embedding-Space Probing), a simple and training-free MTP method that probes an LLM using on-the-fly mask tokens drawn from its embedding space, enabling parallel future-token prediction without modifying weights or relying on draft models. ESP constructs a speculative token tree by sampling Top-K candidates from mask-token logits and applies a lightweight pruning rule to retain high-probability continuations. During generation, predictions are verified in parallel, yielding lossless decoding while significantly reducing model calls and increasing token throughput. ESP consistently outperforms existing training-free baselines, improving acceptance length by 7-11% over LADE on LLaMA3 and 7-8% on Qwen3, and increasing throughput by up to 15-19% over the strongest baseline. Finally, we provide theoretical insight and empirical evidence showing that decoder layers naturally align mask-token representations with next-token states, enabling accurate multi-step prediction without retraining or auxiliary models.

2602.11389 2026-05-29 cs.AI

Causal-JEPA: Learning World Models through Object-Level Latent Masking

Causal-JEPA:通过对象级潜在掩码学习世界模型

Heejeong Nam, Quentin Le Lidec, Lucas Maes, Yann LeCun, Randall Balestriero

AI总结 提出C-JEPA,一种通过对象级潜在掩码扩展联合嵌入预测的对象中心世界模型,在视觉问答和智能体控制任务中分别提升反事实推理20%和仅用1%潜在特征实现高效规划。

详情
Comments
Project Page: https://hazel-heejeong-nam.github.io/cjepa/ ICML 2026 Accepted
AI中文摘要

世界模型需要稳健的关系理解以支持预测、推理和控制。虽然对象中心表示提供了有用的抽象,但不足以捕捉依赖交互的动态。因此,我们提出C-JEPA,一种简单灵活的对象中心世界模型,将掩码联合嵌入预测从图像块扩展到对象中心表示。通过掩码对象级潜在变量并要求每个掩码对象状态从周围上下文中推断,C-JEPA在训练期间施加了结构化的部分可观测性,创建了类似反事实的预测查询,阻止捷径解决方案,并在学习目标下使依赖交互的预测成为必要。实验上,C-JEPA在视觉问答中取得了一致的提升,与没有对象级掩码的相同架构相比,反事实推理绝对提高了约20%。在智能体控制任务中,C-JEPA仅使用基于块的世界模型所需总潜在输入特征的1%,即可实现相当的性能,从而实现了更高效的规划。最后,我们提供了形式化分析,证明对象级掩码通过控制可观测性引入了有用的归纳偏置。我们的代码可在https://github.com/galilai-group/cjepa获取。

英文摘要

World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics. We therefore propose C-JEPA, a simple and flexible object-centric world model that extends masked joint embedding prediction from image patches to object-centric representations. By masking object-level latents and requiring each masked object state to be inferred from the surrounding context, C-JEPA imposes structured partial observability during training, creating counterfactual-like prediction queries that discourage shortcut solutions and make interaction-dependent prediction necessary under the learning objective. Empirically, C-JEPA leads to consistent gains in visual question answering, with an absolute improvement of about 20% in counterfactual reasoning over the same architecture without object-level masking. On agent control tasks, C-JEPA enables substantially more efficient planning by using only 1% of the total latent input features required by patch-based world models, while achieving comparable performance. Finally, we provide a formal analysis demonstrating that object-level masking induces useful inductive bias by controlling observability. Our code is available at https://github.com/galilai-group/cjepa.

2601.22139 2026-05-29 cs.CL cs.AI

Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers

边推理边提问:将推理型大语言模型从被动求解者转变为主动询问者

Xin Chen, Feng Jiang, Yiqian Zhang, Hardy Chen, Shuo Yan, Wenya Xie, Min Yang, Shujian Huang

AI总结 提出主动交互推理(PIR)范式,通过不确定性感知微调和用户模拟器策略优化,使LLM在推理中主动提问以澄清前提和意图不确定性,在数学推理、代码生成和文档编辑任务上显著提升准确率、通过率和BLEU值,同时减少近半推理计算和不必要交互。

详情
Comments
ACL Main Conference
AI中文摘要

面向推理的大语言模型(LLMs)通过思维链(CoT)提示取得了显著进展,但它们仍然受到一种“盲目自我思考”范式的根本限制:即使在关键信息缺失或模糊的情况下,也会进行大量的内部推理。我们提出了主动交互推理(PIR),一种新的推理范式,将LLMs从被动求解者转变为主动询问者,在推理过程中穿插澄清。与现有的主要通过与外部环境交互来解决知识不确定性的搜索或工具框架不同,PIR通过与用户直接交互来解决前提和意图层面的不确定性。PIR通过两个核心组件实现:(1)一种不确定性感知的监督微调过程,赋予模型交互推理能力;(2)一个基于用户模拟器的策略优化框架,由复合奖励驱动,使模型行为与用户意图对齐。在数学推理、代码生成和文档编辑上的大量实验表明,PIR始终优于强基线,准确率提高高达32.70%,通过率提高22.90%,BLEU提升41.36,同时减少近一半的推理计算和不必要的交互轮次。在事实知识、问答和缺失前提场景上的进一步可靠性评估证实了PIR的强大泛化能力和鲁棒性。模型和代码公开于:https://github.com/SUAT-AIRI/Proactive-Interactive-R1

英文摘要

Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fundamentally limited by a \emph{blind self-thinking} paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search- or tool-based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise- and intent-level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty-aware supervised fine-tuning procedure that equips models with interactive reasoning capability, and (2) a user-simulator-based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70\% higher accuracy, 22.90\% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing-premise scenarios confirm the strong generalization and robustness of PIR. Model and code are publicly available at: \href{https://github.com/SUAT-AIRI/Proactive-Interactive-R1}

2601.07525 2026-05-29 cs.CL cs.AI

Thinking Before Constraining: A Unified Decoding Framework for Large Language Models

先思考再约束:大型语言模型的统一解码框架

Ngoc Trinh Hung Nguyen, Alonso Silva, Laith Zumot, Liubov Tupikina, Armen Aghasaryan, Mehwish Alam

AI总结 提出In-Writing混合方法,通过触发令牌将自由形式推理与结构化解码解耦,在分类和推理任务上准确率提升高达27%。

详情
Comments
v2-EMNLP
AI中文摘要

自然生成允许大型语言模型(LLMs)产生具有丰富推理的自由形式响应,但缺乏结构使得输出难以验证。相反,约束解码确保标准化格式,但可能在生成过程中过早施加约束,从而无意中限制推理能力。我们提出一种混合方法,即In-Writing,它在单次调用中结合了自由形式推理和结构化生成。模型首先执行无约束推理,仅在生成触发令牌后应用结构化解码,明确地将推理与格式化解耦。我们证明,我们的触发令牌策略能够几乎消除过早触发,即约束解码中断正在进行推理的失败模式。在涵盖分类和推理任务的多个数据集上的评估表明,我们的方法优于现有技术,在自然生成基础上准确率提升高达27%。我们的代码可在https://github.com/Nokia-Bell-Labs/InWriting获取。

英文摘要

Natural generation allows Large Language Models (LLMs) to produce free-form responses with rich reasoning, yet the lack of structure makes outputs difficult to verify. Conversely, constrained decoding ensures standardized formats but can inadvertently restrict reasoning capabilities by imposing constraints too early in the generation process. We propose a hybrid approach, namely In-Writing, that combines free-form reasoning and structured generation in a single call. The model first performs unconstrained reasoning and only applies structured decoding after a trigger token is generated, explicitly decoupling reasoning from formatting. We establish that our trigger-token strategies are able to virtually eradicate premature triggering, a failure mode in which constrained decoding interrupts on-going reasoning. Evaluations across diverse datasets covering classification and reasoning tasks demonstrate that our approach outperforms the state-of-the-art by achieving accuracy gains of up to 27% over natural generation. Our code are available at: https://github.com/Nokia-Bell-Labs/InWriting.

2511.14426 2026-05-29 cs.LG cond-mat.mtrl-sci cs.AI physics.comp-ph

MiAD: Mirage Atom Diffusion for De Novo Crystal Generation

MiAD: 幻影原子扩散用于从头晶体生成

Andrey Okhotin, Maksim Nakhodnov, Nikita Kazeev, Mikhail Lazarev, Andrey E Ustyuzhanin, Dmitry Vetrov

AI总结 提出幻影注入技术,使扩散模型能在生成过程中改变原子数量,显著提升晶体生成质量,在MP-20数据集上实现8.2%的S.U.N.率。

详情
AI中文摘要

近年来,基于扩散的模型在搜索同时稳定、独特和新颖(S.U.N.)的晶体材料方面表现出卓越性能。然而,大多数这些模型在生成过程中无法改变晶体中的原子数量,这限制了模型采样轨迹的多样性。在本文中,我们展示了这种限制的严重性,并引入了一种简单而强大的技术——幻影注入,它使扩散模型能够将构成晶体的原子状态从存在变为不存在(幻影),反之亦然。我们表明,与没有这种修改的相同模型相比,该技术将模型质量提高了多达2.5倍。由此产生的模型,幻影原子扩散(MiAD),是一种用于从头晶体生成的等变联合扩散模型,能够在生成过程中改变原子数量。MiAD在MP-20数据集上实现了8.2%的S.U.N.率,大大超过了现有的最先进方法。代码:https://github.com/andrey-okhotin/miad.git

英文摘要

In recent years, diffusion-based models have demonstrated exceptional performance in searching for simultaneously stable, unique, and novel (S.U.N.) crystalline materials. However, most of these models don't have the ability to change the number of atoms in the crystal during the generation process, which limits the variability of model sampling trajectories. In this paper, we demonstrate the severity of this restriction and introduce a simple yet powerful technique, mirage infusion, which enables diffusion models to change the state of the atoms that make up the crystal from existent to non-existent (mirage) and vice versa. We show that this technique improves model quality by up to x2.5 compared to the same model without this modification. The resulting model, Mirage Atom Diffusion (MiAD), is an equivariant joint diffusion model for de novo crystal generation that is capable of altering the number of atoms during the generation process. MiAD achieves an 8.2% S.U.N. rate on the MP-20 dataset, which substantially exceeds existing state-of-the-art approaches. Code: https://github.com/andrey-okhotin/miad.git

2510.08535 2026-05-29 stat.ML cs.LG math.PR

Permutation-Invariant Spectral Learning via Dyson Diffusion

通过戴森扩散的置换不变谱学习

Tassilo Schwarz, Cai Dieball, Constantin Kogler, Renaud Lambiotte, Arnaud Doucet, Aljaž Godec, George Deligiannidis

AI总结 提出戴森扩散模型,利用随机矩阵理论从分析上提取扩散过程的谱特性,将归纳偏置从架构转移到动力学,实现置换不变的谱学习,准确学习图谱并超越现有图扩散模型。

详情
AI中文摘要

扩散模型是生成建模的核心,并已通过扩散邻接矩阵表示适应于图。对于具有$n$个节点的图,存在多达$n!$个这样的表示,这一挑战仅通过使用置换等变学习架构得到部分缓解。尽管计算效率高,现有的图扩散模型难以区分某些图族及其谱,除非图数据被增强以特定的特征。这一缺陷源于在学习架构中强制执行归纳偏置。在这项工作中,我们利用随机矩阵理论从分析上提取扩散过程的谱特性,从而将大部分归纳偏置从架构推入动力学。在此基础上,我们引入了戴森扩散模型,该模型采用戴森布朗运动来捕捉邻接矩阵上Ornstein-Uhlenbeck过程的谱动力学。此外,以谱动力学为条件,我们制定了一个李群扩散,适当地建模剩余的自由度。引人注目的是,由此产生的学习问题在李代数层面上变为置换不变的。我们证明,戴森扩散模型能够准确学习图谱,并优于现有的图扩散模型。

英文摘要

Diffusion models are central to generative modeling and have been adapted to graphs by diffusing adjacency matrix representations. The challenge of having up to $n!$ such representations for graphs with $n$ nodes is only partially mitigated by using permutation-equivariant learning architectures. Despite their computational efficiency, existing graph diffusion models struggle to distinguish certain graph families and their spectra, unless graph data are augmented with ad hoc features. This shortcoming stems from enforcing the inductive bias within the learning architecture. In this work, we leverage random matrix theory to analytically extract the spectral properties of the diffusion process, allowing us to push most of the inductive bias from the architecture into the dynamics. Building on this, we introduce the Dyson Diffusion Model, which employs Dyson's Brownian motion to capture the spectral dynamics of an Ornstein-Uhlenbeck process on the adjacency matrix. Furthermore, conditioned on the spectral dynamics, we formulate a Lie group diffusion, appropriately modeling the remaining degrees of freedom. Strikingly, the resulting learning problem becomes permutation invariant at the Lie algebra level. We demonstrate that the Dyson Diffusion Model learns graph spectra accurately and outperforms existing graph diffusion models.

2605.30289 2026-05-29 cs.LG stat.AP stat.ML

Statistical Embeddings for Similarity, Retrieval, and Interpretable Alignment of Numeric Tabular Datasets

用于数值表格数据集的相似性、检索和可解释对齐的统计嵌入

M. Ross Kunz, John Merickel, Keith Wilson

AI总结 提出一种通过结构化探索性数据分析描述符、句子变换器嵌入和典型相关分析(CCA)来表征和比较数值表格数据集的方法,实现跨数据集的相似性检索和可解释变量级对齐,并支持差分隐私。

详情
AI中文摘要

数值表格数据集是科学实践中的主要数据格式,但大型语言模型缺乏在异构特征空间中有意义地表示数值数据集的原生机制。现有方法要么针对单个数据集的预测建模(需要共享变量定义),要么缺乏可解释的跨数据集对齐机制。提出的方法通过结构化探索性数据分析描述符来表征数值表格数据集,使用预训练的句子变换器将这些描述符嵌入到共享向量空间,并通过典型相关分析(CCA)量化跨数据集相似性。此外,应用惩罚形式的CCA来恢复数据集之间稀疏、可解释的变量级对应关系,识别哪些统计描述符或变量级数量驱动跨数据集对齐,而无需共享变量名或特征约定。在嵌入之前,可选地对描述符集应用差分隐私,支持在敏感数据环境中部署,而无需在比较时访问原始观测值。该方法在15个数据集上进行了评估,涵盖通用基准、材料信息学和核级石墨表征。结果表明,总P@1得分为0.9,已知最近邻检索和聚类结构在嵌入消融和差分隐私预算下保持稳健。所提出的框架为将异构数值数据集成到检索增强生成流程中提供了一条原则性途径,同时保留统计上下文,直接应用于数据驱动的算法选择和未知数据集的模拟模型初始化。

英文摘要

Numeric tabular datasets are the dominant data format in scientific practice, yet large language models lack native mechanisms for representing numeric datasets in a meaningful way across heterogeneous feature spaces. Existing approaches either target predictive modeling over individual datasets, which requires a shared set of variable definitions, or lack mechanisms for interpretable cross-dataset alignment. The proposed methodology characterizes numeric tabular datasets through structured exploratory data analysis descriptors, embeds those descriptors into a shared vector space using a pretrained sentence transformer, and quantifies cross-dataset similarity via Canonical Correlation Analysis (CCA). Furthermore, a penalized formulation of CCA is applied to recover sparse, interpretable variable-level correspondences between datasets, identifying which statistical descriptors or variable-level quantities drive cross-dataset alignment without requiring shared variable names or feature conventions. Differential privacy is optionally applied to the descriptor set prior to embedding, supporting deployment in sensitive data contexts without requiring access to raw observations at time of comparison. The methodology is evaluated across 15 datasets spanning general-purpose benchmarks, materials informatics, and nuclear-grade graphite characterization. Results demonstrate a total P@1 score of 0.9, with known nearest-neighbor retrieval and cluster structure remaining robust across embedding ablations and differential privacy budgets. The proposed framework provides a principled pathway for integrating heterogeneous numeric data into retrieval-augmented generation pipelines while preserving statistical context, with direct applications to data-driven algorithm selection and simulation model initialization for unknown datasets.

2605.30284 2026-05-29 cs.AI

ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

ProjectionBench: 在渐进信息揭示下评估大语言模型的科学假设生成

A. J. Lew, Y. Cao, M. J. Buehler

AI总结 提出ProjectionBench框架,通过渐进式信息揭示评估大语言模型在科学发现中的创新性和推理能力,实验表明GPT-5.4在最小上下文下仍保持0.7 F1分数与真实结论对齐。

详情
Comments
19 pages, 4 figures
AI中文摘要

科学发现本质上是一个创造性和不确定的过程,需要超越已知知识的推理。尽管许多基准测试通过多跳检索评估大语言模型在深度研究任务上的表现,但其对真正科学发现至关重要的创新推理能力仍未得到充分测试。我们引入了一个基准框架,用于评估模型在科学发现和推理中的表现,从原始问题逐步构建到经典零假设检验。在我们的框架中,模型最初仅接收来自近期论文的主题和研究问题,技术细节逐步揭示。在每个信息揭示阶段,模型需要生成针对研究问题的假设,这些假设与原始论文的结论进行比较,并通过组成原子声明的自动语义相似性进行评估。这种对与真实结论语义偏离的渐进评估,使得能够评估模型的创新性(在最小信息下)到基于推理的能力(在完整实验细节下),这两者对于将大语言模型用于科学发现都至关重要。我们的框架为系统评估大语言模型的科学推理和发现能力提供了基础,这对于推动下一代AI科学家/协同科学家系统的发展至关重要。具体来说,我们在涵盖生物活性材料、机械材料和纳米材料的45篇论文上评估了GPT-5、GPT-5.4、Gemini 2.5 pro和Gemini 3.1 pro preview。我们发现GPT-5.4和Gemini 3.1 pro的表现优于其前代版本,特别是GPT-5.4即使在最小上下文下仍保持0.7 F1分数与真实结论对齐。

英文摘要

Scientific discovery is an inherently creative and uncertain process, requiring reasoning beyond the recall of known knowledge. While many benchmarks have been proposed to evaluate large language model (LLM) performance on deep research tasks via multi-hop retrieval, their innovative reasoning abilities essential for true scientific discovery remain largely untested. We introduce a benchmark framework for evaluating model performance in scientific discovery and reasoning, building up from a raw problem to the classical null hypothesis test. In our framework, models initially receive only the topic and research question from a recent paper, with technical details progressively revealed. At each stage of information disclosure, the model is tasked with generating hypotheses that address the research question, which is compared with the conclusions from the original paper and evaluated via automated semantic similarity of constituent atomic claims. This progressive evaluation of semantic divergence from ground-truth conclusions enables assessment of a model's innovativeness (under minimal information) to grounded reasoning capabilities (under full experimental details), both critical for using LLMs for scientific discovery purposes. Our framework provides a foundation for systematically evaluating scientific reasoning and discovery capabilities in LLMs, crucial for advancing the development of next-generation AI scientist/co-scientist systems. Specifically, here we evaluate GPT-5, GPT-5.4, Gemini 2.5 pro, and Gemini 3.1 pro preview across 45 papers spanning bioactive materials, mechanical materials, and nanomaterials. We find that GPT-5.4 and Gemini 3.1 pro outperform their previous generation counterparts as expected, and GPT-5.4 in particular maintains 0.7 F1 score alignment with ground truth conclusions even under minimal context.

2605.30283 2026-05-29 cs.AI cs.ET

mcp-proto-okn: Natural-language access to open scientific knowledge graphs through the Model Context Protocol

mcp-proto-okn:通过模型上下文协议实现对开放科学知识图谱的自然语言访问

Peter W. Rose, Benjamin M. Good, Amanda M. Saravia-Butler, Charlotte A. Nelson, James P. Balhoff, Yaphet Kebede, Patricia L. Whetzel, Christopher Bizon, Andrew I. Su, Sergio E. Baranzini

AI总结 提出基于模型上下文协议的服务器mcp-proto-okn,使AI助手能通过自然语言发现、查询和集成科学知识图谱,降低跨领域知识图谱分析门槛。

详情
Comments
9 pages, 1 figure
AI中文摘要

MCP Server Proto-OKN (mcp-proto-okn) 是一个基于Python的模型上下文协议服务器,使AI助手能够通过自然语言发现、检查、查询和集成科学知识图谱。该服务器提供图路由、模式检查、SPARQL执行、本体扩展、多图查询和转录生成功能,降低了生物医学和科学用户进行跨领域知识图谱分析的门槛。mcp-proto-okn使用FastMCP框架在Python中实现,可在https://github.com/sbl-sdsc/mcp-proto-okn获取。GitHub仓库提供了文档、客户端配置说明和示例分析转录。

英文摘要

MCP Server Proto-OKN (mcp-proto-okn) is a Python-based Model Context Protocol server that enables AI assistants to discover, inspect, query and integrate scientific knowledge graphs through natural language. The server provides graph routing, schema inspection, SPARQL execution, ontology expansion, multi-graph querying, and transcript generation, lowering the barrier to cross-domain knowledge graph analysis for biomedical and scientific users. mcp-proto-okn is implemented in Python using the FastMCP framework and is available at https://github.com/sbl-sdsc/mcp-proto-okn. Documentation, client configuration instructions, and example analysis transcripts are provided in the GitHub repository.

2605.30282 2026-05-29 cs.RO

Gaze2Act: Gaze-Conditioned Vision-Language-Action Policies for Interactive Robot Manipulation

Gaze2Act: 基于注视条件的视觉-语言-动作策略用于交互式机器人操作

Kuangji Zuo, Gen Li, Bofan Lyu, Yanshuo Lu, Boyu Ma, Shijia Han, Xinyu Zhou, Xichen Yuan, Chuhao Zhou, Jiaqi Bai, Geng Li, Jianfei Yang

AI总结 提出Gaze2Act框架,通过将人类注视作为动态意图信号,结合跨视角语义匹配和策略级条件化,实现机器人对复杂交互任务的精确操作。

详情
Comments
Project page: https://zuo-kuangji.github.io/Gaze2Act/
AI中文摘要

视觉-语言-动作(VLA)模型近期在遵循语言指令的机器人学习方面展现出强大潜力。然而,在实践中,仅靠语言往往难以精确传达人类意图。很难描述在相似候选对象中具体要交互哪个对象、在对象上的何处操作,或目标在执行过程中如何变化。为解决这一局限,我们提出Gaze2Act,一种新颖的VLA框架,利用人类注视作为复杂交互操作中动态且直观的意图信号。Gaze2Act首先通过跨视角语义匹配将第一人称注视映射到机器人视角,弥合自我-外部视角差距,生成对象掩码和注视点,用于从粗到细的目标指定。然后,这些线索通过感知级提示和动作级条件化整合到策略中,使机器人能够关注相关区域并在动态意图下执行精确交互。在对Unitree G1人形机器人进行的七个任务类别和16个真实机器人任务的系统评估中,Gaze2Act在意图准确性和任务成功率方面均达到最先进水平。它在对象消歧、细粒度交互和动态意图引导方面显著优于基线方法。这些结果表明,人类注视为人在环VLA控制提供了一种自然、低负担且高表达性的模态。

英文摘要

Vision-Language-Action (VLA) models have recently shown strong potential for robot learning by following language instructions. However, in practice, language alone is often insufficient to precisely convey human intent. It is difficult to describe which exact object to interact with among similar candidates, where to act on the object, or how the target may change during execution. To address this limitation, we propose Gaze2Act, a novel VLA framework that leverages human gaze as a dynamic and intuitive intent signal for complex interactive manipulation. Gaze2Act first bridges the ego-exo view gap by mapping first-person gaze into the robot's perspective through cross-view semantic matching, producing both an object mask and a gaze point for coarse-to-fine target specification. These cues are then integrated into the policy through perception-level prompting and action-level conditioning, allowing the robot to attend to relevant regions and execute precise interactions under dynamic intent. In a systematic evaluation across seven task categories and 16 real-robot tasks on a Unitree G1 humanoid, Gaze2Act achieves state-of-the-art performance in both intent accuracy and task success rate. It notably outperforms baselines in object disambiguation, fine-grained interaction, and dynamic intent steering. These results demonstrate that human gaze provides a natural, low-burden, and highly expressive modality for human-in-the-loop VLA control.

2605.30277 2026-05-29 cs.LG physics.flu-dyn

Neural Operator-Based Surrogate Model for CFD:Helical Coil Steam Generator in Small Modular Reactor

基于神经算子的CFD代理模型:小型模块化反应堆中的螺旋管蒸汽发生器

Minseo Lee, Seongmin Oh, Chaehyeon Song, Bumjin Cho, Shilaj Baral, Sangam Khanal, Minseop Song, Joongoo Jeon

AI总结 针对小型模块化反应堆数字孪生中CFD实时仿真的计算瓶颈,提出结合降阶模型与神经算子(多尺度L-DeepONet和FNO)的代理模型框架,在螺旋管蒸汽发生器上实现了瞬时涡流动力学和时均流场的高效预测。

详情
AI中文摘要

实时热工水力仿真对于支持小型模块化反应堆(SMR)安全高效运行的数字孪生(DT)技术至关重要。计算流体动力学(CFD)提供了高保真流动分析,但其计算成本阻碍了在DT中的直接应用。基于AI的代理建模已被积极研究以解决这一限制,但针对SMR特定几何结构的CFD级瞬态分析的神经算子代理尚未见报道。本研究提出了一个集成框架,结合了降阶模型(ROM)与神经算子,应用于系统集成模块化先进反应堆(SMART)的螺旋管蒸汽发生器(HCSG)。比较了针对每种CFD数据类型的两种ROM策略:用于非结构化网格数据的基于MLP的自编码器(AE)和用于结构化网格数据的卷积自编码器(CAE),并将每种策略与深度算子网络(DeepONet)耦合以构建潜在DeepONet(L-DeepONet)。此外,还采用了傅里叶神经算子(FNO)进行比较。两种框架中都引入了多尺度技术以减轻频谱偏差并改进对HCSG内部发展的卡门涡街的预测。多尺度L-DeepONet捕捉了速度和压力场中的瞬时周期性涡旋动力学,而FNO及其多尺度变体预测了时均平均流并提供了可靠的压降估计。这些互补特性提供了实用的模型选择指南,根据CFD数据类型和所需的流动分辨率水平将每种架构与特定的DT目标联系起来。

英文摘要

Real-time thermal-hydraulic simulation is essential for digital twin (DT) technology that supports the safe and efficient operation of small modular reactors (SMRs). Computational fluid dynamics (CFD) provides high-fidelity flow analysis, but its computational cost prevents direct use in DT applications. AI-based surrogate modeling has been actively investigated to address this limitation, yet neural operator--based surrogates for CFD-level transient analysis of SMR-specific geometries have not been reported. This study presents an integrated framework that combines a reduced-order model (ROM) with neural operators, applied to the helical coil steam generator (HCSG) of the System-integrated Modular Advanced Reactor (SMART). Two ROM strategies tailored to each CFD data type were compared, an MLP-based autoencoder (AE) for unstructured mesh data and a convolutional autoencoder (CAE) for structured mesh data, and each was coupled with the deep operator network (DeepONet) to construct the latent DeepONet (L-DeepONet). The Fourier neural operator (FNO) was additionally adopted for comparison. A multi-scale technique was incorporated into both frameworks to mitigate spectral bias and improve the prediction of Kármán vortex streets developing inside the HCSG. The multi-scale L-DeepONet captured the instantaneous periodic vortex dynamics in both velocity and pressure fields, while the FNO and its multi-scale variant predicted the time-averaged mean flow and provided reliable pressure drop estimates. These complementary characteristics provide a practical model-selection guideline that links each architecture to specific DT objectives based on CFD data type and the required level of flow resolution.

2605.30275 2026-05-29 cs.LG q-bio.QM

Digitally enriching a screening population for pancreatic cancer using routine blood-based measures and clinical histories

利用常规血液检测指标和临床病史对胰腺癌筛查人群进行数字富集

Chris Varghese, Leo Y. Li-Han, Richa Bisht, Ellen Larson, Frank Lee, Ryan M. Carr, Tanios S. Bekaii-Saab, Shounak Majumder, John D. Halamka, Mark Truty, Ajit H. Goenka, Hojjat Salehinejad, Cornelius A. Thiels

AI总结 提出基于Transformer的多头注意力神经网络,利用纵向诊断编码和血液检测序列预测胰腺癌风险,实现提前1-3年风险分层,为人群级数字富集筛查奠定基础。

详情
AI中文摘要

早期检测胰腺癌是扩大治愈性治疗可及性和减少癌症死亡的关键;然而,目前筛查并不可行。病理的潜在指标体现在个体的疾病和血液检测轨迹中,可能预测胰腺癌的发展。利用患者在临床互动过程中积累的纵向诊断编码和血液检测值序列,训练了一个基于Transformer的定制神经网络,采用多头注意力机制,以提前多年预测胰腺癌风险,并对人群进行风险分层以进行靶向筛查。该队列包括6,017名胰腺癌成人患者和177,081名对照(总体中位年龄75岁,45%女性),在胰腺癌诊断前拥有中位12年(四分位距6.9-16.2)的病史。通过留一站点法进行外部验证,在诊断前1年、2年和3年预测胰腺癌,受试者工作特征曲线下面积均值分别为0.837(95%置信区间0.827-0.848)、0.797(95%置信区间0.782-0.813)和0.760(95%置信区间0.745-0.776)。估计的胰腺癌风险校准良好(校准图斜率1.08,截距-0.077;Brier评分0.025),贝叶斯人群胰腺癌患病率更新使得估计的癌症风险输出可跨环境迁移。在测试中,1年内胰腺癌风险>3.3%的筛查阈值提供了18.2的诊断优势比。因此,我们的工作为第一个人群级数字富集工具奠定了基础,以扩大胰腺癌治愈性管理的可及性。

英文摘要

Earlier detection of pancreatic cancer is key to enabling wider access to curative treatment and reducing cancer deaths; however, screening is presently not viable. Latent indicators of pathology are evident in an individual's disease and blood test trajectories and may predict the development of pancreatic cancer. Longitudinal sequences of coded diagnoses and blood test values accrued by patients throughout their clinical interactions were used to train a custom Transformer-based neural network with a multi-head attention mechanism to predict risk of pancreatic cancer with a multi-year lead time and risk-stratify populations for targeted screening. The cohort comprised 6,017 adults with pancreatic cancer and 177,081 controls (overall median age 75, 45% female) with median 12 years (interquartile range 6.9-16.2) of medical history prior to pancreatic cancer diagnosis. External validation via leave-one-site-out, out-of-sample testing predicting pancreatic cancer 1-, 2-, and 3-years prior to diagnosis demonstrated mean area under the receiver operating characteristic of 0.837 (95% confidence interval 0.827-0.848), 0.797 (95% confidence interval 0.782-0.813), and 0.760 (95% confidence interval 0.745-0.776), respectively. Estimated pancreatic cancer risks were well-calibrated (calibration plot slope 1.08, intercept of -0.077; Brier score 0.025), and a Bayesian population pancreatic cancer prevalence update allows estimated cancer risk outputs to be transportable across settings. At testing, a screening threshold of >3.3% risk of pancreatic cancer in 1-year offered a diagnostic odds ratio of 18.2. Our work therefore lays the foundation for a first population-level digital enrichment tool to widen access to curative-intent management of pancreatic cancer.

2605.30274 2026-05-29 cs.CL cs.AI

Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection

Loong: 一种类人长文档翻译代理,具有观察与行动的适应性上下文选择

Yutong Wang, Xuebo Liu, Derek F. Wong, Zhilin Li, Rongqing Jiang, Min Zhang, Shimin Tao, Daimeng Wei, Min Zhang

AI总结 提出Loong代理,通过3E记忆模块和强化学习优化上下文策略,解决长文档翻译中上下文窗口限制和冗余信息问题,在英⇄中、德、法翻译中平均提升13.0分。

详情
AI中文摘要

文档级翻译仍然是大型语言模型最具挑战性的任务之一,它们受到有限上下文窗口的限制,阻碍了全局连贯性,同时遭受冗余上下文信息的影响,降低了翻译质量。为了解决这个问题,我们提出了一种名为Loong的类人长文档翻译代理,它利用3E记忆模块(精华-示例-实体)存储摘要、句子对和实体记录作为历史上下文。Loong不是被动地关注所有历史,而是进行深度推理,自适应地识别翻译指导的最佳上下文。Loong通过强化学习优化其上下文策略,利用从其自身采样的观察与行动推理轨迹中得出的偏好数据。实证评估表明,Loong在英语⇄中文、德语和法语方向上实现了显著的翻译质量提升,在三个评估指标上平均提升高达13.0分。此外,Loong在跨领域和对抗上下文噪声方面表现出强大的泛化能力和鲁棒性,同时在超长文档翻译中保持显著的稳定性。我们的代码发布在https://github.com/YutongWang1216/LoongDocMT。

英文摘要

Document-level translation remains one of the most challenging tasks for large language models, which are constrained by limited context windows that impede global cohesion, while simultaneously suffering from redundant contextual information that degrades translation quality. To address this, we propose a human-like long document translation agent called Loong, which leverages a 3E memory module (Essence-Exemplar-Entity) to store summaries, sentence pairs, and entity records as historical context. Instead of passively attending to all history, Loong performs deep reasoning to adaptively identify the optimal context for translation guidance. Loong optimizes its context policy through reinforcement learning, utilizing preference data derived from its own sampled observe-and-act reasoning trajectories. Empirical evaluations demonstrate that Loong achieves substantial translation quality improvements in English $\Leftrightarrow$ Chinese, German, and French directions, with average gains of up to 13.0 points across the three evaluation metrics. Furthermore, Loong exhibits strong generalization across domains and robustness against contextual noise, while maintaining remarkable stability in ultra-long document translation. Our code is released at https://github.com/YutongWang1216/LoongDocMT.

2605.30273 2026-05-29 cs.HC cs.AI cs.CL cs.CY cs.SI

LLUMI: Improving LLM Writing Assistance for Mental Health Support with Online Community Feedback

LLUMI: 利用在线社区反馈改进心理健康支持中的LLM写作辅助

Jiwon Kim, Maya Ajit, Sherry Gong, Soorya Ram Shimgekar, Dong Whi Yoo, Eshwar Chandrasekharan, Koustuv Saha

AI总结 提出LLUMI框架,通过在线社区反馈(如Reddit投票)构建偏好对,结合监督微调和直接偏好优化训练开源小模型,在隐私保护下实现与GPT相当的心理健康支持性能。

详情
AI中文摘要

大型语言模型在生成心理健康问题的支持性回复方面展现出潜力,但提升其有用性、共情能力和安全性通常需要大量计算、专家输入和标注数据。同时,在心理健康相关交互中部署专有云模型会引发重要的隐私和数据治理问题。为解决这一挑战,我们提出了LLUMI设置,该设置可在受保护环境内部署。LLUMI包含两个互补组件:生成模型(GM)起草对心理健康问题的支持性回复,以及改进模型(IM)修改初始人工编写的回复。我们利用Reddit心理健康社区的反馈信号,使用社区认可模式(如点赞和点踩)构建用于监督微调和直接偏好优化的选择-拒绝回复对。我们还通过五个维度(可读性、共情、连接、可操作性和安全性)的人工评估进一步对齐LLUMI。结果表明,尽管依赖较小的开源模型而非专有云GPT模型,LLUMI在语言分析和人工评估中均实现了相当的性能。这些发现表明,使用社区衍生的偏好信号训练的开源模型可以支持高质量的心理健康支持辅助,同时为敏感的支持场景提供更保护隐私的替代方案。

英文摘要

Large language models (LLMs) show promise in generating supportive responses for mental health queries, but improving their usefulness, empathy, and safety often requires substantial compute, expert input, and labeled data. At the same time, deploying proprietary, cloud-based models for mental health-related interactions raises important privacy and data-governance concerns, given the sensitivities. To address this challenge, we introduce LLUMI setup that can be hosted in-house within protected environments. LLUMI consists of two complementary components: a generation model (GM), which drafts supportive responses to mental health queries, and an improvement model (IM), which revises an initial human-crafted response. We leverage feedback signals from Reddit mental health communities, using community endorsement patterns such as upvotes and downvotes to construct chosen-rejected response pairs for Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO). We further align LLUMI using human evaluation across five dimensions: readability, empathy, connection, actionability, and safety. Our results show that, despite relying on smaller open-source models rather than proprietary cloud-based GPT models, LLUMI achieves comparable performance across linguistic analyses and human evaluations. These findings suggest that open-source models, when trained with community-derived preference signals, can support high-quality mental health support assistance while offering a more privacy-preserving alternative for sensitive support contexts.

2605.30269 2026-05-29 cs.CV eess.IV

Boosting Image Quality Assessment Performance: Unsupervised Score Fusion by Deep Maximum a Posteriori Estimation

提升图像质量评估性能:基于深度最大后验估计的无监督分数融合

Zhongling Wang, Raymond Zhou, Shahrukh Athar, Wenbo Yang, Zhou Wang

AI总结 提出一种基于深度最大后验估计的无监督图像质量评估分数融合框架,通过细粒度不确定性估计提高融合预测的准确性并降低不确定性。

详情
Comments
2024 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024)
AI中文摘要

在过去的几十年中,出现了许多图像质量评估(IQA)模型,旨在预测图像的感知质量。然而,单个模型往往偏向于某些类型的图像内容或失真,具体取决于设计原则和过程。一个直观的想法是通过将多个模型的分数融合成一个更强的模型,来利用每个IQA模型的优势并减轻其弱点。在此,我们首次尝试为这一想法寻求最优解,并提出一个基于深度最大后验(MAP)估计的无监督IQA分数融合通用框架。所提出的模型在分数级别进行细粒度不确定性估计,以提高准确性并降低融合预测中的不确定性。综合实验表明,所提出的模型优于单个IQA模型和其他融合方法。它还在融合过程中展现出拒绝“坏”模型的有趣能力。

英文摘要

Over the past decades, numerous Image Quality Assessment (IQA) models have emerged, aiming to predict the perceptual quality of images. However, individual models are often biased toward certain types of image content or distortions, depending on the design principle and process. An intuitive idea is to harness the strengths and mitigate the weaknesses of each IQA model, by fusing the scores of multiple models into a stronger one. Here we make one of the first attempts to seek an optimal solution for the idea and propose a general framework for unsupervised IQA score fusion using deep Maximum a Posteriori (MAP) estimation. The proposed model conducts fine-grained uncertainty estimation at the score level to increase the accuracy and reduce the uncertainty in fused predictions. Comprehensive experiments demonstrate the superiority of the proposed model over individual IQA models and other fusion methods. It also exhibits an interesting capability of rejecting ``bad" models in the fusion process.