arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.15199 2026-05-15 cs.CV cs.AI

EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

Ruozhen He, Meng Wei, Ziyan Yang, Vicente Ordonez

AI总结 EntityBench 是一个用于评估多镜头视频生成中实体一致性能力的基准数据集,包含140个情节(共2,491个镜头),从真实叙事媒体中提取,涵盖不同难度级别的场景,并明确追踪角色、物体和地点在多镜头间的连续性。该基准引入了三部分评估体系,分别评估单镜头质量、提示对齐度和跨镜头一致性,并通过“保真度门”机制确保只有准确的实体表现在跨镜头评分中被计入。研究还提出了一种基于记忆增强的生成方法EntityMem,通过在生成前存储每个实体的视觉参考,显著提升了跨镜头实体一致性表现。

详情
Comments
Project page: https://catherine-r-he.github.io/EntityBench/
英文摘要

Multi-shot video generation extends single-shot generation to coherent visual narratives, yet maintaining consistent characters, objects, and locations across shots remains a challenge over long sequences. Existing evaluations typically use independently generated prompt sets with limited entity coverage and simple consistency metrics, making standardized comparison difficult. We introduce EntityBench, a benchmark of 140 episodes (2,491 shots) derived from real narrative media, with explicit per-shot entity schedules tracking characters, objects, and locations simultaneously across easy / medium / hard tiers of up to 50 shots, 13 cross-shot characters, 8 cross-shot locations, 22 cross-shot objects, and recurrence gaps spanning up to 48 shots. It is paired with a three-pillar evaluation suite that disentangles intra-shot quality, prompt-following alignment, and cross-shot consistency, with a fidelity gate that admits only accurate entity appearances into cross-shot scoring. As a baseline, we propose EntityMem, a memory-augmented generation system that stores verified per-entity visual references in a persistent memory bank before generation begins. Experiments show that cross-shot entity consistency degrades sharply with recurrence distance in existing methods, and that explicit per-entity memory yields the highest character fidelity (Cohen's d = +2.33) and presence among methods evaluated. Code and data are available at https://github.com/Catherine-R-He/EntityBench/.

2605.15198 2026-05-15 cs.CV cs.AI cs.CL

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

Ziyu Guo, Rain Liu, Xinyan Chen, Pheng-Ann Heng

AI总结 该研究提出了一种名为ATLAS的新型视觉推理框架,旨在解决传统方法在计算开销和任务泛化上的不足。ATLAS通过一个单一的离散“功能词”同时实现代理式推理和潜在视觉推理,无需视觉监督且兼容标准训练流程。研究还引入了LA-GRPO方法以提升训练稳定性,实验表明ATLAS在多个基准上表现出色,兼具高效性与可解释性。

详情
Comments
Project Page: https://atlas-oneword.github.io Code: https://github.com/ZiyuGuo99/ATLAS
英文摘要

Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete 'word', termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.

2605.15196 2026-05-15 cs.CV cs.LG

RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

Xiang Fan, Yuheng Wang, Bohan Fang, Zhongzheng Ren, Ranjay Krishna

AI总结 该论文提出了一种名为 RefDecoder 的参考条件视频解码器,旨在提升视觉生成任务中的细节保真度和结构一致性。通过在解码过程中引入高保真参考图像信号,RefDecoder 利用参考注意力机制将参考图像编码为高维特征,并与去噪后的视频潜在特征进行联合处理,从而增强生成结果的质量。实验表明,RefDecoder 在多个基准数据集上显著提升了生成视频的 PSNR 指标,并且无需额外微调即可直接集成到现有视频生成系统中,有效提升了生成内容的主体一致性、背景一致性和整体质量。

详情
英文摘要

Video generation powers a vast array of downstream applications. However, while the de facto standard, i.e., latent diffusion models, typically employ heavily conditioned denoising networks, their decoders often remain unconditional. We observe that this architectural asymmetry leads to significant loss of detail and inconsistency relative to the input image. To address this, we argue that the decoder requires equal conditioning to preserve structural integrity. We introduce RefDecoder, a reference-conditioned video VAE decoder by injecting high-fidelity reference image signal directly into the decoding process via reference attention. Specifically, a lightweight image encoder maps the reference frame into the detail-rich high-dimensional tokens, which are co-processed with the denoised video latent tokens at each decoder up-sampling stage. We demonstrate consistent improvements across several distinct decoder backbones (e.g., Wan 2.1 and VideoVAE+), achieving up to +2.1dB PSNR over the unconditional baselines on the Inter4K, WebVid, and Large Motion reconstruction benchmarks. Notably, RefDecoder can be directly swapped into existing video generation systems without additional fine-tuning, and we report across-the-board improvements in subject consistency, background consistency, and overall quality scores on the VBench I2V benchmark. Beyond I2V, RefDecoder generalizes well to a wide range of visual generation tasks such as style transfer and video editing refinement.

2605.15195 2026-05-15 cs.CV

VGGT-$Ω$

Jianyuan Wang, Minghao Chen, Shangzhan Zhang, Nikita Karaev, Johannes Schönberger, Patrick Labatut, Piotr Bojanowski, David Novotny, Andrea Vedaldi, Christian Rupprecht

AI总结 本文提出了一种改进的前馈重建模型 VGGT-$Ω$,旨在提升静态和动态场景的重建精度与效率。通过简化网络结构、引入注册机制和自监督学习策略,VGGT-$Ω$ 在大幅降低 GPU 内存占用的同时,显著提升了模型性能,并在多个基准测试中取得了优异结果,例如在 Sintel 数据集上将相机估计精度提升了 77%。研究还表明,该模型中的注册机制可有效支持视觉-语言-动作模型的空间理解任务。

详情
Comments
CVPR 2026 (Oral)
英文摘要

Recent feed-forward reconstruction models, such as VGGT, have proven competitive with traditional optimization-based reconstructors while also providing geometry-aware features useful for other tasks. Here, we show that the quality of these models scales predictably with model and data size. We do so by introducing VGGT-$Ω$, which substantially improves reconstruction accuracy, efficiency, and capabilities for both static and dynamic scenes. To enable training this model at an unprecedented scale, we introduce architectural changes that improve training efficiency, a high-quality data annotation pipeline that supports dynamic scenes, and a self-supervised learning protocol. We simplify VGGT's architecture by using a single dense prediction head with multi-task supervision and removing the expensive high-resolution convolutional layers. We also use registers to aggregate scene information into a compact representation and introduce register attention, which restricts inter-frame information exchange to these registers, in part replacing global attention. In this way, during training, VGGT-$Ω$ uses only about 30% of the GPU memory of its predecessor, allowing us to train with 15x more supervised data than prior work and to leverage vast amounts of unlabeled video data. VGGT-$Ω$ achieves strong results for reconstruction of static and dynamic scenes across multiple benchmarks, for example, improving over the previous best camera estimation accuracy on Sintel by 77%. We also show that the learned registers can improve vision-language-action models and support alignment with language, suggesting that reconstruction can be a powerful and scalable proxy task for spatial understanding. Project Page: http://vggt-omega.github.io/

2605.15193 2026-05-15 cs.CV

Aligning Latent Geometry for Spherical Flow Matching in Image Generation

Tuna Han Salih Meral, Kaan Oktay, Hidir Yesiltepe, Adil Kaan Akan, Pinar Yanardag

AI总结 该研究针对图像生成中的潜在流匹配方法,提出了通过对齐潜在空间的几何结构来提升生成质量的新方法。作者发现,传统方法在将高斯噪声传输到变分自编码器潜在空间时,往往沿着欧几里得路径进行,但这种路径无法保持在薄球壳状的潜在分布上。为此,他们将潜在表示分解为径向和角度成分,发现感知和语义信息主要由方向决定,从而提出将数据潜在表示投影到固定半径球面,并采用球面线性插值替代传统方法,使生成路径始终位于球面上,显著提升了生成图像的质量。

详情
英文摘要

Latent flow matching for image generation usually transports Gaussian noise to variational autoencoder latents along linear paths. Both endpoints, however, concentrate in thin spherical shells, and a Euclidean chord leaves those shells even when preprocessing aligns their radii. By decomposing each latent token into radial and angular components, we show through component-swap probes that decoded perceptual and semantic content is carried predominantly by direction, with radius contributing much less. We therefore project data latents onto a fixed token radius, use the radial projection of Gaussian noise as the spherical prior, finetune the decoder with the encoder frozen, and replace linear interpolation with spherical linear interpolation. The resulting geodesic paths stay on the sphere at every timestep, and their velocity targets are purely angular by construction. Under matched training, the method consistently improves class-conditional ImageNet-256 FID across different image tokenizers, leaves the diffusion architecture unchanged, and requires no auxiliary encoder or representation-alignment objective.

2605.15190 2026-05-15 cs.CV

RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO

Yanzuo Lu, Ronglai Zuo, Jiankang Deng

AI总结 本文提出了一种名为RAVEN的实时自回归视频外推网络,用于从先前生成的内容中实时生成未来视频片段。为了解决训练与推理过程中历史分布不一致导致的长期生成质量下降问题,RAVEN在训练时将自滚动过程重构为包含干净历史端点和噪声去噪状态的交错序列,从而对齐训练注意力与推理外推过程。此外,论文还引入了基于条件高斯转移的CM-GRPO方法,通过在线强化学习优化一致性采样步骤,进一步提升了生成效果。实验表明,RAVEN在多项评估指标上优于现有因果视频蒸馏方法。

详情
Comments
Project Page: https://yanzuo.lu/raven
英文摘要

Causal autoregressive video diffusion models support real-time streaming generation by extrapolating future chunks from previously generated content. Distilling such generators from high-fidelity bidirectional teachers yields competitive few-step models, yet a persistent gap between the history distributions encountered during training and those arising at inference constrains generation quality over long horizons. We introduce the Real-time Autoregressive Video Extrapolation Network (RAVEN), a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This formulation aligns training attention with inference-time extrapolation and allows downstream chunk losses to supervise the history representations on which future predictions depend. We further propose Consistency-model Group Relative Policy Optimization (CM-GRPO), which reformulates a consistency sampling step as a conditional Gaussian transition and applies online Reinforcement Learning (RL) directly to this kernel, avoiding the Euler-Maruyama auxiliary process adopted in prior flow-model RL formulations. Experiments demonstrate that RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations, and that CM-GRPO provides further gains when combined with RAVEN.

2605.15188 2026-05-15 cs.LG cs.AI cs.CL

FutureSim: Replaying World Events to Evaluate Adaptive Agents

Shashwat Goel, Nikhil Chandak, Arvindh Arun, Ameya Prabhu, Steffen Staab, Moritz Hardt, Maksym Andriushchenko, Jonas Geiping

AI总结 本文提出 FutureSim,一个用于评估适应性人工智能代理在真实世界事件预测能力的基准平台。该平台通过按时间顺序回放真实新闻事件,测试代理在知识截止点之后预测未来事件的能力。实验表明,现有前沿代理在三月份的预测准确率普遍较低,最高仅为25%,揭示了当前模型在长期适应和不确定性推理方面仍存在显著挑战。FutureSim 为研究长期适应、搜索、记忆和不确定性推理等方向提供了现实可靠的实验环境。

详情
Comments
31 pages, 10 main
英文摘要

AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.

2605.15187 2026-05-15 cs.CV cs.GR cs.RO

Articraft: An Agentic System for Scalable Articulated 3D Asset Generation

Matt Zhou, Ruining Li, Xiaoyang Lyu, Zhaomou Song, Zhening Huang, Chuanxia Zheng, Christian Rupprecht, Andrea Vedaldi, Shangzhe Wu

AI总结 本文提出了一种名为Articraft的智能系统,用于大规模生成可动的3D模型资产。该系统通过将生成任务转化为编写程序的过程,并利用大型语言模型自动编写代码,从而克服了当前缺乏大规模多样化数据集的瓶颈。Articraft引入了专门的编程接口和验证机制,使语言模型能够高效生成包含部件定义、几何组合、关节设置及测试验证的代码,最终生成高质量的可动3D资产。实验表明,该方法在生成质量上优于现有最先进的生成工具,并基于此构建了一个包含10,000个样本、涵盖245类物体的高质量数据集,用于训练和应用如机器人仿真与虚拟现实等领域。

详情
Comments
Project page: https://articraft3d.github.io/
英文摘要

A bottleneck in learning to understand articulated 3D objects is the lack of large and diverse datasets. In this paper, we propose to leverage large language models (LLMs) to close this gap and generate articulated assets at scale. We reduce the problem of generating an articulated 3D asset to that of writing a program that builds it. We then introduce a new agentic system, Articraft, that writes such programs automatically. We design a programmatic interface and harness to help the LLM do so effectively. The LLM writes code against a domain-specific SDK for defining parts, composing geometry, specifying joints, and writing tests to validate the resulting assets. The harness exposes a restricted workspace and interface to the LLM, validates the resulting assets, and returns structured feedback. In this way, the LLM is not distracted by details such as authoring a URDF file or managing a complex software environment. We show that this produces higher-quality assets than both state-of-the-art articulated-asset generators and general-purpose coding agents. Using Articraft, we build Articraft-10K, a curated dataset of over 10K articulated assets spanning 245 categories, and show its utility both for training models of articulated assets and in downstream applications such as robotics simulation and virtual reality.

2605.15185 2026-05-15 cs.CV cs.AI

Quantitative Video World Model Evaluation for Geometric-Consistency

Jiaxin Wu, Yihao Pi, Yinling Zhang, Yuheng Li, Xueyan Zou

AI总结 本文提出了一种名为PDI-Bench的定量评估框架,用于检测生成视频中的几何一致性问题。该方法通过分割和点追踪获取物体中心视角的观测信息,结合单目重建技术将其映射到三维空间,并计算反映尺度-深度对齐、三维运动一致性和结构刚性等三个失败维度的投影几何残差。研究还构建了PDI-Dataset,用于系统评估生成视频的几何特性,揭示了现有生成模型在物理合理性方面的不足。

详情
Comments
12 pages, 5 figures. Project page : https://pdi-bench.github.io/
英文摘要

Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object-centric observations via segmentation and point tracking (e.g., SAM 2, MegaSaM, and CoTracker3), lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi-bench.github.io/.

2605.15184 2026-05-15 cs.CL

Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

Sahil Sen, Akhil Kasturi, Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah

AI总结 本文研究了在智能体搜索系统中,不同检索策略(如grep和向量检索)与智能体架构及工具调用方式之间的交互影响。通过两个实验,作者对比了在不同工具结果呈现方式和干扰信息环境下,grep与向量检索的性能差异,发现grep在多数情况下表现更优,但整体效果还高度依赖于所使用的智能体框架和工具调用方式。研究为优化智能体搜索系统的检索策略提供了实证依据。

详情
英文摘要

Recent advances in Large Language Model (LLM) agents have enabled complex agentic workflows where models autonomously retrieve information, call tools, and reason over large corpora to complete tasks on behalf of users. Despite the growing adoption of retrieval-augmented generation (RAG) in agentic search systems, existing literature lacks a systematic comparison of how retrieval strategy choice interacts with agent architecture and tool-calling paradigm. Important practical dimensions, including how tool outputs are presented to the model and how performance changes when searches must cope with more irrelevant surrounding text, remain under-explored in agent loops. This paper reports an empirical study organized into two experiments. Experiment 1 compares grep and vector retrieval on a 116-question sample from LongMemEval, using a custom agent harness (Chronos) and provider-native CLI harnesses (Claude Code, Codex, and Gemini CLI), for both inline tool results and file-based tool results that the model reads separately. Experiment 2 compares grep-only and vector-only retrieval while progressively mixing in additional unrelated conversation history, so that each query is embedded in more distracting material alongside the passages that matter. Across Chronos and the provider CLIs, grep generally yields higher accuracy than vector retrieval in our comparisons in experiment 1; at the same time, overall scores still depend strongly on which harness and tool-calling style is used, even when the underlying conversation data are the same.

2605.15183 2026-05-15 cs.LG

When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

ML Nissen Gonzalez, Melwina Albuquerque, Laurence Wroe, Jacob Meyer Cohen, Logan Riggs Smith, Thomas Dooms

AI总结 本文研究了如何判断两个神经网络是否实现相同的计算机制,提出了一个基于张量的相似性度量方法,该方法对权重空间的对称性具有不变性,能够捕捉全局功能等价性并考虑跨层机制。相比现有方法,该度量在追踪功能训练动态方面具有更高的精度,将相似性衡量和可信度验证转化为代数问题,提升了机制可解释性的准确性与可靠性。

详情
Comments
22 pages, 8 figures. Code: https://github.com/tdooms/tensor-similarity
英文摘要

Mechanistic interpretability aims to break models into meaningful parts; verifying that two such parts implement the same computation is a prerequisite. Existing similarity measures evaluate either empirical behaviour, leaving them blind to out-of-distribution mechanisms, or basis-dependent parameters, meaning they disregard weight-space symmetries. To address these issues for the class of tensor-based models, we introduce a weight-based metric, tensor similarity, that is invariant to such symmetries. This metric captures global functional equivalence and accounts for cross-layer mechanisms using an efficient recursive algorithm. Empirically, tensor similarity tracks functional training dynamics, such as grokking and backdoor insertion, with higher fidelity than existing metrics. This reduces measuring similarity and verifying faithfulness into a solved algebraic problem rather than one of empirical approximation.

2605.15182 2026-05-15 cs.CV

Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

Yifan Wang, Tong He

AI总结 本文提出了一种名为Warp-as-History的方法,旨在实现无需额外训练即可从单个训练视频中生成可控相机轨迹的视频。该方法通过将相机引起的图像变形转化为伪历史信息,并结合目标帧的位置对齐和可见令牌选择,直接输入到视频生成模型中,从而引导模型生成符合指定相机路径的视频。实验表明,该方法不仅在零样本情况下表现出良好的相机轨迹跟随能力,而且通过轻量的微调还可进一步提升生成视频的质量和运动一致性。

详情
Comments
Project page: https://yyfz.github.io/warp-as-history/
英文摘要

Camera-controlled video generation has made substantial progress, enabling generated videos to follow prescribed viewpoint trajectories. However, existing methods usually learn camera-specific conditioning through camera encoders, control branches, or attention and positional-encoding modifications, which often require post-training on large-scale camera-annotated videos. Training-free alternatives avoid such post-training, but often shift the cost to test-time optimization or extra denoising-time guidance. We propose Warp-as-History, a simple interface that turns camera-induced warps into camera-warped pseudo-history with target-frame positional alignment and visible-token selection. Given a target camera trajectory, we construct camera-warped pseudo-history from past observations and feed it through the model's visual-history pathway. Crucially, we align its positional encoding with the target frames being denoised and remove warped-history tokens without valid source observations. Without any training, architectural modification, or test-time optimization, this interface reveals a non-trivial zero-shot capability of a frozen video generation model to follow camera trajectories. Moreover, lightweight offline LoRA finetuning on only one camera-annotated video further improves this capability and generalizes to unseen videos, improving camera adherence, visual quality, and motion dynamics without test-time optimization or target-video adaptation. Extensive experiments on diverse datasets confirm the effectiveness of our method.

2605.15181 2026-05-15 cs.CV

From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

Anirudh Sundara Rajan, Krishna Kumar Singh, Yong Jae Lee

AI总结 该研究旨在解决开放性图像编辑中抽象、多步骤指令的处理问题,提出了一种将规划与执行紧密结合的框架。其核心方法包括一个生成原子分解步骤的规划器、一个选择编辑工具和区域的协调器,以及一个基于视觉语言判断的奖励机制,用于指导编辑过程。该方法通过奖励驱动的执行优化协调器,并利用成功轨迹反哺规划器,从而实现更连贯、可靠的图像编辑效果。

详情
英文摘要

Modern image editing models produce realistic results but struggle with abstract, multi step instructions (e.g., ``make this advertisement more vegetarian-friendly''). Prior agent based methods decompose such tasks but rely on handcrafted pipelines or teacher imitation, limiting flexibility and decoupling learning from actual editing outcomes. We propose an experiential framework for long-horizon image editing, where a planner generates structured atomic decompositions and an orchestrator selects tools and regions to execute each step. A vision language judge provides outcome-based rewards for instruction adherence and visual quality. The orchestrator is trained to maximize these rewards, and successful trajectories are used to refine the planner. By tightly coupling planning with reward driven execution, our approach yields more coherent and reliable edits than single-step or rule-based multistep baselines.

2605.15179 2026-05-15 cs.LG cs.AI physics.comp-ph

Eradicating Negative Transfer in Multi-Physics Foundation Models via Sparse Mixture-of-Experts Routing

Ellwil Sharma, Arastu Sharma

AI总结 该论文研究了如何消除多物理场基础模型中的负迁移问题,即在同时训练不同偏微分方程(PDE)系统时出现的梯度冲突和优化不稳定现象。为此,作者提出了一种基于稀疏激活的混合专家(MoE)架构Shodh-MoE,通过物理感知的自编码器生成压缩的物理潜在表示,并结合软语义路由策略,将不同物理机制的局部潜在块分配给专门的专家子网络,从而实现对多物理场的高效且稳定的建模。实验表明,该方法在保持质量守恒的同时,显著提升了模型在不同物理场景下的预测精度。

详情
Comments
5 pages, 4 figures
英文摘要

Scaling Scientific Machine Learning (SciML) toward universal foundation models is bottlenecked by negative transfer: the simultaneous co-training of disparate partial differential equation (PDE) regimes can induce gradient conflict, unstable optimization, and plasticity loss in dense neural operators. In particular, broadband open-channel fluid dynamics and boundary-dominated porous media flows impose incompatible spectral and geometric demands on a single dense parameter path. We introduce Shodh-MoE, a sparse-activated latent transformer architecture for multi-physics transport. Shodh-MoE operates on compressed 16^3 physical latents produced by a physics-informed autoencoder with an intra-tokenizer Helmholtz-style velocity parameterization, restricting decoded states to divergence-free velocity manifolds. The model guarantees exact mass conservation, achieving a physically verifiable velocity divergence of ~2.8 x 10^-10 (evaluated post-hoc in FP64) on 128^3 grids. A Top-1 soft-semantic router dynamically assigns localized latent patches to expert subnetworks, enabling specialized parameter paths for distinct physical mechanisms while preserving shared experts for universal symmetries. In a 20,000-step distributed pretraining run over mixed three-dimensional physical tensors, routing telemetry shows autonomous domain bifurcation: held-out validation tokens from the open-channel domain route exclusively to Expert 0, while porous-media tokens route exclusively to Expert 1. The model converges simultaneously across both regimes, achieving latent validation MSEs of 2.46 x 10^-5 and 9.76 x 10^-6, and decoded physical MSEs of 2.48 x 10^-6 and 1.76 x 10^-6. These results support sparse expert routing as a practical architectural mechanism for mitigating multi-physics interference in universal neural operators.

2605.15178 2026-05-15 cs.CV

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Haoyi Zhu, Haozhe Liu, Yuyang Zhao, Tian Ye, Junsong Chen, Jincheng Yu, Tong He, Song Han, Enze Xie

AI总结 本文提出了一种名为 SANA-WM 的高效世界模型,能够在单分钟内生成高保真、720p 分辨率的视频,并具备精确的相机控制能力。该模型通过混合线性注意力机制、双分支相机控制、两阶段生成流程以及鲁棒的标注管道等核心设计,在保证视觉质量的同时显著提升了训练与推理效率。实验表明,SANA-WM 在数据使用量、训练时长和硬件资源消耗方面均优于现有开源模型,且在单分钟世界建模基准测试中表现出更高的动作跟随精度和生成吞吐量。

详情
Comments
https://nvlabs.github.io/Sana/WM/
英文摘要

We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally consistent action labels. Driven by these designs, SANA-WMdemonstrates remarkable efficiency across data, training compute, and inference hardware: it uses only $\sim$213K public video clips with metric-scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60s clip on a single GPU; its distilled variant can be deployed on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s. On our one-minute world-model benchmark, SANA-WM demonstrates stronger action-following accuracy than prior open-source baselines and achieves comparable visual quality at $36\times$ higher throughput for scalable world modeling.

2605.15174 2026-05-15 quant-ph cond-mat.stat-mech cs.IT math-ph math.IT math.MP

Universal quantum resource distillation via composite generalised quantum Stein's lemma

Ludovico Lami, Bartosz Regula, Ryuji Takagi

AI总结 本文研究了量子资源蒸馏的通用方法,提出在无需精确了解输入态的情况下,仍可实现最优蒸馏速率,展示了量子资源蒸馏的鲁棒性。核心方法基于对广义量子Stein引理的扩展,将其应用于由未知态独立同分布副本组成的复合假设检验场景。该成果为量子纠缠净化等任务提供了理论支持,并揭示了最优速率由纠缠相对熵的正则化形式决定。

详情
Comments
8+12 pages
英文摘要

The performance of quantum resource manipulation protocols, including key examples such as distillation of quantum entanglement, is measured in terms of the rate at which desired target states can be produced from a given noisy state. However, to achieve optimal rates, known protocols require precise tailoring to the quantum state in question, demanding a perfect knowledge of the input and allowing no errors in its preparation. Here we show that distillation of quantum resources in the framework of resource non-generating operations can be performed universally: optimal rates of distillation can be achieved with no knowledge of the input state whatsoever, certifying the robustness of quantum resource distillation. The findings apply in particular to the purification of quantum entanglement under non-entangling maps, where the optimal rates are governed by the regularised relative entropy of entanglement. Our result relies on an extension of the generalised quantum Stein's lemma in quantum hypothesis testing to a composite setting where the null hypothesis is no longer a fixed quantum state, but is rather composed of i.i.d. copies of an unknown state. The solution of this asymptotic problem is made possible through new developments in one-shot quantum information and a refinement of the blurring technique from [Lami, arXiv:2408.06410].

2605.15173 2026-05-15 cs.DS cs.DB

Hybrid Sketching Methods for Dynamic Connectivity on Sparse Graphs

Quinten De Man, Gilvir Gill, Michael A. Bender, Laxman Dhulipala, David Tench

AI总结 本文研究了动态图连通性问题在稀疏图上的高效处理方法,提出了一种混合的图素描方法,通过区分图中的稀疏外围和密集核心区域,仅对密集部分进行素描处理,从而在保证性能的同时显著减少空间开销。核心方法包括一种新的BalloonSketch算法,大幅降低每个顶点的素描空间需求,并构建了HybridSCALE系统,实现了在不同密度图上的空间效率优化。该方法在实际图数据上相比传统无损方法节省了高达97%的存储空间。

详情
英文摘要

Dynamic connectivity is a fundamental dynamic graph problem, and recent algorithmic breakthroughs on dynamic graph sketching have reshaped what is theoretically possible: by encoding the graph as per-vertex linear sketches, these algorithms solve dynamic connectivity in only $Θ(V \log^2 V)$ space, independent of the number of edges,outperforming lossless $Θ(V+E)$-space structures that grow as the graph becomes denser. Prior to this work, no practical dynamic connectivity algorithm has been able to translate these theoretical breakthroughs into space savings on real-world graphs. The main obstacle is that per-vertex sketches cost thousands of bytes per vertex, so sketching only pays off once the graph becomes extremely dense. We observe that sparse real-world graphs are often not uniformly sparse, these graphs can contain dense cores on a small subset of vertices that account for a large fraction of edges. We exploit this structure via hybrid sketching: sketch only the dense core, and store the sparse periphery losslessly. We design new hybrid algorithms for fully-dynamic and semi-streaming connectivity with space $O(\min\{V+E, V \log V \log(2+E/V)\})$ w.h.p., simultaneously matching the lossless bound on sparse graphs, the sketching bound on dense graphs, and improving on both in an intermediate regime. A key component is BalloonSketch, a new l0-sampler reducing per-vertex sketch sizes by up to 8x. We implement HybridSCALE, a modular system treating the lossless and sketch-based components as subroutines. HybridSCALE is the first sketch-based dynamic connectivity system to save space on common real-world graphs. Compared to the state-of-the-art lossless baseline, HybridSCALE saves up to 15% space on sparse graphs (average degree < 100), up to 92% on intermediate density graphs (average degree ~ 100-1000), and up to 97% on dense graphs (average degree > 1000).

2605.15172 2026-05-15 cs.CR cs.CL

MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLMs

Rui Wen, Mark Russinovich, Andrew Paverd, Jun Sakuma, Ahmed Salem

AI总结 本文提出了一种新型的后门攻击方法MetaBackdoor,利用大语言模型中的位置编码作为触发机制,无需修改输入文本内容即可激活后门。研究发现,基于位置信息的触发器能够有效激活隐蔽的后门行为,使模型在满足特定长度条件时泄露敏感信息或执行恶意操作。该方法扩展了大语言模型后门攻击的威胁模型,揭示了位置编码这一此前被忽视的攻击面,为防御策略的设计提出了新的挑战。

详情
英文摘要

Backdoor attacks pose a serious security threat to large language models (LLMs), which are increasingly deployed as general-purpose assistants in safety- and privacy-critical applications. Existing LLM backdoors rely primarily on content-based triggers, requiring explicit modification of the input text. In this work, we show that this assumption is unnecessary and limiting. We introduce MetaBackdoor, a new class of backdoor attacks that exploits positional information as the trigger, without modifying textual content. Our key insight is that Transformer-based LLMs necessarily encode token positions to process ordered sequences. As a result, length-correlated positional structure is reflected in the model's internal computation and can be used as an effective non-content trigger signal. We demonstrate that even a simple length-based positional trigger is sufficient to activate stealthy backdoors. Unlike prior attacks, MetaBackdoor operates on visibly and semantically clean inputs and enables qualitatively new capabilities. We show that a backdoored LLM can be induced to disclose sensitive internal information, including proprietary system prompts, once a length condition is satisfied. We further demonstrate a self-activation scenario, where normal multi-turn interaction can move the conversation context into the trigger region and induce malicious tool-call behavior without attacker-supplied trigger text. In addition, MetaBackdoor is orthogonal to content-based backdoors and can be composed with them to create more precise and harder-to-detect activation conditions. Our results expand the threat model of LLM backdoors by revealing positional encoding as a previously overlooked attack surface. This challenges defenses that focus on detecting suspicious text and highlights the need for new defense strategies that explicitly account for positional triggers in modern LLM architectures.

2605.15171 2026-05-15 cs.CV cs.AI cs.LG

Evidential Reasoning Advances Interpretable Real-World Disease Screening

Chenyu Lian, Hong-Yu Zhou, Jing Qin

AI总结 本文提出了一种基于证据推理的可解释疾病筛查框架EviScreen,旨在解决当前医学图像筛查模型在可解释性和性能上的不足。该方法通过从历史病例中检索区域级证据,并结合双知识库进行回顾性解释,提升了模型的透明度和诊断准确性。同时,利用对比检索生成的异常图增强定位解释性,实验表明该方法在真实世界疾病筛查基准上表现出色,尤其在临床召回率下的特异性显著提高。

详情
Comments
ICML 2026
英文摘要

Disease screening is critical for early detection and timely intervention in clinical practice. However, most current screening models for medical images suffer from limited interpretability and suboptimal performance. They often lack effective mechanisms to reference historical cases or provide transparent reasoning pathways. To address these challenges, we introduce EviScreen, an evidential reasoning framework for disease screening that leverages region-level evidence from historical cases. The proposed EviScreen offers retrospection interpretability through regional evidence retrieved from dual knowledge banks. Using this evidential mechanism, the subsequent evidence-aware reasoning module makes predictions using both the current case and evidence from historical cases, thereby enhancing disease screening performance. Furthermore, rather than relying on post-hoc saliency maps, EviScreen enhances localization interpretability by leveraging abnormality maps derived from contrastive retrieval. Our method achieves superior performance on our carefully established benchmarks for real-world disease screening, yielding notably higher specificity at clinical-level recall. Code is publicly available at https://github.com/DopamineLcy/EviScreen.

2605.15168 2026-05-15 cs.CL cs.AI cs.LG stat.ML

Text Knows What, Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment

Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim, Jeremy C. Weiss

AI总结 本研究旨在解决临床文本与结构化电子健康记录(EHR)在时间信息上的互补性问题,提出了一种基于检索增强的多模态对齐框架,用于重建更精确的临床时间线。该方法通过从文本中提取关键事件构建时间框架,并结合结构化数据中的时间信息进行校准,从而提升时间戳的准确性。实验表明,该方法在多个模型上均显著提升了时间一致性,同时保留了事件匹配率,展示了多模态对齐在临床轨迹重建中的优势。

详情
Comments
Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim (authors contributed equally)
英文摘要

Reconstructing precise clinical timelines is essential for modeling patient trajectories and forecasting risk in complex, heterogeneous conditions like sepsis. While unstructured clinical narratives offer semantically rich and contextually complete descriptions of a patient's course, they often lack temporal precision and contain ambiguous event timing. Conversely, structured electronic health record (EHR) data provides precise temporal anchors but misses a substantial portion of clinically meaningful events. We introduce a retrieval-augmented multimodal alignment framework that bridges this gap to improve the temporal precision of absolute clinical timelines extracted from text. Our approach formulates timeline reconstruction as a graph-based multistep process: it first extracts central anchor events from narratives to build an initial temporal scaffold, places non-central events relative to this backbone, and then calibrates the timeline using retrieved structured EHR rows as external temporal evidence. Evaluated using instruction-tuned large language models on the i2m4 benchmark spanning MIMIC-III and MIMIC-IV, our multimodal pipeline consistently improves absolute timestamp accuracy (AULTC) and improves temporal concordance across nearly all evaluated models over unimodal text-only reconstruction, without compromising event match rates. Furthermore, our empirical gap analysis reveals that 34.8% of text-derived events are entirely absent from tabular records, demonstrating that aligning these modalities can produce a more temporally faithful and clinically informative reconstruction of patient trajectories than either source alone.

2605.15167 2026-05-15 cs.CV

Does Synthetic Layered Design Data Benefit Layered Design Decomposition?

Kam Man Wu, Haolin Yang, Qingyu Chen, Yihu Tang, Jingye Chen, Qifeng Chen

AI总结 本文研究了纯合成分层数据是否有助于提升图形设计分解的效果。作者基于先进的CLD框架构建了合成数据集SynLayers,并利用视觉语言模型生成文本监督和自动推理输入,发现纯合成数据在性能上可超越现有非可扩展数据集,且在数据量增加时表现持续提升,同时能有效平衡分层分布。该研究为分层设计编辑系统提供了可扩展的合成数据基础,具有重要的实用价值。

详情
Comments
22 pages, 10 figures. Code is available at https://github.com/YangHaolin0526/SynLayers
英文摘要

Recent advances in image generation have made it easy to produce high-quality images. However, these outputs are inherently flattened, entangling foreground elements, background, and text within a fixed canvas. As a result, flexible post-generation editing remains challenging, revealing a clear last-mile gap toward practical usability. Existing approaches either rely on scarce proprietary layered assets or construct partially synthetic data from limited structural priors. However, both strategies face fundamental challenges in scalability. In this work, we investigate whether pure synthetic layered data can improve graphic design decomposition. We make the assumption that, in graphic design, effective decomposition does not require modeling inter-layer dependencies as precisely as in natural-image composition, since design elements are often intentionally arranged as modular and semantically separable components. Concretely, we conduct a data-centric study based on CLD baseline, which is a state-of-the-art layer decomposition framework. Based on the baseline, we construct our own synthetic dataset, SynLayers, generate textual supervision using vision language models, and automate inference inputs with VLM-predicted bounding boxes. Our study reveals three key findings: (1) even training with purely synthetic data can outperform non-scalable alternatives such as the widely used PrismLayersPro dataset, demonstrating its viability as a scalable and effective substitute; (2) performance consistently improves with increased training data scale, while gains begin to saturate at around 50K samples; and (3) synthetic data enables balanced control over layer-count distributions, avoiding the layer-count imbalance commonly observed in real-world datasets. We hope this data-centric study encourages broader adoption of synthetic data as a practical foundation for layered design editing systems.

2605.15164 2026-05-15 cs.LG cs.AI

Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

Pratinav Seth, Vinay Kumar Sankarapu

AI总结 本文指出,当前的行为保障方法无法满足AI治理框架对安全性的验证需求。治理框架要求验证AI系统是否存在隐藏目标、抗失控能力及灾难性能力边界等属性,但现有方法仅能观察模型输出,无法验证其潜在表征和长期行为。文章提出“审计鸿沟”概念,强调验证需求与技术能力之间的不匹配,并建议通过法律文本中限制行为证据的权重、引入机制性验证手段等方式进行技术转向。

详情
英文摘要

This position paper argues that behavioural assurance, even when carefully designed, is being asked to carry safety claims it cannot verify. AI governance frameworks enacted between 2019 and early 2026 require reviewable evidence of properties such as the absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability; current assurance methodologies (primarily behavioural evaluations and red-teaming) are epistemically limited to observable model outputs and cannot verify the latent representations or long-horizon agentic behaviours these frameworks presume to regulate. We formalize this structural mismatch as the audit gap, the divergence between required and achievable verification access, and introduce the concept of fragile assurance to describe cases where the evidential structure does not support the asserted safety claim. Through an analysis of a 21-instrument inventory, we identify an incentive gradient where geopolitical and industrial pressures systematically reward surface-level behavioral proxies over deep structural verification. Finally, we propose a technical pivot: bounding the weight of behavioral evidence in legal text and extending voluntary pre-deployment access with mechanistic-evidence classes, specifically linear probes, activation patching, and before/after-training comparisons.

2605.15163 2026-05-15 cs.LO

Automating Bitvector and Finite Field Equivalence Proofs in Lean

Elizaveta Pertseva, Valentin Robert, Clark Barrett, James Parker

AI总结 该研究针对零知识证明电路编码验证中涉及位向量与有限域操作的无量词陈述正确性证明难题,提出了一种新的Lean证明策略BitModEq。该方法通过范围引理和案例分析实现有限域到位向量的验证转换,并结合位爆破技术,在解决零知识证明算术化基准问题上优于现有SMT求解器,成功案例增加了19%。

详情
英文摘要

Efforts to verify Zero-Knowledge Proof circuit encodings have highlighted the challenge of proving the correctness of quantifier-free statements that make use of both bitvector and finite field operations. Existing verification workflows are either manual or rely on SMT solvers, which scale poorly on some classes of problems for reasons that include difficulties with conversion operators and challenges reasoning about inequalities. To address these limitations, we present a novel Lean tactic BitModEq that leverages range lemmas and case analysis to produce verified translations from finite fields to bitvectors. Our approach, combined with bit-blasting, outperforms state-of-the-art SMT solvers, solving 19% more ZKP arithmetization benchmarks.

2605.15155 2026-05-15 cs.LG cs.AI cs.CL

Self-Distilled Agentic Reinforcement Learning

Zhengxi Lu, Zhiyuan Yao, Zhuowen Han, Zi-Han Wang, Jinyang Wu, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

AI总结 该论文研究了如何提升基于强化学习(RL)的大型语言模型代理在多轮任务中的性能。为了解决传统RL在长序列任务中监督信号过于稀疏的问题,作者提出了自蒸馏代理强化学习(SDAR),通过将基于教师分支的密集令牌级指导作为辅助目标,与主RL优化框架结合。SDAR通过引入一个门控机制,增强对教师认可的正向令牌的蒸馏效果,同时柔和地抑制教师的负向拒绝,从而在多个基准任务上显著提升了性能,并避免了传统方法的不稳定性。

详情
英文摘要

Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL--OPSD baselines across model scales.

2605.15154 2026-05-15 stat.ML cs.LG

RoSHAP: A Distributional Framework and Robust Metric for Stable Feature Attribution

Lanxin Xiang, Liang Shi, Youhui Ye, Boyu Jiang, Dawei Zhou, Feng Guo

AI总结 本文提出了一种名为RoSHAP的分布框架和鲁棒度量方法,用于实现更稳定的特征归因分析。该方法基于SHAP值,通过引导重采样和核密度估计建模特征归因分数的分布,并在温和正则条件下证明其聚合值渐近服从高斯分布,从而降低了分布估计的计算成本。RoSHAP不仅提升了特征排名的稳定性,还在模拟和实际数据实验中表现出优于传统单次归因方法的性能,同时使用更少的特征即可达到与全特征模型相当的预测效果。

详情
英文摘要

Feature attribution analysis is critical for interpreting machine learning models and supporting reliable data-driven decisions. However, feature attribution measures often exhibit stochastic variation: different train--test splits, random seeds, or model-fitting procedures can produce substantially different attribution values and feature rankings. This paper proposes a framework for incorporating stochastic nature of feature attribution and a robust attribution metric, RoSHAP, for stable feature ranking based on the SHAP metric. The proposed framework models the distribution of feature attribution scores and estimates it through bootstrap resampling and kernel density estimation. We show that, under mild regularity conditions, the aggregated feature attribution score is asymptotically Gaussian, which greatly reduces the computational cost of distribution estimation. The RoSHAP summarizes the distribution of SHAP into a robust feature-ranking criterion that simultaneously rewards features that are active, strong, and stable. Through simulations and real-data experiments, the proposed framework and RoSHAP outperform standard single-run attribution measures in identifying signal features. In addition, models built using RoSHAP-selected features achieve predictive performance comparable to full-feature models while using substantially fewer predictors. The proposed RoSHAP approach improves the stability and interpretability of machine learning models, enabling reliable and consistent insights for analysis.

2605.15150 2026-05-15 quant-ph cond-mat.str-el cs.CC hep-th

Extensive long-range magic in non-Abelian topological orders

Yuzhen Zhang, Isaac H. Kim, Yimu Bao, Sagar Vijay

AI总结 本文研究了非阿贝尔拓扑序低能态中广泛存在的长程魔性,并证明这种魔性无法通过常深度局域幺正电路消除。研究提出了一种新的资源理论视角来刻画拓扑序,并通过一个禁止单态态(即使经过常深度局域幺正变换)近似非阿贝尔弦网模型低能态的定理,进一步揭示了拓扑序的复杂性本质。此外,文章还指出高维量子双重模型的基态和低能态若具有非平凡融合空间的激发,必然表现出这种广泛长程魔性。

详情
Comments
51 pages
英文摘要

We show that the low-energy states of non-Abelian topological orders possess extensive magic which is long-ranged, and cannot be eliminated by a constant-depth local unitary circuit. This refines conventional notions of complexity beyond the linear circuit depth which is required to prepare any topological phase, and provides a new resource-theoretic characterization of topological orders. A central technical result is a no-go theorem establishing that stabilizer states--even up to constant-depth local unitarie--cannot approximate low-energy states of non-Abelian string-net models which satisfy the entanglement bootstrap axioms. Moreover, we show that stabilizer-realizable Abelian string-net phases have mutual braiding phases quantized by the on-site qudit dimension, and that any violation of this condition necessarily implies extensive long-range magic. Extending to higher spatial dimensions, we argue that any state obeying an entanglement area law and hosting excitations with nontrivial fusion spaces must exhibit extensive long-range magic. This applies, in particular, to ground-states and low-energy states of higher-dimensional quantum double models.

2605.15144 2026-05-15 cs.LO math.HO math.LO

Guises and Perspectives: An Intentional and Hyperintensional Sketch

Juan J. Colomina-Alminana

AI总结 本文基于Héctor-Neri Castañeda的工作,构建了一种以“guises”(带有意图的属性集合)为核心的内涵逻辑系统,用于研究关系的内部结构。该逻辑系统融合了莱布尼茨式的内涵语义、意图操作符以及可能性与必然性的模态层,能够处理超内涵现象如意图语境中的替换失败和自指表达。研究展示了关系并非外在因果联系,而是通过“guises”所编码的主体和对象的内在视角结构。

详情
Comments
21pp
英文摘要

This paper develops a formal logic for guises based on the work of Héctor-Neri Castañeda, who understood relations from an internalist viewpoint, following Leibniz. We introduce a syntax, model theory, and proof theory for an intensional logic in which guises (taken as bundles of properties equipped with intention) serve as primary semantic objects. The system integrates (i) a Leibnizian containment semantics for singular truths, (ii) an intentional operator that captures internal relations among guises, and (iii) a modal layer for possibility and necessity modeled as maximally consistent closures. We establish core metatheoretic results (e.i. soundness and canonical-model completeness sketches) and analyze hyperintensional phenomena such as substitution failure in intentional contexts, quasi-indexicality, and de se reference. We compare the framework to classical intensional semantics (Montague), property theory (Bealer), hyperintensional logics (Fine), situation semantics (Barwise and Perry), and to the Leibniz program for a calculus of concepts. The result is a selfcontained formal framework that demonstrates that relations are not external causal links but intentional internal structures encoded in the guises through which agents and objects are conceived: i.e., they are perspectives.

2605.15143 2026-05-15 cs.LO cs.PL

Complete Local Reasoning About Parameterized Programs Over Topologies

Ruotong Cheng, Azadeh Farzan

AI总结 本文研究了在复杂通信拓扑下无限状态参数化并发程序的算法安全验证问题,目标是自动生成一个全称量化归纳不变式作为正确性证明。在合理假设下,该问题可被归约为一种组合验证方案,即将参数化程序的验证转化为一组局部证明。作者提出了一种验证算法并实现为工具,通过多个不同拓扑结构的基准测试验证了该方法在证明参数化程序安全性方面的有效性。

详情
Comments
Draft version with an appendix
英文摘要

This paper investigates the algorithmic safety verification problem of infinite-state parameterized concurrent programs over a rich set of communication topologies. The goal is to automatically produce a proof of correctness in the form of a universally quantified inductive invariant, where the quantification is over the nodes in the topology. We illustrate that under reasonable assumptions on the underlying topology, the problem can be reduced to and solved as a compositional scheme, that is, the verification of the parameterized family is reduced to a set of local proofs, in a complete manner. We propose a verification algorithm, which is implemented as a tool, and demonstrate through a set of benchmarks over several different topologies that our approach is effective in proving parameterized programs safe.

2605.15138 2026-05-15 cs.LG cs.CL cs.ET

Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution

Saisab Sadhu, Pratinav Seth, Vinay Kumar Sankarapu

AI总结 本文研究了量化语言模型中机器遗忘的永久性问题,指出传统方法在全精度下评估遗忘效果,未能反映实际部署中模型先经历量化的情况。研究发现,4位量化会削弱甚至逆转遗忘效果,其根本原因在于参数更新幅度远小于量化区间宽度,导致无法改变量化后的模型结构。为此,作者提出MANSU方法,结合因果电路归因与约束投影,实现有意义的遗忘与结构性删除,并引入CAD指标用于验证,实验证明该方法在多个模型和任务中表现优异。

详情
英文摘要

Standard unlearning evaluations measure behavioral suppression in full precision, immediately after training, despite every deployed language model being quantized first. Recent work has shown that 4-bit post-training quantization can reverse machine unlearning; we show this is not a tuning artefact but a systematic dual failure: gradient-based methods that achieve meaningful forgetting lose it under compression, while methods that survive quantization barely change the model. Both failures trace to the same root cause: across all baselines, per-parameter updates lie 47-828x below the NF4 quantization bin width; updates diffused across billions of parameters cannot clear quantization bin boundaries, a consequence we formalize as a sparsity-permanence tradeoff. We present MANSU (Mechanistic-Aligned Null-Space Unlearning), which resolves both modes by combining causal circuit attribution to isolate the minimal forget-set subgraph, circuit-restricted null-space projection with a diagonal-Fisher retain bound, and a per-parameter magnitude floor guaranteeing quantization survival by construction. We additionally introduce Circuit Attribution Divergence (CAD), a mechanistic verification metric distinguishing structural erasure from behavioral suppression, a distinction existing metrics cannot make. Across multiple model families and hazard benchmarks, MANSU is the first method to jointly satisfy all four properties with margin on each (meaningful forgetting, retain preservation, non-positive PTQ gap, and structural erasure), while gradient-based baselines recover up to +0.05 accuracy under compression.

2605.15135 2026-05-15 eess.SP cs.IT math.IT

Deep Mixture of Experts Network for Resource Optimization in Aerial-Terrestrial CF-mMIMO Systems under URLLC

Donggen Li, Chong Huang, Jingfu Li, Pei Xiao, Wenjiang Feng, Dusit Niyato, Zhu Han

AI总结 本文研究了在超可靠低时延通信(URLLC)场景下,如何优化空天地一体化免蜂窝大规模MIMO(CF-mMIMO)系统的资源分配问题。为应对高移动性带来的信道老化问题,作者提出了一种基于Transformer的信道预测网络(CP-Net),并设计了一个深度专家混合(MoE)网络(MoE-Net)用于上行功率分配,通过引入加权门控网络(WT-Net)实现专家模型的自适应组合。该方法有效提升了系统在URLLC约束下的通信性能和资源效率。

详情
Comments
15 pages, accepted for publication in IEEE Transactions on Wireless Communications
英文摘要

As a critical component of sixth-generation (6G) wireless networks, ultra-reliable and low-latency communication (URLLC) is expected to support real-time and reliable information exchange in low-altitude environments. However, achieving URLLC often incurs significant resource overhead, including increased bandwidth consumption, higher transmit power, and denser access point (AP) deployment, which pose significant challenges to both spectral efficiency (SE) and energy efficiency (EE). Besides, existing iterative optimization algorithms are computationally intensive and struggle to meet the latency requirements of URLLC. To address these challenges, we propose a hybrid aerial-terrestrial cell-free massive MIMO (CF-mMIMO) network to support diverse services, along with a channel prediction network and a deep mixture of experts (MoE) network for uplink optimization. First, we design a channel prediction network (CP-Net) to mitigate channel aging caused by high-mobility user equipment (UE). CP-Net employs three Transformer-based sub-networks for aged channel state information (CSI) prediction, while a channel quality-aware loss function is introduced to improve the prediction accuracy of weak links. Based on the predicted CSI, we develop a deep MoE network (MoE-Net) for power allocation comprising three expert models targeting different objectives. Then, we introduce a weighted gating network (WT-Net) to learn an efficient adaptive combination of expert outputs. The proposed framework better captures heterogeneous UE requirements and improves communication performance under URLLC constraints. Numerical results demonstrate the effectiveness of the proposed method.