arXivDaily arXiv每日学术速递 周一至周五更新
重置
2604.13036 2026-04-15 cs.CV

Lyra 2.0: Explorable Generative 3D Worlds

Tianchang Shen, Sherwin Bahmani, Kai He, Sangeetha Grama Srinivasan, Tianshi Cao, Jiawei Ren, Ruilong Li, Zian Wang, Nicholas Sharp, Zan Gojcic, Sanja Fidler, Jiahui Huang, Huan Ling, Jun Gao, Xuanchi Ren

Comments Project Page: https://research.nvidia.com/labs/sil/projects/lyra2/

详情
英文摘要

Recent advances in video generation enable a new paradigm for 3D scene creation: generating camera-controlled videos that simulate scene walkthroughs, then lifting them to 3D via feed-forward reconstruction techniques. This generative reconstruction approach combines the visual fidelity and creative capacity of video models with 3D outputs ready for real-time rendering and simulation. Scaling to large, complex environments requires 3D-consistent video generation over long camera trajectories with large viewpoint changes and location revisits, a setting where current video models degrade quickly. Existing methods for long-horizon generation are fundamentally limited by two forms of degradation: spatial forgetting and temporal drifting. As exploration proceeds, previously observed regions fall outside the model's temporal context, forcing the model to hallucinate structures when revisited. Meanwhile, autoregressive generation accumulates small synthesis errors over time, gradually distorting scene appearance and geometry. We present Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale. To address spatial forgetting, we maintain per-frame 3D geometry and use it solely for information routing -- retrieving relevant past frames and establishing dense correspondences with the target viewpoints -- while relying on the generative prior for appearance synthesis. To address temporal drifting, we train with self-augmented histories that expose the model to its own degraded outputs, teaching it to correct drift rather than propagate it. Together, these enable substantially longer and 3D-consistent video trajectories, which we leverage to fine-tune feed-forward reconstruction models that reliably recover high-quality 3D scenes.

2604.13035 2026-04-15 cs.CV cs.CL

SceneCritic: A Symbolic Evaluator for 3D Indoor Scene Synthesis

Kathakoli Sengupta, Kai Ao, Paola Cascante-Bonilla

Comments Project Page: https://lab-spell.github.io/SceneCritic/

详情
英文摘要

Large Language Models (LLMs) and Vision-Language Models (VLMs) increasingly generate indoor scenes through intermediate structures such as layouts and scene graphs, yet evaluation still relies on LLM or VLM judges that score rendered views, making judgments sensitive to viewpoint, prompt phrasing, and hallucination. When the evaluator is unstable, it becomes difficult to determine whether a model has produced a spatially plausible scene or whether the output score reflects the choice of viewpoint, rendering, or prompt. We introduce SceneCritic, a symbolic evaluator for floor-plan-level layouts. SceneCritic's constraints are grounded in SceneOnto, a structured spatial ontology we construct by aggregating indoor scene priors from 3D-FRONT, ScanNet, and Visual Genome. SceneOnto traverses this ontology to jointly verify semantic, orientation, and geometric coherence across object relationships, providing object-level and relationship-level assessments that identify specific violations and successful placements. Furthermore, we pair SceneCritic with an iterative refinement test bed that probes how models build and revise spatial structure under different critic modalities: a rule-based critic using collision constraints as feedback, an LLM critic operating on the layout as text, and a VLM critic operating on rendered observations. Through extensive experiments, we show that (a) SceneCritic aligns substantially better with human judgments than VLM-based evaluators, (b) text-only LLMs can outperform VLMs on semantic layout quality, and (c) image-based VLM refinement is the most effective critic modality for semantic and orientation correction.

2604.13034 2026-04-15 cs.DC cs.DB

DySkew: Dynamic Data Redistribution for Skew-Resilient Snowpark UDF Execution

Chenwei Xie, Urjeet Shrestha, Corbin McElhanney, Lukas Lorimer, Gopal V, Zihao Ye, Yi Pan, Nic Crouch, Elliott Brossard, Florian Funke, Yuxiong He

详情
英文摘要

Snowflake revolutionized data warehousing with an elastic architecture that decouples compute and storage, enabling scalable solutions for diverse data analytics needs. Building on this foundation, Snowflake has advanced its AI Data Cloud vision by introducing Snowpark, a managed turnkey solution that supports data engineering and AI/ML workloads using Python and other programming languages. While Snowpark's User-Defined Function (UDF) execution model offers high throughput, it is highly vulnerable to performance degradation from data skew, where uneven data partitioning causes straggler tasks and unpredictable latency. The non-uniform computational cost of arbitrary user code further exacerbates this classic challenge. This paper presents DySkew, a novel, data-skew-aware execution strategy for Snowpark UDFs. Built upon Snowflake's new generalized skew handling solution, an adaptive data distribution mechanism utilizing per-link state machines. DySkew addresses the unique challenges of user-defined logic with goals of fine-grained per-row mitigation, dynamic runtime adaptation, and low-overhead, cost-aware redistribution. Specifically, for Snowpark, we introduce crucial optimizations, including an eager redistribution strategy and a Row Size Model to dynamically manage overhead for extremely large rows. This dynamic approach replaces the limitations of the previous static round-robin method. We detail the architecture of this framework and showcase its effectiveness through performance evaluations and real-world case studies, demonstrating significant improvements in the execution time and resource utilization for large-scale Snowpark UDF workloads.

2604.13033 2026-04-15 quant-ph cs.IT math-ph math.IT math.MP

Partial majorization and Schur concave functions on the sets of quantum and classical states

M. E. Shirokov

Comments 20 pages, 3 figures, any comments are welcome

详情
英文摘要

We construct for a Schur concave function $f$ on the set of quantum states a tight upper bound on the difference $f(ρ)-f(σ)$ for a quantum state $ρ$ with finite $f(ρ)$ and any quantum state $σ$ $m$-partially majorized by the state $ρ$ in the sense described in [1]. We also obtain a tight upper bound on this difference under the additional condition $\frac{1}{2}\|ρ-σ\|_1\leq\varepsilon$ and find simple sufficient conditions for vanishing this bound with $\,\min\{\varepsilon,1/m\}\to0\,$. The obtained results are applied to the von Neumann entropy. The concept of $\varepsilon$-sufficient majorization rank of a quantum state with finite entropy is introduced and a tight upper bound on this quantity is derived and applied to the Gibbs states of a quantum oscillator. We also show how the obtained results can be reformulated for Schur concave functions on the set of probability distributions with a finite or countable set of outcomes.

2604.13030 2026-04-15 cs.CV

Generative Refinement Networks for Visual Synthesis

Jian Han, Jinlai Liu, Jiahuan Wang, Bingyue Peng, Zehuan Yuan

Comments code: https://github.com/MGenAI/GRN

详情
英文摘要

While diffusion models dominate the field of visual generation, they are computationally inefficient, applying a uniform computational effort regardless of different complexity. In contrast, autoregressive (AR) models are inherently complexity-aware, as evidenced by their variable likelihoods, but are often hindered by lossy discrete tokenization and error accumulation. In this work, we introduce Generative Refinement Networks (GRN), a next-generation visual synthesis paradigm to address these issues. At its core, GRN addresses the discrete tokenization bottleneck through a theoretically near-lossless Hierarchical Binary Quantization (HBQ), achieving a reconstruction quality comparable to continuous counterparts. Built upon HBQ's latent space, GRN fundamentally upgrades AR generation with a global refinement mechanism that progressively perfects and corrects artworks -- like a human artist painting. Besides, GRN integrates an entropy-guided sampling strategy, enabling complexity-aware, adaptive-step generation without compromising visual quality. On the ImageNet benchmark, GRN establishes new records in image reconstruction (0.56 rFID) and class-conditional image generation (1.81 gFID). We also scale GRN to more challenging text-to-image and text-to-video generation, delivering superior performance on an equivalent scale. We release all models and code to foster further research on GRN.

2604.13029 2026-04-15 cs.CV cs.AI

Visual Preference Optimization with Rubric Rewards

Ya-Qi Yu, Fangyu Hong, Xiangyang Qu, Hao Wang, Gaojie Wu, Qiaoyu Luo, Nuo Xu, Huixin Wang, Wuheng Xu, Yongxin Liao, Zihao Chen, Haonan Li, Ziming Li, Dezhi Peng, Minghui Liao, Jihao Wu, Haoyu Ren, Dandan Tu

详情
英文摘要

The effectiveness of Direct Preference Optimization (DPO) depends on preference data that reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited to fine-grained visual reasoning. We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-style rubric of essential and additional criteria to score responses from any possible policies. The instruction-rubric pool is built offline and reused during the construction of on-policy data. On public reward modeling benchmarks, rubric-based prompting massively improves a 30B-A3B judge and brings it close to GPT-5.4. On public downstream benchmarks, rubric-based filtering raises the macro average to 82.69, whereas outcome-based filtering drops it to 75.82 from 81.14. When evaluating scalability on a comprehensive benchmark, rDPO achieves 61.01, markedly outperforming the style-constrained baseline (52.36) and surpassing the 59.48 base model. Together, these results show that visual preference optimization benefits from combining on-policy data construction with instance-specific criterion-level feedback.

2604.13028 2026-04-15 cs.CV

Conflated Inverse Modeling to Generate Diverse and Temperature-Change Inducing Urban Vegetation Patterns

Baris Sarper Tezcan, Hrishikesh Viswanath, Rubab Saher, Daniel Aliaga

Comments Accepted to the CVPR 2026 EarthVision Workshop

详情
英文摘要

Urban areas are increasingly vulnerable to thermal extremes driven by rapid urbanization and climate change. Traditionally, thermal extremes have been monitored using Earth-observing satellites and numerical modeling frameworks. For example, land surface temperature derived from Landsat or Sentinel imagery is commonly used to characterize surface heating patterns. These approaches operate as forward models, translating radiative observations or modeled boundary conditions into estimates of surface thermal states. While forward models can predict land surface temperature from vegetation and urban form, the inverse problem of determining spatial vegetation configurations that achieve a desired regional temperature shift remains largely unexplored. This task is inherently underdetermined, as multiple spatial vegetation patterns can yield similar aggregated temperature responses. Conventional regression and deterministic neural networks fail to capture this ambiguity and often produce averaged solutions, particularly under data-scarce conditions. We propose a conflated inverse modeling framework that combines a predictive forward model with a diffusion-based generative inverse model to produce diverse, physically plausible image-based vegetation patterns conditioned on specific temperature goals. Our framework maintains control over thermal outcomes while enabling diverse spatial vegetation configurations, even when such combinations are absent from training data. Altogether, this work introduces a controllable inverse modeling approach for urban climate adaptation that accounts for the inherent diversity of the problem. Code is available at the GitHub repository.

2604.13026 2026-04-15 quant-ph cond-mat.stat-mech cs.CC

A complexity phase transition at the EPR Hamiltonian

Kunal Marwaha, James Sud

Comments 47 pages, 8 figures

详情
英文摘要

We study the computational complexity of 2-local Hamiltonian problems generated by a positive-weight symmetric interaction term, encompassing many canonical problems in statistical mechanics and optimization. We show these problems belong to one of three complexity phases: QMA-complete, StoqMA-complete, and reducible to a new problem we call EPR*. The phases are physically interpretable, corresponding to the energy level ordering of the local term. The EPR* problem is a simple generalization of the EPR problem of King. Inspired by empirically efficient algorithms for EPR, we conjecture that EPR* is in BPP. If true, this would complete the complexity classification of these problems, and imply EPR* is the transition point between easy and hard local Hamiltonians. Our proofs rely on perturbative gadgets. One simple gadget, when recursed, induces a renormalization-group-like flow on the space of local interaction terms. This gives the correct complexity picture, but does not run in polynomial time. To overcome this, we design a gadget based on a large spin chain, which we analyze via the Jordan-Wigner transformation.

2604.13025 2026-04-15 cs.DS cs.DM math.CO

Asymptotically faster algorithms for recognizing $(k,\ell)$-sparse graphs

Bence Deák, Péter Madarasi

详情
英文摘要

The family of $(k,\ell)$-sparse graphs, introduced by Lorea, plays a central role in combinatorial optimization and has a wide range of applications, particularly in rigidity theory. A key algorithmic problem is to decide whether a given graph is $(k,\ell)$-sparse and, if not, to produce a vertex set certifying the failure of sparsity. While pebble game algorithms have long yielded $O(n^2)$-time recognition throughout the classical range $0 \leq \ell < 2k$, and $O(n^3)$-time algorithms in the extended range $2k \leq \ell < 3k$, substantially faster bounds were previously known only in a few special cases. We present new recognition algorithms for the parameter ranges $0 \le \ell \le k$, $k < \ell < 2k$, and $2k \leq \ell < 3k$. Our approach combines bounded-indegree orientations, reductions to rooted arc-connectivity, augmenting-path techniques, and a divide-and-conquer method based on centroid decomposition. This yields the first subquadratic, and in fact near-linear-time, recognition algorithms throughout the classical range when instantiated with the fastest currently available subroutines. Under purely combinatorial implementations, the running times become $O(n\sqrt n)$ for $0 \leq \ell \leq k$ and $O(n\sqrt{n\log n})$ for $k< \ell <2k$. For $2k \leq \ell < 3k$, we obtain an $O(n^2)$-time algorithm when $\ell \leq 2k+1$ and an $O(n^2\log n)$-time algorithm otherwise. In each case, the algorithm can also return an explicit violating set certifying that the input graph is not $(k,\ell)$-sparse.

2604.13024 2026-04-15 cs.LG cs.DB

CLAD: Efficient Log Anomaly Detection Directly on Compressed Representations

Benzhao Tang, Shiyu Yang

详情
英文摘要

The explosive growth of system logs makes streaming compression essential, yet existing log anomaly detection (LAD) methods incur severe pre-processing overhead by requiring full decompression and parsing. We introduce CLAD, the first deep learning framework to perform LAD directly on compressed byte streams. CLAD bypasses these bottlenecks by exploiting a key insight: normal logs compress into regular byte patterns, while anomalies systematically disrupt them. To extract these multi-scale deviations from opaque bytes, we propose a purpose-built architecture integrating a dilated convolutional byte encoder, a hybrid Transformer--mLSTM, and four-way aggregation pooling. This is coupled with a two-stage training strategy of masked pre-training and focal-contrastive fine-tuning to effectively handle severe class imbalance. Evaluated across five datasets, CLAD achieves a state-of-the-art average F1-score of 0.9909 and outperforms the best baseline by 2.72 percentage points. It delivers superior accuracy while completely eliminating decompression and parsing overheads, offering a robust solution that generalizes to structured streaming compressors.

2604.13023 2026-04-15 cs.SD cs.MM

SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding

Luoyi Sun, Xiao Zhou, Zeqian Li, Ya Zhang, Yanfeng Wang, Weidi Xie

详情
英文摘要

Large Audio-Language Models (ALMs) have recently demonstrated remarkable capabilities in holistic audio understanding, yet they remain unreliable for temporal grounding, i.e., the task of pinpointing exactly when an event occurs within long-form audio. This limitation stems from two factors: training data dominated by clip-level supervision lacking precise timestamps, and benchmarks that fail to simulate real-world scenarios where short events are obscured by dense background sounds. In this paper, we introduce SpotSound, an audio language model designed for grounding audio events. SpotSound incorporates a novel training objective, specifically designed to suppress hallucinated timestamps for events absent from the input. Additionally, we present SpotSound-Bench, a challenging temporal grounding benchmark where target events occupy less than ~10\% of each clip, creating a rigorous `needle-in-a-haystack' evaluation. Experiments demonstrate that SpotSound achieves state-of-the-art results on temporal grounding benchmarks while maintaining robust performance across general downstream audio-language tasks. Code, models and benchmark are released on https://loiesun.github.io/spotsound/

2604.13022 2026-04-15 quant-ph cs.LG math.OC stat.ML

Classical and Quantum Speedups for Non-Convex Optimization via Energy Conserving Descent

Yihang Sun, Huaijin Wang, Patrick Hayden, Jose Blanchet

Comments 33 pages, 2 figures

详情
英文摘要

The Energy Conserving Descent (ECD) algorithm was recently proposed (De Luca & Silverstein, 2022) as a global non-convex optimization method. Unlike gradient descent, appropriately configured ECD dynamics escape strict local minima and converge to a global minimum, making it appealing for machine learning optimization. We present the first analytical study of ECD, focusing on the one-dimensional setting for this first installment. We formalize a stochastic ECD dynamics (sECD) with energy-preserving noise, as well as a quantum analog of the ECD Hamiltonian (qECD), providing the foundation for a quantum algorithm through Hamiltonian simulation. For positive double-well objectives, we compute the expected hitting time from a local to the global minimum. We prove that both sECD and qECD yield exponential speedup over respective gradient descent baselines--stochastic gradient descent and its quantization. For objectives with tall barriers, qECD achieves a further speedup over sECD.

2604.13021 2026-04-15 cs.CV cs.AI

Representation geometry shapes task performance in vision-language modeling for CT enterography

Cristian Minoccheri, Emily Wittrup, Kayvan Najarian, Ryan Stidham

详情
英文摘要

Computed tomography (CT) enterography is a primary imaging modality for assessing inflammatory bowel disease (IBD), yet the representational choices that best support automated analysis of this modality are unknown. We present the first study of vision-language transfer learning on abdominal CT enterography and identify two main findings. First, mean pooling of slice embeddings gives better categorical disease assessment (59.2\% three-class accuracy), whereas attention pooling gives better cross-modal retrieval (0.235 text-to-image MRR). This pattern holds across all LoRA configurations tested and suggests that the two aggregators emphasize different properties of the learned representation. Second, per-slice tissue contrast matters more than broader spatial coverage: multi-window RGB encoding, which maps complementary Hounsfield Unit windows to RGB channels, outperforms all strategies that increase spatial coverage through multiplanar sampling, and in this setting adding coronal and sagittal views reduces classification performance. For report generation, fine-tuning without retrieval context yields within-1 severity accuracy at the prevalence-matched chance level (70.4\% vs.\ 71\% random), suggesting little learned ordering beyond the class distribution. Retrieval-augmented generation (RAG) improves this across all configurations, scoring 7--14 percentage points above the chance baseline and improving ordinal MAE from 0.98 to 0.80--0.89. A three-teacher pseudolabel framework enables all comparisons without expert annotations. Together, these findings provide the first baselines for this underexplored modality and offer practical guidance for building vision-language systems for volumetric medical imaging.

2604.13017 2026-04-15 cs.AI cs.HC

PAL: Personal Adaptive Learner

Megha Chakraborty, Darssan L. Eswaramoorthi, Madhur Thareja, Het Riteshkumar Shah, Finlay Palmer, Aryaman Bahl, Michelle A Ihetu, Amit Sheth

详情
英文摘要

AI-driven education platforms have made some progress in personalisation, yet most remain constrained to static adaptation--predefined quizzes, uniform pacing, or generic feedback--limiting their ability to respond to learners' evolving understanding. This shortfall highlights the need for systems that are both context-aware and adaptive in real time. We introduce PAL (Personal Adaptive Learner), an AI-powered platform that transforms lecture videos into interactive learning experiences. PAL continuously analyzes multimodal lecture content and dynamically engages learners through questions of varying difficulty, adjusting to their responses as the lesson unfolds. At the end of a session, PAL generates a personalized summary that reinforces key concepts while tailoring examples to the learner's interests. By uniting multimodal content analysis with adaptive decision-making, PAL contributes a novel framework for responsive digital learning. Our work demonstrates how AI can move beyond static personalization toward real-time, individualized support, addressing a core challenge in AI-enabled education.

2604.13014 2026-04-15 math.NA cs.NA math.AP

Finite element approximation of an anisotropic porous medium equation with fractional pressure

Stefano Fronzoni

Comments arXiv admin note: text overlap with arXiv:2404.18901

详情
英文摘要

We study a nonlocal diffusion equation of porous medium type featuring a generalised fractional pressure with spatial anisotropy. We construct a finite element method for the numerical solution of the equation on a bounded open Lipschitz polytopal domain $Ω\subset \mathbb{R}^{d}$, where $d = 2$ or $3$. The pressure in the model is defined as the solution of fractional elliptic problem involving the fractional power of a second order differential operator, in terms of its spectral definition. Under suitable assumptions on the fractional order and the coefficients of the operator, we rigorously prove convergence of the numerical scheme. The analysis is carried out in two stages: first passing to the limit in the spatial discretization, and then in the time step, ultimately showing that a subsequence of the sequence of finite element approximations defined by the proposed numerical method converges to a bounded and nonnegative weak solution of the initial-boundary-value problem under consideration. Finally, we present numerical experiments in two dimensions illustrating the computational aspects of the method and highlighting the interplay between nonlocal effects and spatial anisotropy under different configurations. We also show numerically the failure of the comparison principle and exponential decay of the numerical solution to a steady state.

2604.13013 2026-04-15 cs.AI math.OC

Bilevel Late Acceptance Hill Climbing for the Electric Capacitated Vehicle Routing Problem

Yinghao Qin, Mosab Bazargani, Edmund K. Burke, Carlos A. Coello Coello, Zhongmin Song, Jun Chen

详情
英文摘要

This paper tackles the Electric Capacitated Vehicle Routing Problem (E-CVRP) through a bilevel optimization framework that handles routing and charging decisions separately or jointly depending on the search stage. By analyzing their interaction, we introduce a surrogate objective at the upper level to guide the search and accelerate convergence. A bilevel Late Acceptance Hill Climbing algorithm (b-LAHC) is introduced that operates through three phases: greedy descent, neighborhood exploration, and final solution refinement. b-LAHC operates with fixed parameters, eliminating the need for complex adaptation while remaining lightweight and effective. Extensive experiments on the IEEE WCCI-2020 benchmark show that b-LAHC achieves superior or competitive performance against eight state-of-the-art algorithms. Under a fixed evaluation budget, it attains near-optimal solutions on small-scale instances and sets 9/10 new best-known results on large-scale benchmarks, improving existing records by an average of 1.07%. Moreover, the strong correlation (though not universal) observed between the surrogate objective and the complete cost justifies the use of the surrogate objective while still necessitating a joint solution of both levels, thereby validating the effectiveness of the proposed bilevel framework and highlighting its potential for efficiently solving large-scale routing problems with a hierarchical structure.

2604.12999 2026-04-15 cs.CV

Agentic Discovery with Active Hypothesis Exploration for Visual Recognition

Jaywon Koo, Jefferson Hernandez, Ruozhen He, Hanjie Chen, Chen Wei, Vicente Ordonez

详情
英文摘要

We introduce HypoExplore, an agentic framework that formulates neural architecture discovery for visual recognition as a hypothesis-driven scientific inquiry. Given a human-specified high-level research direction, HypoExplore ideates, implements, evaluates, and improves neural architectures through evolutionary branching. New hypotheses are created using a large language model by selecting a parent hypothesis to build upon, guided by a dual strategy that balances exploiting validated principles with resolving uncertain ones. Our proposed framework maintains a Trajectory Tree that records the lineage of all proposed architectures, and a Hypothesis Memory Bank that actively tracks confidence scores acquired through experimental evidence. After each experiment, multiple feedback agents analyze the results from different perspectives and consolidate their findings into hypothesis confidence updates. Our framework is tested on discovering lightweight vision architectures on CIFAR-10, with the best achieving 94.11% accuracy evolved from a root node baseline that starts at 18.91%, and generalizes to CIFAR-100 and Tiny-ImageNet. We further demonstrate applicability to a specialized domain by conducting independent architecture discovery runs on MedMNIST, which yield a state-of-the-art performance. We show that hypothesis confidence scores grow increasingly predictive as evidence accumulates, and that the learned principles transfer across independent evolutionary lineages, suggesting that HypoExplore not only discovers stronger architectures, but can help build a genuine understanding of the design space.

2604.12998 2026-04-15 cs.SE

Personalizing LLM-Based Conversational Programming Assistants

Jonan Richards

Comments Accepted to the Doctoral and Early Career Symposium of the 19th International Conference on Cooperative and Human Aspects of Software Engineering (CHASE DECS 2026)

详情
英文摘要

Large Language Models (LLMs) have shown much promise in powering a variety of software engineering (SE) tools. Offering natural language as an intuitive interaction mechanism, LLMs have recently been employed as conversational ``programming assistants'' capable of supporting several SE activities simultaneously. As with any SE tool, it is crucial that these assistants effectively meet developers' needs. Recent studies have shown addressing this challenge is complicated by the variety in developers' needs, and the ambiguous and unbounded nature of conversational interaction. This paper discusses our current and future work towards characterizing how diversity in cognition and organizational context impacts developers' needs, and exploring personalization as a means of improving the inclusivity of LLM-based conversational programming assistants.

2604.12995 2026-04-15 cs.CL cs.CY

PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models

Han Bao, Penghao Zhang, Yue Huang, Zhengqing Yuan, Yanchi Ru, Rui Su, Yujun Zhou, Xiangqi Wang, Kehan Guo, Nitesh V Chawla, Yanfang Ye, Xiangliang Zhang

Comments Accepted by ACL 2026 findings

详情
英文摘要

Large Language Models (LLMs) are increasingly integrated into real-world decision-making, including in the domain of public policy. Yet, their ability to comprehend and reason about policy-related content remains underexplored. To fill this gap, we present \textbf{\textit{PolicyBench}}, the first large-scale cross-system benchmark (US-China) evaluating policy comprehension, comprising 21K cases across a broad spectrum of policy areas, capturing the diversity and complexity of real-world governance. Following Bloom's taxonomy, the benchmark assesses three core capabilities: (1) \textbf{Memorization}: factual recall of policy knowledge, (2) \textbf{Understanding}: conceptual and contextual reasoning, and (3) \textbf{Application}: problem-solving in real-life policy scenarios. Building on this benchmark, we further propose \textbf{\textit{PolicyMoE}}, a domain-specialized Mixture-of-Experts (MoE) model with expert modules aligned to each cognitive level. The proposed models demonstrate stronger performance on application-oriented policy tasks than on memorization or conceptual understanding, and yields the highest accuracy on structured reasoning tasks. Our results reveal key limitations of current LLMs in policy understanding and suggest paths toward more reliable, policy-focused models.

2604.12992 2026-04-15 stat.ML cs.LG econ.EM

Causal Diffusion Models for Counterfactual Outcome Distributions in Longitudinal Data

Farbod Alinezhad, Jianfei Cao, Gary J. Young, Brady Post

详情
英文摘要

Predicting counterfactual outcomes in longitudinal data, where sequential treatment decisions heavily depend on evolving patient states, is critical yet notoriously challenging due to complex time-dependent confounding and inadequate uncertainty quantification in existing methods. We introduce the Causal Diffusion Model (CDM), the first denoising diffusion probabilistic approach explicitly designed to generate full probabilistic distributions of counterfactual outcomes under sequential interventions. CDM employs a novel residual denoising architecture with relational self-attention, capturing intricate temporal dependencies and multimodal outcome trajectories without requiring explicit adjustments (e.g., inverse-probability weighting or adversarial balancing) for confounding. In rigorous evaluation on a pharmacokinetic-pharmacodynamic tumor-growth simulator widely adopted in prior work, CDM consistently outperforms state-of-the-art longitudinal causal inference methods, achieving a 15-30% relative improvement in distributional accuracy (1-Wasserstein distance) while maintaining competitive or superior point-estimate accuracy (RMSE) under high-confounding regimes. By unifying uncertainty quantification and robust counterfactual prediction in complex, sequentially confounded settings, without tailored deconfounding, CDM offers a flexible, high-impact tool for decision support in medicine, policy evaluation, and other longitudinal domains.

2604.12990 2026-04-15 cs.IR

Sparse Contrastive Learning for Content-Based Cold Item Recommendation

Gregor Meehan, Johan Pauwels

Comments Accepted at SIGIR 2026

详情
英文摘要

Item cold-start is a pervasive challenge for collaborative filtering (CF) recommender systems. Existing methods often train cold-start models by mapping auxiliary item content, such as images or text descriptions, into the embedding space of a CF model. However, such approaches can be limited by the fundamental information gap between CF signals and content features. In this work, we propose to avoid this limitation with purely content-based modeling of cold items, i.e. without alignment with CF user or item embeddings. We instead frame cold-start prediction in terms of item-item similarity, training a content encoder to project into a latent space where similarity correlates with user preferences. We define our training objective as a sparse generalization of sampled softmax loss with the $α$-entmax family of activation functions, which allows for sharper estimation of item relevance by zeroing gradients for uninformative negatives. We then describe how this Sampled Entmax for Cold-start (SEMCo) training regime can be extended via knowledge distillation, and show that it outperforms existing cold-start methods and standard sampled softmax in ranking accuracy. We also discuss the advantages of purely content-based modeling, particularly in terms of equity of item outcomes.

2604.12989 2026-04-15 cs.CL

Accelerating Speculative Decoding with Block Diffusion Draft Trees

Liran Ringel, Yaniv Romano

详情
英文摘要

Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model then verifies in parallel. DFlash shows that a block diffusion drafter can generate an entire draft block in a single forward pass and achieve state-of-the-art speculative decoding performance, outperforming strong autoregressive drafters such as EAGLE-3. Vanilla DFlash, however, still verifies only a single drafted trajectory per round, potentially limiting its acceptance length. We introduce DDTree (Diffusion Draft Tree), a method that constructs a draft tree directly from the per-position distributions of a block diffusion drafter. Under a fixed node budget, DDTree uses a simple best-first heap algorithm to select the continuations that are most likely to match the target model according to a surrogate defined by the draft model's output. The resulting tree is verified efficiently in a single target model forward pass using an ancestor-only attention mask. Because DDTree builds on DFlash, a leading draft model for speculative decoding, these gains place DDTree among the leading approaches to speculative decoding.

2604.12988 2026-04-15 cs.DB cs.AI

ROSE: An Intent-Centered Evaluation Metric for NL2SQL

Wenqi Pei, Shizheng Hou, Boyan Li, Han Chen, Zhichao Shi, Yuyu Luo

Comments ACL 2026 Main

详情
英文摘要

Execution Accuracy (EX), the widely used metric for evaluating the effectiveness of Natural Language to SQL (NL2SQL) solutions, is becoming increasingly unreliable. It is sensitive to syntactic variation, ignores that questions may admit multiple interpretations, and is easily misled by erroneous ground-truth SQL. To address this, we introduce ROSE, an intent-centered metric that focuses on whether the predicted SQL answers the question, rather than consistency with the ground-truth SQL under the reference-dependent paradigm. ROSE employs an adversarial Prover-Refuter cascade: SQL Prover assesses the semantic correctness of a predicted SQL against the user's intent independently, while Adversarial Refuter uses the ground-truth SQL as evidence to challenge and refine this judgment. On our expert-aligned validation set ROSE-VEC, ROSE achieves the best agreement with human experts, outperforming the next-best metric by nearly 24% in Cohen's Kappa. We also conduct a largescale re-evaluation of 19 NL2SQL methods, revealing four valuable insights. We release ROSE and ROSE-VEC to facilitate more reliable NL2SQL research.

2604.12986 2026-04-15 cs.CR cs.AI

Parallax: Why AI Agents That Think Must Never Act

Joel Fokou

Comments 20 pages, 1 figure, 5 tables. Open-source reference implementation: https://github.com/openparallax/openparallax. Documentation: https://docs.openparallax.dev. Feedback welcome via email or GitHub issues

详情
英文摘要

Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise applications will embed AI copilots by the end of 2026. As agents gain the ability to execute real-world actions (reading files, running commands, making network requests, modifying databases), a fundamental security gap has emerged. The dominant approach to agent safety relies on prompt-level guardrails: natural language instructions that operate at the same abstraction level as the threats they attempt to mitigate. This paper argues that prompt-based safety is architecturally insufficient for agents with execution capability and introduces Parallax, a paradigm for safe autonomous AI execution grounded in four principles: Cognitive-Executive Separation, which structurally prevents the reasoning system from executing actions; Adversarial Validation with Graduated Determinism, which interposes an independent, multi-tiered validator between reasoning and execution; Information Flow Control, which propagates data sensitivity labels through agent workflows to detect context-dependent threats; and Reversible Execution, which captures pre-destructive state to enable rollback when validation fails. We present OpenParallax, an open-source reference implementation in Go, and evaluate it using Assume-Compromise Evaluation, a methodology that bypasses the reasoning system entirely to test the architectural boundary under full agent compromise. Across 280 adversarial test cases in nine attack categories, Parallax blocks 98.9% of attacks with zero false positives under its default configuration, and 100% of attacks under its maximum-security configuration. When the reasoning system is compromised, prompt-level guardrails provide zero protection because they exist only within the compromised system; Parallax's architectural boundary holds regardless.

2604.12981 2026-04-15 cs.LO

Recursive Completion in Higher K-Models: Front-Seed Semantics, Proof-Relevant Witnesses, and the K-Infinity Model

Daniel O. Martinez-Rivillas, Arthur F. Ramos, Ruy J. G. B. de Queiroz

Comments 26 pages; To be submitted to Journal of Logic and Computation, 2026; fully formalized in Lean 4 at https://github.com/Arthur742Ramos/HigherLambdaModel

详情
英文摘要

Martinez-Rivillas and de Queiroz gave extensional Kan semantics for the untyped lambda-calculus and later constructed the concrete K-infinity homotopy-model. The two main mathematical results of the present paper are these. First, we show that a smaller front-seed coherence package (WL, WR) together with an inner-right-front pentagon contraction already suffices to recover the associator comparison, semantic pentagon, and bridge theorems used in the later semantic arguments. Second, we prove explicit global reify, reflect, and application formulas for K-infinity, with exact coordinatewise identities at every finite stage. We also record two structural clarifications: the recursive all-dimensional continuation of the explicit low-dimensional tower is obtained by a finite packaging phase followed by a uniform equality-generated recursion; and, on a deliberately fixed forward witness language for the classical separation span, the canonical identity-type higher tower on K-infinity forces all higher non-connection once the two witness classes land at distinct points. The paper is fully formalized in Lean 4, and the project sources contain no local uses of sorry, admit, or axiom.

2604.12978 2026-04-15 cs.CL cs.CV

GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts

Amir Hossein Kargaran, Nafiseh Nikeghbal, Jana Diesner, François Yvon, Hinrich Schütze

详情
英文摘要

Optical character recognition (OCR) has advanced rapidly with the rise of vision-language models, yet evaluation has remained concentrated on a small cluster of high- and mid-resource scripts. We introduce GlotOCR Bench, a comprehensive benchmark evaluating OCR generalization across 100+ Unicode scripts. Our benchmark comprises clean and degraded image variants rendered from real multilingual texts. Images are rendered using fonts from the Google Fonts repository, shaped with HarfBuzz and rasterized with FreeType, supporting both LTR and RTL scripts. Samples of rendered images were manually reviewed to verify correct rendering across all scripts. We evaluate a broad suite of open-weight and proprietary vision-language models and find that most perform well on fewer than ten scripts, and even the strongest frontier models fail to generalize beyond thirty scripts. Performance broadly tracks script-level pretraining coverage, suggesting that current OCR systems rely on language model pretraining as much as on visual recognition. Models confronted with unfamiliar scripts either produce random noise or hallucinate characters from similar scripts they already know. We release the benchmark and pipeline for reproducibility. Pipeline Code: https://github.com/cisnlp/glotocr-bench, Benchmark: https://hf.co/datasets/cis-lmu/glotocr-bench.

2604.12970 2026-04-15 eess.IV cs.CV

Probabilistic Feature Imputation and Uncertainty-Aware Multimodal Federated Aggregation

Nafis Fuad Shahid, Maroof Ahmed, Md Akib Haider, Saidur Rahman Sagor, Aashnan Rahman, Md Azam Hossain

Comments Accepted for publication at the Medical Imaging with Deep Learning (MIDL) 2026 conference

详情
英文摘要

Multimodal federated learning enables privacy-preserving collaborative model training across healthcare institutions. However, a fundamental challenge arises from modality heterogeneity: many clinical sites possess only a subset of modalities due to resource constraints or workflow variations. Existing approaches address this through feature imputation networks that synthesize missing modality representations, yet these methods produce point estimates without reliability measures, forcing downstream classifiers to treat all imputed features as equally trustworthy. In safety-critical medical applications, this limitation poses significant risks. We propose the Probabilistic Feature Imputation Network (P-FIN), which outputs calibrated uncertainty estimates alongside imputed features. This uncertainty is leveraged at two levels: (1) locally, through sigmoid gating that attenuates unreliable feature dimensions before classification, and (2) globally, through Fed-UQ-Avg, an aggregation strategy that prioritizes updates from clients with reliable imputation. Experiments on federated chest X-ray classification using CheXpert, NIH Open-I, and PadChest demonstrate consistent improvements over deterministic baselines, with +5.36% AUC gain in the most challenging configuration.

2604.12969 2026-04-15 cs.CV

AbdomenGen: Sequential Volume-Conditioned Diffusion Framework for Abdominal Anatomy Generation

Yubraj Bhandari, Lavsen Dahal, Paul Segars, Joseph Y. Lo

详情
英文摘要

Computational phantoms are widely used in medical imaging research, yet current systems to generate controlled, clinically meaningful anatomical variations remain limited. We present AbdomenGen, a sequential volume-conditioned diffusion framework for controllable abdominal anatomy generation. We introduce the \textbf{Volume Control Scalar (VCS)}, a standardized residual that decouples organ size from body habitus, enabling interpretable volume modulation. Organ masks are synthesized sequentially, conditioning on the body mask and previously generated structures to preserve global anatomical coherence while supporting independent, multi-organ control. Across 11 abdominal organs, the proposed framework achieves strong geometric fidelity (e.g., liver dice $0.83 \pm 0.05$), stable single-organ calibration over $[-3,+3]$ VCS, and disentangled multi-organ modulation. To showcase clinical utility with a hepatomegaly cohort selected from MERLIN, Wasserstein-based VCS selection reduces distributional distance of training data by 73.6\% . These results demonstrate calibrated, distribution-aware anatomical generation suitable for controllable abdominal phantom construction and simulation studies.

2604.12968 2026-04-15 cs.LG cs.CV

Evolution of Optimization Methods: Algorithms, Scenarios, and Evaluations

Tong Zhang, Jiangning Zhang, Zhucun Xue, Juntao Jiang, Yicheng Xu, Chengming Xu, Teng Hu, Xingyu Xie, Xiaobin Hu, Yabiao Wang, Yong Liu, Shuicheng Yan

详情
英文摘要

Balancing convergence speed, generalization capability, and computational efficiency remains a core challenge in deep learning optimization. First-order gradient descent methods, epitomized by stochastic gradient descent (SGD) and Adam, serve as the cornerstone of modern training pipelines. However, large-scale model training, stringent differential privacy requirements, and distributed learning paradigms expose critical limitations in these conventional approaches regarding privacy protection and memory efficiency. To mitigate these bottlenecks, researchers explore second-order optimization techniques to surpass first-order performance ceilings, while zeroth-order methods reemerge to alleviate memory constraints inherent to large-scale training. Despite this proliferation of methodologies, the field lacks a cohesive framework that unifies underlying principles and delineates application scenarios for these disparate approaches. In this work, we retrospectively analyze the evolutionary trajectory of deep learning optimization algorithms and present a comprehensive empirical evaluation of mainstream optimizers across diverse model architectures and training scenarios. We distill key emerging trends and fundamental design trade-offs, pinpointing promising directions for future research. By synthesizing theoretical insights with extensive empirical evidence, we provide actionable guidance for designing next-generation highly efficient, robust, and trustworthy optimization methods. The code is available at https://github.com/APRIL-AIGC/Awesome-Optimizer.

2604.12967 2026-04-15 cs.AI

Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training

Sohyun An, Shuibenyang Yuan, Hayeon Lee, Cho-Jui Hsieh, Alexander Min

详情
英文摘要

Reinforcement Learning (RL) has shown strong potential for optimizing search agents in complex information retrieval tasks. However, existing approaches predominantly rely on gold supervision, such as ground-truth answers, which is difficult to scale. To address this limitation, we propose Cycle-Consistent Search (CCS), a gold-supervision-free framework for training search agents, inspired by cycle-consistency techniques from unsupervised machine translation and image-to-image translation. Our key hypothesis is that an optimal search trajectory, unlike insufficient or irrelevant ones, serves as a lossless encoding of the question's intent. Consequently, a high-quality trajectory should preserve the information required to accurately reconstruct the original question, thereby inducing a reward signal for policy optimization. However, naive cycle-consistency objectives are vulnerable to information leakage, as reconstruction may rely on superficial lexical cues rather than the underlying search process. To reduce this effect, we apply information bottlenecks, including exclusion of the final response and named entity recognition (NER) masking of search queries. These constraints force reconstruction to rely on retrieved observations together with the structural scaffold, ensuring that the resulting reward signal reflects informational adequacy rather than linguistic redundancy. Experiments on question-answering benchmarks show that CCS achieves performance comparable to supervised baselines while outperforming prior methods that do not rely on gold supervision. These results suggest that CCS provides a scalable training paradigm for training search agents in settings where gold supervision is unavailable.