arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.01664 2026-05-05 cs.IR

A Hybrid Retrieval and Reranking Framework for Evidence-Grounded Retrieval-Augmented Generation

Fariba Afrin Irany, Sampson Akwafuo

详情
英文摘要

Retrieval-augmented generation (RAG) improves large language model reliability by grounding generated responses in external evidence. However, RAG performance depends on the relevance of retrieved passages, the quality of evidence ranking, and the ability to verify whether generated claims are supported by source documents. This study presents a hybrid retrieval and reranking framework for citation-aware RAG in biomedical and healthcare-related document question answering. The framework uses Amazon Bedrock Knowledge Bases for document ingestion, parsing, chunking, embedding generation, and evidence retrieval. Source PDF documents are stored in Amazon S3, embedded using Amazon Titan Text Embeddings V2, and indexed with Amazon OpenSearch Serverless. Hybrid retrieval first retrieves candidate evidence chunks, and Cohere reranking then prioritizes the most relevant passages before answer generation. The answer-generation stage uses top-ranked evidence chunks to produce controlled, evidence-grounded responses, while a separate judge model evaluates each generated factual claim against the retrieved evidence. The framework was evaluated using 25 biomedical NLP and healthcare transformer queries as a pilot-scale proof-of-concept study. Across the evaluation set, the system retrieved and reranked 500 evidence chunks and generated answers from top-ranked evidence. Claim-level grounding evaluation extracted 200 factual claims, all of which were judged to be supported by retrieved evidence, resulting in 100.0% grounding accuracy. The results suggest that hybrid retrieval, reranking, conservative prompting, and claim-level evaluation can support reliable evidence-grounded RAG responses when sufficient source evidence is available.

2605.01662 2026-05-05 cs.CV

Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models

Martin Q. Ma, Willis Guo, Aditya Agrawal, Ankit Gupta, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency

Comments ICCV 2025 workshop

详情
英文摘要

Large vision-language models (VLMs) have advanced multimodal tasks such as video question answering (QA). However, VLMs face the challenge of selecting frames effectively and efficiently, as standard uniform sampling is expensive and performance may plateau. Inspired by active perception theory, which posits that models gain information by acquiring data that differs from their expectations, we introduce Video Active Perception (VAP), a training-free method to enhance long-form video QA using VLMs. Our approach treats keyframe selection as data acquisition in active perception and leverages a lightweight text-conditioned video generation model to represent prior world knowledge. Empirically, VAP achieves state-of-the-art zero-shot results on long-form or reasoning video QA datasets such as EgoSchema, NExT-QA, ActivityNet-QA, IntentQA, and CLEVRER, achieving an increase of up to 5.6 x frame efficiency by frames per question over standard GPT-4o, Gemini 1.5 Pro, and LLaVA-OV. Moreover, VAP shows stronger reasoning abilities than previous methods and effectively selects keyframes relevant to questions. These findings highlight the potential of leveraging active perception to improve the frame effectiveness and efficiency of long-form video QA.

2605.01659 2026-05-05 cs.CV cs.AI

TRIMMER: A New Paradigm for Video Summarization through Self-Supervised Reinforcement Learning

Pritam Mishra, Coloma Ballester, Dimosthenis Karatzas

详情
英文摘要

The rapid growth of video content across domains such as surveillance, education, and social media has made efficient content understanding increasingly critical. Video summarization addresses this challenge by generating concise yet semantically meaningful representations, but existing approaches often rely on expensive manual annotations, struggle to generalize across domains, and incur significant computational costs due to complex architectures. Moreover, unsupervised and weakly supervised methods typically underperform compared to supervised counterparts in capturing long-range temporal dependencies and semantic structure. In this work, we propose TRIMMER (Temporal Relative Information Maximization for Multi-objective Efficient Reinforcement), a novel self-supervised reinforcement learning framework for video summarization. TRIMMER operates in two stages: it first learns robust representations via self-supervised learning and then performs spatio-temporal decision making through reinforcement learning guided by information-theoretic reward functions. Unlike prior approaches that rely on similarity-based objectives, our method introduces entropy-based metrics to capture higher-order temporal dynamics and semantic diversity, while computing rewards directly over selected frame indices to improve computational efficiency. Extensive experiments on standard benchmarks demonstrate that TRIMMER achieves state-of-the-art performance among unsupervised and self-supervised methods, while remaining competitive with leading supervised approaches, highlighting its effectiveness for scalable and generalizable video summarization.

2605.01657 2026-05-05 cs.CV

Act2See: Emergent Active Visual Perception for Video Reasoning

Martin Q. Ma, Yuxiao Qu, Aditya Agrawal, Willis Guo, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency

Comments CVPR 2026

详情
英文摘要

Vision-Language Models (VLMs) typically rely on static initial frames for video reasoning, restricting their ability to incorporate essential dynamic information as the reasoning process evolves. Existing methods that augment Chain-of-Thought (CoT) with additional frame information often exhibit suboptimal CoT quality and lack the crucial ability to synthesize visual information for hypothetical or counterfactual scenarios. We introduce Act-to-See (Act2See), a novel framework that enables active visual perception by empowering VLMs to actively interleave video frames within text CoTs. Act2See is developed via Supervised Fine-Tuning (SFT) on a high-quality dataset of reasoning traces generated by a frontier VLM. These traces integrate active calls to either retrieve existing frames or generate new ones, and are rigorously verified against human-annotated CoTs to ensure quality. This approach cultivates an emergent capability: at inference time, the model actively determines when to search for or synthesize the necessary visual evidence. Act2See establishes new state-of-the-art results on challenging benchmarks, including VideoEspresso and ViTIB, and outperforms comparable or larger models on Video-MME, EgoNormia, and VCR-Bench, demonstrating an advancement in enabling VLMs with active visual perception for video reasoning.

2605.01656 2026-05-05 q-bio.NC cs.AI cs.LG

From Cortical Synchronous Rhythm to Brain Inspired Learning Mechanism: An Oscillatory Spiking Neural Network with Time-Delayed Coordination

Tingting Dan, Guorong Wu

Comments 19 pages, 6 figures

详情
英文摘要

Human cognition emerges from coordinated spiking dynamics in distributed neural circuits, where information is encoded via both firing rates and precise spike timing determined by brain rhythms. Inspired by this notion, we propose a brain-inspired learning primitive in which cognition-level neural synchrony emerges through iterative bottom-up and top-down interactions between micro-scale dynamics of spiking neurons and a macro-scale mechanism of oscillatory synchronization. Specifically, we model each parcel (e.g., a cortical region or an image pixel) in the target system as a spiking neuron embedded in a predefined connectivity scaffold. Low-level information is encoded in a spatiotemporal domain, where neurons are selectively grouped and fire spontaneously over time through self-organized dynamics. In the bottom-up route, oscillatory synchronization is formed from past spiking activity accumulated over a finite memory window. Since brain dynamics operate in a regime of partial and transient synchronization rather than global phase locking, we model oscillatory coordination using a time-delayed synchronization formulation, which enables a top-down modulation of heterogeneous neural spiking for a large-scale distributed system. Together, we devise a spiking-by-synchronization neural network (S2-Net) that uses rhythmic timing as a control mechanism for efficient information processing. Promising results have been achieved across a broad range of tasks, including neural activity decoding, energy-efficient signal processing, temporal binding and semantic reasoning.

2605.01655 2026-05-05 math.CA cs.LG

Exact Loop Controllers for ReLU Realization of Homogeneous Curve Refinements

Boldsaikhan Bolorkhuu, Tsogtgerel Gantumur

Comments 39 pages, 6 figures

详情
英文摘要

We study homogeneous refinement operators \((Vγ)(t)=\sum_{j\in\mathbb Z}A_jγ(Mt-j)\), acting on compactly supported continuous piecewise linear curves \(γ:\mathbb R\to\mathbb R^p\), where \(M\ge2\) and only finitely many matrices \(A_j\in\mathbb R^{p\times p}\) are nonzero. We prove that the iterates \(V^nγ\) admit exact ReLU realizations of fixed width and depth \(O(n)\). The main new ingredient is an exact loop controller for the residual dynamics. Instead of propagating scalar residual surrogates, the construction transports the residual orbit by a forward-exact state on a polygonal loop. Scalar factors and digit selectors are then recovered from this loop state by complementary CPwL readouts. The loop seam is not removed, but its remaining ambiguity is confined to the final readout/selector stage, where it is harmless because the scalar atom is supported away from the seam. This gives a homogeneous \(M\)-ary vector-valued extension of the scalar binary refinable-function construction with a more geometric controller architecture. We also record crude exponential bounds on the network weights and biases. Affine forcing terms are handled by expanding affine iterates into finite sums of homogeneous iterates, giving exact fixed-width realizations with depth \(O(n^2)\), and anchored open curves reduce to compactly supported defects with affine anchor mismatch. We also describe homogeneous polygonal generators, including dragon-type examples and a self-intersecting Hilbert-type prototype in arbitrary dimension. The extended version includes stage-dependent forcing, finite-state stacking reductions, and further geometric constructions such as Koch-, Gosper-, Morton-, and connector-based Hilbert-type variants.

2605.01654 2026-05-05 cs.CR math.FA

Limit Properties at Critical Indices of Linear Canonical Riesz Potentials and Their Applications to Security of Multi-Image Encryption

Zunwei Fu, Dachun Yang, Shuhui Yang

Comments 39 pages

详情
英文摘要

In this article we introduce the linear canonical Riesz potential (for short, LCRP) and give its symbol in terms of linear canonical transforms. Driven by image processing, we establish the convergence/divergence of these LCRPs for different kinds of functions. Concretely, for grating functions, we prove that their classical Riesz potentials diverge, whereas their LCRP converge due to the key role of chirp functions. For the characteristic function ${\mathbf 1}_P$ of a convex polygon $P$, we show that the limit of its Riesz potential at any non-boundary point $\boldsymbol{x}$ equals ${\mathbf 1}_P(\boldsymbol{x})$, but its limit at the boundaries differ from ${\mathbf 1}_P$, while it is known that, for any Schwartz function $f$, the limit of its Riesz potential at any point $\boldsymbol{x}$ always equals $f(\boldsymbol{x})$. Based on these and the inverse operator of the LCRP (namely the linear canonical Laplacian operator), we propose an asymmetric cascaded LCRP method for the multi-image encryption and create an efficient and secure cryptosystem. Systematic security evaluations, including sensitivity, statistical, noise attack, and occlusion attack analyses, demonstrate its robustness and its security. Even for a single image, the proposed method is more efficient than the known encryption approach based on the fractional Riesz potential. The novelty of these results lies in that the convergence and the divergence of LCRTs at the critical indices, respectively, for ``good" Schwartz functions and for ``bad" discrete image functions essentially affect the security of image encryption and decryption.

2605.01653 2026-05-05 cs.CV

SteeringDiffusion: A Bottlenecked Activation Control Interface for Diffusion Models

Fangzheng Wu, Brian Summa

详情
英文摘要

We introduce SteeringDiffusion, a bottlenecked activation-level control interface for diffusion models that exposes a smooth, monotonic, and runtime-adjustable control surface over the content--style trade-off. Our method keeps the U-Net backbone frozen and learns a small, prompt-conditioned latent code projected to FiLM/AdaGN-style modulation parameters. A zero-initialized design guarantees exact equivalence to the base model at zero scale, while timestep-aware gating restricts modulation to later denoising stages. A single scalar at inference continuously traverses the control surface without retraining. Across experiments on Stable Diffusion~1.5 and SDXL covering multiple artistic styles, we show that SteeringDiffusion produces smooth and monotonic content--style trade-offs. Under matched parameter budgets, it outperforms LoRA in controllability and stability, while ControlNet and rank-1 adapters do not expose a comparable control surface. We further introduce an inversion-stability diagnostic based on DDIM inversion, used as a post-hoc trajectory probe, which reveals strong correlations with intervention magnitude. These results position \emph{Steering Bottlenecked Explicit Control (S-BEC)} as a practical, general-purpose control interface for frozen diffusion backbones.

2605.01650 2026-05-05 cs.LG

Geospatial foundation-model embeddings improve population estimation unevenly across space and scale

Wenbin Zhang, Eimear Cleary, Francisco Rowe, Somnath Chaudhuri, Maksym Bondarenko, Shengjie Lai, Andrew J. Tatem

详情
英文摘要

Reliable subnational population estimates are essential for applications, yet remain difficult where censuses are sparse, outdated or spatially coarse. Existing population-mapping workflows rely on hand-built geospatial covariates, such as settlement extent, night-time lights, and environmental conditions, which must be assembled and harmonised across scales and geographies. Geospatial foundation models offer an alternative by learning reusable representations of place from more multifaceted and heterogeneous data sources. Here, we benchmark Population Dynamics Foundation Model (PDFM) embeddings against the harmonised geospatial covariates for subnational population estimation in Brazil, Nigeria and the United States. Under geographically structured validation, PDFM increased predictive fit by a median of 20.1% (IQR: 10.0-33.2%, across country-model comparisons) reduction in unexplained variance, and reduced Kullback-Leibler divergence by 23.2% (9.2-26.2%). However, these gains were uneven. PDFM was most advantageous where the geospatial covariates weakly characterised settlement context, such as larger and less-developed subnational areas. Moreover, PDFM performance was scale-coupled with embeddings providing less flexible transfer across spatial aggregations than geospatial covariates. These findings showed that geospatial foundation-model representations of place can improve population estimation in data poor settings, but their benefits break down predictably under spatial scale mismatch, revealing a fundamental limitation of current geospatial AI.

2605.01647 2026-05-05 cs.CL

Beyond Perplexity: Character Distribution Signatures and the MDTA Benchmark for AI Text Detection

Priyadarshan Narayanasamy, Swastik Agrawal, Klint Faber, Fardina Fathmiul Alam

Comments 11 figures, 10 tables, 24 pages, Under Review at COLM 2026

详情
英文摘要

Training-free AI text detection methods primarily rely on model log-probabilities, achieving strong performance through approaches like Binoculars and DNA-DetectLLM. However, these methods face a fundamental ceiling as models are optimized through RLHF to produce human-like probability distributions. We introduce an alternative detection signal based on character distribution signatures. We provide theoretical foundations showing that AI models, trained on massive domain-balanced corpora, approximate global character patterns while humans exhibit domain-specialized distributions, creating a "Wall of Separation" where human-AI divergence significantly exceeds AI-AI divergence. To enable systematic evaluation, we construct the Models-Domains-Temperatures-Adversarials (MDTA) benchmark comprising 642,274 prompt-aligned samples across 4 models, 5 domains, 3 temperature settings, and 3 adversarial strategies, substantially expanding the HC3 dataset with modern model responses, temperature variation, and adversarial augmentation. We introduce the Letter Distribution Score (LD-Score), demonstrating low correlation (r = 0.08-0.13) with perplexity methods. When integrated with DNA-DetectLLM, Binoculars and FastDetectGPT via a non-linear classifier, LD-Score yields consistent improvements in AUROC and F1, with particularly pronounced gains in specialized domains where vocabulary constraints amplify the detection signal. The MDTA dataset can be accessed at: https://huggingface.co/datasets/nsp909/MDTA.

2605.01644 2026-05-05 cs.CR

Toward a Principled Framework for Agent Safety Measurement

Shuyi Lin, Anshuman Suri, Alina Oprea, Cheng Tan

详情
英文摘要

LLM agents emit actions, not just text, and once taken, those actions often cannot be undone. Yet today's agent-safety evaluations run greedy or a few sampled rollouts and report a single safe/unsafe rate -- blind to the long-tail trajectories where unsafe behavior may arise from low-probability but non-negligible actions. We argue agent safety should be measured by search, not sampling. We apply BOA, a framework that, given a deployment configuration (model, decoder, prompt, environment, judger, likelihood budget), searches the in-budget trajectory space and reports a safety score: the probability the agent stays safe under the configuration. BOA searches both within a single LLM round and across the agent-environment interaction tree under a given likelihood budget, and makes search practical via batched decoding/judging, prefix caching, and chunked tree expansion. On agent-safety workloads, BOA discovers unsafe trajectories that greedy and sampled evaluations miss. BOA can additionally be used for ranking models, defenses, and attacks, all on the same scale, with manageable GPU costs.

2605.01640 2026-05-05 cs.LG cs.CL

Prescriptive Scaling Laws for Data Constrained Training

Justin Lovelace, Christian Belardi, Srivatsa Kundurthy, Shriya Sudhakar, Kilian Q. Weinberger

详情
英文摘要

Training compute is increasingly outpacing the availability of high-quality data. This shifts the central challenge from optimal compute allocation to extracting maximum value from limited data. The widely adopted Chinchilla scaling law assumes every training token is unique. This limits its ability to guide pretraining decisions in data-constrained regimes. We model the excess loss under repetition with a simple additive overfitting penalty and find that it accurately describes model behavior. Our scaling law yields qualitatively new compute-optimal allocation advice. Beyond a point, further repetition is counterproductive and compute is better spent on model capacity. We show that following our law's recommended configuration improves performance in data-constrained regimes. Finally, because our one-parameter form isolates overfitting in a single coefficient, it enables direct comparison across training configurations. As a case study, we show that strong weight decay ($λ=1.0$) reduces this coefficient by approximately 70%, providing a scaling-law explanation for recent findings that optimal weight decay in data-constrained regimes is an order of magnitude larger than standard practice.

2605.01638 2026-05-05 cs.CV

Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection

Tianxiao Li, Zhenglin Huang, Haiquan Wen, Yiwei He, Xinze Li, Bingyu Zhu, Wuhui Duan, Congang Chen, Zeyu Fu, Yi Dong, Baoyuan Wu, Jason Li, Guangliang Cheng

Comments Accepted to CVPR 2026

详情
英文摘要

Multimodal deepfakes are proliferating on social media and threaten authenticity, information integrity, and digital forensics. Existing benchmarks are constrained by their single-modality scope, simplified manipulations, or unrealistic distributions, which limit their ability to assess real-world robustness. To address these limitations, we present Omni-Fake, a unified omni-dataset for comprehensive multimodal deepfake detection in social-media settings. It comprises Omni-Fake-Set, a large-scale, high-quality dataset with 1M+ samples, and Omni-Fake-OOD, an out-of-distribution benchmark with 200k+ samples intentionally excluded from training to evaluate generalization. Omni-Fake spans four modalities (image, audio, video, and audio-video talking head) and supports a joint detection-localization-explanation protocol. On top of Omni-Fake, we further propose Omni-Fake-R1, a reinforcement-learning-driven multimodal detector that adaptively integrates visual and auditory cues and outputs structured decisions, localization, and natural-language explanations. Extensive experiments show significant gains in detection accuracy, cross-modal generalization, and explainability over state-of-the-art baselines. Project page: https://tianxiao1201.github.io/omni-fake-project-page/

2605.01637 2026-05-05 cs.LG cs.CC cs.DM math.CO

The Banach-Butterfly Invariant: Influence-Adaptive Walsh Geometry for Ternary Polynomial Threshold Functions

Gorgi Pavlov

Comments 21 pages, 3 figures. Theory paper; LLM-application companion in preparation. Code, certificates, and 616,126 NPN-canonical n=5 representatives in supplementary repository

详情
英文摘要

We introduce the Banach-Butterfly Invariant (BBT), an influence-adaptive Banach geometry on the Walsh-Hadamard butterfly factorization. For a Boolean function $f:\{-1,+1\}^n\to\{-1,+1\}$ with coordinate influences $\mathrm{Inf}_\ell(f)$, BBT assigns exponent $p_\ell = 1+\mathrm{Inf}_\ell(f)$ to butterfly layer $\ell$, yielding the contraction invariant $μ(f)=\prod_\ell 2^{-\mathrm{Inf}_\ell/(1+\mathrm{Inf}_\ell)}$. We prove a Jensen lower bound $\log_2μ(f) \ge -I(f)/(1+I(f)/n)$ and that $μ$ is strictly Schur-convex in the influence vector (modulo permutation), giving scaling classes $μ\sim 2^{-n/2}$ (parity), $2^{-Θ(\sqrt{n})}$ (majority), $2^{-1/2}$ (dictators). $\log_2μ$ is rational but not polynomial in the Fourier coefficients while $μ$ is algebraic, and $μ$ separates functions with identical total influence (122 pairs at $n=3$). Using the certified $n \le 4$ ternary Walsh-threshold universe from a companion synthesis manuscript as a finite testbed, we compute exact MILP minimum-support certificates for all 65,536 Boolean functions at $n=4$ (mean 6.42, max 9, all-odd by a parity argument) and on 10,000 of the 616,126 NPN-canonical representatives we enumerate at $n=5$ (matching OEIS A000370). Conditional Spearman $ρ(μ,|\mathrm{supp}|)$ at fixed total influence is $+0.571$ in the largest stratum at $n=4$ but reverses to $-0.38$ at $n=5$ under both function-uniform and NPN-canonical sampling: $μ$ is a valid Schur-convex concentration invariant, not a universal monotone predictor of minimum support across $n$. A companion application paper validates a real-valued WHT activation-energy proxy inspired by this theory on five pretrained LLMs at W2A16, cutting wikitext-2 perplexity by 15-58% versus vanilla auto-round; the transfer from Boolean theory to the real-valued proxy is qualitative, not formal.

2605.01636 2026-05-05 math.LO cs.LO

Inexpressibility in Exp-Minus-Log

Mark Carney

Comments 5 pages

详情
英文摘要

Odrzywołek defined a system Exp-Minus-Log (EML) that reduces all elementary functions over complex numbers down to a constant `$1$', and a single two place function $E(α, β) = \exp(α) - \log(β)$. This paper shows that in this system, equivalent to Chow's EL numbers, every EML-expressible number is computable. We go on to prove that the canonical example of a non-computable real, Chaitin's $Ω_U$, is inexpressible in EML. This gives a formal inexpressibility theorem for this system.

2605.01634 2026-05-05 cs.LG

Chebyshev-Augmented One-Shot Transfer Learning for PINNs on Nonlinear Differential Equations

Yiqi Rao, Pavlos Protopapas

Comments 18 pages, 4 figures, 9 tables, accepted to ICLR 2026 Workshop on Artificial Intelligence and Partial Differential Equations

详情
英文摘要

Physics-Informed Neural Networks (PINNs) offer a flexible paradigm for solving differential equations by embedding governing laws into the training objective. A persistent limitation is instance specificity: standard PINNs typically require retraining for each new forcing term, boundary/initial condition, or parameter setting. One-shot transfer learning (OTL) addresses this bottleneck for linear operators by freezing a pretrained latent representation and computing optimal output weights in closed form, but for nonlinear problems closed-form adaptation is generally unavailable because the loss is nonconvex in the output layer. In this paper we substantially broaden the class of nonlinearities amenable to one-shot PINN transfer by combining OTL with Chebyshev polynomial surrogates. We approximate general smooth weakly nonlinear terms by truncated Chebyshev expansions over a prescribed solution range, yielding a polynomial nonlinearity that can be handled by a perturbative decomposition into linear subproblems. A multi-head PINN learns a reusable latent space associated with the dominant linear operator; at test time, solutions to new instances are obtained via a sequence of closed-form linear solves in the output layer, without retraining the network body. We provide a unified derivation of the framework for ODEs and PDEs and demonstrate accuracy and fast online adaptation on nonlinear benchmarks, including non-polynomial and singular ODE nonlinearities as well as a reaction-diffusion PDE with saturating kinetics, demonstrating the method's utility in many-query regimes.

2605.01632 2026-05-05 cs.LG

Perturb and Correct: Post-Hoc Ensembles using Affine Redundancy

Eleanor Quint

详情
英文摘要

Models that are indistinguishable on in-distribution data can behave very differently under distribution shift. We introduce Perturb-and-Correct (P&C), a post-hoc method for constructing epistemically diverse predictors from a single pretrained network. P&C applies random hidden layer perturbations with a least-squares correction in the subsequent affine layer, producing predictors that agree on calibration data while remaining free to disagree away from it. We analyze this mechanism through the post-correction residual and its first-order sensitivity: the residual is controlled near the calibration distribution by a leverage term, while corrected sensitivity grows as inputs deviate from the calibration geometry. Empirically, P&C achieves a strong ID/OOD tradeoff across MuJoCo dynamics prediction and CIFAR-10 OOD detection, matching or outperforming standard post-hoc baselines while requiring only a single pretrained model. Our findings highlight the potential in further exploiting overparameterization as a strength of deep learning models.

2605.01630 2026-05-05 cs.CL cs.AI

Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

Roseval Malaquias Junior, Giovana Kerche Bonás, Thales Sales Almeida, Hugo Abonizio, Thiago Laitz, Ramon Pires, Marcos Piau, Celio Larcher, Rodrigo Nogueira

详情
英文摘要

Rankings produced by holistic LLM-as-a-judge scoring are sensitive to the bias of the chosen judge model. We show that switching to binary rubric scoring with multi-judge filtering removes this sensitivity: decomposing the judgement matters more than the judge model itself. To support this claim, we introduce Prosa, the first real user multi-turn Brazilian Portuguese chat benchmark: 1,000 WildChat conversations scored by three judges from three model families on 16 models. Under filtered rubric scoring the three judges agree on every one of the 16 ranks, whereas under holistic scoring they agree on only 7 of 16. Additionally, the rubric filtering pipeline increases the average score gap between neighbouring models by 47%, thereby improving Prosa's discriminative power. Evaluating a new model on Prosa costs approximately $2.1 when using Gemini 3 Flash as the judge. We release the benchmark and the filtering code to ensure that future models can be assessed under identical conditions. These artifacts also make our rubric-based scoring method reusable beyond Prosa, supporting other open-ended evaluation settings.

2605.01628 2026-05-05 stat.ML cs.LG math.ST stat.TH

Self-Normalized Martingales and Uniform Regret Bounds for Linear Regression

Fan Chen, Jian Qian, Alexander Rakhlin, Nikita Zhivotovskiy

详情
英文摘要

Self-normalized martingale inequalities lie at the heart of confidence ellipsoids for online least squares and, more broadly, many bandit and reinforcement-learning results. Yet existing vector and scalar results typically rely on bounded covariates and an explicit regularization matrix, producing bounds that are \emph{not scale-invariant}: although the self-normalized quantity is scale-invariant by definition, its standard upper bounds are not. We characterize when scale-invariant upper bounds on self-normalized martingales are possible. Without further assumptions, we prove that nontrivial scale-invariant bounds exist only in dimension $d=1$; moreover, in $d=1$ we obtain $O(\log T)$ scale-invariant self-normalized bounds without any assumptions on the covariates. In contrast, for $d>1$ we show that no nontrivial scale-invariant bound can hold in full generality. We then connect this dichotomy to \emph{doubly-uniform} regret in online linear regression (i.e., regret bounds that are simultaneously independent of the covariate scale and the comparator norm) and use it to resolve the open question of Gaillard, Gerchinovitz, Huard, and Stoltz, \emph{``Uniform regret bounds over $\mathbb{R}^d$ for the sequential linear regression problem with the square loss''} (ALT 2019): in $d=1$ we give an explicit algorithm with $O(\log T)$ doubly-uniform regret, whereas for $d>1$ sublinear doubly-uniform regret is impossible. Finally, under a natural \emph{smoothness} condition (bounded Radon--Nikodym derivatives of the conditional covariate laws with respect to a fixed base measure), we recover sublinear regret for $d>1$ without bounded covariates and derive a self-normalized concentration inequality free of the usual regularization penalties, yielding arguably a first natural scale-invariant bound for adaptive, non-i.i.d. vector martingales.

2605.01617 2026-05-05 math.NA cs.NA math.AP

Discontinuity Analysis and Semi-Analytic Spectral Approximation for the Nonlocal Poisson Equation

Thinh Dang, Bacim Alali, Nathan Albin

详情
英文摘要

We study a nonlocal Poisson problem with discontinuous source term and analyze how the regularity of the integral kernel determines the discontinuity structure of the corresponding solution. Under general assumptions on compactly supported integrable kernels, we show that jump discontinuities in the source term are inherited by the solution. We then identify two principal mechanisms governing higher-order regularity: singular behavior of the kernel at the origin and jump discontinuities of the kernel, or of its derivatives, at the horizon endpoints. Singularities at the origin lead to blow-up of certain derivatives of the solution at the source discontinuity, while jumps at the horizon generate cascades of derivative discontinuities at translated locations. These phenomena occur for kernels commonly used in peridynamic-type models. By contrast, compactly supported \(C^\infty\) kernels do not generate derivative blow-up or cascading losses of regularity, and in this case the source term and the solution have equivalent piecewise smooth regularity. Motivated by this analysis, we develop a semi-analytic spectral method for the accurate numerical treatment of discontinuous nonlocal problems. The method uses successive smoothing transformations and explicitly constructed correction functions to convert the original problem into an auxiliary problem with improved regularity. A spectral solver is then applied to the smoothed problem, and the approximation to the original solution is recovered by adding back the analytic corrections. Numerical experiments show substantial gains in accuracy and convergence, demonstrating that the method effectively mitigates the loss of accuracy caused by discontinuities and Gibbs oscillations while retaining the efficiency of spectral methods.

2605.01614 2026-05-05 cs.DC cs.OS

CvxCluster: Solving Large, Complex, Granular Resource Allocation Problems 100-1000x Faster

Obi Nnorom, Stephen Boyd, Philip Levis

Comments 13 pages, 5 figures, 2 tables. Submitted to SOSP 2026

详情
英文摘要

Cluster resource allocation is a multidimensional search problem that finds the best allocation of tasks to servers. Because the search space grows exponentially, modern approaches frame it as a mixed integer program (MIP) or a complex set of search heuristics. This paper proposes using a different approach: convex optimization, which has extremely fast solution methods. The research challenge is devising how to transform cluster resource allocation into a convex problem that generates good placements. We describe CvxCluster, which allocates cluster resources with a two-stage algorithm. The first stage solves a convex relaxation of the placement problem to yield a principled set of per-machine resource prices. The second stage uses these prices to drive a lightweight greedy procedure to place tasks. Experimental results with Azure traces find that CvxCluster scales to 100,480 servers under proportional workload growth and sustains arrival rates up to 500,000x the baseline trace. CvxCluster runs 100 to 2,500x faster than a state-of-the-art MIP solver while remaining within 3% of the optimal objective. CvxCluster can support complex constraints such as job anti-affinity, machine types, and GPU servers. The key insight behind CvxCluster is that reformulating placement as a continuous rather than discrete problem enables much faster methods that find solutions just as good or better than prior heuristics.

2605.01611 2026-05-05 cs.CY cs.AI cs.LG

The Case for ESM3 as a General-Purpose AI Model with Systemic Risk Under the EU AI Act

Taro Qureshi, Jacob Griffith, Koen Holtman, Marcel Mir Teijeiro, Ze Shen Chin, Rokas Gipiškis

Comments 8 pages, 1 figure, Technical AI Safety Conference

详情
英文摘要

Due to ambiguity in the wording of the EU AI Act, we examine the question of to what extent frontier biological foundation models such as ESM3 are subject to obligations for general-purpose AI models with systemic risk under the EU AI Act. In this paper, we map ESM3 to the biorisk chain, and conclude that it would be desirable if the providers of ESM3 and similar biological models were subject to these obligations, which would require them to assess and mitigate dual-use risks from their models. We then perform an analysis, comparing the attributes of ESM3 to the classification criteria in the AI Act and the supporting material. We conclude that at this time, ESM3 does not appear to be meaningfully regulated by the Act. We then propose remedies to correct the situation.

2605.01610 2026-05-05 cs.HC cs.AI

Less Interaction But More Explanation: A Communication Perspective on Agentic AI Interfaces

Eunchae Jang, S. Shyam Sundar

详情
Journal ref
Proceedings of the CHI 2026 Workshop on Human-Centered Explainable AI (HCXAI), Barcelona, Spain, 2026
英文摘要

AI systems have long been expected to interact with users, answering questions, generating content, and continuing (social) conversations. Agentic AI, however, breaks from this expectation, as its primary objective is workflow execution on behalf of the users. If a system becomes more agentic, do users need less interaction with the system? Our answer is: less routine back-and-forth, but more communication for oversight and explanation, as agentic AI proactively acts, not just responds. Grounded in a communication perspective, we discuss how users perceive the communicative roles of AI systems (whether as the source of actions or merely a channel), and how this can shape trust. Because agentic AI can play multiple communicative roles, it can complicate this source perception and introduce potential risks. To address this, we propose three types of explanations that agentic AI needs to incorporate (action-process, uncertainty, and coordination), and suggest that customization affordances that allow users to decide when and which explanations they see may be key to preserving human agency as AI autonomy increases.

2605.01609 2026-05-05 cs.LG cs.AI

Concepts Whisper While Syntax Shouts: Spectral Anti-Concentration and the Dual Geometry of Transformer Representations

Pratyush Acharya, Nuraj Rimal, Habish Dhakal

Comments 25 pages, 16 figures, 13 tables

详情
英文摘要

We test whether the causal inner product of \citet{park2024linear} -- defined by the unembedding covariance $Σ$ -- enables cross-lingual concept transport. Across 17 models and 4 language pairs, a matched-spectrum randomization test finds that Whitened Causal Alignment is indistinguishable from spectral regularization alone ($p = 0.95$). However, this failure reveals a broader phenomenon: anti-concentration is observed in residual-stream difference-of-means vectors across five architecture families ($p < 10^{-33}$) and supported by SAE features (e.g., $p = 4.5 \times 10^{-19}$) and linear probes on Gemma and Llama. We discover a \emph{dual geometry}: activation-space concept directions anti-concentrate in the spectral tail, while static unembedding-row contrasts \emph{concentrate} in high-variance directions ($p < 10^{-4}$). Split-injection causal interventions support the functional basis on Gemma and Llama (Cohen's $d$ up to $1.80$), and POS-tag probing across 8 models shows syntax preferentially encodes in the high-variance subspace in 6 of 8 architectures ($p < 0.013$), with the Qwen~2.5 family showing a significant reversal consistent with architecture-specific spectral structure. These results suggest transformers may rotate semantic content into spectrally quiet regions during contextualized processing, encoding concepts where they can be manipulated with reduced grammatical disruption.

2605.01605 2026-05-05 cs.CL cs.AI

Where Do Prompt Perturbations Break Generation? A Segment-Level View of Robustness in LoRA-Tuned Language Models

Zhuoyun Li, Boxuan Wang, Jinwei Hu, Zhenglin Huang, Qisong He, Xinmiao Huang, Guangliang Cheng, Xiaowei Huang, Yi Dong

Comments Under review

详情
英文摘要

Large language models are sensitive to minor prompt perturbations, yet existing robustness methods usually enforce consistency at the whole-sequence level. This holistic view can hide an important failure mode: a perturbed response may remain globally similar to the clean one while drifting on a critical entity, relation, or conclusion. We introduce S$^2$R$^2$, a segment-level framework for robust LoRA fine-tuning. S$^2$R$^2$ decomposes clean and perturbed generations into semantic segments, aligns them with an optimal-transport objective, and penalises the segments with the largest meaning drift. To connect this output-side objective with model adaptation, we add an adapter-stability regulariser motivated by segment-level attention reallocation, using LoRA norm control as a tractable proxy for limiting perturbation-amplified evidence shifts. A PAC-Bayesian complexity view further explains why controlling adapter growth may support transfer beyond observed perturbations. Experiments on summarisation benchmarks show that S$^2$R$^2$ improves robustness under typographical noise, deletion, synonym replacement, and paraphrasing, while maintaining competitive clean performance and stronger cross-dataset transfer than consistency-based baselines.

2605.01604 2026-05-05 cs.AI

Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework

Mukund Pandey

Comments 11 pages, 6 tables, 1 figure. Reference implementation: https://github.com/mukund1985/llm-eval-toolkit

详情
英文摘要

Existing evaluation frameworks for large language models -- including HELM, MT-Bench, AgentBench, and BIG-bench -- are designed for controlled, single-session, lab-scale settings. They do not address the evaluation challenges that emerge when agentic AI systems operate continuously in production: compounding decision errors, tool failure cascades, non-deterministic output drift, and the absence of ground truth for long-horizon tasks. This paper makes three contributions. First, we present a taxonomy of seven failure modes unique to production agentic systems, each grounded in observations from systems operating at billion-event scale. Second, we demonstrate empirically where standard metrics -- ROUGE, BERTScore, accuracy/AUC, and the agentic benchmarks above -- fail to detect each failure mode. Third, we propose PAEF (Production Agentic Evaluation Framework), a five-dimension evaluation framework with an open-source reference implementation, designed for continuous evaluation on production traffic rather than episodic benchmark runs. Our analysis shows that standard metrics fail to detect four of the seven failure modes entirely and detect three others only after a lag of multiple evaluation cycles.

2605.01600 2026-05-05 cs.SE

A Lightweight Scrum Sprint Simulation to Help Learners Traverse the Empirical Process Control Threshold Concept

Eduardo Miranda, Torgeir Dingsøyr, Pritam Chita

Comments 10 pages

详情
英文摘要

Empirical process control, a way of managing work based on the observation of the successes or misfortunes of earlier activities, is a key process in Scrum and other agile development frameworks. In this experience report, we present a lightweight, scalable, free and customizable sprint simulation activity designed to teach students how to empirically control a Scrum project by engaging in the presentation and interpretation of work status information, task selection and resource allocations in a single teaching session. We reflect on our experience using the simulation as an active learning complement to direct instruction in two master level courses at two different universities and in the training of teaching assistants at a third institution, and abductively establish its effectiveness by mapping student comments to the teaching practices in the threshold concepts framework.

2605.01596 2026-05-05 cs.CL

Fine-Tuning Pre-Trained Code Models for AI-Generated Code Detection

Jany-Gabriel Ispas, Sergiu Nisioi

Comments Archaeology at SemEval-2026 Task 13

详情
英文摘要

This paper describes the system submitted by team \textbf{Archaeology} to SemEval-2026 Task~13 on AI-generated code detection. The shared task consists of three subtasks; we participate in Subtask-A (binary classification: human-written vs.\ AI-generated code) and Subtask-B (11-class attribution of the generating model). Starting from a TF-IDF and Logistic Regression baseline, we fine-tune four pre-trained code models (CodeBERT, GraphCodeBERT, UniXcoder, and CodeT5+) with separate strategies for each subtask. For Subtask-A, we use leave-one-language-out cross-validation, code augmentation, chunked inference with trimmed-mean aggregation, and threshold calibration on a difficult dataset. For Subtask-B, we use sandwich token packing, class-balanced loss, and multi-seed ensembling with test-time augmentation. Our best submissions obtain macro-F1 scores of 0.737 on Subtask-A (6th/81 teams) and 0.422 on Subtask-B (7th/34 teams).

2605.01592 2026-05-05 cs.CG

Witness Set: A Visibility Problem in $NP\cap XP$

Satyabrata Jana, Debabrata Pal, Bodhayan Roy, Sasanka Roy

Comments 24 pages, 17 figures

详情
英文摘要

We study the Witness Set problem, a natural dual to the classical Art Gallery problem. In the Witness Set problem, we are given a polygon $P$ and an integer $k$ as input, and the objective is to determine whether $P$ has a witness set of size at least $k$. A point set $X$ in $P$ is called a witness set if every point in $P$ is visible from at most one point in $X$. For simple polygons, we show that Witness Set lies in both $NP$ and $XP$. This stands in sharp contrast to its dual, the Art Gallery problem, which was recently shown to be $\exists \mathbb{R}$-complete by Abrahamsen et al. and is therefore neither in $NP$ nor admits a polynomial-size discretization unless $NP=\exists \mathbb{R}$. In contrast, we prove that Witness Set for simple polygons admits a finite discretization of size $n^{f(k)}$ for some function $f$. For comparison, even for simple polygons, Efrat and Har-Peled gave an algorithm for Art Gallery running in time $n^{O(k)}$ using tools from real algebraic geometry, and it appears difficult to obtain such algorithms without this machinery. On the other hand, our approach for Witness Set is purely combinatorial and relies on discretization, leading to an $n^{f(k)}$-time algorithm. Although Amit et al. claimed more than fifteen years ago that Witness Set is $NP$-hard, no proof or reference was provided. We show that the discrete version of the Witness Set problem - where the witness set must be chosen from a given finite point set $Q$ (instead of allowing witnesses to be chosen anywhere in the polygon), referred to as Discrete Witness Set - is $NP$-complete, even when the input is restricted to rectilinear polygons with holes. However, for simple polygons, Discrete Witness Set admits a polynomial-time algorithm by Das et al. Thus, it remains an open question whether the Witness Set problem is $NP$-hard.

2605.01591 2026-05-05 cs.IR cs.CL

Led to Mislead: Adversarial Content Injection for Attacks on Neural Ranking Models

Amin Bigdeli, Amir Khosrojerdi, Radin Hamidi Rad, Morteza Zihayat, Charles L. A. Clarke, Ebrahim Bagheri

详情
英文摘要

Neural Ranking Models (NRMs) are central to modern information retrieval but remain highly vulnerable to adversarial manipulation. Existing attacks often rely on heuristics or surrogate models, limiting effectiveness and transferability. We propose CRAFT, a supervised framework for black-box adversarial rank attacks powered by large language models (LLMs). CRAFT operates in three stages: adversarial dataset generation via retrieval-augmented generation and self-refinement, supervised fine-tuning on curated adversarial examples, and preference-guided optimization to align generations with rank-promotion objectives. Extensive experiments on the MS MARCO passage dataset, TREC Deep Learning 2019, and TREC Deep Learning 2020 benchmarks show that CRAFT significantly outperforms state-of-the-art baselines, achieving higher promotion rates and rank boosts while preserving fluency and semantic fidelity. Moreover, CRAFT transfers effectively across diverse ranking architectures, including cross-encoder, embedding-based, and LLM-based rankers, underscoring vulnerabilities in real-world retrieval systems. This work provides a principled framework for studying adversarial threats in NRMs, underscores the risks of generative AI in rank manipulation, and provides a foundation for developing more robust retrieval systems. To support reproducibility, we publicly release our source code, trained models, and prompt templates.