arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4046
2508.21146 2026-05-12 cs.LG stat.ML

Privacy Auditing Synthetic Data Release through Local Likelihood Attacks

Joshua Ward, Chi-Hua Wang, Guang Cheng

AI总结 本文研究了合成数据发布中的隐私泄露问题,提出了一种基于局部似然比的新型无模型成员推理攻击方法——生成似然比攻击(Gen-LRA),该方法无需模型访问或知识,通过评估测试样本对合成数据局部似然比估计的影响来检测训练数据是否被泄露。理论分析表明,Gen-LRA 能在局部过拟合条件下有效区分成员与非成员样本,并在多个数据集和模型架构上表现出优于现有方法的性能,突显了生成模型过拟合对隐私安全的潜在威胁。

详情
英文摘要

Auditing the privacy leakage of synthetic data is an important but unresolved problem. Existing privacy auditing frameworks for synthetic data rely on heuristics and unrealistic assumptions about model access, offering limited ability to describe or detect the privacy exposure of training data through synthetic data release. In this paper, we study designing membership inference attacks (MIAs) that specifically exploit the observation that tabular generative models tend to significantly overfit to certain regions of the training distribution. We propose \emph{Generative Likelihood Ratio Attack} (Gen-LRA), a novel, computationally efficient No-Box MIA that, with no assumption of model knowledge or access, formulates its attack by evaluating the influence a test observation has on a surrogate model's estimate of a local likelihood ratio over the synthetic data. We develop a theoretical framework for the attack: we show that the Gen-LRA score admits a closed-form characterization as a localized density-ratio statistic, and we prove that under a general model of local overfitting it produces a provable mean-score gap between members and non-members, yielding testable predictions for when the attack should succeed. We validate these predictions in a controlled simulation study and assess Gen-LRA against a comprehensive benchmark spanning diverse datasets, generative model architectures, and attack parameters. Across metrics, Gen-LRA consistently dominates competing MIAs, with especially strong gains at low false positive rates. These results underscore Gen-LRA's effectiveness as a privacy auditing tool for the release of synthetic data, and highlight the significant privacy risks posed by generative model overfitting in real-world applications.

2508.12776 2026-05-12 cs.LG cs.AI stat.ML

Randomized PCA Forest for Unsupervised Outlier Detection

Muhammad Rajabinasab, Farhad Pakdaman, Moncef Gabbouj, Peter Schneider-Kamp, Arthur Zimek

AI总结 本文提出了一种基于随机主成分分析(RPCA)的无监督异常检测方法,利用RPCA森林的内在特性计算异常分数,以实现高效的异常检测。该方法在多个数据集上表现出优于传统及最新方法的性能,同时具有良好的鲁棒性和计算效率,适用于无监督场景下的异常检测任务。

详情
英文摘要

We propose a novel unsupervised outlier detection method based on Randomized Principal Component Analysis (PCA). Motivated by the performance of Randomized PCA (RPCA) Forest in approximate K-Nearest Neighbor (KNN) search, we develop a novel unsupervised outlier detection method that utilizes RPCA Forest for unsupervised outlier detection by deriving an outlier score from its intrinsic properties. Experimental results showcase the superiority of the proposed approach compared to the classical and state-of-the-art methods in performing the outlier detection task on several datasets while performing competitively on the rest. The extensive analysis of the proposed method reflects its robustness and its computational efficiency, highlighting it as a good choice for unsupervised outlier detection.

2507.20051 2026-05-12 cs.LG cs.CL cs.DC

$K^4$: Online Log Anomaly Detection Via Unsupervised Typicality Learning

Weicong Chen, Vikash Singh, Zahra Rahmani, Debargha Ganguly, Mohsen Hariri, Vipin Chaudhary

AI总结 本文提出了一种名为 $K^4$ 的在线日志异常检测框架,旨在解决现有方法依赖错误解析、检测速度慢及评估方式不现实等问题。$K^4$ 采用无监督学习方式,通过高效的 k 近邻统计将任意日志嵌入转化为四个维度的描述符(精确率、召回率、密度、覆盖率),从而实现无需重新训练即可快速准确地检测异常。该方法在更贴近实际的在线评估中表现出色,取得了当前最先进的性能,检测速度远超现有方法。

详情
Journal ref
2025 IEEE 32nd International Conference on High Performance Computing, Data, and Analytics (HiPC), pp. 96-107, 2025
英文摘要

Existing Log Anomaly Detection (LogAD) methods are often slow, dependent on error-prone parsing, and use unrealistic evaluation protocols. We introduce $K^4$, an unsupervised and parser-independent framework for high-performance online detection. $K^4$ transforms arbitrary log embeddings into compact four-dimensional descriptors (Precision, Recall, Density, Coverage) using efficient k-nearest neighbor (k-NN) statistics. These descriptors enable lightweight detectors to accurately score anomalies without retraining. Using a more realistic online evaluation protocol, $K^4$ sets a new state-of-the-art (AUROC: 0.995-0.999), outperforming baselines by large margins while being orders of magnitude faster, with training under 4 seconds and inference as low as 4 $μ$s.

2507.11185 2026-05-12 cs.LG cs.AI

Explainable Machine Learning Framework for Cardiovascular Disease Diagnosis and Prognosis

Md. Emon Akter Sourov, Md. Sabbir Hossen, Pabon Shaha, Md. Moradul Siddique, Yadab Sutradhar, Md Sadiq Iqbal

AI总结 该研究提出了一种可解释的机器学习框架,用于心血管疾病的诊断与预后评估,旨在提升诊断精度与可靠性。研究结合分类方法检测心脏病和回归方法预测相关风险,并采用SMOTE技术解决数据不均衡问题,使用Heart Disease数据集进行实验。结果表明,随机森林在分类任务中表现优异,线性回归在预测任务中取得高拟合度,同时引入可解释AI方法增强模型结果的可理解性,为临床及时干预提供了有力支持。

Comments This paper has been published at the IEEE SCSE 2026. The final version is available in the IEEE Xplore Digital Library. 2026 IEEE International Research Conference on Smart Computing and Systems Engineering, 2026

详情
英文摘要

Heart disease continues to pose a critical worldwide health issue, more specifically in areas with insufficient access to healthcare infrastructure and diagnostic systems. Conventional diagnostic approaches often fall short in accurately detecting and managing heart disease risks, resulting in unfavorable outcomes. Machine learning presents a powerful means to boost the precision and reliability of cardiovascular disease prognosis and diagnosis. In this research, we introduced a unified approach incorporating classification techniques for detecting heart disease and regression techniques for forecasting associated risks. The analysis utilized the dataset, named Heart Disease, containing 1,035 instances. To mitigate the problem of data disproportion, the SMOTE was implemented, producing 100,000 additional synthetic samples. Evaluation metrics such as F1-score, recall, precision, accuracy, MAE, RMSE, MSE, and R2 were adopted to evaluate the performance of the models. Among the classification algorithms, Random Forest delivered the most notable results, attaining an accuracy of 0.972 on actual data and 0.976 on artificially generated data. For prediction modeling, for both synthetic and real samples, linear regression produced the best R2 values of 0.992 and 0.984, respectively, along with the least amount of measurement errors. Furthermore, Explainable AI methods were utilized to improve the comprehensibility of the model outcomes. This paper emphasizes the transformative capabilities of machine learning for diagnosing cardiovascular disease and estimating risk levels, thereby supporting timely interventions and enhancing clinical settings.

2506.16234 2026-05-12 cs.LG

Sequential Causal Discovery with Noisy Language Model Priors

Prakhar Verma, David Arbour, Sunav Choudhary, Harshita Chopra, Arno Solin, Atanu R. Sinha

AI总结 该研究探讨了在观测数据中进行因果发现的问题,面对数据分批到达、存在采样偏差以及专家知识稀缺等现实挑战。研究提出了一种混合框架,通过自适应整合序列批次数据与语言模型提供的有噪声的专家知识,并考虑数据和模型引入的偏差,有效提升了因果结构发现的准确性。该方法引入从有向无环图到部分祖先图的表示转换,以处理不确定性,并结合序列优化策略提升模型效率,实验表明其在结构准确性和参数估计方面优于现有方法。

Comments 32 pages, Transactions on Machine Learning Research - TMLR (04/2026)

详情
英文摘要

Causal discovery from observational data typically assumes access to complete data and availability of perfect domain experts. In practice, data often arrive in batches, are subject to sampling bias, and expert knowledge is scarce. Language Models (LMs) offer a surrogate for expert knowledge but suffer from hallucinations, inconsistencies, and bias. We present a hybrid framework that bridges these gaps by adaptively integrating sequential batch data with LM-derived noisy, expert knowledge while accounting for both data-induced and LM-induced biases. We propose a representation shift from Directed Acyclic Graph (DAG) to Partial Ancestral Graph (PAG), that accommodates ambiguities within a coherent framework, allowing grounding the global LM knowledge in local observational data. To guide LM interactions, we use a sequential optimization scheme that adaptively queries the most informative edges. Across varied datasets and LMs, we outperform prior work in structural accuracy and extend to parameter estimation, showing robustness to LM noise.

2502.15075 2026-05-12 cs.LG

Quantize What Counts: More for Keys, Less for Values

Mohsen Hariri, Alan Luo, Weicong Chen, Shaochen Zhong, Tianyi Zhang, Qifan Wang, Xia Hu, Xiaotian Han, Vipin Chaudhary

AI总结 大型语言模型(LLM)在推理过程中面临由注意力键值(KV)缓存主导的内存瓶颈。本文提出两个定理,从Transformer模型的内在几何特性出发,为混合精度KV量化提供理论依据,指出键矩阵相比值矩阵具有更高的信息密度,并证明在固定内存预算下优先为键分配更高精度能有效降低量化误差并保持模型精度。实验表明,采用键优先的量化策略(如键4位、值2位)在保持高精度的同时显著节省内存。

详情
Journal ref
The 64th Annual Meeting of the Association for Computational Linguistics (ACL), 2026
英文摘要

Large Language Models (LLMs) suffer inference-time memory bottlenecks dominated by the attention Key-Value (KV) cache, which scales with model size and context length. While KV-cache quantization alleviates this cost, bit allocation between keys and values is often tuned heuristically, lacking theoretical grounding and generalizability. This paper proposes two theorems that anchor mixed-precision KV quantization in the intrinsic geometry of Transformer models. First, key projections systematically have larger spectral and Frobenius norms than value matrices, implying higher information density along the key path. Second, for any given memory budget, prioritizing precision for keys over values strictly reduces quantization error and better preserves accuracy. Empirical evaluations across various prominent LLMs and benchmarks show that key-favored allocations (e.g., 4-bit keys, 2-bit values) retain up to 98.3\% accuracy compared to uniform allocations (e.g., 4-bit for both), while conserving memory. These results transform bit allocation from ad hoc tuning into a theoretically grounded, geometry-driven design principle for efficient LLM inference. Source code is available at https://github.com/mohsenhariri/spectral-kv.

2502.01237 2026-05-12 cs.LG

The Differences Between Direct Alignment Algorithms are a Blur

Alexey Gorbatovski, Boris Shaposhnikov, Viacheslav Sinii, Alexey Malakhov, Daniil Gavrilov

AI总结 本文系统比较了直接对齐算法(DAAs)的性能差异,发现其核心影响因素在于排名目标(成对 vs. 点对),而非之前关注的标量分数(似然 vs. 奇数比)。通过统一训练框架和引入β参数,研究揭示了排名目标是决定对齐质量的主要因素,而具体标量分数的影响相对次要。实验在指令遵循和数学推理任务中验证了这一结论,并指出这一现象与目标在提示偏差中的相互作用有关,强调了对DAAs研究中需进行细致评估的重要性。

详情
英文摘要

Direct Alignment Algorithms (DAAs) simplify LLM alignment by directly optimizing policies, bypassing reward modeling and RL. While DAAs differ in their use of SFT (one-stage vs. two-stage) and the scalar score they optimize (likelihood vs. odds ratios), the key performance drivers remain underexplored. We present a systematic comparison and analyze a previously overlooked axis - the ranking objective (pairwise vs. pointwise). To isolate this factor, we propose a unified training framework across DAAs by (i) converting one-stage methods (ORPO, ASFT) into a two-stage pipeline with an explicit SFT phase and (ii) introducing a $β$ parameter that places all methods in the same hyperparameter space and improves the quality of odds-ratio DAAs (ORPO, ASFT). Under this setup, the ranking objective emerges as the primary determinant of alignment quality, whereas the particular scalar score (policy-reference ratio vs. odds ratio) is secondary. We corroborate this on instruction-following tasks and further confirm it on math-reasoning benchmarks across model scales. Evidence suggests that this stems from how these objectives interact with prompt-specific biases, supported both by strictly controlled experiments and by observations on real data. Our findings underscore the need for nuanced evaluations in DAA research to avoid oversimplified claims of superiority.

2402.02286 2026-05-12 cs.CV cs.AI cs.LG

Attention-Mamba: A Mamba-Enhanced Multi-Scale Parallel Inference Network for Medical Image Segmentation

Yanhua Zhang, Ke Zhang, Jingyu Wang, Gabriella Balestra, Samanta Rosati, Yulin Wu, Wuwei Wang, Valentina Giannini

AI总结 本文提出了一种名为Attention-Mamba的新型医学图像分割网络,旨在克服传统U型结构和Transformer模型在多尺度特征处理与计算效率上的不足。该网络通过构建多尺度并行分支,结合Mamba状态空间模型,实现了高效长程依赖建模与多尺度特征融合,并引入递归对齐模块以增强低分辨率特征的空间细节。实验表明,该模型在多个医学影像数据集上取得了优于现有2D CNN、Transformer及基于Mamba的网络的分割性能,同时保持了较高的计算效率。

Comments 14 pages, 9 figures and 8 Tables

详情
英文摘要

U-shaped architectures have long dominated the field of medical image segmentation, while Transformers are widely employed for modeling long-range dependencies. The former typically handles scale variations implicitly by aggregating multi-level features, whereas the efficiency of the latter is constrained by its quadratic computational and memory complexity. In this work, we propose an effective alternative to traditional U-shaped architectures by constructing parallel branches at different levels to obtain multi-scale features and corresponding predictions. Furthermore, we enhance our network by integrating Mamba, a state space model that captures long-range dependencies with linear complexity. First, a dual-path architecture with lateral connections aggregates high-level semantic information and low-level spatial details at each branch. Then, we introduce a Recursive Alignment Module (RAM) that restores spatial details in low-resolution features through stepwise alignment, optimizing them for subsequent global feature learning and multi-scale fusion. We further build parallel Mamba branches upon aligned features to establish hierarchical global representations. Finally, we propose a Mamba-based attention mechanism for adaptive multi-scale prediction fusion; this mechanism utilizes Mamba to enhance information exchange across scales along both the channel and spatial dimensions. Experiments across three imaging modalities (MRI, CT, and dermoscopy) underscore the superior generalization of the proposed network. Compared to state-of-the-art 2D CNN, Transformer, and Mamba-based networks, our model achieves the highest segmentation performance on the Synapse, ACDC, ISIC-2018, and PH2 datasets while maintaining high efficiency, featuring the second-smallest parameters (14.05 M) and moderate computational complexity (8.94 GFLOPs).

2605.10931 2026-05-12 math.AP cs.LG math.DS

Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime

Albert Alcalde, Leon Bungert, Konstantin Riedl, Tim Roith

AI总结 本文研究了在低温极限下,仅包含编码器的深度Transformer模型中token分布的演化行为,利用平均场连续方程对其进行描述。通过引入多粒子系统收敛分析的思想,论文证明了token分布会迅速集中到由键、查询和值矩阵诱导的投影映射所推动的初始分布上,并在中等时间尺度内保持亚稳态。研究还给出了Wasserstein距离随温度参数和推理时间的变化规律,并通过数值实验验证了理论结果,揭示了在有限温度和长时间演化下系统会进入由值矩阵谱主导的另一阶段。

Comments 30 pages, 10 figures

详情
英文摘要

Transformers with self-attention modules as their core components have become an integral architecture in modern large language and foundation models. In this paper, we study the evolution of tokens in deep encoder-only transformers at inference time which is described in the large-token limit by a mean-field continuity equation. Leveraging ideas from the convergence analysis of interacting multi-particle systems, with particles corresponding to tokens, we prove that the token distribution rapidly concentrates onto the push-forward of the initial distribution under a projection map induced by the key, query, and value matrices, and remains metastable for moderate times. Specifically, we show that the Wasserstein distance of the two distributions scales like $\sqrt{{\log(β+1)}/β}\exp(Ct)+\exp(-ct)$ in terms of the temperature parameter $β^{-1}\to 0$ and inference time $t\geq 0$. For the proof, we establish Lyapunov-type estimates for the zero-temperature equation, identify its limit as $t\to\infty$, and employ a stability estimate in Wasserstein space together with a quantitative Laplace principle to couple the two equations. Our result implies that for time scales of order $\logβ$ the token distribution concentrates at the identified limiting distribution. Numerical experiments confirm this and, beyond that, complement our theory by showing that for finite $β$ and large $t$ the dynamics enter a different terminal phase, dominated by the spectrum of the value matrix.

2605.10910 2026-05-12 quant-ph cs.LG

Equivariant Reinforcement Learning for Clifford Quantum Circuit Synthesis

Richie Yeung, Aleks Kissinger, Rob Cornish

AI总结 本文研究了在全连接量子器件上合成克利福德量子线路的问题,将其建模为强化学习任务,通过学习一系列基本克利福德门将给定的辛矩阵表示简化为单位矩阵。提出了一种对量子比特重标等操作具有等变性的新型神经网络架构,能够适用于不同规模的量子系统,无需重新参数化网络。实验表明,该方法在六量子比特线路中接近最优解,并能扩展到三十量子比特的未知克利福德表,其两量子比特门数量优于现有的合成方法。

详情
英文摘要

We consider the problem of synthesizing Clifford quantum circuits for devices with all-to-all qubit connectivity. We approach this task as a reinforcement learning problem in which an agent learns to discover a sequence of elementary Clifford gates that reduces a given symplectic matrix representation of a Clifford circuit to the identity. This formulation permits a simple learning curriculum based on random walks from the identity. We introduce a novel neural network architecture that is equivariant to qubit relabelings of the symplectic matrix representation, and which is size-agnostic, allowing a single learned policy to be applied across different qubit counts without circuit splicing or network reparameterization. On six-qubit Clifford circuits, the largest regime for which optimal references are available, our agent finds circuits within one two-qubit gate of optimality in milliseconds per instance, and finds optimal circuits in 99.2% of instances within seconds per instance. After continued training on ten-qubit instances, the agent scales to unseen Clifford tableaus with up to thirty qubits, including targets generated from circuits with over a thousand Clifford gates, where it achieves lower average two-qubit gate counts than Qiskit's Aaronson-Gottesman and greedy Clifford synthesizers.

2605.10848 2026-05-12 cs.IR cs.AI cs.CL

Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?

Tz-Huan Hsu, Jheng-Hong Yang, Jimmy Lin

AI总结 本文探讨了在代理循环中,词法检索器是否足以支持大型语言模型进行深度研究。研究通过将BM25与具有更强推理和工具使用能力的前沿大语言模型结合,引入了配备检索、浏览和阅读工具的搜索代理Pi-Serini。实验表明,合理配置的词法检索器在与更强的LLM配合时,能够有效支持深度研究,其性能优于使用密集检索器的现有搜索代理。

Comments 15 pages, 4 figures

详情
英文摘要

Does a lexical retriever suffice as large language models (LLMs) become more capable in an agentic loop? This question naturally arises when building deep research systems. We revisit it by pairing BM25 with frontier LLMs that have better reasoning and tool-use abilities. To support researchers asking the same question, we introduce Pi-Serini, a search agent equipped with three tools for retrieving, browsing, and reading documents. Our results show that, on BrowseComp-Plus, a well-configured lexical retriever with sufficient retrieval depth can support effective deep research when paired with more capable LLMs. Specifically, Pi-Serini with gpt-5.5 achieves 83.1% answer accuracy and 94.7% surfaced evidence recall, outperforming released search agents that use dense retrievers. Controlled ablations further show that BM25 tuning improves answer accuracy by 18.0% and surfaced evidence recall by 11.1% over the default BM25 setting, while increasing retrieval depth further improves surfaced evidence recall by 25.3% over the shallow-retrieval setting. Source code is available at https://github.com/justram/pi-serini.

2605.10808 2026-05-12 cs.CR cs.AI

Threat Modelling using Domain-Adapted Language Models: Empirical Evaluation and Insights

Saba Pourhanifeh, AbdulAziz AbdulGhaffar, Ashraf Matrawy

AI总结 本文研究了在威胁建模任务中使用领域适配的语言模型的效果,重点分析了领域适应、模型规模、解码策略和提示技术对STRIDE威胁分类的影响。通过在5G安全场景下对8种不同语言模型的52种配置进行系统评估,发现领域适配模型并不总是优于通用模型,且解码策略对输出有效性有显著影响。研究指出,当前大语言模型在结构化威胁建模任务中仍存在根本性限制,提升性能需要更多任务特定推理和更扎实的安全概念基础。

详情
英文摘要

Large Language Models(LLMs) are increasingly explored for cybersecurity applications such as vulnerability detection. In the domain of threat modelling, prior work has primarily evaluated a number of general-purpose Large Language Models under limited prompting settings. In this study, we extend the research area of structured threat modelling by systematically evaluating domain-adapted language models of different sizes to their general counterparts. We use both LLMs and Small Language Models(SLMs) that were domain adapted to telecommunications and cybersecuirty. For the structured threat modelling, we selected the widely used STRIDE approach and the application area is 5G security. We present a comprehensive empirical evaluation using 52 different configurations (on 8 different language models) to analyze the impact of 1) domain adaptation, 2) model scale, 3) decoding strategies (greedy vs. stochastic sampling), and 4) prompting technique on STRIDE threat classification. Our results show that domain-adapted models do not consistently outperform their general-purpose counterparts, and decoding strategies significantly affect model behavior and output validity. They also show that while larger models generally achieve higher performance, these gains are neither consistent nor sufficient for reliable threat modelling. These findings highlight fundamental limitations of current LLMs for structured threat modelling tasks and suggest that improvements require more than additional training data or model scaling, motivating the need for incorporating more task-specific reasoning and stronger grounding in security concepts. We present insights on invalid outputs encountered and present suggestions for prompting tailored specifically for STRIDE threat modelling.

2605.10795 2026-05-12 stat.ML cond-mat.dis-nn cond-mat.stat-mech cs.LG

Factual recall in linear associative memories: sharp asymptotics and mechanistic insights

Alessio Giorlandino, Sebastian Goldt, Antoine Maillard

AI总结 本文研究了线性联想记忆网络在存储和检索输入-输出关联时的基本限制,揭示了其存储容量的精确渐进行为及机制。通过引入一个解耦模型,作者证明该模型在存储容量、权重谱和存储机制方面与原模型等价,并利用统计物理工具分析得出其最大存储量与输入维度之间的关系。研究还揭示了最优解如何超越传统赫布学习规则,为理解神经网络的记忆机制提供了新见解。

详情
英文摘要

Large language models demonstrate remarkable ability in factual recall, yet the fundamental limits of storing and retrieving input--output associations with neural networks remain unclear. We study these limits in a minimal setting: a linear associative memory that maps $p$ input embeddings in $\mathbb{R}^d$ to their corresponding~$d$-dimensional targets via a single layer, requiring each mapped input to be well separated from all other targets. Unlike in supervised classification, this strict separation induces~$p$ constraints per association and produces strong correlations between constraints that make a direct characterisation of the storage capacity difficult. Here, we provide a precise characterisation of this capacity in the following way. We first introduce a decoupled model in which each input has its own independent set of competing outputs, and provide numerical and analytical evidence that this decoupled model is equivalent to the original model in terms of storage capacity, spectra of the learnt weights, and storage mechanism. Using tools from statistical physics, we show that the decoupled model can store up to $p_c \log p_c / d^2 = 1 / 2$ associations, and generalise the computation of $p_c$ to linear two-layer architectures. Our analysis also gives mechanistic insight into how the optimal solution improves over a naïve Hebbian learning rule: rather than boosting input-output alignments with broad fluctuations, the optimal solution raises the correct scores just above the extreme-value threshold set by the competing outputs. These findings give a sharp statistical-physics characterisation of factual storage in linear networks and provide a baseline for understanding the memory capacity of more realistic neural architectures.

2605.10794 2026-05-12 cs.CR cs.AI

Can You Keep a Secret? Involuntary Information Leakage in Language Model Writing

Ari Holtzman, Peter West

AI总结 本文研究了大型语言模型在写作过程中是否会无意中泄露被要求保密的信息。研究通过给模型设定一个秘密词汇并指示其不得透露,随后让模型创作故事,并由另一模型尝试从故事中识别该秘密词汇。实验发现,尽管秘密词汇未直接出现在输出中,但所有测试的前沿模型均通过主题、意象和场景等主题性线索泄露了秘密,识别准确率远高于随机水平。研究还表明,这种信息泄露具有跨模型可读性,且随模型规模增大而加剧,但在短文本如笑话中则消失。

详情
英文摘要

Language models are deployed in settings that require compartmentalization: system prompts should not be disclosed, chain-of-thought reasoning is hidden from users, and sensitive data passes through shared contexts. We test whether models can keep prompted information out of their writing. We give each model a secret word with instructions not to reveal it, then ask it to write a story. A second model tries to identify the secret from the story in a binary discrimination test. The secret word never appears literally in any output, but all five frontier models we test leak it thematically -- through topic choice, imagery, and setting--6hy-at rates significantly different from chance, up to 79\%. When told to actively hide the secret, models write \emph{away from} it, and this avoidance is itself detectable. The leakage is cross-model readable, scales sharply with model size within two model families, and disappears entirely for short-form writing like jokes. Giving the model a decoy concept to ``focus on instead'' partially redirects the leakage from the real secret to the decoy. Attending to a secret appears to open up an information channel that frontier LLMs cannot close, even when instructed to.

2605.10779 2026-05-12 cs.CR cs.CL

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

Chiyu Zhang, Huiqin Yang, Bendong Jiang, Xiaolei Zhang, Yiran Zhao, Ruyi Chen, Lu Zhou, Xiaogang Xu, Jiafei Wu, Liming Fang, Zhe Liu

AI总结 随着基于大语言模型(LLM)的自主代理在真实操作系统环境中广泛应用,行为越狱(行为 jailbreak)成为一种新的安全风险,即攻击者诱导代理执行具有不可逆后果的危险操作系统操作。本文提出 LITMUS,一个用于在真实操作系统环境中评估 LLM 代理行为安全性的基准,通过语义-物理双重验证机制和操作系统级状态回滚解决现有基准的不足。LITMUS 包含 819 个高风险测试用例,覆盖三种攻击范式,并揭示了当前代理在安全意识、执行幻觉和攻击易感性方面的显著缺陷。

详情
英文摘要

The rapid proliferation of LLM-based autonomous agents in real operating system environments introduces a new category of safety risk beyond content safety: behavior jailbreak, where an adversary induces an agent to execute dangerous OS-level operations with irreversible consequences. Existing benchmarks either evaluate safety at the semantic layer alone, missing physical-layer harms, or fail to isolate test cases, letting earlier runs contaminate later ones. We present LITMUS (LLM-agents In-OS Testing for Measuring Unsafe Subversion), a benchmark addressing both gaps via a semantic-physical dual verification mechanism and OS-level state rollback. LITMUS comprises 819 high-risk test cases organized into one harmful seed subset and six attack-extended subsets covering three adversarial paradigms (jailbreak speaking, skill injection, and entity wrapping), plus a fully automated multi-agent evaluation framework judging behavior at both conversational and OS-level physical layers. Evaluation across frontier agents reveals three findings: (1) current agents lack effective safety awareness, with strong models (e.g., Claude Sonnet 4.6) still executing 40.64% of high-risk operations; (2) agents exhibit pervasive Execution Hallucination (EH), verbally refusing a request while the dangerous operation has already completed at the system level, invisible to every prior semantic-only framework; and (3) skill injection and entity wrapping attacks achieve high success rates, exposing pronounced agent vulnerabilities. LITMUS provides the first standardized platform for reproducible, physically grounded behavioral safety evaluation of LLM agents in real OS environments.

2605.10775 2026-05-12 math.OC cs.LG

On the global convergence of gradient descent for wide shallow models with bounded nonlinearities

Romain Petit, Clarice Poon, Gabriel Peyré

AI总结 本文研究了宽浅神经网络在梯度下降训练过程中能够收敛到全局最小值的现象,尤其关注具有有界非线性激活函数的模型。作者通过分析训练损失函数的非全局极小点的不稳定性,证明在参数初始分布具有全支撑(如高斯分布)且隐藏单元或注意力头数较多时,连续时间梯度下降只能收敛到全局最小值。研究扩展了现有结果,适用于多头注意力层和输出为向量的两层Sigmoid网络,并完善了对这类模型的“逃逸活跃集”构造,进一步验证了训练动态的稳定性。

详情
英文摘要

A surprising phenomenon in the training of neural networks is the ability of gradient descent to find global minimizers of the training loss despite its non-convexity. Following earlier works, we investigate this behavior for wide shallow networks. Existing results essentially cover the case of ReLU activations and the case of sigmoid activations with scalar output weights. We study a large class of models that includes multi-head attention layers and two-layer sigmoid networks with vector output weights. Building upon [Chizat and Bach, 2018], we prove that all non-global minimizers of the training loss are unstable under gradient descent dynamics. Thus, when the initial distribution of the parameters has full support (which includes the popular Gaussian case), and in the many hidden neurons or attention heads limit, continuous-time gradient descent can only converge to global minimizers. Establishing the instability of non-global minimizers corresponds to the construction of an ``escaping active set'' -- we complete the proof of [Chizat and Bach, 2018] to construct this set for models with bounded nonlinearities and scalar output weights. We also extend this construction to new cases for models with vector output weights. Finally, we show the well-posedness and the stability with respect to discretization of the mean field training dynamic for sub-Gaussian initializations.

2605.10739 2026-05-12 eess.IV cs.AI cs.CV

Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model

David F. Ramirez, Tim Overman, Kristen Jaskie, Andreas Spanias

AI总结 本文提出了一种基于Sentinel-2卫星影像的多模态视觉问答数据集SMART-HC-VQA,用于分析人类活动的时空演变。该数据集通过将施工标注、类型标签、时间阶段标签等信息转化为自然语言问答对,构建了一个时序扩展的自动目标识别与视觉问答挑战任务。研究还引入了一种多图像大语言模型训练框架,能够处理多时相遥感影像并进行语义推理,为理解语言引导下的遥感活动提供了可复现的基础。

Comments Accepted to 2026 SPIE Defense + Security, Automatic Target Recognition XXXVI

详情
英文摘要

We introduce SMART-HC-VQA, a Sentinel-2-based visual question answering dataset derived from the IARPA SMART Heavy Construction dataset, designed for spatiotemporal analysis of human activity. The dataset transforms construction-site annotations, construction-type labels, temporal-phase labels, geographic metadata, and observation relationships into natural language question-answer triplets. This approach redefines the existing dataset as a temporally extended automatic target recognition and visual question answering (VQA) challenge, considering a fixed geospatial site as a target whose attributes and activity states evolve across sparse satellite observations. Currently, SMART-HC-VQA comprises 21,837 accessible Sentinel-2 image chips, 65,511 single-image VQA examples, and approximately 2.3 million two-image temporal comparison examples generated via our novel Image-Pairwise Combinatorial Augmentation. We detail the workflow for retrieving and processing Sentinel-2 imagery, segmenting large satellite tiles into site-centered images, maintaining traceability to SMART-HC annotations, and analyzing the distributions of site size, observation count, temporal coverage, construction type, and phase labels. Additionally, we describe an implemented multi-image MLLM training framework based on LLaVA-NeXT Mistral-7B, adapted to accept multiple dated image inputs and train on metadata-derived VQA examples. This work offers a reproducible foundation for understanding language-guided remote sensing activities, aiming not only to detect change but also to reason about the ongoing processes, their progression, and potential future developments.

2605.10738 2026-05-12 math.OC cs.MA cs.RO cs.SY eess.SY

Decentralized Contingency MPC based on Safe Sets for Nonlinear Multi-agent Collision Avoidance

Max Studt, Georg Schildbach

AI总结 本文研究了在非线性多智能体系统中,如何在不共享轨迹信息的情况下实现去中心化的避障控制。提出了一种基于安全集的应急模型预测控制(MPC)框架,每个智能体仅依赖于自身状态进行局部优化,通过耦合主轨迹与应急保证机制,确保在滚动时域操作中具有可行的避障动作。该方法引入了一种新颖的几何安全集更新机制,保证了递归可行性与收敛性,并在多种密集和稀疏场景中验证了其有效性。

详情
英文摘要

Decentralized collision avoidance remains challenging, particularly when agents do not communicate any information related to planned trajectories. Most existing approaches either rely on conservative coordination mechanisms or provide limited guarantees on recursive feasibility and convergence. This paper develops a decentralized contingency MPC framework for multi-agent systems with nonlinear dynamics that achieves collision-free motion under a state-only information pattern. Each agent follows the same consensual rule set, enabling safe decentralized planning without communication. Each agent solves a local optimization problem that couples a nominal trajectory with a contingency certificate ensuring a feasible backup maneuver under receding-horizon operation. A novel geometric and decentralized safe-set update mechanism prevents feasibility loss between consecutive time steps. The resulting scheme guarantees recursive feasibility, including collision avoidance, and establishes a Lyapunov-type convergence result to an admissible safe equilibrium. Simulation results demonstrate performance in both sparse and dense multi-agent environments, including cluttered bottleneck scenarios and under plug-and-play operation.

2605.10721 2026-05-12 physics.soc-ph cs.CL cs.MA

Conformity Generates Collective Misalignment in AI Agents Societies

Giordano De Marzo, Alessandro Bellina, Claudio Castellano, Viola Priesemann, David Garcia

AI总结 本研究探讨了人工智能代理群体中因从众行为导致的集体对齐偏差问题。通过模拟多个大型语言模型之间的意见动态,发现个体对齐的AI代理在群体交互中可能因从众效应陷入稳定的非对齐状态。研究利用统计物理工具建立了定量理论,揭示了群体陷入长期非对齐状态的条件及可预测的临界点,表明个体层面的对齐并不能保证群体层面的安全性,突显了评估AI群体行为的重要性。

详情
英文摘要

Artificial intelligence safety research focuses on aligning individual language models with human values, yet deployed AI systems increasingly operate as interacting populations where social influence may override individual alignment. Here we show that populations of individually aligned AI agents can be driven into stable misaligned states through conformity dynamics. Simulating opinion dynamics across nine large language models and one hundred opinion pairs, we find that each agent's behavior is governed by two competing forces: a tendency to follow the majority and an intrinsic bias toward specific positions. Using tools from statistical physics, we derive a quantitative theory that predicts when populations become trapped in long-lived misaligned configurations, and identifies predictable tipping points where small numbers of adversarial agents can irreversibly shift population-level alignment even after manipulation ceases. These results demonstrate that individual-level alignment provides no guarantee of collective safety, calling for evaluation frameworks that account for emergent behavior in AI populations.

2605.10713 2026-05-12 stat.ML cs.IT cs.LG math.IT math.ST stat.TH

Price of Quality: Sufficient Conditions for Sparse Recovery using Mixed-Quality Data

Youssef Chaabouni, David Gamarnik

AI总结 本文研究了在混合质量数据源下的稀疏恢复问题,即少量高质量低噪声测量与大量低质量高噪声测量共同存在的情况。作者提出了“质量代价”这一概念,给出了信息论和算法层面的样本数量条件,揭示了高质量样本与低质量样本之间的替代关系。研究发现,在无先验信息的设定下,高质量样本的价值有限,而在有先验信息的设定下,其价值可能无限放大;同时,LASSO算法在混合噪声下的恢复阈值与均匀噪声情况一致,表现出对数据异质性的强鲁棒性。该工作首次为混合质量数据下的稀疏恢复提供了理论条件,并揭示了信息论与算法恢复阈值对数据质量变化的不同适应方式。

Comments Published as a conference paper at ICLR 2026

详情
英文摘要

We study sparse recovery when observations come from mixed-quality sources: a small collection of high-quality measurements with small noise variance and a larger collection of lower-quality measurements with higher variance. For this heterogeneous-noise setting, we establish sample-size conditions for information-theoretic and algorithmic recovery. On the information-theoretic side, we show that it is sufficient for $(n_1, n_2)$ to satisfy a linear trade-off defining the Price of Quality: the number of low-quality samples needed to replace one high-quality sample. In the agnostic setting, where the decoder is completely agnostic to the quality of the data, it is uniformly bounded, and in particular one high-quality sample is never worth more than two low-quality samples for this sufficient condition to hold. In the informed setting, where the decoder is informed of per-sample variances, the price of quality can grow arbitrarily large. On the algorithmic side, we analyze the LASSO in the agnostic setting and show that the recovery threshold matches the homogeneous-noise case and only depends on the average noise level, revealing a striking robustness of computational recovery to data heterogeneity. Together, these results give the first conditions for sparse recovery with mixed-quality data and expose a fundamental difference between how the information-theoretic and algorithmic thresholds adapt to changes in data quality.

2605.10704 2026-05-12 eess.SP cs.RO

xApp Empowered Resource Management for Non-Terrestrial Users in 5G O-RAN Networks

Mohammed M. H. Qazzaz, Syed Ali Zaidi, Aubida A. Al-Hameed, Abdelaziz Salama, Des Mclernon

AI总结 本文提出了一种基于深度强化学习的xApp,用于5G开放无线接入网(O-RAN)中非地面用户设备的资源管理,旨在优化无人机沿预设航线飞行时的切换决策。该方法采用结合迁移学习的双重深度Q网络(DDQN)进行预测性优化,提前预判网络状态,从而降低切换频率和断连概率。实验表明,该框架在保证连接可靠性的同时,显著减少了切换次数,验证了智能学习方法在下一代O-RAN架构中管理无人机移动性的有效性。

详情
英文摘要

This paper introduces a proactive Unmanned Aerial Vehicle (UAV) mobility management xApp for Open Radio Access Network (O-RAN) Near Real-Time Radio Intelligent Controller (Near-RT RIC) environments, employing Double Deep Q-Network (DDQN) reinforcement learning (RL) enhanced with transfer learning to optimise handover decisions for UAVs operating along predetermined flight trajectories. Unlike reactive approaches that respond to signal degradation, the proposed framework anticipates network conditions and minimises both outage probability and handover frequency through predictive optimisation. The system leverages centralised weight averaging to consolidate knowledge from multiple flight scenarios into a global model capable of generalising to previously unseen operational environments without extensive retraining. A comprehensive evaluation demonstrates that the proposed framework achieves a favourable trade-off between handover frequency and connectivity reliability, reducing handover events by up to 54.6% compared to greedy approaches while maintaining outage probability at practically negligible levels. The results validate the effectiveness of intelligent learning-based approaches for UAV mobility management in next-generation O-RAN architectures, thereby contributing to seamless integration of aerial user equipment into cellular networks.

2605.10698 2026-05-12 cs.MA cs.AI

The Bystander Effect in Multi-Agent Reasoning: Quantifying Cognitive Loafing in Collaborative Interactions

Dahlia Shehata, Ming Li

AI总结 本研究探讨了多智能体系统中协作推理时可能出现的“旁观者效应”,即智能体在社交压力下出现认知懈怠现象。通过在三个数据集上对22,500条确定性轨迹进行评估,研究提出了“交互深度限制”概念,并揭示了模型在内部推理正确但外部输出出现对齐幻觉的问题,表明多智能体协作可能削弱个体推理能力,暴露出系统架构中的潜在缺陷。

详情
英文摘要

Multi-agent systems (MAS) assume that collaborating inherently improves Large Language Model (LLM) reasoning. We challenge this by demonstrating that simulated social pressure triggers an algorithmic ``Bystander Effect,'' inducing severe cognitive loafing. By evaluating 22,500 deterministic trajectories across 3 dataset contexts (GAIA, SWE-bench, Multi-Challenge) with 3 state-of-the-art (SOTA) models, we semantically audit internal reasoning traces against external outputs. We formalize the \textit{Interaction Depth Limit} ($D_L$), the exact plurality threshold where an agent's logical sovereignty collapses into social compliance. Crucially, we uncover the \textit{Sovereignty Gap}: models frequently compute the correct derivation internally but suffer ``Alignment Hallucinations'' -- actively subjugating empirical evidence to sycophantically appease a simulated swarm. We prove that multi-agent social load is strictly non-commutative; the "brand" identity of the ``Lead Anchor'' auditor disproportionately dictates the swarm's integrity. These findings expose architectural vulnerabilities, proving that unstructured multi-agent topologies can degrade independent reasoning.

2605.10681 2026-05-12 cs.IT cs.LG math.IT

Scalable Mamba-Based Message-Passing Neural Decoder for Error-Correcting Codes

Rostislav Gusev, Nikita Aleksandrov, Artem Solomkin, Dmitry Artemasov

AI总结 本文提出了一种基于Mamba的可扩展消息传递神经解码器(MMPD),用于二元线性纠错码的解码。该解码器采用基于 Tanner 图的局部消息传递机制,并结合双向 Mamba 状态空间块以实现高效的长距离信息传播,避免了传统注意力机制的二次复杂度问题。实验表明,MMPD 在 (1056, 880) LDPC 码上相比现有最优解码器 CrossMPT 在相同误码率下实现了 0.45 dB 的性能提升,同时将内存消耗降低了 1.5 倍,且在更长码长上表现更为优越。

Comments This work has been submitted to the IEEE for possible publication

详情
英文摘要

Forward error correction is essential for reliable communication over noisy channels. Attention-based model-free neural decoders have shown strong performance for short codes, but their scalability to longer codes is limited by the quadratic memory and computational cost of attention. In this paper, we introduce the Mamba message-passing decoder (MMPD), an attention-free syndrome-based neural decoder for binary linear codes. MMPD retains the Tanner-graph structure of a message-passing decoder by performing local pairwise aggregation along variable-check edges. To enable efficient long-range information propagation, these local updates are combined with bidirectional Mamba state-space blocks. By avoiding dense attention matrices, MMPD scales more favorably for long codes in both memory and computation. Experiments on the (1056, 880) LDPC code show that MMPD achieves a 0.45 dB gain over the state-of-the-art CrossMPT decoder at a specified target bit error rate, while reducing memory consumption by a factor of 1.5. This reduction factor increases substantially for longer codes, demonstrating the applicability of MMPD to scalable neural decoding of practical long codes.

2605.10622 2026-05-12 cs.MM cs.CV

Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

Yangneng Chen, Junlin Li, Weijun Yao, Xilai Ma, Guodong Du, Wenya Wang, Jing Li

AI总结 大型视觉-语言模型(LVLMs)在多模态任务中表现出色,但其可靠性常因幻觉问题而受到挑战,即生成与视觉输入矛盾的文本。本文提出“词汇劫持”现象,发现某些视觉标记(称为惰性标记)会异常地吸引注意力,并在词汇空间中固定解码为无关词语(劫持锚点),导致语义崩溃。基于此,研究提出了一种无需训练的干预方法HAVAE,通过增强关键注意力头对视觉内容的关注,有效缓解了幻觉问题,同时保持模型整体性能。

Comments Accepted by ACL 2026 Main

详情
英文摘要

Large Vision-Language Models (LVLMs) have achieved remarkable progress in multimodal tasks, yet their reliability is persistently undermined by hallucinations-generating text that contradicts visual input. Recent studies often attribute these errors to inadequate visual attention. In this work, we analyze the attention mechanisms via the logit lens, uncovering a distinct anomaly we term Vocabulary Hijacking. We discover that specific visual tokens, defined as Inert Tokens, disproportionately attract attention. Crucially, when their intermediate hidden states are projected into the vocabulary space, they consistently decode to a fixed set of unrelated words (termed Hijacking Anchors) across layers, revealing a rigid semantic collapse. Leveraging this semantic rigidity, we propose Hijacking Anchor-Based Identification (HABI), a robust strategy to accurately localize these Inert Tokens. To quantify the impact of this phenomenon, we introduce the Non-Hijacked Visual Attention Ratio (NHAR), a novel metric designed to identify attention heads that remain resilient to hijacking and are critical for factual accuracy. Building on these insights, we propose Hijacking-Aware Visual Attention Enhancement (HAVAE), a training-free intervention that selectively strengthens the focus of these identified heads on salient visual content. Extensive experiments across multiple benchmarks demonstrate that HAVAE significantly mitigates hallucinations with no additional computational overhead, while preserving the model's general capabilities. Our code is publicly available at https://github.com/lab-klc/HAVAE.

2605.10613 2026-05-12 cond-mat.dis-nn cs.LG

Exact Fixed-Point Constraints in Neural-ODEs with Provable Universality

Feliciano Giuseppe Pacifico, Duccio Fanelli, Lorenzo Buffoni, Lorenzo Chicchi, Diego Febbe, Raffaele Marino

AI总结 本文提出了一种方法,使神经常微分方程(Neural-ODE)能够在预先设定的固定点处精确地将速度场置零,从而逼近任意速度场。该方法在不改变Neural-ODE表达能力的前提下,严格约束梯度训练过程,并证明了在速度场局部约束下Neural-ODE的通用性。研究还通过两个典型的物理模型验证了该方法的有效性。

Comments 15 pages, 3 figures

详情
英文摘要

We introduce a technique that enables Neural-ODEs to approximate arbitrary velocity fields with a priori planted fixed-points. Specifically, a recipe is given to explicitly accommodate for a finite collection of points in the reference multi-dimensional space of the Neural-ODE where the velocity field is exactly equal to zero. In this way, the gradient-based training is rigorously constrained inside the prescribed hypothesis class while leaving the expressive power of the Neural-ODE unaltered. We rigorously prove the universality of the Neural-ODE under any local constraints in the velocity field and give a computationally convenient way of imposing the fixed points. Our method is then tested on two paradigmatic physical models.

2605.10612 2026-05-12 cs.AR cs.LG

Reconfigurable Computing Challenge: Real-Time Graph Neural Networks for Online Event Selection in Big Science

Marc Neu, Frank Baptist, Thomas Lobmaier, Fabio Papagno, Torben Ferber, Jürgen Becker

AI总结 该研究针对大型科学实验中实时图神经网络在硬件触发系统中的部署挑战,提出了一种基于FPGA和AI引擎的端到端解决方案,用于Belle II电磁量能器的在线事件选择。通过开发半自动化的设计流程,实现了图神经网络的高效映射与优化,显著提升了处理吞吐量并降低了资源占用。实验结果表明,该方案在保持低延迟的同时,相比纯FPGA方案,吞吐量提升了53%,资源利用率也大幅下降。

Comments Accepted to FCCM Reconfigurable Computing Challenge 2026

详情
英文摘要

Graph neural networks are increasingly adopted in trigger systems for collider experiments, where strict latency and throughput constraints render deployment on embedded platforms challenging. As detectors move towards higher granularity, the number of inputs per inference increase and FPGA-only solutions face resource bottlenecks. This work presents an end-to-end demonstrator for the real-time deployment of a dynamic Graph Neural Network for the Belle II electromagnetic calorimeter hardware trigger on the AMD Versal VCK190, leveraging both FPGA fabric and AI Engine tiles. We develop a Python-based semi-automated design flow covering operator fusion, partitioning, mapping, spatial parallelization, and kernel-level optimization. Our design achieves a throughput of 2.94 million events per second at an end-to-end latency of 7.15 microseconds. Compared to the FPGA-only baseline, this represents a 53% throughput improvement while reducing DSP utilization from 99% to 19% at 29% AI Engine tile utilization. To validate the deployment, an interactive visualization pipeline enables real-time monitoring of inference results on the physical demonstrator.

2605.10611 2026-05-12 cs.CR cs.AI

Re-Triggering Safeguards within LLMs for Jailbreak Detection

Zheng Lin, Zhenxing Niu, Haoxuan Ji, Yuzhe Huang, Haichang Gao

AI总结 本文提出了一种用于检测大型语言模型(LLMs)越狱攻击的提示方法,旨在增强模型的防御能力。研究指出,尽管现有LLMs内置了安全机制,但仍有精心设计的越狱提示能够绕过这些机制,为此,作者引入了一种嵌入扰动方法,通过重新激活模型内部的安全防护机制来实现检测。实验表明,该方法在白盒和黑盒环境下均能有效防御先进的越狱攻击,并对适应性攻击也表现出良好的鲁棒性。

详情
英文摘要

This paper proposes a jailbreaking prompt detection method for large language models (LLMs) to defend against jailbreak attacks. Although recent LLMs are equipped with built-in safeguards, it remains possible to craft jailbreaking prompts that bypass them. We argue that such jailbreaking prompts are inherently fragile, and thus introduce an embedding disruption method to re-activate the safeguards within LLMs. Unlike previous defense methods that aim to serve as standalone solutions, our approach instead cooperates with the LLM's internal defense mechanisms by re-triggering them. Moreover, through extensive analysis, we gain a comprehensive understanding of the disruption effects and develop an efficient search algorithm to identify appropriate disruptions for effective jailbreak detection. Extensive experiments demonstrate that our approach effectively defends against state-of-the-art jailbreak attacks in white-box and black-box settings, and remains robust even against adaptive attacks.

2605.10597 2026-05-12 cs.SE cs.AI

CrackMeBench: Binary Reverse Engineering for Agents

Isaac David, Arthur Gervais

AI总结 CrackMeBench 是一个用于评估语言模型代理在二进制逆向工程任务中表现的基准测试平台,专注于从可执行文件中恢复验证逻辑并生成被程序接受的输入或密钥生成器。该基准采用教育类 CrackMe 风格任务,结合公开和生成的二进制程序,在无网络的 Docker 环境中进行测试,并记录模型的提交结果、耗时、工具使用情况等信息,为衡量从源代码推理到自主二进制分析的能力提供可复现的实验环境。

详情
英文摘要

Benchmarks for coding agents increasingly measure source-level software repair, and cybersecurity benchmarks increasingly measure broad capture-the-flag performance. Classical binary reverse engineering remains less precisely specified: given only an executable, can an agent recover validation logic and produce an input, serial, artifact, or key generator accepted by the program? We introduce CrackMeBench, a benchmark for evaluating language-model agents on educational CrackMe-style reverse-engineering tasks. CrackMeBench focuses on deterministic binary validation problems with executable oracles, symbol-poor binaries, explicit local tool access, and externally scored submissions rather than free-form explanations. The v0 benchmark combines eight public calibration CrackMes with twelve generated main-score tasks built from seeded C, Rust, and Go templates, and agents run through an equal shell interface in a no-network Linux Docker sandbox with standard reverse-engineering tools. In a three-model evaluation with a five-minute budget and three scored submissions per task, pass@3 on the generated split is 11/12 tasks (92%) for GPT-5.5, 7/12 (58%) for Claude Opus 4.7, and 5/12 (42%) for Kimi K2. The harder generated half separates the models more sharply, with pass@3 of 5/6, 2/6, and 1/6, respectively; on the eight-task public calibration split, pass@3 is 3/8, 2/8, and 1/8. CrackMeBench records pass@1 and pass@3, scored submissions, wall-clock time, command traces, tool categories, provider-reported token usage, estimated cost, and qualitative failure labels, providing a reproducible testbed for measuring progress from source-code reasoning toward autonomous binary analysis while restricting scope to educational, purpose-built programs.

2605.10590 2026-05-12 stat.ML cs.LG

Amortizing Causal Sensitivity Analysis via Prior Data-Fitted Networks

Emil Javurek, Dennis Frauen, Marie Brockschmidt, Jonas Schweisthal, Stefan Feuerriegel

AI总结 该论文提出了一种用于因果敏感性分析的 amortized 方法,旨在在存在未观测混杂因素的情况下,高效估计因果效应的置信区间。研究通过引入基于先验数据拟合的神经网络,将传统的逐实例计算方式转化为上下文学习框架,大幅提升了计算效率。该方法通过构建通用的先验数据集,并利用拉格朗日标量化的优化目标生成训练标签,避免了模型特定的分析推导,同时在标准凸性和线性条件下能够恢复完整的帕累托前沿解。实验表明,该方法在多种数据集和敏感度设置下均表现出显著的加速效果。

详情
英文摘要

Causal sensitivity analysis aims to provide bounds for causal effect estimates in the presence of unobserved confounding. However, existing methods for causal sensitivity analysis are per-instance procedures, meaning that changes to the dataset, causal query, sensitivity level, or treatment require new computation. Here, we instead present an in-context learning approach. Specifically, we propose an amortized approach to causal sensitivity analysis based on prior-data fitted networks. A key challenge is that the sensitivity bounds are not directly available when sampling training data. To address this, we develop a general prior-data construction that is applicable across the class of generalized treatment sensitivity models. Our construction involves a Lagrangian scalarization of the objective to generate training labels for the bounds through a tradeoff between causal effect min/max-imization and sensitivity model violation, which avoids model-specific analytical derivations. We further show that, under standard convexity and linearity conditions, our objective recovers the full Pareto frontier of solutions. Empirically, we demonstrate our amortized approach across various datasets, causal queries, and sensitivity levels, where our approach achieves a test-time computation that is orders of magnitude faster than per-instance methods. To the best of our knowledge, ours is the first foundation model for in-context learning for causal sensitivity analysis.

2605.10582 2026-05-12 cs.CR cs.AI

Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing

Zheng Lin, Zhenxing Niu, Haoxuan Ji, Haichang Gao

AI总结 本文提出了一种针对大语言模型的保证性防越狱攻击方法,名为“扰乱-修复平滑”(DR-Smoothing)。该方法借鉴对抗防御中的去噪平滑思想,通过两阶段的提示处理流程——先扰乱输入提示,再将其修复为符合分布的形式——在传统平滑防御框架中实现更有效的防御。该方法不仅提升了对越狱攻击的防御成功率,还在无害性与有用性之间取得了更好的平衡,并提供了对通用平滑框架的理论分析与防御成功率的严格界。实验表明,该方法在多种攻击场景下均优于当前最先进的防御技术。

详情
英文摘要

This paper proposes a guaranteed defense method for large language models (LLMs) to safeguard against jailbreaking attacks. Drawing inspiration from the denoised-smoothing approach in the adversarial defense domain, we propose a novel smoothing-based defense method, termed Disrupt-and-Rectify Smoothing (DR-Smoothing). Specifically, we integrate a two-stage prompt processing scheme-first disrupting the input prompt, then rectifying it-into the conventional smoothing defense framework. This disrupt-and-rectify approach improves upon previous disrupt-only approaches by restoring out-of-distribution disrupted prompts to an in-distribution form, thereby reducing the risk of unpredictable LLM behavior. In addition, this two-stage scheme offers a distinct advantage in striking a balance between harmlessness and helpfulness in jailbreaking defense. Notably, we present a theoretical analysis for generic smoothing framework, offering a tight bound for the defense success probability and the requirements on the disruption strength. Our approach can defend against both token-level and prompt-level jailbreaking attacks, under both established and adaptive attacking scenarios. Extensive experiments demonstrate that our approach surpasses current state-of-the-art defense methods in terms of both harmlessness and helpfulness.