arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1968
2605.14584 2026-05-15 physics.chem-ph cs.LG

All-atomistic Transferable Neural Potentials for Protein Solvation

Rishabh Dey, Salvina Sharipova, Konstantin Popov

AI总结 该研究提出了一种名为PHNN的全原子可迁移神经势能模型,用于蛋白质溶剂化计算。该模型通过学习可迁移的参数修正来改进隐式溶剂模型的准确性,而非对最终能量进行事后调整。PHNN结合物理先验知识以提高数据效率,在传统分析方法基础上显著提升了预测精度,并在超出训练域的蛋白质系统中保持良好的泛化能力。

详情
英文摘要

Implicit solvent models are widely used to decrease the number of solvent degrees of freedom and enable the calculation of solvation energetics without water molecules. However, its accuracy often falls short compared to explicit models. Recent advancements in neural potentials have shown promise in drug discovery, but transferability remains a persistent challenge. Here, we introduce the Protein Hydration Neural Network (PHNN), an implicit solvent model that extends analytical continuum solvation by learning transferable corrections to model parameters instead of applying post hoc adjustments to final energies. The model is explicitly designed to maximize data efficiency by leveraging physical priors embedded in the data. We demonstrate that PHNN improves accuracy relative to traditional analytical methods and maintains predictive accuracy on out-of-domain protein systems.

2605.14567 2026-05-15 stat.ML cs.LG math.PR math.ST stat.TH

Scaling Laws from Sequential Feature Recovery: A Solvable Hierarchical Model

Arie Wortsman-Zurich, Hugo Tabanelli, Yatin Dandi, Florent Krzakala, Bruno Loureiro

AI总结 本文提出了一种简单的机制,解释了多层网络中特征学习如何产生缩放定律。研究对象是一个高维的分层目标函数,该函数虽然整体复杂度很高,但可以通过一组权重呈幂律衰减的潜在组合特征来表示。通过设计一种逐层谱算法,能够逐步恢复这些潜在特征,且在样本量较小时就能检测到强特征,而弱特征则需要更多数据。理论分析表明,该方法在预测误差上实现了明确的幂律衰减,并通过数值实验验证了特征逐步恢复的现象和与非分层方法的性能差异。

详情
英文摘要

We propose a simple mechanism by which scaling laws emerge from feature learning in multi-layer networks. We study a high-dimensional hierarchical target that is a globally high-degree function, but that can be represented by a combination of latent compositional features whose weights decrease as a power law. We show that a layer-wise spectral algorithm adapted to this compositional structure achieves improved scaling relative to shallow, non-adaptive methods, and recovers the latent directions sequentially: strong features become detectable at small sample sizes, while weaker features require more data. We prove sharp feature-wise recovery thresholds and show that aggregating these transitions yields an explicit power-law decay of the prediction error. Technically, the analysis relies on random matrix methods and a resolvent-based perturbation argument, which gives matching upper and lower bounds for individual eigenvector recovery beyond what standard gap-based perturbation bounds provide. Numerical experiments confirm the predicted sequential recovery, finite-size smoothing of the thresholds, and separation from non-hierarchical kernel baselines. Together, these results show how smooth scaling laws can emerge from a cascade of sharp feature-learning transitions.

2605.14563 2026-05-15 cs.SE cs.CL

Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation

Suyoung Bae, Jaehoon Lee, Changkyu Choi, YunSeok Choi, Jee-Hyong Lee

AI总结 本文提出了一种名为MemDocAgent的长视野智能代理框架,用于生成一致且层次分明的仓库级代码文档。该方法通过依赖感知的遍历引导和基于记忆的代理交互,实现了对整个代码仓库的集成化文档生成,有效解决了现有方法中冗余检索、描述冲突和结构混乱的问题。实验表明,MemDocAgent在多个评估指标上优于开源和闭源基线方法,具有实际的软件开发应用价值。

详情
英文摘要

Automated code documentation is essential for modern software development, providing the contextual grounding that both human developers and coding agents rely on to navigate large codebases. Existing repository-level approaches process components independently, causing redundant retrieval and conflicting descriptions across documents while producing outputs that lack hierarchical structure. Therefore, we propose MemDocAgent, a long-horizon agentic framework that generates documentation within a single, integrated context spanning the entire repository. It combines two components: (i) Dependency-Aware Traversal Guiding that predetermines a traversal order respecting dependency and granularity hierarchies; (ii) Memory-Guided Agentic Interaction, in which the agent interacts with RepoMemory, a shared memory accumulating prior work traces through read, write, and verify operations. Through an in-depth multi-criteria evaluation, MemDocAgent achieves the best performance over both open and closed-source baselines and demonstrates practical applicability in real software development workflows.

2605.14526 2026-05-15 cs.GR cs.DC cs.NA cs.RO math.NA

DiffPhD: A Unified Differentiable Solver for Projective Heterogeneous Materials in Elastodynamics with Contact-Rich GPU-Acceleration

Shih-Yu Lai, Sung-Han Tien, Jui-I Huang, Yen-Chen Tseng, Yi-Ting Chiu, Siyuan Luo, Ziqiu Zeng, Fan Shi, Peter Yichen Chen, Tiantian Liu, Yu-Lun Liu, Bing-Yu Chen

AI总结 DiffPhD 是一种统一的、基于 GPU 加速的可微分投影动力学框架,旨在解决含异质材料、大变形超弹性以及复杂接触交互的弹性动力学问题。该方法通过引入刚度感知的投影权重、信任域特征值过滤与改进的 Anderson 加速策略,并整合到统一的 GPU 计算流程中,实现了对异质材料的高效且稳定的模拟。DiffPhD 在保持梯度精度的同时显著提升了计算效率,并在大刚度对比场景下仍保持收敛性,为复杂物理系统的端到端优化提供了有力支持。

详情
英文摘要

Differentiable simulation of soft bodies is a foundation for system identification, trajectory optimization, and Real2Sim transfer. Yet, existing methods such as the differentiable Projective Dynamics (DiffPD) struggle when faced with heterogeneous materials with extreme stiffness contrasts, hyperelasticity under large deformations, and contact-rich interactions, which are common scenarios in the real world. We present DiffPhD, a unified GPU-accelerated differentiable Projective Dynamics framework for heterogeneous materials that tackles these intertwined challenges simultaneously. Our key insight is a careful integration of: (i) stiffness-aware projective weights to embed heterogeneity into the global system; (ii) trust-region eigenvalue filtering lifted to the backward pass for stable hyperelastic gradients and a type-II Anderson Acceleration scheme with dual-gate convergence to stabilize forward iteration under large stiffness contrasts; and (iii) a unified GPU pipeline that reuses a single sparse factor across forward, backward, and contact computations, with stiffness-amplified Rayleigh damping folded into the same factor for heterogeneity-aware dissipation at zero recurring cost. DiffPhD achieves strict gradient accuracy while delivering up to an order-of-magnitude speedup over prior differentiable solvers on heterogeneous, hyperelastic, contact-rich benchmarks. Crucially, this speedup does not come at the cost of stability: DiffPhD remains convergent on stiffness contrasts up to 100x where prior PD solvers degrade. This unlocks end-to-end gradient-based optimization on regimes previously bottlenecked by either solver fragility or per-iteration cost -- shell--joint composite creatures, soft characters wielding stiff weapons, and soft-gripper robotic manipulation -- all handled within a single forward--backward pass.

2605.14524 2026-05-15 stat.ML cs.LG

Large Dimensional Kernel Ridge Regression: Extending to Product Kernels

Yang Zhou, Yicheng Li, Yuqian Cheng, Qian Lin

AI总结 本文研究了高维核岭回归(KRR)中在更广泛核函数下的泛化误差行为,扩展了之前仅针对球面内积核的结果。作者提出了一类新的高维核函数,并推导了其对应的泛化误差收敛速率。研究发现,即使在更一般的核设置下,仍存在最小最大最优性、饱和效应以及收敛速率的周期性平台和样本量相关的多重下降现象,从而拓展了对高维KRR行为的理解。

详情
英文摘要

Recent studies have reported $\textit{saturation effects}$ and $\textit{multiple descent behavior}$ in large dimensional kernel ridge regression (KRR). However, these findings are predominantly derived under restrictive settings, such as inner product kernels on sphere or strong eigenfunction assumptions like hypercontractivity. Whether such behaviors hold for other kernels remains an open question. In this paper, we establish a broad, new family of large dimensional kernels and derive the corresponding convergence rates of the generalization error. As a result, we recover key phenomena previously associated with inner product kernels on sphere, including: $i)$ the $\textit{minimax optimality}$ when the source condition $s\le 1$; $ii)$ the $\textit{saturation effect}$ when $s>1$; $iii)$ a $\textit{periodic plateau phenomenon}$ in the convergence rate and a $\textit {multiple-descent behavior}$ with respect to the sample size $n$.

2605.14512 2026-05-15 cs.IR cs.AI

Asymmetric Generative Recommendation via Multi-Expert Projection and Multi-Faceted Hierarchical Quantization

Bin Huang, Xin Wang, Junwei Pan, Yongqi Zhou, Yifeng Zhou, Zhixiang Feng, Shudong Huang, Haijie Gu, Wenwu Zhu

AI总结 该论文针对生成式推荐(GenRec)模型中存在的输入和输出瓶颈问题,提出了一种不对称的连续-离散框架AsymRec。通过多专家语义投影(MSP)和多视角分层量化(MHQ)方法,分别提升了输入表示的语义丰富性和输出目标的结构化精度,有效缓解了流行度偏差和细粒度语义丢失的问题。实验表明,AsymRec在多个数据集上显著优于现有生成式推荐方法,平均性能提升达15.8%。

详情
英文摘要

Generative Recommendation (GenRec) models reformulate recommendation as a sequence generation task, representing items as discrete Semantic IDs used symmetrically as both inputs and prediction targets. We identify a critical dual-stage information bottleneck in this design: (1) the Input Bottleneck, where lossy quantization degrades fine-grained semantics, while popularity bias skews the learned representations toward frequent items, and (2) the Output Bottleneck, where imprecise discrete targets limit supervision quality. To address these issues, we propose AsymRec, an asymmetric continuous-discrete framework that decouples input and output representations. Specifically, Multi-expert Semantic Projection (MSP) maps continuous embeddings into the Transformer's hidden space via expert-specialized projections, preserving semantic richness and improving generalization to infrequent items. Multi-faceted Hierarchical Quantization (MHQ) constructs high-capacity, structured discrete targets through multi-view and multi-level quantization with semantic regularization, preventing dimensional collapse while retaining fine-grained distinctions. Extensive experiments demonstrate that AsymRec consistently outperforms state-of-the-art generative recommenders by an average of 15.8 %. The code will be released.

2605.14502 2026-05-15 eess.SY cs.AI cs.SY

Quantifying Cyber-Vulnerability in Power Electronics Systems via an Impedance-Based Attack Reachable Domain

Hongwei Zhen, Ze Yu, Xin Xiang, Wuhua Li, Mingyang Sun

AI总结 本文研究了电力电子系统在受到网络攻击时的脆弱性量化问题,提出了一种基于阻抗的攻击可达域(ARD)框架,用于评估在权限受限条件下节点可能被推近不稳定的程度。该方法通过阻抗重塑映射可行的攻击动作到关键特征值迁移,并定义了攻击穿透指数以综合表征系统稳定性裕度的渗透程度和成功攻击的可达性。为应对逆变器模型缺失的情况,还构建了一个实用的灰盒评估流程,结合现有阻抗识别与可微代理工具,实验表明该方法能有效揭示传统电网强度指标无法反映的脆弱性模式。

详情
英文摘要

Power electronics systems are increasingly exposed to cyber threats due to their integration with digital controllers and communication networks. However, an attacker-oriented metric is still lacking to quantify the extent to which a node can be pushed toward instability within a privilege-constrained action space. This letter proposes an impedance-based Attack Reachable Domain (ARD) framework that maps feasible adversarial actions to critical-eigenvalue migration through impedance reshaping. Based on the ARD, an Attack Penetration Index is defined to quantify node-level cyber-vulnerability by jointly characterizing the penetration of the nominal stability margin and the accessibility of successful destabilizing attacks within a privilege-constrained action space. To make the proposed assessment computable when inverter models are unavailable, a practical gray-box workflow is further established by integrating existing impedance identification and differentiable surrogate tools. Case studies on a 4-bus system and a modified IEEE 39-bus system show that coordinated cross-layer manipulations are markedly more damaging than isolated single-layer attacks, and that the proposed metric reveals vulnerability patterns that cannot be inferred from grid-strength indicators.

2605.14501 2026-05-15 eess.SY cs.AI cs.LG cs.SY

Fully Dynamic Rebalancing in Dockless Bike-Sharing Systems via Deep Reinforcement Learning

Edoardo Scarpel, Alberto Pettena, Matteo Cederle, Federico Chiariotti, Marco Fabris, Gian Antonio Susto

AI总结 本文提出了一种基于深度强化学习的全动态再平衡方法,用于解决无桩共享单车系统中的车辆调度问题。该方法通过图模拟器建模服务系统,并将再平衡问题建模为马尔可夫决策过程,利用深度强化学习代理实时调度单车,根据时空关键性评分执行局部的取车、还车和充电操作。实验结果表明,该方法在真实数据上显著减少了车辆可用性失败,同时减少了空间不平等和出行荒漠现象,展示了基于学习的再平衡方法在提升共享微出行系统效率和可靠性方面的价值。

Comments 6 pages, 5 figures, 1 table, accepted at the 23rd IFAC World Congress, Busan, South Korea, Aug. 23-26, 2026. Open invited track 9-131: "Control and Optimization for Smart Cities"

详情
英文摘要

This paper proposes a fully dynamic Deep Reinforcement Learning (DRL) method for rebalancing dockless bike-sharing systems, overcoming the limitations of periodic, system-wide interventions. We model the service through a graph-based simulator and cast rebalancing as a Markov decision process. A DRL agent routes a single truck in real time, executing localized pick-up, drop-off, and charging actions guided by spatiotemporal criticality scores. Experiments on real-world data show significant reductions in availability failures with a minimal fleet size, while limiting spatial inequality and mobility deserts. Our approach demonstrates the value of learning-based rebalancing for efficient and reliable shared micromobility.

2605.14495 2026-05-15 cs.MM cs.AI

Contestable Multi-Agent Debate with Arena-based Argumentative Computation for Multimedia Verification

Truong Thanh Hung Nguyen, Vo Thanh Khang Nguyen, Hoang-Loc Cao, Phuc Ho, Van Pham, Hung Cao

AI总结 该研究针对多媒体验证任务中准确性和透明性并重的需求,提出了一种可争议的多智能体框架,结合多模态大语言模型、外部验证工具和基于竞技场的双极论证计算方法。该方法将每个案例分解为以主张为中心的模块,检索针对性证据并生成带有来源和强度评分的支持与攻击论点,通过局部论证图进行冲突解决和不确定性处理,最终生成结构清晰、可编辑且具有实际计算可行性的验证报告。

Comments ACM ICMR 2026 Grand Challenge on Multimedia Verification

详情
英文摘要

Multimedia verification requires not only accurate conclusions but also transparent and contestable reasoning. We propose a contestable multi-agent framework that integrates multimodal large language models, external verification tools, and arena-based quantitative bipolar argumentation (A-QBAF) as a submission to the ICMR 2026 Grand Challenge on Multimedia Verification. Our method decomposes each case into claim-centered sections, retrieves targeted evidence, and converts evidence into structured support and attack arguments with provenance and strength scores. These arguments are resolved through small local argument graphs with selective clash resolution and uncertainty-aware escalation. The resulting system generates section-wise verification reports that are transparent, editable, and computationally practical for real-world multimedia verification. Our implementation is public at: https://github.com/Analytics-Everywhere-Lab/MV2026_the_liems.

2605.14478 2026-05-15 cs.SE cs.AI cs.CL

When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context

Haojun Weng, Qianqian Yang, Hao Fu, Haobin Pan, Xinwei Lv

AI总结 该研究探讨了检索增强代码生成中使用过时代码片段可能对代码补全造成的负面影响。通过在五个Python仓库中对17个生产辅助函数签名变化进行受控实验,研究发现仅使用过时代码片段会显著诱导模型生成与当前状态不兼容的代码,而完全不使用检索则导致生成结果无法通过验证。实验还表明,引入当前有效的代码信息可以有效缓解过时信息带来的问题,揭示了检索内容的时间有效性是评估代码检索增强生成鲁棒性的重要因素。

Comments 31 pages, 2 tables. Submitted to Information and Software Technology (Elsevier)

详情
英文摘要

Context: Retrieval-augmented code generation relies on cross-file repository context, but retrieved snippets may come from obsolete project states. Objectives: We study whether temporally stale repository snippets act as harmless noise or actively induce current-state-incompatible code. Methods: We conduct a controlled diagnostic study on a curated 17-sample set of production-helper signature changes from five Python repositories. For each sample, we compare current-only, stale-only, no-retrieval, and mixed current/stale retrieval conditions under prompts that hide commit freshness and expected current signatures. Results: Under neutralized prompts, stale-only retrieval induces stale helper references on 15/17 Qwen2.5-Coder-7B-Instruct samples and 13/17 gpt-4.1-mini samples, corresponding to 88.2 and 76.5 percentage-point increases over current-only retrieval. No retrieval produces zero stale references but only 1/17 passing completions. The two models share 75.0% Jaccard overlap among stale-triggering samples, and mixed conditions show that adding valid current evidence largely rescues stale-only failures. Conclusion: Temporal validity of retrieved repository context is a distinct diagnostic variable for Code RAG robustness: stale context can actively bias models toward obsolete repository state rather than merely removing useful evidence.

2605.14434 2026-05-15 cs.IR cs.AI

Efficient Generative Retrieval for E-commerce Search with Semantic Cluster IDs and Expert-Guided RL

Jianbo Zhu, Xing Fang, Jing Wang, Mingmin Jin, Bokang Wang, Guangxin Song, Zhenyu Xie, Junjie Bai

AI总结 该研究针对电商搜索中生成式召回方法的实用化难题,提出了一种高效的生成式召回框架CQ-SID,通过语义聚类ID和专家引导强化学习方法,有效降低了搜索复杂度并提升了召回效果。CQ-SID结合类别和查询约束的对比学习与残差量化VAE,生成分层语义标识符,显著减少束搜索规模;同时提出的EG-GRPO方法通过引入真实样本,优化生成召回与后续排序的一致性。实验表明,该方法在语义点击率和个性化点击率上分别提升26.76%和11.11%,并在实际系统中取得了显著的GMV和转化率提升。

详情
英文摘要

Generative retrieval offers a promising alternative by unifying the fragmented multi-stage retrieval process into a single end-to-end model. However, its practical adoption in industrial e-commerce search remains challenging, given the massive and dynamic product catalogs, strict latency requirements, and the need to align retrieval with downstream ranking goals. In this work, we propose a retrieval framework tailored for real-world recall scenarios, positioning generative retrieval as a recall-stage supplement rather than an end-to-end replacement. Our method, CQ-SID (Category-and-Query constrained Semantic ID), employs category-aware and query-item contrastive learning along with Residual Quantized VAEs to encode items into hierarchical semantic cluster identifiers, significantly reducing beam search complexity. Additionally, we develop EG-GRPO (Expert-Guided Group Relative Policy Optimization), a reinforcement learning approach that aligns generative recall with downstream ranking under sparse rewards by injecting ground-truth samples to stabilize training. Offline experiments on TmallAPP search logs show that CQ-SID achieves up to 26.76% and 11.11% relative gains in semantic and personalized click hitrate over RQ-VAE baselines, while halving beam search size. EG-GRPO further improves multi-objective performance. Online A/B tests confirm gains in GMV (+1.15%) and UCTCVR (+0.40%). The generative recall channel now contributes substantially in production, accounting for over 50.25% of exposures, 58.96% of clicks, and 72.63% of purchases, demonstrating a viable path for deploying generative retrieval in real-world e-commerce systems.

2605.14426 2026-05-15 physics.ao-ph cs.AI

A plug-and-play generative framework for multi-satellite precipitation estimation

Yunfan Yang, Haofei Sun, Xiuyu Sun, Wei Han, Xiaoze Xu, Xingtao Song, Jun Li, Zhiqiu Gao, Wei Huang, Hao Li

AI总结 该研究提出了一种名为PRISMA的插件式生成框架,用于多卫星降水估计。该方法通过从IMERG最终场中学习无条件降水先验,并结合独立训练的传感器特定条件分支,实现了无需重新训练生成主干即可灵活集成新传感器数据。实验表明,PRISMA在降水估计精度和效率方面均有显著提升,尤其在融合红外与微波观测数据时,显著提高了关键成功指数并降低了均方根误差。

详情
英文摘要

Reliable precipitation monitoring is essential for disaster risk reduction, water resources management, and agricultural decision-making. Multi-source satellite observations, particularly the combination of geostationary infrared and passive microwave measurements, have become a primary means of precipitation detection. Traditional multi-source satellite precipitation estimation methods remain computationally inefficient, and many deep learning methods lack the flexibility to incorporate new sensors without retraining the full model. Here we introduce PRISMA (Precipitation Inference from Satellite Modalities via generAtive modeling), a plug-and-play latent generative framework for multi-sensor precipitation estimation. PRISMA learns an unconditional precipitation prior from IMERG Final fields and constrains it through independently trained, sensor-specific conditional branches, allowing new observation sources to be incorporated without retraining the generative backbone. Applied to FY-4B AGRI infrared and GPM GMI microwave observations, PRISMA improves Critical Success Index by up to 40.3% and reduces root-mean-square error by 22.6% relative to infrared-only estimation within microwave swaths, while also improving probabilistic skill and maintaining an average inference time of about 37 s. Independent rain-gauge validation across China confirms consistent gains, and typhoon case studies show that microwave conditioning restores eyewall and spiral rainband structures, reducing storm-core mean absolute error by up to 42.3%. PRISMA thus provides an extensible and efficient framework for multi-sensor precipitation estimation.

2605.14421 2026-05-15 cs.CR cs.AI

MemLineage: Lineage-Guided Enforcement for LLM Agent Memory

Ciyan Ouyang, Rui Hou

AI总结 MemLineage 是一种针对大型语言模型(LLM)代理记忆的防御机制,通过为每条记忆条目附加密码学来源信息和LLM推导链,确保记忆内容的可信性。该方法将记忆管理视为一种“保管链”问题,利用 Merkle 日志和有向无环图(DAG)记录记忆的生成过程,从而在防止恶意内容被用于敏感操作的同时,保留有用的回忆能力。实验表明,MemLineage 在多个记忆污染场景中表现出色,显著降低了误动作率,且性能开销极低。

Comments 24 pages, 8 figures. Rui Hou is the corresponding author

详情
英文摘要

We introduce MemLineage, a defense for LLM agent memory that attaches both cryptographic provenance and LLM-mediated derivation lineage to every entry. Recent and concurrent work shows that untrusted content can be written into persistent agent state and re-enter later sessions as an instruction; the remaining systems question is how to preserve useful memory recall while preventing such state from justifying sensitive actions. MemLineage treats this as a chain-of-custody problem rather than a filtering problem. It is a six-module design around an RFC-6962 Merkle log over per-principal Ed25519-signed entries: a weighted derivation DAG records which retrieved entries influenced each new memory, and a max-of-strong-edges propagation rule makes Untrusted-Path Persistence hold for any chain whose attribution edges remain above threshold. The sensitive-action gate then refuses dispatches whose active justification descends from an external ancestor, while still allowing benign recall. We evaluate three defense cells against three memory-poisoning workloads on a deterministic mechanism-isolation harness; MemLineage is the only configuration in that harness that drives all three columns to zero ASR, while sub-millisecond per-operation overhead keeps it well below the noise floor of any LLM call. A Codex-backed AgentDojo bridge further separates strong-model behavior from defense-layer behavior: under an intentionally vulnerable tool-output profile, no-defense and signature-only baselines fail on all six banking pairs, while all MemLineage rows reduce strict AgentDojo ASR to zero. The core deterministic artifacts are byte-equal CI-verified; hosted-model AgentDojo and live-model sweeps are recorded as auditable logs rather than byte-pinned artifacts.

2605.14418 2026-05-15 cs.CR cs.AI

The Great Pretender: A Stochasticity Problem in LLM Jailbreak

Jean-Philippe Monteuuis, Cong Chen, Jonathan Petit

AI总结 该论文指出,当前大语言模型(LLM)越狱攻击的评估中存在一个关键问题:攻击成功率(ASR)并不稳定,导致不同研究之间的结果难以比较。研究发现,即使某些攻击在封闭模型上表现出高ASR,但在实际测试中却只能以50%的连续成功率通过开放模型,揭示了越狱攻击生成和评估过程中随机性(stochasticity)的影响。为此,作者提出了一种新的评估框架CAS-eval和生成框架CAS-gen,有效提升了攻击的一致性和成功率,为越狱攻击的标准化评估提供了新方法。

详情
英文摘要

"Oh-Oh, yes, I'm the great pretender. Pretending that I'm doing well. My need is such, I pretend too much..." summarizes the state in the area of jailbreak creation and evaluation. You find this method to generate adversarial attacks proposed by a reputable institution (e.g., BoN from Anthropic or Crescendo from Microsoft Research). However, this method does not deliver on the promise claimed in the paper despite having top ASR scores against industry-grade LLMs. You successfully generate the jailbreak prompts against your target (open) model. However, the generated jailbreak prompt works against the target model with a 50% consecutive success rate (5 out of 10 attempts) despite having an 80% ASR (on paper) on the latest closed-source model (with a guardrail system)! This observation leads us to think. First, Attack Success Rate (ASR), the primary metric for LLM jailbreak benchmarking, is not a stable quantity. Second, published ASR numbers are therefore systematically inflated and incomparable across papers. Therefore, we wonder "Why a successful jailbreak prompt does not perform consistently well against a target model on which the prompts have been optimized?". To answer this question, we study the impact of stochasticity not only during attack evaluation but also during attack generation. Our evaluation includes several jailbreak attacks, models (different sizes and providers), and judges. In addition, we propose a new metric and two new frameworks (CAS-eval and CAS-gen). Our evaluation framework, CAS-eval, shows that an attack can have an ASR drop of up to 30 percentage points when a jailbreak prompt needs to succeed on more than one attempt. Thankfully, our attack generation framework (CAS-gen) improves previous jailbreak methods and helps them recover this loss of 30 percentage points!

2605.14415 2026-05-15 cs.SE cs.AI cs.CL

SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

Man Ho Lam, Chaozheng Wang, Hange Liu, Jingyu Xiao, Haau-sing Li, Jen-tse Huang, Terry Yue Zhuo, Michael R. Lyu

AI总结 SWE-Chain 是一个用于评估代码智能体在连续版本升级场景下表现的基准,聚焦于包级别的连续发布升级任务。该研究设计了一种基于版本说明与代码差异对齐的合成流程,生成真实可行的升级需求,并构建了包含 9 个真实 Python 包、155 个版本转换和 1660 个升级要求的测试集。实验表明,当前主流代码智能体在连续升级任务中仍面临较大挑战,难以在不破坏现有功能的前提下完成准确的升级操作。

详情
英文摘要

Coding agents powered by large language models are increasingly expected to perform realistic software maintenance tasks beyond isolated issue resolution. Existing benchmarks have shifted toward realistic software evolution, but they rarely capture continuous maintenance at the granularity of package releases, where changes are bundled, shipped, and inherited by subsequent versions. We present SWE-Chain, a benchmark for evaluating agents on chained release-level package upgrades, where each transition builds on the agent's prior codebase. To produce upgrade specifications, we design a divide-and-conquer synthesis pipeline that aligns release notes with code diffs for each version transition, ensuring the requirements are grounded in actual code changes, informative to agents, and feasible to implement. SWE-Chain contains 12 upgrade chains across 9 real Python packages, with 155 version transitions and 1,660 grounded upgrade requirements. Across nine frontier agent-model configurations, agents achieve an average of 44.8% resolving, 65.4% precision, and 50.2% F1 under the Build+Fix regime, with Claude-Opus-4.7 (Claude Code) leading at 60.8% resolving, 80.6% precision, and 68.5% F1. These results show that SWE-Chain is both feasible and discriminative, and reveal that current agents still struggle to make correct upgrades across chained package releases without breaking existing functionality.

2605.14386 2026-05-15 cs.NE cs.AI

Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning

Taebong Kim, Youngsik Hong, Minsik Kim, Sunyoung Choi, Jaewon Jang, Junghoon Shin, Minseo Kim

AI总结 本文提出了一种名为 Darwin Family 的框架,通过无训练的进化合并方法提升大语言模型的推理能力。该方法基于梯度-free的权重空间重组,引入了自适应合并基因、MRI-Trust融合机制以及跨架构映射器,实现了对现有模型检查点中潜在能力的重新组织与优化。实验表明,Darwin 模型在多个任务上超越了其原始训练模型,展示了无需额外训练即可提升模型推理性能的有效性。

Comments NeurIPS 2026 submission. 18 pages including appendix

详情
英文摘要

We present Darwin Family, a framework for training-free evolutionary merging of large language models via gradient-free weight-space recombination. We ask whether frontier-level reasoning performance can be improved without additional training, by reorganizing latent capabilities already encoded in existing checkpoints. Darwin introduces three key ideas: (i) a 14-dimensional adaptive merge genome enabling fine-grained component- and block-level recombination; (ii) MRI-Trust Fusion, which adaptively balances diagnostic layer-importance signals with evolutionary search through a learnable trust parameter; and (iii) an Architecture Mapper that enables cross-architecture breeding between heterogeneous model families. Empirically, the flagship Darwin-27B-Opus achieves 86.9% on GPQA Diamond, ranking #6 among 1,252 evaluated models, and outperforming its fully trained foundation model without any gradient-based training. Across scales from 4B to 35B parameters, Darwin models consistently improve over their parents, support recursive multi-generation evolution, and enable a training-free evolutionary merge that combines Transformer- and Mamba-based components. Together, the Darwin Family demonstrates that diagnostic-guided evolutionary merging is a practical and reproducible alternative to costly post-training pipelines for reasoning-centric language models.

2605.14370 2026-05-15 physics.geo-ph cs.AI physics.comp-ph

Deciphering Neural Reparameterized Full-Waveform Inversion with Neural Sensitivity Kernel and Wave Tangent Kernel

Ruihua Chen, Yisi Luo, Bangyu Wu, Xile Zhao, Deyu Meng

AI总结 本文研究了神经重参数化全波形反演(NeurFWI)的收敛机制,提出了神经灵敏度核(NSK)和波切线核(WTK),揭示了神经表示如何通过调节原始灵敏度核和波切线核的特征结构,影响反演过程中的谱滤波效应、梯度波数调制和波频偏差等关键行为。基于这些理论分析,作者提出了改进的NeurFWI方法,提升了反演性能与效率,并在地震勘探和医学成像中验证了其有效性。

详情
英文摘要

Full-waveform inversion (FWI) estimates unknown parameters in the wave equation from limited boundary measurements. Recent advances in neural reparameterized FWI (NeurFWI) demonstrate that representing the parameters using a neural network can reduce the reliance on the high-quality initial model and wavefield data, at the cost of slow high-resolution convergence. However, its underlying theoretical mechanism remains unclear. In this study, we establish the neural sensitivity kernel (NSK) and the wave tangent kernel (WTK) to analyze their convergence behavior from both model and data domains. These theoretical frameworks show that the neural tangent kernel (NTK) induced by neural representation adaptively modulates the original sensitivity and wave tangent kernels. This modulation leads to several key outcomes, i.e., the spectral filtering effect, the gradient wavenumber modulation, and the wave frequency bias, connecting the convergence behavior of NeurFWI with the eigen-structures of NSK and WTK. Building on these insights, we propose several enhanced NeurFWI methods with tailored eigen-structures in NSK and WTK to improve inversion performances and efficiency. We numerically validate these theoretical claims and the proposed methods in seismic exploration, and firstly extend their application to medical imaging.

2605.14362 2026-05-15 cs.SE cs.AI

Correctness-Aware Repository Filtering Under Maximum Effective Context Window Constraints

Shweta Mishra

AI总结 该研究针对大语言模型在开发工具中面临的上下文窗口效率问题,提出了一种基于文件大小的预执行过滤框架,用于在代码仓库扫描前高效剔除超出上下文限制的非代码文件。该方法仅依赖操作系统级别的元数据,具有极低的计算开销,能够在不进行索引和语义分析的情况下实现快速过滤。实验表明,该方法在多个开源仓库中显著减少了输入令牌数量,同时提升了代码生成的准确性并降低了幻觉发生率。

详情
英文摘要

Context window efficiency is a practical constraint in large language model (LLM)-based developer tools. Paulsen [12] shows that all tested models degrade in accuracy well before their advertised context limits the Maximum Effective Context Window (MECW) which makes context construction a quality problem, not just a cost one. Modern software repositories routinely contain large non-code artifacts compiled datasets, binary model weights, minified JavaScript bundles, and gigabyte-scale log files that overflow the context window and push out task-relevant source code. We present a correctness-aware context hygiene framework: a pre-execution, size-based heuristic filter that intercepts repository scans before tokenization, using only OS-level stat() metadata with sub-millisecond overhead. Semantic retrieval approaches such as RepoCoder, GraphRAG, and AST-based chunking require index construction and query-time inference before any filtering decision is reached. Our framework, by contrast, requires no indexing and operates at <0.01 ms per file decision. Across 10 real open-source repositories (22,046 files, 5 languages), the proposed SizeFilter at θ=1 MB achieves 79.6% (\pm13.2%) mean token reduction at 0.30 ms overhead: the HybridFilter achieves 89.3% (\pm9.0%) the lowest variance of any filter evaluated. A token-density study across 2,688 files confirms a strong linear correlation (Pearson r=0.997, k=0.250 tokens/byte). A limited-scope evaluation (18 tasks, CodeLlama-7B-Instruct) yields 72% file-level accuracy under filtering versus 25% at baseline; hallucination frequency declines from 61% to 17%. All code and data are released for reproducibility.

2605.14360 2026-05-15 cs.HC cs.CL

A Formative Study of Brief Affective Text as a Complement to Wearable Sensing for Longitudinal Student Health Monitoring

Tamunotonye Harry, Johanna Hidalgo, Matthew Price, Yuanyuan Feng, Kathryn Stanton, Connie Tompkins, Peter Sheridan Dodds, Mikaela Irene Fudolig, Laura Bloomfield, Christopher Danforth

AI总结 该研究探讨了如何通过简短的情绪文本补充可穿戴设备的数据,以更全面地监测大学生的长期健康状况。研究采用开放式问题收集学生关于自身担忧的简短回答,并结合可穿戴设备数据,利用多种自然语言处理方法分析情绪与睡眠、活动等健康指标的关系。结果表明,情绪表达而非具体话题内容对健康指标有显著影响,提示简短情绪反馈可有效提升被动生理数据的心理可解释性。

Comments Submitted to ACM IMWUT

详情
英文摘要

Wearable devices capture physiological and behavioral data with increasing fidelity, but the psychological context shaping these outcomes is difficult to recover from sensor data alone, limiting passive sensing utility for digital health. We examined whether ultra-brief naturalistic concern text could serve as a scalable complement to passive sensing. In a year-long study of 458 university students (3,610 person-waves) tracked with Oura rings, participants responded bimonthly to an open-ended prompt about what concerned them most; responses had a median length of three words. We compared dictionary-based, general pretrained, and domain-adapted NLP approaches using within-person mixed-effects models across nine sleep and physical activity outcomes. Weeks dominated by academic concern framing were associated with lower physical activity; weeks characterized by emotional exhaustion language were associated with poorer sleep quality and lower heart rate variability. General pretrained embeddings outperformed domain-adapted models for most outcomes, with domain adaptation showing relative advantage for autonomic outcomes. Zero-shot classification of concern topics produced no significant associations, while affective dimensions across all three methods were consistently associated with outcomes, indicating emotional register rather than topical content carries the signal. These findings offer design guidance: ultra-brief affective prompts enrich the psychological interpretability of passive physiological data at minimal burden.

2605.14351 2026-05-15 eess.SY cs.LG cs.SY

Randomized Atomic Feature Models for Physics-Informed Identification of Dynamic Systems

Rajiv Singh, Mario Sznaier, Lennart Ljung

AI总结 本文提出了一种基于随机稳定原子特征的物理信息系统识别框架,通过将脉冲响应表示为稳定极点所关联的阻尼复指数的随机叠加,将系统识别转化为带有线性、二阶锥和KYP约束的凸正则化最小二乘问题。该方法推广了随机傅里叶和拉普拉斯特征,适用于工程系统中的阻尼非平稳情形,同时保持模态可解释性和可扩展的有限维计算。研究还从算子理论角度分析了稳定极点正测度生成正定核的特性,并给出了核空间到ℓ₁空间的嵌入、随机特征收敛性以及稀疏恢复的条件保证。

Comments Extended version of the conference paper submitted for IFAC World Congress, 2026

详情
英文摘要

We present a physics-informed framework for system identification based on randomized stable atomic features. Impulse responses are represented as random superpositions of stable atoms, namely damped complex exponentials associated with poles sampled inside a prescribed disk. Identification is then cast as a convex regularized least-squares problem with optional linear, second-order-cone, and KYP constraints. The approach generalizes random Fourier and random Laplace features to the damped, nonstationary regime relevant to engineering systems while retaining modal interpretability and scalable finite-dimensional computation. The main analytic point is an operator-theoretic Disk-Bochner viewpoint: positive measures over stable poles generate positive-definite kernels with a radius-dependent shift defect, while a converse scalar disk moment representation for an arbitrary kernel is characterized by subnormality of the canonical shift. We prove this statement, establish an RKHS-to-l1 embedding, show that sampled poles induce a valid finite atomic gauge, discuss random-feature convergence, and state sparse-recovery guarantees conditionally on the restricted-eigenvalue properties of the realized disk-Vandermonde or input-output design matrix. We also connect the normalized transfer function problem to Nevanlinna-Pick interpolation and LFT set-membership. The framework directly encodes stability margins, modal localization, DC-gain bounds, monotonicity, passivity, relative degree, settling-time targets, and time/frequency-domain error bounds. Numerical comparisons illustrate how physically meaningful priors can compensate for poor excitation and improve constrained impulse-response recovery in an under-informative data setting.

2605.14331 2026-05-15 eess.SP cs.AI cs.ET cs.IT cs.LG math.IT

Analog RF Computing: A New Paradigm for Energy-Efficient Edge AI Over MU-MIMO Systems

Wentao Yu, Vincent W. S. Wong

AI总结 本文提出了一种基于模拟射频(RF)计算的新范式,用于在多用户多输入多输出(MU-MIMO)无线系统中实现高效节能的边缘人工智能推理。该方法通过基站广播编码的神经网络权重波形,客户端利用无源混频器进行本地输入编码波形的乘法运算,从而在无线接收端高效完成矩阵-向量乘法操作。研究设计了一种面向计算的物理层框架,优化了计算精度与能耗之间的平衡,并提出了一种低复杂度算法解决非凸优化问题,实验表明该方法相比传统数字计算可将客户端能耗降低近两个数量级,为边缘推理提供了高效的无线计算新途径。

Comments 13 pages, 6 figures, 2 tables. This paper proposes analog RF computing as a new paradigm for energy-efficient edge inference over wireless networks and studies the corresponding physical layer design framework

详情
英文摘要

Modern edge devices increasingly rely on neural networks for intelligent applications. However, conventional digital computing-based edge inference requires substantial memory and energy consumption. In analog radio frequency (RF) computing, a base station (BS) encodes the weights of the neural networks and broadcasts the RF waveforms to the clients. Each client reuses its passive mixer to multiply the received weight-encoded waveform with a locally generated input-encoded waveform. This enables wireless receivers to perform the matrix-vector multiplications (MVMs) that account for most of the computation burden in edge inference with ultra-low energy consumption. Unlike conventional downlink transmissions which are optimized for communications, analog RF computing requires a computing-centric physical layer that controls both the analog MVM accuracy and the energy consumption for inference. Motivated by this, in this paper, we propose a physical layer design framework for analog RF computing in MU-MIMO wireless systems. We derive tractable models for computing accuracy and energy consumption for inference, formulate a joint BS beamforming and client-side scaling problem subject to computing accuracy, transmit power, and hardware constraints, and develop a low-complexity algorithm to solve the non-convex problem. The proposed design provides client- and layer-specific accuracy control for both uniform- and mixed-precision inference. Simulations under 3GPP specifications show that analog RF computing can significantly reduce client-side energy consumption by nearly two orders of magnitude compared to digital computing, while mixed-precision inference requires even lower energy consumption than uniform-precision inference. Overall, these results establish analog RF computing over wireless networks as a promising paradigm for energy-efficient edge inference.

2605.14291 2026-05-15 cs.CR cs.AI cs.CL cs.CV cs.LG

To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model

Chengshuai Zhao, Zhen Tan, Dawei Li, Zhiyuan Yu, Huan Liu

AI总结 随着大型视觉-语言模型(LVLMs)的快速发展,未经授权的数据抓取和微调行为带来了严重的版权和隐私风险。为此,本文提出MMGuard,通过注入人类不可感知的扰动生成“不可学习”的示例,主动防御数据被用于未经授权的LVLM微调。该方法利用模型的学习动态,制造优化捷径,使模型在训练时过度拟合噪声,从而在推理时性能下降。此外,MMGuard引入跨模态关联破坏策略,增强防御效果,并在多种威胁模型下展现出高效、隐蔽且鲁棒的保护能力。

详情
英文摘要

The rapid advancement of Large Vision-Language Models (LVLMs) is increasingly accompanied by unauthorized scraping and training on multimodal web data, posing severe copyright and privacy risks to data owners. Existing countermeasures, such as machine unlearning and watermarks, are inherent post-hoc approaches that act only after intellectual property infringement has already occurred. In this work, we propose MMGuard to empower data owners to proactively protect their multimodal data against unauthorized LVLM fine-tuning. MMGuard generates unlearnable examples by injecting human-imperceptible perturbations that actively exploit the learning dynamics of LVLMs. By minimizing the training loss, the perturbation creates an optimization shortcut, causing the model to overfit to the noise and thereby degrading downstream performance when the perturbation is absent during inference. To further strengthen this defense, MMGuard introduces a cross-modal binding disruption, strategically shifting LVLM attention to enforce a spurious correlation between the noise and the training target with theoretical guarantees. Enhanced by an ensemble learning strategy for cross-model transferability, MMGuard is evaluated against nine open-source LVLMs across six datasets. Our comprehensive results demonstrate effective, stealthy, and robust protection under white-box, gray-box, and black-box threat models, establishing a mechanistic advantage in proactively defending against aggressive fine-tuning exploitation.

2605.14290 2026-05-15 cs.CR cs.AI cs.CL cs.SE

Web Agents Should Adopt the Plan-Then-Execute Paradigm

Julien Piet, Annabella Chow, Yiwei Hou, Muxi Lyu, Sylvie Venuto, Jinhao Zhu, Raluca Ada Popa, David Wagner

AI总结 本文指出,当前基于ReAct架构的大型语言模型代理在处理网页任务时存在安全隐患,因为其在决策过程中直接使用未验证的网页内容,容易受到提示注入攻击。作者主张网页代理应采用“先规划后执行”的范式,即在观察网页内容前制定任务特定的执行计划,从而隔离不可信数据对控制流的影响。研究分析了WebArena基准,发现大多数任务可通过纯程序化规划完成,而无需运行时调用LLM子程序,并指出实现该范式的关键在于构建类型化、可审计的网页API接口,而非改进模型本身。

详情
英文摘要

ReAct has become the default architecture across LLM agents, and many existing web agents follow this paradigm. We argue that it is the wrong default for web agents. Instead, web agents should default to plan-then-execute: commit to a task-specific program before observing runtime web content, then execute it. The reason is that web content mixes inputs from many parties. An e-commerce product page may combine a seller's listing, customer reviews and sponsored advertisements. Under ReAct, all of this content flows into the model when deciding on the next action, creating a direct path for prompt injections to steer the agent's control flow. Plan-then-execute changes this boundary: untrusted data may influence values or branches inside a predefined execution graph, but it cannot redefine the user task or cause the model to synthesize new actions at runtime. We analyze WebArena, a popular web agent benchmark, and find that all tasks are compatible with plan-then-execute, while 80% can be completed with a purely programmatic plan, without any runtime LLM subroutine. We identify the main barrier to adopting plan-then-execute on the web: For it to work well, tools must map cleanly to semantic actions, with effects known before execution, so agents have enough information to plan. The web does not naturally expose that interface. Browser tools such as click, type, and scroll have page-dependent meanings. Planning at this layer is near-sighted: the agent can only see actions on the current page, and later actions appear only after it acts. Closing this gap requires typed interfaces that turn website interactions from clicks and keystrokes to task-level operations. This is an infrastructure problem, not a modeling problem. Web tasks do not need reactivity by default; they need typed, complete, auditable website APIs.

2605.14283 2026-05-15 cs.GT cs.AI cs.CR

Watermarking Game-Playing Agents in Perfect-Information Extensive-Form Games

Juho Kim, Fei Fang, Tuomas Sandholm

AI总结 本文研究了在完全信息的扩展式博弈中对博弈策略进行水印的技术,旨在检测游戏代理是否未经授权地使用了AI工具。作者借鉴了大型语言模型的KGW水印方法,提出了一种适用于博弈代理的水印方案,并通过统计检验实现水印的检测。实验表明,水印对策略质量的影响可以忽略不计,且仅需少量对局即可有效检测水印。

详情
英文摘要

Watermarking techniques for large language models (LLMs), which encode hidden information in the output so its source can be verified, have gained significant attention in recent days, thanks to their potential capability to detect accidental or deliberate misuse. Similar challenges involving model misuse also exist in the context of game-playing, such as when detecting the unauthorized use of AI tools in gaming platforms (e.g., cheating in online chess). In this paper, we initiate the study of how game-playing strategies can be watermarked. We show how the KGW watermark for LLMs can be adapted to watermark game-playing agents in perfect-information extensive-form games. The watermark can then be detected using a statistical test. We show that the degradation in the quality of the watermarked strategy profile, quantified by the expected utility, can be bounded, but there is a tradeoff between detectability and quality. In our experiments, we bootstrap the watermarking framework to various chess engines and demonstrate that a) the impact of the watermark on the quality of the strategy is negligible and b) the watermark can be detected with just a handful of games.

2605.14276 2026-05-15 stat.ML cs.LG

Training-Free Generative Sampling via Moment-Matched Score Smoothing

Zhenyu Yao, Daniel Paulin

AI总结 本文提出了一种无需训练的生成采样方法MM-SOLD,通过矩匹配的得分平滑技术,直接从训练数据中估计目标分布的统计特性,并在采样过程中保持这些矩不变。该方法基于过阻尼朗之万动力学,能够在不训练神经网络的情况下实现高质量的样本生成,实验表明其在二维分布和图像生成任务中表现优异,具有计算高效、鲁棒性强的特点。

Comments 35 pages

详情
英文摘要

Diffusion models generate samples by denoising along the score of a perturbed target distribution. In practice, one trains a neural diffusion model, which is computationally expensive. Recent work suggests that score matching implicitly smooths the empirical score, and that this smoothing bias promotes generalization by capturing low-dimensional data geometry. We propose moment-matched score-smoothed overdamped Langevin dynamics (MM-SOLD), a training-free interacting particle sampler that enforces the target moments throughout the sampling trajectory. We prove that, in the large-particle limit, the empirical particle density converges to a deterministic limit whose one-particle stationary marginal is a Gibbs--Boltzmann density obtained by exponentially tilting a naive score-smoothed diffusion target. The mean and covariance of this distribution agree with the empirical moments of the training data. Experiments on 2D distributions and latent-space image generation show that MM-SOLD enables fast, robust, training-free sampling on CPUs, with sample fidelity and diversity competitive with neural diffusion baselines.

2605.14228 2026-05-15 cs.HC cs.LG

Self-Regulated Learning in Essay Writing: Consistency of Strategies and Impact on Outcomes

Gloria Fernández-Nieto, Kiyoshige Garcés, Mladen Raković, Tongguang Li, Xinyu Li, Linxuan Zhao, Dragan Gašević

AI总结 本研究探讨了中学生在在线作文写作过程中如何运用自我调节学习(SRL)策略,以及这些策略随时间的变化和对学习成果的影响。研究通过分析哥伦比亚两所中学学生在两周内的在线写作过程数据,结合过程挖掘和无监督机器学习方法,识别出三种主要的SRL策略,并发现这些策略的使用存在显著差异,其中“先阅读后写作”的策略较为普遍,而“密集写作、选择性阅读”策略虽较少见,却与更好的学习成果相关。研究结果为在线学习支持系统的优化提供了重要参考。

Comments 16 pages, 4 figures, submitted to Journal of Computer Assisted Learning (JCAL) [Under Review]

详情
英文摘要

Background: Abilities for effective self-regulated learning (SRL) are critical for lifelong learning, particularly during adolescence when these skills consolidate and strongly influence future learning. Their importance has grown with the rise of online and blended education. Yet, little is known about how secondary school students self-regulate in online environments, how their SRL processes and strategies evolve, or how they affect outcomes. In secondary education, understanding these processes can reveal patterns and indicators of learning success, informing the design of online support mechanisms. Evidence from repeated-measures designs remains scarce. Objectives: This study aims to examine how secondary school students enact SRL strategies during online essay writing, how these strategies change over time, and how they relate to learning outcomes. Methods: We analysed metacognition-related trace data collected from secondary students during a two-wave online essay-writing task conducted one week apart in two Colombian schools (N = 93 for session 1, N = 95 for session 2) via a digital learning platform. Using a combination of process mining and unsupervised machine learning techniques, we identified dominant SRL strategies grounded in established SRL processes and examined their stability and association with learning outcomes. Results and conclusions: Three dominant SRL strategies were identified. Results showed variability: many students remained in or shifted to Read first, write next, while none used Write intensively, read selectively in session 2. Although less common, latter strategy was positively associated with learning outcomes.

2605.14224 2026-05-15 math.NA cs.AI cs.NA math.DS math.FA

Wavelet-Based Observables for Koopman Analysis: An Extended Dynamic Mode Decomposition Framework

Cankat Tilki, Serkan Gugercin

AI总结 本文提出了一种基于小波变换的Koopman算子分析方法,通过引入小波基观测函数,证明其在特定Banach空间下是Koopman半群的特征函数。在此基础上,构建了Koopman半群及其预解算子的闭式表达,并结合扩展动态模态分解(EDMD)提出了一种新的小波动态模态分解算法(cWDMD),用于数值近似Koopman算子的作用。该方法在两个数值例子中得到了验证,展示了其理论有效性与应用潜力。

详情
英文摘要

We present an in-depth analysis of the Koopman semigroup via wavelet transform. Towards this goal, we start by introducing the wavelet-based observables and show that they are eigenfunctions of the Koopman semigroup when this semigroup is considered over the Banach space of continuous functions on a compact forward-invariant set endowed with the supremum norm. We then construct closed-form expressions of the action of the Koopman semigroup and its resolvent in terms of these observables. To approximate the action of Koopman semigroup numerically, we combine Extended Dynamic Mode Decomposition (EDMD) with the proposed wavelet-based observables leading to the Wavelet Dynamic Mode Decomposition via Continuous Wavelet Transform (cWDMD) algorithm. We validate our theoretical results on two numerical examples.

2605.14202 2026-05-15 cs.SE cs.AI

LLM-Based Robustness Testing of Microservice Applications: An Empirical Study

Hrushitha Goud Tigulla, Marco Vieira

AI总结 本文通过实证研究探讨了基于大语言模型(LLM)的微服务应用鲁棒性测试方法。研究针对不同架构的微服务系统,应用七种提示策略和三种开源LLM生成测试用例,发现提示策略对测试多样性的影响比模型规模更大。研究提出了两种新策略——Guided和GuidedFewShot,结合领域知识提升测试覆盖效果,其中GuidedFewShot在两个系统中均实现了较高的失败模式覆盖率,且保持了较低的模型间相似性。实验表明,仅依赖分类规则不足以引导LLM生成有效测试,具体示例对模型理解输入突变至关重要。

详情
英文摘要

Malformed, missing, or boundary-value inputs in microservice APIs can cascade across dependent services, threatening reliability. Robustness testing systematically exercises such inputs to expose server-side failures, but generating diverse, effective tests remains challenging. Large Language Models can generate such tests from API specifications; however, it is unknown whether different models and prompt strategies produce diverse failure sets or converge on the same failures. We report a controlled experiment applying 7 prompt strategies to 3 open-source LLMs (14B-70B parameters) targeting 2 architecturally distinct microservice systems: one Java monolingual (6 services, 9 failure modes) and one polyglot (27 services, 14 failure modes), yielding 38 valid runs and 663 generated tests. We find that prompt strategy explains more variation in diversity than model size: a Structured prompt collapses diversity entirely, while a single model varied across three prompt strategies achieves complete failure-mode coverage on one system, outperforming any multi-model ensemble under a fixed prompt. We introduce two strategies, Guided and GuidedFewShot, that embed a mutation taxonomy from prior robustness testing research as domain context. GuidedFewShot achieves the highest single-run coverage on both systems (5 of 9 and 8 of 14 failure modes) while maintaining low cross-model similarity. A key lesson is that taxonomy rules alone are insufficient: LLMs cannot distinguish key-absent from value-empty mutations without concrete examples. Findings replicate across both systems.

2605.14195 2026-05-15 cs.DS cs.LG

Stochastic Matching via Local Sparsification

Sara Ahmadian, Edith Cohen, Mohammad Roghani

AI总结 本文研究了在线随机匹配问题中的一种新场景,其中本地通信带宽而非匹配时机成为主要瓶颈。为此,作者提出了一种两阶段的本地稀疏化框架,要求每个请求在全局优化前将其兼容集合缩减到一个固定大小的预算。研究设计了一种基于期望实例分数解的本地选择策略,并理论证明在足够分散度下该方法能够近似保持最大匹配的期望规模。实验表明,即使在严格的本地预算限制下,该方法仍能实现接近最优的全局匹配效果,优于传统在线算法。

详情
英文摘要

The classic online stochastic matching problem typically requires immediate and irrevocable matching decisions. However, in many modern decentralized systems such as real-time ride-hailing and distributed cloud computing, the primary bottleneck is often local communication bandwidth rather than the timing of the match itself. We formalize this challenge by introducing a two-stage local sparsification framework. In this setting, arriving requests must prune their realized compatibility sets to a strict budget of $k$ edges before a central coordinator optimizes the global matching. This creates a "middle ground" between local information constraints and global optimization utility. We propose a local selection strategy, parametrized by a fractional solution of the expected instance. Theoretically, we quantify the approximation ratio as a function of the solution's {\em spread}. We prove that under sufficient spread, our sparsifier globally preserves the expected size of the maximum matching. Empirically, we demonstrate the robustness of our approach using the New York City ride-hailing datasets and adversarial synthetic benchmarks. Our results show that near-optimal global matching is achievable even with highly constrained local budgets, significantly outperforming standard online baselines.

2605.13773 2026-05-15 cs.SE cs.AI cs.LO

(How) Do Large Language Models Understand High-Level Message Sequence Charts?

Mohammad Reza Mousavi

AI总结 本文研究了大型语言模型(LLMs)对高层消息序列图(HMSCs)形式语义的理解程度。通过让三种主流LLMs完成129项与HMSC语义相关的任务,发现它们对基本语义概念的理解较好,但在涉及抽象、组合以及追踪和标签转换系统等复杂语义推理任务时表现较差。研究揭示了当前LLMs在处理具有严格形式语义的软件设计模型时仍存在显著局限。

详情
英文摘要

Large Language Models (LLMs) are being employed widely to automate tasks across the software development life-cycle. It is, however, unclear whether these tasks are performed consistently with respect to the semantics of the artefacts being handled. This question is particularly under-researched concerning architectural design specification. In this paper, we address this question for High-Level Message Sequence Charts (HMSCs). These are visual models with a rigorous formal semantics that have been used for various purposes, including as a foundation for Sequence Diagrams in the Unified Modelling Language (UML). We examine whether LLMs "understand" the semantics of HMSCs by examining three LLMs (Gemini-3, GPT-5.4, and Qwen-3.6) on how they perform 129 semantic tasks ranging from querying basic semantic constructs in HMSCs (i.e., events and their ordering) to semantic-preserving abstractions and compositions, and calculating the set of traces and trace-equivalent labelled transition systems. The results show that LLMs only have a modest understanding of the formal semantics of HMSCs (ca. 52% overall accuracy), with great variability across different semantic concepts: while LLMs seem to understand the basic semantic concepts of MSCs (ca. 88% accuracy), they struggle with semantic reasoning in tasks involving abstraction and composition (ca. 36% accuracy) and traces and LTSs (ca. 42% accuracy). In particular, all three LLMs struggle with the notions of co-region and explicit causal dependencies and never employed them in semantic-preserving transformations.