arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1851
2403.12193 2026-04-29 cs.RO

Continual Domain Randomization

Josip Josifovski, Sayantan Auddy, Mohammadhossein Malmir, Justus Piater, Alois Knoll, Nicolás Navarro-Guerrero

Comments Accepted at IROS 2024. Equal contribution from first two authors

详情
英文摘要

Domain Randomization (DR) is commonly used for sim2real transfer of reinforcement learning (RL) policies in robotics. Most DR approaches require a simulator with a fixed set of tunable parameters from the start of the training, from which the parameters are randomized simultaneously to train a robust model for use in the real world. However, the combined randomization of many parameters increases the task difficulty and might result in sub-optimal policies. To address this problem and to provide a more flexible training process, we propose Continual Domain Randomization (CDR) for RL that combines domain randomization with continual learning to enable sequential training in simulation on a subset of randomization parameters at a time. Starting from a model trained in a non-randomized simulation where the task is easier to solve, the model is trained on a sequence of randomizations, and continual learning is employed to remember the effects of previous randomizations. Our robotic reaching and grasping tasks experiments show that the model trained in this fashion learns effectively in simulation and performs robustly on the real robot while matching or outperforming baselines that employ combined randomization or sequential randomization without continual learning. Our code and videos are available at https://continual-dr.github.io/.

2402.09034 2026-04-29 cs.LG cs.AI

Contrast-Enhanced Gating in GRUs for Robust Low-Data Sequence Learning

Barathi Subramanian, Rathinaraja Jeyaraj, Anand Paul

Comments 43rd The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026)

详情
英文摘要

Activation functions govern how recurrent networks regulate and transmit information across temporal dependencies. Despite advances in sequence modelling, gated recurrent units (GRUs) still depend on the standard sigmoid and tanh nonlinearities, which can produce weak gate separation and unstable learning, particularly when training data are limited. We introduce squared sigmoid-tanh (SST), a parameter-free activation that squares the gate nonlinearity to increase contrast between near-zero- and high-activations, thereby promoting sharper information filtering during GRU updates. We incorporate SST into GRU gating and evaluate it across low-data settings spanning sign language recognition, human activity recognition, and time-series forecasting and classification. Across tasks, SST-GRU consistently surpasses standard sigmoid/tanh GRU, with the largest improvements observed in the smallest-data domains, while adding negligible computational cost. We further examine gate activation statistics and training dynamics, showing that SST improves training stability, which aligns with its performance gains in data-scarce settings. SST is a parameter-free modification that complements more complex architectural advances by improving gating selectivity in low-data sequence learning.

2309.15039 2026-04-29 cs.LG cs.AI stat.AP

Can-SAVE: Deploying Low-Cost and Population-Scale Cancer Screening via Survival Analysis Variables and EHR

Petr Philonenko, Vladimir Kokh, Pavel Blinov

Comments Accepted to the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情
Journal ref
Proc. 32nd ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD '26), 2026, 11 pages
英文摘要

Conventional medical cancer screening methods are costly, labor-intensive, and extremely difficult to scale. Although AI can improve cancer detection, most systems rely on complex or specialized medical data, making them impractical for large-scale screening. We introduce Can-SAVE, a lightweight AI system that ranks population-wide cancer risks solely based on medical history events. By integrating survival model outputs into a gradient-boosting framework, our approach detects subtle, long-term patient risk patterns - often well before clinical symptoms manifest. Can-SAVE was rigorously evaluated on a real-world dataset of 2.5 million adults spanning five Russian regions, marking the study as one of the largest and most comprehensive deployments of AI-driven cancer risk assessment. In a retrospective oncologist-supervised study over 1.9M patients, Can-SAVE achieves a 4-10x higher detection rate at identical screening volumes and an Average Precision (AP) of 0.228 vs. 0.193 for the best baseline (LoRA-tuned Qwen3-Embeddings via DeepSeek-R1 summarization). In a year-long prospective pilot (426K patients), our method almost doubled the cancer detection rate (+91%) and increased population coverage by 36% over the national screening protocol. The system demonstrates practical scalability: a city-wide population of 1 million patients can be processed in under three hours using standard hardware, enabling seamless clinical integration. This work proves that Can-SAVE achieves nationally significant cancer detection improvements while adhering to real-world public healthcare constraints, offering immediate clinical utility and a replicable framework for population-wide screening. Code for training and feature engineering is available at https://github.com/sb-ai-lab/Can-SAVE.

2309.06768 2026-04-29 cs.RO

Hierarchical Time-Optimal Planning for Multi-Vehicle Racing

Georg Jank, Matthias Rowold, Boris Lohmann

Comments 6 pages, accepted to be published as part of the 26th IEEE International Conference on Intelligent Transportation Systems (ITSC 2023), Bilbao, Bizkaia, Spain, September 24-28, 2023

详情
英文摘要

This paper presents a hierarchical planning algorithm for racing with multiple opponents. The two-stage approach consists of a high-level behavioral planning step and a low-level optimization step. By combining discrete and continuous planning methods, our algorithm encourages global time optimality without being limited by coarse discretization. In the behavioral planning step, the fastest behavior is determined with a low-resolution spatio-temporal visibility graph. Based on the selected behavior, we calculate maneuver envelopes that are subsequently applied as constraints in a time-optimal control problem. The performance of our method is comparable to a parallel approach that selects the fastest trajectory from multiple optimizations with different behavior classes. However, our algorithm can be executed on a single core. This significantly reduces computational requirements, especially when multiple opponents are involved. Therefore, the proposed method is an efficient and practical solution for real-time multi-vehicle racing scenarios.

2307.00937 2026-04-29 cs.RO

A Biomimetic Fingerprint for Robotic Tactile Sensing

Oscar Alberto Juiña Quilachamín, Nicolás Navarro-Guerrero

Comments 56th International Symposium on Robotics (ISR Europe) | September 26-27, 2023, Stuttgart, Germany

详情
英文摘要

Tactile sensors have been developed since the early '70s and have greatly improved, but there are still no widely adopted solutions. Various technologies, such as capacitive, piezoelectric, piezoresistive, optical, and magnetic, are used in haptic sensing. However, most sensors are not mechanically robust for many applications and cannot cope well with curved or sizeable surfaces. Aiming to address this problem, we present a 3D-printed fingerprint pattern to enhance the body-borne vibration signal for dynamic tactile feedback. The 3D-printed fingerprint patterns were designed and tested for an RH8D Adult size Robot Hand. The patterns significantly increased the signal's power to over 11 times the baseline. A public haptic dataset including 52 objects of several materials was created using the best fingerprint pattern and material.

2306.06213 2026-04-29 cs.LG math.OC

A Robust Twin Parametric Margin Support Vector Machine for Multiclass Classification

Renato De Leone, Francesca Maggioni, Andrea Spinelli

详情
英文摘要

In this paper, we introduce novel Twin Parametric Margin Support Vector Machine (TPMSVM) models designed to address multiclass classification tasks under feature uncertainty. To handle data perturbations, we construct bounded-by-norm uncertainty set around each training observation and derive the robust counterparts of the deterministic models using robust optimization techniques. To capture complex data structure, we explore both linear and kernel-induced classifiers, providing computationally tractable reformulations of the resulting robust models. Additionally, we propose two alternatives for the final decision function, enhancing models' flexibility. Finally, we validate the effectiveness of the proposed robust multiclass TPMSVM methodology on real-world datasets, showing the good performance of the approach in the presence of uncertainty.

2305.06709 2026-04-29 cs.LG cs.MS stat.ML

NUBO: A Transparent Python Package for Bayesian Optimization

Mike Diessner, Kevin J. Wilson, Richard D. Whalley

详情
Journal ref
Journal of Statistical Software, 114(1), 1-28 (2025)
英文摘要

NUBO, short for Newcastle University Bayesian Optimisation, is a Bayesian optimization framework for the optimization of expensive-to-evaluate black-box functions, such as physical experiments and computer simulators. Bayesian optimization is a costefficient optimization strategy that uses surrogate modelling via Gaussian processes to represent an objective function and acquisition functions to guide the selection of candidate points to approximate the global optimum of the objective function. NUBO itself focuses on transparency and user experience to make Bayesian optimization easily accessible to researchers from all disciplines. Clean and understandable code, precise references, and thorough documentation ensure transparency, while user experience is ensured by a modular and flexible design, easy-to-write syntax, and careful selection of Bayesian optimization algorithms. NUBO allows users to tailor Bayesian optimization to their specific problem by writing the optimization loop themselves using the provided building blocks. It supports sequential single-point, parallel multi-point, and asynchronous optimization of bounded, constrained, and/or mixed (discrete and continuous) parameter input spaces. Only algorithms and methods that are extensively tested and validated to perform well are included in NUBO. This ensures that the package remains compact and does not overwhelm the user with an unnecessarily large number of options. The package is written in Python but does not require expert knowledge of Python to optimize your simulators and experiments. NUBO is distributed as open-source software under the BSD 3-Clause license.

2212.14245 2026-04-29 cs.CV

Practical exposure correction via compensation

Long Ma, Nan An, Jinyuan Liu, Xin Fan, Zhongxuan Luo, Deyu Meng, Risheng Liu

Comments Project Page: https://rsliu.tech/PEC

详情
英文摘要

In computer vision, correcting the exposure level is a fundamental task for enhancing the visual quality of observations with inappropriate lightness. However, existing methodologies tend to be impractical because they lack adaptability to unknown scenes due to restricted modeling patterns and struggle to achieve satisfactory efficiency due to complex computational flows. To tackle these challenges, we establish a new practical exposure corrector (PEC) that excels in both quality and efficiency. Specifically, to overcome the limited expressive power of existing modeling patterns, we build a general model with exposure-sensitive compensation to provide an intuitive modeling perspective. We also design a simple but effective exposure adversarial function to catalyze scene-adaptive compensation. Building on the aforementioned key concepts, we develop a stable and robust iterative shrinkage scheme, avoiding the complex inferences encountered in existing studies. Extensive experimental evaluations across eight challenging datasets showcase the strong adaptability of the developed model to unknown environments. The model offers impressive processing speed, requiring only 0.0009 s to handle a 2K image on a device equipped with a GeForce RTX 2080Ti GPU. Experimental analysis of different downstream vision tasks further verifies the flexibility of the model. The code is available at https://rsliu.tech/PEC.

2207.09845 2026-04-29 cs.RO cs.AI cs.HC cs.LG

Quantifying the Effect of Feedback Frequency in Interactive Reinforcement Learning for Robotic Tasks

Daniel Harnack, Julie Pivin-Bachler, Nicolás Navarro-Guerrero

Comments Neural Computing and Applications (2022). Special Issue on Human-aligned Reinforcement Learning for Autonomous Agents and Robots

详情
英文摘要

Reinforcement learning (RL) has become widely adopted in robot control. Despite many successes, one major persisting problem can be very low data efficiency. One solution is interactive feedback, which has been shown to speed up RL considerably. As a result, there is an abundance of different strategies, which are, however, primarily tested on discrete grid-world and small scale optimal control scenarios. In the literature, there is no consensus about which feedback frequency is optimal or at which time the feedback is most beneficial. To resolve these discrepancies we isolate and quantify the effect of feedback frequency in robotic tasks with continuous state and action spaces. The experiments encompass inverse kinematics learning for robotic manipulator arms of different complexity. We show that seemingly contradictory reported phenomena occur at different complexity levels. Furthermore, our results suggest that no single ideal feedback frequency exists. Rather that feedback frequency should be changed as the agent's proficiency in the task increases.

2206.06282 2026-04-29 cs.RO cs.AI

Analysis of Randomization Effects on Sim2Real Transfer in Reinforcement Learning for Robotic Manipulation Tasks

Josip Josifovski, Mohammadhossein Malmir, Noah Klarmann, Bare Luka Žagar, Nicolás Navarro-Guerrero, Alois Knoll

Comments Accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2022

详情
英文摘要

Randomization is currently a widely used approach in Sim2Real transfer for data-driven learning algorithms in robotics. Still, most Sim2Real studies report results for a specific randomization technique and often on a highly customized robotic system, making it difficult to evaluate different randomization approaches systematically. To address this problem, we define an easy-to-reproduce experimental setup for a robotic reach-and-balance manipulator task, which can serve as a benchmark for comparison. We compare four randomization strategies with three randomized parameters both in simulation and on a real robot. Our results show that more randomization helps in Sim2Real transfer, yet it can also harm the ability of the algorithm to find a good policy in simulation. Fully randomized simulations and fine-tuning show differentiated results and translate better to the real robot than the other approaches tested.

2203.11544 2026-04-29 cs.RO cs.AI

Visuo-Haptic Object Perception for Robots: An Overview

Nicolás Navarro-Guerrero, Sibel Toprak, Josip Josifovski, Lorenzo Jamone

Comments published in Autonomous Robots

详情
Journal ref
Autonomous Robots, 27 (2023) https://link.springer.com/article/10.1007/s10514-023-10091-y
英文摘要

The object perception capabilities of humans are impressive, and this becomes even more evident when trying to develop solutions with a similar proficiency in autonomous robots. While there have been notable advancements in the technologies for artificial vision and touch, the effective integration of these two sensory modalities in robotic applications still needs to be improved, and several open challenges exist. Taking inspiration from how humans combine visual and haptic perception to perceive object properties and drive the execution of manual tasks, this article summarises the current state of the art of visuo-haptic object perception in robots. Firstly, the biological basis of human multimodal object perception is outlined. Then, the latest advances in sensing technologies and data collection strategies for robots are discussed. Next, an overview of the main computational techniques is presented, highlighting the main challenges of multimodal machine learning and presenting a few representative articles in the areas of robotic object recognition, peripersonal space representation and manipulation. Finally, informed by the latest advancements and open challenges, this article outlines promising new research directions.

2104.05565 2026-04-29 cs.CL cs.AI cs.LG

Survey on reinforcement learning for language processing

Victor Uc-Cetina, Nicolas Navarro-Guerrero, Anabel Martin-Gonzalez, Cornelius Weber, Stefan Wermter

详情
Journal ref
Artificial Intelligence Review 2022
英文摘要

In recent years some researchers have explored the use of reinforcement learning (RL) algorithms as key components in the solution of various natural language processing tasks. For instance, some of these algorithms leveraging deep neural learning have found their way into conversational systems. This paper reviews the state of the art of RL methods for their possible use for different problems of natural language processing, focusing primarily on conversational systems, mainly due to their growing relevance. We provide detailed descriptions of the problems as well as discussions of why RL is well-suited to solve them. Also, we analyze the advantages and limitations of these methods. Finally, we elaborate on promising research directions in natural language processing that might benefit from reinforcement learning.

2011.02121 2026-04-29 cs.CL

PheMT: A Phenomenon-wise Dataset for Machine Translation Robustness on User-Generated Contents

Ryo Fujii, Masato Mita, Kaori Abe, Kazuaki Hanawa, Makoto Morishita, Jun Suzuki, Kentaro Inui

Comments 15 pages, 4 figures, accepted at COLING 2020

详情
Journal ref
Proceedings of the 28th International Conference on Computational Linguistics, pages 5929-5943, 2020
英文摘要

Neural Machine Translation (NMT) has shown drastic improvement in its quality when translating clean input, such as text from the news domain. However, existing studies suggest that NMT still struggles with certain kinds of input with considerable noise, such as User-Generated Contents (UGC) on the Internet. To make better use of NMT for cross-cultural communication, one of the most promising directions is to develop a model that correctly handles these expressions. Though its importance has been recognized, it is still not clear as to what creates the great gap in performance between the translation of clean input and that of UGC. To answer the question, we present a new dataset, PheMT, for evaluating the robustness of MT systems against specific linguistic phenomena in Japanese-English translation. Our experiments with the created dataset revealed that not only our in-house models but even widely used off-the-shelf systems are greatly disturbed by the presence of certain phenomena.

2604.25903 2026-04-29 cs.SE cs.LG

Carbon-Taxed Transformers: A Green Compression Pipeline for Overgrown Language Models

Ajmain Inqiad Alam, Palash Roy, Chanchal K. Roy, Banani Roy, Kevin A. Schneider

详情
Journal ref
Proceedings of ACM Software Engineering 3, FSE, Article FSE047, 2026
英文摘要

The accelerating adoption of Large Language Models (LLMs) in software engineering (SE) has brought with it a silent crisis: unsustainable computational cost. While these models demonstrate remarkable capabilities in different SE tasks, they are unmanageably large, slow to deploy, memory-intensive, and carbon-heavy. This reality threatens not only the scalability and accessibility of AI-powered SE, but also its long-term environmental sustainability. The research challenge is clear: we must go beyond accuracy and address efficiency and environmental cost as first-class design constraints. To meet this challenge, we introduce Carbon-Taxed Transformers (CTT), a systematic multi-architectural compression principled pipeline ordering inspired by economic carbon taxation principles. Drawing from the economic concept of carbon pricing, CTT operationalizes a computational carbon tax that penalizes architectural inefficiencies and rewards deployment-ready compression. We evaluate CTT across three core SE tasks: code clone detection, code summarization, and code generation, with models spanning encoder-only, encoder-decoder, and decoder-only architecture. Our results show that CTT delivers on inference: (1) up to 49x memory reduction, (2) time reduction up to 8-10x for clone detection, up to 3x for summarization, and 4-7x for generation, (3) up to 81% reduction in CO2 emissions and (4) CTT retains around 98% accuracy on clone detection, around 89% on summarization, and up to 91% (textual metrics) and 68% (pass@1) for generation. Two ablation studies show that pipeline ordering and individual component contributions are both essential, providing empirical justification for CTT's design and effectiveness. This work establishes a viable path toward responsible AI in SE through aggressive yet performance-preserving compression.

2604.25895 2026-04-29 cs.CY cs.AI cs.CL

Three Models of RLHF Annotation: Extension, Evidence, and Authority

Steve Coyne

Comments 17 pages. Accepted to ACM FAccT '26, June 25-28, Montreal

详情
英文摘要

Preference-based alignment methods, most prominently Reinforcement Learning with Human Feedback (RLHF), use the judgments of human annotators to shape large language model behaviour. However, the normative role of these judgments is rarely made explicit. I distinguish three conceptual models of that role. The first is extension: annotators extend the system designers' own judgments about what outputs should be. The second is evidence: annotators provide independent evidence about some facts, whether moral, social or otherwise. The third is authority: annotators have some independent authority (as representatives of the broader population) to determine system outputs. I argue that these models have implications for how RLHF pipelines should solicit, validate and aggregate annotations. I survey landmark papers in the literature on RLHF and related methods to illustrate how they implicitly draw on these models, describe failure modes that come from unintentionally or intentionally conflating them, and offer normative criteria for choosing among them. My central recommendation is that RLHF pipeline designers should decompose annotation into separable dimensions and tailor each pipeline to the model most appropriate for that dimension, rather than seeking a single unified pipeline.

2604.25885 2026-04-29 hep-ph cs.LG hep-ex

Explainable AI for Jet Tagging: A Comparative Study of GNNExplainer, GNNShap, and GradCAM for Jet Tagging in the Lund Jet Plane

Pahal D. Patel, Sanmay Ganguly

Comments 25 pages, 9 figures. Comments are welcome

详情
英文摘要

Graph neural networks such as ParticleNet and transformer based networks on point clouds such as ParticleTransformer achieve state-of-the-art performance on jet tagging benchmarks at the Large Hadron Collider, yet the physical reasoning behind their predictions remains opaque. We present different methods, i.e. perturbation-based (GNNExplainer), Shapley-value-based (GNNShap), and gradient-based (GRADCam); adapted to operate on LundNet's Lund-plane graph representation. Leveraging the fact that each node in the Lund plane corresponds to a physically meaningful parton splitting, we construct Monte Carlo truth explanation masks and introduce a physics-informed evaluation framework that goes beyond standard fidelity metrics. We perform the analysis in three transverse-momentum bins ($\mathrm{p_T} \in [500,700]$, $[800,1000]$, and the inclusive region $[500,1000]$ GeV), revealing how explanation quality and focus shift between non-perturbative and perturbative regimes. We further quantify the correlation between explainer-assigned node importance and classical jet substructure observables -- $N$-subjettiness ratios $τ_{21}$ and $τ_{32}$ and the energy correlation functions -- establishing the degree to which the model has learned known QCD features. We find that overall the weight assigned by explainability methods has a correlation with analytic observables, with expected shift across different phase space regimes, indicating that a trained neural network indeed learns some aspects of jet-substructure moments. Our open-source implementation enables reproducible explainability studies for graph-based jet taggers.

2604.25884 2026-04-29 quant-ph cs.CV

QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

Shuxiang Cao, Zijian Zhang, Abhishek Agarwal, Grace Bratrud, Niyaz R. Beysengulov, Daniel C. Cole, Alejandro Gómez Frieiro, Elena O. Glen, Hao Hsu, Gang Huang, Raymond Jow, Greshma Shaji, Tom Lubowe, Ligeng Zhu, Luis Mantilla Calderón, Nicola Pancotti, Joel Pendleton, Brandon Severin, Charles Etienne Staub, Sara Sussman, Antti Vepsäläinen, Neel Rajeshbhai Vora, Yilun Xu, Varinia Bernales, Daniel Bowring, Elica Kyoseva, Ivan Rungger, Giulia Semeghini, Sam Stanwyck, Timothy Costa, Alán Aspuru-Guzik, Krysta Svore

详情
英文摘要

Quantum computing calibration depends on interpreting experimental data, and calibration plots provide the most universal human-readable representation for this task, yet no systematic evaluation exists of how well vision-language models (VLMs) interpret them. We introduce QCalEval, the first VLM benchmark for quantum calibration plots: 243 samples across 87 scenario types from 22 experiment families, spanning superconducting qubits and neutral atoms, evaluated on six question types in both zero-shot and in-context learning settings. The best general-purpose zero-shot model reaches a mean score of 72.3, and many open-weight models degrade under multi-image in-context learning, whereas frontier closed models improve substantially. A supervised fine-tuning ablation at the 9-billion-parameter scale shows that SFT improves zero-shot performance but cannot close the multimodal in-context learning gap. As a reference case study, we release NVIDIA Ising Calibration 1, an open-weight model based on Qwen3.5-35B-A3B that reaches 74.7 zero-shot average score.

2604.25862 2026-04-29 cs.SE cs.AI

RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM-Generated REST API Test Cases from NL Requirements

Leon Kogler, Stefan Hangler, Maximilian Ehrhart, Benedikt Dornauer, Roland Wuersching, Peter Schrammel

Comments Accepted for EASE 2026

详情
英文摘要

Existing REST API testing tools are typically evaluated using code coverage and crash-based fault metrics. However, recent LLM-based approaches increasingly generate tests from NL requirements to validate functional behaviour, making traditional metrics weak proxies for whether generated tests validate intended behaviour. To address this gap, we present RESTestBench, a benchmark comprising three REST services paired with manually verified NL requirements in both precise and vague variants, enabling controlled and reproducible evaluation of requirement-based test generation. RESTestBench further introduces a requirements-based mutation testing metric that measures the fault-detection effectiveness of a generated test case with respect to a specific requirement, extending the property-based approach of Bartocci et al. . Using RESTestBench, we evaluate two approaches across multiple state-of-the-art LLMs: (i) non-refinement-based generation, and (ii) refinement-based generation guided by interaction with the running SUT. In the refinement experiments, RESTestBench assesses how exposure to the actual implementation, valid or mutated, affects test effectiveness. Our results show that test effectiveness drops considerably when the generator interacts with faulty or mutated code, especially for vague requirements, sometimes negating the benefit of refinement and indicating that incorporating actual SUT behaviour is unnecessary when requirement detail is high.

2604.25847 2026-04-29 math.OC cs.AI cs.LG

From Soliloquy to Agora: Memory-Enhanced LLM Agents with Decentralized Debate for Optimization Modeling

Jianghao Lin, Zi Ling, Chenyu Zhou, Tianyi Xu, Ruoqing Jiang, Zizhuo Wang, Dongdong Ge

Comments Working Paper

详情
英文摘要

Optimization modeling underpins real-world decision-making in logistics, manufacturing, energy, and public services, but reliably solving such problems from natural-language requirements remains challenging for current large language models (LLMs). In this paper, we propose \emph{Agora-Opt}, a modular agentic framework for optimization modeling that combines decentralized debate with a read-write memory bank. Agora-Opt allows multiple agent teams to independently produce end-to-end solutions and reconcile them through an outcome-grounded debate protocol, while memory stores solver-verified artifacts and past disagreement resolutions to support training-free improvement over time. This design is flexible across both backbones and methods: it reduces base-model lock-in, transfers across different LLM families, and can be layered onto existing pipelines with minimal coupling. Across public benchmarks, Agora-Opt achieves the strongest overall performance among all compared methods, outperforming strong zero-shot LLMs, training-centric approaches, and prior agentic baselines. Further analyses show robust gains across backbone choices and component variants, and demonstrate that decentralized debate offers a structural advantage over centralized selection by enabling agents to refine candidate solutions through interaction and even recover correct formulations when all initial candidates are flawed. These results suggest that reliable optimization modeling benefits from combining collaborative cross-checking with reusable experience, and position Agora-Opt as a practical and extensible foundation for trustworthy optimization modeling assistance. Our code and data are available at https://github.com/CHIANGEL/Agora-Opt.

2604.25846 2026-04-29 cs.CR cs.AI

Towards Agentic Investigation of Security Alerts

Even Eilertsen, Vasileios Mavroeidis, Gudmund Grov

Comments 10 pages, 3 figures, 4 tables. Accepted at the 2025 IEEE International Conference on Big Data (BigData)

详情
Journal ref
Proc. 2025 IEEE Int. Conf. on Big Data (BigData), 2025
英文摘要

Security analysts are overwhelmed by the volume of alerts and the low context provided by many detection systems. Early-stage investigations typically require manual correlation across multiple log sources, a task that is usually time-consuming. In this paper, we present an experimental, agentic workflow that leverages large language models (LLMs) augmented with predefined queries and constrained tool access (structured SQL over Suricata logs and grep-based text search) to automate the first stages of alert investigation. The proposed workflow integrates queries to provide an overview of the available data, and LLM components that selects which queries to use based on the overview results, extracts raw evidence from the query results, and delivers a final verdict of the alert. Our results demonstrate that the LLM-powered workflow can investigate log sources, plan an investigation, and produce a final verdict that has a significantly higher accuracy than a verdict produced by the same LLM without the proposed workflow. By recognizing the inherent limitations of directly applying LLMs to high-volume and unstructured data, we propose combining existing investigation practices of real-world analysts with a structured approach to leverage LLMs as virtual security analysts, thereby assisting and reducing the manual workload.

2604.25799 2026-04-29 cs.AR cs.AI

At the Edge of the Heart: ULP FPGA-Based CNN for On-Device Cardiac Feature Extraction in Smart Health Sensors for Astronauts

Kazi Mohammad Abidur Rahman, Davis Rakhshan, Philipp Lütke, Laura Harms, Ulf Kulau

Comments 9 pages, 7 figures, To be published in: The 22nd Annual International Conference on Distributed Computing in Smart Systems and the Internet of Things (DCOSS-IoT 2026)

详情
英文摘要

The convergence of accelerating human spaceflight ambitions and critical terrestrial health monitoring demands is driving unprecedented requirements for reliable, real-time feature extraction on extremely resource-constrained wearable health sensors. We present an ultra-low-power (ULP) Field-Programmable Gate Array (FPGA) based solution for real-time Seismocardiography (SCG) feature classification using Convolutional Neural Networks (CNNs). Our approach combines quantization-aware training with a systolic-array accelerator to enable efficient integer-only inference on the Lattice iCE40UP5K FPGA, which offers an ideal platform for battery-powered deployments -- particularly in space environments -- thanks to its power efficiency and radiation resilience. The implementation achieves a validation accuracy of 98% while consuming only 8.55 mW, completing inference in 95.5 ms with minimal hardware resources (2,861 LUTs and 7 DSP blocks). These results demonstrate that fully on-device SCG-based cardiac feature extraction is feasible on resource-constrained hardware, enabling energy-efficient, autonomous health monitoring for astronauts in long-duration space missions.

2604.25782 2026-04-29 cs.NI cs.RO

EOS-Bench: A Comprehensive Benchmark for Earth Observation Satellite Scheduling

Qian Yin, Jiaxing Li, Jiaqi Cheng, Qizhang Luo, Annalisa Riccardi, Abhijit Chatterjee, Rafael Vazquez, Carlo Novara, Michalis Mavrovouniotis, Ponnuthurai Nagaratnam Suganthan, Shengzhou Bai, Xiaoxuan Hu, Lining Xing, Ming Xu, Shuang Li, Zixuan Zheng, Xin Shen, Xiaoyu Chen, Yi Gu, Yanjie Song, Witold Pedrycz, Evan L. Kramer, Laio Oriel Seman, Cletah Shoko, Guohua Wu, Xinwei Wang

详情
英文摘要

Earth observation satellite imaging scheduling is a challenging NP-hard combinatorial optimisation problem central to space mission operations. While next-generation agile Earth observation satellites (EOS) increase operational flexibility, they also significantly raise scheduling complexity. The lack of a unified, open-source benchmark makes it difficult to compare algorithms across studies. This paper introduces EOS-Bench, a comprehensive framework for systematic and reproducible evaluation of scheduling methods. By integrating high-fidelity orbital dynamics and platform constraints, EOS-Bench generates 1,390 scenarios and 13,900 benchmark instances, spanning from small-scale validation cases to large coordination problems with up to 1,000 satellites and 10,000 requests. We further propose a scenario characterisation scheme to quantify structural difficulty based on factors such as opportunity density, task flexibility, conflict intensity, and satellite congestion. A multidimensional evaluation protocol is introduced, assessing performance across five metrics: task profit, completion rate, workload balance, timeliness, and runtime. The framework is evaluated using mixed-integer programming, heuristics, meta-heuristics, and deep reinforcement learning across both agile and non-agile settings. Results show that EOS-Bench effectively distinguishes solver performance across scales and conditions, revealing trade-offs between solution quality and computational efficiency, and providing deeper insight into scenario complexity. EOS-Bench offers a unified and extensible open testbed for advancing research in Earth observation satellite scheduling. The code and data are available at https://github.com/Ethan19YQ/EOS-Bench.

2604.25778 2026-04-29 cs.SE cs.AI cs.IR

Can Code Evaluation Metrics Detect Code Plagiarism?

Fahad Ebrahim, Mike Joy

Comments 10 pages, 5 figures, accepted at LEARNER 2026 workshop (associated with EASE 2026)

详情
英文摘要

Source Code Plagiarism Detection (SCPD) plays an important role in maintaining fairness and academic integrity in software engineering education. Code Evaluation Metrics (CEMs) are developed for assessing code generation tasks. However, it remains unclear whether such metrics can reliably detect plagiarism across different levels of modification (L1-L6), increasing in complexity. In this paper, we perform a comparative empirical study using two open-source labelled datasets, ConPlag (raw and template-free versions) and IRPlag. We evaluate five CEMs, namely CodeBLEU, CrystalBLEU, RUBY, Tree Structured Edit Distance (TSED), and CodeBERTScore. The performance is evaluated using threshold-free ranking-based measures to assess overall, per dataset, and per-level plagiarism performance. The results are compared against state-of-the-art (SOTA) Source Code Plagiarism Detection Tools (SCPDTs), JPlag and Dolos. Our findings show that without preprocessing, Dolos achieves the highest overall ranking performance, while among the individual metrics, CrystalBLEU, CodeBLEU, and RUBY outperform JPlag. Performance is strongest at L1 and drops from L4 onward, while CrystalBLEU remains competitive on L6. With preprocessing, CrystalBLEU surpasses Dolos overall. Per dataset, Dolos achieved the best ranking on the ConPlag raw dataset, while CrystalBLEU was the best-performing metric on the remaining datasets. At the plagiarism levels, Dolos remains strongest on L4, while Crystal-BLEU leads most of the remaining difficult levels. These results indicate that CEMs are comparable to dedicated tools in terms of ranking metrics.

2604.25757 2026-04-29 cs.CR cs.AI cs.RO cs.SY eess.SY

Threat-Oriented Digital Twinning for Security Evaluation of Autonomous Platforms

Thomas J. Neubert, Laxima Niure Kandel, Berker Peköz

Comments Camera ready accepted for presentation at and publication in the proceedings of 2026 56st Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W): Dependable and Secure Autonomous Systems (DSAS)

详情
英文摘要

Open, unclassified research on secure autonomy is constrained by limited access to operational platforms, contested communications infrastructure, and representative adversarial test conditions. This paper presents a threat-oriented digital twinning methodology for cybersecurity evaluation of learning-enabled autonomous platforms. The approach is instantiated as an open-source, modular twin of a representative autonomy stack with separated sensing, autonomy, and supervisory-control functions; confidence-gated multi-modal perception; explicit command and telemetry trust boundaries; and runtime hold-safe behavior. The contribution is methodological: a reproducible design pattern that translates threat analysis into observable, controllable tests for spoofing, replay, malformed-input injection, degraded sensing, and adversarial ML stress. Although the implemented proxy is ground based, the architecture is intentionally framed around stack elements shared with UAV and space systems, including constrained onboard compute, intermittent or high-latency links, probabilistic perception, and mission-critical recovery behavior. The result is an implementable research scaffold for dependable and secure autonomy studies across UAV and space domains.

2604.25737 2026-04-29 cs.SE cs.AI

SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?

Noam Tarshish, Nofar Selouk, Daniel Hodisan, Bar Ezra Gafniel, Yuval Elovici, Asaf Shabtai, Eliya Nachmani

Comments Accepted to the EQUISA (Evaluation of Qualitative Aspects of Intelligent Software Assistants) workshop at EASE (Evaluation and Assessment in Software Engineering) 2026

详情
英文摘要

Instructed code editing is a significant challenge for large language models (LLMs). On the EditBench benchmark, 39 of 40 evaluated models obtain a task success rate (TSR) below 60 percent, highlighting a gap between general code generation and the ability to perform instruction-driven editing under executable test constraints. To address this, we propose SAFEdit, a multi-agent framework for instructed code editing that decomposes the editing process into specialized roles to improve reliability and reduce unintended code changes. A Planner Agent produces an explicit, visibility-aware edit plan, an Editor Agent applies minimal, literal code modifications, and a Verifier Agent executes real test runs. When tests fail, SAFEdit uses a Failure Abstraction Layer (FAL) to transform raw test logs into structured diagnostic feedback, which is fed back to the Editor to support iterative refinement. We compare SAFEdit against both prior single-model results reported for EditBench and an implemented ReAct single-agent baseline under the same evaluation conditions. We used EditBench to evaluate SAFEdit on 445 code editing instances in five languages (English, Polish, Spanish, Chinese, and Russian) under varying spatial context variants. SAFEdit achieved 68.6 percent TSR, outperforming the single-model baseline by 3.8 percentage points and the ReAct single-agent baseline by 8.6 percentage points. The iterative refinement loop was found to contribute 17.4 percentage points to SAFEdit's overall success rate. SAFEdit's automated error analysis further indicates a reduction in instruction-level hallucinations compared to single-agent approaches, providing an additional framework component for interpreting failures beyond pass or fail outcomes.

2604.25733 2026-04-29 cs.LO cs.AI cs.FL

Verification of Neural Networks (Lecture Notes)

Benedikt Bollig

Comments 72 pages

详情
英文摘要

These lecture notes provide an introduction to the verification of neural networks from a theoretical perspective. We discuss feed-forward neural networks, recurrent neural networks, attention mechanisms, and transformers, together with specification languages and algorithmic verification techniques.

2604.25710 2026-04-29 stat.AP cs.LG stat.ME stat.ML

Adaptive Meta-Learning Stochastic Gradient Hamiltonian Monte Carlo Simulation for Bayesian Updating of Structural Dynamic Models

Xianghao Meng, James L. Beck, Yong Huang, Hui Li

详情
Journal ref
Comput Meth Appl Mech Eng; 437: 117753 (2025)
英文摘要

In the last few decades, Markov chain Monte Carlo (MCMC) methods have been widely applied to Bayesian updating of structural dynamic models in the field of structural health monitoring. Recently, several MCMC algorithms have been developed that incorporate neural networks to enhance their performance for specific Bayesian model updating problems. However, a common challenge with these approaches lies in the fact that the embedded neural networks often necessitate retraining when faced with new tasks, a process that is time-consuming and significantly undermines the competitiveness of these methods. This paper introduces a newly developed adaptive meta-learning stochastic gradient Hamiltonian Monte Carlo (AM-SGHMC) algorithm. The idea behind AM-SGHMC is to optimize the sampling strategy by training adaptive neural networks, and due to the adaptive design of the network inputs and outputs, the trained sampler can be directly applied to various Bayesian updating problems of the same type of structure without further training, thereby achieving meta-learning. Additionally, practical issues for the feasibility of the AM-SGHMC algorithm for structural dynamic model updating are addressed, and two examples involving Bayesian updating of multi-story building models with different model fidelity are used to demonstrate the effectiveness and generalization ability of the proposed method.

2604.25689 2026-04-29 cs.SE cs.AI

Spreadsheet Modeling Experiments Using GPTs on Small Problem Statements and the Wall Task

Thomas A. Grossman, Yuan Chen, Sopiko Datuashvili

详情
Journal ref
Proceedings of the EuSpRIG 2025 Conference: Spreadsheet Productivity & Risks ISBN : 978-1-905404-60-5
英文摘要

This paper investigates how GPT-based tools can assist in building reusable analytical spreadsheet models. After a screening, we evaluate five GPT extensions and select Excel AI by pulsrai.com for detailed testing. Through structured experiments on simple problem statements, we assess Excel AI's performance against the ERFR criteria (each input in a cell; cell formulas; no hardwired numbers; labels; accurate). Results show that while Excel AI can produce well-structured models, it is inconsistent and often non-reproducible. We identify two central challenges - "the problem of confidence" and "the problem of workflow" - which highlight the need for skilled users to verify and adapt GPT-generated spreadsheets. Though GPTs show promise for generating draft models that may reduce development time or lower skill requirements, current tools remain unreliable for professional use. We conclude with recommendations for future research into prompt engineering, reproducibility, and larger-scale modeling tasks.

2604.25685 2026-04-29 eess.IV cs.CV

Robustness Evaluation of a Foundation Segmentation Model Under Simulated Domain Shifts in Abdominal CT: Implications for Health Digital Twin Deployment

Sanghati Basu

Comments 8 Pages, 5 Tables, 2 Figures

详情
英文摘要

Foundation segmentation models such as the Segment Anything Model (SAM) have demonstrated strong generalization across natural images; however, their robustness under clinically realistic medical imaging domain shifts remains insufficiently quantified. We present a systematic slice-level robustness audit of SAM (ViT-B) for spleen segmentation in abdominal CT using 1,051 nonempty slices from 41 volumes in the Medical Segmentation Decathlon. A standardized ground-truth-derived bounding-box protocol was used to isolate encoder robustness from prompt uncertainty. Controlled perturbations simulating inter-scanner variability, including Gaussian noise, blur, contrast scaling, gamma correction, and resolution mismatch, were applied across ten conditions. The clean baseline achieved a mean Dice score of 0.9145 (95% CI: [0.909, 0.919]) with a failure rate of 0.67%. Across all perturbations, the absolute mean ΔDice remained below 0.01. Paired Wilcoxon signed-rank tests with Benjamini-Hochberg false discovery rate correction identified statistically significant but small-magnitude changes under selected conditions, while McNemar analysis showed no significant increase in failure probability. These findings indicate that SAM exhibits stable segmentation behavior under moderate CT domain shifts, supporting its role as a robust foundation baseline for medical image segmentation research. As health digital twins increasingly incorporate foundation segmentation models for anatomical modeling and organ-level monitoring, formal characterization of robustness under real-world imaging variability is a necessary step toward trustworthy deployment.

2604.25664 2026-04-29 stat.ML cs.LG math.OC

Deflation-Free Optimal Scoring

Sharmin Afroz, Brendan Ames

详情
英文摘要

Sparse Optimal Scoring (SOS) reformulates linear discriminant analysis to enable feature selection through elastic net regularization, making it well-suited for high-dimensional settings where the number of features exceeds observations. Most existing SOS methods use deflation-based strategies that compute discriminant vectors sequentially, which can propagate errors and produce suboptimal solutions. We propose a novel approach that estimates all discriminant vectors simultaneously under an explicit global orthogonality constraint, which we call Deflation-Free Sparse Optimal Scoring (DFSOS). DFSOS combines Bregman iteration with orthogonality-constrained optimization, decomposing the problem into tractable subproblems for scoring vectors, discriminant vectors, and orthogonality enforcement. We establish convergence to stationary points of the augmented Lagrangian under mild conditions. Extensive experiments using synthetic data and real-world time series data demonstrate that DFSOS achieves classification accuracy comparable to or better than existing deflation-based methods. These results indicate that deflation-free approaches offer a robust and effective framework for sparse discriminant analysis in high-dimensional problems.