Full State-Space Visualisation of the 8-Puzzle: Feasibility, Design, and Educational Use
Comments This is a preprint of a paper accepted to IEEE ITET 2026
Ian Frank, Kanata Kawanishi
Comments This is a preprint of a paper accepted to IEEE ITET 2026
Search algorithms are a foundational topic in artificial intelligence education, yet even simple domains can generate large state spaces that challenge learners' ability to form accurate mental models. This paper presents an interactive learning system that demonstrates the feasibility of visualising the entire reachable state space of the 8-puzzle (181,440 states), while tightly coupling abstract graph structure with concrete puzzle manipulation. Built using Unity and modern GPU-based rendering techniques, the system enables real-time exploration of global structure, step-by-step execution of search algorithms, and direct comparison of how different strategies traverse the same space. We describe the system's design, visualisation layouts, and educational use, reporting findings from an initial classroom deployment and pilot study with students at different levels of university education. Overall, the results indicate that full state-space visualisation is both technically feasible and educationally valuable for supporting conceptual understanding of search behaviour within this canonical problem domain.
Peijie Yu, Wei Liu, Yifan Yang, Jinjian Li, Zelong Zhang, Xiao Feng, Feng Zhang
Comments accepted by ICLR 2026
Fulfilling user needs through Large Language Model multi-turn, multi-step tool-use is rarely a straightforward process. Real user interactions are inherently wild, being intricate, messy, and flexible. We identify three key challenges from user behaviour: compositional tasks that demand efficient orchestration of tool-call topologies, implicit intent spread across dialogue turns that require contextual inference, and instruction transition, which mixes task queries, clarifications, and casual conversation, forcing LLMs to adjust their policies on the fly. Existing benchmarks overlook these behaviors, making the apparent progress of LLMs on tool-use spurious. To address this, we introduce WildToolBench, an LLM tool-use benchmark grounded in real-world user behavior patterns. Comprehensive evaluations of 57 LLMs reveal that no model achieves an accuracy of more than 15%, indicating a substantial gap in the robustness of LLMs' agentic ability. Controlled experiments and in-depth analyses further indicate that the real challenge for LLM tool-use lies not in artificially complex tasks, but in the wild nature of user behavior, emphasizing the need to reconsider the interactions among LLMs, users, and tools.
Raymond Chung, Keith Ng, CD Shum
Comments Accepted at 2025 IEEE 49th Annual Computers, Software, and Applications Conference (COMPSAC)
We propose a personalized chatbot designed for elderly individuals. The chatbot initiates discussions based on family photos, encouraging users to interact naturally. During these interactions, it generates W questions (who, where, when, and what) to stimulate cognitive function, followed by an open-ended question to promote positive reminiscence. This approach is structured around a goal-oriented dialogue framework. Additionally, after each conversation about a photo, the chatbot analyzes the discussion to identify topics that the user favors or dislikes. It then offers the user the option to chat about another photo either featuring the same family members or an individual previously mentioned in the conversation. To support this system, we have developed a web portal that allows caregivers to upload photos and review chat conversations. This personalized chatbot not only encourages elderly users to engage with the chatbot regularly and reduces feelings of loneliness but also provides caregivers with a valuable tool to gain insights into users' well-being.
Yichen Gong, Zhuohan Cai, Sunhao Dai, Yuqi Zhou, Zhangxuan Gu, Changhua Meng, Shuheng Shen
Existing online benchmarks for mobile GUI agents remain largely app-centric and task-homogeneous, failing to reflect the diversity and instability of real-world mobile usage. To this end, we introduce VenusBench-Mobile, a challenging online benchmark for evaluating general-purpose mobile GUI agents under realistic, user-centric conditions. VenusBench-Mobile builds two core evaluation pillars: defining what to evaluate via user-intent-driven task design that reflects real mobile usage, and how to evaluate through a capability-oriented annotation scheme for fine-grained agent behavior analysis. Extensive evaluation of state-of-the-art mobile GUI agents reveals large performance gaps relative to prior benchmarks, indicating that VenusBench-Mobile poses substantially more challenging and realistic tasks and that current agents remain far from reliable real-world deployment. Diagnostic analysis further shows that failures are dominated by deficiencies in perception and memory, which are largely obscured by coarse-grained evaluations. Moreover, even the strongest agents exhibit near-zero success under environment variations, highlighting their brittleness in realistic settings. Based on these insights, we believe VenusBench-Mobile provides an important stepping stone toward robust real-world deployment of mobile GUI agents. Code and data are available at https://github.com/inclusionAI/UI-Venus/tree/VenusBench-Mobile.
Ashmal Vayani, Parth Parag Kulkarni, Joseph Fioresi, Song Wang, Mubarak Shah
Medical diagnosis using Large Multimodal Models (LMMs) has gained increasing attention due to capability of these models in providing precise diagnoses. These models generally combine medical questions with visual inputs to generate diagnoses or treatments. However, they are often overly general and unsuitable under the wide range of medical conditions in real-world healthcare. In clinical practice, diagnosis is performed by multiple specialists, each contributing domain-specific expertise. To emulate this process, a potential solution is to deploy a dynamic multi-agent LMM framework, where each agent functions as a medical specialist. Current approaches in this emerging area, typically relying on static or predefined selection of various specialists, cannot be adapted to the changing practical scenario. In this paper, we propose MedRoute, a flexible and dynamic multi-agent framework that comprises of a collaborative system of specialist LMM agents. Furthermore, we add a General Practitioner with an RL-trained router for dynamic specialist selection, and a Moderator that produces the final decision. In this way, our framework closely mirrors real clinical workflows. Extensive evaluations on text and image-based medical datasets demonstrate improved diagnostic accuracy, outperforming the state-of-the-art baselines. Our work lays a strong foundation for future research. Code and models are available at https://github.com/UCF-CRCV/MedRoute/.
Yue Luo, Dibakar Roy Sarkar, Rachel Herring Sangree, Somdatta Goswami
Developing effective, domain-specific educational support systems is central to advancing AI in education. Although large language models (LLMs) demonstrate remarkable capabilities, they face significant limitations in specialized educational applications, including hallucinations, limited knowledge updates, and lack of domain expertise. Fine-tuning requires complete model retraining, creating substantial computational overhead, while general-purpose LLMs often provide inaccurate responses in specialized contexts due to reliance on generalized training data. To address this, we propose ARIA (Adaptive Retrieval Intelligence Assistant), a Retrieval-Augmented Generation (RAG) framework for creating intelligent teaching assistants across university-level courses. ARIA leverages a multimodal content extraction pipeline combining Docling for structured document analysis, Nougat for mathematical formula recognition, and GPT-4 Vision API for diagram interpretation, with the e5-large-v2 embedding model for high semantic performance and low latency. This enables accurate processing of complex educational materials while maintaining pedagogical consistency through engineered prompts and response controls. We evaluate ARIA using lecture material from Statics and Mechanics of Materials, a sophomore-level civil engineering course at Johns Hopkins University, benchmarking against ChatGPT-5. Results demonstrate 97.5% accuracy in domain-specific question filtering and superior pedagogical performance. ARIA correctly answered all 20 relevant course questions while rejecting 58 of 60 non-relevant queries, achieving 90.9% precision, 100% recall, and 4.89/5.0 average response quality. These findings demonstrate that ARIA's course-agnostic architecture represents a scalable framework for domain-specific educational AI deployment.
Yuelin Hu, Zhengxue Cheng, Ronghua Wu, Qunshan Gu, Hongwei Hu, Wei Liu, Qiao Liang, Li Song
Comments accepted by icassp2026
Specialized web tasks in finance, biomedicine, and pharmaceuticals remain challenging due to missing domain priors: queries drift, evidence is noisy, and reasoning is brittle. We present WebExpert, a domain-aware web agent that we implement end-to-end, featuring : (i) sentence-level experience retrieval with topic merging and rule distillation, (ii) schemalight facet induction that bootstraps time,region,policy,industry facets from weak supervision instead of static hand-written lexicons, and (iii) preference-optimized planning that jointly improves query planning and retrieval via pairwise preference learning alongside a coverage-aware objective. At inference, a lightweight experience gate biases decoding toward active facets with fallback under low-retrieval confidence. On GAIA, GPQA, HLE, and WebWalkerQA, WebExpert improves Answer Exact Match (EM) by 1.5-3.6 pp over the strongest browsing baseline and reduces page hops. Analysis shows consistent gains and ablations on retrieval, topic merging, facet induction, and preference-aware training.
Weishu Chen, Zhouhui Hou, Mingjie Zhan, Zhicheng Zhao, Fei Su
We present an empirical study of embedding-based retrieval under realistic conversational settings, where queries are short, dialogue-like, and weakly specified, and retrieval corpora contain structured conversational artifacts. Focusing on Qwen3-embedding models, we identify a deployment-relevant robustness vulnerability: under conversational retrieval without query prompting, structured dialogue-style noise can become disproportionately retrievable and intrude into top-ranked results, despite being semantically uninformative. This failure mode emerges consistently across model scales, remains largely invisible under standard clean-query benchmarks, and is significantly more pronounced in Qwen3 than in earlier Qwen variants and other widely used dense retrieval baselines. We further show that lightweight query prompting qualitatively alters retrieval behavior, effectively suppressing noise intrusion and restoring ranking stability. Our findings highlight an underexplored robustness risk in conversational retrieval and underscore the importance of evaluation protocols that reflect the complexities of deployed systems.
Yingjun Dai, Ahmed El-Roby
Comments 8 pages
Cold-start cross-domain recommender (CDR) systems predict a user's preferences in a target domain using only their source-domain behavior, yet existing CDR models either map opaque embeddings or rely on post-hoc or LLM-generated rationales that are hard to audit. We introduce EviSnap a lightweight CDR framework whose predictions are explained by construction with evidence-cited, faithful rationales. EviSnap distills noisy reviews into compact facet cards using an LLM offline, pairing each facet with verbatim supporting sentences. It then induces a shared, domain-agnostic concept bank by clustering facet embeddings and computes user-positive, user-negative, and item-presence concept activations via evidence-weighted pooling. A single linear concept-to-concept map transfers users across domains, and a linear scoring head yields per-concept additive contributions, enabling exact score decompositions and counterfactual 'what-if' edits grounded in the cited sentences. Experiments on the Amazon Reviews dataset across six transfers among Books, Movies, and Music show that EviSnap consistently outperforms strong mapping and review-text baselines while passing deletion- and sufficiency-based tests for explanation faithfulness.
Jayr Pereira, Leandro Fernandes, Erick de Brito, Roberto Lotufo, Luiz Bonifacio
Legal information retrieval in Portuguese remains difficult to evaluate systematically because available datasets differ widely in document type, query style, and relevance definition. We present JUÁ, a public benchmark for Brazilian legal retrieval designed to support more reproducible and comparable evaluation across heterogeneous legal collections. More broadly, JUÁ is intended not only as a benchmark, but as a continuous evaluation infrastructure for Brazilian legal IR, combining shared protocols, common ranking metrics, fixed splits when applicable, and a public leaderboard. The benchmark covers jurisprudence retrieval as well as broader legislative, regulatory, and question-driven legal search. We evaluate lexical, dense, and BM25-based reranking pipelines, including a domain-adapted Qwen embedding model fine-tuned on JUÁ-aligned supervision. Results show that the benchmark is sufficiently heterogeneous to distinguish retrieval paradigms and reveal substantial cross-dataset trade-offs. Domain adaptation yields its clearest gains on the supervision-aligned JUÁ-Juris subset, while BM25 remains highly competitive on other collections, especially in settings with strong lexical and institutional phrasing cues. Overall, JUÁ provides a practical evaluation framework for studying legal retrieval across multiple Brazilian legal domains under a common benchmark design.
Uloma Okoro, Tammy Mackenzie, Branislav Radeljic
This study examines the perception of legal professionals on the governance of AI in developing countries, using Nigeria as a case study. The study focused on ethical risks, regulatory gaps, and institutional readiness. The study adopted a qualitative case study design. Data were collected through 27 semi-structured interviews with legal practitioners in Nigeria. A focus group discussion was also held with seven additional legal practitioners across sectors such as finance, insurance, and corporate law. Thematic analysis was employed to identify key patterns in participant responses. Findings showed that there were concerns about data privacy risks and the lack of enforceable legal frameworks. Participants expressed limited confidence in institutional capacity and emphasized the need for locally adapted governance models rather than direct adoption of foreign frameworks. While some expressed optimism about AI's potential, this was conditional on the presence of strong legal oversight and public accountability. The study contributes to the growing discourse on AI governance in developing countries by focusing on the perspectives of legal professionals. It highlights the importance of regulatory approaches that are context-specific, inclusive, and capable of bridging the gap between global ethical principles and local realities. These insights offer practical guidance for policymakers, regulators, and scholars working to shape responsible AI governance in similar environments.
Tinko Sebastian Bartels, Ruixiang Wu, Xinyu Lu, Yikai Lu, Fanzeng Xia, Haoxiang Yang, Yue Chen, Tongxin Li
Addressing the critical need for intelligent, context-aware energy management in renewable systems, we introduce the OpenCEM Simulator and Dataset: the first open-source digital twin explicitly designed to integrate rich, unstructured contextual information with quantitative renewable energy dynamics. Traditional energy management relies heavily on numerical time series, thereby neglecting the significant predictive power embedded in human-generated context (e.g., event schedules, system logs, user intentions). OpenCEM bridges this gap by offering a unique platform comprising both a meticulously aligned, language-rich dataset from a real-world PV-and-battery microgrid installation and a modular simulator capable of natively processing this multi-modal context. The OpenCEM Simulator provides a high-fidelity environment for developing and validating novel control algorithms and prediction models, particularly those leveraging Large Language Models. We detail its component-based architecture, hybrid data-driven and physics-based modelling capabilities, and demonstrate its utility through practical examples, including context-aware load forecasting and the implementation of online optimal battery charging control strategies. By making this platform publicly available, OpenCEM aims to accelerate research into the next generation of intelligent, sustainable, and truly context-aware energy systems.
Xiaoan Liu, DaeHo Lee, Eric J Gonzalez, Mar Gonzalez-Franco, Ryo Suzuki
Comments 17 pages, 11 figures, plus appendix
We present VisionClaw, an always-on wearable AI agent that integrates live egocentric perception with agentic task execution. Running on Meta Ray-Ban smart glasses, VisionClaw continuously perceives real-world context and enables in-situ, speech-driven action initiation and delegation via OpenClaw AI agents. Therefore, users can directly execute tasks through the smart glasses, such as adding real-world objects to an Amazon cart, generating notes from physical documents, receiving meeting briefings on the go, creating events from posters, or controlling IoT devices. We evaluate VisionClaw through a controlled laboratory study (N=12) and a longitudinal deployment study (N=5). Results show that integrating perception and execution enables faster task completion and reduces interaction overhead compared to non-always-on and non-agent baselines. Beyond performance gains, deployment findings reveal a shift in interaction: tasks are initiated opportunistically during ongoing activities, and execution is increasingly delegated rather than manually controlled. These results suggest a new paradigm for wearable AI agents, where perception and action are continuously coupled to support situated, hands-free interaction.
Andrea DeMarco, Ian Fenech Conti, Hayley Camilleri, Ardiana Bushi, Simone Riggi
Comments 19 pages
Next-generation radio astronomy surveys are delivering millions of resolved sources, but robust and scalable morphology analysis remains difficult across heterogeneous telescopes and imaging pipelines. We present STRADAViT, a self-supervised Vision Transformer continued-pretraining framework for learning transferable encoders from radio astronomy imagery. The framework combines mixed-survey data curation, radio astronomy-aware training-view generation, and a ViT-MAE-initialized encoder family with optional register tokens. It supports reconstruction-only, contrastive-only, and two-stage branches. Our pretraining dataset comprises radio astronomy cutouts drawn from four complementary sources. We evaluate transfer with linear probing and fine-tuning on three morphology benchmarks spanning binary and multi-class settings. Relative to the ViT-MAE initialization used for continued pretraining, the best two-stage models improve Macro-F1 in all reported linear-probe settings and in two of three fine-tuning settings, with the largest gain on RGZ DR1. Relative to DINOv2, gains are selective rather than universal: the best two-stage models achieve higher mean Macro-F1 than the strongest DINOv2 baseline on LoTSS DR2 and RGZ DR1 under linear probing, and on MiraBest and RGZ DR1 under fine-tuning. A targeted DINOv2 initialization ablation further indicates that the adaptation recipe is not specific to the ViT-MAE starting point and that, under the same recipe. The ViT-MAE-based STRADAViT checkpoint is retained as the released checkpoint because it combines competitive transfer with substantially lower token count and downstream cost than the DINOv2-based alternative. These results indicate that radio astronomy-aware view generation and staged continued pretraining can provide a stronger domain-adapted starting point than off-the-shelf ViT checkpoints for radio astronomy transfer.
Chien-Ping Lu
Comments Use: 12 pages, 1 table, 5 figures. arXiv version v4
Classical Amdahl's Law conceptualized the limit of speedup for an era of fixed serial-parallel decomposition and homogeneous replication. Modern heterogeneous systems need a different conceptual framework: constrained resources must be allocated across heterogeneous hardware while workloads themselves change, with some stages becoming effectively bounded and others continuing to absorb additional effective compute. This paper reformulates Amdahl's Law around that shift. We replace processor count with an allocation variable, replace the classical parallel fraction with a value-scalable fraction, and model specialization by a relative efficiency ratio between dedicated and programmable compute. The resulting objective yields a finite collapse threshold. For a specialized efficiency ratio R, there is a critical scalable fraction S_c = 1 - 1/R beyond which the optimal allocation to specialization becomes zero. Equivalently, for a given scalable fraction S, the minimum efficiency ratio required to justify specialization is R_c = 1/(1-S). Thus, as value-scalable workload grows, over-customization faces a rising bar. The point is not that one hardware class simply defeats another, but that architecture must preserve a sufficiently programmable substrate against a moving frontier of work whose marginal gains keep scaling. In practice, that frontier is often sustained by software- and model-driven efficiency doublings rather than by fixed-function redesign alone. The model helps explain the migration of value-producing work toward learned late-stage computation and the shared design pressure that is making both GPUs and AI accelerators more programmable2
Haiyan Du, Hu Yang
Comments Upon re-examination, we found fundamental flaws in the BAEN-SVM model that undermine our conclusions. The design inadequately addresses geometrical rationality on slack variables, questioning generalizability. Thus, we retract this manuscript. We are exploring a different model and will resubmit after thorough validation. We apologize for any confusion
In this paper, we propose a novel bounded asymmetric elastic net ($L_{baen}$) loss function and combine it with the support vector machine (SVM), resulting in the BAEN-SVM. The $L_{baen}$ is bounded and asymmetric and can degrade to the asymmetric elastic net hinge loss, pinball loss, and asymmetric least squares loss. BAEN-SVM not only effectively handles noise-contaminated data but also addresses the geometric irrationalities in the traditional SVM. By proving the violation tolerance upper bound (VTUB) of BAEN-SVM, we show that the model is geometrically well-defined. Furthermore, we derive that the influence function of BAEN-SVM is bounded, providing a theoretical guarantee of its robustness to noise. The Fisher consistency of the model further ensures its generalization capability. Since the \( L_{\text{baen}} \) loss is non-convex, we designed a clipping dual coordinate descent-based half-quadratic algorithm to solve the non-convex optimization problem efficiently. Experimental results on artificial and benchmark datasets indicate that the proposed method outperforms classical and advanced SVMs, particularly in noisy environments.
Bowei Zhang, Jin Xiao, Guanglei Yue, Qianyu He, Yanghua Xiao, Deqing Yang, Jiaqing Liang
Comments Accepted to ACL 2026 main conference ; Code available at <https://github.com/Chang-pw/NoQuote>
Quotation recommendation aims to enrich writing by suggesting quotes that complement a given context, yet existing systems mostly optimize surface-level topical relevance and ignore the deeper semantic and aesthetic properties that make quotations memorable. We start from two empirical observations. First, a systematic user study shows that people consistently prefer quotations that are ``unexpected yet rational'' in context, identifying novelty as a key desideratum. Second, we find that strong existing models struggle to fully understand the deep meanings of quotations. Inspired by defamiliarization theory, we therefore formalize quote recommendation as choosing contextually novel but semantically coherent quotations. We operationalize this objective with NovelQR, a novelty-driven quotation recommendation framework. A generative label agent first interprets each quotation and its surrounding context into multi-dimensional deep-meaning labels, enabling label-enhanced retrieval. A token-level novelty estimator then reranks candidates while mitigating auto-regressive continuation bias. Experiments on bilingual datasets spanning diverse real-world domains show that our system recommends quotations that human judges rate as more appropriate, more novel, and more engaging than other baselines, while matching or surpassing existing methods in novelty estimation.
Markus Gross, Hans-Martin Rieser
Comments 45 pages, 15 figures
Quantum reservoir computers (QRCs) have emerged as a promising approach to quantum machine learning, since they utilize the natural dynamics of quantum systems for data processing and are simple to train. Here, we consider $n$-qubit quantum extreme learning machines (QELMs) with initial-state encoding and continuous-time reservoir dynamics. We apply the Pauli transfer matrix (PTM) formalism to theoretically analyze the influence of encoding, reservoir dynamics, and measurement operations (including temporal multiplexing) on the QELM performance. This formalism reveals the complete set of (nonlinear) features generated by the encoding, and shows how the subsequent quantum channels linearly transform these Pauli features before they are probed by the chosen measurement operators. Optimizing such a QELM can therefore be cast as a decoding problem in which one shapes the channel-induced transformations such that task-relevant features become available to the regressor, effectively reversing the information scrambling of a unitary. Operator spreading under unitary evolution determines decodability of Pauli features, which underlies the nonlinear processing capacity of the reservoir. When paired with certain observables, structured Hamiltonians can reduce model expressivity, as reflected in a low readout rank. We trace this effect to Hamiltonian symmetries and derive asymptotic rank estimates for symmetry-resolved observable families. The PTM formalism yields a nonlinear vector (auto-)regression model as an interpretable classical representation of a QELM. As a specific application, we focus on forecasting nonlinear dynamical systems and show that a QELM trained on such trajectories learns a surrogate-approximation to the underlying flow map.
Paul Tschisgale, Peter Wulff
Comments The Supplementary Information can be found in the OSF repository cited in the Data Availability Statement
Large language models (LLMs) are increasingly used in research as both tools and objects of study. Much of this work assumes that LLM performance under fixed conditions (identical model snapshot, hyperparameters, and prompt) is time-invariant, meaning that average output quality remains stable over time; otherwise, reliability and reproducibility would be compromised. To test the assumption of time invariance, we conducted a longitudinal study of GPT-4o's average performance under fixed conditions. The LLM was queried to solve the same physics task ten times every three hours over approximately three months. Spectral (Fourier) analysis of the resulting time series revealed substantial periodic variability, accounting for about 20% of total variance. The observed periodic patterns are consistent with interacting daily and weekly rhythms. These findings challenge the assumption of time invariance and carry important implications for research involving LLMs.
Evangelos Georganas, Alexander Heinecke, Pradeep Dubey
General Matrix Multiplication (GEMM) is the cornerstone of HPC workloads and Deep Learning. State-of-the-art vendor libraries tune tensor layouts, parallelization schemes, and cache blocking to minimize data movement across the memory hierarchy and maximize throughput. Optimal settings for these parameters depend on the target platform and matrix shapes, making exhaustive tuning infeasible. We revisit Space Filling Curves (SFC) to alleviate this cumbersome tuning. We partition the Matrix Multiplication using advancements in SFC, and obtain platform-oblivious and shape-oblivious Matrix Multiplication schemes with high degree of data locality. We extend the SFC-based work partitioning to implement Communication-Avoiding (CA) algorithms that provably minimize data movement. The integration of CA-algorithms is seamless with compact code, achieving state-of-the-art results on multiple CPU platforms, outperforming vendor libraries up to 5.5x for a range of GEMM-shapes (1.8x Weighted Harmonic Mean speedup). We show the impact of our work on two real-world applications by leveraging our GEMM as compute backend: i) prefill of LLM inference with speedups up to 1.85x over State-Of-The-Art, and ii) distributed-memory Matrix Multiplication with speedups up to 2.2x.
Zheng Zhang, Hao Tang, Yingying Hu, Zhanli Hu, Jing Qin
Comments Accepted for oral presentation at AAAI 2026
Low-count positron emission tomography (PET) reconstruction is a challenging inverse problem due to severe degradations arising from Poisson noise, photon scarcity, and attenuation correction errors. Existing deep learning methods typically address these in the spatial domain with an undifferentiated optimization objective, making it difficult to disentangle overlapping artifacts and limiting correction effectiveness. In this work, we perform a Fourier-domain analysis and reveal that these degradations are spectrally separable: Poisson noise and photon scarcity cause high-frequency phase perturbations, while attenuation errors suppress low-frequency amplitude components. Leveraging this insight, we propose FourierPET, a Fourier-based unrolled reconstruction framework grounded in the Alternating Direction Method of Multipliers. It consists of three tailored modules: a spectral consistency module that enforces global frequency alignment to maintain data fidelity, an amplitude-phase correction module that decouples and compensates for high-frequency phase distortions and low-frequency amplitude suppression, and a dual adjustment module that accelerates convergence during iterative reconstruction. Extensive experiments demonstrate that FourierPET achieves state-of-the-art performance with significantly fewer parameters, while offering enhanced interpretability through frequency-aware correction.
Md Ajoad Hasan, Dipayan Saha, Khan Thamid Hasan, Nashmin Alam, Azim Uddin, Sujan Kumar Saha, Mark Tehranipoor, Farimah Farahmandi
Comments This paper will be presented at Design, Automation and Test in Europe Conference (DATE) 2026
The growing complexity of modern system-on-chip (SoC) and IP designs is making security assurance difficult day by day. One of the fundamental steps in the pre-silicon security verification of a hardware design is the identification of security assets, as it substantially influences downstream security verification tasks, such as threat modeling, security property generation, and vulnerability detection. Traditionally, assets are determined manually by security experts, requiring significant time and expertise. To address this challenge, we present LAsset, a novel automated framework that leverages large language models (LLMs) to identify security assets from both hardware design specifications and register-transfer level (RTL) descriptions. The framework performs structural and semantic analysis to identify intra-module primary and secondary assets and derives inter-module relationships to systematically characterize security dependencies at the design level. Experimental results show that the proposed framework achieves high classification accuracy, reaching up to 90% recall rate in SoC design, and 93% recall rate in IP designs. This automation in asset identification significantly reduces manual overhead and supports a scalable path forward for secure hardware development.
Hong T. M. Chu
Comments 32 pages, 10 figures
Distributionally Robust (DR) optimization aims to certify worst-case risk within a Wasserstein uncertainty set. Current certifications typically rely either on global Lipschitz bounds, which are often conservative, or on local gradient information, which provides only a first-order approximation. This paper introduces a novel geometric framework based on the least concave majorants of the growth rate functions. Our proposed concave certificate establishes a tight bound on DR risk that remains applicable to non-Lipschitz and non-differentiable losses. We extend this framework to complexity analysis, introducing the worst-case generalization bound that complements the standard statistical generalization bound. Furthermore, we utilize this certificate to bound the gap between adversarial and empirical Rademacher complexity, demonstrating that dependencies on input diameter, network width, and depth can be eliminated. For practical application in deep learning, we introduce the adversarial score as a tractable relaxation of the concave certificate that enables efficient and layer-wise analysis of neural networks. We validate our theoretical results in various numerical experiments on classification and regression tasks using real-world data.
Gyeo-Re Han, Merve Eryilmaz, Artem Goncharov, Yuzhu Li, Shun Ye, Aoi Tomoeda, Emily Ngo, Margherita Scussat, Xiao Wang, Zixiang Ji, Max Zhang, Jeffrey J. Hsu, Omai B. Garner, Dino Di Carlo, Aydogan Ozcan
Comments 32 Pages, 6 Figures, 2 Tables
Rapid and accessible cardiac biomarker testing is essential for the timely diagnosis and risk assessment of myocardial infarction (MI) and heart failure (HF), two interrelated conditions that frequently coexist and drive recurrent hospitalizations with high mortality. However, current laboratory and point-of-care testing systems are limited by long turnaround times, narrow dynamic ranges for the tested biomarkers, and single-analyte formats that fail to capture the complexity of cardiovascular disease. Here, we present a deep learning-enhanced dual-mode multiplexed vertical flow assay (xVFA) with a portable optical reader and a neural network-based quantification pipeline. This optical sensor integrates colorimetric and chemiluminescent detection within a single paper-based cartridge to complementarily cover a large dynamic range (spanning ~6 orders of magnitude) for both low- and high-abundance biomarkers, while maintaining quantitative accuracy. Using 50 uL of serum, the optical sensor simultaneously quantifies cardiac troponin I (cTnI), creatine kinase-MB (CK-MB), and N-terminal pro-B-type natriuretic peptide (NT-proBNP) within 23 min. The xVFA achieves sub-pg/mL sensitivity for cTnI and sub-ng/mL sensitivity for CK-MB and NT-proBNP, spanning the clinically relevant ranges for these biomarkers. Neural network models trained and blindly tested on 92 patient serum samples yielded a robust quantification performance (Pearson's r > 0.96 vs. reference assays). By combining high sensitivity, multiplexing, and automation in a compact and cost-effective optical sensor format, the dual-mode xVFA enables rapid and quantitative cardiovascular diagnostics at the point of care.
Yuqun Zhang, Yuxuan Zhao, Sijia Chen
This paper proposes PyFi, a novel framework for pyramid-like financial image understanding that enables vision language models (VLMs) to reason through question chains in a progressive, simple-to-complex manner. At the core of PyFi is PyFi-600K, a dataset comprising 600K financial question-answer pairs organized into a reasoning pyramid: questions at the base require only basic perception, while those toward the apex demand increasing levels of capability in financial visual understanding and expertise. This data is scalable because it is synthesized without human annotations, using PyFi-adv, a multi-agent adversarial mechanism under the Monte Carlo Tree Search (MCTS) paradigm, in which, for each image, a challenger agent competes with a solver agent by generating question chains that progressively probe deeper capability levels in financial visual reasoning. Leveraging this dataset, we present fine-grained, hierarchical, and comprehensive evaluations of advanced VLMs in the financial domain. Moreover, fine-tuning Qwen2.5-VL-3B and Qwen2.5-VL-7B on the pyramid-structured question chains enables these models to answer complex financial questions by decomposing them into sub-questions with gradually increasing reasoning demands, yielding average accuracy improvements of 19.52% and 8.06%, respectively, on the dataset. All resources of code, dataset and models are available at: https://github.com/AgenticFinLab/PyFi .
Mojtaba A. Farahani, Md Irfan Khan, Thorsten Wuest
The convergence of Agentic AI and MAS enables a new paradigm for intelligent decision making in SMS. Traditional MAS architectures emphasize distributed coordination and specialized autonomy, while recent advances in agentic AI driven by LLMs introduce higher order reasoning, planning, and tool orchestration capabilities. This paper presents a hybrid agentic AI and multi agent framework for a Prescriptive Maintenance use case, where LLM based agents provide strategic orchestration and adaptive reasoning, complemented by rule based and SLMs agents performing efficient, domain specific tasks on the edge. The proposed framework adopts a layered architecture that consists of perception, preprocessing, analytics, and optimization layers, coordinated through an LLM Planner Agent that manages workflow decisions and context retention. Specialized agents autonomously handle schema discovery, intelligent feature analysis, model selection, and prescriptive optimization, while a HITL interface ensures transparency and auditability of generated maintenance recommendations. This hybrid design supports dynamic model adaptation, cost efficient maintenance scheduling, and interpretable decision making. An initial proof of concept implementation is validated on two industrial manufacturing datasets. The developed framework is modular and extensible, supporting seamless integration of new agents or domain modules as capabilities evolve. The results demonstrate the system capability to automatically detect schema, adapt preprocessing pipelines, optimize model performance through adaptive intelligence, and generate actionable, prioritized maintenance recommendations. The framework shows promise in achieving improved robustness, scalability, and explainability for RxM in smart manufacturing, bridging the gap between high level agentic reasoning and low level autonomous execution.
Hanlin Cai, Houtianfu Wang, Haofan Dong, Kai Li, Sai Zou, Ozgur B. Akan
Comments This paper has been accepted by the IEEE 22nd International Wireless Communications & Mobile Computing Conference (IWCMC 2026, Shanghai, China)
Internet of Agents (IoA) envisions a unified, agent-centric paradigm where heterogeneous large language model (LLM) agents can interconnect and collaborate at scale. Within this paradigm, federated fine-tuning (FFT) serves as a key enabler that allows distributed LLM agents to co-train an intelligent global LLM without centralizing local datasets. However, the FFT-enabled IoA systems remain vulnerable to model poisoning attacks, where adversaries can upload malicious updates to the server to degrade the performance of the aggregated global LLM. This paper proposes a graph representation-based model poisoning (GRMP) attack, which exploits overheard benign updates to construct a feature correlation graph and employs a variational graph autoencoder to capture structural dependencies and generate malicious updates. A novel attack algorithm is developed based on augmented Lagrangian and subgradient descent methods to optimize malicious updates that preserve benign-like statistics while embedding adversarial objectives. Experimental results show that the proposed GRMP attack can substantially decrease accuracy across different LLM models while remaining statistically consistent with benign updates, thereby evading detection by existing defense mechanisms and underscoring a severe threat to the ambitious IoA paradigm.
Toshio Irino, Ayako Yamamoto, Fuki Miyazaki
Comments This manuscript was submitted to Speech Communication on April 8, 2026
Age-related hearing loss (HL) reduces speech intelligibility (SI) in older adults (OAs). However, deficits in central and cognitive processing also substantially impact SI. Understanding these contributions is essential for explaining individual differences and developing effective assistive hearing strategies. This study presents a framework that distinguishes peripheral HL from central and cognitive influences on SI. This framework uses the Wakayama University Hearing Impairment Simulator (WHIS), and the Gammachirp Envelope Similarity Index (GESI), an objective measure of intelligibility. First, speech-in-noise tests were conducted with young, normal-hearing listeners (YNHs) using WHIS to simulate the audiogram of a target OA. The target OA achieved SI scores comparable to or higher than those of YNHs with simulated HL, suggesting contributions beyond peripheral hearing function. Then, GESI was used to predict SI scores for YNHs and OAs across different hearing levels. The prediction accuracy was comparable for both groups. Interestingly, many OAs' subjective SI scores were higher than those predicted using parameters derived from YNHs' experiments. This finding is inconsistent with previous research indicating that speech perception ability declines with age. This issue will be discussed. There was no significant correlation between the average hearing levels and the residual differences between the subjective and predicted SI scores. This suggests that GESI effectively absorbed the effects of peripheral HL. Thus, the proposed framework may facilitate systematic examination and comparison of central and cognitive factors beyond peripheral HL among individual YNHs and OAs with and without HL.
Yuansheng Ni, Songcheng Cai, Xiangchao Chen, Jiarong Liang, Zhiheng Lyu, Jiaqi Deng, Kai Zou, Ping Nie, Fei Yuan, Xiang Yue, Wenhu Chen
Large language models (LLMs) have recently enabled coding agents capable of generating, executing, and revising visualization code. However, existing models often fail in practical workflows due to limited language coverage, unreliable execution, and lack of iterative correction mechanisms. Progress has been constrained by narrow datasets and benchmarks that emphasize single-round generation and single-language tasks. To address these challenges, we introduce three complementary resources for advancing visualization coding agents. VisCode-Multi-679K is a large-scale, supervised dataset containing 679K validated and executable visualization samples with multi-turn correction dialogues across 12 programming languages. VisPlotBench is a benchmark for systematic evaluation, featuring executable tasks, rendered outputs, and protocols for both initial generation and multi-round self-debug. Finally, we present VisCoder2, a family of multi-language visualization models trained on VisCode-Multi-679K. Experiments show that VisCoder2 significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4.1, with further gains from iterative self-debug, reaching 82.4% overall execution pass rate at the 32B scale, particularly in symbolic or compiler-dependent languages.
Yongji Wu, Xueshen Liu, Haizhong Zheng, Juncheng Gu, Beidi Chen, Z. Morley Mao, Arvind Krishnamurthy, Ion Stoica
Reinforcement learning (RL) has become essential for unlocking advanced reasoning capabilities in large language models (LLMs). RL workflows involve interleaving rollout and training stages with fundamentally different resource requirements. Rollout typically dominates overall execution time, yet scales efficiently through multiple independent instances. In contrast, training requires tightly-coupled GPUs with full-mesh communication. Existing RL frameworks fall into two categories: co-located and disaggregated architectures. Co-located frameworks fail to address this resource tension by forcing both stages to share the same GPUs. Disaggregated architectures, without modifications of well-established RL algorithms, suffer from resource under-utilization. Meanwhile, preemptible GPU resources, i.e., spot instances on public clouds and spare capacity in production clusters, present significant cost-saving opportunities for accelerating RL workflows, if efficiently harvested for rollout. In this paper, we present RLBoost, a framework for cost-efficient RL training that harvests preemptible GPU resources. Our key insight is that rollout's stateless and embarrassingly parallel nature aligns perfectly with preemptible and often fragmented resources. To efficiently utilize these resources despite frequent and unpredictable availability changes, RLBoost adopts a hybrid architecture with three key techniques: (1) adaptive rollout offload to dynamically adjust workloads on the reserved (on-demand) cluster, (2) pull-based weight transfer that quickly provisions newly available instances, and (3) token-level response collection and migration for efficient preemption handling and continuous load balancing. Extensive experiments show RLBoost increases training throughput by 1.51x-1.97x while improving cost efficiency by 28%-49% compared to using only on-demand GPU resources.
扫码添加微信好友,提出您的宝贵建议 👇
💡 备注请填写:网站反馈