arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2941
2507.15085 2026-03-24 cs.CV

OCRGenBench: A Comprehensive Benchmark for Evaluating OCR Generative Capabilities

Peirong Zhang, Haowei Xu, Jiaxin Zhang, Xuhan Zheng, Guitao Xu, Yuyi Zhang, Junle Liu, Zhenhua Yang, Wei Zhou, Lianwen Jin

详情
英文摘要

Improving visual text synthesis has long been a challenging and evolving frontier for image generation models. While recent state-of-the-art (SOTA) models have made remarkable strides in text generation capabilities, existing benchmarks inadequately assess their true performance due to narrow scope (scene text and posters only), isolated evaluation (T2I generation or editing separately), and insufficient difficulty (lacking challenging scenarios). To bridge this gap, we pioneer the unification of text-centric T2I generation, text editing, and OCR-related image-to-image translation to evaluate a model's holistic visual text synthesis abilities, i.e., OCR generative capabilities. Accordingly, we propose OCRGenBench, the most comprehensive benchmark to date for evaluating these abilities. OCRGenBench covers five common text categories and 33 OCR generative tasks, encompassing T2I generation, text editing, and other image-to-image OCR tasks (e.g., document dewarping and handwriting removal). The benchmark includes 1,060 human-annotated samples consisting of instruction-image-GT triplets, deliberately featuring high text density, diverse generation scales, varied aspect ratios, and bilingual content to capture real-world complexity. Furthermore, we introduce OCRGenScore, a unified metric integrating text accuracy, aesthetic quality, and instruction following. Extensive experiments on 19 cutting-edge generative models reveal that most score below 60/100. Our analysis exposes critical, previously overlooked limitations, including poor text localization, unintended content modifications, and failures with dense or small-scale text. We hope OCRGenBench establishes a robust standard to evaluate OCR generative capabilities, driving the evolution of reliable visual text synthesis. The benchmark and evaluation code are available at https://github.com/NiceRingNode/Awesome-Generative-Models-for-OCR.

2507.11474 2026-03-24 cs.CV

HUG-VAS: A Hierarchical NURBS-Based Generative Model for Aortic Geometry Synthesis and Controllable Editing

Pan Du, Mingqi Xu, Xiaozhi Zhu, Jian-xun Wang

Comments 64 pages, 9 figures, 6 supplementary figures

详情
英文摘要

Accurate, patient-specific vascular geometry is pivotal for diagnosis, planning, and device design, yet existing statistical shape modeling (SSM) pipelines rely on linear priors and topology-specific preprocessing that limit realism, scalability, and interoperability. We present HUG-VAS, a Hierarchical NURBS Generative framework for Vascular models, that unifies NURBS-based 3D shape encoding with diffusion-based generative modeling to synthesize fine-grained, CFD-ready aortic anatomies. HUG-VAS factorizes shape into (i) vessel centerlines generated by a denoising diffusion model and (ii) cross-sectional radius profiles synthesized by a classifier-free guided diffusion model conditioned on the centerline, thereby decoupling and preserving stochastic variability across these two anatomical layers. Beyond unconditional synthesis, we enable training-free, zero-shot conditional generation via diffusion posterior sampling from image-derived prompts (e.g., sparse 3D points, slice contours, or partial surface patches), supporting interactive semi-automatic segmentation, editing and robust reconstruction under degraded imaging. Trained on 21 patient-specific MRA cases, HUG-VAS generates multi-branch aortas with supra-aortic vessels whose biomarker distributions closely match the source cohort, and whose watertight NURBS outputs directly integrate with downstream CFD solvers. To our knowledge, this is the first SSM framework that bridges image-derived priors and generative shape synthesis through a unified combination of NURBS parameterization, hierarchical diffusion, and DPS, enabling a practical path from limited clinical anatomic information to simulation-ready vascular geometry.

2506.19609 2026-03-24 cs.LG nlin.CD physics.comp-ph

Beyond Static Models: Hypernetworks for Adaptive and Generalizable Forecasting in Complex Parametric Dynamical Systems

Pantelis R. Vlachas, Konstantinos Vlachas, Eleni Chatzi

详情
英文摘要

Dynamical systems play a key role in modeling, forecasting, and decision-making across a wide range of scientific domains. However, variations in system parameters, also referred to as parametric variability, can lead to drastically different model behavior and output, posing challenges for constructing models that generalize across parameter regimes. In this work, we introduce the Parametric Hypernetwork for Learning Interpolated Networks (PHLieNet), a framework that simultaneously learns: (a) a global mapping from the parameter space to a nonlinear embedding and (b) a mapping from the inferred embedding to the weights of a dynamics propagation network. The learned embedding serves as a latent representation that modulates a base network, termed the hypernetwork, enabling it to generate the weights of a target network responsible for forecasting the system's state evolution conditioned on the previous time history. By interpolating in the space of models rather than observations, PHLieNet facilitates smooth transitions across parameterized system behaviors, enabling a unified model that captures the dynamic behavior across a broad range of system parameterizations. The performance of the proposed technique is validated in a series of dynamical systems with respect to its ability to extrapolate in time and interpolate and extrapolate in the parameter space, i.e., generalize to dynamics that were unseen during training. Our approach outperforms state-of-the-art baselines in both short-term forecast accuracy and in capturing long-term dynamical features such as attractor statistics.

2506.11128 2026-03-24 cs.CL cs.AI

Theory-Grounded Evaluation of Human-Like Fallacy Patterns in LLM Reasoning

Andrew Keenan Richardson, Ryan Othniel Kearns, Sean Moss, Vincent Wang-Mascianica, Philipp Koralus

详情
英文摘要

We study logical reasoning in language models by asking whether their errors follow established human fallacy patterns. Using the Erotetic Theory of Reasoning (ETR) and its open-source implementation, PyETR, we programmatically generate 383 formally specified reasoning problems and evaluate 38 models. For each response, we judge logical correctness and, when incorrect, whether it matches an ETR-predicted fallacy. Two results stand out: (i) as a capability proxy (Chatbot Arena Elo) increases, a larger share of a model's incorrect answers are ETR-predicted fallacies $(ρ=0.360, p=0.0265)$, while overall correctness on this dataset shows no correlation with capability; (ii) reversing premise order significantly reduces fallacy production for many models, mirroring human order effects. Methodologically, PyETR provides an open-source pipeline for unbounded, synthetic, contamination-resistant reasoning tests linked to a cognitive theory, enabling analyses that focus on error composition rather than error rate.

2506.10634 2026-03-24 cs.CV cs.AI

Symmetrical Flow Matching: Unified Image Generation, Segmentation, and Classification with Score-Based Generative Models

Francisco Caetano, Christiaan Viviers, Peter H. N. De With, Fons van der Sommen

Comments AAAI 2026

详情
英文摘要

Flow Matching has emerged as a powerful framework for learning continuous transformations between distributions, enabling high-fidelity generative modeling. This work introduces Symmetrical Flow Matching (SymmFlow), a new formulation that unifies semantic segmentation, classification, and image generation within a single model. Using a symmetric learning objective, SymmFlow models forward and reverse transformations jointly, ensuring bi-directional consistency, while preserving sufficient entropy for generative diversity. A new training objective is introduced to explicitly retain semantic information across flows, featuring efficient sampling while preserving semantic structure, allowing for one-step segmentation and classification without iterative refinement. Unlike previous approaches that impose strict one-to-one mapping between masks and images, SymmFlow generalizes to flexible conditioning, supporting both pixel-level and image-level class labels. Experimental results on various benchmarks demonstrate that SymmFlow achieves state-of-the-art performance on semantic image synthesis, obtaining FID scores of 11.9 on CelebAMask-HQ and 7.0 on COCO-Stuff with only 25 inference steps. Additionally, it delivers competitive results on semantic segmentation and shows promising capabilities in classification tasks.

2505.10347 2026-03-24 cs.LG cs.AI

Uniform Loss vs. Specialized Optimization: A Comparative Analysis in Multi-Task Learning

Gabriel S. Gama, Valdir Grassi

详情
英文摘要

Specialized Multi-Task Optimizers (SMTOs) balance task learning in Multi-Task Learning by addressing issues like conflicting gradients and differing gradient norms, which hinder equal-weighted task training. However, recent critiques suggest that equally weighted tasks can achieve competitive results compared to SMTOs, arguing that previous SMTO results were influenced by poor hyperparameter optimization and lack of regularization. In this work, we evaluate these claims through an extensive empirical evaluation of SMTOs, including some of the latest methods, on more complex multi-task problems to clarify this behavior. Our findings indicate that SMTOs perform well compared to uniform loss and that fixed weights can achieve competitive performance compared to SMTOs. Furthermore, we demonstrate why uniform loss perform similarly to SMTOs in some instances. The source code is available at https://github.com/Gabriel-SGama/UnitScal_vs_SMTOs.

2503.19740 2026-03-24 cs.CV

LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings

Chengan Che, Chao Wang, Tom Vercauteren, Sophia Tsoka, Luis C. Garcia-Peraza-Herrera

Comments Accepted at CVPR2026 main conference

详情
英文摘要

Traditional open-access datasets focusing on surgical procedures are often limited by their small size, typically consisting of fewer than 100 videos and less than 30 hours of footage, which leads to poor model generalization. To address this data limitation, a new dataset called LEMON has been compiled using a novel aggregation pipeline that collects high-resolution videos from online sources. Featuring an extensive collection of over 4K surgical videos totaling 938 hours (85 million frames) of high-quality footage across multiple procedure types, LEMON offers a comprehensive resource surpassing existing alternatives in size and scope, including two novel downstream tasks. To demonstrate the effectiveness of this diverse dataset, we introduce LemonFM, a foundation model pretrained on LEMON using a novel self-supervised augmented knowledge distillation approach. LemonFM consistently outperforms existing surgical foundation models across four downstream tasks and six datasets, achieving significant gains in surgical phase recognition (+9.5pp, +9.4pp, and +8.4pp in Jaccard on AutoLaparo, M2CAI16, and Cholec80), surgical action recognition (+4.4pp in mAP on CholecT50), surgical tool presence detection (+5.3pp and +10.2pp in mAP on Cholec80 and GraSP), and surgical semantic segmentation (+10.3pp in mDice on CholecSeg8k). LEMON and LemonFM will serve as foundational resources for the research community and industry, accelerating progress in developing autonomous robotic surgery systems and ultimately contributing to safer and more accessible surgical care worldwide. Dataset, code, and models are publicly available at https://github.com/visurg-ai/LEMON.

2503.05721 2026-03-24 cs.CL

What Are They Filtering Out? An Experimental Benchmark of Filtering Strategies for Harm Reduction in Pretraining Datasets

Marco Antonio Stranisci, Christian Hardmeier

详情
英文摘要

Data filtering strategies are a crucial component to develop safe Large Language Models (LLM), since they support the removal of harmful contents from pretraining datasets. There is a lack of research on the actual impact of these strategies on vulnerable groups to discrimination, though, and their effectiveness has not been yet systematically addressed. In this paper we present a benchmark study of data filtering strategies for harm reduction aimed at providing a systematic evaluation on these approaches. We provide an overview $55$ technical reports of English LMs and LLMs to identify the existing filtering strategies in literature and implement an experimental setting to test their impact against vulnerable groups. Our results show that the positive impact that strategies have in reducing harmful contents from documents has the side effect of increasing the underrepresentation of vulnerable groups to discrimination in datasets. WARNING: the paper could contain racist, sexist, violent, and generally offensive contents

2501.16562 2026-03-24 cs.LG stat.ME

C-HDNet: Hyperdimensional Computing for Causal Effect Estimation from Observational Data Under Network Interference

Abhishek Dalvi, Neil Ashtekar, Vasant Honavar

Comments Published at Social Network Analysis and Mining

详情
英文摘要

We address the problem of estimating causal effects from observational data in the presence of network confounding, a setting where both treatment assignment and observed outcomes of individuals may be influenced by their neighbors within a network structure, resulting in network interference. Traditional causal inference methods often fail to account for these dependencies, leading to biased estimates. To tackle this challenge, we introduce a novel matching-based approach that utilizes principles from hyperdimensional computing to effectively encode and incorporate structural network information. This enables more accurate identification of comparable individuals, thereby improving the reliability of causal effect estimates. Through extensive empirical evaluation on multiple benchmark datasets, we demonstrate that our method either outperforms or performs on par with existing state-of-the-art approaches, including several recent deep learning-based models that are significantly more computationally intensive. In addition to its strong empirical performance, our method offers substantial practical advantages, achieving nearly an order-of-magnitude reduction in runtime without compromising accuracy, making it particularly well-suited for large-scale or time-sensitive application

2410.14038 2026-03-24 cs.LG

Sliding Puzzles Gym: A Scalable Benchmark for State Representation in Visual Reinforcement Learning

Bryan L. M. de Oliveira, Luana G. B. Martins, Bruno Brandão, Murilo L. da Luz, Telma W. de L. Soares, Luckeciano C. Melo

Comments Accepted at ICML 2025

详情
Journal ref
Proceedings of Machine Learning Research 267:12689-12717, 2025
英文摘要

Effective visual representation learning is crucial for reinforcement learning (RL) agents to extract task-relevant information from raw sensory inputs and generalize across diverse environments. However, existing RL benchmarks lack the ability to systematically evaluate representation learning capabilities in isolation from other learning challenges. To address this gap, we introduce the Sliding Puzzles Gym (SPGym), a novel benchmark that transforms the classic 8-tile puzzle into a visual RL task with images drawn from arbitrarily large datasets. SPGym's key innovation lies in its ability to precisely control representation learning complexity through adjustable grid sizes and image pools, while maintaining fixed environment dynamics, observation, and action spaces. This design enables researchers to isolate and scale the visual representation challenge independently of other learning components. Through extensive experiments with model-free and model-based RL algorithms, we uncover fundamental limitations in current methods' ability to handle visual diversity. As we increase the pool of possible images, all algorithms exhibit in- and out-of-distribution performance degradation, with sophisticated representation learning techniques often underperforming simpler approaches like data augmentation. These findings highlight critical gaps in visual representation learning for RL and establish SPGym as a valuable tool for driving progress in robust, generalizable decision-making systems.

2405.14108 2026-03-24 cs.LG cs.AI q-bio.BM q-bio.QM

Assessing the potential of deep learning for protein-ligand docking

Alex Morehead, Nabin Giri, Jian Liu, Pawan Neupane, Jianlin Cheng

Comments 55 pages, 2 tables, 37 figures. Results updated (in v1.1.0) after addressing primary-ligand scoring bug (in v1.0.0). Code, data, tutorials, and benchmark results are available at https://github.com/BioinfoMachineLearning/PoseBench

详情
英文摘要

The effects of ligand binding on protein structures and their in vivo functions carry numerous implications for modern biomedical research and biotechnology development efforts such as drug discovery. Although several deep learning (DL) methods and benchmarks designed for protein-ligand docking have recently been introduced, to date no prior works have systematically studied the behavior of the latest docking and structure prediction methods within the broadly applicable context of (1) using predicted (apo) protein structures for docking (e.g., for applicability to new proteins); (2) binding multiple (cofactor) ligands concurrently to a given target protein (e.g., for enzyme design); and (3) having no prior knowledge of binding pockets (e.g., for generalization to unknown pockets). To enable a deeper understanding of docking methods' real-world utility, we introduce PoseBench, the first comprehensive benchmark for broadly applicable protein-ligand docking. PoseBench enables researchers to rigorously and systematically evaluate DL methods for apo-to-holo protein-ligand docking and protein-ligand structure prediction using both primary ligand and multi-ligand benchmark datasets, the latter of which we introduce for the first time to the DL community. Empirically, using PoseBench, we find that (1) DL co-folding methods generally outperform comparable conventional and DL docking baseline algorithms, yet popular methods such as AlphaFold 3 are still challenged by prediction targets with novel binding poses; (2) certain DL co-folding methods are highly sensitive to their input multiple sequence alignments, while others are not; and (3) DL methods struggle to strike a balance between structural accuracy and chemical specificity when predicting novel or multi-ligand protein targets. Code, data, tutorials, and benchmark results are available at https://github.com/BioinfoMachineLearning/PoseBench.

2404.12339 2026-03-24 cs.RO cs.CV

SPOT: Point Cloud Based Stereo Visual Place Recognition for Similar and Opposing Viewpoints

Spencer Carmichael, Rahul Agrawal, Ram Vasudevan, Katherine A. Skinner

Comments Expanded version with added appendix. Published in ICRA 2024. Project page: https://umautobots.github.io/spot

详情
英文摘要

Recognizing places from an opposing viewpoint during a return trip is a common experience for human drivers. However, the analogous robotics capability, visual place recognition (VPR) with limited field of view cameras under 180 degree rotations, has proven to be challenging to achieve. To address this problem, this paper presents Same Place Opposing Trajectory (SPOT), a technique for opposing viewpoint VPR that relies exclusively on structure estimated through stereo visual odometry (VO). The method extends recent advances in lidar descriptors and utilizes a novel double (similar and opposing) distance matrix sequence matching method. We evaluate SPOT on a publicly available dataset with 6.7-7.6 km routes driven in similar and opposing directions under various lighting conditions. The proposed algorithm demonstrates remarkable improvement over the state-of-the-art, achieving up to 91.7% recall at 100% precision in opposing viewpoint cases, while requiring less storage than all baselines tested and running faster than all but one. Moreover, the proposed method assumes no a priori knowledge of whether the viewpoint is similar or opposing, and also demonstrates competitive performance in similar viewpoint cases.

2402.00261 2026-03-24 cs.CV cs.LG

Understanding Neural Network Systems for Image Analysis using Vector Spaces and Inverse Maps

Rebecca Pattichis, Marios S. Pattichis

Comments Accepted to IEEE's Southwest Symposium on Image Analysis and Interpretation 2024

详情
英文摘要

There is strong interest in developing mathematical methods that can be used to understand complex neural networks used in image analysis. In this paper, we introduce techniques from Linear Algebra to model neural network layers as maps between signal spaces. First, we demonstrate how signal spaces can be used to visualize weight spaces and convolutional layer kernels. We also demonstrate how residual vector spaces can be used to further visualize information lost at each layer. Second, we study invertible networks using vector spaces for computing input images that yield specific outputs. We demonstrate our approach on two invertible networks and ResNet18.

2312.04140 2026-03-24 cs.CV eess.IV

Polarimetric Light Transport Analysis for Specular Inter-reflection

Ryota Maeda, Shinsaku Hiura

Comments Accepted to IEEE Transactions on Computational Imaging (TCI)

详情
英文摘要

Polarization is well known for its ability to decompose diffuse and specular reflections. However, the existing decomposition methods only focus on direct reflection and overlook multiple reflections, especially specular inter-reflection. In this paper, we propose a novel decomposition method for handling specular inter-reflection of metal objects by using a unique polarimetric feature: the rotation direction of linear polarization. This rotation direction serves as a discriminative factor between direct and inter-reflection on specular surfaces. To decompose the reflectance components, we actively rotate the linear polarization of incident light and analyze the rotation direction of the reflected light. We evaluate our method using both synthetic and real data, demonstrating its effectiveness in decomposing specular inter-reflections of metal objects. Furthermore, we demonstrate that our method can be combined with other decomposition methods for a detailed analysis of light transport. As a practical application, we show its effectiveness in improving the accuracy of 3D measurement against strong specular inter-reflection.

2303.10371 2026-03-24 cs.LG

Geometric Imbalance in Semi-Supervised Node Classification

Liang Yan, Shengzhong Zhang, Bisheng Li, Menglin Yang, Chen Yang, Min Zhou, Weiyang Ding, Yutong Xie, Zengfeng Huang

Comments Accepted by NeurIPS 2025

详情
Journal ref
Proceedings of the Thirty-ninth Conference on Neural Information Processing Systems (NeurIPS 2025)
英文摘要

Class imbalance in graph data presents a significant challenge for effective node classification, particularly in semi-supervised scenarios. In this work, we formally introduce the concept of geometric imbalance, which captures how message passing on class-imbalanced graphs leads to geometric ambiguity among minority-class nodes in the riemannian manifold embedding space. We provide a rigorous theoretical analysis of geometric imbalance on the riemannian manifold and propose a unified framework that explicitly mitigates it through pseudo-label alignment, node reordering, and ambiguity filtering. Extensive experiments on diverse benchmarks show that our approach consistently outperforms existing methods, especially under severe class imbalance. Our findings offer new theoretical insights and practical tools for robust semi-supervised node classification.

2603.22254 2026-03-24 cond-mat.mtrl-sci cond-mat.mes-hall cs.LG physics.atm-clus physics.chem-ph

Characterizing High-Capacity Janus Aminobenzene-Graphene Anode for Sodium-Ion Batteries with Machine Learning

Claudia Islas-Vargas, L. Ricardo Montoya, Carlos A. Vital-José, Oliver T. Unke, Klaus-Robert Müller, Huziel E. Sauceda

Comments 8 pages, 5 figures, research article

详情
英文摘要

Sodium-ion batteries require anodes that combine high capacity, low operating voltage, fast Na-ion transport, and mechanical stability, which conventional anodes struggle to deliver. Here, we use the SpookyNet machine-learning force field (MLFF) together with all-electron density-functional theory calculations to characterize Na storage in aminobenzene-functionalized Janus graphene (Na$_x$AB) at room-temperature. Simulations across state of charge reveal a three-stage storage mechanism-site-specific adsorption at aminobenzene groups and Na$_n$@AB$_m$ structure formation, followed by interlayer gallery filling-contrasting the multi-stage pore-, graphite-interlayer-, and defect-controlled behavior in hard carbon. This leads to an OCV profile with an extended low-voltage plateau of 0.15 V vs. Na/Na$^{+}$, an estimated gravimetric capacity of $\sim$400 mAh g$^{-1}$, negligible volume change, and Na diffusivities of $\sim10^{-6}$ cm$^{2}$ s$^{-1}$, two to three orders of magnitude higher than in hard carbon. Our results establish Janus aminobenzene-graphene as a promising, structurally defined high-capacity Na-ion anode and illustrate the power of MLFF-based simulations for characterizing electrode materials.

2603.22252 2026-03-24 eess.AS cs.SD

SelfTTS: cross-speaker style transfer through explicit embedding disentanglement and self-refinement using self-augmentation

Lucas H. Ueda, João G. T. Lima, Pedro R. Corrêa, Flávio O. Simões, Mário U. Neto, Paula D. P. Costa

Comments Submitted to Interspeech 2026

详情
英文摘要

This paper presents SelfTTS, a text-to-speech (TTS) model designed for cross-speaker style transfer that eliminates the need for external pre-trained speaker or emotion encoders. The architecture achieves emotional expressivity in neutral speakers through an explicit disentanglement strategy utilizing Gradient Reversal Layers (GRL) combined with cosine similarity loss to decouple speaker and emotion information. We introduce Multi Positive Contrastive Learning (MPCL) to induce clustered representations of speaker and emotion embeddings based on their respective labels. Furthermore, SelfTTS employs a self-refinement strategy via Self-Augmentation, exploiting the model's voice conversion capabilities to enhance the naturalness of synthesized speech. Experimental results demonstrate that SelfTTS achieves superior emotional naturalness (eMOS) and robust stability in target timbre and emotion compared to state-of-the-art baselines.

2603.22231 2026-03-24 cs.IR cs.AI cs.GT cs.LG

One Model, Two Markets: Bid-Aware Generative Recommendation

Yanchen Jiang, Zhe Feng, Christopher P. Mah, Aranyak Mehta, Di Wang

详情
英文摘要

Generative Recommender Systems using semantic ids, such as TIGER (Rajput et al., 2023), have emerged as a widely adopted competitive paradigm in sequential recommendation. However, existing architectures are designed solely for semantic retrieval and do not address concerns such as monetization via ad revenue and incorporation of bids for commercial retrieval. We propose GEM-Rec, a unified framework that integrates commercial relevance and monetization objectives directly into the generative sequence. We introduce control tokens to decouple the decision of whether to show an ad from which item to show. This allows the model to learn valid placement patterns directly from interaction logs, which inherently reflect past successful ad placements. Complementing this, we devise a Bid-Aware Decoding mechanism that handles real-time pricing, injecting bids directly into the inference process to steer the generation toward high-value items. We prove that this approach guarantees allocation monotonicity, ensuring that higher bids weakly increase an ad's likelihood of being shown without requiring model retraining. Experiments demonstrate that GEM-Rec allows platforms to dynamically optimize for semantic relevance and platform revenue.

2603.22227 2026-03-24 cs.HC cs.AI cs.CL

Dyadic: A Scalable Platform for Human-Human and Human-AI Conversation Research

David M. Markowitz

详情
英文摘要

Conversation is ubiquitous in social life, but the empirical study of this interactive process has been thwarted by tools that are insufficiently modular and unadaptive to researcher needs. To relieve many constraints in conversation research, the current tutorial presents an overview and introduction to a new tool, Dyadic (https://www.chatdyadic.com/), a web-based platform for studying human-human and human-AI conversations using text-based or voice-based chats. Dyadic is distinct from other platforms by offering studies with multiple modalities, AI suggestions (e.g., in human-human studies, AI can suggest responses to a participant), live monitoring (e.g., researchers can evaluate, in real time, chats between communicators), and survey deployment (e.g., Likert-type scales, feeling thermometers, and open-ended text boxes can be sent to humans for in situ evaluations of the interaction), among other consequential features. No coding is required to operate Dyadic directly, and integrations with existing survey platforms are offered.

2603.22214 2026-03-24 cs.CR cs.AI cs.LG

Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

Tom Biskupski, Stephan Kleber

详情
英文摘要

A Large Language Model (LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs. An LLM as judge is the combination of one model and one specifically engineered judge prompt that contains the criteria for the analysis. The resulting automation of the analysis scales up the complex evaluation of the victim models' free-form text outputs by faster and more consistent judgments compared to human reviewers. Thus, quality and security assessments of LLMs can cover a wide range of the victim models' use cases. Being a comparably new technique, LLMs as judges lack a thorough investigation for their reliability and agreement to human judgment. Our work evaluates the applicability of LLMs as automated quality assessors of victim LLMs. We test the efficacy of 37 differently sized conversational LLMs in combination with 5 different judge prompts, the concept of a second-level judge, and 5 models fine-tuned for the task as assessors. As assessment objective, we curate datasets for eight different categories of judgment tasks and the corresponding ground-truth labels based on human assessments. Our empirical results show a high correlation of LLMs as judges with human assessments, when combined with a suitable prompt, in particular for GPT-4o, several open-source models with $\geqslant$ 32B parameters, and a few smaller models like Qwen2.5 14B.

2603.22195 2026-03-24 hep-th cs.AI cs.LG math.CO math.GR

CayleyPy-4: AI-Holography. Towards analogs of holographic string dualities for AI tasks

A. Chervov, F. Levkovich-Maslyuk, A. Smolensky, F. Khafizov, I. Kiselev, D. Melnikov, I. Koltsov, S. Kudashev, D. Shiltsov, M. Obozov, S. Krymskii, V. Kirova, E. V. Konstantinova, A. Soibelman, S. Galkin, L. Grunwald, A. Kotov, A. Alexandrov, S. Lytkin, D. Fedoriaka, A. Chevychelov, Z. Kogan, A. Natyrova, L. Cheldieva, O. Nikitina, S. Fironov, A. Vakhrushev, A. Lukyanenko, V. Ilin, D. Gorodkov, N. Bogachev, I. Gaiur, M. Zaitsev, F. Petrov, L. Petrov, T. Gaintseva, A. Gavrilova, M. N. Smirnov, N. Kalinin, A. Khan, K. Jung, H. Mousset, H. Isambert, O. Debeaupuis

Comments 20+120 pages

详情
英文摘要

This is the fourth paper in the CayleyPy project, which applies AI methods to the exploration of large graphs. In this work, we suggest the existence of a new discrete version of holographic string dualities for this setup, and discuss their relevance to AI systems and mathematics. Many modern AI tasks -- such as those addressed by GPT-style language models or RL systems -- can be viewed as direct analogues of predicting particle trajectories on graphs. We investigate this problem for a large family of Cayley graphs, for which we show that surprisingly it admits a dual description in terms of discrete strings. We hypothesize that such dualities may extend to a range of AI systems where they can lead to more efficient computational approaches. In particular, string holographic images of states are proposed as natural candidates for data embeddings, motivated by the "complexity = volume" principle in AdS/CFT. For Cayley graphs of the symmetric group S_n, our results indicate that the corresponding dual objects are flat, planar polygons. The diameter of the graph is equal to the number of integer points inside the polygon scaled by n. Vertices of the graph can be mapped holographically to paths inside the polygon, and the usual graph distances correspond to the area under the paths, thus directly realising the "complexity = volume" paradigm. We also find evidence for continuous CFTs and dual strings in the large n limit. We confirm this picture and other aspects of the duality in a large initial set of examples. We also present new datasets (obtained by a combination of ML and conventional tools) which should be instrumental in establishing the duality for more general cases.

2603.22174 2026-03-24 cs.HC cs.RO

Feasibility of Augmented Reality-Guided Robotic Ultrasound with Cone-Beam CT Integration for Spine Procedures

Tianyu Song, Felix Pabst, Feng Li, Yordanka Velikova, Miruna-Alexandra Gafencu, Yuan Bi, Ulrich Eck, Nassir Navab

Comments 8 pages, 7 figures

详情
英文摘要

Accurate needle placement in spine interventions is critical for effective pain management, yet it depends on reliable identification of anatomical landmarks and careful trajectory planning. Conventional imaging guidance often relies both on CT and X-ray fluoroscopy, exposing patients and staff to high dose of radiation while providing limited real-time 3D feedback. We present an optical see-through augmented reality (OST-AR)-guided robotic system for spine procedures that provides in situ visualization of spinal structures to support needle trajectory planning. We integrate a cone-beam CT (CBCT)-derived 3D spine model which is co-registered with live ultrasound, enabling users to combine global anatomical context with local, real-time imaging. We evaluated the system in a phantom user study involving two representative spine procedures: facet joint injection and lumbar puncture. Sixteen participants performed insertions under two visualization conditions: conventional screen vs. AR. Results show that AR significantly reduces execution time and across-task placement error, while also improving usability, trust, and spatial understanding and lowering cognitive workload. These findings demonstrate the feasibility of AR-guided robotic ultrasound for spine interventions, highlighting its potential to enhance accuracy, efficiency, and user experience in image-guided procedures.

2603.22160 2026-03-24 stat.AP cs.LG

Data Curation for Machine Learning Interatomic Potentials by Determinantal Point Processes

Joanna Zou, Youssef Marzouk

Comments Original publication at https://openreview.net/forum?id=PKGP7tg65A

详情
Journal ref
ICLR AI4MAT Workshop (2025)
英文摘要

The development of machine learning interatomic potentials faces a critical computational bottleneck with the generation and labeling of useful training datasets. We present a novel application of determinantal point processes (DPPs) to the task of selecting informative subsets of atomic configurations to label with reference energies and forces from costly quantum mechanical methods. Through experiments with hafnium oxide data, we show that DPPs are competitive with existing approaches to constructing compact but diverse training sets by utilizing kernels of molecular descriptors, leading to improved accuracy and robustness in machine learning representations of molecular systems. Our work identifies promising directions to employ DPPs for unsupervised training data curation with heterogeneous or multimodal data, or in online active learning schemes for iterative data augmentation during molecular dynamics simulation.

2603.22152 2026-03-24 cs.HC cs.AI

More Isn't Always Better: Balancing Decision Accuracy and Conformity Pressures in Multi-AI Advice

Yuta Tsuchiya, Yukino Baba

Comments 21 pages, 12 figures, accepted to CHI 2026

详情
英文摘要

Just as people improve decision-making by consulting diverse human advisors, they can now also consult with multiple AI systems. Prior work on group decision-making shows that advice aggregation creates pressure to conform, leading to overreliance. However, the conditions under which multi-AI consultation improves or undermines human decision-making remain unclear. We conducted experiments with three tasks in which participants received advice from panels of AIs. We varied panel size, within-panel consensus, and the human-likeness of presentation. Accuracy improved for small panels relative to a single AI; larger panels yielded no gains. The level of within-panel consensus affected participants' reliance on AI advice: High consensus fostered overreliance; a single dissent reduced pressure to conform; wide disagreement created confusion and undermined appropriate reliance. Human-like presentations increased perceived usefulness and agency in certain tasks, without raising conformity pressure. These findings yield design implications for presenting multi-AI advice that preserve accuracy while mitigating conformity.

2603.22146 2026-03-24 eess.SY cs.RO cs.SY

From Singleton Obstacles to Clutter: Translation Invariant Compositional Avoid Sets

Prashant Solanki, Jasper Van Beers, Coen De Visser

详情
英文摘要

This paper studies obstacle avoidance under translation invariant dynamics using an avoid-side travel cost Hamilton Jacobi formulation. For running costs that are zero outside an obstacle and strictly negative inside it, we prove that the value function is non-positive everywhere, equals zero exactly outside the avoid set, and is strictly negative exactly on it. Under translation invariance, this yields a reuse principle: the value of any translated obstacle is obtained by translating a single template value function. We show that the pointwise minimum of translated template values exactly characterizes the union of the translated single-obstacle avoid sets and provides a conservative inner certificate of unavoidable collision in clutter. To reduce conservatism, we introduce a blockwise composition framework in which subsets of obstacles are merged and solved jointly. This yields a hierarchy of conservative certificates from singleton reuse to the exact clutter value, together with monotonicity under block merging and an exactness criterion based on the existence of a common clutter avoiding control. The framework is illustrated on a Dubins car example in a repeated clutter field.

2603.22055 2026-03-24 cs.GR cs.RO

MineRobot: A Unified Framework for Kinematics Modeling and Solving of Underground Mining Robots in Virtual Environments

Shengzhe Hou, Xinming Lu, Tianyu Zhang, Changqing Yan, Xingli Zhang

详情
英文摘要

Underground mining robots are increasingly operated in virtual environments (VEs) for training, planning, and digital-twin applications, where reliable kinematics is essential for avoiding hazardous in-situ trials. Unlike typical open-chain industrial manipulators, mining robots are often closed-chain mechanisms driven by linear actuators and involving planar four-bar linkages, which makes both kinematics modeling and real-time solving challenging. We present \emph{MineRobot}, a unified framework for modeling and solving the kinematics of underground mining robots in VEs. First, we introduce the Mining Robot Description Format (MRDF), a domain-specific representation that parameterizes kinematics for mining robots with native semantics for actuators and loop closures. Second, we develop a topology-processing pipeline that contracts four-bar substructures into generalized joints and, for each actuator, extracts an Independent Topologically Equivalent Path (ITEP), which is classified into one of four canonical types. Third, leveraging ITEP independence, we compose per-type solvers into an actuator-centered sequential forward-kinematics (FK) pipeline. Building on the same decomposition, we formulate inverse kinematics (IK) as a bound-constrained optimization problem and solve it with a Gauss--Seidel-style procedure that alternates actuator-length updates. By converting coupled closed-loop kinematics into a sequence of small topology-aware solves, the framework avoids robot-specific hand derivations and supports efficient computation. Experiments demonstrate that MineRobot provides the real-time performance and robustness required by VE applications.

2603.22050 2026-03-24 stat.ML cs.LG

MAGPI: Multifidelity-Augmented Gaussian Process Inputs for Surrogate Modeling from Scarce Data

Atticus Rex, Elizabeth Qian, David Peterson

详情
英文摘要

Supervised machine learning describes the practice of fitting a parameterized model to labeled input-output data. Supervised machine learning methods have demonstrated promise in learning efficient surrogate models that can (partially) replace expensive high-fidelity models, making many-query analyses, such as optimization, uncertainty quantification, and inference, tractable. However, when training data must be obtained through the evaluation of an expensive model or experiment, the amount of training data that can be obtained is often limited, which can make learned surrogate models unreliable. However, in many engineering and scientific settings, cheaper \emph{low-fidelity} models may be available, for example arising from simplified physics modeling or coarse grids. These models may be used to generate additional low-fidelity training data. The goal of \emph{multifidelity} machine learning is to use both high- and low-fidelity training data to learn a surrogate model which is cheaper to evaluate than the high-fidelity model, but more accurate than any available low-fidelity model. This work proposes a new multifidelity training approach for Gaussian process regression which uses low-fidelity data to define additional features that augment the input space of the learned model. The approach unites desirable properties from two separate classes of existing multifidelity GPR approaches, cokriging and autoregressive estimators. Numerical experiments on several test problems demonstrate both increased predictive accuracy and reduced computational cost relative to the state of the art.

2603.22008 2026-03-24 cs.IR cs.CL

On the Challenges and Opportunities of Learned Sparse Retrieval for Code

Simon Lupart, Maxime Louis, Thibault Formal, Hervé Déjean, Stéphane Clinchant

Comments 15 pages, 5 figures, 12 tables

详情
英文摘要

Retrieval over large codebases is a key component of modern LLM-based software engineering systems. Existing approaches predominantly rely on dense embedding models, while learned sparse retrieval (LSR) remains largely unexplored for code. However, applying sparse retrieval to code is challenging due to subword fragmentation, semantic gaps between natural-language queries and code, diversity of programming languages and sub-tasks, and the length of code documents, which can harm sparsity and latency. We introduce SPLADE-Code, the first large-scale family of learned sparse retrieval models specialized for code retrieval (600M-8B parameters). Despite a lightweight one-stage training pipeline, SPLADE-Code achieves state-of-the-art performance among retrievers under 1B parameters (75.4 on MTEB Code) and competitive results at larger scales (79.0 with 8B). We show that learned expansion tokens are critical to bridge lexical and semantic matching, and provide a latency analysis showing that LSR enables sub-millisecond retrieval on a 1M-passage collection with little effectiveness loss.

2603.22006 2026-03-24 astro-ph.CO astro-ph.IM cs.LG stat.ME

A plug-and-play approach with fast uncertainty quantification for weak lensing mass mapping

Hubert Leterme, Andreas Tersenov, Jalal Fadili, Jean-Luc Starck

详情
英文摘要

Upcoming stage-IV surveys such as Euclid and Rubin will deliver vast amounts of high-precision data, opening new opportunities to constrain cosmological models with unprecedented accuracy. A key step in this process is the reconstruction of the dark matter distribution from noisy weak lensing shear measurements. Current deep learning-based mass mapping methods achieve high reconstruction accuracy, but either require retraining a model for each new observed sky region (limiting practicality) or rely on slow MCMC sampling. Efficient exploitation of future survey data therefore calls for a new method that is accurate, flexible, and fast at inference. In addition, uncertainty quantification with coverage guarantees is essential for reliable cosmological parameter estimation. We introduce PnPMass, a plug-and-play approach for weak lensing mass mapping. The algorithm produces point estimates by alternating between a gradient descent step with a carefully chosen data fidelity term, and a denoising step implemented with a single deep learning model trained on simulated data corrupted by Gaussian white noise. We also propose a fast, sampling-free uncertainty quantification scheme based on moment networks, with calibrated error bars obtained through conformal prediction to ensure coverage guarantees. Finally, we benchmark PnPMass against both model-driven and data-driven mass mapping techniques. PnPMass achieves performance close to that of state-of-the-art deep-learning methods while offering fast inference (converging in just a few iterations) and requiring only a single training phase, independently of the noise covariance of the observations. It therefore combines flexibility, efficiency, and reconstruction accuracy, while delivering tighter error bars than existing approaches, making it well suited for upcoming weak lensing surveys.

2603.21975 2026-03-24 cs.CR cs.AI cs.CL cs.LG

SecureBreak -- A dataset towards safe and secure models

Marco Arazzi, Vignesh Kumar Kembu, Antonino Nocera

详情
英文摘要

Large language models are becoming pervasive core components in many real-world applications. As a consequence, security alignment represents a critical requirement for their safe deployment. Although previous related works focused primarily on model architectures and alignment methodologies, these approaches alone cannot ensure the complete elimination of harmful generations. This concern is reinforced by the growing body of scientific literature showing that attacks, such as jailbreaking and prompt injection, can bypass existing security alignment mechanisms. As a consequence, additional security strategies are needed both to provide qualitative feedback on the robustness of the obtained security alignment at the training stage, and to create an ``ultimate'' defense layer to block unsafe outputs possibly produced by deployed models. To provide a contribution in this scenario, this paper introduces SecureBreak, a safety-oriented dataset designed to support the development of AI-driven solutions for detecting harmful LLM outputs caused by residual weaknesses in security alignment. The dataset is highly reliable due to careful manual annotation, where labels are assigned conservatively to ensure safety. It performs well in detecting unsafe content across multiple risk categories. Tests with pre-trained LLMs show improved results after fine-tuning on SecureBreak. Overall, the dataset is useful both for post-generation safety filtering and for guiding further model alignment and security improvements.