arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2393
专题追踪
2602.14357 2026-02-17 cs.HC cs.AI

Key Considerations for Domain Expert Involvement in LLM Design and Evaluation: An Ethnographic Study

Annalisa Szymanski, Oghenemaro Anuyah, Toby Jia-Jun Li, Ronald A. Metoyer

Comments 14 pages

详情
英文摘要

Large Language Models (LLMs) are increasingly developed for use in complex professional domains, yet little is known about how teams design and evaluate these systems in practice. This paper examines the challenges and trade-offs in LLM development through a 12-week ethnographic study of a team building a pedagogical chatbot. The researcher observed design and evaluation activities and conducted interviews with both developers and domain experts. Analysis revealed four key practices: creating workarounds for data collection, turning to augmentation when expert input was limited, co-developing evaluation criteria with experts, and adopting hybrid expert-developer-LLM evaluation strategies. These practices show how teams made strategic decisions under constraints and demonstrate the central role of domain expertise in shaping the system. Challenges included expert motivation and trust, difficulties structuring participatory design, and questions around ownership and integration of expert knowledge. We propose design opportunities for future LLM development workflows that emphasize AI literacy, transparent consent, and frameworks recognizing evolving expert roles.

2602.14345 2026-02-17 cs.CR cs.AI

AXE: An Agentic eXploit Engine for Confirming Zero-Day Vulnerability Reports

Amirali Sajadi, Tu Nguyen, Kostadin Damevski, Preetha Chatterjee

详情
英文摘要

Vulnerability detection tools are widely adopted in software projects, yet they often overwhelm maintainers with false positives and non-actionable reports. Automated exploitation systems can help validate these reports; however, existing approaches typically operate in isolation from detection pipelines, failing to leverage readily available metadata such as vulnerability type and source-code location. In this paper, we investigate how reported security vulnerabilities can be assessed in a realistic grey-box exploitation setting that leverages minimal vulnerability metadata, specifically a CWE classification and a vulnerable code location. We introduce Agentic eXploit Engine (AXE), a multi-agent framework for Web application exploitation that maps lightweight detection metadata to concrete exploits through decoupled planning, code exploration, and dynamic execution feedback. Evaluated on the CVE-Bench dataset, AXE achieves a 30% exploitation success rate, a 3x improvement over state-of-the-art black-box baselines. Even in a single-agent configuration, grey-box metadata yields a 1.75x performance gain. Systematic error analysis shows that most failed attempts arise from specific reasoning gaps, including misinterpreted vulnerability semantics and unmet execution preconditions. For successful exploits, AXE produces actionable, reproducible proof-of-concept artifacts, demonstrating its utility in streamlining Web vulnerability triage and remediation. We further evaluate AXE's generalizability through a case study on a recent real-world vulnerability not included in CVE-Bench.

2602.14321 2026-02-17 cs.GT cs.AI cs.LG cs.MA

Offline Learning of Nash Stable Coalition Structures with Possibly Overlapping Coalitions

Saar Cohen

Comments To Appear in the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2026

详情
英文摘要

Coalition formation concerns strategic collaborations of selfish agents that form coalitions based on their preferences. It is often assumed that coalitions are disjoint and preferences are fully known, which may not hold in practice. In this paper, we thus present a new model of coalition formation with possibly overlapping coalitions under partial information, where selfish agents may be part of multiple coalitions simultaneously and their full preferences are initially unknown. Instead, information about past interactions and associated utility feedback is stored in a fixed offline dataset, and we aim to efficiently infer the agents' preferences from this dataset. We analyze the impact of diverse dataset information constraints by studying two types of utility feedback that can be stored in the dataset: agent- and coalition-level utility feedback. For both feedback models, we identify assumptions under which the dataset covers sufficient information for an offline learning algorithm to infer preferences and use them to recover a partition that is (approximately) Nash stable, in which no agent can improve her utility by unilaterally deviating. Our additional goal is devising algorithms with low sample complexity, requiring only a small dataset to obtain a desired approximation to Nash stability. Under agent-level feedback, we provide a sample-efficient algorithm proven to obtain an approximately Nash stable partition under a sufficient and necessary assumption on the information covered by the dataset. However, under coalition-level feedback, we show that only under a stricter assumption is sufficient for sample-efficient learning. Still, in multiple cases, our algorithms' sample complexity bounds have optimality guarantees up to logarithmic factors. Finally, extensive experiments show that our algorithm converges to a low approximation level to Nash stability across diverse settings.

2602.14302 2026-02-17 cs.DC cs.LG

Floe: Federated Specialization for Real-Time LLM-SLM Inference

Chunlin Tian, Kahou Tam, Yebo Wu, Shuaihang Zhong, Li Li, Nicholas D. Lane, Chengzhong Xu

Comments Accepted by IEEE Transactions on Parallel and Distributed Systems

详情
英文摘要

Deploying large language models (LLMs) in real-time systems remains challenging due to their substantial computational demands and privacy concerns. We propose Floe, a hybrid federated learning framework designed for latency-sensitive, resource-constrained environments. Floe combines a cloud-based black-box LLM with lightweight small language models (SLMs) on edge devices to enable low-latency, privacy-preserving inference. Personal data and fine-tuning remain on-device, while the cloud LLM contributes general knowledge without exposing proprietary weights. A heterogeneity-aware LoRA adaptation strategy enables efficient edge deployment across diverse hardware, and a logit-level fusion mechanism enables real-time coordination between edge and cloud models. Extensive experiments demonstrate that Floe enhances user privacy and personalization. Moreover, it significantly improves model performance and reduces inference latency on edge devices under real-time constraints compared with baseline approaches.

2602.14285 2026-02-17 cs.DL cs.AI cs.CL cs.LG

FMMD: A multimodal open peer review dataset based on F1000Research

Zhenzhen Zhuang, Yuqing Fu, Jing Zhu, Zhangping Zhou, Jialiang Lin

Comments Work in progress

详情
英文摘要

Automated scholarly paper review (ASPR) has entered the coexistence phase with traditional peer review, where artificial intelligence (AI) systems are increasingly incorporated into real-world manuscript evaluation. In parallel, research on automated and AI-assisted peer review has proliferated. Despite this momentum, empirical progress remains constrained by several critical limitations in existing datasets. While reviewers routinely evaluate figures, tables, and complex layouts to assess scientific claims, most existing datasets remain overwhelmingly text-centric. This bias is reinforced by a narrow focus on data from computer science venues. Furthermore, these datasets lack precise alignment between reviewer comments and specific manuscript versions, obscuring the iterative relationship between peer review and manuscript evolution. In response, we introduce FMMD, a multimodal and multidisciplinary open peer review dataset curated from F1000Research. The dataset bridges the current gap by integrating manuscript-level visual and structural data with version-specific reviewer reports and editorial decisions. By providing explicit alignment between reviewer comments and the exact article iteration under review, FMMD enables fine-grained analysis of the peer review lifecycle across diverse scientific domains. FMMD supports tasks such as multimodal issue detection and multimodal review comment generation. It provides a comprehensive empirical resource for the development of peer review research.

2602.14283 2026-02-17 cs.NI cs.LG

MILD: Multi-Intent Learning and Disambiguation for Proactive Failure Prediction in Intent-based Networking

Md. Kamrul Hossain, Walid Aljoby

Comments Copyright 2026 IEEE. Accepted for presentation in IEEE/IFIP NOMS 2026

详情
英文摘要

In multi-intent intent-based networks, a single fault can trigger co-drift where multiple intents exhibit symptomatic KPI degradation, creating ambiguity about the true root-cause intent. We present MILD, a proactive framework that reformulates intent assurance from reactive drift detection to fixed-horizon failure prediction with intent-level disambiguation. MILD uses a teacher-augmented Mixture-of-Experts where a gated disambiguation module identifies the root-cause intent while per-intent heads output calibrated risk scores. On a benchmark with non-linear failures and co-drifts, MILD provides 3.8\%--92.5\% longer remediation lead time and improves intent-level root-cause disambiguation accuracy by 9.4\%--45.8\% over baselines. MILD also provides per-alert KPI explanations, enabling actionable diagnosis.

2602.14280 2026-02-17 stat.CO cs.LG

Fast Compute for ML Optimization

Nick Polson, Vadim Sokolov

详情
英文摘要

We study optimization for losses that admit a variance-mean scale-mixture representation. Under this representation, each EM iteration is a weighted least squares update in which latent variables determine observation and parameter weights; these play roles analogous to Adam's second-moment scaling and AdamW's weight decay, but are derived from the model. The resulting Scale Mixture EM (SM-EM) algorithm removes user-specified learning-rate and momentum schedules. On synthetic ill-conditioned logistic regression benchmarks with $p \in \{20, \ldots, 500\}$, SM-EM with Nesterov acceleration attains up to $13\times$ lower final loss than Adam tuned by learning-rate grid search. For a 40-point regularization path, sharing sufficient statistics across penalty values yields a $10\times$ runtime reduction relative to the same tuned-Adam protocol. For the base (non-accelerated) algorithm, EM monotonicity guarantees nonincreasing objective values; adding Nesterov extrapolation trades this guarantee for faster empirical convergence.

2602.14270 2026-02-17 cs.CY cs.AI cs.HC

A Rational Analysis of the Effects of Sycophantic AI

Rafael M. Batista, Thomas L. Griffiths

Comments 7 pages, 1 figure

详情
英文摘要

People increasingly use large language models (LLMs) to explore ideas, gather information, and make sense of the world. In these interactions, they encounter agents that are overly agreeable. We argue that this sycophancy poses a unique epistemic risk to how individuals come to see the world: unlike hallucinations that introduce falsehoods, sycophancy distorts reality by returning responses that are biased to reinforce existing beliefs. We provide a rational analysis of this phenomenon, showing that when a Bayesian agent is provided with data that are sampled based on a current hypothesis the agent becomes increasingly confident about that hypothesis but does not make any progress towards the truth. We test this prediction using a modified Wason 2-4-6 rule discovery task where participants (N=557) interacted with AI agents providing different types of feedback. Unmodified LLM behavior suppressed discovery and inflated confidence comparably to explicitly sycophantic prompting. By contrast, unbiased sampling from the true distribution yielded discovery rates five times higher. These results reveal how sycophantic AI distorts belief, manufacturing certainty where there should be doubt.

2602.14250 2026-02-17 cs.IT cs.LG eess.SP math.IT

Energy-Efficient Over-the-Air Federated Learning via Pinching Antenna Systems

Saba Asaad, Ali Bereyhi

Comments To be presented at IEEE International Conference on Communications (ICC 2026): Symposium on Next Generation Multiple Access. 6 pages, 3 algorithms, 3 figures

详情
英文摘要

Pinching antennas systems (PASSs) have recently been proposed as a novel flexible-antenna technology. These systems are implemented by attaching low-cost pinching elements to dielectric waveguides. As the direct link is bypassed through waveguides, PASSs can effectively compensate large-scale effects of the wireless channel. This work explores the potential gains of employing PASSs for over-the-air federated learning (OTA-FL). For a PASS-assisted server, we develop a low-complexity algorithmic approach, which jointly tunes the PASS parameters and schedules the mobile devices for minimal energy consumption in OTA-FL. We study the efficiency of the proposed design and compare it against the conventional OTA-FL setting with MIMO server. Numerical experiments demonstrate that using a single-waveguide PASS at the server within a moderately sized area, the required energy for model aggregation is drastically reduced as compared to the case with fully-digital MIMO server. This introduces PASS as a potential technology for energy-efficient distributed learning in next generations of wireless systems.

2602.14244 2026-02-17 stat.ML cs.LG

Federated Ensemble Learning with Progressive Model Personalization

Ala Emrani, Amir Najafi, Abolfazl Motahari

Comments 42 pages

详情
英文摘要

Federated Learning provides a privacy-preserving paradigm for distributed learning, but suffers from statistical heterogeneity across clients. Personalized Federated Learning (PFL) mitigates this issue by considering client-specific models. A widely adopted approach in PFL decomposes neural networks into a shared feature extractor and client-specific heads. While effective, this design induces a fundamental tradeoff: deep or expressive shared components hinder personalization, whereas large local heads exacerbate overfitting under limited per-client data. Most existing methods rely on rigid, shallow heads, and therefore fail to navigate this tradeoff in a principled manner. In this work, we propose a boosting-inspired framework that enables a smooth control of this tradeoff. Instead of training a single personalized model, we construct an ensemble of $T$ models for each client. Across boosting iterations, the depth of the personalized component are progressively increased, while its effective complexity is systematically controlled via low-rank factorization or width shrinkage. This design simultaneously limits overfitting and substantially reduces per-client bias by allowing increasingly expressive personalization. We provide theoretical analysis that establishes generalization bounds with favorable dependence on the average local sample size and the total number of clients. Specifically, we prove that the complexity of the shared layers is effectively suppressed, while the dependence on the boosting horizon $T$ is controlled through parameter reduction. Notably, we provide a novel nonlinear generalization guarantee for decoupled PFL models. Extensive experiments on benchmark and real-world datasets (e.g., EMNIST, CIFAR-10/100, and Sent140) demonstrate that the proposed framework consistently outperforms state-of-the-art PFL methods under heterogeneous data distributions.

2602.14216 2026-02-17 cs.CY cs.AI cs.CL

Reasoning Language Models for complex assessments tasks: Evaluating parental cooperation from child protection case reports

Dragan Stoll, Brian E. Perron, Zia Qi, Selina Steinmann, Nicole F. Eicher, Andreas Jud

详情
英文摘要

Purpose: Reasoning language models (RLMs) have demonstrated significant advances in solving complex reasoning tasks. We examined their potential to assess parental cooperation during CPS interventions using case reports, a case factor characterized by ambiguous and conflicting information. Methods: A four stage workflow comprising (1) case reports collection, (2) reasoning-based assessment of parental cooperation, (3) automated category extraction, and (4) case labeling was developed. The performance of RLMs with different parameter sizes (255B, 32B, 4B) was compared against human validated data. Two expert human reviewers (EHRs) independently classified a weighted random sample of reports. Results: The largest RLM achieved the highest accuracy (89%), outperforming the initial approach (80%). Classification accuracy was higher for mothers (93%) than for fathers (85%), and EHRs exhibited similar differences. Conclusions: RLMs' reasoning can effectively assess complex case factors such as parental cooperation. Lower accuracy in assessing fathers' cooperation supports the argument of a stronger professional focus on mothers in CPS interventions.

2602.14106 2026-02-17 cs.CR cs.AI

Anticipating Adversary Behavior in DevSecOps Scenarios through Large Language Models

Mario Marín Caballero, Miguel Betancourt Alonso, Daniel Díaz-López, Angel Luis Perales Gómez, Pantaleone Nespoli, Gregorio Martínez Pérez

Comments 8 pages, 3 figures, paper in proceedings of the X National Cybersecurity Research Conference (JNIC) in Zaragoza, Spain, June, 2025

详情
英文摘要

The most valuable asset of any cloud-based organization is data, which is increasingly exposed to sophisticated cyberattacks. Until recently, the implementation of security measures in DevOps environments was often considered optional by many government entities and critical national services operating in the cloud. This includes systems managing sensitive information, such as electoral processes or military operations, which have historically been valuable targets for cybercriminals. Resistance to security implementation is often driven by concerns over losing agility in software development, increasing the risk of accumulated vulnerabilities. Nowadays, patching software is no longer enough; adopting a proactive cyber defense strategy, supported by Artificial Intelligence (AI), is crucial to anticipating and mitigating threats. Thus, this work proposes integrating the Security Chaos Engineering (SCE) methodology with a new LLM-based flow to automate the creation of attack defense trees that represent adversary behavior and facilitate the construction of SCE experiments based on these graphical models, enabling teams to stay one step ahead of attackers and implement previously unconsidered defenses. Further detailed information about the experiment performed, along with the steps to replicate it, can be found in the following repository: https://github.com/mariomc14/devsecops-adversary-llm.git.

2602.14089 2026-02-17 cs.DB cs.AI

TabTracer: Monte Carlo Tree Search for Complex Table Reasoning with Large Language Models

Zhizhao Luo, Zhaojing Luo, Meihui Zhang, Rui Mao

详情
英文摘要

Large language models (LLMs) have emerged as powerful tools for natural language table reasoning, where there are two main categories of methods. Prompt-based approaches rely on language-only inference or one-pass program generation without step-level verification. Agent-based approaches use tools in a closed loop, but verification is often local and backtracking is limited, allowing errors to propagate and increasing cost. Moreover, they rely on chain- or beam-style trajectories that are typically combinatorially redundant, leading to high token costs. In this paper, we propose TabTracer, an agentic framework that coordinates multi-step tool calls over intermediate table states, with explicit state tracking for verification and rollback. First, it enforces step-level verification with typed operations and lightweight numeric and format checks to provide reliable rewards and suppress hallucinations. Second, execution-feedback Monte Carlo Tree Search maintains a search tree of candidate table states and uses backpropagated reflection scores to guide UCB1 selection and rollback via versioned snapshots. Third, it reduces redundancy with budget-aware pruning, deduplication, and state hashing with a monotonicity gate to cut token cost. Comprehensive evaluation on TabFact, WikiTQ, and CRT datasets shows that TabTracer outperforms state-of-the-art baselines by up to 6.7% in accuracy while reducing token consumption by 59--84%.

2602.14043 2026-02-17 cs.SI cs.AI

Beyond Static Snapshots: Dynamic Modeling and Forecasting of Group-Level Value Evolution with Large Language Models

Qiankun Pi, Guixin Su, Jinliang Li, Mayi Xu, Xin Miao, Jiawei Jiang, Ming Zhong, Tieyun Qian

详情
英文摘要

Social simulation is critical for mining complex social dynamics and supporting data-driven decision making. LLM-based methods have emerged as powerful tools for this task by leveraging human-like social questionnaire responses to model group behaviors. Existing LLM-based approaches predominantly focus on group-level values at discrete time points, treating them as static snapshots rather than dynamic processes. However, group-level values are not fixed but shaped by long-term social changes. Modeling their dynamics is thus crucial for accurate social evolution prediction--a key challenge in both data mining and social science. This problem remains underexplored due to limited longitudinal data, group heterogeneity, and intricate historical event impacts. To bridge this gap, we propose a novel framework for group-level dynamic social simulation by integrating historical value trajectories into LLM-based human response modeling. We select China and the U.S. as representative contexts, conducting stratified simulations across four core sociodemographic dimensions (gender, age, education, income). Using the World Values Survey, we construct a multi-wave, group-level longitudinal dataset to capture historical value evolution, and then propose the first event-based prediction method for this task, unifying social events, current value states, and group attributes into a single framework. Evaluations across five LLM families show substantial gains: a maximum 30.88\% improvement on seen questions and 33.97\% on unseen questions over the Vanilla baseline. We further find notable cross-group heterogeneity: U.S. groups are more volatile than Chinese groups, and younger groups in both countries are more sensitive to external changes. These findings advance LLM-based social simulation and provide new insights for social scientists to understand and predict social value changes.

2602.14030 2026-02-17 cs.CR cs.LG

MC$^2$Mark: Distortion-Free Multi-Bit Watermarking for Long Messages

Xuehao Cui, Ruibo Chen, Yihan Wu, Heng Huang

详情
英文摘要

Large language models now produce text indistinguishable from human writing, which increases the need for reliable provenance tracing. Multi-bit watermarking can embed identifiers into generated text, but existing methods struggle to keep both text quality and watermark strength while carrying long messages. We propose MC$^2$Mark, a distortion-free multi-bit watermarking framework designed for reliable embedding and decoding of long messages. Our key technical idea is Multi-Channel Colored Reweighting, which encodes bits through structured token reweighting while keeping the token distribution unbiased, together with Multi-Layer Sequential Reweighting to strengthen the watermark signal and an evidence-accumulation detector for message recovery. Experiments show that MC$^2$Mark improves detectability and robustness over prior multi-bit watermarking methods while preserving generation quality, achieving near-perfect accuracy for short messages and exceeding the second-best method by nearly 30% for long messages.

2602.14029 2026-02-17 stat.ML cs.LG math.ST stat.TH

Why Self-Training Helps and Hurts: Denoising vs. Signal Forgetting

Mingqi Wu, Archer Y. Yang, Qiang Sun

Comments 8 pages main, 29 pages in total

详情
英文摘要

Iterative self-training (self-distillation) repeatedly refits a model on pseudo-labels generated by its own predictions. We study this procedure in overparameterized linear regression: an initial estimator is trained on noisy labels, and each subsequent iterate is trained on fresh covariates with noiseless pseudo-labels from the previous model. In the high-dimensional regime, we derive deterministic-equivalent recursions for the prediction risk and effective noise across iterations, and prove that the empirical quantities concentrate sharply around these limits. The recursion separates two competing forces: a systematic component that grows with iteration due to progressive signal forgetting, and a stochastic component that decays due to denoising via repeated data-dependent projections. Their interaction yields a $U$-shaped test-risk curve and an optimal early-stopping time. In spiked covariance models, iteration further acts as an iteration-dependent spectral filter that preserves strong eigendirections while suppressing weaker ones, inducing an implicit form of soft feature selection distinct from ridge regression. Finally, we propose an iterated generalized cross-validation criterion and prove its uniform consistency for estimating the risk along the self-training trajectory, enabling fully data-driven selection of the stopping time and regularization. Experiments on synthetic covariances validate the theory and illustrate the predicted denoising-forgetting trade-off.

2602.14020 2026-02-17 stat.ML cs.LG

Computable Bernstein Certificates for Cross-Fitted Clipped Covariance Estimation

Even He, Zaizai Yan

详情
英文摘要

We study operator-norm covariance estimation from heavy-tailed samples that may include a small fraction of arbitrary outliers. A simple and widely used safeguard is \emph{Euclidean norm clipping}, but its accuracy depends critically on an unknown clipping level. We propose a cross-fitted clipped covariance estimator equipped with \emph{fully computable} Bernstein-type deviation certificates, enabling principled data-driven tuning via a selector (\emph{MinUpper}) that balances certified stochastic error and a robust hold-out proxy for clipping bias. The resulting procedure adapts to intrinsic complexity measures such as effective rank under mild tail regularity and retains meaningful guarantees under only finite fourth moments. Experiments on contaminated spiked-covariance benchmarks illustrate stable performance and competitive accuracy across regimes.

2602.13971 2026-02-17 cs.IR cs.AI

DAIAN: Deep Adaptive Intent-Aware Network for CTR Prediction in Trigger-Induced Recommendation

Zhihao Lv, Longtao Zhang, Ailong He, Shuzhi Cao, Shuguang Han, Jufeng Chen

详情
英文摘要

Recommendation systems are essential for personalizing e-commerce shopping experiences. Among these, Trigger-Induced Recommendation (TIR) has emerged as a key scenario, which utilizes a trigger item (explicitly represents a user's instantaneous interest), enabling precise, real-time recommendations. Although several trigger-based techniques have been proposed, most of them struggle to address the intent myopia issue, that is, a recommendation system overemphasizes the role of trigger items and narrowly focuses on suggesting commodities that are highly relevant to trigger items. Meanwhile, existing methods rely on collaborative behavior patterns between trigger and recommended items to identify the user's preferences, yet the sparsity of ID-based interaction restricts their effectiveness. To this end, we propose the Deep Adaptive Intent-Aware Network (DAIAN) that dynamically adapts to users' intent preferences. In general, we first extract the users' personalized intent representations by analyzing the correlation between a user's click and the trigger item, and accordingly retrieve the user's related historical behaviors to mine the user's diverse intent. Besides, sparse collaborative behaviors constrain the performance in capturing items associated with user intent. Hence, we reinforce similarity by leveraging a hybrid enhancer with ID and semantic information, followed by adaptive selection based on varying intents. Experimental results on public datasets and our industrial e-commerce datasets demonstrate the effectiveness of DAIAN.

2602.13942 2026-02-17 stat.ML cs.LG

A Theoretical Framework for LLM Fine-tuning Using Early Stopping for Non-random Initialization

Zexuan Sun, Garvesh Raskutti

详情
英文摘要

In the era of large language models (LLMs), fine-tuning pretrained models has become ubiquitous. Yet the theoretical underpinning remains an open question. A central question is why only a few epochs of fine-tuning are typically sufficient to achieve strong performance on many different tasks. In this work, we approach this question by developing a statistical framework, combining rigorous early stopping theory with the attention-based Neural Tangent Kernel (NTK) for LLMs, offering new theoretical insights on fine-tuning practices. Specifically, we formally extend classical NTK theory [Jacot et al., 2018] to non-random (i.e., pretrained) initializations and provide a convergence guarantee for attention-based fine-tuning. One key insight provided by the theory is that the convergence rate with respect to sample size is closely linked to the eigenvalue decay rate of the empirical kernel matrix induced by the NTK. We also demonstrate how the framework can be used to explain task vectors for multiple tasks in LLMs. Finally, experiments with modern language models on real-world datasets provide empirical evidence supporting our theoretical insights.

2602.13914 2026-02-17 cs.LO cs.AI

Common Knowledge Always, Forever

Martín Diéguez, David Fernández-Duque

Comments 16 pages

Journal ref Festschrift for Andreas Herzig on the Occasion of his 65th Birthday. Essays in Honor of Andi, 2026

详情
英文摘要

There has been an increasing interest in topological semantics for epistemic logic, which has been shown to be useful for, e.g., modelling evidence, degrees of belief, and self-reference. We introduce a polytopological PDL capable of expressing common knowledge and various generalizations and show it has the finite model property over closure spaces but not over Cantor derivative spaces. The latter is shown by embedding a version of linear temporal logic with `past', which does not have the finite model property.

2602.13871 2026-02-17 math.ST cs.IT cs.LG math.IT math.OC stat.AP stat.ML stat.TH

Ensemble-Conditional Gaussian Processes (Ens-CGP): Representation, Geometry, and Inference

Sai Ravela, Jae Deok Kim, Kenneth Gee, Xingjian Yan, Samson Mercier, Lubna Albarghouty, Anamitra Saha

Comments 20 pages. Technical manuscrupt on representational equivalence between conditional Gaussian inference, quadratic optimization, and RKHS geometry in finite dimensions

详情
英文摘要

We formulate Ensemble-Conditional Gaussian Processes (Ens-CGP), a finite-dimensional synthesis that centers ensemble-based inference on the conditional Gaussian law. Conditional Gaussian processes (CGP) arise directly from Gaussian processes under conditioning and, in linear-Gaussian settings, define the full posterior distribution for a Gaussian prior and linear observations. Classical Kalman filtering is a recursive algorithm that computes this same conditional law under dynamical assumptions; the conditional Gaussian law itself is therefore the underlying representational object, while the filter is one computational realization. In this sense, CGP provides the probabilistic foundation for Kalman-type methods as well as equivalent formulations as a strictly convex quadratic program (MAP estimation), RKHS-regularized regression, and classical regularization. Ens-CGP is the ensemble instantiation of this object, obtained by treating empirical ensemble moments as a (possibly low-rank) Gaussian prior and performing exact conditioning. By separating representation (GP -> CGP -> Ens-CGP) from computation (Kalman filters, EnKF variants, and iterative ensemble schemes), the framework links an earlier-established representational foundation for inference to ensemble-derived priors and clarifies the relationships among probabilistic, variational, and ensemble perspectives.

2602.13817 2026-02-17 cs.HC cs.AI

What happens when reviewers receive AI feedback in their reviews?

Shiping Chen, Shu Zhong, Duncan P. Brumby, Anna L. Cox

Comments ACM CHI 2026

详情
英文摘要

AI is reshaping academic research, yet its role in peer review remains polarising and contentious. Advocates see its potential to reduce reviewer burden and improve quality, while critics warn of risks to fairness, accountability, and trust. At ICLR 2025, an official AI feedback tool was deployed to provide reviewers with post-review suggestions. We studied this deployment through surveys and interviews, investigating how reviewers engaged with the tool and perceived its usability and impact. Our findings surface both opportunities and tensions when AI augments in peer review. This work contributes the first empirical evidence of such an AI tool in a live review process, documenting how reviewers respond to AI-generated feedback in a high-stakes review context. We further offer design implications for AI-assisted reviewing that aim to enhance quality while safeguarding human expertise, agency, and responsibility.

2602.13811 2026-02-17 cs.NE cs.LG physics.comp-ph

A Unified Physics-Informed Neural Network for Modeling Coupled Electro- and Elastodynamic Wave Propagation Using Three-Stage Loss Optimization

Suhas Suresh Bharadwaj, Reuben Thomas Thovelil

Comments 6 pages, 2 figures

详情
英文摘要

Physics-Informed Neural Networks present a novel approach in SciML that integrates physical laws in the form of partial differential equations directly into the NN through soft constraints in the loss function. This work studies the application of PINNs to solve a one dimensional coupled electro-elastodynamic system modeling linear piezoelectricity in stress-charge form, governed by elastodynamic and electrodynamic equations. Our simulation employs a feedforward architecture, mapping space-time coordinates to mechanical displacement and electric potential. Our PINN model achieved global relative L2 errors of 2.34 and 4.87 percent for displacement and electric potential respectively. The results validate PINNs as effective mesh free solvers for coupled time-dependent PDE systems, though challenges remain regarding error accumulation and stiffness in coupled eigenvalue systems.

2602.13784 2026-02-17 cs.HC cs.AI

Comparables XAI: Faithful Example-based AI Explanations with Counterfactual Trace Adjustments

Yifan Zhang, Tianle Ren, Fei Wang, Brian Y Lim

Comments Accepted by CHI 2026

详情
英文摘要

Explaining with examples is an intuitive way to justify AI decisions. However, it is challenging to understand how a decision value should change relative to the examples with many features differing by large amounts. We draw from real estate valuation that uses Comparables-examples with known values for comparison. Estimates are made more accurate by hypothetically adjusting the attributes of each Comparable and correspondingly changing the value based on factors. We propose Comparables XAI for relatable example-based explanations of AI with Trace adjustments that trace counterfactual changes from each Comparable to the Subject, one attribute at a time, monotonically along the AI feature space. In modelling and user studies, Trace-adjusted Comparables achieved the highest XAI faithfulness and precision, user accuracy, and narrowest uncertainty bounds compared to linear regression, linearly adjusted Comparables, or unadjusted Comparables. This work contributes a new analytical basis for using example-based explanations to improve user understanding of AI decisions.

2602.12289 2026-02-17 eess.SY cs.LG cs.SY eess.SP

String-Level Ground Fault Localization for TN-Earthed Three-Phase Photovoltaic Systems

Yuanliang Li, Xun Gong, Reza Iravani, Bo Cao, Heng Liu, Ziming Chen

详情
英文摘要

The DC-side ground fault (GF) poses significant risks to three-phase TN-earthed photovoltaic (PV) systems, as the resulting high fault current can directly damage both PV inverters and PV modules. Once a fault occurs, locating the faulty string through manual string-by-string inspection is highly time-consuming and inefficient. This work presents a comprehensive analysis of GF characteristics through fault-current analysis and a simulation-based case study covering multiple fault locations. Building on these insights, we propose an edge-AI-based GF localization approach tailored for three-phase TN-earthed PV systems. A PLECS-based simulation model that incorporates PV hysteresis effects is developed to generate diverse GF scenarios, from which correlation-based features are extracted throughout the inverter's four-stage shutdown sequence. Using the simulated dataset, a lightweight Variational Information Bottleneck (VIB)-based localization model is designed and trained, achieving over 93% localization accuracy at typical sampling rates with low computational cost, demonstrating strong potential for deployment on resource-constrained PV inverters.

2602.10833 2026-02-17 cs.IR cs.AI cs.CL cs.LG

Training-Induced Bias Toward LLM-Generated Content in Dense Retrieval

William Xion, Wolfgang Nejdl

Comments Accepted at ECIR 2026

详情
英文摘要

Dense retrieval is a promising approach for acquiring relevant context or world knowledge in open-domain natural language processing tasks and is now widely used in information retrieval applications. However, recent reports claim a broad preference for text generated by large language models (LLMs). This bias is called "source bias", and it has been hypothesized that lower perplexity contributes to this effect. In this study, we revisit this claim by conducting a controlled evaluation to trace the emergence of such preferences across training stages and data sources. Using parallel human- and LLM-generated counterparts of the SciFact and Natural Questions (NQ320K) datasets, we compare unsupervised checkpoints with models fine-tuned using in-domain human text, in-domain LLM-generated text, and MS MARCO. Our results show the following: 1) Unsupervised retrievers do not exhibit a uniform pro-LLM preference. The direction and magnitude depend on the dataset. 2) Across the settings tested, supervised fine-tuning on MS MARCO consistently shifts the rankings toward LLM-generated text. 3) In-domain fine-tuning produces dataset-specific and inconsistent shifts in preference. 4) Fine-tuning on LLM-generated corpora induces a pronounced pro-LLM bias. Finally, a retriever-centric perplexity probe involving the reattachment of a language modeling head to the fine-tuned dense retriever encoder indicates agreement with relevance near chance, thereby weakening the explanatory power of perplexity. Our study demonstrates that source bias is a training-induced phenomenon rather than an inherent property of dense retrievers.

2602.09572 2026-02-17 cs.DB cs.AI cs.LG

Predictive Query Language: A Domain-Specific Language for Predictive Modeling on Relational Databases

Vid Kocijan, Jinu Sunil, Jan Eric Lenssen, Viman Deb, Xinwei Xe, Federico Reyes Gomez, Matthias Fey, Jure Leskovec

详情
英文摘要

The purpose of predictive modeling on relational data is to predict future or missing values in a relational database, for example, future purchases of a user, risk of readmission of the patient, or the likelihood that a financial transaction is fraudulent. Typically powered by machine learning methods, predictive models are used in recommendations, financial fraud detection, supply chain optimization, and other systems, providing billions of predictions every day. However, training a machine learning model requires manual work to extract the required training examples - prediction entities and target labels - from the database, which is slow, laborious, and prone to mistakes. Here, we present the Predictive Query Language (PQL), an SQL-inspired declarative language for defining predictive tasks on relational databases. PQL allows specifying a predictive task in a single declarative query, enabling the automatic computation of training labels for a large variety of machine learning tasks, such as regression, classification, time-series forecasting, and recommender systems. PQL is already successfully integrated and used in a collection of use cases as part of a predictive AI platform. The versatility of the language can be demonstrated through its many ongoing use cases, including financial fraud, item recommendations, and workload prediction. We demonstrate its versatile design through two implementations; one for small-scale, low-latency use and one that can handle large-scale databases.

2602.08882 2026-02-17 cs.HC cs.CV

Designing Multi-Robot Ground Video Sensemaking with Public Safety Professionals

Puqi Zhou, Ali Asgarov, Aafiya Hussain, Wonjoon Park, Amit Paudyal, Sameep Shrestha, Chia-wei Tang, Michael F. Lighthiser, Michael R. Hieb, Xuesu Xiao, Chris Thomas, Sungsoo Ray Hong

详情
英文摘要

Videos from fleets of ground robots can advance public safety by providing scalable situational awareness and reducing professionals' burden. Yet little is known about how to design and integrate multi-robot videos into public safety workflows. Collaborating with six police agencies, we examined how such videos could be made practical. In Study 1, we presented the first testbed for multi-robot ground video sensemaking. The testbed includes 38 events-of-interest (EoI) relevant to public safety, a dataset of 20 robot patrol videos (10 day/night pairs) covering EoI types, and 6 design requirements aimed at improving current video sensemaking practices. In Study 2, we built MRVS, a tool that augments multi-robot patrol video streams with a prompt-engineered video understanding model. Participants reported reduced manual workload and greater confidence with LLM-based explanations, while noting concerns about false alarms and privacy. We conclude with implications for designing future multi-robot video sensemaking tools.

2602.07672 2026-02-17 cs.SE cs.AI cs.LG cs.PL cs.SC

Debugging code world models

Babak Rahmani

Comments 8 pages, 4 figures, under review in conference

详情
英文摘要

Code World Models (CWMs) are language models trained to simulate program execution by predicting explicit runtime state after every executed command. This execution-based world modeling enables internal verification within the model, offering an alternative to natural language chain-of-thought reasoning. However, the sources of errors and the nature of CWMs' limitations remain poorly understood. We study CWMs from two complementary perspectives: local semantic execution and long-horizon state tracking. On real-code benchmarks, we identify two dominant failure regimes. First, dense runtime state reveals produce token-intensive execution traces, leading to token-budget exhaustion on programs with long execution histories. Second, failures disproportionately concentrate in string-valued state, which we attribute to limitations of subword tokenization rather than program structure. To study long-horizon behavior, we use a controlled permutation-tracking benchmark that isolates state propagation under action execution. We show that long-horizon degradation is driven primarily by incorrect action generation: when actions are replaced with ground-truth commands, a Transformer-based CWM propagates state accurately over long horizons, despite known limitations of Transformers in long-horizon state tracking. These findings suggest directions for more efficient supervision and state representations in CWMs that are better aligned with program execution and data types.

2602.07132 2026-02-17 stat.ML cs.LG

Discrete Adjoint Matching

Oswin So, Brian Karrer, Chuchu Fan, Ricky T. Q. Chen, Guan-Horng Liu

Comments ICLR 2026

详情
英文摘要

Computation methods for solving entropy-regularized reward optimization -- a class of problems widely used for fine-tuning generative models -- have advanced rapidly. Among those, Adjoint Matching (AM, Domingo-Enrich et al., 2025) has proven highly effective in continuous state spaces with differentiable rewards. Transferring these practical successes to discrete generative modeling, however, remains particularly challenging and largely unexplored, mainly due to the drastic shift in generative model classes to discrete state spaces, which are nowhere differentiable. In this work, we propose Discrete Adjoint Matching (DAM) -- a discrete variant of AM for fine-tuning discrete generative models characterized by Continuous-Time Markov Chains, such as diffusion-based large language models. The core of DAM is the introduction of discrete adjoint-an estimator of the optimal solution to the original problem but formulated on discrete domains-from which standard matching frameworks can be applied. This is derived via a purely statistical standpoint, in contrast to the control-theoretic viewpoint in AM, thereby opening up new algorithmic opportunities for general adjoint-based estimators. We showcase DAM's effectiveness on synthetic and mathematical reasoning tasks.