arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1447
专题追踪
2510.07175 2026-02-02 cs.CL cs.LG

Quantifying Data Contamination in Psychometric Evaluations of LLMs

Jongwook Han, Woojung Song, Jonggeun Lee, Yohan Jo

Comments EACL 2026 Findings

详情
英文摘要

Recent studies apply psychometric questionnaires to Large Language Models (LLMs) to assess high-level psychological constructs such as values, personality, moral foundations, and dark traits. Although prior work has raised concerns about possible data contamination from psychometric inventories, which may threaten the reliability of such evaluations, there has been no systematic attempt to quantify the extent of this contamination. To address this gap, we propose a framework to systematically measure data contamination in psychometric evaluations of LLMs, evaluating three aspects: (1) item memorization, (2) evaluation memorization, and (3) target score matching. Applying this framework to 21 models from major families and four widely used psychometric inventories, we provide evidence that popular inventories such as the Big Five Inventory (BFI-44) and Portrait Values Questionnaire (PVQ-40) exhibit strong contamination, where models not only memorize items but can also adjust their responses to achieve specific target scores.

2510.03267 2026-02-02 cs.LG cs.AI

PT$^2$-LLM: Post-Training Ternarization for Large Language Models

Xianglong Yan, Chengzhu Bao, Zhiteng Li, Tianao Zhang, Kaicheng Yang, Haotong Qin, Ruobing Xie, Xingwu Sun, Yulun Zhang

Comments Accepted at ICLR 2026

详情
英文摘要

Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment. Ternarization has gained attention as a promising compression technique, delivering substantial size reduction and high computational efficiency. However, its potential in the post-training quantization (PTQ) setting remains underexplored, due to the challenge of training-free parameter optimization and the quantization difficulty posed by outliers and dispersed weights. To address these issues, we propose PT$^2$-LLM, a post-training ternarization framework tailored for LLMs. At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline: (1) Iterative Ternary Fitting (ITF), which alternates between optimal ternary grid construction and flexible rounding to minimize quantization error, and (2) Activation-aware Grid Alignment (AGA), which further refines the ternary grid to better match full-precision outputs. In addition, we propose a plug-and-play Structural Similarity-based Reordering (SSR) strategy that leverages inter-column structural similarity to ease quantization and mitigate outlier effects, further enhancing overall performance. Extensive experiments demonstrate that PT$^2$-LLM delivers competitive performance against state-of-the-art (SOTA) 2-bit PTQ methods with lower memory cost, while also accelerating both prefill and decoding to achieve end-to-end speedup. The code and models will be available at https://github.com/XIANGLONGYAN/PT2-LLM.

2510.02295 2026-02-02 cs.CV cs.AI cs.LG

VideoNSA: Native Sparse Attention Scales Video Understanding

Enxin Song, Wenhao Chai, Shusheng Yang, Ethan Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, Zhuowen Tu

Comments ICLR 2026

详情
英文摘要

Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks. Project Page: https://enxinsong.com/VideoNSA-web/, Code: https://github.com/Espere-1119-Song/VideoNSA

2510.02291 2026-02-02 cs.LG cs.CV stat.ML

Test-Time Anchoring for Discrete Diffusion Posterior Sampling

Litu Rout, Andreas Lugmayr, Yasamin Jafarian, Srivatsan Varadharajan, Constantine Caramanis, Sanjay Shakkottai, Ira Kemelmacher-Shlizerman

Comments Preprint

详情
英文摘要

While continuous diffusion models have achieved remarkable success, discrete diffusion offers a unified framework for jointly modeling text and images. Beyond unification, discrete diffusion provides faster inference, finer control, and principled training-free guidance, making it well-suited for posterior sampling. Existing approaches to posterior sampling using discrete diffusion face severe challenges: derivative-free guidance yields sparse signals, continuous relaxations limit applicability, and split Gibbs samplers suffer from the curse of dimensionality. To overcome these limitations, we introduce Anchored Posterior Sampling (APS), built on two key innovations: quantized expectation for gradient-like guidance in discrete embedding space, and anchored remasking for adaptive decoding. APS achieves state-of-the-art performance among discrete diffusion samplers on both linear and nonlinear inverse problems across the standard image benchmarks. We demonstrate the generality of APS through training-free stylization and text-guided editing. We further apply APS to a large-scale diffusion language model, showing consistent improvement in question answering.

2510.00219 2026-02-02 cs.LG cs.AI cs.CL cs.NE

Thoughtbubbles: an Unsupervised Method for Parallel Thinking in Latent Space

Houjun Liu, Shikhar Murty, Christopher D. Manning, Róbert Csordás

详情
英文摘要

Current approaches for scaling inference-time compute in transformers train them to emit explicit chain-of-thought tokens before producing an answer. While these methods are powerful, they are limited because they cannot be applied during pretraining and rely solely on serially-generated, natural-language verbalization. In this work, we propose Thoughtbubbles, a transformer variant that natively performs parallel adaptive computation in latent space by learning to fork or delete residual streams. Thus, tokens requiring more computation can form a "bubble" of cloned residuals in the middle of the network. Crucially, this behavior is learned during pretraining with only language modeling loss. Using half of the training budget, Thoughtbubbles outperforms the perplexity and zero-shot evals of both standard decoder LMs and those using non-adaptive parallel computation approaches. These results hold across model sizes from 150M to 1.9B. Thoughtbubbles achieves competitive GSM8K results using half of the baseline's token budget. The implicit nature of our method enables models to begin learning adaptive computation at pretraining time, paving the way to unified train-time and test-time scaling behaviors.

2510.00065 2026-02-02 cs.LG

FedLLM-Align: Feature Extraction From Heterogeneous Clients

Abdelrhman Gaber, Muhammad ElMahdy, Youssif Abuzied, Hassan Abd-Eltawab, Tamer ElBatt

详情
英文摘要

Federated learning (FL) enables collaborative model training without sharing raw data, making it attractive for privacy-sensitive domains, e.g., healthcare, finance, and IoT. A major obstacle, however, is the potential heterogeneity of tabular data across clients, in practical settings, where schema mismatches and incompatible feature spaces prevent straightforward aggregation. To address this challenge, this paper proposes FedLLM-Align, a federated learning framework that leverages pretrained transformer based language models for feature extraction. Towards this objective, FedLLM-Align serializes tabular records into text and derives semantically aligned embeddings from a pretrained LLM encoder, e.g, DistilBERT, facilitating lightweight local classifier heads that can be trained in a federated manner using standard aggregation schemes, e.g., FedAvg, while keeping all raw data records local. To quantify the merits and trade-offs of FedLLM-Align, we evaluate the proposed framework on binary classification tasks from two different domains: i) Coronary heart disease prediction on partitioned Framingham Heart Study data, and ii) Customer churn prediction on a financial dataset. FedLLM-Align outperforms state-of-the-art baselines by up to 25% in terms of the F1 score, under simulated schema heterogeneity, and achieves a 65% reduction in the communication overhead. These results establish FedLLM-Align as a privacy-preserving and communication-efficient approach for federated training based on clients with heterogeneous tabular datasets, commonly encountered in practice.

2509.25736 2026-02-02 cs.CL cs.AI cs.IT cs.NI math.IT

Think Less, Label Better: Multi-Stage Domain-Grounded Synthetic Data Generation for Fine-Tuning Large Language Models in Telecommunications

Chenhua Shi, Gregor Macdonald, Bhavika Jalli, Wanlu Lei, John Zou, Mridul Jain, Joji Philip

Comments 6 pages, 6 figures, 5 tables, IEEE ICC 2026

详情
英文摘要

The success of large language models (LLMs) depends heavily on large-scale, high-quality instruction-following and reinforcement datasets. However, generating such data through human annotation is prohibitively time-consuming particularly for domain-specific tasks like telecom network troubleshooting, where accurate responses require deep technical expertise and contextual understanding. In this paper, we present a fully automated, retrieval-augmented pipeline for generating synthetic question-answer (QA) pairs grounded in structured domain knowledge. Our multi-stage framework integrates a retriever, base generator, and refinement model to synthesize and enhance QA pairs using documents retrieved from a domain-specific knowledge graph. To ensure data quality, we employ customized RAGAS-based scoring to filter low-quality samples, producing a high-quality dataset suitable for reinforcement fine-tuning (RFT). We demonstrate our approach in a real-world telecom scenario focused on radio access network (RAN) troubleshooting. The resulting pipeline generates complex, context-rich troubleshooting solution plans without human intervention. This work offers a scalable solution for building instruction and reinforcement datasets in specialized domains, significantly reducing dependence on manual labeling while maintaining high technical fidelity.

2509.25562 2026-02-02 cs.AI cs.CL cs.CV cs.LG

IRIS: Intrinsic Reward Image Synthesis

Yihang Chen, Yuanhao Ban, Yunqi Hong, Cho-Jui Hsieh

详情
英文摘要

Despite the success of Reinforcement Learning from Human Feedback (RLHF) in language reasoning, its application to autoregressive Text-to-Image (T2I) generation is often constrained by the limited availability of human preference data. This paper explores how an autoregressive T2I model can learn from internal signals without relying on external rewards or labeled data. Contrary to recent findings in math and code reasoning, we show that minimizing self-certainty, rather than maximizing it, improves image generation. We observe that autoregressive T2I models with higher certainty are likely to generate simple and uniform images, which are less aligned with human preferences, and models with lower certainty are likely to generate vivid images rich in detail. Based on this observation, we propose IRIS(Intrinsic Reward Image Synthesis), the first framework to improve autoregressive T2I models with reinforcement learning using only an intrinsic reward. Empirical results demonstrate that applying IRIS to autoregressive T2I models achieves performance superior to those trained by individual external rewards, and matching those trained by ensemble external rewards. IRIS also incentivizes the emergence of nuanced CoT reasoning for high-quality image generation.

2509.21981 2026-02-02 cs.AI cs.MA

Collaborative Belief Reasoning with LLMs for Efficient Multi-Agent Collaboration

Zhimin Wang, Duo Wu, Shaokang He, Jinghe Wang, Linjia Kang, Jing Yu, Kai Zhu, Jiawei Li, Zhi Wang

详情
英文摘要

Effective real-world multi-agent collaboration requires not only accurate planning but also the ability to reason about collaborators' intents--a crucial capability for avoiding miscoordination and redundant communication under partial observable environments. Due to their strong planning and reasoning capabilities, large language models (LLMs) have emerged as promising autonomous agents for collaborative task solving. However, existing collaboration frameworks for LLMs overlook their reasoning potential for dynamic intent inference, and thus produce inconsistent plans and redundant communication, reducing collaboration efficiency. To bridge this gap, we propose CoBel-World, a novel framework that equips LLM agents with a Collaborative Belief World--an internal representation jointly modeling the physical environment and collaborators' mental states. CoBel-World enables agents to parse external open-world knowledge into structured beliefs via a symbolic belief representation module, and perform zero-shot Bayesian-style belief updates through LLM reasoning. This allows agents to proactively detect potential miscoordination (e.g., conflicting plans) and communicate adaptively. Evaluated on challenging embodied benchmarks (i.e., TDW-MAT and C-WAH), CoBel-World significantly reduces communication costs by 64-79% and improves task completion efficiency by 4-28% compared to the strongest baseline. Our results show that explicit, intent-aware belief modeling is essential for efficient and human-like collaboration in LLM-based multi-agent systems.

2509.21479 2026-02-02 cs.LG

Filtering with Confidence: When Data Augmentation Meets Conformal Prediction

Zixuan Wu, So Won Jeong, Yating Liu, Yeo Jin Jung, Claire Donnat

详情
英文摘要

With promising empirical performance across a wide range of applications, synthetic data augmentation appears a viable solution to data scarcity and the demands of increasingly data-intensive models. Its effectiveness lies in expanding the training set in a way that reduces estimator variance while introducing only minimal bias. Controlling this bias is therefore critical: effective data augmentation should generate diverse samples from the same underlying distribution as the training set, with minimal shifts. In this paper, we propose conformal data augmentation, a principled data filtering framework that leverages the power of conformal prediction to produce diverse synthetic data while filtering out poor-quality generations with provable risk control. Our method is simple to implement, requires no access to internal model logits, nor large-scale model retraining. We demonstrate the effectiveness of our approach across multiple tasks, including topic prediction, sentiment analysis, image classification, and fraud detection, showing consistent performance improvements of up to 40 percentage points (pp) in $F_1$ score over unaugmented baselines, and 4~pp over other filtered augmentation baselines.

2509.21282 2026-02-02 cs.LG cs.AI

It's Not You, It's Clipping: A Soft Trust-Region via Probability Smoothing for LLM RL

Madeleine Dwyer, Adam Sobey, Adriane Chapman

详情
英文摘要

Training large language models (LLMs) with reinforcement learning (RL) methods such as PPO and GRPO commonly relies on ratio clipping to stabilise updates. While effective at preventing instability, clipping discards information, introduces gradient discontinuities and can prevent exploration of better policies. Inspired by label smoothing, we propose Probability Smoothing Policy Optimisation (PSPO). PSPO smooths current policy probabilities toward the behaviour policy before computing importance ratios, creating a soft trust region that preserves gradients while preventing destabilising updates. Unlike prior soft clipping approaches that use sigmoid-based transformations which can suffer from vanishing gradients and saturation, our method uses a linear interpolation, providing simpler and more robust gradient preservation. Empirically, GR-PSPO outperforms clipping and sigmoid-based alternatives on mathematical reasoning benchmarks when refining models with prior domain knowledge, achieving an accuracy of 79.9% on GSM8K and 59.6% on MATH for Qwen2-Math-1.5B.

2509.19903 2026-02-02 cs.LG

Latent Iterative Refinement Flow: A Geometric Constrained Approach for Few-Shot Generation

Songtao Li, Tianqi Hou, Zhenyu Liao, Ting Gao

详情
英文摘要

Diffusion and flow-matching models trained with limited data often tend to memorize the training data instead of generalization, leading to severely reduced diversity. In this paper, we provide a dynamical perspective and identify this ``collapse-to-memorization'' phenomenon as a consequence of the \emph{velocity field collapse}, where the learned field degenerates into isolated point attractors and trap the sampling trajectories. Inspired by this novel view, we introduce \textbf{{\BLUE L}atent {\BLUE I}terative {\BLUE R}efinement {\BLUE F}low ({\BLUE LIRF})}, a geometry-aware framework for from-scratch training of diffusion models in the limited-data regime. By exploiting the intrinsic geometry of a semantically aligned latent space, LIRF progressively densifies the training data manifold via a \emph{generation--correction--augmentation} closed loop, thereby effectively resolving the velocity field collapse. Theoretical guarantee on the convergence of this manifold densification procedure is also provided. Experiments on FFHQ subsets and Low-Shot datasets demonstrate the advantageous performance of LIRF over existing diffusion models for limited-data generation, achieving significantly higher diversity and recall, with comparably good generative performance.

2509.15612 2026-02-02 cs.SD eess.AS

Thinking in cocktail party: Chain-of-Thought and reinforcement learning for target speaker automatic speech recognition

Yiru Zhang, Hang Su, Lichun Fan, Zhenbo Luo, Jian Luan

详情
英文摘要

Target Speaker Automatic Speech Recognition (TS-ASR) aims to transcribe the speech of a specified target speaker from multi-speaker mixtures in cocktail party scenarios. Recent advancement of Large Audio-Language Models (LALMs) has already brought some new insights to TS-ASR. However, significant room for optimization remains for the TS-ASR task within the LALMs architecture. While Chain of Thoughts (CoT) and Reinforcement Learning (RL) have proven effective in certain speech tasks, TS-ASR, which requires the model to deeply comprehend speech signals, differentiate various speakers, and handle overlapping utterances is particularly well-suited to a reasoning-guided approach. Therefore, we propose a novel framework that incorporates CoT and RL training into TS-ASR for performance improvement. A novel CoT dataset of TS-ASR is constructed, and the TS-ASR model is first trained on regular data and then fine-tuned on CoT data. Finally, the model is further trained with RL using selected data to enhance generalized reasoning capabilities. Experiment results show a significant improvement of TS-ASR performance with CoT and RL training, which demonstrates the effectiveness of the proposed CoT and RL training methods adapted for the TS-ASR task.

2509.15437 2026-02-02 cs.SD cs.AI cs.CR eess.AS

Impact of Phonetics on Speaker Identity in Adversarial Voice Attack

Daniyal Kabir Dar, Qiben Yan, Li Xiao, Arun Ross

Comments Additional figures for extended visualization: https://daniyalkabir.github.io/icassp-2026-results/

详情
英文摘要

Adversarial perturbations in speech pose a serious threat to automatic speech recognition (ASR) and speaker verification by introducing subtle waveform modifications that remain imperceptible to humans but can significantly alter system outputs. While targeted attacks on end-to-end ASR models have been widely studied, the phonetic basis of these perturbations and their effect on speaker identity remain underexplored. In this work, we analyze adversarial audio at the phonetic level and show that perturbations exploit systematic confusions such as vowel centralization and consonant substitutions. These distortions not only mislead transcription but also degrade phonetic cues critical for speaker verification, leading to identity drift. Using DeepSpeech as our ASR target, we generate targeted adversarial examples and evaluate their impact on speaker embeddings across genuine and impostor samples. Results across 16 phonetically diverse target phrases demonstrate that adversarial audio induces both transcription errors and identity drift, highlighting the need for phonetic-aware defenses to ensure the robustness of ASR and speaker recognition systems.

2509.15172 2026-02-02 cs.AI

Self-Improvement of Language Models by Post-Training on Multi-Agent Debate

Ankur Samanta, Akshayaa Magesh, Runzhe Wu, Ayush Jain, Youliang Yu, Daniel Jiang, Boris Vidolov, Paul Sajda, Yonathan Efroni, Kaveh Hassani

详情
英文摘要

Self-improvement, where models improve beyond their current performance without external supervision, remains a challenge. The core difficulty is sourcing a training signal stronger than what the model itself can currently produce. Majority voting has been shown to provide such a signal by aggregating over multiple samples, helping mitigate some of the inconsistencies in LM reasoning. In this work, we show that multi-agent debate--where models collaborate and exchange reasoning over multiple rounds--provides an even richer signal than single-round majority voting. We introduce Multi-Agent Consensus Alignment (MACA), which uses reinforcement learning (RL) to post-train models to effectively utilize multi-agent debate. We find that preference learning over full reasoning traces, learning to differentiate between majority and minority reasoning, is more effective than binary consensus rewards or SFT-based approaches for leveraging these debate signals. This produces three key improvements: models are (1) better at utilizing the multi-agent debate setting (+26.87% on MATH), (2) individually more accurate (+21.51% on MathQA), and (3) more self-consistent (+27.6% on GSM8K). We also see strong generalization to unseen benchmarks (+16.3% on GPQA, +11.6% on CommonsenseQA).

2509.15145 2026-02-02 cs.LG

Optimal Learning from Label Proportions with General Loss Functions

Lorne Applebaum, Travis Dick, Claudio Gentile, Haim Kaplan, Tomer Koren

详情
英文摘要

Motivated by problems in online advertising, we address the task of Learning from Label Proportions (LLP). We introduce a novel and versatile low-variance debiasing methodology to learn from aggregate label information, significantly advancing the state of the art in LLP. Our debiasing approach exhibits remarkable flexibility, seamlessly accommodating a broad spectrum of practically relevant loss functions across both binary and multi-class classification settings. By carefully combining our estimators with standard techniques, we improve sample complexity guarantees for a large class of losses of practical relevance. We also empirically validate the efficacy of our proposed approach across a diverse array of benchmark datasets, demonstrating compelling empirical advantages over standard baselines.

2509.14957 2026-02-02 cs.CV

DF-LLaVA: Unlocking MLLMs for Synthetic Image Detection via Knowledge Injection and Conflict-Driven Self-Reflection

Zhuokang Shen, Kaisen Zhang, Bohan Jia, Heming Jia, Yuan Fang, Zhou Yu, Shaohui Lin

Comments Under review

详情
英文摘要

With the increasing prevalence of synthetic images, evaluating image authenticity and locating forgeries accurately while maintaining human interpretability remains a challenging task. Existing detection models primarily focus on simple authenticity classification, ultimately providing only a forgery probability or binary judgment, which offers limited explanatory insights into image authenticity. Moreover, while MLLM-based detection methods can provide more interpretable results, they still lag behind expert models in terms of pure authenticity classification accuracy. To address this, we propose DF-LLaVA, a novel and effective framework that unlocks the intrinsic discrimination potential of MLLMs. Our approach first mines latent knowledge from the MLLM itself and then injects it into the model via fine-tuning. During inference, conflict signals arising from the model's predictions activate a self-reflection process, leading to the final refined responses. This framework allows LLaVA to achieve outstanding detection accuracy exceeding expert models while still maintaining the interpretability offered by MLLMs. Extensive experiments confirm the superiority of DF-LLaVA, achieving both high accuracy and explainability in synthetic image detection. Code is available online at: https://github.com/Eliot-Shen/DF-LLaVA.

2509.11914 2026-02-02 cs.AI

EgoMem: Lifelong Memory Agent for Full-duplex Omnimodal Models

Yiqun Yao, Naitong Yu, Xiang Li, Xin Jiang, Xuezhi Fang, Wenjia Ma, Xuying Meng, Jing Li, Aixin Sun, Yequan Wang

详情
英文摘要

We introduce EgoMem, the first lifelong memory agent tailored for full-duplex models that process real-time omnimodal streams. EgoMem enables real-time models to recognize multiple users directly from raw audiovisual streams, to provide personalized response, and to maintain long-term knowledge of users' facts, preferences, and social relationships extracted from audiovisual history. EgoMem operates with three asynchronous processes: (i) a retrieval process that dynamically identifies user via face and voice, and gathers relevant context from a long-term memory; (ii) an omnimodal dialog process that generates personalized audio responses based on the retrieved context; and (iii) a memory management process that automatically detects dialog boundaries from omnimodal streams, and extracts necessary information to update the long-term memory. Unlike existing memory agents for LLMs, EgoMem relies entirely on raw audiovisual streams, making it especially suitable for lifelong, real-time, and embodied scenarios. Experimental results demonstrate that EgoMem's retrieval and memory management modules achieve over 95% accuracy on the test set. When integrated with a fine-tuned RoboEgo omnimodal chatbot, the system achieves fact-consistency scores above 87% in real-time personalized dialogs, establishing a strong baseline for future research.

2509.08312 2026-02-02 cs.AI

Leveraging AI Agents for Autonomous Networks: A Reference Architecture and Empirical Studies

Binghan Wu, Shoufeng Wang, Yunxin Liu, Ya-Qin Zhang, Joseph Sifakis, Ye Ouyang

Comments 7 pages, 5 figures. This manuscript has been accepted by IEEE Communication Magazine

详情
英文摘要

The evolution toward Level 4 (L4) Autonomous Networks (AN) represents a strategic inflection point in telecommunications, where networks must transcend reactive automation to achieve genuine cognitive capabilities--fulfilling TM Forum's vision of self-configuring, self-healing, and self-optimizing systems that deliver zero-wait, zero-touch, and zero-fault services. This work bridges the gap between architectural theory and operational reality by implementing Joseph Sifakis's AN Agent reference architecture in a functional cognitive system, deploying coordinated proactive-reactive runtimes driven by hybrid knowledge representation. Through an empirical case study of a Radio Access Network (RAN) Link Adaptation (LA) Agent, we validate this framework's transformative potential: demonstrating sub-10 ms real-time control in 5G NR sub-6 GHz while achieving 4% higher downlink throughput than Outer Loop Link Adaptation (OLLA) algorithms and 85% Block Error Rate (BLER) reduction for ultra-reliable services through dynamic Modulation and Coding Scheme (MCS) optimization. These improvements confirm the architecture's viability in overcoming traditional autonomy barriers and advancing critical L4-enabling capabilities toward next-generation objectives.

2509.06307 2026-02-02 cs.AI

Can AI Make Energy Retrofit Decisions? An Evaluation of Large Language Models

Lei Shu, Dong Zhao

Journal ref Buildings 2025, 15(22), 4081

详情
英文摘要

Conventional approaches to building energy retrofit decision making suffer from limited generalizability and low interpretability, hindering adoption in diverse residential contexts. With the growth of Smart and Connected Communities, generative AI, especially large language models (LLMs), may help by processing contextual information and producing practitioner readable recommendations. We evaluate seven LLMs (ChatGPT, DeepSeek, Gemini, Grok, Llama, and Claude) on residential retrofit decisions under two objectives: maximizing CO2 reduction (technical) and minimizing payback period (sociotechnical). Performance is assessed on four dimensions: accuracy, consistency, sensitivity, and reasoning, using a dataset of 400 homes across 49 US states. LLMs generate effective recommendations in many cases, reaching up to 54.5 percent top 1 match and 92.8 percent within top 5 without fine tuning. Performance is stronger for the technical objective, while sociotechnical decisions are limited by economic trade offs and local context. Agreement across models is low, and higher performing models tend to diverge from others. LLMs are sensitive to location and building geometry but less sensitive to technology and occupant behavior. Most models show step by step, engineering style reasoning, but it is often simplified and lacks deeper contextual awareness. Overall, LLMs are promising assistants for energy retrofit decision making, but improvements in accuracy, consistency, and context handling are needed for reliable practice.

2509.05322 2026-02-02 cs.CV cs.LG cs.SI physics.comp-ph

Application of discrete Ricci curvature in pruning randomly wired neural networks: A case study with chest x-ray classification of COVID-19

Pavithra Elumalai, Sudharsan Vijayaraghavan, Madhumita Mondal, Areejit Samal

Comments 21 pages, 4 figures, 9 tables

Journal ref J. Phys. Complex. 6: 045017 (2025)

详情
英文摘要

Randomly Wired Neural Networks (RWNNs) serve as a valuable testbed for investigating the impact of network topology in deep learning by capturing how different connectivity patterns impact both learning efficiency and model performance. At the same time, they provide a natural framework for exploring edge-centric network measures as tools for pruning and optimization. In this study, we investigate three edge-centric network measures: Forman-Ricci curvature (FRC), Ollivier-Ricci curvature (ORC), and edge betweenness centrality (EBC), to compress RWNNs by selectively retaining important synapses (or edges) while pruning the rest. As a baseline, RWNNs are trained for COVID-19 chest x-ray image classification, aiming to reduce network complexity while preserving performance in terms of accuracy, specificity, and sensitivity. We extend prior work on pruning RWNN using ORC by incorporating two additional edge-centric measures, FRC and EBC, across three network generators: Erdös-Rényi (ER) model, Watts-Strogatz (WS) model, and Barabási-Albert (BA) model. We provide a comparative analysis of the pruning performance of the three measures in terms of compression ratio and theoretical speedup. A central focus of our study is to evaluate whether FRC, which is computationally more efficient than ORC, can achieve comparable pruning effectiveness. Along with performance evaluation, we further investigate the structural properties of the pruned networks through modularity and global efficiency, offering insights into the trade-off between modular segregation and network efficiency in compressed RWNNs. Our results provide initial evidence that FRC-based pruning can effectively simplify RWNNs, offering significant computational advantages while maintaining performance comparable to ORC.

2509.02521 2026-02-02 cs.SD cs.AI cs.CL

FLM-Audio: Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training

Yiqun Yao, Xiang Li, Xin Jiang, Xuezhi Fang, Naitong Yu, Wenjia Ma, Aixin Sun, Yequan Wang

详情
英文摘要

Full-duplex dialog models aim to listen and speak simultaneously, delivering rapid responses to dynamic user input. Among different solutions to full-duplexity, a native solution merges multiple channels in each time step, achieving the lowest latency. However, prevailing designs break down the textual monologue sentences for word-level alignment with audio streams, which degrades language modeling abilities. To help address this issue, we introduce "contiguous monologues", which are composed by continuous sentences and "waiting" intervals, mimicking human-like cognitive behavior in dialogs. We find a proper training paradigm to be critical for semantically aligning contiguous monologues with audio. To this end, we develop a "dual" training paradigm that alternates the position of the monologues, either leading or trailing the audio, across different training stages. A combination of our contiguous monologue and dual training strategy is applied in developing FLM-Audio, our 7B spoken dialog chatbot with native full-duplexity. As confirmed by experimental results, FLM-Audio achieves superior response qualities and chatting experiences while requiring significantly less training data.

2509.00559 2026-02-02 cs.AI

Social World Models

Xuhui Zhou, Jiarui Liu, Akhila Yerukola, Hyunwoo Kim, Maarten Sap

详情
英文摘要

Humans intuitively navigate social interactions by simulating unspoken dynamics and reasoning about others' perspectives, even with limited information. In contrast, AI systems struggle to structure and reason about implicit social contexts, as they lack explicit representations for unobserved dynamics such as intentions, beliefs, and evolving social states. In this paper, we introduce the concept of social world models (SWMs) to characterize the complex social dynamics. To operationalize SWMs, we introduce a novel structured social world representation formalism (S3AP), which captures the evolving states, actions, and mental states of agents, addressing the lack of explicit structure in traditional free-text-based inputs. Through comprehensive experiments across five social reasoning benchmarks, we show that S3AP significantly enhances LLM performance-achieving a +51% improvement on FANToM over OpenAI's o1. Our ablations further reveal that these gains are driven by the explicit modeling of hidden mental states, which proves more effective than a wide range of baseline methods. Finally, we introduce an algorithm for social world models using S3AP, which enables AI agents to build models of their interlocutors and predict their next actions and mental states. Empirically, S3AP-enabled social world models yield up to +18% improvement on the SOTOPIA multi-turn social interaction benchmark. Our findings highlight the promise of S3AP as a powerful, general-purpose representation for social world states, enabling the development of more socially-aware systems that better navigate social interactions.

2508.17427 2026-02-02 cs.CV cs.RO

GMOR: A Lightweight Robust Point Cloud Registration Framework via Geometric Maximum Overlapping

Zhao Zheng, Jingfan Fan, Long Shao, Hong Song, Danni Ai, Tianyu Fu, Deqiang Xiao, Yongtian Wang, Jian Yang

详情
英文摘要

Point cloud registration based on correspondences computes the rigid transformation that maximizes the number of inliers constrained within the noise threshold. Current state-of-the-art (SOTA) methods employing spatial compatibility graphs or branch-and-bound (BnB) search mainly focus on registration under high outlier ratios. However, graph-based methods require at least quadratic space and time complexity for graph construction, while multi-stage BnB search methods often suffer from inaccuracy due to local optima between decomposed stages. This paper proposes a geometric maximum overlapping registration framework via rotation-only BnB search. The rigid transformation is decomposed using Chasles' theorem into a translation along rotation axis and a 2D rigid transformation. The optimal rotation axis and angle are searched via BnB, with residual parameters formulated as range maximum query (RMQ) problems. Firstly, the top-k candidate rotation axes are searched within a hemisphere parameterized by cube mapping, and the translation along each axis is estimated through interval stabbing of the correspondences projected onto that axis. Secondly, the 2D registration is relaxed to 1D rotation angle search with 2D RMQ of geometric overlapping for axis-aligned rectangles, which is solved deterministically in polynomial time using sweep line algorithm with segment tree. Experimental results on indoor 3DMatch/3DLoMatch scanning and outdoor KITTI LiDAR datasets demonstrate superior accuracy and efficiency over SOTA methods, while the time complexity is polynomial and the space complexity increases linearly with the number of points, even in the worst case.

2508.09190 2026-02-02 cs.LG cs.AI

Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining

Bing Han, Feifei Zhao, Dongcheng Zhao, Guobin Shen, Ping Wu, Yu Shi, Yi Zeng

详情
英文摘要

While fine-tuning services drive the rapid expansion of task capabilities in large language models (LLMs), they are often accompanied by the degradation and reorganization of safety-aligned representations, making models more prone to deviating from human preferences and exposing them to emerging jailbreak risks. Existing post-fine-tuning defense methods predominantly rely on single-scale safety correction mechanisms, which struggle to achieve a robust balance among safety, model utility, and continual adaptability. We propose Multi-Level Safety Continual Projection (MSCP), a training-free post-fine-tuning safety enhancement method that implicitly aligns global and localized safety activations through coordinated multi-level representations to isolate sparse neuron clusters governing safety-sensitive behaviors. It then applies composable safety-direction projections without retraining, effectively suppressing harmful outputs under minimal parameter perturbations while preserving task performance and improving alignment with human preferences. Extensive experiments across multiple fine-tuned LLM models demonstrate that our method significantly reduce harmfulness scores and attack success rates with minimal parameter modifications, while preserving the model's utility. Furthermore, we introduce a task-specific, multi-dimensional heterogeneous safety activation clustering mechanism that enables continual defense and generalization capability against unforeseen emerging safety concerns.

2508.06915 2026-02-02 cs.LG

QuiZSF: A Retrieval-Augmented Framework for Zero-Shot Time Series Forecasting

Shichao Ma, Zhengyang Zhou, Qihe Huang, Binwu Wang, Yang Wang

详情
英文摘要

Accurate forecasting of sequential data streams is a cornerstone of modern Web services, supporting applications such as traffic management, user behavior modeling, and online anomaly prevention. However, in many Web environments, new domains emerge rapidly and labeled history data is scarce, which makes zero-shot forecasting particularly challenging. Existing time-series pre-trained models (TSPMs) show promise but they lack the ability to dynamically incorporate external knowledge, while conventional retrieval-augmented generation (RAG) methods are rarely extended beyond text. In this work, we present \textbf{QuiZSF}, a retrieval-augmented forecasting framework that integrates search and forecasting for time series data. The framework performs search by retrieving structurally similar sequences from a large-scale time-series database, and it performs forecasting by integrating the retrieved knowledge into the target sequence. Specifically, QuiZSF introduces a \textbf{ChronoRAG Base}, a hierarchical tree-structured database that enables scalable and domain-aware retrieval, a \textbf{Multi-grained Series Interaction Learner} that captures fine- and coarse-grained dependencies between target and retrieved sequences, and a \textbf{Model Cooperation Coherer} that adapts retrieved knowledge to TSPMs. This design teaches models to actively perform search, align auxiliary information across modalities, and leverage it for more accurate forecasting. Extensive experiments on five public benchmarks demonstrate that QuiZSF consistently outperforms strong baselines, ranking first in up to \textbf{87.5\%} of zero-shot forecasting settings while maintaining high efficiency.

2508.06309 2026-02-02 cs.CL math.PR

Matrix-Driven Identification and Reconstruction of LLM Weight Homology

Ruichong Zhang, Daniel Goldstein

详情
英文摘要

We propose Matrix-Driven Identification and Reconstruction (MDIR), a SOTA large language model homology method that accurately detects weight correspondences between models and provides rigorous $p$-value estimation of the statistical significance of these correspondences. Our method does not require model inference, and allows the detection of unattributed reuse or replication of model weights even on low-resource devices as it compares only a single pair of matrices at a time. We leverage matrix analysis, polar decomposition, and Large Deviation Theory (LDT) to achieve accurate reconstruction of weight relationships between models. Notably, MDIR is the first method to achieve perfect scores on both Area-Under-Curve (AUC) and accuracy metrics across different source models on LeaFBench.

2508.00459 2026-02-02 cs.AI

Thinking Machines: Mathematical Reasoning in the Age of LLMs

Andrea Asperti, Alberto Naibo, Claudio Sacerdoti Coen

Journal ref Big Data and Cognitive Computing, 2026, 10(1), 38

详情
英文摘要

Large Language Models (LLMs) have demonstrated impressive capabilities in structured reasoning and symbolic tasks, with coding emerging as a particularly successful application. This progress has naturally motivated efforts to extend these models to mathematics, both in its traditional form, expressed through natural-style mathematical language, and in its formalized counterpart, expressed in a symbolic syntax suitable for automatic verification. Yet, despite apparent parallels between programming and proof construction, advances in formalized mathematics have proven significantly more challenging. This gap raises fundamental questions about the nature of reasoning in current LLM architectures, the role of supervision and feedback, and the extent to which such models maintain an internal notion of computational or deductive state. In this article, we review the current state-of-the-art in mathematical reasoning with LLMs, focusing on recent models and benchmarks. We explore three central issues at the intersection of machine learning and mathematical cognition: (i) the trade-offs between traditional and formalized mathematics as training and evaluation domains; (ii) the structural and methodological reasons why proof synthesis remains more brittle than code generation; and (iii) whether LLMs genuinely represent or merely emulate a notion of evolving logical state. Our goal is not to draw rigid distinctions but to clarify the present boundaries of these systems and outline promising directions for their extension.

2507.22911 2026-02-02 cs.CL cs.AI

ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing

Jinzhi Wang, Qingke Peng, Haozhou Li, Zeyuan Zeng, Jiangbo Zhang, Kaixuan Yang, Ningyong Wu, Qinfeng Song, Ruimeng Li, Biyi Zhou

详情
英文摘要

As power systems decarbonise and digitalise, high penetrations of distributed energy resources and flexible tariffs make electric power marketing (EPM) a key interface between regulation, system operation and sustainable-energy deployment. Many utilities still rely on human agents and rule- or intent-based chatbots with fragmented knowledge bases that struggle with long, cross-scenario dialogues and fall short of requirements for compliant, verifiable and DR-ready interactions. Meanwhile, frontier large language models (LLMs) show strong conversational ability but are evaluated on generic benchmarks that underweight sector-specific terminology, regulatory reasoning and multi-turn process stability. To address this gap, we present ElectriQ, a large-scale benchmark and evaluation framework for LLMs in EPM. ElectriQ contains over 550k dialogues across six service domains and 24 sub-scenarios and defines a unified protocol that combines human ratings, automatic metrics and two compliance stress tests-Statutory Citation Correctness and Long-Dialogue Consistency. Building on ElectriQ, we propose SEEK-RAG, a retrieval-augmented method that injects policy and domain knowledge during finetuning and inference. Experiments on 13 LLMs show that domain-aligned 7B models with SEEK-RAG match or surpass much larger models while reducing computational cost, providing an auditable, regulation-aware basis for deploying LLM-based EPM assistants that support demand-side management, renewable integration and resilient grid operation.

2507.11352 2026-02-02 cs.AI cs.FL

Foundation Models for Logistics: Toward Certifiable, Conversational Planning Interfaces

Yunhao Yang, Neel P. Bhatt, Christian Ellis, Samuel Li, Alvaro Velasquez, Zhangyang Wang, Ufuk Topcu

详情
英文摘要

Logistics operators, from battlefield coordinators re-routing airlifts ahead of a storm to warehouse managers juggling late trucks, need to make mission-critical decisions. Prevailing methods for logistics planning such as integer programming yield plans that satisfy user-defined logical constraints, assuming an idealized mathematical model of the environment. On the other hand, foundation models lower the intermediate processing barrier by translating natural-language user utterances into executable plans, yet they remain prone to misinterpretations and hallucinations that jeopardize safety and cost. We introduce a Vision-Language Logistics (VLL) agent, built on a neurosymbolic framework that pairs the accessibility of natural-language dialogue with verifiable guarantees on user-objective interpretation. The agent interprets user requests and converts them into structured planning specifications, quantifies the uncertainty of the interpretation, and invokes an interactive clarification loop when the uncertainty exceeds an adaptive threshold. Drawing on a lightweight airlift logistics planning use case as an illustrative case study, we highlight a practical path toward certifiable and user-aligned decision-making for complex logistics. Our lightweight model, fine-tuned on just 100 training samples, surpasses the zero-shot performance of 20x larger models in logistic planning tasks while cutting inference latency by nearly 50%.