arXivDaily arXiv每日学术速递 周一至周五更新
2601.18796 2026-01-27 cs.CL cs.AI cs.LG

ctELM: Decoding and Manipulating Embeddings of Clinical Trials with Embedding Language Models

Brian Ondov, Chia-Hsuan Chang, Yujia Zhou, Mauro Giuffrè, Hua Xu

详情
英文摘要

Text embeddings have become an essential part of a variety of language applications. However, methods for interpreting, exploring and reversing embedding spaces are limited, reducing transparency and precluding potentially valuable generative use cases. In this work, we align Large Language Models to embeddings of clinical trials using the recently reported Embedding Language Model (ELM) method. We develop an open-source, domain-agnostic ELM architecture and training framework, design training tasks for clinical trials, and introduce an expert-validated synthetic dataset. We then train a series of ELMs exploring the impact of tasks and training regimes. Our final model, ctELM, can accurately describe and compare unseen clinical trials from embeddings alone and produce plausible clinical trials from novel vectors. We further show that generated trial abstracts are responsive to moving embeddings along concept vectors for age and sex of study subjects. Our public ELM implementation and experimental results will aid the alignment of Large Language Models to embedding spaces in the biomedical domain and beyond.

2601.18793 2026-01-27 cs.PL

Handling Scope Checks (Extended Version)

Michael Lee, Ningning Xie, Oleg Kiselyov, Jeremy Yallop

Comments Extended version of Handling Scope Checks (POPL'26): includes appendices, fixes minor typos, and tweaks phrasing for readability

Journal ref Proceedings of the ACM on Programming Languages (PACMPL), Volume 10, POPL 2026, Article 39

详情
英文摘要

Metaprogramming and effect handlers interact in unexpected, and sometimes undesirable, ways. One example is scope extrusion: the generation of ill-scoped code. Scope extrusion can either be preemptively prevented, via static type systems, or retroactively detected, via dynamic checks. Static type systems exist in theory, but struggle with a range of implementation and usability problems in practice. In contrast, dynamic checks exist in practice (e.g. in MetaOCaml), but are understudied in theory. Designers of metalanguages are thus given little guidance regarding the design and implementation of checks. We present the first formal study of dynamic scope extrusion checks, introducing a calculus ($λ_{\langle\langle\text{op}\rangle\rangle}$) for describing and evaluating checks. Further, we introduce a novel dynamic check $\unicode{x2014}$ the "Cause-for-Concern" check $\unicode{x2014}$ which we prove correct, characterise without reference to its implementation, and argue combines the advantages of existing dynamic checks. Finally, we extend our framework with refined environment classifiers, which statically prevent scope extrusion, and compare their expressivity with the dynamic checks.

2601.18792 2026-01-27 cs.HC cs.CL cs.LG

MEGnifying Emotion: Sentiment Analysis from Annotated Brain Data

Brian Liu, Oiwi Parker Jones

详情
英文摘要

Decoding emotion from brain activity could unlock a deeper understanding of the human experience. While a number of existing datasets align brain data with speech and with speech transcripts, no datasets have annotated brain data with sentiment. To bridge this gap, we explore the use of pre-trained Text-to-Sentiment models to annotate non invasive brain recordings, acquired using magnetoencephalography (MEG), while participants listened to audiobooks. Having annotated the text, we employ force-alignment of the text and audio to align our sentiment labels with the brain recordings. It is straightforward then to train Brainto-Sentiment models on these data. Experimental results show an improvement in balanced accuracy for Brain-to-Sentiment compared to baseline, supporting the proposed approach as a proof-of-concept for leveraging existing MEG datasets and learning to decode sentiment directly from the brain.

2601.18791 2026-01-27 cs.CL cs.AI cs.LG

Subword-Based Comparative Linguistics across 242 Languages Using Wikipedia Glottosets

Iaroslav Chelombitko, Mika Hämäläinen, Aleksey Komissarov

Comments 15 pages, 4 figues, 4 tables

详情
英文摘要

We present a large-scale comparative study of 242 Latin and Cyrillic-script languages using subword-based methodologies. By constructing 'glottosets' from Wikipedia lexicons, we introduce a framework for simultaneous cross-linguistic comparison via Byte-Pair Encoding (BPE). Our approach utilizes rank-based subword vectors to analyze vocabulary overlap, lexical divergence, and language similarity at scale. Evaluations demonstrate that BPE segmentation aligns with morpheme boundaries 95% better than random baseline across 15 languages (F1 = 0.34 vs 0.15). BPE vocabulary similarity correlates significantly with genetic language relatedness (Mantel r = 0.329, p < 0.001), with Romance languages forming the tightest cluster (mean distance 0.51) and cross-family pairs showing clear separation (0.82). Analysis of 26,939 cross-linguistic homographs reveals that 48.7% receive different segmentations across related languages, with variation correlating to phylogenetic distance. Our results provide quantitative macro-linguistic insights into lexical patterns across typologically diverse languages within a unified analytical framework.

2601.18790 2026-01-27 cs.CL

MortalMATH: Evaluating the Conflict Between Reasoning Objectives and Emergency Contexts

Etienne Lanzeray, Stephane Meilliez, Malo Ruelle, Damien Sileo

详情
英文摘要

Large Language Models are increasingly optimized for deep reasoning, prioritizing the correct execution of complex tasks over general conversation. We investigate whether this focus on calculation creates a "tunnel vision" that ignores safety in critical situations. We introduce MortalMATH, a benchmark of 150 scenarios where users request algebra help while describing increasingly life-threatening emergencies (e.g., stroke symptoms, freefall). We find a sharp behavioral split: generalist models (like Llama-3.1) successfully refuse the math to address the danger. In contrast, specialized reasoning models (like Qwen-3-32b and GPT-5-nano) often ignore the emergency entirely, maintaining over 95 percent task completion rates while the user describes dying. Furthermore, the computational time required for reasoning introduces dangerous delays: up to 15 seconds before any potential help is offered. These results suggest that training models to relentlessly pursue correct answers may inadvertently unlearn the survival instincts required for safe deployment.

2601.18788 2026-01-27 cs.CL cs.LG stat.ML

Unsupervised Text Segmentation via Kernel Change-Point Detection on Sentence Embeddings

Mumin Jia, Jairo Diaz-Rodriguez

Comments arXiv admin note: substantial text overlap with arXiv:2510.03437. substantial text overlap with arXiv:2510.03437. substantial text overlap with arXiv:2510.03437. substantial text overlap with arXiv:2510.03437

详情
英文摘要

Unsupervised text segmentation is crucial because boundary labels are expensive, subjective, and often fail to transfer across domains and granularity choices. We propose Embed-KCPD, a training-free method that represents sentences as embedding vectors and estimates boundaries by minimizing a penalized KCPD objective. Beyond the algorithmic instantiation, we develop, to our knowledge, the first dependence-aware theory for KCPD under $m$-dependent sequences, a finite-memory abstraction of short-range dependence common in language. We prove an oracle inequality for the population penalized risk and a localization guarantee showing that each true change point is recovered within a window that is small relative to segment length. To connect theory to practice, we introduce an LLM-based simulation framework that generates synthetic documents with controlled finite-memory dependence and known boundaries, validating the predicted scaling behavior. Across standard segmentation benchmarks, Embed-KCPD often outperforms strong unsupervised baselines. A case study on Taylor Swift's tweets illustrates that Embed-KCPD combines strong theoretical guarantees, simulated reliability, and practical effectiveness for text segmentation.

2601.18785 2026-01-27 cs.HC cs.AI cs.CL

Design Techniques for LLM-Powered Interactive Storytelling: A Case Study of the Dramamancer System

Tiffany Wang, Yuqian Sun, Yi Wang, Melissa Roemmele, John Joon Young Chung, Max Kreminski

Comments Extended abstract presented at the 2025 Wordplay Workshop at EMNLP

详情
英文摘要

The rise of Large Language Models (LLMs) has enabled a new paradigm for bridging authorial intent and player agency in interactive narrative. We consider this paradigm through the example of Dramamancer, a system that uses an LLM to transform author-created story schemas into player-driven playthroughs. This extended abstract outlines some design techniques and evaluation considerations associated with this system.

2601.18782 2026-01-27 eess.SP cs.NA eess.IV math.GR math.NA math.OC

Low-Bit Quantization of Bandlimited Graph Signals via Iterative Methods

Felix Krahmer, He Lyu, Rayan Saab, Jinna Qian, Anna Veselovska, Rongrong Wang

Comments 17 pages, 5 figures

详情
英文摘要

We study the quantization of real-valued bandlimited signals on graphs, focusing on low-bit representations. We propose iterative noise-shaping algorithms for quantization, including sampling approaches with and without vertex replacement. The methods leverage the spectral properties of the graph Laplacian and exploit graph incoherence to achieve high-fidelity approximations. Theoretical guarantees are provided for the random sampling method, and extensive numerical experiments on synthetic and real-world graphs illustrate the efficiency and robustness of the proposed schemes.

2601.18779 2026-01-27 cs.LG cs.AI cs.CL

POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration

Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, Aviral Kumar

详情
英文摘要

Reinforcement learning (RL) has improved the reasoning abilities of large language models (LLMs), yet state-of-the-art methods still fail to learn on many training problems. On hard problems, on-policy RL rarely explores even a single correct rollout, yielding zero reward and no learning signal for driving improvement. We find that natural solutions to remedy this exploration problem from classical RL, such as entropy bonuses, more permissive clipping of the importance ratio, or direct optimization of pass@k objectives, do not resolve this issue and often destabilize optimization without improving solvability. A natural alternative is to leverage transfer from easier problems. However, we show that mixing easy and hard problems during RL training is counterproductive due to ray interference, where optimization focuses on already-solvable problems in a way that actively inhibits progress on harder ones. To address this challenge, we introduce Privileged On-Policy Exploration (POPE), an approach that leverages human- or other oracle solutions as privileged information to guide exploration on hard problems, unlike methods that use oracle solutions as training targets (e.g., off-policy RL methods or warmstarting from SFT). POPE augments hard problems with prefixes of oracle solutions, enabling RL to obtain non-zero rewards during guided rollouts. Crucially, the resulting behaviors transfer back to the original, unguided problems through a synergy between instruction-following and reasoning. Empirically, POPE expands the set of solvable problems and substantially improves performance on challenging reasoning benchmarks.

2601.18771 2026-01-27 cs.CL cs.AI cs.IR

Dep-Search: Learning Dependency-Aware Reasoning Traces with Persistent Memory

Yanming Liu, Xinyue Peng, Zixuan Yan, Yanxin Shen, Wenjie Xu, Yuefeng Huang, Xinyi Wang, Jiannan Cao, Jianwei Yin, Xuhong Zhang

Comments Dep-Search 1st version

详情
英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, particularly when augmented with search mechanisms that enable systematic exploration of external knowledge bases. The field has evolved from traditional retrieval-augmented generation (RAG) frameworks to more sophisticated search-based frameworks that orchestrate multi-step reasoning through explicit search strategies. However, existing search frameworks still rely heavily on implicit natural language reasoning to determine search strategies and how to leverage retrieved information across reasoning steps. This reliance on implicit reasoning creates fundamental challenges for managing dependencies between sub-questions, efficiently reusing previously retrieved knowledge, and learning optimal search strategies through reinforcement learning. To address these limitations, we propose Dep-Search, a dependency-aware search framework that advances beyond existing search frameworks by integrating structured reasoning, retrieval, and persistent memory through GRPO. Dep-Search introduces explicit control mechanisms that enable the model to decompose questions with dependency relationships, retrieve information when needed, access previously stored knowledge from memory, and summarize long reasoning contexts into reusable memory entries. Through extensive experiments on seven diverse question answering datasets, we demonstrate that Dep-Search significantly enhances LLMs' ability to tackle complex multi-hop reasoning tasks, achieving substantial improvements over strong baselines across different model scales.

2601.18766 2026-01-27 eess.AS cs.LG

Learning to Discover: A Generalized Framework for Raga Identification without Forgetting

Parampreet Singh, Somya Kumar, Chaitanya Shailendra Nitawe, Vipul Arora

Comments Accepted at NCC 2026 conference

详情
英文摘要

Raga identification in Indian Art Music (IAM) remains challenging due to the presence of numerous rarely performed Ragas that are not represented in available training datasets. Traditional classification models struggle in this setting, as they assume a closed set of known categories and therefore fail to recognise or meaningfully group previously unseen Ragas. Recent works have tried categorizing unseen Ragas, but they run into a problem of catastrophic forgetting, where the knowledge of previously seen Ragas is diminished. To address this problem, we adopt a unified learning framework that leverages both labeled and unlabeled audio, enabling the model to discover coherent categories corresponding to the unseen Ragas, while retaining the knowledge of previously known ones. We test our model on benchmark Raga Identification datasets and demonstrate its performance in categorizing previously seen, unseen, and all Raga classes. The proposed approach surpasses the previous NCD-based pipeline even in discovering the unseen Raga categories, offering new insights into representation learning for IAM tasks.

2601.18763 2026-01-27 cs.IT eess.SP math.IT

Multi-Stage Structured Estimators for Information Freshness

Sahan Liyanaarachchi, Sennur Ulukus, Nail Akar

详情
英文摘要

Most of the contemporary literature on information freshness solely focuses on the analysis of freshness for martingale estimators, which simply use the most recently received update as the current estimate. While martingale estimators are easier to analyze, they are far from optimal, especially in pull-based update systems, where maximum aposteriori probability (MAP) estimators are known to be optimal, but are analytically challenging. In this work, we introduce a new class of estimators called $p$-MAP estimators, which enable us to model the MAP estimator as a piecewise constant function with finitely many stages, bringing us closer to a full characterization of the MAP estimators when modeling information freshness.

2601.18761 2026-01-27 cs.ET

From Access Control to Usage Control with User-Managed Access

Wout Slabbinck, Wouter Termont, Ruben Dedecker, Beatriz Esteves

详情
英文摘要

Recent data protection and data governance regulations have intensified the demand for interoperable, decentralized data ecosystems that can support not only access control but also legally-aligned governance over data use. Existing Web-based data storage platforms increasingly struggle to meet these regulatory and practical requirements, as their authorization mechanisms rely on tightly coupled, document-centric access control models that lack expressiveness for legal constraints and fail to separate data management from authorization concerns. In parallel, widely adopted authorization standards remain poorly aligned with decentralized, semantically rich usage-control scenarios. To bridge this gap, this work introduces an architecture that replaces Solid's native access control mechanisms with a UMA authorization flow, enabling the enforcement of usage control policies expressed with the W3C ODRL standard. This article details the conceptual background motivating this approach, presents the proposed UMA-based architecture, and describes a prototype implementation that integrates an ODRL-enabled Authorization Server with a Solid-compatible Resource Server. The prototype demonstrates that decoupling authorization from storage enables more flexible, interoperable, and legally expressive control over data use, while remaining compatible with existing Solid infrastructure. It also highlights practical design choices required to evaluate ODRL policies in the absence of a fully standardized evaluation semantics. Moreover, this work shows how usage control can be operationalized using existing Web standards, offering a concrete path beyond permission-based access control toward policy-aware, legally informed data governance. Future research will focus on policy management interfaces, richer claim verification mechanisms, and techniques for communicating and enforcing obligations over time.

2601.18760 2026-01-27 cs.LG cs.CL

Beyond Preferences: Learning Alignment Principles Grounded in Human Reasons and Values

Henry Bell, Lara Neubauer da Costa Schertel, Bochu Ding, Brandon Fain

详情
英文摘要

A crucial consideration when developing and deploying Large Language Models (LLMs) is the human values to which these models are aligned. In the constitutional framework of alignment models are aligned to a set of principles (the constitution) specified in natural language. However, it is unclear how to fairly determine this constitution with widespread stakeholder input. In this work we propose Grounded Constitutional AI (GCAI), a unified framework for generating constitutions of principles that are representative of both users' general expectations toward AI (general principles) and their interaction-time preferences (contextual principles). We extend the Inverse Constitutional AI (ICAI) approach to generate contextual principles from human preference annotation data by leveraging human-provided \textit{reasons} for their preferences. We supplement these contextual principles with general principles surfaced from user statements of \textit{values} regarding AI. We show that a constitution generated by GCAI is preferred by humans over one generated through ICAI both personally, and for widespread use in governing AI behavior. Additionally participants consider the GCAI constitution to be more morally grounded, coherent, and pluralistic.

2601.18758 2026-01-27 math.NA cs.NA

Divergence-free and mass-conservative virtual element methods for the Navier-Stokes-Cahn-Hilliard system

Alberth Silgado, Giuseppe Vacca

Comments 33 pages, 5 figures

详情
英文摘要

In this work, we design and analyze semi/fully-discrete virtual element approximations for the time-dependent Navier--Stokes-Cahn--Hilliard equations, modeling the dynamics of two-phase incompressible fluid flows with diffuse interfaces. A new variational formulation is derived involving solely the velocity, pressure, and phase field, together with corresponding a priori energy estimates. The spatial discretization is based on the coupling divergence-free and $C^1$-conforming elements of high-order, while the time discretization employs a classical backward Euler scheme. By introducing a novel skew-symmetric trilinear form to discretize the convective term in the Cahn--Hilliard equation, we propose discrete schemes that satisfy mass conservation and energy bounds. Moreover, optimal error estimates are provided for both formulations. Finally, two numerical experiments are presented to support our theoretical findings and to illustrate the good performance of the proposed schemes for different polynomial degrees and polygonal meshes.

2601.18754 2026-01-27 cs.CR cs.AI

$α^3$-SecBench: A Large-Scale Evaluation Suite of Security, Resilience, and Trust for LLM-based UAV Agents over 6G Networks

Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah

详情
英文摘要

Autonomous unmanned aerial vehicle (UAV) systems are increasingly deployed in safety-critical, networked environments where they must operate reliably in the presence of malicious adversaries. While recent benchmarks have evaluated large language model (LLM)-based UAV agents in reasoning, navigation, and efficiency, systematic assessment of security, resilience, and trust under adversarial conditions remains largely unexplored, particularly in emerging 6G-enabled settings. We introduce $α^{3}$-SecBench, the first large-scale evaluation suite for assessing the security-aware autonomy of LLM-based UAV agents under realistic adversarial interference. Building on multi-turn conversational UAV missions from $α^{3}$-Bench, the framework augments benign episodes with 20,000 validated security overlay attack scenarios targeting seven autonomy layers, including sensing, perception, planning, control, communication, edge/cloud infrastructure, and LLM reasoning. $α^{3}$-SecBench evaluates agents across three orthogonal dimensions: security (attack detection and vulnerability attribution), resilience (safe degradation behavior), and trust (policy-compliant tool usage). We evaluate 23 state-of-the-art LLMs from major industrial providers and leading AI labs using thousands of adversarially augmented UAV episodes sampled from a corpus of 113,475 missions spanning 175 threat types. While many models reliably detect anomalous behavior, effective mitigation, vulnerability attribution, and trustworthy control actions remain inconsistent. Normalized overall scores range from 12.9% to 57.1%, highlighting a significant gap between anomaly detection and security-aware autonomous decision-making. We release $α^{3}$-SecBench on GitHub: https://github.com/maferrag/AlphaSecBench

2601.18751 2026-01-27 cs.LG cs.AI

Trust, Don't Trust, or Flip: Robust Preference-Based Reinforcement Learning with Multi-Expert Feedback

Seyed Amir Hosseini, Maryam Abdolali, Amirhosein Tavakkoli, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi

Comments Equal contribution: Seyed Amir Hosseini and Maryam Abdolali. Corresponding author: Maryam Abdolali (maryam.abdolali@kntu.ac.ir)

详情
英文摘要

Preference-based reinforcement learning (PBRL) offers a promising alternative to explicit reward engineering by learning from pairwise trajectory comparisons. However, real-world preference data often comes from heterogeneous annotators with varying reliability; some accurate, some noisy, and some systematically adversarial. Existing PBRL methods either treat all feedback equally or attempt to filter out unreliable sources, but both approaches fail when faced with adversarial annotators who systematically provide incorrect preferences. We introduce TriTrust-PBRL (TTP), a unified framework that jointly learns a shared reward model and expert-specific trust parameters from multi-expert preference feedback. The key insight is that trust parameters naturally evolve during gradient-based optimization to be positive (trust), near zero (ignore), or negative (flip), enabling the model to automatically invert adversarial preferences and recover useful signal rather than merely discarding corrupted feedback. We provide theoretical analysis establishing identifiability guarantees and detailed gradient analysis that explains how expert separation emerges naturally during training without explicit supervision. Empirically, we evaluate TTP on four diverse domains spanning manipulation tasks (MetaWorld) and locomotion (DM Control) under various corruption scenarios. TTP achieves state-of-the-art robustness, maintaining near-oracle performance under adversarial corruption while standard PBRL methods fail catastrophically. Notably, TTP outperforms existing baselines by successfully learning from mixed expert pools containing both reliable and adversarial annotators, all while requiring no expert features beyond identification indices and integrating seamlessly with existing PBRL pipelines.

2601.18749 2026-01-27 cs.SE

Let's Make Every Pull Request Meaningful: An Empirical Analysis of Developer and Agentic Pull Requests

Haruhiko Yoshioka, Takahiro Monno, Haruka Tokumasu, Taiki Wakamatsu, Yuki Ota, Nimmi Weeraddana, Kenichi Matsumoto

Comments Accepted for publication in the 23rd International Conference on Mining Software Repositories (MSR '26) : 5 pages, 3 figures, 3 tables

详情
英文摘要

The automatic generation of pull requests (PRs) using AI agents has become increasingly common. Although AI-generated PRs are fast and easy to create, their merge rates have been reported to be lower than those created by humans. In this study, we conduct a large-scale empirical analysis of 40,214 PRs collected from the AIDev dataset. We extract 64 features across six families and fit statistical regression models to compare PR merge outcomes for human and agentic PRs, as well as across three AI agents. Our results show that submitter attributes dominate merge outcomes for both groups, while review-related features exhibit contrasting effects between human and agentic PRs. The findings of this study provide insights into improving PR quality through human-AI collaboration.

2601.18747 2026-01-27 cs.IR cs.AI cs.CC cs.CL cs.DB

Capturing P: On the Expressive Power and Efficient Evaluation of Boolean Retrieval

Amir Aavani

详情
英文摘要

Modern information retrieval is transitioning from simple document filtering to complex, neuro-symbolic reasoning workflows. However, current retrieval architectures face a fundamental efficiency dilemma when handling the rigorous logical and arithmetic constraints required by this new paradigm. Standard iterator-based engines (Document-at-a-Time) do not natively support complex, nested logic graphs; forcing them to execute such queries typically results in intractable runtime performance. Conversely, naive recursive approaches (Term-at-a-Time), while capable of supporting these structures, suffer from prohibitive memory consumption when enforcing broad logical exclusions. In this paper, we propose that a retrieval engine must be capable of ``Capturing $\mathbf{P}$'' -- evaluating any polynomial-time property directly over its index in a computationally efficient manner. We define a formal Retrieval Language ($\mathcal{L}_R$) based on Directed Acyclic Graphs (DAGs) and prove it precisely captures the complexity class $\mathbf{P}$. We introduce \texttt{ComputePN}, a novel evaluation algorithm that makes $\mathcal{L}_R$ tractable. By combining native DAG traversal with a memory-efficient ``Positive-Negative'' response mechanism, \texttt{ComputePN} ensures the efficient evaluation of any query in $\mathcal{L}_R$. This work establishes the theoretical foundation for turning the search index into a general-purpose computational engine.

2601.18745 2026-01-27 cs.LO cs.PL

Symmetric Proofs of Parameterized Programs

Ruotong Cheng, Azadeh Farzan

详情
英文摘要

We investigate the problem of safety verification of infinite-state parameterized programs that are formed based on a rich class of topologies. We introduce a new proof system, called parametric proof spaces, which exploits the underlying symmetry in such programs. This is a local notion of symmetry which enables the proof system to reuse proof arguments for isomorphic neighbourhoods in program topologies. We prove a sophisticated relative completeness result for the proof system with respect to a class of universally quantified invariants. We also investigate the problem of algorithmic construction of these proofs. We present a construction, inspired by classic results in model theory, where an infinitary limit program can be soundly and completely verified in place of the parameterized family, under some conditions. Furthermore, we demonstrate how these proofs can be constructed and checked against these programs without the need for axiomatization of the underlying topology for proofs or the programs. Finally, we present conditions under which our algorithm becomes a decision procedure.

2601.18740 2026-01-27 cs.IT math.IT

A Scanning-Based Indoor Optical Wireless Positioning System with Single VCSEL

Yicheng Dong, Rashid Iqbal, Julien Le Kernec, Hanaa Abumarshoud

详情
英文摘要

This paper presents a novel indoor visible light positioning (VLP) system utilising one vertical-cavity surface-emitting laser installed at the ceiling centre of a space. The system offers three-dimensional localisation by sweeping through space at one-degree resolution in two dimensions (azimuth and elevation), significantly simplifying hardware. Through incorporating the angle of arrival and received signal strength, this system demonstrates excellent precision in indoor positioning. Simulation results verify that the system attains sub-centimetre precision for most test points, outperforming conventional multi-transmitter VLP schemes in cost-efficiency and simplicity.

2601.18736 2026-01-27 cs.LG cs.NI

Benchmarking Machine Learning Models for IoT Malware Detection under Data Scarcity and Drift

Jake Lyon, Ehsan Saeedizade, Shamik Sengupta

详情
英文摘要

The rapid expansion of the Internet of Things (IoT) in domains such as smart cities, transportation, and industrial systems has heightened the urgency of addressing their security vulnerabilities. IoT devices often operate under limited computational resources, lack robust physical safeguards, and are deployed in heterogeneous and dynamic networks, making them prime targets for cyberattacks and malware applications. Machine learning (ML) offers a promising approach to automated malware detection and classification, but practical deployment requires models that are both effective and lightweight. The goal of this study is to investigate the effectiveness of four supervised learning models (Random Forest, LightGBM, Logistic Regression, and a Multi-Layer Perceptron) for malware detection and classification using the IoT-23 dataset. We evaluate model performance in both binary and multiclass classification tasks, assess sensitivity to training data volume, and analyze temporal robustness to simulate deployment in evolving threat landscapes. Our results show that tree-based models achieve high accuracy and generalization, even with limited training data, while performance deteriorates over time as malware diversity increases. These findings underscore the importance of adaptive, resource-efficient ML models for securing IoT systems in real-world environments.

2601.18735 2026-01-27 cs.AI cs.LG

Why Keep Your Doubts to Yourself? Trading Visual Uncertainties in Multi-Agent Bandit Systems

Jusheng Zhang, Yijia Fan, Kaitong Cai, Jing Yang, Jiawei Yao, Jian Wang, Guanlong Qu, Ziliang Chen, Keze Wang

Comments Accepted to ICLR 2026

详情
英文摘要

Vision-Language Models (VLMs) enable powerful multi-agent systems, but scaling them is economically unsustainable: coordinating heterogeneous agents under information asymmetry often spirals costs. Existing paradigms, such as Mixture-of-Agents and knowledge-based routers, rely on heuristic proxies that ignore costs and collapse uncertainty structure, leading to provably suboptimal coordination. We introduce Agora, a framework that reframes coordination as a decentralized market for uncertainty. Agora formalizes epistemic uncertainty into a structured, tradable asset (perceptual, semantic, inferential), and enforces profitability-driven trading among agents based on rational economic rules. A market-aware broker, extending Thompson Sampling, initiates collaboration and guides the system toward cost-efficient equilibria. Experiments on five multimodal benchmarks (MMMU, MMBench, MathVision, InfoVQA, CC-OCR) show that Agora outperforms strong VLMs and heuristic multi-agent strategies, e.g., achieving +8.5% accuracy over the best baseline on MMMU while reducing cost by over 3x. These results establish market-based coordination as a principled and scalable paradigm for building economically viable multi-agent visual intelligence systems.

2601.18733 2026-01-27 cs.RO cs.AI cs.CV

Advances and Innovations in the Multi-Agent Robotic System (MARS) Challenge

Li Kang, Heng Zhou, Xiufeng Song, Rui Li, Bruno N. Y. Chen, Ziye Wang, Ximeng Meng, Stone Tao, Yiran Qin, Xiaohong Liu, Ruimao Zhang, Lei Bai, Yilun Du, Hao Su, Philip Torr, Zhenfei Yin, Ruihao Gong, Yejun Zeng, Fengjun Zhong, Shenghao Jin, Jinyang Guo, Xianglong Liu, Xiaojun Jia, Tianqi Shan, Wenqi Ren, Simeng Qin, Jialing Yang, Xiaoyu Ma, Tianxing Chen, Zixuan Li, Zijian Cai, Yan Qin, Yusen Qin, Qiangyu Chen, Kaixuan Wang, Zhaoming Han, Yao Mu, Ping Luo, Yuanqi Yao, Haoming Song, Jan-Nico Zaech, Fabien Despinoy, Danda Pani Paudel, Luc Van Gool

Comments MARS Challenge @ NeurIPS 2025 Workshop on Space in Vision, Language, and Embodied AI. Challenge page: https://mars-eai.github.io/MARS-Challenge-Webpage/

详情
英文摘要

Recent advancements in multimodal large language models and vision-languageaction models have significantly driven progress in Embodied AI. As the field transitions toward more complex task scenarios, multi-agent system frameworks are becoming essential for achieving scalable, efficient, and collaborative solutions. This shift is fueled by three primary factors: increasing agent capabilities, enhancing system efficiency through task delegation, and enabling advanced human-agent interactions. To address the challenges posed by multi-agent collaboration, we propose the Multi-Agent Robotic System (MARS) Challenge, held at the NeurIPS 2025 Workshop on SpaVLE. The competition focuses on two critical areas: planning and control, where participants explore multi-agent embodied planning using vision-language models (VLMs) to coordinate tasks and policy execution to perform robotic manipulation in dynamic environments. By evaluating solutions submitted by participants, the challenge provides valuable insights into the design and coordination of embodied multi-agent systems, contributing to the future development of advanced collaborative AI systems.

2601.18732 2026-01-27 econ.TH cs.AI

Optimal Use of Preferences in Artificial Intelligence Algorithms

Joshua S. Gans

Comments 54 pages, 2 figures

详情
英文摘要

Machine learning systems embed preferences either in training losses or through post-processing of calibrated predictions. Applying information design methods from Strack and Yang (2024), this paper provides decision problem agnostic conditions under which separation training preference free and applying preferences ex post is optimal. Unlike prior work that requires specifying downstream objectives, the welfare results here apply uniformly across decision problems. The key primitive is a diminishing-value-of-information condition: relative to a fixed (normalised) preference-free loss, preference embedding makes informativeness less valuable at the margin, inducing a mean-preserving contraction of learned posteriors. Because the value of information is convex in beliefs, preference-free training weakly dominates for any expected utility decision problem. This provides theoretical foundations for modular AI pipelines that learn calibrated probabilities and implement asymmetric costs through downstream decision rules. However, separation requires users to implement optimal decision rules. When cognitive constraints bind, as documented in human AI decision-making, preference embedding can dominate by automating threshold computation. These results provide design guidance: preserve optionality through post-processing when objectives may shift; embed preferences when decision-stage frictions dominate.

2601.18730 2026-01-27 cs.CL cs.LG

Reflect: Transparent Principle-Guided Reasoning for Constitutional Alignment at Scale

Henry Bell, Caroline Zhang, Mohammed Mobasserul Haque, Dhaval Potdar, Samia Zaman, Brandon Fain

详情
英文摘要

The constitutional framework of alignment aims to align large language models (LLMs) with value-laden principles written in natural language (such as to avoid using biased language). Prior work has focused on parameter fine-tuning techniques, such as reinforcement learning from human feedback (RLHF), to instill these principles. However, these approaches are computationally demanding, require careful engineering and tuning, and often require difficult-to-obtain human annotation data. We propose \textsc{reflect}, an inference-time framework for constitutional alignment that does not require any training or data, providing a plug-and-play approach for aligning an instruction-tuned model to a set of principles. \textsc{reflect} operates entirely in-context, combining a (i) constitution-conditioned base response with post-generation (ii) self-evaluation, (iii)(a) self-critique, and (iii)(b) final revision. \textsc{reflect}'s technique of explicit in-context reasoning over principles during post-generation outperforms standard few-shot prompting and provides transparent reasoning traces. Our results demonstrate that \textsc{reflect} significantly improves LLM conformance to diverse and complex principles, including principles quite distinct from those emphasized in the model's original parameter fine-tuning, without sacrificing factual reasoning. \textsc{reflect} is particularly effective at reducing the rate of rare but significant violations of principles, thereby improving safety and robustness in the tail end of the distribution of generations. Finally, we show that \textsc{reflect} naturally generates useful training data for traditional parameter fine-tuning techniques, allowing for efficient scaling and the reduction of inference-time computational overhead in long-term deployment scenarios.

2601.18724 2026-01-27 cs.CL cs.AI cs.DL

HalluCitation Matters: Revealing the Impact of Hallucinated References with 300 Hallucinated Papers in ACL Conferences

Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

Comments Work In Progress

详情
英文摘要

Recently, we have often observed hallucinated citations or references that do not correspond to any existing work in papers under review, preprints, or published papers. Such hallucinated citations pose a serious concern to scientific reliability. When they appear in accepted papers, they may also negatively affect the credibility of conferences. In this study, we refer to hallucinated citations as "HalluCitation" and systematically investigate their prevalence and impact. We analyze all papers published at ACL, NAACL, and EMNLP in 2024 and 2025, including main conference, Findings, and workshop papers. Our analysis reveals that nearly 300 papers contain at least one HalluCitation, most of which were published in 2025. Notably, half of these papers were identified at EMNLP 2025, the most recent conference, indicating that this issue is rapidly increasing. Moreover, more than 100 such papers were accepted as main conference and Findings papers at EMNLP 2025, affecting the credibility.

2601.18723 2026-01-27 cs.RO

Trustworthy Evaluation of Robotic Manipulation: A New Benchmark and AutoEval Methods

Mengyuan Liu, Juyi Sheng, Peiming Li, Ziyi Wang, Tianming Xu, Tiantian Xu, Hong Liu

详情
英文摘要

Driven by the rapid evolution of Vision-Action and Vision-Language-Action models, imitation learning has significantly advanced robotic manipulation capabilities. However, evaluation methodologies have lagged behind, hindering the establishment of Trustworthy Evaluation for these behaviors. Current paradigms rely on binary success rates, failing to address the critical dimensions of trust: Source Authenticity (i.e., distinguishing genuine policy behaviors from human teleoperation) and Execution Quality (e.g., smoothness and safety). To bridge these gaps, we propose a solution that combines the Eval-Actions benchmark and the AutoEval architecture. First, we construct the Eval-Actions benchmark to support trustworthiness analysis. Distinct from existing datasets restricted to successful human demonstrations, Eval-Actions integrates VA and VLA policy execution trajectories alongside human teleoperation data, explicitly including failure scenarios. This dataset is structured around three core supervision signals: Expert Grading (EG), Rank-Guided preferences (RG), and Chain-of-Thought (CoT). Building on this, we propose the AutoEval architecture: AutoEval leverages Spatio-Temporal Aggregation for semantic assessment, augmented by an auxiliary Kinematic Calibration Signal to refine motion smoothness; AutoEval Plus (AutoEval-P) incorporates the Group Relative Policy Optimization (GRPO) paradigm to enhance logical reasoning capabilities. Experiments show AutoEval achieves Spearman's Rank Correlation Coefficients (SRCC) of 0.81 and 0.84 under the EG and RG protocols, respectively. Crucially, the framework possesses robust source discrimination capabilities, distinguishing between policy-generated and teleoperated videos with 99.6% accuracy, thereby establishing a rigorous standard for trustworthy robotic evaluation. Our project and code are available at https://term-bench.github.io/.

2601.18722 2026-01-27 cs.CL cs.LG

Gained in Translation: Privileged Pairwise Judges Enhance Multilingual Reasoning

Lintang Sutawika, Gokul Swamy, Zhiwei Steven Wu, Graham Neubig

Comments Code available at https://github.com/lintangsutawika/SP3F

详情
英文摘要

When asked a question in a language less seen in its training data, current reasoning large language models (RLMs) often exhibit dramatically lower performance than when asked the same question in English. In response, we introduce \texttt{SP3F} (Self-Play with Privileged Pairwise Feedback), a two-stage framework for enhancing multilingual reasoning without \textit{any} data in the target language(s). First, we supervise fine-tune (SFT) on translated versions of English question-answer pairs to raise base model correctness. Second, we perform RL with feedback from a pairwise judge in a self-play fashion, with the judge receiving the English reference response as \textit{privileged information}. Thus, even when none of the model's responses are completely correct, the privileged pairwise judge can still tell which response is better. End-to-end, \texttt{SP3F} greatly improves base model performance, even outperforming fully post-trained models on multiple math and non-math tasks with less than of the training data across the single-language, multilingual, and generalization to unseen language settings.

2601.18721 2026-01-27 math.NA cs.NA

A mixed interpolation-regression method for numerical integration on the unit circle using zeros of para-orthogonal polynomials

Ruymán Cruz-Barroso, Lidia Fernández, Francisco Marcellán

详情
英文摘要

A new alternative numerical procedure to the Szegő quadrature formulas for the estimation of integrals with respect to a positive Borel measure $μ$ supported on the unit circle is presented. As in many practical situations, we assume that the values of the integrand $F$ are only known at a finite number of points, which we will assume to be uniformly distributed on the unit circle (although this does not actually constitute a restriction). Our technique consists of obtaining an approximating Laurent polynomial $L$ to $F$ by interpolation in the Hermite sense in a collection of these points that mimic the zeros of a para-orthogonal polynomial with respect to $μ$, and to use the values of $F$ at the remaining nodes to improve the accuracy of the approximation by a process of simultaneous complex regression. Some numerical examples are carried out.