arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1717
2604.00023 2026-04-02 cs.CL

Phonological Fossils: Machine Learning Detection of Non-Mainstream Vocabulary in Sulawesi Basic Lexicon

Mukhlis Amien, Go Frendi Gunawan

Comments 31 pages, 4 figures, 5 tables. Submitted to Oceanic Linguistics

详情
英文摘要

Basic vocabulary in many Sulawesi Austronesian languages includes forms resisting reconstruction to any proto-form with phonological patterns inconsistent with inherited roots, but whether this non-conforming vocabulary represents pre-Austronesian substrate or independent innovation has not been tested computationally. We combine rule-based cognate subtraction with a machine learning classifier trained on phonological features. Using 1,357 forms from six Sulawesi languages in the Austronesian Basic Vocabulary Database, we identify 438 candidate substrate forms (26.5%) through cognate subtraction and Proto-Austronesian cross-checking. An XGBoost classifier trained on 26 phonological features distinguishes inherited from non-mainstream forms with AUC=0.763, revealing a phonological fingerprint: longer forms, more consonant clusters, higher glottal stop rates, and fewer Austronesian prefixes. Cross-method consensus (Cohen's kappa=0.61) identifies 266 high-confidence non-mainstream candidates. However, clustering yields no coherent word families (silhouette=0.114; cross-linguistic cognate test p=0.569), providing no evidence for a single pre-Austronesian language layer. Application to 16 additional languages confirms geographic patterning: Sulawesi languages show higher predicted non-mainstream rates (mean P_sub=0.606) than Western Indonesian languages (0.393). This study demonstrates that phonological machine learning can complement traditional comparative methods in detecting non-mainstream lexical layers, while cautioning against interpreting phonological non-conformity as evidence for a shared substrate language.

2604.00022 2026-04-02 cs.CL cs.AI

Criterion Validity of LLM-as-Judge for Business Outcomes in Conversational Commerce

Liang Chen, Qi Liu, Wenhuan Lin, Feng Liang

详情
英文摘要

Multi-dimensional rubric-based dialogue evaluation is widely used to assess conversational AI, yet its criterion validity -- whether quality scores are associated with the downstream outcomes they are meant to serve -- remains largely untested. We address this gap through a two-phase study on a major Chinese matchmaking platform, testing a 7-dimension evaluation rubric (implemented via LLM-as-Judge) against verified business conversion. Our findings concern rubric design and weighting, not LLM scoring accuracy: any judge using the same rubric would face the same structural issue. The core finding is dimension-level heterogeneity: in Phase 2 (n=60 human conversations, stratified sample, verified labels), Need Elicitation (D1: rho=0.368, p=0.004) and Pacing Strategy (D3: rho=0.354, p=0.006) are significantly associated with conversion after Bonferroni correction, while Contextual Memory (D5: rho=0.018, n.s.) shows no detectable association. This heterogeneity causes the equal-weighted composite (rho=0.272) to underperform its best dimensions -- a composite dilution effect that conversion-informed reweighting partially corrects (rho=0.351). Logistic regression controlling for conversation length confirms D3's association strengthens (OR=3.18, p=0.006), ruling out a length confound. An initial pilot (n=14) mixing human and AI conversations had produced a misleading "evaluation-outcome paradox," which Phase 2 revealed as an agent-type confound artifact. Behavioral analysis of 130 conversations through a Trust-Funnel framework identifies a candidate mechanism: AI agents execute sales behaviors without building user trust. We operationalize these findings in a three-layer evaluation architecture and advocate criterion validity testing as standard practice in applied dialogue evaluation.

2604.00021 2026-04-02 cs.CL cs.AI cs.CY

How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models

Hiroki Fukui

Comments 34 pages, 7 figures, 4 tables. Preprint. OSF pre-registration: osf.io/4n5uf. Companion paper: arXiv:2603.04904

详情
英文摘要

Alignment safety research assumes that ethical instructions improve model behavior, but how language models internally process such instructions remains unknown. We conducted over 600 multi-agent simulations across four models (Llama 3.3 70B, GPT-4o mini, Qwen3-Next-80B-A3B, Sonnet 4.5), four ethical instruction formats (none, minimal norm, reasoned norm, virtue framing), and two languages (Japanese, English). Confirmatory analysis fully replicated the Llama Japanese dissociation pattern from a prior study ($\mathrm{BF}_{10} > 10$ for all three hypotheses), but none of the other three models reproduced this pattern, establishing it as model-specific. Three new metrics -- Deliberation Depth (DD), Value Consistency Across Dilemmas (VCAD), and Other-Recognition Index (ORI) -- revealed four distinct ethical processing types: Output Filter (GPT; safe outputs, no processing), Defensive Repetition (Llama; high consistency through formulaic repetition), Critical Internalization (Qwen; deep deliberation, incomplete integration), and Principled Consistency (Sonnet; deliberation, consistency, and other-recognition co-occurring). The central finding is an interaction between processing capacity and instruction format: in low-DD models, instruction format has no effect on internal processing; in high-DD models, reasoned norms and virtue framing produce opposite effects. Lexical compliance with ethical instructions did not correlate with any processing metric at the cell level ($r = -0.161$ to $+0.256$, all $p > .22$; $N = 24$; power limited), suggesting that safety, compliance, and ethical processing are largely dissociable. These processing types show structural correspondence to patterns observed in clinical offender treatment, where formal compliance without internal processing is a recognized risk signal.

2604.00020 2026-04-02 cs.CL

Detecting Abnormal User Feedback Patterns through Temporal Sentiment Aggregation

Yalun Qi, Sichen Zhao, Zhiming Xue, Xianling Zeng, Zihan Yu

详情
英文摘要

In many real-world applications, such as customer feedback monitoring, brand reputation management, and product health tracking, understanding the temporal dynamics of user sentiment is crucial for early detection of anomalous events such as malicious review campaigns or sudden declines in user satisfaction. Traditional sentiment analysis methods focus on individual text classification, which is insufficient to capture collective behavioral shifts over time due to inherent noise and class imbalance in short user comments. In this work, we propose a temporal sentiment aggregation framework that leverages pretrained transformer-based language models to extract per-comment sentiment signals and aggregates them into time-window-level scores. Significant downward shifts in these aggregated scores are interpreted as potential anomalies in user feedback patterns. We adopt RoBERTa as our core semantic feature extractor and demonstrate, through empirical evaluation on real social media data, that the aggregated sentiment scores reveal meaningful trends and support effective anomaly detection. Experiments on real-world social media data demonstrate that our method successfully identifies statistically significant sentiment drops that correspond to coherent complaint patterns, providing an effective and interpretable solution for feedback anomaly monitoring.

2604.00019 2026-04-02 cs.CL cs.AI

The Chronicles of RiDiC: Generating Datasets with Controlled Popularity Distribution for Long-form Factuality Evaluation

Pavel Braslavski, Dmitrii Iarosh, Nikita Sushko, Andrey Sakhovskiy, Vasily Konovalov, Elena Tutubalina, Alexander Panchenko

Comments Accepted to LREC 2026

详情
英文摘要

We present a configurable pipeline for generating multilingual sets of entities with specified characteristics, such as domain, geographical location and popularity, using data from Wikipedia and Wikidata. These datasets are intended for evaluating the factuality of LLMs' long-form generation, thereby complementing evaluation based on short-form QA datasets. We present the RiDiC dataset as an example of this approach. RiDiC contains 3,000 entities from three domains -- rivers, natural disasters, and car models -- spanning different popularity tiers. Each entity is accompanied by its geographical location, English and Chinese names (if available) and relevant English and Chinese Wikipedia content, which is used to evaluate LLMs' responses. Generations about RiDiC entities were obtained from three LLMs in English and Chinese. These were then evaluated using a third-party factuality checker, which showed that entities from our dataset caused even frontier models to hallucinate. To facilitate the evaluation of LLMs' long-form factuality in multiple languages, the code, data, and generation/evaluation scripts have been released.

2604.00018 2026-04-02 cs.CL cs.AI

Think Twice Before You Write -- an Entropy-based Decoding Strategy to Enhance LLM Reasoning

Jiashu He, Meizhu Liu, Olaitan P Olaleye, Amit Agarwal, M. Avendi, Yassi Abbasi, Matthew Rowe, Hitesh Laxmichand Patel, Paul Li, Tao Sheng, Sujith Ravi, Dan Roth

详情
英文摘要

Decoding strategies play a central role in shaping the reasoning ability of large language models (LLMs). Traditional methods such as greedy decoding and beam search often suffer from error propagation, while sampling-based approaches introduce randomness without adequate robustness. Self-consistency improves reliability by aggregating multiple rollouts, but incurs significant computational overhead. We propose an entropy-guided decoding framework that introduces token-level adaptivity into generation. At each step, the model computes the entropy of the token distribution, identifies high-uncertainty positions, and selectively branches on these vulnerable points. A dynamic pool of partial rollouts is maintained and expanded until solutions are completed, concentrating computation where uncertainty is greatest and avoiding unnecessary exploration in confident regions. To enable efficient termination, we apply a rollout-level Entropy After </Think> (EAT) stopping criterion by performing entropy evaluation after the full reasoning trace, rather than incrementally at every step. Experiments on GSM8K, AMC2023, and their perturbed variants demonstrate that our method achieves consistently strong accuracy. Notably, on smaller LLMs, performance is comparable to GPT-5 while operating at a fraction of the cost.

2604.00017 2026-04-02 cs.CL

Semantic Shifts of Psychological Concepts in Scientific and Popular Media Discourse: A Distributional Semantics Analysis of Russian-Language Corpora

Orlova Anastasia

详情
英文摘要

This article examines semantic shifts in psychological concepts across scientific and popular media discourse using methods of distributional semantics applied to Russian-language corpora. Two corpora were compiled: a scientific corpus of approximately 300 research articles from the journals Psychology. Journal of the Higher School of Economics and Vestnik of Saint Petersburg University. Psychology (767,543 tokens) and a popular science corpus consisting of texts from the online psychology platforms Yasno and Chistye kogntsii (1,199,150 tokens). After preprocessing (OCR recognition, lemmatization, removal of stop words and non-informative characters), the corpora were analyzed through frequency analysis, clustering, and the identification of semantic associations. The results reveal significant differences in vocabulary and conceptual framing between the two discourse types: scientific texts emphasize methodological and clinical terminology, while popular science materials foreground everyday experience and therapeutic practice. A comparison of semantic associations for key concepts such as burnout and depression shows that scientific discourse links these terms to psychological resources, symptomatology, and diagnostic constructs, whereas popular science discourse frames them through personal narratives, emotions, and everyday situations. These findings demonstrate a clear shift from precise professional terminology toward more generalized and experiential meanings in popular media discourse and confirm the effectiveness of distributional semantics methods for identifying semantic transformations of psychological concepts across different communicative contexts.

2604.00016 2026-04-02 cs.CL cs.AI

Are they human? Detecting large language models by probing human memory constraints

Simon Schug, Brenden M. Lake

Comments Code available at https://github.com/smonsays/llm-humanness

详情
英文摘要

The validity of online behavioral research relies on study participants being human rather than machine. In the past, it was possible to detect machines by posing simple challenges that were easily solved by humans but not by machines. General-purpose agents based on large language models (LLMs) can now solve many of these challenges, threatening the validity of online behavioral research. Here we explore the idea of detecting humanness by using tasks that machines can solve too well to be human. Specifically, we probe for the existence of an established human cognitive constraint: limited working memory capacity. We show that cognitive modeling on a standard serial recall task can be used to distinguish online participants from LLMs even when the latter are specifically instructed to mimic human working memory constraints. Our results demonstrate that it is viable to use well-established cognitive phenomena to distinguish LLMs from humans.

2604.00015 2026-04-02 cs.CL

ASCAT: An Arabic Scientific Corpus and Benchmark for Advanced Translation Evaluation

Serry Sibaee, Khloud Al Jallad, Zineb Yousfi, Israa Elsayed Elhosiny, Yousra El-Ghawi, Batool Balah, Omer Nacar

详情
英文摘要

We present ASCAT (Arabic Scientific Corpus for Advanced Translation), a high-quality English-Arabic parallel benchmark corpus designed for scientific translation evaluation constructed through a systematic multi-engine translation and human validation pipeline. Unlike existing Arabic-English corpora that rely on short sentences or single-domain text, ASCAT targets full scientific abstracts averaging 141.7 words (English) and 111.78 words (Arabic), drawn from five scientific domains: physics, mathematics, computer science, quantum mechanics, and artificial intelligence. Each abstract was translated using three complementary architectures generative AI (Gemini), transformer-based models (Hugging Face \texttt{quickmt-en-ar}), and commercial MT APIs (Google Translate, DeepL) and subsequently validated by domain experts at the lexical, syntactic, and semantic levels. The resulting corpus contains 67,293 English tokens and 60,026 Arabic tokens, with an Arabic vocabulary of 17,604 unique words reflecting the morphological richness of the language. We benchmark three state-of-the-art LLMs on the corpus GPT-4o-mini (BLEU: 37.07), Gemini-3.0-Flash-Preview (BLEU: 30.44), and Qwen3-235B-A22B (BLEU: 23.68) demonstrating its discriminative power as an evaluation benchmark. ASCAT addresses a critical gap in scientific MT resources for Arabic and is designed to support rigorous evaluation of scientific translation quality and training of domain-specific translation models.

2604.00014 2026-04-02 cs.CL cs.HC

Disentangling Prompt Element Level Risk Factors for Hallucinations and Omissions in Mental Health LLM Responses

Congning Ni, Sarvech Qadir, Bryan Steitz, Mihir Sachin Vaidya, Qingyuan Song, Lantian Xia, Shelagh Mulvaney, Siru Liu, Hyeyoung Ryu, Leah Hecht, Amy Bucher, Christopher Symons, Laurie Novak, Susannah L. Rose, Murat Kantarcioglu, Bradley Malin, Zhijun Yin

Comments Submitted to AMIA 2026 Annual Symposium (under review)

详情
英文摘要

Mental health concerns are often expressed outside clinical settings, including in high-distress help seeking, where safety-critical guidance may be needed. Consumer health informatics systems increasingly incorporate large language models (LLMs) for mental health question answering, yet many evaluations underrepresent narrative, high-distress inquiries. We introduce UTCO (User, Topic, Context, Tone), a prompt construction framework that represents an inquiry as four controllable elements for systematic stress testing. Using 2,075 UTCO-generated prompts, we evaluated Llama 3.3 and annotated hallucinations (fabricated or incorrect clinical content) and omissions (missing clinically necessary or safety-critical guidance). Hallucinations occurred in 6.5% of responses and omissions in 13.2%, with omissions concentrated in crisis and suicidal ideation prompts. Across regression, element-specific matching, and similarity-matched comparisons, failures were most consistently associated with context and tone, while user-background indicators showed no systematic differences after balancing. These findings support evaluating omissions as a primary safety outcome and moving beyond static benchmark question sets.

2604.00012 2026-04-02 cs.CL cs.AI

Finding and Reactivating Post-Trained LLMs' Hidden Safety Mechanisms

Mingjie Li, Wai Man Si, Michael Backes, Yang Zhang, Yisen Wang

详情
英文摘要

Despite the impressive performance of general-purpose large language models (LLMs), they often require fine-tuning or post-training to excel at specific tasks. For instance, large reasoning models (LRMs), such as the DeepSeek-R1 series, demonstrate strong reasoning capabilities after post-training different general large language models on diverse chain-of-thought (CoT) datasets. However, this additional training frequently comes at the cost of reduced safety, as the fine-tuned or post-trained models tend to exhibit more harmful behaviors compared with the regular LLMs before post-training or fine-tuning, potentially leading to harmful outcomes due to their enhanced capabilities. Taking LRMs as an example, we first investigate the underlying cause of this safety degradation in this paper. Our analysis reveals that post-training can mask the original safety mechanisms of the base LLM, while over-amplifying representations related to their post-training ability. But luckily, we also find that LRMs' safety mechanisms still exist instead of being removed during their post-training. Based on these findings, we propose a lightweight and cost-effective solution called SafeReAct that restores the suppressed safety behaviors by aligning with LoRA adapters on a few layers. Experiments on four state-of-the-art LRMs show that our method significantly improves safety on harmful prompts without compromising reasoning performance. Besides LRMs, additional results on other domain-specific LLMs, like medical models, further confirm the generality and effectiveness of our approach.

2604.00010 2026-04-02 cs.CL cs.AI

Can LLMs Perceive Time? An Empirical Investigation

Aniketh Garikaparthi

Comments ICLR 2026 I Can't Believe It's Not Better Workshop

详情
英文摘要

Large language models cannot estimate how long their own tasks take. We investigate this limitation through four experiments across 68 tasks and four model families. Pre-task estimates overshoot actual duration by 4--7$\times$ ($p < 0.001$), with models predicting human-scale minutes for tasks completing in seconds. Relative ordering fares no better: on task pairs designed to expose heuristic reliance, models score at or below chance (GPT-5: 18\% on counter-intuitive pairs, $p = 0.033$), systematically failing when complexity labels mislead. Post-hoc recall is disconnected from reality -- estimates diverge from actuals by an order of magnitude in either direction. These failures persist in multi-step agentic settings, with errors of 5--10$\times$. The models possess propositional knowledge about duration from training but lack experiential grounding in their own inference time, with practical implications for agent scheduling, planning and time-critical scenarios.

2604.00009 2026-04-02 cs.CL cs.AI

Eyla: Toward an Identity-Anchored LLM Architecture with Integrated Biological Priors -- Vision, Implementation Attempt, and Lessons from AI-Assisted Development

Arif Aditto

Comments 8 pages, 3 tables, 25 references. Preprint under review for workshop submission

详情
英文摘要

We present the design rationale, implementation attempt, and failure analysis of Eyla, a proposed identity-anchored LLM architecture that integrates biologically-inspired subsystems -- including HiPPO-initialized state-space models, zero-initialized adapters, episodic memory retrieval, and calibrated uncertainty training -- into a unified agent operating system running on consumer hardware. Unlike existing approaches that optimize models for generic helpfulness, Eyla targets identity consistency: the ability to maintain a coherent self-model under adversarial pressure, admit uncertainty, and resist manipulation. We propose the Identity Consistency Score (ICS), a novel benchmark for evaluating this property across LLMs. We then present an honest account of attempting to implement this architecture using AI coding assistants (Claude Code, Cursor) as a non-programmer, documenting a $1,000+ failure that produced a 1.27B parameter model with 86 brain subsystems contributing less than 2% to output. Our analysis identifies five systematic failure modes of AI-assisted development for novel architectures and offers concrete recommendations. To our knowledge, this is the first paper to combine an architectural vision with a documented first-person failure analysis of AI-assisted LLM development, providing lessons for both the AI systems and AI-assisted software engineering communities.

2604.00008 2026-04-02 cs.CL cs.AI

How Trustworthy Are LLM-as-Judge Ratings for Interpretive Responses? Implications for Qualitative Research Workflows

Songhee Han, Jueun Shin, Jiyoon Han, Bung-Woo Jun, Hilal Ayan Karabatman

详情
英文摘要

As qualitative researchers show growing interest in using automated tools to support interpretive analysis, a large language model (LLM) is often introduced into an analytic workflow as is, without systematic evaluation of interpretive quality or comparison across models. This practice leaves model selection largely unexamined despite its potential influence on interpretive outcomes. To address this gap, this study examines whether LLM-as-judge evaluations meaningfully align with human judgments of interpretive quality and can inform model-level decision making. Using 712 conversational excerpts from semi-structured interviews with K-12 mathematics teachers, we generated one-sentence interpretive responses using five widely adopted inference models: Command R+ (Cohere), Gemini 2.5 Pro (Google), GPT-5.1 (OpenAI), Llama 4 Scout-17B Instruct (Meta), and Qwen 3-32B Dense (Alibaba). Automated evaluations were conducted using AWS Bedrock's LLM-as-judge framework across five metrics, and a stratified subset of responses was independently rated by trained human evaluators on interpretive accuracy, nuance preservation, and interpretive coherence. Results show that LLM-as-judge scores capture broad directional trends in human evaluations at the model level but diverge substantially in score magnitude. Among automated metrics, Coherence showed the strongest alignment with aggregated human ratings, whereas Faithfulness and Correctness revealed systematic misalignment at the excerpt level, particularly for non-literal and nuanced interpretations. Safety-related metrics were largely irrelevant to interpretive quality. These findings suggest that LLM-as-judge methods are better suited for screening or eliminating underperforming models than for replacing human judgment, offering practical guidance for systematic comparison and selection of LLMs in qualitative research workflows.

2604.00007 2026-04-02 cs.CL cs.AI

Dynin-Omni: Omnimodal Unified Large Diffusion Language Model

Jaeik Kim, Woojin Kim, Jihwan Hong, Yejoon Lee, Sieun Hyeon, Mintaek Lim, Yunseok Han, Dogeun Kim, Hoeun Lee, Hyunggeun Kim, Jaeyoung Do

Comments Project Page: https://dynin.ai/omni/

详情
英文摘要

We present Dynin-Omni, the first masked-diffusion-based omnimodal foundation model that unifies text, image, and speech understanding and generation, together with video understanding, within a single architecture. Unlike autoregressive unified models that serialize heterogeneous modalities, or compositional unified models that require orchestration with external modality-specific decoders, Dynin-Omni natively formulates omnimodal modeling as masked diffusion over a shared discrete token space, enabling iterative refinement under bidirectional context. Dynin-Omni adopts a multi-stage training strategy with model-merging-based modality expansion and omnimodal alignment. We evaluate Dynin-Omni across 19 multimodal benchmarks spanning language reasoning, image generation and editing, video understanding, and speech recognition and synthesis. Dynin-Omni achieves 87.6 on GSM8K, 1733.6 on MME-P, 61.4 on VideoMME, 0.87 on GenEval, and 2.1 WER on LibriSpeech test-clean, consistently outperforming existing open-source unified models while remaining competitive with strong modality-specific expert systems. These results demonstrate the potential of masked diffusion as a unified paradigm for any-to-any modeling, providing a flexible foundation for real-time omnimodal systems, unified cross-modal retrieval and generation, and embodied multimodal agents.

2604.00006 2026-04-02 cs.CL cs.CY cs.IR cs.LG

Scalable Identification and Prioritization of Requisition-Specific Personal Competencies Using Large Language Models

Wanxin Li, Denver McNeney, Nivedita Prabhu, Charlene Zhang, Renee Barr, Matthew Kitching, Khanh Dao Duc, Anthony S. Boyce

详情
英文摘要

AI-powered recruitment tools are increasingly adopted in personnel selection, yet they struggle to capture the requisition (req)-specific personal competencies (PCs) that distinguish successful candidates beyond job categories. We propose a large language model (LLM)-based approach to identify and prioritize req-specific PCs from reqs. Our approach integrates dynamic few-shot prompting, reflection-based self-improvement, similarity-based filtering, and multi-stage validation. Applied to a dataset of Program Manager reqs, our approach correctly identifies the highest-priority req-specific PCs with an average accuracy of 0.76, approaching human expert inter-rater reliability, and maintains a low out-of-scope rate of 0.07.

2604.00005 2026-04-02 cs.AI cs.CL

How Emotion Shapes the Behavior of LLMs and Agents: A Mechanistic Study

Moran Sun, Tianlin Li, Yuwei Zheng, Zhenhong Zhou, Aishan Liu, Xianglong Liu, Yang Liu

Comments 15 pages, 11 figures

详情
英文摘要

Emotion plays an important role in human cognition and performance. Motivated by this, we investigate whether analogous emotional signals can shape the behavior of large language models (LLMs) and agents. Existing emotion-aware studies mainly treat emotion as a surface-level style factor or a perception target, overlooking its mechanistic role in task processing. To address this limitation, we propose E-STEER, an interpretable emotion steering framework that enables direct representation-level intervention in LLMs and agents. It embeds emotion as a structured, controllable variable in hidden states, and with it, we examine the impact of emotion on objective reasoning, subjective generation, safety, and multi-step agent behaviors. The results reveal non-monotonic emotion-behavior relations consistent with established psychological theories, and show that specific emotions not only enhance LLM capability but also improve safety, and systematically shape multi-step agent behaviors.

2604.00004 2026-04-02 cs.CL cs.AI

LinearARD: Linear-Memory Attention Distillation for RoPE Restoration

Ning Yang, Hengyu Zhong, Wentao Wang, Baoliang Tian, Haijun Zhang, Jun Wang

详情
英文摘要

The extension of context windows in Large Language Models is typically facilitated by scaling positional encodings followed by lightweight Continual Pre-Training (CPT). While effective for processing long sequences, this paradigm often disrupts original model capabilities, leading to performance degradation on standard short-text benchmarks. We propose LinearARD, a self-distillation method that restores Rotary Position Embeddings (RoPE)-scaled students through attention-structure consistency with a frozen native-RoPE teacher. Rather than matching opaque hidden states, LinearARD aligns the row-wise distributions of dense $Q/Q$, $K/K$, and $V/V$ self-relation matrices to directly supervise attention dynamics. To overcome the quadratic memory bottleneck of $n \times n$ relation maps, we introduce a linear-memory kernel. This kernel leverages per-token log-sum-exp statistics and fuses logit recomputation into the backward pass to compute exact Kullback-Leibler divergence and gradients. On LLaMA2-7B extended from 4K to 32K, LinearARD recovers 98.3\% of the short-text performance of state-of-the-art baselines while surpassing them on long-context benchmarks. Notably, our method achieves these results using only \textbf{4.25M} training tokens compared to the \textbf{256M} tokens required by LongReD and CPT. Our code is available at https://github.com/gracefulning/LinearARD.

2604.00002 2026-04-02 cs.CL cs.AI

Benchmark for Assessing Olfactory Perception of Large Language Models

Eftychia Makri, Nikolaos Nakis, Laura Sisson, Gigi Minsky, Leandros Tassiulas, Vahid Satarifard, Nicholas A. Christakis

详情
英文摘要

Here we introduce the Olfactory Perception (OP) benchmark, designed to assess the capability of large language models (LLMs) to reason about smell. The benchmark contains 1,010 questions across eight task categories spanning odor classification, odor primary descriptor identification, intensity and pleasantness judgments, multi-descriptor prediction, mixture similarity, olfactory receptor activation, and smell identification from real-world odor sources. Each question is presented in two prompt formats, compound names and isomeric SMILES, to evaluate the effect of molecular representations. Evaluating 21 model configurations across major model families, we find that compound-name prompts consistently outperform isomeric SMILES, with gains ranging from +2.4 to +18.9 percentage points (mean approx +7 points), suggesting current LLMs access olfactory knowledge primarily through lexical associations rather than structural molecular reasoning. The best-performing model reaches 64.4\% overall accuracy, which highlights both emerging capabilities and substantial remaining gaps in olfactory reasoning. We further evaluate a subset of the OP across 21 languages and find that aggregating predictions across languages improves olfactory prediction, with AUROC = 0.86 for the best performing language ensemble model. LLMs should be able to handle olfactory and not just visual or aural information.

2603.27048 2026-04-02 cs.CV

MOOZY: A Patient-First Foundation Model for Computational Pathology

Yousef Kotp, Vincent Quoc-Huy Trinh, Christopher Pal, Mahdi S. Hosseini

详情
英文摘要

Computational pathology needs whole-slide image (WSI) foundation models that transfer across diverse clinical tasks, yet current approaches remain largely slide-centric, often depend on private data and expensive paired-report supervision, and do not explicitly model relationships among multiple slides from the same patient. We present MOOZY, a patient-first pathology foundation model in which the patient case, not the individual slide, is the core unit of representation. MOOZY explicitly models dependencies across all slides from the same patient via a case transformer during pretraining, combining multi-stage open self-supervision with scaled low-cost task supervision. In Stage 1, we pretrain a vision-only slide encoder on 77,134 public slide feature grids using masked self-distillation. In Stage 2, we align these representations with clinical semantics using a case transformer and multi-task supervision over 333 tasks from 56 public datasets, including 205 classification and 128 survival tasks across four endpoints. Across eight held-out tasks with five-fold frozen-feature probe evaluation, MOOZY achieves best or tied-best performance on most metrics and improves macro averages over TITAN by +7.37%, +5.50%, and +7.83% and over PRISM by +8.83%, +10.70%, and +9.78% for weighted F1, weighted ROC-AUC, and balanced accuracy, respectively. MOOZY is also parameter efficient with 85.77M parameters, 14x smaller than GigaPath. These results demonstrate that open, reproducible patient-level pretraining yields transferable embeddings, providing a practical path toward scalable patient-first histopathology foundation models.

2603.24639 2026-04-02 cs.LG cs.AI

Experiential Reflective Learning for Self-Improving LLM Agents

Marc-Antoine Allard, Arnaud Teinturier, Victor Xing, Gautier Viaud

Comments Published as a conference paper at the ICLR 2026 MemAgents Workshop

详情
英文摘要

Recent advances in large language models (LLMs) have enabled the development of autonomous agents capable of complex reasoning and multi-step problem solving. However, these agents struggle to adapt to specialized environments and do not leverage past interactions, approaching each new task from scratch regardless of their accumulated experience. We introduce Experiential Reflective Learning (ERL), a simple self-improvement framework that enables rapid environment adaptation through experiential learning. ERL reflects on task trajectories and outcomes to generate heuristics, capturing actionable lessons that transfer across tasks. At test time, relevant heuristics are retrieved based on the current task and injected into the agent's context to guide execution. On the Gaia2 benchmark, ERL improves success rate by 7.8% over a ReAct baseline, with large gains in task completion reliability, and outperforms prior experiential learning methods. Through systematic ablations, we find that selective retrieval is essential and that heuristics provide more transferable abstractions than few-shot trajectory prompting. These results demonstrate that reflecting on single-attempt experiences to extract transferable heuristics enables effective agent self-improvement.

2603.20895 2026-04-02 cs.CL cs.LG

LLM Router: Rethinking Routing with Prefill Activations

Tanay Varshney, Annie Surla, Michelle Xu, Gomathy Venkata Krishnan, Maximilian Jeblick, David Austin, Neal Vaidya, Davide Onofrio

详情
英文摘要

LLMs often achieve similar average benchmark accuracies while exhibiting complementary strengths on different subsets of queries, suggesting that a router with query-specific model selection can outperform any single model. While existing routers rely on semantic query features, they often fail to capture model-specific failures or intrinsic task difficulty. We instead study routing via internal prefill activations. Our key idea, Encoder-Target Decoupling, separates the model that produces the predictive signal (the Encoder) from the model whose correctness is being estimated (the Target), allowing open-weight encoders to predict the performance of closed-source target models. We evaluate layerwise geometric probes, finding that Fisher Separability (J) effectively identifies informative layers, supported by Effective Dimensionality (d_eff) diagnostics. We then utilize a SharedTrunkNet, a joint multi-output MLP that predicts simultaneous correctness probabilities across candidate models using concatenated prefill features. In our experiments, SharedTrunkNet consistently outperforms semantic baselines. At its best, SharedTrunkNet closes 45.58% of the gap between the strongest standalone model and the oracle while achieving 74.31% cost savings relative to the most expensive model. These results demonstrate that prefill activations provide a robust routing signal, establishing mechanistic routing as a high-performance alternative to purely semantic selection.

2603.15929 2026-04-02 cs.AI math.AP math.LO

Semi-Autonomous Formalization of the Vlasov-Maxwell-Landau Equilibrium

Vasily Ilin

Comments 11 figures

详情
英文摘要

We present a complete Lean 4 formalization of the equilibrium characterization in the Vlasov-Maxwell-Landau (VML) system, which describes the motion of charged plasma. The project demonstrates the full AI-assisted mathematical research loop: an AI reasoning model (Gemini DeepThink) generated the proof from a conjecture, an agentic coding tool (Claude Code) translated it into Lean from natural-language prompts, a specialized prover (Aristotle) closed 111 lemmas, and the Lean kernel verified the result. A single mathematician supervised the process over 10 days at a cost of \$200, writing zero lines of code. The entire development process is public: all 229 human prompts, and 213 git commits are archived in the repository. We report detailed lessons on AI failure modes -- hypothesis creep, definition-alignment bugs, agent avoidance behaviors -- and on what worked: the abstract/concrete proof split, adversarial self-review, and the critical role of human review of key definitions and theorem statements. Notably, the formalization was completed before the final draft of the corresponding math paper was finished.

2603.10035 2026-04-02 cs.CL

TriageSim: A Conversational Emergency Triage Simulation Framework from Structured Electronic Health Records

Dipankar Srirag, Quoc Dung Nguyen, Aditya Joshi, Padmanesan Narasimhan, Salil Kanhere

Comments 6 pages, 3 figures, 2 tables

详情
英文摘要

Research in emergency triage is restricted to structured electronic health records (EHR) due to regulatory constraints on nurse-patient interactions. We introduce TriageSim, a simulation framework for generating persona-conditioned triage conversations from structured records. TriageSim enables multi-turn nurse-patient interactions with explicit control over disfluency and decision behaviour, producing a corpus of ~800 synthetic transcripts and corresponding audio. We use a combination of automated analysis for linguistic, behavioural and acoustic fidelity alongside manual evaluation for medical fidelity using a random subset of 50 conversations. The utility of the generated corpus is examined via conversational triage classification. We observe modest agreement for acuity levels across three modalities: generated synthetic text, ASR transcripts, and direct audio inputs. The code, persona schemata and triage policy prompts for TriageSim will be available upon acceptance.

2603.05537 2026-04-02 cs.CV eess.IV

Sketch It Out: Exploring Label-Free Structural Cues for Multimodal Gait Recognition

Chao Zhang, Zhuang Zheng, Ruixin Li, Zhanyong Mei

Comments 10 pages, 3 figures

详情
英文摘要

Gait recognition is a non-intrusive biometric technique for security applications, yet existing studies are dominated by silhouette- and parsing-based representations. Silhouettes are sparse and miss internal structural details, limiting discriminability. Parsing enriches silhouettes with part-level structures, but relies heavily on upstream human parsers (e.g., label granularity and boundary precision), leading to unstable performance across datasets and sometimes even inferior results to silhouettes. We revisit gait representations from a structural perspective and describe a design space defined by edge density and supervision form: silhouettes use sparse boundary edges with weak single-label supervision, while parsing uses denser cues with strong semantic priors. In this space, we identify an underexplored paradigm: dense part-level structure without explicit semantic labels, and introduce SKETCH as a new visual modality for gait recognition. Sketch extracts high-frequency structural cues (e.g., limb articulations and self-occlusion contours) directly from RGB images via edge-based detectors in a label-free manner. We further show that label-guided parsing and label-free sketch are semantically decoupled and structurally complementary. Based on this, we propose SKETCHGAIT, a hierarchically disentangled multi-modal framework with two independent streams for modality-specific learning and a lightweight early-stage fusion branch to capture structural complementarity. Extensive experiments on SUSTech1K and CCPG validate the proposed modality and framework: SketchGait achieves 92.9% Rank-1 on SUSTech1K and 93.1% mean Rank-1 on CCPG.

2512.19283 2026-04-02 cs.CV

OmniEgoCap: Camera-Agnostic Sequence-Level Egocentric Motion Reconstruction

Kyungwon Cho, Hanbyul Joo

Comments Project Page: https://kyungwoncho.github.io/OmniEgoCap/

详情
英文摘要

The proliferation of commercial egocentric devices offers a unique lens into human behavior, yet reconstructing full-body 3D motion remains difficult due to frequent self-occlusion and the 'out-of-sight' nature of the wearer's limbs. While head and hand trajectories provide sparse anchor points, current methods often overfit to specific hardware optics or rely on expensive, post-hoc optimizations that compromise motion naturalness. In this paper, we present OmniEgoCap, a unified diffusion framework that scales egocentric reconstruction to diverse capture setups. By shifting from short-term windowed estimation to sequence-level inference, our method captures a global perspective and recovers invariant physical attributes, such as height and body proportions, that provide critical constraints for disambiguating head-only cues. To ensure hardware-agnostic generalization, we introduce a geometry-aware visibility augmentation strategy that treats intermittent hand appearances as principled geometric constraints rather than missing data. Our architecture jointly predicts temporally coherent motion and consistent body shape, establishing a new state-of-the-art on public benchmarks and demonstrating robust performance across diverse, in-the-wild environments.

2512.04485 2026-04-02 cs.CV

Not All Birds Look The Same: Identity-Preserving Generation For Birds

Aaron Sun, Oindrila Saha, Subhransu Maji

详情
英文摘要

Since the advent of controllable image generation, increasingly rich modes of control have enabled greater customization and accessibility for everyday users. Zero-shot, identity-preserving models such as Insert Anything and OminiControl now support applications like virtual try-on without requiring additional fine-tuning. While these models may be fitting for humans and rigid everyday objects, they still have limitations for non-rigid or fine-grained categories. These domains often lack accessible, high-quality data -- especially videos or multi-view observations of the same subject -- making them difficult both to evaluate and to improve upon. Yet, such domains are essential for moving beyond content creation toward applications that demand accuracy and fine detail. Birds are an excellent domain for this task: they exhibit high diversity, require fine-grained cues for identification, and come in a wide variety of poses. We introduce the NABirds Look-Alikes (NABLA) dataset, consisting of 4,759 expert-curated image pairs. Together with 1,073 pairs collected from multi-image observations on iNaturalist and a small set of videos, this forms a benchmark for evaluating identity-preserving generation of birds. We show that state-of-the-art baselines fail to maintain identity on this dataset, and we demonstrate that training on images grouped by species, age, and sex -- used as a proxy for identity -- substantially improves performance on both seen and unseen species.

2511.22690 2026-04-02 cs.CV

Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation

Shubhankar Borse, Phuc Pham, Farzad Farhadzadeh, Seokeon Choi, Phong Ha Nguyen, Anh Tuan Tran, Sungrack Yun, Munawar Hayat, Fatih Porikli

Comments Accepted to CVPR 2026

详情
英文摘要

Despite recent advances in personalized image generation, existing models consistently fail to produce reliable multi-human scenes, often merging or losing facial identity. We present Ar2Can, a novel two-stage framework that disentangles spatial planning from identity rendering for multi-human generation. The Architect predicts structured layouts, specifying where each person should appear. The Artist then synthesizes photorealistic images, guided by a spatially-grounded face matching reward that combines Hungarian spatial alignment with identity similarity. This approach ensures faces are rendered at correct locations and faithfully preserve reference identities. We develop two Architect variants, seamlessly integrated with our diffusion-based Artist model. This is optimized via Group Relative Policy Optimization (GRPO) using compositional rewards for count accuracy, image quality, and identity matching. Evaluated on the MultiHuman-Testbench, Ar2Can achieves substantial improvements in both count accuracy and identity preservation, while maintaining high perceptual quality. Notably, our method achieves these results using primarily synthetic data, without requiring real multi-human images. Project page: https://qualcomm-ai-research.github.io/ar2can/.

2510.01399 2026-04-02 cs.CV

Resolving the Identity Crisis in Text-to-Image Generation

Shubhankar Borse, Farzad Farhadzadeh, Munawar Hayat, Fatih Porikli

Comments Accepted to CVPR 2026

详情
英文摘要

State-of-the-art text-to-image models suffer from a persistent identity crisis when generating scenes with multiple humans: producing duplicate faces, merging identities, and miscounting individuals. We present DisCo (Reinforcement with Diversity Constraints), a reinforcement learning framework that directly optimizes identity diversity both within images and across groups of generated samples. DisCo fine-tunes flow-matching models using Group-Relative Policy Optimization (GRPO), guided by a compositional reward that: (i) penalizes facial similarity within images, (ii) discourages identity repetition across samples, (iii) enforces accurate person counts, and (iv) preserves visual fidelity and prompt alignment via human preference scores. A single-stage curriculum stabilizes training as prompt complexity increases. Importantly, this method does not require any real data. On the DiverseHumans Testset, DisCo achieves 98.6% Unique Face Accuracy and near-perfect Global Identity Spread, outperforming open-source and proprietary models (e.g., Gemini, GPT-Image) while maintaining perceptual quality. Our results establish cross-sample diversity as a critical axis for resolving identity collapse, positioning DisCo as a scalable, annotation-free solution for multi-human image synthesis. Project page: https://qualcomm-ai-research.github.io/disco/

2509.16483 2026-04-02 cs.CV

Octree Diffusion for Semantic Scene Generation and Completion

Xujia Zhang, Brendan Crowe, Christoffer Heckman

Comments Accepted to ICRA 2026. Revised version with updated paragraphs

详情
Journal ref
Proceedings of the 2026 IEEE International Conference on Robotics and Automation (ICRA)
英文摘要

The completion, extension, and generation of 3D semantic scenes are an interrelated set of capabilities that are useful for robotic navigation and exploration. Existing approaches seek to decouple these problems and solve them one-off. Additionally, these approaches are often domain-specific, requiring separate models for different data distributions, e.g.\ indoor vs.\ outdoor scenes. To unify these techniques and provide cross-domain compatibility, we develop a single framework that can perform scene completion, extension, and generation in both indoor and outdoor scenes, which we term Octree Latent Semantic Diffusion. Our approach operates directly on an efficient dual octree graph latent representation: a hierarchical, sparse, and memory-efficient occupancy structure. This technique disentangles synthesis into two stages: (i) structure diffusion, which predicts binary split signals to construct a coarse occupancy octree, and (ii) latent semantic diffusion, which generates semantic embeddings decoded by a graph VAE into voxel-level semantic labels. To perform semantic scene completion or extension, our model leverages inference-time latent inpainting, or outpainting respectively. These inference-time methods use partial LiDAR scans or maps to condition generation, without the need for retraining or finetuning. We demonstrate high-quality structure, coherent semantics, and robust completion from single LiDAR scans, as well as zero-shot generalization to out-of-distribution LiDAR data. These results indicate that completion-through-generation in a dual octree graph latent space is a practical and scalable alternative to regression-based pipelines for real-world robotic perception tasks.