arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.03319 2026-03-05 cs.CL cs.AI

Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

James Wedgwood, Chhavi Yadav, Virginia Smith

详情

英文摘要

Large Language Models (LLMs) are increasingly used as scalable evaluators of model outputs, but their preference judgments exhibit systematic biases and can diverge from human evaluations. Prior work on LLM-as-a-judge has largely focused on a small, predefined set of hypothesized biases, leaving open the problem of automatically discovering unknown drivers of LLM preferences. We address this gap by studying several embedding-level concept extraction methods for analyzing LLM judge behavior. We compare these methods in terms of interpretability and predictiveness, finding that sparse autoencoder-based approaches recover substantially more interpretable preference features than alternatives while remaining competitive in predicting LLM decisions. Using over 27k paired responses from multiple human preference datasets and judgments from three LLMs, we analyze LLM judgments and compare them to those of human annotators. Our method both validates existing results, such as the tendency for LLMs to prefer refusal of sensitive requests at higher rates than humans, and uncovers new trends across both general and domain-specific datasets, including biases toward responses that emphasize concreteness and empathy in approaching new situations, toward detail and formality in academic advice, and against legal guidance that promotes active steps like calling police and filing lawsuits. Our results show that automated concept discovery enables systematic analysis of LLM judge preferences without predefined bias taxonomies.

URL PDF HTML ☆

赞 0 踩 0

2603.03318 2026-03-05 cs.CL cs.AI quant-ph

Quantum-Inspired Self-Attention in a Large Language Model

Nikita Kuznetsov, Niyaz Ismagilov, Ernesto Campos

Comments 8 pages, 7 figures

2603.03317 2026-03-05 cs.CL

Retcon -- a Prompt-Based Technique for Precise Control of LLMs in Conversations

David Kogan, Sam Nguyen, Masanori Suzuki, Feiyang Chen

Comments 5 pages, 2 figures, 3 appendixes with prompts and examples

2603.03316 2026-03-05 cs.CL cs.AI cs.CV

The Influence of Iconicity in Transfer Learning for Sign Language Recognition

Keren Artiaga, Conor Lynch, Haithem Afli, Mohammed Hasanuzzaman

2603.03315 2026-03-05 cs.CL cs.AI cs.LG

M-QUEST -- Meme Question-Understanding Evaluation on Semantics and Toxicity

Stefano De Giorgis, Ting-Chih Chen, Filip Ilievski

详情

英文摘要

Internet memes are a powerful form of online communication, yet their nature and reliance on commonsense knowledge make toxicity detection challenging. Identifying key features for meme interpretation and understanding, is a crucial task. Previous work has been focused on some elements contributing to the meaning, such as the Textual dimension via OCR, the Visual dimension via object recognition, upper layers of meaning like the Emotional dimension, Toxicity detection via proxy variables, such as hate speech detection, and sentiment analysis. Nevertheless, there is still a lack of an overall architecture able to formally identify elements contributing to the meaning of a meme, and be used in the sense-making process. In this work, we present a semantic framework and a corresponding benchmark for automatic knowledge extraction from memes. First, we identify the necessary dimensions to understand and interpret a meme: Textual material, Visual material, Scene, Background Knowledge, Emotion, Semiotic Projection, Analogical Mapping, Overall Intent, Target Community, and Toxicity Assessment. Second, the framework guides a semi-automatic process of generating a benchmark with commonsense question-answer pairs about meme toxicity assessment and its underlying reason. The resulting benchmark M-QUEST consists of 609 question-answer pairs for 307 memes. Thirdly, we evaluate eight open-source large language models on their ability to correctly solve M-QUEST. Our results show that current models' commonsense reasoning capabilities for toxic meme interpretation vary depending on the dimension and architecture. Models with instruction tuning and reasoning capabilities significantly outperform the others, though pragmatic inference questions remain challenging. We release code, benchmark, and prompts to support future research intersecting multimodal content safety and commonsense reasoning.

URL PDF HTML ☆

赞 0 踩 0

2603.03314 2026-03-05 cs.CL cs.AI cs.LG

Towards Self-Robust LLMs: Intrinsic Prompt Noise Resistance via CoIPO

Xin Yang, Letian Li, Abudukelimu Wuerkaixi, Xuxin Cheng, Cao Liu, Ke Zeng, Xunliang Cai, Wenyuan Jiang

2603.03313 2026-03-05 cs.CL cs.AI

How does fine-tuning improve sensorimotor representations in large language models?

Minghua Wu, Javier Conde, Pedro Reviriego, Marc Brysbaert

2603.03310 2026-03-05 cs.CL cs.LG

Entropic-Time Inference: Self-Organizing Large Language Model Decoding Beyond Attention

Andrew Kiruluta

2603.03309 2026-03-05 cs.CL cs.IR

Combating data scarcity in recommendation services: Integrating cognitive types of VARK and neural network technologies (LLM)

Nikita Zmanovskii

Comments 18 pages, 2 tables

2603.03307 2026-03-05 cs.CL cs.AI

TopicENA: Enabling Epistemic Network Analysis at Scale through Automated Topic-Based Coding

Owen H. T. Lu, Tiffany T. Y. Hsu

2603.03306 2026-03-05 cs.CL cs.AI

Token-Oriented Object Notation vs JSON: A Benchmark of Plain and Constrained Decoding Generation

Ivan Matveev

Comments 9 pages, 2 figures, 2 tables. Benchmark code and data available at https://github.com/vetertann/TOON-generation-benchmark

详情

英文摘要

Recently presented Token-Oriented Object Notation (TOON) aims to replace JSON as a serialization format for passing structured data to LLMs with significantly reduced token usage. While showing solid accuracy in LLM comprehension, there is a lack of tests against JSON generation. Though never present in training data, TOON syntax is simple enough to suggest one-shot in-context learning could support accurate generation. The inevitable prompt overhead can be an acceptable trade-off for shorter completions. To test this, we conducted a benchmark creating several test cases with regard to structural complexity, a validation pipeline, and comparing plain JSON generation vs structured output (via constrained decoding) JSON generation vs TOON one-shot in-context learning generation. JSON structured output was included to establish a minimum token budget baseline and to set a starting point for future experiments testing TOON constrained decoding inference enforcement. Key findings: TOON shows promising accuracy/token consumption ratio for in-domain generation tasks, though this advantage is often reduced by the "prompt tax" of instructional overhead in shorter contexts. Plain JSON generation shows the best one-shot and final accuracy, even compared with constrained decoding structured output, where the only significant advantage is the lowest token usage as a trade-off for slightly decreased accuracy overall and significant degradation for some models. Notably, for simple structures, this "lowest token usage" of constrained decoding outperformed even TOON, hinting that TOON enforcing via frameworks such as xgrammar may not yield the desired results. Furthermore, the results suggest a scaling hypothesis: TOON's true efficiency potential likely follows a non-linear curve, shining only beyond a specific point where cumulative syntax savings amortize the initial prompt overhead.

URL PDF HTML ☆

赞 0 踩 0

2603.03305 2026-03-05 cs.CL cs.AI cs.LG

Draft-Conditioned Constrained Decoding for Structured Generation in LLMs

Avinash Reddy, Thayne T. Walker, James S. Ide, Amrit Singh Bedi

2603.03304 2026-03-05 cs.LG cs.AI

Knowledge Graph and Hypergraph Transformers with Repository-Attention and Journey-Based Role Transport

Mahesh Godavarti

Comments 9 pages

2603.03303 2026-03-05 cs.CL cs.AI

HumanLM: Simulating Users with State Alignment Beats Response Imitation

Shirley Wu, Evelyn Choi, Arpandeep Khatua, Zhanghan Wang, Joy He-Yueya, Tharindu Cyril Weerasooriya, Wei Wei, Diyi Yang, Jure Leskovec, James Zou

Comments 27 pages, 17 figures, 9 tables

2603.03302 2026-03-05 cs.CL cs.AI cs.IR

Developing an AI Assistant for Knowledge Management and Workforce Training in State DOTs

Divija Amaram, Lu Gao, Gowtham Reddy Gudla, Tejaswini Sanjay Katale

2603.03301 2026-03-05 cs.CL cs.AI cs.LG

From Exact Hits to Close Enough: Semantic Caching for LLM Embeddings

Dvir David Biton, Roy Friedman

2603.03300 2026-03-05 cs.CL

Benchmarking Legal RAG: The Promise and Limits of AI Statutory Surveys

Mohamed Afane, Emaan Hariri, Derek Ouyang, Daniel E. Ho

Comments Accepted at the 5th ACM Symposium on Computer Science and Law (CS&Law '26)

2603.03299 2026-03-05 cs.CL

How LLMs Cite and Why It Matters: A Cross-Model Audit of Reference Fabrication in AI-Assisted Academic Writing and Methods to Detect Phantom Citations

MZ Naser

2603.03298 2026-03-05 cs.CL cs.AI

TATRA: Training-Free Instance-Adaptive Prompting Through Rephrasing and Aggregation

Bartosz Dziuba, Kacper Kuchta, Paweł Batorski, Przemysław Spurek, Paul Swoboda

2603.03297 2026-03-05 cs.CL cs.AI cs.LG

TTSR: Test-Time Self-Reflection for Continual Reasoning Improvement

Haoyang He, Zihua Rong, Liangjie Zhao, Yunjia Zhao, Lan Yang, Honggang Zhang

Comments work in progress

2603.03296 2026-03-05 cs.CL cs.AI cs.IR

PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents

Ke Yang, Zixi Chen, Xuan He, Jize Jiang, Michel Galley, Chenglong Wang, Jianfeng Gao, Jiawei Han, ChengXiang Zhai

2603.03293 2026-03-05 cs.CL

SE-Search: Self-Evolving Search Agent via Memory and Dense Reward

Jian Li, Yizhang Jin, Dongqi Liu, Hang Ding, Jiafu Wu, Dongsheng Chen, Yunhang Shen, Yulei Qin, Ying Tai, Chengjie Wang, Xiaotong Yuan, Yabiao Wang

2603.03290 2026-03-05 cs.CL cs.AI cs.IR cs.LG

AriadneMem: Threading the Maze of Lifelong Memory for LLM Agents

Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Jingjing Wang, Xuanzhao Dong, Minzhou Huang, Rui Cai, Hejian Sang, Hao Wang, Peijie Qiu, Yueyue Deng, Prayag Tiwari, Brendan Hogan Rappazzo, Yalin Wang

2602.23541 2026-03-05 cs.AI cs.LG

Causal Identification from Counterfactual Data: Completeness and Bounding Results

Arvind Raghavan, Elias Bareinboim

2602.16511 2026-03-05 cs.RO

VIGOR: Visual Goal-In-Context Inference for Unified Humanoid Fall Safety

Osher Azulay, Zhengjie Xu, Andrew Scheffer, Stella X. Yu

2601.17473 2026-03-05 cs.LG

LeanTutor: Towards a Verified AI Mathematical Proof Tutor

Manooshree Patel, Rayna Bhattacharyya, Thomas Lu, Arnav Mehta, Niels Voss, Narges Norouzi, Gireeja Ranade

Comments This work was intended as a replacement of arXiv:2506.08321 and any subsequent updates will appear there

2601.17204 2026-03-05 cs.LG cs.CE

SpecBridge: Bridging Mass Spectrometry and Molecular Representations via Cross-Modal Alignment

Yinkai Wang, Yan Zhou Chen, Xiaohui Chen, Li-Ping Liu, Soha Hassoun

Comments We have found a problem in the preprocessing/evaluation pipeline

2601.07093 2026-03-05 cs.CV cs.AI

3D Wavelet-Based Structural Priors for Controlled Diffusion in Whole-Body Low-Dose PET Denoising

Peiyuan Jing, Yue Yang, Chun-Wun Cheng, Zhenxuan Zhang, Liutao Yang, Thiago V. Lima, Klaus Strobel, Antoine Leimgruber, Angelica Aviles-Rivero, Guang Yang, Javier A. Montoya-Zegarra

Comments 10 pages

2512.17198 2026-03-05 cs.LG

BumpNet: A Sparse MLP Framework for Learning PDE Solutions

Shao-Ting Chiu, Ioannis G. Kevrekidis, Ulisses Braga-Neto

2512.11781 2026-03-05 cs.RO cs.AI cs.MA

Agile Flight Emerges from Multi-Agent Competitive Racing

Vineet Pasumarti, Lorenzo Bianchi, Antonio Loquercio