arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2603.19261 2026-03-23 cs.CL cs.CV cs.LG

Significance-Gain Pair Encoding for LLMs: A Statistical Alternative to Frequency-Based Subword Merging

Azam Nouri

Comments 8 pages, 1 figures

详情

英文摘要

Subword tokenization is a key design choice for modern language models, including large language models (LLMs), with byte- and character-level BPE serving as a widely used baseline. Standard BPE selects merges by raw pair frequency, which favors compression but can conflate true adjacency cohesion with pairs that are frequent due to high marginal counts. This paper introduces Significance-Gain BPE, a drop-in alternative merge criterion that measures cohesion via a z-statistic under an independence null model and combines it with an explicit compression-aware gain term. Significance-Gain BPE is evaluated on WikiText-103 (raw) character slices using a small causal Transformer language model, reporting both token-dependent perplexity and the tokenizer-invariant metric bits per character (BPC). At a representative operating point, Significance-Gain BPE reduces validation and test perplexity by 13% and 12%, respectively, and improves validation and test BPC by about 0.9 to 1.0%. A vocabulary-size sweep further shows lower BPC in most closest-compression comparisons, suggesting that statistically grounded merge selection can improve predictive efficiency per unit of raw text across a range of compression regimes.

URL PDF HTML ☆

赞 0 踩 0

2603.19260 2026-03-23 cs.CL cs.AI cs.CV cs.CY cs.ET

HATL: Hierarchical Adaptive-Transfer Learning Framework for Sign Language Machine Translation

Nada Shahin, Leila Ismail

2603.19259 2026-03-23 cs.CL cs.AI

Breeze Taigi: Benchmarks and Models for Taiwanese Hokkien Speech Recognition and Synthesis

Yu-Siang Lan, Chia-Sheng Liu, Yi-Chang Chen, Po-Chun Hsu, Allyson Chiu, Shun-Wen Lin, Da-shan Shiu, Yuan-Fu Liao

2603.19258 2026-03-23 cs.CL cs.AI cs.CR cs.LG

MAPLE: Metadata Augmented Private Language Evolution

Eli Chien, Yuzheng Hu, Ryan McKenna, Shanshan Wu, Zheng Xu, Peter Kairouz

Comments Preliminary work

2603.19257 2026-03-23 cs.CL

Constraint-aware Path Planning from Natural Language Instructions Using Large Language Models

Dylan Shim, Minghan Wei

Comments Accepted by 2026 SPIE Security + Defense Conference

2603.19256 2026-03-23 cs.CL

ShobdoSetu: A Data-Centric Framework for Bengali Long-Form Speech Recognition and Speaker Diarization

Md. Nazmus Sakib, Shafiul Tanvir, Mesbah Uddin Ahamed, H. M. Aktaruzzaman Mukdho

Comments 7 pages, 4 figures

2603.19255 2026-03-23 cs.CL cs.AI

LARFT: Closing the Cognition-Action Gap for Length Instruction Following in Large Language Models

Wei Zhang, Lintong Du, Yuanhe Zhang, Zhenhong Zhou, Kun Wang, Li Sun, Sen Su

Comments 19 pages, 6 figures

2603.19253 2026-03-23 cs.CL cs.AI

A comprehensive study of LLM-based argument classification: from Llama through DeepSeek to GPT-5.2

Marcin Pietroń, Filip Gampel, Jakub Gomułka, Andrzej Tomski, Rafał Olszowski

2603.19252 2026-03-23 cs.CL cs.AI

GeoChallenge: A Multi-Answer Multiple-Choice Benchmark for Geometric Reasoning with Diagrams

Yushun Zhang, Weiping Fu, Zesheng Yang, Bo Zhao, Lingling Zhang, Jian Zhang, Yumeng Fu, Jiaxing Huang, Jun Liu

Comments 18 pages, 10 figures, 8 tables

2603.19251 2026-03-23 cs.CL

Enhancing Legal LLMs through Metadata-Enriched RAG Pipelines and Direct Preference Optimization

Suyash Maniyar, Deepali Singh, Rohith Reddy

Comments 12 pages including Appendix

2603.19249 2026-03-23 cs.CL

Spelling Correction in Healthcare Query-Answer Systems: Methods, Retrieval Impact, and Empirical Evaluation

Saurabh K Singh

Comments 13 pages, 5 tables. Empirical study using TREC 2017 LiveQA Medical and HealthSearchQA datasets

2603.19248 2026-03-23 cs.CL cs.AI

DuCCAE: A Hybrid Engine for Immersive Conversation via Collaboration, Augmentation, and Evolution

Xin Shen, Zhishu Jiang, Jiaye Yang, Haibo Liu, Yichen Wan, Jiarui Zhang, Tingzhi Dai, Luodong Xu, Shuchen Wu, Guanqiang QI, Chenxi Miao, Jiahui Liang, Yang Li, Weikang Li, Deguo Xia, Jizhou Huang

2603.19247 2026-03-23 cs.CL cs.AI

When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models

Zafir Shamsi, Nikhil Chekuru, Zachary Guzman, Shivank Garg

Comments EACL SRW 2026, Oral

2603.18271 2026-03-23 cs.RO

SG-CoT: An Ambiguity-Aware Robotic Planning Framework using Scene Graph Representations

Akshat Rana, Peeyush Agarwal, K. P. S. Rana, Amarjit Malhotra

Comments This work has been submitted to the IEEE Robotics and Automation Letters for possible publication

2603.18202 2026-03-23 cs.LG cs.AI cs.RO

R2-Dreamer: Redundancy-Reduced World Models without Decoders or Augmentation

Naoki Morihira, Amal Nahar, Kartik Bharadwaj, Yasuhiro Kato, Akinobu Hayashi, Tatsuya Harada

Comments 20 pages, 12 figures, 2 tables

2603.18062 2026-03-23 cs.CV cs.AI

S3T-Former: A Purely Spike-Driven State-Space Topology Transformer for Skeleton Action Recognition

Naichuan Zheng, Hailun Xia, Zepeng Sun, Weiyi Li, Yujia Wang

2603.18048 2026-03-23 cs.AI cs.SD eess.AS

DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models

Jiaqi Xiong, Yunjia Qi, Qi Cao, Yu Zheng, Yutong Zhang, Ziteng Wang, Ruofan Liao, Weisheng Xu, Sichen Liu

Comments 14 pages,6 figures

2603.17021 2026-03-23 cs.AI

Generative AI-assisted Participatory Modeling in Socio-Environmental Planning under Deep Uncertainty

Zhihao Pei, Nir Lipovetzky, Angela M. Rojas-Arevalo, Fjalar J. de Haan, Enayat A. Moallemi

2603.16546 2026-03-23 cs.CL cs.AI

DanceHA: A Multi-Agent Framework for Document-Level Aspect-Based Sentiment Analysis

Lei Wang, Min Huang, Eduard Dragut

2603.14052 2026-03-23 cs.CV cs.MA

A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning

Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Ling Liu

Comments Accepted by CVPR2026

2603.13748 2026-03-23 cs.RO cs.MA

Multi-Robot Coordination for Planning under Context Uncertainty

Pulkit Rustagi, Kyle Hollins Wray, Sandhya Saisubramanian

Comments 8 pages, 6 figures

2603.12680 2026-03-23 cs.CV

G2HFNet: GeoGran-Aware Hierarchical Feature Fusion Network for Salient Object Detection in Optical Remote Sensing Images

Bin Wan, Runmin Cong, Xiaofei Zhou, Hao Fang, Chengtao Lv, Sam Kwong

2603.01176 2026-03-23 cs.RO

Path Integral Particle Filtering for Hybrid Systems via Saltation Matrices

Karthik Shaji, Sreeranj Jayadevan, Bo Yuan, Hongzhe Yu, Yongxin Chen

2602.23148 2026-03-23 cs.AI

On Sample-Efficient Generalized Planning via Learned Transition Models

Nitin Gupta, Vishal Pallagani, John A. Aydin, Biplav Srivastava

Comments 14 pages; Extended version of short paper accepted at ICAPS 2026; updated with results and analysis

2602.21424 2026-03-23 cs.LG cs.AI

On the Structural Non-Preservation of Epistemic Behaviour under Policy Transformation

Alexander Galozy

Comments 15 pages, 3 figures. Under review at RLC 2026. Fixed references due to copy-paste errors

2602.19489 2026-03-23 cs.LG cs.AI

Federated Learning Playground

Bryan Shan, Alysa Ziying Tan, Han Yu

2602.08934 2026-03-23 cs.LG cs.AI cs.CR

StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors

Suraj Ranganath, Atharv Ramesh

Comments Expanded version of a workshop submission. Code available

2602.07784 2026-03-23 cs.CV

Uncertainty-Aware Counterfactual Traffic Signal Control with Predictive Safety and Starvation-Avoidance Constraints Using Vision-Based Sensing

Jayawant Bodagala, Balaji Bodagala

Comments This work has been submitted to the IEEE for possible publication

2601.15275 2026-03-23 cs.CV cs.LG

RayRoPE: Projective Ray Positional Encoding for Multi-view Attention

Yu Wu, Minsik Jeon, Jen-Hao Rick Chang, Oncel Tuzel, Shubham Tulsiani

Comments Project page: https://rayrope.github.io/

2601.09111 2026-03-23 cs.CV

Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning

Yang Li, Aming Wu, Zihao Zhang, Yahong Han

Comments Accepted by CVPR 2026