arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2601.02993 2026-04-22 cs.CL

Stable-RAG: Mitigating Retrieval-Permutation-Induced Hallucinations in Retrieval-Augmented Generation

Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, Zhiming Zheng

Comments Accepted to ACL 2026 Main

详情

英文摘要

Retrieval-Augmented Generation (RAG) has become a key paradigm for reducing factual hallucinations in Large Language Models (LLMs), yet little is known about how the order of retrieved documents affects model behavior. We empirically show that under a Top-5 retrieval setting with the gold document included, LLM answers vary substantially across permutations of the retrieved set, even when the gold document is fixed in the first position. This reveals a previously underexplored sensitivity to retrieval permutations. Although existing robust RAG methods focus primarily on enhancing LLM robustness to low-quality retrieval and mitigating positional bias to distribute attention fairly over long contexts, neither approach directly addresses permutation sensitivity. In this paper, we propose Stable-RAG, which exploits permutation sensitivity estimation to mitigate permutation-induced hallucinations. Stable-RAG runs the generator under multiple retrieval orders, clusters hidden states, and decodes from a cluster-center representation that captures the dominant reasoning pattern. It then uses these reasoning results to align hallucinated outputs toward the correct answer, encouraging the model to produce consistent and accurate predictions across document permutations. Experiments on three QA datasets show that Stable-RAG improves answer accuracy, reasoning consistency, and generalization across datasets, retrievers, and input lengths compared with strong baselines.

URL PDF HTML ☆

赞 0 踩 0

2512.22673 2026-04-22 cs.AI

Beyond Itinerary Planning-A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks

Xiang Cheng, Yulan Hu, Xiangwen Zhang, Lu Xu, Lide Tan, Zheng Pan, Xin Li, Yong Liu

Comments Accepted to the ACL 2026 Main Conference. Camera-ready version

2512.20817 2026-04-22 cs.CL

EssayCBM: Rubric-Aligned Concept Bottleneck Models for Transparent Essay Grading

Kumar Satvik Chaudhary, Chengshuai Zhao, Fan Zhang, Garima Agrawal, Yuli Deng, Huan Liu

2512.20182 2026-04-22 cs.CL cs.AI

FaithLens: Detecting and Explaining Faithfulness Hallucination

Shuzheng Si, Qingyi Wang, Haozhe Zhao, Yuzhuo Bai, Guanqiao Chen, Kangyang Luo, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun

Comments ACL 2026 (Findings)

2512.19302 2026-04-22 cs.CV

Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Optical Remote Sensing

Xu Zhang, Junyao Ge, Yang Zheng, Kaitai Guo, Jimin Liang

详情

英文摘要

Large Vision--Language Models (LVLMs) hold great promise for advancing optical remote sensing (RS) analysis, yet existing reasoning segmentation frameworks couple linguistic reasoning and pixel prediction through end-to-end supervised fine-tuning, leading to weak geometric grounding and limited generalization across tasks. To address this, we developed Think2Seg-RS, a decoupled framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts. Through a mask-only Group Relative Policy Optimization (GRPO) reinforcement learning objective driven strictly by final mask IoU, the LVLM learns to translate abstract semantic reasoning into spatially grounded actions, achieving state-of-the-art performance on the EarthReason dataset. Notably, Think2Seg-RS outperforms leading approaches such as RemoteReasoner and SegEarth-R1 on the EarthReason dataset by reaching a test cIoU of 75.60% and gIoU of 73.36%, yielding absolute improvements of 6.47% and 2.40% over the strongest baseline, respectively. Zero-shot evaluations across three referring segmentation benchmarks reveal a fundamental distinction in task inductive bias, exposing a distinct divide between semantic-level grounding -- which aggregates all regions matching a conceptual intent -- and instance-level tasks that demand discrete object separation. We further found that compact segmenters outperform larger ones under semantic-level supervision by mitigating textural over-segmentation, and that unconstrained negative prompting is unstable in heterogeneous aerial backgrounds. Together, these findings demonstrate that optimizing LVLMs through direct segmentation feedback offers a scalable framework for complex geospatial reasoning, effectively bridging the gap between abstract language understanding and precise pixel-level execution.

URL PDF HTML ☆

赞 0 踩 0

2512.15907 2026-04-22 cs.CL

TabReX : Tabular Referenceless eXplainable Evaluation

Tejas Anvekar, Junha Park, Aparna Garimella, Vivek Gupta

Comments Accepted to ACL 2026 (Main Conference). Long paper

2512.15577 2026-04-22 cs.CV

MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors

Zhipeng Du, Duolikun Danier, Jan Eric Lenssen, Hakan Bilen

Comments CVPR 2026 Findings

2512.12448 2026-04-22 cs.LG cs.NE physics.data-an stat.ML

Optimized Architectures for Kolmogorov-Arnold Networks

James Bagrow, Josh Bongard

Comments 23 pages, 4 figures, 9 tables

2512.07761 2026-04-22 cs.AI cs.LG

TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards

Xiqiao Xiong, Ouxiang Li, Zhuo Liu, Moxin Li, Wentao Shi, Fengbin Zhu, Qifan Wang, Fuli Feng

Comments Accepted to ACL 2026 Main Conference

2512.06380 2026-04-22 cs.SD cs.AI

Protecting Bystander Privacy via Selective Hearing in Audio LLMs

Xiao Zhan, Guangzhi Sun, Jose Such, Phil Woodland

Comments To Appear at ACL 2026 main conference; Dataset: https://huggingface.co/datasets/BrianatCambridge/SelectiveHearingBench

2512.05747 2026-04-22 cs.CL

Capturing Classic Authorial Style in Long-Form Story Generation with GRPO Fine-Tuning

Jinlong Liu, Mohammed Bahja, Venelin Kovatchev, Mark Lee

Comments Accepted to CoNLL 2026

2512.02304 2026-04-22 cs.CL

When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers

Jack Lu, Ryan Teehan, Jinran Jin, Mengye Ren

Comments Accepted at ICLR 2026 AI with Recursive Self-Improvement workshop

2512.00993 2026-04-22 cs.CV

PhotoFramer: Multi-modal Image Composition Instruction

Zhiyuan You, Ke Wang, He Zhang, Xin Cai, Jinjin Gu, Tianfan Xue, Chao Dong, Zhoutong Zhang

Comments Accepted by CVPR 2026

2512.00716 2026-04-22 cs.LG cs.AI

Graph Data Augmentation with Contrastive Learning on Covariate Distribution Shift

Fanlong Zeng, Wensheng Gan

Comments 8 tables, 8 figures

2512.00676 2026-04-22 cs.CV cs.LG

Realistic Handwritten Multi-Digit Writer (MDW) Number Recognition Challenges

Kiri L. Wagstaff

Comments 10 pages, 6 figures

2512.00198 2026-04-22 cs.CV

Mammo-FM: Breast-specific foundational model for Integrated Mammographic Diagnosis, Prognosis, and Reporting

Shantanu Ghosh, Vedant Parthesh Joshi, Rayan Syed, Param Budhraja, Aya Kassem, Katelyn C. Morrison, Alex Tang, Ho Cheung Aiden Wong, Abhishek Varshney, Payel Basak, Weicheng Dai, Judy Wawira Gichoya, Hari M. Trivedi, Imon Banerjee, Shyam Visweswaran, Clare B. Poynton, Kayhan Batmanghelich

2511.23281 2026-04-22 cs.CL

MCP vs RAG vs NLWeb vs HTML: A Comparison of the Effectiveness and Efficiency of Different Agent Interfaces to the Web (Technical Report)

Aaron Steiner, Ralph Peeters, Christian Bizer

详情

DOI: 10.1145/3774904.3792893
Journal ref: Proceedings of the ACM Web Conference 2026 (WWW '26), April 13-17, 2026, Dubai, United Arab Emirates. ACM, New York, NY, USA, pp. 8493-8496

英文摘要

Large language model agents are increasingly used to automate web tasks such as product search, offer comparison, and checkout. Current research explores different interfaces through which these agents interact with websites, including traditional HTML browsing, retrieval-augmented generation (RAG) over pre-crawled content, communication via Web APIs using the Model Context Protocol (MCP), and natural-language querying through the NLWeb interface. However, no prior work has compared these four architectures within a single controlled environment using identical tasks. To address this gap, we introduce a testbed consisting of four simulated e-shops, each offering its products via HTML, MCP, and NLWeb interfaces. For each interface (HTML, RAG, MCP, and NLWeb) we develop specialized agents that perform the same sets of tasks, ranging from simple product searches and price comparisons to complex queries for complementary or substitute products and checkout processes. We evaluate the agents using GPT 4.1, GPT 5, GPT 5 mini, and Claude Sonnet 4 as underlying LLM. Our evaluation shows that the RAG, MCP and NLWeb agents outperform HTML on both effectiveness and efficiency. Averaged over all tasks, F1 rises from 0.67 for HTML to between 0.75 and 0.77 for the other agents. Token usage falls from about 241k for HTML to between 47k and 140k per task. The runtime per task drops from 291 seconds to between 50 and 62 seconds. The best overall configuration is RAG with GPT 5 achieving an F1 score of 0.87 and a completion rate of 0.79. Also taking cost into consideration, RAG with GPT 5 mini offers a good compromise between API usage fees and performance. Our experiments show the choice of the interaction interface has a substantial impact on both the effectiveness and efficiency of LLM-based web agents.

URL PDF HTML ☆

赞 0 踩 0

2511.17578 2026-04-22 cs.RO

Implicit Neural Field-Based Process Planning for Multi-Axis Manufacturing: Direct Control over Collision Avoidance and Toolpath Geometry

Neelotpal Dutta, Tianyu Zhang, Tao Liu, Yongxue Chen, Charlie C. L. Wang

2511.12606 2026-04-22 cs.CV

Pixels or Positions? Benchmarking Modalities in Group Activity Recognition

Drishya Karki, Merey Ramazanova, Anthony Cioppa, Silvio Giancola, Bernard Ghanem

2511.12439 2026-04-22 cs.AI cs.MA

Multi-agent Self-triage System with Medical Flowcharts

Yujia Liu, Sophia Yu, Hongyue Jin, Jessica Wen, Alexander Qian, Terrence Lee, Mattheus Ramsis, Gi Won Choi, Lianhui Qin, Xin Liu, Edward J. Wang

2511.10899 2026-04-22 cs.CL cs.LO cs.SE

From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models

Farima Fatahi Bayat, Pouya Pezeshkpour, Estevam Hruschka

Comments 19 pages, 5 figures

2511.07882 2026-04-22 cs.RO

An Experimental Characterization of Mechanical Layer Jamming Systems

Jessica Gumowski, Krishna Manaswi Digumarti, David Howard

Comments 6 pages, 9 figures, RoboSoft 2026

2511.06168 2026-04-22 cs.AI

Chain-of-Thought as a Lens: Evaluating Structured Reasoning Alignment between Human Preferences and Large Language Models

Boxuan Wang, Zhuoyun Li, Xinmiao Huang, Xiaowei Huang, Yi Dong

Comments Accepted to ACL 2026 (Main Conference)

2510.27052 2026-04-22 cs.CL

VISTA: Verification In Sequential Turn-based Assessment

Ashley Lewis, Andrew Perrault, Eric Fosler-Lussier, Michael White

Comments Accepted to ACL 2026

2510.26782 2026-04-22 cs.LG cs.AI cs.CV

Cloning Deterministic Worlds: The Critical Role of Latent Geometry in Long-Horizon World Models

Zaishuo Xia, Yukuan Lu, Xinyi Li, Yifan Xu, Yubei Chen

2510.26566 2026-04-22 cs.LG cs.AI

Multiclass Local Calibration with the Jensen-Shannon Distance

Cesare Barbera, Lorenzo Perini, Giovanni De Toni, Andrea Passerini, Andrea Pugnana

Comments Accepted at AISTATS 2026

2510.23156 2026-04-22 cs.LG cs.AI

Enabling Vibration-Based Gesture Recognition on Everyday Furniture via Energy-Efficient FPGA Implementation of 1D Convolutional Networks

Koki Shibata, Tianheng Ling, Chao Qian, Tomokazu Matsui, Hirohiko Suwa, Keiichi Yasumoto, Gregor Schiele

Comments 9 pages, 5 figures, 5 tables, accepted by 2025 IEEE Annual Congress on Artificial Intelligence of Things (IEEE AIoT)

详情

DOI: 10.1109/AIoT66900.2025.00061

英文摘要

The growing demand for smart home interfaces has increased interest in non-intrusive sensing methods like vibration-based gesture recognition. While prior studies demonstrated feasibility, they often rely on complex preprocessing and large Neural Networks (NNs) requiring costly high-performance hardware, resulting in high energy usage and limited real-world deployability. This study proposes an energy-efficient solution deploying compact NNs on low-power Field-Programmable Gate Arrays (FPGAs) to enable real-time gesture recognition with competitive accuracy. We adopt a series of optimizations: (1) We replace complex spectral preprocessing with raw waveform input, eliminating complex on-board preprocessing while reducing input size by 21x without sacrificing accuracy. (2) We design two lightweight architectures (1D-CNN and 1D-SepCNN) tailored for embedded FPGAs, reducing parameters from 369 million to as few as 216 while maintaining comparable accuracy. (3) With integer-only quantization and automated RTL generation, we achieve seamless FPGA deployment. A ping-pong buffering mechanism in 1D-SepCNN further improves deployability under tight memory constraints. (4) We extend a hardware-aware search framework to support constraint-driven model configuration selection, considering accuracy, deployability, latency, and energy consumption. Evaluated on two swipe-direction datasets with multiple users and ordinary tables, our approach achieves low-latency, energy-efficient inference on the AMD Spartan-7 XC7S25 FPGA. Under the PS data splitting setting, the selected 6-bit 1D-CNN reaches 0.970 average accuracy across users with 9.22 ms latency. The chosen 8-bit 1D-SepCNN further reduces latency to 6.83 ms (over 53x CPU speedup) with slightly lower accuracy (0.949). Both consume under 1.2 mJ per inference, demonstrating suitability for long-term edge operation.

URL PDF HTML ☆

赞 0 踩 0

2510.18358 2026-04-22 cs.LG cs.CV

Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers

Firas Gabetni, Giuseppe Curci, Andrea Pilzer, Subhankar Roy, Elisa Ricci, Gianni Franchi

Comments ICLR 2026

2510.17414 2026-04-22 cs.LG

Conditional Diffusion Modeling with Attention for Probabilistic Battery Capacity Prediction under Real-World Condition

Chunlin Jiang, Hequn Li, Zhongwei Deng, Jie Shao, Zhansheng Ning

2510.10074 2026-04-22 cs.AI

StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis

Jiayi Mao, Liqun Li, Yanjie Gao, Zegang Peng, Shilin He, Chaoyun Zhang, Si Qin, Samia Khalid, Qingwei Lin, Saravan Rajmohan, Sitaram Lanka, Dongmei Zhang