arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1920
2603.07436 2026-04-15 cs.CV

RPG-SAM: Reliability-Weighted Prototypes and Geometric Adaptive Threshold Selection for Training-Free One-Shot Polyp Segmentation

Weikun Lin, Yunhao Bai, Yan Wang

Comments 8 pages, 3 figures

详情
英文摘要

Training-free one-shot segmentation offers a scalable alternative to expert annotations where knowledge is often transferred from support images and foundation models. But existing methods often treat all pixels in support images and query response intensities models in a homogeneous way. They ignore the regional heterogeity in support images and response heterogeity in query.To resolve this, we propose RPG-SAM, a framework that systematically tackles these heterogeneity gaps. Specifically, to address regional heterogeneity, we introduce Reliability-Weighted Prototype Mining (RWPM) to prioritize high-fidelity support features while utilizing background anchors as contrastive references for noise suppression. To address response heterogeneity, we develop Geometric Adaptive Selection (GAS) to dynamically recalibrate binarization thresholds by evaluating the morphological consensus of candidates. Finally, an iterative refinement loop method is designed to polishes anatomical boundaries. By accounting for multi-layered information heterogeneity, RPG-SAM achieves a 5.56\% mIoU improvement on the Kvasir dataset. Code will be released.

2603.05295 2026-04-15 cs.AI cs.CV

WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces

Sicheng Fan, Rui Wan, Yifei Leng, Gaoning Liang, Li Ling, Yanyi Shang, Dehan Kong

详情
英文摘要

We introduce WebChain, the largest open-source dataset of human-annotated trajectories on real-world websites, designed to accelerate reproducible research in web agents. It contains 31,725 trajectories and 318k steps, featuring a core Triple Alignment of visual, structural, and action data to provide rich, multi-modal supervision. The data is collected via a scalable pipeline that ensures coverage of complex, high-value tasks often missed by synthetic methods. Leveraging this dataset, we propose a Dual Mid-Training recipe that decouples spatial grounding from planning, achieving state-of-the-art performance on our proposed WebChainBench and other public GUI benchmarks. Our work provides the data and insights necessary to build and rigorously evaluate the next generation of scalable web agents.

2603.05044 2026-04-15 cs.AI

WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents

Sicheng Fan, Qingyun Shi, Shengze Xu, Shengbo Cai, Tieyong Zeng, Li Ling, Yanyi Shang, Dehan Kong

详情
英文摘要

Current paradigms for training GUI agents are fundamentally limited by a reliance on either unsafe, non-reproducible live web interactions or costly, scarce human-crafted data and environments. We argue this focus on data volume overlooks a more critical factor: the efficiency of compressing a large language model's (LLM) latent knowledge into actionable agent behavior. We introduce WebFactory, a novel, fully automated closed-loop reinforcement learning pipeline for GUI agents, systematically compressing LLM-encoded internet intelligence into efficient, grounded actions. Our pipeline features a process of scalable environment synthesis, knowledge-aware task generation, LLM-powered trajectory collection, decomposed reward RL training, and systematic agent evaluation. Remarkably, our agent demonstrates exceptional data efficiency and generalization. Trained on synthetic data from only 10 websites within WebFactory, it achieves performance comparable to GUI agents trained on the same amount of human-annotated data from a much larger set of environments. This superior performance is consistent across our internal offline and online transfer benchmarks, where our agent also significantly outperforms the base foundation model. We further provide critical insights into the "embodiment potential" of different LLM foundations, offering a new axis for model evaluation. This work presents a scalable and cost-effective paradigm for transforming passive internet knowledge into active, grounded intelligence, marking a critical step towards general-purpose interactive agents.

2603.03249 2026-04-15 cs.CL

Using Learning Progressions to Guide AI Feedback for Science Learning

Xin Xia, Nejla Yuruk, Yun Wang, Xiaoming Zhai

Comments 15pages, 4 figures

详情
英文摘要

Generative artificial intelligence (AI) offers scalable support for formative feedback, yet most AI-generated feedback relies on task-specific rubrics authored by domain experts. While effective, rubric authoring is time-consuming and limits scalability across instructional contexts. Learning progressions (LP) provide a theoretically grounded representation of students' developing understanding and may offer an alternative solution. This study examines whether an LP-driven rubric generation pipeline can produce AI-generated feedback comparable in quality to feedback guided by expert-authored task rubrics. We analyzed AI-generated feedback for written scientific explanations produced by 207 middle school students in a chemistry task. Two pipelines were compared: (a) feedback guided by a human expert-designed, task-specific rubric, and (b) feedback guided by a task-specific rubric automatically derived from a learning progression prior to grading and feedback generation. Two human coders evaluated feedback quality using a multi-dimensional rubric assessing Clarity, Accuracy, Relevance, Engagement and Motivation, and Reflectiveness (10 sub-dimensions). Inter-rater reliability was high, with percent agreement ranging from 89% to 100% and Cohen's kappa values for estimable dimensions (kappa = .66 to .88). Paired t-tests revealed no statistically significant differences between the two pipelines for Clarity (t1 = 0.00, p1 = 1.000; t2 = 0.84, p2 = .399), Relevance (t1 = 0.28, p1 = .782; t2 = -0.58, p2 = .565), Engagement and Motivation (t1 = 0.50, p1 = .618; t2 = -0.58, p2 = .565), or Reflectiveness (t = -0.45, p = .656). These findings suggest that the LP-driven rubric pipeline can serve as an alternative solution.

2603.02104 2026-04-15 cs.RO

ACDC: Adaptive Curriculum Planning with Dynamic Contrastive Control for Goal-Conditioned Reinforcement Learning in Robotic Manipulation

Xuerui Wang, Guangyu Ren, Tianhong Dai, Bintao Hu, Shuangyao Huang, Wenzhang Zhang, Hengyan Liu

Comments 13 pages (including references and appendix), 12 figures. Accepted to ICAPS 2026. Code available at https://github.com/Xuerui-Wang-oss/Adaptive-Curriculum-Learning-and-Dynamic-Contrastive-Control

详情
英文摘要

Goal-conditioned reinforcement learning has shown considerable potential in robotic manipulation; however, existing approaches remain limited by their reliance on prioritizing collected experience, resulting in suboptimal performance across diverse tasks. Inspired by human learning behaviors, we propose a more comprehensive learning paradigm, ACDC, which integrates multidimensional Adaptive Curriculum (AC) Planning with Dynamic Contrastive (DC) Control to guide the agent along a well-designed learning trajectory. More specifically, at the planning level, the AC component schedules the learning curriculum by dynamically balancing diversity-driven exploration and quality-driven exploitation based on the agent's success rate and training progress. At the control level, the DC component implements the curriculum plan through norm-constrained contrastive learning, enabling magnitude-guided experience selection aligned with the current curriculum focus. Extensive experiments on challenging robotic manipulation tasks demonstrate that ACDC consistently outperforms the state-of-the-art baselines in both sample efficiency and final task success rate.

2603.01591 2026-04-15 cs.LG cs.AI cs.CV

FAST-DIPS: Adjoint-Free Analytic Steps and Hard-Constrained Likelihood Correction for Diffusion-Prior Inverse Problems

Minwoo Kim, Seunghyeok Shin, Hongki Lim

详情
Journal ref
International Conference on Learning Representations 2026
英文摘要

Training-free diffusion priors enable inverse-problem solvers without retraining, but for nonlinear forward operators data consistency often relies on repeated derivatives or inner optimization/MCMC loops with conservative step sizes, incurring many iterations and denoiser/score evaluations. We propose a training-free solver that replaces these inner loops with a hard measurement-space feasibility constraint (closed-form projection) and an analytic, model-optimal step size, enabling a small, fixed compute budget per noise level. Anchored at the denoiser prediction, the correction is approximated via an adjoint-free, ADMM-style splitting with projection and a few steepest-descent updates, using one VJP and either one JVP or a forward-difference probe, followed by backtracking and decoupled re-annealing. We prove local model optimality and descent under backtracking for the step-size rule, and derive an explicit KL bound for mode-substitution re-annealing under a local Gaussian conditional surrogate. We also develop a latent variant and a one-parameter pixel$\rightarrow$latent hybrid schedule. Experiments achieve competitive PSNR/SSIM/LPIPS with up to 19.5$\times$ speedup, without hand-coded adjoints or inner MCMC.

2603.00137 2026-04-15 cs.LG cs.AI

MAML-KT: Addressing Cold Start Problem in Knowledge Tracing for New Students via Few-Shot Model-Agnostic Meta Learning

Indronil Bhattacharjee, Christabel Wayllace

详情
英文摘要

Knowledge tracing (KT) models are commonly evaluated by training on early interactions from all students and testing on later responses. While effective for measuring average predictive performance, this evaluation design obscures a cold start scenario that arises in deployment, where models must infer the knowledge state of previously unseen students from only a few initial interactions. Prior studies have shown that under this setting, standard empirically risk-minimized KT models such as DKT, DKVMN and SAKT exhibit substantially lower early accuracy than previously reported. We frame new-student performance prediction as a few-shot learning problem and introduce MAML-KT, a model-agnostic meta learning approach that learns an initialization optimized for rapid adaptation to new students using one or two gradient updates. We evaluate MAML-KT on ASSIST2009, ASSIST2015 and ASSIST2017 using a controlled cold start protocol that trains on a subset of students and tests on held-out learners across early interaction windows (questions 3-10 and 11-15), scaling cohort sizes from 10 to 50 students. Across datasets, MAML-KT achieves higher early accuracy than prior KT models in nearly all cold start conditions, with gains persisting as cohort size increases. On ASSIST2017, we observe a transient drop in early performance that coincides with many students encountering previously unseen skills. Further analysis suggests that these drops coincide with skill novelty rather than model instability, consistent with prior work on skill-level cold start. Overall, optimizing KT models for rapid adaptation reduces early prediction error for new students and provides a clearer lens for interpreting early accuracy fluctuations, distinguishing model limitations from genuine learning and knowledge acquisition dynamics.

2602.18109 2026-04-15 cs.LG cs.OS cs.SY eess.SY

TempoNet: Slack-Quantized Transformer-Guided Reinforcement Scheduler for Adaptive Deadline-Centric Real-Time Dispatchs

Rong Fu, Yibo Meng, Guangzhen Yao, Jiaxuan Lu, Zeyu Zhang, Zhaolu Kang, Ziming Guo, Jia Yee Tan, Xiaojing Du, Simon James Fong

Comments 43 pages, 12 figures

详情
英文摘要

Real-time schedulers must reason about tight deadlines under strict compute budgets. We present TempoNet, a reinforcement learning scheduler that pairs a permutation-invariant Transformer with a deep Q-approximation. An Urgency Tokenizer discretizes temporal slack into learnable embeddings, stabilizing value learning and capturing deadline proximity. A latency-aware sparse attention stack with blockwise top-k selection and locality-sensitive chunking enables global reasoning over unordered task sets with near-linear scaling and sub-millisecond inference. A multicore mapping layer converts contextualized Q-scores into processor assignments through masked-greedy selection or differentiable matching. Extensive evaluations on industrial mixed-criticality traces and large multiprocessor settings show consistent gains in deadline fulfillment over analytic schedulers and neural baselines, together with improved optimization stability. Diagnostics include sensitivity analyses for slack quantization, attention-driven policy interpretation, hardware-in-the-loop and kernel micro-benchmarks, and robustness under stress with simple runtime mitigations; we also report sample-efficiency benefits from behavioral-cloning pretraining and compatibility with an actor-critic variant without altering the inference pipeline. These results establish a practical framework for Transformer-based decision making in high-throughput real-time scheduling.

2602.17276 2026-04-15 cs.LG math.CO

RLGT: A reinforcement learning framework for extremal graph theory

Ivan Damnjanović, Uroš Milivojević, Irena Đorđević, Dragan Stevanović

详情
英文摘要

Reinforcement learning (RL) is a subfield of machine learning that focuses on developing models that can autonomously learn optimal decision-making strategies over time. In a recent pioneering paper, Wagner demonstrated how the Deep Cross-Entropy RL method can be applied to tackle various problems from extremal graph theory by reformulating them as combinatorial optimization problems. Subsequently, many researchers became interested in refining and extending the framework introduced by Wagner, thereby creating various RL environments specialized for graph theory. Moreover, a number of problems from extremal graph theory were solved through the use of RL. In particular, several inequalities concerning the Laplacian spectral radius of graphs were refuted, new lower bounds were obtained for certain Ramsey numbers, and contributions were made to the Turán-type extremal problem in which the forbidden structures are cycles of length three and four. Here, we present Reinforcement Learning for Graph Theory (RLGT), a novel RL framework that systematizes the previous work and provides support for both undirected and directed graphs, with or without loops, and with an arbitrary number of edge colors. The framework efficiently represents graphs and aims to facilitate future RL-based research in extremal graph theory through optimized computational performance and a clean and modular design.

2602.15355 2026-04-15 cs.CV

DAV-GSWT: Diffusion-Active-View Sampling for Data-Efficient Gaussian Splatting Wang Tiles

Rong Fu, Jiekai Wu, Haiyun Wei, Yee Tan Jia, Yang Li, Xiaowen Ma, Wangyu Wu, Simon Fong

Comments 16 pages, 7 figures

详情
英文摘要

The emergence of 3D Gaussian Splatting has fundamentally redefined the capabilities of photorealistic neural rendering by enabling high-throughput synthesis of complex environments. While procedural methods like Wang Tiles have recently been integrated to facilitate the generation of expansive landscapes, these systems typically remain constrained by a reliance on densely sampled exemplar reconstructions. We present DAV-GSWT, a data-efficient framework that leverages diffusion priors and active view sampling to synthesize high-fidelity Gaussian Splatting Wang Tiles from minimal input observations. By integrating a hierarchical uncertainty quantification mechanism with generative diffusion models, our approach autonomously identifies the most informative viewpoints while hallucinating missing structural details to ensure seamless tile transitions. Experimental results indicate that our system significantly reduces the required data volume while maintaining the visual integrity and interactive performance necessary for large-scale virtual environments.

2602.11236 2026-04-15 cs.CV cs.CL cs.RO

ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, Feng Xiong, Xing Wei, Zhiheng Ma, Mu Xu

Comments Project website: https://amap-cvlab.github.io/ABot-Manipulation/ . Code: https://github.com/amap-cvlab/ABot-Manipulation . 22 pages, 10 figures, 10 tables

详情
英文摘要

Building general-purpose embodied agents across diverse hardware remains a central challenge in robotics, often framed as the ''one-brain, many-forms'' paradigm. Progress is hindered by fragmented data, inconsistent representations, and misaligned training objectives. We present ABot-M0, a framework that builds a systematic data curation pipeline while jointly optimizing model architecture and training strategies, enabling end-to-end transformation of heterogeneous raw data into unified, efficient representations. From six public datasets, we clean, standardize, and balance samples to construct UniACT-dataset, a large-scale dataset with over 6 million trajectories and 9,500 hours of data, covering diverse robot morphologies and task scenarios. Unified pre-training improves knowledge transfer and generalization across platforms and tasks, supporting general-purpose embodied intelligence. To improve action prediction efficiency and stability, we propose the Action Manifold Hypothesis: effective robot actions lie not in the full high-dimensional space but on a low-dimensional, smooth manifold governed by physical laws and task constraints. Based on this, we introduce Action Manifold Learning (AML), which uses a DiT backbone to predict clean, continuous action sequences directly. This shifts learning from denoising to projection onto feasible manifolds, improving decoding speed and policy stability. ABot-M0 supports modular perception via a dual-stream mechanism that integrates VLM semantics with geometric priors and multi-view inputs from plug-and-play 3D modules such as VGGT and Qwen-Image-Edit, enhancing spatial understanding without modifying the backbone and mitigating standard VLM limitations in 3D reasoning. Experiments show components operate independently with additive benefits. We will release all code and pipelines for reproducibility and future research.

2602.10137 2026-04-15 cs.CV cs.AI

Multi-encoder ConvNeXt Network with Smooth Attentional Feature Fusion for Multispectral Semantic Segmentation

Leo Thomas Ramos, Angel D. Sappa

Comments This is an extended version of the study presented at IEEE SoutheastCon2025. It presents substantial new content and original contributions beyond the previous version, including an expanded and enhanced background, new architectural refinements, additional experiments conducted on a broader range of datasets and experimental scenarios, and a more comprehensive analysis of results

详情
Journal ref
Neurocomputing, vol. 685, pages 133533, 2026
英文摘要

This work proposes MeCSAFNet, a multi-branch encoder-decoder architecture for land cover segmentation in multispectral imagery. The model separately processes visible and non-visible channels through dual ConvNeXt encoders, followed by individual decoders that reconstruct spatial information. A dedicated fusion decoder integrates intermediate features at multiple scales, combining fine spatial cues with high-level spectral representations. The feature fusion is further enhanced with CBAM attention, and the ASAU activation function contributes to stable and efficient optimization. The model is designed to process different spectral configurations, including a 4-channel (4c) input combining RGB and NIR bands, as well as a 6-channel (6c) input incorporating NDVI and NDWI indices. Experiments on the Five-Billion-Pixels (FBP) and Potsdam datasets demonstrate significant performance gains. On FBP, MeCSAFNet-base (6c) surpasses U-Net (4c) by +19.21%, U-Net (6c) by +14.72%, SegFormer (4c) by +19.62%, and SegFormer (6c) by +14.74% in mIoU. On Potsdam, MeCSAFNet-large (4c) improves over DeepLabV3+ (4c) by +6.48%, DeepLabV3+ (6c) by +5.85%, SegFormer (4c) by +9.11%, and SegFormer (6c) by +4.80% in mIoU. The model also achieves consistent gains over several recent state-of-the-art approaches. Moreover, compact variants of MeCSAFNet deliver notable performance with lower training time and reduced inference cost, supporting their deployment in resource-constrained environments.

2602.08590 2026-04-15 cs.LG cs.DB

SDFed: Bridging Local Global Discrepancy via Subspace Refinement and Divergence Control in Federated Prompt Learning

Yicheng Di, Wei Yuan, Tieke He, Yuan Liu, Hongzhi Yin

Comments The article contains content that requires significant revision, therefore it is being retracted

详情
英文摘要

Vision-language pretrained models offer strong transferable representations, yet adapting them in privacy-sensitive multi-party settings is challenging due to the high communication cost of federated optimization and the limited local data on clients. Federated prompt learning mitigates this issue by keeping the VLPM backbone frozen and collaboratively training lightweight prompt parameters. However, existing approaches typically enforce a unified prompt structure and length across clients, which is inadequate under practical client heterogeneity in both data distributions and system resources, and may further introduce conflicts between globally shared and locally optimal knowledge. To address these challenges, we propose \textbf{SDFed}, a heterogeneous federated prompt learning framework that bridges Local-Global Discrepancy via Subspace Refinement and Divergence Control. SDFed maintains a fixed-length global prompt for efficient aggregation while allowing each client to learn a variable-length local prompt to better match its data characteristics and capacity. To mitigate local-global conflicts and facilitate effective knowledge transfer, SDFed introduces a subspace refinement method for local prompts and an information retention and divergence control strategy that preserves key local information while maintaining appropriate separability between global and local representations. Extensive experiments on several datasets demonstrate that SDFed consistently improves performance and robustness in heterogeneous federated settings.

2602.05971 2026-04-15 cs.CL cs.LG q-bio.NC

Characterizing Human Semantic Navigation in Concept Production as Trajectories in Embedding Space

Felipe D. Toro-Hernández, Jesuino Vieira Filho, Rodrigo M. Cabral-Carvalho

Comments 10 pages, 6 figures (excluding refs/appendix). Accepted to ICLR 2026

详情
Journal ref
International Conference on Learning Representations (ICLR) 2026
英文摘要

Semantic representations can be framed as a structured, dynamic knowledge space through which humans navigate to retrieve and manipulate meaning. To investigate how humans traverse this geometry, we introduce a framework that represents concept production as navigation through embedding space. Using different transformer text embedding models, we construct participant-specific semantic trajectories based on cumulative embeddings and extract geometric and dynamical metrics, including distance to next, distance to centroid, entropy, velocity, and acceleration. These measures capture both scalar and directional aspects of semantic navigation, providing a computationally grounded view of semantic representation search as movement in a geometric space. We evaluate the framework on four datasets across different languages, spanning different property generation tasks: Neurodegenerative, Swear verbal fluency, Property listing task in Italian, and in German. Across these contexts, our approach distinguishes between clinical groups and concept types, offering a mathematical framework that requires minimal human intervention compared to typical labor-intensive linguistic pre-processing methods. Comparison with a non-cumulative approach reveals that cumulative embeddings work best for longer trajectories, whereas shorter ones may provide too little context, favoring the non-cumulative alternative. Critically, different embedding models yielded similar results, highlighting similarities between different learned representations despite different training pipelines. By framing semantic navigation as a structured trajectory through embedding space, bridging cognitive modeling with learned representation, thereby establishing a pipeline for quantifying semantic representation dynamics with applications in clinical research, cross-linguistic analysis, and the assessment of artificial cognition.

2602.04204 2026-04-15 cs.CV cs.LG

AGMA: Adaptive Gaussian Mixture Anchors for Prior-Guided Multimodal Human Trajectory Forecasting

Chao Li, Rui Zhang, Siyuan Huang, Xian Zhong, Hongbo Jiang

Comments Withdrawn for substantial revision and will be re-uploaded as a new manuscript

详情
英文摘要

Human trajectory forecasting requires capturing the multimodal nature of pedestrian behavior. However, existing approaches suffer from prior misalignment. Their learned or fixed priors often fail to capture the full distribution of plausible futures, limiting both prediction accuracy and diversity. We theoretically establish that prediction error is lower-bounded by prior quality, making prior modeling a key performance bottleneck. Guided by this insight, we propose AGMA (Adaptive Gaussian Mixture Anchors), which constructs expressive priors through two stages: extracting diverse behavioral patterns from training data and distilling them into a scene-adaptive global prior for inference. Extensive experiments on ETH-UCY, Stanford Drone, and JRDB datasets demonstrate that AGMA achieves state-of-the-art performance, confirming the critical role of high-quality priors in trajectory forecasting.

2602.03901 2026-04-15 cs.LG cs.NE

NeuroPareto: Calibrated Acquisition for Costly Many-Goal Search in Vast Parameter Spaces

Rong Fu, Chunlei Meng, Youjin Wang, Haoyu Zhao, Jiaxuan Lu, Kun Liu, JiaBao Dou, Simon James Fong

Comments 39 pages, 19 figures

详情
英文摘要

The pursuit of optimal trade-offs in high-dimensional search spaces under stringent computational constraints poses a fundamental challenge for contemporary multi-objective optimization. We develop NeuroPareto, a cohesive architecture that integrates rank-centric filtering, uncertainty disentanglement, and history-conditioned acquisition strategies to navigate complex objective landscapes. A calibrated Bayesian classifier estimates epistemic uncertainty across non-domination tiers, enabling rapid generation of high-quality candidates with minimal evaluation cost. Deep Gaussian Process surrogates further separate predictive uncertainty into reducible and irreducible components, providing refined predictive means and risk-aware signals for downstream selection. A lightweight acquisition network, trained online from historical hypervolume improvements, guides expensive evaluations toward regions balancing convergence and diversity. With hierarchical screening and amortized surrogate updates, the method maintains accuracy while keeping computational overhead low. Experiments on DTLZ and ZDT suites and a subsurface energy extraction task show that NeuroPareto consistently outperforms classifier-enhanced and surrogate-assisted baselines in Pareto proximity and hypervolume.

2601.19917 2026-04-15 cs.CL

PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models

Haoyu Zheng, Yun Zhu, Yuqian Yuan, Bo Yuan, Wenqiao Zhang, Siliang Tang, Jun Xiao

详情
英文摘要

Strategic planning is critical for multi-step reasoning, yet compact Large Language Models (LLMs) often lack the capacity to formulate global strategies, leading to error propagation in long-horizon tasks. Our analysis reveals that LLMs possess latent reasoning capabilities that can be unlocked when conditioned on explicit plans from a teacher model; however, runtime reliance on external guidance is often impractical due to latency and availability constraints. To bridge this gap, we propose PILOT (Planning via Internalized Latent Optimization Trajectories), a non-invasive framework designed to internalize the strategic oversight of large models into intrinsic Latent Guidance. Instead of altering backbone weights, PILOT employs a lightweight Hyper-Network to synthesize a query-conditioned Latent Guidance vector. This vector acts as an internal steering mechanism, guiding the model's representations toward optimal reasoning paths. Extensive experiments on mathematical and coding benchmarks demonstrate that PILOT effectively stabilizes reasoning trajectories, consistently outperforming strong baselines (e.g., +8.9% on MATH500) with negligible inference latency.

2601.18081 2026-04-15 cs.LG

DRPG (Decompose, Retrieve, Plan, Generate): An Agentic Framework for Academic Rebuttal

Peixuan Han, Yingjie Yu, Jingjun Xu, Jiaxuan You

详情
英文摘要

Despite the growing adoption of large language models (LLMs) in scientific research workflows, automated support for academic rebuttal, a crucial step in academic communication and peer review, remains largely underexplored. Existing approaches typically rely on off-the-shelf LLMs or simple pipelines, which struggle with long-context understanding and often fail to produce targeted and persuasive responses. In this paper, we propose DRPG, an agentic framework for automatic academic rebuttal generation that operates through four steps: Decompose reviews into atomic concerns, Retrieve relevant evidence from the paper, Plan rebuttal strategies, and Generate responses accordingly. Notably, the Planner in DRPG reaches over 98% accuracy in identifying the most feasible rebuttal direction. Experiments on data from top-tier conferences demonstrate that DRPG significantly outperforms existing rebuttal pipelines and achieves performance beyond the average human level using only an 8B model. Our analysis further demonstrates the effectiveness of the planner design and its value in providing multi-perspective and explainable suggestions. We also showed that DRPG works well in a more complex multi-round setting. These results highlight the effectiveness of DRPG and its potential to provide high-quality rebuttal content and support the scaling of academic discussions. Codes for this work are available at https://github.com/ulab-uiuc/DRPG-RebuttalAgent.

2601.14004 2026-04-15 cs.CL

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Hengyuan Zhang, Zhihao Zhang, Mingyang Wang, Zunhai Su, Yiwei Wang, Qianli Wang, Shuzhou Yuan, Ercong Nie, Xufeng Duan, Feijiang Han, Qibo Xue, Zeping Yu, Chenming Shang, Xiao Liang, Jing Xiong, Hui Shen, Chaofan Tao, Zhengwu Liu, Senjie Jin, Zhiheng Xi, Dongdong Zhang, Sophia Ananiadou, Tao Gui, Ruobing Xie, Hayden Kwok-Hay So, Hinrich Schütze, Xuanjing Huang, Qi Zhang, Ngai Wong

详情
英文摘要

Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the opaque decision-making of Large Language Models (LLMs). However, existing reviews primarily treat MI as an observational science, summarizing analytical insights while lacking a systematic framework for actionable intervention. To bridge this gap, we present a practical survey structured around the pipeline: "Locate, Steer, and Improve." We formally categorize Localizing (diagnosis) and Steering (intervention) methods based on specific Interpretable Objects to establish a rigorous intervention protocol. Furthermore, we demonstrate how this framework enables tangible improvements in Alignment, Capability, and Efficiency, effectively operationalizing MI as an actionable methodology for model optimization. The curated paper list of this work is available at https://github.com/rattlesnakey/Awesome-Actionable-MI-Survey.

2601.08209 2026-04-15 cs.CL

Generation-Augmented Generation: A Plug-and-Play Framework for Private Knowledge Injection in Large Language Models

Rongji Li, Jian Xu, Yi Chen, Xueqing Chen, Yisheng Yang, Jiayi Wang, Xingyu Chen, Chunyu Xie, Dawei Leng, Xu-Yao Zhang

详情
英文摘要

In domains such as materials science, biomedicine, and finance, high-stakes deployment of large language models (LLMs) requires injecting private, domain-specific knowledge that is proprietary, fast-evolving, and under-represented in public pretraining. However, the two dominant paradigms for private knowledge injection each have clear drawbacks: fine-tuning is expensive to iterate under continual updates that can induce catastrophic forgetting and general-capability regression; retrieval-augmented generation (RAG) keeps the base model intact but remains brittle in specialized private corpora due to chunk-induced evidence fragmentation, retrieval mismatch, and long-context pressure. Inspired by how multimodal LLMs align heterogeneous modalities into a shared semantic space, we propose Generation-Augmented Generation (GAG), which treats private expertise as an auxiliary modality and injects it into a frozen base model through a compact, constant-budget latent interface. Concretely, GAG distills question-conditioned specialist knowledge from lightweight domain experts into multi-slot latent memories, integrates multi-layer expert signals via per-slot cross-layer fusion, and aligns them to the frozen base model through gated residual projection, while supporting scalable mixed-domain deployment with reliable selective activation. In a unified mixed-domain evaluation spanning two scientific private-domain QA benchmarks (catalytic materials and immunology adjuvant) together with general-domain queries, GAG consistently outperforms strong retrieval-based and parameter-efficient fine-tuning baselines on specialist QA, while preserving general-domain capability, achieving highly reliable routing, and offering a favorable efficiency--effectiveness trade-off. Code and datasets are provided in the supplementary material. Code is publicly available at https://github.com/360CVGroup/GAG.

2601.05524 2026-04-15 cs.CL

Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism

Yuhao Shen, Tianyu Liu, Junyi Shen, Jinyang Wu, Quan Kong, Li Huan, Cong Wang

Comments Accepted by ACL2026 Main

详情
英文摘要

Parallel Speculative Decoding (PSD) accelerates traditional Speculative Decoding (SD) by overlapping draft generation with verification. However, it remains hampered by two fundamental challenges: (1) a theoretical speedup ceiling dictated by the speed ratio between the draft and target models, and (2) high computational waste and pipeline stall due to mid-sequence token rejections of early errors. To address these limitations, we introduce \textsc{Double} (Double Retrieval Speculative Parallelism). By bridging the gap between SD and PSD, our framework resolves the Retrieval \emph{Precision-Efficiency Dilemma} through a novel synchronous mechanism. Specifically, we enable the draft model to execute iterative retrieval speculations to break the theoretical speedup limits; to alleviate rejections without rollback, the target model performs authoritative retrieval to generate multi-token guidance. \textsc{Double} is entirely training-free and lossless. Extensive experiments demonstrate state-of-the-art speedup of $\textbf{5.3}\times$ on LLaMA3.3-70B and $\textbf{2.8}\times$ on Qwen3-32B, significantly outperforming the advanced method EAGLE-3 that requires extensive model training.

2512.22799 2026-04-15 cs.CV

VPTracker: Global Vision-Language Tracking via Visual Prompt

Jingchao Wang, Kaiwen Zhou, Zhijian Wu, Kunhua Ji, Dingjiang Huang, Yefeng Zheng

Comments 7 pages

详情
英文摘要

Vision-Language Tracking aims to continuously localize objects described by a visual template and a language description. Existing methods, however, are typically limited to local search, making them prone to failures under viewpoint changes, occlusions, and rapid target movements. In this work, we introduce the first global tracking framework based on Multimodal Large Language Models (VPTracker), exploiting their powerful semantic reasoning to locate targets across the entire image space. While global search improves robustness and reduces drift, it also introduces distractions from visually or semantically similar objects. To address this, we propose a location-aware visual prompting mechanism that incorporates spatial priors into the MLLM. Specifically, we construct a region-level prompt based on the target's previous location, enabling the model to prioritize region-level recognition and resort to global inference only when necessary. This design retains the advantages of global tracking while effectively suppressing interference from distracting visual content. Extensive experiments show that our approach significantly enhances tracking stability and target disambiguation under challenging scenarios, opening a new avenue for integrating MLLMs into visual tracking. Code is available at https://github.com/jcwang0602/VPTracker.

2512.19728 2026-04-15 cs.LG

Hard Negative Sample-Augmented DPO Post-Training for Small Language Models

Haocheng Lu, Minjun Zhu, Henry Yu

详情
英文摘要

Large language models (LLMs) continue to struggle with mathematical reasoning, and common post-training pipelines often reduce each generated solution to a binary outcome: correct or incorrect. This perspective is limiting in practice, as failures in chain-of-thought (CoT) reasoning are frequently structured; solutions may appear convincing while containing subtle logical, algebraic, or numerical flaws. Meanwhile, reinforcement learning from human feedback (RLHF) variants that rely on large reward models or LLM-as-a-judge signals are often expensive, difficult to scale, and unstable to iterate. We propose a lightweight and pragmatic post-training pipeline that targets such structured errors under realistic compute budgets. Starting from supervised fine-tuning (SFT) on MetaMathQA-style CoT data, we introduce a compact MathVerifier that decomposes a candidate solution into a six-dimensional error profile and aggregates it into interpretable wrongness and absurdity scores. These verifier signals serve two roles: (i) mining hard negatives that are near-correct yet structurally flawed, and (ii) defining per-sample importance weights that emphasize the most informative preference pairs. We integrate both into an offline Direct Preference Optimization (DPO) objective via a verifier-guided weighted formulation. Experiments on a 1.5B-parameter Qwen2.5 model show that verifier-guided, weighted DPO yields more targeted improvements than vanilla SFT and unweighted DPO, particularly on problems where solutions are numerically close to correct but logically inconsistent, while avoiding the overhead of training large reward models or relying on external judges.

2512.14732 2026-04-15 cs.LG cs.AI cs.CV eess.IV

INFORM-CT: INtegrating LLMs and VLMs FOR Incidental Findings Management in Abdominal CT

Idan Tankel, Nir Mazor, Rafi Brada, Christina LeBedis, Guy ben-Yosef

Comments Accepted for Spotlight presentation at MIDL 2026

详情
英文摘要

Incidental findings in CT scans, though often benign, can have significant clinical implications and should be reported following established guidelines. Traditional manual inspection by radiologists is time-consuming and variable. This paper proposes a novel framework that leverages large language models (LLMs) and foundational vision-language models (VLMs) in a plan-and-execute agentic approach to improve the efficiency and precision of incidental findings detection, classification, and reporting for abdominal CT scans. Given medical guidelines for abdominal organs, the process of managing incidental findings is automated through a planner-executor framework. The planner, based on LLM, generates Python scripts using predefined base functions, while the executor runs these scripts to perform the necessary checks and detections, via VLMs, segmentation models, and image processing subroutines. We demonstrate the effectiveness of our approach through experiments on a CT abdominal benchmark for three organs, in a fully automatic end-to-end manner. Our results show that the proposed framework outperforms existing pure VLM-based approaches in terms of accuracy and efficiency.

2512.13726 2026-04-15 cs.LG cs.AI

Time-Constrained Recommendations: Reinforcement Learning Strategies for E-Commerce

Sayak Chakrabarty, Souradip Pal

Comments 9 pages, 5 figures

详情
英文摘要

Unlike traditional recommendation tasks, finite user time budgets introduce a critical resource constraint, requiring the recommender system to balance item relevance and evaluation cost. For example, in a mobile shopping interface, users interact with recommendations by scrolling, where each scroll triggers a list of items called slate. Users incur an evaluation cost - time spent assessing item features before deciding to click. Highly relevant items having higher evaluation costs may not fit within the user's time budget, affecting engagement. In this position paper, our objective is to evaluate reinforcement learning algorithms that learn patterns in user preferences and time budgets simultaneously, crafting recommendations with higher engagement potential under resource constraints. Our experiments explore the use of reinforcement learning to recommend items for users using Alibaba's Personalized Re-ranking dataset supporting slate optimization in e-commerce contexts. Our contributions include (i) a unified formulation of time-constrained slate recommendation modeled as Markov Decision Processes (MDPs) with budget-aware utilities; (ii) a simulation framework to study policy behavior on re-ranking data; and (iii) empirical evidence that on-policy and off-policy control can improve performance under tight time budgets than traditional contextual bandit-based methods.

2512.10226 2026-04-15 cs.CV cs.RO

Latent Chain-of-Thought World Modeling for End-to-End Driving

Shuhan Tan, Kashyap Chitta, Yuxiao Chen, Ran Tian, Yurong You, Yan Wang, Wenjie Luo, Yulong Cao, Philipp Krahenbuhl, Marco Pavone, Boris Ivanovic

Comments Accepted to CVPR 2026

详情
英文摘要

Recent Vision-Language-Action (VLA) models for autonomous driving explore inference-time reasoning as a way to improve driving performance and safety in challenging scenarios. Most prior work uses natural language to express chain-of-thought (CoT) reasoning before producing driving actions. However, text may not be the most efficient representation for reasoning. In this work, we present Latent-CoT-Drive (LCDrive): a model that expresses CoT in a latent language that captures possible outcomes of the driving actions being considered. Our approach unifies CoT reasoning and decision making by representing both in an action-aligned latent space. Instead of natural language, the model reasons by interleaving (1) action-proposal tokens, which use the same vocabulary as the model's output actions; and (2) world model tokens, which are grounded in a learned latent world model and express future outcomes of these actions. We cold start latent CoT by supervising the model's action proposals and world model tokens based on ground-truth future rollouts of the scene. We then post-train with closed-loop reinforcement learning to strengthen reasoning capabilities. On a large-scale end-to-end driving benchmark, LCDrive achieves faster inference, better trajectory quality, and larger improvements from interactive reinforcement learning compared to both non-reasoning and text-reasoning baselines.

2512.08217 2026-04-15 cs.LG

Correction of Decoupled Weight Decay

Jason Chuan-Chih Chou

Comments v3 improves derivation and adds the Nesterov momentum case & numerical simulations for Scion-like updates

详情
英文摘要

Decoupled weight decay, solely responsible for the performance advantage of AdamW over Adam, has long been set to proportional to learning rate $γ$ without questioning. Some researchers have recently challenged such assumption and argued that decoupled weight decay should be set $\propto γ^2$ instead based on orthogonality arguments at steady state. To the contrary, we find that eliminating the contribution of the perpendicular component of the update to the weight norm leads to little change to the training dynamics. Instead, we derive that decoupled weight decay $\propto γ^2$ results in stable weight norm based on the simple assumption that updates become independent of the weights at steady state, regardless of the nature of the optimizer. Based on the same assumption, we derive and empirically verify that the Total Update Contribution (TUC) of a minibatch under the Scion optimizer is better characterized by the momentum-dependent effective learning rate whose optimal value transfers and we show that decoupled weight decay $\propto γ^2$ leads to stable weight and gradient norms and allows us to better control the training dynamics and improve the model performance.

2512.07178 2026-04-15 cs.AI cs.HC cs.LG

ContextualSHAP : Enhancing SHAP Explanations Through Contextual Language Generation

Latifa Dwiyanti, Sergio Ryan Wibisono, Hidetaka Nambo

Comments This paper was accepted and presented at the 7th World Symposium on Software Engineering (WSSE) 2025 on 25 October 2025 in Okayama, Japan, and is currently awaiting publication

详情
Journal ref
WSSE '25: Proceedings of the 2025 7th World Symposium on Software Engineering
英文摘要

Explainable Artificial Intelligence (XAI) has become an increasingly important area of research, particularly as machine learning models are deployed in high-stakes domains. Among various XAI approaches, SHAP (SHapley Additive exPlanations) has gained prominence due to its ability to provide both global and local explanations across different machine learning models. While SHAP effectively visualizes feature importance, it often lacks contextual explanations that are meaningful for end-users, especially those without technical backgrounds. To address this gap, we propose a Python package that extends SHAP by integrating it with a large language model (LLM), specifically OpenAI's GPT, to generate contextualized textual explanations. This integration is guided by user-defined parameters (such as feature aliases, descriptions, and additional background) to tailor the explanation to both the model context and the user perspective. We hypothesize that this enhancement can improve the perceived understandability of SHAP explanations. To evaluate the effectiveness of the proposed package, we applied it in a healthcare-related case study and conducted user evaluations involving real end-users. The results, based on Likert-scale surveys and follow-up interviews, indicate that the generated explanations were perceived as more understandable and contextually appropriate compared to visual-only outputs. While the findings are preliminary, they suggest that combining visualization with contextualized text may support more user-friendly and trustworthy model explanations.

2512.05812 2026-04-15 cs.RO cs.CV

Toward Efficient and Robust Behavior Models for Multi-Agent Driving Simulation

Fabian Konstantinidis, Moritz Sackmann, Ulrich Hofmann, Christoph Stiller

Comments This is the author's accepted version of a paper to appear in the IEEE International Conference on Robotics & Automation (ICRA 2026)

详情
英文摘要

Scalable multi-agent driving simulation requires behavior models that are both realistic and computationally efficient. We address this by optimizing the behavior model that controls individual traffic participants. To improve efficiency, we adopt an instance-centric scene representation, where each traffic participant and map element is modeled in its own local coordinate frame. This design enables efficient, viewpoint-invariant scene encoding and allows static map tokens to be reused across simulation steps. To model interactions, we employ a query-centric symmetric context encoder with relative positional encodings between local frames. We use Adversarial Inverse Reinforcement Learning to learn the behavior model and propose an adaptive reward transformation that automatically balances robustness and realism during training. Experiments demonstrate that our approach scales efficiently with the number of tokens, significantly reducing training and inference times, while outperforming several agent-centric baselines in terms of positional accuracy and robustness.

2511.21211 2026-04-15 cs.LG

Robust gene prioritization for Dietary Restriction via Fast-mRMR Feature Selection techniques

Rubén Fernández-Farelo, Jorge Paz-Ruza, Bertha Guijarro-Berdiñas, Amparo Alonso-Betanzos, Alex A. Freitas

详情
英文摘要

Gene prioritization (identifying genes potentially associated with a biological process) is increasingly tackled with Artificial Intelligence. However, existing methods struggle with the high dimensionality and incomplete labelling of biomedical data. This work proposes a more robust and efficient pipeline that leverages Fast-mRMR Feature Selection to retain only relevant, non-redundant features for classifiers, building simpler, more interpretable and more efficient models. Experiments in our domain of interest, prioritizing genes related to Dietary Restriction (DR), show significant improvements over existing methods and enables us to integrate heterogeneous biological feature sets for better performance, a strategy that previously degraded performance due to noise accumulation. This work focuses on DR given the availability of curated data and expert knowledge for validation, yet this pipeline would be applicable to other biological processes, proving that feature selection is critical for reliable gene prioritization in high-dimensional omics.