arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2332
2605.11376 2026-05-13 cs.AI

LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents

Giuliano Lorenzoni, Paulo Alencar, Donald Cowan

AI总结 本文提出了一种名为LLM-X的可扩展谈判导向型交换框架,旨在支持个人语言模型代理之间的直接、结构化通信。该框架引入了消息总线和路由机制,确保通信的结构有效性与策略执行,并提供了联邦网关、主题路由和策略执行的架构设计,以及支持能力协商和合同网络式协调的类型化消息协议。实验表明,LLM-X在不同规模和负载条件下均能保持稳定,且揭示了策略选择在系统鲁棒性、公平性与通信效率之间的权衡关系。

Comments 8 pages, 7 figures, accepted at AGENT 2026 Workshop, co-located with ICSE 2026

详情
英文摘要

We propose a personal-LLM exchange (LLM-X), a scalable negotiation-oriented environment that enables direct, structured communication across populations of personal agents (LLMs), each representing an individual user. Unlike existing tool-centric protocols that focus on agent-API interaction, LLM-X introduces a message bus and routing substrate for LLM-to-LLM coordination with guarantees around schema validity and policy enforcement. We contribute: (1) an architecture for LLM-X comprising federated gateways, topic-based routing, and policy enforcement; (2) a typed message protocol supporting capability negotiation and contract-net-style coordination; and (3) the first empirical evaluation of LLM-based multi-agent negotiation at scale. Experiments span 5, 9, and 12 agents, under distinct negotiation policies (Low, Medium, High), and across both short-run (minutes) and long-run (2h, 12h) load conditions. Results highlight clear policy-performance trade-offs: stricter policies improve robustness and fairness but increase latencies and message volume. Extended runs confirm that LLM-X remains stable under sustained load, with bounded latency drift.

2605.11373 2026-05-13 cs.AI cs.LG stat.ML

Causal Algorithmic Recourse: Foundations and Methods

Drago Plecko, Collin Wang, Elias Bareinboim

AI总结 本文研究如何在人工智能决策系统中为个体提供可靠的逆向决策建议,即算法性补救(algorithmic recourse)问题。作者提出了一种因果框架,将补救过程建模为干预前后的结果过程,考虑了潜在变量的重新采样和部分稳定性。文章引入了后补救稳定性条件,并开发了基于copula的算法以从观测数据中推断补救效果,同时提出了在数据不满足copula模型时的分布无关学习方法,为算法性补救提供了更稳健和实用的解决方案。

详情
英文摘要

The trustworthiness of AI decision-making systems is increasingly important. A key feature of such systems is the ability to provide recommendations for how an individual may reverse a negative decision, a problem known as algorithmic recourse. Existing approaches treat recourse outcomes as counterfactuals of a fixed unit, ignoring that real-world recourse involves repeated decisions on the same individual under possibly different latent conditions. We develop a causal framework that models recourse as a process over pre- and post-intervention outcomes, allowing for partial stability and resampling of latent variables. We introduce post-recourse stability conditions that enable reasoning about recourse from observational data alone, and develop a copula-based algorithm for inferring the effects of recourse under these conditions. For settings where paired observations of the same individual before and after intervention are available (called recourse data), we develop methods for inferring copula parameters and performing goodness-of-fit testing. When the copula model is rejected, we provide a distribution-free algorithm for learning recourse effects directly from recourse data. We demonstrate the value of the proposed methods on real and semi-synthetic datasets.

2605.11369 2026-05-13 cs.CV

Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers

Sanghyeok Nam, Byoungjun Kim, Daehyung Park, Tae-Kyun Kim

AI总结 该研究旨在解决人类与物体之间动态交互动作生成的挑战,提出了一种结合预训练运动先验和模仿智能体的框架,以生成如持物奔跑等长期动态交互动作。通过在规划阶段引入预训练的人体运动扩散模型增强数据集,并生成物体轨迹,从而规划出动态交互序列;在执行阶段,使用一个组合网络融合专用于动态人体动作或静态交互的预训练模仿智能体,实现时空技能的互补组合。该方法在保持交互质量的同时显著提升了任务成功率,并大幅减少了训练时间。

Comments CVPR Findings 2026

详情
英文摘要

Generating physically plausible dynamic motions of human-object interaction (HOI) remains challenging, mainly due to existing HOI datasets limited to static interactions, and pretrained agents capable of either dynamic full-body motions without objects or static HOI motions. Recent works such as InsActor and CLoSD generate HOI motions in planning and execution stages, are yet limited to either static or short-term contacts e.g. striking. In this work, we propose a framework that fulfills dynamic and long-term interaction motions such as running while holding a table, by combining pretrained motion priors and imitation agents in planning and execution stages. In the planning stage, we augment HOI datasets with dynamic priors from a pretrained human motion diffusion model, followed by object trajectory generation. This plans dynamic HOI sequences. In the execution stage, a composer network blends actions of pretrained imitation agents specialized either for dynamic human motions or static HOI motions, enabling spatio-temporal composition of their complementary skills. Our method over relevant prior-arts consistently improves success rates while maintaining interaction for dynamic HOI tasks. Furthermore, blending pretrained experts with our composer achieves competitive performance in significantly reduced training time. Ablation studies validate the effectiveness of our augmentation and composer blending.

2605.11368 2026-05-13 cs.LG cs.AI q-bio.GN

LPDP: Inference-Time Reward Control for Variable-Length DNA Generation with Edit Flows

Jeongchan Kim, Yunkyung Ko, Jong Chul Ye

AI总结 本文研究了如何利用Edit Flows在DNA序列生成过程中实现推理阶段的奖励控制。提出了一种名为LPDP的方法,它是一种无需训练、关注中间状态和动作的局部重解算操作符,能够在生成可变长度DNA序列时进行高效的编辑操作。LPDP通过在每一步推理中评估单步根编辑、保留最优根编辑集,并在局部范围内求解离散优化问题,从而提升生成序列的质量和生物合理性,适用于增强子优化和基因剪接边界修复等任务。

Comments 22 pages, 5 figures

详情
英文摘要

We study the application of recent Edit Flows for inference-time reward control for DNA sequence generation. Unlike most reward-guided DNA generation frameworks, which operate on fixed-length sequence spaces, Edit Flows have a potential to generate variable-length DNA through biologically plausible insertion, deletion, and substitution operations. In particular, we propose Local Perturbation Discrete Programming (LPDP), a training-free, intermediate-state and action-aware local re-solving operator for variable-length DNA edit-action generators at inference time. More specifically, at each guided rollout step, LPDP scores one-step root edits, retains a near-best root band, and re-ranks each retained root by solving a bounded local discrete program around its child sequence. This local program uses the typed geometry of edit actions to focus on coherent substitution, insertion, or deletion subgraphs, and aggregates local continuations with either a hard Max backup or a soft log-sum-exponential (LSE) backup. We instantiate LPDP in two regimes: front-loaded reward tilting for enhancer optimization, where early edits are critical for establishing global regulatory sequence structure, and back-loaded reward tilting for exon-intron-exon inpainting, where late edits fine-tune splice-boundary contexts.

2605.11363 2026-05-13 cs.CV cs.CL

PresentAgent-2: Towards Generalist Multimodal Presentation Agents

Wei Wu, Ziyang Xu, Zeyu Zhang, Yang Zhao, Hao Tang

AI总结 本文提出了一种名为 PresentAgent-2 的智能框架,旨在从用户查询中生成包含多模态内容的完整演示视频。该框架支持三种独立的演示模式,包括单人讲解、多人讨论和互动问答,并通过深度研究和多模态资源整合,实现内容生成、脚本编写和动态媒体合成。研究拓展了演示生成从依赖文档的幻灯片制作向基于查询、具备研究支撑和交互能力的视频生成方向发展。

详情
英文摘要

Presentation generation is moving beyond static slide creation toward end-to-end presentation video generation with research grounding, multimodal media, and interactive delivery. We introduce PresentAgent-2, an agentic framework for generating presentation videos from user queries. Given an open-ended user query and a selected presentation mode, PresentAgent-2 first summarizes the query into a focused topic and performs deep research over presentation-friendly sources to collect multimodal resources, including relevant text, images, GIFs, and videos. It then constructs presentation slides, generates mode-specific scripts, and composes slides, audio, and dynamic media into a complete presentation video. PresentAgent-2 supports three independent presentation modes within a unified framework: Single Presentation, which generates a single-speaker narrated presentation video; Discussion, which creates a multi-speaker presentation with structured speaker roles, such as for asking guiding questions, explaining concepts, clarifying details, and summarizing key points; and Interaction, which independently supports answering audience questions grounded in the generated slides, scripts, retrieved evidence, and presentation context. To evaluate these capabilities, we build a multimodal presentation benchmark covering single presentation, discussion, and interaction scenarios, with task-specific evaluation criteria for content quality, media relevance, dynamic media use, dialogue naturalness, and interaction grounding. Overall, PresentAgent-2 extends presentation generation from document-dependent slide creation to query-driven, research-grounded presentation video generation with multimodal media, dialogue, and interaction. Code: https://github.com/AIGeeksGroup/PresentAgent-2. Website: https://aigeeksgroup.github.io/PresentAgent-2.

2605.11362 2026-05-13 cs.LG cs.AI stat.AP stat.ML

Causal Fairness for Survival Analysis

Drago Plecko

AI总结 在数据驱动时代,机器学习和人工智能被广泛用于医疗、就业等高风险领域,引发了对系统公平性问题的关注。现有公平机器学习研究多聚焦于静态场景,而对生存分析等时间序列场景中的公平性研究仍较为缺乏。本文提出一种因果框架,用于生存分析中的公平性研究,能够将生存差异分解为直接、间接和虚假路径的贡献,从而提供对差异成因和演变过程的可解释分析,并应用于分析重症监护病房中种族差异随时间的变化。

详情
英文摘要

In the data-driven era, large-scale datasets are routinely collected and analyzed using machine learning (ML) and artificial intelligence (AI) to inform decisions in high-stakes domains such as healthcare, employment, and criminal justice, raising concerns about the fairness behavior of these systems. Existing works in fair ML cover tasks such as bias detection, fair prediction, and fair decision-making, but largely focus on static settings. At the same time, fairness in temporal contexts, particularly survival/time-to-event (TTE) analysis, remains relatively underexplored, with current approaches to fair survival analysis adopting statistical fairness definitions, which, even with unlimited data, cannot disentangle the causal mechanisms that generate disparities. To address this gap, we develop a causal framework for fairness in TTE analysis, enabling the decomposition of disparities in survival into contributions from direct, indirect, and spurious pathways. This provides a human-understandable explanation of why disparities arise and how they evolve over time. Our non-parametric approach proceeds in four steps: (1) formalizing the necessary assumptions about censoring and lack of confounding using a graphical model; (2) recovering the conditional survival function given covariates; (3) applying the Causal Reduction Theorem to reframe the problem in a form amenable to causal pathway decomposition; (4) estimating the effects efficiently. Finally, our approach is used to analyze the temporal evolution of racial disparities in outcome after admission to an intensive care unit (ICU).

2605.11361 2026-05-13 cs.LG cs.DS

The tractability landscape of diffusion alignment: regularization, rewards, and computational primitives

Ankur Moitra, Andrej Risteski, Dhruv Rohatgi

AI总结 本文研究了扩散模型在推理阶段如何通过奖励对齐来调整生成分布的问题,探讨了不同分布距离度量下对齐方法的可计算性差异。作者提出了一种基于计算原语的方法,分析了在KL散度和Wasserstein距离下实现奖励对齐所需的最小算法条件,并展示了对于凸型低维奖励和凹型或低维Lipschitz奖励,分别存在高效的采样和优化原语,从而明确了奖励对齐问题的可解性边界。

详情
英文摘要

Inference-time reward alignment asks how to turn a pre-trained diffusion model with base law $p$ into a sampler that favors a reward $r$ while remaining close to $p$. Since there is no canonical distributional distance for this closeness constraint, different choices lead to different "reward-aligned" laws and, just as importantly, different algorithmic problems. We develop a primitive-based approach to reward alignment: rather than assuming arbitrary reward-aligned laws can be sampled, we ask which simple algorithmic primitives suffice to implement alignment for non-trivial reward classes. If closeness is measured in KL distance, the target law is $q(x) \propto p(x) \exp(λ^{-1}r(x))$. For this setting, we show that linear exponential tilts of the form $q(x)\propto p(x)\exp(\langle θ, x \rangle)$ -- which according to recent work [MRR26] can be efficiently sampled from -- are a sufficient primitive for aligning to a very broad class of convex low-dimensional rewards. If closeness is measured in Wasserstein distance, the corresponding primitive is a proximal transport oracle: given $x$, solve $\mbox{argmax}_y \{r(y)- λc(x,y)\}$. This oracle can be efficiently implemented for concave or low-dimensional Lipschitz rewards $r(x)=f(Ax)$. Together, these results illustrate that the choice of distribution distance for alignment affects the computational primitive and the tractable reward class.

2605.11355 2026-05-13 cs.LG cs.CE

gym-invmgmt: An Open Benchmarking Framework for Inventory Management Methods

Reza Barati, Qinmin Vivian Hu

AI总结 本文提出了一款名为 gym-invmgmt 的开源库存管理方法评估框架,用于在统一实验条件下比较不同库存策略的性能。该框架通过共享的核心环境设定和多样化的22种场景,评估优化方法、启发式方法和学习控制器在不同库存管理条件下的表现。研究发现,基于场景对冲的随机规划方法在预测信息可用时表现最佳,而基于Transformer的近端策略优化方法在推理速度和策略质量上具有优势,但不同策略的表现依赖于信息获取、需求变化、网络结构和策略表示等多个因素。

Comments 16 pages, 4 figures

详情
英文摘要

Inventory-policy comparisons are often difficult to interpret because performance depends on the evaluation contract as much as on the policy itself. Differences in topology, demand regime, information access, feasibility constraints, shortage treatment, and Key Performance Indicator (KPI) definitions can change method rankings. We present gym-invmgmt, a Gymnasium-compatible extension of the OR-Gym inventory-management lineage for auditable cross-paradigm evaluation. The benchmark evaluates optimization, heuristic, and learned controllers under a shared CoreEnv transition, reward, action-bound, and KPI contract, while varying stress conditions through a 22-scenario core grid plus four supplemental MARL-mode rows. Within these released scenarios, informed stochastic programming provides the strongest non-oracle reference, reflecting the value of scenario hedging under forecast access, but at substantially higher online computational cost. Among learned controllers, the Proximal Policy Optimization Transformer variant (PPO-Transformer) achieves the strongest learned-policy quality at fast inference, while Residual Reinforcement Learning (Residual RL) provides competitive hybrid performance. The graph neural network variant (PPO-GNN) is highly competitive on the default divergent topology but less robust on the serial topology. Imitation learning performs well in stationary regimes but degrades under demand shift, and the bounded Large Language Model (LLM) policy-parameter baseline is best interpreted as a diagnostic controller rather than an autonomous inventory optimizer. Overall, the benchmark identifies scenario-conditioned leaders while showing that performance depends jointly on information access, demand shift, topology, and policy representation.

2605.11354 2026-05-13 cs.CV

Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction

Haoyu Zhang, Zeyu Zhang, Zedong Zhou, Yang Zhao, Hao Tang

AI总结 本文提出了一种名为Lite3R的模型无关框架,旨在提升基于Transformer的3D重建方法的效率。该框架通过引入稀疏线性注意力机制减少密集多视图注意力的计算开销,并结合参数高效的FP8感知量化训练策略,实现低精度下的稳定几何重建。实验表明,Lite3R在多个主流模型上显著降低了计算延迟和内存消耗,同时保持了较高的重建质量,为实际应用中的高效3D重建提供了有效的算法与系统协同设计方法。

详情
英文摘要

Transformer-based 3D reconstruction has emerged as a powerful paradigm for recovering geometry and appearance from multi-view observations, offering strong performance across challenging visual conditions. As these models scale to larger backbones and higher-resolution inputs, improving their efficiency becomes increasingly important for practical deployment. However, modern 3D transformer pipelines face two coupled challenges: dense multi-view attention creates substantial token-mixing overhead, and low-precision execution can destabilize geometry-sensitive representations and degrade depth, pose, and 3D consistency. To address the first challenge, we propose Lite3R, a model-agnostic teacher-student framework that replaces dense attention with Sparse Linear Attention to preserve important geometric interactions while reducing attention cost. To address the second challenge, we introduce a parameter-efficient FP8-aware quantization-aware training (FP8-aware QAT) strategy with partial attention distillation, which freezes the vast majority of pretrained backbone parameters and trains only lightweight linear-branch projection layers, enabling stable low-precision deployment while retaining pretrained geometric priors. We further evaluate Lite3R on two representative backbones, VGGT and DA3-Large, over BlendedMVS and DTU64, showing that it substantially reduces latency (1.7-2.0x) and memory usage (1.9-2.4x) while preserving competitive reconstruction quality overall. These results demonstrate that Lite3R provides an effective algorithm-system co-design approach for practical transformer-based 3D reconstruction. Code: https://github.com/AIGeeksGroup/Lite3R. Website: https://aigeeksgroup.github.io/Lite3R.

2605.11348 2026-05-13 cs.CL cs.AI cs.IR cs.SI

Large Language Models for Causal Relations Extraction in Social Media: A Validation Framework for Disaster Intelligence

Ujun Jeong, Saketh Vishnubhatla, Bohan Jiang, Andre Harrison, Adrienne Raglin, Huan Liu

AI总结 本文研究了在灾害场景下,如何利用大语言模型(LLM)从社交媒体中提取因果关系,以增强灾情态势感知。为验证LLM的有效性,作者提出了一种基于专家知识的评估框架,通过对比模型生成的因果图与灾害报告中的参考图,评估其准确性。研究发现,LLM在提取因果关系方面具有潜力,但也存在依赖模型先验知识而非事件后证据的风险。

Comments Submitted to EMNLP

详情
英文摘要

During disasters, extracting causal relations from social media can strengthen situational awareness by identifying factors linked to casualties, physical damage, infrastructure disruption, and cascading impacts. However, disaster-related posts are often informal, fragmented, and context-dependent, and they may describe personal experiences rather than explicit causal relations. In this work, we examine whether Large Language Models (LLMs) can effectively extract causal relations from disaster-related social media posts. To this end, we (1) propose an expert-grounded evaluation framework that compares LLM-generated causal graphs with reference graphs derived from disaster-specific reports and (2) assess whether the extracted relations are supported by post-event evidence or instead reflect model priors. Our findings highlight both the potential and risks of using LLMs for causal relation extraction in disaster decision-support systems.

2605.11346 2026-05-13 cs.LG cs.AI cs.CE

Physics-Informed Teacher-Student Ensemble Learning for Traffic State Estimation with a Varying Speed Limit Scenario

Archie J. Huang, Dongdong Wang, Shaurya Agarwal, Mohamed Abdel-Aty, Md Mahmudul Islam, Muhammad Shahbaz

AI总结 本文研究了在可变限速场景下的交通状态估计问题,提出了一种结合物理信息深度学习与教师-学生集成训练的新型框架。该方法通过在教师模型中编码流量守恒定律,学生模型则利用多层感知机分类器识别交通特征并选择合适的教师模型进行估计,从而有效应对限速变化带来的交通特性异质性。实验结果表明,该方法在交通状态估计任务中优于其他主流基线方法。

Comments The IEEE International Conference on Intelligent Transportation Systems (ITSC) 2026

详情
英文摘要

Physics-informed deep learning (PIDL) neural networks have shown their capability as a useful instrument for transportation practitioners in utilizing the underlying relationship between the state variables for traffic state estimation (TSE). Another efficient traffic management approach is implementing varying speed limits (VSLs) on transportation corridors to control traffic and mitigate congestion. However, the existing training architecture of PIDL in the literature cannot accommodate the changing traffic characteristics on a freeway with VSL. To tackle this challenge, we propose a novel framework integrating teacher-student ensemble training with PIDL neural networks for TSE under VSL scenarios. The physics of flow conservation law is encoded locally in the teacher models by PIDL, and the student model uses a multi-layer perceptron classifier (MLP) to identify traffic characteristics and selects the ensemble member of PIDL neural networks for TSE. This integrated framework provides a natural solution for capturing the heterogeneity of VSL and accurately addressing the TSE problem. The case study results validate the proposed ensemble approach, demonstrating its superior performance in TSE compared to other popular baseline methods, as indicated by relative L2 error.

2605.11341 2026-05-13 cs.AI

CPEMH: An Agentic Framework for Prompt-Driven Behavior Evaluation and Assurance in Foundation-Model Systems for Mental Health Screening

Giuliano Lorenzoni, Ivens Portugal, Paulo Alencar, Donald Cowan

AI总结 本文提出了一种名为CPEMH的智能代理框架,用于评估和保障基于提示的大型语言模型在心理健康筛查中的行为表现。该框架通过协调设计、评估和选择提示策略,实现了对模型行为在不同场景下的系统控制,具备模块化结构,确保了过程的可追溯性和稳定性。研究通过抑郁筛查的案例展示了该框架在临床对话场景中对模型行为进行稳定化和审计的能力,强调了模块化协调、稳定性优先以及将F1值、偏差和鲁棒性作为核心评估标准的重要性。

Comments 4 pages, 2 figures. Accepted at the AGENT 2026 Workshop (ICSE 2026)

详情
英文摘要

This paper presents CPEMH, an agentic framework designed to evaluate prompt-driven behavior in foundation-model systems operating on transcript-based datasets for mental-health screening. CPEMH serves as an engineering methodology for behavioral assurance in large-scale language systems, introducing an orchestrated architecture that autonomously performs the design, evaluation, and selection of prompt strategies, enabling systematic control of behavioral variability across contexts. Its modular agentic design, combining orchestrator, inference, and evaluation agents, ensures traceability, reproducibility, and robustness throughout the prompting lifecycle. A case study on automated depression screening from interview transcripts demonstrates the framework's capacity to stabilize and audit foundation-model behavior in conversational and clinically sensitive domains. Lessons learned emphasize the role of modular orchestration in behavioral assurance, the prioritization of stability over architectural complexity, and the integration of F1, bias, and robustness as core acceptance criteria.

2605.11334 2026-05-13 cs.LG cs.CL cs.IR

VERDI: Single-Call Confidence Estimation for Verification-Based LLM Judges via Decomposed Inference

Jasmine Qi, Danylo Dantsev, Muyang Sun

AI总结 VERDI 是一种用于验证型大语言模型评估系统的单次调用置信度估计方法,通过分解推理过程中的验证步骤,提取三个结构化信号来评估判断结果的可信度。该方法无需额外推理调用,结合逻辑回归模型实现高精度的置信度预测,在多个公开基准和实际系统中均表现出良好的性能,尤其在答案置信度校准不佳的模型上也具有较好的适应性。

Comments 16 pages, 6 figures

详情
英文摘要

LLM-as-Judge systems are widely deployed for automated evaluation, yet practitioners lack reliable methods to know when a judge's verdict should be trusted. Token log-probabilities, the standard post-hoc confidence signal, are unavailable for many commercial LLMs and, even when accessible, saturate above 0.999 with structured JSON output. We introduce VERDI (VERification-Decomposed Inference), a method that extracts confidence from the reasoning trace a structured judge already produces, with no additional inference calls. VERDI decomposes each verification-style evaluation into sub-checks and derives three structural signals: Step-Verdict Alignment, Claim-Level Margin, and Evidence Grounding Score. We combine them with Platt-scaled logistic regression. On three public benchmarks, VERDI achieves AUROC 0.72-0.91 on GPT-4.1-mini and 0.66-0.80 on GPT-5.4-mini. On Qwen3.5-4B/9B/27B, where answer-token logprobs are anti-calibrated (higher confidence on errors, AUROC 0.32-0.49), VERDI achieves 0.56-0.70. We additionally validate on a production system with eight rubrics (AUROC 0.73-0.88 on factual rubrics), demonstrate cross-model transfer (AUROC 0.66-0.69), and show that a 33M-parameter NLI (Natural Language Inference) model provides a scalable alternative to regex extraction.

2605.11330 2026-05-13 cs.AI

Rethinking Evaluation for LLM Hallucination Detection: A Desiderata, A New RAG-based Benchmark, New Insights

Wenbo Chen, Veena Padmanabhan, Tootiya Giyahchi, Elaine Wong, Leman Akoglu

AI总结 本文针对大语言模型(LLM)幻觉检测的评估方法进行了重新思考,提出了一个用于构建有效幻觉检测基准(HDB)的期望属性列表,并指出现有基准在长上下文的RAG(检索增强生成)基准和真实标签噪声支持方面存在明显不足。为此,作者构建并开源了一个新的RAG-based幻觉检测基准T RIVIA+,该基准包含当前最长的上下文样本,并引入了多种噪声标签以模拟真实场景。实验表明,现有检测方法在RAG任务上仍有较大提升空间,且标签噪声对检测性能有显著影响。

Comments ACL 2026 main conference

详情
英文摘要

Hallucination, broadly referring to unfaithful, fabricated, or inconsistent content generated by LLMs, has wide-ranging implications. Therefore, a large body of effort has been devoted to detecting LLM hallucinations, as well as designing benchmark datasets for evaluating these detectors. In this work, we first establish a desiderata of properties for hallucination detection benchmarks (HDBs) to exhibit for effective evaluation. A critical look at existing HDBs through the lens of our desiderata reveals that none of them exhibits all the properties. We identify two largest gaps: (1) RAG-based grounded benchmarks with long context are severely lacking (partly because length impedes human annotation); and (2) Existing benchmarks do not make available realistic label noise for stress-testing detectors although real-world use-cases often grapple with label noise due to human or automated/weak annotation. To close these gaps, we build and open-source a new RAG-based HDB called T RIVIA+ that underwent a rigorous human annotation process. Notably, our benchmark exhibits all desirable properties including (1) T RIVIA+ contains samples with the longest context in the literature; and (2) we design and share four sets of noisy labels with different, both sample-dependent and sampleindependent, noise schemes. Finally, we perform experiments on RAG-based HDBs, including our T RIVIA+, using popular SOTA detectors that reveal new insights: (i) ample room remains for current detectors to reach the performance ceiling on RAG-based HDBs, (ii) the basic LLM-as-a-Judge baseline performs competitively, and (iii) label noise hinders detection performance. We expect that our findings, along with our proposed benchmark 1 , will motivate and foster needed research on hallucination detection for RAG-based tasks.

2605.11328 2026-05-13 cs.LG cs.AI

Epistemic Uncertainty for Test-Time Discovery

Kainat Riaz, Muhammad Ahmed Mohsin, Ahsan Bilal, Muhammad Umer, Ayesha Mohsin, Aqib Riaz, Ali Subhan, John M. Cioffi

AI总结 该研究探讨了如何利用大语言模型在测试阶段进行科学发现的问题,指出传统强化学习方法因惩罚高方差变异而倾向于熟悉模式,导致奖励难以持续提升。为此,研究提出了一种基于知识不确定性度量的探索策略,通过维护一个小型适配器集成,在冻结的基模型上识别出因训练覆盖不足而非问题本质困难的区域,从而引导策略向潜在发现区域探索。实验表明,该方法在多个科学发现任务中提升了最大奖励并保持了更高的解的多样性。

详情
英文摘要

Automated scientific discovery using large language models relies on identifying genuinely novel solutions. Standard reinforcement learning penalizes high-variance mutations, which leads the policy to prioritize familiar patterns. As a result, the maximum reward plateaus even as the average reward increases. Overcoming this limitation requires a signal that distinguishes unexplored regions from intrinsically difficult problems. This necessitates measuring disagreement across independently adapted weight hypotheses rather than relying on a single network's confidence. UG-TTT addresses this challenge by maintaining a small ensemble of low-rank adapters over a frozen base model. The per-token disagreement, quantified as the mutual information between ensemble predictions and weight hypotheses, isolates epistemic uncertainty and identifies positions where insufficient coverage leads to adapter divergence rather than intrinsic problem difficulty. This measure is incorporated as an exploration bonus into the policy gradient, directing the policy toward positions where persistent adapter disagreement signals low training coverage, the same frontier where genuine discovery is possible. A nuclear norm regularizer ensures the adapters remain distinct from one another, thereby preserving the exploration signal throughout training. Across four scientific discovery benchmarks, UG-TTT increases the maximum reward on three tasks, maintains substantially higher solution diversity, and an ablation study confirms that the regularizer is essential for sustaining this behavior.

2605.11327 2026-05-13 cs.LG

Neural Statistical Functions

Daniel Xu, Yuxin Xie, Minghao Guo, Haixu Wu, Wojciech Matusik

AI总结 本文提出了一种新型神经统计函数模型,用于直接估计连续操作条件范围内的统计量,避免了传统方法中重复推理带来的高延迟问题。该方法基于预训练的单样本预测器和散点数据,通过引入前缀统计的概念,将积分、分位数和极值等不同统计函数统一到一个区间条件框架中,并以前缀统计与个体回归之间的原理性一致性作为学习目标。实验表明,该模型在动力系统能量累积、气动响应分位数和碰撞过程最大应力等复杂物理过程的统计估计中表现出色,模型评估次数最多可减少100倍。

详情
英文摘要

Classical deep learning typically operates on individual cases. Despite its success, real-world usage often requires repeated inference to estimate statistical quantities for complex decision-making tasks involving uncertainty or extreme-value analysis, resulting in substantial latency. We introduce neural statistical functions, a new family of models learned from pre-trained single-sample predictors and scattered data samples, which can directly infer statistics over continuous operating condition ranges without explicit sampling. By introducing the notion of prefix statistics, we transform and unify diverse statistical functions (e.g., integrals, quantiles, and maxima) into an interval-conditional framework, in which a principled identity between the prefix statistics and the individual-case regression serves as the learning objective. Neural statistical functions achieve strong performance in estimating essential statistics of complex physical processes, including accumulated energy in dynamical systems, quantiles of aerodynamic responses, and maximum stress in crash processes, while achieving up to a 100$\times$ reduction in model evaluations.

2605.11324 2026-05-13 cs.LG stat.ML

$\varepsilon$-Good Action Identification in Fixed-Budget Monte Carlo Tree Search

Yinan Li, Tuan Nguyen, Kwang-Sung Jun

AI总结 本文研究了在固定预算下深度为2的max-min树中识别ε-优质动作的问题,这是蒙特卡洛树搜索的一个重要特例。作者提出了一种无需输入ε值的算法,能够针对每个有意义的ε值实现实例相关的误差界,其误识别概率以指数形式衰减。此外,作者还分析了该问题与标准K臂老虎机在难度结构上的差异,并提供了相应的下界结果,这是首个针对max-min动作识别的固定预算算法保证。

详情
英文摘要

We study the fixed-budget max-min action identification problem in depth-2 max-min trees, an important special case of Monte Carlo Tree Search. A learner sequentially allocates $T$ samples to leaves and then recommends a subtree whose minimum leaf value is largest. Motivated by approximate planning, we focus on $\varepsilon$-good subtree identification, where any subtree whose min value is within $\varepsilon$ of the optimal maximin value is acceptable. Our main contribution is an $\varepsilon$-agnostic algorithm: it does not require $\varepsilon$ as input, but achieves instance-dependent error bounds for every meaningful $\varepsilon$. We show that the misidentification probability decays as $\exp(-\widetildeΘ(T/H_2(\varepsilon)))$, where $H_2(\varepsilon)$ captures both cross-subtree and within-subtree gaps. When each subtree has a single leaf, the problem reduces to standard fixed-budget best-arm identification, and our analysis recovers, up to accelerating factors, known $\varepsilon$-good guarantees for halving-style methods while giving a new $\varepsilon$-good guarantee for Successive Rejects. On the lower-bound side, we provide complementary positive and negative results showing that max-min identification has a different hardness structure from standard $K$-armed bandits. To our knowledge, this is the first provable fixed-budget algorithmic guarantee for max-min action identification.

2605.11317 2026-05-13 cs.CL cs.AI

SOMA: Efficient Multi-turn LLM Serving via Small Language Model

Xueqi Cheng, Qiong Wu, Zhengyi Zhou, Xugui Zhou, Tyler Derr, Yushun Dong

AI总结 在多轮对话场景中,大型语言模型(LLMs)的部署面临延迟、内存和API成本高昂的问题。为此,本文提出SOMA框架,通过利用会话早期的对话内容估计局部响应流形,并使用一个小的语言模型作为代理模型处理后续对话,从而在保证响应质量的同时提升服务效率。该方法结合软提示学习、反退化控制和局部LoRA微调,实现了代理模型在推理阶段无需提示的高效运行,并提供了理论分析与实验验证,证明了其有效性。

详情
英文摘要

Large Language Models (LLMs) are increasingly deployed in multi-turn dialogue settings where preserving conversational context across turns is essential. A standard serving practice concatenates the full dialogue history at every turn, which reliably maintains coherence but incurs substantial cost in latency, memory, and API expenditure, especially when queries are routed to large proprietary models. Existing approaches often struggle to balance the trade-off between response quality and efficiency. We propose a framework that exploits the early turns of a session to estimate a local response manifold and then adapt a smaller surrogate model to this local region for the remainder of the conversation. Concretely, we learn soft prompts that maximize semantic divergence between the large and surrogate small language models' responses to surface least-aligned local directions, stabilize training with anti-degeneration control, and distill the mined cases into localized LoRA fine-tuning so the surrogate runs without prompts at inference. A simple gate enables a one-time switch with rollback on drift. We further provide a theoretical analysis for key components in SOMA. Extensive experiments show the effectiveness of SOMA. The source code is provided at: https://github.com/LabRAI/SOMA.

2605.11316 2026-05-13 cs.LG math.OC

Error whitening: Why Gauss-Newton outperforms Newton

Maricela Best McKay, Nathan P. Lawrence, Brian Wetton, R. Bhushan Gopaluni

AI总结 本文从函数空间视角分析了为何高斯-牛顿法在实践中优于牛顿法,揭示了高斯-牛顿矩阵通过将损失梯度投影到模型切空间,消除了参数化带来的误差扭曲,这一过程被称为“误差白化”。研究指出,这种特性使得高斯-牛顿法在优化过程中更贴近损失函数本身的结构,从而在多种学习任务中表现出更优的性能。

Comments Neurips preprint

详情
英文摘要

The Gauss-Newton matrix is widely viewed as a positive semidefinite approximation of the Hessian, yet mounting empirical evidence shows that Gauss-Newton descent outperforms Newton's method. We adopt a function space perspective to analyze this phenomenon. We show that the generalized Gauss-Newton (GGN) matrix projects the Newton direction in function space onto the model's tangent space, while a Jacobian-only variant obtained by applying the least squares Gauss-Newton matrix to non-least squares losses projects the function space loss gradient onto this same tangent space. Both projections eliminate distortions from the model's parameterization. Specifically, the evolution of the prediction-target mismatch depends on the model's parameterization through the matrix $JJ^\top$ where $J$ is the Jacobian of the model with respect to its parameters. The projections effectively replace $JJ^\top$ with the identity. We call this effect error whitening. Once the parameterization is removed, the prediction-target mismatch evolves according to dynamics dictated by the structure of the loss and the projection produced by the optimizer. Error whitening is a special property of Gauss-Newton descent that rigorously distinguishes it from Newton's method. We empirically demonstrate that Gauss-Newton optimizers follow the theoretically predicted function space dynamics and outperforms Newton's method, Adam, and Muon across case studies spanning supervised learning, physics-informed deep learning, and approximate dynamic programming.

2605.11312 2026-05-13 cs.AI

Constraint-Data-Value-Maximization: Utilizing Data Attribution for Effective Data Pruning in Low-Data Environments

Danilo Brajovic, David A. Kreplin, Marco F. Huber

AI总结 本文研究了在数据量有限的情况下如何有效进行数据剪枝的问题,提出了一种基于数据归属的约束数据价值最大化(CDVM)方法。该方法通过将剪枝过程建模为一个受约束的优化问题,在最大化整体数据影响的同时限制单个测试样本的贡献,从而在保留少量数据时仍能保持模型性能。实验表明,CDVM在OpenDataVal基准上表现出色,具有良好的性能和竞争力的运行时间。

Comments Accepted for publication at IJCAI 2026

详情
英文摘要

Attributing model behavior to training data is an evolving research field. A common benchmark is data removal, which involves eliminating data instances with either low or high values, then assessing a model's performance trained on the modified dataset. Many existing studies leverage Shapley-based data values for this task. In this paper, we demonstrate that these data values are not optimally suited for pruning low-value data when only a limited amount of data remains. To address this limitation, we introduce the Constraint-Data-Value-Maximization (CDVM) approach, which effectively utilizes data attributions for pruning in low-data scenarios. By casting pruning as a constrained optimization that both maximizes total influence and penalizes excessive per-test contributions, CDVM delivers robust performance when only a small fraction of the data is retained. On the OpenDataVal benchmark, CDVM shows strong performance and competitive runtime.

2605.11311 2026-05-13 cs.LG cs.CV stat.CO stat.ML

Couple to Control: Joint Initial Noise Design in Diffusion Models

Jing Jia, Liyue Shen, Guanyang Wang

AI总结 该论文研究了扩散模型中初始噪声设计的问题,指出传统方法中假设初始噪声相互独立可能限制了生成效果。作者提出通过设计噪声之间的依赖结构,保持单个噪声仍为标准高斯分布,从而在不改变模型输入分布的前提下,提升多样本生成的多样性与质量。实验表明,该方法在多个主流扩散模型中有效提升了生成多样性,同时保持了图像质量和提示对齐,并在部分指标上优于现有优化方法。

Comments 26 pages

详情
英文摘要

Diffusion models typically generate image batches from independent Gaussian initial noises. We argue that this independence assumption is only one choice within a broader class of valid joint noise designs. Instead, one can specify a coupling of the initial noises: each noise remains marginally standard Gaussian, so the pretrained diffusion model receives the same single-sample input distribution, while the dependence across samples is chosen by design. This reframes initial-noise control from selecting or optimizing individual seeds to designing the dependence structure of a multi-sample gallery. This view gives a general framework for initial-noise design, covering several existing methods as special cases and leading naturally to new coupled-noise constructions. Coupled noise can improve generation on its own without adding sampling cost, and it is flexible enough to serve as a structured initialization for optimization-based pipelines when additional computation is available. Empirically, repulsive Gaussian coupling improves gallery diversity on SD1.5, SDXL, and SD3 while largely preserving prompt alignment and image quality. It matches or outperforms recent test-time noise-optimization baselines on several diversity metrics at the same sampling cost as independent generation. Subspace couplings also support fixed-object background generation, producing diverse, natural backgrounds compared with specialized inpainting baselines, with a tunable trade-off in foreground fidelity.

2605.11307 2026-05-13 cs.CV cs.LG

Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation

Ajay Vikram Periasami, Junlin Wang, Bhuwan Dhingra

AI总结 Vision2Code 是一个用于评估多领域图像到代码生成能力的基准测试框架,旨在检验视觉语言模型能否将图像结构转化为可执行代码。该基准包含来自15个数据集的2,169个测试样例,涵盖图表、几何图形、科学图像等多种领域,并采用基于视觉语言模型的评分机制进行评估,有效区分代码执行错误与重建质量问题。实验表明,模型在不同领域的表现存在显著差异,且通过筛选模型输出作为训练数据可有效提升生成性能。

Comments Project page: https://image2code.github.io/vision2code/

详情
英文摘要

Image-to-code generation tests whether a vision-language model (VLM) can recover the structure of an image enough to express it as executable code. Existing benchmarks either focus on narrow visual domains, depend on paired executable reference code, or rely on generic rubrics that miss domain-specific reconstruction errors. We introduce Vision2Code, a reference-code-free benchmark and evaluation framework for multi-domain image-to-code generation. Vision2Code contains 2,169 test examples from 15 source datasets that span charts and plots, geometry, graphs, scientific imagery, documents, and 3D spatial scenes. Models generate executable programs, which we render and score against the source image using a VLM rater with dataset-specific rubrics and deterministic guardrails for severe semantic failures. We report render-success diagnostics that separate code execution failures from reconstruction quality. Human validation shows that this evaluation protocol aligns better with human judgments than either a generic visual rubric or embedding-similarity baselines. Across nine open-weight and proprietary models, we find that image-to-code performance is domain-dependent: leading models perform well on regular chart- and graph-like visuals but remain weak on spatial scenes, chemistry, documents, and circuit-style diagrams. Finally, we show that evaluator-filtered model outputs can serve as training data to improve image-to-code capability, with Qwen3.5-9B improving from 1.60 to 1.86 on the benchmark without paired source programs. Vision2Code provides a reproducible testbed for measuring, diagnosing, and improving image-to-code generation. Our code and data are publicly available at https://image2code.github.io/vision2code/.

2605.11304 2026-05-13 cs.CV

CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography

Eva Prakash, Yunhe Gao, Chong Wang, Justin Xu, Neal Prakash, Arne Michalson, Seena Dehkharghani, Eun Kyoung Hong, Julie Bauml, Roger Boodoo, Jean-Benoit Delbrouck, Sophie Ostmeier, Curtis Langlotz

AI总结 CheXTemporal 是一个用于胸部X光影像时序推理的数据集,旨在解决当前模型在处理胸部影像纵向变化时的不足。该数据集包含配对的前后胸部X光片,并提供了细粒度的时序和空间标注,支持五类疾病进展分类。研究还构建了一个包含28万对影像的弱监督数据集,用于评估模型在时序推理和疾病进展分类任务中的表现,结果表明现有模型在时序推理和空间定位方面仍存在明显局限。

详情
英文摘要

Chest radiograph interpretation requires temporal reasoning over prior and current studies, yet most vision-language models are trained on static image-report pairs and lack explicit supervision for modeling longitudinal change. We introduce CheXTemporal, a dataset for temporally grounded reasoning in chest radiography consisting of paired prior-current chest X-rays (CXR) with finding-level temporal and spatial annotations. The dataset includes a five-class progression taxonomy (new, worse, stable, improved, resolved), localized spatial supervision of pathology, explicit spatial-temporal alignment across paired studies, and multi-source coverage for cross-domain evaluation. We additionally construct a 280K-pair silver dataset with automatically derived temporal and anatomical supervision for large-scale evaluation under weaker supervision. Using these resources, we evaluate multiple state-of-the-art vision-language CXR models on grounding and progression-classification tasks in a zero-shot setting. Across both gold and silver evaluations, current models exhibit consistent limitations in spatial grounding, fine-grained temporal reasoning, and robustness under distribution shift. In particular, models perform substantially better on salient progression categories such as worse than on temporally subtle states such as stable and resolved, suggesting limited modeling of longitudinal disease evolution in chest radiography.

2605.11303 2026-05-13 cs.CL

Predicting Psychological Well-Being from Spontaneous Speech using LLMs

Erfan Loweimi, Sofia de la Fuente Garcia, Saturnino Luz

AI总结 该研究探讨了利用大语言模型(LLMs)从自发性语音中零样本预测 Ryff 心理幸福感(PWB)评分的可行性。研究使用了 PsyVoiD 数据库中 111 名参与者的语音录音,评估了包括 Llama-3、Mistral、Gemma、Phi-4 等在内的 12 个指令微调大模型,并与临床心理学和语言学专家合作设计了领域相关的提示词。实验结果显示,LLMs 能够从语音中提取语义信息,实现高达 0.8 的斯皮尔曼相关系数,同时通过统计分析和关键词云分析增强了预测结果的可解释性。

详情
英文摘要

We investigate the use of Large Language Models (LLMs) for zero-shot prediction of Ryff Psychological Well-Being (PWB) scores from spontaneous speech. Using a few minutes of voice recordings from 111 participants in the PsyVoiD database, we evaluated 12 instruction-tuned LLMs, including Llama-3 (8B, 70B), Ministral, Mistral, Gemma-2-9B, Gemma-3 (1B, 4B, 27B), Phi-4, DeepSeek (Qwen and Llama), and QwQ-Preview. A domain-informed prompt was developed in collaboration with experts in clinical psychology and linguistics. Results show that LLMs can extract semantically meaningful cues from spontaneous speech, achieving Spearman correlations of up to 0.8 on 80\% of the data. Additionally, to enhance explainability, we conducted statistical analyses to characterise prediction variability and systematic biases, alongside keyword-based word cloud analyses to highlight the linguistic features driving the models' predictions.

2605.11301 2026-05-13 cs.AI cs.CL cs.CV

LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

Xueqi Cheng, Yushun Dong

AI总结 本文提出了一种名为 LatentRouter 的多模态模型路由方法,旨在根据图像-问题输入的特性,选择最适合的多模态大语言模型。该方法通过构建多模态路由胶囊和模型能力标记,利用潜在状态间的通信来预测各候选模型的性能表现,并结合分布输出头和边界胶囊校正机制提升预测准确性。实验表明,LatentRouter 在多个基准测试中优于现有方法,尤其在需要视觉、布局或推理能力的任务中表现突出。

详情
英文摘要

Multimodal large language models (MLLMs) have heterogeneous strengths across OCR, chart understanding, spatial reasoning, visual question answering, cost, and latency. Effective MLLM routing therefore requires more than estimating query difficulty: a router must match the multimodal requirements of the current image-question input with the capabilities of each candidate model. We propose LatentRouter, a router that formulates MLLM routing as counterfactual multimodal utility prediction. Given an image-question query, LatentRouter extracts learned multimodal routing capsules, represents each candidate MLLM with a model capability token, and performs latent communication between these states to estimate how each model would perform if selected. A distributional outcome head predicts model-specific counterfactual quality, while a bounded capsule correction refines close decisions without allowing residual signals to dominate the prediction. The resulting utility-based policy supports performance-oriented and performance-cost routing, and handles changing candidate pools through shared per-model scoring with availability masking. Experiments on MMR-Bench and VL-RouterBench show that LatentRouter outperforms fixed-model, feature-level, and learned-router baselines. Additional analyses show that the gains are strongest on multimodal task groups where model choice depends on visual, layout-sensitive, or reasoning-oriented requirements, and that latent communication is the main contributor to the improvement. The code is available at: https://github.com/LabRAI/LatentRouter.

2605.11300 2026-05-13 cs.CV

Can Graphs Help Vision SSMs See Better?

Dhruv Parikh, Anvitha Ramachandran, Haoyang Fan, Mustafa Munir, Rajgopal Kannan, Viktor Prasanna

AI总结 本文研究了如何通过图结构改进视觉状态空间模型(Vision SSMs)的性能,提出了一种基于图的动态扫描操作符GraphScan。该方法为每个视觉标记构建局部图结构,学习基于特征的亲和关系,并通过语义邻域的一次消息传递生成输出标记,从而在全局状态空间混合前实现局部语义对齐。实验表明,集成GraphScan的GraphScan-Mamba在多个视觉任务中取得了最先进的性能,且计算开销较小,为未来视觉状态空间模型的扫描机制提供了新的语义导向视角。

Comments Technical Report

详情
英文摘要

Vision state space models inherit the efficiency and long-range modeling ability of Mamba-style selective scans. However, their performance depends critically on the representation of two-dimensional visual features as one-dimensional token sequences. Existing scan operators range from predefined geometric traversals to dynamic coordinate-based samplers that reroute tokens through predicted offsets and interpolation. While effective, these mechanisms primarily adapt paths or sampling locations, rather than explicitly modeling which local patches should exchange information before global state-space mixing. This motivates a simple question: \emph{can graphs help vision state space models see better?} We introduce \textbf{GraphScan}, a graph-induced dynamic scanning operator for Vision SSMs. For each token, GraphScan constructs a spatially bounded local graph, learns feature-conditioned affinities with relative positional bias, and produces the output token by one-step message passing over its semantic neighborhood. The resulting tokens are locally grounded before being processed by the selective SSM for global aggregation. GraphScan preserves token count and linear scaling in image size, while replacing coordinate-conditioned interpolation with feature-conditioned semantic routing. Integrated into a hierarchical backbone, \textbf{GraphScan-Mamba} achieves state-of-the-art performance among Vision SSMs across image classification, object detection, instance segmentation, and semantic segmentation, with modest computational overhead. Our analysis further shows that GraphScan induces interpretable displacement fields over the token lattice, providing a semantic and spatially grounded view of dynamic scanning. These results suggest that future Vision SSMs should treat scanning not merely as geometric serialization, but as learned local semantic routing before global state-space modeling.

2605.11296 2026-05-13 cs.RO cs.SY eess.SY

Computational Design of a Low-Visibility UAV Using a Human-Aligned Perceptual Metric

Jingxian Wang, Chen Yu, David Matthews, Emma Alexander, Sam Kriegman, Michael Rubenstein

AI总结 本文提出了一种名为 Phantom Twist 的单旋翼无人机设计,通过高速旋转和运动模糊实现低可见性。研究构建了一个两阶段自动化设计流程,优化功能组件的布局,同时满足飞行稳定性要求,并以人类感知对齐的视觉度量(LPIPS)作为优化目标。实验验证表明,该方法生成的无人机具有良好的稳定性和可控性,且相比传统四旋翼无人机,其视觉可察觉性显著降低。

Comments Accepted by RSS 2026

详情
英文摘要

We introduce Phantom Twist, a type of single-propeller UAV designed to achieve low visibility through high-speed spinning and the exploitation of motion blur. We develop a two-stage automated design pipeline that optimizes the placement of functional components including batteries, control PCB, motor-propeller assembly, and counterweights. The pipeline minimizes visibility as measured by a human-aligned perceptual metric (LPIPS) while strictly satisfying inertial and aerodynamic constraints required for stable flight. We validate this approach through fabrication and flight testing of multiple prototypes. These tests confirm that our pipeline produces stable, controllable designs and that the optimized UAV exhibits significantly reduced visual perceptibility compared to conventional quadcopters.

2605.11291 2026-05-13 cs.LG

Optimal Representations for Generalized Contrastive Learning with Imbalanced Datasets

Thuan Nguyen, Shuchin Aeron, D. Richard Brown, Prakash Ishwar

AI总结 本文研究了在类别不平衡数据集下对比学习(CL)中最优表示的几何特性。作者证明,当类别不平衡时,同一类别的所有样本的最优表示会坍缩到类均值,并呈现出由类别比例决定的角对称结构。此外,当类别不平衡达到一定阈值时,会出现“少数类坍缩”现象,即少数类样本全部坍缩为一个向量。研究还提出了一个凸优化问题来确定最优表示的几何结构,并通过数值实验验证了理论结果。

Comments 28 pages, 2 figures

详情
英文摘要

In this paper, we provide a computable characterization of the geometry of optimal representations in Contrastive Learning (CL) when the classes are imbalanced. When classes are balanced and the representation dimension is greater than the number of classes, it is well-known that the optimal representations exhibit Neural Collapse (NC), i.e., representations from the same class collapse to their class means and the class means form an Equiangular Tight Frame (ETF). For imbalanced classes and a large, generalized family of CL losses, we prove that the optimal representations of all samples from the same class collapse to their class means and their geometry exhibits an angular symmetry structure that is determined by the relative class proportions. In general, we show that the geometry can be determined by solving a convex optimization problem. Exploiting this symmetry structure, we analytically investigate a special case where class imbalance is extreme and prove that CL exhibits a phenomenon called Minority Collapse (MC) where all samples from the minority classes (classes with small probabilities) collapse into a single vector, whenever the class imbalance exceeds a threshold, which in turn depends on the regularity properties of the CL loss used and on the number of negative samples. Numerical results are provided to illustrate these phenomena and corroborate the theoretical results. We conclude by identifying a number of open problems.

2605.11290 2026-05-13 cs.CL cs.AI

ReAD: Reinforcement-Guided Capability Distillation for Large Language Models

Xueqi Cheng, Xugui Zhou, Tyler Derr, Yushun Dong

AI总结 本文提出了一种名为 ReAD 的强化引导能力蒸馏框架,旨在在固定 token 预算下更有效地压缩大语言模型,同时保留对下游任务至关重要的能力。该方法通过识别任务关键能力、动态生成针对性监督信号,并利用不确定性感知的上下文老虎机算法优化预算分配,从而在提升任务表现的同时减少能力间的负面干扰和资源浪费。实验表明,ReAD 在相同预算下优于现有方法,具有更高的实用性和效率。

详情
英文摘要

Capability distillation applies knowledge distillation to selected model capabilities, aiming to compress a large language model (LLM) into a smaller one while preserving the abilities needed for a downstream task. However, most existing methods treat capabilities as independent training targets and overlook how improving one capability can reshape the student's broader capability profile, especially when multiple abilities jointly determine task success. We study capability distillation under a fixed token budget and identify two consistent patterns: distillation induces systematic, budget-dependent cross-capability transfer, and additional budget often brings limited task-relevant gains while sometimes degrading other useful abilities. Building on these insights, we propose ReAD, a Reinforcement-guided cApability Distillation framework that explicitly accounts for capability interdependence. ReAD first infers task-essential capabilities, then generates capability-targeted supervision on the fly, and finally uses an uncertainty-aware contextual bandit to adaptively allocate the distillation budget based on expected utility gains. Extensive experiments show that ReAD improves downstream utility under the same token budget while reducing harmful spillover and wasted distillation effort compared to strong baselines. Our code is publicly available at https://github.com/LabRAI/ReAD.

2605.11289 2026-05-13 cs.LG math.OC

Quotient-Categorical Representations for Bellman-Compatible Average-Reward Distributional Reinforcement Learning

Ege C. Kaya, Aliasghar Pourghani, Vijay Gupta, Abolfazl Hashemi

AI总结 本文研究平均奖励强化学习中的分布强化学习问题,针对传统方法在实数线上难以直接定义分布形式的挑战,提出了一种基于商空间和分类参数化的表示方法,以处理状态索引偏差律的平移不变性。该方法定义了投影平均奖励分布算子,并证明其具有良好定义性、非扩张性及不动点性质,同时分析了采样递归的收敛性,并在未知增益情况下引入在线估计器,保证了算法的稳定性与收敛性。

Comments 29 pages, 4 figures

详情
英文摘要

Average-reward reinforcement learning requires estimating the gain and the bias, which is defined only up to an additive constant. This makes direct distributional analogues ill-posed on the real line. We introduce a quotient-space formulation in which state-indexed bias laws are identified up to a common translation, together with a categorical parameterization that respects this symmetry. On this quotient-categorical space, we define a projected average-reward distributional operator and show that it is well-defined, non-expansive in a coordinate Cramér metric, and admits fixed points. We then study sampled recursions whose mean-field maps are asynchronous relaxations of this operator. In an idealized centered-reward setting, a one-state temporal-difference update enjoys almost sure convergence together with finite-iteration residual bounds under both i.i.d. and Markovian sampling. When the gain is unknown, we augment the recursion with an online gain estimator, and prove non-expansiveness and Markovian convergence of the resulting coupled scheme. Finally, we show that synchronous exact updates are gain-independent at the quotient-law level, isolating a structural contrast between ideal quotient distributions and practical fixed-grid categorical representations.