arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4046
2605.08388 2026-05-12 cs.AI

PLACO: A Multi-Stage Framework for Cost-Effective Performance in Human-AI Teams

Pranavkumar Mallela, Vinay Kumar, Shashi Shekhar Jha, Shweta Jain

AI总结 本文提出了一种多阶段框架PLACO,旨在提升人类与AI协作团队的成本效益性能。该方法针对分类任务中人类与模型输出的融合问题,基于贝叶斯规则,在假设人类与模型输出在真实标签条件下条件独立的前提下,提出了一种结合确定性标签(人类)与概率性标签(模型)的有效策略。研究的核心贡献在于提供了一种更高效、更实用的标签融合方法,以提升整体系统性能。

详情
英文摘要

Human-AI teams play a pivotal role in improving overall system performance when neither the human nor the model can achieve such performance on their own. With the advent of powerful and accessible Generative AI models, several mundane tasks have morphed into Human-AI team tasks. From writing essays to developing advanced algorithms, humans have found that using AI assistance has led to an accelerated work pace like never before. In classification tasks, where the final output is a single hard label, it is crucial to address the combination of human and model output. Prior work elegantly solves this problem using Bayes rule, using the assumption that human and model output are conditionally independent given the ground truth. Specifically, it discusses a combination method to combine a single deterministic labeler (the human) and a probabilistic labeler (the classifier model) using the model's instance-level and the human's class-level calibrated probabilities.

2605.08386 2026-05-12 cs.AI

SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents

Yongliang Miao, Ziyang Yu, Liang Zhao, Bowen Zhu, Hasibul Haque

AI总结 SkillLens 是一种用于提升大语言模型智能体效率的自适应多粒度技能复用框架。该方法通过构建包含策略、策略、程序和原语的四层技能图谱,实现对技能的混合粒度检索与适配,从而在保证相关性的同时降低计算成本。SkillLens 通过语义相关性检索、图遍历扩展和验证器决策机制,实现对子技能的直接复用与局部修改,有效提升了任务执行的效率和准确性。实验表明,SkillLens 在多个基准测试中优于现有方法,显著提升了任务成功率和定位精度。

详情
英文摘要

Skill libraries have become a practical way for LLM agents to reuse procedural experience across tasks. However, existing systems typically treat skills as flat, single-resolution prompt blocks. This creates a tension between relevance and cost: injecting coarse skills can introduce irrelevant or misleading context, while rewriting entire skills is expensive and often unnecessary. We propose SkillLens, a hierarchical skill-evolution framework that organizes skills into a four-layer graph of policies, strategies, procedures, and primitives, and retrieves them at mixed granularity. Given a task, SkillLens first retrieves semantically relevant skill seeds, expands them through degree-corrected random walk over the skill graph, and then uses a verifier to decide whether each visited unit should be accepted, decomposed, rewritten, or skipped. This enables the agent to reuse compatible subskills directly while adapting only locally mismatched components. To improve the system over time, SkillLens further refines multi-granularity skills and verifier in order to improve its routing decisions. We provide theoretical analysis showing that mixed-granularity adaptation incurs sublinear cost under sparse mismatch assumptions and that the evolutionary update rule monotonically improves the validation objective until a local optimum. Across MuLocbench and ALFWorld, SkillLens consistently improves over strong skill-based baselines, achieving up to a 6.31 percentage-point Acc@1 gain for bug localization and raising agent success rate from 45.00% to 51.31%.

2605.08383 2026-05-12 cs.CL

Change My View? The Dynamics of Persuasion and Polarization in Online Discourse

David Freeborn, Malihe Alikani, Anthony Sicilia

AI总结 本文研究了在线辩论中说服与极化现象的动态过程,通过分析Reddit平台上“ChangeMyView”论坛的讨论数据,探讨了哪些修辞策略更有可能促成观点改变。研究利用大型语言模型预测观点转变的可能性,并结合人工标注的十种修辞策略,发现让步和共情等策略显著提升观点转变的可能,而直接反驳、攻击可信度等策略则会削弱这种可能性。研究指出,有效的公共论证不仅依赖于证据内容,还与关系框架密切相关。

详情
英文摘要

Philosophical accounts of persuasion often assume that shared evidence and rational argumentation should lead to a convergence of views between peers, yet everyday discourse often suggests otherwise. In this study, we use large language models to analyze a corpus of debates on Reddit's r/ChangeMyView, where belief revision is publicly signaled. Large language models were asked, halfway through each discussion, to forecast whether such an acknowledgement would arise; their probabilistic estimates serve as a conversational baseline. Each reply was then coded, through a hybrid machine-assisted procedure, for ten familiar rhetorical strategies -- concession, empathy, logical challenge, credibility appeals, and so forth. Adding these strategic features markedly improves predictive power and yields a consistent pattern: moves that express concession or empathetic alignment substantially increase the prospect of belief change, whereas frontal refutation, credibility attacks, and topic deflection diminish it. The findings indicate that effective public reasoning depends as much on relational framing as on evidential content, and they invite a refinement of normative accounts of rational dialogue.

2605.08377 2026-05-12 cs.LG stat.ML

Embedding Dimension Lower Bounds for Universality of Deep Sets and Janossy Pooling

Ali Syed, Aditya Nambiar, Jonathan W. Siegel

AI总结 本文研究了在点云数据中实现排列对称性的深度神经网络架构的通用性问题,重点分析了Deep Sets和Janossy Pooling方法所需的嵌入维度下界。通过提出一种新方法,作者证明了保证这些架构通用性的嵌入维度的新的下界,其中对于Deep Sets,结果在维度大于1时给出了正确的最小嵌入维度(相差常数因子),而对于$k$-元Janossy Pooling,这是首次证明了$k > 1$时的非平凡下界。

详情
英文摘要

In many practical applications it is important to build symmetries into neural network architectures. Consider the important case of permutation symmetry on point clouds consisting of $n$ points in $d$ dimensions. In this case the network learns a function on a set of $n$ points in $\mathbb{R}^d$, and a natural paradigm for constructing invariant networks is Janossy pooling, which generalizes the popular Deep Sets architecture. We study the universality of this approach, in particular the important question of how large the embedding dimension must be to guarantee universality of this architecture. Specifically, using a novel technique, we prove new lower bounds on the required size of this embedding dimension. For Deep Sets, this gives the correct minimal dimension up to a constant factor for all $d > 1$. For $k$-ary Janossy pooling, we prove the first non-trivial lower bound on the required embedding dimension when $k > 1$.

2605.08376 2026-05-12 cs.CV

UIESNN: A Scale-Aware Spiking Network for Underwater Image Enhancement

Shuang Chen, Ruochen Li, Zihan Zhu, Ronald Thenius, Farshad Arvin, Amir Atapour-Abarghouei

AI总结 本文提出了一种面向水下图像增强的尺度感知脉冲神经网络UIESNN,旨在解决水下图像中大范围、低频退化问题,如波长依赖的颜色偏移和散射引起的雾化效应。核心方法是引入多尺度池化LIF块(MPLB),通过注入多尺度池化响应到膜电位动态中,扩大感受野并保持细节,同时激发异构的尺度依赖激活。基于MPLB设计的脉冲残差架构结合了频率分解和注意力细化,在全脉冲驱动流程中实现更优的增强效果。实验表明,UIESNN在多个基准数据集上取得了基于SNN方法的最先进性能,具有更高的颜色保真度和空间一致性。

详情
英文摘要

Underwater image enhancement (UIE) is a practically important yet underexplored application of spiking neural networks (SNNs), where the dominant degradations are large-scale and low-frequency, such as wavelength-dependent colour casts and scattering-induced veiling. Existing SNN restoration designs rely on locally bounded spiking perception, which can limit global correction and lead to saturated or inconsistent representations. To address these challenges, we propose a scale-aware SNN framework for UIE named UIESNN. At its core is a Multi-scale Pooling LIF Block (MPLB) that injects hierarchical multi-scale pooling responses into membrane dynamics, thereby enlarging the effective receptive field while preserving fine-grained details and inducing heterogeneous scale-dependent activations. Building on MPLB, we design a spiking residual architecture that integrates frequency decomposition and attention-based refinement in a fully spike-driven pipeline. Extensive experiments on the EUVP and LSUI benchmarks demonstrate that UIESNN achieves state-of-the-art performance among SNN-based methods, delivering improved colour fidelity and spatial coherence with competitive energy cost.

2605.08373 2026-05-12 cs.CV cs.AI

NeuroGAN-3D: Enhancing Intrinsic Functional Brain Networks via High-Fidelity 3D Generative Super-Resolution

M. Moein Esfahani, Sepehr Salem Ghahfarokhi, Mohammed Alser, Jingyu Liu, Vince Calhoun

AI总结 本文提出了一种名为NeuroGAN-3D的三维生成超分辨率模型,旨在提升静息态功能磁共振成像(rs-fMRI)空间图的分辨率,从而更精确地刻画大脑功能网络。该模型基于生成对抗网络架构,有效增强了脑功能图谱的空间细节,显著优于传统方法。研究为深入理解大脑结构与功能的关系,以及相关疾病机制提供了更精细的影像学工具。

Comments Accepted in ICCABS 2026: The 14th International Conference on Computational Advances in Bio and Medical Sciences

详情
英文摘要

Recent advances in neuroimaging have deepened our understanding of the brain's complex functional and structural organization. Among these, functional Magnetic Resonance Imaging (fMRI) - particularly resting-state fMRI (rs-fMRI) - has emerged as a tool for identifying biomarkers of intrinsic brain connectivity and delineating large-scale neural networks. These networks are typically represented as volumetric spatial maps that capture functionally coherent brain regions and reflect individual differences in brain activity and structure. The spatial resolution of these maps plays an important role, as it determines the ability to localize functional units with precision, perform reliable brain parcellation, and detect subtle, spatially specific neurobiological alterations associated with development, aging, or disease. Therefore, improving the effective resolution of neuroimaging-derived maps holds significant promise for enabling more detailed insights into brain architecture and its relationship to behavior and pathology. To address this need, we propose NeuroGAN-3D, a novel 3D generative super-resolution model tailored to the computational demands of volumetric neuroimaging. Our model leverages a generative adversarial network architecture to enhance the spatial resolution of rs-fMRI spatial maps, significantly outperforming a conventional baseline.

2605.08371 2026-05-12 cs.CV

PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers

Haotang Li, Zhenyu Qi, Shaohan Henry Wang, Kebin Peng, Zi Wang, Qing Guo, Sen He, Huanrui Yang

AI总结 本文提出了一种名为PaceVGGT的预交替注意力(AA)标记剪枝框架,旨在加速视觉几何变换器(VGGT)在处理长序列3D任务时的计算效率。该方法通过在冻结的VGGT模型中引入轻量级的标记评分器,在首次AA模块之前对DINO特征中的标记进行剪枝,从而减少输入序列长度。实验表明,PaceVGGT在保持重建质量的同时显著降低了推理延迟,尤其在ScanNet-50和7-Scenes数据集上表现优异。

详情
英文摘要

Visual Geometry Transformer (VGGT) is a strong feed-forward model for multiple 3D tasks, but its Alternating-Attention (AA) stack scales quadratically in the total token count, making long clips expensive. Existing token-reduction accelerators operate inside AA, leaving the patch grid that enters AA uncompressed. We introduce PaceVGGT, a pre-AA token pruning framework that prunes DINO patch tokens before the first AA block of a frozen VGGT. PaceVGGT trains a lightweight Token Scorer that estimates per-token importance from DINO features. The scorer is first distilled against an AA-internal attention target from the unpruned backbone, then refined under downstream camera, depth, and point-map losses. A per-frame keep budget fixes the backbone-visible sequence length, while an importance-adaptive merge/prune assignment preserves residual content from high-saliency frames under a fixed total merge budget. A Feature-guided Restoration module reconstructs the dense spatial grid required by the prediction heads. On ScanNet-50 and 7-Scenes, PaceVGGT remains on the reconstruction quality--latency frontier while reducing inference latency. On ScanNet-50, it reduces latency by \(5.1\times\) over unmodified VGGT at \(N=300\) and \(1.47\times\) over LiteVGGT at \(N=1000\). These results identify pre-AA pruning as a viable acceleration route for frozen VGGT-style geometry transformers.

2605.08368 2026-05-12 cs.AI cond-mat.stat-mech cs.LG

On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective

Yuhao Li, Shengchao Liu

AI总结 本文从自由能视角探讨了大语言模型后训练中“能力激发”与“能力创造”的区别。研究指出,后训练方法如监督微调(SFT)和强化学习(RL)并非本质区别,关键在于训练过程是重新权衡模型已有能力范围内的行为,还是扩展了模型可实现的行为空间。通过引入“可访问支持”概念,作者提出后训练应区分这两种机制,并认为核心问题不在于使用SFT还是RL,而在于训练是否扩展了模型的行为边界。

详情
英文摘要

Debates about large language model post-training often treat supervised fine-tuning (SFT) as imitation and reinforcement learning (RL) as discovery. But this distinction is too coarse. What matters is whether a training procedure increases the probability of behaviors the pretrained model could already produce, or whether it changes what the model can practically reach. We argue that post-training research should distinguish between capability elicitation and capability creation. We make this distinction operational by introducing the notion of accessible support: the set of behaviors that a model can practically produce under finite budgets. Post-training that reweights behaviors within this support is capability elicitation; whereas changing the support itself corresponds to capability creation. We develop this argument through a free-energy view of post-training. SFT and RL can both be seen as reweighting a pretrained reference distribution, only with different external signals. Demonstration signals define low-energy behavior for SFT, and reward signals define low-energy behavior for RL. When the update remains close to the base model, the main effect is local reweighting, not capability creation. Within this framework, the central question is no longer whether post-training is framed as SFT or RL, but whether it reweights behaviors already within reach, or instead expands the model's reachable behavioral space through search, interaction, tool use, or the incorporation of new information.

2605.08366 2026-05-12 cs.LG cs.SE

SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution

Mohit Raghavendra, Soham Dan, Miguel Romero Calvo, Yannis Yiming He, Johannes Baptist Mols, Gautam Anand, Cole McCollum, Edgar Arakelyan, Vijay Bharadwaj, Andrew Park, Jeff Da, MohammadHossein Rezaei, Bing Liu, Brad Kenstler, Yunzhong He

AI总结 本文介绍了SWE Atlas,一个用于评估代码生成智能体的基准测试套件,涵盖代码库问答、测试编写和重构三个专业软件工程流程。该基准不同于以往的SWE基准,聚焦于实践中重要但较少被关注的任务类别,并采用更贴近现实场景的评估方式,综合考量代码功能正确性及软件工程质量。实验表明,尽管顶级模型在某些任务上表现优异,但在处理复杂运行时分析和遵循最佳实践方面仍存在明显不足。

Comments 10 pages

详情
英文摘要

We introduce SWE Atlas, a benchmark suite for coding agents spanning three professional software engineering workflows: Codebase Q&A (124 tasks), Test Writing (90 tasks), and Refactoring (70 tasks). SWE Atlas differs from prior SWE benchmarks in three key ways: it targets underrepresented but practically important task categories, uses comprehensive category-specific evaluation protocols, and adopts under-specified, agentic task formulations that better reflect real-world usage. Its evaluation framework combines programmatic checks with rubric-based assessment. This goes beyond functional correctness, evaluating software engineering quality, including test and refactor completeness, maintainability, reusable abstractions, and codebase hygiene. We evaluate a range of frontier and open-weight models on SWE Atlas and find that GPT-5.4 and Opus 4.7 achieve the strongest overall performance, while even the best open-weight models score poorly. Our analysis suggests that top models rely on extensive codebase exploration and runtime-driven reasoning. However, even top models consistently struggle with subtle edge cases, complex runtime analysis, and adherence to software engineering best practices. Overall, SWE Atlas provides a complementary evaluation suite for measuring both correctness and engineering quality in coding agents.

2605.08360 2026-05-12 cs.AI

Embeddings for Preferences, Not Semantics

Carter Blair, Ariel D. Procaccia, Milind Tambe

AI总结 本文研究了如何将自由文本形式的意见嵌入向量空间,以支持集体决策中的偏好建模。传统文本嵌入关注语义相似性,而集体决策需要衡量偏好相似性,即意见之间的距离应反映参与者的认同程度。作者指出,现有嵌入方法因语义与偏好信号的混淆而存在偏差,并提出通过设计打破这种相关性的训练数据,可以显著提升偏好预测性能。

Comments 28 pages

详情
英文摘要

Modern AI is opening the door to collective decision-making in which participants express their views as free-form text rather than voting on a fixed set of candidates. A natural idea is to embed these opinions in a vector space so that the substantial literature on facility location problems and fair clustering can be brought to bear. But standard text embeddings measure semantic similarity, whereas distances in facility location problems and fair clustering require what we call \textit{preferential similarity}: a participant's agreement with a piece of text should be inversely related to their distance from it. Off-the-shelf embeddings inherit a coarse preference signal through a correlation between semantic and preferential similarity, but fail to capture preferences when the correlation breaks. We formalize this as an invariance problem: text embedding models encode both a preference-relevant signal (stance and values) and semantic nuisance (style and wording), and the two are observationally correlated, so a geometry that relies on nuisance can appear preference-correct even when it is not. We show that synthetic training data designed to break this correlation provably shifts the optimal scorer away from nuisance-dominated cosine and significantly improves preference prediction across 11 online deliberation datasets.

2605.08354 2026-05-12 cs.AI

Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

Juanxi Tian, Fengyuan Liu, Jiaming Han, Yilei Jiang, Yongliang Wu, Yesheng Liu, Haodong Li, Furong Xu, Wanhua Li

AI总结 该研究旨在解决多模态生成模型与人类偏好对齐的问题,提出了一种名为Auto-Rubric as Reward(ARR)的框架,通过将隐式的偏好结构转化为显式的、可解释的评分标准,提升奖励信号的可靠性和可解释性。核心方法包括外部化视觉语言模型的偏好知识为具体评分细则,并引入Rubric Policy Optimization(RPO)优化生成策略,从而在生成训练中实现更稳定和高效的学习。实验表明,该方法在文本到图像生成和图像编辑任务中优于传统奖励模型,验证了显式结构化评分标准在提升多模态对齐效果中的有效性。

Comments 28 pages, 10 figures, 11 tables

详情
英文摘要

Aligning multimodal generative models with human preferences demands reward signals that respect the compositional, multi-dimensional structure of human judgment. Prevailing RLHF approaches reduce this structure to scalar or pairwise labels, collapsing nuanced preferences into opaque parametric proxies and exposing vulnerabilities to reward hacking. While recent Rubrics-as-Reward (RaR) methods attempt to recover this structure through explicit criteria, generating rubrics that are simultaneously reliable, scalable, and data-efficient remains an open problem. We introduce Auto-Rubric as Reward (ARR), a framework that reframes reward modeling from implicit weight optimization to explicit, criteria-based decomposition. Before any pairwise comparison, ARR externalizes a VLM's internalized preference knowledge as prompt-specific rubrics, translating holistic intent into independently verifiable quality dimensions. This conversion of implicit preference structure into inspectable, interpretable constraints substantially suppresses evaluation biases including positional bias, enabling both zero-shot deployment and few-shot conditioning on minimal supervision. To extend these gains into generative training, we propose Rubric Policy Optimization (RPO), which distills ARR's structured multi-dimensional evaluation into a robust binary reward, replacing opaque scalar regression with rubric-conditioned preference decisions that stabilize policy gradients. On text-to-image generation and image editing benchmarks, ARR-RPO outperforms pairwise reward models and VLM judges, demonstrating that explicitly externalizing implicit preference knowledge into structured rubrics achieves more reliable, data-efficient multimodal alignment, revealing that the bottleneck is the absence of a factorized interface, not a deficit of knowledge.

2605.08348 2026-05-12 cs.CL

How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits

Michael Li, Nishant Subramani

AI总结 该研究探讨了语言模型中电路(circuit)的可重复性和任务特异性,发现同一任务内的电路组件高度重复,并且这些组件对任务表现至关重要。然而,不同任务之间的电路存在大量重叠,表明当前发现的电路并不具备明显任务特异性,这引发了关于电路能否支持针对性模型理解与干预的疑问。研究通过六项任务和七种模型的实验,揭示了电路在因果重要性上的普遍性与局限性。

详情
英文摘要

The circuits framework in mechanistic interpretability aims to identify causally important sparse subgraphs of model components, typically evaluated by measuring necessity and sufficiency. We measure circuit reuse, the proportion of components shared across per-example circuits within a task, and investigate two less-studied properties of this: consistency, the recurrence of components within a task, and specificity, their uniqueness to a task. Using edge attribution patching across six tasks and seven models, we find that within-task reuse is high and that shared components are necessary for task performance, with ablations causing up to $\sim$100% relative accuracy drops. However, circuits turn out not to be task-specific: ablating one task's circuit damages another task's performance about as much as that task's own circuit does. We discover that this is due to substantial overlap between circuits across tasks, which are causally important for performance. Some circuits do contain a smaller set of task-specific components, but these account for only a modest portion of circuit performance. Overall, our findings suggest that while circuit discovery at the level of attention heads and MLP layers identifies important components, their lack of task-specificity raises questions about the degree to which circuits can support targeted understanding and intervention on model behavior.

2605.08346 2026-05-12 cs.CL cs.AI

Sanity Checks for Long-Form Hallucination Detection

Geigh Zollicoffer, Minh Vu, Hongli Zhan, Raymond Li, Manish Bhattarai

AI总结 本文研究了大语言模型中长文本幻觉检测方法的有效性问题,指出现有方法可能依赖于最终答案的表面特征而非推理过程本身。为此,作者提出了一种可控不变性方法,通过两个实验(Force 和 Remove)区分模型是基于推理结构还是答案线索进行判断。研究进一步表明,去除答案相关干扰后,基于词法轨迹特征的轻量级检测方法 TRACT 在保持鲁棒性的同时,性能不逊于现有复杂模型,揭示了当前幻觉检测的核心挑战在于如何从最终答案线索中分离出有效的推理信号。

详情
英文摘要

Hallucination detection methods for large language models increasingly operate on chain-of-thought reasoning traces, yet it remains unclear whether they evaluate the reasoning itself or merely exploit surface correlates of the final answer. We introduce a controlled-invariance methodology that exposes this distinction through two oracle tests: \textsc{Force}, which replaces each response's final answer with the ground truth while preserving the reasoning trace, and \textsc{Remove}, which strips answer-announcement steps while leaving the trajectory intact. This reveals if their predictive power derives from answer-level artifacts rather than from the structure or validity of intermediate reasoning. We further show that once these artifacts are controlled for, effective detection does not necessarily require complex learned representations: TRACT, a lightweight scorer built on lexical trajectory features (hedging trends, step-length dynamics, and cross-response vocabulary convergence), achieves strong robustness while remaining competitive with or outperforming existing baselines on unperturbed traces. These findings suggest that the current central challenge in reasoning-aware hallucination detection is not the absence of signal in the trace, but the failure to isolate it from endpoint cues.

2605.08344 2026-05-12 cs.LG

What Time Is It? How Data Geometry Makes Time Conditioning Optional for Flow Matching

Alec Helbling, Sebastian Gutierrez Hernandez, Benjamin Hoover, Duen Horng Chau, Parikshit Ram

AI总结 本文研究了在无需显式时间条件的情况下训练流匹配模型的可行性,挑战了传统观点认为时间插值对速度目标的歧义性消解是必要的。通过分解无时间感知的损失函数,作者识别出两种不可约误差来源,并发现高维数据的几何特性使得时间可以从噪声观测中直接识别。实验表明,耦合方式的选择对模型性能影响更大,而非时间条件本身。

详情
英文摘要

Recent work has shown that models flow matching models can be trained without explicit time conditioning, challenging the standard view that the interpolation time is needed to disambiguate velocity targets. But why should a time-blind model work at all? Decomposing the time-blind flow matching loss, we identify two sources of irreducible error: a coupling variance, which arises from ambiguous velocity targets induced by how noise and data points are paired, and the time-blindness gap, which is the additional error caused by ignoring time. This gap shows that time-blind training is strictly harder than conventional training, reinforcing the puzzle that time-blind models work so well in practice. We resolve this tension by showing that the geometry of high-dimensional data makes time identifiable directly from noisy observations. When data concentrates near a $k$-dimensional subspace, time can be recovered from the statistical structure of noisy interpolants in directions orthogonal to the data; under a spiked-covariance model, this yields a closed-form estimator that recovers $t$ from a single observation $z$ at rate $O(1/\sqrt{d-k})$ for ambient dimension $d$. As a consequence, we prove that the time-blindness gap is asymptotically negligible relative to the coupling variance. We empirically demonstrate our identifiability result on real-world data and show that changing the coupling has a much larger effect on loss and sample quality than removing time conditioning across CIFAR-10, CelebA-HQ, and FFHQ. These results explain why time-blind flow matching works and show that the main practical lever is the choice of coupling, not explicit time conditioning.

2605.08343 2026-05-12 cs.LG cs.CR cs.DC

Private Vertical Federated Inference for Time-Series

Lucas Fenaux, Larris Xie, Aditya Bang, Alex Zhang, Kevin Wilson, Florian Kerschbaum

AI总结 在多方协作处理时间序列数据时,隐私保护是一个重要挑战。本文提出了一种混合垂直联邦学习框架PPHH-VFL,通过将模型头部分为高效的明文公共部分和安全的轻量级MPC私有部分,兼顾了效率与隐私保护。实验表明,该方法在保持高下游任务性能的同时,显著提升了推理速度并大幅降低了通信开销。

详情
英文摘要

Institutions may benefit from collaborative inference on time-series data. In settings where privacy is necessary, multi-party computation (MPC) is a straightforward approach to providing strong guarantees, yet it remains prohibitively expensive and scales poorly with modern transformer architectures. Vertical Federated Learning (VFL) offers efficiency but suffers from privacy leakage at the embedding level, and securing the entire VFL model head via MPC remains prohibitively slow and communication-heavy for larger models. To enable practical, secure inference at scale, we propose "Public/Private Hybrid Head-VFL" (PPHH-VFL). This hybrid architecture splits the model head into an efficient plaintext public head and a secure, lightweight MPC private head. By applying adversarial training to the public embeddings, we mitigate privacy leakage; concurrently, the small private head securely preserves the flow of sensitive information needed for high downstream utility. Empirical evaluations on models ranging up to 86 million parameters demonstrate that PPHH-VFL accelerates inference by up to six orders of magnitude compared to end-to-end MPC. Compared to a standard VFL+MPC baseline, our approach scales significantly better, achieving a speedup of up to 44.4x in WAN and a 91.2x reduction in communication costs (dropping from 1.7 GB to 19 MB per batch), while simultaneously improving downstream classification accuracy by 2.50% and regression RMSE by 40.7%.

2605.08334 2026-05-12 cs.CL

SalesSim: Benchmarking and Aligning Multimodal Language Models as Retail User Simulators

Yada Pruksachatkun, Elaine Wan, Lyanna Chen, Kai-Wei Chang, Chien-Sheng Wu

AI总结 本文提出 SalesSim,一个用于评估多模态大语言模型在模拟真实零售场景中用户行为能力的框架和测试平台。研究通过构建多轮、多模态、工具增强的对话环境,模拟具有不同背景和偏好的购物者与销售代理的互动过程,并设计了一系列衡量决策一致性和对话质量的指标。实验发现现有模型在语言多样性、行为一致性等方面存在明显不足,为此作者提出 UserGRPO 强化学习方法,有效提升了模型的决策对齐度和对话质量,为多模态用户模拟器的研究提供了新的基准和改进方向。

详情
英文摘要

We present SalesSim, a framework and testbed for evaluating the ability of Multimodal Large Language Models (MLLMs) to simulate realistic, persona-driven customer behavior in multi-turn, multi-modal, tool-augmented online retail conversations. Unlike prior work that treat user simulation as surface-level dialogue generation, SalesSim models retail interaction and decision-making as a grounded, agentic process, where shoppers with diverse backgrounds, preferences, and dealbreakers interact with a sales agent, seek clarifications, and make informed purchasing decisions. For evaluation, we design a suite of metrics centered on decision alignment, measuring the consistency between the simulator's actions and its persona specifications, as well as conversational quality. We find several behavioral gaps after benchmarking 6 open and closed-source state-of-the-art models. First, while models produce fluent conversations, they display significantly lower lexical diversity and overdisclosure of criteria across personas compared to human conversations. Second, models tend to be persuaded by sales agent suggestions and drift from persona specifications. Even the strongest model achieves less than 79% average alignment with its underlying persona specifications. To make progress on these limitations, we propose UserGRPO, a multi-turn, multi-objective reinforcement learning recipe to optimize both conversational fluency and decision alignment under persona specifications. Our experiments demonstrate that UserGRPO boosts decision alignment of the baseline model by 13.8% while improving conversational quality. By introducing SalesSim, we provide a new testbed for the community to investigate and improve the adherence of user simulators in goal-oriented settings.

2605.08333 2026-05-12 cs.LG cs.AI cs.CL cs.PF cs.SE

CDS4RAG: Cyclic Dual-Sequential Hyperparameter Optimization for RAG

Pengzhou Chen, Tao Chen

AI总结 检索增强生成(RAG)系统对检索器和生成器的超参数高度敏感,但利用给定查询进行优化因交互复杂和评估成本高而极具挑战。本文提出CDS4RAG框架,通过一种新的循环双序优化方法,分别对检索器和生成器的超参数进行交替优化,提升了优化效率和效果。该框架具有算法无关性,能够与多种通用算法结合,并在多个基准测试中显著提升了生成质量,优于现有先进方法。

Comments Accepted by main track at IJCAI 2026

详情
英文摘要

Retrieval-Augmented Generation (RAG) is sensitive to the vast hyperparameters of the retriever and generator, yet optimizing them using given queries is a challenging task due to the complex interactions and expensive evaluation costs. Existing algorithms are ineffective and slow in convergence, since they often treat RAG as a monolithic black box or only optimize partial hyperparameters. In this paper, we propose CDS4RAG, a framework that optimizes the full RAG hyperparameters using given queries via a new cyclic dual-sequential formulation. CDS4RAG is special in the sense that it distinguishes the hyperparameters of the retriever and generator, cyclically optimizing them in turn. Such a paradigm allows us to design fine-grained within-cycle budget provision and expedite the optimization via cross-cycle seeding when optimizing the generator. CDS4RAG is also an algorithm-agnostic framework that can be paired with diverse general algorithms. Through experiments on four common benchmarks and two backbone LLMs, we reveal that CDS4RAG considerably boosts the vanilla algorithms in 21/24 cases while significantly outperforming state-of-the-art algorithms in all cases with up to 1.54x improvements of generation quality and better speedup.

2605.08330 2026-05-12 cs.RO

Hierarchical Prompting with Dual LLM Modules for Robotic Task and Motion Planning

Karolina Źróbek, Tessa Pulli, Paweł Gajewski, Antonio Galiza Cerdeira Gonzalez, Bipin Indurkhya

AI总结 本文提出了一种基于语言的分层框架,用于机器人任务与运动规划,旨在提升服务与辅助场景中人机交互的自然性和直观性。该方法采用两个大语言模型模块,高层规划代理负责处理自然语言指令并生成动作序列,底层空间推理模块则处理精确的空间操作,如物体放置。实验表明,该系统在24种测试场景中实现了86%的任务成功率,展示了其在复杂指令理解和执行方面的有效性。

详情
英文摘要

We present a hierarchical language-driven framework for robotic task and motion planning to improve natural, intuitive human-robot interaction in service and assistance scenarios. The proposed system employs two large language model (LLM) modules: a high-level planning agent and a low-level spatial reasoning sub-module. The primary agent processes natural language commands and generates action sequences using a ReAct-style prompt, interacting with tools for object perception and manipulation (e.g., pick, place, release). For precise spatial placement, such as interpreting "place the mug next to the plate", a separate sub-prompting module handles 3D reasoning based on object geometry and scene layout. The system integrates YOLOX-GDRNet for object detection and pose estimation, along with a motion execution stub. We evaluated the system in 24 test scenarios, ranging from simple spatial commands to high-level instructions and infeasible requests. The system achieved an overall task success rate of 86%.

2605.08329 2026-05-12 cs.CV eess.IV

An Efficient Token Compression Framework for Visual Object Tracking

Weijing Wu, Qihua Liang, Bineng Zhong, Haiying Xia, Zhiyi Mo, Shuxiang Song

AI总结 本文提出了一种高效的视觉目标跟踪令牌压缩框架ETCTrack,旨在解决基于Transformer的跟踪器因使用大量历史模板帧而导致的计算负担重和性能下降问题。该方法通过自适应令牌压缩模块动态生成紧凑且具有判别力的模板令牌,并结合层次交互编码器实现与搜索区域特征的深度交互,从而在减少计算量的同时保持跟踪精度。实验表明,该方法在七个基准数据集上优于现有先进方法,模板令牌数量减少60%,计算量降低21.4%,精度仅下降0.4%。

Comments Accepted by CVPR2026

详情
英文摘要

Refining visual representations by eliminating their internal feature-level redundancy is crucial for simultaneously optimizing the performance and computational cost of models in visual tracking. To enhance their performance, many contemporary Transformer-based trackers leverage a larger number of historical template frames to capture richer spatio-temporal cues. However, this strategy leads to a massive number of input visual tokens. This creates two critical issues: it imposes a quadratic computational burden and can also degrade the tracker's overall performance. To bridge this gap, we propose a compress-then-interact tracking framework, ETCTrack, that learns to efficiently compress template tokens from historical template frames into a robust target representation, moving beyond handcrafted rules. Our method first employs the Adaptive Token Compressor to dynamically construct compact yet highly discriminative template tokens by filtering out redundant visual tokens. These refined template tokens are then processed by our Hierarchical Interaction Encoder to achieve a deep, adaptive interaction with the search features. Refined search features ensure subsequent precise target localization. Experiments on seven benchmarks demonstrate that our method outperforms current state-of-the-art trackers. ETCTrack-B224 reduces the number of template tokens by 60%, leading to a 21.4% reduction in MACs with only a 0.4% drop in accuracy. The source code are available at https://github.com/PJD-WJ/ETCTrack.

2605.08327 2026-05-12 cs.LG cs.AI

Interactive Critique-Revision Training for Reliable Structured LLM Generation

Fei Xu Yu, Zuyuan Zhang, Mahdi Imani, Nathaniel D. Bastian, Tian Lan

AI总结 在结构化决策流程中,如表单填写、合规检查和维护报告,大语言模型(LLM)生成的内容需要满足局部正确、全局一致且可审计的要求。本文提出了一种名为DPA-GRPO的配对动作训练方法,通过生成器与验证器之间的博弈,结合结构化验证干预,提升模型输出的可靠性。实验表明,该方法在多个基准测试中显著提高了结构化决策的准确性,并增强了生成器与验证器的行为表现。

详情
英文摘要

In structured decision-making workflows such as form filling, compliance checking, and maintenance reporting, LLM outputs must be locally correct, globally consistent, and auditable against task-specific rules. Existing refinement methods often rely on heuristic debate, self-play, or LLM-generated supervision, creating a second-order assurance problem. We propose DPA-GRPO (Dual Paired-Action Group-Relative Policy Optimization), a paired-action training method for a two-player generator--verifier game with structured verifier interventions. The generator proposes outputs and may revise them when challenged; the verifier either remains silent or raises a safety assurance case (SAC) containing a claim, argument, and evidence. These SAC/no-SAC and KEEP/REVISE decisions induce paired counterfactual action groups, which DPA-GRPO uses for role-specific KL-regularized GRPO updates. We analyze the unregularized game and show that positive probability on strictly lower-reward intervention or revision actions creates a profitable unilateral deviation. Under standard stochastic-approximation assumptions, DPA-GRPO tracks the corresponding game ODE, whose isolated asymptotically stable limit points are stationary and candidate local equilibria under role-wise local optimality. Experiments on TaxCalcBench TY24 show that DPA-GRPO improves structured decision accuracy over zero-shot generation and generator-only RL baselines across Qwen3-4B and Qwen3-8B. Training increases correct silent acceptance, reduces missed errors, and improves calibrated revision behavior, indicating gains for both generator and verifier.

2605.08326 2026-05-12 cs.LG cs.AI

LLM Advertisement based on Neuron Auctions

Peiran Yun, Wenxin Xu, Jiayuan Liu, Yihang Zhang, Liang Zeng, Lingkai Kong, Tonghan Wang

AI总结 随着大语言模型(LLM)逐渐应用于对话场景,生成式广告成为重要的变现方式,但如何在保持语义连贯性的同时平衡广告商收益、平台收入和用户体验仍是一个挑战。本文提出基于神经元拍卖的LLM广告方法,通过挖掘模型内部表示空间中的品牌特异性神经元,构建了一个连续的、可解耦的干预预算体系,并设计了一种保证策略证明性的菜单式拍卖机制,从而在保障对话质量的同时实现商业利益与用户满意度的最优平衡。

Comments 17 pages, 9 figures, including appendices

详情
英文摘要

As Large Language Models (LLMs) transition into conversational agents, generative advertising emerges as a crucial monetization strategy. However, embedding advertisements within unstructured LLM outputs introduces a critical trilemma: balancing advertiser payoffs, platform revenue, and user experience. Existing methods, such as prompt injection or rigid position slots, disrupt semantic coherence and lack a parametric framework for independent control, rendering rigorous mechanism design intractable. To bridge this gap, we introduce Neuron Auctions, a novel paradigm that shifts the auction object from the surface text space to the LLM's internal representations. Leveraging mechanistic interpretability, we identify brand-specific feed-forward network (FFN) neurons and demonstrate that competing brands activate within approximately orthogonal subspaces. This near-perfect independence allows us to define continuous, disentangled intervention budgets (specifically, neuron counts and amplification factors) as auctionable commodities. Building on this computational carrier, we design a continuous menu-based auction mechanism that naturally guarantees strategy-proofness and optimizes revenue for the platform. By explicitly incorporating a user utility penalty into the platform's optimization objective, our framework dynamically prices out overly aggressive interventions. Extensive experiments demonstrate that Neuron Auctions effectively preserve natural discourse quality while achieving an optimal alignment between commercial incentives and user satisfaction.

2605.08323 2026-05-12 cs.LG cs.AI

The Reciprocity Gradient

Yue Lin, Pascal Poupart, Shuhui Zhu, Dan Qiao, Wenhao Li, Yuan Liu, Hongyuan Zha, Baoxiang Wang

AI总结 在战略交互中,沟通对于维持互惠与合作至关重要。本文提出了一种新的优化难题——影响归因问题,即智能体在决策时需考虑其行为对第三方声誉的间接影响,并据此调整自身策略。为此,研究引入了“互惠梯度”方法,通过对手策略的公共观测训练私有估计器,显式地将奖励梯度反向传播至声誉链中,从而在无需内在奖励的情况下联合优化动作与评价信号,实验表明该方法能有效学习接近最优的上下文敏感策略。

详情
英文摘要

Communication is fundamental to sustaining reciprocity and cooperation in strategic interactions. We identify and formulate the influence attribution problem as the central optimization difficulty inherent in such dynamics for a learning agent: any action or signal the agent emits reshapes the reputations of many third parties along combinatorially branching paths before feeding back into its own future rewards, forcing the agent to account for all of these indirect channels at once when choosing every action. To address this, we introduce the reciprocity gradient, which explicitly backpropagates reward gradients through private estimators of opponents' policies trained from public observations. The gradient flows through the reputation chain itself analytically, rather than being estimated from sampled returns. It jointly optimizes actions and evaluative signals without intrinsic rewards or reward shaping. Empirically, the method recovers near-optimal context-sensitive policies, while sample-based baselines collapse into constant-output policies.

2605.08321 2026-05-12 cs.LG cs.AI cs.CY cs.HC cs.MA

LLM Wardens: Mitigating Adversarial Persuasion with Third-Party Conversational Oversight

Lennart Wachowiak, Scott D. Blain, David Williams-King, Samuele Marro

AI总结 随着大型语言模型(LLM)在劝说方面的能力增强,如何保护用户免受操控成为一个重要问题。本文提出了一种“守卫者”模型(warden),它作为第三方实时监控人与AI的互动,并在检测到潜在操控时向用户发出非强制性建议。实验表明,这种机制能显著降低对抗性LLM的成功率,且即使守卫者模型能力弱于被监控模型,也能提供有效的防护,为大规模模型监督提供了可行路径。

详情
英文摘要

LLMs are increasingly capable of persuasion, which raises the question of how to protect users against manipulation. In a preregistered user study (N=120) across four decision-making scenarios, we find that an adversarial LLM with a hidden goal succeeds in steering users' decisions 65.4% of the time. We then introduce a "warden" model: a secondary LLM that monitors the human-AI interaction trace in real time and issues non-binding, private advisories to the user when it detects manipulation. Adding a warden more than halves the adversary's success rate to 30.4%, with a much smaller (8.6 percentage points) reduction for genuine interactions. To probe the mechanism behind these results, we release COAX-Bench, a simulation benchmark spanning 14 decision-making scenarios, including hiring, voting, and file access. Across 16,212 simulated multi-agent interactions, capable adversarial LLMs achieve their hidden goals in 34.7% of cases, which warden models reduce to 12.3%. Notably, even warden models substantially weaker than the adversary they oversee provide meaningful protection, suggesting a path for scalable oversight of more capable models.

2605.08317 2026-05-12 cs.LG cs.AI

RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

Junkai Zhang, Hang Guo, Luca Benini, Yawei Li

AI总结 大型语言模型在处理长输入上下文时面临内存和带宽瓶颈,现有的KV缓存压缩方法通常单独考虑淘汰或量化策略。本文将KV缓存压缩建模为率失真问题,提出RDKV方法,统一优化淘汰与量化策略,通过计算每个token或通道的压缩失真权重,结合逆水位填充算法分配位宽,实验表明RDKV在保持性能的同时显著提升了推理速度和内存效率。

详情
英文摘要

Large language models (LLMs) have shown strong performance across diverse tasks, but their inference with long input contexts is bottlenecked by memory size and bandwidth. The Key-Value (KV) cache size grows linearly with sequence length and needs to be re-read from off-chip high-bandwidth memory (HBM) to on-chip memory at every decoding step, resulting in memory-bound inference. Existing methods reduce the cache by either eviction or quantization, but typically treat the two in isolation. In this paper, we cast KV cache compression as a rate-distortion problem, under which eviction and quantization are two end-points of the same bit allocation scheme. This exposes the need to optimize them jointly, motivating our method, RDKV (Rate-Distortion KV cache compression). RDKV derives the weight of each token or channel from the distortion that compression induces on the attention computation. Based on these weights, it assigns each token or channel a bit-width ranging from full precision down to zero bits guided by reverse water-filling, applied once after the prefilling stage. Experiments on LongBench, RULER, and InfiniteBench show that RDKV outperforms the best evaluated baseline by 9.1% on average. On LongBench it recovers 97.81% of full-cache accuracy with only 2.48% cache retention. Compared with full-cache FlashAttention-2 decoding, it achieves 4.5x decode speedup and 1.9x peak memory reduction with 128K context length, while maintaining comparable performance.

2605.08315 2026-05-12 cs.LG

Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias

Rahaf Abu Hara, Vaibbhav Murarri, Claudio Zito

AI总结 现有基于大语言模型(LLM)的策略优化方法仅依赖标量奖励信号,缺乏对策略执行轨迹的详细行为分析。本文提出了一种两阶段的LLM框架——Reflective Prompted Policy Optimization(R2PO),通过结合轨迹级别的行为证据,提升策略搜索的效果。R2PO引入搜索模型和评估模型分别生成策略参数和针对性改进建议,并通过轨迹统计分析和中位轨迹选择等机制缓解了显著性偏差问题,在多个环境中表现出更优的收敛速度和稳定性。

详情
英文摘要

Existing LLM-based policy optimizers see only scalar rewards: that a policy scored 0.45, but not whether the agent got stuck in a loop, fell into a hole on the third step, or performed well on 19 out of 20 rollouts and failed catastrophically on one. We propose Reflective Prompted Policy Optimization (R2PO), a two-stage LLM framework for policy search over compact policy classes that augments scalar reward feedback with trajectory-level behavioral evidence. A Search-LLM proposes candidate policy parameters; the environment executes them; a Critic-LLM inspects the resulting rollouts and proposes targeted revisions grounded in observed states, actions, and rewards. Across ten environments, ablations show R2PO's gains require separating global search from behavior-grounded revision and using selection to filter high-variance edits. We further identify a dominant failure mode, salience bias: when presented with multiple rollouts, the Critic-LLM fixates on improving a single failure even when most trajectories succeed. In a three-trajectory variant where the Critic-LLM sees the best, worst, and median rollout, this behavior explains 76.6% of regressions on CartPole. R2PO mitigates this by reasoning over aggregate rollout statistics, median-trajectory selection, and a revision rule. Using a 20B open-weight model, R2PO achieves the highest mean best reward across all ten environments, reaches near-optimal performance substantially earlier (e.g., near-maximum CartPole reward within ~500 episodes), and trains far more stably than both deep RL and prior LLM-based methods. These results show that treating trajectories as first-class in-context evidence, rather than artifacts reduced to scalar returns, changes how even comparatively small LLMs search over policy spaces, enabling them to learn faster, diagnose more precisely, and reliably improve external controllers.

2605.08314 2026-05-12 cs.LG cs.AI cs.PF

FlashSVD v1.5: Making Low-Rank Transformers Inference Actually Fast

Wenhao Wu, Zishan Shao, Kangning Cui, Jinhee Kim, Yixiao Wang, Hancheng Ye, Danyang Zhuo, Yiran Chen

AI总结 本文研究了基于SVD的低秩压缩技术在实际大语言模型推理加速中的应用问题,指出尽管该方法在参数和计算量上有所减少,但实际推理速度提升有限。为此,作者提出了FlashSVD v1.5,一个统一的推理运行时系统,通过优化执行路径、融合特定阶段的内核以及利用CUDA图等技术,显著提升了低秩压缩模型的推理效率。实验表明,该方法在多个主流低秩压缩方案中实现了显著的解码和端到端加速效果,表明实际加速需要运行时与压缩算法的协同设计。

详情
英文摘要

SVD-based Low-rank compression reduces transformer parameters and nominal FLOPs, but these savings often translate poorly into real LLM serving speedups. We show that this gap is largely a runtime problem: factorized checkpoints fragment execution paths, and the resulting overhead differs substantially between prefill and autoregressive decode. We present FlashSVD v1.5, a unified inference runtime for serving SVD-compressed transformers. FlashSVD v1.5 maps diverse public SVD compression families to a common factorized representation and combines phase-specific kernels with dense-KV decode, packed MLP execution, and per-layer CUDA-graph replay to reorganize the low-rank serving path into a thin runtime. Across representative decoder-serving settings, FlashSVD v1.5 achieves up to 2.55x decode and 2.39x end-to-end speedup, and it attains 1.48x average decode and 1.44x average end-to-end speedup across multiple popular SVD compression families. These results suggest that practical low-rank acceleration requires runtime co-design, not compression algorithms alone. Our code is available at: https://github.com/Zishan-Shao/FlashSVD.

2605.08311 2026-05-12 cs.LG cs.CV

Revitalizing the Beginning: Avoiding Storage Dependency for Model Merging in Continual Learning

Xi Wang, Cheng Deng

AI总结 在持续学习中,模型合并旨在将多个专家模型整合为统一的多任务模型,但受限于存储条件,难以有效保存多样化的历史知识。本文系统分析了现有合并方法的不足,发现其过度关注全局对齐,导致任务特定错误累积,并在新任务开始时因梯度消失而优化停滞。为此,提出轨迹正则化合并(TRM)框架,通过在扩展轨迹子空间中进行优化,同时实现任务对齐、预测一致性和梯度响应,有效保持模型历史稳定性并重新激活优化过程,实验表明该方法在多个基准上达到先进水平。

详情
英文摘要

Model merging provides a compelling paradigm for integrating specialized expertise into a unified multi-task model, a goal that aligns naturally with the sequential knowledge acquisition in continual learning (CL). However, the requirement for preserving diverse forms of previous knowledge conflicts with the storage limitations inherent to CL. In this paper, we systematically analyze existing model merging methods under the constraints of CL. We find that current methods prioritize global alignment, which often leads to the accumulation and amplification of task-specific errors within the continuous data stream; and the vanishing gradients at the onset of subsequent tasks frequently cause optimization to stagnate. These leave the merged model in a suboptimal state at the beginning of the next training phase. To address these challenges, we propose Trajectory Regularized Merging (TRM), a framework that reformulates the merging phase as an optimization process within an augmented trajectory subspace. Our framework integrates three synergistic objectives including task alignment, prediction consistency, and gradient responsiveness to concurrently preserve merged model's historical stability and re-activate optimization dynamics. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple benchmarks.

2605.08308 2026-05-12 cs.LG cs.AI eess.SP

Practical Wi-Fi-based Motion Recognition Under Variable Traffic Patterns

Guolin Yin, Junqing Zhang, Guanxiong Shen, Simon L. Cotton

AI总结 本文研究了在变通量Wi-Fi传输环境下实现高效人体运动识别的问题,提出了基于Transformer架构的采样率自适应神经网络(SRV-NN),以应对不规则的采样率和信号长度。通过引入动态采样率增强策略,该方法在多种采样条件下均表现出优异的性能和稳定性,实验结果表明其在平均准确率上显著优于传统方法。

Comments 17 Pages

详情
英文摘要

Wi-Fi sensing detects human motions and activities by analysing the channel state information (CSI) derived from Wi-Fi transmissions. However, the impact of variable transmission traffic, which dictates the effective sampling rate and interval, is often overlooked. Existing Wi-Fi sensing systems are trained with fixed input size and sampling rate, which suffer from poor sampling rate generalisation. This paper proposes a novel Wi-Fi sensing approach for motion recognition applications, e.g., gesture and activity recognition, under variable traffic patterns. A sampling rate versatile neural network (SRV-NN) based on the transformer is proposed to efficiently handle variable input-sized sensing signals. A dynamic sampling rate augmentation is employed for variable sampling rates and intervals. To validate our approach, we have carried out extensive experimental evaluation, using two self-collected datasets, namely SRV activity and SRV gesture, as well as two publicly available datasets. Our method demonstrated exceptional performance and stability under variable sampling rates, with substantial improvements in average accuracy compared to baseline models without augmentation. The proposed approach significantly enhances stability by greatly reducing accuracy variance across different sampling rates.

2605.08305 2026-05-12 cs.LG cs.AI cs.CL cs.PF cs.SE

LLMSYS-HPOBench: Hyperparameter Optimization Benchmark Suite for Real-World LLM Systems

Siyu Wu, Yulong Ye, Zezhen Xiang, Pengzhou Chen, Gangda Xiong, Tao Chen

AI总结 本文介绍了LLMSYS-HPOBench,这是首个针对真实世界大语言模型(LLM)系统的超参数优化(HPO)基准测试套件。该基准涵盖了从实际运行中采集的超参数配置及其性能指标、保真度因素和成本数据,包含数十万条配置记录和多种评估指标,旨在为AutoML社区提供一个用于验证现有HPO算法并探索新研究方向的平台。

详情
英文摘要

Large Language Model (LLM) systems have been the frontier of AI in many application domains, leading to new challenges and opportunities for hyperparameter optimization (HPO) for the AutoML community. However, this type of system exhibits an unprecedented compound space of hyperparameter configuration from both the AI and non-AI components; rich and nonlinear implications from the fidelity factors; and diverse costs of measuring hyperparameter configurations, none of which have been fully captured in existing benchmarks. This paper presents the first (live) benchmark suite and datasets for HPO of real-world LLM systems, dubbed LLMSYS-HPOBench, covering data related to the inference objective values of hyperparameter configurations profiled from running the LLM systems. Currently, LLMSYS-HPOBench contains 364,450 hyperparameter configurations with a dimensionality of 12-23, 3-5 dimensions of fidelity factor leading to 932 settings, 3-9 inference objective metrics, and 2-10 cost metrics, together with generated logs from measuring the LLM systems. What we seek to advocate is not only a revalidation of the existing HPO algorithms over the frontier LLM systems, but also to provide an evolving platform for the AutoML community to explore new directions of research in this regard. The benchmark suite has been made available at: https://github.com/ideas-labo/llmsys-hpobench

2605.08303 2026-05-12 cs.LG cs.AI

GNN for Structural Displacement Prediction

Hung-Fu Chang, Tzu-Kang Lin, Yung-Li Cheng

AI总结 本文研究了基于图神经网络(GNN)的结构位移预测方法,旨在解决传统有限元方法计算成本高、不适用于实时监测的问题。该方法将结构系统建模为图,节点表示连接点,边表示结构构件,并结合几何与力学特性进行数据驱动的预测。实验表明,与传统神经网络相比,所提出的GNN框架在预测精度上更具优势,展示了其作为高效替代方案的潜力。

Comments 12 pages

详情
英文摘要

Accurate prediction of structural displacements under external loading is fundamental to structural health monitoring and seismic safety assessment. Although the finite element method (FEM) remains the prevailing approach because of its high accuracy, its considerable computational cost restricts its suitability for real-time monitoring applications. To address this limitation, this study proposes a data-driven framework based on Graph Neural Networks (GNNs), in which structural systems are represented as graphs with joints modeled as nodes and structural members as edges. By incorporating both geometric and mechanical properties into the graph representation, the proposed model learns the relationship between applied loads and structural responses directly from simulated data. A synthetic dataset was generated from a two-story frame structure using ANSYS, and both a conventional Neural Network (NN) and a GNN were trained for comparison. The results show that the proposed GNN framework predicts displacements and rotations with high accuracy and outperforms the NN model, demonstrating its potential as a fast and efficient alternative to traditional FEM-based analysis.