arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.09081 2026-06-09 cs.CV 新提交

Edge-Constrained UAV Small-Object Detection with P2 Enhancement and Quantum-Inspired Lightweight Structure Search

边缘约束下基于P2增强和量子启发轻量级结构搜索的无人机小目标检测

Wuming Lei, Yanbin Gao, Mingyan Sun, Xiaobin Li, Xuechen Liang

发表机构 * East China Jiaotong University(华东交通大学)

AI总结 针对无人机边缘部署,结合P2高分辨率检测分支与量子启发进化算法搜索轻量级结构,在VisDrone上显著提升小目标检测精度。

详情
AI中文摘要

无人机目标检测需要紧凑的检测器,在机载计算和内存限制下保留小目标细节。轻量级网络中的重复下采样削弱了浅层空间信息,而手动添加注意力或融合模块可能增加成本且收益不稳定。本研究在边缘部署约束下分析YOLOX-Nano,结合P2高分辨率检测分支与量子启发进化算法(QIEA)进行轻量级结构筛选。搜索空间由轻量级优先级和任务特异性定义,评估同时考虑精度、浮点运算数(FLOPs)、延迟、内存消耗和召回率。在VisDrone上,P2分支使APamall比YOLOX-Nano基线提升31.10%。与类似模型大小的NanoDet-Plus相比,YOLOX-Nano+-P2在APs0.ss上提升17.5%,在APamal上提升44.9%。QIEA选择的候选者获得最高Recallso,但+P2在完整训练后仍是最强的AP导向变体。对Random-best、GA-best和SA/QUBO-best候选者进行完整的100轮验证进一步表明,代理排名不一定转化为最终的APse9s。这些结果支持将P2作为主要的小目标增强路径,并将QIEA作为候选筛选和精度-成本分析的轻量级工具。源代码、配置文件、诊断脚本和总结结果可在https://github.com/Ming23233/UAV-QIEA-Edge-Detection获取。

英文摘要

Unmanned aerial vehicle (UAV) object detection requires compact detectors that retain small-object details under onboard computation and memory constraints. Repeated downsampling inlightweight networks weakens shallow spatial information, while manually adding attention orfusion modules may increase cost without stable gains. This study analyzes YOLOX-Nano underedge-deployment constraints by combining a P2 high-resolution detection branch with a quantum-inspired evolutionary algorithm (QIEA) for lightweight structure screening. The search space isdefined by lightweight priority and task specificity, and the evaluation jointly considers accuracy,floating-point operations (FLOPs), latency, memory consumption, and recall. On VisDrone, theP2 branch increases APamall by 31.10% over the YOLOX-Nano baseline. Compared with NanoDet-Plus with similar model size, YOLOX-Nano+-P2 improves APs0.ss by 17.5% and APamal by 44.9%.The QIEA-selected candidate obtains the highest Recallso, but +P2 remains the strongest AP-oriented variant after full training. Full 100-epoch verification of Random-best, GA-best, andSA/QUBO-best candidates further shows that proxy rankings do not necessarily transfer to finalAPse9s. These results support using P2 as the main small-object enhancement path and QIEA as alightweight tool for candidate screening and accuracy-cost analysis. The source code, configurationfiles, diagnostic scripts, and summarized results are available at https://github.com/Ming23233/UAV-QIEA-Edge-Detection

2606.09080 2026-06-09 cs.LG cs.CL 新提交

Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy

超越FLOPs:基于GEMM中心分类法的LLM剪枝真实推理加速基准测试

Haozhe Hu, Hao Wu, Anhao Zhao, Longwei Ding, Peiran Yin, Yunpu Ma, Xiaoyu Shen

发表机构 * Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo(宁波数字孪生研究院,东方理工大学(宁波)) Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算学系) Munich Center for Machine Learning, LMU Munich(慕尼黑大学机器学习慕尼黑中心)

AI总结 提出一种基于GEMM维度的剪枝方法分类法,通过统一基准框架系统评估不同剪枝方法在加速-质量帕累托前沿上的表现,发现静态深度剪枝在低质量损失下最优,为LLM剪枝加速提供统一视角。

详情
Comments
22 pages, 14 figures
AI中文摘要

剪枝已成为加速大语言模型(LLM)推理的主流范式,涵盖了一系列方法,这些方法在token、层、头、维度和注意力模式上移除计算。尽管目标相同,这些剪枝方法会引发根本不同的执行行为,导致实际加速效果严重依赖于硬件和内核实现。因此,不同剪枝家族的实际加速收益仍知之甚少。在这项工作中,我们引入了一种基于GEMM中心的分类法,根据通用矩阵乘法(GEMM)的逻辑\textbf{M}、\textbf{N}和\textbf{K}维度重新组织现有剪枝方法。利用这一抽象,我们构建了一个统一的基准测试框架,能够在剪枝设计空间中进行实现一致的比较,并系统地表征加速-质量帕累托前沿。我们的结果表明,静态深度剪枝仍然是最强的帕累托最优基线,并且在内存受限场景下最接近其理论加速上限。在预填充阶段,前沿从低质量损失(0\%--4\%)的静态深度,过渡到中等损失(5\%--16\%)的动态深度,最后到更高损失水平(17\%--26\%)的静态宽度剪枝。这些发现首次建立了基于剪枝的LLM加速实际极限的统一视图,并为未来的剪枝研究提供了指导。\footnote{代码可在 https://github.com/EIT-NLP/LLM-Pruning/tree/main/PruningInferSim 获取。}

英文摘要

Pruning has emerged as a dominant paradigm for accelerating large language model (LLM) inference, spanning a broad spectrum of methods that remove computation across tokens, layers, heads, dimensions, and attention patterns. Despite sharing the same objective, these pruning approaches induce fundamentally different execution behaviors, causing realized speedups to depend heavily on hardware and kernel implementations. Consequently, the practical acceleration benefits of different pruning families remain poorly understood. In this work, we introduce a GEMM-centric taxonomy that reorganizes existing pruning methods according to the logical \textbf{M}, \textbf{N}, and \textbf{K} dimensions of general matrix multiplication (GEMM). Leveraging this abstraction, we build a unified benchmarking framework that enables implementation-consistent comparison across the pruning design space and systematically characterizes the acceleration--quality Pareto frontier. Our results show that static depth pruning remains the strongest Pareto-optimal baseline and stays closest to its theoretical acceleration upper bound in memory-bounded scenarios. During prefill, the frontier transitions from static depth at low quality loss (0\%--4\%), to dynamic depth at moderate loss (5\%--16\%), and finally to static width pruning at higher loss levels (17\%--26\%). These findings establish the first unified view of the practical limits of pruning-based LLM acceleration and provide guidance for future pruning research.\footnote{Code is available at https://github.com/EIT-NLP/LLM-Pruning/tree/main/PruningInferSim}

2606.09078 2026-06-09 cs.LG 新提交

The Hidden Bias of Process Reward Models:PRISM for Rewarding the Right Reasoning

过程奖励模型的隐藏偏见:PRISM用于奖励正确推理

Aakriti Agrawal, Souradip Chakraborty, Armin Saghafian, Nihal Sharma, Rizal Fathony, Nam H Nguyen, C. Bayan Bruss, Amrit Singh Bedi, Furong Huang

发表机构 * University of Maryland(马里兰大学) Amazon(亚马逊) University of Central Florida(中佛罗里达大学)

AI总结 针对过程奖励模型因训练数据不平衡导致的虚假高评分偏见,提出PRISM框架,通过对比步骤级比较和前瞻策略生成的难负样本,结合难度感知课程学习优化,显著降低假阳性率并提升推理准确性。

详情
AI中文摘要

过程奖励模型(PRM)通过提供步骤级反馈改善了推理的信用分配。然而,我们发现PRM中存在由步骤级训练数据严重不平衡引起的隐藏偏见。标准交叉熵训练放大了这种偏见,导致PRM过度奖励看似合理但错误的步骤,并产生高假阳性率。我们表明这些假阳性具有不对称的下游效应:假阴性主要减缓探索,而假阳性则主动将Best-of-N选择、引导解码和策略优化引导向有缺陷的推理。这表明PRM训练应从逐点标签拟合转向可靠的相对比较。为解决此问题,我们提出PRISM(改进步骤建模的精确排序),一种策略感知的PRM训练框架,从对比步骤级比较和由时间前瞻策略生成的难负样本中学习,无需新的人工标签。我们进一步使用难度感知课程来优化对比步骤间隔。在PRMBench和ProcessBench上,PRISM显著减少了假阳性(PRMBench上降低22%),并在强判别性PRM上提高了宏F1。当应用于策略优化和搜索任务(包括引导解码和Best-of-N选择)时,它持续提高了准确率(引导解码最高22%,Best-of-N最高33%)和鲁棒性。更广泛地说,可信的过程监督不仅仅是分配高奖励,而是为了正确的理由奖励正确的推理。

英文摘要

Process Reward Models (PRMs) improve credit assignment for reasoning by providing step-level feedback. However, we identify a hidden bias in PRMs caused by severe imbalance in step-level training data. Standard cross-entropy training amplifies this bias, causing PRMs to overcredit plausible but incorrect steps and produce high false-positive rates. We show that these false positives have an asymmetric downstream effect: false negatives mainly slow exploration, whereas false positives actively steer Best-of-N selection, guided decoding, and policy optimization toward flawed reasoning. This suggests that PRM training should shift from pointwise label fitting to reliable relative comparisons. To address this, we propose PRISM (Precision Ranking for Improved Step Modeling), a policy-aware PRM training framework that learns from contrastive step-level comparisons and hard negatives generated by a temporal lookahead strategy, requiring no new human labels. We further use a difficulty-aware curriculum to optimize the contrastive step margin. Across PRMBench and ProcessBench, PRISM substantially reduces false positives (22% on PRMBench) and improves macro F1 over strong discriminative PRMs. When applied to policy optimization and search tasks, including guided decoding and Best-of-N selection, it consistently improves accuracy (up to 22% for guided decoding and 33% for Best-of-N) and robustness. More broadly, trustworthy process supervision is not just about assigning high rewards, but about rewarding the right reasoning for the right reasons.

2606.09077 2026-06-09 cs.LG 新提交

Neural Legendre-Fenchel transform with Hessian Preconditioning

神经 Legendre-Fenchel 变换与 Hessian 预处理

Basile Plus-Gourdon, Frank Nielsen

发表机构 * École Normale Supérieure Paris-Saclay(巴黎-萨克雷高等师范学校) Sony Computer Science Laboratories Inc.(索尼计算机科学实验室公司)

AI总结 提出基于 Hessian 预处理的神经 Legendre-Fenchel 变换方法,通过仿射变形改善病态函数的共轭计算,提高收敛速度和数值精度。

详情
Comments
11 pages, 4 figures
AI中文摘要

Legendre-Fenchel (LF) 变换是凸分析和机器学习中的基本工具,将下半连续函数映射到其凸共轭。在实践中,当给定函数的凸共轭没有闭式公式时,必须使用各种技术进行近似。最近一种通用的数值方法是深度 Legendre 变换方法,它依赖于神经网络,尽管在处理病态函数时仍然具有挑战性。本文基于 LF 变换作为射影对偶的重新表述。该框架的一个显著特性是仿射不变性。我们利用这种仿射不变性引入了一种基于 Hessian 的预处理策略。具体来说,我们在一个极小点附近应用仿射变形,使得函数的二阶泰勒近似与标准抛物面重合,其共轭映射是恒等映射。一个在恒等映射附近初始化的残差网络可以学习这个简化后的映射,而原始共轭映射通过逆变形恢复。所提出的预处理仅带来适度的计算开销,包括初始化时的一次特征分解和每次查询时的两次矩阵-向量乘法。在包括高维基准测试在内的多种凸函数上的实验表明,共轭的收敛速度和数值精度得到了提高,特别是在病态问题上效果显著。最后,我们讨论了所提出方法的适用范围,并指出了其若干局限性。

英文摘要

The Legendre-Fenchel (LF) transform is a fundamental tool in convex analysis and machine learning that maps lower semi-continuous functions to their convex conjugates. In practice, when closed-form formula are not available for expressing convex conjugates of given functions, one must approximate them using various techniques. One recent such versatile numerical method is the deep Legendre transform method which relies on neural networks although it remains challenging particularly for tackling ill-conditioned functions. This work builds on the reformulation of the LF transform as a projective polarity. A notable property of this framework is its affine invariance. We leverage this affine invariance to introduce a Hessian-based preconditioning strategy. Specifically, we apply an affine deformation around a minimizer so that the second-order Taylor approximation of the function coincides with the canonical paraboloid, whose conjugation map is the identity. A residual network initialized near the identity can then learn this simplified mapping, while the original conjugation map is recovered through the inverse deformation. The proposed preconditioning incurs only a modest computational overhead, consisting of a single eigendecomposition during initialization and two matrix-vector multiplications per query. Experiments on a diverse set of convex functions, including high-dimensional benchmarks, demonstrate improved convergence rates and enhanced numerical accuracy of the conjugation, with particularly significant gains for ill-conditioned problems. Finally, we discuss the scope of applicability of our proposed method and highlight several of its limitations.

2606.09076 2026-06-09 cs.CV 新提交

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

超越标量奖励:将推理内化到分数分布中

Xin Jin, Huanqia Cai, Zhen Li, Zechao Zhan, Dengyang Jiang, Aiming Hao, Yuming Jiang, Chunle Guo, Peng Gao, Ming-Ming Cheng, Steven C. H. Hoi

发表机构 * Alibaba Group(阿里巴巴集团) Nankai University(南开大学)

AI总结 提出Z-Reward框架,通过教师-学生模型将推理型奖励内化为紧凑VLM的分数分布,实现高效且准确的文本到图像优化。

详情
AI中文摘要

奖励模型对于文本到图像的后训练至关重要,但视觉偏好是主观的,更适合表示为评分分布而非确定性标量。现有的标量、评分令牌和成对奖励模型过度压缩了不确定性和细粒度评分差异,而基于推理的生成式奖励提供了更强的判断,但部署成本高且难以用作直接优化信号。我们提出Z-Reward,一种教师-学生奖励建模框架,将推理密集型判断与高效奖励部署解耦。教师是一个大型VLM,使用推理推断符合评分标准的分数分布,并通过组定向分数优化(GDSO)进行训练,该优化结合了来自分布期望的策略梯度奖励以及关于分数分布和分数差距的直接点式和成对监督。学生通过推理内化分数蒸馏(RISD)进行训练,将教师的推理条件分数分布转移到紧凑VLM中,而无需在推理时使用显式推理链。在我们内部标注的评估集上,27B GDSO教师达到了89.6%的人类偏好准确率,优于SFT、RewardDance和GRPO,而9B RISD学生达到了88.6%,优于OPD基线并接近更大的教师。我们进一步表明,Z-Reward可以作为文本到图像优化的可微奖励信号,相对于SFT基线产生了41.3%的净人类偏好改进。

英文摘要

Reward models are central to text-to-image post-training, but visual preference is subjective and better represented as a distribution over rubric scores than as a deterministic scalar. Existing scalar, score-token, and pairwise reward models over-compress uncertainty and fine-grained score differences, while reasoning-based generative rewards provide stronger judgments but are costly to deploy and difficult to use as direct optimization signals. We propose Z-Reward, a teacher-student reward modeling framework that decouples reasoning-heavy judgment from efficient reward deployment. The teacher is a large VLM that uses reasoning to infer rubric-aligned score distributions, and is trained with Group-wise Direct Score Optimization (GDSO), which combines policy-gradient rewards from distribution expectations with direct pointwise and pairwise supervision on score distributions and score gaps. The student is trained with Reasoning-Internalized Score Distillation (RISD), which transfers the teacher's reasoning-conditioned score distribution into a compact VLM without requiring explicit reasoning chains at inference time. On our internally annotated evaluation set, the 27B GDSO teacher reaches 89.6% human preference accuracy, outperforming SFT, RewardDance, and GRPO, while the 9B RISD student reaches 88.6%, outperforming the OPD baseline and closely matching the larger teacher. We further show that Z-Reward can serve as a differentiable reward signal for text-to-image optimization, yielding a 41.3% net human-preference improvement over the SFT baseline.

2606.09074 2026-06-09 cs.CV 新提交

REFINE: Super-efficient 3D Gaussian Splatting Pruning via Rendering-Free Primitive Importance

REFINE: 通过无渲染的基元重要性实现超高效的3D高斯泼溅剪枝

Zhang Chen, Shuai Wan, Mengting Yu, Fuzheng Yang, Junhui Hou

发表机构 * Northwestern Polytechnical University(西北工业大学) Xidian University(西安电子科技大学) City University of Hong Kong(香港城市大学)

AI总结 提出REFINE框架,利用无渲染的基元重要性度量(基于解析近似的Hessian场)实现3D高斯泼溅的高效剪枝,在保持渲染质量的同时将剪枝计算复杂度降低3000倍。

详情
AI中文摘要

现有的3D高斯泼溅(3DGS)剪枝方法要么导致严重的质量下降,要么带来过高的计算开销。本文提出REFINE,一个高度加速的3DGS剪枝框架,其核心是一种新颖的无渲染基元重要性度量。我们的方法利用解析近似、渲染感知的Hessian场来量化移除单个基元所导致的预期感知误差。通过建模可见性、投影几何和内容自适应超参数的联合调制,我们完全绕过了昂贵的正向渲染过程,推导出一个各向异性的感知权重场,作为基元重要性的高保真代理。在多个基准数据集上的大量实验表明,REFINE在保持极具竞争力的渲染质量的同时,与最先进的剪枝方法相比,实现了前所未有的3000倍剪枝相关计算复杂度降低。

英文摘要

Existing pruning methods for 3D Gaussian splatting (3DGS) suffer from either severe quality degradation or prohibitive computational overhead. In this paper, we propose REFINE, a highly accelerated 3DGS pruning framework centered on a novel rendering-free primitive importance metric. Our approach leverages an analytically approximated, rendering-aware Hessian field to quantify the expected perceptual error induced by the removal of individual primitives. By modeling the joint modulation of visibility, projection geometry and the content adaptive hyperparameter, we entirely bypass costly forward rendering passes and derive an anisotropic perceptual weight field that serves as a high-fidelity proxy for primitive importance. Extensive experiments across multiple benchmark datasets demonstrate that REFINE maintains highly competitive rendering quality while achieving an unprecedented $3,000\times$ reduction in pruning-related computational complexity compared to state-of-the-art pruning methods.

2606.09064 2026-06-09 cs.CV cs.AI 新提交

See More, Think Deeper: Query-Expanded Visual Evidence and Answer-Clue Guided Reflection for Long Video Understanding

看得更多,思考更深:面向长视频理解的查询扩展视觉证据与答案线索引导反思

Shuning Wang, Zhiheng Wu, YiNuo Lu, Naiming Liu, Chen Jia, Bowen Liu, Shuo Nie, Weijie Zhu, Yumeng Zhang

发表机构 * Baidu Inc.(百度公司) Harbin Institute of Technology(哈尔滨工业大学) Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出CoVER框架,通过动态收集查询扩展视觉证据和答案特定视觉反馈验证草稿答案,实现从答案中心生成到证据中心和视觉可验证推理的转变,在长视频理解任务上超越同规模模型及部分闭源模型。

详情
AI中文摘要

近期视频大语言模型(Video-LLMs)的进展使得长视频理解任务成为可能。然而,现有方法仍面临两个关键限制:证据获取通常依赖单一搜索意图,且答案生成缺乏有效的视觉反馈机制。为解决这些限制,我们提出了\textbf{CoVER},一个用于长视频理解的综合视觉证据与反思框架。CoVER使Video-LLMs能够通过动态收集查询扩展视觉证据来\textbf{看得更多},并通过使用有效的答案特定视觉反馈验证草稿答案来\textbf{思考更深}。这些机制共同将长视频理解从以答案为中心的生成转变为以证据为中心且可视觉验证的推理。实验结果表明,CoVER-7B在相同参数规模下显著优于其他模型,甚至在特定指标上超越了最先进的闭源模型。

英文摘要

Recent advances in Video Large Language Models (Video-LLMs) have enabled performance on long-video understanding tasks. However, existing methods still face two key limitations: evidence acquisition often relies on a single search intent, and answer generation lacks an effective visual feedback mechanism. To address these limitations, we propose \textbf{CoVER}, a Comprehensive Visual Evidence and Reflection framework for long-video understanding. CoVER enables Video-LLMs to \textbf{See More} by dynamically gathering query-expanded visual evidence, and \textbf{Think Deeper} by verifying draft answers with effective answer-specific visual feedback. Together, these mechanisms shift long-video understanding from answer-centric generation to evidence-centric and visually verifiable reasoning. Experimental results show that CoVER-7B substantially outperforms models with the same parameter scale and even surpasses state-of-the-art closed-source models on certain metrics.

2606.09056 2026-06-09 cs.CV cs.LG 新提交

MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation

MilliVid: 用于视频生成中长程一致性的分层潜变量

Ishaan Preetam Chandratreya, David Charatan, Basile Van Hoorick, Sergey Zakharov, Vitor Guizilini, Phillip Isola, Vincent Sitzmann

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Toyota Research Institute(丰田研究所)

AI总结 提出一种多尺度token空间的粗到细展开方法,通过预训练层次化自编码器压缩帧为多层token,并训练视频扩散模型生成这些token,在保持几何和物体持久性长程一致性的同时降低计算开销。

详情
Comments
Ishaan Preetam Chandratreya and David Charatan contributed equally. Project page: https://davidcharatan.com/millivid/
AI中文摘要

视频生成模型已变得日益强大,但长程一致性仍然难以实现,因为即使只有几十帧也需要不切实际的长Transformer序列长度。我们表明,通过在多尺度token空间内使用粗到细展开生成视频,可以缓解这一问题。我们的方法很简单:首先,预训练一个自编码器,将每一帧压缩成一个token层次结构,层级范围从典型的潜变量分辨率到每帧仅几个token。最粗糙的层级捕获最重要的信息,如场景布局和语义,而更细的层级添加高频外观和纹理。然后,我们训练一个视频扩散模型,使用粗到细展开生成这些token。通过仔细控制在每个展开步骤中生成帧并用作上下文的细节级别,我们能够保持几何和物体持久性的长程一致性,同时将计算花费在感知上不太相关的细节的长程一致性上。我们使用一个自定义的长Minecraft视频数据集验证了这种方法,与现有基线相比,它产生了更一致的展开结果。

英文摘要

Video generative models have become increasingly powerful, but long-range consistency remains challenging to achieve because even a few dozen frames require impractically long transformer sequence lengths. We show that this issue can be mitigated by generating video using coarse-to-fine rollout within a multi-scale token space. Our approach is simple: first, we pre-train an autoencoder that compresses each frame into a hierarchy of tokens, with levels ranging from the typical latent resolution to only a handful of tokens per frame. The coarsest levels capture the most consequential information, such as scene layout and semantics, while finer levels add high-frequency appearance and texture. Then, we train a video diffusion model to generate these tokens using coarse-to-fine rollout. By carefully controlling the level of detail at which frames are generated and used as context during each rollout step, we are able to preserve long-range consistency in geometry and object permanence while spending less compute on the long-range consistency of less perceptually relevant details. We validate this approach using a custom dataset of long Minecraft videos, where it produces substantially more consistent rollouts compared to existing baselines.

2606.09051 2026-06-09 cs.LG 新提交

Beyond Convolution: Advancing Hypergraph Neural Networks with Hypergraph U-Nets

超越卷积:用超图U-Net推进超图神经网络

Fuli Wang, Wei Qian, Daniel L. Lau, Gonzalo R. Arce

发表机构 * Institute for Financial Services Analytics, University of Delaware(特拉华大学金融服务分析研究所) Department of Applied Economics and Statistics, University of Delaware(特拉华大学应用经济学与统计学系) Department of Electrical and Computer Engineering, University of Kentucky(肯塔基大学电气与计算机工程系) Department of Electrical and Computer Engineering, University of Delaware(特拉华大学电气与计算机工程系)

AI总结 提出并行层次池化和反池化算子,构建首个超图U-Net架构,在分类、重构和异常检测任务上超越现有方法。

详情
AI中文摘要

卷积已成功从图像处理过渡到非欧几里得高阶域的复杂领域,特别是在超图中。尽管卷积取得了成功,但由于缺乏定义良好的池化和反池化操作,一种名为U-Net的流行架构在超图数据上的探索仍然很少。本工作开创性地研究了超图数据的U-Net架构,解决了设计有效池化和反池化操作的关键挑战,这些操作能保留输入超图的最大结构信息。受层次聚类启发,我们提出通过在不同粒度上切割聚类树状图来一次性构建池化和反池化算子,称为并行层次池化(PHPool)和反池化(PHUnpool)算子。与现有通过顺序学习过程可能造成局部结构损坏的池化方法不同,我们的PHPool算子以全局并行方式设计,确保对原始超图结构的保真度和高效计算,而PHUnpool算子则专门设计为执行PHPool的逆操作以进行超图重构。我们通过超图重构模拟、超图分类和节点级异常检测验证了我们的模型,在这些任务中,它表现出优于现有最先进的图和超图深度学习方法的性能。

英文摘要

Convolutions have successfully transitioned from image processing to the complex realm of non-Euclidean higher-order domains, particularly in hypergraphs. Despite the success in convolution, the exploration of a popular architecture named U-Net remains largely unexplored for hypergraph data due to the lack of well-defined pooling and unpooling operations. This work pioneers the study of U-Net architectures for hypergraph data, addressing the critical challenge of designing effective pooling and unpooling operations that retain maximal structural information from the input hypergraph. Motivated by hierarchical clustering, we propose to construct the pooling and unpooling operators all at once by cutting the clustering dendrogram at different granularities, named the Parallel Hierarchical Pooling (PHPool) and Unpooling (PHUnpool) operators. Unlike existing pooling methods that risk local structural damage through a sequential learning procedure, our PHPool operators are designed in a global and parallel manner to ensure fidelity to the original hypergraph structure with efficient computation while the PHUnpool operators are tailored to perform inverse operations of the PHPools for hypergraph reconstruction. We validate our model through hypergraph reconstruction simulation, hypergraph classification, and node-level anomaly detection, where it demonstrates superior performance over existing state-of-the-art graph and hypergraph deep learning methods.

2606.09046 2026-06-09 cs.LG cs.CL cs.IR 新提交

Decoy-Calibrated Failure Audits for Language Models

语言模型的诱饵校准失败审计

Vyzantinos Repantis, Ameya Gawde, Harshvardhan Singh

发表机构 * Meta Platforms(Meta平台)

AI总结 提出Janus程序,通过诱饵校准和留出数据验证,判断语言模型错误解释的可信度,避免选择偏差。

详情
Comments
14 pages, 5 figures, 4 tables
AI中文摘要

有用的审计不仅揭示模型失败的频率,还揭示失败集中在何处。审计员可能测试许多候选解释:长输入、间接问题、分散注意力的证据或这些因素的组合。风险在于选择。观察到的最大效应可能反映真实的失败模式,也可能只是多次尝试中的最佳结果。我们提出Janus,一种决定何时提出的错误解释足够可信以报告的程序。目标不是生成新解释,而是决定哪些解释站得住脚。审计员从固定的模型、标记的评估集和冻结的候选解释列表(我们称之为描述符)开始。Janus通过错误率提升对每个描述符进行评分,然后将真实描述符与具有相同频率但随机分配给示例的虚假描述符进行比较。只有当描述符在用于发现的数据上击败这个诱饵基准,然后在单独的留出数据上重复时,它才被确认。在多表查找任务的受控审计中,Janus识别出植入的失败,确认了长链描述符及其交互。LLM通常在查找链中途停止,而不是到达最终答案。在两个公共基准MuSiQue和LongBench v2上,SliceLine基线标记了看似高错误的区域,但Janus没有确认任何一个。消融实验显示了为什么两个保障措施都很重要。在LongBench v2上,未校准的固定阈值报告了20个描述符,诱饵基准留下一个,而留出检查在其提升从0.36缩小到0.05后拒绝了最后一个。由此产生的原则将提出解释与报告解释分开。候选解释可能来自任何来源,但只有那些击败诱饵并在新数据上复现的才成为审计发现。

英文摘要

Useful audits reveal not only how often a model fails, but also where its failures concentrate. An auditor may test many candidate explanations: long inputs, indirect questions, distracting evidence, or combinations of these factors. The risk is selection. The largest observed effect may reflect a real failure mode, or it may simply be the best result among many tried. We introduce Janus, a procedure for deciding when a proposed error explanation is credible enough to report. The goal is not to generate new explanations, but to decide which ones hold up. The auditor starts with a fixed model, a labeled evaluation set, and a frozen list of candidate explanations, which we call descriptors. Janus scores each descriptor by its error-rate lift, then compares real descriptors with fake ones that have the same frequencies but are randomly assigned to examples. A descriptor is confirmed only if it beats this decoy floor on the data used for discovery and then repeats on separate held-out data. In a controlled audit of multi-table lookup tasks, Janus identifies the planted failure, confirming long-chain descriptors and their interactions. The LLM often stops partway through the lookup chain instead of reaching the final answer. On two public benchmarks, MuSiQue and LongBench v2, the SliceLine baseline flags plausible high-error pockets, but Janus confirms none of them. Ablations show why both safeguards matter. On LongBench v2, an uncalibrated fixed threshold reports 20 descriptors, the decoy floor leaves one, and the holdout check rejects the last one after its lift shrinks from 0.36 to 0.05. The resulting principle separates proposing explanations from reporting them. Candidates may come from any source, but only those that beat decoys and replicate on fresh data become audit findings.

2606.09043 2026-06-09 cs.LG cs.CL 新提交

DynaCF: Mitigating Shortcut Learning in Reward Models via Dynamic Counterfactual Sensitivity

DynaCF: 通过动态反事实敏感性缓解奖励模型中的捷径学习

Fengyuan Liu, Yongliang Miao, Zirui He, Yanguang Liu, Fei Sun, Mengnan Du

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) New Jersey Institute of Technology(新泽西理工学院) Institute of Computing Technology, CAS(中国科学院计算技术研究所)

AI总结 提出DynaCF框架,通过在线测量反事实扰动下的边际变化和偏好翻转来动态降低捷径敏感样本的权重,从而缓解奖励模型中的捷径学习问题。

详情
AI中文摘要

从成对偏好中训练的奖励模型往往利用表面的捷径线索而非学习真正的响应质量。我们提出DynaCF,一个用于缓解奖励模型训练中捷径学习的动态重加权框架。与静态捷径启发式方法不同,DynaCF在优化过程中通过应用保持语义的反事实扰动并跟踪当前模型下产生的边际变化和偏好翻转,在线测量捷径敏感性。在Bradley-Terry目标中,具有较高捷径敏感性的样本被动态降低权重,鼓励模型较少依赖表面模式,更多依赖任务相关的偏好信号。大量实验表明,DynaCF在偏好建模中持续提高了鲁棒性。

英文摘要

Reward models trained from pairwise preferences often exploit superficial shortcut cues rather than learning true response quality. We propose DynaCF, a dynamic reweighting framework for mitigating shortcut learning in reward model training. Unlike static shortcut heuristics, DynaCF measures shortcut sensitivity online during optimization by applying semantics-preserving counterfactual perturbations and tracking the resulting margin shifts and preference flips under the current model. Samples with higher shortcut sensitivity are dynamically downweighted in the Bradley-Terry objective, encouraging the model to rely less on superficial patterns and more on task-relevant preference signals. Extensive experiments show that DynaCF consistently improves robustness in preference modeling.

2606.09038 2026-06-09 cs.AI 新提交

Personalization Meets Safety:Mechanisms,Risks,and Mitigations in Personalized LLMs

个性化与安全的交汇:个性化大语言模型中的机制、风险与缓解措施

Yanyan Luo, Xue Han, Ruiqiao Bai, Xin Huang, Yitong Wang, Qian Hu, Qing Wang, Chunxu Zhao, Jie Liu, Cong Geng, Lehao Xing, Pengwei Hu, Junlan Feng

发表机构 * China Mobile Jiutian Artificial Intelligence Technology (Beijing) Co., Ltd.(中国移动九天人工智能技术(北京)有限公司) Chinese Academy of Sciences(中国科学院)

AI总结 本文首次对个性化大语言模型进行安全导向的综述,从用户表征、个性化范式和评估三个维度组织,提出统一的安全风险分类,并分析各范式下的脆弱性及缓解策略。

详情
AI中文摘要

大语言模型通过适应用户偏好、上下文和长期历史记录,实现了日益个性化的交互。然而,实现个性化的机制也以现有文献未系统处理的方式扩展了安全领域。现有综述通常只关注个性化或安全,而忽略了它们的交叉。我们提出了首个全面的、安全导向的个性化大语言模型综述。我们沿三个维度组织个性化——用户表征、个性化范式和评估——并引入统一的安全风险分类。在表征层面,我们分析了不同用户表征带来的风险。在主流个性化范式中,我们描述了提示、检索增强、参数微调、强化学习、混合专家、剪枝、智能体框架和多模态个性化中固有的脆弱性,并综合了模型生命周期中的缓解策略。除了这些细粒度风险,我们还描述了由个性化适应产生的范式无关的安全风险。我们进一步总结了个性化数据集和评估方法。通过OpenClaw的案例研究,我们分析了个性化智能体生态系统中的部署趋势。我们的分析揭示了现有研究中的三个结构性不足:安全被评估为与用户无关而非关系性的,个性化技术被孤立分析而非组合分析,评估框架无法捕捉新兴的长期风险。通过联合检查个性化表征、个性化范式、安全风险、防御和评估方法,我们为开发安全的个性化大语言模型提供了一个统一框架,并强调了未来研究的关键方向。

英文摘要

Large Language Models (LLMs) have enabled increasingly personalized interactions by adapting to users' preferences, contexts, and long-term histories. However, the mechanisms that enable personalization also expand the safety landscape in ways not systematically addressed by existing literature. Existing reviews typically focus either on personalization or safety, leaving their intersection largely unexplored. We present the first comprehensive, safety-aware review of personalized LLMs. We organize personalization along three dimensions-user representation, personalization paradigm, and evaluation-and introduce a unified taxonomy of safety risks. At the representation level, we analyze risks arising from diverse user representations. Across mainstream personalization paradigms, we delineate vulnerabilities inherent to prompting, retrieval augmentation, parameter fine-tuning, reinforcement learning, Mixture-of-Experts (MoE), pruning, agent frameworks, and multimodal personalization, and synthesize mitigation strategies across the model lifecycle. Beyond these fine-grained risks, we characterize paradigm-agnostic safety risks arising from personalized adaptation. We further summarize personalized datasets and evaluation methodologies. Through a case study of OpenClaw, we analyze deployment trends in personalized agent ecosystems. Our analysis reveals three structural inadequacies in existing research: safety is evaluated as user-invariant rather than relational, personalization techniques are analyzed in isolation rather than in composition, and evaluation frameworks cannot capture emergent long-term risks. By jointly examining personalized representations, personalization paradigms, safety risks, defenses, and evaluation methods, we provide a unified framework for developing safe personalized LLMs and highlight key directions for future research.

2606.09037 2026-06-09 cs.AI cs.MA 新提交

A Multi-Agent System for IPMSM Design Optimization via an FEA-AI Hybrid Approach

基于FEA-AI混合方法的IPMSM设计优化多智能体系统

Jinseong Han, Sunwoong Yang, Namwoo Kang

发表机构 * Cho Chun Shik Graduate School of Mobility, KAIST(KAIST Cho Chun Shik 移动研究生院) Department of Mechanical Engineering, Hanyang University(汉阳大学机械工程系) Narnia Labs

AI总结 提出一种端到端自动化IPMSM设计优化框架,通过RAG结构化问题定义与不确定性感知的FEA-AI混合优化流水线,平衡计算成本与预测可靠性,在同等FEA预算下优于纯FEA或纯AI方法。

详情
Comments
26 pages, 21 figures
AI中文摘要

内置永磁同步电机(IPMSM)设计需要平衡相互冲突的目标和多物理场约束,而现代优化工作流程面临三个瓶颈:手动问题设置、高有限元分析(FEA)成本以及在稀疏或分布外区域中不可靠的基于代理的搜索。为了解决这些限制,我们提出了一种端到端的自动化IPMSM设计优化框架,该框架将检索增强生成(RAG)用于结构化问题定义,与不确定性感知的FEA-AI混合优化流水线相结合。一个通过RAG连接到电机教科书的设计代理提供基于领域知识的选项和工程技巧,并编译优化卡和用于AI模型训练的试验设计计划。训练代理自动化电磁FEA,记录几何验证和求解器失败日志,使用基于方差分析的数据分析和LLM推理分析失败的几何形状,并调用设计采样代理重新定义设计空间并生成额外样本。优化代理执行基于遗传算法的搜索,具有不确定性驱动的切换:低不确定性候选由AI代理推理评估,而高不确定性和可靠性关键的帕累托前沿或前K候选由高保真FEA校正并用于迭代重训练。该框架将手动、依赖经验的配置转换为可重复的工作流程,平衡计算成本和预测可靠性。在匹配的高保真FEA预算下的实验结果表明,所提出的混合方法实现了更好的目标性能,同时保持低且可进一步降低的预测不确定性,优于受早期预算耗尽限制的纯FEA搜索和收敛到低置信度最优的纯AI搜索。

英文摘要

Interior permanent magnet synchronous motor (IPMSM) design requires balancing conflicting objectives and multi-physics constraints, while modern optimization workflows face three bottlenecks: manual problem setup, high finite element analysis (FEA) cost, and unreliable surrogate-based search in sparse or out-of-distribution regions. To address these limitations, we propose an end-to-end automated IPMSM design optimization framework that integrates retrieval-augmented generation (RAG) for structured problem definition with an uncertainty-aware FEA-AI hybrid optimization pipeline. A Design agent, connected to a motor textbook through RAG, provides domain-knowledge-based options and engineering tips, and compiles an optimization card and a design-of-experiments plan for AI-model training. A Training agent automates electromagnetic FEA, records geometry-validation and solver-failure logs, analyzes failed geometries using ANOVA-based data analysis and LLM reasoning, and invokes a Design Sampling agent to redefine the design space and generate additional samples. An Optimization agent performs GA-based search with uncertainty-driven switching: low-uncertainty candidates are evaluated by AI-surrogate inference, whereas high-uncertainty and reliability-critical Pareto-front or top-K candidates are corrected by high-fidelity FEA and reused for iterative retraining. The framework converts manual, experience-dependent configuration into a reproducible workflow that balances computational cost and prediction reliability. Experimental results under a matched high-fidelity FEA budget show that the proposed hybrid approach achieves better objective performance while maintaining low and further reducible predictive uncertainty, outperforming FEA-only search, which is limited by early budget exhaustion, and AI-only search, which converges to a low-confidence optimum.

2606.09033 2026-06-09 cs.CV cs.CL 新提交

CRANE: Knowledge Editing for Reasoning MLLMs

CRANE:面向推理多模态大语言模型的知识编辑

Han Huang, Hao Wang, Mengqi Zhang, Shu Wu, Qiang Liu, Liang Wang

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) New Laboratory of Pattern Recognition (NLPR), CASIA(中国科学院自动化研究所模式识别国家重点实验室) Harbin Institute of Technology(哈尔滨工业大学) Shandong University(山东大学)

AI总结 针对推理多模态大语言模型在知识编辑中出现的结构崩溃、认知失调和浅层内化三种失败模式,提出检索增强框架CRANE,无需逐编辑参数修改,通过模态感知双库检索系统和两阶段训练策略实现高成功率。

详情
Comments
10 pages, 5 figures
AI中文摘要

推理多模态大语言模型(MLLMs)的出现,即在生成答案前产生显式思维链(CoT)推理,为知识编辑带来了新挑战:在传统指标(教师强制准确率高达100%)下看似成功的方法,在检查模型推理过程时可能严重失败(基础成功率低至0%)。我们识别出三种失败模式:(1)结构崩溃,权重修改方法破坏CoT格式;(2)认知失调,模型的推理链基于视觉证据主动拒绝注入的编辑事实;(3)浅层内化,方法在精确查询上成功但在改写或多跳变体上失败。在推理MLLMs上,这些模式相互作用:泛化方法(FT、LoRA)触发格式崩溃,而无深度修改的方法无法泛化。为揭示这些失败,我们提出一种CoT感知评估协议,并构建ReasonEdit-Bench,包含冲突分层、多级探针和多跳可移植性测试。我们提出CRANE,一种检索增强框架,无需逐编辑参数修改。CRANE结合了模态感知双库检索系统和两阶段训练策略:监督微调(SFT)用于结构初始化,随后是带有认知路由奖励的GRPO,训练模型在视觉先验和注入编辑事实之间进行仲裁。在ReasonEdit-Bench上,CRANE在冲突场景中达到96.9%的基础成功率,多跳链中中间实体使用率为96.9%,文本局部性为97.6%,图像局部性编辑独立性为68.1%。在分布外MMEVOKE基准上,CRANE在黄金检索下达到87.0%。

英文摘要

The emergence of reasoning multimodal large language models (MLLMs), which generate explicit chain-of-thought (CoT) reasoning before producing answers, has introduced a new challenge for knowledge editing: methods that appear successful under traditional metrics (teacher-forcing accuracy up to 100%) can fail severely when the model's reasoning process is examined (Grounded Success as low as 0%). We identify three failure modes: (1) Structural Collapse, where weight-modifying methods destroy the CoT format; (2) Cognitive Dissonance, where the model's reasoning chain actively rejects the injected edit fact based on visual evidence; and (3) Shallow Internalization, where methods succeed on exact queries but fail on rephrase or multi-hop variants. On reasoning MLLMs, these modes interact: methods that generalize (FT, LoRA) trigger format collapse, while methods without deep modification cannot generalize. To expose these failures, we propose a CoT-aware evaluation protocol and construct ReasonEdit-Bench, with conflict stratification, multi-level probes, and multi-hop portability tests. We propose CRANE, a retrieval-augmented framework that requires no per-edit parameter modification. CRANE combines a modality-aware dual-library retrieval system with a two-phase training strategy: Supervised Fine-Tuning (SFT) for structural initialization, followed by GRPO with a Cognitive Routing Reward that trains the model to arbitrate between visual priors and injected edit facts. On ReasonEdit-Bench, CRANE achieves 96.9% Grounded Success on conflict scenarios and 96.9% intermediate entity usage in multi-hop chains, with 97.6% text-locality and 68.1% image-locality Edit Independence. On the out-of-distribution MMEVOKE benchmark, CRANE reaches 87.0% under gold retrieval.

2606.09032 2026-06-09 cs.CL 新提交

Bridging the Agent-World Gap: Text World Models for LLM-based Agents

弥合智能体-世界鸿沟:面向基于LLM的智能体的文本世界模型

Yixia Li, Hongru Wang, Peng Lai, Zhiwen Ruan, He Zhu, Youxin Zhu, Ganlong Zhao, Minda Hu, Yun Chen, Sibei Yang, Peng Li, Jeff Z. Pan, Jia Pan, Guanhua Chen, Yang Liu, Guanbin Li

发表机构 * Southern University of Science and Technology(南方科技大学) University of Edinburgh(爱丁堡大学) Peking University(北京大学) Sun Yat-sen University(中山大学) The Chinese University of Hong Kong(香港中文大学) Shanghai University of Finance and Economics(上海财经大学) Tsinghua University(清华大学) The University of Hong Kong(香港大学)

AI总结 本文系统综述了面向基于LLM的智能体的文本世界模型,围绕形式化框架和智能体生命周期,涵盖基础定义、构建范式、应用(训练时经验合成与推理时规划、验证、适应)及评估,旨在整合该领域并明确设计空间与开放挑战。

详情
Comments
Code: https://github.com/sustech-nlp/awesome-text-world-models
AI中文摘要

基于大型语言模型(LLM)的智能体越来越多地用于交互式文本环境,从网页导航、代码编辑到工具使用和长时对话。然而,许多智能体仍然主要是反应式的,将观察映射到动作,而没有对这些环境如何构建和演变的显式模型。这激发了文本世界模型(TWMs):文本状态上的转移模型,给定状态和候选动作,预测结果网页、终端输出、API响应或用户回复,从而支持规划、高效学习和原则性评估。我们系统综述了面向基于LLM的智能体的文本世界模型,围绕形式化框架和智能体生命周期组织:(1)基础,定义文本世界模型并通过状态表示和基础领域对其进行表征;(2)构建,对LLM作为世界模型和代码作为世界模型范式进行分类,并回顾构建方法;(3)应用,考察世界模型如何通过经验合成在训练时以及通过规划、验证和适应在推理时支持智能体;(4)评估,涵盖世界模型本身的评估及其作为智能体评估环境的使用。我们旨在巩固这一快速发展领域,阐明其设计空间,并强调未来研究的开放挑战。

英文摘要

Large language model (LLM)-based agents are increasingly used in interactive textual environments, from web navigation and code editing to tool use and long-horizon dialogue. Yet many remain largely reactive, mapping observations to actions without an explicit model of how these environments are structured and evolve. This motivates text world models (TWMs): transition models over textual states that, given a state and a candidate action, predict the resulting webpage, terminal output, API response, or user reply, thereby supporting planning, efficient learning, and principled evaluation. We systematically review text world models for LLM-based agents, organized around a formal framework and the agent lifecycle: (1) Foundations, defining text world models and characterizing them by state representation and grounding domain; (2) Construction, taxonomizing LLM-as-WM and code-as-WM paradigms and reviewing methods for building them; (3) Application, examining how world models support agents at training time through experience synthesis and at inference time through planning, verification, and adaptation; and (4) Evaluation, covering both evaluation of the world model itself and its use as an evaluation environment for agents. We aim to consolidate this rapidly developing area, clarify its design space, and highlight open challenges for future research.

2606.09030 2026-06-09 cs.LG cs.AI cs.CL 新提交

TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs

TRIAGE: 基于辩证推理的不规则采样医学时间序列风险可解释预测方法

Hyeongwon Jang, Gyouk Chu, Changhun Kim, Joonhyung Park, Hangyul Yoon, Eunho Yang

发表机构 * KAIST(韩国科学技术院) AITRICS University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出TRIAGE框架,利用大语言模型对竞争性临床结果生成辩证推理,缓解风险极化,实现连续风险评分与可解释推理,在三个基准上AUPRC提升3.3%,校准误差降低81%。

详情
Comments
Code is available at https://github.com/HyeongWon-Jang/TRIAGE
AI中文摘要

基于电子健康记录的临床早期预警系统,其中临床观察记录为不规则采样的医学时间序列(ISMTS),必须提供校准的风险评分用于患者分诊,以及临床医生可验证的可解释理由。大语言模型(LLMs)已被探索用于此任务,但它们将分级临床风险崩溃为过度自信的二元预测。这种风险极化损害了校准性和跨患者可比性。为解决此问题,我们提出TRIAGE框架,该框架训练LLM通过引出特定结果的理由,对竞争性临床结果生成辩证推理。这种辩证公式减轻了风险极化,使单个LLM能够产生基于明确临床推理的连续风险评分。在三个ISMTS基准上评估,TRIAGE相比竞争基线实现了平均AUPRC提升3.3%,校准误差降低81%。LLM作为评判者的评估进一步表明,我们的理由在临床推理质量上比基线的后验解释高出20%。源代码可在https://github.com/HyeongWon-Jang/TRIAGE获取。

英文摘要

Clinical early warning systems built on electronic health records, in which clinical observations are recorded as irregularly sampled medical time series (ISMTS), must deliver both calibrated risk scores for patient triage and interpretable rationales that clinicians can verify. Large Language Models (LLMs) have been explored for this task, yet they collapse graded clinical risk into overconfident binary predictions. This risk polarization undermines both calibration and cross-patient comparability. To address this, we propose TRIAGE, a framework that trains an LLM to generate dialectical reasoning over competing clinical outcomes by eliciting outcome-specific rationales. This dialectical formulation mitigates risk polarization, enabling a single LLM to yield continuous risk scores grounded in explicit clinical reasoning. Evaluated on three ISMTS benchmarks, TRIAGE achieves an average AUPRC improvement of 3.3% and reduces calibration error by 81% compared to the competitive baselines. An LLM-as-a-judge assessment further shows that our rationales surpass post-hoc explanations from the baseline by 20% in clinical reasoning quality. The source code is available at https://github.com/HyeongWon-Jang/TRIAGE .

2606.09029 2026-06-09 cs.CV 新提交

Frequency Decoupled Framework for Screen Content Image Super-Resolution

面向屏幕内容图像超分辨率的频率解耦框架

Xufei Wang, Qicheng Zhang, Qi Wu, Ziyang Gu, Shizhuang Weng

发表机构 * Anhui University(安徽大学)

AI总结 提出频率解耦框架(FDF),通过振幅-相位分解和定制隐式表示,联合利用周期模式与连贯上下文,实现屏幕内容图像超分辨率,在多个数据集上达到最优性能。

详情
Comments
13pages;11figures
AI中文摘要

基于隐式神经表示的方法在屏幕内容图像超分辨率(SCISR)中表现出优越性能。然而,它们忽略了固有的频率特性,导致性能次优。我们提出一种频率解耦框架(FDF),从相量角度重新思考SCISR,通过捕获振幅中的结构化能量和相位中的关系连续性,并利用定制的隐式表示联合利用它们,以忠实恢复屏幕内容图像(SCI)的规则纹理和全局配置。振幅-相位分解网络(APFN)首先将图像分离为振幅和相位流,其中振幅聚类模块(ACM)将稀疏但高能量的振幅响应组织成代表性原型以提取周期模式,而相位一致性自注意力(PCSA)通过连续一致性传播逐步增强配置。振荡-非谐隐式拟合网络(OAIF-Net)集成周期性和连贯隐式表示,以有效利用SCI中嵌入的周期模式和连贯上下文。实验结果表明,FDF在四个公共SCI数据集上的多个尺度上实现了最先进的SCISR性能。消融实验进一步证明了每个组件在提取和利用周期模式与连贯上下文方面的有效性。

英文摘要

Methods based on implicit neural representations have demonstrated superior performance in Screen Content Image Super-Resolution (SCISR) . However, they overlooked the inherent frequency characteristics, leading to suboptimal performance. We propose a frequency decoupled framework (FDF) that rethinks SCISR from a phasor perspective by capturing structured energy in amplitude and relational continuity in phase, and jointly exploiting them with bespoke implicit representations to faithfully recover the regular textures and global configuration of Screen Content Image (SCI). Amplitude-Phase Factorization Network (APFN) first separates images into amplitude and phase streams, where Amplitude Clustering Module (ACM) organizes sparse yet high-energy amplitude responses into representative prototypes for periodic pattern extraction, while Phase Consistency Self-Attention (PCSA) progressively reinforces configuration through continuous consistency propagation. And Oscillation-Anharmonic Implicit Fitting Network (OAIF-Net) integrates periodic and coherent implicit representations for efficient exploitation of the periodic patterns and coherent context embedded in SCI. Experimental results show FDF achieves state-of-the-art SCISR performance at multiple scales across four public SCI datasets. Ablation experiments further demonstrate the effectiveness of each component in extracting and exploiting periodic patterns and coherent context.

2606.09028 2026-06-09 cs.CV cs.AI cs.RO 新提交

ATM: Action-Consistency Transfer Matrix for Diagnosing and Improving Latent World Models

ATM:用于诊断和改进潜在世界模型的动作一致性转移矩阵

Jiaheng Chen

发表机构 * School of Software, Northeastern University(东北大学软件学院)

AI总结 提出ATM矩阵,通过轻量级探针比较真实与预测潜在转移中的动作信息,无需模拟器即可诊断世界模型质量,并引入AITS利用动作可识别性作为训练信号提升下游规划。

详情
Comments
13 pages, 3 figures, 6 tables
AI中文摘要

潜在世界模型越来越多地用于控制和目标条件规划,但评估其学习到的表示是否对规划有用通常需要与CEM等规划器耦合的慢速模拟器评估。这种评估是黑盒且依赖于模型复杂度的:在相同协议下,不同世界模型每个检查点可能需要几分钟到几小时。在这项工作中,我们提出了ATM,一个动作一致性转移矩阵,用于诊断潜在转移是否保留了与规划相关的动作语义。ATM通过轻量级事后探针比较真实编码转移和模型预测转移中的动作信息,生成一个可解释的矩阵,揭示表示质量、转移域不一致性和失败模式,而无需模拟器 rollout。它还可以折叠成一个简单的筛选分数,用于跨检查点、变体和世界模型的内部任务排名。当真实成功差距显著时,ATM实现了高度可靠的成对排名,同时将分钟到小时的CEM评估减少到秒级的转移分析,在我们的设置中实现了超过100倍的加速。我们进一步引入了AITS,表明动作可识别性不仅具有诊断作用,而且是一种有用的训练信号,可以在不改变规划器的情况下改进下游规划。

英文摘要

Latent world models are increasingly used for control and goal-conditioned planning, yet assessing whether their learned representations are useful for planning usually requires slow, planner-coupled simulator evaluation with CEM or similar planners. Such evaluation is black-box and model-complexity-dependent: under the same protocol, different world models may require minutes to hours per checkpoint. In this work, we propose ATM, an Action-Consistency Transfer Matrix for diagnosing whether latent transitions preserve action semantics relevant to planning. ATM compares action information in real encoded transitions and model-predicted transitions through lightweight post-hoc probes, producing an interpretable matrix that reveals representation quality, transition-domain inconsistency, and failure modes without simulator rollout. It can also be collapsed into a simple screening score for within-task ranking across checkpoints, variants, and world models. When the true success gap is non-trivial, ATM achieves highly reliable pairwise ranking, while reducing minutes-to-hours CEM evaluation to seconds-level transition analysis, yielding more than 100x speedup in our setup. We further introduce AITS, showing that action-identifiability is not only diagnostic but also a useful training signal for improving downstream planning without changing the planner.

2606.09019 2026-06-09 cs.SD cs.AI 新提交

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

TLDR:压缩音频令牌以实现高效自回归文本到语音

Yejin Lee, Junwon Moon, Hyoeun Kim, Hyunjin Choi, Heeseung Kim, Kyuhong Shim

发表机构 * Sungkyunkwan University(成均馆大学) University of Seoul(首尔市立大学)

AI总结 提出TLDR框架,通过将因果建模从令牌级转移到补丁级,利用轻量级压缩器和LoRA适配的冻结预训练骨干,实现1.8倍推理加速和75% KV缓存减少。

详情
AI中文摘要

基于编解码器的自回归(AR)语音语言模型通过将语音建模为离散音频令牌序列,并使用大型预训练骨干网络,实现了强大的文本到语音(TTS)质量。然而,这种令牌级公式造成了结构效率瓶颈:语音令牌序列比文本序列长得多,要求AR骨干在每个令牌位置执行因果计算,并维护随序列长度增长的KV缓存。我们引入TLDR,一种基于补丁的自回归框架,通过将因果建模从令牌级语音序列转移到补丁级序列,加速基于编解码器的AR-TTS。TLDR使用轻量级压缩器将连续的编解码器令牌分组为紧凑的潜在补丁,使用通过LoRA适配的冻结预训练AR-TTS骨干对生成的较短补丁序列进行建模,并使用说话人条件提取器在每个补丁内重建细粒度语音令牌。在补丁大小为4的情况下,TLDR比基线AR-TTS模型实现了1.8倍的推理加速,并将全局KV缓存内存减少了高达75%。实验结果表明,补丁级全局因果建模可以成为降低预训练基于编解码器的AR-TTS系统推理成本的一种实用方法,而无需替换现有模块。

英文摘要

Codec-based autoregressive (AR) speech language models have achieved strong text-to-speech (TTS) quality by modeling speech as sequences of discrete audio tokens with large pretrained backbones. However, this token-level formulation creates a structural efficiency bottleneck: speech-token sequences are much longer than text sequences, requiring the AR backbone to perform causal computation at every token position and maintain a KV cache that grows with the sequence length. We introduce TLDR, a patch-based autoregressive framework that accelerates codec-based AR-TTS by shifting the causal modeling from token-level speech sequences to patch-level sequences. TLDR groups consecutive codec tokens into compact latent patches using a lightweight compressor, models the resulting shorter patch sequence with a frozen pretrained AR-TTS backbone adapted by LoRA, and reconstructs fine-grained speech tokens within each patch using a speaker-conditioned extractor. With a patch size of 4, TLDR achieves a 1.8x inference speedup over the baseline AR-TTS model and reduces global KV-cache memory by up to 75%. Experimental results indicate that patch-level global causal modeling can be a practical way to reduce the inference cost of pretrained codec-based AR-TTS systems without replacing the existing modules.

2606.09013 2026-06-09 cs.CL 新提交

Beyond Averages: Evaluating LLMs on Human Survey Replication at the Distributional Level

超越平均值:在分布层面评估LLM对人类调查的复现能力

Jeonghyeon Moon, Jiwon Kim, Yeheum Lah, Yoonju Han, Yuncheol Kang

发表机构 * Ewha Womans University(梨花女子大学)

AI总结 本研究通过非公开的韩国方便面购买实验,在分布层面评估LLM复现人类调查响应的能力,发现均值匹配的模型可能产生更偏离人类的分布,且结构化角色和多模态输入提升对齐度,而推理提示则降低。

详情
AI中文摘要

LLM越来越多地被用于模拟人类调查响应,但先前的工作主要使用均值层面或总体一致性来评估复现能力,对LLM是否复现人类行为的变异性提供的见解有限。我们使用一个非公开的2010年韩国方便面购买消费者选择实验,在分布层面评估基于LLM的调查复现,该设置不太可能与模型训练数据重叠。我们评估了三种不同统计类型的响应变量:二元购买发生、分类品牌选择和计数购买数量。对于每种变量,我们在均值层面、模式和分布一致性上比较人类和LLM响应,并参考仅来自人类数据的基线。LLM在复现条件层面模式上表现合理,但未能捕捉分布结构:对于购买数量,没有模型能击败一个简单的条件不敏感基线(该基线仅匹配合并的人类分布)。因为均值匹配人类良好的模型仍可能产生比该基线更远离人类的分布,仅基于均值的评估可能具有误导性。复现能力也随输入配置而变化,结构化角色和多模态输入改善一致性,而显式推理提示则单调地降低一致性。

英文摘要

LLMs are increasingly used to simulate human survey responses, but prior work has mainly evaluated replication using mean-level or aggregate agreement, offering limited insight into whether LLMs reproduce the variability of human behavior. We evaluate LLM-based survey replication at the distributional level using a non-public 2010 consumer choice experiment on Korean instant noodle purchases, a setting unlikely to overlap with model training data. We evaluate three response variables of differing statistical type: binary purchase incidence, categorical brand choice, and count purchase quantity. For each, we compare human and LLM responses at mean-level, pattern, and distributional alignment, and against reference baselines from the human data alone. LLMs reproduce condition-level patterns reasonably well but fail to capture distributional structure: for purchase quantity, no model beats a condition-insensitive baseline that simply matches the pooled human distribution. Because models that match human means well can still produce distributions further from humans than this baseline, mean-based evaluation alone can be actively misleading. Replication also varies with input configuration, with structured personas and multimodal inputs improving alignment while explicit reasoning prompting degrades it monotonically.

2606.09012 2026-06-09 cs.LG cs.AI math.OC stat.ML 新提交

Understanding Quantization-Aware Training: Gradients at Quantized Weights Bias to the Low-Loss Basin

理解量化感知训练:量化权重的梯度偏向低损失盆地

Hanyang Li, Jianhao Ma, Ying Cui

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出统一几何框架解释后训练量化失败与量化感知训练恢复机制,揭示量化感知训练通过梯度感知谷壁使量化点返回低损失盆地。

详情
Comments
31 pages, 10 figures
AI中文摘要

后训练量化(PTQ)将训练好的全精度模型转换为低比特权重,无需任务级重训练,而量化感知训练(QAT)将量化纳入训练循环。尽管PTQ在中等比特宽度下高效且通常准确,但在激进比特宽度下可能急剧失败;QAT成本更高但通常能恢复丢失的精度。我们提出了一个统一的几何框架,同时解释PTQ失败和QAT恢复。我们将全精度训练建模为在更宽的\emph{山谷}内沿着低损失\emph{河流}:河流的法向邻域形成近乎平坦的\emph{盆地},而离开该盆地会导致损失急剧增加。当量化网格与盆地宽度相当时,局部PTQ目标(包括舍入和基于Hessian的二阶重建)可能选择盆地外的高损失部署量化点,即使附近存在低损失量化点。在这种情况下,基于直通估计器的QAT具有有用的偏差:它在部署的量化权重处评估梯度,同时更新潜在的全精度权重,导致梯度感知谷壁并获得向内分量,从而将后续量化迭代引导回盆地。我们通过局部景观模型形式化这一机制,构造了几何PTQ失败模式,并在局部量化器兼容性假设下证明了有限时间QAT恢复。在多种神经网络量化方案下的视觉和语言模型实验,证实了预测的PTQ跨盆地失败以及相应的QAT恢复机制。

英文摘要

Post-training quantization (PTQ) converts a trained full-precision model into low-bit weights without task-level retraining, while quantization-aware training (QAT) incorporates quantization into the training loop. Although PTQ is efficient and often accurate at moderate bitwidths, it can fail sharply at aggressive bitwidths; QAT is more expensive but can often recover the lost accuracy. We propose a unified geometric framework that explains both PTQ failure and QAT recovery. We model full-precision training as following a low-loss \emph{river} inside a wider \emph{valley}: a normal neighborhood of the river forms a nearly flat \emph{basin}, while leaving this basin incurs a sharp loss increase. When the quantization grid is comparable to the basin width, local PTQ objectives, including rounding and Hessian-based second-order reconstruction, can select a high-loss deployed quantized point outside the basin even when nearby low-loss quantized points exist. In this regime, straight-through-estimator-based QAT has a useful bias: it evaluates gradients at the deployed quantized weights while updating latent full-precision weights, causing the gradient to sense the valley wall and acquire an inward component that steers subsequent quantized iterates back into the basin. We formalize this mechanism through a local landscape model, construct a geometric PTQ failure mode, and prove finite-time QAT recovery under local quantizer-compatibility assumptions. Experiments across vision and language models under multiple neural-network quantization schemes corroborate the predicted basin-crossing failure of PTQ and the corresponding recovery mechanism of QAT.

2606.09004 2026-06-09 cs.AI 新提交

LATTEArena: An Evaluation Framework for LLM-powered Tabular Feature Engineering (Extended Version)

LATTEArena: 基于LLM的表格特征工程评估框架(扩展版)

Ankai Hao, Ke Chen, Huan Li, Lidan Shou

发表机构 * Zhejiang University(浙江大学)

AI总结 提出LATTEArena,首个标准化评估框架,通过六维分类法分解15种方法、模块化竞技场和组件消融实验,揭示Tree-of-Thought与MCTS成本效益最优等16项关键发现。

详情
Comments
30 pages, 9 figures
AI中文摘要

特征工程对于表格数据分析仍然至关重要,大型语言模型(LLM)已成为自动化这一过程的有前景的范式,催生了基于LLM的自动化表格特征工程(LATTE)。然而,缺乏标准化平台阻碍了公平、成本感知的比较。此外,复杂的方法设计掩盖了单个组件的具体贡献;例如,尽管LFG集成了思维树、少样本演示、蒙特卡洛树搜索和自然语言生成,但每种技术的竞争优点的孤立影响仍未量化。为解决这些挑战,我们引入了LATTEArena,这是首个竞争性评估框架,具有以下特点:(1)六维分类法,将15种代表性方法分解为可重用组件;(2)标准化模块化竞技场,用于受控比较;(3)涵盖性能、成本和鲁棒性的多维评估;(4)组件级消融,量化每种技术的竞争优点。通过广泛评估,我们揭示了16项关键发现,包括:(1)思维树与蒙特卡洛树搜索实现了最佳成本效益;(2)RPN和代码输出格式分别主导分类和回归任务。我们公开发布了模块化框架和超过4000条执行日志,使研究人员能够将新技术与现有技术无缝对比,推动LATTE发展。

英文摘要

Feature engineering remains essential for tabular data analysis, and Large Language Models (LLMs) have emerged as a promising paradigm for automating this process, giving rise to LLM-powered AuTomated Tabular feature Engineering (LATTE). However, the absence of standardized platforms prevents fair, cost-aware comparisons. Furthermore, complex methodological designs obscure the specific contributions of individual components; for example, although LFG integrates Tree-of-Thought, few-shot demonstrations, Monte Carlo Tree Search, and natural language generation, the isolated impact of each technique's competitive edge remains unquantified. To address these challenges, we introduce LATTEArena, the first competitive evaluation framework featuring: (1) a six-dimensional taxonomy decomposing 15 representative methods into reusable components; (2) a standardized modular arena for controlled comparison; (3) multi-dimensional assessments covering performance, cost, and robustness; and (4) component-level ablation quantifying each technique's competitive edge. Through extensive evaluations, we reveal 16 key findings, including: (1) Tree-of-Thought with Monte Carlo Tree Search achieves optimal cost-effectiveness; (2) RPN and Code output formats dominate classification and regression tasks, respectively. We publicly release the modular framework and over 4000 execution logs, enabling researchers to seamlessly pit new techniques against existing ones and advance LATTE.

2606.08998 2026-06-09 cs.AI cs.CY econ.GN q-fin.EC 新提交

The Token Not Taken: Sampling, State, and the Variability of AI Agent Outputs

未被选取的令牌:采样、状态与AI智能体输出的变异性

Muhammad Zia Hydari, Raja Iqbal

发表机构 * University of Pittsburgh(匹兹堡大学) Ejento.ai

AI总结 本文分析AI智能体系统输出变异性的来源,区分令牌采样的内在随机性与环境、数据等外在因素,并讨论在匹配条件下变异性的可复现性及确定性执行在部署中未必导致相同行为的原因。

详情
AI中文摘要

智能体AI系统在不同运行中可能表现出不同的行为:相同的请求可能产生不同的计划、不同的工具调用、不同的代码编辑或不同的最终答案。这种变异性源于多个常被混淆的层面。基础模型是一个大型预训练模型,通常可适应许多下游任务,将输入上下文映射到输出的预测。在当前许多智能体中,该模型嵌入在一个编排循环中,该循环进行规划、调用工具、观察结果并更新状态。此类系统中一个明确的内在变异性来源是令牌生成:模型计算可能的下一个令牌的分数,分数被转换为概率,解码器可能使用伪随机数生成器采样令牌。一个微小的采样令牌差异随后可能向上传播为不同的工具调用、代码路径、搜索查询或智能体状态。其他变异性来源是令牌采样的外在因素,包括变化的环境、实时数据、服务基础设施、批次效应和数值细节。通过分离这些层面,本文阐明了将智能体AI系统称为随机系统的含义、在匹配条件下这种变异性何时可复现,以及为什么确定性执行在部署环境中不一定意味着相同的行为。

英文摘要

Agentic AI systems can behave differently across runs: the same request may produce a different plan, a different tool call, a different code edit, or a different final answer. Such variability arises from several layers that are often conflated. A foundation model is a large pretrained model, usually adaptable to many downstream tasks, that maps an input context to predictions over outputs. In many current agents, that model is embedded in an orchestration loop that plans, calls tools, observes results, and updates state. One explicit intrinsic source of variability in such systems is token generation: the model computes scores over possible next tokens, the scores are converted into probabilities, and a decoder may sample tokens using a pseudo-random number generator. A small sampled token difference can then propagate upward into a different tool call, code path, search query, or agent state. Other sources of variability are extrinsic to token sampling, including changing environments, live data, serving infrastructure, batch effects, and numerical details. By separating these layers, the manuscript clarifies what it means to call agentic AI systems stochastic, when such variability can be reproduced under matched conditions, and why deterministic execution need not imply identical behavior in deployed settings.

2606.08994 2026-06-09 cs.CL 新提交

Language-Aware Token Boosting: LLM Language Confusion Reduction Without Tuning

语言感知令牌增强:无需微调的大语言模型语言混淆减少

Trapoom Ukarapol, Pakhapoom Sarapat, Nut Chukamphaeng

发表机构 * SCB DataX Tsinghua University(清华大学) SCBX

AI总结 提出无需微调的语言混淆减少方法,通过语言感知令牌增强(LATB)和自适应版本(Adaptive-LATB)对目标语言令牌施加扰动,有效提升多语言对齐并保持摘要质量。

详情
Comments
ACL2026 Main Conference
AI中文摘要

大型语言模型(LLMs)在生成非英语文本时有时会出现语言混淆。现有方法通常依赖微调来缓解此问题。相比之下,我们提出了一种无需微调的语言混淆减少范式。在该范式中,我们引入了两种方法:语言感知令牌增强(LATB),它对与目标语言相关的令牌施加有针对性的扰动;以及自适应语言感知令牌增强(Adaptive-LATB),它根据模型对目标语言的置信度动态调整这些扰动。实验表明,我们的方法通过减少语言混淆有效提升了多语言对齐,同时在不需额外微调的情况下保持了摘要质量。我们的代码已公开。https://github.com/scbdatax/genai-datax-language-aware-token-boosting

英文摘要

Large language models (LLMs) sometimes exhibit language confusion when generating non-English text. Existing approaches typically rely on fine-tuning to mitigate this issue. In contrast, we propose a tuning-free paradigm for reducing language confusion. Within this paradigm, we introduce two methods: Language-Aware Token Boosting (LATB), which applies targeted perturbations to tokens associated with the desired language, and Adaptive Language-Aware Token Boosting (Adaptive-LATB), which dynamically adjusts these perturbations based on the model's confidence in the intended language. Experiments demonstrate that our methods effectively improve multilingual alignment by reducing language confusion, while maintain the summarization quality without requiring any additional fine-tuning. Our code is publicly available. https://github.com/scbdatax/genai-datax-language-aware-token-boosting.

2606.08993 2026-06-09 cs.LG cs.SY eess.SY math.OC 新提交

LEAF: A Learning-Enabled ADMM Framework for Accelerated Convex Optimization

LEAF: 一种用于加速凸优化的学习增强ADMM框架

Binh Nguyen, Trinh Tran, Truong X. Nghiem

发表机构 * University of Central Florida(中佛罗里达大学)

AI总结 提出LEAF框架,通过输入凸神经网络学习Moreau包络来加速凸优化,降低模型复杂度并保持收敛性,实验显示比最先进求解器快一个数量级。

详情
AI中文摘要

我们提出LEAF,一种用于加速凸优化的学习增强ADMM框架。关键思想是使用输入凸神经网络(ICNN)逼近目标函数的Moreau包络,从而得到一个保持凸性和光滑性的学习模型。这导致了所提出的Moreau包络学习ADMM(MEL-ADMM)及其分裂变体sMEL-ADMM。与直接学习高维算子的现有方法不同,LEAF学习标量值的Moreau包络,显著降低了模型复杂度并提高了数据效率。该框架适用于包括光滑和非光滑目标在内的广泛凸问题。通过ICNN架构显式嵌入凸性,所提出的方法在保持优化问题关键结构性质的同时保持了高逼近精度。MEL-ADMM和sMEL-ADMM都在学习模型下具有收敛性和可行性的理论保证。严格分析表明,所提出的方法实现了与经典ADMM相当的收敛速度,同时降低了每次迭代的计算成本。数值实验表明,与最先进的求解器相比,速度提升可达一个数量级,同时保持较低的最优性差距。

英文摘要

We propose LEAF, a learning-enabled ADMM framework for accelerated convex optimization. The key idea is to approximate the Moreau envelope of the objective function using an Input Convex Neural Network (ICNN), resulting in a learned model that preserves convexity and smoothness. This leads to the proposed Moreau Envelope Learning ADMM (MEL-ADMM) and its splitting variant sMEL-ADMM. Unlike existing approaches that learn high-dimensional operators directly, LEAF learns a scalar-valued Moreau envelope, significantly reducing model complexity and improving data efficiency. The framework accommodates a broad class of convex problems with smooth and non-smooth objectives. By embedding convexity explicitly through the ICNN architecture, the proposed approach maintains high approximation accuracy while preserving key structural properties of the optimization problem. Both MEL-ADMM and sMEL-ADMM are developed with theoretical guarantees of convergence and feasibility under the learned model. Rigorous analysis shows that the proposed methods achieve convergence rates comparable to classical ADMM while reducing per-iteration computational cost. Numerical experiments demonstrate up to an order-of-magnitude speedup over state-of-the-art solvers while maintaining low optimality gaps

2606.08992 2026-06-09 cs.RO cs.AI cs.CV 新提交

SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning

SpaceVLN:具有在线空间认知记忆与推理的零样本视觉与语言导航智能体

Yucheng Deng, Pingrui Lai, Xinhai Li, Chenjia Bai, Xiaoheng Deng, Chengnuo Sun, Xuelong Li, Hua Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学) China Telecom(中国电信) Central South University(中南大学) Jiangsu University(江苏大学)

AI总结 提出SpaceVLN,通过空间认知记忆和任务引导的空间推理,在零样本设置下实现连续环境中的视觉与语言导航,在多个基准上达到最优性能。

详情
Comments
23 pages, 9 figures, 7 tables
AI中文摘要

连续环境中的视觉与语言导航要求智能体理解未见环境的空间结构以遵循语言指令。尽管基础模型为无需任务特定策略训练的零样本导航开辟了有希望的路径,但许多导航器仍依赖局部视觉线索和基于线性历史的推理,忽视了探索区域、穿越路径、地标及其空间关系的空间本质。本文提出SpaceVLN,一种围绕空间认知记忆和任务引导的空间推理构建的导航智能体。具体而言,SpaceVLN引入了一个高效的分阶段闭环框架,其中规划和执行围绕可验证的空间-地标阶段组织。导航过程中,智能体逐步将探索区域抽象为空间航点,并动态维护子任务基础的地标证据,形成层次化的空间认知记忆以进行进度定位和空间关系理解。基于此记忆,Spatial-CoT将任务进度推理与空间感知、分析和预测相结合,实现任务引导的空间推理以用于具身导航。统一阶段接口使SpaceVLN能够在统一的零样本设置下处理视觉与语言导航和目标导向导航,无需任务特定策略训练。在R2R-CE、RxR-CE、GN-Bench和HM3D-OVON上,SpaceVLN实现了最先进的零样本性能,真实机器人部署进一步验证了其适用性。这些结果突显了空间认知记忆和任务引导的空间推理作为更强具身导航智能体的实用基础。

英文摘要

Vision-and-Language Navigation in continuous environments requires agents to understand the spatial structure of previously unseen environments in order to follow language instructions. Although foundation models have opened a promising path toward zero-shot navigation without task-specific policy training, many navigators still rely on local visual cues and linear history-based reasoning, overlooking the spatial nature of navigation across explored regions, traversed paths, landmarks, and their spatial relations. In this paper, we propose SpaceVLN, a navigation agent built around Spatial Cognitive Memory and Task-Guided Spatial Reasoning. Specifically, SpaceVLN introduces an efficient stagewise closed-loop framework where planning and execution are organized around verifiable space--landmark stages. During navigation, the agent progressively abstracts explored regions into Spatial Waypoints and dynamically maintains subtask-grounded landmark evidence, forming a hierarchical Spatial Cognitive Memory for progress localization and spatial-relation understanding. Built on this memory, Spatial-CoT integrates task-progress reasoning with spatial perception, analysis, and prediction, enabling Task-Guided Spatial Reasoning for embodied navigation. The unified stage interface enables SpaceVLN to address both Vision-and-Language Navigation and Object-Goal Navigation under a unified zero-shot setting, without task-specific policy training. Across R2R-CE, RxR-CE, GN-Bench, and HM3D-OVON, SpaceVLN achieves state-of-the-art zero-shot performance, and real-robot deployment further validates its applicability. These results highlight Spatial Cognitive Memory and Task-Guided Spatial Reasoning as a practical foundation for stronger embodied navigation agents.

2606.08988 2026-06-09 cs.CL cs.LG 新提交

Structure-Aware Modeling of Multiple-Choice Questions Improves Automatic Difficulty Estimation

选择题的结构感知建模改进自动难度估计

Gabriel Ortega, Abelino Jiménez, Séverin Lions, Pablo Dartnell

发表机构 * Centro de Investigación Avanzada en Educación (CIAE), Instituto de Estudios Avanzados en Educación (IE), Universidad de Chile(智利大学高级教育研究中心(CIAE),高级教育研究所(IE)) Departamento de Evaluación, Medición y Registro Educacional (DEMRE), Universidad de Chile(智利大学评估、测量与教育注册系(DEMRE)) Centro de Modelamiento Matemático (CMM), Universidad de Chile(智利大学数学建模中心(CMM)) Departamento de Ingeniería Matemática (DIM), Universidad de Chile(智利大学数学工程系(DIM))

AI总结 提出结构感知模型,将选择题的干扰项作为独立输入编码,通过顺序感知或顺序不变聚合提升难度预测,在自然科学和社科数据集上达到R²=0.83和0.71。

详情
Comments
30 pages, 1 table, 2 figures
AI中文摘要

自动题目难度估计(AQDE)在教育评估中日益重要,因为它有潜力产生与专家判断相竞争的难度估计,同时有助于减少与试点管理相关的时间和财务负担,并扩展到数字测试环境。先前的AQDE研究报告了关于将干扰项作为附加文本添加到题干和正确答案中是否能一致改进难度预测的混合证据。我们假设干扰项信息的有效性取决于其结构表示,并且明确将干扰项建模为独立组件可以改进忽略此信息的基线的难度估计。为此,我们设计了受控架构,将选择题组件建模为不同输入,以隔离干扰项内容和顺序的贡献。具体来说,我们通过将每个干扰项编码为独立的文本输入,并通过顺序感知的拼接(带位置标签)或顺序不变的求和来聚合其表示,从而表示干扰项。我们使用两个智利数据集(自然科学和社会科学,2016-2020年;4114道选择题)评估了这些架构。与仅使用题干和正确答案的简单模型相比,我们最佳的结构感知架构实现了更高的预测性能,自然科学题目的R²=0.83,社会科学题目的R²=0.71。一个顺序不变的变体以大约一半的参数达到了几乎相同的准确率,提供了有利的准确率-效率权衡。这些结果表明,结构信息(尤其是干扰项内容)驱动了预测准确性的提升,支持开发计算上可行的大规模教育应用的高效结构感知模型。

英文摘要

Automatic Question Difficulty Estimation (AQDE) holds growing promise for educational assessment because it has the potential to yield difficulty estimates that are competitive with expert judgment, while helping reduce the time and financial burden associated with pilot administrations and scaling to digital testing contexts. Prior AQDE studies report mixed evidence on whether adding distractors as additional text to the question stem and the correct key consistently improves difficulty prediction. We hypothesize that the effectiveness of distractor information depends on its structural representation, and that explicitly modeling distractors as separate components improves difficulty estimation over baselines that omit this information. To address this, we designed controlled architectures that model MCQ components as distinct inputs to isolate the contribution of distractor content and order. Specifically, we represented distractors by encoding each distractor as its own text input and aggregating their representations either with order-aware concatenation (with positional tags) or with an order-invariant summation. We evaluated these architectures using two Chilean datasets (Natural and Social Sciences, 2016-2020; 4,114 multiple-choice questions). Compared to a simpler model that only used the question stem and the key, our best distractor-aware architecture achieved higher predictive performance, reaching R^2 = 0.83 for Natural Sciences and R^2 = 0.71 for Social Sciences items. An order-invariant variant achieved nearly the same accuracy with approximately half as many parameters, offering a favorable accuracy-efficiency trade-off. These results show that structural information (especially distractor content) drives gains in predictive accuracy, supporting the development of efficient, structure-aware models that are computationally viable for large-scale educational applications.

2606.08985 2026-06-09 cs.LG 新提交

Beyond Neural Collapse: Task-Intrinsic Geometry Governs Neural Representations in Modular Arithmetic

超越神经坍缩:任务内在几何决定模算术中的神经表示

Hu Tan, Kuo Gai, Shihua Zhang

发表机构 * Academy of Mathematics and Systems Science, Chinese Academy of Sciences(中国科学院数学与系统科学研究院) School of Mathematical Sciences, University of Chinese Academy of Sciences(中国科学院大学数学科学学院) Shanghai Institute for Mathematics and Interdisciplinary Sciences (SIMIS)(上海数学与交叉学科研究院) Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(浙江省系统健康科学重点实验室,中国科学院大学杭州高等研究院生命科学学院)

AI总结 本文发现模加法任务中网络表示呈现二维循环几何而非神经坍缩的单纯形等角紧框架,通过层间非均匀训练、子空间锁定后的相位对齐动力学和复杂度优势分析解释了这一现象。

详情
AI中文摘要

虽然神经坍缩(NC)预测一个$K$类平衡分类器应将终端表示组织为$(K-1)$维单纯形等角紧框架(ETF),但模加法始终进入不同的状态:网络压缩为二维循环几何,其中分类器权重和词元嵌入都位于圆上。我们从三个方向精炼对这一现象的解释。首先,我们形式化了一个逐层非均匀训练机制:下游分类器权重被密集交叉熵梯度驱动到秩2等角配置,而上游嵌入尚未完全重组;一旦这个分类器平面形成,反向传播的特征梯度将嵌入运动约束在同一平面内,同时权重衰减抑制正交分量。其次,在此子空间锁定之后,诱导的平面内动力学允许在$S^1$上的一种熵正则化输运解释;结合模加法标签,这使嵌入形成简化为相位对齐,其最小化器是$\mathbb{Z}/P\mathbb{Z}$的单频特征,因此是圆上的等角点。第三,我们量化了为什么这一解优于NC:单纯形ETF在交叉熵上仅获得$O(1)$的优势,而循环秩2解在Schatten或权重衰减代理下享有$\Theta(K)$的优势,产生临界阈值$\lambda_{\mathrm{crit}} = \Theta(1/K)$。我们的结果解释了为什么分类器权重首先移动以及为什么嵌入随后与之对齐,表明模算术上的grokking不是由最大分离单独支配,而是由分离、对称性和复杂性之间的任务结构化权衡所支配。

英文摘要

While neural collapse (NC) predicts that a $K$-class-balanced classifier should organize terminal representations as a $(K-1)$-dimensional simplex equiangular tight frame (ETF), modular addition consistently enters a different regime: networks compress to a two-dimensional cyclic geometry in which both classifier weights and token embeddings lie on circles. We refine the explanation of this phenomenon in three directions. First, we formalize a layerwise non-uniform training mechanism: downstream classifier weights are driven by dense cross-entropy gradients into a rank-2 equiangular configuration before upstream embeddings fully reorganize, and once this classifier plane forms, backpropagated feature gradients constrain embedding motion to the same plane while weight decay suppresses orthogonal components. Second, after this subspace locking, the induced in-plane dynamics admit an entropy-regularized transport interpretation on $S^1$; combined with modular-addition labels, this reduces embedding formation to phase alignment, whose minimizers are single-frequency characters of $\mathbb{Z}/P\mathbb{Z}$ and hence equal-angle points on a circle. Third, we quantify why this solution prevails over NC: a simplex ETF gains only an $O(1)$ advantage in cross-entropy, whereas the cyclic rank-2 solution enjoys a $Θ(K)$ advantage under Schatten or weight-decay surrogates, yielding a critical threshold $λ_{\mathrm{crit}} = Θ(1/K)$. Our results explain both why classifier weights move first and why embeddings subsequently align with them, showing that grokking on modular arithmetic is governed not by maximal separation alone but by a task-structured trade-off between separation, symmetry, and complexity.

2606.08978 2026-06-09 cs.LG 新提交

Heterophily-Aware Adaptive Knowledge Distillation for Hypergraph Neural Networks

异质性感知的自适应知识蒸馏用于超图神经网络

Joohee Cho, David Yoon Suk Kang, Yunyong Ko

发表机构 * Chung-Ang University(中央大学) Chungbuk National University(忠北国立大学)

AI总结 针对超图神经网络在异质性节点上性能下降的问题,提出异质性感知的自适应蒸馏方法HADES,通过量化节点异质性调节教师知识迁移,使学生模型性能超越教师并实现最高12.3倍加速。

详情
Comments
5 pages, 2 figures, 4 tables
AI中文摘要

超图知识蒸馏旨在通过轻量级学生模型保留超图神经网络(HNN)教师的预测性能,同时降低推理成本。在这项工作中,我们观察到HNN在通过语义多样的超边连接的异质性节点上的预测性能显著较低,表明教师知识的可靠性在不同节点间存在差异。受此观察启发,我们提出了HADES,一种用于超图神经网络的异质性感知自适应蒸馏方法。HADES量化节点异质性,并将其作为教师可靠性的估计,以在蒸馏过程中调节教师知识的迁移。在真实世界超图上的实验结果表明,HADES在不同HNN教师和蒸馏目标下持续提升学生性能。在许多情况下,所得学生模型的预测性能超越其教师,同时实现高达12.3倍的推理加速。

英文摘要

Hypergraph knowledge distillation aims to retain the predictive performance of a hypergraph neural network (HNN) teacher while reducing inference costs through a lightweight student model. In this work, we observe that HNNs exhibit substantially lower prediction performance on heterophilic nodes connected through semantically diverse hyperedges, indicating that the reliability of teacher knowledge varies across nodes. Motivated by this observation, we propose HADES, a heterophily-aware adaptive distillation method for hypergraph neural networks. HADES quantifies node heterophily and leverages it as an estimate of teacher reliability to modulate the transfer of teacher knowledge during distillation. Experimental results on real-world hypergraphs demonstrate that HADES consistently improves student performance across different HNN teachers and distillation objectives. In many cases, the resulting student models surpass the predictive performance of their teachers while achieving up to 12.3 times faster inference.

2606.08977 2026-06-09 cs.LG cs.DS 新提交

Online Learning with Recency: Algorithms for Sliding-window Streaming Multi-armed Bandits

在线学习中的近因效应:滑动窗口流式多臂老虎机算法

Vladimir Braverman, Chen Wang, Liudeng Wang, Samson Zhou

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Rensselaer Polytechnic Institute(伦斯勒理工学院) Texas A&M University(德克萨斯农工大学)

AI总结 针对在线学习中的近因效应,研究单遍滑动窗口流式多臂老虎机问题,提出纯探索和遗憾最小化算法,并给出记忆-遗憾权衡。

详情
Comments
ICML 2026
AI中文摘要

受在线学习中近因效应的启发,本文研究了单遍*滑动窗口流式多臂老虎机(MABs)*的算法。在该设置中,我们有$n$个臂,其奖励分布为未知的次高斯分布,并给定参数$W$。臂以单遍流的形式到达,只有最近的$W$个臂被视为有效。算法需要在有限内存(定义为存储的臂数)下进行纯探索和遗憾最小化。该模型是近年来广泛研究的流式多臂老虎机模型(无滑动窗口)的自然扩展。我们对该模型下的纯探索和遗憾最小化问题进行了全面分析。对于纯探索,我们证明在次线性内存下找到最佳臂是困难的,而找到近似最佳臂则存在高效算法。对于遗憾最小化,我们探索了一种新的遗憾概念,并给出了任何单遍算法的尖锐内存-遗憾权衡。我们通过实验补充了理论结果,展示了样本、遗憾和内存之间的权衡。

英文摘要

Motivated by the recency effect in online learning, we study algorithms for single-pass *sliding-window streaming multi-armed bandits (MABs)* in this paper. In this setting, we are given $n$ arms with unknown sub-Gaussian reward distributions and a parameter $W$. The arms arrive in a single-pass stream, and only the most recent $W$ arms are considered valid. The algorithm is required to perform pure exploration and regret minimization with limited memory, defined as the number of stored arms. The model is a natural extension of the streaming multi-armed bandits model (without the sliding window) that has been extensively studied in recent years. We provide a comprehensive analysis of both the pure exploration and regret minimization problems with the model. For pure exploration, we prove that finding the best arm is hard with sublinear memory while finding an approximate best arm admits an efficient algorithm. For regret minimization, we explore a new notion of regret and give sharp memory-regret trade-offs for any single-pass algorithm. We complement our theoretical results with experiments, demonstrating the trade-offs between sample, regret, and memory.