arXivDaily arXiv每日学术速递 周一至周五更新
重置
2504.18424 2026-06-10 cs.CV 版本更新

LaRI: Layered Ray Intersections for Single-view 3D Geometric Reasoning

LaRI: 用于单视图3D几何推理的分层射线交点

Rui Li, Biao Zhang, Zhenyu Li, Federico Tombari, Peter Wonka

发表机构 * ETH Zurich(苏黎世联邦理工学院) Adobe Research(Adobe研究)

AI总结 提出LaRI方法,通过分层点图预测射线与多个表面的交点,实现单次前馈的完整场景重建,支持物体级和场景级任务。

详情
Journal ref
ICML 2026
Comments
Project page: https://ruili3.github.io/lari
AI中文摘要

我们提出了分层射线交点(LaRI),一种用于从单张图像进行遮挡几何推理的全监督方法。与仅限于可见表面的传统深度估计不同,LaRI使用分层点图预测相机射线相交的多个表面。与现有利用神经隐式表示或迭代优化的方法相比,LaRI在一次前馈传递中完成完整的场景重建,实现了高效且视图对齐的几何推理,以支持物体级和场景级任务。我们进一步提出预测射线停止索引,该索引从LaRI的输出中识别有效的相交像素和层。为了更好地支持和评估这一任务,我们使用渲染引擎构建了一个注释流水线,为五个公共数据集(包括覆盖3D物体和场景的合成数据和真实世界数据)构建了注释。作为一种通用方法,LaRI的性能在物体级和场景级重建任务中得到了验证。

英文摘要

We present Layered Ray Intersections (LaRI), a fully supervised method for occluded geometry reasoning from a single image. Unlike conventional depth estimation, which is limited to visible surfaces, LaRI predicts multiple surfaces intersected by the camera rays using layered point maps. Compared to the existing approaches that leverage neural implicit representations or iterative refinement, LaRI achieves complete scene reconstruction in one feed-forward pass, enabling efficient and view-aligned geometric reasoning to underpin both object-level and scene-level tasks. We further propose to predict the ray stopping index, which identifies valid intersecting pixels and layers from LaRI's output. To better underpin and evaluate this task, we build an annotation pipeline using rendering engines, construct annotations for five public datasets, including synthetic and real-world data covering 3D objects and scenes. As a generic method, LaRI's performance is validated in object-level and scene-level reconstruction tasks.

2504.03118 2026-06-10 cs.CV cs.AI 版本更新

NuWa: Deriving Lightweight Class-Specific Vision Transformers for Edge Devices

NuWa: 为边缘设备导出轻量级类别特定视觉Transformer

Ziteng Wei, Qiang He, Bing Li, Feifei Chen, Hai Jin, Yun Yang

发表机构 * National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Cluster and Grid Computing Lab(大数据技术与系统国家工程研究中心、服务计算技术与系统实验室、集群与网格计算实验室) Swinburne University of Technology(斯威本科技大学) Deakin University(迪金大学)

AI总结 针对边缘设备只需识别特定类别的问题,提出NuWa方法,通过自知识净化去除有害权重,并利用闭式优化高效导出紧凑ViT,无需重训练即可提升类别精度并加速推理。

详情
Comments
Accepted at CVPR 2026
AI中文摘要

视觉Transformer(ViT)通常需要压缩以部署在资源受限的边缘设备(如无人机和智能车辆)上。然而,现有的模型压缩方法忽略了许多边缘设备仅需特定类别的知识用于其应用。因此,导出的全类别ViT保留了冗余知识,在这些类别上表现次优。我们发现,简单地将校准数据集替换为类别特定数据不足以解决此问题,因为这些方法面临两个根本限制。首先,它们忽略了存在对类别有害的权重,这些权重干扰特化,而移除它们可以提升类别特定性能。其次,目标类别的多样性和边缘设备的资源约束需要大量定制模型。现有方法耗时且计算成本高,因此不可扩展。在这项工作中,我们提出NuWa,一种成本高效的方法,通过从基础ViT导出小型ViT来应对这些挑战,适用于具有特定类别需求的边缘设备。NuWa执行自知识净化以剪除对类别有害的权重,并通过闭式优化高效导出紧凑ViT。无需剪枝后重训练,导出的边缘ViT在类别特定精度上超越基础ViT,并加速推理。综合实验表明,NuWa在类别特定任务上比最先进的无训练剪枝方法精度高出高达29.00%。与性能最佳的依赖训练剪枝方法相比,NuWa实现了33.69倍的剪枝加速,并将剪枝成本降低高达99.83%,平均精度损失仅为0.61%。项目页面:this https URL。

英文摘要

Vision Transformers (ViTs) often need to be compressed for deployment on resource-constrained edge devices like drones and smart vehicles. However, existing model compression methods ignore that many edge devices only require the knowledge of specific classes for their applications. As a result, the derived all-class ViTs retain redundant knowledge and perform suboptimally on these classes. We discovered that simply replacing the calibration dataset with class-specific data does not suffice to address this issue, as these methods face two fundamental limitations. First, they overlook the existence of class-detrimental weights, which interfere with specialization, while removing them can improve class-specific performance. Second, the diversity of target classes and resource constraints on edge devices demand numerous customized models. Existing methods are time-consuming and computationally expensive, thus unscalable. In this work, we present NuWa, a cost-efficient method that addresses these challenges by deriving small ViTs from base ViTs for edge devices with specific class requirements. NuWa performs self-knowledge purification to prune class-detrimental weights and efficiently derives compact ViTs through closed-form optimization. Without post-pruning retraining, the derived edge ViTs surpass the base ViT in class-specific accuracy and accelerate inference. Comprehensive experiments demonstrate that NuWa outperforms state-of-the-art training-free pruning methods on class-specific tasks by up to 29.00\% in accuracy. Compared with the best-performing training-dependent pruning method, NuWa achieves a 33.69x pruning speedup and reduces pruning cost by up to 99.83\%, with only a 0.61\% average accuracy loss. Project Page: https://github.com/CGCL-codes/NuWa.

2503.20272 2026-06-10 stat.ML cs.LG 版本更新

An $(ε,δ)$-accurate level set estimation with a stopping criterion

一个具有停止准则的 $(\epsilon,\delta)$-精确水平集估计

Hideaki Ishibashi, Kota Matsui, Kentaro Kutsukake, Hideitsu Hino

发表机构 * Kyushu Institute of Technology(九州工业技术大学) Nagoya University / RIKEN AIP(名古屋大学 / RIKEN AIP) The Institute of Statistical Mathematics/ RIKEN AIP(统计数学研究所 / RIKEN AIP)

AI总结 提出一种带停止准则的水平集估计获取策略,理论上证明满足 $\epsilon$-精确度和 $1-\delta$ 置信水平,减少不必要的函数评估,实验验证了其有效性。

详情
AI中文摘要

水平集估计问题旨在识别候选点集内未知且评估代价高昂的函数值超过指定阈值的区域,为全面评估函数值提供了一种高效替代方案。传统方法通常采用序列优化策略来寻找 $\epsilon$-精确解,该解允许在阈值轮廓周围留有余量,但往往缺乏有效的停止准则,导致过度探索和效率低下。本文引入了一种带有停止准则的水平集估计获取策略,确保算法在进一步探索不太可能带来改进时停止,从而减少不必要的函数评估。我们从理论上证明,该方法在 $1-\delta$ 的置信水平下满足 $\epsilon$-精确度,弥补了现有方法的一个关键空白。此外,我们表明这还带来了对 F-score 等性能指标下限的保证。数值实验表明,所提出的获取函数在达到与现有方法相当的精确度的同时,确认了停止准则在充分探索后有效终止算法。

英文摘要

The level set estimation problem seeks to identify regions within a set of candidate points where an unknown and costly to evaluate function's value exceeds a specified threshold, providing an efficient alternative to exhaustive evaluations of function values. Traditional methods often use sequential optimization strategies to find $ε$-accurate solutions, which permit a margin around the threshold contour but frequently lack effective stopping criteria, leading to excessive exploration and inefficiencies. This paper introduces an acquisition strategy for level set estimation that incorporates a stopping criterion, ensuring the algorithm halts when further exploration is unlikely to yield improvements, thereby reducing unnecessary function evaluations. We theoretically prove that our method satisfies $ε$-accuracy with a confidence level of $1 - δ$, addressing a key gap in existing approaches. Furthermore, we show that this also leads to guarantees on the lower bounds of performance metrics such as F-score. Numerical experiments demonstrate that the proposed acquisition function achieves comparable precision to existing methods while confirming that the stopping criterion effectively terminates the algorithm once adequate exploration is completed.

2502.09928 2026-06-10 cs.CV cs.AI 版本更新

Deep Tree Tensor Networks

深度树张量网络

Chang Nie

发表机构 * Nanjing University of Science and Technology(南京理工大学)

AI总结 提出深度树张量网络(DTTN),通过多线性运算捕获指数阶特征交互,在多个基准上超越现有方法。

详情
AI中文摘要

源自量子物理的张量网络(TNs)已被广泛用作指数机器和参数分解器用于识别任务。典型的TN模型,如矩阵乘积态(MPS),在自然图像识别中尚未取得成功应用。当它们被使用时,主要是在现有网络中压缩参数,从而失去了捕获指数阶特征交互的独特能力。本文提出了一种名为\textit{\textbf{深度树张量网络}}(DTTN)的新架构,它通过多线性运算捕获跨特征的$2^L$阶乘法交互,同时本质上展开为具有参数共享属性的\textit{树}状TN拓扑。DTTN由多个反对称交互模块(AIMs)堆叠而成,这种设计便于高效实现。此外,我们的理论分析证明了量子启发的TN模型与多项式/多线性网络在特定条件下的等价性。我们认为DTTN可以促进该领域内更具可解释性的研究。所提出的模型在多个基准和领域上进行了评估,显示出优于同行方法和最先进架构的性能。我们的代码在此https URL公开提供。

英文摘要

Originating in quantum physics, tensor networks (TNs) have been widely adopted as exponential machines and parametric decomposers for recognition tasks. Typical TN models, such as Matrix Product States (MPS), have not yet achieved successful application in natural image recognition. When employed, they primarily serve to compress parameters within pre-existing networks, thereby losing their distinctive capability to capture exponential-order feature interactions. This paper introduces a novel architecture named \textit{\textbf{D}eep \textbf{T}ree \textbf{T}ensor \textbf{N}etwork} (DTTN), which captures $2^L$-order multiplicative interactions across features through multilinear operations, while essentially unfolding into a \emph{tree}-like TN topology with the parameter-sharing property. DTTN is stacked with multiple antisymmetric interaction modules (AIMs), and this design facilitates efficient implementation. Furthermore, our theoretical analysis demonstrates the equivalence between quantum-inspired TN models and polynomial/multilinear networks under specific conditions. We posit that the DTTN could catalyze more interpretable research within this field. The proposed model is evaluated across multiple benchmarks and domains, demonstrating superior performance compared to both peer methods and state-of-the-art architectures. Our code is publicly available at https://github.com/NieCha/deep_tree_tensor_network.

2407.20242 2026-06-10 cs.CY cs.AI cs.RO 版本更新

BadRobot: Jailbreaking Embodied LLM Agents in the Physical World

BadRobot: 在物理世界中越狱具身LLM智能体

Hangtao Zhang, Chenyu Zhu, Xianlong Wang, Ziqi Zhou, Changgan Yin, Minghui Li, Lulu Xue, Yichen Wang, Shengshan Hu, Aishan Liu, Peijin Guo, Leo Yu Zhang

发表机构 * Huazhong University of Science and Technology(华中科技大学) Beihang University(北航) Griffith University(格里菲斯大学)

AI总结 提出BadRobot攻击范式,利用LLM在机器人系统中的操纵、语言输出与物理动作的错位以及世界知识缺陷三个漏洞,通过语音交互使具身LLM执行有害行为,并在基准测试中验证了有效性。

详情
Journal ref
International Conference on Learning Representations (ICLR) 2025
Comments
Accepted to ICLR 2025. Please cite the conference version. Project page: https://Embodied-LLMs-Safety.github.io
AI中文摘要

具身AI代表将AI集成到物理实体中的系统。大型语言模型(LLM)展现出强大的语言理解能力,通过促进复杂的任务规划,已被广泛用于具身AI。然而,一个关键的安全问题仍被忽视:这些具身LLM是否会实施有害行为?为此,我们引入了BadRobot,一种新颖的攻击范式,旨在通过典型的基于语音的用户-系统交互,使具身LLM违反安全和伦理约束。具体来说,我们利用了三个漏洞来实现这种攻击:(i) 机器人系统中LLM的操纵,(ii) 语言输出与物理动作之间的错位,以及(iii) 世界知识缺陷导致的意外危险行为。此外,我们构建了一个包含各种恶意物理动作查询的基准,以评估BadRobot的攻击性能。基于该基准,针对现有突出的具身LLM框架(例如Voxposer、Code as Policies和ProgPrompt)的大量实验证明了我们BadRobot的有效性。我们的代码可在以下网址获取:this https URL。

英文摘要

Embodied AI represents systems where AI is integrated into physical entities. Large Language Model (LLM), which exhibits powerful language understanding abilities, has been extensively employed in embodied AI by facilitating sophisticated task planning. However, a critical safety issue remains overlooked: could these embodied LLMs perpetrate harmful behaviors? In response, we introduce BadRobot, a novel attack paradigm aiming to make embodied LLMs violate safety and ethical constraints through typical voice-based user-system interactions. Specifically, three vulnerabilities are exploited to achieve this type of attack: (i) manipulation of LLMs within robotic systems, (ii) misalignment between linguistic outputs and physical actions, and (iii) unintentional hazardous behaviors caused by world knowledge's flaws. Furthermore, we construct a benchmark of various malicious physical action queries to evaluate BadRobot's attack performance. Based on this benchmark, extensive experiments against existing prominent embodied LLM frameworks (e.g., Voxposer, Code as Policies, and ProgPrompt) demonstrate the effectiveness of our BadRobot. Our code is available at https://github.com/Rookie143/BadRobot.

2501.14717 2026-06-10 cs.CL 版本更新

What Really Matters for Table LLMs? A Meta-Evaluation of Model and Data Effects

表格LLM真正重要的是什么?模型与数据影响的元评估

Naihao Deng, Sheng Zhang, Henghui Zhu, Shuaichen Chang, Jiani Zhang, Alexander Hanbo Li, Chung-Wei Hang, Hideo Kobayashi, Yiqun Hu, Patrick Ng

发表机构 * University of Michigan(密歇根大学) AWS AI Labs(AWS人工智能实验室) Figma OKX Google(谷歌)

AI总结 通过指令微调12个模型并在16个基准上评估,发现基座模型选择比训练数据对性能影响更大,泛化与推理仍是挑战。

详情
Comments
EACL 2026 Findings
AI中文摘要

表格建模已经发展了数十年。在这项工作中,我们重新审视了这一轨迹,并强调了LLM时代出现的新挑战,特别是选择悖论:在表格指令微调的背景下,由于基础模型和训练集的多样性,难以将性能提升归因于特定因素。我们通过指令微调三个基础模型在四个现有数据集上,复制了四个表格LLM,共得到12个模型。然后我们在16个表格基准上评估这些模型。我们的研究首次定量分离了训练数据和基础模型选择的影响,揭示了基础模型选择比训练数据本身起更主导的作用。泛化和推理仍然具有挑战性,需要未来在表格建模上继续努力。基于我们的发现,我们分享了对表格建模未来方向的思考。

英文摘要

Table modeling has progressed for decades. In this work, we revisit this trajectory and highlight emerging challenges in the LLM era, particularly the paradox of choice: the difficulty of attributing performance gains amid diverse base models and training sets in the context of table instruction tuning. We replicate four table LLMs by instruction-tuning three foundation models on four existing datasets, yielding 12 models. We then evaluate these models across 16 table benchmarks. Our study is the first to quantitatively disentangle the effects of training data and base model selection, revealing that base model choice plays a more dominant role than the training data itself. Generalization and reasoning remain challenging, inviting future effort on table modeling. Based on our findings, we share our thoughts on the future directions for table modeling.

2501.01481 2026-06-10 eess.IV cs.CV 版本更新

Unleashing Correlation and Continuity for Hyperspectral Reconstruction from RGB Images

释放相关性与连续性:从RGB图像进行高光谱重建

Fuxiang Feng, Runmin Cong, Shoushui Wei, Yipeng Zhang, Jun Li, Sam Kwong, Wei Zhang

发表机构 * School of Control Science and Engineering, Shandong University(控制科学与工程学院,山东大学) Key Laboratory of Machine Intelligence and System Control, Ministry of Education(机器智能与系统控制重点实验室,教育部) University of California, Los Angeles(加州大学洛杉矶分校) Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences(智能地理信息处理重点实验室,中国地质大学) Lingnan University(岭大大学)

AI总结 提出相关性连续性网络(CCNet),通过局部光谱相关性建模(GrSCM)和全局光谱连续性建模(NeSCM)及自适应融合(PAF),实现RGB到高光谱图像的SOTA重建。

详情
AI中文摘要

从RGB图像重建高光谱图像(HSI)可以以较低成本获得高空间分辨率的HSI,显示出巨大的应用潜力。本文揭示了光谱特征的局部相关性和全局连续性对于HSI重建任务至关重要。因此,我们充分探索了这些光谱间关系,并提出了相关性连续性网络(CCNet)用于从RGB图像重建HSI。针对局部光谱的相关性,我们引入了分组光谱相关性建模(GrSCM)模块,该模块在局部范围内高效建立光谱波段相似性。针对全局光谱的连续性,我们设计了邻域光谱连续性建模(NeSCM)模块,该模块利用记忆单元递归地建模全局层面的渐进变化特征。为了探索这两个模块的内在互补性,我们设计了分块自适应融合(PAF)模块,以分块自适应方式将全局连续性特征高效集成到光谱特征中。这些创新提升了重建HSI的质量。我们在光谱重建任务的主流数据集NTIRE2022和NTIRE2020上进行了全面的比较和消融实验。与当前先进的光谱重建算法相比,我们设计的算法达到了最先进(SOTA)性能。

英文摘要

Reconstructing Hyperspectral Images (HSI) from RGB images can yield high spatial resolution HSI at a lower cost, demonstrating significant application potential. This paper reveals that local correlation and global continuity of the spectral characteristics are crucial for HSI reconstruction tasks. Therefore, we fully explore these inter-spectral relationships and propose a Correlation and Continuity Network (CCNet) for HSI reconstruction from RGB images. For the correlation of local spectrum, we introduce the Group-wise Spectral Correlation Modeling (GrSCM) module, which efficiently establishes spectral band similarity within a localized range. For the continuity of global spectrum, we design the Neighborhood-wise Spectral Continuity Modeling (NeSCM) module, which employs memory units to recursively model the progressive variation characteristics at the global level. In order to explore the inherent complementarity of these two modules, we design the Patch-wise Adaptive Fusion (PAF) module to efficiently integrate global continuity features into the spectral features in a patch-wise adaptive manner. These innovations enhance the quality of reconstructed HSI. We perform comprehensive comparison and ablation experiments on the mainstream datasets NTIRE2022 and NTIRE2020 for the spectral reconstruction task. Compared to the current advanced spectral reconstruction algorithms, our designed algorithm achieves State-Of-The-Art (SOTA) performance.

2412.11449 2026-06-10 cs.SD cs.AI cs.CL cs.LG eess.AS 版本更新

Whisper-GPT -- Continuous Discrete Hybrid Representation Language Models For Speech And Music

Whisper-GPT -- 语音和音乐的连续离散混合表示语言模型

Prateek Verma

发表机构 * arXiv.org

AI总结 提出Whisper-GPT,一种结合连续音频表示(如频谱图)和离散音频令牌的生成式大语言模型,解决了离散令牌方法上下文长度过长的问题,在语音和音乐的下一个令牌预测中降低了困惑度和负对数似然。

详情
Comments
6 pages, 3 figures. 50th International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India
AI中文摘要

我们提出了WHISPER-GPT:一种用于语音和音乐的生成式大语言模型(LLM),它允许我们在单个架构中同时处理连续音频表示和离散令牌。近年来,利用神经压缩算法(例如ENCODEC)导出的离散音频令牌的生成式音频、语音和音乐模型激增。然而,这种方法的主要缺点之一是处理上下文长度。如果必须考虑不同频率下的所有音频内容来进行下一个令牌预测,那么对于高保真生成架构来说,上下文长度会急剧增长。通过结合连续音频表示(如频谱图)和离散声学令牌,我们保留了两者的优点:在单个令牌中拥有来自音频特定时间实例的所有必要信息,同时允许LLM预测未来令牌,从而获得采样和离散空间提供的其他好处。我们展示了与基于令牌的语音和音乐LLM相比,我们的架构如何提高下一个令牌预测的困惑度和负对数似然分数。

英文摘要

We propose WHISPER-GPT: A generative large language model (LLM) for speech and music that allows us to work with continuous audio representations and discrete tokens simultaneously as part of a single architecture. There has been a huge surge in generative audio, speech, and music models that utilize discrete audio tokens derived from neural compression algorithms, e.g. ENCODEC. However, one of the major drawbacks of this approach is handling the context length. It blows up for high-fidelity generative architecture if one has to account for all the audio contents at various frequencies for the next token prediction. By combining continuous audio representation like the spectrogram and discrete acoustic tokens, we retain the best of both worlds: Have all the information needed from the audio at a specific time instance in a single token, yet allow LLM to predict the future token to allow for sampling and other benefits discrete space provides. We show how our architecture improves the perplexity and negative log-likelihood scores for the next token prediction compared to a token-based LLM for speech and music.

2411.02817 2026-06-10 cs.LG cs.AI cs.CV cs.IT math.IT 版本更新

Conditional Vendi Score: Prompt-Aware Diversity Evaluation for Generative AI Models and LLMs

条件 Vendi 分数:生成式 AI 模型和 LLM 的提示感知多样性评估

Mohammad Jalali, Azim Ospanov, Amin Gohari, Farzan Farnia

发表机构 * Department of Computer Science and Engineering, The Chinese University of Hong Kong(计算机科学与工程系,香港中文大学) Department of Information Engineering, The Chinese University of Hong Kong(信息工程系,香港中文大学)

AI总结 针对文本提示引导的生成模型,提出条件 Vendi 和条件 RKE 分数,通过条件熵分离模型自身多样性,并证明收敛性及在多个任务中恢复真实多样性排序。

详情
AI中文摘要

由文本提示引导的生成模型在保真度和提示对齐方面被广泛评估,但其产生输出的能力仍未被充分探索。现有的多样性指标(如基于核矩阵的 von Neumann 和 Rényi 熵的 Vendi 和 RKE)是为无条件模型开发的,无法区分提示引起的变异和模型引起的变异。我们通过引入 \textit{Conditional-Vendi} 和 \textit{Conditional-RKE} 来解决这一差距,这些多样性度量源自正半定矩阵的条件熵。这些分数在提示引导生成中分离出模型引起的多样性,其中 Conditional-RKE 具有 $O(1/\sqrt{n})$ 的收敛速度。对于 Conditional-Vendi,我们引入了一种截断谱近似,产生可扩展且一致的估计。在文本到图像、图像字幕和 LLM 任务上的实验表明,条件分数恢复了真实多样性排序,并且还可以引导扩散模型生成更多样化的样本。代码库可从此 https URL 获取。

英文摘要

Generative models guided by text prompts are widely evaluated for fidelity and prompt alignment, yet their ability to produce outputs remains underexplored. Existing diversity metrics such as Vendi and RKE, which are based on the von Neumann and Rényi entropies of kernel matrices, were developed for unconditional models and cannot distinguish prompt-induced from model-induced variability. We address this gap by introducing \textit{Conditional-Vendi} and \textit{Conditional-RKE}, diversity measures derived from the conditional entropy of positive semidefinite matrices. These scores isolate model-induced diversity in prompt-guided generation, with Conditional-RKE enjoying an $O(1/\sqrt{n})$ convergence rate. For Conditional-Vendi, we introduce a truncated-spectrum approximation that yields scalable and consistent estimates. Experiments on text-to-image, image-captioning, and LLM tasks show that the conditional scores recover ground-truth diversity orderings and can also guide diffusion models toward more diverse samples. The codebase is available at https://github.com/mjalali/conditional-vendi.

2409.04111 2026-06-10 cs.LG 版本更新

Active-Passive Federated Learning for Vertically Partitioned Multi-view Data

面向垂直分区多视角数据的主动-被动联邦学习

Jiyuan Liu, Siqi Wang, Xinhang Wan, Yi Zhang, Junsong Chen, Xin Lu, Xinwang Liu

发表机构 * National University of Defense Technology(国防科技大学)

AI总结 提出主动-被动联邦学习框架,主动客户端独立构建完整模型,被动客户端仅辅助训练,解决推理时客户端协作不可靠问题,通过重构损失和对比损失实例化两种分类方法并验证有效性。

详情
AI中文摘要

垂直联邦学习是一种自然且优雅的方法,用于集成跨设备(客户端)垂直分区的多视角数据,同时保护其隐私。除了模型训练,现有方法在模型推理时需要所有客户端的协作。然而,模型推理可能长期维持服务,而协作(尤其是当客户端属于不同组织时)在现实场景中不可预测,例如合同取消、网络不可用等,导致推理失败。为了解决这个问题,我们首次尝试提出了一种灵活的主动-被动联邦学习(APFed)框架。具体来说,主动客户端是学习任务的发起者,负责构建完整模型,而被动客户端仅作为辅助。一旦模型构建完成,主动客户端可以独立进行推理。此外,我们将APFed框架实例化为两种分类方法,分别在被动客户端上采用重构损失和对比损失。同时,这两种方法在一系列实验中进行了测试,并取得了理想的结果,验证了它们的有效性。

英文摘要

Vertical federated learning is a natural and elegant approach to integrate multi-view data vertically partitioned across devices (clients) while preserving their privacies. Apart from the model training, existing methods requires the collaboration of all clients in the model inference. However, the model inference is probably maintained for service in a long time, while the collaboration, especially when the clients belong to different organizations, is unpredictable in real-world scenarios, such as concellation of contract, network unavailablity, etc., resulting in the failure of them. To address this issue, we, at the first attempt, propose a flexible Active-Passive Federated learning (APFed) framework. Specifically, the active client is the initiator of a learning task and responsible to build the complete model, while the passive clients only serve as assistants. Once the model built, the active client can make inference independently. In addition, we instance the APFed framework into two classification methods with employing the reconstruction loss and the contrastive loss on passive clients, respectively. Meanwhile, the two methods are tested in a set of experiments and achieves desired results, validating their effectiveness.

2206.02178 2026-06-10 cs.AI cs.LG 版本更新

Belief Acquisition as Stochastic Filtering

信念获取作为随机滤波

Dawei Chen, John Lloyd, Samuel Yang-Zhao, Kee Siong Ng

发表机构 * School of Computing, Australian National University(计算机学院,澳大利亚国立大学)

AI总结 本文提出将信念获取视为随机滤波问题,通过分解条件滤波器在高维状态空间中同时跟踪状态和估计参数,并在流行病跟踪等实验中验证有效性。

详情
Comments
51 pages
AI中文摘要

本文研究如何利用随机滤波实现信念获取。首先,概述了经验信念的理论基础。然后,研究了该背景下的随机滤波。本文引入了因子化条件滤波器,这是一种新的滤波算法,用于在高维状态空间中同时跟踪状态和估计参数。算法的条件性质用于估计参数,因子化性质用于将状态空间分解为低维子空间,使得在这些子空间上的滤波得到的分布的乘积是对整个状态空间上分布的良好近似。算法成功应用的条件是:观测在子空间级别可用,且转移模式可以分解为近似局限于子空间的局部转移模式;这些条件在计算机科学、工程和地球物理滤波应用中广泛满足。在大型接触网络上跟踪流行病和估计参数的实验结果显示了该方法的有效性。

英文摘要

This paper studies how belief acquisition can be accomplished using stochastic filtering. First, a theoretical foundation for empirical beliefs is outlined. Then stochastic filtering in this context is studied. The paper introduces factored conditional filters, new filtering algorithms for simultaneously tracking states and estimating parameters in high-dimensional state spaces. The conditional nature of the algorithms is used to estimate parameters and the factored nature is used to decompose the state space into low-dimensional subspaces in such a way that filtering on these subspaces gives distributions whose product is a good approximation to the distribution on the entire state space. The conditions for successful application of the algorithms are that observations be available at the subspace level and that the transition schema can be factored into local transition schemas that are approximately confined to the subspaces; these conditions are widely satisfied in computer science, engineering, and geophysical filtering applications. Experimental results on tracking epidemics and estimating parameters in large contact networks show the effectiveness of the approach.

2310.05264 2026-06-10 cs.LG cs.CV 版本更新

The Emergence of Reproducibility and Generalizability in Diffusion Models

扩散模型中可重复性与泛化性的出现

Huijie Zhang, Jinfan Zhou, Yifu Lu, Minzhe Guo, Peng Wang, Liyue Shen, Qing Qu

发表机构 * CIFAR-10 dataset(CIFAR-10数据集)

AI总结 研究发现扩散模型在相同初始噪声和确定性采样器下,不同模型输出高度相似,且这种可重复性在记忆和泛化两种训练模式下均存在,对训练效率、模型隐私等有重要启示。

详情
Comments
NeurIPS Diffusion Model Workshop 2023 (best paper award), the Forty-first International Conference on Machine Learning (ICML 2024)
AI中文摘要

在这项工作中,我们研究了扩散模型的一个有趣且普遍的现象,我们称之为“一致模型可重复性”:给定相同的起始噪声输入和确定性采样器,不同的扩散模型通常会产生非常相似的输出。我们通过全面的实验证实了这一现象,这意味着不同的扩散模型一致地达到相同的数据分布和评分函数,无论扩散模型框架、模型架构或训练过程如何。更引人注目的是,我们的进一步研究表明,扩散模型学习到的不同分布受到训练数据大小的影响。这一点得到了以下事实的支持:模型可重复性表现在两种不同的训练机制中:(i)“记忆机制”,其中扩散模型过拟合到训练数据分布,以及(ii)“泛化机制”,其中模型学习底层数据分布。我们的研究还发现,这一有价值的特性推广到许多扩散模型的变体,包括用于条件使用、解决逆问题和模型微调的变体。最后,我们的工作提出了许多有趣的理论问题供未来研究,并强调了关于训练效率、模型隐私和扩散模型受控生成的实际意义。

英文摘要

In this work, we investigate an intriguing and prevalent phenomenon of diffusion models which we term as "consistent model reproducibility": given the same starting noise input and a deterministic sampler, different diffusion models often yield remarkably similar outputs. We confirm this phenomenon through comprehensive experiments, implying that different diffusion models consistently reach the same data distribution and scoring function regardless of diffusion model frameworks, model architectures, or training procedures. More strikingly, our further investigation implies that diffusion models are learning distinct distributions affected by the training data size. This is supported by the fact that the model reproducibility manifests in two distinct training regimes: (i) "memorization regime", where the diffusion model overfits to the training data distribution, and (ii) "generalization regime", where the model learns the underlying data distribution. Our study also finds that this valuable property generalizes to many variants of diffusion models, including those for conditional use, solving inverse problems, and model fine-tuning. Finally, our work raises numerous intriguing theoretical questions for future investigation and highlights practical implications regarding training efficiency, model privacy, and the controlled generation of diffusion models.

2404.11716 2026-06-10 cs.AI 版本更新

A Survey on Semantic Modeling for Building Energy Management

建筑能源管理的语义建模综述

Miracle Aniakor, Vinicius V. Cogo, Pedro M. Ferreira

发表机构 * LASIGE, DI, Faculdade de Ciências, Universidade de Lisboa, Portugal(里斯本大学科学学院激光工程与信息研究所)

AI总结 综述建筑运行阶段语义建模,分析60个模型和20多个用例,提出本体证据完备性指标,发现物理结构覆盖好而动态概念覆盖不足,指出提升互操作性和泛化能力的方向。

详情
Comments
52 pages, 7 figures, 5 tables
AI中文摘要

建筑能源管理(BEM)对于减少建筑领域的能源消耗和二氧化碳排放至关重要。尽管物联网技术现在提供了广泛的运行数据,但异构数据模型、设备描述和上下文表示仍然限制了语义互操作性,阻碍了通用、自主、上下文感知的BEM应用的发展。本体通过提供结构化、机器可解释的建筑数据、系统和运行上下文表示来解决这一挑战。本综述考察了建筑运行阶段的BEM语义建模。它回顾了60个语义模型,分析了20多个基于本体的BEM用例,并进一步量化了这些用例中的本体实例化率(OIR)和缺失概念。为了支持基于证据的本体使用评估,我们引入了本体证据完备性(OEC)的概念,这是一种衡量研究是否将运行概念明确映射到用于表示它们的本体类别的度量。结果表明,当前的语义模型在表示物理建筑结构、技术系统、传感设备和可观察的运行数据方面比抽象和动态的运行概念更一致。诸如关键绩效指标、评估、服务、控制逻辑、优化任务和计算工作流等概念的覆盖仍然不够一致。因此,应用的BEM研究经常依赖于本体重用、集成、专门化、外部继承或特定应用扩展来解决BEM中的覆盖和互操作性差距。通过综合这些模式,本综述阐明了现有语义模型的能力,并指出了更可互操作、更通用和更上下文感知的BEM系统的发展方向。

英文摘要

Building Energy Management (BEM) is central to reducing energy use and CO2 emissions in the building sector. Although IoT technologies now provide extensive operational data, heterogeneous data models, device descriptions, and contextual representations continue to limit semantic interoperability, limiting the development of generalisable, autonomous, context-aware BEM applications. Ontologies address this challenge by providing structured, machine-interpretable representations of building data, systems, and operational context. This survey examines semantic modelling for BEM during the building operational phase. It reviews 60 semantic models and analyses more than 20 ontology-based BEM use cases. It further quantifies Ontology Instantiation Rates (OIR) and missing concepts across those use cases. To support evidence-based assessment of ontology use, we introduce the notion of Ontology Evidence Completeness (OEC), a measure of whether studies explicitly map operational concepts to the ontology classes used to represent them. Findings show that current semantic models more consistently represent physical building structure, technical systems, sensing devices, and observable operational data than abstract and dynamic operational concepts. Concepts such as key performance indicators, assessments, services, control logic, optimisation tasks, and computational workflows remain less consistently covered. Applied BEM studies therefore frequently depend on ontology reuse, integration, specialisation, external inheritance, or application-specific extension to address coverage and interoperability gaps across BEM. By synthesising these patterns, this survey clarifies the capabilities of existing semantic models and identifies directions for more interoperable, generalisable, and context-aware BEM systems.

2012.15621 2026-06-10 cs.CL 版本更新

Open Korean Corpora: A Practical Report

开放韩语语料库:一份实践报告

Won Ik Cho, Sangwhan Moon, Youngsook Song

发表机构 * AI Center, Samsung Electronics(三星电子AI中心) Google LLC(谷歌公司) Lablup Inc.(Lablup公司)

AI总结 本文梳理并评述了现有韩语开放语料库,涵盖机构级资源及各类任务数据集,并针对低资源语言提出了开源数据集构建与发布的建议。

详情
Comments
Published (v1) in NLP-OSS @EMNLP2020; May 2023 (v2) added with new datasets; June 2026 (v3) added analyses
AI中文摘要

韩语在研究界常被视为低资源语言。虽然这一说法部分正确,但也因为资源的可用性没有得到充分的宣传和管理。本工作整理并评述了一份韩语语料库列表,首先描述了机构级别的资源开发,然后进一步遍历了当前针对不同任务类型的开放数据集。最后,我们提出了针对低资源语言应如何进行开源数据集构建和发布以促进研究的方向。

英文摘要

Korean is often referred to as a low-resource language in the research community. While this claim is partially true, it is also because the availability of resources is inadequately advertised and curated. This work curates and reviews a list of Korean corpora, first describing institution-level resource development, then further iterate through a list of current open datasets for different types of tasks. We then propose a direction on how open-source dataset construction and releases should be done for less-resourced languages to promote research.

2203.03018 2026-06-10 cs.RO cs.SY eess.SY 版本更新

RAPTOR: Rapid Aerial Pickup and Transport of Objects by Robots

RAPTOR: 机器人快速空中抓取与运输物体

Aurel Appius, Erik Bauer, Marc Blöchlinger, Aashi Kalra, Robin Oberson, Arman Raayatsanati, Pascal Strauch, Sarath Suresh, Marco von Salis, Robert K. Katzschmann

发表机构 * Soft Robotics Lab, ETH Zurich, Switzerland(软机器人实验室,苏黎世联邦理工学院,瑞士)

AI总结 提出一种结合软材料Fin Ray夹爪和Fast DDS中间件的四旋翼平台RAPTOR,实现高速飞行中对不同几何形状物体的灵活抓取,平均抓取成功率83%,有效载荷达先前工作的四倍。

详情
Comments
7 pages, 10 figures, accepted to IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2022. Video: https://youtu.be/KHkBlBABsC8 Project page: https://srl-ethz.github.io/RAPTOR
AI中文摘要

通过机器人进行快速空中抓取可以推动许多利用物体快速动态抓取和放置的应用。传统用于空中机械臂的刚性夹爪需要高精度和特定物体几何形状才能成功抓取。我们提出RAPTOR,一种四旋翼平台结合定制Fin Ray夹爪,利用软材料的特性增加夹爪与物体之间的接触面,从而实现对不同几何形状物体的更灵活抓取。为了减少通信延迟,我们提出一种基于Fast DDS(数据分发服务)的新型轻量级中间件解决方案,作为ROS(机器人操作系统)的替代方案。我们展示了RAPTOR在真实环境中以平均1 m/s的速度抓取四种不同几何形状物体时,平均抓取成功率达到83%。在高速设置下,RAPTOR的有效载荷是先前工作的四倍。我们的结果突显了空中无人机在自动化仓库和其他需要速度、敏捷性和鲁棒性且在难以到达区域操作的操作应用中的潜力。

英文摘要

Rapid aerial grasping through robots can lead to many applications that utilize fast and dynamic picking and placing of objects. Rigid grippers traditionally used in aerial manipulators require high precision and specific object geometries for successful grasping. We propose RAPTOR, a quadcopter platform combined with a custom Fin Ray gripper to enable more flexible grasping of objects with different geometries, leveraging the properties of soft materials to increase the contact surface between the gripper and the objects. To reduce the communication latency, we present a new lightweight middleware solution based on Fast DDS (Data Distribution Service) as an alternative to ROS (Robot Operating System). We show that RAPTOR achieves an average of 83% grasping efficacy in a real-world setting for four different object geometries while moving at an average velocity of 1 m/s during grasping. In a high-velocity setting, RAPTOR supports up to four times the payload compared to previous works. Our results highlight the potential of aerial drones in automated warehouses and other manipulation applications where speed, swiftness, and robustness are essential while operating in hard-to-reach places.

2606.00097 2026-06-10 cs.RO cs.MA

RocketSmith: An Agentic System for High-Powered Rocket Design and Manufacturing

RocketSmith: 一种用于高功率火箭设计与制造的智能系统

Peter Pak, Jesse Barkley, Rumi Loghmani, Derek Baich, Ananya Pamal, Amir Barati Farimani

发表机构 * Graduate Research Assistant, Mechanical Engineering(机械工程研究生助理) AI Fellow, Mechanical Engineering(人工智能研究员,机械工程) Undergraduate Student, Mechanical Engineering(机械工程本科生) Senior Member, Pittsburgh Prefecture One(高级会员,匹兹堡郡一区) Russell V. Trader Associate Professor, Mechanical Engineering(Russell V. Trader副教授,机械工程)

AI总结 本文提出RocketSmith,一种基于智能体系统的自动化设计、制造与优化框架,通过子智能体与技能实现零样本和人在回路的飞行参数优化,并利用增材制造成功开发并测试了四枚高功率火箭。

详情
AI中文摘要

本文介绍了RocketSmith,一种能够完成高功率火箭开发中设计、制造和优化过程的智能系统。该系统实现了软件工具的智能自动化,不仅能够验证飞行稳定性等因素,还能生成火箭组件的参数化设计。通过一组子智能体和技能,该系统能够在零样本和人在回路的工作流程中通过迭代优化飞行参数。利用该系统,结合增材制造的独特设计能力,开发了四种不同电机和组件配置的高功率火箭。这些组件使用各种FDM打印机打印,手动评估飞行准备状态,并在发射活动中进行了飞行测试。测试中,所有火箭均实现了稳定发射,其中两枚火箭成功回收并具备再次飞行条件。在收集的飞行数据中,实测远地点与飞行模拟计算值的准确率达到84%。

英文摘要

This work presents RocketSmith, an agentic system capable of the design, manufacturing, and optimization processes in high powered rocket development. The system enables the intelligent automation of software tools as to not only validate factors such as flight stability but also generate the parametric design components for the rocket assembly. A collection of subagents and skills enable optimization workflows of flight parameters via iteration in both zero-shot and human-in-the-loop workflows. With this system, four distinct high power rockets with various motor and assembly configurations were developed utilizing the unique design capabilities of additive manufacturing. These assembly components were fabricated using various FDM printers, manually evaluated for flight readiness, and flight tested at a launch event. From these tests, all rockets achieved a stable launched and two of the four rockets were successfully recovered in reflyable condition. Within the collected flight data, an 84% accuracy was achieved when comparing measured apogee to that calculated in flight simulations.

2509.04154 2026-06-10 cs.LG cs.AI

Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation

鲁棒滤波注意力:自注意力作为精度加权状态估计

Peter Racioppo

发表机构 * arXiv.org

AI总结 提出鲁棒滤波注意力(RFA),将自注意力建模为基于线性随机微分方程的状态估计,在语言建模中实现优于RoPE的困惑度与零样本外推稳定性。

详情
AI中文摘要

我们引入鲁棒滤波注意力(RFA),一种将自注意力表述为鲁棒状态估计的方法。每个令牌被视为由线性随机微分方程(SDE)控制的潜在轨迹的带噪声观测,注意力权重由该模型下的一致性决定,而非静态特征相似性。在各向同性噪声和衰减假设下,RFA的计算复杂度与标准注意力相当。在语言建模基准上,RFA在训练窗口内实现了比RoPE更低的困惑度,同时在零样本外推到更长上下文时保持稳定。该框架还提供了标准位置机制的动力学解释,将旋转嵌入和近因偏差与随机动力学引起的传输和不确定性传播联系起来。

英文摘要

We introduce Robust Filter Attention (RFA), a formulation of self-attention as a robust state estimator. Each token is treated as a noisy observation of a latent trajectory governed by a linear stochastic differential equation (SDE), and attention weights are determined by consistency under this model rather than static feature similarity. Under isotropic noise and decay assumptions, RFA matches the computational complexity of standard attention. On language modeling benchmarks, RFA achieves lower perplexity than RoPE within the training window while remaining stable under zero-shot extrapolation to longer contexts. The framework also provides a dynamical interpretation of standard positional mechanisms, connecting rotational embeddings and recency biases to transport and uncertainty propagation induced by stochastic dynamics.

2605.14999 2026-06-10 cs.HC cs.AI cs.CY

Towards Gaze-Informed AI Disclosure Interfaces: Eye-Tracking Attentional and Cognitive Load While Reading AI-Assisted News

迈向基于 gaze 的 AI 信息披露界面:阅读 AI 协助新闻时的注视注意力与认知负荷

Pooja Prajod, Hannes Cools, Thomas Röggla, Pablo Cesar, Abdallah El Ali

发表机构 * Centrum Wiskunde & Informatica(荷兰数学与信息研究所) University of Amsterdam(阿姆斯特丹大学) TU Delft(代尔夫特理工大学) Utrecht University(乌得勒支大学)

AI总结 研究探讨了AI信息披露对读者注意力和认知负荷的影响,发现简要披露导致更高的注视时间和眼跳次数,而详细披露无额外负担,提出基于注视的自适应信息披露设计。

详情
AI中文摘要

随着生成式AI在新闻业中的深入整合,设计有效的人工智能使用披露以在不给读者造成不必要的负担的情况下提供信息是一个关键挑战。尽管先前研究主要关注信任和可信度,但披露对读者注意力和认知负荷的影响仍被忽视。为填补这一空白,我们进行了一项3×2×2混合因子研究,操纵AI使用披露细节水平(无、一行、详细)、新闻类型(政治、生活方式)和AI的角色(编辑、部分内容生成),通过NASA-TLX和眼动追踪测量负荷。我们的结果揭示了显著的注意力成本:一行披露导致更高的注视持续时间和眼跳次数,尤其是在AI编辑内容中。详细披露未增加额外负担。基于信息间隙理论,我们认为简短标签可能通过提示读者注意AI使用而引发更高的视觉审视,但未提供足够信息。NASA-TLX分数和瞳孔直径在各条件下无显著差异,表明AI使用披露无论细节水平如何均不造成认知负担。访谈见解 contextualize 这些发现,并揭示对详细或“按需详细”设计的强烈偏好。我们的发现为基于注视的自适应信息披露界面设计提供了指导,该界面可根据读者的注意力模式和新闻上下文动态调整透明度水平。

英文摘要

As generative AI becomes increasingly integrated into journalism, designing effective AI-use disclosures that inform readers without imposing unnecessary burden is a key challenge. While prior research has primarily focused on trust and credibility, the impact of disclosures on readers' attentional and cognitive load remains underexplored. To address this gap, we conducted a $3\times2\times2$ mixed factorial study manipulating the level of AI-use disclosure detail (none, one-line, detailed), news type (politics, lifestyle), and role of AI (editing, partial content generation), measuring load via NASA-TLX and eye-tracking. Our results reveal a significant attentional cost: one-line disclosures resulted in significantly higher fixation durations and saccade counts, particularly for AI-edited content. Detailed disclosures did not impose additional burden. Drawing on Information-Gap Theory, we argue that brief labels may trigger increased visual scrutiny by alerting readers to AI use without providing enough information. NASA-TLX scores and pupil diameter showed no significant differences across conditions, suggesting that AI-use disclosures do not impose cognitive burden regardless of the detail level. Interview insights contextualize these findings and reveal a strong preference for detailed or ``detail-on-demand'' designs. Our findings inform the design of gaze-informed adaptive disclosure interfaces that dynamically adjust transparency levels based on readers' attentional patterns and news context.

2602.16898 2026-06-10 cs.RO cs.AI cs.CV cs.LG

MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation

MALLVI:一种多智能体框架用于集成通用机器人操作

Mehrshad Taji, Arad Mahdinezhad Kashani, Iman Ahmadi, AmirHossein Jadidi, Saina Kashani, Babak Khalaj

发表机构 * Department of Electrical Engineering, Sharif University of Technology(电气工程系,谢里夫大学)

AI总结 MALLVI通过多智能体协作实现闭环反馈驱动的机器人操作,提升泛化能力和零样本任务成功率。

详情
Comments
Some fundemental change in text and codebase
AI中文摘要

MALLVI通过多智能体协作实现闭环反馈驱动的机器人操作,提升泛化能力和零样本任务成功率。

英文摘要

Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings. MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVI generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step. Rather than using a single model, MALLVI coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning. Experiments in simulation and real-world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks. Code available at https://github.com/iman1234ahmadi/MALLVI .

2603.03339 2026-06-10 cs.CY cs.AR cs.CL cs.HC

Offline-First LLM Architecture for Adaptive Learning in Low-Connectivity Environments

Joseph Walusimbi, Ann Move Oguti, Joshua Benjamin Ssentongo, Keith Ainebyona

发表机构 * University of Nairobi(内罗毕大学)

详情
Comments
16 pages, 10 figures, 2 tables
英文摘要

Artificial intelligence (AI) and large language models (LLMs) are transforming educational technology by enabling conversational tutoring, personalized explanations, and inquiry-driven learning. However, most AI-based learning systems rely on continuous internet connectivity and cloud-based computation, limiting their use in bandwidth-constrained environments. This paper presents an offline-first large language model architecture designed for AI-assisted learning in low-connectivity settings. The system performs all inference locally using quantized language models and incorporates hardware-aware model selection to enable deployment on low-specification CPU-only devices. By removing dependence on cloud infrastructure, the system provides curriculum-aligned explanations and structured academic support through natural-language interaction. To support learners at different educational stages, the system includes adaptive response levels that generate explanations at varying levels of complexity: Simple English, Lower Secondary, Upper Secondary, and Technical. This allows explanations to be adjusted to student ability, improving clarity and understanding of academic concepts. The system was deployed in selected secondary and tertiary institutions under limited-connectivity conditions and evaluated across technical performance, usability, perceived response quality, and educational impact. Results show stable operation on legacy hardware, acceptable response times, and positive user perceptions regarding support for self-directed learning. These findings demonstrate the feasibility of offline large language model deployment for AI-assisted education in low-connectivity environments.

2510.17876 2026-06-10 physics.geo-ph cs.LG

Three-dimensional inversion of gravity data using implicit neural representations and scientific machine learning

Pankaj K Mishra, Sanni Laaksonen, Jochen Kamm, Anand Singh

发表机构 * Geological Survey of Finland(芬兰地质调查局) Indian Institute of Technology Bombay(印度理工学院孟买分校)

详情
Journal ref
Scientific Reports (2026)
Comments
Codes for reproducing results are at https://zenodo.org/records/19440024
英文摘要

Inversion of gravity data is an important method for investigating subsurface density variations relevant to mineral exploration, geothermal assessment, carbon storage, natural hydrogen, groundwater resources, and tectonic evolution. Here we present a scientific machine-learning approach for three-dimensional gravity inversion that represents subsurface density as a continuous field using an implicit neural representation (INR). The method trains a deep neural network directly through a physics-based forward-model loss, mapping spatial coordinates to a continuous density field without predefined meshes or discretisation. Spatial encoding enhances the network's capacity to capture sharp contrasts and short-wavelength features that conventional coordinate-based networks tend to oversmooth due to spectral bias. We demonstrate the approach on synthetic examples including smooth models, representing realistic geological complexity, and a dipping block model to assess recovery of structures at different depths. The INR framework reconstructs detailed structure and geologically plausible boundaries without explicit regularisation or depth weighting, while reducing the number of inversion parameters as the problem size grows bigger. These results highlight the potential of implicit representations to enable scalable, flexible, and interpretable large-scale geophysical inversion. This framework could generalise to other geophysical methods and for joint/multiphysics inversion.

2603.08561 2026-06-10 cs.AI

RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback

Xiaoying Zhang, Zichen Liu, Yipeng Zhang, Xia Hu, Wenqi Shao

发表机构 * Shanghai AI Lab(上海人工智能实验室) National University of Singapore(新加坡国立大学) Independent Researcher(独立研究者)

详情
Comments
updated
英文摘要

Standard reinforcement learning (RL) for large language model (LLM) agents primarily optimizes extrinsic task rewards, often favoring isolated task completion over continual adaptation. This paradigm can cause premature convergence to suboptimal policies and leaves useful experience only implicitly encoded in model parameters, limiting its retrieval and reuse for future decisions. We introduce RetroAgent, an online RL framework that trains agents to master interactive environments not merely by solving tasks, but by evolving across episodes. Inspired by human retrospective self-improvement, RetroAgent augments extrinsic rewards with hindsight-generated dual intrinsic feedback: (1) Intrinsic Numerical Feedback, which rewards beneficial exploration by measuring incremental subtask progress relative to prior attempts; and (2) Intrinsic Language Feedback which distills successes and failures into reusable textual lessons for explicit experience reuse. To leverage these lessons effectively, we propose Similarity & Utility-Aware Upper Confidence Bound (SimUtil-UCB), a retrieval strategy that balances semantic relevance, historical utility, and exploration. Across four challenging agentic benchmarks, RetroAgent achieves new state-of-the-art performance, outperforming GRPO by +18.3% on ALFWorld, +15.4% on WebShop, +27.1% on Sokoban, and +8.9% on MineSweeper, while demonstrating strong test-time adaptation and out-of-distribution generalization.

2601.00809 2026-06-10 cs.OH cs.AI cs.MA

A Modular Reference Architecture for MCP-Servers Enabling Agentic BIM Interaction

Tobias Heimig-Elschner, Changyu Du, Anna Scheuvens, André Borrmann, Jakob Beetz

发表机构 * Chair of Design Computation, RWTH Aachen University(设计计算系,亚琛工业大学) Chair of Computing in Civil and Building Engineering, Technical University of Munich(土木与建筑工程计算系,慕尼黑技术大学) Federal Institute for Research on Building, Urban Affairs and Spatial Development (BBSR)(建筑、都市事务和空间发展研究院) TUM Georg Nemetschek Institute(慕尼黑技术大学Georg Nemetschek研究所)

详情
Comments
Accepted at the GNI Symposium on Artificial Intelligence for the Built World (Technical University of Munich, May 18--20, 2026)
英文摘要

Agentic workflows driven by large language models (LLMs) are increasingly applied to Building Information Modelling (BIM), enabling natural-language retrieval, modification and generation of IFC models. Recent work has begun adopting the emerging Model Context Protocol (MCP) as a uniform tool-calling interface for LLMs, simplifying the agent side of BIM interaction. While MCP standardises how LLMs invoke tools, current BIM-side implementations are still authoring tool-specific and ad hoc, limiting reuse, evaluation, and workflow portability across environments. This paper addresses this gap by introducing a modular reference architecture for MCP servers that enables API-agnostic, isolated and reproducible agentic BIM interactions. From a systematic analysis of recurring capabilities in recent literature, we derive a core set of requirements. These inform a microservice architecture centred on an explicit adapter contract that decouples the MCP interface from specific BIM-APIs. A prototype implementation using IfcOpenShell demonstrates feasibility across common modification and generation tasks. Evaluation across representative scenarios shows that the architecture enables reliable workflows, reduces coupling, and provides a reusable foundation for systematic research.

2603.04056 2026-06-10 cs.CV cs.RO

Long-Term Visual Localization in Dynamic Benthic Environments: A Dataset, Footprint-Based Ground Truth, and Visual Place Recognition Benchmark

Martin Kvisvik Larsen, Oscar Pizarro

发表机构 * Department of Marine Technology(海洋技术系) Norwegian University of Science and Technology(挪威科学技术大学) Trondheim, Norway(特罗姆瑟,挪威)

详情
Journal ref
Frontiers in Robotics and AI Volume 13 (2026) 1821019
英文摘要

Long-term visual localization has the potential to reduce cost and improve mapping quality in optical benthic monitoring with autonomous underwater vehicles (AUVs). Despite this potential, long-term visual localization in benthic environments remains understudied, primarily due to the lack of curated datasets for benchmarking. Moreover, limited georeferencing accuracy and image footprints necessitate precise geometric information for accurate ground-truthing. In this work, we address these gaps by presenting a curated dataset for long-term visual localization in benthic environments and a novel method to ground-truth visual localization results for near-nadir underwater imagery. Our dataset comprises georeferenced AUV imagery from five benthic reference sites, revisited over periods up to six years, and includes raw and color-corrected stereo imagery, camera calibrations, and sub-decimeter registered camera poses. To our knowledge, this is the first curated underwater dataset for long-term visual localization spanning multiple sites and photic-zone habitats. Our ground-truthing method estimates 3D seafloor image footprints and links camera views with overlapping footprints, ensuring that ground-truth links reflect shared visual content. Building on this dataset and ground truth, we benchmark eight state-of-the-art visual place recognition (VPR) methods and find that Recall@K is significantly lower on our dataset than on established terrestrial and underwater benchmarks. Finally, we compare our footprint-based ground truth to a traditional location-based ground truth and show that distance-threshold ground-truthing can overestimate VPR Recall@K at sites with rugged terrain and altitude variations. Together, the curated dataset, ground-truthing method, and VPR benchmark provide a stepping stone for advancing long-term visual localization in dynamic benthic environments.

2602.23232 2026-06-10 cs.AI

ReCoN-Ipsundrum: An Inspectable Recurrent Persistence Loop Agent with Affect-Coupled Control and Mechanism-Linked Consciousness Indicator Assays

Aishik Sanyal

发表机构 * Aishik Sanyal

详情
Journal ref
Proceedings of the AAAI Symposium Series, 8(1):352-360, 2026
Comments
Accepted at AAAI 2026 Spring Symposium - Machine Consciousness: Integrating Theory, Technology, and Philosophy
英文摘要

Indicator-based approaches to machine consciousness recommend mechanism-linked evidence triangulated across tasks, supported by architectural inspection and causal intervention. Inspired by Humphrey's ipsundrum hypothesis, we implement ReCoN-Ipsundrum, an inspectable agent that extends a ReCoN state machine with a recurrent persistence loop over sensory salience $N^s$ and an optional affect proxy reporting valence/arousal. Across fixed-parameter ablations (ReCoN, Ipsundrum, Ipsundrum+affect), we operationalize Humphrey's qualiaphilia (preference for sensory experience for its own sake) as a familiarity-controlled scenic-over-dull route choice. We find a novelty dissociation: non-affect variants are novelty-sensitive ($Δ$scenic-entry = 0.07). Affect coupling is stable ($Δ$scenic-entry = 0.01) even when scenic is less novel (median {$Δ$novelty $\approx$ -0.43). In reward-free exploratory play, the affect variant shows structured local investigation (scan events 31.4 vs. 0.9; cycle score 7.6). In a pain-tail probe, only the affect variant sustains prolonged planned caution (tail duration 90 vs. 5). Lesioning feedback+integration selectively reduces post-stimulus persistence in ipsundrum variants (AUC drop 27.62, 27.9%) while leaving ReCoN unchanged. These dissociations link recurrence $\rightarrow$ persistence and affect-coupled control $\rightarrow$ preference stability, scanning, and lingering caution, illustrating how indicator-like signatures can be engineered and why mechanistic and causal evidence should accompany behavioral markers.

2602.01023 2026-06-10 cs.IR cs.AI cs.LG

Unifying Ranking and Generation in Query Auto-Completion via Retrieval-Augmented Generation and Multi-Objective Alignment

通过检索增强生成和多目标对齐统一查询自动补全中的排序与生成

Kai Yuan, Anthony Zheng, Jia Hu, Divyanshu Sheth, Hemanth Velaga, Kylee Kim, Matteo Guarrera, Besim Avci, Jianhua Li, Xuetao Yin, Rajyashree Mukherjee, Sean Suchter

发表机构 * Apple(苹果公司) UC Berkeley(加州大学伯克利分校)

AI总结 提出一个统一框架,通过检索增强生成(RAG)和多目标直接偏好优化(DPO)将查询自动补全重构为端到端列表生成,解决传统流水线长尾覆盖不足和生成方法幻觉风险的问题,并在大规模商业搜索平台上验证了有效性。

详情
Journal ref
Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26), August 09--13, 2026, Jeju Island, Republic of Korea
Comments
11 pages, 4 figures
AI中文摘要

查询自动补全(QAC)在用户输入时建议查询补全,帮助用户表达意图并更高效地获取结果。现有方法面临根本性挑战:传统的检索-排序流水线长尾覆盖有限且需要大量特征工程,而最近的生成方法存在幻觉和安全风险。我们提出了一个统一框架,通过检索增强生成(RAG)和多目标直接偏好优化(DPO)将QAC重构为端到端列表生成。我们的方法结合了三个关键创新:(1)将QAC重构为具有多目标优化的端到端列表生成;(2)定义并部署一套基于规则、基于模型和以LLM为评判的验证器用于QAC,并在综合方法中使用它们,结合RAG、多目标DPO和迭代批评-修订以生成高质量合成数据;(3)一种混合服务架构,可在严格的延迟约束下实现高效的生产部署。在大规模商业搜索平台上的评估显示了显著改进:离线指标在所有维度上均有提升,人工评估获得+0.40至+0.69的偏好分数,受控在线实验实现了击键次数减少5.44%和建议采纳率增加3.46%,验证了结合RAG和多目标对齐的统一生成为生产级QAC提供了有效解决方案。这项工作代表了向由大语言模型、RAG和多目标对齐驱动的端到端生成的范式转变,建立了一个经过生产验证的框架,可惠及更广泛的搜索和推荐行业。

英文摘要

Query Auto-Completion (QAC) suggests query completions as users type, helping them articulate intent and reach results more efficiently. Existing approaches face fundamental challenges: traditional retrieve-and-rank pipelines have limited long-tail coverage and require extensive feature engineering, while recent generative methods suffer from hallucination and safety risks. We present a unified framework that reformulates QAC as end-to-end list generation through Retrieval-Augmented Generation (RAG) and multi-objective Direct Preference Optimization (DPO). Our approach combines three key innovations: (1) reformulating QAC as end-to-end list generation with multi-objective optimization; (2) defining and deploying a suite of rule-based, model-based, and LLM-as-judge verifiers for QAC, and using them in a comprehensive methodology that combines RAG, multi-objective DPO, and iterative critique-revision for high-quality synthetic data; (3) a hybrid serving architecture enabling efficient production deployment under strict latency constraints. Evaluation on a large-scale commercial search platform demonstrates substantial improvements: offline metrics show gains across all dimensions, human evaluation yields +0.40 to +0.69 preference scores, and a controlled online experiment achieves 5.44\% reduction in keystrokes and 3.46\% increase in suggestion adoption, validating that unified generation with RAG and multi-objective alignment provides an effective solution for production QAC. This work represents a paradigm shift to end-to-end generation powered by large language models, RAG, and multi-objective alignment, establishing a production-validated framework that can benefit the broader search and recommendation industry.

2509.22148 2026-06-10 eess.AS cs.SD

Speaker Anonymisation for Speech-based Suicide Risk Detection

Ziyun Cui, Sike Jia, Yang Lin, Yinan Duan, Diyang Qu, Runsen Chen, Chao Zhang, Chang Lei, Wen Wu

发表机构 * University of Science and Technology of China(中国科学技术大学)

详情
Comments
Accepted by ICASSP 2026
英文摘要

Adolescent suicide is a critical global health issue, and speech provides a cost-effective modality for automatic suicide risk detection. Given the vulnerable population, protecting speaker identity is particularly important, as speech itself can reveal personally identifiable information if the data is leaked or maliciously exploited. This work presents the first systematic study of speaker anonymisation for speech-based suicide risk detection. A broad range of anonymisation methods are investigated, including techniques based on traditional signal processing, neural voice conversion, and speech synthesis. A comprehensive evaluation framework is built to assess the trade-off between protecting speaker identity and preserving information essential for suicide risk detection. Results show that combining anonymisation methods that retain complementary information yields detection performance comparable to that of original speech, while achieving protection of speaker identity for vulnerable populations.

2601.12641 2026-06-10 cs.AI

STEP-LLM: Generating CAD STEP Models from Natural Language with Large Language Models

Xiangyu Shi, Junyang Ding, Xu Zhao, Sinong Zhan, Payal Mohapatra, Daniel Quispe, Kojo Welbeck, Jian Cao, Wei Chen, Ping Guo, Qi Zhu

发表机构 * Northwestern University(西北大学)

详情
Journal ref
In Proceedings of the 2026 Design, Automation & Test in Europe Conference (DATE 2026), 2026
Comments
Accepted to the Design, Automation & Test in Europe Conference (DATE) 2026
英文摘要

Computer-aided design (CAD) is vital to modern manufacturing, yet model creation remains labor-intensive and expertise-heavy. To enable non-experts to translate intuitive design intent into manufacturable artifacts, recent large language models-based text-to-CAD efforts focus on command sequences or script-based formats like CadQuery. However, these formats are kernel-dependent and lack universality for manufacturing. In contrast, the Standard for the Exchange of Product Data (STEP, ISO 10303) file is a widely adopted, neutral boundary representation (B-rep) format directly compatible with manufacturing, but its graph-structured, cross-referenced nature poses unique challenges for auto-regressive LLMs. To address this, we curate a dataset of ~40K STEP-caption pairs and introduce novel preprocessing tailored for the graph-structured format of STEP, including a depth-first search-based reserialization that linearizes cross-references while preserving locality and chain-of-thought(CoT)-style structural annotations that guide global coherence. We integrate retrieval-augmented generation to ground predictions in relevant examples for supervised fine-tuning, and refine generation quality through reinforcement learning with a specific Chamfer Distance-based geometric reward. Experiments demonstrate consistent gains of our STEP-LLM in geometric fidelity over the Text2CAD baseline, with improvements arising from multiple stages of our framework: the RAG module substantially enhances completeness and renderability, the DFS-based reserialization strengthens overall accuracy, and the RL further reduces geometric discrepancy. Both metrics and visual comparisons confirm that STEP-LLM generates shapes with higher fidelity than Text2CAD. These results show the feasibility of LLM-driven STEP model generation from natural language, showing its potential to democratize CAD design for manufacturing.

2601.09620 2026-06-10 cs.HC cs.AI cs.CY

Full Disclosure, Less Trust? How the Level of Detail about AI Use in News Writing Affects Readers' Trust

Pooja Prajod, Hannes Cools, Thomas Röggla, Karthikeya Puttur Venkatraj, Amber Kusters, Alia ElKattan, Pablo Cesar, Abdallah El Ali

发表机构 * Centrum Wiskunde & Informatica(数学与信息学中心) University of Amsterdam(阿姆斯特丹大学) New York University(纽约大学) TU Delft(代尔夫特理工大学) Utrecht University(乌得勒支大学)

详情
英文摘要

As artificial intelligence (AI) is increasingly integrated into news production, calls for transparency about the use of AI have gained considerable traction. Recent studies suggest that AI disclosures can lead to a ``transparency dilemma'', where disclosure reduces readers' trust. However, little is known about how the \textit{level of detail} in AI disclosures influences trust and contributes to this dilemma within the news context. In this 3$\times$2$\times$2 mixed factorial study with 40 participants, we investigate how three levels of AI disclosures (none, one-line, detailed) across two types of news (politics and lifestyle) and two levels of AI involvement (low and high) affect news readers' trust. We measured trust using the News Media Trust questionnaire, along with two decision behaviors: source-checking and subscription decisions. Questionnaire responses and subscription rates showed a decline in trust only for detailed AI disclosures, whereas source-checking behavior increased for both one-line and detailed disclosures, with the effect being more pronounced for detailed disclosures. Insights from semi-structured interviews suggest that source-checking behavior was primarily driven by interest in the topic, followed by trust, whereas trust was the main factor influencing subscription decisions. Around two-thirds of participants expressed a preference for detailed disclosures, while most participants who preferred one-line indicated a need for detail-on-demand disclosure formats. Our findings show that not all AI disclosures lead to a transparency dilemma, but instead reflect a trade-off between readers' desire for more transparency and their trust in AI-assisted news content.

2510.09498 2026-06-10 q-bio.TO cs.CE cs.LG

Unsupervised full-field Bayesian inference of orthotropic hyperelasticity from a single biaxial test: a myocardial case study

Rogier P. Krijnen, Akshay Joshi, Siddhant Kumar, Mathias Peirlinck

发表机构 * TUDelft(代尔夫特理工大学)

详情
英文摘要

Cardiac muscle tissue exhibits highly non-linear hyperelastic and orthotropic material behavior during passive deformation. Traditional constitutive identification protocols therefore combine multiple loading modes and typically require multiple specimens and substantial handling. In soft living tissues, such protocols are challenged by inter- and intra-sample variability and by manipulation-induced alterations of mechanical response, which can bias inverse calibration. In this work we exploit spatially heterogeneous full-field kinematics as an information-rich alternative to multimodal testing. We recast EUCLID, an unsupervised method for the automated discovery of constitutive models, towards Bayesian parameter inference for highly nonlinear, orthotropic constitutive models. Using synthetic myocardial tissue slabs, we demonstrate that a single heterogeneous biaxial experiment, combined with sparse reaction-force measurements, enables robust recovery of Holzapfel-Ogden parameters with quantified uncertainty, across multiple noise levels. The inferred responses agree closely with ground-truth simulations and yield credible intervals that reflect the impact of measurement noise on orthotropic material model inference. Our work supports single-shot, uncertainty-aware characterization of nonlinear orthotropic material models from a single biaxial test, reducing sample demand and experimental manipulation.