arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3856
2211.05583 2026-06-09 cs.CL math.OC 版本更新

Toward automatic generation of control structures for process flow diagrams with large language models

面向工艺流程图控制结构自动生成的大语言模型方法

Edwin Hirtreiter, Lukas Schulze Balhorn, Artur M. Schweidtmann

发表机构 * University of Zurich(苏黎世大学)

AI总结 提出一种基于Transformer的端到端方法,将控制结构预测视为翻译任务,利用SFILES 2.0表示PFD拓扑,通过预训练和微调实现自动生成,在生成数据上达到74.8%-89.2%的Top-5准确率。

详情
Journal ref
AIChE Journal, Volume 70, Issue 1, January 2024, e18259
AI中文摘要

开发管道和仪表图(P&IDs)是工艺开发中的关键步骤。我们提出了一种数据驱动的控制结构预测方法。我们的方法受基于Transformer的端到端人类语言翻译模型启发。我们将控制结构预测视为翻译任务,其中没有控制结构的工艺流程图(PFDs)被翻译为带有控制结构的PFDs。我们使用SFILES 2.0符号将PFDs的拓扑表示为字符串。我们使用生成的PFDs预训练模型以学习语法结构。之后,利用迁移学习在真实PFDs上对模型进行微调。该模型在10,000个生成的PFDs上达到了74.8%的Top-5准确率,在100,000个生成的PFDs上达到了89.2%的Top-5准确率。这些有希望的结果显示了人工智能辅助工艺工程的巨大潜力。在312个真实PFDs数据集上的测试表明,工业应用需要更大的PFD数据集和混合人工智能解决方案。

英文摘要

Developing Piping and Instrumentation Diagrams (P&IDs) is a crucial step during process development. We propose a data-driven method for the prediction of control structures. Our methodology is inspired by end-to-end transformer-based human language translation models. We cast the control structure prediction as a translation task where Process Flow Diagrams (PFDs) without control structures are translated to PFDs with control structures. We represent the topology of PFDs as strings using the SFILES 2.0 notation. We pretrain our model using generated PFDs to learn the grammatical structure. Thereafter, the model is fine-tuned leveraging transfer learning on real PFDs. The model achieved a top-5 accuracy of 74.8% on 10,000 generated PFDs and 89.2% on 100,000 generated PFDs. These promising results show great potential for AI-assisted process engineering. The tests on a dataset of 312 real PFDs indicate the need for a larger PFD dataset for industry applications and hybrid artificial intelligence solutions.

2208.00859 2026-06-09 cs.LG cs.CL 版本更新

Learning from flowsheets: A generative transformer model for autocompletion of flowsheets

从流程图学习:用于流程图自动补全的生成式Transformer模型

Gabriel Vogel, Lukas Schulze Balhorn, Artur M. Schweidtmann

发表机构 * University of Freiburg(弗赖堡大学)

AI总结 受文本自动补全启发,提出基于SFILES 2.0字符串表示和Transformer语言模型的化工流程图自动补全方法,通过预训练和微调实现交互式流程图合成辅助。

详情
Journal ref
Computers and Chemical Engineering Volume 171, March 2023, 108162
AI中文摘要

我们提出了一种新颖的方法,能够实现化工流程图的自动补全。这一想法受到文本自动补全的启发。我们使用基于文本的SFILES 2.0符号将流程图表示为字符串,并利用基于Transformer的语言模型学习SFILES 2.0语言的语法结构以及流程图中的常见模式。我们在合成生成的流程图拓扑上预训练模型,以学习流程图语言语法。然后,通过迁移学习步骤在真实流程图拓扑上微调模型。最后,我们使用训练好的模型进行因果语言建模,以自动补全流程图。最终,所提出的方法可以在交互式流程图合成过程中为化学工程师提供建议。结果表明,该方法在未来AI辅助过程合成中具有巨大潜力,但也揭示了当前阶段的局限性以及在实际流程图合成场景中部署该技术需要采取的后续步骤。

英文摘要

We propose a novel method enabling autocompletion of chemical flowsheets. This idea is inspired by the autocompletion of text. We represent flowsheets as strings using the text-based SFILES 2.0 notation and learn the grammatical structure of the SFILES 2.0 language and common patterns in flowsheets using a transformer-based language model. We pre-train our model on synthetically generated flowsheet topologies to learn the flowsheet language grammar. Then, we fine-tune our model in a transfer learning step on real flowsheet topologies. Finally, we use the trained model for causal language modeling to autocomplete flowsheets. Eventually, the proposed method can provide chemical engineers with recommendations during interactive flowsheet synthesis. The results demonstrate a high potential of this approach for future AI-assisted process synthesis but also reveal the limitations at the present state and the next steps that need to be taken to deploy this technique in realistic flowsheet synthesis scenarios.

2312.02873 2026-06-09 cs.LG cs.AI 版本更新

Toward autocorrection of chemical process flowsheets using large language models

利用大型语言模型实现化工流程图的自动纠错

Lukas Schulze Balhorn, Marc Caballero, Artur M. Schweidtmann

发表机构 * Process Intelligence Research Group, Department of Chemical Engineering, Delft University of Technology(过程智能研究组,化学工程系,代尔夫特理工大学)

AI总结 提出一种基于大型语言模型的生成式AI方法,自动识别化工流程图中的错误并给出修正建议,在合成数据集上达到80%的top-1准确率。

详情
Journal ref
Computer Aided Chemical Engineering, Volume 53, 2024, Pages 3109-3114
AI中文摘要

过程工程领域广泛使用工艺流程图(PFD)和管道及仪表流程图(P&ID)来表示工艺流程和设备配置。然而,P&ID和PFD(以下统称为流程图)可能包含错误,导致安全隐患、操作效率低下和不必要的开支。纠正和验证流程图是一个繁琐的手动过程。我们提出了一种新颖的生成式AI方法,用于自动识别流程图中的错误并向用户建议修正,即自动纠错流程图。受大型语言模型(LLM)在人类语言语法自动纠错方面突破的启发,我们研究了LLM用于流程图的自动纠错。模型的输入是可能出错的流程图,输出是修正后的流程图建议。我们在合成数据集上以监督方式训练自动纠错模型。该模型在独立测试的合成流程图数据集上达到了80%的top-1准确率和84%的top-5准确率。结果表明,模型能够学习自动纠错合成流程图。我们设想流程图自动纠错将成为化学工程师的有用工具。

英文摘要

The process engineering domain widely uses Process Flow Diagrams (PFDs) and Process and Instrumentation Diagrams (P&IDs) to represent process flows and equipment configurations. However, the P&IDs and PFDs, hereafter called flowsheets, can contain errors causing safety hazards, inefficient operation, and unnecessary expenses. Correcting and verifying flowsheets is a tedious, manual process. We propose a novel generative AI methodology for automatically identifying errors in flowsheets and suggesting corrections to the user, i.e., autocorrecting flowsheets. Inspired by the breakthrough of Large Language Models (LLMs) for grammatical autocorrection of human language, we investigate LLMs for the autocorrection of flowsheets. The input to the model is a potentially erroneous flowsheet and the output of the model are suggestions for a corrected flowsheet. We train our autocorrection model on a synthetic dataset in a supervised manner. The model achieves a top-1 accuracy of 80% and a top-5 accuracy of 84% on an independent test dataset of synthetically generated flowsheets. The results suggest that the model can learn to autocorrect the synthetic flowsheets. We envision that flowsheet autocorrection will become a useful tool for chemical engineers.

2310.10196 2026-06-09 cs.LG cs.AI 版本更新

Large Models for Time Series and Spatio-Temporal Data: A Survey and Outlook

时间序列与时空数据的大模型:综述与展望

Ming Jin, Yaxuan Kong, Yuxuan Liang, Chaoli Zhang, Siqiao Xue, Xue Wang, James Zhang, Yi Wang, Haifeng Chen, Xiaoli Li, Vincent S. Tseng, Yu Zheng, Lei Chen, Hui Xiong, Shirui Pan, Qingsong Wen

发表机构 * Griffith University(格里菲斯大学) University of Oxford(牛津大学) Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Zhejiang Normal University(浙江师范大学) Ant Group(蚂蚁集团) Alibaba Group(阿里巴巴集团) Deloitte Service LLP(德勤服务有限责任公司) The University of Hong Kong(香港大学) NEC Laboratories America(NEC美国实验室) A*STAR National Yang Ming Chiao Tung University(阳明交通大学) JD Technology(京东科技) Squirrel Ai Learning

AI总结 综述了面向时间序列和时空数据的大模型,按数据类型、模型类别、范围和应用领域分类,总结了通用与领域专用模型,并整理了相关资源与开放问题。

Comments Accepted by ACM Computing Surveys; 35 Pages; Github Repo: https://github.com/qingsongedu/Awesome-TimeSeries-SpatioTemporal-LM-LLM

详情
AI中文摘要

时间数据,包括时间序列和时空数据,在现实应用中无处不在。物理和虚拟传感器生成的海量数据记录了动态系统行为,支持各种下游任务。有效分析这些数据对于挖掘其丰富信息至关重要。大型语言模型和其他基础模型的最新进展加速了它们在时间序列和时空数据挖掘中的应用。这些方法不仅提高了跨领域的模式识别和推理能力,还支持了能够理解和处理时间数据的人工通用智能的发展。在本综述中,我们沿着四个维度(数据类型、模型类别、模型范围和应用领域/任务)对针对时间序列和时空数据定制或适配的大模型进行了全面、最新的回顾。我们将现有工作分为两大组:用于时间序列分析的大模型(LM4TS)和用于时空数据挖掘的大模型(LM4STD),并进一步区分通用模型和领域专用模型。我们还整理了相关资源,包括数据集、模型实现和工具,按主要应用领域组织。总体而言,本综述整合了近期进展,并突出了以大型模型为中心的时间数据分析的基础、应用、资源和开放研究机会。

英文摘要

Temporal data, including time series and spatio-temporal data, are pervasive in real-world applications. Generated in massive volumes by physical and virtual sensors, they record dynamic system behaviors and enable a wide range of downstream tasks. Effectively analyzing such data is crucial to unlocking their rich information content. Recent advances in large language models and other foundation models have accelerated their use in time series and spatio-temporal data mining. These approaches not only improve pattern recognition and reasoning across diverse domains but also support progress toward artificial general intelligence that can understand and process temporal data. In this survey, we present a comprehensive, up-to-date review of large models tailored or adapted for time series and spatio-temporal data along four dimensions: data types, model categories, model scopes, and application areas/tasks. We organize existing work into two main groups: large models for time series analysis (LM4TS) and for spatio-temporal data mining (LM4STD), and further distinguish general-purpose from domain-specific models. We also curate related resources, including datasets, model implementations, and tools, organized by major application areas. Overall, this survey consolidates recent advances and highlights foundations, applications, resources, and open research opportunities in large model-centric temporal data analysis.

2308.07822 2026-06-09 cs.LG cs.SY eess.SY 版本更新

Deep reinforcement learning for process design: Review and perspective

深度强化学习在过程设计中的应用:综述与展望

Qinghe Gao, Artur M. Schweidtmann

发表机构 * Delft University of Technology(代尔夫特理工大学)

AI总结 本文综述深度强化学习在化工过程设计中的应用,从信息表示、智能体架构、环境与奖励三要素分析现状,并讨论挑战与未来方向。

详情
AI中文摘要

化学工业向可再生能源和原料供应的转型需要新的概念性过程设计方法。最近,人工智能的突破为加速这一转型提供了机会。具体而言,深度强化学习作为机器学习的一个子类,已显示出解决复杂决策问题和促进可持续过程设计的潜力。我们通过三个主要要素调查了强化学习在过程设计中的最新研究:(i)信息表示,(ii)智能体架构,以及(iii)环境与奖励。此外,我们讨论了潜在挑战和未来有前景的工作,以充分发挥强化学习在化学工程过程设计中的潜力。

英文摘要

The transformation towards renewable energy and feedstock supply in the chemical industry requires new conceptual process design approaches. Recently, breakthroughs in artificial intelligence offer opportunities to accelerate this transition. Specifically, deep reinforcement learning, a subclass of machine learning, has shown the potential to solve complex decision-making problems and aid sustainable process design. We survey state-of-the-art research in reinforcement learning for process design through three major elements: (i) information representation, (ii) agent architecture, and (iii) environment and reward. Moreover, we discuss perspectives on underlying challenges and promising future works to unfold the full potential of reinforcement learning for process design in chemical engineering.

2104.12183 2026-06-09 cs.RO 版本更新

An Interval Branch-and-Bound-Based Inverse Kinemetics Algorithm Towards Global Optimal Redundancy Resolution

基于区间分支定界的逆运动学算法实现全局最优冗余度解析

Yajue Yang, Zeqing Zhang, Yuanqing Wu, Jia Pan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种结合快速数值IK求解器搜索启发式的区间分支定界方法,高效求解机械臂广义逆运动学问题,生成邻域解以提供自运动流形的丰富几何信息,支持最优规划和任意时间求解。

详情
AI中文摘要

机械臂的一般逆运动学(IK)问题,即为期望的末端执行器位姿获取所有可行关节角度的自运动流形(SMM),在机器人建模、规划和控制中起着至关重要的作用。为了高效求解广义IK,本文提出一种基于区间分支定界的方法,并辅以快速数值IK求解器启发的搜索启发式。与基于采样的方法生成的独立解相比,我们的方法生成邻域解块,为SMM的固有几何结构提供更丰富的信息,以支持最优规划和其他应用。它还可以以任意时间方式使用,在有限时间内获得具有次优分辨率的解。通过非冗余和冗余机械臂上的数值实验验证了该方法的性能。

英文摘要

The general inverse kinematics (IK) problem of a manipulator, namely that of acquiring the self-motion manifold (SMM) of all admissible joint angles for a desired end-effector pose, plays a vital role in robotics modeling, planning and control. To efficiently solve the generalized IK, this paper proposes an interval branch-and-bound-based approach, which is augmented with a fast numerical IK-solver-enabled search heuristics. In comparison to independent solutions generated by sampling based methods, our approach generates patches of neighboring solutions to provide richer information of the inherent geometry of the SMM for optimal planning and other applications. It can also be utilized in an anytime fashion to obtain solutions with sub-optimal resolution for applications within a limited period. The performance of our approach is verified by numerical experiments on both non-redundant and redundant manipulators.

2009.10277 2026-06-09 cs.CL cs.LG cs.SI 版本更新

Measuring a hate speech spectrum with faceted Rasch item response theory and perspective-aware, explainable-by-design deep learning

使用分面Rasch项目反应理论和可解释性设计的深度学习测量仇恨言论谱系

Chris J. Kennedy, Geoff Bacon, Alexander Sahn, Claudia von Vacano

发表机构 * Center for Precision Psychiatry, Mass General Hospital Department of Psychiatry, Harvard Medical School(精准精神病学中心,麻省总医院精神病科,哈佛医学院) D-Lab University of California, Berkeley(加州大学伯克利分校D实验室)

AI总结 提出结合监督深度学习与分面Rasch项目反应理论的方法,将仇恨言论分解为10个有序标签,通过IRT模型转化为区间测量值并调整标注者视角,在RoBERTa模型上提升准确性,实现连续谱系测量与可解释性。

Comments 7 pages, 6 figures

详情
AI中文摘要

我们提出一个系统,通过结合监督深度学习与分面Rasch项目反应理论(IRT),在从种族灭绝言论到支持性言论的连续区间值谱系上测量仇恨言论。我们将仇恨言论的理论构念分解为10个有序标签的操作化构成概念。这些标签通过IRT概率潜在模型重构为区间结果测量,同时估计并调整每个标注者的标注视角。我们的标度程序自然地与用于自动预测的多任务深度学习架构集成,允许通过那些组件对连续分数进行基于设计的可解释性。我们将此方法应用于一个新的开源数据集,该数据集包含来自YouTube、Twitter和Reddit的50,070条社交媒体评论,由11,143名美国亚马逊土耳其机器人工作者进行标注和标记。我们的基于RoBERTa的模型相比替代方法显示出改进的准确性。该系统为监督NLP提供了一种新范式,鼓励连续而非二元的构念,以及基于设计的标注者视角和模型可解释性的整合。

英文摘要

We propose a system for measuring hate speech on a continuous, interval-valued spectrum ranging from genocidal to supportive speech by combining supervised deep learning with faceted Rasch item response theory (IRT). We decompose the theoretical construct of hate speech into constituent concepts operationalized as 10 ordinal labels. Those labels are reconstituted via IRT probabilistic latent modeling into an interval outcome measure while simultaneously estimating and adjusting for each annotator's labeling perspective. Our scaling procedure integrates naturally with a multitask deep learning architecture for automated prediction, allowing design-based explainability of the continuous score through those components. We apply this method to a new, open source dataset of 50,070 social media comments sourced from YouTube, Twitter, and Reddit, annotated and labeled by 11,143 United States-based Amazon Mechanical Turk workers. Our RoBERTa-based model shows improved accuracy compared to alternative approaches. This system offers a new paradigm for supervised NLP that encourages continuous rather than binary constructs, and design-based incorporation of annotator perspective and model explainability.

2411.09816 2026-06-09 cs.LG

Learning Fine-grained Parameter Sharing via Sparse Tensor Decomposition

通过稀疏张量分解学习细粒度参数共享

Cem Üyük, Mike Lasby, Mohamed Yassin, Utku Evci, Yani Ioannou

发表机构 * Department of Computer Science, Technical University of Munich(计算机科学系,慕尼黑技术大学) Schulich School of Engineering, University of Calgary(工程学院,卡尔加里大学) Google DeepMind, Canada(加拿大谷歌深Mind)

AI总结 提出FiPS框架,通过跨块参数共享、低秩分解和稀疏性联合优化,压缩Transformer MLP,在ViT和LLM上实现高效压缩且性能损失小。

Comments Accepted as is to Transactions on Machine Learning Research (TMLR), 2026. OpenReview: https://openreview.net/forum?id=vbS7Z8Zswe

详情
AI中文摘要

大型神经网络在许多任务上实现了最先进的性能,但其庞大的规模阻碍了在资源受限设备上的部署。在现有的压缩方法中,跨层参数共享对于Transformer模型而言仍相对未被探索。本文介绍了细粒度参数共享(FiPS),这是一个统一的框架,用于压缩Transformer多层感知器(MLP),它在一个优化中结合了跨块参数共享、低秩分解和稀疏性。FiPS将一组Transformer块中的MLP权重矩阵拼接起来,并将其分解为共享基和稀疏的、特定于层的投影矩阵。两个因子均通过奇异值分解(SVD)初始化,并通过逐块重构误差最小化进行联合优化。FiPS将视觉Transformer(ViT)压缩高达33%,在ImageNet-1k上top-1准确率损失小于1%,结合微调时压缩高达57%。它还将大型语言模型(LLM)压缩高达20%,同时在匹配压缩的情况下,在困惑度和下游基准测试中优于现有的基于SVD的方法。结合量化感知训练(QAT),在Gemma-2-2B上使用3位FiPS实现了比单独使用2位QAT更低的困惑度,同时达到相同的8倍压缩。这些结果确立了细粒度参数共享作为Transformer MLP压缩的一种实用且有效的方法。

英文摘要

Large neural networks achieve state-of-the-art performance on many tasks, yet their sheer size hinders deployment on resource-constrained devices. Among existing compression approaches, cross-layer parameter sharing remains relatively unexplored for transformer models. In this paper, we introduce Fine-grained Parameter Sharing (FiPS), a unified framework for compressing transformer Multi-Layer Perceptrons (MLPs) that combines cross-block parameter sharing, low-rank factorization, and sparsity in a single optimization. FiPS concatenates MLP weight matrices across a group of transformer blocks and factorizes them into a shared basis and sparse, layer-specific projection matrices. Both factors are initialized via singular value decomposition (SVD) and jointly optimized by block-wise reconstruction error minimization. FiPS compresses Vision Transformers (ViTs) by up to 33% with less than 1% top-1 accuracy loss on ImageNet-1k, and by up to 57% when combined with fine-tuning. It also compresses Large Language Models (LLMs) by up to 20% while outperforming existing SVD-based methods in perplexity and downstream benchmarks at matched compression. Combined with Quantization-Aware Training (QAT), 3-bit FiPS on Gemma-2-2B achieves lower perplexity than 2-bit QAT alone while matching the same 8x compression. These results establish fine-grained parameter sharing as a practical and effective approach for transformer MLP compression.

2605.24384 2026-06-09 cs.CL cs.AI

Side-by-side Comparison Amplifies Dialect Bias in Language Models

并排比较加剧语言模型中的方言偏见

Kritee Kondapally, Claire J. Smerdon, Pooja C. Patel, Ogheneyoma Akoni, Jevon Torres, Jaspreet Ranjit, Matthew Finlayson, Swabha Swayamdipta

发表机构 * University of Southern California(美国南加州大学)

AI总结 本研究通过并排比较标准美式英语和非裔美国英语的推文,发现语言模型中的隐性方言偏见在对比设置下显著加剧,且显性方言偏见在安全对齐微调后仍存在。

Comments In proceeding at ACM Conference on Fairness, Accountability, and Transparency 2026

详情
Journal ref
In The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)
AI中文摘要

语言模型(LMs)可能因其方言变体而表现出偏见,即使在没有方言标签的情况下,这种行为被称为隐性方言偏见。在这项工作中,我们通过评估语言模型如何将刻板特征(源自社会心理学关于种族偏见的研究)与标准美式英语(SAE)和非裔美国英语(AAVE)中意图等效的推文相关联,来量化在线话语中的隐性方言偏见。虽然先前的研究表明,在单独评估推文时,语言模型将更多负面刻板印象与AAVE关联,但我们惊讶地发现,当SAE/AAVE推文对并排比较时,这种偏见显著加剧,这种设置更接近模型用于排名候选人的高影响力决策环境。当明确指定方言标签时,偏见只会恶化。考虑到商业开发者为了减轻其语言模型中的偏见所做的广泛努力,这一点令人震惊。令人鼓舞的是,我们表明反事实公平微调可以减轻某些刻板特征的隐性方言偏见,减少单独评估推文时的平均差异,然而,在并排评估SAE/AAVE推文时,这些改进并不一致地适用于所有特征。我们的发现表明,现有的隐性方言偏见评估设置可能低估了其严重性,特别是在对比设置中。此外,即使在安全对齐微调后,显性方言偏见仍然显著,表明它仍然是一个未解决的问题,并激励需要更稳健的评估和缓解框架。

英文摘要

Language models (LMs) can exhibit biases based on variations in their dialects, even in the absence of a dialect label, a behavior known as covert dialect bias. In this work, we quantify covert dialect bias in online discourse by evaluating how LMs associate stereotypical traits (derived from social psychology research on racial bias) with intent-equivalent tweets in Standard American English (SAE) and African-American Vernacular English (AAVE). While prior work shows that LMs associate more negative stereotypes with AAVE when evaluating tweets in isolation, we are surprised to find that this bias is significantly exacerbated when SAE / AAVE tweet pairs are compared side by side, a setting that more closely reflects high-impact decision making contexts in which models are used to rank candidates. The bias only worsens when dialect labels are explicitly specified. This is striking, given the extensive efforts from commercial developers to mitigate bias in their LMs. Encouragingly, we show that counterfactual fairness finetuning can mitigate covert dialect bias for some stereotypical traits, reducing average disparities when evaluating tweets in isolation, however, these improvements do not consistently hold across traits when evaluating SAE / AAVE tweets side by side. Our findings show that existing evaluation settings for covert dialect bias may underestimate its severity, specifically in contrastive settings. Additionally, overt dialect bias remains pronounced even after safety aligned finetuning, indicating that it remains an unresolved problem, and motivates the need for more robust evaluation and mitigation frameworks.

2603.10453 2026-06-09 cs.LG

Spatio-Temporal Forecasting of Retaining Wall Deformation: Mitigating Error Accumulation via Multi-Resolution ConvLSTM Stacking Ensemble

挡土墙变形的时空预测:通过多分辨率ConvLSTM堆叠集成减轻误差累积

Jihoon Kim, Heejung Youn

发表机构 * Department of Civil and Environmental Engineering, Hongik University(弘国大学土木与环境工程系)

AI总结 提出多分辨率ConvLSTM集成框架,利用不同时间输入分辨率减轻误差累积,提高分阶段开挖中挡土结构长期变形预测的准确性。

Comments 27 pages, 17 figures

详情
Journal ref
Geomechanics and Engineering, 45(5), 649-674, 2026
AI中文摘要

本研究提出了一种多分辨率卷积长短期记忆(ConvLSTM)集成框架,利用多样化的时间输入分辨率来减轻误差累积,并提高分阶段开挖过程中挡土结构行为的长期预测。通过PLAXIS2D模拟生成了一个广泛的侧向墙位移响应数据库,该模拟包含五层土壤地层、两种开挖深度(14米和20米)以及随机变化的岩土和结构参数,产生了2000个时间序列挠度剖面。使用全连接神经网络元学习器集成了三个在不同输入分辨率下训练的ConvLSTM模型,构建了集成模型。使用数值结果和现场测量进行的验证表明,集成方法始终优于单独的ConvLSTM模型,特别是在长期多步预测中,表现出减少的误差传播和改进的泛化能力。这些发现强调了多分辨率集成策略的潜力,该策略共同利用多样化的时间输入尺度来增强AI驱动的岩土预测中的预测稳定性和准确性。

英文摘要

This study proposes a multi-resolution Convolutional Long Short-Term Memory (ConvLSTM) ensemble framework that leverages diverse temporal input resolutions to mitigate error accumulation and improve long-horizon forecasting of retaining-structure behavior during staged excavation. An extensive database of lateral wall displacement responses was generated through PLAXIS2D simulations incorporating five-layered soil stratigraphy, two excavation depths (14 and 20 m), and stochastically varied geotechnical and structural parameters, yielding 2,000 time-series deflection profiles. Three ConvLSTM models trained at different input resolutions were integrated using a fully connected neural network meta-learner to construct the ensemble model. Validation using both numerical results and field measurements demonstrated that the ensemble approach consistently outperformed the standalone ConvLSTM models, particularly in long-term multi-step prediction, exhibiting reduced error propagation and improved generalization. These findings underscore the potential of multi-resolution ensemble strategies that jointly exploit diverse temporal input scales to enhance predictive stability and accuracy in AI-driven geotechnical forecasting.

2410.14949 2026-06-09 cs.LG stat.ML

On the Convergence and Straightness of Rectified Flow

关于校正流的收敛性与直线性

Vansh Bansal, Saptarshi Roy, Alessandro Rinaldo, Purnamrita Sarkar

发表机构 * Department of Statistics and Data Sciences, UT Austin(统计与数据科学系,德克萨斯大学奥斯汀分校)

AI总结 本文提出Piecewise Straightness参数γ₂,T,建立首个流模型离散误差与γ₂,T的Wasserstein收敛界,证明最小曲率是实现高保真单步采样的关键,同时为RF的直线性分析提供了理论框架。

Comments 37 pages

详情
AI中文摘要

本文提出Piecewise Straightness参数γ₂,T,建立首个流模型离散误差与γ₂,T的Wasserstein收敛界,证明最小曲率是实现高保真单步采样的关键,同时为RF的直线性分析提供了理论框架。

英文摘要

Flow Matching has become a cornerstone of modern generative models like Stable Diffusion 3, largely due to the efficiency of its Rectified Flow (RF) variant. The success of RF hinges on iteratively learning straight trajectories, pushing generation towards fewer sampling steps. However, the theoretical link between path geometry and sampling efficiency has been underexplored. This paper fills this gap by introducing a novel \textit{Piecewise Straightness} parameter, $γ_{2,T}$. We establish the first Wasserstein convergence bound that explicitly links the discretization error of \textit{any} general flow-model to $γ_{2,T}$, proving that minimizing curvature is the key to achieving high-fidelity, one-step sampling. Building on this theory, we establish the first theoretical framework to analyze the straightness of RF. We begin by offering intuitive geometric arguments for simple cases before identifying sufficient conditions under which a single rectification step (1-RF) yields a perfectly straight or even a Monge optimal coupling. While whether these sufficient conditions are met depends on the problem geometry, they enable the first concrete proofs in this area. Critically, fulfilling these conditions makes the subsequent flow (2-RF) perfectly straight ($γ_{2,T}=0$). This eliminates the discretization error in our bound and makes flawless, single-step sampling possible.

2604.24583 2026-06-09 cs.CV

Improving Vision-language Models with Perception-centric Process Reward Models

通过以感知为中心的过程奖励模型改进视觉-语言模型

Yingqian Min, Kun Zhou, Yifan Li, Yuhuan Wu, Han Peng, Yifan Du, Wayne Xin Zhao, Min Yang, Ji-Rong Wen

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学北京校区人工智能学院) Bytedance(字节跳动) University of California, San Diego(加州大学圣地亚哥分校) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 本文提出Perceval模型,通过token级错误定位提升视觉-语言模型的推理能力,通过感知驱动的监督策略实现细粒度训练与推理优化,实验显示在多个领域基准上显著提升性能。

Comments 8 pages

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 33099-33109
AI中文摘要

近期强化学习与可验证奖励(RLVR)的进步显著提升了视觉-语言模型(VLMs)的复杂推理能力。然而,其结果级监督过于粗略,无法诊断和纠正推理链中的错误。为此,我们提出了Perceval,一种过程奖励模型(PRM),能够实现token级错误定位,提取与图像相关的声明,并逐一与图像中的视觉证据进行比较,最终返回包含感知错误的声明。Perceval通过感知密集的监督训练数据进行训练,然后将其整合到RL训练过程中训练策略模型。具体而言,与传统的GRPO相比,我们通过针对Perceval识别出的幻觉片段施加惩罚,应用token级优势,从而实现细粒度监督信号。除了增强训练过程外,Perceval还能在推理阶段协助VLMs。使用Perceval,可以截断模型响应中的错误部分,然后让模型直接重新生成响应或诱导模型反思其先前输出。此过程可以多次重复以实现测试时扩展。实验显示,在多个领域基准上的多个RL训练的推理VLMs上显著提升,突显了以感知为中心的监督作为通用策略的潜力。对于测试时扩展,它也展示了与其他策略(如多数投票)相比的一致性性能提升。我们的代码和数据将在https://github.com/RUCAIBox/Perceval上公开发布。

英文摘要

Recent advancements in reinforcement learning with verifiable rewards (RLVR) have significantly improved the complex reasoning ability of vision-language models (VLMs). However, its outcome-level supervision is too coarse to diagnose and correct errors within the reasoning chain. To this end, we propose Perceval, a process reward model (PRM) that enables token-level error grounding, which can extract image-related claims from the response and compare them one by one with the visual evidence in the image, ultimately returning claims that contain perceptual errors. Perceval is trained with perception-intensive supervised training data. We then integrate Perceval into the RL training process to train the policy models. Specifically, compared to traditional GRPO, which applies sequence-level advantages, we apply token-level advantages by targeting penalties on hallucinated spans identified by Perceval, thus enabling fine-grained supervision signals. In addition to augmenting the training process, Perceval can also assist VLMs during the inference stage. Using Perceval, we can truncate the erroneous portions of the model's response, and then either have the model regenerate the response directly or induce the model to reflect on its previous output. This process can be repeated multiple times to achieve test-time scaling. Experiments show significant improvements on benchmarks from various domains across multiple reasoning VLMs trained with RL, highlighting the promise of perception-centric supervision as a general-purpose strategy. For test-time scaling, it also demonstrates consistent performance gains over other strategies, such as major voting. Our code and data will be publicly released at https://github.com/RUCAIBox/Perceval.

2604.17488 2026-06-09 cs.CV

AutoVQA-G: Self-Improving Agentic Framework for Automated Visual Question Answering and Grounding Annotation

AutoVQA-G:用于自动视觉问答与接地标注的自我改进代理框架

Rongsheng Hu, Runwei Guan, Yicheng Di, Jiayu Bao, Yuan Liu

发表机构 * School of Artificial Intelligence(人工智能学院)

AI总结 本文提出AutoVQA-G框架,通过迭代优化流程提升视觉问答接地标注的准确性,优于现有多模态LLM,为构建高质量数据促进更稳健的视觉语言模型训练提供新方法。

Comments Accepted at IEEE ICASSP 2026. 5 pages, 5 figures. Code available at https://github.com/rohnson1999/AutoVQA-G

详情
Journal ref
Proc. 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 12312-12316, 2026
AI中文摘要

手动标注高质量的视觉问答与接地(VQA-G)数据集对于推动视觉语言模型(VLMs)的发展至关重要,但难以扩展。现有自动化方法常受限于两个关键问题:(1)由于模型幻觉导致的数据一致性差;(2)基于简单启发法的脆弱验证机制。为解决这些限制,我们引入了AutoVQA-G,一种自我改进的代理框架,用于自动化VQA-G标注。AutoVQA-G采用迭代细化循环,其中一致性评估模块使用链式推理(CoT)进行细粒度视觉验证。基于此反馈,一个记忆增强的提示优化代理分析失败样本的批评,逐步优化生成提示。我们的实验表明,AutoVQA-G生成的VQA-G数据集在视觉接地准确性上优于领先的多模态LLM,为创建高质量数据以促进更稳健的VLM训练和评估提供有前景的方法。代码:https://github.com/rohnson1999/AutoVQA-G

英文摘要

Manual annotation of high-quality visual question answering with grounding (VQA-G) datasets, which pair visual questions with evidential grounding, is crucial for advancing vision-language models (VLMs), but remains unscalable. Existing automated methods are often hindered by two key issues: (1) inconsistent data fidelity due to model hallucinations; (2) brittle verification mechanisms based on simple heuristics. To address these limitations, we introduce AutoVQA-G, a self-improving agentic framework for automated VQA-G annotation. AutoVQA-G employs an iterative refinement loop where a Consistency Evaluation module uses Chain-of-Thought (CoT) reasoning for fine-grained visual verification. Based on this feedback, a memory-augmented Prompt Optimization agent analyzes critiques from failed samples to progressively refine generation prompts. Our experiments show that AutoVQA-G generates VQA-G datasets with superior visual grounding accuracy compared to leading multimodal LLMs, offering a promising approach for creating high-fidelity data to facilitate more robust VLM training and evaluation. Code: https://github.com/rohnson1999/AutoVQA-G

2507.18967 2026-06-09 cs.CV cs.AI cs.LG

Underwater Waste Detection Using Deep Learning A Performance Comparison of YOLOv7 to 10 and Faster RCNN

利用深度学习进行水下垃圾检测:YOLOv7到YOLOv10与Faster R-CNN的性能比较

UMMPK Nawarathne, HMNS Kumari, HMLS Kumari

发表机构 * Faculty of Computing, Sri Lanka Institute of Information Technology(计算学院,斯里兰卡信息科技学院) Faculty of Information Technology and Communication Sciences, Tampere University(信息科技与通信科学学院,塔尔皮埃大学) Computing Centre, Faculty of Engineering, University of Peradeniya(工程学院计算机中心,珀德尼亚大学)

AI总结 本文比较了YOLOv7到YOLOv10及Faster R-CNN在水下垃圾检测中的性能,发现YOLOv8在低能见度和不同深度条件下表现最佳,mAP达80.9%。

Comments 7 pages, 11 figures, to be published in International Journal of Research in Computing (IJRC)

详情
Journal ref
Vol. 5 No. I (2026): International Journal of Research in Computing (IJRC)
AI中文摘要

水下污染是当今最严重的环境问题之一,全球海洋、河流和景观中发现大量垃圾。准确检测这些垃圾对废物管理、环境监测和缓解策略至关重要。本文研究了五种先进的物体识别算法,包括YOLO模型(YOLOv7、YOLOv8、YOLOv9、YOLOv10)和Faster R-CNN,以确定哪种模型在水下环境中识别材料最有效。这些模型在包含十五种不同类别的大型数据集上进行了彻底训练和测试。结果显示,YOLOv8在低能见度和变量深度条件下表现最佳,mAP为80.9%。这种性能提升归因于YOLOv8的架构,其包含改进的无锚机制和自监督学习,从而在各种环境中实现更精确和高效的识别。这些发现突显了YOLOv8模型在全球抗污染斗争中的潜力,提高了水下清理作业的检测能力和可扩展性。

英文摘要

Underwater pollution is one of today's most significant environmental concerns, with vast volumes of garbage found in seas, rivers, and landscapes around the world. Accurate detection of these waste materials is crucial for successful waste management, environmental monitoring, and mitigation strategies. In this study, we investigated the performance of five cutting-edge object recognition algorithms, namely YOLO (You Only Look Once) models, including YOLOv7, YOLOv8, YOLOv9, YOLOv10, and Faster Region-Convolutional Neural Network (R-CNN), to identify which model was most effective at recognizing materials in underwater situations. The models were thoroughly trained and tested on a large dataset containing fifteen different classes under diverse conditions, such as low visibility and variable depths. From the above-mentioned models, YOLOv8 outperformed the others, with a mean Average Precision (mAP) of 80.9%, indicating a significant performance. This increased performance is attributed to YOLOv8's architecture, which incorporates advanced features such as improved anchor-free mechanisms and self-supervised learning, allowing for more precise and efficient recognition of items in a variety of settings. These findings highlight the YOLOv8 model's potential as an effective tool in the global fight against pollution, improving both the detection capabilities and scalability of underwater cleanup operations.

2508.05153 2026-06-09 cs.RO cs.AI

FCBV-Net: Category-Level Robotic Garment Smoothing via Feature-Conditioned Bimanual Value Prediction

FCBV-Net:通过特征条件双臂价值预测实现类别级机器人服装平滑

Mohammed Daba, Jing Qiu

发表机构 * University of Waterloo(多伦多大学)

AI总结 本文提出FCBV-Net,通过预训练的密集几何特征条件预测双臂动作价值,提升机器人服装平滑任务的类别级泛化能力,实验显示其在未见过的服装上效率下降仅为11.5%。

Comments 9 pages, 7 figures, 1 table

详情
Journal ref
Electronics 2026, 15(11), 2468
AI中文摘要

类别级机器人服装操作,如双臂平滑,仍面临显著挑战,由于高维性、复杂动态和类别内变化。现有方法往往在特定实例上过拟合或在感知泛化方面失败。本文提出特征条件双臂价值网络(FCBV-Net),在3D点云上操作,专门增强服装平滑的类别级策略泛化。FCBV-Net将双臂动作价值预测条件于预训练的冻结密集几何特征,确保对类别内服装变化的鲁棒性。可训练的下游组件则利用这些静态特征学习任务特定的策略。在使用CLOTH3D数据集的模拟PyFlex环境中,FCBV-Net展示了优越的类别级泛化能力。它在未见过的服装上仅比基于2D图像的基线低11.5%(Steps80),并实现了89%的最终覆盖率,优于使用相同点特征但固定原始的3D对应基线的83%覆盖率。这些结果表明,将几何理解与双臂动作价值学习解耦能够实现更好的类别级泛化。代码、视频和补充材料可在项目网站:https://dabaspark.github.io/fcbvnet/获取。

英文摘要

Category-level generalization for robotic garment manipulation, such as bimanual smoothing, remains a significant hurdle due to high dimensionality, complex dynamics, and intra-category variations. Current approaches often struggle, either overfitting with concurrently learned visual features for a specific instance or, despite Category-level perceptual generalization, failing to predict the value of synergistic bimanual actions. We propose the Feature-Conditioned bimanual Value Network (FCBV-Net), operating on 3D point clouds to specifically enhance category-level policy generalization for garment smoothing. FCBV-Net conditions bimanual action value prediction on pre-trained, frozen dense geometric features, ensuring robustness to intra-category garment variations. Trainable downstream components then learn a task-specific policy using these static features. In simulated PyFlex environments using the CLOTH3D dataset, FCBV-Net demonstrated superior category-level generalization. It exhibited only an 11.5% efficiency drop (Steps80) on unseen garments compared to 96.2% for a 2D image-based baseline, and achieved 89% final coverage, outperforming an 83% coverage from a 3D correspondence-based baseline that uses identical per-point geometric features but a fixed primitive. These results highlight that the decoupling of geometric understanding from bimanual action value learning enables better category-level generalization. Code, videos, and supplementary materials are available at the project website: https://dabaspark.github.io/fcbvnet/.

2511.18203 2026-06-09 cs.RO

SkillWrapper: Generative Predicate Invention for Task-level Robot Planning

SkillWrapper:任务级机器人规划的生成性谓词发明

Ziyi Yang, Benned Hedegaard, Ahmed Jaafar, Yichen Wei, Skye Thompson, Shreyas S. Raman, Haotian Fu, Stefanie Tellex, George Konidaris, David Paulius, Naman Shah

发表机构 * Brown University(布朗大学) Allen Institute for AI(人工智能研究院)

AI总结 本文提出SkillWrapper方法,通过生成性谓词发明学习符号化表示,使机器人能够基于抽象推理完成长周期任务规划。

详情
AI中文摘要

从单个技能执行到长周期任务的泛化是构建自主机器人面临的核心挑战。一个有前途的方向是学习低层机器人技能的高层符号表示,从而实现独立于底层状态空间的抽象推理。近期基础模型的进步使生成作用于原始感知输入的符号谓词成为可能,这一过程我们称为生成性谓词发明,以促进下游表示学习。然而,先前工作通过启发式或随意方法学习这些抽象,忽略了这些抽象应满足的正式属性以及如何保证这些属性的问题。我们通过提出生成性谓词发明的任务级规划正式理论,并提出SkillWrapper方法,该方法学习可证明正确且完整的规划符号模型来解决这些问题。我们的方法利用基础模型主动收集机器人数据,并学习可被人类解释和规划的表示,仅使用RGB图像观测。我们在仿真和真实机器人上的广泛实证评估表明,SkillWrapper学习的抽象表示使机器人能够将黑箱技能组合起来,解决未见的长周期任务。

英文摘要

Generalizing from individual skill executions to long-horizon tasks is a core challenge in building autonomous robots. A promising direction is learning high-level, symbolic representations of low-level robot skills, enabling abstract reasoning independent of the low-level state space. Recent advances in foundation models have made it possible to generate symbolic predicates that operate on raw sensory inputs-a process we call generative predicate invention-to facilitate downstream representation learning. However, prior work learns these abstractions using heuristic or ad-hoc procedures, ignoring the question of which formal properties they ought to satisfy, and how to guarantee these properties. We address these questions by presenting a formal theory of generative predicate invention for task-level planning, and proposing SkillWrapper, a method that learns symbolic models for provably sound and complete planning. Our approach leverages foundation models to actively collect robot data and learn human-interpretable, plannable representations, using only RGB image observations. Our extensive empirical evaluation in simulation and on real robots shows that SkillWrapper learns abstract representations that enable robots to compose black-box skills to solve unseen, long-horizon tasks in the real world.

2603.24942 2026-06-09 cs.CV

BiFM: Bidirectional Flow Matching for Few-Step Image Editing and Generation

BiFM:双向流匹配用于少步图像编辑与生成

Yasong Dai, Zeeshan Hayder, David Ahmedt-Aristizabal, Hongdong Li

发表机构 * Australian National University(澳大利亚国立大学) Data61-CSIRO

AI总结 BiFM通过双向流匹配框架统一学习生成与反向过程,解决少步采样中正向过程近似差问题,提升图像编辑质量与通用性。

Comments Accepted in CVPR2026

详情
AI中文摘要

最近的扩散和流匹配模型通过迭代采样逐步去除噪声,实现了灵活的语义保持编辑。然而,少步采样在正向过程近似方面表现不佳,导致编辑质量下降。现有少步反向方法通常依赖预训练生成器和辅助模块,限制了不同架构的可扩展性和泛化能力。为了解决这些问题,我们提出了BiFM(双向流匹配),一个统一的框架,能够在单一模型中联合学习生成和反向过程。BiFM直接估计“图像→噪声”和“噪声→图像”方向的平均速度场,受共享的瞬时速度场约束,该速度场由预定义的调度或预训练的多步扩散模型导出。此外,BiFM引入了一种新的训练策略,利用连续时间间隔监督,通过双向一致性目标和轻量级时间间隔嵌入进行稳定。这种双向公式还允许一步反向和无缝集成到流行的扩散和流匹配骨干中。在多样化的图像编辑和生成任务中,BiFM一致优于现有的少步方法,实现了更优越的性能和可编辑性。

英文摘要

Recent diffusion and flow matching models have demonstrated strong capabilities in image generation and editing by progressively removing noise through iterative sampling. While this enables flexible inversion for semantic-preserving edits, few-step sampling regimes suffer from poor forward process approximation, leading to degraded editing quality. Existing few-step inversion methods often rely on pretrained generators and auxiliary modules, limiting scalability and generalization across different architectures. To address these limitations, we propose BiFM (Bidirectional Flow Matching), a unified framework that jointly learns generation and inversion within a single model. BiFM directly estimates average velocity fields in both ``image $\to$ noise" and ``noise $\to$ image" directions, constrained by a shared instantaneous velocity field derived from either predefined schedules or pretrained multi-step diffusion models. Additionally, BiFM introduces a novel training strategy using continuous time-interval supervision, stabilized by a bidirectional consistency objective and a lightweight time-interval embedding. This bidirectional formulation also enables one-step inversion and can integrate seamlessly into popular diffusion and flow matching backbones. Across diverse image editing and generation tasks, BiFM consistently outperforms existing few-step approaches, achieving superior performance and editability.

2508.03453 2026-06-09 cs.CL cs.LG

Cropping outperforms dropout as an augmentation strategy for self-supervised training of text embeddings

裁剪优于dropout作为自监督训练文本嵌入的增强策略

Rita González-Márquez, Philipp Berens, Dmitry Kobak

发表机构 * Hertie Institute for AI in Brain Health(人工智能与脑健康赫尔蒂研究所) University of Tübingen(图宾根大学) University of Tübingen, Germany(德国图宾根大学)

AI总结 本文研究了自监督微调中裁剪和dropout两种增强策略,发现裁剪在文本嵌入质量上表现更优,尤其在领域内数据中能快速生成高质量嵌入。

详情
Journal ref
Transactions on Machine Learning Research (TMLR) 2026
AI中文摘要

文本嵌入,即整个文本的向量表示,在许多NLP应用中起重要作用,如检索增强生成、聚类或文本集合的数据探索。目前,表现最佳的嵌入模型是通过监督对比微调从预训练语言模型中衍生而来。这种微调策略依赖于外部相似性概念和标注数据生成正样本对。本文研究了自监督微调,并系统比较了两种最知名的增强策略。我们评估了MTEB和额外的领域内评估,并发现裁剪增强显著优于基于dropout的方法。我们发现,在领域外数据中,生成的嵌入质量远低于监督的最新成果,但针对领域内数据,自监督微调能在极短的微调后生成高质量文本嵌入。最后,我们发现表示质量随着最后一层transformer层的改变而增加,仅微调这些最后一层足以达到相似的嵌入质量。

英文摘要

Text embeddings, i.e. vector representations of entire texts, play an important role in many NLP applications, such as retrieval-augmented generation, clustering, or visualizing collections of texts for data exploration. Currently, top-performing embedding models are derived from pre-trained language models via supervised contrastive fine-tuning. This fine-tuning strategy relies on an external notion of similarity and annotated data for generation of positive pairs. Here we study self-supervised fine-tuning and systematically compare the two most well-known augmentation strategies used for fine-tuning text embeddings models. We assess embedding quality on MTEB and additional in-domain evaluations and show that cropping augmentation strongly outperforms the dropout-based approach. We find that on out-of-domain data, the quality of resulting embeddings is substantially below the supervised state-of-the-art models, but for in-domain data, self-supervised fine-tuning can produce high-quality text embeddings after very short fine-tuning. Finally, we show that representation quality increases towards the last transformer layers, which undergo the largest change during fine-tuning; and that fine-tuning only those last layers is sufficient to reach similar embedding quality.

2512.16334 2026-06-09 cs.LG cs.AI

Pretrained battery transformer (PBT): A foundation model for battery life prediction

预训练电池变压器(PBT):电池寿命预测的基础模型

Ruifeng Tan, Weixiang Hong, Jia Li, Jiaqiang Huang, Tong-Yi Zhang

发表机构 * Guangzhou Municipal Key Laboratory of Materials Informatics and Sustainable Energy and Environment Thrust, The Hong Kong University of Science and Technology (Guangzhou)(广州材料信息学与可持续能源与环境方向市重点实验室,香港科技大学(广州)) Department of Computer Science & Engineering, The Hong Kong University of Science and Technology(计算机科学与工程系,香港科技大学) Guangzhou Municipal Key Laboratory of Materials Informatics and Data Science and Analytics Thrust, The Hong Kong University of Science and Technology (Guangzhou)(广州材料信息学与数据科学与分析方向市重点实验室,香港科技大学(广州)) Academy of Interdisciplinary Studies, The Hong Kong University of Science and Technology(交叉学科研究院,香港科技大学) Guangzhou HKUST Fok Ying Tung Research Institute(广州科技大学福 Ying Tung 研究院) Material Genome Institute, Shanghai University(材料基因组研究所,上海大学)

AI总结 本文提出PBT模型,通过整合异构电池寿命数据,实现电池寿命预测的统一建模,显著提升预测性能。

Comments 5 figures in the main content

详情
AI中文摘要

电池循环寿命的早期预测对于改进电池设计、制造和部署至关重要。然而,尽管机器学习取得进展,电池寿命预测仍受限于数据稀缺和电池化学、规格、形成协议和工作条件的异质性。尽管迁移学习已被广泛探索,但其效果受限于缺乏能整合异构电池寿命数据的基础模型。本文引入预训练电池变压器(PBT),一种用于电池寿命预测的基础模型,其包含编码电池知识的混合专家层,以学习稀缺和异质的寿命数据。PBT首先在13个锂离子电池数据集上预训练,生成通用PBT,然后通过迁移学习适应到特定场景。在覆盖977个电池和528组老化条件的15个数据集中,PBT实现了最先进的性能,平均超越最强竞争方法21.9%,最高提升达86.9%。本研究建立了已知的第一种电池寿命预测基础模型,并为将电池寿命预测从孤立的场景特定建模任务转向可重用的知识基础提供了步骤,该基础模型可利用有限数据进行特定场景专业化,对其他具有稀缺和异质数据的可持续能源预测问题具有启示。

英文摘要

Early prediction of battery cycle life is essential for improving battery design, manufacturing and deployment. However, despite encouraging progress with machine learning, battery life prediction remains constrained by scarce data and pronounced heterogeneity across battery chemistries, specifications, formation protocols and operating conditions. Although transfer learning has been widely explored to alleviate these challenges, its effectiveness is limited by the absence of a foundation model that can integrate heterogeneous battery life data and provide broadly useful knowledge for target-scenario specialization. Here we introduce the pretrained battery transformer (PBT), a foundation model for battery life prediction that incorporates battery-knowledge-encoded mixture-of-experts layers to learn from scarce and heterogeneous lifetime data. PBT is first pretrained on 13 lithium-ion battery datasets to yield a general PBT that encodes comprehensive battery lifetime knowledge, and is then adapted through transfer learning into specialized PBT models for target scenarios. Across 15 datasets covering 977 batteries and 528 sets of aging conditions from lithium-ion, sodium-ion and zinc-ion batteries, PBT achieves state-of-the-art performance, surpassing the strongest competing method by 21.9% on average, with gains of up to 86.9%. This study establishes, to our knowledge, the first foundation model for battery life prediction and provides a step towards shifting battery lifetime prediction from isolated, scenario-specific modelling tasks to a reusable knowledge foundation that can be specialized to target scenarios with limited data, with implications for other prediction problems characterized by scarce and heterogeneous data in sustainable energy.

2501.15505 2026-06-09 cs.RO cs.CV cs.HC

Unveiling the Potential of iMarkers: Invisible Fiducial Markers for Advanced Robotics

揭示iMarkers的潜力:用于高级机器人的隐形标志物

Ali Tourani, Deniz Isinsu Avsar, Hriday Bavle, Jose Luis Sanchez-Lopez, Jan Lagerwall, Holger Voos

发表机构 * Automation and Robotics Research Group, Interdisciplinary Centre for Security, Reliability, and Trust (SnT), University of Luxembourg(自动化与机器人研究组,安全、可靠性与信任跨学科中心(SnT),卢森堡大学) Faculty of Science, Technology, and Medicine, University of Luxembourg(科学、技术与医学学院,卢森堡大学) Department of Physics & Materials Science, University of Luxembourg(物理与材料科学系,卢森堡大学) Institute for Advanced Studies, University of Luxembourg(先进研究学院,卢森堡大学)

AI总结 本文提出iMarkers,一种隐形标志物,可被机器人和AR设备检测,解决了传统标志物影响视觉美观的问题,展示了其在机器人应用中的灵活性和有效性。

Comments 19 pages, 10 figures, 4 tables

详情
AI中文摘要

标志物在机器人导航、物体识别和场景理解中被广泛应用。尽管为机器人和增强现实(AR)应用提供了显著优势,但它们通常会破坏环境的视觉美观,因为它们对人类可见,因此不适合许多日常使用场景。为了解决这一差距,本文提出了iMarkers,即创新的、不显眼的标志物,仅能被机器人和配备适当传感器和检测算法的AR设备检测。这些标志物在生产中具有高度灵活性,允许根据各种需求定制其可见范围和编码算法。本文还介绍了用于检测iMarkers的硬件设计和开源软件算法,突显了其在检测和识别阶段的适应性和鲁棒性。大量评估已证明iMarkers相对于传统(印刷)和混合标志物的有效性,并确认了其在多样化机器人场景中的适用性。

英文摘要

Fiducial markers are widely used in robotics for navigation, object recognition, and scene understanding. While offering significant advantages for robots and Augmented Reality (AR) applications, they often disrupt the visual aesthetics of environments, as they are visible to humans, making them unsuitable for many everyday use cases. To address this gap, this paper presents iMarkers, innovative, unobtrusive fiducial markers detectable exclusively by robots and AR devices equipped with adequate sensors and detection algorithms. These markers offer high flexibility in production, allowing customization of their visibility range and encoding algorithms to suit various demands. The paper also introduces the hardware designs and open-sourced software algorithms developed for detecting iMarkers, highlighting their adaptability and robustness in the detection and recognition stages. Numerous evaluations have demonstrated the effectiveness of iMarkers relative to conventional (printed) and blended fiducial markers and have confirmed their applicability across diverse robotics scenarios.

2508.15030 2026-06-09 cs.AI

Collab-REC: An LLM-based Agentic Framework for Balancing Recommendations in Tourism

Collab-REC:一种基于LLM的代理框架,用于平衡旅游推荐

Ashmi Banerjee, Adithi Satish, Fitri Nur Aisyah, Wolfgang Wörndl, Yashar Deldjoo

发表机构 * Technical University of Munich(慕尼黑技术大学) Polytechnic University of Bari(巴里理工大学)

AI总结 提出一种多代理框架Collab-REC,通过三个LLM代理(个性化、流行度、可持续性)生成城市建议,并由非LLM调节器迭代优化,以缓解流行度偏差并提高推荐多样性。

详情
AI中文摘要

我们提出了COLLAB-REC,一个多代理框架,旨在抵消流行度偏差并提高旅游推荐的多样性。在我们的设置中,三个基于LLM的代理(个性化、流行度和可持续性)从不同角度生成城市建议。然后,一个非LLM调节器通过迭代约束优化合并并完善这些提议,确保每个代理的观点得到体现,同时减少虚假或重复输出。使用不同规模和模型家族的LLM对欧洲城市查询进行的大量离线实验表明,与单代理基线相比,COLLAB-REC提高了多样性和整体相关性,同时揭示了经常被忽视的较少访问的目的地。这种平衡的、上下文感知的方法更好地捕捉了更广泛的用户和系统级考虑因素,凸显了多利益相关者协作在LLM驱动的推荐系统中的潜力。代码、数据和其他工件可在此处获取:https://github.com/ashmibanerjee/collab-rec,而使用的提示包含在附录中。

英文摘要

We propose COLLAB-REC, a multi-agent framework designed to counteract popularity bias and improve diversity in tourism recommendations. In our setup, three LLM-based agents(Personalization, Popularity, and Sustainability) generate city suggestions from different perspectives. A non-LLM moderator then merges and refines these proposals through iterative constrained refinement, ensuring that each agent's viewpoint is represented while reducing spurious or repeated outputs. Extensive offline experiments on European city queries using LLMs of different sizes and model families show that COLLAB-REC improves both diversity and overall relevance compared to a single-agent baseline, while surfacing lesser-visited destinations that are often overlooked. This balanced, context-aware approach better captures a broader range of user and system-level considerations, highlighting the potential of multi-stakeholder collaboration in LLM-driven recommender systems. Code, data, and other artifacts are available here: https://github.com/ashmibanerjee/collab-rec, while the prompts used are included in the appendix.

2508.02197 2026-06-09 cs.AI

A Message Passing Realization of Expected Free Energy Minimization

期望自由能最小化的信息传递实现

Wouter W. L. Nuijten, Mykola Lukashchuk, Thijs van de Laar, Bert de Vries

发表机构 * Eindhoven University of Technology, 5612 AP Eindhoven, the Netherlands GN Hearing, 5612 AB Eindhoven, The Netherlands(埃因霍温理工大学,荷兰埃因霍温5612 AP GN听力,荷兰埃因霍温5612 AB)

AI总结 本文提出基于因子图的期望自由能最小化信息传递方法,通过将期望自由能最小化转化为变分自由能最小化问题,实现高效策略推断,并在存在epistemic不确定性环境中验证了其有效性。

详情
Journal ref
In: International Workshop on Active Inference, pp. 69-84. Springer, Cham, 2022
AI中文摘要

本文提出基于因子图的期望自由能最小化信息传递方法,通过将期望自由能最小化转化为变分自由能最小化问题,实现高效策略推断,并在存在epistemic不确定性环境中验证了其有效性。

英文摘要

We present a message passing approach to Expected Free Energy (EFE) minimization on factor graphs, based on the theory introduced in arXiv:2504.14898. By reformulating EFE minimization as Variational Free Energy minimization with epistemic priors, we transform a combinatorial search problem into a tractable inference problem solvable through standard variational techniques. Applying our message passing method to factorized state-space models enables efficient policy inference. We evaluate our method on environments with epistemic uncertainty: a stochastic gridworld and a partially observable Minigrid task. Agents using our approach consistently outperform conventional KL-control agents on these tasks, showing more robust planning and efficient exploration under uncertainty. In the stochastic gridworld environment, EFE-minimizing agents avoid risky paths, while in the partially observable minigrid setting, they conduct more systematic information-seeking. This approach bridges active inference theory with practical implementations, providing empirical evidence for the efficiency of epistemic priors in artificial agents.

2309.10370 2026-06-09 cs.LG cs.AI math-ph math.MP math.OC stat.ML

Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimization

浅层神经网络的几何结构与构造性${\mathcal L}^2$成本最小化

Thomas Chen, Patrícia Muñoz Ewald

发表机构 * Department of Mathematics, University of Texas at Austin(德克萨斯大学奥斯汀分校数学系)

AI总结 本文研究浅层ReLU网络在欠参数化情况下的成本最小化问题,通过构造上界揭示分类数据的几何结构,不依赖梯度下降。证明了成本函数最小值的上界与训练数据信噪比相关,并确定了特定子空间的构造性训练网络。

Comments AMS Latex, 29 pages. Experimental evidence added. To appear in Physica D: Nonlinear Phenomena

详情
Journal ref
Phys. D, 490, Article No. 135176 (2026)
AI中文摘要

本文通过显式构造上界,探讨欠参数化浅层ReLU网络中成本(损失)最小化问题,不使用梯度下降方法。重点在于阐明近似和精确极小值的几何结构。考虑$ L^2 $成本函数,输入空间$\mathbb{R}^M$,输出空间${\mathbb R}^Q$,其中$Q\leq M$,训练输入样本大小可任意大。证明了成本函数最小值的上界为$O(δ_P)$,其中$δ_P$衡量训练数据的信噪比。在特殊情况下$M=Q$时,显式确定了成本函数的精确退化局部极小值,并显示该精确值与$Q\leq M$时获得的上界相比,相对误差为$O(δ_P^2)$。上界证明提供了构造性训练的网络;我们证明该网络度量了输入空间$\mathbb{R}^M$中的特定$Q$维子空间。我们还评论了在给定上下文中成本函数全局极小值的特征化问题。

英文摘要

In this paper, we approach the problem of cost (loss) minimization in underparametrized shallow ReLU networks through the explicit construction of upper bounds which appeal to the structure of classification data, without use of gradient descent. A key focus is on elucidating the geometric structure of approximate and precise minimizers. We consider an $L^2$ cost function, input space $\mathbb{R}^M$, output space ${\mathbb R}^Q$ with $Q\leq M$, and training input sample size that can be arbitrarily large. We prove an upper bound on the minimum of the cost function of order $O(δ_P)$ where $δ_P$ measures the signal-to-noise ratio of training data. In the special case $M=Q$, we explicitly determine an exact degenerate local minimum of the cost function, and show that the sharp value differs from the upper bound obtained for $Q\leq M$ by a relative error $O(δ_P^2)$. The proof of the upper bound yields a constructively trained network; we show that it metrizes a particular $Q$-dimensional subspace in the input space ${\mathbb R}^M$. We comment on the characterization of the global minimum of the cost function in the given context.

2507.23592 2026-06-09 cs.RO cs.HC cs.SY eess.SY

Human-Exoskeleton Kinematic Calibration to Improve Hand Tracking for Dexterous Teleoperation

人-外骨骼运动学校准以提高手部跟踪用于灵巧遥操作

Haiyun Zhang, Stefano Dalla Gasperina, Saad N. Yousaf, Toshimitsu Tsuboi, Tetsuya Narita, Ashish D. Deshpande

发表机构 * Walker Department of Mechanical Engineering, The University of Texas at Austin(德克萨斯大学机械工程系) Sony Group Corporation, Tokyo, Japan(索尼集团公司,日本东京) Meta Reality Labs Research, Redmond, WA, USA(Meta现实实验室研究)

AI总结 本文提出一种针对手部外骨骼的个性化校准框架,通过残差加权优化估计虚拟链接参数,减少关节和指尖跟踪误差,提升遥操作精度。

Comments 8 pages, 10 figures, 1 supplementary video, submitted to RA-L

详情
AI中文摘要

手部外骨骼是实现灵巧遥操作和沉浸式操作界面的关键工具,但准确的手部跟踪仍面临挑战,因用户特定的解剖差异和穿戴不一致导致运动学对齐问题。本文提出了一种针对外骨骼的手部跟踪个性化校准框架,通过残差加权优化估计虚拟链接参数。引入数据驱动方法,利用动作捕捉地面真实数据经验调整成本函数权重,实现跨用户的准确一致校准。在七名健康受试者上实施于Maestro手部外骨骼,方法在多样化的手部几何结构中显著减少了关节和指尖跟踪误差。使用基于Unity的虚拟手的定性可视化进一步展示了改进的运动保真度。所提框架适用于具有闭环运动学和最小传感的外骨骼,为高保真遥操作和机器人学习应用奠定了基础。

英文摘要

Hand exoskeletons are critical tools for dexterous teleoperation and immersive manipulation interfaces, but achieving accurate hand tracking remains a challenge due to user-specific anatomical variability and donning inconsistencies. These issues lead to kinematic misalignments that degrade tracking performance and limit applicability in precision tasks. We propose a subject-specific calibration framework for exoskeleton-based hand tracking that estimates virtual link parameters through residual-weighted optimization. A data-driven approach is introduced to empirically tune cost function weights using motion capture ground truth, enabling accurate and consistent calibration across users. Implemented on the Maestro hand exoskeleton with seven healthy participants, the method achieved substantial reductions in joint and fingertip tracking errors across diverse hand geometries. Qualitative visualizations using a Unity-based virtual hand further demonstrate improved motion fidelity. The proposed framework generalizes to exoskeletons with closed-loop kinematics and minimal sensing, laying the foundation for high-fidelity teleoperation and robot learning applications.

2506.03106 2026-06-09 cs.CL cs.AI

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Critique-GRPO:通过自然语言和数值反馈提升大语言模型推理能力

Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng

发表机构 * HCCL, The Chinese University of Hong Kong, Hong Kong, China(香港中文大学人工智能研究中心,香港,中国) University of Cambridge, Cambridge, United Kingdom(剑桥大学,剑桥,英国) MMLab, The Chinese University of Hong Kong, Hong Kong, China(香港中文大学人工智能实验室,香港,中国) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室,上海,中国)

AI总结 本文提出Critique-GRPO框架,结合自然语言和数值反馈提升LLM推理能力,实验显示其在多个任务中优于传统方法,显著提升推理性能。

Comments Accepted by ICML 2026 Spotlight

详情
AI中文摘要

最近利用数值奖励的强化学习(RL)进展显著增强了大语言模型(LLM)的复杂推理能力。然而,我们发现纯数值反馈存在三个根本限制:性能停滞、无效的自发自我反思和持续失败。我们证明,当给plateaued RL模型提供自然语言批评时,它们能够成功细化失败的解决方案。受此启发,我们提出Critique-GRPO,一种在线RL框架,整合自然语言和数值反馈进行策略优化。该方法使LLM能够同时学习初始响应和批评引导的细化,有效内化两个阶段的探索收益。大量实验显示,Critique-GRPO优于所有比较的监督和基于RL的微调方法,在各种Qwen模型上平均Pass@1提升约+15.0-21.6%,在Llama-3.2-3B-Instruct上提升约+7.3%。值得注意的是,Critique-GRPO通过自我批评实现有效自我改进,相较于GRPO取得显著提升,例如在AIME 2024上Pass@1提升+16.7%。代码和模型已发布:https://github.com/zhangxy-2019/critique-GRPO

英文摘要

Recent advances in reinforcement learning (RL) using numerical rewards have significantly enhanced the complex reasoning capabilities of large language models (LLMs). However, we identify three fundamental limitations of purely numerical feedback: performance plateaus, ineffective spontaneous self-reflection, and persistent failures. We show that plateaued RL models can successfully refine failed solutions when given natural language critiques. Motivated by this, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for policy optimization. This approach enables LLMs to learn simultaneously from initial responses and critique-guided refinements, effectively internalizing the exploration benefits of both stages. Extensive experiments show that Critique-GRPO outperforms all compared supervised and RL-based fine-tuning methods, achieving average Pass@1 improvements of approximately +15.0-21.6% on various Qwen models and +7.3% on Llama-3.2-3B-Instruct across eight challenging reasoning tasks. Notably, Critique-GRPO facilitates effective self-improvement through self-critiquing, achieving substantial gains over GRPO, e.g., a +16.7% Pass@1 improvement on AIME 2024. The code and models are released at: https://github.com/zhangxy-2019/critique-GRPO

2412.11439 2026-06-09 cs.LG cs.AI physics.chem-ph

Sampling Out-of-Distribution Chemical Spaces via Bayesian Flow

通过贝叶斯流采样非分布化学空间

Nianze Tao, Minori Abe

发表机构 * Hiroshima University(广岛大学) Tokyo University of Agriculture(东京农业大学)

AI总结 本文提出利用贝叶斯流网络生成高质量非分布分子,通过强化学习策略和可控微分方程求解器提升采样效率,并引入半自回归策略提升模型性能。

Comments 35 pages, 14 figures, 9 tables

详情
AI中文摘要

生成具有更高性能的新型分子,即非分布生成,对从头药物设计至关重要。然而,基于分布学习的模型,如扩散模型,难以解决这一挑战,因为这些方法旨在尽可能贴近训练数据的分布。在本文中,我们证明贝叶斯流网络,特别是ChemBFN模型,能够内在生成高质量的非分布样本,满足多种场景。我们向ChemBFN添加了强化学习策略,并采用可控的微分方程求解器-like生成过程以加速采样过程。最重要的是,我们在训练和推理过程中引入了半自回归策略,以提升模型性能并超越最先进的模型。此外,还包含了一种半自回归方法在ChemBFN中非分布生成的理论分析。

英文摘要

Generating novel molecules with higher properties than the training space, namely the out-of-distribution generation, is important for de novo drug design. However, it is not easy for distribution learning-based models, for example diffusion models, to solve this challenge as these methods are designed to fit the distribution of training data as close as possible. In this paper, we show that Bayesian flow network, especially ChemBFN model, is capable of intrinsically generating high quality out-of-distribution samples that meet several scenarios. A reinforcement learning strategy is added to the ChemBFN and a controllable ordinary differential equation solver-like generating process is employed that accelerate the sampling processes. Most importantly, we introduce a semi-autoregressive strategy during training and inference that enhances the model performance and surpass the state-of-the-art models. A theoretical analysis of out-of-distribution generation in ChemBFN with semi-autoregressive approach is included as well.

2602.13271 2026-06-09 cs.AI cs.HC cs.LG

Human-Centered Explainable AI for Security Enhancement: A Deep Intrusion Detection Framework

面向安全增强的人本可解释AI:一种深度入侵检测框架

Md Muntasir Jahid Ayan, Md. Shahriar Rashid, Tazzina Afroze Hassan, Hossain Md. Mubashshir Jamil, Mahbubul Islam, Lisan Al Amin, Rupak Kumar Das, Farzana Akter, Faisal Quader

发表机构 * Department of Computer Science and Engineering, United International University (UIU), Dhaka 1212, Bangladesh(计算机科学与工程系,国际联合大学(UIU),达卡1212,孟加拉国) Department of Electrical and Electronic Engineering, Islamic University of Technology, Gazipur 1704, Bangladesh(电气与电子工程系,伊斯兰科技大学,加兹ipur 1704,孟加拉国) Department of Computer Science and Engineering (CSE), University of Asia Pacific (UAP), Dhaka 1207, Bangladesh(计算机科学与工程系(CSE),亚洲太平洋大学(UAP),达卡1207,孟加拉国) Department of Information Systems, University of Maryland, Baltimore, 21250, Maryland, USA(信息系统系,马里兰大学,巴尔的摩,21250,美国) College Of Information Sciences and Technology, Pennsylvania State University, University Park, PA 16802, USA(信息科学与技术学院,宾夕法尼亚州立大学,大学公园,PA 16802,美国) Department of Information Technology, Washington University of Science and Technology, Alexandria, VA(信息技术系,科学与技术华盛顿大学,亚历山大,VA) College of Engineering and Information Technology, University of Maryland, College Park, 20742, Maryland, USA(工程与信息技术学院,马里兰大学,学院公园,20742,美国)

AI总结 本文提出一种结合可解释AI的深度入侵检测框架,利用CNN和LSTM捕捉流量序列的时间依赖性,通过SHAP实现模型可解释性,提升安全分析的透明度与可靠性。

详情
AI中文摘要

随着网络威胁的复杂性和频率增加,需要准确且可解释的入侵检测系统(IDS)。本文提出了一种新颖的IDS框架,整合可解释人工智能(XAI)以增强深度学习模型的透明性。该框架在NSL-KDD基准数据集上进行实验评估,显示优于传统IDS和黑箱深度学习模型。所提方法结合卷积神经网络(CNN)和长短期记忆网络(LSTM)以捕捉流量序列的时间依赖性。深度学习结果表明,CNN和LSTM的准确率均达到0.99,其中LSTM在宏平均精度、召回率和F-1分数上优于CNN。对于加权平均精度、召回率和F-1分数,两种模型得分几乎相同。为确保可解释性,XAI模型SHapley Additive exPlanations(SHAP)被纳入,使安全分析师能够理解和验证模型决策。SHAP指出,srv_serror_rate、dst_host_srv_serror_rate和serror_rate是两个模型中的一些重要特征。我们还基于IPIP6和Big Five人格特质进行了以信任为导向的专家调查,通过交互式UI评估系统的可靠性和可用性。本工作强调了在网络安全解决方案中结合性能和透明性的潜力,并通过自适应学习推荐未来改进以实现实时威胁检测。

英文摘要

The increasing complexity and frequency of cyber-threats demand intrusion detection systems (IDS) that are not only accurate but also interpretable. This paper presented a novel IDS framework that integrated Explainable Artificial Intelligence (XAI) to enhance transparency in deep learning models. The framework was evaluated experimentally using the benchmark dataset NSL-KDD, demonstrating superior performance compared to traditional IDS and black-box deep learning models. The proposed approach combined Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) networks for capturing temporal dependencies in traffic sequences. Our deep learning results showed that both CNN and LSTM reached 0.99 for accuracy, whereas LSTM outperformed CNN at macro average precision, recall, and F-1 score. For weighted average precision, recall, and F-1 score, both models scored almost similarly. To ensure interpretability, the XAI model SHapley Additive exPlanations (SHAP) was incorporated, enabling security analysts to understand and validate model decisions. Some notable influential features were srv_serror_rate, dst_host_srv_serror_rate, and serror_rate for both models, as pointed out by SHAP. We also conducted a trust-focused expert survey based on IPIP6 and Big Five personality traits via an interactive UI to evaluate the system's reliability and usability. This work highlighted the potential of combining performance and transparency in cybersecurity solutions and recommends future enhancements through adaptive learning for real-time threat detection.

2602.05027 2026-06-09 cs.SD cs.AI

AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

AudioSAE:利用稀疏自编码器理解音频处理模型

Georgii Aparin, Tasnima Sadekova, Alexey Rukhovich, Assel Yermekova, Laida Kushnareva, Vadim Popov, Kristian Kuznetsov, Irina Piontkovskaya

发表机构 * Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 本文在Whisper和HuBERT的编码器层训练稀疏自编码器(SAE),评估其稳定性和可解释性,并展示其在特征解耦、概念擦除、语音检测优化及与人类脑电活动对齐方面的实用价值。

Comments Accepted to EACL 2026, main track

详情
Journal ref
Proceedings of EACL 2026, pages 3221-3254
AI中文摘要

稀疏自编码器(SAE)是解释神经表征的强大工具,但它们在音频领域的应用尚未充分探索。我们在Whisper和HuBERT的所有编码器层训练SAE,对其稳定性、可解释性进行了广泛评估,并展示了其实用性。超过50%的特征在随机种子间保持一致,且重建质量得以保持。SAE特征捕获了通用声学和语义信息以及特定事件,包括环境噪声和副语言声音(如笑声、低语),并有效解耦它们,仅需移除19-27%的特征即可擦除一个概念。特征引导将Whisper的虚假语音检测降低了70%,且词错误率(WER)增加可忽略不计,展示了实际应用价值。最后,我们发现SAE特征与语音感知过程中的人类脑电活动相关,表明其与人类神经处理的对齐。代码和检查点可在https://github.com/audiosae/audiosae_demo获取。

英文摘要

Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their stability, interpretability, and show their practical utility. Over 50% of the features remain consistent across random seeds, and reconstruction quality is preserved. SAE features capture general acoustic and semantic information as well as specific events, including environmental noises and paralinguistic sounds (e.g. laughter, whispering) and disentangle them effectively, requiring removal of only 19-27% of features to erase a concept. Feature steering reduces Whisper's false speech detections by 70% with negligible WER increase, demonstrating real-world applicability. Finally, we find SAE features correlated with human EEG activity during speech perception, indicating alignment with human neural processing. The code and checkpoints are available at https://github.com/audiosae/audiosae_demo.

2602.01880 2026-06-09 cs.RO

Multimodal Large Language Models for Real-Time Situated Reasoning

多模态大语言模型用于实时情境推理

Giulio Antonio Abbo, Senne Lenaerts, Tony Belpaeme

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 本文探讨多模态大语言模型如何支持实时情境和价值感知决策,结合GPT-4o与模拟智能扫地机器人平台,展示其在家庭活动、社会规范和用户偏好推理中的能力,以及在清洁、舒适和安全等价值上的细致决策。

Comments Submitted to the interactivity track of the 21st ACM/IEEE International Conference on Human-Robot Interaction on December 2025, accepted January 2026

详情
Journal ref
HRI Companion 2026: Companion Proceedings of the 21st ACM/IEEE International Conference on Human-Robot Interaction
AI中文摘要

在本工作中,我们探讨多模态大语言模型如何支持实时情境和价值感知决策。为此,我们将GPT-4o语言模型与模拟智能扫地机器人平台结合,在家庭环境中评估环境通过视觉输入,并判断是否启动清洁。系统展示了这些模型在家庭活动、社会规范和用户偏好推理中的能力,并能做出与涉及人员价值观(如清洁、舒适和安全)一致的细致决策。我们在现实家庭环境中演示了该系统,展示了其从有限视觉输入中推断情境和价值的能力。我们的结果突显了多模态大语言模型在增强机器人自主性和情境感知方面的潜力,同时也指出了与一致性、偏见和实时性能相关挑战。

英文摘要

In this work, we explore how multimodal large language models can support real-time context- and value-aware decision-making. To do so, we combine the GPT-4o language model with a TurtleBot 4 platform simulating a smart vacuum cleaning robot in a home. The model evaluates the environment through vision input and determines whether it is appropriate to initiate cleaning. The system highlights the ability of these models to reason about domestic activities, social norms, and user preferences and take nuanced decisions aligned with the values of the people involved, such as cleanliness, comfort, and safety. We demonstrate the system in a realistic home environment, showing its ability to infer context and values from limited visual input. Our results highlight the promise of multimodal large language models in enhancing robotic autonomy and situational awareness, while also underscoring challenges related to consistency, bias, and real-time performance.

2511.10500 2026-06-09 cs.CV

Learnable Total Variation with Lambda Mapping for Low-Dose CT Denoising

可学习总变分与Lambda映射用于低剂量CT去噪

Yusuf Talha Basak, Mehmet Ozan Unal, Metin Ertas, Isa Yildirim

发表机构 * University of Michigan(密歇根大学)

AI总结 本文提出可学习总变分框架,通过结合展开的总变分求解器与LambdaNet预测像素级正则化图,实现空间自适应平滑,实验显示在低剂量CT去噪中优于传统TV和FBP+U-Net。

详情
Journal ref
2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)
AI中文摘要

尽管总变分(TV)在噪声抑制和边缘保持方面表现出色,但其对标量正则化参数的依赖限制了适应性。在本研究中,我们提出了一种可学习总变分(LTV)框架,将展开的TV求解器与预测像素级正则化图的LambdaNet相结合。所提出的框架端到端训练以优化重建和正则化,实现空间自适应平滑。在DeepLesion数据集上使用现实LoDoPaB-CT模拟实验表明,LTV在低剂量CT去噪中优于传统TV和FBP+U-Net,实现了最高+3.7 dB PSNR和8%的相对SSIM改进。LTV为低剂量CT去噪提供了可解释的替代方案,而非黑箱CNN。

英文摘要

While Total Variation (TV) excels in noise reduction and edge preservation, its reliance on a scalar regularization parameter limits adaptivity. In this study, we present a Learnable Total Variation (LTV) framework coupling an unrolled TV solver with a LambdaNet that predicts a per-pixel regularization map. The proposed framework is trained end-to-end to optimize reconstruction and regularization jointly, yielding spatially adaptive smoothing. Experiments on the DeepLesion dataset, using realistic LoDoPaB-CT simulation, show consistent gains over classical TV and FBP+U-Net, achieving up to +3.7 dB PSNR and 8% relative SSIM improvement. LTV provides an interpretable alternative to black-box CNNs for low-dose CT denoising.