arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2280
2605.27561 2026-05-28 cs.CV cs.AI

Clinical Validation of the Melanoscope AI Mobile Dermoscopy Clinical Decision Support System

Melanoscope AI移动皮肤镜临床决策支持系统的临床验证

Elena Sergeevna Kozachok, Sergey Sergeevich Seregin

AI总结 本研究提出了一种级联深度学习模型的定量可解释性评估方法和三区患者分流算法,并在俄罗斯门诊实践中对Melanoscope AI CDSS进行了前瞻性单中心临床验证,结果显示无假阴性且特异性为88.3%。

Comments 24 pages, 6 figures, 5 tables, 21 references

详情
AI中文摘要

引言:恶性皮肤病变的早期检测对预后至关重要,但俄罗斯地区皮肤科医生短缺限制了筛查覆盖。移动皮肤镜临床决策支持系统(CDSS)提供了一种有前景的方法,但模型可解释性和标准化患者分流仍是采用的关键障碍。目的:开发一种级联深度学习模型的定量可解释性评估方法和三区患者分流算法,并在俄罗斯门诊实践中对Melanoscope AI CDSS进行初步的单中心前瞻性临床验证。材料与方法:皮肤镜图像的两阶段级联分类;注意力图可视化(ViT和Swin使用注意力展开;ConvNeXt和EfficientNetV2使用Grad-CAM);激活图与专家标注之间基于IoU的定量一致性评估;在四次“黑色素瘤日”活动(俄罗斯奥廖尔,2025年6月至2026年4月)中进行前瞻性单中心验证。结果:在176名患者中:与专家评估一致率为88.6%;5例恶性病变中无假阴性(95% CI: 47.8-100.0%);特异性为88.3%。组织学证实了3例黑色素瘤和2例基底细胞癌;6例发育不良痣被纳入随访。平均IoU(n=180):ViT - 0.69;Swin - 0.64;ConvNeXt - 0.53;EfficientNetV2 - 0.51。分流阈值:P<0.15 / 0.15-0.50 / >=0.50。结论:未观察到假阴性;特异性为88.3%,支持筛查应用。集成的级联分类、带IoU评估的注意力图可视化和三区分流提供了可重复、可解释的临床决策支持,可适应不同资源水平。

英文摘要

Introduction. Early detection of malignant skin lesions is critical for prognosis, yet dermatologist shortages in Russian regions limit screening coverage. Mobile dermoscopy clinical decision support systems (CDSS) offer a promising approach, with model interpretability and standardised patient routing remaining key barriers to adoption. Aim. To develop a quantitative interpretability assessment method for cascade deep learning models and a three-zone patient routing algorithm, and to conduct a preliminary single-centre prospective clinical validation of the Melanoscope AI CDSS in Russian outpatient practice. Material and methods. Two-stage cascade classification of dermoscopic images; attention map visualisation (attention rollout for ViT and Swin; Grad-CAM for ConvNeXt and EfficientNetV2); quantitative IoU-based agreement assessment between activation maps and expert annotations; prospective single-centre validation across four "Melanoma Day" sessions (Orel, Russia, June 2025 - April 2026). Results. On 176 patients: agreement with expert assessment 88.6%; no false negatives among 5 malignant lesions (95% CI: 47.8-100.0%); specificity 88.3%. Three melanomas and two basal cell carcinomas were histologically confirmed; six dysplastic naevi placed under follow-up. Mean IoU (n=180): ViT - 0.69; Swin - 0.64; ConvNeXt - 0.53; EfficientNetV2 - 0.51. Routing thresholds: P<0.15 / 0.15-0.50 / >=0.50. Conclusion. No false negatives were observed; specificity was 88.3%, supporting screening use. The integrated cascade classification, attention map visualisation with IoU assessment, and three-zone routing provide reproducible, interpretable clinical decision support adaptable to varying resource levels.

2605.27551 2026-05-28 cs.AI cs.CR cs.IR cs.MM

On the Origin of Synthetic Information by Means of Steganographic Inheritance

论通过隐写继承的合成信息起源

Ching-Chun Chang, Isao Echizen

AI总结 针对合成信息溯源难题,提出一种基于隐写术的遗传机制,通过嵌入可追踪的谱系特征实现合成信息父系鉴定,理论分析与实验验证了方法的有效性。

详情
AI中文摘要

物种起源一直是自然科学中谜中之谜。类比而言,我们认为合成信息的起源是信息科学中谜中之谜。这个问题承载着道德分量,技术解释既无法完全解决,也不能不负责任地忽视,因为它对真理、信任和人类智力的影响深远地延伸到更广泛的经济和社会。人工智能的强大使得合成信息的进化谱系越来越难以追踪,因为一个足够强大的模型可能产生在结构或信号层面上与其父源几乎不相似的后代。如同遗传学中,两个个体可能具有相同的表型,在外观上相互镜像,但基因型却根本不同。我们提出通过隐写术实现一种类似于遗传的机制。在后代被复制的时刻,投影仪从父代派生出一个特征,隐写编码器将其不可见地隐藏在后代中。该特征在赛博生态系统中贯穿后代的整个生命周期。当查询父系时,隐写解码器从后代中提取该特征,并与参考池中候选父代的特征进行比较,从而提名最可能的父代。理论分析将系统发育准确性表征为投影仪和隐写系统属性的函数,而跨多个投影仪和隐写系统的实证评估表明,所提出的方法在广泛的处理操作和语义修改下具有可行性。我们设想一个赛博生态系统,其中合成信息被赋予隐藏但可追踪的谱系特征,从简单的开端分支成无尽的形态,这些形态已经并且正在进化。

英文摘要

The origin of species has been the mystery of mysteries in natural science. By analogy, the origin of synthetic information, we suggest, is the mystery of mysteries in information science. The question carries a moral weight that a technical account can neither fully resolve nor responsibly ignore, as its impact on truth, trust, and human intellect extends deep into the broader economy and society. The very power of artificial intelligence makes the evolutionary lineage of synthetic information grow ever harder to trace, for a sufficiently capable model may generate offspring that bear little resemblance, at either the structural or signal level, to the parent source from which they were derived. As in genetics, two individuals may share the same phenotype mirroring each other in outward appearance, yet differ fundamentally in their genotype. We propose, by means of steganography, a mechanism analogous to heredity. At the moment an offspring is reproduced, a projector derives a trait from the parent, and a steganographic encoder invisibly hides it within the offspring. This trait persists throughout the offspring's life cycle in a cyber ecosystem. When parentage is queried, a steganographic decoder extracts the trait from the offspring and compares it against the traits of candidate parents in a reference pool, thereby nominating the most likely one. A theoretical analysis characterises phylogenetic accuracy as a function of projector and stegosystem properties, whilst empirical evaluations across multiple projectors and stegosystems demonstrate the viability of the proposed methodology under a broad spectrum of processing operations and semantic modifications. We envision a cyber ecosystem in which synthetic information, endowed with hidden yet traceable lineage traits, branches from a simple beginning into endless forms that have been, and are being, evolved.

2605.27546 2026-05-28 cs.CL cs.HC

Keyphrase Generative Representation of Youth Crisis Conversations Beyond Static Taxonomies

超越静态分类法的青少年危机对话的关键词生成表示

Abeer Badawi, Will Aitken, Lydia Sequeira, Jocelyn Rankin, Maia Norman, Elham Dolatabadi

AI总结 本文提出关键词生成表示(KGR)方法,通过约束大语言模型生成对话特定的关键词,将原有19标签分类扩展为39标签层次结构,在129段对话和387个专家注释上评估,准确率达0.96,并发现固定分类中缺失的身份相关主题,将主题检索准确率从0.25提升至0.70。

详情
AI中文摘要

危机响应者每年快速评估数千条青少年短信对话,以识别心理健康问题并指导支持。然而,青少年的痛苦越来越多地通过不断演变且依赖具体语境的语言表达,这些语言通常不适合固定标签的分类法。本研究分析了703,975条去标识化的Kids Help Phone对话(2018-2023年),并将KHP的19标签问题分类扩展为39标签层次结构。然后,我们引入关键词生成表示(KGR),一种受约束的大语言模型,生成简洁、对话特定的关键词,在129段对话和387个专家注释上进行了评估。扩展后的分类法达到了专家共识可靠性,准确率为0.96,专家评审发现81%的关键词准确反映了内容,74%提高了清晰度。KGR揭示了固定分类法中缺失的与身份相关的主题,包括移民问题和照顾者负担,并支持了一个主题检索工作流,与手动分析师流程相比,准确率从0.25提高到0.70(+0.45)。KGR标志着向混合、可解释的生成表示转变,将危机响应扩展到静态分类法之外,以揭示新兴的、植根于文化的青少年痛苦模式。

英文摘要

Crisis Responders (CRs) rapidly assess thousands of youth SMS conversations each year to identify mental health concerns and guide support. Yet youth distress is increasingly expressed through evolving and context-specific language that often does not fit fixed-label taxonomies. This work analyzed 703,975 de-identified Kids Help Phone conversations (2018-2023) and expanded KHP's 19-label issue taxonomy into a 39-label hierarchical schema. We then introduce Keyphrase Generative Representation (KGR), a constrained LLM generating concise, conversation-specific keyphrases, evaluated across 129 conversations and 387 expert annotations. The expanded taxonomy achieved expert consensus reliability, with an accuracy of 0.96, and expert review found that 81% of keyphrases accurately reflected content and 74% improved clarity. KGR surfaced identity-linked themes absent from the fixed taxonomy, including immigration problems and caregiver burden, and supported a topic-retrieval workflow that increased accuracy from 0.25 to 0.70 (+0.45) over the manual analyst process. KGR marks a shift toward hybrid, interpretable generative representations that extend crisis response beyond static taxonomies to surface emerging and culturally grounded patterns of youth distress.

2605.27545 2026-05-28 cs.CL

PAST2HARM: A Simple Adaptive Past Tense Attack for Jailbreaking Multimodal AI

PAST2HARM:一种用于越狱多模态AI的简单自适应过去时攻击

Snehasis Mukhopadhyay

AI总结 提出PAST2HARM框架,通过过去时态改写和迭代升级策略,系统性地利用多模态文本到图像模型的安全漏洞,实现黑盒、无梯度的高成功率越狱攻击。

详情
AI中文摘要

尽管不安全的图像生成可能比不安全的文本产生更严重的后果,且当前防御相对不成熟,但对多模态AI系统的越狱攻击仍未得到充分探索。我们引入了PAST2HARM,一个简单而有效的自适应越狱框架,能够绕过最先进的多模态文本到图像模型中的拒绝训练。基于先前发现过去时态改写可以规避安全防护的结论,PAST2HARM系统地利用了多模态生成式AI中的这一漏洞。 我们沿两个维度刻画攻击。第一,广度:通过时间深化,该框架逐步增强历史锚定和档案线索,侵蚀不同对齐强度模型的拒绝边界。第二,深度:通过初始顺从后的迭代升级,我们探测有害生成的上限,使用由语言模型作为评判者评估的标量严重性越狱指标来衡量严重程度。我们发现对话中间轮次形成峰值脆弱窗口,其中有害性增加后趋于平稳,最终经历语义反转。 我们在三个模型Gemini Nano Banana Pro、GPT Image 2和SD XL上评估PAST2HARM,在黑盒、无梯度设置下分别实现了83%、67%和100%的攻击成功率。对抗性提示也在模型间迁移,跨模型成功率超过50%。该攻击引发了多种有害输出,包括露骨色情内容、政治虚假信息、历史否认叙事、仇恨言论和自我伤害美化。我们进一步发布了一个精心策划的提示、改写和输出基准,作为红队测试和对齐的资源。我们的结果暴露了当前安全防护的根本脆弱性,并强调了加强多模态安全训练的必要性。

英文摘要

Jailbreak attacks on multimodal AI systems remain underexplored, even though unsafe image generation can have more severe consequences than unsafe text and current defenses are relatively immature. We introduce PAST2HARM, a simple yet effective adaptive jailbreak framework that bypasses refusal training in state of the art multimodal text to image models. Building on prior findings that past tense reformulations can evade safeguards, PAST2HARM systematically exploits this vulnerability in multimodal generative AI. We characterize the attack along two dimensions. First, breadth: through temporal deepening, the framework incrementally strengthens historical anchoring and archival cues, eroding refusal boundaries across models with varying alignment strength. Second, depth: via iterative escalation after initial compliance, we probe the upper bound of harmful generation, measuring severity using a scalar severity jailbreak metric evaluated by a language model acting as a judge. We find that mid conversation turns form peak vulnerability windows, where harmfulness increases before plateauing and eventually undergoing semantic inversion. We evaluate PAST2HARM on three models Gemini Nano Banana Pro, GPT Image 2, and SD XL achieving attack success rates of 83 percent, 67 percent, and 100 percent in a black box, gradient free setting. Adversarial prompts also transfer across models, with cross model success rates above 50 percent. The attack elicits diverse harmful outputs, including explicit sexual content, political disinformation, historical denial narratives, hate speech, and self harm glorification. We further release a curated benchmark of prompts, reformulations, and outputs as a resource for red teaming and alignment. Our results expose fundamental brittleness in current safeguards and highlight the need for stronger multimodal safety training.

2605.27541 2026-05-28 cs.LG

SparseOpt: Addressing Normalization-induced Gradient Skew in Sparse Training

SparseOpt:解决稀疏训练中归一化引起的梯度倾斜

Mohammed Adnan, Rohan Jain, Tom Jacobs, Ekansh Sharma, Rahul G. Krishnan, Rebekka Burkholz, Yani Ioannou

AI总结 针对动态稀疏训练收敛慢的问题,通过分析批归一化对稀疏训练的不利影响,提出稀疏感知优化器SparseOpt,实现更快的收敛和更好的泛化。

Comments Accepted International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

动态稀疏训练(DST)方法通过保持稀疏性同时动态调整网络拓扑来训练神经网络。尽管有望减少计算量,但DST方法的收敛速度明显慢于密集训练,通常需要相当长的训练时间才能达到相似的精度。我们在分析和实验上均证明,批归一化(BN)对稀疏训练有不利影响,并提出了SparseOpt,一种稀疏感知优化器来解决这个问题。在CIFAR-100和ImageNet上使用ResNet模型进行的实验表明,我们提出的方法具有持续更快的收敛速度和更好的泛化性能。我们的工作突出了当前归一化层在稀疏训练中的局限性,并首次系统研究了批归一化、稀疏层和DST之间的相互作用,朝着使DST在实际中与密集训练竞争迈出了重要一步。

英文摘要

Dynamic Sparse Training (DST) methods train neural networks by maintaining sparsity while dynamically adapting the network topology. Despite the promise of reduced computation, DST methods converge significantly slower than dense training, often requiring comparable training time to achieve similar accuracy. We demonstrate both analytically and empirically that Batch Normalization (BN) adversely affects sparse training, and propose SparseOpt, a sparsity-aware optimizer, to address this. Experiments on ResNet models across CIFAR-100 and ImageNet demonstrate consistently faster convergence and improved generalization with our proposed method. Our work highlights the limitations of current normalization layers in sparse training and provides the first systematic study of the interaction between Batch Normalization, sparse layers, and DST, taking a significant step toward making DST practically competitive with dense training.

2605.27539 2026-05-28 cs.RO

Synthetic Emotions vs. Gamification: Exploring Engagement Strategies for Small Social Robots in Different Age Groups

合成情感 vs. 游戏化:探索不同年龄段小型社交机器人的参与策略

Morten Roed Frederiksen, Kasper Støy

AI总结 本研究通过两项实验(6-8岁儿童偏好评估和20-27岁大学生行为研究)比较了触觉机器人使用合成情感反馈与积分奖励两种参与策略的效果,发现儿童偏好情感参与,而大学生在积分系统下任务准确率更高且表现持久,揭示了不同年龄组在参与策略有效性上的差异。

Comments 7 pages

详情
AI中文摘要

许多儿童在情绪调节和社交互动方面面临挑战,这限制了他们在日常活动和治疗项目中的参与。为了使社交辅助机器人在这一背景下有效,儿童必须保持持续且有意义的参与。我们探索了一种触觉机器人的参与策略,该机器人旨在通过日常互动支持患有焦虑症的儿童。机器人提供合成情感反馈或积分奖励以鼓励用户参与。我们通过两项研究评估了这些策略:一项是对16名6-8岁学龄儿童的偏好评估,另一项是在自然环境中对14名20-27岁大学生的行为研究。对学龄儿童的研究表明,他们更倾向于情感参与而非基于积分的方法。对大学生进行全天互动的后续研究显示了对比结果:基于积分的系统产生了显著更高的任务准确率(p < 0.05)并保持了持续的表现。来自不同用户群体的发现表明,陈述的偏好和行为结果可能因参与环境而异,这凸显了通过观察互动来验证设计假设的重要性。这项工作为人类-机器人交互设计中参与策略有效性的年龄相关差异提供了见解。

英文摘要

Many children experience challenges in emotional regulation and social interaction, which can limit their participation in everyday activities and therapeutic programs. For socially assistive robots to be effective in this context, it is essential that children remain consistently and meaningfully engaged. We explore engagement strategies for a tactile robot designed to support children suffering from anxiety disorders through daily interactions. The robot delivers either synthetic emotional feedback or point rewards to encourage user participation. We evaluated these strategies through two studies: a preference assessment with 16 school children aged 6-8 years, and a behavioral study with 14 university students aged 20-27 years in naturalistic environments. The study with school children indicated a preference for emotional engagement over points-based approaches. The follow up study with university students across a full day of interactions revealed contrasting results: points-based systems produced significantly higher task accuracy (p < 0.05) and sustained performance over time. Findings from different user groups suggest that stated preferences and behavioral outcomes can diverge depending on engagement context, highlighting the importance of validating design assumptions through observed interaction. This work contributes insights into age-related differences in engagement strategy effectiveness in human-robot interaction design.

2605.27533 2026-05-28 cs.RO

Inducing Calmness With Pocket-Sized Robotics: Reducing Movement and Heart Rate in Children through Hand-Held Tactile Interactions

用口袋大小的机器人诱导平静:通过手持触觉交互降低儿童的心率和运动

Morten Roed Frederiksen, Kasper Støy, Maja Matarić

AI总结 本研究通过手持触觉设备上的节奏振动匹配游戏,发现触觉交互能显著降低儿童的生理唤醒(心率下降3.56 bpm)和身体躁动(整体运动减少38%),从而促进平静和专注状态。

Comments 34 pages, 2 tables, 7 figures

详情
AI中文摘要

高唤醒或躁动期会干扰儿童的注意力、自我调节和身体平静能力。通过触觉交互鼓励具身自我调节的技术可能提供一种简单易行的方法来促进平静。本文研究了与口袋大小的触觉设备交互如何影响典型发育儿童的生理和行为平静标记。基于先前关于心率调节的研究,我们提出了关于触觉交互如何影响全身运动和姿势稳定性的新发现。我们使用一种设备,通过手持节奏振动匹配游戏吸引儿童,旨在集中注意力并鼓励静止。18名儿童参与了一项受试者内研究,包括两种条件:有和没有手持设备的触觉交互,同时记录他们的心率和身体运动。结果表明,触觉游戏交互降低了生理唤醒(心率下降3.56 bpm,p < 0.01)和身体躁动(整体运动减少38%,p < 0.05),与注意力相关的身体区域向静止变化最大(运动减少45%)。这些发现表明,与手持设备的短暂触觉游戏式参与可以下调生理激活,促进平静和专注状态,从而有助于持续注意力和行为调节。

英文摘要

Periods of heightened arousal or restlessness can interfere with children's ability to focus, self-regulation, and physically calm. Technologies that encourage embodied self-regulation through tactile interaction may provide a simple and accessible means of promoting calmness. This paper investigates how interaction with a pocket-sized tactile device influences physiological and behavioral markers of calmness in typically developing children. Building on prior work examining heart rate modulation, we present new findings on how tactile interaction affects full-body movement and postural stability. We employ a device that engages children through a hand-held rhythmic vibration-matching game, designed to focus attention and encourage stillness. Eighteen children participated in a within-subjects study that involved two conditions: with and without tactile interaction with a hand-held device, while having their heart rate and body movement recorded. Results show that the tactile game interaction reduced physiological arousal (heart rate decreased by 3.56 bpm, p < 0.01) and physical restlessness (overall movement decreased by 38%, p < 0.05), with attention-related body regions showing the greatest change toward stillness (45% reduction in movement). These findings demonstrate that brief tactile game-like engagement with a hand-held device can down-regulate physiological activation, promoting the calm and focused states toward sustained attention and behavior regulation.

2605.27532 2026-05-28 cs.RO

SCALE-COMM: Shared, Contrastively-Aligned Latent Embeddings for MARL Communication

SCALE-COMM:用于多智能体强化学习通信的共享、对比对齐潜在嵌入

Mahmoud Abouelyazid, Eman Hammad

AI总结 提出SCALE-COMM框架,通过自监督学习紧凑、稳定的潜在通信表示,解耦通信学习与策略优化,提升多智能体协调的稳定性和样本效率。

Comments IEEE IV 2026

详情
AI中文摘要

涌现通信使得部分可观测的自主移动机器人(AMR)能够在去中心化多智能体强化学习(MARL)环境中有效协调。然而,现有方法常常面临通信协议不稳定、消息语义无根基以及通信学习与策略优化之间的干扰,导致协调性能随时间下降。我们提出SCALE-COMM(用于通信的共享、对比对齐潜在嵌入),一种自监督框架,用于学习紧凑、稳定且与策略相关的通信表示。SCALE-COMM通过训练低维潜在消息来解耦通信学习与策略优化,这些消息捕获与任务相关的规划和交通信息,同时跨智能体和时间强制执行一致性。在标准MARL基准测试和一个现实的仓库协调任务中,SCALE-COMM在表示质量和任务性能方面均持续优于现有通信框架。学习到的通信空间在策略微调下展现出改进的稳定性、样本效率和吞吐量,证明了表示驱动的通信对于可扩展多智能体协调的有效性。

英文摘要

Emergent communication enables partially observant Autonomous Mobile Robots (AMRs) to coordinate effectively in decentralized multi-agent reinforcement learning (MARL) settings. However, existing approaches often struggle with unstable communication protocols, ungrounded message semantics, and interference between communication learning and policy optimization, leading to degraded coordination over time. We propose SCALE-COMM (Shared, Contrastively-Aligned Latent Embeddings for COMMunication), a self-supervised framework for learning compact, stable, and policy-relevant communication representations. SCALE-COMM decouples communication learning from policy optimization by training low-dimensional latent messages that capture task-relevant planning and traffic information, while enforcing consistency across agents and time. Across standard MARL benchmarks and a realistic warehouse coordination task, SCALE-COMM consistently outperforms existing communication frameworks in both representation quality and task performance. The learned communication space yields improved stability, sample efficiency, and throughput under policy fine-tuning, demonstrating the effectiveness of representation-driven communication for scalable multi-agent coordination.

2605.27499 2026-05-28 cs.LG astro-ph.CO astro-ph.IM physics.comp-ph stat.ML

GenSBI: Generative Methods for Simulation-Based Inference in JAX

GenSBI: 基于JAX的模拟推断生成方法

Aurelio Amerio

AI总结 提出GenSBI库,在JAX中实现流匹配、分数匹配和去噪扩散等生成模型,用于模拟推断,提供统一接口和多种Transformer架构,并在标准基准上达到接近理想的C2ST分数。

Comments 48 pages + 1 appendix, 33 figures, 18 tables. For the associated Python code, see https://github.com/aurelio-amerio/GenSBI

详情
AI中文摘要

流和扩散生成模型已成为模拟推断(SBI)中广泛采用的密度估计器,从神经后验估计自然扩展到似然和联合密度估计。它们原则性的优化目标和不受架构约束的特点推动了在自然科学中的快速采用。然而,最广泛使用的SBI库仍然是基于PyTorch的,这使得在JAX中开发前向模型和分析流程的研究人员没有原生选择。我们提出GenSBI,一个完全在JAX中实现流匹配、分数匹配和去噪扩散的开源库。该库提供三种基于Transformer的架构——SimFormer、Flux1和一种新颖的Flux1Joint,它将门控调制Transformer块扩展到联合密度估计——所有这些都通过一个统一接口互换,该接口解耦了生成方法、神经骨干和推理模式。GenSBI提供了从训练到后验校准(SBC、TARP、LC2ST)的端到端工作流,并支持具有领域特定嵌入网络的自定义架构。我们在标准SBI基准上验证了该框架,在SBIBM任务上以最小的每任务调整实现了接近理想的平均C2ST分数(0.50-0.56,其中0.50为理想值),并且在所有测试配置中后验覆盖校准良好。代码公开于https://github.com/aurelio-amerio/GenSBI。

英文摘要

Flow and diffusion generative models have established themselves as widely adopted density estimators for simulation-based inference (SBI), extending naturally from neural posterior estimation to likelihood and joint density estimation. Their principled optimization objectives and freedom from architectural constraints have driven rapid adoption across the natural sciences. Yet the most widely used SBI libraries remain PyTorch-based, leaving researchers who develop their forward models and analysis pipelines in JAX without a native option. We present GenSBI, an open-source library that implements flow matching, score matching, and denoising diffusion entirely in JAX. The library offers three transformer-based architectures - SimFormer, Flux1, and a novel Flux1Joint that extends gate-modulated transformer blocks to joint density estimation - all interchangeable through a unified interface that decouples generative method, neural backbone, and inference mode. GenSBI provides an end-to-end workflow from training through posterior calibration (SBC, TARP, LC2ST) and supports custom architectures with domain-specific embedding networks. We validate the framework on standard SBI benchmarks, achieving near-ideal mean C2ST scores (0.50-0.56, where 0.50 is ideal) on SBIBM tasks with minimal per-task tuning and well-calibrated posterior coverage across all tested configurations. The code is publicly available at https://github.com/aurelio-amerio/GenSBI.

2605.27495 2026-05-28 cs.CV cs.LG

Representation-Conditioned Diffusion Models for Guided Training Data Generation

表示条件扩散模型用于引导训练数据生成

Nithesh Chandher Karthikeyan, Jonas Unger, Gabriel Eilertsen

AI总结 本文提出表示条件扩散模型,通过DINOv2、DINOv3和CLIP的表示条件生成合成图像,在ImageNet100上分类准确率比类条件生成高10.76个百分点,甚至超过真实数据训练的模型2.0个百分点。

详情
AI中文摘要

数据可用性仍然是许多深度学习应用中的关键瓶颈。大规模数据集通常收集、整理和标注成本高昂,这可能限制监督学习方法的可扩展性和适用性。在这项工作中,我们评估了在由生成式深度学习产生的合成图像数据集上训练的模型的分类性能。具体而言,我们使用基于DINOv2、DINOv3和CLIP学习表示的潜在扩散模型。我们的结果表明,这种表示条件公式通过提高样本质量和模式覆盖,显著优于类条件生成(在ImageNet100上top-1准确率提高10.76个百分点)。此外,通过扩大合成数据集的规模,我们能够超越在真实数据上训练的分类器(top-1准确率提高2.0个百分点)。我们还展示了生成的图像如何用于增强目的,优于经典增强方法,以及如何利用条件空间进行样本过滤以进一步提高训练价值。总的来说,这些发现表明,表示条件扩散模型为在大规模视觉学习任务中增强、补充或潜在替代真实世界数据集提供了一种有前景的方法。

英文摘要

Data availability remains a critical bottleneck in many deep learning applications. Large-scale datasets are often expensive to collect, curate and annotate, which can limit the scalability and applicability of supervised learning methods. In this work, we evaluate the classification performance of models trained on synthetic image datasets produced by generative deep learning. In particular, we use latent diffusion models conditioned on learned representations from DINOv2, DINOv3, and CLIP. Our results demonstrates that this representation-conditioned formulation significantly outperforms class-conditioned generation by a large margin (+10.76 p.p. top-1 accuracy on ImageNet100), by improving sample quality and mode coverage. Furthermore, by scaling the size of the synthetic dataset, we are able to outperform a classifier trained on the real data (+2.0 p.p top-1 accuracy). We also demonstrate how generated images can be used for augmentation purposes, outperforming classical augmentation methods, and how the conditioning space can be used for sample filtering to further improve training value. Collectively, these findings highlight that representation-conditioned diffusion models provide a promising approach for augmenting, complementing, or potentially replacing real-world datasets in large-scale visual learning tasks.

2605.27491 2026-05-28 cs.RO

GE-Sim 2.0: A Roadmap Towards Comprehensive Closed-loop Video World Simulators for Robotic Manipulation

GE-Sim 2.0:迈向机器人操作综合闭环视频世界模拟器的路线图

Boxiang Qiu, Liliang Chen, Yue Liao, Nan Wang, Lintao Wang, Jiayi Luo, Wenzhi Zhao, Shengcong Chen, Di Chen, Ye Li, Chen Gao, Shuicheng Yan, Si Liu, Maoqing Yao, Guanghui Ren

AI总结 提出GE-Sim 2.0,一种基于动作条件视频生成的闭环视频世界模拟器,通过重新训练数千小时真实机器人数据并新增状态专家、世界裁判和加速框架三个模块,实现高保真动作跟随和轨迹覆盖,在WorldArena排行榜上以2B参数超越专用模型和通用视频生成器,并验证了基于其生成轨迹和奖励训练的策略在真实世界中的有效性。

详情
AI中文摘要

我们介绍了GE-Sim 2.0(Genie Envisioner世界模拟器2.0),一种用于机器人操作的闭环视频世界模拟器。基于Genie Envisioner的动作条件视频生成框架,GE-Sim 2.0在数千小时的真实机器人数据上重新训练,涵盖遥操作、接触丰富交互和机载策略部署,显著提高了动作跟随保真度和轨迹覆盖范围。在此基础之上,三个新模块实现了从视频模拟到策略学习的闭环:一个状态专家,从视频潜在表示中解码本体感觉状态,以支持下游VLA策略的下一块预测;一个世界裁判,根据任务指令对生成的轨迹进行评分,提供机器可验证的成功信号和奖励,取代人工检查;以及一个加速框架,在单个H100上以2.3秒生成25帧轨迹,并在推理时支持高达4倍跳帧以实现长程评估。GE-Sim 2.0仅以2B参数便登顶公开的WorldArena排行榜,超越了专用机器人世界模型和闭源通用视频生成器,并且基于其生成轨迹和奖励训练的策略可转化为可测量的真实世界收益,确立了GE-Sim 2.0作为可扩展评估和操作策略闭环学习的实用平台。

英文摘要

We introduce GE-Sim 2.0 (Genie Envisioner World Simulator 2.0), a closed-loop video world simulator for robotic manipulation. Building on the action-conditioned video generation framework of Genie Envisioner, GE-Sim 2.0 is re-trained on thousands of hours of real-world robot data spanning teleoperation, contact-rich interaction, and on-robot policy deployment, substantially improving action-following fidelity and trajectory coverage. On top of this foundation, three new modules close the loop from video simulation to policy learning: a state expert that decodes proprioceptive state from video latents to support next-chunk prediction by downstream VLA policies; a world judge that scores generated rollouts against task instructions, yielding machine-verifiable success signals and rewards in place of manual inspection; and an acceleration framework that delivers a 25-frame rollout in 2.3 seconds on a single H100, with up to 4* frame skipping at inference for long-horizon evaluation. GE-Sim 2.0 tops the public WorldArena leaderboard at only 2B parameters, outperforming both dedicated robotic world models and closed-source general video generators, and policies trained against its rollouts and rewards translate into measurable real-world gains, establishing GE-Sim 2.0 as a practical platform for scalable evaluation and closed-loop learning of manipulation policies.

2605.27487 2026-05-28 cs.CV cs.AI

Diffusion-Based Ukrainian Handwritten Text Generation with Cross-Domain Style Transfer

基于扩散的乌克兰手写文本生成与跨域风格迁移

Andrii Ahitoliev, Pavlo Berezin

AI总结 针对乌克兰语等非拉丁文字手写文本生成缺乏数据和模型泛化研究的问题,构建了乌克兰手写单词数据集并重新训练DiffusionPen模型,通过跨语言、零样本和少样本迁移实验验证了潜在扩散模型在跨域风格迁移中的有效性。

Comments 16 pages, 7 figures. Submitted to ICTERI 2026

详情
AI中文摘要

基于书写者风格的手写文本生成(HTG)在拉丁文字中已被广泛研究,但在低资源和非拉丁书写系统中仍探索不足,现有模型在拉丁域之外的泛化能力尚不明确。西里尔字母,尤其是乌克兰语,缺乏大规模书写者标注数据集和此类泛化的经验证据。为填补这一空白,我们使用连通分量分割、质量过滤和对代表性不足的乌克兰字符进行针对性过采样,构建了一个包含308位书写者、126,177张图像的乌克兰手写单词数据集。我们在不修改架构的情况下,在该数据集上重新训练了DiffusionPen——一种带有MobileNetV2三元组损失风格编码器和CANINE条件潜在扩散U-Net的模型,测试了从拉丁到西里尔字母的直接迁移。我们在三种设置下评估跨域风格迁移:从IAM英文样本的跨语言迁移、对20世纪早期乌克兰手稿的零样本迁移,以及对当代书写者的少样本模仿。该模型生成可读且风格一致的单词图像,表明少样本潜在扩散模型能够泛化到拉丁文字域之外。我们发布了数据集、训练模型和评估协议,作为书写者感知的西里尔HTG的可复现基准,为将风格化HTG扩展到其他代表性不足的书写系统奠定了基础。

英文摘要

Handwritten text generation (HTG) conditioned on writer style has been widely studied for Latin scripts, but remains underexplored for low-resource and non-Latin writing systems, leaving open how well existing models generalise beyond the Latin domain. Cyrillic, particularly Ukrainian, lacks both large-scale writer-labeled datasets and empirical evidence of such generalisation. To address this gap, we construct a Ukrainian handwritten word dataset of 126,177 images from 308 writers using connected-component segmentation, quality filtering, and targeted oversampling of underrepresented Ukrainian characters. We retrain DiffusionPen, a MobileNetV2 triplet-loss style encoder with a CANINE-conditioned latent diffusion U-Net, on this dataset without architectural modification, testing direct transfer from Latin to Cyrillic. We evaluate cross-domain style transfer in three settings: cross-lingual transfer from IAM English samples, zero-shot transfer to an early 20th-century Ukrainian manuscript, and few-shot imitation of contemporary writers. The model produces legible, style-consistent word images, indicating that few-shot latent diffusion models generalize beyond the Latin-script domain. We release the dataset, trained models, and evaluation protocol as a reproducible benchmark for writer-aware Cyrillic HTG, providing a foundation for extending stylized HTG to other underrepresented writing systems.

2605.27486 2026-05-28 cs.LG

Federated Learning for Multivariate Time Series Anomaly Detection in Industrial Automation

面向工业自动化的多变量时间序列异常检测的联邦学习

Khayyam Nosrati, Martin Uray, Saverio Messineo, Olaf Sassnick, Stefan Huber

AI总结 本文针对联邦学习范式下多变量时间序列异常检测的数据集挑战,引入一个具有循环动态特性的数据集,并评估了多种MTSAD方法。

Comments Preprint. Accepted at the DEXA International Workshop on Optimisation of Industrial Production with AI Algorithms 2026 (DEXA AI4IP 2026)

详情
AI中文摘要

联邦学习(FL)拓宽了多变量时间序列异常检测(MTSAD)的视野。然而,在FL范式内对此类异常检测方法进行基准测试面临着以数据为中心的挑战。现有数据集无法应对这些挑战,因为它们不能同时提供足够的规模、准确的标签以及避免常见缺陷。此外,在离散工业自动化中常见的循环过程行为在当前的MTSAD研究中仍未得到充分探索。本文旨在进一步阐明相关文献,并通过引入一个由离散自动化过程的重复性产生的循环动态数据集来填补这些空白,同时在所提出的数据集和一个公开基准数据集上评估选定的MTSAD方法。

英文摘要

Federated learning (FL) has broadened the horizon for multivariate time series anomaly detection (MTSAD). However, benchmarking such anomaly detection methods within FL paradigm poses data-centric challenges. The existing datasets do not counteract these challenges since they do not simultaneously provide sufficient scale, accurate labels, and freedom from common flaws. In addition, the role of cyclic process behavior, which is common in discrete industrial automation, remains underexplored for MTSAD for the current state of research. This paper aims to shed more light on the literature and address these gaps by introducing a dataset designed with cyclic dynamics arising from the repetitive nature of discrete automation processes and evaluates selected MTSAD methods on both the proposed dataset and a public benchmark dataset.

2605.27483 2026-05-28 cs.CL cs.AI cs.LG

Debate Helps Weak Judges Reward Stronger Models

辩论有助于弱裁判奖励更强的模型

Ethan Elasky, Frank Nakasako, Naman Goyal

AI总结 研究在强辩手/弱裁判设置下的提议者-批评者辩论,发现当批评者分类能力超过裁判且裁判将批评者言论视为待验证的主张时,辩论能显著提升裁判表现,并可通过单一独立批评以更低成本实现类似效果。

详情
AI中文摘要

尽管理论上具有前景,但辩论作为一种可扩展的监督协议产生了混合的实证结果:在某些设置中有收益,在其他设置中无效,尤其是当裁判没有隐藏信息时。我们在程序可验证的代码和逻辑任务上,研究了强辩手/弱裁判设置下的提议者-批评者辩论。当批评者提供可用的优势时,辩论帮助裁判优于咨询基线:批评者的分类能力必须超过裁判,并且裁判必须将批评者的言论视为待验证的主张而非待总结的证词。在五个配对中的三个满足该条件的配对中,提议者-批评者辩论的收益在统计上显著优于咨询,并且这些配对是最有能力的模型配对。在我们的集合中的两个非响应者配对中,辩论产生无效效果,一旦批评者进入转录,裁判验证率下降数十个百分点。在这些情况下,批评者的二元分类能力与裁判的相差在噪声范围内,并且批评者的分歧被解析为证词而非待检查的主张。从辩论中消去反驳轮次对裁判表现没有可测量的变化:单一独立批评以更低的推理成本恢复了辩论的大部分收益。这些发现为可验证领域(答案、批评、裁判)中无需训练的可扩展监督提供了一种更廉价的原始方法,以及一种预测辩论何时有帮助的部署前审计(批评者是否击败裁判,以及裁判是否会验证它?)。

英文摘要

Despite theoretical promise, debate as a scalable oversight protocol has produced mixed empirical results: gains in some settings, and null effects in others, especially when the judge does not have information hidden from it. We study proposer-critic debate in a stronger-debater/weaker-judge setting on programmatically verifiable code and logic tasks. Debate helps the judge over a consultancy baseline when the critic provides a usable advantage: the critic's classification ability must exceed the judge's, and the judge must treat critic speeches as claims to verify rather than testimony to summarize. On the three of five pairings where the condition holds, proposer-critic debate's gains are statistically significant over consultancy, and these pairings are the most capable model pairings. On the two non-responder pairings in our set, debate produces null effects, and judge verification rates drop by tens of percentage points once a critic enters the transcript. In these cases the critic's binary-classification ability and the judge's are within noise of each other, and the critic's disagreement is parsed as testimony rather than a claim to check. Ablating rebuttal rounds from debate produces no measurable change in judge performance: a single independent critique recovers the bulk of debate's benefit at lower inference cost. These findings suggest a cheaper primitive for training-free scalable oversight in verifiable domains (answer, critique, judge) and a pre-deployment audit (does the critic beat the judge, and will the judge verify it?) that predicts when debate will help.

2605.27482 2026-05-28 cs.LG cs.AI

Energy-Structured Low-Rank Adaptation for Continual Learning

能量结构低秩自适应持续学习

Longhua Li, Lei Qi, Qi Tian, Xin Geng

AI总结 提出E²-LoRA方法,通过能量集中和排序的低秩自适应以及动态秩分配策略,解决持续学习中的任务干扰和知识压缩问题,实现最优性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

虽然正交子空间方法试图缓解持续学习中的任务干扰,但它们常常遭受跨基的能量扩散,阻碍知识压缩并耗尽未来任务的容量。我们观察到参数更新引起的输出特征漂移本质上是低秩的,并理论上证明沿该漂移的主方向保留参数可最小化输出重建误差。受此启发,我们提出能量集中和能量排序的低秩自适应(E²-LoRA)。通过显式地将知识排序并集中到主导秩中,E²-LoRA释放了后续任务的容量。此外,我们设计了一种动态秩分配策略,通过联合优化能量保留和模型可塑性来平衡稳定性和可塑性。在多个基准上的大量实验表明,E²-LoRA达到了最先进的性能。

英文摘要

While orthogonal subspace methods try to mitigate task interference in Continual Learning (CL), they often suffer from energy diffusion across the basis, hindering knowledge compaction and exhausting capacity for future tasks. We observe that output feature drift induced by parameter updates is inherently low-rank, and theoretically prove that preserving parameters along the principal directions of this drift minimizes the output reconstruction error. Motivated by this, we propose \textbf{E}nergy-Concentrated and \textbf{E}nergy-Ordered \textbf{Lo}w-\textbf{R}ank \textbf{A}daptation (E$^2$-LoRA). By explicitly ordering and concentrating knowledge into leading ranks, E$^2$-LoRA frees capacity for subsequent tasks. Furthermore, we design a dynamic rank allocation strategy to balance stability and plasticity by jointly optimizing energy retention and model plasticity. Extensive experiments across multiple benchmarks demonstrate that E$^2$-LoRA achieves state-of-the-art performance.

2605.27479 2026-05-28 cs.LG cs.AI

Resource-Constrained Affect Modelling via Variance Regularisation Pruning

资源约束下的情感建模:基于方差正则化剪枝

Kosmas Pinitas, Konstantinos Katsifis

AI总结 提出方差正则化剪枝(VR)框架,通过考虑跨参与者稳定性来剪枝,在80%稀疏度下仍保持竞争性CCC性能,适用于资源受限的情感感知系统。

Comments This paper has been accepted at the 2026 PErvasive Technologies Related to Assistive Environments (PETRA)

详情
AI中文摘要

情感计算系统越来越多地嵌入到普及和交互环境中,如自适应游戏、辅助技术和资源受限平台,在这些环境中,计算效率必须与跨不同用户的可靠性相平衡。模型剪枝提供了一种减少计算需求的有效方法,但现有方法通常仅优化稀疏性,而不考虑参数移除如何影响个体间的鲁棒性。在这项工作中,我们引入了方差正则化剪枝(VR),一种明确将跨参与者稳定性纳入稀疏化过程的剪枝框架。VR不依赖于平均预测误差,而是根据每个连接对预测准确性和用户间变异性的联合贡献来评估,优先保留在分布差异下仍然可靠的参数。我们在AGAIN数据集上评估了所提出的方法,该数据集包含在九个情感诱发游戏环境中收集的唤醒度标注。实验结果表明,即使在没有额外微调的情况下,VR在80%稀疏度下仍能保持竞争性的一致性相关系数(CCC)性能,突显了其在真实世界、资源受限的情感感知系统中的适用性。总体而言,所提出的框架支持开发紧凑、鲁棒的情感模型,这些模型能够在真实的交互环境中可靠运行。

英文摘要

Affective computing systems are increasingly embedded in pervasive and interactive environments, such as adaptive games, assistive technologies, and resource-constrained platforms, where computational efficiency must be balanced with reliability across diverse users. Model pruning offers an effective way to reduce computational demands, yet existing approaches typically optimise for sparsity alone, without accounting for how parameter removal impacts robustness across individuals. In this work, we introduce Variance-Regularised Pruning (VR), a pruning framework that explicitly incorporates cross-participant stability into the sparsification process. Rather than relying solely on average prediction error, VR evaluates each connection based on its joint contribution to both prediction accuracy and variability across users, prioritising parameters that remain reliable under distributional differences. We evaluate the proposed approach on the AGAIN dataset, which includes arousal annotations collected across nine affect-eliciting game environments. Experimental results demonstrate that VR maintains competitive Concordance Correlation Coefficient (CCC) performance even at 80\% sparsity without additional fine-tuning, highlighting its suitability for deployment in real-world, resource-limited affect-aware systems. Overall, the proposed framework supports the development of compact, robust affective models that can operate reliably in real-world interactive environments.

2605.27476 2026-05-28 cs.LG cs.AI

Balancing Fidelity and Diversity in Diffusion Models via Symmetric Attention Decomposition: Hopfield Perspective

通过对称注意力分解平衡扩散模型中的保真度与多样性:Hopfield视角

Hyunmin Cho, Woo Kyoung Han, Kyong Hwan Jin

AI总结 本文通过将Transformer中的注意力矩阵分解为对称和反对称部分,从Hopfield网络视角解释并调控扩散模型生成中的保真度-多样性权衡。

Comments Accepted to ICML 2026 (Regular)

详情
AI中文摘要

我们将Transformer中的预softmax注意力矩阵$\mathbf{QK^ op}$表征为一个关联记忆矩阵,编码输入特征之间的成对关联。通过将该矩阵分解为对称和反对称部分,我们将对称分量解释为控制能量景观的结构,而反对称分量则驱动该景观上的循环。利用对称分量诱导的能量公式,我们推导出Hopfield风格的稳定性度量,用于量化检索特征的稳定性。我们观察到Hopfield风格稳定性度量与生成中的保真度-多样性权衡之间存在有意义的关联。最后,我们提出一个可控的旋钮,通过修改底层动力学的循环来调节这一权衡。代码可在我们的GitHub上获取(https://github.com/hyeon-cho/Attention-Symmetric-Decomposition)。

英文摘要

We characterize the pre-softmax attention matrix $\mathbf{QK^\top}$ in transformers as an associative memory matrix encoding pairwise associations between input features. By decomposing this matrix into its symmetric and skew-symmetric parts, we interpret the symmetric component as governing the structure of the energy landscape, and the skew-symmetric component as driving circulation on that landscape. Leveraging the energy formulation induced by the symmetric component, we derive Hopfield-style stability measures that quantify the stability of retrieved features. We observe meaningful correlations between Hopfield-style stability measures and the fidelity-diversity trade-offs in generation. Finally, we propose a controllable knob to modulate this trade-off by modifying the circulation of the underlying dynamics. Code is available at our GitHub (https://github.com/hyeon-cho/Attention-Symmetric-Decomposition).

2605.27475 2026-05-28 cs.LG cs.AI

HEAL: Resilient and Self-* Hub-based Learning

HEAL:弹性且自适应的基于集线器的学习

Mohamed Amine Legheraba, Stefan Galkiewicz, Maria Gradinariu Potop-Butucaru, Sébastien Tixeuil

AI总结 提出一种名为HEAL的跨层去中心化学习框架,通过结合联邦学习、八卦学习和流行病学习的优势,利用自组织自愈的P2P覆盖网络和Elevator算法动态选择聚合节点,在无崩溃场景下性能与联邦学习相当,同时在崩溃和波动环境中优于八卦学习和流行病学习。

详情
AI中文摘要

去中心化学习通过将数据和计算分布在节点上,增强了隐私性、可扩展性和容错性。一种流行的方法是联邦学习,它依赖于中央聚合器,但面临服务器脆弱性、可扩展性问题、隐私风险以及最重要的单点故障等挑战。另一种方法是八卦学习和流行病学习,它们通过节点间的点对点模型更新交换实现完全去中心化,确保了鲁棒性和隐私性,但代价是模型收敛速度较慢。在这项工作中,我们提出了一种新颖的去中心化学习框架,称为HEAL。HEAL是首个跨层去中心化学习框架,它利用优化的自组织和自愈底层P2P覆盖网络,结合了联邦学习、八卦学习和流行病学习的优势。借助最近提出的Elevator算法,HEAL将动态选择的节点提升为聚合器。通过仿真,我们证明HEAL在无崩溃环境中具有与联邦学习相似的性能,同时完全去中心化且具有容错性。在崩溃和波动频繁的环境中,HEAL优于八卦学习和流行病学习。

英文摘要

Decentralized learning enhances privacy, scalability, and fault tolerance by distributing data and computation across nodes. A popular approach is Federated learning, which relies on a central aggregator, yet faces challenges such as server vulnerabilities, scalability issues, privacy risks and most importantly, the single point of failure. Alternatively Gossip Learning and Epidemic Learning offer fully decentralization through peer-to-peer exchanges of model updates, ensuring robustness and privacy, at the price of slower model convergence. In this work, we introduce a novel decentralized learning framework called HEAL. HEAL is the first cross-layer decentralized learning framework that exploits an optimized self-organizing and self-healing underlying P2P overlay combining the strengths of Federated Learning, Gossip and Epidemic Learning. Leveraging the recently proposed Elevator algorithm, HEAL promotes dynamically chosen nodes to act as aggregators. Through simulations, we demonstrate that HEAL has similar performances to that of Federated Learning in crash-free settings, while being fully decentralized and fault-tolerant. In crash and churn prone environments HEAL outperforms Gossip and Epidemic Learning.

2605.27470 2026-05-28 cs.LG cs.AI

Detect by Yourself: Self-Designing Agentic Workflows for Few-Shot Graph Anomaly Detection

自行检测:少样本图异常检测的自设计代理工作流

Tairan Huang, Qiang Chen, Yili Wang, Yueyue Ma, Changlong He, Xiu Su, Yi Chen

AI总结 提出SignGAD框架,通过自设计任务条件检测工作流替代固定检测器,结合图编码与检测器选择及受保护重拟合策略,提升少样本图异常检测的适应性与可靠性。

详情
AI中文摘要

图异常检测旨在识别属性图中的异常节点,并在实际应用中发挥重要作用。然而,现有的图异常检测方法仍面临两个关键挑战:1)固定流程,限制了其在有限监督下对不同图任务的适应性;2)弱证据,无法将上下文和结构异常信号明确纳入检测过程。在本文中,我们提出了一种新颖框架,即少样本图异常检测的自设计代理工作流(SignGAD)。具体来说,我们提出了一种新范式,将图异常检测任务从训练固定异常检测器重新定义为设计任务条件检测工作流。通过构建检测工作流,SignGAD选择合适的图编码和检测器设计以利用任务特定的异常证据。同时,我们引入了一种受保护的最终重拟合策略,通过校准重拟合接受度来优化所选工作流,从而增强有限监督下的可靠性。在多个真实世界数据集上进行的大量实验表明,SignGAD相比最先进方法取得了强劲性能,突显了其在图异常检测任务上的有效性。

英文摘要

Graph anomaly detection aims to identify anomaly nodes in attributed graphs and plays an important role in real-world applications. However, existing graph anomaly detection methods still face two key challenges: 1) fixed pipelines, which restrict their adaptability across different graph tasks under limited supervision; 2) weak evidence, which prevents them from explicitly incorporating contextual and structural anomaly signals into the detection process. In this paper, we propose a novel framework, self-designing agentic workflows for few-shot graph anomaly detection (SignGAD). Specifically, we propose a novel paradigm that reformulates graph anomaly detection task from training a fixed anomaly detector to designing task-conditioned detection workflows. By constructing detection workflows, SignGAD selects suitable graph encodings and detector designs to exploit task-specific anomaly evidence. Meanwhile, we introduce a guarded final refit strategy to refine the selected workflow by calibrating refit acceptance, enhancing reliability under limited supervision. Extensive experiments conducted on several real-world datasets demonstrate that SignGAD achieves strong performance against state-of-the-art methods, highlighting its effectiveness on graph anomaly detection tasks.

2605.27469 2026-05-28 cs.LG cs.AI

Architecture-driven Shift: towards a lightweight selector for capturing the trends of logit shift

架构驱动的偏移:面向捕捉逻辑偏移趋势的轻量级选择器

Zhong Ye, Yu Hu, Ruilin Tang

AI总结 本文提出架构驱动偏移(ADS)作为逻辑偏移的轻量级代理,用于高效选择持续学习中的预训练模型,理论推导并实验验证了ADS与逻辑偏移的单调相关性。

详情
AI中文摘要

持续学习是一种利用深度预训练神经网络能力的实用范式,但哪个预训练模型能更好地平衡“可塑性-稳定性”值得选择?逻辑偏移作为自然代理,因为它代表了持续学习场景中的逻辑偏移。然而,获取逻辑偏移需要巨大的计算成本,阻碍了大规模模型选择。现有的理论分析由于假设均匀隐藏层宽度,忽略了实际架构的结构异质性(可变宽度和深度),无法提供有效的替代方案。这引发了一个关键问题:异构架构与在先验任务(模型已训练过的任务)上的逻辑偏移之间理论上存在什么关系?为了回答这个问题,我们将逻辑偏移解耦为架构依赖和数据依赖,建立我们的框架,揭示了两种依赖的组合——定义为架构驱动偏移(ADS)——能够很好地捕捉逻辑偏移趋势,且只需少量数据样本即可计算。具体来说,对于在先验任务上优化良好的模型,较高的ADS与在当前任务训练后较大的逻辑偏移相关,这基于三个机制组件推导得出:(1)权重矩阵梯度关于层宽的谱范数缩放,(2)新任务的优化路径长度,以及(3)宽网络中的渐近任务冲突。跨越175多种不同架构的大量实证结果表明,ADS与逻辑偏移之间存在强单调相关性(最弱的Spearman相关系数$r_s=0.731$)。在实践中,我们证明了ADS可以作为预期校准误差的轻量级代理,预期校准误差是用于可靠持续学习模型选择的广泛使用的指标,在三个数据集的六个场景中得到了验证。

英文摘要

Continual Learning (CL) is a practical paradigm to utilize power of deep pre-trained neural networks, but which pre-trained model has a better ability to balance ``Plasticity-Stability", deserving to be chosen? The logit shift serves as a natural proxy because it represents the logit shift in CL scenarios. However, obtaining the logit shift requires huge computational cost, which hinders large-scale model selection. Existing theoretical analyses fail to offer an efficient alternative because of the assumption of uniform hidden layer widths, which ignores the structural heterogeneity (variable width and depth) of real-world architectures. This raises a critical question: what theoretically relationship can be identified between heterogeneous architecture and logit shift on prior tasks (that the model has been trained on)? To answer the question, we decouple logit shift into architecture dependency and data dependency to establish our framework, which reveals that the combination of two dependency, defined as Architecture-driven Shift (ADS), that can capture the logit shift tendency well computable with few data samples. Specifically, for a well-optimized model on prior tasks, higher ADS is associated with a larger logit shift after training on the current task, which derived based on three mechanistic components: (1) spectral norm scaling of weight matrix gradients with layer width, (2) the optimization path length of the new task, and (3) the asymptotic task conflict in wide networks. Extensive empirical results across more than 175 diverse architectures demonstrate a strong monotonic correlation (the weakest Spearman's $r_s=0.731$) between ADS and logit shift. Practically, we demonstrate that ADS can serve as a lightweight proxy of the expected calibration error, which is a widely used metric for reliable CL model selection, on three datasets across six scenarios.

2605.27467 2026-05-28 cs.LG cs.AI cs.CV

Comparative Analysis of Liquid Neural Networks and LSTM for Sequential Pattern Recognition: Robustness, Efficiency, and Clinical Utility

液态神经网络与LSTM在序列模式识别中的比较分析:鲁棒性、效率与临床实用性

Ye Kyaw Thu, Thazin Myint Oo, Thepchai Supnithi

AI总结 本文通过对比液态神经网络(LNN)与LSTM在四种序列数据上的性能,发现LNN在参数效率和鲁棒性方面更优,尤其适用于数据稀疏的临床环境。

Comments 9 pages, 7 figures, 6 tables, The conference paper will appear in Proceedings of JCSSE 2026

详情
AI中文摘要

传统的循环神经网络(RNN)和长短期记忆网络(LSTM)在离散时间步上运行,往往无法捕捉现实世界物理过程的流体时间动态。液态神经网络(LNN),特别是闭式连续时间(CfC)网络,通过将隐藏状态演化建模为连续微分方程来解决这一问题。在本文中,我们在四种不同的序列模态上进行了全面的基准测试研究:神经形态事件数据(N-MNIST)、基于笔画的绘图(QuickDraw)、视觉手写(IAM)和生理时间序列(PhysioNet Sepsis-3)。此外,我们使用时间丢弃法进行了严格的压力测试,以评估模型对缺失数据的鲁棒性。我们的研究结果表明,LNN在原生时间域和数据稀疏普遍的临床环境中,始终提供优越的参数效率和显著更高的鲁棒性。本扩展预印本提供了关于相关数据集和LNN理论谱系的额外背景,并附有详细附录,记录了我们的完整实现和实验设置。

英文摘要

Traditional Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) units operate on discrete time steps, often failing to capture the fluid temporal dynamics of real-world physical processes. Liquid Neural Networks (LNNs), specifically Closed-form Continuous-time (CfC) networks, address this by modeling the hidden state evolution as a continuous differential equation. In this paper, we conduct a comprehensive benchmarking study across four distinct sequential modalities: neuromorphic event-based data (N-MNIST), stroke-based drawing (QuickDraw), visual handwriting (IAM), and physiological time-series (PhysioNet Sepsis-3). Furthermore, we perform a rigorous stress test using temporal dropout to evaluate model robustness against missing data. Our findings reveal that LNNs consistently provide superior parameter efficiency and significantly higher robustness in natively temporal domains and clinical environments where data sparsity is prevalent. This extended preprint provides additional background on related datasets and the LNN theoretical lineage, supplemented with a detailed appendix documenting our full implementation and experimental settings.

2605.27465 2026-05-28 cs.CV cs.AI

AdaMerge: Salience-Aware Adaptive Token Merging for Training-Free Acceleration of Vision Transformers

AdaMerge: 面向视觉Transformer无训练加速的显著性感知自适应令牌合并

Semi Lee, Hyejin Go, Hyesong Choi

AI总结 提出AdaMerge框架,通过显著性加权相似度和自适应合并强度两个互补机制,在无训练条件下提升令牌合并的精度-计算量帕累托前沿。

Comments 11 pages, 3 figures, 5 tables. Submitted to NeurIPS 2026

详情
AI中文摘要

视觉Transformer(ViT)中自注意力的二次计算成本构成了实际部署的基本瓶颈,激发了令牌缩减方面的活跃研究。在现有方法中,令牌合并(ToMe)已成为一种优雅的无训练解决方案;然而,其设计基于令牌平等的隐含前提,这与自注意力已充分证明的非均匀性相悖,并在激进压缩下导致高显著性令牌的信息丢失。我们通过AdaMerge解决了这一局限,该框架基于两个互补机制。首先,显著性加权相似度利用列式特征亲和度中心性作为令牌重要性代理,并将所得显著性分数纳入二分匹配分数,确保关键令牌对合并表示贡献更大。其次,自适应合并强度使用预先计算的逐层相似度统计量,根据输入特定的冗余性动态调整每层缩减数量。在ImageNet-1k上使用ViT-B/16,AdaMerge在所有FLOPs匹配条件下均持续优于ToMe、PiToMe和DSM。精度差距随压缩单调增大:在13.4G FLOPs操作点,AdaMerge的Top-1下降仅为-1.06%,而PiToMe为-1.45%,DSM为-4.62%。据我们所知,AdaMerge是首个将显著性加权相似度和自适应逐层缩减结合到单一无训练令牌合并框架中的方法,推动了ViT加速的精度-FLOPs帕累托前沿。

英文摘要

The quadratic cost of self-attention in Vision Transformers (ViTs) constitutes a fundamental bottleneck for practical deployment, motivating a vibrant line of research on token reduction. Among existing approaches, token merging (ToMe) has emerged as an elegant training-free solution; yet its design rests on an unspoken premise of token equality, which contravenes the well-documented non-uniformity of self-attention and leads to information loss in high-salience tokens under aggressive compression. We address this limitation with AdaMerge, a token-merging framework based on two complementary mechanisms. First, salience-weighted similarity leverages column-wise feature-affinity centrality as a token-importance proxy and incorporates the resulting salience scores into the bipartite matching score, ensuring that pivotal tokens contribute more strongly to the merged representation. Second, adaptive merging intensity uses pre-computed layer-wise similarity statistics to dynamically modulate the per-layer reduction count in accordance with input-specific redundancy. On ImageNet-1k with ViT-B/16, AdaMerge consistently outperforms ToMe, PiToMe, and DSM across all FLOPs-matched regimes. The accuracy gap widens monotonically with compression: at the 13.4G FLOPs operating point, AdaMerge sustains a Top-1 degradation of only -1.06%, compared to -1.45% for PiToMe and -4.62% for DSM. To our knowledge, AdaMerge is the first to combine salience-weighted similarity and adaptive per-layer reduction into a single training-free token merging framework, advancing the accuracy-FLOPs Pareto frontier of ViT acceleration.

2605.27464 2026-05-28 cs.CV cs.AI

Beyond Motion Primitives: Behavioral Activity Recognition from Head-Mounted IMU

超越运动基元:基于头戴式IMU的行为活动识别

Chung-Ta Huang, Leopold Das, Jeffrey Zhou, Faizaan Siddique, Julia Seungjoo Baek, Serena Liu, Andrew Rusli, Todd Y. Zhou, Freddy Yu, Sinclair Hansen, Ziling Hu, Arnav Sharma, Mengyu Wang

AI总结 提出HiT-HAR层次模型,利用头戴式IMU数据实现行为级活动识别,超越传统运动基元,在五类动作和八类场景识别中优于现有模型。

详情
AI中文摘要

AR智能眼镜需要连续的行为上下文来提供主动辅助,但其最实用的常开传感器——头戴式惯性测量单元(IMU)仅能检测行走或站立等运动基元。我们突破运动基元,实现行为级识别,定义了五个类别以平衡AR应用需求与传感器可观测性。为此,我们构建了一个包含16万样本的Ego4D数据集,采用四层质量保证框架覆盖8个活动场景,并提出了HiT-HAR,一个70.3万参数的层次模型,在五类动作和八类场景识别中优于先前的头戴式IMU模型。我们通过每类可分离性分析进一步绘制了头戴式IMU的可观测性边界,识别出哪些行为类别可靠可观测(移动),哪些受益于时间上下文(物体传递、任务操作),以及哪些场景依赖的信号重叠仍构成挑战。我们的结果表明,利用时间上下文和场景结构的架构选择优于单纯扩大模型规模。代码和数据集公开于https://github.com/Harvard-AI-and-Robotics-Lab/HiT-HAR。

英文摘要

AR smart glasses need continuous behavioral context to offer proactive assistance, yet their most practical always-on sensor, the head-mounted Inertial Measurement Unit (IMU), detects only motion primitives such as walking or standing. We push beyond motion primitives to behavioral-level recognition, defining five categories that balance AR application need with sensor observability. To this end, we construct a 160K-sample Ego4D dataset with a four-tier quality assurance framework spanning 8 activity scenarios, and propose HiT-HAR, a 703K-parameter hierarchical model that outperforms prior head-mounted IMU models on five-class action and eight-class scenario recognition. We further map the observability frontier of head-mounted IMU through per-class separability analysis, identifying which behavioral categories are reliably observable (Locomotion), which benefit from temporal context (Object Transfer, Task Operation), and where scenario-dependent signal overlap poses remaining challenges. Our results indicate that architectural choices exploiting temporal context and scenario structure outperform simply scaling model size. The code and dataset are publicly available at https://github.com/Harvard-AI-and-Robotics-Lab/HiT-HAR.

2605.27461 2026-05-28 cs.RO

A Factory-Floor Deployment Case Study of VLA Pipelines for Industrial Packaging Task: Workflow, Failures, and Lessons

工业包装任务的VLA流水线工厂部署案例研究:工作流、故障与经验教训

Brian Zhu, Philipp Schmitt, Philine Meister, Lukas Gensler, Momen Khalil, Emmanuele Poggi, Johannes Hechtl, Carsten Braunroth, Kai Wurm, Gokul Narayanan, Eugen Solowjow, Georg von Wichert, Andre Scholz, Felix Albrecht, Maxmillian Metzner

AI总结 本研究通过在西门子工厂部署预训练Pi0.5策略执行工业包装任务,迭代微调并收集2535个现场数据片段,总结了VLA流水线部署中的常见故障模式与改进工作流的经验教训。

详情
AI中文摘要

视觉-语言-动作(VLA)策略展示了有前景的操作能力,但其实际影响常受限于现实部署的可靠性要求。我们展示了西门子工厂(德国埃尔朗根GWE)中一项工业包装任务的部署研究:机器人必须从杂乱堆中拾取透明配件袋,将其插入纸板包装的剩余空腔,并确保袋子及其内容物保持在闭合平面以下。我们的目标是理解通过迭代微调和部署驱动的改进,将预训练的Pi0.5策略适配到单一工厂任务所需的实际工作量。该流水线包括数据收集、整理、微调、评估和针对性恢复数据收集的重复循环。我们从现场工厂设置中积累了2535个片段(10小时)。在本文中,我们贡献了一个工厂级VLA部署的实证报告,重点介绍了常见的故障模式和有助于改进部署工作流的经验教训。

英文摘要

Vision-Language-Action (VLA) policies have shown promising manipulation capabilities, yet their practical impact is often limited by the reliability demands of real-world deployment. We present a deployment study of an industrial packaging task at Siemens Factory (GWE, Erlangen, Germany), where a robot must pick a transparent accessory bag from a cluttered pile, insert it into the remaining cavity of a cardboard package, and ensure that the bag and its contents remain below the closing plane. Our goal is to understand the practical effort required to adapt a pretrained Pi0.5 policy to a single factory-floor task through iterative fine-tuning and deployment-driven refinement. The pipeline consists of repeated loops of data collection, curation, fine-tuning, evaluation, and targeted recovery data collection. We have accumulated 2535 episodes (10 hours) from the on-site factory settings. In this paper, we contribute an empirical account of a factory-floor VLA deployment, highlighting recurring failure modes and lessons that inform how to improve the deployment workflow.

2605.27460 2026-05-28 cs.CV

D$^2$Turb: Depth-Aware Simulation and Decoupled Learning for Single-Frame Atmospheric Turbulence Mitigation

D$^2$Turb: 深度感知仿真与解耦学习用于单帧大气湍流抑制

Zixiao Hu, Tianyu Li, Guoqing Wang, Wei Li, Guoguo Xin, Xun Liu, Peng Wang

AI总结 提出D$^2$Turb框架,通过深度感知湍流合成协议和自适应结构先验注入机制,将物理仿真与解耦恢复结合,实现单帧大气湍流下的纹理去模糊与几何校正。

Comments 14 pages, 7 figures

详情
AI中文摘要

单帧大气湍流抑制由于空间变化模糊与非刚性几何畸变并存而本质上是病态的。现有的基于平面场仿真的端到端方法通常难以平衡纹理恢复与几何校正。为克服这一限制,我们提出D$^2$Turb,一个将物理仿真与显式解耦恢复相结合的统一框架。首先,我们引入深度感知湍流合成协议,将场景深度纳入相位到空间公式中,生成物理一致、深度相关的退化,并为解耦学习提供关键的中间倾斜监督信号。基于该仿真引擎,D$^2$Turb将恢复分解为两个交互阶段:纹理去模糊和几何校正。纹理去模糊阶段采用去模糊骨干网络恢复细节,同时保留几何畸变以供后续校正阶段使用。为缓解级联设计中常见的信息碎片化问题,我们进一步提出自适应结构先验注入(ASPI)机制,动态传递去模糊模块的深层结构表示以指导密集流预测进行空间去扭曲。大量实验表明,D$^2$Turb在合成和真实数据集上均达到最先进性能,在纹理恢复和几何保真度方面均有持续改进。我们的代码和预训练模型已在 https://github.com/HertzDot222/D2Turb 公开。

英文摘要

Single-frame atmospheric turbulence mitigation is inherently ill-posed due to spatially varying blur coupled with non-rigid geometric distortion. Existing end-to-end approaches trained on flat-field simulations often struggle to balance texture recovery with geometric rectification. To overcome this limitation, we propose D$^2$Turb, a unified framework that bridges physics-grounded simulation with explicitly decoupled restoration. First, we introduce a Depth-Aware Turbulence Synthesis protocol that incorporates scene depth into the phase-to-space formulation. This generates physically consistent, depth-dependent degradations and provides a crucial intermediate tilt supervision signal for disentangled learning. Building upon this simulation engine, D$^2$Turb decomposes restoration into two interactive stages: texture deblurring and geometric rectification. The texture deblurring stage employs a deblurring backbone to recover fine-grained details while preserving geometric distortion for the subsequent rectification stage. To mitigate the information fragmentation commonly observed in cascaded designs, we further propose an Adaptive Structural Prior Injection (ASPI) mechanism that dynamically transfers deep structural representations from the deblurring module to guide dense flow prediction for spatial unwarping. Extensive experiments demonstrate that D$^2$Turb achieves state-of-the-art performance on both synthetic and real-world datasets, with consistent improvements in both texture recovery and geometric fidelity. Our code and pre-trained models are publicly available at https://github.com/HertzDot222/D2Turb.

2605.27456 2026-05-28 cs.LG

Metric-Aware PCA as a Linear Instance of Geometric Deep Learning

度量感知PCA作为几何深度学习的线性实例

Michael Leznik

AI总结 本文通过将度量感知主成分分析(MAPCA)置于几何深度学习框架中,建立了两者之间在对称性、等变性、不变性等六个轴上的精确对应关系,并证明了MAPCA是几何深度学习的线性实例。

详情
AI中文摘要

几何深度学习围绕数据域的对称性组织神经架构,对称群的选择作为几何先验,决定了可以学习哪些表示。度量感知主成分分析(MAPCA)通过正定度量矩阵参数化主成分分析,其规范子族在标准PCA和输出白化之间插值,对角度量点恢复不变PCA(IPCA)。本文将MAPCA置于几何深度学习框架中。度量被视为几何先验;保持它的正交群是其诱导的对称群;MAPCA解在该群下等变,所得谱不变;MAPCA的定义约束是等变网络中使用的Schur型权重约束的线性类比。在六个轴——域、对称群、等变性、不变性、架构原语和几何先验——上,我们构建了MAPCA与几何深度学习之间的精确字典。技术核心是一个唯一性定理,将IPCA刻画为MAPCA族中唯一的线性数据导出度量,该度量在任意对角缩放下等变,并投影到作用的不动点集上,在归一化下等价于精确形式的方差最大化准则。本文以三座桥梁结束:核PCA作为非线性扩展,谱图方法作为图上的MAPCA,以及深度MAPCA构造将定位扩展到深度等变网络。

英文摘要

Geometric deep learning organises neural architectures around the symmetries of their data domain, with the choice of symmetry group serving as a geometric prior that determines what representations can be learned. Metric-Aware Principal Component Analysis (MAPCA) parameterises principal component analysis by a positive-definite metric matrix, with a canonical subfamily interpolating between standard PCA and output whitening and a diagonal-metric point recovering Invariant PCA (IPCA). This paper positions MAPCA within the geometric deep learning framework. The metric is read as the geometric prior; the orthogonal group preserving it is the symmetry group it induces; MAPCA solutions are equivariant under this group with the resulting spectrum invariant; and MAPCA's defining constraint is the linear analogue of the Schur-type weight constraints used in equivariant networks. Across six axes - domain, symmetry group, equivariance, invariance, architectural primitive, and geometric prior - we construct a precise dictionary between MAPCA and geometric deep learning. The technical anchor is a uniqueness theorem characterising IPCA as the unique linear data-derived metric in the MAPCA family that is equivariant under arbitrary diagonal rescaling and projects onto the fixed-point set of the action, equivalent under normalisation to the variance-maximisation criterion in its precise form. The paper closes with three bridges: kernel PCA as the nonlinear extension, spectral graph methods as MAPCA on graphs, and a deep MAPCA construction extending the positioning into deep equivariant networks

2605.27452 2026-05-28 cs.CV

Fine-Tuning Vision-Language Models for Understanding Current Damage and Scoring Priority with Quality Guard Agent

微调视觉语言模型以理解当前损伤并使用质量卫士代理进行优先级评分

Takato Yasuno

AI总结 本文通过微调LLaVA-1.5-7B视觉语言模型,结合规则引擎和质量卫士代理,实现了桥梁损伤自动理解与修复优先级评分,有效降低了评分变异并提升了效率。

Comments 23 pages, 11 figures, 13 tables

详情
AI中文摘要

日本的桥梁检查要求每五年进行一次强制性目视评估,然而不同工程师分配的定性损伤等级(a-e级)存在显著的评分者间变异性——这是实现一致基础设施管理的关键障碍。资深工程师的老龄化进一步威胁检查能力。本文提出了一种使用微调视觉语言模型(VLM)自动化桥梁损伤理解和修复优先级评分的方法。 我们使用QLoRA在多达4000对桥梁损伤图像和检查文本记录上微调LLaVA-1.5-7B,然后在固定的800张图像测试集上进行评估。模型输出识别结构构件和损伤模式的自然语言描述,基于此,一个基于规则的评分引擎计算五级修复优先级指数。一项渐进式训练研究(1k/2k/3k/4k样本)表明,2000个训练样本在仅2.9小时的训练中即可达到接近最优的验证损失;超过2000后,训练样本每翻倍,验证损失改善不超过0.2%,表现出明显的收益递减。此外,在保留测试集上的语义相似度在3000样本时达到峰值(0.6909),在4000样本时下降(0.6739),表明质量策划的中等规模数据优于更大但噪声更多的语料库。结合torch.compile()和批处理(batch_size=8)的推理优化实现了每张图像10.06秒——比未优化基线降低了70.2%。 我们的方法有助于桥梁检查中的数据治理,减少评分者间变异性,并提供AI辅助分诊以增强检查流程中的专家工程师。此外,我们引入了一个两阶段质量卫士,使用微调的Swallow-8B SLM在优先级评分前拒绝低质量的VLM输出,防止来自损坏或无法识别图像的虚假评分。

英文摘要

Bridge inspection in Japan requires mandatory visual assessments every five years, yet qualitative damage ratings (levels a-e) assigned by different engineers exhibit significant inter-rater variability -- a critical barrier to consistent infrastructure management. The aging of skilled engineers further threatens inspection capacity. This paper presents a methodology for automating bridge damage understanding and repair priority scoring using fine-tuned Vision-Language Models (VLMs). We fine-tune LLaVA-1.5-7B with QLoRA on up to 4,000 paired bridge damage images and inspection text records, then evaluate on a fixed test set of 800 images. The model outputs natural language descriptions identifying structural members and damage patterns, from which a rule-based scoring engine calculates a five-level repair priority index. A progressive training study (1k/2k/3k/4k samples) reveals that 2k training samples achieve near-optimal validation loss in only 2.9 hours of training; beyond 2k, validation loss improves by no more than 0.2% per doubling of training samples, exhibiting clear diminishing returns. Furthermore, semantic similarity on the held-out test set peaks at 3k (0.6909) and degrades at 4k (0.6739), indicating that quality-curated mid-scale data outperforms larger but noisier corpora. Inference optimization combining torch.compile() and batch processing (batch_size=8) achieves 10.06 seconds per image -- a 70.2% reduction over the unoptimized baseline. Our approach contributes to data governance in bridge inspection, reduces inter-rater variability, and provides AI-assisted triage to augment expert engineers in inspection workflows. Furthermore, we introduce a two-stage Quality Guard using a fine-tuned Swallow-8B SLM to reject low-quality VLM outputs before priority scoring, preventing spurious scores from damaged or unrecognised images.

2605.27451 2026-05-28 cs.CV

From Affect to Complex Behavior: Advancing Multimodal Human-Centered AI at the 10th ABAW Workshop & Competition

从情感到复杂行为:第十届ABAW研讨会与竞赛推进多模态以人为中心的人工智能

Dimitrios Kollias, Panagiotis Tzirakis, Alan Cowen, Stefanos Zafeiriou, Irene Kotsia, Eric Granger, Marco Pedersoli, Simon Bacon, Jens Madsen, Soufiane Belharbi, Muhammad Haseeb Aslam, Chunchang Shao, Guanyu Hu

AI总结 本文介绍了第十届ABAW研讨会与竞赛,通过多模态挑战和论文,推动真实环境下人类情感与行为的建模、分析和理解。

Comments accepted at CVPR 2026

详情
AI中文摘要

第十届真实世界情感与行为分析(ABAW)研讨会与竞赛,与CVPR 2026同期举办,持续推动在真实、无约束环境中对人类情感与行为的建模、分析和理解研究。研讨会保持双重结构,包括竞赛和论文轨道。ABAW竞赛引入了一系列多样化的挑战,针对情感与行为理解的关键方面,包括连续情感(效价-唤醒度)估计、离散情感(表情和动作单元)识别,以及更复杂的行为分析任务,如情感模仿强度估计、矛盾/犹豫识别和细粒度暴力检测。这些挑战基于大规模真实世界数据集,为最先进方法提供了全面的基准。与此同时,论文轨道展示了广泛的贡献,涵盖姿态、运动与行为估计、情感建模与多模态学习、基准、数据集与评估协议、公平性、鲁棒性与部署。总体而言,第十届ABAW研讨会与竞赛继续作为基准测试、合作与创新的关键平台,塑造下一代多模态、以人为中心的人工智能系统的发展。

英文摘要

The 10th Affective & Behavior Analysis in-the-Wild (ABAW) Workshop and Competition, held at CVPR 2026, continues to advance research on modelling, analysis, understanding of human affect and behavior in real-world, unconstrained environments. The workshop maintains its dual structure, comprising both a competition and a paper track. The ABAW Competition introduces a diverse set of challenges targeting key aspects of affective and behavioral understanding, including continuous affect (valence-arousal) estimation, discrete affect (expression and action unit) recognition, as well as more complex behavior analysis tasks, such as emotional mimicry intensity estimation, ambivalence/hesitancy recognition and fine-grained violence detection. These challenges are built upon large-scale in-the-wild datasets, providing comprehensive benchmarks for state-of-the-art approaches. In parallel, the paper track presents a wide range of contributions spanning pose, motion & behavior estimation, affect modelling & multimodal learning, benchmarks, datasets & evaluation protocols, fairness, robustness & deployment. Overall, the 10th ABAW Workshop and Competition continues to serve as a key platform for benchmarking, collaboration and innovation, shaping the development of next-generation multimodal, human-centered AI systems.

2605.27431 2026-05-28 cs.LG cs.AI

Tackling Multimodal Learning Challenges with Mixture-of-Expert: A Survey

应对多模态学习挑战的混合专家方法:综述

Liangwei Nathan Zheng, Wei Emma Zhang, Olaf Maennel, Lin Yue, Weitong Chen

AI总结 本文综述了混合专家(MoE)如何通过高效扩展、表示学习和自适应适配解决多模态学习中的可扩展性、异质性和数据不完美等核心挑战。

Comments This survey paper has just been accepted by IJCAI 2026. Results were released by 30 April 2026. As I could not find a particular place to drop the acceptance email. I have upload the acceptance email alongside the LaTeX files of the paper, named as Acceptance_email.pdf

详情
AI中文摘要

混合专家(MoE)为多模态学习提供了一个自然兼容且可扩展的框架,在不同模态和任务中展现出强大的适应性。尽管其日益成功,但关于MoE方法解决多模态挑战的全面系统综述仍然缺乏。现有综述往往从方法分类学角度独立评估多模态学习或MoE,忽视了它们之间的独特相互作用。本综述通过回答一个核心问题来填补这一空白: extit{MoE如何有效解决多模态挑战?}我们从三个关键视角进行探讨:(1) extbf{MoE作为高效多模态引擎:}通过将计算成本与参数增长解耦,并通过选择性专家激活减轻模态冗余,实现可扩展的多模态建模;(2) extbf{MoE作为多模态表示学习器:}整合互补的多意见专家知识,丰富对齐和交互表示;(3) extbf{MoE作为多模态适配器:}提供模块化和灵活的机制,以建模不完美数据场景,如模态不平衡和模态缺失。通过广泛的文献综述,我们识别出关键研究空白,包括可解释路由、专家通信、模态集成和终身多模态学习。我们将本综述定位为未来研究的基础,旨在构建可解释且可持续的多模态混合专家系统。

英文摘要

Mixture-of-Experts (MoE) presents a naturally compatible and scalable framework for multimodal learning, demonstrating strong adaptability across diverse modalities and tasks. Despite its growing success, a comprehensive and systematic review on the MoE metho addressing multimodal challenges remains lacking. Existing surveys tend to evaluate either multimodal learning or MoE independently from method taxonomy, overlooking the unique interplay between them. This survey fills that gap by answering a central question: \textit{How does MoE effectively resolve multimodal challenges?} We approach this from three key perspectives: (1) \textbf{MoE as an Efficient Multimodal Engine:} enabling scalable multimodal modeling by decoupling computational cost from parameter growth and mitigating modality redundancy through selective expert activation; (2) \textbf{MoE as a Multimodal Representation Learner:} integrating complementary multi-opinion expert knowledge to enrich alignment and interaction representations; and (3) \textbf{MoE as a Multimodal Adapter:} providing a modular and flexible mechanism to model imperfect data scenarios such as modality imbalance and missing modality. Through our extensive literature review, we identify critical research gaps, including interpretable routing, expert communication, modality integration, and lifelong multimodal learning. We position this survey as a foundation for future research toward interpretable and sustainable multimodal Mixture-of-Experts system.

2605.27428 2026-05-28 cs.LG

$E^3$-Agent: An Executable and Evolving Agent for Resource Management of Edge Generative Inference

$E^3$-Agent: 一种用于边缘生成式推理资源管理的可执行且可演化智能体

Rui Bao, Yaping Sun, Zhiyong Chen, Feng Yang, Meixia Tao, Nan Li, Wenjun Zhang

AI总结 针对边缘生成式推理中设备性能未知且时变的问题,提出一种可执行且可演化的智能体$E^3$-Agent,通过分离快速路径路由器和慢速路径大语言模型元控制器,实现在线学习与自适应资源管理,在动态场景下平均延迟降低65%-73%。

Comments 13 pages, 4 figures, 6 tables

详情
AI中文摘要

边缘部署的生成式推理日益面临两个实际现实:每设备每模型的性能在部署时通常是未知的,并且由于用户驱动的语义事件、后台负载和设备变动而呈现非平稳性。因此,在固定机制下离线调优的资源管理器可能变得脆弱且维护成本高昂。本文提出了$E^3$-Agent,一种用于边缘人工智能生成内容(AIGC)资源管理的可执行且可演化的智能体。$E^3$-Agent将做出毫秒级调度决策的快速路径路由器与慢速路径事件驱动的大语言模型(LLM)元控制器分离,后者通过工具接口暴露的小型显式控制面(包括风险门控、路由器配置和快速性能校准)来缓解机制变化。该智能体从执行反馈中在线学习,并持续适应未知且时变的服务时间映射。我们在由MLPerf衍生的设备模型测量先验驱动的离散事件模拟器中评估了$E^3$-Agent,涵盖了冷启动预热和三种动态机制:语义动态、设备变动和隐藏漂移。在动态场景中,与最佳静态基线相比,$E^3$-Agent将平均延迟降低了65%-73%,保持在用于评估的在线全信息Oracle的7%-10%以内,并有效抑制了语义退化下的卡顿率。

英文摘要

Edge deployments of generative inference increasingly face two practical realities: per-device per-model performance is often unknown at deployment time, and it is non-stationary due to user-driven semantic events, background load, and device churn. Consequently, a resource manager that is tuned offline under a fixed regime can become brittle and expensive to maintain. This paper presents $E^3$-Agent, an executable and evolving agent for edge artificial intelligence generated content (AIGC) resource management. $E^3$-Agent separates a fast-path router that makes millisecond-level dispatch decisions from a slow-path, event-driven large language model (LLM) meta-controller that mitigates regime shifts through a small, explicit control surface exposed via a tool interface, including risk gating, router configuration, and rapid performance calibration. The agent learns online from execution feedback and continuously adapts to unknown and time-varying service-time mappings. We evaluate $E^3$-Agent in a discrete-event simulator driven by MLPerf-derived device-model measurement priors, covering cold-start warmup and three dynamic regimes: semantic dynamics, device churn, and hidden drift. Across the dynamic scenarios, $E^3$-Agent reduces average latency by 65%-73% compared to the best static baseline, stays within 7%-10% of an online full-information Oracle used for evaluation, and effectively suppresses stutter rate under semantic degradation.