arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.03459 2026-06-03 cs.SD cs.AI

Tonal parsimony in chord-sequence analysis: combining modulation cost and tonal vocabulary

和弦序列分析中的调性简约性:结合调制代价与调性词汇

François Pachet

AI总结 提出调性简约性方法,通过字典序最小化调制次数和不同调性数量,结合动态规划与固定24调性空间,在和弦序列分析中减少调性词汇并保持调制最优。

详情
Comments
20 pages, 1 figure
AI中文摘要

我们研究将局部调性分配给和弦序列,这一任务对和声分析、作曲和爵士即兴演奏很有用。标准的动态规划方法最小化调制,但可能引入不必要多的调性中心。我们将这种仅转移目标与纯最小词汇分析以及调性简约性进行比较,后者按字典序最小化调制次数,然后最小化不同调性的数量。尽管这个联合目标通常组合困难,但我们利用固定的24调性大调/小调宇宙给出了精确算法。在31,032个LMD和弦序列上,调性简约性在55.8%的情况下保持了转移最优,同时减少了调性词汇。在加权爵士替换闭包下,它将平均调性数从3.802降至3.206,调制次数从16.728降至12.141。在1,555个带注释的爵士标准曲上,它将兼容和弦-音阶一致性提高到95.6%,支持可处理的专业级和声分析。

英文摘要

We study the assignment of local tonalities to chord sequences, a task useful for harmonic analysis, composition, and jazz-oriented improvisation. Standard dynamic-programming approaches minimize modulations but can introduce unnecessarily many tonal centers. We compare this transition-only objective with pure minimum-vocabulary analysis and with tonal parsimony, which minimizes lexicographically the number of modulations and then the number of distinct tonalities. Although this joint objective is combinatorially hard in general, we give exact algorithms exploiting the fixed 24-tonality major/minor universe. On 31,032 LMD Chords sequences, tonal parsimony preserves the transition optimum while reducing tonal vocabulary in 55.8% of cases. With weighted jazz-substitution closure, it lowers mean tonalities from 3.802 to 3.206 and modulations from 16.728 to 12.141. On 1,555 annotated jazz standards, it improves compatible chord-scale agreement to 95.6%, supporting tractable professional-scale harmonic analysis.

2606.03458 2026-06-03 cs.LG

KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

KVarN: 方差归一化的KV缓存量化减轻推理任务中的误差累积

Lorenz K. Muller, Philippe Bich, Chiara Boretti, Hyun-Min Chang, Jiawei Zhuang, Lukas Cavigelli

AI总结 提出KVarN,一种无校准的KV缓存量化方法,通过Hadamard旋转和双尺度方差归一化减少自回归解码中的量化误差累积,在2位精度下达到生成基准测试的最新水平。

详情
AI中文摘要

测试时扩展是一种在大语言模型中获取更好推理能力的强大方法,但在长时域解码过程中,由于KV缓存增长,它会成为内存瓶颈。KV缓存量化有助于改善这一问题,但当前方法在预填充设置下进行评估,而误差在自回归解码下表现不同。我们表明,在后一种情况下,量化误差随时间步累积,主要由不正确的token尺度驱动。我们引入KVarN,一种无校准的KV缓存量化器,它应用Hadamard旋转,随后对K和V矩阵的两个轴进行双尺度方差归一化。我们发现,这种组合修复了异常的token尺度误差,并显著减少了现有基线的误差累积。KVarN在生成基准测试(包括MATH500、AIME24和HumanEval)上以2位精度建立了KV缓存量化的最新技术水平。KVarN方法的vLLM实现可在此https URL获取。

英文摘要

Test-time scaling is a powerful approach to obtain better reasoning in large language models, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but current methods are evaluated under prefill-like settings and errors behave differently under autoregressive decoding. We show that in the latter regime, quantization errors accumulate across timesteps, driven primarily by incorrect token scales. We introduce KVarN, a calibration-free KV-cache quantizer that applies a Hadamard rotation followed by a dual-scaling variance normalization across both axes of the K and V matrices. We find that this combination fixes outlying token-scale errors and substantially reduces error accumulation over existing baselines. KVarN establishes a new state-of-theart for KV-cache quantization on generative benchmarks, including MATH500, AIME24 and HumanEval, at 2-bit precision. A vLLM implementation of the KVarN method is available at https://github.com/huawei-csl/KVarN

2606.03453 2026-06-03 cs.CR cs.AI cs.MA

FORGE: Multi-Agent Graduated Exploitation and Detection Engineering

FORGE:多智能体渐进式利用与检测工程

Farooq Shaikh

AI总结 提出多智能体系统FORGE,通过渐进式利用深度桥接漏洞利用生成、优先级排序和检测规则工程三个孤立领域,在603个CVE上实现67.8%的端到端L1+利用,并生成低误报的Sigma和Snort检测规则。

详情
Comments
18 pages, 4 figures, 3 tables. Accepted at the AgentCy Workshop at the 21st International Conference on Availability, Reliability and Security (ARES 2026). Keywords: Vulnerability assessment, Multi-agent systems, Exploit generation, Detection engineering, Risk prioritization
AI中文摘要

漏洞披露数量现已远超组织评估能力,然而三个相邻研究社区(概念验证生成、漏洞优先级排序和检测规则工程)基本上各自为政。现有的自动利用生成系统报告二进制的通过/失败结果,丢弃了部分进展,并且对另外两个社区不产生任何信号。本文提出了FORGE,一个多智能体系统,通过渐进式利用深度来桥接这三个孤岛。五个专门智能体(情报、生成器、规划器、利用和检测器)在一个固定流水线中执行,该流水线(1)从CVE元数据生成目标易受攻击的应用程序,(2)进行指导性的多轮利用,由LLM主预言机根据四级分类法(L0:无证据到L3:完全入侵)评估,以及(3)生成基于OpenTelemetry利用痕迹的Sigma和Snort检测规则。渐进式深度是桥接机制:更深的利用为检测工程提供更丰富的行为痕迹,而跨评分区间的深度数据为优先级排序验证提供真实依据。分层知识架构跨评估累积情报,将构建和利用经验传递给后续CVE。在CVE-GENIE数据集的603个CVE上评估,跨8种语言和187种CWE类型,以每个CVE 1.50美元的成本实现了67.8%的端到端L1+利用。无论EPSS或CVSS区间如何,利用率保持在接近68%,表明模式级可达性与基于元数据的优先级排序正交。来自L2+利用的检测规则实现了显著高于L1衍生规则的跨度归一化基础(p=0.035),并且93.4%的生成Snort规则对合成良性语料库产生零误报。

英文摘要

Vulnerability disclosure volumes now far exceed organizational assessment capacity, yet three adjacent research communities (proof-of-concept generation, vulnerability prioritization, and detection rule engineering) operate largely in isolation. Existing automated exploit generation systems report binary pass/fail outcomes, discarding partial progress and producing no signal for the other two communities. This paper presents FORGE, a multi-agent system that bridges these three silos through graduated exploitation depth. Five specialized agents (Intel, Generator, Planner, Exploit, and Detector) execute in a fixed pipeline that (1) generates targeted vulnerable applications from CVE metadata, (2) conducts coached, multi-turn exploitation assessed by an LLM-primary oracle on a four-level taxonomy (L0: no evidence through L3: full compromise), and (3) produces Sigma and Snort detection rules grounded in OpenTelemetry exploitation traces. Graduated depth is the bridging mechanism: deeper exploitation yields richer behavioral traces for detection engineering, while depth data across scoring bands provides ground truth for prioritization validation. A tiered knowledge architecture accumulates intelligence across assessments, transferring build and exploitation experience to subsequent CVEs. Evaluation on 603 CVEs from the CVE-GENIE dataset achieves 67.8% end-to-end L1+ exploitation at USD 1.50 per CVE across eight languages and 187 CWE types. Exploitation rates remain near 68% regardless of EPSS or CVSS band, indicating that pattern-level reachability is orthogonal to metadata-based prioritization. Detection rules from L2+ exploitation achieve significantly higher span-normalized grounding than L1-derived rules (p=0.035), and 93.4% of generated Snort rules produce zero false positives against a synthetic benign corpus.

2606.03444 2026-06-03 cs.CV cs.AI

PRISM: Synergizing Vision Foundation Models via Self-organized Expert Specialization

PRISM: 通过自组织专家专业化协同视觉基础模型

Ying Tang, Dong Li, Youjia Zhang, Zikai Song, Junqing Yu, Wei Yang

AI总结 提出PRISM框架,采用双流混合专家(MoE)架构,通过两阶段范式(先解构专家知识使其专业化,再动态重组为任务特定路径)解决视觉基础模型集成中的负迁移问题,在PASCAL-Context和NYUD-v2上达到新最优。

详情
Comments
Accepted to ICML 2026
AI中文摘要

将多种视觉基础模型(VFM)的互补优势统一到单个高效模型中是非常理想的,但受到整体蒸馏中固有的负迁移的挑战。为了解决这些特征冲突,我们引入了 extbf{PRISM},一种新颖的双流混合专家(MoE)框架,通过模块化专业化协同VFM。我们提出了一个两阶段范式:(1)专业知识解构,其中教师条件路由器引导专家在不同的表示子空间中专业化以减轻干扰,然后(2)动态重组,其中路由器学习将这些专家组装成针对下游任务定制的计算路径。在PASCAL-Context和NYUD-v2上的实验表明, extbf{PRISM}建立了新的最先进水平,验证了稀疏、涌现的专业化是集成多样化视觉知识的可扩展方法。

英文摘要

Unifying the complementary strengths of diverse Vision Foundation Models (VFMs) into a single efficient model is highly desirable but challenged by the negative transfer inherent in monolithic distillation. To address these feature conflicts, we introduce \textbf{PRISM}, a novel dual-stream Mixture-of-Experts (MoE) framework that synergizes VFMs via modular specialization. We propose a two-stage paradigm: (1) expertise deconstruction, where a teacher-conditional router guides experts to specialize in distinct representational subspaces to mitigate interference, followed by (2) dynamic recomposition, where the router learns to assemble these experts into tailored computational pathways for downstream tasks. Experiments on PASCAL-Context and NYUD-v2 show that \textbf{PRISM} establishes a new state of the art, validating that sparse, emergent specialization is a scalable approach for integrating diverse visual knowledge.

2606.03437 2026-06-03 cs.CL

Large Language Models Are Overconfident in Their Own Responses

大型语言模型对自己的回答过度自信

Mario Sanz-Guerrero, Manuel Mager, Katharina von der Wense

AI总结 研究指令微调与聊天模板导致的大型语言模型校准偏差,发现“所有权偏见”使模型对自己的回答自信度高出26%,并提出通过将模型回答伪装为用户输入来降低过度自信。

详情
Comments
Accepted to ACL 2026 Findings
AI中文摘要

先前工作表明,指令微调的大型语言模型(LLMs)比其基础预训练模型校准更差。然而,关于常用聊天模板对对话型LLM校准的影响知之甚少。在本工作中,我们通过解耦后训练算法和聊天格式的影响,研究了导致这种校准偏差的机制。我们发现,虽然指令微调从根本上损害了校准,但聊天模板通过“所有权偏见”加剧了问题——模型对自己回答的自信度显著高于对用户提供的相同回答的自信度。在六个近期开源权重LLM、三个基准和三种置信度获取方法上的大量实验表明,模型对自己的回答分配的置信度高出26%。利用这一见解,我们提出一种简单的推理时策略:在置信度获取时将模型的回答框定为用户输入。该方法显著降低了过度自信,并将校准提高了26%,无需重新训练,缩小了基础模型与指令微调模型之间的差距。

英文摘要

Prior work has shown that instruction-tuned large language models (LLMs) are less well calibrated than their base pre-trained counterparts. However, little is known about the frequently used chat template's effect on the calibration of conversational LLMs. In this work, we investigate the mechanisms driving this miscalibration by decoupling the effects of the post-training algorithm and the chat format. We find that, while instruction tuning fundamentally harms calibration, the chat template aggravates the issue through an "ownership bias" -- models are significantly more confident in their own answers than in identical answers provided by a user. Extensive experiments across six recent open-weight LLMs, three benchmarks, and three confidence elicitation methods show that models assign up to 26% higher confidence to their own responses. Leveraging this insight, we propose a simple inference-time strategy: framing the model's answer as user input during confidence elicitation. This approach significantly reduces overconfidence and improves calibration by up to 26% without the need for retraining, narrowing the gap between base and instruction-tuned models.

2606.03435 2026-06-03 cs.AI

CP-Agent: Context-Aware Multimodal Reasoning for Cellular Morphological Profiling under Chemical Perturbations

CP-Agent: 化学扰动下细胞形态学轮廓的上下文感知多模态推理

Yuxin Zhang, Yiyao Li, Ping Shu Ho, Simon See, Zhenqin Wu, Kevin Tsia

AI总结 提出CP-Agent,一种基于上下文感知对齐模块CP-CLIP的多模态大语言模型,用于生成药物扰动下细胞形态变化的可解释机制性解释,实现高精度处理与机制区分(最大F1分数0.896),并整合工具使用与推理生成结构化报告以加速药物发现。

详情
Comments
ICLR 2026
AI中文摘要

Cell Painting结合多重荧光染色、高内涵成像和定量分析,生成高维表型读数,以支持多种下游任务,如作用机制(MoA)推断、毒性预测和药物-疾病图谱构建。然而,现有工作流程缓慢、昂贵且难以解释。药物筛选建模方法主要侧重于分子表示学习,而忽略了实际实验上下文(例如细胞系、给药方案等),限制了泛化性和MoA分辨率。我们引入了CP-Agent,一种智能多模态大语言模型(MLLM),能够为药物扰动下的细胞形态变化生成与机制相关、人类可解释的理由。其核心是CP-Agent利用上下文感知对齐模块CP-CLIP,该模块联合嵌入高内涵图像和实验元数据,以实现稳健的处理和MoA区分(达到最大F1分数0.896)。通过将CP-CLIP输出与智能工具使用和推理相结合,CP-Agent将理由编译成结构化报告,以指导实验设计和假设优化。这些能力凸显了CP-Agent通过实现更可解释、可扩展和上下文感知的表型筛选来加速药物发现的潜力——简化药物发现中假设生成的迭代循环。

英文摘要

Cell Painting combines multiplexed fluorescent staining, high-content imaging, and quantitative analysis to generate high-dimensional phenotypic readouts to support diverse downstream tasks such as mechanism-of-action (MoA) inference, toxicity prediction, and construction of drug-disease atlases. However, existing workflows are slow, costly and difficult to interpret. Approaches for drug screening modeling predominantly focus on molecular representation learning, while neglecting actual experimental context (e.g., cell line, dosing schedule, etc.), limiting generalization and MoA resolution. We introduce CP-Agent, an agentic multimodal large language model (MLLM) capable of generating mechanism-relevant, human-interpretable rationales for cell morphological changes under drug perturbations. At its core, CP-Agent leverages a context-aware alignment module, CP-CLIP, that jointly embeds high-content images and experimental metadata to enable robust treatment and MoA discrimination (achieving a maximum F1-score of 0.896). By integrating CP-CLIP outputs with agentic tool usage and reasoning, CP-Agent compiles rationales into a structured report to guide experimental design and hypothesis refinement. These capabilities highlight CP-Agent's potential to accelerate drug discovery by enabling more interpretable, scalable, and context-aware phenotypic screening -- streamlining iterative cycles of hypothesis generation in drug discovery.

2606.03432 2026-06-03 cs.CR cs.AI cs.LG

A Hybrid Approach For Malware Classification Using Secondary Features Fusion

一种使用二次特征融合的恶意软件分类混合方法

Raja Khurram Shahzad, Muhammad Mustaqeem, Haroon Elahi

AI总结 提出一种通过融合API调用和n-gram特征,并采用投票集成算法进行恶意软件检测与家族分类的方法,在Microsoft数据集上达到99.72%准确率和0.989 AUC。

详情
AI中文摘要

恶意软件(无论是变种还是新型)的数量正在迅速增加,使得恶意软件检测和缓解成为一个复杂的问题。改善恶意软件缓解的一种方法是自动检测和恶意软件家族分类。然而,传统的恶意软件检测方法无法将检测到的恶意软件分类到各自的家族中,阻碍了有效的恶意软件缓解。因此,本文提出了一种自动化恶意软件检测并将检测到的恶意软件分类到相应恶意软件家族的方法。所提出的方法在提取相关恶意软件特征(如API调用、固定和可变长度n-gram)后,使用自定义特征选择方法进行特征融合。此外,对于预测模型,提出了一种基于投票的算法融合方法。为了对所提出的方法进行实验评估,对Microsoft提供的数据集应用了二分类和多分类方法。最后,将实验结果与现有技术进行了比较。实验结果表明了所提出方法的有效性和效率,AUC为0.989,准确率为99.72%,对数损失为0.01。

英文摘要

The number of malware (either variant or novel) is rapidly increasing, making malware detection and mitigation a complex problem. One approach to improving malware mitigation is automatic detection and malware family classification. However, traditional malware detection methods cannot classify detected malware into their respective families, hindering effective malware mitigation. Consequently, this paper proposes a method to automate malware detection and classification of the detected malware into respective malware families. The proposed method uses feature fusion after extracting relevant malware features such as API calls and fixed and variable length n-grams with a customized feature selection method. Moreover, for the predictive model, a voting based approach is proposed for algorithm fusion. For the experimental evaluation of the proposed method, both binary and multi-class classification approaches are applied to the data set provided by Microsoft. Finally, the experimental results are compared with the state of the art. The experimental results indicate the effectiveness and efficiency of the proposed approach with an AUC of 0.989, accuracy of 99.72%, and a log loss of 0.01.

2606.03430 2026-06-03 cs.CR cs.AI

FlowGuard: Flow Matching for Identity-Independent Detection of Data-Free Model Stealing Attacks on Energy System Intrusion Detection Systems

FlowGuard: 基于流匹配的能源系统入侵检测系统中无数据模型窃取攻击的身份无关检测

Maxime Schwarzer, Laurin Holz, Tobias Huerten, Johannes Loevenich, Thies Moehlenhof, Roberto Rigolin F. Lopes, Veit Hagenmeyer

AI总结 提出FlowGuard,一种基于流匹配的身份无关防御方法,通过检测查询是否属于分布外(OOD)来防御针对能源系统入侵检测系统的无数据模型窃取攻击,在单客户端和分布式Sybil场景下均保持稳定检测率。

详情
AI中文摘要

部署在能源基础设施中的人工智能入侵检测系统(IDS)容易受到模型窃取攻击,攻击者可以离线创建规避流量。当前针对模型提取的防御要么依赖于身份绑定的查询监控(对分布式攻击者Sybil无效),要么通过软标签扰动进行预测中毒(不适用于硬标签IDS部署)。因此,我们提出FlowGuard,一种基于流匹配的身份无关防御,在IDS处理之前将传入查询分类为分布外(OOD)。该方法利用了以下事实:为无数据模型窃取攻击合成的查询占据比真实网络流量更低维的流形,导致在使用基于合法数据训练的连续归一化流时,对数似然显著降低。我们在单客户端和分布式(100客户端Sybil)设置下,使用MAZE和DisGUIDE攻击评估了我们的方法,并与PRADA和FDINet进行了比较。当分布发生变化时,PRADA的检测率降至0%,而我们的防御在不依赖身份信息的情况下,在两种设置下均保持稳定的检测率。我们讨论了该方法的范围和局限性,并概述了在数据依赖攻击中的潜在应用。

英文摘要

Artificial Intelligence (AI)-based Intrusion Detection Systems (IDS) deployed in energy infrastructure are vulnerable to model theft attacks, which allow adversaries to create evasive traffic offline. Current defences against model extraction rely either on identity-bound query monitoring, which is ineffective against distributed attackers (Sybil), or on prediction poisoning through soft-label perturbation, which is inapplicable to hard-label IDS deployments. Therefore, we propose FlowGuard, an identity-independent defence based on flow matching that classifies incoming queries as out-of-distribution (OOD) prior to IDS processing. This approach exploits the fact that queries generated synthetically for data-free model stealing attacks occupy a lower-dimensional manifold than real network traffic. This results in measurably lower log-likelihoods when using a Continuous Normalizing Flow that has been trained on legitimate data. We evaluate our method against PRADA and FDINet using MAZE and DisGUIDE attacks in single-client and distributed (100-client Sybil) settings. While PRADA's detection rate dropped to 0% when the distribution changed, our defence maintained a stable detection rate across both settings without relying on identity information. We discuss the scope and limitations of the approach, and outline potential applications to data-dependent attacks.

2606.03428 2026-06-03 cs.NE cs.AI cs.LG

PrimeSVT: An Automated Memory-aware Pruning Framework with Prioritized Compression Policy for Spiking Vision Transformers

PrimeSVT: 一种具有优先压缩策略的自动化内存感知剪枝框架用于脉冲视觉Transformer

Rachmad Vidya Wicaksana Putra, Achyuta Muthuvelan, Alberto Marchisio, Muhammad Shafique

AI总结 提出PrimeSVT框架,通过自动化结构化剪枝和优先压缩策略,在满足精度和内存约束下压缩脉冲视觉Transformer,实现内存节省26.68%且精度损失小于3%。

详情
Comments
8 pages, 8 figures, 3 tables
AI中文摘要

脉冲视觉Transformer(SViT)的大尺寸仍然阻碍其嵌入式实现,因此需要模型压缩。现有工作通过非结构化剪枝压缩SViT模型,这需要专门的硬件加速器来利用其特定的稀疏模式以最大化效率提升。此外,它们的手动方法需要大量设计时间来为每个网络找到合适的剪枝设置,因此这种方法不可扩展。为了解决这一限制,我们提出了PrimeSVT,一种新颖的框架,对预训练的SViT模型执行自动化的内存感知结构化剪枝,从而在推理期间最大化其效率提升,适用于广泛使用的计算架构。为此,PrimeSVT首先根据层的大小(即参数数量)对SViT层进行排序,根据它们在不同剪枝率下的鲁棒性识别目标剪枝层,然后利用这个顺序从最大层到最小层逐层顺序压缩模型(即所谓的优先压缩策略),同时考虑用户定义的约束(即可接受的精度和内存节省)。在每一层中,PrimeSVT基于L2范数值采用通道级滤波器剪枝,以结构性地移除不重要的权重。实验结果表明,PrimeSVT通过自动化单次剪枝节省了26.68%的内存,同时将精度保持在原始未剪枝SViT模型(73.3%)的3%以内(未微调时为70.3%,微调后为72.9%),从而满足了精度和内存约束。这些表明我们的PrimeSVT框架实现了SViT及其嵌入式实现的设计自动化。

英文摘要

The large sizes of Spiking Vision Transformers (SViTs) still hinder their embedded implementation, highlighting the need for model compression. State-of-the-art works compress SViT models through unstructured pruning, which needs specialized hardware accelerators for their specific sparsity patterns to maximize efficiency gains. Moreover, their manual approach requires a huge design time to find an appropriate pruning setting for each network, thus making this approach not scalable. To address this limitation, we propose PrimeSVT, a novel framework that performs automated memory-aware structured pruning on pre-trained SViT models, thereby maximizing their efficiency gains during inference amenable to widely-used computing architectures. To achieve this, PrimeSVT first sorts the SViT layers based on their sizes (i.e., number of parameters), identifies the targeted pruning layers based on their robustness under different pruning rates, then leverages this order for compressing the model layer-by-layer sequentially from the largest one to the smallest one (i.e., so-called prioritized compression policy), while considering the user-defined constraints (i.e., acceptable accuracy and memory saving). In each layer, PrimeSVT employs channel-wise filter pruning based on their L2-norm values to structurally remove the non-significant weights. Experimental results show that PrimeSVT saves 26.68% memory through automated single-shot pruning, while preserving accuracy within 3% (70.3% without fine-tuning and 72.9% with fine-tuning) from the original unpruned SViT model (73.3%), thus meeting the accuracy and memory constraints. These show that our PrimeSVT framework enables design automation for SViTs and their embedded implementation.

2606.03421 2026-06-03 cs.RO

Reliability-Guided Depth Fusion for Glare-Resilient Navigation Costmaps

基于可靠性引导的深度融合用于抗眩光导航代价地图

Shang-En Tsai

AI总结 针对反光地面、玻璃边界等表面导致的深度测量噪声,提出基于显式深度可靠性建模的代价地图构建方法,通过DRM-Net预测像素级可靠性并采用加权门控融合机制抑制错误占据更新,实验证明能有效减少虚假障碍并保持实时性能。

详情
AI中文摘要

反光地面、玻璃边界和光滑室内表面上的镜面眩光经常破坏主动立体RGB-D深度测量,产生空洞和尖峰,这些空洞和尖峰在占据栅格代价地图中累积为持久的幻影障碍物。本文提出一种基于显式深度可靠性建模的抗眩光代价地图构建方法。轻量级深度可靠性地图网络(DRM-Net)预测镜面干扰下的逐像素测量可信度,可靠性引导的加权门控融合(RGF)机制在损坏的测量值累积到地图之前调节占据更新。为了支持鲁棒的训练和评估,该方法使用姿态对齐的多视图参考深度构建来减少循环监督偏差,并通过融合变体消融、参数敏感性分析、跨条件测试、配对导航比较、可靠性地图指标和嵌入式运行时分析进行评估。在配备Intel RealSense D435和Jetson Orin Nano的真实移动机器人平台上的实验表明,所提方法减少了虚假障碍物插入,改善了自由空间保留,并在反光地板、玻璃墙和自然光眩光条件下保持实时吞吐量。这些结果支持将眩光视为测量可靠性问题,而不是密集深度补全问题,用于安全关键的室内导航。

英文摘要

Specular glare on reflective floors, glass boundaries, and glossy indoor surfaces frequently corrupts active-stereo RGB-D depth measurements, producing holes and spikes that accumulate as persistent phantom obstacles in occupancy-grid costmaps. This paper presents a glare-resilient costmap construction method based on explicit depth-reliability modeling. A lightweight Depth Reliability Map network (DRM-Net) predicts per-pixel measurement trustworthiness under specular interference, and a reliability-guided weighted-and-gated fusion (RGF) mechanism modulates occupancy updates before corrupted measurements are accumulated into the map. To support robust training and evaluation, the method uses pose-aligned multi-view reference-depth construction to reduce circular-supervision bias and is evaluated through fusion-variant ablations, parameter-sensitivity analysis, cross-condition tests, paired navigation comparisons, reliability-map metrics, and embedded runtime profiling. Experiments on a real mobile robotic platform equipped with an Intel RealSense D435 and a Jetson Orin Nano show that the proposed method reduces false obstacle insertion, improves free-space preservation, and maintains real-time throughput under reflective-floor, glass-wall, and natural-light glare conditions. These results support treating glare as a measurement-reliability problem rather than as a dense depth-completion problem for safety-critical indoor navigation.

2606.03420 2026-06-03 cs.CV

PHAF-Personalized Hand Avatars in a Flash

PHAF-瞬间个性化手部化身

Meghana Shankar, Akanxit Upadhyay, Anmol Namdev, Green Rosh KS, Pawan Prasad BH

AI总结 提出PHAF方法,从两张图像(手背和手掌)快速生成个性化逼真手部化身,通过语义引导网格对齐和密集纹理提取,结合视图修复网络,实现高质量多视角渲染,纹理生成速度比现有方法快30倍。

详情
AI中文摘要

我们提出PHAF-瞬间个性化手部化身,一种个性化的逼真手部化身,仅需两张图像(手背和手掌视图)即可提供高质量的多视角渲染。与基于优化的慢速技术不同,PHAF快速生成个性化纹理,适用于边缘设备上的实时部署。我们的方法结合语义引导的网格对齐和密集纹理提取,高效传递高频细节。基于视图的修复网络细化纹理,确保平滑连续的外观。PHAF可泛化到新视角,并利用参数化手部模型实现精确关节运动,与标准图形引擎兼容。实验表明,其在视觉保真度上与现有方法相当,同时将纹理生成时间大幅减少30倍,支持实用的AR/VR应用。

英文摘要

We present PHAF-Personalized Hand Avatars in a Flash, a personalized photo-realistic hand avatar which provides high quality multi-view renders from just two images (dorsal and palmar views).Unlike slow optimization-based techniques, PHAF generates fast personalized textures for real-time deployment on edge devices. Our approach combines semantic guided mesh alignment and densified texture extraction to transfer high-frequency details efficiently. A view-based inpainting network refines textures ensuring smooth, continuous appearance. PHAF generalizes to novel viewpoints and leverages a parametric hand model for accurate articulations, making it compatible with standard graphics engines. Experiments show it is comparable to existing methods in visual fidelity while drastically reducing texture generation time by 30 times, enabling practical AR/VR applications.

2606.03418 2026-06-03 cs.CV

IDO: Incongruity-aware Distribution Optimization for Multimodal Fake News Detection

IDO: 面向多模态假新闻检测的不一致性感知分布优化

Hengyang Zhou, Rongman Hong, Yuxuan Zhou, Jing Wang, Zhaoyan Pan

AI总结 提出不一致性感知分布优化(IDO)方法,通过事实不一致性和模态不一致性建模,提升多模态假新闻检测性能。

详情
Comments
Accept by GlobalSouthML@ICML 2026
AI中文摘要

多模态假新闻检测旨在识别新闻的真实性。现有的多模态假新闻检测方法主要关注跨模态一致性,但往往未能明确建模欺骗性多模态内容中存在的语义不一致性。然而,虚假信息通常包含与事实不符的语义信息。为了解决这些挑战,我们提出了不一致性感知分布优化(IDO),从事实不一致性和模态不一致性的角度提高假新闻检测的性能。对于事实不一致性,我们引入通道级重加权策略以获得语义判别性嵌入,并利用高斯分布建模由事实不一致性引起的不确定性相关性。对于模态不一致性,我们利用不一致性对比学习来学习跨模态语义信息。实验表明,IDO达到了最先进的性能。

英文摘要

Multimodal fake news detection aims to identify the authenticity of news. Existing multimodal fake news detection methods mainly focus on cross-modal consistency, but often fail to explicitly model the semantic incongruity that characterizes deceptive multimodal content. However, misinformation often contains semantic information incongruity with the facts. To address these challenges, we propose Incongruity-aware Distribution Optimization (IDO) to improve the performance of fake news detection from the perspectives of factual incongruity and modality incongruity. For factual incongruity, we introduce a channel-wise reweighting strategy to obtain semantically discriminative embeddings and utilize gaussian distribution to model the uncertain correlation caused by factual incongruity. For modality incongruity, we utilize incongruity contrastive learning to learn cross-modal semantic information. Experiments demonstrate that IDO achieves state-of-the-art performance.

2606.03417 2026-06-03 cs.CV

A unified multi-task framework enables interpretable chest radiograph analysis

统一多任务框架实现可解释的胸部X光片分析

Lijian Xu, Ziyu Ni, Xinglong Liu, Xiaosong Wang, Hongsheng Li, Shaoting Zhang

AI总结 提出IMT-CXR框架,通过统一Transformer架构模拟放射科医生诊断流程,实现疾病识别、属性表征和可追溯报告生成,在十个基准上表现优异,且临床评估中66%的AI报告达到或超越原始报告。

详情
AI中文摘要

虽然多模态深度学习推动了医学影像分析,但现有的黑箱系统可能局限于孤立任务,常常忽视临床诊断作为多任务过程对信任敏感的本质。我们提出IMT-CXR(可解释多任务Transformer用于胸部X光分析),该框架通过三个基于证据的阶段模拟放射科医生的诊断工作流:1)疾病识别;2)属性表征(如大小、位置、严重程度量化);3)具有可追溯决策路径的证据整合报告生成。该框架采用统一Transformer架构,通过医学领域指令调优优化,顺序执行四个临床任务:多标签疾病分类、病灶定位、解剖分割和放射学报告生成。实验验证表明,在直接推理和微调设置下,该框架在十个CXR基准上表现出竞争性性能。在一项对来自四个医疗中心的160份历史报告的盲评中,三位放射科医生认为66%的AI生成报告在诊断清晰度上达到或超越原始临床报告,凸显了该框架的转化潜力。通过建立从解剖发现到结论的可追溯诊断路径,这项工作弥合了AI技术指标与临床实用性之间的差距,推动了医学影像中可信赖AI系统的发展。

英文摘要

While multimodal deep learning has advanced medical imaging analysis, existing black-box systems \textcolor{black}{may remain confined to isolated tasks, often overlooking} the trust-sensitive nature of clinical diagnosis as a multi-task process. We propose IMT-CXR (Interpretable Multi-task Transformer for Chest X-ray Analysis), a framework that emulates radiologists' diagnostic workflow through three evidence-driven stages: 1) Disease recognition; 2) Attribute characterization (e.g., size, location, severity quantification); 3) Evidence-integrated report generation with traceable decision pathways. The framework employs a unified transformer architecture optimized via medical-domain instruction tuning, sequentially executing four clinical tasks: multi-label disease classification, lesion localization, anatomical segmentation, and radiology report generation. Experimental validation demonstrates competitive performance on ten CXR benchmarks under direct inference and fine-tuning settings. In a blinded evaluation of 160 historical reports from four medical centers, three radiologists rated 66\% of AI-generated reports as comparable to or surpassing original clinical reports in diagnostic clarity, highlighting the framework's translational potential. By establishing traceable diagnostic pathways from anatomical findings to conclusions, this work bridges the gap between AI technical metrics and clinical utility, advancing trustworthy AI systems in medical imaging.

2606.03412 2026-06-03 cs.CL

Lexicons and grammars for language processing: industrial or handcrafted products?

语言处理的词典和语法:工业产品还是手工制品?

Eric Laporte

AI总结 本文分析词典和语法的手工构建与自动化工业化两种趋势,探讨哪种方式或两者结合能获得最佳结果。

详情
Journal ref
Léxico e gramática: dos sentidos à construção da significação, Cultura acadêmica, 2009, Trilhas Lingüísticas, 16, pp.51-84
AI中文摘要

近年来,语言数据在语言处理中的应用逐渐增加。这些数据现在通常被称为语言资源。用于此目的的大多数语言资源是文本集合,如布朗语料库和宾州树库,但电子词典(WordNet、FrameNet、VerbNet、ComLex、词典-语法...)和形式语法(TAG...)也在最近得到发展。词典和语法的大多数构建过程是手动的,而语料库的构建则一直高度自动化。然而,越来越多的语言处理专家认识到,词典和语法的信息内容比语料库更丰富,因此前者可以实现更精细的处理。构建时间的差异可能与信息内容的差异有关:语言学家手工制作词典和语法可能使其比自动生成的数据更具信息性。这种情况可能向两个方向发展:要么语言技术专家逐渐习惯于处理手工构建的资源,这些资源更具信息性且更复杂;要么词典和语法的构建过程被自动化和工业化,这是主流观点。两种演变都在进行中,并且它们之间存在紧张关系。语言学家和计算机科学家之间的关系取决于这些演变的未来,因为前者需要培训和雇佣大量语言学家,而后者主要依赖于计算机工程师提出的解决方案。本文旨在分析所讨论的语言资源的实际例子,并讨论手工制作或工业生成,或两者结合,哪种趋势能产生最佳结果或最现实。

英文摘要

During the recent years, the use of linguistic data for language processing increased progressively. Such data are now commonly called language resources. Most of the language resources used for this purpose are collections of texts as the Brown Corpus and the Penn Treebank, but electronic lexicons (WordNet, FrameNet, VerbNet, ComLex, Lexicon-Grammar...) and formal grammars (TAG...) developed recently. Most processes of construction of lexicons and grammars are manual, whereas the construction of corpora has always been highly automated. However, more and more specialists of language processing realize that the information content of lexicons and grammars is richer than that of corpora, and hence the former make more elaborate processing possible. The difference in construction time is likely to be connected with the difference in information content: the handcrafting of lexicons and grammars by linguists would make them more informative than automatically generated data. This situation can evolve into two directions: either specialists of language technology get progressively used to handling manually constructed resources, which are more informative and more complex, or the process of construction of lexicons and grammars is automated and industrialized, which is the mainstream perspective. Both evolutions are already in progress, and a tension exists between them. The relation between linguists and computer scientists depends on the future of these evolutions, since the first implies training and hiring numerous linguists, whereas the other depends essentially on solutions elaborated by computer engineers. The aim of this article is to analyse practical examples of the language resources in question, and to discuss about which of the two trends, handcrafting or generating industrially, or a combination of both, can give the best results or is the most realistic.

2606.03410 2026-06-03 cs.CV

Enginuity: A Dataset and Benchmark for Vision-Language Understanding of Engineering Diagrams

Enginuity:工程图纸视觉语言理解的数据集与基准

Abhishek Kumar, Isha Motiyani, Tilak Kasturi, Ethan Seefried, Prahitha Movva, Tirthankar Ghosal

AI总结 针对工程图纸领域缺乏公开基准的问题,提出首个开放数据集Enginuity,通过结构化零件表提取和自由形式视觉问答两项任务评估前沿VLM,揭示零件识别与描述保真度之间的系统性差距。

详情
AI中文摘要

工程图纸对视觉语言模型提出了独特的挑战:与自然图像或通用文档不同,它们通过密集的空间布局、领域特定符号以及视觉标注与结构化零件表之间的交叉引用来编码信息。尽管工程图纸在服务、维修和设计工作流中至关重要,但目前尚无公开基准来衡量该领域VLM的能力;现有数据集主要关注流程图、科学图表或商业文档。为填补这一空白,我们引入了Enginuity,这是首个用于评估复杂工程图纸上VLM的开放数据集和基准。我们在美国军用服务和维修手册语料库上定义了两项任务:结构化零件表提取(任务1)和自由形式视觉图问答(VQA)(任务2)用于基准测试。我们在零样本和思维链提示下评估了四种前沿VLM(GPT-5.2 Chat、Claude Opus 4.7、Gemma 4、Qwen3-VL-32B-Instruct)。在任务1上,模型达到了0.61-0.87的Recall@all,但Token F1pen仅为0.03-0.18,暴露了零件识别与描述保真度之间的系统性差距。任务2揭示了所有模型在事实推理上的一致差距。一项支持性分析表明,相对于语义相似性,token重叠指标将技术描述上的模型能力低估了2-6倍,这促使在领域特定评估中进行LLM作为评判者的校准。我们发布了数据集、注释、评估框架以及每个样本的模型输出,以支持对工程内容上VLM能力的可重复研究。

英文摘要

Engineering diagrams pose a distinct challenge for vision-language models: unlike natural images or general documents, they encode information through dense spatial layouts, domain-specific symbols, and cross-references between visual callouts and structured parts tables. Despite their centrality to service, repair, and design workflows, there is no public benchmark for measuring VLM capabilities in this domain; existing datasets primarily focus on flowcharts, scientific figures, or business documents. To address this gap, we introduce Enginuity, the first open dataset and benchmark for evaluating VLMs on complex engineering diagrams. We define two tasks over a corpus of U.S. military service and repair manuals: structured parts-table extraction (Task 1) and free-form visual diagram question answering (VQA)(Task 2) for benchmarking. We evaluate four frontier VLMs (GPT-5.2 Chat, Claude Opus 4.7, Gemma 4, Qwen3-VL-32B-Instruct) under zero-shot and chain-of-thought prompting. On Task 1, models reach Recall@all of 0.61-0.87 but Token F1pen of only 0.03-0.18, exposing a systematic gap between part identification and description fidelity. Task 2 reveals a consistent factual-reasoning gap across all models. A supporting analysis shows that token-overlap metrics under-report model capability on technical descriptions by 2-6x relative to semantic similarity, motivating LLM-as-judge calibration for domain-specific evaluation. We release the dataset, annotations, evaluation harness, and per-sample model outputs to support a reproducible study of VLM capability on engineering content.

2606.03406 2026-06-03 cs.CV

SAMatcher: Co-Visibility Modeling with Segment Anything for Robust Feature Matching

SAMatcher: 基于Segment Anything的共视性建模用于鲁棒特征匹配

Xu Pan, Qiyuan Ma, Mingyue Dong, He Chen, Wei Ji, Xianwei Zheng

AI总结 提出SAMatcher框架,通过共视性建模预测共视区域掩码和边界框作为结构先验,利用Segment Anything Model的对称交叉视图交互机制和统一监督方案,显著提升大视角和尺度变化下的特征匹配性能。

详情
Comments
14 pages
AI中文摘要

可靠的对应估计是图像处理中的一个基本问题,支撑着运动恢复结构、视觉定位和图像配准等应用。现有的基于学习的方法显著改进了局部特征表示,但大多数仍在像素或块级别操作,缺乏对跨视图共同可见区域的显式建模。我们提出了SAMatcher,一个通过共视性建模进行对应估计的特征匹配框架。SAMatcher不直接匹配局部特征,而是首先预测共视区域掩码和边界框作为对应估计的结构先验。基于Segment Anything Model (SAM),它引入了一种对称的交叉视图交互机制,实现了双向特征交换和跨视图语义对齐。我们进一步开发了一个统一的监督方案,通过掩码学习、边界框回归和掩码-边界框一致性约束联合优化掩码预测和边界框定位。在具有挑战性的基准上的大量实验表明,与现有的匹配流程相比,特别是在大视角和尺度变化下,性能有显著提升。我们的结果表明,最初为单目分割设计的基础模型可以通过显式的共视性建模有效地扩展到多视图对应推理,为图像匹配的结构化表示学习提供了新的视角。代码和项目页面:此https URL

英文摘要

Reliable correspondence estimation is a fundamental problem in image processing, underpinning applications such as Structure from Motion, visual localization, and image registration. Existing learning-based methods have significantly improved local feature representations, yet most still operate at the pixel or patch level and lack explicit modeling of regions that are jointly visible across views. We propose SAMatcher, a feature matching framework that formulates correspondence estimation through co-visibility modeling. Instead of directly matching local features, SAMatcher first predicts co-visible region masks and bounding boxes as structured priors for correspondence estimation. Built upon the Segment Anything Model (SAM), it introduces a symmetric cross-view interaction mechanism that enables bidirectional feature exchange and cross-view semantic alignment. We further develop a unified supervision scheme that jointly optimizes mask prediction and box localization through mask learning, box regression, and mask-box consistency constraints. Extensive experiments on challenging benchmarks demonstrate substantial improvements over existing matching pipelines, particularly under large viewpoint and scale variations. Our results show that foundation models originally designed for monocular segmentation can be effectively extended to multi-view correspondence reasoning through explicit co-visibility modeling, offering a new perspective on structured representation learning for image matching. Code and project page: https://xupan.top/Projects/samatcher

2606.03401 2026-06-03 cs.CV

Towards Characterizing Scientific Image Utility and Upgradability

面向科学图像效用与可升级性的表征

WenZhe Li, Qihang Yan, Liang Chen, Junying Wang, Farong Wen, Yijin Guo, Chunyi Li, Zicheng Zhang, Guangtao Zhai

AI总结 针对AI生成内容对科学图像完整性的威胁,提出SIU²A框架,通过效用(错误检测与修正可行性)和可升级性(修正质量)两个维度评估科学图像,并构建基准数据集揭示当前多模态系统在科学错误评估与忠实修正方面的显著局限。

详情
AI中文摘要

科学图像在研究交流中作为关键证据,但其完整性面临来自AI生成内容的前所未有的威胁,这些内容引入了微妙但严重的错误。现有的评估范式被证明是不充分的:感知质量指标与科学有效性相关性差,而语言模型缺乏特定领域的验证能力。为了解决这一差距,我们提出了 extbf{科学图像效用与可升级性评估(SIU$^2$A)}框架,该框架引入了两个互补的科学图像评估维度。 extbf{效用}包括 extit{错误检测}(识别科学不准确性)和 extit{修正可行性}(评估错误是否可以被可靠修复)。 extbf{可升级性}衡量修正的质量。我们将科学图像损坏分为四种基本类型:细节失真、不完整性、虚假内容和实体混淆。基于这一分类,我们构建了SIU$^2$A-Benchmark,这是一个包含专家标注用于错误识别和修复的数据集。该框架实现了一个两阶段评估协议: extit{效用}阶段评估错误检测能力和修复指令生成,而 extit{可升级性}阶段评估修正是否在不损害现有准确信息的情况下忠实恢复科学有效性。实验表明,当前的多模态系统在科学错误评估和忠实修正方面表现出显著局限性,揭示了视觉感知与科学可用性之间的根本差距。

英文摘要

Scientific images function as critical evidence in research communication, yet their integrity faces unprecedented threats from AI-generated content that introduces subtle but consequential errors. Existing evaluation paradigms prove inadequate: perceptual quality metrics poorly correlate with scientific validity, while language models lack domain-specific verification capabilities. To address this gap, we propose the \textbf{S}cientific \textbf{I}mage \textbf{U}tility and \textbf{U}pgradability \textbf{A}ssessment (\textbf{SIU$^2$A}) framework, which introduces two complementary dimensions for scientific image evaluation. \textbf{Utility} encompasses \textit{error detection} (identifying scientific inaccuracies) and \textit{correction feasibility} (assessing whether errors can be reliably repaired). \textbf{Upgradability} measures the quality of correction. We categorize scientific image corruption into four fundamental types: Detail Distortion, Incompleteness, False Content, and Entity Confusion. Based on this taxonomy, we construct SIU$^2$A-Benchmark, a dataset with expert annotations for error identification and repair. The framework implements a two-stage evaluation protocol: the \textit{Utility} stage evaluates error detection capability and repair instruction generation, while the \textit{Upgradability} stage assesses whether corrections faithfully restore scientific validity without compromising existing accurate information. Experiments reveal that current multimodal systems exhibit significant limitations in both scientific error assessment and faithful correction, exposing a fundamental gap between visual perception and scientific usability.

2606.03399 2026-06-03 cs.CL cs.CR

Selective Token-Level Cryptographic Redaction for Privacy-Preserving Clinical Deployment of Large Language Models

选择性令牌级密码学编辑用于大型语言模型的隐私保护临床部署

Farhan Sheth, Ziyuan Yang, Yongying Lan, Si Yong Yeo

AI总结 提出HERALD框架,通过令牌级密码学编辑仅加密敏感令牌,在保护隐私的同时保持下游模型效用,在分类和医疗问答任务上接近明文性能。

详情
Comments
33 pages, 8 figures, 26 tables
AI中文摘要

尽管大型语言模型(LLMs)越来越多地用于临床应用,但许多现有流程需要将原始敏感健康信息发送到远程服务器进行处理,这增加了隐私泄露的风险。缓解这种风险的一种自然方法是在传输前对数据进行加密。然而,加密整个数据集等直接解决方案会带来巨大的计算、对齐和通信开销,使得大规模实际部署不可行。为了在保护隐私的同时保持可用性,我们提出了通过自适应语言分解的医疗加密与编辑(HERALD),这是一个令牌级密码学编辑框架,通过仅加密敏感令牌同时保留上下文以供下游模型使用,实现了这种平衡。HERALD结合医学命名实体识别器(NER)与基于词性(POS)的策略来选择候选令牌,执行目标词形还原以稳定表面形式,并将每个受保护令牌替换为包裹在显式分隔符中的确定性密文。值得注意的是,HERALD是模型无关的,完全在客户端运行,确保敏感内容在存储、传输和处理过程中保持加密,无需更改下游模型。我们在公开数据集上对分类和医疗问答(MQA)任务评估了HERALD。在不同任务中,实验表明完全安全的基线遭受显著的效用损失,而HERALD始终恢复接近明文的性能。总体而言,HERALD提供了一种新颖的利用流程。

英文摘要

While large language models (LLMs) are increasingly used for clinical applications, many existing pipelines require sending raw sensitive health information to remote servers for processing, which heightens the risk of privacy leakage. A natural approach to mitigate this risk is to encrypt the data before transmission. However, straightforward solutions such as encrypting the entire dataset introduce prohibitive computational, alignment, and communication overheads, rendering large-scale practical deployment infeasible. To preserve privacy while maintaining usability, we present Healthcare Encryption & Redaction via Adaptive Linguistic Decomposition (HERALD), a token-level cryptographic redaction framework designed to achieve this balance by encrypting only sensitive tokens while preserving the surrounding context for downstream model utility. HERALD combines medical named-entity recognizer (NER) with part-of-speech (POS) driven policies to select candidate tokens, performs targeted lemmatization to stabilize surface forms, and substitutes each protected token with a deterministic ciphertext wrapped in explicit delimiters. Notably, HERALD is model-agnostic and operates entirely on the client side, ensuring that sensitive content remains encrypted throughout storage, transmission, and processing without requiring changes to downstream models. We evaluated HERALD on both classification and medical question answering (MQA) tasks on public datasets. Across different tasks, experiments illustrate that fully secured baselines suffer significant utility loss, whereas HERALD consistently recovers performance close to plaintext. Overall, HERALD provides a novel utilization pipeline.

2606.03398 2026-06-03 cs.CL cs.AI

Causal Evidence of Stack Representations in Modeling Counter Languages Using Transformers

Transformer建模计数器语言中栈表示的因果证据

Nishit Singh

AI总结 通过线性探针和消融实验,证明Transformer在计数器语言任务中学习的栈表示对其性能具有因果必要性。

详情
Comments
8 pages, 8 figures
AI中文摘要

形式语言已被证明是理解Transformer内部机制的有效途径。以往研究表明,在计数器语言上进行下一个词预测训练的Transformer会学习到与底层栈结构一致的表示。除了表示分析,本文还研究了这些表示的因果作用。我们训练线性探针从模型隐藏状态中预测每个词符处的栈深度,并从探针中提取主表示方向。从模型中消融该方向会导致序列准确率骤降至接近0%,这提供了强有力的经验证据,表明栈表示不仅是学习到的,而且对模型性能具有因果必要性。

英文摘要

Formal languages have proven to be effective conduits to understand the inner mechanisms of transformers. Past work has shown that transformers trained on next token prediction over counter languages learn representations consistent with an underlying stack structure. Beyond representational analysis, this paper investigates the causal role of these representations. Linear probes are trained to predict the stack depth at each token from the model's hidden states, and a principal representation direction is extracted from the probe. Ablation of this direction from the model causes sequential accuracy to collapse to near 0%, providing strong empirical evidence that the stack representation is not just learned, but is causally necessary for model performance.

2606.03392 2026-06-03 cs.RO

OpenEAI-Platform: An Open-source Embodied Artificial Intelligence Hardware-Software Unified Platform

OpenEAI-Platform: 一个开源具身人工智能硬件-软件统一平台

Jinyuan Zhang, Luoyi Fan, Leiyu Wang, Yeqiang Wang, Yicheng Zhu, Cewu Lu, Nanyang Ye

AI总结 提出OpenEAI-Platform,集成低成本6+1自由度机械臂和可复现VLA模型,通过开源设计和两阶段训练在真实操作任务中超越商业臂,性能媲美大规模预训练基线。

详情
AI中文摘要

现实世界中的具身AI需要精确的硬件和稳健的视觉-语言-动作(VLA)策略。我们提出OpenEAI-Platform,一个完全开源平台,集成了低成本6+1自由度机械臂(OpenEAI-Arm)和可复现的VLA模型(OpenEAI-VLA)。OpenEAI-Arm提供开源机械设计以实现低制造成本,并采用柔顺控制方法以提高精度。OpenEAI-VLA基于Qwen3-VL-4B,使用扩散Transformer动作头,并仅使用开源机器人和多模态数据集进行两阶段训练。在四个真实操作任务中,OpenEAI-Arm在相同策略下优于两款商用6+1自由度机械臂,而OpenEAI-VLA在仅有限预训练数据下达到了与大规模预训练pi0基线相当的成功率。我们将发布完整的硬件设计、驱动程序、模型以及训练/数据流水线,以支持可复现研究和可扩展数据收集。我们的代码、布局和模型将在论文被接收后发布。

英文摘要

Embodied AI in the real world requires both accurate hardware and robust vision-language-action (VLA) policies. We present OpenEAI-Platform, a fully open-source platform that integrates a low-cost 6+1 degree-of-freedom (dof) robotic arm (OpenEAI-Arm) and a reproducible VLA model (OpenEAI-VLA). OpenEAI-Arm provides open-source mechanical designs for low manufacturing cost and compliant control methods for higher accuracy. OpenEAI-VLA builds on Qwen3-VL-4B and uses a Diffusion Transformer action head, and is trained in two stages with only open-source robot and multimodal datasets. Across four real-world manipulation tasks, OpenEAI-Arm outperforms two commercial 6+1-dof arms under the same policy, and OpenEAI-VLA achieves success rates comparable to the large-scale pretrained pi0 baseline with only limited pretraining data. We will release the full hardware designs, drivers, models, and training/data pipelines to support reproducible research and scalable data collection. Our codes, layouts, and models will be released after the paper is accepted.

2606.03391 2026-06-03 cs.LG cs.AI cs.CL

When Model Merging Breaks Routing: Training-Free Calibration for MoE

当模型合并破坏路由:MoE的无训练校准

Canbin Huang, Tianyuan Shi, Xiaojun Quan, Jingang Wang, Jianfei Zhang, Qifan Wang

AI总结 针对MoE架构中模型合并导致的路由崩溃问题,提出基于二阶曲率的无训练校准方法HARC,通过闭式解和共轭梯度法高效重对齐路由器,显著提升数学推理和代码生成性能。

详情
AI中文摘要

模型合并已成为一种无需重新训练即可整合多个LLM能力的成本效益方法。然而,现有的合并技术主要基于线性参数算术或优化,在应用于混合专家(MoE)架构时面临困难。我们识别出MoE合并中的一个关键失效模式,称为路由崩溃,其中合并后的路由器无法将令牌分派给合适的专家。路由崩溃源于非线性softmax和离散Top-k路由机制对合并引起的参数扰动的敏感性,这种敏感性进一步被MoE预训练期间施加的负载平衡约束放大。由于微调后的专家表现出不同的专长,即使是适度的错误路由也可能导致严重的性能下降。为解决此问题,我们提出Hessian感知路由器校准(HARC),一种无训练框架,利用二阶曲率信息重新对齐合并后的路由器。该方法采用闭式解,可通过无矩阵共轭梯度法高效求解。在数学推理和代码生成任务上的实验表明,HARC有效缓解了多种MoE合并基线中的路由崩溃,并带来了显著的性能提升。我们的代码可在该https URL获取。

英文摘要

Model merging has emerged as a cost-effective approach for consolidating the capabilities of multiple LLMs without retraining. However, existing merging techniques, largely based on linear parameter arithmetic or optimization, struggle when applied to Mixture-of-Experts (MoE) architectures. We identify a critical failure mode in MoE merging, termed routing breakdown, in which the merged router fails to dispatch tokens to suitable experts. Routing breakdown stems from the sensitivity of the non-linear softmax and discrete Top-k routing mechanisms to parameter perturbations from merging, a sensitivity further amplified by load-balancing constraints imposed during MoE pretraining. Because fine-tuned experts exhibit distinct specializations, even modest misrouting can cause severe performance degradation. To address this issue, we propose Hessian-Aware Router Calibration (HARC), a training-free framework that leverages second-order curvature information to realign the merged router. This approach admits a closed-form solution that can be efficiently solved using a matrix-free conjugate gradient method. Experiments on mathematical reasoning and code generation tasks show that HARC effectively mitigates routing breakdown across diverse MoE merging baselines and leads to substantial performance improvements. Our code is available at https://github.com/huangcb01/HARC.

2606.03390 2026-06-03 cs.RO

Extreme Motion Generation via Hybrid Null-Space Control for Straight-Line Path Following

通过混合零空间控制实现直线路径跟踪的极端运动生成

Xinyi Yuan, Weiwei Wan, Kensuke Harada

AI总结 提出一种混合控制器,结合强化学习策略和模型控制,在关节极限附近切换,以最大化机械臂沿预定轨迹的笛卡尔路径长度,在7自由度Franka FR3上平均延长27%的路径长度。

详情
AI中文摘要

这项工作研究了“极端运动生成”,旨在在机械臂工作空间内沿预定义轨迹最大化笛卡尔路径长度。这一目标在工业中很重要,因为路径跟踪是许多任务(如表面涂层和焊接)的基础。更关键的是,极端运动使固定基座机械臂能够在有限可达性下利用运动学能力。然而,这种利用在实践中具有挑战性,因为机械臂必须在执行过程中主动避开安全边界,这本质上是一个长视界问题。因此,我们主张长视界决策应委托给基于学习的策略以最大化利用,而经典模型控制器覆盖近边界区域,其中学习策略由于稀疏数据覆盖而急剧退化。具体来说,我们提出的方法是一个步级混合控制器,根据归一化关节极限距离在基于强化学习的控制器和模型控制器之间切换。初始关节配置通过条件扩散采样获得,基于学习到的运动先验改进了可实现的路径长度。我们在7自由度Franka FR3上对10,000个直线路径跟踪任务评估了所提出的框架,平均滚动长度比基于模型的基线延长了27%。值得注意的是,某些任务产生了朝向运动极端的显著延伸,如统计结果中报告的最大改进所示。本文的项目网站和相关视频可在此https URL找到。

英文摘要

This work studies ``extreme motion generation'', which aims to maximize the Cartesian path length along a pre-defined trajectory within the manipulator's workspace. This objective is important in industry as long as path-following is fundamental to a large variety of tasks such as surface coating and welding. More critically, extreme motion enables a fixed-base manipulator to exploit the kinematic capability under limited reachability. However, such exploitation is challenging in practice, as the manipulator must actively avoid the safety boundary through execution, which is inherently a long-horizon problem. Accordingly, we claim that long-horizon decision-making should be delegated to a learning-based policy to maximize exploitation, while a classical model-based controller covers the near-boundary region, where the learning policy degrades sharply due to sparse data coverage. In detail, our proposed method is a step-level hybrid controller that switches between an RL-based and a model-based controller according to the normalized joint-limit distance. The initial joint configuration is sampled through conditional diffusion-based sampling, which improves the achievable path length based on the learned motion prior. We evaluate the proposed framework on 10,000 straight-line path-following tasks with a 7-DoF Franka FR3, extending the average rollout length by 27\% over the model-based baseline. Notably, certain tasks yield a pronounced extension toward the motion extreme, as reflected in the maximum improvement reported in the statistical results. The project website and related videos of this paper can be found at https://yuan-xinyi.github.io/extreme-motion-generation/.

2606.03385 2026-06-03 cs.RO cs.AI

Grasp-Then-Plan with Failure Attribution: A Closed Two-Stage Framework for Precise and Generalizable Robotic Manipulation

先抓取后规划与失败归因:一种用于精确且可泛化机器人操作的闭环两阶段框架

Jiahao Xu, Peiyuan Wang, Hanzhuo Zhang, Zihao Yu, Tianyu Fu, Hao Chen, Xuanhao Xiang, Jianbo Yu, Chenchen Fu, Wanyuan Wang

AI总结 提出GTP-FA框架,通过任务导向的两阶段抓取-规划流程和失败归因模型,在抓取和规划模块中分别注入任务先验和风险惩罚以及针对高风险初始状态进行数据收集和微调,显著提升机器人操作任务的成功率。

详情
Comments
32 pages, project page: https://sites.google.com/view/gtp-fa/
AI中文摘要

在机器人操作中,抓取与运动规划之间的紧密耦合常常掩盖失败的真实原因,导致低效的试错过程。为了实现高效的长时域操作,我们提出了GTP-FA(先抓取后规划与失败归因),一种面向任务的两阶段抓取-规划框架,该框架生成抓取候选并根据所选抓取执行下游运动规划。给定失败的操作轨迹,我们学习一个失败归因模型,该模型可泛化到未见过的抓取,并生成失败模式的稳定分布以进行诊断引导的优化。基于这些归因结果,我们以诊断驱动的方式优化两个模块:在抓取侧,我们将任务级先验和风险惩罚注入抓取候选评分和优化中,以抑制不稳定或与任务不兼容的抓取;在规划侧,我们通过数据收集和微调针对高风险初始状态,以解决真正的规划瓶颈。我们在仿真和真实机器人实验中评估了所提出的框架,并表明GTP-FA在基于RL、IL、扩散策略和VLA的设置中提升了相应的基础学习器,实现了显著更高的总体任务成功率。

英文摘要

In robotic manipulation, the tight coupling between grasping and motion planning often obscures the true source of failure, leading to inefficient trial-and-error. To enable efficient long-horizon manipulation, we propose GTP-FA (Grasp-Then-Plan with Failure Attribution), a task-oriented two-stage grasp-then-plan framework that generates grasp candidates and performs downstream motion planning conditioned on the selected grasp. Given a failed manipulation trajectory, we learn a failure attribution model that generalizes to unseen grasps and produces a stable distribution over failure modes for diagnosis-guided optimization. Based on these attribution results, we then optimize both modules in a diagnosis-driven manner: on the grasping side, we inject task-level priors and risk penalties into grasp candidate scoring and optimization to suppress unstable or task-incompatible grasps; on the planning side, we target high-risk initial states through data collection and fine-tuning to address genuine planning bottlenecks. We evaluate the proposed framework in both simulation and real-robot experiments, and show that GTP-FA improves the corresponding base learners across RL, IL, diffusion-policy, and VLA-based settings, achieving substantially higher overall task success rates.

2606.03381 2026-06-03 cs.CR cs.AI

AI Model Extraction Attacks: Bypassing Single-Client Assumptions in Defenses

AI模型提取攻击:绕过防御中的单客户端假设

Maxime Schwarzer, Johannes F. Loevenich, Gustavo Sánchez, Laurin Holz, Thies Möhlenhof, Tobias Hürten, Roberto Rigolin F. Lopes, Veit Hagenmeyer

AI总结 本文通过提出CerberusAI框架,系统性地证明模型提取攻击中的单客户端假设(SCA)在高级持续性威胁(APT)等协同攻击者面前无效,并展示基本轮询查询分布策略即可绕过PRADA等防御机制,呼吁转向无状态、独立于身份的防御架构。

详情
AI中文摘要

确保部署在军事指挥控制(C2)系统和关键基础设施中的人工智能(AI)模型的保护对于维持信息优势至关重要。模型提取攻击(MEA)构成了重大威胁,因为它们使对手能够复制专有模型、泄露受保护信息并准备离线对抗性攻击。然而,当前的防御策略主要依赖于单客户端假设(SCA),即隐含地假设攻击源自孤立身份。本工作系统地证明了在协同威胁行为者(如高级持续性威胁APT)存在的情况下,SCA从根本上无效。我们引入了一个模块化、开源框架CerberusAI,用于可复现的模型窃取研究,并利用它模拟分布式攻击场景。我们的实证评估表明,成熟的防御机制(如防止深度神经网络模型窃取攻击PRADA)可以通过基本的轮询查询分布策略被绕过,导致检测性能显著下降。此外,我们证明即使是全局聚合方法也可以通过自适应流量混合使其在操作上变得无用。这些结果强调了在模型提取攻击领域需要向有状态、独立于身份的防御架构进行范式转变。本文最初发表于由信息系统技术(IST)科学与技术委员会IST-224-RSY组织的国际军事通信与信息系统会议(ICMCIS),该会议于2026年5月12-13日在英国巴斯举行,并获得了最佳论文奖。

英文摘要

Ensuring the protection of Artificial Intelligence (AI) models deployed in military Command and Control (C2) systems and critical infrastructure is essential for maintaining information superiority. Model Extraction Attacks (MEAs) pose a significant threat, as they enable adversaries to replicate proprietary models, compromise protected information, and prepare offline adversarial attacks. However, current defense strategies predominantly rely on the Single Client Assumption (SCA), which is the implicit assumption that attacks originate from isolated identities. This work systematically demonstrates that the SCA is fundamentally invalid in the presence of coordinated threat actors, such as Advanced Persistent Threats (APTs). We introduce a modular, open-source framework called CerberusAI for reproducible model-stealing research, and use it to simulate distributed attack scenarios. Our empirical evaluation shows that well-established defense mechanisms, such as Protecting Against Deep Neural Network Model Stealing Attacks (PRADA), can be bypassed by basic round-robin query distribution strategies, resulting in a significant reduction in detection performance. Furthermore, we demonstrate that even global aggregation approaches can be rendered operationally useless through adaptive traffic mixing. These results highlight the need for a paradigm shift towards stateful, identity-independent defense architectures in the field of model extraction attacks. This paper was originally presented at the International Conference on Military Communication and Information Systems (ICMCIS), organized by the Information Systems Technology (IST) Scientific and Technical Committee, IST-224-RSY - the ICMCIS, held in Bath, United Kingdom, 12-13 May 2026 and won the best paper award.

2606.03374 2026-06-03 cs.RO

eMEM: A Hybrid Spatio-Temporal Memory System For Embodied Agents

eMEM:一种面向具身智能体的混合时空记忆系统

A. Haroon Rasheed, Maria Kabtoul

AI总结 提出eMEM混合图记忆系统,通过多索引架构和分层整合管道实现具身智能体在空间、时间和语义上的高效记忆检索,并在ProcTHOR-10K基准测试中达到80.8加权平均分。

详情
AI中文摘要

我们提出eMEM(具身记忆),一种基于混合图的记忆系统,用于在物理环境中运行的具身智能体。当前的智能体记忆架构,如Generative Agents、MemGPT和A-MEM,将记忆视为文本流或知识图谱,但具身智能体需要同时能够按意义、空间和时间进行搜索的记忆。eMEM通过一个统一在单一图模型背后的多索引架构(用于结构化存储的SQLite、用于近似最近邻语义搜索的hnswlib以及用于空间查询的R-tree)填补了这一空白。一个分层整合管道将原始感知观察转化为压缩摘要,模仿生物系统中海马体-新皮层的整合。十个面向智能体的回忆工具暴露了记忆检索原语,包括概念到位置的解析和跨层回忆,作为LLM工具调用的第一类操作。该系统完全嵌入式,与智能体在同一进程中运行。此外,我们引入了eMEM-Bench v1,这是一个我们在ProcTHOR-10K场景上构建的用于具身记忆评估的基准。该基准明确围绕八个认知心理学范式(DRM诱饵、模式分离、模式完成、源监控、上下文依赖检索、长时程干扰、序列位置和增强保留曲线)组织,每个范式都经过选择,使得结果能够对照人类和先前智能体记忆系统的更广泛记忆系统文献进行解释;这是像LoCoMo或OpenEQA这样的表面任务基准无法提供的诊断水平。eMEM在988个探针上获得80.8加权平均分,在模拟延迟从1小时到1年的房间独特项目上保持平稳的保留曲线。我们表明,纯RAG基线(flat_rag消融)在上下文依赖检索上损失30分,在DRM诱饵拒绝上损失29分,分别隔离了多层存储和整合的贡献。我们发布了系统和基准代码。

英文摘要

We present eMEM (Embodied Memory), a hybrid graph-based memory system for embodied agents operating in physical environments. Current agent memory architectures, such as Generative Agents, MemGPT, and A-MEM, treat memory as text streams or knowledge graphs, but embodied agents require memory that is simultaneously searchable by meaning, space, and time. eMEM fills this gap with a multi-index architecture (SQL ITE for structured storage, hnswlib for approximate nearest neighbour semantic search, and an R-tree for spatial queries) unified behind a single graph model. A tiered consolidation pipeline transforms raw perceptual observations into compressed summaries, mirroring hippocampal-neocortical consolidation in biological systems. Ten agent-facing recall tools expose memory retrieval primitives, including concept-to-location resolution and cross layer recall, as first-class operations for LLM tool calling. The system is fully embedded and runs in-process alongside the agent. In addition we introduce eMEM-Bench v1, a benchmark we construct over ProcTHOR-10K scenes for embodied memory evaluation. The benchmark is organised explicitly around eight cognitive-psychology paradigms (DRM lures, pattern separation, pattern completion, source monitoring, context-dependent retrieval, long-horizon interference, serial position, and a foil augmented retention curve), each chosen so that the result is interpretable against the broader memory-systems literature in humans and prior agent-memory systems; a level of diagnostic that surface-task benchmarks like LoCoMo or OpenEQA cannot provide. eMEM scores 80.8 weighted mean over 988 probes, with a flat retention curve at ceiling from 1 h to 1 yr of simulated delay on room-unique items. We show that a pure RAG baseline (the flat_rag ablation) loses 30 pt on context dependent retrieval and 29 pt on DRM lure rejection, isolating the contribution of multi-layer storage and consolidation respectively. We release both the system and the benchmark code.

2606.03365 2026-06-03 cs.LG

Link Prediction or Perdition: the Seeds of Instability in Knowledge Graph Embeddings

链接预测还是预测失灵:知识图谱嵌入中不稳定的种子

Guillaume Méroué, Fabien Gandon, Pierre Monnin

AI总结 本文系统分析了多种知识图谱嵌入模型在链接预测中的稳定性,发现高性能模型在三元组预测和嵌入空间上存在显著不稳定性,且随机种子、超参数等因素独立引发同等程度的不稳定,投票机制仅能有限提升稳定性。

详情
Comments
Paper accepted at ESWC 2026 (https://2026.eswc-conferences.org)
AI中文摘要

嵌入模型(KGEMs)是完成知识图谱的主要链接预测方法。标准评估协议强调基于排名的指标如MRR或Hits@$K$,但通常忽略随机种子对结果稳定性的影响。此外,这些指标掩盖了个别预测和嵌入空间组织中的潜在不稳定性。在这项工作中,我们对多个数据集上的多种KGEM进行了系统的稳定性分析。我们发现高性能模型实际上在三元组级别产生分歧预测,并具有高度可变的嵌入空间。通过隔离随机因素(即初始化、三元组排序、负采样、dropout、硬件),我们表明每个因素独立地引发相当程度的不稳定性。此外,对于给定模型,具有更好MRR的超参数配置并不能保证更稳定。而且,投票虽然是一种已知的补救机制,但只能提供有限的稳定性增强。这些发现凸显了当前基准测试协议的关键局限性,并引发了对KGEM用于知识图谱补全的可靠性的担忧。

英文摘要

Embedding models (KGEMs) constitute the main link prediction approach to complete knowledge graphs. Standard evaluation protocols emphasize rank-based metrics such as MRR or Hits@$K$, but usually overlook the influence of random seeds on result stability. Moreover, these metrics conceal potential instabilities in individual predictions and in the organization of embedding spaces. In this work, we conduct a systematic stability analysis of multiple KGEMs across several datasets. We find that high-performance models actually produce divergent predictions at the triple level and highly variable embedding spaces. By isolating stochastic factors (i.e., initialization, triple ordering, negative sampling, dropout, hardware), we show that each independently induces instability of comparable magnitude. Furthermore, for a given model, hyperparameter configurations with better MRR are not guaranteed to be more stable. Moreover, voting, albeit a known remediation mechanism, only provides a limited enhancement of stability. These findings highlight critical limitations of current benchmarking protocols, and raise concerns about the reliability of KGEMs for knowledge graph completion.

2606.03363 2026-06-03 cs.CL

EntSQL: A Benchmark for Grounding Text-to-SQL in Long-Context Enterprise Knowledge

EntSQL:一个将Text-to-SQL置于长上下文企业知识中的基准

Chengxi Liao, Tao Xu, Zulong Chen, Chuanfei Xu, Yiyan Wang, Xinyun Wang, Yanlong Zhang, Xiaojun Chen, Zhibo Yang, Zeyi Wen

AI总结 提出EntSQL基准,通过包含1066个跨五个业务领域的中英文对齐示例,评估LLM在长上下文企业文档中基于私有业务知识生成SQL的能力,最佳系统仅达15.9%准确率。

详情
AI中文摘要

Text-to-SQL使得通过自然语言访问数据库成为可能,最近的LLM显著提升了其能力。现有的基准如Spider、BIRD和Spider~2.0评估了模式泛化、大规模数据库和现实工作流,但很大程度上忽略了SQL生成依赖于私有业务知识(如内部指标、报告惯例和组织规则)的企业场景。我们引入了EntSQL,一个面向企业的Text-to-SQL基准,用于评估在专有业务文档上的长上下文基础。EntSQL包含1066个跨五个业务领域的中英文对齐语义示例,大多数示例需要超越问题和模式的领域知识,并涉及复杂的SQL结构。在英文输入上,当提供长文档时,最佳评估系统仅达到15.9%,突显了在企业知识基础上生成SQL的难度。

英文摘要

Text-to-SQL enables natural language access to databases, and recent LLMs have substantially advanced its capabilities. Existing benchmarks such as Spider, BIRD, and Spider~2.0 evaluate schema generalization, large-scale databases, and realistic workflows, but largely overlook enterprise scenarios where SQL generation depends on private business knowledge, such as internal metrics, reporting conventions, and organizational rules. We introduce EntSQL, an enterprise-oriented Text-to-SQL benchmark for evaluating long-context grounding over proprietary business documents. EntSQL contains 1,066 aligned Chinese-English semantic examples across five business domains, with most examples requiring domain knowledge beyond the question and schema and involving complex SQL structures. On English inputs, the best evaluated system reaches only 15.9\% when long-form documents are provided, highlighting the difficulty of grounding SQL generation in enterprise knowledge.

2606.03361 2026-06-03 cs.LG

Mitigating False Credit Propagation: Probabilistic Graphical Reward Aggregation for Rubric-Based Reinforcement Learning

缓解虚假信用传播:基于概率图奖励聚合的准则强化学习

Can Lv, Mingju Chen, Heng Chang, Shiji Zhou

AI总结 针对准则奖励中因忽略准则间依赖关系导致的虚假信用传播问题,提出概率图框架Graphical Event Aggregation for Rubric rewards (GEAR),通过建模潜在伯努利事件和软抑制传播实现依赖感知的奖励聚合,在多个基准上提升性能并减少信用泄漏。

详情
AI中文摘要

基于准则的奖励越来越多地用于开放式语言模型的后训练,但准则级别的分数通常作为独立效用进行聚合。这种扁平标量化忽略了准则间由准则指定的前提和激活关系,使得即使触发奖励或惩罚的条件不存在,奖励或惩罚仍被计入。我们将这种结构性的奖励聚合失败称为 extbf{虚假信用传播}(FCP)。为解决这一局限,我们提出\ourname( extbf{G}raphical extbf{E}vent extbf{A}ggregation for extbf{R}ubric rewards),一种用于依赖感知准则聚合的概率图框架。\ourname将每个准则结果建模为类型化准则图中的潜在伯努利事件,从不受支持的父事件向其子事件传播软抑制,并将结果事件概率聚合为归一化的期望符号效用。这产生了一个线性时间的奖励计算,可以插入到标准的基于准则的RL流程中,而无需改变外部优化算法。在HealthBench、WritingBench和PLawBench上使用两种策略骨干的实验表明,\ourname一致优于扁平聚合和确定性门控,相对于扁平聚合实现了高达15.5%的相对增益。FCP诊断进一步显示,相对于扁平聚合,\ourname减少了96.5%的泄漏,同时保留了比确定性门控更多的许可下游效用。我们的代码在此https URL公开。

英文摘要

Rubric-based rewards are increasingly used for open-ended language model post-training, but criterion-level scores are often aggregated as independent utilities. This flat scalarization ignores rubric-specified prerequisite and activation relations among criteria, allowing reward or penalty to be counted even when the condition that licenses it is absent. We call this structural reward-aggregation failure \textbf{False Credit Propagation} (FCP). To address this limitation, we propose \ourname (\textbf{G}raphical \textbf{E}vent \textbf{A}ggregation for \textbf{R}ubric rewards), a probabilistic graphical framework for dependency-aware rubric aggregation. \ourname models each criterion outcome as a latent Bernoulli event in a typed rubric graph, propagates soft suppression from unsupported parent events to their children, and aggregates the resulting event probabilities into a normalized expected signed utility. This yields a linear-time reward computation that can be plugged into standard rubric-based RL pipelines without changing the outer optimization algorithm. Experiments on HealthBench, WritingBench, and PLawBench with two policy backbones show that \ourname consistently improves over flat aggregation and deterministic gating, achieving relative gains of up to 15.5\% over flat aggregation. FCP diagnostics further show that \ourname reduces leakage by 96.5\% relative to flat aggregation while preserving more licensed downstream utility than deterministic gating. Our code is publicly available at https://github.com/LvCan926/GEAR.

2606.03359 2026-06-03 cs.SD cs.CL cs.LG

Speech Emotion Recognition using Attention-based LSTM-Network with Residual Connection

基于注意力机制的残差连接LSTM网络的语音情感识别

Daniil Krasnoproshin, Maxim Vashkevich

AI总结 提出ResLSTM-SA轻量级架构,在LSTM中集成残差连接和软注意力,在RAVDESS数据集上以46.8k参数达到0.6517 UAR,优于传统基线且适合边缘部署。

详情
Comments
6 pages, 5 figures, DSPA 2026
AI中文摘要

语音情感识别是现代人机交互系统的重要组成部分。然而,许多最先进的方法依赖于具有高计算和内存需求的大型预训练模型,限制了其适用性。本文提出了ResLSTM-SA,一种轻量级架构,在基于LSTM的框架中集成了残差连接和软注意力。在RAVDESS数据集上,在严格的说话人独立划分下进行评估,所提出的模型在未加权平均召回率(UAR)方面优于传统的基于注意力的LSTM基线以及几种先前报道的CNN和混合CNN-LSTM架构。性能最佳的变体(ResLSTM-SA-h64)仅用46.8k可训练参数就达到了0.6517的最大UAR,以比大规模自监督替代方案少三个数量级的参数提供了具有竞争力的准确性,从而能够在边缘设备和实时语音助手上高效部署。源代码可在以下网址获取:https://this URL。

英文摘要

Speech emotion recognition is an important component of modern human-computer interaction systems. However, many state-of-the-art approaches rely on large pretrained models with high computational and memory requirements, limiting their applicability. This paper proposes ResLSTM-SA, a lightweight architecture that integrates residual connections with soft attention within an LSTM-based framework. Evaluated on the RAVDESS dataset under strict speaker-independent partitioning, the proposed model outperforms conventional attention-based LSTM baselines and several previously reported CNN- and hybrid CNN-LSTM architectures in terms of unweighted average recall (UAR). The best-performing variant (ResLSTM-SA-h64) achieves a maximum UAR of 0.6517 with only 46.8k trainable parameters, delivering competitive accuracy with three orders of magnitude fewer parameters than large-scale self-supervised alternatives, thereby enabling efficient deployment on edge devices and real-time voice assistants. The source code is available at https://github.com/Mak-Sim/ResLSTM-SER.

2606.03358 2026-06-03 cs.LG

The Impact of Temporal Granularity on Socio-Demographic Inference from Household Load Profiles

时间粒度对家庭负荷曲线社会人口推断的影响

Dejan Radovanovic, Maximilian Schirl, Andreas Unterweger, Günther Eibl

AI总结 本文通过分析15分钟到7天不同粒度负荷曲线对8个社会人口属性的预测影响,揭示了隐私-效用权衡中时间分辨率、特征提取和分类器选择的联合作用。

详情
Comments
30 pages, 10 figures, book chapter
AI中文摘要

智能电表数据可以揭示家庭敏感的社会人口特征,引发隐私担忧。虽然这一风险已在固定粒度下得到证实,但时间分辨率在塑造推断性能中的作用尚未得到充分探索。本文通过分析从15分钟到7天不同粒度的负荷曲线如何影响1589户家庭一年数据中八个社会人口属性的可预测性,填补了这一空白。我们引入了一个评估框架,其中分类器在全年数据上训练,但在任意周上测试,迫使模型跨季节和每周变化进行泛化。我们的结果显示了三个主要发现。首先,虽然粗化粒度降低了预测准确性,但出现了两个平台期:性能在15分钟到1小时之间稳定,以及在1到7天之间再次稳定。这揭示了在不牺牲效用的情况下进行数据最小化的机会。其次,可解释的手工特征和tsfresh特征仍然与基于CNN的自编码器嵌入具有竞争力,而XGBoost始终优于其他分类器。第三,特征重要性分析突出了静态和动态属性之间的差异:即使从粗粒度数据中也能推断出住宅面积,而游泳池使用则需要细粒度的时间信号。总体而言,我们的研究为智能计量中的隐私-效用权衡提供了新的见解,显示了时间分辨率、特征提取和分类器选择如何共同影响社会人口推断。

英文摘要

Smart meter data can reveal sensitive socio-demographic characteristics of households, raising privacy concerns. While this risk has been demonstrated at fixed granularities, the role of temporal resolution in shaping inference performance remains insufficiently explored. This paper addresses this gap by analyzing how load profiles with granularities from 15 minutes to 7 days affect the predictability of eight socio-demographic attributes in a dataset of 1,589 households over one year. We introduce an evaluation framework where classifiers are trained on year-round data but tested on arbitrary weeks, forcing generalization across seasonal and weekly variations. Our results show three main findings. First, while coarsening granularity reduces predictive accuracy, two plateaus emerge: performance is stable between 15 minutes and 1 hour, and again between 1 and 7 days. This reveals opportunities for data minimization without sacrificing utility. Second, interpretable handcrafted and tsfresh features remain competitive with CNN-based autoencoder embeddings, while XGBoost consistently outperforms alternative classifiers. Third, feature importance analysis highlights differences between static and dynamic attributes: dwelling size can be inferred even from coarse data, whereas swimming pool usage requires fine-grained temporal signals. Overall, our study provides new insights into the privacy-utility trade-off in smart metering, showing how temporal resolution, feature extraction, and classifier choice jointly influence socio-demographic inference.