arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.17704 2026-05-19 cs.LG

Toy Combinatorial Interpretability Models Reveal Lottery Tickets in Early Feature Space

玩具组合可解释性模型揭示早期特征空间中的彩票彩票

Alon Bebchuk, Nir Shavit

AI总结本文研究了彩票彩票假说在早期特征空间中的表现，通过组合玩具模型揭示了彩票彩票在特征空间中的保留对象，表明彩票彩票结构由隐藏的特征空间几何而非权重空间子网络身份决定。

详情

AI中文摘要

彩票彩票假说认为密集网络中包含稀疏子网络，即' winning tickets'，当重置初始权重并单独训练时，其性能可与完整模型匹配。我们提出更机理性的问题：彩票彩票保留的是什么内部对象？我们采用组合、子句结构的玩具设置，该设置允许具有明确组合距离的可解释特征空间表示。我们显示，在权重空间中彩票彩票对应于特征空间中已接近最终特征通道编码的前驱位置。密集SGD通过结构化选择解决这些位置：近邻位置要么收敛到最终代码要么被拒绝，拒绝集中在更拥挤的神经元，暗示在叠加下存在竞争。因此，彩票彩票是兼容代码位置的家族，共同平衡接近最终代码与低特征间干扰。稀疏重训练通常在不同行上重新表达相同的子句/模板家族，因此保留的对象是家族层面而非微观行身份。我们通过轻量级探针基于特征空间距离和运动验证了这一观点；在我们的设置中，这些探针在准确性和精确代码恢复方面经常优于已建立的基于权重的彩票发现方法。尽管这些发现基于玩具设置，但它们表明彩票彩票结构由隐藏的特征空间几何而非权重空间子网络身份决定。

英文摘要

The lottery ticket hypothesis posits that dense networks contain sparse subnetworks, ``winning tickets,'' that, when rewound to their initial weights and retrained in isolation, match the performance of the full model. We ask a more mechanistic question: what internal object does a winning ticket preserve? We work in a combinatorial, clause-structured toy setting that admits an interpretable feature-space representation with well-defined combinatorial distances between features. We show that winning tickets in weight space correspond to precursor locations in feature space that are already near, at initialization, to the final feature-channel codes. Dense SGD resolves these locations through structured selection: proximal locations either converge to final codes or are rejected, with rejection concentrated at more crowded neurons, implicating competition under superposition. A winning ticket is thus a family of compatible code locations that jointly balance proximity to final codes with low inter-feature interference. Sparse retraining often re-expresses the same clause/template family on a different row, so the preserved object is family-level rather than microscopic row identity. We validate this account with lightweight probes based on feature-space distance and motion; in our setting, these probes frequently outperform established weight-based ticket discovery methods in both accuracy and exact code recovery. Although these findings are grounded in a toy setting, they suggest that the lottery ticket structure is governed by hidden feature-space geometry rather than weight-space subnetwork identity.

URL PDF HTML ☆

赞 0 踩 0

2605.17698 2026-05-19 cs.LG cs.MA

Agent Bazaar: Enabling Economic Alignment in Multi-Agent Marketplaces

Agent Bazaar: 使多智能体市场场所具备经济对齐能力

Seth Karten, Cameron Crow, Chi Jin

AI总结该研究提出Agent Bazaar框架，用于评估多智能体系统的经济对齐能力，通过分析两种失败模式（算法不稳定和Sybil欺骗）发现模型难以自我调节，并提出经济对齐的训练方法和EAS评分标准。

Comments 17 pages, 9 figures

详情

AI中文摘要

将大型语言模型（LLMs）作为自主经济代理部署引入了系统性风险，这些风险超出了单个能力故障的范围。随着代理直接参与市场，其集体行为会放大波动并大规模掩盖欺骗。我们引入Agent Bazaar，一个多代理模拟框架，用于评估经济对齐能力，即代理系统维持市场稳定和完整性的能力。我们识别出两种失败模式：（1）在B2C市场中的算法不稳定（

英文摘要

The deployment of Large Language Models (LLMs) as autonomous economic agents introduces systemic risks that extend beyond individual capability failures. As agents transition to directly interacting with marketplaces, their collective behavior can amplify volatility and mask deception at scale. We introduce the Agent Bazaar, a multi-agent simulation framework for evaluating Economic Alignment, the capacity of agentic systems to preserve market stability and integrity. We identify two failure modes: (1) Algorithmic Instability in a B2C market ("The Crash"), where firms amplify price volatility until the market collapses, and (2) Sybil Deception in a C2C market ("The Lemon Market"), where a single deceptive agent controlling multiple coordinated seller identities floods the market with fraudulent listings, eroding trust and consumer welfare. We evaluate frontier and open-weight models across both scenarios and find that models largely fail to self-regulate, with failure severity varying by model rather than by size. We propose economically aligned harnesses, Stabilizing Firms and Skeptical Guardians, that improve outcomes but remain fragile under harder market conditions. To close this gap, we train agents with REINFORCE++ using an adaptive curriculum, producing a 9B model that outperforms all evaluated frontier and open-weight models. We propose the Economic Alignment Score (EAS), a 4-component scalar metric aggregating stability, integrity, welfare, and profitability, enabling direct cross-model comparison. Our results show that economic alignment is orthogonal to general capability and can be directly trained with targeted RL.

URL PDF HTML ☆

赞 0 踩 0

2605.17693 2026-05-19 cs.LG cs.AI

Fine-tuning Pocket-Aware Diffusion Models via Denoising Policy Optimization

通过去噪策略优化微调意识口袋扩散模型

Yuan Xue, Daniel Kudenko, Megha Khosla

AI总结本文提出DEPPA方法，基于去噪扩散策略优化，通过强化学习微调预训练的意识口袋扩散模型，以优化结合亲和力、药物性、可合成性和多样性等多属性。

详情

AI中文摘要

基于结构的药物设计已被意识口袋3D生成模型加速，但大多数方法主要拟合训练分布，可能无法满足真实世界治疗药物发现所需的多种属性。最近，越来越多的关注集中在基于结构的分子优化（SBMO）上，其目标是精细控制多个指定的分子属性。在本文中，我们提出DEPPA，一种新的SBMO方法，基于去噪扩散策略优化，通过强化学习微调预训练的意识口袋扩散模型。DEPPA能够优化多个属性，包括结合亲和力、药物性、可合成性和多样性。我们将预训练的意识口袋扩散模型的反向去噪过程建模为多步马尔可夫决策过程，其中期望的属性作为奖励信号在最终生成的配体分子上进行评估。DEPPA在RL微调期间结合粗略的去噪调度器，以实现高效的分子优化。在CrossDocked2020基准上的实验结果表明，DEPPA在结合亲和力（Vina Score -8.5 kcal/mol）、药物性和多样性方面优于基线，在可合成性方面表现出竞争性性能。源代码可在https://github.com/xy9485/DePPA上获得。

英文摘要

Structure-based drug design has been accelerated by pocket-aware 3D generative models, yet most methods primarily fit the training distribution and may fall short of satisfying multiple properties required in real-world therapeutic drug discovery. Recently, increasing attention has focused on structure-based molecule optimization (SBMO), which targets fine-grained control over multiple specified molecular properties. In this paper, we present DEPPA, a novel SBMO approach building upon Denoising Diffusion Policy Optimization for fine-tuning a pre-trained pocket-aware diffusion model via reinforcement learning. DEPPA enables optimization over multiple properties, including binding affinity, drug-likeness, synthesizability and diversity. We formulate the reverse denoising process of the pretrained pocket-aware diffusion model as a multi-step Markov Decision Process, where the desired properties that serve as reward signals are evaluated on the final generated ligand molecules. DEPPA incorporates a coarse denoising scheduler during the RL fine-tuning to achieve efficient and effective molecule optimization. Experimental results on the CrossDocked2020 benchmark demonstrate that DEPPA outperforms baselines in binding affinity (Vina Score -8.5 kcal/mol), drug-likeness and diversity while exhibiting competitive performance in synthesizability. The source code is available at https://github.com/xy9485/DePPA .

URL PDF HTML ☆

赞 0 踩 0

2605.17692 2026-05-19 cs.LG math.OC

Exact Convex Reformulations of Linear Neural Networks via Completely Positive Lifting

通过完全正提升实现线性神经网络的精确凸改写

Karthik Prakhya, Alp Yurtsever

AI总结本文提出了一种将深度线性神经网络的训练问题精确地转化为凸优化问题的方法，利用完全正锥的提升空间，将非凸性编码在锥约束中，并展示了其与半正定规划的联系。

详情

AI中文摘要

我们证明，在平方损失下深度线性神经网络的训练问题可以精确地在提升空间中的广义完全正锥上重新表述。该改写形式与原非凸问题具有相同最优值，并且在提升变量中是线性的，所有非凸性都编码在锥约束中。其提升空间的维度仅取决于输入和输出维度，与网络深度和数据点数量无关，瓶颈宽度仅通过标量约束进入。构造过程是通过将多层参数化减少为双线性因子分解，将其提升为秩约束的半正定规划，通过互补性条件表达秩约束，并应用完全正提升。尽管一般情况下该形式在计算上不可行，但它给出了由线性因子分解引起的非凸性的精确锥表示，并将线性神经网络训练与半正定规划联系起来。

英文摘要

We show that the training problem of a deep linear neural network under the squared loss admits an exact convex reformulation in a lifted space over a generalized completely positive cone. The reformulation has the same optimal value as the original nonconvex problem and is linear in the lifted variables, with all nonconvexity encoded in the cone constraint. Its ambient lifted dimension depends only on the input and output dimensions, independent of the network depth and the number of data points, and the bottleneck width enters only through scalar constraints. The construction proceeds by reducing the multilayer parameterization to a bilinear factorization, lifting it to a rank-constrained semidefinite program, expressing the rank constraint via a complementarity condition, and applying a completely positive lifting. While the resulting formulation is computationally intractable in general, it gives an exact conic representation of the nonconvexity induced by linear factorization and connects linear neural network training with copositive programming.

URL PDF HTML ☆

赞 0 踩 0

2605.17691 2026-05-19 cs.CL cs.AI

Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification

验证你的权威：在多标签先例处理分类上对LLM进行基准测试

M. Mikail Demir, M. Abdullah Canbaz

AI总结本文提出了一种新的评估框架，通过专家标注的数据集对现代大语言模型进行基准测试，引入了平均严重性误差指标，以更准确地衡量分类错误的实践影响。

Comments Accepted for publication at the Natural Legal Language Processing Workshop (NLLP) 2025, co-located with EMNLP

详情

DOI: 10.18653/v1/2025.nllp-1.13

AI中文摘要

自动化法律先例中负面处理的分类是一个关键但复杂的自然语言处理任务，误分类可能带来重大风险。为了解决标准准确率的不足，本文介绍了一种更稳健的评估框架。我们对239个真实世界法律引用的新专家标注数据集上的现代大语言模型进行了基准测试，并提出了一种新的平均严重性误差度量标准，以更好地衡量分类错误的实践影响。我们的实验揭示了性能的分裂。Google的Gemini 2.5 Flash在高层次分类任务上达到了最高准确率（79.1%），而OpenAI的GPT-5-mini则在更复杂的细粒度模式上表现最佳（67.7%）。本工作建立了关键基准，提供了一个新的上下文丰富的数据集，并引入了一个针对这一复杂法律推理任务的评估度量标准。

英文摘要

Automating the classification of negative treatment in legal precedent is a critical yet nuanced NLP task where misclassification carries significant risk. To address the shortcomings of standard accuracy, this paper introduces a more robust evaluation framework. We benchmark modern Large Language Models on a new, expert-annotated dataset of 239 real-world legal citations and propose a novel Average Severity Error metric to better measure the practical impact of classification errors. Our experiments reveal a performance split. Google's Gemini 2.5 Flash achieved the highest accuracy on a high-level classification task (79.1%), while OpenAI's GPT-5-mini was the top performer on the more complex fine-grained schema (67.7%). This work establishes a crucial baseline, provides a new context-rich dataset, and introduces an evaluation metric tailored to the demands of this complex legal reasoning task.

URL PDF HTML ☆

赞 0 踩 0

2605.17686 2026-05-19 cs.CV

Brain-inspired spike-timing plasticity for reliable label-efficient event-camera vision

脑启发式脉冲时间依赖性可塑性用于可靠的标签高效事件相机视觉

Mohamad Yazan Sadoun, Sarah Sharif, Yaser Mike Banad

AI总结本文提出了一种基于脑启发式脉冲时间依赖性可塑性（STDP）的事件相机视觉方法，通过三个局部STDP模块实现无需GPU支持的单线程处理，提升了标签效率和检测性能。

详情

AI中文摘要

部署事件相机目标检测器受到每帧标注需求和GPU计算需求的限制。本文引入了三个局部脉冲时间依赖性可塑性（STDP）模块，包括序列、候选和管可靠性模块，这些模块在单个CPU线程上运行而无需GPU支持。在FRED无人机基准测试中，所提出的框架覆盖了三个标签高效监督层级。严格零标签检测器实现了53.8%的mAP@30，约26个训练衍生位实现76.9%的mAP@30，而STDP候选可靠性门实现了78.60±0.42%的mAP@30。在获取顺序漂移下，群体门在20次正例试验中优于流式k-means，而无漂移对照组则否定了其效果。STDP将单模型方差减少了6.6倍，一个训练好的门与44种子集合界线相当。门在Intel Lava上实现了89%的前两名一致性。在EVUAV基准测试中，管级STDP层将误报率从454降至331e-4（Pd≥88%）。密集梯度训练检测器无法提供这种梯度训练、密集矩阵乘法和无局部可塑性操作的组合。

英文摘要

Deploying event-camera object detectors is constrained by per-frame labeling requirements and GPU compute demands. This work introduces three local spike-timing-dependent plasticity (STDP) modules, including sequence, candidate, and tube-reliability modules, that operate on a single CPU thread without GPU support. On the FRED drone benchmark, the proposed framework spans three label-efficient supervision tiers. A strict zero-label detector achieves 53.8% mAP@30, approximately 26 train-derived bits achieve 76.9% mAP@30, and an STDP candidate-reliability gate achieves 78.60 +/- 0.42% mAP@30. Under acquisition-order drift, the cohort gate outperforms streaming k-means by 2.03 +/- 0.58 percentage points across 20 of 20 positive trials, while a no-drift control falsifies the effect. STDP reduces single-model variance by 6.6 times, and one trained gate matches a 44-seed ensemble bound. The gate transfers to Intel Lava with 89% top-2 agreement. On the EVUAV benchmark, a tube-level STDP layer reduces false alarms from 454 to 331e-4 at Pd >= 88%. Dense gradient-trained detectors cannot provide this combination of gradient training, dense matrix multiplication, and local plasticity-free operation by construction.

URL PDF HTML ☆

赞 0 踩 0

2605.17685 2026-05-19 cs.CV cs.AI cs.CR cs.SY eess.SP eess.SY

Attention-Guided Fusion of 1D and 2D CNNs for Robust ECG-Based Biometric Recognition

基于注意力引导的1D和2D CNN融合用于鲁棒的基于ECG的生物识别

Arioua, Islameddine, Benzaoui, Amir, Zeroual, Abdelhafid, Houam, Lotfi

AI总结本文提出了一种结合1D和2D CNN的混合框架，通过注意力引导融合机制提升ECG生物识别的鲁棒性和性能，实验表明该方法在多个数据集上均取得了较高的识别准确率。

详情

DOI: 10.1016/j.dsp.2026.106252
Journal ref: Digital Signal Processing 2026

AI中文摘要

基于心电图（ECG）的生物识别已作为一种安全的身份验证和活体检测的有希望的解决方案。然而，大多数现有方法依赖于单模深度学习架构，单独处理一维（1D）时间信号或二维（2D）时频表示，限制了鲁棒性和泛化能力。为了解决这个问题，本文提出了一种将1D和2D卷积神经网络（CNNs）整合到统一端到端架构中的混合框架。1D分支从原始ECG信号中提取时序和形态学特征，而2D分支从时频表示中捕获判别性的频谱信息。注意力引导的融合机制根据输入特性动态加权两种模态，克服了传统静态融合策略的局限性。该框架在三个基准数据集（ECG-ID、MIT-BIH和PTB）上进行了评估，包括健康受试者和患有心脏病理学的患者，分别实现了99.56%、100.00%和99.89%的识别准确率。为了评估长期生物稳定性，还进行了多会话Heartprint数据集的实验，该数据集跨越十年。所提出的方法在相同会话中实现了98.54%（S1）、99.09%（S2）、94.93%（S3R）和96.08%（S3L）的准确率，跨会话评估达到了56.33%（S1-S2）和53.27%（S2-S3R），证明了其在时间上的稳定生物特征捕获能力。最优配置结合了InceptionTime用于1D处理，ResNet-34用于2D分析，以及基于注意力的融合。消融研究证实，所提出的注意力机制在传统融合方法中始终表现更优。总体而言，所提出的框架为ECG生物识别提供了一种稳健、可扩展且高性能的解决方案。

英文摘要

Electrocardiogram (ECG)-based biometric recognition has emerged as a promising solution for secure authentication and liveness detection. However, most existing methods rely on unimodal deep learning architectures that independently process either one-dimensional (1D) temporal signals or two-dimensional (2D) time-frequency representations, limiting robustness and generalization. To address this issue, this paper proposes a hybrid framework integrating 1D and 2D convolutional neural networks (CNNs) within a unified end-to-end architecture. The 1D branch extracts temporal and morphological features from raw ECG signals, while the 2D branch captures discriminative spectral information from time-frequency representations. An attention-guided fusion mechanism dynamically weights both modalities according to input characteristics, overcoming the limitations of conventional static fusion strategies. The framework was evaluated on three benchmark datasets (ECG-ID, MIT-BIH, and PTB), including healthy subjects and patients with cardiac pathologies, achieving identification accuracies of 99.56%, 100.00%, and 99.89%, respectively. To assess long-term biometric permanence, experiments were also conducted on the multi-session Heartprint dataset spanning ten years. The proposed approach achieved same-session accuracies of 98.54% (S1), 99.09% (S2), 94.93% (S3R), and 96.08% (S3L), while cross-session evaluations reached 56.33% (S1-S2) and 53.27% (S2-S3R), demonstrating the ability to capture stable biometric signatures over time. The optimal configuration combines InceptionTime for 1D processing, ResNet-34 for 2D analysis, and attention-based fusion. Ablation studies confirm that the proposed attention mechanism consistently outperforms conventional fusion approaches. Overall, the proposed framework provides a robust, scalable, and high-performance solution for ECG biometric recognition.

URL PDF HTML ☆

赞 0 踩 0

2605.17684 2026-05-19 cs.AI cs.SE

EGI: A Multimodal Emotional AI Framework for Enhancing Scrum Master Real-time Self-Awareness

EGI：一种多模态情感AI框架，用于增强Scrum Master的实时自我意识

Jingni Huang, Peter Bloodsworth

AI总结本文提出一种多模态情感AI框架EGI，通过整合四个精选的AI模型，实时监测Scrum Master和会议组织者无意识表达的情绪，提升团队动态中的情绪感知能力。

详情

AI中文摘要

尽管越来越多的研究关注敏捷团队成员的情绪福祉，但在Scrum Master和会议组织者的情绪监测研究中仍存在显著差距，这些角色对团队动态的影响至关重要。本文提出了一种新的应用，整合四个精心选择和推荐的AI模型，通过实时语音转文本模型进行实时转录；通过阈值分析检测语气中的情绪线索；通过基于情绪的词汇匹配识别语音内容中的情感；并通过开源的多模块AI API提供上下文感知的建议，包含情绪关键词。系统在模拟会议环境中实现了10%的ASR词错误率。我们的评估表明，实时反馈显著提高了模拟敏捷会议中的情绪感知能力，为Scrum Master和会议组织者提供实时和实用的建议，帮助他们快速识别并减少负面情绪的表达，促进更积极有效的团队互动。

英文摘要

While increasing research focuses on the emotional well-being of agile team members, a significant gap remains in emotion monitoring studies for Scrum Masters and meeting organizers, whose impact on team dynamics is crucial. This paper proposes a novel application integrating four carefully selected and recommended AI models to monitor the unconsciously expressed emotions of these key roles. This is achieved through: real- time transcription using a speech-to-text model; thresholding for intonation analysis to detect emotional cues in prosody; applying emotion-based vocabulary matching to identify sentiment in spoken content; and providing context-aware suggestions containing emotion keywords using an open-source, multi-module AI API. The system achieved an ASR word error rate WER of 10% in simulated meeting environments. Our evaluation shows that real- time feedback significantly improves emotion awareness during simulated agile meetings, providing Scrum Masters and meeting organizers with real-time and practical suggestions to help them quickly identify and minimize the expression of negative emotions, fostering more positive and effective team interactions.

URL PDF HTML ☆

赞 0 踩 0

2605.17682 2026-05-19 cs.CV

GEM: Gaussian Evolution Model for Occupancy Forecasting and Motion Planning

GEM：用于占用预测和运动规划的高斯演化模型

Cheng Chen, Hao Huang, Saurabh Bagchi

AI总结该研究提出GEM模型，通过高斯演化模型实现高效的占用预测和运动规划，解决了传统方法在时间灵活性、场景演化和连续时间动态匹配上的不足。

详情

AI中文摘要

未来3D语义占用预测和运动规划是自动驾驶的核心，需要模型能够推断周围场景的演变和车辆的行动。现有占用世界模型通常将场景离散化为潜在嵌入、体素特征或量化标记，并通过固定步长自回归生成预测未来状态。这限制了时间灵活性，掩盖了场景演变，长时间预测会积累误差，并且难以匹配真实驾驶场景的连续时间动态。我们提出了GEM，一种用于非自回归占用世界建模的高斯演化模型，其中驾驶场景被表示为学习的动态显式连续4D高斯原语。与逐步推演未来占用状态不同，GEM可以直接查询高斯世界表示中的任意时间戳，并将相应的条件3D高斯分布投射到语义占用体积中。这使得能够高效地进行全时间范围预测，同时保留紧凑且可解释的场景表示。通过解耦空间几何、时间支持和原语运动，GEM使预测的世界更容易检查，因为每个原语的演变可以连续随时间跟踪。相同表示也支持运动规划，通过从学习的高斯世界预测未来的车辆轨迹。大量实验表明，GEM在未来的语义占用预测和强大的运动规划性能方面均达到最先进的水平，同时提供灵活的时间查询。

英文摘要

Future 3D semantic occupancy forecasting and motion planning are central to autonomous driving, as they require models to reason about how surrounding scenes evolve and how the ego vehicle should act. Existing occupancy world models commonly discretize scenes into latent embeddings, volumetric features, or quantized tokens, and forecast future states through fixed-step autoregressive generation. This limits temporal flexibility, obscures scene evolution, accumulates errors over long horizons, and poorly matches the continuous-time dynamics of real driving scenes. We propose GEM, a Gaussian Evolution Model for non-autoregressive occupancy world modeling, where driving scenes are represented as explicit continuous 4D Gaussian primitives with learned dynamics. Instead of rolling out future occupancy states step by step, GEM directly queries the Gaussian world representation at arbitrary timestamps and splats the corresponding conditional 3D Gaussians into semantic occupancy volumes. This enables efficient forecasting over the full horizon while retaining a compact and interpretable scene representation. By decoupling spatial geometry, temporal support, and primitive motion, GEM makes the predicted world easier to inspect, as each primitive's evolution can be followed continuously over time. The same representation also supports motion planning by predicting future ego trajectories from the learned Gaussian world. Extensive experiments show that GEM achieves state-of-the-art future semantic occupancy forecasting and strong motion planning performance, while providing flexible temporal querying.

URL PDF HTML ☆

赞 0 踩 0

2605.17681 2026-05-19 cs.RO

PRIME: Physically-consistent Robotic Inertial and Motion Estimation for Legged and Humanoid Robots

PRIME: 为四足机器人和人形机器人提供物理一致的机器人惯性与运动估计

Jiarong Kang, Kunzhao Ren, Tao Pang, Xiaobin Xiong

AI总结该研究提出PRIME方法，通过结合可微接触动力学和光滑互补约束，实现从 onboard 传感器数据中获得物理一致的运动轨迹和惯性参数估计，从而提升机器人运动估计的准确性。

Comments Robotics: Science and Systems 2026

详情

AI中文摘要

人形和腿部机器人通过间歇性接触与环境互动，使准确的运动估计从根本上依赖于对接触动力学的推理。然而，标准的传感流程——无论是基于机载本体感觉的扩展卡尔曼滤波器（EKFs）还是外部运动捕捉系统——只能恢复运动学信息，而接触力、接触时间和惯性参数仍未被观测。因此，纯运动学重建往往违反刚体动力学，尤其是在接触丰富的运动中。为了实现从机载运动学数据中准确的运动估计，我们提出PRIME（Physically-consistent Robotic Inertial and Motion Estimation），一种最大后验（MAP）公式，将测量的运动学和执行器命令细化为动态一致的轨迹，同时联合估计摩擦接触力和物理一致的惯性参数。我们的方法结合了可微接触动力学与平滑互补约束和Antescu风格的摩擦模型，产生一个平滑的优化问题，在各种接触转换中保持可处理性。我们在接触丰富的运动中评估了PRIME，使用四足机器人和Unitree G1人形机器人，展示了改进的轨迹一致性和准确的惯性参数识别。除了通过校准的惯性参数提高状态估计和反馈控制外，PRIME还能够从实际机器人中生成带有力和接触注释的运动重建，可用于下游学习应用，包括大规模行为建模和机器人基础模型。

英文摘要

Humanoid and legged robots interact with the environment through intermittent contacts, making accurate motion estimation fundamentally dependent on reasoning about contact dynamics. However, standard sensing pipelines-whether based on onboard proprioception with Extended Kalman Filters (EKFs) or external motion capture systems-recover only kinematics, while contact forces, contact timing, and inertial parameters remain unobserved. As a result, purely kinematic reconstructions often violate rigid-body dynamics, particularly during contact-rich motions. To enable accurate motion estimation from onboard kinematics in real-world deployment, we propose PRIME (Physically-consistent Robotic Inertial and Motion Estimation), a Maximum A Posteriori (MAP) formulation that refines measured kinematics and actuator commands into a dynamically consistent trajectory while jointly estimating frictional contact forces and physically consistent inertial parameters. Our approach incorporates differentiable contact dynamics with smoothed complementarity constraints and an Anitescu-style friction model, yielding a smooth optimization problem that remains tractable across versatile contact transitions. We evaluate PRIME on contact-rich locomotion with quadrupedal robots and the Unitree G1 humanoid, demonstrating improved trajectory consistency and accurate inertial parameter identification. Beyond improving state estimation and feedback control with calibrated inertial parameters, PRIME produces force- and contact-annotated motion reconstructions from real robots in deployment, which can be used to provide high-quality data for downstream learning applications, including large-scale behavior modeling and robot foundation models.

URL PDF HTML ☆

赞 0 踩 0

2605.17673 2026-05-19 cs.CV

A simple approach for biometrics: Finger-knuckle prints recognition based on a Sobel filter and similarity measures

一种简单的生物识别方法：基于Sobel滤波器和相似性度量的指纹-指节印识别

E. O. Rodrigues, T. M. Porcino, Aura Conci, Aristofanes C. Silva

AI总结本文提出了一种简单的指纹-指节印识别方法，利用Sobel滤波器和相似性度量进行边缘检测和噪声减少，实现了高效的二值图像处理和存储，实验表明在大规模数据集上达到了17.02%的正确识别率。

详情

DOI: 10.1109/IWSSIP.2016.7502723
Journal ref: 2016 International Conference on Systems, Signals and Image Processing (IWSSIP)

AI中文摘要

本文的目标是提出一种新的指纹-指节印识别方法，该方法本质上是手指指节区域的数字照片。我们采用了非常简单的视觉计算概念，如基于Sobel算子的边缘检测滤波器和简单的噪声减少算法。这些操作非常快速，能够产生二值图像，这些图像在处理和存储上都非常高效。此外，除了预处理之外，还考虑并评估了某些相似性度量以用于该任务。在预处理输入手指后，将其与数据集中所有手指的图像一一进行比较。我们获得了在大规模数据集上高达17.02%的成功识别率（真阳性率）.

英文摘要

The objective of this work is to propose a novel methodology for the finger knuckle print recognition, which is essentially a digital photo of the finger-knuckle region. We have employed very simple concepts of visual computing such as a filter based on the Sobel operator for finding edges and a simple noise reduction algorithm. These operations are exceptionally fast and produce binary images, which are very efficient to process and to store. Furthermore, alongside this preprocessing, some similarity measures were also regarded and evaluated for the task. After preprocessing an input finger it is compared to all the images of fingers in the dataset, one by one. We have obtained up to 17.02% of successful recognitions (true positive rate) with a large dataset.

URL PDF HTML ☆

赞 0 踩 0

2605.17672 2026-05-19 cs.CL

Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

停止当推理收敛：用于推理模型的语义保留早退出

Dehai Min, Giovanni Vaccarino, Huiyi Chen, Yongliang Wu, Gal Yona, Lu Cheng

AI总结本文提出PUMA框架，通过识别推理层面的语义冗余信号，实现语义保留的早退出，从而在保持准确性和推理链完整性的同时减少token消耗。

Comments under review

详情

AI中文摘要

大型推理模型（LRMs）通过生成长链的推理步骤（CoT）实现强大的性能，但常常过度推理，继续推理在解决方案已经稳定后，从而浪费token并增加延迟。现有的推理时早退出方法主要依赖于答案层面的信号，如置信度或试答一致性，来决定何时停止。然而，这些信号主要反映答案准备程度而非推理收敛：它们可能在模型尚未完成探索或自我纠正之前触发，导致过早退出，从而降低最终答案的准确性并留下保留的推理链语义不完整。我们识别推理层面的语义冗余作为语义保留早退出的互补信号：当连续步骤不再添加新的进展而是重复已确立的结论时，推理轨迹可能已收敛。基于这一见解，我们提出了PUMA，一个插件式框架，结合了轻量级的冗余检测器和答案层面的验证。检测器标记语义冗余的候选退出点，而验证确认停止是否安全，使PUMA能够在保持答案准确性和连贯推理前缀的同时删除冗余的延续。在五个LRMs和五个具有挑战性的推理基准测试中，PUMA实现了26.2%的平均token减少，同时保持准确性和保留的CoT质量。此外，针对代码生成、零样本视觉-语言推理和学习停止策略内部化等额外实验进一步证明，推理层面的冗余是高效推理的稳健、可转移和可学习的信号。我们的代码可在https://github.com/giovanni-vaccarino/PUMA上获得。

英文摘要

Large Reasoning Models (LRMs) achieve strong performance by generating long chains of thought (CoT), but often overthink, continuing to reason after a solution has already stabilized and thereby wasting tokens and increasing latency. Existing inference-time early-exit methods rely primarily on answer-level signals, such as confidence or trial-answer consistency, to decide when to stop. However, these signals mainly reflect answer readiness rather than reasoning convergence: they may trigger before the model has finished exploring or self-correcting, causing premature exits that can degrade final-answer accuracy and leave the retained reasoning chain semantically incomplete. We identify reasoning-level semantic redundancy as a complementary signal for semantic-preserving early exit: when successive steps no longer add novel progress and instead revisit established conclusions, the reasoning trajectory has likely converged. Building on this insight, we propose PUMA, a plug-and-play framework that combines a lightweight Redundancy Detector with answer-level verification. The detector flags semantically redundant candidate exits, while verification confirms whether stopping is safe, allowing PUMA to remove redundant continuation while preserving both answer accuracy and a coherent reasoning prefix. Across five LRMs and five challenging reasoning benchmarks, PUMA achieves 26.2% average token reduction while preserving accuracy and retained CoT quality. Additional experiments on code generation, zero-shot vision-language reasoning, and learned stopping-policy internalization further demonstrate that reasoning-level redundancy is a robust, transferable, and learnable signal for efficient reasoning. Our code is available at \url{https://github.com/giovanni-vaccarino/PUMA}.

URL PDF HTML ☆

赞 0 踩 0

2605.17671 2026-05-19 cs.LG cs.AI

PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment

PEIRA: 通过视图回归对齐学习预测编码器

Michael Arbel, Basile Terver, Jean Ponce

AI总结本文提出PEIRA方法，通过显式目标函数和线性回归器对齐来实现非对比自监督学习，通过理论分析和实验验证其在ImageNet-1K和CIFAR-10上的有效性。

详情

AI中文摘要

非对比自监督学习（SSL）是预测表示学习的有效框架，但像SimSiam、BYOL、I-JEPA或DINO等流行方法依赖于自蒸馏来训练教师-学生网络，但通常不最小化明确的目标函数。我们分析了联合嵌入预测架构（JEPA）的一个变种，使用正则化的线性回归器来预测数据两个视图之间的学习表示，并完全表征其稳定性：非坍塌的稳定平衡点对齐于主导的非线性典型相关子空间，而坍塌的平衡点也可能是稳定的吸引子。受此结果启发，我们引入PEIRA，一种非对比SSL方法，其目标函数通过最优线性回归器的迹定义。我们证明其唯一稳定的平衡点是非平凡的全局最小值，并恢复相同的典型相关子空间，正则化选择有效维度。在ImageNet-1K和CIFAR-10上的实验表明，PEIRA与VICReg和LeJEPA基线具有竞争力，定性实验结果支持理论。

英文摘要

Non-contrastive self-supervised learning (SSL) is an effective framework for predictive representation learning, but popular (and in practice effective) methods such as SimSiam, BYOL, I-JEPA or DINO, which rely on a form of self-distillation to train a teacher-student network, remain poorly understood as they typically do not minimize a well-defined objective. We analyze the dynamics of a variant of the Joint Embedding Predictive Architecture (JEPA) using a regularized linear regressor to predict the learned representations of two views of the data from one another, and fully characterize its stability: non-collapsed stable equilibria align with leading nonlinear canonical correlation subspaces, while collapsed equilibria may also be stable attractors. Motivated by this result, we introduce PEIRA, a non-contrastive SSL method with an explicit objective defined through the trace of the optimal linear regressor. We show that its only stable equilibria are nontrivial global minimizers and recover the same canonical correlation subspaces, with regularization selecting the effective dimension. Experiments on ImageNet-1K and CIFAR-10 show PEIRA is competitive with VICReg and LeJEPA baselines, and qualitative empirical results support the theory.

URL PDF HTML ☆

赞 0 踩 0

2605.17669 2026-05-19 cs.AI

Multimodal Cultural Heritage Knowledge Graph Extension with Language and Vision Models

多模态文化遗产品知图扩展与语言和视觉模型

Yang Zhang, Nada Mimouni, Jean-Claude Moissinac, Fayçal Hamdi

AI总结本文提出了一种多模态方法，利用语言和视觉模型扩展文化遗产品知图，通过构建多模态知识图谱WJoconde并建立评估基准，提高知识图谱的扩展效率和可靠性。

详情

AI中文摘要

文化遗产品保育和解读日益依赖数字技术，其中知识图谱（KGs）因其能够结构化大量数据而脱颖而出。然而，这些KGs的构建和扩展往往面临挑战，因为文化遗产信息具有多样性和复杂性。本文提出了一种新的方法，用于扩展文化遗产领域的KG资源，应用于法语数据。首先，我们引入了一个新的知识图谱WJoconde，其特点是多模态，整合了实体的文本和图像信息。我们进一步引入了三个WJoconde的变体，以促进下游研究，如知识图谱补全（KGC）。我们还建立了一个全面的KGC方法基准，用于我们的数据集。其次，我们提出了一种新的框架，利用多模态方法扩展文化遗产KGs，结合大型语言模型（LLMs）和视觉-语言模型（VLMs），包括从非结构化资源中自动提取数据，并结合特殊的验证流程来确保两种模型输出的可靠性，以进一步扩展WJoconde。我们的结果表明，通过整合文化遗产数据中的丰富文本和图像信息，可以高效地增强具有高可靠性的KGs。我们开源了所有代码和基准数据集，包括文本和图像，以及原始数据的交互访问点。

英文摘要

The preservation and interpretation of cultural heritage increasingly rely on digital technologies, among which Knowledge Graphs (KGs) stand out for their ability to structure vast amounts of data. However, the construction and expansion of these KGs often face challenges due to the diverse and complex nature of cultural heritage information. In this paper, we propose a novel approach for extending KG resources in the domain of cultural heritage, which we applied to French data. First, we introduce a new knowledge graph in the domain of French cultural heritage, WJoconde, which is distinguished by its multimodality as it integrates both textual and image information of the entities. We further introduce three variants of WJoconde to facilitate downstream research, such as Knowledge Graph Completion (KGC). We also built a comprehensive benchmark for KGC methods on our dataset. Second, we propose a new framework for extending cultural heritage KGs using multi-modal approaches leveraging Large Language Models (LLMs) and Vision-Language Models (VLMs), which includes automated data extraction from unstructured resources combined with a special validation pipeline for grounding the output of both models, to further extend WJoconde. Our results show that by integrating the rich text and image information in cultural heritage data, we can efficiently enhance KGs with high reliability. We open-source all code and benchmark datasets with text and images, as well as the original data with an interactive access point

URL PDF HTML ☆

赞 0 踩 0

2605.17668 2026-05-19 cs.CV

Deep learning-based compression of giga-resolution whole slide images

基于深度学习的高分辨率全切片图像压缩

Maren Høibø, Etienne Gaucher, Ingerid Reinertsen, Marit Valla, Erik Smistad

AI总结本文研究了基于深度学习的全切片图像压缩方法，通过比较深度学习与传统编码方式（JPEG、JPEG-2000、JPEG-XL）在去除玻璃和压缩效果上的差异，发现深度学习压缩在减少文件大小方面更有效，但解压时间更长。

详情

AI中文摘要

数字病理学的实施导致全切片图像（WSI）数量增加。WSI的大小对存储构成挑战。目前，WSI使用JPEG等编码器压缩，每个WSI占用数GB空间，存储玻璃导致大量空间浪费。本研究探讨并比较了基于深度学习的组织分割用于去除玻璃和深度学习压缩方法与JPEG、JPEG-2000和JPEG-XL。创建了包含完整玻璃、玻璃替换为单色像素以及玻璃替换为零字节瓷砖的图像金字塔（N=21），并使用JPEG、JPEG-XL和深度学习模型进行压缩。此外，几种压缩模型在组织切片数据集上进行了评估，并与JPEG、JPEG-2000和JPEG-XL进行了比较。去除玻璃显著减少了JPEG和JPEG-XL的文件大小。与JPEG压缩相比，基于深度学习的图像压缩将WSI大小减少了43-72%，而基于深度学习的玻璃去除将WSI大小减少了0.3-33%和6-62%（仅使用单色像素和去除所有玻璃瓷砖）。结合两者，总大小减少为44-80%，表明基于深度学习的图像压缩能高效压缩玻璃瓷砖，而JPEG则不能。在组织切片数据集上，最好的基于深度学习的压缩模型在每块切片上平均节省了约35-40%的存储空间，同时保持平均SSIM值高于0.95，而JPEG-XL和JPEG-2000分别节省了17%和14%，同时保持SSIM值为0.96。然而，深度学习模型的解压时间比JPEG和JPEG-XL更高。

英文摘要

Implementation of digital pathology leads to an increased number of whole slide images (WSIs). The large size of WSIs is challenging. Today, WSIs are compressed with codecs like JPEG resulting in several gigabytes per WSI, and large amounts of space are wasted storing glass. In this study, deep learning-based tissue segmentation for glass removal, and deep learning compression methods were explored and compared with JPEG, JPEG-2000 and JPEG-XL. Image pyramids (N=21) with intact glass, glass replaced by single-colored pixels, and glass replaced by zero-byte tiles were created and compressed with JPEG, JPEG-XL and a deep learning model. Additionally, several compression models were evaluated on a tissue patch dataset and compared with JPEG, JPEG-2000 and JPEG-XL. Removing glass reduced file sizes considerably for JPEG and JPEG-XL. Deep learning-based image compression reduced the WSI size by 43-72% compared to JPEG compression, whereas deep learning-based glass removal reduced the WSI size by 0.3-33%, and 6-62% using only single-colored pixels and removing all-glass tiles, respectively. Combining the two gave a small improvement to a 44-80% total size reduction which indicates that deep learning-based image compression is able to efficiently compress glass tiles, whereas JPEG is not. On the tissue patch dataset, the best deep learning-based compression models saved on average ~35-40% per patch compared to JPEG, while keeping an average SSIM above 0.95, whereas JPEG-XL and JPEG-2000 saved 17% and 14%, respectively while keeping an SSIM of 0.96. However, the deep learning models had higher decompression times than JPEG and JPEG-XL.

URL PDF HTML ☆

赞 0 踩 0

2605.17661 2026-05-19 cs.RO cs.CV

Mono-Hydra++: Real-Time Monocular Scene Graph Construction with Multi-Task Learning for 3D Indoor Mapping

Mono-Hydra++: 基于多任务学习的实时单目场景图构建用于3D室内映射

U. V. B. L. Udugama, George Vosselman, Francesco Nex

AI总结本文提出Mono-Hydra++，一种基于多任务学习的实时单目RGB加IMU流水线，用于3D室内度量语义映射和分层3D场景图构建，通过结合M2H-MX多任务模型和深度特征视觉惯性里程计前端，实现了在资源受限的机器人平台上无需主动深度传感器的实时度量语义映射和场景图构建。

Comments Submitted to ISPRS Journal of Photogrammetry and Remote Sensing. 50 pages, figures and tables included. Code: https://github.com/BavanthaU/mono-hydra-pp.git

详情

AI中文摘要

自主敏捷机器人需要的不仅仅是度量几何：它们必须理解物体、房间、地点和空间关系，以进行搜索、检查、探索和人机交互。传统度量地图支持定位和避障，但不提供这种语义和关系结构。3D场景图通过将几何与物体级和房间级的理解连接起来，填补了这一空白。在敏捷平台上构建此类表示仍然困难，因为空中和轻量级机器人受到严格的载荷、电力和计算限制，使RGB-D相机和LiDAR传感器在许多机载设置中不切实际。我们提出了Mono-Hydra++，一种实时单目RGB加IMU流水线，用于室内度量语义映射和分层3D场景图构建。该系统结合了M2H-MX，一种基于DINOv3的多任务模型，用于深度和语义，以及深度特征视觉惯性里程计前端，稀疏预测深度约束在VIO推导的姿态图中，语义遮蔽用于动态区域，以及在Mono-Hydra后端体积融合前的姿态感知时间对齐。在Go-SLAM ScanNet评估子集中，Mono-Hydra++在仅使用单目RGB加IMU输入的情况下，其平均轨迹误差比我们比较中的最强RGB-D基线低1.6%，在校准的7-Scenes中，其平均ATE比最强的竞争校准基线提高了29.8%。我们进一步在真实ITC建筑部署中验证了Mono-Hydra++，使用RealSense RGB加IMU，并通过在Jetson Orin NX 16GB上部署ONNX/TensorRT FP16 M2H-MX-L感知模型，以25.53 FPS的速度证明了嵌入可行性。这些结果表明，Mono-Hydra++可以在不依赖主动深度传感器的情况下，为资源受限的机器人平台提供实时度量语义映射和场景图构建。

英文摘要

Autonomous agile robots need more than metric geometry: they must understand objects, rooms, places, and spatial relations for search, inspection, exploration, and human robot interaction. Conventional metric maps support localization and collision avoidance, but do not provide this semantic and relational structure. 3D scene graphs address this gap by connecting geometry with object level and room level understanding. Building such representations on agile platforms remains difficult because aerial and lightweight robots operate under strict payload, power, and compute limits, making RGB-D cameras and LiDAR sensors impractical for many onboard settings. We present Mono-Hydra++, a real time monocular RGB plus IMU pipeline for indoor metric semantic mapping and hierarchical 3D scene graph construction. The system combines M2H-MX, a DINOv3 based multi-task model for depth and semantics, with a deep feature visual inertial odometry front end, sparse predicted depth constraints in the VIO derived pose graph, semantic masking for dynamic regions, and pose aware temporal alignment before volumetric fusion in the Mono-Hydra backend. On the Go-SLAM ScanNet evaluation subset, Mono-Hydra++ achieves 1.6% lower average trajectory error than the strongest RGB-D baseline in our comparison, while using only monocular RGB plus IMU input. On calibrated 7-Scenes, it improves average ATE by 29.8% over the strongest competing calibrated baseline. We further validate Mono-Hydra++ in a real ITC building deployment using RealSense RGB plus IMU and demonstrate embedded feasibility by deploying the ONNX/TensorRT FP16 M2H-MX-L perception model at 25.53 FPS on a Jetson Orin NX 16GB. These results show that Mono-Hydra++ can provide real time metric semantic mapping and scene graph construction for resource constrained robotic platforms without relying on active depth sensors.

URL PDF HTML ☆

赞 0 踩 0

2605.17658 2026-05-19 cs.LG

When a Zero-Shooter Cheats: Improving Age Estimation via Activation Steering

当零样本射手作弊时：通过激活引导提升年龄估计

Erik Imgrund, Pia Hanfeld, Klim Kireev, Konrad Rieck

AI总结本文研究了基于视觉语言模型的零样本年龄估计中出现的'身份捷径'现象，提出激活引导方法以提高年龄估计的准确性，减少均方误差达25%。

详情

AI中文摘要

不同年龄相关的规定已提出以保护未成年人免受有害内容和互动的在线影响。自动年龄估计是执行此类规定的关键，而视觉语言模型（VLMs）在该任务上实现了最先进的性能。然而，我们发现VLM基于的零样本年龄估计会产生一个意外的副作用，我们称之为'身份捷径'：VLMs不再从视觉特征中估计年龄，而是识别所描绘的人并从记忆中的知识中推断他们的年龄。这种现象导致在非名人被误认为名人时产生显著错误的预测。它还产生了对名人图像的噪声和对抗扰动具有欺骗性高鲁棒性的效果，这些图像主导了流行基准。为缓解这一问题，我们提出了一种激活引导方法，通过干预VLM的隐藏状态来抑制捷径。该方法提高了对记忆中和未见过的身份的年龄估计准确性，减少均方误差达25%。

英文摘要

Different age-related regulations have been proposed to protect minors from harmful content and interactions online. Automated age estimation is central to enforcing such regulations, and vision-language models (VLMs) achieve state-of-the-art performance on this task. However, we find that the zero-shot nature of VLM-based age estimation produces an unexpected side effect we call the identity shortcut: Instead of estimating age from visual features, VLMs tend to identify the depicted person and infer their age from memorized knowledge. This phenomenon leads to substantially incorrect predictions when non-celebrities are misidentified as celebrities. It also produces deceptively high robustness to noise and adversarial perturbations on celebrity images, which dominate popular benchmarks. To mitigate this, we propose an activation steering method that suppresses the shortcut by intervening on the hidden states of the VLM. This method improves age estimation accuracy for both memorized and unseen identities, reducing mean absolute error by up to 25% across popular benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.17653 2026-05-19 cs.LG cs.AI

LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

LLMForge: 多后端硬件感知的神经架构搜索与无限头注意力用于边缘语言模型

Xinting Jiang, Junyi Luo, Ruichen Qi, Kauna Lei, Ben Laurie, Gregory Kielian, Mehdi Saligane

AI总结本文提出LLMForge，一种多后端硬件感知的神经架构搜索框架，通过无限头注意力扩展了每层注意力配置空间，并结合Forge-Former和Forge-DSE实现了高效的边缘语言模型架构搜索，最终在不同硬件子系统上获得了不同形状的架构，展示了在不同性能指标上的优化效果。

详情

AI中文摘要

子百亿参数的Transformer语言模型正越来越多地部署在边缘设备上，其中设备端推理的隐私、延迟和运行成本优势受到紧密的内存带宽、能量和热预算的限制，使得架构选择和加速器特定的成本成为高效推理的关键。我们提出了LLMForge，一种硬件感知的神经架构搜索（NAS）框架，其三个可组合的贡献共同使边缘LM架构搜索变得硬件条件化，因为不同的基材施加了不同的硬件成本瓶颈。无限头注意力（IHA）解耦了查询头数、KV组数和每个头的查询/键/值维度，扩展了在我们的搜索空间范围内每层注意力配置空间，大约扩大了400倍。Forge-Former是一种基于编码器的替代方案，用于对架构候选者进行排名，优于MLP和随机森林基线。Forge-DSE是一种基于NSGA-II的设计空间探索引擎，与Forge-Former配对，结合了覆盖GPU、张量核心加速器和环数据流边缘加速器的多后端硬件成本模型。在四种不同的硬件基材上，搜索收敛到明显不同的架构，其形状跟踪每个基材的成本瓶颈。在多芯片环基材上，我们的联合搜索返回了三个3亿参数规模的部署感知变体，这些变体位于帕累托前沿上。每个变体都在FineWeb-Edu-10BT上重新训练，以匹配SmolLM2-360M和Qwen-0.5B架构基线。准确的变体具有最低的验证损失2.798，并在参数较少的情况下具有竞争性的基准性能，能量优化的变体降低了每token的能量消耗40%，延迟优化的变体降低了TTFT和TPOT 43%。

英文摘要

Sub-billion-parameter Transformer language models are increasingly deployed on edge devices, where the privacy, latency, and operating-cost advantages of on-device inference are constrained by tight memory-bandwidth, energy, and thermal budgets that make architectural choice and accelerator-specific cost central to efficient inference. We present LLMForge, a hardware-aware neural architecture search (NAS) framework whose three composable contributions together make edge-LM architecture search hardware-conditioned, since different substrates impose different hardware cost bottlenecks. Infinite-Head Attention (IHA) decouples the number of query heads, KV groups, and per-head query/key and value dimensions, expanding the feasible per-layer attention configuration space by approximately 400x over grouped-query attention within our search-space ranges. Forge-Former, an encoder-based surrogate for ranking architectural candidates, outperforms MLP and random-forest baselines. Forge-DSE, an NSGA-II-based design-space-exploration engine, pairs Forge-Former with a multi-backend hardware cost model spanning GPUs, systolic accelerators, and ring-dataflow edge accelerators. Across four different hardware substrates, the searches converge to visibly different architectures whose shapes track each substrate's cost bottleneck. On the multi-chip ring substrate, our co-search returns three 300M-scale deployment-aware variants on the Pareto front. Each is re-trained on FineWeb-Edu-10BT under matched recipe against SmolLM2-360M and Qwen-0.5B architecture baselines. The accurate variant has the lowest validation loss 2.798 and competitive benchmark performance with fewer parameters, the energy-optimized variant lowers energy per token by 40%, and the latency-optimized variant lowers TTFT and TPOT by 43%.

URL PDF HTML ☆

赞 0 踩 0

2605.17652 2026-05-19 cs.CL

Beyond Transcripts: Iterative Peer-Editing with Audio Unlocks High-Quality Human Summaries of Conversational Speech

超越转录：借助音频的迭代同行编辑解锁高质量的对话语音总结

Kaavya Chaparala, Thomas Thebaud, Jesús Villalba López, Laureano Moro-Velazquez, Peter Viechnicki, Najim Dehak

AI总结该研究通过比较音频和转录本总结的质量，探讨了同行编辑在提升语音总结质量中的作用，并发现音频总结在信息量和压缩程度上不如转录本总结，但通过迭代同行编辑可以弥补这一差距。

Comments Accepted in LREC 2026

详情

DOI: 10.63317/4d596vd4x2xr

AI中文摘要

目前缺乏足够的语音总结任务基准。创建新基准需要人工标注，因为LLM可能会将系统性错误和偏见嵌入数据集中。我们测试了十种标注工作流程，涉及不同的输入模态（音频、转录本或两者）以及编辑（自我编辑或同行编辑）的 inclusion，以研究使用人工标注者总结音频时可能的质量权衡。我们比较了基于音频的人类总结与基于转录本的人类总结，以跟踪不同信息模态对总结质量的影响。我们还比较了人类输出与四个LLM基准（三个文本，一个音频）以确定人类编写总结是否比高度流畅的自动输出信息更少。我们发现基于音频的总结信息较少且更压缩，但通过使用音频的迭代同行编辑可以缓解这一差异，使基于音频的总结信息量与转录本总结和LLM总结相当。这些发现验证了在创建结合词汇和语调信息的基准时，人类标注者之间的迭代同行编辑的有效性。这使在转录本不可用的情况下也能关键数据集的收集成为可能。

英文摘要

There are not enough established benchmarks for the task fo speech summarization. Creating new benchmarks demands human annotation, as LLMs could embed systemic errors and bias into datasets. We test ten annotation workflows varying input modality (audio, transcript, or both) and the inclusion of editing (self or peer-editing) to investigate potential quality tradeoffs from using human annotators to summarize audio. We compare human audio-based summaries to human transcript-based summaries to track the impact of the different information modalities on summary quality. We also compare the human outputs against four LLM benchmarks (three text, one audio) to examine whether human-written summaries are less informative than highly fluent automated outputs. We find that audio-based summaries are less informative and more compressed than transcript summaries. However, iterative peer-editing with audio mitigates this difference, enabling audio-based summaries to be as informative as their transcript counterparts and LLM summaries. These findings validate iterative peer-editing among human annotators for the creation of benchmarks informed by both lexical and prosodic information. This enables crucial dataset collection even in setting where transcripts are unavailable.

URL PDF HTML ☆

赞 0 踩 0

2605.17651 2026-05-19 cs.LG

Counterfactual Explanations Under Concept Drift

反事实解释在概念漂移下的应用

Marcin Kostrzewa, Jerzy Stefanowski, Maciej Zięba

AI总结本文研究了在数据不断变化的环境中，如何维护反事实解释的有效性，提出了一种轻量级的更新方案以修复现有解释，保持其与原始实例的接近性。

2605.17648 2026-05-19 cs.AI

SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation

SAPO：基于推理的生成推荐的步骤对齐策略优化

Zaiyi Zheng, Guanghui Min, Yaochen Zhu, Liang Wu, Liangjie Hong, Chen Chen, Jundong Li

AI总结本文提出SAPO方法，通过步骤对齐策略优化解决生成推荐中因精确匹配反馈不足导致的训练不稳定问题，改进了基于推理的生成推荐系统的训练效果。

详情

AI中文摘要

生成推荐将下一项预测视为自回归的物品标识符生成。具体而言，物品被编码为语义标识符（SIDs），这些是短的由粗到细的令牌序列，早期令牌捕捉广泛语义，后期令牌细化它们。近期工作在该范式中加入了推理轨迹并通过强化学习进行优化，通常使用具有生成SID的精确匹配反馈的成果奖励算法。然而，在大型目录推荐中，对生成SID的精确匹配反馈只能报告最终物品是否正确；当生成SID不匹配时，成果奖励无法识别导致不匹配的SID-令牌预测，并可能对匹配的SID-令牌位置和不匹配的位置一起进行惩罚。我们发现在此设置中的自然信用分配单位是一个单独的推理步骤（一个思考块配对一个SID令牌）。我们实例化这一想法在SAPO（步骤对齐策略优化）中：而不是将一个优势广播到整个响应，SAPO为每个推理步骤计算一个单独的组内优势，并仅应用于相应的思考块和SID令牌。在三个真实世界推荐数据集中，SAPO稳定了强化学习训练并持续改进现有生成推荐基线，最大收益出现在稀疏精确匹配反馈使推理步骤信用分配重要的地方。我们的结果表明，结构生成的强化学习目标应反映解码器自身的输出分解。

英文摘要

Generative recommendation treats next-item prediction as autoregressive item-identifier generation. Specifically, items are encoded as semantic identifiers (SIDs), which are short coarse-to-fine token sequences whose early tokens capture broad semantics and later tokens refine them. Recent work augments this paradigm with reasoning traces and optimizes them via reinforcement learning with verifiable rewards, typically outcome-reward algorithm with exact-match feedback on the generated SID. However, in large-catalog recommendation, exact-match feedback on the generated SID only reports whether the final item is correct; when a generated SID mismatches, outcome-reward cannot identify which SID-token prediction caused the mismatch and may penalize matched SID-token positions together with the mismatched position. We identify that the natural unit of credit assignment in this setting is a single reasoning step (one thinking block paired with one SID token). We instantiate this idea in SAPO (Step-Aligned Policy Optimization): rather than broadcasting one advantage to the whole response, SAPO computes a separate group-relative advantage for each reasoning step and applies it only to the corresponding thinking block and SID token. Across three real-world recommendation datasets, SAPO stabilizes reinforcement-learning training and consistently improves over existing generative recommendation baselines, with the largest gains where sparse exact-match feedback makes reasoning-step credit assignment important. Our results suggest that reinforcement-learning objectives for structured generation should mirror the decoder's own decomposition of the output.

URL PDF HTML ☆

赞 0 踩 0

2605.17642 2026-05-19 cs.LG

TabKDE: Simple and Scalable Tabular Data Generation with Kernel Density Estimates

TabKDE: 通过核密度估计实现简单且可扩展的表格数据生成

Meysam Alishahi, Yan Zheng, Junpeng Wang, Chin-Chia Michael Yeh, Jeff M. Phillips

AI总结本文提出了一种基于核密度估计的表格数据生成方法，能够在无需大量训练时间的情况下实现与现有方法相当的准确性和防泄漏性能，并且能够高效处理大规模数据集。

详情

AI中文摘要

表格数据生成考虑的是一个包含多个列的大型表格，每个列包含数值、类别或有时顺序值。目标是生成新的行以复制原始数据行的分布，而不仅仅是复制初始行。过去四年中，这个问题取得了巨大的进展，主要使用计算成本高昂的方法，如one-hot编码、VAE和扩散模型。本文描述了一种新的表格数据生成方法。通过使用copula变换并将分布建模为核密度估计，我们几乎可以达到先前方法在准确性和防泄漏方面的性能，但训练时间几乎可以忽略不计。我们的方法非常可扩展，并且可以在简单的笔记本电脑上处理比现有最先进方法大数个数量级的数据集。此外，由于我们使用核密度估计，我们可以将模型存储为原始数据的coreset -- 我们认为这是生成建模中的首次尝试 -- 并因此需要显著较少的空间。我们的代码可在https://github.com/tabkde/tabkde-main获取。

英文摘要

Tabular data generation considers a large table with multiple columns -- each column comprised of numerical, categorical, or sometimes ordinal values. The goal is to produce new rows for the table that replicate the distribution of rows from the original data -- without just copying those initial rows. The last 4 years have seen enormous progress on this problem, mostly using computational expensive methods that employ one-hot encoding, VAEs, and diffusion. This paper describes a new approach to the problem of tabular data generation. By employing copula transformations and modeling the distribution as a kernel density estimate we can nearly match the accuracy and leakage-avoidance achievements of the previous methods, but with almost no training time. Our method is very scalable, and can be run on data sets orders of magnitude larger than prior state-of-the-art on a simple laptop. Moreover, because we employ kernel density estimates, we can store the model as a coreset of the original data -- we believe the first for generative modeling -- and as a result, require significantly less space as well. Our code is available here: \url{https://github.com/tabkde/tabkde-main}

URL PDF HTML ☆

赞 0 踩 0

2605.17641 2026-05-19 cs.AI cs.CL

Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents

基于因果干预的记忆选择用于长时域大语言模型智能体

Saksham Sahai Srivastava

AI总结本文提出Causal Memory Intervention（CMI）方法，通过因果推理选择大语言模型的长期记忆，以提高回答质量和鲁棒性，同时引入Causal-LoCoMo基准数据集进行评估。

Comments 12 pages, 3 figures, 3 tables

详情

AI中文摘要

长时域大语言模型智能体依赖持久记忆来支持跨会话的交互，但现有记忆系统通常使用语义相似性或广泛历史包含来检索上下文，将检索到的记忆视为统一有用。这一假设是脆弱的，因为记忆可能在主题上相关，但仍然无关、过时或误导性。我们提出了Causal Memory Intervention（CMI），一种因果记忆选择技术，通过在受控干预下估计候选记忆如何影响模型的答案，选择提高任务性能的同时抑制不稳定、无关或有害的记忆。为了评估这一设置，我们引入了Causal-LoCoMo，一个从长对话数据中衍生出的因果标注基准，其中每个示例包含用户请求、结构化记忆库、有用的记忆、无关干扰项以及合成有害记忆。我们比较了CMI与向量、图、反思、摘要、完整历史和无记忆基线。结果表明，CMI在回答质量和对误导性记忆的鲁棒性之间实现了更强的平衡，表明可靠的长期记忆需要基于因果有用性而非相关性本身来选择上下文。完整的框架、基准构建代码和实验流程可在https://github.com/Saksham4796/causal-memory-intervention获取。

英文摘要

Long-horizon LLM agents rely on persistent memory to support interactions across sessions, yet existing memory systems often retrieve context using semantic similarity or broad history inclusion, treating retrieved memories as uniformly useful. This assumption is fragile because memories may be topically related while remaining irrelevant, stale, or misleading. We propose Causal Memory Intervention (CMI), a causal memory-selection technique that estimates how candidate memories affect the model's answer under controlled interventions, selecting memories that improve task performance while suppressing unstable, irrelevant, or harmful ones. To evaluate this setting, we introduce Causal-LoCoMo, a causally annotated benchmark derived from long conversational data, where each example contains a user request, a structured memory bank, useful memories, irrelevant distractors, and synthetic harmful memories. We compare CMI against vector, graph, reflection, summary, full-history, and no-memory baselines. Results show that CMI achieves a stronger balance between answer quality and robustness to misleading memory, suggesting that reliable long-term memory requires selecting context based on causal usefulness rather than relevance alone. The full framework, benchmark construction code, and experimental pipeline are available at https://github.com/Saksham4796/causal-memory-intervention.

URL PDF HTML ☆

赞 0 踩 0

2605.17639 2026-05-19 cs.CL cs.IR

Temporal Decay of Co-Citation Predictability: A 20-Year Statute Retrieval Benchmark from 396M Ukrainian Court Citations

共引预测性的时变衰减：来自3.96亿乌克兰法院引用的20年法规检索基准

Volodymyr Ovcharov

AI总结本文研究了共引结构在法律信息系统中的稳定性假设，通过构建UA-StatuteRetrieval基准，测试了20年中3.96亿条引用数据的共引可预测性，发现Adamic-Adar MRR在固定文章集上下降33%，在训练/测试时间分割下下降47%，证实了真正的时变衰减而非组合变化或评估伪影。

Comments 12 pages, 8 figures, 4 tables. Dataset: https://huggingface.co/datasets/overthelex/ua-statute-retrieval

详情

AI中文摘要

共引结构被广泛假设为提供稳定的检索信号。我们通过构建UA-StatuteRetrieval基准，纵向测试这一假设，该基准在2007-2026年的20个年度快照中测量了3.96亿条法典引用的共引可预测性。通过在完整的双部分引用图上使用留一法协议，我们发现Adamic-Adar MRR在固定文章集上下降33%（从0.43到0.29），在训练/测试时间分割下下降47%（从0.51到0.27），证实了真正的时变衰减而非组合变化或评估伪影。衰减是非均匀的：刑事程序保持稳定的共引模式（MRR ~0.40），而民法从0.35下降到0.15，与2017年司法改革重合。枢纽文章（>100,000引用）抵抗衰减，但中频文章（1,000-10,000）——实际检索前沿失去一半的可预测性。BM25文本基线衰减得更快（31%），嵌入漂移分析显示E5-large揭示了文章引用的语义偏移4.3%，提供了衰减的机制解释。该基准在https://huggingface.co/datasets/overthelex/ua-statute-retrieval发布。

英文摘要

Co-citation structure is widely assumed to provide stable retrieval signal in legal information systems. We test this assumption longitudinally by constructing UA-StatuteRetrieval, a benchmark that measures co-citation predictability across 20 annual snapshots (2007-2026) of 396 million codex citations from 101 million Ukrainian court decisions. Using a leave-one-out protocol over the full bipartite citation graph, we find that Adamic-Adar MRR declines 33% on a fixed set of articles (from 0.43 to 0.29) and 47% under a train/test temporal split (from 0.51 to 0.27) confirming genuine temporal decay rather than compositional shift or evaluation artifact. The decay is non-uniform: criminal procedure maintains stable co-citation patterns (MRR ~0.40), while civil law degrades from 0.35 to 0.15, coinciding with the 2017 judicial reform. Hub articles (>100K citations) resist decay, but mid-frequency articles (1K-10K) -- the practical retrieval frontier lose half their predictability. A BM25 text baseline decays even faster (31%), and embedding drift analysis with E5-large reveals a 4.3% semantic shift in how articles are cited, providing a mechanistic explanation for the observed decay. The benchmark is released at https://huggingface.co/datasets/overthelex/ua-statute-retrieval.

URL PDF HTML ☆

赞 0 踩 0

2605.17638 2026-05-19 cs.CV

TouchMap-OR: Multi-View 3D Mapping of Hand-Surface Contacts

TouchMap-OR: 医院内多视角手-表面接触的3D映射

Sophokles Ktistakis, Rui Wang, Bastian Grande, Hugo Sax

AI总结本文提出TouchMap-OR系统，通过多视角RGB-D视觉系统实现手术室中身份分辨的手-表面接触重建，利用临床环境的语义结构推断接触时间和位置，通过多视角手部重建与追踪医生获得一致的手部轨迹，并建立手术室的语义3D模型以将手部轨迹映射到特定表面。

详情

AI中文摘要

临床医生、患者和医疗设备之间的手-表面互动在医疗程序中起着核心作用，在病原体传播中起关键作用。然而，这些互动仍然大多未被观察到，因为目前的感染预防实践依赖于手动观察，无法重建详细的接触历史。在本工作中，我们提出了在手术室中身份分辨的手-表面互动重建问题，并引入了TouchMap-OR，一种多视角RGB-D视觉系统，该系统能够建模医生、可变形手部几何结构以及临床环境的语义结构，以推断接触发生的时间和位置。该系统在多摄像机之间重建全局一致的多个人3D骨骼轨迹，同时从RGB观测与深度数据对齐的数据中估计可变形MANO手部网格。多视角手部重建被融合并关联到追踪的医生，以获得一致的左右手轨迹。通过多视角分割和深度融合构建手术室的语义3D模型，使重建的手部轨迹能够映射到特定表面，包括医疗设备、可移动物体和患者身体部位。利用时间手-表面接近性推断接触事件，描述了哪位医生接触了哪个表面以及何时。我们在三个真实的麻醉诱导记录上评估了TouchMap-OR，手动标注了接触事件。TouchMap-OR在二元接触F1值上达到0.75，优于基于跟踪的基线方法，同时保持了可比的多个人跟踪精度，并实现了0.96的身份分配精度。

英文摘要

Hand-surface interactions between clinicians, patients, and medical equipment play a central role in pathogen transmission during medical procedures. However, these interactions remain largely unobserved, as current infection-prevention practices rely on manual observation and cannot reconstruct detailed contact histories. In this work we formulate the problem of identity-resolved hand-surface interaction reconstruction in operating rooms and introduce TouchMap-OR, a multi-view RGB-D vision system that models clinicians, articulated hand geometry, and the semantic structure of the clinical environment to infer when and where contacts occur. The system reconstructs globally consistent multi-person 3D skeleton tracks across cameras while estimating articulated MANO hand meshes from RGB observations aligned to depth data. Multi-view hand reconstructions are fused and associated with tracked clinicians to obtain consistent left and right hand trajectories. A semantic 3D model of the operating room is built from multi-view segmentation and depth fusion, enabling reconstructed hand trajectories to be mapped to specific surfaces, including medical equipment, movable objects, and patient body sites. Temporal hand-surface proximity is used to infer contact episodes describing which clinician touched which surface and when. We evaluate TouchMap-OR on recordings from three real anesthesia inductions with manually annotated contact events. TouchMap-OR achieves 0.75 binary contact F1, outperforming tracking-based baselines while maintaining comparable multi-person tracking accuracy and achieving 0.96 identity attribution accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.17633 2026-05-19 cs.CV cs.AI

SparseSAM: Structured Sparsification of Activations in Segment Anything Models

SparseSAM: Segment Anything模型中激活的结构稀疏化

Hoai-Chau Tran, Chi H. Nguyen, Duy M. H. Nguyen, Mathias Niepert, Fan Lai, Khoa D. Doan

AI总结本文提出SparseSAM，一种无需训练的结构稀疏化框架，通过联合加速注意力和MLP层并保持token身份，从而在保持高质量的同时提高推理速度和减少内存使用。

详情

AI中文摘要

Segment Anything Model (SAM) 实现了强大的开放词汇分割，但其基于ViT的图像编码器在推理延迟和内存方面占主导地位。现有的激活压缩方法，如标记合并，通过减少标记长度来处理，但引入了非平凡的运行时开销，并在高压缩下导致灾难性质量下降。其他应用稀疏注意力的方法仅关注注意力本身，使MLP完全密集，并限制了可达到的速度提升。我们提出了SparseSAM，一种（i）无需训练的结构稀疏化框架，该框架在加速注意力和MLP层的同时保持token身份。SparseSAM引入了（ii）Stripe-Sort Attention，它使用确定性的Z序排列将密集注意力转换为静态的硬件友好的稀疏模式，消除了动态掩码的开销。SparseSAM进一步引入了（iii）残差一致性MLP，只将信息性token路由通过MLP，同时通过残差路径传播剩余token。在四个分割基准测试中，SparseSAM在0.4密度下仅损失0.004 mIoU，在0.3密度下损失0.021 mIoU，相较于标记合并方法的改进，准确率损失减少了2.10倍，同时实现了2倍更快的推理速度和2.8倍的内存减少。

英文摘要

The Segment Anything Model (SAM) achieves strong open-vocabulary segmentation, but its ViT-based image encoders dominate inference latency and memory. Existing activation compression methods, such as token merging, reduce the token length to process, yet introduce non-trivial runtime overhead and encounter catastrophic quality drop under high compression. Other methods applying Sparse Attention focus on attention alone, leaving the MLP fully dense and capping achievable speedup. We propose SparseSAM, a (i) training-free structured sparsification framework that jointly accelerates attention and MLP layers while preserving token identity. SparseSAM introduces (ii) Stripe-Sort Attention, which uses a deterministic Z-order permutation to transform dense attention into static hardware-friendly sparse patterns, eliminating dynamic masking overhead. SparseSAM further introduces a (iii) Residual-Consistency MLP that routes only informative tokens through the MLP while propagating remaining tokens through the residual pathway. Across four segmentation benchmarks, SparseSAM loses only 0.004 mIoU at a 0.4 density and 0.021 mIoU at 0.3, a 2.10x reduction in accuracy loss versus token merging advances, while achieving 2x faster inference and 2.8x memory reduction.

URL PDF HTML ☆

赞 0 踩 0

2605.17626 2026-05-19 cs.LG cs.SE

Verifier-Guided Code Translation via Meta-Step Decoding

通过元步骤解码实现验证器引导的代码翻译

Tianyang Zhou, Somesh Jha, Mihai Christodorescu, Kirill Levchenko, Varun Chandrasekaran

AI总结本研究提出了一种元步骤解码框架DTV，通过在生成过程中整合验证器调用，提高了代码翻译的通过率，同时减少了token的使用。

Comments 31 pages, 8 figures

详情

AI中文摘要

测试时间缩放是提高大语言模型的重要机制，特别是在具有确定性验证器的任务中。代码翻译是典型例子：源程序约束有效输出，而编译器、类型检查器和行为检查提供精确的通过/失败反馈。现有方法通常在生成后才应用这些验证器，这效率低下，因为早期错误会破坏自回归上下文且很少被后续纠正。我们引入解码时间验证（DTV），一种框架将结构边界视为元步骤，用于引导解码。DTV在状态机控制器下交替生成与验证器调用，强制有效前缀，利用结构边界检查和结构感知回滚，防止错误传播并减少浪费的token。我们在C到Rust和JavaScript到TypeScript翻译上评估DTV。使用Qwen3-4B作为主要生成器，在匹配的token预算下，DTV将C到Rust的通过率从72.3%提升到82.0%，JavaScript到TypeScript的通过率从33.3%提升到46.0%，同时每案例使用更少的token；相同趋势在Gemma-4-E4B上也有所体现。在评估的匹配成本网格中，DTV在通过率与成本的权衡上优于事后验证或基于采样的缩放。这些结果表明，验证器引导的解码是代码翻译中有效利用推理时间计算的方法。

英文摘要

Test-time scaling is an important mechanism for improving large language models, especially on tasks with deterministic verifiers. Code translation is a canonical example: the source program constrains valid outputs, while compilers, type check- ers, and behavioral checks provide exact pass/fail feedback. Existing approaches typically apply these verifiers only after generation, which is inefficient because early errors corrupt the autoregressive context and are rarely corrected later. We introduce Decoding Time Verification (DTV), a framework that treats structural boundaries as meta steps for verifier-guided decoding. DTV interleaves generation with verifier calls under a state-machine controller that enforces valid prefixes, using structural-boundary checks and structure-aware rollback to prevent error propagation while reducing wasted tokens. We evaluate DTV on C-to-Rust and JavaScript-to-TypeScript translation. Using Qwen3-4B as the primary generator under matched token budgets, DTV improves pass rates from 72.3% to 82.0% on C-to-Rust and from 33.3% to 46.0% on JavaScript-to-TypeScript relative to matched self-refinement baselines, while using fewer tokens per case; the same trend largely transfers to Gemma-4-E4B. In the evaluated cost-matched grid, DTV achieves a more favorable pass-rate-cost tradeoff than post-hoc verification or sampling-based scaling. These results show that verifier-guided decoding is an effective use of inference-time compute for code translation.

URL PDF HTML ☆

赞 0 踩 0

2605.17625 2026-05-19 cs.AI

Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents

用于长周期科学代理的事件-语义记忆架构

Nikola Milosevic

AI总结本文提出了一种双过程记忆架构，用于解决科学代理在长周期任务中面临的情境窗口饱和问题，通过分离即时事件需求和长期知识整合，提升了在大规模科学工作流中的表现和可扩展性。

详情

AI中文摘要

随着大型语言模型（LLMs）发展为持久的科学合作者，情境窗口饱和已成为关键瓶颈。涉及迭代数据分析和假设修正的科学工作流迅速耗尽即使扩展的情境，而单一方法面临二次成本扩展和认知退化。我们评估了一种双过程记忆架构，将即时事件需求（恒定10条消息窗口）与长期整合知识（以每条消息约3个标记增长）分离。不同于先前的社会代理记忆系统，我们的领域特定整合解决了矛盾的参数演变、跨实验阶段的多跳推理以及精确的技术事实保留。通过覆盖15,000条消息的大型评估，跨模型验证六个LLM家族（OpenAI、Anthropic、Google）共计1,440个查询，我们得出三个关键发现。首先，尽管全情境模型在10,000条消息时因情境溢出失败，我们的系统在使用62%更少的标记（45,434 vs 120,000+限制）的情况下，保持70-85%的准确性，延迟仅1-2秒。其次，跨模型验证揭示了架构层面的权衡，与特定LLM无关：双过程在数值/时间查询（65-90%准确率）方面表现优异，而RAG在历史检索（60-85%）方面更优，表明互补的部署策略。第三，我们识别出“仿真到现实”的差距，合成测试保持恒定的记忆，但现实工作流表现出线性增长（约每条消息3个标记），其中整合质量成为主要的可扩展性瓶颈。该架构成功管理了包含14,000多个科学事实（125k标记）的资料，证明了领域特定的记忆整合能够持续运行超过全情境限制。

英文摘要

As Large Language Models (LLMs) evolve into persistent scientific collaborators, context window saturation has emerged as a critical bottleneck. Scientific workflows involving iterative data analysis and hypothesis refinement rapidly saturate even extended contexts with dense technical content, while monolithic approaches suffer from quadratic cost scaling and cognitive degradation. We evaluate a Dual Process Memory Architecture that decouples immediate episodic needs (constant 10-message window) from long-term consolidated knowledge (growing at approximately 3 tokens/message). Unlike prior social agent memory systems, our domain-specific consolidation addresses contradictory parameter evolution, multi-hop reasoning across experimental phases, and precise technical fact retention. Through large-scale evaluation spanning 15,000 messages with cross-model validation across six LLMs from three families (OpenAI, Anthropic, Google), totaling 1,440 queries, we establish three key findings. First, while full-context models fail at 10,000 messages due to context overflow, our system maintains 70-85% accuracy with 1-2 second latency using 62% fewer tokens (45,434 vs 120,000+ limit). Second, cross-model validation reveals architecture-level trade-offs independent of specific LLMs: Dual Process excels at numeric/temporal queries (65-90% accuracy) while RAG excels at historical retrieval (60-85%), suggesting complementary deployment strategies. Third, we identify a "Sim-to-Real" gap where synthetic tests maintain constant memory but realistic workflows exhibit linear growth (about 3 tokens/message), with consolidation quality emerging as the primary scalability bottleneck. The architecture successfully manages profiles with 14,000+ scientific facts (125k tokens), demonstrating that domain-specific memory consolidation enables sustained operation beyond full-context limits.

URL PDF HTML ☆

赞 0 踩 0

2605.17624 2026-05-19 cs.CV cs.AI cs.LG

Multi-task learning on partially labeled datasets via invariant/equivariant semi-supervised learning

通过不变/等变半监督学习进行部分标注数据集上的多任务学习

Miquel Martí i Rabadán, Alessandro Pieropan, Hossein Azizpour, Atsuto Maki

AI总结本文研究了不变和等变半监督学习在处理部分标注数据集上多任务模型训练挑战的潜力，通过FixMatch方法和其等变扩展Dense FixMatch进行评估，在城市景观和BDD100K数据集上针对常见的目标检测和语义分割任务进行测试，发现不变和等变半监督学习在大多数情况下优于监督基线，特别是在标注样本较少时效果更佳。

Comments https://github.com/miquelmarti/DenseFixMatch

详情

AI中文摘要

我们研究了不变和等变半监督学习在处理部分标注数据集上多任务模型训练挑战的潜力。具体而言，我们使用流行的FixMatch方法进行不变半监督学习，并采用其等变扩展Dense FixMatch。我们在Cityscapes和BDD100K数据集上评估了它们在计算机视觉中普遍的目标检测和语义分割任务中的性能。我们考虑了每个任务标注子集的不同大小以及它们之间的不同重叠情况。我们的结果表明，对于不变和等变半监督学习，大多数情况下都优于监督基线，特别是在任务中可用标注样本较少时，改进最为显著，且后者方法通常表现更好。我们的研究表明，不变/等变学习是有限标注数据下多任务学习的一个有前途的方向。

英文摘要

We investigate the potential of invariant and equivariant semi-supervised learning for addressing the challenges of training multi-task models on partially labeled datasets with differently structured output tasks. Specifically, we use the popular FixMatch method for invariant semi-supervised learning and its equivariant extension Dense FixMatch. We evaluate their performance on the Cityscapes and BDD100K datasets in the context of the prevalent object detection and semantic segmentation tasks in computer vision. We consider varying sizes of the subsets annotated for each task and different overlaps among them. Our results for both invariant and equivariant semi-supervised learning outperform supervised baselines in most situations, with the most significant improvements observed when fewer labeled samples are available for a task and generally better results for the latter approach. Our study suggests that invariant/equivariant learning is a promising general direction for multi-task learning from limited labeled data.

URL PDF HTML ☆

赞 0 踩 0

2605.17620 2026-05-19 cs.CV cs.AI cs.LG

SynVA: A Modular Toolkit for Vessel Generation and Aneurysm Editing

SynVA：一种用于血管生成和动脉瘤编辑的模块化工具包

Marten J. Finck, Niklas C. Koser, Sarker M. Mahfuz, Tameem Jahangir, Jon E. Wilhelm, Daniel Behme, Naomi Larsen, Wojtek Palubicki, Sylvia Saalfeld, Sören Pirk

AI总结本文提出SynVA，一种模块化工具包，用于生成血管网格和在解剖学上一致的动脉瘤合成，通过结合新的流匹配方法和基于学习的方法，生成真实血管几何和解剖学合理的动脉瘤，同时提供大规模标注数据集以提升医疗影像分析能力。

详情

AI中文摘要

颅内动脉瘤（IAs）以不可预测的生长和破裂风险为特征，是导致中风的主要原因，可能引发致命性出血，具有高死亡率和长期残疾。随着人口老龄化，脑血管疾病的发病率和整体负担预计会增加，凸显了需要可扩展的方法来分析复杂的医疗数据并提高对这些疾病的群体层面理解的必要性。尽管数字孪生和深度学习为提高诊断、预后和治疗提供了有希望的途径，但其效果受到大规模高质量医疗数据和相应标签稀缺的限制。我们提出了SynVA，一种用于血管网格生成和解剖学一致动脉瘤合成的模块化工具包。SynVA结合了基于流匹配的新型方法生成健康血管网格与基于学习的方法生成解剖条件下的动脉瘤网格——动脉瘤是从已有的血管几何结构计算而来的，而不是孤立生成。此外，我们引入了基于生理学原理和统计先验的SynVA过程模型，用于血管和动脉瘤合成，从而能够生成大规模数据集（例如用于训练基于网格的生成模型）。为此，我们发布了包含50,000个完全标注网格样本的数据集，用于各种下游视觉任务，如语义分割。广泛的定量和定性评估证明了SynVA能够生成逼真的血管几何和解剖学合理的动脉瘤。具体而言，我们的实验表明，某些方法生成的动脉瘤形状更符合专家人类感知，而其他方法在定量相似性度量上与真实动脉瘤的重建表现更优。

英文摘要

Intracranial aneurysms (IAs), characterized by unpredictable growth and risk of rupture, are a major cause of stroke and can lead to life-threatening hemorrhages with high mortality and long-term disability. With aging populations, the incidence and overall burden of cerebrovascular diseases are expected to increase, highlighting the need for scalable approaches to analyze complex medical data and improve population-level understanding of these conditions. While digital twins and deep learning offer promising avenues for improving diagnosis, prognosis, and treatment, their effectiveness is limited by the scarcity of large-scale, high-quality medical data and corresponding labels. We present Synthetic VAsculature (SynVA), a modular toolkit for vascular mesh generation and anatomically consistent aneurysm synthesis. SynVA combines novel flow-matching-based methods for generating healthy vessel meshes with learning-based approaches for anatomy-conditioned aneurysm mesh generation - aneurysms are computed from pre-existing vascular geometries rather than being generated in isolation. In addition, we introduce the SynVA procedural model for vascular and aneurysm synthesis based solely on physiological principles and statistical priors, which enables the generation of large-scale datasets (e.g., for the training of mesh-based generative models). To this end, we release a dataset of 50,000 fully labeled mesh samples for a variety of downstream vision tasks, such as semantic segmentation. Extensive quantitative and qualitative evaluations demonstrate that SynVA generates realistic vessel geometries and anatomically plausible aneurysms. Specifically, our experiments indicate that some methods produce aneurysm shapes more aligned with expert human perception while others perform better on quantitative similarity metrics with reconstructions of real aneurysms.

URL PDF HTML ☆

赞 0 踩 0