arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1971
热门方向导航
2606.18803 2026-06-18 cs.AI cs.CY 新提交

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

ProfiLLM: 面向工业网约车调度的效用对齐智能用户画像

Tengfei Lyu, Zirui Yuan, Xu Liu, Kai Wan, Zihao Lu, Li Ma, Hao Liu

发表机构 * Didichuxing Co. Ltd(滴滴出行科技有限公司)

AI总结 提出ProfiLLM,一种通过工具增强全局知识挖掘和效用对齐画像探索的智能LLM数据管道,解决工业网约车调度中大规模行为日志的用户画像问题,在滴滴生产系统中实现AUC提升6.14%、GMV提升4.35%。

详情
AI中文摘要

将大型语言模型(LLM)作为语义特征提取器引入工业网约车调度,处理平台规模的行为日志,是一个引人注目但尚未充分探索的数据系统问题。生产匹配管道仍然以结构化数值特征为主,但关键的行为信号(例如,驾驶员对某些区域的习惯性厌恶)本质上是上下文相关的,并且可以自然地表达为LLM生成的用户画像。然而,将这种画像扩展到实时的、毫秒级延迟的调度器面临三个相互交织的约束,这些约束很少被一起解决:在一个拥有数百万日订单量的平台上,日志超出任何LLM的上下文窗口数个数量级;大多数用户是长尾用户,交互太少无法进行单个用户画像;表面流畅的画像不一定能提高下游预测效用。我们提出了ProfiLLM,一个智能LLM数据管道,通过两个模块实现面向生产匹配系统的效用对齐用户画像。(1)工具增强全局知识挖掘:为LLM智能体配备27个分析工具,用于挖掘平台规模的数据,生成可复用的全局知识、自适应用户聚类规则和区域级供需先验。(2)效用对齐画像探索:为每个聚类生成多个候选画像,通过轻量级下游效用代理进行评估,迭代优化最佳候选,并为DPO微调构建偏好对。在滴滴生产调度器上部署后,ProfiLLM在结果预测中实现了高达+6.14%的相对AUC改进,在调度模拟中实现了高达+4.35%的GMV增长,并在14天在线A/B测试中持续改进,包括+0.47% GMV、+0.33%完成率和-0.82%接单前取消率。

英文摘要

Bringing Large Language Models (LLMs) into industrial ride-hailing dispatch as semantic feature extractors over platform-scale behavioral logs is a compelling but under-explored data systems problem. Production matching pipelines remain dominated by structured numerical features, yet decisive behavioral signals (e.g., a driver's habitual aversion to certain regions) are inherently contextual and naturally expressible as LLM-generated user profiles. However, scaling such profiling to a live, millisecond-latency dispatcher faces three intertwined constraints rarely addressed together: on a platform with millions of daily orders, logs exceed any LLM's context window by orders of magnitude; most users are long-tail, with too few interactions for per-user profiling; and surface-fluent profiles do not necessarily improve downstream prediction utility. We present ProfiLLM, an agentic LLM data pipeline that operationalizes utility-aligned user profiling for production matching systems through two modules. (1) Tool-Augmented Global Knowledge Mining equips an LLM agent with 27 analytical tools to mine platform-scale data, producing reusable global knowledge, adaptive user clustering rules, and region-level supply-demand priors. (2) Utility-Aligned Profile Exploration generates multiple candidate profiles per cluster, evaluates them via a lightweight downstream utility proxy, iteratively refines the best candidates and constructs preference pairs for DPO fine-tuning. Deployed on DiDi's production dispatcher, ProfiLLM achieves up to +6.14% relative AUC improvement in outcome prediction, up to +4.35% GMV gain in dispatching simulation, and consistent improvements in a 14-day online A/B test including +0.47% GMV, +0.33% Completion Rate, and -0.82% Cancel-Before-Accept rate.

2606.18797 2026-06-18 cs.CL 新提交

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

超越标量分数:探索基于LLM的放射学报告临床意义评估指标

Qingyu Lu, Ruochen Li, Liang Ding, Yufei Xia, Youxiang Zhu, Dacheng Tao

发表机构 * Nanyang Technological University(南洋理工大学) Technical University of Munich(慕尼黑工业大学) Alibaba(阿里巴巴) University of Glasgow(格拉斯哥大学) University of Massachusetts Boston(马萨诸塞大学波士顿分校)

AI总结 针对放射学报告评估中临床准确性要求,研究基于LLM的指标区分临床错误与无害变体的能力,发现判别偏差,并通过合成数据训练轻量级指标,在成本敏感部署中优于大型模型。

Comments Under Review

详情
AI中文摘要

对生成的放射学报告进行可靠评估需要严格的临床准确性,因为遗漏关键发现或误判影像学观察结果会直接影响患者护理。现有指标通过将报告质量简化为一个医学上无依据的标量而模糊了这一要求。尽管大型语言模型(LLM)拥有丰富的医学知识,但它们同样难以在临床显著错误和无害变异之间划定可靠边界。我们以ReEvalMed基准为测试平台研究这一边界,并从检测真实临床错误(“判别力”)和容忍无关变异(“鲁棒性”)两方面评估指标的临床意义。在单次和两次设置下对8个LLM评估器进行实验,我们发现了一个普遍的判别偏差:模型能有效检测错误,但也过度惩罚无害的改写。为缓解这一问题,我们合成了4000对报告,并在Qwen3-8B和MedGemma-4B上训练了轻量级可解释指标。我们训练的指标明确了临床意义边界,超越了32B规模的医学LLM,并与专有模型保持竞争力。关键的是,成本更高的两次设置未能持续提升整体性能,主要是在用判别力换取鲁棒性。这些发现表明,单次训练指标是成本敏感部署的实用选择,而两次推理则保留给判别-鲁棒平衡至关重要的场景。我们将发布数据集和指标。

英文摘要

Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care. Existing metrics obscure this requirement by reducing report quality to a medically ungrounded scalar. Although Large Language Models (LLMs) possess rich medical knowledge, they likewise struggle to draw a reliable boundary between clinically significant errors and harmless variation. We study this boundary using ReEvalMed benchmark as testbed and evaluate metric-level clinical significance from detecting true clinical errors ("Discrimination") and tolerating insignificant variations ("Robustness"). Across 8 LLM evaluators under one-pass and two-pass settings, we identify a widespread discrimination bias: models effectively detect errors but also over-penalize harmless rephrasings. To mitigate this, we synthesize 4k report pairs and train lightweight interpretable metrics on Qwen3-8B and MedGemma-4B. Our trained metric sharpens the clinical significance boundary, surpassing 32B-scale medical LLMs and remaining competitive with proprietary models. Crucially, the more costly two-pass setting fails to consistently improve overall performance and mainly trades discrimination for robustness. These findings suggest one-pass trained metrics as the practical choice for cost-sensitive deployment, with two-pass inference reserved for settings where D-R balance is critical. We will release the dataset and metric.

2606.18793 2026-06-18 cs.CV 新提交

Fuzzy-Geometric Branch-Point Modeling for Structure-Aware Augmentation of Handwritten Chinese Characters

模糊几何分支点建模用于结构感知的手写汉字增强

Dongbin Jiao, Yibo Lyu, Qiulu Wei, Fuxiang Lu, Shengcai Liu, Shi Yan

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) Guangdong Provincial Key Laboratory of Brain-inspired Intelligent Computation, Department of Computer Science and Engineering, Southern University of Science and Technology(南方科技大学计算机科学与工程系广东省类脑智能计算重点实验室)

AI总结 针对手写汉字增强中数据稀缺和结构失真问题,提出基于模糊几何的结构感知增强框架,通过模糊集建模分支点并优化,结合贝塞尔重建与多策略扰动生成样本,显著降低字错误率。

详情
AI中文摘要

数据稀缺和结构失真严重限制了高安全性认证中的手写识别。现有的增强方法常导致拓扑和形态损伤,尤其在处理复杂汉字时,笔画交叉、连笔和急转弯使传统分支点检测不可靠。为此,本文提出一种模糊几何驱动的结构感知(FGSA)增强框架。我们将分支点建模为骨架空间中的模糊集,通过整合拓扑邻域证据和方向场散度,构建连续的分支点隶属度场。该隶属度场通过无监督代理目标自适应优化,实现无需人工标注的鲁棒笔画解耦。最后,通过参数化三次贝塞尔重建和多策略扰动合成运动学对齐样本,确保结构保真度与样本多样性之间的平衡。此外,我们建立了LZUSig,一个专门针对中文手写签名细粒度结构退化的大规模高挑战性数据集。在CASIA-HWDB1.1、ChiSig和LZUSig上的大量实验表明,FGSA显著降低了字错误率(ΔWER),在对比基线中取得了最优识别增益。更重要的是,它在任务增益、结构保真度和判别特征保留之间实现了稳健的权衡,为手写增强提供了一种高度可控的解决方案。

英文摘要

Data scarcity and structural distortion significantly limit handwriting recognition in high-security authentication. Existing augmentation methods often cause topological and morphological damage, particularly when processing complex Chinese characters where stroke intersections, ligatures, and sharp turns render traditional branch-point detection unreliable. To address this, this paper proposes a fuzzy geometry-driven structure-aware (FGSA) augmentation framework. We model branch points as fuzzy sets within the skeleton space, constructing a continuous branch-point membership field by integrating topological neighborhood evidence with direction field divergence. This membership field is adaptively optimized via an unsupervised surrogate objective, enabling robust stroke decoupling without manual annotation. Finally, kinematically-aligned samples are synthesized through parameterized cubic Bézier reconstruction and multi-strategy perturbations, ensuring a balance between structural fidelity and sample diversity. Moreover, we establish LZUSig, a large-scale, highly challenging dataset specifically dedicated to fine-grained structural degradation in Chinese handwritten signatures. Extensive experiments on CASIA-HWDB1.1, ChiSig, and LZUSig demonstrate that FGSA significantly reduces the word-level error rate ($Δ$WER), achieving optimal recognition gains over the compared baselines. More importantly, it strikes a robust trade-off among task gain, structural fidelity, and discriminative feature preservation, offering a highly controllable solution for handwriting augmentation.

2606.18790 2026-06-18 cs.SD cs.AI cs.LG 新提交

Closing the Loop: PID Feedback Control for Interpretable Activation Steering in Symbolic Music Generation

闭环:用于符号音乐生成中可解释激活引导的PID反馈控制

Ioannis Prokopiou, Pantelis Vikatos, Maximos Kaliakatsos-Papakostas, Theodoros Giannakopoulos, Themos Stafylakis

发表机构 * Athens University of Economics and Business(雅典经济与商业大学) Orfium Research(Orfium 研究) Hellenic Mediterranean University(希腊地中海大学) Archimedes / Athena Research Center(阿基米德/雅典娜研究中心)

AI总结 提出基于PID反馈控制的推理时激活引导框架,通过差分均值法提取音高和时长潜在方向,并利用Gram-Schmidt正交化解耦多属性引导,实现符号音乐生成中细粒度、可解释的属性调制。

Comments Accepted at Learning to Listen: ICML 2026 Workshop on Machine Learning for Audio (43rd International Conference on Machine Learning - ICMLMLA26), 4 pages main (11 total), 2 figures

详情
AI中文摘要

基于Transformer的架构在生成复杂符号序列方面取得了显著进展,但在实现对离散信号属性的细粒度、可解释控制方面仍存在明显差距。本文研究了多轨音乐Transformer(MMT)的机制可解释性,并提出了一种无需重新训练即可通过推理时激活引导实现确定性属性调制的框架。利用差分均值(DiffMean)方法,我们在残差流中分离出信号属性(特别是音高和时长)的潜在方向。我们验证了该领域的线性表示假设,实现了引导幅度与属性偏移之间的高相关性。为了解决多属性引导中固有的特征纠缠问题,我们引入了一种利用Gram-Schmidt正交化的双引导框架。实验结果表明,与朴素向量加法相比,这种几何解耦减少了概念干扰和信号退化,即使在强自回归条件下也能实现独立的确定性控制。

英文摘要

Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in achieving fine-grained, interpretable control over discrete signal attributes. This paper investigates the mechanistic interpretability of the Multitrack Music Transformer (MMT) and proposes a framework for deterministic attribute modulation without retraining to bridge this gap via inference-time activation steering. Utilizing the Difference-in-Means (DiffMean) methodology, we isolate latent directions for signal attributes, specifically Pitch and Duration, within the residual stream. We validate the Linear Representation Hypothesis in this domain, achieving high correlation between steering magnitude and attribute shift. To address the inherent feature entanglement in multi-attribute steering, we introduce a Dual Steering framework utilizing Gram-Schmidt Orthogonalization. Experimental results demonstrate that this geometric decoupling reduces conceptual interference and signal degradation compared to naive vector addition, enabling independent deterministic control even against strong autoregressive conditioning.

2606.18788 2026-06-18 cs.CV cs.CL 新提交

HandwritingAgent: Language-Driven Handwriting Synthesis in Scalable Vector Space

HandwritingAgent: 语言驱动的可缩放矢量空间手写合成

Jaward Sesay, Yue Yu, Börje F. Karlsson

发表机构 * Beijing Institute of Technology(北京理工大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 提出HandwritingAgent,利用大推理模型在SVG格式中自动回归生成手写笔画序列,无需风格特定训练,通过自然语言和参考图像控制风格,在模仿、识别、多语言及复杂数学表达式合成等任务上达到或超越现有最优方法。

详情
AI中文摘要

教会机器模仿自然手写风格仍然是一个开放挑战,因为它需要合成在形状、纹理、压力和字体上动态变化的笔画序列——不仅在不同个体之间,而且在同一个人的手写中也是如此。针对这一挑战的尝试主要探索了在线和离线环境下的深度学习方法。然而,这些方法通常受到风格特定架构选择、对大型数据集的严重依赖、高计算成本以及缺乏通过自然语言灵活控制书写风格的限制。为此,我们引入了HandwritingAgent,一个语言驱动的智能体,它可以直接在可缩放矢量图形(SVG)格式中合成自然手写序列,无需风格特定训练。该智能体利用大型推理模型在离散网格画布环境中对目标手写字形进行几何分析并自回归生成笔画序列。生成过程以对话或非对话模式提供的文本以及参考手写风格图像为条件。在涵盖模仿、识别、多语言手写合成以及复杂手写数学和科学表达式生成等多样化手写任务上的实验表明,性能有显著提升,HandwritingAgent匹配或超越了最先进的生成式手写模型,同时提供了一种更高效、可控且泛化能力更强的合成方法。

英文摘要

Teaching machines to emulate natural handwriting styles remains an open challenge, as it requires synthesizing stroke sequences that dynamically vary in shape, texture, pressure and script - not only across individuals, but also within a single person's handwriting. Attempts at this challenge have largely explored deep learning methods in both online and offline settings. However, these approaches are often constrained by style-specific architectural choices, heavy reliance on large datasets, high compute costs, and a lack of flexible control over writing styles through natural language. To this end, we introduce HandwritingAgent, a language-driven agent that can synthesize natural handwriting sequences directly in Scalable Vector Graphics (SVG) format with no need for style-specific training. The agent leverages a large reasoning model to geometrically analyse and autoregressively generate target handwritten glyphs as stroke sequences in a discrete grid canvas environment. Generation is conditioned on texts provided in either conversational or non-conversational mode, along with a reference handwriting-style image. Experiments on diverse handwriting tasks spanning imitation, recognition, multi-lingual handwriting synthesis, and generation of complex handwritten maths and science expressions indicate substantial improvement in performance, with HandwritingAgent matching or surpassing state-of-the-art generative handwriting models, while providing a more efficient, controllable, and generalizable synthesis method.

2606.18787 2026-06-18 cs.CV 新提交

Learned Radius Estimation for UDF-Based Point Cloud Reconstruction

基于UDF的点云重建中的学习半径估计

Eito Ogawa, Hiroshi Watanabe

发表机构 * Graduate School of FSE Waseda University Tokyo, Japan(Waseda大学研究生院FSE学院东京日本)

AI总结 提出一种学习型逐查询半径选择器,预测连续支撑半径并插入冻结的LoSF-UDF骨干网络,通过抛物线插值获取离网目标半径进行训练,提高点云表面重建的细粒度精度。

详情
AI中文摘要

从点云进行表面重建对于消费级3D捕获(包括AR/VR和室内扫描)非常重要。局部补丁无符号距离场(UDF)方法轻量且可泛化,但其精度依赖于支撑半径,传统上半径是固定的或通过一维曲率启发式选择,无法捕捉异质局部几何。我们提出一种学习型逐查询半径选择器,预测连续支撑半径并插入冻结的LoSF-UDF骨干网络。该选择器使用通过抛物线插值从缓存的UDF误差曲线获得的离网目标半径进行训练。实验表明,该方法提高了细尺度重建精度。

英文摘要

Surface reconstruction from point clouds is important for consumer-grade 3D capture, including AR/VR and indoor scanning. Local-patch Unsigned Distance Field (UDF) methods are lightweight and generalizable, but their accuracy depends on the support radius, traditionally fixed or selected by a one-dimensional curvature heuristic that cannot capture heterogeneous local geometry. We propose a learned per-query radius selector that predicts a continuous support radius and plugs into a frozen LoSF-UDF backbone. The selector is trained using off-grid target radii obtained by parabolic interpolation of cached UDF error curves. Experiments show improved fine-scale reconstruction accuracy.

2606.18786 2026-06-18 cs.AI 新提交

R2D-RL: A RoboCup 2D Soccer Environment for Multi-Agent Reinforcement Learning

R2D-RL:用于多智能体强化学习的RoboCup 2D足球环境

Haobin Qin, Baofeng Zhang, Hidehisa Akiyama, Keisuke Fujii

发表机构 * Graduate School of Informatics, Nagoya University(名古屋大学信息学研究科) School of Information and Data Sciences, Nagasaki University(长崎大学信息与数据科学学院)

AI总结 提出R2D-RL环境,通过共享内存通信和周期级同步连接RCSS2D与Python MARL接口,支持全场和场景训练,提供可配置对手、离散/混合动作空间、EPV奖励塑造及并行执行。

Comments Code is available at: https://github.com/open-starlab/R2DRL

详情
AI中文摘要

机器人足球是多智能体强化学习的一个具有挑战性的测试平台,因为它结合了部分可观测性、合作与对抗交互、稀疏奖励以及长期战术行为。RoboCup 2D足球仿真(RCSS2D)提供了一个成熟的机器人足球平台,但其面向竞争的服务器-客户端架构难以直接用于现代基于Python的MARL工作流。我们引入了R2D-RL,这是一个强化学习环境,通过共享内存通信和周期级同步将RCSS2D和基于HELIOS的玩家客户端连接到Python MARL接口。R2D-RL支持全场和基于场景的训练,具有可配置的对手、基础离散和混合参数化动作空间、动作掩码、基于预期控球值(EPV)的奖励塑造以及并行执行。我们提供了前场场景和11对11全场基准测试,以及基线结果。

英文摘要

Robot soccer is a challenging testbed for multi-agent reinforcement learning because it combines partial observability, cooperative and adversarial interaction, sparse rewards, and long-horizon tactical behavior. RoboCup 2D Soccer Simulation (RCSS2D) provides a mature robot-soccer platform, but its competition-oriented server-client architecture is difficult to use directly with modern Python-based MARL workflows. We introduce R2D-RL, a reinforcement learning environment that connects RCSS2D and HELIOS-based player clients to a Python MARL interface through shared-memory communication and cycle-level synchronization. R2D-RL supports full-field and scenario-based training with configurable opponents, Base discrete and Hybrid parameterized action spaces, action masks, expected possession value (EPV)-based reward shaping, and parallel execution. We provide front-goal scenarios and an 11-vs-11 full-field benchmark, together with baseline results.

2606.18785 2026-06-18 cs.LG cs.AI 新提交

Bayesian Anytime Pareto Set Identification for Multi-Objective Multi-Armed Bandits

贝叶斯任意时间帕累托集识别用于多目标多臂老虎机

Lennert Saerens, Bram Silue, Eleni Litsa, Peter Vrancx, Pieter Libin

发表机构 * imec Data Science Institute, Interuniversity Institute of Biostatistics and Statistical Bioinformatics, UHasselt(哈瑟尔特大学生物统计学与统计生物信息学跨大学研究所数据科学研究所)

AI总结 提出首个任意时间多目标多臂老虎机算法Top-Two帕累托前沿汤普森采样(TTPFTS),用于帕累托集识别,在合成环境和超大型分子库中验证有效性,并引入不确定性量化指标。

Comments 26 pages, 13 figures

详情
AI中文摘要

识别帕累托最优解对于支持多目标决策至关重要。我们首次提出了一种用于帕累托集识别问题的任意时间多目标多臂老虎机算法,采用贝叶斯方法:Top-Two帕累托前沿汤普森采样(TTPFTS)。我们在合成环境中将TTPFTS与最先进的固定预算帕累托集识别算法进行基准测试。接下来,我们通过高效探索超大型按需合成分子库,在具有挑战性的多目标分子发现场景中展示了其实用性。此外,我们引入了一种新颖的不确定性量化指标,用于估计算法在预测帕累托集上的置信度。我们证明该指标有效代理真实性能,为监控复杂环境中的学习进度提供了一种稳健的方法。最后,我们用算法渐近正确性的理论证明补充了这些实证发现。

英文摘要

Identifying Pareto optimal solutions is critical to support multi-objective decision-making. We introduce the first anytime Multi-Objective Multi-Armed Bandit algorithm for the Pareto Set Identification problem, taking a Bayesian approach: Top-Two Pareto Front Thompson Sampling (TTPFTS). We benchmark TTPFTS against state-of-the-art fixed-budget Pareto Set Identification algorithms on synthetic environments. Next, we demonstrate its practical utility in a challenging multi-objective molecular discovery setting by efficiently exploring an ultra-large synthesis-on-demand molecular library. Furthermore, we introduce a novel uncertainty quantification metric that estimates our algorithm's confidence in the predicted Pareto set. We demonstrate that this metric effectively proxies true performance, yielding a robust methodology for monitoring learning progress in complex settings. Finally, we complement these empirical findings with a theoretical proof of the algorithm's asymptotic correctness.

2606.18783 2026-06-18 cs.CV 新提交

SCR-Guided Difficulty-Aware Optimization for Infrared Small Target Detection

SCR引导的困难感知优化用于红外小目标检测

Yunus Sevim, Behçet Uğur Töreyin

发表机构 * Aselsan(阿塞尔桑公司) Istanbul Technical University(伊斯坦布尔理工大学)

AI总结 提出REEM框架,利用信杂比作为可见性先验,通过可微调制软IoU损失,提升低可见性目标检测性能,无需额外参数或推理开销。

Comments Accepted at CVPR 2026 Workshops (PBVS). Published version: https://openaccess.thecvf.com/content/CVPR2026W/PBVS/html/Sevim_SCR-Guided_Difficulty-Aware_Optimization_for_Infrared_Small_Target_Detection_CVPRW_2026_paper.html

详情
AI中文摘要

红外小目标检测由于严重的背景杂波、低对比度和弱空间响应仍然具有挑战性,其中几何重叠单独不足以表征检测质量。在这项工作中,我们提出了REEM(重加权显式可见性增强调制),一种轻量级的SCR引导的困难感知优化框架,在训练期间将信杂比(SCR)作为物理上有意义的可见性先验。REEM不修改网络架构或直接优化SCR,而是从输入图像计算真实局部SCR,并对软IoU学习信号应用可微调制,强调低可见性目标,同时保持稳定优化和相同的推理行为。REEM集成到基于U-Net的MSHNet中,无需引入额外参数、架构修改或推理时开销。大量实验表明,与基线相比,REEM实现了持续改进,获得了更高的IoU和检测概率(Pd),同时大幅减少了虚警(FA),特别是在具有挑战性的低可见性条件下。这些结果表明,SCR引导的困难感知优化为红外小目标检测提供了有效且物理基础的补充,超越了传统的基于重叠的目标函数。代码可在https://github.com/yall-in-one/Reemm获取。

英文摘要

Infrared small target detection remains challenging due to severe background clutter, low contrast, and weak spatial responses where geometric overlap alone is insufficient to characterize detection quality. In this work, we propose REEM (Reweighted Explicit-visibility Enhanced Modulation), a lightweight SCR-guided difficulty-aware optimization framework that incorporates Signal-to-Clutter Ratio (SCR) as a physically meaningful visibility prior during training. Instead of modifying the network architecture or directly optimizing SCR, REEM computes a ground-truth local SCR from the input image and applies a differentiable modulation to the soft-IoU learning signal, emphasizing low-visibility targets while preserving stable optimization and identical inference behavior. REEM is integrated into a U-Net-based MSHNet without introducing additional parameters, architectural modifications, or inference-time overhead. Extensive experiments demonstrate consistent improvements over the baseline, achieving higher IoU and detection probability (Pd) together with substantially reduced false alarms (FA), particularly under challenging low-visibility conditions. These results suggest that SCR-guided difficulty-aware optimization provides an effective and physically grounded complement to conventional overlap-based objectives for infrared small target detection. The code is available at https://github. com/yall-in-one/Reemm.

2606.18782 2026-06-18 cs.CL cs.AI 新提交

RedactionBench

RedactionBench:基于上下文完整性的隐私保护基准测试

Sean Brynjólfsson, Shashvat Jayakrishnan, Esha Sali, Diptanshu Purwar, Madhav Aggarwal

发表机构 * A10 Networks, Inc.(A10网络公司)

AI总结 RedactionBench通过200个跨11个领域的文档,评估红actions的上下文隐私问题,提出R-Score指标,揭示红actions的主观性,推动隐私保护系统的发展。

详情
AI中文摘要

大型语言模型日益应用于需要擦除个人身份信息(PII)的敏感领域。尽管擦除PII是数据清理的必要步骤,现有基准测试将提取机制与隐私语义混为一谈。公共电话号码与医疗记录中的电话号码并不等同。是否构成违规取决于持有者、原因和上下文,根本区别红actions与简单实体识别。基于上下文完整性,我们引入RedactionBench,一个手动标注的基准测试,包含200个跨11个领域的文档,主要源自真实世界来源。我们还引入R-Score,一种新的字符级指标,将语义相似的红actions视为同等重要,并消除浅层格式选择,如电话号码的不同遮蔽样式。在命名实体识别模型、实体提取小型语言模型和前沿模型上进行评估,证明上下文红actions仍是一个未解决的问题。对RedactionBench的80多名用户的人工评估显示隐私观念存在明显分歧。标注者在强制性红actions(89.4%)和安全文本保留(94.1%)上达成一致,但在上下文红actions(47.7%)上未能达成一致。这种差异展示了上下文隐私的主观性,推动R-Score,将上下文模糊性与严格精度分离。我们比较了35个模型家族的性能,并报告了它们在擦除PII方面的表现。最后,我们发布RedactionBench,以建立未来隐私保护系统的基准,希望激发高效模型设计和标准化评估。

英文摘要

Large Language Models are increasingly applied to sensitive domains that require redaction of personally identifiable information (PII). While redacting PII is a data cleaning prerequisite, existing benchmarks conflate extraction mechanics with privacy semantics. A public phone number is not equivalent to a phone number in a medical record. Whether information constitutes a violation depends heavily on who holds it, why, and in what context, fundamentally differentiating redaction from simple entity recognition. Grounded in contextual integrity, we introduce RedactionBench, a manually annotated benchmark comprising 200 diverse documents across 11 domains, mostly seeded from real-world sources. We also introduce R-Score, a novel character-level metric that treats semantically similar redactions equally and nullifies shallow formatting choices, such as varying masking styles for phone numbers. Evaluations across Named Entity Recognition models, entity extraction Small Language Models, and frontier models equipped with agentic tools demonstrate that contextual redaction remains an unsolved problem. A human evaluation with over 80 users on RedactionBench reveals a stark dichotomy in privacy perceptions. Annotators show consensus with target labels for mandatory redactions (89.4 percent) and safe text preservations (94.1 percent), but fail to agree on contextual redactions (47.7 percent). This variance demonstrates the subjective nature of contextual privacy and motivates R-Score, which decouples contextual ambiguity from strict precision. We compare 35 models across families and report their performance in redacting PII. Finally, we release RedactionBench to establish a baseline for future privacy-preserving systems, hoping to inspire efficient model design and standardized evaluations.

2606.18781 2026-06-18 cs.CL 新提交

Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

迷失在单一向量中:通过分块证据聚合改进长文档检索

Shanshan Lyu, Yiwei Wang, Yujun Cai, Jiafeng Guo, Shenghua Liu

发表机构 * Chongqing University(重庆大学) State Key Laboratory of AI Safety(人工智能安全国家重点实验室) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) University of California, Merced(加州大学默塞德分校) University of Queensland(昆士兰大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 针对长文档检索中单向量编码削弱关键片段证据的问题,提出无训练的分块证据聚合策略DICE,通过独立编码分块并聚合为单一向量,在保持标准接口的同时显著提升检索性能。

Comments Code is available at https://github.com/PunchlineAAAA/DICE

详情
AI中文摘要

稠密检索将一个查询向量与一个文档向量进行排名。对于长文档,当在排名前的文档编码过程中,一个简短但决定性的跨度被削弱时,这种接口可能会失败。我们将这种失败模式研究为文档侧早期压缩,并引入证据稀释指数(EDI)来衡量文档级表示低于同一黄金文档中最强分块级证据的程度。在此观点的指导下,我们提出了DICE(通过分块证据进行文档推理),一种无需训练的文档侧策略,它将文档分割成块,使用冻结模型独立编码,然后将它们聚合回单个向量,同时保持标准的单查询-单文档接口。在LongEmbed上,DICE在四个骨干网络上提高了检索性能,在超过4k标记的切片上提升最大:对于Dream,Passkey >4k从30.0提升到90.0,Needle >4k从23.3提升到74.0。在12,779个过滤样本中,DICE在92.8%的情况下比单向量基线产生更低的EDI。这些结果确立了文档级编码作为长文档检索的一个实用且未被充分探索的杠杆。

英文摘要

Dense retrieval ranks one query vector against one document vector. On long documents, this interface can fail when a short but decisive span is weakened during document encoding before ranking. We study this failure mode as document-side early compression and introduce the Evidence Dilution Index (EDI) to measure how far a document-level representation falls below the strongest chunk-level evidence within the same gold document. Guided by this view, we propose DICE (Document Inference via Chunk Evidence), a training-free document-side strategy that splits documents into chunks, encodes them independently with a frozen model, and aggregates them back into a single vector while preserving the standard one-query-one-document interface. On LongEmbed, DICE improves retrieval across four backbones, with the largest gains on slices beyond 4k tokens: for Dream, Passkey >4k rises from 30.0 to 90.0 and Needle >4k from 23.3 to 74.0. Across 12,779 filtered samples, DICE yields lower EDI than the single-vector baseline in 92.8% of cases. These results establish document-level encoding as a practical and underexplored lever for long-document retrieval.

2606.18780 2026-06-18 cs.CV cs.CL cs.MM 新提交

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

SAMA:面向统一低资源多模态信息抽取的语义锚定对齐增强

Quanjiang Guo, Chong Mu, Jiazhou Pan, Ming Jia, Ling Tian, Hui Gao, Zhao Kang

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院)

AI总结 提出语义锚定对齐增强框架SAMA,通过构建结构化语义锚引导多专家多模态大模型生成高保真文本,并利用锚保留扩散机制合成图像,结合双约束过滤模块,在低资源多模态信息抽取任务中显著提升性能。

Comments Accepted by IEEE Transactions on Multimedia

详情
AI中文摘要

多模态信息抽取(MIE)——涵盖多模态命名实体识别(MNER)、关系抽取(MRE)和事件抽取(MEE)等任务——对于理解多媒体内容至关重要,但受到严重数据稀缺的限制。尽管数据增强是一种有前景的补救措施,但现有方法受到粗粒度跨模态对齐和碎片化、任务特定设计的阻碍,未能利用共享语义知识。为克服这些限制,我们引入了语义锚定对齐多模态增强(SAMA),一个用于生成高保真、任务感知合成数据的统一框架。SAMA从真实标签构建结构化语义锚,以指导协作多专家多模态大语言模型(CME-MLLM),该模型集成了用于共享语义的通用适配器和任务特定适配器,以生成多样且符合约束的文本样本。对于图像合成,SAMA采用锚保留扩散机制,使用锚加权提示和潜在条件来维持关键语义锚,同时多样化视觉上下文。为消除人工验证需求,SAMA进一步引入双约束过滤模块,基于跨模态一致性和锚保真度选择合成样本。在MNER、MRE和MEE基准数据集上的大量实验表明,SAMA在全监督和低资源设置下均一致优于最先进的增强基线,突显了其通用性、鲁棒性和有效性。

英文摘要

Multimodal Information Extraction (MIE)-covering tasks such as Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE)-is essential for understanding multimedia content but remains constrained by severe data scarcity. Although data augmentation is a promising remedy, existing approaches are impeded by coarse cross-modal alignment and fragmented, task-specific designs that fail to exploit shared semantic knowledge. To overcome these limitations, we introduce Semantic Anchor-aligned Multimodal Augmentation (SAMA), a unified framework for generating high-fidelity, task-aware synthetic data. SAMA constructs structured semantic anchors from ground-truth labels to guide a Collaborative Multi-Experts Multimodal Large Language Model (CME-MLLM), which integrates a Universal Adapter for shared semantics with Task-Specific Adapters to produce diverse yet constraint-compliant textual samples. For image synthesis, SAMA employs an Anchor-Preserving Diffusion mechanism that uses anchor-weighted prompts and latent conditioning to maintain critical semantic anchors while diversifying visual contexts. To eliminate the need for manual verification, SAMA further introduces a Dual-Constraint Filtering module that selects synthetic samples based on both cross-modal consistency and anchor fidelity. Extensive experiments across benchmark datasets for MNER, MRE, and MEE demonstrate that SAMA consistently outperforms state-of-the-art augmentation baselines under both fully supervised and low-resource settings, underscoring its versatility, robustness, and effectiveness.

2606.18778 2026-06-18 cs.LG stat.ML 新提交

Online Distributional Prediction via Latent Cluster Geometry Under Drift and Corruption

漂移与腐败下基于潜在簇几何的在线分布预测

Navyansh Mahla, Prateek Chanda, Ganesh Ramakrishnan

发表机构 * Indian Institute of Technology, Bombay(印度理工学院,孟买)

AI总结 针对非平稳流中的在线分布预测问题,提出一种基于潜在簇几何的吉布斯准后验方法,通过可逆跳跃MCMC采样变维后验,并引入重启变体应对漂移,在亚线性腐败预算和运输代价下实现亚线性Wasserstein遗憾。

详情
AI中文摘要

非平稳流中的在线学习通常被表述为跟踪点估计,但许多应用需要预测完整的数据生成分布。我们研究漂移和对抗性腐败下的在线分布预测。我们的方法通过潜在簇几何表示每个候选律:一个可变大小的中心配置,组织概率质量并诱导预测分布。这些配置上的吉布斯准后验通过后验平均产生在线预测器,所得变维后验可通过可逆跳跃MCMC采样。因此,该方法避免了指定参数化流律,同时保留了用于不确定性、正则化和比较的结构化潜在空间。我们通过累积Wasserstein-1遗憾相对于时变真实律来评估性能。分析分离了两种效应:腐败扰动基于损失的后验更新,而漂移使长时域后验记忆过时。我们通过一个重启变体来解决后者,该变体在时间上局部化相同的准贝叶斯更新。所得的高概率界分解为PAC-Bayesian复杂度项、腐败敏感的后验扰动项以及由\(A_T^{\mathrm{OT}}=\sum_{t=2}^T W_2^2(p_{t-1}^*,p_t^*)\)驱动的动态最优传输项。在有界支撑、稳定潜在几何、预测映射正则性、预言可实现性、局部化重启窗口、亚线性传输作用和亚线性腐败预算下,重启预测器实现了亚线性累积Wasserstein遗憾。这些保证不需要对流、漂移机制或腐败过程进行参数化建模。

英文摘要

Online learning in non-stationary streams is often formulated as tracking a point estimate, but many applications require predicting the full data-generating distribution. We study online distributional prediction under drift and adversarial corruption. Our approach represents each candidate law through a latent cluster geometry: a variable-size configuration of centers that organizes probability mass and induces a predictive distribution. A Gibbs quasi-posterior over these configurations yields an online predictor by posterior averaging, and the resulting variable-dimensional posterior can be sampled with reversible-jump MCMC. The method therefore avoids specifying a parametric streaming law while retaining a structured latent space for uncertainty, regularization, and comparison. We evaluate performance by cumulative Wasserstein-1 regret against the time-varying true law. The analysis separates two effects: corruption perturbs the loss-based posterior update, whereas drift makes long-horizon posterior memory stale. We address the latter with a restarted variant that temporally localizes the same quasi-Bayesian update. The resulting high-probability bounds decompose into a PAC-Bayesian complexity term, a corruption-sensitive posterior perturbation term, and a dynamic optimal-transport term driven by \(A_T^{\mathrm{OT}}=\sum_{t=2}^T W_2^2(p_{t-1}^*,p_t^*)\). Under bounded support, stable latent geometry, predictive-map regularity, oracle realizability, localized restart windows, sublinear transport action, and sublinear corruption budget, the restarted predictor achieves sublinear cumulative Wasserstein regret. These guarantees require no parametric model for the stream, drift mechanism, or corruption process.

2606.18774 2026-06-18 cs.LG 新提交

RouteJudge: An Open Platform for Reproducible and Preference-Aware LLM Routing

RouteJudge: 一个可复现且偏好感知的LLM路由开放平台

Guannan Lai, Haoran Hu, Han-Jia Ye

发表机构 * School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) National Key Laboratory for Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室) SinapisAI

AI总结 提出RouteJudge平台,通过匿名成对比较评估LLM路由策略的决策质量,并发布ORBIT工具箱标准化路由工作流,支持可复现和偏好感知的路由评估。

Comments Accepted by Pluralistic Alignment Workshop at ICML 2026

详情
AI中文摘要

我们提出RouteJudge,一个用于LLM路由系统的在线成对偏好评估框架,并提供一个公开平台(https://...)。与模型级别的响应评估不同,RouteJudge关注路由器级别的决策质量。对于每个用户查询,多个路由策略在相同的模型池和预算约束下独立推荐候选模型。然后通过匿名成对比较将所选模型的响应呈现给用户,由此产生的用户偏好归因于比较响应背后的路由策略。每条评估记录存储查询、路由决策、模型响应、偏好标签、成本、延迟和任务元数据,从而支持对LLM路由器进行偏好感知、成本感知和任务条件分析。为了支持RouteJudge中路由方法的持续扩展,我们进一步发布了ORBIT(最优路由与预算推理工具箱),这是一个模块化且可扩展的工具箱,标准化了LLM路由的端到端工作流。ORBIT为基准加载、查询表示、路由器实现、预算感知评估和方法比较提供了统一接口,允许研究人员在一致的协议下开发和评估路由算法。它同时作为RouteJudge的提交和集成层:研究人员可以在ORBIT中实现路由方法,在现有路由基准上验证它们,并提交兼容的路由器进行在线偏好评估。ORBIT的代码可在https://...获取。

英文摘要

We present RouteJudge, an online pairwise preference evaluation framework for LLM routing systems, with a public platform available at https://routejudge.cn. Different from model-level response evaluation, RouteJudge focuses on router-level decision quality. For each user query, multiple routing strategies independently recommend candidate models under the same model pool and budget constraints. The selected model responses are then presented to users through anonymous pairwise comparisons, and the resulting user preferences are attributed back to the routing strategies behind the compared responses. Each evaluation record stores the query, routing decisions, model responses, preference labels, cost, latency, and task metadata, enabling preference-aware, cost-aware, and task-conditioned analysis of LLM routers. To support the continuous expansion of routing methods in RouteJudge, we further release ORBIT (Optimal Routing and Budgeted Inference Toolbox), a modular and extensible toolbox that standardizes the end-to-end workflow of LLM routing. ORBIT provides unified interfaces for benchmark loading, query representation, router implementation, budget-aware evaluation, and method comparison, allowing researchers to develop and evaluate routing algorithms under consistent protocols. It also serves as the submission and integration layer for RouteJudge: researchers can implement routing methods within ORBIT, validate them on existing routing benchmarks, and submit compatible routers for online preference-based evaluation. The code of ORBIT is available at https://github.com/AIGNLAI/LAMDA-ORBIT.

2606.18773 2026-06-18 cs.LG cs.AI 新提交

Private Learning with Public Feature Conditioning

基于公共特征条件化的私有学习

Shuli Jiang, Walid Krichene, Nicolas Mayoraz

发表机构 * Microsoft(微软) Google Research(谷歌研究院)

AI总结 针对标签差分隐私回归问题,提出Cond-DP方法,利用公共特征矩阵的结构信息构造条件化矩阵以加速优化,在凸、强凸和非凸设置下提供收敛保证,并在线性回归中实现比DPSGD更快的收敛速度。

Comments Proceedings of the 43rd International Conference on Machine Learning (ICML 2026). 26 pages, 9 figures

详情
AI中文摘要

我们研究了每个数据样本包含公共、非敏感特征的设置下的差分隐私(DP)回归问题——这在推荐和广告系统等应用中很常见。虽然这种标签DP或半敏感特征设置主要在分类背景下进行了探索,但有效的回归方法仍未被充分研究。我们提出了Cond-DP,一种DPSGD的条件化变体,它利用公共特征矩阵的结构来改善隐私约束下的优化。受这些公共特征通常表现出快速衰减谱的观察启发,Cond-DP引入了一个数据驱动的条件化矩阵来重塑优化景观并加速收敛。我们为凸、强凸和非凸设置提供了收敛保证,并将标准DPSGD作为条件化矩阵为单位矩阵时的特例。我们展示了如何直接从公共特征为Cond-DP构造有效的条件化矩阵,从而在私有线性回归中实现比DPSGD更快的收敛速度,且不增加额外的隐私成本。实验表明,在标签DP下,使用该条件化矩阵的Cond-DP在多种数据集和模型架构上持续优于最先进的基线方法,展示了强大且稳健的实际性能。

英文摘要

We study differentially private (DP) regression in settings where each data sample includes public, non-sensitive features -- common in applications such as recommendation and advertising systems. While such label-DP or semi-sensitive-feature settings have been primarily explored in the context of classification, effective approaches for regression remain underexplored. We introduce Cond-DP, a conditioned variant of DPSGD that leverages the structure of public feature matrices to improve optimization under privacy constraints. Motivated by the observation that these public features often exhibit rapidly decaying spectra, Cond-DP incorporates a data-driven conditioning matrix to reshape the optimization landscape and accelerate convergence. We provide convergence guarantees for convex, strongly convex, and non-convex settings, and recover standard DPSGD as a special case when the conditioning matrix is the identity. We show how to construct an effective conditioning matrix for Cond-DP directly from public features, enabling provably faster convergence than DPSGD in private linear regression without incurring additional privacy cost. Empirically, Cond-DP with this conditioning matrix consistently outperforms state-of-the-art baselines across a wide range of datasets and model architectures under label DP, demonstrating strong and robust performance in practice.

2606.18772 2026-06-18 cs.RO 新提交

HALOMI: Learning Humanoid Loco-Manipulation with Active Perception from Human Demonstrations

HALOMI: 从人类演示中学习具有主动感知的人形机器人全身操控

Zehui Zhao, Yuxuan Zhao, Gaojing Zhang, Chenxi Liu, Maolin Zheng, Wenzhao Lian

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of Sussex(萨塞克斯大学) East China University of Science and Technology(华东理工大学)

AI总结 提出HALOMI框架,通过扩展通用操控接口(UMI)实现主动感知,利用流形约束控制器和观察-动作对齐,使Unitree G1人形机器人在五项真实任务中平均成功率达85%。

详情
AI中文摘要

人类演示可以大规模收集,并自然捕捉主动的手眼协调,是学习人形机器人全身操控的有前景的数据源。然而,直接将人类演示迁移到人形机器人需要精确的世界坐标系跟踪控制器,这在分布外(OOD)目标下通常脆弱,而人形差异在自我中心观察和动作执行中持续存在。为解决这些挑战,我们提出HALOMI,一个从人类演示中学习具有主动感知的人形机器人全身操控的可扩展框架。HALOMI扩展了通用操控接口(UMI)并加入自我中心感知,以大规模收集自我视角和手腕视角观察以及头-手轨迹。我们进一步提出一个流形约束控制器,在学习的潜在行为流形中规划,以实现世界坐标系中精确鲁棒的头-手跟踪。为弥合人形差异,我们进行自我视角对齐,并引入控制器感知的参考轨迹自适应,以减少观察和动作执行中的不匹配。我们在配备活动脖子的Unitree G1人形机器人上验证HALOMI,涉及导航、抓取、双手操控、全身协调和动态行为五项真实任务。在三个定量评估的任务中,HALOMI平均成功率达85%,而额外定性演示显示其支持动态抛掷和深蹲抓取的能力。

英文摘要

Human demonstrations, which can be collected at scale and naturally capture active hand-eye coordination, are a promising data source for learning humanoid loco-manipulation. However, directly transferring human demonstrations to humanoids requires a precise world-frame tracking controller, which is often brittle under Out-of-Distribution(OOD) targets, while human-to-humanoid gaps persist in both egocentric observation and action execution. To address these challenges, we present HALOMI, a scalable framework for learning humanoid loco-manipulation with active perception from human demonstrations. HALOMI extends Universal Manipulation Interface (UMI) with egocentric sensing to collect ego-view and wrist-view observations along with head-hand trajectories at scale. We further propose a manifold-constrained controller that plans in a learned latent behavior manifold to enable precise and robust head-hand tracking in the world frame. To bridge the human-to-humanoid gap, we perform ego-view alignment and introduce a controller-aware reference trajectory adaptation to reduce mismatch in both observation and action execution. We validate HALOMI on a Unitree G1 humanoid robot with an actuated neck across five real-world tasks involving navigation, grasping, bimanual manipulation, whole-body coordination, and dynamic behaviors. Across the three quantitatively evaluated tasks, HALOMI achieves an average success rate of 85\%, while additional qualitative demonstrations show its ability to support dynamic tossing and deep-squat grasping.

2606.18767 2026-06-18 cs.CL 新提交

Output Vector Editing for Memorization Mitigation in Large Language Models

输出向量编辑:缓解大型语言模型中的记忆化问题

Ahmad Dawar Hakimi, Kaiwei Lei, Isabelle Augenstein, Hinrich Schütze

发表机构 * Center for Information and Language Processing, LMU Munich(慕尼黑大学语言与信息处理中心) Department of Computer Science, University of Copenhagen(哥本哈根大学计算机科学系) Munich Center for Machine Learning(慕尼黑机器学习中心) Pioneer Centre for AI(人工智能先锋中心)

AI总结 提出输出向量编辑方法,通过约束优化修改MLP神经元输出向量引入干扰项,在不改变激活值的情况下抑制记忆化序列,在OLMo-7B上实现87.9%抑制率,并揭示MLP编辑的机制边界。

详情
AI中文摘要

大型语言模型会记忆并复现训练数据中的序列,从而带来隐私、版权和安全风险。现有的神经元级缓解方法将编辑等同于将神经元激活归零,但激活仅控制神经元是否参与;输出向量才是写入残差流的内容,并通过叠加编码多个特征。我们提出输出向量编辑,这是一种约束优化的权重编辑方法,定位负责记忆化延续的一小组MLP神经元,并最小程度地修改其输出向量,以在词汇空间中引入干扰项,从而重定向它们在残差流中的贡献,同时保持激活不变。在四个模型(SmolLM-360M、OLMo-1B、OLMo-7B、Llama2-7B,参数规模从360M到7B)上进行评估,我们重点研究OLMo-7B(其开放权重和预训练语料库支持系统化挖掘),挖掘了6831个记忆化序列,实现了高达87.9%的抑制率。在相同定位的神经元上,与零消融相比,2.7倍的差距表明抑制来自输出向量编辑,而非仅定位。四种编辑模式涵盖了从激进抑制到最小重定向的谱系;集成使用时覆盖了96.5%的记忆化序列,而我们推荐的单一模式配置达到了81.5%,且没有灾难性的局部性失败。我们进一步识别了一个机制边界:约14%的序列无法通过仅MLP编辑达到;虽然这些失败总体上并非由注意力驱动,但消融贡献最大的注意力头可恢复其中60-64%,对于从前缀复制token的延续,恢复更强,这表明注意力是互补的后备机制而非主要机制。编辑模式排序和成功-局部性权衡在所有四个模型上迁移,成功率随模型规模而非家族增长。

英文摘要

Large language models memorize and reproduce sequences from their training data, creating privacy, copyright, and security risks. Existing neuron-level mitigation methods equate editing with zeroing out neuron activations, but the activation only controls whether a neuron engages; the output vector is what writes to the residual stream and, through superposition, encodes multiple features. We propose output vector editing, a constrained-optimization weight edit that locates a small set of MLP neurons responsible for a memorized continuation and minimally modifies their output vectors to introduce a distractor in vocabulary space, redirecting their residual-stream contributions while leaving activations unchanged. Evaluating on four models from 360M to 7B parameters (SmolLM-360M, OLMo-1B, OLMo-7B, Llama2-7B), we center on OLMo-7B (whose open weights and pretraining corpus enable systematic mining) and mine 6831 memorized sequences, achieving up to 87.9% suppression. The 2.7$\times$ gap over zero ablation on the same located neurons shows the suppression comes from the output-vector edit, not localization alone. Four edit modes span a spectrum from aggressive suppression to minimal redirection; in ensemble they cover 96.5% of memorized sequences, while our recommended single-mode configuration reaches 81.5% with no catastrophic locality failures. We further identify a mechanistic boundary at ${\sim}14%$ of sequences unreachable by MLP-only editing; while these failures are not attention-driven overall, ablating the top contributing attention heads recovers 60--64% of them, with stronger recovery on continuations that copy tokens from the prefix, positioning attention as a complementary fallback rather than a primary mechanism. Edit mode ordering and the success-locality trade-off transfer across all four models, with success rates scaling with model size rather than family.

2606.18765 2026-06-18 cs.CV 新提交

SpectralDiT: Timestep-Conditioned Spectral Residual Correction for Flow-Matching DiTs

SpectralDiT:流匹配DiT的时间步条件谱残差校正

Jiayu Tian

发表机构 * Peking University(北京大学)

AI总结 提出SpectralDiT,通过时间步条件谱残差校正模块,在CIFAR-10和ImageNet-100上以极少额外计算和参数提升流匹配DiT的生成质量,FID分别降低5.1%和8.7%。

详情
AI中文摘要

我们提出SpectralDiT,一种对流匹配扩散变换器(Diffusion Transformers)的轻量级修改,它在MLP残差分支中添加了时间步条件谱校正。该模块将每个残差更新分解为补丁-令牌网格上的低频和高频分量,然后学习一个零初始化的加法门,使得模型最初与基线DiT匹配。在CIFAR-10像素空间生成中,SpectralDiT在补丁大小为1时将FID从20.78提升至19.71,并缩小了径向傅里叶谱差距。此外,我们将方法扩展到ImageNet-100上的潜在扩散。在额外理论FLOPs增加0.6%和参数增加1.36%的情况下,SpectralDiT改进了潜在流匹配,在无分类器引导(CFG 2.0)下实现了8.7%的相对FID降低。所有报告结果均为五个种子的平均值。在CIFAR-10上的消融实验和门控可视化揭示了稳定的块特定谱校正模式。

英文摘要

We propose SpectralDiT, a lightweight modification to flow-matching Diffusion Transformers that adds timestep-conditioned spectral correction to the MLP residual branch. The module decomposes each residual update into low- and high-frequency components on the patch-token grid, then learns a zero-initialized additive gate so the model initially matches the baseline DiT. On CIFAR-10 pixel-space generation, SpectralDiT improves FID from 20.78 to 19.71 at patch size 1 and reduces the radial Fourier spectrum gap. Furthermore, we scale our method to latent diffusion on ImageNet-100. With 0.6% additional theoretical FLOPs and 1.36% additional parameters, SpectralDiT improves latent flow-matching, achieving an 8.7% relative FID reduction under classifier-free guidance (CFG 2.0). All reported results are averaged over five seeds. Ablations and gate visualizations on CIFAR-10 reveal stable block-specific spectral correction patterns.

2606.18753 2026-06-18 cs.CV 新提交

SMART: A Flexible, Interpretable, and Scalable Spatio-temporal Brain Atlas from High-Resolution Imaging Data

SMART:一种灵活、可解释且可扩展的高分辨率成像数据时空脑图谱

John Kalkhof, Boris Gutman, Emile d'Angremont, Daniel C. Alexander, Marco Lorenzi

发表机构 * Illinois Institute of Technology(伊利诺伊理工学院) Amsterdam University Medical Center(阿姆斯特丹大学医学中心) University College London(伦敦大学学院)

AI总结 提出SMART框架,通过解耦全局疾病动态与患者特定解剖表现,学习连续疾病时间图谱,实现高分辨率3D医学图像中时空变化的灵活、可解释和可扩展建模。

详情
AI中文摘要

我们介绍了SMART,一个从纵向高分辨率3D医学图像中学习灵活、可解释且可扩展的时空脑图谱的框架。现有的时空图谱构建方法依赖于黑盒生成模型,缺乏灵活性、限制可解释性,并且难以扩展到高维数据。SMART通过学习一个连续的疾病时间图谱来解决这些挑战,该图谱将全局群体级疾病动态与患者特定的解剖表现解耦。在解剖学启发先验的指导下,SMART通过区域特异性微分方程,沿着共享的疾病时间线建模可解释的全局区域进展轨迹。全局轨迹进一步通过由灵活且可扩展的多尺度神经细胞自动机参数化的密集微分同胚位移,个性化到个体解剖结构。在阿尔茨海默病的五个纵向MRI数据集(ADNI-1/GO/2、OASIS-3、AIBL;>1300名受试者)上评估,SMART产生了解剖学上有意义的疾病进展预测,并实现了最先进的预测准确性和比对抗性和扩散基线更好的时间一致性。我们的方法为高维医学图像时间序列中时空变化的灵活、可解释和可扩展建模建立了一个新范式。

英文摘要

We introduce SMART, a framework for learning a flexible, interpretable, and scalable spatio-temporal brain atlas from longitudinal high-resolution 3D medical images. Existing approaches to spatio-temporal atlas construction rely on black-box generative models that lack flexibility, limit interpretability, and struggle to scale to high-dimensional data. SMART addresses these challenges by learning a continuous disease-time atlas that decouples global group-wise disease dynamics from their patient-specific anatomical manifestation. Guided by anatomically inspired priors, SMART models interpretable global trajectories of regional progression along a shared disease timeline through region-specific differential equations. Global trajectories are further personalized to individual anatomies via dense diffeomorphic displacements parameterized by a flexible and scalable multi-scale Neural Cellular Automata. Evaluated on five longitudinal MRI datasets in Alzheimer's disease (ADNI-1/GO/2, OASIS-3, AIBL; > 1,300 subjects), SMART produces anatomically meaningful predictions of disease progression and achieves state-of-the-art forecasting accuracy and improved temporal consistency over adversarial and diffusion baselines. Our approach establishes a new paradigm for flexible, interpretable, and scalable modeling of spatio-temporal change in high-dimensional medical image time-series.

2606.18749 2026-06-18 cs.CV 新提交

Toward Training-Free Zero-Shot Anomaly Detection in 3D Medical Images: A Batch-Based Approach Using 2D Foundation Models

迈向3D医学图像的无训练零样本异常检测:基于批次的方法使用2D基础模型

Tai Le-Gia

发表机构 * Chungnam National University(忠南大学)

AI总结 提出CS3F框架,利用2D基础模型对3D医学图像进行零样本异常检测,通过沿多轴分解、切片编码和跨主体相似性计算异常分数,并引入粗到细的分词策略减少信号衰减。

详情
AI中文摘要

零样本异常检测(ZSAD)在医学成像中具有吸引力,因为临床系统必须处理异构采集协议、变化的患者群体以及可能缺乏标注训练数据的病理。大多数现有的零样本异常检测方法是为2D图像设计的,它们直接扩展到3D医学体积受到大规模体积基础模型稀缺或利用体积上下文困难的限制。我们提出CS3F,一个无训练的基于批次的框架,用于3D医学图像中的ZSAD,使用2D基础模型。每个体积沿多个解剖轴分解,并由2D视觉变换器逐切片编码。然后通过池化相邻切片特征将其转换为局部体积令牌。异常分数通过跨主体互相似性获得:在其他主体中缺乏相似令牌的令牌被赋予更高的异常分数。为了减少深度池化引起的病灶信号衰减,我们引入了一种粗到细的分词策略,无需穷举匹配即可实现细分辨率体积评分。CS3F在脑部MRI上针对转移瘤、胶质瘤和中风进行评估,并在肺部CT上验证其泛化能力,超越标准图谱对齐的脑部MRI。结果表明,冻结的2D基础模型可以支持3D医学图像中的异常定位,且细分词化的益处很大程度上取决于病灶对比度和成像模态。

英文摘要

Zero-shot anomaly detection (ZSAD) is attractive for medical imaging because clinical systems must handle heterogeneous acquisition protocols, changing patient populations, and pathologies for which annotated training data may be unavailable. Most existing zero-shot anomaly detection methods are designed for 2D images, and their direct extension to 3D medical volumes is limited by the scarcity of large-scale volumetric foundation models or by the difficulty of utilizing volumetric context. We propose CS3F, a training-free batch-based framework for ZSAD in 3D medical images using 2D foundation models. Each volume is decomposed along multiple anatomical axes and encoded slice-wise by a 2D vision transformer. These are then converted into localized volumetric tokens by pooling neighboring slice features. Anomaly scores are obtained from cross-subject mutual similarity: tokens that lack close analogues in other subjects are assigned higher anomaly scores. To reduce the attenuation of focal lesion signals caused by depth pooling, we introduce a coarse-to-fine tokenization strategy that enables fine-resolution volumetric scoring without exhaustive matching. CS3F is evaluated on brain MRI across metastases, glioma, and stroke, as well as validated on lung CT to test generalizability beyond atlas-aligned brain MRI. The results show that frozen 2D foundation models can support anomaly localization in 3D medical images, and that the benefit of fine tokenization depends strongly on lesion contrast and imaging modality.

2606.18747 2026-06-18 cs.RO cs.AI 新提交

Generating Natural and Expressive Robot Gestures through Iterative Reinforcement Learning with Human Feedback using LLMs

通过基于人类反馈的迭代强化学习利用大语言模型生成自然且富有表现力的机器人手势

Chris Lee, Flora Salim, Benjamin Tag, Francisco Cruz

发表机构 * University of New South Wales(新南威尔士大学) Universidad Central de Chile(智利中央大学)

AI总结 针对社交机器人手势生成僵硬问题,提出将ChatGPT集成到Pepper机器人中生成共语手势,并引入基于人类反馈的迭代强化学习(RLHF)优化手势,实验表明RLHF提升了手势的表现力、相关性和流畅性。

Comments 8 Pages, 6 Figures

详情
AI中文摘要

富有表现力的手势对于自然有效的沟通至关重要,当仅靠语言线索不足时(例如,指向),手势可以补充言语。对于像Pepper这样的人形社交机器人,产生自然且富有表现力的动作对于改善人机交互(HRI)和长期接受度至关重要。然而,由于依赖专家编写的动画,生成手势仍然具有挑战性,导致行为僵硬,难以适应动态和多样化的环境。或者,机器学习方法通常难以捕捉感知的自然性,随着自由度的增加而变得更加困难。因此,产生富有表现力的机器人手势需要一个能够适应环境同时遵守社会规范和物理约束的系统。大语言模型(LLMs)的最新进展使得动态代码生成成为可能,为从自然语言实时合成手势提供了新的机会。在本文中,我们将ChatGPT集成到人形机器人Pepper中,以生成与对话输出一致的共语手势。虽然这一基线实现了灵活的手势生成,但生成的动作通常被认为僵硬且不自然。为了解决这一限制,我们引入了一种基于人类反馈的迭代强化学习(RLHF)系统,该系统根据用户评估微调手势生成,并利用迭代用户研究比较Pepper生成的手势。我们的结果表明,RLHF改进了LLM的共语生成能力,产生了更富有表现力、相关且流畅的动作。

英文摘要

Expressive gestures are essential for natural and effective communication, complementing speech when verbal cues alone are insufficient (e.g., pointing). For social robots such as the humanoid Pepper, producing natural and expressive movements is critical for improving human-robot interaction (HRI) and long-term acceptance. However, generating gestures remains challenging due to reliance on expert-authored animations, resulting in rigid behaviors that are impractical for dynamic and diverse environments. Alternatively, machine learning approaches often struggle to capture perceived naturalness, becoming increasingly challenging with more degrees of freedom. Consequently, producing expressive robot gestures requires a system that can adapt to the environment while adhering to social norms and physical constraints. Recent advances in large language models (LLMs) enable dynamic code generation, offering new opportunities for runtime gesture synthesis from natural language. In this paper, we integrate ChatGPT into the humanoid robot Pepper to generate co-speech gestures aligned with conversational output. While this baseline enables flexible gesture generation, the resulting motions are often perceived as stiff and unnatural. To address this limitation, we introduce an iterative reinforcement learning with human feedback (RLHF) system that finetunes gesture generation based on user evaluations, leveraging an iterative user study to compare Pepper's generated gestures. Our results show that RLHF improved the LLM's co-speech generative capabilities, producing more expressive, relevant and fluid movements.

2606.18746 2026-06-18 cs.AI 新提交

What Must Generalist Agents Remember?

通用型智能体必须记住什么?

Khurram Yamin, Namrata Deka, Maitreyi Swaroop, Albert Ting, Jeff Schneider, Bryan Wilder

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文形式化论证了通用型智能体为在多个环境和目标下近似最优行动,必须存储领域相关信息以区分观察瓶颈处的不兼容最优动作,并证明记忆可用于重构局部转移动态。

详情
AI中文摘要

本文形式化地阐述了通用型智能体为了在多个环境和目标下近似最优地行动,必须在记忆中存储什么。它表明,当两个领域共享一个观察瓶颈但需要不兼容的最优动作时,任何一致近似最优的策略必须在该瓶颈处诱导出不同的记忆分布。这一结果产生了一个分离定理:足够成功的智能体不能仅依赖当前状态观察,而必须在记忆中保留领域相关信息。本文进一步证明,如果智能体的记忆包含足够的信息来估计相关目标的值,那么该记忆可用于近似重构智能体的局部转移动态。综合这些结果,将记忆刻画为支持领域区分、转移模型重构和通用型智能体规划的基板。

英文摘要

This paper develops a formal account of what generalist agents must store in memory in order to act near-optimally across multiple environments and goals. It shows that when two domains share an observational bottleneck but require incompatible optimal actions, any uniformly near-optimal policy must induce distinct memory distributions at that bottleneck. The result yields a separation theorem: sufficiently successful agents cannot rely only on current state observations, but must preserve domain-relevant information in memory. The paper further shows that if an agent's memory contains enough information to estimate values for related goals, then that memory can be used to approximately reconstruct the agent's local transition dynamics. Together, these results characterize memory as the substrate that supports domain disambiguation, transition-model reconstruction, and planning for generalist agents.

2606.18738 2026-06-18 cs.SD 新提交

GRIDEX: Grid-Grounded Forensic Explanations for Deepfake Spectrogram Analysis

GRIDEX:基于网格的深度伪造频谱图取证解释

Thi Ngan Ha Do, Tingmin Wu, Alsharif Abuadbba, Kristen Moore

发表机构 * CSIRO(澳大利亚联邦科学与工业研究组织)

AI总结 提出GRIDEX框架,通过两阶段学习(SFT+GRPO)定位频谱图异常区域并生成结构化取证解释,提升伪造检测的可解释性。

详情
AI中文摘要

语音生成技术的进步使得人工语音越来越逼真。尽管现代分类模型在深度伪造检测方面可以达到高准确率,但它们不会产生证据,例如指出欺骗线索在频谱图中的位置及其声学含义,从而限制了它们在取证中的实用性。完整频谱图的人工分析是资源密集型的,因此证据应将注意力集中在最具诊断性的区域。此外,现有的可解释性方法在将上下文属性与局部证据联系起来方面的能力有限,使得解释更难验证。为了克服这一限制,我们提出了GRIDEX,这是一个流水线,当给定深度伪造频谱图时,它会生成其异常的取证解释。该流水线(i)选择频谱图中前K个异常区域,并(ii)为每个异常生成解释。这些解释遵循分类声学字段的模式,包括时间、频谱、语音信息和解释文本。据我们所知,这是第一个使用区域定位为深度伪造频谱图生成结构化取证解释的框架。GRIDEX采用两阶段学习范式进行训练,该范式将监督微调(SFT)与群体相对策略优化(GRPO)相结合。在我们的数据集上的实验表明,与强大的视觉语言模型(VLM)基线相比,伪影定位和解释质量有所提高。数据集和代码将在发表后发布。

英文摘要

The advancement of speech generation technologies has made artificial speech increasingly realistic. Although modern classification models can achieve high accuracy when it comes to deepfake detection, they do not produce evidences such as indicating where spoof cues appear in the spectrogram and what they imply acoustically, limiting their usefulness in forensic settings. Manual analysis of full spectrograms is resource-intensive, so evidence should narrow attention to the most diagnostic regions. Moreover, existing explainability methods have limited capabilities in connecting contextual attributes to localized evidence, making explanations harder to verify. To overcome this limitation, we propose GRIDEX, a pipeline that, when given a deepfake spectrogram, generates forensic explanations of its anomalies. The pipeline (i) selects top-K anomalous regions in the spectrogram and (ii) produces an explanation for each anomaly. The explanations follow a schema of categorical acoustic fields, including temporal, spectral, phonetic information and interpretation text. To our knowledge, this is the first framework to generate structured forensic explanations using regional grounding for deepfake spectrograms. GRIDEX is trained with a two-stage learning paradigm that combines supervised fine-tuning (SFT) with Group Relative Policy Optimization (GRPO). Experiments on our dataset show improved artifact localization and explanation quality over strong vision-language model (VLM) baselines. The dataset and code will be released upon publication.

2606.18732 2026-06-18 cs.LG cs.CV 新提交

Low-Cost Neuromorphic Fall Detection Using Synthetic Event Data and Hybrid SNNs

低成本神经形态跌倒检测:使用合成事件数据和混合SNN

Guillermo Rojas, Gonzalo Soto, Daniel Yunge

发表机构 * School of Electrical Engineering Pontificia Universidad Católica de Valparaíso, Chile(瓦尔帕莱索天主教大学电气工程学院)

AI总结 提出混合SNN-CNN模型,从智能手机视频合成事件相机数据,实现高效准确的跌倒检测。

Comments 4 pages, 6 figures, presented at ICONS 2025 during the Poster Session, but not published

详情
AI中文摘要

本工作提出了混合模型,将脉冲神经网络(SNN)与卷积神经网络(CNN)组件集成,以从传统智能手机视频生成的模拟事件相机数据(动态视觉传感器,DVS)中学习。主要针对人类跌倒检测,该方法通过将视频帧转换为事件数据,利用SNN的能效和时空处理能力。通过多个数据集上的模拟评估所提出的模型,并将其性能与传统机器学习模型进行比较。结果表明,在不牺牲准确性的情况下显著提高了效率,强调了将SNN和DVS技术结合用于现实环境中复杂任务的潜力。

英文摘要

This work presents the development of hybrid models that integrate spiking neural networks (SNNs) with components of convolutional neural networks (CNNs) to learn from simulated event-based camera data (Dynamic Vision Sensor, DVS) generated from conventional smartphone videos. Aimed primarily at human fall detection, the approach leverages the energy efficiency and spatio-temporal processing capabilities of SNNs by converting video frames into event-based data. The proposed models are evaluated through simulations on multiple datasets, comparing their performance to that of traditional machine learning models. Results demonstrate significant gains in efficiency without sacrificing accuracy, underscoring the potential of combining SNNs and DVS technology for complex tasks in real-world environments.

2606.18728 2026-06-18 cs.CL 新提交

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

LegalWorld: 法律智能体的生命周期交互环境

Songhan Zuo, Shengbin Yue, Tao Chiang, Guanying Li, Yun Song, Xuanjing Huang, Zhongyu Wei

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Northwest University of Political and Law(西北政法大学)

AI总结 提出LegalWorld,一个将中国民事诉讼建模为五阶段因果链的生命周期交互环境,基于75309对判决书构建,并评估多智能体在连续诉讼中的能力差异。

详情
AI中文摘要

民事诉讼本质上是一个生命周期过程:律师第一天起草的内容会约束数月后庭审的走向。然而,现有的法律基准评估的是孤立的子任务,而先前的法律智能体模拟器每次从共享的真实情况重新初始化场景,忽略了跨阶段的因果依赖关系。我们提出LegalWorld,一个生命周期交互环境,将中国民事诉讼建模为五个阶段(七个子场景)的因果连接状态链,基于75,309对中国民事判决书构建。我们为其配备了可重用的基础设施(本地记忆、全局案件记忆、技能/工具库),确保每个争议在其整个生命周期中保持一致。在此环境基础上,我们构建了LongJud-Bench,用于评估智能体在所有五个连接阶段的能力。来自217名法律背景评估者的18,992个评分证实,LegalWorld的轨迹在程序上忠实且角色一致;跨模型的能力级评估揭示了聚合分数无法暴露的显著分歧,没有单一骨干模型在咨询、起草和庭审辩护中均领先。详细资源将公开发布。

英文摘要

Civil litigation is inherently a life-cycle process: what a lawyer drafts on day one constrains what unfolds at trial months later. Yet existing legal benchmarks evaluate isolated subtasks, and prior legal-agent simulators reinitialize each scenario from shared ground truth, leaving cross-stage causal dependencies unmodeled. We present LegalWorld, a life-cycle interactive environment that models Chinese civil litigation as a causally connected state chain of five stages (seven sub-scenarios), grounded in 75,309 paired Chinese civil judgments. We pair it with reusable infrastructure (local memory, global case memory, a Skill/Tool library) that keeps each dispute consistent across its full life cycle. Building on this environment, we construct LongJud-Bench to evaluate agent capability across all five connected stages. 18,992 ratings from 217 legal-background evaluators confirm that LegalWorld trajectories are procedurally faithful and role-consistent; and a capability-level cross-model evaluation reveals sharp divergences that aggregate scores cannot expose, with no single backbone leading across consultation, drafting, and courtroom advocacy. Detailed resources will be released publicly.

2606.18726 2026-06-18 cs.LG cs.AI 新提交

Graph Grounded Cross Attention Transformer Neural Network for Structurally Constrained Full Event Sequence Generation in Predictive Process Monitoring

基于图锚定交叉注意力Transformer神经网络的预测过程监控中结构约束完整事件序列生成

Fang Wang, Ernesto Damiani

发表机构 * Department of Computer Science, University of Milan(米兰大学计算机科学系)

AI总结 提出图锚定交叉注意力Transformer(GGATN),通过全局过程图作为结构化记忆、Transformer自注意力编码序列位置、图锚定交叉注意力注入过程拓扑,结合维特比式图约束解码,一次性生成完整事件序列,在六个基准日志上优于LLM基线。

Comments 40 pages

详情
AI中文摘要

结构约束的事件序列生成仍然具有挑战性,因为生成的路径必须保持转移可行性、时间顺序、终止和属性一致性。在预测过程监控(PPM)中,这一挑战表现为完整事件序列生成,而现有工作主要处理子任务,如下一个活动、剩余时间、结果和属性预测。本文提出了图锚定交叉注意力Transformer神经网络(GGATN)用于这一统一的PPM任务。GGATN使用全局过程图作为结构化活动记忆,通过Transformer自注意力对序列位置进行上下文化,并通过图锚定交叉注意力注入过程拓扑。与自回归解码不同,GGATN一次性生成活动、时间戳、长度以及事件级和序列级属性,随后进行维特比风格的图约束解码以获得可行路径和显式终止。在六个基准事件日志上的实验表明,其生成质量优于局部指令提示的LLM基线。GGATN在序列相似性、Damerau-Levenshtein相似性、基于二元组的控制流相似性和持续时间分布方面取得了强劲性能,同时保持零幻觉活动和零序列级属性不一致。消融分析证实了全局图编码器作为稳定的结构先验。可解释性分析展示了图结构、序列上下文、反馈细化和约束解码如何塑造生成过程。

英文摘要

Structurally constrained event sequence generation remains challenging because generated paths must preserve transition feasibility, temporal order, termination, and attribute consistency. In predictive process monitoring (PPM), this challenge appears as full event sequence generation, whereas existing work mainly addresses component tasks such as next activity, remaining time, outcome, and attribute prediction. This paper proposes the Graph Grounded Cross Attention Transformer Neural Network (GGATN) for this unified PPM task. GGATN uses a global process graph as structured activity memory, contextualizes sequence positions through Transformer self attention, and injects process topology through graph grounded cross attention. Unlike autoregressive decoding, GGATN generates activities, timestamps, length, and event level and sequence level attributes in a single pass, followed by Viterbi style graph constrained decoding for feasible paths and explicit termination. Experiments on six benchmark event logs show more reliable generation quality than local instruction prompted LLM baselines. GGATN achieves strong performance on sequence similarity, Damerau Levenshtein similarity, bigram based control flow similarity, and duration distribution, while maintaining zero hallucinated activities and zero sequence level attribute inconsistency. Ablation analyses confirm the global graph encoder as a stable structural prior. Interpretability analyses show how graph structure, sequence context, feedback refinement, and constrained decoding shape generation.

2606.18723 2026-06-18 cs.CV cs.LG 新提交

Clinically Aligned Geometry Constraints for Robust IVUS Vessel Boundary Segmentation

临床对齐的几何约束用于鲁棒的IVUS血管边界分割

Yunshu Chen, Litao Yang, Giuseppe Di Giovanni, Jordan Tan, Deval Mehta, Andrew Lin, Derek Chew, Masasi Fujino, Julie Butters, Stephen Nicholls, Zongyuan Ge, Kyung Hoon Cho

发表机构 * AIM For Health Lab, Monash University(莫纳什大学AIM健康实验室) Department of Data Science and Artificial Intelligence, Faculty of IT, Monash University(莫纳什大学信息技术学院数据科学与人工智能系) Monash University Victorian Heart Institute(莫纳什大学维多利亚心脏研究所) School of Computing Technologies, RMIT University(皇家墨尔本理工大学计算技术学院) National Cerebral and Cardiovascular Center(国立循环器病研究中心) Department of Cardiology, Chonnam National University Hospital and Medical School(全南大学医院和医学院心脏病学系)

AI总结 提出GeoCat网络,通过双编码器与可微几何一致性损失,在IVUS分割中降低边界漂移和拓扑错误,提升临床几何测量精度。

Comments MICCAI2026 Accepted

详情
AI中文摘要

血管内超声(IVUS)管腔和外弹性膜(EEM)分割对于定量评估冠状动脉斑块负荷至关重要。管腔或EEM勾画的误差会直接传播到斑块面积、斑块负荷和几何测量中。然而,优先考虑重叠分数的标准方法常常遭受边界漂移和拓扑错误,导致临床测量不准确。我们提出GeoCat,一个几何一致性网络,使用双笛卡尔-极坐标编码器,结合跨域注意力和时间融合,处理5帧IVUS片段。可微的几何一致性损失直接监督临床相关描述符,包括直径、方向和横截面积。该模型在来自146名患者的12,242张标注帧上训练,这些帧使用两种商用IVUS系统采集。我们使用分割准确性和斑块相关临床指标评估性能,包括Dice/IoU、边界测量(95HD(mm)、ASSD)、拓扑违规率和临床几何误差(dmax/dmin、角度和面积)。在我们的数据集上,GeoCat实现了0.93的Dice,将95HD降低到0.14 mm,并将拓扑违规率降低到1.0%。重要的是,它显著提高了几何保真度,产生0.13-0.16 mm的直径误差和约8度的角度误差,支持可靠的斑块负荷量化。

英文摘要

Intravascular ultrasound (IVUS) lumen and external elastic membrane (EEM) segmentation is important for quantitative coronary plaque burden assessment. Errors in lumen or EEM delineation directly propagate to plaque area, plaque burden and geometric measurements. However, standard methods prioritising overlap scores often suffer from boundary drift and topology errors, leading to inaccurate clinical measurements. We present GeoCat, a geometry-consistent network that processes 5-frame IVUS clips using dual Cartesian-polar encoders with cross-domain attention and temporal fusion. A differentiable geometry consistency loss directly supervises clinically relevant descriptors including diameters, orientations, and cross-sectional areas. The model is trained on 12,242 annotated frames from 146 patients acquired with two commercial IVUS systems. We evaluate performance using both segmentation accuracy and plaque-relevant clinical metrics, including Dice/IoU, boundary measures(95HD (mm), ASSD), topology violation rate, and clinical geometry errors (dmax/dmin, angles, and areas). On our dataset, GeoCat achieves a Dice of 0.93, reduces 95HD to 0.14 mm, and lowers topology violations to 1.0%. Importantly, it significantly improves geometric fidelity, yielding diameter errors of 0.13-0.16 mm and angular errors of ~8 degrees, supporting reliable plaque burden quantification.

2606.18721 2026-06-18 cs.CV 新提交

Rethinking the Pointer Loss in Table Structure Recognition: Geometry-Aware Pointer Loss for Spatial Locality

重新思考表格结构识别中的指针损失:面向空间局部性的几何感知指针损失

Hong-Jun Choi, Jongho Lee, Jaeyoung Kim

发表机构 * Teamreboott Inc.(Teamreboott公司)

AI总结 针对指针网络在表格结构识别中相邻单元格错误占79.6%的问题,提出几何感知指针损失,通过反距离加权重写交叉熵目标,聚焦邻近单元格梯度,在不增加推理成本下提升性能。

详情
AI中文摘要

使用指针网络的表格结构识别(TSR)通过预测HTML序列同时将标签与检测到的文本(或单元格)区域对齐,取得了令人印象深刻的结果。然而,我们的分析揭示,当指针网络失败时,79.6%的错误发生在空间相邻的单元格之间(曼哈顿距离<=2)。尽管如此,标准交叉熵损失对所有负候选样本赋予相同权重。在这项工作中,我们提出了几何感知指针(GAP)损失,它根据与真实值的空间邻近性重新加权交叉熵目标。通过应用反距离加权,GAP将梯度流集中在模型最困难的区域:相邻单元格比远处单元格获得更强的梯度。我们的方法仅需对损失计算进行简单修改,保持相同的模型架构且零额外推理成本。在PubTabNet和SynthTabNet上的大量实验表明,GAP持续减少相邻单元格错误,达到了新的最先进性能。我们的发现表明,在损失层面融入几何归纳偏置为鲁棒TSR提供了一种简单而有效的方法。我们的代码可在以下网址获取:this https URL

英文摘要

Table Structure Recognition (TSR) using a pointer network achieves impressive results by predicting HTML sequences while aligning tags to detected text (or cell) regions. However, our analysis reveals that when pointer networks fail, 79.6% of errors occur between spatially adjacent cells (Manhattan distance <= 2). Despite this, standard cross-entropy loss weights all negative candidates equally. In this work, we propose Geometry-Aware Pointer (GAP) Loss, which reweights the cross-entropy objective based on spatial proximity to ground truth. By applying inverse distance weighting, GAP focuses gradient flow where the model struggles most: immediate neighbors receive stronger gradients than distant cells. Our approach requires only a straightforward modification to the loss computation, maintaining the same model architecture with zero additional inference cost. Extensive experiments on PubTabNet and SynthTabNet demonstrate that GAP consistently reduces adjacent-cell errors, achieving new state-of-the-art performance. Our findings suggest that incorporating geometric inductive biases at the loss level provides a simple yet effective approach to robust TSR. Our code is available at https://github.com/teamreboott/GAP

2606.18717 2026-06-18 cs.CL cs.AI 新提交

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Morpheus: 一种面向土耳其语的形态感知神经分词器和词嵌入器

Tolga Şakar

发表机构 * Independent Researcher(独立研究者)

AI总结 针对土耳其语粘着特性,提出Morpheus神经词素边界模型,实现无损可逆分词与结构化词嵌入,在可逆分词器中达到最低比特每字符(1.425),词素对齐F1提升至0.61,GPU内存节省约19%。

详情
AI中文摘要

土耳其语是粘着语:意义由词素承载,然而驱动现代语言模型的子词分词器根据语料库统计分割单词,切碎了承载语义的后缀,并且在WordPiece和基于规则的分析器的情况下,无法将其输出解码回原始文本。本文提出\textbf{Morpheus},一个面向土耳其语的神经词素边界模型,它同时是一个无损的、形态感知的分词器和一个词嵌入生成器。一个可微的泊松-二项式动态规划程序在训练期间将每个字符的边界概率转化为软词素隶属度,在推理时转化为精确的片段,无需字符串归一化,因此$\mathrm{decode}(\mathrm{encode}(w)) = w$由构造保证。由于该模型是神经模型,相同的正向传播在分词的同时也输出结构化的词嵌入。在可逆分词器中——唯一适用于生成的分词器——Morpheus达到了最低的比特每字符(1.425),将子词家族的金标准词素对齐大致翻倍(MorphScore宏F1从约0.32提升至0.61),并且相比64K词汇量的子词分词器节省了约19%的GPU内存。作为嵌入器,冻结的Morpheus向量在词汇检索(根家族MAP 0.85)和同根验证(ROC-AUC 1.00)上领先,超越了多语言检索器BGE-M3和BERTurk;在上下文和屈折依赖的任务(NER、格/数探测)上,更重的上下文编码器仍然领先——我们将这一权衡归因于Morpheus以词根为中心的几何结构。代码:此https URL 模型:此https URL 交互演示:此https URL。

英文摘要

Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents \textbf{Morpheus}, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so $\mathrm{decode}(\mathrm{encode}(w)) = w$ holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character ($1.425$), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 $0.61$ vs.\ ${\sim}0.32$), and uses ${\sim}19\%$ less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP $0.85$) and same-root verification (ROC-AUC $1.00$), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.

2606.18709 2026-06-18 cs.CL 新提交

LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

LLMs难以衡量区分不同水平学生的题目:阅读理解评估中题目区分度研究

Han Chen, Ming Li, Chenguang Wang, Yijun Liang, Dawei Zhou, Hong jiao, Tianyi Zhou

发表机构 * MBZUAI(穆罕默德·本·扎耶德人工智能大学) University of Maryland(马里兰大学) Virginia Tech(弗吉尼亚理工大学)

AI总结 本研究评估42个LLM在零样本设置下预测题目区分度的能力,发现直接预测与人类校准的区分度相关性弱(最高Spearman 0.152),基于CTT的响应校准相关性有限(0.241),表明LLM尚不能可靠捕捉题目区分度。

详情
AI中文摘要

题目区分度是教育评估的一个基本心理测量属性,它衡量一个题目是否能有效区分高水平和低水平学生。虽然已有研究探讨了大语言模型(LLM)能否估计题目难度,但尚不清楚它们能否捕捉题目区分度。在本工作中,我们使用两种互补方法评估了42个专有和开源LLM在零样本设置下的表现:直接区分度预测,即模型从其内容中显式估计题目的区分度值;以及基于响应的经典测试理论(CTT)校准,其中LLM的答案被视为合成学生响应以计算区分度分数。我们的结果表明,直接预测与人类校准的区分度一致性较弱:表现最好的模型仅达到0.152的Spearman相关性。基于响应的CTT校准提供了更强但仍然有限的信号,全人格合成受访者池达到0.241的Spearman相关性。这些发现突显了题目区分度作为基于LLM的心理测量评估的一个开放挑战:当前的LLM包含非随机的区分度相关信号,但它们尚不能可靠地捕捉评估题目如何区分人类学生。

英文摘要

Item discrimination is a fundamental psychometric property of educational assessment, which measures whether an item meaningfully distinguishes students with higher proficiency from students with lower proficiency. While various existing works have explored whether large language models (LLMs) can estimate item difficulty, it remains unclear whether they can capture item discrimination. In this work, we evaluate 42 proprietary and open-weight LLMs in zero-shot settings using two complementary approaches: direct discrimination prediction, where models explicitly estimate an item's discrimination value from its content, and response-based Classical Test Theory (CTT) calibration, where LLM answers are treated as synthetic student responses to compute discrimination scores. Our results show that direct prediction yields weak alignment with human-calibrated discrimination: the best-performing model reaches only a Spearman correlation of 0.152. Response-based CTT calibration provides a stronger but still limited signal, with the all-persona synthetic respondent pool reaching a Spearman correlation of 0.241. These findings highlight item discrimination as an open challenge for LLM-based psychometric evaluation: current LLMs contain non-random discrimination-relevant signal, but they do not yet reliably capture how assessment items distinguish human students.