arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.09253 2026-06-02 cs.CL cs.AI

Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

基石还是绊脚石？解读在线策略蒸馏中的岩石令牌

Yuxuan Jiang, Runchao Li, Shubhashis Roy Dipta, Dawei Li, Zhao Yang

发表机构 * University of Maryland, Baltimore County（马里兰大学巴尔的摩分校）； Case Western Reserve University（凯斯西储大学）； Arizona State University（亚利桑那州立大学）； VU Amsterdam（阿姆斯特丹自由大学）

AI总结本文研究在线策略蒸馏中持续高损失的“岩石令牌”，发现它们虽占据大量梯度但功能贡献微弱，提出绕过这些令牌可简化对齐过程。

详情

AI中文摘要

尽管近期关于可验证奖励强化学习（RLVR）的研究表明，一小部分关键令牌不成比例地驱动推理增益，但在线策略蒸馏（OPD）中类似的令牌级理解仍未探索。本文研究了高损失令牌——在OPD的逐令牌KL目标下，作为师生不匹配的最直接信号，根据现有研究，这些令牌应随着训练收敛而逐渐减少；然而，我们的实证分析显示并非如此。即使在OPD训练达到明显饱和后，仍有大量令牌持续表现出高损失；我们将这些令牌称为“岩石令牌”，它们可占生成输出中高达18%的令牌。我们的研究揭示了两个令人惊讶的悖论。首先，尽管这些令牌的高出现频率提供了不成比例的大份额总梯度范数，但岩石令牌本身在整个训练过程中保持停滞，抵抗教师驱动的修正。其次，通过因果干预，我们发现这些令牌对模型的实际推理性能贡献可忽略不计。这些发现表明，大量优化带宽被花费在学生模型无法或无需内化的结构和话语残差上。通过解构这些动态，我们证明策略性地绕过这些“绊脚石”可以显著简化对齐过程，挑战了统一令牌权重的必要性，并为大规模模型蒸馏提供了更高效的范式。

英文摘要

While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionately drives reasoning gains, an analogous token-level understanding of On-Policy Distillation (OPD) remains largely unexplored. In this work, we investigate high-loss tokens, a token type that--as the most direct signal of student-teacher mismatch under OPD's per-token KL objective--should progressively diminish as training converges according to existing studies; however, our empirical analysis shows otherwise. Even after OPD training reaches apparent saturation, a substantial subset of tokens continues to exhibit persistently high loss; these tokens, which we term Rock Tokens, can account for up to 18\% of the tokens in generated outputs. Our investigation reveals two startling paradoxes. First, despite their high occurrence frequency providing a disproportionately large share of total gradient norms, Rock Tokens themselves remain stagnant throughout training, resisting teacher-driven corrections. Second, through causal intervention, we find that these tokens provide negligible functional contribution to the model's actual reasoning performance. These findings suggest that a vast amount of optimization bandwidth is spent on structural and discourse residuals that the student model cannot or need not internalize. By deconstructing these dynamics, we demonstrate that strategically bypassing these ``stumbling blocks'' can significantly streamline the alignment process, challenging the necessity of uniform token weighting and offering a more efficient paradigm for large-scale model distillation.

URL PDF HTML ☆

赞 0 踩 0

2605.09098 2026-06-02 cs.CL

Dynamic Meta-Metrics: Source-Sentence Conditioned Weighting for MT Evaluation

动态元度量：面向机器翻译评估的源句条件加权

Luke Zhang, Justin Vasselli, Aditya Khan, York Hay Ng, En-Shiun Annie Lee

发表机构 * University of Toronto, Canada（多伦多大学）； Nara Institute of Science and Technology, Japan（奈良科学技術大學）； Ontario Tech University, Canada（安大略技术大学）

AI总结提出动态元度量（DMM）框架，通过源句条件组合现有度量来提升机器翻译评估性能，实验表明MLP组合优于线性与高斯过程集成，软条件扩展进一步带来提升。

Comments 5 pages, ACL SRW 2026

2605.08398 2026-06-02 cs.LG cs.CV

Exploring and Exploiting Stability in Latent Flow Matching

探索和利用潜流匹配中的稳定性

Rania Briq, Michael Kamp, Ohad Fried, Sarel Cohen, Stefan Kesselheim

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文证明潜流匹配模型对数据缩减和模型容量收缩具有鲁棒性，并利用这种稳定性提出更高效的训练和推理算法，包括数据节省和超过两倍的推理加速。

Comments Accepted at ICML 2026

详情

AI中文摘要

在这项工作中，我们展示了潜流匹配（LFM）模型对不同类型的扰动具有鲁棒性，包括数据缩减和模型容量收缩。我们通过这些模型在相同噪声种子下倾向于生成相似输出来表征这种稳定性。我们提供了一个视角，将这种现象与流匹配理论联系起来，表明这种稳定性是FM目标固有的。我们进一步利用这种稳定性推导出更高效训练和推理的实用算法。具体来说，首先，我们表明通过在显著减少的数据集上训练LFM模型，性能得以保持，并且在计算受限的情况下，模型在保持质量的同时收敛更快。这带来了多种优势，包括由于更快的收敛而节省训练时间，以及在训练条件模型时减轻标注工作。其次，LFM在架构收缩下的稳定性产生了一种双模型由粗到细的方法，一个使用轻量级架构用于FM轨迹的第一阶段，另一个具有更高容量用于第二阶段，从而大幅降低推理成本。为了确定哪些样本具有信息量，我们引入了三个样本评分标准，并在生成模型的标准指标下进行评估。我们的结果在多个数据集上进行了彻底评估，展示了这种稳定性的实际优势，包括数据节省和超过两倍的推理加速，同时生成可比较的输出。

英文摘要

In this work, we show that Latent Flow-Matching (LFM) models are robust to different types of perturbations, including data reduction and model capacity shrinkage. We characterize this stability by these models' tendency to generate similar outputs under identical noise seeds. We provide a perspective relating this phenomenon to flow matching theory, which indicates that this stability is inherent to the FM objective. We further exploit this stability to derive practical algorithms for more efficient training and inference. Concretely, first, we show that by training LFM models on significantly reduced datasets, performance is preserved, and in compute-constrained regimes, the model converges faster while maintaining quality. This yields multiple advantages, including savings in the training time due to faster convergence, and alleviating annotation effort when training conditional models. Second, LFM stability under architectural shrinkage gives rise to a two-model coarse-to-fine approach, one using a light-weight architecture for the first phase of the FM trajectory, and one with higher capacity for the second, thereby reducing the inference cost substantially. To determine which samples are informative, we introduce three sample-scoring criteria and evaluate them under standard metrics for generative models. Our results are thoroughly evaluated on multiple datasets, demonstrating the practical advantage of this stability, including data savings and a more than two-fold inference speedup while generating comparable outputs.

URL PDF HTML ☆

赞 0 踩 0

2605.07971 2026-06-02 cs.CV cs.LG

数学辅导中的效用保持去标识化：MathEd-PII基准数据集中的数值歧义研究

Zhuqian Zhou, Kirk Vanacore, Bakhtawar Ahtisham, Jinsook Lee, Doug Pietrzak, Daryl Hedley, Jorge Dias, Chris Shaw, Ruth Schäfer, René F. Kizilcec

发表机构 * University of Washington（华盛顿大学）

AI总结针对数学辅导对话中数值表达式与标识符相似导致过度去标识化的问题，提出MathEd-PII基准数据集，并采用领域感知提示策略（F1达0.821）在保持数据效用的同时有效检测PII。

详情

AI中文摘要

大规模共享对话数据是推进教学科学的关键，但严格的去标识化仍是一大障碍。在数学辅导记录中，数值表达式常与结构化标识符（如日期或ID）相似，导致通用个人身份信息（PII）检测系统过度编辑核心教学内容，降低数据效用。本研究探讨如何在保持教育效用的同时检测PII，重点关注这一“数值歧义”问题。我们引入了MathEd-PII，这是首个用于数学辅导对话中PII检测的基准数据集，通过人机协同的LLM标注构建。利用基于密度的分割，我们发现虚假PII编辑集中在数学密集区域，证实数值歧义是主要失败模式。随后比较了四种检测策略：Presidio基线以及三种基于LLM的方法（基础提示、数学感知提示和片段感知提示）。领域感知提示（包括数学感知F1: 0.802和片段感知F1: 0.821）显著优于基线（F1: 0.379），同时减少了数值假阳性，表明去标识化必须融入领域上下文以保持分析效用。本研究提供了新的基准和证据，表明辅导数据的效用保持去标识化需要领域感知建模。

英文摘要

Large-scale sharing of dialogue data is key to advancing the science of teaching and learning, yet rigorous de-identification remains a major barrier. In mathematics tutoring transcripts, numeric expressions frequently resemble structured identifiers (e.g., dates or IDs), leading generic Personally Identifiable Information (PII) detection systems to over-redact core instructional content and reduce data utility. This work asks how to detect PII while preserving educational utility, focusing on this "numeric ambiguity" problem. We introduce MathEd-PII, the first benchmark dataset for PII detection in math tutoring dialogues, built with human-in-the-loop LLM annotation. Using density-based segmentation, we show that false PII redactions cluster in math-dense regions, confirming numeric ambiguity as a key failure mode. We then compare four detection strategies: a Presidio baseline and three LLM-based approaches with basic, math-aware, and segment-aware prompting. Domain-aware prompting, including both math-aware (F1: 0.802) and segment-aware versions (F1: 0.821), substantially outperforms the baseline (F1: 0.379) while reducing numeric false positives, demonstrating that de-identification must incorporate domain context to preserve analytic utility. This work provides a new benchmark and evidence that utility-preserving de-identification for tutoring data requires domain-aware modeling.

URL PDF HTML ☆

赞 0 踩 0

2602.04672 2026-06-02 cs.CV cs.GR cs.RO

AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation

AGILE: 通过代理生成从视频重建手-物体交互

Jin-Chuan Shi, Binhong Ye, Tao Liu, Junzhe He, Yangjinhui Xu, Xiaoyang Liu, Zeju Li, Hao Chen, Chunhua Shen

发表机构 * State Key Lab of CAD & CG, Zhejiang University（浙江大学计算机辅助设计与图形学国家重点实验室）； Zhejiang University of Technology（浙江工业大学）

AI总结提出AGILE框架，利用视觉语言模型引导生成完整物体网格，结合锚定-跟踪策略和接触感知优化，从单目视频鲁棒重建手-物体交互，生成可直接用于仿真的资产。

Comments 16 pages, SIGGRAPH 2026

详情

AI中文摘要

从单目视频重建动态手-物体交互对于灵巧操作数据收集以及为机器人和VR创建逼真的数字孪生至关重要。然而，当前方法面临两个难以逾越的障碍：(1) 依赖神经渲染通常在严重遮挡下产生碎片化、不可用于仿真的几何体；(2) 依赖脆弱的运动恢复结构（SfM）初始化导致在野外视频中频繁失败。为克服这些限制，我们提出AGILE，一个鲁棒的框架，将范式从重建转变为交互学习的代理生成。首先，我们采用代理流水线，其中视觉语言模型（VLM）引导生成模型合成一个完整、水密的物体网格，具有高保真纹理，不受视频遮挡影响。其次，完全绕过脆弱的SfM，我们提出一种鲁棒的锚定-跟踪策略。我们使用基础模型在单个交互起始帧初始化物体姿态，并通过利用生成资产与视频观测之间的强视觉相似性在时间上传播姿态。最后，接触感知优化整合语义、几何和交互稳定性约束以强制执行物理合理性。在HO3D、DexYCB、ARCTIC和野外视频上的大量实验表明，AGILE在全局几何精度上优于基线，同时在先前技术经常崩溃的具有挑战性的序列上表现出卓越的鲁棒性。通过优先考虑物理有效性，我们的方法生成可直接用于仿真的资产，并通过真实到仿真重定向在机器人应用中验证。项目页面：https://agile-hoi.github.io。

英文摘要

Reconstructing dynamic hand-object interactions from monocular videos is critical for dexterous manipulation data collection and creating realistic digital twins for robotics and VR. However, current methods face two prohibitive barriers: (1) reliance on neural rendering often yields fragmented, non-simulation-ready geometries under heavy occlusion, and (2) dependence on brittle Structure-from-Motion (SfM) initialization leads to frequent failures on in-the-wild footage. To overcome these limitations, we introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning. First, we employ an agentic pipeline where a Vision-Language Model (VLM) guides a generative model to synthesize a complete, watertight object mesh with high-fidelity texture, independent of video occlusions. Second, bypassing fragile SfM entirely, we propose a robust anchor-and-track strategy. We initialize the object pose at a single interaction onset frame using a foundation model and propagate it temporally by leveraging the strong visual similarity between our generated asset and video observations. Finally, a contact-aware optimization integrates semantic, geometric, and interaction stability constraints to enforce physical plausibility. Extensive experiments on HO3D, DexYCB, ARCTIC, and in-the-wild videos reveal that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior arts frequently collapse. By prioritizing physical validity, our method produces simulation-ready assets validated via real-to-sim retargeting for robotic applications. Project page: https://agile-hoi.github.io.

URL PDF HTML ☆

赞 0 踩 0

2411.13109 2026-06-02 cs.RO

Special Unitary Parameterized Estimators of Rotation

旋转的特殊酉参数化估计器

Akshay Chandrasekhar

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文通过特殊酉矩阵重新审视旋转估计问题，提出两种新的连续表示用于神经网络中的旋转学习，并通过实验验证其有效性。

Comments Published at ICLR 2026; clarified paper contribution and theoretical narrative; 33 pages

2605.05427 2026-06-02 cs.AI

当AI评审科学：我们能信任审稿人吗？

Jialiang Wang, Yuchen Liu, Hang Xu, Kaichun Hu, Shimin Di, Wangze Ni, Linan Yue, Min-Ling Zhang, Kui Ren, Lei Chen

发表机构 * School of Electronic Engineering, Southeast University（东南大学电子工程学院）； Zhejiang University（浙江大学）

AI总结针对AI审稿的安全性和可靠性问题，本文通过分类攻击类型并实验验证声望框架、断言强度、反驳谄媚和上下文投毒对评分的影响，为评估AI同行评审的可靠性提供基线。

详情

DOI: 10.59717/j.xinn-inform.2026.100030
Journal ref: The Innovation Informatics 2:100030 (2026)

AI中文摘要

科学投稿数量持续攀升，超过了合格人类审稿人的容量，并延长了编辑时间线。与此同时，现代大型语言模型（LLMs）在摘要、事实核查和文献分类方面展现出令人印象深刻的能力，使得将AI整合到同行评审中越来越有吸引力——实际上，也无可避免。然而，早期的部署和非正式采用已经暴露了严重的故障模式。最近的事件表明，嵌入在稿件中的隐藏提示注入可以引导LLM生成的评审走向不合理的正面判断。补充研究还显示出对对抗性措辞、权威和长度偏见以及幻觉主张的脆弱性。这些事件引发了学术交流的一个核心问题：当AI评审科学时，我们能信任AI审稿人吗？本文提供了以安全和可靠性为中心的AI同行评审分析。我们映射了评审生命周期中的攻击——训练和数据检索、初审、深度评审、反驳和系统层面。我们通过在分层选取的ICLR 2025投稿上使用两个基于LLM的高级审稿人进行四项处理-控制探针，实例化了这一分类法，以隔离声望框架、断言强度、反驳谄媚和上下文投毒对评审分数的因果效应。总之，这一分类法和实验审计为评估和跟踪AI同行评审的可靠性提供了基于证据的基线，并突出了具体的故障点，以指导有针对性的、可测试的缓解措施。

英文摘要

The volume of scientific submissions continues to climb, outpacing the capacity of qualified human referees and stretching editorial timelines. At the same time, modern large language models (LLMs) offer impressive capabilities in summarization, fact checking, and literature triage, making the integration of AI into peer review increasingly attractive -- and, in practice, unavoidable. Yet early deployments and informal adoption have exposed acute failure modes. Recent incidents have revealed that hidden prompt injections embedded in manuscripts can steer LLM-generated reviews toward unjustifiably positive judgments. Complementary studies have also demonstrated brittleness to adversarial phrasing, authority and length biases, and hallucinated claims. These episodes raise a central question for scholarly communication: when AI reviews science, can we trust the AI referee? This paper provides a security- and reliability-centered analysis of AI peer review. We map attacks across the review lifecycle -- training and data retrieval, desk review, deep review, rebuttal, and system-level. We instantiate this taxonomy with four treatment-control probes on a stratified set of ICLR 2025 submissions, using two advanced LLM-based referees to isolate the causal effects of prestige framing, assertion strength, rebuttal sycophancy, and contextual poisoning on review scores. Together, this taxonomy and experimental audit provide an evidence-based baseline for assessing and tracking the reliability of AI peer review and highlight concrete failure points to guide targeted, testable mitigations.

URL PDF HTML ☆

赞 0 踩 0

2604.22896 2026-06-02 cs.RO cs.LG

Magnetic Indoor Localization through CNN Regression and Rotation Invariance

基于CNN回归和旋转不变性的磁室内定位

Helge Rosé, Konstantin Klipp, Tom Koubek, Bernd Schäufele, Ilja Radusch

发表机构 * University of Freiburg（弗赖堡大学）

AI总结提出使用旋转不变特征（磁场强度和重力轴投影）训练轻量级CNN模型，实现无需方向校准的室内定位，在MagPie数据集上达到或超越现有最优精度。

Comments Published and presented at the 2026 4th International Conference on Mechatronics, Control and Robotics (ICMCR)

详情

DOI: 10.1109/ICMCR69541.2026.11533953

AI中文摘要

室内定位是GNSS拒止环境中广泛应用的关键技术，包括室内导航和物联网系统。结合卷积神经网络（CNN）和基于磁场特征的方法，提供了一种低成本、无需基础设施的精确定位解决方案。尽管磁指纹是室内定位的一种有前景的方法，但基于原始3D磁力计数据训练的模型对设备方向高度敏感。我们通过使用从3D磁场导出的两个旋转不变特征来解决这个问题：磁场强度（Mn）和重力轴投影（Mg）。我们在磁序列上训练轻量级7层扩张CNN（MagNetS/XL），直接回归（x, y）位置。使用MagPie数据集（三栋建筑，手持轨迹），我们系统评估了测试和/或训练数据的固定和随机旋转。原始3D输入（Mx, My, Mz）在固定90°旋转下表现出各向同性误差增加，并随着随机旋转增大而进一步恶化。相比之下，2D输入（Mn, Mg）保持旋转不变精度，并且一旦旋转超过三个参考建筑的特定阈值（Loomis大建筑0°，Talbot中建筑5°，CSL小建筑6°），其性能就超过3D输入。MagNetXL在MagPie数据集上达到或超越了现有最优精度，而MagNetS以约三分之一的参数实现了相似性能，有利于移动部署。这些结果表明，在实际使用中，从旋转不变输入获得的鲁棒性超过了输入维度降低的损失，从而无需方向校准或额外基础设施即可进行地图构建和定位。

英文摘要

Indoor positioning is an essential technology for a wide range of applications in GNSS-denied environments, including indoor navigation and IoT systems. Combining convolutional neural networks (CNNs) and magnetic field-based features offers a low-cost, infrastructure-free solution for precise positioning. While magnetic fingerprints are a promising approach for indoor positioning, models trained on raw 3D magnetometer data are highly sensitive to device orientation. We address this by using two rotation invariant features derived from the 3D magnetic field: the norm (Mn) and the projection onto the gravity axis (Mg). We train a lightweight 7-layer dilated CNN (MagNetS/XL) on magnetic sequences to directly regress (x, y) positions. Using the MagPie dataset (three buildings, handheld trajectories), we systematically evaluate fixed and random rotations of test and/or train data. Raw 3D inputs (Mx, My , Mz) exhibit isotropic error increases under fixed 90° rotations and further degrade with growing random rotations. In contrast, 2D (Mn, Mg) inputs maintain rotation invariant accuracy and surpass the 3D inputs once rotation exceeds building-specific thresholds for three reference buildings: 0° for Loomis (large), 5° for Talbot (medium), and 6° for CSL (small). MagNetXL achieves or exceeds state-of-the-art accuracy on the MagPie dataset, and MagNetS delivers similar performance with roughly one third of the parameters, favoring mobile deployment. These results show that the robustness gained from rotation invariant inputs outweighs the loss of input dimensionality in realistic usage, allowing mapping and localization without orientation alignment or added infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2604.07967 2026-06-02 cs.CL cs.AI

AtomEval: Validity-Aware Atomic Evaluation of Adversarial Claim Rewriting in Fact Verification

AtomEval: 事实核查中对抗性声明重写的有效性感知原子评估

Hongyi Cen, Mingxin Wang, Yule Liu, Jingyi Zheng, Hanze Jia, Tan Tang

发表机构 * Zhejiang University（浙江大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出AtomEval协议，通过原子分解和保留门控，区分有效规避验证与改变命题的重写，并引入VASR指标，解决传统ASR膨胀问题。

详情

AI中文摘要

大型语言模型（LLM）可以重写被驳斥的声明以规避基于证据的事实核查器，但当重写改变、削弱或纠正了本应保留的虚假命题时，传统的攻击成功率（ASR）可能会被夸大。我们引入了AtomEval，一种用于固定证据对抗性声明重写的有效性感知评估协议。AtomEval将声明表示为“主体-关系-客体-修饰语”（SROM）原子，应用单向保留门将有效的验证器规避与改变命题的重写分开，并报告有效性感知攻击成功率（VASR），该指标仅统计保留原始虚假命题的验证器规避重写。AtomEval进一步提供细粒度诊断，解释命题级失败和非最小有效重写。在FEVER被驳斥声明重写任务上，AtomEval揭示并解释了ASR膨胀：许多明显的攻击通过改变、削弱或纠正本应保留的命题来欺骗验证器。通过使受攻击命题的保留变得明确且可测量，AtomEval为评估必须在验证器规避与命题保留之间取得平衡的对抗性重写器提供了稳定的评估目标。

英文摘要

Large language models (LLMs) can rewrite refuted claims to evade evidence-based fact verifiers, but conventional attack success rate (ASR) can be inflated when rewrites change, weaken, or correct the false proposition they are supposed to preserve. We introduce AtomEval, a validity-aware evaluation protocol for fixed-evidence adversarial claim rewriting. AtomEval represents claims as subject--relation--object--modifier (SROM) atoms, applies a one-way preservation gate to separate valid verifier evasion from proposition-changing rewrites, and reports validity-aware attack success rate (VASR), which counts only verifier-evasive rewrites that preserve the original false proposition. AtomEval further provides fine-grained diagnostics that explain both proposition-level failures and non-minimal valid rewrites. On FEVER refuted-claim rewriting, AtomEval exposes and explains ASR inflation: many apparent attacks fool the verifier by altering, weakening, or correcting the proposition they should preserve. By making attacked-proposition preservation explicit and measurable, AtomEval provides a stable evaluation target for evaluating adversarial rewriters that must balance verifier evasion with proposition preservation.

URL PDF HTML ☆

赞 0 踩 0

2602.17513 2026-06-02 cs.CL

Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics

弥合领域鸿沟：从MIMIC-III到产科的监督式与零样本临床章节分割

Baris Karacan, Barbara Di Eugenio, Patrick Thornton

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结通过构建产科笔记数据集、评估基于Transformer的监督模型和首次与零样本大语言模型对比，发现监督模型在域内表现强但域外下降显著，而零样本模型在修正幻觉后展现出稳健的域外适应性。

Comments 14 pages. Camera-ready version accepted at LREC 2026; includes minor revisions and an appendix. To appear in the conference proceedings

详情

DOI: 10.63317/4ktoypuohtci
Journal ref: Proceedings of the 2026 Language Resources and Evaluation Conference (LREC 2026), pages 2594-2607, Palma, Spain. ELRA 2026

AI中文摘要

临床自由文本笔记包含重要的患者信息。它们被组织成带标签的章节；识别这些章节已被证明支持临床决策和下游NLP任务。在本文中，我们通过三个关键贡献推进临床章节分割。首先，我们整理了一个新的去标识化、带章节标签的产科笔记数据集，以补充公共语料库（如MIMIC-III）所涵盖的医学领域，现有的大多数分割方法都是在这些语料库上训练的。其次，我们在MIMIC-III的一个精选子集（域内）和新的产科数据集（域外）上系统评估了基于Transformer的监督模型用于章节分割。第三，我们首次将医学章节分割的监督模型与零样本大语言模型进行直接比较。我们的结果表明，虽然监督模型在域内表现强劲，但其性能在域外大幅下降。相比之下，一旦纠正了幻觉章节标题，零样本模型展现出稳健的域外适应性。这些发现强调了开发特定领域临床资源的重要性，并指出零样本分割是将医疗NLP应用于研究充分的语料库之外的一个有前景的方向，前提是适当管理幻觉。

英文摘要

Clinical free-text notes contain vital patient information. They are structured into labelled sections; recognizing these sections has been shown to support clinical decision-making and downstream NLP tasks. In this paper, we advance clinical section segmentation through three key contributions. First, we curate a new de-identified, section-labeled obstetrics notes dataset, to supplement the medical domains covered in public corpora such as MIMIC-III, on which most existing segmentation approaches are trained. Second, we systematically evaluate transformer-based supervised models for section segmentation on a curated subset of MIMIC-III (in-domain), and on the new obstetrics dataset (out-of-domain). Third, we conduct the first head-to-head comparison of supervised models for medical section segmentation with zero-shot large language models. Our results show that while supervised models perform strongly in-domain, their performance drops substantially out-of-domain. In contrast, zero-shot models demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected. These findings underscore the importance of developing domain-specific clinical resources and highlight zero-shot segmentation as a promising direction for applying healthcare NLP beyond well-studied corpora, as long as hallucinations are appropriately managed.

URL PDF HTML ☆

赞 0 踩 0

2604.20308 2026-06-02 cs.LG

Sheaf Neural Networks on SPD Manifolds: Second-Order Geometric Representation Learning

SPD流形上的层神经网络：二阶几何表示学习

Yuhan Peng, Junwen Dong, Yuzhi Zeng, Hao Li, Ce Ju, Huitao Feng, Diaaeldin Taha, Anna Wienhard, Kelin Xia

发表机构 * arXiv.org ； GitHub

AI总结针对图神经网络在欧氏空间中的线性结构限制，提出首个在对称正定矩阵流形上运行的层神经网络，利用李群结构定义层算子，实现二阶几何表示学习，在MoleculeNet基准上取得6/7最优结果。

详情

AI中文摘要

图神经网络面临两个源于欧氏向量空间线性结构的基本挑战：(1) 当前架构通过向量（方向、梯度）表示几何，但许多任务需要矩阵值表示来捕捉方向之间的关系——例如分子中原子取向的协变。这些二阶表示自然地由对称正定矩阵流形上的点捕获；(2) 标准消息传递在边上应用共享变换。层神经网络通过边特定变换解决了这一问题，但现有公式仍局限于向量空间，因此无法传播矩阵值特征。我们通过开发首个在SPD流形上原生运行的层神经网络来应对这两个挑战。我们的关键洞察是SPD流形具有李群结构，使得无需投影到欧氏空间即可定义良置的层算子。理论上，我们证明SPD值层比欧氏层具有更强的表达能力：它们能容纳向量值层无法表示的相容配置（全局截面），直接转化为更丰富的学习表示。实验上，我们的层卷积有效地将秩1方向输入变换为编码局部几何结构的满秩矩阵。我们的双流架构在MoleculeNet基准的6/7个任务上达到最优，且层框架提供了持续的深度鲁棒性。

英文摘要

Graph neural networks face two fundamental challenges rooted in the linear structure of Euclidean vector spaces: (1) Current architectures represent geometry through vectors (directions, gradients), yet many tasks require matrix-valued representations that capture relationships between directions-such as how atomic orientations covary in a molecule. These second-order representations are naturally captured by points on the symmetric positive definite matrices (SPD) manifold; (2) Standard message passing applies shared transformations across edges. Sheaf neural networks address this via edge-specific transformations, but existing formulations remain confined to vector spaces and therefore cannot propagate matrix-valued features. We address both challenges by developing the first sheaf neural network operates natively on the SPD manifold. Our key insight is that the SPD manifold admits a Lie group structure, enabling well-posed analogs of sheaf operators without projecting to Euclidean space. Theoretically, we prove that SPD-valued sheaves are strictly more expressive than Euclidean sheaves: they admit consistent configurations (global sections) that vector-valued sheaves cannot represent, directly translating to richer learned representations. Empirically, our sheaf convolution transforms effectively rank-1 directional inputs into full-rank matrices encoding local geometric structure. Our dual-stream architecture achieves SOTA on 6/7 MoleculeNet benchmarks, with the sheaf framework providing consistent depth robustness.

URL PDF HTML ☆

赞 0 踩 0

2604.19786 2026-06-02 cs.CL

HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models

HumorRank: 基于锦标赛的大语言模型幽默生成评估排行榜

Edward Ajayi, Prasenjit Mitra

发表机构 * Carnegie Mellon University Africa（卡内基梅隆大学非洲分校）

AI总结提出HumorRank，一种基于锦标赛的框架，通过理论指导的成对偏好判断对文本幽默生成进行排名，并利用Bradley-Terry估计生成全局排行榜。

详情

AI中文摘要

幽默在大语言模型（LLM）中仍然难以评估，因为一个回答是否有趣是主观的、比较性的，并且由相互作用的喜剧机制而非单一标量属性塑造。因此，现有的幽默评估协议往往产生孤立的分数或特定任务的判断，难以跨模型进行比较。我们引入了HumorRank，一种基于锦标赛的框架，通过理论指导的成对偏好判断对文本幽默生成进行排名。在SemEval-2026 MWAHAHA和Humor Transfer Bench上，HumorRank使用基于LLM的比较判断（基于言语幽默通论GTVH）评估了九个专有、开放权重和专门模型，并通过Bradley-Terry估计的锦标赛聚合生成全局排名。得到的排名跨评判者稳定：独立的Llama和Qwen LLM评判者在两个基准上均达到Kendall τ = 0.889。排行榜揭示了清晰的模型分层，表明强大的幽默生成不仅依赖于规模，还依赖于对喜剧机制（如不协调、简洁、升级和荒谬）的掌握。HumorRank提供了一种可扩展且可解释的方法，用于对LLM生成的幽默进行基准测试，而不完全依赖孤立的自动指标或有限的人工评估。

英文摘要

Humor remains difficult to evaluate in large language models (LLMs) because what makes a response funny is subjective, comparative, and shaped by interacting comedic mechanisms rather than a single scalar property. Existing humor evaluation protocols therefore tend to produce isolated scores or task-specific judgments that are difficult to compare across models. We introduce HumorRank, a tournament-based framework for ranking textual humor generation through theory-grounded pairwise preference judgments. Across SemEval-2026 MWAHAHA and Humor Transfer Bench, HumorRank evaluates nine proprietary, open-weight, and specialized models using LLM-based comparative judgments informed by the General Theory of Verbal Humor (GTVH), with tournament aggregation yielding global rankings via Bradley-Terry estimation. The resulting rankings are cross-judge stable: independent Llama and Qwen LLM judges achieve Kendall τ = 0.889 on both benchmarks. The leaderboard reveals clear model stratification, showing that strong humor generation depends not only on scale but on mastery of comedic mechanisms such as incongruity, conciseness, escalation, and absurdity. HumorRank provides a scalable and interpretable methodology for benchmarking LLM-generated humor without relying solely on isolated automatic metrics or limited human evaluation.

URL PDF HTML ☆

赞 0 踩 0

2603.15956 2026-06-02 cs.RO cs.AI

FlowC2S：从当前帧流向后续帧以实现快速且内存高效的视频延续

Hovhannes Margaryan, Quentin Bammey, Christian Sandor

发表机构 * Team ARAI, Université Paris-Saclay, CNRS, LISN, France（ARAI团队，巴黎萨克雷大学，法国国家科学研究中心，LISN，法国）； LTCI, Télécom Paris, Institut Polytechnique de Paris, France（LTCI，巴黎电信学院，巴黎理工学院，法国）

AI总结提出FlowC2S方法，通过微调预训练文本到视频流模型学习当前与后续视频块之间的向量场，利用固有最优耦合和目标反转实现快速、内存高效的视频延续。

详情

AI中文摘要

本文介绍了一种生成快速且内存高效的视频延续的新方法。我们的方法名为FlowC2S，它微调预训练的文本到视频流模型，以学习当前视频块与后续视频块之间的向量场。两个设计选择是关键。首先，我们引入固有最优耦合，在训练期间利用时间上相邻的视频块作为真实最优耦合的实用代理，从而产生更直的流。其次，我们纳入目标反转，将目标块的倒置潜在变量注入输入表示中，以加强对应关系并提高视觉保真度。通过直接从当前帧流向后续帧，而不是常见的将当前帧与噪声组合以生成视频延续的方式，我们将模型输入的维度减少了一半。所提出的方法从LTXV和Wan微调而来，在FID和FVD的定量评估中超越了最先进的分数，且仅需五次神经函数评估。

英文摘要

This paper introduces a novel methodology for generating fast and memory-efficient video continuations. Our method, dubbed FlowC2S, fine-tunes a pre-trained text-to-video flow model to learn a vector field between the current and succeeding video chunks. Two design choices are key. First, we introduce inherent optimal couplings, utilizing temporally adjacent video chunks during training as a practical proxy for true optimal couplings, resulting in straighter flows. Second, we incorporate target inversion, injecting the inverted latent of the target chunk into the input representation to strengthen correspondences and improve visual fidelity. By flowing directly from current to succeeding frames, instead of the common combination of current frames with noise to generate a video continuation, we reduce the dimensionality of the model input by a factor of two. The proposed method, fine-tuned from LTXV and Wan, surpasses the state-of-the-art scores across quantitative evaluations with FID and FVD, with as few as five neural function evaluations.

URL PDF HTML ☆

赞 0 踩 0