arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.13196 2026-06-12 cs.AI cs.CY 新提交

Under What Conditions Can a Machine Become Genuinely Creative?

机器在何种条件下能够真正具有创造力？

Yong Zeng

发表机构 * Concordia University（康考迪亚大学）

AI总结本文基于Designics理论，提出机器真正创造力需满足十个要求，并通过实例论证其计算可行性，同时指出当前生成式AI系统尚不具备真正创造力。

详情

AI中文摘要

最近的AI系统能够生成看似具有创造力的文本、软件架构、假设、设计和科学工作流。本文探讨机器在何种条件下能够真正具有创造力，以及如何在共享的认知和创造环境中保持人类能动性。它提出了一个源于Designics（意义承载的意向性变化科学）的需求框架。本文认为，真正的机器创造力不应仅由输出新颖性、当前性能或瞬时架构来定义。相反，创造力被理解为通过递归干预动力学对不完全情境的结构性转变。基于此观点，它依赖于十个需求：环境表示、范围感知、冲突识别、干预能力、后果观察、知识与环境更新、范围重定、局部到全局展开、基于价值的范围界定以及人机共居。这些需求通过Designics的三个定律（感知、冲突和能力）进行组织。本文通过选定的网络-物理和网络-生物研究（包括递归元素提取、自主网格生成以及神经生理和工作负载分析）说明了这些需求的计算可行性。然后，它将开放系统、自动发现框架、自我修改代理、基础模型和代理工作流视为压力案例：它们展示了强大的生成手段，但本身并未建立真正的机器创造力。最后，本文认为主动的AI伦理是真正机器创造力的内在部分，而非事后过滤器。基于价值的范围界定和人机共居必须塑造创造机器如何感知环境、识别冲突、选择干预、观察后果、更新知识以及重新确定未来行动的范围。

英文摘要

Recent AI systems can generate texts, software architectures, hypotheses, designs, and scientific workflows that appear creative. This paper asks under what conditions a machine can become genuinely creative, and how human agency can be preserved within shared cognitive and creative environments. It develops a requirement framework derived from Designics, the science of meaning-bearing intentional change. The paper argues that genuine machine creativity should not be defined by output novelty, current performance, or transient architecture alone. Instead, creativity is understood as the structural transformation of incomplete situations through recursive intervention dynamics. On this view, it depends on ten requirements: environment representation, scoped perception, conflict identification, intervention capability, consequence observation, knowledge and environment update, rescoping, local-to-global unfolding, value-based scoping, and human-AI co-living. These are organized through the three laws of Designics: perception, conflict, and capability. The paper illustrates the computational tractability of these requirements through selected cyber-physical and cyber-biological studies, including recursive element extraction, autonomous mesh generation, and neurophysiological and workload analysis. It then treats open-ended systems, automated discovery frameworks, self-modifying agents, foundation models, and agentic workflows as pressure cases: they demonstrate powerful generative means but do not by themselves establish genuine machine creativity. Finally, the paper argues that proactive AI ethics is internal to genuine machine creativity rather than an after-the-fact filter. Value-based scoping and human-AI co-living must shape how creative machines perceive environments, identify conflicts, select interventions, observe consequences, update knowledge, and rescope future action.

URL PDF HTML ☆

赞 0 踩 0

2606.13194 2026-06-12 cs.LG 新提交

WHAR Arena: Benchmarking the State of the Art in Efficient Wearable Human Activity Recognition

WHAR Arena: 基准测试高效可穿戴人体活动识别的最新进展

Maximilian Burzer, Tobias King, Till Riedel, Michael Beigl, Tobias Röddiger

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）； IPAI Foundation gGmbH（IPAI基金会有限责任公司）

AI总结为解决可穿戴人体活动识别中的可比性危机，构建了包含30个数据集的大规模基准，评估17种架构，发现预测性能趋于饱和，而紧凑模型和随机森林在部署效率上构成帕累托前沿。

详情

Comments: 20 pages, 9 Figures, 3 Tables

AI中文摘要

深度学习已成为可穿戴人体活动识别（WHAR）的主导范式，但进展因可比性危机而变得模糊。结果通常使用不一致的数据集、自定义数据处理和不同的评估协议报告，使得最新技术的声明脆弱。我们通过一个大规模、开源基准来解决这个问题，该基准在标准化处理、统一模型接口和共享的跨主体评估协议下整合了30个不同的数据集。在4760次训练运行中评估了17种代表性架构，我们共同测量了预测性能以及Android参考设备上的设备延迟、峰值内存和模型大小。我们的结果表明，WHAR的最新进展是分布式的，而非由单一架构主导。虽然CNN-HAR实现了最高的平均宏F1，但表现最佳的模型紧密聚集，表明当代架构已接近预测性能上限。当考虑部署效率时，紧凑神经模型（如TinierHAR）和经典随机森林定义了实际相关的帕累托前沿，而较大的循环和混合模型则产生高硬件成本而无相应的性能增益。因此，尽管预测性能已趋于平稳，但在优化部署效率和改进对领域变化的适应方面，未来仍有巨大潜力。我们发布完整框架以支持透明的重用和扩展。

英文摘要

Deep learning has become the dominant paradigm in Wearable Human Activity Recognition (WHAR), yet progress is obscured by a comparability crisis. Results are often reported using inconsistent datasets, custom data processing, and varying evaluation protocols, making state-of-the-art claims fragile. We address this with a large-scale, open-source benchmark that integrates 30 diverse datasets under standardized processing, unified model interfaces, and a shared cross-subject evaluation protocol. Evaluating 17 representative architectures across 4760 training runs, we jointly measure predictive performance alongside on-device latency, peak memory, and model size on an Android reference device. Our results reveal that the WHAR state of the art is distributed rather than dominated by a single architecture. While CNN-HAR achieves the highest mean macro-F1, top-performing models cluster tightly, indicating contemporary architectures have converged near a predictive performance ceiling. When accounting for deployment efficiency, compact neural models, such as TinierHAR, and classical Random Forests define the practically relevant Pareto frontier, whereas larger recurrent and hybrid models incur high hardware costs without corresponding performance gains. Consequently, while predictive performance has plateaued, substantial potential for future progress remains in optimizing deployment efficiency and improving adaptation to domain shifts. We release our full framework to support transparent reuse and extension.

URL PDF HTML ☆

赞 0 踩 0

2606.13192 2026-06-12 cs.AI 新提交

Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach

基于多模态大语言模型的移动用户体验推理：任务、基准与方法

Ruichao Mao, Zhou Fang, Teng Guo, Hao Yang, Yaping Li, Shaohua Peng, Maji Huang, Xiaoyu Lin, Shuoyang Liu, Xuepeng Li, Yuyu Zhang, Hai Rao

发表机构 * Ant Group（蚂蚁集团）

AI总结提出UXBench基准（2000个VQA样本）评估多模态大模型在UI推理上的能力，并设计UI-UX模型，通过奖励路由和不对称过渡奖励机制在UXBench上达到0.7963准确率，超越Claude-4.5-Sonnet。

详情

Comments: 10 pages, 6 figures, Accepted at CVPR 2026 Findings

AI中文摘要

以可用性、感知一致性和功能清晰性为中心的用户体验（UX）是现实世界用户界面（UI）的基础。多模态大语言模型（MLLMs）在用户界面领域的应用正在快速发展，例如视觉元素定位、图形用户界面（GUI）代理和设计到代码生成。然而，基于UI截图评估UX的研究工作仍不成熟。为此，我们提出UXBench，一个包含2000个VQA数据样本的新型多模态基准，旨在评估MLLMs执行基于UI的推理能力。UXBench包括基于真实UI截图的8个任务，需要对布局关系、视觉层次和内容一致性中的UX问题进行细粒度诊断。我们对主流MLLMs的广泛评估表明，它们在基于UI的推理能力上仍然存在根本性限制。结果强调了该领域进一步发展的必要性。为弥补这一差距，我们提出UI-UX，一个基于Qwen3-VL-4B-Thinking基础模型并通过强化学习增强的MLLM，具有两个关键创新：一个奖励路由机制，在推理过程中动态平衡感知理解和逻辑推理；以及一个非对称过渡奖励，抑制冗余或不足的推理步骤。实验表明，UI-UX在UXBench上达到了最先进的性能，准确率达到0.7963——超过Claude-4.5-Sonnet的0.6550——同时在各种UI任务中表现出强大的泛化能力并保持低推理延迟。

英文摘要

User experience (UX) centered on usability, perceived consistency, and functional clarity is fundamental to real-world user interfaces (UI). The application of multimodal large language models (MLLMs) in the field of user interfaces is evolving rapidly, such as visual element grounding, graphical user interface (GUI) agents, and design-to-code generation. However, research efforts on evaluating UX based on UI screenshots are still immature. To address this, we propose UXBench, a novel multimodal benchmark consisting of 2,000 VQA data samples designed to assess MLLMs' ability to perform UI-based reasoning. UXBench includes 8 tasks based on real-world UI screenshots that require fine-grained diagnosis of UX issues across layout relationships, visual hierarchy, and content consistency. Our extensive evaluation of mainstream MLLMs shows that they remain fundamentally limited in their capacity for UI-based reasoning. The results underscore the need for further advancements in this area. To bridge this gap, we propose UI-UX, an MLLM based on Qwen3-VL-4B-Thinking foundation model and enhanced via reinforcement learning with two key innovations: a reward routing mechanism that dynamically balances perceptual understanding and logical reasoning during inference, and an asymmetric transition reward that suppresses redundant or insufficient reasoning steps. Experiments demonstrate that UI-UX achieves state-of-the-art (SOTA) performance on UXBench, attaining an accuracy of 0.7963 -- surpassing Claude-4.5-Sonnet's 0.6550 -- while exhibiting strong generalization across diverse UI tasks and maintaining low inference latency.

URL PDF HTML ☆

赞 0 踩 0

2606.13191 2026-06-12 cs.LG 新提交

The Geometry of Phase Transitions in Generative Dynamics via Projection Caustics

生成动力学中相变的几何：投影焦散视角

Ryosuke Sakamoto, Kotaro Sakamoto

发表机构 * Institute for the Advanced Study of Human Biology, Institute for Advanced Study, Kyoto University（京都大学高等研究院人类生物学高等研究所）； Graduate School of Engineering, The University of Tokyo（东京大学大学院工学系研究科）

AI总结本文通过投影焦散几何解释生成动力学中的相变行为，提出临界边界检测器（CBD）诊断分数方向不稳定性，定位模式承诺并支持敏感区域控制。

详情

AI中文摘要

连续状态生成采样器（包括扩散和流匹配模型）通过连续逆时间动力学演化，但其样本经常经历突然的定性变化：轨迹承诺于模式，语义替代坍缩，窄时间窗口内的小扰动可产生大的下游效应。本文对这种相变般行为进行了几何解释。我们将去噪视为自由能景观上的梯度下降，并表明尖锐转变出现在投影焦散附近，此时数据支撑上的最近点投影不再唯一。受此视角启发，我们引入临界边界检测器（CBD）作为分数方向不稳定性的实用诊断工具。在玩具模型、标准扩散模型和潜在文本到图像扩散模型中，CBD定位了模式承诺，预测了干预敏感窗口，并支持几何敏感区域中的目标控制。我们的结果连接了数据的几何与扩散生成的动力学。

英文摘要

Continuous-state generative samplers, including diffusion and flow-matching models, evolve through continuous reverse-time dynamics, yet their samples often undergo abrupt qualitative changes: trajectories commit to modes, semantic alternatives collapse, and small perturbations in narrow time windows can produce large downstream effects. This paper develops a geometric account of such phase-transition-like behaviour. We view denoising as gradient descent on a free energy landscape and show that sharp transitions arise near projection caustics, where the nearest-point projection onto the data support ceases to be unique. Motivated by this perspective, we introduce the Critical Boundary Detector (CBD), as practical diagnostics for score-direction instability. Across toy models, standard diffusion models, and latent text-to-image diffusion models, CBD localises mode commitment, predicts intervention-sensitive windows, and supports targeted control in geometrically sensitive regions. Our results connect geometry of data and dynamics of diffusion generation.

URL PDF HTML ☆

赞 0 踩 0

2606.13189 2026-06-12 cs.CL 新提交

SICI: A Semantic-Pragmatic Complexity Index Reveals Regime Shifts in LLM Stance Detection

SICI：一种揭示LLM立场检测中相变的语义-语用复杂度指数

Fuqiang Niu, Bowen Zhang

发表机构 * School of Cyber Science and Technology, University of Science and Technology of China（中国科学技术大学网络空间安全学院）； School of Artificial Intelligence, Shenzhen Technology University（深圳技术大学人工智能学院）

AI总结提出SICI指数，从七维语义-语用复杂度诊断立场检测难度，揭示LLM错误随复杂度增加从过度归因到集中弃权的相变规律，且干预方法仅沿归因-弃权轴移动而非消除瓶颈。

详情

AI中文摘要

基于提示的LLM越来越多地用于立场检测，但更难的例子并不总是通过更清晰的指令、推理提示、检索或辩论来修复。我们提出了SICI（立场推理复杂度指数），这是一个七维诊断指标，用于衡量目标-文本对施加的语义-语用负担。在SemEval-2016和VAST上，SICI比表面代理更好地预测LLM准确率，并显示出显著的跨评分者可靠性（$\alpha=0.771$）。更重要的是，随着SICI增加，LLM错误发生相变：低复杂度例子容易过度归因，尤其是对“反对”预测；中等复杂度例子形成不稳定边界；高复杂度例子迅速集中在“无”上。这种类似相变的结构在GPT-3.5、GPT-4o-mini、DeepSeek-V3和GPT-4o中持续存在，尽管更强的模型移动了边界。一项15种方法的干预研究进一步表明，提示、检索和辩论通常沿着归因-弃权轴移动模型，而不是消除高复杂度瓶颈。

英文摘要

Prompt-based LLMs are increasingly used for stance detection, but harder examples are not always repaired by clearer instructions, reasoning prompts, retrieval, or debate. We introduce SICI (Stance Inference Complexity Index), a seven-dimensional diagnostic measure of the semantic-pragmatic burden imposed by a target--text pair. Across SemEval-2016 and VAST, SICI predicts LLM accuracy better than surface proxies and shows substantial cross-scorer reliability ($\alpha=0.771$). More importantly, LLM errors change regime as SICI increases: low-complexity examples invite over-attribution, especially Against predictions; intermediate examples form an unstable boundary; and high-complexity examples rapidly concentrate on None. This phase-transition-like structure persists across GPT-3.5, GPT-4o-mini, DeepSeek-V3, and GPT-4o, although stronger models move the boundaries. A 15-method intervention study further shows that prompting, retrieval, and debate often shift models along the attribution--abstention axis rather than removing the high-complexity bottleneck.

URL PDF HTML ☆

赞 0 踩 0

2606.13188 2026-06-12 cs.CV cs.AI 新提交

Transformer-Guided Graph Attention for Direct Cardiac Mesh Reconstruction: A Structural Digital Twin Framework

Transformer引导的图注意力直接心脏网格重建：一种结构数字孪生框架

Abhishek H S, Akash Ganamukhi, Abhimanyu Suresh, Aditya G Hiremath, Prasad B Honnavalli, Adithya Balasubramanyam

发表机构 * CAVE Labs, C-IoT, Dept. of CSE, PES University（PES大学计算机科学与工程系C-IoT实验室CAVE实验室）； C-IoT, Dept. of CSE, PES University（PES大学计算机科学与工程系C-IoT实验室）

AI总结提出端到端网络，结合3D Swin Transformer和GAT，直接从医学图像生成平滑的心脏表面网格，避免传统后处理，在MM-WHS 2017上实现1.8 mm平均Chamfer距离。

详情

AI中文摘要

构建患者特异性心脏模型是精准心脏病学的核心，但这些模型在临床应用中始终面临同一障碍：网格生成缓慢、混乱且令人沮丧。标准工作流程——分割图像、运行Marching Cubes、然后手动清理结果——耗时、操作者间不一致，并且需要大多数临床团队不具备的专业知识。我们采取了一种根本不同的方法。我们不将分割和网格生成视为两个独立问题，而是训练一个单一的端到端网络，直接从原始3D医学图像生成平滑、可用于模拟的心脏表面网格。核心是一个3D Swin Transformer编码器-解码器，从CT或MRI体积中提取体积特征，配以一个图注意力网络（GAT）头，迭代变形模板网格以拟合患者心脏边界。我们在MM-WHS 2017基准上使用CT和MRI进行了测试。分割分数具有竞争力（CT上Dice为0.84，MRI上为0.83），但主要关注点是网格质量：平均Chamfer距离为1.8 mm，95%分位数表面距离低于5 mm。每个网格通过单次前向传播生成——无需Marching Cubes、平滑滤波器或手动清理。我们认为，对于心脏数字孪生管道，几何保真度和拓扑正确性比像素级Dice分数更重要。通过消除后处理瓶颈，该方法使患者特异性心脏模拟在临床使用中变得更加可行。

英文摘要

Building patient-specific cardiac models sits at the heart of precision cardiology, yet getting those models into clinical use keeps running into the same wall: mesh generation is slow, messy, and frustrating. The standard workflow -- segmenting the image, running Marching Cubes, and then manually cleaning up the result -- is time-consuming, inconsistent across operators, and demands specialist knowledge most clinical teams do not have. We take a fundamentally different approach. Instead of treating segmentation and mesh generation as two separate problems, we train a single end-to-end network that goes directly from a raw 3D medical image to a smooth, simulation-ready cardiac surface mesh. The core is a 3D Swin Transformer encoder-decoder that extracts volumetric features from CT or MRI volumes, paired with a Graph Attention Network (GAT) head that iteratively deforms a template mesh to fit the patient's cardiac boundary. We tested on the MM-WHS 2017 benchmark using both CT and MRI. Segmentation scores were competitive (Dice of 0.84 on CT, 0.83 on MRI), but the primary focus is mesh quality: mean Chamfer distance of 1.8 mm, with 95th-percentile surface distance below 5 mm. Every mesh is produced in a single forward pass -- no Marching Cubes, no smoothing filters, no manual cleanup. We argue that for cardiac digital twin pipelines, geometric fidelity and topological correctness matter more than pixel-level Dice scores. By removing the post-processing bottleneck, this approach makes patient-specific cardiac simulation substantially more accessible for clinical use.

URL PDF HTML ☆

赞 0 踩 0

2606.13187 2026-06-12 cs.CL 新提交

A Context-Aware Dataset for Stance Detection in Bioethical Controversies on Reddit

Reddit生物伦理争议中立场检测的上下文感知数据集

Hu Huang, Genan Dai, Fuqiang Niu, Yi Yang, Zhaoya Gong, Bowen Zhang

发表机构 * School of Cyber Science and Technology, University of Science and Technology of China（中国科学技术大学网络空间安全学院）； School of Artificial Intelligence, Shenzhen Technology University（深圳技术大学人工智能学院）； School of Urban Planning and Design, Peking University（北京大学城市规划与设计学院）

AI总结提出BioStance数据集，包含39,600个Reddit生物伦理讨论中的评论-回复对，覆盖六类争议话题，通过三层立场标注实现高可靠性，支持上下文感知的立场检测研究。

详情

AI中文摘要

生物伦理辩论越来越多地在社交媒体上展开，然而立场检测研究缺乏用于建模此类上下文依赖话语的大规模、领域特定资源。我们提出了BioStance，一个上下文感知的数据集，包含来自Reddit生物伦理讨论的39,600个带注释的帖子-评论对。BioStance涵盖了生物伦理争议三个维度上的六个有争议的目标：基本价值冲突、个人自由与集体责任，以及技术不确定性。每个实例保留了层次化的对话上下文，并由三位独立注释者使用三类立场方案进行标注：赞成、反对和无立场。注释的平均Krippendorff's α为0.82，表明可靠性较高。通过结合主题多样性、对话结构和高质量的人工注释，BioStance支持上下文感知的立场检测、论据挖掘和生物伦理话语的计算分析研究。

英文摘要

Bioethical debates increasingly unfold on social media, yet stance detection research lacks large-scale, domain-specific resources for modeling such context-dependent discourse. We present BioStance, a context-aware dataset of 39,600 annotated Post-Comment pairs from Reddit bioethical discussions. BioStance covers six controversial targets across three dimensions of bioethical controversy: fundamental value conflicts, individual liberty versus collective responsibility, and technological uncertainty. Each instance preserves hierarchical conversational context and is labeled by three independent annotators using a three-class stance scheme: Favor, Against, and None. The annotations achieve a mean Krippendorff's $\alpha$ of 0.82, indicating substantial reliability. By combining thematic diversity, conversational structure, and high-quality human annotation, BioStance supports research on context-aware stance detection, argument mining, and computational analysis of bioethical discourse.

URL PDF HTML ☆

赞 0 踩 0

2606.13184 2026-06-12 cs.CL 新提交

LAUKIN: A Multi-jurisdictional Common Law Contract Dataset

LAUKIN：一个多司法管辖区的普通法合同数据集

Amrita Singh, Aditya Joshi, Jiaojiao Jiang, Hye-young Paik, May Fong Cheong

发表机构 * Computer Science and Engineering, UNSW, Sydney Australia（新南威尔士大学计算机科学与工程学院）； Law and Justice, UNSW, Sydney Australia（新南威尔士大学法律与司法学院）

AI总结针对跨国合同审查需求，构建了包含澳大利亚、英国和印度三地法律条款对的数据集LAUKIN，通过多阶段检索与人工标注实现法律等价性分类，基准测试显示跨司法管辖区分类具有挑战性。

详情

Comments: 5 pages, 2 figures, 4 tables

AI中文摘要

跨国公司越来越需要跨司法管辖区的合同审查，但现有的法律NLP数据集大多局限于单一司法管辖区。我们引入了LAUKIN（澳大利亚、英国和印度的法律等价数据集），这是一个条款对（AU-UK、UK-IN、IN-AU）数据集，标注了布尔法律等价性。我们开发了一种新颖的多阶段检索和重排序流水线来构建初始条款对映射，随后由法律专家对部分条款对进行等价或不等价的标注。该数据集包含来自8种协议类型的204份合同的14,727个条款对，其中3,000个是手动标注的：900个训练集、600个开发集和1,500个测试集。我们评估了4种技术下的12个模型，最佳宏F1达到65.11%，使LAUKIN成为一个具有挑战性的基准。结果表明，尽管有共同的法律传统，但不同司法管辖区的起草惯例差异显著，使得跨司法管辖区的等价分类并非易事。LAUKIN还包括11,727个未标注的训练对，以支持未来法律NLP中的半监督学习研究。

英文摘要

Multinational companies increasingly require cross-jurisdictional contract review, yet existing legal NLP datasets are largely restricted to a single jurisdiction. We introduce LAUKIN (Legal equivalence dataset of Australia, UK, and INdia), a dataset of clause pairs (AU-UK, UK-IN, IN-AU) labelled for boolean legal equivalence. We develop a novel multi-stage retrieval and reranking pipeline to construct the initial clause pair mapping, with a subset of clause pairs subsequently annotated by legal experts as Equivalent or Not Equivalent. The dataset comprises 14,727 clause pairs from 204 contracts across 8 agreement types, of which 3,000 are manually labelled: 900 train, 600 dev, and 1,500 test. We evaluate 12 models across 4 techniques, achieving a best macro-F1 of 65.11%, establishing LAUKIN as a challenging benchmark. Results reveal that, despite shared legal heritage, drafting conventions diverge significantly across jurisdictions, making cross-jurisdictional equivalence classification non-trivial. LAUKIN also includes 11,727 unlabelled training pairs to support future semi-supervised learning research in legal NLP.

URL PDF HTML ☆

赞 0 踩 0

2606.13178 2026-06-12 cs.LG 新提交

Loss-Shift Transfer via Bayes Quotients

通过贝叶斯商进行损失转移迁移学习

Vasileios Sevetlidis

发表机构 * Athena Research Center（雅典娜研究中心）； Democritus University of Thrace（德谟克利特大学）； International Hellenic University（国际希腊大学）

AI总结本文研究数据分布固定但损失函数变化时的损失转移问题，利用贝叶斯商形式化损失的精炼顺序，证明粗损失的最小表示对严格更细的损失不足，并在有限输出对数损失下给出精确量化关系。

详情

AI中文摘要

迁移学习通常被研究为分布偏移的结果。本文识别了一种正交的失败模式，其中数据分布固定而损失函数变化。这种设置称为\emph{损失转移}。损失决定了$X$中哪些信息是贝叶斯相关的，因此即使在同一联合分布$P(X,Y)$下，两个损失也可能需要不同的表示。该思想使用贝叶斯商形式化，允许按精炼程度对损失排序。在贝叶斯商公式中，严格精炼立即给出定性的障碍。对于较粗损失，源最小表示对于严格更细的目标损失是不充分的。对于有限输出的对数损失，这个障碍变成了精确的定量恒等式。超额风险是表示丢弃的关于$Y$的条件信息。在受控、学习、合成图像和真实图像设置中的实验显示了预测的效果，即分类等价的表示在固定数据分布下可能具有不同的最优对数损失性能。

英文摘要

Transfer learning is usually studied as a consequence of distribution shift. This paper identifies an orthogonal failure mode in which the data distribution is fixed and the loss changes. This setting is called \emph{loss shift}. A loss determines which information in $X$ is Bayes-relevant, and two losses may therefore require different representations even under the same joint law $P(X,Y)$. The idea is formalized using Bayes quotients, which allow losses to be ordered by refinement. In the Bayes-quotient formulation, strict refinement gives an immediate qualitative obstruction. A source-minimal representation for a coarser loss is insufficient for a strictly finer target loss. For finite-output log loss, this obstruction becomes an exact quantitative identity. The excess risk is the conditional information about $Y$ discarded by the representation. Experiments in controlled, learned, synthetic-image, and real-image settings show the predicted effect, i.e., classification-equivalent representations can have different optimal log-loss performance under a fixed data distribution.

URL PDF HTML ☆

赞 0 踩 0

2606.13176 2026-06-12 cs.AI 新提交

Mental-R1: Aligning LLM Reasoning for Mental Health Assessment

Mental-R1：面向心理健康评估的对齐LLM推理

Xin Wang, Boyan Gao, Yibo Yang, David A. Clifton

发表机构 * University of Oxford（牛津大学）； Oxford Suzhou Centre for Advanced Research（牛津大学苏州高等研究院）

AI总结提出认知相对策略优化（CRPO）框架，通过阶段依赖不确定性建模和熵正则化机制，使LLM推理对齐人类认知过程，在8个心理健康数据集上加权F1平均提升10.4个百分点。

详情

AI中文摘要

焦虑、抑郁和自杀等心理健康问题仍然是紧迫的全球挑战，及时准确的评估对于有效干预至关重要。最近，大型语言模型已被探索用于心理健康评估。然而，现有的通用后训练方法与人类评估的认知过程不一致，可能导致不可靠的推理结果。为弥合这一差距，我们提出了认知相对策略优化（CRPO），这是一个专为心理健康领域设计的强化学习框架。CRPO通过将阶段依赖的不确定性建模集成到策略优化过程中，扩展了组相对策略优化。具体来说，我们引入了一种阶段熵正则化机制，该机制在早期推理阶段鼓励广泛探索，并在后期阶段逐步强制执行自信决策，模仿人类从不确定性到确定性的认知转变。此外，受认知评价理论的启发，我们形式化了认知推理阶段，从而指导基于理论的可解释推理。在8个心理健康数据集上的实验表明，CRPO在加权F1分数上比最佳强化学习基线平均提高了10.4个百分点。此外，CRPO训练的模型Mental-R1在推理密集型案例上相比现有大型语言模型展现出明显优势，表明CRPO增强了心理健康评估的推理能力。

英文摘要

Mental health problems such as anxiety, depression, and suicide remain urgent global challenges, where timely and accurate assessment is critical for effective intervention. Recently, large language models have been explored for mental health assessment. However, existing general-purpose post-training methods do not align with the cognitive processes of human assessment, which may lead to unreliable reasoning outcomes. To bridge this gap, we propose Cognitive Relative Policy Optimization (CRPO), a reinforcement learning framework tailored for the mental health domain. CRPO extends group relative policy optimization by integrating stage-dependent uncertainty modeling into the policy optimization process. Specifically, we introduce a stage-wise entropy regularization mechanism that encourages broad exploration in early reasoning phases and progressively enforces confident decision-making in later stages, mimicking the human cognitive shift from uncertainty to certainty. In addition, inspired by cognitive appraisal theory, we formalize cognitive reasoning stages, thereby guiding theory-grounded interpretable inference. Experiments on 8 mental health datasets show that CRPO achieves an average improvement of 10.4 percentage points in weighted F1-score over the best reinforcement learning baseline. Furthermore, the CRPO-trained model Mental-R1 demonstrates clear advantages compared with existing large language models on reasoning-intensive cases, suggesting that CRPO enhances reasoning capabilities for mental health assessment.

URL PDF HTML ☆

赞 0 踩 0

2606.13174 2026-06-12 cs.LG cs.CL 新提交

Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

与你合作得更好：将用户修正编译为编码代理的运行时强制

Yujun Zhou, Kehan Guo, Haomin Zhuang, Xiangqi Wang, Yue Huang, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Nuno Moniz, Nitesh V. Chawla, Xiangliang Zhang

发表机构 * University of Notre Dame（圣母大学）； IBM Research（IBM研究院）； Tencent AI Lab（腾讯AI实验室）

AI总结提出TRACE方法，通过将用户修正编译为原子规则并在运行时强制执行，显著减少编码代理在后续任务中的偏好违反，优于纯记忆方法。

详情

AI中文摘要

交互式LLM代理正成为日常工作的组成部分，但它们并不会随着时间的推移而变得更易于合作：在一个会话中记住的修正可能在下一个会话中仍被违反。我们研究了偏好访问与偏好遵从之间的差距。在源自匿名真实用户摩擦案例的任务中，Mem0记忆仍然导致57.5%的适用偏好检查被违反。我们引入了测试时规则获取与编译强制（TRACE），这是一个用于编码代理运行时的即插即用技能层管道，它挖掘用户修正，将其重写为原子规则，并编译为运行时检查，这些检查必须在代理完成未来任务之前通过。与开发者提前编写的运行时检查不同，TRACE技能来自用户自己的聊天修正。我们通过在ClawArena编码代理任务和MemoryArena衍生的内存密集型任务上进行模拟用户参与实验来评估TRACE。在ClawArena上，TRACE将分布内任务的保留偏好违反从100.0%降低到37.6%，将分布外任务从100.0%降低到2.0%。在MemoryArena衍生的任务上，TRACE将分布内违反从100.0%降低到60.5%，同时在任务通过率上匹配或超过最强的记忆基线。这些结果表明，将修正编译为运行时强制可以解决记忆单独无法可靠解决的重复摩擦失败模式，减少用户在未来会话中重复相同修正的需求。实验代码可在此https URL获取，可部署的技能可在此https URL获取。

英文摘要

Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference access and preference compliance. In tasks derived from anonymized real-user friction cases, Mem0 memory still leaves 57.5% of applicable preference checks violated. We introduce Test-time Rule Acquisition and Compiled Enforcement (TRACE), a drop-in skill-layer pipeline for coding-agent runtimes that mines user corrections, rewrites them as atomic rules, and compiles them into runtime checks that must pass before an agent completes future tasks. Unlike runtime checks written ahead of time by developers, TRACE skills come from the user's own chat corrections. We evaluate TRACE with simulated user-in-the-loop experiments on ClawArena coding-agent tasks and MemoryArena-derived memory-intensive tasks. On ClawArena, TRACE reduces held-out preference violation from 100.0% to 37.6% on in-distribution tasks and from 100.0% to 2.0% on out-of-distribution tasks. On MemoryArena-derived tasks, TRACE reduces in-distribution violation from 100.0% to 60.5% while matching or exceeding the strongest memory baseline on task pass. These results suggest that compiling corrections into runtime enforcement can address a repeated-friction failure mode that memory alone does not reliably solve, reducing the need for users to restate the same correction across future sessions. Experiment code is available at this https URL, and the deployable skill is available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.13172 2026-06-12 cs.LG 新提交

Detecting Explanatory Insufficiency in Learned Representations: A Framework for Representational Vigilance

检测学习表示中的解释不充分性：表示警觉性框架

Jacques Raynal, Pierre Slangen, Elsa Raynal, Jacques Margerit

发表机构 * Laboratory of Bioengineering and Nanosciences (LBN), University of Montpellier（蒙彼利埃大学生物工程与纳米科学实验室）； EuroMov Digital Health in Motion, University of Montpellier, IMT Mines Alès（蒙彼利埃大学EuroMov数字健康运动实验室，IMT阿莱斯矿业学院）； Certified Sophrologist, Sensorimotor Practice（认证心理放松治疗师，感觉运动实践）； Emeritus Professor, University of Montpellier（蒙彼利埃大学名誉教授）

AI总结提出VER框架，通过识别持久残差结构来监测学习表示的充分性，补充传统评估方法。

详情

Comments: 22 pages, 1 figure. Conceptual framework for representation diagnostics in machine learning

AI中文摘要

学习表示是现代机器学习的核心，通常通过预测性能、鲁棒性、不确定性估计或泛化能力来评估。然而，一个学习表示可能在操作上仍然成功，同时逐渐无法组织未被传统评估指标完全捕获的持久残差结构。本文介绍了VER（表示警觉评估器），一个用于监测学习表示充分性的概念框架。VER不提出新的学习算法、损失函数或模型架构。相反，它形式化了一个诊断过程，通过该过程可以识别、分析持久残差结构，并将其解释为解释不充分性的潜在指标。该框架将表示不充分性与普通预测误差、不确定性、噪声和分布偏移区分开来。它引入了一个基于表示识别、解释域界定、残差结构检测、解释阻力评估和警觉信号发出的监测序列。VER旨在作为机器学习中表示诊断的贡献。其目标不是取代现有的评估方法，而是通过将表示充分性视为明确的探究对象来补充它们。还概述了通过表示警觉性基准进行实证评估的路径。

英文摘要

Learned representations are central to modern machine learning and are commonly evaluated through predictive performance, robustness, uncertainty estimation, or generalization. However, a learned representation may remain operationally successful while progressively failing to organize persistent residual structures that are not fully captured by conventional evaluation metrics. This article introduces VER, the Vigilant Evaluator of Representations, a conceptual framework for monitoring representational adequacy in learned representations. VER does not propose a new learning algorithm, loss function, or model architecture. Instead, it formalizes a diagnostic process through which persistent residual structures may be identified, analyzed, and interpreted as potential indicators of explanatory insufficiency. The framework distinguishes representational inadequacy from ordinary prediction error, uncertainty, noise, and distribution shift. It introduces a monitoring sequence based on representation identification, explanatory-domain delimitation, residual-structure detection, explanatory-resistance evaluation, and vigilance signaling. VER is intended as a contribution to representation diagnostics in machine learning. Its objective is not to replace existing evaluation methods but to complement them by treating representational adequacy as an explicit object of inquiry. A path toward empirical evaluation through representational-vigilance benchmarks is also outlined.

URL PDF HTML ☆

赞 0 踩 0

2606.13171 2026-06-12 cs.CL cs.AI 新提交

NTS-CoT: Mitigating Hallucinations in LLM-based News Timeline Summarization with Chain-of-Thought Reasoning

NTS-CoT: 基于思维链推理减轻大模型新闻时间线摘要中的幻觉

Feng Lyu, Huiqin Yan, Sijing Duan, Hao Wu, Shuang Gu, Xue Qiao, Weixu Zhang, Haolun Wu

发表机构 * Central South University（中南大学）； Tsinghua University（清华大学）； Nanjing University（南京大学）； Suzhou Aerospace Information Research Institute（苏州空天信息研究院）； McGill University（麦吉尔大学）

AI总结针对大模型在新闻时间线摘要中产生内容不忠实和信息遗漏两类幻觉，提出NTS-CoT框架，通过元素思维链、日期选择和因果思维链三个模块有效缓解幻觉，在三个基准上超越现有方法。

详情

AI中文摘要

在线新闻的快速更新使得追踪事件发展具有挑战性，凸显了时间线摘要（TLS）的需求。幻觉（即大模型生成内容偏离源新闻）仍然是基于大模型的TLS中的关键问题，且现有研究对此关注不足。为弥补这一差距，我们识别出两类主要幻觉：新闻摘要中的不忠实内容和日期事件摘要中的信息遗漏。然后，我们提出NTS-CoT，一种利用思维链（CoT）推理来减轻TLS中幻觉的新框架。该框架包含三个关键模块：i) Element-CoT，用于捕获关键新闻元素以实现忠实摘要；ii) Date Selection，结合时间显著性和事件突出性进行时间戳选择；iii) Causal-CoT，用于推断因果关系并减少日期事件摘要中的遗漏。大量实验，包括在三个TLS基准上的定量分析和人工评估，表明NTS-CoT优于最先进的基线，有效减轻了幻觉并提升了基于大模型的TLS性能。我们的源代码可在该 https URL 获取。

英文摘要

The rapid updates of online news make tracking event developments challenging, highlighting the need for timeline summarization (TLS). Hallucinations, where LLM-generated content deviates from source news, still remain a critical issue in LLM-based TLS and are not well studied in existing works. To bridge this gap, we identify two primary types of hallucinations: unfaithful content during news summarization and information omission in date-event summarization. Then, we propose NTS-CoT, a novel framework that leverages Chain-of-Thought (CoT) reasoning to mitigate hallucinations in TLS. The framework consists of three key modules: i) Element-CoT to capture essential news elements for faithful summarization, ii) Date Selection to combine temporal saliency and event prominence for timestamp selection, and iii) Causal-CoT to infer causal relationships and reduce omissions in date-event summarization. Extensive experiments, including quantitative analysis on three TLS benchmarks and human evaluation, demonstrate that NTS-CoT outperforms state-of-the-art baselines, effectively mitigating hallucinations and improving LLM-based TLS performance. Our source code is available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.13169 2026-06-12 cs.RO 新提交

Redesigning Regularization for Effective Policy Smoothing

重新设计正则化以实现有效的策略平滑

Taisuke Kobayashi, Naoto Yamanaka

发表机构 * National Institute of Informatics (NII)（国立信息学研究所）； The Graduate University for Advanced Studies (SOKENDAI)（综合研究大学院大学）

AI总结针对强化学习中策略平滑问题，本文指出现有正则化实现的理论与实践差异，提出改进方案，在多个任务和算法中实现平滑运动并提升控制性能，并在四足机器人仿真到现实迁移中验证了平滑性对目标速度突变鲁棒性的提升。

详情

Comments: submitted to RA-L

AI中文摘要

本文提出了一种新颖的正则化设计，以有效平滑强化学习中的策略函数。虽然最初考虑了增强“全局”Lipschitz连续性的正则化，但由于平滑性与表达性之间的权衡，它被限制为“局部”Lipschitz连续性。然而，显而易见的是，原始实现繁琐且无法提供足够的平滑效果，导致人们倾向于更简单的实现。这源于理论与实现之间的差异，而更合适的实现有望促进平滑。因此，本文指出了原始实现无法正常工作的三个原因，并提供了相应的补救措施。这种改进的正则化在多个任务和算法中表现良好，成功实现了平滑运动，同时提高了控制性能。此外，通过将其应用于四足机器人的仿真到现实强化学习，证明了平滑运动能够提供对目标速度命令突变的鲁棒性。

英文摘要

This paper proposes a novel regularization design to effectively smooth policy functions in reinforcement learning. While regularization that enhances ``global'' Lipschitz continuity was initially considered, it has been limited to ``local'' Lipschitz continuity due to a tradeoff between smoothness and expressiveness. However, it has become apparent that the original implementation is cumbersome and does not provide sufficient smoothing, leading to a preference for simpler implementations. This stems from a discrepancy between theory and implementation, and a more appropriate implementation can expect to facilitate smoothing. Therefore, this paper identifies three reasons why the original implementation does not function adequately and provide remedies for them. This modified regularization performs well across multiple tasks and algorithms, successfully achieving smooth motion while improving control performance. Furthermore, by applying it to sim-to-real reinforcement learning for a quadruped robot, it is demonstrated that smooth motion provides robustness against sudden changes in target velocity commands.

URL PDF HTML ☆

赞 0 踩 0

2606.13168 2026-06-12 cs.LG 新提交

When Does Routing Become Interpretable? Causal Probes on Block Attention Residuals

路由何时变得可解释？对块注意力残差的因果探针

Aydin Javadov

发表机构 * ETH Zurich（苏黎世联邦理工学院）

AI总结研究块注意力残差中路由的可解释性，发现仅当路由参与训练时才出现结构化深度路由，且路由权重与因果重要性存在分离，需用因果干预验证。

详情

AI中文摘要

块注意力残差（Block AttnRes）通过将固定的加性残差替换为基于早期深度源表示的学习softmax，在前向传播中将跨层路由暴露为可检查的张量。这是一个诱人的可解释性目标：通常间接推断的信息流现在可以直接观察。我们询问这种暴露是否足以进行机制解释。我们在相同的路由消融干预下探测了两个同规模（0.6B）的Block AttnRes检查点：一个是通过确定性近因偏差调度（代码库将其视为路由等效加载路径）包装的普通Qwen3推理，另一个是从头训练且路由作为优化一部分的Block AttnRes Qwen3。包装基线的路由权重与内容无关，并重现了调度的分析预测。而训练的AttnRes检查点则表现出三种局部路由模式：通过早期层MLP的嵌入源路径、通过早期层注意力和MLP的当前状态路径，以及通过后期层注意力的较旧历史路径。除了这种分层之外，我们发现平均路由质量与因果重要性之间存在明显分离：在两个子层中，最大的质量切片并非最大的因果贡献，并且一个源家族在干预下携带了可观的质量但没有可检测的因果作用。因此，路由的架构暴露对于机制解释是必要但不充分的：只有当路由是训练的一部分时，结构化的深度路由才会出现，即使如此，描述性路由总结也应被视为待因果干预检验的候选假设，而非其本身的机制证据。

英文摘要

Block Attention Residuals (Block AttnRes) by replace fixed additive residuals with a learned softmax over earlier depth-source representations, surfacing cross-layer routing as an inspectable tensor in the forward pass. This is a tempting interpretability target: information flow normally inferred indirectly is now directly observable. We ask whether such exposure suffices for mechanistic interpretation. We probe two same-scale ($0.6$B) Block AttnRes checkpoints under identical routing-ablation interventions: a vanilla Qwen3 inference-wrapped through a deterministic recency-bias schedule that the codebase admits as a routing-equivalent loading path, and a Block AttnRes Qwen3 trained from scratch with routing as part of optimisation. The wrapped baseline's routing weights are content-independent and reproduce the schedule's analytic prediction. The trained AttnRes checkpoint instead exhibits three localised routing motifs: an embedding-source pathway through early-layer MLP, a current-state pathway through early-layer attention and MLP, and an older-history pathway through late-layer attention. Beyond this stratification, we find a sharp dissociation between average routing mass and causal importance: in both sublayers, the largest mass slice is not the largest causal contribution, and one source family carries appreciable mass with no detectable causal role under intervention. Architectural exposure of routing is therefore necessary but not sufficient for mechanistic interpretation: structured depth routing emerges only when routing has been part of training, and even then, descriptive routing summaries should be treated as candidate hypotheses to be tested by causal interventions, not as evidence of mechanism in their own right.

URL PDF HTML ☆

赞 0 踩 0

2606.13156 2026-06-12 cs.CV cs.AI 新提交

Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback

迭代视觉思维：通过视觉反馈教会视觉语言模型空间自我修正

Animesh Tripathy, Aswanth Krishnan

发表机构 * QpiAI India Pvt. Ltd（QpiAI印度私人有限公司）

AI总结提出迭代视觉思维（IVT）框架，通过视觉反馈闭环和两阶段训练（SFT+GRPO），使视觉语言模型具备空间自我修正能力，在三个基准上提升指标2.4-3.2个百分点。

详情

AI中文摘要

视觉语言模型（VLM）在单次空间定位上表现强劲，但缺乏观察和修正自身预测的机制。我们发现，简单地提示VLM在其预测的渲染可视化上迭代会导致灾难性失败：指代表达理解的Acc@0.5从79.6%骤降至48.7%（下降31个百分点），揭示了定位能力与自我修正能力之间的根本差距。我们提出迭代视觉思维（IVT），一种闭环框架，其中模型预测边界框，观察预测在图像上的渲染结果，并通过视觉反馈迭代优化。两阶段训练方案弥合了自我修正差距：首先，我们利用基础模型自身的预测作为真实错误，并提示教师VLM生成修正推理轨迹，从而无需人工标注即可获得监督数据；其次，我们应用组相对策略优化（GRPO）和简单的IoU奖励来稳定多步优化。在涵盖RefCOCOg、Ref-Adv和Ref-L4的混合基准（505个测试样本）上，使用IVT的SFT预热在每个指标上都超过了单次基础模型：Acc@0.5升至82.0%（+2.4个百分点），Acc@0.7升至74.1%（+3.2个百分点），Acc@0.9升至48.3%（+2.8个百分点）。GRPO进一步将每步IoU退化减少了5倍，稳定了优化轨迹。所有训练仅使用单个GPU上的2400个样本，表明空间自我修正是一种可学习的能力，可以在适度规模下灌输。

英文摘要

Vision-language models (VLMs) achieve strong singleshot spatial grounding, yet lack any mechanism to observe and correct their own predictions. We find that naively prompting a VLM to iterate over rendered visualizations of its predictions causes catastrophic failure: Acc@0.5 on referring expression comprehension collapses from 79.6% to 48.7% (a 31 percentage point drop), revealing a fundamental gap between grounding capability and self-correction ability. We propose Iterative Visual Thinking (IVT), a closed-loop framework in which the model predicts a bounding box, observes the prediction rendered on the image, and iteratively refines through visual feedback. A two-phase training recipe closes the self-correction gap: first, we exploit the base model's own predictions as realistic errors and prompt a teacher VLM to generate corrective reasoning traces, yielding supervised data without human annotation; second, we apply Group Relative Policy Optimization (GRPO) with a simple IoU reward to stabilize multi-step refinement. On a mixed benchmark spanning RefCOCOg, Ref-Adv, and Ref-L4 (505 test samples), SFT warm-up with IVT surpasses the single-shot base model on every metric: Acc@0.5 rises to 82.0% (+2.4pp), Acc@0.7 to 74.1% (+3.2pp), and Acc@0.9 to 48.3% (+2.8pp). GRPO further reduces per-step IoU degradation by 5x, stabilizing the refinement trajectory. All training uses only 2,400 samples on a single GPU, demonstrating that spatial self-correction is a learnable capability that can be instilled at modest scale.

URL PDF HTML ☆

赞 0 踩 0

2606.13148 2026-06-12 cs.AI 新提交

TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?

TerraBench: 智能体能否对异构地球系统数据进行推理？

Dat Tien Nguyen, Thao Nguyen, Fadillah Adamsyah Maani, Huy M. Le, Muhammad Umer Sheikh, Numan Saeed, Muhammad Haris Khan, Salman Khan

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）

AI总结提出TerraBench基准，基于TerraAgent框架，通过结合大语言模型规划与科学工具，实现跨网格数据、卫星图像、地理空间和模拟器的交互式推理，包含403个任务和24,500个执行步骤。

详情

AI中文摘要

气候和环境决策日益需要对异构输入进行推理，包括网格化物理数据、卫星图像、地理空间背景和模拟器输出。天气和气候基础模型可以很好地预测，但不能以语言进行交互式推理，而大型语言模型（LLM）可以用语言推理，但不能直接操作高维地球系统数据。因此，地球科学中的真实科学工作流仍然得不到充分支持。我们引入了TerraBench，一个基于地球科学推理的基准，构建于TerraAgent之上，这是一个ReAct风格的可执行框架，它交织推理、工具调用和观察，将LLM规划与环境检索、地理空间处理、模拟和基于工件的计算等科学工具相结合。TerraBench在单一可执行界面中统一了对地球观测图像、网格化数据、GIS推理和模拟的分析，而先前的基准将这些能力隔离为狭窄的独立任务。它也是该领域中第一个将过程级工具使用指标与容忍度感知数值评分配对的方法。该基准包含403个广泛的智能体任务，涵盖三个轨道（基础、模拟器基础和文档基础验证）和八个应用领域，共24,500个经过验证的执行步骤。这些结果表明，可靠的地球科学智能体必须超越工具访问，协调异构工作流，精确参数化工具，并保留工件来源。

英文摘要

Climate and environmental decision-making increasingly requires reasoning across heterogeneous inputs, including gridded physical data, satellite imagery, geospatial context, and simulator outputs. Weather and climate foundation models can forecast well, but do not reason interactively in language, while large language models (LLMs) reason in language but cannot operate directly on high-dimensional Earth-system data. As a result, real scientific workflows in Earth-science remain underserved. We introduce TerraBench, a benchmark for grounded Earth-science reasoning, built on TerraAgent, a ReAct-style executable framework that interleaves reasoning, tool calls, and observations to couple LLM planning with scientific tools for environmental retrieval, geospatial processing, simulation, and artifact-backed computation. TerraBench unifies analysis of Earth observation imagery, gridded data, GIS reasoning and simulation in a single executable interface, whereas prior benchmarks isolate these capabilities into narrow individual tasks. It is also the first in this space to pair process-level tool-use metrics with tolerance-aware numeric scoring. The benchmark comprises 403 extensive agentic tasks across three tracks (Fundamentals, Simulator-Grounded, and Document-Grounded Verification) and eight application domains with 24,500 verified execution steps. These results indicate that reliable Earth-science agents must go beyond tool access to coordinate heterogeneous workflows, parameterize tools precisely, and preserve artifact provenance.

URL PDF HTML ☆

赞 0 踩 0

2606.13142 2026-06-12 cs.CL 新提交

HyPE: Category-Aware Hypergraph Encoding with Persistent Edge Embeddings for Persona-Grounded Dialogue

HyPE：基于类别感知的超图编码与持久边嵌入用于人物角色对话

Sangwon Youn, Yoonjin Jang, Youngjoong Ko

发表机构 * Sungkyunkwan University（成均馆大学）

AI总结提出HyPE框架，通过将人物角色文本解析为四元组并构建超图，利用HyperGCN和持久边嵌入（PEE）编码高阶关系，在PersonaChat上优于句子级池化基线。

详情

Comments: 11 pages, 2 figures, 4 tables

AI中文摘要

人物角色对话系统旨在生成与说话者角色一致的回复，但现有方法将角色视为一组扁平句子，未能建模角色属性间的高阶关系——例如，多个角色句子共享一个主题类别。我们提出HyPE（超图角色编码器）框架，该框架（i）将每个承载角色的文本分析为（核心、表达、情感、类别）四元组，以及（ii）将角色元素组织成一个超图，其超边由共享类别标签诱导。HyperGCN超图神经网络将此结构传播为角色摘要向量和软记忆库，以条件化回复生成器。我们进一步提出持久边嵌入（PEE），即轻量级的每类别可学习先验，融合到HyperGCN的消息传递步骤中。在贪婪解码下的PersonaChat上，HyPE在GPT-2、LLaMA-3.2-3B和Qwen2.5-3B骨干网络上一致优于句子级池化基线，表明结构化的超边级角色编码在不同模型规模上提供了可迁移的优势。

英文摘要

Persona-grounded dialogue systems aim to produce responses consistent with a speaker's persona, yet existing methods treat personas as a flat set of sentences and fail to model the high-order relations among persona attributes-e.g., that several persona sentences share a topical category. We propose HyPE (Hypergraph Persona Encoder), a framework that (i) analyzes each persona-bearing text as a (Core, Expression, Sentiment, Category) quadruple, and (ii) organizes persona elements into a hypergraph whose hyperedges are induced by shared category labels. An HyperGCN hypergraph neural network propagates this structure into a persona summary vector and a soft-memory bank that condition the response generator. We further propose Persistent Edge Embeddings (PEE), lightweight per-category learnable priors fused into the HyperGCN message-passing step. On PersonaChat under greedy decoding, HyPE consistently outperforms sentence-level pooling baselines across GPT-2, LLaMA-3.2-3B, and Qwen2.5-3B backbones by demonstrating that structured hyperedge-level persona encoding provides a transferable advantage across model scales.

URL PDF HTML ☆

赞 0 踩 0

2606.13141 2026-06-12 cs.AI 新提交

Rethinking RAG in Long Videos: What to Retrieve and How to Use It?

重新思考长视频中的RAG：检索什么以及如何使用？

Yuho Lee, Jisu Shin, Nicole Hee-Yeon Kim, Jihwan Bang, Juntae Lee, Kyuwoong Hwang, Fatih Porikli, Hwanjun Song

发表机构 * Department of Computer Science, Cranberry-Lemon University（蔓越莓柠檬大学计算机科学系）

AI总结针对视频检索增强生成中检索粒度单一和基准测试缺陷，提出V-RAGBench基准和CARVE方法，通过分块自适应重排序实现多配置交错证据，显著提升性能。

详情

AI中文摘要

检索增强生成正从文本扩展到长、自我中心的视频，系统必须跨多种模态和时间粒度选择与查询相关的块。然而，VideoRAG的进展受到两个差距的限制：现有基准允许无需视频即可回答查询，掩盖了检索错误；先前方法对每个查询应用单一模态-粒度配置，忽略了块级变异性。我们通过引入V-RAGBench（一个⟨查询，证据块，答案⟩三元组基准，支持检索和生成的忠实、解耦评估）和CARVE（一种简单方法，跨配置运行并行检索器并采用块自适应重排序以识别每个块的最佳配置）来解决这两个问题。每个块随后以其在检索期间选择的最佳配置进入生成器，产生一种交错证据形式，其中块级决策在检索和生成两个阶段传播。CARVE优于八种最近的VideoRAG基线，提供给生成器的块交错多种配置而非共享单一配置，这是查询级方法无法实现的行为。

英文摘要

Retrieval-augmented generation is moving beyond text into long, egocentric video, where systems must select query-relevant chunks across multiple modalities and temporal granularities. Yet progress in VideoRAG is limited by two gaps: existing benchmarks allow queries to be answered without the video, obscuring retrieval errors, and prior methods apply a single modality-granularity configuration per query, ignoring chunk-level variability. We address both by introducing V-RAGBench, a benchmark of $\langle$query, evidence chunk, answer$\rangle$ triplets that enables faithful, decoupled evaluation of retrieval and generation, and CARVE, a simple method that runs parallel retrievers across configurations and employs chunk-adaptive reranking to identify the winning configuration for each chunk. Each chunk then enters the generator under its winning configuration selected during retrieval, yielding an interleaved evidence form where the chunk-level decision propagates across both stages. CARVE outperforms eight recent VideoRAG baselines, with the chunks supplied to the generator interleaving multiple configurations rather than sharing a single one, a behavior unattainable by query-level methods.

URL PDF HTML ☆

赞 0 踩 0

2606.13136 2026-06-12 cs.CV cs.LG eess.IV 新提交

An Extensible and Lightweight Unified Architecture for Demosaicing Pixel-bin Image Sensors

一种可扩展且轻量级的统一架构用于像素合并图像传感器的去马赛克

Saurabh Kumar, Nutan Sairam Yenneti

发表机构 * Samsung Research Institute Bangalore（三星研究院班加罗尔分院）

AI总结提出模块化统一架构，通过无学习CFA识别模块和轻量级设计，实现多种像素合并传感器的去马赛克，提升图像质量并降低资源消耗。

2606.13135 2026-06-12 cs.CV cs.AI 新提交

Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation

皮肤肿瘤皮肤镜图像的级联分类：可控敏感度与外部临床验证

Elena S. Kozachok, Sergey S. Seregin, Aleksandr V. Kozachok, Ilya P. Latyshev, Oleg I. Samovarov

发表机构 * Ivannikov Institute for System Programming of the Russian Academy of Sciences (ISP RAS)（俄罗斯科学院伊万尼科夫系统编程研究所）； Orel Oncological Dispensary（奥廖尔肿瘤医院）

AI总结本研究比较了四种深度学习架构在皮肤镜图像分类中的表现，提出一种两阶段级联分类方案，通过可调分诊阈值实现敏感度控制，并在外部临床数据集上验证了泛化差距。

详情

Comments: 28 pages, 8 figures, 10 tables

AI中文摘要

目的：比较皮肤肿瘤皮肤镜图像的深度学习架构和分类方案，并评估从开放国际数据集到俄罗斯临床独立数据集的泛化能力。方法：在三种方案中比较四种架构（ViT-B/16、Swin-S、ConvNeXt-S、EfficientNetV2-S）：二分类（恶性/良性）、单阶段四分类（良性、MEL、SCC、BCC）和两阶段级联（二分类分诊，然后三分类MEL/SCC/BCC）。所有模型使用ImageNet预训练权重和单一增强协议，在聚合的开放ISIC Archive数据上训练，并在内部保留样本和两个临床数据集（Melanoscope AI移动系统；谢切诺夫大学）上评估。结果：内部二分类阶段达到ROC-AUC 0.952-0.966；在谢切诺夫大学数据集上降至0.797-0.893，敏感度降至0.53-0.67，ECE从0.02升至0.27-0.39，且低估恶性，量化了排序和校准中的泛化差距。配对检验确认了临床数据上的一个架构间结果：二分类阶段ViT-B/16的缺陷（p<0.05）；在区分阶段，没有架构显示出显著优势。级联方案在大多数架构上提高了宏F1，但仅对ViT-B/16显著，通过恢复被分配到主导良性类别的恶性病变。在ISIC MILK10k上，直接11分类的平均类别敏感度为0.525。结论：可调分诊阈值提供了标准单阶段（argmax）分类无法实现的敏感度控制，并更好地再现了临床鉴别诊断逻辑。持续的泛化差距要求在部署前进行外部临床验证和重新校准。

英文摘要

Purpose. To compare deep learning architectures and classification schemes for dermoscopic images of skin neoplasms and assess their generalization on transfer from open international datasets to independent clinical datasets of Russian practice. Methods. Four architectures (ViT-B/16, Swin-S, ConvNeXt-S, EfficientNetV2-S) were compared in three schemes: binary (malignant/benign), single-stage four-class (benign, MEL, SCC, BCC), and a two-stage cascade (binary triage, then three-class differentiation MEL/SCC/BCC). All models used ImageNet-pretrained weights and a single augmentation protocol on aggregated open ISIC Archive data, and were evaluated on an internal held-out sample and two clinical datasets (Melanoscope AI mobile system; Sechenov University). Results. Internally the binary stage attains ROC-AUC 0.952-0.966; on Sechenov University it drops to 0.797-0.893, sensitivity to 0.53-0.67, and ECE rises from 0.02 to 0.27-0.39 with underestimation of malignancy, quantifying a generalization gap in ranking and calibration. Paired tests confirm one inter-architecture result on clinical data: the deficit of ViT-B/16 at the binary stage (p<0.05); at the differentiation stage no architecture has a proven advantage. The cascade raises macro F1 over single-stage four-class classification for most architectures, but significantly only for ViT-B/16, by recovering malignant lesions assigned to the dominant benign class. On ISIC MILK10k, direct 11-class classification yields mean-class sensitivity 0.525. Conclusion. A tunable triage threshold gives sensitivity control not attainable in standard single-stage (argmax) classification and better reproduces clinical differential-diagnosis logic. The persistent generalization gap mandates external clinical validation and recalibration before deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.13127 2026-06-12 cs.CV 新提交

Fully Distributed Multi-View 3D Tracking in Real-Time

全分布式多视角3D实时跟踪

Byron Hernandez, Fangyu Li, Aotian Wu, Paul J. Shin, Kaustubh Purandare, Henry Medeiros

发表机构 * University of Florida（佛罗里达大学）； NVIDIA Corporation（英伟达公司）

AI总结提出MV3DT全分布式框架，通过点对点协作实现实时多视角3D跟踪，无需中央聚合，在WILDTRACK上达到94.3% IDF1和93.3% MOTA，支持100摄像头30 FPS运行。

详情

Comments: 18 pages, 4 figures, 2 algorithms, 4 tables

AI中文摘要

具有重叠视野的多摄像头跟踪通常依赖于集中式融合，这造成了计算瓶颈，阻碍了大规模部署。我们提出了MV3DT，一个用于实时多视角3D跟踪的全分布式框架，通过点对点协调实现精确的身份传播和遮挡恢复，消除了中央聚合的需要。每个摄像头节点执行一个轻量级模块化流水线，包括单目3D感知、分布式多视角关联以及通过轻量级消息传递的协作融合。MV3DT在WILDTRACK上达到了94.3%的IDF1和93.3%的MOTA，与最先进的集中式方法相当，同时展示了卓越的可扩展性，在100个摄像头上以30 FPS运行，摄像头间延迟小于10毫秒，通信开销仅为2.2%。在给定相机标定的情况下，MV3DT以零样本方式运行，无需特定场景学习，可直接部署在新环境中。这些结果确立了MV3DT作为大规模重叠摄像头网络中实时多视角跟踪的实用解决方案。

英文摘要

Multi-camera tracking with overlapping fields of view typically relies on centralized fusion, which creates computational bottlenecks that prevent deployment at scale. We present MV3DT, a fully distributed framework for real-time multi-view 3D tracking that achieves accurate identity propagation and occlusion recovery through peer-to-peer coordination, eliminating the need for central aggregation. Each camera node executes a lightweight modular pipeline comprising monocular 3D perception, distributed multi-view association, and collaborative fusion via lightweight messaging. MV3DT achieves 94.3% IDF1 and 93.3% MOTA on WILDTRACK, competitive with state-of-the-art centralized methods, while demonstrating superior scalability by sustaining 30 FPS on 100 cameras with less than 10 ms inter-camera latency and only 2.2% communication overhead. MV3DT operates in a zero-shot regime given camera calibrations, requiring no scene-specific learning and making it directly deployable in new environments. These results establish MV3DT as a practical solution for real-time multi-view tracking in large-scale overlapping camera networks.

URL PDF HTML ☆

赞 0 踩 0

2606.13126 2026-06-12 cs.LG cs.AI cs.CL 新提交

MiniPIC: Flexible Position-Independent Caching in <100LOC

MiniPIC: 少于100行代码的灵活位置无关缓存

Nathan Ordonez (1), Thomas Parnell (1) ((1) IBM Research)

发表机构 * IBM Research（IBM研究院）

AI总结提出MiniPIC，通过无位置编码KV缓存和用户控制缓存重用原语，在vLLM中实现多种位置无关缓存方法，显著提升预填充吞吐量并降低首个令牌延迟。

详情

Comments: 13 pages, 5 figures

AI中文摘要

检索增强和代理工作负载重复预填充可预测的结构化输入（我们称之为“跨度”），例如文档和代码文件。然而，vLLM等引擎中的前缀缓存无法重用KV条目，除非它们与另一个请求共享相同的前缀，而生产级推理服务器中的位置无关缓存（PIC）实现通常需要大量服务器代码更改或将KV状态保留在服务器外部，从而产生主机到设备的传输开销。我们提出了极简PIC（MiniPIC）：一种最小化、灵活且快速的vLLM设计，由两个组件构建：无位置编码的KV缓存和用户控制的缓存重用原语。MiniPIC在KV缓存中存储未旋转的K向量，在注意力内部使用每请求逻辑位置对K块应用RoPE，并公开三个面向用户和令牌级别的原语：块对齐填充、跨度分隔符（SSep）和提示依赖（PDep），这些原语修改哈希行为和有效的块级因果注意力结构。通过少于100行的核心引擎更改加上自定义注意力后端，这些原语足以在同一个运行的vLLM实例中实现多种PIC方法，包括Block-Attention、EPIC和Prompt Cache，同时原生集成KV缓存CPU卸载实现。在2WikiMultihopQA上，使用交错调度的MiniPIC相比基线vLLM将预填充吞吐量提高了49%，将缓存跨度的首个令牌时间减少了最多两个数量级，保持了未缓存跨度的线性预填充扩展，并且仅产生5.7%的最坏情况开销。

英文摘要

Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call "spans") such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV entries unless they share identical prefixes with another request, while Position-Independent Caching (PIC) implementations within production-grade inference servers typically either require substantial server code changes or keep KV state outside the server, incurring host-to-device transfer overhead. We present Minimalistic PIC (MiniPIC): a minimal, flexible and fast vLLM design built from two ingredients: positional-encoding-free KV cache and user-controlled cache-reuse primitives. MiniPIC stores unrotated K vectors in the KV cache, applies RoPE to K tiles inside attention using per-request logical positions, and exposes three user-facing and token-level primitives: block-aligned padding, span separator (SSep), and prompt depend (PDep), that modify hashing behavior and effective block-level causal attention structure. With fewer than 100 lines of core-engine changes plus a custom attention backend, these primitives are sufficient to realize multiple PIC methods, including Block-Attention, EPIC, and Prompt Cache, within the same running vLLM instance, while natively integrating with KV cache CPU offload implementations. On 2WikiMultihopQA, MiniPIC with interleaved scheduling improves prefill throughput by 49% over baseline vLLM, reduces cached-span time-to-first-token by up to two orders of magnitude, preserves the linear prefill scaling of uncached spans, and incurs only 5.7% worst-case overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.13125 2026-06-12 cs.LG cs.AI 新提交

Select and Improve: Understanding the Mechanics of Post-Training for Reasoning

选择与改进：理解推理后训练的机制

Akshay Krishnamurthy, Audrey Huang, Nived Rajaraman

发表机构 * Microsoft Research NYC（微软研究院纽约）； UIUC（伊利诺伊大学厄巴纳-香槟分校）

AI总结通过控制实验揭示强化学习后训练通过策略选择和策略改进两种机制提升推理能力，并指出SFT数据和RL数据的不同作用。

2606.13121 2026-06-12 cs.CL cs.AI cs.SD 新提交

NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation

NaturalFlow: 减少同步语音到语音翻译中破坏自然语音流的停顿

Dongwook Lee, Youngho Cho, Sangkwon Park, Heeseung Kim, Sungroh Yoon

发表机构 * IPAI and ECE, Seoul National University（首尔大学IPAI与ECE）； Department of AI, University of Seoul（首尔市立大学人工智能系）

AI总结提出一个流畅性感知优化框架，通过利用模型内部信号（如语言多样性和语音时长的时间变异性）最小化块间静音，在同步翻译的低延迟和连续翻译的自然流畅之间找到平衡点。

详情

Comments: Proceedings of the 26th Interspeech Conference, Long Paper

AI中文摘要

同步语音到语音翻译旨在通过最小化延迟实现近实时通信，为连续翻译的高延迟提供了一种引人注目的实时替代方案。然而，过度追求低延迟往往会导致碎片化的块状语音。因此，听众会遭受不自然的声学流，其中频繁的停顿可能会增加他们的认知负荷。为了弥补这一差距，我们引入了一个流畅性感知优化框架，旨在发现同步翻译的低延迟优势与连续翻译的自然流畅之间的最佳平衡点。我们的框架通过利用模型内部信号（包括语言多样性和语音时长的诱导时间变异性）来最小化块间静音。在短文本和长文本基准上的实验表明，我们的框架在保持竞争性延迟和翻译质量的同时，产生了自然的语音流。

英文摘要

Simultaneous speech-to-speech translation aims to enable near-real-time communication by minimizing latency, offering a compelling, real-time alternative to the high latency of consecutive translation. However, the excessive pursuit of low latency often results in fragmented chunk-wise speech. Consequently, listeners are subjected to an unnatural acoustic flow punctuated by frequent pauses, which could increase their cognitive load. To bridge this gap, we introduce a fluency-aware optimization framework designed to discover the sweet spot between the low-latency benefits of simultaneous translation and the natural flow of consecutive translation. Our framework minimizes inter-chunk silences by leveraging model-internal signals, including linguistic diversity and induced temporal variability in speech durations. Experiments on short- and long-form benchmarks show that our framework produces natural speech flow while maintaining competitive latency and translation quality.

URL PDF HTML ☆

赞 0 踩 0

2606.13120 2026-06-12 cs.CL 新提交

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

EvoBrowseComp: 基于演化知识的搜索智能体基准测试

Yunhan Wang, Jiaan Wang, Lianzhe Huang, Xianfeng Zeng, Fandong Meng

发表机构 * Northeastern University, China（东北大学（中国））； Weixin AI, Tencent Inc, China（腾讯微信AI（中国））

AI总结提出EvoBrowseComp，一个通过实时网络遍历自动生成400道英文和400道中文无污染复杂问题的演化基准，用于评估搜索智能体在动态知识环境中的真实浏览能力。

详情

Comments: 14 pages, under review

AI中文摘要

搜索智能体——即增强搜索工具的大型语言模型——加剧了对未来验证基准的需求。现有的基准如BrowseComp依赖静态知识，容易受到测试集污染和参数记忆的影响。因此，模型可以通过事实回忆而非真正检索获得高分，通过推理捷径掩盖真实的浏览能力。在本文中，我们介绍EvoBrowseComp，一个包含400道英文和400道中文无污染复杂问题的演化基准，通过实时网络遍历合成。为了收集这些问题，我们设计了一个三智能体协作框架：（1）QA合成智能体，从实时网络中检索新鲜知识以合成问答对；（2）信息过滤智能体，根据可信度和流行度过滤检索到的知识，以阻断参数捷径；（3）高级指导智能体，将问题形式化为推理图，以减少合成问答对中的逻辑冗余和捷径。由于该框架支持全自动合成，EvoBrowseComp可以定期更新以防止数据污染并保持时间新鲜度。大量实验证实了其高难度，需要广泛的横向搜索。它为自动更新、高难度的基准测试建立了一个可扩展的范式，与不断发展的世界知识和不断进步的智能体能力保持同步。

英文摘要

Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to test-set contamination and parametric memorization. Consequently, models can achieve high scores through fact recall rather than genuine retrieval, obscuring true browsing competence via reasoning shortcuts. In this paper, we introduce EvoBrowseComp, an evolving benchmark of 400 English and 400 Chinese contamination-free complex questions synthesized via live-web traversal. To collect these questions, we design a three-agent collaborative framework: (1) a QA synthesis agent that retrieves fresh knowledge from the live web to synthesize QA pairs; (2) an information filtering agent that filters retrieved knowledge in terms of credibility and popularity to block parametric shortcuts; and (3) a high-level guidance agent that formalizes questions into reasoning graphs to reduce logical redundancy and shortcuts in synthesized QA pairs. Because the framework supports fully automated synthesis, EvoBrowseComp can be regularly updated to prevent data contamination and maintain temporal freshness. Extensive experiments confirm its great difficulty, requiring broad horizontal search. It establishes a scalable paradigm for auto-updatable, high-difficulty benchmarking that keeps pace with both evolving world knowledge and advancing agent capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.13119 2026-06-12 cs.LG cs.AI cs.NE 新提交

MP3: Multi-Period Pattern Pre-training forSpatio-Temporal Forecasting

MP3：面向时空预测的多周期模式预训练

Lilan Peng, Yandi Liu, Qingren Yao, Chongshou Li, Tianrui Li

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University（西南交通大学计算机与人工智能学院）； Eindhoven University of Technology（埃因霍温理工大学）

AI总结针对时空数据中短窗口输入导致的时间幻象问题，提出多周期模式预训练插件MP3，通过多周期时间建模、空间建模和跨周期因果交互，提升现有STGNN的预测性能。

详情

AI中文摘要

时空预测在交通、气候和能源等多个领域至关重要。城市时空数据表现出时间幻象：相似的短窗口输入具有不同的未来趋势，反之亦然。现有的时空图神经网络（STGNN）无法有效识别此类幻象。我们认为核心原因在于短窗口输入具有不完整的周期观测、异质的全局空间相关性和跨周期叠加因果性。为弥补这一差距，我们开发了一种新颖的多周期模式预训练（MP3），这是一种用于区分时间幻象的即插即用预训练插件。MP3提出了两项核心创新：（1）多周期模式学习旨在从长时间序列中学习多周期模式。具体地，多周期时间建模利用边卷积来识别不同的多周期模式。多周期空间建模使用瓶颈投影和全局记忆库来高效捕获异质的全局空间关系。跨周期模式交互采用因果增强的Transformer来捕获不同周期模式之间的依赖关系。（2）该插件可以无缝集成到现有的STGNN骨干中，以增强其预测性能。在五个真实世界数据集（包括大规模数据集CA）上的五个STGNN基线实验验证了MP3的有效性、优越的可扩展性和强适应性，其在所有评估基线上带来了一致且稳健的性能提升。平均而言，MP3将MAE降低了4.7%，RMSE降低了5.0%。代码可在此https URL获取。

英文摘要

Spatio-Temporal forecasting is crucial in diverse fields, such as transportation, climate, and energy. Urban spatio-temporal data exhibits temporal mirage: similar short-window inputs have divergent future trends, and vice versa. Existing spatio-temporal graph neural networks (STGNNs) cannot effectively identify such mirages. We argue that the core reason lies in the short-window inputs that have incomplete period observation, heterogeneous global spatial correlation, and cross-period superposition causality. To bridge this gap, we develop a novel Multi- Period Pattern Pre-training (MP3), a plug-and-play pre-training plugin for distinguishing temporal mirages. MP3 presents two core innovations: (1) The multi-period pattern learning is designed to learn multi-period patterns from long time series. Specifically, multi-period temporal modeling leverages edge convolution to identify different multi-period patterns. Multi-period spatial modeling uses a bottleneck project and a global memory bank to capture heterogeneous global spatial relations efficiently. Cross-period pattern interaction employs a causality-enhanced Transformer to capture dependencies across different period patterns. (2) This plugin can seamlessly integrate into existing STGNN backbones to strengthen their forecasting performance. The experiment on five STGNN baselines across five real-world datasets (including a large-scale dataset CA) verify the effectiveness, superior scalability and strong adaptability of MP3, which brings consistent and robust performance improvements across all evaluated baselines. On average, MP3 reduces the MAE 4.7% and the RMSE 5.0%. The code can be available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.13115 2026-06-12 cs.CL cs.AI 新提交

G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents

G-Long：面向高效长期对话代理的图增强记忆管理

Minjun Choi, Yoonjin Jang, Sangwon Youn, Youngjoong Ko

发表机构 * Sungkyunkwan University（成均馆大学）

AI总结提出G-Long框架，利用微调小语言模型进行结构化三元组提取和关联检索，并引入注意力感知重要性评分机制，在降低计算开销的同时，在响应生成和记忆检索上达到最优性能。

详情

Comments: 22 pages, 8 figures, 14 tables

AI中文摘要

尽管大型语言模型（LLMs）推动了开放域对话系统的发展，但由于长上下文推理的固有限制以及处理大量原始文本的低效性，保持长期一致性仍然是一个挑战。现有方法通常依赖于非结构化记忆存储（容易导致信息丢失）或计算成本高昂的LLMs（导致高延迟）。为了解决这些限制，我们提出了G-Long，一个图增强框架，利用微调的小语言模型（sLM）进行结构化三元组提取和关联检索，显著降低了运营成本。此外，我们引入了新颖的注意力感知重要性评分机制，利用T5摘要器的内在交叉注意力信号来识别显著记忆。跨多个基准的大量实验表明，G-Long在响应生成和记忆检索方面均达到了最先进的性能，在MSC上响应质量提升高达9.8%，在LME上检索召回率提升高达40.8%，同时显著降低了计算开销。

英文摘要

While Large Language Models (LLMs) have advanced open-domain dialogue systems, maintaining long-term consistency remains a challenge due to inherent limitations in long-context reasoning and the inefficiency of processing extensive raw text. Existing approaches typically rely on either unstructured memory storage, which is prone to information loss, or computationally expensive LLMs that incur high latency. To address these limitations, we propose G-Long, a graph-enhanced framework that utilizes a fine-tuned small Language Model (sLM) for structured triplet extraction and associative retrieval, significantly reducing operational costs. Furthermore, we introduce the novel attention-aware importance scoring mechanism that leverages the intrinsic cross-attention signals of a T5 summarizer to identify salient memories. Extensive experiments across diverse benchmarks demonstrate that G-Long achieves state-of-the-art performance in both response generation and memory retrieval, yielding performance gains of up to 9.8% in response quality on MSC and 40.8% in retrieval recall on LME, while significantly minimizing computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.13111 2026-06-12 cs.CL 新提交

MÖVE: A Holistic LLM Benchmark for the German Public Sector

MÖVE：德国公共部门的大语言模型整体基准

Camilla Dalerci, Thilo Michael, Robin Schaefer, Daniel Weinland

发表机构 * Innovations Department, Bundesdruckerei GmbH（德国联邦印钞公司创新部）

AI总结提出MÖVE基准，从性能和治理两个维度评估39个LLM在德国公共部门的应用，发现无单一模型全面领先，模型大小非质量可靠指标。

详情

AI中文摘要

我们提出MÖVE（Modelle für die Öffentliche Verwaltung Evaluieren），一个用于评估德国公共部门背景下大语言模型（LLM）的整体基准。尽管LLM在公共管理中日益普及，但模型选择仍然很大程度上是临时的，现有基准提供的指导有限：它们主要面向英语、内容以美国为中心，并且只关注任务性能。MÖVE通过评估39个模型在两个互补维度上填补这些空白。性能标准涵盖摘要、问答和主题提取。治理标准评估幻觉倾向、能耗、提供商透明度、与德国宪法价值观的一致性以及对德国政党立场的知识。总共，我们使用了十个德语数据集，包括我们构建的反映公共管理领域的金标准和银标准数据集。我们采用多指标评估策略，结合经典NLP指标、基于嵌入的方法和LLM作为评判的方法。我们的结果表明，没有单一模型在所有标准上占主导地位：顶级表现者因任务而异，模型大小本身是质量的糟糕预测指标。我们进一步评估基准本身，分析其统计精度、LLM评判可靠性、私有数据集对模型排名的影响、结果对提示表述的敏感性以及能耗估计的有效性。MÖVE被设计为一个活跃开发中的动态基准；结果公开于此https URL。

英文摘要

We present MÖVE (Modelle für die Öffentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public administration, model selection remains largely ad hoc, and existing benchmarks offer limited guidance: they are predominantly English-centric, US-centric in content, and focus exclusively on task performance. MÖVE addresses these gaps by evaluating 39 models across two complementary dimensions. Performance criteria cover summarization, question answering, and topic extraction. Governance criteria assess hallucination tendencies, energy consumption, provider transparency, and alignment with German constitutional values and knowledge about positions by German political parties. In total, we utilize ten German-language datasets, including gold- and silverstandard datasets that we constructed to reflect public-administration domains. We employ a multi-metric evaluation strategy combining classical NLP metrics, embedding-based methods, and LLM-as-a-judge approaches. Our results show that no single model dominates across all criteria: top performers differ between tasks, and model size alone is a poor predictor of quality. We further evaluate the benchmark itself, analyzing its statistical precision, LLM judge reliability, the impact of our private datasets on model rankings, the sensitivity of our results to prompt formulation, and the validity of our energy consumption estimates. MÖVE is designed as a living benchmark under active development; results are publicly available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.13108 2026-06-12 cs.CV 新提交

PP-OCRv6: From 1.5M to 34.5M Parameters, Surpassing Billion-Scale VLMs on OCR Tasks

PP-OCRv6: 从1.5M到34.5M参数，在OCR任务上超越十亿级视觉语言模型

Yubo Zhang, Xueqing Wang, Manhui Lin, Yue Zhang, Penglongyi Deng, Ting Sun, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Changda Zhou, Hongen Liu, Suyin Liang, Cheng Cui, Yi Liu, Dianhai Yu, Yanjun Ma

发表机构 * PaddlePaddle Team, Baidu Inc.（百度公司飞桨团队）

AI总结提出轻量级OCR系统PP-OCRv6，通过统一MetaFormer架构和结构化重参数化，在服务器到边缘设备上以少数量级参数超越十亿级VLM，中模型识别准确率83.2%，检测Hmean 86.2%。

详情

AI中文摘要

视觉语言模型（VLM）在通用视觉语言任务上取得了令人印象深刻的结果，但在应用于专用OCR场景时，它们存在幻觉、定位不精确和计算成本过高的问题。本文提出PP-OCRv6，一个轻量级OCR系统，结合了架构创新和数据中心优化。PP-OCRv6围绕统一的MetaFormer风格构建块重新设计了骨干网络、检测颈和识别颈，采用结构化重参数化，将空间token混合与通道混合解耦，并通过任务特定的步长配置支持两个任务。三个模型层级（中、小、微）共享相同的构建块原语，覆盖从服务器到边缘的部署场景。在我们的内部基准测试中，PP-OCRv6_medium实现了83.2%的识别准确率和86.2%的检测Hmean，分别比PP-OCRv5_server高出+5.1%和+4.6%，同时以数量级更少的参数超越了Qwen3-VL-235B、GPT-5.5和Gemini-3.1-Pro。微层级在Intel Xeon CPU上实现了比PP-OCRv5_mobile快3.9倍的推理速度，同时保持相当的准确率。

英文摘要

Vision-Language Models (VLMs) have achieved impressive results on general vision-language tasks, yet they suffer from hallucination, imprecise localization, and prohibitive computational cost when applied to dedicated OCR scenarios. This paper presents PP-OCRv6, a lightweight OCR system that combines architectural innovation with data-centric optimization. PP-OCRv6 redesigns the backbone, detection neck, and recognition neck around a unified MetaFormer-style building block with structural reparameterization, decoupling spatial token mixing from channel mixing and supporting both tasks through task-specific stride configurations. Three model tiers (medium, small, tiny) share the same block primitives, covering deployment scenarios from server to edge. On our in-house benchmarks, PP-OCRv6_medium achieves 83.2% recognition accuracy and 86.2% detection Hmean, outperforming PP-OCRv5_server by +5.1% and +4.6% respectively while surpassing Qwen3-VL-235B, GPT-5.5, and Gemini-3.1-Pro with orders of magnitude fewer parameters. The tiny tier achieves 3.9$\times$ faster inference than PP-OCRv5_mobile on Intel Xeon CPU while maintaining comparable accuracy.

URL PDF HTML ☆

赞 0 踩 0