arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.13161 2026-06-12 cs.DM cs.DS 新提交

Exhaustive Generation of Genus-One Knot and Link Diagrams via Maps on the Torus

基于环面上地图的亏格一纽结与链环图的穷举生成

Alexander Omelchenko

AI总结 提出基于曲面地图理论的算法框架,通过置换对编码环面投影,穷举生成并制表亏格一纽结与链环图,验证至交叉数8,获得超过33,000种类型,并证明若干结构性质。

详情
Comments
In Proceedings GASCom 2026, arXiv:2606.09910
AI中文摘要

我们提出了一个基于曲面地图理论的算法框架,用于穷举生成和制表加厚环面 T^2 x I 上的纽结与链环图。细胞4-正则环面投影由置换对 (alpha, sigma) 编码,并通过规范代表元完全且无重复地枚举无向等价类。交叉分配、局部图级约简以及广义Kauffman型括号均在相同的置换模型内公式化。该流程在交叉数 N <= 5 的已发表亏格一分类上得到验证,然后扩展到 N = 6, 7, 8,据我们所知,在所述比较约定下首次提供了这些交叉数的完整亏格一制表。所得数据集包含超过33,000种纽结和链环类型。除了表格,计算还提供了已证明的结构事实,包括括号a-跨度的奇偶性陈述以及4-正则环面地图中双边形面数量的严格上界 N-1。它还提出了若干猜想,包括直行分量最大数量的公式、无等四边形纽结投影的存在性以及亏格一括号跨度的4N上界。

英文摘要

We present an algorithmic framework for the exhaustive generation and tabulation of knot and link diagrams on the thickened torus T^2 x I, based on the theory of maps on surfaces. Cellular 4-regular torus projections are encoded by permutation pairs (alpha, sigma), and unsensed equivalence classes are enumerated completely and without duplication via canonical representatives. Crossing assignments, local diagram-level reductions, and the generalized Kauffman-type bracket are formulated entirely within the same permutation model. The pipeline is validated against published genus-one classifications for crossing numbers N <= 5 and then extended to N = 6, 7, 8, producing, to our knowledge, the first complete genus-one tabulation at these crossing numbers under the stated comparison conventions. The resulting dataset contains more than 33,000 knot and link types. Besides the tables, the computation yields proved structural facts, including a parity statement for the a-span of the bracket and a sharp upper bound N-1 for the number of bigon faces in a 4-regular torus map. It also suggests several conjectures, among them a formula for the maximum number of straight-ahead components, the absence of equi-quadrilateral knot projections, and a 4N upper bound for the genus-one bracket span.

2606.13160 2026-06-12 cs.DS cs.DM 新提交

(Un)ranking Permutation Classes

(Un)ranking 置换类

Nathanaël Hassler (LIB, Université Bourgogne Europe), Vincent Vajnovszki (LIB, Université Bourgogne Europe)

AI总结 针对避免长度为3的模式的置换,提出字典序和逆字典序下的排名与反排名方法。

详情
Comments
In Proceedings GASCom 2026, arXiv:2606.09910
AI中文摘要

避免长度为3的模式的置换由 Catalan 数枚举。在这项工作中,我们提出了在字典序或逆字典序下对此类置换进行排名和反排名的方法。

英文摘要

Permutations avoiding a pattern of length three are enumerated by the Catalan numbers. In this work, we present methods for ranking and unranking such permutations in lexicographic or colexicographic order.

2606.13158 2026-06-12 cs.DM cs.CG 新提交

On the Counting Sequence of Z-convex Polyominoes

关于Z-凸多联骨牌的计数序列

Luca Castelli (University of Insubria), Paolo Massazza (University of Insubria)

AI总结 本文研究凸度不超过2的凸多联骨牌的计数问题,通过C++程序计算了迄今为止最长的计数序列。

详情
Comments
In Proceedings GASCom 2026, arXiv:2606.09910
AI中文摘要

凸多联骨牌P的凸度是满足以下条件的最小整数k:P中任意两个单元格可以通过P内的一条单调路径连接,且该路径的方向变化次数不超过k。本文提出一组公式和方程,作为C++程序的基础,该程序能够计算凸度不超过2的凸多联骨牌(按面积)迄今为止最长的计数序列。

英文摘要

The degree of convexity of a convex polyomino P is the smallest integer k such that any two cells of P can be joined by a monotone path inside P with at most k changes of direction. In this paper, we present a set of formulas and equations that are the basis of a C++ program that allows you to compute the longest counting sequence known to date (with respect to the area) of convex polyominoes of degree of convexity at most 2

2606.13156 2026-06-12 cs.CV cs.AI 新提交

Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback

迭代视觉思维:通过视觉反馈教会视觉语言模型空间自我修正

Animesh Tripathy, Aswanth Krishnan

发表机构 * QpiAI India Pvt. Ltd(QpiAI印度私人有限公司)

AI总结 提出迭代视觉思维(IVT)框架,通过视觉反馈闭环和两阶段训练(SFT+GRPO),使视觉语言模型具备空间自我修正能力,在三个基准上提升指标2.4-3.2个百分点。

详情
AI中文摘要

视觉语言模型(VLM)在单次空间定位上表现强劲,但缺乏观察和修正自身预测的机制。我们发现,简单地提示VLM在其预测的渲染可视化上迭代会导致灾难性失败:指代表达理解的Acc@0.5从79.6%骤降至48.7%(下降31个百分点),揭示了定位能力与自我修正能力之间的根本差距。我们提出迭代视觉思维(IVT),一种闭环框架,其中模型预测边界框,观察预测在图像上的渲染结果,并通过视觉反馈迭代优化。两阶段训练方案弥合了自我修正差距:首先,我们利用基础模型自身的预测作为真实错误,并提示教师VLM生成修正推理轨迹,从而无需人工标注即可获得监督数据;其次,我们应用组相对策略优化(GRPO)和简单的IoU奖励来稳定多步优化。在涵盖RefCOCOg、Ref-Adv和Ref-L4的混合基准(505个测试样本)上,使用IVT的SFT预热在每个指标上都超过了单次基础模型:Acc@0.5升至82.0%(+2.4个百分点),Acc@0.7升至74.1%(+3.2个百分点),Acc@0.9升至48.3%(+2.8个百分点)。GRPO进一步将每步IoU退化减少了5倍,稳定了优化轨迹。所有训练仅使用单个GPU上的2400个样本,表明空间自我修正是一种可学习的能力,可以在适度规模下灌输。

英文摘要

Vision-language models (VLMs) achieve strong singleshot spatial grounding, yet lack any mechanism to observe and correct their own predictions. We find that naively prompting a VLM to iterate over rendered visualizations of its predictions causes catastrophic failure: Acc@0.5 on referring expression comprehension collapses from 79.6% to 48.7% (a 31 percentage point drop), revealing a fundamental gap between grounding capability and self-correction ability. We propose Iterative Visual Thinking (IVT), a closed-loop framework in which the model predicts a bounding box, observes the prediction rendered on the image, and iteratively refines through visual feedback. A two-phase training recipe closes the self-correction gap: first, we exploit the base model's own predictions as realistic errors and prompt a teacher VLM to generate corrective reasoning traces, yielding supervised data without human annotation; second, we apply Group Relative Policy Optimization (GRPO) with a simple IoU reward to stabilize multi-step refinement. On a mixed benchmark spanning RefCOCOg, Ref-Adv, and Ref-L4 (505 test samples), SFT warm-up with IVT surpasses the single-shot base model on every metric: Acc@0.5 rises to 82.0% (+2.4pp), Acc@0.7 to 74.1% (+3.2pp), and Acc@0.9 to 48.3% (+2.8pp). GRPO further reduces per-step IoU degradation by 5x, stabilizing the refinement trajectory. All training uses only 2,400 samples on a single GPU, demonstrating that spatial self-correction is a learnable capability that can be instilled at modest scale.

2606.13155 2026-06-12 cs.DM 新提交

Snake Polyominoes of Maximal Area in a Rectangle

矩形中最大面积的蛇形多联骨牌

Alexandre Blondin Massé (LACIM, Université du Québec à Montréal), Alain Goupil (LACIM, Université du Québec à trois-Rivières)

AI总结 针对离散矩形中的蛇形多联骨牌,提出生成算法,并给出高度≤5时最大面积的精确公式。

详情
Comments
In Proceedings GASCom 2026, arXiv:2606.09910
AI中文摘要

给定一个尺寸为 h x w 的离散矩形 R,令 W 为包含在 R 中的蛇形多联骨牌的集合,这些多联骨牌表示为二进制矩阵,即其底层简单图关于 4-邻接关系是一条链。我们提出一个算法,对任意 h 和 w 生成 W。此外,令 a 为 W 中元素可实现的最大面积。我们给出了 h ≤ 5 且任意 w 时 a 的精确公式。

英文摘要

Given a discrete rectangle R of dimensions h x w, let W be the set of snake-like polyominoes contained in R represented as binary matrices, i.e. polyominoes whose underlying simple graph is a chain with respect to the 4-adjacency relation. We present an algorithm that generates W for any h and w. Also, let a be the maximal area that can be realized by an element of W. We provide exact formulas of a for h <= 5 and any w.

2606.13152 2026-06-12 cs.DM 新提交

Fibonacci and Catalan Numbers Meet in Staircase Polyominoes

Fibonacci 和 Catalan 数在阶梯多边形中的相遇

Jean-Luc Baril (LIB, Université Bourgogne Europe, Dijon, France), José Luis Ramírez (Departamento de Matemáticas, Universidad Nacional de Colombia, Bogotá, Colombia), Samuel Ramírez (Departamento de Matemáticas, Universidad Nacional de Colombia, Bogotá, Colombia), Diego Villamizar (Department of Mathematics, Xavier University of Louisiana, New Orleans, LA)

AI总结 研究阶梯多边形(Fibonacci 多边形)的生成函数,通过催化函数方程和核方法得到显式闭式,系数涉及 Catalan 数。

详情
Comments
In Proceedings GASCom 2026, arXiv:2606.09910
AI中文摘要

我们研究 Fibonacci(阶梯)多边形,这是一类列凸多边形,其下边界是具有单位垂直步长的阶梯。我们推导了多元生成函数,通过跟踪额外的周长和面积参数,细化了 Turban 的 Fibonacci 数枚举。证明使用了催化函数方程,并且在周长特例中使用了核方法,得到了显式闭式和 Catalan 数系数公式。

英文摘要

We study Fibonacci (staircase) polyominoes, a class of column-convex polyominoes whose lower boundary is a staircase with unit vertical steps. We derive multivariate generating functions that refine Turban's Fibonacci-number enumeration by tracking additional perimeter and area parameters. The proofs use a catalytic functional equation and, in a perimeter specialization, the kernel method, leading to explicit closed forms and Catalan-number coefficient formulas.

2606.13151 2026-06-12 cs.DS cs.DM 新提交

Random Generation of $k$-coloured Motzkin Paths

$k$ 色 Motzkin 路径的随机生成

Elena Barcucci (University of Florence), Antonio Bernini (University of Florence), Stefano Bilotta (University of Florence), Renzo Pinzani (University of Florence)

AI总结 研究 k 色 Motzkin 路径,通过解析和组合方法探讨其与奇数高度前缀数的联系,并给出线性时间随机生成算法。

详情
Comments
In Proceedings GASCom 2026, arXiv:2606.09910
AI中文摘要

我们研究 k 色 Motzkin 路径,即水平步可以有 k 种不同颜色的 Motzkin 路径,并从解析和组合两个角度探讨其与结束于奇数高度的前缀数之间的联系。此外,组合方法为 k 色 Motzkin 路径提供了一种线性时间的随机生成算法。

英文摘要

We study k-coloured Motzkin paths, namely Motzkin paths in which horizontal steps can be coloured in k different ways, and investigate their connection with the number of prefixes ending at odd height from both an analytical and a combinatorial point of view. Moreover, the combinatorial approach provides a random generation algorithm for k-coloured Motzkin paths in linear-time.

2606.13148 2026-06-12 cs.AI 新提交

TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?

TerraBench: 智能体能否对异构地球系统数据进行推理?

Dat Tien Nguyen, Thao Nguyen, Fadillah Adamsyah Maani, Huy M. Le, Muhammad Umer Sheikh, Numan Saeed, Muhammad Haris Khan, Salman Khan

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出TerraBench基准,基于TerraAgent框架,通过结合大语言模型规划与科学工具,实现跨网格数据、卫星图像、地理空间和模拟器的交互式推理,包含403个任务和24,500个执行步骤。

详情
AI中文摘要

气候和环境决策日益需要对异构输入进行推理,包括网格化物理数据、卫星图像、地理空间背景和模拟器输出。天气和气候基础模型可以很好地预测,但不能以语言进行交互式推理,而大型语言模型(LLM)可以用语言推理,但不能直接操作高维地球系统数据。因此,地球科学中的真实科学工作流仍然得不到充分支持。我们引入了TerraBench,一个基于地球科学推理的基准,构建于TerraAgent之上,这是一个ReAct风格的可执行框架,它交织推理、工具调用和观察,将LLM规划与环境检索、地理空间处理、模拟和基于工件的计算等科学工具相结合。TerraBench在单一可执行界面中统一了对地球观测图像、网格化数据、GIS推理和模拟的分析,而先前的基准将这些能力隔离为狭窄的独立任务。它也是该领域中第一个将过程级工具使用指标与容忍度感知数值评分配对的方法。该基准包含403个广泛的智能体任务,涵盖三个轨道(基础、模拟器基础和文档基础验证)和八个应用领域,共24,500个经过验证的执行步骤。这些结果表明,可靠的地球科学智能体必须超越工具访问,协调异构工作流,精确参数化工具,并保留工件来源。

英文摘要

Climate and environmental decision-making increasingly requires reasoning across heterogeneous inputs, including gridded physical data, satellite imagery, geospatial context, and simulator outputs. Weather and climate foundation models can forecast well, but do not reason interactively in language, while large language models (LLMs) reason in language but cannot operate directly on high-dimensional Earth-system data. As a result, real scientific workflows in Earth-science remain underserved. We introduce TerraBench, a benchmark for grounded Earth-science reasoning, built on TerraAgent, a ReAct-style executable framework that interleaves reasoning, tool calls, and observations to couple LLM planning with scientific tools for environmental retrieval, geospatial processing, simulation, and artifact-backed computation. TerraBench unifies analysis of Earth observation imagery, gridded data, GIS reasoning and simulation in a single executable interface, whereas prior benchmarks isolate these capabilities into narrow individual tasks. It is also the first in this space to pair process-level tool-use metrics with tolerance-aware numeric scoring. The benchmark comprises 403 extensive agentic tasks across three tracks (Fundamentals, Simulator-Grounded, and Document-Grounded Verification) and eight application domains with 24,500 verified execution steps. These results indicate that reliable Earth-science agents must go beyond tool access to coordinate heterogeneous workflows, parameterize tools precisely, and preserve artifact provenance.

2606.13145 2026-06-12 cs.IR 新提交

The Clustering Strikes Back: Building Cost-Effective and High-Performance ANNS at Scale with Helmsman

聚类反击:使用Helmsman构建经济高效且高性能的大规模ANNS

Yuchen Huang, Baiteng Ma, Yiping Sun, Yang Shi, Xiao Chen, Xiaocheng Zhong, Zhiyong Wang, Yao Hu, Erci Xu, Chuliang Weng

AI总结 针对内存ANNS成本高昂的问题,提出基于全闪存服务器的聚类ANNS系统Helmsman,通过用户态存储栈、分级学习剪枝和GPU加速构建,节省90%硬件成本,实现十亿级索引快速重建。

详情
Comments
Accepted by OSDI'26
AI中文摘要

RedNote(即小红书,一个全球规模的社交网络平台)广泛采用近似最近邻搜索(ANNS)来支持其搜索、推荐和广告服务。由于苛刻的服务水平协议(SLA),我们不得不依赖基于内存的图ANNS(即HNSW)来提供高吞吐量和低延迟。然而,不断增长的用户群和内容量导致内存占用爆炸性增长,进而带来巨大的资本支出和运营支出。在探索了各种替代方案后,我们发现基于全闪存服务器构建聚类ANNS是有前景的。然而,我们仍然面临来自内核I/O栈、固定剪枝策略和缓慢索引构建的严重开销。我们提出了HELMSMAN,一个高性能且经济高效的基于聚类的ANNS系统,它结合了面向ANNS的用户态存储栈、分级学习剪枝模块和GPU加速的构建流水线。HELMSMAN节省了超过90%的硬件成本,并能在数小时内完成十亿级索引的(重)构建。在当前的生产部署中,稳定运行数月,40台机器现在承载了之前需要约35,000个核心和0.35 PB DRAM的ANNS工作负载。

英文摘要

RedNote (a.k.a., Xiaohongshu, a global-scale social network platform) widely adopts approximate nearest neighbor search (ANNS) to power its search, recommendation, and advertising services. Due to the demanding Service Level Agreements (SLAs), we have to rely on in-memory graph-based ANNS (i.e., HNSW) to provide high throughput and low latency. However, the ever-growing user base and content volume have led to an explosive increase in memory footprint and consequently huge CapEx and OpEx. After exploring various alternatives, we find that building a clustering-based ANNS on top of all-flash servers can be promising. Yet, we still experience severe overheads from the kernel I/O stack, a fixed pruning strategy, and slow index construction. We present HELMSMAN, a high-performance and cost-effective clustering-based ANNS system, which combines an ANNS-oriented userspace storage stack, a leveling-learned pruning module, and GPU-accelerated pipelines of construction. HELMSMAN saves over 90% of hardware costs and enables billion-scale index (re)builds within hours. In the current production deployment, operating stably for several months, 40 machines now host ANNS workloads that previously required about 35,000 cores and 0.35 PB DRAM.

2606.13142 2026-06-12 cs.CL 新提交

HyPE: Category-Aware Hypergraph Encoding with Persistent Edge Embeddings for Persona-Grounded Dialogue

HyPE:基于类别感知的超图编码与持久边嵌入用于人物角色对话

Sangwon Youn, Yoonjin Jang, Youngjoong Ko

发表机构 * Sungkyunkwan University(成均馆大学)

AI总结 提出HyPE框架,通过将人物角色文本解析为四元组并构建超图,利用HyperGCN和持久边嵌入(PEE)编码高阶关系,在PersonaChat上优于句子级池化基线。

详情
Comments
11 pages, 2 figures, 4 tables
AI中文摘要

人物角色对话系统旨在生成与说话者角色一致的回复,但现有方法将角色视为一组扁平句子,未能建模角色属性间的高阶关系——例如,多个角色句子共享一个主题类别。我们提出HyPE(超图角色编码器)框架,该框架(i)将每个承载角色的文本分析为(核心、表达、情感、类别)四元组,以及(ii)将角色元素组织成一个超图,其超边由共享类别标签诱导。HyperGCN超图神经网络将此结构传播为角色摘要向量和软记忆库,以条件化回复生成器。我们进一步提出持久边嵌入(PEE),即轻量级的每类别可学习先验,融合到HyperGCN的消息传递步骤中。在贪婪解码下的PersonaChat上,HyPE在GPT-2、LLaMA-3.2-3B和Qwen2.5-3B骨干网络上一致优于句子级池化基线,表明结构化的超边级角色编码在不同模型规模上提供了可迁移的优势。

英文摘要

Persona-grounded dialogue systems aim to produce responses consistent with a speaker's persona, yet existing methods treat personas as a flat set of sentences and fail to model the high-order relations among persona attributes-e.g., that several persona sentences share a topical category. We propose HyPE (Hypergraph Persona Encoder), a framework that (i) analyzes each persona-bearing text as a (Core, Expression, Sentiment, Category) quadruple, and (ii) organizes persona elements into a hypergraph whose hyperedges are induced by shared category labels. An HyperGCN hypergraph neural network propagates this structure into a persona summary vector and a soft-memory bank that condition the response generator. We further propose Persistent Edge Embeddings (PEE), lightweight per-category learnable priors fused into the HyperGCN message-passing step. On PersonaChat under greedy decoding, HyPE consistently outperforms sentence-level pooling baselines across GPT-2, LLaMA-3.2-3B, and Qwen2.5-3B backbones by demonstrating that structured hyperedge-level persona encoding provides a transferable advantage across model scales.

2606.13141 2026-06-12 cs.AI 新提交

Rethinking RAG in Long Videos: What to Retrieve and How to Use It?

重新思考长视频中的RAG:检索什么以及如何使用?

Yuho Lee, Jisu Shin, Nicole Hee-Yeon Kim, Jihwan Bang, Juntae Lee, Kyuwoong Hwang, Fatih Porikli, Hwanjun Song

发表机构 * Department of Computer Science, Cranberry-Lemon University(蔓越莓柠檬大学计算机科学系)

AI总结 针对视频检索增强生成中检索粒度单一和基准测试缺陷,提出V-RAGBench基准和CARVE方法,通过分块自适应重排序实现多配置交错证据,显著提升性能。

详情
AI中文摘要

检索增强生成正从文本扩展到长、自我中心的视频,系统必须跨多种模态和时间粒度选择与查询相关的块。然而,VideoRAG的进展受到两个差距的限制:现有基准允许无需视频即可回答查询,掩盖了检索错误;先前方法对每个查询应用单一模态-粒度配置,忽略了块级变异性。我们通过引入V-RAGBench(一个⟨查询,证据块,答案⟩三元组基准,支持检索和生成的忠实、解耦评估)和CARVE(一种简单方法,跨配置运行并行检索器并采用块自适应重排序以识别每个块的最佳配置)来解决这两个问题。每个块随后以其在检索期间选择的最佳配置进入生成器,产生一种交错证据形式,其中块级决策在检索和生成两个阶段传播。CARVE优于八种最近的VideoRAG基线,提供给生成器的块交错多种配置而非共享单一配置,这是查询级方法无法实现的行为。

英文摘要

Retrieval-augmented generation is moving beyond text into long, egocentric video, where systems must select query-relevant chunks across multiple modalities and temporal granularities. Yet progress in VideoRAG is limited by two gaps: existing benchmarks allow queries to be answered without the video, obscuring retrieval errors, and prior methods apply a single modality-granularity configuration per query, ignoring chunk-level variability. We address both by introducing V-RAGBench, a benchmark of $\langle$query, evidence chunk, answer$\rangle$ triplets that enables faithful, decoupled evaluation of retrieval and generation, and CARVE, a simple method that runs parallel retrievers across configurations and employs chunk-adaptive reranking to identify the winning configuration for each chunk. Each chunk then enters the generator under its winning configuration selected during retrieval, yielding an interleaved evidence form where the chunk-level decision propagates across both stages. CARVE outperforms eight recent VideoRAG baselines, with the chunks supplied to the generator interleaving multiple configurations rather than sharing a single one, a behavior unattainable by query-level methods.

2606.13140 2026-06-12 cs.SI 新提交

MIDSim: Simulating Multi-Channel Information Diffusion in Social Media with LLM-Powered Multi-Agent System

MIDSim: 基于LLM多智能体系统的社交媒体多渠道信息扩散模拟

Lexi Liu, Qi Cao, Yuanhao Liu, Huawei Shen, Xueqi Cheng

AI总结 提出LLM驱动的多智能体系统,联合建模社交与算法曝光流,模拟多渠道信息扩散,在三个真实数据集上优于基线。

详情
AI中文摘要

社交媒体中的信息扩散塑造公众舆论和集体行为,因此其建模与模拟是一个重要的研究问题。现有研究通过基于流行病、基于级联和基于点过程模型来研究信息扩散。然而,它们主要关注通过社交链接的扩散,忽略了平台算法(如推荐系统)启用的其他扩散渠道,并且未能捕捉用户行为的复杂性。为了解决这些局限性,我们提出了一种基于LLM的多智能体系统,用于模拟多渠道信息扩散,其中大型语言模型实例化个性化用户智能体,扩散过程联合建模社交和算法曝光流。我们进一步构建了三个真实世界的扩散数据集,涵盖新浪微博、小红书和Twitter,包含扩散记录、用户画像、历史帖子和社交关系。在真实扩散事件上的实验结果表明,我们提出的框架能够真实地模拟宏观扩散现象并生成多样化的评论内容,显著优于基线。

英文摘要

Information diffusion in social media shapes public opinion and collective behavior, making its modeling and simulation an important research problem. Existing studies have investigated information diffusion through epidemic-based, cascade-based, and point process models. However, they predominantly focus on diffusion through social links, overlooking other diffusion channels enabled by platform algorithms (e.g., recommender systems) and failing to capture user behavioral complexity. To address these limitations, we propose an LLM-powered multi-agent system for simulating multi-channel information diffusion, where large language models instantiate personalized user agents and the diffusion process jointly models social and algorithmic exposure streams. We further construct three real-world diffusion dataset spanning Sina Weibo, RedNote, and Twitter, containing diffusion records, user profiles, historical posts, and social relationships. Experimental results on real diffusion events show that our proposed framework realistically simulate macro diffusion phenomenon and generate diverse comment content, significantly outperforming baselines.

2606.13138 2026-06-12 cs.OH 新提交

The Limits of Time

时间的极限

B. Biira, Amelia Lee Doğan

AI总结 通过系统文献综述,识别LIMITS社区中时间与时间性的五种参与类型,强调明确关注时间概念对丰富极限研究的重要性。

详情
AI中文摘要

LIMITS社区的创立旨在促进对话,从以增长为导向的计算愿景和价值观转向关注长期福祉。我们认为,这种取向天然地涉及时间与时间性问题。先前研究表明,时间框架塑造了未来的想象方式、哪些问题值得关注以及追求何种解决方案或替代方案。本文从作者对时间在其生活经验中的观察开始,然后将这些观察扩展到LIMITS社区。通过对过去十年LIMITS学术研究的系统文献综述,我们识别出明确关注时间与时间性概念如何理解将丰富极限研究的方式。在涉及时间的LIMITS研究中,我们识别出五种反复出现的时间参与类型:计算时间、方法论与设计时间、时间政治与伦理、生物与生态时间、以及来世与废物时间。这些参与类型共同凸显了隐含的时间假设如何嵌入LIMITS工作中的研究实践、设计方法和技术影响描述。我们结合将时间作为分析关注点的跨学科学术研究讨论这些发现,并考虑这些模式如何指向LIMITS社区中对时间更明确、多元和情境化参与的更广泛需求,以及这对社区承诺的重要性。

英文摘要

The LIMITS community was founded to foster conversations that move away from growth-oriented visions and values in computing toward a focus on long-term well-being. This orientation, we argue, inherently engages questions of time and temporality. Prior work has shown that temporal frameworks shape how futures are imagined, which problems are understood to be worth attending to, and which solutions or alternatives are pursued. We begin this paper with author observations of time in their lived experience, and then extend these observations to the LIMITS community. Through a systematic literature review of the last decade of LIMITS scholarship, we identify ways that explicit attention to how concepts of time and temporality are understood would enrich Limits scholarship. Within the LIMITS scholarship that does engage with time, we identify five recurring types of temporal engagement: computing time, methodological and design time, politics and ethics of time, biological and ecological time, and afterlife and waste time. Together, these engagement types highlight how implicit assumptions about time are embedded across research practices, design approaches, and accounts of technological impact within LIMITS work. We discuss these findings in relation to cross-disciplinary scholarship that takes time as an analytic concern and consider how these patterns point to a broader need for more explicit, plural, and situated engagements with time in the LIMITS community, and why this matters for the community's commitments.

2606.13136 2026-06-12 cs.CV cs.LG eess.IV 新提交

An Extensible and Lightweight Unified Architecture for Demosaicing Pixel-bin Image Sensors

一种可扩展且轻量级的统一架构用于像素合并图像传感器的去马赛克

Saurabh Kumar, Nutan Sairam Yenneti

发表机构 * Samsung Research Institute Bangalore(三星研究院班加罗尔分院)

AI总结 提出模块化统一架构,通过无学习CFA识别模块和轻量级设计,实现多种像素合并传感器的去马赛克,提升图像质量并降低资源消耗。

详情
AI中文摘要

像素合并图像传感器因其分辨率与聚光能力的权衡,正成为智能手机相机的默认选择。然而,与拜耳彩色滤光片阵列(CFA)相比,它们更大的颜色间分离使得去马赛克更具挑战性。此外,现有的基于深度学习的去马赛克方法是CFA特定的,需要多个独立模型,占用宝贵的板载资源,并需要更大的开发和维护工作。在这项工作中,我们提出了一种模块化的统一架构,用于对各种像素合并传感器进行去马赛克,该架构在可扩展且轻量级的同时提供更高的图像质量。此外,为了实现即插即用操作,我们引入了一个无学习的CFA识别模块,以准确检测原始数据的CFA类型。

英文摘要

Pixel-bin image sensors are becoming the default choice for smartphone cameras due to their resolution vs light-gathering trade-off. However, their larger inter-color separation compared to the Bayer color filter array (CFA) makes them challenging to demosaic. Furthermore, existing deep learning-based demosaicing methods are CFA-specific, requiring multiple individual models that take up precious onboard resources and demand larger development and maintenance efforts. In this work, we propose a modular unified architecture for demosaicing various pixel-bin sensors that provides higher image quality while being extensible and lightweight. Additionally, to enable plug-and-play operation, we introduce a learning-free CFA-identification module to detect the CFA type of raw data accurately.

2606.13135 2026-06-12 cs.CV cs.AI 新提交

Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation

皮肤肿瘤皮肤镜图像的级联分类:可控敏感度与外部临床验证

Elena S. Kozachok, Sergey S. Seregin, Aleksandr V. Kozachok, Ilya P. Latyshev, Oleg I. Samovarov

发表机构 * Ivannikov Institute for System Programming of the Russian Academy of Sciences (ISP RAS)(俄罗斯科学院伊万尼科夫系统编程研究所) Orel Oncological Dispensary(奥廖尔肿瘤医院)

AI总结 本研究比较了四种深度学习架构在皮肤镜图像分类中的表现,提出一种两阶段级联分类方案,通过可调分诊阈值实现敏感度控制,并在外部临床数据集上验证了泛化差距。

详情
Comments
28 pages, 8 figures, 10 tables
AI中文摘要

目的:比较皮肤肿瘤皮肤镜图像的深度学习架构和分类方案,并评估从开放国际数据集到俄罗斯临床独立数据集的泛化能力。方法:在三种方案中比较四种架构(ViT-B/16、Swin-S、ConvNeXt-S、EfficientNetV2-S):二分类(恶性/良性)、单阶段四分类(良性、MEL、SCC、BCC)和两阶段级联(二分类分诊,然后三分类MEL/SCC/BCC)。所有模型使用ImageNet预训练权重和单一增强协议,在聚合的开放ISIC Archive数据上训练,并在内部保留样本和两个临床数据集(Melanoscope AI移动系统;谢切诺夫大学)上评估。结果:内部二分类阶段达到ROC-AUC 0.952-0.966;在谢切诺夫大学数据集上降至0.797-0.893,敏感度降至0.53-0.67,ECE从0.02升至0.27-0.39,且低估恶性,量化了排序和校准中的泛化差距。配对检验确认了临床数据上的一个架构间结果:二分类阶段ViT-B/16的缺陷(p<0.05);在区分阶段,没有架构显示出显著优势。级联方案在大多数架构上提高了宏F1,但仅对ViT-B/16显著,通过恢复被分配到主导良性类别的恶性病变。在ISIC MILK10k上,直接11分类的平均类别敏感度为0.525。结论:可调分诊阈值提供了标准单阶段(argmax)分类无法实现的敏感度控制,并更好地再现了临床鉴别诊断逻辑。持续的泛化差距要求在部署前进行外部临床验证和重新校准。

英文摘要

Purpose. To compare deep learning architectures and classification schemes for dermoscopic images of skin neoplasms and assess their generalization on transfer from open international datasets to independent clinical datasets of Russian practice. Methods. Four architectures (ViT-B/16, Swin-S, ConvNeXt-S, EfficientNetV2-S) were compared in three schemes: binary (malignant/benign), single-stage four-class (benign, MEL, SCC, BCC), and a two-stage cascade (binary triage, then three-class differentiation MEL/SCC/BCC). All models used ImageNet-pretrained weights and a single augmentation protocol on aggregated open ISIC Archive data, and were evaluated on an internal held-out sample and two clinical datasets (Melanoscope AI mobile system; Sechenov University). Results. Internally the binary stage attains ROC-AUC 0.952-0.966; on Sechenov University it drops to 0.797-0.893, sensitivity to 0.53-0.67, and ECE rises from 0.02 to 0.27-0.39 with underestimation of malignancy, quantifying a generalization gap in ranking and calibration. Paired tests confirm one inter-architecture result on clinical data: the deficit of ViT-B/16 at the binary stage (p<0.05); at the differentiation stage no architecture has a proven advantage. The cascade raises macro F1 over single-stage four-class classification for most architectures, but significantly only for ViT-B/16, by recovering malignant lesions assigned to the dominant benign class. On ISIC MILK10k, direct 11-class classification yields mean-class sensitivity 0.525. Conclusion. A tunable triage threshold gives sensitivity control not attainable in standard single-stage (argmax) classification and better reproduces clinical differential-diagnosis logic. The persistent generalization gap mandates external clinical validation and recalibration before deployment.

2606.13133 2026-06-12 cs.DS cs.LG 新提交

Learning-Augmented Approximation for Unrelated-Machines Makespan Scheduling

学习增强的无关联机器调度近似算法

Kaito Baba, Evripidis Bampis, Giorgos Mitropoulos

AI总结 针对无关联机器调度问题,提出学习增强算法,利用重作业分配预测实现精确预测时(1+ε)-近似,误差增大时退化为2-近似。

详情
Comments
22 pages, 3 figures
AI中文摘要

最近,Antoniadis等人(ICLR 2025)提出了一个框架,通过引入预测来近似NP-hard选择问题。尽管该方法简单,但它紧密匹配理论下界,因此其推广极具吸引力。我们解决了Antoniadis等人工作中提出的一个开放问题,即如何将该方法扩展到选择问题类之外的其他重要问题,例如调度问题。我们为无关联机器上的最小化完工时间问题(记为$R\\|C_{\max}$)开发了一种学习增强算法。通过使用重作业分配的预测,我们在预测准确时实现了多项式时间的$(1+\varepsilon)$-近似,并且随着误差增加,该近似平滑地退化为最坏情况下的2-近似。我们通过实证分析总结了我们的工作。

英文摘要

Recently, Antoniadis et al. (ICLR 2025) proposed a framework for incorporating predictions to approximate NP-hard selection problems. Despite its simplicity, this approach tightly matches theoretical lower bounds, making its generalization highly compelling. We address an open question raised in the work of Antoniadis et al., concerning the extension of this approach to other important problems outside the class of selection problems, such as scheduling. We develop a learning-augmented algorithm for the makespan minimization problem on unrelated machines, denoted by $R\|C_{\max}$. By using predictions of heavy job assignments, we achieve a polynomial-time $(1+\varepsilon)$-approximation for accurate predictions that smoothly degrades to a worst-case 2-approximation as the error increases. We conclude our work with an empirical analysis of our method.

2606.13127 2026-06-12 cs.CV 新提交

Fully Distributed Multi-View 3D Tracking in Real-Time

全分布式多视角3D实时跟踪

Byron Hernandez, Fangyu Li, Aotian Wu, Paul J. Shin, Kaustubh Purandare, Henry Medeiros

发表机构 * University of Florida(佛罗里达大学) NVIDIA Corporation(英伟达公司)

AI总结 提出MV3DT全分布式框架,通过点对点协作实现实时多视角3D跟踪,无需中央聚合,在WILDTRACK上达到94.3% IDF1和93.3% MOTA,支持100摄像头30 FPS运行。

详情
Comments
18 pages, 4 figures, 2 algorithms, 4 tables
AI中文摘要

具有重叠视野的多摄像头跟踪通常依赖于集中式融合,这造成了计算瓶颈,阻碍了大规模部署。我们提出了MV3DT,一个用于实时多视角3D跟踪的全分布式框架,通过点对点协调实现精确的身份传播和遮挡恢复,消除了中央聚合的需要。每个摄像头节点执行一个轻量级模块化流水线,包括单目3D感知、分布式多视角关联以及通过轻量级消息传递的协作融合。MV3DT在WILDTRACK上达到了94.3%的IDF1和93.3%的MOTA,与最先进的集中式方法相当,同时展示了卓越的可扩展性,在100个摄像头上以30 FPS运行,摄像头间延迟小于10毫秒,通信开销仅为2.2%。在给定相机标定的情况下,MV3DT以零样本方式运行,无需特定场景学习,可直接部署在新环境中。这些结果确立了MV3DT作为大规模重叠摄像头网络中实时多视角跟踪的实用解决方案。

英文摘要

Multi-camera tracking with overlapping fields of view typically relies on centralized fusion, which creates computational bottlenecks that prevent deployment at scale. We present MV3DT, a fully distributed framework for real-time multi-view 3D tracking that achieves accurate identity propagation and occlusion recovery through peer-to-peer coordination, eliminating the need for central aggregation. Each camera node executes a lightweight modular pipeline comprising monocular 3D perception, distributed multi-view association, and collaborative fusion via lightweight messaging. MV3DT achieves 94.3% IDF1 and 93.3% MOTA on WILDTRACK, competitive with state-of-the-art centralized methods, while demonstrating superior scalability by sustaining 30 FPS on 100 cameras with less than 10 ms inter-camera latency and only 2.2% communication overhead. MV3DT operates in a zero-shot regime given camera calibrations, requiring no scene-specific learning and making it directly deployable in new environments. These results establish MV3DT as a practical solution for real-time multi-view tracking in large-scale overlapping camera networks.

2606.13126 2026-06-12 cs.LG cs.AI cs.CL 新提交

MiniPIC: Flexible Position-Independent Caching in <100LOC

MiniPIC: 少于100行代码的灵活位置无关缓存

Nathan Ordonez (1), Thomas Parnell (1) ((1) IBM Research)

发表机构 * IBM Research(IBM研究院)

AI总结 提出MiniPIC,通过无位置编码KV缓存和用户控制缓存重用原语,在vLLM中实现多种位置无关缓存方法,显著提升预填充吞吐量并降低首个令牌延迟。

详情
Comments
13 pages, 5 figures
AI中文摘要

检索增强和代理工作负载重复预填充可预测的结构化输入(我们称之为“跨度”),例如文档和代码文件。然而,vLLM等引擎中的前缀缓存无法重用KV条目,除非它们与另一个请求共享相同的前缀,而生产级推理服务器中的位置无关缓存(PIC)实现通常需要大量服务器代码更改或将KV状态保留在服务器外部,从而产生主机到设备的传输开销。我们提出了极简PIC(MiniPIC):一种最小化、灵活且快速的vLLM设计,由两个组件构建:无位置编码的KV缓存和用户控制的缓存重用原语。MiniPIC在KV缓存中存储未旋转的K向量,在注意力内部使用每请求逻辑位置对K块应用RoPE,并公开三个面向用户和令牌级别的原语:块对齐填充、跨度分隔符(SSep)和提示依赖(PDep),这些原语修改哈希行为和有效的块级因果注意力结构。通过少于100行的核心引擎更改加上自定义注意力后端,这些原语足以在同一个运行的vLLM实例中实现多种PIC方法,包括Block-Attention、EPIC和Prompt Cache,同时原生集成KV缓存CPU卸载实现。在2WikiMultihopQA上,使用交错调度的MiniPIC相比基线vLLM将预填充吞吐量提高了49%,将缓存跨度的首个令牌时间减少了最多两个数量级,保持了未缓存跨度的线性预填充扩展,并且仅产生5.7%的最坏情况开销。

英文摘要

Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call "spans") such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV entries unless they share identical prefixes with another request, while Position-Independent Caching (PIC) implementations within production-grade inference servers typically either require substantial server code changes or keep KV state outside the server, incurring host-to-device transfer overhead. We present Minimalistic PIC (MiniPIC): a minimal, flexible and fast vLLM design built from two ingredients: positional-encoding-free KV cache and user-controlled cache-reuse primitives. MiniPIC stores unrotated K vectors in the KV cache, applies RoPE to K tiles inside attention using per-request logical positions, and exposes three user-facing and token-level primitives: block-aligned padding, span separator (SSep), and prompt depend (PDep), that modify hashing behavior and effective block-level causal attention structure. With fewer than 100 lines of core-engine changes plus a custom attention backend, these primitives are sufficient to realize multiple PIC methods, including Block-Attention, EPIC, and Prompt Cache, within the same running vLLM instance, while natively integrating with KV cache CPU offload implementations. On 2WikiMultihopQA, MiniPIC with interleaved scheduling improves prefill throughput by 49% over baseline vLLM, reduces cached-span time-to-first-token by up to two orders of magnitude, preserves the linear prefill scaling of uncached spans, and incurs only 5.7% worst-case overhead.

2606.13125 2026-06-12 cs.LG cs.AI 新提交

Select and Improve: Understanding the Mechanics of Post-Training for Reasoning

选择与改进:理解推理后训练的机制

Akshay Krishnamurthy, Audrey Huang, Nived Rajaraman

发表机构 * Microsoft Research NYC(微软研究院纽约) UIUC(伊利诺伊大学厄巴纳-香槟分校)

AI总结 通过控制实验揭示强化学习后训练通过策略选择和策略改进两种机制提升推理能力,并指出SFT数据和RL数据的不同作用。

详情
AI中文摘要

强化学习已迅速成为推理和编码模型训练的关键组成部分,但从机制角度理解仍不足。我们研究通过强化学习后训练如何以及通过哪些底层过程获取或增强能力。基于Qwen-2.5-1.5B的受控数学推理实验分析揭示了两种核心机制:策略选择和策略改进。我们的结果强调了SFT数据和强化学习数据在激活这些机制中的作用,特别展示了监督模型使用多种推理策略如何实现策略选择,以及增加强化学习数据难度如何实现策略改进。综合来看,我们的结果为RL训练提供了机制性见解,并提出了继续扩展推理能力的实用干预措施。

英文摘要

Reinforcement learning has rapidly emerged as a key component in the training of reasoning and coding models, yet it remains poorly understood from a mechanistic perspective. We study how and through what underlying processes capabilities are acquired or enhanced via reinforcement learning post-training. Our analysis, based on controlled math reasoning experiments with Qwen-2.5-1.5B, reveals two core mechanisms: strategy selection and strategy improvement. Our results highlight the role of SFT data and reinforcement learning data in activating these mechanisms, in particular showing how supervising the model on diverse reasoning strategies can enable strategy selection and how increasing difficulty in reinforcement learning data can enable strategy improvement. Taken together, our results provide mechanistic insight into RL training and suggest practical interventions to continue scaling reasoning capabilities.

2606.13121 2026-06-12 cs.CL cs.AI cs.SD 新提交

NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation

NaturalFlow: 减少同步语音到语音翻译中破坏自然语音流的停顿

Dongwook Lee, Youngho Cho, Sangkwon Park, Heeseung Kim, Sungroh Yoon

发表机构 * IPAI and ECE, Seoul National University(首尔大学IPAI与ECE) Department of AI, University of Seoul(首尔市立大学人工智能系)

AI总结 提出一个流畅性感知优化框架,通过利用模型内部信号(如语言多样性和语音时长的时间变异性)最小化块间静音,在同步翻译的低延迟和连续翻译的自然流畅之间找到平衡点。

详情
Comments
Proceedings of the 26th Interspeech Conference, Long Paper
AI中文摘要

同步语音到语音翻译旨在通过最小化延迟实现近实时通信,为连续翻译的高延迟提供了一种引人注目的实时替代方案。然而,过度追求低延迟往往会导致碎片化的块状语音。因此,听众会遭受不自然的声学流,其中频繁的停顿可能会增加他们的认知负荷。为了弥补这一差距,我们引入了一个流畅性感知优化框架,旨在发现同步翻译的低延迟优势与连续翻译的自然流畅之间的最佳平衡点。我们的框架通过利用模型内部信号(包括语言多样性和语音时长的诱导时间变异性)来最小化块间静音。在短文本和长文本基准上的实验表明,我们的框架在保持竞争性延迟和翻译质量的同时,产生了自然的语音流。

英文摘要

Simultaneous speech-to-speech translation aims to enable near-real-time communication by minimizing latency, offering a compelling, real-time alternative to the high latency of consecutive translation. However, the excessive pursuit of low latency often results in fragmented chunk-wise speech. Consequently, listeners are subjected to an unnatural acoustic flow punctuated by frequent pauses, which could increase their cognitive load. To bridge this gap, we introduce a fluency-aware optimization framework designed to discover the sweet spot between the low-latency benefits of simultaneous translation and the natural flow of consecutive translation. Our framework minimizes inter-chunk silences by leveraging model-internal signals, including linguistic diversity and induced temporal variability in speech durations. Experiments on short- and long-form benchmarks show that our framework produces natural speech flow while maintaining competitive latency and translation quality.

2606.13120 2026-06-12 cs.CL 新提交

EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge

EvoBrowseComp: 基于演化知识的搜索智能体基准测试

Yunhan Wang, Jiaan Wang, Lianzhe Huang, Xianfeng Zeng, Fandong Meng

发表机构 * Northeastern University, China(东北大学(中国)) Weixin AI, Tencent Inc, China(腾讯微信AI(中国))

AI总结 提出EvoBrowseComp,一个通过实时网络遍历自动生成400道英文和400道中文无污染复杂问题的演化基准,用于评估搜索智能体在动态知识环境中的真实浏览能力。

详情
Comments
14 pages, under review
AI中文摘要

搜索智能体——即增强搜索工具的大型语言模型——加剧了对未来验证基准的需求。现有的基准如BrowseComp依赖静态知识,容易受到测试集污染和参数记忆的影响。因此,模型可以通过事实回忆而非真正检索获得高分,通过推理捷径掩盖真实的浏览能力。在本文中,我们介绍EvoBrowseComp,一个包含400道英文和400道中文无污染复杂问题的演化基准,通过实时网络遍历合成。为了收集这些问题,我们设计了一个三智能体协作框架:(1)QA合成智能体,从实时网络中检索新鲜知识以合成问答对;(2)信息过滤智能体,根据可信度和流行度过滤检索到的知识,以阻断参数捷径;(3)高级指导智能体,将问题形式化为推理图,以减少合成问答对中的逻辑冗余和捷径。由于该框架支持全自动合成,EvoBrowseComp可以定期更新以防止数据污染并保持时间新鲜度。大量实验证实了其高难度,需要广泛的横向搜索。它为自动更新、高难度的基准测试建立了一个可扩展的范式,与不断发展的世界知识和不断进步的智能体能力保持同步。

英文摘要

Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to test-set contamination and parametric memorization. Consequently, models can achieve high scores through fact recall rather than genuine retrieval, obscuring true browsing competence via reasoning shortcuts. In this paper, we introduce EvoBrowseComp, an evolving benchmark of 400 English and 400 Chinese contamination-free complex questions synthesized via live-web traversal. To collect these questions, we design a three-agent collaborative framework: (1) a QA synthesis agent that retrieves fresh knowledge from the live web to synthesize QA pairs; (2) an information filtering agent that filters retrieved knowledge in terms of credibility and popularity to block parametric shortcuts; and (3) a high-level guidance agent that formalizes questions into reasoning graphs to reduce logical redundancy and shortcuts in synthesized QA pairs. Because the framework supports fully automated synthesis, EvoBrowseComp can be regularly updated to prevent data contamination and maintain temporal freshness. Extensive experiments confirm its great difficulty, requiring broad horizontal search. It establishes a scalable paradigm for auto-updatable, high-difficulty benchmarking that keeps pace with both evolving world knowledge and advancing agent capabilities.

2606.13119 2026-06-12 cs.LG cs.AI cs.NE 新提交

MP3: Multi-Period Pattern Pre-training forSpatio-Temporal Forecasting

MP3:面向时空预测的多周期模式预训练

Lilan Peng, Yandi Liu, Qingren Yao, Chongshou Li, Tianrui Li

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University(西南交通大学计算机与人工智能学院) Eindhoven University of Technology(埃因霍温理工大学)

AI总结 针对时空数据中短窗口输入导致的时间幻象问题,提出多周期模式预训练插件MP3,通过多周期时间建模、空间建模和跨周期因果交互,提升现有STGNN的预测性能。

详情
AI中文摘要

时空预测在交通、气候和能源等多个领域至关重要。城市时空数据表现出时间幻象:相似的短窗口输入具有不同的未来趋势,反之亦然。现有的时空图神经网络(STGNN)无法有效识别此类幻象。我们认为核心原因在于短窗口输入具有不完整的周期观测、异质的全局空间相关性和跨周期叠加因果性。为弥补这一差距,我们开发了一种新颖的多周期模式预训练(MP3),这是一种用于区分时间幻象的即插即用预训练插件。MP3提出了两项核心创新:(1)多周期模式学习旨在从长时间序列中学习多周期模式。具体地,多周期时间建模利用边卷积来识别不同的多周期模式。多周期空间建模使用瓶颈投影和全局记忆库来高效捕获异质的全局空间关系。跨周期模式交互采用因果增强的Transformer来捕获不同周期模式之间的依赖关系。(2)该插件可以无缝集成到现有的STGNN骨干中,以增强其预测性能。在五个真实世界数据集(包括大规模数据集CA)上的五个STGNN基线实验验证了MP3的有效性、优越的可扩展性和强适应性,其在所有评估基线上带来了一致且稳健的性能提升。平均而言,MP3将MAE降低了4.7%,RMSE降低了5.0%。代码可在此https URL获取。

英文摘要

Spatio-Temporal forecasting is crucial in diverse fields, such as transportation, climate, and energy. Urban spatio-temporal data exhibits temporal mirage: similar short-window inputs have divergent future trends, and vice versa. Existing spatio-temporal graph neural networks (STGNNs) cannot effectively identify such mirages. We argue that the core reason lies in the short-window inputs that have incomplete period observation, heterogeneous global spatial correlation, and cross-period superposition causality. To bridge this gap, we develop a novel Multi- Period Pattern Pre-training (MP3), a plug-and-play pre-training plugin for distinguishing temporal mirages. MP3 presents two core innovations: (1) The multi-period pattern learning is designed to learn multi-period patterns from long time series. Specifically, multi-period temporal modeling leverages edge convolution to identify different multi-period patterns. Multi-period spatial modeling uses a bottleneck project and a global memory bank to capture heterogeneous global spatial relations efficiently. Cross-period pattern interaction employs a causality-enhanced Transformer to capture dependencies across different period patterns. (2) This plugin can seamlessly integrate into existing STGNN backbones to strengthen their forecasting performance. The experiment on five STGNN baselines across five real-world datasets (including a large-scale dataset CA) verify the effectiveness, superior scalability and strong adaptability of MP3, which brings consistent and robust performance improvements across all evaluated baselines. On average, MP3 reduces the MAE 4.7% and the RMSE 5.0%. The code can be available at this https URL.

2606.13115 2026-06-12 cs.CL cs.AI 新提交

G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents

G-Long:面向高效长期对话代理的图增强记忆管理

Minjun Choi, Yoonjin Jang, Sangwon Youn, Youngjoong Ko

发表机构 * Sungkyunkwan University(成均馆大学)

AI总结 提出G-Long框架,利用微调小语言模型进行结构化三元组提取和关联检索,并引入注意力感知重要性评分机制,在降低计算开销的同时,在响应生成和记忆检索上达到最优性能。

详情
Comments
22 pages, 8 figures, 14 tables
AI中文摘要

尽管大型语言模型(LLMs)推动了开放域对话系统的发展,但由于长上下文推理的固有限制以及处理大量原始文本的低效性,保持长期一致性仍然是一个挑战。现有方法通常依赖于非结构化记忆存储(容易导致信息丢失)或计算成本高昂的LLMs(导致高延迟)。为了解决这些限制,我们提出了G-Long,一个图增强框架,利用微调的小语言模型(sLM)进行结构化三元组提取和关联检索,显著降低了运营成本。此外,我们引入了新颖的注意力感知重要性评分机制,利用T5摘要器的内在交叉注意力信号来识别显著记忆。跨多个基准的大量实验表明,G-Long在响应生成和记忆检索方面均达到了最先进的性能,在MSC上响应质量提升高达9.8%,在LME上检索召回率提升高达40.8%,同时显著降低了计算开销。

英文摘要

While Large Language Models (LLMs) have advanced open-domain dialogue systems, maintaining long-term consistency remains a challenge due to inherent limitations in long-context reasoning and the inefficiency of processing extensive raw text. Existing approaches typically rely on either unstructured memory storage, which is prone to information loss, or computationally expensive LLMs that incur high latency. To address these limitations, we propose G-Long, a graph-enhanced framework that utilizes a fine-tuned small Language Model (sLM) for structured triplet extraction and associative retrieval, significantly reducing operational costs. Furthermore, we introduce the novel attention-aware importance scoring mechanism that leverages the intrinsic cross-attention signals of a T5 summarizer to identify salient memories. Extensive experiments across diverse benchmarks demonstrate that G-Long achieves state-of-the-art performance in both response generation and memory retrieval, yielding performance gains of up to 9.8% in response quality on MSC and 40.8% in retrieval recall on LME, while significantly minimizing computational overhead.

2606.13113 2026-06-12 eess.SY cs.RO 新提交

MPC for underactuated spacecraft control with a Lyapunov supervised physics-informed neural network correction layer

基于李雅普诺夫监督的物理信息神经网络校正层的欠驱动航天器MPC控制

Amirhossein Ayanmanesh Motlaghmofrad, Carlo Cena, Mauro Martini, Marcello Chiaberge

AI总结 针对欠驱动航天器姿态控制,提出一种分层架构,结合非线性模型预测控制、物理信息神经网络和李雅普诺夫监督机制,在不确定性下降低稳态误差并保持鲁棒性。

详情
Comments
Accepted at SPAICE (AI in and for Space) 2026
AI中文摘要

欠驱动航天器面临可控性限制和对环境干扰的高度敏感性,使得姿态机动和稳定复杂化。由于沿欠驱动轴缺乏控制能力,传统控制器无法直接稳定所有姿态分量,因此需要参考规划策略。此外,MPC方法对惯性不确定性和未建模动态耦合仍然敏感,导致在失配下跟踪性能下降。为解决这些问题,我们考虑一种集成三层的分层架构:(i) 非线性模型预测控制器(NMPC),用于约束和欠驱动感知的机动规划以及在执行器限制下的标称闭环稳定性;(ii) 物理信息神经网络(PINN),在仿真数据上离线训练以估计残余干扰力矩,其损失项强制执行与刚体旋转动力学的一致性;(iii) 基于李雅普诺夫的监督安全机制,在线评估学习到的校正并限制或抑制其影响,以保持基线控制器的稳定性特性。该架构在模拟反作用轮动力学、执行器饱和及环境干扰的高保真仿真环境中进行评估。蒙特卡洛研究表明,与独立NMPC相比,稳态姿态误差有统计显著的降低,同时在不确定性下保持鲁棒行为。监督层确保当基于学习的增强不可靠时,能够优雅地退化到纯模型控制。

英文摘要

Underactuated spacecraft faces controllability limitations and heightened sensitivity to environmental disturbances, complicating attitude maneuvering and stabilization. Due to the lack of control authority along the underactuated axis, conventional controllers cannot directly stabilize all attitude components and therefore require reference planning strategies. Furthermore, MPC approaches remain sensitive to inertia uncertainty and unmodeled dynamic couplings, resulting in degraded tracking performance under mismatch. To address these issues, we consider a hierarchical architecture integrating three layers: (i) a nonlinear model predictive controller (NMPC) for constraint and underactuation-aware maneuver planning and nominal closed-loop stability under actuator limits; (ii) a physics-informed neural network (PINN) trained offline on simulation data to estimate residual disturbance torques, with loss terms that enforce consistency with rigid-body rotational dynamics; (iii) a Lyapunov-based supervisory safety mechanism that evaluates the learned correction online and bounds or suppresses its influence to preserve the stability properties of the baseline controller. The architecture is evaluated in a high-fidelity simulation environment modelling reaction wheel dynamics, actuator saturation, and environmental disturbances. Monte Carlo studies show statistically significant reductions in steady-state attitude error relative to standalone NMPC while maintaining robust behavior under uncertainty. The supervisory layer ensures graceful degradation to purely model-based control when the learning-based augmentation is unreliable.

2606.13111 2026-06-12 cs.CL 新提交

MÖVE: A Holistic LLM Benchmark for the German Public Sector

MÖVE:德国公共部门的大语言模型整体基准

Camilla Dalerci, Thilo Michael, Robin Schaefer, Daniel Weinland

发表机构 * Innovations Department, Bundesdruckerei GmbH(德国联邦印钞公司创新部)

AI总结 提出MÖVE基准,从性能和治理两个维度评估39个LLM在德国公共部门的应用,发现无单一模型全面领先,模型大小非质量可靠指标。

详情
AI中文摘要

我们提出MÖVE(Modelle für die Öffentliche Verwaltung Evaluieren),一个用于评估德国公共部门背景下大语言模型(LLM)的整体基准。尽管LLM在公共管理中日益普及,但模型选择仍然很大程度上是临时的,现有基准提供的指导有限:它们主要面向英语、内容以美国为中心,并且只关注任务性能。MÖVE通过评估39个模型在两个互补维度上填补这些空白。性能标准涵盖摘要、问答和主题提取。治理标准评估幻觉倾向、能耗、提供商透明度、与德国宪法价值观的一致性以及对德国政党立场的知识。总共,我们使用了十个德语数据集,包括我们构建的反映公共管理领域的金标准和银标准数据集。我们采用多指标评估策略,结合经典NLP指标、基于嵌入的方法和LLM作为评判的方法。我们的结果表明,没有单一模型在所有标准上占主导地位:顶级表现者因任务而异,模型大小本身是质量的糟糕预测指标。我们进一步评估基准本身,分析其统计精度、LLM评判可靠性、私有数据集对模型排名的影响、结果对提示表述的敏感性以及能耗估计的有效性。MÖVE被设计为一个活跃开发中的动态基准;结果公开于此https URL。

英文摘要

We present MÖVE (Modelle für die Öffentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public administration, model selection remains largely ad hoc, and existing benchmarks offer limited guidance: they are predominantly English-centric, US-centric in content, and focus exclusively on task performance. MÖVE addresses these gaps by evaluating 39 models across two complementary dimensions. Performance criteria cover summarization, question answering, and topic extraction. Governance criteria assess hallucination tendencies, energy consumption, provider transparency, and alignment with German constitutional values and knowledge about positions by German political parties. In total, we utilize ten German-language datasets, including gold- and silverstandard datasets that we constructed to reflect public-administration domains. We employ a multi-metric evaluation strategy combining classical NLP metrics, embedding-based methods, and LLM-as-a-judge approaches. Our results show that no single model dominates across all criteria: top performers differ between tasks, and model size alone is a poor predictor of quality. We further evaluate the benchmark itself, analyzing its statistical precision, LLM judge reliability, the impact of our private datasets on model rankings, the sensitivity of our results to prompt formulation, and the validity of our energy consumption estimates. MÖVE is designed as a living benchmark under active development; results are publicly available at this https URL.

2606.13108 2026-06-12 cs.CV 新提交

PP-OCRv6: From 1.5M to 34.5M Parameters, Surpassing Billion-Scale VLMs on OCR Tasks

PP-OCRv6: 从1.5M到34.5M参数,在OCR任务上超越十亿级视觉语言模型

Yubo Zhang, Xueqing Wang, Manhui Lin, Yue Zhang, Penglongyi Deng, Ting Sun, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Changda Zhou, Hongen Liu, Suyin Liang, Cheng Cui, Yi Liu, Dianhai Yu, Yanjun Ma

发表机构 * PaddlePaddle Team, Baidu Inc.(百度公司飞桨团队)

AI总结 提出轻量级OCR系统PP-OCRv6,通过统一MetaFormer架构和结构化重参数化,在服务器到边缘设备上以少数量级参数超越十亿级VLM,中模型识别准确率83.2%,检测Hmean 86.2%。

详情
AI中文摘要

视觉语言模型(VLM)在通用视觉语言任务上取得了令人印象深刻的结果,但在应用于专用OCR场景时,它们存在幻觉、定位不精确和计算成本过高的问题。本文提出PP-OCRv6,一个轻量级OCR系统,结合了架构创新和数据中心优化。PP-OCRv6围绕统一的MetaFormer风格构建块重新设计了骨干网络、检测颈和识别颈,采用结构化重参数化,将空间token混合与通道混合解耦,并通过任务特定的步长配置支持两个任务。三个模型层级(中、小、微)共享相同的构建块原语,覆盖从服务器到边缘的部署场景。在我们的内部基准测试中,PP-OCRv6_medium实现了83.2%的识别准确率和86.2%的检测Hmean,分别比PP-OCRv5_server高出+5.1%和+4.6%,同时以数量级更少的参数超越了Qwen3-VL-235B、GPT-5.5和Gemini-3.1-Pro。微层级在Intel Xeon CPU上实现了比PP-OCRv5_mobile快3.9倍的推理速度,同时保持相当的准确率。

英文摘要

Vision-Language Models (VLMs) have achieved impressive results on general vision-language tasks, yet they suffer from hallucination, imprecise localization, and prohibitive computational cost when applied to dedicated OCR scenarios. This paper presents PP-OCRv6, a lightweight OCR system that combines architectural innovation with data-centric optimization. PP-OCRv6 redesigns the backbone, detection neck, and recognition neck around a unified MetaFormer-style building block with structural reparameterization, decoupling spatial token mixing from channel mixing and supporting both tasks through task-specific stride configurations. Three model tiers (medium, small, tiny) share the same block primitives, covering deployment scenarios from server to edge. On our in-house benchmarks, PP-OCRv6_medium achieves 83.2% recognition accuracy and 86.2% detection Hmean, outperforming PP-OCRv5_server by +5.1% and +4.6% respectively while surpassing Qwen3-VL-235B, GPT-5.5, and Gemini-3.1-Pro with orders of magnitude fewer parameters. The tiny tier achieves 3.9$\times$ faster inference than PP-OCRv5_mobile on Intel Xeon CPU while maintaining comparable accuracy.

2606.13107 2026-06-12 cs.CR cs.NI 新提交

The Invisible Ink of the Android Malware World: A Longitudinal Study on the Usage of Covert Communication Channels

Android恶意软件世界的隐形墨水:隐蔽通信信道使用的纵向研究

Zeya Umayya, Manan Aggarwal, Manan Chugh, Mann Nariya, Yogesh Kaushik, Sambuddho Chakravarty

AI总结 首次对Android恶意软件生态系统中隐蔽信道(CC)的使用进行纵向研究,通过静态和动态分析结合的多阶段流水线分析350万恶意APK,发现CC使用率从2012年的0.30%增长到2025年的50%,并揭示了CC使用的演变模式。

详情
Comments
21 pages, 23 figures, EuroS&P 2026
AI中文摘要

代理、VPN和Tor长期以来帮助隐私社区和受审查地区的用户对抗审查。然而,同样的工具可能被恶意软件和僵尸网络恶意利用,以隐藏其与外部命令和控制服务器的通信。尽管恶意软件攻击的激增加剧了这一关键担忧,但尚无纵向研究分析恶意应用程序如何使用隐蔽信道(CC)来逃避检测。我们通过首次研究Android恶意软件生态系统中隐蔽信道的使用来填补这一空白。为此,我们开发了一个结合静态和动态分析的多阶段流水线,以调查系统和网络层面的特征。我们将此流水线应用于2009年至2025年7月间的350万Android恶意软件语料库。我们精心设计的静态验证规则发现了288K个使用CC的APK,涵盖511个恶意软件家族,CC使用率从2012年的0.30%指数增长到2025年的50%。总体而言,在动态分析中,我们识别出19,308个唯一IP地址,分布在85个国家,其中我们能够明确验证17个国家的59个IP地址存在CC。此外,我们进行了一项跨越16年以上的基于CC的恶意软件纵向数据集研究,发现CC使用已经演变,例如,一些恶意软件采用多个CC;其他恶意软件则定期切换CC(一个家族在2019年至2025年间切换了40次CC使用)。

英文摘要

Proxies, VPNs and Tor have long helped the privacy community and users in censored regions to fight censorship. However, the same tools can be maliciously exploited by malware and botnets to conceal their communication to external command and control servers. Despite being a critical concern fueled by the proliferation of malware based attacks, no longitudinal studies have analyzed how malware applications use covert channels (CC) to evade detection. We fill this gap by performing the first study of the usage of covert channels in the Android malware ecosystem. To that end, we develop a multistage pipeline that combines static and dynamic analysis to investigate both system and network-level features. We applied this pipeline on a corpus of 3.5M Android malware spanning 2009 to July 2025. Our carefully crafted static validation rules uncovered 288K APKs that used CCs spanning 511 malware families and CC usage growing exponentially from 0.30\% (2012) to 50\% (2025). Overall, in dynamic analysis, we identified 19,308 unique IP addresses being contacted in 85 countries, out of which we were able to explicitly validate the presence of CCs for 59 IP addresses across 17 countries. Further, we performed a longitudinal dataset study spanning over 16 years for CC based malware and found that CC usage has evolved, \textit{e.g.,} some malware adopted by using more than one CCs; others switched between them periodically (one family switched CC usage 40 times from 2019 to 2025).

2606.13106 2026-06-12 cs.LG cs.CL 新提交

Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

揭秘隐状态循环:基于在线强化学习的可切换潜在推理

Jiayu Yang, Chao Chen, Shengen Wu, Yinhong Liu, Yuxuan Fan, Lujundong Li, Songning Lai, Chengwei Qin, Zhijiang Guo

发表机构 * HKUST(GZ)(香港科技大学(广州)) University of Cambridge(剑桥大学) NTU(南洋理工大学) JoinQuant(聚宽) HKUST(香港科技大学)

AI总结 提出SWITCH框架,通过离散边界令牌使隐状态循环推理兼容在线强化学习,并支持因果机制分析,实验表明其优于现有方法。

详情
AI中文摘要

潜在思维链通过用连续的隐状态循环替换可见推理轨迹来压缩推理,但现有公式难以用标准在线强化学习(RL)优化,且难以进行因果解释。我们的关键见解是,一对显式的边界令牌可以同时解决这两个问题:离散的进入和退出锚点使潜在块与标准在线RL兼容,并且相同的锚点为机制分析提供了自然立足点。基于此,我们提出SWITCH,一个可切换的潜在推理框架。模型发出<swi>进入潜在模式,</swi>退出。由于边界是普通的离散令牌,GRPO策略比率在每个决策点都有明确定义。相同的锚点也使潜在步骤暴露于直接探测和因果干预。我们通过可见到潜在的课程和Switch-GRPO目标训练模型,该目标通过循环潜在计算传播梯度。SWITCH在相似规模下始终优于先前的隐状态循环潜在推理方法。通过边界令牌的机制分析进一步揭示了三个发现:(i)<swi>是一个尖锐局部化的学习切换策略,而非风格化伪影;(ii)它开启的潜在步骤执行特定于问题的、因果重要的计算,而非作为惰性占位符;(iii)该计算集中在进入时的单个隐状态转换上。这些结果表明,隐状态循环潜在推理既可RL训练,又可进行直接机制分析,包括在线RL本身如何从内部改进模型。

英文摘要

Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits <swi> to enter latent mode and </swi> to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) <swi> is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.

2606.13105 2026-06-12 cs.LG 新提交

Disparate Impact in Synthetic Data Generation

合成数据生成中的差异性影响

Paul Andrey, Michaël Perrot, Batiste Le Bars, Marc Tommasi

发表机构 * Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 - CRIStAL(里尔大学、法国国家信息与自动化研究所、法国国家科学研究中心、中央里尔高等电力工程学院、计算机科学、信号与自动化研究实验室)

AI总结 本文重新审视合成数据生成中的差异性影响公平性概念,指出非差异性影响要求合成分布与真实分布一致,并分析SDG失败的原因(表达能力、抽样误差、差分隐私估计误差),提出分组学习策略以提升整体效用和公平性。

详情
AI中文摘要

我们重新审视合成数据生成(SDG)中差异性影响的公平性概念,该概念评估生成记录的效用是否在不同敏感群体间相同。我们的方法不同于现有的公平SDG工作,后者旨在纠正观测分布中的不当偏差,从而将SDG重新定义为学习一个并非真实数据分布的分布。相比之下,当合成分布与真实分布相同时,非差异性影响得以显著实现。我们揭示了SDG可能无法达到该解决方案的原因,并讨论了近似误差和估计误差为何会发生以及可能在不同群体间存在差异。我们特别关注了SDG方法相对于分布复杂性的表达能力、群体比例导致的抽样误差以及差分隐私机制引起的估计误差。我们在人工和真实数据上展示了差异性影响的案例,重点关注依赖概率图模型的SDG方法。我们还引入了一种学习分组SDG模型的策略,并说明了它在许多情况下如何提升整体效用及其公平性。

英文摘要

We revisit the fairness notion of disparate impact for synthetic data generation (SDG), that assesses whether the utility of generated records is the same across sensitive groups. Our approach departs from existing work on fair SDG, that address the problem of correcting for undue biases in the observed distribution, hence redefining SDG as learning a distribution that is not that of the real data. By contrast, non-disparate impact is notably achieved when the synthetic and real distributions are the same. We expose reasons why SDG may fail to reach that solution and discuss why approximation and estimation errors occur and can be disparate across groups. We notably look into the expressive power of SDG methods relative to distribution complexity, sampling errors due to group proportions, and estimation errors induced by differential privacy mechanisms. We illustrate cases of disparate impact on both artificial and real-world data, focusing on SDG methods that rely on probabilistic graphical models. We also introduce a strategy of learning group-wise SDG models and illustrate how it can improve both the overall utility and its parity in many settings.

2606.13104 2026-06-12 cs.LG 新提交

Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models

权威、真实性与引文偏差:研究大语言模型认知易感性的大规模多领域基准

Aryan Khurana, Aravind Ramana RN, Dhruv Kumar

AI总结 提出AuthorityBench基准,通过2x2因子设计隔离引文权威信号对LLM认知行为的影响,发现引文存在(无论真假)均提高幻觉率,真声明搭配假引文时幻觉率上升3-22个百分点。

详情
Comments
10 pages, 5 figures. Accepted to AI4GOOD and EIML at ICML 2026
AI中文摘要

大型语言模型越来越多地部署在引文增强的环境中,但引文存在对模型行为的影响(独立于事实内容)仍知之甚少。我们引入了AuthorityBench,一个包含220,564个提示的多领域基准,用于隔离基于引文的权威信号如何影响LLM的认知行为。该基准采用完全平衡的2x2因子设计,交叉声明真实性(claim veracity)与引文真实性(citation veracity),这是首个这样做的基准,涵盖四个领域(常识、科学、法律和医学),并在40个提示模板、四个场所声望等级和一个国家编码的作者姓名数据集上进行受控变化。评估七个模型在12个结构化研究问题上的表现,我们发现引文的存在(无论是真实的还是捏造的)相对于无引文基线一致地提高了幻觉率。当捏造的引文伴随真实声明时,这种效应最强,使幻觉率提高3到22个百分点,在常识领域达到35%到77%,而法律声明相对稳健,场所声望和作者人口统计学影响可忽略不计。所有数据集和评估代码均可在以下网址获取:this https URL

英文摘要

Large language models are increasingly deployed in citation-augmented settings, yet the effect of citation presence on model behavior independent of factual content remains poorly understood. We introduce AuthorityBench, a 220,564-prompt multi-domain benchmark that isolates how citation-based authority signals influence epistemic behavior in LLMs. The benchmark uses a fully balanced 2x2 factorial design crossing claim veracity with citation veracity, the first to do so, across four domains (general knowledge, science, law, and medicine), with controlled variation over 40 prompt templates, four venue prestige tiers, and a country-coded author name dataset. Evaluating seven models on 12 structured research questions, we find that citation presence, whether real or fabricated, consistently increases hallucination rates relative to a no-citation baseline. The effect is strongest when fabricated citations accompany true claims, raising hallucination rates by 3 to 22 percentage points and reaching 35 to 77% in the general knowledge domain, while legal claims are comparatively robust and venue prestige and author demographics show negligible impact. All datasets and evaluation code are available at: this https URL