arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部类别 2338
2606.10572 2026-06-10 cs.AI 新提交

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

每个多模态证据一个令牌:面向资源受限问答的潜在记忆

Zhi Zheng, Ziqiao Meng, Hao Luan, Wei Liu, Wee Sun Lee

发表机构 * School of Computing, National University of Singapore(新加坡国立大学计算机学院)

AI总结 提出潜在记忆范式,将每个证据压缩为单个高维潜在令牌,通过统一训练实现高效检索与生成,在资源受限场景下以3-10倍令牌节省达到竞争性问答性能。

详情
AI中文摘要

外部记忆有效地将基于大语言模型(LLMs)和视觉-语言模型(VLMs)的问答(QA)与相关的多模态证据联系起来。然而,现有的记忆范式以原始文本和图像形式表示每个记忆项,因此基于检索的系统必须将检索到的文本或图像传递给生成LLMs/VLMs,导致高令牌消耗和存储压力,使得资源受限的应用难以承受。我们提出潜在记忆,一种潜在空间记忆范式,它将每个原始文本或图像证据项替换为由小型压缩器LLM/VLM生成的单个高维潜在令牌。潜在记忆不是在生成时检索原始证据,而是在统一的潜在表示空间中操作:查询被嵌入到该空间中以检索相关的潜在令牌,检索到的潜在令牌直接提示给预训练的LLM或VLM以生成答案。为了使每个潜在令牌同时具有用于重建、检索和生成的信息,我们使用重建、对比和蒸馏目标以统一的端到端方式训练压缩器。潜在记忆在七个纯文本QA基准(例如HotpotQA)和多模态QA基准上进行了评估,与先进的RAG基线相比,它实现了具有竞争力的QA性能,同时消耗的生成器令牌减少了3到10倍。它还能在WebQA上提供最强的图像基础问答性能。代码可在该https URL获取。

英文摘要

External memory effectively grounds large language models (LLMs) and vision-language models (VLMs)-based question answering (QA) in relevant multimodal evidence. However, existing memory paradigms represent each memory item in raw text and image forms, so retrieval-based systems must pass the retrieved text or images to the generation LLMs/VLMs, resulting in high token consumption and storage pressure, making it unaffordable for resource-constrained applications. We propose Latent Memory, a latent-space memory paradigm that replaces each raw text or image evidence item with a single high-dimensional latent token produced by a small compressor LLM/VLM. Rather than retrieving raw evidence for generation, Latent Memory operates in a unified latent representation space: the query is embedded into this space to retrieve relevant latent tokens, and the retrieved latent tokens are directly prompted to a pretrained LLM or VLM for answer generation. To make each latent token simultaneously informative for reconstruction, retrieval, and generation, we train the compressor with reconstruction, contrastive, and distillation objectives in a unified end-to-end manner. Latent Memory is evaluated on seven text-only QA benchmarks (e.g., HotpotQA) and multimodal QA benchmarks, where it achieves competitive QA performance compared to advanced RAG baselines while consuming 3x to 10x fewer generator tokens. It can also deliver the strongest image-grounded QA performance on WebQA. Code is available at https://github.com/zz1358m/Latent-Memory-Master.

2606.10571 2026-06-10 cs.CV cs.AI cs.CR 新提交

Improving Adversarial Transferability on Vision-Language Pre-training Models via Surrogate-Specific Bias Correction

通过代理特定偏差校正提高视觉-语言预训练模型上的对抗迁移性

Lijia Yu, Jiuxin Cao, Yuchen Qiang, Changhao Chen, Yifei Huang, Bo Liu

发表机构 * School of Cyber Science and Engineering, Southeast University(东南大学网络空间安全学院) Purple Mountain Laboratories(紫金山实验室) School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院)

AI总结 提出DeBias-Attack方法,通过梯度校正消除代理特定偏差,提高对抗样本在VLP模型间的迁移性,实验验证其在多种模型和任务上的有效性。

详情
Comments
17 pages, 7 figures, 10 tables
AI中文摘要

对抗样本揭示了视觉-语言预训练(VLP)模型中的脆弱性,并为提高鲁棒性提供了见解。一个关键特性是跨模型迁移性,这使得基于迁移的黑盒攻击成为可能。然而,现有攻击通常严重依赖代理模型,导致跨模型性能下降。一个原因是对抗优化可能更多地遵循代理模型响应而非输入语义,使得更新方向在代理模型上有效,但对未见目标迁移性较差。我们将这种依赖称为代理特定偏差。受此观察启发,DeBias-Attack通过校正对抗优化方向中的代理特定偏差来提高迁移性。它维护两个扰动分支。主分支在原始图像上优化扰动,并获得用于破坏图像-文本对齐的对抗梯度。参考分支在弱语义图像上优化扰动,该图像由数据集平均图像加上每次迭代重新采样的小高斯噪声构成。由于该弱语义图像几乎不含清晰的视觉内容,其优化更多地反映代理模型响应而非图像语义,其参考梯度估计代理特定偏差。DeBias-Attack在更新对抗图像之前移除主梯度在参考梯度上的对齐投影,然后使用更新后的对抗图像进行上下文感知的文本替换。DeBias-Attack是首个通过梯度校正来校正代理特定偏差的基于迁移的VLP攻击。实验表明,在VLP模型、下游任务以及开源和闭源多模态大语言模型上均表现出强劲性能。

英文摘要

Adversarial examples reveal vulnerabilities in Vision-Language Pre-training (VLP) models and provide insights for improving robustness. A key property is cross-model transferability, which enables transfer-based black-box attacks. However, existing attacks often rely heavily on the surrogate model, causing cross-model performance drops. One reason is that adversarial optimization may follow surrogate model responses more than input semantics, making the update direction effective on the surrogate but less transferable to unseen targets. We refer to this dependency as surrogate-specific bias. Motivated by this observation, DeBias-Attack improves transferability by correcting surrogate-specific bias in adversarial optimization directions. It maintains two perturbation branches. The main branch optimizes a perturbation on the original image and obtains the adversarial gradient used to disrupt image-text alignment. The reference branch optimizes a perturbation on a weak-semantic image constructed from the dataset mean image with small Gaussian noise resampled at each iteration. Since this weak-semantic image contains little clear visual content, its optimization reflects surrogate responses more than image semantics, and its reference gradient estimates surrogate-specific bias. DeBias-Attack removes the aligned projection of the main gradient on the reference gradient before updating the adversarial image, then performs context-aware text substitution using the updated adversarial image. DeBias-Attack is the first transfer-based VLP attack that corrects surrogate-specific bias through gradient correction. Experiments show strong performance across VLP models, downstream tasks, and open-source and closed-source multimodal large language models.

2606.10569 2026-06-10 cs.CL cs.AI 新提交

Hidden Consensus:Preference-Validity Compression in Human Feedback

隐藏共识:人类反馈中的偏好有效性压缩

Dorcas Chia Ern Chua, Karen Myn Hui Lee, Jia Yue Tan, Zhen Xue Gue, Norzalena Abdul Hamid, Azima Binti Azmi, Keat Mei Yeong, Aizat Izyani binti Mujab, Hafsah Noor Azam, Chee Guo Khoo, Han Ying Lim, Chee Seng Chan

发表机构 * YTL AI Labs Universiti Malaya(马来亚大学) Monash University Malaysia(莫纳什大学马来西亚校区) Universiti Malaysia Sarawak(马来西亚沙捞越大学)

AI总结 本文提出偏好有效性压缩问题,即RLHF将多元有效反馈压缩为单一奖励目标,导致对齐测量偏差。通过马来西亚语料分析,79%的提示存在多个多数支持响应,表明多数聚合测量的是argmax可接受性而非多元对齐。

详情
Comments
28 pages. When AI learns from human feedback, it forces a single "correct" answer, but sometimes multiple answers are all genuinely valid, and that nuance gets thrown away
AI中文摘要

标准的RLHF流程通常将异质的人类判断简化为单一的标量奖励目标。我们认为这种简化在结构多元的社会中可能错误地衡量对齐,在这些社会中,分歧可能反映文化、历史、语言、区域或规范性的解释,而非标注噪声。我们将这种失败称为偏好有效性压缩,即多个多元有效的响应选项被压缩成一个优化目标。以马来西亚为诊断场景,我们通过偏好事件分析RLHF风格的反馈聚合,这些事件将提示、响应和跨解释框架的可接受性判断联系起来。在来自20名参与者和107个三人标注提示的321个偏好事件中,79%的提示包含多个多数支持的响应,而单一赢家聚合会丢弃这些响应,并且当考虑所有多数支持的选项时,顶部响应之间的明显优势差距会消失。参与者经常选择多个可接受的响应,而被丢弃的响应明显反映了连贯的本地、实践或文化框架。这些发现表明,该语料中的多数聚合测量的是argmax可接受性而非多元对齐。我们将此视为测量有效性问题,并认为未来的对齐方法应满足有效性保持一致性,即在多元有效的解释框架中保持稳定,而不是将它们压缩为单一的奖励目标。

英文摘要

Standard RLHF pipelines often reduce heterogeneous human judgments into a single scalar reward target. We argue that this reduction can mis-measure alignment in structurally plural societies, where disagreement may reflect culturally, historically, linguistically, regionally, or normatively grounded interpretations rather than annotation noise. We call this failure Preference-Validity Compression, the collapse of multiple plural-valid response options into a single optimization target. Using Malaysia as a diagnostic setting, we analyze RLHF-style feedback aggregation through preference events linking prompts, responses, and acceptability judgments across interpretive frames. Across 321 preference events from 20 participants and 107 trio-annotated prompts, 79% of prompts contain more than one majority-supported response that single-winner aggregation would discard, and apparent dominance gaps between top responses diminish when all majority-supported options are considered. Participants frequently select multiple acceptable responses, and discarded responses demonstrably reflect coherent local, practical, or cultural frames. These findings show that majority aggregation in this corpus measures argmax acceptability rather than plural alignment. We treat this as a measurement-validity issue and argue that future alignment methods should satisfy Validity-Preserving Consistency, remaining stable across plural-valid interpretive frames rather than collapsing them into a single reward target.

2606.10568 2026-06-10 cs.RO 新提交

VeriSpace: Spatially Grounded Action Verification for Vision-Language-Action Models

VeriSpace: 面向视觉-语言-动作模型的空间基础动作验证

Guiyu Zhao, Longteng Guo, Junyou Zhu, Jun Fu, Yanghong Mei, Bin Cao, Jie Jiang, Xingjian He, Jing Liu

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出VeriSpace,一种3D感知的动作验证器,通过双路径3D注入场景编码和空间基础动作推理,在测试时选择候选动作,提升VLA模型的可靠性。

详情
Comments
Submit to ACM MM
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中展现出强大潜力,但其测试时的可靠性仍受限于一次性动作预测,即使微小的动作误差也可能导致抓取失败、碰撞或任务进展错误。一种自然的替代方案是为VLA系统配备测试时验证,允许在执行前提出并评估多个候选动作。然而,可靠的动作验证具有挑战性,因为它不仅需要区分候选动作之间的细微几何差异,还需要评估动作是否朝着任务目标有意义地推进。我们提出VeriSpace,一种用于VLA系统测试时动作选择的3D感知动作验证器。VeriSpace通过两个关键组件评估候选动作:双路径3D注入场景编码,构建同时保留视觉语义和显式3D几何的场景表示;以及空间基础动作推理,通过推理任务相关的空间关系、几何有效性和预期的目标进展来评估每个动作。这些组件共同实现了对细微但结果关键的候选动作更可靠的区分,同时与现有VLA策略完全兼容。在公共基准和真实机器人操作任务上的实验表明,VeriSpace在分布内和分布外设置中均持续提高了底层VLA策略和先前基于验证的方法的决策可靠性,带来了显著的性能提升。

英文摘要

Vision-language-action (VLA) models have shown strong promise for robotic manipulation, but their reliability at test time remains limited by one-shot action prediction, where even small action errors can cause grasp failure, collision, or incorrect task progression. A natural alternative is to equip VLA systems with test-time verification, allowing multiple candidate actions to be proposed and evaluated before execution. However, reliable action verification is challenging because it requires not only distinguishing subtle geometric differences between candidate actions, but also assessing whether an action makes meaningful progress toward the task goal. We present VeriSpace, a 3D-aware action verifier for test-time action selection in VLA systems. VeriSpace evaluates candidate actions through two key components: Dual-Path 3D-Injected Scene Encoding, which constructs a scene representation that jointly preserves visual semantics and explicit 3D geometry, and Spatially-Grounded Action Reasoning, which evaluates each action by reasoning over task-relevant spatial relations, geometric validity, and expected goal progress. Together, these components enable more reliable discrimination between subtle yet outcome-critical action candidates while remaining fully compatible with existing VLA policies. Experiments on public benchmarks and real-world robotic manipulation tasks show that VeriSpace consistently improves decision reliability over both underlying VLA policies and prior verification-based methods, yielding substantial gains in both in-distribution and out-of-distribution settings.

2606.10554 2026-06-10 cs.CL cs.AI 新提交

Benchmarking Knowledge Editing using Logical Rules

使用逻辑规则对知识编辑进行基准测试

Tatiana Moteu Ngoli, NDah Jean Kouagou, Hamada M. Zahera, Axel-Cyrille Ngonga Ngomo

发表机构 * Data Science Group, Heinz Nixdorf Institute, Paderborn University(帕德博恩大学海因茨·尼克斯多夫研究所数据科学组)

AI总结 提出基于逻辑规则的基准,评估知识编辑方法对单次编辑逻辑后果的处理能力,发现现有方法在蕴含知识上性能下降高达24%。

详情
Journal ref
The Semantic Web. ISWC 2025. ISWC 2025. Lecture Notes in Computer Science, vol 16141. Springer, Cham
Comments
Accepted at the 24th International Semantic Web Conference 2025
AI中文摘要

大型语言模型(LLMs)越来越多地部署在需要访问最新知识的实际应用中。然而,重新训练LLMs计算成本高昂。因此,知识编辑技术对于维护预训练模型中的当前信息和纠正错误断言至关重要。当前的知识编辑基准主要关注回忆编辑过的事实,往往忽略其逻辑后果。为解决这一局限,我们引入了一个新基准,旨在评估知识编辑方法如何处理单次事实编辑的逻辑后果。我们的基准从知识图谱中提取与给定编辑相关的逻辑规则,然后基于这些规则生成多跳问题,以评估对逻辑后果的影响。我们的发现表明,虽然现有的知识编辑方法能够准确地将直接断言插入LLMs,但它们经常无法注入蕴含的知识。具体来说,使用ROME和FT等流行方法的实验显示,在直接编辑的知识和蕴含知识的评估之间存在高达24%的性能差距。这凸显了在知识编辑中需要语义感知的评估框架。

英文摘要

Large Language Models (LLMs) are increasingly deployed in real-world applications that require access to up-to-date knowledge. However, retraining LLMs is computationally expensive. Therefore, knowledge editing techniques are crucial for maintaining current information and correcting erroneous assertions within pre-trained models. Current benchmarks for knowledge editing primarily focus on recalling edited facts, often neglecting their logical consequences. To address this limitation, we introduce a new benchmark designed to evaluate how knowledge editing methods handle the logical consequences of a single fact edit. Our benchmark extracts relevant logical rules from a knowledge graph for a given edit. Then, it generates multi-hop questions based on these rules to assess the impact on logical consequences. Our findings indicate that while existing knowledge editing approaches can accurately insert direct assertions into LLMs, they frequently fail to inject entailed knowledge. Specifically, experiments with popular methods like ROME and FT reveal a substantial performance gap, up to 24%, between evaluations on directly edited knowledge and on entailed knowledge. This highlights the critical need for semantics-aware evaluation frameworks in knowledge editing.

2606.10550 2026-06-10 cs.CV cs.GR 新提交

PrismAvatar: Pseudo-Multiview Reconstruction and Subpixel Prism Rendering for Real-Time Stereoscopic Communication

PrismAvatar:用于实时立体通信的伪多视图重建与亚像素棱镜渲染

Chufeng Fang, Dongdong Teng, Lilin Liu

发表机构 * Sun Yat-sen University(中山大学)

AI总结 提出PrismAvatar系统,通过单目视频重建可控头部化身,并利用亚像素编码光栅实现实时裸眼立体通信,采用伪多视图监督和轮廓感知损失提升侧视质量。

详情
Comments
10 pages, 5 figures, 3 tables
AI中文摘要

实时立体视频通信一直是沉浸式远程呈现的目标,但实际系统仍需要专门的捕获设备或将远程用户限制为单个肖像视图。我们提出PrismAvatar,一种高斯头部化身系统,将单目化身捕获与亚像素编码的裸眼光栅显示连接起来,用于实时自动立体通信。从单目肖像视频中,PrismAvatar重建可控头部化身,并针对显示引起的横向观看区域进行优化。该方法利用自然头部转动作为伪多视图(PMV)监督,以约束在单目训练中弱观察的区域,包括头发、耳朵、下颌轮廓和颈部边界。可靠的侧帧按偏航角分箱,对齐到虚拟相机,并在严格的头部和头发域内进行监督;轮廓感知损失和分阶段正则化进一步抑制鬼影、alpha泄漏和深度不稳定性,同时保留横向细节。在运行时,PrismAvatar渲染32个虚拟视图,并将其编码为具有校准亚像素路由掩码的4K光栅图像。实时跟踪原型保持10.65 FPS,而特定主体的蒸馏驱动将相同的显示管线提升至38.49 FPS。

英文摘要

Real-time stereoscopic video communication has long been a goal of immersive telepresence, yet practical systems still require specialized capture rigs or reduce remote users to a single portrait view. We present PrismAvatar, a Gaussian head-avatar system that connects monocular avatar capture with subpixel-encoded glasses-free lenticular display for real-time autostereoscopic communication. From a monocular portrait video, PrismAvatar reconstructs a controllable head avatar and optimizes it for the lateral viewing zones induced by the display. The method uses natural head turns as pseudo-multiview (PMV) supervision to constrain regions that are otherwise weakly observed in monocular training, including hair, ears, jaw contours, and neck boundaries. Reliable side frames are yaw-binned, aligned to virtual cameras, and supervised within a strict head-and-hair domain; contour-aware losses and staged regularization further suppress ghosting, alpha leakage, and depth instability while preserving lateral detail. At runtime, PrismAvatar renders 32 virtual views and encodes them into a 4K lenticular raster with calibrated subpixel-routing masks. The live-tracker prototype sustains 10.65 FPS, and a subject-specific distilled driver raises the same display pipeline to 38.49 FPS.

2606.10543 2026-06-10 cs.LG cs.AI cs.ET q-bio.QM 新提交

Flexible Flows for Biological Sequence Design

生物序列设计的灵活流模型

Yogesh Verma, Dani Korpela, Harri Lähdesmäki, Vikas Garg

发表机构 * Aalto University(阿尔托大学) YaiYai Ltd(YaiYai有限公司) OpenProtein.AI

AI总结 提出结构化耦合、潜编辑速率参数化和潜分类器无引导机制,实现变长序列生成和细粒度控制,在多种生物序列任务中达到最优性能。

详情
AI中文摘要

设计功能性生物序列需要在严格的进化和生物物理约束下导航巨大的离散空间。离散流匹配(DFM)提供了在此类空间上的生成框架,但现有方法依赖于生物学上无信息的耦合,并且在变长序列生成和细粒度控制方面灵活性有限。我们提出了一种结构化耦合,编码序列元素间的领域特定偏好,将源分布偏向合理区域,而不修改流目标或训练过程。在此基础上,我们引入了一种基于潜编辑的速率参数化,通过基于共享全局潜变量的编辑操作(类似于潜变量模型)对变长生成进行建模,同时保持可追踪性。我们进一步引入了一种潜分类器无引导机制,在连续潜空间中连贯地引导生成,以及用于测试时控制编辑操作的Dirichlet先验温度缩放。我们的方法在多种生物序列任务中实现了最先进的性能,包括密度估计、无条件和条件DNA序列生成以及肽序列生成。

英文摘要

Designing functional biological sequences requires navigating vast discrete spaces under strict evolutionary and biophysical constraints. Discrete Flow Matching (DFM) offers a generative framework over such spaces, but existing approaches rely on biologically uninformative couplings and offer limited flexibility for variable-length sequence generation and fine-grained control. We propose a structured coupling that encodes domain-specific preferences among sequence elements, biasing the source distribution toward plausible regions without modifying the flow objective or training procedure. Building on this, we introduce a latent edit-based rate parameterization that models variable-length generation via edit operations conditioned on a shared global latent, akin to a latent variable model, while remaining tractable. We further introduce a latent classifier-free guidance mechanism that steers generation coherently in continuous latent space, along with Dirichlet-prior temperature scaling for test-time control over edit operations. Our method achieves state-of-the-art performance across diverse biological sequence tasks, including density estimation, unconditional and conditional DNA sequence generation, and peptide sequence generation.

2606.10541 2026-06-10 cs.CV 新提交

GRAR: Glass-induced Reflection Artifact Removal in LiDAR Point Clouds

GRAR: LiDAR点云中玻璃引起的反射伪影去除

Wanpeng Shao, Zeyi Guo, Bo Zhang, Yifei Xue, Tie Ji, Yizhen Lao

发表机构 * College of Computer Science and Electronic Engineering, Hunan University(湖南大学计算机科学与电子工程学院) School of Design, Hunan University(湖南大学设计学院) School of Finance and Statistics, Hunan University(湖南大学金融与统计学院)

AI总结 提出两阶段框架,先利用多模态视觉基础模型和几何线索精确分割玻璃区域,再基于物理驱动的反射感知局部-全局几何相似性描述符去除反射伪影,在多个公开数据集上优于现有方法。

详情
AI中文摘要

在城市环境中采集的地面激光扫描(TLS)点云经常受到玻璃引起的反射伪影的影响,严重降低了后续应用的质量。现有的反射伪影去除方法通常依赖于理想的反射对称性假设,但其性能受限于不准确的玻璃估计和不足的几何表示。为了解决这些问题,我们提出了一种新颖的统一框架,旨在实现鲁棒的反射伪影去除:在第一阶段,我们利用多模态视觉基础模型生成初始玻璃掩膜,然后使用几何线索进行细化以获得高精度的玻璃区域,随后进行玻璃补全以恢复透明表面上由于无回波测量导致的缺失区域;在第二阶段,我们提出了一种物理驱动的描述符,称为反射感知局部-全局几何相似性(RE-LGGS),该描述符基于实际的激光反射几何,并使用基于PCA的局部形状表示联合编码多尺度几何结构和方向一致性,从而显著提高了对不完美观测的鲁棒性。在多个公开TLS数据集上的大量实验表明,我们的框架在反射伪影去除方面始终优于最先进的方法。

英文摘要

Terrestrial Laser Scanning (TLS) point clouds captured in urban environments frequently suffer from glass-induced reflection artifacts, severely degrading downstream applications. Existing reflection artifact removal methods generally rely on ideal reflection symmetry assumptions, yet their performance is limited by inaccurate glass estimation and insufficient geometric representations. To address these issues, we propose a novel unified framework aimed at robust reflection artifact removal: In the first stage, we leverage a multi-modal vision foundation model to produce initial glass masks, which are then refined using geometric cues to achieve high-precision glass regions, followed by glass completion to recover missing regions caused by no-return measurements on transparent surfaces; In the second stage, we propose a physics-driven descriptor, termed Reflection-aware Local-Global Geometric Similarity (RE-LGGS), which is grounded in actual laser reflection geometry and jointly encodes multi-scale geometric structures and orientation consistency using PCA-based local shape representations, thereby significantly improving robustness against imperfect observations. Extensive experiments on multiple public TLS datasets demonstrate that our framework consistently outperforms state-of-the-art methods in reflection artifacts removal.

2606.10537 2026-06-10 cs.CL 新提交

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

Prefilling-dLLM: 扩散语言模型中长上下文推理的预测性预填充

Jing Xiong, Qi Han, Shansan Gong, Yunta Hsieh, Chengyue Wu, Chaofan Tao, Chenyang Zhao, Ngai Wong

发表机构 * The University of Hong Kong(香港大学) University of Michigan, Ann Arbor(密歇根大学安娜堡分校) LMSYS Org(LMSYS组织)

AI总结 针对扩散语言模型在长上下文中因重复编码前缀导致计算量二次增长的问题,提出Prefilling-dLLM框架,通过分块缓存KV表示并基于稀疏性选择相关块,实现高效解码,在LongBench等基准上达到最先进加速效果。

详情
Comments
Technical Report
AI中文摘要

扩散大语言模型(dLLM)在每个去噪步骤中重新编码整个前缀,导致计算量随上下文长度二次增长,在长上下文场景中变得不可行。我们提出Prefilling-dLLM,一种无需训练的预填充-解码分离框架,将前缀划分为N个块,缓存其KV表示一次,并利用块内令牌稀疏性选择最相关的K个块进行解码,表明稀疏预填充可以优于密集注意力,同时将每步复杂度从完整序列长度的二次方降低到仅解码长度的二次方。在LongBench和InfiniteBench上,Prefilling-dLLM在dLLM加速方法中达到了最先进的质量,并且一个对非连续缓存的块KV进行并行解码的注意力核在8K--32K上下文下实现了9.1--28.0倍的加速。我们进一步表明,预置到每个块的开头序列令牌作为周期性注意力锚点,消除了中间丢失现象。代码见此 https URL。

英文摘要

Diffusion large language models (dLLMs) re-encode the entire prefix at every denoising step, causing recomputation that scales quadratically with context length and becomes prohibitive for long-context scenarios. We propose Prefilling-dLLM, a training-free prefill-decode disaggregation framework for dLLMs that partitions the prefix into N chunks, caches their KV representations once, and selects the top-K most relevant chunks with intra-chunk token sparsity for decoding, showing that sparse prefilling can outperform dense attention while reducing per-step complexity from quadratic in the full sequence length to quadratic only in the decode length. On LongBench and InfiniteBench, Prefilling-dLLM achieves state-of-the-art quality among dLLM acceleration methods, and an attention kernel that parallelizes decoding over the non-contiguously cached chunk KV yields 9.1--28.0x speedup at 8K--32K contexts. We further show that beginning-of-sequence tokens prepended to each chunk act as periodic attention anchors that eliminate the lost-in-the-middle phenomenon. Code is available at https://github.com/menik1126/Prefilling-dLLM.

2606.10533 2026-06-10 cs.CV 新提交

Audio-Visual Exchange-Aware Token Pruning for Efficient Audio-Visual Captioning

音频-视觉交换感知令牌剪枝用于高效音频-视觉字幕生成

Zihan Meng, Dexiang Hong, Weidong Chen, Ziyu Zhou, Bo Hu, Zhendong Mao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出基于强化学习的AVEX-Prune方法,通过跨模态令牌交换策略选择高置信度令牌,在40%保留率下保持全令牌质量。

详情
AI中文摘要

音频-视觉字幕生成从视频和音频内容生成自然语言描述。多模态大语言模型推进了这一任务,但两种模态都为LLM输入贡献了大量令牌,其中预填充自注意力呈二次方扩展。现有的令牌剪枝方法通常通过注意力、显著性或交叉熵损失保留令牌,但硬阈值选择使得难以保留真正有价值的令牌,尤其是决策边界附近的高混淆令牌。为此,我们提出AVEX-Prune,一种基于强化学习的音频-视觉动态令牌剪枝方法。在我们的AVEX-Prune中,提出了一种音频-视觉令牌交换策略,通过用来自同一或另一模态的高置信度候选令牌替换低置信度保留令牌,并测量令牌交换带来的字幕生成差异,来选择真正有价值的令牌。AVEX-Prune在VILA 1.5-8B(54.5 vs. 54.6)和VideoLLaMA 2(57.0 vs. 56.8)上以40%保留率保持了全令牌质量。

英文摘要

Audio-visual captioning generates natural language descriptions from video and audio content. Multimodal LLMs have advanced this task, but both modalities contribute many tokens to the LLM input, where prefill self-attention scales quadratically. Existing token-pruning methods usually retain tokens by attention, saliency, or cross-entropy loss, yet the hard threshold selection makes it difficult to retain tokens that are truly valuable, especially for high-confusing tokens near the decision boundary. To this end, we propose a AVEX-Prune, an RL-based audio-visual dynamic token pruning method in this work. In our AVEX-Prune, an audio-visual token exchange strategy is proposed to select truly valuable tokens by replacing low-confidence retained tokens with high-confidence candidate tokens from the same or the other modality, and measuring the differences in caption generation from token swaps. AVEX-Prune preserves full-token quality at a 40% retention ratio on both VILA 1.5-8B (54.5 vs. 54.6) and VideoLLaMA 2 (57.0 vs. 56.8).

2606.10532 2026-06-10 cs.AI 新提交

ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning

ActiveMem: 用于长程LLM推理的分布式主动记忆

Yunhan Jiang, Wenbin Duan, Shasha Guo, Liang Pang, Xiaoqian Sun, Huawei Shen

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所人工智能安全国家重点实验室) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出ActiveMem框架,将记忆从核心推理中解耦,通过分布式主动记忆系统积累语义要点,在长程推理任务中实现高精度和低开销。

详情
AI中文摘要

记忆对于使大型语言模型(LLM)代理能够处理长程推理任务至关重要。现有的记忆机制大多是集中式的,通常将检索到的信息和交互历史组织在单个模型上下文中。这种设计带来了一个基本的权衡:扩展推理轨迹可能导致上下文过载,而激进的修剪内容可能导致不可逆的信息丢失。为了寻求更好的权衡,我们从人类认知系统中汲取灵感,特别是前额叶皮层(执行控制)和海马体(记忆管理)之间的功能互补性,表明这种权衡并非固有,而可能源于集中式记忆组织。为此,我们提出了ActiveMem,一个异构框架,将代理记忆从核心推理过程中解耦。具体来说,高层规划器利用提炼的语义要点执行推理,而轻量级的分布式记忆系统并行运行,在整个任务中主动积累和整合这些要点。在BrowseComp-Plus和GAIA上的实验表明,ActiveMem以显著降低的开销实现了最先进的准确性,证明了分布式主动记忆在长程推理中的有效性。

英文摘要

Memory is essential for enabling large language model (LLM) agents to handle long-horizon reasoning tasks. Existing memory mechanisms are largely centralized, typically organizing retrieved information and interaction history within a single model context. This design imposes a fundamental trade-off: scaling reasoning trajectories risks context overload, whereas aggressive content pruning may result in irreversible information loss. Seeking a better trade-off, we draw inspiration from human cognitive systems, especially the functional complementarity between the prefrontal cortex (executive control) and the hippocampus (memory management), suggesting that such a trade-off need not be inherent, but may instead stem from centralized memory organization. To this end, we propose ActiveMem, a heterogeneous framework that decouples agent memory from the core reasoning process. Specifically, a high-level Planner utilizes distilled semantic gists to execute reasoning, while a lightweight, distributed memory system operates in parallel to actively accumulate and consolidate these gists throughout the task. Experiments on BrowseComp-Plus and GAIA show that ActiveMem achieves state-of-the-art accuracy with significantly reduced overhead, demonstrating the effectiveness of distributed active memory for long-horizon reasoning.

2606.10530 2026-06-10 cs.LG cs.AI 新提交

Machine Learning Methods for Studying Latent Neural Activity Dynamics

研究潜在神经活动动力学的机器学习方法

Shufeng Kong, Fumei Deng, Xinyi Dong, Caihua Liu, Weiwei Chen, Yingheng Wang, Daniel Cao, Azahara Oliva, Antonio Fernandez-Ruiz, Carla Gomes

发表机构 * School of Software Engineering, Sun Yat-sen University(中山大学软件工程学院) Department of Computer Science, Cornell University(康奈尔大学计算机科学系) Department of Neurobiology and Behavior, Cornell University(康奈尔大学神经生物学与行为学系) Department of Ecology and Evolutionary Biology, Cornell University(康奈尔大学生态学与进化生物学系) School of Computer Science and Artificial Intelligence, Foshan University(佛山大学计算机科学与人工智能学院)

AI总结 综述从状态空间模型到深度生成模型的潜在变量模型,涵盖单区域动力学、多区域通信和行为对齐建模,并讨论大规模神经基础模型及未来挑战。

详情
Comments
Accepted by IJCAI 2026 survey track
AI中文摘要

脑记录的最新发展推动了对能够解码大量神经元潜在结构的机器学习工具的需求。本文提供了全面的综述,概述了潜在变量模型(LVM)从早期状态空间模型到最近深度生成模型的轨迹。我们将文献组织为三个密切相关的领域:(1)单区域潜在动力学,包括从线性动力系统到由循环神经网络(RNN)和神经常微分方程(ODE)表示的更复杂动力学模型;(2)多区域通信,采用概率和子空间方法研究信息如何在不同脑区之间传递,考虑突触传播延迟和网络连接;(3)行为对齐建模,旨在通过监督或对比学习将与任务表现相关的神经活动与其他内部状态分离。本综述还包括大规模神经基础模型,如Transformer和扩散模型,它们依赖大规模预训练以实现跨主体的最佳性能。最后,我们总结并讨论基准、评估标准和开放挑战,如识别因果联系或通信方向的能力,以促进弥合可解释脑动力学与可靠神经解码之间的未来研究。

英文摘要

Recent developments in brain recording are driving a demand for machine learning tools capable of decoding the latent structure of large populations of neurons. In this paper, we provide a comprehensive survey that outlines the trajectory of Latent Variable Models (LVMs) from early state-space models to more recent deep generative models. We organize the literature into three closely related domains: (1) Single-Region Latent Dynamics, which includes models such as linear dynamical systems to more complex dynamics represented by Recurrent Neural Networks (RNNs) and Neural Ordinary Differential Equations (ODEs); (2) Multi-Region Communication, which employs probabilistic as well as subspace methods to study how information is transferred across different brain areas considering synaptic propagation delays and network connectivity; and (3) Behavior-Aligned Modeling, which seeks to disentangle neural activity related to task performance from other internal states via supervised or contrastive learning. This survey also includes large-scale neural foundation models, such as Transformers and diffusion models, that rely on large-scale pre-training for optimal performance across subjects. Finally, we conclude and discuss benchmarks, evaluation criteria, and open challenges, such as the ability to identify causal links or directionality of communication, to facilitate future research for bridging interpretable brain dynamics with reliable neural decoding.

2606.10528 2026-06-10 cs.LG cs.CL 新提交

Representation-Aware Advantage Estimation: Your Reward Model Provides More Than A Scalar Output

表示感知优势估计:你的奖励模型提供的不仅仅是标量输出

Guozheng Li, Xiyan Fu, Yiwen Guo

发表机构 * Southeast University(东南大学) Nanyang Technological University(南洋理工大学) Independent Researcher(独立研究员)

AI总结 提出表示感知优势估计方法,利用奖励模型隐藏状态作为辅助信号,通过图传播计算优势值,提升RLHF的样本效率和鲁棒性。

详情
AI中文摘要

当前基于人类反馈的强化学习(RLHF)方法主要依赖来自训练好的奖励模型(RM)的标量奖励。虽然有效,但标量奖励通常存在噪声,无法捕捉细粒度的偏好差异,而RM隐藏状态编码了更丰富的语义和偏好信息。我们引入了表示感知优势估计,利用RM隐藏状态并将其建模为辅助信号以实现更好的优势估计。具体来说,我们提出了基于图的优势估计(GraphAE),将每个采样组视为一个图,其中节点对应响应,边捕捉它们在RM隐藏空间中的相似性。然后通过图传播计算优势值,使每个样本能够从其邻居中融入上下文信息。GraphAE轻量级,可以无缝集成到现有的基于组的RL算法中。我们将GraphAE应用于GRPO、GSPO和RLOO,并在不同模型和基准上进行了大量实验。实证结果显示,在三个基准上均有一致改进,在Arena-Hard-v0.1上提升高达+6.3,在AlpacaEval 2.0上提升+8.27,在MT-Bench上提升+0.22。这些结果表明,利用RM表示可以实现更高效和鲁棒的RLHF。

英文摘要

Current reinforcement learning from human feedback (RLHF) methods primarily rely on scalar rewards from a trained reward model (RM). While effective, scalar rewards are often noisy and fail to capture fine-grained preference differences, whereas RM hidden states encode richer semantic and preference information. We introduce the representation-aware advantage estimation, which leverages RM hidden states and models them as auxiliary signals for better advantage estimation. Specifically, we propose the Graph-based Advantage Estimation (GraphAE), treat each sampled group as a graph, where nodes correspond to responses and edges capture their similarity in the RM hidden space. Then advantages are computed via graph propagation, enabling each sample to incorporate contextual information from its neighbors. GraphAE is lightweight and can be seamlessly integrated into existing group-based RL algorithms. We apply GraphAE to GRPO, GSPO and RLOO, and conduct extensive experiments on different models and benchmarks. Empirical results show consistent improvements across three benchmarks, with gains of up to + 6.3 on Arena-Hard-v0.1, + 8.27 on AlpacaEval 2.0, and + 0.22 on MT-Bench. These results demonstrate that leveraging RM representations leads to more sample efficient and robust RLHF.

2606.10522 2026-06-10 cs.CV 新提交

GUI-AC: Enhancing Continual Learning in GUI Agents

GUI-AC:增强GUI代理的持续学习能力

Can Lin, Tao Feng, Hangjie Yuan, Dan Zhang, Yifan Zhu, Zhonghong Ou

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学) Zhejiang University(浙江大学) National University of Singapore(新加坡国立大学)

AI总结 针对GUI代理在持续学习中的分布漂移和强化微调不稳定性问题,提出GUI-AC方法,通过自适应优势和动态裁剪机制提升性能,超越现有基线。

详情
AI中文摘要

图形用户界面(GUI)是人机交互的主要媒介,但构建能够像人类一样在多样化的真实界面环境中泛化、具有相同灵活性和鲁棒性的GUI代理仍未解决。值得注意的是,GUI数据本质上是非平稳的:持续出现未见过的界面实例(例如,新领域和分辨率)会导致持续的分布漂移,严重阻碍现有GUI代理的持续学习。强化微调(RFT)作为一种有前景的方法引起了广泛关注。然而,RFT在其定位能力上表现出明显的不稳定性,表现为奖励的急剧不连续和高方差振荡。推出结果的不平衡分布给优势估计引入了大量噪声,导致策略过度自信。固定的裁剪边界抑制了适应新分布所需的策略概率增加,导致探索能力崩溃。为了解决这些挑战,我们提出了GUI-AC,一种增强GUI代理持续学习能力的方法。GUI-AC引入了定位确定性以支持两个核心机制:(i)自适应优势,降低噪声优势估计的权重以防止策略过度自信;以及(ii)动态裁剪,放松裁剪边界以鼓励探索范围。大量实验表明,这些机制共同提高了性能,使我们的方法超越了最先进的基线。代码匿名提供于此https URL。

英文摘要

Graphical User Interfaces (GUIs) serve as the dominant medium for human-computer interaction, yet building GUI agents that generalize across the vast diversity of real-world interface environments, with the same flexibility and robustness that humans naturally exhibit, remains unsolved. Notably, GUI data are inherently non-stationary: the continual emergence of previously unseen interface instances (e.g., novel domains and resolutions) induces persistent distribution shifts, significantly impeding the continual learning of existing GUI agents. Reinforcement fine-tuning (RFT) has attracted considerable attention as a promising approach. Nevertheless, RFT exhibits pronounced instability in its grounding capability, manifested as sharp reward discontinuities and high-variance oscillations. The imbalanced distribution of rollout outcomes introduces substantial noise into advantage estimation, leading to policy overconfidence. The fixed clipping bound suppresses the increase in policy probabilities needed to adapt to new distributions, leading to a collapse in exploration capacity. To address these challenges, we propose GUI-AC, a method that enhances the continual learning capability of GUI agents. GUI-AC introduces grounding certainty to support two core mechanisms: (i) Adaptive Advantage, which down-weights noisy advantage estimates to prevent policy overconfidence; and (ii) Dynamic Clipping, which relaxes the clipping bound to encourage exploration range. Extensive experiments show that these mechanisms jointly improve performance, enabling our method to surpass state-of-the-art baselines. Code is available anonymously at https://anonymous.4open.science/r/GUI-AC.

2606.10517 2026-06-10 cs.CV 新提交

LAFP: Preserving Latent Action Structure in Latent Policy Learning via Flow Matching

LAFP:通过流匹配在潜在策略学习中保留潜在动作结构

Jiexi Lyu, Xizhou Bu, Qingqiu Huang, Chufeng Tang, Xiaoshuai Hao, Hongbo Wang, Wei Li

发表机构 * Fudan University(复旦大学) Morphi

AI总结 提出LAFP方法,利用流匹配学习潜在策略,并引入推理时插值机制缓解随机性导致的错位,在模仿学习任务中成功率提升10-15%,推理开销增加不到1倍。

详情
AI中文摘要

从大规模无标签视频中学习高质量潜在动作,并结合有限真实交互数据训练动作解码器,已成为可扩展潜在策略学习的一种有前景的范式。然而,现有方法通常依赖行为克隆,这倾向于将固有的多模态动作分布坍缩为单模态分布,从而破坏预训练的潜在动作结构。虽然流匹配提供了一种潜在的替代方案,但由于学习策略的随机性,直接应用它会导致动作解码器训练中潜在动作与物理动作之间的错位。为了解决这些问题,我们提出了潜在动作流策略(LAFP),它利用流匹配进行潜在策略学习,并引入推理时插值机制来缓解随机性引起的错位。实验结果表明,LAFP在下游模仿学习任务上持续优于先前方法,成功率提升高达10-15%,而推理开销增加不到1倍。

英文摘要

Learning high-quality latent actions from large-scale unlabeled videos, coupled with limited real-world interaction data for training an action decoder, has emerged as a promising paradigm for scalable latent policy learning. However, existing approaches typically rely on behavior cloning, which tends to collapse inherently multimodal action distributions into unimodal ones, thereby degrading the pretrained latent action structure. While flow matching provides a potential alternative, directly applying it leads to a misalignment between latent actions and physical actions during action decoder training, due to the stochastic nature of the learned policy. To address these, we propose Latent Action Flow Policy (LAFP), which leverages flow matching for latent policy learning and introduces an inference-time interpolation mechanism to mitigate stochasticity-induced misalignment. Experimental results demonstrate that LAFP consistently outperforms prior methods on downstream imitation learning tasks, achieving up to 10-15% improvement in success rate while incurring less than 1x additional inference overhead.

2606.10507 2026-06-10 cs.AI 新提交

HIPIF: Hierarchical Planning and Information Folding for Long-Horizon LLM Agent Learning

HIPIF: 面向长视界LLM智能体学习的层次化规划与信息折叠

Juncheng Diao, Zhicong Lu, Peiguang Li, Yongwei Zhou, Changyuan Tian, Qingbin Li, Rongxiang Weng, Jingang Wang, Xunliang Cai

发表机构 * Meituan(美团) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出层次化规划与信息折叠方法,通过子目标分解和历史折叠减少长上下文干扰,结合层次化反思和子目标过程奖励,提升LLM在多轮长视界任务中的表现。

详情
AI中文摘要

尽管大型语言模型(LLM)在广泛任务中展现出作为自主智能体的强大能力,但其性能在多轮长视界智能体任务中常常下降。现有方法通过细粒度信用分配以缓解长视界稀疏奖励,以及通过层次化强化学习分解任务并减少长期依赖,取得了进展。然而,这些方法仍未直接解决长上下文干扰问题,即持续增长的历史记录削弱了智能体跟踪全局任务状态的能力,并损害了后续推理和决策。受人类通过子目标分解和已完成进度总结处理复杂任务的方式启发,我们提出了面向长视界LLM智能体学习的层次化规划与信息折叠(HIPIF)。HIPIF端到端地训练智能体,使其围绕显式子目标组织长视界执行,同时折叠已完成的子目标历史以减少长上下文干扰。此外,为稳定基于子目标的规划与执行,HIPIF结合了层次化反思和面向子目标的过程奖励,以指导子目标的生成、转换和执行,而无需依赖昂贵的辅助模型或特定任务的专家轨迹。在三个公开可用的智能体基准上的广泛实验证明了我们方法的有效性。

英文摘要

While Large Language Models (LLMs) have demonstrated strong capabilities as autonomous agents across a wide range of tasks, their performance often degrades in multi-turn long-horizon agentic tasks. Existing methods have made progress through fine-grained credit assignment to alleviate long-horizon sparse rewards and hierarchical reinforcement learning to decompose tasks and reduce long-term dependency. However, these methods still do not directly address long-context interference, in which continuously growing histories weaken the agent's ability to track the global task state and impair subsequent reasoning and decision-making. Inspired by the way humans handle complex tasks through subgoal decomposition and completed progress summarization, we propose Hierarchical Planning and Information Folding (HIPIF) for long-horizon LLM agent learning. HIPIF trains the agent end-to-end to organize long-horizon execution around explicit subgoals while folding completed subgoal histories to reduce long-context interference. Furthermore, to stabilize subgoal-based planning and execution, HIPIF combines hierarchical reflection and subgoal-oriented process rewards to guide subgoal generation, transition, and execution, without relying on costly auxiliary models or task-specific expert trajectories. Extensive experiments on three publicly available agentic benchmarks demonstrate the validity of our method.

2606.10501 2026-06-10 cs.RO 新提交

Uncovering Vulnerability of Vision-Language-Action Models under Joint-Level Physical Faults

揭示视觉-语言-动作模型在关节级物理故障下的脆弱性

Minsoo Jo, Taeju Kwon, Junha Chun, Youngjoon Jeong, Taesup Kim

发表机构 * Graduate School of Data Science, Seoul National University(首尔大学数据科学研究生院)

AI总结 本研究揭示VLA模型在机器人关节级物理故障(如执行器退化、摩擦增加)下性能显著下降,并提出轻量级残差校准框架J-PARC,通过推断关节故障状态并自适应修正动作,提升鲁棒性。

详情
AI中文摘要

在真实机器人系统中部署视觉-语言-动作(VLA)模型不仅需要对语义和感知变化具有鲁棒性,还需要对改变动作物理实现方式的实体侧故障具有鲁棒性。真实机器人可能经历由执行器退化、硬件故障、安全限制、碰撞损坏或磨损引起的摩擦导致的关节级变化。这些故障至关重要,因为它们改变了策略的动作到运动接口,破坏了指令动作、实现运动与后续观测之间的学习闭环关系。在这项工作中,我们研究了真实的关节级物理故障,并表明当预测动作通过受扰动的机器人身体执行时,VLA模型是脆弱的。我们的分析揭示了关节依赖效应,受影响关节的任务成功率呈现异质性退化。我们还表明,性能下降不能仅归因于物理不可行性,因为可行的故障(如增加的关节摩擦)仍能显著降低成功率并引发闭环执行不匹配。受这些发现的启发,我们提出了关节级物理故障感知残差校准器(J-PARC),这是一个构建在冻结VLA策略之上的轻量级残差校准框架。J-PARC从最近的关节动力学中推断出潜在的关节故障状态,并在此状态下调节共享的残差校准器,从而实现对故障关节的自适应动作修正。实验表明,J-PARC在关节级故障下提高了鲁棒性,同时保持了无故障环境下的性能。

英文摘要

Deploying Vision-Language-Action (VLA) models in real robotic systems requires robustness not only to semantic and perceptual variations, but also to embodiment-side faults that change how actions are physically realized. Real robots can experience joint-level changes caused by actuator degradation, hardware faults, safety limits, collision damage, or wear-induced friction. These faults are critical because they alter the action-to-motion interface of a policy, disrupting the learned closed-loop relationship between commanded actions, realized motion, and subsequent observations. In this work, we study realistic joint-level physical faults and show that VLA models are vulnerable when predicted actions are executed through a perturbed robot body. Our analysis reveals joint-dependent effects, with heterogeneous degradation in task success across affected joints. We also show that performance drops cannot be attributed solely to physical infeasibility, since feasible faults such as increased joint friction can still substantially reduce success rates and induce closed-loop execution mismatch. Motivated by these findings, we propose Joint-level Physical-fault Aware Residual Calibrator (J-PARC), a lightweight residual calibration framework built on top of a frozen VLA policy. J-PARC infers a latent joint-fault regime from recent joint dynamics and conditions a shared residual calibrator on this regime, enabling adaptive action correction across faulty joints. Experiments show that J-PARC improves robustness under joint-level faults while preserving fault-free environment performance.

2606.10499 2026-06-10 cs.LG cs.AI 新提交

MoE Enhanced Federated Learning for Spatiotemporal Prediction

基于混合专家模型增强的联邦学习用于时空预测

Zhehao Dai, Xiao Han, Zhaolin Deng, Zijian Zhang, Xiangyu Zhao, Guojiang Shen, Xiangjie Kong

发表机构 * Zhejiang University of Technology, Zhejiang Key Laboratory of Visual Information Intelligent Processing(浙江工业大学,浙江省可视信息智能处理重点实验室) Jilin University(吉林大学) City University of Hong Kong(香港城市大学)

AI总结 提出MoE-FedTP框架,通过轻量级混合专家网络和门控机制,在保护隐私的同时实现跨城市时空预测,有效缓解数据稀缺和异质性问题。

详情
AI中文摘要

交通预测是智能交通系统和城市计算的基础,然而由于传感器部署有限和城市发展不均衡,许多城市仍然面临交通数据稀缺的问题。跨城市知识转移因此受到越来越多的关注,使数据丰富的城市能够帮助数据稀缺的城市。然而,集中式方法引发了隐私问题,而现有的联邦方法难以应对城市间显著的时空异质性。为了解决这些挑战,我们提出了MoE-FedTP,一种基于轻量级混合专家(MoE)网络的个性化联邦跨城市时空预测框架。MoE-FedTP首先利用时空神经网络从源城市和目标城市提取特征,然后通过部分参数共享引入来自不同源城市的专家网络集合。门控机制动态融合专家以捕捉多样的交通动态,在保护隐私的同时实现城市异质性的细粒度建模。在四个真实世界交通数据集上的实验表明,MoE-FedTP始终优于最先进的跨城市和联邦学习基线,证明了其在提高数据稀缺城市预测准确性方面的有效性。

英文摘要

Traffic prediction is fundamental to intelligent transportation systems and urban computing, yet many cities continue to suffer from traffic data scarcity due to limited sensor deployment and uneven urban development. Cross-city knowledge transfer has thus attracted increasing attention, enabling data-rich cities to assist data-scarce ones. However, centralized approaches raise privacy concerns, while existing federated methods struggle with pronounced spatiotemporal heterogeneity across cities. To address these challenges, we propose MoE-FedTP, a personalized federated cross-city spatiotemporal prediction framework based on lightweight Mixture-of-Experts (MoE) networks. MoE-FedTP first employs spatiotemporal neural networks to extract features from both source and target cities, then introduces a set of expert networks derived from different source cities through partial parameter sharing. A gating mechanism dynamically fuses the experts to capture diverse traffic dynamics, achieving fine-grained modeling of urban heterogeneity while preserving privacy. Experiments on four real-world traffic datasets show that MoE-FedTP consistently outperforms state-of-the-art cross-city and federated learning baselines, demonstrating its effectiveness in enhancing prediction accuracy for data-scarce cities.

2606.10495 2026-06-10 cs.RO 新提交

Act on What You See: Unlocking Safe Social Navigation in Vision-Language-Action Models

Act on What You See: 在视觉-语言-动作模型中解锁安全社交导航

Qingzi Wang, Xiyang Wu, Guangyao Shi, Dianwei Chen, Xianfeng Yang, Dinesh Manocha

发表机构 * University of Maryland(马里兰大学) University of Southern California(南加州大学)

AI总结 提出SALSA框架,通过两阶段无标注后训练(社交行为对齐和时间安全对齐),使预训练VLA模型利用已有表征实现安全社交导航,减少86.4%的近距离碰撞。

详情
AI中文摘要

安全社交导航要求机器人区分行人与普通障碍物,并在危险迫近前做出反应。我们表明,预训练的视觉-语言-动作(VLA)模型已在其内部表征中编码了行人-物体区分和未来碰撞信号,但行为克隆未能将这些信号转化为社交上合适的动作。为解决这一不匹配问题,我们提出SALSA,一个两阶段无标注后训练框架:(1)社交行为对齐将中间层社交特征桥接到动作头,并在反事实人-物场景对上训练以打破视觉显著性捷径;(2)时间安全对齐提供自动生成的未来风险监督,实现预期性碰撞避免。在SCAND和实际部署中,SALSA将近距离碰撞减少86.4%,并将社交反事实准确率从53%提升至93%,表明通过教导VLA策略利用其已拥有的表征来行动,可以实现更安全的社交导航。这些结果表明,通过更好地对齐潜在表征与动作生成,预训练VLA策略可被调整用于更安全的社交导航。

英文摘要

Safe social navigation requires robots to distinguish people from ordinary obstacles and to react before danger becomes imminent. We show that pretrained Vision-Language-Action (VLA) models already encode pedestrian-object distinctions and future collision signals in their internal representations, but behavior cloning fails to translate these signals into socially appropriate actions. To address this mismatch, we propose SALSA, a two-stage annotation-free post-training framework: (1) social behavioral alignment bridges intermediate-layer social features to the action head and trains on counterfactual human-object scene pairs to break visual saliency shortcuts; (2) temporal safety alignment provides automatically generated future-risk supervision to enable anticipatory collision avoidance. On SCAND and real-world deployment, SALSA reduces near-collisions by 86.4% and improves social counterfactual accuracy from 53% to 93%, demonstrating that safer social navigation can be achieved by teaching VLA policies to act on representations they already possess. These results show that pretrained VLA policies can be adapted for safer social navigation by better aligning their latent representations with action generation.

2606.10492 2026-06-10 cs.CV 新提交

PathRelax: Parallel-Path Relaxed Speculative Jacobi Decoding for Accelerating Auto-Regressive Text-to-Image Generation

PathRelax: 并行路径松弛推测雅可比解码加速自回归文本到图像生成

Haodong Lei, Hongsong Wang, Bingxuan Dai, Pan Zhou

发表机构 * College of Software Engineering, Southeast University(东南大学软件工程学院) School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) School of Cyber Science and Engineering, Southeast University(东南大学网络空间安全学院) School of Computing and Information Systems, Singapore Management University(新加坡管理大学计算机与信息系统学院)

AI总结 针对自回归文本到图像模型因长序列导致推理慢的问题,提出并行路径交叉松弛推测雅可比解码框架,通过多序列草稿树结构扩展搜索空间并利用跨路径语义相似性提高接受率,实现3.95-4.18倍加速。

详情
Comments
10 pages, 5 figures
AI中文摘要

自回归文本到图像模型对高分辨率图像生成的需求日益增长,导致令牌序列延长,显著增加了计算成本和推理时间。然而,现有的加速自回归文本到图像模型的最先进方法依赖于链式结构的草稿令牌序列,导致草稿令牌搜索效率低下且接受长度有限。为了解决这一问题,我们提出了并行路径交叉松弛推测雅可比解码(PathSpec),一种通过多序列草稿树结构提升效率的新框架。我们的并行路径推测雅可比解码(PathExplore)扩展了令牌搜索空间,在不牺牲图像质量的情况下实现了更高的加速比。此外,我们引入了跨路径松弛验证(PathRelax),利用序列间的语义相似性进一步提高令牌接受率。在Parti-Prompts、MSCOCO2017和T2ICompBench数据集上的评估表明,我们的方法分别实现了4.14倍、3.95倍和4.18倍的加速比。值得注意的是,PathExplore在没有任何松弛采样的条件下,在加速比上优于GSD和LANTERN等松弛采样方法。此外,PathRelax的松弛机制可以与其他松弛技术无缝集成,实现进一步加速,为实时文本到图像生成提供了高效解决方案。我们的代码可在该https URL获取。

英文摘要

The growing need for high-resolution image generation in autoregressive text-to-image models has resulted in extended token sequences, significantly increasing computational costs and inference times. However, existing state-of-the-art methods for accelerating autoregressive text-to-image models rely on chain-structured draft token sequences, leading to inefficient draft token search and limited acceptance lengths. To address this, we propose parallel-path cross-relaxed speculative Jacobi decoding (\textbf{PathSpec}), a novel framework that enhances efficiency through a multi-sequence draft tree structure. Our parallel-path speculative Jacobi decoding (\textbf{PathExplore}) expands the token search space, achieving a higher speedup ratio without sacrificing image quality. Additionally, we introduce cross-path relaxed verification (\textbf{PathRelax}) that exploits semantic similarities across sequences to further boost token acceptance rates. Evaluated on the Parti-Prompts, MSCOCO2017, and T2ICompBench datasets, our method achieves a speedup ratio of 4.14 $\times$, 3.95$\times$, and 4.18$\times$, respectively. Remarkably, PathExplore, without any relaxed sampling, outperforms relaxed sampling methods in the speedup ratio, such as GSD and LANTERN. Moreover, PathRelax's relaxation mechanism can be seamlessly integrated with other relaxation techniques, enabling further acceleration and providing an efficient solution for real-time text-to-image generation. Our code is available at https://github.com/Haodong-Lei-Ray/PathSpec.

2606.10489 2026-06-10 cs.AI 新提交

A complementary study on PlanGPT: Evaluation with defined Performance Metrics and comparison with a planner

PlanGPT的补充研究:使用定义性能指标评估并与规划器比较

Youssef Abdelkader, Humbert Fiorino, Damien Pellier

发表机构 * Univ. Grenoble Alpes - LIG(格勒诺布尔阿尔卑斯大学 - 信息学实验室(LIG))

AI总结 本文对大型语言模型PlanGPT进行补充实验,使用规划成本和生成时间两个指标评估其性能,并与传统规划器比较,发现PlanGPT并不优于贪心搜索策略。

详情
Comments
7 pages
AI中文摘要

自动规划是人工智能(AI)的一个子领域,其主要目标是生成一系列动作(称为规划),帮助我们从初始状态达到目标状态。规划问题由一组对象、初始状态和期望目标状态定义。目标是计算一个从初始状态到目标状态的规划。生成规划的程序称为规划器。在本文中,我们对去年发布的最新LLM——PlanGPT进行了补充研究。我们重新进行了一些实验,以验证使用LLM进行规划是否**恰当**且**有价值**。我们还检查了官方PlanGPT论文中关于规划覆盖的结果是否正确,并对PlanGPT的性能进行了更全面的研究:在我们的论文中,PlanGPT的性能使用两个指标进行评估:规划成本和规划生成时间。将PlanGPT的结果与同一规划和相同指标下传统规划器产生的结果进行比较。我们发现PlanGPT并不优于贪心搜索策略。

英文摘要

Automated Planning is a subfield of Artificial Intelligence (AI) where the main objective is generating a sequence of actions, known as a plan, that helps us reach a goal state from an initial state. A planning problem is defined by a set of objects, an initial state and a desired goal state. The objective is to compute a plan that'll lead us from the inital state to the goal state. Programs that generate plans are called planners. In this paper, we did a complementary study to the state-of-the-art LLM called PlanGPT which was released last year. We redid some experiments to verify whether planning with LLMs is \textbf{pertinent} and \textbf{worthwhile}. We also check whether the results obtained in the official PlanGPT paper for plan coverage were correct, and we also performed a more comprehensive study on PlanGPT's performance: in our paper PlanGPT's performance was evaluated using two metrics: Plan Cost and Plan Generation Time. The results of planGPT were compared to those produced by a traditional planner for the same plans and same metrics. We discovered that PlanGPT is no better than a Greedy search strategy.

2606.10488 2026-06-10 cs.CV 新提交

5% > 100%: Flatness Preference is All You Need for Multimodal Parameter-Efficient Fine-Tuning

5% > 100%: 平坦性偏好是您进行多模态参数高效微调所需的一切

Yifan Zhu, Can Lin, Hangjie Yuan, Zixiang Zhao, Pengfei Zhang, Tao Feng, Zhonghong Ou

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Zhejiang University(浙江大学) ETH Zürich(苏黎世联邦理工学院) Anhui University of Science and Technology(安徽理工大学) Tsinghua University(清华大学) State Key Laboratory of Networking and Switching Technology(网络与交换技术国家重点实验室)

AI总结 揭示参数高效微调方法中普遍存在的平坦性偏好,即少量尖锐维度主导泛化,并提出FlatPO方法优化这些维度以提升泛化性能。

详情
AI中文摘要

参数高效微调(PEFT)方法为将大型模型适应特定领域的多模态下游任务提供了一种简化和高效的工具。尽管这些方法在实践中证明了其实际效果,但其主要方面仍未得到充分探索。因此,我们仍然对各种PEFT方法中的潜在泛化机制以及如何进一步增强它们感到好奇。在本文中,我们揭示了各种PEFT中普遍存在的平坦性偏好,其中一小部分尖锐维度主导了PEFT的泛化。这一发现暗示了一种有吸引力的可能性:我们可能只需关注这一小部分尖锐维度而非所有维度,就能获得更好的泛化。此外,我们提出了平坦性偏好优化(FlatPO)来平坦化这些关键的尖锐维度,使各种PEFT朝向更好的泛化。大量实验证明了我们的发现和所提方法的有效性。代码可在以下网址获取:https://this URL。

英文摘要

Parameter-Efficient Fine-Tuning (PEFT) methods provide a streamlined and efficient tool for adapting large models to domain-specific multimodal downstream tasks. Although these methods proved their tangible effects in practice, their principal aspects remain under-explored. Therefore we remain curious about the underlying generalization mechanisms in various PEFT methods and how they can be further enhanced. In this paper, we reveal the flatness preference widely present in various PEFTs, where a small fraction of sharp dimensions dominates the generalization of PEFT. This finding suggests an appealing possibility: we may be satisfied with a better generalization by merely attending to this small fraction of sharp dimensions instead of all of them. Furthermore, we propose Flatness Preference Optimization (FlatPO) to flatten these key sharpness dimensions, leading various PEFTs toward better generalization. Extensive experiments demonstrate the effectiveness of our findings and the proposed method. Code is available at https://github.com/Can-Lin/FlatPO.

2606.10487 2026-06-10 cs.LG cs.AI 新提交

Stop Early, Spend Less: Hidden-State Probes as a Practical Recipe for Streaming Moderation of LLM Outputs

早停早省:隐藏状态探针作为LLM输出流式审核的实用方案

Huizhen Shu, Xuying Li, Piao Xue

发表机构 * ModelOneAI yunshanai(云山AI)

AI总结 提出基于隐藏状态的轻量级词元级探针,在解码循环中实时检测不安全输出,无需额外前向传播,实现亚毫秒级安全审核,可提前中断或修改生成。

详情
Comments
Technical Report. 14 pages, 3 figures, 4 tables
AI中文摘要

在面向用户的系统中部署大型语言模型需要高效的输出安全过滤。现有方法通常依赖于生成后应用的单独审核模型,这会使推理成本翻倍,并且仅在生成完成后检测违规。我们观察到审核所需的信号已经存在于模型隐藏状态中。基于此,我们训练了轻量级的词元级探针,直接操作内部激活,生成每个词元的安全分数,这些分数可以聚合用于离线评估和在线干预。该探针重用生成器的激活,无需额外的前向传播,从而在解码循环内实现亚毫秒级的逐词元安全检查。应用于单个中间层的探针可以恢复强防护模型的大部分决策,作为一个低成本替代方案,优化延迟而非准确性。在流式设置中,它可以在不安全输出完全生成之前暂停或修改它们,用连续的词元级监控取代序列结束时的审核。与事后和流式防护模型相比,我们的方法实现了数量级的计算开销降低,且延迟成本最小。我们还提供了一个实用的部署方案,包括层选择、聚合策略、探测频率和触发阈值。最后,我们展示了探针的线性分量对应于残差空间中的一个方向,从而能够以可忽略的成本实现检测和激活引导。

英文摘要

Deploying large language models in user-facing systems requires efficient output safety filtering. Existing approaches typically rely on a separate moderation model applied after generation, which doubles inference cost and only detects violations after generation completes. We observe that the signal needed for moderation is already present in the model hidden states. Based on this, we train lightweight token-level probes that operate directly on internal activations, producing per-token safety scores that can be aggregated for both offline evaluation and online intervention. The probe reuses activations from the generator and requires no additional forward pass, enabling sub millisecond per-token safety checks inside the decoding loop. A probe applied to a single mid layer recovers most decisions of a strong guard model, acting as a low cost surrogate optimized for latency rather than accuracy. In streaming settings, it can halt or modify unsafe outputs before they are fully generated, replacing end of sequence moderation with continuous token level monitoring. Compared to post hoc and streaming guard models, our method achieves orders of magnitude lower compute overhead with minimal latency cost. We also provide a practical deployment recipe, including layer selection, aggregation strategy, probing frequency, and triggering thresholds. Finally, we show that the probe linear component corresponds to a direction in residual space, enabling both detection and activation steering at negligible cost.

2606.10481 2026-06-10 cs.LG cs.AI cs.CL cs.CR stat.ML 新提交

Advancing the State-of-the-Art in Empirical Privacy Auditing

推进经验隐私审计的最新水平

Nicole Mitchell, Galen Andrew, Arun Ganesh, Brendan McMahan, Peter Kairouz

发表机构 * Google Research(谷歌研究院)

AI总结 提出通过高温采样生成合成金丝雀,用于经验隐私审计,并引入基于辅助模型的合成数据审计方法,系统研究模型容量与金丝雀熵对记忆化的交互影响。

详情
AI中文摘要

大型语言模型的参数高效微调可能表现出对个别训练示例的问题性记忆。经验隐私审计(EPA)通过测量成员推断(MI)或重构攻击上的实际数据泄露来量化这种风险。EPA的一个关键挑战是设计与隐私敏感训练数据混合的“金丝雀”示例。我们提出通过从LLM中进行高温采样($T \geq 0.8$)生成合成金丝雀,使用针对隐私敏感训练数据定制的提示。这些金丝雀作为高影响异常值,确保高可识别性,从而实现强审计。此外,由于金丝雀本身是非私有的,它们是可检查的,并且可以重复插入,而不会危及真实数据的隐私。在隐私敏感数据上微调的模型的一个重要用途是生成合成数据。这也带来了隐私风险。我们引入了一种强大的合成数据审计方法,基于在合成数据上微调辅助模型。然后,对原始金丝雀的辅助模型进行审计,可以强有力地估计通过合成数据的隐私泄露。最后,利用我们强大的审计方法,我们系统研究了模型容量和金丝雀熵对记忆化的交互影响。

英文摘要

Parameter-efficient fine-tuning of large language models (LLMs) can exhibit problematic memorization of individual training examples. Empirical privacy auditing (EPA) quantifies this risk by measuring realistic data leakage on membership inference (MI) or reconstruction attacks. A key challenge in EPA is designing ``canary'' examples that are mixed with the privacy-sensitive training data. We propose generating synthetic canaries via high-temperature sampling ($T \geq 0.8$) from LLMs, using prompts tailored to the privacy-sensitive training data. These canaries act as high-influence outliers, ensuring high identifiability and hence strong audits. Further, since the canaries are themselves non-private, they are inspectable and can be inserted with repetition without jeopardizing the privacy of the real data. An important use of models fine-tuned on privacy-sensitive data is the generation of synthetic data. This also comes with privacy risk. We introduce a powerful synthetic data audit based on fine-tuning an auxiliary model on the synthetic data. Auditing the auxiliary model for the original canaries then provides a strong estimate of the privacy leakage through the synthetic data. Finally, leveraging our strong auditing methodologies, we perform a systematic investigation into the interacting effects of model capacity and canary entropy on memorization.

2606.10479 2026-06-10 cs.AI 新提交

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

ComBench: 奥林匹克级组合数学中严格证明推理与构造实现的基准测试

Shunkai Zhang, Haoran Zhang, Yun Luo, Qianjia Cheng, Haodi Lei, Yizhuo Li, Runzhe Zhan, Zhilin Wang, Bangjie Xu, Yucheng Su, Xinmiao Han, Xiaoye Qu, Dongrui Liu, Zhouchen Lin, Yu Qiao, Ning Ding, Yafu Li, Yu Cheng

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Peking University(北京大学) Shanghai Jiao Tong University(上海交通大学) Tsinghua University(清华大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出ComBench基准,包含100道奥林匹克级组合问题,分分析和构造两类,通过评分与验证评估大模型推理能力,发现最强模型准确率仅65.4%,且证明推理与构造实现能力存在差异。

详情
Comments
39 pages, 6 figures, 26 tables. Project page: https://simplified-reasoning.github.io/ComBench/docs/
AI中文摘要

组合数学是奥林匹克级数学问题解决的核心,需要深入的离散推理、创造性构造和严格的结构洞察。最近的证据表明,即使今天最强的前沿模型在奥林匹克组合问题上仍表现不均,揭示了创造性数学推理方面的差距。我们引入了ComBench,一个奥林匹克级组合数学基准,用于评估和诊断大语言模型的组合推理能力。ComBench包含100道人工标注的竞赛级问题,围绕两个互补的设置组织:以分析为中心的问题,主要需要严格的数学论证;以及以构造为中心的问题,除了正确性证明外还需要显式构造。评估协议结合了基于评分标准的证明评分和确定性构造验证,揭示了证明质量和构造有效性存在分歧的情况。对前沿开源和闭源模型的实验表明,ComBench远未饱和:最强模型总体平均准确率为65.4%,总体Best@4为75.3%。我们进一步发现,严格证明推理和构造实现是不同的能力:Kimi-K2.6在分析中心的证明评分上落后于GPT-5.5,但在构造中心的Best@4上超过它,而存在性和构造问题在代表性前沿模型中始终是最难的。

英文摘要

Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.

2606.10478 2026-06-10 cs.CV 新提交

3D-CoS: A New 3D Reconstruction Paradigm Based on VLM Code Synthesis

3D-CoS:基于VLM代码合成的新型3D重建范式

Yuhao Wang, Puyi Wang, Linjie Li, Zhengyuan Yang, Kevin Qinghong Lin, Yu Cheng

发表机构 * Shanghai Jiao Tong University(上海交通大学) The Chinese University of Hong Kong(香港中文大学) Microsoft(微软) University of Oxford(牛津大学)

AI总结 提出3D代码合成(3D-CoS)范式,将3D资产表示为可执行的Blender代码,利用VLM进行程序化重建,实现高可控性和局部编辑能力。

详情
Comments
Preprint. 24 pages, 11 figures
AI中文摘要

最近的3D重建和编辑系统大多基于隐式或显式表示,如NeRF、点云或网格。尽管这些表示能够实现高保真渲染,但它们本质上是低层次的,难以通过编程控制。相比之下,我们提出并系统评估了一种新的3D重建范式——3D代码合成(3D-CoS),其中3D资产被构建为可执行的Blender代码,这是一种可编程且可解释的媒介。为了评估当前VLM使用代码表示3D对象的能力,我们在统一协议下评估了代表性的开源和闭源VLM在基于代码的重建中的表现。我们进一步引入了一套结构化的代码合成工作流,包括基于蓝图的规划、Blender API文档的检索增强生成(RAG)、少样本几何演示以及用于逐部分代码生成的组件级Agent工作流。为了展示这种表示的独特优势,我们进一步评估了局部文本驱动的修改,并将我们的基于代码的编辑与基于点云的3D编辑基线进行了比较。我们的研究表明,代码作为3D表示提供了强大的可控性和局部性,在目标编辑评估中产生了更强的编辑保真度和更好的未编辑区域保持。我们的工作还分析了这种范式的潜力,描绘了当前VLM在程序化3D建模中的能力边界,并强调了代码合成作为可编辑3D重建的一个有前景的方向。

英文摘要

Most recent 3D reconstruction and editing systems operate on implicit and explicit representations such as NeRF, point clouds, or meshes. While these representations enable high-fidelity rendering, they are fundamentally low-level and hard to control programmatically. In contrast, we propose and systematically evaluate a new 3D reconstruction paradigm, 3D Code Synthesis (3D-CoS), where 3D assets are constructed as executable Blender code, a programmatic and interpretable medium. To assess how well current VLMs can use code to represent 3D objects, we evaluate representative open-source and closed-source VLMs in code-based reconstruction under a unified protocol. We further introduce a suite of structured code-synthesis workflows, including blueprint-based planning, Retrieval-Augmented Generation (RAG) over Blender API documentation, few-shot geometric demonstrations, and a component-level Agent workflow for part-wise code generation. To demonstrate the unique advantages of this representation, we further evaluate localized text-driven modifications and compare our code-based edits with a point-cloud-based 3D editing baseline. Our study shows that code as a 3D representation offers strong controllability and locality, yielding stronger edit fidelity and better preservation of unedited regions in our targeted editing evaluation. Our work also analyzes the potential of this paradigm, delineates the current capability frontier of VLMs for programmatic 3D modeling, and highlights code synthesis as a promising direction for editable 3D reconstruction.

2606.10471 2026-06-10 cs.CL cs.AI 新提交

Detecting Speculative Language in Biomedical Texts using Recurrent Neural Tensor Networks

使用递归神经张量网络检测生物医学文本中的推测性语言

Dhruv Dixit

发表机构 * Stevens Institute of Technology(史蒂文斯理工学院)

AI总结 利用分布式句子表示和深度学习技术,提出递归神经张量网络(RNTN)用于自动检测生物医学文献中的推测性语言,性能略优于线性双元SVM(F1=0.885 vs 0.881)。

详情
Comments
12 Pages
AI中文摘要

在本研究中,我们通过利用分布式句子表示和先进的深度学习技术,深入探讨了生物医学文章中推测性语言的自动检测。这种识别的意义延伸至信息检索、多文档摘要以及新知识的探索。我们的探索涵盖了两种获取分布式句子表示的不同方法:段落向量模型和递归神经张量网络。然后,将这些方法与三种基础基线算法进行严格比较:支持向量机、朴素贝叶斯和模式匹配。我们的发现表明,递归神经张量网络(RNTN)的性能(F1=0.885)略优于表现最佳的基线线性双元SVM(F1=0.881)。同时,段落向量模型即使在使用大规模未标记数据集进行广泛训练后,效果也较差(F1=0.368)。我们对影响这些性能差异的因素进行了全面讨论,并为未来的研究方向提供了有见地的建议。

英文摘要

In this investigation, we delve into the automated detection of speculative language within biomedical articles by utilizing distributed sentence representations and advanced deep learning techniques. The implications of such identification extend to information retrieval, multi-document summarization, and the exploration of new knowledge. Our exploration encompasses two distinct approaches for acquiring distributed sentence representations: the Paragraph Vector model and the Recursive Neural Tensor Network. These methodologies are then rigorously compared against three foundational baseline algorithms: Support Vector Machines, Naive Bayes, and pattern matching. Our findings reveal that the Recursive Neural Tensor Network (RNTN) demonstrates a slight performance edge (F1 = 0.885) over the top-performing baseline, the linear bigram SVM (F1 = 0.881). Meanwhile, the Paragraph Vector model proves less effective (F1 = 0.368), even after extensive training using an expansive, unlabeled dataset. We engage in a comprehensive discourse on the factors influencing these performance disparities and provide insightful recommendations for future research directions.

2606.10468 2026-06-10 cs.CV 新提交

Geometric Coastline Localization using Vision-Language Models

基于视觉语言模型的海岸线几何定位

Rafia Malik, Bernhard Pfahringer, Karin Bryan, Mark Dickson, Eibe Frank

发表机构 * The University of Waikato(怀卡托大学) The University of Auckland(奥克兰大学)

AI总结 提出将海岸线提取视为几何边界定位任务,基于GeoChat-7B/LLaVA-1.5架构构建CoastlineVLM-7B模型,直接预测折线而非分割掩码,在几何指标上优于传统分割方法。

详情
AI中文摘要

遥感图像中的海岸线检测通常被表述为逐像素分割问题,通过后处理从预测掩码中提取最终海岸线。这种表述将海岸线几何(海岸变化分析中使用的主要表示)降级为次要产物而非学习目标。在实践中,海岸线由地貌代理(如植被线、沙丘趾或悬崖边缘)定义,而非像素级分割方法中常用的瞬时水陆边界。在这项工作中,我们从表示角度重新审视海岸线提取,并将任务表述为几何边界定位。我们使用新西兰海岸变化数据集(NZCCD)和来自新西兰土地信息局(LINZ)的高分辨率航空影像,开发了CoastlineVLM-7B,这是一个基于GeoChat-7B/LLaVA-1.5架构的视觉语言模型(VLM),联合执行海岸线存在检测、代理类型分类和海岸线定位。该模型直接预测海岸线为折线,而非密集分割掩码。我们在严格的单像素边界监督下,将CoastlineVLM-7B与分割基线进行评估。结果表明,基于几何的指标比像素重叠指标(如交并比IoU)更适合评估海岸线定位质量。CoastlineVLM-7B改善了与参考海岸线的全局几何对齐,将豪斯多夫距离从37.74米降至31.84米,地球移动距离从21.12米降至17.32米。这些结果表明,输出表示是海岸线提取中的关键设计选择,而面向几何的学习结合视觉语言模型的语义推理能力,与运营海岸监测中海岸线的定义和评估方式高度一致。

英文摘要

Coastline detection in remote sensing imagery is commonly formulated as a pixel-wise segmentation problem, where the final coastline is extracted from a predicted mask through post-processing. This formulation relegates coastline geometry, the primary representation used in coastal change analysis, to a secondary artifact rather than the learning objective. In practice, coastlines are defined by geomorphic proxies such as vegetation lines, dune toes, or cliff edges, rather than an instantaneous land-water boundary often used in pixel-based segmentation approaches. In this work, we revisit coastline extraction from a representation perspective and formulate the task as geometric boundary localization. We use the New Zealand Coastal Change Dataset (NZCCD) and high-resolution aerial imagery from Land Information New Zealand (LINZ) to develop CoastlineVLM-7B, a vision-language model (VLM) built on the GeoChat-7B/LLaVA-1.5 architecture that jointly performs coastline presence detection, proxy-type classification, and coastline grounding. The model directly predicts a coastline as a polyline rather than a dense segmentation mask. We evaluate CoastlineVLM-7B against segmentation baselines under strict one-pixel boundary supervision. Results show that geometry-based metrics are more suitable for assessing coastline localization quality than pixel-overlap metrics such as Intersection over Union (IoU). CoastlineVLM-7B improves global geometric alignment with reference coastlines, reducing Hausdorff distance from 37.74 m to 31.84 m and Earth Mover's Distance from 21.12 m to 17.32 m. These results indicate that output representation is a critical design choice in coastline extraction, and that geometry-oriented learning, combined with the semantic reasoning capabilities of vision-language models, aligns well with how coastlines are defined and evaluated in operational coastal monitoring.

2606.10467 2026-06-10 cs.CL 新提交

Large Language Models as Modal Models in Linguistics

大语言模型作为语言学中的模态模型

Haruto Suzuki, Saku Sugawara

发表机构 * Keio University(庆应义塾大学) National Institute of Informatics(国立信息学研究所) University of Tokyo(东京大学)

AI总结 本文应用科学哲学中的模态建模框架,论证大语言模型作为最小模型具有真正的认知价值,能提供“如何可能解释”,但当前尚不满足“如何实际解释”的条件,其解释力位于两者之间的连续统上。

详情
AI中文摘要

大语言模型(LLMs)的快速发展加剧了关于它们对语言学理论重要性的争论。这些争论通常分为三种立场:绝缘主义,认为LLMs与人类语言无关;消除主义,声称LLMs可以取代传统语言学理论;以及调和主义,将LLMs视为语言学研究的有用工具。为澄清这些立场,本文应用了科学哲学中的模态建模框架。我们认为,即使没有与人类认知的结构对应,LLMs作为最小模型也具有真正的认知价值。特别是,它们可以通过测试关于语言习得和语言能力的模态主张来提供“如何可能解释”(HPEs)。然后,我们基于科学解释的机制说明,考察了LLMs有资格成为人类语言的“如何实际解释”(HAEs)的条件。我们认为当前的LLMs尚未满足这些要求。在此分析基础上,我们提出将LLMs的解释力理解为位于HPEs和HAEs之间的连续统上。这一框架既避免了夸大也避免了低估它们的解释意义,并为评估LLMs在语言科学研究中的作用提供了更精确的基础。

英文摘要

The rapid advancement of large language models (LLMs) has intensified debates about their significance for linguistic theory. These debates are commonly divided into three positions: insulationism, which regards LLMs as irrelevant to human language; eliminativism, which claims that LLMs can replace traditional linguistic theories; and conciliationism, which views them as useful tools for linguistic research. To clarify these positions, this paper applies the framework of modal modeling from the philosophy of science. We argue that LLMs possess genuine epistemic value as minimal models, even without structural correspondence to human cognition. In particular, they can provide how-possibly explanations (HPEs) by testing modal claims about language acquisition and linguistic competence. We then examine the conditions under which LLMs could qualify as how-actually explanations (HAEs) of human language, drawing on the mechanistic account of scientific explanation. We argue that current LLMs do not yet satisfy these requirements. On the basis of this analysis, we propose understanding the explanatory power of LLMs as lying on a continuum between HPEs and HAEs. This framework avoids both overstating and understating their explanatory significance and offers a more precise basis for evaluating the role of LLMs in the scientific study of language.

2606.10466 2026-06-10 cs.LG cs.AI 新提交

UPLOTS: A Unified Pretrained Language Model for Constrained Time-series Generation

UPLOTS: 一种用于约束时间序列生成的统一预训练语言模型

Du Yin, Hao Xue, Jinliang Deng, Yang Yang, Shuang Ao, Arian Prabowo, Flora Salim

发表机构 * University of New South Wales(新南威尔士大学) HKUST(GZ)(香港科技大学(广州)) BUAA(北京航空航天大学)

AI总结 提出UPLOTS,一种基于统一预训练语言模型和提示引导的框架,通过动态多数据集损失重加权和提示到模式映射,实现跨领域约束时间序列生成,在四个基准上验证了其泛化性和数据增强效果。

详情
AI中文摘要

在时间序列生成中,现有方法通常为每个数据集手工设计或训练单独的模型,这阻碍了它们的可扩展性,并且未能利用跨领域的共享时间结构。为了解决这种碎片化问题,我们提出了UPLOTS,一种统一的、提示引导的语言模型框架,用于跨不同领域的约束时间序列生成。UPLOTS不是构建任务特定的模型,而是利用一个由学习到的约束提示引导的单一预训练transformer骨干网络,从而能够按需生成并精确控制模式。一个关键创新是我们的动态多数据集损失重加权和提示到模式映射,这使得UPLOTS能够在训练期间内化多样化的时间结构,并在推理时有条件地生成它们。我们在四个真实世界基准和多个约束设置(包括峰值周期、日历、负载水平和波动性模式)上评估了UPLOTS。额外的保留约束组合和下游预测实验进一步表明,UPLOTS能够泛化到原始峰值模式设置之外,并在真实数据稀缺的情况下改进数据增强。我们的代码和基线可在匿名GitHub仓库获取:this https URL。

英文摘要

In time-series generation, existing approaches typically handcraft ortrain a separate model for each dataset, which hinders their scalability and fails to leverage shared temporal structures across domains. To address this fragmentation, we propose UPLOTS, a Unified, Prompt-guided Language model framework fOr constrained Time-Series Generation across diverse domains. Instead of building task-specific models, UPLOTS leverages a single pre-trained transformer backbone guided by learned constraint prompts, enabling on-demand generation with precise pattern control. One key innovation is our dynamic multi-dataset loss re-weighting and prompt-to-pattern mapping, which allows UPLOTS to internalize diverse temporal structures during training and conditionally generate them at inference. We evaluate UPLOTS on four real-world benchmarks and multiple constraint settings, including peak-period, calendar, load-level, and volatility patterns. Additional held-out constraint-combination and downstream forecasting experiments further demonstrate that UPLOTS generalizes beyond the original peak-pattern setting and improves data augmentation under scarce real-data regimes. Our code and baselines are available at anonymous github repo: https://anonymous.4open.science/r/UPLOTS-6C36.