arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.13439 2026-06-12 cs.CL cs.LG 新提交

S-GBT: Smooth Growth Bound Tensor for Certified Robustness Against Word Substitution Attacks in NLP

S-GBT:针对NLP中词替换攻击的认证鲁棒性的平滑增长界张量

Mohammed Bouri, Mohammed Erradi, Adnane Saoud

发表机构 * College of Computing, Mohammed VI Polytechnic University(穆罕默德六世理工大学计算机学院) ENSIAS, University Mohamed V of Rabat(拉巴特穆罕默德五世大学ENSIAS) CID Development

AI总结 提出二阶方法S-GBT,通过逐元素约束Hessian矩阵并加入正则化项,结合一阶和二阶正则化提升对词替换攻击的认证鲁棒性,在LSTM和CNN上验证,认证鲁棒准确率提升高达23.4%。

详情
Comments
The paper has been accepted at NETYS 2026 - 14th edition of the International Conference on Networked Systems
AI中文摘要

尽管自然语言处理(NLP)近期取得了进展,模型仍然容易受到词替换攻击。大多数现有防御方法关注一阶敏感性,并衡量输入轻微扰动时输出的变化程度。然而,它们忽略了这种敏感性的演变,而这由曲率描述。当梯度急剧变化时,模型仍可能失败。本文引入了平滑增长界张量(S-GBT),一种逐元素约束Hessian矩阵的二阶方法,我们为其产生的鲁棒性界提供了形式化理论证明。在训练过程中添加正则化项以最小化这些界。这产生了针对词替换攻击的更紧的认证鲁棒性。词替换下输出的变化由线性项和二次项共同界定。S-GBT针对两种架构推导:长短期记忆网络(LSTM)和卷积神经网络(CNN)。该方法直接集成到训练目标中。在多个基准数据集上评估其有效性。结果表明,与先前方法相比,结合一阶和二阶正则化可将认证鲁棒准确率提升高达23.4%,同时干净准确率保持竞争力。这些发现表明,同时控制梯度及其变化是构建更鲁棒模型的一个有前景的方向。

英文摘要

Despite recent progress in Natural Language Processing (NLP), models remain vulnerable to word substitution attacks. Most existing defenses focus on first order sensitivity and measure how much the output changes when the input is slightly perturbed. However, they ignore how this sensitivity evolves, which is described by curvature. When gradients vary sharply, models can still fail. This paper introduces the Smooth Growth Bound Tensor (S-GBT), a second order method that bounds the Hessian element-wise, for which we provide formal theoretical proofs on the resulting robustness bounds. A regularization term is added during training to minimize these bounds. This yields tighter certified robustness against word substitution attacks. The change in the output under word substitution is bounded by both a linear term and a quadratic term. S-GBT is derived for two architectures: Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN). The method is integrated directly into the training objective. Its effectiveness is evaluated on multiple benchmark datasets. The results show that combining first and second order regularization improves certified robust accuracy by up to 23.4% compared to prior methods, while clean accuracy remains competitive. These findings indicate that controlling both the gradient and its variation is a promising direction for building more robust models.

2606.13435 2026-06-12 cs.RO 新提交

GIVE: Grounding Human Gestures in Vision-Language-Action Models

GIVE:在视觉-语言-动作模型中接地人类手势

Pengfei Liu, Gen Li, Junqiao Fan, Boyu Ma, Jindou Jia, Yang Xiao, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University(南洋理工大学MARS实验室)

AI总结 针对VLA模型忽略手势导致意图理解不准的问题,提出GIVE方法,通过视觉和语义双路径增强手势理解,在真实HRI实验中目标识别准确率提升40%,任务成功率提升80%。

详情
Comments
Project page: this https URL
AI中文摘要

人类交流本质上是多模态的,语言通常伴随着非语言线索(如手势)来传达意图。然而,当前的视觉-语言-动作(VLA)模型将机器人操作视为纯文本驱动的任务,忽视了手势在人机交互(HRI)中的重要作用。当语言指令模糊或不明确时,这往往导致意图接地不准确和操作不可靠。为了解决这一挑战,我们提出了GIVE(通过视觉-语义增强的手势意图),一种有效的方法,在不修改架构的情况下,用人类手势理解增强预训练的VLA模型。具体来说,GIVE通过两条互补的路径融入手势信息:一条视觉路径,将手部骨架和指尖射线叠加到机器人观测上,用于显式对象接地;一条语义路径,生成人类手势和任务指令的高级描述,用于鲁棒的意图接地。通过联合利用视觉和语义指导,GIVE使VLA策略能够更好地将手势与操作行为关联,并适应动态交互意图。在真实世界的HRI实验中,GIVE显著优于基线,目标对象识别准确率提升40%,整体任务成功率提升80%,同时展现出对未见空间布局和不同参与者的强大鲁棒性和泛化能力。

英文摘要

Human communication is inherently multimodal, where language is often accompanied by non-verbal cues such as gestures to convey intentions. However, current Vision-Language-Action (VLA) models treat robotic manipulation as a pure text-driven task, overlooking the important role of gestures in Human-Robot Interaction (HRI). This often leads to inaccurate intent grounding and unreliable manipulation when language instructions are ambiguous or underspecified. To address this challenge, we propose GIVE (Gesture Intent via Visual-Semantic Enhancement), an effective approach that enhances pre-trained VLA models with human gesture understanding without architectural modifications. Specifically, GIVE incorporates gesture information through two complementary pathways: a visual pathway that overlays hand skeletons and fingertip rays onto robot observations for explicit object grounding, and a semantic pathway that generates high-level descriptions of human gestures and task instructions for robust intent grounding. By jointly leveraging visual and semantic guidance, GIVE enables VLA policies to better associate gestures with manipulation behaviors and adapt to dynamic interaction intents. In real-world HRI experiments, GIVE substantially outperforms the baseline, improving target object recognition accuracy by 40% and overall task success rate by 80%, while demonstrating strong robustness and generalization to unseen spatial layouts and diverse participants.

2606.13432 2026-06-12 cs.CV cs.AI 新提交

OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

OmniDirector: 无需配对数据的通用多镜头相机克隆

Jiwen Liu, Shujuan Li, Zhixue Fang, Xiaohan Li, Yan Zhou, Zijie Meng, Zhimin Zhang, Yawen Luo, Guoxin Zhang, Yu-Shen Liu, Pengfei Wan

发表机构 * Kuaishou Technology(快手科技) Tsinghua University(清华大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出OmniDirector框架,通过将相机参数编码为网格运动视频,并利用百万级配对数据训练,实现无需交叉配对数据的多镜头相机运动克隆,具备卓越的控制性能。

详情
Comments
12 pages, 8 figures
AI中文摘要

从参考视频中克隆相机运动是视频生成中的一项重要任务,因为视频提供了直观且精确的控制。现有方法要么直接使用无法处理多镜头生成的参数化表示,要么合成交叉配对数据,但受限于数据稀缺性,导致在复杂相机运动克隆中表现不佳。为解决这些问题,我们引入了一种通用的相机运动表示,将相机编码为网格运动视频。该相机网格以视觉方式表示相机参数,并支持集成多样化的轨迹以进行多镜头视频生成。基于此,我们提出了OmniDirector,一个在百万级相机网格-视频对上训练的统一框架,该框架协调角色、动作和相机,为多模态扩散变换器提供导演级别的控制。此外,我们设计了一种新颖的分层提示扩展代理,通过理解信号关系系统地描述相机运动和视觉内容,从而和谐地整合不同的控制信号。大量实验证明了我们框架的卓越性能和出色的可控性。项目页面:此https URL

英文摘要

Cloning camera motion from reference videos is an important task in video generation, as videos provide intuitive and precise control. Existing methods either directly use parametric representations that fail to handle multi-shot generation or synthesize cross-paired data, which suffer from data scarcity, resulting in poor performance in complicated camera motion cloning. To address these issues, we introduce a general camera motion representation that encodes cameras as grid motion videos. This camera grid represents the camera parameters visually and supports the integration of diverse trajectories for multi-shot video generation. Building upon this, we propose OmniDirector, a unified framework trained on a million-scale camera grid-video pairs that coordinates characters, actions, and cameras to provide director-level control for multimodal diffusion transformers. Furthermore, we design a novel hierarchical prompt expansion agent that harmoniously integrates different control signals by systematically describing camera motion and visual content through understanding signal relationships. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework. Project page: this https URL

2606.13427 2026-06-12 cs.CV 新提交

VietFashion: Benchmarking Sketch-Text Composed Image Retrieval for Cultural Outfits

VietFashion:面向文化服饰的草图-文本组合图像检索基准

Hoang-Nguyen Cao, Le-Hoang Bui, Dinh-Khoi Vo, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science, Ho Chi Minh City, Vietnam(胡志明市理科大学) Vietnam National University, Ho Chi Minh City, Vietnam(越南国家大学胡志明市分校)

AI总结 提出VietFashion基准,针对越南传统服饰奥黛,结合手绘草图和文本描述进行多目标检索,揭示现有方法在细粒度文化语义和跨模态组合上的不足。

详情
Comments
ICMR 2026. Project page: this https URL
AI中文摘要

文化服饰对视觉检索系统提出了独特挑战,因为其身份往往依赖于标准AI模型难以捕捉的微妙结构和符号细节。我们引入VietFashion,一个以越南传统服饰奥黛为中心的草图-文本组合图像检索新基准。VietFashion使设计师和研究人员能够通过手绘草图(传达服装结构)和文本描述(编码文化语义)的组合来检索具有文化意义的服装。数据集初始包含650张草图,并通过生成模型扩展至超过21,000张带有对齐标题的照片级真实图像。文本提示描述了详细的服装属性,这些属性从时尚杂志中提取以确保真实性和多样性。为了更好地反映设计意图固有的模糊性,VietFashion采用多目标检索设置,其中单个查询可能对应多个有效结果。我们建立了标准化的评估协议,并对最先进的组合图像检索方法进行了基准测试。实验结果表明,在建模细粒度文化语义和多模态组合方面存在显著性能差距,使VietFashion成为细粒度时尚检索的一个具有挑战性的基准。数据集公开于:this https URL。

英文摘要

Cultural garments pose a unique challenge for visual retrieval systems, as their identity often depends on subtle structural and symbolic details that are poorly captured by standard AI models. We introduce VietFashion, a new benchmark for sketch-text composed image retrieval centered on the Ao Dai, a traditional Vietnamese garment. VietFashion enables designers and researchers to retrieve culturally meaningful outfits using a combination of hand-drawn sketches, which convey garment structure, and textual descriptions, which encode cultural semantics. The dataset is initialized with 650 sketches and expanded using generative models to produce over 21,000 photorealistic images with aligned captions. Textual prompts that describe detailed outfit attributes, which are extracted from fashion magazines to ensure authenticity and diversity. To better reflect the inherent ambiguity of design intent, VietFashion adopts a multi-target retrieval setting, where a single query may correspond to multiple valid results. We establish standardized evaluation protocols and benchmark state-of-the-art composed image retrieval methods. Experimental results reveal significant performance gaps in modeling fine-grained cultural semantics and multi-modal composition, positioning VietFashion as a challenging benchmark for fine-grained fashion retrieval. The dataset is publicly available at: this https URL.

2606.13411 2026-06-12 cs.CL 新提交

An End-to-End Hybrid Framework for Rumour Detection in Low-Resources Algerian Dialect

面向低资源阿尔及利亚方言谣言检测的端到端混合框架

Dihia Lanasri, Fatima Benbarek

发表机构 * ATM Mobilis USTHB Algiers(阿尔及尔科技大学)

AI总结 针对阿尔及利亚方言谣言检测中资源稀缺、代码切换等问题,提出端到端混合框架,结合Transformer嵌入与经典分类器,F1达0.84,并发现领域预训练比模型规模更重要。

详情
AI中文摘要

社交媒体的快速增长加剧了谣言的传播。在阿尔及利亚语境下,由于方言内容的非正式性和代码切换特性、标注资源的稀缺以及标准阿拉伯语NLP工具在方言文本上的有限有效性,这一问题更具挑战性。本文提出了一种面向阿尔及利亚方言社交媒体内容的端到端谣言检测混合框架。我们通过结合真实社交媒体帖子、合成数据和FASSILA语料库,并基于相似性标注过程进行自动标注,构建了一个领域特定的标注数据集。还引入了一个音译流水线,以生成阿拉伯文字和Arabizi的并行数据集。我们评估了多种方法,包括经典机器学习、深度学习、Transformer和混合模型。实验结果表明,结合Transformer嵌入与经典分类器的混合方法达到了最佳性能,F1分数为0.84。我们还发现,领域特定预训练比模型规模更重要,在社交媒体上训练的模型优于在正式阿拉伯语语料库上训练的更大模型。这些结果证明了在低资源阿尔及利亚方言环境下进行谣言检测的可行性。

英文摘要

The rapid growth of social media has intensified the spread of rumours. This issue is more challenging in the Algerian context due to the informal and code-switched nature of dialectal content, the scarcity of annotated resources, and the limited effectiveness of standard Arabic NLP tools on dialect text. This paper presents an end-to-end rumour detection hybrid framework for Algerian dialect social media content. We build a domain-specific annotated dataset by combining real social media posts, synthetic data, and the FASSILA corpus, with automatic labeling based on a similarity-based annotation process. A transliteration pipeline is also introduced to generate parallel datasets in Arabic script and Arabizi. We evaluate multiple approaches, including classical machine learning, deep learning, transformers, and hybrid models. Experimental results show that a hybrid approach combining transformer embeddings with a classical classifier achieves the best performance, reaching an F1-score of 0.84. We also find that domain-specific pre-training is more important than model size, with social media-trained models outperforming larger models trained on formal Arabic corpora. These results demonstrate the feasibility of rumour detection in low-resource Algerian dialect settings.

2606.13410 2026-06-12 cs.CV cs.HC 新提交

Person Identification from Contextual Motion

基于情境运动的人物识别

Igor Kviatkovsky, Ehud Rivlin, Ilan Shimshoni

发表机构 * Technion – Israel Institute of Technology(以色列理工学院) University of Haifa(海法大学)

AI总结 提出一种生成模型描述动作实例创建过程,并针对监控和认证应用推导概率身份推断方案;引入交互式人物识别场景,通过序列化消息交换最大化互信息,实现高识别率。

详情
AI中文摘要

我们考虑基于运动风格识别人的问题。我们提出了一个描述动作实例创建过程的生成模型,并针对监控和认证应用所驱动的两种常见人物识别场景推导了概率身份推断方案。我们引入了一种新颖的、交互式的人物运动模式识别场景。为此,我们将识别过程形式化为受试者与系统之间的顺序消息交换会话。受试者的行为使用受人类信息处理(HIP)范式启发的概率生成模型建模。在每个阶段,系统向受试者呈现视觉刺激(线索)并记录其运动响应。线索的选择旨在最大化预期响应与受试者身份的互信息。一旦记录,响应用于更新可能受试者身份的后验概率。一旦达到足够的分类置信水平,该过程终止。据我们所知,这是首次在这种交互式设置中解决人物识别问题。我们在五个公开数据集和我们自己的新数据集(包含22名受试者对15个线索的4,476条记录)上报告了高识别率。

英文摘要

We consider the problem of identifying people based on their motion styles. We present a generative model describing the action instance creation process and derive a probabilistic identity inference scheme for two common person identification scenarios motivated by the surveillance and authentication applications. We introduce a novel, \emph{interactive}, scenario for person identification from motion patterns. To this end, we formalize the identification process in the context of a sequential message exchange session between the subject and the system. The subject's behavior is modeled using a probabilistic generative model inspired by the Human Information Processing (HIP) paradigm. At each stage, the system presents a visual stimulus (a cue) to the subject and records their motion response. The cue is selected so as to maximize the mutual information of the expected response and the subject's identity. Once recorded, the response is used to update the a posteriori probability over possible subjects' identities. The process terminates once a sufficient classification confidence level is reached. To the best of our knowledge, this is the first time person identification is addressed in such interactive setting. We report high recognition rates on five publicly available datasets and our own novel dataset consisting of 4,476 recordings of 22 test subjects responding to 15 cues.

2606.13407 2026-06-12 cs.AI 新提交

Optimizing Appliance Scheduling for Solar Energy Management Using Metaheuristic Algorithms

使用元启发式算法优化太阳能管理的电器调度

Hiba Ahmed, Alexander E.I. Brownlee, Jason Adair, Simon T. Powers

发表机构 * Computing Science and Mathematics, University of Stirling(斯特灵大学计算科学与数学学院)

AI总结 提出基于迭代局部搜索和模拟退火的元启发式方法,优化电器启动时间以最大化太阳能利用,并处理多天任务溢出问题。

详情
Comments
9 pages; full results and methodology for poster paper accepted to GECCO 2026
AI中文摘要

可再生能源对于满足未来能源需求至关重要;然而,仅在白天发生的太阳能发电通常与家庭消费模式不一致。诸如炊具、洗衣机和烘干机等电器通常根据用户偏好的时间表运行,而不是根据太阳能可用性,这形成了一个调度优化问题。目标是确定最佳电器启动时间,以最大化可再生能源利用,同时最小化用户不便并遵守系统约束。本文提出了一种使用迭代局部搜索(ILS)和模拟退火(SA)的元启发式方法,以优化电器启动时间,同时考虑电器运行持续时间、功耗、逆变器限制、电池荷电状态约束和太阳能发电预测。与大多数现有工作不同,调度扩展到单日之外,以容纳前几天的未完成任务(溢出),确保操作连续性并支持跨多天的顺序操作。实验结果表明,顺序多日调度框架在独家太阳能发电下有效管理系统约束,同时确保用户便利。这些发现也为未来关于不同规模设备投资、投资回报和用户满意度之间的多目标权衡研究提供了机会。

英文摘要

Renewable energy is essential for meeting future energy demands; however, solar energy generation, which occurs only during daylight hours often does not align with household consumption patterns. Appliances such as cookers, washing machines, and dryers are typically operated according to user preferred schedules rather than solar energy availability, creating a scheduling optimization problem. The objective is to determine optimal appliance start times to maximize renewable energy utilization while minimizing user inconvenience and adhering to system constraints. This paper presents a metaheuristic approach using Iterated Local Search (ILS) and Simulated Annealing (SA) to optimize appliance start times, while considering appliance operating durations, power consumption, inverter limit, battery state of charge constraints, and solar generation forecasts. Unlike most existing work, the scheduling is extended beyond a single day to accommodate unfinished tasks from previous days (spillover), ensuring operational continuity and enabling sequential operation across multiple days. Experimental results show that the sequential multi-day scheduling framework effectively manages system constraints while ensuring user convenience under exclusive solar generation. These findings also open opportunities for future research on multi-objective trade-offs between investment in equipment of various sizes, return on that investment, and user satisfaction.

2606.13405 2026-06-12 cs.AI cs.MA 新提交

Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda

用于受规管流程自动化的神经符号代理:挑战与研究议程

Alexander Rombach, Chantale Lauer, Nijat Mehdiyev

发表机构 * German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI)) Saarland University(萨尔大学)

AI总结 提出将领域内符号结构(法规、流程模型、合规约束)作为代理核心架构组件,实现合规性内置(compliance-by-construction)以补充护栏监控,并列出神经符号研究挑战。

详情
Comments
Accepted as a poster in NILA Workshop @ IJCAI-ECAI 2026
AI中文摘要

基于LLM的代理正在进入受规管行业,在这些行业中,它们自动化判断密集型质量管理流程。我们认为,这些领域中已经嵌入的符号结构,包括法规、类型化流程模型和合规约束,不应仅被视为外部监控机制,而应作为塑造代理决策和行为的核心架构组件。我们提出合规性内置作为基于护栏监控的补充范式:一种防止控制流违规的结构基础,而护栏对于捕获语义错误仍然必不可少。我们在基础和能力层面识别出一组结构化的神经符号研究挑战,并表明共同解决这些挑战能够实现合规性内置。我们呼吁神经符号社区将受规管流程自动化作为一个高影响力的研究领域来参与。

英文摘要

LLM-based agents are entering regulated industries where they automate judgment intensive quality management processes. We argue that symbolic structures already embedded in these domains, including regulations, typed process models, and compliance constraints, should be treated not merely as external monitoring mechanisms but as core architectural components that shape the agent's decision-making and behavior. We propose compliance-by-construction as a complementary paradigm to guardrail-based monitoring: a structural foundation that prevents control-flow violations, while guardrails remain essential for catching semantic errors. We identify a structured set of neuro-symbolic research challenges on foundational and capability level and show that addressing them jointly enables compliance-by-construction. We call on the neuro-symbolic community to engage with regulated process automation as a high impact research domain.

2606.13394 2026-06-12 cs.RO 新提交

GeoHAT: Geometry-Adaptive Hybrid Action Transformer for Mobile Manipulation

GeoHAT: 几何自适应混合动作Transformer用于移动操作

Xiangyu Zhu, Renjun Wu, Luzhou Ge, Jinyan Liu, Xuesong Li

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 提出GeoHAT框架,通过轻量级傅里叶空间编码器注入几何信息,并采用混合全身动作解码器分解机械臂与基座动作,在ManiSkill-HAB基准上成功率提升23.7%。

详情
AI中文摘要

全身移动操作需要在不断变化的视角下协调移动基座和机械臂,这对几何感知和动作生成提出了挑战。当前的策略要么依赖2D特征,要么依赖缺乏密集空间结构的稀疏3D表示,并且通常将机械臂和基座编码在一个动作向量中,忽略了它们各自不同的控制需求。此外,现有的密集融合策略在噪声深度下可能破坏预训练表示,同时带来沉重的计算开销。我们提出了GeoHAT,一个基于简单原则的端到端扩散框架:几何信息应仅在可靠处注入,且仅在需要处被关注。GeoHAT采用轻量级傅里叶空间编码器,将密集的逐像素3D坐标映射为几何标记,无需额外的3D视觉骨干网络。然后,通过由深度有效性调制的逐标记门控融合,将这些标记选择性地注入视觉基础模型特征中,在保留语义先验的同时丰富空间理解。对于动作生成,混合全身动作解码器将机械臂和基座分解到不同的子空间,并通过稀疏交叉注意力让每个动作模态关注其任务相关的视觉上下文,同时因果时序建模捕获时间步内协调和时间步间依赖。在ManiSkill-HAB仿真基准上的实验表明,GeoHAT实现了79.3%的平均成功率,比最强基线高出23.7%。此外,在多种任务上的真实世界实验也证实了在所有基线上的一致改进。

英文摘要

Whole-body mobile manipulation requires coordinating mobile base and manipulator under shifting viewpoints, posing challenges in geometric perception and action generation. Current policies either rely on 2D features or sparse 3D representations that lack dense spatial structure, and typically encode arm and base within one action vector that ignores their distinct control demands. Moreover, existing dense fusion strategies risk corrupting pretrained representations under noisy depth while incurring heavy computational overhead. We present GeoHAT, an end-to-end diffusion-based framework built on a simple principle: geometry should be injected only where reliable and attended to only where needed. GeoHAT employs a lightweight Fourier spatial encoder that maps dense per-pixel 3D coordinates into geometric tokens without an additional 3D vision backbone. These tokens are then selectively injected into vision foundation model features through per-token gated fusion modulated by depth validity, preserving the semantic prior while enriching spatial understanding. For action generation, a Hybrid Whole-Body Action Decoder decomposes arm and base into distinct subspaces and lets each action modality attend to its task-relevant visual context through sparse cross-attention, while causal temporal modeling captures intra-timestep coordination and inter-timestep dependencies. Experiments on the ManiSkill-HAB simulation benchmark demonstrate that GeoHAT achieves a 79.3% mean success rate, surpassing the strongest baseline by 23.7%. Furthermore, real-world experiments on diverse tasks also confirm consistent improvements over all baselines.

2606.13392 2026-06-12 cs.AI 新提交

MiniMax Sparse Attention

MiniMax 稀疏注意力

Xunhao Lai, Weiqi Xu, Yufeng Yang, Qiaorui Chen, Yang Xu, Lunbin Zeng, Xiaolong Li, Haohai Sun, Haichao Zhu, Vito Zhang, Pengyu Zhao

发表机构 * MiniMax Peking University(北京大学) NVIDIA(英伟达) Zhejiang University(浙江大学) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出 MiniMax 稀疏注意力(MSA),一种基于分组查询注意力的块级稀疏注意力机制,通过轻量索引分支选择 Top-k 键值块,实现高效长上下文处理,在 109B 模型上以 1M 上下文减少 28.4 倍注意力计算,并带来 14.2 倍预填充和 7.6 倍解码加速。

详情
Comments
30 pages, 14 figures
AI中文摘要

超长上下文能力对于前沿大语言模型变得不可或缺:智能体工作流、仓库级代码推理和持久记忆都要求模型共同关注数十万到数百万个 token,然而 softmax 注意力的二次成本使得这在部署规模上难以实现。我们引入了 MiniMax 稀疏注意力(MSA),一种基于分组查询注意力(GQA)构建的块级稀疏注意力。一个轻量级索引分支对键值块进行评分,并为每个 GQA 组独立选择 Top-k 子集,从而实现组特定的稀疏检索,同时保持高效的块级执行;主分支则仅对选中的块执行精确的块稀疏注意力。MSA 的设计遵循简单和可扩展的原则,经过精心简化,使其能够在一系列 GPU 上高效部署。为了将稀疏性转化为实际加速,我们与 MSA 协同设计了 GPU 执行路径,该路径使用无指数 Top-k 选择和 KV 外部稀疏注意力,以在块粒度访问下提高张量核心利用率。在一个具有原生多模态训练的 109B 参数模型上,MSA 的性能与 GQA 相当,同时在 1M 上下文下将每个 token 的注意力计算减少了 28.4 倍。结合我们协同设计的内核,MSA 在 H800 上实现了 14.2 倍的预填充和 7.6 倍的解码端到端加速。我们的推理内核可在以下网址获取:this https URL。一个由 MSA 驱动的生产级原生多模态模型已在以下网址公开发布:this https URL。

英文摘要

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA). A lightweight Index Branch scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attention over only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-token attention compute by 28.4x at 1M context. Paired with our co-designed kernel, MSA achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800. Our inference kernel is available at: this https URL. A production-grade natively multimodal model powered by MSA has been publicly released at: this https URL.

2606.13382 2026-06-12 cs.CV cs.AI 新提交

SmartFont: Dynamic Condition Allocation for Few-Shot Font Generation

SmartFont: 少样本字体生成的动态条件分配

Zian Yang, Zixin Wang

发表机构 * Fudan University(复旦大学)

AI总结 提出SmartFont扩散框架,通过全局内容-风格生成与弱监督局部校正专家结合,并引入去噪状态条件分配模块动态加权全局与局部特征,实现少样本字体生成的全局完整性与局部细节保真度平衡。

详情
AI中文摘要

少样本字体生成同时需要全局结构完整性和细粒度局部风格保真度。现有方法通常要么依赖全局内容-风格建模(鲁棒但解耦不完美),要么强调组件/局部建模(捕捉细节但严重依赖局部先验和参考覆盖)。我们认为关键挑战不仅在于学习更纯净的条件,而在于通过生成过程中的多级分配来组织互补但有偏的全局和局部条件。为此,我们提出SmartFont,一个基于扩散的少样本字体生成框架,结合全局内容-风格生成与弱监督局部校正专家。局部分支通过弱组件监督学习专家级局部概念和语义有意义的空间图,实现无需显式组件条件推理的细粒度校正。在此基础上,去噪状态条件分配模块在时间步和注入块上自适应地加权全局内容、全局风格和局部校正特征。大量实验表明,SmartFont实现了更好的全局-局部平衡,提高了字形质量和局部细节保真度。

英文摘要

Few-shot font generation simultaneously requires global structural completeness and fine-grained local style fidelity. Existing methods usually either rely on global content-style modeling, which is robust but imperfectly disentangled, or emphasize component/local modeling, which captures fine details but relies heavily on local priors and reference coverage. We argue that the key challenge is not merely to learn purer conditions, but to organize complementary yet biased global and local conditions through multi-level allocation during generation. To this end, we propose SmartFont, a diffusion-based few-shot font generation framework that combines global content-style generation with weakly supervised local corrective experts. The local branch performs semantic-spatial allocation by learning expert-wise local concepts and semantically meaningful spatial maps under weak component supervision, enabling fine-grained correction without requiring explicit component-conditioned inference. On top of this, a denoising-state condition allocation module adaptively weights global content, global style, and local corrective feature across timesteps and injection blocks. Extensive experiments show that SmartFont achieves better global-local balance, improves glyph quality and local detail fidelity.

2606.13379 2026-06-12 cs.LG cs.AR cs.ET 新提交

Positional Encoding in the Context of Memristor-Based Analog Computation for Automatic Speech Recognition

基于忆阻器的模拟计算在自动语音识别中的位置编码

Benedikt Hilmes, Nick Rossenbach, Ralf Schlüter

发表机构 * Machine Learning and Human Language Technology Group, Faculty of Computer Science, RWTH Aachen University(亚琛工业大学计算机科学学院机器学习和人类语言技术组) Apptek GmbH(Apptek 有限公司)

AI总结 针对忆阻器模拟计算中位置编码导致模数转换精度下降的问题,通过调整ADC权重和精度位比例或移除编码相关线性变换,分别降低约50%和30%的性能损失。

详情
Comments
Accepted at Interspeech 2026
AI中文摘要

忆阻器通过实现向量-矩阵乘法的模拟执行,为自然语言处理神经模型的资源高效计算提供了新机遇。然而,目前这些器件在权重编程和执行过程中都容易产生较大的失真。在这项工作中,我们发现转换后的位置编码的大输出值会导致忆阻器计算中模数转换(ADC)的严重退化。通过调整特定忆阻器层的ADC权重和精度位的比例,我们将执行退化相对降低了约50%,同时保持估计能耗稳定。此外,我们研究了ADC无法修改的情况。在这种情况下,移除编码相关的线性变换后,退化可相对降低约30%。

英文摘要

Memristors provide a new chance for resource-efficient computation of neural models for natural language processing by enabling analog execution of vector-matrix-multiplication. Yet, computations on these devices are currently subject to larger distortion, both in weight programming and execution. In this work, we identify large output values of transformed positional encodings to cause major degradation within analog-to-digital conversion (ADC) as part of memristor-based computation. By adjusting the proportion of weight and precision bits of the ADC of specific memristor layers, we reduce the degradation of the execution by ~50% relative, while keeping the estimated energy consumption stable. Additionally, we investigate scenarios where the ADC cannot be modified. In that case the degradation can be reduced by ~30% relative after removing encoding-related linear transformations.

2606.13370 2026-06-12 cs.AI 新提交

A Quantitative Experimental Repeated Measures Study of Training Dynamics in a Small Llama Style Language Model Under a Compute-Aware Token Budget

在计算感知令牌预算下小型Llama风格语言模型训练动态的定量实验重复测量研究

Joe Dwyer

发表机构 * Department of Computer Information Science, ECPI University(ECPI大学计算机信息科学系)

AI总结 本研究通过重复测量设计,分析在固定计算预算下训练小型Llama模型时,验证损失、困惑度等指标随令牌数变化的动态,发现早期快速改进后出现非单调退化,表明计算感知评估应关注训练轨迹而非终点指标。

详情
AI中文摘要

本研究考察了在固定、计算受限的令牌预算下训练的小型Llama风格语言模型的训练动态。研究并未仅通过终点性能来评估效率,而是采用定量实验重复测量设计,分析验证损失、验证困惑度、滚动波动性、回退行为、尖峰行为以及种子间变异性如何在基于令牌的训练区间内变化。在拥有426万参数的模型上,使用TinyStories语料库、CPU全精度训练以及约2000万累积训练令牌的目标预算,进行了六次独立训练运行。在21个区间内收集指标,产生了126个种子-区间观测值。重复测量方差分析显示,验证损失、验证困惑度和滚动波动性存在统计显著的区间效应。描述性轨迹揭示了早期快速改进,随后在后期训练区间出现非单调退化。平均验证损失从初始化的8.3552降至接近400万令牌时的2.7996,但在最终检查点增至3.9010。验证困惑度遵循相同模式,在训练早期急剧下降,随后上升。衍生遥测进一步显示了反复的验证损失回退,并且在预定义标准下没有区间汇总证据表明存在稳定阶段。这些发现表明,计算感知的语言模型评估应检查训练轨迹而非仅终点指标。在受限计算设置中,额外的令牌暴露可能增加计算成本而不产生成比例的泛化收益,而区间级遥测可以揭示终点指标可能掩盖的不稳定性、回归和收益递减。

英文摘要

This study examines training dynamics in a small Llama-style language model trained under a fixed, compute-constrained token budget. Rather than evaluating efficiency solely through endpoint performance, the study uses a quantitative experimental repeated measures design to analyze how validation loss, validation perplexity, rolling volatility, backslide behavior, spike behavior, and between-seed variability change across token-based training intervals. Six independent training runs were conducted on a 4.26-million-parameter model using the TinyStories corpus, CPU-based full-precision training, and a target budget of approximately 20 million cumulative training tokens. Metrics were collected across 21 intervals, producing 126 seed-by-interval observations. Repeated measures ANOVA showed statistically significant interval effects for validation loss, validation perplexity, and rolling volatility. Descriptive trajectories revealed rapid early improvement followed by non-monotonic degradation during later training intervals. Mean validation loss decreased from 8.3552 at initialization to 2.7996 near 4 million tokens, but increased to 3.9010 by the final checkpoint. Validation perplexity followed the same pattern, falling sharply early in training before rising later. Derived telemetry further showed recurrent validation-loss backslides and no interval-summary evidence of a stable phase under the predefined criteria. These findings suggest that compute-aware language model evaluation should examine training trajectories rather than endpoint metrics alone. In constrained compute settings, additional token exposure may increase computational cost without producing proportional generalization gains, and interval-level telemetry can reveal instability, regression, and diminishing returns that final metrics may obscure.

2606.13366 2026-06-12 cs.CV cs.MM 新提交

Dual-Constrained Diffusion Image Compression for Operational Rate-Distortion-Perception Optimization

双约束扩散图像压缩用于操作率失真感知优化

Sanxin Jiang, Jiro Katto, Heming Sun

发表机构 * Shanghai University of Electric Power(上海电力大学) Waseda University(早稻田大学) Institute of Science Tokyo(东京科学大学)

AI总结 提出DCIC框架,结合学习编解码器和基于扩散的解码器,通过联合失真和等幂约束实现率失真感知帕累托前沿的连续导航,无需额外码率开销。

详情
AI中文摘要

率失真感知(RDP)权衡通过施加重建的分布约束扩展了经典率失真理论,为联合控制保真度和感知真实性的神经图像压缩提供了统一框架。虽然先前的工作实现了接近最优的率感知权衡,但明确实现完整RDP曲面的实用框架仍然很少,主要由于在解码器引入公共随机性的困难。我们提出DCIC(双约束扩散图像压缩),它将学习编解码器与基于扩散的解码器相结合,受联合失真和等幂约束的支配。失真约束限制了相对于基础编解码器输出的重建保真度;等幂约束——要求重新编码恢复图像恢复基础编解码器重建——作为分布感知要求的可处理替代。它们通过一致噪声注入的迭代优化引导反向去噪过程,实现公共随机性而无需额外码率开销。在固定码率下,双衰减因子$(K_D, K_P)$共同导航失真感知平面的帕累托前沿,从单个比特流实现连续可调的保真度-真实感权衡。DCIC$_{RD}$($K_P{=}0$)和DCIC$_{RP}$($K_D{=}0$)作为边界曲线出现,DCIC$_{RDP}$($K_D = K_P=1$)实现最优内部工作点。在CelebA-HQ、CLIC2020和ImageNet-1K上,跨CNN、Transformer和混合架构的实验证实,DCIC$_{RDP}$在所有感知编解码器中实现了优越的BD-PSNR,而DCIC$_{RP}$在BD-FID上与专用感知方法相匹配,验证了完整RDP曲面导航的实用价值。

英文摘要

The rate-distortion-perception (RDP) trade-off extends classical rate--distortion theory by imposing a distributional constraint on reconstructions, providing a unified framework for neural image compression that jointly governs fidelity and perceptual realism. While prior work achieves near-optimal rate--perception trade-offs, practical frameworks explicitly realizing the full RDP surface remain scarce, primarily due to the difficulty of introducing common randomness at the decoder. We propose DCIC (Dual-Constrained Diffusion Image Compression), which integrates a learned codec with a diffusion-based decoder governed by joint distortion and idempotence constraints. The distortion constraint bounds reconstruction fidelity relative to the base codec output; the idempotence constraint -- requiring that re-encoding the restored image recovers the base codec reconstruction -- serves as a tractable surrogate for the distributional perception requirement. Together, they steer the reverse denoising process via iterative optimization with consistent noise injection, realizing common randomness without additional rate overhead. At fixed rate, dual attenuation factors $(K_D, K_P)$ jointly navigate the Pareto frontier of the distortion-perception plane, enabling continuously adjustable fidelity-realism trade-offs from a single bitstream. DCIC$_{RD}$ ($K_P{=}0$) and DCIC$_{RP}$ ($K_D{=}0$) arise as boundary curves, with DCIC$_{RDP}$ ($K_D = K_P=1$) realizing the optimal interior operating point. Experiments on CelebA-HQ, CLIC2020, and ImageNet-1K across CNN, Transformer, and hybrid architectures confirm that DCIC$_{RDP}$ achieves superior BD-PSNR over all perceptual codecs, while DCIC$_{RP}$ matches dedicated perception-oriented methods in BD-FID, validating the practical value of full RDP surface navigation.

2606.13364 2026-06-12 cs.LG cs.CV 新提交

VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

VideoMDM: 从2D监督走向3D人体运动生成

Amir Mann, Gal Michael Harari, Merav Keidar, Or Litany

发表机构 * Technion(以色列理工学院) NVIDIA(英伟达)

AI总结 提出VideoMDM框架,利用单目视频的2D姿态通过扩散模型学习3D运动先验,使用深度加权的2D重投影损失近似3D监督,在HumanML3D上接近全3D监督性能。

详情
Comments
this https URL
AI中文摘要

我们提出VideoMDM,一个基于扩散的框架,直接从单目视频中提取的精确2D姿态训练3D人体运动先验,无需任何3D真实数据。预训练的2D到3D提升器提供近似的3D姿态序列,作为有噪声的教师:这些序列被扩散,模型在3D空间去噪,并通过重投影预测并与精确关键点比较在2D空间进行监督。我们证明,在温和假设下,深度加权的2D重投影损失在期望上等价于直接3D监督,并将标准3D运动正则化器——速度一致性和过参数化表示对齐——适应到这一2D设置。与仅在推理时将2D提升到3D的方法不同,VideoMDM在训练期间学习一个连贯的3D运动流形。在HumanML3D上,它几乎缩小了与完全3D监督的MDM的差距(FID 0.88 vs 0.54);在真实视频数据集Fit3D和NBA上,该方法学习生成一致被人类偏好的运动,并取得了强定量结果。

英文摘要

We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.

2606.13361 2026-06-12 cs.AI cs.CE cs.MA 新提交

Can I Buy Your KV Cache?

我能买你的KV缓存吗?

Luoyuan Zhang

发表机构 * Harbin Institute of Technology, Shenzhen (HITSZ)(哈尔滨工业大学(深圳))

AI总结 针对AI代理重复计算相同文档KV缓存的问题,提出由发布者预计算KV缓存,其他代理付费加载以跳过预填充,实验表明在Qwen3-4B上计算成本降低9-50倍,并设计了代理原生预填充CDN架构。

详情
AI中文摘要

现在,在世界各地,AI代理正在重复同样的荒谬行为:为了读取一份文档,每个代理都从头开始重新计算。每个代理都重新运行预填充——大型模型最计算密集的步骤——在相同的文本上,只是为了重建一个与之前代理刚刚构建的完全相同的键值(KV)缓存。相同的答案,被计算了一百万次。我们提出了一个几乎粗鲁简单的建议:只计算一次。让发布者预计算文档的KV缓存,然后让每个其他代理购买加载该缓存并跳过预填充的权利。这可行,并且是token精确的:加载预计算的KV并继续与从头开始预填充匹配(24/24个贪婪token,并且在logits级别),没有准确度损失。在Qwen3-4B上,重用比预填充计算便宜9-50倍,并且差距随长度增加而扩大(预填充的注意力与L^2成比例),因此一次重用就足以收回成本。然后关键部分:KV存储在哪里。传输它失败了,因为KV几乎不可压缩,因此每次加载的出口成本比它节省的预填充成本还要高。将其托管在提供方侧,正如生产中的提示缓存那样,完全消除了出口成本。奖励的大小由我们测量的计算节省决定:为80M代理提供一份热门的3774-token文档,重新预填充成本约150万美元,而重用计算成本仅约3万美元(减少49.7倍)。API收取的0.1倍缓存读取关税在测量范围内为用户提供了10倍的折扣,因此10倍是下限,而测量的约50倍计算节省超过了它,与物理约50倍的差距是提供方的利润:每份热门文档数百万美元。我们构建了由此产生的代理原生预填充CDN,并将无损KV压缩和跨方支付层作为开放问题。

英文摘要

Right now, across the world, AI agents are repeating the same absurd act: to read one document, they each recompute it from scratch. Every agent re-runs prefill, the most compute-intensive step a large model takes, over identical text, only to rebuild a key-value (KV) cache identical to the one the agent before it just built. The same answer, computed a million times. We make a proposal that is almost offensively simple: compute it once. Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill. It works, and it is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill's attention scales with L^2), so a single reuse already pays it back. Then the part that matters: where the KV lives. Shipping it fails, because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Hosting it provider-side, exactly as production prompt-caching works, removes egress entirely. The size of the prize is set by our measured compute saving: serving one hot 3774-token document to 80M agents costs ~$1.5M to re-prefill but only ~$0.03M of reuse compute (49.7x less). The 0.1x cache-read tariff APIs charge passes a 10x discount to users while sitting inside this measured envelope, so the 10x is a floor that the measured ~50x compute saving clears, and the gap to the physical ~50x is provider margin: millions of dollars per popular document. We frame the resulting agent-native prefill CDN and leave lossless KV compression and a cross-party payment layer as the open problems.

2606.13355 2026-06-12 cs.RO cs.AI 新提交

Real-Time Execution with Autoregressive Policies

基于自回归策略的实时执行

Sangkyu Lee, Seohyeon Park, Tackgeun You, Avi Caciularu, Idan Szpektor, Hwasup Lim, Youngjae Yu

发表机构 * Korea Institute of Science and Technology(韩国科学技术研究院) Seoul National University(首尔大学) Google Research(谷歌研究院)

AI总结 通过异步推理和约束解码实现自回归策略的实时执行,在保证低延迟的同时提升任务完成速度,实验表明其性能优于流匹配策略。

详情
AI中文摘要

实时执行通过异步推理实现平滑动作轨迹和快速响应,对于大规模视觉-语言-动作模型的实际部署至关重要。然而,近期关于实时执行的工作主要关注扩散策略的变体,尽管自回归策略在同步推理中滚动速度较慢,更需要实时性。相比之下,我们证明自回归策略可以通过调整分词范围和应用约束解码来实现实时执行,从而保证严格的延迟界限,支持多轨迹解码以最大化性能。在模拟和真实环境中,我们发现自回归策略始终优于同等水平的流匹配策略,同时显著提升了同步推理的任务完成速度。结合自回归策略的固有优势(如更快的收敛速度和更好的指令遵循泛化能力),这些结果证实自回归策略仍是一种支持实时执行的竞争性策略类型。

英文摘要

Real-time execution, enabled by asynchronous inference that ensures both smooth action trajectories and fast reactivity, is critical for realistic deployments of large-scale Vision-Language-Action models. However, recent work on real-time execution primarily focuses on variants of diffusion policies, even though it is more critical for autoregressive policies given their slower rollout speed in synchronous inference. In contrast, we demonstrate that autoregressive policies can achieve real-time execution by adjusting the tokenization horizon and applying constrained decoding, thereby guaranteeing strict latency bounds that enable multi-trajectory decoding to maximize performance. Across simulated and real-world environments, we find that the autoregressive policy consistently outperforms its equivalent-level flow-matching policy counterpart while achieving significantly improved task completion speeds from synchronous inference. Coupled with the inherent advantages of autoregressive policies, such as faster convergence and better generalizability in instruction-following, these results confirm that autoregressive policies can remain a competitive policy type supporting real-time execution.

2606.13352 2026-06-12 cs.RO 新提交

Low cost, easily manufactured, highly flexible strain and touch sensitive fiber for robotics applications

低成本、易制造、高柔性应变与触觉传感纤维用于机器人应用

Christian Diaz Herrera, Srushti Raste, Simin Liu, Miles Modeste, Jiyang (Patton)Yin, Katelyn McCall, Yuxing Jared Yao, Roopkamal Chahal, Simon Chidley, Trung Ha, T. David Westmoreland, Sonia Roberts

发表机构 * Wesleyan University(卫斯理大学)

AI总结 提出一种仅用廉价商用部件和工具快速制造的导电纤维,兼具电阻应变传感和电容触觉传感功能,实验验证其在机器人抓取、姿态估计和近场跟踪中的应用。

详情
AI中文摘要

现有的机器人拉伸和触觉传感器通常在材料成本、所需制造设备或制造时间方面至少有一项昂贵。我们提出并实验表征了一种导电纤维,仅使用廉价的商用现成部件(导电线程$0.07/英尺,硅胶管$0.94/英尺)和工具(环形针穿线器$2),可快速制造(20厘米长度2分钟)。我们展示了其作为电阻应变传感器的三种应用:触发气动辅助手指的抓取、感知气动机器人带的位置、以及估计柔性固体的姿态。我们还展示了其作为电容传感器的两种应用:首先,作为触觉传感器触发商业机器人手臂移动;其次,作为近场传感器使机器人手臂跟随移动的手。电容传感器通过编织制成,展示了纤维的高柔性。我们讨论了提高制造可扩展性的方法及其成本权衡。最后,我们展示了一种修复切断纤维的方法。

英文摘要

Existing stretch and touch sensors for robots are generally expensive with respect to at least one of material costs, required manufacturing equipment, or manufacturing time. We present and experimentally characterize a conductive fiber made using only inexpensive commercial off-the-shelf parts (conductive thread at $0.07/ft, silicone tubing at $0.94/ft) and tools (loop-style needle threader at $2), which can be manufactured quickly (20 cm length in 2 minutes.) We demonstrate its use as a resistive strain sensor with three applications: Triggering a grasp in a pneumatically actuated assistive finger, sensing the pose of a pneumatically actuated robotic strap, and estimating the pose of a flexible solid. We also demonstrate that it can be used as a capacitive sensor with two applications: First, as a touch sensor which triggers a commercial robot arm to move, and second, as a near-field sensor enabling the robot arm to follow a moving hand. The capacitive sensors are knitted, showcasing the high flexibility of the fiber. We discuss methods for improving manufacturing scalability and their cost trade-offs. Finally, we demonstrate a method for repairing a cut fiber.

2606.13349 2026-06-12 cs.CL 新提交

From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent

从被动生成到主动调查:一种主动的科学同行评审代理

Haishuo Fang, Yue Feng, Iryna Gurevych

发表机构 * Ubiquitous Knowledge Processing Lab (UKP Lab), Technical University of Darmstadt(达姆施塔特工业大学通用知识处理实验室) National Research Center for Applied Cybersecurity ATHENE, Germany(德国国家应用网络安全研究中心 ATHENE) School of Computer Science, University of Birmingham(伯明翰大学计算机科学学院)

AI总结 提出ProReviewer,一种基于LLM的主动科学同行评审代理,将评审建模为马尔可夫决策过程,通过结构化评审日志引导主动调查,在五个质量维度上平均得分最高,优于现有方法。

详情
AI中文摘要

大型语言模型(LLM)在自动化科学同行评审方面显示出潜力。然而,现有方法通常难以生成有具体证据支持的深入评审。我们认为,一个关键限制是缺乏根据累积证据主动调查论文可疑部分的灵活性,就像人类评审员所做的那样。在本文中,我们探讨如何使基于LLM的评审代理能够进行这种主动调查。我们发现,这可以自然地表述为马尔可夫决策过程(MDP),并提出了ProReviewer,一种科学同行评审代理,它通过维护的结构化评审日志主动评审论文。结构化评审日志作为代理的工作空间,用于跟踪评审过程中收集的证据和中间发现。实验表明,使用8B骨干网络、通过监督微调训练并通过强化学习优化的ProReviewer,在五个质量维度上取得了最高平均分,相对优于基于提示的方法(使用更大的前沿LLM)高达39%,优于最强的微调基线16%。在人工评估中,它也取得了对基线最高的胜率。

英文摘要

Large language models (LLMs) have shown promise in automating scientific peer review. However, existing approaches often struggle to generate in-depth reviews supported by concrete evidence. We argue that a key limitation is the lack of flexibility to proactively investigate suspicious parts of a paper based on accumulated evidence, as human reviewers do. In this paper, we explore how to enable an LLM-based review agent to perform such proactive investigation. We find that this can be naturally formulated as a Markov Decision Process (MDP), and propose ProReviewer, a scientific peer review agent that proactively reviews a paper guided by a maintained, structured review log. The structured review log serves as a workspace for the agent to track evidence and intermediate findings collected during review. Experiments show that ProReviewer with an 8B backbone, trained by supervised fine-tuning and optimized by reinforcement learning, achieves the highest average score across five quality dimensions, outperforming prompt-based methods with much larger frontier LLMs by up to 39% and the strongest fine-tuned baseline by 16% relatively. It also attains the highest win rates against baselines in human evaluation.

2606.13348 2026-06-12 cs.CL cs.AI 新提交

IVIE: A Neuro-symbolic Approach to Incremental and Validated Generation of Interactive Fiction Worlds

IVIE:一种用于增量且经过验证的交互式小说世界生成的神经符号方法

Micaela Vaucher, Santiago Silveira, Santiago Góngora, Luis Chiruzzo

发表机构 * Instituto de Computación, Facultad de Ingeniería, Universidad de la República(乌拉圭共和国大学工程学院计算机研究所)

AI总结 提出IVIE神经符号方法,结合LLM的创造力与符号验证的连贯性,通过四阶段增量生成管道构建可玩的交互式小说世界,人类评估显示其生成沉浸式、主题连贯的世界,平衡了灵活性与叙事一致性。

详情
Comments
10 pages, 3 figures. To appear in the Proceedings of the 16th International Conference on Computational Creativity (ICCC'26), June 2026
AI中文摘要

交互式小说中的计算创造力面临一个基本矛盾:大型语言模型(LLM)可能产生创意叙事,但难以维持世界连贯性,而符号系统确保一致性但缺乏创意灵活性。我们提出IVIE(增量与验证的交互体验),一种从零开始生成完整且可玩的交互式小说世界的神经符号方法。基于PAYADOR的神经符号框架,IVIE实现了一个四阶段增量生成管道,将创意决策——设定与角色创建、谜题设计——委托给LLM,同时通过符号验证将世界状态接地。该系统生成具有相互关联的地点、功能性物品、非玩家角色和连贯谜题的世界,所有这些都围绕一个中心目标导向架构组织。人类评估表明,该方法生成了沉浸式、主题连贯的世界,具有高玩家参与度。结果似乎表明,神经符号方法成功平衡了灵活性与叙事连贯性:符号验证在不消除生成自由的情况下将LLM生成接地。然而,挑战依然存在:LLM的不一致性偶尔会绕过谜题约束,客观验证的空白允许一些结构上不可能的目标。我们为未来的神经符号交互式叙事系统确定了关键设计考虑因素,特别是关于LLM的能力及其局限性。

英文摘要

Computational creativity in Interactive Fiction faces a fundamental tension: Large Language Models (LLM) may produce creative narratives but struggle with world coherence, while symbolic systems ensure consistency but lack creative flexibility. We present IVIE (Incremental & Validated Interactive Experiences), a neuro-symbolic approach to generating complete and playable interactive fiction worlds from scratch. Building upon PAYADOR's neuro-symbolic framework, IVIE implements a four-stage incremental generation pipeline that delegates creative decisions--setting and character creation, puzzle design--to LLMs while grounding the world state through symbolic validation. The system generates worlds with interconnected locations, functional items, non-player characters, and coherent puzzles, all structured around a central goal-oriented architecture. Human evaluation shows the approach generates immersive, thematically coherent worlds with high player engagement. Results seem to indicate that the neuro-symbolic approach successfully balances flexibility with narrative coherence: symbolic validation grounds LLM generation without eliminating generative freedom. However, challenges remain: LLM inconsistencies occasionally bypass puzzle constraints, and objective validation gaps allow some structurally impossible goals. We identify key design considerations for future neurosymbolic interactive storytelling systems, particularly regarding LLM capabilities and their limitations.

2606.13345 2026-06-12 cs.CV 新提交

JointEdit3D: Feed-Forward 3D Scene Editing in a Unified Latent Space

JointEdit3D:统一潜在空间中的前馈3D场景编辑

Xinnan Zhu, Ruijie Xu, Jiayu Ying, Daoguo Dong, Jiachen Xu, Yuan Xie, Xin Tan

发表机构 * East China Normal University(华东师范大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Fudan University(复旦大学) Tencent(腾讯)

AI总结 提出JointEdit3D,在统一RGB-几何重建生成潜在空间中通过非对称潜在修复实现前馈3D场景编辑,引入SceneAnchor分支和编辑/背景感知损失,并构建SceneEdit3D-15K数据集和SceneEdit3D-Bench基准,显著提升编辑区域质量和3D结构完整性。

详情
Comments
Preprint. Project page: this https URL
AI中文摘要

现有的3D场景编辑方法通常依赖于对显式3D表示进行逐场景优化或级联编辑-重建流水线,导致测试时成本高、3D感知有限以及结构不一致。为了在编辑过程中耦合外观合成和几何预测,我们构建了一个统一的RGB-几何重建生成潜在空间,并将其适应于前馈3D场景编辑。由此产生的框架JointEdit3D通过仅观察单个编辑后的RGB参考潜在变量,并在源场景锚定下生成剩余的RGB视图和编辑后的几何潜在变量,执行非对称潜在修复。JointEdit3D引入了一个专门的SceneAnchor分支来注入源场景结构而不强制直接复制,并采用编辑/背景感知损失来平衡编辑区域的保真度与未编辑内容的保持。为了解决缺乏用于标准化3D场景编辑评估的配对资源的问题,我们引入了SceneEdit3D-15K数据集,该数据集包含15K个配对编辑样本和渲染器提供的3D注释,以及SceneEdit3D-Bench,一个精心挑选的100样本基准。实验表明,JointEdit3D在保持竞争性背景保留的同时,在编辑区域质量和3D结构完整性方面优于先前基线。

英文摘要

Existing 3D scene editing methods typically rely on per-scene optimization over explicit 3D representations or cascaded edit-and-reconstruct pipelines, resulting in high test-time cost, limited 3D awareness, and structural inconsistencies. To couple appearance synthesis and geometry prediction during editing, we build on a unified RGB-geometry reconstruction-generation latent space and adapt it to feed-forward 3D scene editing. The resulting framework, \textbf{JointEdit3D}, performs asymmetric latent inpainting by observing only a single edited RGB reference latent and generating the remaining RGB views and edited geometry latent under source-scene anchoring. JointEdit3D introduces a dedicated SceneAnchor Branch to inject source-scene structure without forcing direct copying, and adopts edit/background-aware losses to balance edited-region fidelity with unedited-content preservation. To address the lack of paired resources for standardized 3D scene editing evaluation, we introduce SceneEdit3D-15K, a dataset with 15K paired editing samples and renderer-provided 3D annotations, together with SceneEdit3D-Bench, a curated 100-sample benchmark. Experiments show that JointEdit3D improves edited-region quality and 3D structural completeness over prior baselines while maintaining competitive background preservation.

2606.13340 2026-06-12 cs.RO 新提交

EMG-Based Adaptation of Anisotropic Virtual Fixtures for Robot-Assisted Surgical Resection and Dissection

基于EMG的各向异性虚拟夹具自适应方法用于机器人辅助手术切除与解剖

Dario Onfiani, Michael Dyck, Luigi Biagiotti, Julian Klodmann

发表机构 * University of Modena and Reggio Emilia(摩德纳大学) German Aerospace Center (DLR)(德国航空航天中心)

AI总结 提出一种基于EMG信号自适应调节各向异性虚拟夹具的框架,通过实时推断外科医生意图动态调整约束,实验证明能提高手术精度和运动一致性,降低认知负荷。

详情
AI中文摘要

本文针对机器人辅助腹腔镜手术中的精细任务(如切除和解剖),开发了一种自适应辅助系统。尽管虚拟夹具在引导外科医生运动方面具有显著优势,但传统虚拟夹具通常由固定几何形状定义,缺乏适应手术流程或外科医生即时意图的灵活性。为解决这些局限性,我们提出了一种自适应各向异性虚拟夹具的新框架。此外,我们引入了一种直观的控制接口,该接口基于从EMG信号推断的外科医生意图,实时调节夹具的几何形状。该方法允许外科医生通过收缩前臂肌肉动态扩展或解除约束,实现精确引导运动和工具自由重新定位之间的无缝切换。基于标准化手术训练任务的初步用户研究实验结果表明了所提方法的有效性。该系统在任务精度和运动一致性方面表现出显著改善,同时降低了感知认知负荷、努力和挫败感。

英文摘要

In this paper, we address the development of an adaptive assistance system for robot-assisted laparoscopic surgery, specifically for delicate tasks such as Resection and Dissection. Even if Virtual Fixtures offer significant advantages for guiding a surgeon's movements, conventional Virtual Fixtures are often defined by fixed geometries, lacking the flexibility to adapt to the surgical workflow or the surgeon's immediate intent. To address these limitations, we propose a novel framework for an adaptive and anisotropic virtual fixture. In addition, we introduce an intuitive control interface that modulates the fixture's geometry in real-time based on the surgeon's intent, inferred from EMG signals. This approach allows the surgeon to dynamically expand or disengage the constraint by contracting their forearm muscles, enabling seamless transitions between precise guided motion and free repositioning of the tool. Experimental results from a pilot user study, based on a standardized surgical training task, demonstrate the effectiveness of the proposed method. The system showed significant improvements in task accuracy and movement consistency, alongside a reduction in perceived cognitive load, effort, and frustration.

2606.13338 2026-06-12 cs.LG 新提交

Navigating the Safety-Fidelity Trade-off: Massive-Variate Time Series Forecasting for Power Systems via Probabilistic Scenarios

导航安全-保真度权衡:通过概率场景进行电力系统的大规模多变量时间序列预测

Kaijie Xu, Anqi Wang, Xilin Dai

发表机构 * ZJU-UIUC Institute, Zhejiang University(浙江大学伊利诺伊大学厄巴纳香槟校区联合学院)

AI总结 针对现有基准无法评估大规模多变量概率预测的安全性与保真度权衡问题,提出包含多达36,964个通道的电力系统基准PowerPhase和场景式分位数预测器PowerForge,在多个网格上取得最佳平均排名。

详情
AI中文摘要

概率预测模型越来越多地部署在具有不同通道物理特性和运行约束的多变量系统上,但现有基准无法大规模评估这两个属性。公开的规范多变量基准最多包含2,000个通道,而电力系统基准要么缺乏时间结构,要么缺乏概率评估。我们提出PowerPhase,这是一个基于六个输电网络构建的概率预测基准,联合预测通道数从2,000到36,964,比流行的规范多变量基准高出一个数量级以上。每个目标轨迹是交流潮流求解的输出,PowerPhase配备了约束感知指标,包括Safety_mBrier、NECV和CVaR-alpha,作为CRPS和Distortion的补充。在八个基线和三个随机种子上,分布准确性和约束满足对模型进行不同排序,我们将这种权衡称为安全-保真度。我们进一步提出PowerForge,一种基于场景的分位数预测器,具有类型特定的解码头和变量组之间的因果桥,在每个网格上实现了最佳平均排名。

英文摘要

Probabilistic forecasting models are increasingly deployed on multivariate systems with distinct channel physics and operational constraints, but existing benchmarks evaluate neither property at scale. Public canonical multivariate benchmarks cap out at 2,000 channels, while power-system benchmarks either lack temporal structure or probabilistic evaluation. We introduce PowerPhase, a probabilistic forecasting benchmark built on six transmission grids ranging from 2,000 to 36,964 jointly forecasted channels, more than an order of magnitude beyond popular canonical multivariate benchmarks. Each target trajectory is the output of an AC power-flow solve, and PowerPhase ships with constraint-aware metrics, including Safety_mBrier, NECV, and CVaR-alpha, that complement CRPS and Distortion. Across eight baselines and three seeds, distributional accuracy and constraint satisfaction rank models differently, a trade-off we term safety-fidelity. We further propose PowerForge, a scenario-based quantile forecaster with type-specific decoding heads and a causal bridge between variable groups, which achieves the best average rank on every grid.

2606.13332 2026-06-12 cs.CV 新提交

OR-Action: Multi-Role Video Understanding with Fine-Grained Actions

OR-Action: 细粒度动作的多角色视频理解

Felix Tristram, Ege Özsoy, Christian Benz, Marcel Walch, Ghazal Ghazaei, Nassir Navab

发表机构 * Technical University of Munich(慕尼黑工业大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Carl Zeiss AG(卡尔蔡司股份公司)

AI总结 针对手术室活动理解中场景图方法缺乏时间建模的问题,提出基于公开数据集的细粒度多角色动作基准,并引入纯视觉时序模型,显著优于图方法,同时提出多视角到单视角特征对齐策略提升单视角性能。

详情
AI中文摘要

对手术室活动的细粒度理解能够实现工作流感知的辅助,但由于杂乱、遮挡和有限的感知,仍然困难。建模该环境的主流方法是使用场景图作为OR交互的可解释表示。然而,在没有显式时间建模的情况下,将它们的逐帧关系预测转换为时间上延伸的细粒度动作是具有挑战性的。为了对当前OR理解方法进行原则性的时间评估,我们引入了第一个以动作为中心的基准,该基准基于公开可用的自我中心-外部中心OR数据集,通过定义细粒度的多角色动作分类法,并通过从地面真实场景图状态变化中蒸馏生成密集动作片段。在该基准上的实验表明,当前的场景图预测方法难以建模时间结构,即使通过图神经网络添加显式建模也是如此。因此,我们引入了一种纯视觉时间模型,当使用所有可用的自我中心视频作为输入时,该模型显著优于基于图的方法。在此模型基础上,我们还引入了一种新颖的多视角到单视角特征对齐策略,提高了多角色动作识别的单视角性能,减少了对大量自我中心视频采集的需求。基准和代码将在接收后发布。

英文摘要

Fine-grained understanding of operating room (OR) activity could enable workflow-aware assistance, yet remains difficult due to clutter, occlusions, and limited sensing. The prevailing approach to model this environment is scene graphs as an interpretable representation of OR interactions. Converting their frame-wise relational predictions into temporally extended, fine-grained actions however, is challenging without explicit temporal modeling. To enable a principled temporal evaluation of current OR understanding methods, we introduce the first action-centric benchmark built on a publicly available ego-exocentric OR dataset by defining a fine-grained, multi-role action taxonomy and generating dense action segments via distillation from ground-truth scene graph state changes. Experiments on this benchmark show that current scene graph prediction methods struggle to model temporal structure, even when adding explicit modeling through Graph Neural Networks. We therefore introduce a vision-only temporal model that outperforms graph-based methods significantly when using all available egocentric video as input. Building on this model we also introduce a novel multi- to single-view feature alignment strategy that improves single-view performance on multi-role action recognition, mitigating the need for extensive egocentric video capture. Benchmark and code will be released upon acceptance.

2606.13322 2026-06-12 cs.CL 新提交

Low-Latency Real-Time Audio Game Commentary System via LLM-Based Parallel Text Generation

基于LLM并行文本生成的低延迟实时音频游戏解说系统

Ryota Kawamatsu, Anum Afzal, Yuki Saito, Shinnosuke Takamichi, Graham Neubig, Katsuhito Sudoh, Hiroya Takamura, Tatsuya Ishigaki

发表机构 * The University of Tokyo(东京大学) National Institute of Advanced Industrial Science and Technology(产业技术综合研究所) Technical University of Munich(慕尼黑工业大学) Keio University(庆应义塾大学) Carnegie Mellon University(卡内基梅隆大学) Nara Women’s University(奈良女子大学)

AI总结 提出一种并行文本生成与语音播放的低延迟实时游戏解说系统,将平均句间静默从9.6秒降至0.3秒,显著提升解说节奏。

详情
Comments
Accepted at IJCAI-ECAI 2026 (Demonstrations Track)
AI中文摘要

我们提出了一种低延迟实时音频游戏解说系统,可直接从实时游戏视频生成语音解说。在这种端到端设置中,关键瓶颈是累积等待时间;传统流程顺序执行帧捕获、文本生成和语音合成,且直到语音播放完成才请求下一次生成。这种严格顺序性导致语句间出现长且不自然的静默。为解决这一延迟瓶颈,我们的系统将文本生成与语音播放并行运行,并预先缓冲多个候选语句,从而在播放边界实现即时合成。在快节奏游戏视频上的实验表明,与顺序基线相比,我们的并行设计将平均句间静默从9.6秒降至0.3秒。它还将与专业演讲的静默时间模式相似度提高了40%以上,一项包含120名经验游戏玩家的用户研究证实,感知到的说话节奏显著改善。我们的演示视频可在以下网址获取:this https URL。

英文摘要

We present a low-latency real-time audio game commentary system that generates spoken commentary directly from live gameplay video. In this end-to-end setting, a key bottleneck is accumulated waiting time; conventional pipelines capture frames, generate text, and synthesize speech sequentially for each utterance, and do not request the next generation until speech playback has completed. This strict sequentiality causes long and unnatural silence between utterances. To address this latency bottleneck, our system runs text generation in parallel with speech playback and buffers multiple candidate utterances ahead of time, enabling immediate synthesis at playback boundaries. Experiments on fast-paced game videos show that our parallel design reduces the mean inter-utterance silence from 9.6 seconds to 0.3 seconds compared to sequential baselines. It also improves similarity to professional speaking--silence timing patterns by over 40 %, and a user study with 120 experienced game players confirms significantly improved perceived speaking rhythm. Our demo video is available at: this https URL.

2606.13317 2026-06-12 cs.CL 新提交

SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents

SkillCAT: 面向LLM智能体的对比评估与拓扑感知技能自进化

Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) School of Computer Science, Fudan University(复旦大学计算机学院)

AI总结 提出SkillCAT框架,通过对比因果提取、评估增强进化和拓扑感知任务执行三阶段,实现无需训练的LLM智能体技能自进化,在多个基准上平均提升高达40.40%。

详情
Comments
9 pages, 6 figures
AI中文摘要

LLM智能体的技能自进化方法旨在将执行轨迹转化为可复用的技能文档,但当前流程通常每个任务只学习一条轨迹,在检查前合并候选技能补丁,并在推理前加载完整技能语料库。我们提出SkillCAT,一个无需训练的框架,将该过程分为三个阶段。对比因果提取(CCE)为每个任务采样多条轨迹,并比较同任务的成功/失败对,以识别解释结果差异的证据。评估增强进化(AAE)在源任务克隆上回放每个候选补丁,并在层次化技能补丁合并前仅保留改善或保持任务结果的补丁。拓扑感知任务执行(TTE)将进化后的技能编译成可路由的子技能拓扑,因此推理仅加载与任务相关的能力节点。我们在常见智能体基准上评估SkillCAT,包括SpreadsheetBench、WikiTableQuestions和DocVQA,并进一步测试跨模型和分布外泛化。在这些设置中,SkillCAT将基线平均得分提升高达40.40%,展示了无需模型训练的可靠技能进化。

英文摘要

Skill self-evolution methods for LLM agents aim to turn execution trajectories into reusable skill documents, but current pipelines typically learn from one trajectory per task, merge candidate skill patches before checking them, and load the full skill corpus before inference. We propose SkillCAT, a training-free framework that separates this process into three stages. Contrastive Causal Extraction (CCE) samples multiple trajectories for each task and compares same-task success/failure pairs to identify evidence that explains outcome differences. Assessment-Augmented Evolution (AAE) replays each candidate patch on source-task clones and keeps only patches that improve or preserve task outcomes before hierarchical skill patch merging. Topology-Aware Task Execution (TTE) compiles the evolved skills into a routable sub-skill topology, so inference loads only the capability nodes relevant to the task. We evaluate SkillCAT on common agent benchmarks, including SpreadsheetBench, WikiTableQuestions, and DocVQA, and further test cross-model and out-of-distribution generalization. Across these settings, SkillCAT raises the average score over baselines by up to 40.40%, demonstrating reliable skill evolution without model training.

2606.13316 2026-06-12 cs.AI 新提交

ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning

ReSum: 通过强化学习协同LLM推理与摘要生成

Xucong Wang, Ziyu Ma, Yong Wang, Shidong Yang, Hailang Huang, Renda Li, Pengkun Wang, Xiangxiang Chu

发表机构 * University of Science and Technology of China(中国科学技术大学) AMAP, Alibaba Group(阿里巴巴集团高德地图)

AI总结 提出ReSum框架,利用自摘要机制让LLM压缩和组织推理轨迹,通过对比评估自适应触发摘要,在提升性能4%的同时减少18.6%的推理长度。

详情
Comments
24 pages, including 13 pages of main text and 11 pages of appendix
AI中文摘要

可验证奖励强化学习(RLVR)是提升大语言模型(LLM)长程推理的核心技术。然而,现有RLVR方法常鼓励不必要的长推理轨迹,这会降低推理连贯性并耗尽可用上下文预算。现有的长上下文组织方法通常依赖外部机制来组织轨迹,而非让模型自主管理推理过程。为解决此局限,我们提出ReSum,一种新颖的RLVR框架,使LLM能够通过自摘要压缩和组织其推理轨迹。我们的初步研究表明,自摘要通过降低token级熵来稳定生成,并且引入“摘要”短语可显著减少从错误轨迹前缀传播的误差。受此启发,ReSum采用一种摘要感知的自适应轨迹机制,通过对比评估自摘要是否有利于当前推理过程。具体而言,当模型自发触发自摘要时,ReSum屏蔽摘要短语以创建对比分支;对于非摘要位置,则随机注入该短语以创建匹配分支。我们进一步设计了摘要感知优势函数,以实现对比轨迹之间更细粒度的比较。大量实验表明,ReSum在平均提升4%性能的同时,将推理长度减少18.6%。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) is a central technique for improving long-horizon reasoning in Large Language Models (LLMs). However, existing RLVR methods often encourage unnecessarily long reasoning rollouts, which can degrade reasoning coherence and exhaust the available context budget. Existing approaches to long-context organization often depend on external mechanisms to organize rollouts, rather than enabling the model to manage its own reasoning trajectory. To address this limitation, we propose ReSum, a novel RLVR framework that enables LLMs to compress and organize their reasoning trajectories through self-summarization. Our pilot studies show that self-summarization stabilizes generation by lowering token-level entropy, and that introducing a ``summarization'' phrase can substantially mitigate errors propagated from an incorrect rollout prefix. Motivated by these findings, ReSum adopts a summarization-aware adaptive rollout mechanism that contrastively evaluates whether self-summarization benefits the ongoing reasoning process. Specifically, when the model spontaneously triggers self-summarization, ReSum masks the summarization phrase to create a contrastive branch; for non-summarization positions, it instead randomly injects the phrase to create a matched branch. We further design a summarization-aware advantage to enable finer-grained comparison between contrastive rollout trajectories. Extensive experiments show that ReSum improves performance at an average of 4\% while reducing rollout length by 18.6\%.

2606.13315 2026-06-12 cs.CV eess.IV 新提交

Masked and Predictive Self-Supervised Foundation Models for 3D Brain MRI

用于3D脑部MRI的掩码和预测自监督基础模型

Esra Ergün, Hersh Chandarana, Dan Sodickson, Gözde Ünal

发表机构 * Istanbul Technical University(伊斯坦布尔理工大学) NYU Langone Health(纽约大学朗格尼医学中心)

AI总结 研究自监督基础模型在MRI疾病检测中的应用,提出频谱域重建损失(MAE)和方差-协方差正则化(JEPA)两种方法,在五个下游任务中验证了目标设计对任务结构匹配的重要性。

详情
AI中文摘要

自监督基础模型在医学影像中展现出巨大潜力。然而,现有的MRI基础模型研究主要强调分割和密集预测任务,而针对基于MRI的疾病检测的自监督基础模型的系统研究仍然有限。在这项工作中,我们研究了两种主要的自监督预训练范式用于基于MRI的疾病检测:通过掩码自编码器(MAE)的基于重建的学习和通过联合嵌入预测架构(JEPA)的预测表示学习。我们通过引入一种新颖的MAE频谱域重建损失来增强对细粒度解剖结构的敏感性,并通过在我们的JEPA框架中集成方差-协方差正则化(VCR)来鼓励去相关的潜在表示,从而研究辅助目标的作用。我们的模型在对比度无关的设置下,在异质单对比度MRI体积上进行预训练,无需模态拼接。在五个下游疾病检测任务中,我们的结果突出了自监督目标设计对医学基础模型预训练的重要性,表明每个目标的下游收益由其与任务结构的相关性决定。具体来说,当下游判别信号以强高频解剖结构为特征时,频谱正则化带来最大的改进;而当判别信息跨越多个去相关的特征维度时,协方差正则化最为有益。具有频谱域监督的MAE在基于MRI的疾病检测中始终实现优越的下游性能。这些发现表明,医学影像中的自监督目标编码了特定的偏差,其下游收益根本上取决于任务的结构。

英文摘要

Self-supervised foundation models have shown strong promise in medical imaging. However, existing MRI foundation-model studies have primarily emphasized segmentation and dense prediction tasks, while systematic investigation of self-supervised foundation models for MRI-based disease detection remains limited. In this work, we investigate two major self-supervised pretraining paradigms for MRI-based disease detection: reconstruction-based learning via Masked Autoencoders (MAE) and predictive representation learning via Joint Embedding Predictive Architectures (JEPA). We study the role of auxiliary objectives by introducing a novel spectral-domain reconstruction loss for MAE to enhance sensitivity to fine-grained anatomical structure, and by integrating variance--covariance regularization (VCR) within our JEPA framework to encourage decorrelated latent representations. Our models are pretrained on heterogeneous single-contrast MRI volumes in a contrast-agnostic setting, without modality concatenation. Across five downstream disease detection tasks, our results highlight the importance of self-supervised objective design for medical foundation model pretraining, demonstrating that the downstream benefit of each objective is determined by its relevance to the task's structure. Specifically, spectral regularization yields the largest improvements when the downstream discriminative signal is characterized by strong high-frequency anatomical structures, while covariance regularization is most beneficial when discriminative information spans multiple decorrelated feature dimensions. MAE with spectral-domain supervision consistently achieves superior downstream performance for MRI-based disease detection. These findings suggest that self-supervised objectives in medical imaging encode specific biases, and their downstream benefit is fundamentally conditioned on the task's structure.

2606.13312 2026-06-12 cs.CV cs.GR 新提交

MagPlus: Bridging Micro-to-Regular Facial Expressions through Learnable Magnification

MagPlus: 通过可学习放大桥接微表情到常规表情

Sliman Jammal, Andrei Sharf

发表机构 * Ben-Gurion University of the Negev(内盖夫本-古里安大学)

AI总结 提出MagPlus管道,通过可学习放大将微表情运动映射到常规表情范围,再利用标准表情模型处理,最后用DeMagPlus恢复强度,无需重新训练即可生成逼真微表情。

详情
AI中文摘要

面部微表情是短暂而细微的面部运动,为真实人类情感提供重要线索。然而,由于标注的微表情数据有限且底层面部运动极其微弱,建模和生成微表情仍然困难。现有的微表情生成方法因此常面临质量有限、鲁棒性弱和泛化能力差的问题。我们提出MagPlus,一个可迁移的微表情处理管道,将微表情分析与标准面部动画模型连接起来。MagPlus不是从头训练专用生成器,而是学习将细微面部运动放大到常规表情范围,将微表情转换为与现有面部表情处理模型兼容的信号。放大后的序列随后被标准面部表情模型用于迁移和合成等任务。互补的DeMagPlus模块将生成的运动恢复为逼真的微表情强度水平,同时保留合成的动态。我们使用四个面部动画模型评估该框架:FOMM、FSRT、MetaPortrait和EmoPortraits。这些模型均未在微表情数据上训练。实验表明,MagPlus-DeMagPlus使预训练的宏表情模型能够生成更逼真的微表情运动,而无需重新训练主干网络。

英文摘要

Facial micro-expressions are subtle and short-lived facial movements that provide important cues about genuine human emotions. However, modeling and generating them remains difficult because annotated micro-expression data is limited and the underlying facial motions are extremely weak. Existing micro-expression generation methods therefore often suffer from limited quality, weak robustness, and poor generalization. We propose MagPlus, a transferable micro-expression processing pipeline that connects micro-expression analysis with standard facial animation models. Instead of training a dedicated generator from scratch, MagPlus learns to magnify subtle facial motions into the range of regular facial expressions, transforming micro-expressions into signals that are compatible with existing facial expression processing models. The magnified sequence is then used by a standard facial expression model for tasks such as transfer and synthesis. A complementary DeMagPlus module then restores the generated motion back to realistic micro-expression intensity levels while preserving the synthesized dynamics. We evaluate the framework using four facial animation models: FOMM, FSRT, MetaPortrait, and EmoPortraits. None of these models are trained on micro-expression data. Experiments show that MagPlus-DeMagPlus enables pretrained macro-expression models to generate more realistic micro-expression motion without retraining the backbones.

2606.13311 2026-06-12 cs.LG cs.AI 新提交

Rarity-Gated Context Conditioning for Offline Imitation Learning-Based Maritime Anomaly Detection

基于离线模仿学习的海事异常检测中的稀有门控上下文调节

Yongmin Kim, ByeongHoon Jeon, Sungil Kim

发表机构 * Department of Industrial Engineering, Ulsan National Institute of Science and Technology (UNIST)(蔚山科学技术院工业工程系)

AI总结 提出RGFiLM模块,通过稀有度门控调节上下文调制强度,解决上下文异常检测中稀有上下文导致的高误报问题,在海事轨迹异常检测中取得最佳F1-FPR权衡。

详情
AI中文摘要

上下文异常检测旨在根据上下文变量识别异常行为,但实际部署常面临高度不平衡的上下文分布,其中稀有情境可能包含关键信息。在这种频率偏差下,上下文条件模型可能在稀有上下文中产生不稳定的决策和过多的误报。我们提出稀有门控特征线性调制(RGFiLM),一种稀有感知调节模块,结合特征调制(即上下文条件化的隐藏特征缩放和平移)与由数据驱动稀有度分数控制的门控。稀有度分数根据上下文变量的经验分布估计,并调节上下文对中间表示的调制强度:在稀有上下文中门控更果断,而在常见上下文中保持保守。我们在使用AIS运动序列和ERA5环境上下文的环境敏感绕行场景中评估RGFiLM在海事轨迹异常检测中的表现。当实例化到顺序异常评分流程中时,RGFiLM在比较的上下文无关和上下文条件方法中实现了最佳的平均F1-假阳性率(FPR)权衡。这些结果表明,显式考虑上下文稀有性是减少上下文敏感异常检测中误报的有效方法。

英文摘要

Contextual anomaly detection aims to identify abnormal behavior conditional on context variables, but practical deployments often face highly imbalanced context distributions where rare regimes can be critical information. Under such frequency bias, context-conditioned models can produce unstable decisions and excessive false alarms in rare contexts. We propose Rarity-Gated Feature-wise Linear Modulation (RGFiLM), a rarity-aware conditioning module that combines feature-wise modulation (i.e., context-conditioned scaling and shifting of hidden features) with a gate controlled by a data-driven rarity score. The rarity score is estimated from the empirical distribution of context variables and regulates how strongly context modulates intermediate representations: the gate becomes more decisive under rare contexts while remaining conservative under frequent contexts. We evaluate RGFiLM on maritime trajectory anomaly detection using AIS motion sequences with ERA5 environmental context in an environment-sensitive detour scenario. When instantiated in a sequential anomaly scoring pipeline, RGFiLM achieves the best mean F1--False Positive Rate (FPR) trade-off among the compared context-agnostic and context-conditioned methods. These results suggest that explicitly accounting for context rarity is an effective approach for reducing false alarms in context-sensitive anomaly detection.