Generative Animations: A Multi-Model Pipeline for Prompt-Driven Motion Synthesis
生成式动画:面向提示驱动运动合成的多模型流水线
AI总结 提出一种结合大语言模型和分割模型的流水线,将自然语言提示自动转换为符合场景几何、深度遮挡和3D透视变换的动画运动路径。
生成式动画:面向提示驱动运动合成的多模型流水线
Mannat Khurana, Sanyam Jain, Rishav Agarwal
AI总结 提出一种结合大语言模型和分割模型的流水线,将自然语言提示自动转换为符合场景几何、深度遮挡和3D透视变换的动画运动路径。
动画将数字文档提升为沉浸式体验,然而创建自定义运动路径仍然繁琐,需要设计师手动选择预设、绘制贝塞尔点并配置时间属性。我们引入了生成式动画,这是一个将自然语言提示转换为生产就绪动画的系统。通过将用于语义解析的大语言模型(LLMs)与用于视觉基础的Segment Anything Model(SAM)串联,我们的流水线自动生成尊重场景几何、处理基于深度的遮挡并考虑3D透视变换的运动路径。我们通过三个用例演示该系统:轮廓跟随轨迹、具有z轴顺序意识的轨道动画以及变换对象上的透视对齐运动。
Animation elevates digital documents into immersive experiences, yet creating custom motion paths remains cumbersome, requiring designers to manually select presets, plot Bézier points, and configure timing properties. We introduce Generative Animations, a system that transforms natural language prompts into production-ready animations. By chaining Large Language Models (LLMs) for semantic parsing with the Segment Anything Model (SAM) for visual grounding, our pipeline automatically generates motion paths that respect scene geometry, handle depth-based occlusions, and honor 3D perspective transforms. We demonstrate the system through three use cases: contour-following trajectories, orbital animations with z-order awareness, and perspective-aligned motion on transformed objects.
EpiCurveBench: 评估视觉语言模型在流行病曲线数字化中的表现
Thomas Berkane, Maimuna S. Majumder
AI总结 针对现有图表数据提取基准中忽略时间序列结构的问题,提出包含1000张真实流行病曲线图像的EpiCurveBench基准和基于动态规划的EpiCurveSimilarity评估指标,实验表明最强模型仅达52.3% ECS,且ECS能更好区分模型性能。
使用视觉语言模型(VLM)进行图表到数据提取的评估,越来越多地依赖于那些显示递减余量的基准(前沿VLM在ChartQA上超过89%)以及将提取点视为无序键值对的指标,忽略了时间序列的时间结构,并将小的对齐偏移视为灾难性失败。我们通过EpiCurveBench(一个从多种公共卫生来源精选的1000张真实流行病曲线图像基准)和EpiCurveSimilarity(ECS,一种通过动态规划对齐预测序列和真实序列的评估指标,容忍局部时间偏移和间隙,同时按比例惩罚它们)来解决这两个空白。评估六种方法——三种前沿闭源VLM、一种开源VLM和两种专门的图表提取系统——我们发现最强的模型仅达到52.3% ECS,并且ECS将四种通用VLM分散在25个百分点的范围内,而键值指标(RMS、SCRM)将它们压缩在5个百分点的范围内。我们进一步针对四个下游流行病学汇总统计量验证ECS,发现更高的ECS预测更小的总计数、峰值时间和峰值幅度误差,以及更高的增长率保真度;在所有四个统计量中,ECS的相关性比动态时间规整强1.5-3.6倍,后者缺乏间隙惩罚,因此无法区分截断预测与时间保真预测。EpiCurveBench针对一个高影响力的公共卫生应用——解锁被困在已发表图表中的数十年的疫情数据——但该基准和指标直接适用于任何结构化时间序列图表提取场景。
Chart-to-data extraction with vision-language models (VLMs) is increasingly evaluated on benchmarks that show diminishing headroom (frontier VLMs exceed 89% on ChartQA) and with metrics that treat extracted points as unordered key-value pairs, ignoring the temporal structure of time series and penalizing small alignment shifts as catastrophic failures. We address both gaps with EpiCurveBench, a benchmark of 1,000 real-world epidemic curve images curated from diverse public-health sources, and EpiCurveSimilarity (ECS), an evaluation metric that aligns predicted and ground-truth series via dynamic programming, tolerating local temporal shifts and gaps while penalizing them proportionally. Evaluating six methods--three frontier closed VLMs, one open VLM, and two specialized chart-extraction systems--we find the strongest model reaches only 52.3% ECS, and that ECS spreads the four general-purpose VLMs over a 25-point range where key-value metrics (RMS, SCRM) compress them into a 5-point band. We further validate ECS against four downstream epidemiological summary statistics, finding that higher ECS predicts smaller errors in total counts, peak timing, and peak magnitude, and higher growth-rate fidelity; across all four, ECS correlates 1.5--3.6 times more strongly than Dynamic Time Warping, which lacks a gap penalty and therefore cannot distinguish a truncated prediction from a temporally faithful one. EpiCurveBench targets a high-impact public-health application--unlocking decades of outbreak data trapped in published figures--but the benchmark and metric apply directly to any structured time-series chart-extraction setting.
并非所有标记都同等重要:基于关键标记监督的动态上下文向量蒸馏用于长医学报告生成
Ning Wu, Rui Liu, Xinkun Lin, Weixing Chen, Jinxi Xiang, Tao Wei, Lina Yao, Mingjie Li
AI总结 提出DIVE框架,通过关键标记监督和状态条件动态引导,解决长文本生成中标记级蒸馏忽略关键标记的问题,在医学报告生成任务上取得最佳性能。
将示范效果蒸馏到隐藏空间干预中提供了一种轻量级的替代全微调的方法。然而,现有的多模态变体主要是在短文本任务上评估的,其中输出在几个标记后结束。将这些方法扩展到长文本生成暴露了一个基本但未充分研究的局限性:标记级蒸馏隐式地将所有输出标记视为同等信息量,但长文本输出由高频模板和语法标记主导,而实际决定输出质量的标记稀疏分布。在医学报告生成(MRG)中,有两种这样的关键标记突出:决定诊断内容的病理相关标记和决定终止的序列结束(EOS)事件。两者在均匀交叉熵下都受到不足的监督,自回归解码通过偏离教师强制轨迹进一步加剧了问题。我们提出DIVE,一个冻结骨干的蒸馏框架,通过两种与这些失败相匹配的互补机制来解决长文本报告生成。关键标记监督通过提高病理相关标记和EOS事件的交叉熵贡献来恢复监督平衡,确保内容保真度和终止在训练期间学习,而不是在解码时施加。状态条件动态引导用隐藏状态相关的适配器替换固定的开环残差,允许注入信号随着解码漂移而适应。在MIMIC-CXR和CheXpert Plus上使用两个医学VLM骨干的实验表明,DIVE在词汇和临床代理指标中始终位列最强方法之一。我们的方法在所有数据集-骨干设置中实现了最佳的BLEU-4、ROUGE-L和RadGraph F1,同时在粗粒度标签级CheXbert F1上保持竞争力。
Distilling demonstration effects into hidden-space interventions offers a lightweight alternative to full finetuning. However, existing multimodal variants are mostly evaluated on short-form tasks, where outputs end after a few tokens. Extending these methods to long-form generation exposes a fundamental yet underexamined limitation: token-level distillation implicitly treats all output tokens as equally informative, but long-form outputs are dominated by high-frequency template and grammatical tokens, while the tokens that actually determine output quality are sparsely distributed. In medical report generation (MRG), two such decisive tokens stand out: pathology-related tokens that determine diagnostic content, and the end-of-sequence (EOS) event that determines termination. Both receive insufficient supervision under uniform cross-entropy, and autoregressive decoding further compounds the problem by drifting away from teacher-forced trajectories. We propose DIVE, a frozen-backbone distillation framework that addresses long-form report generation through two complementary mechanisms matched to these failures. Decisive-token supervision restores supervision balance by upweighting the cross-entropy contribution of pathology-related tokens and the EOS event, ensuring that content fidelity and termination are learned during training rather than imposed at decoding time. State-conditioned dynamic steering replaces fixed open-loop residuals with hidden-state-dependent adapters, allowing the injected signal to adapt as decoding drifts. Experiments on MIMIC-CXR and CheXpert Plus with two medical VLM backbones show that DIVE consistently ranks among the strongest methods across lexical and clinical-proxy metrics. Our method achieves the best BLEU-4, ROUGE-L, and RadGraph F1 in all dataset--backbone settings, while remaining competitive on coarse label-level CheXbert F1.
在大音频语言模型中学习何时在聆听时思考
Zhiyuan Song, Weici Zhao, Yang Xiao, Suhao Yu, Cheng Zhu, Jiatao Gu
AI总结 提出一种可学习的等待-思考-回答控制机制,通过多奖励强化学习优化大音频语言模型在流式语音交互中的推理时机,在提升准确率的同时减少响应延迟。
近期大音频语言模型(LALMs)的进展使得实时、流式的语音交互越来越实用。在这种场景下,推理质量和响应速度紧密耦合:将推理延迟到语音端点可以提高答案质量,但会将思考时间转移到用户可见的响应延迟中,而过早回答则可能在决定性证据到达之前做出承诺。我们为LALMs引入了一种可学习的等待-思考-回答控制公式。受人类对话渐进性启发,控制器在部分音频证据下决定何时等待、何时外化紧凑的推理更新、以及何时回答。以Qwen2.5-Omni-7B为基础模型,我们从语音推理数据中构建对齐的等待-思考-回答轨迹,使用监督微调(SFT)训练控制器,然后应用解耦裁剪和动态采样策略优化(DAPO)。奖励结合了答案正确性、动作有效性、更新时机、延迟同步、推理质量和链一致性,优化完整的等待-思考-回答轨迹,而不仅仅是最终答案。在一个六任务合成语音推理问答(SRQA)基准上,六奖励DAPO控制器将行加权准确率从67.6%提升到70.3%,同时在相同Qwen部署环境下将端点后最终思考长度减少14%。在一个包含186个人类录音的真实音频基准(Real Audio Bench)上,作为超越文本转语音(TTS)渲染语音的迁移检查,控制器家族仍然有效:SFT实现了最强的准确率,而六奖励DAPO控制器是唯一最终思考长度低于基础模型的学习变体。这些结果表明,流式模型应该学习在音频流中何时使中间推理显式化。
Recent advances in Large Audio-Language Models (LALMs) have made real-time, streaming spoken interaction increasingly practical. In this setting, reasoning quality and responsiveness are tightly coupled: delaying reasoning until the speech endpoint can improve answer quality but moves deliberation into user-visible response delay, while answering too early risks committing before decisive evidence arrives. We introduce a learnable wait-think-answer control formulation for LALMs. Motivated by the incremental nature of human conversation, the controller decides under partial audio evidence when to wait, when to externalize a compact reasoning update, and when to answer. Using Qwen2.5-Omni-7B as the base model, we construct aligned wait-think-answer traces from spoken reasoning data, train the controller with supervised fine-tuning (SFT), and then apply Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO). The reward combines answer correctness, action validity, update timing, latency synchronization, reasoning quality, and chain consistency, optimizing the complete wait-think-answer trajectory and not the final answer alone. On a six-task synthetic spoken reasoning question answering (SRQA) benchmark, the six-reward DAPO controller improves the row-weighted accuracy from 67.6% to 70.3% while reducing post-endpoint final-think length by 14% under the same Qwen deployment harness. On a 186-item human-recorded Real Audio Bench, a transfer check beyond text-to-speech (TTS)-rendered speech, the controller family remains functional: SFT achieves the strongest accuracy, while the six-reward DAPO controller is the only learned variant whose final-think length falls below the base. These results suggest that a streaming model should learn when to make intermediate reasoning explicit during the audio stream.
超越二元:认知评分层级中的语音表征
Serli Kopar, Roshan Prakash Rane, Christian Mychajliw, Lydia Federmann, Gerhard Eschweiler, Daniela Berg, Sam Gijsen, Paula Andrea Perez-Toro, Kerstin Ritter
AI总结 本研究利用5,754份德语神经心理学评估录音,比较手工声学特征与自监督学习嵌入在轻度认知障碍认知评估层级(任务、领域、全局)中的表现,发现任务约束与评估层级之间的关联。
本研究考察了轻度认知障碍中语音表征与认知评估层级结构之间的关系。利用5,754份德语神经心理学评估录音,我们在三个评分层级(任务、领域和全局)上评估了六项认知任务。我们比较了手工声学特征与自监督学习(SSL)嵌入。结果表明,尽管SSL表示在较低层级通常优于手工特征,但这种趋势在MCI分类中发生逆转。此外,任务特定约束影响性能:响应自由度较大的任务随着层级增加表现出性能稀释,表明“专家”表示,而高度结构化任务的性能向更高层级增加,表明“通才”表示。这些发现揭示了自动临床语音分析中任务约束与评估层级之间的联系。
This study examines the relationship between speech representations and the hierarchical structure of cognitive assessment in mild cognitive impairment. Utilizing 5,754 German neuropsychological assessment recordings, we evaluate six cognitive tasks across three score levels: task, domain, and global levels. We compare hand-crafted acoustic features with self-supervised learning (SSL) embeddings. Results show that although SSL representations generally outperform hand-crafted features at lower levels, this trend reverses for MCI classification. Furthermore, task-specific constraints influence performance: tasks with greater response freedom exhibit performance dilution as hierarchical levels increase, suggesting ``specialist'' representations, whereas the performance of highly structured tasks increases toward higher levels, suggesting ``generalist'' representations. These findings show links between task constraints and assessment hierarchy in automated clinical speech analysis.
MAIGO: 通过历史清理的在线策略自蒸馏缓解对话丢失
Haoyu Zheng, Yun Zhu, Shu Yuan, Shangming Chen, Qing Wang, Wenqiao Zhang, Jun Xiao, Yueting Zhuang
AI总结 针对大语言模型在多轮对话中性能下降(对话丢失)的问题,提出MAIGO方法,通过在线策略自蒸馏和清理历史助手回复来减少自污染,无需验证器或推理时辅助,显著提升多轮对话准确性。
大语言模型通常能从完整指定的提示中解决任务,但当相同需求在多轮中展开时,性能会下降,这被称为对话丢失(LiC)差距。我们将这种退化部分归因于自污染:中间助手的回复进入后续上下文,并将早期偏差向前传递。受此机制启发,我们提出了MAIGO,一种在线策略自蒸馏方法,通过使用模型自身策略的历史清理参考来减少这种污染。对于中间轮次,MAIGO移除先前的助手回复,同时保留用户可见的分片前缀;对于回答轮次,它从基于完整用户侧对话的配对全视图参考中蒸馏。一个可靠性权重降低与干净参考不一致的中间轮次样本的权重。MAIGO不需要验证器奖励、状态标签或推理时辅助。在具有确定性验证器的LiC配对视图协议下,MAIGO将Qwen2.5-7B-Instruct的SHARDED准确率从52.8提升至66.1,SHARDED/FULL比率从66.5%提升至84.1%,同时保持FULL准确率在2.3个点以内。这些结果表明,自污染是LiC差距中一个可训练的成分。
Large language models often solve tasks from a fully specified prompt but degrade when the same requirements unfold over multiple turns, known as the lost-in-conversation (LiC) gap. We trace part of this degradation to self-contamination: intermediate assistant replies enter later context and carry early deviations forward. Motivated by this mechanism, we propose MAIGO, an on-policy self-distillation method that reduces this contamination using history-cleaned references from the model's own policy. For middle turns, MAIGO removes prior assistant replies while preserving the user-visible sharded prefix; for answer turns, it distills from paired full-view references conditioned on the completed user-side dialogue. A reliability weight downweights middle-turn samples that disagree with the clean reference. MAIGO requires no verifier rewards, state labels, or inference-time scaffolding. Under the LiC paired-view protocol with deterministic verifiers, MAIGO improves Qwen2.5-7B-Instruct SHARDED accuracy from 52.8 to 66.1 and the SHARDED/FULL ratio from 66.5% to 84.1%, while keeping FULL accuracy within 2.3 points. These results show that self-contamination is a trainable component of the LiC gap.
FoundObj: 自监督基础模型作为无标签3D物体分割的奖励
Zihui Zhang, Zhixuan Sun, Yafei Yang, Jinxi Li, Jiahao Chen, Bo Yang
AI总结 提出FoundObj框架,利用自监督2D/3D基础模型的语义和几何先验作为奖励,通过强化学习引导超点合并,实现无标注复杂场景3D物体分割。
我们解决了在训练过程中不依赖任何场景级人类标注的复杂场景点云中3D物体分割的挑战性任务。现有方法通常局限于识别简单物体,这主要是由于学习过程中物体先验不足。在本文中,我们提出了FoundObj,一个新颖的框架,其特点是基于超点的物体发现代理,该代理在我们的创新语义和几何奖励模块的指导下逐步合并合适的相邻超点。这些模块协同利用自监督2D/3D基础模型中的语义和几何先验,为物体发现代理提供互补反馈,并通过强化学习实现对多类物体的鲁棒识别。在多个基准上的大量实验表明,我们的方法始终优于现有基线。值得注意的是,我们的方法在零样本和长尾场景中表现出强大的泛化能力,突显了其在可扩展、无标签3D物体分割方面的潜力。
We address the challenging task of 3D object segmentation in complex scene point clouds without relying on any scene-level human annotations during training. Existing methods are typically constrained to identifying simple objects, primarily due to insufficient object priors in the learning process. In this paper, we present FoundObj, a novel framework featuring a superpoint-based object discovery agent that incrementally merges suitable neighboring superpoints, guided by our innovative semantic and geometric reward modules. These modules synergistically leverage semantic and geometric priors from self-supervised 2D/3D foundation models, providing complementary feedback to the object discovery agent and enabling robust identification of multi-class objects through reinforcement learning. Extensive experiments on diverse benchmarks demonstrate that our approach consistently outperforms existing baselines. Notably, our method exhibits strong generalization in zero-shot and long-tail scenarios, underscoring its potential for scalable, label-free 3D object segmentation.
AI在声音设计师工作流程与体验中的整合研究
Nelly Garcia, Joshua Reiss
AI总结 通过混合方法研究(76人调查+20人访谈),发现当前AI工具在快速消费媒体中表现良好,但缺乏高端声音设计所需的叙事复杂性,从业者偏好辅助性、任务特定的应用,而非端到端生成系统。
人工智能正越来越多地被整合到专业音频制作工作流程中,然而开发者生产的工具与实际声音设计师的需求之间仍存在差距。本文通过一项混合方法研究调查了这一差距,包括对76名从业者的调查以及对20名行业专业人士的后续半结构化访谈。使用描述性统计分析和主题分析对结果进行分析,以识别两个数据集中的模式。我们的分析得出了五个主题:上下文、工作流程、潜力、风险和正确使用。我们的工作表明,当前的AI工具在快速消费媒体环境中表现良好,但缺乏高端声音设计(电影、沉浸式体验等)所需的叙事复杂性。从业者表现出对辅助性、任务特定应用的偏好,特别是在音频修复和库管理方面,而不是端到端生成系统。这项工作为创意产业中AI及AI增强工具的使用正在进行的讨论做出了贡献。我们从声音设计师和创意音频从业者的角度报告了该领域的当前状况,并根据我们的发现为声音技术专家和开发者提供了一系列建议,以指导开发更明智的AI声音设计工具。
Artificial intelligence is increasingly being integrated into professional audio production workflows, yet a gap persists between the tools developers produce and the requirements of practising sound designers. This paper investigates this gap through a mixed-methods study comprising a survey of 76 practitioners and follow-up semi-structured interviews with 20 industry professionals. Results were analysed using descriptive statistical analysis and thematic analysis to identify patterns across both datasets. Five themes emerged from our analysis: Context, Workflow, Potential, Risks, and Right Use. Our work indicates that current AI tools perform adequately in fast-consumption media contexts but lack the narrative sophistication required for high-end sound design (films, immersive experiences etc). Practitioners demonstrate a preference for assistive, task-specific applications, particularly in audio restoration and library management, over end-to-end generative systems. This work contributes to the on-going discussion on the use of AI and AI-enhanced tools in the creative industries. We report on the current status of the field from the point of view of sound designers and creative audio practitioners, and offer a set of recommendation for sound technologist and developers based on our findings to guide the development of more informed AI tools for sound design.
将文本嵌入与利益相关者关联对齐
Jonathan Rystrøm, Sofie Burgos-Thorsen, Zihao Fu, Johan Irving Søltoft, Kenneth C. Enevoldsen, Chris Russell
AI总结 提出利益相关者对齐练习方法,通过评估嵌入模型与人类专家的语义距离一致性,发现神经文本嵌入在丹麦政策案例中可靠性显著低于专家(差距19-26个百分点),且该差距在美国联邦AI用例中复现(16个百分点)。
文本嵌入被广泛用于分析大型复杂文本语料库。然而,尚不清楚这些嵌入是否捕捉到与使用它们的人类专家相同的语义距离。确保嵌入表示与人类意图一致对于有效分析至关重要。我们提出了利益相关者对齐练习,这是一种使专家关联显式化并将嵌入模型结果扎根于人类理解的方法。在我们关于丹麦政策问题的主要案例研究中,我们发现神经文本嵌入的可靠性远低于人类专家(差距19-26个百分点),并且这种不对齐会传播到下游聚类性能(练习排名与聚类质量之间的Spearman $ρ=0.9$)。一项关于美国联邦AI用例的二次研究使用数字协议和不同的专家社区在英语中复现了该差距(16个百分点)——表明该差距并非单一工具或领域的产物。利益相关者对齐练习提供了一种实用方法,用于评估嵌入模型是否捕捉到对领域专家最重要的语义区分。
Text embeddings are widely used to analyse large corpora of complex texts. However, it is unclear whether the embeddings capture the same semantic distances as the human experts using them. Ensuring alignment between embedding representations and human intentions is essential for valid analyses. We present the Stakeholder Grounding Exercise, a method for making expert associations explicit and grounding embedding model results in human understanding. In our primary case study on Danish policy issues, we find that neural text embeddings are substantially less reliable than human experts (19-26 pp gap), and that this misalignment propagates to downstream clustering performance (Spearman $ρ=0.9$ between exercise ranking and cluster quality). A secondary study on US Federal AI use cases replicates the gap (16pp) in English, using a digital protocol and a different community of experts -- demonstrating that the gap is not an artefact of a single instrument or domain. The Stakeholder Grounding Exercise offers a practical method for assessing whether embedding models capture the semantic distinctions that matter most to domain experts.
TCBiRRT:紧耦合双臂空间机械臂的任务空间随机扩展快速运动规划
Jiawei Zhang, Xinhao Miao, Jifeng Guo, Qinghua Li, Chengchao Bai
AI总结 针对紧耦合双臂空间机械臂在闭链约束下的运动规划问题,提出一种任务空间约束的双向快速随机扩展树算法(TCBiRRT),通过在任务空间直接采样和节点扩展,结合路径逆运动学映射和重抓取机制,显著提高规划成功率和速度。
在大型空间结构的在轨组装中,为紧耦合双臂空间机械臂规划在闭链约束下的运动路径是一个基础且具有挑战性的问题。闭链约束显著减少了可行构型空间,使得现有规划器难以高效生成无碰撞运动,尤其是在杂乱环境中。为解决这一问题,本文提出了一种任务空间约束的双向快速随机扩展树算法,称为TCBiRRT。与在高维构型空间中运行的传统方法不同,所提方法直接在由操作对象位姿定义的任务空间中进行随机采样和节点扩展。开发了一种任务空间节点扩展策略来生成候选对象运动,然后通过路径逆运动学算法将其映射到连续关节路径。该方法进一步与双向RRT框架和重抓取机制集成,以高效连接两个随机树。在具有不同环境复杂度的代表性在轨组装场景中进行了大量仿真。结果表明,与最先进的规划器相比,TCBiRRT实现了显著更高的成功率和数量级的规划时间改进。所提方法为紧耦合双臂空间机械臂的运动规划提供了一种高效且鲁棒的解决方案。
Planning the motion path for a tightly coupled dual-arm space manipulator under closed-chain constraints is a fundamental yet challenging problem in on-orbit assembly of large-scale space structures. The closed-chain constraints significantly reduce the feasible configuration space, making it difficult for existing planners to efficiently generate collision-free motions, especially in cluttered environments. To address this issue, this paper proposes a task-space constrained bidirectional rapidly-exploring random tree algorithm, termed TCBiRRT. Unlike conventional methods that operate in the high-dimensional configuration space, the proposed approach performs random sampling and node expansion directly in the task space defined by the manipulated object pose. A task-space node expansion strategy is developed to generate candidate object motions, which are then mapped to continuous joint paths using a path inverse kinematics algorithm. The method is further integrated with a bidirectional RRT framework and a regrasp mechanism to efficiently connect two random trees. Extensive simulations are conducted in representative on-orbit assembly scenarios with varying levels of environmental complexity. The results demonstrate that TCBiRRT achieves significantly higher success rates and orders-of-magnitude improvements in planning time compared to state-of-the-art planners. The proposed method provides an efficient and robust solution for motion planning of tightly coupled dual-arm space manipulators.
符号查询还是语义检索?面向半结构化问答的数据集与方法
Mateusz Czyżnikiewicz, Ryszard Tuora, Adam Kozakiewicz, Tomasz Ziętkiewicz, Mateusz Galiński, Michał Godziszewski, Michał Karpowicz, Timothy Hospedales, Cristina Cornelio
AI总结 提出 DualGraph 框架,通过文本知识图谱和符号知识图谱双视图实现半结构化文档的语义检索与符号查询结合,并在 SpecsQA 基准上超越现有方法。
检索增强生成(RAG)系统通常通过查询与文档块之间的语义相似性来检索证据。虽然这种方法对非结构化文本有效,但在半结构化语料库上可靠性较低,因为回答可能需要跨多个文档的结构化属性进行精确过滤、聚合或穷举检索。符号方法支持此类操作,但在嘈杂的自然语言语料库上往往脆弱。我们通过 DualGraph 解决了这一差距,这是一个 RAG 框架,通过两种互补视图表示文档:用于语义检索的文本知识图谱和用于对类型化主语-谓语-宾语三元组进行符号查询的符号知识图谱。基于这两个组件,我们提供了多种策略来选择或组合语义和符号证据。我们还引入了 SpecsQA,这是一个来自商业购物网站的基准测试,包含半结构化产品文档和人工策划的问题,涵盖开放式和面向规格的检索。实验表明,DualGraph 在各种问题类型上始终优于最先进的密集检索、GraphRAG、符号和基于表格的基线。代码和数据可在 https://github.com/corneliocristina/DualGraphRAG 获取。
Retrieval-Augmented Generation (RAG) systems for question answering typically retrieve evidence by semantic similarity between the query and document chunks. While effective for unstructured text, this approach is less reliable on semi-structured corpora where answering may require exact filtering, aggregation, or exhaustive retrieval over structured attributes across multiple documents. Symbolic approaches support such operations, but they are often brittle on noisy natural-language corpora. We address this gap with DualGraph, a RAG framework that represents documents through two complementary views: a Textual Knowledge Graph for semantic retrieval and a Symbolic Knowledge Graph for symbolic querying over typed subject--predicate--object triples. Building on these two components, we provide multiple strategies for selecting or combining semantic and symbolic evidence.We also introduce SpecsQA, a benchmark from a commercial shopping website with semi-structured product documents and manually curated questions spanning open-ended and specification-oriented retrieval. Experiments show that DualGraph consistently outperforms state-of-the-art dense-retrieval, GraphRAG, symbolic, and table-oriented baselines across question types.Code and data are available at https://github.com/corneliocristina/DualGraphRAG.
因果特征在战略分类中的作用:鲁棒性与对齐
Antonio Gois, Sophia Gunluk, Nir Rosenfeld, Nidhi Hegde, Simon Lacoste-Julien, Dhanya Sridhar
AI总结 本文通过因果模型分析战略分类中的分布偏移,证明因果分类在噪声有界时达到最优误差,并分解OOD交叉熵风险,揭示因果特征在长期激励对齐中的优势。
在战略分类中,机构(例如银行)预期用户会改变其特征以提高分类任务(例如贷款偿还)中的效用,从而进行适应。由于关键挑战是用户引起的分布偏移,我们转向因果模型,该模型已被证明可以限制最坏情况下的分布外(OOD)风险,并建立了几个将因果关系与战略分类联系起来的新结果。首先,我们证明,当噪声以某种方式有界时,因果分类在任何足够大的适应后都能达到最优分类误差。其次,当这些假设不成立时,我们证明最优分类器的OOD交叉熵风险分解为一个OOD偏差项和一个由未使用所有可观测特征引起的项,从而使我们能够理解因果分类器何时具有优势。最后,我们证明使用因果特征可以允许机构与用户之间的长期激励对齐,这与先前强调此类方法社会成本的工作形成对比。我们在合成数据上凭经验验证了我们的理论,发现我们的结果预测了实际行为。
In strategic classification, an institution (e.g., a bank) anticipates adaptation from users who change their features to increase utility in a classification task (e.g., loan repayment). Since a key challenge is the distribution shift induced by users, we turn to causal models, which have been shown to bound the worst-case out-of-distribution (OOD) risk, and establish several new results that link causality and strategic classification. First, we show that causal classification leads to optimal classification error after any sufficiently large adaptation, when the noise is bounded in a certain way. Second, when these assumptions do not hold, we show OOD cross-entropy risk of optimal classifiers decomposes into an OOD bias term and a term arising from not using all observable features, allowing us to understand when causal classifiers have an advantage. Finally, we show that the use of causal features can allow alignment of long-term incentives between institutions and users, contrasting with previous work that highlights social costs of such approaches. We validate our theory empirically on synthetic data, finding that our results predict behavior in practice.
马达加斯加语动词变位的形式化
Joro Ny Aina Ranaivoarison, Eric Laporte, Baholisoa Simone Ralalaoherivony
AI总结 本文基于Unitex平台,通过构建电子词典和有限状态转换器,实现了马达加斯加语简单动词的形态分析,并优先保证可读性以便语言学家扩展更新。
本文报告了为构建基于词典的马达加斯加语简单动词形态分析器所进行的核心语言工作。该分析器使用Unitex平台,包括构建马达加斯加语简单动词的电子词典。数据基于形态特征进行编码。动词词干的形态变化及其与屈折词缀的组合通过可编辑图表示的有限状态转换器进行形式化。78个转换器使Unitex能够生成词干变体的词典。另外271个转换器被形态分析器用于识别变位动词中的词干和词缀。词典和转换器的设计优先考虑可读性,以便语言学家能够扩展和更新它们。
This paper reports the core linguistic work performed to construct a dictionary-based morphological analyser for Malagasy simple verbs. It uses the Unitex platform and comprised the contruction of an electronic dictionary for Malagasy simple verbs. The data is encoded on the basis of morphological features. The morphological variations of verb stems and their combination with inflectional affixes are formalized in finite-state transducers represented by editable graphs. 78 transducers allow Unitex to generate a dictionary of allomorphs of stems. 271 other transducers are used by the morphological analyser of Unitex to recognize the stem and the affixes in conjugated verbs. The design of the dictionary and transducers prioritizes readability, so that they can be extended and updated by linguists.
具有复数值乘积单元的动力学系统模型发现
Martin Brückmann, Babette Dellen, Uwe Jaekel
AI总结 提出基于复数值乘积单元网络的数据驱动方法,直接从观测轨迹学习包含分数或负指数单项式的稀疏线性组合,从而发现动力学系统的控制方程。
从观测轨迹中发现动力学系统的控制方程比单纯预测未来状态能更深入地理解其结构。我们提出一种基于复数值乘积单元网络的数据驱动模型发现方法,其中每个单元表示一个复数值单项式,网络输出是这些单项式的稀疏线性组合。与SINDy等基于库的方法不同,我们的方法不需要预定义候选函数集:相关的单项式(包括分数或负指数)直接从数据中学习。在四个混沌基准系统(Lorenz63、Lorenz84、四翼吸引子和Lorenz63的分数阶变体)上,使用至少3000个训练点,我们对前三个系统在90%的试验中恢复了精确的控制方程,对分数阶情况在70-90%的试验中恢复。应用于真实世界的人体步态加速度计信号,模型产生具有有界预测误差的稳定轨迹,在比训练间隔长三倍的测试时间范围内,RMSE约为信号幅度范围的12-14%,展示了其在高维系统(其中解析方程不可用)中的潜力。
Discovering the governing equations of a dynamical system from observed trajectories provides deeper insight into its structure than mere prediction of future states. We present a data-driven approach to model discovery based on complex-valued product-unit networks, in which each unit represents a complex monomial and the network output is a sparse linear combination of such monomials. In contrast to established library-based methods such as SINDy, our approach does not require a predefined set of candidate functions: the relevant monomials, including those with fractional or negative exponents, are learned directly from data. Across four chaotic benchmark systems (Lorenz63, Lorenz84, the Four-Wing attractor, and a fractional variant of Lorenz63), we recover the exact governing equations in 90% of trials for the first three systems, and in 70-90% of trials for the fractional case, using at least 3000 training points. Applied to real-world human-gait accelerometer signals, the model produced stable trajectories with bounded prediction errors, corresponding to an RMSE of approximately 12-14% of the signal amplitude range over a test horizon three times longer than the training interval, demonstrating its potential for high-dimensional systems in which analytic equations are unavailable.
检测不等于解决:检索增强型大语言模型中的监控控制差距
Zhe Yu, Wenpeng Xing, Chen Ye, Xuyang Teng, Bo Yang, Changting Lin, Meng Han
AI总结 本文通过多轮文档累积协议发现检索增强型大语言模型存在监控控制差距,即模型能识别矛盾证据但无法安全约束最终建议,并揭示其机制在于行动选择缺陷。
检索增强型大语言模型被部署用于证据质量决定行动安全的任务,但评估协议假设单轮鲁棒性能够预测证据跨轮累积时的鲁棒性。我们证明这一假设根本错误。模型存在监控-控制差距:它们容易承认矛盾证据,但这种意识无法约束最终建议——检测认知冲突并不意味着安全解决它。通过跨四个模型家族(1.5B-32B参数)和超过50,000次轮次级评估的多轮文档累积协议,我们证明单轮诊断系统性地高估了RAG安全性,矛盾承认与安全解决不相关(这一模式得到针对性人工验证的证实),并且不存在通用的提示修复方法。汇聚的机制证据——隐藏状态探测、注意力分析和响应策略分类——指向行动选择作为最可能的缺陷所在:危险相关信息被内部表示并在不安全生成期间获得增强的注意力,但未能约束输出行为。在检索增强系统可被信任用于高风险场景之前,必须测量并弥合模型识别与行动之间的差距。
Retrieval-augmented LLMs are deployed for tasks where evidence quality determines action safety, yet evaluation protocols assume that single-turn robustness predicts robustness when evidence accumulates across turns. We show this assumption is fundamentally incorrect. Models exhibit a monitoring-control gap: they readily acknowledge contradictory evidence, yet this awareness fails to constrain their final recommendations - detecting epistemic conflict does not imply resolving it safely. Through a multi-turn document accumulation protocol across four model families (1.5B-32B parameters) and over 50,000 turn-level evaluations, we demonstrate that single-turn diagnostics systematically overestimate RAG safety, that contradiction acknowledgement is uncorrelated with safe resolution, a pattern corroborated by targeted human validation, and that no universal prompt fix exists. Converging mechanism evidence - hidden-state probing, attention analysis, and response-strategy taxonomy - points to action selection as the most plausible locus of the deficit: danger-relevant information is internally represented and receives enhanced attention during unsafe generation, yet fails to constrain output behavior. The gap between what models recognize and what they do must be measured and closed before retrieval-augmented systems can be trusted in high-stakes settings.
LitSeg: 面向文学RAG的叙事感知文档分割
Ruikang Zhang, Zhanni Chen, Yiqiao Cai, Qi Su
AI总结 提出LitSeg,一种基于叙事理论引导的文档分割框架,通过多阶段提示提取事件、梳理叙事线索并定位转折点,以解决现有分割方法忽视文学叙事结构导致检索与生成性能下降的问题,并引入轻量版LitSeg-Lite通过数据蒸馏降低计算开销。
检索增强生成(RAG)通过引入外部知识增强了大型语言模型(LLMs),特别是在文学作品等长尾领域。然而,RAG中关键的文档分割步骤仍未得到充分探索。现有策略通常语义盲目,忽视了文学作品复杂的叙事结构,常常导致情节碎片化和指代不清,严重阻碍了检索和生成性能。为了解决这一问题,我们提出了LitSeg,一种新颖的叙事理论引导的分割框架。通过采用多阶段提示,LitSeg明确提取有效事件,梳理叙事线索,阐明叙事结构,并定位转折点以指导分割。为了减轻大规模模型多阶段推理的计算开销,我们进一步引入了LitSeg-Lite,一种轻量级的单遍分块器,通过两阶段训练策略在LitSeg生成的数据上进行微调,将复杂过程蒸馏为单次推理。大量实验表明,通过结构独立的文本块,我们的方法在检索准确性和上下文相关性上显著优于基线,最终提升了下游问答性能,而消融研究验证了叙事学指导和数据蒸馏的有效性。
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge, particularly for long-tail domains such as literary works. However, the critical step of document segmentation in RAG remains largely underexplored. Existing strategies are typically semantically blind and overlook the complicated narrative structures of literary works, often resulting in fragmented plots and unclear references that severely hinder retrieval and generation performance. To address this, we propose LitSeg, a novel narrative-theory-guided segmentation framework. By employing multi-stage prompting, LitSeg explicitly extracts valid events, untangles narrative threads, clarifies narrative structures, and locates turning points to inform segmentation. To alleviate the computational overhead of multi-stage inference with large-scale models, we further introduce LitSeg-Lite, a lightweight single-pass chunker fine-tuned on LitSeg-generated data via a two-stage training strategy, distilling the complex process into a single inference pass. Extensive experiments demonstrate that with structurally independent text chunks, our methods significantly improve retrieval accuracy and context relevance over baselines, ultimately enhancing downstream QA performance, while ablation studies validate the efficacy of narratological guidance and data distillation.
Touch-R1:在多模态大语言模型中强化触觉推理
Yingxin Lai, Yafei Zhou, Fucai Zhu, Siyu Zhu, Weihao Yuan
AI总结 针对触觉推理中物理属性序数性和跨传感器分布偏移的挑战,提出基于触觉接地GRPO目标训练的Touch-R1模型,在TouchReason-Bench上平均性能超过Octopi-13B和GPT-4o。
虽然基于规则的强化学习最近在多模态模型中催化了显式推理,但触觉推理仍然很大程度上未被探索。现有的触觉语言模型主要依赖于监督或对比目标,这限制了它们将预测基于物理证据或纠正误导性视觉先验的能力。触觉推理引入了两个模态特定的挑战:物理属性(如硬度、粗糙度)的序数性质,以及光学触觉硬件固有的跨传感器分布偏移。在这项工作中,我们引入了TouchReason-1M,一个大规模多模态数据集,包含来自四个不同传感器的超过100万同步触觉对,以及TouchReason-Bench,一个用于评估触觉感知和视觉-触觉冲突解决的严格框架。在此基础上,我们提出了Touch-R1,一个基于Qwen2.5-VL-7B的触觉推理多模态大语言模型。Touch-R1通过一个触觉接地的GRPO目标进行训练,该目标结合了序数感知准确性、跨传感器物理一致性、结构化格式控制以及输入侧触觉接地目标。具体来说,触觉使用奖励仅在真实触觉输入相对于去除、打乱或噪声掩蔽触觉流的反事实控制产生更优正确性时赋予信用。在TouchReason-Bench上,Touch-R1-7B平均优于Octopi-13B 18.4%和GPT-4o 24.7%。其结构化推理轨迹揭示了探测、比较和修正的涌现行为,表明R1风格的推理可以有效地基于物理接触。
While rule-based reinforcement learning has recently catalyzed explicit reasoning in multimodal models, tactile reasoning remains largely underexplored. Existing tactile-language models primarily rely on supervised or contrastive objectives, which limits their capacity to ground predictions in physical evidence or rectify misleading visual priors. Tactile reasoning introduces two modality-specific challenges: the ordinal nature of physical attributes (e.g., hardness, roughness) and the cross-sensor distribution shifts inherent in optical tactile hardware. In this work, we introduce TouchReason-1M, a large-scale multimodal dataset comprising over 1M synchronized tactile pairs across four distinct sensors, and TouchReason-Bench, a rigorous framework for evaluating tactile perception and visual-tactile conflict resolution. Building upon these, we propose Touch-R1, a tactile reasoning MLLM based on Qwen2.5-VL-7B. Touch-R1 is trained via a tactile-grounded GRPO objective that combines ordinal-aware accuracy, cross-sensor physical consistency, structured-format control, and an input-side tactile grounding objective. Specifically, the tactile-use reward assigns credit only when authentic tactile inputs yield superior correctness relative to counterfactual controls where the tactile stream is removed, shuffled, or noise-masked. On TouchReason-Bench, Touch-R1-7B outperforms Octopi-13B by 18.4\% and GPT-4o by 24.7\% on average. Its structured reasoning traces reveal emergent behaviors of probing, comparison, and revision, demonstrating that R1-style reasoning can be effectively grounded in physical contact.
Chaos-SSL:基于混沌变换的注意力自监督学习框架用于医学图像分类
Joao Batista Florindo
AI总结 提出Chaos-SSL框架,利用一维混沌映射作为非线性数据增强进行自监督预训练,并结合注意力融合模型,在皮肤病变和糖尿病视网膜病变分类上达到与最先进方法竞争的性能。
自监督学习(SSL)已成为缓解对大规模标注数据集依赖的强大范式,这是医学图像分析中的常见瓶颈。然而,依赖简单几何和颜色增强的标准SSL方法可能无法捕捉到分类细微病理所需的细粒度、复杂纹理细节。本文介绍了Chaos-SSL,一种新颖的两阶段医学图像分类框架。在第一阶段,我们提出了一种新的自监督预训练策略,利用一维混沌映射(Logistic、Tent和Sine)作为对比学习的复杂非线性增强。我们假设这些混沌变换创建了“更难”且语义更丰富的视图,迫使网络学习细粒度医学纹理的鲁棒表示。在第二阶段,我们引入了一种基于注意力的融合模型,该模型动态地将来自Chaos-SSL模型的专门特征与来自更大的ImageNet预训练模型的通用特征相结合。我们在两个公共数据集上验证了我们的方法:ISIC 2018(皮肤病变)和APTOS 2019(糖尿病视网膜病变)。我们的结果表明,使用Tent映射预训练30个epoch的Chaos-SSL模型,随后进行注意力融合,其性能与最先进方法完全竞争,在ISIC 2018上达到0.9261的准确率,在APTOS 2019上达到0.8726的准确率。这显著优于现有的SSL方法,包括几种最新方法。
Self-Supervised Learning (SSL) has emerged as a powerful paradigm to mitigate the reliance on large, annotated datasets, a common bottleneck in medical image analysis. However, standard SSL methods, which rely on simple geometric and color augmentations, may fail to capture the fine-grained, complex textural details necessary for classifying subtle pathologies. This paper introduces Chaos-SSL, a novel two-stage framework for medical image classification. In the first stage, we propose a new self-supervised pre-training strategy that leverages 1D chaotic maps (Logistic, Tent, and Sine) as a complex, non-linear augmentation for contrastive learning. We hypothesize that these chaotic transformations create ``harder'' and more semantically-rich views, forcing a network to learn robust representations of fine-grained medical textures. In the second stage, we introduce an attention-based fusion model that dynamically combines the specialized features from our Chaos-SSL model with the general-purpose features of a larger, ImageNet-pre-trained model. We validate our method on two public datasets: ISIC 2018 (skin lesions) and APTOS 2019 (diabetic retinopathy). Our results demonstrate that the Chaos-SSL model pre-trained with a Tent map for 30 epochs, followed by attention fusion, achieves performance fully competitive with the state-of-the-art, yielding an accuracy of 0.9261 on ISIC 2018 and 0.8726 on APTOS 2019. This significantly outperforms existing SSL methods, including several recent approaches.
图像是否也值得16x16=256个超像素?一个用于注意力图像分类的框架
Pedro Henrique da Costa Avelar, Anderson R. Tavares, Luís C. Lamb
AI总结 提出超像素变换器(SPT)框架,统一超像素图像分类与视觉变换器,通过多维正弦余弦位置编码和增强的补丁数据结构,在多个数据集上优于超像素图神经网络方法,与视觉变换器竞争。
基于超像素的图像分类传统上利用图神经网络(GNN)处理不规则图像表示。计算机视觉的最新进展,由视觉变换器(ViT)驱动,引入了自注意力模型的新范式,在各种任务中超越了卷积神经网络(CNN)。然而,GNN、超像素和变换器之间的协同联系仍未探索。在这项工作中,我们提出了超像素变换器(SPT),这是一个统一超像素图像分类和ViT的新框架。SPT将超像素图像分类与图注意力网络(SICGAT)模型和ViT泛化,以支持任意超像素分块策略、连接图和位置编码。我们引入了改进,包括多维正弦余弦位置编码和完全包含超像素形状和颜色信息的增强补丁数据结构。通过在CIFAR10、FashionMNIST和Imagenette等数据集上测试SPT,采用各种超像素生成和图连接策略,我们证明SPT相比以前的超像素GNN方法实现了优越的性能,并与ViT保持竞争力。值得注意的是,我们的方法解决了SICGAT的局限性,例如像素聚合过程中的信息丢失,并展示了受限图连接如何增强ViT性能。SPT弥合了基于超像素和变换器模型之间的差距,为跨领域泛化和混合注意力框架的未来创新开辟了道路,并表明图像也值得$16\times16$个超像素。
Superpixel-based image classification has traditionally leveraged graph neural networks (GNNs) for processing irregular image representations. Recent advances in computer vision, driven by Vision Transformers (ViTs), have introduced new paradigms in self-attentional models, surpassing convolutional neural networks (CNNs) in various tasks. However, a synergistic connection between GNNs, superpixels, and transformers remains unexplored. In this work, we propose Superpixel Transformers (SPT), a novel framework that unifies superpixel-based image classification and ViTs. SPT generalizes the Superpixel Image Classification with Graph Attention Networks (SICGAT) model and ViT to support arbitrary superpixel-based chunking strategies, connectivity graphs, and positional encodings. We introduce refinements including a multidimensional sine-cosine positional encoding and an enriched patch data structure that fully incorporates superpixel shape and color information. By testing SPT across datasets such as CIFAR10, FashionMNIST, and Imagenette, with various superpixel generation and graph connectivity strategies, we demonstrate that SPT achieves superior performance compared to previous superpixel-based GNN methods and remains competitive with ViTs. Notably, our approach addresses the limitations of SICGAT, such as information loss during pixel aggregation, and shows how constrained graph connectivity can enhance ViT performance. SPT bridges the gap between superpixel-based and transformer models, opening avenues for cross-domain generalization and future innovations in hybrid attentional frameworks, and showing that an image can also be worth $16\times16$ superpixels.
VitaBench 2.0:评估长期用户交互中的个性化与主动型代理
Yuxin Chen, Yi Zhang, Zhengzhou Cai, Yaorui Shi, Zhiyuan Yao, Chenhang Cui, Jingnan Zheng, Yaqi Huo, Xi Su, Qi Gu, Xunliang Cai, Xiang Wang, An Zhang, Tat-Seng Chua
AI总结 针对现有代理基准忽视用户偏好推断与利用的问题,提出VitaBench 2.0基准,通过时间序列任务和可扩展记忆接口评估代理在长期交互中的个性化与主动性,实验表明最先进模型仍面临挑战。
大型语言模型已演变为交互式代理,与用户在现实任务中协作。在这种设置下,有效协作越来越依赖于理解用户未明确表达的内容,因为用户意图往往反映在碎片化的日常交互中,需要个性化建模和主动交互。然而,现有的代理基准主要评估推理和工具使用,在很大程度上忽视了在现实场景中推断和利用用户偏好的挑战。为解决这一差距,我们引入了VitaBench 2.0,这是一个用于评估长期用户交互中个性化与主动代理行为的基准。在VitaBench 2.0中,任务被组织为单个用户的时间顺序序列,其中偏好嵌入在碎片化和异构的交互中。成功完成任务要求代理从这些交互中持续提取、利用和更新用户偏好。我们进一步通过要求代理识别缺失信息并在决策前主动从用户或环境中获取信息的任务来评估主动性。为了支持系统分析,我们提供了一个可扩展的记忆接口,使得不同记忆架构之间的受控比较成为可能。我们对一系列前沿专有和开源LLM进行了基准测试。结果表明,即使对于最先进的模型,现实世界的个性化仍然极具挑战性,揭示了当前能力与实际需求之间的巨大差距。广泛的分析进一步揭示了当前代理在现实世界个性化决策中的失败模式和能力瓶颈,为未来的模型改进提供了见解。
Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent benchmarks primarily evaluate reasoning and tool use, largely overlooking the challenges of inferring and leveraging user preferences in realistic scenarios. To address this gap, we introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. In VitaBench 2.0, tasks are organized as temporally ordered sequences for individual users, where preferences are embedded in fragmented and heterogeneous interactions. Successful completion of tasks requires the agent to continuously extract, utilize, and update user preferences from these interactions. We further evaluate proactiveness through tasks that require agents to recognize missing information and actively acquire it from users or environments before making decisions. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across different memory architectures. We benchmark a diverse set of frontier proprietary and open-source LLMs. Results show that real-world personalization remains highly challenging even for state-of-the-art models, revealing a substantial gap between current capabilities and practical requirements. Extensive analysis further reveals the failure modes and capability bottlenecks of current agents in real-world personalized decision-making, providing insights for future model improvements.
StepOPSD: 面向智能体强化学习的步骤感知在线偏好蒸馏
Yanfei Zhang, Xu Lin, Chenglin Wu
AI总结 提出StepOPSD框架,以智能体步骤为信用分配单元,通过事后增强教师上下文重新评分步骤段,并在GRPO更新前进行归一化每步信用预算的优势塑造,解决多轮智能体强化学习中的信用分配不匹配问题。
多轮智能体的强化学习存在信用分配不匹配问题:奖励稀疏且基于轨迹,而成功往往取决于少数局部决策。现有的在线策略蒸馏(OPD)提供了更密集的令牌级监督,但通常将异质的智能体轨迹视为整体字符串而非因果交互单元。我们提出StepOPSD,一种事后回放偏好自蒸馏框架,以智能体步骤作为信用重分配的单位。StepOPSD将轨迹分解为以动作中心的步骤段,在事后增强的教师上下文中重新评分,并将令牌级对数概率差距转化为符号保持的优势塑造,在GRPO更新前进行归一化的每步信用预算。在ALFWorld和Search-QA上使用Qwen3-1.7B和Qwen2.5-3B-Instruct的实验中,StepOPSD在对局部因果错误最敏感的子集上取得了最佳或次佳结果,包括ALFWorld Heat(79.1%)、PickTwo(95.0%)、Search-QA TriviaQA(61.6%)的第一名,以及HotpotQA(40.4%)的并列最佳。结果进一步揭示了一致的双旋钮定律:较小的α_clip作为广泛稳定的局部信任区域,而最优全局混合强度λ_mix依赖于任务。这些发现表明,当轨迹级奖励与决定下游成功的局部动作弱对齐时,步骤感知蒸馏最为有用。
Reinforcement learning for multi-turn agents suffers from a credit-assignment mismatch: rewards are sparse and trajectory-level, while success often hinges on a few local decisions. Existing online policy distillation (OPD) provides denser token-level supervision, but typically treats heterogeneous agent trajectories as monolithic strings rather than causal interaction units. We present StepOPSD, a post-rollout preference self-distillation framework that takes the agent step as the unit of credit redistribution. StepOPSD decomposes trajectories into action-centered step segments, rescoring them under hindsight-enriched teacher contexts and converting token-level log-probability gaps into sign-preserving advantage shaping with a normalized per-step credit budget before the GRPO update. Across ALFWorld and Search-QA with Qwen3-1.7B and Qwen2.5-3B-Instruct, StepOPSD attains best or second-best results on subsets most sensitive to local causal errors, including first-place performance on ALFWorld Heat (79.1%), PickTwo (95.0%), Search-QA TriviaQA (61.6%), and tied-best performance on HotpotQA (40.4%). The results further reveal a consistent two-knob law: smaller α_clip acts as a broadly stabilizing local trust region, whereas the optimal global mixing strength λ_mix remains task-dependent. These findings suggest that step-aware distillation is most useful when trajectory-level rewards are weakly aligned with the local action that determines downstream success.
无监督深度图像先验用于稀疏视角和有限角度电子断层扫描
Serge Brosset, Daniel del Pozo Bueno, Thomas David, Laure Guetaz, Philippe Ciuciu, Zineb Saghi
AI总结 提出无监督深度图像先验方法,在稀疏视角和有限角度条件下实现与监督方法相当的电子断层重建性能,并应用于实验数据验证其可靠性。
电子断层扫描(ET)在纳米材料的三维(3D)表征中发挥着重要作用。然而,在有限角度和稀疏视角条件下,传统算法会产生退化的重建结果,影响所得3D数据的质量和可解释性。本文提出深度图像先验(DIP),一种无监督的深度学习(DL)方法,用于高度退化的断层扫描采集,并通过模拟数据证明,即使在倾斜范围仅为60°、倾斜步长为10°的情况下,其性能也与需要训练数据集的监督方法相当。然后,我们将其应用于实验数据,并表明它在稀疏视角和有限角度条件下都能实现可靠的3D量化,突显了其在广泛材料和采集模式中的潜力。
Electron tomography (ET) plays an important role in the three-dimensional (3D) characterization of nanomaterials. However, under limited-angle and sparse-view conditions, conventional algorithms produce degraded reconstructions, which compromise the quality and interpretability of resulting 3D data. In this paper, we present deep image prior (DIP), an unsupervised deep learning (DL) approach, for highly degraded tomography acquisitions and demonstrate, using simulated data, that its performance is comparable to that of supervised approaches requiring training datasets, even for tilt ranges as limited as 60° and tilt increments of 10°. We then apply it to experimental data and show that it enables reliable 3D quantification under both sparse-view and limited-angle conditions, highlighting its potential for a wide range of materials and acquisition modalities.
ICCU: 通过模式诱导拒绝规则进行上下文持续遗忘
Ruihao Pan, Suhang Wang
AI总结 提出ICCU框架,通过从遗忘数据中诱导可读拒绝规则并在推理时应用,无需修改模型参数,实现高效、无干扰的持续机器遗忘。
机器遗忘旨在从训练好的语言模型中移除特定数据的影响。在实际部署中,遗忘请求通常顺序到达,这对现有的基于微调的方法提出了挑战:对每个请求进行微调成本高昂、累积效用损失,并可能导致跨请求干扰。为了解决这些问题,我们提出了ICCU(上下文持续遗忘),一种上下文持续遗忘框架,它从遗忘数据集中诱导出可读的拒绝规则,并在推理时作为过滤器或通过系统提示应用,而不修改模型参数。由于规则作为与顺序无关的并集累积,ICCU是组合的且无跨请求干扰,并且原始遗忘集数据可以在规则诱导后丢弃。大量实验表明,ICCU有效抑制目标知识同时保持效用,可扩展到顺序请求,并且对释义和跨语言查询保持鲁棒性。
Machine unlearning aims to remove the influence of specific data from trained language models. In real-world deployments, unlearning requests often arrive sequentially, which challenges existing fine-tuning-based methods: fine-tuning each request is costly, accumulates utility loss, and may cause cross-request interference. To address these issues, we propose ICCU (In-Context Continual Unlearning), an in-context continual unlearning framework that induces readable refusal rules from unlearning datasets and applies them at inference time either as a filter or via the system prompt, without modifying model parameters. Because rules are accumulated as an order-independent union, ICCU is compositional and free of cross-request interference, and the original forget-set data can be discarded after rule induction. Extensive experiments show that ICCU effectively suppresses target knowledge while preserving utility, scales across sequential requests, and remains robust to paraphrased and cross-lingual queries.
利用视觉信号实现视觉-语言生成中鲁棒的词元级不确定性
Joseph Hoche, David Brellmann, Gianni Franchi
AI总结 针对大型视觉语言模型不确定性量化中视觉信息利用不足的问题,提出基于视觉锚定的词元级不确定性量化框架VIG-TUQ,通过加权语言不确定性与视觉锚定分数,无需训练即可提升不确定性估计性能。
不确定性量化(UQ)对于大型视觉语言模型(LVLMs)的可靠预测和实际部署仍然是一个关键挑战。然而,现有方法大多源自LLM文献,主要关注语言模态,而视觉信息对LVLM不确定性的贡献在很大程度上未被探索。在本文中,我们研究了LVLMs如何处理视觉信息,以及这一过程是否可用于改进不确定性估计。通过分析生成过程中视觉特征整合后的隐藏表示,我们观察到高置信度预测比不确定预测更依赖于视觉内容。基于这一发现,我们提出了视觉锚定语元级UQ(VIG-TUQ),这是一个无需训练的框架,通过用视觉锚定分数加权词元级语言不确定性,将视觉锚定显式纳入不确定性估计。我们在多个数据集和不同的LVLM架构(包括早期融合、晚期融合和原生融合模型)上评估了VIG-TUQ。结果表明,我们的方法通常优于现有的词元级不确定性方法。代码和数据将在接收后公开。
Uncertainty quantification (UQ) remains a critical challenge in Large Vision Language Models (LVLMs) for reliable predictions and real-world deployment. However, most existing methods are adapted from the LLM literature and primarily focus on the language modality, leaving the contribution of visual information to LVLM uncertainty largely underexplored. In this paper, we investigate how LVLMs process visual information and whether this process can be used to improve uncertainty estimation. By analyzing hidden representations after the integration of visual features during the generation process, we observe that high-confidence predictions rely more heavily on visual content than uncertain ones. Building on this insight, we propose Visual-Grounded Token UQ (VIG-TUQ), a training-free framework that explicitly incorporates visual grounding into uncertainty estimation by weighting token-level language uncertainty with visual grounding scores. We evaluate VIG-TUQ on multiple datasets and across diverse LVLM architectures, including early-fusion, late-fusion, and native-fusion models. Results indicate that our method often improves upon existing token-level uncertainty approaches. Code and data will be made available upon acceptance.
现代事后水印方法能否击败断箭?
Enoal Gesny, Eva Giboulot
AI总结 本文通过公平比较现代与经典事后水印方法在多种攻击下的鲁棒性和安全性,发现经典方法在现实场景中更优。
随着扩散模型等生成模型的快速普及,数字水印已成为识别AI生成图像的关键解决方案。现代事后水印方案利用神经网络实现极低的误报率,同时对常见图像变换保持鲁棒性。然而,这些现代方法与经典方法之间缺乏比较,特别是在鲁棒性和安全性优先于极低误报概率的现实场景中。本文提出了现代与经典事后水印在多种经典增强和近期复杂攻击下的鲁棒性和安全性的公平比较。实验表明,在现实场景中,经典水印在保持鲁棒性的同时,在安全性方面优于现代技术。
With the rapid proliferation of generative models, such as diffusion models, digital watermarking has emerged as a crucial solution for identifying AI-generated images. Modern post-hoc watermarking schemes use neural networks to achieve an extremely low false-alarm rate while remaining robust to common image transformations. However, there is a lack of comparison between these modern methods and classic ones, particularly in real-world scenarios where robustness and security take precedence over achieving an extremely low false-alarm probability. In this paper, we propose a fair comparison of robustness and security between modern and classic post-hoc watermarking across various types of classic augmentations and recent sophisticated attacks. Our experiments show that, in a realistic scenario, classic watermarking outperforms modern techniques in terms of security while maintaining robustness.
面向移动GUI导航的视觉语言模型:缩放、基准测试与推理
Heng Qu, Yike Liu, Renren Jin, Wenzong Zhang, Pengzhi Gao, Wei Liu, Jian Luan
AI总结 本文系统研究了视觉语言模型在移动GUI导航中的数据缩放、基准测试与推理,提出了大规模数据集HyperTrack和开源工具包GUIEvalKit,并发现基于强化学习的微调优于监督微调,尤其在域外场景中表现更佳。
视觉语言模型(VLM)在移动GUI导航方面取得了快速进展。本文针对该领域中基于VLM的智能体,系统研究了数据缩放、基准测试和推理。为了促进严格评估,我们引入了HyperTrack,这是一个大规模数据集,包含超过650个中国移动应用程序的16000多个真实世界任务,以及GUIEvalKit,一个用于在离线GUI导航任务上统一基准测试VLM的开源工具包。利用HyperTrack,我们分析了训练数据规模对监督微调和基于强化学习的微调的影响。我们的结果表明,基于强化学习的微调始终优于监督微调,特别是在域外设置中,突出了数据缩放与强化学习之间的协同作用。借助GUIEvalKit,我们进一步对最先进的VLM进行了基准测试,并分析了交互历史和推理能力如何影响任务完成。HyperTrack和GUIEvalKit共同为在移动GUI导航任务中开发和评估VLM智能体提供了一个全面的平台。
Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and reasoning for VLM-based agents in this domain. To facilitate rigorous evaluation, we introduce HyperTrack, a large-scale dataset with over 16000 real-world tasks across more than 650 Chinese mobile applications, along with GUIEvalKit, an open-source toolkit for unified benchmarking of VLMs on offline GUI navigation tasks. Using HyperTrack, we analyze the effects of training data scale on both supervised and reinforcement-based finetuning. Our results show that reinforcement-based finetuning consistently outperforms supervised finetuning, particularly in out-of-domain settings, highlighting the synergy between data scaling and reinforcement learning. Leveraging GUIEvalKit, we further benchmark state-of-the-art (SOTA) VLMs and analyze how interaction history and reasoning capabilities influence task completion. Together, HyperTrack and GUIEvalKit provide a comprehensive platform for developing and evaluating VLM agents in mobile GUI navigation tasks.
基本前向-后向分裂诱导网络的深层极限与稳定性分析(II):学习问题
Xuan Lin, Chunlin Wu
AI总结 本文研究基本前向-后向分裂(FBS)诱导网络的训练问题,证明其收敛到深层极限系统的学习问题,并给出扰动稳定性分析。
源自迭代优化方案和数值常/偏微分方程(ODE/PDE)的深度展开神经网络在过去十年中引起了数据科学界的广泛关注。其中,许多重要的网络架构是从基本的前向-后向分裂(FBS)算法构建的。在本文中,我们继续研究最基本的FBS诱导网络,该网络通过引入直接参数松弛从原始FBS算法展开。基于我们先前前向系统分析中的差分/微分包含公式,我们在此考虑相应学习问题的一些理论方面。在一些温和假设下,我们建立了基本FBS诱导网络的训练问题收敛到深层极限系统的学习问题的一般收敛性质,这意味着一个$\Gamma$-收敛论证,表明网络最优学习参数的任意聚点是深层极限系统学习问题的解。还对这些学习问题的扰动稳定性进行了定性分析。进行了一个简单的数值实验以验证我们的主要一般收敛结果。
Deep unfolding neural networks derived from iterative optimization schemes and numerical ordinary/partial differential equations (ODEs/PDEs) have attracted much attention in data science over the last decade. Therein, numerous important network architectures were constructed from the basic forward-backward-splitting (FBS) algorithm. In this paper, we continue our research on the most basic FBS-induced network, an architecture unrolled from the original FBS algorithm by incorporating direct parameter relaxations. Following the difference/differential inclusion formulations in our previous forward system analyses, we here consider some theoretical aspects of corresponding learning problems. Under some mild assumptions, we establish a general convergence property of the training problem of the basic FBS-induced network to the learning problem of the deep-layer limit system, implying a $Γ$-convergence argument showing that any cluster point of the optimal learning parameters for the network is a solution to the learning problem of the deep-layer limit system. A qualitative analysis of perturbation stabilities of these learning problems is also presented. A simple numerical experiment is conducted to validate our main general convergence result.
图像阈值化:理解评估指标对特定评估函数的偏差
Eslam Hegazy, Mohamed Gabr
AI总结 本文通过分析BSDS500数据集上所有可能阈值的阈值化目标函数与质量指标的相关性,揭示了Otsu准则与SSIM和PSNR的高相关性,以及Kapur熵的弱相关性,表明存在固有的指标-目标函数偏差。
多级图像阈值化广泛应用于从医学成像到遥感的分割任务中。经典的目标函数,如Otsu的类间方差和Kapur的熵,通常通过元启发式算法进行优化,并使用结构相似性指数(SSIM)和峰值信噪比(PSNR)等指标评估性能。这些评估隐含地假设SSIM和PSNR提供了分割质量的无偏度量。在本研究中,我们通过分析BSDS500数据集中所有可能阈值下阈值化目标函数与质量指标之间的相关性来检验这一假设。结果表明,Otsu准则始终与SSIM和PSNR表现出高相关性,而Kapur熵的相关性较弱且变化较大。Otsu在所有图像上与PSNR的相关性优于Kapur,在超过91%的图像上与SSIM的相关性也优于Kapur。我们的发现揭示了一种固有的指标-目标函数偏差。这项工作强调了需要更中立的评估框架,并激励将分析扩展到其他阈值化准则和领域。本文的源代码可在https://w3id.org/met-dp/icpr26-95找到。
Multilevel image thresholding is widely used for segmentation in applications ranging from medical imaging to remote sensing. Classical objective functions, such as Otsu's between-class variance and Kapur's entropy, are often optimized using metaheuristic algorithms, with performance evaluated via metrics like Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR). These evaluations implicitly assume that SSIM and PSNR provide unbiased measures of segmentation quality. In this study, we examine this assumption by analyzing the correlation between thresholding objective functions and quality metrics across all possible thresholds for images in the BSDS500 dataset. Results show that Otsu's criterion consistently exhibits high correlation with both SSIM and PSNR, while Kapur's entropy demonstrates weaker and more variable correlation. Otsu outperforms Kapur in correlation with PSNR for all images and with SSIM for over 91%. Our findings reveal an inherent metric-objective-function bias. This work highlights the need for more neutral evaluation frameworks and motivates extending the analysis to additional thresholding criteria and domains. Source code of this paper can be found at https://w3id.org/met-dp/icpr26-95
超越数据网格幻象:设计现代AI增强型湖仓以弥合理论与实践差距
Oliver Angélil, Jan Migon
AI总结 针对企业数据平台中领域自服务与整体治理之间的张力,提出一种基于现代湖仓架构的AI增强型中心辐射模型,通过中心卓越中心提供共享服务与AI治理,领域团队逐步承担更多责任,以平衡灵活性与控制,并通过数据产品采纳率、查找时间和洞察时间三个指标评估架构效果。
企业数据平台面临着领域自服务与整体治理之间的持久张力。数据网格范式提出了去中心化的领域所有权作为解决方案,但纯粹的实现往往效果不佳:团队在没有足够的平台成熟度、工具或协调机制的情况下继承了新的责任。本文认为,通过在现代湖仓架构上叠加AI增强的中心辐射模型,可以缓解灵活性与控制之间的权衡。中心枢纽(卓越中心)提供共享平台服务、策略自动化和AI驱动的治理,自动标准化数据产品、生成质量规则、起草数据合约并审查变更以检测回归。领域辐条拥有业务语义、产品积压和本地迭代节奏,随着成熟度提高逐步承担更多责任。执行治理任务的同一LLM也降低了领域从业者发展跨业务和数据工程的真正跨职能专业知识的门槛,使辐条团队能够承担更大的端到端所有权,而无需按比例增加对中心的依赖。自然语言对话界面进一步为业务用户民主化访问,释放了历史上未充分利用的企业数据。在组织方面,我们提出了一个分阶段框架,将所有权从中心转移到辐条,避免了集中式瓶颈和不协调的去中心化。我们通过三个结果指标评估架构:数据产品采纳率、查找时间和洞察时间,这些指标将平台成功与可衡量的业务价值而非内部活动联系起来。
Enterprise data platforms face an enduring tension between domain self-service and holistic governance. The data mesh paradigm proposed decentralized domain ownership as a remedy, but pure implementations frequently underdeliver: teams inherit new responsibilities without the platform maturity, tooling, or coordination mechanisms needed to exercise them effectively. This paper argues that the flexibility-versus-control trade-off can be relaxed through an AI-augmented hub-and-spoke model layered on a modern lakehouse architecture. A central hub (Center of Excellence) provides shared platform services, policy automation, and AI-enabled governance, automatically standardizing data products, generating quality rules, drafting data contracts, and reviewing changes for regressions. Domain spokes own business semantics, product backlogs, and local iteration cadence, progressively assuming greater responsibility as they mature. The same LLMs that automate governance tasks also lower the barrier for domain practitioners to develop genuine cross-functional expertise spanning business and data engineering, enabling spoke teams to take on greater end-to-end ownership without proportionally increasing their dependence on the hub. Natural-language conversational interfaces further democratize access for business users, exposing historically underutilized enterprise data. On the organizational side, we propose a staged framework that shifts ownership from hub to spokes, avoiding both centralized bottlenecks and uncoordinated decentralization. We evaluate the architecture through three outcome metrics: data product adoption, time-to-find, and time-to-insight, that tie platform success to measurable business value rather than internal activity.
DEI:质量-多样性搜索中的进化推理多样性
John Donaghy, Shikhar Rastogi
AI总结 提出DEI框架,通过异构大语言模型作为变异算子进行分布式质量-多样性搜索,实验表明模型多样性比并行性更能提升搜索性能。
我们提出DEI:进化推理中的多样性,一个分布式质量-多样性(QD)搜索框架,该框架将异构大语言模型(LLM)分配为变异算子,在通过非阻塞集合操作通信的对等节点间运行。与同质并行搜索(在所有工作节点上复制单一模型的归纳偏差)不同,DEI将每个LLM独特的创造性先验视为行为新颖性的互补来源。通过DEI扩展数字红皇后框架,节点在每轮结束时共享局部最优解,以播种下一轮的种群。这产生了跨模型的对抗压力,推动了超越模型内自博弈的鲁棒性。在Core War领域(一个竞争性编程基准,其中Redcode战士程序在模拟机器中战斗)上评估,一个四节点异构集成(GPT-5.4-mini、Claude Sonnet 4.6、GPT-5.2和Claude Haiku 4.5)在相等的总LLM调用预算下,相比单节点基线,实现了124%更高的合并存档QD分数(45.90 vs. 20.46)和28%更高的覆盖率(80.6% vs. 63.0%的单元格)。异构集成还在QD分数、覆盖率和所有四个模型家族的保留解泛化性上优于同等预算的同质集成。这些结果首次提供了经验证据,表明模型多样性(而非仅仅是并行性)是分布式基于LLM的QD搜索中增益的关键驱动因素。
We present DEI: Diversity in Evolutionary Inference, a distributed Quality-Diversity (QD) search framework that assigns heterogeneous large language models (LLMs) as mutation operators across peer nodes communicating with non-blocking collective operations. Unlike homogeneous parallel search, which replicates a single model's inductive biases across all workers, DEI treats each LLM's distinct creative prior as a complementary source of behavioral novelty. Extending the Digital Red Queen framework with DEI, nodes share local optimal solutions at the end of each round to seed the next round's population. This creates cross-model adversarial pressure that drives robustness beyond intra-model self-play. Evaluated on the Core War domain, a competitive programming benchmark in which Redcode warrior programs battle inside a simulated machine, a four-node heterogeneous ensemble (GPT-5.4-mini, Claude Sonnet 4.6, GPT-5.2, and Claude Haiku 4.5) achieves 124 percent higher merged-archive QD-Score (45.90 vs. 20.46) and 28 percent higher coverage (80.6 percent vs. 63.0 percent of cells) than a single-node baseline at equal total LLM-call budget. The heterogeneous ensemble also outperforms an equally-budgeted homogeneous ensemble on QD-Score, coverage, and held-out solution generality across all four model families. These results provide the first empirical evidence that model diversity, not merely parallelism, is the key driver of gain in distributed LLM-based QD search.