IRIS: time-structured manifold projections
IRIS: 时间结构化流形投影
AI总结 提出IRIS算法,通过结合时间顺序和流形拓扑结构,解决t-SNE和UMAP无法体现时间动态的问题,适用于scRNA-seq、比较宏基因组学等动态生物医学数据可视化。
IRIS: 时间结构化流形投影
Brian Ondov, Chia-Hsuan Chang, Weipeng Zhou, Xingjian Zhang, Xueqing Peng, Yutong Xie, Huan He, Qiaozhu Mei, Hua Xu
AI总结 提出IRIS算法,通过结合时间顺序和流形拓扑结构,解决t-SNE和UMAP无法体现时间动态的问题,适用于scRNA-seq、比较宏基因组学等动态生物医学数据可视化。
高维生物医学数据,如细胞-基因矩阵,越来越多地按时间顺序生成。然而,流形学习算法(如t-SNE和UMAP)无法在其布局中融入时间顺序,模糊了细胞类型或其他类别的动态变化。作为解决方案,我们提出了IRIS,一种新的流形学习算法,能够按时间顺序和流形拓扑结构构建布局。IRIS可以可视化广泛的动态生物医学数据,包括scRNA-seq、比较宏基因组学和文献数据。
High-dimensional biomedical data, such as cell-by-gene matrices, are increasingly generated temporally. However, Manifold Learning algorithms, like t-SNE and UMAP, cannot incorporate time-ordering in their layouts, obfuscating the dynamics of cell types or other classes. As a solution, we present IRIS, a new Manifold Learning algorithm that structures layouts both chronologically and by manifold topology. IRIS can visualize a wide range of dynamic biomedical data, including scRNA-seq, comparative metagenomics, and literature.
面向大语言模型对齐的差分隐私偏好数据合成
Fengyu Gao, Jing Yang
AI总结 提出DPPrefSyn算法,基于Bradley-Terry偏好模型和DP-PCA生成差分隐私合成偏好数据,实现隐私保护的偏好对齐。
偏好对齐是大语言模型(LLMs)的关键后训练步骤,以确保其输出与人类价值观一致。然而,在真实人类偏好数据上进行后训练会引发隐私问题,因为这些数据集通常包含敏感的用户提示和人类判断。为了解决这一问题,我们提出了DPPrefSyn,一种用于生成差分隐私(DP)合成偏好数据的新算法,以实现隐私保护的偏好对齐。DPPrefSyn是一个基于Bradley-Terry偏好模型和成对人类偏好数据内在几何结构的原理性框架。它首先从具有正式差分隐私保证的私有数据中学习一个潜在的偏好模型,然后利用学习到的模型结合公共提示合成高质量的偏好数据。它利用每个簇奖励模型的共享线性结构来有效捕捉私有数据中的异构人类偏好,并利用差分隐私主成分分析(DP-PCA)来提高学习准确性。大量实验结果表明,DPPrefSyn在强DP保证下实现了具有竞争力的对齐性能。这些发现突显了合成偏好数据作为隐私保护偏好对齐的实用替代方案在广泛应用中的潜力。据我们所知,这是首项为LLM对齐生成DP合成偏好数据的工作。我们的代码可在https://github.com/gfengyu/Differentially-Private-Preference-Data-Synthesis获取。
Preference alignment is a crucial post-training step for large language models (LLMs) to ensure their outputs align with human values. However, post-training on real human preference data raises privacy concerns, as these datasets often contain sensitive user prompts and human judgments. To address this, we propose DPPrefSyn, a novel algorithm for generating differentially private (DP) synthetic preference data to enable privacy-preserving preference alignment. DPPrefSyn is a principled framework grounded in the Bradley-Terry preference model and the intrinsic geometric structure of pairwise human preference data. It first learns an underlying preference model from private data with formal differential privacy guarantees, and then leverages the learned model together with public prompts to synthesize high-quality preference data. It exploits the shared linear structure of per-cluster reward models to effectively capture heterogeneous human preferences in private datasets, and leverages DP Principal Component Analysis (DP-PCA) to improve learning accuracy. Extensive experimental results demonstrate that DPPrefSyn achieves competitive alignment performance under strong DP guarantees. These findings highlight the potential of synthetic preference data as a practical alternative for privacy-preserving preference alignment across a broad range of applications. To the best of our knowledge, this is the first work to generate DP synthetic preference data for LLM alignment. Our code is available at https://github.com/gfengyu/Differentially-Private-Preference-Data-Synthesis.
共形可靠性:条件生成的新评估指标
Yachen Gao, Xinwei Sun, Yikai Wang, Ye Shi, Jingya Wang, Jianfeng Feng, Yanwei Fu
AI总结 提出基于共形预测的可靠性分数作为条件生成模型的新评估指标,并开发CReL框架高效计算该分数,实验证明其有效性和可解释性。
条件生成模型近年来在各种应用中取得了显著成功。然而,目前仍缺乏一个合适的指标来评估这些模型的可靠性,该指标需要考虑其固有的不确定性。现有指标通常评估单个输出,可能无法捕捉生成中的变异性或潜在风险。在本文中,我们提出了一种基于共形预测的新型评估指标,称为可靠性分数,该指标在预指定的置信水平下衡量预测集内的最差性能。然而,由于输出空间的高维性以及指标函数和预测集的非凸性,计算该分数具有挑战性。为了高效计算该分数,我们引入了共形可靠性(CReL)框架,该框架可以(i)构建具有期望覆盖率的预测集;(ii)在构建的预测集内准确优化可靠性分数。我们提供了关于覆盖率的理论结果,并实验证明我们的方法比现有方法能产生更具信息量的预测集。在合成数据以及图像到文本和文本到图像任务上的实验进一步展示了我们新指标的可解释性,以及我们计算框架的有效性和高效性。源代码可在https://ggc29.github.io/CReL/找到。
Conditional generative models have recently achieved remarkable success in various applications. However, a suitable metric for evaluating the reliability of these models, which takes into account their inherent uncertainty, is still lacking. Existing metrics, which typically assess a single output, may fail to capture the variability or potential risks in generation. In this paper, we propose a novel evaluation metric called reliability score based on conformal prediction, which measures the worst-case performance within the prediction set at a pre-specified confidence level. However, computing this score is challenging due to the high-dimensional nature of the output space and the nonconvexity of both the metric function and the prediction set. To efficiently compute this score, we introduce Conformal ReLiability (CReL), a framework that can (i) construct the prediction set with desired coverage; and (ii) accurately optimize the reliability score within the constructed prediction set. We provide theoretical results on coverage and demonstrate empirically that our method produces more informative prediction sets than existing approaches. Experiments on synthetic data and the image-to-text and text-to-image tasks further demonstrate the interpretability of our new metric, and the validity and effectiveness of our computational framework. Source code can be found at https://ggc29.github.io/CReL/.
将LLM性别偏见锚定人类基线:跨语言审计
Jiwoo Choi, Seonwoo Ahn, Tongxin Zhang, Seohyon Jung
AI总结 通过HEXACO-100人格量表,跨英语、韩语、中文和日语审计六种大语言模型的性别刻板印象,发现其偏见幅度是人类跨国差异的2.5倍,并引入四模式框架(一致性、抑制、重组、放大)描述跨语言行为。
我们审计了六种大语言模型(LLM)在英语、韩语、中文和日语中的性别刻板印象。其中三种主要面向英语使用(Claude、GPT、Gemini),三种面向东亚使用(DeepSeek、Syn-Pro、HyperCLOVA X)。我们采用HEXACO-100人格量表,并将每个模型锚定于覆盖48个国家的跨文化人类数据集,以询问的不是LLM是否有偏见,而是它们的性别归因偏离其部署人群的程度。我们的发现表明,它们的刻板印象范围大约是人类跨国范围的2.5倍,且该效应可能跨语言复合。一个以英语为中心的模型在用韩语提示时,达到了当地基线的5倍,即使提示表明候选人已被录用(这通常会减弱人类的刻板印象)。为了在不排序的情况下描述此类行为,我们引入了一个四模式框架——一致性、抑制、重组和放大——涵盖24个(模型×语言)单元。项目级分析表明,翻译不仅重新缩放刻板印象,还改变了与之相关的属性,在表面看似校准良好的情况下隐藏了显著的重新排列。我们的结果最终表明,没有单一的消除偏见流程能够均匀地解决跨语言边界的偏见。
We audit six large language models (LLMs) for gender stereotyping across English, Korean, Chinese, and Japanese. Three were developed primarily for English-language use (Claude, GPT, Gemini) and three for East Asian use (DeepSeek, Syn-Pro, HyperCLOVA X). We adopt the HEXACO-100 personality inventory and anchor each model against a cross-cultural human dataset spanning 48 countries to ask not whether LLMs are biased, but how far their gender attributions drift from the populations they are deployed among. Our findings show that their stereotyping spans a range roughly 2.5 times wider than the entire cross-country range found in humans, and the effect can compound across languages. One English-centric model, prompted in Korean, reached 5 times the local baseline, even when the prompt stated the candidate had already been hired, which often dampens human stereotyping. To characterize such behaviors without ranking them, we introduce a four-pattern framework -- concordance, suppression, reorganization, and amplification -- across 24 (model x language) cells. Item-level analysis reveals that translation does not just rescale stereotypes, but changes the attributes tied to it, hiding significant rearrangement under the surface while appearing well-calibrated. Our results ultimately suggest that no single debiasing pipeline is likely to address bias evenly across linguistic boundaries.
PReMISE:作为LLM评判者测量规范的政策评分标准
Swastik Roy, Rajkumar Pujari, Tharindu Kumarage, Charith Peris, Rahul Gupta, Anna Rumshisky, Pradeep Natarajan, Venkatesh Saligrama
AI总结 提出PReMISE框架,从人类偏好数据中发现政策级评分标准集,并从结构充分性、可靠性、偏好拟合和对抗鲁棒性四个维度审计评分标准,通过偏好排名选择和可靠性约束修复操作提升评判准确性并降低可被利用性。
LLM评判者越来越多地被用于评估开放式回答,但其分数强烈依赖于条件化它们的评分标准。一个模糊的评分标准要求回答“有帮助且事实准确”可能会奖励那些编造事实或违反用户意图的精心修饰的回答。我们将可重复使用的评分标准视为测量规范:改变评分标准会改变由固定评判者产生的回答质量测量。我们引入PReMISE,一个框架,给定成对的人类偏好数据,(i) 发现一个政策级别的评分标准集,以及(ii) 在LLM评判者使用下,沿着四个维度审计任何评分标准集:结构充分性、可靠性、偏好拟合和对抗鲁棒性。在评分标准来源中,没有原始来源同时具有可靠性、偏好预测性和对抗鲁棒性;高评分者间一致性并不意味着低可被利用性。PReMISE是唯一同时在适用性、特异性和有效维度上得分非平凡的评分标准来源。我们贡献了两个针对审计的修复操作:偏好排名选择将评判者在成对回答上的准确率从65.0%提高到68.6%,与最强的评分标准发现基线竞争,并在我们的跨评判者扫描中在三个评判者中的两个上领先;可靠性约束精炼将利用性回答获得高分的比率从46.4%降低到36.0%,而评分者间一致性变化很小(α=.531→.519)。
LLM judges are increasingly used to evaluate open-ended responses, but their scores depend strongly on the rubrics that condition them. A vague rubric asking for a response to be ``helpful and factual'' can reward polished answers that invent facts or violate user intent. We treat reusable rubrics as measurement specifications: changing the rubric changes the response quality measurement induced by a fixed judge. We introduce PReMISE, a framework that, given pairwise human-preference data, (i) discovers a policy-level rubric set, and (ii) audits any rubric set under LLM-judge use along four axes: structural adequacy, reliability, preference fit, and adversarial robustness. Across rubric sources no raw source is simultaneously reliable, preference-predictive, and adversarially robust; and high inter-rater agreement does not imply low exploitability. PReMISE is the only rubric source to score non-trivially on applicability, specificity, and effective dimensionality simultaneously. We contribute two audit-targeted repair operations: preference-rank selection raises judge accuracy on paired responses from $65.0\%$ to $68.6\%$, competitive with the strongest rubric-discovery baselines and leading on two of three judges in our cross-judge sweep; reliability-constrained refinement reduces the rate at which exploit responses receive high scores from $46.4\%$ to $36.0\%$ with little change in inter-judge agreement ($α{=}.531\to.519$).
用于预测市场决议的多智能体AI预言机系统的设计与评估
Tarun Kota
AI总结 本研究设计并评估了多智能体LLM架构作为预测市场决议的预言机,通过独立聚合与协商共识两种机制,在KalshiBench数据集上对比单模型基线,发现置信度加权投票的独立聚合达到83.43%准确率,而协商共识因错误传播性能下降,并提出了混合AI-人类预言机的路由标准。
预测市场聚合集体智慧以预测不确定事件,但其效用依赖于可靠的结果决议。现有的预言机系统在快速但脆弱的自动化与准确但昂贵的人工仲裁之间进行权衡。单LLM预言机实现了有意义的准确性,但继承了其底层模型的所有失败模式,且没有自我纠正机制。我们评估了多智能体LLM架构是否能在单模型基线之上提高预言机决议准确性。我们在KalshiBench的1,189个已决议预测市场问题上,比较了独立聚合和协商共识与单LLM基线(GPT-5 Nano、DeepSeek V3和Llama-3.3-70B)的性能。所有智能体通过Exa共享共同的证据层,检索按出版日期过滤以隔离推理与检索质量。采用置信度加权投票的独立聚合达到了83.43%的最高准确率,比最佳个体模型高出1.01个百分点。协商共识将准确率降低至约76%,低于所有单模型基线,这归因于辩论过程中的错误传播,即自信的错误模型使正确模型发生翻转。模型间的错误相关性(0.529-0.689)解释了聚合增益为何低于理论Condorcet上限,对集成方法构成了根本限制。许多问题无法通过任何多智能体架构纠正,这促使升级至人工仲裁。我们提出了混合AI-人类预言机系统的路由标准:仅自动解决一致且高置信度的问题,在数据集的47%上达到97.87%的准确率,而智能体间的分歧则标记其余部分供人工审查。
Prediction markets aggregate collective intelligence to forecast uncertain events, but their utility depends on reliable outcome resolution. Existing oracle systems tradeoff fast but brittle automation against accurate but costly human arbitration. Single-LLM oracles achieve meaningful accuracy but inherit all failure modes of their underlying model with no self-correction mechanism. We evaluate whether multi-agent LLM architectures can improve oracle resolution accuracy over single-model baselines. We compare independent aggregation and deliberative consensus against single-LLM baselines (GPT-5 Nano, DeepSeek V3, and Llama-3.3-70B) on 1,189 resolved prediction market questions from KalshiBench. All agents share a common evidence layer through Exa, with retrieval filtered by publication date to isolate reasoning from retrieval quality. Independent aggregation with confidence-weighted voting achieves the highest accuracy at 83.43 percent, outperforming the best individual model by 1.01 percentage points. Deliberative consensus degrades accuracy to approximately 76 percent, below every single-model baseline, attributed to error propagation during debate where confidently wrong models flip correct ones. Error correlations across models (0.529-0.689) explain why aggregation gains fall short of the theoretical Condorcet ceiling, placing a fundamental limit on ensemble approaches. Many questions resist correction by any multi-agent architecture, motivating escalation to human arbitration. We propose routing criteria for hybrid AI-human oracle systems: auto-resolving only unanimous, high-confidence questions yields 97.87 percent accuracy on 47 percent of the dataset, with inter-agent disagreement flagging the remainder for human review.
Feat2Go: 面向具身强化学习的视觉特征基础价值估计
Junyang Shu, Zhiwei Lin, Bingqing Wei, Yongtao Wang
AI总结 提出Feat2Go框架,通过预训练视觉世界模型提取补丁级子目标相似度并聚类语义阶段,训练具身价值模型预测结构进度以重塑终端奖励,显著提升VLA模型在单臂和双臂操作任务中的强化学习性能。
强化学习是提升视觉-语言-动作(VLA)模型能力的一种有前景的方法,同时避免了模仿学习对大量数据的需求。然而,其对VLA模型的有效性常受限于稀疏监督以及为长程操作设计信息丰富的奖励信号的困难。在这项工作中,我们提出了Feat2Go,一种用于具身强化学习的细粒度价值估计框架。具体来说,Feat2Go首先通过测量与子目标状态的补丁级相似性,并利用基于趋势的聚类将回合划分为语义阶段,从预训练的视觉世界模型中导出一个连续的进度目标。然后,我们训练一个具身价值模型,根据当前观测和任务指令预测这一结构进度,并在策略优化过程中使用预测值重塑终端奖励。所提出的框架与现有的VLA策略强化学习流程(包括PPO和GRPO)兼容,且不依赖手动奖励工程。在ManiSkill3和RoboTwin 2.0上的大量实验表明,Feat2Go在单臂和双臂操作设置下均能持续提升现有VLA模型的性能。更具体地说,在ManiSkill3上,Feat2Go将OpenVLAOFT的平均分布外成功率从17.5%提升至82.9%,同时保留了96.9%的分布内性能。在RoboTwin 2.0上,Feat2Go在域随机化任务设置中实现了88.8%的平均成功率,优于先前的强化学习方法。
Reinforcement learning is a promising approach for improving the capabilities of vision-language-action (VLA) models while avoiding the heavy data requirements of imitation learning. However, its effectiveness for VLA models is often constrained by sparse supervision and the difficulty of designing informative reward signals for long-horizon manipulation. In this work, we present Feat2Go, a fine-grained value estimation framework for embodied reinforcement learning. Specifically, Feat2Go first derives a continuous progress target from a pretrained visual world model by measuring patch-level similarity to subgoal states and partitioning episodes into semantic stages with trend-based clustering. We then train an embodied value model to predict this structural progress from the current observation and task instruction, and use the predicted value to reshape terminal rewards during policy optimization. The proposed framework is compatible with existing VLA policy reinforcement learning pipelines, including PPO and GRPO, and does not rely on manual reward engineering. Extensive experiments on ManiSkill3 and RoboTwin 2.0 demonstrate that Feat2Go consistently improves the performance of existing VLA models under both single-arm and bimanual manipulation settings. More specifically, on ManiSkill3, Feat2Go improves OpenVLAOFT from 17.5% to 82.9% average out-of-distribution success while retaining 96.9% in-distribution performance. On RoboTwin 2.0, Feat2Go achieves an average success rate of 88.8% in domain-randomized task settings, outperforming prior reinforcement learning methods.
MechVQA:在综合机械图纸理解上基准测试与增强多模态大语言模型
Qian Kou, Xiaofeng Shi, Yulin Li, Xiaosong Qiu, Xinyang Wang, Hua Zhou, Cao Dongxing
AI总结 针对多模态大语言模型在机械工程图纸理解上的不足,提出首个综合机械图纸理解数据集MechVQA,并开发MechVL模型,通过多阶段训练显著提升性能。
多模态大语言模型(MLLMs)在通用视觉问答(VQA)任务中取得了显著成就。然而,它们在机械工程图纸上仍然脆弱,因为高标注密度和弱领域知识,加上严格投影规则和几何约束下不可靠的空间关系推理,使得决定性线索容易被忽略,并经常导致错误答案。为弥补这一差距,我们引入了第一个综合机械图纸理解数据集MechVQA,通过半自动构建和质量控制流程创建。MechVQA包含3.3k张高密度图片和21K个问答对,涵盖三个能力级别(识别、推理和判断)的10个不同细粒度任务,为评估和改进MLLM在真实机械图纸上的理解提供了测试平台。在MechVQA基础上,我们通过多阶段训练范式开发了MechVL模型,构建了一个强大的领域专用基线。大量实验结果表明,MechVL在MechVQA总分上比最强的闭源基线高出7.57个百分点,显著增强了机械图纸理解能力,并为在机械设计和检测场景中部署MLLM提供了可复用的基础。
Multimodal Large Language Models (MLLMs) have demonstrated significant achievements in general visual question answering (VQA) tasks. However, they remain brittle on mechanical engineering drawings, where high annotation density and weak domain knowledge, compounded by unreliable spatial relation reasoning under strict projection rules and geometric constraints, make decisive cues easy to miss and frequently lead to wrong answers. To bridge this gap, we introduce the first comprehensive mechanical drawing understanding dataset, MechVQA, created through a semi-automated construction and quality-control pipeline. MechVQA contains 3.3k high-density pictures with 21K question-answer pairs, spanning 10 different fine-grained tasks across three capability levels: Recognition, Reasoning, and Judging, providing a testbed to evaluate and improve MLLM understanding on real-world mechanical drawings. On top of MechVQA, we then develop the MechVL model through a multi-stage training paradigm, building a strong domain-specialized baseline. Extensive experimental results demonstrate that MechVL outperforms the strongest closed-source baseline by 7.57 percentage points on the MechVQA total score, significantly enhancing mechanical drawing understanding ability and providing a reusable foundation for deploying MLLMs in mechanical design and inspection scenarios.
OpenSTBench:超越语义评估的语音翻译
Yanjie An, Yuxiang Zhao, Yichi Zhang, Qixi Zheng, Yujie Tu, Keqi Deng, Kai Yu, Xie Chen
AI总结 提出OpenSTBench统一多维评估框架,联合评估语音翻译系统的翻译质量、语音质量、时间一致性等,揭示系统间跨维度差异。
语音翻译系统日益涵盖语音到文本翻译(S2TT)、语音到语音翻译(S2ST)、离线翻译和流式生成,产生的输出在模态、语音实现和时间行为上有所不同。现有评估实践评估了翻译质量、语音质量和时间质量等重要方面,但这些方面通常在不同的协议下进行评估,使得难以全面比较异构系统。为弥补这一差距,我们提出了OpenSTBench,一个统一的多维评估框架,将异构语音翻译输出组织成共享的评估格式。OpenSTBench支持离线与流式设置下的S2TT和S2ST系统,并联合评估翻译质量、语音质量、说话人保留、情感与副语言保真度、时间一致性和延迟。通过在代表性语音翻译系统上的实验,我们表明具有强翻译质量的系统在语音质量和时间质量上仍可能存在显著差异。OpenSTBench提供了一个可复现的协议,用于分析这些跨维度差异,并支持面向应用的语音翻译系统比较。代码和数据集可在https://github.com/sjtuayj/OpenSTBench获取。
Speech translation systems increasingly span speech-to-text translation (S2TT), speech-to-speech translation (S2ST), offline translation, and streaming generation, producing outputs that differ in modality, speech realization, and timing behavior. Existing evaluation practices assess important aspects such as translation quality, speech quality, and temporal quality, but these aspects are often evaluated under separate protocols, making it difficult to compare heterogeneous systems comprehensively. To address this gap, we present OpenSTBench, a unified multidimensional evaluation framework that organizes heterogeneous speech translation outputs into a shared evaluation format. OpenSTBench supports both S2TT and S2ST systems in offline and streaming settings, and jointly evaluates translation quality, speech quality, speaker preservation, emotion and paralinguistic fidelity, temporal consistency, and latency. Through experiments on representative speech translation systems, we show that systems with strong translation quality can still differ substantially in speech quality, as well as in temporal quality. OpenSTBench provides a reproducible protocol for analyzing these cross-dimensional differences and supporting application-oriented comparison of speech translation systems. The code and datasets are available at https://github.com/sjtuayj/OpenSTBench.
关于检索内容表示对RAG管道的影响
Jonathan J Ross, Bevan Koopman, Anton van der Vegt, Guido Zuccon
AI总结 通过控制变量实验,研究检索文档的不同表示(选择、摘要、改写等)对RAG生成准确性的影响,发现答案保留是主要决定因素。
检索增强生成(RAG)通过检索到的文档补充语言模型的输入,但大多数RAG管道继承了为人类读者设计的检索组件。当消费者是大型语言模型(LLM)而非人类时,检索内容应如何表示尚不清楚。最近的工作提出了对检索内容的转换,并识别了影响生成的属性,但每项工作仅孤立地考察单一转换或属性,未明确文档表示的哪些特征最重要。我们通过控制比较来解决这一问题:固定检索不变,仅改变检索文档的表示,将原始基线与其他十三种转换(涵盖选择、摘要和改写,包括查询相关和查询无关变体)进行比较。在这十四种表示中,我们测量了四个生成器的问答准确性,并对每种表示测量了答案保留:即已知包含答案的文档在转换后是否仍支持其答案。我们发现,答案保留是生成器准确性的主要决定因素;值得注意的是,当保留率高时,表示的措辞、结构、长度和查询相关性影响有限。这表明,先前工作中归因于特定机制的准确性提升,可能部分由这些机制保留答案内容的能力解释,而这种归因在未控制保留的情况下无法确定。
Retrieval-Augmented Generation (RAG) supplements a language model's input with retrieved documents, yet most RAG pipelines inherit retrieval components designed for human readers. How retrieved content should be represented when the consumer is a large language model (LLM) rather than a human is less well understood. Recent work has proposed transformations of retrieved content and identified properties that affect generation, but each examines a single transformation or property in isolation, leaving open which features of a document's representation matter most. We address this with a controlled comparison: holding retrieval fixed, we vary only the representation of retrieved documents, comparing an original baseline against thirteen transformations spanning selection, summarisation, and reformulation, in query-dependent and query-independent variants. Across these fourteen representations we measure question-answering accuracy for four generators, and for each representation we also measure answer retention: whether a known answer-bearing document still supports its answer after transformation. We find that answer retention is the primary determinant of generator accuracy; notably, when retention is high, a representation's wording, structure, length, and query-dependence have limited effect. This suggests that accuracy gains attributed to specific mechanisms in prior work may be partly explained by how well those mechanisms preserve answer-bearing content, an attribution that cannot be settled without controlling for retention.
XLGoBench: 用算法任务检测跨语言技能差距
Purvam Jain, Preethi Jyothi, Vihari Piratla, Suvrat Raju
AI总结 提出一套合成算法任务基准,通过跨语言执行相同任务来检测大语言模型的跨语言能力差距,实验揭示多个先进模型存在持续差距。
我们引入一套合成算法任务,用于检测大语言模型在跨语言能力上的差距。我们的基准在语言间具有可比性,因为它要求模型在不同语言中执行相同的底层任务;可扩展,因为每个任务可以在不同复杂度级别生成,从而适应不同能力的模型;可量化,因为每个任务都承认客观的正确性概念;且透明,因为任务是从简单模板生成的,可以轻松审计翻译错误。由于我们的基准专注于算法任务,性能差异是跨语言差距的充分但不必要条件。尽管如此,我们通过大量实验表明,我们的基准暴露了多个最先进模型中存在的持续跨语言差距。
We introduce a set of synthetic algorithmic tasks to detect cross-lingual gaps in the abilities of large language models. Our benchmark is commensurate across languages, since it requires models to perform the same underlying task in different languages; scalable, since each task can be generated at varying levels of complexity allowing it to be adapted to models with different capabilities; quantifiable, since every task admits an objective notion of correctness; and transparent, since tasks are generated from simple templates that can be readily audited for translation errors. Because our benchmark focuses on algorithmic tasks, differential performance is a sufficient -- but not necessary -- indicator of cross-lingual gaps. Nevertheless, we show through extensive experiments that our benchmark exposes persistent cross-lingual gaps in multiple state-of-the-art models.
AbstainGNN:教会图神经网络在图分类中弃权
Xixun Lin, Zhiheng Zhou, Zhengyin Zhang, Yancheng Chen, Shuai Zhang, Ge Zhang, Shichao Zhu, Lixin Zou, Chuan Zhou, Peng Zhang, Shirui Pan, Yanan Cao
AI总结 提出AbstainGNN框架,通过理论驱动的弃权机制让GNN在不确定时拒绝预测,避免错误决策,并基于PAC-Bayes理论优化分类与弃权权衡。
图分类是图数据挖掘中的核心任务,具有广泛的现实应用。图神经网络的最新进展显著提升了图分类的性能。然而,现有的GNN即使在高度不确定性或未知条件下也通常被迫做出预测,导致不可靠的决策,特别是在安全关键场景中会严重影响下游任务。为了解决这一关键限制,我们提出了AbstainGNN,一种新颖且理论驱动的带弃权图分类框架,使GNN能够拒绝不确定的预测,而不是产生错误的决策。具体来说,AbstainGNN显式地建模了预测函数和弃权函数,从而有效利用图结构信息。此外,与现有的启发式弃权方法不同,我们从PAC-Bayesian泛化角度理论刻画了分类错误与拒绝成本之间的权衡,并推导出用于模型优化的统一学习目标。在此理论洞察的指导下,我们进一步开发了一种高效的两阶段训练策略,包括预测函数预热和弃权函数校准。在五个基准数据集上的大量实验表明,AbstainGNN优于现有的弃权方法,在相同拒绝率下实现了更优的分类性能。
Graph classification is a core task in graph data mining with widespread real-world applications. Recent advances in graph neural networks (GNNs) have led to substantial performance improvements for graph classification. However, existing GNNs are typically forced to make predictions even under high uncertainty or unknown conditions, resulting in unreliable decisions that can severely impact downstream tasks, particularly in safety-critical scenarios. To address this critical limitation, we propose AbstainGNN, a novel and theory-driven framework for graph classification with abstention, which enables GNNs to reject uncertain predictions instead of producing incorrect decisions. Specifically, AbstainGNN explicitly models both the predictive function and the abstention function, allowing for effective utilization of graph structural information. Moreover, unlike existing heuristic abstention methods, we theoretically characterize the trade-off between classification errors and rejection costs from a PAC-Bayesian generalization perspective, and derive a unified learning objective for model optimization. Guided by this theoretical insight, we further develop an efficient two-stage training strategy consisting of predictive function warm-start and abstention function calibration. Extensive experiments on five benchmark datasets show that AbstainGNN outperforms existing abstention methods, achieving superior classification performance under the same rejection rates.
面向长时任务的学习智能体兼容上下文管理
Lu Yi, Runlin Lei, Liuyi Yao, Yuexiang Xie, Yuyang Li, Wenhao Zhang, Zhewei Wei, Yaliang Li, Jian-Yun Nie
AI总结 提出AdaCoM方法,通过外部LLM对冻结智能体进行端到端强化学习上下文管理,在长时任务中提升性能并揭示保真度-可靠性权衡。
LLM智能体在现实应用中越来越多地面临长时任务,如网络搜索和深度研究,累积的上下文可能导致长上下文退化和推理失败。先前的工作通过智能体端上下文控制或固定策略(如摘要)来缓解这一问题,这需要训练智能体本身进行适应——这使得它对于闭源智能体不切实际,并且忽略了不同智能体可能需要不同策略。我们引入了自适应上下文管理(AdaCoM),它训练一个外部LLM通过灵活的修改动作和端到端强化学习来管理冻结智能体的上下文。在多种智能体上进行的网络搜索和深度研究基准测试中,AdaCoM通过保留任务约束和进展同时修剪过时内容,显著提升了性能。学习到的策略揭示了保真度-可靠性权衡:具有更高原始ReAct性能的智能体受益于更高保真度的上下文保留,而性能较低的智能体则需要更激进的压缩以保持在可靠的推理范围内。迁移实验表明,AdaCoM在能力相似(以原始ReAct性能衡量)的智能体之间最有效地泛化,这为智能体系统的可复用上下文管理器提供了一条实用路径。
LLM agents increasingly face long-horizon tasks such as web search and deep research in real-world applications, where accumulated context can cause long-context degradation and reasoning failures. Prior work mitigates this through context management with agent-side context control or fixed strategies such as summarization, which require training the agent itself for adaptation - making it impractical for closed-source agents and ignoring that different agents may require different strategies. We introduce Adaptive Context Management (AdaCoM), which trains an external LLM to manage the context of a frozen agent through flexible modification actions and end-to-end reinforcement learning. Across diverse agents on web search and deep research benchmarks, AdaCoM substantially improves performance by preserving task constraints and progress while pruning stale content. The learned strategies reveal a Fidelity-Reliability Trade-off: agents with higher vanilla ReAct performance benefit from higher-fidelity context preservation, whereas lower-performing agents require more aggressive compression to stay within a reliable reasoning regime. Transfer experiments show that AdaCoM generalizes most effectively across agents with similar capability (measured by vanilla ReAct performance), suggesting a practical path toward reusable context managers for agent systems.
文本引导的跨模态步态识别特征解耦
Zhiyang Lu, Ming Cheng
AI总结 针对LiDAR与RGB相机之间的模态差异,提出TCFDNet网络,利用文本先验引导解耦模态共享特征,通过CLIP对齐、特征解耦和稳定性增强实现跨模态步态识别,在SUSTech1K和FreeGait数据集上达到最优性能。
步态识别是一种基于行走模式识别个体的生物特征技术,在远距离、非侵入场景中具有优势。然而,现实场景通常涉及异构传感模态,如LiDAR和RGB相机,由于2D视频和3D点云序列之间存在显著的模态差距,LiDAR-相机跨模态步态识别(LCCGR)成为一项关键但具有挑战性的任务。为应对这一挑战,我们提出了TCFDNet,一种文本引导的跨模态特征解耦网络,该网络利用模态感知的文本先验作为语义锚点,指导学习解耦的模态共享表示。具体而言,我们使用大型语言模型构建步态模态文本字典(GMTD),以生成跨模态和视角的丰富步态语义描述。然后,基于CLIP的多粒度特征编码器将视觉和文本特征对齐到统一的视觉-语言空间中。此外,文本引导的特征解耦(TFD)模块选择topk匹配的文本描述来重建模态特定表示,并通过残差分解和正交性约束推导出模态共享特征。为缓解解耦共享特征的脆弱性,我们提出特征稳定性增强(FSE)模块,该模块建模空间和通道相关性以提高特征鲁棒性。此外,引入跨模态补丁交换策略以进一步提升泛化能力。在SUSTech1K和FreeGait数据集上的大量实验表明,TCFDNet取得了新的最优结果,并验证了所提模块的有效性。
Gait recognition is a biometric technique that identifies individuals based on their walking patterns, offering advantages in long-range, non-intrusive scenarios. However, real-world scenarios often involve heterogeneous sensing modalities such as LiDAR and RGB cameras, making LiDAR-Camera Cross-modal Gait recognition (LCCGR) a critical yet challenging task due to the substantial modality gap between 2D videos and 3D point cloud sequences. To address this challenge, we propose TCFDNet, a Text-guided Cross-modal Feature Disentanglement Network, which leverages modality-aware textual priors as semantic anchors to guide the learning of disentangled modality-shared representations. Specifically, we construct a Gait Modality Text Dictionary (GMTD) using large language models to generate rich semantic descriptions of gait across modalities and viewpoints. A CLIP-based Multi-grained Feature Encoder then aligns visual and textual features within a unified vision-language space. Furthermore, the Text-guided Feature Disentanglement (TFD) module selects the topk matched textual descriptions to reconstruct modality-specific representations and derive modality-shared features via residual decomposition and orthogonality constraints. To mitigate the fragility of the disentangled shared features, we propose a Feature Stability Enhancement (FSE) module, which models spatial and channel-wise correlations to improve feature robustness. In addition, a cross-modal patch exchange strategy is introduced to further improve generalization. Extensive experiments on SUSTech1K and FreeGait datasets demonstrate that TCFDNet achieves new state-of-the-art results and validate the effectiveness of the proposed modules.
抓取中的两自由度振动运输
C. L. Yako, Shenli Yuan, Kenneth Salisbury
AI总结 利用非对称振动实现抓取零件的两自由度(DoF)手内操作,通过闭环位置控制产生周期性粘滑波形,分析波形参数对平均速度的影响,并用实验验证。
在本文中,我们利用非对称振动演示了抓取零件的两自由度(DoF)手内操作。非对称振动通过移动表面的闭环位置控制实现,该表面向待操作零件施加周期性粘滑波形。我们从理论上分析了两个振动波形参数——粘附加速度和滑动加速度——如何影响零件在对抗重力运动时的平均速度。然后使用实验装置验证理论趋势,其中挤压力受控,零件运动由高分辨率编码器记录。我们还开发了一个2-DoF振动表面,能够在一个方向平移并绕表面法线旋转。在平行爪夹持器配置中使用两个这样的2-DoF表面,我们双向平移和旋转各种抓取零件,并证明相同的平移波形趋势也适用于面内旋转。
In this paper, we use asymmetric vibrations to demonstrate two degree-of-freedom (DoF) in-hand manipulation of grasped parts. The asymmetric vibrations are achieved through closed-loop position control of a moving surface, which applies a periodic stick-slip waveform to the part to be manipulated. We show analytically how two vibratory waveform parameters, the sticking acceleration and the slipping acceleration, affect average part velocity when moving against gravity. The theoretical trends are then validated using an experimental setup where the squeeze force is controlled and part motion is recorded by a high-resolution encoder. We also develop a 2-DoF vibratory surface capable of translation in one direction and rotation about the surface normal. Using two of these 2-DoF surfaces in a parallel jaw gripper configuration, we bidirectionally translate and rotate a variety of grasped parts, as well as demonstrate that the same waveform trends for translation also persist for in-plane rotation.
面向非抓取式机器人操作的对象感知模型预测路径积分控制
Nikola Raicevic, Bharath Raam Radhakrishnan, Chenbin Yu, Ki Myung Brian Lee, Nikolay Atanasov
AI总结 提出一种分层模型预测路径积分(MPPI)控制框架,通过对象级规划引导机器人级规划,实现非抓取式操作中的长时域高效规划。
由于欠驱动和不连续交互,非抓取式机器人操作的长时域规划具有挑战性。我们提出一种分层模型预测路径积分(MPPI)控制框架,通过单独计算的对象级规划引导机器人级规划,实现高效的长时域预测。我们首先求解一个简化的仅对象问题,假设对象可以直接被驱动,并将规划的对象轨迹作为参考来求解联合机器人-对象规划问题。我们在仿真和硬件上使用6自由度xArm6机械臂执行对象推动任务来评估我们的方法,其中目标对象必须到达目标点同时避开静态障碍物,这需要非短视的推理。我们的对象感知MPPI在仿真中将任务成功率提高了40%,控制频率提高了26%,在实际实验中提高了20%,且计算量与常规MPPI相当。
Long-horizon planning for non-prehensile robot manipulation is challenging due to underactuated and discontinuous interactions. We propose a hierarchical formulation of model predictive path integral (MPPI) control that guides robot-level planning with a separately computed object-level plan to achieve efficient long-horizon prediction. We first solve a simplified object-only problem, assuming the object can be actuated directly, and use the planned object trajectory as a reference in solving the joint robot-object planning problem. We evaluate our method in both simulation and hardware using a 6-DoF xArm6 manipulator to perform object pushing tasks in which the target object must reach a goal while avoiding static obstacles, necessitating non-myopic reasoning. Our object-informed MPPI increases task success by 40\% with a 26\% faster control frequency in simulation, and by 20\% in real experiments with similar computation as regular MPPI.
高效且不确定性感知的离线到在线强化学习扩散框架
Ha Manh Bui, Metod Jazbec, Eric Nalisnick, Anqi Liu
AI总结 提出DUAL框架,利用扩散模型先验知识蒸馏快速采样扩散策略和转移模型,并通过拉普拉斯近似和距离转移状态偏移检测进行不确定性量化,以改进在线阶段的探索与利用平衡。
离线到在线强化学习(O2O-RL)利用离线预训练策略来最小化昂贵的在线交互。尽管数据高效,但O2O-RL容易受到离线与在线分布之间偏移的影响。现有工作旨在通过对从扩散模型采样的轨迹数据微调策略来减轻这种偏移的危害。受此启发,我们提出了DUAL:一个用于离线到在线强化学习的高效不确定性感知扩散框架。DUAL利用扩散模型的先验知识,在离线阶段蒸馏出一个快速采样的扩散策略和转移模型。DUAL还采用拉普拉斯近似和距离转移状态偏移检测,从而通过不确定性量化来改进在线阶段的探索与利用平衡。我们正式证明,带有拉普拉斯近似的策略损失提供了认知不确定性原则性估计的代理。实验上,DUAL在多种设置和环境下的在线期望回报优于O2O-RL基线。
Offline-to-Online Reinforcement Learning (O2O-RL) leverages an offline, pre-trained policy to minimize costly online interactions. Although data-efficient, O2O-RL is susceptible to shifts between offline and online distributions. Existing work aims to mitigate the harm of this shift by finetuning the policy on trajectory data sampled from a diffusion model. Inspired by this line of work, we propose DUAL: an efficient \textbf{D}iffusion \textbf{U}ncertainty-\textbf{A}ware framework for offline-to-online reinforcement \textbf{L}earning. DUAL utilizes the prior knowledge of the diffusion model to distill a fast-sampling diffusion actor policy and transition model in the offline phase. DUAL also employs a Laplace approximation and distance transition-state-shift detection, thereby using uncertainty quantification to improve exploration versus exploitation in the online phase. We formally show that our actor loss with the Laplace approximation provides a proxy for a principled estimate of epistemic uncertainty. Empirically, DUAL improves the online expected return over O2O-RL baselines across multiple settings and environments.
CameraNoise: 通过几何流引导的噪声扭曲实现视频扩散中的忠实相机控制
Haoyu Zhao, Jiaxi Gu, Haoran Chen, Qingping Zheng, Yeying Jin, Hongyi Yang, Junqi Cheng, Yuang Zhang, Zenghui Lu, Huan Yu, Jie Jiang, Peng Shu, Zuxuan Wu, Yu-Gang Jiang
AI总结 提出CameraNoise方法,通过几何流引导的噪声扭曲将相机运动编码为时间一致的随机表示,实现视频扩散中忠实且几何一致的相机控制。
精确的相机姿态控制对于视频扩散至关重要,但保持几何一致性仍然是一个挑战。现有方法直接将数值相机参数注入扩散骨干网络,往往无法弥合抽象坐标与视觉内容之间的差距,导致结构失真。为解决这一问题,我们提出CameraNoise,一种流到噪声的扭曲方法,将相机运动编码为时间一致的随机表示。与传统的条件控制不同,CameraNoise将相机姿态直接嵌入噪声空间。这将在忠实保留轨迹动态的同时,将运动与场景外观解耦。具体来说,我们引入了一种新颖的几何引导重投影流和噪声扭曲算法,共同保持扩散的高斯先验,并确保在相机变换下噪声传播的一致性。通过将CameraNoise集成到扩散过程中,我们的框架能够生成稳定、高保真的视频。大量实验表明,我们的方法在视觉质量和轨迹忠实度方面均显著优于先前方法。项目页面和代码可在 https://gulucaptain.github.io/CameraNoise/ 获取。
Precise camera pose control is critical for video diffusion, yet maintaining geometric consistency remains a challenge. Existing methods that directly inject numerical camera parameters into the diffusion backbone often fail to bridge the gap between abstract coordinates and visual content, leading to structural distortions. To address this issue, we propose CameraNoise, a flow-to-noise warping method that encodes camera motion into a temporally coherent stochastic representation. Unlike conventional conditioning, CameraNoise embeds camera poses directly into the noise space. This decouples motion from scene appearance while faithfully preserving trajectory dynamics. Specifically, we introduce a novel Geometry-guided Reprojection Flow and a noise warping algorithm, which jointly preserve the Gaussian prior of diffusion and ensure consistent noise propagation under camera transformations. By integrating CameraNoise into the diffusion process, our framework delivers stable, high-fidelity videos. Extensive experiments demonstrate that our approach significantly outperforms prior methods in both visual quality and trajectory faithfulness. The project page and code are available at: https://gulucaptain.github.io/CameraNoise/.
Eywa:基于溯源的人工智能智能体长期记忆
Resham Joshi
AI总结 提出Eywa架构,通过先存储证据再推导事实、验证记忆并采用确定性多路径读取(零LLM调用)实现可审计的长期记忆,在多个基准测试中取得高准确率。
跨会话持久化的人工智能智能体需要能够检索、审计、更新和擦除的记忆。现有的记忆系统通常将源证据、提取的事实、检索到的上下文和答案策略合并为一个不透明的提示路径,使得故障难以诊断:错误答案可能源于缺失证据、不支持的提取、过时状态、检索损失或答案模型行为。我们提出Eywa,一种基于溯源的记忆架构,围绕“证据先于信念”构建。Eywa在推导规范事实之前存储不可变的源证据,根据类型化信号和源支持验证提取的记忆,并通过确定性多路径读取路径(检索内部零LLM调用)检索有界的记忆上下文。检索到的上下文与答案指令分开返回,使得相同的记忆基质可以在前沿、预算和本地答案模型上进行评估。在冻结的、工件记录的检索配置下,Eywa在LoCoMo C1-C4分割上使用Claude Sonnet 4.6写入和QA角色达到90.19%的裁判准确率。在LongMemEval-S上,达到88.2%的检索充分性准确率。在BEAM(一个700问题的技术记忆压力基准)上,达到81.45%的平均nugget分数和85.29%的pass@score >= 0.5。完整的每问题工件,包括问题、黄金答案、模型答案、检索到的上下文和标签,发布在https://eywa.to/research。
AI agents that persist across sessions need memory they can retrieve, audit, update, and erase. Existing memory systems often collapse source evidence, extracted facts, retrieved context, and answer policy into one opaque prompt path, making failures difficult to diagnose: a wrong answer may come from missing evidence, unsupported extraction, stale state, retrieval loss, or answer-model behavior. We present Eywa, a provenance-grounded memory architecture built around evidence before belief. Eywa stores immutable source evidence before deriving canonical facts, validates extracted memories against typed signals and source support, and retrieves bounded memory context through a deterministic multi-route read path with zero LLM calls inside retrieval. Retrieved context is returned separately from answer instructions, allowing the same memory substrate to be evaluated across frontier, budget, and local answer models. Under a frozen, artifact-recorded retrieval configuration, Eywa reaches 90.19% judge accuracy on the LoCoMo C1-C4 split with Claude Sonnet 4.6 write and QA roles. On LongMemEval-S, it reaches 88.2% retrieval-sufficiency accuracy. On BEAM, a 700-question technical-memory stress benchmark, it reaches 81.45% mean nugget score and 85.29% pass@score >= 0.5. Full per-question artifacts, including questions, gold answers, model answers, retrieved context, and labels, are published at https://eywa.to/research.
SSR:将稳健且对称的人形穿越扩展到开放世界
Ruiqi Yu, Yiwen Wang, Yuan Hao, Jun WU, Qiuguo Zhu
AI总结 提出SSR框架,通过引入想象落脚点引导、等变潜在空间对称增强和地形特定多判别器运动先验,实现基于视觉的人形机器人在开放世界中的安全稳定穿越。
将人形穿越扩展到开放世界是在人类环境中实际部署的关键,但仍然具有挑战性。机器人必须利用视觉在高度动态运动下确保在异质地面上安全可靠的落脚点,同时产生协调、自然的全身行为。我们提出SSR,一种高效的端到端框架,用于基于自我中心视觉的人形穿越,联合学习这些能力。SSR引入了想象落脚点引导,学习建模即将到来的摆动脚接触并评估其支撑,以指导触地前的摆动朝向稳定区域,减少边缘滑动。它进一步采用等变潜在空间对称增强,在高维视觉观察下有效诱导双边协调,并使用地形特定多判别器运动先验,鼓励跨场景的类人行为。大量实验表明,SSR在多种真实世界地形上实现了安全、稳定和高质量的运动,包括不同结构的楼梯以及宽间隙和高平台等极端挑战,同时在开放户外环境中实现了可靠的长距离穿越。
Extending humanoid traversal to the open world is key to practical deployment in human environments, but remains challenging. The robot must use vision to ensure safe and reliable foot placement on heterogeneous terrain under highly dynamic motion, while producing coordinated, natural whole-body behaviors. We propose SSR, an efficient end-to-end framework for egocentric vision-based humanoid traversal that jointly learns these capabilities. SSR introduces imagined foothold guidance, which learns to model forthcoming swing-foot contacts and evaluates their support to guide pre-touchdown swings toward stable regions, reducing edge slips. It further employs equivariant latent-space symmetry augmentation to efficiently induce bilateral coordination under high-dimensional visual observations, and uses terrain-specific multi-discriminator motion priors to encourage human-like behavior across scenes. Extensive experiments show that SSR achieves safe, stable, and high-quality locomotion on diverse real-world terrains, including stairs with varied structures and extreme challenges such as wide gaps and high platforms, while enabling reliable long-horizon traversal in open outdoor environments.
DisPlace: 面向多参考视觉地点识别的判别性地点投影
Dhyey Manish Rajani, Michael Milford, Tobias Fischer
AI总结 提出DisPlace框架,通过广义特征值问题融合多参考描述符,最大化地点间可分性并抑制地点内变化,提升视觉地点识别在多变条件下的鲁棒性。
视觉地点识别(VPR)的一个关键挑战是在不同环境条件和视角下,将查询图像与参考地图进行匹配。虽然多次参考遍历提高了鲁棒性,但现有的融合策略要么统一聚合参考,要么依赖启发式选择,无法区分保持稳定地点身份的描述符变化与由变化条件或视角引起的变化。在本文中,我们提出DisPlace,一种多参考VPR框架,将多个参考描述符融合为单个紧凑且具有判别性的地点表示。DisPlace将描述符融合表述为一个广义特征值问题,该问题最大化地点间可分性,同时抑制跨参考的地点内变化,而不是保留整体描述符方差。与现有的多参考融合方法不同,DisPlace利用跨参考遍历的变化来识别哪些描述符维度的线性组合保留了地点身份,哪些捕捉了条件或视角特定的变化。我们在Oxford RobotCar、Nordland、Pittsburgh30k和Google Landmarks v2上,使用六种最先进的VPR描述符评估了DisPlace。在54种外观变化条件下,DisPlace在49种中优于七种多参考基线,在视角和非结构化设置下持续改进描述符级融合性能,并且在推理期间比所有比较的融合方法需要更少的存储空间。
A key challenge in Visual Place Recognition (VPR) is matching query images against reference maps captured under diverse environmental conditions and viewpoints. While multiple reference traversals improve robustness, existing fusion strategies either aggregate references uniformly or rely on heuristic selection, without distinguishing descriptor variations that preserve stable place identity from those caused by changing conditions or viewpoints. In this paper, we propose DisPlace, a multi-reference VPR framework that fuses multiple reference descriptors into a single compact and discriminative place representation. DisPlace formulates descriptor fusion as a generalized eigenvalue problem that maximizes between-place separability while suppressing within-place variation across references, rather than preserving overall descriptor variance. Unlike existing multi-reference fusion methods, DisPlace exploits variation across reference traversals to identify which linear combinations of descriptor dimensions preserve place identity and which capture condition- or viewpoint-specific variation. We evaluate DisPlace on Oxford RobotCar, Nordland, Pittsburgh30k, and Google Landmarks v2 across six state-of-the-art VPR descriptors. DisPlace outperforms seven multi-reference baselines in 49 out of 54 appearance-varying conditions, consistently improves descriptor-level fusion performance under viewpoint and unstructured settings, and requires less storage during inference than all compared fusion methods.
成对参考对齐作为模型级序数可观测量
Mujing Li
AI总结 本文定义成对参考对齐为模型评分函数诱导的序数可观测量,提出中心化序参数统计量和基于边界的扩展,并给出有限样本估计和浓度界,通过Qwen2.5和RewardBench实验验证。
成对偏好数据广泛用于语言模型评估和对齐,通常用于模型排名、奖励建模或偏好优化。本文提出了一个更基础的测量问题:给定成对偏好的参考分布,当我们测试模型是否将首选响应排在拒绝响应之上时,估计的是哪个模型级量?我们将成对参考对齐定义为由模型评分函数诱导的序数可观测量。给定三元组$(x,y^+,y^-)$上的参考对分布$P_{\mathrm{pair}}$和标量模型分数$S_M(x,y)$,我们将对齐可观测量定义为模型诱导的排序与参考偏好排序一致的概率。我们进一步定义了一个中心化的序参数类统计量,并讨论了基于边界的扩展。所得量在独立抽样假设下具有简单的有限样本估计量和浓度界。本文没有引入新的基准。它为成对参考对齐提供了概念和统计公式,阐明了参考对分布的作用,并将一般的序数可观测量与评分选择(如归一化对数概率或基于能量的分数)区分开来。我们还在Qwen2.5模型和RewardBench上进行了初步实证研究,其中所提出的统计量随模型大小和指令调优而增加,并根据公式在参考对子集之间变化。
Pairwise preference data is widely used in language-model evaluation and alignment, often for model ranking, reward modeling, or preference optimization. This note formulates a more basic measurement question: given a reference distribution of pairwise preferences, what model-level quantity is estimated when we test whether a model ranks preferred responses above rejected responses? We define pairwise reference alignment as an ordinal observable induced by a model scoring function. Given a reference pair distribution $P_{\mathrm{pair}}$ over triples $(x,y^+,y^-)$, and a scalar model score $S_M(x,y)$, we define the alignment observable as the probability that the model-induced ordering agrees with the reference preference ordering. We further define a centered order-parameter-like statistic and discuss a margin-based extension. The resulting quantities admit simple finite-sample estimators and concentration bounds under independent sampling assumptions. This note does not introduce a new benchmark. It provides a conceptual and statistical formulation for pairwise reference alignment, clarifies the role of the reference pair distribution, and distinguishes the general ordinal observable from scoring choices such as normalized log-probability or energy-based scores. We also provide an initial empirical study on Qwen2.5 models and RewardBench, where the proposed statistics increase with model size and instruction tuning and vary across reference-pair subsets as predicted by the formulation.
思维链与压缩循环Transformer:记忆预算分离
Haozhou Zhang
AI总结 本文通过比较三种记忆机制(压缩潜在循环、全序列状态循环和思维链暂存区),证明压缩循环Transformer的记忆预算限制其推理能力,而思维链通过扩展上下文实现更强的问题求解。
思维链提示和循环Transformer都赋予固定模型更多的测试时计算,但它们在记忆内容上有所不同。思维链将中间状态存储在生成的标记中,这些标记保留在上下文中,而循环Transformer通过循环隐藏激活传递状态。我们认为这种持久可变记忆是测试时推理的核心资源。我们比较了三种记忆机制:压缩潜在循环、全序列状态循环和思维链暂存区。我们的主要结果表明,压缩循环受其循环状态大小的限制。运行更长时间的循环增加了计算量,但本身不会创建增长的暂存区,因此即使运行多个步骤,具有小循环状态的循环仍然是小空间推理器。在标准复杂性假设下,这样的循环无法解决在logspace归约下P-complete的问题,而多项式长度的思维链可以。这种分离是压缩循环特有的,因为全序列状态循环在每个输入位置携带状态,并处于更接近显式暂存区的记忆丰富状态。受控的指针追逐和关联回忆扫描说明了这种记忆预算观点,其性能对持久状态预算是否匹配任务的工作记忆需求敏感。
Chain-of-thought prompting and looped Transformers both give a fixed model more test-time computation, but they differ in what they remember. Chain-of-thought stores intermediate state in generated tokens that remain in the context, whereas a looped Transformer carries state through recurrent hidden activations. We argue that this persistent mutable memory is a central resource for test-time reasoning. We compare three memory regimes, the compressed latent loop, the full sequence-state loop, and the chain-of-thought scratchpad. Our main result shows that a compressed loop is limited by the size of its recurrent state. Running the loop longer adds computation but does not by itself create a growing scratchpad, so a loop with a small recurrent state remains a small-space reasoner even when run for many steps. Under a standard complexity assumption, such loops cannot decide problems that are P-complete under logspace reductions, whereas polynomial-length chain-of-thought can. The separation is specific to compressed loops, as full sequence-state loops carry state at every input position and live in a memory-rich regime closer to explicit scratchpads. Controlled pointer-chasing and associative-recall sweeps illustrate this memory-budget view, with performance sensitive to whether the persistent-state budget matches the task's working-memory demand.
通过时空并行解码和置信度外推的高效扩散大语言模型
Zekai Li, Ji Liu, Yiqing Huang, Ziqiong Liu, Dong Li, Emad Barsoum
AI总结 提出时空并行解码(TSPD)和置信度外推(CE)两种方法,通过动态控制去噪轨迹减少冗余迭代,加速扩散大语言模型推理。
基于扩散的大语言模型(dLLMs)通过迭代去噪支持并行文本生成,但由于许多步骤花费在冗余精炼和重复掩码那些最终值已确定的token上,推理仍然延迟严重。先前的加速方法主要依赖于步骤局部置信度启发式或固定调度,这些方法对提示和任务变化敏感,且忽略了序列内的强位置效应。我们将扩散解码视为一个动态控制问题,并表明逐token的去噪轨迹为可靠控制提供了关键信号。我们提出了一个具有两个组件的轨迹感知解码框架。首先,时空并行解码(TSPD)使用一个轻量级的时空控制器,该控制器消耗每个token的轨迹特征,包括置信度、熵和动量,以及token位置,以决定何时token已收敛并可以安全固定。其次,我们引入了置信度外推(CE),一个无训练的状态空间模块,它预测未来的logit趋势并带有不确定性,以支持主动决策,包括安全的前瞻和在轨迹振荡或置信度不足时的目标稳定。TSPD和CE共同减少了不必要的去噪迭代,同时保持了输出质量,并且它们与系统优化(如KV缓存)干净地组合。
Diffusion-based large language models (dLLMs) support parallel text generation via iterative denoising, yet inference remains latency-heavy because many steps are spent on redundant refinement and repeated remasking of tokens whose final values are already determined. Prior acceleration methods mainly depend on step-local confidence heuristics or fixed schedules, which are sensitive to prompt and task variation and ignore strong positional effects within a sequence. We cast diffusion decoding as a dynamic control problem and show that token-wise denoising trajectories provide the key signal for reliable control. We propose a trace-aware decoding framework with two components. First, Temporal-Spatial Parallel Decoding (TSPD) uses a lightweight temporalspatial controller that consumes per-token trajectory features, including confidence, entropy, and momentum, together with token position, to decide when a token has converged and can be safely fixed. Second, we introduce Confidence Extrapolation (CE), a training-free state-space module that forecasts future logit trends with uncertainty to support proactive decisions, including safe look-ahead and targeted stabilization when trajectories are oscillatory or underconfident. Together, TSPD and CE reduce unnecessary denoising iterations while preserving output quality, and they compose cleanly with system optimizations such as KV caching.
SLAP: 用于变分视频-语言建模的语义最小作用原理
Xiang Fang, Wanlong Fang
AI总结 提出语义最小作用原理(SLAP),将视频插值建模为黎曼流形上的边界值问题,通过离散欧拉-拉格朗日方程保持对象持久性,解决大视频语言模型中的时间间隙问题。
在大视频语言模型(LVLMs)时代,稀疏帧采样的计算需求造成了根本性的“时间间隙”,使模型对关键的因果转换视而不见。现有的依赖于生成幻觉(如潜在扩散)或自回归外推的解决方案往往难以在长时间跨度内保持语义一致性,遭受对象消失和能量不稳定的问题。我们提出从概率生成到变分力学的范式转变,即语义最小作用原理(SLAP)。通过在经典力学和语义动力学之间建立严格的同构关系,我们将潜在视频轨迹建模为由语义拉格朗日量控制的黎曼流形上的路径。通过将插值任务表述为通过离散欧拉-拉格朗日方程求解的边界值问题(BVP),SLAP自然地强制对象持久性,而无需像素级渲染。大量实验证明了我们提出的SLAP的有效性。
In the era of Large Video-Language Models (LVLMs), the computational necessity of sparse frame sampling creates a fundamental ``temporal gap'', rendering models blind to critical causal transitions. Existing solutions relying on generative hallucination (e.g., latent diffusion) or autoregressive extrapolation often fail to maintain semantic consistency over long horizons, suffering from object vanishing and energetic instability. We propose a paradigm shift from probabilistic generation to variational mechanics with the \textbf{Semantic Least Action Principle (SLAP)}. Drawing a rigorous isomorphism between classical mechanics and semantic dynamics, we model the latent video trajectory as a path on a Riemannian manifold governed by a Semantic Lagrangian. By formulating the interpolation task as a Boundary Value Problem (BVP) solved via the discrete Euler-Lagrange equations, SLAP naturally enforces object persistence without pixel-level rendering. Extensive experiments show the effectiveness of our proposed SLAP.
FLAG: 通过潜在增强引导的流策略最大熵强化学习
Sungha Kim, Gawon Lee, Jusuk Lee, Jonghae Park, H. Jin Kim, Daesol Cho
AI总结 提出FLAG方法,通过潜在变量增强状态空间并优化代理最大熵目标,解决重要性权重崩溃问题,实现高维控制任务中的表达性策略优化。
最大熵强化学习(MaxEnt-RL)能够实现鲁棒的探索,然而实际实现通常将策略限制为简单的高斯分布。最近的方法通过重要性加权监督学习引入表达性生成策略,但容易受到重要性权重崩溃的影响,这限制了它们在高维动作空间中的可扩展性。我们的关键见解是通过局部化采样区域来缓解这一限制,避免在整个动作空间上进行重要性采样导致的权重退化。为了实例化这一见解,我们引入了FLAG(具有潜在增强引导的流策略)。FLAG通过流潜在变量增强状态空间,并优化一个可证明一致的代理MaxEnt-RL目标。我们经验证明,FLAG能够在有限的重要性样本下实现表达性策略优化,并扩展到高维控制任务。此外,FLAG在具有挑战性的基准测试中达到了最先进的性能。我们的项目网页:https://flag-rl.github.io/
Maximum entropy reinforcement learning (MaxEnt-RL) enables robust exploration, yet practical implementations often restrict policies to simple Gaussians. While recent approaches incorporate expressive generative policies via importance-weighted supervised learning, they are prone to importance weight collapse, which limits their scalability in high-dimensional action spaces. Our key insight is to mitigate this limitation by localizing the sampling region, avoiding the weight degeneracy induced by importance sampling over the entire action space. To instantiate this insight, we introduce \textbf{FLAG} (\textbf{F}low policy with \textbf{L}atent-\textbf{A}ugmented \textbf{G}uidance). FLAG augments the state space with a flow latent variable and optimizes a provably consistent proxy MaxEnt-RL objective. We empirically demonstrate that FLAG enables expressive policy optimization with limited importance samples and scales to high-dimensional control tasks. Furthermore, FLAG achieves state-of-the-art performance across challenging benchmarks. Our project webpage: https://flag-rl.github.io/
Immuno-VLM:通过生成式语义抗体实现大型视觉-语言模型的开放世界可信赖性
Xiang Fang, Wanlong Fang, Wei Ji
AI总结 针对大型视觉-语言模型在开放世界部署中因缺乏负面知识而将未知异常高置信度误分类为已知类别的“语义傲慢”问题,提出受生物免疫负选择启发的Immuno-VLM框架,利用大语言模型的生成推理主动产生“语义抗体”(近分布异常文本描述)来约束已知类决策空间,在ImageNet-1K和四个OOD基准上达到新最优。
大型视觉-语言模型通过将视觉特征与广泛语义概念对齐,在零样本识别中取得了前所未有的成功。然而,这种语义抽象在开放世界部署中造成了一个关键漏洞:“语义傲慢”——由于缺乏显式的负面知识,模型会将未知异常高置信度地强行拟合到已知类别中。为了解决这个“开放世界可信赖性悖论”,我们提出了 extbf{Immuno-VLM},一个受生物启发的框架,它将 extbf{免疫负选择}的生物学原理适应到高维潜在空间。与依赖被动密度估计或低效像素空间异常生成的传统开放集识别方法不同,Immuno-VLM利用大语言模型的生成推理能力主动“幻想”出“语义抗体”,即近分布异常(例如,相似物、上下文异常)的文本描述,这些描述有效地约束了已知类别的决策空间。在ImageNet-1K和四个具有挑战性的OOD基准上的大量实验表明,Immuno-VLM达到了新的最优水平。
Large Vision-Language Models have achieved unprecedented success in zero-shot recognition by aligning visual features with broad semantic concepts. However, this semantic abstraction creates a critical vulnerability in open-world deployment: the ``Hubris of Semantics'', where models force-fit unknown anomalies into known categories with high confidence due to the lack of explicit negative knowledge. To address this \textit{Open-World Trustworthiness Paradox}, we propose \textbf{Immuno-VLM}, a bio-inspired framework that adapts the biological principle of \textbf{Immunological Negative Selection} to high-dimensional latent spaces. Departing from traditional Open-Set Recognition methods that rely on passive density estimation or inefficient pixel-space outlier generation, Immuno-VLM leverages the generative reasoning of Large Language Models to actively hallucinate ``Semantic Antibodies'', textual descriptions of near-distribution outliers (e.g., look-alikes, contextual anomalies) that effectively bound the decision space of known classes.Extensive experiments on ImageNet-1K and four challenging OOD benchmarks reveal that Immuno-VLM establishes a new state-of-the-art.
注释并非全部所需:面向无监督时间语句定位的跨模态知识迁移网络
Xiang Fang, Daizong Liu, Wanlong Fang, Pan Zhou, Yu Cheng, Keke Tang, Kai Zou
AI总结 提出跨模态知识迁移网络,通过从图像-名词和视频-动词任务中迁移实体感知和事件感知知识,实现无监督时间语句定位,无需配对视频-查询标注。
本文研究时间语句定位(TSG)任务。尽管许多优秀工作在该重要课题上取得了显著成就,但它们严重依赖于大量昂贵的视频-查询配对标注,这在现实应用中需要大量人力收集。为此,本文针对更实际但更具挑战性的TSG设置:无监督时间语句定位,其中网络训练期间既没有配对视频-查询标注,也没有片段边界标注。考虑到其他跨模态任务提供了许多易于获取且廉价的标签,我们倾向于收集并将其简单的跨模态对齐知识迁移到我们的复杂场景中:1)首先从配对的图像-名词任务中探索实体感知的对象引导外观知识,并将其适应到每个独立视频帧;2)然后从配对的视频-动词任务中提取事件感知的动作表示,并通过新提出的复制-粘贴方法进一步将动作表示精炼为更实际但复杂的现实案例;3)通过将外观和动作知识调制并迁移到我们具有挑战性的无监督任务中,我们的模型可以直接利用这些通用知识来关联视频和查询,并在无需训练的情况下准确检索相关片段。在两个具有挑战性的数据集(ActivityNet Captions和Charades-STA)上的大量实验证明了我们的有效性,优于现有无监督方法,甚至与有监督方法竞争。
This paper addresses the task of temporal sentence grounding (TSG). Although many respectable works have made decent achievements in this important topic, they severely rely on massive expensive video-query paired annotations, which require a tremendous amount of human effort to collect in real-world applications. To this end, in this paper, we target a more practical but challenging TSG setting: unsupervised temporal sentence grounding, where both paired video-query and segment boundary annotations are unavailable during the network training. Considering that some other cross-modal tasks provide many easily available yet cheap labels, we tend to collect and transfer their simple cross-modal alignment knowledge into our complex scenarios: 1) We first explore the entity-aware object-guided appearance knowledge from the paired Image-Noun task, and adapt them into each independent video frame; 2) Then, we extract the event-aware action representation from the paired Video-Verb task, and further refine the action representation into more practical but complicated real-world cases by a newly proposed copy-paste approach; 3) By modulating and transferring both appearance and action knowledge into our challenging unsupervised task, our model can directly utilize this general knowledge to correlate videos and queries, and accurately retrieve the relevant segment without training. Extensive experiments on two challenging datasets (ActivityNet Captions and Charades-STA) show our effectiveness, outperforming existing unsupervised methods and even competitively beating supervised works.
最后一层是否足以进行不确定性量化?
Joseph Wilson, Chris van der Heide, Liam Hodgkinson, Fred Roosta
AI总结 通过理论分析和实验评估,比较全网络线性化与最后一层线性化在深度神经网络认知不确定性量化中的性能,发现最后一层近似在保持相当UQ性能的同时显著提升计算效率。
深度神经网络(DNN)的认知不确定性量化(UQ)是在关键任务环境中安全采用AI的要求。几种领先的UQ方法将DNN线性化以形成贝叶斯广义线性模型(GLM),其中认知不确定性通过预测后验分布建模。在DNN的最终连接层参数周围进行线性化是一种常用的近似方法,用于减少此类GLM的计算负担,尽管通常认为这会以性能下降为代价。在这项工作中,我们使用理论和实证方法比较了由全网络和最后一层线性化产生的GLM。我们首先利用随机矩阵理论进行理论比较;该分析显示全线性化在UQ能力上没有有意义的改进。结合一系列现代机器学习任务的大规模实证评估,我们得出以下结论:最后一层近似在提供显著提高的计算效率的同时,产生了可比的UQ性能。
Epistemic uncertainty quantification (UQ) for deep neural networks (DNNs) is a requirement for safe adoption of AI in mission-critical settings. Several leading methods for UQ linearize DNNs to form Bayesian Generalized Linear Models (GLMs), where epistemic uncertainty is modeled via the predictive posterior distribution. Linearizing around the parameters of the final connected layer of a DNN is a commonly used approximation for reducing the computational burden of such GLMs, though it is often believed to come at the cost of degraded performance. In this work, we compare GLMs arising from full-network and last-layer linearization using both theoretical and empirical approaches. We first employ tools from random matrix theory to conduct a theoretical comparison; this analysis reveals no meaningful improvement in the UQ capabilities of full linearization. Coupled with a large-scale empirical evaluation across a range of modern machine learning tasks, we arrive at the following conclusion: a last-layer approximation yields comparable UQ performance while offering substantially improved computational efficiency.
GSAM: 一种通用且安全的铰接物体操作机器人框架
Beichen Shao, Mengying Xie, Heng Su, Wanyi Zhang, Mingyan Li, Yan Ding, Fausto Giunchiglia, Chao Chen
AI总结 提出GSAM框架,通过视觉感知器生成运动学参数、基于VLM的细调器进行常识推理修正、交互约束函数生成器集成障碍物避免知识,并由运动学感知规划器验证轨迹可达性,在50个铰链任务上相比最佳基线将标准差降低3.1%、操作成功率提升36.0%。
铰接物体操作对服务机器人是一个独特的挑战。现有方法采用端到端策略学习、视觉运动规划以及大语言/视觉语言模型(LLM/VLM),但往往忽视了铰接物体的多样性和末端执行器与手柄之间交互的复杂性,导致泛化能力有限和破坏性碰撞。为了解决这一问题,我们提出了GSAM,一个通用且安全的铰接物体操作机器人框架。具体来说,一个基于视觉的感知器生成运动学参数。考虑到感知器中预训练标记产生的原始估计可能偏离常识,我们提出了一个基于VLM的细调器,利用链式思维(COT)常识推理来细化感知。为了防止破坏性碰撞,我们设计了一个交互约束函数生成器,将铰接物体、交互姿态和障碍物避免知识集成到一个基中。然后LLM将这些约束函数化,并将其应用于轨迹和姿态规划。一个运动学感知的操作规划器验证轨迹和姿态的可达性。在5个物体类别的50个铰链任务和50个随机初始化的末端执行器-手柄配置上的实验表明,与最佳基线相比,GSAM将标准差降低了3.1%,操作成功率提高了36.0%,分别展示了GSAM在实际场景中优越的物体泛化能力和交互安全性。
Articulated object manipulation is a unique challenge for service robots. Existing methods employ end-to-end policy learning, visionmotion planning, and large-language/visual-language model (LLM/VLM), but often overlook the diversity of articulated objects and the complexity of interactions between end-effector and handle, leading to limited generalization and destructive collisions. To address this, we propose GSAM, a generalizable and safe robotic framework for articulated object manipulation. Specifically, a vision-based perceiver generates the kinematic parameters. Considering that pre-trained markers in perceiver yield raw estimations that may deviate from commonsense, we present a f ine-tuned VLM-based refiner, using chain-of-thought (COT) commonsense reasoning to refine perception. To prevent destructive collisions, we design an interaction constraint function generator, integrating articulated object, interaction pose, and obstacle avoidance knowledge into a base. LLM then functionalize these constraints and apply them to trajectory and posture planning. A kinematic-aware manipulation planner verifies reachability for trajectory and posture. Experiments on 50 hinge tasks across 5 object categories and 50 randomly initialized end-effectorhandle configurations show that GSAM reduces standard deviation by 3.1% and improves manipulation success rate by 36.0% compared to the best baseline, respectively demonstrating the superior object generalization and interaction safety of GSAM in practical scenarios.