BiNSGPS: Geometry Problem Solving via Bidirectional Neuro-Symbolic Interaction
BiNSGPS: 通过双向神经符号交互解决几何问题
AI总结 提出BiNSGPS框架,通过多模态大语言模型顾问与符号求解器之间的双向神经符号交互,动态纠正不一致的形式表示或提出辅助假设,以解决几何问题中的早期错误和符号冲突。
BiNSGPS: 通过双向神经符号交互解决几何问题
Qi Wang, Peijie Wang, Fei Yin, Cheng-Lin Liu
AI总结 提出BiNSGPS框架,通过多模态大语言模型顾问与符号求解器之间的双向神经符号交互,动态纠正不一致的形式表示或提出辅助假设,以解决几何问题中的早期错误和符号冲突。
几何问题求解在人工智能中提出了独特的挑战。现有方法通常分为两种范式:符号方法(适应性有限)和神经方法(容易产生幻觉)。最近的神经符号混合方法主要依赖单向流水线,其中神经输出被输入求解器而无反馈,使得系统对早期错误脆弱。为了打破这一单向瓶颈,我们提出了BiNSGPS,一个在多模态大语言模型顾问和符号求解器之间建立双向神经符号交互的框架。多模态大语言模型顾问主动整合来自符号求解器的反馈,以动态纠正不一致的形式表示或提出辅助假设,解决符号冲突并促进复杂推理。
Geometry problem solving poses distinct challenges in artificial intelligence. Existing approaches typically fall into two paradigms: symbolic methods, which exhibit limited adaptability, and neural methods, which are prone to hallucinations. Recent neuro-symbolic hybrids predominantly rely on a unidirectional pipeline where neural outputs are fed into solvers without feedback, making system brittle to early-stage errors. To break this unidirectional bottleneck, we propose BiNSGPS, a framework that establishes Bidirectional Neuro-Symbolic Interaction (BiNS) between a MLLM Adviser and a Symbolic Solver. MLLM Adviser actively incorporates feedback from the symbolic solver to dynamically rectify inconsistent formal representations or propose auxiliary hypotheses, resolving symbolic conflicts and facilitating complex deductions.
ALINC: 通过图采样的归纳式节点分类主动学习
Pascal Plettenberg, Denis Huseljic, André Alcalde, Bernhard Sick, Josephine M. Thomas
AI总结 提出ALINC框架,通过图级采样策略解决归纳式节点分类中的主动学习问题,并评估了多种策略与聚合方法的效果。
节点分类的主动学习通常专注于在一个或几个大图(例如社交网络分析)中选择最具信息量的节点进行标注。然而,在其他领域,如分子化学或电子设计自动化,数据集由数千个独立图组成。在许多这样的归纳式设置中,标注单个节点需要全图分析,这实际上会即时产生剩余的节点标签。因此,这些场景需要选择整个图而非单个节点的主动学习策略,而这一问题迄今尚未在文献中得到解决。因此,我们提出了ALINC,一个通过图采样进行归纳式节点分类的主动学习框架。它通过多种聚合机制将节点级效用度量提升为图级选择标准,从而弥合了现有的方法论差距。在包含十种策略、三种聚合方法和四个数据集的广泛基准测试中,我们确定了CoreSet、TypiClust和BADGE作为性能最佳的图采样策略。我们的详细分析进一步揭示,聚合方法的选择至关重要,因为它显著影响模型性能和标注成本。最后,我们在两个用例研究中展示了ALINC的有效性:分子中的代谢位点预测和印刷电路板原理图的设计自动化。
Active learning (AL) for node classification typically focuses on selecting the most informative nodes for annotation within one or a few large graphs (e.g., in social network analysis). However, in other domains, such as molecular chemistry or electronic design automation, datasets consist of thousands of independent graphs. In many of these inductive settings, annotating an individual node requires a full-graph analysis, which effectively yields the remaining node labels on-the-fly. Therefore, these scenarios require AL strategies that select entire graphs instead of single nodes, a problem which has not been tackled in the literature so far. Thus, we introduce ALINC, an AL framework for inductive node classification via graph sampling. It bridges the existing methodological gap by elevating node-level utility measures to graph-level selection criteria through various aggregation mechanisms. In an extensive benchmark including ten strategies, three aggregation methods, and four datasets, we identify CoreSet, TypiClust, and BADGE as the top-performing graph sampling strategies. Our detailed analysis further reveals that the choice of the aggregation method is pivotal, as it substantially affects model performance and annotation costs. Finally, we demonstrate the effectiveness of ALINC in two use case studies: site-of-metabolism prediction in molecules and design automation of printed circuit board schematics.
QO-Bench: 诊断类型化事件元组上的查询操作符保持检索
Mengao Zhang, Xiang Yang, Chang Liu, Tianhui Tan, Ke-wei Huang
AI总结 提出QO-Bench基准,通过类型化事件元组上的确定性评估,诊断检索增强生成系统在查询操作符(如连接、交集)上的执行瓶颈。
许多关于商业、法律和科学语料库的现实世界问题是文本中潜在记录的数据库风格查询的自然语言版本。现有的检索增强生成(RAG)系统主要针对语义相关性进行优化,但检索到看似相关的段落并不能保证正确的查询执行。我们引入了QO-Bench,一个用于类型化事件元组上查询操作符问答的诊断基准。该基准涵盖22,984篇新闻文章和614个公司事件,涉及18个查询模板,在785个问题上进行评估。每个黄金答案由类型化事件元组确定性计算得出,并通过召回率评分,答案通过精确匹配而非LLM评判器与黄金元组匹配。这种设计支持操作符级别的诊断,如连接和交集。我们在匹配条件下评估了RAG、ReAct RAG、GraphRAG和信息提取到SQL的方法,并设置了一个长上下文oracle上限以隔离检索失败。一个双轴框架——索引时保持与查询时执行——预测了每种范式失败的位置,结果证实了这一点:系统检索到相关文本,但丢弃了操作符所需的类型化值,并且可部署的范式排名在不同操作符间反转,相似性检索在过滤/投影上领先,而提取到SQL在交集和计数上领先。即使提供了黄金证据,长上下文oracle也远未饱和,因此操作符执行——而不仅仅是检索——是一个核心瓶颈,更强的答案模型也无法消除。QO-Bench将目标从段落相关性重新定义为查询操作符保持检索。
Many real-world questions over business, legal, and scientific corpora are natural-language versions of database-style queries over records latent in text. Existing retrieval-augmented generation (RAG) systems are optimized primarily for semantic relevance, but retrieving plausible passages does not guarantee correct query execution. We introduce QO-Bench, a diagnostic benchmark for query-operator question answering over typed event tuples. The benchmark covers 22,984 news articles and 614 corporate events across 18 query templates, evaluated on 785 questions. Each gold answer is deterministically computed from typed event tuples and scored by recall, with answers matched to the gold tuples by exact match rather than an LLM judge. This design enables operator-level diagnosis such as joins and intersection. We evaluate RAG, ReAct RAG, GraphRAG, and information-extraction-to-SQL under matched conditions, with a long-context oracle ceiling to isolate retrieval failure. A two-axis framework -- index-time preservation versus query-time execution -- predicts where each paradigm fails, and the results bear it out: systems retrieve relevant text but discard the typed values operators need, and the deployable paradigm ranking inverts across operators, with similarity retrieval leading on filter/project and extraction-to-SQL on intersection and counting. Even given the gold evidence, a long-context oracle stays far from saturated, so operator execution -- not retrieval alone -- is a core bottleneck that a stronger answer model does not remove. QO-Bench reframes the goal from passage relevance to query-operator-preserving retrieval.
CYGNET: 用于神经执行分类与成本控制的密码门
Nikodem Tomczak
AI总结 提出CYGNET门控机制,通过预执行验证和错误修正,在保证生成准确率的同时,高效拦截结构错误的Cypher查询并标记成本过高的执行计划。
作为知识图谱代理的语言模型生成的Cypher查询可能因结构错误(在数据库中崩溃)或语义错误(执行但返回错误结果)而失败。我们在查询生成与生产级Neo4j数据库之间设置了一个预执行门。该门通过一个四后端链验证结构,最终在镜像图上执行,中位延迟为5.6毫秒。结构错误的查询被路由到一个修正器,该修正器通过语言模型迭代结构化错误反馈。在七个CypherBench模式(2348个问题,ACL 2025)上,该流水线在所有测试模型上保持了生成准确率,证实其作为安全防御层的有效性。修正器在五个模型上的成功率为81%至95%(平均89%)。在九个模式的模板生成语料库上,该门捕获了100%的解析错误、100%的约束违规以及100%的路径查询中带标签端点的模式引用错误,在1135个查询中零误报。属性兄弟交换(替换后的名称在目标标签上有效)得分为0%,标志着结构验证结束和语义验证开始的正式边界。基于规划器的成本门在执行前标记灾难性的计划结构。
Language models acting as agents over knowledge graphs generate Cypher queries that fail structurally (crashing at the database) or semantically (executing but returning wrong results). We place a pre-execution gate between query generation and a production Neo4j database. The gate validates structure through a four-backend chain culminating in execution against a mirror graph at 5.6 ms median latency. Structurally broken queries are routed to a corrector that iterates structured error feedback through a language model. On seven CypherBench schemas (2348 questions, ACL 2025) the pipeline maintains generation accuracy on every model tested, confirming it operates as a safe defensive layer. The corrector achieves 81% to 95% success across five models (mean 89%). On a template-generated corpus across nine schemas the gate catches 100% of parse errors, 100% of constraint violations, and 100% of schema-reference errors in path queries with labelled endpoints, at zero false positives across 1135 queries. Property sibling-swaps where the substituted name is valid on the target label score 0%, marking the formal boundary where structural validation ends and semantic validation must begin. A planner-based cost gate flags catastrophic plan structures before execution.
可解释的安全强化学习
Sabine Rieder, Stefan Pranger, Debraj Chakraborty, Jan Křetínský, Bettina Könighofer
AI总结 提出一种基于分层决策树的可解释安全强化学习方法,通过世界模型分析状态风险并构建屏蔽策略,生成可理解的解释,同时保持安全保证。
对决策系统的信任既需要安全保证,也需要解释和理解其行为的能力。这对于学习系统尤为重要,因为其决策过程往往高度不透明。屏蔽是一种基于模型的强化学习安全增强技术。然而,由于屏蔽是通过严格的形式化方法自动合成的,其决策同样难以被人类解释。最近,决策树被广泛用于表示控制器和策略。但由于屏蔽本质上具有非确定性,其决策树表示变得过大,无法在实践中提供可解释性。为应对这一挑战,我们提出了一种新颖的可解释安全强化学习方法,通过提供人类可理解的屏蔽决策解释来增强信任。我们的方法将屏蔽策略表示为分层决策树,提供自上而下的基于案例的解释。在设计时,我们使用世界模型分析在给定状态下执行动作的安全风险。基于此分析,我们构建屏蔽策略和一个高层决策树,将状态分类为风险类别(安全、关键、危险、不安全),解释为何某种情况可能涉及安全关键。在运行时,我们生成局部决策树,解释哪些动作被允许以及为何其他动作被认为不安全。我们的方法促进了屏蔽安全强化学习中安全方面的可解释性,不需要超出屏蔽已用信息的额外信息,开销极小,并能轻松集成到现有的屏蔽强化学习流程中。实验中,我们使用比原始屏蔽小几个数量级的决策树来计算解释。
Trust in a decision-making system requires both safety guarantees and the ability to interpret and understand its behavior. This is particularly important for learned systems, whose decision-making processes are often highly opaque. Shielding is a prominent model-based technique for enforcing safety in reinforcement learning. However, because shields are automatically synthesized using rigorous formal methods, their decisions are often similarly difficult for humans to interpret. Recently, decision trees became customary to represent controllers and policies. However, since shields are inherently non-deterministic, their decision tree representations become too large to be explainable in practice. To address this challenge, we propose a novel approach for explainable safe RL that enhances trust by providing human-interpretable explanations of the shield's decisions. Our method represents the shielding policy as a hierarchy of decision trees, offering top-down, case-based explanations. At design time, we use a world model to analyze the safety risks of executing actions in given states. Based on this analysis, we construct both the shield and a high-level decision tree that classifies states into risk categories (safe, critical, dangerous, unsafe), explaining why a situation may be safety-critical. At runtime, we generate localized decision trees that explain which actions are allowed and why others are deemed unsafe. Our method facilitates explainability of the safety aspect in safe-by-shielding reinforcement learning, requires no additional information beyond what is already used for shielding, incurs minimal overhead, and integrates readily into existing shielded RL pipelines. In our experiments, we compute explanations using decision trees that are several orders of magnitude smaller than the original shield.
VentAgent:当大语言模型学会呼吸——ARDS通气的多目标仲裁
Teqi Hao, Yuxuan Fu, Xiaoyu Tan, Shaojie Shi, Bohao Lv, Yinghui Xu, Xihe Qiu
AI总结 提出VentAgent分层框架,利用大语言模型作为透明仲裁者,通过感知-规划-编排三阶段将机械通气控制转化为动态多目标仲裁过程,在生理模拟器上优于强化学习和经典控制基线,并提供可解释的推理链。
急性呼吸窘迫综合征(ARDS)的机械通气需要平衡竞争性的生理目标,包括氧合、肺保护和酸碱平衡。然而,当前的数据驱动方法,尤其是模仿回顾性电子健康记录(EHR)的方法,常常遭受模仿偏差。它们可能从不一致的临床演示中捕获表面相关性,例如将被动呼吸机设置与生存关联,因为这种设置在稳定患者中很常见,因此无法泛化到不稳定或分布外的表型。标准的强化学习(RL)方法也难以处理重症监护中的对抗性权衡,并常常产生不透明且临床可解释性有限的策略。为了解决这些局限性,我们引入了VentAgent,一个分层框架,其中大语言模型(LLM)作为机械通气的透明仲裁者。我们将通气控制重新表述为动态多目标仲裁过程,而非单目标优化。VentAgent将决策分解为三个可解释的阶段:感知、规划和编排。通过利用LLM的语义推理能力,它综合来自异构专家的策略,并通过显式协调机制解决冲突的临床优先级。在高保真生理模拟器上的评估表明,VentAgent优于最先进的RL和经典控制基线。此外,它将控制决策转化为人类可读的推理链,为重症监护自动化提供了更安全、更可解释和更自适应的范式。
Mechanical ventilation for Acute Respiratory Distress Syndrome (ARDS) requires balancing competing physiological goals, including oxygenation, lung protection, and acid-base homeostasis. However, current data-driven methods, especially those imitating retrospective Electronic Health Records (EHR), often suffer from imitation bias. They may capture superficial correlations from inconsistent clinical demonstrations, such as associating passive ventilator settings with survival because such settings are common in stable patients, and thus fail to generalize to volatile or out-of-distribution phenotypes. Standard Reinforcement Learning (RL) methods also struggle with the adversarial trade-offs of critical care and often produce opaque policies with limited clinical interpretability. To address these limitations, we introduce VentAgent, a hierarchical framework in which Large Language Models (LLMs) act as transparent arbitrators for mechanical ventilation. We reformulate ventilation control as a dynamic Multi-Objective Arbitration process rather than single-objective optimization. VentAgent decomposes decision-making into three interpretable stages: Perception, Planning, and Orchestration. By leveraging the semantic reasoning capabilities of LLMs, it synthesizes strategies from heterogeneous experts and resolves conflicting clinical priorities through an explicit coordination mechanism. Evaluations on a high-fidelity physiological simulator show that VentAgent outperforms state-of-the-art RL and classical control baselines. Moreover, it converts control decisions into human-readable reasoning chains, offering a safer, more interpretable, and adaptable paradigm for critical care automation.
RAMPART: 基于注册表的代理记忆与优先级感知运行时转换
Nikodem Tomczak
AI总结 提出RAMPART编译时记忆模型和纯内存块注册表,通过可编程运行时操作和五种原语实现上下文组装,实验表明块位置和分组显著影响任务成功率,并实现零提示令牌成本的共享注册表协调。
RAMPART是一种用于基于LLM的代理的编译时记忆模型和纯内存块注册表。上下文组装是一种可编程的运行时操作,其中内容根据显式策略(排序、包含和驱逐)从结构化注册表中编译。五种可组合原语(提升、门控、写入、驱逐、回滚)在编译前对命名可寻址块进行操作,且零提示令牌成本。来源标签和不可驱逐的作者标志实现了具有块级所有权的许可记忆模型。使用Qwen3-8B Q4进行的受控探测表明,编译时放置以及块与任务查询之间的结构关系影响任务成功,当任务跟随注册表时,性能在约第七个块位置急剧下降,当任务先于注册表时则在第十二个位置。将关键块与内容相邻的邻居分组,并将该组作为一个单元提升,在单块放置失败的位置将任务成功率提高数十个百分点。在Qwen2.5-7B、Llama-3.1-8B、Mistral-7B-v0.3和Qwen3-14B上的跨模型复现表明,内容启动效应在不同家族中出现在相同的绝对位置,幅度随模型强度变化。块分组使Mistral在最难注册表大小下的平均通过率提高约五倍,并且在中间注册表区域,使用干预的较小模型可以超越不使用干预的较大模型。相关性门控将提示成本降低67.8%,同时恢复83%的提升条件成功率。模式驱逐产生0%的调用,而存在模式时为100%,这是基于策略的方法无法通过构造保证的属性。共享注册表协调将代理间通信减少为方法调用,且零协调令牌成本。
RAMPART is a compile-time memory model and pure in-RAM block registry for LLM-based agents. Context assembly is a programmable runtime operation where content is compiled from a structured registry under explicit policy for ordering, inclusion, and eviction. Five composable primitives (promote, gate, write, evict, rollback) act on named addressable blocks before compilation at zero prompt-token cost. Provenance tags and non-evictable authorship flags implement a permissioned memory model with block-level ownership. Controlled probes with Qwen3-8B Q4 show that compile-time placement and the structural relationship between blocks and the task query affect task success, with the cliff falling at roughly the seventh block position when the task follows the registry and the twelfth when it precedes. Grouping the critical block with content-adjacent neighbours and promoting the group as a unit lifts task success by tens of percentage points at positions where single-block placement fails. Cross-model replication on Qwen2.5-7B, Llama-3.1-8B, Mistral-7B-v0.3, and Qwen3-14B shows the content-priming effect appears at the same absolute positions across families, with magnitude varying with model strength. Block grouping raises Mistral's mean pass rate roughly fivefold at the hardest registry size, and a smaller model with the intervention can outperform a larger model without it in the mid-registry zone. Relevance gating reduces prompt cost by 67.8\% while recovering 83% of the promoted-condition success rate. Schema eviction produces 0% invocations against 100% with the schema present, a property policy-based approaches cannot guarantee by construction. Shared-registry coordination reduces inter-agent communication to a method call at zero coordination token cost.
基于辛嵌入逼近定理的辛模型降阶学习
Liyi Feng, Yifa Tang, Yulin Xie, Ruili Zhang, Aiqing Zhu
AI总结 针对高维哈密顿系统降阶中辛结构易破坏的问题,提出辛保持自编码器(SpAE),通过参数化解码器为辛嵌入、编码器为辛投影,在保证辛结构的同时提升重构与预测精度。
高维哈密顿系统在许多科学和工程学科中扮演着核心角色,其动力学在辛流形上演化。尽管深度学习为从数据构建低维替代模型提供了强大工具,但在模型降阶过程中,内在的辛结构很容易被破坏。因此,标准自编码器可能产生不支持哈密顿流的潜在坐标,导致长时间预测不稳定。本文首先建立了辛嵌入的通用逼近定理。基于该理论,我们提出了辛保持自编码器(SpAE),其中解码器被参数化为辛嵌入,编码器被构造为相应的辛投影。该架构具有足够的表达能力来逼近非线性辛嵌入及其相关的辛投影,通过构造精确保持辛结构,并且可以通过标准的无约束优化进行训练,从而提高了重构和预测精度。在高维晶格和粒子系统上的大量实验证明了所提出方法的有效性。
High-dimensional Hamiltonian systems play a central role in many scientific and engineering disciplines, with dynamics evolving on symplectic manifolds. Although deep learning provides powerful tools for constructing low-dimensional surrogates from data, the intrinsic symplectic structure is easily destroyed during model reduction. As a result, a standard autoencoder may produce latent coordinates that do not support a Hamiltonian flow, leading to unstable long-time prediction. In this paper, we first establish a universal approximation theorem for symplectic embeddings. Based on this theory, we propose symplecticity-preserving autoencoders (SpAE), in which the decoder is parameterized as a symplectic embedding and the encoder is constructed as the corresponding symplectic projection. This architecture is expressive enough to approximate nonlinear symplectic embeddings and the associated symplectic projections, preserves the symplectic structure exactly by construction, and can be trained by standard unconstrained optimization, thereby improving both reconstruction and prediction accuracy. Extensive experiments on high-dimensional lattice and particle systems demonstrate the effectiveness of the proposed method.
MeshFlow: 通过MeshVAE和基于流的扩散Transformer实现高效艺术网格生成
Weiyu Li, Antoine Toisoul, Tom Monnier, Roman Shapovalov, Rakesh Ranjan, Ping Tan, Andrea Vedaldi
AI总结 提出MeshFlow方法,利用变分自编码器将网格拓扑和顶点坐标映射到连续潜空间,并结合修正流Transformer并行生成网格,相比自回归方法速度提升18倍且精度优异。
我们提出MeshFlow,一种生成类艺术家3D网格的新方法。当前的网格生成器通常采用自回归(AR)下一个标记预测,鉴于网格拓扑的离散性,这是一个自然的选择。然而,AR方法扩展性差,因为推理成本随网格大小呈二次增长。它们还需要离散化顶点坐标,这引入了量化误差。为了解决这些挑战,我们引入了一个变分自编码器(VAE),通过对比损失监督,将连续的顶点位置和离散的连接性表示在连续潜空间中。这个潜空间比先前基于标记的网格表示紧凑得多。然后,我们基于修正流Transformer构建了一个3D生成器,并行生成所有网格顶点和边。我们的模型生成网格的速度比最快的AR生成器快18倍,同时在标准网格生成指标上实现了出色的精度。主页:https://mesh-flow.github.io/,代码:https://github.com/facebookresearch/meshflow
We present MeshFlow, a new method for generating artist-like 3D meshes. Current mesh generators often adopt Auto-Regressive (AR) next-token prediction, a natural choice given the discrete nature of mesh topology. However, AR methods scale poorly because the inference cost is quadratic in mesh size. They also require discretizing the vertex coordinates, which introduces quantization errors. To address these challenges, we introduce a Variational Autoencoder (VAE) that, supervised with a contrastive loss, represents both continuous vertex positions and discrete connectivity in a continuous latent space. This latent space is significantly more compact than prior token-based mesh representations. We then build a 3D generator based on a Rectified Flow transformer, generating all mesh vertices and edges in parallel. Our model generates meshes 18x faster than the fastest AR generator while also achieving excellent accuracy across standard mesh-generation metrics. Homepage: https://mesh-flow.github.io/, Code: https://github.com/facebookresearch/meshflow
QuBLAST: 一种采用块级压缩方法和激活缩放策略量化大语言模型的框架
Pasindu Wickramasinghe, Achyuta Muthuvelan, Rachmad Vidya Wicaksana Putra, Minghao Shao, Muhammad Shafique
AI总结 针对大语言模型部署困难,提出QuBLAST框架,通过块级混合精度量化和激活缩放策略,在降低模型大小40%-45.2%的同时保持困惑度增加不超过5%。
大语言模型已成为解决NLP任务的最先进算法。然而,它们通常伴随着巨大的计算和内存成本,因此难以部署在嵌入式系统上。为此,最先进的方法通常在网络的所有注意力块上采用统一的训练后量化,从而忽略了在同一网络中应用不同量化级别的潜力。它们还采用复杂操作来减轻激活异常值的负面影响,从而产生高计算开销。此外,它们没有考虑使用具有非传统注意力架构(例如状态空间模型)的新兴大语言模型进行评估,这些模型在应用量化时提出了不同的挑战。为了解决这些局限性,我们提出了QuBLAST,一种新颖的训练后量化方法,该方法采用块级压缩方法和激活缩放策略用于大语言模型。块级压缩方法实现了网络各块之间的混合精度量化,而激活缩放策略有效减轻了激活异常值的负面影响。具体来说,QuBLAST首先通过交叉熵损失分析预训练模型中不同注意力块的敏感性。QuBLAST利用这种敏感性分析来确定模型中每个注意力块的权重量化级别。此外,QuBLAST为每个块采用激活缩放图来控制激活值的范围并减轻激活异常值的负面影响,从而实现更好的量化结果。实验结果表明,QuBLAST在不同模型架构(即Qwen3-8B、Llama3-8B、Mistral v0.1-8B和Falcon H1R-7B)上将模型大小减少了40%-45.2%,同时在WikiText-2和WikiText-103数据集上保持性能在5%的困惑度增加之内。
LLMs have become the state-of-the-art algorithms for solving NLP tasks. However, they typically come at huge computational and memory costs, thus making them difficult to deploy on embedded systems. Toward this, state-of-the-art methods typically employ uniform post-training quantization (PTQ) across attention blocks of the network, hence overlooking the potential of applying different quantization levels in the same network. They also employ complex operations to mitigate the negative impact of activation outliers, hence incurring high computational overheads. Moreover, they have not considered evaluation using emerging LLMs with non-conventional attention architectures (e.g., state-space models), which pose different challenges in applying quantization. To address these limitations, we propose QuBLAST, a novel PTQ methodology that employs block-level compression approach with activation scaling strategy for LLMs. Block-level compression approach enables mixed-precision quantization across blocks of the network, while activation scaling strategy efficiently mitigates the negative impact of activation outliers. Specifically, QuBLAST first analyzes the sensitivity of different attention blocks in the pre-trained model through the cross-entropy loss analysis. QuBLAST leverages this sensitivity analysis to determine the weight quantization level for each attention block in the model. Furthermore, QuBLAST employs the activation scaling map for each block to control the range of activation values and mitigate the negative impact of activation outliers, thereby enabling better quantization results. Experimental results show that, QuBLAST reduces model sizes by 40%-45.2% across different model architectures (i.e., Qwen3-8B, Llama3-8B, Mistral v0.1-8B, and Falcon H1R-7B), while maintaining the performance within 5% perplexity increase for the WikiText-2 and WikiText-103 datasets.
基于ASP的合规推理的规范性中间表示
Yangfan Wu, Huanyu Yang, Jianmin Ji
AI总结 提出MONIR,一种用于ASP合规推理的模态化输出规范性中间表示,通过分阶段操作语义和可执行编译,结合LLM辅助流程应用于中国ADAS法规,并评估提取质量与模块化增量求解效率。
我们提出MONIR,一种用于基于ASP的合规推理的模态化输出规范性中间表示。其核心片段具有分阶段操作语义,而MONIR-ASP提供了可执行编译以及外部函数、时间规则和稳定模型推理的扩展。我们通过LLM辅助流程将框架实例化到中国ADAS法规和标准上。实验评估了提取质量以及模块化和增量ASP求解的效率。
We propose MONIR, a Modalized-Output Normative Intermediate Representation for ASP-based compliance reasoning. Its core fragment has a staged operational semantics, while MONIR-ASP provides an executable compilation and extensions for external functions, temporal rules, and stable-model reasoning. We instantiate the framework on Chinese ADAS regulations and standards with an LLM-assisted pipeline. Experiments evaluate extraction quality and the efficiency of modular and incremental ASP solving.
BPDA-GMM:基于高斯混合模型的贝叶斯概率数据关联用于语义SLAM
Thanh Nguyen Canh, Haolan Zhang, Xiem HoangVan, Antonio Sgorbissa, Nak Young Chong
AI总结 提出BPDA-GMM在线贝叶斯概率数据关联框架,通过狄利克雷过程先验和中文餐馆过程模型实现语义SLAM中增长式地标关联,并利用α散度退火处理模糊关联,提升轨迹精度和语义建图鲁棒性。
概率数据关联(PDA)在感知混淆场景中改进了语义SLAM,但现有方法通常假设固定的地标集、随着地图增长重新计算关联权重,或依赖手动调整的零假设权重。为解决这些限制,我们提出了 extbf{BPDA-GMM},一个用于具有增长式对象级地图的语义SLAM的在线贝叶斯PDA框架。BPDA-GMM使用狄利克雷过程先验来诱导中文餐馆过程(CRP)关联模型,其中累积证据倾向于已有地标,而浓度参数将概率质量分配给新地标。对于每个语义检测,通过联合语义-几何门选择合理候选,计算CRP加权的关联概率,并以闭合形式将对象地标更新为语义高斯。所得地标集形成高斯混合模型,其主导分量作为最大混合语义因子传递给后端。当关联权重不确定时,一个由模糊触发的$α$-散度退火步骤提高了区分度。最后,解耦的后端将语义因子的位姿雅可比置零,使得噪声检测能够细化地标而不直接扰动轨迹。在仿真和真实室内数据集上的实验表明,与最先进的基线相比,轨迹精度、语义建图质量以及对感知混淆和分类器错误的鲁棒性均有所提升。代码和视频公开于https://github.com/thanhnguyencanh/BPDA-SLAM。
Probabilistic data association (PDA) improves semantic SLAM in perceptually aliased scenes, but existing methods often assume a fixed landmark set, recompute association weights as the map grows, or rely on hand-tuned null-hypothesis weights. To address these limitations, we propose \textbf{BPDA-GMM}, an online Bayesian PDA framework for semantic SLAM with a growing object-level map. BPDA-GMM uses a Dirichlet-process prior to induce a Chinese Restaurant Process (CRP) association model, where accumulated evidence favors existing landmarks, and the concentration parameter assigns probability mass to new landmarks. For each semantic detection, plausible candidates are selected by a joint semantic-geometric gate, CRP-weighted association probabilities are computed, and object landmarks are updated as semantic Gaussians in closed form. The resulting landmark set forms a Gaussian mixture model, and its dominant component is passed to the back-end as a max-mixture semantic factor. When association weights are inconclusive, an ambiguity-triggered $α$-divergence tempering step improves discrimination. Finally, a decoupled back-end zeroes the pose Jacobian of semantic factors, allowing noisy detections to refine landmarks without directly perturbing the trajectory. Experiments in simulation and on a real indoor dataset demonstrate improved trajectory accuracy, semantic mapping quality, and robustness to perceptual aliasing and classifier errors over state-of-the-art baselines. Code and video are publicly available at https://github.com/thanhnguyencanh/BPDA-SLAM.
超越对称对齐:医学领域视觉-语言模型中模态不平衡的光谱诊断
Alessandro Gambetti, Qiwei Han, Cláudia Soares, Hong Shen
AI总结 提出非对称光谱对齐分数(SAS),通过特征值加权的特征模态相关性量化模态信息不平衡,并在医学图像-文本数据集上评估15个VLM,发现医学图像比临床报告保留更丰富的结构信息,且SAS与检索性能的相关性最强。
视觉-语言模型(VLM)在应用于医学图像-文本数据时表现不佳,但可用于诊断这种失败的工具仍然有限。现有的表示对齐度量是对称的,将两种模态合并为一个分数,隐藏了哪种模态驱动了跨模态退化。我们引入了光谱对齐分数(SAS),这是一种非对称度量,将两种模态投影到锚定模态的主特征基上,并计算特征值加权的每个特征模态的相关性,从而得到方向性分数,其差值量化了模态信息不平衡。我们将SAS嵌入到一个基准框架中,评估了15个VLM在自然和医学图像-文本数据集上的表现,同时使用了6种对齐度量和双向检索。我们的实验表明,医学图像比其配对的临床报告保留了更丰富的结构信息,这种方向性不对称是所有竞争度量无法察觉的,并且SAS在医学领域实现了与检索性能的最强零标签相关性,使其成为临床部署的实用诊断工具。代码可在以下网址获取:https://github.com/iamalegambetti/medical-vlms-assessment。
Vision-Language Models (VLMs) struggle when applied to medical image-text data, yet the tools available to diagnose this failure remain limited. Existing representation alignment metrics are symmetric, collapsing both modalities into a single score and hiding which modality drives cross-modal degradation. We introduce the Spectral Alignment Score (SAS), an asymmetric metric that projects both modalities onto the principal eigenbasis of an anchor modality and computes eigenvalue-weighted per-eigenmode correlations, resulting in directional scores whose difference quantifies modality information imbalance. We embed SAS within a benchmarking framework evaluating 15 VLMs across natural and medical image-text datasets alongside 6 alignment metrics and bidirectional retrieval. Our experiments show that medical images retain richer structural information than their paired clinical reports, a directional asymmetry invisible to all competing metrics, and that SAS achieves the strongest zero-label correlation with retrieval performance in the medical domain, positioning it as a practical diagnostic tool for clinical deployment. Code is available at this URL: https://github.com/iamalegambetti/medical-vlms-assessment.
混合对抗防御用于自然语言理解任务
Manar Abouzaid, Yang Wang, Chenghua Lin, Stuart E. Middleton
AI总结 提出一种结合熵、不确定性和几何特征的混合防御框架,在多个自然语言理解数据集上同时提升了干净任务性能和对抗鲁棒性。
大型语言模型(LLMs)既容易产生幻觉,也容易受到对抗性操纵。尽管这些问题密切相关,但现有的防御方法通常分别处理它们。我们研究了一种混合防御框架,该框架结合了旨在减少幻觉的基于熵的模型,以及旨在降低脆弱性的基于不确定性的模型和基于几何的模型。在自然语言理解数据集(FEVER、HotpotQA、CSQA、SIQA)上的域内测试中,我们发现我们的混合模型提高了干净任务性能(准确率提升高达43.34%)和对抗鲁棒性(准确率提升高达64.92%,攻击成功率降低62.27%)。对于分布外数据集(AeroEngQA、CPIQA),我们的混合模型表现出类似的对抗鲁棒性(准确率提升高达57.14%)。对于提示注入(SafeGuard)和越狱检测(AdvBench、DAN)数据集,我们的混合模型也非常强大(与最先进的基线模型相比,攻击成功率降低高达51%)。总体而言,我们的结果表明,对于域内和分布外任务,结合熵、不确定性和几何特征比单独使用任何单一特征都能提供更有效的防御策略。
Large Language Models (LLMs) are vulnerable both to hallucination and adversarial manipulation. Although these problems are closely related, existing defences typically address them separately. We investigate a hybrid defence framework that combines entropy-based models, designed to reduce hallucinations, with uncertainty-based models and geometric-based models, designed to reduce vulnerability. Under in-domain tests on Natural Language Understanding datasets (FEVER, HotpotQA, CSQA, SIQA) we find our hybrid model improves both clean-task performance (up to 43.34\% increase in accuracy) and adversarial robustness (up to 64.92\% improvement in accuracy and 62.27\% reduction in attack success rate). For out-of-distribution datasets (AeroEngQA, CPIQA) we see similar adversarial robustness from our hybrid model (up to 57.14\% improvement in accuracy). For prompt injection (SafeGuard) and jailbreak detection (AdvBench, DAN) datasets our hybrid model is also very strong (up to 51\% reduction in attack success rate compared to state of the art baseline models). Overall, our results show that combining entropy, uncertainty and geometric features provides a more effective defence strategy than using any single feature alone for both in-domain and out-of-distribution tasks.
COMBINER: 基于属性邻居关系的组合图像检索
Zixu Li, Yupeng Hu, Zhiwei Chen, Haokun Wen, Xuemeng Song, Liqiang Nie
AI总结 针对组合图像检索中视觉相似但属性不同的样本问题,提出基于属性原型的跨模态统一表示方法COMBINER,通过自适应语义解耦、统一原型组合和双重关系建模提升检索准确性。
组合图像检索(CIR)是一项具有挑战性的检索任务,旨在通过多模态输入定位特定图像。尽管CIR技术近期取得了进展,但先前的方法常常忽略视觉上相似但属性不同的情况,这可能削弱多模态特征融合和相似性建模。为缓解这一限制,我们基于属性原型设计了跨模态特征的统一表示。然而,由于三个核心问题,该任务远非直接:(1)属性级语义的纠缠,(2)模态间的不一致性,以及(3)监督信号缺失。为解决上述障碍,我们引入了基于属性邻居关系的组合图像检索网络(COMBINER)。具体而言,我们首先设计了一个自适应语义解耦模块,能够基于多模态原始特征解耦属性特征。其次,我们提出了一个统一原型组合模块,可以构建跨模态统一原型(CUP)并促进多模态特征组合。最后,我们引入了一个双重关系建模模块,能够基于属性相似性挖掘成对和邻居关系。与传统的邻居关系建模CIR方法相比,COMBINER是首个解决视觉相似但属性无关样本现象的研究。它通过采用基于属性原型的相似性度量,实现了对样本间语义关系的更准确理解。在三个基准数据集上进行的全面实验证实了我们提出的COMBINER的有效性。我们的方法实现将在https://github.com/Lee-zixu/COMBINER上提供。
Composed Image Retrieval (CIR) represents a challenging retrieval task that targets locating specific images through multimodal inputs. Despite recent progress in CIR techniques, prior approaches often overlook cases where images appear visually alike yet differ in attributes, potentially undermining both multimodal feature fusion and similarity modeling. To mitigate this limitation, we design a unified representation of cross-modal features based on attribute prototypes. Nevertheless, the task is far from straightforward, owing to three core issues: (1) entanglement in attribute-level semantics, (2) inconsistency across modalities, and (3) supervised signal missing. To tackle the above obstacles, we introduce a COMposed image retrieval network guided By attrIbute-based NEighbor Relations (COMBINER). Specifically, we first design an Adaptive Semantic Disentanglement module, which is capable of disentangling attribute features based on multimodal primitive features. Secondly, we propose a Unified Prototype-based Composition module, which can construct cross-modal unified prototypes (CUP) and facilitate multimodal feature composition. Finally, we introduce a Dual Relations Modeling module, which can mine pairwise and neighbor relations based on attribute similarity. Compared to traditional neighbor relations modeling CIR methods, COMBINER represents the first study addressing the phenomenon of visually similar but attribute-unrelated samples. It achieves a more accurate understanding of the semantic relations among samples by employing an attribute prototype-based similarity metric. Comprehensive experiments conducted on three benchmark datasets confirm the effectiveness of our proposed COMBINER. The implementation of our method will be accessed at https://github.com/Lee-zixu/COMBINER
面向不确定性感知检索的分布近似最近邻搜索
Olivier Jeunen
AI总结 提出DINOSAUR框架,通过为每个物品采样多个嵌入并构建索引,在检索时对用户嵌入进行采样,以隐式边缘化嵌入不确定性,从而在不改变模型架构或索引基础设施的情况下提升长尾物品的覆盖。
近似最近邻搜索索引构成了现实世界推荐系统的骨干,支持在百万级物品目录上进行实时候选检索。通常,为每个用户和每个物品学习一个点估计嵌入。在服务时,用户嵌入查询索引以获取相关物品。由于这些表示是从稀疏交互数据中学习的,它们带有噪声,可能无法捕捉所有有助于“相关性”的细微差别——忽略了其固有的基本不确定性。结果是检索管道系统性地偏向于少数嵌入估计良好的热门头部物品,而牺牲了长尾中多数小众、多样和偶然的内容。 我们提出了DINOSAUR(面向不确定性感知检索的分布近似最近邻搜索):一个简单且与基础设施兼容的框架,将嵌入不确定性纳入候选生成。DINOSAUR不为点估计建立索引,而是为每个物品采样$S_i$个嵌入,并在这一增强集上构建索引。类似地,在查询时,对用户嵌入进行采样。这种双边的随机检索过程隐式地边缘化了嵌入不确定性,无需改变模型架构或ANN索引基础设施。 在分析方面,我们展示了当不确定性消失时,DINOSAUR恢复标准的点估计检索,并刻画了增加的嵌入方差如何扩展不确定物品可检索的潜在空间区域。可重复的实证观察与这些预期一致,显示出在离线召回率小幅损失的情况下,覆盖率大幅提升。
Approximate Nearest Neighbour search indices form the backbone of real-world recommender systems, enabling real-time candidate retrieval over million-item catalogues. Typically, a single point estimate embedding is learnt for every user and every item. At serving time, the user embedding queries the index for relevant items. Since these representations are learnt from sparse interaction data, they are noisy and might fail to capture all the nuances that contribute to ``relevance'' -- ignoring the fundamental uncertainty that is inherent to them. The result is a retrieval pipeline that is systematically biased toward the small minority of popular head items with well-estimated embeddings, at the expense of the long-tail majority of niche, diverse, and serendipitous content. We propose DINOSAUR (Distributional Approximate Nearest Neighbour Search for Uncertainty-Aware Retrieval): a simple and infrastructure-compatible framework to incorporate embedding uncertainty into candidate generation. Rather than indexing point estimates, DINOSAUR samples $S_i$ embeddings per item and constructs an index on this augmented set. Analogously, at query time, a user embedding is sampled. This two-sided stochastic retrieval process implicitly marginalises over embedding uncertainty, without requiring changes to model architecture or ANN index infrastructure. On the analytical side, we show that DINOSAUR recovers standard point-estimate retrieval as uncertainty vanishes, and we characterise how increased embedding variance expands the regions of latent space in which uncertain items are retrievable. Reproducible empirical observations align with these expectations, showing large coverage gains with small losses in offline recall.
先计划,后评判,更优运行:一种受DMAIC启发的工业异常检测智能体系统
Yongzi Yu, Ao Li, Le Wang, Ziyue Li, Fugee Tsung, Yuxuan Liang, Man Li
AI总结 提出受DMAIC启发的多智能体系统DMAIC-IAD,通过先制定标准化操作程序(SOP)再生成策略,并引入预训练的无执行评判模型来排序候选策略,无需昂贵运行时试验,在四种模态上平均检测性能提升37.76%。
大型语言模型(LLM)智能体在自动化复杂数据分析工作流方面展现出潜力,但在高风险工业场景中的可靠部署仍具挑战。工业异常检测(IAD)对制造质量、安全和效率至关重要,然而现有基于LLM的IAD智能体主要关注执行,而策略制定方面利用不足。因此,它们难以以统一且经济高效的方式处理异构模态。受DMAIC质量管理框架启发,我们提出DMAIC-IAD(受DMAIC启发的智能体工业异常检测),一种“先计划,后评判”的多智能体系统,将LLM智能体与结构化工业问题解决相结合。DMAIC-IAD在策略生成前将异构参考提炼为标准化操作程序(SOP),并引入预训练的无执行评判模型,无需昂贵的运行时试验即可对候选策略进行排序。跨四种模态的大量实验表明,DMAIC-IAD在适用智能体基线上平均检测性能提升37.76%。
Large language model (LLM) agents have shown promise in automating complex data-analysis workflows, but their reliable deployment remains challenging in high-stakes industrial scenarios. Industrial anomaly detection (IAD) is essential for manufacturing quality, safety, and efficiency, yet existing LLM-based IAD agents mainly focus on execution while under-exploiting strategy formulation. Consequently, they struggle to handle heterogeneous modalities in a unified and cost-effective manner. Inspired by the DMAIC quality-management framework, we propose DMAIC-IAD (DMAIC-inspired Agentic Industrial Anomaly Detection), a "Plan First, Judge Later" multi-agent system that aligns LLM agents with structured industrial problem-solving. DMAIC-IAD distills heterogeneous references into standardized operating procedures (SOPs) before strategy generation, and introduces a pre-trained execution-free judge model to rank candidate strategies without costly runtime trials. Extensive experiments across four modalities show that DMAIC-IAD improves average detection performance over applicable agentic baselines by 37.76%.
通过成本划分学习可采纳启发式
Hugo Barral, Quentin Cappart, Marie-José Huguet, Sylvie Thiébaux
AI总结 提出一个框架,利用成本划分与乘子预测的拉格朗日对偶等价性,通过图编码和自注意力网络学习可采纳成本划分,从而生成首个保证可采纳性的机器学习启发式。
可采纳启发式对于最优规划至关重要,但由于存在高估风险,学习它们仍然具有挑战性。成本划分在保持可采纳性的同时结合多个抽象启发式,但在线计算最优划分代价高昂。我们提出了一个框架,通过利用成本划分与乘子预测之间的拉格朗日对偶等价性,学习推断可采纳成本划分。规划状态和模式被编码为带标签的图,并使用Weisfeiler-Leman算法的动作中心变体提取结构特征向量。一个具有轴向自注意力和softmax输出层的深度架构将这些特征映射到成本权重,这些权重通过构造满足划分约束,从而确保可采纳性。实验表明,与次优划分基线相比,节点扩展减少,同时保持严格的可采纳性。据我们所知,这是第一个保证可采纳性的机器学习启发式。
Admissible heuristics are essential for optimal planning, yet learning them remains challenging due to the risk of overestimation. Cost partitioning combines multiple abstraction heuristics while preserving admissibility, but computing optimal partitions online is expensive. We propose a framework that learns to infer admissible cost partitions by leveraging the Lagrangian dual equivalence between cost partitioning and multiplier prediction. Planning states and patterns are encoded as labelled graphs, and an action-centric variant of the Weisfeiler-Leman algorithm extracts structural feature vectors. A deep architecture with axial self-attention and a softmax output layer maps these features to cost weights that satisfy the partition constraints by construction, ensuring admissibility. Experiments demonstrate reduced node expansions compared to suboptimal partitioning baselines while maintaining strict admissibility. To our knowledge, this is the first machine-learned heuristic guaranteed to be admissible.
多视频摘要中位置偏差的系统评估:基于多模态大语言模型
Huangchen Xu, Yuan Wu, Yi Chang
AI总结 本研究系统评估了多模态大语言模型在多视频摘要任务中的位置偏差,通过构建基准和三种互补指标揭示了领域与模型依赖的偏差特性,并分析了提示级缓解方法。
多模态大语言模型(MLLMs)越来越多地用于视频理解,但它们在多视频输入下的可靠性仍知之甚少。我们研究了多视频摘要中的位置偏差,即每个视频摘要的质量可能随视频输入槽位的变化而变化,即使底层内容不变。我们从ActivityNet和新闻视频构建了一个基准,涵盖烹饪、家庭、休闲和新闻场景,包含两个和四个视频输入。我们评估了九个开源和专有MLLMs,并使用三种互补指标测量位置效应:覆盖率、方向性位置偏差(DPB)和中间边缘差距(MEG)。我们的结果表明,位置效应是领域和模型依赖的:即使中间位置表现不佳,有符号的方向性偏差也可能很小;增加视觉或生成预算并不能均匀地消除不平衡。我们进一步分析了提示级缓解方法。总之,结果表明多视频摘要仍然对输入协议和位置敏感,这促使开发更鲁棒的、顺序不变的多模态系统。
Multimodal Large Language Models (MLLMs) are increasingly used for video understanding, yet their reliability under multi-video inputs remains poorly understood. We study positional bias in multi-video summarization, where the quality of a per-video summary can change with the video's input slot even when the underlying content is unchanged. We construct a benchmark from ActivityNet and News videos, covering Cooking, Domestic, Leisure, and News settings with two- and four-video inputs. We evaluate nine open-source and proprietary MLLMs and measure position effects with three complementary metrics: Coverage, Directional Positional Bias (DPB), and Middle-Edge Gap (MEG). Our results show that positional effects are domain- and model-dependent: signed directional bias can be small even when middle positions underperform, and increasing visual or generation budget does not uniformly remove the imbalance. We further analyze prompt-level mitigation methods. Together, the results show that multi-video summarization remains sensitive to input protocol and position, motivating more robust order-invariant multimodal systems.
Ekka: LLM推理中静默错误的自动诊断
Yile Gu, Zhen Zhang, Shaowei Zhu, Xinwei Fu, Jun Wu, Yida Wang, Baris Kasikci
AI总结 提出Ekka系统,通过差分调试对齐比较中间执行状态,自动诊断LLM推理框架中的静默错误,在真实错误基准上达到80% pass@1和88% pass@5的诊断准确率。
LLM服务框架随着复杂的软件栈和大量优化而快速发展。快速开发过程可能引入静默错误,即输出质量在没有任何显式错误信号的情况下悄然下降。由于高层症状与底层根本原因之间存在巨大的语义鸿沟,诊断静默错误非常困难。我们观察到,通过利用语义正确的参考实现,静默错误的诊断可以有效地构建为差分调试问题。我们提出了Ekka,一个自动诊断系统,通过系统地对齐和比较目标框架与参考框架之间的中间执行状态来识别根本原因。我们构建了一个来自流行服务框架的真实静默错误基准,Ekka显示出80%的pass@1诊断准确率和88%的pass@5诊断准确率,优于现有系统。Ekka还诊断了服务框架中的4个新静默错误,所有错误均已得到开发者确认。
LLM serving frameworks are quickly evolving with a complex software stack and a vast number of optimizations. The rapid development process can introduce silent errors where output quality silently degrades without any explicit error signals. Diagnosing silent errors is notoriously difficult due to the substantial semantic gap between the high-level symptoms and the low-level root causes. We observe that diagnosis of silent errors can be effectively framed as a differential debugging problem by leveraging the existence of semantically correct reference implementations. We propose Ekka, an automated diagnosis system that identifies root causes by systematically aligning and comparing intermediate execution states between a target and a reference framework. We constructed a benchmark of real-world silent errors from popular serving frameworks, where Ekka shows 80% pass@1 diagnosis accuracy and 88% pass@5 diagnosis accuracy, outperforming state-of-the-art systems. Ekka also diagnoses 4 new silent errors from serving frameworks, all of which have been confirmed by the developers.
来自稀疏动态相机的4D重建
Kazuki Ozeki, Shun Kenney, Yuto Shibata, Eisuke Takeuchi, Takuya Narihira, Kazumi Fukuda, Ryosuke Sawata, Yuki Mitsufuji, Yoshimitsu Aoki
AI总结 针对稀疏动态相机设置下的4D重建,提出一种通过集成跨相机特征匹配与帧内点跟踪来确保时空一致性的3D轨迹初始化方法,并引入噪声鲁棒的深度排序正则化损失和时空多样批次采样策略,在自建数据集LetCamsGo上验证了动态区域重建质量的提升。
尽管从单目动态相机进行动态3D(即4D)重建最近取得了进展,但其仍然受到深度模糊的根本限制。本文关注一种替代实用方案,即稀疏动态相机设置,其中少量独立移动的相机捕捉相同的对象。在保持低成本的同时,这种设置引入了多视图约束,并且对于现实世界的视频制作(如体育、音乐会和电视节目)仍然实用。尽管有潜力,但我们的实验表明,现有单目或密集固定相机方法的简单扩展是不够的,因为它们无法解决跨视图和时间的复杂时空不一致性。为填补这一空白,我们提出了一种简单而有效的3D轨迹初始化方法,通过集成跨相机特征匹配与帧内点跟踪来确保时空一致性。此外,我们引入了噪声鲁棒的深度排序正则化损失和时空多样批次采样策略,以增强优化稳定性和跨视图泛化。进一步地,为解决此任务缺乏标准化基准的问题,我们引入了LetCamsGo,这是一个新的真实世界视频数据集,包含4个不同环境中的5个序列,由三个独立移动的相机和一个固定相机记录。在LetCamsGo上的全面基准测试表明,与基线相比,我们提出的框架提高了动态区域的4D重建质量,为野外低成本4D重建范式铺平了道路。
Although dynamic 3D (i.e., 4D) reconstruction from a monocular dynamic camera has recently advanced, it remains fundamentally limited by depth ambiguity. In this paper, we focus on an alternative practical way, i.e., sparse dynamic camera setup, where a handful of independently moving cameras capture the same subjects. While keeping capture costs low, this setup introduces multi-view constraints and remains practical for real-world video production such as sports, concerts, and TV shows. Despite its potential, our experiments show that naive extensions of existing monocular or dense-fixed camera-based methods are insufficient since they fail to resolve the complex spatiotemporal inconsistencies across views and time. To fill this gap, we propose a simple yet effective 3D track initialization method designed to ensure spatiotemporal consistency by integrating inter-camera feature matching with intra-camera point tracking. Additionally, we incorporate a noise-robust depth-ordering regularization loss and a spatiotemporally diverse batch sampling strategy to enhance optimization stability and cross-view generalization. Furthermore, to address the lack of standardized benchmarks for this task, we introduce LetCamsGo, a new real-world video dataset with 5 sequences across 4 diverse environments, recorded by three independently moving cameras and one fixed camera. Comprehensive benchmarking on LetCamsGo demonstrated that our proposed framework improves 4D reconstruction quality in dynamic regions compared with baselines, paving the way for a low-cost 4D reconstruction paradigm in the wild.
合成人格:LLM 如何使用社会经济微观数据模仿个体受访者?
Leonard Kinzinger, Jochen Hartmann
AI总结 研究利用德国社会经济面板数据构建个体级数字孪生,通过评估不同构建方法(模型、信息深度、嵌入方式、推理模式)对200万以上孪生响应的准确性,发现信息深度在75%熵分位数达到成本效益帕累托点,最佳单元准确率达78.8%。
基于LLM的数字孪生有望扩展和加速市场研究,但大多数已发表的孪生要么是基于少数人口统计问题的粗略角色机器人,要么是基于专门收集的调查和访谈记录构建的详细个体级孪生。这两种设置都不涉及营销实践中操作上最相关的情况:从企业通过CRM系统、忠诚度计划和重复调查积累的现有异构面板数据中构建详细的个体孪生。我们从德国社会经济面板(SOEP)构建详细的个体级孪生,并在一个$3 \times 5 \times 2 \times 2$的构建方法网格中评估它们,该网格涵盖三个开放权重的LLM、五个按归一化香农熵排序的累积信息深度、两种嵌入方法和两种推理模式,对500名参与者和183个保留问题评分超过210万个孪生响应。孪生质量随信息深度提高,但超过75%熵分位数后收益递减,该分位数相对于性能最佳的100%单元充当成本效益帕累托点。将嵌入从叙述性角色摘要切换到原始对话历史(过去响应)在100%深度下每个模型-推理单元中提高了保留准确率,而显式思考模式提高了秩次相关性但不改变准确率。最佳单元准确率达到78.8%,Fisher-$z$相关性在SOEP保留评估集上达到$r = 0.590$。研究结果表明,基于孪生的市场研究不再受数据设计限制,而是受项目数量、模型选择和本文现在映射的一小部分构建级决策限制。
LLM-based digital twins promise to scale and accelerate market research, but most published twins are either coarse persona bots conditioned on a few demographic questions or detailed individual-level twins built on purpose-collected surveys and interview transcripts. Neither setup speaks to the operationally most relevant case for marketing practice: building detailed individual twins from the pre-existing heterogeneous panel data that firms already accumulate through CRM systems, loyalty programs, and repeat surveys. We construct detailed individual-level twins from the German Socio-Economic Panel (SOEP) and evaluate them across a $3 \times 5 \times 2 \times 2$ construction-method grid that covers three open-weights LLMs, five cumulative information depths ranked by normalized Shannon entropy, two embedding methods, and two reasoning modes, scoring over 2.1 million twin responses on 500 participants and 183 held-out questions. Twin quality rises with information depth but with diminishing returns past the 75 percent entropy quartile, which acts as a cost-efficient Pareto point relative to the best-performing 100 percent cells. Switching the embedding from a narrative persona summary to a raw dialog history of past responses raises hold-out accuracy in every model-by-reasoning cell at the 100 percent depth, while an explicit thinking mode raises rank-order correlation without moving accuracy. Best-cell accuracy reaches 78.8 percent and Fisher-$z$ correlation reaches $r = 0.590$ on the SOEP held-out evaluation set. The findings suggest that twin-based market research is no longer gated by data design, but by item volume, model selection, and a small set of construction-level decisions that this paper now maps.
多模态长对话中的细粒度片段检索
Hanbo Bi, Zhiqiang Yuan, Chongyang Li, Qiwei Yan, Zexi Jia, Jiapei Zhang, Xiaoyue Duan, Yingchao Feng, Jinchao Zhang, Jie Zhou
AI总结 提出细粒度片段检索任务,通过强化学习训练的生成式检索模型F2RVLM和两阶段系统FFRS,实现多模态长对话中多语句、多图像片段的精准定位。
随着多模态交流平台的广泛采用,文本和图像交织的长对话变得越来越普遍。用户通常需要检索与特定主题相关的连贯对话片段,而不是孤立的语句。我们提出了细粒度片段检索(FFR),用于在多模态长对话中定位语义相关的多语句、多图像片段。我们探索了两种设置:(1)单对话内的FFR,从给定对话中检索片段;(2)对话语料库内的FFR,从大规模语料库中为开放域场景检索片段。对于(1),我们引入了F2RVLM,一种基于生成的检索模型,使用强化学习训练,通过多目标奖励和难度感知课程采样来增强片段连贯性。对于(2),我们开发了FFRS,一个两阶段系统,结合了离线片段级索引和在线检索。具体来说,每个对话被分解为最小语义片段,由片段嵌入模型(FEM)编码到向量数据库中;在推理时,FEM快速召回Top-K候选,F2RVLM进行细粒度推理以识别最相关的子内容。为支持FFR,我们构建了MLDR,迄今为止最长的多模态对话检索数据集,以及一个基于微信的真实世界测试集。在两个基准上的实验表明,F2RVLM和FFRS在单对话和语料库级别的FFR上始终取得优越性能。
With the widespread adoption of multi-modal communication platforms, long-form dialogues interleaving text and images have become increasingly common. Users often need to retrieve coherent dialogue fragments related to specific topics, rather than isolated utterances. We propose Fine-grained Fragment Retrieval (FFR), which locates semantically relevant multi-utterance, multi-image fragments in multi-modal long-form dialogues. We explore two settings: (1) FFR within Single-Dialogue, retrieving fragments from a given dialogue; and (2) FFR within Dialogue Corpus, retrieving from a large-scale corpus for open-domain scenarios. For (1), we introduce F2RVLM, a generation-based retrieval model trained with reinforcement learning, using multi-objective rewards and difficulty-aware curriculum sampling to enhance fragment coherence. For (2), we develop FFRS, a two-stage system combining offline fragment-level indexing with online retrieval. Specifically, each dialogue is decomposed into minimal semantic fragments encoded by a Fragment Embedding Model (FEM) into a vector database; at inference, FEM rapidly recalls Top-K candidates, and F2RVLM performs fine-grained reasoning to identify the most relevant sub-content. To support FFR, we construct MLDR, the longest multi-modal dialogue retrieval dataset to date, and a WeChat-based real-world test set. Experiments on both benchmarks demonstrate that F2RVLM and FFRS consistently achieve superior performance across single-dialogue and corpus-level FFR.
VCIFBench:评估视频理解中的复杂指令遵循能力
Huangchen Xu, Yuan Wu, Yi Chang
AI总结 提出VCIFBench基准,通过混合验证流水线评估多模态大模型在视频理解中遵循内容、格式、风格和结构约束的复杂指令能力,实验表明联合约束满足仍具挑战,DPO训练可提升性能。
多模态大语言模型在视频理解方面取得了快速进展,然而现有基准主要依赖简单提示,且提供的证据有限,无法判断模型是否能满足明确的输出约束。我们引入了VCIFBench,这是一个用于评估视频理解中复杂指令遵循能力的基准。VCIFBench从基准适配和直接视频接地提示中构建了富含约束的指令,涵盖内容、格式、风格和结构要求,并通过混合验证流水线评估模型输出。该基准包含306个可满足的测试指令、一个540对的DPO偏好数据集以及一个30项的冲突诊断子集。在10个MLLM上的实验表明,联合约束满足仍然具有挑战性。我们进一步表明,在VCIFBench数据上进行DPO训练可以提高指令遵循性能。
Multimodal large language models have made rapid progress in video understanding, yet existing benchmarks largely rely on simple prompts and provide limited evidence about whether models can satisfy explicit output constraints. We introduce VCIFBench, a benchmark for evaluating complex instruction following in video understanding. VCIFBench constructs constraint-rich instructions from both benchmark-adapted and directly video-grounded prompts, covering content, format, style, and structure requirements, and evaluates model outputs with a hybrid verification pipeline. The benchmark contains 306 satisfiable test instructions, a 540-pair DPO preference dataset, and a 30-item conflict diagnostic subset. Experiments on 10 MLLMs show that joint constraint satisfaction remains challenging. We further show that DPO training on VCIFBench data can improve instruction-following performance.
SHB-AE:基于球谐波束形成的智能手机麦克风阵列Ambisonics编码与升级方法
Yuhuan You, Yufan Qian, Tianshu Qu, Bin Wang, Xueyang Lv
AI总结 针对智能手机麦克风阵列,提出一种基于球谐波束形成的Ambisonics编码与升级方法SHB-AE,通过设计各阶球谐函数的波束形成器,仅用四个非规则排列的麦克风即可实现四阶Ambisonics编码与升级。
随着虚拟现实(VR)和增强现实(AR)的快速发展,空间音频录制与回放引起了越来越多的研究兴趣。高阶Ambisonics(HOA)因其对各种播放设备的适应性以及整合头部朝向的能力而脱颖而出。然而,当前的HOA录制通常依赖于笨重的球形麦克风阵列(SMA),而智能手机等便携设备受到阵列配置和麦克风数量的限制。我们提出SHB-AE,一种基于球谐波束形成的Ambisonics编码方法,适用于智能手机麦克风阵列(SPMA)。通过基于阵列流形为各阶球谐函数设计波束形成器,该方法实现了Ambisonics编码与升级。在真实SPMA及其模拟自由场对应物上,在噪声和混响条件下的验证表明,该方法仅用四个非规则排列的麦克风即可成功编码并升级至四阶Ambisonics。
With the rapid development of virtual reality (VR) and augmented reality (AR), spatial audio recording and reproduction have gained increasing research interest. Higher Order Ambisonics (HOA) stands out for its adaptability to various playback devices and its ability to integrate head orientation. However, current HOA recordings often rely on bulky spherical microphone arrays (SMA), and portable devices like smartphones are limited by array configuration and number of microphones. We propose SHB-AE, a spherical harmonic beamforming based method for Ambisonics encoding using a smartphone microphone array (SPMA). By designing beamformers for each order of spherical harmonic functions based on the array manifold, the method enables Ambisonics encoding and up-scaling. Validation on a real SPMA and its simulated free-field counterpart in noisy and reverberant conditions showed that the method successfully encodes and up-scales Ambisonics up to the fourth order with just four irregularly arranged microphones.
HalfNet: 具有学习子空间几何的随机神经网络
Ethem Alpaydin
AI总结 提出HalfNet,通过从可学习的低秩协方差矩阵中随机采样权重,在减少参数的同时匹配全连接网络的性能,揭示权重空间几何对预测能力的关键作用。
许多研究者研究了将部分权重固定为从给定分布(例如 $N(0, I)$)随机抽取值的神经网络。我们提出的 HalfNet 从 $N(0, Σ)$ 中抽取随机权重,其中定义分布几何的 $Σ$ 具有我们从数据中学习的低秩分解。在 MNIST 和 CIFAR-10 上的实验表明,HalfNet 在使用显著更少参数的情况下,能够匹配全训练多层感知器的性能。谱分析表明,神经网络的大部分预测能力在于其权重空间的几何结构,而非单个参数的精确值,并且我们观察到准确率随秩平滑扩展。HalfNet 并非针对低秩结构的神经架构技巧;它实现了一种数据相关的随机嵌入,也可以通过监督度量学习或随机特征和核视角进行解释。
Many researchers investigated neural networks with some of their weights fixed to values randomly drawn from a given distribution, e.g., $N(0, I)$. Our proposed HalfNet draws random weights from $N(0, Σ)$, where $Σ$, which defines the geometry of the distribution, has a low-rank factorization that we learn from data. Experiments on MNIST and CIFAR-10 demonstrate that HalfNet can match the performance of fully trained multilayer perceptrons while using substantially fewer parameters. Spectral analysis indicates that much of the predictive power of neural networks lies in the geometry of their weight space rather than in the precise values of individual parameters, and we observe that accuracy scales smoothly with rank. HalfNet is not a neural architecture trick for low-rank structure; it implements a data-dependent random embedding that can also be interpreted through supervised metric learning, or random-feature and kernel perspectives.
通过仿真辅助智能感知重建不可观测温度场
Monika Stipsitz, Hèlios Sanchis-Alepuz, Jacob Reynvaan, Silvester Sabathiel
AI总结 提出基于随机物理仿真生成数据集的方法,训练神经网络从稀疏传感器重建内部温度场,实现实时在线监测。
在许多系统中,由于传感器位置的限制,实时监测组件和子结构内部的温度分布是一个具有挑战性的课题。虽然机器学习在许多应用中是一种多功能工具,但其在高分辨率热监测中的应用受到高质量训练数据集可用性的阻碍。在这项工作中,我们提出了一种基于随机物理仿真为工业应用生成数据集的新方法。我们在一个概念验证硬件设置中演示了该方法:仅在此类合成数据集上训练的神经网络被用于从嵌入硬件中的稀疏传感器重建内部温度场。基于神经网络的重建不仅在鲁棒性上优于克里金法,而且能够实现实时推理,使得该方法适用于在线监测原本不可观测的热状态。
Real-time monitoring of the temperature distribution within components and sub-structures is a challenging topic in many systems due to restrictions on feasible sensor locations. While machine learning (ML) proves a versatile tool in many applications, its adoption for high-resolution thermal monitoring is hindered by the availability of high-quality datasets for training. In this work, we propose a novel approach for generating datasets for industrial applications based on randomized physics-based simulations. We demonstrate the approach in a proof-of-concept hardware setup: A neural network (NN) trained only on such a synthetic dataset, is used to reconstruct the internal temperature field from sparse sensors embedded in the hardware. The NN-based reconstructions do not only outperform Kriging in robustness but also enable real-time inference, making the method suitable for online monitoring of otherwise unobservable thermal states.
SCI-PRM:用于科学推理验证的工具感知过程奖励模型
Xiangyu Zhao, Hengyuan Zhao, Yiheng Wang, Wanghan Xu, Yuhao Zhou, Qinglong Cao, Zhiwang Zhou, Lei Bai, Wenlong Zhang, Xiao-Ming Wu
AI总结 针对科学推理中工具使用和事实一致性问题,提出Sci-PRM模型,通过构建包含工具链轨迹的数据集SCIPRM70K并训练过程奖励模型,在测试时扩展和强化学习中提供细粒度监督,提升基础模型性能。
虽然过程奖励模型(PRM)在数学推理中取得了显著成功,但它们在复杂科学领域(如生物学、化学和物理学)的应用仍基本未被探索。科学问题不仅要求逻辑严谨,还要求事实一致性和领域特定工具的精确使用,而当前模型在这些方面常常出现幻觉且缺乏验证。在本文中,我们首先构建了SCIPRM70K,这是一个大规模数据集,包含显式地将推理与科学工具执行交错的工具链轨迹。在此基础上,我们训练了一个名为Sci-PRM的高效奖励模型,以在单次推理的每一步提供关于工具选择、执行准确性和结果解释的细粒度监督。实验表明,Sci-PRM在两个关键方面显著增强了基础模型:(1)通过Best-of-N选择实现有效的测试时扩展;(2)当集成到强化学习中时,它作为密集奖励信号,缓解了优势消失的关键问题,使模型能够突破现有性能上限。
While Process Reward Models (PRMs) have achieved remarkable success in mathematical reasoning, their application in complex scientific domains-such as biology, chemistry, and physics remains largely unexplored. Scientific problems demand not only logical rigor but also factual consistency and the precise usage of domain-specific tools, areas where current models often suffer from hallucinations and lack of verification. In this paper, we first construct SCIPRM70K, a large-scale dataset featuring Chain-of-Tool trajectories that explicitly interleave reasoning with the execution of scientific tools. Building upon this, we train an efficient reward model called Sci-PRM to provide fine-grained supervision on tool selection, execution accuracy, and result interpretation at each step in one inference. Experiments demonstrate that Sci-PRM significantly enhances foundation models in two key aspects: (1) it enables effective test-time scaling via Best-of-N selection; and (2) when integrated into Reinforcement Learning, it serves as a dense reward signal that mitigates the critical issue of advantage disappearance, allowing the model to break through existing performance ceilings.
ReSGA: 一种用于学习风险价值和预期缺口的大尾部风险模型
Yichi Zhang, Ke Zhu, Zhoufan Zhu
AI总结 提出检索增强自分组自编码器(ReSGA),利用数百万参数捕捉资产横截面依赖和长期时间动态,在1926-2023年美国股票数据上优于12种基准模型,并通过新规模增强左尾动量策略实现经济收益。
学习风险价值(VaR)和预期缺口(ES)对于有效管理金融风险至关重要。在大数据时代,参数有限的现有方法容易受到模型错误设定的影响。为了解决这一局限性,我们提出了一种大尾部风险模型——检索增强自分组自编码器(ReSGA),该模型设计有数百万个参数,利用资产的特征来挖掘丰富的横截面依赖性和长期时间动态。应用于1926年至2023年的月度美国股票收益数据,包含153个公司特征,ReSGA在样本外损失和统计回测方面优于十二种计量经济学和机器学习竞争对手。此外,其预测优势可以通过一种新的规模增强左尾动量策略构建的多空十分位投资组合转化为显著的经济收益。为了阐明复杂性的作用,我们进一步进行了系统的规模分析,并证明联合VaR-ES预测的改进主要由数据复杂性驱动,而非模型复杂性。最后,我们的组重要性和迁移学习分析展示了ReSGA的可解释性和跨市场泛化能力。
Learning Value-at-Risk (VaR) and Expected Shortfall (ES) is important for managing financial risks effectively. Existing approaches with limited parameters are vulnerable to model misspecification in the era of big data. To address this limitation, we propose a large tail risk model, the retrieval-enhanced self-grouping autoencoder (ReSGA), which is designed with millions of parameters to exploit the rich cross-sectional dependence and long-term temporal dynamics of assets using their characteristics. Applied to monthly US equity returns from 1926 to 2023 with 153 firm characteristics, ReSGA outperforms twelve econometric and machine learning competitors in terms of out-of-sample loss and statistical backtesting. In addition, its forecast advantages can translate into significant economic gains from long-short decile portfolios that are constructed by a new size-enhanced left-side momentum strategy. To clarify the role of complexity, we further conduct a systematic scaling analysis and demonstrate that improvements in joint VaR-ES forecasting are primarily driven by data complexity rather than model complexity. Finally, our analyses of group-importance and transfer-learning exhibit the interpretability and cross-market generalizability of ReSGA.
基于深度强化学习的加密货币市场动态多对交易策略
Damian Lebiedź, Robert Ślepaczuk
AI总结 本研究提出一种结合深度强化学习执行覆盖层的层次化“过滤-排序”配对选择方法和“固定风险、自适应均值”执行模型,在加密货币市场实现优于启发式基准的统计套利表现。
本研究旨在确定深度强化学习(DRL)作为专门执行覆盖层是否能够增强高波动性加密货币市场中的配对交易。尽管该策略的经典实现在传统股票市场中已被证明成功,但在高方差环境中往往表现出刚性并面临严重的发散风险。为应对这一需求,本研究引入了新颖概念。为构建稳健系统,我们开发了层次化的“过滤-排序”配对选择方法和专有的“固定风险、自适应均值”执行模型。该系统采用带有长短期记忆(LSTM)层的近端策略优化(PPO)智能体,在严格确定性风险管理边界内控制执行决策。在币安USD-M期货市场的1小时间隔数据上评估,优化后的强化学习策略在样本外表现显著优于启发式基线。平稳循环块自举稳健性检验证实,智能体的风险调整后超额收益在10%水平上统计显著。尽管略低于更严格的5%阈值,这一结果凸显了数字资产特有的极端异质方差。最终,本论文通过引入结合统计套利与DRL执行策略的混合架构,为量化金融文献做出贡献。此外,它通过确定性屏蔽提供了一种安全强化学习的新框架,证明将神经策略锚定于统计稳健边界能成功缓解严重的发散风险。
This study aims to determine whether the application of Deep Reinforcement Learning (DRL) as a specialized execution overlay can enhance pair trading in highly volatile cryptocurrency markets. Although classical implementations of the strategy have proven successful in traditional equities, they frequently exhibit rigidity and suffer from severe divergence risks when applied to high-variance environments. To address this need, this research introduces novel concepts. To construct a robust system, we developed a hierarchical "Filter-then-Rank" pair selection methodology and a proprietary "Fixed Risk, Adaptive Mean" execution model. The system employs a Proximal Policy Optimization (PPO) agent with a Long Short-Term Memory (LSTM) layer to govern execution decisions within strict deterministic risk management boundaries. Evaluated on 1-hour interval data from the Binance USD-M Futures market, the optimized RL policy achieved an out-of-sample performance that substantially outperformed the heuristic baseline. A stationary circular block bootstrap robustness check confirms that the agent's risk-adjusted outperformance is statistically significant at the 10 percent level. Although falling marginally short of the stricter 5 percent threshold, this result highlights the extreme idiosyncratic variance characteristic of digital assets. Ultimately, this thesis contributes to the quantitative finance literature by introducing a hybrid architecture that combines statistical arbitrage with DRL execution policies. Furthermore, it delivers a novel framework for safe reinforcement learning via deterministic shielding, proving that anchoring a neural policy to statistically robust boundaries successfully mitigates severe divergence risks.