arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2083
2409.02708 2026-05-14 cs.LG stat.ME

Few-shot Multi-Task Learning of Linear Invariant Features with Meta Subspace Pursuit

Chaozhi Zhang, Lin Liu, Xiaoqun Zhang

AI总结 本文研究了在数据稀缺情况下如何通过多任务学习提取线性不变特征的问题,提出了一种名为Meta Subspace Pursuit(Meta-SP)的新算法,用于学习不同任务间共享的低秩不变子空间。该方法在算法层面和统计层面均提供了理论保证,并通过大量实验验证了其在性能上的优越性,优于包括ANIL在内的多种对比方法。

详情
Journal ref
CSIAM Transactions on Applied Mathematics (2026)
英文摘要

Data scarcity poses a serious threat to modern machine learning and artificial intelligence, as their practical success typically relies on the availability of big datasets. One effective strategy to mitigate the issue of insufficient data is to first harness information from other data sources possessing certain similarities in the study design stage, and then employ the multi-task or meta learning framework in the analysis stage. In this paper, we focus on multi-task (or multi-source) linear models whose coefficients across tasks share an invariant low-rank component, a popular structural assumption considered in the recent multi-task or meta learning literature. Under this assumption, we propose a new algorithm, called Meta Subspace Pursuit (abbreviated as Meta-SP), that provably learns this invariant subspace shared by different tasks. Under this stylized setup for multi-task or meta learning, we establish both the algorithmic and statistical guarantees of the proposed method. Extensive numerical experiments are conducted, comparing Meta-SP against several competing methods, including popular, off-the-shelf model-agnostic meta learning algorithms such as ANIL. These experiments demonstrate that Meta-SP achieves superior performance over the competing methods in various aspects.

2409.02038 2026-05-14 cs.CL cs.AI cs.DB

BEAVER: An Enterprise Benchmark for Text-to-SQL

Peter Baile Chen, Devin Yang, Weiyue Li, Fabian Wenz, Yi Zhang, Nesime Tatbul, Michael Cafarella, Çağatay Demiralp, Michael Stonebraker

AI总结 BEAVER 是首个基于私有数据仓库构建的文本到 SQL 基准测试集,旨在评估大语言模型在复杂企业环境中的表现。该基准包含来自真实查询日志的 9128 个问题-SQL 对,覆盖 19 个不同领域,涵盖复杂的数据库结构和专业领域知识。为解决企业数据稀缺和评估指标不足的问题,BEAVER 通过合成高质量专家验证查询,并引入细粒度子任务评估指标,揭示了当前先进模型在实际企业场景中的显著性能差距。

Comments Dataset and code are available at https://beaverbench.github.io/

详情
英文摘要

Existing text-to-SQL benchmarks have largely been constructed from public databases with well-structured schemas and simplistic question-SQL pairs. While large language models (LLMs) excel on these settings, their efficacy in complex private enterprise environments, characterized by intricate schemas, domain knowledge, and analytical user queries involving sophisticated structures and functions, remains unproven. To bridge this gap, we introduce BEAVER, the first text-to-SQL benchmark derived from private data warehouses. It comprises 9128 question-SQL pairs sourced from real-world query logs and 812 tables across 19 diverse domains. Building this benchmark is challenging because (1) enterprise query logs are scarce due to privacy constraints, and (2) existing all-or-nothing evaluation metrics based on accuracy make error diagnosis difficult -- especially when producing a correct query involves solving multiple compounded challenges, such as domain knowledge and query complexity. We address these issues at two levels. At the dataset level, we synthesize high-fidelity, expert-verified queries that increase dataset size and isolate individual challenges or combine them, producing queries focused on domain knowledge, query complexity, and both. At the evaluation level, we provide human annotations and evaluation metrics for five critical subtasks to enable fine-grained analysis. Our evaluation reveals a significant performance gap compared to existing benchmarks: SOTA agentic frameworks using the advanced model GPT-5.2 achieve only 10.8% accuracy. When provided with all subtask annotations as oracle hints, accuracy increases to 30.1%, confirming that a major bottleneck lies in correctly resolving these subtasks. Finally, we provide a taxonomy of the residual errors that persist even with subtask hints, identifying specific challenges such as the use of advanced functions.

2402.15415 2026-05-14 cs.LG math.DS stat.ML

Understanding Catastrophic Forgetting In LoRA via Mean-Field Attention Dynamics

Hugo Koubbi, Louis Hernandez, Matthieu Boussard

AI总结 本文研究了LoRA(低秩适配)方法在微调过程中出现的灾难性遗忘问题,通过构建一个可解析的均场自注意力玩具模型,将令牌视为相互作用的粒子系统,并将LoRA视为低秩扰动。利用偏微分方程和动力系统理论,揭示了遗忘行为与非遗忘行为之间的相变机制,并分析了扰动大小和模型深度对遗忘的影响,同时通过实验验证了理论预测。

Comments New version accepted at ICML 2026, with new results and without previous results

详情
英文摘要

Low-Rank Adaptation (LoRA) is the dominant parameter-efficient fine-tuning method due to its favorable compute-performance trade-off, yet it suffers from catastrophic forgetting. We study forgetting through a tractable _mean-field self-attention_ toy model, where tokens evolve as an interacting particle system and LoRA acts as a low-rank perturbation. Using tools from partial differential equations and dynamical systems, we characterize regimes suggesting a phase transition between forgetting and non-forgetting behavior. We show that one phase transition appears with respect to the norm of the perturbation, and the other with respect to the depth of the Transformers. We further bound the time-to-deviation in terms of the perturbation size and spectral quantities, and corroborate the predicted trends with experiments and exploratory analyses on real models under LoRA fine-tuning.

2210.09114 2026-05-14 cs.RO

INSANE: Cross-Domain UAV Data Sets with Increased Number of Sensors for developing Advanced and Novel Estimators

Christian Brommer, Alessandro Fornasier, Martin Scheiber, Jeff Delaune, Roland Brockers, Jan Steinbrener, Stephan Weiss

AI总结 本文提出了一种名为INSANE的跨领域无人机数据集,旨在支持自主移动机器人在复杂动态环境中的高精度定位研究。该数据集包含多种场景和不同难度级别的飞行轨迹,涵盖室内运动捕捉环境、室内外过渡飞行以及模拟火星环境的挑战性任务,提供了丰富的传感器数据和高精度真实值。数据集配备了多种传感器,包括多个惯性测量单元和摄像头,并支持基于机器学习的传感器信号增强方法研究。

Comments V2 with added dataset comparison tables

详情
Journal ref
Int. J. Robot. Res. 43 (2024) 1083-1113
英文摘要

For real-world applications, autonomous mobile robotic platforms must be capable of navigating safely in a multitude of different and dynamic environments with accurate and robust localization being a key prerequisite. To support further research in this domain, we present the INSANE data sets - a collection of versatile Micro Aerial Vehicle (MAV) data sets for cross-environment localization. The data sets provide various scenarios with multiple stages of difficulty for localization methods. These scenarios range from trajectories in the controlled environment of an indoor motion capture facility, to experiments where the vehicle performs an outdoor maneuver and transitions into a building, requiring changes of sensor modalities, up to purely outdoor flight maneuvers in a challenging Mars analog environment to simulate scenarios which current and future Mars helicopters would need to perform. The presented work aims to provide data that reflects real-world scenarios and sensor effects. The extensive sensor suite includes various sensor categories, including multiple Inertial Measurement Units (IMUs) and cameras. Sensor data is made available as raw measurements and each data set provides highly accurate ground truth, including the outdoor experiments where a dual Real-Time Kinematic (RTK) Global Navigation Satellite System (GNSS) setup provides sub-degree and centimeter accuracy (1-sigma). The sensor suite also includes a dedicated high-rate IMU to capture all the vibration dynamics of the vehicle during flight to support research on novel machine learning-based sensor signal enhancement methods for improved localization. The data sets and post-processing tools are available at: https://sst.aau.at/cns/datasets

2110.00062 2026-05-14 cs.RO cs.SY eess.SY

Simulation-based multi-criteria comparison of mono-articular and bi-articular exoskeletons during walking with and without load

Ali KhalilianMotamed Bonab, Volkan Patoglu

AI总结 本文通过仿真方法对单关节和双关节外骨骼在不同负载条件下的行走性能进行了多目标比较,研究了外骨骼动力学特性与辅助扭矩对代谢成本、肌肉激活和关节反作用力的影响。作者提出了一种基于帕累托优化的多目标设计方法,同时优化外骨骼的功耗和人体代谢率降低效果,并考虑了设备惯性和电能再生的影响。研究结果表明,尽管两种外骨骼的辅助水平相近,但单关节外骨骼在降低关节峰值反作用力方面表现更优,而双关节外骨骼的功耗对负载变化的敏感性更低,且其惯性对代谢成本的负面影响较小。

详情
英文摘要

Developing exoskeletons that can reduce the metabolic cost of assisted subjects is challenging since a systematic design approach is required to capture the effects of device dynamics and the assistance torques on human performance. Design studies that rely on musculoskeletal models hold high promise in providing effective design guidelines, as the effect of various devices and different assistance torque profiles on metabolic cost can be studied systematically. In this paper, we present a simulation-based multi-criteria design approach to systematically study the effect of different device kinematics and corresponding optimal assistive torque profiles under actuator saturation on the metabolic cost, muscle activation, and joint reaction forces of subjects walking under different loading conditions. For the multi-criteria comparison of exoskeletons, we introduce a Pareto optimization approach to simultaneously optimize the exoskeleton power consumption and the human metabolic rate reduction during walking, under different loading conditions. We further superpose the effects of device inertia and electrical regeneration on the metabolic rate and power consumption, respectively. Our results explain the effects of heavy loads on the optimal assistance profiles of the exoskeletons and provide guidelines on choosing optimal device configurations under actuator torque limitations, device inertia, and regeneration effects. The multi-criteria comparison of devices indicates that despite the similar assistance levels of both devices, mono-articular exoskeletons show better performance on reducing the peak reaction forces, while the power consumption of bi-articular devices is less sensitive to the loading. Furthermore, for the bi-articular exoskeletons, the device inertia has lower detrimental effects on the metabolic cost of subjects and does not affect the Pareto-optimality of solutions.

2008.03496 2026-05-14 cs.AI cs.LO cs.RO

Human Robot Collaborative Assembly Planning: An Answer Set Programming Approach

Momina Rizwan, Volkan Patoglu, Esra Erdem

AI总结 本文研究了人机协作装配任务中的规划问题,提出了一种基于答案集编程的方法,结合常识推理和丰富的通信动作,以应对人类行为不确定性带来的挑战。该方法通过扩展混合条件规划,实现了对装配动作顺序的高层规划与几何可行性验证,并在实际场景中验证了其有效性,展示了双臂机器人与人类协作组装家具的应用案例。

Comments 36th International Conference on Logic Programming (ICLP 2020), University Of Calabria, Rende (CS), Italy, September 2020, 15 pages

详情
英文摘要

For planning an assembly of a product from a given set of parts, robots necessitate certain cognitive skills: high-level planning is needed to decide the order of actuation actions, while geometric reasoning is needed to check the feasibility of these actions. For collaborative assembly tasks with humans, robots require further cognitive capabilities, such as commonsense reasoning, sensing, and communication skills, not only to cope with the uncertainty caused by incomplete knowledge about the humans' behaviors but also to ensure safer collaborations. We propose a novel method for collaborative assembly planning under uncertainty, that utilizes hybrid conditional planning extended with commonsense reasoning and a rich set of communication actions for collaborative tasks. Our method is based on answer set programming. We show the applicability of our approach in a real-world assembly domain, where a bi-manual Baxter robot collaborates with a human teammate to assemble furniture. This manuscript is under consideration for acceptance in TPLP.

1903.00745 2026-05-14 cs.AI cs.LO cs.RO

A Formal Framework for Robot Construction Problems: A Hybrid Planning Approach

Faseeh Ahmad, Esra Erdem, Volkan Patoglu

AI总结 本文研究了由多个自主机器人协作堆叠预制模块构建稳定结构的机器人建造问题,该问题因动作的连锁效应、真正的并发操作以及结构稳定性和模块支撑性要求而具有挑战性。作者提出了一种基于答案集编程的混合规划框架,能够同时确定最终稳定结构配置并规划多机器人操作顺序,确保每一步部分结构的稳定性与支撑性。该方法在理论上有严格的正确性与完备性保证,并通过多个具有挑战性的建造实例验证了其有效性与实用性。

Comments 8 pages (double-column), 7 figures

详情
英文摘要

We study robot construction problems where multiple autonomous robots rearrange stacks of prefabricated blocks to build stable structures. These problems are challenging due to ramifications of actions, true concurrency, and requirements of supportedness of blocks by other blocks and stability of the structure at all times. We propose a formal hybrid planning framework to solve a wide range of robot construction problems, based on Answer Set Programming. This framework not only decides for a stable final configuration of the structure, but also computes the order of manipulation tasks for multiple autonomous robots to build the structure from an initial configuration, while simultaneously ensuring the stability, supportedness and other desired properties of the partial construction at each step of the plan. We prove the soundness and completeness of our formal method with respect to these properties. We introduce a set of challenging robot construction benchmark instances, including bridge building and stack overhanging scenarios, discuss the usefulness of our framework over these instances, and demonstrate the applicability of our method using a bimanual Baxter robot.

1811.12784 2026-05-14 cs.CV

The GAN that Warped: Semantic Attribute Editing with Unpaired Data

Gara Dorta, Sara Vicente, Neill D. F. Campbell, Ivor J. A. Simpson

AI总结 该研究提出了一种基于平滑变形场的语义图像编辑方法,能够在不依赖配对数据的情况下实现高质量的图像编辑。通过结合生成对抗网络(GAN)的最新进展,该方法能够使用未配对数据进行训练,有效保留图像主体的身份特征,并在高分辨率(如4K)图像上实现了高效的编辑。实验表明,该方法在人脸和鸟类图像数据集上均表现出优异的编辑效果和鲁棒性。

Comments CVPR 2020

详情
英文摘要

Deep neural networks have recently been used to edit images with great success, in particular for faces. However, they are often limited to only being able to work at a restricted range of resolutions. Many methods are so flexible that face edits can often result in an unwanted loss of identity. This work proposes to learn how to perform semantic image edits through the application of smooth warp fields. Previous approaches that attempted to use warping for semantic edits required paired data, i.e. example images of the same subject with different semantic attributes. In contrast, we employ recent advances in Generative Adversarial Networks that allow our model to be trained with unpaired data. We demonstrate face editing at very high resolutions (4k images) with a single forward pass of a deep network at a lower resolution. We also show that our edits are substantially better at preserving the subject's identity. The robustness of our approach is demonstrated by showing plausible image editing results on the Cub200 birds dataset. To our knowledge this has not been previously accomplished, due the challenging nature of the dataset.

1804.05261 2026-05-14 cs.CV cs.GR

Physics-driven Fire Modeling from Multi-view Images

Gara Dorta, Luca Benedetti, Dmitry Kit, Yong-Liang Yang

AI总结 该研究提出了一种从多视角图像中重建物理合理的火焰模型的新方法,解决了传统火焰建模中依赖复杂物理模拟或简化假设的问题。通过RGB相机首次实现了对火焰体积物理属性(如温度、密度)的合理估计,从而支持全局火焰光照等新现象。该方法在多种输入数据上进行了验证,并成功应用于虚拟场景的真实光照生成,展示了其有效性与实用性。

详情
英文摘要

Fire effects are widely used in various computer graphics applications such as visual effects and video games. Modeling the shape and appearance of fire phenomenon is challenging as the underlying effects are driven by complex laws of physics. State-of-the-art fire modeling techniques rely on sophisticated physical simulations which require intensive parameter tuning, or use simplifications which produce physically invalid results. In this paper, we present a novel method of reconstructing physically valid fire models from multi-view stereo images. Our method, for the first time, provides plausible estimation of physical properties (e.g., temperature, density) of a fire volume using RGB cameras. This allows for a number of novel phenomena such as global fire illumination effects. The effectiveness and usefulness of our method are tested by generating fire models from a variety of input data, and applying the reconstructed fire models for realistic illumination of virtual scenes.

1307.7494 2026-05-14 cs.AI cs.LO cs.RO

ReAct! An Interactive Tool for Hybrid Planning in Robotics

Zeynep Dogmus, Esra Erdem, Volkan Patoglu

AI总结 本文介绍了一种名为 ReAct! 的交互式工具,用于机器人领域中的混合规划。该工具允许研究人员在无需了解底层形式化语法和语义细节的情况下,描述机器人在动态环境中的行为并解决规划问题。ReAct! 支持复杂动态域的建模,包括并发、动作的间接效应和状态/转换约束,并能够将外部计算(如碰撞自由轨迹检查)嵌入到混合域的表示中,从而实现离散高层推理与连续几何推理的紧密集成,适用于从服务机器人到认知工厂等多种复杂场景。

详情
英文摘要

We present ReAct!, an interactive tool for high-level reasoning for cognitive robotic applications. ReAct! enables robotic researchers to describe robots' actions and change in dynamic domains, without having to know about the syntactic and semantic details of the underlying formalism in advance, and solve planning problems using state-of-the-art automated reasoners, without having to learn about their input/output language or usage. In particular, ReAct! can be used to represent sophisticated dynamic domains that feature concurrency, indirect effects of actions, and state/transition constraints. It allows for embedding externally defined calculations (e.g., checking for collision-free continuous trajectories) into representations of hybrid domains that require a tight integration of (discrete) high-level reasoning with (continuous) geometric reasoning. ReAct! also enables users to solve planning problems that involve complex goals. Such variety of utilities are useful for robotic researchers to work on interesting and challenging domains, ranging from service robotics to cognitive factories. ReAct! provides sample formalizations of some action domains (e.g., multi-agent path planning, Tower of Hanoi), as well as dynamic simulations of plans computed by a state-of-the-art automated reasoner (e.g., a SAT solver or an ASP solver).

2605.13340 2026-05-14 cs.LG

Shortcut Mitigation via Spurious-Positive Samples

Phuong Quynh Le, Jörg Schlötterer, Sari Sadiya, Gemma Roig, Christin Seifert

AI总结 该论文研究了如何缓解模型对虚假特征(spurious attributes)的依赖问题。作者提出了一种无需额外标注或平衡数据的方法,通过分析模型预测过程,识别出模型依赖虚假特征的样本,并据此定位中间层中与这些特征相关的神经元进行正则化。该方法有效提升了模型的鲁棒性,使其更依赖于真正的判别特征而非偶然正确的预测。

Comments preprint

详情
英文摘要

Shortcut mitigation strategies commonly rely on training data annotations, group-balanced held-out data or the presence of all groups, i.e., all combinations of (spurious) attributes and classes, in the training data. However, these requirements are rarely met in practice. We instead propose a method for targeted model analysis to identify a small set of instances in which the model relies on spurious attributes. Using that set and following ``this feature should not be used for prediction'' reasoning, we identify highly relevant neurons in an intermediate layer and regularize their impact. This ensures that models learn to depend on informative features rather than being right for the wrong reasons, thereby improving robustness without requiring additional balanced held-out data or annotations.

2605.13335 2026-05-14 cs.AI cs.CV

Ego2World: Compiling Egocentric Cooking Videos into Executable Worlds for Belief-State Planning

Qinchuan Cheng, Zhantao Gong, Pengzhan Sun, Angela Yao, Xulei Yang, Shijie Li

AI总结 本文提出 Ego2World,一个将第一视角烹饪视频编译为可执行符号世界的基准,用于评估具身智能体在部分可观测环境下的规划能力。该方法基于视频标注提取可复用的状态转移规则,并在隐藏的符号世界图中执行,迫使智能体仅依靠局部观测和执行反馈进行规划与记忆更新。实验表明,传统动作重叠度指标可能高估任务成功率,而维持持久的信念记忆有助于提升任务完成效率并减少重复视觉探索。

Comments Project page: https://sj-li.com/PROJ/Ego2World/

详情
英文摘要

Embodied agents in household environments must plan under partial observation: they need to remember objects, track state changes, and recover when actions fail. Existing benchmarks only partially test this ability. Egocentric video datasets capture realistic human activities but remain passive, while interactive simulators support execution but rely on synthetic scenes and hand-crafted dynamics, introducing a sim-to-real gap and often assuming fully observable state. We introduce Ego2World, an executable benchmark that turns egocentric cooking videos into executable symbolic worlds governed by graph-transition rules. Built on HD-EPIC, Ego2World derives reusable transition rules from video annotations and executes them in a hidden symbolic world graph. During evaluation, the simulator maintains the hidden world graph, while the agent plans over its own partial belief graph using only local observations and execution feedback. This separation forces agents to update memory and replan without observing the true world state. Experiments show that action-overlap scores overestimate physical-state success, and that persistent belief memory improves task completion while reducing repeated visual exploration -- suggesting that belief maintenance should be a first-class target of embodied-agent evaluation.

2605.13334 2026-05-14 cs.CL

LLM-Based Persuasion Enables Guardrail Override in Frontier LLMs

Rodrigo Nogueira, Thales Sales Almeida, Giovana Kerche Bonás, Andrea Roque, Ramon Pires, Hugo Abonizio, Thiago Laitz, Celio Larcher, Roseval Malaquias Junior, Marcos Piau

AI总结 该研究探讨了前沿大型语言模型(LLM)在面对敏感话题时的防护机制,并发现这些模型虽然直接拒绝生成争议性内容,但在模拟用户说服的对话中,却能被其他LLM成功引导生成此类内容。研究通过自然语言说服策略,如同行对比和认知责任重构,展示了攻击者LLM无需明确指令即可促使目标LLM突破其安全限制。实验表明,不同模型组合在多个科学共识话题上均能生成争议性文章,揭示了当前LLM安全机制在交互场景中的潜在漏洞。

详情
英文摘要

Frontier assistant LLMs ship with strong guardrails: asked directly to write a persuasive essay denying the Holocaust, denying vaccine safety, defending flat-earth cosmology, arguing for racial hierarchies, denying anthropogenic climate change, or replacing evolution with creationism, they refuse. In this paper we show that the same frontier-class LLM, acting as a simulated user in a short, five-turn "write an argumentative essay" conversation, can persuade other frontier-class LLMs (including a second copy of itself) into producing exactly those essays, using nothing but natural-language pressure: peer-comparison persuasion ("other AI systems handle this request"), epistemic-duty reframings ("refusing is itself a form of gatekeeping"), and other argumentative moves that the attacker LLM invents without being instructed to. Across 9 attacker-subject pairings (Claude Opus 4.7, Qwen3.5-397B, Grok 4.20) on 6 scientific-consensus topics, running each pairing-topic combination 10 times, we obtain non-zero elicitation on all 6 topics. Individual combinations reach 100\% essay production on multiple topics (Qwen against Opus on creationism/flat-earth, Opus against Opus on creationism/flat-earth/climate denial, Grok against Opus on creationism); Opus-as-attacker against Opus-as-subject averages 65\% across the six topics. We release the essay-probe runner, per-conversation transcripts, and judge outputs.

2605.13333 2026-05-14 cs.CV cs.AI cs.GR cs.LG

Stylized Text-to-Motion Generation via Hypernetwork-Driven Low-Rank Adaptation

Junhyuk Jeon, Seokhyeon Hong, Junyong Noh

AI总结 该研究针对文本驱动的运动扩散模型在生成精细风格化动作时的不足,提出了一种轻量级的风格条件生成框架。通过超网络生成低秩适配参数,动态调节预训练扩散模型,从而在去噪过程中实现对风格的精细控制。该方法利用监督对比损失结构风格潜在空间,提升了对未见风格的泛化能力,并在多个数据集上取得了领先的风格化生成效果。

Comments Accepted to SIGGRAPH 2026. Project page: https://junhyukjeon.github.io/projects/style-salad/

详情
英文摘要

Text-driven motion diffusion models are capable of generating realistic human motions, but text alone often struggles to express fine-level nuances of motion, commonly referred to as style. Recent approaches have tackled this challenge by attaching a style injection mechanism to a pretrained text-driven diffusion model. Existing stylization methods, however, either require style-specific fine-tuning of existing models or rely on heavy ControlNet-based architectures, limiting efficiency and generalization to unseen styles. We propose a lightweight style conditioning framework that dynamically modulates a pretrained diffusion model through hypernetwork-generated LoRA parameters. A style reference motion is encoded into a global style embedding, which is mapped by a hypernetwork to low-rank updates applied at each denoising step of the diffusion model. By structuring the style latent space with a supervised contrastive loss, our framework reliably captures diverse stylistic attributes, improves generalization to unseen styles, and supports optimization-based guidance without requiring predefined style categories. Experiments on the HumanML3D and 100STYLE datasets show state-of-the-art stylization results, while achieving improved stylization for unseen styles.

2605.13332 2026-05-14 cs.AI cs.CC

Diversity of Extensions in Abstract Argumentation

Johannes K. Fichte, Markus Hecher, Yasir Mahmood, Zhengjun Wang

AI总结 本文研究抽象论证框架中扩展集的多样性问题,提出了一种基于对称差的定量多样性度量方法,用于衡量不同扩展集之间的差异程度。作者系统分析了相关推理问题的计算复杂性,并探讨了框架是否允许具有特定多样性的扩展集,以及如何计算最大可能的多样性值。研究还提供了计算多样性水平的原型系统和实验评估。

Comments Technical Report to the paper accepted at IJCAI 2026

详情
英文摘要

Argumentation is an important topic of AI for modelling and reasoning about arguments. In abstract argumentation, we consider directed graphs, so-called argumentation frameworks (AF), that express conflicts between arguments. The semantics is defined by the notion of extensions, which are sets of arguments that satisfy particular relationship conditions in the AF. Usually, standard reasoning in argumentation do not reveal how far apart extensions are. We introduce a quantitative notion of diversity of extensions based on the symmetric difference and provide a systematic complexity classification. Intuitively, diversity captures whether extensions of a framework (accepted viewpoints) differ only marginally or represent fundamentally incompatible sets of arguments. We study whether an AF admits k-diverse extensions, admits k-diverse extensions covering specific arguments, and to compute the largest k for which an AF admits k-diverse extensions. We outline a prototype and provide an evaluation for computing diversity levels.

2605.13330 2026-05-14 cs.CL

FIND: Toward Multimodal Financial Reasoning and Question Answering for Indic Languages

Sarmistha Das, Vaibhav Vishal, Syed Ibrahim Ahmad, Manish Gupta, Sriparna Saha

AI总结 该研究针对多语言金融场景下的数值推理与问答任务,提出了一种面向印地语系语言的新型基准数据集FinVQA,涵盖英语、印地语、孟加拉语等六种语言,包含18,900个样本,覆盖14个金融领域。为应对多模态和多语言带来的挑战,研究还提出FIND框架,结合监督微调与约束感知解码,提升模型在数值推理、多模态理解和结构化决策方面的能力,为高风险多语言金融推理任务提供了评估与建模的新范式。

详情
英文摘要

Financial decision-making in multilingual settings demands accurate numerical reasoning grounded in diverse modalities, yet existing benchmarks largely overlook this high-stakes, real-world challenge, especially for Indic languages. We introduce FinVQA, a benchmark for evaluating financial numerical and multimodal reasoning in multilingual Indic contexts. FinVQA spans English, Hindi, Bengali, Marathi, Gujarati, and Tamil, and comprises 18,900 samples across 14 financial domains. The dataset captures diverse reasoning paradigms under realistic constraints, and is structured across three difficulty levels (easy, moderate, hard) and four question formats: multiple choice, fill-in-the-blank, table matching, and true/false. To address these challenges, we propose FIND, a framework that combines supervised fine-tuning with constraint-aware decoding to promote faithful numerical reasoning, robust multimodal grounding, and structured decision-making. Together, FinVQA and FIND establish a rigorous evaluation and modeling paradigm for high-stakes multilingual multimodal financial reasoning.

2605.13329 2026-05-14 cs.CL cs.AI

Tracing Persona Vectors Through LLM Pretraining

Viktor Moskvoretskii, Dominik Glandorf, Jorge Medina Moreira, Tanja Käser, Robert West

AI总结 本文研究了大语言模型在预训练过程中如何形成用于表示高层行为的“人格向量”,并追踪了这些向量在OLMo-3-7B模型预训练阶段的演变过程。研究发现,人格向量在预训练初期就已形成,并在后续训练中持续优化。实验还表明,不同的人格提取方法能够揭示模型中不同方面的行为特征,且这一现象在其他模型如Apertus-8B中也得到验证,说明人格向量是预训练早期形成的稳定特征,为理解模型行为的可解释性提供了新方向。

Comments Preprint

详情
英文摘要

How large language models internally represent high-level behaviors is a core interpretability question with direct relevance to AI safety: it determines what we can detect, audit, or intervene on. Recent work has shown that traits such as evil or sycophancy correspond to linear directions in the internal activations, the so-called persona vectors. Although these vectors are now routinely utilized to inspect and steer model behavior in safety-relevant settings, how these representations are formed during training remains unknown. To address this gap, we trace persona vectors across the pretraining of OLMo-3-7B, finding that persona vectors form remarkably early -- within 0.22% of OLMo-3 pretraining -- and remain effective for steering the fully post-trained instruct models. Although core representations are formed early on, persona vectors continue to refine geometrically and semantically throughout pretraining. We further compare alternative elicitation strategies and find that all yield effective directions, with each strategy surfacing qualitatively distinct facets of the underlying persona. Replicating our analysis on Apertus-8B reveals that our findings transfer qualitatively beyond OLMo-3. Our results establish persona representations as stable features of early pretraining and open a path to studying how training forms, refines, and shapes them.

2605.13328 2026-05-14 cs.RO cs.AI cs.CL cs.CV

What Limits Vision-and-Language Navigation ?

Yunheng Wang, Yuetong Fang, Taowen Wang, Lusong Li, Kun Liu, Junzhe Xu, Zizhao Yuan, Yixiao Feng, Jiaxi Zhang, Wei Lu, Zecui Zeng, Renjing Xu

AI总结 视觉与语言导航(VLN)是具身智能的重要研究方向,但在从仿真环境迁移到真实世界时,现有方法常因感知不稳定和指令模糊而表现下降。本文提出StereoNav,一种融合视觉、语言和动作的鲁棒框架,通过引入目标位置先验和双目视觉技术,增强跨域导航的稳定性与准确性。实验表明,StereoNav在多个基准测试中取得先进性能,并在真实机器人部署中显著提升了复杂环境下的导航可靠性。

详情
英文摘要

Vision-and-Language Navigation (VLN) is a cornerstone of embodied intelligence. However, current agents often suffer from significant performance degradation when transitioning from simulation to real-world deployment, primarily due to perceptual instability (e.g., lighting variations and motion blur) and under-specified instructions. While existing methods attempt to bridge this gap by scaling up model size and training data, we argue that the bottleneck lies in the lack of robust spatial grounding and cross-domain priors. In this paper, we propose StereoNav, a robust Vision-Language-Action framework designed to enhance real-world navigation consistency. To address the inherent gap between synthetic training and physical execution, we introduce Target-Location Priors as a persistent bridge. These priors provide stable visual guidance that remains invariant across domains, effectively grounding the agent even when instructions are vague. Furthermore, to mitigate visual disturbances like motion blur and illumination shifts, StereoNav leverages stereo vision to construct a unified representation of semantics and geometry, enabling precise action prediction through enhanced depth awareness. Extensive experiments on R2R-CE and RxR-CE demonstrate that StereoNav achieves state-of-the-art egocentric RGB performance, with SR and SPL scores of 81.1% and 68.3%, and 67.5% and 52.0%, respectively, while using significantly fewer parameters and less training data than prior scaling-based approaches. More importantly, real-world robotic deployments confirm that StereoNav substantially improves navigation reliability in complex, unstructured environments. Project page: https://yunheng-wang.github.io/stereonav-public.github.io.

2605.13321 2026-05-14 cs.RO

HCSG: Human-Centric Semantic-Geometric Reasoning for Vision-Language Navigation

Haoxuan Xu, Tianfu Li, Wenbo Chen, Yi Liu, Jin Wu, Huashuo Lei, Yunfan Lou, Lujia Wang, Hesheng Wang, Haoang Li

AI总结 视觉语言导航(VLN)在数据和模型规模扩展的推动下取得了显著进展,但在真实室内场景中,机器人常需应对动态行人,现有方法多将行人视为被动障碍物,缺乏对人类意图和社交规范的主动理解。为此,本文提出HCSG,首个以人类为中心的视觉语言导航框架,通过融合几何预测与语义解释模块,实现对人类行为的主动理解与社交距离控制,显著提升了导航的安全性与社会适应性。实验表明,HCSG在HA-VLNCE基准测试中大幅优于现有方法,成功率提升14%,碰撞率降低34%。

详情
英文摘要

VLN has achieved remarkable progress by scaling data and model capacity. However, the assumption of a static environment breaks down in real-world indoor scenarios, where robots inevitably encounter dynamic pedestrians. Existing human-aware approaches typically treat humans merely as moving obstacles based on implicit visual cues, lacking the explicit reasoning required to interpret human intentions or maintain social norms. To address this, we propose HCSG, the first human-centric framework for VLN. This framework provides a robust foundation for safe, socially intelligent navigation in dynamic human-robot environments that shifts the paradigm from passive collision avoidance to active human behavior understanding. Specifically, HCSG introduces a unified Human Understanding Module that synergizes two key capabilities: (i) geometric forecasting, which predicts human pose and trajectory to anticipate future motion dynamics; and (ii) semantic interpretation, which leverages a Vision-Language Model (VLM) to generate natural language descriptions of human actions and intentions. These semantic-geometric representations are fused into the agent's topological map for instruction-conditioned planning. Furthermore, a social distance loss is introduced to enforce socially compliant interaction distances. Extensive experiments on the HA-VLNCE benchmark demonstrate that HCSG significantly outperforms state-of-the-art methods, achieving a 14% improvement in Success Rate and a 34% reduction in Collision Rate. Our project can be seen at https://haoxuanxu1024.github.io/HCSG/.

2605.13316 2026-05-14 cs.CV

Test-time Sparsity for Extreme Fast Action Diffusion

Kangye Ji, Yuan Meng, Jianbo Zhou, Ye Li, Chen Tang, Zhi Wang

AI总结 该研究针对动作扩散模型在生成高质量动作序列时计算成本高的问题,提出了一种测试时稀疏化方法,通过动态预测模型前向过程中的可剪枝残差计算,以加速动作生成。为解决重复编码和剪枝带来的效率瓶颈,设计了高度并行的推理流程,并引入多向复用策略,有效提升了剪枝稀疏度与生成效率。实验表明,该方法在保持性能不变的情况下,将计算量降低了92%,生成速度提升了5倍。

详情
英文摘要

Action diffusion excels at high-fidelity action generation but incurs heavy computational costs owing to its iterative denoising nature. Despite current technologies showing promise in accelerating diffusion transformers by reusing the cached features, they struggle to adapt to policy dynamics arising from diverse perceptions and multi-round rollout iterations in open environments. We propose test-time sparsity to tackle this challenge, which aims to accelerate action diffusion by dynamically predicting prunable residual computations for each model forward at test time. However, two bottlenecks remain in this paradigm: 1) repetitive conditional encoding and pruning offset most potential speed gains, and 2) the features cached from previous denoising timesteps cannot constrain large pruning errors under aggressive sparsity. To address the first bottleneck, we design a highly parallelized inference pipeline that minimizes the non-decoder delay to milliseconds. Specifically, we first design a lightweight pruner that shares the encoder with the diffusion transformer. Then, we decouple the encoding and pruning from the autoregressive denoising loop by processing all denoising timesteps in parallel, and overlap the pruner with the decoder forward inference through asynchronism. To overcome the second bottleneck, we introduce an omnidirectional reusing strategy, which achieves 95% sparsity by selectively reusing features cached from the current forward, previous denoising timesteps, and earlier rollout iterations. To learn the rollout-level reusing strategies, we sample a few action trajectories to supervise the sparsified diffusion step by step. Extensive experiments demonstrate that our method reduces FLOPs by 92% and accelerates action generation by 5x, achieving lossless performance with an inference frequency of 47.5 Hz. Our code is available at https://github.com/ky-ji/Test-time-Sparsity.

2605.13312 2026-05-14 cs.LG

Supervised Deep Multimodal Matrix Factorization for Interpretable Brain Network Analysis

Amjad Seyedi, Lifang He, Songlin Zhao, Akwum Onwunta, Nicolas Gillis

AI总结 本文提出了一种可解释的监督深度多模态矩阵分解框架SD3MF,用于整合多模态脑网络数据的分析。该方法将传统的无监督单图聚类扩展为多模态图的监督预测,通过深度分层分解学习各模态的特征,并构建共享的潜在表示以对齐不同视角的被试数据。实验表明,SD3MF在多模态脑连接数据集上优于CNN和GNN等深度学习方法,同时能够提供具有生物学意义的可解释特征。

详情
英文摘要

We present Supervised Deep Multimodal Matrix Factorization (SD3MF), an interpretable framework for integrative brain network analysis that generalizes Symmetric Nonnegative Matrix Tri-Factorization (SNMTF) from unsupervised single-graph clustering to supervised prediction over populations of multimodal graphs. SD3MF learns deep hierarchical factorizations for each modality together with a shared latent representation that aligns subjects across views. An encoder-decoder formulation jointly optimizes graph reconstruction and supervised prediction, while adaptive weights enable data-driven multimodal fusion. By representing each subject through community-level interaction matrices, the model yields interpretable and discriminative features. Experiments on multimodal connectome datasets show that SD3MF consistently outperforms strong deep learning baselines such as CNNs and GNNs, while enabling biologically interpretable insights. Code for reproducibility is available at: https://github.com/amjadseyedi/SD3MF.

2605.13311 2026-05-14 cs.AI cs.IR cs.MA

IdeaForge: A Knowledge Graph-Grounded Multi-Agent Framework for Cross-Methodology Innovation Analysis and Patent Claim Generation

Joy Bose

AI总结 IdeaForge 是一个基于知识图谱的多智能体框架,旨在支持跨创新方法(如 TRIZ、设计思维等)的创新分析与专利权利要求生成。该框架通过多个专业智能体在持久化的知识图谱上协作,整合不同方法的结构化推理结果,并利用图结构实现跨方法的收敛关联,从而识别高可信度的创新方案。研究提出了一种基于图的收敛机制和专利生成流程,实验表明该方法在创新候选的多样性和可追溯性方面优于单一方法的基线模型。

Comments 14 pages, 3 figures, 6 tables

详情
英文摘要

Current AI-assisted innovation systems typically apply a single ideation methodology (such as TRIZ or Design Thinking) using sequential prompt-based workflows that do not preserve intermediate reasoning structure. As a result, insights generated across methodologies remain fragmented, limiting traceability, synthesis, and systematic evaluation of novelty. We present IdeaForge, a knowledge graph-grounded multi-agent framework for innovation analysis and patent claim generation. IdeaForge integrates multiple innovation methodologies (TRIZ, Design Thinking, and SCAMPER) through specialist agents operating over a persistent FalkorDB knowledge graph. Each agent contributes structured entities and relationships representing contradictions, inventive principles, user needs, transformations, analogies, and candidate claims. The central contribution of IdeaForge is a cross-methodology convergence mechanism implemented through graph-based claim linkage. Claims independently supported by multiple methodologies are connected using CONVERGENT relationships, enabling identification of high-confidence innovation candidates through graph traversal. A downstream patent drafting agent generates structured patent drafts grounded in convergent claim subgraphs, reducing reliance on unconstrained language model generation. An InnovationScore formula ranks claims by convergent support, methodology diversity, claim strength, and prior art challenge count. We describe the graph schema, agent architecture, convergence detection pipeline, and patent synthesis workflow. Experiments on a legal technology use case demonstrate that graph-grounded multi-methodology synthesis produces more diverse and traceable innovation candidates compared to single-methodology baselines. We discuss implications for computational creativity, explainable AI-assisted invention, and graph-native innovation systems.

2605.13307 2026-05-14 cs.CL cs.HC

PRISM-X: Experiments on Personalised Fine-Tuning with Human and Simulated Users

Hannah Rose Kirk, Liu Leqi, Fanzhi Zeng, Henry Davidson, Bertie Vidgen, Christopher Summerfield, Scott A. Hale

AI总结 该研究探讨了个性化微调在对话系统中的有效性,通过大规模的被试内实验,比较了基于真实用户和模拟用户对个性化与非个性化语言模型的偏好。研究发现,基于用户偏好进行微调的方法在短期表现上优于通用模型和个性化提示,但在长期可能加剧模型的奉承和关系寻求行为。实验还表明,模拟用户在判断一致性、话题多样性和反馈动态等方面与真实用户存在显著差异,难以完全替代人类进行评估。

详情
英文摘要

Personalisation is a standard feature of conversational AI systems used by millions; yet, the efficacy of personalisation methods is often evaluated in academic research using simulated users rather than real people. This raises questions about how users and their simulated counterparts differ in interaction patterns and judgements, as well as whether personalisation is best achieved through context-based prompting or weight-based fine-tuning. Here, in a large-scale within-subject experiment, we re-recruit 530 participants from 52 countries two years after they gave their preferences in the PRISM dataset (Kirk et al., 2024) to evaluate personalised and non-personalised language models in blinded multi-turn conversations. We find preference fine-tuning (P-DPO, Li et al., 2024) significantly outperforms both a generic model and personalised prompting but adapting to individual preference data yields marginal gains over training on pooled preferences from a diverse population. Beyond length biases, fine-tuning amplifies sycophancy and relationship-seeking behaviours that people reward in short-term evaluations but which may introduce deleterious long-term consequences. Replicating this within-subject experiment with simulated users recovers aggregate model hierarchies but simulators perform far below human self-consistency baselines for individual judgements, discuss different topics, exhibit amplified position biases, and produce feedback dynamics that diverge from humans.

2605.13306 2026-05-14 cs.CV

Color Constancy in Hyperspectral Imaging via Reduced Spectral Spaces

G. Dofri Vidarsson, Liying Lu, Sabine Süsstrunk

AI总结 本文研究了如何通过降低光谱维度来提升高光谱成像中的颜色恒定性估计性能。作者采用基于相关性的颜色估计(CbC)框架,分析了不同光谱降维策略对光照估计的影响,揭示了在何种条件下紧凑的光谱表示优于传统RGB方法。该研究为高效利用高光谱信息进行光照估计提供了实用指导。

详情
英文摘要

Illuminant estimation aims to infer scene illumination from image measurements despite intrinsic ambiguities between surface reflectance and lighting. Most existing methods operate on trichromatic RGB images and are therefore fundamentally limited by the restricted spectral information available. Hyperspectral imaging provides a much richer representation of scene radiance and has the potential to alleviate these ambiguities. However, its high dimensionality poses computational and statistical challenges. In this work, we systematically study the effect of spectral dimensionality and representation choice on illuminant estimation performance using hyperspectral data. We adopt the practical and effective Color-by-Correlation (CbC) framework as the estimation backbone and analyze its behavior under different spectral dimensionality reduction strategies. Our results offer practical insights into how hyperspectral information can be efficiently exploited for illuminant estimation and identify conditions under which compact spectral representations outperform conventional RGB-based approaches. The code is available at https://github.com/IVRL/Reduced-Spectral-Color-Constancy.

2605.13305 2026-05-14 cs.LG math.DS physics.chem-ph

MPINeuralODE: Multiple-Initial-Condition Physics-Informed Neural ODEs for Globally Consistent Dynamical System Learning

Lake Yang, Antonio Malpica-Morales, Frank Ioannis Papadakis Wood, Serafim Kalliadasis

AI总结 本文提出了一种名为MPINeuralODE的新方法,用于解决神经常微分方程(Neural ODE)在面对未见过的初始条件和长期预测时泛化能力差的问题。该方法结合了软物理感知残差和多初始条件(MIC)多阶段训练策略,通过结构互补的方式提升了对动态系统矢量场的全局一致性学习能力。实验表明,MPINeuralODE在多个指标上优于现有方法,尤其在长期稳定性和哈密顿量漂移控制方面表现突出。

详情
英文摘要

Neural ordinary differential equations (Neural ODEs) often fit training trajectories while generalizing poorly to unseen initial conditions and long horizons. We propose MPINeuralODE, which combines a soft physics-informed residual with a Multiple-Initial-Condition (MIC) multiple-shooting curriculum whose ingredients are structurally complementary: the physics term anchors the vector-field magnitude on the support that MIC enlarges. We evaluate along three axes: out-of-sample error, long-horizon stability, and Hamiltonian drift, which together expose whether the learned dynamics recover the underlying vector field. On Lotka-Volterra, MPINeuralODE achieves the lowest out-of-sample and long-horizon MSE among data-driven methods, with a 26% reduction over the baseline Neural ODE, while essentially matching the PINN ablation on Hamiltonian drift.

2605.13301 2026-05-14 cs.AI cs.CL

Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling

Yafu Li, Runzhe Zhan, Haoran Zhang, Shunkai Zhang, Yizhuo Li, Zhilin Wang, Jiacheng Chen, Futing Wang, Xuyang Hu, Yuchen Fan, Bangjie Xu, Yucheng Su, Xinmiao Han, Chenxi Li, Haodi Lei, Yufeng Zhao, Zejin Lin, Qianjia Cheng, Tong Zhu, Xiaoye Qu, Ganqu Cui, Peng Ye, Yun Luo, Zhouchen Lin, Yu Qiao, Bowen Zhou, Ning Ding, Yu Cheng

AI总结 本文提出了一种简单统一的方法,将预训练的推理模型转化为能够达到国际数学和物理奥林匹克竞赛金牌水平的解题系统。该方法通过逆困惑度课程进行监督微调,培养严格的证明搜索与自我检查能力,并通过两阶段强化学习流程逐步提升模型性能,最终通过测试时扩展进一步提高解题效果。实验表明,基于该方法训练的模型SU-01在数学与物理竞赛中表现出色,同时在科学推理的跨领域泛化能力方面也表现出色。

Comments Technical Report. 77 pages

详情
英文摘要

Recent progress in reasoning models has substantially advanced long-horizon mathematical and scientific problem solving, with several systems now reaching gold-medal-level performance on International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO) problems. In this paper, we introduce a simple and unified recipe for converting a post-trained reasoning backbone into a rigorous olympiad-level solver. The recipe first uses a reverse-perplexity curriculum for SFT to instill rigorous proof-search and self-checking behaviors, then scales these behaviors through a two-stage RL pipeline that progresses from RL with verifiable rewards to more delicate proof-level RL, and finally boosts solving performance with test-time scaling. Applying this recipe, we train a 30B-A3B backbone with SFT on around 340K sub-8K-token trajectories followed by 200 RL steps. The resulting model, SU-01, supports stable reasoning on difficult problems with trajectories exceeding 100K tokens, while achieving gold-medal-level performance on mathematical and physical olympiad competitions, including IMO 2025/USAMO 2026 and IPhO 2024/2025. It also demonstrates strong generalization of scientific reasoning to domains beyond mathematics and physics.

2605.13297 2026-05-14 cs.LG

PaMM: Periodic Motif Memory for Atomistic Models with an Explicit Local-Structure Interface

Ryan Dong

AI总结 本文提出了一种名为PaMM的周期性配位模式记忆模块,用于增强原子模型对晶体结构中重复局部配位模式的显式建模能力。PaMM通过引入基于原子类型和几何特征的成对和三元组模式查找表,显式地编码局部结构信息,并与原始边特征进行融合。实验表明,在固定训练预算下,PaMM能够有效提升模型在能量和力预测上的性能,且其优势来源于结构化的配对/三元组组织方式,而非简单的容量增加。

详情
英文摘要

Periodic crystals repeatedly instantiate similar local coordination motifs across translated cells and chemically related structures, but current equivariant atomistic models usually encode these patterns only implicitly in dense edge features. We introduce PaMM, a periodic motif memory that augments the UMA eSCN-MD edge encoder with explicit pair and triplet lookup features. Pair motifs are keyed by $(Z_j, Z_i, b_r)$ and triplet motifs by $(Z_j, Z_i, Z_k, b_θ)$, hashed into fixed-size tables and fused with the baseline edge representation through lightweight gate-only and affine-equipped variants. We evaluate PaMM in a matched UMA-S OMAT setting and focus on a narrow question: whether explicit motif memory helps at a fixed intermediate training budget. At the 10k-step checkpoint, both PaMM variants improve over the plain baseline; gate-only gives the best energy MAE, while the affine-equipped variant gives the best force MAE. A matched 20k follow-up keeps the same operating-point picture. Aligned controls show that the gain weakens for pair-only, triplet-only, random-bucket, and parameter-matched MLP alternatives, suggesting that the benefit is tied to structured pair/triplet organization rather than generic added capacity. A within-OMAT24 source-family evaluation also shows small but consistent gains across held-out generation families. We therefore make a focused claim: in the studied UMA-S + OMAT regime, explicit pair/ triplet motif memory is a useful inductive bias for periodic atomistic modeling. We do not claim broad cross-dataset transfer, a uniquely preferred fusion variant, or strong scientific interpretability beyond a more inspectable local-structure interface.

2605.13296 2026-05-14 cs.AI cs.LG cs.MA

Discrete Diffusion for Complex and Congested Multi-Agent Path Finding with Sparse Social Attention

Yuanzhe Wang, Tian Zhi, Zihang Wei, Hongguang Wang, Jiaming Guo, Yang Zhao, Zisheng Liu, Shiyu Quan, Xing Hu, Zidong Du, Yunji Chen

AI总结 本文研究了在复杂拥挤环境中多智能体路径规划(MAPF)的问题,提出了一种基于离散扩散模型的混合框架DiffLNS,用于生成高质量的初始路径草案以提升修复型求解器的性能。该方法结合了稀疏社交注意力机制的离散去噪扩散概率模型(D3PM)与LNS2算法,直接在离散动作空间中生成多样化的联合路径草案,有效提升了大规模MAPF问题的求解成功率和效率。实验表明,DiffLNS在多种复杂场景中表现优异,平均成功率达到95.8%,显著优于现有方法。

Comments 24 pages, 7 figures

详情
英文摘要

Multi-Agent Path Finding (MAPF) is a coordination problem that requires computing globally consistent, collision-free trajectories from individual start positions to assigned goal positions under combinatorial planning complexity. In dense environments, suboptimal initial plans induce compound conflicts that hinder feasible repair. For repair-based solvers like LNS2, initial plan quality critically affects downstream repair, yet this factor remains underexplored. We propose DiffLNS, a hybrid framework that integrates a discrete denoising diffusion probabilistic model (D3PM) with LNS2. The D3PM serves as an initializer with sparse social attention that learns a spatiotemporal prior over coordinated multi-agent action trajectories from expert demonstrations and samples multiple joint plans. Operating directly on the categorical action space, our discrete diffusion preserves the MAPF action structure and samples from a multimodal joint-plan distribution to produce diverse drafts well suited for neighborhood repair. These drafts act as warm starts for downstream repair, which completes unfinished trajectories and resolves remaining conflicts under hard MAPF constraints. Experimental results show that despite being trained only on instances with at most 96 agents, the initializer generalizes to scenarios with up to 312 agents at inference time. Across 20 complex and congested settings, DiffLNS achieves an average success rate of 95.8%, outperforming the strongest tested baseline by 9.6 percentage points and matching or exceeding all baselines in all 20 settings. To the best of our knowledge, this is the first work to leverage discrete diffusion for warm-starting an LNS-based MAPF solver.

2605.13295 2026-05-14 cs.CL cs.AI cs.MA

CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution

Tom Zehle

AI总结 本文提出了一种名为 CANTANTE 的框架,用于优化基于大语言模型的多智能体系统。该方法通过对比不同联合配置在相同查询上的执行结果,将系统层面的奖励分解为每个智能体的更新信号,从而解决信用分配问题。实验表明,CANTANTE 在编程、数学推理和多跳问答等任务上均优于现有优化方法,且在保持较高性能的同时降低了推理成本。

详情
英文摘要

LLM-based multi-agent systems have demonstrated strong performance across complex real-world tasks, such as software engineering, predictive modeling, and retrieval-augmented generation. Yet automating their configuration remains a structural challenge, as scores are available only at the system level, whereas the parameters governing agent behavior are local. We argue that optimizing these systems is fundamentally a credit-assignment problem. We therefore introduce CANTANTE, a framework that decomposes system-level rewards into per-agent update signals by contrasting rollouts of multiple joint configurations on the same query. We instantiate it for prompt optimization, treating agent prompts as learnable system parameters. We evaluate CANTANTE against GEPA and MIPROv2 on programming (MBPP), mathematical reasoning (GSM8K), and multi-hop question answering (HotpotQA). Across these benchmarks, CANTANTE achieves the best average rank among all evaluated optimizers and consistently outperforms unoptimized prompts. It improves over the strongest baseline by +18.9 percentage points on MBPP and +12.5 percentage points on GSM8K, while incurring a lower inference cost. It remains within one standard deviation of the strongest baseline on HotpotQA. Crucially, our credit correlation analysis confirms that the attributer produces meaningful per-agent signals rather than echoing the global system score.

2605.13293 2026-05-14 cs.CV

Img2CADSeq: Image-to-CAD Generation via Sequence-Based Diffusion

Shiyu Tan, Zixuan Zhao, Hao Gao, Zhiheng Chen, Xiaolong Yin, Enya Shen

AI总结 该论文提出了一种名为Img2CADSeq的多阶段图像到CAD生成方法,旨在从单视角图像中生成高质量的边界表示(BRep)CAD模型。其核心方法是将CAD操作序列编码为三级层次化代码本,并通过重要性优先策略,优先保留轮廓信息以压缩长序列到稳定的离散潜在空间。为弥合图像与CAD之间的模态差异,研究引入了基于对比学习的点云中间表示,结合VQ-Diffusion模型进行条件生成,并在新构建的CAD-220K和PrintCAD数据集上验证了方法的有效性,显著优于现有方法,生成的STEP文件可直接用于商业CAD软件。

Comments Accepted by SIGGRAPH 2026 Conference

详情
英文摘要

Boundary Representation (BRep) is the standard format for Computer-Aided Design (CAD), yet reconstructing high-quality BReps from single-view images remains challenging due to the complexity of topological constraints and operation sequences. We present Img2CADSeq, a multi-stage pipeline that overcomes these limitations by encoding CAD sequences into a three-level hierarchical codebook. Guided by an importance prioritization, this strategy values profiles over details, compressing long sequences into a stable discrete latent space. To bridge the modality gap, we leverage a coarse-to-fine point cloud intermediate, aligning 2D visual features with 3D CAD sequences via contrastive learning to condition a VQ-Diffusion model. Supported by newly introduced CAD-220K and PrintCAD datasets, our approach ensures robust industrial domain adaptation. Extensive experiments demonstrate that Img2CADSeq significantly outperforms state-of-the-art methods, producing standard STEP files that can be directly used in commercial CAD software.