arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4178
2606.01435 2026-06-02 cs.AI cs.CL cs.IR

Don't Ask the LLM to Track Freshness: A Deterministic Recipe for Memory Conflict Resolution

不要询问LLM追踪新鲜度:一种确定性的内存冲突解决策略

Vikas Reddy, Sumanth Challaram

发表机构 * IIT Kgp(印度理工学院科钦分校)

AI总结 针对基于LLM的内存系统中事实冲突解决性能低下的问题,提出用候选提取加Python max(serial)的确定性聚合替代LLM判断,在单跳任务上提升10.8个百分点,并扩展到多跳任务。

详情
AI中文摘要

基于LLM的内存系统越来越多地维护随时间演变的事实,其中一个反复出现的失败是冲突解决:当一个事实有多个矛盾的值时,智能体应该返回哪个?MemoryAgentBench (MAB; Hu et al., 2026) 在其FactConsolidation任务中明确了这一点:事实被编号,反事实具有更高的序号,并且智能体被告知较新的事实具有较大的序号。然而,每个已发布的系统表现不佳:HippoRAG-v2在单跳(FC-SH)上达到54%,BM25 48%,Mem0 18%,而时间知识图谱Zep/Graphiti仅为7%。多跳几乎未解决(22个系统中最多7%)。我们认为瓶颈在于组装步骤:基线将冲突解决留给LLM介导的检索或生成,而不是版本感知的聚合。一个匹配设置的比较(相同的主干、检索、分块、TOP_K)表明,用候选提取加Python max(serial)替换LLM判断答案流水线,在FC-SH上(gpt-4o-mini)获得+10.8分的提升,从6K时的+8分扩大到262K时的+21分。这是一个全流水线效应(解析器、提示、格式和温度共同变化);隔离解析器是未来的工作。该配方在FC-SH上达到78.0%(gpt-4o-mini)、94.8%(gpt-4o),在FC-MH上达到30.2%(gpt-4o-mini,使用gpt-4o时升至51.5%),通过每跳确定性的Self-Ask扩展。在匹配的262K下,它比HippoRAG-v2高出+28分,比已发布的最佳FC-MH结果高出+20分。这一含义对该子领域具有纠正作用:冲突解决的瓶颈是组装(检索后聚合),而不是存储。一个LongMemEval知识更新检查表明,该机制从max(serial)移植到max(timestamp),但仅与LLM判断持平(57.8% vs 64.4%,n=45):确定性聚合是当前值冲突的正确原语,并且必须与问题类型感知处理组合,以实现更广泛的内存问答。

英文摘要

LLM-based memory systems increasingly maintain facts that evolve over time, where a recurring failure is conflict resolution: when a fact has multiple contradictory values, which should the agent return? MemoryAgentBench (MAB; Hu et al., 2026) makes this explicit in its FactConsolidation task: facts are numbered, the counterfactual has the higher serial, and agents are told newer facts have larger serials. Yet every published system underperforms: HippoRAG-v2 reaches 54% on single-hop (FC-SH), BM25 48%, Mem0 18%, and the temporal KG Zep/Graphiti just 7%. Multi-hop is near-unsolved (at most 7% across 22 systems). We argue the bottleneck is the assembly step: baselines leave conflict resolution to LLM-mediated retrieval or generation rather than version-aware aggregation. A matched-setup comparison (same backbone, retrieval, chunking, TOP_K) shows that replacing the LLM-judgment answer pipeline with candidate-extraction plus Python max(serial) yields +10.8 points on FC-SH (gpt-4o-mini), widening from +8 at 6K to +21 at 262K. This is a whole-pipeline effect (resolver, prompt, format, and temperature vary jointly); isolating the resolver is future work. The recipe reaches 78.0% on FC-SH (gpt-4o-mini), 94.8% (gpt-4o), and 30.2% on FC-MH (gpt-4o-mini, rising to 51.5% with gpt-4o) via a per-hop deterministic extension of Self-Ask. At matched-262K, it beats HippoRAG-v2 by +28 points and the best published FC-MH result by +20. The implication is corrective for the subfield: the bottleneck on conflict resolution is assembly (post-retrieval aggregation), not storage. A LongMemEval knowledge-update check shows the mechanism ports from max(serial) to max(timestamp) but only ties LLM judgment (57.8% vs 64.4%, n=45): deterministic aggregation is the right primitive for current-value conflicts and must be composed with question-type-aware handling for broader memory QA.

2606.01434 2026-06-02 cs.CL

DrugClaw and DrugAudit: A Primary-Source-Grounded Agent and Authority-Aware Benchmark for Drug-Information Question Answering

DrugClaw与DrugAudit:基于原始来源的智能体与权威感知基准用于药物信息问答

Qing Wang, Bo Li, Jialu Liang, Daling Shi, Bob Zhang, Qianqian Song

发表机构 * Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida(佛罗里达大学健康结局与生物医学信息学系,医学院) PAMI Research Group, Department of Computer and Information Science, Faculty of Science and Technology, University of Macau(澳门大学科学与技术学院计算机与信息科学系PAMI研究组)

AI总结 提出多智能体检索增强系统DrugClaw,通过反射驱动状态机查询药物注册与药物警戒知识库,并构建含3772条权威感知基准DrugAudit,在多个基准上取得最优性能。

详情
AI中文摘要

药物信息问答是一个高风险场景,其中虚构的事实可能误导临床决策,且每个引用事实的来源与事实本身同样重要。我们提出了DrugClaw,一个多智能体检索增强系统,通过反射驱动的状态机工作流查询药物注册和药物警戒技能库,并返回基于原始监管或同行评审记录的答案。我们还贡献了DrugAudit,一个包含3772个条目的权威感知基准,配备评估面板,在双评判者LLM作为评判者的协议下(评判者间kappa=0.88,几乎完美),对上游黄金来源匹配、令牌级语义片段重叠和引用忠实性进行评分。在DrugAudit以及MedQA(751)和PubMedQA(512)的药物相关子集上,DrugClaw在标题表的每一列均排名第一:两个评判者下的综合证据指数、评判者中介的答案正确性、原始来源率(0.918,比次优高10.1个百分点)、忠实性(0.887,高5.9个百分点)、MedQA(0.920)和PubMedQA(0.693)。

英文摘要

Drug-information question answering is a high-stakes setting where hallucinated facts can mislead clinical decision-making and the provenance of each cited fact matters as much as the fact itself. We present DrugClaw, a multi-agent retrieval-augmented system that queries a registry of drug and pharmacovigilance skills via a reflection-driven state-machine workflow and returns answers grounded in primary regulatory or peer-reviewed records. We also contribute DrugAudit, a 3,772-item authority-aware benchmark with an evaluation panel that scores upstream-of-gold source match, token-level semantic snippet overlap, and citation faithfulness under a dual-judge LLM-as-judge protocol with inter-judge kappa = 0.88 (almost-perfect). Across DrugAudit plus drug-related subsets of MedQA (751) and PubMedQA (512), DrugClaw is top-1 on every column of the headline table: composite Evidence Index under both judges, judge-mediated answer correctness, primary-source rate (0.918, +10.1 pp over next-best), faithfulness (0.887, +5.9 pp), MedQA (0.920), and PubMedQA (0.693).

2606.01425 2026-06-02 cs.LG

Learning-based Directed Graph Abstraction of Combinatorial Spaces for Order-Preserving Search in Mixed-Combinatorial Nonlinear Optimization

基于学习的组合空间有向图抽象用于混合组合非线性优化中的保序搜索

Gishnu Madhu, Feng Liu, Souma Chowdhury

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Department of Mechanical and Aerospace Engineering(机械与航空航天工程系)

AI总结 提出一种基于图神经网络的有向图抽象方法,将组合空间映射为有向图,以改进混合组合非线性规划问题的搜索效率。

详情
Comments
Accepted for presentation at 2026, ASME IDETC
AI中文摘要

混合组合非线性规划(MCNLP)问题出现在许多工程设计和规划应用中,例如由于分类、组件和几何设计选择,以及联合任务和运动规划。组合空间的传统表示方法,如整数或二进制编码,常常引入虚假关系,增加维度,并需要额外的兼容性约束。相反,本文借鉴了机器人规划和车辆/网络路由领域的最新发展,旨在使用图神经网络(GNN)学习组合空间上的搜索启发式。更具体地说,本文通过使用边场图网络(EFGN)学习从无向全连接组合图到指示改进方向的有向图的映射,提出了首个结构化的组合空间抽象。为了展示这种抽象组合空间的新方法在解决MCNLP中的效用,我们采用了一个最近的优化框架,该框架纯粹搜索非组合(例如连续)变量,并通过使用抽象模型(类似于推荐系统)为每个候选设计检索最合适的组合。与原始框架中的推荐系统相比,所提出的方向感知抽象模型提供了可能更具可扩展性和可解释性的组合检索。为了评估,所提出的方法与著名的粒子群优化和遗传算法求解器集成,在三个具有不同组合和变量数量的基准非线性问题上进行测试。与使用索引化组合的基线求解器相比,基于GNN的推荐器在多次运行中始终获得更好的平均最优值和鲁棒性。

英文摘要

Mixed-combinatorial nonlinear programming (MCNLP) problems arise in many engineering design and planning applications, e.g., due to categorical, component, and geometric design choices, as well as joint task and motion planning. Traditional representations of combinatorial spaces, such as integer or binary encoding, often introduce spurious relations, increase dimensionality, and require additional compatibility constraints. Instead, this paper draws on recent developments in robot planning and vehicle/network routing domains that aim to learn search heuristics over combinatorial spaces using graph neural networks (GNNs). More specifically, this paper presents a first-of-its-kind structured abstraction of the combinatorial space by learning a mapping from an undirected fully connected graph of combinations to a directed graph indicating improvement directions using an Edge Field Graph Network (EFGN). To demonstrate the utility of this new way of abstracting the combinatorial space in solving MCNLPs, we adopt a recent optimization framework that purely searches over the non-combinatorial (e.g., continuous) variables and retrieves the best-suited combination for each candidate design by using the abstraction model, akin to a recommender system. The presented direction-aware abstraction model provides a potentially more scalable and interpretable retrieval of combinations compared to the original recommendation system in that framework. For evaluation, the proposed method is integrated with a well-known particle swarm optimization and genetic algorithm solvers on three benchmark nonlinear problems with varying numbers of combinations and variables. Compared to baseline solvers using indexified combinations, the GNN-based recommender consistently achieves better mean optimum values and robustness across multiple runs.

2606.01421 2026-06-02 cs.LG

Target localization, identification and sensing using latent symmetries

利用潜在对称性进行目标定位、识别与感知

David Dukov, Malte Röntgen, Bryn Davies

发表机构 * Mathematics Institute, University of Warwick(沃里克大学数学研究所) Eastern Institute for Advanced Study(东部高级研究 institute) Eastern Institute of Technology Ningbo, Zhejiang, China(宁波东部技术研究院,浙江,中国)

AI总结 本文利用设计有潜在对称性的散射体阵列作为传感器,通过分析对称性破缺程度,结合贝叶斯推断或人工神经网络实现入侵散射体的半径识别与位置定位。

详情
Comments
Submitted to SIAM Journal on Imaging Sciences
AI中文摘要

我们展示了具有潜在(“隐藏”)对称性的散射体阵列可用作传感器。我们使用电容矩阵作为三维杂化的典型模型,研究“入侵”散射体的引入如何破坏潜在对称性。通过分析每个对称性被破坏的程度,我们识别出入侵者的半径并定位其位置。这可以通过基于字典的方法实现,然而在存在测量噪声的情况下,贝叶斯推断或人工神经网络(多层感知器)表现更好。据我们所知,这是首次将潜在对称性成功应用于感知问题。这也是首次在无法用稀疏图近似的三维开放系统中观察到潜在对称性。

英文摘要

We show that an array of scatterers which has been designed to have latent ("hidden") symmetries can be used as a sensor. We use the capacitance matrix as a canonical model for three-dimensional hybridisation and study how the introduction of an "intruder'' scatterer breaks the latent symmetries. By analysing the degree to which each symmetry is broken, we identify the radius of the intruder and localize its position. This can be achieved using a dictionary-based approach, however Bayesian inference or an artificial neural network (multi-layer perceptron) perform better in the presence of measurement noise. To our knowledge, this is the first time latent symmetries have been exploited successfully for sensing problems. It is also the first time latent symmetries have been observed in a three-dimensional open system that cannot be approximated by a sparse graph.

2606.01419 2026-06-02 cs.CV

DENSER: Depth-Guided Ensemble with Staged EFA-GS Reconstruction for Soccer Novel View Synthesis

DENSER:面向足球新视角合成的深度引导集成与分阶段EFA-GS重建

Parthsarthi Rawat

发表机构 * GameChanger by Dick’s Sporting Goods(Dick’s Sporting Goods 游戏变革)

AI总结 提出DENSER方法,通过深度引导集成和分阶段EFA-GS重建,结合相机高度损失加权、单目深度监督和三模型像素平均集成,提升足球场景新视角合成质量。

详情
Comments
CVPR 2026 SoccerNet Novel View Synthesis Challenge, Rank 1
AI中文摘要

我们提出DENSER,一种面向足球新视角合成的深度引导集成与分阶段EFA-GS重建方法。DENSER在EFA-GS基础上做出三项关键贡献:(1)基于相机高度的损失加权,优先考虑地面级广播视角;(2)来自Depth-Anything-V2的单目深度监督,用于在无纹理区域正则化几何结构;(3)三模型像素平均集成,其成员通过改变训练长度和高斯尺度限制从共享基础检查点发散。在五个保留的挑战场景上,我们实现了平均PSNR为29.89 dB,SSIM为0.791,LPIPS为0.366。

英文摘要

We propose DENSER, a Depth-guided ENSemble with Staged EFA-GS Reconstruction for soccer novel view synthesis. DENSER extends EFA-GS with three key contributions: (1) camera-height-based loss weighting that prioritises ground-level broadcast views, (2) monocular depth supervision from Depth-Anything-V2 to regularise geometry in textureless regions, and (3) a three-model pixel-average ensemble whose members diverge from a shared base checkpoint by varying training length and Gaussian scale clamping. On five held-out challenge scenes we achieve a mean PSNR of 29.89 dB, SSIM of 0.791, and LPIPS of 0.366.

2606.01417 2026-06-02 cs.AI

GovAI-Pipe: A Layered AI Governance Pipeline for Citizen-Facing AI in Turkey's e-Government Gateway

GovAI-Pipe:面向土耳其电子政务门户的公民交互AI分层治理管道

Ahmet Kaplan

发表机构 * Turkey's e-Government Gateway(土耳其电子政务门户)

AI总结 针对土耳其电子政务平台缺乏结构化技术治理基础设施的问题,提出基于设计科学研究方法的四层AI治理管道GovAI-Pipe,将AI模型生命周期映射到治理检查点,并通过高风险用例验证其可审计的技术实现。

详情
Comments
7 pages
AI中文摘要

土耳其的电子政务门户(e-Devlet)为超过6800万注册用户提供9200多项政府服务,并越来越多地将人工智能集成到面向公民的应用中,如聊天机器人助手和资格评估。然而,目前没有结构化的技术治理基础设施将高级AI政策框架(如欧盟AI法案、OECD AI原则和土耳其自身的国家AI战略)与在集中式电子政务平台中部署AI的操作现实联系起来。我们提出GovAI-Pipe,这是一个使用设计科学研究方法设计的四层治理管道,将AI模型生命周期映射到治理检查点:(1)部署前验证,用于偏差测试、可解释性和隐私影响评估;(2)部署治理,用于风险等级分类和审批工作流;(3)运行时监控,用于漂移检测、公平性跟踪和人在回路升级;(4)事后治理,用于审计跟踪、回滚和公民补救。每一层都锚定到欧盟AI法案、GDPR数据保护框架和国家AI战略的具体条款。我们通过两个高风险e-Devlet用例演示该框架,展示GovAI-Pipe如何将治理原则作为可审计的技术管道组件进行操作化。

英文摘要

Turkey's e-Government Gateway (e-Devlet) serves over 68 million registered users with more than 9,200 government services, and is increasingly integrating artificial intelligence into citizen-facing applications such as chatbot assistants and eligibility assessments. However, no structured technical governance infrastructure currently connects high-level AI policy frameworks, such as the EU AI Act, OECD AI Principles, and Turkey's own National AI Strategy, to the operational reality of deploying AI within a centralized e-government platform. We propose GovAI-Pipe, a four-layer governance pipeline designed using Design Science Research methodology that maps the AI model lifecycle to governance checkpoints: (1) pre-deployment validation for bias testing, explainability, and privacy impact assessment; (2) deployment governance for risk-tier classification and approval workflows; (3) runtime monitoring for drift detection, fairness tracking, and human-in-the-loop escalation; and (4) post-incident governance for audit trails, rollback, and citizen redress. Each layer is anchored to specific provisions of the EU AI Act, the GDPR data protection framework, and the National AI Strategy. We demonstrate the framework through two high-risk e-Devlet use cases, showing how GovAI-Pipe operationalizes governance principles as auditable, technical pipeline components.

2606.01416 2026-06-02 cs.AI

Self-Healing Agentic Orchestrators for Reliable Tool-Augmented Large Language Model Systems

用于可靠的工具增强型大语言模型系统的自愈代理编排器

Rahul Suresh Babu, Adarsh Agrawal

发表机构 * Independent Researcher(独立研究者) Senior Member, IEEE(IEEE高级成员)

AI总结 提出一种自愈代理编排器,通过将可靠性视为有界运行时控制问题,映射故障信号、选择恢复动作并验证轨迹,在100任务故障注入基准上达到98.8%任务成功率,优于重试和完全重规划基线。

详情
AI中文摘要

工具增强型大语言模型(LLM)代理依赖于协调规划、检索、工具调用、验证、记忆和恢复的编排层。在这些系统中,故障不仅来自模型错误,还来自编排层问题,如工具超时、参数格式错误、过时上下文、矛盾证据、重试循环和未验证的中间输出。本文提出一种自愈代理编排器,将可靠性视为有界运行时控制问题。该编排器将可观察的故障信号映射到推断的故障类别,在显式预算下选择目标恢复动作,验证恢复轨迹,并记录可观察性痕迹。我们在一个100任务的受控故障注入基准上,将本方法与静态工作流、仅重试、ReAct风格和完全重规划基线进行比较。自愈方法实现了98.8%的任务成功率,而仅重试为94.5%,完全重规划为93.8%。匹配的恢复预算扫描显示,在每个测试预算下,自愈方法均优于仅重试和完全重规划,在单次恢复尝试下差距最大:分别为94.0%对比85.3%和88.2%。在受控的语义静默故障设置下,验证器引导的自愈将静默故障降至0.0%,而非验证基线更频繁地返回错误但看似合理的输出。紧凑的模型在环验证表明,当实时工具调用模型在本地故障注入工具上执行工具选择、参数生成和答案合成时,相同的恢复机制可以运行。这些结果提供了受控证据,表明故障感知、有预算和验证引导的编排提高了工具增强型LLM系统的可靠性和可诊断性。

英文摘要

Tool-augmented large language model (LLM) agents rely on orchestration layers that coordinate planning, retrieval, tool invocation, validation, memory, and recovery. In these systems, failures arise not only from model errors, but also from orchestration-level issues such as tool timeouts, malformed arguments, stale context, contradictory evidence, retry loops, and unverified intermediate outputs. This paper presents a self-healing agentic orchestrator that treats reliability as a bounded runtime control problem. The orchestrator maps observable failure signals to inferred failure classes, selects targeted recovery actions under explicit budgets, verifies recovered trajectories, and records observability traces. We evaluate the approach on a 100-task controlled fault-injection benchmark against static workflow, retry-only, ReAct-style, and full-replanning baselines. Self-healing achieves 98.8\% task success, compared with 94.5\% for retry-only and 93.8\% for full replanning. A matched recovery-budget sweep shows that self-healing outperforms retry-only and full replanning at every tested budget, with the largest gap under a single recovery attempt: 94.0\% versus 85.3\% and 88.2\%, respectively. Under a controlled semantic silent-failure setting, verifier-guided self-healing reduces silent failures to 0.0\%, while non-verifying baselines return wrong-but-plausible outputs more often. A compact model-in-the-loop validation shows that the same recovery mechanism can operate when a live tool-calling model performs tool selection, argument generation, and answer synthesis over local fault-injected tools. These results provide controlled evidence that failure-aware, budgeted, and verification-guided orchestration improves reliability and diagnosability in tool-augmented LLM systems.

2606.01414 2026-06-02 cs.CV

Agent Skills Should Go Beyond Text: The Case for Visual Skills

Agent技能应超越文本:视觉技能的必要性

Binxiao Xu, Ruichuan An, Bocheng Zou, Hang Hua

发表机构 * Peking University(北京大学) University of Wisconsin(威斯康星大学) MIT-IBM Watson AI Lab(麻省理工-IBM沃森人工智能实验室)

AI总结 针对现有技能学习方法仅存储文本经验导致视觉任务瓶颈的问题,提出多模态技能范式,结合文本逻辑与视觉支持,通过自动系统将经验转化为可复用的视觉技能,在GUI等视觉任务中显著优于纯文本技能。

详情
AI中文摘要

可复用技能是扩展智能体能力的关键机制,使智能体能够积累经验并解决日益复杂的任务。然而,现有大多数技能学习方法仅将可复用经验存储为文本资产,如指令、推理轨迹或总结的轨迹。我们认为,这种纯文本范式为视觉中心任务造成了根本性瓶颈,因为可复用知识通常依赖于空间布局、视觉定位、细粒度外观和局部状态变化。为解决这一局限,我们提出\NAME,一种结合声明式文本逻辑与显式视觉支持的多模态技能范式。我们区分三种可复用形式:静态先验(用于稳定的空间惯例)、动态先验(用于现场视觉工作记忆)以及交错视觉技能(将有序文本步骤绑定到源帧、截图或页面区域,以证明其合理性)。视觉技能不仅描述要做什么,还编码了在哪里看、如何检查以及如何验证视觉结果。为了规模化构建视觉技能,我们引入\SYSTEM,一种自动系统,通过保留任务轨迹中的文本推理、空间引用、视觉边界和交互模式,将智能体经验转化为可复用的多模态技能。在GUI和其他视觉中心任务上的实验表明,视觉技能始终优于纯文本技能,尤其是在成功需要空间对应、视觉证据和状态感知交互时。这些结果支持我们的核心立场:可复用智能体技能应超越文本,成为未来多模态智能体的多模态资产。

英文摘要

Reusable skills are a key mechanism for extending agent capabilities, allowing agents to accumulate experience and solve increasingly complex tasks. Yet most existing skill-learning methods store reusable experience as text-only assets, such as instructions, reasoning traces, or summarized trajectories. We argue that this text-only paradigm creates a fundamental bottleneck for visual-centric tasks, where reusable knowledge often depends on spatial layout, visual grounding, fine-grained appearance, and localized state changes. To address this limitation, we propose \textbf{\NAME}, a multimodal skill paradigm that combines declarative textual logic with explicit visual support. We distinguish three reusable forms: static priors for stable spatial conventions, dynamic priors for in-situ visual working memory, and interleaved visual skills that bind ordered text steps to the source frames, screenshots, or page regions that justify them. Rather than only describing what to do, visual skills also encode where to look, how to inspect, and how to verify visual outcomes. To scale visual-skill construction, we introduce \textbf{\SYSTEM}, an automatic system that converts agent experience into reusable multimodal skills by preserving textual reasoning, spatial references, visual boundaries, and interaction patterns from task trajectories. Experiments on GUI and other visual-centric tasks show that visual skills consistently outperform text-only skills, particularly when success requires spatial correspondence, visual evidence, and state-aware interaction. These results support our central position: reusable agent skills should go beyond text and become multimodal assets for future multimodal agents.

2606.01412 2026-06-02 cs.LG cs.IT math.IT

GPTQ-intrinsic LoRA: A Near-optimal Algorithm for Low-precision Quantization with Low-rank Adaptation

GPTQ-intrinsic LoRA: 一种用于低秩自适应低精度量化的近最优算法

Shihao Zhang, Rayan Saab

发表机构 * Department of Mathematics, University of California San Diego(数学系,加州大学圣地亚哥分校) Department of Mathematics and Halıcıoğlu Data Science Institute, University of California San Diego(数学系和Halıcıoğlu数据科学研究所,加州大学圣地亚哥分校)

AI总结 本文提出GPTQ-intrinsic LoRA算法,通过将低秩校正直接融入GPTQ量化过程,并利用信息论下界证明其近最优性,在语言和视觉模型上优于现有方法。

详情
AI中文摘要

后训练量化广泛用于压缩大型神经网络,但激进的低比特量化会显著降低模型质量。一种常见的补救措施是用低秩校正增强量化权重,得到形如 $W\approx Q+LR$ 的近似。本文通过逐层重构目标 $\|XW-X(Q+LR)\|_F^2$ 研究这种低精度加低秩表示,其中 $X$ 是校准矩阵。我们首次在有限字母和有界低秩补偿约束下建立了该问题的信息论下界。然后我们提出GPTQ-intrinsic LoRA,一种无训练算法,通过适当增广校准Hessian矩阵,将低秩校正直接融入GPTQ风格的量化过程中。对于选择 $L=V_r$($V_r$ 包含 $X$ 的顶部右奇异向量),我们证明了逐层重构误差界,其中通常的GPTQ对 $\|X\|_F^2$ 的依赖被秩-$r$ 残差 $\|X-X_r\|_F^2$ 取代,直至正则化项。在自然结构假设下,这些界在主导尺度上与信息论下界匹配(至多常数和温和因子)。我们还引入了Bid-Up,一种固定网格量化细化步骤,可与最优低秩补偿交替进行,保证逐层重构误差不增。在Qwen3语言模型和DeiT视觉变换器上的实验表明,GPTQ-intrinsic LoRA优于GPTQ以及GPTQ后接低秩补偿,并且通过细化循环获得额外增益。

英文摘要

Post-training quantization is widely used for compressing large neural networks, but aggressive low-bit quantization can significantly degrade model quality. A common remedy is to augment the quantized weights with a low-rank correction, leading to approximations of the form $W\approx Q+LR$. In this paper, we study this low-precision plus low-rank representation through the layer-wise reconstruction objective $\|XW-X(Q+LR)\|_F^2$, where $X$ is a calibration matrix. We establish, to our knowledge, the first information-theoretic lower bounds for this problem under finite-alphabet and bounded low-rank compensation constraints. We then propose GPTQ-intrinsic LoRA, a training-free algorithm that incorporates the low-rank correction directly into a GPTQ-style quantization pass by appropriately augmenting the calibration Hessian. For the choice $L=V_r$, where $V_r$ contains the top right singular vectors of $X$, we prove layer-wise reconstruction error bounds in which the usual GPTQ dependence on $\|X\|_F^2$ is replaced by the rank-$r$ residual $\|X-X_r\|_F^2$, up to regularization terms. Under natural structural assumptions, these bounds match the information-theoretic lower bounds in their dominant scaling, up to constants and mild factors. We also introduce Bid-Up, a fixed-grid quantization refinement step that can be alternated with optimal low-rank compensation with guaranteed non-increasing layer-wise reconstruction error. Experiments on Qwen3 language models and DeiT vision transformers show that GPTQ-intrinsic LoRA improves over GPTQ and GPTQ followed by low-rank compensation, with additional gains from refinement loops.

2606.01402 2026-06-02 cs.LG cs.AI

Neural Network Compression by Approximate Differential Equivalence

基于近似微分等价的神经网络压缩

Ravi Dhiman, Andrea Passarella, Mirco Tribastone, Lorenzo Valerio

发表机构 * IMT School for Advanced Studies Lucca(利古里亚高级研究学院) IIT CNR(理工学院-国家科研委员会)

AI总结 提出一种通过聚合功能相似神经元来压缩神经网络的方法,利用近似前向微分等价将网络编码为多项式ODE系统,实现模型大小与精度的平滑权衡。

详情
Comments
19 pages, 4 figures
AI中文摘要

神经网络压缩通常通过基于局部重要性分数(例如基于幅度的剪枝)剪枝参数来实现。我们提出一种互补方法,通过聚合具有相似功能行为的神经元来压缩模型,而不是独立移除权重。我们的方法将训练好的网络编码为多项式ODE系统,并应用一种称为近似前向微分等价的 lumping 方法来识别具有近似匹配诱导动力学的神经元。单个容差参数 $\varepsilon$ 控制压缩水平,并在模型大小和预测精度之间诱导平滑权衡。我们在来自已知真实行为的非线性动力系统的合成数据集和公共回归基准上评估该方法。在这两种设置下,所提出的方法在保持精度的同时实现了显著的参数减少,并在相似的压缩水平下始终优于基于幅度的剪枝和Wanda。这些结果表明,基于微分等价的聚合是传统以权重为中心的剪枝的一种有原则且有效的替代方案。

英文摘要

Neural network compression is commonly achieved by pruning parameters based on local importance scores, e.g., magnitude-based pruning. We propose a complementary approach that compresses models by aggregating neurons with similar functional behavior rather than removing weights independently. Our method encodes a trained network as a polynomial ODE system and applies a lumping method called Approximate Forward Differential Equivalence to identify neurons with approximately matching induced dynamics. A single tolerance parameter, $\varepsilon$, controls the compression level and induces a smooth trade-off between model size and predictive accuracy. We evaluate the method on synthetic datasets derived from nonlinear dynamical systems with known ground-truth behavior and on public regression benchmarks. Across both settings, the proposed approach achieves substantial parameter reduction while preserving accuracy, and consistently compares favorably with magnitude-based pruning and Wanda at similar compression levels. These results suggest that differential equivalence-based aggregation is a principled and effective alternative to conventional weight-centric pruning.

2606.01400 2026-06-02 cs.CL cs.AI

Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs

一致且独特:基于相似图最大独立集提示选择的LLM基准测试效率

Denica Kjorvezir, Marko Djukanović, Ana Gjorgjevikj, Gjorgjina Cenikj, Tome Eftimov

发表机构 * Computer Systems Department, Jožef Stefan Institute, Ljubljana, Slovenia(计算机系统部,乔塞夫·斯塔芬研究所,卢布尔雅那,斯洛文尼亚) Jožef Stefan International Postgraduate School, Ljubljana, Slovenia(乔塞夫·斯塔芬国际研究生学院,卢布尔雅那,斯洛文尼亚) Center for Astrophysics and Cosmology, University of Nova Gorica, Nova Gorica, Slovenia(天体物理与宇宙学中心,诺瓦戈里察大学,诺瓦戈里察,斯洛文尼亚)

AI总结 提出基于相似图最大独立集的提示选择框架,通过选择多样且非冗余的子集,在保持LLM排名一致性的同时显著减少基准测试成本。

详情
AI中文摘要

在全面基准测试中评估大型语言模型(LLM)既昂贵又耗时。我们提出了一种基于图的提示选择框架,将每个基准建模为相似图——如果提示在嵌入空间中的距离超过可配置阈值,则节点相连——并应用最大独立集(MIS)算法选择最大多样、非冗余的子集。我们评估了四种MIS求解器(CPLEX、GREEDY、Online-MIS、ReduMIS),涵盖六种嵌入模型、三种距离度量、六个百分位数阈值和四个基准(GPQA、IFEval、MMLU-Pro、Omni-MATH),涉及66个LLM。我们的核心假设——不同随机种子下的重复选择会产生一致的LLM排名,且可能不同于完整基准基线——得到强烈证实:在99.2%的随机配置中Kendall's $W \geq 0.90$(平均$W = 0.997 \pm 0.008$),而在较高百分位数阈值下,所选子集平均减少25-48%的提示。与完整基准的排名差异($\rho < 0.95$)仅发生在15.95%的配置中,主要集中在低阈值($p_{10}$-$p_{20}$)和基准(GPQA、IFEval)上,识别出过于密集的图是主要失败模式。

英文摘要

Evaluating large language models (LLMs) across comprehensive benchmarks is expensive and time-consuming. We propose a graph-based prompt selection framework that models each benchmark as a similarity graph -- nodes are prompts connected if their embedding-space distance falls above a configurable threshold -- and applies Maximum Independent Set (MIS) algorithms to select a maximally diverse, non-redundant subset. We evaluate four MIS solvers (CPLEX, GREEDY, Online-MIS, ReduMIS) across six embedding models, three distance measures, six percentile thresholds, and four benchmarks (GPQA, IFEval, MMLU-Pro, Omni-MATH) covering 66 LLMs. Our central hypothesis -- that repeated selection under different random seeds yields consistent LLM rankings that may also differ from the full-benchmark baseline -- is strongly confirmed: Kendall's $W \geq 0.90$ in 99.2\% of stochastic configurations (mean $W = 0.997 \pm 0.008$), while at higher percentile thresholds selected subsets achieve 25--48\% prompt reduction on average. Ranking divergence from the full benchmark ($ρ< 0.95$) occurs in only 15.95\% of configurations, concentrated at low thresholds ($p_{10}$--$p_{20}$) and benchmarks (GPQA, IFEval), identifying overly dense graphs as the primary failure mode.

2606.01398 2026-06-02 cs.RO

A Sonar-Visual Dataset for Cross-Modal Underwater Robot Perception

用于跨模态水下机器人感知的声纳-视觉数据集

Weitung Chen, Phil Tinn, Per Gunnar Auran, Martin Ludvigsen, Peter Halland Haro

发表机构 * Massachusetts Institute of Technology(麻省理工学院) SINTEF(斯蒂纳夫) Norwegian University of Science and Technology(挪威科技大学)

AI总结 提出SOVIS数据集,包含76,000多对声纳-视觉帧,通过端到端管道同步和清洗数据,并利用交互式标注工具加速标注,在跨模态鱼类检测任务中实现mAP@0.10提升7倍。

详情
Comments
6 pages, 7 figures, 3 tables. Accepted to IEEE ICRA 2026 S2S Workshop (From Sea to Space: Advancing Perception in Harsh Domains)
AI中文摘要

水下机器人通常同时使用相机和声纳进行感知,以利用视觉丰富的语义细节和声学稳健的距离测量。然而,由于缺乏声纳-视觉配对数据集,通过跨模态预测学习这些模态之间的映射仍然探索不足。我们提出了SOVIS,一个用于跨模态水下感知的声纳-视觉数据集。SOVIS包含在特隆赫姆峡湾六个地点17次潜水中收集的超过76,000对配对帧,并得到端到端管道的支持,该管道清洁和同步跨模态传感器数据。我们还引入了一个交互式标注工具,旨在加速配对数据的标注过程。最后,我们使用一小部分标注数据展示了一个概念验证的跨模态鱼类检测任务,与单目相机基线相比,mAP@0.10提高了7倍。SOVIS是推进跨模态水下感知研究的第一步,支持从单目图像进行密集声纳预测等研究方向。

英文摘要

Underwater robots typically use both cameras and sonar for perception to leverage the rich semantic details of vision and the robust range measurements of acoustics. However, learning to map between these modalities via cross-modal prediction remains underexplored due to limited sonar-visual paired datasets. We present SOVIS, a sonar-visual dataset for cross-modal underwater perception. SOVIS comprises over 76,000 paired frames collected across 17 dives at six sites in the Trondheimfjord, supported by an end-to-end pipeline that cleans and synchronizes the cross-modal sensor data. We also introduce an interactive annotation tool designed to accelerate the labeling process for this paired data. Finally, we demonstrate a proof-of-concept cross-modal fish detection task using a small subset of labeled data, achieving a 7x improvement in mAP@0.10 over a monocular camera baseline. SOVIS serves as the first step toward advancing cross-modal underwater perception research, enabling research directions such as dense sonar prediction from monocular images.

2606.01397 2026-06-02 cs.RO cs.LG cs.SY eess.SY

Autopilot-Preserving Residual Q-Learning with HJB-Inspired Finite-Action Risk Filtering for Fixed-Wing UAV Command Supervision

基于HJB启发有限动作风险滤波的保持自动驾驶仪的残差Q学习用于固定翼无人机指令监督

Mehmet Iscan, Batuhan Temiz

发表机构 * PythaLab, Yildiz Technical University, Istanbul, Turkey(伊兹密尔技术大学吡塔实验室,伊斯坦布尔,土耳其) Turkish Aerospace (TUSAŞ), Ankara, Turkey(土耳其航空航天(TUSAŞ),安卡拉,土耳其)

AI总结 提出一种保持自动驾驶仪的残差指令监督框架,通过HJB方程启发的半离散值迭代评价器和控制Lyapunov/屏障函数启发的有限动作屏蔽,选择有限有界动作集中的残差,显著降低路径跟踪误差。

详情
Comments
47 pages, 12 figures, 20 tables. Simulation-based study with a code-traceable benchmark, source code and a demonstration video are linked in the paper
AI中文摘要

固定翼无人机必须在风、阵风和湍流下保持空速、高度和航向参考,这些通道耦合使得纠正一个通道可能恶化另一个。经典自动驾驶仪能很好地稳定机身,但在强侧风遇到激进转弯时适应能力差,而直接作用于舵面的强化学习策略将探索风险集中在执行器接口。我们在未改变的自动驾驶仪之上放置一个学习型监督器,而不是在其内部:它从指令空速、高度和航向的有限有界动作集中选择一个残差;修改后的参考在到达自动驾驶仪之前被投影到允许的指令包络内,自动驾驶仪仍然是唯一面向执行器的控制器。新颖之处在于残差的选择方式。HJB残差使用半离散值迭代评价器(基于Hamilton-Jacobi-Bellman方程精神)对候选动作评分,通过无操作相对哈密顿优势排序,并通过控制Lyapunov函数和控制屏障函数启发的有限动作屏蔽进行过滤,该屏蔽始终保留无操作回退。在共享的12状态运行时(固定植物、自动驾驶仪和执行器模型)上,HJB残差将均方根路径跟踪误差降低到44.809米,而基线自动驾驶仪为338.617米,表格Q残差为88.809米,相比基线降低86.77%,相比Q学习降低49.54%。增益集中在基线表现最差的区域,但伴随空速误差的测量上升,因此没有方法在所有指标上占优。我们呈现这种保持自动驾驶仪的残差指令监督设计,并完整报告其权衡基准。

英文摘要

A fixed-wing UAV must hold airspeed, altitude, and heading references under wind, gusts, and turbulence, channels coupled so that correcting one can degrade another. Classical autopilots stabilize the airframe well but adapt poorly when a hard crosswind meets an aggressive turn, while reinforcement-learning (RL) policies acting directly on the surfaces concentrate exploration risk at the actuator interface. We place a learned supervisor above an unchanged autopilot rather than inside it: it selects a residual from a finite, bounded action set on the commanded airspeed, altitude, and heading; the modified reference is projected into an admissible command envelope before reaching the autopilot, which stays the only actuator-facing controller. What is new is how the residual is chosen. HJB residual scores candidates with a semi-discrete value-iteration critic in the spirit of the Hamilton-Jacobi-Bellman (HJB) equation, ranks them by a no-op-relative Hamiltonian advantage, and filters them through a control-Lyapunov- and control-barrier-inspired finite-action shield that always keeps a no-op fallback. On a shared 12-state runtime holding the plant, autopilot, and actuator model fixed, so the comparison is at the package level, HJB residual lowers mean RMS path-tracking error to 44.809 m, against 338.617 m for the baseline autopilot and 88.809 m for a tabular-Q residual, an 86.77% reduction over the baseline and 49.54% over Q-learning. The gain concentrates where the baseline fails worst and comes with a measured rise in airspeed error, so no method dominates every metric. We present this autopilot-preserving residual command-supervision design and benchmark with its trade-offs reported intact.

2606.01393 2026-06-02 cs.CL cs.AI cs.CV

Dr. DocBench: A Comprehensive Benchmark for Expert-Level and Difficult Document Parsing

Dr. DocBench:专家级与困难文档解析的综合基准

Minglai Yang, Xinyan Velocity Yu, Pengyuan Li, Xinyu Guo, Zhenting Qi, Konwoo Kim, Longtian Ye, Xiaolong Luo, Jinhe Bi, Henry Zhang, Haris Riaz, Xuan Zhang, Yunze Xiao, Bangya Liu, Tom Tang, Yunfei Zhao, Qunshu Lin, Zihan Wang, Minghao Liu, Michael Lingzhi Li, Yilun Du, Jesse Thomason, Rogerio Feris, Alex Pentland, Zexue He

发表机构 * Stanford University(斯坦福大学) MIT(麻省理工学院) Carnegie Mellon University(卡内基梅隆大学) University of Southern California(南加州大学) Harvard University(哈佛大学) IBM Research(IBM研究院) University of Arizona(亚利桑那大学) Duke University(杜克大学) UC Berkeley(加州大学伯克利分校) LMU Munich(慕尼黑路德维希-马克西米利安大学)

AI总结 提出Dr. DocBench基准,通过基于解析器失败的采样从多语言书籍语料库中选取挑战性文档,包含52个BISAC主题领域和65k高质量标注,用于评估专家级文档解析能力。

详情
Comments
27 pages, 13 figures, 14 tables
AI中文摘要

文档解析和识别是视觉语言模型(VLM)和文档处理系统的基本能力。然而,现有的光学字符识别(OCR)和文档解析基准在覆盖范围和难度上日益受限:许多基准专注于常见文档类型或均匀采样的页面,现代解析器在这些页面上已表现良好,而对专家领域结构(如化学公式、乐谱、复杂表格和跨页布局)的标注有限。我们引入了Dr. DocBench,一个面向专家级文档解析的难度感知基准。Dr. DocBench基于大规模多语言书籍语料库构建,涵盖52个BISAC主题领域,并通过基于解析器失败的采样选择挑战性文档,针对多个最先进系统难以处理的案例。它包含来自平均约100页的长文档的4,514个标注页面,具有65k高质量的页面级和块级标注,涵盖布局、阅读顺序、层次关系和特定领域的视觉内容。对基于流水线的解析器和通用VLM的评估表明,在现有基准上的强性能并不能迁移到我们的专家级文档解析中。我们的分析揭示了跨主题、内容类型和结构属性的重大失败,突显了Dr. DocBench作为诊断和推进文档智能的综合测试平台的作用。

英文摘要

Document parsing and recognition are fundamental capabilities for vision-language models (VLMs) and document processing systems. However, existing Optical Character Recognition (OCR) and document parsing benchmarks are increasingly limited in coverage and difficulty: many focus on common document genres or uniformly sampled pages where modern parsers already perform strongly, while offering limited annotation for expert-domain structures such as chemical formula, music notation, complex tables, and cross-page layouts. We introduce Dr. DocBench, a difficulty-aware benchmark for expert-level document parsing. Built from a large-scale multilingual book corpus, Dr. DocBench spans 52 BISAC subject domains and selects challenging documents through parser-failure-based sampling, targeting cases where multiple state-of-the-art systems struggle. It contains 4,514 annotated pages from long documents averaging around 100 pages, with 65k high-quality page- and block-level annotations for layout, reading order, hierarchical relations, and domain-specific visual contents. Evaluations of pipeline-based parsers and general-purpose VLMs show that strong performance on existing benchmarks does not transfer to our expert-level document parsing. Our analysis reveals substantial failures across subjects, content types, and structural attributes, highlighting Dr. DocBench as a comprehensive testbed for diagnosing and advancing document intelligence.

2606.01386 2026-06-02 cs.AI cs.CL cs.DC cs.LG

GuidaPA: Privacy-Preserving Chatbot for Public Administration via Federated Learning

GuidaPA: 通过联邦学习为公共行政提供隐私保护的聊天机器人

Daniel M. Jimenez-Gutierrez, Albenzio Cirillo, Raffaele Nicolussi, Alessio Beltrame, Andrea Vitaletti

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 提出GuidaPA,一个基于联邦学习(FL)在意大利公共行政文档上训练的隐私保护聊天机器人,通过参数高效的联邦微调(QLoRA)和角色访问控制,在保持数据本地化的同时实现了接近集中式微调的答案质量。

详情
Comments
Accepted to the 2nd International Conference on Federated Learning and Intelligent Computing Systems (FLICS2026)
AI中文摘要

我们提出了GuidaPA,一个为意大利公共行政(PA)设计的隐私保护聊天机器人,它通过联邦学习(FL)在两个国家PA平台SIGESON和SIDFORS的文档上进行训练。我们的语料库包括约8页的SIGESON手册和31页的SIDFORS手册/常见问题解答;虽然本研究使用公开文档作为安全代理,但预期的部署将扩展到受限制的内部来源(例如,工单、官员手册、数据库提取),这些数据由于监管和组织约束无法集中汇集。GuidaPA集成了基于角色的访问控制、安全的客户端预处理、对非独立同分布效应的显式监控以及大语言模型的参数高效联邦微调。使用QLoRA(4位)进行15轮联邦训练,每个客户端采用80/20的训练-测试划分,我们使用ROUGE、BLEU-4和METEOR评估答案质量。最佳联邦模型达到了ROUGE-1/2/L分别为61.10/55.77/59.44,BLEU-4为45.02,METEOR为63.94——接近私有集中式微调的性能,同时保持数据在本地。与通用基线相比,领域微调将ROUGE-1从41.45提高到62.18,BLEU-4从26.97提高到50.90。总体而言,结果表明FL可以在不进行集中数据共享的情况下,为公共服务提供高质量的对话式AI。

英文摘要

We present GuidaPA, a privacy-preserving chatbot for the Italian Public Administration (PA) trained via Federated Learning (FL) on documentation from two national PA platforms, SIGESON and SIDFORS. Our corpus includes approximately 8 pages of SIGESON manuals and 31 pages of SIDFORS manuals/FAQs; while this study uses public documentation as a safe proxy, the intended deployment extends to restricted internal sources (e.g., tickets, officer manuals, database extracts) that can not be centrally pooled due to regulatory and organizational constraints. GuidaPA integrates role-based access control, secure client-side preprocessing, explicit monitoring of non-IID effects, and parameter-efficient federated fine-tuning of large language models. Using QLoRA (4-bit) over 15 federated rounds with an 80/20 train-test split per client, we evaluate answer quality with ROUGE, BLEU-4, and METEOR. The best federated model achieves ROUGE-1/2/L of 61.10/55.77/59.44, BLEU-4 of 45.02, and METEOR of 63.94-close to private centralized fine-tuning while keeping data on-site. Compared to the general-purpose baseline, domain fine-tuning improves ROUGE-1 from 41.45 to 62.18 and BLEU-4 from 26.97 to 50.90. Overall, the results indicate that FL can deliver high-quality conversational AI for public services without centralized data sharing

2606.01380 2026-06-02 cs.CV

Training-free image inversion for one-step diffusion models

无需训练的一步扩散模型图像反演

Tao Wu, Senmao Li, Yaxing Wang, Shiqi Yang, Kai Wang, Joost van de Weijer

发表机构 * CVC, University of Alabama in Birmingham(CVC,阿拉巴马大学伯明翰分校) Machine Intelligence Institute, Masdar Institute of Science and Technology(机器智能研究所,马斯达尔科技 institute) Jilin University(吉林大学) City University of Hong Kong, Department of Geography(香港城市大学地理系)

AI总结 提出一种无需训练的反演框架TFinv,通过迭代噪声对齐和后缀学习解决一步扩散模型中真实图像反演与编辑的关键挑战,实现高效编辑。

详情
Comments
Accepted to Pattern Recognition
AI中文摘要

在这项工作中,我们为一步扩散模型引入了一种新颖的无需训练的反演(TFinv)框架,解决了真实图像反演和编辑中的关键挑战。我们首先确定了阻碍真实图像反演和编辑的两个关键因素:(1)初始潜在可编辑性,与初始噪声与理想高斯分布之间的距离有关;(2)描述差距,即文本描述与图像表示之间的对齐。这两个因素都影响一步扩散模型的反演效率和可编辑性。然后,我们提出了两种新颖的技术:迭代噪声对齐(iterNA),它最小化分布差距以与正态高斯分布对齐;以及后缀学习(suffL),它通过引入学习到的后缀提示令牌来增强文本到图像的描述对齐。这些技术能够将输入图像精确反演为其初始噪声表示,并促进图像编辑。此外,我们提出了一种基于掩码的编辑技术,用于局部编辑同时保持背景完整性。在PIE-Bench数据集上的全面实验验证了我们的方法TFinv不仅在一阶扩散编辑中实现了最先进的性能,而且在效率上显著优于现有的多步方法。代码可在https://github.com/tttao-uwu/TFinv.git获取。

英文摘要

In this work, we introduce a novel training-free inversion (TFinv) framework for one-step diffusion models,addressing key challenges in real image inversion and editing. We first identify two critical factors hamperingreal-image inversion and editing: (1) Initial Latent Editability, which is related to the distance between theinitial noise and the ideal Gaussian distribution, and (2) Caption Gap, which means the alignment betweentext captions and image representations. Both factors influence inversion efficiency and the editability ofone-step diffusion models. Then, we propose two novel techniques: iterative noise alignment (iterNA), whichminimizes the distribution gap to align with the normal Gaussian distribution, and suffix learning (suffL),which enhances text-to-image caption alignment by introducing learned suffix prompt tokens. These techniquesenable precise inversion of input images into their initial noise representations and facilitate image editing.Furthermore, we propose a mask-based editing technique for localized edits while preserving backgroundintegrity. Comprehensive experiments on the PIE-Bench dataset validate that our method TFinv not onlyachieves state-of-the-art performance in one-step diffusion editing, but also significantly outperforms existingmultistep approaches in efficiency. The code is available at https://github.com/tttao-uwu/TFinv.git.

2606.01374 2026-06-02 cs.LG

From Performance to Viability: A Bootstrap Framework for Latent-Space Representation Learning in Adaptive Biological Systems

从性能到生存力:自适应生物系统中潜在空间表示学习的自举框架

Jacques Raynal, Pierre Slangen, Elsa Raynal, Jacques Margerit

发表机构 * Laboratory of Bioengineering and Nanosciences (LBN)(生物工程与纳米科学实验室) University of Montpellier(蒙彼利埃大学) EuroMov Digital Health in Motion(EuroMov数字健康运动) IMT Mines Alès Certified Sophrologist, Sensorimotor Practice(认证Sophrologist,运动感知实践)

AI总结 针对自适应生物系统中性能相似但组织不同的问题,提出一个五级自举框架,通过逐步引入潜在组织、纵向生存力和内部预测近似,从观测不足中学习更具信息量的表示。

详情
Comments
25 pages. Methodological framework for latent-space representation learning in adaptive biological systems
AI中文摘要

可观测性能通常用于表征生物系统。然而,在自适应系统中,相似性能可能源于不同的组织,且在给定时间看似相似的配置可能遵循不同的纵向轨迹。这一局限性促使我们提出一种方法论框架,以超越基于性能的解释,而无需事先假设完整的机制模型。本文提出了一个用于自适应生物系统中潜在空间表示学习的自举框架。这里的自举是在方法论和认识论意义上使用的:当先前的表示不足以解释观察到的自适应动态时,引入新的分析层次。该框架围绕五个层次组织:可观测性能、动态组织、潜在组织、纵向生存力和内部预测近似。通过三个先前报道的步态-遮挡研究来说明该框架,这些研究仅作为方法论案例序列,而非新的实验证据。本文形式化了性能分析如何导致潜在组织,静态潜在组织如何导致纵向生存力,以及观察到的生存力如何导致内部预测近似。贡献不是新的学习算法、临床协议或数据集,而是一个用于潜在空间表示学习的自举框架,描述了如何从自适应生物数据的观测不足中涌现出更具信息量的表示。

英文摘要

Observable performance is commonly used to characterize biological systems. In adaptive systems, however, similar performances may arise from distinct organizations, and configurations that appear comparable at a given time may follow different longitudinal trajectories. This limitation motivates a methodological framework for moving beyond performance-based interpretation without assuming a complete mechanistic model in advance. This article proposes a bootstrap framework for latent-space representation learning in adaptive biological systems. Here, bootstrap is used in a methodological and epistemological sense: new analytical levels are introduced when the preceding representation becomes insufficient to account for observed adaptive dynamics. The framework is organized around five levels: observable performance, dynamic organization, latent organization, longitudinal viability, and internal predictive approximation. The framework is illustrated by three previously reported gait--occlusion studies, used here only as a methodological case sequence and not as new experimental evidence. The article formalizes how performance analysis led to latent organization, how static latent organization led to longitudinal viability, and how observed viability led to internal predictive approximation. The contribution is not a new learning algorithm, clinical protocol, or dataset, but a bootstrap framework for latent-space representation learning describing how increasingly informative representations can emerge from observational insufficiencies in adaptive biological data.

2606.01372 2026-06-02 cs.LG cs.AI cs.CV

BRo-JEPA: Learning Modular Arithmetic in Latent Space

BRo-JEPA:在潜空间中学习模算术

Divyansh Jha, Yuanfang Xie, Varan Mehra, Brennen Yu

发表机构 * Georgia Institute of Technology(佐治亚理工学院) NYU Langone Health(纽约大学Langone医疗中心)

AI总结 本文提出BRo-JEPA模型,通过在潜空间中施加模10算术的循环结构,实现零样本泛化,解决了标准模型无法外推未见操作的问题。

详情
Comments
10 pages, 14 figures
AI中文摘要

神经网络能否学习抽象的代数规则,还是仅仅记忆训练模式?我们使用MNIST数字作为状态,模算术运算作为动作,在JEPA风格的潜世界模型中进行研究。标准监督基线和带有加法操作嵌入的JEPA模型能够学习已见操作,但无法可靠地外推到未见操作。为了弥补这一差距,我们引入了一个块旋转预测器,在潜空间中施加模10算术的循环结构。这使得模型具有强大的零样本泛化能力,最佳的基于ResNet的JEPA块旋转模型达到了99.46%的零样本准确率和99.46%的展开准确率。我们的结果表明,当架构与问题结构匹配时,潜世界模型可以学习符号变换规则。我们的代码可以在此处访问:https://github.com/DL-World-Models/mnist-math。

英文摘要

Can neural networks learn abstract algebraic rules, or do they merely memorize training patterns? We investigate this using MNIST digits as states and modular arithmetic operations as actions in a JEPA-style latent world model. Standard supervised baselines and JEPA models with additive operation embeddings fit seen operations but fail to extrapolate reliably to unseen ones. To bridge this gap, we introduce a block-rotation predictor that imposes the circular structure of modulo-10 arithmetic in latent space. This enables strong zero-shot generalization, with the best ResNet-based JEPA block-rotation model achieving 99.46\% zero-shot and 99.46\% rollout accuracy. Our results suggest that latent world models can learn symbolic transformation rules when architecture matches the structure of the problem. Our code can be \href{https://github.com/DL-World-Models/mnist-math}{accessed here}.

2606.01367 2026-06-02 cs.RO cs.CV

ActMVS: Active Scene Reconstruction with Monocular Multi-View Stereo

ActMVS:基于单目多视图立体的主动场景重建

Guo Pu, Yixuan Han, Zhouhui Lian

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王轩计算机技术研究所)

AI总结 提出ActMVS框架,通过视图因子图构建和全局深度优化,实现单目相机在线生成高质量、全局一致的密集深度图,支持机器人/UAV的主动场景重建与安全轨迹规划。

详情
Comments
ICRA 2026
AI中文摘要

主动场景重建使机器人/UAV能够自主规划轨迹并重建环境,无需昂贵的手动数据采集。与被动方法不同,主动重建需要实时构建高置信度占据地图以实现无碰撞导航。现有方法依赖深度传感器更新占据地图,增加了平台成本和重量。为推进空间智能,我们旨在实现纯视觉单目解决方案。然而,当前单目场景重建方法离线运行,无法在机器人/UAV导航所需的帧率下提供全局一致的密集深度。为弥补这一差距,我们引入ActMVS,这是首个单目主动重建框架。我们的框架集成了用于信息多视图立体深度预测的视图因子图构建,以及全局深度优化,从而实现在线生成高质量、全局一致的密集深度图。这使得单目机器人/UAV能够在重建过程中维护可靠的占据地图,以实现安全的轨迹规划。在Replica数据集上的实验表明,其性能与RGB-D方法相当。我们的代码和数据可在https://github.com/TrickyGo/ActMVS获取。

英文摘要

Active scene reconstruction enables robots/UAVs to autonomously plan trajectories and reconstruct environments without costly manual data acquisition. Unlike passive methods, active reconstruction requires real-time construction of high-confidence occupancy maps for collision-free navigation. Existing approaches rely on depth sensors for occupancy map updates, increasing platform cost and weight. To advance spatial intelligence, we aim for a vision-only monocular solution. However, current monocular scene reconstruction methods operate offline and fail to deliver globally consistent dense depth at the frame rates required for robots/UAVs navigation. To bridge this gap, we introduce ActMVS, the first framework for monocular active reconstruction. Our framework integrates a view factor graph construction for informed Multi-View Stereo depth prediction, along with a global depth optimization, to enable the online generation of high-quality, globally consistent dense depth maps. This enables monocular robots/UAVs to maintain reliable occupancy maps for safe trajectory planning during reconstruction. Experiments on Replica datasets demonstrate performance competitive with RGB-D methods. Our code and data are available at https://github.com/TrickyGo/ActMVS.

2606.01365 2026-06-02 cs.AI

Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability

多智能体LLM系统中浪费计算资源的早期诊断:基于故障感知的可观测性

Xianyou Li, Weiran Yan, Yichao Wu, Penghao Liang, Mengwei Yuan, Jianan Liu, Jing Yang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种故障感知的可观测性框架,通过在线轨迹信号诊断多智能体LLM系统中的浪费计算,并在GAIA验证集上评估,揭示不同故障机制及其与资源消耗的关系。

详情
AI中文摘要

使用工具的多智能体大语言模型(LLM)系统在产生答案之前,通过模型令牌、工具调用、重试和代码执行来消耗计算资源。当运行失败时,最终答案评估揭示了终点,但通常无法揭示轨迹停止可恢复进展的时间点。本文引入了一个故障感知的可观测性框架,用于诊断多智能体LLM轨迹中的浪费计算。该框架将重复出现的故障模式映射到在线轨迹信号,包括工具可靠性、执行恢复、编排循环、证据可用性、信息变化和预算压力。我们在一个三智能体问答系统中实例化该框架,并在相同的执行上限下对165条GAIA验证轨迹进行评估。操作故障仍然常见:22/53的1级运行、33/86的2级运行和12/26的3级运行未能产生可用的最终答案。轨迹揭示了这些结果背后的不同机制,包括证据不足、重复动作循环、最大步数终止、工具故障连续以及成功执行但无有用输出的调用。平均令牌使用量从1级的8,152个令牌上升到3级的16,389个令牌,而证据可用性和句子级支持则出现分歧。一项缓存的10条轨迹LLM评判基础审计表明,廉价的在线信号和更深入的语义指标捕捉了故障的互补层面。结果将故障感知可观测性定位为原始执行日志与最终答案准确性之间的诊断层。

英文摘要

Tool-using multi-agent large language model (LLM) systems spend computation through model tokens, tool calls, retries, and code execution before producing an answer. When a run fails, final-answer evaluation reveals the endpoint but usually not the point at which the trajectory stopped making recoverable progress. This paper introduces a failure-aware observability framework for diagnosing wasted computation in multi-agent LLM traces. The framework maps recurring failure modes to online trace signals, including tool reliability, execution recovery, orchestration loops, evidence availability, information change, and budget pressure. We instantiate the framework in a three- agent question-answering system and evaluate it on 165 GAIA validation traces under identical execution caps. Operational failures remain common: 22/53 level-1 runs, 33/86 level-2 runs, and 12/26 level-3 runs fail to produce a usable final answer. The traces expose different mechanisms behind these outcomes, including insufficient evidence, repeated-action loops, max-step termination, tool-failure streaks, and execution calls that succeed without useful output. Mean token use rises from 8,152 tokens at level 1 to 16,389 tokens at level 3, while evidence availability and sentence-level support diverge. A cached 10-trace LLM-judge grounding audit shows that cheap online signals and deeper semantic metrics capture complementary layers of failure. The results position failure-aware observability as a diagnostic layer between raw execution logs and final-answer accuracy.

2606.01363 2026-06-02 cs.LG cs.SY eess.SY

All Models are Wrong, Knowing Where is Useful: On Model Uncertainty in Reinforcement Learning

所有模型都是错的,知道哪里有用:强化学习中的模型不确定性

Bernd Frauenknecht, Devdutt Subhasish, Artur Eisele, Friedrich Solowjow, Sebastian Trimpe

发表机构 * German Federal Ministry of Research, Technology and Space (BMFTR)(德国联邦研究、技术和空间部) Robotics Institute Germany (RIG)(德国机器人研究所) Institute for Data Science in Mechanical Engineering, RWTH Aachen University(机械工程数据科学研究所,亚琛工业大学) NHR Center NHR4CES at RWTH Aachen University(亚琛工业大学NHR4CES中心)

AI总结 提出通过针对性处理概率模型的不确定性来减轻模型利用的框架,并展示在硬件直接学习和安全探索方面的成功。

详情
AI中文摘要

基于模型的强化学习(MBRL)从学习的动力学模型中推断环境信息,并具有解决数据高效和机器人安全学习等开放性问题的潜力。然而,学习到的动力学模型的不准确性通常被智能体利用,严重阻碍了MBRL方法的能力。我们提出了一个通过针对性处理不确定性来有效减轻模型利用的框架。我们展示了在硬件直接学习和安全探索方面的近期成功,并讨论了不确定性感知MBRL的未来方向。

英文摘要

Model-based reinforcement learning (MBRL) infers information about the environment from a learned dynamics model and bears the potential to address open problems such as data efficient and safe learning in robotics. However, inaccuracies of the learned dynamics model are typically exploited by the agent, substantially hampering the capabilities of MBRL methods. We present a framework for dealing with inaccuracies of probabilistic models through targeted handling of uncertainty that effectively mitigates model exploitation. We present recent successes in learning directly on hardware and safe exploration, and discuss future directions for uncertainty-aware MBRL.

2606.01361 2026-06-02 cs.CV

Diamonds in the Sky: Pareidolic Animals in Clouds

天空中的钻石:云中的空想性动物

Miriam Horovicz, Yacov Hel-Or, Yael Moses

发表机构 * Reichman University, Israel(里奇曼大学,以色列)

AI总结 提出基于扩散模型的方法,预测人们可能在云中感知到的空想性动物,并通过生成相似形状的动物图像和变形视频辅助识别。

详情
AI中文摘要

人们常在云中看到动物形状,这种现象被称为空想性错视。我们提出一种基于AI的方法,旨在预测人们可能在云中感知到哪些动物,尽管最先进的识别方法通常无法检测到此类动物。此外,我们引入一种方法帮助个体感知特定的空想性动物,即使他们最初未能识别。我们的方法使用扩散模型将云片段转换为视觉上类似于原始云的动物形状。这种扩散技术的灵感来源于观察:扩散过程仅在目标动物与云形状相似时成功,且微妙的视觉线索通常足以帮助个体识别特定的空想性动物。从扩散模型成功生成的图像随后用于预测空想性动物。此外,使用从生成图像过渡回原始云片段的短变形视频进一步增强人类对空想性动物的感知。

英文摘要

People often see animal shapes in clouds, a phenomenon known as pareidolia. We propose an AI-based method that aims to predict which animals people are likely to perceive in clouds, even though state-of-the-art recognition methods typically fail to detect such animals. Additionally, we introduce a method to assist individuals in perceiving specific pareidolic animals, even if they did not recognize them initially. Our approach uses a diffusion model to transform cloud segments into an animal shape that visually resemble the original cloud. This diffusion technique is inspired by the observation that the diffusion process succeeds only when the target animal resembles the shape of the cloud, and that subtle visual hints often suffice to help individuals recognize specific pareidolic animals. A generated image, successfully derived from the diffusion model, is then used to predict the pareidolic animal. Additionally, a short morphing video transitioning from the generated image back to the original cloud segment is employed to further enhance the human's perception of the pareidolic animals.

2606.01352 2026-06-02 cs.AI

FlowTime: Towards Continuous Generative Watch Time Prediction via Flow-based Personalized Priors

FlowTime: 基于流的个性化先验实现连续生成式观看时间预测

Hongxu Ma, Han Zhou, Chenghou Jin, Jie Zhang, Xiaoyu Yang, Chunjie Chen, Jihong Guan, Shuigeng Zhou

发表机构 * Fudan University(复旦大学) Shanghai University of Finance and Economics(上海财经大学) Kuaishou Technology(快手科技) Tongji University(同济大学)

AI总结 针对现有观看时间预测方法在范式上的局限性,提出连续生成式回归范式及FlowTime方法,利用一步生成变分自编码器和基于流的个性化先验,有效建模多模态用户-物品交互模式,显著提升预测性能。

详情
Comments
Accepted by KDD'26
AI中文摘要

观看时间已成为短视频推荐系统中优化深度用户参与度的关键指标。然而,当前的观看时间预测方法存在固有的范式特定局限性。直接回归因单峰高斯假设而面临均值崩溃,序数回归因刚性离散化而受到量化误差的困扰。同样,离散生成式回归则面临高推理延迟和启发式词汇表设计的问题。除了这些具体缺陷外,一个共同的不足是无法捕捉用户-物品交互模式的内在多模态性和异质性。为应对这些挑战,我们首先从因果角度重新审视观看时间预测问题,并将这些用户特定模式识别为调节观看时间结果的结构性混淆因素,其中相同的兴趣在不同用户习惯条件下表现为不同的观看时间结果。然后,我们正式提出一种新的(即第四种)范式——连续生成式回归,并引入FlowTime,一种利用一步生成变分自编码器的新方法。FlowTime有效规避了迭代去噪的延迟,同时保持了连续潜在空间的表达能力。此外,我们设计了一种基于流的个性化先验,利用归一化流将标准高斯先验扭曲为复杂的历史条件流形,从而实现对多模态交互模式的自适应建模。最后,我们构建了TimeRec,首个开源观看时间预测库,并引入一种新的个性化指标,以建立严格的基准测试标准。广泛的离线实验和在线A/B测试表明,FlowTime显著优于现有最先进方法。

英文摘要

Watch time has emerged as a pivotal metric for optimizing deep user engagement in short-video recommender systems. However, current methods of watch time prediction (WTP) suffer from inherent paradigm-specific limitations. Direct Regression faces mean-collapse due to unimodal Gaussian assumptions, while Ordinal Regression is hampered by quantization errors from rigid discretization. Similarly, Discrete Generative Regression struggles with high inference latency and heuristic vocabulary design. Beyond these specific flaws, a shared deficiency is the inability to capture the intrinsic multimodality and heterogeneity of User-Item Interaction Patterns. To address these challenges, we first revisit the WTP problem from a causal perspective and identify these user-specific patterns as structural confounders that modulate watch time outcomes, where identical interests manifest as distinct watch time outcomes conditioned on diverse user habits. Then, we formally propose a new (or the fourth) paradigm -- Continuous Generative Regression, and introduce FlowTime, a novel method utilizing a One-step Generative Variational Autoencoder. FlowTime effectively circumvents the latency of iterative denoising while maintaining the expressivity of continuous latent spaces. Furthermore, we design a Flow-based Personalized Prior that leverages NFs to warp a standard Gaussian prior into a complex, history-conditioned manifold, thereby enabling the adaptive modeling of multimodal interaction patterns. Finally, we build TimeRec, the first open-source WTP Library, alongside a novel personalization metric to establish a rigorous benchmarking standard. Extensive offline experiments and online A/B tests demonstrate FlowTime's significant superiority over SOTA methods.

2606.01351 2026-06-02 cs.AI

Recognize Your Orchestrator: An Entropy Dynamics Perspective for LLM Multi-Agent Systems

识别你的编排器:面向LLM多智能体系统的熵动力学视角

Junze Zhu, Weihao Chen, Xuanwang Zhang, Zhen Wu, Xinyu Dai

发表机构 * Junze Zhu, Weihao Chen, Xuanwang Zhang, Zhen Wu, Xinyu Dai(朱俊泽、陈伟浩、张轩望、伍震、戴新宇)

AI总结 提出平均场熵动力学框架,通过逆工作流生成(IWG)合成高复杂度基准,揭示推理型模型作为编排器时因上下文压缩而失效的“推理陷阱”,为多智能体系统架构设计提供物理可解释参数。

详情
AI中文摘要

从单轮模型到多智能体系统(MAS)的转变有望增强问题解决能力,但集中式编排拓扑仍然是一个关键脆弱点。为分析此问题,我们提出平均场熵动力学框架,将编排过程建模为由任务解决和累积上下文加载两种竞争力量支配的系统。为便于验证,我们引入逆工作流生成(IWG),一种多智能体流水线,用于合成具有密集中间检查点的过程可验证、高复杂度基准。我们证明熵动力学模型拟合经验轨迹,提供量化系统稳定性和性能崩溃的物理可解释参数。关键的是,我们的分析揭示了“推理陷阱”:尽管推理密集型模型在孤立任务中表现出色,但由于上下文压缩,它们作为编排器时经常失败。阐明编排器背后的物理机制并量化系统不确定性,为MAS的架构设计提供了见解。

英文摘要

The transition from single-turn models to Multi-Agent Systems (MAS) promises enhanced problem-solving capabilities, yet the centralized orchestration topology remains a critical point of fragility. To analyze this, we propose a Mean-Field Entropy Dynamics framework, modeling the orchestration process as a system governed by the competing forces of task resolution and cumulative context loading. To facilitate validation, we introduce Inverse Workflow Generation (IWG), a multi-agent pipeline that synthesizes process-verifiable, high-complexity benchmarks with dense intermediate checkpoints. We demonstrate that our entropy dynamics model fits empirical trajectories, providing physically interpretable parameters that quantify system stability and performance collapse. Crucially, our analysis uncovers a ``Reasoning Trap": while reasoning-heavy models excel in isolated tasks, they frequently fail as orchestrators due to context squeezing. Elucidating the physical mechanisms underlying the Orchestrator and quantifying systemic uncertainty offers insights for the MASs' architectural design.

2606.01339 2026-06-02 cs.LG cs.AI cs.CL cs.CV cs.ET

FreqLite: A Lightweight Frequency-Decomposed Linear Model with Adaptive Reversible Normalization for Robust Long-Term Time-Series Forecasting

FreqLite:一种轻量级频率分解线性模型,具有自适应可逆归一化,用于稳健的长期时间序列预测

Mirza Samad Ahmed Baig, Syeda Anshrah Gillani

发表机构 * arXiv.org

AI总结 提出FreqLite,一种超轻量级、通道独立的频率分解线性预测器,通过可学习的无损谱滤波器进行频带分解和线性预测,并引入自适应可逆实例归一化(A-RevIN)处理非平稳性,在长期预测基准上以更少参数和计算资源超越PatchTST等模型。

详情
Comments
26 pages, 5 figures
AI中文摘要

长期时间序列预测需要既准确又能在商用硬件上高效运行的模型。轻量级线性预测器在此领域表现出色,但仍存在两个问题:可逆实例归一化(RevIN)使用单一回溯统计量对整个预测区间进行去归一化,在非平稳性下不准确;时域趋势/季节分解依赖于固定的非自适应滤波器。我们提出FreqLite,一种超轻量级、通道独立的频率分解线性预测器:一个可学习的、无损的单位划分谱滤波器将输入分割成多个频带,由每个频带的线性头进行预测,与低通截断方法不同,高频带被保留并建模。FreqLite在标准长期预测基准上是最佳的轻量级模型,在长回溯(L=336)时,其平均误差低于PatchTST Transformer(0.3244 vs 0.3587 MSE),同时参数减少4倍,内存减少2.2倍,在单块4 GB笔记本GPU上每轮时间减少2.2倍;尽管幅度不大,但在所有匹配单元上的配对Wilcoxon检验中,其改进具有统计显著性(p < 1e-5)。我们进一步引入自适应可逆实例归一化(A-RevIN),一种自适应可逆归一化,严格推广了RevIN(在其门关闭时完全恢复),在非平稳性下起作用,并在平稳数据上无害地退化为RevIN。我们在一个真实的强非平稳数据集(ILI,MSE降低约5%)和一个受控合成漂移扫描中验证了这一点,其中A-RevIN的收益及其学习门都随注入的非平稳性单调增加。每个组件均可独立消融(Linear和RLinear是FreqLite的特例),所有结果均可在商用硬件上复现。

英文摘要

Long-term time-series forecasting needs models that are accurate yet efficient enough for commodity hardware. Lightweight linear forecasters are remarkably strong in this regime, yet they leave two openings: reversible instance normalization (RevIN) de-normalizes the entire horizon with a single lookback statistic, which is inaccurate under non-stationarity, and time-domain trend/seasonal decomposition relies on a fixed, non-adaptive filter. We present FreqLite, an ultra-lightweight, channel-independent frequency-decomposed linear forecaster: a learnable, lossless, partition-of-unity spectral filter splits the input into bands that are forecast by per-band linear heads and, unlike low-pass-truncation approaches, the high-frequency band is retained and modeled. FreqLite is the best lightweight model on the standard long-term forecasting benchmarks and, at long lookback (L=336), attains a lower average error than a PatchTST Transformer (0.3244 vs. 0.3587 MSE) while using 4x fewer parameters, 2.2x less memory, and 2.2x less time per epoch on a single 4 GB laptop GPU; although modest in magnitude, its improvements are statistically significant under paired Wilcoxon tests across all matched cells (p < 1e-5). We further introduce Adaptive Reversible Instance Normalization (A-RevIN), a regime-adaptive reversible normalization that strictly generalizes RevIN (recovered exactly when its gate is closed), engages under non-stationarity, and reduces to RevIN without harm on stationary data. We validate this on both a real strongly non-stationary dataset (ILI, up to ~5% MSE reduction) and a controlled synthetic drift sweep in which A-RevIN's benefit and its learned gate both rise monotonically with injected non-stationarity. Every component is independently ablatable (Linear and RLinear are special cases of FreqLite), and all results are reproducible on commodity hardware.

2606.01338 2026-06-02 cs.CL

Benchmarking Local LLMs for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware

在生物制药制造中本地LLM的自然语言到SQL查询基准测试:消费级硬件上的实证基准

Sagar Bhetwal, Rajan Bastakoti, Nirajan Acharya, Gaurav Kumar Gupta

发表机构 * Department of Computer Science, University of the Cumberlands(大学的计算机科学系) Department of Computer Science, DePaul University(德保罗大学计算机科学系) Youngstown State University(亚当斯州立大学)

AI总结 本研究评估了四种本地部署的开源大语言模型在生物制药制造数据库上的自然语言到SQL生成性能,发现代码调优的通用模型优于领域特定模型,但当前性能仍需人工监督。

详情
AI中文摘要

生物制药制造组织在FDA指南、欧盟良好生产规范(GMP)和欧盟AI法案等监管框架下运营,这些框架可能限制基于云的人工智能系统的使用。本地部署的大语言模型(LLM)提供了一种保护隐私的替代方案,但它们在制药制造任务中的适用性仍未得到充分探索。本研究评估了四种通过Ollama本地部署的开源LLM(Qwen 2.5 Coder 7B、Llama 3.1 8B、Mistral 7B和Meditron 7B)在制药制造数据库上的自然语言到SQL生成能力。开发了一个基于FastAPI的评估平台PharmaBatchDB AI,使用一个包含约63,000条记录的合成Microsoft SQL Server数据库,涵盖批次、制造执行系统(MES)和在线清洗(CIP)模块。模型在60个领域特定的自然语言问题上进行了基准测试,使用的指标包括SQL提取率、SQL合规性、事实一致性、ROUGE-L、幻觉率、吞吐量和延迟。Qwen 2.5 Coder 7B、Llama 3.1 8B和Mistral 7B为所有评估任务生成了SQL,而Meditron 7B由于上下文窗口限制和SQL生成能力差,几乎在所有任务上失败。Llama 3.1 8B实现了最高的SQL合规性,而Qwen 2.5 Coder 7B在整体文本相似性和事实一致性方面最强。两个领先模型之间的性能差异在统计上不显著。结果表明,代码调优的通用LLM在制药制造数据的结构化查询生成上优于领域特定的生物医学模型。尽管完全本地化、符合GxP的NLQ系统在消费级硬件上是可行的,但当前性能水平在监管使用中仍需人工监督和下游验证。

英文摘要

Biopharmaceutical manufacturing organizations operate under regulatory frameworks such as FDA guidance, EU Good Manufacturing Practice (GMP), and the EU AI Act, which can restrict the use of cloud-based artificial intelligence systems. Locally deployed large language models (LLMs) offer a privacy-preserving alternative, but their suitability for pharmaceutical manufacturing tasks remains underexplored. This study evaluates four open-source LLMs (Qwen 2.5 Coder 7B, Llama 3.1 8B, Mistral 7B, and Meditron 7B) deployed locally via Ollama for natural-language-to-SQL generation over a pharmaceutical manufacturing database. A FastAPI-based evaluation platform, PharmaBatchDB AI, was developed using a synthetic Microsoft SQL Server database containing approximately 63,000 records across Batch, Manufacturing Execution System (MES), and Clean-In-Place (CIP) modules. Models were benchmarked on 60 domain-specific natural-language questions using metrics including SQL extraction rate, SQL compliance, factual consistency, ROUGE-L, hallucination rate, throughput, and latency. Qwen 2.5 Coder 7B, Llama 3.1 8B, and Mistral 7B generated SQL for all evaluation tasks, while Meditron 7B failed on nearly all tasks due to context-window limitations and poor SQL generation capability. Llama 3.1 8B achieved the highest SQL compliance, whereas Qwen 2.5 Coder 7B achieved the strongest overall text similarity and factual consistency. Performance differences between the two leading models were not statistically significant. The results show that code-tuned general-purpose LLMs outperform a domain-specific biomedical model on structured query generation for pharmaceutical manufacturing data. Although fully local, GxP-aligned NLQ systems are feasible on consumer hardware, current performance levels still require human oversight and downstream validation for regulated use.

2606.01336 2026-06-02 cs.CL

LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning

LongAttnComp:跨族上下文压缩用于长上下文推理

Mengmeng Ji, Ravi Shanker Raju, Jonathan Lingjie Li, Chen Wu

发表机构 * SambaNova Systems, Inc.(SambaNova系统公司)

AI总结 提出LongAttnComp方法,通过微调轻量级交叉注意力评分层并引入令牌级分块、令牌预算top-p算法、位置重排序和格式无关查询解析器,结合两阶段微调策略,在长上下文推理任务中实现与全上下文相当或更优的准确率。

详情
Comments
Under review
AI中文摘要

随着实际应用越来越需要处理10万+令牌的输入,上下文长度与推理效率之间的差距已成为关键瓶颈。上下文压缩提供了一种在保持任务准确性的同时降低预填充成本的方法。然而,现有的无训练注意力方法在代码推理等要求高的长上下文任务中留下了显著差距。我们提出了LongAttnComp,这是AttnComp的长上下文适配,它微调了一个轻量级的交叉注意力评分层,并引入了令牌级分块、令牌预算top-p算法、位置重排序和格式无关的查询解析器。我们进一步为压缩器设计了两阶段微调方案:阶段1从NIAH风格数据构建通用检索基础,阶段2通过多跳和推理数据扩展,以覆盖更广泛的长上下文任务。在InfiniteBench Code-Debug上,LongAttnComp匹配或超过全上下文准确率,显著优于无训练基线,并在来自三个族的四个目标模型上迁移。在LongBench v2上,两阶段方案在很大程度上缩小了阶段1在多文档推理上的差距,同时保持了Code-Debug的性能。

英文摘要

As real-world applications increasingly require processing inputs of 100k+ tokens, the gap between context length and inference efficiency has become a critical bottleneck. Context compression offers a way to reduce prefill costs while preserving task accuracy. However, existing training-free attention-based methods leave substantial gaps in demanding long-context tasks such as code reasoning. We present LongAttnComp, a long-context adaptation of AttnComp that fine-tunes a lightweight cross-attention scoring layer and introduces tokenlevel chunking, a token-budget top-p algorithm, positional reordering, and a formatagnostic query parser. We further design a two-stage fine-tuning recipe for the compressor: Stage 1 builds a general retrieval foundation from NIAH-style data, and Stage 2 extends it with multi-hop and reasoning data for broader long-context task coverage. On InfiniteBench Code-Debug, LongAttnComp matches or exceeds full-context accuracy, substantially outperforms training-free baselines, and transfers across four target models from three families. On LongBench v2, the two-stage recipe largely closes the Stage 1 gap on multi-document reasoning while preserving Code-Debug performance.

2606.01334 2026-06-02 cs.CV

HOLA: Holistic Multi-Modal Alignment for Open-Set 3D Recognition

HOLA: 面向开放集3D识别的全息多模态对齐

Koby Aharonov, Oren Shrout, Ayellet Tal

发表机构 * Technion – Israel Institute of Technology(技术ion-以色列理工学院)

AI总结 提出HOLA方法,通过解耦多正例对比损失和对齐点云与多视图图像及文本描述,实现开放集3D识别中的全息多模态对齐,在长尾基准上取得最先进零样本性能。

详情
AI中文摘要

开放集3D识别需要模型能够泛化到罕见或未见类别。最近的方法通过将语言-视觉知识蒸馏到3D编码器来解决这一问题,通常依赖重型2D ViT,并将每个点云与单张图像或标题对齐,从而将表示锚定到局部视图。我们提出将每个点云与多张图像和文本描述对齐,以捕获对3D对象的更全面理解。为实现这一想法,必须设计一个损失函数,能够联合对齐一个3D实例与多个匹配信号(多视图图像和多个文本),同时将正例聚合与负例竞争分离。我们引入了这样的函数,称为解耦多正例对比损失。我们的公式增强了损失对困难负例的难度感知关注,避免了当多个正例与所有负例共享同一个softmax时出现的“聚光灯拥挤”现象。作为补充,我们提出了一个轻量级文本适配器,仅应用于网络标题,减少了与精心标注之间的领域差距,并能够有效利用大规模无监督文本。我们的模型在长尾基准上展示了最先进的开放词汇性能,在保持高帧率的同时实现了显著的零样本改进。

英文摘要

Open-set 3D recognition requires models that generalize to rare or unseen categories. Recent approaches address this by distilling language-vision knowledge into 3D encoders, typically relying on heavy 2D ViTs and aligning each point cloud with a single image or caption, thus anchoring representations to partial views. We propose aligning each point cloud with multiple images and textual descriptions to capture a more holistic understanding of 3D objects. To realize this idea, it is essential to design a loss function capable of jointly aligning a 3D instance with multiple matched signals, multi-view images and multiple texts, while separating positive aggregation from negative competition. We introduce such a function, termed the decoupled multi-positive contrastive loss. Our formulation enhances the loss's hardness-aware focus on challenging negatives, avoiding the "spotlight crowding" that occurs when many positives share the same softmax with all the negatives. Complementing this, we present a lightweight text adapter applied only to web captions, reducing the domain gap to curated annotations and enabling effective use of large-scale unsupervised text. Our model demonstrates state-of-the-art open-vocabulary performance on long-tail benchmarks, yielding substantial zero-shot improvements while sustaining high frame rates.

2606.01332 2026-06-02 cs.RO

S2M-Trek: From Single to Multi-Sphere Transport via Per-Frame Deep Sets on a Wheel-Legged Robot

S2M-Trek: 从单球到多球运输:基于轮腿机器人的逐帧深度集方法

Zong Chen, Xuebin Li, Jinpeng Xiao, Shaoyang Li, Ben Liu, Min Li, Zhouping Yin, Yiqun Li

发表机构 * School of Mechanical Science and Engineering, Huazhong University of Science and Technology(华中科技大学机械科学与工程学院) School of Mathematics, Harbin Institute of Technology(哈尔滨工业大学数学学院)

AI总结 针对轮腿四足机器人背部同时运输多个自由滚动球体的动态操作问题,提出逐帧深度集(PFDS)编码器,通过逐帧置换不变池化解决历史拼接编码器的置换对称性不匹配,实现五球100%无掉落运输。

详情
AI中文摘要

我们研究了从单个自由滚动球体到多个球体同时运输的动态操作缩放问题,这些球体在轮腿四足机器人背部运输,无需围栏、夹具或机械止动器。多个相同的自由滚动球体构成一个无序集合,没有持久身份:它们的顺序可能在每个历史帧中独立变化,产生一种\\emph{逐帧置换对称性},而标准的历史拼接集合编码器并未显式强制这种对称性——这些编码器仅在整个历史上施加共享的对角置换对称性。我们表明,这种对称性不匹配导致基于课程强化学习的具体失败模式。在相同的PPO训练预算内,平坦MLP和分支编码器在双球阶段或以下停滞,而历史拼接深度集基线(\\\HCDS)在我们的运行中无法超越双球阶段,除非在训练期间随机化球到槽的分配,这表明它利用槽索引作为课程捷径,而不是学习无身份的多球动力学。我们提出\textbf{逐帧深度集(\\\PFDS)},它在时间读出之前在每个历史帧内执行置换不变池化;我们证明\\\PFDS是$\\\Gframe$-不变的,并且能普遍逼近连续的$\\\Gframe$-不变策略。一个$2{\\times}2$消融实验(编码器架构和槽随机化)分离了架构和数据增强路径,\\\PFDS在所有五个随机种子下达到五球阶段,模拟中实现100%无掉落运输。我们进一步通过DAgger将\\\PFDS教师蒸馏为\\\TactSet,用$16{\\times}16$布尔联合接触图替代特权球体状态观测,产生紧凑且自然$\\\Gframe$-不变的触觉表示。

英文摘要

We study the problem of scaling dynamic loco-manipulation from a single free-rolling sphere to multiple spheres transported simultaneously on the back of a wheel-legged quadruped, without fences, grippers, or mechanical stops. Multiple identical free-rolling spheres form an unordered set with no persistent identity: their ordering may change independently at each history frame, creating a \emph{per-frame permutation symmetry} that standard history-concatenation set encoders do not explicitly enforce -- these encoders impose only a shared, diagonal permutation symmetry over the full history. We show that this symmetry mismatch leads to a concrete failure mode in curriculum-based reinforcement learning. Within the same PPO training budget, flat MLPs and branch-wise encoders plateau at or below the two-sphere stage, while a history-concatenation Deep Sets baseline (\HCDS) fails to progress past the two-sphere stage in our runs unless ball-to-slot assignments are randomised during training, suggesting that it exploits slot indices as a curriculum shortcut rather than learning identity-free multi-sphere dynamics. We propose \textbf{Per-Frame Deep Sets (\PFDS)}, which performs permutation-invariant pooling within each history frame before temporal readout; we prove that \PFDS is $\Gframe$-invariant and universally approximates continuous $\Gframe$-invariant policies. A $2{\times}2$ ablation over encoder architecture and slot randomisation separates the architectural and data-augmentation pathways, and \PFDS reaches the five-sphere stage with 100\% no-drop transport in simulation across all five random seeds. We further distill the \PFDS teacher into \TactSet via DAgger, replacing privileged sphere-state observations with a $16{\times}16$ Boolean union contact map, yielding a compact and naturally $\Gframe$-invariant tactile representation.

2606.01329 2026-06-02 cs.LG q-bio.BM

Conditioned free-energy density of proteins using unbalanced solutions to constraint satisfaction problems

使用约束满足问题的不平衡解的条件化蛋白质自由能密度

Pratik Worah, Subhash Khot, Srinivasa Varadhan

发表机构 * CIMS, NYU(纽约大学应用数学与计算科学中心)

AI总结 本文通过将条件化非均匀Curie-Weiss自旋哈密顿量的对数配分函数(自由能)简化为不平衡$2 \to 1$范数计算,并设计多项式时间SDP算法,应用于泛素蛋白以探索自由能景观并识别柔性区域。

详情
AI中文摘要

我们证明,计算条件化非均匀Curie-Weiss自旋哈密顿量的对数配分函数(自由能)简化为不平衡的$2 \to 1$范数计算,并为此问题设计了一个多项式时间的SDP算法,同时给出了所实现不平衡量的下界证明。应用于蛋白质泛素,该框架从已知晶体结构出发,探索自由能景观中的替代骨架构象,并在保留天然二级结构的同时识别蛋白质的柔性区域。

英文摘要

We show that computing the log-partition function (free-energy) of conditioned inhomogeneous Curie--Weiss spin Hamiltonians reduces to an unbalanced $2 \to 1$ norm computation, and design a polynomial-time SDP algorithm for this problem with a lower bound proof for the amount of unbalance achieved. Applied to the protein Ubiquitin, the framework starts from a known crystal structure, explores alternative backbone conformations across the free-energy landscape, and identifies flexible regions of the protein while preserving its native secondary structure.