arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3990
2606.08364 2026-06-09 cs.CV cs.AI 新提交

Self-Supervised Vision Transformers for CBCT-Based Detection of Temporomandibular Joint Osteoarthritis

基于自监督视觉Transformer的CBCT颞下颌关节骨关节炎检测

Shradhdha Trivedi, Vrundan Sojitra, Mariela Padilla

发表机构 * Herman Ostrow School of Dentistry, University of Southern California(南加州大学赫尔曼·奥斯特罗牙科学院) Viterbi School of Engineering, University of Southern California(南加州大学维特比工程学院)

AI总结 研究DINO系列自监督ViT在CBCT颞下颌关节骨关节炎检测中的迁移性能,发现部分解冻最后两个Transformer块可将AUC从0.671提升至0.902,表明适应策略比骨干选择更重要。

详情
AI中文摘要

颞下颌关节骨关节炎(TMJ OA)是一种常见的退行性疾病,其骨性改变在锥形束CT(CBCT)上通常很细微,使得自动检测具有挑战性。我们研究了DINO系列自监督视觉Transformer——DINOv1、DINOv2、DINOv2+reg和RAD-DINO(一种放射学预训练变体)——迁移到CBCT的效果,询问需要多少以及何种骨干适应。我们提出了一种简单的基于切片的流程,使用视觉Transformer(ViT)骨干:轴向CBCT切片由冻结或部分适应的ViT逐切片编码,并通过基于注意力的多实例学习(MIL)聚合,用于患者级别的二分类OA/正常分类。通过在多源CBCT数据集上对解冻策略和聚合设计进行系统消融,我们发现部分解冻最后两个Transformer块是决定性因素,将AUC从0.671(完全冻结的DINOv2)提高到0.902。这优于DINOv1(0.867)、DINOv2+reg(0.774)和有监督的ImageNet ViT-B/16基线(0.843)。我们的结果为在低数据医学影像设置中适应DINO系列基础模型提供了实用指导,表明适应策略比骨干选择本身更能驱动性能。

英文摘要

Temporomandibular joint osteoarthritis (TMJ OA) is a prevalent degenerative condition whose osseous changes are often subtle on cone-beam CT (CBCT), making automated detection challenging. We study how well the DINO family of self-supervised vision transformers -- DINOv1, DINOv2, DINOv2+reg, and RAD-DINO (a radiology-pretrained variant) -- transfers to CBCT, asking how much backbone adaptation is needed and of what kind. We propose a simple slice-based pipeline using Vision Transformer (ViT) backbones: axial CBCT slices are encoded per-slice by a frozen or partially adapted ViT and aggregated via attention-based multiple instance learning (MIL) for patient-level binary OA/Normal classification. Through systematic ablation across unfreezing strategies and aggregation designs on a multi-source CBCT dataset, we find that partial unfreezing of the final two transformer blocks is the decisive factor, improving AUC from 0.671 (fully frozen DINOv2) to 0.902. This outperforms DINOv1 (0.867), DINOv2+reg (0.774), and a supervised ImageNet ViT-B/16 baseline (0.843). Our results provide practical guidance for adapting DINO-family foundation models in low-data medical imaging settings, showing that adaptation strategy is a stronger driver of performance than backbone choice alone.

2606.08360 2026-06-09 cs.LG cs.AI 新提交

Generative Frontier Planning for Adaptive Peer-Referral Recruitment under Covariate-Dependent Arrivals

协变量依赖到达下的自适应同伴推荐招募的生成前沿规划

Lingkai Kong, Hezi Jiang, Andrew Ma, Keyu Wang, Akseli Kangaslahti, Milind Tambe

发表机构 * Harvard University(哈佛大学)

AI总结 针对同伴推荐招募中协变量依赖到达的现实问题,提出生成前沿规划(GFP),通过确定性备份和边际贪心分配实现高效规划,在模拟实验中优于基线方法。

详情
AI中文摘要

同伴推荐招募系统(如受访者驱动抽样)对于研究和干预受传染病影响的隐藏人群至关重要。为了加速招募,公共卫生机构必须在多轮中自适应地分配有限的推荐资源,当前决策影响未来招募者的数量和协变量。先前的工作通过假设推荐来自同质总体的独立同分布抽样使问题可解,但忽略了驱动真实同伴推荐的同质性和共享背景。我们考虑一个更现实的模型,其中推荐容量和新推荐个体的协变量都依赖于推荐者,并通过删失计数模型和条件生成模型从数据中学习。由此产生的规划问题具有挑战性,因为每个候选分配都会导致未来招募者的不同分布。我们提出生成前沿规划(GFP),一种基于模型的规划器,用潜在协变量覆盖值替代的确定性备份替代每步蒙特卡洛采样。该替代的设计使得下一个前沿的期望值仅通过离线摊销的有限维摘要依赖于后代生成模型,并且使得每轮目标具有单调递减收益。这两个性质共同使规划易于处理:确定性备份消除了蒙特卡洛采样,递减收益结构使得边际贪心分配能够为每轮问题实现(1-1/e)近似。在根据真实受访者驱动抽样数据集校准的模拟环境中,GFP在四个折扣因子下均优于随机、强化学习和独立同分布动态规划基线。

英文摘要

Peer-referral recruitment systems such as respondent-driven sampling are critical for studying and intervening on hidden populations affected by infectious diseases. To accelerate recruitment, public health agencies must adaptively allocate limited referral resources across multiple rounds, where current decisions shape both the number and the covariates of future recruits. Prior work makes this problem tractable by assuming that referrals are drawn i.i.d.\ from a homogeneous population, an assumption that ignores the homophily and shared context that drive real peer recruitment. We instead consider a more realistic model in which both referral capacity and the covariates of newly referred individuals are conditioned on the referrer, learned from data with a censored count model and a conditional generative model. The resulting planning problem is challenging because each candidate allocation induces a different distribution over future recruits. We propose \emph{Generative Frontier Planning} (GFP), a model-based planner that replaces per-step Monte-Carlo sampling with a deterministic backup over a latent covariate-coverage value surrogate. The surrogate is designed so that the expected value of the next frontier depends on the offspring generative model only through finite-dimensional summaries that are amortized offline, and so that the resulting per-round objective is monotone with diminishing returns. Together, these two properties make planning tractable: the deterministic backup eliminates Monte-Carlo sampling, and the diminishing-returns structure lets a marginal greedy allocation achieve a \((1-1/e)\)-approximation for the per-round problem. On a simulation environment calibrated to a real respondent-driven sampling dataset, GFP outperforms random, reinforcement-learning, and i.i.d.\ dynamic-programming baselines across four discount factors.

2606.08357 2026-06-09 cs.CL 新提交

Forward-Free Diffusion Language Models

无前向过程的扩散语言模型

Haotian Sun, Rushi Qiang, Yuqian Zheng, Bo Dai

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出FReDA,一种无需人工设计前向过程的扩散语言模型,通过递归分布细化利用模型生成草稿作为隐式中间状态,在推理和编码任务上超越更大模型,并实现1.5-1.8倍加速。

详情
AI中文摘要

扩散语言模型通过迭代去噪生成文本,为自回归生成提供了强大的替代方案。然而,离散语言空间缺乏用于定义有效扰动的自然邻域结构,因此在前向过程中提出了一些人工破坏方案。这些预设的前向过程通常产生数学上方便但与生成过程中遇到的草稿和错误不一致的状态,导致样本质量下降。为了解决这一限制,我们提出了FReDA,一种无前向过程的扩散语言模型,消除了对人工设计前向过程的需求。我们将扩散语言建模形式化为递归分布细化,其中模型生成的草稿作为隐式中间状态,学习的细化模型逐步将草稿分布推向目标分布。具体地,FReDA通过提出候选草稿序列并直接执行自我细化或通过最佳N细化在并行候选中进行选择来细化草稿。通过这种设计,FReDA是邻域无关的、模型复杂度感知的,并且与灵活的细化参数化兼容。在sub-8B规模下的广泛评估表明,FReDA-4B在推理和编码基准上优于更大的扩散基础模型,实现了高达15%的绝对增益,同时相对于扩散基线达到1.5-1.8倍的平均加速,并且随着额外细化计算量的增加而有效扩展。

英文摘要

Diffusion language models generate text through iterative denoising, offering a powerful alternative to autoregressive generation. However, discrete language spaces lack a natural neighborhood structure for defining effective perturbations, so some artificial corruption schemes are proposed in the forward process. Such prescribed forward processes often produce states that are mathematically convenient but misaligned with drafts and errors encountered during generation, resulting in degraded sample quality. To address this limitation, we propose FReDA, a forward-free diffusion language model that eliminates the need for a hand-designed forward process. We formulate diffusion language modeling as recursive distribution refinement, in which model-generated drafts serve as implicit intermediate states, and the learned refinement model progressively moves the draft distribution toward the target distribution. Concretely, FReDA refines drafts by proposing candidate draft sequences and either directly performing self-refinement or selecting among parallel candidates via best-of-N refinement. With this design, FReDA is neighborhood-agnostic, model-complexity-aware, and compatible with flexible refinement parameterizations. Extensive evaluations in the sub-8B regime show that FReDA-4B outperforms larger diffusion base models on reasoning and coding benchmarks, achieving absolute gains of up to 15%, while reaching a 1.5-1.8x average speedup over diffusion baselines and scaling effectively with additional refinement computation.

2606.08348 2026-06-09 cs.CL 新提交

Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses

Bayesian-Agent:面向LLM Agent框架的后验引导技能演化

Xiaojun Wu, Cehao Yang, Honghao Liu, Xueyuan Lin, Wenjie Zhang, Zhichao Shi, Xuhui Jiang, Chengjin Xu, Jia Li, Jian Guo

发表机构 * IDEA Research(IDEA研究院) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) DataArcTech Ltd.(DataArcTech有限公司)

AI总结 提出Bayesian-Agent框架,将可复用技能视为假设,通过后验分布指导技能演化(如修补、拆分、压缩等),在多个基准上显著提升性能,表明Agent技能演化应视为后验引导的框架优化。

详情
Comments
15 pages, 6 figures
AI中文摘要

LLM agent越来越依赖外部推理条件:提示、工具、记忆、SOP、技能和框架反馈。这些资产可以在不改变模型权重的情况下改进任务执行,但通常通过启发式反思或重用观察到的成功和失败来修订,仿佛计数本身是可靠的信念。我们引入了\textbf{Bayesian-Agent},一个原生且跨框架的框架,将可重用技能和SOP视为关于冻结模型在特定提示、上下文和框架环境下是否会成功的假设。Bayesian-Agent记录经过验证的轨迹证据,维护每个技能的特征条件分类后验,并将后验状态映射为可检查的动作,如修补、拆分、压缩、退役和探索。面向模型的提示获得可执行的防护栏和故障模式修补,而后验摘要仍可用于审计。使用\texttt{deepseek-v4-flash},增量修复将SOP-Bench从80%提升到95%,Lifelong AgentBench从90%提升到100%,RealFin-Bench从45%提升到65%。我们进一步评估了Bayesian-Agent的原生后端以及可选的GenericAgent、mini-swe-agent和Claude Code后端。结果包括正面、负面、饱和和案例研究设置,表明Agent技能演化最好被视为后验引导的框架优化,而非未校准的提示积累。源代码可在https://github.com/DataArcTech/Bayesian-Agent获取。

英文摘要

LLM agents increasingly rely on external inference conditions: prompts, tools, memory, SOPs, skills, and harness feedback. These assets can improve task execution without changing model weights, but they are often revised by heuristic reflection or by reusing observed successes and failures as if counts alone were reliable belief. We introduce \textbf{Bayesian-Agent}, a native and cross-harness framework that treats reusable skills and SOPs as hypotheses about whether a frozen model will succeed under a particular prompt, context, and harness environment. Bayesian-Agent records verified trajectory evidence, maintains a feature-conditioned categorical posterior over each skill, and maps posterior state into inspectable actions such as patch, split, compress, retire, and explore. Model-facing prompts receive executable guardrails and failure-mode patches, while posterior summaries remain available for audit. With \texttt{deepseek-v4-flash}, incremental repair improves SOP-Bench from 80\% to 95\%, Lifelong AgentBench from 90\% to 100\%, and RealFin-Bench from 45\% to 65\%. We further evaluate Bayesian-Agent's native backend and optional GenericAgent, mini-swe-agent, and Claude Code backends. The results include positive, negative, saturated, and case-study settings, suggesting that agent skill evolution is best viewed as posterior-guided harness optimization rather than uncalibrated prompt accumulation. The source code is available at https://github.com/DataArcTech/Bayesian-Agent.

2606.08346 2026-06-09 cs.CL cs.LG 新提交

CATPO: Critique-Augmented Tree Policy Optimization

CATPO: 批评增强的树策略优化

Ayush Singh, Umang Goyal, Ankur Dahiya

发表机构 * Indian Institute of Technology Roorkee(印度理工学院罗尔基分校) Vision and Language Group(视觉与语言组)

AI总结 提出CATPO方法,通过树信息性评分和批评引导修复,解决树结构强化学习中低效树浪费计算的问题,在数学推理任务上提升准确率。

详情
Comments
14 pages, 1 figures, 6 tables
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为提升大语言模型(LLM)推理能力的主流范式。最近的基于树的方法(如TreeRPO)通过树结构展开扩展了平坦轨迹采样,无需单独的奖励模型即可获得密集的步级奖励信号。然而,并非所有树都具有相同的信息量:所有叶子成功、所有叶子失败或策略已预测出奖励分布的树对梯度更新贡献甚微,浪费计算资源。我们提出CATPO(批评增强的树策略优化),在树级别诊断并解决这一浪费问题。CATPO首先通过树信息性分数F(T)对每棵树进行评分,该分数结合了叶子结果多样性和策略-奖励去相关性,且无需额外计算。对于所有分支均失败的“全错”树,CATPO应用批评引导修复:定位最浅的失败点,生成自然语言批评,并嫁接精炼的延续以恢复训练信号。最后,信息性加权损失通过归一化分数缩放每棵树的梯度贡献,将参数更新集中在最具信息性的树上,同时保持整体梯度幅度。在MATH数据集上训练的Qwen2.5-Math-1.5B上的实验表明,CATPO在四个基准(AIME24、MATH-500、OlympiadBench和MinervaMath)上实现了37.5%的宏平均准确率,比TreeRPO提高1.9%,比GRPO提高4.8%。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving the reasoning capabilities of large language models (LLMs). Recent tree-based methods such as TreeRPO extend flat trajectory sampling with tree-structured rollouts to obtain dense, step-level reward signals without a separate process reward model. However, not all trees are equally informative: trees where all leaves succeed, all leaves fail, or the policy already predicts the reward distribution contribute little to gradient updates, wasting compute. We introduce CATPO (Critique-Augmented Tree Policy Optimization), which diagnoses and addresses this waste at the tree level. CATPO first scores each tree via a tree informativeness score, F(T), combining leaf-outcome diversity with policy-reward decorrelation at zero extra compute. For dead-wrong trees where all branches fail, CATPO applies critique-guided healing: it locates the shallowest failure point, generates a natural-language critique, and grafts refined continuations to recover training signal. Finally, an informativeness-weighted loss scales each tree's gradient contribution by its normalized score, concentrating parameter updates on the most informative trees while preserving overall gradient magnitude. Experiments on Qwen2.5-Math-1.5B trained with the MATH dataset show that CATPO achieves 37.5% macro accuracy across four benchmarks (AIME24, MATH-500, OlympiadBench, and MinervaMath), improving over TreeRPO by 1.9% and GRPO by 4.8%.

2606.08341 2026-06-09 cs.RO 新提交

Uncertainty-Aware Intention Prediction for Human-to-Robot Assembly Teleoperation

面向人机装配遥操作的不确定性感知意图预测

Fnu Heman, Yixuan Wang, Kolin Xu, Conner Wallace, John Dang, Akhil Joshi, Jun Sheng, Pinhas Ben-Tzvi, Mingyu Cai

发表机构 * University of California, Riverside(加州大学河滨分校) University of Miami(迈阿密大学)

AI总结 提出结合层次迁移学习、共形预测和VLM引导校正的不确定性感知意图预测框架,利用人类演示数据预训练,仅用少量机器人数据即提升动作分割性能。

详情
Comments
7 pages, 6 figures. Preprint version
AI中文摘要

在人机协作的辅助遥操作中,准确的意图预测对于在长时程操作和装配任务中实现及时可靠的机器人辅助至关重要。这些系统需要持续理解用户行为,以实时识别动作、预测意图并检测错误。然而,机器人遥操作演示成本高且受硬件限制,而人类演示更易收集且提供丰富的时序结构。为解决这一挑战,我们提出了一种不确定性感知的人到机器人意图预测框架,该框架结合了:(1) 层次迁移学习,其中MS-TCN++在人类手部演示上预训练,并在有限的机器人遥操作数据上微调,以捕捉低级动作和高级任务意图;(2) 共形预测模块,提供具有统计覆盖保证的帧级预测集,用于可靠的不确定性量化和早期意图估计;(3) VLM引导的片段校正,利用视觉和时序上下文选择性审查低置信度或时序不确定的片段。该框架支持辅助遥操作中的动作识别、时序分割、意图预测和错误检测。在包含22个动作类别的机器人装配演示实验表明,仅使用16个机器人演示,人到机器人的微调将机器人测试集的Edit分数从70.50提升至80.70。Edit安全的VLM校正进一步将帧准确率从45.21%提升至46.42%,并提高了F1@25和F1@50,同时保持了Edit分数。这些结果表明,人类演示为鲁棒、不确定性感知的机器人动作分割提供了可扩展的预训练数据。代码和数据见项目网站。

英文摘要

In assisted teleoperation for human-robot collaboration, accurate intention prediction is critical for enabling timely and reliable robotic assistance during long-horizon manipulation and assembly tasks. These systems require continuous understanding of user behavior to recognize actions, anticipate intentions, and detect mistakes in real time. However, robot teleoperation demonstrations are costly and hardware-limited, whereas human demonstrations are easier to collect and provide rich temporal structure. To address this challenge, we propose an uncertainty-aware human-to-robot intention prediction framework that combines: (1) hierarchical transfer learning, where MS-TCN++ is pretrained on human hand demonstrations and fine-tuned on limited robot teleoperation data to capture low-level actions and high-level task intentions; (2) a conformal prediction module that provides frame-level prediction sets with statistical coverage guarantees for reliable uncertainty quantification and early intention estimation; and (3) VLM-guided segment correction, which selectively reviews low-confidence or temporally uncertain segments using visual and temporal context. The framework supports action recognition, temporal segmentation, intention anticipation, and mistake detection for assisted teleoperation. Experiments on robot assembly demonstrations with 22 action classes show that human-to-robot fine-tuning improves the robot test-set Edit score from 70.50 to 80.70 using only 16 robot demonstrations. Edit-safe VLM correction further improves frame accuracy from 45.21% to 46.42% and increases F1@25 and F1@50 while preserving the Edit score. These results show that human demonstrations provide scalable pretraining data for robust, uncertainty-aware robot action segmentation. Code and data: project website.

2606.08340 2026-06-09 cs.AI cs.LG cs.MA 新提交

Benchmarking Open-Ended Multi-Agent Coordination in Language Agents

开放式多智能体协作在语言智能体中的基准测试

Kale-ab Abebe Tessera, Andras Szecsenyi, Cameron Barker, Alexander Rutherford, Davide Paglieri, Aidan Scannell, Henry Gouk, Elliot J. Crowley, Tim Rocktäschel, Amos Storkey

发表机构 * University of Edinburgh(爱丁堡大学) University of Oxford(牛津大学) University College London(伦敦大学学院)

AI总结 提出基于JAX的开放式多智能体协作基准Alem,评估13种现代LLM在长时生存世界中的零样本协作能力,发现协调能力是前沿LLM智能体的独立瓶颈。

详情
Comments
42 pages, preprint
AI中文摘要

随着语言模型越来越多地被部署为自主智能体,它们必须在开放式交互任务中与他人进行长期协调。然而,现有评估很少同时测试这些需求,而是强调单智能体任务、短交互或高度结构化的多智能体设置。我们提出了$alem$,一个基于JAX的开放式多智能体协作基准,构建在类似Craftax的动态之上。Alem将程序生成的协调任务、软专业化、通信和可控制的协调难度嵌入到一个具有探索、制作、交易和战斗的长期生存世界中。我们在同质团队中零样本评估了$13$种现代LLM,并以训练好的MARL智能体作为参考点。当前的LLM智能体远未解决Alem,平均标准化回报仅约6%,但它们的失败并非均匀分布。在最难的协调设置下,零样本的Gemini-3.1-Pro-High接近训练了十亿步的MARL智能体,而GPT-5.4-High实现了强基础任务奖励但协调奖励低得多。这种对比表明,个体任务能力并不等同于协调能力。消融实验表明,通信是协调的最大贡献者,而记忆和推理在用于维护多步计划时有所帮助。总体而言,我们的结果将协调确定为前沿LLM智能体的一个独立瓶颈,与单智能体能力分开。Alem使这一瓶颈可测量,并为开发能够通信、分配角色和执行共享计划的智能体提供了一个受控测试平台。代码可在https://github.com/alem-world/alem-env获取。

英文摘要

As language models are increasingly deployed as autonomous agents, they must coordinate with others over long horizons in open-ended interactive tasks. Yet existing evaluations rarely test these demands together, instead emphasising single-agent tasks, short interactions, or highly structured multi-agent settings. We introduce $alem$, a JAX-based benchmark for open-ended multi-agent coordination built on Craftax-like dynamics. Alem embeds procedurally generated coordination tasks, soft specialisation, communication, and controllable coordination difficulty into a long-horizon survival world with exploration, crafting, trading, and combat. We evaluate $13$ modern LLMs zero-shot within homogeneous teams, with trained MARL agents as reference points. Current LLM agents remain far from solving alem, averaging only ~6% normalised return, but their failures are not uniform. On the hardest coordination setting, zero-shot Gemini-3.1-Pro-High approaches MARL agents trained for one billion steps, while GPT-5.4-High achieves strong base-task reward but much lower coordination reward. This contrast shows that individual task competence does not imply coordination competence. Ablations show that communication is the largest contributor to coordination, while memory and reasoning help when used to maintain multi-step plans. Overall, our results identify coordination as a distinct bottleneck for frontier LLM agents, separate from single-agent capabilities. Alem makes this bottleneck measurable and provides a controlled testbed for developing agents that communicate, allocate roles, and execute shared plans. Code is available at https://github.com/alem-world/alem-env.

2606.08336 2026-06-09 cs.CV 新提交

Beyond Raw Signals: Undecoded Generative Latents as Privileged Synthetic Data

超越原始信号:作为特权合成数据的未解码生成潜变量

Cristian Sbrolli, Nicolas Michel, Matteo Matteucci, Toshihiko Yamasaki

发表机构 * Politecnico di Milano(米兰理工大学) The University of Tokyo(东京大学)

AI总结 提出直接潜变量增强(DLA)方法,利用未解码的生成潜变量作为特权信息,并通过多层显式模拟联觉(MESSy)将密集知识迁移到纯视觉学生模型,避免了解码-编码循环的低效性。

详情
AI中文摘要

虽然多模态集成显著提升了计算机视觉模型,但部署它们会带来高昂的推理成本,并且需要稀缺且完美配对的数据集。近期方法通过生成式AI合成缺失模态来解决这一数据瓶颈,但它们引入了一个严重的低效问题:解码-编码循环。具体来说,信息丰富的生成潜变量被解码为噪声原始信号,迫使下游分类器浪费容量重新编码它们。为了绕过这一瓶颈,我们提出直接潜变量增强(DLA),直接利用未解码的生成潜变量作为特权信息。此外,为了将这种密集知识迁移到纯视觉学生模型,我们引入多层显式模拟联觉(MESSy)。MESSy 不使用强制表示匹配(这迫使学生扭曲其原生视觉特征以适应复杂的多模态拓扑),而是使用预测目标来安全地内化这些物理先验。实验结果表明,我们的框架显著优于原始数据增强和传统蒸馏。最终,我们的方法产生了高度准确的单模态学生模型,其具有“联觉”潜变量结构,这些结构本质上与它们从未直接观察到的物理属性对齐。

英文摘要

While multimodal integration significantly improves computer vision models, deploying them incurs prohibitive inference costs and requires scarce, perfectly paired datasets. Recent methods address this data bottleneck by synthesizing missing modalities via generative AI, yet they introduce a severe inefficiency: the Decode-Encode Loop. Specifically, information-rich generative latents are decoded into noisy raw signals, forcing the downstream classifier to waste capacity re-encoding them. To bypass this bottleneck, we propose Direct Latent Augmentation (DLA), utilizing undecoded generative latents directly as privileged information. Furthermore, to transfer this dense knowledge to a purely visual student, we introduce Multilayer Explicit Simulated Synesthesia (MESSy). Instead of enforcing rigid representation matching, which forces the student to distort its native visual features to accommodate complex multimodal topologies, MESSy uses a predictive objective to safely internalize these physical priors. Empirical results demonstrate that our framework significantly outperforms raw data augmentation and traditional distillation. Ultimately, our approach yields highly accurate unimodal students with ``synesthetic'' latent structures that are inherently aligned with physical properties they have never directly observed.

2606.08332 2026-06-09 cs.CV 新提交

SMI: Efficient Self-Supervised Learning via Mutual-Information-Inspired Dependency Optimization

SMI: 基于互信息启发的依赖优化的高效自监督学习

Pritam Mishra, Coloma Ballester, Dimosthenis Karatzas

发表机构 * Universitat Pompeu Fabra(庞培法布拉大学) Universitat Autònoma de Barcelona(巴塞罗那自治大学)

AI总结 提出SMI方法,通过非线性变换样本级依赖矩阵优化自监督学习,在ImageNet上以ResNet-50达到竞争性能并降低计算复杂度,在低资源任务上提升迁移性能。

详情
AI中文摘要

自监督学习(SSL)已经取得了显著的表示学习性能,但许多现有方法依赖于大批量大小、内存库、动量编码器或全局同步机制,这些机制大大增加了计算成本和训练复杂度。在这项工作中,我们提出了语义互信息(SMI),一种轻量级的自监督目标,它源于高斯假设下互信息启发的依赖公式。与在高维特征相关矩阵上操作的传统相关匹配目标不同,SMI通过成对相关性的非线性变换在样本级依赖矩阵上进行优化。这种公式引入了独特的优化动态,强调强依赖的语义对,同时保持表示多样性。在ImageNet上使用ResNet-50骨干网络的实验结果表明,SMI在实现与最先进的SSL方法相当的线性评估性能的同时,显著降低了计算复杂度。在多个低资源基准上,SMI持续改善了Barlow Twins的迁移性能,特别是在细粒度数据集上。此外,对优化动态和表示几何的分析表明,对齐-冗余平衡得到改善,特征多样性增加,语义表示更加空间局部化。这些结果表明,非线性依赖优化为传统的基于相关的自监督学习目标提供了一种有效且计算高效的替代方案。

英文摘要

Self-supervised learning (SSL) has achieved remarkable representation learning performance, but many existing methods rely on large batch sizes, memory banks, momentum encoders, or global synchronization mechanisms that substantially increase computational cost and training complexity. In this work, we propose Semantic Mutual Information (SMI), a lightweight self-supervised objective derived from a mutual-information-inspired dependency formulation under Gaussian assumptions. Unlike conventional correlation matching objectives that operate on high-dimensional feature correlation matrices, SMI performs optimization on a sample-level dependency matrix through a nonlinear transformation of pairwise correlations. This formulation induces distinct optimization dynamics that emphasize strongly dependent semantic pairs while maintaining representation diversity. Experimental results on ImageNet using a ResNet-50 backbone demonstrate that SMI achieves competitive linear evaluation performance relative to state-of-the-art SSL approaches while substantially reducing computational complexity. Across multiple low-resource benchmarks, SMI consistently improves transfer performance over Barlow Twins, particularly on fine-grained datasets. Furthermore, analyses of optimization dynamics and representation geometry suggest improved alignment--redundancy balance, greater feature diversity, and more spatially localized semantic representations. These results indicate that nonlinear dependency optimization provides an effective and computationally efficient alternative to conventional correlation-based self-supervised learning objectives.

2606.08327 2026-06-09 cs.CL cs.AI cs.LG 新提交

Chiaroscuro Attention: Spending Compute in the Dark

明暗对比注意力:在黑暗中投入计算

Prateek Kumar Sikdar

发表机构 * Accenture(埃森哲)

AI总结 提出CHIAR-Former,一种基于谱熵路由的混合Transformer,通过DCT谱混合与全注意力互补,在WikiText-103上以62.5%更少注意力FLOPs实现PPL 36.54,较全注意力基线提升45%。

详情
Comments
8 pages, 6 figures, 3 tables
AI中文摘要

标准Transformer在每一层和每个标记上统一应用自注意力,无论输入是否需要动态的跨标记交互。我们提出CHIAR-Former(明暗对比注意力),一种4层混合Transformer,它基于每个标记的谱熵(一种理论上合理的复杂度信号)将每个标记路由到三个算子之一:DCT谱混合、RBF核混合或全自注意力。通过在WikiText-103上的系统消融,我们发现路由崩溃:路由器持续拒绝RBF而偏向DCT和注意力,表明谱混合和动态注意力是互补且充分的。一个专门设计的仅DCT+注意力变体在WikiText-103上达到验证集PPL 36.54——相比全注意力基线(PPL 66.62)提升45%,同时减少62.5%的注意力FLOPs。我们将评估扩展到WikiText-2、IMDB情感分类和合成ListOps操作,建立了一个清晰的操作区间:CHIAR-Former在大型自然文本上表现出色,其中标记多样性支持谱专门化,而全注意力在小数据集和合成模式匹配任务上仍保持优势。这些发现——无论是成功还是失败——共同定义了谱路由何时以及为何值得使用。

英文摘要

Standard transformers apply self-attention uniformly at every layer and token, regardless of whether the input requires dynamic cross-token interaction. We propose CHIAR-Former (Chiaroscuro Attention), a 4-layer hybrid transformer that routes each token to one of three operators - DCT spectral mixing, RBF kernel mixing, or full self-attention - based on per-token spectral entropy, a theoretically justified complexity signal. Through systematic ablation on WikiText-103, we discover routing collapse: the router consistently rejects RBF in favour of DCT and attention, revealing that spectral mixing and dynamic attention are complementary and sufficient. A purpose-designed DCT+Attention-only variant achieves Val PPL 36.54 on WikiText-103 - a 45% improvement over a full-attention baseline (PPL 66.62) at 62.5% fewer attention FLOPs. We extend evaluation to WikiText-2, IMDB sentiment classification, and synthetic ListOps operations, establishing a clear operating regime: CHIAR-Former excels on large-scale naturalistic text where token diversity supports spectral specialisation, while full attention retains an edge on small datasets and synthetic pattern-matching tasks. These findings - both the wins and the losses - together define when and why spectral routing earns its keep.

2606.08322 2026-06-09 cs.LG stat.ME 新提交

Orthogonality and Dimensionality in Airline Cluster Analysis using PCA and Kernel PCA

使用PCA和核PCA的航空公司聚类分析中的正交性与维度性

Andreas Schlapbach

发表机构 * Swiss Federal Railways (SBB)(瑞士联邦铁路(SBB)) University of Berne(伯尔尼大学)

AI总结 本文复现了Renold等人对1995-2020年美国航空公司利润周期的聚类实验,通过PCA和核PCA分析,发现六聚类分类在原始7维和3维PC空间中具有几何鲁棒性,并验证了数据的内在线性流形结构。

详情
AI中文摘要

为了刻画1995年至2020年美国航空公司的利润周期,Renold等人(2023)结合了k-means聚类、主成分分析和系统动力学建模。我们在三个空间中复现了他们的聚类实验——原始7维变量空间、3维PC得分空间和4维PC得分空间,使用了他们论文中慷慨包含的数据集。我们表明,六聚类分类在几何上是鲁棒的:在3-PC空间中的k-means产生的聚类分配与7维原始空间逐位相同。作为非线性检验,我们在六个核(涵盖三个族加上一个线性基线)下应用核PCA。所有六个核在2D中保留了六聚类分配。一个1D诊断进一步收紧:线性核将COVID年份C_3与峰值利润聚类C_0混淆,而所有五个非基线核将C_3移动到仅与后金融危机聚类C_5重叠。核族之间的一致性证实了一个内在的线性流形,没有隐藏的曲率。轮廓准则显示,该数据集在结构上仅支持三个聚类,而不是六个。原始7D空间中的共线性抑制了本应识别k=3作为结构上合理选择的轮廓信号。

英文摘要

To characterize the US airline profit cycles from 1995 to 2020, the authors of Renold et al. (2023) combine k-means clustering, principal component analysis, and system dynamic modelling. We replicate their clustering experiment in three spaces -- the original 7-dimensional raw-variable space, a 3-dimensional PC score space, and a 4-dimensional PC score space using their dataset gratefully included in the paper. We show that the six-cluster taxonomy is geometrically robust: k-means in 3-PC space produces bit-for-bit identical cluster assignments relative to 7D raw space. As a nonlinearity check we apply kernel PCA under six kernels spanning three families plus a linear baseline. All six kernels preserve the six-cluster assignment in 2D. A 1D diagnostic tightens this: the linear kernel conflates the COVID year C_3 with the peak-profit cluster C_0, whereas all five non-baseline kernels shift C_3 to overlap only the post-financial-crisis cluster C_5. Agreement across the kernel families confirms an intrinsically linear manifold with no hidden curvature. The silhouette criterion reveals that the dataset structurally supports only three clusters, not six. Collinearity in the raw 7D space suppresses the silhouette signal that would otherwise identify k=3 as the structurally motivated choice.

2606.08314 2026-06-09 cs.AI 新提交

Integrating Deep Learning Demand Forecasting with Multi-Objective Optimization for Circular Coffee Supply Chains: A Data-Driven Framework for Cost, Emissions, and Freshness Management

集成深度学习需求预测与多目标优化的循环咖啡供应链:面向成本、排放和新鲜度管理的数据驱动框架

Gerçek Budak, Faraz Gholamzadeh Gharehgheshlaghi, Melika Barjesteh Vaezi, Ahmad Gholizadeh Lonbar

发表机构 * Ankara Yıldırım Beyazıt University(安卡拉耶尔德勒姆贝亚泽特大学) Texas Tech University(德克萨斯理工大学) University of Alabama(阿拉巴马大学)

AI总结 提出两阶段框架,先用CNN-LSTM模型预测需求(MAE=22.87,R²=0.90),再通过三目标MILP模型优化成本、碳排放和新鲜度,在循环供应链中获得25个Pareto解,平衡政策可减排22.4%仅增成本9.9%。

详情
AI中文摘要

咖啡供应链是最复杂的农产品网络之一,具有地理分散生产、多层协调以及对质量和新鲜度高度敏感的特点。尽管可持续性和数字化已受到关注,但需求预测、优化和可追溯性通常被分开处理。本研究提出了一个两阶段集成框架。首先,使用混合CNN-LSTM模型进行需求预测。在公开的Coffee Chain Sales数据集上,按时间顺序70/15/15划分,模型实现了MAE为22.87、R²为0.90,优于最佳深度学习基准约12%,优于经典方法超过30%。第二阶段,预测的需求输入一个三目标混合整数线性规划(MILP)模型,该模型在具有循环回收的多周期、多模式、闭环供应链中同时最小化成本、最小化碳排放和最大化产品新鲜度。新鲜度通过基于库存年龄的指数衰减建模。使用epsilon-约束方法,获得了25个Pareto解。敏感性和政策分析表明,平衡的可持续性政策可以在仅增加9.9%成本的情况下减少22.4%的排放,同时保持接近最优的新鲜度。

英文摘要

The coffee supply chain is one of the most complex agri-food networks, marked by geographically dispersed production, multi-tier coordination, and high sensitivity to quality and freshness. While sustainability and digitalization have gained attention, demand forecasting, optimization, and traceability are often treated separately. This study presents a two-phase integrated framework. First, a hybrid CNN-LSTM model is used for demand forecasting. On the public Coffee Chain Sales dataset with chronological 70/15/15 splitting, the model achieves MAE of 22.87 and R^2 of 0.90, outperforming the best deep learning benchmark by ~12% and classical methods by over 30%. In the second phase, the forecasted demand feeds a tri-objective mixed-integer linear programming (MILP) model that jointly minimizes cost, minimizes carbon emissions, and maximizes product freshness in a multi-period, multimodal, closed-loop supply chain with circular recovery. Freshness is modeled via exponential decay based on inventory age. Using the epsilon-constraint method, 25 Pareto solutions are obtained. Sensitivity and policy analyses show that balanced sustainability policies can reduce emissions by 22.4% with only a 9.9% cost increase while maintaining near-optimal freshness. Keywords: Coffee supply chain; Deep learning; Demand forecasting; Multi-objective optimization; Circular economy; CNN-LSTM; Mixed-integer linear programming.

2606.08312 2026-06-09 cs.AI cs.FL 新提交

Neuro-Symbolic Injection of LTLf Constraints in Autoregressive Reinforcement Learning Policies

自回归强化学习策略中LTLf约束的神经符号注入

Ashkan Ansarifard, Matteo Mancanelli, Elena Umili, Fabio Patrizi

发表机构 * Sapienza University of Rome(罗马大学)

AI总结 提出神经符号框架,将LTLf约束编译为DFA并通过可微损失注入Transformer策略,在导航任务中提升约束满足且保持回报竞争力。

详情
Comments
Accepted at the Joint Workshop on Statistics and Knowledge Integration for Logic, Learning, Ethical Decisions, and LLMs (SKILLED-LLMs 2026), co-located with KR 2026 and FLoC 2026, Lisbon, Portugal
AI中文摘要

在这项工作中,我们研究了在有限迹线性时序逻辑(LTLf)表达的时延任务约束下的离线强化学习(RL)。最近,基于Transformer的方法如Trajectory Transformers和Decision Transformers已被采用,将RL视为序列建模问题。然而,这些方法纯粹优化奖励,不考虑高层时序需求。在此,我们引入一个神经符号框架,将LTLf背景知识注入到这类基于Transformer的RL策略中。我们的方法将LTLf公式编译为确定性有限自动机(DFA),并通过可微表示和基于逻辑的损失函数将其整合到学习过程中。特别地,我们从DFA进展中推导出可微的满足信号,并将其作为训练过程中的正则化项。最终的方法在不同模型间是架构无关的。我们在具有覆盖安全性和可达性时序属性组合的规范套件的导航环境中评估所提出的框架。实验结果表明,融入背景知识不仅提高了约束满足,而且与普通基线相比保持了有竞争力的回报。

英文摘要

In this work we study offline reinforcement learning (RL) under temporally extended task constraints expressed in Linear Temporal Logic over finite traces (LTLf). Recently, transformer-based approaches such as Trajectory Transformers and Decision Transformers have been adopted to address RL as a sequence modeling problem. However, these methods optimize purely for reward and do not account for high-level temporal requirements. Here, we introduce a neurosymbolic framework that injects LTLf background knowledge into such transformer-based RL policies. Our approach compiles LTLf formulas into deterministic finite automata (DFAs) and integrates them into the learning process through a differentiable representation and a logic-based loss function. In particular, we derive differentiable satisfaction signals from DFA progression and use them as a regularization term during training. The resulting method is architecture-agnostic across different models. We evaluate the proposed framework on navigation environments with specification suites covering combinations of safety and reachability temporal properties. Experimental results show that incorporating background knowledge not only improves constraint satisfaction, but also maintains competitive return compared to vanilla baselines.

2606.08310 2026-06-09 cs.AI cs.MA 新提交

To Nuke or Not to Nuke: LLMs' (Missing) Ethical Reasoning and Actions in a High-Stakes Decision-Making Simulation

核弹还是和平:大语言模型在高风险决策模拟中的(缺失的)伦理推理与行动

John Chen, Sihan Cheng, Can Gurkan, H M Abdul Fattah

发表机构 * University of Arizona(亚利桑那大学) Northwestern University(西北大学)

AI总结 研究LLM在复杂游戏《文明V》中自发升级核授权的现象,通过三种提示干预发现伦理推理未能可靠消除升级,识别出三种失败路径,强调需在复杂决策上下文中测试伦理推理的自发性和行为有效性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为具有决策能力的长期智能体。虽然LLM在电车难题等困境中能展现伦理能力,但这种能力可能无法迁移到复杂的智能体场景中。我们在《文明V》中研究这一差距,这是一款涉及经济、外交、技术和军事战略等复杂决策的多玩家游戏。从130个高紧张度的LLM自我对弈回合开始(其中LLM玩家自发升级核授权),我们通过三种提示干预重放这些回合:强调核伤害的伦理提示、移除先前模型的决策理由、以及强调现实世界影响的高风险框架。没有干预或干预组合能可靠消除涌现的升级。我们识别出三种失败路径:伦理推理在没有提示时未能浮现、即使在提示下也未能出现、或者浮现但未能生效(当战略反制因素占主导时)。因此,对智能体模型的评估必须测试伦理推理是否在复杂决策上下文中被自发调用并具有行为有效性,而不仅仅是在孤立情境中能否被诱发。

英文摘要

Large language models (LLMs) are increasingly deployed as long-horizon agents with decision-making capacities. While LLMs can show ethical competence on dilemmas such as trolley problems, this competence may not translate to complex, agentic scenarios. We study this gap in Civilization V, a multiplayer game with a complex decision-making landscape including economy, diplomacy, technology, and military strategy. Starting from 130 high-tension LLM self-play episodes, in which an LLM player spontaneously escalated nuclear authorization, we replay them across 13 models with three prompt interventions: an ethical prompt naming nuclear harm, removal of the previous model's decision-making rationale, and high-stakes framing emphasizing real-world impacts. No interventions nor their combinations reliably eliminate emergent escalation. We identify three failure pathways: ethical reasoning that fails to surface without prompting, fails to appear even when prompted, or surfaces but fails to take effect when strategic counter-factors dominate. Evaluations of agentic models, therefore, must test whether ethical reasoning is spontaneously invoked and behaviorally effective in complex decision-making contexts, beyond whether it can be elicited in isolation.

2606.08309 2026-06-09 cs.LG cs.CV 新提交

Where the Score Lives: A Wavelet View of Diffusion

分数函数所在之处:扩散的小波视角

Emma Finn, Binxu Wang, T. Anderson Keller, Demba E. Ba

发表机构 * The Kempner Institute for the Study of Natural and Artificial Intelligence(肯普纳自然与人工智能研究所) Harvard University(哈佛大学)

AI总结 提出基于二维正交小波基的分数函数参数化,通过数据分布矩分析揭示不同架构的归纳偏差,解释扩散模型中分数网络与数据分布的相互作用。

详情
Journal ref
Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026, Tangier, Morocco. PMLR: Volume 300
Comments
20 pages, 12 figures, AISTATS 2026
AI中文摘要

基于分数的生成模型在过去十年中在生成多样化视觉上合理的图像方面取得了显著成功。在扩散建模中,包括CNN、U-Net和Transformer在内的多种架构被用作分数近似网络;然而,迄今为止,关于这些架构选择如何影响生成行为的了解相对较少。在这项工作中,为了提供对此领域的见解,我们提出了一种使用二维正交小波基展开的分数函数的解析可解参数化。特别地,我们根据数据分布的矩推导出可解释的最优分数函数。我们利用这种参数化提供了一种与架构无关的、基于矩的分析,揭示了数据分布的哪些属性对去噪最为重要。我们的分数机器足够灵活,可以部分模仿多种架构(包括U-Net和CNN)的相关归纳偏差,朝着理解不同分数架构为何表现出不同生成行为迈出了一步。由于我们的分数函数可以根据数据矩解析求解,我们可以开始理解数据分布如何与分数网络相互作用,从而产生我们在扩散模型中观察到的行为。

英文摘要

Score-based generative models have had remarkable success over the last decade in generating a diverse set of visually plausible images. A variety of architectures including CNNs, U-Nets, and Transformers have been used as the score-approximation network in such diffusion modeling; however, to date, relatively little is known about how these architectural choices impact generative behavior. In this work, to provide insight into this area, we propose an analytically solvable parameterization of the score function using an expansion in a 2D orthogonal wavelet basis. In particular, we derive interpretable optimal score functions in terms of the moments of the data distribution. We use this parametrization to provide an architecture-agnostic, moment-based analysis that reveals which attributes of the data distribution tend to matter most for denoising. Our score machine is flexible enough to partially mimic the relevant inductive biases of multiple architectures, including U-Nets, and CNNs, taking a step towards understanding why different score architectures can exhibit distinct generative behavior. Since our score is solvable in terms of the moments of the data, we can begin to understand how the data distribution interacts with the score network to produce the behavior we observe in diffusion models.

2606.08308 2026-06-09 cs.LG 新提交

Fourier fractal dimension to predict the generalization of deep neural networks

傅里叶分形维数预测深度神经网络的泛化能力

Joao B. Florindo, Davi Wanderley Misturini

发表机构 * Institute of Mathematics, Statistics and Scientific Computing - University of Campinas(坎皮纳斯大学数学、统计与科学计算研究所)

AI总结 提出基于权重变化的傅里叶分形维数作为泛化度量,并设计傅里叶优化器正则化该维数,在CIFAR-10等数据集上实现与泛化差距的高相关性。

详情
AI中文摘要

在不依赖留出验证数据的情况下预测深度神经网络的泛化性能是机器学习中的一个基本挑战。虽然随机梯度下降驱动这些高度参数化模型的优化,但其重尾、非高斯动力学在参数空间中诱导出复杂的、尺度不变的轨迹。在本文中,我们提出了一种基于网络权重变化的傅里叶分形维数的新型泛化度量。通过分析频域中Lévy驱动的随机微分方程的特征函数,我们提取出一个能够稳健捕捉学习过程几何复杂性的度量。此外,我们引入了一种定制的基于傅里叶的优化器,旨在训练过程中主动正则化该分形维数。在CIFAR-10、SVHN和MNIST数据集上的大量实证评估表明,我们提出的傅里叶泛化度量与实际泛化差距具有强相关性。我们的方法实现了最先进的Kendall秩相关系数,优于现有的基于范数、基于间隔和PAC-Bayesian度量。最终,这项工作凸显了频域分形分析作为模型泛化能力的强大预测器以及开发更稳定优化算法的原则性基础的潜力。

英文摘要

Predicting the generalization performance of deep neural networks without relying on hold-out validation data is a fundamental challenge in machine learning. While Stochastic Gradient Descent (SGD) drives the optimization of these highly parameterized models, its heavy-tailed, non-Gaussian dynamics induce complex, scale-invariant trajectories in the parameter space. In this paper, we propose a novel generalization measure based on the Fourier fractal dimension of the network's weight variations. By analyzing the characteristic function of the Lévy-driven stochastic differential equations in the frequency domain, we extract a metric that robustly captures the geometric complexity of the learning process. Furthermore, we introduce a customized Fourier-based optimizer designed to actively regularize this fractal dimension during training. Extensive empirical evaluations on the CIFAR-10, SVHN, and MNIST datasets demonstrate that our proposed Fourier generalization measure exhibits a strong correlation with the actual generalization gap. Our method achieves state-of-the-art Kendall rank correlation coefficients, outperforming a wide array of existing norm-based, margin-based, and PAC-Bayesian measures. Ultimately, this work highlights the potential of frequency-domain fractal analysis as both a powerful predictor for model generalizability and a principled foundation for developing more stable optimization algorithms.

2606.08307 2026-06-09 cs.CL 新提交

Understanding the Sociocultural Dimensions of Mental Health Discourse in Arabic-Language X Communities

理解阿拉伯语X社区中心理健康话语的社会文化维度

Amal Alqahtani, Rana Salama, Mona Diab

发表机构 * King Saud University(沙特国王大学) Cairo University(开罗大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 通过GPT-4.1识别个人披露的推特用户,分析边缘型人格障碍、双相障碍和ADHD相关话语,发现不同病症的词汇模式差异,提出可复用的LLM辅助披露流程和文化关键词框架。

详情
Comments
Accepted to the SMM4H-HeaRD Workshop, co-located with the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
AI中文摘要

计算心理健康研究主要集中于英语人群,阿拉伯语话语相对缺乏研究。我们提出一项探索性计算研究,包含来自607名用户的8147条推文,这些用户被GPT-4.1个人披露流程分类为三个特定病症的阿拉伯语X(原Twitter)社区中可能具有亲身经历的作者。我们关注与边缘型人格障碍(BPD)、双相障碍和ADHD相关的话语,并使用多领域文化关键词框架描述社区相关的语言模式。结果表明,在该语料库中,双相障碍推文包含更多宗教和医学词汇,BPD推文包含更多关系、身份和情绪困扰词汇,而ADHD推文更常关注实际症状和药物管理。我们将这些模式视为假设生成而非验证性的,因为语料库在不同病症间不平衡,某些子语料库在时间上集中,且关键词框架是初步操作化而非经过验证的测量工具。本文贡献了一个可复用的LLM辅助个人披露流程和一个针对阿拉伯语心理健康话语的探索性文化关键词框架。

英文摘要

Computational mental health research has predominantly centered on English-speaking populations, leaving Arabic-language discourse comparatively under-examined. We present an exploratory computational study of 8,147 tweets from 607 users classified by a GPT-4.1 personal-disclosure pipeline as likely lived-experience authors in three condition-specific Arabic-language X (formerly Twitter) Communities. We focus on discourse related to borderline personality disorder (BPD), bipolar disorder, and ADHD, and characterize community-associated linguistic patterns using a multi-domain cultural keyword framework. The results suggest that in this corpus, Bipolar tweets contain more religious and medical vocabulary, BPD tweets contain more relational, identity, and emotional-distress vocabulary, and ADHD tweets more often focus on practical symptoms and medication management. We treat these patterns as hypothesis-generating rather than confirmatory because the corpus is imbalanced across conditions, some subcorpora are temporally concentrated, and the keyword framework is an initial operationalization rather than a validated measurement instrument. The paper contributes a reusable LLM-assisted personal-disclosure pipeline and an exploratory cultural keyword framework for Arabic mental health discourse.

2606.08303 2026-06-09 cs.LG 新提交

GeoGNN: Time Series Geo-Localization using Two-Tower Graph Neural Networks

GeoGNN:使用双塔图神经网络的时间序列地理定位

Toan Tran, Waqwoya Abebe, Abhishek Potnis, Supriya Chinthavali, Cyrus Shahabi, Li Xiong, Dalton Lunga

发表机构 * Emory University(埃默里大学) Oak Ridge National Laboratory(橡树岭国家实验室) University of Southern California(南加州大学)

AI总结 提出GeoGNN双塔架构,利用地理邻接图学习空间嵌入,结合时间序列表示,通过点积匹配实现时间序列地理定位,在电力消费数据集上平均提升约27%的定位精度。

详情
AI中文摘要

本文研究时间序列地理定位的新概念,目标是推断每个原始时间序列的地理来源。成功的地理定位可以为时间序列提供空间上下文,支持下游位置感知应用。我们形式化了该问题,借鉴图像地理定位的核心思想建立了强基线,并提出了GeoGNN——一种双塔架构。训练时,GeoGNN的空间塔通过利用地理邻接图学习地理单元候选的嵌入,而时间塔从时间序列中提取信息表示。推理时,每个时间表示与候选地理嵌入通过点积相似度匹配,并结合辅助分类头,以预测时间序列关联的地理来源。在全国范围的大规模电力消费数据集上的实验表明,GeoGNN在数据集上取得了最佳性能,并将细粒度和粗粒度地理定位精度平均提高了约27%。

英文摘要

This paper investigates a novel concept of time series geolocalization, where the goal is to infer the geographic origin of each raw time series. Successful geolocalization can provide spatial context to time series, enabling downstream location-aware applications. We formalize the problem, adapt core ideas from image geolocalization to establish strong baselines, and propose GeoGNN, a two-tower architecture. During training, GeoGNN's spatial tower learns embeddings of geographic cell candidates by leveraging the geographic adjacency graph, while the temporal tower extracts informative representations from time series. During inference, each temporal representation is matched against candidate geographic embeddings using dot-product similarity, combined with an auxiliary classification head, to predict the time series' associated geographic origin. Experiments on large-scale, countrywide electricity-consumption datasets demonstrate that GeoGNN achieves the best performance across datasets and enhances both fine- and coarse-grained geolocalization accuracy by ~27% on average.

2606.08302 2026-06-09 cs.CV 新提交

HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling

HACK++:面向高效视觉自回归建模的更有效的头部感知键值压缩

Ziran Qin, Yuchen Jiang, Mingbao Lin, Youru Lv, Hang Guo, Wen Fei, Weiyao Lin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Rakuten(乐天) Tsinghua University(清华大学)

AI总结 针对VAR模型跨尺度KV缓存导致的高计算和内存开销,提出无训练头部感知压缩框架HACK++,通过离线分类头部类型和自适应预算分配,在极低缓存预算下保持近无损生成。

详情
AI中文摘要

视觉自回归(VAR)模型采用下一尺度预测范式,以显著更少的解码步骤实现高质量生成。然而,现有VAR模型由于跨尺度键值(KV)缓存的累积,面临严重的注意力复杂度和内存开销。本文通过将KV缓存压缩引入下一尺度范式来应对这一挑战。我们首先深入分析VAR注意力,观察到注意力头可以稳定地分为两个功能不同的类别:上下文头关注保持语义一致性,而结构头保持空间连贯性。它们的功能差异使得现有的一刀切压缩方法在VAR模型上表现不佳。我们进一步发现,两种头部类型对历史尺度的依赖程度不同,且这种依赖在不同层和生成步骤中发生变化,这要求自适应的缓存预算分配。为解决这些问题,我们提出HACK++,一种针对VAR模型的无训练头部感知键值压缩框架。通过一次性离线校准,HACK++分类头部类型并推导头部特定先验。在推理时,它将注意力与缓存压缩在独立预算下解耦,在压缩累积缓存时采用更激进的策略,通过模式特定策略和依赖感知预算分配来限制当前尺度的注意力成本。在多个VAR模型上进行的广泛实验,涵盖文本到图像、类别条件和统一理解与生成任务,验证了HACK++的有效性和泛化能力。例如,在Infinity-2B/8B上,HACK++在仅30%注意力预算和10%缓存预算下保持近无损生成,即使在1%缓存预算下也保持稳健。

英文摘要

Visual Autoregressive (VAR) models adopt a next-scale prediction paradigm, offering high-quality generation with substantially fewer decoding steps. However, existing VAR models suffer from significant attention complexity and severe memory overhead due to the accumulation of key-value (KV) caches across scales. In this paper, we tackle this challenge by introducing KV cache compression into the next-scale paradigm. We begin with an in-depth analysis of VAR attention and observe that attention heads can be stably divided into two functionally distinct categories: Contextual Heads focus on maintaining semantic consistency, while Structural Heads preserve spatial coherence. Their functional divergence makes existing one-size-fits-all compression methods perform poorly on VAR models. We further find that the two head types differ markedly in their reliance on historical scales, and that this reliance shifts across layers and generation steps, arguing for an adaptive cache budget allocation. To address these challenges, we propose HACK++, a training-free Head-Aware key-value Compression frameworK for VAR models. From a one-time offline calibration, HACK++ classifies head types and derives head-specific priors. At inference, it decouples attention from cache compression under independent budgets, bounding the current-scale attention cost while compressing the accumulated cache far more aggressively, via pattern-specific strategies and a reliance-aware budget allocation. Extensive experiments on multiple VAR models across text-to-image, class-conditional, and unified understanding-and-generation tasks validate the effectiveness and generalizability of HACK++. For example, on Infinity-2B/8B, HACK++ maintains near-lossless generation with only a 30% attention budget and a 10% cache budget, and remains robust even under a 1% cache budget.

2606.08295 2026-06-09 cs.CL 新提交

TLRD: Teaching LLMs to Reason over Tabular Data with Tri-Level Rationale Distillation

TLRD: 通过三级理由蒸馏教授LLMs在表格数据上进行推理

Tianyuan Liang, Xuwei Tan, Lei Shi, Junsheng Zhong, Ziyu Hu, Tian Xie, Zhiqun Zuo, Xiaodong Yu, Xueru Zhang

发表机构 * The Ohio State University(俄亥俄州立大学) Stevens Institute of Technology(史蒂文斯理工学院)

AI总结 提出TLRD框架,通过三级理由蒸馏将表格数据集转换为结构化理由监督,使LLMs在仅基于原始特征的情况下实现零开销预测和可解释推理,显著缩小与树集成模型的性能差距。

详情
AI中文摘要

表格数据是存储现实世界信息的主要媒介,驱动着机器学习的许多工业应用。传统预测器实现了强大的预测性能,但不提供决策所必需的可读、案例特定的解释。大型语言模型(LLMs)可以通过生成预测和解释来自然弥合这一差距。然而,数据集特定的模式(如特征分布和交互)使LLMs难以理解和推理表格数据,而仅标签微调在提高性能的同时会导致灾难性遗忘。为了解决这个问题,我们提出了三级理由蒸馏(TLRD),一个将仅标签表格数据集转换为LLMs的结构化理由监督的框架。TLRD使用高容量教师模型,基于三个互补的证据级别(实例级特征、数据集级分布上下文和比较级检索邻居)合成理由语料库,然后将理由蒸馏到学生LLMs中,从而仅从原始特征实现零开销预测和基于理由的解释。在多个领域数据集上的实验表明,TLRD显著缩小了LLMs与最先进树集成模型之间的性能差距,同时产生基于理由且可读的解释,为高风险决策提供了有价值的参考。

英文摘要

Tabular data is a primary medium for storing real-world information, driving many industrial applications of machine learning. Traditional predictors achieve strong predictive performance but do not provide readable, case-specific explanations essential for decision-making. Large Language Models (LLMs) can naturally bridge this gap by generating predictions alongside explanations. However, dataset-specific patterns, such as feature distributions and interactions, make tabular data difficult for LLMs to understand and reason over, while label-only fine-tuning improves performance at the cost of catastrophic forgetting. To address this problem, we propose Tri-Level Rationale Distillation (TLRD), a framework that converts label-only tabular datasets into structured rationale supervision for LLMs. TLRD uses a high-capacity teacher to synthesize a rationale corpus grounded in three complementary levels of evidence: instance-level feature, dataset-level distributional context, and comparison-level retrieved neighbors, then distills the rationale into student LLMs, enabling zero-overhead prediction and grounded explanation from raw features only. Experiments on multiple domain datasets show that TLRD significantly closes the performance gap between LLMs and state-of-the-art tree ensembles while producing grounded and readable explanations, offering a valuable reference for high-stakes decision-making.

2606.08292 2026-06-09 cs.AI 新提交

Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers

消融可逆头不传递:对Transformer中机制角色声称的压力测试

Philip Quirke

发表机构 * Martian

AI总结 本文发现注意力头通过必要性、线性编码和消融后恢复三个测试仍不足以证明其角色,引入KID框架和匹配控制下的激活转导,揭示角色声称的不足。

详情
Comments
9 pages, 1 figure
AI中文摘要

在机制可解释性中,注意力头通常被提升为角色声称(例如,“这个头表示加法”),当它们对某个行为是必要的、线性编码该行为,并且在消融后恢复该行为时。我们证明这种证据是不充分的:在三个7-8B指令微调模型和五个计算家族中,通过所有三个检查的头在匹配控制下将其激活修补到不同提示时,通常无法传递计算。我们引入KID(知道/意图/做),一个注意力头的角色分配视角,并将其与一个三阶段流程配对:能力选择性筛选(CSS)、奇异值分解(SVD)和匹配控制下的激活转导。我们的结果记录了一个初步的角色分类(包括提示轨迹稳定器、答案侧logit偏置头和软计算模式载体),并表明相同答案控制(一个共享答案字符串但不共享请求计算的转导目标)是一种未被充分利用的检查,它暴露了伪装成语义特异性的广泛状态转移。

英文摘要

In mechanistic interpretability, attention heads are commonly elevated to role claims (e.g., "this head represents addition") when they are necessary for a behavior, encode it linearly, and recover that behavior when restored after ablation. We show this evidence is insufficient: across three 7-8B instruction-tuned models and five computation families, heads passing all three checks routinely fail to transfer the computation when their activations are patched into a different prompt under matched controls. We introduce KID (Knowing / Intent / Doing), a role-assignment lens for attention heads, and pair it with a three-stage pipeline: capability-selective screening (CSS), singular value decomposition (SVD), and activation transduction under matched controls. Our results document a preliminary role taxonomy (including prompt-trajectory stabilizers, answer-side logit-bias heads, and soft computation-pattern carriers) and show that the same-answer control (a transduction target sharing the answer string but not the requested computation) is an underused check that exposes broad state transfer masquerading as semantic specificity.

2606.08288 2026-06-09 cs.RO 新提交

MotionVLA: Injecting Geometric Motion into Vision-Language-Action Model

MotionVLA: 将几何运动注入视觉-语言-动作模型

Shanglin Yuan, Weiheng Zhao, Xianda Guo, Wei Sui, Li Yu, Wenyu Liu, Xinggang Wang

发表机构 * Huazhong University of Science and Technology(华中科技大学) D-Robotics(大疆机器人) Wuhan University(武汉大学)

AI总结 提出MotionVLA,通过运动历史接口将过去视频窗口转换为紧凑的连续轨迹场令牌,解决长程操作中的几何漂移和时间线索碎片化问题,提升动作平滑性和执行效率。

详情
Comments
17 pages, 8 figures
AI中文摘要

视觉-语言-动作(VLA)模型越来越多地基于历史、深度或4D特征来调节机器人策略,以解决长程操作中的歧义。然而,更多的时空证据并不一定更好:当注入的证据不是运动一致的时,它可能引入几何漂移、碎片化的时间线索和不稳定的动作生成。这提出了一个简单的问题:VLA应该记住过去的帧,还是记住连接它们的运动?我们引入了MotionVLA,一个运动历史接口,它将短时间仅包含过去的视频窗口转换为紧凑的、时间连续的轨迹场令牌。MotionVLA不是将历史视为一组稀疏的独立提升帧,而是将最近的观测表示为物理一致的运动证据。当前的视觉令牌查询这个历史以检索任务相关的运动信息,然后在轨迹基础的监督下重新耦合到VLA流中。在模拟基准和初步真实机器人部署上的实验表明,MotionVLA改善了长程操作,同时产生了更平滑、更直接的执行。这些结果表明,有效的VLA记忆不仅仅是提供更多的4D上下文,而是暴露可用于控制的运动一致证据。

英文摘要

Vision-language-action (VLA) models increasingly condition robot policies on history, depth, or 4D features to resolve ambiguity in long-horizon manipulation. However, more spatiotemporal evidence is not necessarily better: when the injected evidence is not motion-consistent, it can introduce geometric drift, fragmented temporal cues, and unstable action generation. This raises a simple question: should a VLA remember past frames, or remember the motion that connects them? We introduce MotionVLA, a motion-history interface that converts a short past-only video window into compact, time-continuous trajectory-field tokens. Instead of treating history as a sparse set of ndependently lifted frames, MotionVLA represents recent observations as physically coherent motion evidence. Current visual tokens query this history to retrieve task-relevant motion information, which is then recoupled into the VLA stream under trajectory-grounded supervision. Experiments across simulation benchmarks and preliminary real-robot rollouts show that MotionVLA improves long-horizon manipulation while producing smoother and more direct executions. These results suggest that effective VLA memory is not just about providing more 4D context, but about exposing motion-consistent evidence that is usable for control.

2606.08286 2026-06-09 cs.SD 新提交

FXplorer: A Map-Based Interface for Exploratory Audio Effect Design

FXplorer: 一种基于地图的探索性音频效果设计界面

Annie Chu, Jason Brent Smith, Bryan Pardo

发表机构 * Northwestern University(西北大学)

AI总结 提出FXplorer界面,将音频效果组织在感知二维空间中,通过空间交互与嵌入方法实现连续浏览与参数精调的统一,支持交互式预设编辑与插值。

详情
Comments
Accepted to NIME 2026. Project page: https://anniejchu.github.io/fxplorer/
AI中文摘要

音频效果(FX)在当代音乐实践中塑造声音。然而,大多数界面将它们呈现为离散模块和参数,这有利于针对性调整而非探索性聆听。这种分离使得难以建立关于可能变换的更广阔空间的直觉,也难以在搜索和精调之间流畅移动。我们提出FXplorer,一个将音频效果组织在感知信息丰富的二维空间中的界面,允许将声音变换作为连续景观而非孤立预设进行浏览。通过结合既定的空间交互方法和可解释的DAW风格控制,以及基于嵌入的相似性和语义搜索的机器学习方法,该系统将探索和参数精调整合到单个工作空间中。FXplorer通过允许用户交互式编辑和插值效果预设,支持作曲、制作或表演。

英文摘要

Audio effects (FX) shape sound in contemporary music practice. However, most interfaces present them as discrete modules and parameters that favor targeted adjustment over exploratory listening. This separation can make it difficult to build intuition about the broader space of possible transformations or to move fluidly between searching and refinement. We present FXplorer, an interface that organizes audio effects within a perceptually informed 2D space, allowing sound transformations to be browsed as a continuous landscape rather than as isolated presets. By combining established spatial interaction approaches and interpretable DAW-style controls with recent embedding-based machine learning methods for similarity and semantic search, the system brings exploration and parameter refinement into a single workspace. FXplorer supports composition, production, or performance by allowing users to edit and interpolate between effect presets interactively.

2606.08284 2026-06-09 cs.CV cs.RO 新提交

G2G: Exploiting Intra-Group Geometry for Inter-Group Pose Estimation

G2G:利用组内几何进行组间姿态估计

Yufei Wei, Shuhao Ye, Chenxiao Hu, Yiyuan Pan, Dongyu Feng, Rong Xiong, Yue Wang, Yanmei Jiao

发表机构 * State Key Laboratory of Industrial Control and Technology, Zhejiang University(浙江大学工业控制技术国家重点实验室) Zhejiang Humanoid Robot Innovation Center Co., Ltd.(浙江人形机器人创新中心有限公司) School of Information Science and Engineering, Hangzhou Normal University(杭州师范大学信息科学与工程学院)

AI总结 提出G2G方法,通过冻结多视图基础模型并添加三个轻量可训练模块(感知器重采样器、跨组桥接模块和多帧姿态头),仅利用相对姿态监督实现组间6-DoF姿态估计,在四个数据集上达到SOTA。

详情
AI中文摘要

恢复两个图像组之间的相对6-DoF姿态是跨序列重定位和多相机刚性里程计的基础。每个组通过视觉里程计或刚性校准携带已知的组内几何,预训练的多视图骨干网络已经将这种几何融合到视觉特征中。然而,当前模型将所有视图视为非结构化集合,缺少跨组推理的关键环节。我们提出\ours{},该方法保持基础模型完全冻结,并添加三个轻量可训练模块来桥接两个组:感知器重采样器、带有合并自注意力的跨组桥接模块以及多帧姿态头。可训练部分总计约32M参数,不到完整模型的6%,且仅由相对姿态监督。在四个数据集(涵盖室内外仿真、真实世界跨季节采集以及零样本仿真到真实迁移)上,\ours{}在两个任务上都达到了最先进的精度,而每个基线都使用其完整的原始监督进行重新训练。代码可在https://github.com/WeiYuFei0217/G2G获取。

英文摘要

Recovering the relative 6-DoF pose between two image groups underlies cross-sequence relocalization and multi-camera rig odometry. Each group carries known intra-group geometry from visual odometry or rig calibration, and pretrained multi-view backbones already fuse such geometry into visual features. Yet current models treat all views as an unstructured set, leaving cross-group reasoning as the missing piece. We introduce \ours{}, which keeps the foundation model entirely frozen and adds three lightweight trainable modules to bridge the two groups: a perceiver resampler, a cross-group bridge with merged self-attention, and a multi-frame pose head. The trainable footprint totals about 32M parameters, under 6\% of the full model, and is supervised only by relative poses. Across four datasets that span indoor and outdoor simulation, real-world cross-season capture, and zero-shot sim-to-real transfer, \ours{} attains state-of-the-art accuracy on both tasks, while every baseline is retrained with its full original supervision. Code is available at https://github.com/WeiYuFei0217/G2G.

2606.08282 2026-06-09 cs.AI 新提交

From Validator Selection to Portfolio Collection Optimization in Proof-of-Stake Blockchains

从验证者选择到权益证明区块链中的投资组合收集优化

Jonas Gehrlein, Grzegorz Miebs, Matteo Brunelli, Adam Mielniczuk, Miłosz Kadziński

发表机构 * Parity Technologies AG Institute of Computing Science, Poznan University of Technology(波兹南工业大学计算科学研究所) Department of Industrial Engineering, University of Trento(特伦托大学工业工程系)

AI总结 针对权益证明区块链中提名者选择验证者的多准则决策问题,提出双目标优化框架,同时最大化验证者期望效用(代表组合质量和盈利能力)和分配期望熵(代表风险分散),通过主动偏好学习和多目标进化算法求解,并引入交互式二分搜索导航确定满意折衷。

详情
Comments
24 pages, 5 figures, 3 tables
AI中文摘要

我们考虑权益证明区块链环境中出现的一个问题,其中称为提名者的代理选择验证者——负责维护区块链物理基础设施的实体。选择过程本质上是主观和多准则的,并且结合了提名者通常通过多个账户操作的事实。这引出了一个投资组合选择问题,其中代理寻求将其提名分配到多个账户以分散风险。我们提出了一个决策支持框架来优化这一选择,通过同时最大化两个目标:可能分配的验证者的期望效用,代表组合质量和盈利能力;以及分配的期望熵,代表跨 stash 的多样化和风险缓解。验证者效用通过基于多属性价值理论的原始主动偏好学习过程推导,重点关注排名靠前的验证者。所得的双目标优化问题通过多目标进化算法求解,为了支持最终选择,我们引入了一个交互式二分搜索导航程序,该程序引导提名者穿过前沿,并仅通过几个问题确定一个满意的折衷。数值实验检验了优化策略,而涉及五位经验丰富的提名者的专家评估证实了该方法的实际相关性和有用性。

英文摘要

We consider a problem arising in proof-of-stake blockchain environments, where agents called nominators select validators - entities responsible for maintaining the blockchain's physical infrastructure. The selection process is inherently subjective and multi-criterial and combines with the fact that nominators commonly operate through multiple accounts. This gives rise to a portfolio selection problem, where agents seek to distribute their nominations across accounts to diversify risk. We propose a decision support framework to optimize this selection by simultaneously maximizing two objectives: the expected utility of the validators likely to be allocated, representing portfolio quality and profitability, and the expected entropy of the allocation, representing diversification and risk mitigation across stashes. Validator utilities are derived using an original active preference learning procedure based on multi-attribute value theory, with emphasis on top-ranked validators. The resulting bi-objective optimization problem is solved with a multi-objective evolutionary algorithm and, to support the final choice, we introduce an interactive binary search navigation procedure that guides the nominator through the front and identifies a satisfactory trade-off with only a few questions. Numerical experiments examine the optimization strategies, while an expert assessment involving five experienced nominators confirms the approach's practical relevance and usefulness.

2606.08281 2026-06-09 cs.RO cs.SY eess.SY 新提交

Impedance MPC for Physical Human-Robot Interaction: Predictive Disturbance Rejection with Joint-Limit Safety

阻抗MPC用于物理人机交互:具有关节极限安全性的预测性扰动抑制

Yongyan Cao, Jinshan Tang

发表机构 * Voryx Robotic LLC George Mason University(乔治梅森大学)

AI总结 针对物理人机交互中轨迹精度与安全性的矛盾,提出双层阻抗MPC,通过解析抵消动力学和卡尔曼滤波估计持续扰动,实现零稳态误差,并利用零空间势垒和工作空间投影保证关节极限安全。

详情
Comments
7 pages and 3 figures
AI中文摘要

物理人机交互(pHRI)要求在非计划接触下同时实现轨迹精度和顺应性安全。经典阻抗控制在持续人力作用下会产生非零稳态位置误差(施加力除以任务刚度),积分作用仅在狭窄的稳定增益预算内减少该误差。我们提出一种双层阻抗MPC来解决这一矛盾。第一层解析抵消重力、科里奥利力和任务空间惯性,将剩余被控对象简化为具有恒定状态转移矩阵的构型无关双积分器。第二层以100 Hz求解30变量凸QP,利用该恒定结构使得自由响应矩阵仅需预计算一次;增广卡尔曼滤波器估计持续扰动状态,提供形式化的零稳态误差保证。零空间逆势垒和任务空间工作空间投影在测试工作空间内保证关节极限安全。在7自由度Franka FR3上,与经典阻抗在持续15 N力下的44.8 mm稳态误差相比,带卡尔曼增广的阻抗MPC达到亚0.05 mm稳态误差(降低超过800倍),在四个3-D圆上实现亚毫米跟踪,并对测量噪声和高达30%的惯性失配具有优雅鲁棒性。

英文摘要

Physical human-robot interaction (pHRI) demands simultaneous trajectory accuracy and compliant safety under unplanned contact. Classical impedance control incurs a nonzero steady-state position error under sustained human force -- the applied force divided by the task stiffness -- which integral action reduces only within a narrow stable-gain budget. We present a two-layer Impedance MPC that resolves this tension. Layer~1 analytically cancels gravity, Coriolis, and task-space inertia, reducing the residual plant to a configuration-independent double integrator with a constant state-transition matrix. Layer~2 solves a 30-variable convex QP at 100\,Hz, exploiting this constant structure so the free-response matrix is precomputed once; an augmented Kalman filter estimates the persistent disturbance state, giving a formal zero-steady-state-error guarantee. A null-space inverse-barrier potential and a task-space workspace projection enforce joint-limit safety across the tested workspace. On a 7-DOF Franka FR3, Impedance MPC with Kalman augmentation attains sub-0.05\,mm steady-state error versus 44.8\,mm for classical impedance (a $>$800-fold reduction) under a sustained 15\,N force, sub-millimeter tracking on four 3-D circles, and graceful robustness to measurement noise and inertial mismatch up to 30\%.

2606.08278 2026-06-09 cs.RO 新提交

SIMPLE: Simulation-Based Policy Learning and Evaluation for Humanoid Loco-manipulation

SIMPLE:基于仿真的人形机器人全身操作策略学习与评估

Songlin Wei, Zhenhao Ni, Jie Liu, Zhenyu Zhao, Junjie Ye, Hongyi Jing, Junkai Xia, Xiawei Liu, Michael Leong, Liang Heng, Di Huang, Yue Wang

发表机构 * USC Physical Superintelligence (PSI) Lab(南加州大学物理超级智能实验室)

AI总结 提出SIMPLE仿真平台,结合MuJoCo动力学与IsaacSim渲染,包含60个全身任务、50个室内场景和1000+物体资产,支持自动化轨迹生成和VR遥操作数据采集,并集成多种主流策略,实验证明仿真与真实世界性能强相关,可实现零样本迁移。

详情
AI中文摘要

人形基础模型的发展速度超过了我们评估它们的能力。虽然真实世界测试成本高昂且难以复现,但现有的仿真基准主要关注桌面或轮式机器人。针对全身人形操作的可扩展且可复现的基准仍然是一个开放问题。为此,我们提出了SIMPLE,一个用于人形策略学习和评估的统一仿真测试平台。SIMPLE将MuJoCo的精确接触丰富动力学与IsaacSim的光真实感渲染相结合。它提供了一个大规模环境,包含60个多样的全身任务、50个室内场景和超过1000个物体资产。为了促进可扩展的数据收集,该框架集成了两个数据生成流水线:通过运动规划自动生成轨迹和低延迟VR遥操作接口。我们进一步在SIMPLE中大规模集成并基准测试了主流人形策略,包括轻量级模仿网络、大型视觉-语言-动作(VLA)模型以及最新的世界动作模型(WAM)。我们的实验揭示了策略在仿真和真实世界中的性能之间存在强相关性。此外,我们证明了在SIMPLE中收集的数据上训练的策略可以在相似设置下零样本迁移到物理人形机器人上,为人形机器人研究提供了稳健且可复现的基础。

英文摘要

Humanoid foundation models are advancing faster than we can evaluate them. While real-world testing is expensive and difficult to reproduce, existing simulation benchmarks focus primarily on table-top or wheeled robots. A scalable and reproducible benchmark for whole-body humanoid loco-manipulation remains an open problem. To this end, we present SIMPLE, a unified simulation testbed for humanoid policy learning and evaluation. SIMPLE couples the accurate contact-rich dynamics of MuJoCo with the photorealistic rendering of IsaacSim. It provides a large-scale environment comprising 60 diverse whole-body tasks, 50 indoor scenes, and over 1,000 object assets. To facilitate scalable data collection, the framework integrates two data generation pipelines: automated trajectory generation via motion planning and a low-latency VR teleoperation interface. We further integrate and benchmark mainstream humanoid policies at scale in SIMPLE, including lightweight imitation networks, large vision-language-action (VLA) models, and recent world action models (WAMs). Our experiments reveal a strong correlation between policy performance in simulation and the real world. Furthermore, we demonstrate that policies trained on data collected in SIMPLE can be transferred zero-shot to physical humanoid robots under similar settings, providing a robust and reproducible foundation for humanoid robotics research.

2606.08277 2026-06-09 cs.CV 新提交

Remember with Confidence: Uncertainty Quantification for Spatio-temporal Memory with Probabilistic Guarantees

自信记忆:具有概率保证的时空记忆不确定性量化

Harry Zhang, Nicolas Gorlo, Luca Carlone

发表机构 * MIT(麻省理工学院)

AI总结 针对机器人长期操作中VLM描述噪声大、视角不一致的问题,提出目标级语义不确定性评分,并集成到UQ-DAAAM系统中,通过主动选择高质量视图和融合多视角描述来降低不确定性,同时提供概率保证。

详情
AI中文摘要

长期机器人操作需要时空记忆来记录环境状态并在下游推理中回忆。场景图和检索增强系统将VLM描述锚定到持久的3D实体,并带有丰富的语义描述。然而,VLM描述存在噪声且视角不一致,现有系统将其视为神谕,没有机制检测不可靠的存储描述。我们引入了多视角VLM记忆的目标级语义不确定性:一种衡量目标中心跨视角语义描述分散度并识别语义未解决目标的分数。然后,我们将不确定性分数集成到一个高级空间语义记忆系统中,称为UQ-DAAAM。UQ-DAAAM利用该分数,在固定查询预算下通过选择高质量视图并将多视角描述融合为单一目标描述,主动优化不确定目标。我们还推导了概率保证,表明更高质量的候选视图(根据我们的方法选择)更有可能降低不确定性。实验表明,不确定性量化可以使具身4D记忆系统更可靠、更有效。特别是在OC-NaVQA基准上,UQ-DAAAM相比基线实现了显著更大的不确定性降低和更好的时空问答性能。

英文摘要

Long-horizon robot operation requires spatio-temporal memory to record the environment state and recall it for downstream reasoning. Scene graphs and retrieval-augmented systems ground VLM descriptions to persistent 3D entities with rich semantic descriptions. However, VLM captions are noisy and viewpoint-inconsistent, and existing systems treat them as an oracle with no mechanism to detect unreliable stored descriptions. We introduce object-level semantic uncertainty for multi-view VLM memory: a score that measures object-centric cross-view semantic scatter of captions and identifies semantically unresolved objects. Then, we include our uncertainty scores in an advanced spatial-semantic memory system, that we dub UQ-DAAAM. UQ-DAAAM uses this score to actively refine uncertain objects under a fixed query budget by selecting high-quality views and fusing the resulting multi-view captions into a single object description. We also derive probabilistic guarantees showing that higher-quality candidate views (as selected by our approach) are more likely to reduce uncertainty. Our experiments show that uncertainty quantification can make embodied 4D memory systems more reliable and more effective. In particular, on the OC-NaVQA benchmark, UQ-DAAAM achieves substantially larger uncertainty reduction and better spatio-temporal question answering performance than baselines.

2606.08275 2026-06-09 cs.LG cs.AI 新提交

Causal Agent Replay: Counterfactual Attribution for LLM-Agent Failures

因果智能体回放:LLM智能体故障的反事实归因

Jaineet Shah

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出Causal Agent Replay (CAR)方法,通过结构因果模型和干预操作,对LLM智能体失败步骤进行反事实归因,解决现有方法无法定位决策步骤的问题。

详情
Comments
Open-source: https://github.com/jaineet17/causal-agent-replay
AI中文摘要

当LLM智能体失败时——例如发放了不应发放的退款、调用了错误的工具、泄露了数据——现有工具只能回答发生了什么(可观测性)或是否通过(评估),但无法回答哪个步骤导致了失败。直观的启发式方法是错误的:执行有害动作的步骤通常不是决定该动作的步骤,而LLM判断的归因是相关性的且不可靠(在Who&When基准上,最先进的步骤级准确率约为14%)。我们提出Causal Agent Replay (CAR),通过干预来回答这个问题:它将智能体运行建模为结构因果模型,对某个步骤应用do操作,并在相同随机策略下重新执行轨迹,测量结果分布的变化。我们定义了智能体步骤上的干预代数、一个单步对比估计器(其承诺点规则解决了特定于随机向前运行的混杂因素),以及一个预算有界的蒙特卡洛Shapley估计器(用于在交互步骤间分配信用)。每个效应都附有置信区间。我们在具有植入真实标签的合成结构因果模型上进行验证:对比估计器恢复了关键步骤,Shapley恢复了两步交互(0.44, 0.45, ~0;效率总和0.909对比解析值0.91)。CAR是开源的,可在托管或免费的本地模型上运行。

英文摘要

When an LLM agent fails -- issues a refund it should not have, calls the wrong tool, leaks data -- existing tooling answers what happened (observability) or whether it passed (evaluation), but not which step caused the failure. The obvious heuristics are wrong: the step that executes the harmful action is usually not the step that decided on it, and LLM-judge attribution is correlational and unreliable (state-of-the-art step-level accuracy on the Who&When benchmark is about 14%). We present Causal Agent Replay (CAR), which answers the question by intervention: it models an agent run as a structural causal model, applies a do-operation to a step, and re-executes the trajectory forward under the same stochastic policy, measuring the shift in the outcome distribution. We define an intervention algebra over agent steps, a single-step contrastive estimator whose point-of-commitment rule resolves a confound specific to stochastic run-forward, and a budget-bounded Monte-Carlo Shapley estimator that splits credit across interacting steps. Every effect is reported with confidence intervals. We validate against synthetic structural causal models with planted ground truth: the contrastive estimator recovers the pivotal step, and Shapley recovers a two-step interaction (0.44, 0.45, ~0; efficiency sum 0.909 versus the analytic 0.91). CAR is open source and runs on hosted or free local models.

2606.08272 2026-06-09 cs.CL cs.AI 新提交

AgriGov: A Structured Multilingual Dataset Curation for Indian Government Schemes for Farmers

AgriGov:面向印度政府农民计划的结构化多语言数据集整理

Mohsina Bilal, Gopakumar G

发表机构 * National Institute of Technology Calicut(国立卡利卡特理工学院)

AI总结 提出AgriGov三语数据集,通过自动抓取、翻译流水线和人工后编辑构建约8000句对齐的农业政策领域平行语料,支持机器翻译、问答等应用。

详情
Comments
15 pages, 4 figures, Submitted to: Sadhana, Elsevier
AI中文摘要

AgriGov是一个精心整理的三语(英语-印地语-马拉地语)数据集,旨在解决农业政策和农民福利计划领域缺乏领域基础的多语言资源的问题。最初,我们使用自动抓取技术从可信门户收集并结构化50个政府计划的数据,将其组织到预定义的语义字段(如标题、资格、申请流程、文件、排除项)。翻译通过结合Google Translate API、MarianMT和人工后编辑的流水线进行,生成了一个包含约2100个源片段的领域特定印地语-马拉地语数据集。为了增强覆盖范围,我们用Samanantar语料库中的句子扩充了该数据集,产生了约8000个句子对齐的印地语-马拉地语平行对。该数据集现在为微调该领域的机器翻译模型提供了强大的资源。AgriGov专为领域自适应机器翻译、问答、信息检索和摘要系统等应用而设计。其主要贡献是一个模式驱动、人工校正的多语言对齐流水线,确保领域保真度、提供来源并支持可重复实验,从而为面向农民的工具实现检索增强应用。

英文摘要

AgriGov is a curated, trilingual (English-Hindi-Marathi) dataset designed to address the scarcity of domain-grounded multilingual resources for agricultural policies and farmer welfare schemes. Initially, we collected and structured data from 50 government schemes sourced from trusted portals using automated scraping techniques, organizing it into predefined semantic fields (e.g., title, eligibility, application process, documents, exclusions). Translations were performed using a pipeline combining Google Translate API, MarianMT, and human post-editing, resulting in a domain-specific Hindi-Marathi dataset comprising approximately 2100 source segments. To enhance coverage, we augmented this dataset with sentences from the Samanantar corpus, leading to approximately 8,000 sentence-aligned Hindi-Marathi parallel pairs. The dataset now offers robust resources for fine-tuning machine translation models in this domain. AgriGov is designed for applications in domain-adaptive machine translation, question answering, information retrieval, and summarization systems. Its key contribution is a schema-driven, human-corrected multilingual alignment pipeline that ensures domain fidelity, provides provenance, and supports reproducible experiments, enabling retrieval-augmented applications for farmer-facing tools.