arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.11654 2026-06-12 cs.IR cs.CL cs.HC cs.SI 新提交

The Long Tail, Not the Front Page: Cold-Start Prediction of Crowd Highlight Salience

长尾而非首页：众包高亮显著性的冷启动预测

Kazuki Nakayashiki, Keisuke Watanabe

发表机构 * Glasp Inc.（Glasp公司）

AI总结本文研究在无读者标记时，如何从文本预测文档的众包高亮显著性，提出基于句子嵌入和位置/上下文特征的对数排序模型，在平均精度上比位置基线提升0.044，并证明该优势源于真实读者标记的学习。

详情

Comments: 10 pages, 3 figures, 4 tables

AI中文摘要

社交高亮工具最有用的信号——一群读者标记的段落——仅存在于人们已经阅读过的文档中。能否在标记积累之前，从文本预测文档的聚合众包显著性？先前关于此数据的研究发现，零样本语言模型恢复高亮位置的效果不如简单的基线（位置），因此我们询问，在高亮语料上训练的模型能否击败该基线。使用预注册的模型阶梯和按文档的聚类自助法，我们发现一个微小但稳健的优势：基于句子嵌入和位置/上下文特征的对数排序器比位置基线平均精度高出+0.044（95%置信区间[+0.029, +0.058]；在97%的重采样中超过预注册的边界delta=0.03，且在流水线重复运行中稳定）。两种无监督抽取式基线（质心、LexRank风格中心性）均输给位置基线，而训练模型比它们高出+0.108，因此该优势并非由通用无监督代理恢复——它反映了从真实读者标记中学习。在产品术语中，precision@3从0.25上升到0.39（相对提升55%），模型在69%的文档上击败位置基线。消融实验将优势归因于原始嵌入（+0.014）和训练增强（+0.010），每个都有正的置信区间。该优势并非时间泛化失败，我们也没有发现内容漂移或近似重复泄露可以解释它的证据。标准化回归显示，优势主要由文档流行度（流行度越低，优势越大）和标签可靠性决定。它仅在流行度最高的内容上几乎消失；在那里，是位置基线变强，而非模型变弱。由于我们的评估条件设定在最终积累了读者的文档上，这些结果是回顾性的冷启动模拟。

英文摘要

A social highlighter's most useful signal -- which passages a crowd of readers marks -- exists only for documents people have already read. Can the aggregate crowd salience of a document be predicted from its text before its marks accumulate? Prior work on this data found that zero-shot language models recover highlight locations worse than a trivial lead (position) baseline, so we ask whether a model trained on the highlight corpus can beat that baseline. Using a pre-registered ladder of models and a by-document cluster bootstrap, we find a small but robust edge: a logistic ranker over sentence embeddings and positional/contextual features beats the lead baseline by +0.044 average precision (95% CI [+0.029, +0.058]; clears a pre-registered margin delta=0.03 in 97% of resamples, and stable across pipeline re-runs). Two unsupervised extractive baselines (centroid, LexRank-style centrality) lose to lead, and the trained model beats them by +0.108, so the edge is not recovered by generic unsupervised proxies -- it reflects learning from real reader marks. In product terms, precision@3 rises from 0.25 to 0.39 (+55% relative) and the model beats lead on 69% of documents. An ablation attributes the edge to the raw embedding (+0.014) and training augmentation (+0.010), each with a positive CI. The edge is not a temporal-generalization failure, and we find no evidence that content drift or near-duplicate leakage explains it. A standardized regression shows the advantage is governed mainly by document popularity (lower popularity, larger edge) and by label reliability. It nearly vanishes only on the most popular content; there it is the lead baseline that strengthens, not the model that weakens. Because our evaluation conditions on documents that eventually accumulated readers, these results are a retrospective cold-start simulation.

URL PDF HTML ☆

赞 0 踩 0

2606.11255 2026-06-12 cs.LG 新提交

Bernstein-Schur Kernels: Random Features by Sketched Modulation and Radial Randomization

Bernstein-Schur核：通过草图调制和径向随机化的随机特征

Taha Bouhsine

发表机构 * Azetta AI

AI总结提出一种随机特征构造方法，用于Bernstein-Schur核类，通过草图化有限调制和随机化完全单调径向因子，实现无偏估计和算子范数界，应用于yat核族。

详情

AI中文摘要

Bernstein-Schur核是有限特征核（具有显式有限维特征映射的核）与完全单调平移不变核的乘积：非平稳核介于平移不变和点积模板之间，随机特征通常利用后者，因此一般Bochner采样或多项式草图都不能直接应用于完整核。我们为整个类给出一种随机特征构造，它随机化两个因子：草图化有限调制并随机化完全单调径向因子，对后者的单变量Bernstein-Widder尺度进行采样，然后应用高斯随机傅里叶特征（其频率仍是d维的）。特征维度为Dm，由草图大小m和径向抽取次数D设定，与精确调制特征的O(d^2)大小无关。保持调制精确是可分析极限（m→∞）：在那里我们证明无偏性、推荐平坦估计量的精确方差、期望矩阵-Bernstein算子范数界（具有匹配的高概率尾部），该界由核和调制Gram矩阵的最大特征值以及固有维度控制，而非粗糙的N max_{ij}逐元素路径，以及确定性相对谱核岭稳定性结果。通过条件化于草图，双随机化估计量继承了相同的固有维度算子范数保证，加上一个可调加性草图项，该草图项由m独立于D调节。激励实例是有偏yat核k_{yat,b}(w,x)=(w^⊤x+b)^2/(‖w-x‖^2+ε)，b≥0，其族通过b的有限差分包含逆多二次核；对于它，径向混合是IMQ谱采样器，每个尺度一个频率在固定径向特征预算下是方差最优的。

英文摘要

Bernstein--Schur kernels are products of a finite-feature kernel and a completely monotone shift-invariant kernel: nonstationary kernels falling between the shift-invariant and dot-product templates random features exploit, so neither Bochner sampling nor polynomial sketching applies to the full kernel directly. We give one random-feature construction for the whole class that randomizes both factors: it sketches the finite modulation and samples the radial factor's one-dimensional Bernstein--Widder scale before applying Gaussian random Fourier features, giving feature dimension $Dm$, free of the $O(d^2)$ size of the exact modulation feature. With the modulation kept exact (the $m\to\infty$ limit), we prove unbiasedness, an exact variance, and a matrix-Bernstein operator-norm bound controlled by the top kernel and modulation eigenvalues and an intrinsic dimension rather than the crude $N\max_{ij}$ route. Whitening this argument at the ridge makes the effective dimension $d_{\mathrm{eff}}(\lambda)$ the \emph{exact} intrinsic dimension of the matrix variance, so $O((1+\|P\|_{\mathrm{op}}/\lambda)\log(d_{\mathrm{eff}}/\delta))$ radial draws preserve the kernel-ridge solution; tilting the draw by a closed-form whitened leverage improves this to the effective-dimension count $O((1+d_{\mathrm{eff}})\log(d_{\mathrm{eff}}/\delta))$. Conditioning on the sketch carries every guarantee to the deployed doubly-randomized estimator up to one additive sketch term, and all hold for the whole class with the modulation Gram in place of the polynomial one. The flagship instance is the biased $yat$-kernel $k_{yat,b}(w,x)=(w^\top x+b)^2/(\|w-x\|^2+\varepsilon)$, whose family span contains the inverse-multiquadric kernel by finite differences in $b$.

URL PDF HTML ☆

赞 0 踩 0

2606.11238 2026-06-12 q-fin.GN cs.AI 新提交

Artificial Intelligence in Ship Finance: Applications, Opportunities, and a Case Study in AI-Augmented Loan Origination

人工智能在船舶金融中的应用：机遇与AI增强贷款发起的案例研究

Lasse Dierich, Orestis Schinas

发表机构 * ShipFinance.ai ； HHX.blue GmbH ； Technical University of Munich（慕尼黑技术大学）； University of the Aegean（爱琴海大学）

AI总结本文探讨AI在船舶金融中的应用，提出基于大语言模型的模块化架构，用于文档理解、信息提取和工作流自动化，以支持贷款申请流程。

详情

Comments: 9 pages, 1 figure

AI中文摘要

船舶金融是资产担保贷款中数据密集且文档繁重的领域，需要整合来自异构且高度非结构化来源的财务、技术、合同和监管信息。日益严格的环境法规和ESG报告要求进一步增加了承销和贷款发起流程的复杂性。人工智能（AI）的最新进展，特别是大语言模型（LLMs），为处理和分析此类信息创造了新的机遇。本文回顾了AI在船舶金融中的潜在应用，特别关注基于LLM的系统用于文档理解、信息提取和工作流自动化。我们提出了this http URL，一个模块化代理架构，用于支持船舶金融中的贷款申请工作流。所提出的系统结合了基于LLM的提取模块、财务分析组件、外部海事数据服务以及带有聊天机器人界面的受控文档生成模块，以支持标准化融资申请的准备工作。本文讨论了在生产中使用此类模型的关键挑战。我们认为，AI辅助系统可以支持海事金融专业人士管理日益复杂的信息和报告要求。

英文摘要

Ship finance is a data-intensive and document-heavy segment of asset-based lending, requiring the integration of financial, technical, contractual, and regulatory information from heterogeneous and largely unstructured sources. Increasing environmental regulation and ESG reporting requirements are adding further complexity to underwriting and loan-origination processes. Recent advances in artificial intelligence (AI), particularly large language models (LLMs), create new opportunities for processing and analysing such information. This paper reviews potential applications of AI in ship finance, with a particular focus on LLM-based systems for document comprehension, information extraction, and workflow automation. We present this http URL, a modular agentic architecture to support loan application workflows in ship finance. The proposed system combines an LLM-based extraction module, financial analysis components, external maritime data services, and a controlled document-generation module with a chatbot interface to support the preparation of standardized financing applications. The paper discusses the key challenges for using such models in production. We argue that AI-assisted systems can support maritime finance professionals in managing increasingly complex information and reporting requirements.

URL PDF HTML ☆

赞 0 踩 0

2606.10231 2026-06-12 eess.AS cs.SD 新提交

LLM can Read Spectrogram: Encoder-free Speech-Language Modeling

LLM 能读频谱图：无编码器的语音语言建模

Ruchao Fan, Yiming Wang, Yuxuan Hu, Bo Ren, Yufei Xia, Xiaofei Wang, Yao Qian, Shujie Liu, Jinyu Li

发表机构 * arXiv.org

AI总结提出 Mel-LLM，一种无需专用语音编码器、直接将梅尔频谱图补丁通过线性投影输入 LLM 的架构，在 ASR 和 TTS 任务上验证了其可行性，ASR 性能与有编码器方案相当，TTS 初步可行。

详情

AI中文摘要

最近的语音感知大语言模型（Speech-LLMs）依赖预训练的语音编码器将音频转换为 LLM 可消费的语义丰富表示。相反，在这项工作中，我们探索：LLM 能否直接学习读取梅尔频谱图，而无需专用的语音编码器？我们提出 Mel-LLM，一种无编码器的 Speech-LLM，它将经过轻量预处理的梅尔频谱图补丁通过线性投影直接输入 LLM，使 LLM 仅通过自身参数学习语音-文本对齐。我们在自动语音识别（ASR）和文本到语音（TTS）任务上进行了大量实验。对于 ASR，我们在 OpenASR 排行榜公开集和生产级扩展实验上评估，表明无编码器方案在性能上具有竞争力，与有编码器初始化的对应方案相比仅有有限退化。我们发现，当数据有限时，从多模态检查点（Phi-4-MM）初始化对于保持性能至关重要。我们还进行了消融研究，揭示了哪些 LLM 层与语音编码相关性较低。对于 TTS，我们展示了使用下一个令牌 VAE 方法的初步结果。虽然 TTS 性能尚未达到最优，但这些结果确立了用于自回归语音-文本建模的完全统一无编码器架构的可行性。

英文摘要

Recent speech-aware large language models (Speech-LLMs) rely on a pre-trained speech encoder to convert audio into semantic-rich representations consumable by LLM. In this work, instead, we explore: can an LLM learn to read Mel spectrogram directly without a dedicated speech encoder? We propose Mel-LLM, an encoder-free Speech-LLM that feeds lightly pre-processed Mel spectrogram patches directly into the LLM through a linear projection, allowing the LLM to learn speech-text alignment purely through its own parameters. We conduct extensive experiments on both automatic speech recognition (ASR) and text-to-speech (TTS) tasks. For ASR, we evaluate on the OpenASR leaderboard public sets and production-level scaling experiments, demonstrating that the encoder-free solution achieves competitive performance with only limited degradation compared to encoder-initialized counterparts. We find that when data is limited, initialization from a multimodal checkpoint (Phi-4-MM) is crucial for maintaining performance. We also present ablation studies revealing which LLM layers are less relevant to speech encoding. For TTS, we show preliminary results with a next-token VAE approach. While TTS performance is not yet optimal, these results establish the feasibility of a fully unified encoder-free architecture for autoregressive speech-text modeling.

URL PDF HTML ☆

赞 0 踩 0

2606.11190 2026-06-12 cs.LG 新提交

When to Align, When to Predict: A Phase Diagram for Multimodal Learning

何时对齐，何时预测：多模态学习的相图

Ilay Kamai, Hugues Van Assel, Aviv Regev, Hagai B. Perets, Randall Balestriero

发表机构 * Technion（以色列理工学院）； Genentech（基因泰克公司）； Brown University（布朗大学）； Meta AI, FAIR

AI总结提出统一线性框架，通过信噪比模型揭示跨模态对齐与预测的互补失效模式，构建四区域相图指导多模态学习目标选择，并在非线性实验中验证。

详情

AI中文摘要

跨模态对齐（CA）和跨模态预测（CP）是多模态表示学习的主要范式，但目前缺乏对每种方法何时成功、何时失败以及跨模态训练何时有帮助的系统性理解——这一空白使得从业者，特别是在生物医学或天体物理学等科学领域，面对异构仪器以及多个层次的组织和测量时，无法诊断为什么标准方法不如最佳单模态。我们开发了一个统一的线性框架来解决这两个问题。在具有结构化跨模态干扰相关性的尖峰信号加噪声模型下，我们推导出两个目标的分离比，揭示了互补的失效模式：对齐使每个模态白化，当干扰在视图间强相关时失败；预测通过单侧白化编码任何可跨模态预测的内容，恢复由源模态质量决定。由此产生的相图将多模态问题划分为四个区域：两者、仅CA、仅CP和两者都不。我们提出了一种数据驱动的方法，使用少量标记子样本将真实世界数据集定位在该图中，在任何跨模态训练之前确定首选目标和预测方向。在合成数据、立体视觉基准、图像-文本对和真实天体物理数据上的实验验证了非线性情况下的预测，包括跨模态训练有害的“两者都不”区域。我们的框架使从业者能够诊断其多模态问题，并在投入训练之前选择正确的目标。重现结果的代码可在此https URL获取。

英文摘要

Cross-modal alignment (CA) and cross-modal prediction (CP) are the dominant paradigms for multimodal representation learning, yet there is no systematic understanding of when each succeeds, when each fails, and when cross-modal training helps at all -- a gap that leaves practitioners, especially in scientific domains like biomedicine or astrophysics, with heterogeneous instruments and multiple levels of organization and measurement, unable to diagnose why standard methods underperform the best single modality. We develop a unified linear framework that addresses both questions. Under a spiked signal-plus-noise model with structured cross-modal nuisance correlation, we derive separation ratios for both objectives that expose complementary failure modes: alignment whitens each modality and fails when nuisance is strongly correlated across views; prediction encodes whatever is cross-predictable through a one-sided whitening, with recovery governed by source-modality quality. The resulting phase diagram partitions multimodal problems into four regimes: Both, CA only, CP only, and Neither. We present a data-driven procedure to locate real-world datasets in this diagram using a small labeled subsample, identifying the preferred objective and prediction direction before any cross-modal training. Experiments on synthetic data, stereo-vision benchmarks, image-caption pairs, and real astrophysical data validate the predictions in the nonlinear regime, including the Neither regime where cross-modal training is actively harmful. Our framework lets practitioners diagnose their multimodal problem and choose the right objective before committing to training. Code to reproduce the results is available at this https URL.

URL PDF HTML ☆

赞 1 踩 0

2606.10716 2026-06-12 cs.CL cs.AI 新提交

Attention Expansion: Enhancing Keyphrase Extraction from Long Documents with Attention-Augmented Contextualized Embeddings

注意力扩展：利用注意力增强的上下文嵌入提升长文档关键短语提取

Roberto Martínez-Cruz, Alvaro J. López-López, José Portela

发表机构 * Institute for Research in Technology, ICAI School of Engineering, Comillas Pontifical University（技术研究所，ICAI工程学院，科米利亚斯宗座大学）； DD-AIM, Senior Machine Learning Researcher（DD-AIM，高级机器学习研究员）

AI总结提出注意力扩展机制，通过预训练词嵌入增强PLM的上下文表示，在不增加计算成本的情况下扩展有效上下文范围，显著提升长文档关键短语提取性能。

详情

AI中文摘要

预训练语言模型（PLM）在关键短语提取（KPE）中取得了强劲性能，主要得益于其生成丰富上下文表示的能力。然而，长文档KPE仍然具有挑战性，因为显著的关键短语证据可能分散在遥远的文档部分，而这些部分无法在大多数PLM有限的上下文窗口内被联合捕获。尽管长上下文大语言模型（LLM）可以处理更广泛的文本上下文，但其计算成本限制了它们在高效和高通量KPE中的实用性。为了克服这一限制，我们提出了一种注意力扩展机制，该机制利用预训练词嵌入，用周围超出上下文的块中的信息来增强PLM的令牌表示。所提出的机制扩展了基于PLM的KPE模型的有效上下文范围，而无需全文档注意力或昂贵的基于LLM的推理。我们在五个PLM骨干网络上评估了我们的方法，包括通用、科学、任务特定和长上下文编码器，使用了两种训练机制和来自科学和新闻领域的五个基准语料库。实验结果表明，注意力扩展在所有评估设置中一致地提升了KPE性能，超越了最先进的模型，并在F1分数上取得了显著改进。这些改进扩展到领域特定、任务专门化和原生长上下文模型，表明所提出的机制提供了互补信息，而不仅仅是补偿有限的输入长度。这些结果确立了注意力扩展作为长文档KPE的一种高效且有效的策略。

英文摘要

Pre-trained language models (PLMs) have achieved strong performance in keyphrase extraction (KPE), largely due to their ability to generate rich contextualized representations. However, long-document KPE remains challenging because salient keyphrase evidence may be scattered across distant document sections that cannot be jointly captured within the limited context window of most PLMs. Although long-context large language models (LLMs) can process broader textual contexts, their computational cost limits their practicality for efficient and high-throughput KPE. To overcome this limitation, we propose an attention expansion mechanism that augments PLM token representations with information from surrounding out-of-context chunks using pre-trained word embeddings. The proposed mechanism expands the effective contextual scope of PLM-based KPE models without requiring full-document attention or expensive LLM-based inference. We evaluate our approach across five PLM backbones, including general-purpose, scientific, task-specific, and long-context encoders, using two training regimes and five benchmark corpora from scientific and news domains. Experimental results demonstrate that attention expansion consistently enhances KPE performance across all evaluation settings, outperforming state-of-the-art models and yielding notable improvements in F1 score. The improvements extend to domain-specific, task-specialized, and native long-context models, showing that the proposed mechanism provides complementary information rather than merely compensating for limited input length. These results establish attention expansion as an efficient and effective strategy for long-document KPE.

URL PDF HTML ☆

赞 0 踩 0

2606.10683 2026-06-12 cs.RO cs.AI cs.CV 新提交

UniDexTok: A Unified Dexterous Hand Tokenizer from Real Data

UniDexTok：基于真实数据的统一灵巧手分词器

Dong Fang, Youjun Wu, Yuanxin Zhong, Rui Zhang, Yunlong Wang, Xiaosong Jia, Yu-Gang Jiang

发表机构 * Fudan University（复旦大学）； Hefei University of Technology（合肥工业大学）； Rimbot ； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出统一灵巧手模型(UDHM)将人手和机器人手状态映射到共享22自由度语义接口，并基于此开发UniDexTok，一种免重定向的状态分词器，学习基于真实关节状态的离散token，实现异构灵巧手的统一表示，误差降低98%以上。

详情

AI中文摘要

灵巧手对于精细操作至关重要，但其硬件设计在不同实施例之间存在显著差异。运动学、关节定义和自由度方面的差异使得定义共享状态表示变得困难，与平行夹爪相比更是如此。因此，灵巧手数据仍然碎片化，难以用于联合训练。在这项工作中，我们提出了统一灵巧手模型（UDHM），它将人手和机器人手状态映射到一个共享的22自由度语义接口。基于UDHM，我们引入了UniDexTok，一种免重定向的状态分词器，它从标准化的真实关节状态中学习基于实施例的离散token。UniDexTok为异构灵巧手提供了统一表示，无需依赖重定向或仿真数据。与最近的基线UniHM相比，UniDexTok将MPJAE从15.63度降低到0.16度，MPJPE从18.51毫米降低到0.18毫米，误差分别减少了98.98%和99.03%。这些结果将重建精度从厘米级提升到亚毫米级。实验进一步表明，来自其他实施例的数据提高了目标实施例的重建精度，证明了跨实施例分词的优势。当引入新的灵巧手时，UniDexTok还表现出强大的零样本和少样本重建能力。

英文摘要

Dexterous hands are essential for fine-grained manipulation, but their hardware designs vary substantially across embodiments. Differences in kinematics, joint definitions, and degrees of freedom make it difficult to define a shared state representation compared with parallel grippers. As a result, dexterous-hand data remains fragmented and difficult to use for joint training. In this work, we propose the Unified Dexterous Hand Model (UDHM), which maps human and robot hand states into a shared 22-DoF semantic interface. Based on UDHM, we introduce UniDexTok, a retargeting-free state tokenizer that learns embodiment-conditioned discrete tokens from standardized real joint states. UniDexTok provides a unified representation for heterogeneous dexterous hands without relying on retargeting or simulation data. Compared with the recent baseline UniHM, UniDexTok reduces MPJAE from 15.63 degrees to 0.16 degrees and MPJPE from 18.51 mm to 0.18 mm, corresponding to error reductions of 98.98% and 99.03%, respectively. These results improve reconstruction from centimeter-scale to sub-millimeter accuracy. Experiments further show that data from other embodiments improves target-embodiment reconstruction accuracy, demonstrating the benefit of cross-embodiment tokenization. UniDexTok also shows strong zero-shot and few-shot reconstruction ability when new dexterous hands are introduced.

URL PDF HTML ☆

赞 0 踩 0

2606.10678 2026-06-12 cs.LG 新提交

One Step Closer to Ground Truth: A Multi-Scale Residual-Aware Representation Learning Pipeline for Predicting Time Series Data

更接近真实：一种多尺度残差感知表示学习管道用于时间序列预测

Amrijit Biswas, Mustafa Kamal, Robin Krambroeckers, M. M. Lutfe Elahi, Sifat Momen, Nabeel Mohammed, Shafin Rahman

发表机构 * RobotBulls Labs（RobotBulls实验室）； North South University（南北大学）

AI总结提出两阶段模型无关框架，通过显式解耦预测与残差学习，使用元校正器动态建模结构误差模式，提升Transformer预测精度。

详情

Comments: Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26)

AI中文摘要

近年来，基于Transformer的模型已成为时间序列预测的主要范式，利用自注意力机制捕获长程依赖关系。尽管取得了成功，但这些单阶段预测架构由于结构差异、未建模的随机成分或多尺度时间表示不足，表现出持续的系统性残差偏差。当残差被视为不可约噪声时，这一局限性依然存在，阻碍了对结构化误差模式的自适应校正。为解决这一问题，我们引入了一个两阶段、模型无关的框架，将预测和残差学习显式解耦为不同的表示学习阶段。基础Transformer首先生成初始预测。随后，专用的元校正器动态建模跨多元通道的结构化误差模式，保留跨变量依赖关系，并迭代修正基础Transformer的残差偏差。通过将该管道形式化为假设空间扩展，我们的框架解决了单阶段架构固有的近似局限性，消除了对限制性假设的依赖，并实现了复杂误差动态的端到端学习。在八个流行的基准数据集上使用既定协议进行评估，我们的方法达到了最先进的性能，在标准指标（MSE、MAE）上有显著改进。结果表明，该框架能够减轻系统性偏差，增强对复杂时间动态的鲁棒性，推进了基于Transformer的预测模型的实际应用。

英文摘要

Transformer-based models have emerged as leading paradigms in time-series forecasting in recent years, employing self-attention mechanisms to capture long-range dependencies. Despite their success, these single-stage forecasting architectures exhibit persistent systematic residual biases arising from structural discrepancies, unmodeled stochastic components, or inadequate multi-scale temporal representations. This limitation persists when residuals are treated as irreducible noise, precluding adaptive correction of structured error patterns. To address this limitation, we introduce a two-stage, model-agnostic framework that explicitly decouples forecasting and residual learning into distinct stages of representation learning. A base transformer first generates the initial predictions. Subsequently, a dedicated meta-corrector dynamically models structured error patterns across multivariate channels, preserves cross-variable dependencies, and iteratively refines the residual bias of the base transformer. By formalizing this pipeline as a hypothesis space expansion, our framework addresses approximation limitations inherent in single-stage architectures, removes reliance on restrictive assumptions, and enables end-to-end learning of complex error dynamics. Evaluated on eight popular benchmark datasets using established protocols, our approach achieves state-of-the-art performance, with significant improvements in standard metrics (MSE, MAE). The results demonstrate the framework's ability to mitigate systematic biases and enhance robustness to complex temporal dynamics, advancing the practical applicability of transformer-based forecasting models.

URL PDF HTML ☆

赞 0 踩 0

2606.10616 2026-06-12 cs.AI 新提交

Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

学习记住什么：通过约束优化实现长时域语言代理的观测安全记忆保留

Qingcan Kang, Liu Mingyang, Shixiong Kai, Kaichao Liang, Tao Zhong, Mingxuan Yuan

发表机构 * Huawei Noah's Ark Lab（华为诺亚方舟实验室）； Department of Computer Science, City University of Hong Kong（香港城市大学计算机科学系）

AI总结针对长时域语言代理的有限上下文窗口，提出OSL-MR框架，将记忆保留建模为约束随机优化问题，通过在线可观测特征与离线监督的严格分离学习查询条件化的证据价值，实验表明在严格预算下优于现有方法。

详情

AI中文摘要

长时域语言代理积累的观测、推理轨迹和检索事实会超出其有限的上下文窗口，使得记忆保留成为一个基本的资源分配问题。现有记忆系统通过启发式评分、检索优化或学习压缩来改进管理，但大多将保留视为局部决策问题，并未在现实观测约束下显式建模其长期后果。为填补这一空白，我们将记忆保留建模为一个约束随机优化问题，具有明确的预算可行性、证据效用以及延迟成本（包括遗漏惩罚、重新获取延迟和过时信息风险）。随后，我们提出OSL-MR（观测安全记忆保留学习），这是一个新颖的框架，强制执行在线可观测特征与离线可用监督（OAS）之间的严格分离。OSL-MR结合了一个从实现的证据监督中训练的证据学习器和一个混合评分启发式，该启发式既作为可部署的在线安全基线，又作为结构化的归纳先验用于学习。由此产生的策略直接从交互数据中学习查询条件化的证据价值，同时在同一观测约束下保持可部署性。在LOCOMO和LongMemEval上的实验表明，OSL-MR在严格记忆预算下持续优于基于最近性的方法、生成式代理风格评分和其他启发式基线。混合评分先验在保持召回率的同时进一步提高了精确度，敏感性分析表明其在广泛的成本配置下具有鲁棒性。

英文摘要

Long-horizon language agents accumulate observations, reasoning traces, and retrieved facts that exceed their finite context windows, making memory retention a fundamental resource-allocation problem. Existing memory systems improve management through heuristic scoring, retrieval optimization, or learned compression, but largely treat retention as a local decision problem and do not explicitly model its long-term consequences under realistic observability constraints. To fill this gap, we formulate memory retention as a constrained stochastic optimization problem with explicit budget feasibility, evidence utility, and delayed costs including miss penalties, reacquisition delays, and stale-information risk. We then propose OSL-MR (Observability-Safe Learning for Memory Retention), a novel framework that enforces a strict separation between online-observable features and offline-available supervision (OAS). OSL-MR combines an evidence learner trained from realized evidence supervision with a Mixed-Score heuristic that serves both as a deployable online-safe baseline and as a structured inductive prior for learning. The resulting policy learns query-conditioned evidence value directly from interaction data while remaining deployable under the same observability constraints. Experiments on LOCOMO and LongMemEval show that OSL-MR consistently outperforms recency-based methods, Generative Agents-style scoring, and other heuristic baselines, particularly under tight memory budgets. The Mixed-Score prior further improves precision while preserving recall, and sensitivity analysis demonstrates robustness across a wide range of cost configurations.

URL PDF HTML ☆

赞 0 踩 0

2606.10403 2026-06-12 cs.CL 新提交

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

KCSAT-ML: 用全国队列人类难度探测推理模型

Sanghee Park, Geewook Kim, Kee-Eung Kim

发表机构 * NAVER Cloud AI（NAVER云AI）； KAIST AI（韩国科学技术院人工智能系）

AI总结提出KCSAT-ML基准（含664道韩国高考数学题及339道带官方错误率的核心题）和难度对齐推理增益（DRG）指标，揭示视觉语言模型在人类高错误率题目上准确率崩溃、测试时缩放非单调以及同一模型族内反缩放与过度思考并存的现象。

详情

Comments: 18 pages, 14 figures, 8 tables

AI中文摘要

数学推理基准已大量涌现，但大多数缺乏基于实际人类表现的每道题难度信号。我们引入KCSAT-ML，包含十年（2014-2025）韩国大学修学能力考试（KCSAT；修能）数学：664道题，其中339道核心题带有来自数十万考生全国队列的官方每道题错误率。我们将该基准与难度对齐推理增益（DRG）配对：一种分数正交的度量，询问模型的错误是集中在人类认为难的题目上，还是人类认为容易的题目上。两者共同揭示，在广泛的视觉语言模型（以及通过OCR的LLM）中，存在三种模式：（i）低预算准确率在人类高错误率尾部崩溃，无论模型大小；（ii）测试时缩放（TTS）使token使用量大致随队列错误率线性增加，而准确率增益遵循非单调曲线；（iii）在同一模型族内，TTS在最难题目上从反缩放翻转到较容易题目上的过度思考——这是同一对齐失败的两个方面。在DRG上，准确率几乎相同的模型可以处于几乎相反的值：一个模型做错了人类也觉得难的题目，而另一个模型解决了最难的题目却在人类认为容易的题目上失败——这是聚合准确率所隐藏的对比。我们的代码和数据集构建器将在https://this URL开源。

英文摘要

Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung) mathematics: 664 problems with a 339-item core set carrying official per-item error rates from nationwide cohorts of hundreds of thousands of examinees. We pair the benchmark with Difficulty-aligned Reasoning Gain (DRG): a score-orthogonal metric that asks whether a model's mistakes concentrate on the items humans found hard, or on items humans found easy. Together they expose, across a wide range of VLMs (and LLMs via OCR), three patterns: (i) low-budget accuracy collapses on the high-human-error tail at every model size; (ii) test-time scaling (TTS) raises token use roughly linearly with cohort error rate, while accuracy gains follow a non-monotonic curve; (iii) within a single family, TTS flips between anti-scaling on the hardest items and overthinking on easier ones -- two faces of the same alignment failure. On DRG, models with near-identical accuracy can sit at near-opposite values: one model gets wrong what humans also find hard, while another solves the hardest items yet fails on items humans find easy -- a contrast that aggregate accuracy hides. Our code and dataset builder will be open-sourced at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.09855 2026-06-12 cs.MM cs.CV cs.LG 新提交

MinhwaNet: Faithful but Insufficient Object Grounding in Korean Folk Painting

MinhwaNet: 韩国民俗画中忠实但不足的对象定位

Joonhyung Bae

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)（韩国科学技术院）

AI总结提出MinhwaNet，通过部分级检测器生成对象证据图，发现韩国民俗画中符号列表不足以预测画作类型，而符号布局更重要，揭示了忠实但不足的解离现象。

详情

AI中文摘要

韩国民俗画（minhwa）由少量吉祥符号构成——老虎代表保护、一对鸟代表婚姻和谐、牡丹代表财富——这些符号在其许多绘画类型中反复出现。这暗示了一种直观的计算方法：识别画作中出现的符号，并从符号清单中读取画作类型。我们使用一个公开语料库，包含整幅画作、八字段双语策展说明以及一组独立的专家对象裁剪图，发现这种方法并不奏效。仅给定画作包含的符号列表的模型，其预测画作类型的效果远不如将图像与策展文本融合的模型，而强制类型表示基于对象定位反而会损害准确性。然而，类型预测所依赖的视觉证据仍然是局部化的且可检查的。从部分级检测器投影出的无泄漏对象证据图，在空间上忠实于策展人隔离符号对象的位置以及基于补丁的替代模型的梯度显著性。我们将这种配置称为忠实但不足的解离。部分级解释诚实地反映了部分级模型所见，但类型目标取决于符号的排列方式而非出现的符号。相同的视角区分了内容标签（在转移到保留的源机构时仍然有效，即类型）和风格标签（无效，即时代），我们通过语料库中的另外两个标签验证了这一预测。我们发布了多模态系统、一幅画作的证据图与其目录的工作示例解读，以及在长尾遗产收藏中反复出现的一系列评估注意事项。

英文摘要

Korean folk painting (minhwa) is built from a small vocabulary of auspicious symbols, a tiger for protection, a pair of birds for marital harmony, a peony for wealth, that recur across many of its painted genres. This suggests an obvious computational approach, identify which symbols appear in a painting and read the genre from the inventory. Working with a public corpus that pairs whole paintings, eight-field bilingual curatorial captions, and a separate set of expert object crops, we find that this approach does not work. A model given only a list of which symbols a painting contains predicts the genre far worse than a model that fuses the image with the curatorial text, and forcing the genre representation to be object-grounded actively hurts accuracy. The visual evidence on which the genre prediction rests is nonetheless localized and inspectable. A leakage-safe object evidence map projected from a part-level detector is spatially faithful to where curators isolated symbolic objects and to a patch-based surrogate's own gradient saliency. We name this configuration a faithful-but-insufficient dissociation. The part-level explanation is honest about what the part-level model sees, yet the genre target turns on how symbols are arranged rather than on which ones appear. The same lens separates a content label that survives transfer to held-out source institutions, genre, from a style label that does not, era, a prediction we confirm on two further labels in the corpus. We release the multimodal system, a worked-example reading of one painting's evidence map against its catalogue, and a set of evaluation cautions that recur in long-tailed heritage collections.

URL PDF HTML ☆

赞 0 踩 0

2606.11000 2026-06-12 quant-ph cs.LG cs.NE 新提交

Analog Quantum Asynchronous Event-Based Graph Neural Network

模拟量子异步事件驱动图神经网络

Kristian Sotirov, Shaheen Acheche, Antonio A. Gentile, Osvaldo Simeone

发表机构 * King’s Communications, Learning and Information Processing (KCLIP) lab（国王通讯、学习与信息处理（KCLIP）实验室）； Centre for Intelligent Information Processing Systems (CIIPS)（智能信息处理系统中心）； Department of Engineering（工程系）； Pasqal SAS（Pasqal SAS公司）； Institute for Intelligent Networked Systems (INSI)（智能网络化系统研究所）； Northeastern University London（伦敦东北大学）

AI总结提出模拟量子异步事件驱动图神经网络（QA-AEGNN），利用中性原子量子处理器映射事件数据为原子阵列，通过Rydberg哈密顿量模拟消息传递，实现高效事件图计算。

详情

Comments: 31 pages, 8 figures, initial version

AI中文摘要

异步、事件驱动的图神经网络（AEGNN）最近成为一种处理事件相机稀疏高时间分辨率数据的有效范式。本文提出量子模拟AEGNN（QA-AEGNN），一种在中性原子量子计算机上实现AEGNN的新框架。中性原子量子处理器基于可控的Rydberg原子相互作用，提供可编程的模拟量子计算平台。为此，我们将流式事件数据映射到被困中性原子阵列，每个原子代表一个图节点（事件），其位置使得几何邻近性反映事件的时空邻域。量子处理器的原生Rydberg哈密顿量被编程以镜像AEGNN的消息传递计算，原子量子比特状态作为节点特征嵌入，原子间相互作用实现图边。此外，我们提出一种混合量子-经典训练方案，其中模拟哈密顿量参数（如激光脉冲幅度和失谐）通过经典反馈优化，以从数据中学习量子AEGNN模型。我们的方法利用中性原子量子系统的连续哈密顿量动力学和大规模并行性，以潜在精度改进原生执行事件图计算。

英文摘要

Asynchronous, event-based graph neural networks (AEGNNs) have recently emerged as an efficient paradigm for processing the sparse and high-temporal-resolution data from event cameras. In this paper, we propose quantum analog AEGNNs (QA-AEGNNs), a novel framework to implement an AEGNN on a neutral-atom quantum computer. Neutral-atom quantum processors offer a programmable analog quantum computing platform based on controllable Rydberg-atom interactions. To this end, we map the streaming event data to an array of trapped neutral atoms, where each atom represents a graph node (event) and is positioned such that geometric proximity reflects the spatio-temporal neighborhood of events. The native Rydberg Hamiltonian of the quantum processor is programmed to mirror the message-passing computations of the AEGNN, with atomic qubit states serving as node feature embeddings and inter-atom interactions realizing graph edges. Furthermore, we propose a hybrid quantum-classical training scheme in which the analog Hamiltonian parameters (e.g., laser pulse amplitudes and detunings) are optimized using classical feedback to learn the quantum AEGNN model from data. Our approach leverages the continuous Hamiltonian dynamics and massive parallelism of neutral-atom quantum systems to natively execute event-based graph computations with potential accuracy improvements

URL PDF HTML ☆

赞 0 踩 0

2606.09639 2026-06-12 cs.CV 新提交

CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation

CineDance: 迈向下一代多镜头长片电影级音视频生成

Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He, Zhucun Xue, Qianyu Zhou, Jason Li, Lizhuang Ma, Jiangning Zhang, Dacheng Tao

发表机构 * Shanghai Jiao Tong University（上海交通大学）； University of Electronic Science and Technology of China（电子科技大学）； Zhejiang University（浙江大学）； The University of Tokyo（东京大学）； Nanyang Technological University（南洋理工大学）

AI总结提出CineDance-1M大规模多镜头长片音视频数据集，通过三阶段筛选流程和CineBench评估体系，实现高质量联合生成。

详情

AI中文摘要

训练数据集的保真度和结构多样性从根本上决定了视频生成模型的能力。尽管商业系统在生成电影叙事方面表现出色，但开源模型的进展仍受限于高质量训练数据的稀缺性。为弥合这一差距，我们引入了CineDance-1M，一个大规模、开放研究文本到音视频（T2AV）数据集，专门用于多镜头、长片联合音视频生成。每个视频平均时长92.8秒，包含24.2个连续镜头，并提供音频和视频模态的可配置、结构化标注。这一卓越质量通过严格的三个阶段筛选流程实现：i) 多样化来源和全面清洗，ii) 基于电影理论的叙事解析，以及iii) 层次化双模态字幕生成。为进行全面评估，我们提出了CineBench，包含多样化的提示套件和六维、与人类对齐的度量系统，专为复杂叙事音视频评估而设计。此外，我们将LTX-2.3适配为CineDance，展示了卓越的单模态质量以及精确的音视频对齐和稳健的主体与环境一致性，有效验证了我们的筛选策略和CineDance-1M的高质量。我们预期这项工作将为加速未来多镜头、长片联合音视频生成研究奠定坚实基础。我们的项目页面可在https://aliothchen.github.io/projects/CineDance/获取。

英文摘要

The fidelity and structural diversity of training datasets fundamentally determine the capabilities of video generation models. While commercial systems showremarkableabilitytogeneratecinematicnarratives, the progress of open-source models remains limited by the scarcity of high-quality training data. To bridge this gap, we introduce CineDance-1M, a large-scale, open research Text-to-Audio-Video (T2AV) dataset designed specifically for multi-shot, long-form joint audio-video generation. Averaging 92.8 seconds and 24.2 continuous shots per video, it provides configurable, structured annotations for both audio and video modalities. This exceptional quality is achieved through a rigorous three-stage curation pipeline: i) diverse sourcing and comprehensive cleansing, ii) film-theory-inspired narrative parsing, and iii) hierarchical dual-modal captioning. For a comprehensive assessment, we propose CineBench, featuring a diverse prompt suite and a six-dimensional, human-aligned metric system tailored for complex narrative audio-video evaluation. Furthermore, we adapt LTX-2.3 into CineDance, which demonstrates exceptional single-modality quality alongside precise audio-video alignment and robust subject and environment consistency, effectively validating our curation strategy and the high quality of CineDance-1M. We anticipate that this work will serve as a solid foundation for accelerating future research in multi-shot, long-form joint audio-video generation. Our project page is available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.08765 2026-06-12 cs.RO cs.CV 新提交

RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation

RGB-S: 用于鲁棒灵巧操作的图像对齐触觉显著性

Shengcheng Luo, Kefei Wu, Xiaoying Zhou, Wanlin Li, Ziyuan Jiao, Chenxi Xiao

发表机构 * ShanghaiTech University（上海科技大学）； Beijing Institute for General Artificial Intelligence（北京通用人工智能研究院）

AI总结提出RGB-S框架，通过正向运动学和相机标定将触觉传感器位置投影到RGB图像平面，生成力调制高斯显著性图，显式对齐触觉与视觉，在严重遮挡下灵巧操作成功率提升26.7个百分点。

详情

Comments: 20 pages, 7 figures

AI中文摘要

有效的视觉-触觉整合对于机器人灵巧操作至关重要，尤其是在视觉观测不可靠或被遮挡时。然而，将稀疏、异构的触觉测量与密集的视觉表示鲁棒地对齐仍然是一个基本挑战。大多数现有方法需要策略从有限的演示中隐式学习跨模态对应关系，而不利用几何先验。因此，它们在视觉观测退化时往往数据效率低且泛化能力差。为解决这一限制，我们提出一个框架，显式地将物理接触锚定在图像域中。利用机器人正向运动学和相机标定，我们将触觉传感器位置直接投影到RGB图像平面上。然后，我们渲染力调制的高斯显著性图，以模拟由运动学和标定误差引起的空间不确定性。通过零初始化的条件架构整合这些2D空间锚点，我们的方法将物理接触先验注入标准视觉骨干网络，同时保留预训练的视觉表示。我们在模拟和现实世界的六项灵巧操作任务中评估了我们的方法，在严重视觉遮挡下。现实世界实验表明，在图像域中显式的RGB-S锚定将现实世界遮挡操作成功率比最强的隐式视觉-触觉基线提高了26.7个百分点，表明其空间推理能力和对遮挡的鲁棒性得到了改善。项目页面：touch-as-saliency.github.io

英文摘要

Effective visuo-tactile integration is critical for robotic dexterous manipulation, especially when visual observations are unreliable or occluded. However, robustly aligning sparse, heterogeneous tactile measurements with dense visual representations remains a fundamental challenge. Most existing approaches require policies to learn cross-modal correspondences implicitly from limited demonstrations, without leveraging geometric priors. As a result, they are often data-inefficient and generalize poorly when visual observations are degraded. To address this limitation, we propose a framework that explicitly grounds physical contacts in the image domain. Using robot forward kinematics and camera calibration, we project tactile sensor locations directly onto the RGB image plane. We then render force-modulated Gaussian saliency maps to model spatial uncertainty arising from kinematic and calibration errors. By integrating these 2D spatial anchors through a zero-initialized conditioning architecture, our method injects physical contact priors into standard visual backbones while preserving pre-trained visual representations. We evaluate our method on six dexterous manipulation tasks in both simulation and the real world under severe visual occlusions. Real-world experiments show that explicit RGB-S grounding in the image domain improves real-world occluded manipulation success rates by $26.7$ percentage points over the strongest implicit visuo-tactile baseline, suggesting its improved spatial reasoning and robustness to occlusion. Project page: this http URL

URL PDF HTML ☆

赞 0 踩 0

2606.08098 2026-06-12 cs.AI cs.LG 新提交

When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

何时委托优于多数？一种基于委托的多样本LLM推理聚合器

Yasushi Sakai, Allen Song, Kent Larson

发表机构 * MIT Media Lab（麻省理工学院媒体实验室）

AI总结提出基于委托的聚合器PPV，利用样本的字母熵和推理几何信号，在MMLU-Pro上比多数投票高1.5个百分点，无需标签或训练。

详情

Comments: Preprint. 16 pages, 5 figures, 4 tables

AI中文摘要

多数投票是对多样本LLM推理进行无监督聚合的主流方法。我们证明，将每个样本携带的信号输入基于委托的聚合器（传播代理投票，PPV）可产生一种无监督共识规则，在MMLU-Pro上整体比多数投票高1.5个百分点，在非平凡子集上高2.24个百分点（配对McNemar p ~ 1.0e-14，n = 8,099）。多数投票丢弃了每个样本携带的两个自由信号：组内字母熵和组间推理几何。PPV暴露了两个每个投票者使用的杠杆，它们恰好消耗这些信号：WHEN（投票者保留自己选择的权重）和WHOM（如何将剩余权重分配给同行）。我们使用字母熵驱动WHEN，使用以问题为中心的嵌入余弦驱动WHOM。该方法不需要真实标签和辅助训练：对于每个问题，我们将128个采样生成划分为16组，计算每组的字母级语义熵和推理嵌入质心，并将两者输入随机委托矩阵，其平稳分布选择共识答案。我们通过一个例子说明PPV如何推翻一个明显的10-6多数（错误答案）：10票的多数簇几何上不连贯（平均簇内余弦-0.02），而6票的少数簇紧凑（+0.26），因此传播的委托质量集中在少数派的答案上，尽管仅凭熵会使多数保持领先。我们还报告了具有负面结果的委托策略，这些策略限制了无监督LLM聚合的设计空间：没有问题内的置信度模式集成能够缩小与oracle的差距。

英文摘要

Majority voting over sampled answers is the dominant unsupervised aggregator for multi-sample LLM inference. In this paper, we show a delegation-based aggregator (Propagational Proxy Voting, PPV; Sakai et al., 2025) yields an unsupervised consensus rule that beats majority on MMLU-Pro by +1.5 pp overall and +2.24 pp on the non-trivial subset (paired McNemar p ~ 1.0e-14, n = 8,099). Majority discards two signals that every sample carries: within-group letter entropy and between-group reasoning geometry. PPV exposes per-voter levers that consume exactly these two signals: When (how much weight a voter keeps on its own pick) and Whom (how it splits the remainder across peers). We drive When with letter entropy and Whom with per-question-centered embedding cosine. Our method needs no gold labels and no auxiliary training: per-question, we partition 128 sampled generations into 16 groups, compute each group's letter-level semantic entropy and reasoning embedding centroid, and feed both into a stochastic delegation matrix whose stationary distribution selects the consensus answer. We walk through an example in which PPV overturns a clear 10-6 majority for the wrong letter: the 10-voter majority cluster is geometrically incoherent (mean within-cluster cosine -0.02) while the 6-voter minority is tight (+0.26), so propagated delegation mass concentrates on the minority's answer even though entropy alone would keep the majority ahead. We further report delegation strategies with negative results that constrain the design space for unsupervised LLM aggregation. No within-question ensemble of confidence modes closes the oracle gap.

URL PDF HTML ☆

赞 0 踩 0

2606.07515 2026-06-12 cs.CL cs.AI cs.HC math.PR 新提交

How reliable are LLMs when it comes to playing dice?

LLM 在掷骰子时有多可靠？

Luca Avena, Gianmarco Bet, Bernardo Busoni

发表机构 * Università degli Studi di Firenze（佛罗伦萨大学）

AI总结通过离散概率问题基准测试，发现 LLM 在标准问题上准确率 0.96，但在反直觉问题上仅 0.59，且存在 token 偏差和误导提示的脆弱性。

2606.07489 2026-06-12 cs.AI econ.GN 新提交

How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope

AI代理如何重塑知识工作：自主性、效率与范围

Jeremy Yang, Kate Zyskowski, Noah Yonack, Jerry Ma

发表机构 * Harvard Business School（哈佛商学院）； Perplexity AI

AI总结基于Perplexity产品数据，研究发现AI代理通过端到端任务执行，将自主工作时间从33秒提升至26分钟，完成时间缩短87%，成本降低94%，并扩展了工作范围与认知层次。

详情

AI中文摘要

前沿AI系统正从对话式助手转向端到端执行任务的自主代理，弥合智能与实用性之间的差距。利用Perplexity的Search和Computer产品的生产数据，我们通过研究AI代理如何加速和重塑知识工作来考察这一转变。三个关键实证发现出现。首先，使用具有几乎相同初始查询对的会话作为同一底层任务的自然实验，Computer每个用户会话执行26分钟的自主工作，而Search为33秒。Computer自动化了Search用户可能手动编排和实现的任务分解与执行。因此，Computer将后续查询分布转向更高层次的工作，如验证和扩展。自主性也提高了执行质量，Computer上每次查询的不满意率比Search低55%。其次，由于其自主性优势，Computer在匹配任务上将完成时间从269分钟减少到36分钟，与仅配备Search的人类相比，估计时间和成本分别降低87%和94%。第三，Computer改变了用户尝试的工作范围：Computer查询更常跨越职业边界，需要更高层次的认知，利用更广泛的专业知识，采取将相互依赖的子任务捆绑到单个查询中的复合任务形式，并解锁了同一用户在Search使用中基本不存在的工作活动。综合来看，证据表明AI代理加速工作流程、提高输出质量、降低成本，并扩展自动化工作的广度和深度。

英文摘要

Frontier AI systems are bridging the gap between intelligence and utility by shifting from conversational assistants to autonomous agents that execute tasks end to end. Using production data from Perplexity's Search and Computer products, we study this transition by examining how AI agents accelerate and reshape knowledge work. Three key empirical findings emerge. First, using sessions with near-identical initial query pairs as natural experiments for the same underlying task attempted with both products, Computer performs 26 minutes of autonomous work per user session, versus 33 seconds for Search. Computer automates task decomposition and execution that Search users might otherwise manually orchestrate and implement. As a result, Computer shifts follow-up query distribution toward higher-order work such as verification and extension. Autonomy also increases execution quality, with per-query dissatisfaction rates 55% lower on Computer than on Search. Second, due to its autonomy advantage, Computer reduces completion time from 269 to 36 minutes on matched tasks, lowering estimated time and cost by 87% and 94%, respectively, compared to humans equipped with Search alone. Third, Computer changes the scope of work that users attempt: Computer queries more often cross occupational boundaries, require higher-order cognition, draw on broader expertise, take the form of composite tasks that bundle interdependent subtasks into a single query, and unlock work activities that are essentially absent from Search usage among the same users. Together, the evidence indicates that AI agents accelerate workflows, enhance output quality, reduce costs, and expand the breadth and depth of automated work.

URL PDF HTML ☆

赞 0 踩 0

2606.07436 2026-06-12 cs.CV 新提交

Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning

Skill-3D：面向智能体3D空间推理的场景感知技能进化

Haoyuan Li, Zhengdong Hu, Jun Wang, Hehe Fan, Yi Yang

发表机构 * Zhejiang University（浙江大学）； University of Technology Sydney（技术悉尼大学）； OPPO Research Institute（OPPO研究院）

AI总结提出Skill-3D框架，通过场景记忆和技能库的协同进化，使智能体根据场景自适应选择工具，显著提升3D空间推理中工具使用的正确性和充分性。

详情

AI中文摘要

本文探索智能体3D空间理解，即MLLM智能体通过工具使用进行3D推理。现有方法在3D场景下常误用工具并表现出有偏的工具偏好，使得智能体范式相比非智能体策略仅有边际提升。我们揭示3D空间推理任务在不同场景下具有异质性，而这些智能体对所有场景采用统一的工具使用策略，而非根据具体场景和任务选择工具。为解决此问题，我们提出Skill-3D，一种学习自进化场景感知技能的框架。具体而言，Skill-3D识别任务场景并将智能体的工具使用轨迹记录到场景记忆中，其中来自相似场景的成功轨迹被聚合和蒸馏成可复用的场景感知技能，失败的轨迹作为教训附加到该技能上。在训练过程中，一旦相似场景再次出现，注入相应技能以引导智能体，产生新轨迹，其成功和失败进一步优化技能，形成记忆和技能库共同进化的循环。实验表明，Skill-3D显著提升了3D空间推理中的工具利用率（在VSI-Bench上从39%提升至78%），推动智能体正确且充分地使用工具。例如，在MMSI-Bench上，它将Gemini-3-Flash提升了67%。此外，我们在技能引导的轨迹上进行智能体后训练，使Qwen3-VL-8B在VSI-Bench上提升了43%。

英文摘要

This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with only marginal gains over non-agentic strategies. We reveal that 3D spatial reasoning tasks are heterogeneous across scenes, while these agents apply a uniform tool-use strategy to all scenes rather than selecting tools according to the specific scene and task. To address this, we propose Skill-3D, a framework that learns self-evolving scene-aware skills. Specifically, Skill-3D identifies the task scene and records the agent's tool-use trajectory into a Scene Memory, where successful trajectories from similar scenes are aggregated and distilled into a reusable scene-aware skill, with failed ones attached to the skill as lessons. During training, once a similar scene recurs, the corresponding skill is injected to guide the agent, producing new trajectories whose successes and failures further refine the skill, forming a loop in which the memory and the skill library co-evolve. Experiments show that Skill-3D substantially improves tool utilization in 3D spatial reasoning (from 39% to 78% on VSI-Bench), driving the agent toward correct and sufficient tool use. For instance, it improves Gemini-3-Flash by 67% on MMSI-Bench. Furthermore, we conduct agentic post-training over skill-guided trajectories, which boosts Qwen3-VL-8B by 60% on VSI-Bench.

URL PDF HTML ☆

赞 0 踩 0

2606.07334 2026-06-12 cs.SD cs.LG 新提交

How Far Can Chord-Symbol Time-Series Adaptation Carry Genre Identity? Capabilities and Boundaries in Multi-Genre Chord-Symbol Modeling

和弦符号时间序列适应能承载多远流派身份？多流派和弦符号建模的能力与边界

Jinju Lee

发表机构 * PearlLeeStudio

AI总结本研究评估了五种轻量级适应方法（LoRA、IA3、BitFit、前缀微调和全微调）将预训练流行爵士和弦模型扩展到11个目标流派的效果，发现所有方法均能提升和弦预测性能，但和弦符号本身不足以完整传递流派身份。

详情

Comments: v2: corrected frozen-base checkpoint description after weight-level verification (released F1 coincides with the pop-only Phase-0 baseline; selection artifact); added released-adapter rank-selection disclosure; all reported numbers unchanged

AI中文摘要

和声是一个紧凑的符号层，其中数学音高关系、声学协和与音乐惯例交汇。本报告将和弦符号序列视为音乐的不完全表示，而是作为可解释、可控的时间序列用于流派局部和声建模。从一个冻结的流行爵士音乐变换器检查点开始，我评估了小型适应接口能将模型扩展到11个目标流派的程度：布鲁斯、波萨诺瓦、巴赫众赞歌、乡村、电子、民谣、放克、福音、嘻哈、R&B/灵魂乐和摇滚。主要比较了LoRA、IA3、BitFit、前缀微调和全微调在11个流派和3个种子上的表现，构成完整的165个单元格网格。所有五种方法在保留和弦预测上都优于冻结基线，宏观增益从+2.89到+3.61分；LoRA和IA3得分最高，但经Holm和Benjamini-Hochberg校正的Wilcoxon检验不支持决定性优胜者。一个匹配数据量的对照实验进一步明确了这一点：当流派被子采样到共同语料库大小时，IA3保持领先，但LoRA的全数据优势消失并跌至最后，表明小差距部分由数据驱动。一个控制标记基线也很强，错误流派适配器通常优于冻结基线，表明大部分效果来自对可重用和声基底的轻量级条件化，而非特定适配器家族。额外的诊断（秩扫描、错误流派轮换、基础检查点消融、仅和弦流派分类、生成输出统计、真实歌曲评估和重复分析）支持一个有限的结论：和弦符号适应可靠地改进了流派局部和声预测，但仅靠和弦符号不能承载完整的流派身份。因此，本报告避免关于感知流派真实性或完整音乐质量的声明，这需要受控的听众或音乐家评估。

英文摘要

This report treats chord-symbol sequences as an interpretable, controllable time series for genre-local harmonic modeling. The frozen Music Transformer base - released as a pop-jazz fine-tune endpoint but verified in this revision weight-identical to the pop-only Phase-0 baseline, so all gains are measured over a pure-pop prior (see Changes in v2) - is extended to eleven target genres: blues, bossa nova, Bach chorales, country, electronic, folk, funk, gospel, hip-hop, R&B/soul, and rock. The main evaluation compares LoRA, IA3, BitFit, prefix tuning, and full fine-tuning over 11 genres and 3 seeds, a complete 165-cell grid. All five methods improve over the frozen base on held-out chord prediction (macro gains +2.89 to +3.61 percentage points); LoRA and IA3 score highest, but pairwise Wilcoxon tests with Holm and Benjamini-Hochberg correction do not support a decisive winner. A matched-data-size control sharpens this: at a common corpus size IA3 stays on top while LoRA drops to last, so the small method gaps are partly data-driven rather than representational. A control-token baseline is also strong, and wrong-genre adapters often beat the frozen base, suggesting the adaptation effect is largely lightweight conditioning over a reusable harmonic base rather than genre-specific adapter memory. Further diagnostics (rank sweeps, wrong-genre rotation, a base-checkpoint ablation that v2 reinterprets as a same-weights control, chord-only genre classification, output-distribution statistics, real-song evaluation, duplicate analysis) support a bounded conclusion: chord-symbol adaptation reliably improves genre-local harmonic prediction, but chord symbols alone do not carry complete genre identity. Perceived genre authenticity and musical quality are left to controlled listener evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.07218 2026-06-12 cs.IR cs.CL 新提交

HKVM-RAG: Key-Value-Separated Hypergraph Evidence Organization for Multi-Hop RAG

HKVM-RAG：用于多跳RAG的键值分离超图证据组织

Mingyu Zhang, Ying Ma

发表机构 * Faculty of Computing, Harbin Institute of Technology（哈尔滨工业大学计算机学院）； School of Computer and Information Engineering, Henan University（河南大学计算机与信息工程学院）

AI总结提出HKVM-RAG，一种键值分离的证据组织层，通过超图键值检索改进多跳RAG的证据链暴露，在三个基准上提升F1分数。

详情

Comments: Submitted to ICDE 2027. 13 pages, 3 figures

AI中文摘要

多跳RAG提出了一个超越段落匹配的数据工程问题：在固定检索预算下，系统必须将检索到的文本组织成能够暴露答案链的证据单元。密集检索器独立评分段落，而基于图的记忆使关联显式化，但通常依赖于成对或实体中心的键，这些键会碎片化多跳证据。我们提出HKVM-RAG，一个键值分离的证据组织层。它从缓存的段落级LLM证据元组中组装答案路径超边，并将其用作检索键，同时保留段落文本作为答案值。为了隔离键空间设计，我们的固定基底协议在成对图和超图变体中保持元组缓存、候选段落、阅读器和评估预算不变。加权超图键值检索在2WikiMultiHopQA上比KG-PPR提高+3.426 F1，在MuSiQue上提高+3.592 F1；HotpotQA显示更高的结构化支持覆盖率不一定带来独立的答案F1增益。因此，我们将WHG-KV视为一种证据控制信号，而非密集检索的替代。Oracle和训练到开发分析表明支持选择是可修复的，一个密集感知控制器使用冻结的ColBERTv2和HKVM排名/分数特征，结合折外HKVM预测。它在三个基准上分别达到88.846、65.073和85.810 F1，比ColBERTv2提高+11.084、+6.763和+5.966 F1。源级消融实验表明，匹配的非WHG结构化信号无法达到WHG-KV的增益。这些结果提供了有界证据，表明键值分离的超图组织可以作为多跳RAG的可重用证据控制机制。

英文摘要

Multi-hop RAG poses a data-engineering problem beyond passage matching: under fixed retrieval budgets, a system must organize retrieved text into evidence units that expose answer chains. Dense retrievers score passages independently, while graph-based memories make associations explicit but often rely on pairwise or entity-centered keys that fragment multi-hop evidence. We present HKVM-RAG, a key-value-separated evidence-organization layer. It assembles answer-path hyperedges from cached passage-level LLM evidence tuples and uses them as retrieval keys, while retaining passage text as answer values. To isolate key-space design, our fixed-substrate protocol holds the tuple cache, candidate passages, reader, and evaluation budget constant across pairwise graph and hypergraph variants. Weighted hypergraph key-value retrieval improves over KG-PPR by +3.426 F1 on 2WikiMultiHopQA and +3.592 F1 on MuSiQue; HotpotQA shows that higher structured support coverage need not yield standalone answer-F1 gains. We therefore study WHG-KV as an evidence-control signal rather than a dense-retrieval replacement. Oracle and train-to-dev analyses identify support selection as repairable, and a dense-aware controller combines frozen ColBERTv2 and HKVM rank/score features using out-of-fold HKVM predictions. It reaches 88.846, 65.073, and 85.810 F1 on the three benchmarks, improving over ColBERTv2 by +11.084, +6.763, and +5.966 F1. Source-level ablations show that matched non-WHG structured signals do not match the WHG-KV gains. These results provide bounded evidence that key-value-separated hypergraph organization can serve as a reusable evidence-control mechanism for multi-hop RAG.

URL PDF HTML ☆

赞 0 踩 0

2606.06525 2026-06-12 cs.GR cs.AI 新提交

Agentic Large Language Models for Automated Structural Analysis of 3D Frame Systems

用于三维框架系统自动化结构分析的主体化大型语言模型

Ziheng Geng, Ian Franklin, Santiago Martinez, Jiachen Liu, Yunhe Zhao, Minghui Cheng

发表机构 * Department of Civil and Architectural Engineering, University of Miami（迈阿密大学土木与建筑工程系）； School of Architecture, University of Miami（迈阿密大学建筑学院）； HBC Engineering Company（HBC工程公司）； Department of Electrical and Computer Engineering, University of Miami（迈阿密大学电气与计算机工程系）

AI总结提出一种主体化LLM框架，通过投影表示和智能体流水线实现从自然语言输入到3D框架的自动化结构分析，平均准确率达90%。

详情

AI中文摘要

大型语言模型（LLM）已成为跨领域具有强推理能力的强大基础模型。除了反应式文本生成，主体化LLM通过模块化任务分解和协调工具使用实现自主工作流执行。在结构工程中，最近的工作开发了用于平面框架自动化分析的主体化LLM。然而，由于不规则几何表示、拓扑一致性和长程推理的挑战，它们向3D框架的扩展仍未充分探索。本文提出了一种主体化LLM框架，用于从自然语言输入自动化分析3D框架。不规则3D框架通过投影到2D平面表示，其中正交网格线定义空间坐标，楼层数矩阵编码每个网格单元的垂直拉伸。基于此表示，框架建立了一个多智能体流水线：问题分析智能体将输入解析为结构化JSON；楼层分解智能体推导每层的空间布局；3D几何由节点、梁、板和柱智能体组装；支撑和荷载智能体分配边界和荷载条件，代码翻译智能体生成可执行的SAP2000脚本。在十个代表性3D框架上评估，所提框架在重复试验中平均准确率达到90%，表现出一致且可靠的性能。

英文摘要

Large language models (LLMs) have emerged as powerful foundation models with strong reasoning capabilities across domains. Beyond reactive text generation, agentic LLMs enable autonomous workflow execution through modular task decomposition and coordinated tool use. In structural engineering, recent efforts have developed agentic LLMs for automated analysis of plane frames. However, their extension to 3D frames remains underexplored due to challenges in irregular geometric representation, topological consistency, and long-horizon reasoning. This paper proposes an agentic LLM framework for automated structural analysis of 3D frames from natural language inputs. Irregular 3D frames are represented by projection onto a 2D plan, where orthogonal gridlines define spatial coordinates and a matrix of number of stories encodes vertical extrusion of each grid cell. Building on this representation, the framework establishes a multi-agent pipeline: a problem analysis agent parses input into structured JSON; a floor decomposition agent derives the spatial layout of each floor; the 3D geometry is assembled by node, girder, slab, and column agents; support and load agents assign boundary and loading conditions, and code translation agents generate executable SAP2000 script. Evaluated on ten representative 3D frames, the proposed framework achieves an average accuracy of 90% across repeated trials, demonstrating consistent and reliable performance.

URL PDF HTML ☆

赞 0 踩 0

2605.18898 2026-06-12 cs.LG stat.ML 交叉投稿

A Two-Parameter Weibull Framework for Diagnosing Transformer Weight Distributions

一种双参数Weibull框架用于变压器权重分布诊断

Tiexin Ding

发表机构 * Independent Researcher（独立研究者）

AI总结本文提出了一种基于Weibull分布的双参数框架，用于分析Transformer中元素权重幅度分布，通过实验发现不同模块的k值分布特征，并揭示了训练过程中lambda参数的变化规律。

详情

Comments: 27 pages, 14 figures. Companion library npm-weibull-py and benchmark database available at this https URL

AI中文摘要

我们应用Weibull分布——极值理论中的一个双参数家族——作为诊断框架，用于分析Transformer中元素权重幅度分布。在初始化时，i.i.d.高斯权重给出|w| ~ HalfNormal，产生k ~ 1.20通过中间80%概率-图拟合（此工作中的协议）。这个锚点使k成为一种原则性的、架构无关的训练动态测量工具；在每个层的每个检查点独立拟合每个权重矩阵，使能够进行每组件、每层和每步的诊断，这些聚合统计无法解决。将此框架应用于12个模型，涵盖7个架构家族（Pythia, OLMo-1/2, LLaMA-3, Mistral, Qwen2.5/3）揭示了三个发现。首先，FFN模块和注意力输出投影W_o——传输类——落在狭窄的k带中：在12个条目中，中位数终端k在[1.186, 1.204]之间（跨家族CV=0.51%），在SwiGLU/GeLU激活、Pre-LN/QK-Norm放置和70M-14B大小之间共享。其次，注意力输入投影W_q, W_k——选择类——脱离Weibull家族，其严重程度由存储形状决定：分别存储Q/K（OLMo-1, OLMo-2）产生k在[0.76, 0.99]（深层）；GQA模型产生k在[1.10, 1.16]（轻微）；Pythia的合并W_qkv占据过渡区，跟踪训练预算T/tau单调递增。第三，lambda在训练过程中显著增长，并在Pythia家族中与sqrt(eta/lambda_wd)成比例（Pearson r=0.94，三种传输类型），方向上与Fan等人（2025）一致。这两个参数携带独立信息：k标记功能类别，lambda标记训练进度。我们发布了npm-weibull-py v0.4（Python库）和DATABASE_v9_1在https://github.com/tiexinding/NPM-Weibull-public。

英文摘要

We apply the Weibull distribution -- a two-parameter family from extreme-value theory -- as a diagnostic framework for element-wise weight magnitude distributions in transformers. At initialization, i.i.d. Gaussian weights give |w| ~ HalfNormal, yielding k ~ 1.20 via middle-80% probability-plot fit (the protocol used throughout this work). This anchor makes k a principled, architecture-independent measuring stick for training dynamics; fitting each weight matrix independently at every layer at every checkpoint enables per-component, per-layer, and per-step diagnostics that aggregate statistics cannot resolve. Applying this framework to 12 model entries spanning 7 architectural families (Pythia, OLMo-1/2, LLaMA-3, Mistral, Qwen2.5/3) reveals three findings. First, FFN modules and the attention output projection W_o -- the Transmission Class -- fall in a narrow k band: median terminal k in [1.186, 1.204] across 12 entries (cross-family CV = 0.51%), shared across SwiGLU/GeLU activations, Pre-LN/QK-Norm placements, and 70M-14B sizes. Second, the attention input projections W_q, W_k -- the Selection Class -- depart from the Weibull family, with severity shaped by storage: separately-stored Q/K (OLMo-1, OLMo-2) yields k in [0.76, 0.99] (deep); GQA models yield k in [1.10, 1.16] (mild); Pythia's merged W_qkv occupies a transitional zone tracking training budget T/tau monotonically. Third, lambda grows substantially during training and scales with sqrt(eta/lambda_wd) within the Pythia family (Pearson r = 0.94, three Transmission kinds), directionally consistent with Fan et al. (2025). The two parameters carry independent information: k labels the functional class, lambda labels training progress. We release npm-weibull-py v0.4 (Python library) and DATABASE_v9_1 at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.13629 2026-06-12 stat.ME cs.AI cs.LG stat.ML 新提交

Valid Inference with Synthetic Data via Task Exchangeability

通过任务可交换性实现基于合成数据的有效推断

Lezhi Tan, Tijana Zrnic

AI总结提出任务可交换性条件，确保在科学研究中使用合成数据进行统计推断的有效性，并给出在民意调查和AI评估中的应用。

详情

AI中文摘要

越来越多的工作主张在科学研究中使用合成数据。例如，社会科学家主张在试点研究中使用LLM生成的“硅样本”；AI评估越来越依赖“LLM作为裁判”的输出；蛋白质组学研究通过生成合成蛋白质结构的生成模型加速。这些发展引发了一个有趣的可能性：合成数据可以帮助研究人员提出更多问题、进行更多研究并加速发现。但它们也引发了一个根本性的担忧：合成数据可能有偏、有噪声且设定错误。在这项工作中，我们提出了在科学研究中使用合成数据的统计原则，并具有可证明的有效性保证。关键见解是一个我们称为任务可交换性的新技术条件。非正式地说，这是一个要求，即研究人员可以识别出有真实数据可用的历史任务，使得他们当前感兴趣的任务与历史任务在适当的数学意义上可交换。我们开发了在任务可交换性下进行有效推断的方法，以及即使在可交换性之外也能提供保证的扩展。我们通过硅样本的民意调查和自动评分器的AI评估来展示该框架。

英文摘要

There is a proliferation of work arguing for the use of synthetic data in scientific research. For example, social scientists are arguing for the use of LLM-generated "silicon samples" in pilot studies; AI evaluations increasingly rely on "LLM-as-a-judge" outputs; and proteomics research is accelerated by generative models that produce synthetic protein structures. These developments raise an intriguing possibility: synthetic data may help researchers ask more questions, run more studies, and accelerate discovery. But they also raise a fundamental concern: synthetic data can be biased, noisy, and misspecified. In this work, we propose statistical principles for using synthetic data in scientific research with provable validity guarantees. The key insight is a new technical condition that we call task exchangeability. Informally, this is a requirement that the researcher can identify historical tasks, for which real data is available, such that their current task of interest is exchangeable with the historical tasks in an appropriate mathematical sense. We develop methods for valid inference under task exchangeability, together with extensions that provide guarantees even beyond exchangeability. We demonstrate the framework on public opinion surveys with silicon samples and AI evaluation with autoraters.

URL PDF HTML ☆

赞 0 踩 0

2606.13544 2026-06-12 eess.AS cs.AI cs.CL 新提交

Adaptive Turn-Taking for Real-time Multi-Party Voice Agents

自适应轮流发言：面向实时多方语音代理

Soumyajit Mitra, Prabhat Pandey, Abhinav Jain, Shanmukha Sahith, K V Vijay Girish

AI总结提出ModeratorLM，一种基于角色条件的语音大模型，通过分块流式处理和链式推理，在多方对话中实现自适应轮流发言，显著提升轮流精度和召回率。

详情

Comments: Accepted for publication at Interspeech 2026

AI中文摘要

多方口语对话中的轮流发言仍然是语音代理面临的基本挑战，特别是在动态的发言权竞争和用户期望变化的情况下。我们提出ModeratorLM，一种角色扮演语音代理，它在多方环境中根据明确分配的角色来调节轮流发言行为。该系统基于以分块流式方式运行的语音大语言模型。我们进一步引入了一种推理增强变体，该变体结合了对对话上下文和分配角色的链式推理。我们构建了RolePlayConv，一个大规模合成数据集，包含具有多种助手角色的口语多方对话。在真实会议数据和RolePlayConv上的实验表明，与无角色条件的基线相比，轮流发言精度提高了40%以上，召回率提高了70%以上，同时大幅减少了误报中断。

英文摘要

Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice agent that conditions turn-taking behavior on an explicitly assigned role in multi-party settings. The system is built on a speech large language model operating in chunk-wise streaming manner. We further introduce a reasoning-augmented variant that incorporates chain-of-thought reasoning over conversational context and the assigned role. We construct RolePlayConv, a large-scale synthetic dataset of spoken multi-party conversations with diverse assistant roles. Experiments on real-world meeting data and RolePlayConv show improved turn-taking precision by over 40% and recall by more than 70%, while substantially reducing false-positive interruptions compared to non-role-conditioned baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.13450 2026-06-12 eess.AS cs.SD 新提交

Endpoint Anticipation for Low-Latency Spoken Dialogue

低延迟口语对话的端点预测

Sathvik Udupa, Shinji Watanabe, Petr Schwarz, Jan Cernocky

AI总结提出端点预测方法，通过提前预测对话结束信号实现低延迟，在部分上下文中投机执行LLM和TTS流水线，平均延迟降低505毫秒。

详情

Comments: Accepted at Interspeech 2026

AI中文摘要

虽然低延迟交互对于口语对话至关重要，但级联架构通常受限于反应式话轮结束检测。我们提出端点预测，从反应式检测转向主动预测结束信号。我们的基于语音的模型可提前最多2.56秒预测端点，从而能够在部分上下文中投机执行LLM和TTS流水线。我们引入指标来量化实现的延迟降低与计算冗余之间的权衡。在对话和任务导向数据集上的评估表明，我们的模型始终优于基于VAP的竞争基线。与Unmute框架的集成展示了平均延迟降低505毫秒，投机计算增加28.4%，有效掩盖了顺序瓶颈，从而在实时语音到语音交互中实现复杂推理。

英文摘要

While low-latency interaction is critical for spoken dialogue, cascaded architectures are often bottlenecked by reactive turn-completion detection. We propose Endpoint Anticipation, shifting from reactive detection to proactive forecasting of end-of-turn signals. Our speech-based model anticipates endpoints upto 2.56 seconds in advance, enabling speculative execution of LLM and TTS pipelines on partial context. We introduce metrics to quantify the trade-off between realized latency reduction and computational redundancy. Evaluation across conversational and task-oriented datasets shows our model consistently outperforms competitive VAP-based baselines. Integration with the Unmute framework demonstrates a 505 ms average latency reduction with a 28.4% increase in speculative computation, effectively masking sequential bottlenecks to enable complex reasoning in real-time speech-to-speech interaction.

URL PDF HTML ☆

赞 0 踩 0

2606.13295 2026-06-12 stat.ML cs.LG stat.ME 新提交

Simultaneous Latent Budget Trees for Stratified Classification

用于分层分类的同时潜在预算树

Simultaneous Latent Budget Trees for Stratified Classification Cristian Buoncompagni, Stefano Pellegrino, Giulia Vannucci, Roberta Siciliano

AI总结提出同时潜在预算树框架，通过模型驱动的分裂规则处理分层因素，实现可解释分类，并应用于肌萎缩侧索硬化症性别差异分析。

详情

AI中文摘要

在可解释人工智能时代，单棵树因其易于解释而重新受到关注。本文介绍了同时潜在预算树，这是一个概率机器学习框架，用于在存在分层因素（如时间、空间或人口统计变量）作为控制变量或潜在混杂因素时的分类树。标准的树生长过程并非设计用于优化条件分裂规则。提出了一种基于模型的分裂规则，其中子节点被解释为同时混合模型（如同时潜在预算模型及其约束版本）的潜在成分，该模型拟合于父节点。混合参数驱动观测值（不同组别不同）到达子节点，而潜在预算参数更新控制变量每个水平的响应类别轮廓。参数通过最小二乘法估计，考虑模型的神经网络视角。信息丰富的树结构可以通过节点和路径上的解释辅助工具进行交互式可视化，包括视觉剪枝和决策树选择过程。提出了适当的措施来处理不平衡的响应类别分布。所提出的方法应用于调查肌萎缩侧索硬化症疾病进展中的性别相关差异。SLBT库及其各种基于树的算法可在链接的GitHub仓库中获取。

英文摘要

In the era of Explainable Artificial Intelligence, there is a renewed focus on single trees for their ease of interpretation. This paper introduces Simultaneous Latent Budget Trees, a probabilistic machine learning framework for classification trees in the presence of a stratification factor such as a temporal, spatial, or demographic variable, acting as a control variable or potential confounder. Standard tree growth procedures are not designed to optimize a conditional split rule. A model-based split rule is proposed in which child nodes are interpreted as latent components of a simultaneous mixture model, such as the Simultaneous Latent Budget Model and its constrained versions, fitted to the parent node. Mixing parameters drive the observations, differently for each group, to the child nodes whereas latent budgets parameters update the response classes profile of each level of the control variable. Parameters are estimated by least squares considering a neural network perspective of the model. An informative tree structure can be interactively visualized with interpretation aids on the node and the paths, including visual pruning and decision tree selection procedure. Suitable measures are proposed to handle an unbalanced response class distribution. The proposed methodology is applied to investigate gender-related differences in disease progression of Amyotrophic Lateral Sclerosis. The SLBT library with the various tree-based algorithms is available in the linked GitHub repository.

URL PDF HTML ☆

赞 0 踩 0

2606.13277 2026-06-12 stat.ML cs.LG 新提交

ProtoX-AD: Self-Explainable Time Series Anomaly Detection and Characterization

ProtoX-AD：自解释的时间序列异常检测与特征描述

Aitor Sánchez-Ferrera, Elisabeth Wetzer, Kristoffer Wickstrøm, Michael Kampffmeyer, Robert Jenssen

AI总结提出ProtoX-AD框架，通过原型学习实现自监督时间序列异常检测的可解释性，在保持检测性能的同时提供语义一致的异常特征解释。

详情

Comments: 26 pages, 8 figures

AI中文摘要

时间序列异常检测（TSAD）的最新进展突显了自监督分类方法的有效性。这些方法对正常训练样本应用变换，训练分类器识别变换特定模式，从而通过增加分类误差来帮助识别异常。尽管性能强大，但一个重大挑战是缺乏可解释性，因为它们对标记异常的特征提供的洞察有限。为了解决这一局限，我们提出了ProtoX-AD，一种基于原型的自解释框架，用于自监督TSAD。ProtoX-AD学习变换感知的潜在表示以及可解释的原型，从而实现准确的异常检测和通过基于原型的解释识别不同的异常轮廓。此外，它允许系统分析变换设计如何影响检测性能和可解释性。在合成和真实世界数据集上的实验结果表明，ProtoX-AD实现了与其黑盒对应物相当的检测性能，同时比现有的可解释基线提供更一致和语义上有意义的解释。我们的代码在此 https URL 公开。

英文摘要

Recent advances in time series anomaly detection (TSAD) have highlighted the effectiveness of self-supervised classification-based approaches. These methods apply transformations to normal training samples, training a classifier to recognize transformation-specific patterns that help identify anomalies through increased classification errors. Despite their strong performance, a significant challenge is their lack of explainability, as they provide limited insight into the characteristics of flagged anomalies. To address this limitation, we propose ProtoX-AD, a prototype-based self-explainable framework for self-supervised TSAD. ProtoX-AD learns transformation-aware latent representations alongside interpretable prototypes, enabling both accurate anomaly detection and the identification of distinct anomalous profiles through prototype-based explanations. Additionally, it allows for systematic analysis of how transformation design impacts detection performance and explainability. Experimental results on synthetic and real-world datasets demonstrate that ProtoX-AD achieves detection performance comparable to its black-box counterparts while offering more consistent and semantically meaningful explanations than existing explainable baselines. Our code is publicly available at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.13193 2026-06-12 eess.AS cs.PL cs.SD 新提交

A Dual-Mode Faust-to-CLAP Compilation System

双模式 Faust 到 CLAP 编译系统

Facundo Franchino (1), Stéphane Letz (2), Jatin Chowdhury (3) ((1) University of York, (2) GRAME-CNCM, (3) Massachusetts Institute of Technology)

AI总结提出 faust2clap 框架，支持静态编译和动态解释两种模式，通过地址身份匹配算法和稳定槽位分配方案解决 DSP 参数身份保持问题，实现高效编译与热更新。

详情

Comments: 4 pages, 4 figures, 1 algorithm. Presented at the International Faust Conference (IFC-26), Lyon, France, June 2026

AI中文摘要

我们描述了 faust2clap，一个建立从 Faust DSP 规范到 CLAP 格式的首个官方维护编译路径的框架。该系统以两种不同模式运行。静态模式采用提前编译以生成最优效率的原生二进制文件，而动态模式使用运行时解释以允许在不中断宿主应用程序的情况下修改 DSP 代码。后一种能力解决了音频软件开发中一个长期存在的摩擦，即编辑、编译和重载循环的累积开销。我们详细阐述了两种模式背后的算法机制，特别关注参数身份问题。为了在结构 DSP 突变中保留参数值及其与宿主自动化的绑定，我们引入了一种基于地址的身份匹配算法和一种稳定的槽位分配方案。该实现包含约 2400 行 C++ 架构和 Python 工具代码，并已集成到 Faust 主发行版中。

英文摘要

We describe faust2clap, a framework establishing the first officially maintained compilation pathway from Faust DSP specifications to the CLAP format. The system operates in two different modes. A static mode employs ahead-of-time compilation to yield native binaries of optimal efficiency, while a dynamic mode uses runtime interpretation to permit DSP code modification without interrupting the host application. This latter capability addresses a persistent friction in audio software development, namely the cumulative overhead of the edit, compile, and reload cycle. We detail the algorithmic machinery underlying both modes, focusing specifically on the problem of parameter identity. To preserve both parameter values and their bindings to host automation across structural DSP mutations, we introduce an address-based identity matching algorithm and a stable slot allocation scheme. The implementation, comprising approximately 2,400 lines of C++ architecture and Python tooling code, has been integrated into the main Faust distribution.

URL PDF HTML ☆

赞 0 踩 0

2606.13146 2026-06-12 stat.ML cs.LG stat.ME 新提交

Robust State-Conditional Feature-Weighted Jump Models for Temporal Clustering

鲁棒的状态条件特征加权跳跃模型用于时间聚类

Federico P. Cortese, Alessio Farcomeni

AI总结提出一种鲁棒的特征加权跳跃模型，通过Tukey双权损失函数实现鲁棒性，并引入状态特定特征权重，在模拟和实证中优于竞争方法。

2606.13109 2026-06-12 eess.AS cs.SD 新提交

Generating Training Targets for Real-World Speech Enhancement via Close-to-Distant Microphone Projection

为真实场景语音增强生成训练目标：通过近远麦克风投影

Tomohiro Nakatani, Rintaro Ikeshita, Naoyuki Kamo, Marc Delcroix, Shoko Araki

AI总结提出近远麦克风投影（C2D投影）方法，利用真实录音生成配对数据，通过参数化多通道维纳滤波器实现投影，训练神经网络在远场语音增强中优于现有GSS方法。

详情

AI中文摘要

在远距离语音捕获场景中训练语音增强（SE）神经网络需要配对的失真和干净参考语音信号。虽然此类数据通常通过模拟生成，但模拟与真实录音之间的不匹配显著限制了SE的准确性。为解决此问题，我们提出近远麦克风投影（C2D投影），一种从近距离和远距离麦克风捕获的真实录音中生成配对数据的方法。C2D投影估计一个最优投影矩阵，将近麦克风输入转换为与远麦克风录音对齐的干净参考信号，同时执行去噪。我们证明，使用参数化多通道维纳滤波器（PMWF）的变体可以有效地实现这种投影。实验结果表明，在具有挑战性的CHiME6晚宴派对ASR任务中，使用C2D投影数据训练的神经网络在oracle说话人日志条件下，当使用GSS的增强输出作为神经网络的辅助输入时，优于最先进的引导源分离（GSS）。

英文摘要

Training neural networks (NNs) for speech enhancement (SE) in distant speech-capturing scenarios requires paired distorted and clean reference speech signals. While such data are often generated through simulation, the mismatch between simulated and real recordings significantly limits SE accuracy. To address this issue, we propose Close-to-Distant microphone Projection (C2D projection), a method that generates paired data from real recordings captured by close and distant microphones. C2D projection estimates an optimal projection matrix that transforms close-microphone inputs into clean reference signals aligned with distant-microphone recordings, while simultaneously performing denoising. We show this projection can be effectively realized using a variant of the Parametric Multichannel Wiener Filter (PMWF). Experimental results demonstrate that an NN trained with C2D-projected data outperforms the state-of-the-art Guided Source Separation (GSS) on the challenging CHiME6 dinner party ASR task under oracle diarization, when using the enhanced output from GSS as an auxiliary input to the NN.

URL PDF HTML ☆

赞 0 踩 0