arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.07525 2026-06-09 cs.CL cs.AI 新提交

Implicit Causal Graph Construction in Text via Chain Discovery

通过链发现实现文本中的隐式因果图构建

Liesbeth Allein, Marie-Francine Moens

发表机构 * KU Leuven（鲁汶大学）； Ghent University（根特大学）

AI总结研究利用大语言模型从文本因果对中推断中间事件以构建隐式因果图，比较端到端构建与因果链发现方法，并探索多模型集成策略，基于1560个科学验证因果对评估。

详情

AI中文摘要

文本中的因果图通常由可观察的、预定义的事件填充。相比之下，我们研究从文本中构建隐式因果图，将每个描述的因果对视为潜在隐式因果图的起点和终点，并使用大型语言模型（LLM）推断中间因果事件。我们比较了端到端图构建与将任务视为因果链发现的方法。在后一种方法中，图是通过聚合推断出的链或通过迭代搜索过程逐步扩展部分链来构建的。我们进一步探索了“群体智慧”扩展，即在事后聚合和协作推理设置中从多个LLM访问因果知识。我们分析了这些方法之间的权衡，并使用一个包含1560个经过科学验证的因果对的手动策划数据库评估推断出的因果关系的有效性。这种基于数据库的评估被认为是可靠的、资源高效的，并且可迁移到无法获得真实图的情况。

英文摘要

Causal graphs in text are typically populated by observable, predefined events. In contrast, we study implicit causal graph construction from text by treating each described cause-effect pair as the begin- and endpoint of an underlying latent causal graph and using large language models (LLMs) to infer intermediate causal events. We compare end-to-end graph construction with methods that frame the task as causal chain discovery. In the latter, graphs are built either by aggregating inferred chains or by progressively expanding partial chains through an iterative search process. We further explore Wisdom of the Crowd extensions that access causal knowledge from multiple LLMs in post-hoc aggregation and collaborative inference settings. We analyze trade-offs among these approaches and evaluate the validity of inferred causal relations using a manually curated database of 1,560 scientifically validated causal pairs. This database-based evaluation is proposed as reliable, resource-efficient, and transferable to settings where ground-truth graphs are unavailable.

URL PDF HTML ☆

赞 0 踩 0

2606.07524 2026-06-09 cs.CL cs.AI 新提交

ABLE: Representing and Mapping LLMs via Attribution-Based Large-model Embedding

ABLE：基于归因的大模型嵌入表示与映射

Zirui Wang, Yusen Hou, Shaofeng Liang, Bowen Tian, Yanlin Zhang, Wenshuo Chen, Yutao Yue

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Deep Interdisciplinary Intelligence Lab (DI2 Lab)（深度跨学科智能实验室（DI2 Lab））

AI总结提出ABLE框架，利用梯度特征归因和分词器无关的词级对齐构建模型嵌入，实现异构LLM的高效比较，在关系预测、模型路由和基准分数预测上表现优异。

详情

AI中文摘要

大语言模型（LLM）的爆炸式增长形成了一个异构且文档不完善的生态系统，使得系统性的模型比较对于来源审计、安全分析和模型选择越来越重要。现有的表示方法难以高效应对这一场景。分析内部参数的方法在架构兼容时很强大，但在结构异构下面临可扩展性障碍；而依赖外部输出的方法可能混淆具有相似行为的模型，且难以在不同分词器的更丰富输出空间中对齐。为弥合这一差距，我们提出ABLE（基于归因的大模型嵌入）框架，利用可解释性空间构建模型表示。通过基于梯度的特征归因，经由分词器无关的词级对齐进行聚合，ABLE捕获模型特定的输入敏感性模式，而不仅仅是表面输出。除经验效用外，我们提供了稳定性分析，表明在可微Transformer风格模型的标准正则性假设下，ABLE诱导出一个Lipschitz连续的参数到嵌入映射，并具有有限样本收敛保证。在239个开源LLM上的大量实验表明，我们的无训练方法在关系预测、模型路由和基准分数预测方面达到了有竞争力或更优的性能。

英文摘要

The explosive growth of large language models (LLMs) has created a heterogeneous and poorly documented ecosystem, making systematic model comparison increasingly important for provenance auditing, security analysis, and model selection. Existing representation methods struggle to address this setting efficiently. Approaches analyzing internal parameters are powerful when architectures are compatible, but face scalability barriers under structural heterogeneity, while methods relying on external outputs may conflate models with similar behaviors and are difficult to align in richer output spaces across different tokenizers. To bridge this gap, we propose ABLE (Attribution-Based Large-model Embedding), a framework that leverages the interpretability space to construct model representations. By aggregating gradient-based feature attributions via a tokenizer-agnostic word-level alignment, ABLE captures model-specific input-sensitivity patterns rather than only surface-level outputs. Beyond empirical utility, we provide a stability analysis showing that, under standard regularity assumptions for differentiable Transformer-style models, ABLE induces a Lipschitz-continuous parameter-to-embedding map with finite-sample convergence guarantees. Extensive experiments on 239 open-source LLMs demonstrate that our training-free approach achieves competitive or superior performance in relation prediction, model routing, and benchmark score prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.07523 2026-06-09 cs.CL cs.AI 新提交

Retrieval Augmented Generation Framework for the Nepali Legal Domain Question Answering

面向尼泊尔法律领域问答的检索增强生成框架

Samir Wagle, Abiral Adhikari, Reewaj Khanal, Batsal Bhandari, Prashant Manandhar, Praveen Acharya, Bal Krishna Bal

发表机构 * Dublin City University（都柏林城市大学）

AI总结提出首个基于检索增强生成的尼泊尔法律问答模型，利用BM25和E5模型检索案例法，实现91%的top-1精度和74%的生成答案可信度。

详情

AI中文摘要

英语等高资源语言的法律领域已广泛采用人工智能进行法律问答。然而，尼泊尔语等低资源语言的数据稀缺限制了大型语言模型在尼泊尔法律文本上的训练。本研究首次应用基于检索增强生成的模型，利用从Nepal Kanun Patrika数字档案中提取的案例法进行尼泊尔法律问答。使用BM25对分块文档进行检索，该方法实现了91%的top-1精度，使用多语言E5大模型时达到75%。对生成答案的评估显示，使用BM25文档检索时，可信度为74%，根据自动评判模型评估的真实性为85%，人工评估的真实性为84%，成功答案生成率为92%。这些结果表明，RAG管道可以有效解决低资源语言法律问答的差距，并为尼泊尔法律领域的可靠AI系统奠定基础。

英文摘要

Legal domains in high-resource languages like English have widely adopted artificial intelligence for legal question answering. However, data scarcity in low resource languages such as Nepali has limited the training of large language models on Nepali legal texts. This study presents the first application of a Retrieval Augmented Generation based model for Nepali legal question answering using case laws extracted from the Nepal Kanun Patrika digital archive. Using BM25 on chunked documents, the approach achieved a top precision at one of 91 percent, and up to 75 percent with the multilingual E5 large model. Evaluation of generated answers showed 74 percent groundedness, 85 percent truthfulness according to an automated judge model, and 84 percent human evaluated truthfulness when using BM25 document retrieval, with a 92 percent successful answer generation rate. These results demonstrate that the RAG pipeline can effectively address the gap in legal question answering for low resource languages and provide a foundation for reliable AI systems in the Nepali legal domain.

URL PDF HTML ☆

赞 0 踩 0

2606.07522 2026-06-09 cs.CL cs.LG cs.SI 新提交

Community-Specific Slang and Entity Detection via Semantic Shift in Fine-Tuned Language Models

通过微调语言模型中的语义偏移检测社区特定俚语和实体

Julia Kruk, Sanchita Porwal, Amitrajit Bhattacharjee, Mansi Phute

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出无监督方法，通过测量词在微调前后的语义偏移幅度，从在线社区文本中自动识别俚语、独特实体和民俗用语。

Comments 6 pages, 6 figures, 2 tables

2606.07521 2026-06-09 cs.CL cs.AI 新提交

Evaluating Hallucinations in Domain-Adapted Large Language Models

评估领域自适应大语言模型中的幻觉现象

Sanchita Porwal, Sai Prasath S, Xingjian Bi, Madelyn Scandlen

发表机构 * College of Computing, Georgia Institute of Technology（佐治亚理工学院计算学院）

AI总结本研究通过微调Llama-2模型，测试其记忆、回忆和推理能力，发现领域自适应大语言模型在生成新领域特定信息时存在幻觉问题，表明仅靠微调难以有效缓解幻觉。

Comments 13 pages, 2 figures, 3 tables

详情

AI中文摘要

本研究调查了领域自适应大语言模型（LLMs）中的幻觉现象，重点关注使用Lamini数据集对Llama-2模型进行微调。幻觉，即LLMs生成无意义或不忠实内容的现象，构成了重大挑战，尤其是当这些模型使用领域特定数据进行微调时。我们的方法包括一系列实验，测试微调后LLM的记忆、回忆和推理能力，并将其在新问答对和领域特定信息上的表现进行比较。我们发现，虽然模型在与训练数据相似的任务上表现出色，但其准确推理和回忆新领域特定信息的能力仍然有限，导致出现幻觉实例。模型倾向于提供带有额外信息的正确答案，表明存在过度生成的倾向。这些结果表明，仅靠微调方法在将LLMs适应专业领域时缓解幻觉存在重要局限性，并强调了在将LLMs适应专业领域时需要更鲁棒的方法。该研究还提供了关于LLMs在不同类型信息上表现差异的见解，揭示了其在处理领域特定查询时的相对弱点。

英文摘要

This study investigates the phenomenon of hallucinations in domain-adapted Large Language Models (LLMs), focusing on the fine-tuning of the Llama-2 model with the Lamini dataset. Hallucinations, or the generation of nonsensical or unfaithful content by LLMs, pose a significant challenge, especially when these models are fine-tuned with domain-specific data. Our methodology involves a series of experiments testing memorization, recall, and reasoning capabilities of the fine-tuned LLM, comparing its performance on novel question-answer pairs and domain-specific information. We found that while the model shows proficiency in tasks similar to its training data, its capability to accurately reason about and recall new domain-specific information remains limited, leading to instances of hallucination. The model demonstrates a tendency to provide correct answers with extra information, suggesting an inclination toward over-generation. These results suggest important limitations of fine-tuning-only approaches for mitigating hallucinations when adapting LLMs to specialized domains and underscore the need for more robust methods in adapting LLMs to specialized domains. The study also provides insights into the varying performance of LLMs on different types of information, revealing a comparative weakness in handling domain-specific queries.

URL PDF HTML ☆

赞 0 踩 0

2606.07520 2026-06-09 cs.CL cs.LG 新提交

TinyJudge: Unverifiable Constraint Alignment via Lightweight Specialist Ensembles

TinyJudge: 通过轻量级专家集成实现不可验证约束对齐

Yirong Zeng, Yufei Liu, Xiao Ding, Yutai Hou, Yuxian Wang, Wu Ning, Haonan Song, Dandan Tu, Qixun Zhang, Yuxiang He, Bibo Cai, Ting Liu

发表机构 * Harbin Institute of Technology SCIR Lab（哈尔滨工业大学SCIR实验室）； Peking University（北京大学）； Huawei Technologies Co., Ltd（华为技术有限公司）

AI总结针对LLM遵循不可验证约束时奖励黑客和计算开销大的问题，提出TinyJudge框架，利用多个小型语言模型集成提供奖励，在五个基准上平均性能提升约10%，奖励精度提升12%，训练速度提升3倍。

Comments ACL 2026 Main Conference;15 pages, 9 figures

详情

AI中文摘要

指令遵循（IF）是LLM的核心能力，要求严格遵守从可验证（如输出长度）到不可验证（如语气）的多种约束。基于可验证奖励的强化学习已成为IF任务的范式，利用LLM作为裁判来评估不可验证约束。然而，我们实验发现该方法仍存在显著瓶颈，遭受严重的奖励黑客和更高的计算开销。本文首先分析不可验证约束的泛化能力，发现特定约束表现出独特的高泛化模式。受此启发，我们提出TinyJudge框架，采用专门的小型语言模型集成（约0.6B）为软约束提供奖励。通过将前沿模型的知识蒸馏到这些小型模型中，实现了高精度、轻量级的评估。在五个基准上的广泛评估表明，TinyJudge在平均性能上比基线高出约10%，奖励精度高出12%。关键的是，它还在总训练时间上实现了3倍的加速。我们的工作为将LLM与不可验证的人类指令对齐提供了一条可扩展且稳健的路径。

英文摘要

Instruction Following (IF) is a core capability of LLMs, requiring strict adherence to diverse constraints, ranging from verifiable ones (e.g., output length) to unverifiable ones (e.g., tone). Reinforcement learning with verifiable rewards has emerged as a paradigm for IF tasks, leveraging LLM-as-a-judge to assess unverifiable constraints. However, we empirically find that this approach remains a significant bottleneck, suffering from severe reward hacking and higher computational overhead. In this work, we first analyze the generalization capabilities of unverifiable constraints and discover that specific constraints exhibit distinct, high-generalization patterns. Motivated by this, we propose TinyJudge, a framework that employs an ensemble of specialized tiny language models ($\sim0.6B$) to provide rewards for soft constraints. By distilling expertise from frontier models into these tiny models, it achieves high-precision, lightweight evaluation. Extensive evaluations across five benchmarks demonstrate that TinyJudge outperforms the baselines by $\sim10\%$ in average performance and $12\%$ in reward precision. Crucially, it also achieves a $3\times$ speedup in total training time. Our work provides a scalable and robust path for aligning LLMs with unverifiable human instructions.

URL PDF HTML ☆

赞 0 踩 0

2606.07519 2026-06-09 cs.CL cs.AI 新提交

Bidirectional Small-Granularity Search between Code and Text

代码与文本之间的双向小粒度搜索

Marco A. Valenzuela-Escárcega, Enrique Noriega-Atala, Gus Hahn-Powell, Clayton T. Morrison, Mihai Surdeanu

发表机构 * Lex Machina ； The University of Arizona（亚利桑那大学）

AI总结提出双向小粒度搜索任务，通过自动生成数据训练模型，实现科学出版物文本与代码片段间的直接链接，支持跨模态检索。

详情

AI中文摘要

我们引入了代码与文本之间双向小粒度搜索的新任务，其中查询是文本或代码的小片段，结果也是相反模态的小片段，即代码或文本。该任务在科学出版物中的文本与相应代码片段之间建立直接链接，以支持更好、更快地理解科学方法。我们为所提出的任务引入了一个大型数据集，其中包括使用GPT-4自动生成的代码文本描述的训练分区，以及三个测试分区：一个域内和两个域外（OOD），包含手动注释的数据以及其他领域的材料。我们还提出了一种模块化方法来解决此任务。我们的方法在四个不同的子任务之间共享一个编码器，这些子任务学习双向答案跨度的开始/结束。我们表明，我们的方法在域内取得了良好结果，在域外也取得了令人鼓舞的结果。这表明使用自动生成的数据解决此任务是可能的，但仍有令人兴奋的未来工作要做。

英文摘要

We introduce the novel task of bidirectional small-granularity search between code and text, where the queries are small snippets of text or code and the results are also small fragments of the opposite modality, i.e., code or text. This task establishes direct links between text in scientific publications and corresponding code segments, in support of better and faster understanding of scientific methods. We introduce a large dataset for the proposed task that includes a training partition with textual descriptions of code generated automatically using GPT-4, and three testing partitions, one in-domain and two out-of-domain (OOD) that contain manually-annotated data as well as material from other domains. We also propose a modular approach to address this task. Our approach shares an encoder across four different subtasks that learn start/end of answer spans in both directions. We show that our method achieves good results in-domain, and encouraging results OOD. This suggests that addressing this task with automatically-generated data is possible, but there is exciting future work to be done.

URL PDF HTML ☆

赞 0 踩 0

2606.07102 2026-06-09 cs.CV cs.AI 新提交

GP-Adapter: Gaussian Process CLIP-Adapter for Few-Shot Out-of-Distribution Detection

GP-Adapter: 基于高斯过程的CLIP适配器用于少样本分布外检测

Taisei Saito, Koretaka Ogata, Takafumi Hiroi

发表机构 * st Taisei Saito（第一作者）； nd Koretaka Ogata（第二作者）； rd Takafumi Hiroi（第三作者）

AI总结提出GP-Adapter，一种无需训练的框架，通过高斯过程不确定性建模增强CLIP，用于少样本分类和分布外检测，无需微调骨干网络，仅依赖少量缓存和轻量超参数选择。

Comments 8 pages, 6 figures, Accepted at IJCNN 2026

详情

AI中文摘要

我们提出GP-Adapter，一种无需训练的框架，通过高斯过程（GP）不确定性建模增强CLIP（对比语言-图像预训练），用于少样本分类和分布外（OOD）检测。虽然CLIP实现了强大的零样本识别，但它产生确定性的相似度分数，并提供有限的不确定性信息，这在分布偏移和数据稀缺情况下至关重要。GP-Adapter在冻结的CLIP嵌入之上，使用图像特征的RBF核和文本提示的线性核构建模态特定、类别级的一类GP，并融合它们的预测统计量，以生成方差感知的置信度分数用于OOD检测。该方法无需微调CLIP骨干网络，仅依赖于少量$K$样本缓存和轻量超参数选择，内存成本为$O(CK^2)$，其中$C$为类别数，$K$为样本数。在ImageNet和多个OOD基准上的实验表明，GP-Adapter提供了具有竞争力的少样本性能，并且在与提示学习基线结合时持续改进OOD检测，突出了基于GP的不确定性建模与提示学习之间的互补性。总体而言，我们的结果表明，将概率推理与大型预训练视觉-语言模型集成可以提高低数据和分布偏移场景下的可靠性。代码可在该https URL获取。

英文摘要

We propose GP-Adapter, a training-free framework that augments CLIP (Contrastive Language-Image Pre-training) with Gaussian Process (GP) uncertainty modeling for few-shot classification and out-of-distribution (OOD) detection. While CLIP achieves strong zero-shot recognition, it yields deterministic similarity scores and offers limited uncertainty information, which is critical under distribution shift and data scarcity. GP-Adapter constructs modality-specific, class-wise one-class GPs on top of frozen CLIP embeddings using an RBF kernel for image features and a linear kernel for text prompts and fuses their predictive statistics to produce a variance-aware confidence score for OOD detection. The method requires no fine-tuning of the CLIP backbone and relies only on a small $K$-shot cache and lightweight hyperparameter selection, with memory cost scaling as $O(CK^2)$ for $C$ classes and $K$ shots. Experiments on ImageNet and multiple OOD benchmarks show that GP-Adapter provides competitive few-shot performance and consistently improves OOD detection when combined with prompt-learning baselines, highlighting the complementarity between GP-based uncertainty modeling and prompt learning. Overall, our results suggest that integrating probabilistic inference with large pre-trained vision-language models can improve reliability in low-data and distribution-shifted settings. Code is available at https://github.com/tms-byte/GP-Adapter

URL PDF HTML ☆

赞 0 踩 0

2606.07066 2026-06-09 cs.CL 新提交

Modeling semantic association in self-paced reading with language model embeddings

使用语言模型嵌入建模自定步速阅读中的语义关联

Sara Møller Østergaard, Kenneth Enevoldsen, Afra Alishahi, Bruno Nicenboim

发表机构 * Department of Computational Cognitive Science, Tilburg University（蒂尔堡大学计算认知科学系）； Center for Humanities Computing, Aarhus University（奥胡斯大学人文计算中心）

AI总结本研究使用语言模型嵌入的十种实现方式量化语义关联，通过贝叶斯模型分析其对N400和自定步速阅读时间的影响，发现句子嵌入能可靠捕捉超出词可预测性的语义关联。

详情

AI中文摘要

词语与其上下文之间的语义关联已被认为是阅读理解的重要组成部分，即使考虑了词的可预测性。最近的研究强调了语言模型（LM）嵌入在量化语义关联方面的潜力。然而，基于嵌入的语义关联已有多种操作化方式。在本研究中，我们使用LM嵌入来估计联合脑电图（EEG）和自然荷兰语文本自定步速阅读语料库上的语义关联。语义关联通过十种不同的实现方式计算，这些方式在嵌入模型和上下文长度上有所不同。使用贝叶斯层次模型和贝叶斯因子检验了不同实现方式下语义关联对N400和自定步速阅读时间的影响。结果表明，嵌入模型的选择可以改变语义关联对N400和自定步速阅读时间的估计效应。此外，结果显示了句子嵌入在捕捉语义关联方面的潜力，因为只有依赖句子嵌入的实现方式在神经和行为测量上都显示出超出词可预测性的可靠语义关联结果。总之，这些发现强调了在量化语义关联时方法论选择的重要性。

英文摘要

Semantic association between a word and its context has been identified as an important component of reading comprehension, even when word predictability is accounted for. Recent research has highlighted the potential of language model ( LM) embeddings to quantify semantic association. Yet, embedding-based semantic association have been operationalized in a myriad of ways. In this study, we use embeddings from LMs to estimate semantic association on a corpus of joint electroencephalography (EEG) and self-paced reading of natural, Dutch texts. Semantic association is calculated in ten different implementations that vary the embedding model and context lengths. The effects of semantic association across the different implementations on the N400 and self-paced reading times are examined using Bayesian hierarchical models and Bayes factor. The results show that the choice of embedding model can alter the estimated effect of semantic association on both the N400 and self-paced reading times. Furthermore, the results demonstrate a promising potential of sentence embeddings for capturing semantic association, as only implementations relying on sentence embeddings indicate reliable results of semantic association beyond word predictability on both neural and behavioral measures. Together, these findings highlight the importance of methodological choices in quantifying semantic association.

URL PDF HTML ☆

赞 0 踩 0

2605.03357 2026-06-09 cs.LG math.OC 版本更新

Population-Aware Imitation Learning in Mean-field Games with Common Noise

平均场博弈中考虑共同噪声的群体感知模仿学习

Grégoire Lambrecht, Mathieu Laurière

发表机构 * Institut National des Sciences et Techniques de l'Information et des Systèmes (INSTI)（信息与系统科学与技术国家研究院）

AI总结针对含共同噪声的平均场博弈，提出群体感知模仿学习框架，通过行为克隆和对抗散度两种代理，建立有限样本误差界，并利用广义虚拟博弈和深度学习计算专家策略，实验证明群体感知策略对应对随机性的重要性。

详情

AI中文摘要

平均场博弈（MFGs）为建模大量交互智能体的集体行为提供了强大框架。本文研究了含共同噪声的MFG中的模仿学习（IL）问题，其中群体分布随机演化。这种随机性迫使智能体采用群体感知策略以应对总体冲击。我们制定了两个不同的学习目标：恢复纳什均衡和最大化相对于专家群体的性能。我们研究了两种模仿代理：行为克隆（BC）和对抗（ADV）散度。然后，我们建立了有限样本误差界，表明最小化这些代理能有效控制策略的可利用性及其相对于专家的性能差距。此外，我们提出了一个使用广义虚拟博弈和深度学习的数值框架来计算专家群体感知策略。通过在三个环境上的实验，我们证明了标准的群体无感知策略无法捕捉均衡动态。我们的结果强调，学习群体感知策略对于避免被共同噪声固有的随机性误导至关重要。

英文摘要

Mean Field Games (MFGs) provide a powerful framework for modeling the collective behavior of large populations of interacting agents. In this paper, we address the problem of Imitation Learning (IL) in MFGs subject to common noise, where the population distribution evolves stochastically. This stochasticity compels agents to adopt population-aware policies to respond to aggregate shocks. We formulate two distinct learning objectives: recovering a Nash equilibrium and maximizing performance against an expert population. We investigate two imitation proxies: Behavioral Cloning (BC) and Adversarial (ADV) divergence. We then establish finite-sample error bounds showing that minimizing these proxies effectively controls both the policy's exploitability and its performance gap relative to the expert. Furthermore, we propose a numerical framework using generalized Fictitious Play and Deep Learning to compute expert population-aware policies. Through experiments on three environments we demonstrate that standard population-unaware policies fail to capture the equilibrium dynamics. Our results highlight that learning population-aware policies is crucial to avoid being misled by the randomness inherent in common noise.

URL PDF HTML ☆

赞 0 踩 0

2605.03229 2026-06-09 cs.CL cs.LG 版本更新

Sparse Memory Finetuning as a Low-Forgetting Alternative to LoRA and Full Finetuning

稀疏记忆微调：作为LoRA和全微调的低遗忘替代方案

Prakhar Gupta, Garv Shah, Satyam Goyal, Anirudh Kanchi

发表机构 * University of Washington（华盛顿大学）

AI总结提出稀疏记忆微调（SMF），通过添加键值记忆层并仅更新当前批次最活跃的记忆行，在MedMCQA任务上提升2.5个百分点，同时将遗忘探针（WikiText困惑度和TriviaQA准确率）控制在基线的1个百分点内，优于LoRA和全微调。

2605.03226 2026-06-09 cs.LG cs.AI cs.CR 版本更新

Self-Mined Hardness for Safety Fine-Tuning

自我挖掘的难度用于安全微调

Prakhar Gupta, Garv Shah, Donghua Zhang

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出通过模型自身生成结果评估提示难度，对最难的提示进行安全微调，在Llama-3模型上将攻击成功率降至1-3%，但增加了拒绝率，通过混合良性提示可平衡性能。

详情

AI中文摘要

语言模型的安全微调通常需要一个精心策划的对抗性数据集。我们采取不同的方法：通过目标模型自身生成结果被判定为有害的频率来评分每个候选提示的难度，然后在最难的提示上使用模型自身的非越狱生成结果进行微调。在Llama-3-8B-Instruct和Llama-3.2-3B-Instruct上，该方法将WildJailbreak攻击成功率从11.5%和20.1%降至1-3%，但将越狱形式良性提示的拒绝率从14-22%提升至74-94%。将相同的困难提示与对抗性框架的良性提示（看起来像越狱但意图良性的提示）以1:1的比例交错，可将8B模型的拒绝率降至30-51%，3B模型降至52-72%，但攻击成功率增加2-6个百分点。在混合模式下，使用合格池中最难的一半而非随机一半进行训练，可将两个模型的剩余ASR降低35-50%（约3个百分点）。

英文摘要

Safety fine-tuning of language models typically requires a curated adversarial dataset. We take a different approach: score each candidate prompt's difficulty by how often the target model's own rollouts are judged harmful, then fine-tune on the hardest prompts paired with the model's own non-jailbroken rollouts. On Llama-3-8B-Instruct and Llama-3.2-3B-Instruct, this approach cuts the WildJailbreak attack success rate from 11.5% and 20.1% down to 1-3%, but pushes refusal on jailbreak-shaped benign prompts from 14-22% to 74-94%. Interleaving the same hard prompts 1:1 with adversarially-framed benign prompts (prompts that look like jailbreaks but have benign intent) cuts that refusal back down to 30-51% on 8B and 52-72% on 3B, at a cost of 2-6 percentage points of attack success rate. Within the mixed regime, training on the hardest half of the eligible pool rather than a random half cuts the remaining ASR by 35-50% (about 3 percentage points) on both models.

URL PDF HTML ☆

赞 0 踩 0

2605.05138 2026-06-09 cs.AI 版本更新

Executable World Models for ARC-AGI-3 in the Era of Coding Agents

可执行世界模型在编码智能体时代的ARC-AGI-3应用

Sergey Rodionov

发表机构 * SingularityNET

AI总结提出一种编码智能体系统，通过维护可执行Python世界模型、验证观察、重构简化抽象和模型内规划，在ARC-AGI-3游戏中取得初步成果，GPT-5.5高推理下完全解决15个游戏。

Comments 13 pages. Accepted for publication at AGI-2026

详情

AI中文摘要

我们评估了一个用于ARC-AGI-3的初始编码智能体系统，其中智能体维护一个可执行的Python世界模型，根据先前的观察验证它，将其重构为更简单的抽象作为MDL类简单性偏好的实际代理，并在行动前通过模型进行规划。该系统有意保持直接：它使用脚本化控制器、预定义的世界模型接口、验证程序和执行计划器，但没有手工编码的游戏特定逻辑。面向智能体的提示、工作区和控制器不包含游戏特定代码、游戏特定提示、手工编码的启发式方法、隐藏解决方案或其他游戏特定信息；相同的智能体和提示用于所有游戏。由于编码智能体具有广泛的系统访问权限，我们审计了非预期的信息通道，描述了早期脆弱的框架，并解释了当前框架如何关闭观察到的泄漏通道，同时减少基准特定信息的暴露。我们报告了在25个公开ARC-AGI-3游戏上的结果。每次游戏从全新的智能体实例和干净的工作区开始，无法访问先前游戏的文件或对话状态。使用GPT-5.5高推理努力，智能体完全解决了15个游戏，平均每游戏RHAE为58.12%。使用GPT-5.4高推理努力，它完全解决了8个游戏，平均每游戏RHAE为41.29%。在尚未提供给我们的私有验证集上的性能仍有待测试。总体而言，这些结果提供了初步证据，表明验证器驱动的可执行世界模型是ARC-AGI-3智能体的一种有前景的方法。完整的运行工件与代码一起发布在https://github.com/astroseger/arc-3-agents-baseline1。

英文摘要

We evaluate an initial coding-agent system for ARC-AGI-3 in which the agent maintains an executable Python world model, verifies it against previous observations, refactors it toward simpler abstractions as a practical proxy for an MDL-like simplicity bias, and plans through the model before acting. The system is intentionally direct: it uses a scripted controller, predefined world-model interfaces, verifier programs, and a plan executor, but no hand-coded game-specific logic. The agent-facing prompts, workspace, and controller contain no game-specific code, game-specific prompts, hand-coded heuristics, hidden solutions, or other game-specific information; the same agent and prompts are used across games. Because the coding agent has broad system access, we audit unintended information channels, describe earlier vulnerable harnesses, and explain how the current harness closes observed leakage channels while reducing benchmark-specific information exposure. We report results on the 25 public ARC-AGI-3 games. Each playthrough starts from a fresh agent instance and clean workspace, with no access to files or conversation state from earlier playthroughs. With GPT-5.5 high reasoning effort, the agent fully solved 15 games and achieved a mean per-game RHAE of 58.12%. With GPT-5.4 high reasoning effort, it fully solved 8 games and achieved a mean per-game RHAE of 41.29%. Performance on the private validation set, which is not yet available to us, remains to be tested. Overall, the results provide preliminary evidence that verifier-driven executable world models are a promising approach for ARC-AGI-3 agents. Full run artifacts are released with the code at https://github.com/astroseger/arc-3-agents-baseline1.

URL PDF HTML ☆

赞 0 踩 0

2605.03395 2026-06-09 cs.SD cs.AI cs.LG cs.MM 版本更新

APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

APEX：面向AI生成音乐的大规模多任务美学感知流行度预测

Jaavid Aktar Husain, Dorien Herremans

发表机构 * AMAAI Lab, Singapore University of Technology and Design（新加坡科技设计大学AMAAI实验室）

AI总结提出APEX框架，利用MERT音频嵌入联合预测AI生成音乐的流行度指标与五维美学质量，在Music Arena数据集上验证了美学特征对偏好预测的泛化能力。

详情

AI中文摘要

音乐流行度预测因其对艺术家、平台和推荐系统的重要性而吸引了越来越多的研究兴趣。然而，AI生成音乐平台的爆炸式增长创造了一个全新且很大程度上未被探索的领域，每天都有大量歌曲被生产和消费，而没有传统的艺术家声誉或唱片公司支持。在这一探索中，美学质量是关键但尚未被研究的因素。我们提出了APEX，这是首个面向AI生成音乐的大规模多任务学习框架，在来自Suno和Udio的超过21.1万首歌曲（1万小时音频）上训练，该框架联合预测基于参与度的流行度信号——流媒体播放量和点赞分数——以及从MERT（一个自监督音乐理解模型）提取的冻结音频嵌入中的五个感知美学质量维度。美学质量和流行度捕捉了音乐的互补方面，两者结合被证明是有价值的：在Music Arena数据集上的分布外评估中，该数据集包含训练期间未见过的十一个生成音乐系统之间的成对人类偏好对决，引入美学特征持续改进了偏好预测，展示了所学表示在生成架构上的强大泛化能力。

英文摘要

Music popularity prediction has attracted growing research interest, with relevance to artists, platforms, and recommendation systems. However, the explosive rise of AI-generated music platforms has created an entirely new and largely unexplored landscape, where a surge of songs is produced and consumed daily without the traditional markers of artist reputation or label backing. Key, yet unexplored in this pursuit is aesthetic quality. We propose APEX, the first large-scale multi-task learning framework for AI-generated music, trained on over 211k songs (10k hours of audio) from Suno and Udio, that jointly predicts engagement-based popularity signals - streams and likes scores - alongside five perceptual aesthetic quality dimensions from frozen audio embeddings extracted from MERT, a self-supervised music understanding model. Aesthetic quality and popularity capture complementary aspects of music that together prove valuable: in an out-of-distribution evaluation on the Music Arena dataset, comprising pairwise human preference battles across eleven generative music systems unseen during training, including aesthetic features consistently improves preference prediction, demonstrating strong generalisation of the learned representations across generative architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.03058 2026-06-09 cs.LG cs.AI 版本更新

Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

基于对比分层消融的大语言模型神经元锚定规则提取

Francesco Sovrano, Gabriele Dominici, Marc Langheinrich

发表机构 * Università della Svizzera italiana（瑞士意大利大学）

AI总结提出MechaRule方法，通过定位稀疏激动剂激活将规则提取锚定在LLM电路中，利用自适应组测试和置信引导剪枝，以极低代价高召回率识别关键神经元，并在算术和越狱任务中验证其有效性。

Comments Accepted for publication at KDD'2026

详情

DOI: 10.1145/3770855.3818091

AI中文摘要

可解释AI的一个核心目标是符号化地表达大语言模型（LLM）的决策逻辑，并将其锚定在内部机制中。现有的规则提取方法通常学习非锚定的符号代理，而机械可解释性将行为与神经元联系起来，但通常需要手工假设和昂贵的干预。我们提出MechaRule，一种通过定位稀疏激动剂激活（其消融会破坏规则相关行为）将规则提取锚定在LLM电路中的流程。MechaRule基于两个发现。首先，在固定的基线/翻转机制下，稀疏激动剂效应可能表现出“超越”：少数高效应的激活在较大组中仍可检测到，主导较弱效应，并翻转许多相同的示例。在这种机制下，使用置信引导的保守剪枝的自适应组测试，当k << N为激动剂时，需要对N个候选进行O(k log(N/k) + k)次干预。其次，在与接近忠实规则行为对齐的数据分割上，激动剂的定位更可靠；谱分割提供了无规则的备选方案，而不忠实的分割会降低定位效果。实验上，在算术和越狱任务中，MechaRule在匹配的暴力验证中召回97.0%的最高效应激动剂，平均仅消耗完全消融成本的2.14%。消融定位的激动剂消除了97.6–100.0%的合格正确算术答案和越狱，并可纠正算术错误或诱导越狱，分别高达72.8%和32.5%。

英文摘要

A central goal of explainable AI is to express large language model (LLM) decision logic symbolically and ground it in internal mechanisms. Existing rule-extraction methods usually learn ungrounded symbolic surrogates, while mechanistic interpretability links behavior to neurons but often requires hand-crafted hypotheses and costly interventions. We introduce MechaRule, a pipeline that grounds rule extraction in LLM circuits by localizing sparse agonist activations whose ablation disrupts rule-related behavior. MechaRule rests on two findings. First, in a fixed baseline/flip regime, sparse agonist effects can exhibit overtopping: a few high-effect activations remain detectable within larger groups, dominate weaker ones, and flip many of the same examples. In such regimes, adaptive group testing with confidence-guided conservative pruning requires O(k log(N/k) + k) interventions over N candidates when k << N are agonists. Second, agonists are localized more reliably on data splits aligned with close-to-faithful rule behavior; spectral splits provide a rule-free fallback, whereas unfaithful splits degrade localization. Empirically, on arithmetic and jailbreaking, MechaRule recalls 97.0% of highest-effect agonists in matched brute-force validations at only 2.14% of exhaustive-ablation cost on average. Ablating the localized agonists eliminates 97.6--100.0% of eligible correct arithmetic answers and jailbreaks, and can correct arithmetic errors or induce jailbreaks by up to 72.8% and 32.5%.

URL PDF HTML ☆

赞 0 踩 0

2605.02950 2026-06-09 cs.LG cs.AI 版本更新

Kernel Affine Hull Machines as Compute-Efficient Encoders for Frozen Semantic Spaces

核仿射包机作为冻结语义空间的计算高效编码器

Mohit Kumar, Somayeh Kargaran, Bernhard A. Moser, Manuela Geiß

发表机构 * University of Rostock（罗斯托克大学）； Software Competence Center Hagenberg GmbH（海根堡软件竞争力中心）

AI总结提出核仿射包机（KAHM）作为轻量级查询编码器，在固定教师表示空间下，通过RKHS中的后验权重估计替代神经网络编码，实现计算高效且性能优异的语义检索。

详情

AI中文摘要

基于Transformer的语义编码器在检索中很有效，但在许多部署中，重复出现的瓶颈是在线查询编码，而非离线语料库索引。本文研究，一旦强大的教师表示空间和语料库索引固定，是否可以用一个更轻量且解析明确的估计器来替代重复的神经查询编码。我们将固定教师的词汇到语义编码表述为一个条件均值估计问题，其中目标语义向量表示为由后验聚类概率加权的语义原型的噪声混合。使用核仿射包机（KAHM）几何，在显式识别的RKHS假设空间中，从廉价的词汇特征估计这些后验权重，并通过归一化最小均方更新从带噪声的教师嵌入中精炼语义原型。这产生了一个无反向传播的查询端编码器，以及一个端到端的误差分解，包括后验近似、有限样本/泛化和教师噪声项。我们在一个受控的奥地利法律检索基准上实例化该方法，该基准包含5000个测试查询、84个候选法律和10762个对齐的检索单元，使用特定于法律的编码器进入冻结的Mixedbread嵌入空间。在评估匹配的学习适配器中，KAHM在所有评估截断处实现了最强的教师空间重建和最佳的排名敏感检索性能。在k=20时，它获得了MRR@20=0.504、Hit@20=0.694和Top-1准确率=0.411，同时在报告的CPU设置中，相对于直接Transformer查询编码，在线每查询时间减少了8.53倍。结果支持KAHM作为监督固定表示部署场景中的计算高效编码器。

英文摘要

Transformer-based semantic encoders are effective for retrieval, but in many deployments the recurring bottleneck is online query encoding rather than offline corpus indexing. This paper studies whether, once a strong teacher representation space and corpus index are fixed, repeated neural query encoding can be replaced by a substantially lighter and analytically explicit estimator. We formulate fixed-teacher lexical-to-semantic encoding as a conditional-mean estimation problem in which the target semantic vector is represented as a noisy mixture of semantic prototypes weighted by posterior cluster probabilities. Kernel Affine Hull Machine (KAHM) geometry is used to estimate these posterior weights from inexpensive lexical features in an explicitly identified RKHS hypothesis space, and the semantic prototypes are refined by normalized least-mean-squares updates from noisy teacher embeddings. This yields a backpropagation-free query-side encoder together with an end-to-end error decomposition into posterior-approximation, finite-sample/generalization, and teacher-noise terms. We instantiate the approach on a controlled Austrian-law retrieval benchmark with 5,000 test queries, 84 candidate laws, and 10,762 aligned retrieval units, using law-specific encoders into a frozen Mixedbread embedding space. Among evaluation-matched learned adapters, KAHM achieves the strongest teacher-space reconstruction and the best rank-sensitive retrieval performance at all evaluated cutoffs. At k=20, it obtains MRR@20 = 0.504, Hit@20 = 0.694, and Top-1 Accuracy = 0.411, while reducing online per-query time by 8.53 relative to direct transformer query encoding in the reported CPU setting. The results support KAHMs as compute-efficient encoders for supervised fixed-representation deployment regimes.

URL PDF HTML ☆

赞 0 踩 0

2605.01799 2026-06-09 cs.CV 版本更新

Embody4D: A Generalist Data Engine for Embodied 4D World Modeling

Embody4D: 面向具身4D世界建模的通用数据引擎

Peiyan Tu, Hanxin Zhu, Jingwen Sun, Shaojie Ren, Cong Wang, Yuyan Xu, Jiayi Luo, Xiaoqian Cheng, Zhibo Chen

发表机构 * Zhejiang University（浙江大学）； Beijing Zhongguancun Academy（北京中关村学院）； University of Science and Technology of China（中国科学技术大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Shanghai Jiao Tong University（上海交通大学）； Beihang University（北京航空航天大学）

AI总结提出Embody4D视频到视频世界模型，通过3D感知合成管道、潜在置信度专家调制和交互注意力机制，将单目机器人视频转换为多视角视频，解决具身智能中视角稀疏问题，提升下游规划与学习性能。

详情

AI中文摘要

具身智能体需要鲁棒且全面的3D时空表示来支持空间推理、操作理解和下游决策。然而，现有的机器人数据通常从固定或稀疏的视角捕获，仅提供部分且依赖视角的观察，这限制了多视角感知和跨视角泛化。鉴于在真实环境中收集额外视角的困难，我们提出Embody4D，一种专为具身场景设计的视频到视频世界模型，通过将单目机器人视频转换为来自灵活目标相机视角的新视角视频来弥合这一观察差距。首先，为解决训练数据稀缺问题，我们引入了一种3D感知的组合合成管道，以策划一个异构数据集，该数据集组合了跨具身形态的机器人手臂与多样背景，促进了广泛泛化。其次，为强制几何稳定性，我们设计了一种潜在置信度感知的专家调制策略，该策略估计扭曲潜在先验的可靠性，并自适应地将区域路由到复制、修复或修补专家，以实现时空一致的4D生成。最后，为增强操作保真度，我们引入了一种交互感知注意力机制，该机制明确关注机器人交互区域。大量实验表明，Embody4D在视觉评估基准上达到了最先进的性能，同时模拟和真实机器人实验进一步证明了其作为鲁棒数据引擎的有效性，能够合成高保真、视角一致的视频，赋能下游机器人规划和学习。

英文摘要

Embodied agents require robust and comprehensive 3D spatiotemporal representations to support spatial reasoning, manipulation understanding, and downstream decision making. However, existing robot data are typically captured from fixed or sparse viewpoints, providing only partial and view-dependent observations, which limits multi-view perception and generalization across viewpoints. Given the difficulty of collecting additional viewpoints in real-world settings, we propose Embody4D, a dedicated video-to-video world model for embodied scenarios to bridge this observation gap by transforming a monocular robot video into novel-view videos from flexible target camera viewpoints. First, to tackle training data scarcity, we introduce a 3D-aware compositional synthesis pipeline to curate a heterogeneous dataset compositing cross-embodiment robotic arms with diverse backgrounds, promoting broad generalization. Second, to enforce geometric stability, we devise a latent confidence-aware expert modulation strategy, which estimates the reliability of warped latent priors and adaptively routes regions to copy, repair, or inpaint experts for spatiotemporally consistent 4D generation. Finally, to enhance the fidelity of the manipulation, we incorporate an interaction-aware attention mechanism that explicitly attends to the robotic interaction regions. Extensive experiments show that Embody4D achieves state-of-the-art performance on visual evaluation benchmarks, while both simulated and real-world robotic experiments further demonstrate its effectiveness as a robust data engine for synthesizing high-fidelity, view-consistent videos that empower downstream robotic planning and learning.

URL PDF HTML ☆

赞 0 踩 0

2605.01616 2026-06-09 cs.LG cs.AI cs.CY cs.NI 版本更新

Learning Behavioral Signals from Encrypted Smartphone Network Traffic

从加密智能手机网络流量中学习行为信号

Rameen Mahmood, Omar El Shahawy, Souptik Barua, Zachary Beattie, Jeffrey Kaye, Xuhai "Orson'' Xu, Chao-Yi Wu, Danny Yuxing Huang

发表机构 * New York University（纽约大学）； NYU Langone Health（NYU Langone健康）； NYU Grossman School of Medicine（NYU Grossman医学院）； Oregon Health & Science University（俄勒冈健康与科学大学）； Columbia University（哥伦比亚大学）； Harvard Medical School（哈佛医学院）

AI总结本文利用基于Transformer的模型从加密网络流量中学习行为表征，结合用户特定适配器，并通过稀疏表示和广义估计方程分析，发现压力、孤独感和睡眠障碍分别与个体间差异、个体内波动及两者组合相关，且学习到的表征优于传统手工特征。

Comments 19 pages, 6 figures

详情

AI中文摘要

人类行为难以在大规模下连续测量，然而日常活动和幸福感的痕迹可能反映在与个人设备的交互中。我们研究加密的智能手机网络流量是否可以作为被动感知信号，用于检测与睡眠障碍、压力和孤独感相关的行为状态。为了捕捉群体层面的模式和个体特定的行为，我们采用基于Transformer的模型，该模型带有用户特定的适配器，学习网络活动的表征，同时考虑个人基线及其偏差。为了提高可解释性，我们进一步使用稀疏表示学习分析这些表征，以识别与不同活动模式相关的潜在行为特征。我们使用带有Mundlak分解的广义估计方程将所得特征与睡眠障碍、压力和孤独感联系起来，从而能够区分稳定的个体间差异和随时间变化的个体内变化。我们的分析揭示了这三种结果具有不同的时间动态：压力主要与持续的个体间变异相关，孤独感与个体内波动更密切相关，而睡眠障碍则反映了两者的结合。重要的是，这些个体内行为信号无法通过传统的手工网络流量特征恢复，这突显了学习表征在纵向行为建模中的优势。总体而言，我们的发现表明加密网络流量包含可解释的行为信息，并能够支持被动、可扩展的行为动态监测，特别是相对于个体典型活动模式的变化。

英文摘要

Human behavior is challenging to measure continuously at scale, yet traces of daily routines and well-being may be reflected in interactions with personal devices. We investigate whether encrypted smartphone network traffic can serve as a passive sensing signal for behavioral states related to sleep disturbance, stress, and loneliness. To capture both population-level patterns and individual-specific behavior, we employ a transformer-based model with user-specific adapters that learns representations of network activity while accounting for personal baselines and deviations from them. To improve interpretability, we further analyze these representations using sparse representation learning to identify latent behavioral features associated with distinct activity patterns. We relate the resulting features to sleep disturbance, stress, and loneliness using generalized estimating equations with Mundlak decomposition, enabling separation of stable between-person differences from within-person changes over time. Our analysis reveals that the three outcomes are characterized by different temporal dynamics: stress is predominantly associated with persistent between-person variation, loneliness is more strongly linked to within-person fluctuations, and sleep disturbance reflects a combination of both. Importantly, these within-person behavioral signals are not recovered by conventional handcrafted network-traffic features, highlighting the advantages of learned representations for longitudinal behavioral modeling. Overall, our findings demonstrate that encrypted network traffic contains interpretable behavioral information and can support passive, scalable monitoring of behavioral dynamics, particularly changes relative to an individual's typical pattern of activity.

URL PDF HTML ☆

赞 0 踩 0

2606.07435 2026-06-09 cs.CV cs.CL 版本更新

The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

唇读差距：VSR模型是否像人类唇读者一样感知视觉语音？

Rishabh Jain, Naomi Harte

发表机构 * Sigmedia Group（Sigmedia集团）； School of Engineering（工程学院）； Trinity College Dublin（都柏林大学）

AI总结通过对比VSR系统与人类在MaFI数据集上的表现，发现模型虽整体准确率更高，但错误模式与人类不同，主要依赖训练数据中的语言线索而非视觉感知。

Comments Accepted at INTERSPEECH 2026

详情

AI中文摘要

视觉语音识别（VSR）模型在基准测试中现已超越人类唇读者，但这样的进步是否建立了类人的视觉语音感知？为探究此问题，我们使用MaFI词级唇读数据集，在词、字符、音素和视位级别上比较了三个VSR系统与人类基线。尽管模型实现了更高的整体准确率，但它们在不同于人类的单词上成功和失败。仅给定少量初始音素的纯文本n-gram基线可与人类唇读相媲美。VSR词级错误始终能更好地通过训练词频而非词的视觉信息量来解释。视位准确率、混淆矩阵以及人类-模型相关性进一步表明，模型在人类认为最难的视位上获益最多，并且对视觉清晰度的依赖性弱得多。我们的工作表明，VSR系统主要依赖训练数据中的语言线索而非视觉感知，未能将视觉特征绑定为有意义的单词。

英文摘要

Visual speech recognition (VSR) models now surpass human lipreaders on benchmarks, but do such gains establish human-like visual speech perception? To explore this, we compare three VSR systems with human baselines on the MaFI word-level lipreading dataset using word, character, phoneme, and viseme-level metrics. Although models achieve higher overall accuracy, they succeed and fail on different words than humans. A text-only n-gram baseline given only a few initial phonemes rivals human lipreading. VSR word-level errors are consistently better explained by training word frequency than by the visual informativeness of words. Viseme accuracies, confusion matrices and human-model correlations further show that models gain most on visemes humans find hardest, and show much weaker dependence on visual clarity. Our work demonstrates that VSR systems rely primarily on language cues from training data rather than visual perception, failing to bind visual features into meaningful words.

URL PDF HTML ☆

赞 0 踩 0

2606.07431 2026-06-09 cs.CV 版本更新

OpenGlass: Ultra-Low-Power On-Device AI Eyewear with Event-based Vision

OpenGlass：用于设备上基于事件的手势识别的开源智能眼镜

Pietro Bonazzi, Julian Moosmann, Ahmet Celik, Philipp Mayer, Michele Magno

发表机构 * Department of Information Technology and Electrical Engineering, ETH Zürich（信息科技与电气工程系，瑞士联邦理工学院）

AI总结提出开源智能眼镜平台OpenGlass，采用模块化设计、事件驱动电源管理和GAP9 RISC-V SoC，实现低功耗设备上ML，在LynX数据集上达到83.94%的跨主体手势识别准确率。

详情

AI中文摘要

智能眼镜通过多模态传感器和设备上智能实现无干扰、上下文感知的交互，但受限于紧凑外形下的功耗、内存和计算约束。支持事件视觉和嵌入式ML的开源硬件平台在此规模下很少见。本文介绍了一个开源智能眼镜平台，用于新型传感器和算法的快速原型设计。其模块化设计使用灵活的FPC转接板，支持事件相机和帧相机，无需完全重新设计PCB。硬件-软件协同设计的电源管理系统结合了可配置PMIC和通过nRF5340协调器的事件驱动唤醒，使GAP9 RISC-V SoC在推理之间保持断电。原型从200 mAh电池实现长达11.8小时的连续设备上ML。作为演示，使用来自Prophesee GENX320相机的极性分离事件直方图，在LynX数据集上评估了以自我为中心的手势识别流水线。R(2+1)D在留二受试者交叉验证下达到最佳跨主体准确率83.94%（宏F1=0.781），在GAP9上端到端延迟为33.9毫秒。时间增强和去除模糊类别带来了最大增益（+8.9个百分点）。所有硬件设计、固件和模型均开源发布。

英文摘要

Smart eyewear enables unobtrusive, context-aware interaction through multimodal sensors and on-device intelligence, but is severely limited by power, memory, and compute constraints in a compact form factor. Open-hardware platforms supporting event-based vision and embedded ML at this scale are rare. This work introduces an open-source smart glasses platform for rapid prototyping of novel sensors and algorithms. Its modular design uses a flexible FPC interposer to support both event-based and frame-based cameras without full PCB redesign. A hardware-software co-designed power management system combines a configurable PMIC with event-driven wake-up via an nRF5340 coordinator, keeping the GAP9 RISC-V SoC powered down between inferences. The prototype achieves up to 11.5 hours of continuous on-device ML from a 200 mAh battery. As a demonstration, an egocentric hand gesture recognition pipeline was evaluated on the LynX dataset using polarity-separated event histograms from a Prophesee GENX320 camera. R(2+1)D achieved the best cross-subject accuracy of 83.94\% (macro F1 = 0.781) under leave-two-subjects-out validation, with 78.3 ms end-to-end inference latency on the GAP9. Temporal augmentation and removal of ambiguous classes provided the largest gains (+8.9 pp). All hardware designs, firmware, and models are released open source.

URL PDF HTML ☆

赞 0 踩 0

2606.07419 2026-06-09 cs.CV 版本更新

DisPOSE: Projected Polystochastic Diffusion for Self-Supervised Multi-View 3D Human Pose Estimation

DisPOSE: 投影多随机扩散用于自监督多视图3D人体姿态估计

Tony Danjun Wang, Tolga Birdal, Nassir Navab, Lennart Bastian

发表机构 * Imperial College London（伦敦帝国学院）； Technical University of Munich（慕尼黑技术大学）

AI总结提出DisPOSE框架，将多视图人员分配问题建模为多随机张量空间上的生成扩散过程，通过可微Sinkhorn投影和超图卷积解码器实现自监督3D人体姿态估计，在标准数据集和手术室遮挡场景中表现优异。

详情

AI中文摘要

从不同摄像机视角恢复多个个体的3D人体姿态是分析交互行为的基本瓶颈。现有的自监督方法利用3D姿态的合成目录；然而，由于分布偏移，这导致在真实场景中泛化能力差。因此，我们引入了DisPOSE，一个自监督框架，将固有的离散多视图人员分配问题近似为多随机张量空间上的生成扩散过程。通过在去噪过程中采用可微的Sinkhorn投影，模型学会基于2D图像先验引导解决方案走向有效且可行的分配。然后，使用超图卷积解码器对定位个体的完整3D骨架进行回归，该解码器显式建模跨多个视图的关系结构和关节。所提出的方法在标准数据集上优于当前最先进的自监督方法，并在一个包含手术室高度遮挡场景的新基准上展示了强大的性能。我们的基于扩散的定位展示了高标签效率，仅使用10%的伪标签就能保持99%的性能。值得注意的是，在保持可微性的同时解耦分配和根回归组件，使得DisPOSE几乎对不同摄像机布置不敏感。

英文摘要

Recovering 3D human poses for multiple individuals from different camera views is a fundamental bottleneck for analyzing interacting behaviors. Existing self-supervised approaches leverage synthetic catalogues of 3D poses; however, this leads to poor generalization in real-world scenarios due to distribution shifts. We therefore introduce DisPOSE, a self-supervised framework that approximates the inherently discrete multi-view person-assignment problem as a generative diffusion process over the space of polystochastic tensors. By employing differentiable Sinkhorn projections during denoising, our model learns to guide solutions toward valid and feasible assignments based on 2D image priors. The complete 3D skeletons of localized individuals are then regressed using a Hypergraph-Convolutional Decoder that explicitly models relational structures and articulated joints across multiple views. The proposed approach outperforms current state-of-the-art self-supervised methods on standard datasets and demonstrates strong performance on a newly proposed benchmark featuring highly occluded scenes from surgical operating rooms. Our diffusion-based localization demonstrates high label efficiency, retaining 99% of its performance with only 10% of the pseudo-labels. Notably, disentangling the assignment and root regression components while maintaining differentiability makes DisPOSE nearly agnostic to different camera arrangements.

URL PDF HTML ☆

赞 0 踩 0

2606.07379 2026-06-09 cs.LG cs.AI cs.CL stat.ME 版本更新

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

编码智能体会欺骗我们吗？通过带随机测试的上限评估检测和防止作弊

Thanawat Lodkaew, Johannes Ackermann, Soichiro Nishimori, Nontawat Charoenphakdee, Masashi Sugiyama, Takashi Ishida

发表机构 * The University of Tokyo（东京大学）； RIKEN（理化学研究所）

AI总结提出CapCode框架，通过设置上限评估检测模型在编码任务中的作弊行为，并设计CapReward奖励机制防止作弊，实验表明该方法能有效检测和减少作弊。

详情

AI中文摘要

在智能体评估和训练中，一个日益增长的失败模式是模型可以通过利用捷径而非解决预期任务来获得高评估分数，产生欺骗性表现。这使得评估分数作为真实任务解决能力的度量不可靠。我们提出CapCode，一个构建带有随机测试的编码数据集的框架，其最佳可达的非作弊性能被故意限制在1以下。这种上限性能设计赋予评估分数更清晰的解释：显著高于上限的分数是不可信的，因此提供了作弊的证据。为了防止作弊，我们提出CapReward，一种基于CapCode原则的奖励设计，以抑制超出上限的优化。跨多个数据集的实验表明，CapCode能够检测作弊同时保持模型的性能排名，CapReward减少了作弊行为，产生了更好地遵循预期任务规范的模型。

英文摘要

A growing failure mode in agent evaluation and training is that models can achieve high evaluation scores by exploiting shortcuts instead of solving the intended task, producing deceptive performance. This makes evaluation scores unreliable as measures of true task-solving ability. We propose CapCode, a framework for constructing coding datasets with randomized tests whose best achievable non-cheating performance is deliberately capped below one. This capped-performance design gives evaluation scores a clearer interpretation: scores substantially above the cap are implausible and therefore provide evidence of cheating. To prevent cheating, we propose CapReward, a reward design based on the CapCode principle to discourage optimization beyond the cap. Experiments across multiple datasets show that CapCode detects cheating while preserving performance ranking of models, and CapReward reduces cheating behavior, yielding models that better follow the intended task specification.

URL PDF HTML ☆

赞 0 踩 0

2606.07118 2026-06-09 cs.RO 版本更新

QuadVerse: An Integrated Framework Aligning Visual-Physical Reality for Quadruped Simulation

QuadVerse：一种对齐视觉-物理现实用于四足仿真的集成框架

Yuxiang Chen, Yuanhao Wang, Ziheng Zhang, Meng Zhang, Yu Liu, Yufei Jia, Tiancai Wang, Erjin Zhou, Jin Xie

发表机构 * Nanjing University（南京大学）； BUPT（北京邮电大学）； DEXMAL ； Tsinghua University（清华大学）

AI总结提出QuadVerse框架，通过重建场景校准视觉、物理和致动器，利用3DGS和接触校准减少仿真到现实的差距，实现零样本视觉导航策略部署。

详情

AI中文摘要

仿真对于机器人学习至关重要，然而仿真到现实的差距仍然是一个主要挑战。现有方法通常单独处理视觉或动态差距，忽略了这些个体不匹配如何在机器人状态估计中累积和传播。在本文中，我们介绍QuadVerse，一个集成框架，使用重建场景作为校准基底，对齐视觉感知、物理交互和致动器动力学。从捕获的RGB视频中，我们重建几何约束的3D高斯泼溅（3DGS）场景，支持批处理的光照真实自我视角渲染和可用于碰撞的语义网格提取。网格进一步通过初始化空间变化的摩擦先验并通过基于轨迹的后验推理细化，实现接触校准。为了解决剩余的致动器差异，QuadVerse通过在接触校准的地形上重放真实世界轨迹来训练残差动力学补偿器，减少地形引起的接触误差与致动器动力学之间的纠缠。我们表明，QuadVerse在相关基线上提高了重建质量和运动跟踪。在此基础之上，我们展示了无需任务特定真实世界部署的鲁棒零样本视觉导航策略部署。

英文摘要

Simulation is central to robot learning, yet the sim-to-real gap remains a major bottleneck. Existing approaches often tackle visual or dynamic gaps separately, overlooking how these individual mismatches accumulate and propagate throughout the robot's state evolution. In this paper, we introduce QuadVerse, an integrated framework that uses reconstructed scenes as a calibration substrate for aligning visual perception, physical interaction, and actuator dynamics. From captured RGB videos, we reconstruct geometry-constrained 3D Gaussian Splatting (3DGS) scenes that support batched photorealistic ego-view rendering and collision-ready semantic mesh extraction. The meshes further enable contact calibration by initializing spatially varying friction priors and refining them through trajectory-based posterior search. To address remaining actuator discrepancies, QuadVerse trains a residual dynamics compensator by replaying real-world trajectories on the contact-calibrated terrain, reducing the entanglement between terrain-induced contact errors and actuator non-idealities. Experiments show that QuadVerse improves reconstruction quality and locomotion tracking over relevant baselines. Leveraging this foundation, we demonstrate robust zero-shot visual-navigation policy deployment without task-specific real-world rollouts.

URL PDF HTML ☆

赞 0 踩 0

2606.07108 2026-06-09 cs.AI 版本更新

DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling

DyCon: 通过演化难度建模的动态推理控制

Tengyao Tu, Yulin Li, Hui-Ling Zhen, Libo Qin, Zhoujun Wei, Jinghua Piao, Zhuotao Tian, Yong Li, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Zhongguancun Academy（中关村学院）； Huawei Noah’s Ark Lab（华为诺亚实验室）； Shenzhen Loop Area Institute（深圳环城研究院）； Tsinghua University（清华大学）

AI总结提出DyCon框架，利用步骤级嵌入动态建模推理过程中的难度演化，无需训练即可控制推理深度，减少冗余步骤，提升效率且不损失准确性。

Comments Accepted at ICML 2026

详情

AI中文摘要

近期大型推理模型（LRMs）通过迭代反思、探索和执行复杂任务取得了显著的性能提升，但由于冗余推理（即“过度思考”）而效率低下。现有的缓解方法要么依赖静态难度估计，要么需要特定任务训练，因此无法适应推理过程中的动态复杂性。在这项工作中，我们经验性地证明，问题难度在推理过程中动态演化，并线性编码在LRM的步骤级嵌入中。基于这一发现，我们提出了DyCon，一个无需训练的框架，利用潜在步骤级表示显式建模演化中的任务难度，从而实现对推理深度的动态控制以缓解过度思考问题。在4B到32B的四个模型上进行的广泛实验，涵盖数学推理、通用问答和编码任务的十二个基准测试表明，DyCon通过减少冗余步骤显著提升了推理效率，且不牺牲准确性或泛化能力。项目页面和代码可在此https URL获取。

英文摘要

Recent advances in Large Reasoning Models (LRMs) demonstrate remarkable performance improvements by iteratively reflecting, exploring, and executing complex tasks, yet suffer from inefficiencies due to redundant reasoning, known as "overthinking". Existing methods to mitigate this issue either rely on static difficulty estimates or require task-specific training, and thus fail to adapt to the dynamic complexity during reasoning. In this work, we empirically show that the problem difficulty evolves dynamically throughout the reasoning process and is linearly encoded in the LRM's step-level embeddings. Building on this insight, we propose DyCon, a training-free framework that leverages latent step-level representations to explicitly model the evolving task difficulty, enabling the dynamic control of reasoning depth to mitigate the overthinking issue. Extensive experiments conducted on four models ranging from 4B to 32B, and across twelve benchmarks in math reasoning, general question answering, and coding tasks demonstrate that DyCon significantly enhances reasoning efficiency by reducing redundant steps without sacrificing accuracy or generalization. Code is available at https://github.com/yu-lin-li/DyCon.

URL PDF HTML ☆

赞 0 踩 0

2606.07047 2026-06-09 cs.AI 版本更新

Front-to-Attractors: Modifying the Front-to-Front Heuristic in Bidirectional Search

Front-to-Attractors：修改双向搜索中的Front-to-Front启发式

Alvin Zou, Muhammad Suhail Saleem, Maxim Likhachev

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出Front-to-Attractors (F2A)启发式类，通过动态维护吸引子集替代完整前沿，在保持Front-to-Front信息性的同时大幅降低计算开销，实验显示相比F2F减少最多11.2倍成对评估，平均节点扩展比F2E少4.8倍。

详情

AI中文摘要

启发式在双向搜索算法的性能中扮演核心角色，通常依赖两个主要类别。Front-to-end (F2E) 启发式估计从状态 s 到搜索目标（前向搜索的目标或后向搜索的起点）的距离。相比之下，Front-to-front (F2F) 启发式通过成对函数 h(s, s') 估计从 s 到对面搜索前沿的距离，其中 s' 遍历前沿状态。尽管 F2F 启发式通常信息量更大，从而减少节点扩展数量，但它们依赖大量的成对评估，导致显著的计算开销。为了解决这一限制，我们引入了一个新的启发式类——Front-to-attractors (F2A)，它在保留 F2F 大部分信息性的同时，大幅降低了计算成本。F2A 不是评估到对面前沿所有状态的距离，而是估计从 s 到对面搜索方向中一个小的、动态维护的吸引子集的距离。这些吸引子作为完整前沿的替代，使得在极少的计算开销下提供丰富的启发式指导，同时保持 F2F 提供的最优性保证。我们在多个领域评估了 F2A，结果显示，与 F2F 相比，它减少了最多 11.2 倍的成对评估，同时平均节点扩展比 F2E 少 4.8 倍。

英文摘要

Heuristics play a central role in the performance of bidirectional search algorithms, which commonly rely on two main classes. Front-to-end (F2E) heuristics estimate the distance from a state s to the target of the search (the goal for forward search or the start for backward search). In contrast, front-to-front (F2F) heuristics estimate the distance from s to the opposite search frontier using a pairwise function h(s, s'), where s' ranges over frontier states. Although F2F heuristics are typically more informative and therefore reduce the number of node expansions, their reliance on extensive pairwise evaluations incurs substantial computational overhead. To address this limitation, we introduce a new heuristic class, front-to-attractors (F2A), that preserves much of the informativeness of F2F while dramatically reducing its computational cost. Rather than evaluating distances to all states on the opposite frontier, F2A estimates the distance from s to a small, dynamically maintained set of attractors in the opposite search direction. These attractors serve as a surrogate for the full frontier, enabling rich heuristic guidance at a fraction of the computational expense while maintaining the optimality guarantees offered by F2F. We evaluate F2A across multiple domains and show that it reduces the number of pairwise evaluations by up to 11.2x compared to F2F, while achieving 4.8x fewer node expansions than F2E on average.

URL PDF HTML ☆

赞 0 踩 0

2606.06915 2026-06-09 cs.CL cs.AI cs.LG 版本更新

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

ThinkBooster: 一种用于LLM推理无缝测试时扩展的统一框架

Vladislav Smirnov, Chieu Nguyen, Sergey Senichev, Minh Ngoc Ta, Ekaterina Fadeeva, Artem Vazhentsev, Daria Galimzianova, Nikolai Rozanov, Viktor Mazanov, Jingwei Ni, Tianyi Wu, Igor Kiselev, Mrinmaya Sachan, Iryna Gurevych, Preslav Nakov, Timothy Baldwin, Artem Shelmanov

发表机构 * MBZUAI ； ETH Zürich（苏黎世联邦理工学院）； Imperial College London（伦敦帝国理工学院）； NUS（国立大学新加坡）； Accenture（埃森哲）； Innopolis University（因诺普里斯大学）； Independent Researcher（独立研究者）

AI总结提出ThinkBooster框架，通过模块化库、联合评估基准和可部署代理服务，实现LLM推理的测试时计算扩展，在数学和编码任务上验证了性能-计算权衡。

详情

AI中文摘要

测试时计算（TTC）扩展已成为一种强大的范式，通过在推理期间分配额外计算（例如，通过多样本生成和基于验证器的重新排序）来改进大型语言模型（LLM）推理。现有的TTC扩展策略和推理评分器仍然碎片化，在不一致的协议下进行评估，并且很少通过质量-成本权衡的视角进行分析。我们引入了ThinkBooster，一个用于LLM推理无缝测试时计算扩展的统一框架，它包括（i）一个模块化的Python库，实现了最先进的TTC扩展策略和评分器家族，（ii）一个联合评估性能和计算效率的基准，以及（iii）一个可部署的、兼容OpenAI的代理服务，使得将自适应推理无缝集成到实际应用中成为可能。我们还提供了一个演示可视化调试器，用于检查推理轨迹、中间选择决策和替代推理路径。在数学和编码任务上的实证结果揭示了TTC扩展策略和评分方法的性能-计算权衡，并表明ThinkBooster在实际任务中提供了实际收益。代码以MIT许可证在线提供。

英文摘要

Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based reranking. Existing TTC scaling strategies and reasoning scorers remain fragmented, evaluated under inconsistent protocols, and are rarely analyzed through the lens of quality-cost trade-offs. We introduce ThinkBooster, a unified framework for seamless test-time compute scaling of LLM reasoning, which consists of (i) a modular Python library implementing state-of-the-art TTC scaling strategy and scorer families, (ii) a benchmark that jointly evaluates performance and computational efficiency, and (iii) a deployable OpenAI-compatible proxy service that enables drop-in integration of adaptive reasoning into real-world applications. We further provide a demo visual debugger for inspecting the reasoning trajectories, intermediate selection decisions, and alternative reasoning paths. Empirical results on mathematical and coding tasks reveal the performance-compute trade-offs of TTC scaling strategies and scoring methods and demonstrate that ThinkBooster provides practical gains in real-world tasks. The code is available online under an MIT license.

URL PDF HTML ☆

赞 0 踩 0

2606.06656 2026-06-09 cs.AI cs.LO 版本更新

A Study of Parallel Continuous Local Search

并行连续局部搜索研究

Cody J Christopher, Charles Gretton

发表机构 * School of Computing, Australian National University（澳大利亚国立大学计算机学院）

AI总结研究并行连续局部搜索（CLS）在对称伪布尔约束可满足性问题中的应用，发现冗余约束会抑制收敛，CLS在混合求解中能快速完成部分赋值，且局部搜索因鞍点密集目标而快速收敛到稳定解质量分布。

详情

AI中文摘要

我们研究并行连续局部搜索（CLS）作为解决具有对称伪布尔（PB）约束的布尔可满足性问题的一种方法。这里，$n$变量PB可满足性问题被松弛为一个连续优化问题，其目标函数在$n$维超立方体上可微。对于可满足的实例，该优化问题的全局最小值对应于所讨论SAT问题的满足赋值。我们通过实证实验提出了几个新发现：（i）冗余约束会抑制而非加速收敛；（ii）CLS在混合设置中作为子求解器显示出前景，能快速完成部分赋值；（iii）由于鞍点密集的目标函数，局部搜索迅速收敛到解质量的稳定分布（即满足程度），此时额外的求解步骤收益递减。我们的发现为在现代加速硬件上使用CLS解决SAT问题提供了实用指导。

英文摘要

We study parallel Continuous Local Search (CLS) as a solution approach for Boolean satisfiability problems with symmetric pseudo-Boolean (PB) constraints. Here, the $n$-variable PB-satisfiability problem is relaxed to a continuous optimisation problem with a differentiable objective function on an $n$-dimensional hypercube. For satisfiable instances, the global minimisers of this optimisation problem correspond to satisfying assignments of the SAT problem at hand. We present several novel findings via empirical experiments: (i) redundant constraints can inhibit rather than accelerate convergence; (ii) CLS shows promise as a sub-solver in hybridised settings, quickly completing partial assignments; and (iii) local search rapidly converges to a stable distribution of solution quality (i.e., degree of satisfaction), due to saddle-dense objectives where additional solver steps yield diminishing returns. Our findings inform practical uses of CLS for SAT on modern accelerator hardware.

URL PDF HTML ☆

赞 0 踩 0

2606.06554 2026-06-09 cs.LG cs.AI 版本更新

Multi-Scale Feature Attention Network for Polymer Classification Using Terahertz Spectroscopy

基于多尺度特征注意力网络的太赫兹双梳光谱聚合物分类

Roshni Mahtani, Ilán Carretero, Laura Monroy, Aldo Moreno-Oyervides, Oscar Elías Bonilla-Manrique, Rocío del Amor

发表机构 * Instituto Universitario de Investigación e Innovación en Tecnología Centrada en el Ser Humano, HUMAN-tech, Universitat Politècnica de València（人类中心技术大学研究与创新研究所，HUMAN-tech，巴塞罗那理工大学）； Department of Electronic Technology, Universidad Carlos III de Madrid（电子技术系，马德里卡洛斯三世大学）； Artikode Intelligence S.L.

AI总结提出多尺度特征注意力网络(MSFAN)，结合特征门控和多尺度并行卷积，利用太赫兹双梳光谱对12种聚合物进行分类，准确率达85.2%。

Comments Accepted in EUSIPCO'26

详情

AI中文摘要

可靠的聚合物识别对于确保回收塑料的质量和安全至关重要，然而传统的分选和光谱技术往往难以提供稳健的区分。太赫兹双梳光谱(THz-DCS)提供了一种有前景的替代方案，能够实现快速、高分辨率且无损的测量。在这项工作中，我们利用THz-DCS对12种聚合物进行分类，包括纯聚合物、多层薄膜、商业混合物和生物聚合物。为了处理这些光谱信号的复杂性，我们提出了多尺度特征注意力网络(MSFAN)，这是一种专为THz-DCS数据设计的新型深度学习架构。该框架集成了用于信号重校准的特征门控和多尺度并行卷积，以捕获不同的频率模式。这些特征通过交叉特征注意力和注意力池化进一步细化，使模型能够内在地突出最具信息量的太赫兹区域。MSFAN始终优于最先进的模型，分类准确率达到85.2%。本研究展示了将THz-DCS与深度学习技术相结合，用于有效、可扩展且可解释的聚合物分类的潜力。

英文摘要

Reliable polymer identification is essential for ensuring the quality and safety of recycled plastics, yet conventional sorting and spectroscopic techniques often struggle to deliver robust discrimination. Terahertz (THz) spectroscopy offers a promising alternative, providing high-resolution and non-destructive measurements. In this work, we leverage THz signals to classify 12 types of polymers, including pure polymers, multilayer films, commercial blends, and biopolymers. To handle the complexity of these spectral signals, we propose the Multi-Scale Feature Attention Network (MSFAN), a novel deep learning architecture tailored for THz data. The framework integrates feature gating for signal recalibration and multi-scale parallel convolutions to capture diverse frequency patterns. These features are further refined through cross-feature attention and attention pooling, enabling the model to intrinsically highlight the most informative THz regions. MSFAN consistently outperforms state-of-the-art models, reaching a classification accuracy of 85.2%. This study demonstrates the potential of combining THz spectroscopy with deep learning techniques for effective, scalable, and interpretable polymer classification.

URL PDF HTML ☆

赞 0 踩 0

2606.06443 2026-06-09 cs.CL cs.MM cs.SI 版本更新

Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online Discussions

修正语境，转变模拟立场：审计基于LLM的在线讨论立场模拟

Xinnong Zhang, Wanting Shan, Hanjia Lyu, Zhongyu Wei, Jiebo Luo

发表机构 * Fudan University（复旦大学）； University of Rochester（罗切斯特大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结本研究通过反事实语境修正框架审计LLM立场模拟，对比纯文本与多模态策略，评估平均方向性立场转变和立场转换率，揭示语境敏感性的有效性与鲁棒性。

详情

AI中文摘要

大型语言模型越来越多地被用于模拟社交媒体用户，并推断个人如何回应在线讨论。然而，目前尚不清楚这些模拟是否反映了精确的用户特定信念，或者它们是否对对话语境中语义无关的变化高度敏感。在这项工作中，我们研究反事实语境修正作为审计基于LLM的立场模拟的框架。给定一个原始在线对话，我们首先推断目标用户对特定话题的立场。然后，我们对对话语境应用受控修正策略，并在修正后的语境下再次模拟用户的立场。我们比较了纯文本修正策略与包含模因语境的多模态策略，并评估了两个主要有效性指标，即平均方向性立场转变和立场转换率。结果揭示了在不同极化偏好机制下，纯文本和多模态策略中有效且稳健的立场转变。我们的研究为理解基于LLM的立场模拟的语境敏感性提供了一个评估框架。更广泛地说，它突出了使用LLM模拟在线舆论动态的前景和风险。

英文摘要

Large language models are increasingly used to simulate social media users and infer how individuals may respond to online discussions. However, it remains unclear whether these simulations reflect precise user-specific beliefs or whether they are highly sensitive to semantically independent changes in conversational contexts. In this work, we study counterfactual context revision as a framework for auditing LLM-based stance simulation. Given an original online conversation, we first infer a target user's stance toward a specific topic. We then apply controlled revision strategies to the conversational context and simulate the user's stance again under the revised context. We compare text-only revision strategies with a multimodal one that incorporates meme-based context and evaluate two main effectiveness metrics, i.e., average directional stance shift and stance transition rate. The results reveal effective and robust stance transitions in both text-only and multimodal strategies across different polarization-preference mechanisms. Our study contributes an evaluation framework for understanding the context sensitivity of LLM-based stance simulation. More broadly, it highlights both the promise and risk of using LLMs to simulate online opinion dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.06407 2026-06-09 cs.CV cs.IR cs.LG eess.IV 版本更新

A Vision-language Framework for Comparative Reasoning in Radiology

放射学中比较推理的视觉语言框架

Tengfei Zhang, Ziheng Zhao, Xiaoman Zhang, Lisong Dai, Pengcheng Qiu, Ya Zhang, Yanfeng Wang, Weidi Xie

发表机构 * University of Science and Technology of China（中国科学技术大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； School of Artificial Intelligence, Shanghai Jiao Tong University（上海交通大学人工智能学院）； Department of Biomedical Informatics, Harvard Medical School（哈佛医学院生物医学信息学系）； Department of Radiology, Renmin Hospital of Wuhan University（武汉大学仁民医院放射科）； Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University（上海交通大学附属第六人民医院）

AI总结提出一个实体感知的跨图像推理框架，通过构建大规模比较影像数据集MedReCo-DB和开发MedReCo及MedReCo-VLM模型，实现了参考病例检索和时间比较解读，显著提升了放射学比较推理性能。

详情

AI中文摘要

医学影像人工智能在孤立图像解读方面取得了强劲性能，但仍与放射学实践存在较大差距，因为诊断和随访依赖于对先前研究和类似参考病例的比较。本文我们将放射学比较形式化为一个实体感知的跨图像推理问题，并引入一个支持参考病例检索和时间比较解读的框架。我们构建了MedReCo-DB，这是一个从常规图像-报告对中派生的大规模比较影像资源，包含来自八个机构、四个国家、七种成像模态的超过16万名患者的69万余张图像。报告被分解为解剖结构、异常发现和病理状况，为实体条件检索和比较视觉问答提供监督。利用该资源，我们开发了MedReCo，一个用于可控检索临床类似病例的实体感知视觉编码器，以及MedReCo-VLM，一个用于生成性解读间隔变化的视觉语言扩展。在内部、外部和跨中心评估中，MedReCo在所有12个内部检索设置中实现了最高的Recall@1，并将外部检索平均提高了6.0个百分点。在临床易混淆的鉴别组中，它始终优于最强的基线。MedReCo-VLM在所有比较生成评估中取得了最佳性能，并在胸部X光片上将纵向随访准确性提高了14.5-46.5个百分点，在CT上提高了13.0-27.9个百分点。这些发现表明，实体感知的比较推理可以从常规临床数据中大规模学习，并可能为医学影像AI提供更符合临床的基础。

英文摘要

Medical imaging artificial intelligence has achieved strong performance in isolated image interpretation, but remains poorly aligned with radiological practice, where diagnosis and follow-up rely on comparison across prior studies and analogous reference cases. Here we formulate radiological comparison as an entity-aware cross-image reasoning problem and introduce a framework that supports both reference-case retrieval and temporal comparative interpretation. We construct MedReCo-DB, a large-scale comparative imaging resource derived from routine image-report pairs, comprising more than 690,000 images from over 160,000 patients across eight institutions, four countries and seven imaging modalities. Reports are decomposed into anatomical structures, abnormal findings and pathological conditions to provide supervision for entity-conditioned retrieval and comparative visual question answering. Using this resource, we develop MedReCo, an entity-aware visual encoder for controllable retrieval of clinically analogous cases, and MedReCo-VLM, a vision--language extension for generative interpretation of interval change. Across internal, external and cross-center evaluations, MedReCo achieved the highest Recall@1 in all 12 internal retrieval settings and improved external retrieval by a mean of 6.0 percentage points. In clinically confusable differential groups, it consistently outperformed the strongest baselines. MedReCo-VLM achieved the best performance across all comparative generation evaluations and improved longitudinal follow-up accuracy by 14.5-46.5 percentage points on chest radiographs and 13.0-27.9 percentage points on CT. These findings suggest that entity-aware comparative reasoning can be learned from routine clinical data at scale and may provide a more clinically aligned foundation for medical imaging AI.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Implicit Causal Graph Construction in Text via Chain Discovery

ABLE: Representing and Mapping LLMs via Attribution-Based Large-model Embedding

Retrieval Augmented Generation Framework for the Nepali Legal Domain Question Answering

Community-Specific Slang and Entity Detection via Semantic Shift in Fine-Tuned Language Models

Evaluating Hallucinations in Domain-Adapted Large Language Models

TinyJudge: Unverifiable Constraint Alignment via Lightweight Specialist Ensembles

Bidirectional Small-Granularity Search between Code and Text

GP-Adapter: Gaussian Process CLIP-Adapter for Few-Shot Out-of-Distribution Detection

Modeling semantic association in self-paced reading with language model embeddings

Population-Aware Imitation Learning in Mean-field Games with Common Noise

Sparse Memory Finetuning as a Low-Forgetting Alternative to LoRA and Full Finetuning

Self-Mined Hardness for Safety Fine-Tuning

Executable World Models for ARC-AGI-3 in the Era of Coding Agents

APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

Kernel Affine Hull Machines as Compute-Efficient Encoders for Frozen Semantic Spaces

Embody4D: A Generalist Data Engine for Embodied 4D World Modeling

Learning Behavioral Signals from Encrypted Smartphone Network Traffic

The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

OpenGlass: Ultra-Low-Power On-Device AI Eyewear with Event-based Vision

DisPOSE: Projected Polystochastic Diffusion for Self-Supervised Multi-View 3D Human Pose Estimation

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

QuadVerse: An Integrated Framework Aligning Visual-Physical Reality for Quadruped Simulation

DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling

Front-to-Attractors: Modifying the Front-to-Front Heuristic in Bidirectional Search

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

A Study of Parallel Continuous Local Search

Multi-Scale Feature Attention Network for Polymer Classification Using Terahertz Spectroscopy

Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online Discussions

A Vision-language Framework for Comparative Reasoning in Radiology