arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

Duc-Manh Phan, Quoc-Duy Tran, Duy-Khang Do, Anh-Tuan Vo, Hai-Dang Nguyen, Trong Le Do, Mai-Khiem Tran, Vinh-Tiep Nguyen, Tam V. Nguyen, Isao Echizen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM（胡志明市国家大学下属理科大学）； Vietnam National University, Ho Chi Minh（胡志明市国家大学）； University of Information Technology, VNU-HCM（胡志明市国家大学下属信息技术大学）； University of Dayton（代顿大学）； National Institute of Informatics（国立信息学研究所）

AI总结针对扩散模型生成的逼真灾难图像难以检测的问题，提出包含30000张图像（6000张真实、24000张合成）的基准数据集，实验发现微调检测器在未知生成器上准确率下降50%，零样本检测器也不稳定，凸显了跨域检测的迫切需求。

Comments SOICT 2025

详情

AI中文摘要

文本到图像扩散模型的快速进步使得创建高度逼真的合成图像成为可能，这些图像与真实照片极为相似，使得区分真实内容与AI生成的伪造品越来越困难。这对网络安全、数字取证和灾难响应构成了挑战，其中洪水、火灾或地震的虚假图像可能传播错误信息或扰乱应急行动。为此，我们引入了Forged Calamity，一个用于合成灾难检测的基准数据集，包含30000张图像，其中包括6000张真实样本和由四种扩散模型生成的24000张合成样本。在微调和零样本设置下的全面实验揭示了当前取证方法的一致弱点。微调检测器在分布内表现良好，但在未见过的生成器或灾难类型上准确率下降高达50%，显示出对模型特定伪影的过拟合。零样本通用检测器也难以保持稳定的准确率，只有少数具有鲁棒表示能力的模型表现出有限的韧性。这些发现凸显了持续存在的泛化差距，以及在扩散时代确保视觉真实性迫切需要领域和模型无关的检测方法。

英文摘要

The rapid advancement of text-to-image diffusion models has enabled the creation of highly photorealistic synthetic images that closely resemble real photographs, making it increasingly difficult to distinguish authentic content from AI-generated fabrications. This poses challenges for cybersecurity, digital forensics, and disaster response, where fake imagery of floods, fires, or earthquakes can spread misinformation or disrupt emergency operations. To address this, we introduce Forged Calamity, a benchmark dataset for synthetic disaster detection containing 30,000 images, including 6,000 real and 24,000 synthetic samples generated by four diffusion models. Comprehensive experiments across fine-tuned and zero-shot settings reveal consistent weaknesses in current forensic approaches. Fine-tuned detectors perform well in-distribution but lose up to 50\% accuracy on unseen generators or disaster types, showing overfitting to model-specific artifacts. Zero-shot generalized detectors also struggle to maintain stable accuracy, with only limited resilience in a few representation-robust models. These findings highlight persistent generalization gaps and the urgent need for domain- and model-agnostic detection methods to ensure visual authenticity in the diffusion era.

URL PDF HTML ☆

赞 0 踩 0

2606.18553 2026-06-18 cs.CV 新提交

Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

基于知识的分层多模态检索用于新闻图像描述生成

Minh-Loi Nguyen, Xuan-Vu Le, Long-Bao Nguyen, Hoang-Bach Ngo, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM（越南国立大学胡志明市分校理学院）； Vietnam National University, Ho Chi Minh City（越南国立大学胡志明市分校）

AI总结提出分层多模态文章检索增强的图像描述框架，通过结构感知检索和上下文精炼，结合VLM和LLM生成富含上下文细节的描述，在EVENTA 2025挑战赛中获得第5名。

Comments SOICT 2025

详情

AI中文摘要

传统的图像描述方法通常难以生成全面、上下文丰富的描述，尤其是对于无法直接从视觉线索中观察到的细节。为了克服这一问题，我们提出了一种新颖的检索增强图像描述框架，通过利用外部知识生成具有更深层次洞察的描述，如对象属性、事件背景和潜在意义。我们的方法采用分层多模态文章检索机制，超越了单一的文本实体。该检索考虑了文章结构感知特征，包括加权文本组件（例如，标题、正文部分）和视觉布局模式，以及多方面的相似性计算（内容-视觉、视觉-视觉和话语定位）。后续的上下文相关性精炼阶段进一步增强了检索到的信息。检索到的文章随后作为描述生成的知识库：首先，VLM生成简洁的图像描述；其次，我们基于该描述从检索到的文章中分割出相关信息；最后，LLM利用描述和提取的知识生成全面、上下文详细的描述。我们参加了ACM Multimedia EVENTA 2025挑战赛，并在OpenEvent-V1数据集的私有测试集上以0.2824的总分获得第5名。源代码已在此https URL公开发布。

英文摘要

Traditional image captioning methods often struggle to generate comprehensive, context-rich descriptions, especially for details not directly observable from visual cues. To overcome this, we propose a novel retrieval-augmented image captioning framework that generates captions with deeper insights, such as object attributes, event context, and underlying significance, by leveraging external knowledge. Our approach features a hierarchical multi-modal article retrieval mechanism that moves beyond monolithic text entities. This retrieval considers article structure-aware features, including weighted textual components (e.g., headlines, body sections) and visual placement patterns, alongside multi-faceted similarity computations (content--visual, visual--visual, and discourse positioning). A subsequent contextual relevance refinement stage further enhances the retrieved information. The retrieved articles then serve as the knowledge base for caption generation: first, a VLM generates a concise image description; second, we segment relevant information from the retrieved articles based on this description; and finally, an LLM utilizes both the description and extracted knowledge to generate a comprehensive, contextually detailed caption. We participated in the ACM Multimedia EVENTA 2025 Challenge and achieved 5th place with an overall score of 0.2824 on the private test set of the OpenEvent-V1 dataset. Source code is publicly released at https://github.com/mf0212/EVENTA-Challange.

URL PDF HTML ☆

赞 0 踩 0

2606.18543 2026-06-18 cs.AI cs.CL cs.SE 新提交

CEO-Bench: Can Agents Play the Long Game?

CEO-Bench：智能体能否玩转长期博弈？

Haozhe Chen, Karthik Narasimhan, Zhuang Liu

发表机构 * Princeton University（普林斯顿大学）

AI总结提出CEO-Bench，通过模拟500天运营初创公司的任务，评估语言模型智能体在长期、不确定、动态环境下的综合决策能力。

详情

AI中文摘要

语言模型智能体在软件工程、客户服务等孤立、短期的任务上正变得熟练。然而，现实世界的挑战需要结合多种复杂技能，这些技能在很大程度上尚未在智能体中得到测试：（1）在不确定性中导航长期视野；（2）在嘈杂环境中获取信息；（3）适应不断变化的世界；（4）协调多个移动部分以实现连贯目标。我们引入CEO-Bench，通过模拟一个代表性的现实世界任务——运营一家初创公司500天——来共同评估这些能力。智能体通过可编程的Python接口管理一家虚构公司的定价、营销、预算等众多方面，在相同的环境中运行，并面临与人类CEO相同的挑战。成功需要分析嘈杂、相互关联的业务数据库，将信号转化为合理的策略，并通过编程协调许多决策。最强的智能体编写复杂的代码，模拟客户群体以预测未来现金流，并挖掘谈判历史以揭示隐藏的客户偏好。即便如此，大多数最先进的模型在此环境中挣扎。只有Claude Opus 4.8和GPT-5.5的最终余额超过100万美元的起始资金，且两者均未能持续盈利。CEO-Bench迈出了衡量驱动持续、自适应进步所需智能的第一步。

英文摘要

Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely untested in agents: (1) navigating long horizons amid uncertainty; (2) acquiring information in noisy environments; (3) adapting to a changing world; (4) orchestrating multiple moving parts toward a coherent goal. We introduce CEO-Bench, which evaluates these capabilities together by simulating a representative real-world task: operating a startup for 500 days. An agent manages pricing, marketing, budgeting, and many other aspects of a fictional company through a programmable Python interface, operating in the same environment and facing the same challenges as a human CEO. Success demands analyzing noisy, interconnected business databases, translating signals into sound strategy, and coordinating many decisions with programming. The strongest agents write sophisticated code that simulates customer cohorts to forecast future cash and mines negotiation history to uncover hidden customer preferences. Even so, most state-of-the-art models struggle in this environment. Only Claude Opus 4.8 and GPT-5.5 finish above the $1M starting balance, and neither consistently turns a profit. CEO-Bench takes a first step toward measuring the intelligence required to drive sustained, adaptive progress over time.

URL PDF HTML ☆

赞 0 踩 0

2606.18539 2026-06-18 cs.LG stat.ML 新提交

TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults

TS-Fault: 针对结构性故障的时间序列预测器基准测试

Yuyang Zhao, Lian Xu, Hao Miao, Chenxi Liu, Hao Xue

发表机构 * Ray-zyy

AI总结提出TS-Fault基准，通过参数化故障场景（沿观测/机制、单变量/多变量两轴）评估时间序列预测模型鲁棒性，发现干净数据准确性与鲁棒性负相关、机制级故障重排排名、基础模型最脆弱。

详情

AI中文摘要

时间序列预测（TSF）支撑着能源、交通、金融和医疗等领域的关键决策，然而TSF模型几乎普遍通过在干净保留数据上的单一数字（如平均误差）进行排名，隐含假设该数字能预测部署可靠性。但实际故障并非独立同分布噪声，而是具有时间形状的结构化事件、断裂的跨变量依赖、伴随缺失的机制变化以及跨传感管道的因果传播。将TSF鲁棒性视为数据质量问题，我们提出TS-Fault，一个在显式、参数化且具有可控语义难度的故障场景下评估预测模型的基准。TS-Fault将重复出现的故障沿两个正交轴（观测级 vs 机制级；单变量 vs 多变量）组织为四种模式，并通过统一重要性评分将每种故障注入最关键的预测窗口。该设计使得鲁棒性能够针对模型实际依赖的结构进行测试，而非简化为通用噪声敏感性。我们在6个数据集、4种模式和5个难度级别上，采用配对干净/损坏协议评估了21个模型。结果揭示了三个与常见排行榜直觉相悖的发现：（i）干净数据准确性与鲁棒性负相关；（ii）干净排名在观测级故障下保持不变，但在机制级故障下重新洗牌；（iii）所有灾难性故障均发生在机制级故障下，基础模型在干净数据上准确率最高但表现出最大的脆弱性。代码已公开于该URL。

英文摘要

Time series forecasting (TSF) underpins consequential decisions in energy, transportation, finance, and healthcare, yet TSF models are almost universally ranked by a single number (e.g., average error) on clean held-out data, under the implicit assumption that it predicts deployed reliability. However, real faults are not i.i.d noise but structured events with temporal shape, broken cross-variable dependencies, regime change coupled with missingness, and causal propagation across a sensing pipeline. Treating TSF robustness as a data-quality problem, we present TS-Fault, a benchmark that evaluates forecasting models under explicit, parameterized fault scenarios with controllable semantic difficulty. TS-Fault organizes recurring failures into four modes along two orthogonal axes (observation- vs mechanism-level; univariate vs multivariate) and injects each fault into the most prediction-critical window via a unified importance score. This design enables robustness to be tested against the structures models actually rely on, rather than reduced to generic noise sensitivity. We evaluate 21 models across 6 datasets, 4 modes, and 5 difficulty levels under a paired clean/corrupt protocol. The results reveal three findings that contradict common leaderboard intuition: (i) clean-data accuracy anti-correlates with robustness; (ii) clean rankings are preserved under observation-level faults but reshuffled under mechanism-level faults; and (iii) all catastrophic failures occur under mechanism-level faults, with foundation models achieving the highest clean-data accuracy yet exhibiting the greatest fragility. The code is publicly available at https://github.com/Ray-zyy/TS-Fault.

URL PDF HTML ☆

赞 0 踩 0

2606.18538 2026-06-18 cs.LG stat.ML 新提交

Effects of sparsity and superposition on loss in simple autoencoders

稀疏性与叠加对简单自编码器损失的影响

Mriganka Basu Roy Chowdhury, Eric McLaughlin Weiner

发表机构 * Department of Statistics, UC Berkeley（伯克利大学统计学系）； Department of Materials Science, UC Berkeley（伯克利大学材料科学系）

AI总结研究神经网络中多语义性源于叠加现象，通过数学分析稀疏输入下自编码器的L2重构损失上下界，验证并扩展了Elhage等人的实证结果。

Comments 16 pages, 3 figures

详情

AI中文摘要

神经网络机械可解释性的主要困难之一是出现多语义性，即每个神经元通常负责多个不同任务，阻碍了对其功能的清晰解释。Elhage等人（2022）的开创性论文认为，这是由于叠加现象，即神经网络将不同特征表示为低维空间中的非正交方向，这种策略可以在不牺牲保真度的情况下实现更大的数据压缩，因为输入向量具有特征稀疏性。Elhage等人（2022）在一个相当自然且简单的具有稀疏输入的自编码器中实证验证了这些假设。本文的贡献在于分析叠加现象发生和最优性的数学基础，同时严格证实了他们的一些发现。特别地，我们为幂激活函数提供了L2重构损失的上界和下界，在非常稀疏的情况下是紧的。文末还包含一个简短的开放问题列表。

英文摘要

One of the major difficulties in the mechanistic interpretability of neural networks is the occurrence of polysemanticity, which suggests that each neuron is typically responsible for multiple different tasks, impeding a clean interpretation of their function. The seminal paper of Elhage et al. (2022) argues that this occurs due to superposition, a phenomenon where the neural network represents distinct features as non-orthogonal directions in a lower-dimensional space, a strategy that allows much greater compression of the data without sacrificing fidelity due to the feature sparsity of input vectors. Elhage et al. (2022) empirically validates these hypotheses in a rather natural and simple autoencoder with sparse inputs. The contribution of the present work is to analyze the mathematical basis for the occurrence and optimality of superposition, while rigorously corroborating some of their findings. In particular, we provide upper and lower bounds for the L2 reconstruction loss, tight in the very sparse regime, for power activation functions. A short list of interesting open problems are also included at the end.

URL PDF HTML ☆

赞 0 踩 0

2606.18537 2026-06-18 cs.LG 新提交

概念调制模型：可识别性与外推的统一框架

Soheun Yi, Yizhou Lu, Chandler Squires, Pradeep Ravikumar

发表机构 * Department of Statistics and Data Science, Carnegie Mellon University（卡内基梅隆大学统计与数据科学系）； Machine Learning Department, Carnegie Mellon University（卡内基梅隆大学机器学习系）

AI总结提出概念调制模型（CMMs），通过属性势统一条件潜变量模型的可识别性与外推分析，将基于转移的可识别性提升至条件设置，并导出代数外推准则。

详情

AI中文摘要

条件潜变量模型中的可靠泛化需要理解可识别性和外推：观测属性间的变化如何决定潜在结构，以及该结构如何决定未见属性上的分布。然而，现有的可识别性和外推保证大多是模型特定的，在非线性ICA、因果表示学习、扰动建模及相关条件潜变量模型中分别进行分析。我们引入概念调制模型（CMMs），这是一类属性索引的条件生成模型，其结构为$A\to \Lambda \to C\to X$，其中属性选择调制器，调制器诱导潜在概念法则，概念生成观测特征。CMMs通过展示观测属性上的特征一致性诱导受CMM类约束的潜在概念转移，将基于转移的可识别性提升至条件设置。我们通过属性势（属性条件概念法则之间的对数密度比）表达这些约束，将通用提升步骤与模型特定的刚性论证分离。相同的势控制外推：当且仅当传输的属性势恒等式扩展到这些属性时，未见属性上的一致性成立。这导出了代数外推准则，识别出几个现有可识别性和外推结果背后的共同基于势的证明对象，并且当与这些工作中的模型特定刚性论证结合时，恢复了它们所述的结论。

英文摘要

Reliable generalization in conditional latent variable models requires understanding both identifiability and extrapolation: how observed variation across attributes determines latent structure, and how that structure determines distributions at unseen attributes. However, existing identifiability and extrapolation guarantees are largely model-specific, with separate analyses in nonlinear ICA, causal representation learning, perturbation modeling, and related conditional latent variable models. We introduce concept modulation models (CMMs), an attribute-indexed class of conditional generative models with structure $A\to Λ\to C\to X$, where attributes select modulators, modulators induce latent concept laws, and concepts generate observed features. CMMs lift transition-based identifiability to conditional settings by showing that feature agreement on observed attributes induces a latent concept transition constrained by the CMM class. We express these constraints through attribute potentials, log-density ratios between attribute-conditioned concept laws, separating the generic lifting step from model-specific rigidity arguments. The same potentials control extrapolation: agreement at unseen attributes holds exactly when the transported attribute-potential identities extend to those attributes. This yields algebraic extrapolation criteria, identifies the common potential-based proof objects behind several existing identifiability and extrapolation results, and, when combined with the model-specific rigidity arguments in those works, recovers their stated conclusions.

URL PDF HTML ☆

赞 0 踩 0

2606.18508 2026-06-18 cs.CL cs.IR 新提交

神经相位相关

Cole Reynolds

发表机构 * Weyl Labs（Weyl实验室）

AI总结提出相位相关的学习泛化，通过可学习基函数将变换分解，适用于非刚性形变和幺正动力学，在心脏MRI和超声数据集上达到或超越现有方法。

详情

AI中文摘要

对应关系本质上是关系性的：它寻求同一场景两次观测之间的未知变换，而非任一观测的内容。然而，主流的基于学习的方法并未将变换表示为架构中的一等对象。它们独立编码每幅图像，让学习的相似度函数或深度解码器隐式地发现映射。相位相关是典型的例外，它直接在傅里叶域测量图像间关系，但其固定基的刚性将其限制于全局平移。我们引入相位相关的学习泛化，通过学习变换分解所基于的基来解除这一限制。相同的代数原语可扩展到密集非刚性形变和幺正动力学。在ACDC心脏MRI基准上，该框架在两个配准方向上匹配或超越先前发表的基线。在CAMUS超声心动图上，它无需辅助评分或自适应平滑机制即可达到最先进水平。应用于一维量子谐振子的时间演化波函数对时，同一框架仅从观测对中恢复未知哈密顿量的埃尔米特函数本征态和量子化能级。

英文摘要

Correspondence is fundamentally relational: it seeks the unknown transformation between two observations of a common scene, not the content of either. Yet the dominant learning-based methods do not represent the transformation as a first-class object in the architecture. They encode each image independently and let a learned similarity function or a deep decoder discover the mapping implicitly. Phase correlation is the canonical exception, measuring the inter-image relationship directly in the Fourier domain, but the rigidity of its fixed basis confines it to global translation. We introduce a learned generalization of phase correlation that lifts this restriction by learning the basis on which the transformation decomposes. The same algebraic primitive extends to dense non-rigid deformations and to unitary dynamics. On the ACDC cardiac-MRI benchmark the framework matches or exceeds prior published baselines on both registration directions. On CAMUS echocardiography it matches state-of-the-art without auxiliary scoring or adaptive-smoothness mechanisms. Applied to time-evolved wavefunction pairs of the 1-D quantum harmonic oscillator, the same framework recovers the Hermite-function eigenstates and the quantized energy levels of the unknown Hamiltonian from observation pairs alone.

URL PDF HTML ☆

赞 0 踩 0

2606.18487 2026-06-18 cs.LG cs.AI cs.CL 新提交

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

SFT 过训练通过熵崩溃预测 RLVR 下的排名反转

Siddharth Aphale, Kelly Liu

发表机构 * Stanford University（斯坦福大学）

AI总结研究发现 SFT 过度训练导致 rollout 分布熵降低，使 GRPO 中优势信号消失，从而引发排名反转；提出基于熵的两阶段诊断方法可预警高风险检查点。

Comments 14 pages, 6 figures. Accepted at the Deep Learning for Code (DL4C) Workshop at ICML 2026

详情

AI中文摘要

当 SFT 压缩 rollout 分布时，选择 pass@1 最高的 SFT 检查点进行 GRPO 的标准启发式方法可能失败。对于二元奖励，组内期望优势方差为 $p(1{-}p)(g{-}1)/g$；当早期 GRPO 将 $p$ 驱动到 $p^*(g)$ 以下时，大多数组具有相同奖励，不提供组间相对信号。我们研究了 Qwen2.5-Coder-3B 和 DeepSeek-Coder-6.7B 的 SFT 深度阶梯。我们在五个深度和三个种子上测试 Qwen2.5-Coder-3B，在四个匹配深度和三个种子上测试 DeepSeek-Coder-6.7B。在 Qwen 上，RL 前的 pass@1 随 SFT 深度增加而上升，但 GRPO 峰值 pass@10 从 $0.806$ 下降到 $0.481$（3 种子均值，$n{=}20$）；RL 前的熵与 GRPO 结果正相关（$\rho{=}{+}0.69$）。在 DeepSeek 上，pass@1 仍远高于 $p^*(8){=}0.083$，GRPO 结果压缩而非反转。结合 RL 前熵分诊与早期 GRPO 熵监测的两阶段诊断方法，可标记高风险检查点并提前停止失败运行。在我们的设置中，简单的 KL 参考正则化和标签平滑变体未能挽救崩溃的 Qwen 检查点，表明该失败并非琐碎的 GRPO 超参数伪影。

英文摘要

The standard heuristic of selecting the SFT checkpoint with the highest pass@1 for GRPO can fail when SFT compresses the rollout distribution. For binary rewards, the expected within group advantage variance is $p(1{-}p)(g{-}1)/g$; when early GRPO drives $p$ below $p^*(g)$, most groups have identical rewards and provide no group relative signal. We study SFT depth ladders for Qwen2.5-Coder-3B and DeepSeek-Coder-6.7B. We test Qwen2.5-Coder-3B across five depths and three seeds, and DeepSeek-Coder-6.7B across four matched depths and three seeds. On Qwen, pre RL pass@1 rises with SFT depth, but peak GRPO pass@10 falls from $0.806$ to $0.481$ (3 seed mean, $n{=}20$); pre RL entropy is positively associated with the GRPO outcome ($ρ{=}{+}0.69$). On DeepSeek, pass@1 remains far above $p^*(8){=}0.083$, and GRPO outcomes compress rather than invert. A two stage diagnostic, combining pre RL entropy triage with an early GRPO entropy monitor, flags high risk checkpoints and can stop failing runs early. Simple KL to reference regularisation and label smoothing variants do not rescue the collapsed Qwen checkpoint in our setting, suggesting the failure is not a trivial GRPO hyperparameter artefact.

URL PDF HTML ☆

赞 0 踩 0

2606.18484 2026-06-18 cs.CV 新提交

Vines-DB: An RGB image dataset for multi-species ornamental vine segmentation

Vines-DB：用于多物种观赏藤蔓分割的RGB图像数据集

Saroj Burlakoti, Utsav Bhandari, Aaron Etienne, Shital Poudyal

发表机构 * Department of Plants, Soils and Climate, Utah State University（植物、土壤与气候系，犹他州立大学）； Department of Applied Sciences, Technology and Education, Utah State University（应用科学、科技与教育系，犹他州立大学）

AI总结为支持精准园艺和城市生态中的多类实例分割，构建了包含7种观赏藤蔓的RGB图像数据集Vines-DB，通过手动标注和增强得到2307张图像，并划分训练/验证/测试集。

Comments 7 pages, 1 figure. Source data repository: OSF (DOI: 10.17605/OSF.IO/YJHCK)

详情

AI中文摘要

Vines-DB数据集包含在美国犹他州洛根市犹他农业实验站格林维尔研究农场田间条件下采集的7种观赏藤蔓的1,218张原始高分辨率RGB图像。该数据集来自168株于2022年移植的藤本植物，在2023和2024生长季（7月至10月）的多个月份重复拍摄。图像使用配备48 MP摄像头的iPhone 16 Pro在上午10:00至下午12:00之间于日光下拍摄。藤蔓生长在1.2m x 2.4m的格架上，从1m距离处拍摄，背景为黑色或白色泡沫板，以增强对比度并减少背景噪声。数据集包括木通、凌霄花、藤绣球、金银花、凌霄'马德琳·加伦'、五叶地锦和多花紫藤。所有原始图像由训练有素的标注员在Roboflow中手动标注，生成基于多边形的实例分割掩码，共8个类别（7个物种和背景）。经过预处理和数据增强后，工作数据集扩展至2,307张图像，用于模型开发和评估。增强后的数据集通过分层抽样划分为2,019张训练图像、192张验证图像和96张测试图像，以保持平衡的代表性。Vines-DB支持精准园艺和城市生态中多类实例分割深度学习模型的开发和评估。该数据集可实现自动冠层覆盖度估计、物种识别和可扩展的田间表型分析等应用。此外，每月重复成像捕获了冠层发育和植物外观的时间变化，增加了数据集在真实田间条件下进行分割基准测试的实用性。

英文摘要

The Vines-DB dataset contains 1,218 original high-resolution RGB images of seven ornamental vine species collected under field conditions at the Utah Agricultural Experiment Station's Greenville Research Farm in Logan, Utah, USA. The dataset was generated from 168 individual vine plants that were transplanted in 2022 and photographed repeatedly across multiple months during the 2023 and 2024 growing seasons (July-October). Images were captured with an iPhone 16 Pro equipped with a 48 MP camera between 10:00 AM and 12:00 PM under daylight. Vines were grown on 1.2m x 2.4m trellises and photographed from a distance of 1m against black or white Styrofoam backdrops to improve contrast and reduce background noise. The dataset includes Akebia quinata, Campsis radicans, Hydrangea anomala petiolaris, Lonicera x heckrottii, Campsis x tagliabuana 'Madame Galen', Parthenocissus quinquefolia, and Wisteria floribunda. All original images were manually annotated in Roboflow by trained annotators to produce polygon-based instance segmentation masks for eight classes, including seven species and background. After preprocessing and data augmentation, the working dataset was expanded to 2,307 images for model development and evaluation. The augmented dataset was divided into 2,019 training images, 192 validation images, and 96 test images using stratified sampling to maintain balanced representation. Vines-DB supports the development and evaluation of deep learning models for multi-class instance segmentation in precision horticulture and urban ecology. The dataset enables applications such as automated canopy cover estimation, species identification, and scalable field phenotyping. In addition, repeated monthly imaging of the plants captures temporal variation in canopy development and plant appearance, increasing the dataset's utility for segmentation benchmarking under realistic field conditions.

URL PDF HTML ☆

赞 0 踩 0

2606.18479 2026-06-18 cs.LG cs.CY 新提交

The Illusion of Improvement: Reject Inference Strategies in Credit Scoring

改进的幻觉：信用评分中的拒绝推断策略

Bruno Scarone, Ricardo Baeza-Yates

发表机构 * Northeastern University（东北大学）； KTH Royal Institute of Technology（瑞典皇家理工学院）

AI总结研究揭示拒绝推断方法在信用评分中因反馈循环导致评估指标误导，提出通过少量探索打破循环并诊断问题。

Comments Accepted to ECML PKDD 2026 (Research Track)

详情

AI中文摘要

拒绝推断方法被广泛用于减轻信用评分中的生存偏差，但其有效性仍不明确。我们系统评估了几种此类方法，并发现一个结构性失败模式：在自然的再训练循环中，模型的准确率提升而召回率崩溃，造成改进的幻觉，使从业者认为系统在变好，而实际上其拒绝质量——正确筛选出违约者的能力——在恶化。然后，我们提出一种受控探索策略，无需统计假设即可打破反馈循环：贷款方故意批准一部分被拒绝的申请人，并观察他们的真实结果。我们表明，准确率和拒绝质量在是否探索上给出相反的建议：准确率倾向于不探索，而拒绝质量随探索提高，证实标准评估指标在选择性偏差下具有误导性。即使极低的探索率（2-5%）在我们的实验中也足以以近乎零成本诊断反馈循环的严重性。我们的发现在两种机器学习方法和三个真实数据集上一致，表明标准评估协议不足以评估在生存偏差下训练的模型。

英文摘要

Reject inference methods are widely used to mitigate survival bias in credit scoring, yet their effectiveness remains poorly understood. We systematically evaluate several such methods and uncover a structural failure mode: in a natural retraining cycle, models whose accuracy improves while recall collapses create an illusion of improvement that leads practitioners to believe the system is getting better when, in fact, its rejection quality -- the ability to correctly screen out defaulters -- is deteriorating. We then propose a controlled exploration strategy that breaks the feedback loop without statistical assumptions: the lender deliberately approves a fraction of rejected applicants and observes their true outcomes. We show that accuracy and rejection quality give opposite recommendations on whether to explore: accuracy favors no exploration, while rejection quality improves with it, confirming that standard evaluation metrics are misleading under selection bias. Even minimal exploration rates (2--5\%) prove sufficient in our experiments to diagnose the severity of the feedback loop at near-zero cost. Our findings are consistent across two machine learning methods and three real-world datasets, and suggest that standard evaluation protocols are inadequate for assessing models trained under survival bias.

URL PDF HTML ☆

赞 0 踩 0

2606.18478 2026-06-18 cs.CV 新提交

可能还是确定？评估临床文本中诊断不确定性保留的基准

Hongbo Du, Zixin Lu, Jiaming Qu

发表机构 * Trine University（特里尼大学）； University of Michigan（密歇根大学）； Amazon（亚马逊）

AI总结构建包含9184个不确定性标注的基准，评估LLM在临床文本中保留诊断不确定性的能力，发现LLM保留原始不确定性线索不足一半，且难以区分相邻级别。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地用于临床文本任务，如总结和修订。虽然大多数研究评估LLM生成文本的流畅性和连贯性，但LLM是否正确保留诊断不确定性仍未得到充分探索。在临床实践中，诸如“可能肺炎”之类的短语传达了现有证据的强度，并直接指导后续检测和治疗决策。改变这些不确定性表达可能会完全改变临床含义。在本文中，我们通过两个步骤系统地评估了这个问题。首先，我们构建了一个包含1200份临床文档的基准，其中包含跨五个级别的9184个不确定性标注。其次，我们在此基准上评估了三个LLM。我们的结果表明：（1）LLM保留原始不确定性线索的能力很差，通常不到一半的时间；（2）LLM难以区分相邻级别之间的细微差别。这项工作揭示了标准评估指标无法捕捉的失败模式，并为LLM在临床工作流程中的安全部署提供了启示。

英文摘要

Large language models (LLMs) are increasingly used for clinical text tasks such as summarization and revision. While most studies evaluate the fluency and coherence of LLM-generated text, whether LLMs correctly preserve diagnostic uncertainty remains underexplored. In clinical practice, phrases such as ``possible pneumonia'' communicate the strength of available evidence and directly guide decisions about follow-up testing and treatment. Altering these uncertainty expressions can change the clinical meaning entirely. In this paper, we systematically evaluated this problem in two steps. First, we constructed a benchmark of 1,200 clinical documents with 9,184 uncertainty annotations across five levels. Second, we evaluated three LLMs on this benchmark. Our results show that (1) LLMs preserve the original uncertainty cues poorly, often less than half the time; (2) LLMs struggle with nuanced distinctions between adjacent levels. This work reveals a failure mode not captured by standard evaluation metrics and provides implications for the safe deployment of LLMs in clinical workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.18469 2026-06-18 cs.LG cs.AI 新提交

Structured Representation Learning with Locally Linear Embeddings and Adaptive Feature Fusion

基于局部线性嵌入与自适应特征融合的结构化表示学习

Somjit Nath, Jackson J Cone, Derek Nowrouzezahrai, Samira Ebrahimi Kahou

发表机构 * Mila – Quebec AI Institute（米拉-魁北克人工智能研究所）

AI总结受神经科学启发，提出一种强化学习框架，利用局部线性嵌入捕捉状态局部结构，并通过注意力机制自适应融合动态与奖励特征，提升学习效率。

Comments Published in Transactions on Machine Learning Research (04/2026)

详情

AI中文摘要

神经科学研究揭示，大脑通过利用结构化的低维流形和自适应门控机制动态融合多源信息来编码复杂行为。受这些原理启发，我们提出了一种新颖的强化学习（RL）框架，鼓励分离动态特定和奖励特定特征，直接类比神经回路如何分离和整合信息以实现高效决策。我们的方法利用局部线性嵌入（LLE）来捕捉许多环境中固有的局部线性结构，反映神经群体活动中观察到的局部平滑性，同时通过标准RL目标推导奖励特定特征。一种类似于皮层门控的注意力机制，在逐状态基础上自适应地融合这些互补表示。在基准任务上的实验结果表明，我们的方法基于神经科学原理，相比传统RL方法提高了学习效率和整体性能，凸显了显式建模局部状态结构和自适应特征选择（如生物系统中观察到的）的优势。

英文摘要

Neuroscientific research has revealed that the brain encodes complex behaviors by leveraging structured, low-dimensional manifolds and dynamically fusing multiple sources of information through adaptive gating mechanisms. Inspired by these principles, we propose a novel reinforcement learning (RL) framework that encourages the disentanglement of dynamics-specific and reward-specific features, drawing direct parallels to how neural circuits separate and integrate information for efficient decision-making. Our approach leverages locally linear embeddings (LLEs) to capture the intrinsic, locally linear structure inherent in many environments, mirroring the local smoothness observed in neural population activity, while concurrently deriving reward-specific features through the standard RL objective. An attention mechanism, analogous to cortical gating, adaptively fuses these complementary representations on a per-state basis. Experimental results on benchmark tasks demonstrate that our method, grounded in neuroscientific principles, improves learning efficiency and overall performance compared to conventional RL approaches, highlighting the benefits of explicitly modeling local state structures and adaptive feature selection as observed in biological systems.

URL PDF HTML ☆

赞 0 踩 0

2606.18466 2026-06-18 cs.CL 新提交

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

Montreal Forced Aligner 与 2026 年语音到文本对齐的现状

Michael McAuliffe, Kaylynn Gunter, Michael Wagner, Morgan Sonderegger

发表机构 * University of Wisconsin--Madison（威斯康星大学麦迪逊分校）； McGill University（麦吉尔大学）； Centre for Brain, Language, and Music（大脑、语言与音乐中心）； University of Oregon（俄勒冈大学）

AI总结本文介绍 MFA 3.0 自 1.0 版本以来的发展，并在英语、日语和韩语上评估其性能，在四个基准数据集上达到平均边界误差低于 15 ms 的最优或接近最优性能。