Human-1 by Josh Talks: 基于真实对话的印地语全双工对话建模框架

Bhaskar Singh, Shobhit Banga, Mahima Manik, Pranav Sharma

AI总结本文通过适配Moshi架构，使用自定义印地语分词器和26,000小时真实对话数据训练，提出了首个开放、可复现的印地语全双工口语对话系统，实现了自然的打断、重叠和反馈行为。

详情

AI中文摘要

全双工口语对话系统能够模拟自然的对话行为，如打断、重叠和反馈，然而这类系统在印度语言中仍 largely unexplored。我们通过适配最先进的双工语音架构Moshi，使用自定义印地语分词器，并在从14,695名说话者收集的26,000小时真实自发对话数据（具有独立的说话者通道）上进行训练，提出了首个开放、可复现的印地语全双工口语对话系统，从而能够直接从自然交互中学习话轮转换和重叠模式。为了支持印地语文本生成，我们替换了原始英语分词器，并重新初始化了依赖于文本词汇的参数，同时保留了预训练的音频组件。我们提出了一种两阶段训练方案——大规模预训练，然后在1,000小时对话数据上进行微调。通过提示对话延续范式，结合自动评估指标和人工判断，评估结果表明生成的模型在印地语中表现出自然且有意义的全双工对话行为。这项工作为印地语及其他印度语言的实时双工口语对话系统迈出了第一步。

英文摘要

Full-duplex spoken dialogue systems can model natural conversational behaviours such as interruptions, overlaps, and backchannels, yet such systems remain largely unexplored for Indian languages. We present the first open, reproducible full-duplex spoken dialogue system for Hindi by adapting Moshi, a state-of-the-art duplex speech architecture, using a custom Hindi tokeniser and training on 26,000 hours of real spontaneous conversations collected from 14,695 speakers with separate speaker channels, enabling direct learning of turn-taking and overlap patterns from natural interactions. To support Hindi text generation, we replace the original English tokeniser and reinitialise text-vocabulary-dependent parameters while retaining the pre-trained audio components. We propose a two-stage training recipe -- large-scale pre-training followed by fine-tuning on 1,000 hours of conversational data. Evaluation through the prompted dialogue continuation paradigm with both automatic metrics and human judgments demonstrates that the resulting model generates natural and meaningful full-duplex conversational behaviour in Hindi. This work serves as a first step toward real-time duplex spoken dialogue systems for Hindi and other Indian languages.

URL PDF HTML ☆

赞 0 踩 0

2604.23017 2026-05-26 cs.LG cs.NA math.CV math.NA

从临床叙述中学习基于偏好的目标用于动态脓毒症治疗

Daniel J. Tan, Jayne Hui Zhen Chan, Kai Wen Hwang, Arturo Yong Yao Neo, Kay Choong See, Mengling Feng

AI总结提出CN-PR框架，利用大语言模型从出院小结中提取轨迹级偏好，通过偏好优化学习奖励函数，在离线强化学习中改善脓毒症治疗结果。

详情

AI中文摘要

在医疗保健中为强化学习设计奖励函数仍然具有挑战性，因为临床有意义的结果稀疏、延迟且难以明确指定。尽管结构化临床数据捕获了生理状态，但它们往往无法反映患者轨迹的更广泛方面，如治疗反应、恢复动态和干预负担。相比之下，临床叙述编码了临床医生对疾病进展、治疗效果和恢复的纵向评估，提供了超越预定义结果指标的轨迹级监督的潜在来源。我们提出了临床叙述知情偏好奖励（CN-PR）框架，该框架通过将临床叙述视为轨迹级偏好的可扩展监督，直接从出院小结中学习奖励函数。使用大语言模型，我们推导出轨迹质量分数，并在患者轨迹之间构建成对偏好，通过基于偏好的优化来学习奖励。为了考虑叙述信息量的变异性，我们引入了一个任务相关性信号，根据监督与下游决策任务的相关性对其进行加权。我们在离线强化学习中评估了CN-PR在动态脓毒症治疗中的应用。学习到的奖励与轨迹质量分数表现出强烈的单调对齐，并产生了与改善恢复相关结果相关的策略，包括增加器官支持无天数和更快的休克解决，同时保持与基于结果的奖励基线相当的性能。这些发现在外部验证下得以保留。我们的结果表明，临床叙述为动态治疗方案中的奖励学习提供了可扩展且富有表现力的监督来源。

英文摘要

Designing reward functions for reinforcement learning (RL) in healthcare remains challenging because clinically meaningful outcomes are sparse, delayed, and difficult to explicitly specify. Although structured clinical data capture physiologic states, they often fail to reflect broader aspects of patient trajectories such as treatment response, recovery dynamics, and intervention burden. Clinical narratives, by contrast, encode longitudinal clinician assessments of disease progression, treatment effectiveness, and recovery, providing a potential source of trajectory-level supervision beyond predefined outcome metrics. We propose Clinical Narrative-informed Preference Rewards (CN-PR), a framework that learns reward functions directly from discharge summaries by treating clinical narratives as scalable supervision for trajectory-level preferences. Using a large language model, we derive trajectory quality scores and construct pairwise preferences between patient trajectories to learn rewards through preference-based optimization. To account for variability in narrative informativeness, we incorporate a task relevance signal that weights supervision according to its relevance to the downstream decision-making task. We evaluate CN-PR in dynamic sepsis treatment using offline RL. The learned reward demonstrated strong monotonic alignment with trajectory quality scores and produced policies associated with improved recovery-related outcomes, including increased organ support-free days and faster shock resolution, while maintaining mortality performance comparable to outcome-based reward baselines. These findings were preserved under external validation. Our results suggest that clinical narratives provide a scalable and expressive source of supervision for reward learning in dynamic treatment regimes.

URL PDF HTML ☆

赞 0 踩 0

2604.08870 2026-05-26 cs.LG cs.AI

Temporal Dropout Risk in Learning Analytics: A Harmonized Survival Benchmark Across Dynamic and Early-Window Representations

学习分析中的时间辍学风险：跨动态与早期窗口表示的协调生存基准

Rafael da Silva, Jeff Eicher, Gregory Longo

AI总结本研究使用OULAD数据集，通过协调的生存分析基准（包括动态周表示和连续时间表示）评估辍学风险模型，发现时间行为特征比静态背景属性更具预测力。

Comments 34 pages, 14 figures, 18 tables. Includes appendix with reliability diagrams, sensitivity analyses, and dataset audit tables

详情

AI中文摘要

学生辍学是学习分析中持续关注的问题，然而比较研究经常在异质协议下评估预测模型，优先考虑区分度而非时间可解释性和校准。本研究引入了一个面向生存的基准，用于使用开放大学学习分析数据集（OULAD）进行时间辍学风险建模。比较了两个协调分支：一个动态周分支，采用人-时期表示的模型；以及一个可比较的连续时间分支，扩展了模型家族——基于树的生存模型、参数模型和神经网络模型。评估协议整合了四个分析层面：预测性能、消融、可解释性和校准。结果在每个分支内分别报告，因为跨分支单一排名在方法论上不合理。在可比较分支中，随机生存森林在区分度和特定时间点的Brier分数上领先；在动态分支中，泊松分段指数在紧密的五家族聚类中在综合Brier分数上略微领先。无重抽样自举变异将这些位置视为方向性信号而非绝对优势。消融和可解释性分析在所有家族中收敛于一个共同发现：主导预测信号主要不是人口统计学或结构性的，而是时间和行为性的。校准在更好区分的模型中证实了这一模式，但XGBoost AFT除外，它表现出系统性偏差。这些结果支持在学习分析中采用协调的多维基准的价值，并将辍学风险定位为一个时间行为过程，而非静态背景属性的函数。

英文摘要

Student dropout is a persistent concern in Learning Analytics, yet comparative studies frequently evaluate predictive models under heterogeneous protocols, prioritizing discrimination over temporal interpretability and calibration. This study introduces a survival-oriented benchmark for temporal dropout risk modelling using the Open University Learning Analytics Dataset (OULAD). Two harmonized arms are compared: a dynamic weekly arm, with models in person-period representation, and a comparable continuous-time arm, with an expanded roster of families -- tree-based survival, parametric, and neural models. The evaluation protocol integrates four analytical layers: predictive performance, ablation, explainability, and calibration. Results are reported within each arm separately, as a single cross-arm ranking is not methodologically warranted. Within the comparable arm, Random Survival Forest leads in discrimination and horizon-specific Brier scores; within the dynamic arm, Poisson Piecewise-Exponential leads narrowly on integrated Brier score within a tight five-family cluster. No-refit bootstrap sampling variability qualifies these positions as directional signals rather than absolute superiority. Ablation and explainability analyses converged, across all families, on a shared finding: the dominant predictive signal was not primarily demographic or structural, but temporal and behavioral. Calibration corroborated this pattern in the better-discriminating models, with the exception of XGBoost AFT, which exhibited systematic bias. These results support the value of a harmonized, multi-dimensional benchmark in Learning Analytics and situate dropout risk as a temporal-behavioral process rather than a function of static background attributes.

URL PDF HTML ☆

赞 0 踩 0

2604.08213 2026-05-26 cs.CV cs.AI

EditCaption: Human-Refined SFT and HAE-DPO for Image Editing Instruction Synthesis

EditCaption: 用于图像编辑指令合成的人工精炼SFT与HAE-DPO

Xiangyuan Wang, Honghao Cai, Yunhao Bai, Chao Hui, Tianze Zhou, Haohua Chen, Hao Shi, Yuling Wu, Yao Hu, Xu Tang, Yibo Chen, Wei Zhu

AI总结提出EditCaption两阶段后训练流程，通过人工精炼SFT和基于难度自适应错误感知DPO（HAE-DPO）提升图像编辑指令合成质量，显著降低关键错误率并超越现有模型。

详情

AI中文摘要

高质量的源-目标图像对及精确的编辑指令对于指令引导的图像编辑至关重要，但大规模构建此类训练三元组成本高昂。最近的流程通常依赖视觉语言模型自动合成编辑指令，但我们发现强大的VLM仍难以描述图像对之间的视觉变换。具体而言，它们表现出三种反复出现的失败模式：方向不一致、视角模糊和缺少细粒度属性。在400个图像对的人工评估中，多个开源VLM基线产生超过47%的关键错误率，使得许多合成指令不适合下游训练。为解决此问题，我们提出EditCaption，一种用于图像编辑指令合成的两阶段后训练流程。首先，通过基于GLM的自动字幕生成、EditScore过滤和人工精炼构建100K监督微调数据集。其次，收集10K人工标注的偏好对，其中每个被拒绝的指令都标注了其主要错误类型和严重程度。基于此数据集，我们提出难度自适应错误感知DPO（HAE-DPO），一种任务适配的DPO目标，它引入了基于人工标注的严重程度、失败模式类型和参考模型难度的自适应边界。在三个基准上的实验表明，我们的235B模型经过SFT+HAE-DPO后在开源和闭源模型中达到最先进性能，在Eval-400、HQ-Edit和ByteMorph-Bench上分别获得4.720、4.672和4.651分——在所有三个基准上均超越Gemini-3-Pro。人工评估证实关键错误率从47.75%降至17.50%，正确率从41.75%提升至70.25%，超越Gemini-3-Pro（66.00%）。

英文摘要

High-quality source-target image pairs with precise editing instructions are essential for instruction-guided image editing, yet constructing such training triplets at scale remains costly. Recent pipelines often rely on vision-language models to synthesize editing instructions automatically, but we find that strong VLMs still struggle to describe visual transformations between image pairs. In particular, they exhibit three recurring failure modes: orientation inconsistency, viewpoint ambiguity, and missing fine-grained attributes. In a human evaluation on 400 image pairs, several open-source VLM baselines produce critical-error rates above 47\%, making many synthesized instructions unsuitable for downstream training. To address this, we propose EditCaption, a two-stage post-training pipeline for image editing instruction synthesis. First, we construct a 100K supervised fine-tuning dataset through GLM-based auto-captioning, EditScore filtering, and human refinement. Second, we collect 10K human-annotated preference pairs, where each rejected instruction is labeled with its primary error type and severity. Based on this dataset, we propose Hardness-Adaptive Error-Aware DPO (HAE-DPO), a task-adapted DPO objective that introduces an adaptive margin based on human-labeled severity, failure-mode type, and reference-model hardness. Experiments across three benchmarks demonstrate that our 235B model with SFT+HAE-DPO achieves state-of-the-art performance among open-source and closed models, scoring 4.720 on Eval-400, 4.672 on HQ-Edit, and 4.651 on ByteMorph-Bench -- surpassing Gemini-3-Pro on all three. Human evaluation confirms critical error rates drop from 47.75\% to 17.50\%, with correct rates improving from 41.75\% to 70.25\%, surpassing Gemini-3-Pro (66.00\%).

URL PDF HTML ☆

赞 0 踩 0

2604.07039 2026-05-26 cs.RO cs.AI

AEROS: A Single-Agent Operating Architecture with Embodied Capability Modules

AEROS：一种具有具身能力模块的单智能体操作架构

Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li

AI总结提出AEROS架构，将机器人建模为单一持久智能主体，通过可安装的具身能力模块扩展能力，实现模块化可扩展性、可组合能力执行和一致的系统级安全。

Comments Submitted to Engineering Applications of Artificial Intelligence (EAAI). 48 pages, 5 figures, 9 tables

详情

AI中文摘要

机器人系统缺乏一种原则性的抽象来统一组织智能、能力和执行。现有方法要么在单体架构中耦合技能，要么将功能分解为松散协调的模块或多个智能体，通常缺乏一致的标识和控制权限模型。我们认为，机器人应被建模为一个单一的持久智能主体，其能力通过可安装的包来扩展。我们将这一观点形式化为AEROS（智能体执行运行时操作系统），其中每个机器人对应一个持久智能体，能力通过具身能力模块（ECM）提供。每个ECM封装了可执行技能、模型和工具，而执行约束和安全保证由策略分离的运行时强制执行。这种分离实现了模块化可扩展性、可组合能力执行和一致的系统级安全。我们在PyBullet仿真中使用Franka Panda 7自由度机械臂评估了一个参考实现，进行了八项实验，涵盖重新规划、故障恢复、策略执行、基线比较、跨任务通用性、ECM热插拔、消融和故障边界分析。每个条件下超过100次随机试验，AEROS在三个任务上实现了100%的任务成功率，而基线（BehaviorTree.CPP风格和ProgPrompt风格为92-93%，扁平流水线为67-73%）；策略层阻止了所有无效动作，零误接受；运行时优势跨任务泛化，无需特定任务调整；ECM在运行时加载，交换后成功率为100%。

英文摘要

Robotic systems lack a principled abstraction for organizing intelligence, capabilities, and execution in a unified manner. Existing approaches either couple skills within monolithic architectures or decompose functionality into loosely coordinated modules or multiple agents, often without a coherent model of identity and control authority. We argue that a robot should be modeled as a single persistent intelligent subject whose capabilities are extended through installable packages. We formalize this view as AEROS (Agent Execution Runtime Operating System), in which each robot corresponds to one persistent agent and capabilities are provided through Embodied Capability Modules (ECMs). Each ECM encapsulates executable skills, models, and tools, while execution constraints and safety guarantees are enforced by a policy-separated runtime. This separation enables modular extensibility, composable capability execution, and consistent system-level safety. We evaluate a reference implementation in PyBullet simulation with a Franka Panda 7-DOF manipulator across eight experiments covering re-planning, failure recovery, policy enforcement, baseline comparison, cross-task generality, ECM hot-swapping, ablation, and failure boundary analysis. Over 100 randomized trials per condition, AEROS achieves 100% task success across three tasks versus baselines (BehaviorTree.CPP-style and ProgPrompt-style at 92--93%, flat pipeline at 67--73%), the policy layer blocks all invalid actions with zero false acceptances, runtime benefits generalize across tasks without task-specific tuning, and ECMs load at runtime with 100% post-swap success.

URL PDF HTML ☆

赞 0 踩 0

2604.05550 2026-05-26 cs.CL cs.CE

AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

AutoSOTA：面向最先进AI模型发现的端到端自动化研究系统

Yu Li, Chenyang Shao, Xinyang Liu, Ruotong Zhao, Peijie Liu, Hongyuan Su, Zhibin Chen, Qinglong Yang, Anjie Xu, Yi Fang, Qingbin Zeng, Tianxing Li, Jingbo Xu, Fengli Xu, Yong Li, Tie-Yan Liu

AI总结提出AutoSOTA系统，采用多智能体架构实现从论文复现到模型优化的全自动化，成功发现105个超越原始方法的新SOTA模型。

详情

AI中文摘要

表征语言模型间的线性对齐

Matt Gorbett, Suman Jana

AI总结研究独立训练的大语言模型间是否存在线性对齐，并探索其在文本生成、嵌入分类、分布外检测及隐私保护跨孤岛推理中的应用。

详情

AI中文摘要

语言模型似乎越来越多地学习到相似的表示，尽管训练目标、架构和数据模态存在差异。这种独立训练模型之间新兴的兼容性为跨模型对齐下游目标带来了新的机会。此外，这种能力解锁了新的潜在应用领域，例如在安全、隐私或竞争约束禁止直接数据或模型共享的场景中。在这项工作中，我们研究了表示收敛在多大程度上实现了大语言模型之间的实用线性对齐。具体来说，我们学习独立模型最终隐藏状态之间的仿射变换，并在文本生成、嵌入分类和分布外检测中经验性地评估这些映射。我们发现，模型对之间的性能基本保持不变，并首次证明线性对齐有时能够实现跨独立训练模型的文本生成。我们进一步强调了线性对齐在隐私保护跨孤岛推理中的潜在应用。该框架在共享公共数据集上学习仿射变换，并使用同态加密来保护客户端查询。通过仅加密线性分类操作，该方法实现了亚秒级推理延迟。

英文摘要

Language models increasingly appear to learn similar representations, despite differences in training objectives, architectures, and data modalities. This emerging compatibility between independently trained models introduces new opportunities for cross-model alignment to downstream objectives. Moreover, this capability unlocks new potential application domains, such as settings where security, privacy, or competitive constraints prohibit direct data or model sharing. In this work, we investigate the extent to which representational convergence enables practical linear alignment between large language models. Specifically, we learn affine transformations between the final hidden states of independent models and empirically evaluate these mappings across text generation, embedding classification, and out-of-distribution detection. We find that performance is largely preserved across model pairs, and show for the first time that linear alignment sometimes enables text generation across independently trained models. We further highlight a potential application of linear alignment for privacy-preserving cross-silo inference. The framework learns an affine transformation over a shared public dataset and uses homomorphic encryption to protect client queries. By encrypting only the linear classification operation, the method achieves sub-second inference latency.

URL PDF HTML ☆

赞 0 踩 0

2603.16105 2026-05-26 cs.CL cs.AI

Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

频率至关重要：用于剪枝和量化的快速模型无关数据筛选

Francesco Pio Monaco, Elia Cunegatti, Flavio Vella, Giovanni Iacca

AI总结提出一种基于Zipf幂律的模型无关数据筛选策略ZipCal，通过最大化词汇多样性来选择校准数据，在剪枝和量化中实现与依赖模型困惑度的最先进方法相当的性能，且速度快约240倍。

Comments Added statistical analysis, mechanistic analysis and a comparison with a generative baseline. 22 pages

详情

AI中文摘要

训练后模型压缩对于增强大型语言模型（LLMs）的可移植性同时保持其性能至关重要。虽然已经提出了几种压缩方法，但较少关注选择最合适的数据集（所谓的校准数据）来寻找压缩模型配置。校准数据的选择是保留模型在任务内和任务间能力的关键步骤。在这项工作中，我们通过分析内在数据属性而非模型特定信号，解决了为剪枝和量化识别高性能校准集的挑战。我们引入了 exttt{ extbf{ZipCal}}，一种基于Zipf幂律最大化词汇多样性的模型无关数据筛选策略。实验表明，我们的方法在各种剪枝基准测试中始终优于标准的均匀随机采样。值得注意的是，在下游性能方面，它与依赖模型困惑度的最先进方法表现相当。后者在大规模模型和数据集上变得极其昂贵，而 exttt{ extbf{ZipCal}}由于其可处理的线性复杂度，平均快约240倍。

英文摘要

Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, less emphasis has been placed on selecting the most suitable set of data (the so-called \emph{calibration data}) for finding the compressed model configuration. The choice of calibration data is a critical step in preserving model capabilities both intra- and inter-tasks. In this work, we address the challenge of identifying high-performance calibration sets for both pruning and quantization by analyzing intrinsic data properties rather than model-specific signals. We introduce \texttt{\textbf{ZipCal}}, a model-agnostic data curation strategy that maximizes lexical diversity based on Zipfian power laws. Experiments demonstrate that our method consistently outperforms standard uniform random sampling across various pruning benchmarks. Notably, it also performs on par, in terms of downstream performance, with a state-of-the-art method that relies on model perplexity. The latter becomes prohibitively expensive at large-scale models and datasets, while \texttt{\textbf{ZipCal}} is on average $\sim$240$\times$ faster due to its tractable linear complexity\footnote{We make the code and the experiments available at https://github.com/FrancescoMonaco/ZipCal.}.

URL PDF HTML ☆

赞 0 踩 0

2603.12983 2026-05-26 cs.CL cs.AI

Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation

人工标注是否必要？用于机器翻译错误跨度检测的迭代MBR蒸馏

Boxuan Lyu, Haiyue Song, Zhi Qu

AI总结提出一种基于最小贝叶斯风险解码的迭代MBR蒸馏自演化框架，利用现成大语言模型生成伪标签，无需人工标注即可在错误跨度检测任务上超越监督基线。

2603.09943 2026-05-26 cs.AI

PathMem: Toward Cognition-Aligned Memory Transformation for Pathology MLLMs

PathMem: 面向病理学多模态大模型的认知对齐记忆转换

Jinyue Li, Yuci Liang, Qiankun Li, Xinheng Lyu, Jiayu Qian, Huabao Chen, Kun Wang, Zhigang Zeng, Anil Anthony Bharath, Yang Liu

AI总结提出PathMem框架，通过长期记忆与工作记忆的动态转换机制，实现结构化病理知识整合与可解释记忆控制，在WSI报告生成和开放诊断任务上达到SOTA。

详情

AI中文摘要

计算病理学需要视觉模式识别与结构化领域知识（包括分类学、分级标准和临床证据）的动态整合。在实践中，诊断推理需要将形态学证据与正式诊断和分级标准联系起来。尽管多模态大语言模型（MLLMs）展现出强大的视觉语言推理能力，但它们缺乏结构化知识整合和可解释记忆控制的显式机制。因此，现有模型在推理过程中难以一致地融入病理学特定的诊断标准。受人类病理学家层级记忆过程的启发，我们提出了PathMem，一种面向病理学MLLMs的以记忆为中心的多模态框架。PathMem将结构化病理知识组织为长期记忆（LTM），并引入记忆变换器（Memory Transformer），通过多模态记忆激活和上下文感知知识接地建模从LTM到工作记忆（WM）的动态转换，从而实现用于下游推理的上下文感知记忆细化。PathMem在多个基准测试中达到SOTA性能，在WSI-Bench报告生成（WSI-Precision提升12.8%，WSI-Relevance提升10.1%）和开放式诊断任务上分别比先前的基于WSI的模型提升9.7%和8.9%。

英文摘要

Computational pathology demands both visual pattern recognition and dynamic integration of structured domain knowledge, including taxonomy, grading criteria, and clinical evidence. In practice, diagnostic reasoning requires linking morphological evidence with formal diagnostic and grading criteria. Although multimodal large language models (MLLMs) demonstrate strong vision language reasoning capabilities, they lack explicit mechanisms for structured knowledge integration and interpretable memory control. As a result, existing models struggle to consistently incorporate pathology-specific diagnostic standards during reasoning. Inspired by the hierarchical memory process of human pathologists, we propose PathMem, a memory-centric multimodal framework for pathology MLLMs. PathMem organizes structured pathology knowledge as a long-term memory (LTM) and introduces a Memory Transformer that models the dynamic transition from LTM to working memory (WM) through multimodal memory activation and context-aware knowledge grounding, enabling context-aware memory refinement for downstream reasoning. PathMem achieves SOTA performance across benchmarks, improving WSI-Bench report generation (12.8% WSI-Precision, 10.1% WSI-Relevance) and open-ended diagnosis by 9.7% and 8.9% over prior WSI-based models.

URL PDF HTML ☆

赞 0 踩 0