arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1764
2606.07107 2026-06-08 cs.RO 新提交

Coarse-to-Control: Action-Token Planning for Vision-Language-Action Models

粗到细控制:面向视觉-语言-动作模型的行动令牌规划

Jinhao Wu, Shiduo Zhang, Yicheng Liu, Xiaopeng Yu, Sixian Li, Siyin Wang, Hang Zhao, Jing Huo, Yang Gao, Jingjing Gong, Xipeng Qiu, Yu-Gang Jiang

发表机构 * Nanjing University(南京大学) Shanghai Innovation Institute(上海创新研究院) Fudan University(复旦大学) Tsinghua University(清华大学)

AI总结 提出Coarse-to-Control框架,在动作令牌空间中引入原生规划,通过先预测粗粒度动作令牌序列再生成可执行动作,提升长程任务性能。

详情
AI中文摘要

大多数视觉-语言-动作(VLA)模型直接将观测映射到动作,缺乏显式的中间规划,这限制了在早期错误累积的长程任务上的性能。我们提出Coarse-to-Control,一种规划-执行VLA模型,在动作令牌空间中原生引入规划。关键思想是让策略首先预测一个紧凑的粗粒度动作令牌序列,该序列总结了预期的未来轨迹,然后基于此规划生成可执行的动作令牌。由于规划和执行共享统一的离散动作词汇,规划保持接近控制流形,并提供直接可操作的指导,而不是必须被转换回运动命令的抽象提示。在LIBERO、SimplerEnv-WidowX和真实世界操作任务上的实验表明,动作令牌规划一致地优于直接动作生成,在长程多阶段任务上提升最大。

英文摘要

Most vision-language-action (VLA) models map observations directly to actions without explicit intermediate planning, which limits performance on long-horizon tasks where early mistakes compound. We propose Coarse-to-Control, a plan-execute VLA that introduces planning natively in the action-token space. The key idea is to let the policy first predict a compact sequence of coarse action tokens that summarize the intended future trajectory, and then generate executable action tokens conditioned on this plan. Because both planning and execution share a unified discrete action vocabulary, the plan stays close to the control manifold and provides directly actionable guidance rather than an abstract hint that must be translated back to motor commands. Experiments on LIBERO, SimplerEnv-WidowX, and real-world manipulation tasks show that action-token planning consistently improves over direct action generation, with the largest gains on long-horizon multi-stage tasks.

2606.07103 2026-06-08 cs.CL 新提交

Style or Content? Evaluating Style Classifiers with Controlled Content Overlap

风格还是内容?通过控制内容重叠评估风格分类器

Zhuo Liu, Haozheng Du, Xiangxiang Xu, Hangfeng He

发表机构 * University of Rochester(罗切斯特大学)

AI总结 提出控制内容重叠的评估方法,通过并行圣经翻译构建参数α,发现低重叠模型依赖内容线索,高重叠模型更鲁棒,为分离风格学习与内容捷径提供诊断。

详情
Comments
9 pages
AI中文摘要

风格分类器可以利用自然收集数据中与风格标签相关的内容线索,但我们缺乏系统的方法来衡量这种依赖。我们通过基于并行圣经翻译构建的控制内容重叠设置来研究这个问题。具体来说,我们将重叠参数α定义为内容身份与风格标签之间互信息的归一化残差,从而衡量风格类别之间共享内容的程度:从无共享内容(α=0)到完全共享内容(α=1)。基于RoBERTa分类器的交叉重叠评估表明,当内容线索被移除时,低重叠模型性能下降,而高重叠模型迁移更鲁棒。跨风格内容检索探针进一步表明,随着α增加,内容变得难以恢复,训练动态显示这种移除是逐渐发生的。这些结果表明,控制重叠为分离风格学习与内容捷径提供了一个简单的诊断方法。

英文摘要

Style classifiers can use content cues that correlate with style labels in naturally collected data, yet we lack a systematic way to measure this reliance. We study this problem with a controlled content overlap setup built on parallel Bible translations. Specifically, we define the overlap parameter $α$ as the normalized residual of mutual information between content identity and style label, so that it measures how much content is shared across style classes: from no shared content ($α=0$) to fully shared content ($α=1$). Cross-overlap evaluation of RoBERTa-based classifiers shows that low-overlap models degrade when content cues are removed, while high-overlap models transfer more robustly. A cross-style content retrieval probe further shows that content becomes less recoverable as $α$ increases, with training dynamics showing this removal occurs gradually. Together, these results suggest that controlled overlap provides a simple diagnostic for separating style learning from content shortcuts.

2606.07098 2026-06-08 cs.CL cs.LG 新提交

SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices

SigmaScale: 基于SVD低秩分解和学习缩放矩阵的LLM压缩

Ernests Lavrinovics, Marco Letizia, Roy Janco, Shai Segal, Johannes Bjerva, Maurizio Pierini

发表机构 * Department of Computer Science, Aalborg University Copenhagen(奥尔堡大学哥本哈根分校计算机科学系) MaLGa-DIBRIS, University of Genoa(热那亚大学MaLGa-DIBRIS) INFN, Sezione di Genova(国家核物理研究所热那亚分部) European Organization for Nuclear Research (CERN)(欧洲核子研究中心) Ceva, Inc.(Ceva公司)

AI总结 提出SigmaScale方法,通过学习辅助缩放矩阵优化截断SVD的LLM压缩,降低权重矩阵有效秩,在Llama 3.1 8B和Qwen3-8B上达到竞争性能。

详情
AI中文摘要

我们提出SigmaScale,一种学习辅助缩放矩阵$S$以辅助基于截断奇异值分解(SVD)的大语言模型(LLM)压缩的方法。SigmaScale不是解析地推导缩放矩阵,而是优化两组定义对角行和列缩放变换的向量,并在激活感知的压缩损失下进行。我们表明,学习到的缩放降低了权重矩阵的有效内在秩,这反映在有效秩熵的减少上,并且这种减少与压缩损失强相关。在Llama 3.1 8B Instruct和Qwen3-8B上的实验表明,SigmaScale在困惑度和零样本基准测试上与最相关的基于SVD的压缩方法具有竞争力。通过使用学习到的激活感知变换,SigmaScale通过适应单个模型权重的结构,探索了一条更灵活的低秩LLM压缩路径。在特定任务中观察到的优势使我们的方法成为需要降低LLM推理计算成本的应用的有效选择。

英文摘要

We present SigmaScale, a method for learning auxiliary scaling matrices $S$ to aid truncated Singular Value Decomposition (SVD) based Large Language Model (LLM) compression. Instead of deriving scaling matrices analytically, SigmaScale optimizes two sets of vectors that define diagonal row and column scaling transformations under an activation-aware compression loss. We show that learned scaling lowers the effective intrinsic rank of weight matrices, as reflected by reductions in effective-rank entropy, and that this reduction is strongly correlated with compression loss. Experiments on Llama 3.1 8B Instruct and Qwen3-8B show that SigmaScale is competitive with closely related state-of-the-art SVD-based compression methods across perplexity and zero-shot benchmarks. By using learned activation-aware transformations, SigmaScale explores a more flexible route to low-rank LLM compression by adapting to the structure of individual model weights. The advantage observed in specific tasks makes our approach a valid option for applications requiring a reduced LLM-inference computing cost.

2606.07093 2026-06-08 cs.LG 新提交

The discovery of the effects of women employment participation on the fertility of developing countries: A panel data approach

女性就业参与对发展中国家生育率影响的发现:面板数据方法

Thi Kim Ngan Nguyen

发表机构 * Tokyo International University(东京国际大学)

AI总结 本文使用面板数据方法,将115个发展中国家分为四大洲组,发现女性劳动参与率对生育率的影响因地区而异,仅美洲地区显著负相关。

详情
AI中文摘要

过去几十年,发展中国家的生育率显著下降,同时女性在职场中的作用有所提升。为了更深入地了解女性劳动力市场参与率对发展中国家总生育率的因果关系,本文将1991-2018年间115个发展中国家的数据集分为四个大洲组(非洲、南北美洲、亚太、欧洲),并采用数据驱动的面板数据计量经济学程序来减轻遗漏变量偏差。结果表明,南北美洲大陆女性的生育行为受到其职业选择的影响;而在其他地区的社会中,女性在考虑生育时,其他因素可能更为重要。总之,政策制定者可以借鉴本文制定政策,以在生育决策方面提供更多激励,该领域的进一步研究需要考虑发展中国家的家庭政策和从夫居作为重要数据。

英文摘要

The fertility trend in developing countries has experienced a significant decline in the last few decades; at the same time, the role of women in the workplace has improved. To have a better insight of the causality of the rate of women participation in the labor market on the total fertility rate in developing world, this paper divides the dataset of 115 developing countries in the period of 1991-2018 into four continents group (Africa, North/South America, Asia/Pacific, Europe) and then applies a data-driven panel data econometric procedure to mitigate omitted bias. The results suggest that the fertility behaviors of women in the North/South America continents are influenced by their career choice; meanwhile in society of other regions, other factors might be more important to women when thinking of having children. In conclusion, policymakers can reference to the paper and formulate policies to have more incentives in making reproductive decisions and further research in the field needs to consider family policies and patrilocality of developing countries as important data.

2606.07089 2026-06-08 cs.RO 新提交

Dreaming when Necessary: Advancing World Action Models with Adaptive Multi-Modal Reasoning

必要时做梦:通过自适应多模态推理推进世界行动模型

Yinzhou Tang, Jingbo Xu, Yu Shang, Zihao Song, Chen Gao, Wei Wu, Yong Li

发表机构 * Tsinghua University(清华大学) Manifold AI

AI总结 提出AdaWAM,通过轻量动态路由器自适应触发文本或视觉推理,提升长时复杂任务中的推理效率和性能。

详情
AI中文摘要

世界行动模型(WAMs)为具身智能提供了一种有前景的方法,但现有方法严重依赖视频预测作为行动先验,缺乏自适应多模态推理,限制了其在长时、复杂任务中的有效性。我们观察到,WAMs在不同执行上下文中需要不同的多模态推理模式:在任务转换期间,文本推理对于指导高层行动预测至关重要,而在细粒度操作期间,视觉推理对于精确控制至关重要。基于这一观察,我们提出了\textbf{AdaWAM},一种具有自适应多模态推理能力的世界行动模型。AdaWAM集成了一个轻量动态路由器,可在任务执行过程中根据需要自主触发文本或视觉推理。在模拟和真实世界具身任务上的实验表明,AdaWAM在显著提升推理效率的同时,超越了最先进的具身策略。代码和演示可在以下网址获取:this https URL。

英文摘要

World Action Models (WAMs) offer a promising approach to embodied intelligence, yet existing methods rely heavily on video prediction as action priors and lack adaptive multimodal reasoning, limiting their effectiveness on long-horizon, complex tasks. We observe that WAMs require different multimodal reasoning modes under different execution contexts: textual reasoning is essential during task transitions to guide high-level action prediction, while visual reasoning is critical during fine-grained manipulation for precise control. Motivated by this observation, we propose \textbf{AdaWAM}, a world action model with adaptive multimodal reasoning abilities. AdaWAM integrates a lightweight dynamic router that autonomously triggers textual or visual reasoning as needed during task execution. Experiments on both simulated and real-world embodied tasks show that AdaWAM substantially improves inference efficiency while outperforming state-of-the-art embodied policies. Codes and demos are available at: https://adawam.github.io/.

2606.07083 2026-06-08 cs.RO 新提交

Predictive Style Matching: Natural and Robust Humanoid Locomotion

预测性风格匹配:自然且鲁棒的类人机器人行走

Simeon Nedelchev, Ekaterina Chaikovskaia, Egor Davydenko, Eduard Zaliaev, Roman Gorbachev

发表机构 * Moscow Institute of Physics and Technology (MIPT)(莫斯科物理技术学院) Innopolis University(因诺波利斯大学) Sber Robotics Center(Sber机器人中心)

AI总结 提出预测性风格匹配(PSM)方法,通过离线预测器将机器人下半身状态映射到上半身关节和步态目标,在保持任务奖励鲁棒性的同时显著降低风格误差。

详情
AI中文摘要

强化学习已成为类人机器人行走控制的主流方法:策略能够可靠地从仿真迁移到硬件,并从干扰中优雅恢复。然而,运动质量仍然落后:仅任务奖励往往收敛到僵硬、不对称的步态,而运动模仿方法改善了外观,但由于参考信号可能对抗恢复平衡所需的瞬态姿态,因此对外部干扰更加敏感。我们提出预测性风格匹配,其中离线预测器将机器人下半身状态历史和速度命令映射到可解释的上半身关节和步态目标,以在训练期间塑造奖励。由于目标是状态条件而非时间索引,且预测器仅在训练时使用,部署的控制器继承了仅任务奖励强化学习基线(RL baseline)的本体感觉接口和推理成本。在Unitree G1上,无论是在仿真还是硬件中,PSM将上半身风格误差比仅任务RL降低大约一个数量级,同时保持其跌倒恢复率,而运动模仿基线实现了最低的风格误差,但无法从干扰中恢复的频率大约高出五倍。

英文摘要

Reinforcement learning has become the prevailing approach to humanoid locomotion control: policies transfer reliably from simulation to hardware and recover gracefully from disturbances. Motion quality, however, still lags behind: task-only rewards often converge to stiff, asymmetric gaits, while motion imitation methods improve appearance but become more sensitive to external disturbances because reference signals can oppose the transient poses needed to regain balance. We propose Predictive Style Matching, in which an offline predictor maps the robot's lower-body state history and velocity commands to interpretable upper-body joint and gait targets that shape the rewards during training. Because the targets are state-conditioned rather than time-indexed and the predictor is used only at training time, the deployed controller inherits the proprioceptive interface and inference cost of a task-only RL baseline. On the Unitree G1, in both simulation and hardware, PSM reduces upper-body style error by roughly an order of magnitude over task-only RL while preserving its fall-recovery rate, whereas the motion-imitation baseline attains the lowest style error but fails to recover from disturbances about five times as often.

2606.07080 2026-06-08 cs.SD cs.AI eess.AS 新提交

dots.tts Technical Report

dots.tts 技术报告

Shi Lian, Changtao Li, Bohan Li, Hankun Wang, Da Zheng, Junfeng Tian, Yufeng Ma, Colin Zhang, Kai Yu

发表机构 * ByteDance(字节跳动)

AI总结 提出一个20亿参数的连续自回归TTS基础模型,通过多目标AudioVAE、全历史条件流匹配和无奖励自校正后训练,在Seed-TTS-Eval上取得最优性能,并支持低延迟推理。

详情
AI中文摘要

我们提出了 this http URL,一个20亿参数的连续自回归文本到语音(TTS)基础模型,在连续潜在空间中建模语音。与现有的连续自回归模型相比,我们的关键创新有三点。首先,我们训练了一个具有多目标的AudioVAE,以构建语义结构化和预测友好的连续语音空间。其次,我们在流匹配头中使用全历史条件,以保持长程一致性并减少生成过程中的漂移。第三,我们对流匹配头应用无奖励自校正后训练,以进一步提高鲁棒性和声学质量。在大规模多语言语料库上训练后,this http URL 在Seed-TTS-Eval上取得了最佳平均性能,在zh/en/zh-hard测试集上的WER分别为0.94%/1.30%/6.60%,SIM分数分别为81.0/77.1/79.5。在其他基准测试中,this http URL 也持续展示了开源最先进的性能,表现出强大的生成稳定性、声音克隆能力和情感表现力。为了实现高效推理,我们进一步应用了CFG感知的MeanFlow蒸馏,使得输出流和双流模式下的首包延迟分别为85毫秒和54毫秒,实现了低延迟语音生成。为了促进可重复研究和实际部署,我们在Apache 2.0许可下发布了训练和推理代码,以及预训练、后训练和MeanFlow蒸馏的检查点。

英文摘要

We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous latent space. Compared with existing continuous autoregressive models, our key innovations are threefold. First, we train an AudioVAE with multiple objectives to build a semantically structured and prediction-friendly continuous speech space. Second, we use full-history conditioning in the flow-matching head to preserve long-range consistency and reduce drift during generation. Third, we apply reward-free self-corrective post-training to the flow-matching head to further improve robustness and acoustic quality. After being trained on a large-scale multilingual corpus, dots.tts achieves the best average performance on Seed-TTS-Eval, with WERs of 0.94%/1.30%/6.60% and SIM scores of 81.0/77.1/79.5 on the zh/en/zh-hard test sets, respectively. Across other benchmarks, dots.tts also consistently demonstrates open-source state-of-the-art performance, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness. For efficient inference, we further apply CFG-aware MeanFlow distillation, enabling low-latency speech generation with first-packet latencies of 85/54 ms in output streaming and dual-streaming modes, respectively. To facilitate reproducible research and practical deployment, we release the training and inference code, together with the pretrained, post-trained, and MeanFlow-distilled checkpoints, under the Apache 2.0 license.

2606.07079 2026-06-08 cs.CV 新提交

AsyncPatch Diffusion: spatially-flexible image generation

异步补丁扩散:空间灵活的图像生成

Samuele Papa, Valentin De Bortoli, Guillaume Couairon, Daniel Sýkora, Romuald Elie, Klaus Greff

发表机构 * Google DeepMind(谷歌DeepMind) University of Amsterdam(阿姆斯特丹大学) The Netherlands Cancer Institute(荷兰癌症研究所)

AI总结 提出AsyncPatch Diffusion框架,通过为不同空间区域分配不同噪声水平实现异质去噪轨迹,在保持生成质量的同时原生支持图像修复和自适应生成。

详情
Comments
36 pages, 14 figures
AI中文摘要

标准扩散模型使用单一共享噪声水平破坏整个样本,迫使所有空间区域遵循相同的去噪轨迹。我们引入了AsyncPatch Diffusion,一个联合扩散框架,为不同的输入维度(如图像像素或潜在令牌)分配不同的噪声水平。我们展示了这种异步破坏如何定义有效的生成过程,同时支持更丰富的空间异质去噪轨迹,并为此过程证明了第一个有效的ELBO。我们表明,单个预训练模型可以执行空间自适应生成,其中不同区域按不同调度去噪。一个关键挑战是训练:天真的独立噪声水平采样过度强调高度异质的配置,而低估了在采样过程中至关重要的同质噪声水平。我们通过一个受控的噪声水平采样器来解决这个问题,该采样器调节平均破坏水平及其空间变异性。AsyncPatch在ImageNet 256和LSUN上实现了与常规扩散相当的生成质量,同时原生适用于图像修复而无需特定任务微调。我们进一步引入了输入引导,利用干净或部分损坏的区域来指导未知区域的生成,提高了局部一致性和纹理匹配。最后,我们展示了自适应生成策略,包括不确定性引导加速和自回归采样。

英文摘要

Standard diffusion models corrupt an entire sample with a single shared noise level, forcing all spatial regions to follow the same denoising trajectory. We introduce AsyncPatch Diffusion, a joint-diffusion framework that assigns distinct noise levels to different input dimensions, such as image pixels, or latent tokens. We show how this asynchronous corruption defines a valid generative process while supporting a richer family of spatially heterogeneous denoising trajectories, and prove the first valid ELBO for this process. We show that a single pretrained model can perform spatially adaptive generation, where different regions are denoised on different schedules. A key challenge is training: naive independent noise-level sampling overemphasizes highly heterogeneous configurations and underrepresents homogeneous noise levels, that are crucial during sampling. We address this with a controlled noise-level sampler that regulates both the average corruption level and its spatial variability. AsyncPatch achieves generation quality comparable to conventional diffusion on ImageNet 256 and LSUN, while being natively suited for inpainting without task-specific fine-tuning. We further introduce input guidance, which uses clean or partially corrupted regions to guide the generation of unknown regions, improving local consistency and texture matching. Finally, we demonstrate adaptive generation strategies including uncertainty-guided acceleration and autoregressive sampling.

2606.07074 2026-06-08 cs.LG cs.AI 新提交

SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

SlimSearcher: 通过自适应奖励门控实现训练效率感知的Web代理

Zequn Xie, Junjie Wang, Dan Yang, Jie Feng, Yue Shen, Jian Wang, Jinjie Gu

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团)

AI总结 提出SlimSearcher框架,通过帕累托高效过滤和自适应奖励门控,在保持或提升准确率的同时将工具调用轮次减少17%-58%。

详情
Comments
17 pages, 8 figures,
AI中文摘要

深度研究代理在复杂信息寻求任务中展现了卓越能力,但这种能力伴随着高昂的计算成本。受准确率驱动训练范式的影响,当前模型采用蛮力策略,表现为盲目依赖工具和执行性推理——生成长而冗余的轨迹,这些对于解决任务远非必要,导致浪费的工具调用和过多的token消耗。为克服这一效率陷阱,我们提出SlimSearcher,一个原则性框架,在监督微调(SFT)和强化学习(RL)中推动准确率与计算成本之间的帕累托前沿。在SFT阶段,SlimSearcher采用帕累托高效过滤来提炼既成功又经济的轨迹,引导模型走向内在效率感知的搜索行为。在RL阶段,我们引入自适应奖励门控,一种动态奖励塑造机制,在采样队列中评估相对工具和token效率。通过将这些自适应效率指标与严格正确性门控级联,我们的方法有效避免了与绝对惩罚相关的简洁性偏差,并缓解了奖励黑客攻击。在包括GAIA、BrowseComp和XBenchDeepSearch在内的长时域基准上的大量实验表明,SlimSearcher在保持或提升准确率的同时,将平均工具调用轮次减少了17%-58%。

英文摘要

Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this power comes at a steep computational cost. Driven by accuracy-focused training paradigms, current models adopt brute-force strategies characterized by blind tool dependency and performative reasoning-generating long, redundant trajectories that are far from necessary for resolving these tasks, leading to wasteful tool calls and excessive token consumption. To overcome this efficiency trap, we propose SlimSearcher, a principled framework that pushes the Pareto frontier between accuracy and computational cost across both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). In the SFT stage, SlimSearcher employs Pareto-efficient filtration to distill trajectories that are both successful and economical, guiding the model toward inherently efficiency-aware search behaviors. During RL, we introduce Adaptive Reward Gating, a dynamic reward-shaping mechanism that evaluates relative tool and token efficiency within a sampled cohort. By cascading these adaptive efficiency metrics with a strict correctness gate, our approach effectively avoids the brevity bias associated with absolute penalties and mitigates reward hacking. Extensive experiments on long-horizon benchmarks, including GAIA, BrowseComp, and XBenchDeepSearch, demonstrate that SlimSearcher reduces average tool-call rounds by 17%-58% while maintaining or improving accuracy.

2606.07069 2026-06-08 cs.CL cs.CY 新提交

mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?

mmPISA-bench:LLMs 在 43 种语言中的推理能力是否同样出色?

Yerzhan Sapenov, Jaromir Savelka

发表机构 * Independent Scholar(独立学者) School of Computer Science, Carnegie Mellon University(卡内基梅隆大学计算机科学学院)

AI总结 提出 mmPISA-bench,一个基于 PISA 的多语言推理基准,包含 25 道选择题,官方翻译至 43 种语言,评估 LLMs 在不同语言、推理难度和翻译类型下的表现,发现现代 LLMs 在所有语言上推理有效,机器翻译不影响准确性,但部分语言成本更高且准确率更低。

详情
AI中文摘要

我们推出了 mmPISA-bench,这是一个紧凑的高质量多语言推理基准,源自 OECD 国际学生评估项目(PISA)。该基准包含 25 道需要推理才能正确回答的多项选择题。每道题都提供了官方人工翻译的 43 种语言版本,并辅以机器翻译版本(即总共 2,150 个数据点)。我们评估了两个主流专有 LLMs 在不同语言、推理努力水平和翻译类型下正确回答问题的能力。我们的结果表明,现代 LLMs 能够在所有评估的语言中有效推理,达到与人类应试者相当的准确率,但在所覆盖的语言之间存在一些性能差异。我们进一步发现,与官方人工翻译相比,机器翻译的问题并未降低准确率,这表明高质量的机器翻译(合成数据)可能通常足以用于大规模多语言推理评估,尤其是在没有官方翻译的情况下。最后,我们分析了 token 使用和相关推理成本,发现某些语言中 LLMs 的使用同时更昂贵且准确率更低。

英文摘要

We introduce mmPISA-bench, a compact high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). The benchmark consists of 25 multiple-choice questions that require reasoning in order to be answered correctly. Each question is provided in official human translations to 43 languages and complemented with machine-translated counterparts (i.e., 2,150 data points in total). We evaluate two mainstream proprietary LLMs across languages, reasoning effort levels, and translation types in terms of their ability to answer the questions correctly. Our results show that modern LLMs can reason effectively across all evaluated languages, achieve accuracy comparable to human test-takers, with some performance variations across covered languages. We further find that machine-translated questions do not degrade accuracy relative to official human translations which suggests that high-quality machine translation (synthetic data) might often be adequate for large-scale multilingual reasoning evaluations where official translations are not available. Finally, we analyze token usage and related inference cost and find that LLMs usage in some languages is simultaneously more expensive and less accurate.

2606.07068 2026-06-08 cs.LG 新提交

Bias in Filter Feature Selection Evaluation: A Meta-Analysis of Datasets, Baselines, and Experimental Design Choices

过滤特征选择评估中的偏差:数据集、基线和实验设计选择的元分析

Malick Ebiele, Malika Bendechache, Rob Brennan

发表机构 * University College Dublin(都柏林大学) University of Galway(Galway大学) ADAPT Centre(ADAPT中心)

AI总结 通过分析28项高影响力过滤特征选择研究,发现数据集数量、基线方法和新方法数量可解释33%的性能变异,揭示了评估中的潜在偏差,并提出了五项基于证据的评估建议。

详情
AI中文摘要

背景:自1990年以来,跨异构应用提出了许多特征选择方法。为了验证新方法的有用性,需要在使用至少一个数据集的特征选择任务中,与现有文献中的至少一种基线方法进行比较。表格深度学习(DL)和机器学习(ML)中数据估值的最新发展表明,新方法、算法和模型的评估可能有意识或无意识地存在偏差。我们假设在特征选择(FS)中,特别是在过滤特征选择(FFS)中,存在类似的趋势。因此,本研究的目的是检查FFS研究,以识别影响评估的因素,这些因素可能构成偏差的入口点,从而为FFS评估推荐更强的原则。方法:我们分析了1994年至2025年间发表的28项高影响力FFS研究样本。该分析提供了如何检查FFS研究的思考,强调了过程中学到的经验教训,并为未来的FFS评估给出了五项基于证据的建议。结果:多元线性回归分析得分为$R^2=0.33$。这意味着新方法相对于所选基线的性能变异(胜率)的33%可由数据集数量(#Datasets)、基线数量(#Baselines)和新方法数量(#NewMethods)解释。讨论:$R^2=0.33$被认为是中等解释力;考虑到这是首次此类研究,这一结果是有希望的。中等解释力的结果是由于胜率还受到其他因素的影响,例如特征选择领域的成熟度、数据集和基线的类型,以及用于解释关系的回归模型的简单性。

英文摘要

Background: Since 1990 many feature selection methods have been proposed across heterogeneous applications. To validate the usefulness of a new method, it needs to be compared against at least one baseline method from the existing literature on a feature selection task using at least one dataset. Recent developments in tabular Deep Learning (DL) and data valuation in Machine Learning (ML) suggest that the evaluation of new methods, algorithms, and models may be consciously or unconsciously biased. We hypothesise that a similar trend exists in feature selection (FS), particularly in filter feature selection (FFS). The aim of this study is therefore to examine FFS studies to identify factors that influence the evaluation and that might consist entry point for biases in order to recommend stronger principles for FFS evaluation. Methods: We analyse a sample of 28 high profile FFS studies published between 1994 and 2025. The analysis provides reflections on how to examine FFS studies, highlights lessons learned throughout the process, and gives five evidence-based recommendations for future FFS evaluation. Results: Multivariate Linear Regression analysis achieved a score of $R^2=0.33$. It means that 33% of the variance in the performance of new methods against chosen baselines (win rate) is explained by the number of datasets (#Datasets), the number of baselines (#Baselines), and the number of new methods (#NewMethods). Discussion: $R^2=0.33$ is considered medium explanation; which is promising given that this is the first such study. The medium explanation result is due to the fact that win rate is influenced by additional factors such as the maturity of the feature selection domain, the type of datasets and baselines, and the simplicity of the regression model used to explain the relationship.

2606.07058 2026-06-08 cs.LG cs.CV math.AT stat.ML 新提交

Constructing VAE Latent Spaces with Prescribed Topology

构建具有指定拓扑的VAE潜在空间

Jilles S. van Hulst, Jakub M. Tomczak, W. P. M. H. Heemels, Duarte J. Antunes

发表机构 * Control Systems Technology Section, Department of Mechanical Engineering, Eindhoven University of Technology(机械工程系控制系统技术部,埃因霍温理工大学) Nature Innovation Laboratory (NatInLab)(自然创新实验室(NatInLab))

AI总结 针对数据流形非欧几里得拓扑导致标准高斯先验不匹配的问题,提出一种构造性数学框架,通过因子化分布和重参数化技巧,为乘积覆盖空间流形(如圆柱、环面、莫比乌斯带等)设计拓扑匹配的先验,提升重建质量和表示忠实性。

详情
Comments
16 pages, 7 figures
AI中文摘要

变分自编码器(VAE)学习高维数据的低维潜在表示。当数据位于具有非欧几里得拓扑的流形上时,标准高斯先验会引入拓扑不匹配,从而降低重建质量并阻碍忠实表示。我们提出了一个构造性数学框架,解决了所有允许乘积覆盖空间的流形的这种不匹配问题。这些流形可表示为基本因子(圆、区间或直线)的乘积,或此类乘积在有限对称群下的商。该类包括圆柱、环面、莫比乌斯带、克莱因瓶和实射影空间。基本因子上的因子化分布产生具有闭式解耦KL散度的乘积拓扑,使得每个潜在因子可以独立塑造,同时保持训练可处理。我们为周期、有界和无界支撑编目了可重参数化的编码器-先验对,并提供了坐标变换,允许标准神经网络输出具有平滑梯度的非欧几里得参数。对于商流形,解码器接收覆盖空间坐标的群不变特征,使得识别点产生相同输出。锚点约束相对于数据固定坐标系或创建软拓扑孔。在合成流形和真实图像数据集(旋转和循环移位MNIST)上的实验证实,拓扑匹配的先验使KL正则化与数据流形对齐。所得到的拓扑感知模型在所有实际相关的正则化强度下均优于高斯基线。代码可从此https URL获取。

英文摘要

Variational autoencoders (VAEs) learn low-dimensional latent representations of high-dimensional data. When the data lies on a manifold with non-Euclidean topology, the standard Gaussian prior introduces a topological mismatch that degrades reconstruction quality and prevents faithful representation. We present a constructive mathematical framework that resolves this mismatch for all manifolds that admit a product covering space. These are manifolds expressible as products of elementary factors (circles, intervals, or lines) or as quotients of such products by a finite symmetry group. The class includes cylinders, tori, Möbius strips, Klein bottles, and real projective spaces. Factorized distributions over the elementary factors yield product topologies with closed-form, decoupled KL divergences, so that each latent factor can be shaped independently while keeping training tractable. We catalogue reparametrizable encoder-prior pairs for periodic, bounded, and unbounded supports, and provide coordinate transformations that allow standard neural networks to output non-Euclidean parameters with smooth gradients. For quotient manifolds, the decoder receives group-invariant features of the covering-space coordinates, so that identified points produce identical outputs. Anchor constraints fix the coordinate system relative to the data or create soft topological holes. Experiments on synthetic manifolds and real-image datasets (rotated and cyclically shifted MNIST) confirm that a topology-matched prior aligns KL regularization with the data manifold. The resulting topology-aware models outperform the Gaussian baseline at all practically relevant regularization strengths. The code is available at https://github.com/JvHulst/VAE-Topology.

2606.07054 2026-06-08 cs.CL cs.AI cs.CR cs.LG 新提交

TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents

TRACE: 通过自适应跨步骤证据聚合的LLM智能体轨迹推理

Vijitha Mittapalli, Shreyaa Jayant Dani, Satya Srujana Pilli, Snigdha Ansu, Mohammadreza Teymoorianfard, Franck Dernoncourt, Hongjie Chen, Yu Wang, Ryan A. Rossi, Nesreen K. Ahmed

发表机构 * University of Massachusetts at Amherst(马萨诸塞大学阿默斯特分校) Adobe Research(Adobe研究) Dolby Labs(杜比实验室) University of Oregon(俄勒冈大学) Cisco(思科)

AI总结 提出TRACE框架,通过TIJ循环识别高信号区域、累积跨步骤证据并合成轨迹级判决,在SHADE-Arena的十个任务域上F1达0.713,召回率0.844,尤其擅长长距离证据链接。

详情
AI中文摘要

自主LLM智能体可以通过一系列单独良性的行动追求隐藏的恶意目标,这使得使用标准轨迹级监控难以检测破坏行为。现有方法要么一次性评估完整轨迹,要么将其划分为独立评分的窗口,限制了连接时间上相距较远的证据的能力。我们提出TRACE,一个用于长视界LLM智能体轨迹的监控框架。TRACE通过一个TIJ(分类-检查-判决)循环运行,该循环识别高信号区域,执行有针对性的检查,同时在推理步骤中累积累积的证据,并综合出轨迹级判决。我们在SHADE-Arena的十个任务域上评估TRACE,与最先进的基线进行比较。TRACE实现了0.713的总体F1分数和0.844的召回率,在需要长距离证据链接的任务上取得了最大的提升。

英文摘要

Autonomous LLM agents can pursue hidden malicious objectives through sequences of individually benign actions, making sabotage difficult to detect using standard trajectory-level monitoring. Existing approaches either evaluate complete trajectories in a single pass or partition them into independently scored windows, limiting their ability to connect evidence across temporally distant actions. We propose TRACE, a monitoring framework for long-horizon LLM agent trajectories. TRACE operates through a TIJ (Triage-Inspect-Judge) loop that identifies high-signal regions, performs targeted inspection while maintaining accumulated evidence across reasoning steps, and synthesizes a trajectory-level verdict. We evaluate TRACE on ten task domains from SHADE-Arena against state-of-the-art baselines. TRACE achieves an aggregate F1 of 0.713 and recall of 0.844, with the largest gains on tasks requiring long-range evidence linking.

2606.07053 2026-06-08 cs.CV cs.LG 新提交

TrioPose: Native Triple-Stream Diffusion Transformers for Pose-Guided Text-to-Image Generation

TrioPose: 用于姿态引导文本到图像生成的原生三流扩散变换器

Dian Gu, Zhengyi Yang

发表机构 * Institute of Automation Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 提出TrioPose,基于SD3.5M架构的原生三流姿态感知DiT,通过逐层激活和零初始化双残差注入保持预训练稳定性,并设计可学习关系偏置掩码和姿态引导空间损失加权,在多人姿态引导生成中实现SOTA性能,Human-Art上AP达64.33。

详情
Comments
15 pages (9 pages main body, 6 pages references and appendix), 3 figures, 5 tables
AI中文摘要

姿态引导的文本到图像生成在复杂多人场景中常遭受肢体扭曲和特征串扰。虽然现有的基于UNet的适配器难以处理长程空间依赖,新兴的多模态扩散变换器(MM-DiT)提供了优越的全局建模能力。然而,MM-DiT中的简单信号拼接严重破坏了预训练的潜在分布。为了解决这个问题,我们提出了TrioPose,一个基于SD3.5M架构的原生姿态驱动框架。具体来说,我们引入了一个三流姿态感知DiT(TSPA-DiT),将姿态视为独立模态。它采用逐层激活和零初始化双残差注入,在保持预训练潜在稳定性的同时平滑地施加几何约束。为了解决严重的多实例遮挡,我们设计了一个可学习关系偏置掩码,将拓扑连接分类为细粒度的物理状态,将其映射为连续的注意力软约束,以有效解耦实例间干扰。此外,一种姿态引导空间损失加权策略利用热图导出的误差图调制原生扩散目标,将解剖监督严格集中在畸变区域。大量实验表明,TrioPose在具有挑战性的基准测试(包括Human-Art、CrowdPose和OCHuman)上实现了最先进的性能。值得注意的是,它在Human-Art上达到了64.33的AP,比先前方法提高了30%,同时在复杂多人生成中为视觉保真度和文本-图像语义对齐设立了新标准。

英文摘要

Pose-guided text-to-image generation often suffers from limb distortions and feature crosstalk in complex multi-person scenarios. While existing UNet-based adapters struggle with long-range spatial dependencies, emerging Multimodal Diffusion Transformers (MM-DiTs) offer superior global modeling. However, naive signal concatenation in MM-DiTs severely disrupts pre-trained latent distributions. To address this, we propose TrioPose, a native pose-driven framework built upon the SD3.5M architecture. Specifically, we introduce a Triple-Stream Pose-Aware DiT (TSPA-DiT) that treats pose as an independent modality. It employs layer-wise activation and zero-initialized dual-residual injection to smoothly enforce geometric constraints while preserving pre-trained latent stability. To resolve severe multi-instance occlusions, we design a Learnable Relational Bias Mask that categorizes topological connectivity into fine-grained physical states, mapping them into continuous attention soft constraints to effectively decouple inter-instance interference. Furthermore, a Pose-Guided Spatial Loss Weighting strategy modulates the native diffusion objective using heatmap-derived error maps, focusing anatomical supervision strictly on distortion-prone regions. Extensive experiments demonstrate that TrioPose achieves state-of-the-art performance across challenging benchmarks, including Human-Art, CrowdPose, and OCHuman. Notably, it attains an AP of $64.33$ on Human-Art, representing a $30\%$ improvement over prior arts, while setting new standards for visual fidelity and text-image semantic alignment in complex multi-human generation.

2606.07044 2026-06-08 cs.LG 新提交

Hierarchical Forecast Reconciliation for Urban Rail Transit Demand Prediction under Operational Disruptions

运营中断下城市轨道交通需求预测的层级协调方法

Dang Viet Anh Nguyen, Alma Fazlagic, Kristine Pryds Loft, Filipe Rodrigues

发表机构 * Technical University of Denmark (DTU)(丹麦技术大学)

AI总结 针对城市轨道交通中站点与OD流预测不一致问题,提出首个层级协调框架,利用神经全连接协调器(FCR)学习非线性映射,确保结构一致性,在中断场景下OD预测误差降低达17.45%。

详情
Comments
33 pages, 6 figures, 16 tables
AI中文摘要

准确且一致的乘客需求预测对于城市轨道交通(URT)运营至关重要。乘客需求具有层级结构,其中起讫点(OD)流量通过守恒约束聚合为站点级进出站流量。实践中,站点级和OD级预测通常独立生成,产生违反这些约束的不一致预测,给运营决策带来不一致性。在中断期间,当预测可靠性最为关键时,此类问题更为严重。本文提出了首个用于联合站点级和OD级URT需求预测的层级预测协调框架。神经全连接协调器(FCR)学习从非协调基础预测到协调层级预测的非线性映射,同时通过构造保证精确的结构一致性。该方法使用哥本哈根S-train网络的Rejsekort智能卡数据,在单步、多步和中断预测场景下,与OLS、WLS和最小迹(MinT)变体进行基准比较。结果表明,协调一致地提高了OD预测准确性,同时确保了层级一致性。在正常条件下,FCR与基于MinT的方法性能相当。一项oracle分析表明,完美的站点级预测可将OD预测误差降低高达34%,凸显了改进基础预测的价值。在严重中断下,FCR优于经典方法,在多步目的地侧延迟场景中将OD预测误差降低高达17.45%。这些发现确立了层级协调作为提高预测鲁棒性的有效机制,最大的收益出现在最具挑战性的运营条件下。

英文摘要

Accurate and coherent passenger demand forecasting is essential for Urban Rail Transit (URT) operations. Passenger demand has a hierarchical structure in which origin-destination (OD) flows aggregate to station-level inflows and outflows through conservation constraints. In practice, station-level and OD-level forecasts are often generated independently, producing incoherent predictions that violate these constraints and introduce inconsistencies into operational decision-making. Such issues become more severe during disruptions, when forecasting reliability is most critical. This paper presents the first hierarchical forecast reconciliation framework for joint station-level and OD-level URT demand prediction. A neural Fully Connected Reconciler (FCR) learns a non-linear mapping from incoherent base forecasts to coherent hierarchical predictions while guaranteeing exact structural consistency by construction. The method is benchmarked against OLS, WLS, and Minimum Trace (MinT) variants using Rejsekort smart-card data from the Copenhagen S-train network under one-step, multi-step, and disruption forecasting scenarios. Results show that reconciliation consistently improves OD forecasting accuracy while ensuring hierarchical coherence. Under normal conditions, FCR performs competitively with MinT-based methods. An oracle analysis indicates that perfect station-level forecasts could reduce OD prediction error by up to 34 percent, highlighting the value of improved base forecasts. Under severe disruptions, FCR outperforms classical methods, reducing OD forecasting error by up to 17.45 percent in multi-step destination-side delay scenarios. These findings establish hierarchical reconciliation as an effective mechanism for improving forecast robustness, with the largest benefits occurring under the most challenging operating conditions.

2606.07036 2026-06-08 cs.CV cs.AI cs.CE cs.LG 新提交

STREAM: Stochastic Riemannian Flow Matching with Anisotropic Decoder for Digital Histopathology Image Generation

STREAM: 用于数字组织病理学图像生成的随机黎曼流匹配与各向异性解码器

Won June Cho, Daeky Jeong, Hyeongyeol Lim, Hongjun Yoon

发表机构 * DEEPNOID Inc.(DEEPNOID公司)

AI总结 提出STREAM框架,利用组织病理学视觉基础模型的patch-token特征作为潜在空间,通过黎曼流匹配生成高质量组织病理学图像,解决条件崩溃问题,并设计各向异性解码器提升生成质量。

详情
Comments
27 pages, 7 figures
AI中文摘要

合成组织病理学图像生成解决了计算病理学中的关键挑战,包括患者隐私和对基础模型大规模训练数据日益增长的需求。潜在扩散模型主导了图像生成领域,最近的研究强调潜在空间的选择对生成图像的质量至关重要。现有的组织病理学最先进生成模型使用预训练的视觉基础模型(VFM)作为条件信号,我们观察到这会导致“条件崩溃”,即条件信号主导潜在空间,降低生成样本的质量和多样性。因此,我们转而使用预训练的组织病理学VFM作为潜在空间本身,利用其编码丰富语义信息的patch-token特征。我们经验性地表明,这些特征经过$\ell_2$归一化,位于单位超球面$\mathcal{S}^{d-1}$上,具有强烈的角度主导性和内在曲率,使其自然适用于黎曼公式。因此,我们提出了STREAM,这是第一个在病理学领域应用黎曼流匹配的框架。STREAM包括两个阶段:1)一种桥式随机扰动,在$\mathcal{S}^{d-1}$上建立每个token的可整流性,用于在潜在空间中训练扩散变换器(DiT);2)一种新颖的各向异性解码器,对速度场雅可比矩阵的低能量方向分配鲁棒性,同时保持其高能量方向的保真度。STREAM在乳腺癌和结直肠癌数据集上实现了最先进的重建和生成性能。代码将在接收后公开发布。

英文摘要

Synthetic histopathology image generation addresses critical challenges in computational pathology, including patient privacy and the growing need for large-scale training data for foundation models. Latent diffusion models have dominated the image generation domain, with recent works emphasizing that the choice of latent space is critical to the quality of generated images. Existing state-of-the-art generative models in histopathology use pretrained Vision Foundation Models (VFMs) as conditioning signals, and we observe that this leads to "conditioning collapse," where the conditioning signal dominates the latent space and lowers the quality and diversity of generated samples. Therefore, we instead use pretrained histopathology VFMs as the latent space itself, leveraging their patch-token features that encode rich semantic information. We empirically show that these features are $\ell_2$-normalized and lie on the unit hypersphere $\mathcal{S}^{d-1}$ with strong angular dominance and intrinsic curvature, making them naturally suited for a Riemannian formulation. We therefore present STREAM, the first framework to apply Riemannian flow matching in the pathology domain. STREAM consists of two stages: 1) a bridge-type stochastic perturbation that establishes per-token rectifiability on $\mathcal{S}^{d-1}$ for training a Diffusion Transformer (DiT) in latent space, and 2) a novel anisotropic decoder that allocates robustness to low-energy directions of the velocity-field Jacobian while preserving fidelity along its high-energy directions. Together, STREAM achieves state-of-the-art reconstruction and generation performance on breast and colorectal cancer datasets. The code will be publicly released upon acceptance.

2606.07033 2026-06-08 cs.AI cs.CV 新提交

Hierarchical Semantic-Constrained Heterogeneous Graph for Audio-Visual Event Localization

层次化语义约束异构图用于音视频事件定位

Zhe Yang, Ruyi Zhang, Hongtao Chen, Wenrui Li, Hengyu Man, Wangmeng Zuo, Xiaopeng Fan

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Peng Cheng Laboratory(鹏城实验室) Harbin Institute of Technology Suzhou Research Institute(哈尔滨工业大学苏州研究院)

AI总结 提出层次化语义约束异构图框架,通过构建异构图、双向语义约束和双曲空间层次正则化,解决开放词汇音视频事件定位中跨尺度一致性和层次语义一致性问题。

详情
AI中文摘要

开放词汇音视频事件定位(OV-AVEL)联合建模音视频线索,以识别并时间定位事件,包括训练中未见过的类别。现有方法主要在欧几里得空间中学习联合音视频表示,但仍面临两个重大挑战。首先,未见类别缺乏监督信号,难以在多个时间尺度上保持音视频一致性。其次,片段级与视频级语义之间缺乏层次约束,导致模型无法在不同层级间建立语义一致性。为解决这些挑战,我们提出一种层次化语义约束异构图(HSCHG)用于音视频事件定位框架。我们首先在欧几里得空间中构建一个异构层次图,包含音频和视觉片段节点及其对应的视频级节点。我们使用多方向时间边来捕获每个模态内的完整时间信息。同时,我们采用双阈值过滤门控融合策略,仅在对齐置信度高时引入跨模态信息。此外,我们在片段级和视频级表示之间引入双向语义约束,以实现不同层级间的语义一致性。基于此,我们将多级音视频表示和文本原型统一映射到双曲空间中。我们使用层次蕴含正则化损失来表征视频与片段之间的层次关系。大量实验结果表明,我们的方法在OV-AVEL基准上优于现有方法。消融研究进一步验证了我们方法的有效性。

英文摘要

Open-vocabulary audio-visual event localization (OV-AVEL) jointly models audio-visual cues to recognize and temporally localize events, including categories unseen during training. Existing methods primarily learn joint audio-visual representations in Euclidean space, but still face two significant challenges. First, the lack of supervision signals for unseen categories makes it difficult to maintain audio-visual consistency across multiple temporal scales. Second, the lack of hierarchical constraints between segment- and video-level semantics prevents the model from establishing semantic consistency across different levels. To address these challenges, we propose a hierarchical semantic constrained heterogeneous graph (HSCHG) for audio-visual event localization framework. We first construct a heterogeneous hierarchical graph in Euclidean space, which includes audio and visual segment nodes and their corresponding video-level nodes. We use multi-directional temporal edges to capture complete temporal information within each modality. Simultaneously, we employ a dual-threshold filtering gated fusion strategy, introducing cross-modal information only when the alignment confidence is high. Furthermore, we introduce bidirectional semantic constraints between segment- and video-level representations to achieve semantic consistency across different levels. Based on this, we map the multi-level audio-visual representations and text prototypes uniformly into hyperbolic space. We use a hierarchical entailment regularization loss to characterize the hierarchical relationships between videos and segments. Extensive experimental results show that our method outperforms existing methods on the OV-AVEL benchmark. Ablation studies further validate the effectiveness of our method.

2606.07030 2026-06-08 cs.SD cs.AI cs.CL cs.LG 新提交

Phonetic Error Analysis of Raw Waveform Acoustic Models

原始波形声学模型的音素错误分析

Erfan Loweimi, Zhengjun Yue, Andrea Carmantini, Zoran Cvetkovic, Steve Renals, Peter Bell

发表机构 * Centre for Speech Technology Research (CSTR), University of Edinburgh, UK(语音技术研究中心(CSTR),爱丁堡大学,英国) Cisco, UK(思科公司,英国) SLAI & CUHK-SZ, China(SLAI与CUHK-SZ,中国) King's College London, UK(伦敦国王学院,英国)

AI总结 通过分解音素错误率、分析混淆矩阵,发现BLSTM层对过渡依赖类提升最大,WSJ迁移学习对辅音改进约是元音的三倍,且混淆模式反映固有音素相似性。

详情
Comments
INTERSPEECH2026
AI中文摘要

我们分析了原始波形声学模型在TIMIT音素识别中的错误模式,超越了整体音素错误率(PER)。将PER按三个广义语音类别(BPC)分解,并从替换错误构建混淆矩阵。我们的模型将参数化(SincNet, Sinc2Net)或非参数化CNN与双向LSTM相结合,在开发/测试集上分别达到13.9%/15.3%的PER,这是原始波形模型在TIMIT上的最佳报告结果。来自WSJ的迁移学习将PER降至11.3%/12.3%,超越了Filterbank基线。每个BPC的分析表明,BLSTM层对过渡依赖类提升最大,而WSJ迁移学习对辅音的改进约是元音的三倍。原始波形和Filterbank系统的混淆模式一致,表明主要混淆反映了固有的音素相似性。

英文摘要

We analyse error patterns of raw waveform acoustic models on TIMIT phone recognition beyond the overall phone error rate (PER). PER is decomposed across three broad phonetic class (BPC) categorisations, and confusion matrices are constructed from substitution errors. Our models combine parametric (SincNet, Sinc2Net) or non-parametric CNNs with Bidirectional LSTMs, achieving 13.9%/15.3% PER on Dev/Test, the best reported results for raw waveform models on TIMIT. Transfer learning from WSJ reduces PER to 11.3%/12.3%, surpassing the Filterbank baseline. Per-BPC analysis reveals that BLSTM layers benefit transition-dependent classes most, while WSJ transfer learning improves consonants roughly three times more than vowels. Confusion patterns are consistent across raw waveform and Filterbank systems, indicating that the dominant confusions reflect inherent phonetic similarities.

2606.07024 2026-06-08 cs.CV 新提交

GuideCAD: A Lightweight Multimodal Framework for 3D CAD Model Generation via Prefix Embedding

GuideCAD: 基于前缀嵌入的轻量级多模态3D CAD模型生成框架

Minseong Kim, Jinyeong Park, Sungho Park, Jibum Kim

发表机构 * Convergence Research Center for Insect Vectors(昆虫传播载体汇聚研究中心) Incheon National University(仁川国立大学) Center for Brain-Machine Interface(脑机接口中心)

AI总结 提出GuideCAD框架,利用少量可训练参数通过映射网络将图像嵌入转为前缀嵌入,结合预训练大语言模型和Transformer解码器生成3D CAD模型,参数减少约4倍且训练效率提升2倍。

详情
AI中文摘要

用于3D CAD生成的多模态方法需要大量计算资源,因此需要高效训练。为此,我们提出GuideCAD,利用语义丰富的视觉-文本表示,仅用少量可训练参数即可生成3D CAD模型。具体而言,GuideCAD使用映射网络将图像嵌入转换为前缀嵌入,使预训练的大语言模型能够整合视觉和文本信息。随后,基于Transformer的解码器利用视觉-文本嵌入预测构建序列,从而生成3D CAD模型。为了实验评估,我们构建了一个新数据集,称为GuideCAD,包含文本-图像对。每对包括一个表示3D CAD构建序列的文本提示及其对应的3D CAD图像。实验结果表明,与微调方法相比,GuideCAD在生成质量相当的情况下,参数减少约四倍,训练效率提升两倍。我们已在以下网址发布方法的源代码和数据集:this https URL

英文摘要

Multi-modal approaches used for 3D CAD generation require substantial computational resources, necessitating efficient training. To address this, we propose GuideCAD, which leverages semantically rich visual-textual representations having only a small number of trainable parameters to generate 3D CAD models. Specifically, GuideCAD uses a mapping network that converts image embeddings into prefix embeddings, enabling a pretrained large language model to integrate visual and textual information. As a result, a transformer-based decoder predicts the construction sequence using the visual-textual embeddings in order to generate the 3D CAD model. For experimental evaluation, we construct a new dataset, referred to as GuideCAD, which consists of text-image pairs. Each pair includes a text prompt that represents a 3D CAD construction sequence and its corresponding 3D CAD image. Our experimental results show that GuideCAD generates comparably high-quality 3D CAD models while using approximately four times fewer parameters and achieving twice the training efficiency compared to fine-tuning approaches. We have released the source code and dataset for our method at: https://github.com/mskimS2/GuideCAD

2606.07020 2026-06-08 cs.CL 新提交

MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights

MADE:超越评分——通过多语言智能诊断引擎实现细粒度评估洞察

Yilun Liu, Miao Zhang, Shimin Tao, Minggui He, Chunguang Zhao, Chenxin Liu, Li Zhang, Chen Liu, Cheng Qian, Liqun Deng, Xiaojun Meng, Daimeng Wei

发表机构 * Huawei(华为)

AI总结 提出MADE多语言智能诊断引擎,将评估后分析分解为规划、聚合分析、实例检查、多语言文化反思和报告合成,在33个模型族、11个基准、26种语言等大规模设置下,诊断报告质量提升47%,专家偏好率达87.9%。

详情
AI中文摘要

多语言和多文化基准现在覆盖数十种语言和模型族,但由此产生的得分景观仍然指标丰富而洞察贫乏,需要进行细粒度的多语言评估后诊断。然而,单个LLM和开放式智能体很容易被冗长、嘈杂的诊断输入所淹没,并且没有可重用的分类法。为了解决这个问题,我们提出了MADE,一个多语言智能诊断引擎,它将评估后分析分解为规划、聚合分析、实例级案例检查、多语言和文化反思以及基于事实的报告合成。MADE与一个专家主导的54个查询和15种语言的诊断集配对,在大规模多语言评估基础(33个模型族、11个基准、26种语言、34种文化、866万条评估记录)上进行评估。实验表明,MADE在诊断报告质量上比最强的共享基线高出47%,并且在87.9%的成对比较中被多语言人类专家偏好。与多语言专家一起应用,MADE进一步揭示了关于部署、迭代和跨文化陷阱的四个可操作发现,将基准得分表转化为模型选择和修复指南。

英文摘要

Multilingual and multicultural benchmarks now cover dozens of languages and model families, but the resulting score landscapes remain metric-rich and insight-poor, necessitating fine-grained multilingual post-evaluation diagnosis. However, single LLMs and open-ended agents are easily swamped by the long, noisy diagnostic input, and no reusable taxonomy exists for it. To address this, we propose MADE, a Multilingual Agentic Diagnosing Engine that decomposes post-evaluation analysis into planning, aggregate analysis, instance-level case inspection, multilingual and cultural reflection, and grounded report synthesis. MADE is paired with an expert-led 54-query and 15-language diagnostic set, evaluated on top of a large-scale multilingual evaluation substrate (33 model families, 11 benchmarks, 26 languages, 34 cultures, 8.66M evaluation records). Experiments show that MADE outperforms the strongest shared baseline by 47% in diagnosis report quality and is preferred by human multilingual experts in 87.9% of pairwise comparisons. Applied with multilingual experts, MADE further surfaces four actionable findings on deployment, iteration, and cross-cultural pitfalls, turning benchmark score tables into model-selection and remediation guidance.

2606.07017 2026-06-08 cs.AI cs.CL cs.ET 新提交

The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

基础模型智能体的仿真到现实差距:统一MDP视角

Xiaoou Liu, Tiejin Chen, Weibo Li, Xiyang Hu, Hua Wei

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 本文提出将基础模型智能体的评估与训练差距形式化为经典仿真到现实问题,围绕MDP四要素(观测、动作、转移、奖励)构建统一框架,并倡导采用域随机化等成熟解决方案。

详情
Comments
7 pages, 2 figures, 2 tables. Accepted by KDD 2026 Blue Sky Ideas Track
AI中文摘要

基础模型智能体越来越多地被部署用于现实世界决策,但受到仿真到现实差距的影响。虽然机器人学和经典控制有成熟的框架来解决这一差距,但基础模型社区将智能体鲁棒性视为一个全新的现象。我们的论文提出将基础模型智能体评估和训练差距形式化为一个经典的仿真到现实问题,完全围绕马尔可夫决策过程的四个要素构建,包括观测、动作、转移和奖励。在本文中,我们设定了一个全面的研究议程,将经典差异转化为基础模型领域,并倡导采用域随机化等成熟解决方案。我们提供了具体示例,例如多语言工具调用,以展示尽管语义意图正确,但观测空间差距如何导致操作无效的动作。最终,这一议程旨在推动范式转变,产生统一的词汇和标准化的压力测试基准,以培养新一代高度可信的智能体,用于可靠的现实世界应用。

英文摘要

Foundation model agents are increasingly deployed for real-world decision-making, but suffer from the sim-to-real gap. While robotics and classical control have mature frameworks to address this gap, the foundation model community is treating agent robustness as an entirely novel phenomenon. Our paper proposes formalizing the foundation model agent evaluation and training gap as a classical sim-to-real problem structured entirely around the four elements of a Markov Decision Process, including Observation, Action, Transition, and Reward. In this paper, we set a comprehensive research agenda that translates classical discrepancies into the foundation model domain and advocates for adopting established solutions like domain randomization. We provide concrete examples, such as a multilingual tool calling to demonstrate how severe observation space gaps lead to operationally invalid actions despite correct semantic intent. Ultimately, this agenda aims to drive a paradigm shift, yielding a unified vocabulary and standardized stress test benchmarks to foster a new generation of highly trustworthy agents for reliable real-world applications.

2606.07015 2026-06-08 cs.SD cs.AI 新提交

Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation

面向统一歌曲生成与带伴奏共生成的歌声转换

Ziyu Zhang, Chunyu Qiang, Xiaopeng Wang, Yuxin Guo, Kang Yin, Wenjie Tian, Jingbin Hu, Tianlun Zuo, Zhao Guo, Teng Ma, Yuzhe Liang, Chen Zhang, Lei Xie

发表机构 * Northwestern Polytechnical University(西北工业大学) Kuaishou Technology(快手科技) Beijing Institute of Technology(北京理工大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) University of Science and Technology of China(中国科学技术大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出UniSinger框架,基于多模态扩散Transformer统一零样本歌曲生成与伴奏共生成歌声转换,通过共享说话人嵌入和课程学习策略实现跨任务音色控制与多任务优化。

详情
AI中文摘要

尽管歌曲生成和歌声转换(SVC)已显著发展,但长期以来它们被孤立开发:前者缺乏零样本说话人克隆,而后者忽略了人声-伴奏协同。为弥合这一差距,我们提出UniSinger,这是首个统一说话人克隆歌曲生成与伴奏共生成SVC的端到端框架。基于多模态扩散Transformer,我们构建了一个统一的说话人嵌入空间,将说话人表示从SVC迁移到歌曲生成,从而实现细粒度的跨任务音色控制。为缓解多任务优化冲突,我们设计了一种课程学习策略,使用任务特定的模态掩码来引导模型逐步掌握语义内容、人声音色和伴奏之间的生成机制。实验表明,在两个任务上均达到最先进性能,并实现了互补优势,为智能音乐制作提供了新可能性。

英文摘要

While song generation and singing voice conversion (SVC) have evolved significantly, they have long been developed isolated: the former lacks zero-shot speaker cloning, while the latter overlooks vocal-accompaniment synergy. To bridge this gap, we propose UniSinger, the first end-to-end framework unifying speaker cloning song generation and accompaniment co-generation SVC. Building on the multimodal diffusion transformer, we construct a unified speaker embedding space transferring speaker representation from SVC to song generation, endowing fine-grained cross-task timbre control. To mitigate multi-task optimization conflicts, we design a curriculum learning strategy using task-specific modality masking to guide the model to gradually master the generative mechanisms among semantic content, vocal timbre, and accompaniment. Experiments show state-of-the-art performance on both tasks and realizes complementary benefits, offering new possibilities for intelligent music production.

2606.07013 2026-06-08 cs.RO cs.HC 新提交

A Multi-Operator Mixed-Reality Interface for Multi-Robot Control and Coordination: Co-Located and Private Workspace Collaboration

面向多机器人控制与协调的多操作员混合现实界面:共位与私有工作空间协作

Omotoye Shamsudeen Adekoya, Antonio Sgorbissa, Carmine Tommaso Recchiuto

发表机构 * DIBRIS Department, RICE Laboratory, University of Genoa(DIBRIS部门,RICE实验室,热那亚大学)

AI总结 提出一种扩展至多操作员协作的混合现实界面,支持共位共享工作空间和私有工作空间两种模式,通过注册驱动场景构建、轻量级会话同步和单机器人控制租约防止命令冲突。实验表明两种模式任务性能相当,但共位模式显著提升协作感知和操作员偏好。

详情
Comments
Submitted to RO-MAN 2026
AI中文摘要

多操作员控制机器人团队不仅需要访问相同的任务信息,还需要维护共享态势感知并防止冲突干预的机制。基于我们之前的HORUS界面(统一系统的整体操作现实),我们提出了一种混合现实界面,将单操作员多机器人监督扩展到协作式多操作员使用。该系统支持两种互补模式:共位共享工作空间,操作员在同一物理位置观察和操作同一张迷你地图;以及私有工作空间模式,操作员通过独立放置的本地工作空间执行相同任务。该架构结合了注册驱动的场景构建、轻量级共享会话同步以及每机器人控制租约,以支持协作监控、任务分配和远程操作,同时防止冲突命令。我们在一项人类受试者研究中评估了该方法,共有36名参与者(18对)在两个搜索环境中控制三台Nova Carter移动机器人。两种模式下的客观任务性能相当,表明两种模式都支持有效的任务执行。然而,共位共享工作空间显著改善了感知协作、共享理解和交接清晰度,并且是首选的协作模式。这些结果表明,即使底层机器人控制工具保持不变,物理上共置MR工作空间也能改善操作员的协调方式。

英文摘要

Multi-operator control of robot teams requires not only access to the same mission information, but also mechanisms for maintaining shared awareness and preventing conflicting interventions. Building on our previous HORUS interface (Holistic Operational Reality for Unified Systems) we present a mixed-reality interface that extends single-operator multi-robot supervision to collaborative multi-operator use. The system supports two complementary modes: a co-located shared workspace, in which operators observe and manipulate the same mini-map in the same physical location, and a private-workspace mode, in which operators work on the same mission through independently placed local workspaces. The architecture combines registration-driven scene construction, lightweight shared-session synchronization, and per-robot control leases to support collaborative monitoring, tasking, and teleoperation while preventing conflicting commands. We evaluated the approach in a human-subject study with 36 participants (18 pairs) controlling three Nova Carter mobile robots in two search environments. The performance of the objective task was comparable across the two modes, indicating that both modes supported effective mission execution. However, the co-located shared workspace significantly improved perceived collaboration, shared understanding, and handoff clarity, and was the preferred collaborative mode. These results indicate that physically co-locating the MR workspace improves how operators coordinate even when the underlying robot-control tools remain unchanged.

2606.07012 2026-06-08 cs.RO 新提交

Task Editing for Generalizable 3D Visuomotor Policy Learning

面向可泛化3D视觉运动策略学习的任务编辑

Jian-Jian Jiang, YiHan Yang, Lan Wei, Yuming Luo, Xiao-Ming Wu, Xuhang Chen, Bin Fan, Dandan Zhang, Wei-Shi Zheng

发表机构 * Sun Yat-sen University(中山大学) Imperial College London(帝国理工学院) Nanyang Technological University(南洋理工大学) South China University of Technology(华南理工大学)

AI总结 提出Task-Edit框架,通过将任务分解为场景、技能和对象组件并灵活重组,生成多样化轨迹,提升3D视觉运动策略在长程操作任务中的泛化能力。

详情
Comments
8 pages, 4 figures
AI中文摘要

3D视觉运动策略为复杂机器人操作提供了有前景的方向,因为深度图和点云为空间推理提供了丰富的几何信息。然而,它们的成功通常依赖于大规模的真实世界演示,这些演示的收集成本高昂且耗时。为此,现有方法通常使用演示生成策略,通过对人类收集的演示应用以对象为中心的变换(如改变对象姿态或尺度)来提高数据效率。虽然这些变换在局部变化上有效,但它们很大程度上保留了原始场景结构和技能序列,限制了合成复杂任务中多样化的场景-技能-对象组合的能力。在本文中,我们提出Task-Edit,一种新颖的演示生成框架,从任务中心编辑的角度生成多样化轨迹。Task-Edit的关键见解是将任务分解为场景、技能和对象组件,并灵活地重新组合它们。通过这种方式,Task-Edit实现了可扩展的演示生成,并显著提高了长程操作任务的泛化能力。我们通过大量真实世界实验评估了Task-Edit,并展示了三个优势:(1)有效性:Task-Edit在各种真实世界任务和机器人形态上显著提升了3D视觉运动策略。(2)泛化性:Task-Edit提高了模型在不同场景设置下的泛化能力。(3)适用性:Task-Edit使模型能够处理真实世界中难以收集的场景,包括抗干扰、避障和未见过的杂乱场景。

英文摘要

3D visuomotor policies offer a promising direction for complex robotic manipulation, as depth maps and point clouds provide rich geometric information for spatial reasoning. However, their success often depends on large-scale real-world demonstrations, which are costly and time-consuming to collect. To this end, existing methods commonly use demonstration generation strategies to improve data efficiency by applying object-centric transformations to human-collected demonstrations, such as varying object poses or scales. While effective for local variation, these transformations largely preserve the original scene structure and skill sequence, limiting their ability to synthesize diverse scene-skill-object combinations for complex tasks. In this paper, we propose Task-Edit, a novel demonstration generation framework that generates diverse trajectories from a task-centric editing perspective. The key insight of Task-Edit is to decompose a task into scene, skill and object components, and flexibly recombine them. In this way, Task-Edit enables scalable demonstration generation and significantly improves generalization for long-horizon manipulation tasks. We evaluate Task-Edit through extensive real-world experiments and demonstrate three advantages: (1) Effectiveness: Task-Edit significantly improves 3D visuomotor policies across various real-world tasks and robot embodiments. (2) Generalizability: Task-Edit improves model generalization across different scenario setups. (3) Applicability: Task-Edit enables models to handle scenarios that are difficult to collect in the real world, including disturbance resistance, obstacle avoidance and unseen cluttered scenes.

2606.07006 2026-06-08 cs.LG cs.CL 新提交

RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

RASFT: 用于推理的滚动自适应监督微调

Yongliang Miao, Fengyuan Liu, Wei Shi, Yanguang Liu, Fei Sun, Na Zou, Mengnan Du

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) New Jersey Institute of Technology(新泽西理工学院) Institute of Computing Technology, CAS(中国科学院计算技术研究所)

AI总结 提出RASFT框架,通过基于策略rollout的问题级可解性校准专家监督,在模型困难时加强指导、表现可靠时放松模仿并纳入自生成轨迹,同时使用裁剪逆比约束策略漂移,在多个推理基准上优于SFT和RL方法。

详情
AI中文摘要

监督微调(SFT)是一种通过模仿离线专家演示来使大型语言模型适应推理任务的流行方法,通常将单个专家轨迹视为目标行为。然而,推理并非简单的路径模仿:严格遵循一个演示解决方案可能会过度拟合表面形式并抑制模型自身的推理分布。我们提出了滚动自适应监督微调(RASFT),这是一种策略感知的SFT框架,它根据从验证的策略rollout中估计的问题级可解性来校准专家监督。对于每个问题,当当前策略困难时,RASFT加强专家指导,而当模型已经表现出可靠的推理行为时,放松严格模仿并纳入正确的自生成轨迹。为了保留有用的推理先验,RASFT进一步引入了冻结参考模型与当前策略之间的裁剪逆比,以约束过度的策略漂移。在六个数学推理基准和两个代码推理基准上的多个模型实验表明,RASFT在整体性能上优于SFT、SFT变体和代表性RL方法。代码可在该https URL获取。

英文摘要

Supervised fine-tuning (SFT) is a prevailing method for adapting large language models to reasoning tasks by imitating offline expert demonstrations, often treating a single expert trajectory as the target behavior. However, reasoning is not simple path imitation: rigidly following one demonstrated solution may overfit to surface forms and suppress the model's own reasoning distribution. We propose Rollout-Adaptive Supervised Fine-Tuning (RASFT), a policy-aware SFT framework that calibrates expert supervision according to problem-level solvability estimated from verified on-policy rollouts. For each problem, RASFT strengthens expert guidance when the current policy struggles, while relaxing rigid imitation and incorporating correct self-generated trajectories when the model already exhibits reliable reasoning behavior. To preserve useful reasoning priors, RASFT further introduces a clipped inverse ratio between the frozen reference model and the current policy to constrain excessive policy drift. Experiments across multiple models on six mathematical reasoning benchmarks and two code reasoning benchmarks show that RASFT achieves better overall performance than SFT, SFT variants, and representative RL methods. The code is available at https://github.com/zjd1sq/RASFT.

2606.07000 2026-06-08 cs.AI 新提交

Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization

教方法而非答案:用于多模态策略优化的特权辅导蒸馏

Shizhe Xiang, Ke An, Wenlong Yu, Yue Liu, Jian Luan, Pei Fu, Qilong Wang

发表机构 * Tianjin University(天津大学) Beijing Institute of Technology(北京理工大学) Singapore Management University(新加坡国立大学) University of Chinese Academy of Sciences(中国科学院大学) Xiaomi Inc(小米公司)

AI总结 提出PTD-PO框架,通过构建特权提示提供密集的令牌级监督,避免暴露答案,并采用Top-K JS散度稳定蒸馏,显著提升多模态推理性能。

详情
AI中文摘要

最近的后训练方法,特别是具有可验证奖励的强化学习(RLVR),显著增强了大型视觉语言模型(LVLMs)的推理能力。然而,可验证奖励的稀疏性为失败的rollout提供了很少的令牌级监督,常常导致复杂多模态推理任务中的低效探索。尽管策略蒸馏可以提供密集的指导,但基于外部教师的方法引入了大量计算开销,而基于答案条件微调的方法可能暴露答案级信息并诱导类似捷径的生成行为。为解决这些限制,我们提出了PTD-PO,一种用于RLVR的特权辅导蒸馏策略优化框架,在不向学生策略暴露答案的情况下提供密集指导。具体来说,PTD-PO从空间注意力引导和中间文本推理步骤中构建结构化的特权提示,并通过上下文学习将其用于生成逐步的令牌分布监督。学生仍在原始无答案上下文中优化,其失败的rollout在令牌分布级别与提示增强的参考模型对齐。为进一步稳定引导和无引导上下文之间分布偏移下的蒸馏,我们引入了Top-K Jensen-Shannon散度目标,专注于对齐信息性令牌概率,同时减少内存开销。在2B到8B参数的LVLMs上的实验表明,PTD-PO持续优于RLVR和蒸馏基线,缓解了熵崩溃,并提高了复杂多模态推理性能。

英文摘要

Recent post-training methods, particularly Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced the reasoning ability of Large Vision-Language Models (LVLMs). However, the sparse nature of verifiable rewards provides little token-level supervision for failed rollouts, often leading to inefficient exploration in complex multimodal reasoning tasks. Although policy distillation can offer dense guidance, external teacher based methods introduce substantial computational overhead, while answer conditioned tuning methods may expose answer-level information and induce shortcut-like generation behavior. To address these limitations, we propose PTD-PO, a Privileged Tutoring Distillation Policy Optimization framework for RLVR that provides dense guidance without exposing the answer to the student policy. Specifically, PTD-PO constructs structured privileged hints from spatial attention guidance and intermediate textual reasoning steps, and uses them through in-context learning to produce step-wise token-distribution supervision. The student is still optimized under the original answer-free context, and its failed rollouts are aligned with the hint-augmented reference model at the token-distribution level. To further stabilize distillation under the distribution shift between guided and unguided contexts, we introduce a Top-K Jensen-Shannon divergence objective that focuses alignment on informative token probabilities while reducing memory overhead. Experiments on LVLMs ranging from 2B to 8B parameters show that PTD-PO consistently outperforms RLVR and distillation baselines, mitigates entropy collapse, and improves complex multimodal reasoning performance.

2606.06994 2026-06-08 cs.CL cs.DB 新提交

Principles of Concept Representation in Sentence Encoders

句子编码器中概念表示的原则

Isabelle Mohr, John Dujany, Jonathan Souquet, Andre Freitas

发表机构 * Idiap Research Institute(Idiap研究 institute) Merck KGaA(默克 KGaA)

AI总结 通过表征组合性视角,研究句子编码器产生良好概念表示的条件,提出四个原则:微调重塑而非扩展潜在几何(P1)、语义信号集中在特定层(P2)、硬负例改善区分性但不提升排序(P3)、监督有效性取决于概念组合类型(P4)。

详情
AI中文摘要

是什么让句子编码器产生良好的概念表示?我们通过表征组合性的视角来探讨这个问题:只有当编码器的潜在空间允许相应语义算子的低失真实现时,它才支持一个概念族。这一框架预测了当前编码器成功之处以及它们在结构上与监督不匹配的地方。通过在WordNet和Wiktionary的330万同义词和定义对上训练的编码器条件进行受控消融实验,在三个去污染分割和一个修饰语标记的名词短语基准上进行评估,我们确定了四个原则。微调重新校准潜在几何而非扩展它(P1)。语义信号在概念特定训练开始前集中在最后的Transformer层,使得跨层池化变得多余(P2)。硬负例改善了区分性和压力测试鲁棒性,但不提升检索排序,表明校准和排序是可独立处理的(P3)。最后,监督的有效性取决于目标概念的组合类型。外延训练有助于交性和子性概念族,但损害关系性和内涵性概念族,暴露了当前训练范式的结构性限制(P4)。我们发布了两个新的评估数据集:一个DBpedia语义差距基准和一个修饰语标记的名词短语释义套件。

英文摘要

What makes a sentence encoder produce good concept representations? We approach this through the lens of representational compositionality: an encoder supports a concept family only when its latent space admits a low-distortion realization of the corresponding semantic operator. This framing predicts both where current encoders succeed and where they are structurally mismatched to their supervision. Through a controlled ablation over encoder conditions trained on 3.3 million synonym and definition pairs from WordNet and Wiktionary, evaluated on three decontaminated splits and a modifier-labeled noun-phrase benchmark, we identify four principles. Fine-tuning recalibrates the latent geometry rather than expanding it (P1). Semantic signal concentrates in the final transformer layer before concept-specific training begins, making cross-layer pooling redundant (P2). Hard negatives improve discrimination and stress-test robustness without improving retrieval ranking, showing that calibration and ranking are independently addressable (P3). Finally, the effectiveness of supervision depends on the composition type of the target concept. Extensional training helps intersective and subsective families while degrading relational and intensional ones, exposing a structural limitation of current training paradigms (P4). We release two new evaluation datasets: a DBpedia semantic-gap benchmark and a modifier-labeled NP paraphrase suite.

2606.06990 2026-06-08 cs.LG 新提交

Accelerating Reproducible Research in Synthetic EHR Generation

加速可复现的合成电子健康记录生成研究

Jalen Jiang, Chufan Gao, Ethan Rasmussen, Stephen Z. Xie, Jimeng Sun

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出一个轻量级端到端基准框架,统一数据加载、标准化训练和架构无关评估,重新实现多种基线模型并添加GPT-2基线,通过隐私-效用评估套件和自助置信区间分析长尾性能问题,推动社区驱动的可复现性。

详情
AI中文摘要

生成高保真合成电子健康记录(EHR)对于在保护患者隐私的同时推进医学研究至关重要。然而,现有生成模型之间的直接比较因代码库分散、数据加载器不兼容、库依赖冲突以及评估协议不一致而受到阻碍。为解决这些问题,我们引入了一个轻量级、端到端的可复现合成EHR评估基准框架,组织为统一流水线,涵盖数据摄取、标准化模型训练和架构无关评估。我们当前的实现针对纵向ICD诊断代码的生成——这是该文献中最常研究的模态——并基于社区维护的PyHealth库构建。我们在完整ICD-9词汇粒度下重新实现并统一了强基线模型(MedGAN、CorGAN、PromptEHR、HALO),并从通用序列建模文献中添加了一个轻量级GPT-2基线。我们贡献了一个严格的、架构无关的隐私-效用评估套件,该套件同样适用于GAN和基于Transformer的生成器,并报告了所有指标的自助置信区间。我们进一步分析了现有模型在长尾分布上的不佳表现,并讨论了框架在诊断代码之外的可扩展性。通过降低在单一流水线下运行、扩展和评估的工程障碍,我们为社区驱动的可复现性和合成EHR模型基准测试提供了一个起点。

英文摘要

The generation of high-fidelity synthetic Electronic Health Records (EHR) is crucial for advancing medical research while preserving patient privacy. However, head-to-head comparison of existing generative models is hindered by disjointed codebases, incompatible data loaders, conflicting library dependencies, and inconsistent evaluation protocols. To address these gaps, we introduce a lightweight, end-to-end benchmarking framework for reproducible synthetic EHR evaluation, organized as a unified pipeline spanning data ingestion, standardized model training, and architecture-agnostic evaluation. Our current implementation targets the generation of longitudinal ICD diagnosis codes -- the most commonly studied modality in this literature -- and is built on the community-maintained PyHealth library. We reimplement and unify strong baselines (MedGAN, CorGAN, PromptEHR, HALO) under full ICD-9 vocabulary granularity, and add a lightweight GPT-2 baseline from the general-purpose sequence-modeling literature. We contribute a rigorous, architecture-agnostic privacy-utility evaluation suite that applies identically to GAN- and transformer-based generators, and report bootstrapped confidence intervals across all metrics. We further analyze the poor long-tailed performance of existing models and discuss the extensibility of our framework beyond diagnosis codes. By lowering the engineering barrier to running, extending, and evaluating under a single pipeline, we introduce a starting point for community-driven reproducibility and benchmarking synthetic EHR models.

2606.06985 2026-06-08 cs.CL eess.AS 新提交

Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition

基于大语言模型生成近误样本的对比训练用于鲁棒语码转换语音识别

Tung X. Nguyen, Hieu Minh Truong, Giang-Son Nguyen, Nhu Vo, Wray Buntine, Dung D. Le

发表机构 * VinUniversity(文大学) University of Technology Sydney(技术悉尼大学) Monash University(莫纳什大学)

AI总结 提出POI感知对比训练框架,通过大语言模型生成近误负样本并过滤,结合POI加权交叉熵与多负例对比损失微调Whisper-small,在语码转换语音识别任务上降低超过2%的错误率。

详情
Comments
Accepted at INTERSPEECH 2026
AI中文摘要

语码转换(CS)是指在单个话语中交替使用多种语言,这对自动语音识别(ASR)仍然具有挑战性。为了解决这个问题,我们提出了一个兴趣点(POI)感知的对比训练框架,该框架提高了CS关键区域的识别能力。我们首先采用文献中的POI检测方法识别CS片段,然后通过扰动ASR N-best输出中的POI并利用大语言模型扩展候选,构建声学上合理的近误假设。通过声学、音位和文本约束过滤,保留困难但合理的负样本。最后,我们使用POI加权交叉熵锚点目标以及多负例对比排序损失,通过LoRA微调Whisper-small。在CS-FLEURS(cmn-eng)和ViMedCSS(vie-eng)上的实验表明,与标准LoRA微调相比,通用错误率和CS感知错误率均持续降低超过2%。

英文摘要

Code-switching (CS), the alternation between multiple languages within a single utterance, remains challenging for Automatic Speech Recognition (ASR). To address this issue, we propose a Point-of-Interest (POI)-aware contrastive training framework that improves recognition at CS-critical regions. We first identify CS spans by adopting POI detection method from literature, then construct acoustically plausible near-miss hypotheses by perturbing POIs in ASR N-best outputs and expanding candidates with a large language model. Hard but plausible negatives are retained through filtering with acoustic, phonemic, and textual constraints. Finally, we fine-tune Whisper-small with LoRA using a POI-weighted cross-entropy anchor objective together with a multi-negative contrastive ranking loss. Experiments on CS-FLEURS (cmn-eng) and ViMedCSS (vie-eng) show consistent reductions of over 2% in both general and CS-aware error rates compared to standard LoRA fine-tuning.

2606.06984 2026-06-08 cs.LG 新提交

Accelerating Multi-Objective Bayesian Optimisation via Predictive-Gradient Catalysts

通过预测梯度催化剂加速多目标贝叶斯优化

Alma Rahat, Tinkle Chugh, Jonathan Fieldsend, Richard Allmendinger

发表机构 * Loughborough University(洛辛厄姆大学) University of Exeter(埃克塞特大学) The University of Manchester(曼彻斯特大学)

AI总结 提出利用高斯过程预测梯度作为辅助信号,增强现有Pareto兼容采集函数,加速多目标贝叶斯优化收敛到全局Pareto集。

详情
Comments
Parallel Problem Solving From Nature (PPSN), 2026
AI中文摘要

本文提出了一种通用的多目标贝叶斯优化(MOBO)加速机制,利用高斯过程预测梯度作为辅助信号。该方法并非取代现有的Pareto兼容采集函数,而是通过从代理模型导出的梯度中获取局部平稳性信息来增强它们,从而在有限的评估预算下更快地收敛到全局Pareto集。研究了两种催化剂实例:自适应多重梯度下降算法催化剂(MGDA)和预定义权重变体,后者在预算紧张时能够实现聚焦探索。在DTLZ基准测试套件(使用2个目标和10个决策变量)上的实验表明,当代理模型准确时,特别是对于平稳问题,预测梯度催化相比其他采集函数(EHVI、AugTch、tMPoI、SAF)能够带来显著的加速。

英文摘要

This paper presents a general acceleration mechanism for multi-objective Bayesian optimisation (MOBO) that leverages Gaussian process predictive gradients as auxiliary signals. Rather than replacing existing Pareto-compliant acquisition functions, the proposed approach augments them with local stationarity information derived from surrogate-derived gradients, enabling faster convergence toward the global Pareto set under limited evaluation budgets. Two catalyst instantiations are investigated: an adaptive Multiple-Gradient Descent Algorithm-Based Catalyst (MGDA) and a predefined-weight variant that enables focused exploration when budgets are tight. Experiments on the DTLZ benchmark suite (using 2 objectives and 10 decision variables) show that predictive gradient catalysis can deliver significant acceleration compared to other acquisition functions (EHVI, AugTch, tMPoI, SAF) when surrogates are accurate, particularly for stationary problems.