arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1764
2606.06877 2026-06-08 cs.RO cs.AI 新提交

Neuro-Symbolic Learning for Long-Horizon Task Planning Under Complex Logical Constraints

复杂逻辑约束下长时域任务规划的神经符号学习

Qiwei Du, Zitong Zhan, Shaoshu Su, Bowen Li, Yi Du, Zhipeng Zhao, Taimeng Fu, Sebastian Scherer, Jiaoyang Li, Chen Wang

发表机构 * Spatial AI & Robotics (SAIR) Lab, University at Buffalo, NY 14260(空间人工智能与机器人实验室,布法罗大学,纽约州,14260) Robotics Institute, Carnegie Mellon University, PA 15213(机器人研究所,卡内基梅隆大学,宾夕法尼亚州,15213)

AI总结 提出基于命令学习的双层优化框架,通过神经评分器剪枝无关对象,并引入3R策略(修复、重启、回滚)稳定下层规划,在三个基准上实现失败率降低80.04%、规划时间减少57.14%。

详情
AI中文摘要

当机器人必须在复杂逻辑约束(包括对象可供性、空间关系和顺序动作依赖)下推理长时域动作序列时,任务规划常常面临严重的效率瓶颈。最近的神经符号方法通过学习对象重要性分数来剪枝任务无关对象,从而提高规划效率,但它们通常依赖于从完整搜索空间生成的固定离线监督。这造成了训练-测试不匹配:在部署时,规划器在由模型自身不完美预测诱导的剪枝搜索空间中运行,导致暴露偏差和规划性能下降。为了解决这一挑战,我们将任务规划的对象重要性学习形式化为一个基于命令学习的双层优化问题。上层优化一个神经评分器,而下层在评分剪枝的搜索空间中求解符号规划问题。为了稳定这一学习过程,我们在下层规划中引入3R策略,使用并行的修复、重启和回滚恢复来为上层学习提供可靠且自适应的反馈。在三个具有挑战性的基准上的实验展示了最先进的性能,包括失败率降低80.04%和规划时间减少57.14%。我们进一步在仿真和现实世界中的四足移动机械臂上验证了该框架,展示了其在高效且可部署的神经符号任务规划方面的潜力。

英文摘要

Task planning often suffers from severe efficiency bottlenecks when robots must reason over long-horizon action sequences under complex logical constraints, including object affordances, spatial relationships, and sequential action dependencies. Recent neuro-symbolic methods improve planning efficiency by learning object-importance scores to prune task-irrelevant objects, but they typically rely on fixed offline supervision generated from full search spaces. This creates a train-test mismatch: at deployment, the planner operates in pruned search spaces induced by the model's own imperfect predictions, leading to exposure bias and degraded planning performance. To address this challenge, we formulate object-importance learning for task planning as an imperative learning-based bilevel optimization problem. The upper level optimizes a neural scorer, while the lower level solves a symbolic planning problem in the score-pruned search space. To stabilize this learning process, we introduce a 3R strategy into the lower-level planning, using parallel Repair, Restart, and Rollback recovery to provide reliable and adaptive feedback for upper-level learning. Experiments on three challenging benchmarks demonstrate state-of-the-art performance, including an 80.04% reduction in failure rate and a 57.14% reduction in planning time. We further validate the framework on a quadruped-based mobile manipulator in simulation and the real world, demonstrating its potential for efficient and deployable neuro-symbolic task planning.

2606.06870 2026-06-08 cs.RO 新提交

What Is My Robot Thinking? Design Considerations for Transparent and Trustworthy Shared Autonomy

我的机器人在想什么?透明且可信的共享自主性的设计考量

Atharv Belsare, Zohre Karimi, Connor Mattson, Rushiil Nakka, Daniel S. Brown

发表机构 * Kahlert School of Computing, University of Utah(犹他大学计算学院) Robotics Center, University of Utah(犹他大学机器人中心)

AI总结 通过用户实验研究共享自主系统中界面透明度(反馈模态和信息丰富度)对协调与信任的影响,发现反馈提高意图对齐、减少纠正干预,视觉优于听觉,信息丰富度偏好依赖任务复杂度,揭示完整信念分布并不一致提升对齐或信任。

详情
Comments
9 pages, 5 Figures, Code and videos are available at https://sites.google.com/view/design-t2-sa/home. Under review at IROS 2026
AI中文摘要

在共享自主性下运行的辅助机器人必须平衡用户控制与自主辅助。由于机器人动作依赖于不可直接观察的内部意图推理,推断目标与预期目标之间的不匹配会破坏协调与信任。我们研究了界面级透明度,包括反馈模态(视觉与听觉)和信息丰富度(稀疏与丰富),如何影响基于视觉的共享自主系统中的交互。在一项包含N=25名参与者的用户研究中,涉及两项辅助操作任务,我们评估了这些设计如何影响协调与信任。提供反馈显著提高了意图对齐并减少了纠正干预,表明使推断目标可理解加速了共享控制中的收敛。参与者偏好视觉反馈而非听觉反馈,而对稀疏与丰富信息的偏好取决于任务复杂度。我们还发现,揭示完整的信念分布并不一致地提高对齐或信任。这些发现共同表明,有效的透明度主要通过目标可理解性增强协调,而信任取决于任务适当的信息暴露,而非最大程度的信息披露。基于这些结果,我们概述了设计透明共享自主系统的指导方针。

英文摘要

Assistive robots operating under shared autonomy must balance user control with autonomous assistance. Because robot actions depend on internal intent inference that is not directly observable, mismatches between inferred and intended goals can undermine coordination and trust. We investigate how interface-level transparency, including feedback modality (visual vs. auditory) and information richness (sparse vs. rich), shapes interaction in a vision-based shared autonomy system. In a user study with N=25 participants across two assistive manipulation tasks, we evaluate how these designs influence coordination and trust. Providing feedback significantly improves intent alignment and reduces corrective intervention, indicating that making the inferred goal legible accelerates convergence in shared control. Participants preferred visual over auditory feedback, while preferences for sparse versus rich information depended on task complexity. We also found that revealing the full belief distribution did not consistently improve alignment or trust. Together, these findings indicate that effective transparency enhances coordination primarily through goal legibility, while trust depends on task-appropriate information exposure rather than maximal disclosure. Based on these results, we outline guidelines for designing transparent shared autonomy systems.

2606.06869 2026-06-08 cs.AI 新提交

Evidence-Based Intelligent Diagnostic and Therapeutic Visualization System with Large Language Models: Multi-Turn Interaction and Multimodal Treatment Plan Generation

基于证据的智能诊断与治疗可视化系统与大语言模型:多轮交互与多模态治疗方案生成

Yunhan Wang, Yuda Wang, Zhiying Tu, Mingqiang Song, Li Song, Kun Li, Dianhui Chu, Bolin Zhang

发表机构 * Harbin Institute of Technology, Weihai(哈尔滨工业大学(威海)) Harbin Institute of Technology (Weihai) Qingdao Research Institute(哈尔滨工业大学(威海)青岛研究院) Shandong Key Laboratory of Digital Service Computing Technology and Systems(山东省数字服务计算技术与系统重点实验室) Weihai Municipal Hospital(威海市人民医院) Shanghai Taizhu Technology Co., Ltd(上海泰山技术有限公司) Tianjin Zhifu Qihuang Medical Technology Co., Ltd(天津中孚启黄医疗技术有限公司)

AI总结 提出知识增强的可视化诊断系统,通过知识图谱约束、信息增益驱动提问和多模态治疗呈现,提升中医辨证透明度和治疗可解释性。

详情
Comments
29 pages, 9 figures, 5 tables, including supporting information
AI中文摘要

目的:现有AI辅助中医诊断工具存在推理过程不透明、交互被动及治疗方案展示有限的问题。本研究提出一种知识增强的可视化诊断系统,以提高辨证论治的透明度和可解释性。方法:系统基于包含241个证候、1263个症状和2485个关系的Neo4j知识图谱构建。它集成了四阶段症状匹配流水线(精确、语义、模糊和大语言模型验证)、基于信息增益的主动提问策略(经遗传算法优化),以及融合人工智能生成插图、三维经络穴位模型和循证文献的多模态治疗呈现。结果:知识图谱约束将非标准输出减少了32%。案例研究验证了交互工作流在患者自评、临床辅助诊断和中医教育中的有效性。跨30个案例的自动配对比较评估进一步显示,诊断信任度显著提升(Cohen's d = 1.82, p < 0.001),认知负荷降低(五个维度中四个维度改善),循证参考文献可信度更高(4.21 vs. 2.95)。结论:所提系统通过知识图谱驱动的可视化和多模态交互,增强了中医诊断推理的透明度和治疗方案的可解释性,为可信AI辅助中医应用提供了实用解决方案。

英文摘要

Aim: Existing AI-assisted traditional Chinese medicine diagnostic tools suffer from opaque reasoning processes, passive interaction, and limited treatment plan presentation. This study proposes a knowledge-enhanced visual diagnostic system to improve the transparency and interpretability of syndrome differentiation and treatment. Methods: The system is built upon a Neo4j knowledge graph comprising 241 syndromes, 1,263 symptoms, and 2,485 relations. It incorporates a four-stage symptom matching pipeline (exact, semantic, fuzzy, and large language model verification), an information gain-driven proactive questioning strategy optimized with genetic algorithms, and a multimodal treatment presentation integrating artificial intelligence-generated illustrations, three-dimensional meridian-acupoint models, and evidence-based literature. Results: Knowledge graph constraints reduced non-standard outputs by 32%. Case studies validated the effectiveness of the interactive workflow across patient self-assessment, clinician-assisted diagnosis, and traditional Chinese medicine education. Automated paired-comparison evaluation across 30 cases further demonstrated significant improvements in diagnostic trust (Cohen's d = 1.82, p < 0.001), reduced cognitive load (improvements in four of five dimensions), and higher credibility of evidence-based references (4.21 vs. 2.95). Conclusions: The proposed system enhances the transparency of traditional Chinese medicine diagnostic reasoning and the interpretability of treatment plans through knowledge graph-driven visualization and multimodal interaction, offering a practical solution for trustworthy artificial intelligence-assisted traditional Chinese medicine applications.

2606.06867 2026-06-08 cs.CV 新提交

Multi-FRuGaL: Multimodal Flexible Redundancy-aware Decomposed Gated Learning for Cancer Diagnosis and Prognosis

Multi-FRuGaL:面向癌症诊断与预后的多模态灵活冗余感知分解门控学习

Sanket Kachole, Siddhesh Thakur, Shubham Innani, Sanyukta Adap, Suhang You, Carla Pitarch-Abaigar, Spyridon Bakas

发表机构 * Division of Computational Pathology, Department of Pathology and Laboratory Medicine, Indiana University School of Medicine(计算病理学部,病理学与实验室医学部,印第安纳大学医学院) IU Melvin and Bren Simon Comprehensive Cancer Center(印第安纳大学Melvin和Bren Simon综合癌症中心) Departments of Biostatistics and Health Data Science(生物统计学与健康数据科学部) Radiology and Imaging Sciences(放射学与影像科学部) Neurological Surgery(神经外科) Indiana University School of Medicine(印第安纳大学医学院) Department of Computer Science, Luddy School of Informatics, Computing, and Engineering(计算机科学部,Luddy信息、计算与工程学院)

AI总结 提出Multi-FRuGaL框架,通过分解感知自适应门控中间融合,在缺失模态下学习模态级表示,分离冗余与互补信号,提升癌症诊断与预后性能。

详情
AI中文摘要

现代医学依赖于涵盖放射学、病理学、文本报告和结构化临床信息的异构数据源。然而,真实世界的患者数据常常不完整,存在缺失或稀疏获取的模态,限制了标准多模态融合方法的有效性。为此,我们提出了多模态灵活冗余感知分解门控学习(Multi-FRuGaL)框架,这是一种分解感知的自适应门控中间融合框架,可在数据缺失下执行模态级表示学习。Multi-FRuGaL 集成了每个模态的编码器、信号分解层、输入条件门控网络和信息感知融合目标,以将冗余信号与模态特异性互补信号分离,选择性地提升信息丰富的模态并抑制冗余或噪声输入,即使在多个模态缺失时也能保持良好定义。我们在两个多模态头颈癌队列上评估了 Multi-FRuGaL:HANCOCK 挑战数据集(N = 763),包含五种模态和两个预后终点(5年生存率和2年复发率);以及 HECKTOR 挑战数据集(N = 588),包含三种模态用于人乳头瘤病毒(HPV)状态分类。Multi-FRuGaL 在多个任务上始终比评估的基线方法获得更高的平均性能,将生存预测的 AUC 从 0.601 提高到 0.8496,复发预测的 AUC 从 0.672 提高到 0.8102,并在 HECKTOR 上实现 HPV 预测的 AUC 为 0.975。对于生存分析,它在 HANCOCK 上进一步实现了总生存期的 C-index 为 0.6814,无复发生存期为 0.7421,无进展生存期为 0.7143,在 HECKTOR 上无复发生存期为 0.7203。定性分析进一步表明,即使在严重缺失模态条件下,Multi-FRuGaL 也能学习到判别性和鲁棒的多模态表示。

英文摘要

Modern medicine relies on heterogeneous data sources spanning radiology, pathology, text reports, and structured clinical information. However, real-world patient data are frequently incomplete, with missing or sparsely acquired modalities, limiting the effectiveness of standard multimodal fusion approaches. To this end, we propose the Multimodal Flexible Redundancy-aware decomposed GAted Learning (Multi-FRuGaL) framework, a decomposition-aware, adaptive gated intermediate-fusion framework that performs modality-level representation learning under missing data. Multi-FRuGaL integrates per-modality encoders with a signal decomposition layer, an input-conditioned gating network, and an information-aware fusion objective to separate redundant from modality-specific complementary signals, selectively upweighting informative modalities and suppressing redundant or noisy inputs, and remaining well-defined even when multiple modalities are absent. We evaluate Multi-FRuGaL on two multimodal head and neck cancer cohorts: the HANCOCK challenge dataset (N = 763) comprising five modalities and two prognostic endpoints (5-year survival and 2-year recurrence), and the HECKTOR challenge dataset (N = 588) comprising three modalities for human papillomavirus (HPV) status classification. Multi-FRuGaL consistently achieves higher mean performance than the evaluated baselines across multiple tasks, improving AUC from 0.601 to 0.8496 for survival, from 0.672 to 0.8102 for recurrence, and achieving 0.975 AUC for HPV prediction on HECKTOR. For survival analysis, it further achieves a concordance index of 0.6814 for overall survival, 0.7421 for recurrence-free survival, and 0.7143 for progression-free survival on HANCOCK, and 0.7203 for recurrence-free survival on HECKTOR. Qualitative analyses further show that Multi-FRuGaL learns discriminative and robust multimodal representations, even under severe missing-modality conditions.

2606.06866 2026-06-08 cs.LG nucl-th 新提交

Product units in gated recurrent units improve nuclear-mass prediction

门控循环单元中的乘积单元改进核质量预测

Ziyuan Li, Paulo S. A. Freitas, John W. Clark, Babette Dellen

发表机构 * University of Applied Sciences Koblenz(应用科学大学科伦兹大学) Technical University of Munich(慕尼黑技术大学) University of Madeira(马德拉大学) Washington University in St. Louis(圣路易斯华盛顿大学)

AI总结 提出基于复数域加法-乘法乘积单元门控循环单元(AM-PU-GRU)的机器学习模型,通过整合乘积单元变换和复数计算,在核质量预测中实现插值RMSE 0.227 MeV和外推RMSE 0.179 MeV,超越现有模型。

详情
Comments
Accepted at ICCS 2026
AI中文摘要

使用机器学习预测原子核质量可以补充理论模型,并推进对核图表中未知领域的探索。我们提出了一种基于门控循环单元(GRU)的机器学习技术,该技术通过利用长期依赖关系在核质量预测中展现出竞争性能。通过在循环单元内整合乘法交互和乘积单元变换,我们报告了核质量预测的显著改进。计算在复数域中进行,以联合捕捉幅度和相位动态。对于基于原子质量评估(AME2016和AME2020)的插值和时间外推任务,复数加法-乘法乘积单元门控循环单元(AM-PU-GRU)模型始终实现最低的预测误差,插值RMSE为0.227 ± 0.004 MeV,外推RMSE为0.179 ± 0.015 MeV。这些结果超越了其他最先进的机器学习模型,也优于实值GRU基线和乘积单元消融变体,同时对不同的理论先验(包括WS4和SEMF)保持鲁棒性。我们的发现确立了复数乘积单元循环网络作为基于序列的核质量预测的新基准。

英文摘要

The prediction of masses of atomic nuclei using machine learning can complement theoretical models and advance the exploration of poorly known domains of the nuclear chart. We propose a machine learning technique based on gated recurrent units (GRU), which have demonstrated competitive performance in nuclear-mass prediction by exploiting long-term dependencies. By integrating multiplicative interactions and product-unit transformations within recurrent units, we report significant improvements in nuclear-mass prediction. Computations are performed in the complex domain to jointly capture amplitude and phase dynamics. For interpolation and temporal-extrapolation tasks based on the atomic mass evaluation (AME2016 and AME2020), the complex additive-multiplicative product-unit gated recurrent unit (AM-PU-GRU) model consistently achieves the lowest prediction errors, with an interpolation RMSE of 0.227 $\pm$ 0.004 MeV and an extrapolation RMSE of 0.179 $\pm$ 0.015 MeV. These results surpass other state-of-the-art machine learning models and also outperform the real-valued GRU baseline and product-unit ablation variants, while remaining robust to different theoretical priors, including WS4 and SEMF. Our findings establish complex-valued product-unit recurrent networks as a new benchmark for sequence-based nuclear-mass prediction.

2606.06865 2026-06-08 cs.CL 新提交

Are Large Language Models Suitable for Graph Computation? Progress and Prospects

大型语言模型是否适合图计算?进展与展望

Yuting Zhang, Yi Han, Kai Wang, Wei Ni, Angela Bonifati, Wenjie Zhang

发表机构 * University of New South Wales(新南威尔士大学) Antai College of Economics and Management, Shanghai Jiao Tong University(上海交通大学安泰经济管理学院) Edith Cowan University(埃迪斯科文大学) Lyon 1 University(里昂第一大学)

AI总结 本文通过角色分类法综述LLM在图计算中的应用,分析作为执行者和规划者的两种范式,指出LLM适用于简单小规模任务,但在大规模和精确性要求高的任务中不可靠,并总结数据集和未来方向。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被探索用于图计算,其中任务需要对结构化关系和算法操作进行推理。然而,目前尚不清楚LLMs何时能可靠地支持此类计算,以及如何将它们整合到图求解流程中。现有的关于LLMs和图交叉的综述主要关注图学习、文本属性图或图语言建模。为弥补这一空白,我们通过基于角色的分类法对LLMs在图计算中的应用进行了全面综述。具体来说,我们识别出两种主要范式:i) LLMs作为执行者,模型直接从图描述和指令中解决图任务;ii) LLMs作为规划者,模型制定问题、分解推理步骤,并调用外部工具或代理执行。基于此分类法,我们分析了当前方法的优势和局限性。我们的综述表明,LLMs在简单、小规模任务中具有潜力,但在大规模和精确性要求高的任务中仍不可靠。最后,我们总结了可用的数据集,并提出了四个未来方向。

英文摘要

Large language models (LLMs) have been increasingly explored for graph computation, where tasks require reasoning over structured relationships and algorithmic operations. Yet, it remains unclear when LLMs can reliably support such computation and how they should be incorporated into graph-solving pipelines. Existing surveys at the intersection of LLMs and graphs primarily focus on graph learning, text-attributed graphs, or graph-language modeling. To bridge this gap, we provide a comprehensive review of LLMs for graph computation through a role-based taxonomy. Specifically, we identify two major paradigms: i) LLMs as executors, where models directly solve graph tasks from graph descriptions and instructions; and ii) LLMs as planners, where models formulate problems, decompose reasoning steps, and invoke external tools or agents for execution. Based on this taxonomy, we analyze the strengths and limitations of current methods. Our review indicates that LLMs are promising for simple, small-scale tasks, but remain unreliable for large-scale and exactness-demanding tasks. Finally, we summarize available datasets and suggest four future directions.

2606.06864 2026-06-08 cs.CV cs.LG 新提交

LRMIL: Efficient Low-Resolution Multiple Instance Learning via High-Resolution Knowledge Distillation for Whole Slide Image Classification

LRMIL: 通过高分辨率知识蒸馏实现全切片图像分类的高效低分辨率多实例学习

Yonghan Shin, Won-Ki Jeong

发表机构 * Department of Computer Science and Engineering, Korea University, Seoul, Korea(韩国大学计算机科学与工程系)

AI总结 提出LRMIL框架,通过两阶段知识蒸馏将高分辨率知识迁移到低分辨率表示,在推理时仅使用低分辨率图像块,显著降低计算成本并提升分类性能。

详情
AI中文摘要

多实例学习(MIL)已成为数字病理学中全切片图像(WSI)分析的标准范式,因为它无需密集标注即可实现切片级预测。现有的MIL方法通常依赖于高分辨率图像块的详尽提取和编码。然而,这种做法在真实临床环境中存在两个关键限制:难以在较低放大倍数下捕获全局视觉线索,并且由于每张切片包含大量高分辨率图像块而导致巨大的计算开销。为了解决这些限制,我们提出了一种高效的低分辨率多实例学习(LRMIL)框架,该框架将高分辨率知识迁移到低分辨率表示。LRMIL采用两阶段蒸馏策略。首先,图像块级别的跨分辨率蒸馏将低分辨率图像块嵌入与高分辨率表示对齐。其次,切片级知识蒸馏在切片级监督和教师指导下训练低分辨率学生MIL模型。在推理时,LRMIL仅处理低分辨率图像块,大幅减少了数据预处理和计算成本。在多个WSI基准上的大量实验表明,LRMIL在实现更高效推理的同时,始终优于最先进的MIL方法。这些结果凸显了LRMIL作为临床病理学中WSI分析的实用且可扩展的解决方案。

英文摘要

Multiple instance learning (MIL) has become a standard paradigm for whole slide image (WSI) analysis in digital pathology, as it enables slide-level prediction without dense annotations. Existing MIL methods typically rely on exhaustive extraction and encoding of high-resolution patches. However, this practice suffers from two critical limitations in real-world clinical settings: it struggles to capture global visual cues at lower magnifications, and incurs substantial computational overhead due to the massive number of high-resolution patches per slide. To address these limitations, we propose an efficient low-resolution multiple instance learning (LRMIL) framework that transfers high-resolution knowledge to low-resolution representations. LRMIL adopts a two-stage distillation strategy. First, patch-level cross-resolution distillation aligns low-resolution patch embeddings with high-resolution representations. Second, slide-level knowledge distillation trains a low-resolution student MIL model under both slide-level supervision and teacher guidance. At inference time, LRMIL operates exclusively on low-resolution patches, substantially reducing data preprocessing and computational cost. Extensive experiments on multiple WSI benchmarks demonstrate that LRMIL consistently outperforms state-of-the-art MIL methods while achieving more efficient inference. These results highlight LRMIL as a practical and scalable solution for WSI analysis in clinical pathology.

2606.06861 2026-06-08 cs.LG cs.AI 新提交

Modeling Nonlinear Feature Interactions with Product-Unit Residual Networks

使用乘积单元残差网络建模非线性特征交互

Ziyuan Li, Uwe Jaekel, Babette Dellen

发表机构 * University of Applied Sciences Koblenz(科隆应用科学大学) Technical University of Munich(慕尼黑技术大学)

AI总结 提出乘积单元残差网络(PURe),通过显式建模特征交互提升鲁棒性和可解释性,在合成和真实数据集上优于MLP。

详情
Comments
Accepted at ICCS 2026
AI中文摘要

理解非线性特征交互在科学和工程中至关重要,然而标准多层感知器(MLP)通常仅隐式地捕获此类交互,导致表征纠缠,可能损害鲁棒性和可解释性。我们研究了乘积单元残差网络(PURe),它将乘法乘积单元与残差连接相结合,以显式建模跨特征耦合,同时稳定优化。我们在一个基于交互的合成基准和两个真实世界数据集上进行了系统评估,考察了预测准确性、对高斯特征噪声的鲁棒性以及在有限训练数据下的性能,并在匹配参数预算下比较了实值和复值变体。除了准确性,基于SHapley Additive exPlanations(SHAP)的交互分析表明,与MLP基线相比,PURe学习了更集中且结构更连贯的交互模式。总体而言,PURe实现了具有竞争力或更好的性能,在低数据场景下具有更好的鲁棒性和样本效率,并增强了交互级别的可解释性。

英文摘要

Understanding nonlinear feature interactions is crucial in science and engineering, yet standard multilayer perceptrons (MLPs) often capture such interactions only implicitly, leading to entangled representations that can impair robustness and interpretability. We investigate product-unit residual networks (PURe) that integrate multiplicative product units with residual connections to explicitly model cross-feature couplings while stabilizing optimization. We conduct a systematic evaluation on an interaction-driven synthetic benchmark and two real-world datasets, assessing predictive accuracy, robustness to Gaussian feature noise, and performance under limited training data, and we compare real- and complex-valued variants under a matched parameter budget. Beyond accuracy, SHapley Additive exPlanations (SHAP)-based interaction analyses show that PURe learns more concentrated and structurally coherent interaction patterns than MLP baselines. Overall, PURe achieves competitive or improved performance, better robustness and sample efficiency in low-data regimes, and enhanced interaction-level interpretability.

2606.06857 2026-06-08 cs.CL 新提交

Interpreting Brain Responses to Language with Sparse Features from Language Models

用语言模型稀疏特征解释大脑对语言的响应

Michael A. Lepori, Kendrick Kay, Greta Tuckute

发表机构 * Brown University(布朗大学) University of Minnesota(明尼苏达大学) Harvard University(哈佛大学)

AI总结 提出增强稀疏编码模型,用分层稀疏自编码器特征替代密集LM隐状态,并加入惊奇度预测器,解释大脑语言皮层响应,发现前颞叶语言网络由共同特征预测,且大脑响应与LM中最通用的特征对应。

详情
AI中文摘要

认知神经科学的一个核心目标是刻画人类语言皮层所表征的特征。人工语言模型已成为应对这一挑战的有力工具,但将生物表征与人工表征相关联的研究常被批评为将一个黑箱与另一个黑箱相关联。本文引入增强稀疏编码模型,一种用分层组织的稀疏自编码器特征替代密集LM隐状态,并显式包含惊奇度作为预测因子的编码框架。利用该方法,我们(i) 产生对神经响应的解释,并(ii) 测试模型-大脑对齐是否反映了LM表征中的主要变异或特异变异。使用8名参与者聆听200句语言多样性句子的高场7T fMRI数据集,我们首先通过恢复先前对处理难度和意义抽象性调谐的体素群体的解释来验证建模框架。然后,我们解释了一个先前未表征(但可靠)的体素群体,发现其调谐于与人相关的内容。接着,我们显示额颞叶人类语言网络由其组成区域间的共同特征集预测,但发现额叶区域即使在没有LM特征的情况下也能被惊奇度单独较好地解释。最后,我们显示语言处理过程中的大脑响应并非仅能从任意一组LM特征预测。相反,大脑响应最好由倾向于捕捉LM表征中编码的最通用信息的特征解释,表明大脑与LM语言表征之间存在非平凡的对齐。

英文摘要

A central goal of cognitive neuroscience is to characterize the features that are represented by human language cortex. Artificial language models (LMs) have emerged as a powerful tool to address this challenge, but studies relating biological and artificial representations are often criticized as relating one black box to another. The present work introduces Augmented Sparse Encoding Models, an encoding framework that replaces dense LM hidden states with hierarchically-organized sparse autoencoder (SAE) features, while explicitly including surprisal as a predictor. Using this approach, we (i) produce interpretations of neural responses and (ii) test whether model-brain alignment reflects primary or idiosyncratic variation in LM representations. Using a high-field 7T fMRI dataset of eight participants listening to 200 linguistically diverse sentences, we first validate our modeling framework by recovering previous interpretations of voxel populations tuned to processing difficulty and meaning abstractness. We then interpret a previously-uncharacterized (but reliable) voxel population and find that it is tuned to people-related content. Next, we show that the fronto-temporal human language network is predicted by a common set of features across its constituent regions, but find that frontal regions are relatively well-explained by surprisal alone, even in the absence of LM-based features. Finally, we show that brain responses during language processing are not merely predictable from an arbitrary set of LM features. Rather, brain responses are best explained by the features that tend to capture the most general information encoded in LM representations, suggesting a nontrivial correspondence between brain and LM language representation.

2606.06854 2026-06-08 cs.LG 新提交

The Geometry of Last-Layer Model Stealing

最后一层模型窃取的几何学

Snigdha Chandan Khilar

发表机构 * Independent Researcher(独立研究者)

AI总结 利用几何学解释如何通过已知方法窃取机器学习模型,展示了完美复制Transformer网络最后一层的条件,并揭示了隐藏层的限制。

详情
AI中文摘要

本文利用几何学解释如何通过已有的知名方法窃取机器学习模型。作者展示了完美复制Transformer网络最后一层所需的确切条件。在深入探究隐藏层时,作者解释了明确的限制。作者还证明,仅通过查看最终结果无法完全逆向工程隐藏网络。该研究清晰地勾勒出模型中哪些部分可以被窃取,哪些不能。

英文摘要

This paper uses geometry to explain how a machine learning model can be stolen using an already existing well-known method. The author has shown the exact conditions required to perfectly copy the final layer of a transformer network. When looking deeper into the hidden layers the author has explained clear limits. The author has also demonstrated that a hidden network cannot be fully reverse engineered just by looking at the final results. The research clearly maps out what can and cannot be stolen from a model.

2606.06853 2026-06-08 cs.CV cs.AI 新提交

MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models

MotionEnhancer: 利用视频扩散模型增强运动感知的视觉-语言模型

Yifan Xu, Chao Zhang, Ruifei Ma, Fei Gao, Zhifei Yang, Jiaxing Qi, Zhipeng Chen

发表机构 * School of Computer Science and Engineering, Beihang University(北航计算机科学与工程学院) Beijing Digital Native Digital City Research Center(北京数字原生数字城研究中心) School of Computer Science, Peking University(北京大学计算机学院) School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院)

AI总结 提出MotionEnhancer,通过从视频扩散模型中提取运动先验并利用注意力对齐增强视觉-语言模型的运动理解能力,无需额外参数或架构修改,在运动级视频理解基准上取得一致提升。

详情
Comments
Accepted by CVPR 2026
AI中文摘要

新时代见证了视觉-语言模型(VLM)在视频理解任务中的显著能力扩展。虽然当前的VLM在事件或故事级别的理解上表现出色,但它们捕捉细粒度运动细节的能力仍然有限,这主要是由于它们关注高层静态语义结构和宏观事件逻辑。相比之下,视频扩散模型(VDM)擅长建模动态运动模式,得益于大规模视频数据和时序生成的内在需求。在本文中,我们介绍了MotionEnhancer,一种新颖的方法,它利用从强大视频扩散模型中提取的运动先验作为辅助监督,通过注意力对齐增强VLM的运动理解能力。MotionEnhancer包含两个简单的无参数模块:运动敏感头选择(MHS)和运动显著文本标记识别(MTTI),以仅计算的方式直接从VDM中提取和优化与运动相关的注意力。MotionEnhancer为运动理解提供了可扩展的解决方案,无需额外的训练参数、修改现有架构或工具调用。大量实验表明,在两个运动级视频理解基准上,MotionEnhancer能够在最先进的VLM上实现一致的改进,尤其是在运动相关指标上。

英文摘要

The new era has witnessed a remarkable capability to extend Vision-Language Models (VLMs) for tackling tasks of video understanding. While current VLMs excel at event- or story-level understanding, their ability to capture fine-grained motion details remains limited, primarily due to their focus on high-level static semantic structures and macro-event logic. In contrast, Video Diffusion Models (VDMs) are adept at modeling dynamic motion patterns, benefiting from large-scale video data and the intrinsic requirement of temporal generation. In this paper, we introduce MotionEnhancer, a novel approach that leverages motion priors distilled from a powerful video diffusion model as auxiliary supervision to enhance the motion understanding capability of a VLM via attention alignment. MotionEnhancer comprises two simple parameter-free modules, Motion-sensitive Head Selection (MHS) and Motion-salient Text Token Identification (MTTI), to directly extract and optimize motion-related attentions from the VDM in a computation-only manner. MotionEnhancer provides a scalable solution for motion understanding without additional training parameters, modifications to existing architectures, or tool calling. Extensive experiments demonstrate that MotionEnhancer can achieve consistent improvements over state-of-the-art VLMs on two motion-level video understanding benchmarks, especially on motion-related metrics.

2606.06850 2026-06-08 cs.CV 新提交

CFRNet: Cycle-Consistent Fixed-Point Training for Real-Time Blind Face Restoration on Consumer Embedded NPUs

CFRNet: 用于消费级嵌入式NPU上实时盲脸修复的循环一致不动点训练

Fuchen Li, Xinyang Wang, Yahui Zhang, Yuhan Chen, Jiahong Guo, Zhuohan Qin, Wenbo Ma

发表机构 * University of Florida(佛罗里达大学) University of Southampton(南安普顿大学) Chongqing University(重庆大学) Qingdao University(青岛大学) Intel Asia-Pacific Research & Development Ltd(英特尔亚太研发有限公司)

AI总结 提出CFRNet,一种2.0M参数的ResNet风格修复网络,通过循环一致不动点训练(CCFP)在消费级NPU上实现高质量盲脸修复,兼顾速度与效果,LPIPS比单次循环降低31%。

详情
Comments
12 pages.Code and project page will be released
AI中文摘要

消费设备上的盲脸修复必须在图像质量与速度和内存之间取得平衡。GFPGAN和CodeFormer等强方法提供了良好的感知质量,但它们依赖于大型预训练生成先验以及注意力、码本查找和风格调制等操作,这些操作难以在消费硬件中使用的小型神经处理单元(NPU)上编译和量化。小型卷积修复器运行速度足够快,但往往过度平滑,并在眼睛、鼻子和嘴巴周围留下伪影。我们提出了CFRNet,一个2.0M参数的ResNet风格修复器,用于在消费级NPU上常见的$256\times256$人脸裁剪尺寸的端侧使用。主要思想是循环一致不动点训练(CCFP)。我们不是训练网络进行单次前向传播然后手动多次运行,而是训练它作为一个不动点算子,使得对修复后的人脸再次应用该网络不会改变人脸。CCFP使用三种训练损失,即渐进式多周期监督、幂等损失和重新退化循环损失,并且在推理时不增加任何成本。为了在我们的部署限制下进行公平比较,我们在相同的$256\times256$分辨率下从头重新训练所有基线。在300张图像的测试集上,CFRNet达到了最佳感知分数(三次循环时LPIPS为0.250,比一次循环低31%),并且在两次循环时也达到了最佳PSNR和SSIM。在HiSilicon Hi3402 NPU上,它以INT8格式每次循环运行约23毫秒,而相同的基线无法编译到该芯片上。循环次数$k$作为一个简单的质量旋钮,无需重新训练:PSNR在$k=2$时最佳,LPIPS在$k=3$时持续改善。我们进一步表明,同样的思想适用于更易于部署的普通CNN,并在车载驾驶员监控板上实时运行模型。

英文摘要

Blind face restoration on consumer devices has to balance image quality against speed and memory. Strong methods such as GFPGAN and CodeFormer give good perceptual quality, but they rely on large pretrained generative priors and on operators such as attention, codebook lookup, and style modulation that are hard to compile and quantize on the small neural processing units (NPUs) used in consumer hardware. Small convolutional restorers run fast enough, but they tend to over-smooth and to leave artifacts around the eyes, nose, and mouth. We present CFRNet, a 2.0,M-parameter ResNet-style restorer for on-device use at $256\times256$, the common face-crop size on consumer NPUs. The main idea is Cycle-Consistent Fixed-Point Training (CCFP). Instead of training the network for one pass and then running it several times by hand, we train it to act as a fixed-point operator, so that applying it again to a restored face does not change the face. CCFP uses three training losses, namely progressive multi-cycle supervision, an idempotence loss, and a re-degradation cycle loss, and it adds no cost at inference. To compare fairly under our deployment limits, we retrain all baselines from scratch at the same $256\times256$ resolution. On a 300-image test set, CFRNet reaches the best perceptual score (LPIPS 0.250 at three cycles, which is 31% lower than one cycle) and also the best PSNR and SSIM at two cycles. It runs in about 23,ms per cycle in INT8 on a HiSilicon Hi3402 NPU, while the same baselines cannot be compiled to that chip. The cycle count $k$ acts as a simple quality knob that needs no retraining: PSNR is best at $k\!=\!2$ and LPIPS keeps improving up to $k\!=\!3$. We further show that the same idea works with a plain CNN that is even easier to deploy, and we run the model in real time on an in-car driver-monitoring board.

2606.06842 2026-06-08 cs.CL 新提交

CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification

CRAFT:面向表格问答与事实验证的统一反事实推理框架

Chenshuo Pan, Yu Zhao, Jie Zhang, Changzai Pan, Zhenhe Wu, Jiayi Liang, Yujie Mao, Shuangyong Song, Yongxiang Li, Zhongjiang He

发表机构 * Xingchen AGI Lab,China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd(兴晨AGI实验室,中国电信人工智能技术(北京)有限公司)

AI总结 提出CRAFT统一反事实推理框架,将表格问答和事实验证转化为双向验证过程,通过构建声明及其反事实变体并加权整合证据,显著提升复杂表格推理性能。

详情
Comments
24pages,10 figures
AI中文摘要

表格推理对大型语言模型(LLMs)仍然具有挑战性,尤其是在需要多步推理的长且结构化的表格任务中。现有方法主要依赖单向推理,限制了其跨任务探索替代假设的能力。在这项工作中,我们提出了CRAFT,一个统一的反事实推理框架,将表格问答和事实验证重新表述为通用的双向验证过程。我们的方法显式地构建声明性陈述及其反事实变体。然后,沿着原始路径和反事实路径进行推理提取证据,并通过加权机制整合以得出最终答案。实验结果表明,我们的方法在WikiTQ和TabFact等表格推理数据集上持续优于代表性基线,在复杂问答上取得了特别大的改进。我们的框架还显著缩小了不同骨干LLM之间的性能差距。这表明反事实推理有效克服了单向推理的局限性,引导LLM进行更具辨别力的推理,并为结构化推理任务建立了更原则性的范式。我们的代码将在接收后公开。

英文摘要

Table reasoning remains challenging for large language models (LLMs), particularly in tasks that require multi-step inference over long and structured tables. Existing approaches predominantly rely on single-direction reasoning, which limits their ability to explore alternative hypotheses across tasks. In this work, we propose CRAFT, a unified Counterfactual Reasoning Framework that reformulates Tabular question answering and fact verification into a general bidirectional verification process. Our method explicitly constructs both declarative statements and their counterfactual variants. Evidence is then extracted from reasoning along both the original and counterfactual paths, and integrated via a weighted mechanism to arrive at the final answer. Experimental results show that our approach consistently surpasses representative baselines on table reasoning datasets such as WikiTQ and TabFact, achieving especially large improvements on complex question answering. Our framework also significantly mitigates performance gaps between different backbone LLMs. This indicates that counterfactual reasoning effectively overcomes the limitations of single-direction inference, guiding LLMs toward more discerning reasoning and establishing a more principled paradigm for structured reasoning tasks. Our code will be made publicly available upon acceptance.

2606.06840 2026-06-08 cs.CL cs.AI cs.LG 新提交

Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

先刻画再蒸馏:大输出空间中的机械推理

Debjyoti Saha Roy, Byron C. Wallace, Javed A. Aslam

发表机构 * Khoury College of Computer Sciences, Northeastern University(东北大学计算机科学学院)

AI总结 研究现代推理模型在百万级标签空间中实现零样本多标签分类的机制,提出“候选列表生成+精细推理”两阶段模型,并基于此开发机械蒸馏策略,优于标准蒸馏。

详情
AI中文摘要

现代推理模型在具有挑战性的多标签任务上表现出令人惊讶的强大零样本性能,这些任务需要从数十万到数百万个候选标签中选择一小部分相关选项。我们研究了它们如何机械地实现这一点。我们将推理描述为一个两阶段过程:首先进行广泛的“候选列表生成”,然后对生成的集合进行精细推理。我们在一系列数据集上提供证据表明,这些步骤可以分离并且是互补的。利用这一刻画,我们开发了一种机械蒸馏策略,该策略始终优于标准蒸馏。

英文摘要

Modern reasoning models offer surprisingly strong zero-shot performance on challenging multi-label tasks that require selecting a small set of relevant options from hundreds of thousands to millions of candidate labels. We investigate how they achieve this mechanistically. We characterize reasoning as a two-phase process: A broad "shortlisting" of candidates followed by fine-grained reasoning over the resulting set. We provide evidence across a range of datasets that these steps can be isolated and are complementary. Using this characterization, we develop a mechanistic distillation strategy that consistently outperforms standard distillation.

2606.06835 2026-06-08 cs.CL 新提交

Translate-R1: Cost-Aware Translation Tool Use via Reinforcement Learning

Translate-R1:通过强化学习实现成本感知的翻译工具使用

Pratik Jayarao, Chaitanya Dwivedi, Himanshu Gupta, Neeraj Varshney, Adithya M Devraj, Meet Vadera, Priyanka Nigam, Bing Yin

发表机构 * Amazon Stores Foundation AI(亚马逊商店基金会人工智能)

AI总结 提出一种基于强化学习的门控策略,让LLM自主评估理解能力,仅在必要时调用翻译工具,在22种语言上提升奖励并降低翻译成本。

详情
Comments
14 pages main text plus appendix, 7 figures, 11 tables
AI中文摘要

LLM在不同语言上的性能差距已有充分记录,而原生缩小差距需要对大多数语言不存在的语料库进行预训练或微调。翻译提供了一种替代方案:将输入转换为模型的主导语言,从而立即释放其全部能力。然而,对每个输入都应用翻译对于模型已能处理的语言来说是浪费的,而将选择权留给模型则相反地失败,因为LLM过于自信,即使无法理解输入也会跳过工具。先前的工作通过语言特定规则、领域启发式、语言标识符或外部路由器来解决这一问题,每种方法都需要手动工程。我们转而学习一个单一策略,仅从奖励中决定何时翻译,开发出语言和领域自适应的内省能力,评估自身理解能力,并仅在无法原生解决任务时调用翻译。使用我们保留答案的翻译流水线构建的数据,我们在后训练的Qwen3-4B上继续RL,涵盖3个资源层级(高、低、极低)的22种语言和5个领域,并引入置信度门控GSPO用于成本敏感的工具使用。门控策略在基线基础上将奖励提升:高资源+4.6,低资源+23.5,极低资源+17.5。与几乎总是翻译的无约束策略相比,它以63%的成本保留了全部奖励,并在87%的成本敏感范围内是帕累托最优的。此外,为了模拟在完全未见语言上的行为,我们创建了2种合成语言,在这些语言上,我们的门控策略比过度自信的基线(即使在这些不可理解的输入上也未充分利用工具)提升了+18.7。该策略零样本迁移到9种保留语言,我们分析了工具使用在训练过程中如何按语言和领域出现。

英文摘要

The performance gap across languages in LLMs is well documented, and closing it natively requires pretraining or fine-tuning on corpora that, for most languages, do not exist. Translation offers an alternative: converting an input into the model's dominant language unlocks its full capabilities at once. Applying translation to every input, however, is wasteful for languages the model already handles, while leaving the choice to the model fails in the opposite way, as LLMs are overconfident and skip the tool even when they cannot understand the input. Prior work resolves this with language-specific rules, domain heuristics, language identifiers, or external routers, each requiring manual engineering. We instead learn a single policy that decides when to translate from reward alone, developing language- and domain-adaptive introspection that assesses its own comprehension and invokes translation only when it cannot solve a task natively. Using data built by our answer-preserving translation pipeline, we continue RL on the post-trained Qwen3-4B across 22 languages in 3 resource tiers (High, Low, XLow) and 5 domains, and introduce confidence-gated GSPO for cost-sensitive tool use. The gated policy lifts reward over the baseline by +4.6 on High, +23.5 on Low, and +17.5 on XLow. Against an unconstrained policy that almost always translates, it preserves full reward at 63% of the cost and is Pareto-optimal across 87% of the cost-sensitivity range. Additionally, to simulate behavior on a completely unseen language, we create 2 synthetic languages, where our gated policy improves +18.7 over the overconfident baseline that underutilizes the tool even on these incomprehensible inputs. The policy transfers zero-shot to 9 held-out languages, and we analyze how tool use emerges over training, per language and per domain.

2606.06834 2026-06-08 cs.CL q-bio.GN 新提交

The Dark Regulome: Disentangling Predictability from Regulation in Genomic Foundation Models

暗调控组:从基因组基础模型中分离可预测性与调控性

Chahat Baranwal, Aadtya Baranwal, Lakshya Nitin Tandon

发表机构 * IIT Jodhpur(印度理工学院贾尔普尔分校) University of Central Florida(中央佛罗里达大学) Northeastern University(东北大学)

AI总结 本研究提出残差化-置换诊断方法,从基因组基础模型的计算机诱变评分中分离序列可预测性与调控信号,揭示10kb近端调控边界,并验证跨架构分解可区分可预测性层与调控输出层,为暗基因组调控研究提供通用工具。

详情
AI中文摘要

高级别胶质瘤通过与神经元的突触整合到神经回路中,这引发了一个问题:哪些非编码元件塑造了肿瘤细胞中的突触形成基因表达。写在暗基因组上的调控程序,我们称之为$\textit{暗调控组}$,是探索的自然底物,而序列基础模型通过计算机诱变(ISM)提供了一条零样本路径;然而,基于似然的评分与局部序列可预测性存在同义反复的耦合,使得调控解释不充分。在三个架构不同的基础模型(Caduceus-Ph、HyenaDNA、Enformer)和92个胶质瘤相关位点的30,448个暗基因组元件上,我们引入了一种残差化-置换诊断方法,以分离由可预测性驱动和由调控驱动的RIS方差。一个尖锐的10kb近端调控边界在我们应用的所有控制中仍然存在,但LM衍生的元件类别层次结构则不然:一个六特征线性基线在AUC=0.985时匹配Caduceus的十分位数成员。跨架构分解清晰地分离了序列可预测性层(两个语言模型共同对长且可预测的转座元件进行排序)和调控输出层(只有Enformer保留了区分cCRE的信号),两个前100列表之间完全没有重叠。然后,保守性、脑cis-eQTL和STRING-PPI交叉检查锚定了哪些生物学信息得以保留:所有三个模型的前100个元件在匹配脑eQTL方面每个模型富集了3.3倍($p_\mathrm{emp} < 5\times 10^{-3}$),而一个诱人的转座元件调控层和一个显著的NRXN1+NLGN1蛋白对收敛在构建适当的置换检验后均未通过。我们将该诊断方法作为任何基于ISM的调控研究的通用方法工具提供。

英文摘要

High-grade gliomas integrate into neural circuits through functional synapses with neurons, raising the question of which noncoding elements shape synaptogenic gene expression in tumor cells. The regulatory program written across the dark genome, what we call the $\textit{dark regulome}$, is the natural substrate to probe, and sequence foundation models offer a zero-shot route through in-silico mutagenesis (ISM); yet likelihood-based scoring is tautologically coupled to local sequence predictability, leaving the regulatory interpretation underdetermined. Across three architecturally distinct foundation models (Caduceus-Ph, HyenaDNA, Enformer) and 30,448 dark genome elements at 92 glioma-relevant loci, we introduce a residualization-and-permutation diagnostic that separates predictability-driven from regulation-driven RIS variance. A sharp 10kb proximal-regulatory horizon survives every control we apply, but the LM-derived element-class hierarchy does not: a six-feature linear baseline matches Caduceus top-decile membership at AUC $= 0.985$. Cross-architecture decomposition cleanly separates a sequence-predictability layer (the two language models co-rank long well-predicted transposable elements) from a regulatory-output layer (Enformer alone retains residual cCRE-discriminative signal), with literally zero overlap between the two top-100 lists. Conservation, brain cis-eQTL, and STRING-PPI cross-checks then anchor what biology survives: top-100 elements across all three models are $3.3\times$ enriched per model for matching brain eQTLs ($p_\mathrm{emp} < 5\times 10^{-3}$), while a tempting transposable-element regulatory layer and a striking NRXN1+NLGN1 protein-pair convergence both fail proper permutation tests once those tests are constructed. We deliver the diagnostic as a general methodological tool for any ISM-based regulatory study.

2606.06833 2026-06-08 cs.LG cs.AI cs.CR 新提交

Hearing the Unspoken: Language Model Priors for Acoustic Adversarial Attacks

听弦外之音:面向声学对抗攻击的语言模型先验

Jiani Xie, Andrew C. Cullen, Paul Montague, Benjamin I. P. Rubinstein

发表机构 * University of Melbourne(墨尔本大学) DST Group(DST集团)

AI总结 提出Semantic Gambit攻击,利用大语言模型实时提供预测上下文,突破因果限制,使实时ASR系统词错误率提升至35.6%,较当前最优方法提高三倍。

详情
AI中文摘要

在实时环境中运行的自动语音识别(ASR)系统必须在严格的时间约束下处理声学输入,其转录决策本质上基于不完整信息。这种因果约束成为攻击者的信息瓶颈,显著限制了攻击性能。我们的新攻击方法Semantic Gambit通过实时利用大语言模型提供的预测上下文,突破了这一因果限制。实验表明,这种增强方式可将语料级词错误率提升至35.6%——比当前最优方法提高三倍。最终,这项工作揭示了如何利用常见的低延迟LLM工具系统地破坏实时ASR流水线。

英文摘要

Automatic Speech Recognition (ASR) systems operating in real-time settings must process acoustic input under strict temporal constraints, where transcription decisions are inherently made on incomplete information. This causal constraint serves as an information bottleneck on attackers, significantly limiting attack performance. Our new Semantic Gambit attack breaks this causal limitation by augmenting the adversary with predictive context derived from a Large Language Model in real-time. Our experiments show that this form of augmentation can elevate the corpus-level Word Error Rate to 35.6% -- a three-fold increase over the current state-of-the-art. Ultimately, this work reveals how common, low-latency LLM tooling can be exploited to systematically subvert real-time ASR pipelines.

2606.06832 2026-06-08 cs.RO 新提交

STRIPS-WM: Learning Grounded Propositional STRIPS-style World Models from Images

STRIPS-WM:从图像学习基于命题的STRIPS风格世界模型

Abhiroop Ajith, Constantinos Chamzas

发表机构 * Worcester Polytechnic Institute(沃斯特理工学院)

AI总结 提出STRIPS-WM框架,从图像转换中学习符号化世界模型,用于机器人视觉任务规划,提升规划成功率。

详情
AI中文摘要

执行长时域视觉操作的机器人观察高维图像,但成功的规划依赖于与动作相关的事实:当前可以做什么以及之后会发生什么变化。有用的规划表示应丢弃无关的视觉细节,同时保留动作的适用性和效果。经典任务规划器通过具有前提条件和效果的符号操作符利用这种结构,但从原始视觉经验中获得此类表示仍然具有挑战性。我们研究了一个视觉任务规划设置,其中机器人仅接收图像转换:当前图像、执行的高级动作以及结果图像。在测试时,给定起始图像和目标图像,机器人必须产生一系列达到目标的高级动作。为了解决这个问题,我们引入了STRIPS-WM,一个直接从视觉转换中学习基于图像的STRIPS风格世界模型的框架。STRIPS-WM首先从图像中诱导出有限的抽象转换图,然后学习潜在二元谓词和每个动作标签的一个基于命题的操作符。学习到的操作符形成一个具有稀疏前提条件和添加/删除效果的符号动作模型。最后,学习到的谓词被蒸馏到视觉编码器中,使得能够直接从新的起始和目标图像进行经典规划。在视觉重排任务上的实验表明,STRIPS-WM在图像到规划的成功率上优于测试的视觉展开、潜在图搜索和潜在符号基线。

英文摘要

Robots performing long-horizon visual manipulation observe high-dimensional images, but successful plans depend on action-relevant facts: what can be done now and what changes afterward. A useful planning representation should discard irrelevant visual details while preserving action applicability and effects. Classical task planners exploit this structure through symbolic operators with preconditions and effects, but obtaining such representations from raw visual experience remains challenging. We study a visual task-planning setting in which a robot receives only image transitions: the current image, executed high-level action, and the resulting image. At test time, given a start image and a goal image, the robot must produce a sequence of high-level actions that reaches the goal. To address this problem, we introduce STRIPS-WM, a framework for learning image-grounded STRIPS-style world models directly from visual transitions. STRIPS-WM first induces a finite abstract transition graph from images, then learns latent binary predicates and one grounded propositional operator per action label. The learned operators form a symbolic action model with sparse preconditions and add/delete effects. Finally, the learned predicates are distilled into a visual encoder, enabling classical planning directly from novel start and goal images. Experiments on visual rearrangement tasks show that STRIPS-WM improves image-to-plan success over the tested visual rollout, latent graph-search and latent-symbolic baselines.

2606.06829 2026-06-08 cs.RO 新提交

Three-dimensional hydro-cluttered locomotion by an undulatory robot

三维水杂波环境中的波动机器人运动

Tianyu Wang, Matthew Fernandez, Galen Tunnicliffe, Nikolas Cornell, Justin Duong, Donoven Dortilus, Zhaochen J. Xu, Patricia Meza, Sean Lublinsky, Darsh Parikh, Jianfeng Lin, Emily Grace, Daniel I. Goldman

发表机构 * Institute for Robotics and Intelligent Machines, Georgia Institute of Technology(机器人与智能机器研究所,佐治亚理工学院) School of Physics, Georgia Institute of Technology(Georgia理工学院物理系) George W. Woodruff School of Mechanical Engineering, Georgia Institute of Technology(佐治亚理工学院乔治·W·伍德鲁夫机械工程学院) School of Electrical and Computer Engineering, Georgia Institute of Technology(佐治亚理工学院电气与计算机工程学院) Department of Mechanical and Industrial Engineering, Northeastern University(东北大学机械与工业工程系) Ransom Everglades School(拉森·伊弗格莱德学校)

AI总结 提出AquaMILR机器人,通过可编程体顺应性和深度调节,在三维水杂波环境中实现快速鲁棒的前进运动,并利用惯性滚动作为自发恢复机制。

详情
AI中文摘要

水生机器人扩展了人类进入水下环境的能力,但许多水下空间包含可能干扰开放水域运动的障碍物。在“水杂波”环境中,水与刚性和柔性杂物交织,使得身体与障碍物的接触不可避免。在这些空间中操作需要能够调节和利用接触的机器人,但这一机制仍然难以建模或模拟。基于近期在具有地形适应能力的无肢机器人机械智能方面的进展,我们利用AquaMILR(一种细长无肢机器人)开发了三维水生运动原理,该机器人结合了双侧缆绳驱动、可编程体顺应性、分布式深度调节、耐腐蚀外壳以及用于无系留现场操作的板载电源和电子设备。系统的机器人物理实验表明,可编程体顺应性调节身体变形,并将身体-环境相互作用转化为跨增强水杂波约束强度的快速、鲁棒的前向推进。深度调节提供了三维通道,使机器人能够绕过杂物、从阻塞中恢复,并继续通过原本无法通行的路径。在潜在卡滞场景中,涌现的惯性诱导滚动作为一种自发恢复机制,使机器人摆脱可能导致失败的杂物,无需额外控制即可继续运动。在红树林水生环境中的机器人测试表明,这些原理可转化为实际操作,实现导航和无法进入根区的板载视觉检查。这些结果确立了水杂波运动原理和一种设计范式,其中水生机器人将环境复杂性作为运动资源加以利用。

英文摘要

Aquatic robots have expanded human access to underwater environments, yet many underwater spaces contain obstacles that can disrupt open-water locomotion. In "hydro-cluttered" environments, water is interspersed with rigid and flexible clutter, making body-obstacle contact unavoidable. Operating in these spaces requires robots that can regulate and exploit contact, but this regime remains difficult to model or simulate. Building on recent advances in mechanical intelligence in terradynamically capable limbless robotics, we develop principles for 3D aquatic locomotion using AquaMILR, an elongate limbless robot that combines bilateral cable-driven actuation, programmable body compliance, distributed depth regulation, corrosion-resistant enclosures, and onboard power and electronics for untethered field operation. Systematic robophysical experiments reveal that programmable body compliance regulates body deformation and converts body-environment interactions into fast, robust, forward progression across increasing hydro-clutter constraint strength. Depth regulation provides three-dimensional access, allowing the robot to bypass clutter, recover from obstruction, and continue through otherwise inaccessible routes. In potential jamming scenarios, emergent inertia-induced rolling acts as a spontaneous recovery mechanism, freeing the robot from clutter that would otherwise lead to failure and allowing locomotion to continue without additional control. Tests of the robot in an aquatic mangrove field demonstrate that these principles transfer to practical operation, enabling navigation and onboard visual inspection of inaccessible root zones. These results establish principles for hydro-cluttered locomotion and a design paradigm in which aquatic robots exploit environmental complexity as a locomotor resource.

2606.06828 2026-06-08 cs.CV cs.LG 新提交

AdaGRPO: A Capability-Aware Adaptive Enhancement for Flow-based GRPO

AdaGRPO: 一种面向基于流的GRPO的能力感知自适应增强方法

Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Tianyi Wei, Xiaohang Zhan, Jiaqi Wang, Tong Wu, Xingang Pan, Dahua Lin

发表机构 * Shanghai Jiao Tong University(上海交通大学) S-Lab, Nanyang Technological University(南洋理工大学S实验室) Shanghai AI Laboratory(上海人工智能实验室) University of Science and Technology of China(中国科学技术大学) Stanford University(斯坦福大学) Shanghai Innovation Institute(上海创新研究院) The Chinese University of Hong Kong(香港中文大学) Fudan University(复旦大学) CPII under InnoHK(InnoHK下的CPII) Adobe Research(Adobe研究)

AI总结 提出AdaGRPO,通过在线课程过滤策略和跨层级优势融合,解决流模型GRPO中提示选择随机和优势估计缺乏全局视角的问题,提升训练稳定性和性能。

详情
Comments
Project Website: https://bujiazi.github.io/adagrpo.github.io/
AI中文摘要

组相对策略优化(GRPO)在将文本到图像(T2I)流模型与人类偏好对齐方面取得了显著成功。然而,我们发现当前基于流的GRPO的学习循环与学习者的当前能力基本脱钩,在提示选择和优势估计方面存在关键盲点:(i)现有方法随机采样提示,忽视了数据选择对强化学习(RL)效能的重大影响——这一因素在大型语言模型的GRPO中被证明至关重要;(ii)它们仅依赖组内统计来评估样本质量,缺乏准确衡量真实策略改进的全局视角。为解决这些问题,我们提出了自适应GRPO(AdaGRPO),一种专为流模型设计的新型能力感知RL算法。具体而言,AdaGRPO由两个主要部分组成:(i)在线课程过滤策略:动态跟踪模型的能力,并自适应选择与其当前学习边界最匹配的提示;(ii)跨层级优势融合:协同整合细粒度组内优势与宏观全局优势,提供全面无偏的策略评估。作为轻量级即插即用模块,AdaGRPO可无缝集成到现有框架如Flow-GRPO、DanceGRPO和Flow-CPS中。大量实验表明,AdaGRPO持续推动性能提升,同时显著稳定流模型的GRPO训练。

英文摘要

Group Relative Policy Optimization (GRPO) has demonstrated remarkable success in aligning text-to-image (T2I) flow models with human preferences. However, we have identified that the learning loop of current flow-based GRPO is fundamentally decoupled from the learner's current capability, suffering from critical blind spots at both prompt selection and advantage estimation: (i) Existing methods sample prompts randomly, overlooking the substantial impact of data selection on reinforcement learning (RL) efficacy--a factor proven crucial in GRPO for large language models; (ii) They evaluate sample quality solely relying on intra-group statistics, lacking a global perspective to accurately measure true policy improvement. To address these issues, we propose Adaptive GRPO (AdaGRPO), a novel capability-aware RL algorithm tailored for flow models. Specifically, AdaGRPO consists of two principal components: (i) Online Curriculum Filtering Strategy: Dynamically tracks the model's proficiency and adaptively selects prompts that best match its current learning boundary; (ii) Cross-Level Advantage Fusion: Synergistically integrates fine-grained intra-group advantages with macro-level global advantages, providing a comprehensive and unbiased policy evaluation. As a lightweight, plug-and-play module, AdaGRPO can be seamlessly integrated with existing frameworks such as Flow-GRPO, DanceGRPO, and Flow-CPS. Extensive experiments demonstrate that AdaGRPO consistently drives performance gains while significantly stabilizes GRPO training for flow models.

2606.06825 2026-06-08 cs.CL cs.AI 新提交

Progress-SQL: Improving Reinforcement Learning for Text-to-SQL via Progressive Rewards

Progress-SQL: 通过渐进式奖励改进文本到SQL的强化学习

Shihao Zhang, Xiaoman Wang, Yuan Liu, Yunshi Lan, Weining Qian

发表机构 * East China Normal University(华东师范大学)

AI总结 提出Progress-SQL,一种多轮强化学习框架,通过Oracle引导诊断树(ODT)生成子句级结构反馈,结合渐进式奖励(结构对齐、词汇对齐、延迟奖励和执行状态奖励),提升文本到SQL生成的准确性和鲁棒性。

详情
AI中文摘要

强化学习最近在改进大型语言模型进行文本到SQL生成方面显示出潜力,但现有方法通常优化基于单个SQL状态定义的一次性奖励。这种奖励为迭代SQL纠正提供的指导有限,不足以捕捉多轮SQL改进的提升。在本文中,我们提出Progress-SQL,一种具有渐进式奖励的多轮强化学习框架,用于文本到SQL。我们的方法引入Oracle引导诊断树(ODT),它将SQL查询抽象为子句级结构轮廓,并为下一轮改进生成诊断反馈。为了提供密集且稳健的奖励信号,我们将基于ODT的结构对齐与词汇对齐相结合,并定义一个渐进式奖励,衡量从初始SQL到最终SQL的改进。我们进一步加入一个偏好早期正确性的渐进延迟奖励和一个鼓励从无效SQL中恢复的执行状态奖励。在BIRD、Spider和Spider鲁棒性变体上的实验表明,我们的方法在主要评估和鲁棒性评估上均一致提升了文本到SQL的性能。

英文摘要

Reinforcement learning has recently shown promise in improving large language models for Text-to-SQL generation, yet existing methods typically optimize one-shot rewards defined over a single SQL state. Such rewards provide limited guidance for iterative SQL correction and are insufficient to capture the improvement of multi-turn SQL refinement. In this paper, we propose Progress-SQL, a multi-turn reinforcement learning framework with progressive rewards for Text-to-SQL. Our approach introduces an Oracle-guided Diagnostic Tree (ODT), which abstracts SQL queries into clause-level structural profiles and produces diagnostic feedback for next-turn refinement. To provide dense and robust reward signals, we combine ODT-based structural alignment with lexical alignment and define a progressive reward that measures the improvement from the initial SQL to the final SQL. We further incorporate a progression latency reward that favors earlier correctness and an execution status reward that encourages recovery from the invalid SQL. Experiments on BIRD, Spider, and Spider robustness variants demonstrate that our method consistently improves Text-to-SQL performance across both primary and robustness evaluations.

2606.06823 2026-06-08 cs.LG cs.AI q-fin.ST 新提交

PandaAI: A Practical Agent CQ2 for Neuro-symbolic Data Analysis And Integrated Decision-Making in Quantitative Finance

PandaAI: 一种用于量化金融中神经符号数据分析与集成决策的实用智能体CQ2

Yuqi Li, Siyuan Liu, Bingjun Liu

发表机构 * Panda AI

AI总结 针对金融数据低信噪比和非平稳性,提出PandaAI,一种结合市场状态建模与约束alpha生成的闭环神经符号LLM智能体,通过领域微调和模块化架构实现风险感知决策,在沪深300数据上Rank IC提升18.2%,最大回撤降低25.7%。

详情
AI中文摘要

尽管深度学习在各个领域表现出色,但由于金融数据的低信噪比(SNR)和非平稳性,其在金融序列决策中的应用仍然具有挑战性。利用大型语言模型(LLM)的推理能力,我们提出了\textbf{PandaAI},一种具有市场状态建模和约束alpha生成的闭环神经符号LLM智能体,它桥接了通用LLM推理与金融严谨性,并抑制了LLM生成输出的金融毒性。为了弥合通用语言能力与金融严谨性之间的差距,我们微调了一个领域特定的LLM。此外,我们将此LLM集成到模块化架构中,形成一个闭环系统。与传统优化孤立预测指标的模型不同,\textbf{PandaAI}被设计为一种神经符号智能体,以明确的风险意识在复杂、真实的金融环境中导航。在沪深300股票数据上的大量实验表明,\textbf{PandaAI}比最先进的时间序列模型实现了$18.2\%$更高的Rank IC和$25.7\%$更低的最大回撤。我们的约束LLM生成和双通道适应方法为LLM在高风险序列决策场景中的部署提供了一种通用范式。

英文摘要

While deep learning has excelled in various domains, its application to sequential decision-making in finance remains challenging due to the low Signal-to-Noise Ratio (SNR) and non-stationarity of financial data. Leveraging the reasoning capabilities of Large Language Models (LLMs), we propose \textbf{PandaAI}, a closed-loop neuro-symbolic LLM agent with market regime modeling and constrained alpha generation, which bridges general LLM reasoning with financial rigor and suppresses the financial toxicity of LLM-generated outputs. To bridge the gap between general linguistic capability and financial rigor, we fine-tune a domain-specific LLM. Furthermore, we integrate this LLM into a modular architecture and form a closed-loop system. Unlike traditional models that optimize isolated prediction metrics, \textbf{PandaAI} is designed as a neuro-symbolic agent that navigates the complex, real-world financial environment with explicit risk awareness. Extensive experiments on CSI 300 stock data show that \textbf{PandaAI} achieves a $18.2\%$ higher Rank IC and $25.7\%$ lower maximum drawdown than state-of-the-art time-series models. Our constrained LLM generation and dual-channel adaptation method provide a general paradigm for LLM deployment in high-stakes sequential decision-making scenarios.

2606.06820 2026-06-08 cs.LG cs.AI 新提交

SCALE: Scalable Cross-Attention Learning with Extrapolation for Agentic Workflow Scheduling

SCALE: 可扩展的交叉注意力学习与外推方法用于智能体工作流调度

Zhifei Xu, Jierui Lan, Zixuan Liang, Aiji Liang, Jinxi He

发表机构 * Faculty of Arts and Sciences, Beijing Normal University(北京师范大学文理学院)

AI总结 提出SCALE调度器,通过交叉注意力指针网络和结构化表示正则化,实现无需微调即可泛化到不同规模集群的深度强化学习工作流调度。

详情
Comments
Submitted to Computer Networks
AI中文摘要

智能体大型语言模型系统将复杂任务分解为工作流有向无环图,其原语必须在异构集群上调度。现有的深度强化学习调度器与固定集群大小绑定,当服务器数量变化时需要重新训练。我们提出SCALE(可扩展的交叉注意力学习与外推),一种无需微调即可泛化到未见过的集群规模的深度强化学习调度器。SCALE采用交叉注意力指针网络,其中任务特征查询服务器特征,因此架构通过构造接受任意数量的服务器。然而,我们观察到仅排列不变架构并不能保证在新规模下的良好性能——随着服务器数量增长,注意力特征经历分布偏移。为了解决这个问题,我们引入结构化表示正则化:一种去相关损失结合朝向标准正态的KL惩罚,使特征统计量无论输入大小都保持稳定。在16个节点上训练并直接在32和48个节点上测试,SCALE在N=48时相对于没有SRR的相同架构将平均响应时间降低了8.9%,确认了显式正则化对于缩小规模泛化差距是必要的。

英文摘要

Agentic Large Language Model (LLM) systems decompose complex tasks into workflow Directed Acyclic Graphs (DAGs) whose primitives must be scheduled on heterogeneous clusters. Existing deep reinforcement learning (DRL) schedulers are tied to a fixed cluster size and require retraining whenever the number of servers changes. We propose SCALE (Scalable Cross-Attention Learning with Extrapolation), a DRL scheduler that generalizes to unseen cluster scales without fine-tuning. SCALE employs a cross-attention pointer network where task features query against server features, so the architecture accepts any number of servers by construction. We observe, however, that permutation-invariant architecture alone does not guarantee good performance at new scales - the attention feature undergoes distribution shift as the server count grows. To counter this, we introduce Structured Representation Regularization (SRR): a decorrelation loss combined with a KL penalty toward the standard normal, which keeps feature statistics stable regardless of input size. Trained on 16 nodes and tested directly on 32 and 48 nodes, SCALE reduces average response time by 8.9% at N=48 relative to the same architecture without SRR, confirming that explicit regularization is necessary to close the scale-generalization gap.

2606.06812 2026-06-08 cs.CL 新提交

Quantifying Media Representation Dynamics Across 25 Years of News Reporting on Policing-related Deaths

量化25年警务相关死亡新闻报道中的媒体表征动态

Farhan Samir, Jappun Dhillon, Meghna Ravikumar, Syed Ishtiaque Ahmed, Vered Shwartz

发表机构 * University of Toronto(多伦多大学) University of British Columbia(不列颠哥伦比亚大学)

AI总结 通过分析25年间4000篇加拿大新闻报道,提出PerspectiveGap模型,发现国家官僚视角出现频率是公众视角的近三倍,且近年来平民代表有所增加。

详情
Journal ref
Proceedings of the 18th ACM Web Science Conference 2026 (pp. 421-429)
Comments
9 pages, 6 figures. Websci'26
AI中文摘要

我们进行了迄今为止最大规模的加拿大警务相关死亡新闻叙事计算分析,涵盖了过去25年间的4000篇文章。我们开发了一个新颖的计算模型PerspectiveGap,该模型基于先前关于警务媒体表征的社会学研究。我们发现,关于警务相关死亡的报道平均而言,国家官僚视角的出现频率几乎是其他公众成员(包括亲属、社区成员、目击者、代表家庭的律师或公民自由团体)视角的三倍。相当一部分文章完全没有平民行为者的观点,尽管近年来平民代表有所增加。定性分析表明,国家官僚对这些死亡的描述往往是临床和程序性的,而平民话语则带有明显更多的情感色彩。这里开发的PerspectiveGap框架可以适用于其他司法管辖区,提供了一种可扩展的方法来分析媒体系统如何构建关于警务和问责的叙事。

英文摘要

We perform the largest known computational analysis of Canadian news narratives about police-involved deaths, spanning 4,000 articles from the last quarter-century. We develop a novel computational model, PerspectiveGap, grounded in prior sociological work on media representation of policing. We find that reporting on police-involved deaths on average features perspectives from state bureaucrats at a rate nearly three times as much as perspectives from other members of the public, including relatives, community members, eyewitnesses, lawyers representing the family, or civil liberties groups. A considerable fraction of articles contain no points of view from civilian actors, though civilian representation has increased in recent years. Qualitatively, we find that state bureaucrats' accounts of these deaths tend to be clinical and procedural, while civilian discourse carries considerably more emotional valence. The PerspectiveGap framework developed here can be contextualized to other jurisdictions, offering a scalable approach for analyzing how media systems construct narratives around policing and accountability.

2606.06806 2026-06-08 cs.SD eess.AS 新提交

Leveraging Soft Distributions of SSL-Derived Discrete Speech Tokens for Downstream Inference

利用SSL导出的离散语音标记的软分布进行下游推理

Kentaro Onda, Satoru Fukayama, Daisuke Saito, Nobuaki Minematsu

发表机构 * The University of Tokyo(东京大学) National Institute of Advanced Industrial Science and Technology (AIST)(国家工业科学与技术研究院(AIST))

AI总结 提出在下游推理时使用软标记分配,保留硬离散化的训练效率同时增强推理时表达力,在ASR和语音合成任务上优于硬分配,并在非母语ASR上超越连续SSL特征。

详情
Comments
Accepted to Interspeech2026
AI中文摘要

从自监督学习(SSL)模型获得的离散语音标记在保持强大性能的同时提供高效的数据压缩,并已广泛用作各种任务中的中间表示。然而,离散化不可避免地导致信息丢失,与连续SSL特征相比性能下降。在这项工作中,我们提出仅在下游推理期间应用软标记分配。这种方法保留了训练期间硬离散化的效率,同时增强了推理时标记的表达力。所提出的方法在ASR和语音合成任务上均优于传统的硬分配,并且对域外数据表现出特别强的泛化能力。对于非母语语音的ASR,它甚至超过了使用连续SSL特征的模型。此外,对所得表示的分析表明,与传统的硬分配相比,它们与音素的对齐更准确。

英文摘要

Discrete speech tokens obtained from self-supervised learning (SSL) models provide efficient data compression while maintaining strong performance, and have been widely used as intermediate representations in various tasks. However, discretization inevitably causes information loss, leading to degraded performance compared with continuous SSL features. In this work, we propose to apply soft token assignment only during downstream inference. This approach preserves the efficiency of hard discretization during training while enhancing the expressiveness of the tokens at inference. The proposed method outperforms conventional hard assignment on both ASR and speech synthesis tasks, and exhibits particularly strong generalizability to out-of-domain data. For ASR of non-native speech, it even surpasses models using continuous SSL features. Moreover, analysis of the resulting representations shows they align more accurately with phonemes compared with conventional hard assignment.

2606.06804 2026-06-08 cs.LG stat.AP 新提交

Interpreting Learning Under Competing Models: Joint and Stepwise Approaches for Dynamic Cognitive Diagnosis

解释竞争模型下的学习:动态认知诊断的联合与逐步方法

Yawen Ma, Sahoko Ishida, Kate Cain, Gabriel Wallin

发表机构 * School of Mathematical Sciences, Lancaster University(兰卡斯特大学数学科学学院) Department of Computer Science, University of Oxford(牛津大学计算机科学系) Department of Psychology, Lancaster University(兰卡斯特大学心理学系)

AI总结 研究在项目-技能结构未知时,联合估计Q矩阵与学习过程相比先确定Q矩阵再研究学习,如何改变对学习者发展的结论,并通过动态认知诊断模型分析阅读游戏数据,发现联合分析更可靠。

详情
AI中文摘要

数字学习环境记录学习者对单个项目的反应,使得研究特定技能的发展而非总体分数成为可能。从这些数据中得出关于学习的结论需要一个将反应与潜在技能联系起来的模型,并追踪掌握程度随时间的变化。当每个项目测量的技能未知时,分析者必须决定是联合估计这种结构(Q矩阵)与学习过程,还是先确定它再研究学习。我们表明,这一决定可以改变关于学习者如何发展的实质性结论。使用动态认知诊断模型,我们分析了两个阅读游戏的数据,这些游戏测量了从二年级到三年级的词汇和理解能力,项目文本嵌入为未知的Q矩阵提供了先验信息。联合分析和偏差校正的逐步分析一致认为,大多数学习者朝着掌握两种技能的方向发展,但在三年级时有多少人仍然只部分熟练的问题上存在分歧,从而改变了阅读进展的报告方式。模拟研究确定了两种分析何时出现分歧,并表明当项目-技能结构不确定且项目池在不同年级之间变化时,联合分析更可靠。我们提供了两种分析的R代码。

英文摘要

Digital learning environments record learners' responses to individual items, making it possible to study the development of specific skills rather than overall scores. Drawing conclusions about learning from these data requires a model that links responses to latent skills and tracks how mastery changes over time. When the skills measured by each item are unknown, the analyst must decide whether to estimate this structure, the Q-matrix, jointly with the learning process, or to establish it first and study learning afterwards. We show that this decision can change substantive conclusions about how learners develop. Using dynamic cognitive diagnostic models, we analyse data from two reading games measuring vocabulary and comprehension from Grade 2 to Grade 3, with item-text embeddings providing prior information for the unknown Q-matrix. A joint analysis and a bias-corrected stepwise analysis agree that most learners move toward mastering both skills, but disagree about how many remain only partially proficient at Grade 3, changing how reading progress would be reported. A simulation study identifies when the two analyses diverge and shows that joint analysis is more reliable when the item-skill structure is uncertain and the item pool changes between grades. We provide R code for both analyses.

2606.06797 2026-06-08 cs.CL 新提交

Korean Culture into LLM Alignment: Toward Cultural Coherence

将韩国文化融入大语言模型对齐:迈向文化一致性

MinJae Jung, Minwoo Kim

发表机构 * SKT LG AI Research(LG人工智能研究) Kanana Team(Kanana团队)

AI总结 针对大语言模型的文化对齐,提出构建性定义而非仅抑制负面输出,设计基于提示的种子生成器扩展韩国危害分类,结合韩国法律、社会规范和解释惯例制定安全响应策略,通过DPO微调提升韩国文化安全率且不损害通用能力。

详情
Comments
Accepted to ICML 2026 Workshop on Culture X AI
AI中文摘要

大语言模型的文化方面工作主要集中于负面目标:抑制哪些输出。我们认为还需要一个建设性的对应部分,即文化一致性响应的操作性定义,而不仅仅是它必须避免什么,并针对韩国进行了实例化。我们设计了一个围绕基于提示的LLM种子生成器的对齐数据流水线,该生成器扩展了韩国危害分类,其核心是韩国文化适应的安全响应策略:一个基于韩国法律框架、社会规范和解释惯例的逐类别指南,三个前沿模型各自根据该指南生成候选响应。对所得三元组进行DPO微调提高了六个开源LLM的韩国文化安全率,同时未导致韩国通用能力基准的大幅下降,定性输出显示微调模型能够引用韩国法规和机构程序,并在适当时提供建设性的韩国背景信息以及拒绝回答。

英文摘要

Cultural-aspect work on large language models is dominated by a negative target: which outputs to suppress. We argue that a constructive counterpart is also needed, a working definition of what a culturally coherent response is rather than only what it must avoid, and instantiate it for Korean. We design an alignment-data pipeline around a prompt-based LLM seed generator that expands a Korean harm taxonomy, with a Korean-culturally-adapted safe-response policy at its centre: a per-category guideline grounded in Korean legal frameworks, social norms, and interpretive conventions, against which three frontier models each produce a candidate response. DPO fine-tuning on the resulting triplets improves the Korean cultural safe rate across six open-weight LLMs while causing no large degradation on Korean general-capability benchmarks, and qualitative outputs show fine-tuned models naming Korean statutes and institutional procedures and, where appropriate, supplying constructive Korean-context information alongside refusal.

2606.06794 2026-06-08 cs.CL cs.IR 新提交

TA-RAG: Tone-Aware Retrieval-Augmented Generation for Peer-Support Health Communication

TA-RAG: 面向同伴支持健康沟通的语气感知检索增强生成

Yong-Bin Kang, Anthony McCosker

发表机构 * Swinburne University of Technology(斯winburne大学)

AI总结 提出TA-RAG框架,通过轻量级提示在RAG管道中嵌入语气控制(无污名化、可读性调整、受众适应、同理心改写),无需微调模型,提升敏感健康沟通质量。

详情
Comments
5 pages, 5 figures, CIKM 2026 submission manuscript
AI中文摘要

检索增强生成(RAG)成功地将大型语言模型(LLM)的输出建立在可信文档上,但仅靠事实依据不足以支持敏感的同伴健康沟通。在HIV同伴支持等领域,回复还必须易于理解、无污名化、富有同理心并针对接收者定制。本文提出TA-RAG,一个轻量级的、基于提示的语气感知RAG框架,它将明确的语气控制嵌入到RAG管道中,无需模型微调。我们通过四个核心组件来操作化语气:无污名化改写、可读性调整、接收者适应和同理心重述。我们使用来自澳大利亚HIV在线学习(HOLA)、UNAIDS术语指南、可读性指标、澳大利亚HIV感染者协会(NAPWHA)的同伴支持标准以及公共同理心数据集的问题,通过组件级测试评估TA-RAG。结果表明,TA-RAG的组件在保留关键内容的同时,提高了其目标沟通质量。这些发现强调,基于提示的语气控制是使RAG输出适用于敏感同伴支持健康沟通的一个潜在方向。

英文摘要

Retrieval-augmented generation (RAG) successfully grounds large language model (LLM) outputs in trusted documents, but factual grounding alone is insufficient for sensitive peer-support health communication. In domains such as HIV peer support, responses must also be accessible, stigma-free, empathetic, and tailored to the recipient. This paper presents TA-RAG, a lightweight, prompt-based tone-aware RAG framework that embeds explicit tone control into a RAG pipeline without requiring model fine-tuning. We operationalise tone across four core components: stigma-free rewriting, readability adjustment, recipient adaptation, and empathy rephrasing. We evaluate TA-RAG through component-level tests using questions derived from HIV Online Learning Australia (HOLA), UNAIDS terminology guidance, readability metrics, peer-support standards from National Association of People with HIV Australia (NAPWHA), and a public empathy dataset. Results show that the TA-RAG's components improve their targeted communication quality while preserving key content. These findings emphasise that prompt-based tone control is a potential direction for making RAG outputs suitable for sensitive peer-support health communication.

2606.06790 2026-06-08 cs.RO cs.LG cs.SY eess.SY 新提交

Learning All-Terrain Locomotion for a Planetary Rover with Actively Articulated Suspension

学习具有主动铰接悬挂的行星探测车的全地形运动

Arthur Bouton, Tristan D. Hasseler, Michael Paton, Travis Brown, Jacob Levy, William Reid, Joshua Martin, Hari Nayar

发表机构 * Jet Propulsion Laboratory, California Institute of Technology(喷气推进实验室,加州理工学院) Center for Autonomy, University of Texas at Austin(自主性中心,德克萨斯大学奥斯汀分校) Space Systems Laboratory, University of Maryland(空间系统实验室,马里兰大学)

AI总结 提出一种带有主动万向悬挂的四轮行星探测车概念,利用强化学习训练单一神经网络控制器,实现自主障碍协商和全地形运动,通过策略整合和零样本迁移在物理车上验证。

详情
Comments
21 pages, 26 figures
AI中文摘要

本文介绍了ERNEST,一种四轮行星探测车概念,配备了两自由度主动万向悬挂系统,结合偏航和滚转驱动,实现车轮重构、转向和主动负载分配。一个单一的神经网络控制器,经过训练以在挑战性地形上跟踪期望路径,完全释放了这种驱动悬挂系统在自主障碍协商中的能力。利用高保真DARTS仿真引擎开发了强化学习框架,该引擎结合了刚体接触动力学和Bekker-Wong地面力学,使得能够出现适应松散土壤条件的运动策略。为了在异质地形上获得单一统一控制器,一种策略整合策略将地形专业化智能体的经验合并到一个神经网络中,消除了对显式地形分类和控制器切换的需求。得到的控制器结合了本体感觉和外感觉反馈,包括稀疏立体视觉导出的地形高程、底盘姿态、关节状态和力-扭矩测量。通过领域随机化、传感器噪声注入和模型到真实系统的辨识,实现了到物理车的零样本迁移。实验结果表明,该控制器能够自主穿越岩石场、凸起陷阱、轮高台阶、沙波纹和沙坡。在20°沙坡上,尽管增加了驱动,学习到的控制器在干沙上降低了37%的运输成本,并在湿沙上实现了优越的性能,而被动悬挂在湿沙上完全无法移动。

英文摘要

This paper presents ERNEST, a four-wheeled planetary rover concept equipped with a two-degree-of-freedom Active Gimbal Suspension that combines yaw and roll actuation to enable wheel reconfiguration, steering, and active load redistribution. A single neural network controller, trained to track a desired path across challenging terrain, fully unlocks the capabilities of this actuated suspension system for autonomous obstacle negotiation. A reinforcement learning framework is developed using the high-fidelity DARTS simulation engine, which combines rigid-contact dynamics and Bekker-Wong terramechanics, enabling the emergence of locomotion strategies adapted to loose-soil conditions. To obtain a single unified controller across heterogeneous terrains, a policy consolidation strategy merges the experience of terrain-specialized agents into one neural network, eliminating the need for explicit terrain classification and controller switching. The resulting controller operates on a combination of proprioceptive and exteroceptive feedback, including sparse stereo-derived terrain elevation, chassis attitude, joint states, and force-torque measurements. Zero-shot transfer to the physical rover is achieved through domain randomization, sensor noise injection, and model-to-real system identification. Experimental results demonstrate autonomous traversal of rock fields, a bump trap, a wheel-high step, sand ripples, and sandy slopes. On a 20° sandy slope, the learned controller reduces the cost of transport by 37% on dry sand despite the additional actuation, and achieves superior performance on wet sand where the passive suspension becomes completely immobilized.

2606.06788 2026-06-08 cs.CL cs.HC 新提交

Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses

像对五岁小孩一样解释或随我选择:评估语言模型响应的交互潜力

Indu Panigrahi, Tal August

发表机构 * Siebel School of Computing and Data Science(计算与数据科学学院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出基于语言复杂度的交互评估框架,测试GPT-5.1等模型生成不同复杂度响应的能力,发现最佳模型仅46%时间正确调整复杂度。

详情
Comments
Preprint
AI中文摘要

在科学信息检索任务中对大型语言模型(LLMs)的评估日益以使用为中心,例如与真实用户进行实时或多轮评估。这些评估仍然假设单一的静态聊天界面,但随着模型被集成到新界面中,评估必须转向纳入特定于界面的标准。我们基于一项有16名参与者的形成性研究,提出了一个新的评估框架,该框架测试模型生成对同一查询的多个响应的能力,这些响应沿语言的可解释轴(语言复杂度)变化,灵感来自人机交互设计文献中的直接操作界面。我们评估了GPT-5.1、GPT-5 mini、Claude Sonnet 4.5 + Thinking和DeepSeek-V3.1,为98个科学查询生成了5个不同语言复杂度级别的响应。虽然模型在不同响应之间变化复杂度,但大多数变化仍然不一致,表现最佳的模型(Claude Sonnet 4.5)仅46%的时间在正确方向上移动了可靠的复杂度度量。我们的发现在增加样本量和替代复杂度级别时仍然成立。

英文摘要

Evaluations of large language models (LLMs) in scientific information seeking tasks have become increasingly use-centric, such as conducting live or multi-turn evaluations with real users. These evaluations still assume a single, static chat interface, but as models are integrated into new interfaces, evaluations must shift to incorporate interface-specific criteria. We propose a new evaluation framework based on a formative study with $16$ participants that tests models' ability to generate multiple responses to one query that differ along an interpretable axis of language (language complexity), inspired by direct manipulation interfaces from human-centered design literature. We evaluate GPT-5.1, GPT-5 mini, Claude Sonnet 4.5 + Thinking, and DeepSeek-V3.1 by generating 5 responses at different levels of language complexity for $98$ scientific queries. While models vary complexity across responses, most changes remain inconsistent, with the best performing model (Claude Sonnet 4.5) only shifting reliable complexity measures in the correct direction $46\%$ of the time. Our findings hold with increased sample size and alternative complexity levels.