arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.28745 2026-05-28 cs.CL

Stance Detection in Prediction Markets: Addressing Imbalanced Trader Commentary via Counterfactual Augmentation and Market Context

预测市场中的立场检测:通过反事实增强和市场上下文解决交易者评论不平衡问题

Thomas Mbrice

AI总结 针对预测市场评论中极端不平衡的立场检测问题,提出结合市场上下文和LLM驱动的反事实增强方法,显著提升了少数类(反对立场)的召回率和F1值。

详情
Comments
14 pages, 9 figures
AI中文摘要

Polymarket等预测市场将群体信念聚合为实时概率估计,交易者在每个市场下方发布的评论包含价格无法捕捉的丰富方向性立场信号。本文首次将立场检测研究应用于预测市场评论,该领域具有极端简短性、交易者特定用语和严重的类别不平衡(仅8.7%的评论反对市场结果)。我们在4×3消融实验中对RoBERTa-base进行微调:四种输入配置({2类, 3类} × {有/无市场上下文})和三种增强条件(基线、50%合成、100%合成)。通过Anthropic API,利用LLM驱动的Pro→Anti反事实翻转生成合成少数类样本。结果表明:(1)市场上下文是影响最大的单一因素,将3类Anti召回率从0.10提升至0.45;(2)反事实增强有条件地有效,在弱配置中提升Anti F1(0.10→0.24),但在强配置中降低性能(2类上下文宏F1:全剂量下从0.68降至0.50);(3)50%增强是最佳剂量,100%始终损害性能。基于注意力的可解释性分析为所有三个发现提供了机制支持。

英文摘要

Prediction markets such as Polymarket aggregate crowd beliefs into real-time probability estimates, and the comments traders post beneath each market contain rich directional stance signals that prices alone cannot capture. This work introduces the first stance detection study applied to prediction market commentary, a domain characterized by extreme brevity, trader- specific vernacular, and severe class imbalance (only 8.7% of comments oppose the market outcome). RoBERTa-base is fine-tuned across a 4 x 3 ablation: four input configurations ({2- class, 3-class} x {with/without market context}) and three augmentation conditions (baseline, 50% synthetic, 100% synthetic). Synthetic minority-class samples are generated via LLM-driven Pro -> Anti counterfactual flips using the Anthropic API. Results show that (1) market context is the single most impactful factor, raising 3-class Anti recall from 0.10 to 0.45; (2) counterfactual augmentation is conditionally effective, improving Anti F1 in weak configurations (0.10 -> 0.24) while degrading strong ones (2-class-ctx macro F1: 0.68 -> 0.50 at full dose); and (3) 50% augmentation is the optimal dose, with 100% consistently hurting performance. Attention-based interpretability analysis provides mechanistic support for all three findings.

2605.28741 2026-05-28 cs.CV

Self-Prophetic Decoding to Unlock Visual Search in LVLMs

自预言解码以解锁LVLM中的视觉搜索

Zhendong He, Qiyuan Dai, Guanbin Li, Liang Lin, Sibei Yang

AI总结 提出SeProD框架,通过自预言解码利用预训练模型的内在单步能力,以无训练、即插即用的方式增强LVLM在多步视觉搜索中的连贯推理,在4个基准的12个分割上一致提升性能。

详情
Comments
Accepted at ICML 2026
AI中文摘要

大型视觉语言模型(LVLM)正迅速向真正的多模态推理发展,视觉搜索代表了“用图像思考”范式的具体实例。然而,LVLM视觉搜索面临两个关键挑战:后训练后内在能力之间的不兼容性,以及长多步推理上下文中的干扰。为解决这些问题,我们提出了两个新颖的见解。首先,预训练和后训练LVLM之间的自我调节利用了预训练模型的内在单步能力,以减轻能力退化和长上下文干扰。其次,基于概率的预言采样取代了简单的提示,提供了一个概率接口,其中预训练模型充当预言家,后训练模型在其输出分布下选择性地接受预言令牌,从而保持连贯的多步推理。基于这些见解,我们引入了SeProD,一个自预言解码框架,它利用内在的单步能力以无训练、即插即用的方式实现连贯的多步推理。实验表明,由于并行的预言接受机制,SeProD在4个视觉搜索基准的所有12个分割以及通用VQA基准上一致地提升了多个视觉搜索LVLM的性能,且没有增加计算开销。

英文摘要

Large Vision-Language Models (LVLMs) are rapidly evolving toward true multimodal reasoning, with visual search representing a concrete instantiation of the thinking-with-images paradigm. However, LVLM visual search faces two key challenges: incompatibility among intrinsic capabilities after post-training, and interference in long multi-step reasoning contexts. To address these, we identify two novel insights. First, self-regulation between pre- and post-training LVLMs leverages the intrinsic single-step capabilities of the pre-training model to mitigate capability deterioration and long-context interference. Second, probability-based prophetic sampling, replacing naive prompting, provides a probabilistic interface where the pre-training model acts as a prophet and the post-training model selectively accepts prophetic tokens under its output distribution, preserving coherent multi-step reasoning. Building on these insights, we introduce SeProD, a self-prophetic decoding framework that leverages intrinsic single-step capabilities to enable coherent multi-step reasoning in a training-free, plug-and-play manner. Experiments show that SeProD consistently improves multiple visual-search LVLMs across all 12 splits of 4 visual search benchmarks, as well as across general VQA benchmarks, without added computational overhead, thanks to its parallel prophetic acceptance mechanism.

2605.28740 2026-05-28 cs.CL cs.AI

Reverse Probing: Supervised Token-level Uncertainty Quantification for Large Language Models in Clinical Text

反向探测:临床文本中大语言模型的监督式词级不确定性量化

Bushi Xiao, Sarvesh Soni, Daisy Zhe Wang

AI总结 提出反向探测框架,利用预标注摘要从模型内部激活中提取词级不确定性信号,在临床文本中实现高效、可解释的不确定性量化。

详情
AI中文摘要

随着大语言模型越来越多地应用于临床文本,确保它们能够可靠地表明自身的不确定性变得至关重要。大多数现有的不确定性量化(UQ)方法是为开放域生成设计的,无法在长临床文本中定位到词或跨度级别的不确定性。我们提出了反向探测,这是首个专门针对临床摘要的UQ框架,它直接从预标注的摘要中估计词级不确定性。与采样新输出不同,反向探测将文本视为探测模型内部状态的探针,从四类内部激活中提取不确定性信号。我们在两个专家标注的临床数据集上进行了评估,在所有指标上优于八个适配基线,AUPRC最高提升4倍,同时减少了推理时间和计算成本。特征分析表明,delta能量和邻域上下文是所有模型中最一致的预测因子。本研究提供了关于模型内部如何响应无支持的临床内容的可解释性见解。

英文摘要

As large language models are increasingly deployed for clinical text, ensuring they can reliably signal their own uncertainty becomes critical. Most existing uncertainty quantification (UQ) methods are designed for open-domain generation and cannot localize uncertainty at the token or span level in long clinical text. We propose Reverse Probing, the first UQ framework specialized for clinical summarization, which estimates token-level uncertainty directly from pre-existing labeled summaries. Rather than sampling new outputs, Reverse Probing treats the text as a probe into the model's internal state, extracting uncertainty signals from four categories of internal activations. We evaluate on two expert-annotated clinical datasets and outperform eight adapted baselines on all metrics, achieving up to 4 times higher AUPRC while reducing inference time and computational costs. Feature analysis reveals that delta energy and neighborhood context are the most consistent predictors across all models. This study offers interpretable insights into how models internally respond to unsupported clinical content.

2605.28739 2026-05-28 cs.LG cs.AI cs.NE q-bio.QM

BIRDNet: Mining and Encoding Boolean Implication Knowledge Graphs as Interpretable Deep Neural Networks

BIRDNet: 挖掘和编码布尔蕴含知识图作为可解释深度神经网络

Tirtharaj Dash

AI总结 提出BIRDNet,通过挖掘特征间的布尔蕴含关系并编码为稀疏可解释神经网络,在保持高精度的同时大幅减少参数,并在转录组和蛋白质组数据中恢复已知生物学特征。

详情
Comments
5 pages; 1 figure, 4 tables
AI中文摘要

知识丰富领域中的表格数据通常携带特征对之间的布尔蕴含关系(BIR)形式的潜在先验。我们使用稀疏异常二项检验挖掘此类关系。挖掘出的蕴含构成一个带类型的定向图,等价于一个由2-文字子句组成的命题规则库。我们将该图编码为分层神经网络的连接性,称为BIRDNet,其中每个隐藏单元对应一条挖掘出的规则,并仅绑定到其两个特征。我们展示了这种设计的两个结果:首先,该架构在构造上是稀疏的:每个BIR层中最多有$2/d$的权重是活跃的,其中$d$是输入维度。其次,模型是可解释的:每个训练后的单元保持稳定的符号身份,因此无需代理模型即可从网络中读取规则。与大多数神经符号模型不同,BIRDNet不消耗外部规则库;其结构先验是从数据中挖掘的。我们在六个转录组和蛋白质组基准上评估BIRDNet。我们的结果表明,BIRDNet在AUROC上与最强的密集基线相差0.02以内,精度损失很小,同时使用的活跃参数比架构匹配的密集MLP少高达96倍。第一层规则恢复了多种癌症亚型和组织类型中的已知生物学特征,包括典型扩增子、谱系定义共表达模块和免疫浸润标记。数据和代码可在 https://github.com/MAHI-Group/BIRDNet 获取。

英文摘要

Tabular data in knowledge-rich domains often carries a latent prior in the form of Boolean implication relationships (BIRs) between pairs of features. We mine such relationships with a sparse-exception binomial test. The mined implications form a typed directed graph, equivalent to a propositional rule base of 2-literal clauses. We encode this graph as the connectivity of a layered neural network, called BIRDNet, in which each hidden unit corresponds to one mined rule and binds only to its two features. We show two consequences of this design: First, the architecture is sparse by construction: at most $2/d$ of the weights in each BIR layer are active, where $d$ is the input dimension. Second, the model is interpretable: every trained unit keeps a stable symbolic identity, so rules can be read off the network without surrogate models. Unlike most neurosymbolic models, BIRDNet does not consume an external rule base; its structural prior is mined from the data. We evaluate BIRDNet on six transcriptomic and proteomic benchmarks. Our results show that BIRDNet stays within 0.02 AUROC of the strongest dense baseline, at a small accuracy cost, while using up to $96\times$ fewer active parameters than an architecture-matched dense MLP. First-layer rules recover known biological signatures across multiple cancer subtypes and tissue types, including canonical amplicons, lineage-defining co-expression modules, and immune-infiltration markers. Data and code are available at: https://github.com/MAHI-Group/BIRDNet.

2605.28736 2026-05-28 cs.RO

Imitation Learning for Robot Assistance in Open Surgery: A Multi-Policy Evaluation on Suture Following

开放手术中机器人辅助的模仿学习:针对缝合跟随的多策略评估

Xucheng Wang, Zhizhou Yang, Xiaoman Zhang, Sung Eun Kim, Romain Hardy, Pranav Rajpurkar

AI总结 本研究首次评估通用模仿学习在开放手术中用于外科医生-机器人协作辅助的可行性,以缝合跟随(每次缝合时助手执行的抓取-拉动-释放动作)为任务,通过比较四种策略(ACT、Diffusion Policy、SmolVLA、π₀)在28个训练模型上的表现,发现π₀在数据效率、背景鲁棒性和轨迹平滑性上最优,并在机器人缝合试验中达到92%的缝合完成率。

详情
AI中文摘要

本研究首次评估了通用模仿学习在外科医生-机器人协作辅助开放手术中的应用,针对缝合跟随:即助手在每次缝合时执行的抓取-拉动-释放动作。我们在一个开源机器人臂上收集了160次遥操作演示(32,374帧),并基准测试了四种架构不同的模仿学习策略(ACT、Diffusion Policy、SmolVLA、π₀),涉及28个训练模型,在32种配置下沿三个临床相关维度(数据集大小、相机视角和背景变化)进行评估。结果表明,在理想条件下,四种策略实现了50%-75%的任务成功率,深度误差是所有架构的主要失败模式。在所有策略中,π₀凭借预训练的视觉-语言骨干网络取得了最强结果,展现出优越的数据效率、对背景变化的更强鲁棒性以及与手术工作流兼容的更平滑轨迹。在外科医生-机器人缝合试验中,π₀实现了92%的缝合完成率。这些发现确立了开放手术中的协作机器人辅助作为模仿学习的可行目标,并强调深度感知和末端执行器设计是临床转化的关键优先事项。

英文摘要

This study presents the first evaluation of general-purpose imitation learning for surgeon-robot collaborative assistance in open surgery, targeting suture following: the grab-pull-release motion an assistant performs at every stitch. We collect 160 teleoperated demonstrations (32,374 frames) on an open-source robot arm, benchmark four architecturally diverse imitation learning policies (ACT, Diffusion Policy, SmolVLA, $π_0$) across 28 trained models evaluated in 32 configurations along three clinically motivated dimensions: dataset size, camera viewpoint, and background variation. Our results demonstrate that under ideal conditions, the four policies achieve $50$-$75\%$ task success, with depth error as the dominant failure mode across all architectures. Among all policies, $π_0$ achieves the strongest results with a pretrained vision-language backbone, demonstrating superior data efficiency, greater robustness to background variation, and smoother trajectories compatible with surgical workflow. When deployed in a surgeon-robot suturing trial, $π_0$ yields a $92\%$ stitch completion rate. These findings establish collaborative robotic assistance in open surgery as a feasible target for imitation learning and highlight depth perception and end-effector design as key priorities for clinical translation.

2605.28735 2026-05-28 cs.CV

SeeGroup: Multi-Layer Depth Estimation of Transparent Surfaces via Self-Determined Grouping

SeeGroup: 通过自确定分组的透明表面多层深度估计

Hongyu Wen, Jia Deng

AI总结 提出SeeGroup方法,通过将多层深度建模为点过程并采用置换不变损失,实现自适应分组,显著提升透明表面多层深度估计精度。

详情
AI中文摘要

透明物体在日常生活中很常见,理解其多层深度(包括透明表面及其背后的物体)非常重要。现有的多层深度方法通常扩展单层预测,通过3D点的前后顺序定义层并顺序预测。然而,由于分层几何允许将3D点分组为多个有效层,预定义的分组策略本质上是受限的。在这项工作中,我们提出了SeeGroup,一种避免施加预定义分组并允许模型自适应地将表面分配到深度图的多层深度估计方法。我们将逐像素多层深度公式化为一个点过程,将深度层视为沿每条相机射线的无序事件。这引出了观测深度层上的置换不变似然,产生了一个自然支持任意层分组的损失函数。实验表明,我们的方法显著推进了多层深度估计的最新水平,在LayeredDepth基准上将四重相对深度准确率从61.34%提升至70.09%。代码可在https://github.com/princeton-vl/SeeGroup获取。

英文摘要

Transparent objects are common in daily life, and it is important to understand their multilayer depth, including the transparent surface and the objects behind it. Existing methods for multilayer depth typically extend single-layer prediction. They define layers by the front-to-back ordering of 3D points and predict the layers sequentially. However, as layered geometry can admit multiple valid groupings of 3D points into layers, a predefined grouping strategy is inherently restrictive. In this work, we propose SeeGroup, a multi-layer depth estimation method that avoids imposing a predefined grouping and allows the model itself to adaptively assign surfaces to depth maps. We formulate per-pixel multi-layer depth as a point process, treating depth layers as unordered events along each camera ray. This induces a permutation-invariant likelihood over the observed depth layers, yielding a loss that naturally supports arbitrary layer groupings. Experiments demonstrate that our method significantly advances the state of the art of multi-layer depth estimation, improving quadruplet relative depth accuracy on LayeredDepth benchmark from 61.34% to 70.09%. Code is available at https://github.com/princeton-vl/SeeGroup.

2605.28734 2026-05-28 cs.CR cs.CL cs.LG

Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests

代码即武器:用于衡量编码模型对恶意代码请求遵从性的共识标记提示库

Richard J. Young, Gregory D. Moody

AI总结 本文通过构建一个经五名评审共识标记的提示库(包含4,748个可执行恶意代码请求和1,923个有害安全知识请求),为编码模型对恶意代码请求的拒绝行为提供了可靠且可跨语料库比较的测量基准。

详情
Comments
21 pages, 9 figures, 5 tables. Consensus-labeled prompt bank consolidating eight malicious-code corpora (ASTRA, CySecBench, AdvBench/harmful_behaviors, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) under a five-judge panel; 6,675 prompts, 33,375 classification calls, Fleiss' kappa = 0.767
AI中文摘要

一个回答有害问题的通用语言模型返回文本;而一个遵从恶意请求的编码模型可以返回一个可运行的武器——键盘记录器、勒索软件存根、按原样运行的漏洞利用。这种单一遵从行为严重性的不对称意味着,编码专用模型应比通用聊天模型设置更高的拒绝标准,而非更低,然而目前该领域无法判断它们是否做到了这一点。针对恶意代码的拒绝基准是零散的:它们混合了可执行软件(即用型武器)的请求与有害安全知识(仍需人类操作的信息)的请求,并在不可比较的语料库上报告拒绝率,因此没有单一统计量衡量真正重要的属性。本文引入了一个扩展的共识标记提示库,区分了这两种请求类型,并为跨语料库的编码模型遵从性测量提供了结构稳定的基础。八个语料库(ASTRA、CySecBench、AdvBench/harmful_behaviors、JailbreakBench、MalwareBench、RedCode、RMCBench、Scam2Prompt)在五名评审共识协议下被整合和分类(6,675个提示 × 5名评审 = 33,375次调用)。评审小组达到Fleiss' kappa = 0.767 [95% CI 0.755, 0.777](“显著”);95.0%的提示获得至少四名评审一致,76.9%完全一致,并且小组在3,133个共享提示上以Cohen's kappa = 0.952复现了先前四个语料库的发布。发布的库包含4,748个共识-CODE提示(可执行恶意代码请求)和1,923个共识-KNOWLEDGE提示(有害安全知识请求)。该库是该领域一直缺乏的经过验证的工具:一个经过可靠性量化的基础,用于测试编码模型是否满足其可执行输出所要求的更严格拒绝标准。

英文摘要

A general-purpose language model that answers a harmful question returns text; a coding model that complies with a malicious request can return a working weapon -- a keylogger, a ransomware stub, an exploit that runs as written. This asymmetry in the severity of a single act of compliance implies coding-specialized models should clear a higher refusal bar than general-purpose chat models, not a lower one, yet the field cannot presently tell whether they do. Refusal benchmarks for malicious code are fragmented: they mix requests for executable software (ready-to-run weapons) with requests for harmful security knowledge (information a human must still operationalise) and report refusal rates over non-comparable corpora, so no single statistic measures the property that actually matters. This paper introduces an expanded consensus-labeled prompt bank that distinguishes between these two request types and provides a construct-stable substrate for cross-corpus coding-model compliance measurement. Eight corpora (ASTRA, CySecBench, AdvBench/harmful_behaviors, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) are consolidated and classified under a five-judge consensus protocol (6,675 prompts x 5 judges = 33,375 calls). The panel reaches Fleiss' kappa = 0.767 [95% CI 0.755, 0.777] ("substantial"); 95.0% of prompts draw at least four agreeing judges, 76.9% are unanimous, and the panel reproduces the earlier four-corpus release at Cohen's kappa = 0.952 on the 3,133 shared prompts. The released bank comprises 4,748 consensus-CODE prompts (executable malicious code requests) and 1,923 consensus-KNOWLEDGE prompts (harmful security knowledge requests). The bank is the validated instrument the field has lacked: a reliability-quantified basis for testing whether coding models meet the stricter refusal standard their executable output demands.

2605.28733 2026-05-28 cs.AI

Utility-Aware Multimodal Contrastive Learning for Product Image Generation

效用感知的多模态对比学习用于产品图像生成

Xiaohang Feng, Yiling Xie

AI总结 提出一种效用感知的多模态对比学习框架,通过引入效用感知InfoNCE损失优化产品图像生成,使图像在语义对齐的同时提升市场需求。

详情
AI中文摘要

产品图像强烈影响在线市场中消费者的决策。借助多模态对比学习,生成式AI可以输出与文本提示紧密对齐的图像。然而,现有的生成式AI模型并未直接优化市场表现。这是一个关键差距,因为仅凭语义对齐并不能保证图像能够促进销售。为了解决这一局限性,我们提出了一个 extit{效用感知的多模态对比学习}框架,将消费者需求纳入新颖的效用感知InfoNCE损失中。优化这一效用感知目标引导生成过程朝向既语义连贯又增强需求的图像。这一效果直接源于学习到的图像-文本表示空间向需求驱动的视觉线索的转变,我们也通过所提目标的理论界限验证了这一点。在Amazon和Airbnb的下游应用中,我们的方法生成和编辑的产品图像在增加需求和保持保真度方面优于最先进的模型,同时保持了文本-图像一致性。值得注意的是,我们的效用感知框架保留了美学和独特性等属性的倒U型需求模式,在保持保真度和语义一致性的同时提升了基于需求的性能。人类受试者实验进一步验证了其商业有效性。随着生成式AI技术的不断发展,我们的效用感知组件可以灵活地嵌入新兴的生成模型中,以改善直接商业用途。

英文摘要

Product images strongly influence consumer decision-making in online marketplaces. Empowered by multimodal contrastive learning, generative AI can output images that closely align with text prompts. Yet existing generative AI models do not directly optimize marketplace performance. This is a critical gap, since semantic alignment alone does not guarantee that an image will sell. To address this limitation, we propose a \textit{utility-aware multimodal contrastive learning} framework that incorporates consumer demand into a novel Utility-Aware InfoNCE loss. Optimizing this utility-aware objective guides generation toward images that are both semantically coherent and demand-enhancing. This effect arises directly from a shift in the learned image-text representation space toward demand-driven visual cues, which we also validate through the theoretical bound of the proposed objective. In downstream applications on Amazon and Airbnb, product images generated and edited by our method outperform state-of-the-art models in increasing demand and preserving fidelity, while maintaining text-image consistency. Notably, our utility-aware framework preserves inverse U-shaped demand patterns for attributes such as aesthetics and uniqueness, improving demand-based performance while preserving fidelity and semantic consistency. Human-subject experiments further validate its commercial effectiveness. As generative AI technology continues to evolve, our utility-aware component can be flexibly embedded into emerging generative models to improve direct commercial use.

2605.28732 2026-05-28 cs.CL cs.AI cs.LG

MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

MemTrace:大型语言模型记忆系统中的错误追踪与归因

Xinle Deng, Ruobin Zhong, Hujin Peng, Xiaoben Lu, Yanzhe Wu, Guang Li, Buqiang Xu, Yunzhi Yao, Jizhan Fang, Haoliang Cao, Junjie Guo, Yuan Yuan, Ziqing Ma, Yuanqiang Yu, Rui Hu, Baohua Dong, Hangcheng Zhu, Ningyu Zhang

AI总结 提出MemTrace框架,通过构建可执行的记忆演化图实现细粒度错误追踪,并利用自动归因方法定位根因,进而优化提示词提升下游任务性能。

详情
Comments
Ongoing work
AI中文摘要

记忆对于使大型语言模型支持长程推理至关重要,但现有的记忆系统仍然不可靠且难以调试。追踪记忆的动态演化对于理解信息如何随时间合成、传播或损坏至关重要。在这项工作中,我们研究了LLM记忆系统中错误追踪与归因的新问题。我们提出了一种新颖的框架,将记忆流水线转换为可执行的记忆演化图,从而实现对操作信息流的细粒度追踪。然后,我们构建了MemTraceBench,一个从代表性记忆系统(如Long-Context、RAG、Mem0和EverMemOS)收集的基准,以系统地研究记忆故障模式。我们进一步引入了一种自动归因方法,该方法迭代地追踪操作子图以定位任何失败案例的根本原因。我们的分析表明,记忆故障是系统性的,源于操作层面的问题,如信息丢失和检索错位。关键的是,我们利用这些细粒度的归因信号来指导下游提示优化,建立了一个自动纠正故障并提升最终任务性能高达7.62%的闭环系统。代码将在https://github.com/zjunlp/MemTrace发布。

英文摘要

Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and difficult to debug. Tracing memory's dynamic evolution is crucial to understand how information is synthesized, propagated, or corrupted over time. In this work, we study the new problem of error tracing and attribution in LLM memory systems. We propose a novel framework that transforms memory pipelines into executable memory evolution graphs, enabling fine-grained tracing of operational information flow. We then construct MemTraceBench, a benchmark collected from representative memory systems such as Long-Context, RAG, Mem0, and EverMemOS, to systematically study memory failure modes. We further introduce an automatic attribution method that iteratively traces operation subgraphs to pinpoint the root cause of any failed case. Our analysis reveals that memory failures are systematic, stemming from operation-level issues like information loss and retrieval misalignment. Crucially, we leverage these fine-grained attribution signals to guide downstream prompt optimization, establishing a closed-loop system that automatically corrects faults and boosts end-task performance by up to 7.62%. Code will be released at https://github.com/zjunlp/MemTrace.

2605.28730 2026-05-28 cs.AI

AlphaTransit: Learning to Design City-scale Transit Routes

AlphaTransit: 学习设计城市级公交线路

Bibek Poudel, Sai Swaminathan, Weizi Li

AI总结 针对公交线路设计中的延迟反馈问题,提出AlphaTransit框架,将蒙特卡洛树搜索与神经策略-价值网络结合,在布卢明顿基准上实现最高服务率。

详情
AI中文摘要

设计公交网络需要许多顺序的线路扩展决策,但其质量通常只有在完整网络组装后才能显现。这种延迟反馈挑战是公交线路网络设计问题(TRNDP)的核心,其中线路交互可能具有欺骗性:一个看似有用的局部扩展可能会造成换乘瓶颈、产生冗余重叠或降低整体吞吐量。为了在延迟模拟器反馈下指导线路构建,我们引入了AlphaTransit,一个用于城市级公交网络设计的基于搜索的规划框架。AlphaTransit将蒙特卡洛树搜索(MCTS)与神经策略-价值网络相结合:策略提出线路扩展,价值估计下游设计质量,搜索利用这些预测来优化每个决策。这提供了在路线构建过程中的决策时间前瞻,而无需在搜索树内运行模拟器展开。我们在一个新的布卢明顿TRNDP基准上评估AlphaTransit,该基准具有现实的道路拓扑和基于人口普查的需求,在混合和全公交需求设置下。在布卢明顿网络中,AlphaTransit在两种需求设置下均达到了最高服务率,分别为54.6%和82.1%。相对于无搜索的强化学习,这对应9.9%和11.4%的服务率提升;相对于无学习指导的MCTS,这对应2.5%和11.2%的提升。这些结果表明,将学习指导与MCTS结合比单独使用任何一种方法对公交网络设计更有效。我们的代码和数据公开在https://github.com/poudel-bibek/AlphaTransit。

英文摘要

Designing a transit network requires many sequential route extension decisions, but their quality is often visible only after the full network is assembled. This delayed-feedback challenge lies at the heart of the Transit Route Network Design Problem (TRNDP), where route interactions can be deceptive: an extension that appears useful locally can create transfer bottlenecks, produce redundant overlap, or reduce overall throughput. To guide route construction under delayed simulator feedback, we introduce AlphaTransit, a search-based planning framework for cityscale bus network design. AlphaTransit couples Monte Carlo Tree Search (MCTS) with a neural policy-value network: the policy proposes route extensions, the value estimates downstream design quality, and search uses these predictions to refine each decision. This provides decision-time lookahead during route construction without running simulator rollouts inside the search tree. We evaluate AlphaTransit on a new Bloomington TRNDP benchmark with realistic road topology and censusderived demand, under mixed and full transit demand settings. In the Bloomington network, AlphaTransit attains the highest service rate in both demand settings, reaching 54.6% and 82.1%, respectively. Relative to reinforcement learning without search, these correspond to 9.9% and 11.4% service rate gains; relative to MCTS without learned guidance, they correspond to 2.5% and 11.2% gains. These results suggest that coupling learned guidance with MCTS is more effective than using either approach alone for transit network design. Our code and data are publicly available in https://github.com/poudel-bibek/AlphaTransit.

2605.28729 2026-05-28 stat.ML cs.LG

Beyond Lipschitz: Data-Driven Robustness via Discrete Modulus of Continuity

超越Lipschitz:基于离散模连续性的数据驱动鲁棒性

Jürgen Dölz, Michael Multerer, Michele Palma

AI总结 提出基于离散模连续性(DMOC)的数据驱动鲁棒性框架,通过非线性泛化Lipschitz连续性并引入可扩展的小批量算法,实现与数据分布相关的细粒度鲁棒性评估。

详情
AI中文摘要

神经网络的鲁棒性通常通过局部或全局Lipschitz常数来量化。然而,Lipschitz连续性作为全局鲁棒性度量可能过于粗糙或过于严格,无法捕捉细微的、依赖于数据的行为。我们提出了一种基于离散模连续性(DMOC)的数据驱动、架构无关的框架,这是Lipschitz连续性的非线性推广,提供了更精细的鲁棒性概念。与许多现有方法不同,DMOC不需要访问模型内部,而是评估相对于数据分布的规律性。这将焦点从模型转移到数据,数据提供了规律性的数据驱动基线,用于评估网络的鲁棒性。我们建立了DMOC诱导半范数的收敛结果,给出了基于分离距离的显式数据驱动速率,并引入了一种可扩展的小批量算法,该算法将精确计算的二次成本降低,从而能够应用于ImageNet等大规模数据集。实验上,DMOC作为一种架构无关的诊断工具:它区分了训练和未训练的网络,揭示了欠拟合和过拟合状态,并且作为特例,产生了与最先进方法(如ECLipsE和ECLipsE-fast)相当的紧Lipschitz估计。

英文摘要

Robustness of neural networks is commonly quantified via local or global Lipschitz constants. However, Lipschitz continuity can be overly coarse or overly restrictive as global robustness measure, failing to capture nuanced, data-dependent behavior. We propose a data-driven, architecture-agnostic framework based on the discrete modulus of continuity (DMOC), a non linear generalization of Lipschitz continuity that provides a finer notion of robustness. Unlike many existing approaches, DMOC does not require access to model internals and instead evaluates regularity relative to the data distribution. This shifts the focus from the model to the data, which provide a data-driven baseline of regularity against which the network's robustness is assessed. We establish convergence results for DMOC-induced seminorms with explicit data-driven rates in terms of the separation distance, and introduce a scalable minibatch algorithm that reduces the quadratic cost of exact computation, enabling application to large-scale data sets such as ImageNet. Empirically, DMOC serves as an architecture independent diagnostic: it distinguishes trained from untrained networks, reveals underfitting and overfitting regimes, and yields, as a special case, tight Lipschitz estimates comparable to state-of-the-art method such as ECLipsE and ECLipsE-fast.

2605.28726 2026-05-28 cs.RO cs.LG

How VLAs Fail Differently: Black-Box Action Monitoring Reveals Architecture-Specific Failure Signatures

VLA如何以不同方式失败:黑盒动作监控揭示架构特定的失败特征

Krishnam Gupta

AI总结 本文通过黑盒动作监控发现,视觉-语言-动作(VLA)架构在电机指令层面以根本不同且可预测的方式失败,并证明架构匹配的监控器选择至关重要。

详情
Comments
Accepted at IEEE ICRA 2026 Workshop "From Data to Decisions: VLA Pipelines for Real Robots", Vienna, June 2026. Non-archival workshop. 5 pages, 2 figures, 22 references
AI中文摘要

我们发现VLA架构在电机指令层面以根本不同且可预测的方式失败。在相同的评估协议(PushT和ALOHA 14自由度双手操作共450个回合)上运行VQ-BeT、Diffusion Policy和ACT,我们发现:(1)方向反转率是所有三种架构的通用失败预测器(AUROC=0.93, 0.79, 0.91; p<0.001);(2)加加速度监控仅对离散令牌架构具有预测性,遵循离散到连续的梯度(0.88, 0.69, 0.41);(3)速度违规本身在所有地方均无预测性(AUROC 0.41-0.69),然而速度检查是VLA部署代码中最常见的安全机制;(4)对于连续族VLA,速度监控提供的预测信号几乎为零(ACT上AUROC=0.52,Diffusion上0.41),证明架构匹配的监控器选择至关重要。这些结果量化了众所周知的离散/连续VLA区分的监控后果:两个家族产生定性不同的失败特征,需要不同的监控器。没有单一的监控器能普遍适用;需要架构匹配的选择。这一发现得益于SafeContract,一个无需训练、黑盒动作监控工具包,具有共形校准。代码:https://github.com/krishnam94/vla-edge

英文摘要

We discover that VLA architectures fail in fundamentally different, predictable ways at the motor-command level. Running VQ-BeT, Diffusion Policy, and ACT on identical evaluation protocols (n=450 episodes across PushT and ALOHA 14-DOF bimanual manipulation), we find: (1) direction reversal rate is a universal failure predictor across all three architectures (AUROC=0.93, 0.79, 0.91; p<0.001); (2) jerk monitoring is predictive only for discrete-token architectures, following a discrete-to-continuous gradient (0.88, 0.69, 0.41); (3) velocity violations alone are non-predictive everywhere (AUROC 0.41-0.69), yet velocity checking is the most common safety mechanism in VLA deployment code; and (4) for continuous-family VLAs, velocity monitoring provides effectively zero predictive signal (AUROC=0.52 on ACT, 0.41 on Diffusion), proving that architecture-matched monitor selection is essential. These results quantify a monitoring consequence of the well-known discrete/continuous VLA distinction: the two families produce qualitatively different failure signatures that require different monitors. No single monitor works universally; architecture-matched selection is required. This finding was enabled by SafeContract, a training-free, black-box action monitoring toolkit with conformal calibration. Code: https://github.com/krishnam94/vla-edge

2605.28722 2026-05-28 cs.AI

Multi-Adapter Representation Interventions via Energy Calibration

通过能量校准的多适配器表示干预

Manjiang Yu, Hongji Li, Junwei Chen, Xue Li, Priyanka Singh, Yang Cao, Lijie Hu

AI总结 提出MARI方法,通过竞争性多适配器机制和基于能量的门控模块,自适应地确定干预方向和强度,在保持通用能力的同时提升对齐性能。

详情
Comments
Accepted by ICML 2026
AI中文摘要

表示干预已成为一种有前景的范式,可以在不修改模型权重的情况下将大型语言模型对齐到期望的行为。现有方法通常对所有输入统一应用固定的干预。然而,我们发现适当的干预方向和强度在不同样本间差异很大,这种无差别的干预会导致良性输入上通用能力的下降。为了解决这些挑战,我们提出了通过能量校准的多适配器表示干预(MARI)。具体来说,我们引入了一种竞争性多适配器机制,其中专门的专家捕获非线性校正模式,并自适应地确定不同样本的适当干预方向和强度。此外,我们设计了一个基于能量的门控模块,利用内部传播动力学来区分适合干预的输入。跨不同模型系列和参数规模的广泛实验表明,MARI实现了最先进的对齐性能。我们的方法在TruthfulQA、BBQ和安全基准测试上显著提高了性能,同时在MMLU和ARC等任务上保持甚至提高了通用能力。我们的代码可在https://github.com/V1centNevwake/MARI获取。

英文摘要

Representation intervention has emerged as a promising paradigm for aligning large language models toward desired behaviors without modifying model weights. Existing methods typically apply a fixed intervention uniformly across all inputs. However, we find that the appropriate intervention direction and strength vary substantially across samples, and such indiscriminate intervention leads to degradation of general capabilities on benign inputs. To address these challenges, we propose Multi-Adapter Representation Interventions via Energy Calibration (MARI). Specifically, we introduce a competitive multi-adapter mechanism in which specialized experts capture non-linear correction patterns and adaptively determine the appropriate intervention direction and strength for different samples. Furthermore, we design an energy-based gating module that leverages internal propagation dynamics to distinguish inputs that are applicable for intervention. Extensive experiments across diverse model families and parameter scales demonstrate that MARI achieves state-of-the-art alignment performance. Our method significantly improves performance on TruthfulQA, BBQ, and safety benchmarks, while maintaining and even improving general capabilities on tasks such as MMLU and ARC. Our code is available at https://github.com/V1centNevwake/MARI.

2605.28721 2026-05-28 cs.AI

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

LiveBrowseComp: 搜索智能体是在搜索,还是仅仅在验证它们已知的信息?

HuiMing Fan, Xiao Wang, Zheng Chu, Qianyu Wang, Zhuoyao Wang, Ming Liu, Bing Qin, XingYu

AI总结 本文通过诊断方法发现基于LLM的搜索智能体存在内在知识依赖(IKD),即依赖模型内部知识而非外部证据,并引入LiveBrowseComp基准来评估超越内在知识覆盖的深度搜索能力。

详情
AI中文摘要

基于LLM的搜索智能体是真的在搜索,还是仅仅利用网络验证它们已知的信息?我们在BrowseComp上通过三个诊断研究这个问题。我们的分析揭示了内在知识依赖(IKD):即使有工具访问权限,智能体也常常依赖内在知识——检索前模型已编码的信息——而非外部证据。智能体在没有工具的情况下回答了高达44.5%的BrowseComp问题,超过一半的搜索查询来自内部生成的假设而非检索到的线索,并且当答案支持证据被移除时,其表现比闭卷基线更差。这些结果表明,静态搜索基准可能奖励基于记忆的验证而非基于证据的发现,混淆了智能体已知的信息与它们能找到的信息。然后我们引入了LiveBrowseComp,一个深度搜索基准,旨在评估超越内在知识覆盖的智能体。它包含335个人工编写的问题,其答案依赖于基准构建前90天内发布的事实,来自六个更新的来源,并过滤掉全球显著事件。在LiveBrowseComp上,所有评估的智能体闭卷准确率低于2%,搜索增强的分数相对于BrowseComp下降了25-40个百分点,且先前的模型排名不再可靠地预测性能。LiveBrowseComp可在https://huggingface.co/datasets/Forival/LiveBrowseComp获取。

英文摘要

Are LLM-based search agents genuinely searching, or using the web to verify what they already know? We study this question on BrowseComp with three diagnostics. Our analysis reveals Intrinsic Knowledge Dependence (IKD): even with tool access, agents often rely on intrinsic knowledge -- information encoded in the model before retrieval -- rather than on external evidence. Agents answer up to 44.5% of BrowseComp questions without tools, generate more than half of their search queries from internally produced hypotheses rather than retrieved leads, and perform worse than closed-book baselines when answer-supporting evidence is removed. These results suggest that static search benchmarks can reward memory-backed verification rather than evidence-driven discovery, conflating what agents already know with what they can find. We then introduce LiveBrowseComp, a deep-search benchmark designed to evaluate agents beyond intrinsic coverage. It contains 335 human-authored questions whose answers depend on facts published within the 90 days preceding benchmark construction, drawn from six updated sources and filtered to exclude globally salient events. On LiveBrowseComp, all evaluated agents fall below 2% closed-book accuracy, search-augmented scores drop by 25-40 points relative to BrowseComp, and prior model rankings no longer reliably predict performance. LiveBrowseComp is available at https://huggingface.co/datasets/Forival/LiveBrowseComp.

2605.28717 2026-05-28 cs.AI cs.AR cs.NI

OpenURMA: A Clean-Room Open Implementation of the Unified Bus Protocol

OpenURMA:统一总线协议的开源洁净室实现

Bojie Li

AI总结 针对RDMA在数据中心网络接口的瓶颈,OpenURMA基于华为UB协议规范,通过RTL、SystemC和gem5三层实现,展示了UB在64字节远程取操作中相比RoCEv2 RC实现4.37倍延迟降低和2.80倍吞吐提升。

详情
AI中文摘要

现代数据中心RDMA的瓶颈在网络接口而非线缆。运行RoCE或InfiniBand的NIC为每个(应用,远程端点)对维护每连接状态——在1024应用扇出时达数百兆字节——并在64字节操作上支付四次PCIe往返,将延迟放大到线缆延迟的一个数量级以上。这两者都源于RDMA从InfiniBand继承的基于PCIe的队列对抽象。 华为的统一总线(UB)是2025年公开的规范,它改变了抽象:将每应用端点状态与每主机传输状态解耦,使连接上下文呈加性增长,将排序作为可选功能,并通过原生CPU加载/存储到片上总线控制器来访问远程内存。UB已搭载在华为闭源的Ascend 950芯片中。 OpenURMA是UB传输层和事务层的首个洁净室开源实现,在三个层级实现——Alveo U50上的可综合RTL、双节点周期级SystemC模拟器以及gem5全系统框架——每个层级都有匹配的OpenRoCE(RoCEv2 RC)基线。贡献在于实现、测试平台以及闭源芯片无法进行的受控比较。在规范的64字节远程取操作——UB规范第8.3节的LOAD,RoCEv2 RC的READ——上,UB的加载/存储路径实现了约500纳秒的端到端延迟,比匹配基线(2186纳秒)低4.37倍,吞吐量高2.80倍,且仅占用U50约14%的LUT。

英文摘要

Modern datacenter RDMA is bottlenecked at the network interface, not the wire. A NIC running RoCE or InfiniBand holds per-connection state for every (application, remote-endpoint) pair - hundreds of megabytes at 1024-application fanout - and pays a four-traversal PCIe round trip on a 64-byte operation, inflating latency an order of magnitude beyond the wire. Both follow from the Queue Pair over PCIe abstraction RDMA inherits from InfiniBand. Huawei's Unified Bus (UB), a public 2025 specification, changes the abstraction: it decouples per-application endpoint state from per-host transport state so connection context grows additively, exposes ordering as opt-in, and reaches remote memory through native CPU load/store to an on-chip-bus controller. UB ships in Huawei's closed Ascend 950 silicon. OpenURMA is the first clean-room open implementation of UB's transport and transaction layers, realised at three tiers - synthesisable RTL on Alveo U50, a cycle-level two-node SystemC simulator, and a gem5 full-system scaffold - each with a matched OpenRoCE (RoCEv2 RC) baseline. The contribution is the implementation, harness, and controlled comparison closed silicon does not admit. On the canonical 64-byte remote fetch - LOAD on UB-spec Sec.8.3, READ on RoCEv2 RC - UB's load/store path delivers ~500 ns end-to-end, 4.37x below the matched baseline (2186 ns), sustains 2.80x higher throughput, and fits in ~14% of a U50's LUTs.

2605.28714 2026-05-28 cs.CL cs.AI

IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents

IPO-Mine:用于长多模态IPO文档的章节结构化分析的工具包和数据集

Michael Galarnyk, Siddharth Lohani, Vidhyakshaya Kannan, Sagnik Nandi, Aman Patel, Liqin Ye, Arnav Hiray, Rutwik Routu, Prasun Banerjee, Siddhartha Somani, Sudheer Chava

AI总结 本文提出IPO-Mine工具包和数据集,通过标准化解析IPO文件为章节结构化文本和图像,构建大规模多模态数据集,并建立图表评估任务,揭示多模态模型在长文档分析中的对齐挑战。

详情
Comments
12 pages
AI中文摘要

首次公开募股(IPO)文件是私营公司上市时发布的文件,允许个人(散户)投资者购买其股票。这些文件描述了公司的业务、财务状况和风险,是包含叙述性文本和图像的长篇多模态文档。尽管它们对金融市场至关重要,但目前缺乏用于使用现代语言和多模态模型研究IPO文件的大规模标准化数据集或基准。这些文档带来了重大挑战:文件通常超过50万词,且缺乏一致的结构组织。我们引入了IPO-Toolkit,这是一个开源框架,用于下载和解析IPO文件,将其标准化为章节结构化文本和提取的图像。该工具包分割文件、提取嵌入的图像,并生成结构化输出,从而支持对长多模态文档进行大规模、可重复的分析工作流。利用这一基础设施,我们构建了IPO-Dataset,这是一个大规模、章节结构化的多模态数据集,涵盖1994年至2026年超过109,000份IPO文件及其修订版,包含超过76,000张图像。我们针对提取的金融图表建立了结构化评估任务,包括图表质量和误导性评估。我们的实验表明,最先进的多模态模型在这些任务上常常与专家人类判断存在分歧,揭示了在长篇幅真实监管文档上进行多模态推理时的对齐挑战。除了基准测试,IPO-Dataset还支持对章节级文本变异以及视觉和文本披露实践的跨行业差异进行大规模分析。我们的代码、数据集和网站根据CC-BY-4.0公开提供。

英文摘要

An Initial Public Offering (IPO) filing is a document released when a private firm goes public, allowing individual (retail) investors to purchase its shares. These filings describe a firm's business, financials, and risks and are long, multimodal documents with narrative text and images. Despite their importance to financial markets, there is no large-scale, standardized dataset or benchmark for studying IPO filings with modern language and multimodal models. These documents pose significant challenges: filings frequently exceed 500,000 tokens and lack consistent structural organization. We introduce the IPO-Toolkit, an open-source framework for downloading and parsing IPO filings into standardized section-structured text and extracted images. The toolkit segments filings, extracts embedded images, and produces structured outputs that enable large-scale, reproducible analysis workflows over long, multimodal documents. Using this infrastructure, we construct the IPO-Dataset, a large, section-structured, multimodal dataset covering more than 109,000 IPO filings and amendments from 1994 to 2026 and containing over 76,000 images. We establish structured evaluation tasks over extracted financial charts, including chart quality and misleadingness assessment. Our experiments show that state-of-the-art multimodal models often diverge from expert human judgments on these tasks, exposing alignment challenges in multimodal reasoning over long, real-world regulatory documents. Beyond benchmarking, the IPO-Dataset enables large-scale analysis of section-level textual variation and cross-industry differences in visual and textual disclosure practices. Our code, dataset, and website are publicly available under CC-BY-4.0.

2605.28713 2026-05-28 cs.AI

Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

思维即压缩:你的推理模型其实是一个上下文压缩器

Guoxin Ma, Yibing Liu, Chengzhengxu Li, Yu Liang, Yan Wang, Yueyang Zhang, Kecheng Chen, Zhaohan Zhang, Zhiyuan Sun, Daiting Shi

AI总结 本文提出思维即压缩(TaC)范式,利用推理模型自身的思维痕迹作为压缩上下文,并通过奖励驱动优化(TaC-C)实现可控压缩,在长上下文QA任务上显著优于现有方法。

详情
Comments
Under Review
AI中文摘要

上下文压缩旨在缩短长上下文输入,同时最小化信息损失,以加速LLM推理。现有方法虽有前景,但通常依赖复杂的压缩模块或针对压缩的训练,忽视了LLM的内在能力。相比之下,本文揭示推理模型本身可以通过组织任务相关信息自然地压缩长上下文。因此,我们提出思维即压缩(TaC),一种将思维本身视为压缩上下文的新压缩范式。无需专用压缩器,TaC直接提示推理模型生成思维痕迹作为缩短的上下文,已优于大多数代表性压缩方法。进一步,鉴于原始思维输出可能难以控制预算和存在捷径行为,我们引入带约束的思维即压缩(TaC-C),利用简单的奖励驱动优化框架,激发内在思维成为紧凑且可控的压缩上下文。在四个长上下文QA基准上的实验表明,TaC-C一致优于现有基线。在4倍和8倍压缩比下,它在平均F1上分别超过最强竞争对手17.4%和23.4%,在平均精确匹配分数(EM)上分别超过15.7%和21.7%。

英文摘要

Context compression aims to shorten long context inputs with minimal information loss for LLM inference acceleration. While existing methods have shown promise, they typically rely on complex compression modules or compression-specific training, leaving the intrinsic capabilities of LLMs underexplored. In contrast, this work reveals that a thinking model itself can naturally compress long contexts by organizing task-relevant information. We thus derive Thinking as Compression (TaC), a new compression paradigm that treats thinking itself as compressed context. Without relying on specific dedicated compressor, TaC directly prompts the thinking model to generate thinking traces as the shortened context, already outperforming most representative compression methods. Further, given that raw thinking output may struggle with budget control and shortcut behaviors, we introduce Thinking as Compression Constrained (TaC-C), leveraging a simple reward-driven optimization framework to elicit intrinsic thinking as compact and controllable compressed context. Experiments across four long-context QA benchmarks demonstrate that TaC-C consistently outperforms existing baselines. At 4x and 8x compression ratios, it surpasses the strongest competitor by 17.4% and 23.4% in average F1, and by 15.7% and 21.7% in average Exact Match Score (EM), respectively.

2605.28710 2026-05-28 cs.CL cs.AI

Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study

迈向可靠的多语言LLM作为评判者:一项实证研究

Irune Zubiaga, Aitor Soroa, Rodrigo Agerri

AI总结 本研究通过分析指令翻译、单语与多语言监督及模型规模等策略,探讨了在有无领域内数据情况下开发多语言LLM评判者的方法,并揭示了领域内数据可用时微调小模型可媲美专有模型、零样本大模型在域外更有效等关键权衡。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于生成文本的自动评估,然而大多数先前工作集中在英语上。尽管对多语言评估的需求日益增长,将基于LLM的评估器扩展到多语言环境仍然具有挑战性,特别是对于低资源语言和领域内数据稀缺的场景。本文探索了开发多语言LLM评判者的几种策略,考虑了是否有领域内数据可用于微调。我们系统分析了英语、西班牙语和巴斯克语(代表高、中、低资源语言),考虑了指令翻译、单语与多语言监督以及模型规模。为了评估,我们将两个现有的元评估数据集扩展到巴斯克语和西班牙语。我们的结果揭示了关键的权衡:当领域内数据可用时,微调的小模型可以达到与专有模型相当的性能,而在域外设置中,使用较大模型的零样本评估更为有效。我们还观察到,在域外数据上进行微调可能会对模型性能产生不利影响。这些发现为构建高效、可靠的多语言评估流程提供了实用指导。数据和代码公开在hitz-zentroa/mJudge。

英文摘要

Large language models (LLMs) are increasingly used for the automatic evaluation of generated text, yet most prior work focuses on English. Despite the growing demand for multilingual evaluation, extending LLM-based evaluators to multilingual settings remains challenging, particularly for low-resource languages and scenarios where in-domain data is scarce. This work explores several strategies for developing multilingual LLMs-as-a-judge, considering whether in-domain data is available for fine-tuning or not. We systematically analyze English, Spanish, and Basque, representing high-, mid-, and low-resource languages, considering instruction translation, monolingual versus multilingual supervision, and model size. For evaluation, we extend two existing meta-evaluation datasets to Basque and Spanish. Our results reveal key trade-offs: When in-domain data is available, fine-tuned smaller models can achieve performance comparable to proprietary models, whereas zero-shot evaluation with larger models proves more effective in out-of-domain settings. We also observe that fine-tuning on out-of-domain data can adversely affect model performance. These findings provide practical guidance for building efficient, reliable multilingual evaluation pipelines. The data and code are publicly available at hitz-zentroa/mJudge.

2605.28707 2026-05-28 cs.AI cs.LG

Beyond Binary Moral Judgment: Modeling Ethical Pluralism in AI

超越二元道德判断:在AI中建模伦理多元主义

Aisha Aijaz, Rahul Goel, Arnav Batra, Raghava Mutharaju

AI总结 提出将道德推理建模为规范性伦理理论分布(伦理多元主义)的框架,通过规范-语义双流架构和堆叠集成学习实现,在450个案例上达到88.89%的准确率。

详情
AI中文摘要

在社会关键领域的决策中,AI系统正以不同能力越来越多地参与。然而,尽管自主系统无处不在,大多数处理自主道德决策的方法仍诉诸于标量或二元判断。这些方法对于可接受的道德推理是不够的,因为它们提供的解释很少,遗漏了必须包含以支持问责的关键背景和理论信息。为此,我们提出了一个将道德推理建模为规范性伦理理论或伦理多元主义分布的框架。我们引入了一个整合这些理论的规范伦理单纯形。还准备了涵盖15个细分子理论的450个案例基准,用于堆叠集成学习。这些案例描述了自然语言中的伦理困境,并具有相关的提取上下文特征。单纯形的实现通过双流规范-语义架构完成,随后是规范信息的融合和顺序堆叠集成,以学习三个广泛理论(后果主义、美德伦理学和道义论)及其15个子类别的最佳拟合。我们的实验表明,将上下文和规范先验与语义嵌入相结合显著提高了分类性能,准确率达到88.89%。我们进行了消融研究,以表明结构化伦理表示超越了类比推理的贡献,并且所选的堆叠架构由于逐步学习粒度而给出了最佳结果。还通过熵、置信度和可视化分析了伦理多元主义。因此,将伦理多元主义建模为概率性规范分布支持类人道德推理、伦理分歧分析以及未来AI系统中的对齐。

英文摘要

Critical decision-making in socially consequential spaces is increasingly involving AI systems at varying capacities. Yet, despite the ubiquity of autonomous systems, most approaches to handling autonomous moral decision-making resort to scalar or binary judgments. These methods are insufficient for acceptable moral reasoning, as they provide little explanation, leaving out imperative contextual and theoretical information that must be included to support accountability. For this, we propose a framework to model moral reasoning as a distribution over normative ethical theories or ethical pluralism. We introduce a normative ethics simplex that integrates these theories. A benchmark of 450 cases across 15 fine-grained subtheories was also prepared for the purposes of stacked ensemble learning. These cases describe ethical dilemmas in natural language and have associated extracted contextual features. The implementation of the simplex was achieved via a two-stream normative-semantic architecture. This is followed by the fusion of normative information and a sequential, stacking ensemble to learn the best fit of the three broad theories: consequentialism, virtue ethics, and deontology, and the 15 subcategories. Our experiments demonstrate that the integration of contextual and normative priors with the semantic embeddings significantly improves the performance of the classification, displaying an accuracy of 88.89%. We conducted ablation studies to show that structured ethical representations contribute beyond analogical reasoning, and the chosen stacking architecture gives the best results due to the gradual learning of granularity. Ethical pluralism is also analyzed through entropy, confidence, and visualization. Thus, modeling ethical pluralism as a probabilistic normative distribution supports human-like moral reasoning, ethical disagreement analysis, and future alignment in AI systems.

2605.28705 2026-05-28 cs.LG

Understanding Generalization and Forgetting in In-Context Continual Learning

理解上下文持续学习中的泛化与遗忘

Guangyu Li, Meng Ding, Lijie Hu

AI总结 提出首个上下文持续学习理论框架,分析预训练Transformer在单提示中处理多序列任务时的泛化与遗忘行为,揭示注意力机制导致的干扰和偏差。

详情
Comments
accepted by ICML 2026
AI中文摘要

上下文学习(ICL)的强大之处在于使大型语言模型能够仅通过基于提示的推理来适应新任务,完全绕过了参数更新的需要。现有理论主要在单任务设置下研究ICL,而现实中的提示通常包含异构任务序列,这导致我们无法理解大型语言模型是否在推理过程中隐式地执行持续学习。为了弥补这一差距,我们提出了首个用于上下文持续学习的理论框架,模拟预训练Transformer如何通过共享注意力机制在单个提示内处理多个顺序任务。聚焦于线性和掩码线性自注意力,我们推导了顺序任务提示下模型预测的误差表达式,并分析了它们的泛化和遗忘行为。我们的结果表明,标准注意力机制通过均匀或因果地聚合历史上下文,不可避免地引起任务间干扰,导致系统性偏差。我们进一步提供了预测误差的偏差-方差-干扰分解,刻画了历史上下文信息何时产生正迁移或可证明的负迁移。这一分析揭示了基于注意力的持续推理的基本限制,并为长提示中的顺序敏感性和性能退化提供了理论解释。

英文摘要

In-context learning (ICL) derives its power from enabling Large Language Models to adapt to new tasks via prompt-based reasoning alone, entirely bypassing the need for parameter updates. Existing theories primarily study ICL in single-task settings, while real-world prompts often contain sequences of heterogeneous tasks, leaving a gap in understanding whether Large Language Models implicitly perform continual learning during inference. To bridge this gap, we propose the first theoretical framework for in-context continual learning, modeling how a pretrained Transformer processes multiple sequential tasks within a single prompt through shared attention mechanisms. Focusing on linear and masked linear self-attention, we derive error expressions for model predictions under sequential task prompts and analyze their generalization and forgetting behavior. Our results reveal that standard attention mechanisms inevitably induce intertask interference by uniformly or causally aggregating historical contexts, leading to systematic bias. We further provide a bias-variance-interference decomposition of prediction error, characterizing when historical in-context information yields positive transfer or provable negative transfer. This analysis exposes fundamental limits of attention-based continual inference and offers theoretical explanations for order sensitivity and performance degradation in long prompts.

2605.28704 2026-05-28 cs.LG

Expressive Power of Floating-Point Neural Networks with Arbitrary Reduction Orders and Inexact Activation Implementations

具有任意归约顺序和不精确激活实现的浮点神经网络的表达能力

Yeachan Park, Geonho Hwang, Wonyeol Lee, Sejun Park

AI总结 本文研究在广义浮点执行语义下(包括任意归约顺序和具有有界ulp误差的不精确激活实现),浮点神经网络能否精确表示浮点域上的任意函数,并引入通用可区分性框架,证明第一层区分每对不同输入的能力是通用可表示性的必要条件,同时在温和条件下证明适当形式的可区分性也是充分条件,从而为Sigmoid、tanh、ReLU等实际激活函数建立了通用可表示性结果。

详情
AI中文摘要

大多数现有的神经网络表达能力理论假设精确实数运算,而实际神经网络是在有限精度浮点算术下执行的,其执行语义依赖于实现。最近的工作开始研究浮点神经网络的表达能力,但现有结果仅限于高度受限的激活函数和理想化假设,如固定的从左到右归约顺序和正确舍入的激活实现。在这项工作中,我们研究了在广义浮点执行语义下浮点神经网络的表达能力,包括任意归约顺序和具有有界ulp误差的不精确激活实现。我们探讨了浮点神经网络何时能够精确表示浮点域上的任意函数。为此,我们引入了一个通用的可区分性框架,并表明在第一层中区分每对不同输入的能力是通用可表示性的必要条件。这一表征产生了广泛的不具备通用可表示性的激活实现类别,扩展了先前孤立的反例,如正确舍入的余弦激活。我们进一步证明,在激活实现的温和条件下,适当形式的可区分性也是通用可表示性的充分条件。利用这一框架,我们为一大类实际激活函数建立了通用可表示性结果,包括Sigmoid、tanh、ReLU、ELU、SeLU、GeLU、Swish、Mish和sin的实现,这些结果在比以前已知的显著更现实的浮点执行模型下成立。

英文摘要

Most existing expressivity theories for neural networks assume exact real arithmetic, whereas practical neural networks are executed under finite-precision floating-point arithmetic with implementation-dependent execution semantics. Recent works have begun studying the expressive power of floating-point neural networks, but existing results are limited to highly restricted activation functions and idealized assumptions such as fixed left-to-right reduction orders and correctly rounded activation implementations. In this work, we study the expressive power of floating-point neural networks under generalized floating-point execution semantics, including arbitrary reduction orders and inexact activation implementations with bounded ulp errors. We investigate when floating-point neural networks can represent arbitrary functions between floating-point domains exactly. To this end, we introduce a general distinguishability framework and show that the ability to distinguish every pair of distinct inputs in the first layer is necessary for universal representability. This characterization yields broad classes of activation implementations that are not universal representators, extending previous isolated counterexamples such as the correctly rounded cosine activation. We further prove that a suitable form of distinguishability is also sufficient for universal representability under mild conditions on the activation implementation. Using this framework, we establish universal representability results for a broad class of practical activation functions, including implementations of $\mathrm{Sigmoid}$, $\tanh$, $\mathrm{ReLU}$, $\mathrm{ELU}$, $\mathrm{SeLU}$, $\mathrm{GeLU}$, $\mathrm{Swish}$, $\mathrm{Mish}$, and $\sin$, under significantly more realistic floating-point execution models than previously known.

2605.28703 2026-05-28 cs.NE cs.AI cs.DS math.OC

A Fresh Look at Lamarckian Evolution and the Baldwin Effect

对拉马克进化与鲍德温效应的重新审视

Inès Benito, Johannes F. Lutzeyer, Benjamin Doerr

AI总结 通过实验和理论分析,比较拉马克、鲍德温和达尔文进化在最大独立集和最大割问题上的表现,证明局部搜索增强的进化算法(尤其是鲍德温进化)显著优于达尔文进化,并给出理论上的运行时界限。

详情
Comments
To appear in the proceedings of PPSN 2026
AI中文摘要

鲍德温和拉马克进化在进化算法中已存在很长时间,但从未主导学术文献或实际应用。在这项工作中,我们使用现代实证和理论方法重新审视拉马克和鲍德温进化,并将其与一般的达尔文进化进行严格比较。在实证方面,我们在来自近期GraphBench基准的六个不同数据集的图上,针对最大独立集和最大割问题运行了一套全面的实验。我们的结果表明,鲍德温和拉马克进化始终优于达尔文进化,证实了局部搜索增强进化算法的巨大潜力。值得注意的是,在绝大多数情况下,所有进化算法都优于最近的深度学习基线,并接近高度专业化的启发式和精确求解器的性能。此外,我们报告了一组适用于所有研究进化类型的高性能通用参数,希望未来对从业者有用。在理论方面,我们将现有的欺骗性前导块基准扩展到任意块长度,并使用现代理论运行时分析工具来证明预期运行时的上下界。对于大于二的块长度,鲍德温进化渐近快于拉马克进化,而拉马克进化渐近快于达尔文进化。当考虑适应度评估中局部搜索过程的成本时,排序取决于实现方式,鲍德温进化从较小的块长度开始就保持最快,这解释了其强大的实证性能。

英文摘要

Baldwinian and Lamarckian evolution have existed for a long time in evolutionary algorithms (EAs) without ever dominating the academic literature or practical applications. In this work, we use modern empirical and theoretical methods to revisit Lamarckian and Baldwinian evolution and rigorously compare them with the generic Darwinian evolution. On the empirical side, we run a comprehensive suite of experiments on graphs from six different datasets from the recent GraphBench benchmark on Maximum Independent Set and Maximum Cut problems. Our results show that Baldwinian and Lamarckian evolution consistently outperform Darwinian evolution, confirming the great potential of local search augmented evolutionary algorithms. Notably, in the great majority of cases, all EAs outperform recent deep learning baselines and approach the performance of highly specialised heuristic and exact solvers. We furthermore report a high-performing set of generalist parameters for all studied evolution types that we hope will be of use to practitioners in future. On the theoretical side, we extend the existing Deceptive Leading Block benchmark to arbitrary block length and use tools from modern theoretical runtime analysis to prove upper and lower bounds on the expected runtime. For block lengths greater than two, Baldwinian evolution is asymptotically faster than Lamarckian which is asymptotically faster than Darwinian evolution. When accounting for the cost of the local search procedure in fitness evaluations, the ordering depends on the implementation with Baldwinian evolution staying fastest from small block lengths onwards, explaining its strong empirical performance.

2605.28699 2026-05-28 cs.AI

TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

TRACER: 基于内部强化信用与轮次级遗憾匹配的多LLM协作推理

Chusen Li, Zhou Liu, Shuigeng Zhou, Wentao Zhang

AI总结 提出TRACER框架,通过控制器-遗憾层和生成-信用层分别学习发言时机与内容,解决多智能体强化学习中的稀疏奖励、搭便车和固定协议振荡问题,实现数学收敛的协作推理。

详情
Comments
25 pages, 3 figures
AI中文摘要

大型语言模型越来越依赖强化学习或多智能体提示来改进推理,但这两个范式仍然难以结合。将单智能体强化学习直接应用于多轮多智能体系统面临以下困境:i) 稀疏奖励、角色级搭便车和过高的训练开销。ii) 智能体仅模仿协作。iii) 固定协作协议陷入振荡的局部最优。我们引入TRACER,一个用于协作多LLM推理的轮次级强化框架。TRACER将协作决策分为控制器-遗憾层和生成-信用层,其中控制器通过遗憾匹配学习智能体是否应在当前轮次发言或跳过,生成-信用层则使用角色特定的GSPO奖励优化提议者和评审者的发言。这种设计i) 在动作模式和生成话语两个层面分配信用,从而避免搭便车和稀疏奖励。我们仅扩展控制器做出的选择,从而大幅降低训练的计算成本。此外,ii) 智能体在学习何时发言和说什么的过程中获得协作能力。最后,iii) 通过巧妙设计二元动作,我们将为有限动作空间建立的经典博弈论扩展到深度学习,从而实现数学上严格的收敛。我们在GSM8K训练集上训练所有局部RL方法,并在保留的GSM8K、MATH500和GPQA-Diamond上评估域内准确率、跨基准泛化能力、推理成本和修正保持行为。所得框架提供了一个紧凑且可复现的测试平台,用于研究超越固定辩论、投票或聚合协议的学习协作策略。代码可在https://github.com/Shark-Forest/TRACER获取。

英文摘要

Large language models increasingly rely on either reinforcement learning or multi-agent prompting to improve reasoning, yet these two paradigms remain difficult to combine. Directly applying single-agent reinforcement learning to multi-turn multi-agent systems faces following dilemmas: i) Sparse rewards, role-level free-riding and excessive training overhead. ii) Agents only imitate to collaborate. iii) Fixed collaboration protocol falls into oscillating local optimum. We introduce TRACER, a turn-level reinforcement framework for cooperative multi-LLM reasoning. TRACER separates collaborative decision making into a controller-regret layer, where controllers learn whether the agents should speak or skip the current round through regret matching, and a generation-credit layer, which optimizes proposer and reviewer utterances with role-specific GSPO rewards. This design i) assigns credit at the level of both action modes and generated utterances, thus avoiding free-riding and sparse rewards. We only expand the choices made by the controllers, thus greatly reducing computational cost of training. Moreover, ii) agents acquire collaborative capability as they learn when to utter and what to speak. Finally, iii) by designing binary actions ingeniously, we extend classical game theory established for finite action spaces to deep learning, thus achieving mathematically rigorous convergence. We train all local RL-style methods on the GSM8K training split and evaluate on held-out GSM8K, MATH500, and GPQA-Diamond to measure in-domain accuracy, cross-benchmark generalization, inference cost, and correction-preservation behavior. The resulting framework provides a compact and reproducible testbed for studying learned collaboration policies beyond fixed debate, voting, or aggregation protocols. Code is available at https://github.com/Shark-Forest/TRACER.

2605.28697 2026-05-28 eess.IV cs.AI cs.CV

Deep Learning Strain Estimation: Is Physics-Based Simulation the Solution?

深度学习应变估计:基于物理的模拟是解决方案吗?

Thierry Judge, Nicolas Duchateau, Andreas Østvik, Khuram Faraz, Anders Austlid Taskén, Sigve Karlsen, Thor Edvardsen, Harald Brunvand, Md Abulkalam Azad, Havard Dalen, Bjørnar Grenne, Gabriel Kiss, Pierre-Yves Courand, Lasse Lovstakken, Pierre-Marc Jodoin, Olivier Bernard

AI总结 针对超声心动图中应变估计缺乏可靠运动参考的问题,提出一种结合真实视频散斑去相关测量与迭代细化过程的模拟策略,生成逼真数据集训练运动估计算法,在全局和区域应变上达到优于临床参考的性能。

详情
Comments
10 pages
AI中文摘要

斑点追踪超声心动图(STE)是心肌应变估计的临床标准。尽管在全局应变(GLS)上表现良好,但其区域应变的准确性仍然有限,尽管这一生物标志物对于早期诊断和表征细微异常高度相关。深度学习是一种有前景的替代方案,但其发展受到缺乏可靠运动参考的限制。现有解决方案要么依赖于STE衍生的标签,要么依赖于基于物理模型生成的模拟,但这些合成序列与临床数据相比仍缺乏足够的真实性。在本文中,我们提出了一种新的模拟策略,该策略结合了来自真实视频的散斑去相关测量,并使用迭代细化过程来改善模拟中的运动真实性。我们创建了一个包含1,478个视频及其参考运动的开源逼真数据集,用于训练超声心动图运动估计算法。所提出的方法在全局和区域应变上实现了无与伦比的性能,特别是在专家间设置中,GLS变异性达到1.42%,而临床参考为1.78%。

英文摘要

Speckle tracking echocardiography (STE) is the clinical standard for myocardial strain estimation. Despite good performance on global strain (GLS), its accuracy for regional strain remains limited, even though this biomarker is highly relevant for early diagnosis and the characterization of subtle abnormalities. from clinical data. Deep learning is a promising alternative, but its development is constrained by the lack of reliable motion references. Existing solutions rely either on STE-derived labels or on simulations generated by physics-based models, but these synthetic sequences still have limited realism compared with clinical data.In this paper, we propose a novel simulation strategy that incorporates speckle decorrelation measures from real videos and uses an iterative refinement process to improve the motion realism in the simulations. We created an open-source photorealistic dataset of 1,478 videos with reference motion, which was used to train an echocardiographic motion estimation algorithm. The proposed method achieves unmatched performance on global and regional strain, notably reaching a GLS variability of 1.42% in an inter-expert setting compared to 1.78% for the clinical reference.

2605.28693 2026-05-28 q-bio.NC cs.AI

Misalignment Between Backpropagation and the Hierarchy of Brain Responses to Images

反向传播与大脑对图像响应的层级结构之间的错位

Joséphine Raugel, Maximilian Seitzer, Marc Szafraniec, Huy V. Vo, Jérémy Rapin, Patrick Labatut, Piotr Bojanowski, Valentin Wyart, Jean-Rémi King

AI总结 通过fMRI和MEG记录人类对自然图像的脑响应,发现预训练模型的反向传播梯度虽能预测高级视觉皮层和晚期信号,但其时空组织与大脑层级结构不一致,表明深度网络与大脑可能依赖不同的学习机制。

详情
Comments
13 pages, 9 figures
AI中文摘要

反向传播是深度学习核心的学习机制。然而,该算法是否以及如何在大脑中实现仍存在高度争议。特别是,虽然预训练模型的前向激活可靠地映射到视觉处理的皮层层级结构,但反向传播梯度是否表现出类似的对应关系尚不清楚。在这里,我们利用功能性磁共振成像(fMRI)和脑磁图(MEG)记录人类对自然图像的脑响应来探讨这一问题。为此,我们将前向激活的标准编码分析扩展到将反向传播梯度映射到神经数据。聚焦于最近的自监督视觉模型(DINOv3)并在八个视觉模型上复现结果,我们发现反向传播梯度能够可靠地预测fMRI和MEG信号,尤其是在高级视觉皮层和较晚的潜伏期。然而,这些反向传播梯度在大脑中的空间和时间组织与生物合理反向传播机制预期的模式不同:具体而言,梯度计算的顺序及其空间组织均与人类大脑的时间和空间层级结构相偏离。这些结果表明,尽管深度网络和大脑可能共享相似的表征内容,但它们可能依赖根本不同的机制来学习这些表征。

英文摘要

Backpropagation is the core learning mechanism underlying deep learning. However, whether and how this algorithm is implemented in the brain remains highly debated. In particular, while forward activations of pretrained models reliably map onto the cortical hierarchy of visual processing, it is unknown whether backpropagated gradients exhibit a similar correspondence. Here, we address this question using functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG) recordings of human brain responses to natural images. For this, we extend standard encoding analyses of forward activations to map backpropagated gradients onto neural data. Focusing on a recent self-supervised vision model (DINOv3) and reproducing results on eight vision models, we find that backpropagated gradients can reliably predict both fMRI and MEG signals, specifically in higher-level visual cortex and for later latencies. However, the spatial and temporal organization of these backpropagated gradients in the brain diverges from the patterns expected under a biologically plausible backpropagation mechanism: specifically, both the order in which gradients are computed and their spatial organization diverge from the temporal and spatial hierarchies of the human brain. Together, these results suggest that, although deep networks and the brain may share similar representational content, they likely rely on fundamentally different mechanisms to learn those representations.

2605.28691 2026-05-28 cs.CV

OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning

OSP-Next: 结合稀疏序列并行、HiF8量化和强化学习的高效高质量视频生成

Yunyang Ge, Xianyi He, Zezhong Zhang, Bin Lin, Bin Zhu, Xinhua Cheng, Li Yuan

AI总结 提出OSP-Next文本到视频生成模型,通过混合全稀疏注意力架构、稀疏序列并行(SSP)、HiF8量化和混合GRPO后训练,在保持高质量的同时显著提升效率,在NVIDIA H200和Ascend 950PR上实现1.5倍以上加速。

详情
AI中文摘要

扩散Transformer在视频生成中取得了高质量,但全注意力的二次成本限制了效率。我们提出OSP-Next,一种高效的文本到视频生成模型,集成了稀疏注意力、并行、量化和强化学习。OSP-Next采用混合全稀疏注意力架构,其中稀疏组件通过Skiparse-2D注意力实现。这种固定模式机制沿空间维度应用逐token和逐组的稀疏注意力,利用局部性同时保持与FlashAttention内核的原生兼容性。基于Skiparse-2D注意力中重排的局部等价性,我们进一步提出稀疏序列并行(SSP),它将子序列划分到多个rank,并通过一次All-to-All通信切换稀疏模式。与Ulysses序列并行(SP)相比,SSP为稀疏注意力提供了原生并行策略,并将通信量减少了75%。OSP-Next还引入了HiF8量化,以实现8位量化和稀疏微调的稳定联合训练,并应用Mix-GRPO后训练来提升稀疏模型的性能。实验表明,OSP-Next的VBench总得分为83.73%,超过了Wan2.1基线。在5秒720P和5秒768P设置下,OSP-Next在NVIDIA H200 GPU上实现了高达1.64倍的单GPU加速和超过1.52倍的八GPU加速。此外,在VBench总分仅下降0.4%的情况下,OSP-Next-HiF8在单个Ascend 950PR上分别实现了1.69倍和2.27倍的加速,展示了OSP-Next跨硬件平台的效率和性能。

英文摘要

Diffusion Transformers achieve strong video generation quality, but the quadratic cost of full attention limits efficiency. We introduce OSP-Next, an efficient text-to-video generation model that integrates sparse attention, parallelism, quantization, and reinforcement learning. OSP-Next uses a hybrid full-sparse attention architecture, where the sparse component is implemented with Skiparse-2D Attention. This fixed-pattern mechanism applies token-wise and group-wise sparse attention along spatial dimensions, leveraging locality while maintaining native compatibility with FlashAttention kernels. Based on the local equivalence of rearrangement in Skiparse-2D Attention, we further propose Sparse Sequence Parallelism (SSP), which partitions subsequences across ranks and switches sparse patterns through a single All-to-All communication. Compared with Ulysses Sequence Parallelism (SP), SSP provides a native parallel strategy for sparse attention and reduces communication volume by 75%. OSP-Next also incorporates HiF8 quantization to enable stable joint training with 8-bit quantization and sparse fine-tuning, and applies Mix-GRPO post-training to improve the performance of the sparse model. Experiments show that OSP-Next achieves a VBench total score of 83.73%, surpassing the Wan2.1 baseline. Under the 5-second 720P and 5-second 768P settings, OSP-Next achieves up to 1.64$\times$ single-GPU speedup and over 1.52$\times$ eight-GPU speedup on NVIDIA H200 GPUs. In addition, with only a 0.4% drop in VBench total score, OSP-Next-HiF8 achieves 1.69$\times$ and 2.27$\times$ speedups under the two settings on a single Ascend 950PR, demonstrating the efficiency and performance of OSP-Next across hardware platforms.

2605.28687 2026-05-28 cs.SD physics.med-ph

Cross-modal characterization of infant cry: validation of a chest-surface accelerometer in extracting acoustic vocal function measures

婴儿哭声的跨模态表征:胸表加速度计在提取声学发声功能测量中的验证

Winko W. An, Saketh Sundar, Lisa Yankowitz, Daryush D. Mehta, Carol L. Wilkinson

AI总结 本研究验证了胸表加速度计在婴儿哭声分析中的有效性,发现其能可靠捕获基频和抖动等声学特征,为噪声鲁棒且保护隐私的临床研究提供替代方案。

详情
AI中文摘要

背景:婴儿哭声声学为早期神经发育提供了有前景的窗口,并可能作为神经发育障碍的可扩展生物标志物。然而,传统的基于麦克风的录音在现实临床环境中极易受到环境噪声的影响,并引发隐私问题。胸表加速度计通过直接捕获来自喉部的振动,可能提供一种稳健的替代方案。方法:我们通过比较常规疫苗接种期间从加速度计和同步记录的麦克风信号中提取的声学特征,评估了胸戴加速度计用于婴儿哭声分析的有效性。最终样本包括来自多样化儿科人群的85名婴儿(41名4个月大;44名12个月大)。从两种模态中提取了七种发声测量指标,包括基频、抖动、 shimmer、倒谱峰值突出度和谐波噪声比。使用组内相关系数评估模态间的一致性和一致性。结果:加速度计和麦克风录音之间的基频表现出极好的一致性(ICC > 0.94)。抖动测量也显示出良好到极好的一致性,而倒谱峰值突出度显示出中等一致性。Shimmer和谐波噪声比在模态间显示出较低的一致性绝对值和系统偏差,反映了信号传输和噪声敏感性可能存在的差异。结论:总之,胸表加速度计可以可靠地捕获婴儿哭声的几个临床相关声学特征,特别是基频和抖动的时间测量。这种方法为基于麦克风的录音提供了一种噪声鲁棒且保护隐私的替代方案,支持其在可扩展的临床和发育研究应用中的潜在用途。

英文摘要

Background: Infant cry acoustics provide a promising window into early neurodevelopment and may serve as scalable biomarkers for neurodevelopmental disorders. However, conventional microphone-based recordings are highly susceptible to environmental noise and raise privacy concerns in real-world clinical settings. Chest-surface accelerometers may offer a robust alternative by capturing vibrations directly from the larynx. Methods: We evaluated the validity of a chest-mounted accelerometer (ACC) for infant cry analysis by comparing acoustic features derived from ACC and simultaneously recorded microphone (MIC) signals during routine vaccination visits. The final sample included 85 infants (41 at 4 months; 44 at 12 months) from a diverse pediatric population. Seven vocal measures were extracted from both modalities, including fundamental frequency (F0), jitter, shimmer, cepstral peak prominence (CPP), and harmonics-to-noise ratio (HNR). Agreement and consistency between modalities was assessed using intraclass correlation coefficients (ICCs). Results: F0 demonstrated excellent agreement between ACC and MIC recordings (ICC > 0.94). Jitter measures also showed good-to-excellent agreement, while CPP demonstrated moderate agreement. Shimmer and HNR showed lower absolute agreement and systematic bias between modalities, reflecting possible differences in signal transmission and noise sensitivity. Conclusion: In summary, chest-surface accelerometers can reliably capture several clinically relevant acoustic features of infant cry, particularly temporal measures of F0 and jitter. This approach offers a noise-robust and privacy-preserving alternative to microphone-based recordings, supporting its potential use in scalable clinical and developmental research applications.

2605.28684 2026-05-28 cs.LG cs.CE cs.NA math.NA physics.comp-ph

History-aware adaptive reduced-order models via incremental singular value decomposition

基于增量奇异值分解的历史感知自适应降阶模型

Amirpasha Hedayat, Ali Mohaghegh, Laura Balzano, Cheng Huang, Karthik Duraisamy

AI总结 针对降阶模型在线动态偏离离线训练区域导致精度下降的问题,提出基于增量奇异值分解(iSVD)的投影自适应降阶框架,通过偶尔的全阶算子评估提供校正快照以在线更新基,并在三个非线性问题上验证其优于现有方法。

详情
Comments
50 pages, 27 figures, Preprint submitted to Elsevier
AI中文摘要

降阶模型(ROM)可以加速高维动力学模拟,但当在线动态偏离离线训练数据所代表的区域时,其精度通常会下降。我们开发了一种基于增量奇异值分解(iSVD)的投影自适应ROM框架,其中偶尔的全阶算子评估为在线基更新提供校正快照。这里考虑的侵入式ROM完全由基参数化,因此每次更新自然传播到降阶算子和超降阶机制。通过其演变的奇异结构,iSVD保留了观测动态的编码历史,在这个意义上具有历史感知能力。我们在三个复杂度递增的非线性问题上研究了该方法:一维粘性Burgers方程、Sod激波管和刚性一维十种组分旋转爆震发动机(RDE)。Burgers问题用于分析该方法,并将iSVD与替代基自适应规则进行比较,表明历史感知更新优于瞬时更新,且iSVD整体性能最强。Sod和RDE案例表明,这些优势在更具挑战性的可压缩流设置中持续存在。对于RDE问题,iSVD自适应ROM在预测精度和计算效率上都优于当前最先进的直接自适应ROM基线。成本分析表明,主要的在线成本来自与全阶模型交互以获取校正快照,而iSVD更新本身可忽略不计。这些结果将iSVD确定为在线学习降阶子空间的有效机制,并指出了使ROM在其初始训练窗口长几个数量级的时间范围内保持预测性的路径。

英文摘要

Reduced-order models (ROMs) can accelerate high-dimensional dynamical simulations, but their accuracy often deteriorates when online dynamics leave the regime represented by offline training data. We develop a projection-based adaptive ROM framework based on incremental singular value decomposition (iSVD), in which occasional full-order operator evaluations provide correction snapshots for online basis updates. The intrusive ROMs considered here are fully parameterized by the basis, so each update naturally propagates to reduced operators and hyper-reduction machinery. Through its evolving singular structure, iSVD retains an encoded history of the observed dynamics and is history-aware in this sense. We study the method on three nonlinear problems of increasing complexity: the one-dimensional viscous Burgers equation, the Sod shock tube, and a stiff one-dimensional ten-species rotating detonation engine (RDE). The Burgers problem is used to analyze the method and compare iSVD with alternative basis adaptation rules, showing that history-aware updates outperform instantaneous updates and that iSVD gives the strongest overall performance. The Sod and RDE cases demonstrate that these advantages persist in more challenging compressible-flow settings. For the RDE problem, the iSVD adaptive ROM improves upon the current state-of-the-art Direct adaptive ROM baseline in both predictive accuracy and computational efficiency. A cost analysis shows that the dominant online cost comes from interacting with the full-order model to obtain correction snapshots, while the iSVD update itself is negligible. These results identify iSVD as an effective mechanism for online learning of reduced subspaces and suggest a path toward ROMs that remain predictive over horizons several orders of magnitude longer than their initial training window.

2605.28683 2026-05-28 cs.AI

VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora

VeriTrip: 面向非结构化网络语料的旅行规划智能体可验证基准

Yuting Xu, Jiayi Tian, Jian Liang, Xin Xiong, Hang Zhang, Mu Xu, Xiao-Yu Zhang

AI总结 提出VeriTrip基准,通过多模态检索库和可验证知识库,评估智能体在非结构化网络语料中基于证据推理的旅行规划能力,揭示检索-推理权衡问题。

详情
Comments
10 pages, 4 figures
AI中文摘要

现有基准通过建立以API为中心的范式为旅行规划智能体奠定了基础。然而,随着自主智能体能力的不断提升,其评估必须从简单的工具执行扩展到处理开放网络的固有复杂性。当前基准绕过了核心认知障碍:它们未能考虑信息噪声,忽略了多源事实矛盾,并且忽视了将视觉感知融入逻辑规划的必要性。我们引入了VeriTrip,一个旨在满足智能体鲁棒性和可靠性日益增长需求的可验证基准。VeriTrip将评估重点转向基于非结构化多模态网络语料的证据推理。它建立了一个源自真实世界的多模态检索库(MRB),迫使智能体自主协调跨异构数据的查询。同步的可验证知识库(VKB)支持逐单元验证协议,精确量化事实可靠性,区分系统性推理失败与参数幻觉。我们在领先的多模态大语言模型上的评估揭示了一个关键的“检索-推理权衡”:自主检索的认知负荷显著侵蚀了指令保持能力。VeriTrip为能够在无约束多模态环境中运行的下一代规划智能体提供了严格的基础。

英文摘要

Existing benchmarks have laid the foundation for travel planning agents by establishing API-centric paradigms. However, as the capabilities of Autonomous Agents continue to advance, their evaluation must evolve beyond simple tool execution toward handling the inherent complexities of the open web. Current benchmarks bypass core cognitive hurdles: they fail to account for information noise, ignore multi-source factual contradictions, and overlook the necessity of grounding visual perception into logical planning. We introduce VeriTrip, a verifiable benchmark designed to meet the increasing demands for agent robustness and reliability. VeriTrip shifts the evaluation focus to evidence-grounded reasoning over unstructured multimodal web corpora. It establishes a Multimodal Retrieval Base (MRB) derived from real-world sources, forcing agents to autonomously orchestrate queries across heterogeneous data. A synchronized Verifiable Knowledge Base (VKB) enables a cell-wise verification protocol that precisely quantifies factual reliability, distinguishing systematic reasoning failures from parametric hallucinations. Our evaluations across leading MLLMs reveal a critical \textit{retrieval-reasoning trade-off}: the cognitive load of autonomous retrieval significantly erodes instruction retention. VeriTrip provides the rigorous foundation necessary for the next generation of planning agents capable of operating in unconstrained, multimodal environments.

2605.28680 2026-05-28 cs.HC cs.AI cs.CY

AI in the Workplace: The Impact of AI on Perceived Job Decency and Meaningfulness

职场中的AI:人工智能对感知工作体面性和意义性的影响

Kuntal Ghosh, Marc Hassenzahl, Shadan Sadeghian

AI总结 本研究通过对24名来自IT、服务和医疗行业员工的访谈,探讨了AI对工作满意度的感知影响,发现不同职业领域对AI带来的工作体面性和意义性变化预期不同,从而影响整体满意度。

详情
Comments
Accepted to CSCW 2026 / Proceedings of the ACM on Human-Computer Interaction (PACMHCI)
AI中文摘要

人工智能在工作场所的普及正在改变我们的工作方式。虽然现有关于人机协作的研究通常优先考虑绩效,但对其体验结果知之甚少。通过对24名来自信息技术、服务和医疗行业的员工进行访谈,本文考察了AI通过感知工作体面性和意义性对当前和未来工作满意度的影响。我们的结果显示,AI对整体工作满意度的预期影响因职业领域而异,对其潜在的体面性和意义性的感知也不同。例如,IT和医疗行业预期在工时等体面性方面满意度提高,但由于误解AI将处理大部分任务,在社交形象等意义性方面满意度下降。相反,服务行业员工预计工时无改善,但由于与AI合作带来的地位提升感知,社会地位会提高。

英文摘要

The proliferation of Artificial Intelligence (AI) in workplaces is transforming how we work. While existing research on human-AI collaboration at work often prioritizes performance, less is known about their experiential outcomes. Through interviews with 24 employees across Information Technology (IT), service-based, and healthcare sectors, this paper examines AI's impact on job satisfaction via perceptions of job decency and meaningfulness, now and in the future. Our results reveal that the anticipated impact of AI on overall job satisfaction varies with the occupational domain, with differing perceptions of its underlying decency and meaningfulness. For instance, IT and healthcare anticipate increased satisfaction with decency aspects like working hours but decreased satisfaction with meaningfulness aspects like social image due to misconceptions about AI handling most of their tasks. Conversely, service workers foresee no improvement in their working hours but a higher social standing due to the perceived status boost associated with working with AI.