arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.04602 2026-06-12 cs.AI 版本更新

Parthenon Law: A Self-Evolving Legal-Agent Framework

Parthenon Law: 一种自我进化的法律智能体框架

Hejia Geng, Leo Liu

发表机构 * tapntell.ai

AI总结 本文提出Parthenon框架,通过分解模型、工具、知识等组件并引入反泄漏学习循环,使法律领域的大语言模型智能体能够从经验中自我进化,显著提升法律事务处理性能。

详情
AI中文摘要

随着智能体能力的增强,法律领域的大语言模型智能体有望将文档密集型事务转化为可审查的工作产品——然而可靠部署面临三个障碍:缺乏关于当前最强模型与框架组合在端到端法律事务上行为的大规模证据;没有适应法律垂直领域的智能体架构,只有通用框架;以及在不断变化的事实、权威和截止日期环境中,缺乏系统从自身结果中学习的机制。我们逐一解决这些问题。在Harvey LAB上进行的大规模实证研究——包含12,510条智能体轨迹——表明即使是前沿智能体也无法一次性完成事务:每项标准的准确率随模型增强而提高,但严格的事务完成率停滞不前。然后我们引入Parthenon,一种自我进化的法律智能体框架,将模型、框架、智能体角色、法律知识、确定性工具和程序技能分解为可审计的表面,以实现来源可追溯性、日期和数字接地、交付物合规性和问题关闭。最后,一个反泄漏学习循环将评分失败转化为对技能、工具和知识的任务无关编辑,使系统能够随着经验改进——就像律所在每个事务后完善其检查清单和操作手册——而不触及模型权重。在我们的大规模实证分析中,Parthenon显著提升了最先进模型和框架在法律事务任务上的性能。

英文摘要

As agents grow more capable, legal-domain LLM agents promise to turn document-heavy matters into reviewable work products -- yet reliable deployment faces three obstacles: no large-scale evidence on how today's strongest model-and-harness combinations behave on end-to-end legal matters; no agent architecture adapted to the legal vertical, only general-purpose harnesses; and, in a setting that keeps shifting with new facts, authorities, and deadlines, no mechanism for systems to learn from their own outcomes. We address each. A large-scale empirical study on Harvey LAB -- $12{,}510$ agent trajectories -- shows that even frontier agents remain far from completing matters in a single pass: per-criterion accuracy climbs with stronger models while strict matter completion stalls. We then introduce \textsc{Parthenon}, a self-evolving legal-agent framework that factors Model, Harness, Agent roles, legal Knowledge, deterministic Tools, and procedural Skills into auditable surfaces for source traceability, date and number grounding, deliverable compliance, and issue closure. Finally, an anti-leakage learning loop converts scored failures into task-agnostic edits to skills, tools, and knowledge, letting the system improve with experience -- as a firm refines its checklists and playbooks after each matter -- without touching model weights. Across our large-scale empirical analysis, \textsc{Parthenon} substantially improves the performance of state-of-the-art models and harnesses on legal-matter tasks.

2606.04525 2026-06-12 cs.CL cs.LG q-bio.GN 版本更新

GENEB: Why Genomic Models Are Hard to Compare

GENEB:为什么基因组模型难以比较

Daria Ledneva, Mikhail Nuridinov, Denis Kuznetsov

发表机构 * GitHub arXiv

AI总结 针对基因组基础模型评估碎片化的问题,提出GENEB基准,通过统一探测协议在100项任务上比较40个模型,揭示模型排名不稳定、规模收益有限等关键发现。

详情
Comments
change first page figure, fix model sizes, add more consistency
AI中文摘要

由于基准碎片化、评估协议不兼容以及任务特定报告,基因组基础模型的进展难以评估。因此,关于模型优越性或通用性的声明往往无法直接比较。我们引入GENEB,这是一个大规模诊断基准,在统一的基于探测的协议下(包括少样本场景),评估来自40个基因组基础模型的冻结表示,涵盖100个任务,跨越13个功能类别。GENEB能够在明确暴露任务级权衡的同时,对模型规模、架构、分词和预训练数据进行受控比较。我们的分析表明,整体排行榜不稳定:模型排名在不同任务类别间变化剧烈,规模仅带来适度且不一致的收益,而架构和预训练对齐常常超过参数数量的影响。这些结果凸显了当前评估实践的局限性,并将GENEB定位为基因组机器学习中原则性比较和类别感知模型选择的参考框架。

英文摘要

Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not directly comparable. We introduce GENEB, a large-scale diagnostic benchmark that evaluates frozen representations from 40 genomic foundation models across 100 tasks spanning 13 functional categories under a unified probing-based protocol, including few-shot regimes. GENEB enables controlled comparison across model scale, architecture, tokenization, and pretraining data while explicitly exposing task-level trade-offs. Our analysis shows that aggregate leaderboards are unstable: model rankings vary sharply across task categories, scale provides only modest and inconsistent gains, and architectural and pretraining alignment frequently outweigh parameter count. These results highlight limitations of current evaluation practices and position GENEB as a reference framework for principled comparison and category-aware model selection in genomic machine learning.

2606.04474 2026-06-12 cs.CL eess.AS 版本更新

Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention

语音大模型推理中的实体绑定失败:诊断与思维链干预

Ming-Hao Hsu, Xiaohai Tian, Jun Zhang, Zhizheng Wu

发表机构 * School of Data Science, The Chinese University of Hong Kong, Shenzhen, China(1 数据科学学院,香港中文大学(深圳)) ByteDance, China(2 字节跳动,中国)

AI总结 本文通过诊断语音大模型在逻辑推理中的实体绑定失败问题,提出实体感知思维链方法,显著提升推理准确率。

详情
Comments
INTERSPEECH 2026
AI中文摘要

语音大模型在复杂推理任务上表现不如文本模型。我们揭示了这种模态差距并非均匀的认知缺陷。通过评估三个不同的语音大模型,我们发现在空间、句法和事实任务上,语音到文本(S2T)匹配或超过文本到文本(T2T)。然而,在需要实体追踪的逻辑任务上,S2T准确率降至随机水平。我们将这种局部退化诊断为实体绑定失败:连续的语音特征导致模型在隐式推理过程中丢失精确的实体-属性关联。为解决此问题,我们提出了实体感知思维链(EA-CoT),强制语音大模型在推理前显式枚举实体并将其绑定到声明上。引人注目的是,即使口语名称被误识别,EA-CoT也能弥合差距,带来高达24.4%的绝对准确率提升。消融实验证实这些提升完全源于显式语义绑定,将模态差距重新定义为可解决的瓶颈。

英文摘要

Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this gap is not a uniform cognitive deficit. Evaluating two architecturally diverse SLLMs, we show speech-to-text (S2T) matches or exceeds text-to-text (T2T) on spatial, syntactic, and factual tasks. Yet on logical tasks requiring entity tracking, S2T accuracy collapses to chance. We diagnose this as an entity binding failure: continuous speech features blur precise entity-property associations during implicit reasoning. To validate this diagnosis, we introduce Entity-Aware Chain-of-Thought (EA-CoT), a lightweight inference-time intervention forcing SLLMs to enumerate entities and bind them to claims before reasoning. EA-CoT bridges the gap, even when spoken names are misrecognized, yielding up to a 24.4 percentage-point accuracy gain. Ablations confirm the gains stem from explicit semantic binding, reframing the gap as an elicitation failure rather than a missing capability.

2606.04364 2026-06-12 cs.CV cs.LG 版本更新

Spatially Grounded Concept Bottleneck Models via Part-Factorized Attention

通过部分分解注意力的空间基础概念瓶颈模型

Dhanesh Ramachandram

发表机构 * Vector Institute(向量研究所)

AI总结 提出一种部分分解的概念瓶颈模型,通过空间先验约束注意力,在细粒度识别中实现可解释性并提升定位精度。

详情
Comments
Updated results with GobalAttention Tokens
AI中文摘要

概念瓶颈模型(CBM)在预测类别之前预测一层人类命名的属性,从而使其决策可审计。在细粒度识别任务中,概念头通常可以自由关注图像中的任何位置,因此以某个身体区域命名的头可能被其他区域的证据满足。本研究通过构造一个部分分解的CBM来消除这种自由度。该方法基于冻结的DINOv3视觉变换器,包含三个组件。一个学习到的前景门控,基于DINOv3块特征训练,抑制部分注意力内的背景块。一组部分查询交叉关注块特征,并且312个CUB属性中的每一个通过固定的概念到部分映射被路由,仅从其名称所暗示的部分令牌读取。一个可学习的二维高斯先验,以对数空间加性注入注意力logits,打破部分查询之间的排列对称性;其均值从每个部分的数据集平均关键点位置初始化,在训练或测试时不需要每张图像的关键点监督。在CUB-200-2011上,空间先验模型匹配完全监督基线(top-1准确率88.85%对88.95%),同时将指向精度提高16个百分点(52.6%对36.4%)。用PCA前景目标替换边界框监督,并与高斯先验结合,消除了所有每张图像监督,达到88.6%的top-1准确率和约70%的指向精度。关键点分数扫描显示,训练集的0.5%(约27张图像)足以初始化先验,且无显著损失。完全移除部分身份是更困难的情况:没有任何空间先验,指向精度降至2.9%。

英文摘要

Concept bottleneck models (CBMs) predict a layer of human-named attributes before predicting a class, which makes their decisions auditable. On fine-grained recognition tasks the concept heads are usually free to attend anywhere in the image, so a head named for one body region can be satisfied by evidence on another. This work studies a part-factorized CBM that removes that freedom by construction. The method has three components built on a frozen DINOv3 vision transformer. A learned foreground gate, trained on DINOv3 patch features, suppresses background patches inside the part attention. A set of part queries cross-attends to patch features and each of the 312 CUB attributes is routed, through a fixed concept-to-part map, to read only from the part token its name implies. A learnable two-dimensional Gaussian prior, injected additively in log space into the attention logits, breaks the permutation symmetry among part queries; its means are initialized from the dataset-average keypoint location of each part, which requires no per-image keypoint supervision at training or test time. On CUB-200-2011 the spatial-prior model matches a fully supervised baseline (88.85% versus 88.95% top-1) while raising pointing accuracy by 16 points (52.6% versus 36.4%). Replacing bounding-box supervision with a PCA foreground target and combining it with the Gaussian prior removes all per-image supervision and reaches 88.6% top-1 at about 70% pointing accuracy. A keypoint-fraction sweep shows that 0.5% of the training set (about 27 images) suffices to initialize the prior with no measurable loss. Removing part identity entirely is the harder case: without any spatial prior, pointing accuracy collapses to $2.9\%$.

2606.04009 2026-06-12 stat.ML cs.AI cs.LG 版本更新

Counterfactual Explanations for Deep Two-Sample Testing

深度双样本检验的反事实解释

Wei-Cheng Lai, Marco Simnacher, Christoph Lippert

发表机构 * Hasso-Plattner-Institute, University of Potsdam(波茨坦大学洪堡-劳恩堡研究所) Hasso Plattner Institute for Digital Health at Mount Sinai Icahn School of Medicine at Mount Sinai(辛辛那提医学院洪堡数字健康研究所)

AI总结 针对深度双样本检验,提出基于扩散自编码器和MMD优化的反事实解释框架,生成样本级编辑以揭示驱动假设拒绝的特征。

详情
Comments
17 pages
AI中文摘要

双样本检验是检测科学领域中分布差异的基本工具,但经典检验(包括基于核的检验)在高维结构化数据(如图像)上可能效果不佳。最近的深度双样本检验通过学习信息表示提高了这些场景下的灵敏度,但它们对哪些数据特征驱动拒绝原假设 $H_0$ 提供的洞察有限。为解决此问题,我们提出了一种用于深度双样本检验的反事实解释框架,该框架生成样本级编辑,将观测值从源组移向目标组,同时明确减少检验所测量的差异。我们的方法将扩散自编码器与预训练的深度双样本检验模型相结合,并在检验模型的表示空间中优化最大均值差异(MMD)目标,以生成合理的反事实。我们通过检验统计量和由此产生的双样本p值的变化来量化分布级效应。我们在合成2D形状数据集和两个MRI队列上评估了该方法。在这两种设置下,反事实变换相对于原始样本持续增加p值,表明编辑后的源集在检验下在统计上更接近目标分布。我们使用LPIPS测量最小性,以确保反事实保持接近原始样本。由此产生的编辑提供了与检测到的组差异相关的特征的可解释证据。在MRI上,局部变化与队列之间已知的解剖学差异一致。

英文摘要

Two-sample testing is a fundamental tool for detecting distributional differences across scientific domains, but classical tests (including kernel-based tests) can be ineffective on high-dimensional structured data such as images. Recent deep two-sample tests improve sensitivity in these settings by learning informative representations, yet they provide limited insight into which data features drive rejection of the null hypothesis $H_0$. To address this issue, we propose a counterfactual explanation framework for deep two-sample testing that generates sample-level edits moving observations from a source group toward a target group while explicitly reducing the discrepancy measured by the test. Our method combines a diffusion autoencoder with a pretrained deep two-sample test model and optimizes a maximum mean discrepancy (MMD) objective in the test model's representation space to produce plausible counterfactuals. We quantify distribution-level effects through changes in the test statistic and the resulting two-sample p-values. We evaluate the method on synthetic 2D shape datasets and two MRI cohorts. Across both settings, the counterfactual transformations consistently increase p-values relative to the original samples, indicating that the edited source set becomes statistically closer to the target distribution under the test. We measure minimality using LPIPS to ensure the counterfactuals remain close to the original samples. The resulting edits provide interpretable evidence of the features associated with the detected group differences. On MRI, the localized changes are consistent with known anatomical differences between cohorts.

2606.03096 2026-06-12 cs.CL 版本更新

Can Factual Opinions Be Edited (Manipulated) in Large Language Models?

大型语言模型中的事实性观点能否被编辑(操纵)?

Yuanpu Cao, Ziyi Yin, Fenglong Ma, Jinghui Chen

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 提出FOE基准测试,评估当前知识编辑技术对事实性观点(如公众人物立场)的操纵能力,并发现其仅能实现表面修改,无法保持观点与证据的一致性;进而提出自生成证据对齐方法实现观点-证据对齐。

详情
Comments
Accepted to the ACL 2026 Main Conference
AI中文摘要

大型语言模型(LLMs)正日益融入各个领域,这使得知识编辑技术变得至关重要,但也存在潜在危险。当前的编辑方法主要针对原子事实,忽视了操纵事实性观点(例如,公众人物在社会问题上的有记录的立场)所带来的重大风险。这种操纵可能重塑公众形象、影响选举并改变社会观点。为了系统评估这一威胁,我们引入了事实性观点编辑与证据(FOE)基准,涵盖261位公众人物、19个问题类别和2,178条完整的观点记录。我们的评估表明,当前的编辑技术在处理事实性观点时面临显著困难,通常仅能实现表面修改,而无法保持编辑后的观点与模型生成的支撑证据之间的一致性。为解决这一局限,我们进一步提出了一种简单而有效的自生成证据对齐方法,无需依赖显式指令即可实现观点-证据对齐。我们的基准和方法共同为理解LLMs中事实性观点编辑的新兴安全影响奠定了基础。

英文摘要

Large Language Models (LLMs) are increasingly integrated into various domains, making knowledge editing techniques crucial yet potentially hazardous. Current editing methods primarily target atomic facts, overlooking the significant risks associated with manipulating factual opinions, e.g., documented stances of public figures on societal issues. Such manipulation could reshape public images, influence elections, and alter societal views. To systematically assess this threat, we introduce the Factual Opinion Editing with Evidence (FOE) benchmark, which encompasses 261 public figures, 19 issue categories, and 2,178 complete opinion records. Our evaluations demonstrate that current editing techniques struggle significantly with factual opinions, often achieving only superficial changes while failing to preserve consistency between the edited opinion and the supporting evidence generated by the model. To address this limitation, we further propose a simple yet effective Self-Generated Evidence-Aligned method that achieves opinion-evidence alignment without relying on explicit instructions. Together, our benchmark and method provide a foundation for understanding the emerging security implications of factual opinion editing in LLMs.

2606.02778 2026-06-12 astro-ph.EP astro-ph.IM cs.LG 版本更新

One Transit Is All You Need: Detecting Exoplanets Through Learned Stellar Behaviour with EXOVEIL

一次凌星足矣:通过EXOVEIL学习恒星行为检测系外行星

Pratik Priyanshu

发表机构 * SRH Hochschule(SRH 高校)

AI总结 提出EXOVEIL系统,利用Transformer世界模型和自监督学习从原始光变曲线中检测单次凌星事件,在Kepler数据上实现高召回率,并零样本迁移至TESS和PLATO任务。

详情
Comments
v3: appendix gallery of confirmed-planet recoveries added; Section 6 candidate catalogue reframed as transit-like anomalies for follow-up; TLS comparison table expanded
AI中文摘要

我提出EXOVEIL,一个凌星检测系统,它学习恒星亮度应有的样子,并在现实不符时发出标记。与需要相位折叠输入的现有系统不同,EXOVEIL在原始通量时间序列上运行,可以检测仅凌星一次的行星。一个Transformer世界模型,在16,499条Kepler光变曲线上通过凌星掩蔽自监督学习训练,预测预期的恒星通量。一个带有方差加权的匹配滤波检测器从预测残差中提取凌星信号。一个学习分类器(XGBoost)将行星与假阳性区分开,在Kepler DR25上达到AUC 0.938。应用于单次凌星注入-恢复,EXOVEIL在1000 ppm深度下恢复了32%的凌星——而所有基于分类的系统由于设计原因得分为0%。对3,737颗Kepler恒星进行盲搜索,发现了179个新的凌星类信号,这些信号不在DR25 TCE目录中,包括46个单次凌星候选者。无需重新训练,应用于PLATO LOPS2场中的47颗已确认TESS行星,EXOVEIL实现了100%的恢复,展示了零样本跨任务迁移。在PLATO的25秒曝光下,检测达到100 ppm——接近地球类似物范围。我提供了共形预测在凌星检测中的首次应用(95.9%经验覆盖率),并发布了该系统,可通过pip install exoveil安装,包含预训练权重和候选目录。

英文摘要

I present EXOVEIL, a transit detection system that learns what a star's brightness should look like and flags when reality disagrees. Unlike existing systems that require phase-folded input, EXOVEIL operates on raw flux time series and can detect planets that transit only once.A Transformer world model, trained on 16,499 Kepler light curves with transit-masked self-supervised learning, predicts expected stellar flux. A matched-filter detector with variance weighting extracts transit signals from the prediction residuals. A learned classifier (XGBoost) separates planets from false positives, achieving AUC 0.938 on Kepler DR25. Applied to single-transit injection-recovery, EXOVEIL recovers 32% of transits at 1000 ppm depth a task where all classification-based systems score 0% by construction. A blind search of 3,737 Kepler stars yields 179 new transit-like signals not present in the DR25 TCE catalogue, including 46 monotransit candidates. Applied withoutretraining to 47 confirmed TESS planets in the PLATO LOPS2 field, EXOVEIL achieves 100% recovery, demonstrating zero-shot cross-mission transfer. At PLATO's 25-second cadence, detection reaches 100 ppm -- approaching the Earth-analog regime. I provide the first application of conformal prediction to transit detection (95.9% empirical coverage) and release the system as pip install exoveil with pretrained weights and a candidate catalogue.

2606.02133 2026-06-12 cs.LG cs.AI 版本更新

Variational Learning for Insertion-based Generation

基于插入生成的变分学习

Yangtian Zhang, Zhe Wang, Arthur Gretton, Rex Ying, David van Dijk, Michalis K. Titsias, Jiaxin Shi

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出插入过程(IP)模型,通过排列变分推断联合学习插入位置、内容和终止条件,支持变长生成并提升非自回归序列建模质量。

详情
AI中文摘要

非单调序列生成方法,如掩码扩散模型,通过允许以非固定和预设的顺序生成token,为从左到右的自回归建模提供了一种灵活的替代方案。尽管具有实际优势,但大多数现有的非单调模型是顺序无关的,并依赖于固定长度的网格,限制了它们支持变长生成和自适应插入顺序的能力。在这项工作中,我们引入了一个概率框架,用于在变长插入模型中学习插入顺序。我们形式化了插入轨迹与排列之间的双射对应关系,这使得数据似然能够精确重参数化为排列上的和。基于这一结果,我们提出了插入过程(IP),这是一种随机生成模型,它联合学习在哪里插入、插入什么以及何时终止,并通过基于排列的变分推断进行训练。与先前的固定画布方法不同,IP原生支持变长生成,并学习数据驱动的插入顺序偏好。在目标条件规划和分子字符串生成上的实验表明,在缺乏规范从左到右结构的领域中,学习插入顺序提高了建模质量和泛化能力。

英文摘要

Non-monotonic sequence generation methods, such as masked diffusion models, provide a flexible alternative to left-to-right autoregressive modeling by allowing tokens to be generated in non-fixed and prescribed orders. Despite their practical advantages, most existing non-monotonic models are order-agnostic and rely on a fixed-length grid, limiting their ability to support variable-length generation and adaptive insertion order. In this work, we introduce a probabilistic framework for learning insertion order in variable-length insertion models. We formalize a bijective correspondence between insertion trajectories and permutations, which enables an exact reparameterization of the data likelihood as a sum over permutations. Building on this result, we propose the Insertion Process (IP), a stochastic generative model that jointly learns where to insert, what to insert, and when to terminate, trained via permutation-based variational inference. Unlike prior fixed-canvas approaches, IP natively supports variable-length generation and learns data-driven preferences over insertion orders. Experiments on goal-conditioned planning and molecular string generation demonstrate that learning insertion order improves both modeling quality and generalization in domains without a canonical left-to-right structure.

2606.01621 2026-06-12 cs.CV cs.RO 版本更新

Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation

Goal2Pixel: 将目标锚定到像素以实现视觉语言导航

Muyi Bao, Yuxin Cai, Hang Xu, Zongtai Li, Jinxi He, Jingfan Tang, Chen Lv, Ji Zhang, Yaqi Xie, Wenshan Wang

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Nanyang Technological University(南洋理工大学)

AI总结 提出Goal2Pixel范式,通过将连续环境中的视觉语言导航(VLN-CE)重新定义为可导航像素锚定,利用图像平面作为统一空间接口,预测可见导航像素并反投影为3D航点,结合可见性感知关键帧记忆和坐标感知辅助损失,在减少VLM调用次数的同时实现竞争性性能。

详情
Comments
8 pages
AI中文摘要

视觉语言模型(VLM)已成为连续环境中视觉语言导航(VLN-CE)的常见基础。然而,大多数基于VLM的方法将导航视为低级动作预测,这种接口模糊、受限于短视运动基元,且由于重复的VLM查询而效率低下。我们提出Goal2Pixel,一种纯基于像素的范式,将VLN-CE重新定义为可导航像素锚定。Goal2Pixel不预测动作,而是使用图像平面作为VLM推理与机器人运动之间的统一空间接口:模型预测一个对智能体可见的可导航像素,该像素被反投影为3D航点以进行前向导航。对于非前向动作,我们在图像平面上附加辅助指令区域,其中左/右/下区域分别解释为左转、右转和停止。为了实现长程导航,我们提出了一种可见性感知的关键帧记忆,用于紧凑且信息丰富的历史表示。为了将预训练的VLM适应于可导航像素锚定,我们引入了语义嵌入和坐标感知辅助损失。Goal2Pixel在需要比先前方法更少的VLM推理调用的情况下,实现了具有竞争力的最新性能。在R2R-CE Val-Unseen上,它以每集仅7.75次VLM调用达到54.1%的SR和52.5%的SPL,而直接动作预测在32.9%的SR下需要46.62次调用,减少了6倍。同样的趋势在RxR-CE上也成立。项目页面:https://baobao0926.github.io/Goal2Pixel/。

英文摘要

Vision-language models (VLMs) have become a common foundation for vision-and-language navigation in continuous environments (VLN-CE). Yet most VLM-based methods cast navigation as low-level action prediction, an interface that is ambiguous, tied to short-horizon motion primitives, and inefficient due to repeated VLM querying. We propose Goal2Pixel, a pure pixel-based paradigm that reformulates VLN-CE as navigable pixel grounding. Rather than predicting actions, Goal2Pixel uses the image plane as a unified spatial interface between VLM reasoning and robot motion: the model predicts a visible navigable pixel to the agent, which is back-projected into a 3D waypoint for forward navigation. For non-forward actions, we append auxiliary directive regions to the image plane, where the left/right/bottom regions are interpreted as turning left, turning right, and stopping, respectively. To enable long-horizon navigation, we propose a visibility-aware keyframe memory for compact and informative history representation. To adapt pretrained VLMs to navigable pixel grounding, we introduce semantic embeddings and coordinate-aware auxiliary losses. Goal2Pixel achieves competitive state-of-the-art performance while requiring fewer VLM inference calls than prior methods. On R2R-CE Val-Unseen it achieves 54.1% SR and 52.5% SPL with just 7.75 VLM calls per episode, 6x fewer than the 46.62 required by direct action prediction at 32.9% SR. The same trend holds on this http URL Page: this https URL.

2606.01538 2026-06-12 cs.GR cs.CV cs.LG 版本更新

MPMWorlds: Material-Point-Method Simulations for Inferring and Extrapolating Physical Dynamics

MPMWorlds: 用于推断和外推物理动力学的物质点法模拟

Žiga Kovačič, Kevin Ellis

发表机构 * Cornell University(康奈尔大学)

AI总结 通过构建2D物质点法(MPM)模拟数据集,研究从视频推断物理动力学并外推时间演化的能力,比较代码生成与视频扩散方法的优劣。

详情
Comments
16 pages, 13 figures. Project page: this https URL
AI中文摘要

为了研究从视频推断物理动力学并将其向前外推的能力,我们组装了一个包含丰富物理现象(如可变形物体、流体、运动物体和发射器)的2D物质点法(MPM)物理模拟数据集。我们在此数据集上研究了代码生成和视频扩散方法,通过改变物理相关辅助信息的数量来识别它们的优缺点。代码生成模型除了提供自动合成MPM模拟的工作演示外,还揭示了这种方法在从视觉输入推断物理参数方面存在困难,但相对于视频扩散,它能产生物理和时间上稳定的向前外推结果,而视频扩散模型能更强烈地从视觉输入中识别几何属性,但会产生物理上不可信的外推结果。

英文摘要

To study the ability to infer physical dynamics from videos and extrapolate them forward in time, we assemble a dataset of 2D Material Point Method (MPM) physical simulations covering rich physical phenomena such as deformable objects, fluids, kinetic objects, and emitters. We study code generation and video diffusion approaches on this dataset, identifying their strengths and weaknesses by varying the amount of physically relevant side information. The code generation model, beyond giving a working demonstration of automatic synthesis of MPM simulations, reveals that such an approach struggles with inferring physical parameters from visual input, but relative to video diffusion, produces physically and temporally stable extrapolations forward in time, while the video diffusion model more strongly identifies geometric properties from visual input but produces physically implausible extrapolations.

2606.01172 2026-06-12 cs.LG stat.ME stat.ML 版本更新

Revisiting Neural Processes via Fourier Transform and Volterra Series

通过傅里叶变换和Volterra级数重新审视神经过程

Peiman Mohseni, Nick Duffield, Raymond K. W. Wong

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文利用Volterra展开和集合傅里叶卷积,提出了两种新的条件神经过程模型,解决了现有平移等变神经过程在可解释性和计算效率上的局限性。

详情
AI中文摘要

从有限的、不规则采样的测量中建模未知的潜在函数是科学和工程中的一个反复出现的挑战。神经过程(NPs)是一类概率函数模型,是有前景的解决方案——尤其是当赋予领域特定的对称性(如平移等变性)时,这提高了样本效率和泛化能力。然而,现有的平移等变NPs面临两个局限性:(i)它们堆叠带有非线性的通用组件,模糊了诱导的函数类并限制了可解释性;(ii)卷积设计依赖于具有局部感受野的核,并需要密集的均匀输入网格,而基于注意力的方法避免了这些问题,但随观测数量呈二次方缩放。我们通过两个贡献解决了这两个问题。首先,利用Volterra展开,我们将连续平移等变算子表征为高阶卷积的和,实现了分析透明性,同时允许通过一阶卷积进行高效近似。其次,我们引入了集合傅里叶卷积(SFConvs),这是一种频域参数化方法,直接在不规则采样点上操作,实现近似全局感受野,并在观测数量上线性缩放。基于这些思想,我们提出了两种条件神经过程(CNPs):SFConvCNPs,它堆叠带有非线性的SFConv块,以及SFVConvCNPs,它整合了Volterra公式。在合成和真实世界数据集上的实验证明了我们的方法相对于最先进基线的有效性。

英文摘要

Modeling unknown latent functions from finite, irregularly sampled measurements is a recurring challenge across science and engineering. Neural processes (NPs), a family of probabilistic functional models, are promising solutions -- especially when endowed with domain-specific symmetries like translation equivariance, which improve sample efficiency and generalization. Yet existing translation-equivariant NPs face two limitations: (i) they stack generic components with non-linearities, obscuring the induced function class and limiting interpretability; and (ii) convolutional designs rely on kernels with local receptive fields and require dense uniform input grids, while attention-based methods avoid these issues but scale quadratically with the number of observations. We address both with two contributions. First, using the Volterra expansion, we characterize continuous translation-equivariant operators as sums of higher-order convolutions, yielding analytical transparency while admitting efficient approximation by first-order convolutions. Second, we introduce set Fourier convolutions (SFConvs), a frequency-domain parameterization that operates directly on irregularly sampled points, achieves approximately global receptive fields, and scales linearly in the number of observations. Building on these ideas, we propose two conditional NPs (CNPs): SFConvCNPs, which stack SFConv blocks with non-linearities, and SFVConvCNPs, which integrate the Volterra formulation. Experiments on synthetic and real-world datasets demonstrate our methods' efficacy against state-of-the-art baselines.

2606.00807 2026-06-12 cs.AI cs.HC 版本更新

Interaction-Centered Intelligence: Toward an Interaction-Based Theory of Human-AI Co-Creation

以交互为中心的智能:将交互作为共创AI和人机系统中的主要分析单元

Nicholas Davis

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Co-Creative AI Consulting(协同人工智能咨询)

AI总结 本文提出以交互作为主要分析单元,通过分布式认知、具身认知等理论,论证智能涌现于交互动态而非孤立计算,并引入交互中心智能框架。

详情
AI中文摘要

传统人工智能很大程度上将智能概念化为发生在有界代理内的孤立计算。在经典AI、机器学习以及许多生成系统中,主要的分析单元仍然是单个模型或自主系统,通过输出、基准、预测准确性或优化性能进行评估。尽管这些方法取得了重大进展,但它们往往低估了交互在智能、创造力、意义和适应性行为涌现中的作用。本文提出将交互作为共创AI和更广泛的以交互为中心的智能的主要分析单元。借鉴分布式认知、具身认知、生成、参与式意义建构、人机交互和计算创造力,本文追溯了向越来越关系性智能观的历史进程。基于先前在创造性意义建构、量化共创以及诸如绘图学徒和AI绘图伙伴等共创系统上的工作,本文论证了智能通过代理、环境和社会技术系统之间不断演化的交互动态涌现,而非仅仅通过内部计算。本文引入了以交互为中心的智能作为理解人机共创、协作涌现、适应性参与和交互动态的框架。该框架不通过生成的输出单独评估智能,而是强调随时间展开的交互轨迹、协调模式、参与性参与、适应性调节和交互漂移。讨论了可解释的共创AI、混合智能、生成AI和未来人机系统的启示。

英文摘要

Traditional artificial intelligence has largely conceptualized intelligence as isolated computation occurring within bounded agents. Across classical AI, machine learning, and many generative systems, the dominant unit of analysis remains the individual model or autonomous system evaluated through outputs, benchmarks, prediction accuracy, or optimization performance. While these approaches have produced major advances, they often under-theorize the role of interaction in the emergence of intelligence, creativity, meaning, and adaptive behavior. This paper proposes interaction as the primary unit of analysis for co-creative AI and interaction-centered intelligence more broadly. Drawing from distributed cognition, embodied cognition, enaction, participatory sense-making, human-computer interaction, and computational creativity, the paper traces a historical progression toward increasingly relational accounts of intelligence. Building upon prior work in Creative Sense-Making, quantified co-creation, and co-creative systems such as the Drawing Apprentice and AI Drawing Partner, it argues that intelligence emerges through evolving interaction dynamics among agents, environments, and socio-technical systems rather than solely through internal computation. The paper introduces Interaction-Centered Intelligence as a framework for understanding human-AI co-creation, collaborative emergence, adaptive participation, and interactional dynamics. Rather than evaluating intelligence solely through generated outputs, the framework emphasizes interaction trajectories, coordination patterns, participatory engagement, adaptive regulation, and interactional drift unfolding through time. Implications for explainable co-creative AI, hybrid intelligence, enactive AI, and future human-AI systems are discussed.

2605.31419 2026-06-12 cs.CV cs.RO 版本更新

Triangle Splatting SLAM

三角形泼溅SLAM

Nicholas Fry, Eric Dexheimer, Kirill Mazur, Paul H. J. Kelly, Andrew J. Davison

发表机构 * Software Performance Optimisation Group(软件性能优化组) Department of Computing(计算部门)

AI总结 提出首个使用可微三角形作为3D地图表示的密集RGB-D SLAM系统,通过在线可微渲染实现跟踪与建图,并支持实时网格转换与编辑。

详情
Comments
26 pages, 11 figures
AI中文摘要

我们提出了一种密集RGB-D SLAM系统,使用可微三角形作为3D地图表示。虽然3D高斯泼溅已成为新颖视角合成的主要方法,但三角形仍然是传统渲染硬件、游戏引擎以及需要显式几何的下游任务(如模拟、碰撞和编辑)的标准图元。最近的离线方法表明,通过在一组带姿态的图像上进行Delaunay三角剖分,可以将非结构化的“三角形汤”优化为照片级逼真的网格。基于这一见解,我们提出了第一个密集SLAM系统,通过在线可微渲染三角形汤来执行跟踪和建图。地图可以通过受限Delaunay三角剖分实时转换为连通网格,从而实现网格变形和碰撞检测等新的在线功能。在Replica和TUM-RGBD数据集上,我们的系统在3D几何方面优于基线,匹配相机跟踪精度,并支持基于网格的在线场景编辑。

英文摘要

We present a dense RGB-D SLAM system using differentiable triangles as the 3D map representation. While 3D Gaussian Splatting has emerged as the leading method for novel-view synthesis, triangles remain the standard primitive for traditional rendering hardware, game engines, and downstream tasks requiring explicit geometry such as simulation, collision, and editing. Recent offline methods have demonstrated that an unstructured 'triangle soup' can be optimised into a photorealistic mesh via Delaunay triangulation across a set of posed images. Building upon this insight, we present the first dense SLAM system to employ Triangle Splatting to perform both tracking and mapping through online differentiable rendering of a triangle soup. The map can be converted into a connected mesh on-the-fly via restricted Delaunay triangulation, enabling new online capabilities such as mesh deformation and collision checking. On Replica and TUM-RGBD, our system outperforms baselines on 3D geometry, matches the camera-tracking accuracy, and enables online mesh-based scene editing.

2605.28507 2026-06-12 cs.LG 版本更新

Universal Time Series Generation with Neural Controlled Differential Equations

基于神经受控微分方程的通用时间序列生成

Torben Berndt, Elyes Farjallah, Leif Seute, Raeid Saqur, Benjamin Walker, Jan Stühmer

发表机构 * Heidelberg Institute for Theoretical Studies(海德堡理论研究所) IAR, Karlsruhe Institute of Technology(卡尔斯鲁厄技术大学IAR部门) Max Planck Institute for Polymer Research(马克斯·普朗克聚合物研究所) IWR, Heidelberg University(海德堡大学IWR部门) Dept. of Computer Science, University of Toronto(多伦多大学计算机科学系) Mathematical Institute, University of Oxford(牛津大学数学研究所) Vector Institute, Toronto, Canada(多伦多向量研究所)

AI总结 本文证明结构化线性受控微分方程(SLiCEs)是通用时间序列生成器,并提出生成式SLiCEs(G-SLiCEs)用于路径空间上的流匹配,在概率预测和下流任务中表现优异,尤其适用于不规则网格。

详情
AI中文摘要

最近关于状态空间模型(SSMs)序列通用性的工作引入了高效、最大表达性的连续时间方法用于时间序列建模。虽然这些工作侧重于判别设置,我们将这一视角扩展到生成式时间序列建模,通过证明最大表达性的结构化线性受控微分方程(SLiCEs)是通用时间序列生成器,即它们可以在$W_\infty$中逼近紧致潜在集上连续因果推前映射的诱导路径律。基于这些理论结果,我们提出了生成式SLiCEs(G-SLiCEs),一种用于路径空间上流匹配的最大表达性连续时间模型。实验上,我们表明表达性提高了概率预测和下流任务的性能,同时保留了连续时间模型的优势,例如泛化到任意观测网格。这对于不规则网格尤其有利,而固定网格模型通常难以处理此类网格。

英文摘要

Recent work on the sequence universality of State Space Models (SSMs) has introduced efficient, maximally expressive continuous-time approaches for time-series modelling. While these works focus on discriminative settings, we extend this perspective to generative time-series modelling by proving that maximally expressive Structured Linear Controlled Differential Equations (SLiCEs) are universal time-series generators, in the sense that they can approximate the induced path laws of continuous causal pushforwards on compact latent sets in $W_\infty$. Building on these theoretical results, we propose Generative SLiCEs (G-SLiCEs), a maximally expressive continuous-time model for flow matching on path-space. Empirically, we show that expressivity improves performance in probabilistic forecasting and downstream tasks, while retaining the advantages of continuous-time models such as generalising to arbitrary observation grids. This is particularly beneficial for irregular grids, where fixed-grid models often struggle.

2605.26358 2026-06-12 physics.flu-dyn cs.LG 版本更新

Deep Learning-based Algebraic Reynolds Stress Closures for RANS Simulations of Turbulent Flows

基于深度学习的代数雷诺应力闭合模型用于湍流RANS模拟

Daniel Dehtyriov, Jonathan F. MacArt, Justin Sirignano

发表机构 * Mathematical Institute, University of Oxford(牛津大学数学研究所) Aerospace and Mechanical Engineering, University of Notre Dame(诺特丹大学航空航天与机械工程系)

AI总结 提出一种物理驱动的深度学习闭合模型DARSM,通过神经网络映射流动不变量到隐式代数雷诺应力方程中的经验参数,并结合伴随方程实现端到端优化,在方形管道和周期性山丘基准测试中平均速度误差降低2-4倍。

详情
AI中文摘要

湍流在工程和科学中普遍存在,但直接模拟成本过高。雷诺平均纳维-斯托克斯(RANS)方程可节省超过十个数量级的计算量,但引入了未封闭项(封闭问题)。离线训练的机器学习(ML)闭合模型在预测模拟中会出现分布偏移,而绕过控制方程的ML方法难以从稀缺的高保真数据中泛化。我们开发了一种基于物理的深度学习RANS闭合模型——深度代数雷诺应力模型(DARSM),该模型可在小数据集上训练,并准确泛化到不同雷诺数、未见几何形状和不同流动状态。神经网络将流动不变量映射到隐式代数雷诺应力方程中的经验参数,该方程基于弱平衡假设从雷诺应力输运方程推导而来,为ML闭合施加了基于物理的结构。通过控制偏微分方程和耦合隐式闭合的端到端优化消除了分布偏移,但展开和隐式自动微分在刚性耦合求解器上均失败。我们推导了利用求解器隐式-显式结构的伴随方程,以实现高效优化。在标准方形管道和周期性山丘基准测试中,DARSM将基线RANS的平均测试速度误差降低了2-4倍(跨雷诺数、几何形状和流动状态),峰值案例级降低达12倍。在附着、各向异性主导的流动(方形管道)上训练的模型无需重新训练即可准确泛化到分离流动(周期性山丘),这是底层物理状态的改变。DARSM还优于五种已建立的ML方法:离线训练、张量基神经网络、场反演机器学习、DeepONet和物理信息神经网络。

英文摘要

Turbulence is ubiquitous in engineering and science, yet direct simulation is prohibitively expensive. The Reynolds-averaged Navier-Stokes (RANS) equations provide savings exceeding ten orders of magnitude but introduce unclosed terms (the closure problem). Offline-trained machine-learning (ML) closures suffer distribution shift in predictive simulations, while ML methods that bypass the governing equations struggle to generalise from scarce high-fidelity data. We develop a physics-derived deep learning closure model for RANS, the Deep Algebraic Reynolds Stress Model (DARSM), which can be trained on small datasets and accurately generalise across Reynolds numbers, to unseen geometries, and to different flow regimes. A neural network maps flow invariants to empirical parameters in an implicit algebraic Reynolds stress equation, derived from the Reynolds stress transport equations under the weak-equilibrium assumption, imposing physics-based structure on the ML closure. End-to-end optimisation through the governing PDEs and the coupled implicit closure eliminates distribution shift, but both unrolled and implicit automatic differentiation fail on the stiff coupled solver. We derive adjoint equations that exploit the solver's implicit-explicit structure for efficient optimisation. On canonical square-duct and periodic-hill benchmarks, DARSM reduces average test velocity error over baseline RANS by $2$-$4\times$ across Reynolds number, geometries, and flow regimes, with peak case-level reductions of $12\times$. The model trained on attached, anisotropy-dominated flows (square duct) accurately generalises without retraining to separated flows (periodic hills), a regime change in the underlying physics. DARSM also outperforms five established ML methods: offline training, tensor-basis neural networks, field-inversion machine learning, DeepONets, and physics-informed neural networks.

2605.26144 2026-06-12 cs.SE cs.AI cs.CV 版本更新

VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents

VISTA:面向视觉规格到网页应用编码智能体的端到端基准

JunJia Guo, Yuhang Yao, Jiawei (Joe) Zhou, Jingdi Chen

发表机构 * University of Arizona(亚利桑那大学) Zoom Stony Brook University(石溪大学)

AI总结 提出VISTA基准,通过多维度输入条件和评估指标,衡量基于LLM的智能体从视觉规格生成功能完整、视觉一致的网页应用的能力。

详情
Comments
Project page: this https URL Code: this https URL Dataset: this https URL
AI中文摘要

我们提出了VISTA(视觉规格到应用基准),这是一个用于评估基于LLM的智能体端到端网页应用生成能力的基准。与以往关注算法任务的代码生成基准不同,VISTA针对以UI为中心的现实开发场景,要求智能体从不明确的输入中生成功能完整、视觉一致的应用。我们定义了五种提示信息条件,沿视觉/结构保真度和技术栈约束两个轴变化:(1)仅文本,自由选择技术栈;(2)文本加参考截图,指定三种技术栈;(3)文本加参考截图,自由选择技术栈;(4)文本加截图和精简的Figma结构,指定单一技术栈;(5)文本加截图和精简的Figma结构,自由选择技术栈。为实现稳健评估,基准中的每个页面都手动标注了交互式UI组件和大约三个视觉锚点,解决了Playwright等基于脚本的测试工具在开放式代码生成设置中的已知局限性。评估结合了基于DOM的参考匹配、行为特定的浏览器测试和基于CLIP的视觉相似性,共同衡量结构对齐、行为完整性和整体视觉保真度。我们使用VISTA评估了来自两个模型家族和两个框架的四个智能体系统,发现视觉保真度和功能正确性在输入条件和智能体之间部分解耦,并且智能体的编辑风格差异显著,但大体上与任务质量正交。VISTA为推进基于智能体的软件工程研究建立了严谨且可重复的基础。

英文摘要

We present VISTA (VIsual Spec-To-App Benchmark), a benchmark for evaluating the end-to-end web-app generation capabilities of LLM-based agents. Unlike prior code generation benchmarks that focus on algorithmic tasks, VISTA targets realistic UI-centric development, where agents must produce functional, visually coherent applications from underspecified inputs. We define five prompt-information conditions that vary along two axes, visual/structural fidelity and stack constraint: (1) text only with free stack choice, (2) text with reference screenshots under three specified stacks, (3) text with reference screenshots under free stack choice, (4) text with screenshots and pruned Figma structure under a single specified stack, and (5) text with screenshots and pruned Figma structure under free stack choice. To enable robust evaluation, each page in the benchmark is manually annotated with interactive UI components and around three visual anchor points, addressing the well-known limitations of script-based testing tools such as Playwright in open-ended code generation settings. Evaluation combines DOM-grounded reference matching, behavior-specific browser tests, and CLIP-based visual similarity, jointly measuring structural alignment, behavioral completeness, and overall visual fidelity. We use VISTA to assess four agent systems drawn from two model families and two harnesses, finding that visual fidelity and functional correctness are partially decoupled across both input conditions and agents, and that agent editing style varies sharply but is largely orthogonal to task quality. VISTA establishes a rigorous and reproducible foundation for advancing agent-based software engineering research.

2605.29906 2026-06-12 cs.LG 版本更新

Plan, Don't Pose: Long Composite Motion Generation with Text-Aligned BFM

计划,而非摆姿势:基于文本对齐的BFM的长复合运动生成

Nikolay Shvetsov, Maksim Bobrin, Nazar Buzun, Anton Bozhedarov, Dmitry V. Dylov

发表机构 * AvaCapo Potsdam University(波茨坦大学) Applied AI Institute(应用人工智能研究所) Computational Imaging Lab(计算成像实验室) AXXX Innopolis University(因诺波利斯大学)

AI总结 提出Text2BFM框架,通过将自然语言与预训练行为基础模型对齐,在潜在策略空间中实现长复合运动生成,无需端到端运动生成器。

详情
AI中文摘要

文本到运动(T2M)生成在角色动画、虚拟化身和人机交互中具有广泛应用。现有方法通常直接从语言生成姿态轨迹或运动令牌,迫使单个模型处理语义解释、长程结构和低级物理实现。这种耦合使得它们在处理长、复合或语义密集的提示时成本高昂且往往不可靠。我们提出Text2BFM,这是第一个将自然语言与预训练行为基础模型(BFM)对齐用于T2M生成的框架,无需依赖重型端到端运动生成器。Text2BFM在冻结的BFM的潜在策略空间中操作,将其用作可执行的运动先验。一个文本对齐的变分行为瓶颈将BFM策略潜在序列压缩成与语言兼容且保留长程行为结构的紧凑运动表示。生成在这个紧凑的行为流形上通过轻量级条件生成器进行,得到的潜在编码行为被解码为驱动预训练冻结BFM的策略潜在。通过将语义规划与运动执行解耦,Text2BFM实现了高效、鲁棒的T2M生成,并在长复合文本描述上表现出色。

英文摘要

Text-to-motion (T2M) generation has broad applications in character animation, virtual avatars, and human-robot interaction. Existing methods typically generate pose trajectories or motion tokens directly from language, forcing a single model to handle semantic interpretation, long-horizon structure, and low-level physical realization. This coupling makes them costly and often unreliable for long, compositional, or semantically dense prompts. We propose Text2BFM, the first framework that aligns natural language with pretrained Behavioral Foundation Models (BFMs) for T2M generation without relying on heavy end-to-end motion generators. Text2BFM operates in the latent policy space of a frozen BFM, using it as an executable motion prior. A text-aligned variational behavioral bottleneck compresses BFM policy-latent sequences into compact motion representations that are compatible with language and preserve long-horizon behavioral structure. Generation is performed in this compact behavioral manifold with a lightweight conditional generator, and the resulting latent encoded behaviors are decoded into policy latents that drive the pretrained frozen BFM. By decoupling semantic planning from motion execution, Text2BFM achieves efficient, robust T2M generation and strong performance on long, compositional textual descriptions.

2601.01901 2026-06-12 cs.LG 版本更新

FedBiCross: Personalized One-Shot Federated Learning on Medical Images

FedBiCross: 医学图像上的个性化一次性联邦学习

Yuexuan Xia, Yinghao Zhang, Yalin Liu, Hong-Ning Dai, Yong Xia

发表机构 * School of Computer Science and Engineering, Northwestern Polytechnical University, China(西北工业大学计算机科学与工程学院) School of Science and Technology, Hong Kong Metropolitan University, Hong Kong(香港 Metropolitan 大学科学与技术学院) Department of Computer Science, Hong Kong Baptist University, Hong Kong(香港 Baptist 大学计算机科学系)

AI总结 提出FedBiCross框架,通过聚类、双层跨簇优化和个性化蒸馏解决非独立同分布数据下一次性联邦学习中知识蒸馏效果差的问题,在四个医学图像数据集上优于现有方法。

详情
Comments
Accepted by BlockSys 2026. This version of the contribution has been accepted for publication, after peer review (when applicable) but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections
AI中文摘要

基于无数据知识蒸馏的一次性联邦学习(OSFL)在单轮通信中训练模型,无需共享原始数据,这使得OSFL对隐私敏感的医疗应用具有吸引力。然而,现有方法聚合所有客户端的预测以形成全局教师。在非独立同分布数据下,冲突的预测在平均过程中相互稀释,产生信息量较少的软标签,从而削弱蒸馏效果。我们提出FedBiCross,一个个性化OSFL框架,包含三个阶段:(1)根据模型输出相似性对客户端进行聚类,形成连贯的子集成;(2)双层跨簇优化,学习自适应权重以选择性利用有益的跨簇知识,同时抑制负迁移;(3)针对客户端特定适应的个性化蒸馏。在四个医学图像数据集上的实验表明,FedBiCross在不同非独立同分布程度下始终优于最先进的基线方法。

英文摘要

Data-free knowledge distillation-based one-shot federated learning (OSFL) trains a model in a single communication round without sharing raw data, making OSFL attractive for privacy-sensitive medical applications. However, existing methods aggregate predictions from all clients to form a global teacher. Under non-IID data, conflicting predictions dilute each other during averaging, yielding less informative soft labels that weaken distillation. We propose FedBiCross, a personalized OSFL framework with three stages: (1) clustering clients by model output similarity to form coherent sub-ensembles, (2) bi-level cross-cluster optimization that learns adaptive weights to selectively leverage beneficial cross-cluster knowledge while suppressing negative transfer, and (3) personalized distillation for client-specific adaptation. Experiments on four medical image datasets demonstrate that FedBiCross consistently outperforms state-of-the-art baselines across different non-IID degrees.

2512.15133 2026-06-12 cs.CE cs.AI 版本更新

HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens

HD-Prot:一种使用连续结构令牌进行联合序列-结构建模的蛋白质语言模型

Yi Zhou, Haohao Qu, Yunqing Liu, Shanru Lin, Le Song, Wenqi Fan

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Mohamed bin Zayed University of Artificial Intelligence(马尔代夫人工智能大学)

AI总结 提出HD-Prot,一种混合扩散蛋白质语言模型,通过连续结构令牌将序列pLM扩展为多模态,实现联合序列-结构建模,在多种任务上取得竞争性能。

详情
Comments
This is the long version of the corresponding paper to appear at KDD 2026
AI中文摘要

蛋白质本质上具有一致的序列-结构二重性。丰富的蛋白质序列数据可以很容易地表示为离散令牌,这推动了蛋白质语言模型(pLM)的丰硕发展。然而,一个关键的剩余挑战是如何有效地将连续结构知识整合到pLM中。当前的方法通常将蛋白质结构离散化以适应语言建模框架,这不可避免地导致细粒度信息的丢失,并限制了多模态pLM的性能潜力。在本文中,我们认为这些担忧是可以避免的:基于序列的pLM可以通过连续令牌(即避免向量量化的高保真蛋白质结构潜在表示)扩展以纳入结构模态。具体来说,我们提出了一种混合扩散蛋白质语言模型HD-Prot,它在离散pLM之上嵌入了一个连续值扩散头,使得能够无缝处理离散和连续令牌,用于联合序列-结构建模。它通过统一的吸收扩散过程捕获跨模态的令牌间依赖关系,并通过序列的分类预测和结构的连续扩散估计每个令牌的分布。大量结果表明,HD-Prot在无条件序列-结构共生成、基序支架、蛋白质结构预测和反向折叠任务中取得了竞争性能。此外,尽管在有限的计算资源下开发(即模态扩展微调的预算不到十分之一),我们的方法可以与最先进的多模态pLM相媲美。它突显了在统一语言模型架构中同时估计分类和连续分布的可行性,为多模态pLM提供了一个有前景的替代方向。

英文摘要

Proteins inherently possess a consistent sequence-structure duality. The abundance of protein sequence data, which can be readily represented as discrete tokens, has driven fruitful developments in protein language models (pLMs). A key remaining challenge, however, is how to effectively integrate continuous structural knowledge into pLMs. Current methods often discretize protein structures to accommodate the language modeling framework, which inevitably results in the loss of fine-grained information and limits the performance potential of multimodal pLMs. In this paper, we argue that such concerns can be circumvented: a sequence-based pLM can be extended to incorporate the structure modality through continuous tokens, i.e., high-fidelity protein structure latents that avoid vector quantization. Specifically, we propose a hybrid diffusion protein language model, HD-Prot, which embeds a continuous-valued diffusion head atop a discrete pLM, enabling seamless operation with both discrete and continuous tokens for joint sequence-structure modeling. It captures inter-token dependencies across modalities through a unified absorbing diffusion process, and estimates per-token distributions via categorical prediction for sequences and continuous diffusion for structures. Extensive results demonstrate that HD-Prot achieves competitive performance in unconditional sequence-structure co-generation, motif-scaffolding, protein structure prediction, and inverse folding tasks. Furthermore, our method can perform on par with state-of-the-art multimodal pLMs, despite being developed under limited computational resources (i.e., less than one-tenth the budget for modality extension fine-tuning). It highlights the viability of simultaneously estimating categorical and continuous distributions within a unified language model architecture, offering a promising alternative direction for multimodal pLMs.

2605.25225 2026-06-12 cs.LG cs.AI 版本更新

Transformer Field Theory: A Response-Theoretic Approach to Mechanistic Interpretability

用于Transformer修补和机制可解释性的连续深度场论

David N. Olivieri, Antonio F. Pérez Rodríguez

发表机构 * Universidade de Vigo(维戈大学) Independent Researcher(独立研究员)

AI总结 本文提出场论框架,将残差流视为深度-标记场,通过局部源插入、灵敏度场预测、经验格林函数响应和伴随变分问题来组织和预测Transformer激活修补干预,并在GPT-2风格自回归Transformer中验证了前向响应理论。

详情
AI中文摘要

机制可解释性通常使用激活修补、因果追踪、路径修补和引导方向来揭示Transformer激活空间中行为有意义的子空间。本文发展了一个场论框架来组织和预测此类干预。将残差流视为深度-标记场,我们将修补公式化为局部源插入,修补效应作为灵敏度场预测,下游传播作为经验格林函数响应,修补选择作为伴随变分问题。实验上,我们通过在GPT-2风格自回归Transformer中应用局部残差场干预并观察诱导的残差场差异和logit差异响应来测试前向响应理论。我们识别出有界的局部线性区域;从跨残差站点的一阶灵敏度预测修补效应;测量跨深度和标记位置的结构化各向异性传播;从高灵敏度站点和切片格林算子构建响应描述;并表明提示诱导的残差位移可以传递答案行为。这些结果将响应对象(即灵敏度、传播场和格林算子切片)确立为组织修补实验的实用语言,以及制定修补站点推断和跨尺度迁移的前向数学基础。

英文摘要

Mechanistic interpretability often studies Transformer behavior by intervening on internal activations through activation patching, causal tracing, path patching, and steering directions. This paper develops Transformer Field Theory: a response-theoretic framework in which the residual stream of a fixed forward pass is treated as a Transformer field over layer depth and token position. In this formulation, patching becomes a localized source insertion into the Transformer field, first-order sensitivity fields predict patch effects, Green functions describe downstream propagation, and patch selection is posed as an adjoint inverse problem. Empirically, we test the theory's forward response objects in GPT-2-style autoregressive Transformers. Localized Transformer-field interventions exhibit a bounded local linear regime; first-order sensitivities predict patch effects across layer-token sites; localized sources generate structured anisotropic Transformer-field propagation; high-sensitivity sites and sliced Green operators provide reduced response descriptions; and prompt-induced Transformer-field displacements partially transfer answer behavior. These results establish sensitivities, Transformer-field responses, and sliced Green operators as practical objects for organizing patching experiments, while providing the forward mathematical basis for patch-site inference and cross-scale response transfer.

2605.03460 2026-06-12 cs.AI cs.LG 版本更新

FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models

FinSTaR:面向时间序列推理模型的金融推理

Seunghan Lee, Jun Seo, Jaehoon Lee, Sungdong Yoo, Minjae Kim, Tae Yoon Lim, Dongwan Kang, Hwanil Choi, Soonyoung Lee, Wonbin Ahn

发表机构 * LG AI Research(LG人工智能研究)

AI总结 针对时间序列推理模型在金融领域的失效问题,提出基于2x2能力分类法的FinSTaR模型,通过Compute-in-CoT和Scenario-Aware CoT策略在FinTSR-Bench基准上达到78.9%平均准确率。

详情
Comments
KDD Workshop on SciSoc Agents & LLMs 2026
AI中文摘要

时间序列推理模型在通用领域表现出色,但在具有独特特征的金融领域却持续失败。我们提出一个通用的2x2能力分类法,通过交叉1)单实体与多实体分析,以及2)当前状态评估与未来行为预测来划分TSRM能力。我们在金融领域实例化该分类法——其中确定性评估与随机性预测的区分尤为关键——形成十个金融推理任务,并基于标普股票构建FinTSR-Bench基准。为此,我们提出FinSTaR(金融时间序列思考与推理),在FinTSR-Bench上训练,并针对每个类别采用不同的思维链策略。对于评估(确定性,即可从可观测数据计算得出),我们采用Compute-in-CoT,一种程序化思维链,使模型能够直接从原始价格推导答案。对于预测(本质上是随机的,即受不可观测因素影响),我们采用场景感知思维链,在做出判断前生成多种场景,模拟金融分析师在不确定性下的推理方式。所提方法在FinTSR-Bench上达到78.9%的平均准确率,显著优于LLM和TSRM基线。此外,我们展示了四个能力类别通过联合训练具有互补性和相互增强性,并且场景感知思维链相比标准思维链持续提升预测准确率。代码已公开:https://github.com/seunghan96/FinSTaR。

英文摘要

Time series (TS) reasoning models (TSRMs) have shown promising capabilities in general domains, yet they consistently fail in the financial domain, which exhibits unique characteristics. We propose a general 2 x 2 capability taxonomy for TSRMs by crossing 1) single-entity vs. multi-entity analysis with 2) assessment of the current state vs. prediction of future behavior. We instantiate this taxonomy in the financial domain-where the distinction between deterministic assessment and stochastic prediction is particularly critical-as ten financial reasoning tasks, forming the FinTSR-Bench benchmark based on S&P stocks. To this end, we propose FinSTaR (Financial Time Series Thinking and Reasoning), trained on FinTSR-Bench with distinct chain-of-thought (CoT) strategies tailored to each category. For assessment, which is deterministic (i.e., computable from observable data), we employ Compute-in-CoT, a programmatic CoT that enables models to derive answers directly from raw prices. For prediction, which is inherently stochastic (i.e., subject to unobservable factors), we adopt Scenario-Aware CoT, which generates diverse scenarios before making a judgment, mirroring how financial analysts reason under uncertainty. The proposed method achieves 78.9% average accuracy on FinTSR-Bench, substantially outperforming LLM and TSRM baselines. Furthermore, we show that the four capability categories are complementary and mutually reinforcing through joint training, and that Scenario-Aware CoT consistently improves prediction accuracy over standard CoT. Code is available at this https URL.

2605.24488 2026-06-12 cs.CV cs.GR 版本更新

Appearance-Invariant Detection of Suggestive Motion via Laban Movement Descriptors

基于SMPL骨架的拉班运动描述子的暗示性运动外观不变检测

Jaehoon Ahn, Jeonghan Kong, Moon-Ryul Jung

发表机构 * Sogang University(ソガン大学)

AI总结 提出一种仅基于SMPL骨架轨迹和拉班运动分析描述子的运动分类流程,用于检测暗示性和露骨动作,在四个层级上实现57.3%的四分类准确率。

详情
Comments
5 pages, 2 figures, 3 tables. Extended version of a poster accepted to SIGGRAPH 2026
AI中文摘要

在线多人3D虚拟环境中的内容审核最近已交由自动化、基于AI的流程处理。然而,该领域主要涉及图像、视频和音频中非法内容的检测,在暗示性运动的检测技术上存在盲点。我们提出一种仅基于运动的分类流程,使用拉班运动分析(LMA)描述子从SMPL骨架轨迹中检测暗示性和露骨动作。在涵盖四个有序层级(日常、艺术、暗示、露骨)的20,514个运动片段(17小时以上)上,基于110个LMA特征的逻辑回归实现了57.3%的四分类准确率(随机概率的2.3倍)、72.1%的三分类准确率和78.7%的二元SFW/NSFW准确率。混淆主要集中在相邻层级,证实分类错误集中在相邻层级而非非相邻层级。此外,不同运动质量在分类体系的每个层级占主导地位——没有单一特征驱动分类,表明四层级结构反映了真正不同的运动模式。

英文摘要

Content moderation in online multiplayer 3D virtual environments is increasingly automated, yet detection has focused on images, video, and audio, leaving suggestive motion a blind spot. We present a motion-only classification pipeline that detects suggestive and explicit movement from SMPL skeleton trajectories using Laban Movement Analysis (LMA) descriptors. On a dataset spanning everyday, artistic, suggestive, and explicit movement (17+ hours of video), a logistic regression trained on 61-feature LMA descriptors reaches 68% binary SFW/NSFW accuracy (70% random forest) under a leak-free evaluation protocol. At this level, our descriptor performs comparably to a learned video model trained on the same motion re-rendered as appearance-free video, a gray figure with no clothing, skin, or scene. The indirectness (tortuosity) of each joint's trajectory, measured as the ratio of the joint's path length to its net displacement, peaks at the suggestive tier, showing that the Direct-to-Indirect polarity of Laban's Space factor provides an interpretable marker of the shift from functional to suggestive motion. Ultimately, Laban-based kinematic descriptors offer a lightweight, interpretable approach to suggestive-motion detection: every decision decomposes into named, theory-grounded features. Because the classifier operates on pose trajectories alone, moderation can run directly on avatar poses in virtual environments, with no appearance data.

2605.17770 2026-06-12 cs.AI cs.CL 版本更新

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

熵梯度反转:迈向大型推理模型的内部机制

Junyao Yang, Chen Qian, Kun Wang, Linfeng Zhang, Quanshi Zhang, Yong Liu, Dongrui Liu

发表机构 * National University of Singapore(新加坡国立大学) Renmin University of China(中国人民大学) Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学)

AI总结 本文发现大型推理模型中令牌熵与logit梯度之间的稳健负相关(熵梯度反转),并提出相关性正则化组策略优化(CorR-PO)将其嵌入强化学习奖励正则化,从而提升推理性能。

详情
Comments
The authors are withdrawing this manuscript due to fundamental inaccuracies in the institutional affiliations and administrative attributions provided at the time of submission. As this version cannot be validated under the correct institutional framework, the authors request its formal withdrawal from the repository. No immediate replacement is intended
AI中文摘要

大型推理模型(LRMs)的进步推动了从反应式“快思考”文本生成向系统性、逐步“慢思考”推理的范式转变,在复杂数学和逻辑任务中实现了最先进的性能。然而,该领域面临着 extit{令牌级行为分析与内部推理机制之间的根本差距,以及依赖昂贵外部验证器的推理优化强化学习(RL)的不稳定性}。我们识别并正式定义了 extbf{熵梯度反转},即令牌熵与logit梯度之间的稳健负相关,它作为LRM推理能力的明确几何指纹。在此基础上,我们提出 extbf{相关性正则化组策略优化(CorR-PO)},将这种反转特征嵌入RL奖励正则化。在多个模型规模的各种推理基准上的大量实验表明,CorR-PO始终优于最先进的基线,证实了更强的反转直接与更优的推理性能相关。

英文摘要

The advancement of Large Reasoning Models (LRMs) has catalyzed a paradigm shift from reactive ``fast thinking'' text generation to systematic, step-by-step ``slow thinking'' reasoning, unlocking state-of-the-art performance in complex mathematical and logical tasks. However, the field faces \textit{the fundamental gap between token-level behavioral analysis and internal reasoning mechanisms, and the instability of reinforcement learning (RL) for reasoning optimization relying on costly external verifiers}. We identify and formally define \textbf{Entropy-Gradient Inversion}, a robust negative correlation between token entropy and logit gradients that acts as a definitive geometric fingerprint for LRM reasoning capability. Building on this, we propose \textbf{Correlation-Regularized Group Policy Optimization (CorR-PO)}, which embeds this inversion signature into RL reward regularization. Extensive experiments on various reasoning benchmarks across multiple model scales show CorR-PO consistently outperforms state-of-the-art baselines, confirming that stronger inversion directly correlates with superior reasoning performance.

2605.22641 2026-06-12 cs.CL cs.AI cs.LG 版本更新

More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts

更多上下文、更大模型还是道德知识?政治文本中施瓦茨价值观检测的系统研究

Víctor Yeste, Paolo Rosso

发表机构 * PRHLT Research Center, Universitat Politècnica de València, Spain(巴塞罗那理工大学研究中心,西班牙 Valencia理工大学) School of Science, Engineering and Design, Universidad Europea de Valencia, Spain(Valencia欧洲大学科学、工程与设计学院,西班牙) Valencian Graduate School and Research Network of Artificial Intelligence (ValgrAI)(瓦伦西亚人工智能研究生学院与研究网络(ValgrAI))

AI总结 本研究系统比较了上下文范围、检索增强道德知识和模型规模对政治文本中施瓦茨价值观检测的影响,发现全文档上下文和检索知识对监督编码器有效,但对零样本大语言模型帮助有限,且模型扩展不保证性能提升。

详情
Comments
Code: this https URL, best model: this https URL, 18 pages, 3 figures
AI中文摘要

检测政治文本中的施瓦茨价值观具有挑战性,因为隐含线索通常依赖于周围的论证和相邻价值观之间的细微差别。我们研究了上下文和显式道德知识何时有助于句子级别的价值观检测。使用ValuesML/Touché ValueEval格式,我们比较了句子、窗口和全文档输入;无检索增强和基于检索增强的设置(使用精心策划的道德知识库);监督的DeBERTa-v3-base/large编码器;以及参数规模从12B到123B的零样本大语言模型。结果表明,更多上下文并非总是更好:全文档上下文使监督的DeBERTa编码器相比仅句子输入提高了3.8-4.8个宏F1点,但对零样本大语言模型没有一致帮助。在匹配比较中,检索到的道德知识更一致地有用,在早期融合下改善了每个测试的模型系列和上下文条件。然而,从DeBERTa-v3-base扩展到large以及从12B扩展到更大的大语言模型并不保证收益,并且简单的早期融合优于测试的后期融合和交叉注意力检索增强生成变体。按价值观分析表明,上下文和检索对社交情境化或概念上易混淆的价值观帮助最大。这些发现表明,价值观敏感的NLP应联合评估上下文、知识和模型系列,而不是将更长的输入或更大的模型视为通用改进。

英文摘要

Detecting Schwartz values in political text is difficult because implicit cues often depend on surrounding arguments and fine-grained distinctions between neighboring values. We study when context and explicit moral knowledge help sentence-level value detection. Using the ValuesML/Touché ValueEval format, we compare sentence, window, and full-document inputs; no-RAG and retrieval-augmented settings with a curated moral knowledge base; supervised DeBERTa-v3-base/large encoders; and zero-shot LLMs from 12B to 123B parameters. The results show that more context is not uniformly better: full-document context improves supervised DeBERTa encoders by 3.8-4.8 macro-F1 points over sentence-only input, but does not consistently help zero-shot LLMs. Retrieved moral knowledge is more consistently useful in matched comparisons, improving each tested model family and context condition under early fusion. However, scaling from DeBERTa-v3-base to large and from 12B to larger LLMs does not guarantee gains, and simple early fusion outperforms the tested late-fusion and cross-attention RAG variants for encoders. Per-value analyses show that context and retrieval help most for socially situated or conceptually confusable values. These findings suggest that value-sensitive NLP should evaluate context, knowledge, and model family jointly rather than treating longer inputs or larger models as universal improvements.

2602.00122 2026-06-12 cs.CV cs.AI cs.MM 版本更新

VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

VDE Bench: 评估图像编辑模型对视觉文档进行修改的能力

Hongzhu Yi, Yujia Yang, Yuanxiang Wang, Tong Li, Zhenyu Guan, Tianyu Zong, Jiahuan Chen, Chenxi Bao, Tiankun Yang, Haopeng Jin, Yixuan Yuan, Xinming Wang, Tao Yu, Ruilin Gao, Ruiwen Tao, Haijin Liang, Jin Ma, Jinwen Luo, Yeshani, Xinyu Zuo, Jungang Xu

发表机构 * UCAS(中国科学院大学) CASIA(中国科学院自动化研究所) Tencent(腾讯) CMU(卡内基梅隆大学) WashU(华盛顿大学) SJTU(上海交通大学) XDU(北京理工大学)

AI总结 本文提出VDE Bench,一个专门评估图像编辑模型在双语中文-英文和复杂视觉文档编辑任务性能的基准,通过高质量数据集和新的评估框架,系统量化了文本修改的准确性。

详情
AI中文摘要

近年来,图像编辑模型取得了显著进展,使用户能够通过自然语言指令灵活地交互式地操作视觉内容。然而,一个重要的但尚未充分研究的研究方向是密集的视觉文档图像编辑,这涉及在图像中修改文本内容,同时忠实保留原始文本风格和背景上下文。现有方法主要集中在英语场景和文本相对稀疏的图像上,因此无法充分解决密集、结构复杂的文档或非拉丁文字如中文。为弥合这一差距,我们提出了VDE Bench(视觉文档编辑基准),这是一个严格人工标注和评估的基准,专门设计用于评估图像编辑模型在双语中文-英文和复杂视觉文档编辑任务上的性能。该基准包含942个基于指令的图像编辑样本数据集,其种子图像涵盖密集的中文和英文文本文档,包括学术论文、海报、演示文稿、考试材料和报纸。此外,我们引入了一个新的评估框架,系统地量化了在OCR解析层面的编辑性能,从而实现了对文本修改准确性的细粒度评估。基于此基准,我们对代表性图像编辑模型进行了全面评估。人类验证显示,人类判断与自动化评估指标之间有一致性。VDE Bench构成了评估图像编辑模型在双语密集文本视觉文档性能的首个系统性基准。

英文摘要

In recent years, image editing models have made significant progress, enabling users to manipulate visual content in a flexible and interactive manner through natural language instructions. However, an important yet underexplored research direction remains dense visual document image editing, which involves modifying textual content within images while faithfully preserving the original text style and background context. Existing methods primarily focus on English scenarios and images with relatively sparse text, and thus cannot adequately address dense, structurally complex documents or non-Latin scripts such as Chinese. To bridge this gap, we propose VDE Bench (Visual Doc Edit Bench), a rigorously human annotated and evaluated benchmark specifically designed to assess the performance of image editing models on bilingual Chinese-English and complex visual document editing tasks. The benchmark comprises a high quality dataset of 942 instruction based image editing samples, whose seed images encompass dense Chinese and English text documents including academic papers, posters, presentation slides, examination materials, and newspapers. Furthermore, we introduce a novel evaluation framework that systematically quantifies editing performance at the OCR parsing level, thereby enabling fine grained assessment of text modification accuracy. Based on this benchmark, we conduct a comprehensive evaluation of representative image editing models. Human verification demonstrates a high degree of consistency between human judgments and automated evaluation metrics. VDE Bench constitutes the first systematic benchmark for evaluating the performance of image editing models on bilingual dense text visual documents.

2605.20763 2026-06-12 cs.LG 版本更新

ShapeBench: A Scalable Benchmark and Diagnostic Suite for Standardized Evaluation in Aerodynamic Shape Optimization

ShapeBench: 一种可扩展的基准和诊断套件,用于气动形状优化的标准化评估

Shaghayegh Fazliani, Krissh Chawla, Jack Guo, Yiren Shen, Matthias Ihme, Madeleine Udell

发表机构 * Stanford University(斯坦福大学) Spinoza Labs(斯皮诺扎实验室)

AI总结 本文提出ShapeBench,一个开源的气动形状优化基准,提供统一的API,涵盖103个任务和八个形状类别,通过验证的代理模型和高保真CFD流程进行系统分析,展示了不同形状类别和问题形式中优化器排名的显著差异,强调了需要更通用方法的必要性。

详情
AI中文摘要

气动形状优化(ASO)的快速进展已超过了目前可用的标准化评估框架。公平比较需要一个覆盖多样形状类别、目标公式和匹配预算的统一基准。我们引入ShapeBench,一个开源的ASO基准,涵盖103个任务,跨越八个形状类别和多种优化模式。每个ShapeBench任务包括经过验证的代理模型以实现快速搜索;当可行时,提供高保真计算流体动力学(CFD)流程用于最终验证,从而实现系统化的保真度差距分析。ShapeBench提供可重复的协议和配置良好的基线,以使用一致的预算度量进行公平比较,允许在经典方法和LLM驱动方法之间进行比较,包括通用优化器和一个新的领域专用进化LLM基线,ShapeEvolve。在ShapeBench上的结果展示了不同形状类别和问题形式中优化器排名的显著差异,平均成对斯皮尔曼ρ=0.013,因此单任务结论无法可靠地推广到问题类别中。该基准还远未饱和;经典方法很少能适用于所有形状类别和任务,进一步强调了需要更通用方法的必要性。

英文摘要

Rapid progress in aerodynamic shape optimization (ASO) has outpaced currently-available standardized evaluation frameworks. Fair comparison requires a unified benchmark spanning diverse shape classes, objective formulations, and matched-budget state-of-the-art baselines. We introduce ShapeBench, an open-source ASO benchmark with a unified API spanning 103 tasks across eight shape categories and multiple optimization regimes. Each ShapeBench task includes a validated surrogate for fast search; when feasible, a high-fidelity Computational Fluid Dynamics (CFD) pipeline for final verification is available, enabling systematic fidelity-gap analysis. ShapeBench provides a reproducible protocol with well-configured baselines to compare fairly using a consistent budget metric, allowing for comparison among both classical and LLM-driven methods, including general-purpose optimizers and a new domain-specialized evolutionary LLM baseline, ShapeEvolve. Results on ShapeBench demonstrate substantial variance in optimizer rankings across shape categories and problem formulations, with mean pairwise Spearman $\rho = 0.013$, so single-task conclusions do not reliably generalize across problem classes. The benchmark is also far from saturation; classical methods are rarely applicable across all shape categories and tasks, further highlighting the need for more general-purpose approaches.

2605.01733 2026-06-12 cs.CV cs.AI 版本更新

GEASS: Gated Evidence-Adaptive Selective Caption Trust for Vision-Language Models

GEASS: 基于证据适应的门控选择性描述信任机制用于视觉-语言模型

Zeshang Li, Shuoyang Zhang

发表机构 * arXiv.org

AI总结 本文提出GEASS,一种无需训练的模块,通过门控、加权和证据标准来决定模型在每个查询中消耗多少描述信息,从而提升视觉-语言模型的准确性。

详情
Comments
18 pages, 12 figures
AI中文摘要

视觉-语言模型(VLMs)在 grounded reasoning 方面表现出色,但仍然容易产生 object hallucination。最近的研究将自动生成的描述视为一个均匀的积极资源,但我们发现盲目地嵌入一个描述可能会降低而不是提高性能——在 HallusionBench 上,Qwen2.5-VL-3B 的准确性下降了近 10 个点。两个结构性质解释了这一点。首先,描述不仅锚定了模型的最终答案,还锚定了其推理轨迹和词汇选择。其次,描述错误是不对称的:遗漏远多于伪造,但每个伪造对实例的影响更大。因此,描述的有用性是查询特定的,而不是语料库特定的。我们提出 GEASS(ated Evidence-Adaptive Selective Caption Trust),一个无需训练的模块,决定每个查询中模型消耗多少描述信息:它通过干净路径的置信度来门控描述,通过它产生的熵减少来加权描述,并在两种路径意见不同时提高证据标准。在 POPE 和 HallusionBench 上对四个 VLMs 的实验表明,GEASS 在 vanilla 推理和对比解码上都表现出色,仅需每个查询两个额外的前向传递。

英文摘要

Vision-Language Models (VLMs) hallucinate objects that are not present, and a growing line of work tries to curb this by feeding the model its own generated caption as auxiliary evidence -- assuming that a caption, once available, is something to consume. We show this fails: naively appending a caption can lower accuracy rather than raise it, dropping Qwen2.5-VL-3B$^\dagger$ on HallusionBench by nearly ten points. To understand why, we build \textbf{GD-Probe}, a diagnostic set that pairs a global and a detail question on the same image, so that any difference in caption effect is attributable to the question alone. Caption utility proves to be a \emph{per-query} property: the same caption helps global questions and harms detail ones, through a single mechanism -- an embedded caption competes with the image for attention and pulls the model's evidence onto its own text -- whose sign is set by whether the caption \emph{covers} the queried content. Crucially, this regime is readable from quantities the decoder already emits, with no attention access or grounding. We turn this into \textbf{GEASS} (Gated Evidence-Adaptive Selective Caption Trust), a training-free, logit-level module that decides per query how much of the caption to trust, gating it by the clean path's confidence, weighting it by the entropy reduction it induces, and raising the evidence bar when the two pathways disagree. Across four VLMs and two benchmarks (POPE and HallusionBench), GEASS improves over both vanilla inference and contrastive decoding under a single fixed setting, adding only two forward passes and no parameters.

2605.18817 2026-06-12 cs.LG 版本更新

Multi-Token Residual Prediction

多令牌残差预测

Yufeng Xu, Zishuo Bao, Qian Wang, Zeshen Zhang, Haoqi Zhang, Bowen Peng, Ang Li, Rahul Chalamala, Yucheng Lu

发表机构 * New York University(纽约大学) New York University Shanghai(纽约大学上海) Nos Research(Nos研究) Modal

AI总结 本文提出了一种轻量级模块Multi-token Residual Prediction,通过利用去噪过程中相邻步骤的logit分布相似性,在单次骨干网络前向传播中实现依赖感知的多令牌去噪,从而在成本较低的情况下提高去噪效率。

详情
AI中文摘要

扩散语言模型(DLMs)通过迭代去噪掩码令牌序列生成文本,相较于自回归模型在并行性和质量之间提供了一种权衡。在当前实践中,每步解码的令牌数量由置信度阈值控制,随着每步去噪的令牌数量增加,质量单调下降。我们引入了多令牌残差预测(MRP),这是一种轻量级模块,能够在单个骨干网络前向传播中实现依赖感知的多令牌去噪。MRP利用了去噪过程的一个关键性质:相邻去噪步骤的logit分布具有显著相似性。而不是再次运行骨干网络以获得下一步的logits,MRP通过骨干网络的隐藏状态预测步骤间的残差,从而在较低的成本下在单次骨干网络前向传播中去噪更多的令牌。我们部署了MRP在两种推理模式中:直接解码,它使用纠正的logits而不进行验证,以实现可调节的质量-速度权衡;以及推测解码,它通过骨干网络验证MRP的提案以实现无损加速。在SDAR模型上进行的实验表明,在推理和代码生成基准测试中,SDAR模型在1.7B、4B和8B规模上实现了高达1.42倍的SGLang无损加速。

英文摘要

Diffusion Language Models (DLMs) generate text by iteratively denoising masked token sequences, offering a tradeoff between parallelism and quality compared to autoregressive models. In current practice, the number of tokens decoded per step is controlled by a confidence threshold, and quality degrades monotonically as more tokens are denoised per step. We introduce Multi-token Residual Prediction (MRP), a lightweight module that enables dependency-aware multi-token denoising within a single backbone forward pass. MRP exploits a key property of the denoising process: the logit distributions at adjacent denoising steps are remarkably similar. Rather than running the backbone a second time to obtain the next-step logits, MRP predicts the residual between steps from the backbone's hidden states, effectively denoising more tokens per backbone forward at a fraction of the cost. We apply MRP across the two operating regimes of DLM decoding. In the high-quality-low-throughput static denoising regime, MRP serves as a drafter for speculative decoding: its proposals are verified against the backbone, yielding lossless acceleration of up to 1.4x in SGLang. In the low-quality-high-throughput dynamic denoising regime, MRP instead drives a remasking scheme that revokes over-eager reveals, recovering most of the accuracy lost to aggressive low-threshold decoding and improving accuracy by up to 22.6 points on code generation task HumanEval and 17.7 points on reasoning task GSM8K.

2605.18231 2026-06-12 cs.LG 版本更新

Attacking the First-Principle: A Black-Box, Query-Free Targeted Mimicry Attack on Binary Function Classifiers

攻击第一原理:一种针对二元函数分类器的黑盒、无查询目标模仿攻击

Gabriel Sauger (UL, CNRS, LORIA, Inria), Jean-Yves Marion (UL, CNRS, LORIA, Inria), Sazzadur Rahaman, Victor Matrat (CNRS, UL, LORIA, Inria), Vincent Tourneur (UL, CNRS, LORIA, Inria), Muaz Ali

发表机构 * LORIA(洛林信息与自动化研究院) University of Arizona(亚利桑那大学)

AI总结 本文提出Kelpie框架,首次在黑盒无查询环境下成功执行针对二元函数分类器的模仿攻击,展示了其在不同模型架构下的有效性,并通过实际案例验证了攻击的可行性,引发对现有机器学习二元函数分类器可靠性和安全性的质疑。

详情
AI中文摘要

二元函数分类器在维护软件系统安全性和完整性方面起着关键作用,通过检测恶意代码和未经授权的修改。然而,基于机器学习的分类器容易受到对抗攻击的威胁,这些攻击可以绕过检测。在本研究中,我们提出Kelpie,一种新型框架,用于在黑盒、零查询环境下执行模仿攻击,这是一种更强大的目标逃避攻击类型。与以往依赖查询目标分类器来优化无目标逃避攻击的方法不同,Kelpie利用代码转换,保持恶意负载的功能性,同时使其被误分类为所需类别。通过广泛实验,我们证明Kelpie能够成功对六种最先进的二元函数分类器执行模仿攻击,这些分类器代表了不同的模型架构,而无需直接与它们交互。我们进一步通过实际演示验证了我们的方法,包括隐藏在看似无害函数中的键盘记录器和擦除器。到目前为止,我们的工作是首次在黑盒、零查询环境下展示此类模仿攻击,引发了对现有基于机器学习的二元函数分类器可靠性和安全性的重大质疑。

英文摘要

Binary function classifiers play a crucial role in maintaining the security and integrity of software systems by detecting malicious code and unauthorized modifications. However, machine learning-based classifiers are vulnerable to adversarial attacks that can evade detection. In this study, we present Kelpie, a novel framework for executing mimicry attacks, a stronger type of targeted evasion attacks, on binary function classifiers in a black-box, zero-query setting. Unlike previous approaches that rely on querying the target classifier to refine untargeted evasion attacks, Kelpie leverages code transformations that preserve the functionality of malicious payloads while causing them to be misclassified as we want. Through extensive experimentation, we demonstrate that Kelpie can successfully execute mimicry attacks against six state-of-the-art binary function classifiers representing different model architectures without requiring direct interaction with them. We further validate our approach with a practical demonstration, involving a keylogger and a wiper concealed within benign-looking functions embedded in an application. This work, to our best knowledge, is the first to demonstrate such a mimicry attack in a black-box, zero-query context, raising important questions about the reliability and security of existing machine learning-based binary function classifiers.

2603.11395 2026-06-12 cs.LG cs.AI 版本更新

ARROW: Augmented Replay for RObust World models

ARROW:增强重放用于鲁棒世界模型

Abdulaziz Alyahya, Abdallah Al Siyabi, Markus R. Ernst, Luke Yang, Levin Kuhlmann, Gideon Kowadlo

发表机构 * Imam Mohammad Ibn Saud Islamic University (IMSIU)(伊玛姆·穆罕默德·本·沙特伊斯兰大学) Monash University(莫纳什大学) University of New South Wales, Sydney(新南威尔士大学,悉尼) Cerenaut

AI总结 本文提出ARROW算法,一种基于模型的持续强化学习方法,通过高效的重放缓冲区减少灾难性遗忘,提升在无共享结构任务和有共享结构任务中的表现。

详情
Comments
36 pages and 11 figures (includes Appendix)
AI中文摘要

持续强化学习挑战智能体在获取新技能的同时保留已学习技能,以提高过去和未来任务的性能。大多数现有方法依赖于无模型方法和重放缓冲区来缓解灾难性遗忘;然而,这些解决方案往往面临显著的可扩展性挑战,因为内存需求大。受神经科学启发,其中大脑将经验重放给预测世界模型而不是直接重放到策略中,我们提出了ARROW(增强重放用于鲁棒世界模型),一种扩展DreamerV3的基于模型的持续RL算法,具有内存高效、分布匹配的重放缓冲区。与标准固定大小的FIFO缓冲区不同,ARROW维护两个互补的缓冲区:一个短期缓冲区用于近期经验,一个长期缓冲区通过智能采样保留任务多样性。我们在两个具有挑战性的持续RL设置中评估了ARROW:无共享结构任务(Atari)和有共享结构任务(Procgen CoinRun变体)。与相同大小的无模型和基于模型的基线方法相比,ARROW在无共享结构任务中表现出显著减少的遗忘,同时保持可比的前向转移。我们的发现突显了基于模型的RL和生物启发方法在持续强化学习中的潜力,值得进一步研究。

英文摘要

Continual reinforcement learning challenges agents to acquire new skills while retaining previously learned ones with the goal of improving performance in both past and future tasks. Most existing approaches rely on model-free methods with replay buffers to mitigate catastrophic forgetting; however, these solutions often face significant scalability challenges due to large memory demands. Drawing inspiration from neuroscience, where the brain replays experiences to a predictive World Model rather than directly to the policy, we present ARROW (Augmented Replay for RObust World models), a model-based continual RL algorithm that extends DreamerV3 with a memory-efficient, distribution-matching replay buffer. Unlike standard fixed-size FIFO buffers, ARROW maintains two complementary buffers: a short-term buffer for recent experiences and a long-term buffer that preserves task diversity through intelligent sampling. We evaluate ARROW on two challenging continual RL settings: Tasks without shared structure (Atari), and tasks with shared structure, where knowledge transfer is possible (Procgen CoinRun variants). Compared to model-free and model-based baselines with replay buffers of the same-size, ARROW demonstrates substantially less forgetting on tasks without shared structure, while maintaining comparable forward transfer. Our findings highlight the potential of model-based RL and bio-inspired approaches for continual reinforcement learning, warranting further research.