arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3839
专题追踪
2606.07878 2026-06-09 cs.LG 新提交

Still: Amortized KV Cache Compaction in a Single Forward Pass

Still: 单次前向传递中的摊销KV缓存压缩

Charles O'Neill, Alex Sandomirsky, Harry Partridge, Mudith Jayasekara, Max Kirkby

发表机构 * Baseten

AI总结 提出Still方法,通过单次前向传递的轻量级Perceiver层实现KV缓存压缩,在8×至200×压缩比和8k至128k上下文长度下兼顾速度与质量,长上下文任务超越最强基线8-22分。

详情
AI中文摘要

KV缓存是长时语言模型部署的内存瓶颈。实际上,可部署的压缩器必须足够轻量以便在推理时调用,足够表达以在约束下保留上下文,并且可跨轨迹重用。现有压缩方法仅满足部分要求:选择方法轻量但受限于子集,而合成方法表达性强但依赖于逐上下文优化。这里我们介绍Still,一个小的逐层Perceiver,针对冻结的基础模型训练一次,在单次前向传递中生成紧凑的键和值。在Qwen和Gemma模型上,Still在压缩比从$8\ imes$到$200\ imes$、上下文长度从8k到128k的范围内,占据了速度-质量前沿的有利位置。在长上下文RULER网格上,Still超过最强基线8-22分。相同的紧凑缓存还支持自由形式的摘要,在HELMET上保留了大部分全上下文增益,并在LongBench摘要比较中胜过KV-Distill。由于压缩是一次前向传递,Still可以迭代应用,进入逐上下文方法无法实现的长期场景。我们表明,摊销使长上下文缓存压缩变得可行,而合成使其紧凑状态在极端压缩下有用。

英文摘要

The KV cache is the memory bottleneck of long-horizon language model deployment. Practically, a deployable compactor must be lightweight enough to call during inference, expressive enough to preserve context under constraint, and reusable across a trajectory. Existing compaction methods satisfy only part of this requirement: selection methods are lightweight but subset-bound, while synthesis methods are expressive but rely on per-context optimization. Here we introduce Still, a small per-layer Perceiver trained once against a frozen base model that produces compact keys and values in a single forward pass. On Qwen and Gemma models, Still occupies the favorable side of the speed--quality frontier across compression ratios from $8\times$ to $200\times$ and context lengths from $8$k to $128$k. On the long-context RULER grid, Still exceeds the strongest baseline by 8--22 points. The same compact cache also supports free-form summarization, preserving most of the full-context gain on HELMET and winning a pairwise LongBench summarization comparison against KV-Distill. Because compaction is a forward pass, Still can be applied iteratively, entering a long-horizon regime unavailable to per-context methods. We show that amortization makes long-context cache compaction tractable, and synthesis makes its compact state useful at extreme compression.

2606.07877 2026-06-09 cs.CL 新提交

Whose Norms? Disentangling Cultural and Personal Alignment in Large Language Models

谁的规范?解开大语言模型中的文化与个人对齐

Angana Borah, Isabelle Augenstein, Rada Mihalcea

发表机构 * University of Michigan - Ann Arbor(密歇根大学安娜堡分校) University of Copenhagen(哥本哈根大学)

AI总结 提出PACT框架评估大语言模型在文化规范与个人偏好间的权衡,发现模型受国家背景影响大于年龄和性别,且人类对齐未能捕捉文化多元性。

Comments Preprint under review

详情
AI中文摘要

大语言模型越来越多地用于需要平衡文化规范与个人偏好的社会决策情境。例如,偏好诚实的用户可能会询问是否应在当地规范倾向于间接反馈时公开纠正同事。然而,现有研究大多将文化对齐和个性化分开研究。我们引入了PACT(个人偏好与文化规范权衡)框架,该框架评估模型是选择遵循文化规范还是允许个人偏好。我们发现,LLMs在强制执行文化规范的刚性程度上有所不同,行为受国家背景(7.8%)的影响大于年龄(1%)和性别(0.7%),并且在指令微调后非均匀地变化。此外,我们在五个国家进行的关于PACT的人类研究表明,人类遵循文化主要受情境国家驱动,当参与者判断自己的文化背景时一致性最低,显示出文化内部的多元性。最后,人类-LLM对齐实验表明,模型可以匹配多数选择,但未能捕捉响应分布和不确定性(最佳相关性仅为0.24)。总之,这些发现激励了超越多数、捕捉社会判断中文化多元性和分歧的对齐评估。

英文摘要

Large language models are increasingly used for social decision-making situations that require balancing cultural norms with personal preferences. For example, a user preferring honesty might ask whether to correct a coworker publicly when local norms favor indirect feedback. Yet existing research studies cultural alignment and personalization largely separately. We introduce PACT, the Personal-Preference and Cultural-Norm Trade-off framework, which evaluates whether models choose to follow a cultural norm or allow personal preferences. We find that LLMs vary in how rigidly they enforce cultural norms, with behavior shifted more by country context (7.8%) than age (1%) and gender (0.7%) and shifting non-uniformly after instruction tuning. Furthermore, our five-country human study on PACT shows that culture-following in humans is mainly driven by scenario country, with the lowest agreement when participants judge their own cultural contexts, showing within-culture pluralism. Finally, human-LLM alignment experiments show that models can match majority choices, but fail to capture response distributions and uncertainty (with best correlations reaching only 0.24). Together, these findings motivate alignment evaluations that go beyond majority to capture cultural pluralism and disagreement in social judgment.

2606.07874 2026-06-09 cs.AI 新提交

Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators

安全是上下文相关的,而LLM评判者不是:应对评估者的刚性先验

Anissa Alloula, Federico Licini, Ava Batchkala, Seraphina Goldfarb-Tarrant

发表机构 * University of Oxford(牛津大学) Cohere

AI总结 研究LLM作为安全评判者时,对上下文信息的依赖性和对不同安全定义的可引导性,发现它们难以在上下文或安全定义与自身先验矛盾时调整评估。

详情
AI中文摘要

LLM作为评判者是规模化评估安全性的唯一方式。尽管它们很重要,但LLM评判者本身很少在简单的静态基准测试中除了人类一致性之外被评估。因此,我们研究了LLM作为评判者的两个未被充分探索但至关重要的特性:它们对依赖上下文信息的敏感性,以及它们对不同安全定义的可引导性,这些定义可能与其内部安全先验不一致。我们评估了许多通用LLM和特定安全评判者的安全评判能力,并研究了任务演示、新颖的上下文信息以及变化的安全定义的影响。我们发现,虽然LLM评判者可以从新信息中学习,但如果上下文或安全定义与其先验相矛盾,它们通常不太可能调整其评估。

英文摘要

LLMs-as-judges are the only way to evaluate safety at scale. Despite their importance, LLM-judges themselves are rarely evaluated beyond human agreement in simple, static benchmarks. We therefore investigate two under-explored but crucial properties of LLMs-as-judges: their susceptibility to relying on in context-information, and their steerability to differing safety definitions, which may not align with their internal safety priors. We evaluate the safety judging abilities of many generalist LLMs and safety-specific judges, and investigate the impact of task demonstrations, novel in-context information, and changing safety definitions. We find that while LLM-judges can learn from new information, they are broadly unlikely to adjust their evaluations if the context or safety definition contradicts their prior.

2606.07872 2026-06-09 cs.CV 新提交

VisualFLIP: Do Predictions Depend on Task-Critical Visual Evidence in Multimodal Reasoning?

VisualFLIP: 在多模态推理中,预测是否依赖于任务关键的视觉证据?

Didi Zhu, Changrui Chen, Stefanos Zafeiriou, Jiankang Deng

发表机构 * Imperial College London(伦敦帝国理工学院)

AI总结 提出VisualFLIP基准,通过成对图像扰动测试多模态大模型是否真正依赖关键视觉证据,发现正确预测与证据依赖存在分离。

详情
AI中文摘要

当多模态大语言模型正确回答视觉推理问题时,该预测是否确实得到任务关键视觉证据的支持?正确答案可能与有缺陷的推理共存,这使得仅凭准确性无法完全检验基础。我们引入了VisualFLIP,一个包含1,374张图像的成对基准,这些图像在基数、属性、空间和逻辑任务中构成相同问题的扰动对。每对保持问题不变,但最小程度地改变证据,使得正确答案确定性地翻转。我们使用成对准确率(要求解决对中的两侧)和崩溃率(CR,衡量在至少解决一侧的模型中,对两个图像重复相同非空答案的频率)评估了24个MLLM。这些指标共同表明,成对正确性和证据依赖性相关但不同:有能力的模型在任务关键视觉变化后仍可能无法更新,并且当编辑后的图像在序列设置中跟随先前的答案时,某些模型的崩溃变得更加严重。更多细节请参见我们的项目页面:https://didizhu-judy.github.io/VisualFLIP/

英文摘要

When a multimodal large language model answers a visual reasoning question correctly, is the prediction actually supported by the task-critical visual evidence? Correct answers can coexist with flawed reasoning, making accuracy alone an incomplete test of grounding. We introduce VisualFLIP, a paired benchmark with 1,374 images arranged as same-question perturbation pairs across cardinality, attribute, spatial, and logic tasks. Each pair keeps the question fixed but minimally changes the evidence so the gold answer deterministically flips. We evaluate 24 MLLMs with pair accuracy, which requires solving both sides of a pair, and Collapse Rate (CR), which measures how often a model that solves at least one side repeats the same non-empty answer for both images. Together, these metrics show that paired correctness and evidence dependence are related but distinct: capable models can still fail to update after task-critical visual changes, and collapse becomes more severe for some models when the edited image follows an earlier answer in a sequential setting. Further details are available on our project page: https://didizhu-judy.github.io/VisualFLIP/

2606.07867 2026-06-09 cs.CL 新提交

The Cold-Start Safety Gap in LLM Agents

LLM智能体中的冷启动安全差距

Chung-En Sun, Linbo Liu, Tsui-Wei Weng

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 研究发现工具调用型LLM智能体在会话开始时最脆弱,随着常规任务执行安全性提升,提出SODA基准并验证预热策略可缩小冷启动安全差距。

详情
AI中文摘要

工具调用型LLM智能体在整个对话过程中是否同样安全?我们发现并非如此:智能体在会话开始时最脆弱,在完成几个常规智能体任务后安全性显著提升——我们将这一现象称为冷启动安全差距。为了系统研究这一问题,我们引入了面向智能体的深度安全基准(SODA),该基准控制智能体在遇到安全威胁之前完成的常规智能体任务数量,最多支持20个前置任务。评估来自4个系列的7个模型,随着前置常规智能体任务数量从零增加到二十,安全性提升9%至52%。表示分析证实,随着更多前置任务的出现,模型隐藏状态逐渐向安全对齐区域移动。通过系统研究前置对话中哪部分最重要,我们发现常规智能体任务本身是安全性的主要驱动因素,而智能体自身的先前响应对安全性影响较小,但对于保持后续效用至关重要。这一结论在开源安全基准(AgentHarm、Agent Safety Bench)和效用基准(BFCL、API-Bank)上的评估中得到进一步支持,证实了在部署前用常规智能体任务预热智能体可以使其更安全并保持全部能力。基于这些发现,我们推荐一种简单的部署策略:让智能体在可能暴露于安全关键请求之前完成几个常规智能体任务,以缩小冷启动安全差距。我们的代码可在https://github.com/Trustworthy-ML-Lab/Agent-Cold-Start-Safety-Gap获取。

英文摘要

Are tool-calling LLM agents equally safe throughout a conversation? We discover they are not: agents are most vulnerable at the very start of a session and become substantially safer after a few regular agentic tasks -- a phenomenon we term the cold-start safety gap. To study this systematically, we introduce Safety Over Depth for Agents (SODA), a benchmark that controls how many regular agentic tasks the agent completes before encountering a safety threat, supporting up to 20 preceding tasks. Evaluating 7 models from 4 families, safety improves by 9--52% as the number of preceding regular agentic tasks increases from zero to twenty. Representation analysis confirms that model hidden states gradually shift toward a safety-aligned region as more preceding tasks are present. By systematically studying which part of the preceding conversation matters most, we find that the regular agentic tasks themselves are the primary driver of safety, while the agent's own prior responses have less effect on safety but are essential for preserving later utility. This conclusion is further supported by evaluation on open-source safety benchmarks (AgentHarm, Agent Safety Bench) and utility benchmarks (BFCL, API-Bank), confirming that warming up the agent with regular agentic tasks before deployment makes it safer and preserves full capability. Based on these findings, we recommend a simple deployment strategy: having the agent complete a few regular agentic tasks before possible exposure to safety-critical requests mitigates the cold-start safety gap. Our code is available at https://github.com/Trustworthy-ML-Lab/Agent-Cold-Start-Safety-Gap

2606.07866 2026-06-09 cs.AI cs.MA 新提交

Overcoming the Regulatory Bottleneck via Agent-to-Agent Protocols: A Nuclear Case Study

通过智能体间协议克服监管瓶颈:以核能为例

Akshay J. Dave, David Grabaskas, Joseph A. Renevitz, Richard B. Vilim

发表机构 * Argonne National Laboratory(阿贡国家实验室) Idaho National Laboratory(爱达荷国家实验室)

AI总结 提出监管上下文协议(RCP),一种智能体间通信标准,将监管与申请方之间的人工流程转为结构化、可审计的智能体通道,在核反应堆审批中降低成本50-77%、缩短时间65%。

Comments 26 pages, 10 figures

详情
AI中文摘要

先进核反应堆设计的监管审查通常耗时超过三年,并消耗数亿美元的综合监管和申请方劳动力。我们提出了监管上下文协议(RCP),这是一种智能体间通信标准,用结构化、可审计的智能体通道取代监管机构和申请方之间正式的人工流程,同时在安全关键决策点保留人类监督。该协议基于对美国核监管委员会先进反应堆案卷中1,236份文件的分析进行校准,并通过一个工作中的多智能体试点进行演示。相对于8,900万美元、42个月的基准重建,RCP将成本降低50-77%(2,100万至4,400万美元),时间缩短65%(15个月)。在没有共享协议的情况下,独立智能体仅能达到5,400万至7,400万美元和21个月。剩余的成本和时间差距是结构性的,而非算法性的:它源于组织间的流程,只有智能体间标准才能压缩。同样的瓶颈——在严格的可审计性要求下进行正式的多方审查——也是药品审批、环境许可、金融监管和航空认证的特点。美国监管文书负担每年带来4,265亿美元的机会成本;如果广泛复制,预计50-77%的减少意味着每年节省约2,100亿至3,300亿美元——接近美国GDP的1%。

英文摘要

Regulatory review of advanced nuclear reactor designs routinely spans more than three years and consumes hundreds of millions of dollars in combined regulator and applicant labor. We present the Regulatory Context Protocol (RCP), an Agent-to-Agent communication standard that replaces the formal human-to-human pipeline between regulators and applicants with a structured, auditable agentic channel, while preserving human oversight at safety-significant decision points. The protocol is calibrated against an analysis of 1,236 documents from U.S. Nuclear Regulatory Commission advanced reactor dockets and demonstrated with a working multi-agent pilot. Against an 89M USD, 42-month Reconstructed Baseline, RCP cuts costs by 50-77 percent (21M-44M USD) and timelines by 65 percent (15 months). Without a shared protocol, Standalone Agents reach only 54M-74M USD and 21 months. The residual cost-and-time gap is structural, not algorithmic: it traces to the inter-organizational pipeline that only an agent-to-agent standard can compress. The same bottleneck - formal multi-party review under strict auditability requirements - characterizes pharmaceutical approvals, environmental permitting, financial supervision, and aviation certification. The US regulatory paperwork burden carries a 426.5 billion USD annual opportunity cost; replicated broadly, the projected 50-77 percent reduction implies savings on the order of 210-330 billion USD per year - approaching 1 percent of US GDP.

2606.07865 2026-06-09 cs.LG cs.AI physics.comp-ph stat.ML 新提交

Instrumented data for causal scientific machine learning

因果科学机器学习的仪器化数据

Daniel N. Wilke

发表机构 * University of the Witwatersrand(威特沃特斯兰德大学)

AI总结 提出仪器化数据作为观测数据和模板合成数据之外的第三种选择,每个数据点携带产生它的机制模型、显式不确定性及可执行的反事实族,通过V&V仪器化图像到模拟管道实现,支持因果干预。

Comments 10 pages, 2 figures

详情
AI中文摘要

科学机器学习受限于训练数据而非模型大小。观测数据记录发生了什么但不记录原因;模板合成数据具有已知的生成过程,但仅适用于模拟器的模板,而非用户面对的情况。我们认为第三种选择现在在操作上是可行的:仪器化数据,其中每个数据点携带产生它的机制模型、对该模型的显式不确定性以及可执行的反事实族。验证与确认(V&V)仪器化图像到模拟管道是一种实现:传感器观测成为完全指定、求解器支持的模拟,具有显式、可编辑的参数以及传播的偶然/认知不确定性。该基底是案例特定的、机制监督的,并通过Pearl的do算子支持因果干预。在验证、审计和替代训练方面的近期影响涵盖计算生物学、气候、材料、流体力学和医学成像;长期可证伪的推论涉及科学推理的基础模型。

英文摘要

Scientific machine learning is limited less by model size than by the data it is trained on. Observational data records what happened but not why; template synthetic data has a known generating process but only for the simulator's template, not the case a user faces. We argue a third option is now operationally feasible: instrumented data, in which every datum carries the mechanistic model that produced it, an explicit uncertainty over that model, and an executable family of counterfactuals. Verification-and-validation (V&V) instrumented image-to-simulation pipelines are one realisation: a sensor observation becomes a fully specified, solver-backed simulation with explicit, editable parameters and a propagated aleatoric/epistemic uncertainty. The substrate is case-specific, mechanistically supervised, and supports causal interventions through Pearl's do-operator. Near-term consequences for validation, auditing, and surrogate training span computational biology, climate, materials, fluid mechanics, and medical imaging; a longer-term, falsifiable implication concerns foundation models for scientific reasoning.

2606.07861 2026-06-09 cs.CV cs.AI 新提交

The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models

最后一个可见像素:探究视觉-语言模型中的精细尺度感知

Lujun Li, Lama Sleem, Niccolo Gentile, Yangjie Xu, Yewei Song, Wenbo Wu, Radu State

发表机构 * University of Luxembourg(卢森堡大学) Foyer S.A. Université Paris-Saclay(巴黎-萨克雷大学)

AI总结 提出FineSightBench基准,通过4-48像素尺度分离感知与推理任务,发现视觉-语言模型感知在12像素饱和,推理在更大尺度仍受限,揭示精细视觉推理的根本缺陷。

Comments 25 pages

详情
AI中文摘要

最近的视觉-语言模型(VLM)在多模态理解和推理方面表现出色,但其细粒度视觉感知仍未被充分探索。'Strawberry中有多少个r?'的自然延伸是:VLM能可靠感知多小的视觉模式?为此,我们引入了FineSightBench,这是一个新的基准,通过将感知任务(字母、形状、物体的像素级识别)与推理任务(空间推理、计数、小目标排序)在4-48像素的受控尺度上分离,系统地探究这一极限。通过对最先进模型的全面实验和详细失败模式分析,我们揭示了一个尖锐的分离:感知在12像素左右饱和,而即使在更大尺度下推理仍然受限,存在持续的计数和序列错误。这些发现暴露了VLM在精细尺度视觉推理中的根本缺陷,需要更严格的评估。

英文摘要

Recent vision-language models (VLMs) excel at multimodal understanding and reasoning, yet their fine-grained visual perception remains underexplored. A natural extension of ``How many r are there in Strawberry?'' asks: how small a visual pattern can a VLM reliably perceive? As such, we introduce FineSightBench, a new benchmark that systematically probes this limit by separating perception tasks (pixel-level recognition of letters, shapes, objects) from reasoning tasks (spatial reasoning, counting, ordering over small targets) across controlled scales of 4--48px. Through comprehensive experiments and detailed failure mode analysis on state-of-the-art models, we reveal a sharp dissociation: perception saturates around 12px, while reasoning remains limited even at larger scales, with persistent numeracy and sequence errors. These findings expose fundamental deficiencies in VLMs' fine-scale visual reasoning that demand more rigorous evaluation.

2606.07856 2026-06-09 cs.LG 新提交

Teacher-Free Self-Training Amplifies but Does Not Compound: A Pass@$K$ Crossover on a Free-Verifier Domain

无教师自训练放大但不复合:自由验证器域上的 Pass@$K$ 交叉

Igor Lima Strozzi

发表机构 * Federal University of Rio de Janeiro(里约热内卢联邦大学)

AI总结 在自由验证器域上,使用无教师自训练(STaR)和批评者指导的选择,发现自训练放大模型能力但不复合,通过 Pass@$K$ 交叉诊断证实。

详情
AI中文摘要

当语言模型在其自身验证的输出上训练时,它是获得了超越其基础的能力,还是仅仅更好地表达了基础已有的能力?我们通过一个无教师的“星座”——一个生成器、一个学习到的批评者和一个自由精确验证器——在一个 FlashFill 风格的“陷阱门”DSL 上使该问题可判定,其中验证的(问题,解决方案)对易于合成、难以反转且可自由精确检查。一切都在单个 4 位 Qwen3-4B 上运行,使用单个 24 GB GPU,循环中没有比基础更大的模型。我们报告三个发现。(i) 批评者指导的选择优于验证器过滤的最佳 $k$ 选择,提高了 $+9.1$ 个百分点($6/6$ 种子),全部增益集中在候选者在保留输入上意见不一致的任务上。(ii) 每轮 STaR 自训练提高了上限但从未加速——增益跟踪剩余空间并在 $K=4$ 个独立训练轨迹上减速。(iii) 该域没有清晰的零能力边界,因此通常的“$0\% \to$ 爬升 $=$ 涌现”测试在此无效。一个测量的 Pass@$K$ 交叉解决了诊断:训练模型在操作预算(Pass@$8$)上获胜,但基础模型在大预算(Pass@$64$)上在每个轨迹上超越它,因此自训练集中概率质量而非扩展覆盖范围。这是放大,而非复合。($K=4$ 是指示性的,尚不是跨轨迹的稳健置信区间。)

英文摘要

When a language model trains on its own verified outputs, does it acquire capability beyond its base, or merely get better at expressing capability the base already had? We make the question decidable with a teacher-free "constellation" -- a generator, a learned critic, and a free exact verifier -- on a FlashFill-style "trapdoor" DSL, where verified (problem, solution) pairs are cheap to synthesize, hard to invert, and free to check exactly. Everything runs on one 4-bit Qwen3-4B on a single 24 GB GPU, with no model in the loop larger than the base. We report three findings. (i) Critic-guided selection beats verifier-filtered best-of-$k$ by $+9.1$ pp ($6/6$ seeds), with the entire gain localized to tasks where candidates disagree on held-out inputs. (ii) Per-round STaR self-training raises the ceiling but never accelerates -- the gain tracks remaining headroom and decelerates across $K=4$ independent training trajectories. (iii) The domain has no clean zero-capability frontier, so the usual "$0\% \to$ climb $=$ emergence" test is invalid here. A measured pass@$K$ crossover settles the diagnosis: the trained model wins at the operating budget (pass@$8$) but the base overtakes it at a large budget (pass@$64$) on every trajectory, so self-training concentrates probability mass rather than expanding reach. This is amplification, not compounding. ($K=4$ is indicative, not yet a robust across-trajectory CI.)

2606.07855 2026-06-09 cs.RO math.OC 新提交

Path Planning Using Deep Deterministic Policy Gradient: A Reinforcement Learning Approach

使用深度确定性策略梯度的路径规划:一种强化学习方法

Qiang Le, Yaguang Yang, Isaac E. Weintraub

发表机构 * Hampton University(汉普顿大学) Air Force Research Laboratory(空军研究实验室)

AI总结 提出基于深度确定性策略梯度的路径规划方法,将威胁建模为圆形禁行区,通过奖励函数引导智能体学习从状态到动作的映射,找到最大安全起始点集,相比传统最优控制方法速度更快,适用于实时应用。

Comments 14 pages, 12 figures

详情
AI中文摘要

在充满威胁的环境中,自主车辆的路径规划是一个基本挑战,因为即使是最简单的情景,问题也是非线性和非凸的。虽然传统最优控制方法可用于寻找理想路径,但计算时间通常太慢,无法实时决策。为了解决这一挑战,我们提出了一种基于深度确定性策略梯度(DDPG)的方法,并将威胁建模为可能多个圆形“禁行”区域。如果车辆在任何时刻进入该限制区域或未到达目的地附近,则任务被视为失败。DDPG智能体通过模拟环境中的试错进行训练,学习从其当前状态(位置和航向)到一系列可行动作的直接映射,从而引导智能体安全到达目的地。奖励函数包含三部分:(a) 以最终目的地为中心的吸引场,(b) 以圆形障碍物原点为中心的若干排斥场,以及(c) 控制能量消耗(航向变化幅度)的惩罚,间接有利于直线路径。DDPG利用这些激励训练智能体,以找到最大的起始点集合,从中保证存在一条通往目的地的安全路径。这为任务规划提供了关键信息,预先显示从给定起始点任务是否可行,辅助任务前规划活动。该方法在仿真中得到验证。将DDPG方法与传统最优控制(伪谱)方法进行了比较。结果表明,基于学习的智能体能够生成有效路径,同时速度显著更快,使其更适合实时应用。

英文摘要

Path-planning for autonomous vehicles in threat-laden environments is a fundamental challenge because the problem is nonlinear and nonconvex even in simplest scenarios. While traditional optimal control methods can be used to find ideal paths, the computational time is often too slow for real-time decision-making. To solve this challenge, we propose a method based on Deep Deterministic Policy Gradient (DDPG) and model the threat as possibly multiple circular 'no-go' zones. A mission is regarded as a failure if the vehicle enters this restricted zone at any time or does not reach a neighborhood of the destination. The DDPG agent is trained through trial and error in a simulated environment, learning a direct mapping from its current state (position and heading) to a series of feasible actions that guide the agent to safely reach its destination. The reword function has three parts: (a) an attractive field centered at the final destination, (b) some repulsive fields centered at the origins of circular obstacles, and (c) a penalty of control energy consumption (the magnitude of heading change) that indirectly in favor for straight path. The DDPG trains the agent using these incentives to find the largest possible set of starting points wherein a safe path to the destination is guaranteed. This provides critical information for mission planning, showing beforehand whether a task is achievable from a given starting point, assisting pre-mission planning activities. The approach is validated in simulation. A comparison between the DDPG method and a traditional optimal control (pseudo-spectral) method is carried out. The results show that the learning-based agent produces effective paths while being significantly faster, making it a better fit for real-time applications.

2606.07853 2026-06-09 cs.CL cs.AI 新提交

Beyond English benchmarks: clinical llm evaluation in Brazilian Portuguese

超越英语基准:巴西葡萄牙语临床大语言模型评估

Giordano de Pinho Souza, Glaucia Melo, Josefino Cabral Melo Lima, Daniel Schneider

发表机构 * Federal University of Rio de Janeiro(里约热内卢联邦大学) Toronto Metropolitan University(多伦多都会大学)

AI总结 提出首个双语临床基准ClinicalBr,基于巴西病例报告构建,评估四个模型发现葡萄牙语-英语性能差距具有任务依赖性,诊断检索英语优势明显,其他任务差距消失。

详情
AI中文摘要

大语言模型正在改变临床决策支持及其在实际场景中的应用。然而,大多数基准测试以英语进行,跨语言评估对于解决全球可及性中的语言差距至关重要。我们介绍了ClinicalBr,这是首个基于真实巴西病例报告构建的双语临床决策基准。该语料库包含来自28种SciELO医学期刊的2,892个病例,涵盖18个专科,并构建为平行葡萄牙语-英语对。每个病例支持四项评估任务:诊断检索、鉴别诊断、检查推荐和治疗规划。我们评估了四个模型:MedGemma-27B、Sabiá-4、DeepSeek-R1和o3-mini,涵盖两种语言。核心发现是,葡萄牙语-英语性能差距是任务依赖的,而非普遍的。在诊断检索中,英语在所有模型上均具有一致优势,准确率高出7.5-12.1个百分点。这种优势在鉴别诊断、检查推荐和治疗规划中消失,大多数模型的置信区间跨越零,且葡萄牙语的完整性分数略高。巴西地方病比完整语料库更容易,而非更难,表明热带疾病表现在当前预训练中得到了充分体现。检查推荐是所有模型和两种语言中最难的任务,F1分数低于0.10,远低于鉴别诊断的上限0.20-0.27。

英文摘要

Large Language Models are transforming the support for clinical decision and their application in real scenarios. Yet, most benchmarks are conducted in English, and cross-lingual evaluation is needed to tackle the language gaps in global access. We introduce ClinicalBr, the first bilingual benchmark for clinical decision built from real Brazilian case reports. The corpus contains 2,892 cases drawn from 28 SciELO medical journals, spanning 18 specialties, and is structured as parallel Portuguese-English pairs. Each case supports four evaluation tasks: diagnosis retrieval, differential diagnosis, exam recommendation, and treatment planning. We evaluate four models: MedGemma-27B, Sabiá-4, DeepSeek-R1, and o3-mini, across both languages. The central finding is that the Portuguese-English performance gap is task-dependent, not general. In diagnosis retrieval, English yields a consistent advantage across all models, with +7.5-12.1 accuracy points. This advantage disappears in differential diagnosis, exam recommendation, and treatment planning, where confidence intervals cross zero for most models and Portuguese completeness scores are marginally higher. Brazilian-endemic conditions proved easier than the full corpus, not harder, indicating that tropical presentations are adequately represented in current pre-training. Exam recommendation was the hardest task across all models and both languages, with F1 scores below 0.10, well below the differential diagnosis ceiling of 0.20-0.27.

2606.07835 2026-06-09 cs.LG 新提交

Mitigating the Contractivity Trap in Diffusion ODEs via Stein Stabilization

通过Stein稳定化缓解扩散ODE中的收缩陷阱

Shigui Li, Delu Zeng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对扩散模型确定性概率流ODE大步长推理中的收缩陷阱问题,提出SteinDiff框架,通过Stein导出的几何感知残差校正机制正则化求解器更新,无需参考样本即可提升生成质量。

Comments 32 pages, 12 figures. Accepted to ICML 2026

详情
AI中文摘要

在扩散模型通过其确定性概率流常微分方程(PF-ODE)轨迹进行大步长推理时,存在一个基本张力,我们称之为收缩陷阱:高效推理倾向于大步长,而激进的步长和高表达能力的去噪器可能破坏基于收缩的误差抑制稳定性保证。为了解决这个问题,我们提出了SteinDiff,一种逐步推理时稳定化框架,它采用Stein导出的校正,无需参考样本。具体来说,SteinDiff引入了一种几何感知残差校正机制,在不重新训练的情况下正则化大步长求解器更新。为此,我们推导了用于逐步求解器调整的闭式Stein校正系数,实现了对局部数据几何的无参考自适应。我们进一步建立了在分布偏移下的分数控制扰动界,并提供了对EDM风格参数化的补充Stein视角。大量实验表明,SteinDiff在大步长推理设置中减轻了严重伪影并提高了生成质量。

英文摘要

A fundamental tension exists in the large-step inference of diffusion models via their deterministic probability flow ordinary differential equation (PF-ODE) trajectories, which we identify as the contractivity trap: efficient inference favors large step sizes, while aggressive steps and highly expressive denoisers can undermine contraction-based stability certificates for error suppression. To address this, we propose SteinDiff, a step-wise inference-time stabilization framework that employs Stein-derived corrections without requiring reference samples. Specifically, SteinDiff introduces a geometry-aware residual correction mechanism that regularizes large-step solver updates without retraining. To this end, we derive a closed-form Stein correction coefficient for step-wise solver adjustment, enabling reference-free adaptation to local data geometry. We further establish a score-controlled perturbation bound under distributional shifts and provide a complementary Stein perspective on EDM-style parameterizations. Extensive experiments demonstrate that SteinDiff mitigates severe artifacts and improves generative quality across large-step inference settings.

2606.07819 2026-06-09 cs.AI cs.LG 新提交

Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression

联合结构剪枝与混合精度量化的大语言模型压缩

Hoang-Loc La, Truong-Thanh Le, Amir Taherkordi, Phuong Hoai Ha

发表机构 * UiT The Arctic University of Norway(挪威北极大学) University of Oslo, Norway(挪威奥斯陆大学)

AI总结 提出端到端框架,通过全局误差最小化的混合精度量化策略和联合优化结构剪枝与量化策略,在超低比特下显著降低困惑度。

详情
AI中文摘要

近年来,大型语言模型(LLM)部署的效率已成为实际应用中的关键问题。虽然训练后量化(PTQ)和结构剪枝是减少内存占用和推理延迟的成熟技术,但大多数现有的PTQ方法在逐层基础上优化量化误差,忽略了误差如何在网络中累积和传播,通常导致次优解。传统的流程也倾向于孤立或顺序地应用剪枝和量化,进一步加剧了次优性。我们引入了一种新颖的端到端框架,以两种关键方式解决这些限制。首先,我们提出了一种新颖的混合精度PTQ策略,该策略直接最小化整个模型上的全局误差传播,而不是隔离逐层误差。在此基础上,我们开发了一种新颖的联合优化方法,该方法在统一的搜索空间中同时学习结构剪枝决策和混合精度量化策略。大量实验表明,在超低精度(1-3比特)下,与最先进的(SoTA)权重激活量化基线相比,我们的量化方法将WikiText困惑度降低了高达21%。与领先的仅权重量化方法相比,它在WikiText和C4上分别实现了高达59%和85%的困惑度降低。与最先进的联合剪枝和量化技术相比,我们提出的方法在超低比特下提供了优越的困惑度和推理性能。

英文摘要

Recently, the efficiency of Large Language Models (LLMs) deployment has become a critical concern in practical applications. While post-training quantization (PTQ) and structural pruning are established techniques for reducing memory footprint and inference latency, most existing PTQ approaches optimize quantization errors on a per-layer basis, overlooking how errors accumulate and propagate through the network, often resulting in suboptimal solutions. Traditional pipelines also tend to apply pruning and quantization in isolation or sequentially, further compounding sub-optimality. We introduce a novel end-to-end framework that addresses these limitations in two key ways. First, we propose a novel mixed-precision PTQ strategy that directly minimizes global error propagation across the entire model, rather than isolating layer-wise errors. Building on this, we develop a novel joint optimization approach that simultaneously learns structural pruning decisions and mixed-precision quantization policies within a unified search space. Extensive experiments show that, at ultra-low precisions (1-3 bits), our quantization method reduces WikiText perplexity by up to 21% compared to state-of-the-art (SoTA) weight-activation quantization baselines. Against leading weight-only quantization methods, it achieves up to 59% and 85% lower perplexity on WikiText and C4, respectively. Compared to the SoTA joint pruning-and-quantization techniques, our proposed method delivers superior perplexity and reasoning performance at ultra-low bits.

2606.07818 2026-06-09 cs.CL cs.NE 新提交

Representational Similarity and Model Behavior in Multi-Agent Interaction

多智能体交互中的表征相似性与模型行为

Yujin Potter, Seun Eisape, Shiyang Lai, Alexander Huth, James Evans, Been Kim, Jacob Eisenstein, Dawn Song, Alane Suhr

发表机构 * University of Washington(华盛顿大学)

AI总结 研究LLM对间的表征相似性对合作与创新的影响,发现高相似性促进合作但降低新颖性,且早期层相似性关联最强。

Comments ICML 2026

详情
AI中文摘要

研究人员已经表明,人类之间的神经相似性可以预测社会亲密度和合作成功,而创新往往源于不同个体之间的互动。我们通过考察大型语言模型之间的交互来研究这些原理是否适用于人工智能。在我们的实验中,276个模型对在涵盖合作和新颖性的八个游戏中互动。我们发现,具有更相似表征空间的对实现了显著更高的合作,但表现出较低的新颖性和创造力。即使控制了其他因素(如性能差异和模型大小),表征相似性对合作和新颖性的影响仍然稳健。我们还发现,与中间层和后期层相比,早期层的相似性与合作和新颖性的关联始终最强。这表明这些模式背后的一个核心因素可能是两个模型共享词汇和语义基础的程度。总体而言,表征相似性可能是多智能体系统设计中的一个重要考虑因素。

英文摘要

Researchers have shown that neural similarity among humans predicts social closeness and cooperative success, whereas innovation often emerges from interactions among dissimilar individuals. We investigate whether these principles extend to artificial intelligence by examining interactions between large language models. In our experiments, 276 model pairs interact across eight games spanning both cooperation and novelty. We find that pairs with more similar representation spaces achieve significantly higher cooperation but exhibit reduced novelty and creativity. The effects of representational similarity on cooperation and novelty remain robust even after controlling for other factors such as performance disparity and model size. We also find that similarity in the early layers consistently shows the strongest association with cooperation and novelty, compared to the middle and later layers. This suggests that a central factor underlying these patterns could be the extent to which the two models share lexical and semantic grounding. Overall, representational similarity can be an important consideration in multi-agent system design.

2606.07813 2026-06-09 cs.RO cs.CV 新提交

MinNav: Minimalist Navigation Using Optical Flow For Active Tiny Aerial Robots

MinNav:基于光流的极简导航用于主动微型飞行机器人

Aniket Patil, Mandeep Singh, Uday Girish Maradana, Nitin J. Sanket

发表机构 * Worcester Polytechnic Institute(伍斯特理工学院) Perception and Autonomous Robotics (PeAR) Group, Robotics Engineering Department, Worcester Polytechnic Institute(伍斯特理工学院机器人工程系感知与自主机器人(PeAR)实验室)

AI总结 提出MinNav导航栈,利用光流及其不确定性,使微型飞行机器人在无先验知识下穿越静态/动态障碍和未知形状间隙,通过主动探索提高成功率,实验成功率70%,计算量远小于深度方法。

Comments Accepted for publication at ICRA 2026. Link to Project page https://pear.wpi.edu/research/minnav.html

详情
AI中文摘要

使用单目相机进行导航对于微型飞行机器人的自主操作至关重要,因为它在多功能性、成本和精度之间取得了完美平衡。在本文中,我们介绍了MinNav,一个基于光流及其不确定性的导航栈,用于在静态和动态障碍物以及未知形状间隙的场景中飞行,无需任何关于场景组件和/或其位置/顺序的先验知识。我们通过利用机器人的主动性以探索方式移动来寻找障碍物并导航,进一步提高了成功率。我们在多种环境下的许多真实世界实验中成功评估并演示了所提出的方法,包括静态和动态障碍物以及未知形状间隙,总体成功率为70%。据我们所知,这是第一个使用单目相机在无先验知识的情况下解决所有上述导航案例的方案。我们的方法在性能上与基于深度的方法相当,但所需的计算量少几个数量级,并且可以轻松在微型飞行机器人上运行。随附的视频、补充材料、代码和数据集可在https://pear.wpi.edu/research/minnav.html找到。

英文摘要

Navigation using a monocular camera is pivotal for autonomous operation on tiny aerial robots due to their perfect balance of versatility, cost and accuracy. In this paper, we introduce MinNav, a navigation stack based on optical flow and its uncertainty to fly through a scene with static and dynamic obstacles and unknown-shaped gaps without any prior knowledge of the scene components and/or their locations/ordering. We further improve success rate by using the activeness of the robot to move around in an exploratory way to find obstacles and navigate. We successfully evaluate and demonstrate the proposed approach in many real-world experiments in various environments with static and dynamic obstacles and unknown-shaped gaps with an overall success rate of 70%. To the best of our knowledge, this is the first solution to tackle all the aforementioned navigation cases without prior knowledge using a monocular camera. Our approach is on par in performance with depth based methods with factors of magnitude less computation required and can readily run onboard tiny aerial robots. The accompanying video, supplementary material, code and dataset can be found at https://pear.wpi.edu/research/minnav.html

2606.07812 2026-06-09 cs.AI cs.CL 新提交

Scaling Participation in Modular AI Systems

模块化AI系统中的参与扩展

Shangbin Feng, Yike Wang, Weijia Shi, Luke Zettlemoyer, Yejin Choi, Yulia Tsvetkov

发表机构 * University of Washington(华盛顿大学) Stanford University(斯坦福大学)

AI总结 提出参与扩展范式,通过多方贡献小模型构建模块化AI系统,在15项任务上比单体大语言模型提升高达15.4%,并展现涌现能力。

详情
AI中文摘要

人类是由多面才能和需求组成的马赛克,任何真正智能的AI必须反映这种丰富性。然而,所有人使用的LLM却由少数人构建——一个集中化的单体AI模型市场,其结构上不适合捕捉人类知识、推理和价值观的多样性。本文介绍参与扩展,一种新范式,其中模块化AI系统通过不同利益相关者的贡献自下而上构建。参与者贡献基于自身兴趣和优先级训练的小模型;这些模型随后在模块化框架中作为组合式AI系统协作。参与式AI系统在15项任务(如推理和事实性)上比单体LLM高出最多15.4%,超越了比所有贡献组件总和更大的模型。进一步实验表明,参与式AI系统受益于贡献者多样性,显著改善每个贡献者的原始优先级,并展现出涌现能力,使其能解决超过15%的所有单个模型失败的问题。参与扩展为从单体现状向开放、自下而上、协作的AI未来过渡提供了技术基础。

英文摘要

Humanity is a mosaic of multifaceted talents and needs, and any truly intelligent AI must reflect that richness. Yet the LLMs used by all are built by the few -- a centralized market of monolithic AI models structurally ill-suited to capture the diversity of human knowledge, reasoning, and values. Here we introduce scaling participation, a new paradigm in which modular AI systems are built from the bottom up through the contributions of diverse stakeholders. Participants contribute small models trained on their own interests and priorities; these models then collaborate in modular frameworks as compositional AI systems. Participatory AI systems outperform monolithic LLMs by up to 15.4% across 15 tasks, such as reasoning and factuality, surpassing models larger than all contributed components combined. Further experiments show that participatory AI systems benefit from contributor diversity, substantially improve on each contributor's original priorities, and exhibit emergent capabilities that allow them to solve over 15% of problems where all individual models fail. Scaling participation provides a technical foundation for transitioning from the monolithic status quo toward an open, bottom-up, and collaborative AI future.

2606.07810 2026-06-09 cs.CL cs.AI cs.LG 新提交

SLMJury: Can Small Language Models Judge as Well as Large Ones?

SLMJury:小型语言模型能否像大型模型一样进行评判?

Anish Laddha, Nitesh Pradhan, Gaurav Srivastava

发表机构 * LNMIIT Virginia Tech(弗吉尼亚理工大学)

AI总结 提出SLMJury框架,评估小型语言模型作为评判者的能力,发现领域依赖的过度思考效应、领域泛化差异、闭端与开端评判能力分离,以及多智能体辩论降低准确性。

详情
AI中文摘要

大型语言模型(LLMs)被广泛用作评估模型输出的评判者,但其高成本、延迟和不透明性限制了可扩展性。我们引入SLMJury,一个评估小型语言模型(SLMs)作为评判者的框架,涵盖两种范式:闭端二元正确性和开端质量评分。我们在四个模型家族的16个SLM评判者(0.6B-14B参数)上,跨十个基准进行基准测试:八个闭端任务涵盖数学、科学和通用推理(每个配置N=64,824个判断),以及用于摘要和对话评分的SummEval和MT-Bench。我们将评判形式化为预算条件函数,并研究五个维度。得出四个发现。(1)过度思考效应是领域依赖的:对于大多数评判者,快速10令牌判决在数学评判上匹配或优于扩展推理(在有帮助的情况下提升2-7%),而推理在通用任务上胜出高达23%。(2)领域泛化区分了模型家族,数学到通用准确率差距从低于10%到接近40%不等。(3)闭端和开端评判依赖不同的能力:最佳二元评判者(Phi-4)在MT-Bench上降至第9名,而经过推理训练的模型则反转了这一顺序。(4)在反思-批判-改进(RCR)辩论协议下,多智能体辩论在所有测试配置中降低了准确性,而顶级评判者抵抗六种对抗性人格的方差<=0.55%。可靠的自动评估不需要大型专有模型,但没有单一的SLM占主导地位。排行榜可在https://anishh15.github.io/SLMJury/获取,我们的框架代码和pip包公开在https://github.com/anishh15/SLMJury和https://pypi.org/project/slmjury/。

英文摘要

Large language models (LLMs) are widely used as judges for evaluating model outputs, but their high cost, latency, and opacity limit scalability. We introduce SLMJury, a framework for evaluating small language models (SLMs) as judges across two paradigms: closed-ended binary correctness and open-ended quality scoring. We benchmark 16 SLM judges (0.6B-14B parameters) from four model families across ten benchmarks: eight closed-ended tasks spanning mathematical, scientific, and general reasoning (N=64,824 judgments per configuration), plus SummEval and MT-Bench for summarization and conversational scoring. We formalize judging as a budget-conditioned function and study five dimensions. Four findings emerge. (1) The overthinking effect is domain-dependent: for most judges quick 10-token verdicts match or beat extended reasoning on mathematical judging (by 2-7% where they help), while reasoning wins on general tasks by up to 23%. (2) Domain generalization separates model families, with math-to-general accuracy gaps ranging from under 10% to nearly 40%. (3) Closed-ended and open-ended judging draw on different capabilities: the best binary judge (Phi-4) drops to rank 9 on MT-Bench, while reasoning-trained models invert this ordering. (4) Under the Reflect-Critique-Refine (RCR) debate protocol, multi-agent debate degrades accuracy across all tested configurations, whereas the top judges resist six adversarial personas with <=0.55% variance. Reliable automated evaluation does not require large proprietary models, yet no single SLM dominates. The leaderboard is available at https://anishh15.github.io/SLMJury/, and our framework code and pip package are publicly available at https://github.com/anishh15/SLMJury and https://pypi.org/project/slmjury/.

2606.07808 2026-06-09 cs.AI 新提交

Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models

指令层级失效之处:诊断与修复推理语言模型的故障

Sanjay Kariyappa, G. Edward Suh

发表机构 * NVIDIA(英伟达)

AI总结 提出白盒诊断框架,将指令层级失效定位为指令识别、冲突解决和响应实现三个环节,并设计两种免训练自监控机制,将违规率降低81-99%。

详情
AI中文摘要

部署在智能体工作流中的推理语言模型必须遵循指令层级:当来自不同来源的指令冲突时,模型应服从最高权限的适用指令。现有基准主要端到端地衡量这种行为,询问最终响应是否合规。然而,不合规的响应可能源于几种不同的故障:模型可能无法识别上下文中的相关指令,无法解决已识别指令之间的冲突,或者在推理中正确解决了冲突但仍产生违规响应。我们引入了一个白盒诊断框架,将指令层级失效定位为指令识别、冲突解决和响应实现,使故障更具可解释性。我们在IHEval和IHChallenge的长上下文改编版本上评估了三个推理模型——Gemma-4-31B-IT、Qwen3.6-35B-A3B和Claude Sonnet 4.6,发现主要故障模式因模型、任务和上下文长度而异。基于模型在明确提示时通常能检测冲突并输出违规的观察,我们提出了两种免训练的自监控机制:用于生成前低延迟冲突检测的并行输入监控器,以及用于响应级审查和修复的顺序输出监控器。在Gemma-4-31B-IT、Claude Sonnet 4.6和GPT-5.3上,最强的监控器将规则遵循违规率降低了81-99%,其中GPT-5.3在静态攻击下降低86%,在自适应攻击下降低45%。

英文摘要

Reasoning language models deployed in agentic workflows must follow an instruction hierarchy: when instructions from different sources conflict, the model should obey the highest-privilege applicable instruction. Existing benchmarks largely measure this behavior end-to-end, asking whether the final response is compliant. However, a non-compliant response can arise from several distinct failures: the model may fail to identify the relevant instructions in context, fail to resolve conflicts among identified instructions, or correctly resolve the conflict in its reasoning while still producing a violating response. We introduce a white-box diagnostic framework that localizes instruction hierarchy failures into instruction identification, conflict resolution, and response realization, making failures more interpretable. We evaluate three reasoning models--Gemma-4-31B-IT, Qwen3.6-35B-A3B, and Claude Sonnet 4.6--on long-context adaptations of IHEval and IHChallenge, and find that the dominant failure mode varies across models, tasks, and context length. Building on the observation that models can often detect conflicts and output violations when explicitly prompted, we propose two training-free self-monitoring mechanisms: a parallel input monitor for low-latency conflict detection before generation, and a sequential output monitor for response-level review and repair. Across Gemma-4-31B-IT, Claude Sonnet 4.6, and GPT-5.3, the strongest monitor reduces rule-following non-compliance by 81-99%, with GPT-5.3 reductions of 86% under static attacks and 45% under adaptive attacks.

2606.07805 2026-06-09 cs.AI cs.MA 新提交

Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems

超越古德哈特定律:多智能体系统中合规性评估的动态基准

Yiyang Zhao, Zhuo Zhang, Qingxuan Le, Lizhen Qu, Zenglin Xu

发表机构 * Fudan University(复旦大学) Shanghai Academy of AI for Science(上海人工智能科学研究院) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Monash University(莫纳什大学)

AI总结 针对多智能体系统在压力下可能违反安全规则的问题,提出MAC-Bench动态对抗基准,通过SERV流水线生成无污染场景,并引入CSR和MG指标评估前沿模型的合规性。

详情
AI中文摘要

大型语言模型(LLMs)从被动助手向自主、可执行智能体的快速演进引入了关键的操作风险。当前大多数评估框架忽视了程序合规性,导致“马基雅维利”行为——智能体为最大化奖励而策略性地违反安全规则,这是古德哈特定律的直接体现。为解决这一盲点,我们提出MAC-Bench,一个动态对抗基准,旨在评估多智能体系统在现实压力下的程序对齐。我们提出了SERV(种子-进化-精炼-验证)流水线,一种“智能体即基准”范式,将非结构化法律文本转化为可执行、无污染的场景。通过合成全息沙盒环境并注入校准的社会工程压力向量,MAC-Bench迫使智能体在任务成功与监管遵守之间进行帕累托最优权衡。我们引入了新指标:合规加权成功率(CSR)和马基雅维利差距(MG),并对最先进的前沿模型进行了全面评估,揭示了成功与合规之间的普遍权衡。

英文摘要

The rapid evolution of Large Language Models (LLMs) from passive assistants to autonomous, execution-capable agents has introduced critical operational risks. Most current evaluation frameworks neglect procedural compliance, leading to ''Machiavellian'' behaviors where agents strategically violate safety rules to maximize rewards - a direct manifestation of Goodhart's Law. To address this blind spot, we introduce MAC-Bench, a dynamic, adversarial benchmark designed to evaluate the procedural alignment of multi-agent systems under realistic pressure. We propose the SERV(Seed - Evolve - Refine - Verify) pipeline, an ``Agent-as-a-Benchmark'' paradigm that transforms unstructured legal texts into executable, contamination-free scenarios. By synthesizing holographic sandbox environments and injecting calibrated social-engineering pressure vectors, MAC-Bench forces agents into Pareto-optimal trade-offs between task success and regulatory adherence. We introduced novel metrics: the Compliance-Weighted Success Rate (CSR) and the Machiavellian Gap (MG), and conducted a comprehensive evaluation of state-of-the-art frontier models to reveal the pervasive trade-offs between success and compliance.

2606.07801 2026-06-09 cs.AI 新提交

Improving Multimodal Reasoning via Worst Dimension Optimization

通过最差维度优化改进多模态推理

Haocheng Lv, Huaping Zhang, Qiuchi Li, Lei Li, Chunxiao Gao

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 提出最差维度优化方法,通过识别并优先优化推理路径中最差的约束维度,提升多模态推理的整体有效性。

详情
AI中文摘要

多模态推理需要一条在广泛约束(从视觉基础到逻辑一致性)下保持完整性的路径。然而,当前的过程奖励模型关注于启发式定义的奖励,这些奖励平等地权衡这些因素,可能导致主导因素掩盖个别维度的失败,从而无法保证推理过程的一般有效性。

英文摘要

Multimodal reasoning requires a path that retains integrity over a wide range of constraints, from visual grounding to logic consistency. However, the current Process Reward Models focus on heuristically defined rewards that equally weigh these factors, which may lead to the concealment of individual dimension failures by the dominating factors, without guaranteeing the validity of the reasoning process in general.

2606.07798 2026-06-09 cs.AI cs.LG q-bio.NC 新提交

Reconstructing and forecasting disease trajectories of patients with Alzheimer's disease using routine data in resource-constrained settings

在资源受限环境中利用常规数据重建和预测阿尔茨海默病患者的疾病轨迹

Ratnadeep Das, Atri Chatterjee, Sitikantha Roy

发表机构 * Yardi School of Artificial Intelligence (ScAI), Indian Institute of Technology Delhi(印度理工学院德里分校亚迪人工智能学院) Department of Neurology, Vardhman Mahavir Medical College and Safdarjung Hospital(瓦尔丹·马哈维尔医学院和萨夫达戎医院神经内科) Department of Applied Mechanics, Indian Institute of Technology Delhi(印度理工学院德里分校应用力学系)

AI总结 提出GNOVA框架,结合GRU编码器和神经ODE解码器的变分自编码器,利用常规临床数据(无需神经影像或生物标志物)实现认知评分的双向预测、插值/外推及不确定性估计,在ADNI数据集上取得低误差。

详情
AI中文摘要

阿尔茨海默病是一种进行性神经退行性疾病,其进展在不同患者间差异显著。现有工作旨在预测患者未来的认知状态,但很少关注从既往就诊中重建状态。此外,当前研究中,量化预测不确定性仍未被充分探索,且依赖于MRI、PET和CSF等昂贵模态,限制了在资源有限环境中的部署。在本研究中,我们的主要目标是:第一,从不规则就诊中双向预测认知评分,以呈现完整的疾病轨迹;第二,实现插值和外推能力,以辅助临床医生做出知情预后决策;第三,为所有预测提供校准良好的不确定性估计;最后,利用常规就诊中可用的模态实现上述目标。我们提出了一个统一框架GNOVA:GRU-神经ODE变分自编码器。该架构在变分自编码器框架内结合了门控循环单元编码器和神经ODE解码器。在我们的工作中,我们预测了CDR-SB和MMSE评分。GRU编码器允许在任何时间点输入任意数量的数据。神经ODE解码器执行连续估计,允许在任何期望的时间点进行插值和外推。变分自编码器允许预测中的不确定性估计。我们使用了ADNI数据集中1727名患者超过10年的数据;该模型在无需任何神经影像或生物标志物数据的情况下,对CDR-SB和MMSE评分分别实现了1.35和2.28的平均绝对误差。特征消融研究表明,年龄、BMI和APOE4状态是强预测因子。所提出的框架能够重建不完整的患者病史并预测未来的认知状态。

英文摘要

Alzheimer's disease is a progressive neurodegenerative disorder, and its progression varies substantially across patients. Existing work aims to forecast patients' future cognitive state, with minimal focus on reconstructing the state from past visits. Furthermore, in current research, quantifying predictive uncertainty remains underexplored and relies on costly modalities such as MRI, PET, and CSF, limiting their deployment in resource-limited settings. In this research, our primary objectives are: First, bidirectional prediction of cognitive scores from irregular visits to present the complete disease trajectory. Second, to enable interpolation and extrapolation capabilities to assist clinicians in informed prognostic decision making, and third, to provide a well-calibrated uncertainty estimate for all predictions, and finally, to achieve the objectives using the modalities available during routine visits. We propose a unified framework, GNOVA: A GRU-Neural ODE Variational Autoencoder. The architecture combines a Gated Recurrent Unit encoder and a Neural ODE decoder within a variational autoencoder framework. In our work, we forecast the CDR-SB and MMSE Scores. The GRU encoder allows for any number of inputs at any time point. The Neural-ODE decoder performs continuous estimation, allowing interpolation and extrapolation at any desired time point. The Variational autoencoder allows for uncertainty estimation in predictions. We worked with 1,727 patients from the ADNI dataset over 10 years; the model achieved mean absolute errors of 1.35 and 2.28 for CDR-SB and MMSE scores, respectively, without requiring any neuroimaging or biomarker data. Feature-ablation studies revealed that age, BMI, and APOE4 status were strong predictors. The proposed framework enables the reconstruction of incomplete patient histories and the anticipation of future cognitive states.

2606.07790 2026-06-09 cs.LG 新提交

Byzantine Cheap Talk: Adversarial Resilience and Topology Effects in LLM Coordination Games

拜占庭廉价谈话:LLM协调博弈中的对抗韧性与拓扑效应

Aya El Mir, Martin Takáč, Salem Lahlou

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学)

AI总结 研究多智能体LLM在协调博弈中面对拜占庭攻击和通信拓扑限制的脆弱性,发现智能体无法集体适应背叛,且显式限制拓扑会破坏合作,而隐式限制则不影响。

Comments Accepted at NETYS 2026 (The International Conference on Networked Systems)

详情
AI中文摘要

多智能体LLM系统越来越依赖通信协议进行协调,但它们在对抗和结构约束下的鲁棒性仍然知之甚少。基于先前工作表明廉价谈话通道能够在LLM协调博弈中实现合作,我们在一个4人Stag Hunt博弈中,跨越六个模型系列和720次试验,研究了两个脆弱性类别。首先,当拜占庭智能体发出合作信号但背叛时,非拜占庭智能体在一轮内检测到背叛,但未能集体适应:相当一部分智能体尽管反复被利用仍继续合作,由于博弈的一致同意支付结构而无法恢复协调。其次,显式限制通信拓扑会完全破坏合作,而应用相同的隐式限制则保持近乎完美的合作。这表明协调失败源于智能体关于隐藏信息的元推理,而非信息损失本身。我们识别出两种在所有模型队列中复现的稳定行为原型:背叛倾向模型在背叛后永久切换,以及合作坚持模型以显著的个人成本继续合作。这些发现揭示了具体的安全漏洞:通信通道可被利用为对抗注入向量,且向智能体披露网络拓扑即使在没有任何对手的情况下也会削弱协调。

英文摘要

Multi-agent LLM systems increasingly rely on communication protocols for coordination, yet their robustness under adversarial and structural constraints remains poorly understood. Building on prior work showing that cheap-talk channels enable cooperation in LLM coordination games, we investigate two vulnerability classes in a 4-player Stag Hunt across six model families and 720 trials. First, when Byzantine agents signal cooperation but defect, non-Byzantine agents detect the betrayal within one round yet fail to adapt collectively: a substantial fraction continue cooperating despite repeated exploitation, unable to recover coordination due to the game's unanimity payoff structure. Second, explicitly restricting communication topology collapses cooperation, while applying identical restrictions silently preserves near-perfect cooperation. This establishes that coordination failure stems from agents' meta-reasoning about hidden information, not information loss itself. We identify two stable behavioral archetypes that replicate across all model cohorts: Defection-Prone models that switch permanently after betrayal, and Cooperation-Persistent models that continue cooperating at significant individual cost. These findings reveal concrete security vulnerabilities: communication channels can be exploited as adversarial injection vectors, and disclosing network topology to agents can degrade coordination even without any adversary present.

2606.07789 2026-06-09 cs.LG stat.ML 新提交

A Framework for Evaluating and Benchmarking Concept Drift Detection Methods

概念漂移检测方法的评估与基准测试框架

Vitor Cerqueira, Heitor Murilo Gomes, Marco Heyden, Bernhard Pfahringer, Albert Bifet

发表机构 * University of Coimbra(科英布拉大学) Victoria University of Wellington(惠灵顿维多利亚大学) Commerzbank(德国商业银行) University of Waikato(怀卡托大学) AI Institute, University of Waikato(怀卡托大学人工智能研究所)

AI总结 提出一个包含漂移模拟、时序感知评估和超参数优化协议的基准测试框架,在7个真实数据集上评估14种漂移检测方法,揭示其优劣并建立基线性能。

Comments Accepted in KDD'26

详情
AI中文摘要

数据流挖掘从根本上受到概念漂移的挑战,其中分布变化可能降低模型性能。尽管漂移检测方法层出不穷,但该领域的进展受到不一致评估实践的阻碍:研究依赖于过度简化的合成数据生成器,采用不兼容的指标,并且在超参数选择上缺乏透明度,使得公平比较变得困难。我们通过一个新颖的基准测试框架来解决这一差距,该框架包含三个贡献:(1)一种漂移模拟方法,通过蒙特卡洛试验将受控的分布变化注入真实世界数据集,在保留真实数据复杂性的同时实现监督评估;(2)一种用于漂移检测的评估协议,具有时序感知标准,包括推导出跨流可比较的新指标(例如,F1检测分数、归一化检测时间);(3)我们提倡一种留一数据集超参数优化协议,用于漂移检测方法,以促进跨异构流动态的配置鲁棒性。我们在7个真实世界数据集上对14种广泛使用的漂移检测方法进行了基准测试,涵盖4种漂移类型(类别先验、标签交换、特征排列、特征过滤),每种类型均包括突变和渐变转换。我们的实验结果揭示了当前漂移检测方法的优缺点,同时为该领域的未来研究建立了基线性能指标。所有代码和实验均公开可用。

英文摘要

Data stream mining is fundamentally challenged by concept drift, where distributional changes can degrade model performance. Despite the proliferation of drift detection methods, progress in the field is hindered by inconsistent evaluation practices: studies rely on oversimplified synthetic data generators, adopt incompatible metrics, and lack transparency in hyperparameter selection, making fair comparisons difficult. We address this gap with a novel benchmarking framework comprising three contributions: (1) a drift simulation method that injects controlled distributional changes into real-world datasets via Monte Carlo trials, enabling supervised evaluation while preserving real-world data complexity; (2) an evaluation protocol for drift detection with timing-aware criteria, including the derivation of new metrics (e.g., F1 detection score, normalized detection time) that are comparable across streams; and (3) we advocate for a leave-one-dataset-out hyperparameter optimization protocol for drift detection methods that promotes configuration robustness across heterogeneous stream dynamics. We benchmark 14 widely used drift detection methods on 7 realworld datasets across 4 drift types (class prior, label swap, feature permutation, feature filtering), each under both abrupt and gradual transitions. Our experimental results provide insights into the strengths and weaknesses of current drift detection approaches while establishing baseline performance metrics for future research in this area. All code and experiments are publicly available.

2606.07783 2026-06-09 cs.CL 新提交

Evaluating RAG Reliability under Clean, Misleading, and Mixed Retrieval

评估RAG在干净、误导和混合检索下的可靠性

Sevgi Yigit-Sert

发表机构 * Ankara University(安卡拉大学)

AI总结 提出评估协议,通过参数覆盖和置信度指标,系统测试RAG系统在干净、有毒和混合证据下处理参数知识与检索证据冲突的鲁棒性。

详情
AI中文摘要

检索增强生成(RAG)通过将答案基于检索到的证据,被广泛用于提高大型语言模型(LLMs)的事实可靠性。然而,在信息丰富的误导环境中,检索到的内容可能包含看似合理但不正确的信息,引发了对基于RAG的信息访问系统可靠性的担忧。在这项工作中,我们提出了一种评估协议,以系统地测试RAG系统如何处理参数知识与从具有不同数量误导信息的上下文中检索到的证据之间的冲突。我们针对模型在无检索时也能正确回答的事实性问题,使用干净、有毒和混合证据来测试系统。所提出的分析框架结合了参数覆盖和置信度指标,以评估误导信息何时以及如何影响LLMs的生成过程。本研究旨在为信息混乱场景下RAG系统的鲁棒性提供见解。

英文摘要

Retrieval-Augmented Generation (RAG) is widely used to improve the factual reliability of large language models (LLMs) by grounding answers in retrieved evidence. In misinformation-rich environments, however, retrieved content may include plausible but incorrect information, raising concerns about the reliability of RAG-based information access systems. In this work, we propose an evaluation protocol to systematically test how the RAG system handles conflicts between parametric knowledge and evidence retrieved from context with varying amounts of misleading information. We target correct answers to factoid questions that the model responds to correctly, even when there is no retrieval, and use this to test the system with clean, poisoned, and mixed evidence. The proposed analytical framework combines parametric override and confidence metrics to assess when and how misleading information affects the generation process of LLMs. This study aims to provide insights into the robustness of RAG systems in information disorder scenarios.

2606.07780 2026-06-09 cs.AI cs.CV cs.LG 新提交

Land cover and flood type govern the detection limits of satellite-based flood mapping across diverse global flood events

土地覆盖与洪水类型控制基于卫星的洪水测绘在不同全球洪水事件中的检测极限

Venkatesh Kolluru, Rajat Shinde, Abdelhak Marouane, Caden Helbling, Deepak Shah, Othneil Drew, Iksha Gurung, Manil Maskey, Rahul Ramachandran

发表机构 * Earth System Science Center, University of Alabama in Huntsville(阿拉巴马大学亨茨维尔分校地球系统科学中心) Space and Earth Science Data Analysis(空间与地球科学数据分析) NASA Marshall Space Flight Center(NASA马歇尔太空飞行中心)

AI总结 研究利用Prithvi-EO-2.0模型在19个全球洪水事件中评估卫星洪水测绘的检测能力,发现检测精度取决于土地覆盖和洪水类型,农田和河流洪水检测效果较好,而树木覆盖和建成区检测近乎为零。

详情
AI中文摘要

洪水是最具破坏性的自然灾害之一,在气候变化下其频率增加使得基于卫星的淹没测绘对灾害响应至关重要。基于卫星档案预训练的地理空间基础模型提供了地理可迁移性,但其在多样、未见事件中的操作可靠性尚未被表征。在此,我们在跨越六大洲、八个气候带和六种洪水机制的19个分布外洪水事件(2017-2025年)中部署Prithvi-EO-2.0,并针对两个独立参考产品进行验证。检测精度共同依赖于土地覆盖和洪水类型,农田产生最高一致性(IoU=52%),河流事件检测最强(F1=0.69),而树木覆盖和建成区显示近乎零检测(IoU=4%),无论洪水机制如何。双参考验证揭示,明显的模型误差部分反映了参考产品之间的定义不一致而非检测失败。迭代流水线测试识别出23种故障模式,其中流水线工程在初始误差中占主导地位,超过模型容量。这些发现为操作卫星洪水测绘建立了环境依赖的检测边界。

英文摘要

Floods are among the most destructive natural hazards, and their increasing frequency under climate change makes satellite-based inundation mapping essential for disaster response. Geospatial foundation models pretrained on satellite archives offer geographic transferability, but their operational reliability across diverse, unseen events remains uncharacterized. Here we deploy Prithvi-EO-2.0 across 19 out-of-distribution flood events (2017-2025) spanning six continents, eight climate zones, and six flood mechanisms, validating against two independent reference products. Detection accuracy depended jointly on land cover and flood type, with cropland yielding the highest agreement (IoU=52%) and riverine events the strongest detection (F1=0.69), while tree cover and built-up areas showed near-zero detection (IoU=4%) regardless of flood mechanism. Dual-reference validation revealed that apparent model error partly reflects definitional inconsistency between reference products rather than detection failure. Iterative pipeline testing identified 23 failure modes, with pipeline engineering dominating initial error over model capacity. These findings establish environment-dependent detection boundaries for operational satellite flood mapping.

2606.07778 2026-06-09 cs.CL 新提交

Unlocking Latent Value: Taxonomy-Guided Recovery of High-Performing Data from Low-Tier Web Corpora

解锁潜在价值:基于分类法从低层级网络语料库中恢复高性能数据

Neeraj Varshney, Sanket Lokegaonkar, Nasser Zalmout, Qingyu Yin, Priyanka Nigam, Bing Yin

发表机构 * Amazon(亚马逊)

AI总结 提出一种分类驱动框架,通过引入时效性和文化特异性两个新维度,结合两阶段过滤方法,从低质量网络数据中恢复高性能子集,在推理和编码任务上显著超越未过滤的高质量数据。

详情
AI中文摘要

主流的预训练网络数据筛选流程将文档质量压缩为单一复合分数,系统性地遗漏了评分器权重不足维度上的高价值内容。我们提出一个分类驱动的框架,通过沿复合分数无法捕捉的语义有意义维度进行过滤来恢复这一价值。首先,基于ESSENTIAL-WEB分类法,我们引入两个新维度:时效性和文化特异性,它们与现有维度的成对NMI较低。我们使用Qwen2.5 32B对1400万文档进行标注,并蒸馏成一个轻量级0.5B模型。为实现快速的语料库级标注,我们额外在E5嵌入上训练了一个7300万参数的多任务MLP,推理吞吐量提升50倍。其次,为应对过滤配置的组合爆炸,我们引入一个计算高效的两阶段框架:第一阶段在小规模上识别最强维度信号;第二阶段从最优表现者中构建并评估合取和析取复合过滤器——以全规模定律成本的一小部分识别高性能配置。将所选过滤器应用于被降级的网络数据,分类过滤后的子集在性能上超过其未过滤基线,甚至超越最高质量层级。在中层数据上,我们的最佳过滤器在推理、编码和知识基准上分别比未过滤基线提升12.1%、9.5%和2.0%,在推理和编码上分别超过未过滤的顶层数据6.7%和13.7%。此外,来自典型生产阈值以下两个层级的数据,其过滤后的子集在推理和编码上比未过滤基线提升22.3%和19.5%,在编码基准上超越顶层数据。这些结果表明,大量潜在价值仍锁定在被降级的网络数据中,而多维分类过滤是解锁这些价值的原理性且计算高效的钥匙。

英文摘要

Dominant web data curation pipelines for pretraining collapse document quality into a single composite score, systematically missing high-value content along dimensions the scorer underweights. We present a taxonomy-driven framework that recovers this value by filtering along semantically meaningful dimensions that composite scores fail to capture. First, building on the ESSENTIAL-WEB taxonomy, we introduce two novel dimensions: timeliness and cultural specificity, both of which show low pairwise NMI with existing ones. We annotate 14M documents using Qwen2.5 32B and distill into a lightweight 0.5B model. To enable rapid corpus-wide annotation, we additionally train a 73M multi-task MLP on E5 embeddings, achieving 50x inference throughput. Second, to navigate the combinatorial explosion of filter configurations, we introduce a compute-efficient two-pass framework: Pass 1 identifies the strongest dimension signals at small scale; Pass 2 constructs and evaluates conjunctive and disjunctive compound filters from the top performers - identifying high-performing configurations at a fraction of full scaling-law cost. Applying the selected filters to deprioritized web data, taxonomy-filtered subsets outperform their unfiltered baselines and even surpass the highest-quality tier. On mid-tier data, our best filter improves over its unfiltered baseline by 12.1% on reasoning, 9.5% on coding, and 2.0% on knowledge benchmarks, exceeding unfiltered top-tier data by 6.7% on reasoning and 13.7% on coding. Furthermore, filtered data from two tiers below the typical production threshold improves by 22.3% on reasoning and 19.5% on coding over its unfiltered baseline, surpassing top-tier data on coding benchmarks. These results establish that vast latent value remains locked in deprioritized web data, and that multi-dimensional taxonomy filtering is a principled, compute-efficient key to unlocking it.

2606.07775 2026-06-09 cs.CV 新提交

DALE-CT: Depth-Aware Foundation Models for Computed Tomography

DALE-CT: 用于计算机断层扫描的深度感知基础模型

Evan W. Damron, Mahmut S. Gokmen, Mitchell A. Klusty, Caroline N. Leach, Emily B. Collier, V. K. Cody Bumgardner

发表机构 * University of Kentucky(肯塔基大学)

AI总结 提出DALE-CT,一种基于LeJEPA的2D切片模型,通过3D深度感知预训练(利用解剖掩膜和异常标注)提升表示质量,在CT多异常检测中达到与3D视觉语言模型近似的性能。

Comments 9 pages, 2 figures

详情
AI中文摘要

自监督学习(SSL)的最新突破,如潜在欧几里得联合嵌入预测架构(LeJEPA),以及视觉编码器与语言模型集成的成功,推动了计算机断层扫描(CT)中对适应性强、高容量视觉编码器的需求。在这项工作中,我们探索了基于2D切片的架构作为处理体积CT数据的原生3D模型的灵活替代方案。使用CT-RATE数据集,我们从头开始训练了DALE-CT(深度感知潜在欧几里得计算机断层扫描),这是一个完全使用LeJEPA构建的2D模型系列,并将其与持续预训练的DINOv2基线进行了比较。为了提高表示质量,我们开发了一种新颖的3D深度感知预训练策略,该策略由来自自动解剖掩膜和人工标注异常的双重辅助监督密集支持。在使用多实例学习(MIL)进行多异常检测的线性探测评估下,该双监督模型(DALE-CT-2S)的冻结主干实现了0.833的宏AUROC。这一性能表明,从头开始使用显著更少的数据且无需文本监督,即可达到与最先进的3D视觉语言模型近乎相当的水平。为确保可重复性,所有训练代码、评估脚本和模型权重均已公开。

英文摘要

Recent breakthroughs in self-supervised learning (SSL), such as the Latent-Euclidean Joint-Embedding Predictive Architecture (LeJEPA), alongside successes in integrating visual encoders with language models, have driven the demand for adaptable, high-capacity vision encoders in Computed Tomography (CT). In this work, we explore 2D slice-based architectures as a flexible alternative to native 3D models for processing volumetric CT data. Using the CT-RATE dataset, we trained DALE-CT (Depth-Aware Latent-Euclidean Computed Tomography), a 2D model family built entirely from scratch using LeJEPA, and compared its performance against a continually pre-trained DINOv2 baseline. To enhance representation quality, we developed a novel 3D depth-aware pre-training strategy anchored by dense auxiliary supervision from both automated anatomical masks and human-annotated abnormalities. Under linear probe evaluation with Multiple Instance Learning (MIL) for multi-abnormality detection, the frozen backbone of this dual-supervised model (DALE-CT-2S) achieves a Macro AUROC of 0.833. This performance demonstrates near-parity with state-of-the-art 3D vision-language models, achieved entirely from scratch with significantly less data and no textual supervision. To ensure reproducibility, all training code, evaluation scripts, and model weights have been made publicly available.

2606.07770 2026-06-09 cs.LG 新提交

Contrast encodes inductive bias: separating slow noise from dynamics in predictive representation learning

对比编码归纳偏置:在预测性表示学习中将慢噪声与动力学分离

Paarth Gulati, Ilya Nemenman

发表机构 * Emory University(埃默里大学)

AI总结 针对自监督方法在潜在空间预测动力学时混淆慢噪声与信号的问题,本文分析其根源为跨轨迹采样负样本的对比目标,提出通过轨迹内采样负样本消除预测捷径,从而强制编码动力学相关变量。

详情
AI中文摘要

在潜在空间中学习表示并预测动力学的自监督方法(如JEPA)已被证明会混淆缓慢变化的噪声与它们旨在捕捉的动力学信号。具体来说,当噪声特征在每个轨迹内近似保持不变时,对比预测目标会优先编码这些特征,而不是控制系统的真实潜在变量。学习到的表示因此被轨迹特定噪声主导,下游性能随噪声强度下降,且即使增加训练轨迹的数量和持续时间也不会改善。我们认为这种失败是目标本身的属性,由一系列跨轨迹采样负样本的对比预测目标共享。为了说明这种普遍性,我们在两种设置中研究了失败模式及其补救措施:在合成移动点数据集上的标准SimCLR风格JEPA,以及在刚体摆电影上的DySIB(一种最近引入的用于提取动力学物理可解释表示的方法)。当负样本改为在单个轨迹内采样时,慢噪声无法区分该轨迹内的帧,从而消除了预测捷径。同时在许多这样的轨迹上训练一个编码器,迫使它编码与动力学相关的变量,更长的轨迹即使在强慢噪声下也能产生更好的表示。我们的结果指向了在动力学表示学习中设计对比预测目标的原则,特别是对于具有噪声实验观测的物理系统。

英文摘要

Self-supervised methods that learn representations and predict dynamics fully in the latent space, such as JEPA, have been shown to confuse slowly varying noise with the dynamical signals they aim to capture. Specifically, when noise features remain approximately constant within each trajectory, contrastive predictive objectives preferentially encode these features instead of the true latent variables governing the system. The learned representation then becomes dominated by trajectory-specific noise, so downstream performance degrades with noise strength and does not improve even as the number and duration of training trajectories increase. We argue that this failure is a property of the objective itself, shared by a long line of contrastive predictive objectives that sample negatives across trajectories. To illustrate this generality, we study the failure mode and its remedy in two settings: a standard SimCLR-style JEPA on a synthetic moving-dot dataset, and DySIB, a recently introduced method designed for extracting physically interpretable representations of dynamics, on movies of a rigid-body pendulum. When negatives are instead sampled within a single trajectory, the slow noise can no longer distinguish frames within that trajectory, removing the predictive shortcut. Training one encoder simultaneously on many such trajectories then forces it to encode the variables relevant for the dynamics, with longer trajectories yielding better representations even for strong slow noise. Our results point toward principles for designing contrastive predictive objectives in dynamical representation learning, especially for physical systems with noisy experimental observations.

2606.07766 2026-06-09 cs.CV cs.AI 新提交

Quantum-Enhanced Similarity Measures for Polarimetric Materials Classification

量子增强的极化材料分类相似度度量

Sara Shojaei, Seyed Mohamad Ali Tousi, Emma Bennett, Param Sangani, Ali Shiri Sichani, Ilker Ersoy, Hadi Ali-Akbarpour, Filiz Bunyak, G. N. DeSouza

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出量子-经典混合流水线,将极化材料分类转化为点匹配问题,利用SWAP测试估计嵌入向量保真度,实现竞争性分类精度和开放集判别能力。

详情
AI中文摘要

我们提出了一种用于极化材料分类的量子-经典混合流水线,将其视为点匹配问题。包含偏振光反射的体素立方体用于训练编码器,为立方体的体素生成32维嵌入。在推理时,丢弃编码器头部,将嵌入编码为量子态的概率幅。然后,SWAP测试电路估计查询立方体的每个32D嵌入与锚点立方体数据集之间的保真度。聚合的保真度作为材料相似度分数,具有最高聚合保真度的锚点类别被视为查询材料的类别。我们在一个包含23种材料(每种约800个样本)的数据集上评估了我们的方法,这些材料来自其Mueller矩阵。比较了所提出的量子SWAP测试的点匹配方法和使用最优传输的经典分类器。我们的结果展示了竞争性的分类精度以及开放集判别潜力,使其成为基于NISQ的材料识别的可行途径。

英文摘要

We present a quantum--classical hybrid pipeline for polarimetric material classification that casts this as a point-matching problem. Voxel cubes, containing polarized light reflections, are used to train an encoder to produce 32-dimensional embeddings for the voxels of the cubes. At inference, the encoder head is discarded and the embeddings are encoded as probability amplitudes of quantum states. Next, a SWAP-test circuit estimates the fidelity between each of the 32D embeddings from the query cube and a dataset of anchor cubes. The aggregated fidelity serves as materials similarity scores, and the class of the anchor with highest aggregated fidelity is deemed as the class of the queried material. We evaluate our approach on a dataset of 23 materials ($\approx$800 samples each) derived from their Mueller matrices. The point-matching approaches from the proposed quantum SWAP-test and a classical classifier using Optimal Transport are compared. Our results demonstrate the competitive classification accuracy alongside open-set discrimination potential, establishing it as a viable path toward NISQ-based material recognition.

2606.07760 2026-06-09 cs.LG 新提交

scCBGM: Interpretable Single-Cell Counterfactual Editing

scCBGM:可解释的单细胞反事实编辑

Alma Andersson, Aya Abdelsalam Ismail, Edward De Brouwer, Doron Haviv, Tommaso Biancalani, Kyunghyun Cho, Gabriele Scalia, Aïcha BenTaieb, Hector Corrada Bravo

发表机构 * University of Copenhagen(哥本哈根大学) University of Cambridge(剑桥大学) University of Amsterdam(阿姆斯特丹大学) University of California, Berkeley(加州大学伯克利分校) University of Tokyo(东京大学) University of Washington(华盛顿大学) University of Oxford(牛津大学)

AI总结 提出scCBGM框架,通过概念瓶颈架构和解耦惩罚实现单细胞反事实编辑,在组合泛化和反事实预测上表现优异。

Comments Accepted to ICML 2026; code at https://github.com/almaan/scCBGM

详情
AI中文摘要

理解细胞表型及其对扰动的响应对于疾病生物学和治疗设计至关重要。单细胞RNA测序能够在细胞分辨率下进行表征,但条件的组合空间使得穷举实验映射不可行。我们引入了单细胞概念瓶颈生成模型(scCBGM),这是一个用于对单个细胞进行可解释且精确的反事实编辑的框架。scCBGM通过解码器跳跃连接和促进无维度约束解耦的交叉协方差惩罚,将概念瓶颈架构适应于单细胞数据。我们将该框架扩展到流匹配模型,从而在编码-解码和生成两种模式下实现概念引导的编辑。为了进行严格评估,我们开发了一个具有真实反事实的合成基准。在多个真实数据集上,scCBGM在组合泛化和反事实预测方面表现出优越性能,并通过合成数据上的细胞级验证和真实数据集上的群体级基准得到了支持。

英文摘要

Understanding cellular phenotypes and how they respond to perturbations is critical for disease biology and therapeutic design. Single-cell RNA sequencing enables characterization at cellular resolution, yet the combinatorial space of conditions makes exhaustive experimental mapping infeasible. We introduce single-cell Concept Bottleneck Generative Models (scCBGM), a framework for interpretable and precise counterfactual editing of individual cells. scCBGM adapts concept bottleneck architectures for single-cell data through decoder skip connections and a cross-covariance penalty that promotes disentanglement without dimensional constraints. We extend the framework to flow matching models, enabling concept-guided editing in both encoding-decoding and generation regimes. To enable rigorous evaluation, we develop a synthetic benchmark with ground-truth counterfactuals. Across multiple real datasets, scCBGM demonstrates superior performance in combinatorial generalization and counterfactual prediction, supported by cell-level validation on synthetic data and population-level benchmarks on real datasets.