arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3990
2606.07881 2026-06-09 cs.LG 新提交

Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency

打破气泡:具有有界权重不一致性的异步流水线并行训练

Itay Elam, Eliron Rahimi, Avi Mendelson, Chaim Baskin

发表机构 * Technion - Israel Institute of Technology(以色列理工学院) Ben-Gurion University of the Negev(本·古里安大学)

AI总结 提出PACI方法,通过局部梯度累积控制版本漂移,实现无气泡异步流水线并行,在GPT风格语言模型预训练中匹配同步1F1B-flush的稳定性和困惑度,吞吐量完全利用,训练时间至准确率提升达1.69倍。

详情
AI中文摘要

流水线并行对于训练大型神经网络至关重要,但现有的调度方案在吞吐量、内存和优化一致性之间进行权衡。同步流水线保持了前向/反向权重一致性,但存在气泡;异步流水线消除了气泡,但引入了权重版本不匹配,通常需要权重暂存、预测或校正机制。我们提出了PACI(具有可控不一致性的流水线异步训练),一种无气泡的异步流水线方法,它限制了前向/反向版本漂移,无需权重暂存、预测、额外的参数副本或全局同步。关键思想是使用局部梯度累积作为版本控制机制:通过相对于流水线延迟减慢参数版本演化,PACI限制了任何微批次跨越的优化器更新次数,同时保持稳态利用率。在GPT风格的语言模型预训练中,PACI匹配了同步1F1B-flush的稳定性和最终困惑度,保留了相同的峰值内存占用,实现了完全利用的流水线吞吐量,并将训练时间至准确率相比最快的flush基线提升了高达1.69倍。这些结果表明,前向/反向不一致性不必消除:当明确有界时,可以安全地将其换取显著的效率提升。

英文摘要

Pipeline parallelism is essential for training large neural networks, but existing schedules trade off throughput, memory, and optimization consistency. Synchronous pipelines preserve forward/backward weight consistency but suffer from bubbles; asynchronous pipelines remove bubbles but introduce weight-version mismatch, typically requiring weight stashing, prediction, or correction mechanisms. We introduce PACI (Pipeline Asynchronous training with Controlled Inconsistency), a bubble-free asynchronous pipeline method that bounds forward/backward version drift without weight stashing, prediction, additional parameter copies, or global synchronization. The key idea is to use local gradient accumulation as a version-control mechanism: by slowing parameter-version evolution relative to pipeline delay, PACI limits the number of optimizer updates crossed by any micro-batch while preserving steady-state utilization. In GPT-style language-model pretraining, PACI matches the stability and final perplexity of synchronous 1F1B-flush, retains the same peak memory footprint, achieves fully utilized pipeline throughput, and improves training time-to-accuracy by up to $1.69\times$ over the fastest flush baseline. These results show that forward/backward inconsistency need not be eliminated: when explicitly bounded, it can be safely traded for substantial efficiency gains.

2606.07878 2026-06-09 cs.LG 新提交

Still: Amortized KV Cache Compaction in a Single Forward Pass

Still: 单次前向传递中的摊销KV缓存压缩

Charles O'Neill, Alex Sandomirsky, Harry Partridge, Mudith Jayasekara, Max Kirkby

发表机构 * Baseten

AI总结 提出Still方法,通过单次前向传递的轻量级Perceiver层实现KV缓存压缩,在8×至200×压缩比和8k至128k上下文长度下兼顾速度与质量,长上下文任务超越最强基线8-22分。

详情
AI中文摘要

KV缓存是长时语言模型部署的内存瓶颈。实际上,可部署的压缩器必须足够轻量以便在推理时调用,足够表达以在约束下保留上下文,并且可跨轨迹重用。现有压缩方法仅满足部分要求:选择方法轻量但受限于子集,而合成方法表达性强但依赖于逐上下文优化。这里我们介绍Still,一个小的逐层Perceiver,针对冻结的基础模型训练一次,在单次前向传递中生成紧凑的键和值。在Qwen和Gemma模型上,Still在压缩比从$8\ imes$到$200\ imes$、上下文长度从8k到128k的范围内,占据了速度-质量前沿的有利位置。在长上下文RULER网格上,Still超过最强基线8-22分。相同的紧凑缓存还支持自由形式的摘要,在HELMET上保留了大部分全上下文增益,并在LongBench摘要比较中胜过KV-Distill。由于压缩是一次前向传递,Still可以迭代应用,进入逐上下文方法无法实现的长期场景。我们表明,摊销使长上下文缓存压缩变得可行,而合成使其紧凑状态在极端压缩下有用。

英文摘要

The KV cache is the memory bottleneck of long-horizon language model deployment. Practically, a deployable compactor must be lightweight enough to call during inference, expressive enough to preserve context under constraint, and reusable across a trajectory. Existing compaction methods satisfy only part of this requirement: selection methods are lightweight but subset-bound, while synthesis methods are expressive but rely on per-context optimization. Here we introduce Still, a small per-layer Perceiver trained once against a frozen base model that produces compact keys and values in a single forward pass. On Qwen and Gemma models, Still occupies the favorable side of the speed--quality frontier across compression ratios from $8\times$ to $200\times$ and context lengths from $8$k to $128$k. On the long-context RULER grid, Still exceeds the strongest baseline by 8--22 points. The same compact cache also supports free-form summarization, preserving most of the full-context gain on HELMET and winning a pairwise LongBench summarization comparison against KV-Distill. Because compaction is a forward pass, Still can be applied iteratively, entering a long-horizon regime unavailable to per-context methods. We show that amortization makes long-context cache compaction tractable, and synthesis makes its compact state useful at extreme compression.

2606.07877 2026-06-09 cs.CL 新提交

Whose Norms? Disentangling Cultural and Personal Alignment in Large Language Models

谁的规范?解开大语言模型中的文化与个人对齐

Angana Borah, Isabelle Augenstein, Rada Mihalcea

发表机构 * University of Michigan - Ann Arbor(密歇根大学安娜堡分校) University of Copenhagen(哥本哈根大学)

AI总结 提出PACT框架评估大语言模型在文化规范与个人偏好间的权衡,发现模型受国家背景影响大于年龄和性别,且人类对齐未能捕捉文化多元性。

详情
Comments
Preprint under review
AI中文摘要

大语言模型越来越多地用于需要平衡文化规范与个人偏好的社会决策情境。例如,偏好诚实的用户可能会询问是否应在当地规范倾向于间接反馈时公开纠正同事。然而,现有研究大多将文化对齐和个性化分开研究。我们引入了PACT(个人偏好与文化规范权衡)框架,该框架评估模型是选择遵循文化规范还是允许个人偏好。我们发现,LLMs在强制执行文化规范的刚性程度上有所不同,行为受国家背景(7.8%)的影响大于年龄(1%)和性别(0.7%),并且在指令微调后非均匀地变化。此外,我们在五个国家进行的关于PACT的人类研究表明,人类遵循文化主要受情境国家驱动,当参与者判断自己的文化背景时一致性最低,显示出文化内部的多元性。最后,人类-LLM对齐实验表明,模型可以匹配多数选择,但未能捕捉响应分布和不确定性(最佳相关性仅为0.24)。总之,这些发现激励了超越多数、捕捉社会判断中文化多元性和分歧的对齐评估。

英文摘要

Large language models are increasingly used for social decision-making situations that require balancing cultural norms with personal preferences. For example, a user preferring honesty might ask whether to correct a coworker publicly when local norms favor indirect feedback. Yet existing research studies cultural alignment and personalization largely separately. We introduce PACT, the Personal-Preference and Cultural-Norm Trade-off framework, which evaluates whether models choose to follow a cultural norm or allow personal preferences. We find that LLMs vary in how rigidly they enforce cultural norms, with behavior shifted more by country context (7.8%) than age (1%) and gender (0.7%) and shifting non-uniformly after instruction tuning. Furthermore, our five-country human study on PACT shows that culture-following in humans is mainly driven by scenario country, with the lowest agreement when participants judge their own cultural contexts, showing within-culture pluralism. Finally, human-LLM alignment experiments show that models can match majority choices, but fail to capture response distributions and uncertainty (with best correlations reaching only 0.24). Together, these findings motivate alignment evaluations that go beyond majority to capture cultural pluralism and disagreement in social judgment.

2606.07874 2026-06-09 cs.AI 新提交

Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators

安全是上下文相关的,而LLM评判者不是:应对评估者的刚性先验

Anissa Alloula, Federico Licini, Ava Batchkala, Seraphina Goldfarb-Tarrant

发表机构 * University of Oxford(牛津大学) Cohere

AI总结 研究LLM作为安全评判者时,对上下文信息的依赖性和对不同安全定义的可引导性,发现它们难以在上下文或安全定义与自身先验矛盾时调整评估。

详情
AI中文摘要

LLM作为评判者是规模化评估安全性的唯一方式。尽管它们很重要,但LLM评判者本身很少在简单的静态基准测试中除了人类一致性之外被评估。因此,我们研究了LLM作为评判者的两个未被充分探索但至关重要的特性:它们对依赖上下文信息的敏感性,以及它们对不同安全定义的可引导性,这些定义可能与其内部安全先验不一致。我们评估了许多通用LLM和特定安全评判者的安全评判能力,并研究了任务演示、新颖的上下文信息以及变化的安全定义的影响。我们发现,虽然LLM评判者可以从新信息中学习,但如果上下文或安全定义与其先验相矛盾,它们通常不太可能调整其评估。

英文摘要

LLMs-as-judges are the only way to evaluate safety at scale. Despite their importance, LLM-judges themselves are rarely evaluated beyond human agreement in simple, static benchmarks. We therefore investigate two under-explored but crucial properties of LLMs-as-judges: their susceptibility to relying on in context-information, and their steerability to differing safety definitions, which may not align with their internal safety priors. We evaluate the safety judging abilities of many generalist LLMs and safety-specific judges, and investigate the impact of task demonstrations, novel in-context information, and changing safety definitions. We find that while LLM-judges can learn from new information, they are broadly unlikely to adjust their evaluations if the context or safety definition contradicts their prior.

2606.07872 2026-06-09 cs.CV 新提交

VisualFLIP: Do Predictions Depend on Task-Critical Visual Evidence in Multimodal Reasoning?

VisualFLIP: 在多模态推理中,预测是否依赖于任务关键的视觉证据?

Didi Zhu, Changrui Chen, Stefanos Zafeiriou, Jiankang Deng

发表机构 * Imperial College London(伦敦帝国理工学院)

AI总结 提出VisualFLIP基准,通过成对图像扰动测试多模态大模型是否真正依赖关键视觉证据,发现正确预测与证据依赖存在分离。

详情
AI中文摘要

当多模态大语言模型正确回答视觉推理问题时,该预测是否确实得到任务关键视觉证据的支持?正确答案可能与有缺陷的推理共存,这使得仅凭准确性无法完全检验基础。我们引入了VisualFLIP,一个包含1,374张图像的成对基准,这些图像在基数、属性、空间和逻辑任务中构成相同问题的扰动对。每对保持问题不变,但最小程度地改变证据,使得正确答案确定性地翻转。我们使用成对准确率(要求解决对中的两侧)和崩溃率(CR,衡量在至少解决一侧的模型中,对两个图像重复相同非空答案的频率)评估了24个MLLM。这些指标共同表明,成对正确性和证据依赖性相关但不同:有能力的模型在任务关键视觉变化后仍可能无法更新,并且当编辑后的图像在序列设置中跟随先前的答案时,某些模型的崩溃变得更加严重。更多细节请参见我们的项目页面:https://didizhu-judy.github.io/VisualFLIP/

英文摘要

When a multimodal large language model answers a visual reasoning question correctly, is the prediction actually supported by the task-critical visual evidence? Correct answers can coexist with flawed reasoning, making accuracy alone an incomplete test of grounding. We introduce VisualFLIP, a paired benchmark with 1,374 images arranged as same-question perturbation pairs across cardinality, attribute, spatial, and logic tasks. Each pair keeps the question fixed but minimally changes the evidence so the gold answer deterministically flips. We evaluate 24 MLLMs with pair accuracy, which requires solving both sides of a pair, and Collapse Rate (CR), which measures how often a model that solves at least one side repeats the same non-empty answer for both images. Together, these metrics show that paired correctness and evidence dependence are related but distinct: capable models can still fail to update after task-critical visual changes, and collapse becomes more severe for some models when the edited image follows an earlier answer in a sequential setting. Further details are available on our project page: https://didizhu-judy.github.io/VisualFLIP/

2606.07867 2026-06-09 cs.CL 新提交

The Cold-Start Safety Gap in LLM Agents

LLM智能体中的冷启动安全差距

Chung-En Sun, Linbo Liu, Tsui-Wei Weng

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 研究发现工具调用型LLM智能体在会话开始时最脆弱,随着常规任务执行安全性提升,提出SODA基准并验证预热策略可缩小冷启动安全差距。

详情
AI中文摘要

工具调用型LLM智能体在整个对话过程中是否同样安全?我们发现并非如此:智能体在会话开始时最脆弱,在完成几个常规智能体任务后安全性显著提升——我们将这一现象称为冷启动安全差距。为了系统研究这一问题,我们引入了面向智能体的深度安全基准(SODA),该基准控制智能体在遇到安全威胁之前完成的常规智能体任务数量,最多支持20个前置任务。评估来自4个系列的7个模型,随着前置常规智能体任务数量从零增加到二十,安全性提升9%至52%。表示分析证实,随着更多前置任务的出现,模型隐藏状态逐渐向安全对齐区域移动。通过系统研究前置对话中哪部分最重要,我们发现常规智能体任务本身是安全性的主要驱动因素,而智能体自身的先前响应对安全性影响较小,但对于保持后续效用至关重要。这一结论在开源安全基准(AgentHarm、Agent Safety Bench)和效用基准(BFCL、API-Bank)上的评估中得到进一步支持,证实了在部署前用常规智能体任务预热智能体可以使其更安全并保持全部能力。基于这些发现,我们推荐一种简单的部署策略:让智能体在可能暴露于安全关键请求之前完成几个常规智能体任务,以缩小冷启动安全差距。我们的代码可在https://github.com/Trustworthy-ML-Lab/Agent-Cold-Start-Safety-Gap获取。

英文摘要

Are tool-calling LLM agents equally safe throughout a conversation? We discover they are not: agents are most vulnerable at the very start of a session and become substantially safer after a few regular agentic tasks -- a phenomenon we term the cold-start safety gap. To study this systematically, we introduce Safety Over Depth for Agents (SODA), a benchmark that controls how many regular agentic tasks the agent completes before encountering a safety threat, supporting up to 20 preceding tasks. Evaluating 7 models from 4 families, safety improves by 9--52% as the number of preceding regular agentic tasks increases from zero to twenty. Representation analysis confirms that model hidden states gradually shift toward a safety-aligned region as more preceding tasks are present. By systematically studying which part of the preceding conversation matters most, we find that the regular agentic tasks themselves are the primary driver of safety, while the agent's own prior responses have less effect on safety but are essential for preserving later utility. This conclusion is further supported by evaluation on open-source safety benchmarks (AgentHarm, Agent Safety Bench) and utility benchmarks (BFCL, API-Bank), confirming that warming up the agent with regular agentic tasks before deployment makes it safer and preserves full capability. Based on these findings, we recommend a simple deployment strategy: having the agent complete a few regular agentic tasks before possible exposure to safety-critical requests mitigates the cold-start safety gap. Our code is available at https://github.com/Trustworthy-ML-Lab/Agent-Cold-Start-Safety-Gap

2606.07866 2026-06-09 cs.AI cs.MA 新提交

Overcoming the Regulatory Bottleneck via Agent-to-Agent Protocols: A Nuclear Case Study

通过智能体间协议克服监管瓶颈:以核能为例

Akshay J. Dave, David Grabaskas, Joseph A. Renevitz, Richard B. Vilim

发表机构 * Argonne National Laboratory(阿贡国家实验室) Idaho National Laboratory(爱达荷国家实验室)

AI总结 提出监管上下文协议(RCP),一种智能体间通信标准,将监管与申请方之间的人工流程转为结构化、可审计的智能体通道,在核反应堆审批中降低成本50-77%、缩短时间65%。

详情
Comments
26 pages, 10 figures
AI中文摘要

先进核反应堆设计的监管审查通常耗时超过三年,并消耗数亿美元的综合监管和申请方劳动力。我们提出了监管上下文协议(RCP),这是一种智能体间通信标准,用结构化、可审计的智能体通道取代监管机构和申请方之间正式的人工流程,同时在安全关键决策点保留人类监督。该协议基于对美国核监管委员会先进反应堆案卷中1,236份文件的分析进行校准,并通过一个工作中的多智能体试点进行演示。相对于8,900万美元、42个月的基准重建,RCP将成本降低50-77%(2,100万至4,400万美元),时间缩短65%(15个月)。在没有共享协议的情况下,独立智能体仅能达到5,400万至7,400万美元和21个月。剩余的成本和时间差距是结构性的,而非算法性的:它源于组织间的流程,只有智能体间标准才能压缩。同样的瓶颈——在严格的可审计性要求下进行正式的多方审查——也是药品审批、环境许可、金融监管和航空认证的特点。美国监管文书负担每年带来4,265亿美元的机会成本;如果广泛复制,预计50-77%的减少意味着每年节省约2,100亿至3,300亿美元——接近美国GDP的1%。

英文摘要

Regulatory review of advanced nuclear reactor designs routinely spans more than three years and consumes hundreds of millions of dollars in combined regulator and applicant labor. We present the Regulatory Context Protocol (RCP), an Agent-to-Agent communication standard that replaces the formal human-to-human pipeline between regulators and applicants with a structured, auditable agentic channel, while preserving human oversight at safety-significant decision points. The protocol is calibrated against an analysis of 1,236 documents from U.S. Nuclear Regulatory Commission advanced reactor dockets and demonstrated with a working multi-agent pilot. Against an 89M USD, 42-month Reconstructed Baseline, RCP cuts costs by 50-77 percent (21M-44M USD) and timelines by 65 percent (15 months). Without a shared protocol, Standalone Agents reach only 54M-74M USD and 21 months. The residual cost-and-time gap is structural, not algorithmic: it traces to the inter-organizational pipeline that only an agent-to-agent standard can compress. The same bottleneck - formal multi-party review under strict auditability requirements - characterizes pharmaceutical approvals, environmental permitting, financial supervision, and aviation certification. The US regulatory paperwork burden carries a 426.5 billion USD annual opportunity cost; replicated broadly, the projected 50-77 percent reduction implies savings on the order of 210-330 billion USD per year - approaching 1 percent of US GDP.

2606.07865 2026-06-09 cs.LG cs.AI physics.comp-ph stat.ML 新提交

Instrumented data for causal scientific machine learning

因果科学机器学习的仪器化数据

Daniel N. Wilke

发表机构 * University of the Witwatersrand(威特沃特斯兰德大学)

AI总结 提出仪器化数据作为观测数据和模板合成数据之外的第三种选择,每个数据点携带产生它的机制模型、显式不确定性及可执行的反事实族,通过V&V仪器化图像到模拟管道实现,支持因果干预。

详情
Comments
10 pages, 2 figures
AI中文摘要

科学机器学习受限于训练数据而非模型大小。观测数据记录发生了什么但不记录原因;模板合成数据具有已知的生成过程,但仅适用于模拟器的模板,而非用户面对的情况。我们认为第三种选择现在在操作上是可行的:仪器化数据,其中每个数据点携带产生它的机制模型、对该模型的显式不确定性以及可执行的反事实族。验证与确认(V&V)仪器化图像到模拟管道是一种实现:传感器观测成为完全指定、求解器支持的模拟,具有显式、可编辑的参数以及传播的偶然/认知不确定性。该基底是案例特定的、机制监督的,并通过Pearl的do算子支持因果干预。在验证、审计和替代训练方面的近期影响涵盖计算生物学、气候、材料、流体力学和医学成像;长期可证伪的推论涉及科学推理的基础模型。

英文摘要

Scientific machine learning is limited less by model size than by the data it is trained on. Observational data records what happened but not why; template synthetic data has a known generating process but only for the simulator's template, not the case a user faces. We argue a third option is now operationally feasible: instrumented data, in which every datum carries the mechanistic model that produced it, an explicit uncertainty over that model, and an executable family of counterfactuals. Verification-and-validation (V&V) instrumented image-to-simulation pipelines are one realisation: a sensor observation becomes a fully specified, solver-backed simulation with explicit, editable parameters and a propagated aleatoric/epistemic uncertainty. The substrate is case-specific, mechanistically supervised, and supports causal interventions through Pearl's do-operator. Near-term consequences for validation, auditing, and surrogate training span computational biology, climate, materials, fluid mechanics, and medical imaging; a longer-term, falsifiable implication concerns foundation models for scientific reasoning.

2606.07861 2026-06-09 cs.CV cs.AI 新提交

The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models

最后一个可见像素:探究视觉-语言模型中的精细尺度感知

Lujun Li, Lama Sleem, Niccolo Gentile, Yangjie Xu, Yewei Song, Wenbo Wu, Radu State

发表机构 * University of Luxembourg(卢森堡大学) Foyer S.A. Université Paris-Saclay(巴黎-萨克雷大学)

AI总结 提出FineSightBench基准,通过4-48像素尺度分离感知与推理任务,发现视觉-语言模型感知在12像素饱和,推理在更大尺度仍受限,揭示精细视觉推理的根本缺陷。

详情
Comments
25 pages
AI中文摘要

最近的视觉-语言模型(VLM)在多模态理解和推理方面表现出色,但其细粒度视觉感知仍未被充分探索。'Strawberry中有多少个r?'的自然延伸是:VLM能可靠感知多小的视觉模式?为此,我们引入了FineSightBench,这是一个新的基准,通过将感知任务(字母、形状、物体的像素级识别)与推理任务(空间推理、计数、小目标排序)在4-48像素的受控尺度上分离,系统地探究这一极限。通过对最先进模型的全面实验和详细失败模式分析,我们揭示了一个尖锐的分离:感知在12像素左右饱和,而即使在更大尺度下推理仍然受限,存在持续的计数和序列错误。这些发现暴露了VLM在精细尺度视觉推理中的根本缺陷,需要更严格的评估。

英文摘要

Recent vision-language models (VLMs) excel at multimodal understanding and reasoning, yet their fine-grained visual perception remains underexplored. A natural extension of ``How many r are there in Strawberry?'' asks: how small a visual pattern can a VLM reliably perceive? As such, we introduce FineSightBench, a new benchmark that systematically probes this limit by separating perception tasks (pixel-level recognition of letters, shapes, objects) from reasoning tasks (spatial reasoning, counting, ordering over small targets) across controlled scales of 4--48px. Through comprehensive experiments and detailed failure mode analysis on state-of-the-art models, we reveal a sharp dissociation: perception saturates around 12px, while reasoning remains limited even at larger scales, with persistent numeracy and sequence errors. These findings expose fundamental deficiencies in VLMs' fine-scale visual reasoning that demand more rigorous evaluation.

2606.07856 2026-06-09 cs.LG 新提交

Teacher-Free Self-Training Amplifies but Does Not Compound: A Pass@$K$ Crossover on a Free-Verifier Domain

无教师自训练放大但不复合:自由验证器域上的 Pass@$K$ 交叉

Igor Lima Strozzi

发表机构 * Federal University of Rio de Janeiro(里约热内卢联邦大学)

AI总结 在自由验证器域上,使用无教师自训练(STaR)和批评者指导的选择,发现自训练放大模型能力但不复合,通过 Pass@$K$ 交叉诊断证实。

详情
AI中文摘要

当语言模型在其自身验证的输出上训练时,它是获得了超越其基础的能力,还是仅仅更好地表达了基础已有的能力?我们通过一个无教师的“星座”——一个生成器、一个学习到的批评者和一个自由精确验证器——在一个 FlashFill 风格的“陷阱门”DSL 上使该问题可判定,其中验证的(问题,解决方案)对易于合成、难以反转且可自由精确检查。一切都在单个 4 位 Qwen3-4B 上运行,使用单个 24 GB GPU,循环中没有比基础更大的模型。我们报告三个发现。(i) 批评者指导的选择优于验证器过滤的最佳 $k$ 选择,提高了 $+9.1$ 个百分点($6/6$ 种子),全部增益集中在候选者在保留输入上意见不一致的任务上。(ii) 每轮 STaR 自训练提高了上限但从未加速——增益跟踪剩余空间并在 $K=4$ 个独立训练轨迹上减速。(iii) 该域没有清晰的零能力边界,因此通常的“$0\% \to$ 爬升 $=$ 涌现”测试在此无效。一个测量的 Pass@$K$ 交叉解决了诊断:训练模型在操作预算(Pass@$8$)上获胜,但基础模型在大预算(Pass@$64$)上在每个轨迹上超越它,因此自训练集中概率质量而非扩展覆盖范围。这是放大,而非复合。($K=4$ 是指示性的,尚不是跨轨迹的稳健置信区间。)

英文摘要

When a language model trains on its own verified outputs, does it acquire capability beyond its base, or merely get better at expressing capability the base already had? We make the question decidable with a teacher-free "constellation" -- a generator, a learned critic, and a free exact verifier -- on a FlashFill-style "trapdoor" DSL, where verified (problem, solution) pairs are cheap to synthesize, hard to invert, and free to check exactly. Everything runs on one 4-bit Qwen3-4B on a single 24 GB GPU, with no model in the loop larger than the base. We report three findings. (i) Critic-guided selection beats verifier-filtered best-of-$k$ by $+9.1$ pp ($6/6$ seeds), with the entire gain localized to tasks where candidates disagree on held-out inputs. (ii) Per-round STaR self-training raises the ceiling but never accelerates -- the gain tracks remaining headroom and decelerates across $K=4$ independent training trajectories. (iii) The domain has no clean zero-capability frontier, so the usual "$0\% \to$ climb $=$ emergence" test is invalid here. A measured pass@$K$ crossover settles the diagnosis: the trained model wins at the operating budget (pass@$8$) but the base overtakes it at a large budget (pass@$64$) on every trajectory, so self-training concentrates probability mass rather than expanding reach. This is amplification, not compounding. ($K=4$ is indicative, not yet a robust across-trajectory CI.)

2606.07855 2026-06-09 cs.RO math.OC 新提交

Path Planning Using Deep Deterministic Policy Gradient: A Reinforcement Learning Approach

使用深度确定性策略梯度的路径规划:一种强化学习方法

Qiang Le, Yaguang Yang, Isaac E. Weintraub

发表机构 * Hampton University(汉普顿大学) Air Force Research Laboratory(空军研究实验室)

AI总结 提出基于深度确定性策略梯度的路径规划方法,将威胁建模为圆形禁行区,通过奖励函数引导智能体学习从状态到动作的映射,找到最大安全起始点集,相比传统最优控制方法速度更快,适用于实时应用。

详情
Comments
14 pages, 12 figures
AI中文摘要

在充满威胁的环境中,自主车辆的路径规划是一个基本挑战,因为即使是最简单的情景,问题也是非线性和非凸的。虽然传统最优控制方法可用于寻找理想路径,但计算时间通常太慢,无法实时决策。为了解决这一挑战,我们提出了一种基于深度确定性策略梯度(DDPG)的方法,并将威胁建模为可能多个圆形“禁行”区域。如果车辆在任何时刻进入该限制区域或未到达目的地附近,则任务被视为失败。DDPG智能体通过模拟环境中的试错进行训练,学习从其当前状态(位置和航向)到一系列可行动作的直接映射,从而引导智能体安全到达目的地。奖励函数包含三部分:(a) 以最终目的地为中心的吸引场,(b) 以圆形障碍物原点为中心的若干排斥场,以及(c) 控制能量消耗(航向变化幅度)的惩罚,间接有利于直线路径。DDPG利用这些激励训练智能体,以找到最大的起始点集合,从中保证存在一条通往目的地的安全路径。这为任务规划提供了关键信息,预先显示从给定起始点任务是否可行,辅助任务前规划活动。该方法在仿真中得到验证。将DDPG方法与传统最优控制(伪谱)方法进行了比较。结果表明,基于学习的智能体能够生成有效路径,同时速度显著更快,使其更适合实时应用。

英文摘要

Path-planning for autonomous vehicles in threat-laden environments is a fundamental challenge because the problem is nonlinear and nonconvex even in simplest scenarios. While traditional optimal control methods can be used to find ideal paths, the computational time is often too slow for real-time decision-making. To solve this challenge, we propose a method based on Deep Deterministic Policy Gradient (DDPG) and model the threat as possibly multiple circular 'no-go' zones. A mission is regarded as a failure if the vehicle enters this restricted zone at any time or does not reach a neighborhood of the destination. The DDPG agent is trained through trial and error in a simulated environment, learning a direct mapping from its current state (position and heading) to a series of feasible actions that guide the agent to safely reach its destination. The reword function has three parts: (a) an attractive field centered at the final destination, (b) some repulsive fields centered at the origins of circular obstacles, and (c) a penalty of control energy consumption (the magnitude of heading change) that indirectly in favor for straight path. The DDPG trains the agent using these incentives to find the largest possible set of starting points wherein a safe path to the destination is guaranteed. This provides critical information for mission planning, showing beforehand whether a task is achievable from a given starting point, assisting pre-mission planning activities. The approach is validated in simulation. A comparison between the DDPG method and a traditional optimal control (pseudo-spectral) method is carried out. The results show that the learning-based agent produces effective paths while being significantly faster, making it a better fit for real-time applications.

2606.07853 2026-06-09 cs.CL cs.AI 新提交

Beyond English benchmarks: clinical llm evaluation in Brazilian Portuguese

超越英语基准:巴西葡萄牙语临床大语言模型评估

Giordano de Pinho Souza, Glaucia Melo, Josefino Cabral Melo Lima, Daniel Schneider

发表机构 * Federal University of Rio de Janeiro(里约热内卢联邦大学) Toronto Metropolitan University(多伦多都会大学)

AI总结 提出首个双语临床基准ClinicalBr,基于巴西病例报告构建,评估四个模型发现葡萄牙语-英语性能差距具有任务依赖性,诊断检索英语优势明显,其他任务差距消失。

详情
AI中文摘要

大语言模型正在改变临床决策支持及其在实际场景中的应用。然而,大多数基准测试以英语进行,跨语言评估对于解决全球可及性中的语言差距至关重要。我们介绍了ClinicalBr,这是首个基于真实巴西病例报告构建的双语临床决策基准。该语料库包含来自28种SciELO医学期刊的2,892个病例,涵盖18个专科,并构建为平行葡萄牙语-英语对。每个病例支持四项评估任务:诊断检索、鉴别诊断、检查推荐和治疗规划。我们评估了四个模型:MedGemma-27B、Sabiá-4、DeepSeek-R1和o3-mini,涵盖两种语言。核心发现是,葡萄牙语-英语性能差距是任务依赖的,而非普遍的。在诊断检索中,英语在所有模型上均具有一致优势,准确率高出7.5-12.1个百分点。这种优势在鉴别诊断、检查推荐和治疗规划中消失,大多数模型的置信区间跨越零,且葡萄牙语的完整性分数略高。巴西地方病比完整语料库更容易,而非更难,表明热带疾病表现在当前预训练中得到了充分体现。检查推荐是所有模型和两种语言中最难的任务,F1分数低于0.10,远低于鉴别诊断的上限0.20-0.27。

英文摘要

Large Language Models are transforming the support for clinical decision and their application in real scenarios. Yet, most benchmarks are conducted in English, and cross-lingual evaluation is needed to tackle the language gaps in global access. We introduce ClinicalBr, the first bilingual benchmark for clinical decision built from real Brazilian case reports. The corpus contains 2,892 cases drawn from 28 SciELO medical journals, spanning 18 specialties, and is structured as parallel Portuguese-English pairs. Each case supports four evaluation tasks: diagnosis retrieval, differential diagnosis, exam recommendation, and treatment planning. We evaluate four models: MedGemma-27B, Sabiá-4, DeepSeek-R1, and o3-mini, across both languages. The central finding is that the Portuguese-English performance gap is task-dependent, not general. In diagnosis retrieval, English yields a consistent advantage across all models, with +7.5-12.1 accuracy points. This advantage disappears in differential diagnosis, exam recommendation, and treatment planning, where confidence intervals cross zero for most models and Portuguese completeness scores are marginally higher. Brazilian-endemic conditions proved easier than the full corpus, not harder, indicating that tropical presentations are adequately represented in current pre-training. Exam recommendation was the hardest task across all models and both languages, with F1 scores below 0.10, well below the differential diagnosis ceiling of 0.20-0.27.

2606.07822 2026-06-09 cs.CL cs.AI cs.LG 新提交

The ACUTE Protocol: Operationalizing Language Model Activations for Better Calibration, Utility, and Trust

ACUTE协议:操作语言模型激活以实现更好的校准、效用和信任

Nishant Subramani, Palash Goyal, Yiwen Song, Mani Malek, Yuan Xue, Tomas Pfister, Hamid Palangi

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Google(谷歌) Scale AI

AI总结 提出ACUTE协议,通过操作语言模型激活来估计置信度,平衡校准与信息性,在多项选择问答、工具调用和科学文档摘要等任务上优于强基线,提升校准、效用和可信度。

详情
Comments
Accepted to ICML 2026
AI中文摘要

随着语言模型的改进并越来越多地部署以解决各种任务,可信度变得至关重要。校准是信任的良好代理:良好校准的置信度估计有助于在信任特定模型输出时告知风险与回报的权衡。不幸的是,即使模型改进,它们仍然校准不良,往往偏向过度自信。此外,校准可能被操纵:总是预测基率的策略是完美校准的,但完全没有信息性。为了解决这个问题,我们开发了一个新指标,即通过预言机重新归一化的期望效用(EURO),它平衡了校准和信息性。我们还提出了一种通用的基于激活的置信度、效用和信任估计协议(ACUTE),以适当裁决不确定性。ACUTE协议为4个模型家族的6个模型上的3个任务(包括多项选择问答、工具调用和科学文档摘要)提供了灵活、样本高效和计算高效的置信度估计器。ACUTE在EURO上优于强基线,同时保持较低的校准误差。综合来看,我们的工作表明,为LLM配备ACUTE协议可以在多种设置中提高校准、效用和可信度。

英文摘要

As language models improve and become increasingly deployed to solve a variety of tasks, trustworthiness becomes essential. Calibration is a good proxy for trust: well-calibrated confidence estimates help inform the risk versus reward tradeoff when trusting a specific model output. Unfortunately, even as models improve, they remain poorly calibrated, often biasing towards overconfidence. Additionally, calibration can be gamed: a policy that always predicts the base rate is perfectly calibrated, but completely uninformative. To resolve this, we develop a new metric, expected utility renormalized by the oracle (EURO), that balances calibration and informativeness. We also propose a general-purpose activation-based confidence, utility, and trust estimation protocol (ACUTE) to appropriately adjudicate uncertainty. The ACUTE protocol provides flexible, sample-efficient, and compute-efficient confidence estimators for 3 tasks including multiple choice question answering, tool-calling, and scientific document summarization across 6 models from 4 model families. ACUTE outperforms strong baselines on EURO, while maintaining low calibration error. Taken together, our work shows that equipping LLMs with the ACUTE protocol can improve calibration, utility, and trustworthiness in numerous settings.

2606.07819 2026-06-09 cs.AI cs.LG 新提交

Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression

联合结构剪枝与混合精度量化的大语言模型压缩

Hoang-Loc La, Truong-Thanh Le, Amir Taherkordi, Phuong Hoai Ha

发表机构 * UiT The Arctic University of Norway(挪威北极大学) University of Oslo, Norway(挪威奥斯陆大学)

AI总结 提出端到端框架,通过全局误差最小化的混合精度量化策略和联合优化结构剪枝与量化策略,在超低比特下显著降低困惑度。

详情
AI中文摘要

近年来,大型语言模型(LLM)部署的效率已成为实际应用中的关键问题。虽然训练后量化(PTQ)和结构剪枝是减少内存占用和推理延迟的成熟技术,但大多数现有的PTQ方法在逐层基础上优化量化误差,忽略了误差如何在网络中累积和传播,通常导致次优解。传统的流程也倾向于孤立或顺序地应用剪枝和量化,进一步加剧了次优性。我们引入了一种新颖的端到端框架,以两种关键方式解决这些限制。首先,我们提出了一种新颖的混合精度PTQ策略,该策略直接最小化整个模型上的全局误差传播,而不是隔离逐层误差。在此基础上,我们开发了一种新颖的联合优化方法,该方法在统一的搜索空间中同时学习结构剪枝决策和混合精度量化策略。大量实验表明,在超低精度(1-3比特)下,与最先进的(SoTA)权重激活量化基线相比,我们的量化方法将WikiText困惑度降低了高达21%。与领先的仅权重量化方法相比,它在WikiText和C4上分别实现了高达59%和85%的困惑度降低。与最先进的联合剪枝和量化技术相比,我们提出的方法在超低比特下提供了优越的困惑度和推理性能。

英文摘要

Recently, the efficiency of Large Language Models (LLMs) deployment has become a critical concern in practical applications. While post-training quantization (PTQ) and structural pruning are established techniques for reducing memory footprint and inference latency, most existing PTQ approaches optimize quantization errors on a per-layer basis, overlooking how errors accumulate and propagate through the network, often resulting in suboptimal solutions. Traditional pipelines also tend to apply pruning and quantization in isolation or sequentially, further compounding sub-optimality. We introduce a novel end-to-end framework that addresses these limitations in two key ways. First, we propose a novel mixed-precision PTQ strategy that directly minimizes global error propagation across the entire model, rather than isolating layer-wise errors. Building on this, we develop a novel joint optimization approach that simultaneously learns structural pruning decisions and mixed-precision quantization policies within a unified search space. Extensive experiments show that, at ultra-low precisions (1-3 bits), our quantization method reduces WikiText perplexity by up to 21% compared to state-of-the-art (SoTA) weight-activation quantization baselines. Against leading weight-only quantization methods, it achieves up to 59% and 85% lower perplexity on WikiText and C4, respectively. Compared to the SoTA joint pruning-and-quantization techniques, our proposed method delivers superior perplexity and reasoning performance at ultra-low bits.

2606.07813 2026-06-09 cs.RO cs.CV 新提交

MinNav: Minimalist Navigation Using Optical Flow For Active Tiny Aerial Robots

MinNav:基于光流的极简导航用于主动微型飞行机器人

Aniket Patil, Mandeep Singh, Uday Girish Maradana, Nitin J. Sanket

发表机构 * Worcester Polytechnic Institute(伍斯特理工学院) Perception and Autonomous Robotics (PeAR) Group, Robotics Engineering Department, Worcester Polytechnic Institute(伍斯特理工学院机器人工程系感知与自主机器人(PeAR)实验室)

AI总结 提出MinNav导航栈,利用光流及其不确定性,使微型飞行机器人在无先验知识下穿越静态/动态障碍和未知形状间隙,通过主动探索提高成功率,实验成功率70%,计算量远小于深度方法。

详情
Comments
Accepted for publication at ICRA 2026. Link to Project page https://pear.wpi.edu/research/minnav.html
AI中文摘要

使用单目相机进行导航对于微型飞行机器人的自主操作至关重要,因为它在多功能性、成本和精度之间取得了完美平衡。在本文中,我们介绍了MinNav,一个基于光流及其不确定性的导航栈,用于在静态和动态障碍物以及未知形状间隙的场景中飞行,无需任何关于场景组件和/或其位置/顺序的先验知识。我们通过利用机器人的主动性以探索方式移动来寻找障碍物并导航,进一步提高了成功率。我们在多种环境下的许多真实世界实验中成功评估并演示了所提出的方法,包括静态和动态障碍物以及未知形状间隙,总体成功率为70%。据我们所知,这是第一个使用单目相机在无先验知识的情况下解决所有上述导航案例的方案。我们的方法在性能上与基于深度的方法相当,但所需的计算量少几个数量级,并且可以轻松在微型飞行机器人上运行。随附的视频、补充材料、代码和数据集可在https://pear.wpi.edu/research/minnav.html找到。

英文摘要

Navigation using a monocular camera is pivotal for autonomous operation on tiny aerial robots due to their perfect balance of versatility, cost and accuracy. In this paper, we introduce MinNav, a navigation stack based on optical flow and its uncertainty to fly through a scene with static and dynamic obstacles and unknown-shaped gaps without any prior knowledge of the scene components and/or their locations/ordering. We further improve success rate by using the activeness of the robot to move around in an exploratory way to find obstacles and navigate. We successfully evaluate and demonstrate the proposed approach in many real-world experiments in various environments with static and dynamic obstacles and unknown-shaped gaps with an overall success rate of 70%. To the best of our knowledge, this is the first solution to tackle all the aforementioned navigation cases without prior knowledge using a monocular camera. Our approach is on par in performance with depth based methods with factors of magnitude less computation required and can readily run onboard tiny aerial robots. The accompanying video, supplementary material, code and dataset can be found at https://pear.wpi.edu/research/minnav.html

2606.07812 2026-06-09 cs.AI cs.CL 新提交

Scaling Participation in Modular AI Systems

模块化AI系统中的参与扩展

Shangbin Feng, Yike Wang, Weijia Shi, Luke Zettlemoyer, Yejin Choi, Yulia Tsvetkov

发表机构 * University of Washington(华盛顿大学) Stanford University(斯坦福大学)

AI总结 提出参与扩展范式,通过多方贡献小模型构建模块化AI系统,在15项任务上比单体大语言模型提升高达15.4%,并展现涌现能力。

详情
AI中文摘要

人类是由多面才能和需求组成的马赛克,任何真正智能的AI必须反映这种丰富性。然而,所有人使用的LLM却由少数人构建——一个集中化的单体AI模型市场,其结构上不适合捕捉人类知识、推理和价值观的多样性。本文介绍参与扩展,一种新范式,其中模块化AI系统通过不同利益相关者的贡献自下而上构建。参与者贡献基于自身兴趣和优先级训练的小模型;这些模型随后在模块化框架中作为组合式AI系统协作。参与式AI系统在15项任务(如推理和事实性)上比单体LLM高出最多15.4%,超越了比所有贡献组件总和更大的模型。进一步实验表明,参与式AI系统受益于贡献者多样性,显著改善每个贡献者的原始优先级,并展现出涌现能力,使其能解决超过15%的所有单个模型失败的问题。参与扩展为从单体现状向开放、自下而上、协作的AI未来过渡提供了技术基础。

英文摘要

Humanity is a mosaic of multifaceted talents and needs, and any truly intelligent AI must reflect that richness. Yet the LLMs used by all are built by the few -- a centralized market of monolithic AI models structurally ill-suited to capture the diversity of human knowledge, reasoning, and values. Here we introduce scaling participation, a new paradigm in which modular AI systems are built from the bottom up through the contributions of diverse stakeholders. Participants contribute small models trained on their own interests and priorities; these models then collaborate in modular frameworks as compositional AI systems. Participatory AI systems outperform monolithic LLMs by up to 15.4% across 15 tasks, such as reasoning and factuality, surpassing models larger than all contributed components combined. Further experiments show that participatory AI systems benefit from contributor diversity, substantially improve on each contributor's original priorities, and exhibit emergent capabilities that allow them to solve over 15% of problems where all individual models fail. Scaling participation provides a technical foundation for transitioning from the monolithic status quo toward an open, bottom-up, and collaborative AI future.

2606.07810 2026-06-09 cs.CL cs.AI cs.LG 新提交

SLMJury: Can Small Language Models Judge as Well as Large Ones?

SLMJury:小型语言模型能否像大型模型一样进行评判?

Anish Laddha, Nitesh Pradhan, Gaurav Srivastava

发表机构 * LNMIIT Virginia Tech(弗吉尼亚理工大学)

AI总结 提出SLMJury框架,评估小型语言模型作为评判者的能力,发现领域依赖的过度思考效应、领域泛化差异、闭端与开端评判能力分离,以及多智能体辩论降低准确性。

详情
AI中文摘要

大型语言模型(LLMs)被广泛用作评估模型输出的评判者,但其高成本、延迟和不透明性限制了可扩展性。我们引入SLMJury,一个评估小型语言模型(SLMs)作为评判者的框架,涵盖两种范式:闭端二元正确性和开端质量评分。我们在四个模型家族的16个SLM评判者(0.6B-14B参数)上,跨十个基准进行基准测试:八个闭端任务涵盖数学、科学和通用推理(每个配置N=64,824个判断),以及用于摘要和对话评分的SummEval和MT-Bench。我们将评判形式化为预算条件函数,并研究五个维度。得出四个发现。(1)过度思考效应是领域依赖的:对于大多数评判者,快速10令牌判决在数学评判上匹配或优于扩展推理(在有帮助的情况下提升2-7%),而推理在通用任务上胜出高达23%。(2)领域泛化区分了模型家族,数学到通用准确率差距从低于10%到接近40%不等。(3)闭端和开端评判依赖不同的能力:最佳二元评判者(Phi-4)在MT-Bench上降至第9名,而经过推理训练的模型则反转了这一顺序。(4)在反思-批判-改进(RCR)辩论协议下,多智能体辩论在所有测试配置中降低了准确性,而顶级评判者抵抗六种对抗性人格的方差<=0.55%。可靠的自动评估不需要大型专有模型,但没有单一的SLM占主导地位。排行榜可在https://anishh15.github.io/SLMJury/获取,我们的框架代码和pip包公开在https://github.com/anishh15/SLMJury和https://pypi.org/project/slmjury/。

英文摘要

Large language models (LLMs) are widely used as judges for evaluating model outputs, but their high cost, latency, and opacity limit scalability. We introduce SLMJury, a framework for evaluating small language models (SLMs) as judges across two paradigms: closed-ended binary correctness and open-ended quality scoring. We benchmark 16 SLM judges (0.6B-14B parameters) from four model families across ten benchmarks: eight closed-ended tasks spanning mathematical, scientific, and general reasoning (N=64,824 judgments per configuration), plus SummEval and MT-Bench for summarization and conversational scoring. We formalize judging as a budget-conditioned function and study five dimensions. Four findings emerge. (1) The overthinking effect is domain-dependent: for most judges quick 10-token verdicts match or beat extended reasoning on mathematical judging (by 2-7% where they help), while reasoning wins on general tasks by up to 23%. (2) Domain generalization separates model families, with math-to-general accuracy gaps ranging from under 10% to nearly 40%. (3) Closed-ended and open-ended judging draw on different capabilities: the best binary judge (Phi-4) drops to rank 9 on MT-Bench, while reasoning-trained models invert this ordering. (4) Under the Reflect-Critique-Refine (RCR) debate protocol, multi-agent debate degrades accuracy across all tested configurations, whereas the top judges resist six adversarial personas with <=0.55% variance. Reliable automated evaluation does not require large proprietary models, yet no single SLM dominates. The leaderboard is available at https://anishh15.github.io/SLMJury/, and our framework code and pip package are publicly available at https://github.com/anishh15/SLMJury and https://pypi.org/project/slmjury/.

2606.07808 2026-06-09 cs.AI 新提交

Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models

指令层级失效之处:诊断与修复推理语言模型的故障

Sanjay Kariyappa, G. Edward Suh

发表机构 * NVIDIA(英伟达)

AI总结 提出白盒诊断框架,将指令层级失效定位为指令识别、冲突解决和响应实现三个环节,并设计两种免训练自监控机制,将违规率降低81-99%。

详情
AI中文摘要

部署在智能体工作流中的推理语言模型必须遵循指令层级:当来自不同来源的指令冲突时,模型应服从最高权限的适用指令。现有基准主要端到端地衡量这种行为,询问最终响应是否合规。然而,不合规的响应可能源于几种不同的故障:模型可能无法识别上下文中的相关指令,无法解决已识别指令之间的冲突,或者在推理中正确解决了冲突但仍产生违规响应。我们引入了一个白盒诊断框架,将指令层级失效定位为指令识别、冲突解决和响应实现,使故障更具可解释性。我们在IHEval和IHChallenge的长上下文改编版本上评估了三个推理模型——Gemma-4-31B-IT、Qwen3.6-35B-A3B和Claude Sonnet 4.6,发现主要故障模式因模型、任务和上下文长度而异。基于模型在明确提示时通常能检测冲突并输出违规的观察,我们提出了两种免训练的自监控机制:用于生成前低延迟冲突检测的并行输入监控器,以及用于响应级审查和修复的顺序输出监控器。在Gemma-4-31B-IT、Claude Sonnet 4.6和GPT-5.3上,最强的监控器将规则遵循违规率降低了81-99%,其中GPT-5.3在静态攻击下降低86%,在自适应攻击下降低45%。

英文摘要

Reasoning language models deployed in agentic workflows must follow an instruction hierarchy: when instructions from different sources conflict, the model should obey the highest-privilege applicable instruction. Existing benchmarks largely measure this behavior end-to-end, asking whether the final response is compliant. However, a non-compliant response can arise from several distinct failures: the model may fail to identify the relevant instructions in context, fail to resolve conflicts among identified instructions, or correctly resolve the conflict in its reasoning while still producing a violating response. We introduce a white-box diagnostic framework that localizes instruction hierarchy failures into instruction identification, conflict resolution, and response realization, making failures more interpretable. We evaluate three reasoning models--Gemma-4-31B-IT, Qwen3.6-35B-A3B, and Claude Sonnet 4.6--on long-context adaptations of IHEval and IHChallenge, and find that the dominant failure mode varies across models, tasks, and context length. Building on the observation that models can often detect conflicts and output violations when explicitly prompted, we propose two training-free self-monitoring mechanisms: a parallel input monitor for low-latency conflict detection before generation, and a sequential output monitor for response-level review and repair. Across Gemma-4-31B-IT, Claude Sonnet 4.6, and GPT-5.3, the strongest monitor reduces rule-following non-compliance by 81-99%, with GPT-5.3 reductions of 86% under static attacks and 45% under adaptive attacks.

2606.07805 2026-06-09 cs.AI cs.MA 新提交

Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems

超越古德哈特定律:多智能体系统中合规性评估的动态基准

Yiyang Zhao, Zhuo Zhang, Qingxuan Le, Lizhen Qu, Zenglin Xu

发表机构 * Fudan University(复旦大学) Shanghai Academy of AI for Science(上海人工智能科学研究院) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Monash University(莫纳什大学)

AI总结 针对多智能体系统在压力下可能违反安全规则的问题,提出MAC-Bench动态对抗基准,通过SERV流水线生成无污染场景,并引入CSR和MG指标评估前沿模型的合规性。

详情
AI中文摘要

大型语言模型(LLMs)从被动助手向自主、可执行智能体的快速演进引入了关键的操作风险。当前大多数评估框架忽视了程序合规性,导致“马基雅维利”行为——智能体为最大化奖励而策略性地违反安全规则,这是古德哈特定律的直接体现。为解决这一盲点,我们提出MAC-Bench,一个动态对抗基准,旨在评估多智能体系统在现实压力下的程序对齐。我们提出了SERV(种子-进化-精炼-验证)流水线,一种“智能体即基准”范式,将非结构化法律文本转化为可执行、无污染的场景。通过合成全息沙盒环境并注入校准的社会工程压力向量,MAC-Bench迫使智能体在任务成功与监管遵守之间进行帕累托最优权衡。我们引入了新指标:合规加权成功率(CSR)和马基雅维利差距(MG),并对最先进的前沿模型进行了全面评估,揭示了成功与合规之间的普遍权衡。

英文摘要

The rapid evolution of Large Language Models (LLMs) from passive assistants to autonomous, execution-capable agents has introduced critical operational risks. Most current evaluation frameworks neglect procedural compliance, leading to ''Machiavellian'' behaviors where agents strategically violate safety rules to maximize rewards - a direct manifestation of Goodhart's Law. To address this blind spot, we introduce MAC-Bench, a dynamic, adversarial benchmark designed to evaluate the procedural alignment of multi-agent systems under realistic pressure. We propose the SERV(Seed - Evolve - Refine - Verify) pipeline, an ``Agent-as-a-Benchmark'' paradigm that transforms unstructured legal texts into executable, contamination-free scenarios. By synthesizing holographic sandbox environments and injecting calibrated social-engineering pressure vectors, MAC-Bench forces agents into Pareto-optimal trade-offs between task success and regulatory adherence. We introduced novel metrics: the Compliance-Weighted Success Rate (CSR) and the Machiavellian Gap (MG), and conducted a comprehensive evaluation of state-of-the-art frontier models to reveal the pervasive trade-offs between success and compliance.

2606.07801 2026-06-09 cs.AI 新提交

Improving Multimodal Reasoning via Worst Dimension Optimization

通过最差维度优化改进多模态推理

Haocheng Lv, Huaping Zhang, Qiuchi Li, Lei Li, Chunxiao Gao

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 提出最差维度优化方法,通过识别并优先优化推理路径中最差的约束维度,提升多模态推理的整体有效性。

详情
AI中文摘要

多模态推理需要一条在广泛约束(从视觉基础到逻辑一致性)下保持完整性的路径。然而,当前的过程奖励模型关注于启发式定义的奖励,这些奖励平等地权衡这些因素,可能导致主导因素掩盖个别维度的失败,从而无法保证推理过程的一般有效性。

英文摘要

Multimodal reasoning requires a path that retains integrity over a wide range of constraints, from visual grounding to logic consistency. However, the current Process Reward Models focus on heuristically defined rewards that equally weigh these factors, which may lead to the concealment of individual dimension failures by the dominating factors, without guaranteeing the validity of the reasoning process in general.

2606.07798 2026-06-09 cs.AI cs.LG q-bio.NC 新提交

Reconstructing and forecasting disease trajectories of patients with Alzheimer's disease using routine data in resource-constrained settings

在资源受限环境中利用常规数据重建和预测阿尔茨海默病患者的疾病轨迹

Ratnadeep Das, Atri Chatterjee, Sitikantha Roy

发表机构 * Yardi School of Artificial Intelligence (ScAI), Indian Institute of Technology Delhi(印度理工学院德里分校亚迪人工智能学院) Department of Neurology, Vardhman Mahavir Medical College and Safdarjung Hospital(瓦尔丹·马哈维尔医学院和萨夫达戎医院神经内科) Department of Applied Mechanics, Indian Institute of Technology Delhi(印度理工学院德里分校应用力学系)

AI总结 提出GNOVA框架,结合GRU编码器和神经ODE解码器的变分自编码器,利用常规临床数据(无需神经影像或生物标志物)实现认知评分的双向预测、插值/外推及不确定性估计,在ADNI数据集上取得低误差。

详情
AI中文摘要

阿尔茨海默病是一种进行性神经退行性疾病,其进展在不同患者间差异显著。现有工作旨在预测患者未来的认知状态,但很少关注从既往就诊中重建状态。此外,当前研究中,量化预测不确定性仍未被充分探索,且依赖于MRI、PET和CSF等昂贵模态,限制了在资源有限环境中的部署。在本研究中,我们的主要目标是:第一,从不规则就诊中双向预测认知评分,以呈现完整的疾病轨迹;第二,实现插值和外推能力,以辅助临床医生做出知情预后决策;第三,为所有预测提供校准良好的不确定性估计;最后,利用常规就诊中可用的模态实现上述目标。我们提出了一个统一框架GNOVA:GRU-神经ODE变分自编码器。该架构在变分自编码器框架内结合了门控循环单元编码器和神经ODE解码器。在我们的工作中,我们预测了CDR-SB和MMSE评分。GRU编码器允许在任何时间点输入任意数量的数据。神经ODE解码器执行连续估计,允许在任何期望的时间点进行插值和外推。变分自编码器允许预测中的不确定性估计。我们使用了ADNI数据集中1727名患者超过10年的数据;该模型在无需任何神经影像或生物标志物数据的情况下,对CDR-SB和MMSE评分分别实现了1.35和2.28的平均绝对误差。特征消融研究表明,年龄、BMI和APOE4状态是强预测因子。所提出的框架能够重建不完整的患者病史并预测未来的认知状态。

英文摘要

Alzheimer's disease is a progressive neurodegenerative disorder, and its progression varies substantially across patients. Existing work aims to forecast patients' future cognitive state, with minimal focus on reconstructing the state from past visits. Furthermore, in current research, quantifying predictive uncertainty remains underexplored and relies on costly modalities such as MRI, PET, and CSF, limiting their deployment in resource-limited settings. In this research, our primary objectives are: First, bidirectional prediction of cognitive scores from irregular visits to present the complete disease trajectory. Second, to enable interpolation and extrapolation capabilities to assist clinicians in informed prognostic decision making, and third, to provide a well-calibrated uncertainty estimate for all predictions, and finally, to achieve the objectives using the modalities available during routine visits. We propose a unified framework, GNOVA: A GRU-Neural ODE Variational Autoencoder. The architecture combines a Gated Recurrent Unit encoder and a Neural ODE decoder within a variational autoencoder framework. In our work, we forecast the CDR-SB and MMSE Scores. The GRU encoder allows for any number of inputs at any time point. The Neural-ODE decoder performs continuous estimation, allowing interpolation and extrapolation at any desired time point. The Variational autoencoder allows for uncertainty estimation in predictions. We worked with 1,727 patients from the ADNI dataset over 10 years; the model achieved mean absolute errors of 1.35 and 2.28 for CDR-SB and MMSE scores, respectively, without requiring any neuroimaging or biomarker data. Feature-ablation studies revealed that age, BMI, and APOE4 status were strong predictors. The proposed framework enables the reconstruction of incomplete patient histories and the anticipation of future cognitive states.

2606.07790 2026-06-09 cs.LG 新提交

Byzantine Cheap Talk: Adversarial Resilience and Topology Effects in LLM Coordination Games

拜占庭廉价谈话:LLM协调博弈中的对抗韧性与拓扑效应

Aya El Mir, Martin Takáč, Salem Lahlou

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学)

AI总结 研究多智能体LLM在协调博弈中面对拜占庭攻击和通信拓扑限制的脆弱性,发现智能体无法集体适应背叛,且显式限制拓扑会破坏合作,而隐式限制则不影响。

详情
Comments
Accepted at NETYS 2026 (The International Conference on Networked Systems)
AI中文摘要

多智能体LLM系统越来越依赖通信协议进行协调,但它们在对抗和结构约束下的鲁棒性仍然知之甚少。基于先前工作表明廉价谈话通道能够在LLM协调博弈中实现合作,我们在一个4人Stag Hunt博弈中,跨越六个模型系列和720次试验,研究了两个脆弱性类别。首先,当拜占庭智能体发出合作信号但背叛时,非拜占庭智能体在一轮内检测到背叛,但未能集体适应:相当一部分智能体尽管反复被利用仍继续合作,由于博弈的一致同意支付结构而无法恢复协调。其次,显式限制通信拓扑会完全破坏合作,而应用相同的隐式限制则保持近乎完美的合作。这表明协调失败源于智能体关于隐藏信息的元推理,而非信息损失本身。我们识别出两种在所有模型队列中复现的稳定行为原型:背叛倾向模型在背叛后永久切换,以及合作坚持模型以显著的个人成本继续合作。这些发现揭示了具体的安全漏洞:通信通道可被利用为对抗注入向量,且向智能体披露网络拓扑即使在没有任何对手的情况下也会削弱协调。

英文摘要

Multi-agent LLM systems increasingly rely on communication protocols for coordination, yet their robustness under adversarial and structural constraints remains poorly understood. Building on prior work showing that cheap-talk channels enable cooperation in LLM coordination games, we investigate two vulnerability classes in a 4-player Stag Hunt across six model families and 720 trials. First, when Byzantine agents signal cooperation but defect, non-Byzantine agents detect the betrayal within one round yet fail to adapt collectively: a substantial fraction continue cooperating despite repeated exploitation, unable to recover coordination due to the game's unanimity payoff structure. Second, explicitly restricting communication topology collapses cooperation, while applying identical restrictions silently preserves near-perfect cooperation. This establishes that coordination failure stems from agents' meta-reasoning about hidden information, not information loss itself. We identify two stable behavioral archetypes that replicate across all model cohorts: Defection-Prone models that switch permanently after betrayal, and Cooperation-Persistent models that continue cooperating at significant individual cost. These findings reveal concrete security vulnerabilities: communication channels can be exploited as adversarial injection vectors, and disclosing network topology to agents can degrade coordination even without any adversary present.

2606.07789 2026-06-09 cs.LG stat.ML 新提交

A Framework for Evaluating and Benchmarking Concept Drift Detection Methods

概念漂移检测方法的评估与基准测试框架

Vitor Cerqueira, Heitor Murilo Gomes, Marco Heyden, Bernhard Pfahringer, Albert Bifet

发表机构 * University of Coimbra(科英布拉大学) Victoria University of Wellington(惠灵顿维多利亚大学) Commerzbank(德国商业银行) University of Waikato(怀卡托大学) AI Institute, University of Waikato(怀卡托大学人工智能研究所)

AI总结 提出一个包含漂移模拟、时序感知评估和超参数优化协议的基准测试框架,在7个真实数据集上评估14种漂移检测方法,揭示其优劣并建立基线性能。

详情
Comments
Accepted in KDD'26
AI中文摘要

数据流挖掘从根本上受到概念漂移的挑战,其中分布变化可能降低模型性能。尽管漂移检测方法层出不穷,但该领域的进展受到不一致评估实践的阻碍:研究依赖于过度简化的合成数据生成器,采用不兼容的指标,并且在超参数选择上缺乏透明度,使得公平比较变得困难。我们通过一个新颖的基准测试框架来解决这一差距,该框架包含三个贡献:(1)一种漂移模拟方法,通过蒙特卡洛试验将受控的分布变化注入真实世界数据集,在保留真实数据复杂性的同时实现监督评估;(2)一种用于漂移检测的评估协议,具有时序感知标准,包括推导出跨流可比较的新指标(例如,F1检测分数、归一化检测时间);(3)我们提倡一种留一数据集超参数优化协议,用于漂移检测方法,以促进跨异构流动态的配置鲁棒性。我们在7个真实世界数据集上对14种广泛使用的漂移检测方法进行了基准测试,涵盖4种漂移类型(类别先验、标签交换、特征排列、特征过滤),每种类型均包括突变和渐变转换。我们的实验结果揭示了当前漂移检测方法的优缺点,同时为该领域的未来研究建立了基线性能指标。所有代码和实验均公开可用。

英文摘要

Data stream mining is fundamentally challenged by concept drift, where distributional changes can degrade model performance. Despite the proliferation of drift detection methods, progress in the field is hindered by inconsistent evaluation practices: studies rely on oversimplified synthetic data generators, adopt incompatible metrics, and lack transparency in hyperparameter selection, making fair comparisons difficult. We address this gap with a novel benchmarking framework comprising three contributions: (1) a drift simulation method that injects controlled distributional changes into real-world datasets via Monte Carlo trials, enabling supervised evaluation while preserving real-world data complexity; (2) an evaluation protocol for drift detection with timing-aware criteria, including the derivation of new metrics (e.g., F1 detection score, normalized detection time) that are comparable across streams; and (3) we advocate for a leave-one-dataset-out hyperparameter optimization protocol for drift detection methods that promotes configuration robustness across heterogeneous stream dynamics. We benchmark 14 widely used drift detection methods on 7 realworld datasets across 4 drift types (class prior, label swap, feature permutation, feature filtering), each under both abrupt and gradual transitions. Our experimental results provide insights into the strengths and weaknesses of current drift detection approaches while establishing baseline performance metrics for future research in this area. All code and experiments are publicly available.

2606.07783 2026-06-09 cs.CL 新提交

Evaluating RAG Reliability under Clean, Misleading, and Mixed Retrieval

评估RAG在干净、误导和混合检索下的可靠性

Sevgi Yigit-Sert

发表机构 * Ankara University(安卡拉大学)

AI总结 提出评估协议,通过参数覆盖和置信度指标,系统测试RAG系统在干净、有毒和混合证据下处理参数知识与检索证据冲突的鲁棒性。

详情
AI中文摘要

检索增强生成(RAG)通过将答案基于检索到的证据,被广泛用于提高大型语言模型(LLMs)的事实可靠性。然而,在信息丰富的误导环境中,检索到的内容可能包含看似合理但不正确的信息,引发了对基于RAG的信息访问系统可靠性的担忧。在这项工作中,我们提出了一种评估协议,以系统地测试RAG系统如何处理参数知识与从具有不同数量误导信息的上下文中检索到的证据之间的冲突。我们针对模型在无检索时也能正确回答的事实性问题,使用干净、有毒和混合证据来测试系统。所提出的分析框架结合了参数覆盖和置信度指标,以评估误导信息何时以及如何影响LLMs的生成过程。本研究旨在为信息混乱场景下RAG系统的鲁棒性提供见解。

英文摘要

Retrieval-Augmented Generation (RAG) is widely used to improve the factual reliability of large language models (LLMs) by grounding answers in retrieved evidence. In misinformation-rich environments, however, retrieved content may include plausible but incorrect information, raising concerns about the reliability of RAG-based information access systems. In this work, we propose an evaluation protocol to systematically test how the RAG system handles conflicts between parametric knowledge and evidence retrieved from context with varying amounts of misleading information. We target correct answers to factoid questions that the model responds to correctly, even when there is no retrieval, and use this to test the system with clean, poisoned, and mixed evidence. The proposed analytical framework combines parametric override and confidence metrics to assess when and how misleading information affects the generation process of LLMs. This study aims to provide insights into the robustness of RAG systems in information disorder scenarios.

2606.07778 2026-06-09 cs.CL 新提交

Unlocking Latent Value: Taxonomy-Guided Recovery of High-Performing Data from Low-Tier Web Corpora

解锁潜在价值:基于分类法从低层级网络语料库中恢复高性能数据

Neeraj Varshney, Sanket Lokegaonkar, Nasser Zalmout, Qingyu Yin, Priyanka Nigam, Bing Yin

发表机构 * Amazon(亚马逊)

AI总结 提出一种分类驱动框架,通过引入时效性和文化特异性两个新维度,结合两阶段过滤方法,从低质量网络数据中恢复高性能子集,在推理和编码任务上显著超越未过滤的高质量数据。

详情
AI中文摘要

主流的预训练网络数据筛选流程将文档质量压缩为单一复合分数,系统性地遗漏了评分器权重不足维度上的高价值内容。我们提出一个分类驱动的框架,通过沿复合分数无法捕捉的语义有意义维度进行过滤来恢复这一价值。首先,基于ESSENTIAL-WEB分类法,我们引入两个新维度:时效性和文化特异性,它们与现有维度的成对NMI较低。我们使用Qwen2.5 32B对1400万文档进行标注,并蒸馏成一个轻量级0.5B模型。为实现快速的语料库级标注,我们额外在E5嵌入上训练了一个7300万参数的多任务MLP,推理吞吐量提升50倍。其次,为应对过滤配置的组合爆炸,我们引入一个计算高效的两阶段框架:第一阶段在小规模上识别最强维度信号;第二阶段从最优表现者中构建并评估合取和析取复合过滤器——以全规模定律成本的一小部分识别高性能配置。将所选过滤器应用于被降级的网络数据,分类过滤后的子集在性能上超过其未过滤基线,甚至超越最高质量层级。在中层数据上,我们的最佳过滤器在推理、编码和知识基准上分别比未过滤基线提升12.1%、9.5%和2.0%,在推理和编码上分别超过未过滤的顶层数据6.7%和13.7%。此外,来自典型生产阈值以下两个层级的数据,其过滤后的子集在推理和编码上比未过滤基线提升22.3%和19.5%,在编码基准上超越顶层数据。这些结果表明,大量潜在价值仍锁定在被降级的网络数据中,而多维分类过滤是解锁这些价值的原理性且计算高效的钥匙。

英文摘要

Dominant web data curation pipelines for pretraining collapse document quality into a single composite score, systematically missing high-value content along dimensions the scorer underweights. We present a taxonomy-driven framework that recovers this value by filtering along semantically meaningful dimensions that composite scores fail to capture. First, building on the ESSENTIAL-WEB taxonomy, we introduce two novel dimensions: timeliness and cultural specificity, both of which show low pairwise NMI with existing ones. We annotate 14M documents using Qwen2.5 32B and distill into a lightweight 0.5B model. To enable rapid corpus-wide annotation, we additionally train a 73M multi-task MLP on E5 embeddings, achieving 50x inference throughput. Second, to navigate the combinatorial explosion of filter configurations, we introduce a compute-efficient two-pass framework: Pass 1 identifies the strongest dimension signals at small scale; Pass 2 constructs and evaluates conjunctive and disjunctive compound filters from the top performers - identifying high-performing configurations at a fraction of full scaling-law cost. Applying the selected filters to deprioritized web data, taxonomy-filtered subsets outperform their unfiltered baselines and even surpass the highest-quality tier. On mid-tier data, our best filter improves over its unfiltered baseline by 12.1% on reasoning, 9.5% on coding, and 2.0% on knowledge benchmarks, exceeding unfiltered top-tier data by 6.7% on reasoning and 13.7% on coding. Furthermore, filtered data from two tiers below the typical production threshold improves by 22.3% on reasoning and 19.5% on coding over its unfiltered baseline, surpassing top-tier data on coding benchmarks. These results establish that vast latent value remains locked in deprioritized web data, and that multi-dimensional taxonomy filtering is a principled, compute-efficient key to unlocking it.

2606.07775 2026-06-09 cs.CV 新提交

DALE-CT: Depth-Aware Foundation Models for Computed Tomography

DALE-CT: 用于计算机断层扫描的深度感知基础模型

Evan W. Damron, Mahmut S. Gokmen, Mitchell A. Klusty, Caroline N. Leach, Emily B. Collier, V. K. Cody Bumgardner

发表机构 * University of Kentucky(肯塔基大学)

AI总结 提出DALE-CT,一种基于LeJEPA的2D切片模型,通过3D深度感知预训练(利用解剖掩膜和异常标注)提升表示质量,在CT多异常检测中达到与3D视觉语言模型近似的性能。

详情
Comments
9 pages, 2 figures
AI中文摘要

自监督学习(SSL)的最新突破,如潜在欧几里得联合嵌入预测架构(LeJEPA),以及视觉编码器与语言模型集成的成功,推动了计算机断层扫描(CT)中对适应性强、高容量视觉编码器的需求。在这项工作中,我们探索了基于2D切片的架构作为处理体积CT数据的原生3D模型的灵活替代方案。使用CT-RATE数据集,我们从头开始训练了DALE-CT(深度感知潜在欧几里得计算机断层扫描),这是一个完全使用LeJEPA构建的2D模型系列,并将其与持续预训练的DINOv2基线进行了比较。为了提高表示质量,我们开发了一种新颖的3D深度感知预训练策略,该策略由来自自动解剖掩膜和人工标注异常的双重辅助监督密集支持。在使用多实例学习(MIL)进行多异常检测的线性探测评估下,该双监督模型(DALE-CT-2S)的冻结主干实现了0.833的宏AUROC。这一性能表明,从头开始使用显著更少的数据且无需文本监督,即可达到与最先进的3D视觉语言模型近乎相当的水平。为确保可重复性,所有训练代码、评估脚本和模型权重均已公开。

英文摘要

Recent breakthroughs in self-supervised learning (SSL), such as the Latent-Euclidean Joint-Embedding Predictive Architecture (LeJEPA), alongside successes in integrating visual encoders with language models, have driven the demand for adaptable, high-capacity vision encoders in Computed Tomography (CT). In this work, we explore 2D slice-based architectures as a flexible alternative to native 3D models for processing volumetric CT data. Using the CT-RATE dataset, we trained DALE-CT (Depth-Aware Latent-Euclidean Computed Tomography), a 2D model family built entirely from scratch using LeJEPA, and compared its performance against a continually pre-trained DINOv2 baseline. To enhance representation quality, we developed a novel 3D depth-aware pre-training strategy anchored by dense auxiliary supervision from both automated anatomical masks and human-annotated abnormalities. Under linear probe evaluation with Multiple Instance Learning (MIL) for multi-abnormality detection, the frozen backbone of this dual-supervised model (DALE-CT-2S) achieves a Macro AUROC of 0.833. This performance demonstrates near-parity with state-of-the-art 3D vision-language models, achieved entirely from scratch with significantly less data and no textual supervision. To ensure reproducibility, all training code, evaluation scripts, and model weights have been made publicly available.

2606.07770 2026-06-09 cs.LG 新提交

Contrast encodes inductive bias: separating slow noise from dynamics in predictive representation learning

对比编码归纳偏置:在预测性表示学习中将慢噪声与动力学分离

Paarth Gulati, Ilya Nemenman

发表机构 * Emory University(埃默里大学)

AI总结 针对自监督方法在潜在空间预测动力学时混淆慢噪声与信号的问题,本文分析其根源为跨轨迹采样负样本的对比目标,提出通过轨迹内采样负样本消除预测捷径,从而强制编码动力学相关变量。

详情
AI中文摘要

在潜在空间中学习表示并预测动力学的自监督方法(如JEPA)已被证明会混淆缓慢变化的噪声与它们旨在捕捉的动力学信号。具体来说,当噪声特征在每个轨迹内近似保持不变时,对比预测目标会优先编码这些特征,而不是控制系统的真实潜在变量。学习到的表示因此被轨迹特定噪声主导,下游性能随噪声强度下降,且即使增加训练轨迹的数量和持续时间也不会改善。我们认为这种失败是目标本身的属性,由一系列跨轨迹采样负样本的对比预测目标共享。为了说明这种普遍性,我们在两种设置中研究了失败模式及其补救措施:在合成移动点数据集上的标准SimCLR风格JEPA,以及在刚体摆电影上的DySIB(一种最近引入的用于提取动力学物理可解释表示的方法)。当负样本改为在单个轨迹内采样时,慢噪声无法区分该轨迹内的帧,从而消除了预测捷径。同时在许多这样的轨迹上训练一个编码器,迫使它编码与动力学相关的变量,更长的轨迹即使在强慢噪声下也能产生更好的表示。我们的结果指向了在动力学表示学习中设计对比预测目标的原则,特别是对于具有噪声实验观测的物理系统。

英文摘要

Self-supervised methods that learn representations and predict dynamics fully in the latent space, such as JEPA, have been shown to confuse slowly varying noise with the dynamical signals they aim to capture. Specifically, when noise features remain approximately constant within each trajectory, contrastive predictive objectives preferentially encode these features instead of the true latent variables governing the system. The learned representation then becomes dominated by trajectory-specific noise, so downstream performance degrades with noise strength and does not improve even as the number and duration of training trajectories increase. We argue that this failure is a property of the objective itself, shared by a long line of contrastive predictive objectives that sample negatives across trajectories. To illustrate this generality, we study the failure mode and its remedy in two settings: a standard SimCLR-style JEPA on a synthetic moving-dot dataset, and DySIB, a recently introduced method designed for extracting physically interpretable representations of dynamics, on movies of a rigid-body pendulum. When negatives are instead sampled within a single trajectory, the slow noise can no longer distinguish frames within that trajectory, removing the predictive shortcut. Training one encoder simultaneously on many such trajectories then forces it to encode the variables relevant for the dynamics, with longer trajectories yielding better representations even for strong slow noise. Our results point toward principles for designing contrastive predictive objectives in dynamical representation learning, especially for physical systems with noisy experimental observations.

2606.07756 2026-06-09 cs.CV cs.RO 新提交

DroneDAR: Long-Range Drone Distance Estimation Using Monocular Vision and Bounding-Box Features

DroneDAR: 使用单目视觉和边界框特征的长距离无人机距离估计

Knut Peterson, Zaid Mayers, David Han

发表机构 * iMaPLe Research Lab, Drexel University(德雷塞尔大学iMaPLe研究实验室)

AI总结 针对长距离小无人机距离估计的挑战,提出DroneDAR模型,结合卷积骨干网络和轻量级门控机制融合边界框特征,分析骨干容量、裁剪分辨率和回归损失对性能的影响,并探讨远距离失效模式。

详情
Comments
6 pages, 5 figures. Accepted to the 2026 International Conference on Advanced Visual and Signal-Based Systems (AVSS)
AI中文摘要

在长距离图像中准确估计小型无人机的距离对于跟踪和态势感知至关重要,但由于极端的目标尺度变化、背景杂波和噪声视觉线索,这仍然具有挑战性。本文研究了使用图像裁剪和边界框几何进行单目无人机距离估计,这是一种实际设置,其中检测器提供候选无人机区域,模型从外观和框派生特征预测距离。我们评估了一个Droneranger风格的基线,并引入了一个新的DroneDAR(无人机检测与测距)模型,该模型通过轻量级门控机制将卷积骨干网络与显式边界框线索相结合。实验分析了骨干网络容量、裁剪分辨率和回归损失函数如何影响不同距离范围内的性能。我们进一步研究了远距离下的常见失效模式,包括对边界框噪声的敏感性和裁剪中纹理细节的减少。结果为设计和训练在真实远距离条件下保持鲁棒性的距离估计器提供了指导,并指出了在无人机仅占据几个像素时提高可靠性的方向。

英文摘要

Accurate distance estimation for small drones in long-range imagery is important for tracking and situational awareness, yet remains challenging due to extreme target scale variation, background clutter, and noisy visual cues. This paper studies monocular drone distance estimation using image crops together with bounding-box geometry, a practical setting in which a detector provides a candidate drone region and the model predicts range from appearance and box-derived features. We evaluate a Droneranger-style baseline, and introduce a new DroneDAR (Drone Detection And Ranging) model that combines a convolutional backbone with explicit bounding-box cues through a lightweight gating mechanism. Experiments analyze how backbone capacity, crop resolution, and regression loss functions affect performance across distance regimes. We further examine common failure modes at long distances, including sensitivity to bounding-box noise and reduced texture detail in the crop. The results provide guidance for designing and training range estimators that remain robust under real-world long-range conditions and highlight directions for improving reliability when drones occupy only a few pixels.

2606.07728 2026-06-09 cs.LG 新提交

Characterizing the Discrete Geometry of ReLU Networks

表征ReLU网络的离散几何

Blake B. Gaines, Jinbo Bi

发表机构 * University of Connecticut(康涅狄格大学)

AI总结 本文研究全连接ReLU网络线性区域构成的复形,证明其连通图平均度上界为输入维度的两倍,且直径上界与输入维度无关。

详情
Journal ref
Proceedings of the International Conference on Learning Representations (ICLR), 2026
Comments
Selected for an oral presentation at ICLR 2026. Tagged PDF, reviews, and discussions are available at https://openreview.net/forum?id=TgLW2DiRDG
AI中文摘要

众所周知,ReLU网络定义连续分段线性函数,其线性区域是输入空间中的多面体。这些区域构成一个完全划分输入空间的复形。这些区域组合的方式对网络行为至关重要,因为非线性仅发生在这些区域连接的边界处。然而,除了区域总数的界限外,关于这些复形的几何性质所知甚少,且精确计算复形对大多数网络而言是棘手的。在这项工作中,我们证明了关于这些复形的新的理论结果,这些结果对所有全连接ReLU网络都成立,特别是关于它们的连通图,其中节点对应区域,边存在于由面连接的每对区域之间。我们发现,无论网络的宽度和深度如何,该图的平均度上界是输入维度的两倍,并且该图的直径有一个不依赖于输入维度的上界,尽管区域数量随输入维度指数增长。我们通过在合成和真实数据上训练的网络进行的实验证实了我们的发现,这些实验为ReLU网络的几何提供了额外的见解。重现我们结果的代码可在https://github.com/bl-ake/ICLR-2026找到。

英文摘要

It is well established that ReLU networks define continuous piecewise-linear functions, and that their linear regions are polyhedra in the input space. These regions form a complex that fully partitions the input space. The way these regions fit together is fundamental to the behavior of the network, as nonlinearities occur only at the boundaries where these regions connect. However, relatively little is known about the geometry of these complexes beyond bounds on the total number of regions, and calculating the complex exactly is intractable for most networks. In this work, we prove new theoretical results about these complexes that hold for all fully-connected ReLU networks, specifically about their connectivity graphs in which nodes correspond to regions and edges exist between each pair of regions connected by a face. We find that the average degree of this graph is upper bounded by twice the input dimension regardless of the width and depth of the network, and that the diameter of this graph has an upper bound that does not depend on input dimension, despite the number of regions increasing exponentially with input dimension. We corroborate our findings through experiments with networks trained on both synthetic and real-world data, which provide additional insight into the geometry of ReLU networks. Code to reproduce our results can be found at https://github.com/bl-ake/ICLR-2026.

2606.07726 2026-06-09 cs.LG 新提交

Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity

利用SySRs降低LLM评估成本:一种可证明利用模型相似性的Bandit算法

Zifan Lyu, Chahine Nejma, Tobias Wegel, Fanny Yang, Florian E. Dorner

发表机构 * ETH Zurich(苏黎世联邦理工学院) Centrale Supélec(中央理工-高等电力学院) ENS de Cachan(卡尚高等师范学校) MPI for Intelligent Systems, Tübingen(马克斯·普朗克智能系统研究所,图宾根)

AI总结 提出SySRs算法,通过配对比较和自适应分配评估预算,利用模型相似性降低LLM评估成本,在15个基准上平均错误率最低。

详情
Comments
Published at ICML 2026
AI中文摘要

大型语言模型通常通过在每个测试查询上评估每个模型来进行基准测试。对于寻求部署最佳模型的从业者来说,这通常是浪费的:如果一个模型明显比其他模型表现更差,则无需精确估计其性能。最佳臂识别算法可以通过自适应分配评估预算来大幅降低成本。此外,语言模型通常对相同的提示做出相似的反应——先前的工作试图利用这一特性但结果好坏参半。我们提出了同步连续拒绝(SySRs),通过配对比较增强了经典的连续拒绝算法。与先前在最佳模型识别中利用模型相似性的尝试不同,我们的方法无超参数,并且具有随着评估模型之间相似性程度的提高而改善的性能保证。在经验上,我们的方法在15个标准基准上的平均错误率以及可靠识别最佳模型的最坏情况预算方面均优于所有基线。

英文摘要

Large Language Models are typically benchmarked by evaluating every model on every test query. For practitioners seeking the best model to deploy, this is often wasteful: if a model clearly performs worse than others, there is no need to precisely estimate its performance. Best-arm identification algorithms can be naturally applied to drastically reduce costs by adaptively allocating evaluation budget. Further, language models often respond similarly to the same prompt-a property previous work has tried to leverage with mixed success. We propose Synchronized Successive Rejects (SySRs), augmenting the classical Successive Rejects algorithm with paired comparisons. Unlike prior attempts to leverage model similarity in best-model identification, our approach is hyperparameter-free and enjoys performance guarantees that improve with the degree of similarity between evaluated models. Empirically, our method outperforms all baselines in terms of average error rate across 15 standard benchmarks, and in terms of worst-case budget for reliably identifying the best model.