Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs
通过推导图揭示Do-演算推理的结构
AI总结 本文引入推导图来表示Do-演算规则的应用与组合,刻画了在Do-演算下等价的观测与干预概率的完整空间,并展示了通过最多四次规则应用即可实现等价变换,进而利用等价因果查询产生更有效的估计量。
通过推导图揭示Do-演算推理的结构
Clément Yvernes, Emilie Devijver, Marianne Clausel, Eric Gaussier
AI总结 本文引入推导图来表示Do-演算规则的应用与组合,刻画了在Do-演算下等价的观测与干预概率的完整空间,并展示了通过最多四次规则应用即可实现等价变换,进而利用等价因果查询产生更有效的估计量。
Do-演算定义了干预查询的一般推理系统,允许通过连续应用其规则来转换因果量。这个过程产生了丰富的等价干预表达式空间,但组合和排序这些规则仍然具有挑战性。在这项工作中,我们引入了推导图,它表示Do-演算规则如何应用和组合,并刻画了在Do-演算下等价的观测和干预概率的完整空间。这些图的结构产生了一个简单的过程,最多使用四次Do-演算规则的应用。最后,我们展示了如何将识别算法应用于等价的因果查询,为相同的因果量产生多个有效的估计量,最终得到更有效的估计量。
The do-calculus defines a general system of inference for interventional queries, allowing causal quantities to be transformed through successive applications of its rules. This process induces a rich space of equivalent interventional expressions, but combining and ordering these rules remains challenging. In this work, we introduce derivation graphs, which represent how do-calculus rules are applied and combined, and characterize the full space of observational and interventional probabilities which are equivalent under the do-calculus. The structure of these graphs yields a simple procedure that uses at most four applications of do-calculus rules. Finally, we show how applying identification algorithms to equivalent causal queries produces multiple valid estimands for the same causal quantity, eventually yielding more efficient estimators.
文生图模型对文本编码器的依赖比你想象的要少
Nurit Spingarn, Noa Cohen, Tamar Rott Shaham, Tomer Michaeli
AI总结 本文发现基于扩散Transformer的文生图模型主要依赖文本编码器提供的单词含义和词序信息,而非完整的上下文信息,并通过构建仅含位置标记词袋的嵌入验证了这一观点。
文生图模型依赖文本提示作为与人类意图交互的主要接口。提示由文本编码器编码为嵌入,以条件化图像生成过程。除了单个标记的含义外,文本嵌入还编码了整个提示中的上下文信息,如组合性和属性绑定。然而,图像模型是否实际利用了这些更丰富的信息仍未被充分探索。在此,我们探讨问题:文本表示的哪些方面对图像生成至关重要?我们表明,基于扩散Transformer的文生图模型通常仅依赖文本表示的两个相对简单的方面:(i)相邻标记合并为单词表示(对于跨多个标记的单词),以及(ii)词序,该词序由文本编码器的位置嵌入印刻。为了证明这一点,我们构建了一种新的文本嵌入,它仅编码单个单词的含义和顺序,但缺乏关于整个提示的任何上下文信息。我们发现,这种带位置标记的词袋表示足以成功引导图像生成,实现了与完整文本嵌入引导生成相当的视觉质量和文本保真度。这表明,与普遍看法相反,文生图模型通常不使用文本嵌入中除单词含义和词序之外的丰富信息。相反,复杂语言结构的解码由图像模型本身执行。项目网页:此 https URL
Text-to-image models rely on text prompts as their primary interface to human intent. Prompts are encoded by a text encoder into embeddings that condition the image generation process. Beyond individual token meanings, text embeddings encode contextual information across the full prompt, such as compositionality and attribute binding. However, whether image models actually exploit this richer information remains underexplored. Here, we address the question: Which aspects of text representation are essential for image generation? We show that text-to-image diffusion transformer-based models commonly rely only on two relatively straightforward aspects of text representations: (i) the merging of adjacent tokens into a word representation, for words spanning multiple tokens, and (ii) word order, which is imprinted by the positional embedding of the text-encoder. To show this, we construct a new text embedding that encodes only individual word meanings and order but lacks any contextual information about the full prompt. We find that this bag of position-tagged words representation is sufficient to successfully guide image generation, achieving visual quality and text fidelity that are on par with full text embedding-guided generation. This demonstrates that, contrary to common belief, text-to-image models often do not use the rich information encoded in the text embedding beyond individual word meanings and word order. Instead, the decoding of complex linguistic structures is performed by the image model itself. Project webpage: https://nsping13.github.io/contextless-TTI/
探究多模态大语言模型的对抗鲁棒性
Hashmat Shadab Malik, Muzammal Naseer, Salman Khan
AI总结 通过系统研究多模态大语言模型的对抗鲁棒性,提出诊断性CLIP对齐协议预测鲁棒视觉编码器的迁移效果,并证明端到端多模态对抗训练能显著提升模型在强对抗攻击下的性能。
多模态大语言模型(MLLMs)在视觉-语言任务上表现出色,但通过视觉编码器(如CLIP)引入视觉输入显著扩大了攻击面,使这些模型容易受到视觉对抗扰动的影响。先前的防御方法通常通过在对抗微调期间强制与CLIP原始嵌入空间严格对齐来保持与预训练MLLMs的兼容性;虽然实用,但这种约束从根本上限制了可实现的鲁棒性。我们对MLLMs的对抗鲁棒性进行了系统研究。我们首先引入了一个诊断性CLIP对齐协议,该协议在完整的MLLM训练之前预测哪些鲁棒视觉编码器能有效迁移到多模态设置中,揭示出大规模多模态对抗预训练(而非仅单模态规模)是强鲁棒性迁移的关键因素。通过端到端多模态训练将这些编码器集成到MLLMs中,与受约束的即插即用基线相比,在强对抗攻击下,字幕生成平均提升28个CIDEr点,VQA准确率提升11.7%。我们进一步表明,直接对标准非鲁棒MLLM应用对抗训练会降低干净和对抗性能,从而确立了鲁棒视觉表示作为严格先决条件,而从鲁棒骨干网络进行端到端对抗训练则额外带来1.9个CIDEr点和4.3% VQA准确率的提升。除了训练时防御外,轻量级的测试时视觉随机变换可作为非鲁棒MLLM的有效黑盒防御,将对抗性能从接近零提升到与鲁棒模型相当的水平。最后,我们展示了鲁棒模型在白盒视觉越狱攻击下显著减少了有毒生成。代码和预训练权重将公开发布。
Multi-modal Large Language Models (MLLMs) achieve strong performance on vision-language tasks, but incorporating visual inputs through a vision encoder (e.g., CLIP) substantially expands the attack surface, making these models vulnerable to visual adversarial perturbations. Prior defenses typically preserve compatibility with pretrained MLLMs by enforcing strict alignment to CLIP's original embedding space during adversarial fine-tuning; while practical, this constraint fundamentally limits achievable robustness. We present a systematic investigation of adversarial robustness in MLLMs. We first introduce a diagnostic CLIP-alignment protocol that predicts, prior to full MLLM training, which robust vision encoders will transfer effectively to the multimodal setting, revealing that large-scale multimodal adversarial pretraining, rather than unimodal scale alone, is the critical factor for strong robustness transfer. Integrating such encoders into MLLMs via end-to-end multimodal training yields average gains of 28 CIDEr points on captioning and 11.7% VQA accuracy under strong adversarial attacks compared to constrained plug-and-play baselines. We further show that adversarial training applied directly to a standard non-robust MLLM degrades both clean and adversarial performance, establishing robust visual representations as a strict prerequisite, while end-to-end adversarial training from a robust backbone delivers additional gains of 1.9 CIDEr points and 4.3% VQA accuracy. Beyond training-time defenses, lightweight test-time visual stochastic transformations serve as an effective black-box defense for non-robust MLLMs, elevating adversarial performance from near-zero to levels comparable with robust models. Finally, we show that our robust models substantially reduce toxic generation under white-box visual jailbreak attacks. Code and pretrained weights will be released publicly.
当图标记沉没:图语言模型的机制分析
Ding Zhang, Runtao Zhou, Wenqing Zheng, Rizal Fathony, Bayan Bruss, Chirag Agarwal
AI总结 本文通过分析图语言模型中图标记的内部行为,发现激活层面的显著性与图信息利用之间存在解耦,揭示了现有图标记构建、放置和对齐机制的局限性。
图语言模型(GLMs)已成为将大型语言模型(LLMs)适应图学习任务的一个有前景的方向。通过将图拓扑和节点信息转换为图标记,GLMs允许LLMs联合处理结构化图输入和文本指令。然而,LLMs如何内部解释这些图标记以及图标记是否作为图结构的有意义载体仍不清楚。在这项工作中,我们通过代表性GLM架构中的图标记行为分析了LLMs如何处理图信息。发现:我们发现GLMs中图标记的内部显著性与图信息利用并不等价。图沉没标记一致地表现为激活层面的异常值:它们可以通过一小部分隐藏状态维度上的巨大激活值来识别,并且偏向于早期的图标记位置。然而,这种激活层面的显著性并不意味着这些标记是图信息的主要载体。与语言和视觉-语言模型中的经典注意力沉没不同,图沉没标记不一定从查询标记中吸引最大的注意力权重。通过剪枝、重新定位和交换干预,我们表明图沉没标记对于下游预测并不是最重要的语义或结构标记。含义:这些结果共同表明,在当前的GLMs将图结构映射到LLM标记空间后,产生的图标记表示并不会自然地形成完全可用的拓扑感知内部表示;相反,它们在激活层面的显著性和图语义效用之间表现出解耦。这种解耦指出了现有图标记构建、放置和对齐机制的局限性。
Graph Language Models (GLMs) have become a promising direction for adapting Large Language Models (LLMs) to graph learning tasks. By transforming graph topology and node information into graph tokens, GLMs allow LLMs to jointly process structured graph inputs and textual instructions. Yet, it remains unclear how LLMs internally interpret these graph tokens and whether graph tokens act as meaningful carriers of graph structure. In this work, we analyze how LLMs process graph information through graph-token behavior in representative GLM architectures. Findings. We find that the internal saliency of graph tokens in GLMs is not equivalent to graph information utilization. Graph sink tokens consistently emerge as activation-level outliers: they can be identified by massive activation values along a small set of hidden-state dimensions and are biased toward early graph-token positions. However, this activation-level saliency does not imply that these tokens are the main carriers of graph information. Unlike classical attention sinks in language and vision-language models, graph sink tokens do not necessarily attract the largest attention weights from query tokens. Through pruning, repositioning, and swapping interventions, we show that graph sink tokens are not the most important semantic or structural tokens for downstream prediction. Implications. Together, these results suggest that after current GLMs map graph structure into the LLM token space, the resulting graph-token representations do not naturally form a fully usable topology-aware internal representation; instead, they exhibit a decoupling between activation-level saliency and graph-semantic utility. This decoupling points to limitations in existing graph-token construction, placement, and alignment mechanisms.
图上的代码:通过大型语言模型在知识图谱上进行迭代式程序化推理
Weiwei Ding, Zixuan Li, Long Bai, Zhuo Chen, Kun Su, Fei Wang, Xiaolong Jin, Jin Zhang, Jiafeng Guo, Xueqi Cheng
AI总结 提出Code-on-Graph (CoG)框架,通过将知识图谱模式表示为Python类并生成可执行代码,解决现有LLM-KG集成中操作符不灵活和知识注入不可扩展的问题,在WebQSP、CWQ和GrailQA上提升高达10.5%。
知识图谱(KGs)被广泛用于缓解大型语言模型(LLMs)的局限性,如知识过时和幻觉。现有的LLM-KG集成框架通常依赖预定义操作符从知识图谱中检索事实知识,并将其注入提示以生成答案。这种范式面临两个关键瓶颈:1)不灵活性:预定义操作符范围有限,因此缺乏足够的组合表达能力来完全捕捉知识图谱问题所需的复杂语义。2)不可扩展性:将事实知识直接注入提示限制了处理大规模事实知识的可扩展性。为了解决这两个瓶颈,我们提出了Code-on-Graph(CoG),一个用于LLM-KG集成的程序化推理框架。具体来说,给定每个推理步骤检索到的事实知识,CoG首先识别相应的知识图谱模式,并将这些模式表示为Python类,这些类作为检索事实的抽象接口。然后,它生成基于这些类的可执行代码,在执行过程中,检索到的事实被实例化为相应类的对象。这种设计实现了灵活的基于代码的推理,同时避免将大规模事实知识直接注入提示。在WebQSP、CWQ和GrailQA上的实验表明,CoG比之前的最先进模型性能提升高达10.5%。
Knowledge Graphs (KGs) are widely used to mitigate the limitations of Large Language Models (LLMs), such as outdated knowledge and hallucinations. Existing LLM-KG integration frameworks typically rely on predefined operators to retrieve factual knowledge from KGs and inject it into prompts for answer generation. This paradigm faces two critical bottlenecks: 1) Inflexibility: The predefined operators are limited in scope and thus lack sufficient compositional expressiveness to fully capture the complex semantics required by KG questions. 2) Unscalability: Direct injection of factual knowledge into prompts limits scalability in handling large-scale factual knowledge. To address these two bottlenecks, we propose Code-on-Graph (CoG), a programmatic reasoning framework for LLM-KG integration. Specifically, given the factual knowledge retrieved at each reasoning step, CoG first identifies the corresponding KG schemas and represents these schemas as Python classes, which serve as abstract interfaces to the retrieved facts. It then generates executable code grounded in these classes, with the retrieved facts instantiated as objects of the corresponding classes during execution. This design enables flexible code-based reasoning while avoiding the direct injection of large-scale factual knowledge into prompts. Experiments on WebQSP, CWQ, and GrailQA demonstrate that CoG outperforms prior state-of-the-art models by up to 10.5%.
动态目标选择与防护机制及大语言模型监督在金融决策中的应用
Keigo Sakurai, Takahiro Ogawa, Miki Haseyama, Anjyu Anan, Kei Nakagawa
AI总结 提出DOSS方法,通过将目标选择建模为分类问题并利用滚动窗口进行顺序更新,结合置信度感知门控和LLM监督,实现金融决策中动态目标选择,降低误选和过度切换风险。
金融决策任务(如股票推荐和投资组合配置)通常估计未来收益和风险,然后为投资者选择交易或配置,所选优化目标往往决定实际表现。然而,由于市场条件随时间变化,固定目标在不同市场状态下可能次优,而依赖潜在状态估计的状态切换流程可能噪声大或延迟,频繁切换会增加交易成本和运营不稳定性。本文提出DOSS(带防护机制的动态目标选择),一种基于学习的选择器,直接从近期收益的可解释统计摘要中为每个时间点选择决策相关的目标函数,从少量候选(如追求收益、规避损失和风险调整)中选择,无需引入中间状态变量。DOSS将目标选择形式化为目标上的分类问题,并通过滚动窗口进行顺序更新以做出前瞻性选择,避免时间泄漏,同时为每个提议输出置信度分数。为缓解部署中的误选和过度切换,DOSS应用置信度感知门控,并带有故障安全机制,将低置信度提议覆盖为保守默认值,并实施与切换频率相关的显式控制。我们进一步通过将大语言模型(LLM)定位为监督组件而非新目标生成器来整合治理:LLM仅限于接受提议目标或将其覆盖为预定义安全默认值,并在需要时由确定性基于规则的约束触发覆盖。
Financial decision-making tasks such as stock recommendation and portfolio allocation typically estimate future return and risk and then select trades or allocations for an investor, and the chosen optimization objective often determines realized performance. However, because market conditions evolve over time, a fixed objective can be suboptimal across regimes, while regime-switching pipelines that rely on latent regime estimates can be noisy or delayed and frequent switching can increase turnover and operational instability. In this paper, we propose DOSS (Dynamic Objective Selection with Safeguards), a learning-based selector that directly chooses the decision-relevant objective function at each time point from interpretable statistical summaries of recent returns, selecting among a small set of candidates (e.g., return-seeking, loss-averse, and risk-adjusted) without introducing intermediate regime variables. DOSS formulates objective selection as a classification problem over objectives and performs sequential updates with a rolling window to make forward-looking selections without temporal leakage, while also outputting a confidence score for each proposal. To mitigate misselection and excessive switching in deployment, DOSS applies confidence-aware gating with a fail-safe that overrides low-confidence proposals to a conservative default and enforces explicit controls tied to switching frequency. We further integrate governance by positioning a Large Language Model (LLM) as an oversight component rather than a generator of new objectives: the LLM is restricted to accept a proposed objective or override it to a predefined safe default, with deterministic rule-based constraints triggering overrides when needed.
Multi$^2$:基于LLM智能体在交互环境中的分层多智能体决策
Sangeun Park, Minhae Kwon
AI总结 提出Multi$^2$分层多智能体决策框架,通过高层智能体(System 1)使用监督微调生成子目标,低层智能体(System 2)使用离线到在线强化学习执行原子动作,以缓解目标漂移并实现长期稳定控制。
大型语言模型(LLM)研究的一个核心目标是构建能够通过与动态环境持续交互进行规划、行动和适应的智能体系统。尽管最近的基于LLM的智能体展现出令人印象深刻的上下文推理能力,但它们的长期决策仍然脆弱,常常遭受目标漂移,即目标和计划在长时间交互中发生偏移。我们引入了Multi$^2$,一个分层多智能体决策框架,将智能体行为显式分解为互补角色。高层智能体(System 1)使用监督微调(SFT)专注于上下文感知的子目标生成,而低层智能体(System 2)通过交互环境中的离线到在线强化学习(RL)执行原子动作。这种分离实现了稳定的长期控制,减轻了目标漂移,并允许高效适应。在多种交互环境中,Multi$^2$持续优于强智能体基线,在多轮交互中展现出改进的鲁棒性和协调性。除了性能提升,我们还引入并发布了三个分层基准数据集,填补了训练和评估基于LLM智能体的分层决策的长期空白。
A central goal of large language model (LLM) research is to build agentic systems that can plan, act, and adapt through sustained interaction with dynamic environments. While recent LLM-based agents exhibit impressive contextual reasoning, their long-horizon decision-making remains fragile, often suffering from objective drift, where goals and plans drift over extended interactions. We introduce Multi$^2$, a hierarchical multi-agent decision-making framework that explicitly decomposes agent behavior into complementary roles. A high-level agent (System 1) focuses on context-aware sub-goal generation using supervised fine-tuning (SFT), while a low-level agent (System 2) executes atomic actions through offline-to-online reinforcement learning (RL) in interactive environments. This separation enables stable long-horizon control, mitigates objective drift, and allows efficient adaptation. Across diverse interactive environments, Multi$^2$ consistently outperforms strong agentic baselines, demonstrating improved robustness and coordination in multi-turn interaction. Beyond performance, we introduce and release three hierarchical benchmark datasets, filling a long-standing gap in training and evaluating hierarchical decision-making for LLM-based agents.
不要忘记你的嵌入:通过精确编辑嵌入实现鲁棒的知识擦除
Clara Haya Suslik, Or Shafran, Mor Geva
AI总结 提出 EMBER 模块,利用稀疏矩阵分解精确擦除词嵌入中的概念相关特征,增强现有知识擦除方法的鲁棒性和特异性。
随着语言模型在现实应用中的广泛部署,从模型中擦除特定知识的能力对安全性和合规性变得至关重要。主流方法通过更新模型参数实现持久移除,但目标知识往往可以通过对抗性提示或重新学习恢复。在这项工作中,我们假设这种局限性部分源于现有方法忽略了嵌入层。为了解决这个问题,我们引入了 EMBedding ERasure (EMBER),一个即插即用的擦除模块,利用稀疏矩阵分解从词嵌入中精确擦除概念相关特征。通过在 Gemma-2-2B-it 和 Llama-3.1-8B-Instruct 上对不同概念的综合评估,我们发现用 EMBER 增强现有方法可以一致地提高擦除效果和特异性,且连贯性损失最小。此外,它显著提高了对重新学习的鲁棒性,将恢复的准确率降低高达 50%,在 Llama 上限制在 35%,而先前方法为 70%-76%。进一步分析表明,连贯性成本是局部的,仅影响一小部分概念专属词元。我们的工作确立了精确的嵌入层干预对于鲁棒的概念擦除是必要的,并证明现有方法可以从这种增强中受益。
As language models are increasingly deployed in real-world applications, the ability to erase specific knowledge from them becomes critical for safety and compliance. Prominent methods seek persistent removal by updating the model's parameters, yet the target knowledge often can be recovered through adversarial prompting or relearning. In this work, we hypothesize this limitation stems in part from existing methods overlooking the embedding layer. To address this, we introduce EMBedding ERasure (EMBER), a plug-n-play erasure module that leverages Sparse Matrix Factorization for precise erasure of concept-related features from token embeddings. Through comprehensive evaluations across diverse concepts on Gemma-2-2B-it and Llama-3.1-8B-Instruct, we find that augmenting existing methods with EMBER consistently improves erasure efficacy and specificity across task formats, with minimal coherence loss. Moreover, it dramatically improves robustness to relearning, reducing regained accuracy by up to 50%, limiting it to 35% on Llama compared to 70%-76% for prior methods. Further analysis shows that the coherence cost is localized, affecting only a small set of concept-exclusive tokens. Our work establishes that precise embedding-level intervention is necessary for robust concept erasure, and demonstrates that existing methods can benefit from such augmentation.
面向人机交互的面部与身体跟踪:一个自我中心数据集
Jessica Wenninger, Gabriel Skantze
AI总结 针对社交机器人自我中心视角下频繁身份切换问题,提出一个自定义标注的自我中心数据集,通过系统评估检测误差、对比面部与身体跟踪,并分析扩展空间记忆和外观重识别的影响,最终优化管道将身份切换减少49%。
为了实现有意义的人机交互(HRI),机器人必须通过持续跟踪用户来不断评估参与度。然而,最先进的计算机视觉模型主要针对监控或自动驾驶进行了优化。社交机器人面临独特的自我中心挑战,例如人类跳动、相互遮挡或离开画面。频繁的身份切换(IDSW)会导致机器人在对话中失去立足点。为了解决这个问题,我们引入了一个新颖的、自定义标注的自我中心数据集,通过Furhat机器人收集,以捕捉复杂的社会动态。我们进行了系统评估,将检测错误与跟踪逻辑分离,比较面部与身体跟踪,并评估扩展空间记忆和外观重识别(ReID)的影响。结果表明,增加空间记忆可以缓解长时间遮挡,但在复杂动态事件上失败。集成ReID解决了复杂的切换,但表现出相反的效果:它显著提高了身体跟踪的稳定性,但由于轮廓角度敏感性导致面部IDSW激增。最终,我们的优化管道将IDSW减少了49%,减轻了交互中断。由于标准基准缺乏密集的近距离遮挡,这项工作强调了原生捕捉社会动态对于真正验证HRI感知模型的迫切需求。
To enable meaningful human-robot interaction (HRI), a robot must continuously assess engagement by consistently tracking users over time. State-of-the-art computer vision models, however, are heavily optimized for surveillance or autonomous driving. A social robot faces distinct egocentric challenges, such as humans bouncing, obstructing each other, or leaving the frame. Frequent identity switches (IDSW) cause the robot to lose its footing mid-conversation. To address this, we introduce a novel, custom-annotated egocentric dataset collected via the Furhat robot to capture complex social dynamics. We present a systematic evaluation isolating detection errors from tracking logic, comparing face versus body tracking, and assessing the impact of extended spatial memory and appearance re-identification (ReID). Results indicate that increasing spatial memory mitigates prolonged occlusions but fails on complex dynamic events. Integrating ReID resolves complex switches but exhibits opposing effects: it substantially improves body tracking stability, yet causes facial IDSW to spike due to profile angle sensitivity. Ultimately, our optimized pipeline reduces IDSW by 49\%, mitigating interaction breakdowns. Because standard benchmarks lack dense, close-quarter occlusions, this work highlights the critical need for natively captured social dynamics to truly validate HRI perception models.
语言转换会破坏医学视觉语言模型吗?印度尼西亚放射学视觉问答案例研究
Pieter Christy Yan Yudhistira, Dzaki Rafif Malik, Novanto Yudistira
AI总结 本研究通过构建印尼语放射学VQA数据集IndoRad-VQA,评估医学视觉语言模型在非英语临床语言下的鲁棒性,发现英语与印尼语设置间存在8-25%的性能差距,表明需要更包容的多语言评估。
医学视觉语言模型(VLM)通常在英语放射学视觉问答基准上进行评估,其在非英语临床语言下的鲁棒性很大程度上未被探索。我们引入了IndoRad-VQA,这是VQA-RAD的印尼语改编版,以评估当问题以印尼语提出时,医学VLM是否保留放射学推理能力。放射学问答对被翻译成印尼语,并通过基于自我评估的质量控制来保持临床意义、术语一致性和答案等价性。我们在英语和印尼语提示设置下评估了通用、东南亚多语言和医学专用VLM。除了准确性,我们量化了英语和印尼语输入之间的语言鲁棒性差距。我们还进行了错误分析,以识别问答的失败模式,例如是/否翻转、侧向性错误和输出语言不匹配。我们的发现表明,在英语医学VQA基准上的强性能并不一定转化为印尼语临床环境中的鲁棒行为。我们观察到英语和印尼语设置之间的性能差距为8%到25%,具体取决于评估指标。这些结果突显了对医学多模态基础模型进行更包容的多语言评估的必要性。数据集可在以下网址获取:此 https URL。
Medical Vision-Language Models (VLMs) are typically evaluated on English radiology visual question answering benchmarks, leaving their robustness under non-English clinical language largely unexplored. We introduce IndoRad-VQA, an Indonesian adaptation of VQA-RAD, to assess whether medical VLMs retain radiology reasoning ability when questions are asked in Bahasa Indonesia. Radiology question-answer pairs are translated into Indonesian with self-evaluation-based quality control to preserve clinical meaning, terminology consistency, and answer equivalence. We evaluate general-purpose, Southeast Asian multilingual, and medical-specific VLMs under English and Indonesian prompting settings. Beyond accuracy, we quantify the language robustness gap between English and Indonesian inputs. We also conduct an error analysis to identify failure modes of question answering, such as yes/no flips, laterality errors, and output-language mismatches. Our findings show that strong performance on English medical VQA benchmarks does not necessarily translate to robust behavior in Indonesian clinical contexts. We observe a performance gap of 8 to 25 percent between the English and Indonesian settings, depending on the evaluation metric. These results highlight the need for more inclusive multilingual evaluation of medical multimodal foundation models. The dataset is available at https://huggingface.co/datasets/Lab-IS/IndoRad-VQA.
SkillPyramid:一种用于自我进化智能体的层次化技能整合框架
Yuan Xiong, Ziqi Miao, Qian Chen, Lijun Li, Yequan Wang, Shizhu He, Jun Zhao, Kang Liu
AI总结 针对智能体缺乏系统性技能构建、积累和迁移的问题,提出SkillPyramid层次化技能整合框架,通过自进化机制在任务执行中组合、验证和吸收新技能,在三个基准上平均奖励提升38.0%,执行步骤减少27.7%。
最近的AI智能体可以灵活调用技能来解决复杂任务,但其长期改进从根本上受到缺乏系统性技能构建、积累和迁移的限制。特别是,没有统一的技能整合框架,智能体倾向于在不同任务中冗余构建相似能力,无法有效将经验转化为可复用资产,并且难以将任务特定技能泛化到新场景。为了解决这一限制,我们提出了SkillPyramid,一个技能整合框架,它重用现有技能经验以实现更广泛的任务泛化。在层次化技能拓扑上运行,SkillPyramid进一步引入了一种自进化机制,使智能体能够在任务执行过程中组合、验证和吸收新技能。在ALFWorld、WebShop和ScienceWorld上使用四个骨干模型的实验表明,SkillPyramid将平均奖励提高了38.0%,并将执行步骤减少了27.7%。总体而言,我们的方法将技能集合从静态资源池转变为动态进化系统。
Recent AI agents can flexibly invoke skills to solve complex tasks, but their long-term improvement is fundamentally constrained by a lack of systematic skill construction, accumulation, and transfer. In particular, without a unified framework for skill consolidation, agents tend to redundantly construct similar capabilities across different tasks, are unable to effectively transform experience into reusable assets, and struggle to generalize task-specific skills to novel scenarios. To address this limitation, we propose SkillPyramid, a skill consolidation framework that reuses existing skill experience for broader task generalization. Operating on a hierarchical skill topology, SkillPyramid further introduces a self-evolution mechanism that enables agents to compose, validate, and incorporate new skills during task execution. Experiments on ALFWorld, WebShop, and ScienceWorld across four backbone models show that SkillPyramid substantially increases the average reward by 38.0% and reduces execution steps by 27.7%. Overall, our method transforms a skill collection from a static resource pool into a dynamic evolution system.
保持存活:基于表格基础模型的无审查生存分析
Mariana Vargas Vieyra
AI总结 提出一种无需训练的生存回归方法,利用表格基础模型预测事件时间并迭代填补右删失数据,构建加速失效时间模型,在标准基准上表现与需训练的模型相当。
生存分析是一种统计框架,用于建模直到某个感兴趣事件发生的时间跨度。它广泛应用于包括医疗保健和客户流失预测在内的多个领域,其适用性的一个核心挑战在于事件时间被部分观测或存在右删失。近年来,表格基础模型因其能够在单次前向传播中执行预测任务而无需数据集特定的参数拟合,引起了广泛关注。尽管取得了成功,但由于右删失的存在,它们在时间-事件数据预测任务中的应用仍然困难。在这项工作中,我们提出了一种无需训练的生存回归方法,通过利用表格基础模型来预测事件时间并迭代地填补右删失数据。我们的方法使用表格基础模型构建加速失效时间模型,除了拟合单个标量参数外无需训练。随后,基于Buckley-James估计器,我们引入了一种非参数上下文内估计器来处理右删失数据。我们在标准生存分析基准上的实验表明,我们的方法与几种需要训练的参数和半参数生存回归模型(包括Cox回归和参数加速失效时间模型)相比具有竞争力。
Survival Analysis (SA) is a statistical framework that models the time span until some event of interest occurs. Widely used in several domains, including healthcare and churn prediction, a central challenge in its applicability stems from the time of the event being partially observed or \emph{right-censoring}. Tabular Foundation Models (TFM) have attracted significant interest in recent years due to their ability to perform prediction tasks in a single forward pass, requiring no dataset-specific parameter fitting. Despite their success, their application to prediction tasks on time-to-event data remains difficult due to right censoring. In this work, we present a training-free method to survival regression by leveraging TFMs to both predict the time of the event and iteratively impute right-censored data. Our method uses a TFM to construct an Accelerated Failure Time (AFT) model requiring no training beyond fitting a single scalar parameter. Subsequently, by building on the Buckley-James estimator, we introduce a non-parametric in-context estimator for right-censored data. Our experiments on standard survival analysis benchmarks show that our method is competitive with several parametric and semi-parametric survival regression models that require training, including Cox regression and parametric AFT models.
DeepSpeak-Agentic 数据集
Sarah Barrington, Maty Bohacek, Hany Farid
AI总结 本文提出了一个包含37小时人机半结构化对话视频的数据集DeepSpeak-Agentic,用于评估AI代理的自动取证识别、研究人机交互特性,并作为大型语言模型和AI生成语音/面部技术的基准。
我们提出了DeepSpeak-Agentic,一个包含超过37小时半结构化对话视频的数据集,对话发生在人类与具身AI代理之间。我们利用该数据集评估AI代理的自动取证识别(音频、视频或文本),研究人机交互的本质,并为驱动具身AI代理的大型语言模型和AI生成语音及面部技术的未来进展提供基准。我们还贡献了一个可扩展的数据采集系统,该系统创建代理,自动将其与人类众包工作者配对,记录指定场景下的视听对话,并在混合流中识别和分离人类与代理。
We present DeepSpeak-Agentic, a dataset of videos comprising over 37 hours of semi-structured conversations between a human and an embodied AI agent. We use this dataset to evaluate the automatic forensic identification (audio, video, or text) of AI agents, study the nature of human-agent interactions, and provide a benchmark for future advances in the large-language models and AI-generated voices and faces that power embodied AI agents. We also contribute a scalable data-capture system that creates agents, automatically pairs them with human crowd workers, records audiovisual conversations across specified scenarios, and identifies and separates the human and agent in the combined stream.
监督微调的大语言模型规划器中世界模型恢复的深入探究
Patrick Emami, Nan Qiang, Peter Graf
AI总结 通过可解释性实验,研究监督微调如何影响大语言模型在经典规划任务中恢复世界模型的能力,发现微调使模型线性编码动作有效性和状态谓词,且更广泛的状态空间覆盖有助于更准确的世界模型恢复。
监督微调(SFT)改进了大语言模型(LLM)中的端到端经典规划,但这些模型是否也学会了表示和推理它们正在解决的规划问题?由于经典规划问题的相对复杂性以及端到端规划生成对LLM的挑战,探索这个问题一直很困难。在我们的工作中,我们设计并执行了一系列可解释性实验,通过检查微调LLM的内部表示和生成能力,全面探究世界模型恢复。我们发现:a) 对有效动作序列进行监督微调使LLM能够线性编码动作有效性和一些状态谓词。b) 难以使用输出概率对动作有效性进行分类的模型可能仍然学习到将有效动作与无效动作分开的内部表示。c) 微调期间更广泛的状态空间覆盖(例如来自随机游走数据)能更准确地恢复底层世界模型。总之,这项工作为将可解释性技术应用于规划LLM提供了一种方法,并产生了有助于揭示LLM中知识表示方式的见解。
Supervised fine-tuning (SFT) improves end-to-end classical planning in large language models (LLMs), but do these models also learn to represent and reason about the planning problems they are solving? Due to the relative complexity of classical planning problems and the challenge that end-to-end plan generation poses for LLMs, it has been difficult to explore this question. In our work, we devise and perform a series of interpretability experiments that holistically interrogate world model recovery by examining both internal representations and generative capabilities of fine-tuned LLMs. We find that: a) Supervised fine-tuning on valid action sequences enables LLMs to linearly encode action validity and some state predicates. b) Models that struggle to use output probabilities for classifying action validity may still learn internal representations that separate valid from invalid actions. c) Broader state space coverage during fine-tuning, such as from random walk data, yields more accurate recovery of the underlying world model. In summary, this work contributes a recipe for applying interpretability techniques to planning LLMs and generates insights that shed light on open questions about how knowledge is represented in LLMs.
GN0:迈向视觉语言导航中生成、评估与策略学习的统一范式
Xinhai Li, Xiaotao Zhang, Yuehao Huang, Jiankun Dong, Tianhang Wang, Sunyao Zhou, Yunzi Wu, Chengnuo Sun, Yunfei Ge, Qizhen Weng, Chi Zhang, Chenjia Bai, Xuelong Li
AI总结 提出GN0统一框架,通过自动生成大规模导航数据集GN-Matrix、基于3DGS的高保真仿真平台和BEV基准GN-Bench,结合RL驱动的导航基础模型BAE,在VLN任务上超越现有方法。
具身导航将智能体与物理世界连接起来,是通用机器人智能的基础。导航数据的有限可用性和质量限制了视觉语言导航(VLN)系统的泛化和长时程能力。为解决这一问题,我们整理了多样化的3D场景,并开发了大规模导航数据的自动化流水线,生成了GN-Matrix数据集。基于3D高斯泼溅(3DGS)引擎,我们引入了一个支持交互式漫游和碰撞感知导航的高保真仿真平台。我们进一步提出了GN-Bench,这是首个基于BEV的基准测试,包含用于人机交互评估的动态3DGS化身。为了利用仿真器,我们开发了一个RL驱动的导航基础模型——Break and Establish(BAE)。在监督学习之后,DAgger将模型暴露于滚动生成的状态,打破了狭窄的专家中心分布,并实现了下游RL探索。这一统一的VLN范式整合了基于地图和无地图的任务,包括指令跟随、人类跟随和目标导航。GN-BAE将高保真3DGS渲染的鸟瞰图表示形式化为紧凑记忆,解锁了VLM中的潜在空间推理。在GN-Bench和VLN-CE上的广泛评估表明,GN0优于最先进的VLN方法。总体而言,GN-Matrix提供了一个涵盖数据、仿真和学习的统一框架,推动了研究和工业应用中的具身导航。
Embodied navigation connects intelligent agents with the physical world and is fundamental for general robotic intelligence. Limited availability and quality of navigation data have constrained Vision-and-Language Navigation (VLN) systems' generalization and long-horizon capabilities. To address this, we curate diverse 3D scenes and develop an automated pipeline for large-scale navigation data, resulting in the GN-Matrix dataset. Building on a 3D Gaussian Splatting (3DGS) engine, we introduce a high-fidelity simulation platform supporting interactive roaming and collision-aware navigation. We further propose GN-Bench, the first BEV-based benchmark incorporating dynamic 3DGS avatars for human-robot interaction evaluation. To leverage the simulator, we develop an RL-driven navigation foundation model, Break and Establish (BAE). After supervised learning, DAgger exposes the model to rollout-induced states, breaking narrow expert-centric distributions and enabling downstream RL exploration. This unified VLN paradigm integrates map-based and map-free tasks, including instruction following, human following, and goal navigation. GN-BAE formalizes high-fidelity 3DGS-rendered Bird's Eye View representations as compact memory, unlocking latent spatial reasoning in VLMs. Extensive evaluations on GN-Bench and VLN-CE show that GN0 outperforms state-of-the-art VLN methods. Overall, GN-Matrix offers a unified framework spanning data, simulation, and learning, advancing embodied navigation in research and industrial applications.
表格基础模型预训练的速通
Salih Bora Ozturk, Alexander Pfefferle, Frank Hutter
AI总结 提出一种速通竞赛格式,通过优化单文件训练脚本,在nanoTabPFN上实现81倍预训练加速,并建立社区排行榜以累积改进。
预训练成本是表格基础模型研究的主要瓶颈,减缓了新架构、先验知识和优化思路的迭代周期。然而,社区缺乏一种简单的方法来比较和累积预训练加速。我们为nanoTabPFN引入了一个社区速通:贡献者修改单文件训练脚本,并使用一块NVIDIA L40S GPU在子采样的TabArena上竞争达到固定的下游ROC AUC目标。当前最佳记录在0.92分钟内达到目标,相比74.32分钟的基线实现了81倍加速,同时使用的合成数据集减少了22倍。速通格式为社区提供了一种简单的协议来添加、验证和叠加预训练改进,排行榜对贡献开放。代码和记录可在该网址获取。
Pretraining cost is a major bottleneck for research on tabular foundation models, slowing the iteration cycle for new architectures, priors, and optimization ideas. Yet the community lacks a simple way to compare and accumulate pretraining speedups. We introduce a community speedrun for nanoTabPFN: contributors modify a single-file training script and compete to reach a fixed downstream ROC AUC target on subsampled TabArena using one NVIDIA L40S GPU. The current best record reaches the target in 0.92 minutes, an 81x speedup over the 74.32 minute baseline while using 22x fewer synthetic datasets. The speedrun format provides a simple protocol for the community to add, verify, and stack pretraining improvements, with the leaderboard open to contributions. Code and records are available at https://github.com/borawhocodess/modded-nanotabpfn.
EvoDrive: 通过自我改进的LLM智能体实现安全关键自动驾驶的帕累托进化
Tong Nie, Yuewen Mei, Yihong Tang, Junlin He, Jie Deng, Jian Sun, Wei Ma
AI总结 提出EvoDrive,首个基于LLM的自动化智能体进化框架,通过模拟器接地演员-评论家架构和帕累托存档,在安全关键场景生成中实现对抗性与真实性的多目标优化。
生成安全关键场景对于验证和改进自动驾驶系统至关重要,但它本质上需要在最大化对抗性以暴露故障的同时保持真实性。现有方法通常通过手工设计的启发式方法来管理这种权衡,将生成限制在已知的先验知识中,忽视了未充分探索的模式。虽然最近开放式的智能体进化可以突破这一限制,但不受约束的通用智能体缺乏严格的模拟器接地,往往将多目标张力退化为单标量最大化。本文提出了EvoDrive,第一个基于LLM的自动化智能体进化框架,用于多目标场景生成。EvoDrive采用模拟器接地的演员-评论家架构,其中记忆驱动的演员迭代地提出对生成器的改进,评论家过滤掉不可信的候选者,而自我进化的世界评估器将有前途的候选者路由以优化模拟预算。EvoDrive进一步维护一个评估候选者的帕累托存档,以保留多样化的攻击-真实性权衡,并通过模拟反馈指导未来的进化。在MetaDrive和CARLA上的基准测试结果表明,EvoDrive不仅显著扩展了各种生成器的帕累托前沿,而且为策略训练生成了有价值的场景。
Generating safety-critical scenarios is essential for validating and improving autonomous driving systems, yet it inherently requires maximizing adversariality to expose failures while preserving realism. Existing methods usually manage this trade-off with handcrafted heuristics, confining generation to known priors and overlooking underexplored patterns. While recent open-ended agentic evolution can push this limit, unconstrained general agents lack strict simulator grounding and tend to collapse the multi-objective tension into single-scalar maximization. Here we present EvoDrive, the first automated, LLM-based agentic evolution framework for multi-objective scenario generation. EvoDrive employs a simulator-grounded actor-critic architecture where a memory-driven actor iteratively proposes improvements to the generators and critics filter out implausible candidates, and a self-evolving world evaluator routes promising proposals to optimize simulation budgets. EvoDrive further maintains a Pareto archive of evaluated candidates to preserve diverse attack-realism trade-offs and guide future evolution via simulation feedback. Benchmark results on MetaDrive and CARLA show that EvoDrive not only significantly expands the Pareto frontier across various generators, but also produces valuable scenarios for policy training.
基于Mag1c-SAS和LinkNet的星载甲烷快速检测流水线
Jonáš Herec, Vít Růžička, Rado Pitoňák, Jan Sedmidubsky
AI总结 提出Mag1c-SAS算法加速甲烷检测,并结合轻量级LinkNet模型降噪,在星载硬件上实现高效、低功耗的甲烷泄漏检测。
甲烷是一种强效温室气体,通过高光谱卫星图像早期检测泄漏有助于减缓气候变化。然而,许多现有高光谱任务仅捕获操作员手动瞄准的区域,从而遗漏潜在感兴趣事件。为了经济高效地克服下行链路速率慢的问题,星载检测是一种可行的解决方案。然而,传统的甲烷检测方法对于资源受限的星载硬件计算需求过高。本工作通过关注高效、低功耗算法来加速甲烷检测。具体而言,我们测试了先前未用于甲烷检测的快速目标检测ACE和CEM方法,并提出了Mag1c-SAS——当前最先进Mag1c算法的显著更快变体。为了探索其检测潜力,我们将它们与基于U-Net和LinkNet的机器学习模型集成。我们在STARCOP数据集和一个新的EMIT-MSeg数据集上评估我们的方法,该数据集我们与高质量注释策略一起引入并开源。所提出的Mag1c-SAS方法被证明非常有效,运行速度比原始Mag1c方法快约80倍,提供视觉上相似但噪声更大的结果。当额外与轻量级LinkNet方法配对时,它有效降低了噪声,在EMIT-MSeg上相比基线Mag1c方法AUPRC得分提高了超过30个百分点,在STARCOP上F1得分提高了约4个百分点。我们评估了两种新颖的波段选择策略,并通过硬件分析确认了系统的星载可行性,展示了边际功耗和高效的CPU/RAM利用率。我们以用户友好的轻量级PyPI库形式发布最终系统,网址为:this https URL,同时所有实验代码、模型和数据发布在:this https URL。
Methane is a potent greenhouse gas, and detecting leaks early via hyperspectral satellite imagery can help climate change mitigation efforts. Meanwhile, many existing hyperspectral missions only capture areas manually targeted by operators, thus missing potential events of interest. To overcome slow downlink rates cost-effectively, onboard detection is a viable solution. However, traditional methane detection methods are too computationally demanding for resource-limited onboard hardware. This work accelerates methane detection by focusing on efficient, low-power algorithms. In particular, we test fast target detection ACE and CEM methods that have not been previously used for methane detection and propose Mag1c-SAS -- a significantly faster variant of the current state-of-the-art Mag1c algorithm. To explore their detection potential, we integrate them with a machine learning model based on U-Net and LinkNet. We evaluate our methods on the STARCOP dataset and a novel EMIT-MSeg dataset, which we introduce and open-source alongside a high-quality annotation strategy. The proposed Mag1c-SAS approach proves highly effective by operating ~80x faster than the original Mag1c approach, providing a visually similar, but noisier result. When additionally paired with the lightweight LinkNet approach, it effectively reduces noise, achieving AUPRC score improvements of over 30 pp on EMIT-MSeg compared to the baseline Mag1c approach, and an F1 score on STARCOP ~4 pp higher. We evaluate two novel band selection strategies and confirm the system's onboard viability through hardware profiling, demonstrating marginal power consumption and efficient CPU/RAM utilization. We release the final system in a user-friendly and lightweight PyPI library at: https://pypi.org/project/onboard-methane-detection/, alongside all experimental code, models, and data at: https://github.com/zaitra/methane-filters-benchmark.
Foley-Omni:从任务级音频合成到完整视频配乐生成的统一多模态生成模型
Ye Tao, Lupeng Liu, Xuenan Xu, Jiasun Feng, Jiarui Wang, Ying Qin, Shuiyang Mao, Wei Liu, Shuai Wang
AI总结 提出Foley-Omni统一多模态音频生成模型,通过共享潜变量生成过程联合建模语音、音效和音乐,实现从孤立任务级合成到完整视频配乐生成,并构建V2ST-Bench基准进行综合评估。
最近的统一音频生成模型可以支持语音、音效和音乐等多种任务,但大多数仍然专注于孤立的任务级合成。然而,真实的视频制作通常需要为同一视频联合且一致地生成完整音轨的多个组成部分。我们提出了Foley-Omni,一种统一的多模态音频生成模型,通过在一个共享的潜变量生成过程中联合建模语音、音效和音乐,将孤立的任务级合成扩展到完整的视频配乐生成。为了支持训练和可重复评估,我们开发了一个视听数据整理流程,并引入了V2ST-Bench,一个用于整体视频配乐生成评估的基准。实验表明,Foley-Omni在单个合成任务上与专家系统相比具有竞争性的性能,同时在混合配乐生成中提高了语音清晰度、视听一致性和感知质量。
Recent unified audio generation models can support diverse tasks across speech, sound effects, and music, but most of them still focus on isolated task-level synthesis. However, real video production often requires multiple components of a complete audio track to be generated jointly and consistently for the same video. We present Foley-Omni, a unified multimodal audio generation model that extends isolated task-level synthesis to complete video soundtrack generation by jointly modeling speech, sound effects, and music within a shared latent generation process. To support training and reproducible evaluation, we develop an audiovisual data curation pipeline and introduce V2ST-Bench, a benchmark for holistic video soundtrack generation evaluation. Experiments show that Foley-Omni achieves competitive performance with expert systems on individual synthesis tasks, while improving speech intelligibility, audiovisual consistency and perceptual quality for mixed soundtrack generation.
超越单一解:用于图像压缩感知的多假设协作深度展开网络
Wenxue Cui, Hualin Li, Yuhang Qin, Yifu Xu, Xiaopeng Fan, Debin Zhao
AI总结 针对压缩感知问题的病态性,提出一种多假设协作深度展开网络(MHC-DUN),通过联合优化多个解空间,利用AlphaNet动态预测空间变步长进行梯度下降,并设计多假设协作近端映射模块,以提升重建质量。
最近的深度展开网络(DUNs)通过将迭代优化与深度学习架构有效集成,推动了压缩感知(CS)的发展。然而,大多数CS方法主要将其推理限制在单一解空间,忽略了CS问题固有的病态性,该病态性本质上允许多个合理的候选假设。本文提出了一种新颖的多假设协作深度展开CS网络(MHC-DUN),该网络通过跨不同解空间联合优化,显式建模并利用多个假设。具体而言,遵循近端梯度下降算法,MHC-DUN在此多假设范式下联合执行梯度下降和近端映射。i) 对于梯度下降,引入精心设计的AlphaNet,动态预测所有假设的空间变步长,实现跨多个解的协作梯度更新。ii) 对于近端算子,设计了一个复杂的多假设协作近端映射模块,该模块利用假设内和假设间的相关性先验,联合优化多个解。为了实现端到端训练,设计了一种新颖的复合损失函数,该函数平衡测量保真度、假设多样性和重建精度,在保持重建保真度的同时鼓励探索互补解。实验结果表明,所提出的CS方法优于现有的CS网络。
Recent deep unfolding networks (DUNs) have advanced Compressive Sensing (CS) by effectively integrating iterative optimization with deep learning architectures. However, most CS approaches predominantly confine their inference to a single solution space, neglecting the inherent ill-posedness of CS problems that intrinsically permits multiple plausible candidate hypotheses. In this paper, a novel Multi-Hypothesis Collaborative Deep Unfolding CS Network (MHC-DUN) is proposed, which explicitly models and leverages multiple hypotheses by jointly optimizing across diverse solution spaces. Specifically, following the Proximal Gradient Descent algorithm, MHC-DUN jointly performs gradient descent and proximal mapping within this multi-hypothesis paradigm. i) For gradient descent, a well-designed AlphaNet is introduced to dynamically predict spatially varying step sizes for all hypotheses, enabling collaborative gradient updates across multiple solutions. ii) For proximal operator, a sophisticated multi-hypothesis collaborative proximal mapping module is designed, which leverages both intra-hypothesis and inter-hypothesis correlation priors to jointly refine multiple solutions. To enable end-to-end training, a novel composite loss function is designed, which balances measurement fidelity, hypothesis diversity, and reconstruction accuracy, encouraging exploration of complementary solutions while maintaining reconstruction fidelity. Experimental results reveal that the proposed CS method outperforms existing CS networks.
AUGUSTE: 用于预测性URLLC调度的在线学习dApp
Maxime Elkael, Michele Polese, Yunseong Lee, Koichiro Furueda, Tommaso Melodia
AI总结 针对URLLC中调度请求导致的高延迟问题,提出基于在线机器学习的MAC调度框架AUGUSTE,通过预测数据包到达提前分配资源,在真实5G测试平台上实现延迟与资源开销的最佳权衡。
超可靠低延迟通信(URLLC)是5G的主要驱动力之一,3GPP为工业自动化、车联网(V2X)、战术边缘网络和无人系统控制等应用设定了1-10毫秒的延迟目标。多年后,真实的5G时分双工(TDD)网络的中位上行链路(UL)往返时间仍在50-70毫秒范围内,这主要是因为用户设备(UE)在发送UL数据之前必须完成调度请求(SR)过程。现有的补救措施,主要是配置授权(CG)调度,仅能消除严格周期性流量的这一开销,并需要跨层同步,这限制了其采用。我们提出了AUGUSTE(通过自适应时间估计实现URLLC的预测性上行授权),这是一种基于学习的介质访问控制(MAC)调度框架,它将在线机器学习(ML)模型嵌入UL调度器中,以预测数据包到达并在发出SR之前主动分配资源。一个自适应状态机在收集无偏到达统计信息的学习阶段和利用学习到的预测仅在预期有流量时进行调度的自信阶段之间交替。我们在运行OpenAirInterface的真实5G测试平台上,针对三种URLLC流量模式(请求-响应、ML边缘推理和周期性自主报告)评估了AUGUSTE,结果表明它在延迟-开销权衡上达到了最佳可行点:它以约十分之一的资源开销(7-10%开销)实现了与始终在线调度相当的中位往返时间(RTT)(约10毫秒,比基于SR的20毫秒基线减半)。
Ultra Reliable and Low Latency Communications (URLLC) was one of the main motivations behind 5G, with 3GPP advertising 1-10 ms latency targets for applications such as industrial automation, Vehicle-To-Everything (V2X), tactical edge networking, and unmanned-system control. Years on, real 5G Time Division Duplexing (TDD) networks still show median Uplink (UL) round-trip times in the 50-70 ms range, largely because of the Scheduling Request (SR) procedure that a User Equipment (UE) must complete before transmitting UL data. Existing remedies, primarily Configured Grant (CG) scheduling, only eliminate this overhead for strictly periodic traffic and require cross-layer synchronization, which has limited their adoption. We propose AUGUSTE (Anticipatory Uplink Grants for URLLC via Self-Adapting Temporal Estimation), a learning-based Medium Access Control (MAC) scheduling framework that embeds online Machine Learning (ML) models in the UL scheduler to predict packet arrivals and proactively allocate resources before an SR is issued. An adaptive state machine alternates between a learning phase that collects unbiased arrival statistics and a confident phase that exploits the learned predictions to schedule only when traffic is expected. We evaluate AUGUSTE on a real 5G testbed running OpenAirInterface across three URLLC traffic patterns (request-response, ML edge inference, and periodic autonomous reporting), and show that it operates at the best achievable point on the latency-overhead trade-off: it matches always-on scheduling's median Round Trip Time (RTT) (around 10 ms, halving the 20 ms SR-based baseline) at roughly one-tenth its resource cost (7-10 percent overhead).
诊断大语言模型工具使用中的知识缺口:面向新API获取的智能体基准
Jinnuo Liu, Yue Peng, Jinhan Niu, Hongyi Wen
AI总结 提出 NovelAPIBench 基准,通过动态发现新API、分解知识包并生成可执行任务,诊断模型在API使用中的六类错误,发现检索与参数调优互补。
用于代码生成的大语言模型通常需要使用预训练数据中不存在的API。这不仅仅是回忆函数名:模型必须协调签名、模块路径、输入输出契约、语义和可执行使用模式。现有的新API基准通常是静态的,依赖于粗略的通过/失败指标,或使用可能无法反映真实库演变的合成API。我们引入了NovelAPIBench,一个全自动动态基准,对于任何基础模型和目标库,发现新API,提取分解的知识包,生成可执行编码任务,并将失败样本分配到六个诊断类别。在大约1.9K个任务、四个基础模型和五个领域上,我们比较了通过检索注入的知识与通过参数自适应内化的知识。我们发现知识组件不可互换:使用示例是最强的独立信号,而最佳的双组件设置将签名与机制或示例配对,具体取决于领域和骨干。添加更多上下文,尤其是源代码,可能通过增加导入路径错误而有害。一旦外部知识被移除,参数自适应也不能取代检索;相反,微调主要教会模型如何使用提供的包,并且这种能力可以迁移到保留的库。这些结果表明检索和调优扮演互补角色:检索提供易变的API内容,而调优改进程序性整合。
Large language models for code generation often need to use APIs that are absent from their pretraining data. This requires more than recalling a function name: models must coordinate signatures, module paths, input-output contracts, semantics, and executable usage patterns. Existing novel-API benchmarks are typically static, rely on coarse pass/fail metrics, or use synthetic APIs that may not reflect real library evolution. We introduce NovelAPIBench, a fully automated dynamic benchmark that, for any base model and target library, discovers novel APIs, extracts decomposed knowledge bundles, generates executable coding tasks, and assigns failed samples to six diagnostic categories. Across about 1.9K tasks, four base models, and five domains, we compare knowledge injected through retrieval with knowledge internalized through parametric adaptation. We find that knowledge components are not interchangeable: usage examples are the strongest standalone signal, while the best two-component setting pairs signatures with either mechanisms or examples depending on the domain and backbone. Adding more context, especially source code, can hurt by increasing import-path errors. Parametric adaptation also does not replace retrieval once external knowledge is removed; rather, fine-tuning mainly teaches models how to use provided bundles, and this ability transfers to held-out libraries. These results suggest that retrieval and tuning play complementary roles: retrieval supplies volatile API content, while tuning improves procedural integration.
命题可废止立场逻辑中的非单调蕴涵
Nicholas Leisegang, Thomas Meyer, Ivan Varzniczak
AI总结 本文通过引入情境立场条件句,将KLM风格的非单调理性蕴涵关系提升到命题可废止立场逻辑(PDSL)的一个片段中,并证明了该片段可表达为一组情境条件句,进而将基于排序的蕴涵关系(如理性和词典序闭包)从命题情况忠实翻译到PDSL,同时保持复杂度界限。
近期在可废止推理领域的研究中,Kraus等人提出的优先语义和蕴涵概念已被应用于模态逻辑。然而,该领域的工作主要集中在可满足性检查以及单调蕴涵关系上,后者在推理上可能较弱。引入这一概念的一个特定模态逻辑是命题立场逻辑,其中的模态可以表达不同视角的观点。这导致了命题可废止立场逻辑(PDSL)的形式化。在本文中,我们提出了一种方法,将(非单调)理性蕴涵关系类从传统的KLM风格推理提升到PDSL的一个片段中。为此,我们通过情境立场条件句扩展了PDSL的表达力,使得我们能够在给定立场的上下文中讨论可废止条件句。这使我们能够用情境条件句重新刻画PDSL的语法,并表明PDSL的一个大片段可以表达为一组情境条件句。然后,我们专注于刻画该片段中的非单调蕴涵,定义了一种方法,将任何基于排序的蕴涵关系从命题情况移植到PDSL情况。这首先在一般情形下描述,然后在理性和词典序闭包的具体情形下考虑,为每个推理提供了到PDSL的忠实翻译。我们还表明,该PDSL片段中的蕴涵检查可以主要使用命题情况下的算法进行,同时保持复杂度界限。
Recent work in defeasible reasoning has seen notions of preferential semantics and entailment in the style of Kraus et al. applied to modal logics. However, work in this field has focussed primarily on satisfiability checking, and monotonic notions of entailment, which may be inferentially weak. One particular modal logic where this has been introduced is propositional standpoint logics, where modalities can express the views of different viewpoints. This has resulted in the formalisation of propositional defeasible standpoint logic (PDSL). In this paper, we propose a means of lifting the class of (non-monotonic) rational entailment relations from traditional KLM-style reasoning to a fragment of PDSL. In order to do so, we extend the expressivity of PDSL via situated standpoint conditionals, allowing us to talk about a defeasible conditional holding in the context of a given standpoint. This allows us to re-characterise the syntax of PDSL in terms of situated conditionals, and shows that a large fragment of PDSL is expressible as a set of situated conditionals. We then focus on characterising non-monotonic entailment in this fragment, defining a method to transport any ranking-based entailment relation from the propositional case into the PDSL case. This is first described in the general case and then considered in the specific cases of rational and lexicographic closures, providing a faithful translation of each inference into PDSL. We also show that entailment-checking in this fragment of PDSL can be done largely using algorithms from the propositional case, while preserving complexity bounds.
图正则化非负简化四元数矩阵分解用于彩色图像识别
Hailang Wu, Yonghe Liu, Bingxuan Yu, Chaoqian Li
AI总结 针对非负简化四元数矩阵分解忽略局部几何结构的问题,提出图正则化模型,通过引入图拉普拉斯正则化项保持局部结构,并设计分量交替投影梯度算法,在彩色图像识别中取得竞争性结果。
非负简化四元数矩阵分解(NRBMF)利用简化四元数(RB)矩阵的乘积,将彩色图像像素的非负约束纳入分解过程。然而,NRBMF主要关注重构精度,未利用图像数据的局部几何结构,这可能限制所学低维特征的判别能力。为解决此问题,我们提出了一种图正则化非负简化四元数矩阵分解(GNRBMF)模型用于彩色图像识别。该模型将图拉普拉斯正则化项引入简化四元数系数矩阵,鼓励原始空间中的邻近样本在学习的特征空间中具有相似表示。同时,GNRBMF在简化四元数域中保留了NRBMF的非负保持特性。为求解优化问题,推导了一种分量交替投影梯度算法,并分析了其收敛性。实验结果表明,所提出的GNRBMF模型在某些测试设置下取得了具有竞争力或更优的识别性能。
Non-negative reduced biquaternion matrix factorization (NRBMF) uses the product of reduced biquaternion (RB) matrices to incorporate the non-negativity constraints of color image pixels into the factorization process. However, NRBMF mainly focuses on reconstruction accuracy and does not exploit the local geometric structure of image data, which may limit the discriminative ability of the learned low-dimensional features. To address this issue, we propose a graph regularized non-negative reduced biquaternion matrix factorization (GNRBMF) model for color image recognition. The proposed model incorporates a graph Laplacian regularizer into the reduced biquaternion coefficient matrix, encouraging nearby samples in the original space to have similar representations in the learned feature space. Meanwhile, GNRBMF retains the non-negativity-preserving property of NRBMF in the reduced biquaternion domain. To solve the optimization problem, a component-wise alternating projected gradient algorithm is derived, and its convergence properties are analyzed. Experimental results demonstrate that the proposed GNRBMF model achieves competitive or superior recognition performance in some tested settings.
微调大语言模型的安全性测量应基于能力
Krishnapriya Vishnubhotla, Hillary Dawkins, Isar Nejadgholi, Svetlana Kiritchenko
AI总结 通过将微调锚定于特定能力目标,多维度评估微调对模型能力和安全性的影响,发现微调模型对安全提示可能产生不连贯输出、自动安全判断不可靠,且结论因安全基准和评估者而异。
通过微调将基础大语言模型适应用户的任务或偏好风格可能会损害模型的安全性。先前的研究在有限且看似随机的实验设置中考察了微调对模型安全性的影响。我们认为,将微调锚定于特定的能力目标对于避免任意的经验选择至关重要,这使我们能够得出关于安全性影响的有意义结论,并在一致的基础上比较缓解方法。我们通过关注能力和安全性,对微调对模型行为的影响进行了多维度评估。我们的结果揭示了重要问题:(1) 微调模型可能对安全提示产生不连贯的生成内容,(2) 对于这种不连贯输出,自动安全判断不可靠,(3) 关于微调影响的结论可能因安全基准以及安全评估者的选择而改变。
Adapting foundation large language models to a user's task or preferred style through fine-tuning can result in compromising the model's safety. Previous works examined the effects of fine-tuning on model safety in limited and seemingly random experimental settings. We argue that anchoring fine-tuning to a specific capability goal is essential for avoiding arbitrary empirical choices, allowing us to draw meaningful conclusions about safety impacts, and to compare mitigation methods on a consistent basis. We conduct a multi-dimensional evaluation of the effects of fine-tuning on model behavior by focusing on capability as well as safety. Our results surface important issues that (1) fine-tuned models can produce incoherent generations in response to safety prompts, (2) automated safety judgments are unreliable for such incoherent outputs, and (3) the conclusions about the effects of fine-tuning can change depending on the choice of safety benchmark as well as the safety evaluator.
黑盒、自适应、高效、可迁移、有害、适用……攻击是破解LLM所需的一切
Vincent Limbach, Jonas Dornbusch, David Lüdke, Stephan Günnemann, Leo Schwinn
AI总结 提出间接危害优化(IHO)方法,通过迭代偏好优化训练掩码扩散语言模型攻击器,实现黑盒、高效、可迁移的自适应攻击,显著提升对分层防御的破解成功率。
准确评估对抗鲁棒性是一个长期挑战。有缺陷的攻击设计可能会夸大鲁棒性估计,使得部署风险评估和防御比较不可靠。历史上,像AutoAttack这样的标准化攻击在很大程度上解决了图像分类器的问题,为跨防御的系统比较提供了可靠的评估基线。然而,对于LLM越狱评估,目前还没有等效的方法,而设计这样的攻击要困难得多。一个可靠的攻击必须(除其他外)兼容黑盒、适用于任意防御管道且高效,而现有方法无法同时满足这些条件。我们引入了间接危害优化(IHO),这是一种掩码扩散语言模型攻击器,通过对危害评判器进行迭代偏好优化来训练,仅需对目标进行黑盒访问。相同的方法无需修改即可用作针对个体行为的强自适应攻击,或作为一种高效的摊销策略,无需微调即可迁移到未见行为和未见目标模型。即使面对分层防御(例如,结合辅助检测器的Circuit Breaker训练模型),IHO在攻击成功率上也显著优于最先进的方法,且无需任何防御特定的适应。我们的结果将IHO定位为向那种过去提高了可靠性的标准化越狱评估迈出的实际一步。代码和模型可在GitHub和Hugging Face上获取。
Accurately evaluating adversarial robustness is a longstanding challenge. A flawed attack design can inflate robustness estimates, making deployment risk assessment and defense comparison unreliable. Historically, standardized attacks such as AutoAttack have largely resolved this for image classifiers, providing a reliable evaluation baseline for systematic comparison across defenses. However, no equivalent exists for LLM jailbreak evaluation yet, where designing such an attack is considerably more difficult. A reliable attack must, among other things, be black-box compatible, applicable to arbitrary defense pipelines, and efficient, which no existing method jointly satisfies. We introduce Indirect Harm Optimization (IHO), a masked diffusion language model attacker trained via iterative preference optimization against a harmfulness judge, requiring only black-box access to the target. The same method can be used without modification as a strong adaptive attack on individual behaviors, or as an efficient amortized policy that transfers to held-out behaviors and unseen target models without fine-tuning. Even against layered defenses, such as a Circuit Breaker-trained model combined with an auxiliary detector, IHO improves attack success considerably over state-of-the-art approaches, without any defense-specific adaptation. Our results position IHO as a practical step toward the kind of standardized jailbreak evaluation that has improved reliability in the past. Code and models are available on GitHub and Hugging Face.
半监督多模态人群计数基准
Haoliang Meng, Xiaopeng Hong, Yabin Wang, Wangmeng Zuo
AI总结 本文构建了首个半监督多模态人群计数基准,通过制定标准化协议和评估多种基线方法,为该任务奠定基础。
本文构建了首个半监督多模态人群计数基准。为了奠定这一未探索任务的基础,我们首先制定了半监督多模态设置和标准化协议,该协议规定了不同标记比例下的标记-未标记数据划分。接下来,为了建立可靠的参考点,我们精心定制了一系列具有代表性的基线方法,包括现有的全监督多模态方法和半监督单模态方法。然后,我们在提出的基准下仔细评估了它们的性能。代码和数据划分将在该 https URL 上发布。
This paper constructs the first benchmark on semi-supervised multi-modal crowd counting. To lay the foundation for this unexplored task, we first formulate the semi-supervised multi-modal setting and a standardized protocol that specifies the labeled-unlabeled data partition across different labeled ratios. Next, to establish solid reference points, we carefully tailor a diverse set of representative baselines, including existing fully supervised multi-modal methods and semi-supervised single-modal methods. Then, we carefully evaluate their performance under our proposed benchmark. Codes and the data partition will be released on https://github.com/HenryCilence/Semi-supervised-Multimodal-Crowd-Counting.
加法的形状:大型语言模型中算术的几何结构
Liuyuan Wen, Xun Zhu, Lihao Huang, Wenbin Li, Yang Gao
AI总结 通过分析多操作数加法中残差流的几何结构,发现等原始和轨迹(IRST)并建立噪声量化模型,将算术错误解释为由内部神经噪声引起的几何滑移,并利用几何一致性检查方法检测和纠正量化失败。
大型语言模型在基本算术中表现出矛盾的脆弱性,暗示内部计算与离散输出之间存在脱节。通过分析多操作数加法中的残差流几何结构,我们识别出等原始和轨迹(IRST),这是一种由语义数字锚定并由连续进位纤维调制的几何结构。我们提出噪声量化模型来解释这种几何结构,将算术错误视为由内部神经噪声推动连续的潜在进位势跨越量化阈值引起的几何滑移。这一几何框架进一步阐明了探针多功能性,解释了轻量级探针如何从单个激活向量中解开共存的潜在信号(如真实值与幻觉)。最后,我们通过一种几何一致性检查方法验证了这些见解,该方法在推理过程中有效检测和纠正了这些量化失败。我们的代码可在以下网址获取:https://this URL。
Large Language Models exhibit paradoxical fragility in fundamental arithmetic, implying a disconnect between internal computation and discrete output. By analyzing the residual stream geometry during multi-operand addition, we identify the Iso-Raw-Sum Trajectory (IRST), a geometric structure where representations are anchored by semantic digits and modulated by continuous carry fibers. We propose the Noisy Quantization Model to explain this geometry, framing arithmetic errors as Geometric Slippages caused by internal neural noise pushing a continuous, latent Carry Potential across quantization thresholds. This geometric framework further elucidates Probe Versatility, explaining how lightweight probes can disentangle coexisting latent signals (such as ground truth versus hallucination) from a single activation vector. Finally, we validate these insights through a geometric consistency check method that effectively detects and corrects these quantization failures during inference. Our code is available at https://github.com/RL-MIND/Shape-of-Addition.
空间转录组学引导的对齐增强病理基础模型中的分子分析
Fengtao Zhou, Yingxue Xu, Zhengyu Zhang, Yihui Wang, Zhengrui Guo, Ling Liang, Jiabo Ma, Cheng Jin, Ziyi Liu, Huajun Zhou, Hongyi Wang, Du Cai, Chenglong Zhao, Xi Wang, Can Yang, Yu Wang, Wenbin Li, Feng Gao, Zhe Wang, Zhenhui Li, Xiuming Zhang, Li Liang, Hao Chen
AI总结 提出STAMP框架,利用空间转录组数据通过通路感知对齐策略增强病理基础模型的分子感知能力,并在多层级评估中验证其临床效用。
全面的分子分析对于现代精准肿瘤学至关重要,但高昂的成本、标本耗尽和漫长的周转时间仍然阻碍其应用。虽然病理基础模型(PFMs)已显示出从常规苏木精-伊红(H&E)全切片图像推断分子表型的潜力,但当前架构主要依赖于以视觉为中心的自监督学习或视觉-语言对齐,缺乏将细微形态学特征与潜在基因组改变联系起来所需的空间解析分子监督。空间转录组学(ST)作为一种变革性技术出现,能够在完整组织切片内进行转录组定量,从而保留组织学与分子谱之间的精确空间联系。在本研究中,我们提出了用于分子分析的空间转录组学引导对齐框架(STAMP),该框架赋予PFMs内在的分子感知能力。为支持这一范式,我们整理了HumanST-1k,一个涵盖不同解剖器官和测序平台的人类ST数据集。该图谱产生了180万对H&E斑块及其对应的转录组谱,提供了一个将组织学结构与其分子状态联系起来的语料库。为减轻原始转录组学中固有的技术噪声,STAMP采用了一种通路感知对齐策略,将转录组数据聚合为生物学功能通路,随后通过参数高效微调将其整合到PFMs中。这种对齐丰富了PFMs的表征空间,并释放了其解析亚视觉分子特征的能力。通过多层级评估框架验证了这些增强表征的临床实用性。
Comprehensive molecular profiling is essential for modern precision oncology but remains hindered by prohibitive costs, specimen exhaustion, and protracted turnaround times. While pathology foundation models (PFMs) have demonstrated potential for inferring molecular phenotypes from routine hematoxylin and eosin (H&E) whole-slide images (WSIs), current architectures primarily rely on vision-centric self-supervised learning or vision-language alignment, lacking the spatially resolved molecular supervision required to connect subtle morphological features with underlying genomic alterations. Spatial transcriptomics (ST) emerges as a transformative technology that enables transcriptomic quantification within intact tissue sections, thereby preserving the precise spatial link between histology and molecular profiles. In this study, we present a Spatial Transcriptomics-guided Alignment framework for Molecular Profiling (STAMP), which endows PFMs with intrinsic molecular awareness. To support this paradigm, we curated HumanST-1k, a human ST dataset spanning diverse anatomical organs and sequencing platforms. This atlas yields 1.8 million pairs of H&E patches and corresponding transcriptomic profiles, providing a corpus that links histological structures with their molecular states. To mitigate the technical noise inherent to raw transcriptomics, STAMP applies a pathway-informed alignment strategy that aggregates transcriptomic data into biologically functional pathways, which are subsequently integrated into PFMs via parameter-efficient fine-tuning. This alignment enriches the representation space of PFMs and unlocks their capacity to resolve sub-visual molecular signatures. The clinical utility of these augmented representations was validated through a multi-tier evaluation framework.
LLM医疗分诊中的性别依赖性诊断替代:相同症状,不同紧急程度
Qi Han Wong
AI总结 研究大型语言模型在相同神经症状下,仅因患者性别和年龄不同而产生不同的分诊建议,发现年轻女性被系统性低估紧急程度,机制为诊断替代。
我们调查了大型语言模型是否会对相同的神经症状,在仅改变患者性别和年龄的情况下,产生不同的医疗分诊建议。使用三个模型家族——Gemini 3.5 Flash、Claude Sonnet 4.6和GPT-5.4-mini——我们呈现了一个标准化的症状特征(持续性头痛、视力模糊、晨起恶心、视觉障碍),跨越七个人口统计学条件:三个年龄组(25、38、65岁)×两个性别(男、女),加上一个性别未指定的基线(每个模型每个条件n=30,共630次试验)。我们发现了一个显著、系统性的性别依赖性分诊差异:年轻女性获得的急诊室转诊率显著低于同龄男性(Gemini:0% vs. 23.3%;Claude:6.7% vs. 96.7%;GPT:6.7% vs. 66.7%,所有p<0.001)。所有模型在65岁年龄组中差异消失。主要机制是诊断替代:模型锚定于与性别相关的诊断,优先将年轻女性分类为特发性颅内高压——一种流行病学上与育龄女性相关的疾病——而将男性诊断为伴有占位性病变的通用颅内压增高。这种诊断闭合将女性患者导向较低紧急程度的护理(门诊医生预约),尽管严重程度评分相当(7-9/10)。我们的发现表明,临床LLM通过使用流行病学先验来抑制分诊紧急程度,复制了已知的人类临床偏见,提示AI分诊引擎必须将紧急程度评估与概率性诊断先验解耦。我们发布了所有代码、提示和原始结果。
We investigate whether large language models produce different medical triage recommendations for identical neurological symptoms when only the patient's stated gender and age vary. Using three model families--Gemini 3.5 Flash, Claude Sonnet 4.6, and GPT-5.4-mini--we present a standardized symptom profile (persistent headache, blurred vision, morning nausea, visual disturbances) across seven demographic conditions: three age groups (25, 38, 65) x two genders (male, female), plus a gender-unspecified baseline (n = 30 per condition per model, 630 total trials). We find a stark, systemic gender-dependent triage disparity: young women receive significantly lower emergency room (ER) referral rates than age-matched men (Gemini: 0% vs. 23.3%; Claude: 6.7% vs. 96.7%; GPT: 6.7% vs. 66.7%, all p < 0.001). The disparity disappears at age 65 for all models. The primary mechanism is diagnostic substitution: the models anchor on a gender-associated diagnosis, preferentially classifying young women with Idiopathic Intracranial Hypertension (IIH)--a condition epidemiologically linked to women of childbearing age--while diagnosing men with generic increased intracranial pressure with space-occupying lesions in the differential. This diagnostic closure routes female patients to lower-urgency care (outpatient doctor appointments) despite comparable severity ratings (7-9/10). Our findings demonstrate that clinical LLMs replicate documented human clinical biases by using epidemiological priors to suppress triage urgency, suggesting that AI triage engines must decouple urgency assessment from probabilistic diagnostic priors. We release all code, prompts, and raw results.