arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.12977 2026-06-12 cs.CV cs.AI cs.CR cs.LG 新提交

Efficient, Robust, and Anti-Collusion Fingerprinting of Image Diffusion Models

图像扩散模型的高效、鲁棒且抗共谋指纹识别

Jianwei Fei, Yunshu Dai, Zhihua Xia, Xiaochun Cao, Jiantao Zhou, Alessandro Piva, Benedetta Tondi

发表机构 * University of Florence（佛罗伦萨大学）； Shenzhen Campus of Sun Yat-sen University（中山大学深圳校区）； College of Cyber Security, Jinan University（暨南大学网络空间安全学院）； State Key Laboratory of Internet of Things for Smart City, University of Macau（澳门大学智慧城市物联网国家重点实验室）； Department of Computer and Information Science, Faculty of Science and Technology, University of Macau（澳门大学科技学院计算机与信息科学系）； University of Siena（锡耶纳大学）

AI总结针对生成式文本到图像模型指纹识别缺乏抗共谋攻击鲁棒性的问题，提出基于个性化归一化模块的编码方法，并引入无损函数不变参数变换的抗共谋机制，实现高保真、高鲁棒且首次主动抵御共谋攻击的指纹识别。

详情

AI中文摘要

模型指纹识别，即将用户特定标识（指纹）嵌入生成输出中，最近已成为保护生成式文本到图像（T2I）模型知识产权并防止未经授权重新分发的流行解决方案。在这项工作中，我们揭示了现有生成模型指纹识别方法中一个先前未被探索的系统性漏洞：它们缺乏对共谋攻击的鲁棒性，其中多个攻击者结合他们的模型以移除或掩盖指纹。为了解决这个问题，我们迈出了为T2I模型开发具有抗共谋能力的鲁棒指纹识别方法的第一步。所提出的方法将比特串（即指纹）编码到集成到T2I模型中的个性化归一化模块（PNM）的系数中，从而可以从任何生成的图像中可靠地恢复指纹。为了防御共谋攻击并防止未经授权的模型重新分发，我们引入了一种基于无损函数不变参数变换的抗共谋机制。该机制显著降低了共谋模型的图像生成质量，使其实际上无法使用。此外，我们的方法允许开发者通过重新参数化PNM高效地创建多个带指纹的T2I模型副本，而无需重新训练。我们还引入了一种最坏情况优化策略，以提高对模型级攻击的鲁棒性。实验表明，所提出的方法在多个T2I图像生成和编辑任务中实现了高保真度和鲁棒性，指纹提取准确率超过99.5%。与现有方法相比，我们的方法首次通过显著增加共谋模型的FID，展示了对共谋攻击的显著主动鲁棒性。

英文摘要

Model fingerprinting, embedding user-specific identifiers (fingerprints) into generated outputs, has recently emerged as a popular solution to protect the intellectual property rights (IPR) of generative text-to-image (T2I) models and prevent unauthorized redistribution. In this work, we reveal a previously unexplored systematic vulnerability in existing generative model fingerprinting methods: they lack robustness against collusion attacks, where multiple attackers combine their models to remove or obscure the fingerprints. To address this issue, we take the first step towards a robust fingerprinting method for T2I models with anti-collusion capabilities. The proposed method encodes strings of bits, namely fingerprints, into the coefficients of a personalized normalization module (PNM) incorporated into T2I models, so that fingerprints can be reliably recovered from any generated image. To defend against collusion attacks and prevent unauthorized model redistribution, we introduce an anti-collusion mechanism based on lossless function-invariant parameter transformations. This mechanism significantly degrades the image generation quality of colluded models, making them effectively unusable. Moreover, our method allows developers to efficiently create multiple copies of fingerprinted T2I models by reparameterizing the PNM without the need for retraining. We also introduce a worst-case optimization strategy to improve robustness against model-level attacks. Our experiments demonstrate that the proposed method achieves high fidelity and robustness across multiple T2I image generation and editing tasks, with fingerprint extraction accuracy exceeding 99.5%. Compared with existing methods, our method demonstrates, for the first time, a notable proactive robustness to collusion attacks by significantly increasing the FID of colluded models.

URL PDF HTML ☆

赞 0 踩 0

2606.12976 2026-06-12 cs.AI 新提交

A Mathematical Forum Platform for Collaborative Problem Solving and Dataset Generation for AI Reasoning

面向协作问题求解与AI推理数据集生成的数学论坛平台

Akbar Erkinov, Nurmukhammad Abdurasulov

发表机构 * Independent Researchers, San Francisco, CA, USA（独立研究者，美国加利福尼亚州旧金山）

AI总结提出一个集成图像到LaTeX转换管线的论坛系统，消除数学内容分享的摩擦，支持桌面和移动端，并生成社区验证的数学问题数据集以训练AI推理。

详情

Comments: 11 pages, 3 figures

AI中文摘要

在在线论坛中分享数学内容仍然是学生和教师的一个显著痛点：编写原始LaTeX容易出错，独立的光学字符识别工具需要切换平台，而当前的论坛软件没有提供从公式照片到渲染帖子的集成路径。我们提出了一个统一系统，通过将图像到LaTeX转换管线直接嵌入论坛发布界面来消除这一摩擦。用户上传或拍摄数学表达式的图像；系统通过Mathpix OCR API路由该图像，检测返回的输出是LaTeX还是包含内联数学的纯文本，应用适当的分隔符规范化，并在帖子提交到数据库之前以LaTeX或Markdown模式提供实时预览。该架构分为三个松散耦合的层：图像处理、渲染和存储，并支持桌面和移动客户端。已提交一份涵盖核心方法的美国临时专利申请。我们描述了完整的系统设计、每个组件的细节、数据模式以及关键的技术创新，并将该工作与现有的独立工具和论坛平台进行对比，以展示其填补的实际空白。除了直接的可用性之外，我们认为这种部署的平台构成了一个持续增长、社区验证的数学问题和逐步解决方案数据集，该资源可用于训练和基准测试AI系统以实现准确的数学推理。

英文摘要

Sharing mathematical content in online forums remains a significant friction point for students and educators: writing raw LATEX is error-prone, standalone optical character recognition tools require platform switching, and current forum software offers no integrated path from a photograph of a formula to a rendered post. We present a unified system that eliminates this friction by embedding an image to LATEX conversion pipeline directly inside a forum posting interface. A user uploads or captures an image of a mathematical expression; the system routes it through the Mathpix OCR API, detects whether the returned output is LATEX or plain text containing inline math, applies the appropriate delimiter normalisation, and renders a live preview in either LATEX or Markdown mode before the post is committed to the database. The architecture is organized in three loosely coupled layers: image processing, rendering, and storage, and supports both desktop and mobile clients. A provisional US patent application has been filed covering the core methods. We describe the full system design, each component in detail, the data schema, and the key technical innovations, and we position the work against existing standalone tools and forum platforms to demonstrate the practical gap it closes. Beyond immediate usability, we argue that a deployed platform of this kind constitutes a continuously growing, community-validated dataset of mathematical problems and step-by-step solutions, a resource that can be used to train and benchmark AI systems for accurate mathematical reasoning

URL PDF HTML ☆

赞 0 踩 0

2606.12972 2026-06-12 cs.HC 新提交

From Prompts to Preferences: An Open-Source Platform for Generative AI-Enhanced Conjoint Analysis

从提示到偏好：生成式AI增强联合分析的开源平台

Philipp Brauner

AI总结提出一个开源、自托管的联合分析调查平台，利用生成式AI（大语言模型和文本到图像模型）创建集成刺激格式，降低研究门槛，并通过概念验证研究展示其有效性。

详情

AI中文摘要

联合分析是营销研究、政治学、医疗保健和人机交互中广泛使用的偏好测量方法。尽管被广泛采用，但无法访问商业平台的研究人员面临重大障碍，因为现有工具要么昂贵，要么缺乏端到端的调查基础设施。本文提出了一个开源、自托管的Web应用程序，用于设计、部署和分析联合调查。除了传统的表格刺激外，该平台使用生成式AI生成集成刺激格式：由大语言模型生成的文本场景描述，以及由文本到图像模型生成的视觉刺激。研究者定义的基础提示通过联合配置文件进行参数化，可选的面向LLM的水平注释丰富了生成过程。结构化的设置向导、AI辅助属性建议和实时数据分析降低了联合分析方法新手研究者的技术门槛。完整的导出包包括所有刺激、其生成提示和响应数据，促进了透明度和可重复性。通过一项关于环境辅助生活（AAL，N=55）中护理机器人偏好的概念验证研究，使用AI生成的视觉刺激展示了该平台。本文讨论了AI辅助在联合设计中的作用，认为理论依据必须仍然是研究者的责任，并概述了genAI生成的刺激如何拓宽HCI及相关领域的方法论库。

英文摘要

Conjoint analysis is a widely used preference measurement method in marketing research, political science, healthcare, and human-computer interaction. Despite broad adoption, researchers without access to commercial platforms face significant barriers, as existing tools are either expensive or lack end-to-end survey infrastructure. This paper presents an open-source, self-hosted web application for designing, deploying, and analysing conjoint surveys. Beyond conventional tabular stimuli, the platform uses generative AI to produce integrated stimuli formats: textual scenario descriptions generated by a large language model, and visual stimuli by a text-to-image model. A researcher-defined base prompt is parameterised with the conjoint profile, and optional LLM-facing level annotations enrich the generation. A structured setup wizard, AI-assisted attribute suggestion, and live data analysis lower the technical barriers for researchers new to conjoint methodology. A full export bundle including all stimuli, their generating prompts, and response data facilitates transparency and reproducibility. The platform is demonstrated through a proof-of-concept study on care robot preferences for ambient assisted living (AAL, N=55) using AI-generated visual stimuli. The paper discusses the role of AI assistance in conjoint design, arguing that theoretical grounding must remain the researcher's responsibility, and outlining how genAI-generated stimuli can broaden the methodological repertoire for HCI and related fields.

URL PDF HTML ☆

赞 0 踩 0

2606.12971 2026-06-12 cs.LG 新提交

Predicting Cognitive Load from Speech and Interaction Dynamics in Dyadic Conversations

从二元对话中的语音和交互动态预测认知负荷

Tahiya Chowdhury

发表机构 * Department of Computer Science, Colby College（科尔比学院计算机科学系）

AI总结研究在自然协作对话中，通过语音和交互动态特征预测感知认知负荷，发现对话交互（如话轮转换）能有效预测时间压力、脑力工作等认知负荷维度。

详情

Comments: Accepted to Interspeech 2026

AI中文摘要

从语音估计认知负荷主要在受控实验室环境中研究，对其在自然协作对话中的可靠性了解有限。我们研究语音和交互动态是否能预测二元对话中的感知认知负荷。我们分析了53对执行九项协作任务的对话音频，提取静态声学、动态和交互特征，训练双头门控循环单元编码器预测认知负荷分数。结果表明，对话交互为预测与时间压力、脑力工作、努力和任务表现相关的认知负荷提供了有用信号。时间需求与话轮转换动态（如重叠和说话者切换）相关，而脑力需求与说话者之间的不平衡参与相关。这些发现强调了任务结构和对话交互在自然协作环境中建模认知负荷的重要性。

英文摘要

Estimating cognitive load from speech has largely been studied in controlled laboratory settings, with limited understanding of its reliability in natural collaborative conversations. We investigate whether speech and interaction dynamics predict perceived cognitive load during dyadic conversations. We analyze audio from 53 dyads performing nine collaborative tasks and extract static acoustic, dynamic, and interaction features to train a two-head Gated Recurrent Unit encoder to predict cognitive load scores. Results show conversational interaction provides useful signals for predicting cognitive load related to time pressure, mental work, effort, and task performance. Temporal demand is associated with turn-taking dynamics such as overlap and speaker switch, while mental demand is linked to imbalanced participation between speakers. These findings highlight the importance of task structure and conversational interaction for modeling cognitive load in natural collaborative settings.

URL PDF HTML ☆

赞 0 踩 0

2606.12970 2026-06-12 cs.DS 新提交

Binary Search Variants: A Comprehensive Analysis

二分搜索变体：全面分析

Ali Dasdan

AI总结本文统一处理二分搜索的五种核心变体、六种派生查询函数和四种标准库实现，引入bsearch_ultimate组合搜索，并通过Python代码、Dafny形式化证明和伪代码提供所有算法，经9500余次测试和21次Dafny验证。

详情

Comments: 57 pages, 1 figure

AI中文摘要

二分搜索概念上看似简单，但正确实现却出了名地困难。本文对二分搜索进行了统一处理：五种核心变体、六种派生查询函数以及四种标准库实现（BSD、glibc、Java、C++ STL），每种都附带一致的符号表示、循环不变量和分析。我们引入了bsearch_ultimate，一种组合搜索，可在单次调用中涵盖所有变体。每个算法都以同步的Python代码、Dafny形式化证明和伪代码形式提供。所有实现均经过超过9500次测试和21次Dafny形式化验证；另外六个故意有缺陷的实现展示了常见的错误类别以及Dafny检测它们的能力。我们还提供了易于记忆的规则，将边界选择与循环条件和更新公式联系起来。

英文摘要

Binary search is deceptively simple in concept yet notoriously difficult to implement correctly. This paper presents a unified treatment of binary search: five core variants, six derived query functions, and four standard library implementations (BSD, glibc, Java, C++ STL), each with consistent notation, loop invariants, and analysis. We introduce bsearch_ultimate, a combined search that subsumes all variants in a single call. Every algorithm is provided as synchronized Python code, Dafny formal proof, and pseudocode. All implementations are validated by over 9,500 tests and 21 Dafny formal verifications; an additional six deliberately faulty implementations demonstrate common bug categories and Dafny's ability to detect them. We also provide memorable rules linking boundary choices to loop conditions and update formulas.

URL PDF HTML ☆

赞 0 踩 0

2606.12969 2026-06-12 cs.AI 新提交

Multi-Modal Agents for Power Distribution Defect Detection: An Evaluation of Foundation Models

用于配电缺陷检测的多模态智能体：基础模型评估

Quan Quan

AI总结提出多模态智能体框架，系统评估基础模型在感知、推理和工具使用三方面的能力，用于配电缺陷检测的闭环自动化。

详情

AI中文摘要

配电网络对可靠电力输送至关重要，但传统检测方法在语义理解、泛化和闭环自动化方面存在局限。为解决这些挑战，本文提出了一种专门用于配电缺陷检测的多模态智能体框架。本研究的核心是系统评估多模态基础模型作为统一认知引擎的能力。我们严格评估了它们在三个关键能力上的综合表现：（1）感知，模型必须准确识别设备并生成专家级的缺陷描述；（2）推理，模型根据视觉发现解释原因、评估严重性并基于领域知识规划维护策略；（3）工具使用，模型作为自主操作者执行动作——如查询知识库或生成工单——以实现闭环维护。为支持此评估，我们开发了领域特定的评估数据集和综合基准。实验结果表明了当前基础模型在这三个维度的优势与局限，为在高风险工业环境中部署自主智能体提供了实证依据。

英文摘要

The power distribution network is critical to reliable electricity delivery, yet traditional inspection methods face limitations in semantic understanding, generalization, and closed-loop automation. To address these challenges, this paper proposes a Multi-Modal Agent framework specifically for power distribution defect detection. Central to this study is the systematic evaluation of multimodal foundation models as unified cognitive engines. We rigorously assess their integrated performance across three critical capabilities: (1) Perception, where the model must accurately identify equipment and generate expert-level descriptions of defects; (2) Reasoning, where the model interprets visual findings to diagnose causes, assess severity, and plan maintenance strategies based on domain knowledge; and (3) Tool Usage, where the model acts as an autonomous operator to execute actions -- such as querying knowledge bases or generating work orders -- to achieve closed-loop maintenance. To support this evaluation, a domain-specific evaluation dataset and a comprehensive benchmark are developed. Experimental results demonstrate the strengths and limitations of current foundation models in these three dimensions, providing empirical evidence for deploying autonomous agents in high-stakes industrial environments.

URL PDF HTML ☆

赞 0 踩 0

2606.12966 2026-06-12 cs.LG cs.NE 新提交

Circuit Synchronization Precedes Generalization: Causal Evidence from Fourier Structure in Grokking Transformers

电路同步先于泛化：来自Grokking Transformer中傅里叶结构的因果证据

Achyuthan Sivasankar

发表机构 * New York University（纽约大学）

AI总结提出频率同步度（FSD）指标，发现其在模算术任务中比grokking早500-3000步同步，且通过权重衰减控制验证了间隔期的正则化本质，提供因果证据。

详情

Comments: 16 pages, 6 figures, 10 tables

AI中文摘要

Grokking——模算术上的transformer从近乎随机突然转变为近乎完美的验证准确率——归因于傅里叶电路，但其时机、因果结构和可控性仍知之甚少。我们引入了频率同步度（FSD），一种无需先验电路知识的归一化、置换检验的傅里叶电路同步度量。在九个模加法配置（素数p∈{53,71,97,113,131}，三个种子）中，FSD在grokking前500-3000步同步（平均领先+1722步；所有九个为正，符号检验p≈0.004），并且在所有九个案例中先于受限logit损失基线（Nanda等人的排除损失），使其成为最早可用的预测器。我们提供了直接因果证据，证明相间间隙是一种正则化现象：在FSD峰值步骤分叉训练并变化权重衰减λ，会产生严格单调的更早grokking，且Δ_t与1/λ成正比。该定律在三个素数（p∈{53,97,131}；两个干净案例的R²=1.00和R²=0.99）上重复，表示为Δ_t ~ C/λ，与(1/λ)*log(||W_mem||/τ)一致。架构消融实验表明，仅注意力模型在强FSD前兆下grok；仅MLP模型从不grok；单层模型的FSD滞后，确认了前兆是多块电路属性。

英文摘要

Grokking -- where a transformer on modular arithmetic suddenly transitions from near-chance to near-perfect validation accuracy -- is attributed to a Fourier circuit, but its timing, causal structure, and controllability remain poorly understood. We introduce the Frequency Synchronization Degree (FSD), a normalised, permutation-tested metric for Fourier circuit synchronisation requiring no prior circuit knowledge. Across nine modular addition configurations (primes p in {53, 71, 97, 113, 131}, three seeds), FSD synchronises 500-3,000 steps before grokking (mean lead +1,722 steps; all nine positive, sign-test p~0.004), and precedes a restricted-logit loss baseline (Nanda et al.'s excluded loss) in all nine cases, making it the earliest available predictor. We provide direct causal evidence that the inter-phase gap is a regularisation phenomenon: forking training at the FSD-ceiling step and varying weight decay lambda produces strictly monotone earlier grokking, with Delta_t proportional to 1/lambda. This law replicates across three primes (p in {53,97,131}; R^2=1.00 and R^2=0.99 for two clean cases), captured as Delta_t ~ C/lambda, consistent with (1/lambda)*log(||W_mem||/tau). Architecture ablations show an attention-only model groks with a strong FSD precursor; an MLP-only model never groks; a single-layer model's FSD lags, confirming the precursor is a multi-block circuit property.

URL PDF HTML ☆

赞 0 踩 0

2606.12965 2026-06-12 cs.RO 新提交

EmbodiSteer: Steering Embodiment-Agnostic Visuomotor Policies with Joint-Space Guidance for Zero-Shot Cross-Embodiment Deployment

EmbodiSteer: 用关节空间引导的具身无关视觉运动策略实现零样本跨具身部署

Shihefeng Wang, Kangchen Lv, Mingrui Yu, Xiang Li

发表机构 * Department of Automation, Tsinghua University（清华大学自动化系）； Beijing Key Laboratory of Embodied Intelligence Systems（北京具身智能系统重点实验室）； Institute for Embodied Intelligence and Robotics, Tsinghua University（清华大学具身智能与机器人研究所）

AI总结提出EmbodiSteer框架，通过前向运动学和雅可比更新将推理时的扩散采样提升到目标机器人关节空间，并加入全身碰撞感知引导，实现零样本、具身感知的部署，在模拟和物理机器人上显著降低碰撞率并提高任务成功率。

详情

Comments: The first two authors contribute equally

AI中文摘要

可扩展的机器人模仿学习依赖于来自不同机器人的大规模异构数据或无身体数据，使得笛卡尔末端执行器动作成为具身无关策略学习的关键接口。然而，仅末端执行器的抽象使得笛卡尔策略对部署的机器人身体无感知，导致其在全身碰撞避免等机器人特定约束下脆弱。为克服这一限制，我们提出EmbodiSteer，一种无需训练的框架，将具身无关的视觉运动策略引导至零样本、具身感知的部署。EmbodiSteer将策略学习保持在笛卡尔空间，同时通过前向运动学和基于雅可比的更新，高效地将推理时的扩散采样提升到目标机器人的关节空间。在每个去噪步骤后，通过关节轨迹上的全身碰撞感知引导，机械臂可以在保持学习到的末端执行器行为的同时避开碰撞。与仅笛卡尔执行相比，EmbodiSteer在9个模拟机器人上将碰撞率降低46.1%，任务成功率提高28.5%，并在高度受限场景下的两个物理机器人上实现碰撞率降低90.0%，成功率提高36.7%。我们的项目页面位于此https URL。

英文摘要

Scalable robot imitation learning relies on large-scale heterogeneous data from diverse robots or body-free data, making Cartesian end-effector actions a key interface for embodiment-agnostic policy learning. However, end-effector-only abstraction leaves Cartesian policies unaware of the deployed robot body, making them brittle under robot-specific constraints such as whole-body collision avoidance. To overcome this limitation, we present EmbodiSteer, a training-free framework that steers embodiment-agnostic visuomotor policies toward zero-shot, embodiment-aware deployment. EmbodiSteer keeps policy learning in Cartesian space while efficiently lifting inference-time diffusion sampling into the target robot's joint space via forward kinematics and Jacobian-based updates. With whole-body collision-aware guidance over joint trajectories after each denoising step, the arm can be steered away from collisions while preserving learned end-effector behavior. Compared with Cartesian-only execution, EmbodiSteer reduces collision rate by 46.1% and improves task success rate by 28.5% across 9 simulated robots, and further achieves 90.0% collision rate reduction and 36.7% success rate increase on two physical robots in highly constrained scenarios. Our project page is at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.12963 2026-06-12 cs.NI cs.DC cs.ET 新提交

ScaleAcross: Designing Multi-Data-Center Infrastructure for Geo-Distributed AI Training

ScaleAcross: 为地理分布式AI训练设计多数据中心基础设施

Naved Inam, Aryan Alpesh Bhavsar, Masabattula Teja Nikhil, Sidharth Sharma

AI总结本文提出一个基于EVPN-VXLAN的可扩展仿真框架，用于研究地理分布式AI训练中的同步密集型通信和跨站点数据交换问题，通过ECMP、BFD和队列对感知流量分配机制提升性能。

详情

AI中文摘要

AI模型的快速增长和日益增长的数据主权要求正在推动跨多个数据中心的地理分布式AI训练的转变。这种部署引入了由同步密集型通信、跨站点数据交换和广域网延迟约束引起的系统级挑战。本文研究了EVPN-VXLAN作为地理分布式AI训练环境的基础设施基础，并提出了一个可扩展的仿真框架，用于在现实广域网条件下系统研究分布式AI工作负载。所提出的框架结合了VXLAN覆盖网络和基于EVPN的数据中心间连接，并使用ContainerLab和FRRouting（FRR）实现。该框架进一步集成了等价多路径（ECMP）路由、双向转发检测（BFD）和队列对感知流量分配机制，旨在改善同步密集型AI工作负载的通信行为，同时保持与商品基础设施的兼容性。通过使用真实的广域网仿真，我们表征了采用AllReduce和参数服务器通信模式的分布式训练工作负载下的通信和系统行为。结果提供了对地理分布式AI环境中流量分布、弹性和基础设施行为的见解，突显了可重现的多数据中心基础设施框架在可扩展分布式AI训练中的潜力。

英文摘要

The rapid growth of AI models and increasing data sovereignty requirements are driving the transition toward geo-distributed AI training across multiple data centers. Such deployments introduce system-level challenges arising from synchronization-intensive communication, cross-site data exchange, and wide-area latency constraints. This paper investigates EVPN--VXLAN as an infrastructure foundation for geo-distributed AI training environments and presents a scalable emulation framework for systematically studying distributed AI workloads under realistic wide-area conditions. The proposed framework combines VXLAN overlays with EVPN-based inter-data-center connectivity and is implemented using ContainerLab and FRRouting (FRR). The framework further incorporates Equal-Cost Multi-Path (ECMP) routing, Bidirectional Forwarding Detection (BFD), and a queue-pair-aware traffic distribution mechanism designed to improve communication behavior for synchronization-intensive AI workloads while preserving compatibility with commodity infrastructure. Using realistic WAN emulation, we characterize communication and system behavior under distributed training workloads employing AllReduce and Parameter Server communication patterns. Results provide insights into traffic distribution, resilience, and infrastructure behavior in geo-distributed AI environments, highlighting the potential of reproducible multi-data-center infrastructure frameworks for scalable distributed AI training.

URL PDF HTML ☆

赞 0 踩 0

2606.12958 2026-06-12 cs.CV 新提交

YOLO-AMC: An Improved YOLO Architecture with Attention Mechanisms for Building Crack Detection

YOLO-AMC：一种改进的带有注意力机制的YOLO架构用于建筑裂缝检测

Ching-Yu Tsai, Chia-Min Lin, Chih-Hsiang Yang, Yung-Che Wang, Jen-Shiun Chiang

发表机构 * Department of Electrical and Computer Engineering, Tamkang University（淡江大学电机与计算机工程系）

AI总结提出YOLO-AMC，在YOLOv11中移除C2PSA并引入GAM、Res-CBAM、SA等注意力机制，增强裂缝检测性能，在测试集上mAP@0.5达0.9917，速度110.95 FPS，兼顾精度与部署效率。

详情

Comments: 14 pages, 8 tables, 6 figures. Expanded version of IET ICETA 2025 conference paper

AI中文摘要

裂缝检测在基础设施检查和结构健康监测（SHM）中起着重要作用。然而，裂缝通常表现为薄、低对比度的结构，且容易受到背景噪声的影响，给现有目标检测模型带来了挑战。本研究提出了一种改进的基于YOLO的架构，集成了注意力机制，称为YOLO-AMC（用于裂缝检测的YOLO注意力机制），以增强自动裂缝检测性能。基于YOLOv11，移除了原始的C2PSA模块，并在Neck的多尺度特征融合层中引入了多种注意力机制，包括全局注意力机制（GAM）、残差卷积块注意力模块（Res-CBAM）和Shuffle Attention（SA），以加强跨尺度特征整合。实验结果表明，YOLO-AMC在多个评估指标上始终优于基线模型YOLOv11n和YOLOv8n。在评估的注意力模块中，GAM取得了最佳检测性能，在测试数据集上获得了mAP@0.5 = 0.9917和mAP@0.5:0.95 = 0.9506，高于YOLOv11（0.9833 / 0.9112）和YOLOv8（0.9707 / 0.8921）。此外，在保持7.6 GFLOPs计算复杂度的同时，所提出的模型在NVIDIA RTX 4090平台上达到了110.95 FPS，在Raspberry Pi 5边缘设备上约为5 FPS，展示了准确性与部署效率之间的良好权衡。本研究的实现代码可在GitHub上获取，网址为：https://this https URL。

英文摘要

Crack detection plays an important role in infrastructure inspection and Structural Health Monitoring (SHM). However, cracks typically appear as thin, low-contrast structures and are easily affected by background noise, posing challenges for existing object detection models. This study proposes an improved YOLO-based architecture with integrated attention mechanisms, termed YOLO-AMC (YOLO with Attention Mechanisms for Crack Detection), to enhance automated crack detection performance. Based on YOLOv11, the original C2PSA module is removed, and multiple attention mechanisms, including Global Attention Mechanism (GAM), Residual Convolutional Block Attention Module (Res-CBAM), and Shuffle Attention (SA), are introduced into the multi-scale feature fusion layers of the Neck to strengthen cross-scale feature integration. Experimental results demonstrate that YOLO-AMC consistently outperforms baseline models YOLOv11n and YOLOv8n across multiple evaluation metrics. Among the evaluated attention modules, GAM achieves the best detection performance, obtaining mAP@0.5 = 0.9917 and mAP@0.5:0.95 = 0.9506 on the test dataset, which are higher than those of YOLOv11 (0.9833 / 0.9112) and YOLOv8 (0.9707 / 0.8921). Furthermore, while maintaining a computational complexity of 7.6 GFLOPs, the proposed model achieves 110.95 FPS on an NVIDIA RTX 4090 platform and approximately 5 FPS on a Raspberry Pi 5 edge device, demonstrating a favorable trade-off between accuracy and deployment efficiency. The implementation code for this study is available on GitHub at this https URL.

URL PDF HTML ☆

赞 0 踩 0

2606.12956 2026-06-12 cs.RO 新提交

SERF: Spatiotemporal Environment and Robot Feature Map for Long-Horizon Mobile Manipulation

SERF：面向长时域移动操作任务的时空环境与机器人特征地图

Sunghwan Kim, Byeonghyun Pak, Kehan Long, Yulun Tian, Nikolay Atanasov

发表机构 * UC San Diego（加州大学圣地亚哥分校）； Agency for Defense Development（国防发展局）； SceniX Inc.（SceniX公司）； University of Michigan（密歇根大学）

AI总结提出SERF地图，将环境与机器人身体表示为共享潜空间中的神经点，并在线更新，作为VLA模型的状态输入，提升长时域移动操作中的推理能力，在BEHAVIOR-1K上优于纯图像基线。

详情

Comments: Project page: this https URL

AI中文摘要

长时域机器人移动操作需要对定位、环境变化和任务进度进行持续推理，而这些都难以仅从图像观测中推断。在本文中，我们表明，将移动操作策略条件化于一个时空特征地图可以改善长时域上的推理。该地图将环境和铰接机器人身体表示为共享潜空间中的神经点，并从自我中心观测和本体感受状态在线更新。我们使用基于对象的刚性跟踪更新环境神经点，并使用正向运动学更新机器人神经点。通过从多个参考帧和空间尺度提取地图标记，我们将时空环境与机器人特征（SERF）地图作为状态输入到视觉-语言-动作（VLA）模型中，为策略提供局部和全局上下文。我们在BEHAVIOR-1K（一个家庭环境中的长时域移动操作基准）上展示了SERF。实验表明，SERF VLA策略优于纯图像基线，通过遵循更直接的轨迹更快地达到子目标，提高了对场景配置变化的鲁棒性，并能从物体掉落失败中恢复。

英文摘要

Long-horizon robot mobile manipulation requires continual reasoning about localization, environment changes, and task progress, all of which are challenging to infer from image observations alone. In this paper, we show that conditioning a mobile manipulation policy on a spatiotemporal feature map improves reasoning over long horizons. The map represents the environment and the articulated robot body as neural points in a shared latent space and is updated online from egocentric observations and proprioceptive state. We update the environment neural points using object-level rigid tracking and the robot neural points using forward kinematics. We use our spatiotemporal environment and robot feature (SERF) map as a state input to a vision-language-action (VLA) model by extracting map tokens from multiple reference frames and spatial scales, providing the policy with both local and global context. We demonstrate SERF on BEHAVIOR-1K, a benchmark for long-horizon mobile manipulation in household environments. Experiments show that the SERF VLA policy outperforms image-only baselines, reaches subgoals faster by following more direct trajectories, improves robustness to scene-configuration shifts, and recovers from object-drop failures.

URL PDF HTML ☆

赞 0 踩 0

2606.12954 2026-06-12 cs.RO 新提交

Towards Reliable Sequential Object Picking in Clutter: The Runner-up Solution to RGMC 2025

面向杂乱环境中的可靠顺序物体抓取：RGMC 2025 亚军方案

Wei Yu, Xidan Zhang, Ziyi Zheng, Weijie Kong, Huixu Dong

AI总结针对杂乱环境中的顺序物体抓取任务，提出集成硬件-软件流水线，结合多功能夹爪设计与物体分布及遮挡关系新表示，实现高效识别、搜索与顺序抓取，获RGMC 2025亚军。

详情

Comments: First, Second and Third Coauthor contributed equally to this work

AI中文摘要

作为机器人操作中的长期挑战，在杂乱环境中稳定高效地抓取在工业场景中至关重要。尽管近期研究在杂乱抓取中取得了较高的成功率，但对于顺序物体搜索与分类等更具挑战性的任务，成熟解决方案仍然较少。本工作基于杂乱环境抓取基准（CEPB）解决杂乱环境中的顺序物体抓取问题，并展示了我们在ICRA 2025第十届机器人抓取与操作竞赛（RGMC）的“杂乱抓取”赛道中的方案。该任务提出了几个关键挑战。首先，它需要鲁棒且考虑碰撞的抓取，在包括刚性和可变形物体在内的多样化物体集上具有高成功率。其次，它要求高效搜索目标物体，这对方案的清理和搜索策略提出了严格要求。为应对上述挑战，我们设计了一个集成的硬件-软件流水线，结合了物体识别、清理和多模态抓取。主要贡献包括多功能夹爪的硬件设计以及杂乱空间中物体分布和遮挡关系的新表示。该流水线实现了对杂乱环境中物体的高效识别、搜索和顺序抓取，在实验室测试和竞赛场景中均表现出色，最终在RGMC 2025的“杂乱抓取”赛道中获得第二名。

英文摘要

As a long-standing challenge in robotic manipulation, stable and efficient grasping in cluttered environments is of great importance in industrial settings. While recent studies have achieved relatively high success rates in grasping from clutter, there remain few mature solutions for more demanding tasks such as sequential object search and sorting. This work addresses sequential object picking in cluttered environments based on the Cluttered Environment Picking Benchmark (CEPB) and presents our solution to the Pick-in-Clutter track of the 10th Robotic Grasping and Manipulation Competition (RGMC) at ICRA 2025. The task poses several key challenges. First, it requires robust and collision-aware grasping with high success rates across a diverse set of objects, including both rigid and deformable ones. Second, it demands efficient search for target objects, which places stringent requirements on the decluttering and searching strategies of the solution. To address the above challenges, we design an integrated hardware-software pipeline that combines object recognition, decluttering, and multi-modal grasping. The main contributions include the hardware design of a multifunctional gripper and novel representations for object distribution and occlusion relationships in cluttered space. This pipeline enables efficient recognition, search, and sequential grasping of objects in clutter, demonstrating strong performance in both laboratory tests and competition scenarios, and ultimately achieving second place in the Pick-in-Clutter track of the RGMC 2025.

URL PDF HTML ☆

赞 0 踩 0

2606.12953 2026-06-12 cs.AI cs.CV cs.LG eess.IV 新提交

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

OpenMedQ：面向医学视觉语言模型的广泛开放预训练

Ibrahim Gulluk, Max Van Puyvelde, Olivier Gevaert

发表机构 * Stanford University（斯坦福大学）； Stanford University School of Medicine（斯坦福大学医学院）； Ghent University（根特大学）

AI总结提出OpenMedQ，在14个数据集（约335万样本）上预训练医学视觉语言模型，在PathVQA上BLEU-1达75.9，超越562B参数的Med-PaLM M，并在8个未见医学分类任务上取得最高平均macro-F1（0.757）。

详情

Comments: Medical Imaging with Deep Learning (MIDL) 2026, Short Paper Track

AI中文摘要

我们提出OpenMedQ，一个在迄今为止最广泛的完全开放医学混合数据集上预训练的医学视觉语言模型：包含14个数据集，总计约335万预训练样本，涵盖病理学、放射学、显微镜和纯文本临床问答。OpenMedQ在PathVQA上达到最先进的BLEU-1（75.9），击败了参数多达562B（约大80倍）的Med-PaLM M变体，并在VQA-MED上匹配了最佳报告的BLEU-1（64.5）。其视觉编码器在相同的下游配方下迁移到8个未见过的医学分类基准，获得了最高的平均macro-F1（0.757），优于BiomedCLIP（0.745）、PMC-CLIP（0.745）、PubMedCLIP（0.746）和从头训练的基线（0.616）。我们公开了代码，并提供了一个交互式演示，作为社区的可复现基线。

英文摘要

We present OpenMedQ, a medical vision-language model pretrained on the broadest fully-open medical mix to date: 14 datasets totaling ~3.35M pretraining samples spanning pathology, radiology, microscopy, and text-only clinical QA. OpenMedQ reaches state-of-the-art BLEU-1 on PathVQA (75.9), beating Med-PaLM M variants up to 562B parameters (~80x larger), and matches the best reported VQA-MED BLEU-1 (64.5). Its vision encoder, transferred to 8 unseen medical classification benchmarks under an identical downstream recipe, obtains the highest average macro-F1 (0.757) among BiomedCLIP (0.745), PMC-CLIP (0.745), PubMedCLIP (0.746), and a from-scratch baseline (0.616). We release our code and an interactive demo is publicly available as a reproducible baseline for the community.

URL PDF HTML ☆

赞 0 踩 0

2606.12950 2026-06-12 cs.DC 新提交

Maestro: Workload-Aware Cross-Cluster Scheduling for LLM-Based Multi-Agent Systems

Maestro: 面向基于LLM的多智能体系统的工作负载感知跨集群调度

Jinghao Wang, Xiao Zhou, Xiaoyang Sun, Yihui Zhang, Yilong Li, Tianyu Wo, Xu Wang, Chunming Hu, Renyu Yang

AI总结提出Maestro调度系统，利用智能体语义预测输出长度和内存使用，通过层次化调度（节点级多模型共置、集群级延迟感知路由、全局工作流感知优先级）在严格GPU预算下优化LLM多智能体服务，减少KV缓存HBM占用67.2%，提高高竞争SLO达标率23.6个百分点。

详情

Comments: Accepted to the 46th IEEE International Conference on Distributed Computing Systems (ICDCS 2026). 11 pages

AI中文摘要

基于大型语言模型的多智能体系统（LLM-MAS）已成为一种强大的范式，通过将复杂任务分解为专门LLM驱动的智能体的协作工作流来处理这些任务。然而，大规模部署此类多智能体工作负载带来了重大系统挑战。每个用户查询会引发LLM调用的迭代流水线，与单轮查询相比，极大地放大了资源消耗。在资源受限的云环境中，这些工作流面临解码阶段非确定性和输入依赖的成本、具有内存碎片和过度供应的重尾多模型需求，以及跨集群调度权衡。我们提出Maestro，一个为在严格GPU预算下服务LLM-MAS而设计的工作负载感知调度系统。Maestro明确利用智能体语义和角色：它预测每个阶段的输出长度和内存使用，并利用此预测驱动层次化调度器。在节点级别，Maestro通过层次化权重缓存和弹性内存供应实现动态多模型共置。在集群级别，它执行延迟感知路由以避免冷启动延迟和内存过载。在全局级别，它实施工作流感知优先级排序，以最小化交互式任务的队头阻塞。在原型实验和轨迹驱动模拟中，Maestro将KV预留HBM减少了67.2%，并将高竞争SLO达标率比EDF提高了23.6个百分点。

英文摘要

Large Language Model based Multi-Agent Systems (LLM-MAS) have emerged as a powerful paradigm for tackling complex tasks by breaking them into collaborative workflows of specialized LLM-powered agents. However, deploying such multi-agent workloads at scale poses significant system challenges. Each user query spawns an iterative pipeline of LLM calls, greatly amplifying resource consumption compared to single-turn queries. In resource-constrained cloud settings, these workflows face non-deterministic and input-dependent costs at decode stage, heavy-tailed multi-model requirements with memory fragmentation and over-provisioning, and cross-cluster scheduling trade-offs. We present Maestro, a workload-aware scheduling system designed for LLM-MAS serving under strict GPU budgets. Maestro explicitly leverages agent semantics and roles: it predicts the output length and memory usage of each stage and uses this prediction to drive a hierarchical scheduler. At the node level, Maestro enables dynamic multi-model co-location via hierarchical weight caching and elastic memory provisioning. At the cluster level, it performs latency-aware routing to avoid cold-start delays and memory overloads. At the global level, it enforces workflow-aware prioritization to minimize head-of-line blocking for interactive tasks. Across prototype experiments and trace-driven simulations, Maestro reduces KV-reservation HBM by 67.2% and improves high-contention SLO attainment over EDF by 23.6 percentage points.

URL PDF HTML ☆

赞 0 踩 0

2606.12949 2026-06-12 cs.CR cs.CV 新提交

ViPER: Vision-based Packing-Aware Encoder for Robust Malware Detection

ViPER：基于视觉的打包感知编码器用于鲁棒恶意软件检测

Fatima Qaiser, Bisma Tahir, Muhammad Abid Mughal, Nauman Shamim

AI总结提出ViPER，一种基于LoRA适配ViT-B/14的双头架构，联合学习恶意软件分类和打包检测，通过打包感知门控机制和频率加权损失处理打包标签偏斜，在20万Windows PE图像上达到0.8521平衡准确率、0.9260 ROC-AUC和0.9279 AUPR。

详情

AI中文摘要

基于可视化的恶意软件检测将原始二进制字节映射为灰度图像，并应用学习的视觉分类器，为传统分析流程提供了一种抗规避且无需反汇编的替代方案。然而，可执行文件打包仍然是一个关键的失效模式：打包后的二进制文件产生高熵图像，掩盖了这些模型所依赖的结构模式。由于打包在良性软件中也很常见（例如用于压缩或复制保护），仅凭打包状态并不能可靠地指示恶意性，且现有方法未在统一的监督框架内解决这一挑战。我们提出了ViPER，一种基于视觉的打包感知编码器，用于鲁棒的恶意软件检测。ViPER构建在LoRA适配的ViT-B/14骨干网络上，采用双头架构，联合学习恶意软件分类和打包检测。打包感知门控机制根据推断的打包状态调节恶意软件预测，从而为打包和未打包输入实现不同的决策边界。为了解决训练期间打包标签偏斜的问题，我们采用了频率加权损失，并在联合类别-打包层上进行分层采样。在20万张Windows PE字节图图像上的评估中，ViPER达到了0.8521的平衡准确率、0.9260的ROC-AUC和0.9279的AUPR，在所有主要指标上均优于代表性的最先进基线，同时打包检测AUC达到0.9949。

英文摘要

Visualization-based malware detection maps raw binary bytes to grayscale images and applies learned visual classifiers, providing an evasion-resistant and disassembly-free alternative to conventional analysis pipelines. However, executable packing remains a critical failure mode: packed binaries produce high-entropy images that obscure the structural patterns these models rely on. Because packing is also prevalent in benign software (e.g., for compression or copy protection), packing state alone is not a reliable indicator of maliciousness, and existing approaches do not address this challenge within a unified supervised framework. We present ViPER, a Vision-based Packing-Aware Encoder for Robust malware detection. ViPER builds on a LoRA-adapted ViT-B/14 backbone with a dual-head architecture that jointly learns malware classification and packing detection. A packing-aware gating mechanism conditions malware predictions on the inferred packing state, enabling distinct decision boundaries for packed and unpacked inputs. To address packing label skew during training, we employ frequency-weighted losses with stratified sampling over joint class-packing strata. Evaluated on 200,000 Windows PE byteplot images, ViPER achieves a balanced accuracy of 0.8521, ROC-AUC of 0.9260, and AUPR of 0.9279, outperforming representative state-of-the-art baselines across all primary metrics, while attaining a packing detection AUC of 0.9949.

URL PDF HTML ☆

赞 0 踩 0

2606.12946 2026-06-12 cs.CY 新提交

Data Aphasia: An Institutional Counterfactual Study of the Stability of Academic Cognition Under Letter-Grade Evaluation Systems

数据失语症：字母评分制度下学术认知稳定性的制度反事实研究

Li Li, Yu Cao

AI总结本文提出“数据失语症”概念，通过将百分制成绩转换为字母等级，发现信息熵下降约69%，聚类结构不稳定，诊断一致性波动大，揭示了字母评分制度对认知稳定性的影响。

详情

Comments: 36 pages, 14 figures, 16 tables

AI中文摘要

字母评分制度在实现减负目标的同时，是否影响了教育系统对学生学术结构的稳定认知？本文引入“数据失语症”概念，指因机构强制规定的数据呈现形式而对诊断信息表达造成的限制。利用75名小学生68次数学考试数据，采用制度反事实模拟方法将百分制成绩转换为A/B/C/D字母等级，并在信息、结构和诊断层面进行系统检验。结果显示：成绩转换后信息熵下降约69%；全样本下字母评分制度表面稳定（K=4），但移除单个极端锚点学生后，最优K从4增至8，个体诊断身份一致性从95%降至62%；时间一致性在52%至96%之间波动，远低于百分制93%-96%的基线。机制分析表明，离散化在68次考试中将特征空间压缩约19倍；标准化后产生大量伪异质性区域，使密度梯度平坦化，聚类边界对微小扰动高度敏感。基于此，本文提出双轨评价机制，并为理解教育评价改革的认知成本提供了可检验的分析框架。

英文摘要

Does the letter-grade evaluation system, while achieving its burden-reduction goals, affect the education system's stable understanding of students' academic structures? This paper introduces the concept of "data aphasia," referring to restrictions on diagnostic information expression caused by institutionally mandated forms of data presentation. Using data from 68 mathematics examinations administered to 75 primary school students, we employ an institutional counterfactual simulation method to convert percentage scores into A/B/C/D letter grades and conduct systematic tests at the information, structural, and diagnostic levels. Results show that information entropy decreases by approximately 69% after grade conversion; under the full sample, the letter-grade system appears superficially stable (K=4), but removing a single extreme anchor student causes the optimal K to increase from 4 to 8 and individual diagnostic identity consistency to fall from 95% to 62%; temporal consistency fluctuates between 52% and 96%, far below the 93%-96% baseline of the percentage system. Mechanism analysis indicates that discretization compresses the feature space by approximately nineteenfold across 68 examinations; after standardization, it creates extensive pseudo-heterogeneity regions, flattens density gradients, and makes clustering boundaries highly sensitive to minor perturbations. Based on these findings, this paper proposes a dual-track evaluation mechanism and provides a testable analytical framework for understanding the cognitive costs of educational evaluation reform.

URL PDF HTML ☆

赞 0 踩 0

2606.12945 2026-06-12 cs.AI 新提交

Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agentic Memory

学习该记住什么：一种基于认知的多因素记忆价值模型

Zhibao Chen, Qian Cheng

发表机构 * Huatai Securities（华泰证券）； OneBeget.com

AI总结针对长期LLM代理的记忆管理问题，提出一种基于认知心理学的多因素记忆价值函数，通过无梯度优化学习权重，统一控制编码深度、遗忘风险和检索排名，在LongMemEval上显著优于单一因素和近因策略。

详情

Comments: 11 pages, 3 figures

AI中文摘要

长期运行的LLM代理积累的交互历史远超任何上下文窗口，迫使面临一个持续决策：在固定记忆预算下，哪些内容应深度编码、哪些应遗忘、哪些应检索。生产系统采用语义相似性或近因性——两者对于遗忘决策都是错误指定的，因为遗忘决策是在未来查询未知的整合时刻做出的。我们提出一个多因素记忆价值函数 V(m)=∑_i w_i f_i(m)，涵盖七个可解释因素（情感强度、目标相关性、价值对齐、自我/用户相关性、任务效用、可靠性和使用历史），这些因素来自认知心理学，其权重通过无梯度优化器从下游目标中学习，并且该单一标量统一控制编码深度、遗忘风险和检索排名。我们提出一个方法论观点：在LongMemEval上，针对保留的评估问题对目标相关性进行评分，使得黄金证据保留率达到≈0.98——这衡量的是检索，而非遗忘。在现实盲态模式下，学习到的多因素价值在479个可用案例中保留了0.770±0.011的黄金证据，而均匀权重为0.657，最佳单一因素为0.518，近因性为0.368；每对差距的95%自助法置信区间均高于零，且基于相同因素的神经网络与线性模型持平。学习到的权重是可解释的——可靠性、情感强度和自我/用户相关性占主导，而查询时的目标相似性在遗忘决策中被正确降权。一个带有植入混淆的受控合成任务证实，学习器恢复了分离性权重（保留率1.00），而均匀权重失败（0.62）。该基础架构是开源的；所有实验在单CPU上运行，无需API调用。

英文摘要

Long-running LLM agents accumulate interaction histories far larger than any context window, forcing a standing decision: what to encode deeply, what to forget, and what to retrieve under a fixed memory budget. Production systems answer with semantic similarity or recency -- both mis-specified for the forgetting decision, which is made at consolidation time before the future query is known. We propose a multi-factor memory value function V(m)=\sum_i w_i f_i(m) over seven interpretable factors (emotional intensity, goal relevance, value alignment, self/user relevance, task utility, reliability, and usage history) drawn from cognitive psychology, whose weights are learned from a downstream objective by a gradient-free optimiser, and whose single scalar uniformly controls encoding depth, forget risk, and retrieval rank. We make a methodological point: on LongMemEval, scoring goal relevance against the held-out evaluation question saturates gold-evidence retention at \approx 0.98 -- this measures retrieval, not forgetting. In the realistic blind regime, a learned multi-factor value retains 0.770 \pm 0.011 of gold evidence across 479 usable cases, versus 0.657 for uniform weights, 0.518 for the best single factor, and 0.368 for recency; every paired gap's 95% bootstrap CI is above zero, and a neural network over the same factors ties the linear model. The learned weights are interpretable -- reliability, emotional intensity, and self/user relevance dominate, while query-time goal similarity is correctly down-weighted for the forgetting decision. A controlled synthetic task with planted confounds confirms the learner recovers a separating weighting (1.00 retention) where uniform weighting fails (0.62). The substrate is open-source; all experiments run on a single CPU with no API calls.

URL PDF HTML ☆

赞 0 踩 0

2606.12944 2026-06-12 cs.LO 新提交

Testing Theory of Truly Concurrent Processes

真正并发过程的理论测试

Yong Wang

AI总结本文基于Hennessy的工作，为真正并发过程代数建立测试语义，继承操作语义、公理语义和指称语义的三位一体。

2606.12942 2026-06-12 cs.AI 新提交

PRISMR: Overcoming Parse Collapse in Multimodal Listwise Ranking via Parameterized Representation Internalization

PRISMR: 通过参数化表示内化克服多模态列表排序中的解析崩溃

Hao Jiang, Xin Li, Annan Wang, Zhi Yang, Haoxiang Zhang, Yichi Zhang, Weisi Lin

发表机构 * Nanyang Technological University（南洋理工大学）； Peking University（北京大学）； Independent Researcher（独立研究员）

AI总结针对多模态长上下文场景中生成式列表排序的解析崩溃问题，提出PRISMR框架，用参数化结构条件替代临时上下文列表处理，通过轻量级超网络并行编码候选并生成LoRA权重，显著减少解析崩溃并提升排序性能。

详情

AI中文摘要

基于大型多模态模型（LMM）的生成式列表排序旨在单次前向传播中捕获全局列表上下文，但其效果在长上下文多模态场景中会退化。我们识别出一种重复出现的失败模式——解析崩溃，即自回归解码器生成流畅但不完整的排序，通过静默省略候选并提前终止。这种失败源于有限的上下文利用，而非简单的格式错误，使得提示工程和约束解码不足以解决。我们提出PRISMR（参数化表示内化用于语义多模态排序）框架，用参数化结构条件替代临时的上下文内列表处理。PRISMR使用轻量级超网络并行编码多模态候选并生成项目特定的LoRA权重，这些权重被合成为LMM的实例特定适配器。这种范式在保留基础模型的同时，实现了更鲁棒的列表结构内化。我们进一步引入了一个大规模多模态评论排序基准用于评估。实验表明，PRISMR显著减少了解析崩溃，提高了列表排序性能，并有效跨领域和指令微调骨干网络迁移。

英文摘要

Generative listwise ranking with Large Multimodal Models (LMMs) aims to capture global list context in a single forward pass, but its effectiveness degrades in long-context multimodal scenarios. We identify a recurring failure mode, parse collapse, where the autoregressive decoder produces fluent yet incomplete rankings by silently omitting candidates and terminating early. This failure stems from limited context utilization rather than simple formatting mistakes, making prompt engineering and constrained decoding insufficient. We propose PRISMR (Parameterized Representation Internalization for Semantic Multimodal Ranking), a framework that replaces transient in-context list processing with parametric structural conditioning. PRISMR uses a lightweight hypernetwork to encode multimodal candidates in parallel and generate item-specific LoRA weights, which are synthesized into an instance-specific adapter for a LMM. This paradigm enables more robust internalization of list structure while preserving the base model. We further introduce a large-scale multimodal review-ranking benchmark for evaluation. Experiments demonstrate that PRISMR substantially reduces parse collapse, improves listwise ranking performance, and transfers effectively across domains and instruction-tuned backbones.

URL PDF HTML ☆

赞 0 踩 0

2606.12941 2026-06-12 cs.CL 新提交

Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL

当上下文分片到达时的多轮推理：可扩展的分片与记忆增强强化学习

Shu Tong Luo, Wenqin Liu, Rui Liu, Mingming Gong, Jiaxian Guo

发表机构 * The University of Melbourne（墨尔本大学）； Google Research Australia（谷歌澳大利亚研究院）

AI总结针对多轮对话中信息碎片化导致LLM准确率下降65%的问题，提出通过训练模型维护紧凑滚动记忆而非增长历史来缓解，并引入低成本分片流水线将单轮QA转换为多轮碎片化情节，训练的记忆增强策略显著提升多轮准确率并零样本泛化到更难任务。

详情

AI中文摘要

当用户在多个对话轮次中透露任务关键信息时，尽管上下文完全可用，LLM的准确率下降高达65%。我们表明，这种“迷失在对话中”的退化可以通过训练模型维护紧凑的滚动记忆而不是关注增长的历史来大幅缓解。为了使这种训练可扩展，我们引入了一个低成本的分片流水线，将单轮QA数据集转换为多轮碎片化信息情节，消除了数小时手动标注的需求。仅在分片的GSM8K上训练，我们的记忆增强策略显著提高了多轮准确率，并零样本泛化到更难的数学和域外长上下文QA。此外，即使在测试时给定完整历史，记忆训练模型也优于全历史基线，这表明学习压缩比单独的全上下文暴露能诱导更稳健的增量推理。

英文摘要

When a user reveals task-critical information across several conversation turns, LLM accuracy drops by up to 65% despite full context availability. We show that this Lost in Conversation degradation can be substantially mitigated by training models to maintain a compact rolling memory instead of attending to a growing history. To make such training scalable, we introduce a low-cost sharding pipeline that converts single-turn QA datasets into multi-turn fragmented-information episodes, eliminating the need for hours of manual annotation. Training only on sharded GSM8K, our memory-augmented policy significantly improves multi-turn accuracy and generalises zero-shot to harder math and out-of-domain long-context QA. Moreover, memory-trained models outperform full-history baselines even when given the full history at test time, suggesting that learning to compress induces more robust incremental reasoning than full-context exposure alone.

URL PDF HTML ☆

赞 0 踩 0

2606.12940 2026-06-12 cs.SD cs.LG 新提交

Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment

自引导：通过解码器流形对齐增强神经编解码器

Xiang Li, Yixuan Zhou, Jingran Xie, Zhiyong Wu, Hui Wang

AI总结提出自引导方法，通过轻量特征映射损失对齐解码器内部流形，在不改变推理过程下提升VQ-VAE神经语音编解码器重建质量，实现低比特率SOTA性能并支持4倍码本缩减。

详情

Comments: 20 pages, 9 figures, accepted to ICML 2026, demo website available at this https URL

AI中文摘要

基于向量量化VAE（VQ-VAE）的神经语音编解码器是语音大语言模型的核心音频分词器，但其重建保真度受限于量化误差。常见的修复方法是修改量化器或增加模型容量，但这会复杂化下游语言建模。我们的核心思想是，在处理量化标记及其原始连续嵌入时，使用轻量级特征映射损失对齐解码器的内部特征流形。这需要最小的训练开销，且无需改变推理过程。应用于XCodec2时，自引导改善了所有重建指标，实现了低比特率下的最先进性能。值得注意的是，它实现了4倍码本缩减而无保真度损失，下游TTS实验表明，通过简化标记建模空间，这显著改善了基于LLM的合成。多项统计观察和可视化证实了解码器中内部流形对齐的增强。大量实验证实了其在各种归纳偏置下的通用性。因此，自引导建立了一种高效、广泛适用的高保真神经音频编码方法。

英文摘要

Neural speech codecs based on Vector-Quantized VAEs (VQ-VAEs) are core audio tokenizers for speech LLMs, yet their reconstruction fidelity is bottlenecked by quantization error. Modifying the quantizer or increasing model capacity are common fixes, but they complicate downstream language modeling. Our core idea is to align the decoder's internal feature manifolds when processing both the quantized tokens and their original continuous embeddings, using a lightweight feature-mapping loss. This requires minimal training overhead and no inference-time changes. Applied to XCodec2, self-guidance improves all reconstruction metrics, achieving state-of-the-art low-bitrate performance. Notably, it enables a 4x codebook reduction without fidelity loss, which downstream TTS experiments show significantly improves LLM-based synthesis by simplifying the token modeling space. Multiple statistical observations and visualizations corroborate the enhanced internal manifold alignment in the decoder. Extensive experiments confirm its generality across various inductive biases. Self-guidance thus establishes an efficient, broadly applicable method for high-fidelity neural audio coding.

URL PDF HTML ☆

赞 0 踩 0

2606.12939 2026-06-12 cs.CV 新提交

MAMVI: 3D Test-Time Adaptation via Masked Multi-View Point Clouds

MAMVI：通过掩蔽多视角点云实现3D测试时自适应

Inseok Kong, Geunyoung Jung, Jiyoung Jung

发表机构 * Department of Geo Informatics, University of Seoul（首尔大学地理信息学系）； Department of Artificial Intelligence, University of Seoul（首尔大学人工智能系）

AI总结针对3D点云在分布偏移下性能下降的问题，提出MAMVI方法，用统一单步自适应替代顺序优化，结合混合掩蔽策略和多视角损失聚合，实现快速且高精度的测试时自适应。

详情

Comments: Accepted by ICPR 2026

AI中文摘要

3D点云模型在传感器噪声、遮挡和环境变化引起的分布偏移下会出现显著的性能下降。测试时自适应（TTA）已成为在推理过程中缓解此问题的实用范式。最近，利用多视角增强在提升3D TTA性能方面显示出潜力。然而，现有的多视角方法通常受限于将每个视角独立处理的顺序优化。这种顺序优化由于重复的优化步骤导致显著的推理延迟，使得实时自适应不切实际。为了解决这个问题，我们提出了掩蔽多视角测试时自适应（MAMVI），它用统一的单步自适应替代顺序优化。具体来说，MAMVI利用一种混合掩蔽策略，结合固定比例以保持稳定性，以及Beta分布采样以增加多样性。通过聚合多个视角的损失，MAMVI基于多视角共识通过单次反向传播执行自适应。此外，使用基于置信度的自适应学习率来动态调整每个样本的自适应强度。在ModelNet-40C、ShapeNet-C和ScanObjectNN-C上的大量实验表明，MAMVI在ShapeNet-C和ScanObjectNN-C上达到了最先进的准确率。同时，它在ModelNet-40C上保持竞争力，同时推理速度提高了4.9-8.9倍，使其非常适合实时应用。我们的代码可在以下网址获取：this https URL

英文摘要

3D point cloud models suffer significant performance degradation under distribution shifts caused by sensor noise, occlusions, and environmental changes. Test-time adaptation (TTA) has emerged as a practical paradigm for mitigating this issue during inference. Recently, leveraging multi-view augmentation has shown promise in improving 3D TTA performance. However, existing multi-view approaches are often constrained by sequential optimization that treats each view independently. This sequential optimization leads to substantial inference latency due to repetitive optimization steps, making real-time adaptation impractical. To address this, we propose Masked Multi-View Test-Time Adaptation (MAMVI), which replaces sequential optimization with a unified single-step adaptation. Specifically, MAMVI utilizes a hybrid masking strategy that combines fixed ratios for stability with Beta-distributed sampling for diversity. By aggregating losses across multiple views, MAMVI performs adaptation through a single backward pass based on multi-view consensus. Additionally, a confidence-based adaptive learning rate is used to dynamically adjust the adaptation intensity for each sample. Extensive experiments on ModelNet-40C, ShapeNet-C, and ScanObjectNN-C demonstrate that MAMVI achieves state-of-the-art accuracy on ShapeNet-C and ScanObjectNN-C. Moreover, it remains competitive on ModelNet-40C while delivering 4.9-8.9 times faster inference, making it highly suitable for real-time applications. Our code is available at this https URL

URL PDF HTML ☆

赞 0 踩 0

2606.12936 2026-06-12 cs.RO cs.AI 新提交

An Embodied Simulation Platform, Benchmark, and Data-Efficient Augmentation Framework for Wet-Lab Robotics

面向湿实验室机器人的具身仿真平台、基准测试及数据高效增强框架

Zhe Liu, Huanbo Jin, Zhaohui Du, Zhe Wang, He Xu, Peijia Li, Jiaming Gu, Quan Lu, Qi Wang, Bin Ji, Ting Xiao

AI总结提出Pipette平台，包含可编辑资产、仿真数据增强管道和11任务基准测试，将30次演示的VLA成功率从44.1%提升至74.7%。

详情

Comments: 25 pages, 17figures

AI中文摘要

湿实验室机器人可以提高生物医学实验的可重复性、通量和安全性，但扩展其学习需要可定制的模拟器以进行安全和可重复的任务生成、开放的可编辑实验室资产，以及将有限演示转化为可用训练数据的高效管道。我们提出了Pipette，一个用于湿实验室机器人学习的具身仿真平台、基准测试和数据高效增强框架。Pipette发布了超过43个开源且可重新编辑的湿实验室资产，以及一个可扩展的资产构建管道。Pipette的一个关键组件是其基于仿真的数据增强管道，在仿真中重放人类演示，应用光照、相机、速度和动作扰动，并通过自动任务成功检查过滤生成的片段，从有限的手动演示中快速扩展可用的训练数据。我们进一步引入了一个包含11个任务的湿实验室具身基准测试，涵盖样本处理、培养器具操作、设备操作和精确放置。每个任务仅需30次演示，ACT实现了65.5%的平均成功率，而仿真增强将SmolVLA从44.1%提升至74.7%，将π0从40.4%提升至46.5%，验证了Pipette在数据高效的VLA训练和评估中的有效性。Pipette还支持自然语言驱动的场景构建和任务注册，降低了非专家用户定义新湿实验室机器人任务的门槛。

英文摘要

Wet-lab robots can improve the reproducibility, throughput, and safety of biomedical experiments, but scaling their learning requires customizable simulators for safe and reproducible task generation, open editable laboratory assets, and efficient pipelines that turn limited demonstrations into usable training data. We present Pipette, an embodied simulation platform, benchmark, and data-efficient augmentation framework for wet-lab robot learning. Pipette releases over 43 open-source and re-editable wet-lab assets, together with an extensible asset-building pipeline. A key component of Pipette is its simulation-based data augmentation pipeline, replaying human demonstrations in simulation, applies lighting, camera, speed, and action perturbations, and filters generated episodes with automatic task success checks, rapidly expanding usable training data from limited manual demonstrations. We further introduce an 11-task wet-lab embodied benchmark covering sample handling, culture-ware manipulation, device operation, and precision placement. With only 30 demonstrations per task, ACT achieves 65.5% average success rate, while simulation augmentation improves SmolVLA from 44.1% to 74.7% and {\pi}0 from 40.4% to 46.5%, validating the effectiveness of Pipette for data-efficient VLA training and evaluation. Pipette also supports natural-language-driven scene construction and task registration, lowering the barrier for non-expert users to define new wet-lab robotic tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.12935 2026-06-12 cs.AI 新提交

MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling

MARS: 用于并行LLM测试时扩展的边际对抗风险控制停止策略

Wenbo Chen, Puheng Li, Mengyang Liu, Weijie Su, Tianpei Xie

发表机构 * Amazon（亚马逊）； Stanford University（斯坦福大学）； University of Pennsylvania（宾夕法尼亚大学）

AI总结提出MARS停止规则，通过监测中间检查点的聚合投票并利用对抗性边界估计未来投票变化，在保证准确率的同时节省25-47%的自一致性token。

详情

AI中文摘要

并行测试时扩展采样多个推理轨迹并对答案进行多数投票，提高了LLM的准确性，但需要轨迹运行至完成，导致大量计算开销。我们观察到，在中间检查点探测部分轨迹可以在不中断生成的情况下提取当前答案，揭示出不断演变的聚合投票。基于这一观察，我们引入了MARS，一种边际对抗性停止规则，它估计哪些活跃轨迹可能改变其答案，并在未来投票移动的保守边界下，一旦领先者保持安全就停止。该规则分离了两种不确定性来源。它学习轨迹级别的切换概率，这些概率决定了当前边际有多少可能被保留，同时通过从预热轨迹中校准的对抗性边界处理切换轨迹落在哪里的更难问题。在真实切换概率下，MARS以高概率保证提前停止的答案与完整预算投票一致。在实践中，一个五特征逻辑模型紧密匹配了神谕切换行为。在三个推理模型和三个竞赛数学基准上，MARS节省了25-47%的自一致性token，并在DeepConf Online（一个已经过滤和截断弱轨迹的强置信加权基线）之上额外节省14-29%，同时匹配相应完整预算基线的准确率。

英文摘要

Parallel test-time scaling samples many reasoning traces and majority-votes their answers, improving LLM accuracy but requiring traces to run to completion, incurring substantial computational overhead. We observe that probing partial traces at intermediate checkpoints can extract current answers without disrupting generation, revealing an evolving aggregate vote. Based on this observation, we introduce MARS, a margin-adversarial stopping rule that estimates which active traces are likely to change their answers and stops once the leader remains safe under a conservative bound on future vote movement. The rule separates two sources of uncertainty. It learns the trace-level switch probabilities that determine how much of the current margin is likely to be retained, while handling the harder question of where switching traces land through an adversarial bound calibrated from warmup traces. With true switch probabilities, MARS guarantees with high probability that the early-stopped answer matches the full-budget vote. In practice, a five-feature logistic model closely matches oracle switching behavior. Across three reasoning models and three competition-math benchmarks, MARS saves 25-47% of self-consistency tokens and 14-29% on top of DeepConf Online, a strong confidence-weighted baseline that already filters and truncates weak traces, while matching the accuracy of the corresponding full-budget baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.12930 2026-06-12 cs.LG 新提交

Is Spurious Correlation Removal Always Learnable?

虚假相关性去除是否总是可学习的？

Yibo Zhou, Bo Li, Hai-Miao Hu, Hanzi Wang, Xiaokang Zhang, Ruifan Zhang

AI总结研究不变学习在统计可识别时的计算障碍，证明存在一维不变子空间的可采样多环境实例，多项式时间算法无法达到常数精度，并量化环境多样性对可识别性和风险的影响。

详情

Comments: poster paper in ICML-2026

AI中文摘要

即使不变结构在统计上是可识别的，不变学习也可能失败。我们展示了一个条件计算障碍：在由平均情况稀疏恢复归约驱动的黑盒可采样监督稀疏恢复原语下，存在具有一维预测不变子空间（$k=1$）的\emph{可采样}多环境实例，这些实例可以通过穷举搜索用多项式样本学习，而任何多项式时间常数精度恢复算法都会与该原语矛盾。我们进一步通过分离参数$\gamma$量化环境多样性，该参数控制可识别性和不变性目标的曲率。在充分多样性和局部高斯正则性下，极小极大风险为$\mathbb{E}[\dist(\hat{V},V_{\mathrm{inv}})^2]=\Theta(k(d-k)/(n|\mathcal{E}|))$，在标签诱导的偏移下，在$n^*\propto k(d-k)/(|\mathcal{E}|\gamma^2)$处发生相变，估计误差缩放比例与$1/\gamma^2$成正比。合成和真实数据集说明了预测的差距和转变，并激发了简单的多样性诊断。

英文摘要

Invariant learning can fail even when the invariant structure is statistically identifiable. We show a conditional computational barrier: under a black-box samplable supervised sparse recovery primitive motivated by average-case sparse-recovery reductions, there exist \emph{samplable} multi-environment instances with a one-dimensional predictive invariant subspace ($k=1$) that are learnable with polynomial samples by exhaustive search, while any polynomial-time constant-accuracy recovery algorithm would contradict the primitive. We further quantify environment diversity by a separation parameter $\gamma$, which controls identifiability and the curvature of invariance objectives. Under sufficient diversity and local Gaussian regularity, the minimax risk is $\mathbb{E}[\dist(\hat{V},V_{\mathrm{inv}})^2]=\Theta(k(d-k)/(n|\mathcal{E}|))$, and under label-induced shifts a phase transition occurs at $n^*\propto k(d-k)/(|\mathcal{E}|\gamma^2)$ with refined estimation error scaling proportional to $1/\gamma^2$. Synthetic and real datasets illustrate the predicted gaps and transitions and motivate simple diversity diagnostics.

URL PDF HTML ☆

赞 0 踩 0

2606.12925 2026-06-12 cs.CV cs.LG 新提交

Multi-Label Test-Time Adaptation with Bayesian Conditional Priors

基于贝叶斯条件先验的多标签测试时自适应

Qiru Li, Ao Zhou, Zhiwei Jiang, Zifeng Cheng, Cong Wang, Yafeng Yin, Qing Gu

AI总结提出贝叶斯条件先验估计（BCP），一种无梯度的测试时自适应方法，通过在线估计锚定条件先验注入标签依赖性，提升冻结视觉语言模型在多标签识别中的分布偏移鲁棒性。

详情

Comments: accepted by ICML2026

AI中文摘要

多标签识别中，冻结的视觉语言模型（VLM）在分布偏移下表现脆弱：标准零样本推理独立评分每个标签，忽略共现结构，产生不连贯的标签集，其中主导概念抑制较弱但兼容的标签。我们引入贝叶斯条件先验（BCP）估计，一种无梯度的测试时自适应方法，在不调整主干网络的情况下注入标签依赖性。BCP将零样本logits视为在固定图像-文本似然下的边缘后验代理，并将偏移引起的误差主要归因于不匹配的标签先验。对于每个测试图像，它选择一个高置信度的锚定标签，并应用锚定条件的贝叶斯精炼。该更新在logit空间中是闭式的，并具有点互信息（PMI）解释，明确促进兼容标签并抑制不兼容标签。BCP通过从无标签测试流中在线估计锚定条件先验（使用轻量级二阶共现统计）来运行，无需目标标注，且仅增加单个前向传递之外的微不足道的开销。在标准多标签基准和多个CLIP主干网络上，BCP持续优于强TTA基线，例如将RN50的平均mAP从57.31提升至69.22，ViT-B/16从62.61提升至71.79。

英文摘要

Multi-label recognition with frozen Vision-Language Models (VLMs) is brittle under distribution shift: standard zero-shot inference scores labels independently, ignoring co-occurrence structure and producing incoherent label sets where dominant concepts suppress weaker but compatible labels. We introduce Bayesian Conditional Priors (BCP) Estimation, a gradient-free test-time adaptation method that injects label dependency without tuning the backbone. BCP views zero-shot logits as a proxy for marginal posteriors under a fixed image-text likelihood and attributes shift-induced errors mainly to a mismatched label prior. For each test image, it selects a high-confidence anchor label and applies an anchor-conditioned Bayesian refinement. This update is closed-form in logit space and admits a pointwise mutual information (PMI) interpretation, explicitly promoting compatible labels and suppressing incompatible ones. BCP operates without target annotations by estimating anchor-conditioned priors online from the unlabeled test stream via lightweight second-order co-occurrence statistics, adding negligible overhead beyond a single forward pass. Across standard multi-label benchmarks and multiple CLIP backbones, BCP consistently outperforms strong TTA baselines, e.g., improving RN50 average mAP from 57.31 to 69.22 and ViT-B/16 from 62.61 to 71.79.

URL PDF HTML ☆

赞 0 踩 0

2606.12924 2026-06-12 cs.AI 新提交

Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce

迭代优化搜索：面向电子商务中智能搜索架构评估的双智能体仿真框架

Jetlir Duraj, Jayanth Yetukuri, Shuang Zhou, Dhruv Varma, Rui Kong, Ishita Khan, Qunzhi Zhou

发表机构 * eBay Inc.（eBay公司）

AI总结提出模块化双智能体仿真框架，通过固定买家智能体对比不同应答器设计，发现滚动窗口记忆在质量和速度上优于意图提取记忆，并基于失败分析将失败率降低62%。

详情

AI中文摘要

我们提出了一个模块化的双智能体仿真框架，用于评估对话式购物助手架构。一个独立的买家智能体，配置了角色、任务和耐心水平，与一个可互换的应答器配对，该应答器与真实的电子商务搜索API集成。在实验中保持买家不变，可以在相同场景下对照比较应答器设计。利用跨越14个角色桶的2011次对话，我们建立了四个实证发现。首先，滚动窗口记忆在所有质量指标上优于意图提取记忆，同时每个查询速度快35%。其次，通过对应答器版本的系统性失败分析，实现了有针对性的修复，将整个数据集上的失败和接近失败率降低了62%，展示了快速的证据驱动迭代。第三，将应答器的LLM骨干从Gemini~2.5切换到Llama~3.3~70B，尽管架构相同，但性能下降了0.16-0.45点。最后，我们记录了前沿LLM评判者之间系统性的哲学分歧：Gemini奖励过程正确性，而Claude要求具体结果，尽管使用了相同的评估提示。

英文摘要

We present a modular two-agent simulation framework for evaluating conversational shopping assistant architectures. An independent buyer agent, configured with personas, missions, and patience levels, is paired with an interchangeable responder that integrates with a real e-commerce search API. Holding the buyer constant across experiments enables controlled comparison of responder designs on identical scenarios. Using 2011 conversations across 14 persona buckets, we establish four empirical findings. First, rolling-window memory outperforms intent-extraction memory on all quality metrics while being 35% faster per query. Second, illustrating rapid evidence-driven iteration, a systematic failure analysis of a responder version enables targeted fixes that reduce failure and near-failure rates by 62% across the full dataset. Third, swapping the responder LLM backbone from Gemini~2.5 to Llama~3.3~70B costs 0.16--0.45 points despite identical architecture. Finally, we document systematic philosophical disagreement between frontier LLM judges: Gemini rewards process correctness while Claude demands concrete outcomes, despite using the same evaluation prompt.

URL PDF HTML ☆

赞 0 踩 0

2606.12923 2026-06-12 cs.LG cs.AI cs.CL 新提交

Order Is Not Control

秩序并非控制

Gareth Seneque, Lap-Hang Ho, Nafise Erfanian Saeedi, Jeffrey Molendijk, Tim Elson

发表机构 * Australian Broadcasting Corporation（澳大利亚广播公司）

AI总结本文论证秩序不等于控制，提出接收器门控响应定律，并在生物、大语言模型、适配器和随机算子面板中验证，表明控制是局部的、可测量的。

详情

Comments: 52 pages, 7 figures

AI中文摘要

AI对齐、可解释性、引导和神经扰动研究识别出诱导秩序的对象。我们认为秩序并非控制。控制需要接收器门控的响应定律：一个分母索引算子，将物质状态、动作/驱动、浴和接收器状态映射到响应位移、汇、努力和盆地投影。我们在生物、大语言模型、适配器和随机算子面板中识别出该定律。这些定律是局部的：干预可以被接纳、饱和、变号、泄漏或过驱动，取决于介质、浴、接收器状态、动作端口和比较器。当有限努力在相同分母下移动目标或结果读出类别，而损伤、无效/规避、无效格式、过驱动和不必要努力保持有界时，控制被分配。小鼠ALM、秀丽隐杆线虫和斑马鱼面板提供了物理响应算子证据，同时排除了坐标同一性和控制器结论。大语言模型面板展示了生成输出响应定律：在四种物质条件下，响应向量的分量符号预测准确率为72.8-73.7%，非零分量上提升至84.3-84.8%；留出观察者以93.6%和91.7%的准确率预测系统效应和目标/预言家族。宪法条件适配器将易感性重塑为制备介质，随机算子面板将测量机会与可部署行动策略分离。这给出了介观控制层面的驱动-耗散响应系统描述：驱动通过制备介质、浴和接收器作用，产生接纳运动、阻抗、汇或过驱动。证据支持局部接纳控制和可测量的随机响应算子，同时将可部署的预生成控制、隐藏/logit因果充分性、生物到LLM坐标同一性以及字面热力学量排除在范围之外。

英文摘要

AI alignment, interpretability, steering, and neural perturbation studies identify order-inducing objects. We argue that order is not control. Control requires a receiver-gated response law: a denominator-indexed operator mapping material state, action/drive, bath, and receiver state to response displacement, sinks, effort, and basin projection. We identify it across biological, LLM, adapter, and stochastic-operator panels. The laws are local: an intervention can be admitted, saturated, sign-changing, leaky, or overdriven depending on medium, bath, receiver state, action port, and comparator. Control is assigned when finite effort moves a target or outcome-readout class under the same denominator while damage, null/evasive, invalid format, overdrive, and unnecessary effort stay bounded. Mouse ALM, C. elegans, and zebrafish panels provide physical response-operator evidence while excluding coordinate identity and controller conclusions. LLM panels show generated-output response laws: across four material conditions, response vectors are predictable at 72.8-73.7% component-sign accuracy, rising to 84.3-84.8% on nonzero components; held-out observers predict system-effect and target/oracle families at 93.6% and 91.7% accuracy. Constitution-conditioned adapters reshape susceptibility as prepared media, and stochastic-operator panels separate measured opportunity from deployable action policies. This gives a driven-dissipative response-system account at the mesoscopic control level: drives act through prepared media, baths, and receivers, producing admitted movement, impedance, sinks, or overdrive. The evidence supports local admitted control and measurable stochastic response operators, while leaving deployable pre-generation control, hidden/logit causal sufficiency, biological-to-LLM coordinate identity, and literal thermodynamic quantities outside scope.

URL PDF HTML ☆

赞 0 踩 0

2606.12922 2026-06-12 cs.CL cs.CY 新提交

Polar: A Benchmark for Evaluating Political Bias in LLMs

Polar: 评估大语言模型中政治偏见的基准

Sangho Kim, Heejin Kim, Yoonhee Park, Hyunggeun Jeon, Jaejin Lee

发表机构 * Graduate School of Data Science, Seoul National University（首尔大学数据科学研究生院）； Dept. of Computer Science and Engineering, Seoul National University（首尔大学计算机科学与工程系）

AI总结提出Polar基准，通过选项级似然度测量大语言模型的政治偏见，覆盖美国和韩国政治语境，发现偏见随语境、议题、模型组和语言变化。

详情

Comments: Submitted to ARR 2026 May cycle

AI中文摘要

大语言模型（LLM）中的政治偏见日益显著，但在不同政治和语言背景下难以可重复地测量。我们引入了Polar，一个包含4,026个实例的多项选择基准，通过选项级似然度而非基于提示的生成来测量政治偏见。Polar覆盖了两个意识形态轴和来自Manifesto Project的八个议题类别，并在美国和韩国政治语境中并行评估模型。在38个LLM中，测量的偏见随政治语境、议题类别、模型组和呈现语言系统性地变化。所有模型在美国政治内容上倾向于左翼进步派，但在韩国内容上表现出更居中且混合的模式。翻译实验进一步表明，仅呈现语言就能改变测量的偏见。这些发现凸显了对LLM中政治偏见进行多语言和跨语境评估的必要性。

英文摘要

Political bias in large language models (LLMs) is increasingly significant, but difficult to measure reproducibly across political and linguistic contexts. We introduce Polar, a 4,026-instance multiple-choice benchmark that measures political bias through option-level likelihoods rather than prompt-based generation. Polar covers two ideological axes and eight issue categories derived from the Manifesto Project, and evaluates models in parallel across U.S. and South Korean political contexts. Across 38 LLMs, measured bias varies systematically with political context, issue category, model group, and presentation language. All models lean left-progressive on U.S. political content, but show more centered and mixed patterns on South Korean content. Translation experiments further show that presentation language alone can shift measured bias. These findings highlight the need for multilingual and cross-contextual evaluation of political bias in LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.12921 2026-06-12 cs.LG cs.AI 新提交

LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold

LoRA-Muon：低秩流形上的谱最速下降

Franz Louis Cesista, Katherine Crowson, Cédric Simal, Stella Biderman

发表机构 * Ateneo de Manila University（雅典耀马尼拉大学）； EleutherAI ； NaXys, UNamur（纳慕尔大学NaXys研究所）

AI总结提出LoRA-Muon优化器，将Muon的谱最速下降规则应用于低秩微调，解决LoRA对初始化敏感、最优学习率跨秩迁移差等问题，在TinyShakespeare上以秩32达到比稠密基线更低的验证损失。

详情

Comments: 20 pages, 4 figures

AI中文摘要

低秩适应（LoRA）显著降低了微调深度学习模型的计算和内存成本，但通常比稠密训练更难调优：当使用因子级优化器（如AdamW）时，它对初始化选择敏感，其最优学习率在秩之间迁移性差，且常常无法超越稠密基线。我们通过将Muon优化器的谱最速下降规则应用于低秩设置，推导出LoRA-Muon。结合我们的分裂权重衰减规则，我们的主要主张是LoRA-Muon是全秩Muon和Shampoo族优化器的一个良好的低秩代理。其最优学习率在秩、宽度、深度和因子重缩放之间均可迁移。在我们计算匹配的TinyShakespeare研究中，秩2代理恢复了稠密最佳测试学习率，秩32的LoRA-Muon运行在种子平均扫描中达到了比稠密基线更低的平均验证损失。我们进一步表明，Spectron优化器依赖于任意的因子缩放，因此在从严重不平衡的因子开始微调时可能不太适用，并且LoRA-RITE的简化QR坐标核心实现了相同的谱更新。LoRA-Muon无需QR分解即可计算该更新，并避免存储二阶矩，使其更易于加速器使用且内存效率更高。

英文摘要

Low-Rank Adaptation (LoRA) significantly reduces compute and memory costs for finetuning Deep Learning models but is often harder to tune than dense training: when using factor-wise optimizers such as AdamW, it is sensitive to initialization choices, its optimal learning rates transfer poorly across ranks, and it often fails to beat dense baselines. We derive LoRA-Muon by applying the Muon optimizer's spectral steepest-descent rule to the low-rank setting. Along with our split weight-decay rule, our main claim is that LoRA-Muon is a good low-rank proxy for full-rank Muon and Shampoo-family optimizers. Its optimal learning rates transfer across rank, width, depth, and factor-rescaling. In our compute-matched TinyShakespeare study, a rank-$2$ proxy recovers the dense best tested learning rate, and a rank-$32$ LoRA-Muon run attains lower mean validation loss than the dense baseline in the seed-averaged sweep. We further show that the Spectron optimizer depends on arbitrary factor scaling, so it would likely be a poor fit when finetuning starts from badly imbalanced factors, and that LoRA-RITE's simplified QR-coordinate core implements the same spectral update. LoRA-Muon computes that update without QR-decomposition and avoids storing second moments, making it more accelerator-friendly and memory-efficient.

URL PDF HTML ☆

赞 0 踩 0