arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3990
2606.08875 2026-06-09 cs.AI 新提交

Can the Environment Speak for Itself? $T^{2}$-GRPO: A Turn-Trajectory Group Relative Policy Optimization for Caregiver Agents

环境能否为自己发声?$T^{2}$-GRPO:一种面向护理智能体的转向-轨迹组相对策略优化

Yutong Song, Jiang Wu, Pengfei Zhang, Wenjun Huang, Honghui Xu, Nikil Dutt, Amir M. Rahmani

发表机构 * University of California, Irvine(加州大学尔湾分校) Independent Researcher(独立研究员) Kennesaw State University(肯尼索州立大学)

AI总结 提出T²-GRPO框架,通过解耦护理强化学习为两个归一化奖励视界,并利用二元硬否决确保安全,从环境状态转换中提取密集转向级奖励,结合轨迹级评估,有效处理即时患者反馈、长期护理结果和安全约束。

详情
AI中文摘要

优化用于长期护理智能体的大型语言模型(LLMs)需要平衡延迟的任务目标与即时的环境动态,例如患者的痛苦和抵抗。在痴呆症护理中,这种平衡尤其困难:轨迹级奖励对于转向级信用分配过于稀疏,而基于外部LLM的评估器成本高昂且可能误读零散或间接的患者反应。为解决这一问题,我们提出了\textbf{转向-轨迹组相对策略优化}(\textbf{T$^{2}$-GRPO}),该框架将护理强化学习解耦为两个归一化奖励视界,并通过二元硬否决强制执行安全性。$T^2$-GRPO直接从环境状态转换中推导出密集的转向级奖励,从冻结的痴呆症患者模拟器中测量患者痛苦和抵抗的变化。这些基于环境的奖励通过独立中心秩归一化与轨迹级评估相结合,保留了异质奖励信号并缓解了奖励崩溃。在痴呆症护理上的大量实验表明,T$^{2}$-GRPO优于竞争基线,表明在情感敏感的护理场景中,有效处理即时患者反馈、长期护理结果和安全约束方面取得了实质性改进。

英文摘要

Optimizing large language models (LLMs) for long-horizon caregiver agents requires balancing delayed task objectives with immediate environment dynamics, such as patient distress and resistance. In dementia care, this balance is especially difficult: trajectory level rewards are too sparse for turn level credit assignment, while external LLM-based evaluators are costly and can misread fragmented or indirect patient responses. To address this issue, we propose \textbf{T}urn-\textbf{T}rajectory \textbf{G}roup \textbf{R}elative \textbf{P}olicy \textbf{O}ptimization (\textbf{T$^{2}$-GRPO}), a framework that decouples caregiver RL into two normalized reward horizons and enforces safety through a binary hard veto. $T^2$-GRPO derives dense turn-level rewards directly from environment state transitions, measuring changes in patient distress and resistance from a frozen dementia patient simulator. These environment-grounded rewards are combined with trajectory-level evaluations through independent centered-rank normalization, which preserves heterogeneous reward signals and mitigates reward collapse. Extensive experiments on dementia caregivers show that T $^{2}$-GRPO outperforms competitive baselines, indicating a substantial improvement for emotionally sensitive caregiver scenarios that effectively handles immediate patient feedback, long-term care outcomes, and safety constraints.

2606.08867 2026-06-09 cs.CL 新提交

Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework

构建面向1亿用户规模的客户支持AI代理:一种评估驱动的框架

Aman Gupta, Kevin Rossell, Edesio Alcobaça, Jose Chrystian Lima Pacheco, Carolina Baptista de Lima, Shao Tang, Luiz Paulo Rabachini, Luis Moneda, Herbert Fei, Daniel Silva, Rohan Ramanath

发表机构 * Nubank

AI总结 提出一个统一框架,通过评估驱动开发、上下文工程、人工循环提示迭代和LLM评判一致性优化,在Nubank的100M+用户规模下实现客户支持AI代理的离线开发与在线效果桥接,并在五个生产部署中验证了离线指标与在线结果的高度相关性。

详情
AI中文摘要

LLM能力的快速提升使得AI代理在广泛任务中越来越可行。其中最有前景的应用之一是构建生产就绪的面向客户代理,这一挑战需要在评估方法论、上下文工程、训练和在线测量方面协调卓越。然而,这些关键支柱通常是孤立开发的,导致只有在部署后才会暴露的盲点。\n在本文中,我们提出了一个统一框架,将离线开发与在线影响桥接起来,应用于Nubank(一家拥有1亿+用户的公司)的客户支持AI代理。我们的方法整合了几个关键组件:(1) 针对客户支持代理定制的结构化上下文工程,(2) 系统化的人工在环提示迭代,(3) 具有测量评估者间一致性和GEPA优化一致性的严格LLM评判评估,以及(4) 从构思到生产的验证。\n一个核心见解是评估管道质量直接决定迭代速度。我们展示了跨越不同领域的五个生产部署的结果:卡片递送、债务管理、信用额度支持、卡片管理和产品解释。这些部署在显著加速迭代的同时,带来了持续的客户满意度提升。在我们的卡片递送部署中,大规模A/B测试显示,与之前的代理变体相比,AI交易净推荐值提高了37个百分点,自助服务率提高了29个百分点,同时离线模拟指标与在线结果之间存在强相关性,表明评估驱动开发可靠地预测了生产影响。在大多数用例中,AI满意度达到了与专家人类代理相差几个百分点的水平。

英文摘要

The rapid rise in LLM capabilities has made AI agents increasingly viable across a broad range of tasks. Among the most promising applications is building production-ready customer-facing agents, a challenge that demands coordinated excellence in evaluation methodology, context engineering, training, and online measurement. Yet these critical pillars are typically developed in isolation, creating blind spots that only surface after deployment. In this paper, we present a unified framework that bridges offline development with online impact for customer support AI agents at Nubank, a company with 100M+ users. Our approach integrates several key components: (1) structured context engineering tailored to customer support agents, (2) systematic human-in-the-loop prompt iteration, (3) rigorous LLM judge evaluation with measured inter-rater agreement and GEPA optimization for consistency, and (4) ideation-to-production validation. A central insight is that evaluation-pipeline quality directly determines iteration velocity. We present results from five production deployments spanning distinct domains: card delivery, debt management, credit-limit support, card management, and product explanation. These deployments deliver consistent customer-satisfaction gains while substantially accelerating iteration. In our card-delivery deployment, large-scale A/B testing yields a 37 percentage-point improvement in AI transactional Net Promoter Score and a 29 percentage-point gain in self-service rate over prior agent variants, alongside a strong correlation between offline simulation metrics and online outcomes, demonstrating that eval-driven development reliably predicts production impact. On most use cases, AI satisfaction reaches within a few percentage points of expert human agents.

2606.08866 2026-06-09 cs.CV 新提交

Generalizing Geometry-Guided Mamba as a Plug-and-Play Context Module for CNN-based Semantic Segmentation

泛化几何引导Mamba作为CNN语义分割的即插即用上下文模块

Sheng-Wei Chan, Hsin-Jui Pan, Chun-Po Shen, Chia-Min Lin, Yung-Che Wang, Jen-Shiun Chiang

发表机构 * Tamkang University(淡江大学)

AI总结 将几何引导的Mamba(G-Mamba)作为即插即用的上下文聚合模块,替代六种CNN分割网络的上下文头,在Cityscapes上以少量额外计算量获得一致的mIoU提升。

详情
AI中文摘要

基于CNN的语义分割网络通常依赖上下文头(如ASPP、PPM或注意力模块)来扩大感受野。这些头有效但可能引入大量计算、内存开销或边界泄漏。本文重新审视DGM-Net中的方向几何Mamba(G-Mamba),并将其作为即插即用的上下文聚合模块,而非全新的分割架构。关键思想是将几何引导注入选择性扫描过程,使长程特征传播能够由边界和向心流线索调制。我们替换了六种代表性CNN分割模型(包括DeepLabV3+、DANet、CCNet、PSPNet、PSANet和OCRNet)的原始上下文头,同时保持ResNet-101骨干网络不变。在Cityscapes上的结果表明,在$1024\ imes1024$分辨率下,仅增加适度的额外GFLOPs即可获得一致的mIoU提升,表明几何引导的SSM模块可以作为传统CNN上下文头的实用替代或增强。

英文摘要

CNN-based semantic segmentation networks usually rely on context heads such as ASPP, PPM, or attention modules to enlarge the receptive field. These heads are effective but may introduce heavy computation, memory cost, or boundary leakage. This paper revisits Directional Geometric Mamba (G-Mamba) from DGM-Net and studies it as a plug-and-play context aggregation module rather than a complete new segmentation architecture. The key idea is to inject geometric guidance into the selective scan process, allowing long-range feature propagation to be modulated by boundary and centripetal-flow cues. We replace the original context heads of six representative CNN segmentation models, including DeepLabV3+, DANet, CCNet, PSPNet, PSANet, and OCRNet, while keeping the ResNet-101 backbone unchanged. Results on Cityscapes show consistent mIoU gains with only moderate extra GFLOPs at $1024\times1024$ resolution, suggesting that geometry-guided SSM modules can serve as practical alternatives or enhancements to conventional CNN context heads.

2606.08864 2026-06-09 cs.CV cs.LG 新提交

CHROMA: Detecting AI-Generated Images through Inter-Channel Color-Space Correlations

CHROMA: 通过通道间色彩空间相关性检测AI生成图像

Juan Pablo Sotelo, Marina Gardella, Pablo Musé

发表机构 * Instituto de Ingeniería Eléctrica, Facultad de Ingeniería, Universidad de la República, Montevideo, Uruguay(乌拉圭共和国大学工程学院电气工程研究所) Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, Gif-sur-Yvette, 91190 France(巴黎萨克雷大学,巴黎萨克雷高等师范学校,法国国家科学研究中心,博雷利中心)

AI总结 提出利用通道间色彩相关性作为轻量级取证线索,通过增强RGB输入与相关性图,使用固定CNN骨干网络在有限计算预算下训练,有效区分真实与AI生成图像,并提升对未知生成器的鲁棒性。

详情
Comments
This manuscript has been accepted for publication at the 28th International Conference on Pattern Recognition (ICPR 2026). The final published version will appear in the Springer LNCS proceedings
AI中文摘要

扩散模型和大规模生成模型的快速普及使得区分合成图像与真实照片越来越具有挑战性。尽管已有自动检测器被提出,但它们对未见生成器的泛化能力仍然脆弱。为解决这一局限,我们研究了通道间色彩相关性,这是一种轻量级且未被充分利用的取证线索。我们首先证明,LPIPS(一种广泛使用的感知度量)对选择性改变不同色彩空间参数化下通道依赖性的扰动表现出不一致的响应,表明跨通道统计量并不受常见感知训练目标的统一约束。受此启发,我们分析了多个色彩空间中成对通道间相关性特征的分布。我们的分析揭示了这些分布中系统性的、生成器特定的差异,其中RGB和Lab色彩空间提供了真实图像与生成图像之间最明显的分离。基于此,我们引入了Chroma,一种AI生成图像检测器,它用通道间相关性图增强标准RGB输入,并采用在适度计算预算下训练的固定CNN骨干网络。我们在单生成器训练和有限多生成器监督机制(仅从额外生成器获取少量样本)下评估其鲁棒性。在标准基准协议下,相关性增强的输入改善了真实与生成图像的区分能力和鲁棒性,在保持简单架构和训练过程的同时,性能与最新检测器相当。代码可在https://github.com/JPSoteloSilva/CHROMA获取。

英文摘要

The rapid adoption of diffusion and large-scale generative models has made it increasingly challenging to distinguish synthetic imagery from real photographs. While automated detectors have been proposed, their generalization to unseen generators remains brittle. To address this limitation, we investigate inter-channel color correlations, a lightweight and underexploited forensic cue. We first demonstrate that LPIPS, a widely used perceptual metric, exhibits inconsistent responses to perturbations that selectively alter channel dependence across different color-space parameterizations, indicating that cross-channel statistics are not uniformly constrained by common perceptual training objectives. Motivated by this, we analyze the distributions of pairwise inter-channel correlation features across multiple color spaces. Our analysis reveals systematic, generator-specific differences in these distributions, with RGB and Lab color spaces providing the most apparent separation between real and generated images. Building on this, we introduce Chroma, a detector of AI-generated images which augments standard RGB inputs with inter-channel correlation maps and employs a fixed CNN backbone trained with a modest computational budget. We assess its robustness under both single-generator training and a limited multi-generator supervision regime, where only a few samples from additional generators are available. Across a standard benchmark protocol, correlation-augmented inputs improve real-vs-generated discrimination and robustness, yielding performance competitive with recent detectors while maintaining a simple architecture and training procedure. Code is available at https://github.com/JPSoteloSilva/CHROMA

2606.08860 2026-06-09 cs.CV 新提交

Vision-Language Work Zone Intelligence for Safety-Critical Speed Regulation of Mixed-Autonomy Vehicles in Dynamic Environments

面向动态环境中混合自主车辆安全关键速度调节的视觉语言工作区智能

Angel Martinez-Sanchez, Kianna Ng, Wesley Maia, Laura Fleig, Maitrayee Keskar, Erika Maquiling, Yash Tandon, Parthib Roy, Mohan Trivedi, Ross Greer

发表机构 * UC Merced(加州大学默塞德分校) Johns Hopkins(约翰霍普金斯大学) UC San Diego(加州大学圣地亚哥分校)

AI总结 提出一种实时车载感知管线,通过目标检测与语义验证融合及滞后状态转换,从视觉标志中识别临时工作区限速,在低成本硬件上实现96.5%召回率和68.7%精确率。

详情
AI中文摘要

临时工作区限速通过视觉不一致的标志传达,且常缺失于数字地图中,给人类驾驶员和自动驾驶车辆系统带来安全风险。我们提出一种实时车载感知管线,用于检测活动工作区、识别相关临时限速,并输出符合法规的工作区状态和速度值,适用于驾驶员警报或下游自动控制。该系统将目标检测与语义验证以及时间平滑、基于滞后的状态转换相结合,以减少动态场景中的误激活和闪烁,并完全在低成本嵌入式硬件上运行。在ROADWork数据集(490个序列)的标注子集上手动评估,系统实现了工作区内事件级召回率96.5%和事件级精确率68.7%。基于35分钟内部驾驶数据评估的限速识别达到95.45%精确率和53.85%召回率,无错误速度分类,仅有一个误报。这些结果表明了一种实用、可扩展的方法,将工作区速度感知直接建立在车载感知而非地图或基础设施上。我们在GitHub仓库中发布了所提系统管线的源代码:https://github.com/Mi3-Lab/workzone

英文摘要

Temporary work-zone speed limits are communicated through visually inconsistent signage and are often missing from digital maps, creating safety risks for human drivers and automated vehicle systems. We present a real-time, onboard perception pipeline that detects active work zones, recognizes associated temporary speed limits, and outputs a law-aware work-zone state and speed value suitable for driver alerts or downstream automated control. The system fuses object detections with semantic verification and temporally smoothed, hysteresis-based state transitions to reduce false activations and flicker in dynamic scenes, and runs fully on low-cost embedded hardware. Evaluated manually on a annotated subset of the ROADWork dataset (490 sequences), the system achieves inside-work-zone event-level recall of 96.5% and event-level precision of 68.7%. Speed-limit recognition evaluated on 35 minutes of in-house driving data attains 95.45% precision and 53.85% recall, with no incorrect speed classifications and a single false positive. These results demonstrate a practical, scalable approach for grounding work-zone speed awareness directly in onboard perception rather than maps or infrastructure. We release our source code for the proposed system pipeline on our GitHub repository: https://github.com/Mi3-Lab/workzone

2606.08858 2026-06-09 cs.CV cs.AI 新提交

Intelligent Character Recognition of Handwritten Forms with Deep Neural Networks

基于深度神经网络的手写表单智能字符识别

Hartwig Grabowski

发表机构 * Institute for Machine Learning and Analytics (IMLA) Offenburg University(奥芬堡大学机器学习与分析研究所(IMLA))

AI总结 提出一种通过深度神经网络将检测与分类合并为单一任务的手写字符识别方法,利用人工合成训练数据,在真实考试数据上达到88.28%的识别率。

详情
Journal ref
In: Cavallucci D., Livotov P., Brad S. (eds), Towards AI-Aided Invention and Innovation, IFIP Advances in Information and Communication Technology, vol. 682, Springer Nature Switzerland, 2023, pp. 81-94
Comments
Author's accepted manuscript of a published Springer book chapter. 14 pages, 16 figures
AI中文摘要

手写表单的自动处理仍然是一项具有挑战性的任务,其中手写字符的检测和后续分类是关键步骤。我们描述了一种新颖的方法,其中两个步骤——检测和分类——通过深度神经网络在一个任务中执行。因此,训练数据不是手动标注的,而是从基础表单和现有数据集中人工制造的。可以证明,这种单任务方法优于最先进的双任务方法。当前研究专注于手写拉丁字母,并使用EMNIST数据集。然而,该数据集存在局限性,需要进一步定制。最后,在从笔试中获得的真实数据上达到了88.28%的整体识别率。

英文摘要

The automatic processing of handwritten forms remains a challenging task, wherein detection and subsequent classification of handwritten characters are essential steps. We describe a novel approach, in which both steps -- detection and classification -- are executed in one task through a deep neural network. Therefore, training data is not annotated by hand, but manufactured artificially from the underlying forms and yet existing datasets. It can be demonstrated that this single-task approach is superior in comparison to the state-of-the-art two-task approach. The current study focuses on hand-written Latin letters and employs the EMNIST data set. However, limitations were identified with this data set, necessitating further customization. Finally, an overall recognition rate of 88.28 percent was attained on real data obtained from a written exam.

2606.08857 2026-06-09 cs.CL 新提交

PaperMentor: A Human-Centered Multi-Agent Writing Tutor for AI Research Papers on Overleaf

PaperMentor:面向Overleaf的AI研究论文写作人本多智能体辅导系统

Jiarui Liu, Terry Jingchen Zhang, Ryan Faulkner, X. Angelo Huang, Vilém Zouhar, Dominik Glandorf, Isabel Dahlgren, Van Q. Truong, Rishit Dagli, Yuen Chen, Felix Leeb, Punya Syon Pandey, Yves Bicker, Suvajit Majumder, Wenyuan Jiang, Zeju Qiu, Sankalan Pal Chowdhury, Bernhard Schölkopf, Mona Diab, Zhijing Jin

发表机构 * CMU(卡内基梅隆大学) Jinesis Lab, University of Toronto & Vector Institute(Jinesis实验室,多伦多大学与向量研究所) EuroSafeAI ETHZ(苏黎世联邦理工学院) EPFL(洛桑联邦理工学院) UIUC(伊利诺伊大学厄巴纳-香槟分校) Max Planck Institute for Intelligent Systems, Tübingen, Germany(马克斯·普朗克智能系统研究所,德国图宾根)

AI总结 提出PaperMentor,一种在Overleaf中提供内联建议的人本写作助手,通过专家技能库和12个专业智能体提供可操作反馈,用户研究中90.6%建议被认为可操作。

详情
Comments
Accepted to the ACL 2026 Demo Track
AI中文摘要

来自经验丰富研究者的专业写作反馈对于早期职业学者改进手稿至关重要,然而高质量的反馈往往稀缺,因为审阅研究论文是劳动密集型的。新兴的AI写作助手主要关注语法修正或通过最终分数模拟同行评审,但它们在提供具体、可操作的建议以帮助学生在起草过程中改进论文方面存在不足。我们提出PaperMentor,一个人本写作助手系统,以Overleaf原生内联注释的形式提供可操作建议,同时将实际写作完全留给人类作者。PaperMentor集成了一专家技能库,该库精心整理自资深研究者的写作建议,并包含12个专业智能体,涵盖论文写作的不同方面,如格式合规性、措辞准确性和术语一致性。在一项用户研究(n=14)中,90.6%的生成评论被评为可操作,67.5%被评为有效,显著优于没有技能库的GPT-5.2基线。我们将PaperMentor作为开源软件发布供公众使用。我们的代码在AGPL-3.0许可下公开于https://github.com/jiarui-liu/overleaf。

英文摘要

Expert writing feedback from experienced researchers is critical for early-career scholars to improve their manuscripts, yet high-quality feedback often remains scarce because reviewing research papers is labor-intensive. Emerging AI-powered writing assistants largely focus on grammar fixes or simulating peer review with final scores, yet they fall short of providing concrete, actionable suggestions that help students improve their papers during drafting. We present PaperMentor, a human-centered writing assistant system that delivers actionable suggestions as Overleaf-native inline comments while leaving the actual writing entirely to human authors. PaperMentor integrates an expert skill library carefully curated from established researchers' writing advice with 12 specialized agents covering different aspects of paper writing, such as formatting compliance, phrasing accuracy, and terminology consistency. In a user study (n=14), 90.6% of the generated comments were rated actionable and 67.5% were rated valid, significantly outperforming a GPT-5.2 baseline uswithout the skill library. We release PaperMentor as open source for public use. Our code is publicly available under the AGPL-3.0 license at https://github.com/jiarui-liu/overleaf

2606.08855 2026-06-09 cs.AI cs.CV cs.CY 新提交

Hybrid E-Assessment in Higher Education: Semi-Automated Grading of Paper-Based Written Examinations

高等教育中的混合电子评估:纸质笔试的半自动评分

Hartwig Grabowski, Michael Canz

发表机构 * Institute for Machine Learning and Analytics, Hochschule Offenburg(霍恩海姆应用技术大学机器学习与分析研究所) Hochschule Offenburg(霍恩海姆应用技术大学)

AI总结 针对完全数字化和部分数字化电子评估在总结性考试中的局限性,提出混合电子评估方法,保留纸质问题导向任务,通过结构化答案格式和手写字符识别实现半自动评分,结合视觉大语言模型和两遍验证提升评估有效性、公平性和可扩展性。

详情
Comments
15 pages, 6 figures
AI中文摘要

本文考察了完全数字化和部分数字化电子评估方法在高等教育总结性考试中的局限性。分析聚焦于封闭式问题格式导致的教学狭窄化,以及在大学生群体中尤为突出的组织、技术和法律约束。作为替代方案,本文提出了一种混合电子评估方法,该方法保留纸质、问题导向的考试任务,同时实现半自动评分。评估相关的中间结果以结构化答案格式编码,由学生手写输入,随后从表格字段中捕获。核心的技术瓶颈是在现实考试条件下可靠识别手写字符。最近的视觉大语言模型,结合两遍验证原则和与标准答案的比对,可以减少误分类,从而提高总结性评估的有效性、公平性和可扩展性。

英文摘要

This paper examines the limitations of fully digital and partially digital e-assessment approaches in summative examinations in higher education. The analysis focuses on the didactic narrowing caused by closed question formats and on organizational, technical, and legal constraints that become particularly relevant in large student cohorts. As an alternative, the paper proposes a hybrid e-assessment approach that retains paper-based, problem-oriented examination tasks while enabling semi-automated grading. Assessment-relevant intermediate results are encoded in a structured answer format, entered by students by hand, and subsequently captured from table fields. The central technical bottleneck is reliable recognition of handwritten characters under realistic examination conditions. Recent vision-capable large language models, combined with a two-pass validation principle and comparison against a solution key, can reduce misclassifications and thereby improve the validity, fairness, and scalability of summative assessment.

2606.08854 2026-06-09 cs.LG cs.AI cs.CL stat.ML 新提交

sGPO: Trading Inference FLOPs for Training Efficiency in RLVR

sGPO: 在RLVR中用推理FLOPs换取训练效率

Shivchander Sudalairaj, Kai Xu, Akash Srivastava, Giorgio Giannone

发表机构 * Red Hat(红帽) IBM

AI总结 提出sGPO方法,通过少量推理计算预估查询难度,自适应分配训练预算,将训练计算量降低三倍,同时保持或提升性能。

详情
AI中文摘要

标准的可验证奖励强化学习(RLVR)训练为每个查询分配固定的展开预算,而不考虑每个查询的难度对当前策略的意义。这导致两种对称的失败模式:简单查询产生接近零的优势,因为策略已经解决了它们;而无法解决的查询不产生信号,因为策略从未解决它们。这两种情况都浪费了训练FLOPs,而没有贡献学习梯度。我们引入了排序组策略优化(sGPO),一种计算高效的策略,用少量推理FLOPs换取大量减少浪费的训练FLOPs。关键见解是,廉价的推理计算可以作为查询难度的单一离线代理。通过在初始策略下为每个查询生成一小批并行样本,我们获得了模型感知的经验成功率。这激励将训练展开组大小设置为该成功率的倒数,这是一个实用的规则,通过从每个生成的展开中提取最大优势来最大化样本效率。这一单次性能分析过程同时驱动数据过滤(移除琐碎查询和子采样无法解决的查询)、自适应组大小分配和课程构建(从易到难调度查询)。sGPO匹配或超过基线性能,同时将总训练计算量减少三倍,包括前期的推理性能分析成本。

英文摘要

Standard Reinforcement Learning with Verifiable Rewards (RLVR) training allocates a fixed rollout budget to every query, without regard for what each query's difficulty means for the current policy. This leads to two symmetric failure modes: easy queries produce near-zero advantage because the policy already solves them, while unsolvable queries produce no signal because the policy never solves them. Both regimes waste training FLOPs without contributing to a learning gradient. We introduce sorted Group Policy Optimization (sGPO), a compute-efficient strategy that trades a small budget of inference FLOPs for a large reduction in wasted training FLOPs. The key insight is that cheap inference compute can serve as a single offline proxy for query difficulty. By generating a small batch of parallel samples per query under the initial policy, we obtain a model-aware empirical success rate. This motivates setting the training rollout group size to the inverse of this success rate, a practical rule that maximizes sample efficiency by extracting the most advantage per generated rollout. This single profiling pass simultaneously drives data filtering (removing trivial queries and sub-sampling unsolvable ones), adaptive group size allocation, and curriculum construction (scheduling queries from easy to hard). sGPO matches or exceeds baseline performance while reducing total training compute by a factor of three, with the upfront inference profiling cost included.

2606.08850 2026-06-09 cs.LG cs.AI cs.CL stat.ML 新提交

Intrinsic Selection and Particle Resampling for Inference-Time Scaling Beyond Domain Verifiability

内在选择与粒子重采样:超越领域可验证性的推理时扩展

Giorgio Giannone, Mustafa Eyceoz, Shabana Baig, Shivchander Sudalairaj, Anna C. Doris, Faez Ahmed, Akash Srivastava, Kai Xu

发表机构 * MIT(麻省理工学院) Red Hat(红帽公司) IBM(IBM公司)

AI总结 提出基于并行样本集内在统计量(长度调整尾熵)的推理时扩展方法,通过后验候选排序和步骤级重采样,无需外部验证即可提升开放领域任务性能。

详情
Comments
preprint
AI中文摘要

推理时扩展(ITS)在数学和编程等可验证领域取得了很大成功,其中廉价验证使得可扩展输出选择成为可能。然而,将ITS扩展到容易发生系统性失败的任务——由错误初始假设或未满足的多维约束驱动——通常依赖于昂贵的外部求解器或脆弱的基于模型的验证器。我们的关键洞察是,并行样本集的内在统计量,特别是长度调整尾熵,提供了关于解质量的稳健判别信号,而无需访问真实标签。至关重要的是,这些统计量作为自适应计算分配的难度门控,动态地将问题路由到不同的扩展规模。首先,内在选择(iS)事后对候选进行排序,在三个领域匹配基于共识的算法,并将工程设计选择性能比pass@1基线提高20%。其次,内在粒子滤波(iPF)将其推广到步骤级重采样,引导生成走向高置信度推理轨迹,在困难数学问题上平均将pass@1提高6.1个百分点。最后,粒子蒸馏(dPF)通过早期logit混合和KL引导重采样注入特权指导,引导生成绕过系统性推理错误以满足专家评分标准,在复杂临床响应上获得高达26.5%的提升。我们的流程无缝适用于通用、领域专用和多模态架构,成功将ITS扩展到开放领域,而无需训练奖励模型或精确的真实标签验证。

英文摘要

Inference-Time Scaling (ITS) has largely succeeded in verifiable domains like math and coding, where cheap verification enables scalable output selection. However, extending ITS to tasks prone to systematic failure - driven by faulty initial assumptions or unmet multidimensional constraints - typically relies on costly external solvers or brittle, model-based verifiers. Our key insight is that the intrinsic statistics of parallel sample sets, specifically length-adjusted tail entropy, provide a robust discriminative signal for solution quality without access to ground truth. Crucially, these statistics serve as a difficulty gate for adaptive compute allocation, dynamically routing problems across scaling regimes. First, Intrinsic Selection (iS) ranks candidates post-hoc, matching consensus-based algorithms across three domains and improving engineering design selection by 20% over pass@1 baselines. Second, Intrinsic Particle Filtering (iPF) generalizes this to step-level resampling, guiding generation toward high-confidence reasoning trajectories to improve pass@1 by 6.1 points on average on hard math problems. Finally, Particle Distillation (dPF) injects privileged guidance via early logit blending and KL-guided resampling, steering generation past systematic reasoning errors to satisfy expert rubrics, yielding up to 26.5% gains on complex clinical responses. Our pipeline applies seamlessly across broad-purpose, domain-specialized, and multimodal architectures, successfully extending ITS to open-ended domains without requiring trained reward models or exact ground-truth verification.

2606.08849 2026-06-09 cs.AI 新提交

A Resilience-as-a-Service assessment framework for coordinated disruption response in interdependent urban transit systems

面向城市交通系统协同中断响应的弹性即服务评估框架

Sara Jaber, S. M. Hassan Mahdavi, Neila Bhouri, Mostafa Ameli

发表机构 * Univ. Gustave Eiffel, COSYS, GRETTIA, Paris, France(古斯塔夫·埃菲尔大学,交通系统、网络与安全实验室,交通工程与智能交通系统研究组,法国巴黎) VEDECOM, mobiLAB, Department of Human factors and Economics of Sustainable Mobility, Versailles, France(VEDECOM研究所,移动出行实验室,可持续出行人因与经济系,法国凡尔赛)

AI总结 提出一个基于KPI的时间索引框架,结合优化模型与智能体仿真,从脆弱性、适应性、鲁棒性等多维度评估城市交通中断响应方案的弹性,并通过巴黎RER B线案例验证了协同策略的优越性。

详情
AI中文摘要

城市公共交通中断需要快速响应策略,然而现有研究很少提供一个决策支持框架,使用一组通用的动态、乘客、运营商和环境导向指标来比较替代的中断响应解决方案。本文提出了一个KPI驱动的、时间索引的框架,用于评估城市交通系统中中断响应方案的弹性。该框架将优化模型与基于智能体仿真的行为评估相结合。它还考虑了当在途车辆被撤回以支持中断走廊时,辅助线路上的二次服务退化。该框架不将弹性视为单一分数,而是评估互补维度,包括脆弱性、适应性、鲁棒性、弹性损失、响应性、基于成本的性能、排放和公平性。该框架在法兰西岛(巴黎)网络的RER B交通线上实施。结果表明,协同策略提供了最平衡的弹性曲线,与单一模式替代方案相比,结合了高服务连续性和较低的总中断成本,同时提高了公平性并保持了有竞争力的环境性能。敏感性分析进一步确定了协同多模式响应最有价值的中断条件。

英文摘要

Urban public transport disruptions require rapid response strategies, yet existing studies rarely provide a decision support framework to compare alternative disruption response solutions using a common set of dynamic, passenger, operator, and environment oriented indicators. This paper proposes a KPI-driven, time-indexed framework to assess the resilience of disruption response solutions in urban transit systems. The framework combines an optimization model with a behavioral evaluation in agent-based simulation. It also underlays the secondary service degradation induced on helper lines when in-service vehicles are withdrawn to support the disrupted corridor. Rather than treating resilience as a single score, it evaluates complementary dimensions including vulnerability, adaptability, robustness, resilience loss, responsiveness, cost-based performance, emissions, and equity. The framework is implemented for the RER B transit line in the Ile-de-France (Paris) network. Results show that the coordinated strategy provides the most balanced resilience profile, combining high service continuity with lower total disruption cost than single mode alternatives, while also improving equity and maintaining competitive environmental performance. Sensitivity analysis further identifies the disruption conditions under which coordinated multimodal response is most valuable.

2606.08847 2026-06-09 cs.CV cs.AI cs.LG 新提交

BLM-SGAN: Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation

BLM-SGAN: 用于语义-空间文本到图像生成的双向语言建模

Ahmed Abdelmoneim Mazrou, Haidy Maher El-Amir, Ali Hamdi

发表机构 * Faculty of Computer Science, MSA University, Egypt(MSA大学计算机科学学院,埃及)

AI总结 提出BLM-SGAN模型,利用BERT的双向注意力机制捕获长程依赖,解决GAN在文本到图像生成中的梯度消失和序列处理限制,在鸟类图像生成上达到SOTA。

详情
Journal ref
Advances on Intelligent Computing and Data Science II (ICACIn 2024), Lecture Notes on Data Engineering and Communications Technologies, vol. 254, Springer, Cham, 2025
Comments
Published in ICACIn 2024. Appears in Advances on Intelligent Computing and Data Science II, Lecture Notes on Data Engineering and Communications Technologies, vol. 254, Springer, 2025
AI中文摘要

尽管从文本描述生成图像取得了成功,但在自然语言处理(NLP)和计算机视觉(CV)等领域仍面临难以克服的挑战。文本到图像(T2I)模型的最新进展,特别是那些利用生成对抗网络(GAN)的模型,显著提高了跨领域合成逼真图像的能力。然而,现有的基于GAN的T2I模型仍然面临关键挑战,例如难以捕获长程依赖、梯度消失以及序列处理的局限性。为了解决这些问题,我们引入了BLM-SGAN,一种新颖的模型,它结合了用于语义-空间文本到图像生成的双向语言建模。BLM-SGAN利用BERT的注意力机制来捕获丰富的上下文信息并有效管理扩展序列。我们的模型展示了最先进的性能,Inception Score(IS)为5.45 +/- 0.08,超过了多个竞争模型,如SSA-GAN、DF-GAN、SD-GAN和AttnGAN。BLM-SGAN能够从详细的文本描述中有效生成高度逼真的鸟类图像。实现代码可在以下网址获取:https://github.com/haidy-maher/BLM-SGAN-Text-to-Image-Generation。

英文摘要

Despite the success of image generation from text descriptions, it still faces challenges that are difficult to overcome in domains such as natural language processing (NLP) and computer vision (CV). Recent advancements in text-to-image (T2I) models, particularly those utilizing generative adversarial networks (GANs), have significantly improved the synthesis of realistic images across various domains. However, existing GAN-based T2I models still encounter key challenges, such as difficulty in capturing long-range dependencies, vanishing gradients, and the limitations of sequential processing. To address these issues, we introduce BLM-SGAN, a novel model that incorporates Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation. BLM-SGAN leverages BERT's attention mechanisms to capture rich contextual information and efficiently manage extended sequences. Our model demonstrates state-of-the-art performance, with an Inception Score (IS) of 5.45 +/- 0.08, surpassing several competitive models such as SSA-GAN, DF-GAN, SD-GAN, and AttnGAN. BLM-SGAN effectively generates highly realistic images of birds from detailed text descriptions. The implementation code is available at: https://github.com/haidy-maher/BLM-SGAN-Text-to-Image-Generation.

2606.08844 2026-06-09 cs.CV cs.RO 新提交

Geometry-Aware Fisheye-LiDAR Fusion for Robust 3D Object Detection in Low-Overlap Setups

几何感知鱼眼-激光雷达融合用于低重叠设置下的鲁棒3D目标检测

Xiangzhong Liu, Xihao Wang, Hao Shen

发表机构 * Technical University of Munich(慕尼黑工业大学)

AI总结 针对稀疏视角下鱼眼相机与激光雷达的几何畸变和低重叠问题,提出几何感知混合融合框架,通过畸变感知LSS模块和双注意力校正模块实现极坐标与笛卡尔特征融合,在三个基准上提升检测精度。

详情
Comments
8 pages, 4 figures, submitted to RA-L
AI中文摘要

随着自主系统从资本密集型的机器人出租车扩展到成本敏感的物流领域,传感器配置越来越优化以实现每单位成本的覆盖范围。一种常见的稀疏视图设置利用双鱼眼摄像头和车顶安装的激光雷达,引入了严重的几何挑战:极端径向畸变、最小重叠以及球面投影与笛卡尔网格之间的错位。BEV融合算法通常在流程早期将图像和点云模态强制统一到笛卡尔网格中,导致广角鱼眼相机出现显著的特征失真和信息丢失。为了解决这个问题,我们提出了一个几何感知混合融合(GA-HF)框架,该框架明确考虑了鱼眼几何和BEV特征失真,其中鱼眼特征通过畸变感知的Lift-Splat-Shoot(LSS)模块提升到极坐标BEV网格中以保留原生角密度,而激光雷达特征在原生笛卡尔空间中处理以实现边界框回归的度量保真度。为了桥接这些异构流,我们引入了一个双注意力扭曲校正模块,该模块在融合前对扭曲的相机特征应用空间和通道注意力,明确抑制低质量外围区域的伪影,同时增强高质量语义线索。GA-HF在三个基准数据集上进行了评估:KITTI-360、Dur360BEV和Fisheye3DOD。据我们所知,这是首个探索激光雷达-鱼眼相机融合的方法。在KITTI-360上,GA-HF相比笛卡尔基线将NDS提高了4.2%;在Dur360BEV上,它超越了仅激光雷达和BEVFusion,同时在几何畸变下显著降低了方向误差;在Fisheye3DOD上,它在所有融合方法中取得了最高的检测分数。

英文摘要

As autonomous systems expand from capital-intensive robotaxis to cost-sensitive logistics, sensor configurations are increasingly optimized for coverage-per-cost. A prevalent sparse-view setup utilizes dual-fisheye cameras with a roof-mounted LiDAR, introducing severe geometric challenges: extreme radial distortion, minimal overlap, and misalignment between spherical projections and rectilinear grids. BEV fusion algorithms typically force image and point cloud modalities into unified Cartesian grids early in the pipeline, causing significant feature distortion and information loss for wide-view fisheye cameras. To address this, we propose a Geometry-Aware Hybrid Fusion (GA-HF) framework that explicitly accounts for fisheye geometry and BEV feature distortion, where fisheye features are lifted into a polar BEV grid via a Distortion-Aware Lift-Splat-Shoot (LSS) module to preserve native angular density, while LiDAR features are processed in native Cartesian space for metric fidelity of bounding box regression. To bridge these heterogeneous streams, we introduce a Dual-Attention Warping Correction module that applies spatial and channel attention to the warped camera features before fusion, explicitly suppressing artifacts in low-quality peripheral regions while enhancing high-quality semantic cues. GA-HF is evaluated on three benchmarks: KITTI-360, Dur360BEV, and Fisheye3DOD datasets. To the best of our knowledge, it is the first approach to explore LiDAR-fisheye camera fusion. On KITTI-360, GA-HF improves NDS by 4.2% over Cartesian baselines; on Dur360BEV, it surpasses both LiDAR-only and BEVFusion, while significantly reducing orientation error despite the geometric distortions; on Fisheye3DOD, it attains the highest detection score among all fusion methods.

2606.08843 2026-06-09 cs.SD cs.LG 新提交

From A to B to A: Palindromic Zero-Shot Voice Conversion with Non-Parallel Data

从A到B再回到A:基于非平行数据的回文零样本语音转换

Moshe Mandel, Shlomo E. Chazan

发表机构 * Independent, Israel(以色列独立机构) OriginAI, Israel(以色列OriginAI公司)

AI总结 提出利用WavLM表示的K近邻检索对齐非平行语音,构建合成训练对,结合说话人损失实现零样本语音转换,在仅用英语数据训练下跨语言表现优异。

详情
AI中文摘要

我们提出一个语音转换(VC)框架,利用WavLM表示上的K近邻(KNN)检索来对齐非平行的源语音和目标语音,从而为监督学习构建合成训练对。检索到的片段作为合成输入,而真实目标音频提供真实输出,形成一种合成到真实的训练范式,该范式自然支持多语言数据,无需平行语料库或显式对齐。为了确保一致的目标说话人身份,我们引入了一个来自预训练说话人验证模型的说话人损失。跨多种语言的实验表明,尽管仅使用英语数据训练,所提出的方法实现了高自然度和强说话人相似性,优于有竞争力的VC基线。样本可在https://palindromic-vc.github.io获取。

英文摘要

We present a voice conversion (VC) framework that utilizes K-Nearest Neighbors (KNN) retrieval over WavLM representations to align non-parallel source and target speech, constructing synthetic training pairs for supervised learning. The retrieved segments serve as synthetic inputs, while real target audio provides ground-truth outputs, forming a synthetic-to-real training paradigm that naturally supports multilingual data without requiring parallel corpora or explicit alignment. To ensure consistent target-speaker identity, we incorporate a speaker loss derived from a pretrained speaker verification model. Experiments across multiple languages demonstrate that the proposed approach achieves high naturalness and strong speaker similarity, outperforming competitive VC baselines, despite being trained exclusively on English data. Samples can be accessed at: https://palindromic-vc.github.io.

2606.08841 2026-06-09 cs.AI cs.CV 新提交

ZIPP:Zero-shot Image Personalization from Personas

ZIPP:基于人物画像的零样本图像个性化生成

Harini SI, Somesh Singh, Yaman Kumar Singla, David Doermann, Rajiv Ratn Shah

发表机构 * Adobe Media and Data Science Research (MDSR)(Adobe媒体与数据科学研究(MDSR)) IIIT-Delhi(德里印度理工学院) SUNY at Buffalo(纽约州立大学布法罗分校)

AI总结 提出ZIPP方法,利用自然语言人物画像通过LLM改写提示词实现零样本图像个性化生成,无需用户数据或微调;引入ZIPBench基准,在多个评测中取得13-20%的提升。

详情
AI中文摘要

文本到图像扩散模型越来越多地部署在开放式创意环境中,但其输出仍然缺乏个性,优化的是整体审美而非个人品味。人类偏好是多元化的:一位喜欢柔和、怀旧肖像的用户可能偏爱充满活力的街头摄影,而另一位则倾向于梦幻的电影美学。现有方法需要密集的交互历史或逐用户微调,在冷启动场景中失败,并将上下文相关的偏好压缩为静态表示。我们提出了基于人物画像的零样本图像个性化生成(ZIPP),该方法以自然语言人物画像(用户身份和审美偏好的简洁描述符)为条件生成图像,无需任何用户特定数据或权重更新。ZIPP使用LLM从给定人物画像的角度重写提示词,引导扩散模型输出个性化结果。为了大规模挖掘人物画像,我们在一个包含2200万用户的Reddit交互图上训练了一个归纳式图注意力网络,采用双对比目标将图结构与视觉行为对齐,然后通过多模态大语言模型将学习到的表示转化为自然语言人物画像。我们引入了ZIPBench,这是首个零样本个性化基准,包含1500名用户、图挖掘的人物画像和4万张生成图像。在四个基准和涵盖五个模型家族的14个LLM上,人物画像条件化带来一致的性能提升(13-20%),前沿模型受益最大。在少样本设置中,ZIPP匹配或超过了基于每用户100多个示例微调的基线。ZIPP实现了最低的偏好分布散度(CMMD 0.16 vs 0.55),且经IPF归一化的人口统计评估表明,它显著减少了现有方法中存在的子群体偏差。人工评估证实,与通用生成相比胜率为79%,与所有微调基线相比胜率为58-65%。

英文摘要

Text-to-image diffusion models are increasingly deployed in open-ended creative contexts, yet their outputs remain impersonal, optimized for aggregate aesthetics rather than individual taste. Human preferences are pluralistic: one user favoring muted, nostalgic portraits may prefer vibrant street photography, while another gravitates toward dreamy film aesthetics. Existing methods require dense interaction histories or per-user fine-tuning, failing in cold-start settings and collapsing context-dependent preferences into a static representation. We introduce zero-shot image personalization from personas (ZIPP), which conditions image generation on natural-language personas (concise descriptors of a user's identity and aesthetic sensibilities) without any user-specific data or weight updates. ZIPP uses an LLM to rewrite prompts from the perspective of a given persona, steering diffusion models toward personalized outputs. To mine personas at scale, we train an inductive Graph Attention Network over a 22M-user Reddit interaction graph with dual contrastive objectives aligning graph structure with visual behavior, then verbalize learned representations into natural-language personas via an MLLM. We introduce ZIPBench, the first zero-shot personalization benchmark with 1.5K users, graph-mined personas, and 40K generated images. Across four benchmarks and 14 LLMs spanning five model families, persona conditioning yields consistent gains (13-20%), with frontier models benefiting most. In the few-shot setting, ZIPP matches or exceeds fine-tuned baselines trained on 100+ examples per user. ZIPP achieves the lowest preference distributional divergence (CMMD 0.16 vs. 0.55), and IPF-normalized demographic evaluation shows it substantially reduces subpopulation bias present in existing methods. Human evaluation confirms a 79% win rate over generic generation and 58-65% over all fine-tuned baselines.

2606.08833 2026-06-09 cs.CV 新提交

CSFlow: Aligning Flow Matching with Human Contrast Sensitivity

CSFlow: 将流匹配与人类对比敏感度对齐

Malgorzata Galinska, Bart Pogodzinski, Jan Eric Lenssen

发表机构 * Max Planck Institute for Informatics, Saarland Informatics Campus(马克斯·普朗克信息学研究所,萨尔兰信息学园区)

AI总结 提出CSFlow加权方案,通过将人类对比敏感度函数与流匹配的迭代去噪步骤对齐,在傅里叶空间中引入软自回归结构,提升生成图像的视觉真实感,FID降低4.7%,Inception Score提升2.2%。

详情
AI中文摘要

我们引入了对比敏感流(CSFlow),这是一种将人眼的对比敏感度函数(CSF)与流匹配的迭代去噪步骤联系起来的加权方案。由于真实世界图像将信号集中在低空间频率,这些分量在连续扩散过程中比高频分量更早达到高信噪比。当使用扩散或流匹配模型生成图像时,这会在傅里叶空间中诱导一种软自回归结构,其中粗略的图像内容在精细细节之前稳定。同时,人类视觉系统对空间频率的敏感度不均:极低和极高的频率需要显著更高的对比度才能被感知。我们首次通过两个贡献将这些观察结果融合在一起:(1)一个估计每个反向流区间生成哪些频率的度量,以及(2)通过将每个噪声级别生成的频率与人类对比敏感度对齐获得的时间步权重。我们通过实验验证了我们的贡献,表明这些权重可以通过仅推理时间步修改或短时微调,将FID降低4.7%,Inception Score提高2.2%,GenEval分数提高2.5%,从而改善生成性能。定性上,我们发现我们的CSFlow权重导致生成的图像具有更好的视觉真实感和更少的卡通外观。

英文摘要

We introduce Contrast Sensitive Flow (CSFlow), a weighting scheme that connects the human eye's Contrast Sensitivity Function (CSF) to the iterative denoising steps of flow matching. Because real-world images concentrate signal at low spatial frequencies, these components reach high signal-to-noise ratio earlier during continuous diffusion than high-frequency components. When generating images with diffusion or flow matching models, this induces a soft autoregressive structure in Fourier space, where coarse image content stabilizes before fine detail. Meanwhile, the human visual system is unequally sensitive to spatial frequencies: very low and very high frequencies require significantly higher contrast to be perceived. We for the first time merge these observations through two contributions: (1) a metric that estimates which frequencies are generated at each reverse flow interval and (2) timestep weights obtained by aligning the frequencies generated at each noise level with human contrast sensitivity. We validate our contributions experimentally showing that these weights can improve generative performance by lowering FID by 4.7%, increasing Inception Score by 2.2% and improving GenEval scores by 2.5% using inference-only timestep modification or short fine-tuning. Qualitatively, we find that our CSFlow weights lead to better visual realism and less cartoonish appearance of generated images.

2606.08831 2026-06-09 cs.AI 新提交

Inference-Time Conformal Reasoning with Valid Factuality Control for Large Language Models

面向大语言模型的推理时保形推理与有效事实性控制

Ting Wang, Yuanjie Shi, Yan Yan, Huan Zhang

发表机构 * Machine Learning, ICML(机器学习,国际机器学习大会)

AI总结 提出推理时保形推理框架,将保形预测集成到推理图生成中,通过图级不确定性校准生成停止阈值,实现有效事实性控制。

详情
Comments
Accepted at ICML 2026
AI中文摘要

大型语言模型(LLMs)越来越多地执行多步推理,其中中间声明形成隐式有向无环图,其节点正确性在结构上依赖于其祖先。这使得事实不确定性具有结构性,而非节点错误的简单累积,并且需要对推理结构进行推理时不确定性量化。虽然保形预测(CP)提供了灵活的用户指定事实性控制,但现有工作仍然是事后性的,无法在生成过程中进行干预。为了填补CP灵活性与事后局限性之间的差距,我们提出了一种推理时保形推理(ITCR)框架,该框架将CP直接集成到推理图生成中。ITCR学习一种结构级事实性不确定性函数,该函数在不进行复杂建模假设的情况下,聚合推理图上的声明级事实性信号。然后,我们基于图级事实性不确定性设计非一致性分数,并校准保形阈值以决定何时停止生成。我们从理论上证明这种生成是嵌套的,为事实性控制提供了有效的覆盖保证。在多个数据集和覆盖目标上的实验证明了经验上的有效覆盖。在下游推理任务中,推理时校准的图比事后剪枝的图产生更准确的生成。

英文摘要

Large language models (LLMs) increasingly perform multi-step reasoning, where intermediate claims form implicit directed acyclic graphs whose node correctness is structurally conditioned on their ancestors. This makes factuality uncertainty structural, rather than a trivial accumulation of node-wise errors, and necessitates inference-time uncertainty quantification over the reasoning structure. While conformal prediction (CP) offers flexible user-specified factuality control, existing work remains post-hoc and cannot intervene during generation. To fill the gap between CP's flexibility and its post-hoc limitation, we propose an \emph{Inference-Time Conformal Reasoning (ITCR)} framework that integrates CP directly into reasoning graph generation. ITCR learns a structure-level factuality uncertainty function that aggregates claim-level factuality signals over reasoning graphs without complex modeling assumptions. We then design the non-conformity score based on graph-level factuality uncertainty and calibrate the conformal threshold to decide when to stop generation. We theoretically show such generation is nested, yielding valid coverage guarantees for factuality control. Experiments over multiple datasets and coverage objectives demonstrate empirically valid coverage. In downstream reasoning tasks, inference-time calibrated graphs yield more accurate generation than post-hoc pruned graphs.

2606.08828 2026-06-09 cs.RO 新提交

Video2Sim2Real: Full-Stack Autonomous Dexterous Skill Acquisition from a Single Human Video

Video2Sim2Real:从单个人类视频实现全栈自主灵巧技能获取

Yunhai Han, Jianuo Qiu, Linhao Bai, Ziyu Xiao, Zihang Zeng, Yangcen Liu, Zhaodong Yang, Shalin Jain, Wenrui Ma, Jiaqi Fu, Yuqian Zheng, Manisha Natarajan, Muhammad Zubair Irshad, Kenneth Shaw, Matthew Gombolay, Zsolt Kira, Harish Ravichandar

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of Pennsylvania(宾夕法尼亚大学) Toyota Research Institute(丰田研究所) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出Video2Sim2Real框架,从单个人类操作视频中重建数字孪生并提取运动先验,通过物体关键帧优化机器人配置,结合残差强化学习与碰撞感知规划,实现从仿真到真实世界的灵巧技能迁移。

详情
Comments
Website: https://video2sim2real.github.io/
AI中文摘要

人类操作视频是机器人学习的便捷直观来源。然而,由于感知误差和具身差距,直接将人类灵巧性迁移到机器人仍然具有挑战性。为此,我们引入Video2Sim2Real,一个从单个人类操作视频中自主获取技能的全栈框架。我们的框架首先使用现成的基础模型重建适用于仿真器的数字孪生,并提取机器人和物体运动先验。与将提取的机器人运动视为整个执行过程中的可靠参考不同,我们的关键思想是恢复并利用从演示技能中获得的最基本监督来源:我们识别以物体为中心的关键帧,利用仿真器中的物体信息优化相应的机器人配置,并将这些配置作为锚点来细化机器人运动,使其最终对环境产生期望的影响。为了弥合剩余的仿真到现实差距,我们引入了一种仿真到现实策略,将对噪声和不完整感知的鲁棒性与手-物交互动力学的变化解耦。具体来说,我们通过模仿学习从噪声的真实世界点云中重新校准机器人配置,并利用残差强化学习进行局部手指级自适应,以确保鲁棒且有效的交互。最后,一个碰撞感知的运动规划模块实现了对新颖物体配置的空间泛化。在多个日常操作任务中,Video2Sim2Real在模拟任务成功率、安全性和轨迹一致性上优于众多基线,并且比现有技术实现了更好的仿真到现实迁移。这些结果展示了从人类视频自主获取灵巧技能的一条有前景的路径。

英文摘要

Human manipulation videos are a convenient and intuitive source for robot learning. However, directly transferring human dexterity to robots remains challenging due to perception errors and embodiment gap. To address this, we introduce Video2Sim2Real, a full-stack framework for autonomous skill acquisition from a single human manipulation video. Our framework first uses off-the-shelf foundation models to reconstruct a simulator-ready digital twin and extract robot and object motion priors. Rather than treating the extracted robot motion as a reliable reference throughout execution, our key idea is to recover and leverage the most fundamental sources of supervision from the demonstrated skill: We identify object-centric keyframes to optimize the corresponding robot configurations using object information from the simulator, and use these configurations as anchors that refine the robot motion such that it ultimately has the desired impact on the environment. To bridge the remaining sim-to-real gap, we introduce a sim-to-real strategy that decouples robustness to noisy and incomplete perception from variations in hand-object interaction dynamics. Specifically, we learn to recalibrate robot configurations from noisy real-world point clouds via IL, and leverage residual RL to perform local finger-level adaptations to ensure for robust and effective interactions. Finally, a collision-aware motion planning module enables spatial generalization to novel object configurations. Across several everyday manipulation tasks, Video2Sim2Real improves simulated task success, safety, and trajectory coherence over numerous baselines, and achieves better sim-to-real transfer than existing techniques. These results demonstrate a promising path toward autonomous dexterous skill acquisition from human videos.

2606.08826 2026-06-09 cs.CV astro-ph.GA 新提交

Classifying galaxies in the Galaxy10 DECals dataset using Inception and Residual CNNs

使用Inception和残差CNN对Galaxy10 DECals数据集中的星系进行分类

Lanz Anthonee A. Lagman, Prospero C. Naval, Reinabelle C. Reyes

发表机构 * University of the Philippines - Diliman(菲律宾大学迪利曼分校) Department of Computer Science, College of Engineering, University of the Philippines - Diliman(菲律宾大学迪利曼分校工程学院计算机科学系) National Institute of Physics, College of Science, University of the Philippines - Diliman(菲律宾大学迪利曼分校理学院国家物理研究所)

AI总结 本研究比较了ResNet101和InceptionV4在星系形态分类任务上的性能,两者均达到约90%的准确率,其中ResNet101表现更优,表明这两种CNN架构可作为未来巡天星系图像分类的稳健基础。

详情
Journal ref
Proc. Samahang Pisika Pilipinas 42, SPP-2024-2E-05 (2024)
Comments
4 pages, 3 figures, 2 tables, published in Proceedings of the 42nd Samahang Pisika ng Pilipinas Physics Conference (SPP 2024)
AI中文摘要

关于星系形态的图像数据预计在未来几年内将在数量和质量上都有所增加;因此,探索哪些适用于图像分类任务的深度学习架构具有成本效益非常重要。残差网络和Inception网络因其计算效率而成为探索分类卷积神经网络(CNN)的理想选择,这得益于残差连接和并行化Inception模块等技术,使得网络能够更深而不显著增加计算复杂度。在这项工作中,我们分析了ResNet101和InceptionV4在空间增强的Galaxy10 DECals数据集上的性能。保留星系的十类分类,我们修改了每个类别的图像数量。我们发现ResNet101和InceptionV4模型达到了约90%的准确率,与文献中报告的性能相当。在性能指标方面,ResNet101优于InceptionV4。我们的结果表明,这两种CNN架构中的任何一种都可以作为即将到来的巡天中星系图像分类专用管线的稳健基础。

英文摘要

Image data regarding galactic morphology is expected to increase both in quantity and quality for the next foreseeable years; thus it is important to explore which deep learning architectures adapted for image classification tasks are cost-effective. Residual and Inception networks are ideal for exploring classification convolutional neural networks (CNNs) due to their computational efficiency, achieved through techniques such as residual connections and parallelized inception modules, enabling deeper networks without excessively increasing computational complexity. In this work, we analyze the performance of ResNet101 and InceptionV4 on a spatially-augmented Galaxy10 DECals dataset. Retaining the ten-class classification of galaxies, we modify the image count of each class. We find that ResNet101 and InceptionV4 models achieved accuracies of $\sim$ 90%, comparable with reported performance in the literature. In terms of performance metrics, ResNet101 is superior to InceptionV4. Our results indicate that either of these CNN architectures could serve as a robust foundation for specialized pipelines for classification of galaxy images from upcoming surveys.

2606.08816 2026-06-09 cs.LG cs.AI 新提交

Knowledge Graphs and Reasoning LLMs for Finding Simple Yet Effective Transcriptomic Perturbation Predictors

知识图谱与推理大语言模型用于寻找简单而有效的转录组扰动预测因子

Jake Fawkes, Liam Hodgson, Jason Hartford

发表机构 * University College London(伦敦大学学院) University of Manchester(曼彻斯特大学) Valence Labs(Valence实验室) Recursion(Recursion公司)

AI总结 利用知识图谱的K近邻方法在基因敲除扰动预测中表现优异,结合强化学习优化的LLM可达到最先进性能。

详情
AI中文摘要

预测未见过的基因敲除扰动对转录组基因表达的影响仍然是虚拟细胞模型的一个极具挑战性的问题。最近,通过利用生物知识图谱提供相似扰动的概念,在训练扰动集之外实现了更好的外推。在这项工作中,我们证明了利用这些假设的最简单模型——知识图谱的K近邻——在此任务上取得了极具竞争力的性能,并且通过使用强化学习(RL)优化的LLM可以进一步提高预测性能。具体来说,我们发现K近邻方法在分布外扰动预测上几乎击败了所有方法,而当通过RL训练推理LLM以改变邻域时,它在Replogle等人(2022)的细胞系上获得了与当前最先进方法相当的性能。我们还证明,尽管没有直接训练,RL训练提高了LLM在差异表达预测下游任务上的性能。总体而言,这些发现证明了知识图谱作为模型先验的有效性,并显示出RL可以将LLM精炼为预测复杂生物反应的通用工具的早期迹象。

英文摘要

Predicting the effect of an unseen gene knockout perturbation on transcriptomic gene expression remains a highly challenging problem for virtual cell models. Recent progress has been made by leveraging biological knowledge graphs to provide a notion of similar perturbation, allowing for improved extrapolation beyond the set of training perturbations. In this work, we demonstrate that the simplest model to leverage these assumptions - a K-nearest neighbour from the knowledge graph - achieves highly competitive performance on this task, and that this can be improved further using LLMs optimised via reinforcement learning (RL) for predictive performance. Specifically, we find that the K-nearest neighbour approach beats almost all methods on out-of-distribution perturbation prediction, and when a reasoning LLM is trained via RL to make changes to the neighbourhood, it obtains equivalent performance to current state of the art methods on the cell lines from Replogle et al. (2022). We also demonstrate that the RL training improves the LLM's performance on the downstream task of differential expression prediction, despite not being trained on this directly. Overall, these findings demonstrate the efficacy of knowledge graphs as model priors, and show early signs that RL can refine LLMs into generalizable tools for predicting complex biological responses.

2606.08815 2026-06-09 cs.AI cs.CL cs.LG 新提交

Momentum for Reasoning: Dense Intrinsic Signals in Policy Optimization

推理的动量:策略优化中的密集内在信号

Hao Chen, Zhanming Shen, Liyao Li, Yanyu Chen, Xuhang Zhu, Xiaomeng Hu, Qi Zhang, Ru Peng, Xiaoyu Shen, Haobo Wang, Junbo Zhao

发表机构 * Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学) Eastern Institute of Technology(东方理工学院)

AI总结 针对GRPO在长链推理中因二元奖励导致的零优势崩溃和幻觉确定性失败模式,提出ISPO方法,通过内在信号密集化奖励,在三个基模型和五个数学推理基准上持续优于基线。

详情
Comments
14 pages, 6 figures, 8 tables
AI中文摘要

基于可验证奖励的强化学习已成为激发大型语言模型长链推理的强大范式。然而,现有基于组相对策略优化(GRPO)的方法依赖于二元结果奖励,这引发了两种结构性失败模式:零优势崩溃,即组内所有轨迹共享相同结果导致梯度消失;以及幻觉确定性,即模型在训练后期对错误轨迹变得过度自信。我们通过使用完全从策略自身条件概率计算的内在信号来密集化奖励,解决了这两种模式,并提出了ISPO(内在信号策略优化),它结合了衡量思考轨迹对最终答案信息量的序列级信号,以及令牌级方向性奖励,其幻觉确定性铰链惩罚关键决策令牌上的错误自信预测。在三个基模型和五个数学推理基准上,ISPO持续优于竞争基线,在零优势崩溃最频繁的最难基准上取得最大提升,训练动态诊断证实两种失败模式均被减少。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for eliciting long-chain reasoning in large language models. However, existing methods based on Group Relative Policy Optimization (GRPO) rely on a binary outcome reward, which induces two structural failure modes: Zero-Advantage Collapse, in which all rollouts in a group share the same outcome and the gradient vanishes, and Hallucinated Certainty, in which the model becomes increasingly confident on incorrect rollouts late in training. We address both modes by densifying the reward with intrinsic signals computed entirely from the policy's own conditional probabilities, and propose ISPO (Intrinsic Signal Policy Optimization, which combines a sequence-level signal measuring how informative the thinking trajectory is for the final answer, with a token-level directional reward whose hallucinated-certainty hinge penalizes confidently-wrong predictions at critical decision tokens. Across three base models and five mathematical reasoning benchmarks, ISPO consistently outperforms competitive baselines, with the largest gains on the hardest benchmarks where zero-advantage collapse is most frequent, and training-dynamics diagnostics confirm that both failure modes are decreased.

2606.08810 2026-06-09 cs.CL cs.LG 新提交

Continuous Language Diffusion as a Decoder-Interface Problem

连续语言扩散作为解码器-接口问题

Zhicheng Du, Lan Ma

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院, 清华大学)

AI总结 研究连续扩散语言模型如何从高斯噪声生成流畅文本,提出解码器-盆地机制,并设计诊断协议揭示标量指标隐藏的失败,通过接口相图解释令牌恢复行为。

详情
AI中文摘要

高斯扰乱的句子嵌入没有直接的语言解释,但连续扩散语言模型可以从它们生成流畅文本。我们通过嵌入式语言流(ELF)研究这一谜题,并识别出解码器-盆地机制:当轨迹到达原生解码器可以读取稳定令牌的区域时,去噪成功。我们引入了可去噪性、语义可恢复性、顺序敏感性、解码器兼容性和轨迹可靠性的诊断协议。它暴露了标量指标隐藏的失败:低均方误差可能丢弃语言内容,低困惑度可能反映低熵崩溃,干净的潜在重建可能与狭窄的解码器盆地共存。一个解码器-边界界解释了为什么令牌恢复依赖于边界和局部解码器敏感性,而不仅仅是潜在误差。审计公开的ELF检查点揭示了一个接口相图:早期预测弱可读,轨迹中期分歧标志竞争区域,晚期预测进入高边界最终令牌盆地。一旦进入,在生成的ELF状态上令牌实现出奇简单:冻结的T5令牌嵌入查找恢复了原生解码器决策的93%–96%,单个线性读出在32k样本时达到97.9%的一致性,在结构化残差尾部留下约1.1的困惑度差距。在显式诊断监控下,保守的边界门在去噪步骤中提前17%–27%退出。对LangFlow、BitstreamDiffusion和连续潜在扩散语言模型(Cola-DLM)的边界检查表明,当状态对象和解码器改变时,相同的接口问题仍然有意义。因此,连续和潜在扩散语言模型应作为表示-解码器系统进行评估。

英文摘要

Gaussian-corrupted sentence embeddings have no direct linguistic interpretation, yet continuous diffusion language models can generate fluent text from them. We study this puzzle through Embedded Language Flows (ELF) and identify a decoder-basin mechanism: denoising succeeds when trajectories reach regions where the native decoder can read stable tokens. We introduce a diagnostic protocol for denoisability, semantic recoverability, order sensitivity, decoder compatibility, and trajectory reliability. It exposes failures hidden by scalar metrics: low mean-squared error can discard linguistic content, low perplexity can reflect low-entropy collapse, and clean latent reconstruction can coexist with a narrow decoder basin. A decoder-margin bound explains why token recovery depends on margin and local decoder sensitivity, not latent error alone. Auditing public ELF checkpoints reveals an interface phase diagram: early predictions are weakly readable, mid-trajectory disagreement marks a competition region, and late predictions enter a high-margin final-token basin. Once inside, token realization is surprisingly simple on generated ELF states: frozen T5 token-embedding lookup recovers $93$--$96\%$ of native decoder decisions, and a single linear readout reaches $97.9\%$ agreement at 32k samples, leaving about a 1.1 perplexity gap in a structured residual tail. A conservative margin gate exits $17$--$27\%$ earlier in denoising steps under an explicit diagnostic monitor. Boundary checks on LangFlow, BitstreamDiffusion, and the Continuous Latent Diffusion Language Model (Cola-DLM) show that the same interface questions remain meaningful when the state object and decoder change. Continuous and latent diffusion language models should therefore be evaluated as representation-decoder systems.

2606.08802 2026-06-09 cs.LG 新提交

Active Flow Expansion for Out-of-Distribution Discovery: from Theory to Molecules

主动流扩展用于分布外发现:从理论到分子

Riccardo De Santi, Bruce Lee, Cristian Perez Jensen, Kimon Protopapas, Sophia Tang, Cheng-Hao Liu, Pranam Chatterjee, Yisong Yue, Andreas Krause

发表机构 * ETH Zurich(苏黎世联邦理工学院) ETH AI Center(ETH AI 中心) University of Pennsylvania(宾夕法尼亚大学) Caltech(加州理工学院) FutureHouse

AI总结 提出Active Flow Expansion (ActFlow)方法,通过验证器反馈和主动探索扩展预训练流模型的生成集,覆盖更多有效设计空间,理论证明统计学习保证,在分子和蛋白质任务上优于现有方法。

详情
AI中文摘要

标准流和扩散预训练匹配可用数据(例如分子)的分布,这通常只覆盖有效设计空间的一小部分。然而,在生成发现中,目标是采样有效的新自然设计,这些设计在标准模型下被赋予可忽略的概率,因此无法从拟合观测数据的标准模型中获取。为克服这一限制,我们偏离数据分布匹配,通过生成集(模型以非可忽略概率覆盖的区域)来审视生成模型。这允许引入一种新的分布外流建模学习原则:扩大模型的生成集以增加对有效设计空间的覆盖。我们提出主动流扩展(ActFlow),一种持续预训练方法,利用验证器反馈,通过迭代适应在学习的流表示中主动探索生成的合成数据,将预训练模型扩展到新的有效区域。理论上,我们建立了据我们所知首个分布外流建模的统计学习保证,将生成集扩展分析为在学习表示上的局部到全局可达过程。实验上,我们使用合适的分布外生成建模指标,在小有机分子、中等大小药物样分子、治疗性肽和蛋白质序列设计任务上评估ActFlow。结果表明,ActFlow将有效覆盖扩展到远超初始预训练模型建模的区域,显著优于广泛采用的合成流预训练方法。

英文摘要

Standard flow and diffusion pre-training matches the distribution of available data (e.g., molecules), which often covers only a small fraction of the valid design space. In generative discovery, however, one aims to sample valid new-to-nature designs, assigned negligible probability under, and thus inaccessible to, standard models fitted to the observed data. To overcome this limitation, we depart from data distribution matching and view a generative model through its generable set: the region it covers with non-negligible probability. This allows to introduce a new learning principle for out-of-distribution flow modeling: enlarging a model's generable set to increase coverage of the valid design space. We propose Active Flow Expansion (ActFlow), a continued pre-training method that employs verifier feedback to expand a pre-trained model over new valid regions by iteratively adapting to synthetic data generated through active exploration in the learned flow representation. Theoretically, we establish to our knowledge first-of-their-kind statistical learning guarantees for out-of-distribution flow modeling, analyzing generable set expansion as a local-to-global reachability process over a learned representation. Empirically, we assess ActFlow with suitable out-of-distribution generative modeling metrics across small organic molecules, mid-sized drug-like molecules, therapeutic peptides, and protein sequence design tasks. Results show that ActFlow expands valid coverage far beyond the region modeled by the initial pre-trained model, significantly outperforming widely adopted synthetic flow pre-training methods.

2606.08800 2026-06-09 cs.AI 新提交

Bridging Expert Knowledge and Automated Feature Engineering via Self-Evolution

通过自进化桥接专家知识与自动化特征工程

Varun Khurana, Vijval Ekbote, Vashu Chauhan, Yaman Kumar Singla, Rajiv Ratn Shah, Balaji Krishnamurthy

发表机构 * Adobe Media and Data Science Research(Adobe媒体与数据科学研究) IIIT-Delhi(德里印度理工学院)

AI总结 提出FEST方法,结合双流特征生成、语义去重和树引导迭代进化,从原始文本和图像中发现可审计特征,在品牌分类等任务中平均提升4.2个百分点,并实现60-80%的专家特征覆盖。

详情
AI中文摘要

在品牌合规、临床护理和内容审核等高风险场景中,机器学习不能作为不透明的预言机部署:从业者需要检查驱动模型决策的特征,模型必须利用管理这些领域的专家文档。实际上,数据以非结构化内容形式到达,从中提取的特征必须可解释、有区分度,并与专家认为重要的内容对齐。现有方法存在不足:它们针对表格输入,缺乏专家对齐的证明,并且无法将诸如“保持专业语气”之类的定性标准转化为精确特征。我们提出了FEST(自进化树特征工程),结合了双流特征生成(语义和确定性)、语义去重和树引导的迭代进化,从原始文本和图像中发现可审计特征。FEST在品牌分类、内容真实性检测和压力检测的20个分类器-任务组合中领先17个,在五个分类器上平均比最强基线高出4.2个百分点。LLM作为评判者的评估显示,在严格的语义对齐阈值下,FEST实现了60-80%的专家设计品牌特征覆盖率,并通过人类专家研究证实,这些特征在相关性、清晰度和可操作性方面获得高评分。当以专家指南作为种子时,FEST将定性标准细化为可操作特征,跨品牌平均提高6-12个百分点的准确率。为了实现对自动化特征工程中专家对齐的系统评估,我们发布了BrandGuide,这是第一个将专家设计特征与2,683个品牌的100万+资产配对的数据集。通过将特征工程建立在专家知识基础上,FEST为需要人类监督的可解释机器学习开辟了一条实用途径。

英文摘要

In high-stakes settings such as brand compliance, clinical care, and content moderation, machine learning cannot be deployed as opaque oracles: practitioners inspect the features driving model decisions, and models must leverage the expert documentation governing these domains. In practice, the data arrives as unstructured content, and features extracted from it must be interpretable, discriminative, and aligned with what experts consider important. Existing methods fall short: they target tabular inputs, lack demonstrated expert alignment, and cannot operationalize qualitative criteria such as 'maintain professional tone' into precise features. We present FEST (Feature Engineering with Self-evolving Trees), combining dual-stream feature generation (semantic and deterministic), semantic deduplication, and tree-guided iterative evolution to discover auditable features from raw text and images. FEST leads in 17 of 20 classifier-task combinations across brand classification, content authenticity detection, and stress detection, with a mean gain of 4.2 pp over the strongest baseline across five classifiers. An LLM-as-judge evaluation shows FEST achieves 60-80% coverage of expert-designed brand features at strict semantic-alignment thresholds, corroborated by a human expert study rating features highly on relevance, clarity, and actionability. When seeded with expert guidelines, FEST refines qualitative criteria into operational features, improving accuracy by 6-12 pp on average across brands. To enable systematic evaluation of expert alignment in automated feature engineering, we release BrandGuide, the first dataset pairing expert-designed features with 1M+ assets across 2,683 brands. By grounding feature engineering in expert knowledge, FEST opens a practical pathway for interpretable ML in domains demanding human oversight.

2606.08797 2026-06-09 cs.LG cs.AI 新提交

Scaling Decision-Focused Learning to Large Problems with Lagrangian Decomposition

通过拉格朗日分解将决策聚焦学习扩展到大规模问题

Stéphane Eilles-Chan Way, Hugo Percot, Quentin Cappart, Tias Guns, Louis-Martin Rousseau

发表机构 * Polytechnique Montréal(蒙特利尔综合理工学院) Ecole Polytechnique(巴黎综合理工学院) UCLouvain(鲁汶大学) Mila - Québec AI Institute(魁北克人工智能研究所) KU Leuven(荷语鲁汶大学)

AI总结 提出结合拉格朗日分解的决策聚焦学习框架,通过新代理目标和两种损失函数,在保持可并行化的同时,有效处理大规模约束优化问题,实验表明在变量数多八倍的实例上优于传统方法。

详情
AI中文摘要

决策聚焦学习在解决预测-优化问题中显示出巨大潜力,尤其是在模型欠规范的情况下。然而,其实际部署常因高计算成本和有限的可扩展性而受阻,因为需要在每次迭代中对每个训练实例求解一个约束优化问题。为解决这些挑战,我们提出了一种新颖的框架,将拉格朗日分解融入决策聚焦学习范式。具体而言,我们引入了一个新的代理目标以及两个用于评估和训练底层预测模型的损失函数。我们进一步提出了两种变体,它们在计算效率和解决方案质量之间提供了不同的权衡。我们的框架可以无缝集成到标准的决策聚焦学习方法中,包括Smart Predict-then-Optimize (SPO+)和隐式最大似然估计 (IMLE)。通过在两个标准基准测试(多维背包问题和二次投资组合优化)上的实验,我们证明了我们的方法在保持可并行化的同时实现了有竞争力的性能。特别是,在大规模实例上,它始终优于传统的决策聚焦学习方法,这些实例的变量数比相关工作通常考虑的要多出八倍。实现代码可在 https://github.com/corail-research/DFL-LD 获取。

英文摘要

Decision-focused learning has shown great promise for addressing predict-then-optimize problems, particularly in the presence of under-specified models. However, its practical deployment is often hindered by high computational costs and limited scalability, as it requires solving a constrained optimization problem for each training instance at every iteration. To address these challenges, we propose a novel framework that incorporates Lagrangian decomposition into the decision-focused learning paradigm. Specifically, we introduce a new surrogate objective along with two loss functions for evaluating and training the underlying prediction model. We further propose two variants of our approach, which offer different trade-offs between computational efficiency and solution quality. Our framework can be seamlessly integrated with standard decision-focused learning methods, including Smart Predict-then-Optimize (SPO+) and Implicit Maximum Likelihood Estimation (IMLE). Through experiments on two standard benchmarks, the multi-dimensional knapsack problem and quadratic portfolio optimization, we demonstrate that our approach achieves competitive performance while remaining amenable to parallelization. In particular, it consistently outperforms traditional decision-focused learning methods on large-scale instances, involving up to eight times more variables than those typically considered in related work. The implementation is available at https://github.com/corail-research/DFL-LD.

2606.08795 2026-06-09 cs.CV 新提交

PairWise Image Finder: An Open-source Tool for Finding Visually Aligned Street-Level Image Pairs for Urban Perception Studies

PairWise Image Finder: 用于城市感知研究的视觉对齐街景图像对查找开源工具

Jussi Torkko

发表机构 * Digital Geography Lab, Department of Geosciences and Geography, University of Helsinki(赫尔辛基大学地球科学与地理系数字地理实验室)

AI总结 提出PairWise图像查找工具,集成特征检测与匹配及语义分割掩码,量化不同时期图像的视觉对齐度,输出匹配特征比例、距离、覆盖率和语义掩码对齐度,支持过滤高质量图像对,用于纵向变化研究和减少人工工作量。

详情
Comments
6 pages, two figures, github repo link near the end
AI中文摘要

变化检测和场景识别技术已广泛应用于街景图像(SVI)以理解跨年场景的变化。然而,仅凭元数据往往不足以可靠地找到视觉对齐的图像对。本研究介绍了PairWise图像查找器,该工具集成了特征检测和匹配,并辅以语义分割掩码来量化不同时期两幅图像的视觉对齐度。该工具输出匹配关键特征的比例、匹配特征距离和覆盖率以及语义掩码的对齐度,使用户能够根据对齐质量和用例过滤图像对。从该工具导出的视觉对齐对可用于准确研究显式的纵向变化,并有助于减少感知研究中的人工工作量。通过比较纵向变化展示了该工具的可用性,强调了量化变化时视角的重要性。所提出的方法为研究人员和利益相关者提供了一个可扩展的开源工具,用于查找用于城市分析、感知及相关应用的高质量图像对。

英文摘要

Change detection and scene recognition techniques have been widely applied to Street View Imagery (SVI) to understand changes in scenes across the years. However, metadata alone is often insufficient to reliably find visually aligned image pairs. This study introduces the PairWise image finder, a tool that integrates feature detection and matching, supported by semantic segmentation masks to quantify the visual alignment of two images of varying time periods. The tool outputs the share of matched key features, the matched feature distance and coverage, and the alignment of semantic masks, which enables the user to filter image pairs depending on the alignment quality and use case. The visually aligned pairs derived from the tool can be used to accurately study explicit longitudinal change and help reduce manual effort for perception studies. The usability of the tool is demonstrated through a comparison of longitudinal changes, highlighting the importance of perspective when quantifying changes. The proposed method provides a scalable and open tool for researchers and stakeholders to find high-quality image pairs for urban analysis, perception and related applications.

2606.08792 2026-06-09 cs.CL 新提交

The Amplifying Mirror: Locating and Steering the Partisan Direction inside a Large Language Model

放大镜:定位和操控大语言模型内的党派方向

Wendy K. Tam

发表机构 * Vanderbilt University(范德比尔特大学) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 通过线性探针在Llama 3.1 8B Instruct模型的隐藏状态中定位党派政治身份方向,并利用稀疏自编码器分解为可解释特征,因果干预可系统性改变模型输出,证明党派偏见是可定位和操控的几何特征。

详情
AI中文摘要

大型语言模型正迅速取代搜索引擎,成为人与信息之间的主要界面。与检索现有内容的搜索引擎不同,LLM生成受训练期间学到的内部表示影响的新文本。在这里,我们展示了党派政治身份编码在模型的激活空间中,并且这个方向直接塑造生成。使用来自美国国会现任议员的190,491条推文作为标记训练数据,我们在Llama 3.1 8B Instruct模型的隐藏状态上训练线性探针。我们在第18层识别出一个单一的几何轴,该轴以0.945的AUC和1.94的Cohen's d区分共和党和民主党文本,并使用稀疏自编码器将该轴分解为可解释的党派特征。沿该轴进行因果干预,在生成过程中消融或放大党派成分,会产生模型输出的系统性变化。我们观察到立场反转、语域转换以及结构化的权威捏造。我们的结果表明,语言模型中的党派偏见不是模糊的涌现属性,而是可以精确定位和操控的习得几何特征。党派偏见不是需要修补的漏洞,而是这些模型如何编码关于用户信息的结构属性。随着LLM取代搜索引擎成为知识界面,理解产品设计(及其后果)对于驾驭从策划到生成的信息生态系统的法律、社会和政治转型至关重要。

英文摘要

Large language models are rapicly replacing search engines as the primary interface between people and information. Unlike search engines, which retrieve existing content, LLMs generate novel text shaped by internal representations learned during training. Here we show that partisan political identity is encoded in the model's activation space, and that this direction directly shapes generation. Using 190,491 tweets from sitting members of the U.S. Congress as labeled training data, we train linear probes on the hidden states of the Llama 3.1 8B Instruct model. We identify a single geometric axis at layer 18 that separates Republican from Democratic text with an AUC of 0.945 and a Cohen's d of 1.94, and use sparse autoencoders to decompose that axis into interpretable partisan features. Causally intervening along this axis, ablating or amplifying the partisan component mid-generation, produces systematic shifts in the model's output. We witness stance reversals, register shifting, and structured fabrications of authority. Our results demonstrate that partisan bias in language models is not a vague emergent property but a learned geometric feature that can be precisely located and steered. Partisan bias is not a bug to be patched, but a structural property of how these models encode information about their users. As LLMs displace search engines as the interface to knowledge, understanding that product design (and its consequences) will be essential for navigating the legal, social, and political transitions from an information ecosystem that is curated to one that is generated.

2606.08790 2026-06-09 cs.AI cs.CR cs.MA 新提交

RAILS: Verification-Native Clearing For Agentic Commerce

RAILS: 面向代理商务的验证原生清算

Adrian de Valois-Franklin, Alex Bogdan

发表机构 * Evolutionairy AI

AI总结 针对自主代理在商务活动中缺乏中立清算机制的问题,提出RAILS协议,通过可靠性评分、记录和清算函数实现验证原生清算,确保财务结算基于充分证据。

详情
Comments
49 pages, 15 figures
AI中文摘要

自主代理进行谈判、购买、部署代码和转移资金,但缺乏中立机制来确定它们是否履行了委托义务、未履行时谁负责、以及后续采取何种结算行动。这就是代理清算问题。工具协议(MCP)、代理间通信(A2A)、支付轨道(x402)、授权和网络代理协议(AP2、Visa、Mastercard)以及结算风险标准都假设存在这种确定机制,但都没有产生它。清算是缺失的原语。支付不是清算。授权不是清算。LLM作为法官的评估不是清算。结算风险托管不是清算:它消耗清算决策。RAILS(实时代理完整性与账本结算)是代理商务的完整性和清算层,涵盖每个输出的可靠性评分、公开的可靠性记录以及消耗它们的清算函数。其核心的清算协议填补了这一空白。七个原语(义务对象、证据信封、验证网格、清算决策、结算指令、清算护照、终局规则),由可接受性分级验证的形式模型约束,共同产生一个可靠性属性:没有财务上重要的结算得到低于义务可接受性底线的证据支持。该属性可针对规范进行证伪。我们不知道先前的代理商务验证机制陈述过此类属性。最接近的方法输出通过、交付保证、裸分数或均衡。本文详细说明了该清算协议。

英文摘要

Autonomous agents negotiate, purchase, deploy code, and move funds, but no neutral mechanism determines whether they met their delegated obligation, who is responsible when they did not, or which settlement action follows. This is the agentic clearing problem. Tool protocols (MCP), inter-agent communication (A2A), payment rails (x402), mandate and network agent protocols (AP2, Visa, Mastercard), and settlement-risk standards each assume that determination and none produce it. Clearing is the missing primitive. Payment is not clearing. Authorization is not clearing. LLM-as-judge evaluation is not clearing. Settlement-risk escrow is not clearing: it consumes clearing decisions. RAILS (Real-Time Agent Integrity & Ledger Settlement) is the integrity and clearing layer for agentic commerce, spanning a per-output reliability score, a published reliability record, and a clearing function that consumes them. The clearing protocol at its core closes that gap. Seven primitives (Obligation Object, Evidence Envelope, Verification Mesh, Clearing Decision, Settlement Instruction, Clearing Passport, Finality Rules), bound by a formal model of admissibility-graded verification, together yield a soundness property: no financially material settlement is supported by evidence below the obligation's admissibility floor. The property is falsifiable against the spec. We are not aware of a prior agent-commerce verification mechanism that states a property of this kind. The approaches nearest to it emit a pass, a delivery guarantee, a bare score, or an equilibrium. This paper specifies that clearing protocol.

2606.08788 2026-06-09 cs.CV 新提交

MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training

MaskAlign: 面向高效扩散训练的令牌子集表示对齐

Lianyu Pang, Tianlin Pan, Cheng Da, Changqian Yu, Huan Yang, Kun Gai, Song Guo, Wenhan Luo

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Kuaishou Technology(快手科技) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 针对扩散模型与预训练视觉模型表示对齐中令牌级信息不匹配问题,提出MaskAlign方法,通过随机采样令牌子集进行对齐,并引入预掩码令牌混合块减少信息损失,提升训练效率和生成质量。

详情
AI中文摘要

与预训练视觉模型的表示对齐最近显示出加速扩散Transformer训练的潜力。通过将中间扩散特征与来自自监督视觉编码器的干净图像表示对齐,现有方法提高了收敛速度和生成质量。然而,这种对齐也引入了一个非平凡的约束:扩散模型处理噪声输入,其可用信息随时间步变化,而参考特征是从干净图像中提取的。在本文中,我们从令牌级角度重新审视这种不匹配。我们发现,在全令牌表示对齐下,具有较大对齐梯度范数的令牌表现出稳定的空间偏好,这表明对齐目标并非均匀影响所有令牌,可能鼓励模型依赖完整的干净图像令牌集。为了解决这个问题,我们提出了MaskAlign,一种令牌子集表示对齐方法,在训练期间对随机采样的令牌子集应用对齐。通过在不同迭代中向模型暴露不同的令牌子集,MaskAlign减少了表示对齐对完整令牌集的依赖,并鼓励在令牌子集扰动下更稳定的对齐行为。为了缓解直接丢弃令牌导致的信息损失,我们进一步引入了一个轻量级的预掩码令牌混合块,在掩码之前跨令牌共享信息。

英文摘要

Representation alignment with pretrained vision models has recently shown strong potential for accelerating diffusion transformer training. By aligning intermediate diffusion features with clean-image representations from self-supervised vision encoders, existing methods improve convergence and generation quality. However, such alignment also introduces a non-trivial constraint: diffusion models operate on noisy inputs whose usable information varies across timesteps, while the reference features are extracted from clean images. In this paper, we revisit this mismatch from a token-level perspective. We find that, under full-token representation alignment, tokens with large alignment-gradient norms exhibit a stable spatial preference, suggesting that the alignment objective does not affect all tokens uniformly and may encourage the model to rely on the complete set of clean-image tokens. To address this issue, we propose MaskAlign, a token-subset representation alignment method that applies alignment to randomly sampled token subsets during training. By exposing the model to different token subsets across iterations, MaskAlign reduces the dependence of representation alignment on the complete token set and encourages alignment behavior that is more stable under token-subset perturbations. To mitigate the information loss caused by directly dropping tokens, we further introduce a lightweight pre-mask token mixing block that shares information across tokens before masking.

2606.08781 2026-06-09 cs.CV 新提交

DeepMine-Mamba: Mitigating Information Dilution in Mamba-Based State Space Models for Document Image Binarization

DeepMine-Mamba:缓解基于Mamba的状态空间模型在文档图像二值化中的信息稀释问题

Sheng-Wei Chan, Yung-Che Wang, Hsin-Jui Pan, Chia-Min Lin, Jen-Shiun Chiang

发表机构 * Department of Electrical and Computer Engineering, Tamkang University(淡江大学电机与计算机工程系)

AI总结 提出DeepMine-Mamba框架,通过抗稀释门控机制选择性恢复笔画敏感局部响应,抑制无关背景增强,解决Mamba状态空间模型在文档二值化中弱前景线索被稀释的问题。

详情
Comments
code will be released on https://github.com/henrychan0719/Deep-Mine-Mamba
AI中文摘要

文档图像二值化旨在从退化的背景中分离前景文本,同时保留细、断裂和低对比度的笔画。尽管深度学习方法提高了二值化性能,但大多数现有方法依赖于卷积、基于Transformer或生成架构,而基于Mamba的状态空间模型在此任务中尚未被充分探索。在这项工作中,我们研究了基于Mamba的特征传播,并观察到直接的状态空间传播可能会在长程建模过程中稀释弱前景线索,特别是淡墨迹、碎片化字符和边界敏感的笔画细节。为了解决这个问题,我们提出了DeepMime-Mamba,一个基于Mamba的二值化框架,配备了一种新颖的抗稀释门控机制,该机制估计传播引起的特征变化,并选择性地恢复笔画敏感的局部响应,同时抑制不必要的背景增强。在严格的留一年验证协议下,对DIBCO/H-DIBCO基准的实验表明,DeepMine-Mamba取得了具有竞争力的整体性能,在基准年份中具有强大的平均FM和Fps。消融结果进一步表明,抗稀释门控机制改善了笔画保留,并减少了感知上显著的二值化误差。

英文摘要

Document image binarization aims to separate foreground text from degraded backgrounds while preserving thin, broken, and low-contrast strokes. Although deep learning methods have improved binarization performance, most existing approaches rely on convolutional, transformer-based, or generative architectures, while Mamba-based state space models remain largely unexplored for this task. In this work, we investigate Mamba-based feature propagation and observe that direct state-space propagation may dilute weak foreground cues during long-range modeling, especially faint ink traces, fragmented characters, and boundary-sensitive stroke details. To address this problem, we propose DeepMine-Mamba, a Mamba-based binarization framework equipped with a novel Anti-Dilution Gate that estimates propagation-induced feature changes and selectively restores stroke-sensitive local responses while suppressing unnecessary background enhancement. Experiments on DIBCO/H-DIBCO benchmarks under a strict leave-one-year-out protocol show that DeepMine-Mamba achieves competitive overall performance, with strong average FM and Fps across benchmark years. Ablation results further demonstrate that the Anti-Dilution Gate improves stroke preservation and reduces perceptually significant binarization errors.