arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1832
2510.12476 2026-05-01 cs.CL cs.AI

When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection

Lang Gao, Xuhui Li, Chenxi Wang, Mingzhe Li, Wei Liu, Zirui Song, Jinghui Zhang, Rui Yan, Preslav Nakov, Xiuying Chen

Comments Oral of ACL 2026

详情
英文摘要

Large language models (LLMs) have grown more powerful in language generation, producing fluent text and even imitating personal style. Yet, this ability also heightens the risk of identity impersonation. To the best of our knowledge, no prior work has examined personalized machine-generated text (MGT) detection. In this paper, we introduce \dataset, the first benchmark for evaluating detector robustness in personalized settings, built from literary and blog texts paired with their LLM-generated imitations. Our experimental results demonstrate large performance gaps across detectors in personalized settings: some state-of-the-art models suffer significant drops. We attribute this limitation to the \textit{feature-inversion trap}, where features that are discriminative in general domains become inverted and misleading when applied to personalized text. Based on this finding, we propose \method, a simple and reliable way to predict detector performance changes in personalized settings. \method identifies latent directions corresponding to inverted features and constructs probe datasets that differ primarily along these features to evaluate detector dependence. Our experiments show that \method can accurately predict both the direction and the magnitude of post-transfer changes, showing 85\% correlation with the actual performance gaps. We hope that this work will encourage further research on personalized text detection.

2510.04978 2026-05-01 cs.AI

Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI

Kun Xiang, Terry Jingchen Zhang, Yinya Huang, Jixi He, Zirong Liu, Yueling Tang, Ruizhe Zhou, Lijing Luo, Youpeng Wen, Xiuwei Chen, Bingqian Lin, Jianhua Han, Hang Xu, Hanhui Li, Bin Dong, Xiaodan Liang

详情
英文摘要

The rapid advancement of embodied intelligence and world models has intensified efforts to integrate physical laws into AI systems, yet physical perception and symbolic physics reasoning have developed along separate trajectories without a unified bridging framework. This work provides a comprehensive overview of physical AI, establishing clear distinctions between theoretical physics reasoning and applied physical understanding while systematically examining how physics-grounded methods enhance AI's real-world comprehension across structured symbolic reasoning, embodied systems, and generative models. Through rigorous analysis of recent advances, we advocate for intelligent systems that ground learning in both physical principles and embodied reasoning processes, transcending pattern recognition toward genuine understanding of physical laws. Our synthesis envisions next-generation world models capable of explaining physical phenomena and predicting future states, advancing safe, generalizable, and interpretable AI systems. We maintain a continuously updated resource at https://github.com/AI4Phys/Awesome-AI-for-Physics.

2509.25845 2026-05-01 cs.CV cs.AI

Training-Free Reward-Guided Image Editing via Trajectory Optimal Control

Jinho Chang, Jaemin Kim, Jong Chul Ye

Comments Poster in ICLR 2026; 22 pages, 9 figures. The code is available at https://github.com/jinhojsk515/ITOC

详情
英文摘要

Recent advancements in diffusion and flow-matching models have demonstrated remarkable capabilities in high-fidelity image synthesis. A prominent line of research involves reward-guided guidance, which steers the generation process during inference to align with specific objectives. However, leveraging this reward-guided approach to the task of image editing, which requires preserving the semantic content of the source image while enhancing a target reward, is largely unexplored. In this work, we introduce a novel framework for training-free, reward-guided image editing. We formulate the editing process as a trajectory optimal control problem where the reverse process of a diffusion model is treated as a controllable trajectory originating from the source image, and the adjoint states are iteratively updated to steer the editing process. Through extensive experiments across distinct editing tasks, we demonstrate that our approach significantly outperforms existing inversion-based training-free guidance baselines, achieving a superior balance between reward maximization and fidelity to the source image without reward hacking.

2509.22017 2026-05-01 cs.LG

AEGIS: Authentic Edge Growth In Sparsity for Link Prediction in Edge-Sparse Bipartite Knowledge Graphs

Hugh Xuechen Liu, Kıvanç Tatar

详情
英文摘要

Bipartite knowledge graphs in niche domains are typically data-poor and edge-sparse, which hinders link prediction. We introduce AEGIS (Authentic Edge Growth In Sparsity), an edge-only augmentation framework that resamples existing training edges -either uniformly simple or with inverse-degree bias degree-aware -thereby preserving the original node set and sidestepping fabricated endpoints. To probe authenticity across regimes, we consider naturally sparse graphs (game design pattern's game-pattern network) and induce sparsity in denser benchmarks (Amazon, MovieLens) via high-rate bond percolation. We evaluate augmentations on two complementary metrics: AUC-ROC (higher is better) and the Brier score (lower is better), using two-tailed paired t-tests against sparse baselines. On Amazon and MovieLens, copy-based AEGIS variants match the baseline while the semantic KNN augmentation is the only method that restores AUC and calibration; random and synthetic edges remain detrimental. On the text-rich GDP graph, semantic KNN achieves the largest AUC improvement and Brier score reduction, and simple also lowers the Brier score relative to the sparse control. These findings position authenticity-constrained resampling as a data-efficient strategy for sparse bipartite link prediction, with semantic augmentation providing an additional boost when informative node descriptions are available.

2509.18308 2026-05-01 cs.CV

Rethinking Pulmonary Embolism Segmentation: A Study of Current Approaches and Challenges with an Open Weight Model

Yixin Zhang, Ryan Chamberlain, Lawrence Ngo, Kevin Kramer, Maciej A. Mazurowski

Comments Published in Journal of Imaging Informatics in Medicine. Model weights available at: https://github.com/mazurowski-lab/PulmonaryEmbolismSegmentation

详情
英文摘要

Pulmonary Embolism (PE) is a life-threatening condition for which accurate and timely detection is critical to patient care. However, our systematic study of PE segmentation algorithms reveals concerning limitations in the current state of research. Challenges such as small and inconsistent datasets, a lack of reproducible baselines, and limited comparative evaluation across models are hindering progress in the field. In this study, we curated a densely annotated dataset comprising 490 CTPA scans, each from a unique patient (430 for training and 60 for testing). We evaluated nine widely used segmentation architectures, including both CNN- and ViT-based models, in 2D and 3D configurations, using mean Dice Similarity Coefficient (mDSC) and Average Symmetric Surface Distance (ASSD) as evaluation metrics. Furthermore, the highest-performing model was evaluated on a public dataset without fine-tuning and achieved reasonable generalization performance. Our results show that: (1) a 3D U-Net with ResNet encoding blocks remains a highly effective architecture for PE segmentation; (2) 3D models consistently outperform their 2D counterparts; (3) across all architectures, when trained and evaluated on the same datasets, model error patterns are highly consistent; and (4) distal emboli remain particularly challenging due to both task complexity and the scarcity of high-quality datasets, highlighting the need for datasets with more comprehensive and consistent distal PE coverage. To promote research reproducibility, the architecture and pretrained weights of our best-performing model are publicly available at https://github.com/mazurowski-lab/PulmonaryEmbolismSegmentation

2509.15549 2026-05-01 cs.CL

M-DaQ: Retrieving Samples with Multilingual Diversity and Quality for Instruction Fine-Tuning Datasets

Chunguang Zhao, Yilun Liu, Pufan Zeng, Yuanchang Luo, Shimin Tao, Minggui He, Weibin Meng, Song Xu, Chen Liu, Hongxia Ma, Li Zhang, Boxing Chen, Daimeng Wei

Comments Accepted by SIGIR 2026 Short

详情
英文摘要

Multilingual instruction fine-tuning (IFT) empowers large language models to generalize across diverse linguistic and cultural contexts; however, high-quality, systematically curated multilingual IFT datasets remain scarce. To address this gap, we propose M-DaQ (Multilingual Diversity and Quality), a diversity-aware sampling framework that jointly optimizes instruction-response quality and cross-lingual semantic diversity. M-DaQ leverages a fine-tuned Quality Scoring Model alongside a maximal marginal relevance-inspired selection strategy to construct balanced, high-fidelity training data. Furthermore, we present the first systematic investigation of the Superficial Alignment Hypothesis in multilingual settings. Extensive evaluations across 18 languages demonstrate that models trained on M-DaQ-curated data achieve average win rates exceeding 60% against strong baselines on Alpaca-Eval and MT-Bench. Complementary human evaluations corroborate these gains, highlighting significant improvements in cultural relevance, contextual appropriateness, and instruction-following capability. The code are publicly released to facilitate reproducibility and future research.

2509.06033 2026-05-01 cs.CV

Analysis of Blood Report Images Using General Purpose Vision-Language Models

Nadia Bakhsheshi, Hamid Beigy

Comments 4 pages , 3 figures , This paper has been submitted to the IEEE-affiliated ICBME Conference (Iran), 2025, and is currently under review. DOR number: [20.1001.2.0425023682.1404.10.1.440.7]

详情
Journal ref
Proc. 2025 32nd ICBME, IEEE, 2026
英文摘要

The reliable analysis of blood reports is important for health knowledge, but individuals often struggle with interpretation, leading to anxiety and overlooked issues. We explore the potential of general-purpose Vision-Language Models (VLMs) to address this challenge by automatically analyzing blood report images. We conduct a comparative evaluation of three VLMs: Qwen-VL-Max, Gemini 2.5 Pro, and Llama 4 Maverick, determining their performance on a dataset of 100 diverse blood report images. Each model was prompted with clinically relevant questions adapted to each blood report. The answers were then processed using Sentence-BERT to compare and evaluate how closely the models responded. The findings suggest that general-purpose VLMs are a practical and promising technology for developing patient-facing tools for preliminary blood report analysis. Their ability to provide clear interpretations directly from images can improve health literacy and reduce the limitations to understanding complex medical information. This work establishes a foundation for the future development of reliable and accessible AI-assisted healthcare applications. While results are encouraging, they should be interpreted cautiously given the limited dataset size.

2509.05291 2026-05-01 cs.CL cs.AI cs.LG

Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

Deniz Bayazit, Aaron Mueller, Antoine Bosselut

Comments Accepted to ACL 2026

详情
英文摘要

Large language models (LLMs) learn non-trivial abstractions during pretraining, such as detecting irregular plural noun subjects. However, because traditional evaluation methods (e.g., benchmarking) fail to reveal how models acquire these concepts and capabilities, it is not well understood when and how these specific linguistic abilities emerge. To bridge this gap and better understand model training at the concept level, we use sparse crosscoders to discover and align features across model checkpoints. Using this approach, we track the evolution of linguistic features during pretraining. We train crosscoders between open-sourced checkpoint triplets with significant performance and representation shifts, and introduce a novel metric, Relative Indirect Effects (RelIE), to trace training stages at which individual features become causally important for task performance. We show that crosscoders can detect feature emergence, maintenance, and discontinuation during pretraining. Our approach is architecture-agnostic and scalable, offering a promising path toward more interpretable and fine-grained analysis of representation learning throughout pretraining.

2508.21787 2026-05-01 cs.CL cs.AI

PiCSAR: Probabilistic Confidence Selection And Ranking for Reasoning Chains

Joshua Ong Jun Leang, Zheng Zhao, Aryo Pradipta Gema, Sohee Yang, Wai-Chung Kwan, Xuanli He, Wenda Li, Pasquale Minervini, Eleonora Giunchiglia, Shay B. Cohen

详情
英文摘要

Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.

2508.13316 2026-05-01 cs.LG

Constraint-Aware Flow Matching via Randomized Exploration

Zhengyan Huan, Jacob Boerma, Li-Ping Liu, Shuchin Aeron

详情
Journal ref
Transactions on Machine Learning Research (TMLR), 2026
英文摘要

We consider the problem of designing constraint-aware flow matching (FM) models that address the issue of constraint violations commonly observed in vanilla generative models. We consider two scenarios, viz.: (a) when a differentiable distance function to the constraint set is given, and (b) when the constraint set is only available via queries to a membership oracle. For case (a), we propose a simple adaptation of the FM objective with an additional term that penalizes the distance between the constraint set and the generated samples. For case (b), we propose to employ randomization and learn a mean flow that is numerically shown to have a high likelihood of satisfying the constraints. This approach deviates significantly from existing works that require simple convex constraints, knowledge of a barrier function, or a reflection mechanism to constrain the probability flow. Furthermore, in the proposed setting we show that a two-stage approach, where both stages approximate the same original flow but with only the second stage probing the constraints via randomization, is more computationally efficient than the corresponding one-stage approach. Through several synthetic cases of constrained generation, we numerically show that the proposed approaches achieve significant gains in terms of constraint satisfaction while matching the target distributions. As a showcase for a practical oracle-based constraint, we show how our approach can be used for training an adversarial example generator, using queries to a hard-label black-box classifier. We conclude with several future research directions. Our code is available at https://github.com/ZhengyanHuan/FM-RE.

2508.13024 2026-05-01 cs.CL

WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents

Ralph Peeters, Aaron Steiner, Luca Schwarz, Julian Yuya Caspary, Christian Bizer

Comments Accepted for publication at SIGIR 2026

详情
英文摘要

LLM-based web agents have the potential to automate long-running web tasks, such as searching for products in multiple e-shops and subsequently ordering the cheapest products that meet the users needs. Benchmarks for evaluating web agents either require agents to perform tasks online using the live Web or offline using simulated environments, the latter allowing for the exact reproduction of the experimental setup. While DeepShop and ShoppingComp provide online benchmarks that require agents to perform challenging shopping tasks, existing offline benchmarks such as WebShop, WebArena, and Mind2Web cover only comparatively simple e-commerce tasks performed against a single shop containing product data from a single source. What is missing is an e-commerce benchmark that simulates multiple shops containing heterogeneous product data and requires agents to perform complex retrieval tasks. We fill this gap by introducing WebMall, the first offline multi-shop benchmark for evaluating web agents on challenging comparison shopping tasks. WebMall consists of four simulated shops populated with product data extracted from the Common Crawl. The WebMall tasks range from specific product searches and price comparisons to advanced searches for complementary or substitute products, as well as checkout processes. We validate WebMall using eight agents that differ in observation space, availability of short-term memory, and the employed LLM. The validation highlights the difficulty of the benchmark, with the best-performing agents achieving task completion rates below 65% in the task categories cheapest product search and vague product search.

2508.12022 2026-05-01 cs.AI

AI Models for Depressive Disorder Detection and Diagnosis: A Review

Dorsa Macky Aleagha, Payam Zohari, Mostafa Haghir Chehreghani

详情
英文摘要

Major Depressive Disorder is one of the leading causes of disability worldwide, yet its diagnosis still depends largely on subjective clinical assessments. Integrating Artificial Intelligence (AI) holds promise for developing objective, scalable, and timely diagnostic tools. In this paper, we present a comprehensive survey of state-of-the-art AI methods for depression detection and diagnosis, based on a systematic review of 55 key studies. We introduce a novel hierarchical taxonomy that structures the field by primary clinical task (diagnosis vs. prediction), data modality (text, speech, neuroimaging, multimodal), and computational model class (e.g., graph neural networks, large language models, hybrid approaches). Our in-depth analysis reveals three major trends: the predominance of graph neural networks for modeling brain connectivity, the rise of large language models for linguistic and conversational data, and an emerging focus on multimodal fusion, explainability, and algorithmic fairness. Alongside methodological insights, we provide an overview of prominent public datasets and standard evaluation metrics as a practical guide for researchers. By synthesizing current advances and highlighting open challenges, this survey offers a comprehensive roadmap for future innovation in computational psychiatry.

2508.05469 2026-05-01 cs.LG cs.IT math.IT

Let's Measure Information Step-by-Step: AI-Based Evaluation Beyond Vibes

Zachary Robertson, Sanmi Koyejo

Comments Accepted to TMLR (2026)

详情
英文摘要

We evaluate artificial intelligence (AI) systems without ground truth by exploiting a link between strategic gaming and information loss. Building on established information theory, we analyze which mechanisms resist adversarial manipulation. This motivates mutual evaluation, where the overseer is treated as a strategic player estimating mutual information by prompting, making truthful agent reporting an optimal strategy. We show that certain f-divergences, such as total variation distance (TVD), maintain polynomial guarantees under attack, building on an established exponential barrier for estimating mutual information (MI) in worst-case certification settings. Under adversarial attacks, TVD-MI maintains effectiveness (area under the curve 0.70--0.77) while other approaches can decay toward chance, demonstrating that prompting the same system for information relationships rather than quality judgments can improve robustness. The mechanisms decompose pairwise evaluations into reliable item-level detection scores without ground truth, addressing a key limitation of standard peer prediction. Pre-registration: https://osf.io/c7pum

2507.14719 2026-05-01 cs.AI

Policy-Grounded Safety Evaluation of 20 Large Language Models

Juan Manuel Contreras

详情
英文摘要

As large language models (LLMs) become increasingly integrated into real-world applications, scalable and rigorous safety evaluation is essential. This paper introduces Aymara AI, a programmatic platform for generating and administering customized, policy-grounded safety evaluations. Aymara AI transforms natural-language safety policies into adversarial prompts and scores model responses using an AI-based rater validated against human judgments. We demonstrate its capabilities through the Aymara LLM Risk and Responsibility Matrix, which evaluates 20 commercially available LLMs across 10 real-world safety domains. Results reveal wide performance disparities, with mean safety scores ranging from 86.2% to 52.4%. While models performed well in well-established safety domains such as Misinformation (mean = 95.7%), they consistently failed in more complex or underspecified domains, notably Privacy & Impersonation (mean = 24.3%). Analyses of Variance confirmed that safety scores differed significantly across both models and domains (p < .05). These findings underscore the inconsistent and context-dependent nature of LLM safety and highlight the need for scalable, customizable tools like Aymara AI to support responsible AI development and oversight.

2507.13486 2026-05-01 cs.CV

Uncertainty Quantification Framework for Aerial and UAV Photogrammetry through Error Propagation

Debao Huang, Rongjun Qin

Comments 27 pages, 12 figures, this manuscript has been accepted to ISPRS Journal of Photogrammetry and Remote Sensing

详情
英文摘要

Uncertainty quantification of the photogrammetry process is essential for providing per-point accuracy credentials of the point clouds. Unlike airborne LiDAR, whose accuracy generally remains consistent with objects with varying geometric complexity, the accuracy of photogrammetric point clouds is rather object/scene-dependent, as it relies on algorithm-derived measurements. Generally, errors of the photogrammetric point clouds propagate through a two-step process: Structure-from-Motion (SfM) with Bundle adjustment (BA), followed by Multi-view Stereo (MVS). While uncertainty estimation in the SfM stage has been well studied using the first-order statistics of the reprojection error function, that in the MVS stage remains largely unsolved and non-standardized, primarily due to its non-differentiable and multi-modal nature (i.e., from pixel values to geometry). In this paper, we present an uncertainty quantification framework closing this gap by associating an error covariance matrix per point accounting for this two-step photogrammetry process. Specifically, to estimate the uncertainty in the MVS stage, we propose a novel, self-calibrating method by taking reliable n-view points (n>=6) per-view to regress the disparity uncertainty using highly relevant cues (such as matching cost values) from the MVS stage. Compared to existing approaches, our method uses self-contained, reliable 3D points extracted directly from the MVS process, with the benefit of being self-supervised and naturally adhering to error propagation path of the photogrammetry process, thereby providing a robust and certifiable uncertainty quantification across diverse scenes. We evaluate the framework using a variety of publicly available airborne and UAV imagery datasets. Results demonstrate that our method outperforms existing approaches by achieving high bounding rates without overestimating uncertainty.

2507.12414 2026-05-01 cs.CV cs.AI cs.LG cs.RO

AutoVDC: Automated Vision Data Cleaning Using Vision-Language Models

Santosh Vasa, Aditi Ramadwar, Jnana Rama Krishna Darabattula, Md Zafar Anwar, Stanislaw Antol, Andrei Vatavu, Thomas Monninger, Sihao Ding

Comments Accepted to IV 2026 Drive-X Foundation Models for Autonomous Driving (Oral presentation)

详情
英文摘要

Training of autonomous driving systems requires extensive datasets with precise annotations to attain robust performance. Human annotations suffer from imperfections, and multiple iterations are often needed to produce high-quality datasets. However, manually reviewing large datasets is laborious and expensive. In this paper, we introduce AutoVDC (Automated Vision Data Cleaning) framework and investigate the utilization of Vision-Language Models (VLMs) to automatically identify erroneous annotations in vision datasets, thereby enabling users to eliminate these errors and enhance data quality. We validate our approach using the KITTI and nuImages datasets, which contain object detection benchmarks for autonomous driving. To test the effectiveness of AutoVDC, we create dataset variants with intentionally injected erroneous annotations and observe the error detection rate of our approach. Additionally, we compare the detection rates using different VLMs and explore the impact of VLM fine-tuning on our pipeline. The results demonstrate our method's high performance in error detection and data cleaning experiments, indicating its potential to significantly improve the reliability and accuracy of large-scale production datasets in autonomous driving.

2507.07986 2026-05-01 cs.LG cs.AI

EXPO: Stable Reinforcement Learning with Expressive Policies

Perry Dong, Qiyang Li, Dorsa Sadigh, Chelsea Finn

Comments changed formatting, added code

详情
英文摘要

We study the problem of training and fine-tuning expressive policies with online reinforcement learning (RL) given an offline dataset. Training expressive policy classes with online RL present a unique challenge of stable value maximization. Unlike simpler Gaussian policies commonly used in online RL, expressive policies like diffusion and flow-matching policies are parameterized by a long denoising chain, which hinders stable gradient propagation from actions to policy parameters when optimizing against some value function. Our key insight is that we can address stable value maximization by avoiding direct optimization over value with the expressive policy and instead construct an on-the-fly RL policy to maximize Q-value. We propose Expressive Policy Optimization (EXPO), a sample-efficient online RL algorithm that utilizes an on-the-fly policy to maximize value with two parameterized policies -- a larger expressive base policy trained with a stable imitation learning objective and a light-weight Gaussian edit policy that edits the actions sampled from the base policy toward a higher value distribution. The on-the-fly policy optimizes the actions from the base policy with the learned edit policy and chooses the value maximizing action from the base and edited actions for both sampling and temporal-difference (TD) backup. Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods both in the setting of fine-tuning a pretrained policy given offline data and in leveraging offline data to train online.

2506.22500 2026-05-01 cs.CV cs.AI

OR-VSKC: Resolving Visual-Semantic Knowledge Conflicts in Operating Rooms with Synthetic Data-Guided Alignment

Weiyi Zhao, Xiaoyu Tan, Liang Liu, Sijia Li, Youwei Song, Xihe Qiu

Comments 13 pages, 5 figures. The dataset and appendix are available at https://github.com/zgg2577/VS-KC

详情
英文摘要

Automated identification of surgical safety risks is critical for improving patient outcomes; however, Multimodal Large Language Models (MLLMs) frequently suffer from Visual-Semantic Knowledge Conflicts (VS-KC), a phenomenon where models possess safety knowledge but fail to activate it during visual inspection. Investigating this alignment gap in operating rooms (ORs) is impeded by a critical bottleneck: the scarcity and privacy constraints of real-world OR data depicting safety violations. To address this, we introduce OR-VSKC, a benchmark for studying VS-KC and surgical risk perception in strictly regulated OR environments. Constructed via our Protocol-to-Pixel Generative Framework, OR-VSKC comprises 28,190 high-fidelity synthetic images grounded in authoritative safety standards, complemented by a 713-image expert-authored challenge subset validated by multiple experts. The full benchmark is built from authentic OR contexts drawn from the 4D-OR and CAMMA-MVOR datasets, where the 4D-OR-based portion serves as the primary benchmark core and the CAMMA-MVOR-based portion is reserved for external validation and cross-dataset generalization analysis. Evaluations of state-of-the-art MLLMs reveal substantial reliability gaps even in advanced generalist models. Furthermore, experiments show that fine-tuning on OR-VSKC effectively mitigates VS-KC and enables robust generalization to unseen camera viewpoints. We open-source the code and dataset to support reproducible research in safety-critical medical environments. The source code is available at https://github.com/zgg2577/VS-KC.

2506.08332 2026-05-01 cs.AI

ORFS-agent: Tool-Using Agents for Chip Design Optimization

Amur Ghose, Andrew B. Kahng, Sayak Kundu, Zhiang Wang

详情
英文摘要

Machine learning has been widely used to optimize complex engineering workflows across numerous domains. In integrated circuit design, modern flows (e.g., register-transfer level to physical layout) involve extensive configuration via thousands of parameters, and small changes can have large downstream impacts on design performance, power, and area. Recent advances in Large Language Models (LLMs) offer new opportunities for learning and reasoning within such high-dimensional optimization tasks. In this work, we introduce ORFS-agent, an LLM-based iterative optimization agent that automates parameter tuning in an open-source hardware design flow. ORFS-agent adaptively explores parameter configurations, demonstrating improvements over standard Bayesian optimization approaches in terms of resource efficiency and final design metrics. Across six benchmarks on ASAP7 and SKY130HD, thinking-model backends (Sonnet 4.6 [69] and Kimi K2.5 [28]) improve the geometric-mean normalized wirelength, effective clock period, and co-optimization objectives by up to 1.0%, 1.3%, and 2.7% over OR-AutoTuner while using 40% fewer iterations; the open-weight Kimi K2.5 remains within 0.24% of Sonnet 4.6, enabling private deployment. Relative to the earlier Sonnet 3.5 backend, these thinking models improve the same objectives by up to 7.5%, 3.1%, and 4.0%. Optional retrieval tools accelerate early convergence but do not improve final endpoints. By following natural language objectives to trade off certain metrics for others, ORFS-agent demonstrates a flexible and interpretable framework for multi-objective and constrained optimization. Crucially, ORFS-agent is modular and model-agnostic, and can be plugged into any frontier LLM without any further fine-tuning. We also report checkpoint-aligned trajectories and reasoning summaries that document the agent's decision process.

2506.07179 2026-05-01 cs.LG cs.AI

Efficient Traffic Forecasting on Large-Scale Road Network by Regularized Adaptive Graph Convolution

Kaiqi Wu, Weiyang Kong, Sen Zhang, Zitong Chen, Yubao Liu

Comments Accepted by ICDE 2026

详情
英文摘要

Traffic prediction is a critical task in spatial-temporal forecasting with broad applications in travel planning and urban management. To model the complex spatial-temporal dependencies in traffic data, Spatial-Temporal Graph Convolutional Networks (STGCNs) have been widely employed, achieving advanced performance. However, when applied to large-scale road networks, the quadratic computational complexity of traditional graph convolution operations severely limits their scalability. Several methods attempt to address this issue through approximation, compression, or spatial partitioning. Nevertheless, these methods often either fail to achieve sufficient computational efficiency or compromise prediction accuracy. To address these challenges, we propose a Regularized Adaptive Graph Convolution (RAGC) model. First, to ensure scalability on large road networks, we develop the Efficient Cosine Operator (ECO), which performs graph convolution based on the cosine similarity of node embeddings with linear time complexity. Second, we introduce a regularized adaptive graph convolution framework that combines Stochastic Shared Embedding (SSE) and adaptive graph convolution through a residual difference mechanism. This design enables the model to learn high-quality node embeddings, thereby improving prediction accuracy while maintaining computational efficiency. Extensive experiments on four large-scale real-world traffic datasets show that RAGC consistently outperforms state-of-the-art methods in terms of prediction accuracy and exhibits competitive computational efficiency. The code is available at: https://github.com/wkq-wukaiqi/RAGC.

2506.02671 2026-05-01 cs.CV

Test-Time Distillation for Continual Model Adaptation

Xiao Chen, Jiazhen Huang, Zhiming Liu, Qinting Jiang, Fanding Huang, Jingyan Jiang, Zhi Wang

Comments Accepted by CVPR 2026 Findings

详情
英文摘要

Deep neural networks often suffer performance degradation upon deployment due to distribution shifts. Continual Test-Time Adaptation (CTTA) aims to address this issue in an unsupervised manner. However, existing methods that rely on self-supervision are prone to an inherent self-referential feedback loop that amplifies initial prediction errors, leading to model drift. We revisit this limitation and propose Test-Time Distillation (TTD), which reframes adaptation as a distillation process guided by a frozen Vision-Language Model (VLM) as an external signal. While promising, we find that direct distillation is fraught with two pitfalls: (1) the Generalist Trap, where the VLM's broad but non-specialized knowledge leads to suboptimal performance on specific tasks and shifts; and (2) the Entropy Bias, where naive model fusion techniques based on entropy fail due to the disparate calibration of heterogeneous models. These pitfalls highlight the need to build a robust supervisory signal and leverage it to guide the target model toward stable adaptation. Hence, we present CoDiRe, a Continual Distillation and Rectification framework for TTD. CoDiRe first constructs a robust blended teacher by dynamically fusing the predictions of the VLM and the target model. Critically, it circumvents the Entropy Bias by leveraging Maximum Softmax Probability (MSP) as a more reliable confidence metric for weighting each model's expertise. Then it applies an Optimal Transport-based rectification to further align predictions with the blended teacher, enabling continuous and stable adaptation. Extensive experiments show that CoDiRe outperforms state-of-the-art baselines, exceeding CoTTA by 10.55% with only 48% of its time cost on ImageNet-C. Project page is publicly available at https://github.com/walawalagoose/TTD.

2506.02515 2026-05-01 cs.CL cs.AI cs.LG

FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

Zhuohan Xie, Daniil Orel, Rushil Thareja, Dhruv Sahnan, Hachem Madmoun, Fan Zhang, Debopriyo Banerjee, Georgi Georgiev, Xueqing Peng, Lingfei Qian, Jimin Huang, Jinyan Su, Aaryamonvikram Singh, Rui Xing, Rania Elbadry, Chen Xu, Haonan Li, Fajri Koto, Ivan Koychev, Tanmoy Chakraborty, Yuxia Wang, Salem Lahlou, Veselin Stoyanov, Sophia Ananiadou, Preslav Nakov

Comments 24 pages, includes 12 figures and 9 tables; introduces the FinChain benchmark and ChainEval metric

详情
英文摘要

Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning steps required for transparency and verification. To address this gap, we introduce FinChain, the first benchmark specifically designed for verifiable Chain-of-Thought evaluation in finance. FinChain spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python code that enable fully machine-verifiable reasoning and scalable, contamination-free data generation. To assess reasoning capacity, we propose CHAINEVAL, a dynamic alignment measure that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Our evaluation of 26 leading LLMs reveals that even frontier LLMs exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models can substantially narrow this gap. Overall, FinChain exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI. This project is available at https://github.com/mbzuai-nlp/finchain.git.

2505.24848 2026-05-01 cs.CV cs.LG

Reading Recognition in the Wild

Charig Yang, Samiul Alam, Shakhrul Iman Siam, Michael J. Proulx, Lambert Mathias, Kiran Somasundaram, Luis Pesqueira, James Fort, Sheroze Sheriffdeen, Omkar Parkhi, Carl Ren, Mi Zhang, Yuning Chai, Richard Newcombe, Hyo Jin Kim

Comments NeurIPS 2025. Project Page: https://www.projectaria.com/datasets/reading-in-the-wild/

详情
英文摘要

To enable egocentric contextual AI in always-on smart glasses, it is crucial to be able to keep a record of the user's interactions with the world, including during reading. In this paper, we introduce a new task of reading recognition to determine when the user is reading. We first introduce the first-of-its-kind large-scale multimodal Reading in the Wild dataset, containing 100 hours of reading and non-reading videos in diverse and realistic scenarios. We then identify three modalities (egocentric RGB, eye gaze, head pose) that can be used to solve the task, and present a flexible transformer model that performs the task using these modalities, either individually or combined. We show that these modalities are relevant and complementary to the task, and investigate how to efficiently and effectively encode each modality. Additionally, we show the usefulness of this dataset towards classifying types of reading, extending current reading understanding studies conducted in constrained settings to larger scale, diversity and realism.

2505.22798 2026-05-01 cs.LG cs.AI cs.CR

Efficient Preimage Approximation for Neural Network Certification

Anton Björklund, Mykola Zaitsev, Paolo Morettin, Marta Kwiatkowska

Comments Code available at https://github.com/Anton-Bjorklund/Premap2

详情
英文摘要

The growing reliance on artificial intelligence in safety- and security-critical applications is raising concerns about the robustness of neural networks to erroneous or adversarial input. Certification is a methodology for ensuring model trustworthiness by providing formal guarantees on model behaviour. While most verification methods focus on worst-case analysis by bounding the network output, an alternative approach based on approximating the preimage can complement such analysis by estimating the proportion of inputs that satisfy a given specification. However, existing preimage-based methods, such as the state-of-the-art PREMAP, are limited to fully connected neural networks of moderate dimensionality. In this paper, we introduce PREMAP2, a collection of algorithmic extensions to PREMAP that enhance its scalability and efficiency through improved branching heuristics, adaptive Monte Carlo sampling, and reverse bound propagation. We further endow PREMAP2 with additional functionality such as support for non-uniform priors and confidence intervals. These advances enable the application of PREMAP2 to previously intractable settings, including real-world patch attacks against convolutional neural networks, where adversarial stickers or lighting conditions obscure parts of images. We showcase the effectiveness of our approach across several use cases, including certifying reliability, robustness, interpretability, and fairness, on domains ranging from computer vision to control tasks. Our implementation is available as open-source software.

2505.22701 2026-05-01 cs.CV

Frequency-Adaptive Discrete Cosine-ViT-ResNet Architecture for Sparse-Data Vision

Ziyue Kang, Weichuan Zhang

Comments This preprint has been withdrawn by the primary author after identifying issues in the dataset validation process, including noisy and potentially out-of-domain samples. These issues may affect the reliability of the reported experimental results. The author therefore considers withdrawal to be the most responsible course of action and appreciates the readers' understanding

详情
英文摘要

A major challenge in rare animal image classification is the scarcity of data, as many species usually have only a small number of labeled samples. To address this challenge, we designed a hybrid deep-learning framework comprising a novel adaptive DCT preprocessing module, ViT-B16 and ResNet50 backbones, and a Bayesian linear classification head. To our knowledge, we are the first to introduce an adaptive frequency-domain selection mechanism that learns optimal low-, mid-, and high-frequency boundaries suited to the subsequent backbones. Our network first captures image frequency-domain cues via this adaptive DCT partitioning. The adaptively filtered frequency features are then fed into ViT-B16 to model global contextual relationships, while ResNet50 concurrently extracts local, multi-scale spatial representations from the original image. A cross-level fusion strategy seamlessly integrates these frequency- and spatial-domain embeddings, and the fused features are passed through a Bayesian linear classifier to output the final category predictions. On our self-built 50-class wildlife dataset, this approach outperforms conventional CNN and fixed-band DCT pipelines, achieving state-of-the-art accuracy under extreme sample scarcity.

2505.17056 2026-05-01 cs.CL cs.AI

From Test-taking to Cognitive Scaffolding: A Pedagogical Diagnostic Benchmark for LLMs on English Standardized Tests

Luoxi Tang, Tharunya Sundar, Yuqiao Meng, Shuai Yang, Ankita Patra, Lakshmi Manohar Chippada, Jiqian Zhao, Yi Li, Weicheng Ma, Zhaohan Xi

详情
英文摘要

As large language models (LLMs) are increasingly integrated into educational tools, current evaluations on standardized tests predominantly focus on binary outcome accuracy. Instead, an effective AI tutor must exhibit faithful reasoning, elucidate solution strategies, and diagnose specific human misconceptions. To bridge this gap, we introduce a pedagogical diagnostic framework that models English Standardized Test (EST) problem-solving as a traversal through a cognitive framework. Based on this framework, we present ESTBook, a multimodal benchmark encompassing 10,576 questions and 29 task types across five major exams. Unlike traditional datasets, ESTBook goes beyond data aggregation by enriching questions with formalized reasoning trajectories and distractor rationales that capture specific cognitive traps. Through extensive evaluations, we empirically demonstrate the practical utility of our diagnostic framework, showing that identifying cognitive trajectories facilitates the mitigation of performance gap and improves pedagogical reasoning through guided elicitation.

2504.12612 2026-05-01 cs.AI cs.CR cs.MA

Chronology of Multi-Agent Interactions for Provenance of Evolving Information

Ching-Chun Chang, Isao Echizen

详情
Journal ref
Royal Society Open Science (2026)
英文摘要

Provenance is the chronological history of things, resonating with the fundamental pursuit to uncover origins, trace connections, and situate entities within the flow of space and time. As artificial intelligence advances towards autonomous agents capable of interactive collaboration on complex tasks, the provenance of generated content becomes entangled in the interplay of collective creation, where contributions are continuously revised, extended or overwritten. In a multi-agent generative chain, content undergoes successive transformations, often leaving little, if any, trace of prior contributions. In this study, we investigate the problem of tracking multi-agent provenance across the temporal dimension of generation. We propose a chronological system for post hoc attribution of generative history from content alone, without reliance on internal memory states or external meta-information. At its core lies the notion of symbolic chronicles, representing signed and time-stamped records, in a form analogous to the chain of custody in forensic science. The system operates through a feedback loop, whereby each generative timestep updates the chronicle of prior interactions and synchronises it with the synthetic content in the very act of generation. This research seeks to develop an accountable form of collaborative artificial intelligence within evolving cyber ecosystems.

2503.22676 2026-05-01 cs.CV

TranSplat: Instant Object Relighting in Gaussian Splatting via Spherical Harmonic Radiance Transfer

Boyang Tony Yu, Yanlin Jin, Yun He, Akshat Dave, Ravi Ramamoorthi, Guha Balakrishnan

详情
英文摘要

We present TranSplat, a method for instant, accurate object relighting within the Gaussian Splatting (GS) framework. Rather than relying on costly inverse rendering routines, we propose a BRDF-free radiance transfer strategy that analytically modulates the spherical harmonic (SH) appearance coefficients of an object's 2D Gaussian surfels using per-normal irradiance ratios derived from source and target environment maps. To handle view-dependent and glossy appearances without explicit material estimation, we introduce a specularity-aware dual-path SH transfer strategy that adapts higher-order SH bands in the reflection domain. Additionally, we propose a lightweight SH-domain self-shadowing module to ensure physically realistic occlusion without explicit mesh raycasting. Operating as a post-processing step, TranSplat requires no additional GS retraining for a pair of source and target scenes. Evaluations on synthetic and real-world objects demonstrate state-of-the-art accuracy, outperforming recent inverse-rendering and diffusion-based GS relighting methods across most conditions, all while completing relighting operations in under one second. Although bounded by radially symmetric BRDF approximations and the low-pass nature of the SH basis, TranSplat produces perceptually realistic renderings even for glossy, complex materials, establishing a valuable, lightweight path forward for GS relighting.

2503.12844 2026-05-01 cs.CV

GuideDog: A Real-World Egocentric Multimodal Dataset for Blind and Low-Vision Accessibility-Aware Guidance

Junhyeok Kim, Jaewoo Park, Junhee Park, Sangeyl Lee, Jiwan Chung, Jisung Kim, Ji Hoon Joung, Youngjae Yu

Comments ACL 2026 Main. Project page: https://jun297.github.io/GuideDog/

详情
英文摘要

For people affected by blindness and low vision (BLV), safe and independent navigation remains a major challenge, impacting over 2.2 billion individuals worldwide. Although multimodal large language models (MLLMs) offer new opportunities for assistive navigation, progress has been limited by the scarcity of accessibility-aware datasets, because creating them requires labor-intensive expert annotation. To this end, we introduce GuideDog, a novel dataset containing 22K image-description pairs (2K human-verified) capturing real-world pedestrian scenes across 46 countries. Our human-AI pipeline shifts annotation from generation to verification, grounded in established BLV guidance standards from experts and research, improving scalability while maintaining quality. We also present GuideDogQA, an 818-sample benchmark evaluating object recognition and depth perception. Experiments reveal that depth perception and adherence to these standards remain challenging for current MLLMs.

2503.07878 2026-05-01 cs.CV cs.AI

A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions

Rahul Nair, Bhanu Tokas, Hannah Kerner

Comments Accepted at WACV 2026. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2026

详情
英文摘要

When we train models on biased datasets, they not only reproduce data biases, but can worsen them at test time - a phenomenon called bias amplification. Many of the current bias amplification metrics (e.g., BA (MALS), DPA) measure bias amplification only in classification datasets. These metrics are ineffective for image captioning datasets, as they cannot capture the language semantics of a caption. Recent work introduced Leakage in Captioning (LIC), a language-aware bias amplification metric that understands caption semantics. However, LIC has a crucial limitation: it cannot identify the source of bias amplification in captioning models. We propose Directional Bias Amplification in Captioning (DBAC), a language-aware and directional metric that can identify when captioning models amplify biases. DBAC has two more improvements over LIC: (1) it is less sensitive to sentence encoders (a hyperparameter in language-aware metrics), and (2) it provides a more accurate estimate of bias amplification in captions. Our experiments on gender and race attributes in the COCO captions dataset show that DBAC is the only reliable metric to measure bias amplification in captions.