arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1423
专题追踪
2511.02463 2026-02-05 cs.AI

Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation

Mengyu Zhang, Siyu Ding, Weichong Yin, Yu Sun, Hua Wu

Comments 8 pages

详情
英文摘要

Reinforcement Learning with Verifiable Rewards(RLVR) has demonstrated great potential in enhancing the reasoning capabilities of large language models (LLMs). However, its success has thus far been largely confined to the mathematical and programming domains with clear and automatically checkable outcomes. Reinforcement learning on open-ended tasks (e.g., creative writing and subjective Q&A) continues to rely on reward models due to the absence of verifiable solutions. This raises a key question: how can we extend RLVR to strengthen reasoning in open-ended tasks regardless of the absence of the unambiguous ground truth? To overcome this challenge, we introduce Verifiable Multiple-Choice Reformulation for Reinforcement Learning from Verifiable Rewards (VMR-RLVR), a novel training strategy that restructures open-ended data into verifiable multiple-choice formats, enabling effective training even in the absence of explicit ground truth. Experimental results on multiple benchmarks validate the effectiveness of our method in improving LLM performance on open-ended tasks. Notably, across seven open-ended benchmarks, our VMR-RLVR training delivers an average gain of 3.29 points over the RL with reward model.

2511.00640 2026-02-05 cs.AI cs.CL cs.LG

DTS: Enhancing Large Reasoning Models via Decoding Tree Sketching

Zicheng Xu, Xiuyi Lou, Guanchu Wang, Yu-Neng Chuang, Feng Luo, Guangyao Zheng, Alexander S. Szalay, Zirui Liu, Vladimir Braverman

详情
英文摘要

Large Reasoning Models (LRMs) achieve remarkable inference-time improvements through parallel thinking. However, existing methods rely on redundant sampling of reasoning trajectories, failing to effectively explore the reasoning space to uncover high-quality solutions. To address these limitations, we propose Decoding Tree Sketching (DTS), a plug-and-play decoding framework for structural multi-trajectory exploration and reasoning selection. For reasoning exploration, DTS sketches a backbone tree of the reasoning space by selectively branching at decision tokens. For reasoning selection, guided by length-accuracy anti-correlation, DTS designs an early termination to prioritize short and reliable trajectories during decoding. Experimental results across four LRMs and datasets demonstrate that DTS significantly enhances accuracy by 14% and reduces repetitive generation by 8% on average. Notably, DTS enables smaller models to outperform larger models with 10$\times$ the size, highlighting its potential to strengthen reasoning capabilities.

2510.24366 2026-02-05 cs.CV

Adaptive Knowledge Transferring with Switching Dual-Student Framework for Semi-Supervised Medical Image Segmentation

Hoang-Thien Nguyen, Thanh-Huy Nguyen, Ba-Thinh Lam, Vi Vu, Bach X. Nguyen, Jianhua Xing, Tianyang Wang, Xingjian Li, Min Xu

Comments The paper is published at Pattern Recognition Journal

详情
英文摘要

Teacher-student frameworks have emerged as a leading approach in semi-supervised medical image segmentation, demonstrating strong performance across various tasks. However, the learning effects are still limited by the strong correlation and unreliable knowledge transfer process between teacher and student networks. To overcome this limitation, we introduce a novel switching Dual-Student architecture that strategically selects the most reliable student at each iteration to enhance dual-student collaboration and prevent error reinforcement. We also introduce a strategy of Loss-Aware Exponential Moving Average to dynamically ensure that the teacher absorbs meaningful information from students, improving the quality of pseudo-labels. Our plug-and-play framework is extensively evaluated on 3D medical image segmentation datasets, where it outperforms state-of-the-art semi-supervised methods, demonstrating its effectiveness in improving segmentation accuracy under limited supervision.

2510.19435 2026-02-05 cs.SD math.AT nlin.AO physics.data-an physics.soc-ph

Time delay embeddings to characterize the timbre of musical instruments using Topological Data Analysis: a study on synthetic and real data

Gakusei Sato, Hiroya Nakao, Riccardo Muolo

详情
英文摘要

Timbre allows us to distinguish between sounds even when they share the same pitch and loudness, playing an important role in music, instrument recognition, and speech. Traditional approaches, such as frequency analysis or machine learning, often overlook subtle characteristics of sound. Topological Data Analysis (TDA) can capture complex patterns, but its application to timbre has been limited, partly because it is unclear how to represent sound effectively for TDA. In this study, we investigate how different time delay embeddings affect TDA results. Using both synthetic and real audio signals, we identify time delays that enhance the detection of harmonic structures. Our findings show that specific delays, related to fractions of the fundamental period, allow TDA to reveal key harmonic features and distinguish between integer and non-integer harmonics. The method is effective for synthetic and real musical instrument sounds and opens the way for future works, which could extend it to more complex sounds using higher-dimensional embeddings and additional persistence statistics.

2510.13060 2026-02-05 cs.LG cs.GT math.OC stat.ML

Achieving Logarithmic Regret in KL-Regularized Zero-Sum Markov Games

Anupam Nayak, Tong Yang, Osman Yagan, Gauri Joshi, Yuejie Chi

详情
英文摘要

Reverse Kullback-Leibler (KL) divergence-based regularization with respect to a fixed reference policy is widely used in modern reinforcement learning to preserve the desired traits of the reference policy and sometimes to promote exploration (using uniform reference policy, known as entropy regularization). Beyond serving as a mere anchor, the reference policy can also be interpreted as encoding prior knowledge about good actions in the environment. In the context of alignment, recent game-theoretic approaches have leveraged KL regularization with pretrained language models as reference policies, achieving notable empirical success in self-play methods. Despite these advances, the theoretical benefits of KL regularization in game-theoretic settings remain poorly understood. In this work, we develop and analyze algorithms that provably achieve improved sample efficiency under KL regularization. We study both two-player zero-sum matrix games and Markov games: for matrix games, we propose OMG, an algorithm based on best response sampling with optimistic bonuses, and extend this idea to Markov games through the algorithm SOMG, which also uses best response sampling and a novel concept of superoptimistic bonuses. Both algorithms achieve a logarithmic regret in $T$ that scales inversely with the KL regularization strength $β$ in addition to the traditional $\widetilde{\mathcal{O}}(\sqrt{T})$ regret without the $β^{-1}$ dependence.

2510.13025 2026-02-05 cs.LG cs.SY eess.SY

Information Shapes Koopman Representation

Xiaoyuan Cheng, Wenxuan Yuan, Yiming Yang, Yuanzhao Zhang, Sibo Cheng, Yi He, Zhuo Sun

Comments Published as a conference paper at ICLR 2026

详情
英文摘要

The Koopman operator provides a powerful framework for modeling dynamical systems and has attracted growing interest from the machine learning community. However, its infinite-dimensional nature makes identifying suitable finite-dimensional subspaces challenging, especially for deep architectures. We argue that these difficulties come from suboptimal representation learning, where latent variables fail to balance expressivity and simplicity. This tension is closely related to the information bottleneck (IB) dilemma: constructing compressed representations that are both compact and predictive. Rethinking Koopman learning through this lens, we demonstrate that latent mutual information promotes simplicity, yet an overemphasis on simplicity may cause latent space to collapse onto a few dominant modes. In contrast, expressiveness is sustained by the von Neumann entropy, which prevents such collapse and encourages mode diversity. This insight leads us to propose an information-theoretic Lagrangian formulation that explicitly balances this tradeoff. Furthermore, we propose a new algorithm based on the Lagrangian formulation that encourages both simplicity and expressiveness, leading to a stable and interpretable Koopman representation. Beyond quantitative evaluations, we further visualize the learned manifolds under our representations, observing empirical results consistent with our theoretical predictions. Finally, we validate our approach across a diverse range of dynamical systems, demonstrating improved performance over existing Koopman learning methods. The implementation is publicly available at https://github.com/Wenxuan52/InformationKoopman.

2510.12780 2026-02-05 cs.SD cs.CL

Content Anonymization for Privacy in Long-form Audio

Cristina Aggazzotti, Ashi Garg, Zexin Cai, Nicholas Andrews

Comments Accepted to ICASSP 2026; v2: added more related work, used a more speech-adapted content-attack model, added a github link to code/prompts

详情
英文摘要

Voice anonymization techniques have been found to successfully obscure a speaker's acoustic identity in short, isolated utterances in benchmarks such as the VoicePrivacy Challenge. In practice, however, utterances seldom occur in isolation: long-form audio is commonplace in domains such as interviews, phone calls, and meetings. In these cases, many utterances from the same speaker are available, which pose a significantly greater privacy risk: given multiple utterances from the same speaker, an attacker could exploit an individual's vocabulary, syntax, and turns of phrase to re-identify them, even when their voice is completely disguised. To address this risk, we propose a new approach that performs a contextual rewriting of the transcripts in an ASR-TTS pipeline to eliminate speaker-specific style while preserving meaning. We present results in a long-form telephone conversation setting demonstrating the effectiveness of a content-based attack on voice-anonymized speech. Then we show how the proposed content-based anonymization methods can mitigate this risk while preserving speech utility. Overall, we find that paraphrasing is an effective defense against content-based attacks and recommend that stakeholders adopt this step to ensure anonymity in long-form audio.

2510.04670 2026-02-05 cs.AI

Improving Multimodal Brain Encoding Model with Dynamic Subject-awareness Routing

Xuanhua Yin, Runkai Zhao, Weidong Cai

Comments 7 pages, 4 figures, accepted by ICASSP 2026

详情
英文摘要

Naturalistic fMRI encoding must handle multimodal inputs, shifting fusion styles, and pronounced inter-subject variability. We introduce AFIRE (Agnostic Framework for Multimodal fMRI Response Encoding), an agnostic interface that standardizes time-aligned post-fusion tokens from varied encoders, and MIND, a plug-and-play Mixture-of-Experts decoder with a subject-aware dynamic gating. Trained end-to-end for whole-brain prediction, AFIRE decouples the decoder from upstream fusion, while MIND combines token-dependent Top-K sparse routing with a subject prior to personalize expert usage without sacrificing generality. Experiments across multiple multimodal backbones and subjects show consistent improvements over strong baselines, enhanced cross-subject generalization, and interpretable expert patterns that correlate with content type. The framework offers a simple attachment point for new encoders and datasets, enabling robust, plug-and-improve performance for naturalistic neuroimaging studies.

2510.02712 2026-02-05 cs.CL cs.AI cs.LG

Time-To-Inconsistency: A Survival Analysis of Large Language Model Robustness to Adversarial Attacks

Yubo Li, Ramayya Krishnan, Rema Padman

详情
英文摘要

Large Language Models (LLMs) have revolutionized conversational AI, yet their robustness in extended multi-turn dialogues remains poorly understood. Existing evaluation frameworks focus on static benchmarks and single-turn assessments, failing to capture the temporal dynamics of conversational degradation that characterize real-world interactions. In this work, we present a large-scale survival analysis of conversational robustness, modeling failure as a time-to-event process over 36,951 turns from 9 state-of-the-art LLMs on the MT-Consistency benchmark. Our framework combines Cox proportional hazards, Accelerated Failure Time (AFT), and Random Survival Forest models with simple semantic drift features. We find that abrupt prompt-to-prompt semantic drift sharply increases the hazard of inconsistency, whereas cumulative drift is counterintuitively \emph{protective}, suggesting adaptation in conversations that survive multiple shifts. AFT models with model-drift interactions achieve the best combination of discrimination and calibration, and proportional hazards checks reveal systematic violations for key drift covariates, explaining the limitations of Cox-style modeling in this setting. Finally, we show that a lightweight AFT model can be turned into a turn-level risk monitor that flags most failing conversations several turns before the first inconsistent answer while keeping false alerts modest. These results establish survival analysis as a powerful paradigm for evaluating multi-turn robustness and for designing practical safeguards for conversational AI systems.

2510.02250 2026-02-05 cs.AI cs.CL cs.CV cs.LG

Scaling Agents for Computer Use

Gonzalo Gonzalez-Pumariega, Vincent Tu, Chih-Lun Lee, Jiachen Yang, Ang Li, Xin Eric Wang

Comments 21 pages, 7 figures, 13 tables

详情
英文摘要

Computer-use agents (CUAs) hold promise for automating everyday digital tasks, but their performance on long-horizon, complex problems remains unreliable. Single-rollout execution is brittle, with small errors compounding over time and leading to high variance in outcomes. While prior work has attempted to scale within a single rollout, such approaches have yielded limited gains. Scaling over multiple rollouts offers a more promising alternative but doing so effectively is challenging due to the difficulty of evaluating and selecting among long-horizon agent behaviors. We introduce Behavior Judge (BJudge), which addresses this challenge by representing agent executions as behavior narratives and comparing candidate behaviors at this level, substantially improving robustness and success rates. Using multiple rollouts, BJudge establishes a new state of the art (SoTA) in OSWorld at 72.6%, significantly outperforming prior methods and surpassing human-level performance at 72.36%, with comprehensive ablations validating key design choices. We further demonstrate strong generalization results to different operating systems on WindowsAgentArena and AndroidWorld. Crucially, our results highlight the strong effectiveness of scaling CUAs, when you do it right: effective scaling requires structured trajectory understanding and selection, and BJudge provides a practical framework to achieve this.

2509.26634 2026-02-05 cs.CL eess.AS

Scaling Spoken Language Models with Syllabic Speech Tokenization

Nicholas Lee, Cheol Jun Cho, Alan W Black, Gopala K. Anumanchipalli

Comments ICASSP 2026

详情
英文摘要

Spoken language models (SLMs) typically discretize speech into high-frame-rate tokens extracted from SSL speech models. As the most successful LMs are based on the Transformer architecture, processing these long token streams with self-attention is expensive, as attention scales quadratically with sequence length. A recent SSL work introduces acoustic tokenization of speech at the syllable level, which is more interpretable and potentially more scalable with significant compression in token lengths (4-5 Hz). Yet, their value for spoken language modeling is not yet fully explored. We present the first systematic study of syllabic tokenization for spoken language modeling, evaluating models on a suite of SLU benchmarks while varying training data scale. Syllabic tokens can match or surpass the previous high-frame rate tokens while significantly cutting training and inference costs, achieving more than a 2x reduction in training time and a 5x reduction in FLOPs. Our findings highlight syllable-level language modeling as a promising path to efficient long-context spoken language models.

2509.25495 2026-02-05 cs.SD cs.AI

EMO-TTA: Improving Test-Time Adaptation of Audio-Language Models for Speech Emotion Recognition

Jiacheng Shi, Hongfei Du, Y. Alicia Hong, Ye Gao

Comments Accepted to ICASSP 2026

详情
英文摘要

Speech emotion recognition (SER) with audio-language models (ALMs) remains vulnerable to distribution shifts at test time, leading to performance degradation in out-of-domain scenarios. Test-time adaptation (TTA) provides a promising solution but often relies on gradient-based updates or prompt tuning, limiting flexibility and practicality. We propose Emo-TTA, a lightweight, training-free adaptation framework that incrementally updates class-conditional statistics via an Expectation-Maximization procedure for explicit test-time distribution estimation, using ALM predictions as priors. Emo-TTA operates on individual test samples without modifying model weights. Experiments on six out-of-domain SER benchmarks show consistent accuracy improvements over prior TTA baselines, demonstrating the effectiveness of statistical adaptation in aligning model predictions with evolving test distributions.

2509.25458 2026-02-05 cs.AI

Plug-and-Play Emotion Graphs for Compositional Prompting in Zero-Shot Speech Emotion Recognition

Jiacheng Shi, Hongfei Du, Y. Alicia Hong, Ye Gao

Comments Accepted to ICASSP 2026

详情
英文摘要

Large audio-language models (LALMs) exhibit strong zero-shot performance across speech tasks but struggle with speech emotion recognition (SER) due to weak paralinguistic modeling and limited cross-modal reasoning. We propose Compositional Chain-of-Thought Prompting for Emotion Reasoning (CCoT-Emo), a framework that introduces structured Emotion Graphs (EGs) to guide LALMs in emotion inference without fine-tuning. Each EG encodes seven acoustic features (e.g., pitch, speech rate, jitter, shimmer), textual sentiment, keywords, and cross-modal associations. Embedded into prompts, EGs provide interpretable and compositional representations that enhance LALM reasoning. Experiments across SER benchmarks show that CCoT-Emo outperforms prior SOTA and improves accuracy over zero-shot baselines.

2509.21826 2026-02-05 cs.CL

ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models

Zihan Lin, Xiaohan Wang, Jie Cao, Jiajun Chai, Guojun Yin, Wei Lin, Ran He

Comments Accepted by ICLR2026

详情
英文摘要

Large language models (LLMs) transcend passive generation and act as goal-directed agents by invoking external tools. Reinforcement learning (RL) offers a principled framework for optimizing these emergent tool-use policies, yet the prevailing paradigm relies exclusively on sparse outcome rewards and lacks consideration of the particularity of tool-use tasks, inflating policy-gradient variance and resulting in inefficient training. To better understand and address these challenges, we first establish a theoretical link between policy entropy and training stability of tool-use tasks, which reveals that structured, low-entropy tokens are primary determinants of rewards. Motivated by this insight, we propose \textbf{Res}haped \textbf{T}oken-level policy gradients (\textbf{ResT}) for tool-use tasks. ResT reshapes the policy gradient through entropy-informed token reweighting, progressively upweighting reasoning tokens as training proceeds. This entropy-aware scheme enables a smooth shift from structural correctness to semantic reasoning and stabilizes convergence in multi-turn tool-use tasks. Evaluation on BFCL and API-Bank shows that ResT achieves state-of-the-art results, outperforming prior methods by up to $8.76\%$. When fine-tuned on a 4B base LLM, ResT further surpasses GPT-4o by $4.11\%$ on single-turn tasks and $1.50\%$ on multi-turn base tasks. Code is available at https://github.com/1229095296/ResT_Tool_use_LLM.git.

2509.21058 2026-02-05 cs.LG

SPREAD: Sampling-based Pareto front Refinement via Efficient Adaptive Diffusion

Sedjro Salomon Hotegni, Sebastian Peitz

详情
英文摘要

Developing efficient multi-objective optimization methods to compute the Pareto set of optimal compromises between conflicting objectives remains a key challenge, especially for large-scale and expensive problems. To bridge this gap, we introduce SPREAD, a generative framework based on Denoising Diffusion Probabilistic Models (DDPMs). SPREAD first learns a conditional diffusion process over points sampled from the decision space and then, at each reverse diffusion step, refines candidates via a sampling scheme that uses an adaptive multiple gradient descent-inspired update for fast convergence alongside a Gaussian RBF-based repulsion term for diversity. Empirical results on multi-objective optimization benchmarks, including offline and Bayesian surrogate-based settings, show that SPREAD matches or exceeds leading baselines in efficiency, scalability, and Pareto front coverage. Code is available at https://github.com/safe-autonomous-systems/moo-spread .

2509.20628 2026-02-05 cs.CV

Recov-Vision: Linking Street View Imagery and Vision-Language Models for Post-Disaster Recovery

Yiming Xiao, Archit Gupta, Miguel Esparza, Yu-Hsuan Ho, Antonia Sebastian, Hannah Weas, Rose Houck, Ali Mostafavi

Comments 20 pages, 10 figures

详情
英文摘要

Building-level occupancy after disasters is vital for triage, inspections, utility re-energization, and equitable resource allocation. Overhead imagery provides rapid coverage but often misses facade and access cues that determine habitability, while street-view imagery captures those details but is sparse and difficult to align with parcels. We present FacadeTrack, a street-level, language-guided framework that links panoramic video to parcels, rectifies views to facades, and elicits interpretable attributes (for example, entry blockage, temporary coverings, localized debris) that drive two decision strategies: a transparent one-stage rule and a two-stage design that separates perception from conservative reasoning. Evaluated across two post-Hurricane Helene surveys, the two-stage approach achieves a precision of 0.927, a recall of 0.781, and an F-1 score of 0.848, compared with the one-stage baseline at a precision of 0.943, a recall of 0.728, and an F-1 score of 0.822. Beyond accuracy, intermediate attributes and spatial diagnostics reveal where and why residual errors occur, enabling targeted quality control. The pipeline provides auditable, scalable occupancy assessments suitable for integration into geospatial and emergency-management workflows.

2509.17489 2026-02-05 cs.CL cs.AI

MapCoder-Lite: Distilling Multi-Agent Coding into a Single Small LLM

Woongkyu Lee, Junhee Cho, Jungwook Choi

详情
英文摘要

Large language models (LLMs) have advanced code generation from single-function tasks to competitive-programming problems, but existing multi-agent solutions either rely on costly large-scale (>30B) models or collapse when downsized to small open-source models. We present MapCoder-Lite, a framework for distilling the complex reasoning of large, multi-agent coding systems into a single 7B model. Our contribution is a novel, three-pillar methodology that synergistically generates, refines, and encodes multi-agent knowledge: (i) pass-based trajectory distillation from strong LLMs fixes format fragility in retrieval and reduces failures in debugging, (ii) supervisor-guided correction with global feedback strengthens planning and coding agents, and (iii) agent-wise LoRA fine-tuning delivers memory-efficient specialisation. Comprehensive evaluation on xCodeEval, APPS, and CodeContests shows that MapCoder-Lite more than doubles xCodeEval accuracy (from 13.2% to 28.3%), eliminates all format failures, while reducing GPU memory and token-generation time by 4x compared to a 32B model. It also achieves over 10% gains on simpler coding benchmarks, demonstrating broad improvements beyond competitive programming. These results demonstrate that careful agent-wise fine-tuning unleashes high-quality multi-agent coding on a small language model. Our code is publicly available at https://github.com/aiha-lab/MapCoder-Lite.

2509.16479 2026-02-05 cs.CV cs.AI

Thermal Imaging-based Real-time Fall Detection using Motion Flow and Attention-enhanced Convolutional Recurrent Architecture

Christopher Silver, Thangarajah Akilan

详情
英文摘要

Falls among seniors are a major public health issue. Existing solutions using wearable sensors, ambient sensors, and RGB-based vision systems face challenges in reliability, user compliance, and practicality. Studies indicate that stakeholders, such as older adults and eldercare facilities, prefer non-wearable, passive, privacy-preserving, and real-time fall detection systems that require no user interaction. This study proposes an advanced thermal fall detection method using a Bidirectional Convolutional Long Short-Term Memory (BiConvLSTM) model, enhanced with spatial, temporal, feature, self, and general attention mechanisms. Through systematic experimentation across hundreds of model variations exploring the integration of attention mechanisms, recurrent modules, and motion flow, we identified top-performing architectures. Among them, BiConvLSTM achieved state-of-the-art performance with a ROC-AUC of $99.7\%$ on the TSF dataset and demonstrated robust results on TF-66, a newly emerged, diverse, and privacy-preserving benchmark. These results highlight the generalizability and practicality of the proposed model, setting new standards for thermal fall detection and paving the way toward deployable, high-performance solutions.

2509.12434 2026-02-05 cs.AI

Building Coding Agents via Entropy-Enhanced Multi-Turn Preference Optimization

Jiahao Yu, Zelei Cheng, Xian Wu, Xinyu Xing

详情
英文摘要

Software engineering presents complex, multi-step challenges for Large Language Models (LLMs), requiring reasoning over large codebases and coordinated tool use. The difficulty of these tasks is exemplified by benchmarks like SWE-bench, where current LLMs still struggle to resolve real-world issues. A promising approach to enhance performance is test-time scaling (TTS), but its gains are heavily dependent on the diversity of model outputs. While standard alignment methods such as Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) are effective at aligning model outputs with human preferences, this process can come at the cost of reduced diversity, limiting the effectiveness of TTS. Additionally, existing preference optimization algorithms are typically designed for single-turn tasks and do not fully address the complexities of multi-turn reasoning and tool integration required for interactive coding agents. To bridge this gap, we introduce EntroPO, an entropy-enhanced framework that adapts existing preference optimization algorithms to the multi-turn, tool-assisted setting. EntroPO augments the preference objective to explicitly preserve policy entropy and generalizes learning to optimize over multi-turn interactions rather than single-turn responses. We validate EntroPO by fine-tuning a diverse suite of models from different families and sizes (up to 106B parameters).To maximize performance gains from TTS, we further propose a hybrid best-trajectory selection scheme combining a learned verifier model with model free approaches. On the SWEBENCH leaderboard, our approach establishes new state-of-the-art results among open-weight models. A 30B parameter model trained with EntroPO ranks 1st on SWEBENCH-LITE and 4th on SWEBENCH-VERIFIED on the open-weight leaderboard, surpassed only by models with over 10x more parameters(e.g., >$350B).

2509.09843 2026-02-05 cs.LG cs.AI

HGEN: Heterogeneous Graph Ensemble Networks

Jiajun Shen, Yufei Jin, Yi He, Xingquan Zhu

Comments The paper is in proceedings of the 34th IJCAI Conference, 2025

详情
英文摘要

This paper presents HGEN that pioneers ensemble learning for heterogeneous graphs. We argue that the heterogeneity in node types, nodal features, and local neighborhood topology poses significant challenges for ensemble learning, particularly in accommodating diverse graph learners. Our HGEN framework ensembles multiple learners through a meta-path and transformation-based optimization pipeline to uplift classification accuracy. Specifically, HGEN uses meta-path combined with random dropping to create Allele Graph Neural Networks (GNNs), whereby the base graph learners are trained and aligned for later ensembling. To ensure effective ensemble learning, HGEN presents two key components: 1) a residual-attention mechanism to calibrate allele GNNs of different meta-paths, thereby enforcing node embeddings to focus on more informative graphs to improve base learner accuracy, and 2) a correlation-regularization term to enlarge the disparity among embedding matrices generated from different meta-paths, thereby enriching base learner diversity. We analyze the convergence of HGEN and attest its higher regularization magnitude over simple voting. Experiments on five heterogeneous networks validate that HGEN consistently outperforms its state-of-the-art competitors by substantial margin.

2508.15610 2026-02-05 cs.AI cs.LG

Transduction is All You Need for Structured Data Workflows

Alfio Gliozzo, Naweed Khan, Christodoulos Constantinides, Nandana Mihindukulasooriya, Nahuel Defosse, Gaetano Rossiello, Junkyu Lee

Comments 38 pages, 5 figures

详情
英文摘要

This paper introduces Agentics, a functional agentic AI framework for building LLM-based structured data workflow pipelines. Designed for both research and practical applications, Agentics offers a new data-centric paradigm in which agents are embedded within data types, enabling logical transduction between structured states. This design shifts the focus toward principled data modeling, providing a declarative language where data types are directly exposed to large language models and the data values are composed through transductions between input and output types. We present a range of structured data workflow tasks and empirical evidence demonstrating the effectiveness of this approach, including data wrangling, text-to-SQL semantic parsing, and domain-specific multiple-choice question answering, and data-driven scientific discovery tasks.

2508.13131 2026-02-05 cs.CL cs.LG stat.ML

Improving Detection of Watermarked Language Models

Dara Bahri, John Wieting

Comments Published at TMLR 2026

详情
英文摘要

Watermarking has recently emerged as an effective strategy for detecting the generations of large language models (LLMs). The strength of a watermark typically depends strongly on the entropy afforded by the language model and the set of input prompts. However, entropy can be quite limited in practice, especially for models that are post-trained, for example via instruction tuning or reinforcement learning from human feedback (RLHF), which makes detection based on watermarking alone challenging. In this work, we investigate whether detection can be improved by combining watermark detectors with non-watermark ones. We explore a number of hybrid schemes that combine the two, observing performance gains over either class of detector under a wide range of experimental conditions.

2508.10542 2026-02-05 cs.CV

Beyond Global Scanning: Adaptive Visual State Space Modeling for Salient Object Detection in Optical Remote Sensing Images

Mengyu Ren, Yutong Li, Hua Li, Chuhong Wang, Runmin Cong

详情
英文摘要

Salient object detection (SOD) in optical remote sensing images (ORSIs) faces numerous challenges, including significant variations in target scales and low contrast between targets and the background. Existing methods based on vision transformers (ViTs) and convolutional neural networks (CNNs) architectures aim to leverage both global and local features, but the difficulty in effectively integrating these heterogeneous features limits their overall performance. To overcome these limitations, we propose an adaptive state space context network (ASCNet), which builds upon the state space model mechanism to simultaneously capture long-range dependencies and enhance regional feature representation. Specifically, we employ the visual state space encoder to extract multi-scale features. To further achieve deep guidance and enhancement of these features, we design a Multi-Level Context Module (MLCM), which module strengthens cross-layer interaction capabilities between features of different scales while enhancing the model's structural perception, allowing it to distinguish between foreground and background more effectively. Then, we design the Adaptive Patchwise Visual State Space (APVSS) block as the decoder of ASCNet, which integrates our proposed Dynamic Adaptive Granularity Scan (DAGS) and Granularity-aware Propagation Module (GPM). It performs adaptive patch scanning on feature maps enhanced by local perception, thereby capturing rich local region information and enhancing state space model's local modeling capability. Extensive experimental results demonstrate that the proposed model achieves state-of-the-art performance, validating its effectiveness and superiority.

2508.09100 2026-02-05 cs.LG cs.AI

Towards Universal Neural Likelihood Inference

Shreyas Bhat Brahmavar, Yang Li, Qiyang Liu, Shashank Srivastava, Junier Oliva

详情
英文摘要

We introduce universal neural likelihood inference (UNLI): enabling a single model to provide data-grounded, conditional likelihood predictions for arbitrary targets given any collection of observed features, across diverse domains and tasks. To achieve UNLI over heterogeneous tabular data, we develop the Arbitrary Set-based Permutation-Invariant Reasoning Engine (ASPIRE) model. Our design addresses critical gaps in existing approaches to merge semantic-understanding capabilities and generalised numerical feature reasoning within a zero-shot capable framework. Trained on over 1,400 real diverse datasets spanning various domains, ASPIRE achieves 15\% higher F1 scores and 85\% lower RMSE than existing tabular foundation models in zero-shot and few-shot settings. Lastly, this work introduces open-world active feature acquisition, where we leverage the UNLI capabilities of ASPIRE to adeptly determine next feature-values to observe to improve inference time prediction accuracies.

2508.04441 2026-02-05 cs.CV

Benchmarking Foundation Models for Mitotic Figure Classification

Jonas Ammeling, Jonathan Ganz, Emely Rosbach, Ludwig Lausser, Christof A. Bertram, Katharina Breininger, Marc Aubreville

Comments Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2026:003

Journal ref Machine.Learning.for.Biomedical.Imaging. 2026 (2026)

详情
英文摘要

The performance of deep learning models is known to scale with data quantity and diversity. In pathology, as in many other medical imaging domains, the availability of labeled images for a specific task is often limited. Self-supervised learning techniques have enabled the use of vast amounts of unlabeled data to train large-scale neural networks, i.e., foundation models, that can address the limited data problem by providing semantically rich feature vectors that can generalize well to new tasks with minimal training effort increasing model performance and robustness. In this work, we investigate the use of foundation models for mitotic figure classification. The mitotic count, which can be derived from this classification task, is an independent prognostic marker for specific tumors and part of certain tumor grading systems. In particular, we investigate the data scaling laws on multiple current foundation models and evaluate their robustness to unseen tumor domains. Next to the commonly used linear probing paradigm, we also adapt the models using low-rank adaptation (LoRA) of their attention mechanisms. We compare all models against end-to-end-trained baselines, both CNNs and Vision Transformers. Our results demonstrate that LoRA-adapted foundation models provide superior performance to those adapted with standard linear probing, reaching performance levels close to 100% data availability with only 10% of training data. Furthermore, LoRA-adaptation of the most recent foundation models almost closes the out-of-domain performance gap when evaluated on unseen tumor domains. However, full fine-tuning of traditional architectures still yields competitive performance.

2508.04428 2026-02-05 cs.AI

Building Scaffolding Dialogue Data with LLM-Simulated Novices

Si Chen, Izzy Molnar, Ting Hua, Peiyu Li, Le Huy Khiem, G. Alex Ambrose, Jim Lang, Ronald Metoyer, Nitesh V. Chawla

详情
英文摘要

High-quality, multi-turn instructional dialogues between novices and experts are essential for developing AI systems that support teaching, learning, and decision-making. These dialogues often involve scaffolding -- the process by which an expert supports a novice's thinking through questions, feedback, and step-by-step guidance. However, such data are scarce due to privacy concerns in recording and the vulnerability inherent in help-seeking. We present SimInstruct, a scalable, expert-in-the-loop tool for collecting scaffolding dialogues. Using teaching development coaching as an example domain, SimInstruct simulates novice instructors via LLMs, varying their teaching challenges and LLM's persona traits, while human experts provide multi-turn feedback, reasoning, and instructional support. This design enables the creation of realistic, pedagogically rich dialogues without requiring real novice participants. Our results reveal that persona traits, such as extroversion and introversion, meaningfully influence how experts engage. Compared to real mentoring recordings, SimInstruct dialogues demonstrate comparable pedagogical relevance and cognitive depth. Experts also reported the process as engaging and reflective, improving both data quality and their own professional insight. We further fine-tuned a LLaMA model to be an expert model using the augmented dataset, which outperformed GPT-4o in instructional quality. Our analysis highlights GPT-4o's limitations in weak reflective questioning, overuse of generic praise, a condescending tone, and a tendency to overwhelm novices with excessive suggestions.

2508.03365 2026-02-05 cs.SD cs.AI cs.CR eess.AS

When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

Hiskias Dingeto, Taeyoun Kwon, Dasol Choi, Bodam Kim, DongGeon Lee, Haon Park, JaeHoon Lee, Jongho Shin

详情
英文摘要

As large language models (LLMs) become increasingly integrated into daily life, audio has emerged as a key interface for human-AI interaction. However, this convenience also introduces new vulnerabilities, making audio a potential attack surface for adversaries. Our research introduces WhisperInject, a two-stage adversarial audio attack framework that manipulates state-of-the-art audio language models to generate harmful content. Our method embeds harmful payloads as subtle perturbations into audio inputs that remain intelligible to human listeners. The first stage uses a novel reward-based white-box optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to jailbreak the target model and elicit harmful native responses. This native harmful response then serves as the target for Stage 2, Payload Injection, where we use gradient-based optimization to embed subtle perturbations into benign audio carriers, such as weather queries or greeting messages. Our method achieves average attack success rates of 60-78% across two benchmarks and five multimodal LLMs, validated by multiple evaluation frameworks. Our work demonstrates a new class of practical, audio-native threats, moving beyond theoretical exploits to reveal a feasible and covert method for manipulating multimodal AI systems.

2507.14843 2026-02-05 cs.LG cs.AI cs.CL

The Invisible Leash: Why RLVR May or May Not Escape Its Origin

Fang Wu, Weihao Xuan, Ximing Lu, Mingjie Liu, Yi Dong, Zaid Harchaoui, Yejin Choi

详情
英文摘要

Recent advances highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing LLMs' capabilities. However, it remains unclear whether the current practice of RLVR truly expands a model's reasoning boundary or mainly amplifies high-reward outputs that the base model already knows, thereby improving precision. This study presents an empirical investigation that provides fresh insights into the limits of RLVR. We examine how RLVR can operate as a support-constrained optimization mechanism that may restrict the discovery of entirely original solutions, remaining constrained by the base model's initial distribution. We also identify an entropy-reward trade-off: while RLVR reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while RLVR consistently improves \texttt{pass@1}, \textit{the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets}, failing to recover correct answers that were previously accessible to the base model. Interestingly, while RLVR sometimes increases token-level entropy, it results in greater uncertainty at each generation step and declining answer-level entropy. This indicates that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, we reveal potential limits of RLVR in extending reasoning horizons. Breaking this invisible leash requires future innovations that seed probability mass into underrepresented solution regions.

2507.13551 2026-02-05 cs.CL cs.AI

Reading Between the Lines: Combining Pause Dynamics and Semantic Coherence for Automated Assessment of Thought Disorder

Feng Chen, Weizhe Xu, Changye Li, Serguei Pakhomov, Alex Cohen, Simran Bhola, Sandy Yin, Sunny X Tang, Michael Mackinley, Lena Palaniyappan, Dror Ben-Zeev, Trevor Cohen

详情
英文摘要

Formal thought disorder (FTD), a hallmark of schizophrenia spectrum disorders, manifests as incoherent speech and poses challenges for clinical assessment. Traditional clinical rating scales, though validated, are resource-intensive and lack scalability. Automated speech recognition (ASR) allows for objective quantification of linguistic and temporal features of speech, offering scalable alternatives. Furthermore, ASR-derived utterance timestamps provide access to pause dynamics, which are thought to reflect the cognitive processes underlying speech production. Yet, their added value beyond semantic measures remains insufficiently explored. In this study, we evaluated a scalable multimodal framework that integrates pause features with semantic coherence metrics across three datasets: naturalistic self-recorded diaries (AVH), structured picture descriptions (TOPSY), and dream narratives (PsyCL). Pause-related features were evaluated alongside established coherence measures using support vector regression to predict clinical FTD scores. Models using pause features alone robustly predict manually rated FTD severity consistently across datasets. Integrating pause features with semantic coherence metrics enhanced predictive performance compared to coherence-only models, with late fusion yielding the most robust and consistent gains in all three datasets. On average across datasets, Spearman correlation increased from \r{ho} = 0.413 for semantic-only models to \r{ho} = 0.455 with late fusion. The performance gains from semantic and pause features integration held consistently across all contexts, though the nature of the most informative pause patterns was dataset-dependent. These findings suggest that both pause dynamics and semantic coherence reflect complementary aspects of thought disorganization.

2507.07855 2026-02-05 cs.LG cs.AI cs.CL

DPO Unchained: Your Training Algorithm is Secretly Disentangled in Human Choice Theory

Wenxuan Zhou, Shujian Zhang, Brice Magdalou, John Lambert, Ehsan Amid, Richard Nock, Andrew Hard

详情
英文摘要

Normative theories allow one to elicit key parts of a ML algorithm from first principles, which is crucial at a time of championed scrutiny for ML work. Direct Preference Optimization (DPO) cleverly bypasses reward modeling by making an explicit link with a specific normative model of human choice. Our paper elevates this connection to the full generality of DPO's normative framework. Getting there requires reworking human choice theory's textbook path for a better RLHF/ML fit. It elevates the connection to a remarkably broad viewpoint on preference optimization, considering the current panorama of DPO follow-ups. It also unveils unexpected riches for ML, chief among which the support for non-convex losses, the fact that any compliant ML analytical choice can be embedded with any human choice model, and a normative framework's umbrella wide enough to safeguard DPO's extensions (margins, length correction, ...). A toy experiment ``far away'' from the DPO crowd is given.