arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2088
2605.05707 2026-05-08 cs.RO

On the Emergence of Pendular Structure in Multi-Contact Locomotion

Lingxue Lyu, Zihui Liu

详情
英文摘要

LIPM is everywhere in legged-locomotion control, but almost always as a modeling choice rather than as something the controller's cost actually prefers. This note tries to make that link more explicit. Working from a small centroidal OCP that penalizes the rate of angular momentum, we look at what its optimum tends to look like. Three things come out. With full-rank stance, the optimum drifts toward a pendular force pattern at a rate determined by the SVD of the moment Jacobian; the constant is set by foot-span geometry and matches the experiments to within 16%. With N=2 stance, as in trot, the friction cone introduces a lower bound on $\|\dot{H}_G\|$ that no amount of weight tuning fixes; we also see a non-smooth feasibility kink at a critical horizontal acceleration that we can write in closed form. Adding a task term that asks for a nonzero $\dot{H}_G$ moves the optimum off the pendular set in a predictable way. None of this is far from the classical ZMP/DCM picture. We test these claims on a point-mass quadruped and on the Unitree Go1 in MuJoCo (open-loop QP and a torque-level closed-loop controller), and we note where the asymptotic story stops being a good description of what the closed loop actually does.

2605.05706 2026-05-08 cs.AI q-bio.QM

Resolving the bias-precision paradox with stochastic causal representation learning for personalized medicine

Peisong Zhang, Manqiang Peng, Yuxuan Wu, Pawit Phadungsaksawasdi, Wesley Yeung, Ye Zhang, Trang Nguyen, Qiang Zhang, Nan Liu, Meng Wang, Kee Yuan Ngiam, Yih-Chung Tham, Ching-Yu Cheng, Tianfan Fu, Qingyu Chen, Rosemary Ke, Chang Li, Wenzhuo Yang, Zhenghao Lu, Chunyou Lai, Yu Zhang, Sheng Zhong, Hao Deng, Dianbo Liu

详情
英文摘要

Estimating individualized treatment effects from longitudinal observational data is central to data-driven medicine, yet existing methods face a fundamental limitation: reducing confounding bias often suppresses clinically informative heterogeneity, degrading patient-specific predictions. Here, we identify this tension as a bias-precision paradox in causal representation learning and introduce sampling-based maximum mean discrepancy (sMMD), a stochastic alignment strategy that replaces global adversarial balancing with subset-level matching. We instantiate this approach in a framework for counterfactual outcome prediction with attribution-grounded interpretability. Across two large-scale ICU cohorts (n = 27,783), our framework improves accuracy under distribution shift, reducing error by up to 11.5% and substantially increasing recall in high-risk tasks. Mechanistic analyses show that sMMD selectively preserves clinically decisive variables. In human-AI evaluation, our method outperforms clinicians-in-training and large language models, and improves clinician accuracy by 14.7% while reducing decision time, enabling interpretable, real-time clinical decision support.

2605.05702 2026-05-08 cs.AI

Knowledge-Graph Paths as Intermediate Supervision for Self-Evolving Search Agents

Huyu Wu, Jun Liu, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu

详情
英文摘要

Self-evolving search agents reduce reliance on human-written training questions by generating and solving their own search tasks. We build on Search Self-Play (SSP), a representative Proposer and Solver framework in which questions are generated and answered via multi-step search and reasoning. In practice, however, SSP faces two bottlenecks: the Proposer constructs questions from isolated answer entities without relational context, yielding many invalid or unverifiable questions in early self-play training, while the Solver receives only a binary outcome reward that discards useful signal from partially on-track search trajectories. We address both bottlenecks by reusing knowledge-graph paths as construction-derived intermediate supervision for both question construction and reward shaping. First, we ground question construction in LLM-guided knowledge-graph subgraphs, providing relational context for the Proposer. Second, we observe that constructing and solving a multi-hop question can involve overlapping intermediate entities: the factual bridges used to formulate the question may provide approximate waypoints for answering it. Exploiting this overlap, we introduce Waypoint Coverage Reward (WCR), which grants graded partial credit to incorrect Solver trajectories according to their coverage of entities on the construction path, while preserving full reward for correct answers. Across seven QA benchmarks and nine model configurations, our approach improves the average score over standard SSP in all configurations, including notable gains on multi-hop QA tasks. These results suggest that knowledge-graph paths can be reused as lightweight intermediate supervision, providing both relational guidance and process feedback without additional task-specific human annotations or manually labeled process steps.

2605.05701 2026-05-08 cs.AI

Inference-Time Budget Control for LLM Search Agents

Zhengru Fang, Senkang Forest Hu, Zhonghao Chang, Yu Guo, Yihang Tao, Hongyao Liu, Mengzhe Ruan, Jun Huang, Yuguang Fang

详情
英文摘要

LLM search agents increasingly rely on tools at inference time, but their trajectories are often constrained by hard limits on both tool calls and generated tokens. Under such dual budgets, better answers require not only stronger models, but also explicit control over which search action should receive the next budget unit and when the accumulated evidence is sufficient to commit a final answer. We study this problem in multi-hop question answering (QA) and formulate it as two-stage inference-time budget control. At search time, our controller assigns each feasible action a task-level Value-of-Information (VOI) score, defined as an operational estimate of marginal task value per unit budget under the current search state and remaining dual budget, and uses this score to choose among retrieval, decomposition, and answer commitment. After search, a selective evidence-grounded finalizer compares the trajectory answer with a refined candidate and rewrites only when the residual error appears to be a low-risk answer-form error. Across four multi-hop QA benchmarks, three LLM backbones, and four budget levels, the method yields positive aggregate gains over four audited baselines under the same hard dual-budget protocol. Ablations show that search-time budget control, especially budget-dependent penalty, provides the main performance gain, while answer-time control helps mainly when the retrieval path is already adequate. These results suggest that inference-time budget control for LLM search agents should govern both how budget is spent during search and how the final answer is committed.

2605.05697 2026-05-08 cs.LG cs.AI

Budgeted Attention Allocation: Cost-Conditioned Compute Control for Efficient Transformers

Amrit Nidhi

Comments 12 pages, 1 figure, 10 tables

详情
英文摘要

Transformers usually expose one inference cost per trained model, while deployed systems often need multiple cost-quality operating points. We study Budgeted Attention Allocation, a monotone head-gating mechanism conditioned on a requested attention budget. Dense warm-starting is important for stability: on a robust synthetic sequence task, one budgeted model reaches 99.7% accuracy at 0.303 estimated attention cost and 100.0% accuracy at 0.504 cost. On held-out AG News with a custom word-level transformer, hard-gate adaptation turns soft cost control into measured single-thread CPU speed, reaching 82.1% accuracy with 1.28x speedup at budget 0.50. In pretrained BERT-Mini AG News, budgeted structural pruning reaches 87.6% accuracy with 1.20x speedup at budget 0.50; a validation-ranked zero-shot dense post-hoc structural baseline reaches 86.1%, and one recovery epoch raises that per-budget specialist to 87.9%. On DBpedia14, BERT-Mini budgeted gates reach 97.4% at exact budget 0.50 versus 96.6% for dense full attention. Static fixed-budget gates and recovered dense specialists remain strong. The contribution is therefore not universal dominance, but a reproducible feasibility study of one controllable checkpoint across budgets that can trade attention cost for accuracy and be converted into measured structural speedups on small CPU benchmarks.

2605.05694 2026-05-08 cs.CV

Adaptive Physical-Facial Representation Fusion via Subject-Invariant Cross-Modal Prompt Tuning for Video-Based Emotion Recognition

Xiwen Luo, Jia Li, Rencheng Song, Yu Liu, Juan Cheng

Comments The source code will be available at https://github.com/MSA-LMC/SCPT

详情
英文摘要

Emotion recognition from facial videos enables non-contact inference of human emotional states. Although facial expressions are widely used cues, they cannot fully reflect intrinsic affective states. Remote photoplethysmography (rPPG) provides complementary physiological information, but it is highly susceptible to noise and inter-subject variability, limiting generalization to unseen individuals. Existing multimodal methods combine facial and rPPG features, yet their fusion strategies often disrupt pretrained facial representations and lack explicit mechanisms to suppress subject-specific variations. To address these issues, we propose a subject-invariant cross-modal prompt-tuning framework for video-based emotion recognition. Specifically, rPPG waveforms are transformed into noise-robust time-frequency representations (TFRs), from which modality-complementary prompts are generated to modulate facial tokens within a frozen Vision Transformer (ViT). This design enables effective cross-modal interaction while preserving the generalizable facial representations learned by the pretrained backbone. In addition, we introduce a decoupled shared-specific adapter (DSSA) into each ViT layer to explicitly separate subject-shared and subject-specific components, thereby improving cross-subject generalization. Experiments on the MAHNOB-HCI and DEAP benchmarks demonstrate that the proposed method consistently outperforms strong baselines in both recognition accuracy and generalization ability, highlighting its effectiveness for video-based emotion recognition.

2605.05692 2026-05-08 cs.CV cs.AI cs.CR

CFE-PPAR: Compression-friendly encryption for privacy-preserving action recognition leveraging video transformers

Haiwei Lin, Shoko Imaizumi, Hitoshi Kiya

Comments 6 pages, 5 figures, accepted to 2026 IEEE International Conference on Image Processing (ICIP)

详情
英文摘要

Privacy-preserving action recognition (PPAR) enables machines to understand human activities in videos without revealing sensitive visual content. Among the various strategies for PPAR, encryption-based methods achieve strong privacy protection while maintaining high recognition performance. However, these methods lead to a catastrophic decrease in recognition performance and visual quality when the encrypted videos are compressed. That is, the previous methods are not compression-friendly. To address these issues, in this paper, we propose the first compression-friendly encryption method for PPAR, called CFE-PPAR. In CFE-PPAR, videos encrypted with secret keys can be directly recognized by a video transformer, which uses parameters transformed by the same keys as those used for video encryption. In experiments, it is verified that CFE-PPAR outperforms previous methods on the UCF101 and HMDB51 datasets under Motion-JPEG and H.264 compression.

2605.05689 2026-05-08 cs.AI

GCCM: Enhancing Generative Graph Prediction via Contrastive Consistency Model

Shaozhen Ma, Wei Huang, Hanchen Wang, Dong Wen, Wenjie Zhang

详情
英文摘要

Conditional generative models, particularly diffusion-based methods, have recently been applied to graph prediction by modeling the target as a conditional distribution given the input graph, yielding competitive results compared to deterministic predictor. However, existing diffusion-based prediction methods typically require expensive iterative denoising at inference and often suffer from unstable sampling, which motivates recent efforts to reduce inference denoising steps and enable stable sampling via techniques such as consistency training. Despite this progress, we find that existing consistency training methods for graph prediction could potentially fall into a shortcut solution: the model may attempt to satisfy the self-consistency constraint by ignoring the noisy target (i.e., assigning it negligible weight), ultimately collapsing into a purely deterministic predictor. To mitigate such shortcut solution, we propose GCCM, a graph contrastive consistency model that goes beyond isolated pairwise matching between the same target at different noise levels by introducing negative pairs into a contrastive consistency objective. This adds an additional separation requirement, making the shortcut solution no longer trivially sufficient to satisfy the proposed objective. Moreover, we apply feature perturbation to the input node/edge features to break identical conditioning on the input graph, so that the shortcut no longer yields the same predictions across noise levels and becomes less attractive. Extensive experiments on benchmark datasets demonstrate that GCCM mitigates the shortcut solution and yields consistent performance improvements in graph prediction compared to deterministic predictors.

2605.05688 2026-05-08 cs.CV

R2H-Diff: Guided Spectral Diffusion Model for RGB-to-Hyperspectral Reconstruction

Songyu Ding, Ronggiang Zhao, Mingchun Sun, Jie Liu

详情
英文摘要

RGB-to-hyperspectral image reconstruction is a highly ill-posed inverse problem, since multiple plausible spectral distributions may correspond to the same RGB observation. Existing regression-based methods usually learn a deterministic mapping, which limits their ability to model reconstruction uncertainty and often leads to over-smoothed spectral responses. Although diffusion models provide strong distribution modeling capability, their direct application to hyperspectral reconstruction remains challenging due to the high spectral dimensionality, strong inter-band correlations, and strict requirement for spectral fidelity. To this end, we propose R2H-Diff, an efficient diffusion-based framework tailored for RGB-to-HSI reconstruction. Specifically, R2H-Diff formulates spectral recovery as a conditional iterative refinement process, enabling progressive reconstruction under RGB guidance. We proposed a Guided Spectral Refinement Module for RGB-conditioned feature fusion and a Hyperspectral-Adaptive Transposed Attention module for efficient spatial--spectral dependency modeling. Furthermore, a normalization-free denoising backbone is adopted to preserve spectral amplitude consistency, while a task-adapted linear noise schedule enables high-quality reconstruction with only five denoising steps. Extensive experiments on NTIRE2022, CAVE, and Harvard demonstrate that R2H-Diff achieves a favorable balance between reconstruction quality and computational efficiency. Notably, on NTIRE2022, R2H-Diff obtains 35.37 dB PSNR with a sub-million-parameter model of 0.58M parameters and 12.25G FLOPs, achieving the lowest model complexity among the evaluated methods while maintaining strong reconstruction fidelity.

2605.05687 2026-05-08 cs.AI

DataDignity: Training Data Attribution for Large Language Models

Xiaomin Li, Andrzej Banburski-Fahey, Jaron Lanier

详情
英文摘要

Auditing language-model outputs often requires more than judging correctness: an auditor may need to identify which source document most likely supports the knowledge expressed in a response. We study this as pinpoint provenance: given a prompt, a target-model response, and a candidate corpus, rank the documents that best support the response. We introduce FakeWiki, a controlled benchmark of 3,537 fabricated Wikipedia-style articles designed to preserve ground-truth provenance while weakening lexical shortcuts. FakeWiki includes QA probes, source-preserving paraphrases, retro-generated variants, hard anti-documents that remain topically similar while removing answer-critical facts, and five query conditions: clean prompting plus four jailbreak-inspired transformations. We evaluate seven retrieval baselines, a training-free activation-steering retrieval-fusion method, SteerFuse, and a supervised contrastive provenance ranker, ScoringModel. ScoringModel maps response and document features into a shared space and is trained with InfoNCE using in-batch, retrieval-mined, and anti-document negatives. Across nine open-weight instruction-tuned LLMs and five query conditions, ScoringModel improves mean Recall@10 from 35.0 for the strongest retrieval baseline to 52.2, without inference-time fusion, and wins 41/45 model-by-condition cells. SteerFuse is usually second-best despite requiring no supervised training, showing that activation-space evidence can efficiently complement text retrieval. On jailbreak-inspired transformed queries, ScoringModel improves Recall@10 by 15.7 points on average over the best baseline. Overall, our work shows that robust training data attribution requires evaluation settings that separate true answer support from topical or lexical resemblance.

2605.05685 2026-05-08 cs.LG cs.AI stat.ML

Temporal Functional Circuits: From Spline Plots to Faithful Explanations in KAN Forecasting

Naveen Mysore

Comments 9 pages, 4 figures, 6 tables, plus appendix. Under review at NeurIPS 2026

详情
英文摘要

Unlike MLPs, Kolmogorov-Arnold Networks (KANs) expose explicit learnable edge functions on every connection, enabling mechanistic explanation in time-series forecasting. This paper introduces Temporal Functional Circuits, a framework that transforms KAN edge functions from latent visualizations into faithful, temporally grounded explanations. Built on a gated residual KAN that decomposes forecasts into a linear base and a sparsely activated KAN correction, the framework (i) maps each edge to input lags via output-aware attribution, (ii) ranks edges by learned activation range, and (iii) validates faithfulness through edge-level interventions including zeroing and spline removal. Removing the learned B-spline component while retaining the base SiLU term degrades forecasts, providing evidence that the spline shape itself carries predictive value beyond the base activation. On four synthetic regimes of increasing complexity, the learned gate opens progressively wider as signal complexity grows. On regime-switching signals, gated KAN achieves 59% lower MSE than linear-only models. Across eight benchmarks, the gated architecture is competitive with linear, attention, and MLP alternatives, while providing interpretable edge functions that MLP-based corrections cannot offer.

2605.05678 2026-05-08 cs.AI

Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering

Xiaomin Li, Jianheng Hou, Zheyuan Deng, Zhiwei Zhang, Taoran Li, Binghang Lu, Bing Hu, Yunhan Zhao, Yuexing Hao

详情
英文摘要

Large reasoning models (LRMs) increasingly expose chain-of-thought-like reasoning for transparency, verification, and deliberate problem solving. This creates a safety blind spot: harmful or policy-violating content may appear in reasoning traces even when final answers appear safe. We test whether final-answer safety is a sufficient proxy for the full reasoning-answer trajectory by scoring both stages under a unified twenty-principle safety rubric. Using prompts from seven public harmfulness and jailbreak sources, plus four out-of-distribution (OOD) sources, we evaluate 15 open-weight and API-based LRMs across 41K prompts per model. Reasoning traces consistently reveal additional safety risks beyond final answers, especially in high-severity stage-wise failures: leak cases, where unsafe reasoning precedes a safe-looking answer, and escape cases, where benign-looking reasoning precedes an unsafe final response. Principle-level analysis shows that risk concentrates in misinformation, legal compliance, discrimination, physical harm, and psychological harm. We further propose adaptive multi-principle steering, a white-box test-time mitigation that learns one unsafe-to-safe activation direction per safety principle and activates only directions whose current hidden state is closer to the unsafe than safe centroid. On three steerable open reasoning models, adaptive steering reduces unsafe counts in both reasoning traces and final answers on held-out and OOD benchmarks. DeepSeek-R1-Qwen-7B achieves a 40.8% average unsafe-count reduction while retaining 97.7% macro-averaged accuracy on BBH, GSM8K, and MMLU. These results suggest that LRM safety should be evaluated and mitigated over the full exposed reasoning-answer trajectory, not only at the final-answer stage.

2605.05676 2026-05-08 cs.CL cs.AI

Decomposing the Basic Abilities of Large Language Models: Mitigating Cross-Task Interference in Multi-Task Instruct-Tuning

Bing Wang, Ximing Li, Changchun Li, Jinjin Chi, Gang Niu, Masashi Sugiyama

Comments Accepted by ICML 2026. 25 pages, 13 figures. Code: https://github.com/wangbing1416/BADIT

详情
英文摘要

Recently, the prominent performance of large language models (LLMs) has been largely driven by multi-task instruct-tuning. Unfortunately, this training paradigm suffers from a key issue, named cross-task interference, due to conflicting gradients over shared parameters among different tasks. Some previous methods mitigate this issue by isolating task-specific parameters, e.g., task-specific neuron selection and mixture-of-experts. In this paper, we empirically reveal that the cross-task interference still exists for the existing solutions because of many parameters also shared by different tasks, and accordingly, we propose a novel solution, namely Basic Abilities Decomposition for multi-task Instruct-Tuning (BADIT). Specifically, we empirically find that certain parameters are consistently co-activated, and that co-activated parameters naturally organize into base groups. This motivates us to analogize that LLMs encode several orthogonal basic abilities, and that any task can be represented as a linear combination of these abilities. Accordingly, we propose BADIT that decomposes LLM parameters into orthogonal high-singular-value LoRA experts representing basic abilities, and dynamically enforces their orthogonality during training via spherical clustering of rank-1 components. We conduct extensive experiments on the SuperNI benchmark with 6 LLMs, and empirical results demonstrate that BADIT can outperform SOTA methods and mitigate the degree of cross-task interference.

2605.05668 2026-05-08 cs.AI cs.CV

Large Vision-Language Models Get Lost in Attention

Gongli Xi, Ye Tian, Mengyu Yang, Huahui Yi, Liang Lin, Xiaoshuai Hao, Kun Wang, Wendong Wang

Comments 25 pages, 10 figures. Accepted by ICML 2026

详情
英文摘要

Despite the rapid evolution of training paradigms, the decoder backbone of large vision--language models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct roles of internal modules is critical for understanding model mechanics and guiding architectural optimization. While prior statistical approaches have provided valuable attribution-based insights, they often lack a unified theoretical basis. To bridge this gap, we propose a unified framework grounded in information theory and geometry to quantify the geometric and entropic nature of residual updates. Applying this unified framework reveals a fundamental functional decoupling: Attention acts as a subspace-preserving operator focused on reconfiguration, whereas FFNs serve as subspace-expanding operators driving semantic innovation. Strikingly, further experiments demonstrate that replacing learned attention weights with predefined values (e.g., Gaussian noise) yields comparable or even superior performance across a majority of datasets relative to vanilla models. These results expose severe misallocation and redundancy in current mechanisms, suggesting that state-of-the-art LVLMs effectively ``get lost in attention'' rather than efficiently leveraging visual context.

2605.05664 2026-05-08 cs.CV

Sparse-to-Complete: From Sparse Image Captures to Complete 3D Scenes

Yiyang Shen, Yin Yang, Kun Zhou, Tianjia Shao

详情
Journal ref
SIGGRAPH 2026 Conf. Proc
英文摘要

We introduce S2C-3D, a novel sparse-view 3D reconstruction framework for high-fidelity and complete scene reconstruction from as few as six to eight images. Our framework features three components: a specialized diffusion model for scene-specific image restoration, a training-free view-consistency conditioned sampling process in the diffusion model for refined Gaussian optimization, and a camera trajectory planning scheme to ensure comprehensive scene coverage. The specialized diffusion model is developed by finetuning a pretrained architecture on the input views and their corresponding degraded counterparts. The adaptation to the scene distribution allows the model to repair Gaussian renderings while effectively eliminating domain gaps. Meanwhile, the trajectory planning scheme optimizes scene coverage by connecting each newly sampled camera to its two nearest neighbors. By iteratively constructing paths and retaining only those that significantly enhance visibility, the scheme establishes a trajectory that covers the entire scene. To address multi-view conflicts, the view-consistency conditioned sampling process quantifies the consistency between neighboring repaired images. This information is injected as a condition into the sampling process of the frozen diffusion model, facilitating the generation of view-consistent images without additional training. Consequently, our approach produces high-fidelity 3D Gaussians that are robust to artifacts. Experimental results demonstrate that S2C-3D outperforms state-of-the-art methods, constructing high-quality scenes that are free from missing regions, blurring, or other artifacts with very sparse inputs. The source code and data are available at https://gapszju.github.io/S2C-3D.

2605.05662 2026-05-08 cs.CL cs.AI

XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity

Dasol Choi, Eugenia Kim, Jaewon Noh, Sang Seo, Eunmi Kim, Myunggyo Oh, Yunjin Park, Brigitta Jesica Kartono, Josef Pichlmeier, Helena Berndt, Sai Krishna Mendu, Glenn Johannes Tungka, Özlem Gökçe, Suresh Gehlot, Katherine Pratt, Amanda Minnich, Haon Park

详情
英文摘要

Current LLM safety benchmarks are predominantly English-centric and often rely on translation, failing to capture country-specific harms. Moreover, they rarely evaluate a model's ability to detect culturally embedded sensitivities as distinct from universal harms. We introduce XL-SafetyBench. a suite of 5,500 test cases across 10 country-language pairs, comprising a Jailbreak Benchmark of country-grounded adversarial prompts and a Cultural Benchmark where local sensitivities are embedded within innocuous requests. Each item is constructed via a multi-stage pipeline that combines LLM-assisted discovery, automated validation gates, and dual independent native-speaker annotators per country. To distinguish principled refusal from comprehension failure, we evaluate Attack Success Rate (ASR) alongside two complementary metrics we introduce: Neutral-Safe Rate (NSR) and Cultural Sensitivity Rate (CSR). Evaluating 10 frontier and 27 local LLMs reveals two key findings. First, jailbreak robustness and cultural awareness do not show a coupled relationship among frontier models, so a composite safety score obscures per-axis variation. Second, local models exhibit a near-linear ASR-NSR trade-off (r = -0.81), indicating that their apparent safety reflects generation failure rather than genuine alignment. XL-SafetyBench enables more nuanced, cross-cultural safety evaluation in the multilingual era.

2605.05660 2026-05-08 cs.LG math.OC

Distributionally Robust Multi-Objective Optimization

Yufeng Yang, Fangning Zhuo, Ziyi Chen, Heng Huang, Yi Zhou

Comments 47 pages

详情
英文摘要

Multi-objective optimization (MOO) has received growing attention in applications that require learning under multiple criteria. However, the existing MOO formulations do not explicitly account for distributional shifts in the data. We introduce distributionally robust multi-objective optimization (DR-MOO), which minimizes multiple objectives under their respective worst-case distributions. We propose Pareto-type solution concepts for DR-MOO and develop multi-gradient descent algorithms (MGDA) with provable guarantees. Leveraging a Lagrangian dual reformulation, we first design a double-loop MGDA that uses an inner loop to estimate dual variables and achieves a total sample complexity $\mathcal{O}(ε^{-12})$ for reaching an $ε$-Pareto-stationary point. To further improve efficiency, we incorporate gradient clipping to handle generalized-smooth and biased gradient estimates, removing the need for double sampling. This yields a single-loop double-clip MGDA with substantially improved sample complexity $\mathcal{O}(ε^{-4})$. Our theory applies to the nonconvex setting and does not require bounded objectives or gradients. Experiments demonstrate that our methods are competitive with state-of-the-art MGDA baselines.

2605.05659 2026-05-08 cs.LG

Structural Correspondence and Universal Approximation in Diagonal plus Low-Rank Neural Networks

Ying Chen, Aoxi Li, Jihun Kim, Javad Lavaei

Comments 27 pages, 6 figures

详情
英文摘要

The massive computational costs of scaling modern deep learning architectures have driven the widespread use of parameter-efficient low-rank structures, such as LoRA and low-rank factorization. However, theoretical guarantees for their expressive power are less explored, often relying on restrictive priors like a pretrained base matrix, ReLU activations or non-verifiable singularity conditions. We first investigate the limits of neural networks constrained strictly to low-rank manifolds without pretrained dense priors. We demonstrate a theoretical paradox: while purely rank-1 layers can exactly interpolate arbitrary scalar datasets, they collapse for function approximations. To overcome this bottleneck without surrendering parameter efficiency, we introduce a unified \textit{Structural Correspondence} framework. We prove that augmenting low-rank layers with only a minimal sparse diagonal component, say a Diagonal plus Low-Rank (DLoR) structure, is sufficient to reach Universal Approximation. We show that any full-rank transformation can be exactly reconstructed using these DLoR components by trading off network width (additive decomposition) or depth (multiplicative decomposition). By tracking asymptotic Taylor remainders, we prove that DLoR neural networks fully restore the Universal Approximation Theorem for general activation functions. Finally, we establish that multiplicative depth provides superior parameter-to-expressivity scaling compared to additive width. Our results show that dense matrices and specific activation functions are not topological prerequisites for universal expressivity.

2605.05657 2026-05-08 cs.AI cs.MA

Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation

Abhijit Talluri, Pujith Anne, Bhagavan Choudary Pendiyala, Raghavendra Chilukuri

Comments 30 pages, 9 figures. NeurIPS 2026 Evaluations and Datasets Track Submission Under review

详情
英文摘要

Multi-agent LLM systems for code generation face a fundamental routing problem: the optimal orchestration topology depends on the structural complexity of the code under modification, yet existing systems select topologies without consulting the codebase. We present Retrieval-Guided Adaptive Orchestration (RGAO), an architecture that closes this loop by extracting a structural complexity vector from a hierarchical code index before selecting the orchestration topology. RGAO operates within Code-Agent, a multi-agent framework whose sub-agents are governed by formal contracts with six-dimensional budget vectors. Our headline contribution is the composition of two previously separate lines of work -- complexity-conditioned LLM routing and formal resource algebras -- yielding a property neither admits alone: provable budget conservation under retrieval-conditioned dynamic topology selection. Concretely we contribute: (1) a complexity-conditioned topology router that reduces proxy-measured misrouting from 30.1% to 8.2%; (2) a budget algebra with a structural-induction conservation theorem; and (3) a hierarchical code retrieval engine. Empirical evaluation demonstrates sub-millisecond DAG construction and linear tree-index scalability.

2605.05653 2026-05-08 cs.CL

Negative Before Positive: Asymmetric Valence Processing in Large Language Models

Sohan Venkatesh

详情
英文摘要

Mechanistic interpretability has revealed how concepts are encoded in large language models (LLMs), but emotional content remains poorly understood at the mechanistic level. We study whether LLMs process emotional valence through dedicated internal structure or through surface token matching. Using activation patching and steering on open-source LLMs, we find that negative and positive valence are processed at different network depths. Negative outcomes localize to early layers while positive outcomes peak at mid-to-late layers. Holding topic fixed while flipping valence produces sign-opposite responses, ruling out topic detection. Steering with the good-news direction at the identified layers shifts neutral prompts toward positive valence, showing these layers encode valence as a manipulable direction. Emotional valence in LLMs is localized, causal and steerable, making it a concrete target for interpretability-based oversight.

2605.05646 2026-05-08 cs.CV

MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

Panqi Yang, Haodong Jing, Jiahao Chao, Tingyan Xiang, Li Lin, Yao Hu, Yang Luo, Yongqiang Ma

Comments 21 pages,Accepted by ICML 2026 main track

详情
英文摘要

Unified visual tokenization faces a fundamental trade-off between high-fidelity pixel reconstruction (spatial equivariance) and semantic abstraction (conceptual invariance). We attribute this conflict to Manifold Misalignment: naive joint optimization induces opposing gradients, creating a zero-sum game between reconstruction and perception. To address this, we propose MUSE, a framework based on Topological Orthogonality. By treating Structure as an orthogonal bridge, MUSE decouples optimization within Transformers: structural gradients refine attention topology, while semantic gradients update feature values. This turns destructive interference into Mutual Reinforcement. Experiments show that MUSE breaks the trade-off, achieving state-of-the-art generation quality (gFID 3.08) and surpassing its teacher InternViT-300M in linear probing (85.2\% vs. 82.5\%), demonstrating that structurally aligned reconstruction can enhance semantic perception. Code is available at https://github.com/PanqiYang1/MUSE.

2605.05643 2026-05-08 cs.AI cs.IR

Text-Graph Synergy: A Bidirectional Verification and Completion Framework for RAG

Jiarui Zhong, Hong Cai Chen

Comments 12 pages, 3 figures

详情
英文摘要

Retrieval-Augmented Generation (RAG) has become a core paradigm for enhancing factual grounding and multi-hop reasoning in Large Language Models (LLMs). Traditional text-based RAG often retrieves logically irrelevant pseudo-evidence, while graph-based RAG is frequently hindered by search-time pruning, which may discard potentially valid reasoning paths. Existing hybrid approaches primarily adopt simple evidence concatenation or unidirectional enhancement, which fails to address the fundamental "Information Island" problem caused by asymmetric reasoning flows between unstructured text and structured graphs. We propose \textbf{TGS-RAG}, a unified framework for \textbf{T}ext-\textbf{G}raph \textbf{S}ynergistic enhancement. TGS-RAG introduces a bidirectional mechanism: (i) a \textbf{Graph-to-Text} channel that employs a Global Voting strategy from visited graph nodes to re-rank and refine textual evidence, filtering out semantic noise; and (ii) a \textbf{Text-to-Graph} channel that utilizes the \textbf{Memory-based Orphan Entity Bridging} algorithm. This algorithm utilizes textual cues to proactively resurrect valid but previously pruned reasoning paths from the search history without additional database overhead. Experimental results on multiple multi-hop reasoning benchmarks demonstrate that TGS-RAG significantly outperforms state-of-the-art baselines, achieving a superior balance between retrieval precision and computational efficiency.

2605.05640 2026-05-08 cs.CV

AffectSeek: Agentic Affective Understanding in Long Videos under Vague User Queries

Zhen Zhang, Yuhang Yang, Yunxiang Jiang, Yuhuan Lu, Haifeng Lu, Zheng Lian, Runhao Zeng, Xiping Hu

详情
英文摘要

Existing affective understanding studies have mainly focused on recognizing emotions from images, audio signals, or pre-cliped video clips, where the affective evidence is already given. This passive and clip-centered setting does not fully reflect real-world scenarios, in which users often interact with long videos and express their needs through natural-language queries. In this paper, we study \textbf{Vague-Query-driven video Affective Understanding (VQAU)}, a new task that requires models to localize affective moments in long videos, predict their emotion categories, and generate evidence-grounded rationales under vague user queries. To support this task, we construct \textbf{VQAU-Bench}, a benchmark that integrates long videos, vague affective queries, temporal clip annotations, emotion labels, and rationale explanations into a unified evaluation framework. VQAU-Bench enables systematic assessment of semantic-temporal-affective alignment, affective moment localization, emotion classification, and rationale generation. To address the multi-step reasoning challenges of VQAU, we further propose \textbf{AffectSeek}, an agentic framework that actively seeks, verifies, and explains affective moments in long videos. AffectSeek decomposes VQAU into intent interpretation, candidate localization, clip verification, emotion reasoning, and rationale generation, and progressively aligns vague user intent with long-video evidence through role-specialized reasoning and cross-stage verification. Experiments show that VQAU remains challenging for existing affective recognition models and single-step vision-language models, while AffectSeek provides a simple yet effective framework for agentic long-video affective understanding.

2605.05638 2026-05-08 cs.LG

Scaling Pretrained Representations Enables Label-Free Out-of-Distribution Detection Without Fine-Tuning

Brett Barkley, Preston Culbertson, David Fridovich-Keil

详情
英文摘要

Models trained with deep learning often fail to signal when inputs fall outside their training data manifold, leading to unreliable predictions under distribution shift. Prior work suggests that effective out-of-distribution (OOD) detection often requires class-conditional modeling or specialized models obtained through supervised fine-tuning. We revisit this assumption in modern pretrained models and show that their frozen representations already encode sufficient geometric structure for accurate label-free OOD detection. Across 59 backbone-task pairings spanning vision and language, we compare two complementary label-free detectors: a global Mahalanobis estimator fit on unlabeled latent representations, and ReSCOPED, a lightweight, diffusion-based typicality estimator operating on the same features at a local level. Despite their different detection mechanisms, representation scaling reveals a consistent regime-dependent pattern: both local and global detectors' absolute performance improves with better representation quality, and performance gaps between the two detectors disappear across both language and vision tasks as representations scale. These results suggest that label-free OOD detection depends strongly on the geometry exposed by frozen pretrained backbones, reducing the importance of detector choice as backbone scale increases and enabling efficient deployment directly on frozen models.

2605.05636 2026-05-08 cs.CV cs.GR

Learning a Delighting Prior for Facial Appearance Capture in the Wild

Yuxuan Han, Xin Ming, Tianxiao Li, Zhuofan Shen, Qixuan Zhang, Lan Xu, Feng Xu

Comments ACM Transactions on Graphics (Proc. of SIGGRAPH), 2026. Code: https://github.com/yxuhan/OpenDelight Project Page: https://yxuhan.github.io/OpenDelight/index.html

详情
英文摘要

High-quality facial appearance capture has traditionally required costly studio recording. Recent works consider an in-the-wild smartphone-based setup; however, their model-based inverse rendering paradigm struggles with the complex disentanglement of reflectance from unknown illumination. To bridge this gap, we propose to shift the paradigm into training a powerful delighting network as a prior to constrain the optimization. We leverage the OLAT dataset and the rendered Light Stage scans for training, and propose Dataset Latent Modulation (DLM) to seamlessly integrate these heterogeneous data sources. Specifically, by conditioning the core network on learnable source-aware tokens, we decouple dataset-specific styles from physical delighting principles, enabling the emergence of a delighting prior that outperforms existing proprietary models. This powerful delighting prior enables a simple and automatic appearance capture pipeline that achieves high-quality reflectance estimation from casual video inputs, outperforming prior arts by a large margin. Furthermore, we leverage our appearance capture method to transform the multi-view NeRSemble dataset into NeRSemble-Scan, a large-scale collection of 4K-resolution relightable scans. By open-sourcing our model and the NeRSemble-Scan dataset, we democratize high-end facial capture and provide a new foundation for the research community to build photorealistic digital humans.

2605.05627 2026-05-08 cs.CV cs.AI cs.LG cs.RO

Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping

Gabriel Jeanson, David-Alexandre Duclos, William Larrivée-Hardy, Noé Cochet, Matěj Boxan, Anthony Deschênes, François Pomerleau, Philippe Giguère

Comments 36 pages, 17 figures

详情
英文摘要

Sustainable forest management relies on precise species composition mapping, yet traditional ground surveys are labour-intensive and geographically constrained. While Uncrewed Aerial Vehicles (UAVs) offer scalable data collection, the transition to deep learning-based interpretation is bottlenecked by the severe scarcity of expert-annotated imagery, particularly in complex, visually heterogeneous regeneration zones. This paper addresses the dual challenges of data scarcity and extreme class imbalance in the semantic segmentation of fine-grained forest regeneration species by providing a scalable framework that reduces reliance on manual photo-interpretation for high-resolution, millimetre-level aerial imagery. Importantly, we leverage the large-scale vision-language Nano Banana Pro model to simultaneously generate high-fidelity images and their corresponding pixel-aligned semantic masks from prompts. We introduce WilDReF-Q-V2, an expansion of a natural forest dataset with 13 977 new unlabelled and 50 labelled real images, as well as the Gen4Regen dataset, featuring 2101 pairs of synthetic images and semantic masks. Our methodology integrates real-world data with AI-generated images, highlighting that AI-generated data is highly complementary to real-world data, with unified training yielding an F1 score improvement of over 15 %pt compared to purely supervised baselines. Furthermore, we demonstrate that even small quantities of prompt-generated data significantly improve performance for underrepresented species, some of which saw per-species F1 score gains of up to 30 %pt. We conclude that vision-language models can serve as agile data generators, effectively bootstrapping perception tasks for niche AI domains where expert labels are scarce or unavailable. Our datasets, source code, and models will be available at https://norlab-ulaval.github.io/gen4regen.

2605.05626 2026-05-08 cs.CL cs.AI

When2Speak: A Dataset for Temporal Participation and Turn-Taking in Multi-Party Conversations for Large Language Models

Vihaan Nama, Shreya Mendi, Zian Ye, Brinnae Bent

Comments Currently under review. Dataset can be found: https://huggingface.co/datasets/duke-trust-lab/When2Speak

详情
英文摘要

Large Language Models (LLMs) excel at generating contextually appropriate responses but remain poorly calibrated for multi-party conversations, where deciding when to speak is as critical as what to say. In such settings, naively responding at every turn leads to excessive interruptions and degraded conversational coherence. We introduce When2Speak, a grounded synthetic dataset and four-stage generation pipeline for learning intervention timing in group interactions. The dataset comprises over 215,000 examples derived from 16,000 conversations involving 2-6 speakers, spanning diverse conversational styles, tones, and participant dynamics, and explicitly modeling SPEAK vs. SILENT decisions at each turn. Our pipeline combines real-world grounding, structured augmentation, controlled transcript synthesis, and fine-tuning-ready supervision, and is fully open-sourced to support reproducibility and adaptation to domain-specific conversational norms. Across multiple model families, supervised fine-tuning (SFT) on When2Speak significantly outperforms zero-shot baselines (e.g., the average Macro F1 increase across 4B+ parameter models was 60%, with the largest increase being 120%). However, SFT-trained models remain systematically over-conservative, missing nearly half of warranted interventions as seen through the Missed Intervention Rate (MIR), which was on average 0.50 and is noticed even at larger model sizes. To address this limitation, we apply reinforcement learning with asymmetric reward shaping, which reduces MIR to 0.186-0.218 and increases recall from 0.479 to 0.78-0.81. Our findings establish that temporal participation is a distinct and trainable dimension of conversational intelligence, and that grounded synthetic data provides an effective and scalable pathway for enabling LLMs to participate more naturally and appropriately in multi-party interactions.

2605.05623 2026-05-08 cs.LG

Region-adaptable retrieval of coastal biogeochemical parameters from near-surface hyperspectral remote sensing reflectance using physics-aware meta-learning

Yiqing Guo, Nagur R. C. Cherukuru, Eric A. Lehmann, S. L. Kesav Unnithan, Tim J. Malthus, Gemma Kerrisk, Xiubin Qi, Faisal Islam, Tisham Dhar, Mark J. Doubell

详情
英文摘要

Hyperspectral in situ sensing has shown promise in retrieving aquatic biogeochemical (BGC) parameters, such as total suspended solids, dissolved organic carbon, and total chlorophyll-a, for cost-effective monitoring of coastal water quality. However, generalising such retrieval algorithms across water bodies remains challenging, as the relationship between remote sensing reflectance (Rrs) and BGC parameters can vary considerably from one region to another due to regional distinctions in environmental conditions and biogeochemistry that lead to different BGC ranges and bio-optical properties. In this study, we propose a two-stage physics-aware meta-learning framework for retrieving coastal BGC parameters from near-surface Rrs observations. In the first stage, a bio-optical forward model is used to generate a large synthetic dataset based on an in situ bio-optical spectral library with broad representativeness of Australian coastal waters. This dataset is then used to pretrain a region-agnostic base model with meta-learning, allowing the model to learn fundamental physical relationships. In the second stage, the pretrained base model is fine-tuned for specific regions with local samples. We collected in situ hyperspectral Rrs and BGC measurements from five geographically distinct sites in Australian coastal waters. Our experimental results suggest: (1) the BGC parameters and their corresponding hyperspectral Rrs signatures exhibited clear regional distinctions among the experimental sites; (2) the synthetic dataset was physically plausible and closely aligned with real-world samples in both parameter distributions and inter-parameter correlations; (3) the proposed approach outperformed five benchmark models in BGC retrieval; and (4) time series of in situ measured and model-predicted BGC parameters showed good agreement in both magnitude and temporal dynamics.

2605.05616 2026-05-08 cs.CV cs.LG

RAM-H1200: A Unified Evaluation and Dataset on Hand Radiographs for Rheumatoid Arthritis

Songxiao Yang, Haolin Wang, Yao Fu, Junmu Peng, Lin Fan, Hongruixuan Chen, Jian Song, Masayuki Ikebe, Shinya Takamaeda-Yamazaki, Masatoshi Okutomi, Tamotsu Kamishima, Yafei Ou

Comments 50 pages, 24 figures, 25 tables

详情
英文摘要

Rheumatoid arthritis (RA) assessment from hand radiographs requires multi-level analysis and modeling of anatomical structures and fine-grained local pathological changes. However, existing public resources do not support such unified multi-level analysis, often lacking full-hand coverage, fine-grained annotations, and consistent integration with clinical scoring systems. In particular, annotations that enable quantitative analysis of bone erosion (BE) remain scarce. RAM-H1200 contains 1,200 hand radiographs collected from six medical centers, with multi-level annotations including (i) whole-hand bone structure instance segmentation, (ii) pixel-level BE masks, (iii) SvdH-defined joint regions of interest, and (iv) joint-level SvdH scores for both BE and joint space narrowing (JSN). It is designed to evaluate whether models can jointly capture anatomical structure, localized erosive pathology, and clinically standardized RA severity from hand radiographs. The proposed BE masks enable, for the first time, quantitative BE analysis beyond coarse categorical grading by providing explicit spatial supervision for lesion extent and morphology. To our knowledge, RAM-H1200 is the first public large-scale benchmark that jointly supports whole-hand bone structure instance segmentation, pixel-level BE delineation, and clinically grounded joint-level SvdH scoring for both BE and JSN. Results across benchmark tasks show that anatomical modeling is substantially more mature than quantitative BE analysis: whole-hand bone segmentation achieves strong performance, whereas BE segmentation remains a major open challenge. By unifying anatomical structure modeling, quantitative lesion analysis, and clinically grounded SvdH scoring, RAM-H1200 provides a single benchmark for comprehensive RA analysis on hand radiographs.

2605.05609 2026-05-08 cs.LG econ.EM stat.ML

Optimal Contextual Pricing under Agnostic Non-Lipschitz Demand

Jianyu Xu, Yu-Xiang Wang

Comments 30 pages, 1 figure, 1 table

详情
英文摘要

We study contextual dynamic pricing with linear valuations and bounded-support agnostic noise, whose induced demand curve may be non-Lipschitz with arbitrary jumps and atoms. Such discontinuities break the cross-context interpolation arguments used by smooth-demand pricing algorithms, while the best previous method achieved only $\tilde O(T^{3/4})$ regret. We propose Conservative-Markdown Redirect-UCB Pricing, a polynomial-time algorithm that combines randomized parameter estimation, conservative residual-grid probing, and confidence-based one-step redirection. Our algorithm achieves $\tilde O(T^{2/3})$ optimal regret, matching the known lower bounds of Kleinberg and Leighton (2003) up to logarithmic factors and improving over the previous upper bound of Xu and Wang (2022). Under stochastic well-conditioned contexts, this closes the long-existing open regret gap in linear-valuation contextual pricing under agnostic non-Lipschitz noise distribution.