arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2393
专题追踪
2508.01055 2026-02-17 cs.LG cs.AI q-bio.BM q-bio.QM

FGBench: A Dataset and Benchmark for Molecular Property Reasoning at Functional Group-Level in Large Language Models

Xuan Liu, Siru Ouyang, Xianrui Zhong, Jiawei Han, Huimin Zhao

Comments NeurIPS 2025 (Datasets and Benchmarks Track)

详情
英文摘要

Large language models (LLMs) have gained significant attention in chemistry. However, most existing datasets center on molecular-level property prediction and overlook the role of fine-grained functional group (FG) information. Incorporating FG-level data can provide valuable prior knowledge that links molecular structures with textual descriptions, which can be used to build more interpretable, structure-aware LLMs for reasoning on molecule-related tasks. Moreover, LLMs can learn from such fine-grained information to uncover hidden relationships between specific functional groups and molecular properties, thereby advancing molecular design and drug discovery. Here, we introduce FGBench, a dataset comprising 625K molecular property reasoning problems with functional group information. Functional groups are precisely annotated and localized within the molecule, which ensures the dataset's interoperability thereby facilitating further multimodal applications. FGBench includes both regression and classification tasks on 245 different functional groups across three categories for molecular property reasoning: (1) single functional group impacts, (2) multiple functional group interactions, and (3) direct molecular comparisons. In the benchmark of state-of-the-art LLMs on 7K curated data, the results indicate that current LLMs struggle with FG-level property reasoning, highlighting the need to enhance reasoning capabilities in LLMs for chemistry tasks. We anticipate that the methodology employed in FGBench to construct datasets with functional group-level information will serve as a foundational framework for generating new question-answer pairs, enabling LLMs to better understand fine-grained molecular structure-property relationships. The dataset and evaluation code are available at https://github.com/xuanliugit/FGBench.

2507.19666 2026-02-17 cs.CL

RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams

Andrei Vlad Man, Răzvan-Alexandru Smădu, Cristian-George Craciun, Dumitru-Clementin Cercel, Florin Pop, Mihaela-Claudia Cercel

Comments 41 pages, 30 figures, Accepted by the Findings of EACL 2026

详情
英文摘要

The intersection of AI and legal systems presents a growing need for tools that support legal education, particularly in under-resourced languages such as Romanian. In this work, we aim to evaluate the capabilities of Large Language Models (LLMs) and Vision-Language Models (VLMs) in understanding and reasoning about the Romanian driving law through textual and visual question-answering tasks. To facilitate this, we introduce RoD-TAL, a novel multimodal dataset comprising Romanian driving test questions, text-based and image-based, along with annotated legal references and explanations written by human experts. We implement and assess retrieval-augmented generation (RAG) pipelines, dense retrievers, and reasoning-optimized models across tasks, including Information Retrieval (IR), Question Answering (QA), Visual IR, and Visual QA. Our experiments demonstrate that domain-specific fine-tuning significantly enhances retrieval performance. At the same time, chain-of-thought prompting and specialized reasoning models improve QA accuracy, surpassing the minimum passing grades required for driving exams. We highlight the potential and limitations of applying LLMs and VLMs to legal education. We release the code and resources through the GitHub repository.

2507.19457 2026-02-17 cs.CL cs.AI cs.LG cs.SE

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, Omar Khattab

Comments Accepted to ICLR 2026 (Oral). Code: https://github.com/gepa-ai/gepa

详情
英文摘要

Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language often provides a much richer learning medium for LLMs, compared to policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error. Given any AI system containing one or more LLM prompts, GEPA samples trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA's design, it can often turn even just a few rollouts into a large quality gain. Across six tasks, GEPA outperforms GRPO by 6% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% (e.g., +12% accuracy on AIME-2025), and demonstrates promising results as an inference-time search strategy for code optimization. We release our code at https://github.com/gepa-ai/gepa .

2507.15856 2026-02-17 cs.CV

Latent Denoising Makes Good Tokenizers

Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, Yue Wang

Comments Code is available at: https://github.com/Jiawei-Yang/DeTok

详情
英文摘要

Despite their fundamental role, it remains unclear what properties could make tokenizers more effective for generative modeling. We observe that modern generative models share a conceptually similar training objective -- reconstructing clean signals from corrupted inputs, such as signals degraded by Gaussian noise or masking -- a process we term denoising. Motivated by this insight, we propose aligning tokenizer embeddings directly with the downstream denoising objective, encouraging latent embeddings that remain reconstructable even under significant corruption. To achieve this, we introduce the Latent Denoising Tokenizer (l-DeTok), a simple yet highly effective tokenizer trained to reconstruct clean images from latent embeddings corrupted via interpolative noise or random masking. Extensive experiments on class-conditioned (ImageNet 256x256 and 512x512) and text-conditioned (MSCOCO) image generation benchmarks demonstrate that our l-DeTok consistently improves generation quality across six representative generative models compared to prior tokenizers. Our findings highlight denoising as a fundamental design principle for tokenizer development, and we hope it could motivate new perspectives for future tokenizer design.

2507.08838 2026-02-17 cs.LG cs.AI stat.ML

wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models

Xiaohang Tang, Rares Dolga, Sangwoong Yoon, Ilija Bogunovic

Comments Accepted to ICLR 2026

详情
英文摘要

Improving the reasoning capabilities of diffusion-based large language models (dLLMs) through reinforcement learning (RL) remains an open problem. The intractability of dLLMs likelihood function necessitates approximating the current, old, and reference policy likelihoods at each policy optimization step. This reliance introduces additional computational overhead, and can lead to large variance and estimation error in RL objective -- particularly in computing the policy ratio for importance sampling. To mitigate these issues, we introduce wd1, a novel ratio-free policy optimization approach that reformulates the RL objective as a weighted log-likelihood, requiring only a single approximation for the current parametrized policy likelihood. We formally show that our proposed method can be interpreted as energy-guided discrete diffusion training combined with negative sample unlearning, thereby confirming its theoretical soundness. In experiments on LLaDA-8B model, wd1 outperforms diffusion-based GRPO (d1) while requiring lower computational cost, achieving up to a $+59\%$ improvement in accuracy. Furthermore, we extend wd1 to denoising-stepwise weighted policy optimization (wd1++), achieving state-of-the-art math performance of $44.2\%$ on MATH500 and $84.5\%$ on GSM8K with only 20 RL training steps.

2507.07139 2026-02-17 cs.CV cs.CR cs.LG

Image Can Bring Your Memory Back: A Novel Multi-Modal Guided Attack against Image Generation Model Unlearning

Renyang Liu, Guanlin Li, Tianwei Zhang, See-Kiong Ng

Comments Accepted by ICLR 2026

详情
英文摘要

Recent advances in image generation models (IGMs), particularly diffusion-based architectures such as Stable Diffusion (SD), have markedly enhanced the quality and diversity of AI-generated visual content. However, their generative capability has also raised significant ethical, legal, and societal concerns, including the potential to produce harmful, misleading, or copyright-infringing content. To mitigate these concerns, machine unlearning (MU) emerges as a promising solution by selectively removing undesirable concepts from pretrained models. Nevertheless, the robustness and effectiveness of existing unlearning techniques remain largely unexplored, particularly in the presence of multi-modal adversarial inputs. To bridge this gap, we propose Recall, a novel adversarial framework explicitly designed to compromise the robustness of unlearned IGMs. Unlike existing approaches that predominantly rely on adversarial text prompts, Recall exploits the intrinsic multi-modal conditioning capabilities of diffusion models by efficiently optimizing adversarial image prompts with guidance from a single semantically relevant reference image. Extensive experiments across ten state-of-the-art unlearning methods and diverse tasks show that Recall consistently outperforms existing baselines in terms of adversarial effectiveness, computational efficiency, and semantic fidelity with the original textual prompt. These findings reveal critical vulnerabilities in current unlearning mechanisms and underscore the need for more robust solutions to ensure the safety and reliability of generative models. Code and data are publicly available at \textcolor{blue}{https://github.com/ryliu68/RECALL}.

2506.20430 2026-02-17 cs.CL cs.AI cs.CV cs.MA

An Agentic System for Rare Disease Diagnosis with Traceable Reasoning

Weike Zhao, Chaoyi Wu, Yanjie Fan, Xiaoman Zhang, Pengcheng Qiu, Yuze Sun, Xiao Zhou, Yanfeng Wang, Xin Sun, Ya Zhang, Yongguo Yu, Kun Sun, Weidi Xie

详情
英文摘要

Rare diseases affect over 300 million individuals worldwide, yet timely and accurate diagnosis remains an urgent challenge. Patients often endure a prolonged diagnostic odyssey exceeding five years, marked by repeated referrals, misdiagnoses, and unnecessary interventions, leading to delayed treatment and substantial emotional and economic burdens. Here we present DeepRare, a multi-agent system for rare disease differential diagnosis decision support powered by large language models, integrating over 40 specialized tools and up-to-date knowledge sources. DeepRare processes heterogeneous clinical inputs, including free-text descriptions, structured Human Phenotype Ontology terms, and genetic testing results, to generate ranked diagnostic hypotheses with transparent reasoning linked to verifiable medical evidence. Evaluated across nine datasets from literature, case reports and clinical centres across Asia, North America and Europe spanning 14 medical specialties, DeepRare demonstrates exceptional performance on 3,134 diseases. In human-phenotype-ontology-based tasks, it achieves an average Recall@1 of 57.18%, outperforming the next-best method by 23.79%; in multi-modal tests, it reaches 69.1% compared with Exomiser's 55.9% on 168 cases. Expert review achieved 95.4% agreement on its reasoning chains, confirming their validity and traceability. Our work not only advances rare disease diagnosis but also demonstrates how the latest powerful large-language-model-driven agentic systems can reshape current clinical workflows.

2506.18960 2026-02-17 cs.RO

FORTE: Tactile Force and Slip Sensing on Compliant Fingers for Delicate Manipulation

Siqi Shang, Mingyo Seo, Yuke Zhu, Lillian Chin

详情
英文摘要

Handling fragile objects remains a major challenge for robotic manipulation. Tactile sensing and soft robotics can improve delicate object handling, but typically involve high integration complexity or slow response times. We address these issues through FORTE, an easy-to-fabricate tactile sensing system. FORTE uses 3D-printed fin-ray grippers with internal air channels to provide low-latency force and slip feedback. This feedback allows us to apply just enough force to grasp objects without damaging them. We accurately estimate grasping forces from 0-8 N with an average error of 0.2 N, and detect slip events within 100 ms of occurring. FORTE can grasp a wide range of slippery, fragile, and deformable objects, including raspberries and potato chips with 92% success and achieves 93% accuracy in detecting slip events. These results highlight FORTE's potential as a robust solution for delicate robotic manipulation. Project page: https://merge-lab.github.io/FORTE/

2506.15715 2026-02-17 cs.LG cs.AI

NeuronSeek: On Stability and Expressivity of Task-driven Neurons

Hanyu Pei, Jing-Xiao Liao, Qibin Zhao, Ting Gao, Shijun Zhang, Xiaoge Zhang, Feng-Lei Fan

Comments 14 pages, 10 figures

详情
英文摘要

Drawing inspiration from our human brain that designs different neurons for different tasks, recent advances in deep learning have explored modifying a network's neurons to develop so-called task-driven neurons. Prototyping task-driven neurons (referred to as NeuronSeek) employs symbolic regression (SR) to discover the optimal neuron formulation and construct a network from these optimized neurons. Along this direction, this work replaces symbolic regression with tensor decomposition (TD) to discover optimal neuronal formulations, offering enhanced stability and faster convergence. Furthermore, we establish theoretical guarantees that modifying the aggregation functions with common activation functions can empower a network with a fixed number of parameters to approximate any continuous function with an arbitrarily small error, providing a rigorous mathematical foundation for the NeuronSeek framework. Extensive empirical evaluations demonstrate that our NeuronSeek-TD framework not only achieves superior stability, but also is competitive relative to the state-of-the-art models across diverse benchmarks. The code is available at https://github.com/HanyuPei22/NeuronSeek.

2506.11087 2026-02-17 cs.LG cs.AI cs.CL

Enhancing Delta Compression in LLMs via SVD-based Quantization Error Minimization

Boya Xiong, Shuo Wang, Weifeng Ge, Guanhua Chen, Yun Chen

详情
英文摘要

Supervised Fine-Tuning (SFT) empowers Large Language Models (LLMs) with exceptional performance on specialized tasks, but it yields dense, high-dimensional delta parameters that pose severe storage and distribution challenges. Singular Value Decomposition (SVD)-based compression offers a compact representation for such delta parameters, but existing methods adopt heuristic quantization without clarifying underlying mechanisms, leading to poor generalizability. In this work, we propose PrinMix, a rigorous SVD-based framework that models quantization as an optimization problem, grounding the design in mathematical mechanisms. We first theoretically derive quantization error and identify a key singular-value-dominated scaling mechanism, which mathematically proves the necessity of mix-precision quantization. We then model the quantization scheme as a 0/1 Integer Linear Programming (ILP) problem, which yields optimal bit-budget-constrained solutions without empirical assumptions. Furthermore, PrinMix integrates a Reconstruction Target Correction (RTC) method to compensate for errors from the $\mathbf{V}$-then-$\mathbf{U}$ sequential quantization process. Extensive experiments confirm PrinMix performs well: for 7B LLMs, PrinMix outperforms SOTA Delta-CoMe on challenging benchmarks by 22.3% on AIME2024 and 6.1% on GQA.

2506.08672 2026-02-17 cs.CL cs.AI cs.LG

RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling

Yang Liu, Jiaqi Li, Zilong Zheng

Comments ICLR 2026 camera ready, 28 pages, 10 figures, 15 tables

详情
英文摘要

Rule-based reasoning is acknowledged as one of the fundamental problems of reasoning. While recent studies show that large reasoning models (LRMs) have remarkable reasoning capabilities enhanced by reinforcement learning (RL), real applications still face severe challenges due to variations in rule formats, types, and complexity. To mitigate this issue, we introduce RuleReasoner, an effective method for rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach in RL. Specifically, RuleReasoner resamples each training batch by updating the domain weights based on historical rewards. This facilitates domain balance and active learning schedules for RL, obviating static mix-training engineered by human. Evaluations on in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a significant margin ($Δ$4.1% on eight ID tasks and $Δ$10.4% on three OOD tasks over OpenAI-o1). Notably, our approach also exhibits higher computational efficiency compared to prior methods.

2506.07272 2026-02-17 cs.LG

A Cramér-von Mises Approach to Incentivizing Truthful Data Sharing

Alex Clinton, Thomas Zeng, Yiding Chen, Xiaojin Zhu, Kirthevasan Kandasamy

Comments NeurIPS 2025

详情
英文摘要

Modern data marketplaces and data sharing consortia increasingly rely on incentive mechanisms to encourage agents to contribute data. However, schemes that reward agents based on the quantity of submitted data are vulnerable to manipulation, as agents may submit fabricated or low-quality data to inflate their rewards. Prior work has proposed comparing each agent's data against others' to promote honesty: when others contribute genuine data, the best way to minimize discrepancy is to do the same. Yet prior implementations of this idea rely on very strong assumptions about the data distribution (e.g. Gaussian), limiting their applicability. In this work, we develop reward mechanisms based on a novel, two-sample test inspired by the Cramér-von Mises statistic. Our methods strictly incentivize agents to submit more genuine data, while disincentivizing data fabrication and other types of untruthful reporting. We establish that truthful reporting constitutes a (possibly approximate) Nash equilibrium in both Bayesian and prior-agnostic settings. We theoretically instantiate our method in three canonical data sharing problems and show that it relaxes key assumptions made by prior work. Empirically, we demonstrate that our mechanism incentivizes truthful data sharing via simulations and on real-world language and image data.

2506.06964 2026-02-17 cs.CL cs.LG

Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization

Subhojyoti Mukherjee, Viet Dac Lai, Raghavendra Addanki, Ryan Rossi, Seunghyun Yoon, Trung Bui, Anup Rao, Jayakumar Subramanian, Branislav Kveton

Comments Advances in Neural Information Processing Systems 38

详情
英文摘要

Offline reinforcement learning (RL) is a variant of RL where the policy is learned from a previously collected dataset of trajectories and rewards. In our work, we propose a practical approach to offline RL with large language models (LLMs). We recast the problem as reward-weighted fine-tuning, which can be solved using similar techniques to supervised fine-tuning (SFT). To showcase the value of our approach, we apply it to learning short-horizon question-answering policies of a fixed length, where the agent reasons about potential answers or asks clarifying questions. Our work stands in a stark contrast to state-of-the-art methods in this domain, based on SFT and direct preference optimization, which have additional hyper-parameters and do not directly optimize for rewards. We compare to them empirically, and report major gains in both optimized rewards and language quality.

2506.05289 2026-02-17 cs.CV

Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model

Pingyu Wu, Kai Zhu, Yu Liu, Longxiang Tang, Jian Yang, Yansong Peng, Wei Zhai, Yang Cao, Zheng-Jun Zha

Comments ICLR2026

详情
英文摘要

Autoregressive image generation aims to predict the next token based on previous ones. However, this process is challenged by the bidirectional dependencies inherent in conventional image tokenizations, which creates a fundamental misalignment with the unidirectional nature of autoregressive models. To resolve this, we introduce AliTok, a novel Aligned Tokenizer that alters the dependency structure of the token sequence. AliTok employs a bidirectional encoder constrained by a causal decoder, a design that compels the encoder to produce a token sequence with both semantic richness and forward-dependency. Furthermore, by incorporating prefix tokens and employing a two-stage tokenizer training process to enhance reconstruction performance, AliTok achieves high fidelity and predictability simultaneously. Building upon AliTok, a standard decoder-only autoregressive model with just 177M parameters achieves a gFID of 1.44 and an IS of 319.5 on ImageNet-256. Scaling to 662M, our model reaches a gFID of 1.28, surpassing the SOTA diffusion method with 10x faster sampling. On ImageNet-512, our 318M model also achieves a SOTA gFID of 1.39. Code and weights at https://github.com/ali-vilab/alitok.

2506.04051 2026-02-17 cs.CL cs.AI

High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning

Tim Franzmeyer, Archie Sravankumar, Lijuan Liu, Yuning Mao, Rui Hou, Sinong Wang, Jakob N. Foerster, Luke Zettlemoyer, Madian Khabsa

详情
英文摘要

Large Language Models (LLMs) currently respond to every prompt. However, they can produce incorrect answers when they lack knowledge or capability -- a problem known as hallucination. We instead propose post-training an LLM to generate content only when confident in its correctness and to otherwise (partially) abstain. Specifically, our method, HALT, produces capability-aligned post-training data that encodes what the model can and cannot reliably generate. We generate this data by splitting responses of the pretrained LLM into factual fragments (atomic statements or reasoning steps), and use ground truth information to identify incorrect fragments. We achieve capability-aligned finetuning responses by either removing incorrect fragments or replacing them with "Unsure from Here" -- according to a tunable threshold that allows practitioners to trade off response completeness and mean correctness of the response's fragments. We finetune four open-source models for biography writing, mathematics, coding, and medicine with HALT for three different trade-off thresholds. HALT effectively trades off response completeness for correctness, increasing the mean correctness of response fragments by 15% on average, while resulting in a 4% improvement in the F1 score (mean of completeness and correctness of the response) compared to the relevant baselines. By tuning HALT for highest correctness, we train a single reliable Llama3-70B model with correctness increased from 51% to 87% across all four domains while maintaining 53% of the response completeness achieved with standard finetuning.

2506.03490 2026-02-17 cs.CL

Beyond Memorization: A Rigorous Evaluation Framework for Medical Knowledge Editing

Shigeng Chen, Linhao Luo, Zhangchi Qiu, Yanan Cao, Carl Yang, Shirui Pan

Comments Accepted to EACL 2026 Main Conference

详情
英文摘要

Recently, knowledge editing (KE) has emerged as a promising approach to update specific facts in Large Language Models (LLMs) without the need for full retraining. Despite the effectiveness in general-domain benchmarks, their applicability to complex medical domain remains largely unexplored. Medical knowledge editing is particularly challenging, as it requires LLMs to internalize the knowledge and generalize to unseen scenarios for effective and interpretable decision-making. In this work, we propose a novel framework called MedEditBench to rigorously evaluate the effectiveness of existing KE methods in the medical domain. In MedEditBench, we introduce a new medical knowledge editing benchmark as well as three different knowledge editing paradigms, which are designed to assess the impact of different knowledge sources for editing. Our findings indicate that current KE methods result in only superficial memorization of the injected information, failing to generalize to new scenarios. To overcome this limitation, we present Self-Generated Rationale Editing (SGR-Edit), which utilizes model-derived rationales as the target knowledge for editing, thereby uncovering the underlying reasoning process and demonstrating significant improvements over existing KE approaches. Additionally, we offer deeper insights into medical knowledge editing, including the localization of medical knowledge in LLMs and the impact of sequential editing on evolving knowledge. This could provide practical guidance for implementing KE methods in real-world medical applications.

2506.02873 2026-02-17 cs.AI

It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

Matthew Kowal, Jasper Timm, Jean-Francois Godbout, Thomas Costello, Antonio A. Arechar, Gordon Pennycook, David Rand, Adam Gleave, Kellin Pelrine

详情
英文摘要

Persuasion is a powerful capability of large language models (LLMs) that both enables beneficial applications (e.g. helping people quit smoking) and raises significant risks (e.g. large-scale, targeted political manipulation). Prior work has found models possess a significant and growing persuasive capability, measured by belief changes in simulated or real users. However, these benchmarks overlook a crucial risk factor: the propensity of a model to attempt to persuade in harmful contexts. Understanding whether a model will blindly ``follow orders'' to persuade on harmful topics (e.g. glorifying joining a terrorist group) is key to understanding the efficacy of safety guardrails. Moreover, understanding if and when a model will engage in persuasive behavior in pursuit of some goal is essential to understanding the risks from agentic AI systems. We propose the Attempt to Persuade Eval (APE) benchmark, that shifts the focus from persuasion success to persuasion attempts, operationalized as a model's willingness to generate content aimed at shaping beliefs or behavior. Our evaluation framework probes frontier LLMs using a multi-turn conversational setup between simulated persuader and persuadee agents. APE explores a diverse spectrum of topics including conspiracies, controversial issues, and non-controversially harmful content. We introduce an automated evaluator model to identify willingness to persuade and measure the frequency and context of persuasive attempts. We find that many open and closed-weight models are frequently willing to attempt persuasion on harmful topics and that jailbreaking can increase willingness to engage in such behavior. Our results highlight gaps in current safety guardrails and underscore the importance of evaluating willingness to persuade as a key dimension of LLM risk. APE is available at github.com/AlignmentResearch/AttemptPersuadeEval

2506.00494 2026-02-17 cs.RO cs.AI cs.CE

Multi-Objective Neural Network-Assisted Design Optimization of Soft Fin-Ray Fingers for Enhanced Grasping Performance

Ali Ghanizadeh, Ali Ahmadi, Arash Bahrami

Comments Major revision correcting errors that affected the reported results. Corresponding results were regenerated, and the manuscript was updated accordingly

详情
英文摘要

The internal structure of the Fin-Ray fingers plays a significant role in their adaptability and grasping performance. However, modeling the grasp force and deformation behavior for design purposes is challenging. When the Fin-Ray finger becomes more rigid and capable of exerting higher forces, it becomes less delicate in handling objects. The contrast between these two gives rise to a multi-objective optimization problem. We employ the finite element method to estimate the deflections and contact forces of the Fin-Ray fingers grasping cylindrical objects, generating a dataset of 120 simulations. This dataset includes three input variables: the thickness of the front and support beams, the thickness of the crossbeams, and the equal spacing between the crossbeams, which are the design variables in the optimization. This dataset is then used to construct a multilayer perceptron (MLP) with four output neurons predicting the contact force and tip displacement in two directions. The magnitudes of maximum contact force and maximum tip displacement are two optimization objectives, showing the trade-off between force and delicate manipulation. The set of solutions is found using the non-dominated sorting genetic algorithm (NSGA-II). The results of the simulations demonstrate that the proposed methodology can be used to improve the design and grasping performance of soft grippers, aiding to choose a design not only for delicate grasping but also for high-force applications.

2505.23522 2026-02-17 cs.CV cs.LG

OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data

Fengxiang Wang, Mingshuo Chen, Xuming He, Yi-Fan Zhang, Yueying Li, Feng Liu, Zijie Guo, Zhenghao Hu, Jiong Wang, Jingyi Xu, Zhangrui Li, Junchao Gong, Di Wang, Fenghua Ling, Ben Fei, Weijia Li, Long Lan, Wenjing Yang

详情
英文摘要

Existing benchmarks for multimodal learning in Earth science offer limited, siloed coverage of Earth's spheres and their cross-sphere interactions, typically restricting evaluation to the human-activity sphere of atmosphere and to at most 16 tasks. These limitations: narrow-source heterogeneity (single/few data sources), constrained scientific granularity, and limited-sphere extensibility. Therefore, we introduce OmniEarth-Bench, the first multimodal benchmark that systematically spans all six spheres: atmosphere, lithosphere, oceanosphere, cryosphere, biosphere, and human-activity sphere, and cross-spheres. Built with a scalable, modular-topology data inference framework and native multi-observation sources and expert-in-the-loop curation, OmniEarth-Bench produces 29,855 standardized, expert-curated annotations. All annotations are organized into a four-level hierarchy (Sphere, Scenario, Ability, Task), encompassing 109 expert-curated evaluation tasks. Experiments on 9 state-of-the-art MLLMs reveal that even the most advanced models struggle with our benchmarks, where none of them reach 35% accuracy, revealing systematic gaps in Earth-system cognitive ability. The dataset and evaluation code were released at OmniEarth-Bench (https://anonymous.4open.science/r/OmniEarth-Bench-B1BD).

2505.19712 2026-02-17 cs.LG math.PR stat.ML

On the Relation between Rectified Flows and Optimal Transport

Johannes Hertrich, Antonin Chambolle, Julie Delon

Comments Accepted for NeurIPS 2025

详情
英文摘要

This paper investigates the connections between rectified flows, flow matching, and optimal transport. Flow matching is a recent approach to learning generative models by estimating velocity fields that guide transformations from a source to a target distribution. Rectified flow matching aims to straighten the learned transport paths, yielding more direct flows between distributions. Our first contribution is a set of invariance properties of rectified flows and explicit velocity fields. In addition, we also provide explicit constructions and analysis in the Gaussian (not necessarily independent) and Gaussian mixture settings and study the relation to optimal transport. Our second contribution addresses recent claims suggesting that rectified flows, when constrained such that the learned velocity field is a gradient, can yield (asymptotically) solutions to optimal transport problems. We study the existence of solutions for this problem and demonstrate that they only relate to optimal transport under assumptions that are significantly stronger than those previously acknowledged. In particular, we present several counterexamples that invalidate earlier equivalence results in the literature, and we argue that enforcing a gradient constraint on rectified flows is, in general, not a reliable method for computing optimal transport maps.

2505.19645 2026-02-17 cs.LG cs.AI

MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE

Zongle Huang, Lei Zhu, Zongyuan Zhan, Ting Hu, Weikai Mao, Xianzhi Yu, Yongpan Liu, Tianyu Zhang

Comments Accepted as spotlight at NeurIPS 2025

详情
英文摘要

Large Language Models (LLMs) have achieved remarkable success across many applications, with Mixture of Experts (MoE) models demonstrating great potential. Compared to traditional dense models, MoEs achieve better performance with less computation. Speculative decoding (SD) is a widely used technique to accelerate LLM inference without accuracy loss, but it has been considered efficient only for dense models. In this work, we first demonstrate that, under medium batch sizes, MoE surprisingly benefits more from SD than dense models. Furthermore, as MoE becomes sparser -- the prevailing trend in MoE designs -- the batch size range where SD acceleration is expected to be effective becomes broader. To quantitatively understand tradeoffs involved in SD, we develop a reliable modeling based on theoretical analyses. While current SD research primarily focuses on improving acceptance rates of algorithms, changes in workload and model architecture can still lead to degraded SD acceleration even with high acceptance rates. To address this limitation, we introduce a new metric 'target efficiency' that characterizes these effects, thus helping researchers identify system bottlenecks and understand SD acceleration more comprehensively. For scenarios like private serving, this work unveils a new perspective to speed up MoE inference, where existing solutions struggle. Experiments on different GPUs show up to 2.29x speedup for Qwen2-57B-A14B at medium batch sizes and validate our theoretical predictions.

2505.18487 2026-02-17 cs.RO cs.CV cs.LG

Grounding Bodily Awareness in Visual Representations for Efficient Policy Learning

Junlin Wang, Zhiyun Lin

Comments A preprint version

详情
英文摘要

Learning effective visual representations for robotic manipulation remains a fundamental challenge due to the complex body dynamics involved in action execution. In this paper, we study how visual representations that carry body-relevant cues can enable efficient policy learning for downstream robotic manipulation tasks. We present $\textbf{I}$nter-token $\textbf{Con}$trast ($\textbf{ICon}$), a contrastive learning method applied to the token-level representations of Vision Transformers (ViTs). ICon enforces a separation in the feature space between agent-specific and environment-specific tokens, resulting in agent-centric visual representations that embed body-specific inductive biases. This framework can be seamlessly integrated into end-to-end policy learning by incorporating the contrastive loss as an auxiliary objective. Our experiments show that ICon not only improves policy performance across various manipulation tasks but also facilitates policy transfer across different robots. The project website: https://inter-token-contrast.github.io/icon/

2505.14462 2026-02-17 cs.CV cs.CL

RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

Jiaang Li, Yifei Yuan, Wenyan Li, Mohammad Aliannejadi, Daniel Hershcovich, Anders Søgaard, Ivan Vulić, Wenxuan Zhang, Paul Pu Liang, Yang Deng, Serge Belongie

Comments ICLR 2026; Project page: https://jiaangli.github.io/ravenea/

详情
英文摘要

As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating over 11,396 unique Wikipedia documents curated and ranked by human annotators. Through the extensive evaluation on seven multimodal retrievers and fifteen VLMs, RAVENEA reveals some undiscovered findings: (i) In general, cultural grounding annotations can enhance multimodal retrieval and corresponding downstream tasks. (ii) VLMs, when augmented with culture-aware retrieval, generally outperform their non-augmented counterparts (by averaging +6% on cVQA and +11% on cIC). (iii) Performance of culture-aware retrieval augmented varies widely across countries. These findings highlight the limitations of current multimodal retrievers and VLMs, underscoring the need to enhance visual culture understanding within RAG systems. We believe RAVENEA offers a valuable resource for advancing research on retrieval-augmented visual culture understanding.

2505.12641 2026-02-17 cs.CV cs.AI

Single Image Reflection Separation via Dual Prior Interaction Transformer

Yue Huang, Tianle Hu, Yu Chen, Zi'ang Li, Jie Wen, Xiaozhao Fang

详情
英文摘要

Single image reflection separation aims to separate the transmission and reflection layers from a mixed image. Existing methods typically combine general priors from pre-trained models with task-specific priors such as text prompts and reflection detection. However, the transmission prior, as the most direct task-specific prior for the target transmission layer, has not been effectively modeled or fully utilized, limiting performance in complex scenarios. To address this issue, we propose a dual-prior interaction framework based on lightweight transmission prior generation and effective prior fusion. First, we design a Local Linear Correction Network (LLCN) that finetunes pre-trained models based on the physical constraint T=SI+B, where S and B represent pixel-wise and channel-wise scaling and bias transformations. LLCN efficiently generates high-quality transmission priors with minimal parameters. Second, we construct a Dual-Prior Interaction Transformer (DPIT) that employs a dual-stream channel reorganization attention mechanism. By reorganizing features from general and transmission priors for attention computation, DPIT achieves deep fusion of both priors, fully exploiting their complementary information. Experimental results on multiple benchmark datasets demonstrate that the proposed method achieves state-of-the-art performance.

2505.11839 2026-02-17 cs.AI

On the Eligibility of LLMs for Counterfactual Reasoning: A Decompositional Study

Shuai Yang, Qi Yang, Luoxi Tang, Yuqiao Meng, Nancy Guo, Jeremy Blackburn, Zhaohan Xi

Comments ICLR 2026

详情
英文摘要

Counterfactual reasoning has emerged as a crucial technique for generalizing the reasoning capabilities of large language models (LLMs). By generating and analyzing counterfactual scenarios, researchers can assess the adaptability and reliability of model decision-making. Although prior work has shown that LLMs often struggle with counterfactual reasoning, it remains unclear which factors most significantly impede their performance across different tasks and modalities. In this paper, we propose a decompositional strategy that breaks down the counterfactual generation from causality construction to the reasoning over counterfactual interventions. To support decompositional analysis, we investigate \ntask datasets spanning diverse tasks, including natural language understanding, mathematics, programming, and vision-language tasks. Through extensive evaluations, we characterize LLM behavior across each decompositional stage and identify how modality type and intermediate reasoning influence performance. By establishing a structured framework for analyzing counterfactual reasoning, this work contributes to the development of more reliable LLM-based reasoning systems and informs future elicitation strategies.

2505.08417 2026-02-17 cs.RO

ORACLE-Grasp: Zero-Shot Affordance-Aligned Robotic Grasping using Large Multimodal Models

Avihai Giuili, Rotem Atari, Avishai Sintov

详情
英文摘要

Grasping unknown objects in unstructured environments is a critical challenge for service robots, which must operate in dynamic, real-world settings such as homes, hospitals, and warehouses. Success in these environments requires both semantic understanding and spatial reasoning. Traditional methods often rely on dense training datasets or detailed geometric modeling, which demand extensive data collection and do not generalize well to novel objects or affordances. We present ORACLE-Grasp, a zero-shot framework that leverages Large Multimodal Models (LMMs) as semantic oracles to guide affordance-aligned grasp selection, without requiring task-specific training or manual input. The system reformulates grasp prediction as a structured, iterative decision process, using a dual-prompt tool-calling strategy: the first prompt extracts high-level object semantics, while the second identifies graspable regions aligned with the object's function. To address the spatial limitations of LMMs, ORACLE-Grasp discretizes the image into candidate regions and reasons over them to produce human-like and context-sensitive grasp suggestions. A depth-based refinement step improves grasp reliability when available, and an early stopping mechanism enhances computational efficiency. We evaluate ORACLE-Grasp on a diverse set of RGB and RGB-D images featuring both everyday and AI-generated objects. The results show that our method produces physically feasible and semantically appropriate grasps that align closely with human annotations, achieving high success rates in real-world pick-up tasks. Our findings highlight the potential of LMMs for enabling flexible and generalizable grasping strategies in autonomous service robots, eliminating the need for object-specific models or extensive training.

2505.08145 2026-02-17 cs.LG cs.DC cs.IT cs.NI math.IT

A Generalized Hierarchical Federated Learning Framework with Theoretical Guarantees

Seyed Mohammad Azimi-Abarghouyi, Carlo Fischione

详情
英文摘要

Almost all existing hierarchical federated learning (FL) models are limited to two aggregation layers, restricting scalability and flexibility in complex, large-scale networks. In this work, we propose a Multi-Layer Hierarchical Federated Learning framework (QMLHFL), which appears to be the first study that generalizes hierarchical FL to arbitrary numbers of layers and network architectures through nested aggregation, while employing a layer-specific quantization scheme to meet communication constraints. We develop a comprehensive convergence analysis for QMLHFL and derive a general convergence condition and rate that reveal the effects of key factors, including quantization parameters, hierarchical architecture, and intra-layer iteration counts. Furthermore, we determine the optimal number of intra-layer iterations to maximize the convergence rate while meeting a deadline constraint that accounts for both communication and computation times. Our results show that QMLHFL consistently achieves high learning accuracy, even under high data heterogeneity, and delivers notably improved performance when optimized, compared to using randomly selected values.

2505.07861 2026-02-17 cs.CL cs.AI cs.LG

Scalable LLM Reasoning Acceleration with Low-rank Distillation

Harry Dong, Bilge Acun, Beidi Chen, Yuejie Chi

详情
英文摘要

Due to long generations, large language model (LLM) math reasoning demands significant computational resources and time. While many existing efficient inference methods have been developed with excellent performance preservation on language tasks, they often severely degrade math performance. In this paper, we propose Caprese, a resource-efficient distillation method to recover lost capabilities from deploying efficient inference methods, focused primarily in feedforward blocks. With original weights unperturbed, roughly 1% of additional parameters, and only 20K synthetic training samples, we are able to recover much if not all of the reasoning capabilities lost from efficient inference for thinking LLMs and without harm to language tasks for instruct LLMs. Moreover, Caprese slashes the number of active parameters (~2B cut for Gemma 2 9B and Llama 3.1 8B) and integrates cleanly into existing model layers to reduce latency (>16% time-to-next-token reduction) while encouraging response brevity (up to 8.5% fewer tokens).

2505.07671 2026-02-17 cs.CL cs.AI cs.IR

Benchmarking Retrieval-Augmented Generation for Chemistry

Xianrui Zhong, Bowen Jin, Siru Ouyang, Yanzhen Shen, Qiao Jin, Yin Fang, Zhiyong Lu, Jiawei Han

Comments Accepted to COLM 2025

详情
英文摘要

Retrieval-augmented generation (RAG) has emerged as a powerful framework for enhancing large language models (LLMs) with external knowledge, particularly in scientific domains that demand specialized and dynamic information. Despite its promise, the application of RAG in the chemistry domain remains underexplored, primarily due to the lack of high-quality, domain-specific corpora and well-curated evaluation benchmarks. In this work, we introduce ChemRAG-Bench, a comprehensive benchmark designed to systematically assess the effectiveness of RAG across a diverse set of chemistry-related tasks. The accompanying chemistry corpus integrates heterogeneous knowledge sources, including scientific literature, the PubChem database, PubMed abstracts, textbooks, and Wikipedia entries. In addition, we present ChemRAG-Toolkit, a modular and extensible RAG toolkit that supports five retrieval algorithms and eight LLMs. Using ChemRAG-Toolkit, we demonstrate that RAG yields a substantial performance gain -- achieving an average relative improvement of 17.4% over direct inference methods. We further conduct in-depth analyses on retriever architectures, corpus selection, and the number of retrieved passages, culminating in practical recommendations to guide future research and deployment of RAG systems in the chemistry domain. The code and data is available at https://chemrag.github.io.

2505.06795 2026-02-17 cs.LG cs.AI cs.CE

Sparse Latent Factor Forecaster (SLFF) with Iterative Inference for Transparent Multi-Horizon Commodity Futures Prediction

Abhijit Gupta

详情
英文摘要

Amortized variational inference in latent-variable forecasters creates a deployment gap: the test-time encoder approximates a training-time optimization-refined latent, but without access to future targets. This gap introduces unnecessary forecast error and interpretability challenges. In this work, we propose the Sparse Latent Factor Forecaster with Iterative Inference (SLFF), addressing this through (i) a sparse coding objective with L1 regularization for low-dimensional latents, (ii) unrolled proximal gradient descent (LISTA-style) for iterative refinement during training, and (iii) encoder alignment to ensure amortized outputs match optimization-refined solutions. Under a linearized decoder assumption, we derive a design-motivating bound on the amortization gap based on encoder-optimizer distance, with convergence rates under mild conditions; empirical checks confirm the bound is predictive for the deployed MLP decoder. To prevent mixed-frequency data leakage, we introduce an information-set-aware protocol using release calendars and vintage macroeconomic data. Interpretability is formalized via a three-stage protocol: stability (Procrustes alignment across seeds), driver validity (held-out regressions against observables), and behavioral consistency (counterfactuals and event studies). Using commodity futures (Copper, WTI, Gold; 2005--2025) as a testbed, SLFF demonstrates significant improvements over neural baselines at 1- and 5-day horizons, yielding sparse factors that are stable across seeds and correlated with observable economic fundamentals (interpretability remains correlational, not causal). Code, manifests, diagnostics, and artifacts are released.