arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4056
2605.09144 2026-05-12 cs.LG

FedVSSAM: Mitigating Flatness Incompatibility in Sharpness-Aware Federated Learning

Bingnan Xiao, Yuan Gao, Bingcong Li, Wei Ni, Xin Wang, Tony Q. S. Quek

AI总结 本文研究了联邦学习中尖锐度感知优化(SAM)方法在数据异构环境下存在的“平坦度不兼容”问题,即局部优化追求的平坦区域与全局目标偏好不一致,导致模型训练和泛化效果受限。为解决这一问题,作者提出了一种新的联邦学习方法 FedVSSAM,通过引入方差抑制的调整方向,在局部优化和全局更新中保持一致性,从而更好地对齐局部与全局的优化方向。实验表明,FedVSSAM 能有效缓解平坦度不兼容问题,并在多种联邦学习场景中优于现有方法。

详情
英文摘要

Sharpness-aware minimization (SAM) is an effective method for improving the generalization of federated learning (FL) by steering local training toward flat minima. Under data heterogeneity, however, device-side SAM searches for locally flat basins that are incompatible with the flat region preferred by the global objective. We identify this structural failure mode as flatness incompatibility, which explains why improving local flatness alone may provide limited training and generalization improvement for the global model. We reveal that flatness incompatibility arises from data heterogeneity and the friendly adversary phenomenon, and is further amplified by local updates and partial device participation. To mitigate this issue, we propose Federated Learning with variance-suppressed sharpness-aware minimization (FedVSSAM), which constructs a variance-suppressed adjusted direction and uses it consistently in local flatness search, local descent, and global update. FedVSSAM anchors both perturbation and update directions to a more stable global direction, instead of correcting only an isolated local perturbation. We establish non-convex convergence guarantees of FedVSSAM and prove that the mean-square deviation between the adjusted direction and the global gradient is effectively controlled. Experiments demonstrate that FedVSSAM mitigates flatness incompatibility and outperforms the baselines across diverse FL settings.

2605.09137 2026-05-12 cs.LG

Evaluating Federated Learning approaches for mammography under breast density heterogeneity

Gonzalo Iñaki Quintana, Franco Martin Di Maria, Laurence Vancamberg

AI总结 该研究评估了乳腺密度异质性对联邦学习(FL)在乳腺X线摄影图像分类中的影响,并测试了常用FL算法在真实临床环境中的鲁棒性。实验在两种场景下进行,分别模拟了强异质性和基于人群的乳腺密度分布,结果表明FL方法在性能上可与集中训练相当,而本地训练和简单聚合方法在强异质性下表现较差。研究发现,FedAvg在无需专门处理异质性的算法下,仍能保持与集中训练相当甚至更高的准确率,展示了其在实际乳腺X线影像协作建模中的可行性与应用潜力。

详情
英文摘要

Breast density is a key factor that influences mammography interpretation and is a major source of heterogeneity in multicenter datasets. Such heterogeneity poses challenges for collaborative machine learning across institutions, particularly in Federated Learning. This study aims to evaluate the impact of breast density-induced heterogeneity on FL for mammography image classification and to assess the robustness of common FL algorithms in realistic clinical settings. We conducted experiments under two scenarios: (1) a strongly heterogeneous setting where each participating site contributed exclusively low- or high-density cases, based on the BI-RADS density score, and (2) a population-based setting simulating breast density distributions in White and Asian populations. For the strongly heterogeneous setting, we evaluated two configurations: one with 2 clients, where the cases were grouped as BI-RADS A-B and C-D, and one with 4 clients, where each site contained cases of a single BI-RADS density. We compared three FL methods (FedAvg, FedProx, SCAFFOLD) against centralized training, local-only training, and naive aggregation approaches, including ensembling and weight averaging. Across both scenarios, FL achieved performance comparable to centralized training, while local models and naive aggregation approaches underperformed in the presence of strong heterogeneity. Notably, FedAvg achieved accuracy on par with or exceeding centralized training, demonstrating resilience to breast density-induced data imbalance without requiring specialized heterogeneity mitigation algorithms. These findings show that FL can address breast density-related heterogeneity, supporting its feasibility for real-world mammography workflows. The demonstrated robustness of FedAvg underscores the potential for broad clinical deployment of FL, enabling collaborative model development while maintaining data privacy.

2605.09132 2026-05-12 cs.CV

KEPIL: Knowledge-Enhanced Prompt-Image Learning for Prompt-Robust Disease Detection

Haozhe Luo, Shelley Zixin Shu, Ziyu Zhou, Robert Berke, Mauricio Reyes

AI总结 该研究提出了一种名为KEPIL的知识增强型提示-图像学习框架,旨在提升医学影像诊断中基于提示的零样本推理能力。为了解决当前视觉-语言模型对提示变化敏感且缺乏可靠外部知识的问题,KEPIL结合了结构化医学知识,通过动态提示增强、语义感知对比损失和实体中心报告标准化等方法,增强了模型的鲁棒性和泛化能力。实验表明,KEPIL在多个基准测试中取得了领先的零样本性能,显著提升了在提示变化情况下的诊断准确性。

详情
英文摘要

Vision--language models (VLMs) show promise for clinical decision support in radiology because they enable joint reasoning over radiological images and clinical text, thereby leveraging complementary clinical information. However, radiological findings are long-tailed in practice, leaving some conditions underrepresented and making zero-shot inference essential. Yet current CLIP-style medical VLMs are sensitive to prompt variations and often lack trustworthy external knowledge at inference time, which hinders reliable clinical deployment. We present \textit{KEPIL}, a prompt-robust framework that integrates curated medical knowledge to stabilize zero-shot generalization. KEPIL comprises: (i) \emph{dynamic prompt enrichment} using ontologies with LLM assistance, (ii) a \emph{semantic-aware contrastive loss} aligning embeddings of equivalent prompt variants via a dual-embedding objective, and (iii) \emph{entity-centric report standardization} to yield ontology-aligned representations. Across seven benchmarks, KEPIL achieves state-of-the-art zero-shot inference performance; under prompt-variation tests, it improves AUC by \(6.37\%\) on \textit{CheXpert} and by \(4.11\%\) on average. These results suggest that structured knowledge and robust prompt design are key to clinically reliable radiology-facing VLMs. Code will be released at https://github.com/Roypic/KEPIL.

2605.09131 2026-05-12 cs.AI cs.MA

MCP-Cosmos: World Model-Augmented Agents for Complex Task Execution in MCP Environments

Giridhar Ganapavarapu, Dhaval Patel

AI总结 本文提出MCP-Cosmos框架,通过将生成式世界模型(World Model)融入MCP生态系统,解决当前智能体在复杂任务执行中缺乏环境动态理解和长期规划的问题。该框架统一了MCP、世界模型和智能体三类技术,允许智能体在潜在空间中模拟状态转移并优化任务计划,从而提升任务执行的准确性和成功率。实验表明,该方法在多个MCP-Bench任务中显著提高了工具使用成功率和参数准确性,并引入了新的评估指标以分析世界模型的有效性。

详情
英文摘要

The Model Context Protocol (MCP) has unified the interface between Large Language Models (LLMs) and external tools, yet a fundamental gap remains in how agents conceptualize the environments within which they operate. Current paradigms are bifurcated: Task-level planning often ignores execution-time dynamics, while reactive execution lacks long-horizon foresight. We present MCP-Cosmos, a framework that infuses generative World Models (WM) into the MCP ecosystem to enable predictive task automation. By unifying three disparate technologies, namely MCP, World Model, and Agent, we demonstrate that a "Bring Your Own World Model" (BYOWM) strategy allows agents to simulate state transitions and refine plans in a latent space before execution. We conducted experiments using two strategies, namely ReAct and SPIRAL with 2 planning models and 3 representative world models over 20+ MCP-Bench tasks. We observed improvements in Agent's environment interaction KPI such as tool success rate and tool parameter accuracy. The framework also offers new metrics such as Execution Quality to generate new insights about the effectiveness of world models compared to baseline.

2605.09129 2026-05-12 cs.AI

Data-driven Circuit Discovery for Interpretability of Language Models

Daking Rai, Mor Geva, Ziyu Yao

AI总结 该研究探讨了语言模型中电路发现的问题,旨在通过定位和解释模型内部的计算子图来增强模型的可解释性。现有方法基于假设定义任务并寻找单一电路,但作者发现这些方法容易受到数据集变化的影响,且可能发现的是数据集特定的电路,而非通用任务电路。为此,研究提出了一种数据驱动的电路发现框架(DCD),通过聚类数据集中的样本并为每个聚类发现独立电路,从而更准确地揭示模型内部的机制结构。

Comments 40 pages, 54 figures, 12 tables, Under review

详情
英文摘要

Circuit discovery aims to explain how language models (LMs) implement a specific task by localizing and interpreting a circuit, a computational subgraph responsible for the LM's behavior. Existing circuit discovery methods are hypothesis-driven; they first informally define a task with a dataset, and then apply a circuit discovery algorithm over that dataset to obtain a single circuit. This imposes two strong assumptions: that the LM implements the task with a single circuit, and that the dataset adequately represents the task as humans understand it. We systematically test these assumptions across four previously studied tasks and find that even minor dataset variations that preserve task semantics can produce circuits with low edge overlap and cross-dataset faithfulness. More strikingly, when applied to a mixed dataset with two distinct tasks whose separately discovered circuits have near-zero cross-faithfulness, existing methods still return a single circuit with high faithfulness across both tasks. This indicates that current methods discover dataset-specific circuits, rather than general task circuits. We propose Data-driven Circuit Discovery (DCD), a new discovery framework that drops both assumptions: instead of returning a single circuit for a dataset, DCD first clusters examples in the dataset by how similarly the model processes them and discovers a separate circuit for each group. This allows distinct mechanisms to appear separately rather than merged into a single circuit; each circuit explains its group, not the full task. Experiments show that DCD discovers multiple circuits per dataset, each more faithful to its group than a single circuit discovered by existing methods. Broadly, DCD lets the data reveal mechanistic structure within LMs, rather than relying on human-defined task boundaries that may not align with how models organize their computation.

2605.09126 2026-05-12 cs.LG

Cosine-Gated Adam-Decay: Drop-In Staleness-Aware Outer Optimization for Decoupled DiLoCo

Vatsal Shah, Jiahao Sun

AI总结 本文研究了异步DiLoCo系统中因伪梯度延迟带来的优化问题,提出了一种新的外层优化器Cosine-Gated Adam Decay(CGAD),通过引入基于时间衰减的缩放机制,使优化过程对延迟更加鲁棒。CGAD在Adam算法的基础上,利用指数衰减和余弦门控函数对不同年龄的梯度进行加权,有效缓解了大规模延迟对训练稳定性的影响,并在多个语言模型预训练任务中表现出优越的性能和稳定性。

详情
英文摘要

Asynchronous DiLoCo systems may receive pseudo-gradients computed several outer rounds earlier, yet the standard Nesterov outer optimizer does not explicitly condition its update on per-update age. This can make the outer momentum buffer brittle under large controlled delays. We propose Cosine Gated Adam Decay (CGAD), a simple, drop-in, age-aware outer optimizer that scales each incoming pseudo-gradient by $σ(τ) = γ(τ) e^{-ατ}$ before it enters Adam's first- and second-moment buffers; the exponential models information decay and the cosine gate $γ(τ)$ smoothly zeroes contributions past a chosen cutoff. CGAD reduces to plain Adam at $τ=0$, adds two hyperparameters whose defaults transfer across scales, and extends to partial-sync schedulers via a per-fragment age-aware variant (PA-CGAD). For an idealized gated-adaptive update on smooth non convex objectives, we prove a non-asymptotic convergence bound whose staleness-bias term depends on $α$ alone, rather than on the realized maximum delay $τ_{\max}$; standard analyses of asynchronous momentum-SGD instead carry a $τ_{\max}^2$ factor. Empirically, on Llama style language model pretraining at 25M, 1B, and 7B parameters, CGAD trains stably across the controlled delays we sweep. The cosine cutoff acts as scale insurance: the closest baseline, Adam Decay (CGAD without the cutoff), is competitive at 25M but its seed-to-seed $σ$ at $τ=8$ grows 27x from 25M to 7B, pushing its single-shot risk (mean + $σ$) above the chance-level loss while CGAD's stays well below. The published Nesterov recipe is the least stable method on the full sweep.

2605.09121 2026-05-12 cs.LG cs.AI cs.CL cs.IT math.IT

A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability

Hamed Omidvar, Vahideh Akhlaghi

AI总结 该论文提出了一种基于通信理论的框架,用于分析和优化大语言模型(LLM)代理的可靠性技术。研究将LLM采样视为香农编码理论中的离散随机信道,并基于此构建了统一的分析框架,将多种可靠性方法归类为六类经典操作符。论文还引入了一种成本感知的语义最近邻路由机制,能够在不重新训练模型的情况下,在质量和成本之间进行动态权衡,实验表明其在多个任务上优于固定方法,展示了框架的有效性和灵活性。

详情
英文摘要

Agents built on large language models (LLMs) rely on a range of reliability techniques, including retry, majority voting, and self-consistency, that have been developed in parallel rather than within a common analytical framework. We observe that an LLM sampled at temperature $T$ is a discrete stochastic channel $p(y \mid x)$ in the sense of Shannon's coding theory, and use this identity as the entry point for such a framework grounded in communication theory. Each of these techniques is a special case of one of six classical reliability operators: diversity combining, hybrid retransmission, iterative generator-critic decoding, rateless sampling, structured redundant verification, and difficulty-adaptive routing. Within the framework we give two closed-form results: a noise-variance threshold above which uniform averaging beats quality-weighted averaging, and a contractivity criterion for generator-critic refinement, consistent with a contractive-to-divergent transition we observe between 3B- and 14B-parameter models. We further introduce a cost-aware semantic-nearest-neighbor router whose single Lagrangian knob traverses the quality-cost frontier without retraining. Across six channel configurations spanning local and cloud models on 69 hard tasks, no fixed model-technique-budget choice dominates, motivating per-task allocation. On a 300-item hard split of MMLU, GSM8K, and HumanEval, our router occupies the full empirical Pareto frontier: at matched quality, its normalized cost is ${\approx}56$\% lower than the strongest fixed technique; at matched normalized cost, it improves quality by ${\approx}7$\% ($26$\% over single-shot decoding). These results argue for consolidating these reliability techniques into a single tunable layer informed by channel coding.

2605.09119 2026-05-12 cs.LG cs.AI cs.CL

Personalized Alignment Revisited: The Necessity and Sufficiency of User Diversity

Enoch Hyunwook Kang

AI总结 本文重新审视个性化对齐问题,旨在使大语言模型适应不同用户的偏好,并首次从理论上明确了实现统计效率的条件。研究指出,用户多样性是实现个性化对齐的关键,即用户特定的头部集合必须覆盖所有可能影响最优响应的潜在奖励方向。该条件被证明是必要且充分的,当满足该条件时,简单贪心算法即可达到最优效率,否则所有自然可接受的学习方法都将产生对数级别的遗憾。这一发现揭示了用户多样性对个性化可识别性的根本作用。

详情
英文摘要

Personalized alignment aims to adapt large language models to heterogeneous user preferences, yet the precise theoretical conditions for its statistical efficiency have not been formally established. This paper characterizes the conditions under which personalized alignment achieves O(1) online regret and log(1/epsilon) offline sample complexity. We show that these optimal rates depend on a specific user-diversity condition: the population of user-specific heads must span the latent reward directions that can alter the optimal response. We prove that this condition is both necessary and sufficient. When it holds, simple greedy algorithms achieve benchmark efficiency; when it fails, every learner in a natural admissible class incurs at least logarithmic regret. Our results identify user diversity as the fundamental driver of personalized identifiability.

2605.09112 2026-05-12 cs.LG cs.AI

Contextual Plackett-Luce: An Efficient Neural Model for Probabilistic Sequence Selection under Ambiguity

Noam Mizrachi, Nadav Har-Tuv, Shai Shalev-Shwartz

AI总结 本文提出了一种名为Contextual Plackett-Luce(CPL)的高效神经网络模型,用于在存在歧义的情况下进行概率序列选择。该模型通过引入上下文相关的参数化方式,结合一元和二元交互项,扩展了经典的Plackett-Luce模型,能够在保持表达能力的同时实现高效计算。CPL将并行评分与顺序选择过程解耦,既克服了全自回归模型在并行硬件上的计算瓶颈,又避免了并行方法在多模态依赖建模上的不足,在多模态路径预测和代表性子集选择任务中表现出更优的结构一致性和鲁棒性。

Comments 22 pages, 5 figures

详情
英文摘要

Selecting a coherent sequence or subset of elements is a fundamental problem in structured prediction, arising in tasks such as detection, trajectory forecasting, and representative subset selection. In many such settings, the target is inherently ambiguous: each input admits multiple valid outputs, while supervision provides only a single sampled instance. This induces a mismatch between the underlying multi-modal target distribution and the observed training signal. We propose Contextual Plackett-Luce (CPL), a structured probabilistic model for sequence selection that extends the classical Plackett-Luce model to a context-dependent setting following an Ising-style parameterization with unary and pairwise interaction terms. CPL can be viewed as a hybrid between fully autoregressive prediction and parallel sequence selection: autoregressive models effectively capture uncertainty but are computationally expensive on modern parallel hardware such as GPUs, while parallel methods are efficient but struggle to represent multi-modal dependencies. CPL combines the strengths of both by constructing the parameters of a probabilistic selection model in a fully parallel manner, followed by a lightweight autoregressive selection process in which each step applies incremental updates to contextual logits. This decoupling of parallel scoring and sequential selection enables efficient computation without sacrificing expressivity. We evaluate CPL on two structured selection tasks: multi-modal path prediction and representative subset selection. CPL achieves improved structural consistency and robustness under ambiguous supervision compared to strong parallel baselines.

2605.09109 2026-05-12 cs.AI

When (and How) to Trust the Expert: Diagnosing Query-Time Expert-Guided Reinforcement Learning

Yann Berthelot, Philippe Preux, Riad Akrour

AI总结 本文研究了在强化学习中何时以及如何信任专家控制器的问题,提出了一种统一的实验框架,对多种基于专家引导的强化学习方法进行了系统比较。通过分析专家性能下降、动作偏差和观测噪声等因素,揭示了现有方法在不同任务结构下的失效模式,并提出了一个基于预训练可观测指标的决策规则。研究的主要贡献包括构建了一个标准化的基准测试平台、任务分类体系以及可指导实际应用的决策规则。

详情
英文摘要

Many continuous-control problems ship with a competent but suboptimal controller (a tuned PID, a hand-designed gait). A growing family of methods uses such controllers as queryable experts during RL, but each method has been proposed in isolation, on a different benchmark, without imperfect-expert testing. We harmonize the comparison on a shared SAC backbone, common HPO and evaluation protocols, 100/50 seeds per (env, method), and a degradation sweep over expert undertuning, action bias, and observation noise. The comparison surfaces three failure modes single-paper evaluations miss: (F1) a critic blind spot under argmax-plus-bootstrap that drags IBRL below no-expert SAC on experts close to the no-expert-RL ceiling (RL-near-ceiling, distinct from the absolute physical ceiling); (F2) residual saturation on far-from-optimal experts; and (F3) warm-start buffer poisoning that collapses training-time-handoff methods under deployment-time expert undertuning. No single method dominates: each wins on one task-structure regime and fails predictably elsewhere; on RL-near-ceiling experts (FourTank, GlassFurnace) no query-time method clears the expert within our 1M-step budget, leaving open whether this is a fundamental wall or a budget effect. We convert the spread into a testable decision rule keyed on three pre-training observables (expert quality, task termination, perturbation type). The benchmark, taxonomy, and decision rule are the primary contribution; we additionally describe EDGE, a softmax-over-ensemble-LCB design point used to demonstrate that both axes the taxonomy points to (gate form, scoring rule) are individually exploitable.

2605.09106 2026-05-12 cs.CL

Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain

Xiaoyu Hu, Jinman Zhao

AI总结 该研究提出Fin-Bias,一个用于评估大语言模型在金融领域面对人类偏见时决策能力的基准。Fin-Bias基于大量包含分析师评级的公司报告,测试模型在存在明确偏见信息时的投资判断倾向,并发现模型容易受到上下文偏见的影响。研究还提出了一种检测人类意见的方法,有助于引导模型独立思考,部分模型在预测股票回报方面甚至超越了人类表现。

Comments ACL 2026 Findings

详情
Journal ref
ACL 2026 Findings
英文摘要

Large language models (LLMs) are increasingly deployed in financial contexts, raising critical concerns about reliability, alignment, and susceptibility to adversarial manipulation. While prior finance-related benchmarks assess LLMs' capabilities in stock trading, they are often restricted to small sample and fail to demonstrate LLM susceptibility to context with potential human bias. We introduce Fin-Bias (financial herding under long and uncertain financial context), a benchmark for evaluating LLM investment decision-making when faced with uncertainty and possible human-biased opinions. Fin-Bias includes 8868 long firm-specific analyst reports, including firm aspects summarized and analyzed by sophisticated analysts with investment ratings (Bullish/Neutral/Bearish) spanning from various industries. We present large language models with firm analyst reports with/without analyst investment ratings and even with 'fake' rating, to get investment ratings generated by LLMs. Our results reveal that LLMs tend to herd the explicit bias in context. We also develop a method to detect potential human opinions, which can encourage LLMs to think independently, some models even exceed human performance in predicting future stock return.

2605.09104 2026-05-12 cs.AI

Token Economics for LLM Agents: A Dual-View Study from Computing and Economics

Yuxi Chen, Junming Chen, Chenyu He, Yiwei Li, Yicheng Ji, Yifan Wu, Dingyu Yang, Lansong Diao, Lidan Shou, Hongliang Zhang, Huan Li, Gang Chen

AI总结 随着大型语言模型代理的发展,令牌已成为智能代理系统中的核心经济要素,但其指数级消耗带来了计算、协作与安全方面的瓶颈。本文从计算机科学与经济学的双重视角出发,首次系统性地综述了令牌经济学,提出了包括微观、中观、宏观层面及安全四个维度的分类框架,统一了令牌作为生产要素、交换媒介和价值单位的概念,并指出了未来研究方向,如可微分令牌预算和动态市场机制,为构建可扩展的下一代智能代理系统奠定了理论基础。

详情
英文摘要

As LLM agents evolve, tokens have emerged as the core economic primitives of Agentic AI. However, their exponential consumption introduces severe computational, collaborative, and security bottlenecks. Current surveys remain fragmented across system optimization, architecture design, and trust, lacking a unified framework to evaluate the fundamental trade-off between output quality and economic cost. To bridge this gap, this survey presents the first comprehensive survey of Token Economics. By unifying computer science and economics, we conceptualize tokens as production factors, exchange mediums, and units of account. We synthesize existing literature across a four-dimensional taxonomy: (1) Micro-level (Single Agent): Optimizing budget-constrained factor substitution via neoclassical firm theory. (2) Meso-level (Multi-Agent Systems): Minimizing collaboration friction using transaction cost and principal-agent theories. (3) Macro-level (Agent Ecosystems): Addressing congestion externalities and pricing via mechanism design. (4) Security: Internalizing adversarial threats as endogenous economic constraints. Finally, we outline frontier directions, including differentiable token budgets and dynamic markets, to lay the theoretical foundation for scalable next-generation agent systems.

2605.09096 2026-05-12 cs.LG

Bridging Spectral Operator Learning and U-Net Hierarchies: SpectraNet for Stable Autoregressive PDE Surrogates

Enrique Hernández Noguera, Md Meftahul Ferdaus, Elias Ioup, Mahdi Abdelguerfi, Julian Simeonov

AI总结 本文提出了一种名为SpectraNet的自回归神经算子,旨在解决时间依赖偏微分方程(PDE)中谱架构与U-Net层次结构之间的结构矛盾。该方法结合了截断谱卷积与U-Net层次结构,并引入残差-目标谱块及半群一致性损失进行训练,从而在保持多尺度细节的同时实现更稳定的预测。实验表明,SpectraNet在多个PDE基准测试中表现出优越的性能与参数效率,尤其在计算资源受限的情况下具有显著优势。

Comments 29 pages, 9 figures. Code: https://github.com/Enrikkk/spectranet

详情
英文摘要

Neural operators for time-dependent PDEs face a structural tension: spectral architectures (FNO and descendants) inherit exponential rollout-error growth from their one-step Lipschitz constant, while hierarchical U-Net operators trade resolution invariance for multi-scale detail. We introduce SpectraNet, an autoregressive neural operator that composes truncated spectral convolutions inside a U-Net hierarchy with a Residual-Target Spectral Block trained under a Semigroup-Consistency Loss. The residual-target parametrization replaces L^T stability blow-up with linear T*delta drift, and the spectral path's parameter count is Theta(L w^2 M^2), independent of grid N. Under a single unified protocol against 16 published neural-operator baselines on Navier-Stokes nu=1e-5 at 64x64, SpectraNet reaches test relative L2 = 0.0822 at 2.04M parameters -- 2.33x fewer than canonical FNO at ~20% lower error -- and wins five of six rows in a cross-PDE comparison against FNO (NS at nu in {1e-4, 1e-3}, PDEBench Shallow-Water 2D and Diffusion-Reaction, with the Active-Matter row going to FNO inside its seed spread). Trained from scratch at native 128^2 under the same protocol, SpectraNet improves to 0.0724 while FNO regresses to 0.3080. Free rollout stays bounded for T=100 where FNO diverges across all 200 test trajectories. On consumer CPU at B=1, SpectraNet runs sub-200ms while the full-attention Transformer that wins raw L2 pays ~60x latency; we do not claim to beat that Transformer on raw L2, only to dominate the lightweight (<=5M parameter, sub-200ms CPU) Pareto frontier. Source code: https://github.com/Enrikkk/spectranet

2605.09093 2026-05-12 cs.RO cs.SY eess.SY

HyDRA Scorpion: A Cost-effective and Modular ROV for Real-Time Underwater Inspection, Intervention, and Object Detection

Anika Tabassum Orchi, Md Farhan Zaman, Md Darain Khan, Md Alamgir, Mahbubul Islam, Md. Jobayer Rahman, A. M. Zayed Abdullah, Md Mehrab Hossain Khan, Md. Kutub Al Baki, Iftekharul Islam, Shakil Ahmed, Md Sadique Hossain, Md Muzahidul Islam, Shah Mohammad Seaman, Nusrat Jahan Piyal, Shekh Md. Saifur Rahman, Fahim Hafiz, A. K. M. Muzahidul Islam, M. Rezwan Khan

AI总结 本文提出了一种名为HyDRA Scorpion的低成本、模块化遥控水下机器人,旨在解决现有商用ROV成本高、智能化不足的问题。该系统集成了AI驱动的感知模块和原位测量功能,具备四自由度机动性、双机械臂和定制压力测试外壳,能够在水下304.8米深度等效压力环境下稳定运行,并实现高精度的实时目标检测与三维距离测量。实验验证了其在长时间压力测试和外部干扰下的稳定性和可靠性,展示了其在水下检测与干预任务中的实用价值。

Comments 9 Pages, 11 figures, Research Paper by UIU Mariner Team

详情
英文摘要

A Remotely Operated Vehicle (ROV) is a tethered underwater robot used for tasks like inspection and intervention. While essential tools for underwater science, the high cost of commercial ROVs and a persistent gap between mechanically capable platforms and those with integrated intelligence create a significant barrier to access. HyDRA Scorpion differs from conventional systems by addressing these challenges, integrating an advanced, AI-driven perception stack with in-situ measurement capabilities onto a low-cost, locally manufacturable platform. The system combines 4-DoF maneuverability, dual manipulators, and a custom pressure-tested housing. Experimental results validate the system's robustness and performance. Leak-free operation was confirmed through prolonged pressure testing of the electronics housing to 4 bar, equivalent to the pressure of a 304.8-meter water depth approximately in a simulated environment, with no moisture ingress detected. The vehicle also demonstrated stable station-keeping, maintaining its position within a tight tolerance of $\(\pm\)0.15$ meters under external disturbances. The onboard AI module achieved underwater object detection mean Average Precision (mAP) of 0.89 with real-time inference, length and 3D-mapping based distance measurement. Also, 4-DoF manipulator arm can grip and maintain dual-function manipulator feature which support 360 degree tangle-free rotation.

2605.09092 2026-05-12 cs.CL

Character-Level Transformer for Tajik-Persian Transliteration with a Parallel Lexical Corpus

Mullosharaf K. Arabov

AI总结 本研究旨在解决从塔吉克语(使用西里尔字母)自动转写为波斯语(使用波斯-阿拉伯字母)的问题。研究者构建了一个包含52,152个塔吉克-波斯语单词和短语的经过词典验证的平行语料库,并基于该语料库训练了一个字符级的序列到序列Transformer模型。实验表明,该模型在字符错误率和精确匹配准确率上均优于基于词典的规则方法和循环神经网络基线,且通过束搜索进一步提升了性能。研究还提供了数据预处理流程、模型架构和实验设置,为后续研究提供了支持。

Comments Published in Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script (AbjadNLP), pages 75-83, Rabat, Morocco, March 2026

详情
Journal ref
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script (AbjadNLP), pages 75-83, Rabat, Morocco. Association for Computational Linguistics, March 2026
英文摘要

This study addresses automatic transliteration from Tajik (Cyrillic script) to Persian (Perso-Arabic script). We present a curated, lexicographically verified parallel corpus of 52,152 Tajik--Persian words and short phrases, compiled from printed dictionaries, encyclopedic sources, and manually verified online resources. To the best of our knowledge, this is one of the largest publicly available word-level corpora for Tajik--Persian transliteration. Using this corpus, we train a character-level sequence-to-sequence Transformer model and evaluate it using Character Error Rate (CER) and exact-match accuracy. The Transformer achieves a CER of 0.3216 and an exact-match accuracy of 0.3133, outperforming both dictionary-based rule-based and recurrent neural baselines. With beam search (k=3), performance further improves to CER 0.3182 and accuracy 0.3215. We describe the data collection and preprocessing pipeline, model architecture, and experimental protocol, and report a part-of-speech analysis showing performance differences across lexical categories. All preprocessing scripts, deterministic splits into training, validation, and test sets, and training configurations are released to support reproducibility and further research on Tajik and related Persian dialects. The corpus supports research in character-level transliteration, cross-script NLP, and lexicographic applications.

2605.09090 2026-05-12 cs.CV cs.AI

Investigating Anisotropy in Visual Grounding under Controlled Counterfactual Perturbations

Gabriele Lombardo, Luigi Maiorana, Liliana Lo Presti, Marco La Cascia

AI总结 该研究探讨了在受控反事实扰动下视觉 grounding 模型中的各向异性问题,旨在分析模型在面对语义不匹配的描述时的行为。研究引入了一种基于相似度控制的反事实描述生成方法,系统地扰动图像中的物体或上下文成分,以分析 grounding 模型在不同对齐程度下的表现。实验表明,嵌入空间的各向异性并非导致反事实错误的主因,模型的鲁棒性需进一步考察嵌入空间更细致的几何特性。

Comments To be published in the proceedings of the 5th Explainable AI for Computer Vision (XAI4CV) Workshop at CVPR 2026

详情
英文摘要

Visual Grounding benchmarks assume that the object described by a referring expression is always present in the image, and grounding models are therefore rarely evaluated under semantically mismatched captions. In such cases, models frequently exhibit approximation behavior, producing a plausible bounding box that satisfies only part of the expression (\eg, preserving the original object while ignoring modified contextual cues). Because mismatched captions represent realistic edge cases, this behavior compromises reliability and raises concerns from an explainability perspective. Identifying its underlying causes is thus essential for improving model faithfulness and interpretability. Adopting a mechanistic interpretability viewpoint, this work examines whether embedding anisotropy contributes to counterfactual failures. A similarity-controlled counterfactual caption generation protocol is introduced to systematically perturb object or contextual components within predefined embedding similarity intervals, enabling a fine-grained analysis of grounding behavior as a function of alignment. Experiments on two Transformer-based models with markedly different embedding geometries (BERT-based TransVG and CLIP-based SwimVG) reveal no meaningful correlation between cosine similarity and approximation. These findings suggest that anisotropy alone does not account for counterfactual errors, and that robustness requires investigating finer-grained geometric properties of the embedding space.

2605.09089 2026-05-12 cs.CV cs.AI

Field-Localized Forgery Detection for Digital Identity Documents

Abhishek Kumar, Riya Tapwal, Carsten Maple, Mark Hooper

AI总结 本文提出了一种轻量级的场域定位伪造检测框架FLiD,专门用于数字身份文件的远程身份验证,以应对面部照片和文本信息等关键字段的局部篡改问题。该方法通过目标检测定位关键区域,并利用冻结的MobileNetV3-Small网络提取紧凑的特征嵌入,最终通过轻量分类网络实现高精度的伪造检测。实验表明,FLiD在多个评估指标上显著优于现有通用伪造检测方法,且参数量和计算量大幅减少。

详情
英文摘要

Digital identity verification systems used in remote onboarding rely on document images to authenticate users, making them vulnerable to localized manipulations of key identity fields such as facial photographs and textual information. Existing forgery detection methods, developed primarily for natural-image forensics, show limited transferability to structured identity documents. We propose FLiD, a lightweight field-localized framework that targets critical identity regions rather than processing full-document images. A fine-tuned object detector first localizes face and text fields; a frozen MobileNetV3-Small backbone then extracts compact field-level embeddings, which are classified by lightweight neural network with only 191K trainable parameters. FLiD achieves AUC scores of 0.880 (face), 0.954 (text), and 0.923 (both-field attacks), with corresponding EERs of 18.05%, 11.61%, and 15.16%, representing absolute reductions of 29-35 percentage points over a full-document baseline trained from scratch. FLiD also consistently outperforms general-purpose manipulation detectors (TruFor, MMFusion, UniVAD) across all attack scenarios while requiring 13x fewer parameters and 21x fewer FLOPs

2605.09087 2026-05-12 cs.SD cs.LG

Towards Trustworthy Audio Deepfake Detection: A Systematic Framework for Diagnosing and Mitigating Gender Bias

Aishwarya Fursule, Shruti Kshirsagar, Anderson R. Avila

AI总结 本文针对音频深度伪造检测系统中性别偏见问题,提出了一种系统性的诊断与缓解框架。研究发现,性别偏差主要源于声学表示差异、特征中的性别泄露以及评估结构的不对称性,而非训练数据不平衡。通过引入新的公平性正则化方法和阈值调整策略,有效减少了不公平性,同时保持检测准确率不受影响,为构建可信的音频深度伪造检测系统提供了重要指导。

Comments Submitted to SMC 2026 conference

详情
英文摘要

Audio deepfake detection systems are increasingly deployed in high-stakes security applications, yet their fairness across demographic groups remains critically underexamined. Prior work measures gender disparity but does not investigate where it comes from or how to fix it systematically. We present the first diagnosis-first framework that identifies bias source before applying targeted mitigation, evaluated on two models, AASIST and Wav2Vec2+ResNet18, on ASVSpoof5. Our diagnosis shows that bias does not stem from imbalanced training data but from acoustic representation differences, gender leakage in learned features, and structural evaluation asymmetry. We test mitigation strategies across in-processing, post-processing and combined families, including novel methods introduced in this work. Adjusting the decision threshold separately per gender reduces unfairness by 54% to 75% at no cost to detection accuracy, and our new epoch-level fairness regularisation method outperforms existing per-batch approaches. Adversarial debiasing succeeds only when gender leakage is localised, and fails when it is diffuse, an outcome correctly predicted by our diagnosis before training. No single method fully closes the fairness gap, confirming that bias sources must be identified before fixes are applied and that fairer benchmark design is equally important

2605.09085 2026-05-12 cs.AI math.PR

Constant-Target Energy Matching: A Unified Framework for Continuous and Discrete Density Estimation

Zhijun Zeng, Yixuan Jiang, Pipi Hu, Zuoqiang Shi

AI总结 本文提出了一种统一的能量匹配框架——Constant-Target Energy Matching(CTEM),用于在连续、离散和混合变量域中进行密度估计。CTEM 通过引入有界能量差变换,避免了传统方法中目标不稳定的问题,并通过仅依赖样本的训练目标实现对数密度的估计,无需进行归一化常数估计或无界密度比回归。实验表明,CTEM 在多种基准任务中显著优于现有方法,且能生成质量更高的样本。

详情
英文摘要

Density estimation is a central primitive in probabilistic modeling, yet continuous, discrete, and mixed-variable domains are often treated by separate objectives, limiting the ability to exploit a common statistical structure across data types. Continuous score-based methods rely on log-density gradients, while discrete extensions typically use concrete score whose unbounded targets become unstable near low-probability states. We introduce Constant-Target Energy Matching (CTEM), a unified energy-based framework for density estimation on general state spaces. CTEM replaces ordinary density-ratio regression with a bounded energy-difference transform and derives from it a sample-only training objective with the constant target 1. The learned scalar potential recovers log p without partition-function estimation or explicit unbounded ratio regression. Across continuous, discrete, and mixed-variable benchmarks, CTEM substantially improves density estimation over competitive baselines and yields higher-quality samples under standard sampling procedures.

2605.09079 2026-05-12 cs.AI

CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators

Nicolás Astorga, Anita Kriz, Mihaela van der Schaar

AI总结 尽管大型语言模型在数学、编程等知识密集型任务上已超越人类表现,但在因果推理方面仍存在明显不足。本文提出CauSim框架,通过构建由大型语言模型逐步生成的可执行结构因果模型(SCMs),将因果推理从稀缺标注问题转化为可扩展的监督学习任务。CauSim实现了对非可执行因果知识的代码化表达与自然语言转换,支持跨表示的因果推理能力提升,并展示了在复杂系统中的泛化能力、课程学习的持续收益以及基于形式化领域知识的数据增强效果。

详情
英文摘要

Despite surpassing human performance across mathematics, coding, and other knowledge-intensive tasks, large language models (LLMs) continue to struggle with causal reasoning. A core obstacle is the target data itself: causal systems are complex and often expressed in non-executable forms, while ground-truth answers to causal queries are inherently scarce. We introduce CauSim, a framework that turns causal reasoning from a scarce-label problem into a scalable supervised one. CauSim constructs increasingly complex causal simulators: executable structural causal models (SCMs), incrementally built by LLMs, that scale to globally complex systems while maintaining verifiable answers to causal queries. CauSim operates across representations by formalizing non-executable causal knowledge into code, enabling data augmentation, and translating executable SCMs into natural language, enabling supervision in previously difficult-to-supervise representations. We structure our research into two parts: (1) how to construct increasingly complex causal simulators, and (2) a systematic study of what CauSim enables, demonstrating generalization across representations, consistent gains from curriculum scaling and data volume, LLM self-improvement through self-generated simulators, and data augmentation via formalization of existing domain knowledge.

2605.09073 2026-05-12 cs.RO

Smoothing Out the Edges: Continuous-Time Estimation with Gaussian Process Motion Priors on Factor Graphs

Connor Holmes, Sven Lilge, Zi Cong Guo, Frank Dellaert, Timothy D. Barfoot

AI总结 本文研究了连续时间状态估计问题,提出了一种基于因子图的高斯过程运动先验方法,以实现更平滑的轨迹估计并处理异步传感器数据。与传统的参数化方法相比,该方法利用高斯过程的非参数特性,提供了更灵活且易于实现的解决方案。文章还提供了三个在GTSAM框架下的实现示例,便于实际应用与推广。

详情
英文摘要

Continuous-time state estimation is gaining in popularity due to its abilities to provide smooth solutions, handle asynchronous sensors, and interpolate between data points. While there are two main paradigms, parametric (e.g., temporal basis functions, splines) and nonparametric (Gaussian processes), the latter has seen less adoption despite its technical advantages and relative ease of implementation. In this article, we seek to rectify this situation by providing a new simplified explanation of GP continuous-time estimation rooted in the language of factor graphs, which have become the de facto estimation paradigm in much of robotics. To simplify onboarding, we also provide three working examples implemented in the popular GTSAM estimation framework.

2605.09071 2026-05-12 cs.CV

Probability-Flow Distillation: Exact Wasserstein Gradient Flow for High-Fidelity 3D Generation

Rohith Ramanan, A. N. Rajagopalan

AI总结 该论文提出了一种名为概率流蒸馏(PFD)的新方法,用于解决文本到3D生成中现有方法如分数蒸馏采样(SDS)及其变体所面临的模式崩溃和细节丢失问题。PFD通过将蒸馏过程建模为精确的Wasserstein梯度流,实现了更准确的分布匹配,从而能够生成具有精细、高保真细节的3D模型,显著提升了生成质量。

详情
英文摘要

Score Distillation Sampling (SDS) and its variants have been widely used for text-to-3D generation by distilling 2D image diffusion priors. However, the standard SDS objective is prone to severe mode collapse, frequently yielding over-smoothed and over-saturated results. Although recent advancements, such as Score Distillation via Inversion (SDI), mitigate these artifacts and produce visually sharper models, they ultimately fail to faithfully capture the full target distribution. In this work, we show that the bottleneck limiting the sampling capacity of SDI stems from its reliance on the posterior mean estimator, which is mathematically equivalent to a single-step Euler approximation of the deterministic reverse DDIM trajectory. To address this, we propose a naturally motivated extension termed Probability-Flow Distillation (PFD). We establish that PFD corresponds exactly to a Wasserstein gradient flow, thereby inducing principled distribution-matching dynamics. Finally, we show that PFD can synthesize 3D assets with fine-grained, high-fidelity details and achieve improved quality compared to existing methods.

2605.09067 2026-05-12 cs.CV

Reducing Annotation Burden for Femoral Cartilage Segmentation in Knee MRI via Cross-Sequence Transfer Learning

Francesco Chiumento, Gianluigi Crimi, Elisa Moretta, Rocco Milieri, Alberto Bazzocchi, Giulio Vara, Giacomo Dal Fabbro, Stefano Zaffagnini, Fulvia Taddei, Serena Bonaretti

AI总结 该研究旨在通过跨序列迁移学习减少膝关节MRI中股骨软骨分割的人工标注负担,测试双回波稳态(DESS)与矢状位质子密度加权3D快速自旋回波(Cube)序列之间的双向迁移效果。研究采用改进的2D U-Net模型,在OAI数据集的507张DESS图像上进行预训练,并在不同序列间进行迁移学习,结果表明从Cube到DESS的迁移性能接近原序列训练效果,而从DESS到Cube的迁移则需更多标注数据,且病变对不同序列的分割影响存在差异。这一成果为减少医学图像分割标注工作提供了有效方法。

详情
英文摘要

Purpose: To develop and evaluate cross-sequence transfer learning for automatic femoral cartilage segmentation, testing bidirectional transfer between dual-echo steady-state (DESS) and sagittal proton density-weighted 3D fast spin-echo (Cube) sequences. Materials and Methods: We optimized a modified 2D U-Net on 507 DESS images from the Osteoarthritis Initiative (OAI). We then established same-sequence baselines using subject-level cross-validation on a subset of 44 OAI DESS images and 44 Cube images acquired at the Istituto Ortopedico Rizzoli, Bologna, Italy. Each subset included 22 non-lesioned and 22 lesioned subjects. Finally, we performed transfer learning across sequences by fine-tuning the pretrained models on the target sequence with increasing training set sizes to study convergence, while keeping validation and test sets fixed. Segmentations were evaluated using Dice similarity coefficient (DSC) and average surface distance (ASD). Lesion effects were assessed with two-sided Mann-Whitney U tests with Bonferroni correction. Results: Same-sequence training yielded higher accuracy on DESS than Cube (DSC, $0.900$ vs $0.830$; $P < .001$). Cube-to-DESS transfer matched DESS performance (DSC, $0.903 \pm 0.032$ vs $0.900 \pm 0.027$), reaching a performance plateau at 9 training subjects. DESS-to-Cube yielded a lower combined DSC ($0.802 \pm 0.049$ vs $0.830 \pm 0.042$), reaching a plateau at 24 training subjects. Lesions did not affect DESS ($P \ge .39$) but reduced Cube accuracy (DSC, $0.805$ vs $0.856$; $P < .001$). Conclusion: Transfer learning across sequences can substantially reduce target-sequence annotation requirements for femoral cartilage segmentation, but performance is direction- and sequence-dependent, and the effects of lesions on segmentation may vary across MRI sequences.

2605.09065 2026-05-12 cs.CV cs.LG

Dependency-Aware Discrete Diffusion for Scene Graph Generation

Rajalaxmi Rajagopalan, Romit Roy Choudhury

AI总结 该研究提出了一种依赖感知的离散扩散模型,用于生成场景图,以解决从自然语言生成结构化场景图的挑战。该方法通过在正向和反向过程中解耦结构与语义,捕捉对象、边和关系之间的条件依赖,从而生成更符合文本描述的场景图。实验表明,该方法在标准基准上优于现有连续和离散图生成方法,并在后续图像生成任务中表现出更优的组合对齐效果,尤其在多物体场景中表现突出。

详情
英文摘要

Scene graphs (SGs) represent objects and their relationships as structured graphs, enabling applications in image generation, robotics, and 3D understanding. Recent work suggests that conditioning image generation on scene graphs improves compositional fidelity compared to text-only prompting. However, since users typically provide text rather than structured graphs, a key challenge is to generate scene graphs from natural language. Prior work on discrete diffusion has demonstrated success in generating generic graphs such as molecules and circuits, but fails to account for the hierarchical structure and strong dependencies between objects, edges, and relations in scene graphs. We address this limitation by introducing a dependency-aware, hierarchically constrained discrete diffusion model for scene graph generation. Our approach decouples structure and semantics across the forward and reverse processes, enabling the model to capture conditional dependencies. At inference time, we perform training-free conditioning to sample text-aligned scene graphs. We evaluate our method on standard SG benchmarks and demonstrate improvements over both continuous and discrete graph generation baselines across graph and layout metrics. When fed to downstream image generation, our approach yields improved compositional alignment compared to text-to-image models, particularly in multi-object scenarios.

2605.09060 2026-05-12 cs.CL

Language-Conditioned Visual Grounding with CLIP Multilingual

J. de Curtò, Mauro Liz, I. de Zarzà

AI总结 该研究探讨了多语言视觉-语言模型在不同语言间表现差异的成因,通过固定视觉编码器、仅改变文本分支的密集多语言CLIP探针实验,明确了性能差距主要来源于文本分支。研究发现,低资源语言在文本分支上存在结构性缺陷,而增大视觉编码器规模会加剧某些语言的性能差距,但有助于改善阿拉伯语表现。此外,跨语言相似性保持较高,但区域掩码交并比下降表明主要失败模式是空间对齐问题而非信号消失。该方法在能效方面具有竞争力,适合用于多语言部署。

详情
英文摘要

Multilingual vision-language models exhibit systematic performance gaps across languages, but the mechanism remains ambiguous: cross-language divergence could arise from the visual encoder, the text branch, or their interaction. We resolve this ambiguity through a dense multilingual CLIP probe in which the visual encoder is held identical across thirteen typologically diverse languages and only the XLM-RoBERTa text branch varies. We evaluate two CLIP architectures spanning a 7x visual-encoder scale gap (XLM-R base + ViT-B/32, ~87M visual parameters; XLM-R large + ViT-H/14, ~632M) on 11 concepts and 210 images, and quantify cross-language agreement via cluster-mask IoU, top-percentile IoU, and Spearman rank correlation against an English reference (n=2,310 paired observations per language). Three findings emerge. First, low-resource languages (Arabic, Basque, Luxembourgish) incur a structural penalty at both backbone scales (Wilcoxon HR>LR p<10^-300; cluster-mask IoU gap +0.114 at base, +0.143 at large), isolating the deficit to the text branch. Second, scaling the encoder 7x widens the gap for structural failure cases (Basque Δ=-0.056, Luxembourgish Δ=-0.076) while improving Arabic (Δ=+0.033), separating corpus-coverage from tokeniser-fertility failures. Third, peak similarity is preserved across languages (mean ratio 0.94 at large scale) while cluster-mask IoU drops sharply, identifying spatial misalignment, not signal collapse, as the dominant failure mode. At 3.4-3.9 Wh per 1,000 queries, dense-CLIP grounding is competitive with high-throughput inference budgets, positioning it as a practical substrate for energy-aware multilingual deployment.

2605.09055 2026-05-12 cs.RO cs.AI cs.MA

Octopus Protocol: One-Shot Hardware Discovery and Control for AI Agents via Infrastructure-as-Prompts

Quilee Simeon, Justin M. Wei, Yile Fan

AI总结 本文提出了一种名为Octopus Protocol的系统,旨在通过“基础设施即提示”的理念,实现对AI代理的单次硬件发现与控制。该系统通过一个五阶段的自动化流程,仅需操作系统访问权限和语言模型API密钥,即可识别连接设备、生成类型化工具接口并部署为实时HTTP端点,大幅降低了硬件接入的工程成本。实验表明,该方法可在10至15分钟内完成硬件接入,并提供多达30个可用工具,支持闭环视觉-运动控制。

详情
英文摘要

Recent agentic-robotics systems, from Code-asPolicies to modern vision-language-action (VLA) foundation models, presuppose that drivers, SDKs, or ROS-style primitives for the target hardware already exist. Writing those primitives is the dominant engineering cost of bringing up new hardware for agent control. We present Octopus Protocol, a system that collapses that cost to a single shell command. Given only raw OS access and a language-model API key, a coding agent executes a five-stage pipeline--PROBE, IDENTIFY, INTERFACE, SERVE, DEPLOY--to discover connected devices, infer their capabilities, generate a Model Context Protocol (MCP) server with typed tools, and deploy it as a live HTTP endpoint. A persistent daemon then monitors the system, heals broken code, and perceives physical state through the camera tools it generated for itself. Two architectural principles make this work: protocols are prompts, not code, and the coding agent is the runtime. We validate the system on three heterogeneous platforms (PC/WSL, Apple Silicon macOS, Raspberry Pi 4) and on a commercial 6-DOF robotic arm with USB camera feedback. One command onboards the hardware in ~10-15 minutes and exposes up to 30 MCP tools; an MCP-compliant client then performs closed-loop visual-motor control through tools no human wrote.

2605.09053 2026-05-12 cs.CV

LCGNav: Local Candidate-Aware Geometric Enhancement for General Topological Planning in Vision-Language Navigation

Jiankun Peng, Jianyuan Guo, Yiguang Yang, Yue Liu, Jiashuang Yan, Ying Xu

AI总结 在连续环境视觉语言导航(VLN-CE)中,现有的在线拓扑规划方法仍面临局部深度信息冗余和随着拓扑图扩展导致当前候选节点关注减弱的问题。为此,本文提出LCGNav,一种模块化的局部几何增强框架,通过将候选深度视图转换为三维点云并结合可达范围的物理截断,实现更紧凑的局部几何建模。此外,LCGNav引入了一种保持维度的局部融合策略,仅对当前相关的“幽灵”节点进行几何增强,而无需改变原有规划器接口。实验表明,LCGNav作为一种有效的跨架构增强模块,能够以较低的训练成本提升多个代表性在线拓扑方法的关键指标,并在R2R-CE和RxR-CE数据集的val-unseen划分上取得了最佳性能。

详情
英文摘要

Online topological planning has become an effective paradigm for Vision-Language Navigation in Continuous Environments (VLN-CE), but existing methods still suffer from two limitations: redundant local depth information and weakened focus on current frontier candidates as the topological graph grows. To address this, we propose LCGNav, a modular local geometric enhancement framework for topological VLN. LCGNav explicitly converts candidate depth views into 3D point clouds and applies physical truncation based on the agent's reachable range, enabling more compact local geometric modeling. It further introduces a dimension-preserving local fusion strategy with transient state degradation, so that geometric enhancement is applied only to the currently relevant ghost nodes without changing the original planner interface. Experiments on R2R-CE and RxR-CE show that LCGNav serves as an effective cross-architecture enhancement module, consistently improving multiple key metrics of representative online topological baselines with low additional training cost. When integrated with ETP-R1, LCGNav achieves the best performance among the compared online topological methods on the val-unseen splits of the R2R-CE and RxR-CE benchmarks. The code is available at https://github.com/shannanshouyin/LCGNav.

2605.09050 2026-05-12 cs.RO cs.CV

Automated Robotic Moisture Monitoring in Agricultural Fields

Senthil Palanisamy, Akila I. S

AI总结 本文旨在开发一种自动化机器人系统,用于大规模农田的土壤湿度监测。该系统结合田间湿度传感器和机器人,利用Dijkstra算法规划路径,并通过图像处理技术计算土壤湿度,从而实现高效、经济的监测。研究搭建了一个小型实验田并测试了原型系统,验证了该方法的可行性。

Comments 2018 International Seminar on Intelligent Technology and Its Applications (ISITIA)

详情
Journal ref
2018 International Seminar on Intelligent Technology and Its Applications (ISITIA)
英文摘要

Monitoring moisture level of land in a large-scale plantation is tedious. The main objective of this project is to use a robotic kit in collaboration with the on-field moisture sensor circuits, thereby creating an efficient and economical moisture monitoring system. A large agriculture field is divided into smaller grids. Each grid is placed with a moisture sensor. Whenever a sensor reports the soil to be dry, the robot goes to the concerned field for inspection. The path to the concerned field is found by applying Dijkstra's shortest path algorithm on the aerial image of the field. Then the total moisture content of the field is calculated by the robot using suitable image processing algorithms and reported accordingly. For developing and testing this work, a small study field was set up above which a camera was mounted at an appropriate height to capture its aerial view. Thus a prototype for an automated system of monitoring agricultural fields' moisture has been developed through this work.

2605.09045 2026-05-12 cs.AI cs.CR cs.SE

Containment Verification: AI Safety Guarantees Independent of Alignment

Royce Moon, Lav R. Varshney

AI总结 本文提出了一种名为“ containment verification”(约束验证)的新方法,用于确保人工智能代理在执行任务时的行为符合安全边界,其保证独立于模型的对齐性。该方法通过在代理框架内部建立安全边界策略,利用形式化验证技术,在任意可能的模型输出下强制执行安全约束,并在 Dafny 中实现了该验证过程。研究首次对一个最小化的代理式大语言模型框架进行了演绎式形式验证,其安全保证不受模型能力变化的影响。

Comments 14 pages

详情
英文摘要

Agentic frameworks are the software layer through which AI agents act in the world. Existing safety methods intervene on the model and therefore remain conditional on unverifiable properties of learned behavior. We introduce containment verification, which locates safety guarantees in the agentic framework itself. Under havoc oracle semantics, the AI is modeled as an unconstrained oracle ranging over the entire typed action space, and the verified containment layer must enforce the boundary policy for every possible AI output. For boundary-enforceable properties, expressed over modeled boundary events, action arguments, and state, we prove a universal guarantee by forward-simulation refinement and mechanize it in Dafny. We instantiate the paradigm by verifying PocketFlow, a minimalist agentic LLM framework, and use an agentic synthesis pipeline to generate the specification, operational model, and refinement proof under an information barrier against tautological specifications. To our knowledge, this is the first deductive formal verification of an agentic framework, and its guarantee is invariant to model capability over the modeled typed action boundary.

2605.09044 2026-05-12 cs.LG

Predicting Plasticity in Deep Continual Learning: A Theoretical Perspective

Jiuqi Wang, Jayanth Srinivasa, Claire Chen, Shuze Daniel Liu, Ali Payani, Shangtong Zhang

AI总结 深度持续学习要求模型在不从头训练的情况下适应新任务,但神经网络在训练过先前任务后可能会丧失对新任务的适应能力,这一现象称为可塑性丧失。本文从理论角度出发,指出一些常用的可塑性诊断指标在预测模型未来优化能力方面存在局限,并提出了一种新的度量指标——优化准备度,该指标结合了梯度强度和梯度可靠性,具有理论保证,并在多个持续学习任务中表现出更优的预测性能。

Comments 21 pages, 4 figures, 2 tables

详情
英文摘要

Deep continual learning requires models to adapt to new tasks without retraining from scratch. However, neural networks can lose their ability to adapt to new tasks after training on previous ones, a phenomenon known as loss of plasticity. There have been several explanations and diagnostics proposed for plasticity loss. Motivated by the philosophy that "all models are wrong, but some are useful", we ask: can existing diagnostics predict a neural network's plasticity? In this work, we take a practical view to interpret plasticity as trainability, i.e., a neural network's future optimization gain on a target task. We first take a theoretical approach, showing, by constructing a few counterexamples, that some widely adopted diagnostics of plasticity, including representation rank and neural tangent kernel rank, can fail to predict the loss of trainability in both regression and classification settings. We instead propose a novel metric, called optimization readiness, which combines gradient strength and gradient reliability. We prove that optimization readiness lower bounds one-step optimization gain under standard smoothness assumptions, providing a theoretical guarantee for its predictive power. Empirically, we show that across commonly used deep continual learning settings, such as Slowly-Changing Regression and Permuted MNIST, optimization readiness more reliably ranks checkpoints by trainability than prior diagnostics, even with substantially fewer samples.