Emotion Profiling in LLM-Based Literary Translation: Systematic Shifts Across MT and Post-Editing
基于LLM的文学翻译中的情感特征:机器翻译与译后编辑的系统性转变
AI总结 研究LLM翻译的情感特征及译后编辑如何使其接近人类翻译,通过对比《Oryx and Crake》的LLM翻译、译后编辑版本和人类翻译,发现MT系统引入特定情感指纹,削弱作者声音。
基于LLM的文学翻译中的情感特征:机器翻译与译后编辑的系统性转变
Antonio Castaldo, Johanna Monti, Sheila Castilho
AI总结 研究LLM翻译的情感特征及译后编辑如何使其接近人类翻译,通过对比《Oryx and Crake》的LLM翻译、译后编辑版本和人类翻译,发现MT系统引入特定情感指纹,削弱作者声音。
本文研究LLM翻译是否表现出可识别的情感特征,以及译后编辑如何将其重塑为更接近人类的标准。我们比较了玛格丽特·阿特伍德《Oryx and Crake》的LLM翻译及其译后编辑版本和人类翻译,以当代意大利科幻小说的大规模语料库为基线。通过基于词典和多语言建模的方法,我们对不同系统的情感变化进行了细粒度分析。我们发现,机器翻译系统在翻译中引入了特定模型且统计显著的情感指纹,导致作者声音的保留有限。
This paper investigates whether LLM translations exhibit identifiable emotional profiles and how post-editing reshapes them toward human-like norms. We compare LLM translations of Margaret Atwood's Oryx and Crake with their post-edited versions and a human translation, using a large-scale corpus of contemporary Italian science-fiction as a baseline. We examine emotion through lexicon-based and multilingual modeling, conducting a fine-grained analysis of emotional variation across systems. We find that MT systems introduce model-specific and statistically significant emotional fingerprints across translations, leading to a limited preservation of an author's voice.
最优多物品多竞拍者拍卖设计的对偶性:通过深度学习的收入证书
Yanchen Jiang, David C. Parkes, Tonghan Wang
AI总结 提出首个直接处理多物品多竞拍者拍卖对偶问题的计算框架,通过神经网络参数化拉格朗日乘子并引入提升技术,生成可证明的收入上界,为连续类型提供近最优性证书。
刻画多物品、多竞拍者设置下的收入最优拍卖仍然是一个基本开放问题,除了限制性的二元类型实例外,没有已知的闭式解。这激发了人们对最优拍卖设计的计算方法的兴趣。在本文中,我们引入了第一个直接处理多物品、多竞拍者拍卖和占优策略激励相容(DSIC)的对偶问题的计算框架,生成有证书的收入上界。我们的方法使用神经网络参数化具有结构保证的严格流量守恒性质的拉格朗日乘子,从而通过梯度下降对可行对偶解进行高效优化。为了弥合离散计算方法与连续类型的理论保证之间的差距,我们开发了一种新颖的提升技术,将对偶证书从粗离散化映射到精细细化。我们证明,对于具有连续均匀估值的多物品、多竞拍者拍卖,提升给出了有效的收入上界。此外,我们给出了任意连续分布的广义提升构造,并证明了这些提升对偶在离散极限下收敛到原始连续问题的收入。我们通过恢复典型实例的已知分析机制,验证了该对偶拍卖设计问题的计算框架。对于多物品多竞拍者问题,我们的框架在最优收入与已知最佳DSIC机制之间建立了小差距,提供了近最优性的计算证书。
Characterizing revenue-optimal auctions for multi-item, multi-bidder settings remains a fundamental open problem, with no known closed-form solution existing beyond restrictive binary-type instances. This has motivated interest in computational approaches to optimal auction design. In this paper, we introduce the first computational framework that directly tackles the dual problem for multi-item, multi-bidder auctions and dominant-strategy incentive compatibility (DSIC), generating certified revenue upper bounds. Our approach parametrizes Lagrange multipliers with a structurally guaranteed strict flow-conservation property using neural networks, enabling efficient optimization over feasible dual solutions via gradient descent. To bridge the gap between discrete computational methods and theoretical guarantees for continuous types, we develop a novel lifting technique that maps dual certificates from coarse discretizations to fine refinements. We prove that lifting gives valid revenue upper bounds for multi-item, multi-bidder auctions with continuous uniform valuations. Furthermore, we give a generalized lifting construction for arbitrary continuous distributions and demonstrate that these lifted duals converge to the revenue of the original continuous problem in the discrete limit. We validate this computational framework for the dual auction design problem by recovering known analytical mechanisms for canonical instances. For multi-item multi-bidder problems, our framework establishes a small gap between the optimal revenue and best-known DSIC mechanisms, providing computational certificates of near-optimality.
非线性估计器:用于参数学习的双贝叶斯仿射估计器
Sasan Vakili, Daniël Woonings, Pradyumna Paruchuri, Peyman Mohajerin Esfahani
AI总结 提出一种用于Wiener型状态空间模型的非线性参数估计器,通过固定点架构耦合两个仿射最小均方误差估计器,分别估计未知参数和潜在变量,并开发两种双估计器框架,实验表明双状态-参数估计器在参数均方误差上优于其他方法。
本文提出一种用于Wiener型状态空间模型的非线性参数估计器,该估计器采用固定点架构,耦合两个仿射最小均方误差(MMSE)估计器:一个用于未知参数,另一个用于潜在变量。该架构保留了最优仿射MMSE参数估计器的功能结构,同时引入了动态基统计(DBS)估计,以总结非线性基函数评估。开发了两种DBS构建策略,从而产生两种非线性估计器框架。双基-参数估计器将仿射基估计器与仿射参数估计器相结合,而双状态-参数估计器首先计算仿射状态估计及其协方差,然后通过高斯DBS算子映射这些状态估计统计量以获得DBS估计。两种双估计器都采用固定点表征,交替估计每个分量,使用另一个分量的更新先验(该先验来自前一次迭代中该分量的插件估计统计量)。通过广泛的蒙特卡洛实验检验了所提方法的有效性,结果表明双基-参数估计器获得的参数均方误差与纯仿射参数估计器相当,而双状态-参数估计器实现了最低的参数均方误差,优于双基-参数估计器、纯仿射参数估计器以及经典粒子吉布斯和期望最大化方案的顺序蒙特卡洛变体。
This paper presents a nonlinear parameter estimator for Wiener-type state-space models obtained as a fixed-point architecture that couples two affine minimum mean-squared error (MMSE) estimators: one for the unknown parameters and one for latent variables. The architecture retains the functional structure of the optimal affine MMSE parameter estimator while incorporating Dynamic Basis Statistics (DBS) estimates that summarize nonlinear basis-function evaluations. Two DBS construction strategies are developed, leading to two nonlinear estimator frameworks. The dual basis-parameter estimator combines an affine basis estimator with the affine parameter estimator, whereas the dual state-parameter estimator first computes affine state estimates and their covariances, then maps these state-estimate statistics through a Gaussian DBS operator to obtain DBS estimates. Both dual estimators admit fixed-point characterizations that alternate between estimating each component using the updated prior of the other, obtained from that component's plug-in estimate statistics from the previous iteration. The efficacy of the proposed methods is examined via extensive Monte Carlo experiments, showing that the dual basis-parameter estimator attains parameter mean-squared errors comparable to those of the purely affine parameter estimator, while the dual state-parameter estimator achieves the lowest parameter mean-squared error, outperforming both the dual basis-parameter and purely affine parameter estimators, as well as sequential Monte Carlo variants of classical Particle Gibbs and Expectation-Maximization schemes.
最大匹配精度:利用全局最优匹配的实例分割评估指标
Kaden Stillwagon, Alexandra D. VandeLoo, Craig R. Forest
AI总结 提出最大匹配精度(MMA),通过全局最优一对一匹配和逐像素归一化,克服现有指标在细胞分割评估中的不连续、不敏感和匹配非最优问题,提供更稳定、敏感和可解释的评分。
可靠评估实例分割模型需要准确且一致反映分割质量的指标。然而,生物成像中最广泛使用的指标存在根本性的数学缺陷:硬交并比阈值导致不连续、低灵敏度的评分;逐对象归一化在对象大小变化下扭曲分数;以及贪婪或一对多匹配过程产生非最优、顺序依赖的对应关系。这些特性共同导致在常见失败模式(如细胞分裂、细胞合并和细胞边界不精确)下产生不直观且不可靠的模型排名。我们提出最大匹配精度(MMA),一种无阈值连续指标,它找到预测对象与真实对象之间的全局最优一对一匹配,并使用逐像素归一化聚合总重叠。我们在三个实验(合成失败案例、渐进式破坏测试和模型排名比较)中评估MMA与AP@50、PQ、SEG和AJI。MMA产生的分数比现有替代方案更稳定、更敏感、更可解释,为生物细胞成像中的公平实例分割基准测试提供了原则性基础。
Reliable evaluation of instance segmentation models requires metrics that accurately and consistently reflect segmentation quality. However, the metrics most widely used in biological imaging carry fundamental mathematical weaknesses: hard Intersection-over-Union (IoU) thresholds that produce discontinuous, low sensitivity scoring; per-object normalization that distorts scores under object size variation; and greedy or one-to-many matching procedures that yield non-optimal, order-dependent correspondences. Together, these properties produce unintuitive and unreliable model rankings under common failure modes such as split cells, merged cells, and cell boundary imprecision. We propose Maximum Matching Accuracy (MMA), a threshold-free continuous metric that finds a globally optimal one-to-one matching between predicted and ground truth objects and aggregates total overlap using per-pixel normalization. We evaluate MMA against AP@50, PQ, SEG, and AJI across three experiments: synthetic failure cases, progressive corruption tests, and a model ranking comparison. MMA produces scores that are more stable, more sensitive, and more interpretable than existing alternatives, providing a principled foundation for fair instance segmentation benchmarking in biological cell imaging.
什么使一个工具成为工具:智能体工具的必要和充分条件
Sanderson Oliveira de Macedo
AI总结 本文通过概念分析,定义了智能体工具的必要和充分条件,并提供了包含/排除测试,以区分智能体工具与智能体框架、SDK、IDE插件等。
术语“智能体工具”现在在软件工程中随着生成式人工智能广泛流传。它指的是包裹语言模型并将其转化为能够在仓库上行动的编码智能体的层。该用法松散且多义。有时该术语指整个产品(Claude Code, Codex CLI);有时指运行智能体执行任务的评估脚手架(SWE-bench工具);有时它与智能体框架、SDK、IDE插件或编排器混为一谈。缺失的是一个作为工具的参考定义,能够一致地包含和排除案例。我们通过概念分析构建该定义,结合了具有持久标识符的作品和主要灰色文献来源,如官方文档、词汇表和工程报告。我们重构了该术语的谱系,从马具到经典测试工具,到机器学习评估工具,最后到智能体工具。然后我们提出一个构成性定义,陈述一个系统成为智能体工具的必要和充分条件,我们将其操作化为包含和排除测试,并绘制该概念与智能体框架、智能体SDK、IDE插件、评估工具和编排器的边界。我们将该定义应用于六个真实工具(Claude Code, Codex CLI, Aider, Cline, OpenHands和SWE-agent)以及故意的边缘案例;测试一致地包含和排除。最后我们以按设计张力轴组织的研究议程结束。贡献是智能体工具的操作性定义,具有共享词汇,能够指导工程实践和智能体系统的科学比较。
The term agent harness now circulates widely in software engineering with generative artificial intelligence. It names the layer that wraps a language model and turns it into a coding agent able to act on a repository. The usage is loose and polysemous. Sometimes the term denotes the whole product (Claude Code, Codex CLI); sometimes it denotes the evaluation scaffold that runs an agent against tasks (the SWE-bench harness); sometimes it gets conflated with an agent framework, an SDK, an IDE plugin, or an orchestrator. What is missing is a reference definition that works as an instrument, one that includes and excludes cases consistently. We build that definition through a conceptual analysis that combines works with persistent identifiers and primary grey-literature sources, such as official documentation, glossaries, and engineering reports. We reconstruct the genealogy of the term, from the horse's tack to the classic test harness, to the machine-learning evaluation harness, and finally to the agent harness. We then propose a constitutive definition that states the necessary and sufficient conditions for a system to be an agent harness, we operationalize it as an inclusion and exclusion test, and we draw the boundary of the concept against an agent framework, an agent SDK, an IDE plugin, an eval harness, and an orchestrator. We apply the definition to six real harnesses (Claude Code, Codex CLI, Aider, Cline, OpenHands, and SWE-agent) and to deliberate edge cases; the test includes and excludes consistently. We close with a research agenda organized by design tension axes. The contribution is an operational definition of agent harness, with a shared vocabulary, able to guide engineering practice and the scientific comparison of agentic systems.
无监督风格表示学习用于通过释义反转检测AI文本
Rafael Rivera Soto, Barry Chen, Nicholas Andrews
AI总结 提出无监督风格编码器,通过重构人工文本与机器生成释义间的差异学习判别性风格特征,实现少样本和零样本AI文本检测,性能优于基线。
大型语言模型(LLMs)的快速发展引发了对其滥用的担忧,如抄袭、错误信息和自动化影响操作,这促使需要鲁棒的检测器。最近的研究表明,写作风格的神经表示对于检测是有效的,并且至关重要的是,对于击败大多数现有检测器的对抗攻击具有鲁棒性。然而,当前的基于风格的检测器依赖作者标签进行训练,并且仅限于少样本推理进行检测,需要可能并不总是可用的分布内样本。我们通过训练风格编码器从机器生成的释义中重构人工文本,从而在没有作者标签的情况下学习判别性风格特征;在训练期间冻结语义编码器,使风格编码器偏向于仅捕获重构所需的非语义特征。我们通过两种检测策略评估学习到的表示:少样本检测器和基于DeepSVDD的零样本检测器。在基准测试中,我们的方法在少样本设置下匹配或优于所有基线,并且在零样本设置下,与完全监督的分类器在分布内测试数据上具有竞争力,同时对未见过的LLMs具有更好的泛化能力。除了检测之外,学习到的表示还能泛化到未见过的任务,在作者验证和细粒度风格区分上取得竞争性表现,尽管从未针对这两个目标进行训练。
The rapid development of large language models (LLMs) has raised concerns about misuse such as plagiarism, misinformation, and automated influence operations, motivating the need for robust detectors. Recent work has shown that neural representations of writing style are effective for detection and, crucially, robust to adversarial attacks that defeat most existing detectors. However, current style-based detectors rely on authorship labels for training, and are limited to few-shot inference for detection, requiring in-distribution samples that may not always be available. We learn discriminative style features without authorship labels by training a style encoder to reconstruct human-authored text from its machine-generated paraphrase; freezing a semantic encoder during training biases the style encoder to capture only the non-semantic features needed for reconstruction. We evaluate the learned representations via two detection strategies: a few-shot detector and a zero-shot DeepSVDD-based detector. Across benchmarks, our method matches or outperforms all baselines in the few-shot setting and, in the zero-shot regime, is competitive with fully supervised classifiers on in-distribution test data while generalizing better to unseen LLMs. Beyond detection, the learned representations generalize to unseen tasks, achieving competitive performance on authorship verification and fine-grained style discrimination despite never being trained on either objective.
预测性辅助与探索性压缩的时间动态
Balaraju Battu
AI总结 提出几何动力学框架,研究预测性AI如何通过外源探索性压缩改变认知探索的时间动态,发现持续稳定会降低探索响应性、曲率不对称积累导致滞后效应、早期干预限制后续探索多样性。
经典认知理论将问题解决描述为通过结构化问题空间的探索性搜索,其中重复交互逐渐将搜索压缩为高效的表征结构。预测性人工智能系统引入了一种独特的机制,在这种机制中,稳定可能在探索性多样化展开之前发生,在内部生成搜索之前提供解决方案和决策轨迹。本文发展了一个几何动力学框架,其中注意力在由稳定漂移、内源探索性扰动和响应性门控学习塑造的策略景观上演化。预测性辅助被建模为外源探索性压缩的过程,在自生成探索拓宽策略空间的可达区域之前稳定轨迹。该框架产生三个主要结果。首先,持续的预测性稳定通过减弱内源扰动的有效影响来降低探索响应性,即使探索变异性仍然存在。其次,曲率不对称地积累和松弛,产生滞后效应和辅助撤除后探索移动性的延迟恢复。第三,发展结果关键取决于稳定的时机,早期干预在广泛的表征多样化发生之前缩小未来的探索遍历。该框架产生了关于探索熵、过早收敛和预测稳定后延迟恢复的经验可检验预测。更广泛地说,结果表明预测系统可能重塑探索性认知本身的几何结构。
Classical theories of cognition describe problem solving as exploratory search through structured problem spaces in which repeated interaction gradually compresses search into efficient representational structures. Predictive artificial intelligence systems introduce a distinct regime in which stabilization may occur before exploratory diversification unfolds, supplying solutions and decision trajectories prior to internally generated search. This paper develops a geometric dynamical framework in which attention evolves over a landscape of strategies shaped by stabilizing drift, endogenous exploratory perturbation, and responsiveness-gated learning. Predictive assistance is modeled as a process of exogenous exploratory compression that stabilizes trajectories before self-generated exploration broadens the accessible regions of strategy space. The framework yields three main results. First, sustained predictive stabilization reduces exploratory responsiveness by attenuating the effective influence of intrinsic perturbations even when exploratory variability remains present. Second, curvature accumulates and relaxes asymmetrically, producing hysteresis and delayed recovery of exploratory mobility after assistance withdrawal. Third, developmental outcomes depend critically on the timing of stabilization, with early intervention narrowing future exploratory traversal before broad representational diversification has occurred. The framework generates empirically testable predictions concerning exploratory entropy, premature convergence, and delayed recovery following predictive stabilization. More broadly, the results suggest that predictive systems may reshape the geometry of exploratory cognition itself.
组合风险下的决策
Yifan Hong, Hongmiao Fan, Chen Wang
AI总结 通过投资分配任务研究组合风险下的决策,发现参与者主要依据投资后成功概率等特征而非精确评估完整分布,并利用符号回归发现简洁描述模型。
风险下的决策通常通过单次彩票选择来研究。然而,许多实际决策涉及组合风险,其中风险来自多个风险组件,因此结果上的彩票是诱导的而非直接给出的,并且精确评估可能代价高昂。我们引入了一项投资分配任务来研究组合风险下的决策,其中投资于一个组件会提高其成功概率,从而重塑结果分布。参与者倾向于选择概率增量较大的选项,当增量相等时,选择初始成功概率较高的选项。揭示诱导的概率质量函数(PMF)会显著改变行为,使参与者对组合风险特征的反应减弱,并减少选择方差。为了解释这些模式,我们超越标准基准和手工假设,使用符号回归发现简洁的描述模型。发现的模型主要依赖于组合风险特征,例如投资后的成功概率,而不是对完整诱导分布的精确评估。当显示PMF时,行为可以通过用前景理论残差模型增强该模型来很好地解释。结果表明,人们主要通过核心特征来导航组合风险,仅在显示诱导PMF时才转向彩票估值。
Decision-making under risk is typically studied through single-shot lottery choices. Yet many real decisions involve combinatorial risk, where risk arises from multiple risky components, so the lottery over outcomes is induced rather than given outright and can be costly to evaluate exactly. We introduce an investment-allocation task to study decision under combinatorial risk, where investing in a component raises its success probability and thereby reshapes the outcome distribution. Participants favor the option with the larger probability increment, and, when increments are equal, the option with the higher initial success probability. Revealing the induced probability mass function (PMF) substantially changes behavior, making participants less responsive to combinatorial-risk features and reducing choice variance. To explain these patterns, we move beyond standard benchmarks and hand-crafted hypotheses with symbolic regression to discover compact descriptive models. The discovered models rely mainly on combinatorial-risk features, such as the after-investment success probability, rather than exact evaluation of the full induced distribution. Behavior under the displayed PMF is then well explained by augmenting this model with a prospect-theoretic residual model. The results show that people navigate combinatorial risk primarily through its core features, shifting toward lottery valuation only when the induced PMF is displayed.
SoK: 机器学习流水线中的合谋攻击者
Vasisht Duddu, Lipeng He, Asim Waheed, N. Asokan
AI总结 本文提出一个系统框架,研究机器学习流水线中训练阶段与推理阶段攻击者之间的合谋行为,通过五个实证案例验证了合谋的潜在风险,并讨论了攻击者特征对合谋可能性的影响。
机器学习模型容易受到各种安全、隐私和公平性风险的影响。具有不同特征(即目标、知识和能力)的攻击者可以通过执行一种攻击来放大其他攻击,从而进行合谋。现有工作缺乏一个系统框架来探索攻击者之间的合谋,以及研究攻击者特征的影响。我们提出了一个涵盖(a)训练阶段和推理阶段攻击者之间,以及(b)推理阶段攻击者之间的合谋框架。我们的框架考虑了促成攻击者之间合谋的因素。我们提出了一种指南,利用促成因素推测合谋的可能性。我们用它来解释先前的工作,推测未探索的合谋,并实证验证了五个这样的案例。最后,我们讨论了攻击者特征如何影响合谋的可能性。
Machine learning (ML) models are susceptible to various security, privacy, and fairness risks. Adversaries with different characteristics (i.e., objectives, knowledge, and capabilities) can collude by executing one attack to amplify others. Existing work lacks a systematic framework to explore collusion among adversaries, and to study the implications of the adversaries' characteristics. We present a framework covering collusion (a) between train- and inference-time adversaries, and (b) among inference-time adversaries. Our framework accounts for factors enabling collusion between adversaries. We propose a guideline to conjecture about the potential for collusion using enabling factors. We use it to explain prior work, conjecture about unexplored collusions, and empirically validate five such cases. Finally, we discuss how adversaries' characteristics influence the potential for collusion.
基于神经网络的流匹配理论
Yihan He, Qishuo Yin, Yuan Cao, Jianqing Fan, Han Liu
AI总结 本文为神经网络参数化的条件速度场流匹配建立了理论基础,证明了过参数化两层ReLU网络中梯度下降的收敛性,推导了条件速度场匹配目标的泛化界,并提供了生成样本的Wasserstein距离保证。
在这项工作中,我们为神经网络参数化的条件速度场流匹配建立了理论基础。我们证明了过参数化两层ReLU神经网络中梯度下降的收敛性保证。我们推导了条件速度场匹配目标的泛化界。基于这些结果,我们为诱导流生成的样本提供了Wasserstein距离保证。我们的分析基于具有无界损失的多任务表示学习的泛化界,这可能对流式生成建模之外的其他领域也有独立意义。这些理论结果通过在合成和真实图像基准上的大量实验得到了验证。
In this work, we develop theoretical foundation for flow matching with neural-network-parameterized conditional velocity fields. We establish convergence guarantees for gradient descent in the over-parameterized 2-layered ReLU neural network regime. We derive generalization bounds for the conditional velocity-field matching objective. Building on these results, we provide Wasserstein-distance guarantees for the samples generated by the induced flow. Our analysis is based on generalization bound for multi-task representation learning with unbounded losses, which may be of independent interest beyond flow-based generative modeling. These theoretical results are validated through extensive experiments on both synthetic and real-world image benchmarks.
可解释的时序面部区域运动分析用于野外帕金森病视频分类
Riyadh Almushrafy
AI总结 提出基于面部区域关键点的时序运动描述符,在YouTubePD基准上实现轻量级且可解释的PD视频分类,平衡准确率达0.826。
面部表情减少是帕金森病(PD)常见的运动表现,通常描述为面部运动减退或面部运动迟缓。本文研究从面部区域关键点提取的时序运动描述符是否能够支持野外PD相关视频分类,并在YouTubePD基准上进行评估。每个视频使用来自14个预定义面部区域的几何描述符表示。在相同的二分类协议下,比较了静态几何、归一化几何、基于速度的描述符、相对速度描述符以及GRU序列基线。为了评估稳定性和可解释性,研究包括种子鲁棒性分析、区域级消融和排列重要性。最佳结果使用归一化速度描述符和随机森林分类器获得,在保留测试集上达到平衡准确率0.826和AUROC 0.855。在10个随机种子下,该表示保持稳定,平衡准确率为0.810 ± 0.018,AUROC为0.855 ± 0.005。总体而言,结果表明归一化的面部区域运动是YouTubePD视频分类的一种轻量级且可解释的表示。该研究作为基准级分析,不声称临床严重程度评估或MDS-UPDRS面部表情评分。
Reduced facial expressivity is a common motor manifestation of Parkinson's disease (PD), often described as hypomimia or facial bradykinesia. This paper examines whether temporal motion descriptors extracted from facial-region keypoints can support in-the-wild PD-related video classification on the YouTubePD benchmark. Each video is represented using geometric descriptors from 14 predefined facial regions. Static geometry, normalized geometry, velocity-based descriptors, relative-velocity descriptors, and a GRU sequence baseline are compared under the same binary classification protocol. To assess stability and interpretability, the study includes seed-robustness analysis, region-level ablation, and permutation importance. The best result is obtained with normalized velocity descriptors and a Random Forest classifier, reaching a balanced accuracy of 0.826 and an AUROC of 0.855 on the held-out test split. Across 10 random seeds, this representation remains stable, with balanced accuracy of 0.810 +/- 0.018 and AUROC of 0.855 +/- 0.005. Overall, the results suggest that normalized facial-region motion is a lightweight and interpretable representation for YouTubePD video classification. The study is framed as a benchmark-level analysis and does not claim clinical severity assessment or MDS-UPDRS facial-expression scoring.
CodeAlchemy:大规模合成代码重写
Ankit Gupta, Aditya Prasad, Rameswar Panda
AI总结 提出CodeAlchemy框架,通过5种策略生成超过500B token的合成代码数据,引入DevEval和TraceEval基准,3B模型在多项任务上超越10倍大小的前沿模型。
在原始代码上预训练可以学习语法,但为多样化的真实世界任务格式提供的信号稀疏。虽然合成数据已被证明对语言模型具有变革性,但代码领域除有限的质量改进外仍基本未被探索。我们提出CodeAlchemy,一个合成数据生成框架,通过5种策略将公开来源的代码转换为语义丰富的训练数据:CodeEnhance(质量感知重写)、CodeQA(基于模板的问题)、CodeDev(开发者任务)、CodeDialogue(多轮对话)和CodeTrace(执行轨迹)。我们处理了15种语言的3个语料库,生成了超过500B token的合成数据以及350B推理token,数量级远超先前工作。CodeTrace对14种语言和5K个库的1.3M+文件进行插桩和执行,捕获控制流、状态跟踪和库知识。我们引入了DevEval(开发者任务)和TraceEval(执行预测)基准;前沿模型如Claude Sonnet 4.5在TraceEval上仅达到5.6%的精确匹配,揭示了语义理解的关键差距。我们的3B模型在HumanEval上达到83.5%,在MBPP上达到63.2%,在DevEval上达到8.09%的胜率,在TraceEval上达到15.36 ROUGE-2,超越了包括27B Gemma-3和32B Granite-4.0在内的10倍大小的前沿模型。
Pre-training on raw code teaches syntax but provides sparse signal for diverse real-world task formats. While synthetic data has proven transformative for language models, code remains largely unexplored beyond limited quality improvements. We present CodeAlchemy, a synthetic data generation framework that transforms publicly sourced code into semantically-rich training data through 5 strategies: CodeEnhance (quality-aware rewriting), CodeQA (template-based problems), CodeDev (developer tasks), CodeDialogue (multi-turn conversations), and CodeTrace (execution traces). We process 3 corpora across 15 languages to generate 500B+ tokens of synthetic data plus 350B reasoning tokens, orders of magnitude more than prior efforts. CodeTrace instruments and executes 1.3M+ files across 14 languages and 5K libraries, capturing control flow, state tracking, and library knowledge. We introduce DevEval (developer tasks) and TraceEval (execution prediction) benchmarks; frontier models like Claude Sonnet 4.5 achieve only 5.6% exact match on TraceEval, revealing critical gaps in semantic understanding. Our 3B models achieve 83.5% on HumanEval, 63.2% on MBPP, 8.09% win rate on DevEval, and 15.36 ROUGE-2 on TraceEval, outperforming frontier models 10x the size including 27B Gemma-3 and 32B Granite-4.0.
AI辅助优化下的探索响应性与适应性刚性
Balaraju Battu
AI总结 本文提出AI辅助优化下的探索适应理论,通过动态框架分析预测辅助如何影响系统探索响应性,揭示收敛预测机制导致适应性降低、刚性增强,而探索增强机制则促进适应性。
本文发展了AI辅助优化下的探索适应理论。核心论点是,AI系统的长期适应效应关键取决于预测辅助如何与探索响应性本身相互作用。我们使用一个动态框架形式化这一机制,其中认知、制度和技术系统在由多个局部强化配置构成的崎岖认知景观上演化。模型中的一个核心状态变量是适应响应性,它衡量系统在不断变化的条件下穿越不熟悉的概念和制度轨迹的能力。在收敛预测机制下,AI系统替代探索参与,降低适应响应性,并产生亚稳态陷阱、滞后、过早收敛和探索崩溃动力学,使系统局部高效但全局刚性。该框架还识别出对比的探索增强机制,其中AI系统放大探索搜索、概念穿越和适应流动性。因此,有效替代参数是响应性依赖的:拥有弱探索例程的系统更容易受到探索替代,而已经拥有高适应响应性的系统可能利用AI辅助在崎岖景观上扩展探索流动性。因此,AI的长期适应效应不仅取决于AI能力本身,还取决于制度结构、发展背景和人机交互架构。
This paper develops a theory of exploratory adaptation under AI-assisted optimization. The central argument is that the long-run adaptive effects of AI systems depend critically on how predictive assistance interacts with exploratory responsiveness itself. We formalize this mechanism using a dynamical framework in which cognitive, institutional, and technological systems evolve over rugged epistemic landscapes characterized by multiple locally reinforced configurations. A central state variable in the model is adaptive responsiveness, which measures the capacity of a system to traverse unfamiliar conceptual and institutional trajectories under changing conditions. Under convergent predictive regimes, AI systems substitute for exploratory engagement, reducing adaptive responsiveness and generating metastable trapping, hysteresis, premature convergence, and exploration-collapse dynamics in which systems become locally efficient but globally rigid. The framework also identifies contrasting exploration-enhancing regimes in which AI systems amplify exploratory search, conceptual traversal, and adaptive mobility. The effective substitution parameter is therefore responsiveness-dependent: systems possessing weak exploratory routines are more vulnerable to exploratory substitution, whereas systems already possessing high adaptive responsiveness may use AI assistance to expand exploratory mobility across rugged landscapes. The long-run adaptive effects of AI consequently depend not only on AI capability itself, but also on institutional structure, developmental context, and the architecture of human-machine interaction.
CTF-4-Science Lorenz基准的分治建模策略
Shundong Li
AI总结 提出分治建模策略,针对CTF-4-Science Lorenz基准的五个场景族分别设计模型,通过平滑去噪、NG-RC/NVAR预测、Lorenz过渡校正和参数前缀混合,以79.63分证明场景特定更新优于通用模型。
本文针对CTF-4-Science Lorenz基准提出了一种分治建模策略,该基准通过十二个隐藏分数和五个场景族评估混沌系统预测:干净预测、噪声重建、噪声输入预测、少样本学习和参数泛化。最终系统不是强制一个模型类处理所有场景,而是将每个预测块与其任务组的评估行为相匹配。主要贡献包括:基于平滑的重建用于噪声全轨迹去噪;针对噪声长时间吸引子预测调优的NG-RC/NVAR模型;限制在敏感干净短时间前缀上的拟合Lorenz过渡校正;以及用于插值任务的参数前缀混合。最终系统得分为79.63,表明在混合混沌预测基准上,有界、场景特定的更新可以优于广泛的模型替换。
This work presents a divide-and-conquer modeling strategy for the CTF-4-Science Lorenz benchmark, which evaluates chaotic-system prediction across twelve hidden scores and five scenario families: clean forecasting, noisy reconstruction, noisy-input forecasting, few-shot learning, and parametric generalization. Rather than forcing one model class to handle all regimes, the final system matched each prediction block to the evaluation behavior of its task group. The main contributions are: smoothing-based reconstruction for noisy full-trajectory denoising; NG-RC/NVAR models tuned for noisy long-time attractor forecasting; a fitted Lorenz transition correction restricted to the sensitive clean short-time prefix; and a parametric prefix blend for the interpolation task. The resulting system with final public score of 79.63 shows that bounded, scenario-specific updates can outperform broad model replacement on mixed chaotic forecasting benchmarks.
VFUSE: 基于稀疏自编码器的毒力特征理解
Michael Yu, Matthew L. Olson
AI总结 提出VFUSE方法,通过训练稀疏自编码器(SAE)分析扩散-Transformer模型激活,识别蛋白质设计中的危险特征,实现可解释性提升而不牺牲性能。
生成模型在蛋白质设计等领域取得了显著进展,但这种能力也使得危险蛋白质的生成变得不透明。在这项工作中,我们引入了VFUSE(基于稀疏自编码器的毒力特征理解),这是一种机制可解释性方法,通过在扩散-Transformer激活上训练SAE来审计蛋白质模型中的危险感知特征。我们将VFUSE应用于RoseTTAFold3和RFDiffusion3,这些是流行的开源蛋白质折叠和合成模型。我们发现,对于某些模块,线性探针在SAE潜在空间中的拟合效果显著优于原始模型表示,从而在不牺牲模型性能的情况下提高了可解释性。此外,我们识别出SAE中的单语义特征,这些特征仅在危险设计上激活,AUROC高达0.84(q < 10^{-13})。据我们所知,这是首次在全原子扩散模型上训练SAE,也是首次对蛋白质设计模型进行特征级毒力审计,为安全且可解释的蛋白质设计铺平了道路。
Generative models have shown remarkable progress in a variety of domains such as protein design, but such power enables the opaque generation of hazardous proteins. In this work, we introduce VFUSE (Virulent Feature Understanding with Sparse autoEncoders), a mechanistic interpretability approach that trains SAEs on diffusion-transformer activations to audit protein models for hazard-aware features. We apply VFUSE to RoseTTAFold3 and RFDiffusion3, popular open-weight models for protein folding and synthesis. We find that for certain blocks, linear probes detect hazardous designs significantly better when fit in the SAE latent space over the original model's representations: improving interpretability without sacrificing model performance. Furthermore, we identify monosemantic features from the SAE that fire only on hazardous designs at up to AUROC $0.84$ ($q < 10^{-13}$). To our knowledge this is the first SAE trained on an all-atom diffusion model and the first feature-level virulence audit of a protein design model, paving the way towards safe and interpretable protein design.
时序层神经网络与动态正交传输
Md Sadek Hossain Asif, Tanzila Khan, Md. Mosaddek Khan
AI总结 提出时序层神经网络(TSNN),通过动态正交帧和局部坐标系间显式传输实现时序链接预测,在多种基准上超越现有方法,尤其适用于节点角色异质性强的图。
我们引入了时序层神经网络(TSNN),这是一个时序链接预测框架,它为每个节点配备一个时变正交帧,并仅在局部坐标系之间进行显式传输后比较节点状态。与在共享全局嵌入空间中运行的现有连续时间图模型不同,TSNN通过动态局部帧建模节点特定且不断演化的交互语义。该模型通过高效的低秩Householder乘积参数化每个节点的帧,在帧更新下精确保留存储的隐藏状态,并使用几何残差解码器,该解码器基于传输距离锚定预测,同时学习残差校正。所有计算严格因果,仅使用事件前历史。我们证明了对称度归一化层拉普拉斯算子与对称归一化图拉普拉斯算子正交相似,而随机游走归一化形式在相应度度量下相似;TSNN使用的全激活、特征缩放扩散正是组合层Dirichlet能量上的度量梯度步,具有无度单调下降和非扩张保证。帧漂移仅线性扰动更新。在TGB v2链接预测和时序异质排行榜以及DGB基准套件上,TSNN在大多数基准上匹配或超越最强先前方法,在表现出强节点角色异质性的图上改进最大。消融实验证实了动态帧、正交传输和几何残差解码的独特优势。
We introduce Temporal Sheaf Neural Networks (TSNN), a temporal link prediction framework that equips each node with a time-varying orthogonal frame and compares node states only after explicit transport between local coordinate systems. In contrast to existing continuous-time graph models that operate in a shared global embedding space, TSNN models node-specific and evolving interaction semantics through dynamic local frames. The model parameterizes per-node frames via efficient low-rank Householder products, preserves stored hidden states exactly under frame updates, and uses a geometric-residual decoder that anchors predictions on transported distances while learning residual corrections. All computations are strictly causal and use only the pre-event history. We show that the symmetric degree-normalized sheaf Laplacian is orthogonally similar to the symmetric normalized graph Laplacian, with the random-walk normalized form similar in the corresponding degree metric; the full-active, feature-scaled diffusion used by TSNN is exactly a metric-gradient step on the combinatorial sheaf Dirichlet energy, with a degree-free monotone-descent and non-expansiveness guarantee. Frame drift perturbs updates only linearly. Across TGB v2 link-prediction and temporal-heterogeneous leaderboards, together with the DGB benchmark suite, TSNN matches or surpasses the strongest prior methods on most benchmarks, with the largest improvements on graphs exhibiting strong node-role heterogeneity. Ablations confirm the distinct benefit of dynamic frames, orthogonal transport, and geometric-residual decoding.
高维超参数优化的重要性感知调度
Ruinan Wang, Ian Nabney, Mohammad Golbabaee
AI总结 提出GIF方法,通过小样本预热估计超参数重要性,按重要性分组并比例分配试验,保留全空间回退,在高维基准上优于TPE等方法,提升采样效率。
超参数优化(HPO)对于构建高性能的ML/DL模型至关重要,但传统优化器在高维空间中常常难以应对,其中评估成本高昂且进展被许多低影响变量稀释。我们提出贪婪重要性优先(GIF),一种重要性感知的调度策略,使用小样本预热来估计超参数重要性,形成基于重要性的分组,按比例分配试验,并保留全空间回退。我们在五个各向异性解析函数、Bayesmark和NAS-Bench-301上,在固定评估预算下评估GIF。在高维基准上,GIF比TPE、BOHB、随机搜索和顺序分组更快地达到更好的当前最优解。在有效维度较小的Bayesmark上,GIF仍具有竞争力,但优势较小。消融研究表明,重要性估计、比例分配和回退步骤都有助于性能提升。我们还验证了HIA组件在解析基准上恢复了预期的各向异性。这些结果表明,GIF是一种简单且即插即用的方法,可提高高维HPO中的样本效率。
Hyperparameter Optimization (HPO) is essential for building high-performing ML/DL models, yet conventional optimizers often struggle in high-dimensional spaces where evaluations are costly and progress is diluted across many low-impact variables. We propose Greedy Importance First (GIF), an importance-aware scheduling strategy that uses a small-sample warm start to estimate hyperparameter importance, forms importance-based groups, allocates trials proportionally, and retains a full-space fallback. We evaluate GIF under fixed evaluation budgets on five anisotropic analytic functions, Bayesmark, and NAS-Bench-301. On the higher-dimensional benchmarks, GIF reaches better incumbents with faster convergence than TPE, BOHB, Random Search, and Sequential Grouping. On Bayesmark, where the effective dimensionality is smaller, GIF remains competitive but the margins are smaller. Ablation studies show that importance estimation, proportional allocation, and the fallback step all contribute to the gains. We also verify that the HIA component recovers the intended anisotropy on the analytic benchmarks. These results suggest that GIF is a simple and plug-compatible way to improve sample efficiency in high-dimensional HPO.
公共医学视觉语言基准中预训练污染的受控审计
Bruce Changlong Xu, Lan Wu, Alexander Ryu
AI总结 审计发现公共医学VLM基准存在图像源重叠和文本规范顺序交换性信号,但确认的像素级重复罕见,且现有成员推理检测器在小规模医学VLM队列中不可靠。
医学视觉语言模型(VLM)在公共基准上进行评估,这些基准的图像和问答对多年来一直可自由下载,但报告准确度假设这些示例在预训练中不存在。我们对SLAKE-En、PathVQA、VQA-RAD以及一个辅助的公共OmniMedVQA镜像上的开放VLM进行了审计,使用了四种检测器系列:图像侧近邻重叠(针对PMC-OA-beta)、规范顺序可交换性、队列相对Min-K%++尾部富集以及跨模型Top-K重叠。我们发现SLAKE-En上存在可测量的图像侧源重叠:SigLIP-B-16标记了19.8%的图像,SigLIP-SO400M标记了4.2%,而域外对照产生0/2000个标记。人工裁定显示,相同模态、相同投影的匹配对应不同患者,而非经过验证的像素级重复,因此我们将其解释为源或分布重叠,而非确认的每图像记忆。在文本侧,Qwen2.5-VL在SLAKE-En上显示出规范顺序可交换性信号,该信号在顺序消融和外部非医学基线中仍然存在。在OmniMedVQA镜像上,五个医学和通用VLM触发了可交换性,而BLIP-2保持干净。相比之下,队列相对Min-K%++尾部富集和跨模型Top-K重叠在外部预域基线中崩溃:BLIP-2重现了明显的正信号,尽管缺乏合理的医学VQA暴露。我们得出结论,这些队列相对检测器作为小规模医学VLM队列上的独立成员推理信号是不可靠的。
Medical vision-language models (VLMs) are evaluated on public benchmarks whose images and question-answer pairs have been freely downloadable for years, yet reported accuracy assumes these examples were absent from pretraining. We audit open VLMs on SLAKE-En, PathVQA, VQA-RAD, and an auxiliary public OmniMedVQA mirror using four detector families: image-side near-neighbour overlap against PMC-OA-beta, canonical-order exchangeability, cohort-relative Min-K%++ tail enrichment, and cross-model top-K overlap. We find measurable image-side source overlap on SLAKE-En: 19.8% of images are flagged under SigLIP-B-16 and 4.2% under SigLIP-SO400M, while out-of-domain controls produce 0/2000 flags. Manual adjudication shows same-modality, same-projection matches to different patients rather than verified pixel-level duplicates, so we interpret this as source or distributional overlap rather than confirmed per-image memorization. On the text side, Qwen2.5-VL on SLAKE-En shows a canonical-order exchangeability signal that survives ordering ablation and external non-medical baselines. On the OmniMedVQA mirror, exchangeability fires for five medical and general VLMs while BLIP-2 remains clean. In contrast, cohort-relative Min-K%++ tail enrichment and cross-model top-K overlap collapse under an external pre-domain baseline: BLIP-2 reproduces the apparent positive signals despite lacking plausible medical-VQA exposure. We conclude that these cohort-relative detectors are unreliable as standalone membership-inference signals on small medical-VLM cohorts.
Bittensor 智能体竞技场作为轨迹基元:从 ShoppingBench 子网轨迹中蒸馏购物智能体
Shardul Bansal, Seth Schilbe, Jarrod Barnes
AI总结 针对小模型后训练缺乏多轮轨迹数据的问题,利用 Bittensor 子网 SN15 的竞技机制生成激励对齐的轨迹,通过结构质量过滤提取智能体轨迹,后训练 Qwen3-4B 模型在 ShoppingBench 上达到 42.7% ASR,接近合成数据基线。
小模型智能体后训练的瓶颈更多在于其消耗的轨迹基质而非算法。领先的方案(RLVR、群体相对 RL、拒绝采样重 SFT)都需要携带每轨迹监督的多轮轨迹,而现有的两个来源存在不足:前沿合成数据继承了合成器的偏见并坍缩了长尾,而未经过滤的生产日志未经评判且被捷径行为污染。我们认为可以设计一个激励对齐的智能体竞技场来制造此类轨迹,并在 ORO Subnet 15(SN15)上进行了演示,这是 ShoppingBench 智能体电商基准的 Bittensor 部署。SN15 的竞赛机制、LLM 推理评判器和旋转泄漏簇防护问题集产生了一个具有三个特性的语料库:激励对齐的多样性、每轨迹评判和反记忆的留出评估。我们引入了一个结构质量过滤器,通过保留智能体轨迹(模型自身发出工具调用)并拒绝子任务轨迹(模型仅在确定性搜索循环上进行分类或叙述),将原始数据流转换为可训练的语料库,然后使用与已发布的 ShoppingBench SFT-然后-GRPO 流程匹配的方案对 Qwen3-4B 进行后训练。在泄漏簇防护的留出分区上,以生产严格方式评分,模型从已发布的 Qwen3-4B 基线的 18.0% ASR 提升至 42.7%,与合成数据 SFT 仅基线(43.6%)在单问题噪声范围内,同时仅训练了子网单日输出的一小部分。监督堆栈留下了较大的 pass@8 到 pass@1 差距(53.3% 对比 34.8%);每步教师基础的 Dr. GRPO 奖励将该空间转化为过程改进,我们确定子任务数据流是缩小与 48.7% SFT+GRPO 基线差距的主要杠杆。我们发布了过滤器、语料库分割和竞技场机制。
Small-model agentic post-training is bottlenecked less by the algorithm than by the trajectory substrate it consumes. Leading recipes (RLVR, group-relative RL, rejection-sampled re-SFT) all need multi-turn traces carrying per-trajectory supervision, and the two existing sources fall short: frontier-synthesised data inherits the synthesizer's biases and collapses the long tail, while unfiltered production logs are unjudged and contaminated by shortcut behaviour. We argue that an incentive-aligned agent arena can be engineered to manufacture such trajectories, and demonstrate this on ORO Subnet 15 (SN15), a Bittensor deployment of the ShoppingBench agentic-commerce benchmark. SN15's race mechanism, LLM reasoning judge, and rotating leak-cluster-guarded problem suite yield a corpus with three properties: incentive-aligned diversity, per-trajectory judging, and anti-memorised held-out evaluation. We introduce a structural-quality filter that converts the raw firehose into a trainable corpus by keeping agentic trajectories (the model itself emits the tool calls) and rejecting sub-task trajectories (the model only classifies or narrates over a deterministic search loop), then post-train Qwen3-4B with a recipe matched to the published ShoppingBench SFT-then-GRPO pipeline. On a leak-cluster-guarded held-out partition scored production-strict, the model lifts from the published Qwen3-4B base of 18.0% ASR to 42.7%, within single-problem noise of the synthetic-data SFT-only baseline (43.6%), while training on a fraction of a single day of subnet output. The supervised stack leaves a large pass@8 to pass@1 gap (53.3% vs 34.8%); a per-step teacher-grounded Dr. GRPO reward converts that headroom into process improvement, and we identify the sub-task firehose as the primary lever for closing the gap to the 48.7% SFT+GRPO bar. We release the filter, the corpus splits, and the arena mechanics.
基础模型智能体中的部署时记忆
Lei, Chen, Guilin Zhang, Kai Zhao, Dalmo Cirne, Andy Olsen, Xu Chu, Zeke Miller, Alet Blanken, Amine Anoun, Jerry Ting
AI总结 研究基础模型智能体在部署时记忆的设计选择如何影响个性化效用、提取风险和删除保真度,提出遗忘残差分数并揭示压缩与删除的权衡。
基础模型智能体正成为越来越长寿命的系统,它们跨交互记忆用户,使记忆成为明确的部署时功能,而不仅仅是模型权重的属性。现有工作处理参数化记忆或审计固定记忆配置,但没有描述记忆设计选择如何共同塑造个性化效用、提取风险和删除保真度。我们将这一表面研究为部署时记忆,将智能体记忆表述为通过个性化召回(PR)和对抗提取率(AER)测量的隐私-效用前沿,并扫描三个记忆设计旋钮:摘要攻击性、检索广度(k)和删除模式。我们进一步引入遗忘残差分数(FRS)来量化删除的信息是否仍可从派生记忆层中恢复。在LongMemEval上,关键事实摘要将Gemma 3 12B上的金丝雀提取减少76%,GPT-4o-mini上减少64%,同时几乎保留所有个性化召回;关键是,一旦内容被压缩掉,增加k不再恢复泄漏。然而,相同的压缩会导致删除保真度失败:仅原始删除使派生摘要副本在大约20%的实例中可恢复,只有全管道清除或墓碑修订才能使最差层残差为零。总之,这些结果确立了持久智能体记忆必须作为一级记忆机制进行评估——通过它帮助智能体回忆的内容、它使什么可提取以及它真正能擦除什么来评估。
Foundation-model agents are increasingly long-lived systems that remember users across interactions, making memorization an explicit deployment-time function rather than solely a property of model weights. Existing work addresses parametric memorization or audits fixed memory configurations, but does not characterize how memory-design choices jointly shape personalization utility, extraction risk, and deletion fidelity. We study this surface as deployment-time memorization, formulating agent memory as a privacy-utility frontier measured by Personalization Recall (PR) and Adversarial Extraction Rate (AER), and sweeping three memory-design knobs: summarization aggressiveness, retrieval breadth (k), and deletion mode. We further introduce the Forgetting Residue Score (FRS) to quantify whether deleted information remains recoverable from derived memory tiers. On LongMemEval, key-fact summarization reduces canary extraction by 76% on Gemma 3 12B and 64% on GPT-4o-mini while preserving nearly all personalization recall; critically, once content is compressed away, increasing k no longer restores leakage. The same compression, however, induces a deletion-fidelity failure: raw-only deletion leaves derived summary copies recoverable in approximately 20% of instances, and only full-pipeline purge or tombstone redaction drives worst-tier residue to zero. Together, these results establish that persistent agent memory must be evaluated as a first-class memorization mechanism -- assessed by what it helps agents recall, what it makes extractable, and what it can truly erase.
BenSyc: 孟加拉语上下文中大语言模型对话谄媚与人类对齐的基准测试
Kazi Noshin, Sajib Acharjee Dip, Ranat Das Prangon, Fardin Hassan Tamim, Syed Ishtiaque Ahmed, Liqing Zhang, Sharifa Sultana
AI总结 提出BenSyc基准,基于孟加拉语社交数据构建五级标注集,评估15+模型在对话对齐分类与生成任务上的表现,发现前沿模型在区分共情与强化性认可上仍存在困难。
大型语言模型(LLMs)越来越多地参与情感敏感的社交对话,其回应可能从平衡支持转向过度认可或升级性对齐。现有的谄媚研究主要关注事实一致性和指令遵循设置,而文化背景下的对话谄媚尚未得到充分探索。我们引入了BenSyc,这是首个用于研究孟加拉语社交语境中对话谄媚的基准。从孟加拉国和西孟加拉邦社区收集的11,840条Reddit帖子和170k条评论出发,我们构建了一个人工验证的基准,包含二元标签和一个细粒度的五级分类体系,涵盖无效化、中立、支持、认可和升级。我们在对话对齐分类和响应生成任务上评估了超过15个开源和专有LLM。结果表明,即使对于前沿的指令调优模型,区分共情性支持与强化导向的认可仍然具有挑战性:最佳系统在二元检测上仅达到61.8 Macro-F1,在五类分类上达到61.7 Macro-F1。在生成设置中,多个模型在情感激烈的情境下频繁产生强烈认可或升级性回应。我们的发现凸显了不同模型家族和对话行为之间的显著差异,强调了文化背景下的多语言基准对于评估社交对齐的对话AI系统的重要性。
Large language models (LLMs) increasingly participate in emotionally sensitive social conversations, where responses may shift from balanced support toward excessive validation or escalatory alignment. Existing sycophancy research primarily focuses on factual agreement and instruction-following settings, leaving culturally grounded conversational sycophancy underexplored. We introduce BenSyc, the first benchmark for studying conversational sycophancy in Bengali social contexts. Starting from 11,840 Reddit posts and 170k comments collected from communities across Bangladesh and West Bengal, we construct a human-validated benchmark with binary labels and a fine-grained five-level taxonomy spanning Invalidation, Neutral, Support, Validation, and Escalation. We evaluate more than 15 open and proprietary LLMs on conversational alignment classification and response generation tasks. Results show that distinguishing empathetic support from reinforcement-oriented validation remains challenging even for frontier instruction-tuned models: the best system achieves only 61.8 Macro-F1 on binary detection and 61.7 Macro-F1 on five-class classification. In generation settings, several models frequently produce strongly validating or escalatory responses in emotionally charged situations. Our findings highlight substantial variation across model families and conversational behaviors, underscoring the importance of culturally grounded multilingual benchmarks for evaluating socially aligned conversational AI systems.
使用恶化技巧将重写规则编译为有限状态转录机
Mans Hulden, Michael Ginn
AI总结 提出基于“恶化技巧”的紧凑编译方案,将重写规则编译为有限状态转录机,支持多种上下文和重写模式,实现简单且易于扩展。
有限状态转录机(FST)对于计算语言学和自然语言处理(NLP)中的字符串重写建模至关重要,特别是对于音韵和形态重写规则。编译形式为 $A \ o B / L \, \_ \, R$ 的一般重写规则(其中 $A$、$B$、$L$ 和 $R$ 是任意正则语言)由于重叠匹配和上下文约束而复杂。传统方法(如 Kaplan 和 Kay 或 Karttunen 的方法)依赖于带有辅助标记的复杂转录机组合。本文提出了一种基于“恶化技巧”的紧凑编译方案:生成所有合法的重写候选,然后过滤那些对于相同输入比其他候选更差的候选。该构造作为 PyFoma 中的内置重写编译器实现,支持多个上下文、任意转录、标记、定向重写、权重和并行重写。得到的公式简短且统一,并且在语义一致的情况下,它们重现了与早期方法相同的规则转录机,同时更易于扩展。该实现已在大量重写语法集合和涵盖主要重写模式的自动回归测试套件上针对 foma 进行了验证,得到的转录机除了状态编号外完全匹配。
Finite-state transducers (FSTs) are essential for modeling string rewriting in computational linguistics and natural language processing (NLP), particularly for phonological and morphological rewrite rules. Compiling general rewrite rules of the form $A \to B / L \, \_ \, R$, where $A$, $B$, $L$, and $R$ are arbitrary regular languages, is complex due to overlapping matches and context constraints. Traditional methods, such as those by Kaplan and Kay or Karttunen, rely on intricate transducer compositions with auxiliary markers. This paper presents a compact compilation scheme based on the "worsening trick'': generate all legal rewrite candidates, then filter candidates that are worse than another candidate for the same input. Implemented as the built-in rewrite compiler in PyFoma, the construction supports multiple contexts, arbitrary transductions, markup, directed rewriting, weights, and parallel rewriting. The resulting formulas are short and uniform, and where semantics coincide, they reproduce the same rule transducers as earlier approaches while remaining easier to extend. The implementation has been validated against foma on both a substantial collection of rewrite grammars and an automated regression suite covering the major rewrite modalities, with the resulting transducers matching exactly apart from state numbering.
连续神经重参数化作为鲁棒固定图表UV修复的深度几何先验
Mohammad Sadegh Salehi
AI总结 提出将固定图表UV展开视为连续神经重参数化,使用未训练的SIREN网络优化几何目标,结合谱初始化、Tutte残差预热等策略,实现零翻转的鲁棒图表求解。
传统的UV展开依赖于几何畸变能量的直接优化,可能因无效初始化、局部最小值或拓扑翻转而失败。我们将固定图表UV展开重新定义为连续神经重参数化:一个未训练的SIREN将每个顶点的网格特征映射到UV坐标,其权重针对几何目标进行优化。实际贡献是一个鲁棒的图表求解器配方,结合了Laplace-Beltrami谱输入、Tutte残差预热、$C^2$行列式扩展、单射性屏障以及有效性检查的重试/回退路由,而非声称任何单一组件能保证有效性或应取代重切割方法。NTK-LBO诊断表明,谱条件改变更新几何,尤其在初始化和中秩子空间,但本身不能预测图表成功。在紧凑预切割图表和47图表分层Thingi10K/xatlas切割基准上,神经求解器在所有紧凑图表上产生零翻转,并在42/47个分层求解中有效零翻转。与BFF和OptCuts的比较明确了范围:允许时重切割可以更快且畸变更低,而神经求解器针对提供图表的有效性和验证优先的图集构建。在Amara Spatial生成的网格上,完整的图集构建路径在25个资产集上提供打包图集覆盖,并在大规模Rust图集运行中通过回退路由实现1000/1000严格局部有效且零UV翻转的图集。
Traditional UV unwrapping relies on direct optimization of geometric distortion energies and can fail through invalid initialization, local minima, or topological foldovers. We recast fixed-chart UV unwrapping as continuous neural reparameterization: an untrained SIREN maps per-vertex mesh features to UV coordinates, and its weights are optimized for a geometric objective. The practical contribution is a robust chart-solver recipe, combining Laplace--Beltrami spectral inputs, Tutte residual warm-up, a $C^2$ determinant extension, an injectivity barrier, and validity-checked retry/fallback routing, rather than a claim that any single component guarantees validity or that recutting methods should be replaced. NTK--LBO diagnostics show that spectral conditioning changes update geometry, especially at initialization and mid-rank subspaces, but does not by itself predict chart success. On compact pre-cut charts and a 47-chart stratified Thingi10K/xatlas-cut benchmark, the neural solver produces zero flips on all compact charts and 42/47 valid zero-flip stratified solves. BFF and OptCuts comparisons sharpen the scope: recutting can be faster and lower-distortion when allowed, while the neural solver targets supplied-chart validity and validation-first atlas construction. On Amara Spatial generated meshes, the full atlas construction path gives packed-atlas coverage on a 25-asset set and 1000/1000 strict locally valid atlases with zero UV flips in a large-scale Rust atlas run after fallback routing.
商业世界模型
Cecil Pang, Hiroki Sayama
AI总结 提出商业世界模型(BWM)架构,将世界模型思想应用于商业环境,通过编码状态、动态、约束和目标,支持自主决策与规划。
企业越来越多地采用AI驱动的工具来提高生产力、降低成本并增强产品和服务。然而,AI的变革潜力不仅限于自动化预定义任务:它在于使智能系统能够从高层战略目标出发,规划、优化和执行商业计划。本文介绍了商业世界模型(BWM)的概念和架构,这是一种专门针对商业和组织环境的世界模型。受人工智能、认知科学和控制理论中的世界模型启发,BWM编码了商业状态、动态、约束、目标和可行的动作空间,以支持自主决策。我们提出了一种以商业语义为中心的公式,其中商业状态、动态和动作与关键商业实体相关联。在此框架内,智能体可以模拟替代动作序列,估计其对未来商业结果的影响,并在不确定性下评估权衡。所提出的架构将语义数据表示、概率机器学习模型、确定性业务规则和显式动作空间整合为一个用于规划和反事实推理的连贯结构。尽管其各个组成部分并非全新,但BWM的贡献在于将它们组织为用于商业计划的可执行内部模拟器。这项工作为能够从基于指令的执行转向目标驱动的规划和执行的自主商业系统奠定了概念基础。
Businesses are increasingly adopting AI-enabled tools to improve productivity, reduce costs, and enhance products and services. However, the transformative potential of AI extends beyond automating predefined tasks: it lies in enabling intelligent systems to plan, optimize, and execute business initiatives from high-level strategic objectives. This paper introduces the concept and architecture of a business world model (BWM), a world model specialized for business and organizational environments. Inspired by world models in artificial intelligence, cognitive science, and control theory, a BWM encodes business states, dynamics, constraints, objectives, and feasible action space to support autonomous decision-making. We propose a business-semantics-centric formulation in which business states, dynamics and actions are linked to key business entities. Within this framework, agents can simulate alternative action sequences, estimate their effects on future business outcomes, and evaluate trade-offs under uncertainty. The proposed architecture integrates semantic data representations, probabilistic machine learning models, deterministic business rules, and explicit action space into a coherent structure for planning and counterfactual reasoning. Although its individual components are not new, the contribution of BWM lies in organizing them as an executable internal simulator for business initiatives. This work establishes a conceptual foundation for autonomous business systems capable of moving from instruction-based execution toward goal-driven planning and execution.
使用悬挂托盘的机器人非抓取式物体运输
Adam Heins, Angela P. Schoellig
AI总结 针对机器人服务员问题,提出使用绳索悬挂托盘实现三维摆运动,仅需3自由度移动基座即可减少滑动和泼洒,实验验证了有效性并集成到交互演示中。
我们考虑称为服务员问题的非抓取式物体运输任务,其中机器人必须将平衡在托盘上的物体从一个位置移动到另一个位置。与先前关于机器人服务员问题的工作(使机器人倾斜由末端执行器刚性握持的托盘)不同,我们使用由绳索从末端执行器悬挂的托盘,使其行为类似于三维摆。一些先前的工作驱动机器人使末端执行器模拟摆的行为,因为摆运动减少了作用在运输物体上的剪切力,从而最小化刚性物体的滑动和液体容器中的泼洒。相比之下,我们使用真实的悬挂托盘,使得我们能够获得摆运动的益处,同时仅驱动3自由度移动基座,而不需要完整的6自由度机械臂。我们在仿真和真实硬件上的实验表明,与静态、刚性握持的托盘相比,悬挂托盘显著减少了滑动和泼洒。此外,我们将悬挂托盘集成到交互式机器人服务员演示中,该演示使用计算机视觉识别举手的人,并通过视觉伺服引导机器人朝向它们,使它们能够接触托盘。
We consider the nonprehensile object transportation task known as the waiter's problem, in which a robot must move an object balanced on a tray from one location to another. In contrast to prior works on the robotic waiter's problem, which make the robot tilt a tray rigidly held by its end effector (EE), we use a tray suspended from the EE by ropes, such that it behaves like a three-dimensional pendulum. Some prior works have actuated the robot so that the EE simulates the behavior of a pendulum, because pendular motion reduces the shear forces acting on the transported objects, minimizing the sliding of rigid objects and sloshing in containers of liquid. In contrast, our use of a real hanging tray allows us to obtain the benefits of pendular motion while only actuating a 3 degree-of-freedom (DOF) mobile base, rather than requiring a full 6-DOF manipulator arm. Our experiments in simulation and on real hardware show that the hanging tray substantially reduces both sliding and sloshing compared to a static, rigidly-grasped tray. Furthermore, we integrate the hanging tray into an interactive robot waiter demonstration, which uses computer vision to identify people with a raised hand and visual servoing to steer toward them and allow them to access the tray.
用稀疏自编码器解释和引导文本转语音语言模型
Nikita Koriagin, Georgii Aparin, Nikita Balagansky, Daniil Gavrilov
AI总结 本文在CosyVoice3语言模型骨干上训练BatchTopK稀疏自编码器,发现特征可解释且因果可控,能操纵笑声、性别和语速。
语言模型日益成为文本转语音(TTS)系统的骨干,但我们对其在文本和生成语音令牌共享单一残差流时构建的表示知之甚少。我们在CosyVoice3的语言模型骨干上训练BatchTopK稀疏自编码器,并引入一种模态感知的自动解释流水线,根据特征触发的位置——文本前缀上下文、1秒语音片段或两者——为每个特征打标签。恢复的特征是可解释的,涵盖音素、笑声、口音提示和说话者性别。通过SAE潜在空间进行引导表明,这些特征是因果性的而非仅仅是描述性的:有针对性的干预将笑声概率从0.02提高到0.79,翻转感知到的说话者性别,并在保持口语内容的同时控制语速。因此,SAE特征既可作为解释性对象,也可作为TTS合成的控制方向。
Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, laughter, accent prompts and speaker gender. Steering through the SAE latent space shows these features are causal rather than merely descriptive: targeted interventions raise laughter probability from 0.02 to 0.79, flip perceived speaker gender, and control speech rate while preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.
GHOST: 用于泛化机器人操作的层次化子目标策略
Sriram Krishna, Ben Eisner, Haotian Zhan, Ying Yuan, Haoyu Zhen, Chuang Gan, Shubham Tulsiani, David Held
AI总结 提出GHOST框架,通过将控制分解为高层子目标预测和低层目标条件控制器,实现视觉运动操作策略的泛化,并利用人类演示适应新物体和任务变化。
我们提出了GHOST,一个学习视觉运动操作策略的框架,该策略能够泛化到训练分布之外。GHOST将控制分解为:(i) 高层策略,从多视角RGB-D观测中预测下一个子目标作为3D末端执行器位姿的分布,以及(ii) 低层目标条件控制器,执行特定于具体体的动作。为了将基于图像的策略条件化于3D目标,我们引入了一个简单的空间接口,将预测的目标投影到图像平面,并将其表示为末端执行器热图。在一系列操作任务中,与平坦的扩散策略相比,这种层次化分解持续提高了性能和鲁棒性。此外,我们展示了这种层次化接口也使得整合人类演示变得容易,而无需依赖(嘈杂的)动作重定向。由于子目标在很大程度上与具体体无关,我们在人类视频上训练高层策略,以指定如何应用和组合学到的技能,同时保持低层策略仅在机器人数据上训练。这种层次结构使得能够使用少量人类演示适应新物体和任务变化。
We present GHOST, a framework for learning visuomotor manipulation policies that generalize beyond the training distribution. GHOST factorizes control into (i) a high-level policy that predicts the next sub-goal as a distribution over 3D end-effector poses from multi-view RGB-D observations, and (ii) a low-level goal-conditioned controller that executes embodiment-specific actions. To condition image-based policies on 3D goals, we introduce a simple spatial interface that projects predicted goals into the image plane and represents them as end-effector heatmaps. Across a suite of manipulation tasks, this hierarchical factorization consistently improves performance and robustness compared to a flat Diffusion Policy. Further, we show that this hierarchical interface also makes it easy to incorporate human demonstrations without relying on (noisy) action retargeting. As sub-goals are largely embodiment-agnostic, we train the high-level policy on human video to specify how learned skills should be applied and composed, while keeping the low-level policy trained purely on robot data. This hierarchy enables adaptation to novel objects and task variations using a small number of human demonstrations.
SpineReport: MRI上腰椎退变的自动化3D量化与报告
Nathan Molinier, Adrian A. Marth, Reto Sutter, Christoph Germann, Jacob A. Connolly, Mathieu Guay-Paquet, Nathan D. Schilaty, Kenneth A. Weber, Julien Cohen-Adad
AI总结 提出SpineReport开源框架,利用鲁棒解剖分割从腰椎MRI中提取3D形态和信号特征,生成个体化报告,在中央管狭窄评估中AUC达0.95。
腰椎疾病是全球致残的主要原因,但MRI上退变的可靠量化仍具挑战。临床实践中,分析主要在二维(2D)中进行,因为手动三维(3D)评估耗时。然而,2D测量重复性有限,尤其当解剖结构不与成像平面对齐时。现有自动化方法通常局限于2D、依赖离散分级或缺乏鲁棒性和可解释性。我们介绍SpineReport,一个用于腰椎MRI全面3D形态测量的开源全自动框架。利用鲁棒解剖分割,该方法从关键结构中提取定量指标,包括椎管、脊髓、椎骨、椎间盘和椎间孔。这些指标包括形态和信号特征,支持跨受试者和纵向评估。SpineReport进一步生成个体化报告,允许与队列分布比较,提高脊柱形态的可解释性和客观表征。临床相关性根据放射科医生报告的中央管、侧隐窝和椎间孔狭窄严重程度分级进行评估。指标与中央管狭窄严重程度强相关,T2加权脑脊液信号表现最佳(AUC = 0.95)。椎管前后径和面积比也显示出强相关性和高区分能力(AUC > 0.80)。对于侧隐窝狭窄,相关性中等,侧方脑脊液信号最具信息量(AUC = 0.73)。尽管感兴趣区域提取鲁棒,但未观察到与椎间孔狭窄的显著关联。SpineReport作为开放获取工具发布:此https URL
Lumbar spine conditions are a leading cause of disability worldwide, yet reliable quantification of degeneration from MRI remains challenging. In clinical practice, analysis is predominantly performed in two dimensions (2D), as manual three-dimensional (3D) assessment is time-consuming. However, 2D measurements suffer from limited reproducibility, particularly when anatomical structures are not aligned with the imaging plane. Existing automated approaches are often restricted to 2D, rely on discrete grading, or lack robustness and interpretability. We introduce SpineReport, an open-source, fully automated framework for comprehensive 3D morphometric analysis of lumbar spine MRI. Leveraging robust anatomical segmentations, the method extracts quantitative metrics from key structures, including the spinal canal, spinal cord, vertebrae, intervertebral discs, and foramina. These include both morphological and signal-based features, enabling cross-subject and longitudinal assessment. SpineReport further generates subject-specific reports that allow comparison with cohort distributions, improving interpretability and objective characterization of spinal morphology. Clinical relevance was evaluated against radiologist-reported severity grades for central canal, lateral recess, and foraminal stenosis. Metrics showed strong associations with central canal stenosis severity, with T2-weighted CSF signal providing the highest performance (AUC = 0.95). Canal AP diameter and area ratios also demonstrated strong correlations and high discriminative ability (AUC > 0.80). For lateral recess stenosis, associations were moderate, with lateral CSF signal being the most informative (AUC = 0.73). No significant associations were observed for foraminal stenosis despite robust region-of-interest extraction. SpineReport is released as an open-access tool: https://ivadomed.github.io/SpineReport/
广义CVO:基于二阶黎曼优化的快速无对应局部点云配准
Ray Zhang, Marcus Greiff, Thomas Lew, John Subosits
AI总结 提出一种基于几何表面结构和再生核希尔伯特空间嵌入的无对应局部点云配准方法,采用二阶流形优化实现高达10倍加速,在LiDAR和RGB-D跟踪及物体配准中显著降低漂移并提升鲁棒性。
我们提出了一种快速且无需对应关系的局部点云配准方法,该方法利用了几何表面结构和再生核希尔伯特空间(RKHS)嵌入。该方法将点云表示为具有逐点各向异性核的连续函数,这些核编码了局部几何信息。这种公式化在沿表面法线方向改善对齐的同时,放松了沿切线方向的对齐。为了解决由此产生的配准问题,我们提出了一种具有近似黎曼海森矩阵的二阶流形优化方案,与先前基于无对应RKHS方法中使用的一阶求解器相比,实现了高达10倍的加速。我们展示了在多种室内外数据集上改进的帧到帧LiDAR和RGB-D跟踪精度。在驾驶领域的LiDAR跟踪配准任务中,我们在具有挑战性的特征稀疏环境下实现了平移和旋转漂移均减少超过55%。在物体配准基准测试中,我们展示了相比基于ICP的方法更强的鲁棒性,并且在优化全局初始化时(尤其是在中等错位情况下)获得了进一步的提升。
We propose a fast and correspondence-free local point cloud registration method that leverages geometric surface structure and reproducing kernel Hilbert space (RKHS) embeddings. The method represents point clouds as continuous functions with point-wise anisotropic kernels that encode local geometry. This formulation improves alignment along surface normals while relaxing alignment along tangential directions. To solve the resulting registration problem, we propose a second-order on-manifold optimization scheme with approximate Riemannian Hessians, achieving a speedup of up to 10x over the first-order solvers used in prior correspondence-free RKHS-based methods. We demonstrate improved frame-to-frame LiDAR and RGB-D tracking accuracy across diverse indoor and outdoor datasets. On a LiDAR tracking registration task in the driving domain, we achieve a reduction of $>55\%$ in both translational and rotational drift in challenging feature-sparse environments. On object registration benchmarks, we show improved robustness over ICP-based methods and further gains when refining global initialization, particularly under moderate misalignment.
基于hls4ml的FPGA上脉冲神经网络推理
Barry M. Dillon
AI总结 本文扩展hls4ml工具,实现将PyTorch训练的脉冲神经网络(SNN)部署到FPGA固件上,在Heidelberg Spiking Digits数据集上达到约34μs的推理延迟。
脉冲神经网络(SNN)提供了一种自然的时序机器学习框架。它们的神经元维持内部状态并通过离散脉冲传播信息,从而实现低延迟的时序推理。尽管SNN通常与异步神经形态处理器相关联,但许多科学实时推理系统依赖于传统的同步现场可编程门阵列(FPGA)和高级综合(HLS)工作流程。在本文中,我们提出了hls4ml的扩展,使得在PyTorch中训练的SNN能够以时钟驱动方式部署到FPGA固件上。我们使用在Heidelberg Spiking Digits数据集上训练的密集量化SNN演示了该工作流程,其推理延迟约为34μs。通过软件参考比较、HLS C仿真、HLS综合、导出和Vivado综合报告,我们验证了生成的设计。这项工作将hls4ml工具包开放给神经形态计算,允许对SNN模型进行流线型优化、综合和部署,用于实时推理。
Spiking Neural Networks (SNNs) provide a naturally temporal machine-learning framework. Their neurons maintain an internal state and propagate information through discrete spikes, enabling low-latency temporal inference. Although SNNs are often associated with asynchronous neuromorphic processors, many scientific real-time inference systems rely on conventional synchronous field-programmable gate arrays (FPGAs) and high-level synthesis (HLS) workflows. In this paper we present an extension of hls4ml that enables clock-driven deployment of SNNs trained in pytorch onto FPGA firmware. We demonstrate the workflow using a dense quantised SNN trained on the Heidelberg Spiking Digits dataset where it achieves inference latencies of approximately $34μ$s. We validate the generated design through software reference comparisons, HLS C simulation, HLS synthesis, export, and Vivado synthesis reports. This work opens up the hls4ml toolkit to neuromorphic computing, allowing streamlined optimisation, synthesis, and deployment of SNN models for real-time inference.