Nvidia Developer 10月09日 01:06
利用AI技术与联邦学习预测蛋白质细胞内定位
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何利用NVIDIA FLARE和NVIDIA BioNeMo框架,通过联邦学习技术协作训练AI模型,以预测蛋白质的细胞内定位。这种方法允许研究人员在不移动敏感数据的情况下,共同改进模型性能。文章详细阐述了模型微调的流程,包括数据格式、ESM-2nv模型的应用以及联邦平均(FedAvg)策略。实验结果表明,联邦学习相比本地训练能显著提高预测准确性,为加速生物科学研究和药物开发提供了新的途径。

🔬 **蛋白质定位的重要性**:预测蛋白质在细胞内的位置(如细胞核、细胞质或细胞膜)对于理解其功能至关重要,这不仅能揭示细胞过程的奥秘,还能为药物发现提供潜在靶点。

🤝 **联邦学习的优势**:NVIDIA FLARE和BioNeMo框架支持联邦学习,允许不同机构的参与者在本地训练模型,仅共享模型更新而非原始敏感数据。这种“数据不出门”的模式,通过联邦平均(FedAvg)聚合模型更新,能共同构建更强大的全局模型,同时保护数据隐私。

⚙️ **模型训练与数据格式**:文章以预测蛋白质的细胞内定位为例,展示了如何微调预训练的ESM-2nv模型。数据格式遵循biotrainer标准,以FASTA文件形式包含蛋白质序列、训练/验证集标识以及10种可能的细胞内定位类别,如“细胞核”或“细胞膜”。

📊 **实验结果与价值**:通过对比本地训练与联邦训练(FedAvg)的结果,研究发现联邦学习在异构数据条件下平均准确率从78.8%提升至81.7%。这表明跨机构的知识共享能够构建出超越单一机构能力的模型,有效加速药物研发和生命科学领域的AI应用。

💡 **未来展望**:将生命科学的语言(蛋白质序列)与联邦AI工作流相结合,是加速药物开发、医疗保健和生物技术发现的新范式。NVIDIA FLARE和BioNeMo框架正在推动生命科学AI的协作化未来,鼓励访问GitHub仓库以获取更多信息和示例。

Predicting where proteins are located inside a cell is critical in biology and drug discovery. This process is known as subcellular localization. The location of a protein is tightly linked to its function. Knowing whether a protein resides in the nucleus, cytoplasm, or cell membrane can unlock new insights into cellular processes and potential therapeutic targets. 

This post explains how researchers can collaboratively train AI models to predict protein properties such as subcellular location—without moving sensitive data across institutions—using NVIDIA FLARE and NVIDIA BioNeMo Framework

How to fine-tune a model for subcellular localization 

A new NVIDIA FLARE tutorial demonstrates how to fine-tune an ESM-2nv model to classify proteins by their subcellular localization. The ESM-2nv model learns from embeddings of protein sequences, leveraging datasets introduced in Light Attention Predicts Protein Location from the Language of Life.

We focus on subcellular localization prediction, formatted as FASTA files following the biotrainer standard that include the sequence, training/validation split, and location class (one of 10, for example: Nucleus, Cell_membrane, and so on).

Figure 1. Cross-section of an animal cell showing the location of various membrane-bound organelles that are targeted for protein property prediction

A data sample in this FASTA format looks like this: 

>Sequence1 TARGET=Cell_membrane SET=train VALIDATION=False MMKTLSSGNCTLNVPAKNSYRMVVLGASRVGKSSIVSRFLNGRFEDQYTPTIEDFHRKVYNIHGDMYQLDILDTSGNHPFPAMRRLSILTGDVFILVFSLDSRESFDEVKRLQKQILEVKSCLKNKTKEAAELPMVICGNKNDHSELCRQVPAMEAELLVSGDENCAYFEVSAKKNTNVNEMFYVLFSMAKLPHEMSPALHHKISVQYGDAFHPRPFCMRRTKVAGAYGMVSPFARRPSVNSDLKYIKAKVLREGQARERDKCSIQ

Where:

    TARGET = subcellular location class SET = training versus test data VALIDATION = marks validation sequences 

The dataset spans 10 location classes, making it an excellent real-world classification challenge. 

How to use federated learning with BioNeMo protein language models

Running this example is refreshingly simple. With BioNeMo Framework v2.5 in Docker, you can spin up a Jupyter Lab environment directly and run the Federated Protein Property Prediction with BioNeMo tutorial notebook in your browser. 

On top of the BioNeMo framework, NVIDIA FLARE is used to bring in federated training. Instead of pooling datasets from multiple sites, each participant trains locally and contributes only model updates. With FedAvg, those updates are aggregated centrally to form a shared global model—privacy preserved, collaboration enabled.

Training and visualization 

For this demonstration, the team fine-tuned the 650-million-parameter ESM-2nv model, pretrained in BioNeMo. This larger model offers a strong balance between predictive accuracy and computational efficiency, making it well-suited for federated training scenarios. 

Key steps in the workflow include: 

    Data splitting: Heterogeneous sampling is applied to mimic the variability one would expect across real-world institutions. This ensures the federated setup more closely reflects practical deployment conditions. Federated averaging (FedAvg): Local client updates are aggregated into a shared global model, enabling collaboration without exposing raw protein sequence data. Visualization with TensorBoard: Researchers can monitor both local and federated training runs in real time. Continuous server-side metrics provide insight into how the global model evolves with each communication round. 

Results 

The team compared local training at each site against federated training (FedAvg) under heterogeneous data conditions (alpha = 1.0). 

Client # Samples Local accuracy FedAvg accuracy 
Site-1 1,844 78.2 81.8 
Site-2 2,921 78.9 81.3
Site-3 2,151 79.2 82.1
Average — 78.881.7 
Table 1. Federated training consistently outperformed local models across all sites, improving average accuracy from 78.8% to 81.7%

These results highlight how federated learning leverages knowledge across institutions to build a stronger model than any site could achieve alone.

Figure 3. Federated training (FedAvg) yields higher accuracy at all sites compared to local models, demonstrating the benefit of collaborative learning

Benefits of using BioNeMo and FLARE for protein prediction

The benefits of using BioNeMo and FLARE extend beyond predicting where proteins localize in a cell. This approach supports the community to build AI for science together. With BioNeMo plus FLARE: 

    Federated learning strengthens protein property prediction: Pool collective intelligence without sharing raw data. Collaboration benefits everyone: Each site contributes to a stronger model while keeping sensitive data local. BioNeMo Framework accelerates discovery: Access state-of-the-art tools for biological sequence analysis. 

Get started with federated protein prediction 

Federated protein property prediction with NVIDIA BioNeMo and NVIDIA FLARE is part of a powerful new paradigm. Combining the language of life (protein sequences) with federated AI workflows can accelerate discoveries in drug development, healthcare, and biotech—all while respecting data privacy. 

The future of life sciences AI isn’t siloed—it’s collaborative. And with FLARE and BioNeMo, that future is already here. Visit the NVIDIA/NVFlare GitHub repo to get started with Federated Protein Property Prediction with BioNeMo and to see more advanced examples.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

蛋白质定位 联邦学习 AI NVIDIA FLARE NVIDIA BioNeMo 生物信息学 药物发现 Protein Localization Federated Learning AI NVIDIA FLARE NVIDIA BioNeMo Bioinformatics Drug Discovery
相关文章