NVIDIA BioNeMo Recipes：加速大规模AI模型训练

Training models with billions or trillions of parameters demands advanced parallel computing. Researchers must decide how to combine parallelism strategies, select the most efficient accelerated libraries, and integrate low-precision formats such as FP8 and FP4—all without sacrificing speed or memory.

There are accelerated frameworks that help, but adapting to these specific methodologies can significantly slow R&D, as users typically need to learn an entirely new codebase.

NVIDIA BioNeMo Recipes can simplify and accelerate this process by lowering the barrier to entry for large-scale model training. Using step-by-step guides built on familiar frameworks like PyTorch and Hugging Face (HF), we show how integrating accelerated libraries such as NVIDIA Transformer Engine (TE) unlocks speed and memory efficiency, scaling performance through techniques like Fully Sharded Data Parallel (FSDP) and Context Parallelism.

In this blog post, we demonstrate how to accelerate transformer-style AI models for biology by taking the Hugging Face ESM-2 protein language model with a native PyTorch training loop and:

Accelerating it with TE. Integrating with FSDP2 for auto-parallelism. Showin sequence packing to achieve even greater performance.

All you need to get started is PyTorch, NVIDIA CUDA 12.8, and the following resources:

Integrating Transformer Engine into ESM-2

TE enables significant performance gains by optimizing transformer computations, particularly on NVIDIA GPUs. It can be integrated into existing training pipelines without requiring a complete overhaul of your datasets, data loaders, or trainers. This section shows how to incorporate TE into a model like ESM-2, drawing inspiration from the BioNeMo recipes.

In most use cases, using the ready-made TransformerLayer module from TE is straightforward. This encapsulates all fused TE operations and best practices into a single drop-in module, reducing boilerplate code and setup. The following snippet shows how we integrated TE in ESM-2. The full implementation can be found in the NVEsmEncoder class definition in bionemo-recipes.

import torchimport transformer_engine.pytorch as tefrom transformer_engine.common.recipe import Format, DelayedScalingclass MyEsmEncoder(torch.nn.Module):    def __init__(self, num_layers, hidden_size, ffn_hidden_size, num_heads):        super().__init__()        self.layers = torch.nn.ModuleList([            te.TransformerLayer(                hidden_size=hidden_size,                ffn_hidden_size=ffn_hidden_size,                num_attention_heads=num_heads,                layer_type="encoder",                self_attn_mask_type="padding",                attn_input_format="bshd", # or 'thd', read below.                window_size=(-1, -1), # disable windowed attention            ) for _ in range(num_layers)        ])        # Optionally add embedding, head, etc.    def forward(self, x, attention_mask=None):        for layer in self.layers:            x = layer(x, attention_mask=attention_mask)        return x# Layer configurationlayer_num = 8hidden_size = 4096sequence_length = 2048batch_size = 4ffn_hidden_size = 16384num_attention_heads = 32dtype = torch.bfloat16# Synthetic data (batch, seq, hidden) for bshd formatx = torch.rand(batch_size, sequence_length, hidden_size).cuda().to(dtype=dtype)attention_mask = torch.ones(batch_size, 1, 1, sequence_length, dtype=torch.bool).cuda()myEsm = MyEsmEncoder(layer_num, hidden_size, ffn_hidden_size, num_attention_heads)myEsm.to(dtype=dtype).cuda()fp8_recipe = DelayedScaling(fp8_format=Format.HYBRID)with te.fp8_autocast(enabled=True, fp8_recipe=fp8_recipe):    y = myEsm(x, attention_mask=attention_mask)

If your architecture deviates from a standard Transformer block, TE can still be integrated at the layer level. The core idea is to replace standard PyTorch modules (e.g., nn.Linear, nn.LayerNorm) with their TE counterparts and use FP8 autocasting to achieve maximum performance gains. TE provides several alternative implementations to common layers, such as Linear, fused LayerNormLinear, and attention modules like DotProductAttention and MultiheadAttention. For a complete list of supported modules, check the TE documentation.

Efficient sequence packing

Standard input data formats can be inefficient when samples have varying sequence lengths. For example, ESM-2 pretraining with a context length of 1,024 can consist of around 60% padding tokens, wasting compute on tokens that do not participate in the model’s attention mechanism. Internally, networks typically represent the hidden state of input sequences in a tensor with four dimensions: [batch size (B), max sequence length (S), number of attention heads (H), and head hidden dimension (D)], or BSHD.

As an alternative, modern attention kernels enable users to provide packed inputs without padding tokens, using index vectors to denote the boundaries between input sequences. Here, hidden states are represented by a flattened tensor of size [flattened input tokens (T), number of attention heads (H), head hidden dimension (D)], or THD. Figure 1 shows this format change, which results in less memory usage and faster token throughput by removing padding tokens (grey).

*Figure 1. BSHD vs. THD “sequence‑packed” input: converting padded BSHD tensors to THD using cumulative sequence lengths (cu_seq_lens)*

TE makes this optimization relatively simple by adding an attn_input_format parameter to relevant layers, which then accepts standard flash-attention-style cumulative sequence length keyword arguments (cu_seq_lens_q). These can be generated using THD-aware collators, such as Hugging Face’s DataCollatorWithFlattening, or the masking version implemented in BioNeMo Recipes.

def sequence_pack(input_ids, labels):    # input_ids is a list of sequences: [(S1,), (S2,), ..., (SN,)] of shape (B,S)    # Flatten and track sequence boundaries    # Determine the length of each sequence        sample_lengths = [len(sample) for sample in input_ids]    # Flatten the input_ids and labels    flat_input_ids = [token for sample in input_ids for token in sample]    flat_labels = [label for sample in labels for label in sample]    # Create a list of cumulative sums showing where the sequences start/stop    # Note: for self attention cu_seqlens_q and cu_seqlens_kv will be the same    cu_seqlens = torch.cumsum(torch.tensor([0] + sample_lengths), dim=0, dtype=torch.int32)    max_length = max(sample_lengths)        return {        "input_ids": torch.tensor(flat_input_ids, dtype=torch.int64),        "labels": torch.tensor(flat_labels, dtype=torch.int64),        # These are the same kwargs used by `flash_attn_varlen_func`, etc.        "cu_seqlens_q": cu_seqlens,        "cu_seqlens_kv": cu_seqlens,        "max_length_q": max_length,        "max_length_kv": max_length,    }

TE and sequence packing on/off performance

Figure 2. TE and sequence packing on/off performanceFigure 2 shows the performance comparison, with a significant uplift in token throughput when TE is employed. This demonstrates TE’s ability to maximize the computational efficiency of your NVIDIA GPUs.EvolutionaryScale integrated Transformer Engine across their next-generation models as well:“ESM3 is the largest foundation model trained on biological data. Integrating the NVIDIA Transformer Engine was crucial to training it at this 98B parameter scale with high throughput and GPU utilization,” said Tom Sercu, co-founder and VP of Engineering at EvolutionaryScale. “The precision and speed of FP8 acceleration, combined with optimized kernels for fused layers, allow us to push the boundaries of compute and model scale across NVIDIA GPUs. This leads to emergent understanding of biology in our frontier models for the scientific community.”Hugging Face interoperabilityOne of the key advantages of TE is its interoperability with existing machine learning ecosystems, including popular libraries like Hugging Face. This means you can use TE’s performance benefits even when working with models loaded from the Hugging Face Transformers library.TE layers can be embedded directly inside a Hugging Face Transformers PreTrainedModel, and are fully compatible with AutoModel.from_pretrained. See the NVIDIA BioNeMo Collection on the Hugging Face Hub for pre-optimized models.The process typically involves loading your Hugging Face model, then carefully identifying and replacing its standard PyTorch layers (such as nn.Linear, nn.LayerNorm, and nn.MultiheadAttention) with their TE-optimized counterparts. This often requires renaming some layers or a custom model wrapper to ensure the TE layers are correctly integrated into the model’s forward pass.Get startedOur mission with BioNeMo Recipes is to make acceleration and scaling accessible for all foundation model builders. To help us build a more powerful and practical toolkit, we want to hear from you. We encourage you to try out the recipes and contribute by submitting a pull request or opening an issue on our GitHub.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NVIDIA BioNeMo Recipes AI模型训练并行计算 Transformer Engine 低精度训练序列打包 Hugging Face PyTorch NVIDIA GPU ESM-2 Large-Scale Training Parallel Computing Low-Precision Training Sequence Packing NVIDIA GPUs

Import AI 368: 500% faster local LLMs; 38X more efficient red teaming; AI21’s Frankenmodel

98% GPU Utilization Achieved in 1K GPU-Scale AI Training Using Distributed Cache

COLLAGE: A New Machine Learning Approach to Deal with Floating-Point Errors in Low-Precision to Make LLM Training Accurate and Efficient

Open Source Generative AI at Hugging Face with Jeff Boudier - #624

Multimodal, Multi-Lingual NLP at Hugging Face with John Bohannon and Douwe Kiela - #589

Advancing Hands-On Machine Learning Education with Sebastian Raschka - #565

Big Science and Embodied Learning at Hugging Face ? with Thomas Wolf - #564

AutoML for Natural Language Processing with Abhishek Thakur - #475

3D Deep Learning with PyTorch 3D w/ Georgia Gkioxari - #408

Practical Deep Learning with Rachel Thomas - TWiML Talk #138

.footer { width: 100%; /* 原先页面已经预留了空间 */ /* height: 2.3rem; */ position: relative; } .footer.padding-bottom{ padding-bottom: 1.2rem; } .footer .fixed-footer { position: fixed; bottom: 0; left: 0; width: 100%; height: 2.3rem; background-color: #191919; z-index: 100; } .footer.padding-bottom .fixed-footer{ padding-bottom: 1.2rem; } .footer .fixed-footer .flex-content{ position: absolute; top: 0; left: 0; right: 0; bottom: 0; height: 2.3rem; display: flex; box-sizing: border-box; align-items: center; justify-content: space-between; padding:0 .55rem; } .footer .icon-left, .footer .icon-right{ position: absolute; width: .55rem; height: .55rem; top: -0.54rem; } .footer .icon-left{ left: 0; } .footer .icon-left::after{ position: absolute; width: .55rem; height: .55rem; content: ''; bottom: -0.01rem; left: -0.01rem; background-image: url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAUCAYAAACNiR0NAAABC0lEQVQ4T63TMUrEUBDG8e+L5AYWeQFLC72A17Cw8gYqXkDBTryAoBaWixbWWlor1haW8pKZ4mElFj42I4GNhe6ym82bA/z4w8wQwAqAMRINy7Jcres6JPLAoig2VfU1JbijqnfJQOfcqYgcpwSfRGQrJTiOMa6FEOoUKJ1zBuBIRM5Sgu8isg7geyjaFcLM9lX1IhkIIJDcGHrkv4WTshsR2R1S+RdsrT0RuVwW/QeaWQSwrar3y6DTClvny8zal3zoi84C261HkocictUHnQl2CMnbLMsOvPcfi8BzwQkSzOxEVa/nHf+iYBfnSZ6THFVV5acV9wU7ozGzZ5KPAF6apnnL87z23n/+ADjcghv4tAnCAAAAAElFTkSuQmCC'); background-size: 100% 100%; background-repeat: no-repeat; } .footer .icon-right{ right: 0; } .footer .icon-right::after{ position: absolute; width: .55rem; height: .55rem; content: ''; bottom: -0.01rem; right: -0.01rem; background-image: url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAUCAYAAACNiR0NAAABC0lEQVQ4T63TMUrEUBDG8e+L5AYWeQFLC72A17Cw8gYqXkDBTryAoBaWixbWWlor1haW8pKZ4mElFj42I4GNhe6ym82bA/z4w8wQwAqAMRINy7Jcres6JPLAoig2VfU1JbijqnfJQOfcqYgcpwSfRGQrJTiOMa6FEOoUKJ1zBuBIRM5Sgu8isg7geyjaFcLM9lX1IhkIIJDcGHrkv4WTshsR2R1S+RdsrT0RuVwW/QeaWQSwrar3y6DTClvny8zal3zoi84C261HkocictUHnQl2CMnbLMsOvPcfi8BzwQkSzOxEVa/nHf+iYBfnSZ6THFVV5acV9wU7ozGzZ5KPAF6apnnL87z23n/+ADjcghv4tAnCAAAAAElFTkSuQmCC'); background-size: 100% 100%; background-repeat: no-repeat; transform: rotateY(180deg); } .footer .flex-content .open-weapp { position: absolute; top: 0; left: 0; width: 100%; height: 100%; z-index: 10; opacity: 0; } .footer .flex-content .footer-left, .footer .flex-content .footer-right { position: relative; font-weight: bold; } .footer .flex-content .footer-left{ width: 4.35rem; height: 1.25rem; } .footer .flex-content .footer-left .footer-left-content { position: absolute; top: 0; left: 0; width: 100%; height: 100%; display: flex; align-items: center; color: #D4D4D4; font-size: .65rem; } .footer .flex-content .footer-left .footer-left-content .logo{ width: 1.1rem; height: 1.1rem; background-image: url('http://app.myzaker.com/news/images/logo_icon.png'); background-size: 100% 100%; background-repeat: no-repeat; margin-right: .35rem; border-radius: 50%; } .footer .flex-content .footer-right{ width: 4.38rem; height: 1.25rem; line-height: 1.25rem; display: block; box-sizing: border-box; } .footer .flex-content .footer-right .open-weapp-btn{ position: absolute; top: 0; left: 0; right: 0; bottom: 0; background-color: #2B2B2B; border-radius: .15rem; color: #D4D4D4; font-size: .65rem; text-align: center; display: block; } var browser = { versions: (function () { var u = navigator.userAgent.toLowerCase(), isPad = false,isAndroidPad = false,isIpad = false,isMobile = false,isPc = false; if(u.indexOf('')) if(u.indexOf('android') > -1){ if(u.indexOf('mobile') == -1){ isAndroidPad = true; } } if(u.indexOf('ipad') > -1){ isIpad = true; } if(isAndroidPad||isIpad){ isPad = true; }else if((u.indexOf('mobile') > -1 && !isPad ) || (u.indexOf('android') > -1 && !isAndroidPad) || (u.indexOf('phone') > -1)){ isMobile = true; }else{ isPc = true; } return { android: u.indexOf('android') > -1 || u.indexOf('Linux') > -1, iPhone: u.indexOf('iphone') > -1, isPad: isPad, isMobile:isMobile, isPc:isPc, wx:u.toLowerCase().indexOf('micromessenger') > -1, }; })() } var checkInZaker = function(){ if (navigator.appinfo || navigator.userAgent.match(/zaker/ig)) { return true; } return false; } if( location.href.indexOf('mobile=1')<0 && (browser.versions.isPc || browser.versions.isPad) ){ var style = '<style type="text/css">'; style+= 'html{background-color:#f8f8f8;}'; style+= '#body{width:720px;margin:0 auto;background-color:#fff;border-left:1px solid #e8e8e8;border-right:1px solid #e8e8e8;font-style:normal}'; style+= '#temple_title,#content_text,.icon-font-origin,#top5{padding:0 50px;}'; style+= '#downTips{width:720px;}'; style+= '#qrcode{position:fixed;background-color:#fff;margin:44px 0 0 740px;}'; style+= '#downTips{display:none;}'; style+= '</style>'; document.write(style); } var _$ = function(id){return document.querySelector(id);}, isWap = true; var qrcodeHtml = '' if(location.href.indexOf('mobile=1')<0 && browser.versions.isPc){ qrcodeHtml += '<img id="qrcode" src="/static/image/qrcode_dingyuehao.jpg"/>' } qrcodeHtml += '<div class="zk_top_barwrap"><div class="zk_top_bar"><a href="/" class="zk_top_bar_logo"></a></div></div>' $('#body').prepend(qrcodeHtml); var new_style = ''; var vo = document.createElement("a"); vo.className = 'icon-font-origin-btn'; vo.style.borderBottom = 'none'; vo.style.color = '#00abff'; vo.style.marginLeft = '0px'; if(new_style){ vo.style.cssText="border-bottom-style: none;font-size: 11px;color: #ababab;margin-left: 6px;"; document.getElementById('ID_disclaimer').style.cssText='text-align: left;color:#ababab;font-size: 16px;line-height: 32px;padding:0;padding-top: 4px;'; } vo.href = 'https://developer.nvidia.com/blog/scale-biology-transformer-models-with-pytorch-and-nvidia-bionemo-recipes/'; vo.innerHTML = '查看原文'; var el_disclaimer = _$("#ID_disclaimer"); if(el_disclaimer){ el_disclaimer.appendChild(vo); } //图片初始化 (function () { var imglazy = document.querySelectorAll('.img_box .lazy'); imglazy = Array.prototype.slice.call(imglazy); imglazy.forEach(function(img){ // 获取宽高 var dWidth = img.dataset['width']; var dHeight = img.dataset['height']; // 获取父元素 var parentEle = img; do{ parentEle = parentEle.parentNode; } while(!parentEle.classList.contains('img_box') || parentEle.id == "content") // 获取图片的父容器占宽 var parentWidth = parentEle.offsetWidth; // 1. 图片原宽度大于容器宽度70%，撑到100% // 2. 图片原宽度大于容器宽度40%，小于容器宽度70%，保持图片原尺寸 // 3. 图片原宽度小于容器宽度40%，撑到40% var maxRate = 0.7; var minRate = 0.4; // 计算阀值 var maxWidth = maxRate * parentWidth; var minWidth = minRate * parentWidth; // 最终设定图片的宽高 var imgWidth, imgHeight; if (dWidth) { if (dWidth > maxWidth) { imgWidth = parentWidth; } else if (dWidth > minWidth) { imgWidth = dWidth; img.parentNode.style['display'] = 'inline-block'; // img.parent('.content_img_div').css('display', 'inline-block'); } else { imgWidth = minWidth; img.parentNode.style['display'] = 'inline-block'; // img.parent('.content_img_div').css('display', 'inline-block'); } // 计算高度 imgHeight = dHeight / dWidth * imgWidth; } else { imgWidth = parentWidth; } // 设置图片大小 img.style['width'] = imgWidth + "px"; img.style['height'] = imgHeight + "px"; }); })(); var inzaker = (navigator.userAgent.match(/zaker/ig)) ? true : false; if(!inzaker && !navigator.userAgent.match(/AlipayClient/ig) ){ if(document.querySelector('.ntpl_head')){ (function(){ function getStyle(obj,attr){ if(obj.currentStyle){ return obj.currentStyle[attr]; }else{ return document.defaultView.getComputedStyle(obj,null)[attr]; } } var $ntplHead = document.querySelector('.ntpl_head'), pt = getStyle($ntplHead, 'paddingTop'); $ntplHead.style.paddingTop = (parseInt(pt, 10) - 20)+'px'; })(); } } window.zkgetWebConfig = function(data) { inzaker = true; if(data.appType == 'elderly'){ document.getElementsByTagName('body')[0].className += ' body_elderly'; } }; window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'G-LT4LDFPVLZ');