Nvidia Developer 10月14日 21:24
NVIDIA Parabricks v4.6:加速基因组分析,支持DeepVariant与STAR新功能
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA Parabricks v4.6是专为数据科学家和生物信息学家设计的可扩展基因组学软件套件,专注于二次分析。通过提供GPU加速的开源工具版本,该软件显著提高了速度和准确性,帮助研究人员更快地获得生物学见解。最新版本v4.6在多项功能上进行了改进,特别是增加了对Google DeepVariant和DeepSomatic 1.9的支持,包括一种pangenome-aware模式,可以提高跨遗传变异和不同人群的准确性。此外,还引入了DeepSomatic长读长和全外显子组测序(WES)支持,以及STAR的quantMode GeneCounts功能,进一步提升了RNA测序比对的速度和效率。

🚀 **GPU加速的基因组学分析平台**: NVIDIA Parabricks v4.6通过GPU加速开源工具,为数据科学家和生物信息学家提供了高效的基因组二次分析解决方案,大幅提升了分析速度和准确性,从而加速了生物学洞察的发现。

💡 **DeepVariant与DeepSomatic 1.9支持**: 新版本增强了对Google DeepVariant和DeepSomatic 1.9的支持,特别引入了pangenome-aware模式的DeepVariant,显著提高了在不同遗传变异和多样化人群中的变异检测准确性,同时DeepSomatic也新增了对长读长和全外显子组测序(WES)数据的支持。

✨ **STAR RNA测序效率提升**: Parabricks v4.6改进了STAR工具,使其在两块NVIDIA RTX PRO 6000 GPU上的速度比CPU解决方案快近8倍。新增的quantMode GeneCounts功能,能快速生成基因水平的读数计数,极大地提升了RNA测序比对的效率,适用于基因表达、质量控制和数据整合等多种应用。

📈 **pangenome-aware模式的优势**: 采用pangenome(泛基因组)方法替代传统的线性参考基因组,能够更全面地捕捉人类群体中的遗传变异多样性,减少参考偏差,从而提高跨人群的变异检测准确性。与BWA等工具相比,pangenome-aware DeepVariant在SNP和Indel检测上展现出更高的F1分数,结合Giraffe和Parabricks的GPU加速,实现了超过14倍的速度提升。

🛠️ **易于上手的实现指南**: 文章提供了详细的Docker命令示例,指导用户如何利用Parabricks v4.6集成Giraffe和pangenome-aware DeepVariant进行变异检测工作流程,以及如何使用STAR的quantMode GeneCounts功能,方便用户快速上手并实现加速的基因组分析。

Built for data scientists and bioinformaticians, NVIDIA Parabricks is a scalable genomics software suite for secondary analysis. Providing GPU-accelerated versions of open-source tools for increased speed and accuracy, researchers can uncover biological insights faster.

The latest release, Parabricks v4.6, offers improvements to multiple features, most notably support for Google’s DeepVariant and DeepSomatic 1.9. This includes a pangenome-aware mode for DeepVariant, which improves accuracy across genetic variations and diverse populations. 

New features:

    DeepVariant and DeepSomatic 1.9, including pangenome-aware DeepVariant.DeepSomatic long read and whole exome sequencing (WES) support.STAR quantMode including GeneCounts.

Improved features:

    STAR speedups: Almost 8x faster on two NVIDIA RTX PRO 6000 GPUs compared to CPU-only solutions.Additional arguments for Mutectcaller, including mitochondrial mode.

Improve variant calling with DeepVariant and DeepSomatic 1.9

Variant calling is a critical step in genomic analysis. It identifies differences between the sample genome (i.e., an individual or population) and a reference genome. Understanding these genetic differences gives scientists a better understanding of diseases and potential treatments.

There is a wide variety of tools built to address variant calling, including HaplotypeCaller and Mutect2 in the Genomic Analysis Toolkit (GATK) from the Broad Institute. In addition to the industry standards from GATK, deep-learning-based variant callers have become widely used. 

Developed by Google, DeepVariant and DeepSomatic use deep learning to support variant identification. For germline data, DeepVariant determines inherited variants. On the other hand, DeepSomatic shows how somatic variants affect non-inherited mutations, including those found in tumor cells. 

Enhancing variant calling accuracy is critical, particularly when considering genetic diversity. According to a recent paper, pangenome-aware DeepVariant reduced errors by up to 25.5% across all settings when compared to linear-referenced-based DeepVariant. 

“Taking genetic diversity into account is critical to accurate genome analysis, especially across diverse populations. New pangenome methods allow more comprehensive maps of genetic variation to inform analysis,” says Andrew Carroll, product lead at Google Research. “I’m excited by Parabricks v4.6 support for pangenome-aware DeepVariant v1.9, which combines the incredible speed of Parabricks with the new DeepVariant ability to directly use pangenome information during variant calling.”

Improve accuracy even more with Giraffe and DeepVariant v1.9

Traditional linear references, including the Genome Reference Consortium Human Build 38 (GRCh38), are built from the DNA of only a few individuals, providing a universal coordinate system for genomic research. However, these references don’t capture the full spectrum of genetic variation present across the broader human population. As a result, important subpopulation diversity is often underrepresented. This can introduce bias into subsequent analyses, such as read mapping and variant detection, which may miss or inaccurately interpret important genetic differences tied to ancestry or disease. 

Unlike linear references, pangenomes are built by integrating multiple high-quality genomes from diverse individuals, capturing a much broader range of genetic variation present in human populations. This comprehensive approach reduces reference bias, improves variant detection across populations, and supports more accurate and equitable genomic analyses. Giraffe, a software tool developed by researchers at the University of California, Santa Cruz, enables efficient read alignment to pangenome graphs.

Giraffe maps genomic sequences to a reference pangenome rather than a traditional linear reference, improving variant-calling accuracy across diverse populations. Combining Giraffe with pangenome-aware mode in DeepVariant, which is now available in Parabricks v4.6, improves the accuracy of identified variants and provides the speed of Parabricks GPU acceleration. 

    Accuracy: Open-source pangenome-aware DeepVariant was more accurate than BWA, receiving the following F1 scores according to Pangenome-aware DeepVariant.
      Pangenome-aware DeepVariant: SNP: 0.9981 | Indel 0.9971BWA: SNP: 0.9973 | Indel: 0.9968  
    Speed: Using GPU-acceleration in Parabricks, Giraffe, and DeepVariant runtimes resulted in over a 14x speedup compared to CPU-only Giraffe and DeepVariant with pangenome-aware mode on four NVIDIA RTX PRO 6000 GPUs.
Figure 1. Using four NVIDIA RTX PRO 6000 GPUs, the total runtime for pangenome-aware DeepVariant 1.9 and Giraffe reduced from more than 9 hours on CPU-only solution to under 40 minutes

“Roche’s SBX technology enables sequencing at unparalleled data rates and flexible data processing workflows for different sequencing applications,” says John Mannion, VP Computational Sciences at Roche. “Through our collaboration with NVIDIA, we plan to leverage GPU-accelerated versions of multiple aligners, including Giraffe, to provide users with an integrated solution allowing for faster and more accurate analysis.”

Get started with Giraffe and DeepVariant

Existing users of Parabricks can run DeepVariant after providing:

    the appropriate FASTA reference file from the Giraffe index files, a BAM file and the graph GPZ file output from running Giraffe.

Instructions on obtaining these files are available in the Parabricks Giraffe documentation focused on Using Giraffe in Variant Calling workflows. The following steps also guide you through the process.

Step 1 

Run baseline VG to generate a FASTA file from the graph.

Please note that step 1 with baseline VG is a one-time run. Once you have the FASTA file from the graph, you don’t need to run step 1. Instead, run steps 2 and 3 to handle more FASTQ samples.

# Extract the sequences corrresponding to the list of paths to a FASTA filedocker run --rm --volume $(pwd):/workdir \    --workdir /workdir \    quay.io/vgteam/vg:v1.59.0 \    vg paths -x hprc-v1.1-mc-grch38.gbz -p hprc-v1.1-mc-grch38.paths.sub -F > hprc-v1.1-mc-grch38.fa# Index the fasta filesamtools faidx hprc-v1.1-mc-grch38.fa

Step 2

Next, run Giraffe normally.

# This command assumes all the inputs are in the current working directory and all the outputs go to the same place.docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \    --workdir /workdir \    nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 \    pbrun giraffe --read-group "sample_rg1" \    --sample "sample-name" --read-group-library "library" \    --read-group-platform "platform" --read-group-pu "pu" \    --dist-name /workdir/hprc-v1.1-mc-grch38.dist \    --minimizer-name /workdir/hprc-v1.1-mc-grch38.min \    --gbz-name /workdir/hprc-v1.1-mc-grch38.gbz \    --ref-paths /workdir/hprc-v1.1-mc-grch38.paths.sub \    --in-fq /workdir/${INPUT_FASTQ_1} /workdir/${INPUT_FASTQ_2} \    --out-bam /outputdir/${OUTPUT_BAM}

Step 3 

Finally, these three files can be used as inputs for Deep Variant. Run pangenome_aware_deepvariant with the BAM from step 2, FASTA from step 1, and the graph GBZ file.

# Pangenome_aware_deepvariant# This command assumes all the inputs are in the current working directory and all the outputs go to the same place.docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \    --workdir /workdir \    nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 \    pbrun pangenome_aware_deepvariant \    --ref /workdir/hprc-v1.1-mc-grch38.fa \    --pangenome /workdir/hprc-v1.1-mc-grch38.gbz \    --in-bam /workdir/${INPUT_BAM} \    --out-variants /outputdir/${OUTPUT_VCF}

STAR improvements: including quantMode GeneCounts

In addition to pangenome-aware mode for DeepVariant, the latest release of Parabricks also includes improvements to STAR. STAR is a tool used to accelerate RNA-sequencing alignment. It is particularly useful due to its speed and accuracy for RNA-seq data across sequencing platforms and scalability for large datasets. Already available in Parabricks, STAR is further accelerated thanks to GPU-acceleration–resulting in nearly 8x faster speedups on two NVIDIA RTX PRO 6000 GPUs compared to CPU-only solutions. 

In the latest release of Parabricks, quantMode GeneCounts is a new option available for STAR, which is valuable for a variety of applications relevant to gene expression, QC, normalization, and data integration. During the mapping step of alignment, quantMode GeneCounts enables fast generation of gene-level read counts.

 Get started with STAR

QuantMode GeneCounts can be run as an argument that can be added to STAR. An example command is below. 

docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \    --workdir /workdir \nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 \pbrun rna_fq2bam \--genome-lib-dir ${GENOME_DIR} \--in-fq ${FASTQ1} ${FASTQ2} \--output-dir ${OUT_DIR} \--ref ${GENOME} \--out-bam ${OUT_BAM} \--num-gpus ${GPU_NUM} \--quantMode GeneCounts

Download Parabricks today

Download NVIDIA Parabricks v4.6 to get started with GPU-accelerated genomic analysis and join the conversation on the NVIDIA Parabricks Developer Forum

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NVIDIA Parabricks 基因组学 GPU加速 DeepVariant DeepSomatic STAR RNA测序 pangenome 生物信息学 NVIDIA Genomics GPU Acceleration RNA-Seq Bioinformatics
相关文章