DeepVariant is a set of programs used to transform aligned sequencing reads into variant calls. At the highest level, a user needs to provide three inputs:
-
A reference genome in FASTA format and its corresponding .fai index file generated using the
samtools faidx
command. -
An aligned reads file in BAM format and its corresponding index file (.bai). The reads must be aligned to the reference genome described above.
-
A model checkpoint for DeepVariant.
The output of DeepVariant is a list of all variant calls in VCF format.
DeepVariant can be run in a variety of ways. The simplest method is to use
run_deepvariant
, which will configure DeepVariant for a given model type
(--model_type
), and run all subprograms.
However, under the hood, DeepVariant is composed of three programs:
make_examples
, call_variants
, and postprocess_variants
. More details about
each program are described in detail in the
Inputs and outputs section.
- Sharded files are a single logical collection of files with a common naming
convention. For example, we talk about
filename@10
as a single 10-way sharded file namedfilename
. On most filesystems this actually looks like 10 distinct filesfilename-00000-of-00010
, ...,filename-00009-of-00010
. DeepVariant can write sharded files using theirfilename@10
-style name and can read sharded files using both that style as well as the glob form, such asfilename-*
orfilename-*-of-00010
. - Files with the
.gz
suffix are interpreted as being compressed with gzip and are read/written accordingly.
make_examples
consumes reads and the reference genome to create TensorFlow
examples for evaluation with our deep learning models. The tf.Example protos are
written out in TFRecord format. To learn more about tf.Example and TFRecord, see
the
Using TFRecords and tf.Example
Colab.
make_examples
is a single-threaded program. Since the process of generating
examples is embarrassingly parallel across the genome, make_examples
supports
sharding of its input and output via the --task
argument with a sharded output
specification. For example, if the output is specified as --examples examples.tfrecord@10.gz
and --task 0
, the input to the program will be 10% of
the regions and the output will be written to
examples.tfrecord-00000-of-00010.gz
. Memory usage per instance is detailed
below. These values do not consider the small model, or the fast-pipeline modes
of make examples, and should be considered as upper limits of memory usage.
Memory Usage
These values were calculated by looking at memory usage for 1/25th of chr20.
Model Type | version | mean_mem_mb | median_mem_mb | max_mem_mb |
---|---|---|---|---|
PacBio | 1.6.1 | 464.6 | 426.4 | 590.7 |
PacBio | 1.8.0 | 487.0 | 457.1 | 630.3 |
Model Type | version | mean_mem_mb | median_mem_mb | max_mem_mb |
---|---|---|---|---|
Illumina (WGS) | 1.6.1 | 404.7 | 403.9 | 418.3 |
Illumina (WGS) | 1.8.0 | 435.1 | 434.7 | 449.3 |
make_examples
requires its input files to satisfy a few basic requirements to
be processed correctly.
First, the reference genome FASTA, passed in using the --ref
flag, must be
indexed and can either be uncompressed or compressed with bgzip.
Second, the BAM file provided to --reads
should be aligned to a "compatible"
version of the genome reference provided as the --ref
. By compatible here we
mean the BAM and FASTA share at least a reasonable set of common contigs, as
DeepVariant will only process contigs shared by both the BAM and reference. As
an example, suppose you have a BAM file mapped to b37 + decoy FASTA and you
provide just the vanilla b37 fasta to make_examples
. DeepVariant will only
process variants on the shared contigs, effectively excluding the hs37d5 contig
present in the BAM but not in the reference.
The BAM file must be also sorted and indexed. It must exist on disk, so you cannot pipe it into DeepVariant. Duplicate marking may be performed, in our analyses there is almost no difference in accuracy except at lower (<20x) coverages. Finally, we recommend that you do not perform BQSR. Running BQSR results in a small decrease in accuracy. It is not necessary to do any form of indel realignment, though there is not a difference in DeepVariant accuracy either way.
Third, if you are providing --regions
or other similar arguments these should
refer to contigs present in the reference genome. These arguments accept
space-separated lists, so all of the follow examples are valid arguments for
--regions
or similar arguments:
--regions chr20
=> only process all of chromosome 20--regions chr20:10,000,000-11,000,000
=> only process 10-11mb of chr20--regions "chr20 chr21"
=> only process chromosomes 20 and 21
Fourth and finally, if running in training mode the truth_vcf
and
confident_regions
arguments should point to VCF and BED files containing the
true variants and regions where we are confident in our calls (i.e., calls
within these regions and not in the truth_vcf are considered false positives).
These should be bgzipped and tabix indexed and be on a reference consistent with
the one provided with the --ref
argument.
call_variants
consumes TFRecord file(s) of tf.Examples protos created by
make_examples
and a deep learning model checkpoint and evaluates the model on
each example in the input TFRecord. The output here is a TFRecord of
CallVariantsOutput protos. call_variants
doesn't directly support sharding its
outputs, but accepts a glob or shard-pattern for its inputs.
call_variants
uses around 4 GB per process and uses TensorFlow for evaluation.
When evaluating a model in CPU mode, TensorFlow can make use of multiple cores,
but scaling is sub-linear. In other words, call_variants
on a 64 core machine
is less than 8x faster than running on a 8 core machine.
When using a GPU, call_variants
is both faster, more efficient, and needs
fewer CPUs. Based on a small number of experiments, currently the most efficient
configuration for call_variants
on a GPU instance is 4-8 CPUs and 1 GPU.
Compared to our setting in the whole genome case study, we noticed a 2.5x
speedup on the call_variants step using a single P100 GPU and 8 CPUs. Note that
currently call_variants
can only use one GPU at most. So it doesn't improve
the speed if you get a multiple-GPU machine.
postprocess_variants
reads all of the output TFRecord files from
call_variants
, sorts them, combines multi-allelic records, and writes out a
VCF file. When gVCF output is requested, it also
outputs a gVCF file which merges the VCF with the non-variant sites.
Because postprocess_variants
combines and sorts the output of call_variants
,
it needs to see all of the outputs from call_variants
for a single sample to
merge into a final VCF. postprocess_variants
is single-threaded and needs a
non-trivial amount of memory to run (20-30 GB), so it is best run on a
single/dual core machine with sufficient memory.
The DeepVariant team has been hard at work since we first presented the method. Key changes and improvements include:
- Rearchitected with open source release in mind
- Built on TensorFlow
- Increased variant calling accuracy, especially for indels
- Vastly faster with reduced memory usage
We have made a number of improvements to the methodology as well. The biggest change was to move away from RGB-encoded (3-channel) pileup images and instead represent the aligned read data using a multi-channel tensor data layout. Channels represent sequencing features. All of our models currently have a set of 6 'base channels':
- The read base (A, C, G, T)
- The base's quality score
- The read's mapping quality score
- The read's strand (positive or negative)
- Does the read support the allele being evaluated?
- Does the base match the reference genome at this position?
We can add additional channels to this base channel set to tailor the model
input to a particular sequencing platform or technology to maximize accuracy.
For example, for our Illumina models (wgs
, exome
) we add an additional
insert_size
channel. Because long-read data can be phased, we add a
haplotype
channel to our pacbio
and ont
models.
Additional modeling changes were to move to the inception-v3 architecture and to train on many more independent sequencing replicates of the ground truth training samples, including 50% downsampled versions of each of those read sets. In our testing this allowed the model to better generalize to other data types.
In the end these changes reduced our error rate by more than 50% on the held out evaluation sample (NA24385 / HG002) as compared to our results in the PrecisionFDA Truth Challenge:
DeepVariant April 2016 (HG002, GIAB v3.2.2, b37):
Type | # FN | # FP | Recall | Precision | F1_Score |
---|---|---|---|---|---|
INDEL | 4175 | 2839 | 0.987882 | 0.991728 | 0.989802 |
SNP | 1689 | 832 | 0.999447 | 0.999728 | 0.999587 |
DeepVariant December 2017 (HG002, GIAB v3.2.2, b37):
Type | # FN | # FP | Recall | Precision | F1_Score |
---|---|---|---|---|---|
INDEL | 2384 | 1811 | 0.993081 | 0.994954 | 0.994017 |
SNP | 735 | 363 | 0.999759 | 0.999881 | 0.999820 |
See the whole genome case study, which we update with each release of DeepVariant, for the latest results.
You can also see the Colab example to see how you can visualize the pileup images.
For the models we've released over time, you can find more details about the training data in DeepVariant training data.
As of v0.7, DeepVariant accepts CRAM files as input in addition to BAM files.
As of v0.9.0, we changed the default to use the reference file specified by the
--ref
flag, instead of the path to the original reference in the CRAM file
(encoded in the file's "UR" tag).
For more information about CRAM, see the
samtools
documentation in general
but particularly the sections on
Global Options and
reference sequences in CRAM.
htslib
also hosts a nice page
benchmarking CRAM with information
on the effect of different CRAM options on file size and encoding/decoding
performance.
Here are some basic file size and runtime numbers for running a single
make_examples
job on a 30x whole genome sample in BAM and CRAM format.
Filetype | Size (Gb) | Runtime (min) |
---|---|---|
BAM | 66.99 | 79m47.37307s |
CRAM | 37.85 | 96m53.477s |
Ratio | 56.50% | 121.43% |
- BAM file:
gs://deepvariant/performance-testdata/HG002_NIST_150bp_downsampled_30x.bam
- CRAM file:
gs://deepvariant/performance-testdata/HG002_NIST_150bp_downsampled_30x.cram
Runtime was measured on n1-standard-64 machines.
Based on user feedback (GitHub issue #414), we added samtools and bcftools in our Docker image:
docker run google/deepvariant:"${BIN_VERSION}" samtools
and
docker run google/deepvariant:"${BIN_VERSION}" bcftools
You can read more about samtools and bcftools here: http://www.htslib.org/doc/.
We report runtime in our case studies documentation. In order to make sure the results we report are reproducible without too much variation, we provide the commands we used here to show you what kind of machines we ran the case studies on. This is NOT the fastest or cheapest configuration.
We used a 96-core (vCPU) machine with 384GiB of memory and no GPU, on the Google Cloud Platform. Specifying the CPU platform also allows us to report the runtime more consistently.
gcloud compute instances create "${USER}-cpu" \
--scopes "compute-rw,storage-full,cloud-platform" \
--image-family "ubuntu-2204-lts" \
--image-project "ubuntu-os-cloud" \
--machine-type "n2-standard-96" \
--boot-disk-size "300" \
--zone "us-west1-b"
We tested with P100 GPU, which is currently only available on N1 machines:
gcloud compute instances create "${USER}-gpu" \
--scopes "compute-rw,storage-full,cloud-platform" \
--maintenance-policy "TERMINATE" \
--accelerator=type=nvidia-tesla-p100,count=1 \
--image-family "ubuntu-2204-lts" \
--image-project "ubuntu-os-cloud" \
--machine-type "n1-standard-16" \
--boot-disk-size "300" \
--zone "us-west1-b" \
--min-cpu-platform "Intel Skylake"
NOTE: Be sure to manage instances efficiently. Remember to delete the instances you're not using. You can find the instances at: https://console.cloud.google.com/compute/instances?project=YOUR_PROJECT