8 Principal Component Analysis (PCA)

8.1 PCA plot

Principal Component Analysis (PCA) and Multi-Dimensional Scaling (MDS) are dimension-reduction techniques that are used to reduce a large set of possibly correlated variables to a smaller set of uncorrelated variable that still contains most of the information of the original set. These new variables are referred to as "principal components" (PCs). By definition the first PC accounts for the largest amount of the variation in the data, and each succeeding PC accounts for as much of the remaining variability as possible.

In RNA-seq, and other sequencing technologies, PCA is an efficient visualization tool for quickly identifying the effect of variables (treatment, sex, cell type, etc.) on gene expression (Figure 8.1). Each dot on the PCA/MDS plot corresponds to an individual sample. Hovering the mouse pointer over a dot will show the name of a corresponding sample.

rnaseqDRaMA PCA plot, colored by cell type

Figure 8.1: rnaseqDRaMA PCA plot, colored by cell type

PCA is also useful for diagnosing possible technical issues such as poor replicate reproducibility. rnaseqDRaMA performs PCA with individual RNA-seq samples acting as variables and genes as cases. You can select the number of genes to be used for PCA and selection criteria for these genes in the PCA Plot Control box. Within the Controls panel you can also select any pair of principal components to plot. The '% Variance' described by each component is calculated for the number of genes you have chosen and not the whole dataset. '% Variance' values are shown in the axis labels. Under the hood PCA analysis is performed by the R function prcomp() that performs singular value decomposition (SVD) of the scaled log transformed CPMs.

The MDS plot has been added for the purpose of compatibility with EdgeR that produces this plot by default. The code generating MDS plot is considerably slower -- be patient if you selected a large number of genes.

rnaseqDRaMA MDS plot, colored by cell type

Figure 8.2: rnaseqDRaMA MDS plot, colored by cell type

8.2 PCA plot control panel

Analysis Method: Plot type selection -- either PCA or MDS

Gene Selection Method: Gene sorting selection criteria -- by variance (tagwise dispersion) or average log fold change (logFC) across all groups.

Variables to Highlight: Following PCA, samples on the plot can be highlighted based on available covariates/variables (See Summary section).

Number of Genes: Select the number of top genes sorted based on Gene Selection Method to be used in PCA/MDS analysis (default: 250). Selecting a small number of genes should enhance the difference between group, but may not give a clear picture of the whole dataset, while selecting a large number of genes might give a better sense of the whole dataset, but mask how important the top most influential genes are to the difference between groups.

X-axis Component: and Y-axis Component: Select principal components/dimensions to plot in the PCA/MDS plot (default: 1 and 2). These can be selected with the assistance of the Loadings Plot in order to show the importance and contribution of a specific variable.

Select Point Type: changes the default marker appearance (filled circles) to a variety of alternatives such as triangles, squares, diamonds, etc.

8.3 Loadings Plot

PCs are linear combinations of the original variables, multiplied by coefficients, called loadings, that reflect the contribution of each original variable to a PC. Formally, loadings are the correlations between the original variables and the PCs. Since rnaseqDRaMA performs correlation-based PCA the loadings squared is the percentage of contribution of a given PC to the original variable. In practical terms, the PCs which have the largest difference in loadings for the chosen variable in the Variables to Highlight account for the variation separating these variables.

For example, in the loading plot shown on Figure 8.3 the largest loading differences for the variable of interest (M, F, and YAA) are associated with PC4 and PC6. Plotting PCs indeed show the separation of variables, although only a small percentage of variation is described by PC4 and PC6.

Relationship between loadings and original variable

Figure 8.3: Relationship between loadings and original variable