Statistical tests and multiple hypothesis tests

Please Do Not Distribute

Dependencies

The following packages must be downloaded from CRAN or Bioconductor. Search these package names for more detailed installation instructions. When you have a problem with compiling any package, try to get the binary versions. R packages on CRAN can be installed with install.packages(). Bioconductor packages are installed by using BiocManager::install():

# install.packages("package_name")
install.packages(c("package_name1","package_name2"))

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("Biobase","limma","genefilter","edge","qvalue"))

library(devtools)
library(Biobase)
library(limma)
library(edge)
library(genefilter)
library(qvalue)
library(tidyverse)
library(data.table)

Differential Expression Genes

A common goal in DNA microarray experiments is to detect genes that show differential expression across two or more biological conditions. In this scenario, the “features” are the genes, and they are tested against the null hypothesis that there is no differential gene expression. One of the goals of Hedenfalk et al. was to find genes that are differentially expressed between BRCA1- and BRCA2-mutation-positive tumors by obtaining several microarrays from each cell type. In their analysis they computed a modified F statistic and used it to assign a p value to each gene.

# import data
data(hedenfalk)
stat <- hedenfalk$stat
stat0 <- hedenfalk$stat0 #vector from null distribution
pvalues <- empPvals(stat=stat, stat0=stat0)

# calculate q-values and view results
q.object <- qvalue(pvalues)
summary(q.object)

## 
## Call:
## qvalue(p = pvalues)
## 
## pi0: 0.669926    
## 
## Cumulative number of significant calls:
## 
##           <1e-04 <0.001 <0.01 <0.025 <0.05 <0.1   <1
## p-value       15     76   265    424   605  868 3170
## q-value        0      0     1     73   162  319 3170
## local FDR      0      0     3     30    85  167 2241

hist(q.object)

plot(q.object)

Gene expression from RNA-seq data

This week, we use the RNA-seq data from 2 genetic strains of mouse. Originally, this data/experiment was used to assess the (dis)similarities between RNA-seq and microarrays using straightforward study designs (e.g., no disease or environmental factors). See:

Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-Seq and microarrays.

C57BL/6J (B6) and DBA/2J (D2) are two of the most commonly used inbred mouse strains in neuroscience research. However, the only currently available mouse genome is based entirely on the B6 strain sequence. Subsequently, oligonucleotide microarray probes are based solely on this B6 reference sequence, making their application for gene expression profiling comparisons across mouse strains dubious due to their allelic sequence differences, including single nucleotide polymorphisms (SNPs). The emergence of next-generation sequencing (NGS) and the RNA-Seq application provides a clear alternative to oligonucleotide arrays for detecting differential gene expression without the problems inherent to hybridization-based technologies. Using RNA-Seq, an average of 22 million short sequencing reads were generated per sample for 21 samples (10 B6 and 11 D2), and these reads were aligned to the mouse reference genome, allowing 16,183 Ensembl genes to be queried in striatum for both strains. To determine differential expression, ‘digital mRNA counting’ is applied based on reads that map to exons. The current study compares RNA-Seq (Illumina GA IIx) with two microarray platforms (Illumina MouseRef-8 v2.0 and Affymetrix MOE 430 2.0) to detect differential striatal gene expression between the B6 and D2 inbred mouse strains. We show that by using stringent data processing requirements differential expression as determined by RNA-Seq is concordant with both the Affymetrix and Illumina platforms in more instances than it is concordant with only a single platform, and that instances of discordance with respect to direction of fold change were rare. Finally, we show that additional information is gained from RNA-Seq compared to hybridization-based techniques as RNA-Seq detects more genes than either microarray platform. The majority of genes differentially expressed in RNA-Seq were only detected as present in RNA-Seq, which is important for studies with smaller effect sizes where the sensitivity of hybridization-based techniques could bias interpretation.

ReCount

We will get the pre-processed data, thanks to the ReCount project. More datasets and explanations are available from ReCount and ReCount2. They have applied consistent processing steps such as genome mapping, alignment, and summarization. Instead of very large original data, we get the data in ExpressionSet that is specifically designed to work in R.

Once downloaded, look at the data. And save the downloaded ExpressionSet for the future use:

con = url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bottomly_eset.RData")
load(file=con)
close(con)
ls()

## [1] "bottomly.eset" "con"           "hedenfalk"     "pvalues"      
## [5] "q.object"      "stat"          "stat0"

bottomly.eset

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 36536 features, 21 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: SRX033480 SRX033488 ... SRX033494 (21 total)
##   varLabels: sample.id num.tech.reps ... lane.number (5 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: ENSMUSG00000000001 ENSMUSG00000000003 ...
##     ENSMUSG00000090268 (36536 total)
##   fvarLabels: gene
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

class(bottomly.eset)

## [1] "ExpressionSet"
## attr(,"package")
## [1] "Biobase"

save(bottomly.eset, file="bottomly.Rdata")

ExpressionSet is organized by using related matrices contained as follow:

The phenotype data is about samples:

pdata=pData(bottomly.eset)
dim(pdata)

## [1] 21  5

head(pdata)

##           sample.id num.tech.reps   strain experiment.number lane.number
## SRX033480 SRX033480             1 C57BL/6J                 6           1
## SRX033488 SRX033488             1 C57BL/6J                 7           1
## SRX033481 SRX033481             1 C57BL/6J                 6           2
## SRX033489 SRX033489             1 C57BL/6J                 7           2
## SRX033482 SRX033482             1 C57BL/6J                 6           3
## SRX033490 SRX033490             1 C57BL/6J                 7           3

The gene expression data as we know:

edata=as.matrix(exprs(bottomly.eset))
dim(edata)

## [1] 36536    21

edata[1:5,1:5]

##                    SRX033480 SRX033488 SRX033481 SRX033489 SRX033482
## ENSMUSG00000000001       369       744       287       769       348
## ENSMUSG00000000003         0         0         0         0         0
## ENSMUSG00000000028         0         1         0         1         1
## ENSMUSG00000000031         0         0         0         0         0
## ENSMUSG00000000037         0         1         1         5         0

In the feature data, each row is a gene (e.g., feature), that is annotated or explained:

fdata = fData(bottomly.eset)
dim(fdata)

## [1] 36536     1

head(fdata)

##                                  gene
## ENSMUSG00000000001 ENSMUSG00000000001
## ENSMUSG00000000003 ENSMUSG00000000003
## ENSMUSG00000000028 ENSMUSG00000000028
## ENSMUSG00000000031 ENSMUSG00000000031
## ENSMUSG00000000037 ENSMUSG00000000037
## ENSMUSG00000000049 ENSMUSG00000000049

As the feature data may contain limited information about genes, you should search and look at biological processes and molecular pathways using a variety of web servers. Search GeneCard, Genbank, GO term. There are corresponding APIs for R, such that you can pull down annotations for a list of genes programmatically. Use them to population the feature data.

Log2 transformation

Why do we apply log transformation on gene expression data?

edata_log = log2(edata + 1)
edata_log = edata_log[rowMeans(edata_log) > 10, ]

Conduct t-tests

Let’s test if the means of gene expression are different according to two groups (i.e., two strains). There are many different way to compute this in R depending on data structures. The most basic one:

tgene1 <- t.test(edata[1, pdata$strain == "C57BL/6J"],edata[1, pdata$strain == "DBA/2J"],
       var.equal = TRUE)
print(tgene1)

## 
##  Two Sample t-test
## 
## data:  edata[1, pdata$strain == "C57BL/6J"] and edata[1, pdata$strain == "DBA/2J"]
## t = -0.28241, df = 19, p-value = 0.7807
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -220.3735  167.9735
## sample estimates:
## mean of x mean of y 
##     512.8     539.0

You can use this function in a for or while loops, as well as R-specific apply to apply row-by-row.

tout_apply <- apply(edata, 1, function(x) {
        t.test(x[pdata$strain == "C57BL/6J"],x[pdata$strain == "DBA/2J"], var.equal = TRUE)$p.value
      })

There are many useful functions that automate or speed up statistical tests in large data. The package genefilter provides:

tout = rowttests(x = edata, fac = as.factor(pdata$strain))
# are these p-values really identical? 
all(tout$p.value == tout_apply)

## [1] FALSE

# did i mis-calculate? check!
plot(tout$p.value, tout_apply)

sum(tout$p.value - tout_apply)

## [1] NaN

Visualize statistics and p-values

We transform what we have so far, tout, into a tidy format. This allow us to quickly visualize (e.g.,) as histograms:

ttidy <- gather(tout)
ggplot(ttidy) + geom_histogram(bins = 30,aes(x=value)) + facet_wrap(~ key, scales="free")

## Warning: Removed 45208 rows containing non-finite values (stat_bin).

Please use these as a starting point to learn more about ggplot2.

Adjusting p-values

There are several methods that output adjusted p-values according to multiple comparisons. Look at ? p.adjust. One of the most popular and also the most conservative approaches is a Bonferroni correction:

tout$p.adjust <- p.adjust(tout$p.value, method="bonferroni")
ttidy <- gather(tout[,c("p.value","p.adjust")])
ggplot(ttidy) + geom_histogram(bins = 30,aes(x=value)) + facet_wrap(~ key, scales="free")

## Warning: Removed 45208 rows containing non-finite values (stat_bin).

The Bonferroni correction attempts to control the FWER which is very strigent. We may want to relax this criteria.

Calculating q-value to control false discovery rates

Download, install, an use the qvalue Bioconductor library. Also the original paper and the library vignette are helpful:

Statistical significance for genome-wide studies. https://www.pnas.org/content/100/16/9440.full

With the increase in genomewide experiments and the sequencing of multiple genomes, the analysis of large data sets has become com- monplace in biology. It is often the case that thousands of features in a genomewide data set are tested against some null hypothesis, where a number of features are expected to be significant. Here we propose an approach to measuring statistical significance in these genomewide studies based on the concept of the false discovery rate. This approach offers a sensible balance between the number of true and false positives that is automatically calibrated and easily inter- preted. In doing so, a measure of statistical significance called the q value is associated with each tested feature. The q value is similar to the well known p value, except it is a measure of significance in terms of the false discovery rate rather than the false positive rate. Our approach avoids a flood of false positive results, while offering a more liberal criterion than what has been used in genome scans for linkage

Using qvalue, you can automatically estimate pi0 and q-values for each gene:

q.obj <- qvalue(tout$p.value)
plot(q.obj)

# look at the estimate of m0 (proportion of null variables)
q.obj$pi0

## [1] 0.7387381

We organize the qvalues into our previous data frame. Then transform into a tidy format and visualize using ggplot2:

tout$q.value <- q.obj$qvalues
ttidy <- gather(tout[,c("p.value","q.value")])
ggplot(ttidy) + geom_histogram(bins = 30,aes(x=value)) + facet_wrap(~ key, scales="free")

## Warning: Removed 45208 rows containing non-finite values (stat_bin).

Let’s order this data frame by p-values so we see the top 10 genes. Note their p-values and q-values:

tout.order <- tout[order(tout$p.value),]
tout.order[1:10,]

##                     statistic         dm      p.value     p.adjust      q.value
## ENSMUSG00000041748 -11.772174 -23.090909 3.578885e-10 4.986102e-06 3.532856e-06
## ENSMUSG00000043719 -11.322898 -10.918182 6.865182e-10 9.564571e-06 3.532856e-06
## ENSMUSG00000050141  10.806512  42.809091 1.487561e-09 2.072470e-05 5.103376e-06
## ENSMUSG00000035775   9.912957  53.254545 6.057808e-09 8.439739e-05 1.558689e-05
## ENSMUSG00000029298   9.659477   7.663636 9.169062e-09 1.277434e-04 1.653185e-05
## ENSMUSG00000020912   9.629301  78.336364 9.637597e-09 1.342710e-04 1.653185e-05
## ENSMUSG00000028009  -9.129021  -6.181818 2.236712e-08 3.116187e-04 3.288637e-05
## ENSMUSG00000044678  -8.997073  -2.636364 2.806929e-08 3.910613e-04 3.611149e-05
## ENSMUSG00000027855  -8.860112 -24.363636 3.561099e-08 4.961324e-04 4.072354e-05
## ENSMUSG00000078633   8.735904   9.954545 4.427838e-08 6.168864e-04 4.557175e-05

Modeling the lanes and experiments using limma

We would like to use the meta information (in pdata) to adjust for technical variations that are likely contaminating the expression data. Then, the R package limma makes the computation of P-values and statistics easy for us. Particularly, we get to acccess its moderated t-statistics and moderated F-statistics, in which residual variances are shrunken towards the mean. This empirical Bayes method is more powerful than univariate tests because it borrows information across the whole dataset.

mod = model.matrix(~ pdata$strain + pdata$lane.number + pdata$experiment.number)
fit_limma = lmFit(edata,mod)
ebayes_limma = eBayes(fit_limma)

## Warning in fitFDist(var, df1 = df, covariate = covariate): More than half of
## residual variances are exactly zero: eBayes unreliable

limma_pvals = topTable(ebayes_limma,number=dim(edata)[1])$P.Value

## Removing intercept from test coefficients

hist(limma_pvals,col=4)

Why shrinkage?

They use an empirical Bayes method to squeeze the genewise-wise residual variances towards a common value (or towards a global trend) (Smyth, 2004; Phipson et al, 2016). The degrees of freedom for the individual variances are increased to reflect the extra information gained from the empirical Bayes moderation, resulting in increased statistical power to detect differential expression. The empirical Bayes moderated t-statistics test each individual contrast equal to zero. For each gene (row), the moderated F-statistic tests whether all the contrasts are zero. The F-statistic is an overall test computed from the set of t-statistics for that probe. This is exactly analogous the relationship between t-tests and F-statistics in conventional anova, except that the residual mean squares have been moderated between genes.

#  Ordinary t-statistic
ordinary.t <- ebayes_limma$coef[,2] / ebayes_limma$stdev.unscaled[,2] / ebayes_limma$sigma
plot(ordinary.t, ebayes_limma$coefficients[,2],pch=20); abline(0,1,col="red")

We computed p-values, adjusted p-values (by the Bonferroni correction), and q-values. Put them altogether and visualize them at once.

limma_out <- data.table(pvalue = ebayes_limma$p.value[,2], coeff = ebayes_limma$coefficients[,2],
                        p.bonferroni = p.adjust(ebayes_limma$p.value[,2], method="bonferroni"),
                        qvalue = qvalue(ebayes_limma$p.value[,2])$qvalues)
limma_tidy <- gather(limma_out)
ggplot(limma_tidy) + geom_histogram(bins = 30,aes(x=value)) + facet_wrap(~ key, scales="free")