Please Do Not Distribute

Dependencies

The following packages must be downloaded from CRAN or Bioconductor. Search these package names for more detailed installation instructions. When you have a problem with compiling any package, try to get the binary versions. R packages on CRAN can be installed with install.packages(). Bioconductor packages are installed by using BiocManager::install():

# install.packages("package_name")
install.packages(c("package_name1","package_name2"))

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("Biobase","limma","genefilter","edge","qvalue"))
library(devtools)
library(Biobase)
library(limma)
library(edge)
library(genefilter)
library(qvalue)
library(tidyverse)
library(data.table)

Differential Expression Genes

A common goal in DNA microarray experiments is to detect genes that show differential expression across two or more biological conditions. In this scenario, the “features” are the genes, and they are tested against the null hypothesis that there is no differential gene expression. One of the goals of Hedenfalk et al. was to find genes that are differentially expressed between BRCA1- and BRCA2-mutation-positive tumors by obtaining several microarrays from each cell type. In their analysis they computed a modified F statistic and used it to assign a p value to each gene.

# import data
data(hedenfalk)
stat <- hedenfalk$stat
stat0 <- hedenfalk$stat0 #vector from null distribution
pvalues <- empPvals(stat=stat, stat0=stat0)

# calculate q-values and view results
q.object <- qvalue(pvalues)
summary(q.object)
## 
## Call:
## qvalue(p = pvalues)
## 
## pi0: 0.669926    
## 
## Cumulative number of significant calls:
## 
##           <1e-04 <0.001 <0.01 <0.025 <0.05 <0.1   <1
## p-value       15     76   265    424   605  868 3170
## q-value        0      0     1     73   162  319 3170
## local FDR      0      0     3     30    85  167 2241
hist(q.object)

plot(q.object)

Gene expression from RNA-seq data

This week, we use the RNA-seq data from 2 genetic strains of mouse. Originally, this data/experiment was used to assess the (dis)similarities between RNA-seq and microarrays using straightforward study designs (e.g., no disease or environmental factors). See:

Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-Seq and microarrays.

C57BL/6J (B6) and DBA/2J (D2) are two of the most commonly used inbred mouse strains in neuroscience research. However, the only currently available mouse genome is based entirely on the B6 strain sequence. Subsequently, oligonucleotide microarray probes are based solely on this B6 reference sequence, making their application for gene expression profiling comparisons across mouse strains dubious due to their allelic sequence differences, including single nucleotide polymorphisms (SNPs). The emergence of next-generation sequencing (NGS) and the RNA-Seq application provides a clear alternative to oligonucleotide arrays for detecting differential gene expression without the problems inherent to hybridization-based technologies. Using RNA-Seq, an average of 22 million short sequencing reads were generated per sample for 21 samples (10 B6 and 11 D2), and these reads were aligned to the mouse reference genome, allowing 16,183 Ensembl genes to be queried in striatum for both strains. To determine differential expression, ‘digital mRNA counting’ is applied based on reads that map to exons. The current study compares RNA-Seq (Illumina GA IIx) with two microarray platforms (Illumina MouseRef-8 v2.0 and Affymetrix MOE 430 2.0) to detect differential striatal gene expression between the B6 and D2 inbred mouse strains. We show that by using stringent data processing requirements differential expression as determined by RNA-Seq is concordant with both the Affymetrix and Illumina platforms in more instances than it is concordant with only a single platform, and that instances of discordance with respect to direction of fold change were rare. Finally, we show that additional information is gained from RNA-Seq compared to hybridization-based techniques as RNA-Seq detects more genes than either microarray platform. The majority of genes differentially expressed in RNA-Seq were only detected as present in RNA-Seq, which is important for studies with smaller effect sizes where the sensitivity of hybridization-based techniques could bias interpretation.

ReCount

We will get the pre-processed data, thanks to the ReCount project. More datasets and explanations are available from ReCount and ReCount2. They have applied consistent processing steps such as genome mapping, alignment, and summarization. Instead of very large original data, we get the data in ExpressionSet that is specifically designed to work in R.

Once downloaded, look at the data. And save the downloaded ExpressionSet for the future use:

con = url("http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bottomly_eset.RData")
load(file=con)
close(con)
ls()
## [1] "bottomly.eset" "con"           "hedenfalk"     "pvalues"      
## [5] "q.object"      "stat"          "stat0"
bottomly.eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 36536 features, 21 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: SRX033480 SRX033488 ... SRX033494 (21 total)
##   varLabels: sample.id num.tech.reps ... lane.number (5 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: ENSMUSG00000000001 ENSMUSG00000000003 ...
##     ENSMUSG00000090268 (36536 total)
##   fvarLabels: gene
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
class(bottomly.eset)
## [1] "ExpressionSet"
## attr(,"package")
## [1] "Biobase"
save(bottomly.eset, file="bottomly.Rdata")

ExpressionSet is organized by using related matrices contained as follow:

The phenotype data is about samples:

pdata=pData(bottomly.eset)
dim(pdata)
## [1] 21  5
head(pdata)
##           sample.id num.tech.reps   strain experiment.number lane.number
## SRX033480 SRX033480             1 C57BL/6J                 6           1
## SRX033488 SRX033488             1 C57BL/6J                 7           1
## SRX033481 SRX033481             1 C57BL/6J                 6           2
## SRX033489 SRX033489             1 C57BL/6J                 7           2
## SRX033482 SRX033482             1 C57BL/6J                 6           3
## SRX033490 SRX033490             1 C57BL/6J                 7           3

The gene expression data as we know:

edata=as.matrix(exprs(bottomly.eset))
dim(edata)
## [1] 36536    21
edata[1:5,1:5]
##                    SRX033480 SRX033488 SRX033481 SRX033489 SRX033482
## ENSMUSG00000000001       369       744       287       769       348
## ENSMUSG00000000003         0         0         0         0         0
## ENSMUSG00000000028         0         1         0         1         1
## ENSMUSG00000000031         0         0         0         0         0
## ENSMUSG00000000037         0         1         1         5         0

In the feature data, each row is a gene (e.g., feature), that is annotated or explained:

fdata = fData(bottomly.eset)
dim(fdata)
## [1] 36536     1
head(fdata)
##                                  gene
## ENSMUSG00000000001 ENSMUSG00000000001
## ENSMUSG00000000003 ENSMUSG00000000003
## ENSMUSG00000000028 ENSMUSG00000000028
## ENSMUSG00000000031 ENSMUSG00000000031
## ENSMUSG00000000037 ENSMUSG00000000037
## ENSMUSG00000000049 ENSMUSG00000000049

As the feature data may contain limited information about genes, you should search and look at biological processes and molecular pathways using a variety of web servers. Search GeneCard, Genbank, GO term. There are corresponding APIs for R, such that you can pull down annotations for a list of genes programmatically. Use them to population the feature data.

Log2 transformation

Why do we apply log transformation on gene expression data?

edata_log = log2(edata + 1)
edata_log = edata_log[rowMeans(edata_log) > 10, ]

Conduct t-tests

Let’s test if the means of gene expression are different according to two groups (i.e., two strains). There are many different way to compute this in R depending on data structures. The most basic one:

tgene1 <- t.test(edata[1, pdata$strain == "C57BL/6J"],edata[1, pdata$strain == "DBA/2J"],
       var.equal = TRUE)
print(tgene1)
## 
##  Two Sample t-test
## 
## data:  edata[1, pdata$strain == "C57BL/6J"] and edata[1, pdata$strain == "DBA/2J"]
## t = -0.28241, df = 19, p-value = 0.7807
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -220.3735  167.9735
## sample estimates:
## mean of x mean of y 
##     512.8     539.0

You can use this function in a for or while loops, as well as R-specific apply to apply row-by-row.

tout_apply <- apply(edata, 1, function(x) {
        t.test(x[pdata$strain == "C57BL/6J"],x[pdata$strain == "DBA/2J"], var.equal = TRUE)$p.value
      })

There are many useful functions that automate or speed up statistical tests in large data. The package genefilter provides:

tout = rowttests(x = edata, fac = as.factor(pdata$strain))
# are these p-values really identical? 
all(tout$p.value == tout_apply)
## [1] FALSE
# did i mis-calculate? check!
plot(tout$p.value, tout_apply)