Differential expression anlaysis of the TCGA breast cancer set

12. Differential expression anlaysis of the TCGA breast cancer set#

First we retrieve the breast cancer RNAseq data as well as the clinical classification of the sets from cbioportal.org.

The gene expresion data is stored in the DataFrame brca, and the adherent clinical information of the cancers and their patients is stored in the DataFrame brca_clin. It can be woth exploring these data structures.

import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    ![ ! -f "dsbook/README.md" ] && git clone https://github.com/statisticalbiotechnology/dsbook.git
    my_path = "dsbook/dsbook/common/"
else:
    my_path = "../common/"
sys.path.append(my_path) # Read local modules for tcga access and qvalue calculations
import load_tcga as tcga
import qvalue 

brca = tcga.get_expression_data(my_path + "../data/brca_tcga_pub2015.tar.gz", 'https://cbioportal-datahub.s3.amazonaws.com/brca_tcga_pub2015.tar.gz',"data_mrna_seq_v2_rsem.txt")
brca_clin = tcga.get_clinical_data(my_path + "../data/brca_tcga_pub2015.tar.gz", 'https://cbioportal-datahub.s3.amazonaws.com/brca_tcga_pub2015.tar.gz',"data_clinical_sample.txt")

Before any further analysis we clean our data. This includes removal of genes where no transcripts were found for any of the samples , i.e. their values are either NaN or zero.

The data is also log transformed. It is generally assumed that expression values follow a log-normal distribution, and hence the log transformation implies that the new values follow a nomal distribution.

brca.dropna(axis=0, how='any', inplace=True)
brca = brca.loc[~(brca<=0.0).any(axis=1)]
brca = pd.DataFrame(data=np.log2(brca),index=brca.index,columns=brca.columns)

We can get an overview of the expression data, i.e differnt characterizations of the tumors from other sources (patient file, histological analysis etc) than the expression data:

brca

and the clinical data:

brca_clin

12.1. Differential expression analysis#

The goal of the excercise is to determine which genes that are differentially expressed in so called tripple negative cancers as compared to other cancers. A breast cancer is triple negative when it does not express either Progesterone receptors, Estrogen receptors or Epidermal growth factor receptor 2. Such cancers are known to behave different than other cancers, and are not amendable to regular hormonal theraphies.

We first create a vector of booleans, that track which cancers that are tripple negative. This will be needed as an input for subsequent significance estimation.

brca_clin.index
brca_clin.loc["3N"]= (brca_clin.loc["PR_STATUS_BY_IHC"]=="Negative") & (brca_clin.loc["ER_STATUS_BY_IHC"]=="Negative") & (brca_clin.loc["IHC_HER2"]=="Negative")
tripple_negative_bool = (brca_clin.loc["3N"] == True)

Next, for each transcript that has been measured, we calculate (1) log of the average Fold Change difference between tripple negative and other cancers, and (2) the significance of the difference between tripple negative and other cancers.

An easy way to do so is by defining a separate function, get_significance_two_groups(row), that can do such calculations for any row of the brca DataFrame, and subsequently we use the function apply for the function to execute on each row of the DataFrame. For the significance test we use a \(t\) test, which is provided by the function ttest_ind.

This results in a new table with gene names and their \(p\) values of differential concentration, and their fold changes.

def get_significance_two_groups(row):
    log_fold_change = row[tripple_negative_bool].mean() - row[~tripple_negative_bool].mean() # Calculate the log Fold Change
    p = ttest_ind(row[tripple_negative_bool],row[~tripple_negative_bool],equal_var=False)[1] # Calculate the significance
    return [p,-np.log10(p),log_fold_change]

pvalues = brca.apply(get_significance_two_groups,axis=1,result_type="expand")
pvalues.rename(columns = {list(pvalues)[0]: 'p', list(pvalues)[1]: '-log_p', list(pvalues)[2]: 'log_FC'}, inplace = True)

The resulting list can be further investigated.

pvalues

A common way to illustrate the diffrential expression values are by plotting the negative log of the \(p\) values, as a function of the mean fold change of each transcript. This is known as a Volcano plot.

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("white")
sns.set_context("talk")
ax = sns.relplot(data=pvalues,x="log_FC",y="-log_p",aspect=1.5,height=6)
ax.set(xlabel="$log_2(TN/not TN)$", ylabel="$-log_{10}(p)$");

The regular interpretation of a Volcano plot is that the ges in the top left and the top right corner are the most interesting ones, as the have a large fold change between the conditions as well as being very significant.