Differential expression anlaysis of the TCGA breast cancer set

12. Differential expression anlaysis of the TCGA breast cancer set#

First we retrieve the breast cancer RNAseq data as well as the clinical classification of the sets from cbioportal.org.

The gene expresion data is stored in the DataFrame brca, and the adherent clinical information of the cancers and their patients is stored in the DataFrame brca_clin. It can be woth exploring these data structures.

import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    ![ ! -f "dsbook/README.md" ] && git clone https://github.com/statisticalbiotechnology/dsbook.git
    my_path = "dsbook/dsbook/common/"
else:
    my_path = "../common/"
sys.path.append(my_path) # Read local modules for tcga access and qvalue calculations
import load_tcga as tcga
import qvalue 

brca = tcga.get_expression_data(my_path + "../data/brca_tcga_pub2015.tar.gz", 'https://cbioportal-datahub.s3.amazonaws.com/brca_tcga_pub2015.tar.gz',"data_mrna_seq_v2_rsem.txt")
brca_clin = tcga.get_clinical_data(my_path + "../data/brca_tcga_pub2015.tar.gz", 'https://cbioportal-datahub.s3.amazonaws.com/brca_tcga_pub2015.tar.gz',"data_clinical_sample.txt")
File extracted to ../data/brca_tcga_pub2015
File extracted to ../data/brca_tcga_pub2015

Before any further analysis we clean our data. This includes removal of genes where no transcripts were found for any of the samples , i.e. their values are either NaN or zero.

The data is also log transformed. It is generally assumed that expression values follow a log-normal distribution, and hence the log transformation implies that the new values follow a nomal distribution.

brca.dropna(axis=0, how='any', inplace=True)
brca = brca.loc[~(brca<=0.0).any(axis=1)]
brca = pd.DataFrame(data=np.log2(brca),index=brca.index,columns=brca.columns)

We can get an overview of the expression data, i.e differnt characterizations of the tumors from other sources (patient file, histological analysis etc) than the expression data:

brca
TCGA-A1-A0SB-01 TCGA-A1-A0SD-01 TCGA-A1-A0SE-01 TCGA-A1-A0SF-01 TCGA-A1-A0SH-01 TCGA-A1-A0SI-01 TCGA-A1-A0SJ-01 TCGA-A1-A0SK-01 TCGA-A1-A0SM-01 TCGA-A1-A0SN-01 ... TCGA-LL-A5YM-01 TCGA-LL-A5YN-01 TCGA-LL-A5YO-01 TCGA-LL-A5YP-01 TCGA-LQ-A4E4-01 TCGA-MS-A51U-01 TCGA-OL-A66H-01 TCGA-OL-A66I-01 TCGA-OL-A66J-01 TCGA-OL-A66K-01
Hugo_Symbol
HMGB1P1 6.862786 5.913201 7.258756 7.141528 6.318322 5.989800 7.075461 10.128243 6.900577 6.571219 ... 6.662298 6.648297 8.121858 7.689562 6.870735 7.235377 6.849722 7.307063 6.961539 6.732642
LOC155060 8.128052 6.387132 6.223071 8.296679 6.580528 7.226836 5.992061 7.381276 6.336428 6.760258 ... 7.032748 8.520790 7.373054 6.700139 8.263093 9.077958 8.401726 7.719361 8.335920 9.115184
HSPB1P1 5.449515 7.826203 8.227829 8.110198 6.986008 7.996865 9.313161 6.095703 7.521303 8.912048 ... 11.926292 9.389839 8.576146 8.456419 7.290927 7.765053 8.651058 7.933979 10.153871 10.070840
GTPBP6 8.616132 8.083619 8.895371 9.538594 8.465846 8.730170 8.611165 8.370326 8.954813 9.254954 ... 10.601717 10.453578 10.685847 10.133098 8.878949 9.282737 9.881552 8.969242 9.087300 9.781890
A1BG 5.620563 7.152768 7.591106 8.348814 7.494444 7.005119 7.825150 6.437021 8.264413 6.452506 ... 10.557569 8.821466 8.207272 7.087837 8.962278 8.502185 5.679393 7.695365 8.060956 8.721849
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ZYG11B 10.414229 9.761690 9.907780 9.783369 10.402201 9.932282 10.093600 10.431197 10.461518 9.406503 ... 8.029939 8.735835 9.524218 9.022067 9.711410 9.745021 9.308973 8.904128 9.416144 9.477514
ZYX 12.594962 11.797529 11.554500 12.383563 12.415996 12.140607 11.566101 9.207597 12.158002 11.505789 ... 12.404470 12.495560 12.714737 13.491160 11.410198 12.267554 12.023571 12.692872 12.348940 12.276578
ZZEF1 10.915356 10.320764 9.855441 9.867816 10.358420 9.980989 9.904470 10.812922 10.207563 10.812135 ... 8.524975 9.913596 10.200806 10.172626 10.285172 11.111679 10.289877 9.342848 10.460192 10.005446
ZZZ3 10.488039 10.223519 10.070714 8.988931 9.967392 9.905970 9.489432 10.383320 10.506666 9.533681 ... 7.236577 8.989705 9.244033 8.948726 10.094338 9.378562 9.594594 9.608044 9.478536 9.741248
TPTEP1 9.109346 6.426506 9.758466 5.707779 8.832177 4.935752 5.443935 5.628774 10.191686 4.734417 ... 4.934016 4.107805 7.694024 6.203377 4.696745 7.965032 2.545524 5.559400 9.148153 5.221173

12928 rows × 817 columns

and the clinical data:

brca_clin
SAMPLE_ID TCGA-A1-A0SB-01 TCGA-A1-A0SD-01 TCGA-A1-A0SE-01 TCGA-A1-A0SF-01 TCGA-A1-A0SH-01 TCGA-A1-A0SI-01 TCGA-A1-A0SJ-01 TCGA-A1-A0SK-01 TCGA-A1-A0SM-01 TCGA-A1-A0SN-01 ... TCGA-LL-A5YM-01 TCGA-LL-A5YN-01 TCGA-LL-A5YO-01 TCGA-LL-A5YP-01 TCGA-LQ-A4E4-01 TCGA-MS-A51U-01 TCGA-OL-A66H-01 TCGA-OL-A66I-01 TCGA-OL-A66J-01 TCGA-OL-A66K-01
PATIENT_ID TCGA-A1-A0SB TCGA-A1-A0SD TCGA-A1-A0SE TCGA-A1-A0SF TCGA-A1-A0SH TCGA-A1-A0SI TCGA-A1-A0SJ TCGA-A1-A0SK TCGA-A1-A0SM TCGA-A1-A0SN ... TCGA-LL-A5YM TCGA-LL-A5YN TCGA-LL-A5YO TCGA-LL-A5YP TCGA-LQ-A4E4 TCGA-MS-A51U TCGA-OL-A66H TCGA-OL-A66I TCGA-OL-A66J TCGA-OL-A66K
OTHER_SAMPLE_ID f66358d4-9319-4c23-983d-ab5056daacf1 81f193d9-ad19-43c2-b89f-54003a8d133e 2fd1998c-1ea4-42eb-84b9-db352da9cf25 f4d51ca6-89a6-4571-ae8d-0fad7dde4452 08044da0-8cb7-4bf7-bbb6-b80be4fddd8c c5181ae5-bd05-459f-920d-26bd31bd9088 9ec60572-39e4-4a0d-b157-cf0ec726fb1d dbd8fda0-bdf2-40f0-80b6-daef5ebf9c97 289f5433-4834-401f-9361-efc8e883e630 62d2a36e-feb1-4084-8111-7ea38a4e3049 ... CF5FDA71-8FBA-43FC-91F3-E4BAA36571DD 97AD1ECF-7643-461F-B9A2-1AC953825895 2B7230C7-0182-4417-9861-03672629D98D 3DC0BC4F-6863-4297-8915-F29E1AA25D26 8C8D4BD4-3AA9-4AD1-9715-2376EE540A0C 764A5732-0B44-4CBE-93F2-7B84FD7D81A1 31E39E5F-AF9C-4F34-B065-8363CF9DF39C 2B8C960A-3E9A-49BF-8329-7D322DA90326 1889A2AF-AE25-41EC-8A6D-0EC5C5B852CB 796B90A8-99D6-4102-81A0-96C6954E9F26
DAYS_TO_COLLECTION 764 1697 1672 1648 1305 1267 1399 1164 1115 1091 ... 131 41 47 29 414 129 441 1715 1645 981
IS_FFPE NO NO NO NO NO NO NO NO NO NO ... NO NO NO NO NO NO NO NO NO NO
OCT_EMBEDDED TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ... FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
CANCER_TYPE_DETAILED Invasive Breast Carcinoma Breast Invasive Ductal Carcinoma Breast Invasive Lobular Carcinoma Breast Invasive Ductal Carcinoma Breast Mixed Ductal and Lobular Carcinoma Breast Invasive Ductal Carcinoma Breast Invasive Ductal Carcinoma Invasive Breast Carcinoma Breast Invasive Ductal Carcinoma Breast Invasive Ductal Carcinoma ... Breast Invasive Ductal Carcinoma Breast Invasive Ductal Carcinoma Breast Invasive Ductal Carcinoma Breast Invasive Ductal Carcinoma Breast Invasive Lobular Carcinoma Breast Invasive Lobular Carcinoma Invasive Breast Carcinoma Breast Invasive Ductal Carcinoma Breast Mixed Ductal and Lobular Carcinoma Breast Invasive Lobular Carcinoma
ONCOTREE_CODE BRCA IDC ILC IDC MDLC IDC IDC BRCA IDC IDC ... IDC IDC IDC IDC ILC ILC BRCA IDC MDLC ILC
SAMPLE_TYPE Primary Primary Primary Primary Primary Primary Primary Primary Primary Primary ... Primary Primary Primary Primary Primary Primary Primary Primary Primary Primary
SOMATIC_STATUS Matched Matched Matched Matched Matched Matched Matched Matched Matched Matched ... Matched Matched Matched Matched Matched Matched Matched Matched Matched Matched
TMB_NONSYNONYMOUS 0.6 1.0 0.7 1.266667 2.6 5.566667 1.0 3.366667 0.733333 1.533333 ... 0.766667 0.866667 1.5 2.133333 2.3 0.433333 0.7 0.966667 1.733333 0.833333

62 rows × 817 columns

12.1. Differential expression analysis#

The goal of the excercise is to determine which genes that are differentially expressed in so called tripple negative cancers as compared to other cancers. A breast cancer is triple negative when it does not express either Progesterone receptors, Estrogen receptors or Epidermal growth factor receptor 2. Such cancers are known to behave different than other cancers, and are not amendable to regular hormonal theraphies.

We first create a vector of booleans, that track which cancers that are tripple negative. This will be needed as an input for subsequent significance estimation.

brca_clin.index
Index(['PATIENT_ID', 'OTHER_SAMPLE_ID', 'DAYS_TO_COLLECTION', 'IS_FFPE',
       'OCT_EMBEDDED', 'PATHOLOGY_REPORT_FILE_NAME',
       'SURGICAL_PROCEDURE_FIRST', 'FIRST_SURGICAL_PROCEDURE_OTHER',
       'SURGERY_FOR_POSITIVE_MARGINS', 'SURGERY_FOR_POSITIVE_MARGINS_OTHER',
       'MARGIN_STATUS_REEXCISION', 'STAGING_SYSTEM_OTHER',
       'MICROMET_DETECTION_BY_IHC', 'METASTATIC_SITE', 'METASTATIC_SITE_OTHER',
       'ER_STATUS_BY_IHC', 'ER_STATUS_IHC_PERCENT_POSITIVE',
       'ER_POSITIVITY_SCALE_USED', 'IHC_SCORE', 'ER_POSITIVITY_SCALE_OTHER',
       'PR_STATUS_BY_IHC', 'PR_STATUS_IHC_PERCENT_POSITIVE',
       'PR_POSITIVITY_SCALE_USED', 'PR_POSITIVITY_IHC_INTENSITY_SCORE',
       'PR_POSITIVITY_SCALE_OTHER', 'PR_POSITIVITY_DEFINE_METHOD', 'IHC_HER2',
       'HER2_IHC_PERCENT_POSITIVE', 'HER2_IHC_SCORE',
       'HER2_POSITIVITY_SCALE_OTHER', 'HER2_POSITIVITY_METHOD_TEXT',
       'HER2_FISH_STATUS', 'HER2_COPY_NUMBER', 'CENT17_COPY_NUMBER',
       'HER2_AND_CENT17_CELLS_COUNT', 'HER2_CENT17_RATIO',
       'HER2_AND_CENT17_SCALE_OTHER', 'HER2_FISH_METHOD',
       'NEW_TUMOR_EVENT_AFTER_INITIAL_TREATMENT', 'NTE_ER_STATUS',
       'NTE_ER_STATUS_IHC__POSITIVE', 'NTE_ER_IHC_INTENSITY_SCORE',
       'NTE_PR_STATUS_BY_IHC', 'NTE_PR_STATUS_IHC__POSITIVE',
       'NTE_PR_IHC_INTENSITY_SCORE', 'NTE_HER2_STATUS',
       'NTE_HER2_STATUS_IHC__POSITIVE', 'NTE_HER2_POSITIVITY_IHC_SCORE',
       'NTE_HER2_FISH_STATUS', 'NTE_CENT_17_HER2_RATIO', 'PRIMARY_SITE',
       'DISEASE_CODE', 'METASTATIC_TUMOR_INDICATOR', 'PROJECT_CODE',
       'TISSUE_SOURCE_SITE', 'TUMOR_TISSUE_SITE', 'CANCER_TYPE',
       'CANCER_TYPE_DETAILED', 'ONCOTREE_CODE', 'SAMPLE_TYPE',
       'SOMATIC_STATUS', 'TMB_NONSYNONYMOUS'],
      dtype='object')
brca_clin.loc["3N"]= (brca_clin.loc["PR_STATUS_BY_IHC"]=="Negative") & (brca_clin.loc["ER_STATUS_BY_IHC"]=="Negative") & (brca_clin.loc["IHC_HER2"]=="Negative")
tripple_negative_bool = (brca_clin.loc["3N"] == True)

Next, for each transcript that has been measured, we calculate (1) log of the average Fold Change difference between tripple negative and other cancers, and (2) the significance of the difference between tripple negative and other cancers.

An easy way to do so is by defining a separate function, get_significance_two_groups(row), that can do such calculations for any row of the brca DataFrame, and subsequently we use the function apply for the function to execute on each row of the DataFrame. For the significance test we use a \(t\) test, which is provided by the function ttest_ind.

This results in a new table with gene names and their \(p\) values of differential concentration, and their fold changes.

def get_significance_two_groups(row):
    log_fold_change = row[tripple_negative_bool].mean() - row[~tripple_negative_bool].mean() # Calculate the log Fold Change
    p = ttest_ind(row[tripple_negative_bool],row[~tripple_negative_bool],equal_var=False)[1] # Calculate the significance
    return [p,-np.log10(p),log_fold_change]

pvalues = brca.apply(get_significance_two_groups,axis=1,result_type="expand")
pvalues.rename(columns = {list(pvalues)[0]: 'p', list(pvalues)[1]: '-log_p', list(pvalues)[2]: 'log_FC'}, inplace = True)

The resulting list can be further investigated.

pvalues
p -log_p log_FC
Hugo_Symbol
HMGB1P1 4.916745e-08 7.308322 0.475075
LOC155060 6.431571e-02 1.191683 -0.193073
HSPB1P1 7.853037e-08 7.104962 -0.867070
GTPBP6 5.730296e-01 0.241823 0.059863
A1BG 2.801925e-05 4.552543 -0.628046
... ... ... ...
ZYG11B 1.125863e-01 0.948514 0.087070
ZYX 2.382079e-05 4.623044 0.383241
ZZEF1 7.551192e-03 2.121985 -0.178845
ZZZ3 3.317690e-02 1.479164 0.152268
TPTEP1 5.024608e-01 0.298898 0.175348

12928 rows × 3 columns

A common way to illustrate the diffrential expression values are by plotting the negative log of the \(p\) values, as a function of the mean fold change of each transcript. This is known as a Volcano plot.

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("white")
sns.set_context("talk")
ax = sns.relplot(data=pvalues,x="log_FC",y="-log_p",aspect=1.5,height=6)
ax.set(xlabel="$log_2(TN/not TN)$", ylabel="$-log_{10}(p)$");
../_images/a88463fae9b8075ad8980a45fe61767c181339bfbd63b0eedda0aa9836714702.png

The regular interpretation of a Volcano plot is that the ges in the top left and the top right corner are the most interesting ones, as the have a large fold change between the conditions as well as being very significant.