22. Example of ORA and GSEA#

We first run the same steps as in the previous notebook on multiple testing.

import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    ![ ! -f "dsbook/README.md" ] && git clone https://github.com/statisticalbiotechnology/dsbook.git
    my_path = "dsbook/dsbook/common/"
else:
    my_path = "../common/"
sys.path.append(my_path) # Read local modules for tcga access and qvalue calculations
import load_tcga as tcga
import qvalue 

brca = tcga.get_expression_data(my_path + "../data/brca_tcga_pub2015.tar.gz", 'https://cbioportal-datahub.s3.amazonaws.com/brca_tcga_pub2015.tar.gz',"data_mrna_seq_v2_rsem.txt")
brca_clin = tcga.get_clinical_data(my_path + "../data/brca_tcga_pub2015.tar.gz", 'https://cbioportal-datahub.s3.amazonaws.com/brca_tcga_pub2015.tar.gz',"data_clinical_sample.txt")
brca.dropna(axis=0, how='any', inplace=True)
brca = brca.loc[~(brca<=0.0).any(axis=1)]
brca = pd.DataFrame(data=np.log2(brca),index=brca.index,columns=brca.columns)
brca_clin.loc["3N"]= (brca_clin.loc["PR_STATUS_BY_IHC"]=="Negative") & (brca_clin.loc["ER_STATUS_BY_IHC"]=="Negative") & (brca_clin.loc["IHC_HER2"]=="Negative")
tripple_negative_bool = (brca_clin.loc["3N"] == True)

def get_significance_two_groups(row):
    log_fold_change = row[tripple_negative_bool].mean() - row[~tripple_negative_bool].mean()
    p = ttest_ind(row[tripple_negative_bool],row[~tripple_negative_bool],equal_var=False)[1]
    return [p,-np.log10(p),log_fold_change]

pvalues = brca.apply(get_significance_two_groups,axis=1,result_type="expand")
pvalues.rename(columns = {list(pvalues)[0]: 'p', list(pvalues)[1]: '-log_p', list(pvalues)[2]: 'log_FC'}, inplace = True)
qvalues = qvalue.qvalues(pvalues)
File extracted to ../data/brca_tcga_pub2015
File extracted to ../data/brca_tcga_pub2015

If we investigate a Volcano plot of the tripple negative cancers vs. the other cancers, we see an large number of both up and down regulated genes. We will in this note book ecamine if there are common patterns in the up and down regulation.

import matplotlib.pyplot as plt
import seaborn as sns

sns.relplot(data=qvalues,x="log_FC",y="-log_p")
plt.xlabel("$log_2(FC)$")
plt.ylabel("$-log_{10}(p)$")
plt.show()
../_images/e4ea909c902d2d3a0bdf6c225708f4550eabc220e584a9bcc1d983608cd08698.png

22.1. Over-representation analysis#

We use the gseapy module to run an overrepresentation analysis. The module is unfortunately not implementing pathway analysis itself. It instead call a remote webserverEnrichr.

In the analysis here we use the KEGG database’s definition of metabolomic pathways. This choice can easily be changed to other databases such as GO.

Here we select to use the \(q\) values below \(10^{-12}\) as an input for the analysis. First we select this as our gene_list, and then we calculate the overlap of the gene list to all the pathwauys in KEGG.

import gseapy as gp

pathway_db=['KEGG_2019_Human']
background=set(qvalues.index)
gene_list = list(qvalues.loc[qvalues["q"]<1e-12,"q"].index)

output_enrichr=pd.DataFrame()
enr=gp.enrichr(
                gene_list=gene_list,
                gene_sets=pathway_db,
                background=background,
                outdir = None
            )
len(gene_list)
1403

We clean up the results a bit by only keeping some of the resulting metics. We also multiple hypothesis correct our results, and list the terms with a FDR less than 20%.

kegg_enr = enr.results[["P-value","Term"]].rename(columns={"P-value": "p"})
kegg_enr = qvalue.qvalues(kegg_enr)
kegg_enr.loc[kegg_enr["q"]<0.20]
p Term q
0 0.000002 Cell cycle 0.000701
1 0.000265 Hedgehog signaling pathway 0.037638
2 0.001870 Prostate cancer 0.176826

The analysis seem to find overrepresentation of relatively few pathways, particularly given the significance of the differences between case and controll on transcript level.

22.2. Geneset Enrichment analysis#

Subsequently we us pygsea to perform a geneset enricment analysis (GSEA). This time we compare to pathways in GSEAs oncogenic signature database, MSigDB.

classes = ["TrippleNeg" if tripple_negative_bool[sample_name] else "Respond" for sample_name in brca.columns]
gs = gp.GSEA(data=brca, 
                 gene_sets='MSigDB_Oncogenic_Signatures', 
#                 gene_sets='KEGG_2019_Human', 
                 classes=classes, # cls=class_vector
                 permutation_type='phenotype', # null from permutations of class labels
                 permutation_num=2000, # reduce number to speed up test
                 min_size=15, # minimal size of pathway 
                 outdir=None,  # do not write output to disk
                 no_plot=True, # Skip plotting
                 method='signal_to_noise',
                 threads=4, # Number of allowed parallel processes
                 seed=42,
                 format='png',)
                 # ascending=True)
gs.run()
gs_res = gs.res2d
2024-11-19 16:31:57,187 [WARNING] Found duplicated gene names, values averaged by gene names!

We first list the pathways with a FDR < 0.5.

significant = gs_res[gs_res["FDR q-val"]<0.5]
significant
Name Term ES NES NOM p-val FDR q-val FWER p-val Tag % Gene % Lead_genes
0 gsea RPS14 DN.V1 DN -0.506616 -1.809224 0.018145 0.424336 0.1695 69/152 20.65% PSAT1;CLCN4;SUV39H2;CDC20;MCM5;KIF2C;CDCA3;ASN...
1 gsea CSR LATE UP.V1 UP -0.572861 -1.771695 0.014911 0.308419 0.2205 81/138 21.54% LYAR;FANCA;CDCA8;MRAS;SNRPA1;FOXM1;CHEK2;RPF2;...
3 gsea E2F1 UP.V1 UP -0.448257 -1.711008 0.01608 0.354616 0.318 82/167 30.03% NDUFAF4;E2F3;AURKB;PIM1;KIFC1;CHEK1;UCK2;NASP;...
5 gsea RB P107 DN.V1 UP -0.518007 -1.639151 0.052427 0.47894 0.4595 56/122 21.09% CCNE1;STMN1;FANCA;MCM7;MCM5;ORC6;GTSE1;CDC25B;...

The package we are using for accessing GSEA, gseapy, has some built in plotting routine for illustrating the enrichment for any given pathway.

axs = gs.plot(terms=significant.Term[0])
../_images/9f26019e42639cce77f8264e022483d9807530559fee1350b0cac4887c83080a.png

We can also compare the enrichment between multiple pathways

axs = gs.plot(significant.Term, show_ranking=False, legend_kws={'loc': (1.05, 0)}, )
../_images/f244f965c468636cc8ef3cc9201a626d6adfc554909f66754fd6f60ef898c8bd.png

as well as the heatmap (Normalized gene expression as a function of Gene and patient) for the genes in a given pathway.

from gseapy import heatmap
# plotting heatmap
i = 0
genes = gs.res2d.Lead_genes[i].split(";")
# Make sure that ``ofname`` is not None, if you want to save your figure to disk
ax = heatmap(df = gs.heatmat.loc[genes], z_score=0, title=gs_res.Term[i], figsize=(14,4))
../_images/551751fddee4db0198f293895c09897a0b503f49e6ac1d7b6ef5e0ec06583c12.png