14. \(q\) Value calculations in breast cancer set#

14.1. Differential expression analysis with multiple testing#

This notebook continues from where the previous notebook on hypothesis testing ended.

We would like to compare so called tripple negative cancers with other cancers. A breast cancer is triple negative when it does not express either Progesterone receptors, Estrogen receptors or Epidermal growth factor receptor 2. Such cancers are known to behave different than other cancers, and are not amendable to regular hormonal theraphies in the TCGA breast cancer data set.

We first recreate the steps of the previous notebook.

import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
import sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
    ![ ! -f "dsbook/README.md" ] && git clone https://github.com/statisticalbiotechnology/dsbook.git
    my_path = "dsbook/dsbook/common/"
else:
    my_path = "../common/"
sys.path.append(my_path) # Read local modules for tcga access and qvalue calculations
import load_tcga as tcga
import qvalue 

brca = tcga.get_expression_data(my_path + "../data/brca_tcga_pub2015.tar.gz", 'https://cbioportal-datahub.s3.amazonaws.com/brca_tcga_pub2015.tar.gz',"data_mrna_seq_v2_rsem.txt")
brca_clin = tcga.get_clinical_data(my_path + "../data/brca_tcga_pub2015.tar.gz", 'https://cbioportal-datahub.s3.amazonaws.com/brca_tcga_pub2015.tar.gz',"data_clinical_sample.txt")

brca.dropna(axis=0, how='any', inplace=True)
brca = brca.loc[~(brca<=0.0).any(axis=1)]
brca = pd.DataFrame(data=np.log2(brca),index=brca.index,columns=brca.columns)
brca_clin.loc["3N"]= (brca_clin.loc["PR_STATUS_BY_IHC"]=="Negative") & (brca_clin.loc["ER_STATUS_BY_IHC"]=="Negative") & (brca_clin.loc["IHC_HER2"]=="Negative")
tripple_negative_bool = (brca_clin.loc["3N"] == True)
def get_significance_two_groups(row):
    log_fold_change = row[tripple_negative_bool].mean() - row[~tripple_negative_bool].mean() # Calculate the log Fold Change
    p = ttest_ind(row[tripple_negative_bool],row[~tripple_negative_bool],equal_var=False)[1] # Calculate the significance
    return [p,-np.log10(p),log_fold_change]

pvalues = brca.apply(get_significance_two_groups,axis=1,result_type="expand")
pvalues.rename(columns = {list(pvalues)[0]: 'p', list(pvalues)[1]: '-log_p', list(pvalues)[2]: 'log_FC'}, inplace = True)
pvalues = pvalues.loc[~pvalues.index.duplicated(keep='first')]
File extracted to ../data/brca_tcga_pub2015
File extracted to ../data/brca_tcga_pub2015

When plotting the \(p\) value distribution below, we see an enrichment of low p values. These are the tests of the genes that adhere to the alternative hypothesis. We also see a uniform distribution of the \(p\) values in the higher end of the distribution i.e. \(p\) values of 0.3-1.0. These are likely steming from genes adhering to \(H_0\)

import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 8))
sns.histplot(x=pvalues["p"],kde=False)
plt.xlim(0,1.0);
../_images/fa72a8357914d25760a26a5f46139933ff1faa2ab438cc5a06e9b957f38abae3.png

14.1.1. \(q\) value esitmation#

We define a function for the palculation of \(\pi_0\). Here we use a different method than the one described in Storey&Tibshirani. The details of this method, known as the bootstrap method, are given in Storey 2002

import numpy as np
import numpy.random as npr

def bootstrap(invec):
    idx = npr.randint(0, len(invec), len(invec))
    return [invec[i] for i in idx]

def estimatePi0(p, numBoot=100, numLambda=100, maxLambda=0.95):
    p.sort()
    n=len(p)
    lambdas=np.linspace(maxLambda/numLambda,maxLambda,numLambda)
    Wls=np.array([n-np.argmax(p>=l) for l in lambdas])
    pi0s=np.array([Wls[i] / (n * (1 - lambdas[i])) for i in range(numLambda)])
    minPi0=np.min(pi0s)
    mse = np.zeros(numLambda)
    for boot in range(numBoot):
        pBoot = bootstrap(p)
        pBoot.sort()
        WlsBoot =np.array([n-np.argmax(pBoot>=l) for l in lambdas])
        pi0sBoot =np.array([WlsBoot[i] / (n *(1 - lambdas[i])) for i in range(numLambda)])
        mse = mse + np.square(pi0sBoot-minPi0)
    minIx = np.argmin(mse)
    return pi0s[minIx]

We subsequently use Storey&Tibshirani to calculate first calculate,

\[ \hat{\rm FDR}(t) = \frac{\pi_0mp}{|\{p_i\le t\}|}, \]

and then smooth the \(\rm FDR(t)\) estimates as,

\[\hat{q}(p_i)=\min_{t \ge p_i}\hat{\rm FDR}(t).\]
def qvalues(pvalues):
    m = pvalues.shape[0] # The number of p-values
    pvalues.sort_values("p",inplace=True) # sort the pvalues in acending order
    pi0 = estimatePi0(list(pvalues["p"].values))
    print("pi_0 estimated to " + str(pi0))
    
    # calculate a FDR(t) as in Storey & Tibshirani
    num_p = 0.0
    for ix in pvalues.index:
        num_p += 1.0
        t = pvalues.loc[ix,"p"]
        fdr = pi0*t*m/num_p
        pvalues.loc[ix,"q"] = fdr
    
    # calculate a q(p) as the minimal FDR(t)
    old_q=1.0
    for ix in reversed(list(pvalues.index)):
        q = min(old_q,pvalues.loc[ix,"q"])
        old_q = q
        pvalues.loc[ix,"q"] = q
    return pvalues
qv = qvalues(pvalues)
pi_0 estimated to 0.19388086386728304

We note a very low \(\pi_0\) indicating that about 80% of all genes are differentially expressed in the tripple negative cancers. This is maybe not only reflects the different biology of tripple negative cancers, but might also relate to differencees in sample handlig or normalization of data.

We can list the differntial genes, in descending order of significance.

qv
p -log_p log_FC q
Hugo_Symbol
NAT1 3.104964e-45 44.507943 -3.919466 7.780762e-42
TBC1D9 2.162932e-43 42.664957 -3.116726 2.475169e-40
ESR1 2.963198e-43 42.528239 -5.855994 2.475169e-40
ARMT1 5.785445e-43 42.237663 -1.932617 3.624451e-40
SLC16A6 7.397218e-43 42.130932 -3.157309 3.707353e-40
... ... ... ... ...
SCRN1 9.989525e-01 0.000455 -0.000178 1.937377e-01
GPKOW 9.990908e-01 0.000395 0.000066 1.937378e-01
FUT10 9.991074e-01 0.000388 -0.000106 1.937378e-01
DNAH2 9.995367e-01 0.000201 -0.000150 1.938060e-01
AMMECR1L 9.999002e-01 0.000043 -0.000007 1.938615e-01

12925 rows × 4 columns

14.2. Displaying number of findings as a function of \(q\) value (a \(p\)-\(q\) plot)#

A plot of the number of differentially expressed genes as a function of \(q\) value gives the same message.

sns.lineplot(x=pvalues["q"],y=list(range(pvalues.shape[0])),errorbar=None,lw=3)
plt.xlim(0,0.1);
plt.ylim();
plt.ylabel("Number of differential genes");
../_images/ab477091abbc27e6b55fb57c768023c55c8099b7166de47e40dbd74ce5a66386.png

14.3. Volcano plots revisited#

We often see that Volcano plots are complemented with FDR tresholds. Here we complement the previous lecture’s volcano plot with coloring indicating if transcripts are significantly differentially abundant at a FDR-treshhold of \(10^{-10}\).

qv["Significant"] = qv["q"]<1e-10
less_than_FDR_10 = qv[qv["q"]<1e-10]
p_treshold = float(less_than_FDR_10.iloc[-1:]["-log_p"].values)
/tmp/ipykernel_128661/3856227648.py:3: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  p_treshold = float(less_than_FDR_10.iloc[-1:]["-log_p"].values)
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("white")
sns.set_context("talk")
ax = sns.relplot(data=pvalues,x="log_FC",y="-log_p",hue="Significant",aspect=1.5,height=6)
plt.axhline(p_treshold)
#sns.lineplot([-6,4],[p_treshold,p_treshold],ax=ax)
ax.set(xlabel="$log_2(TN/not TN)$", ylabel="$-log_{10}(p)$");
../_images/09d285c8cecf9249485e9def631d266b4db2227a5d6e055231e51ecd316f8503.png

Again it should be noted that the large number of differential transcripts maybe not only reflects the different biology of tripple negative cancers, but might also relate to differencees in sample handlig or normalization of data.

14.4. Spline estimation of \(\pi_0\)#

Storey and Tibshirani outlines an other procedure for estimating \(\pi_0\) than the bootstrap procedure used above, i.e. using \(\hat{\pi_0}(\lambda) = \frac{|\{p>\lambda \}|}{m(1-\lambda)}\).

Below, we almost follow the article’s described procedure (please try to find the difference on how we select which lambdas we evaluate). Furthermore we fit a qubic spline to these \(\pi_0\) estimates.

from scipy.interpolate import UnivariateSpline

m = pvalues.shape[0] # The number of p-values
pvalues.sort_values("p",inplace=True,ascending=False) # sort the pvalues in decending order
num_p = -1
for ix in pvalues.index:
    num_p += 1
    lambda_p = pvalues.loc[ix,"p"]
    pi0_hat = num_p/(m*(1-lambda_p))
    pvalues.loc[ix,"pi0_hat"] = pi0_hat

pvalues.sort_values("p",inplace=True) # sort the pvalues in ascending order
s = UnivariateSpline(pvalues["p"],pvalues["pi0_hat"], k=3,s=10)

We plot the estimates (blue) as well as the spline fit (red) for two different intevalls of \(\lambda\). You will see the need of a smoother, particularly in the region near 1.0.

def plot_pi0_hat(p,s,xlow,xhigh,ylow,yhigh,ax):
  sns.lineplot(x=pvalues["p"],y=pvalues["pi0_hat"],errorbar=None,lw=3, ax=ax, color='b')
  sns.lineplot(x=pvalues["p"],y=s(pvalues["p"]),errorbar=None,lw=3,ax=ax, color='r')
  ax.set_xlim(xlow,xhigh);
  ax.set_ylim(ylow,yhigh);
  ax.set_xlabel("$\lambda $");
  ax.set_ylabel("$\pi_0(\lambda)$");

fig, ax = plt.subplots(1,2,figsize=(12, 4))
plot_pi0_hat(pvalues,s,0,1,0,0.6,ax[0])
plot_pi0_hat(pvalues,s,0.95,1,0.1,0.35,ax[1])
../_images/6bc6cd62ffe4fa228ff68b1ecc1c129c3bf07e56b9f355c42f997f9b0f1df67a.png

We can obtain a final estimate by evaluating the spline for \(\lambda=1\), and compare the it to the bootstrapping estimate.

print("Spline estimate of pi_0: " + str(s(1)))
print("Bootstrap estimate of pi_0: " + str(estimatePi0(list(pvalues["p"].values))))
Spline estimate of pi_0: 0.18473806167848936
Bootstrap estimate of pi_0: 0.19388086386728304