4. Multiple Regression Example#
In this example, we apply multiple regression to analyze the relationship between several proteins and a clinical variable, BMI (Body Mass Index). We perform Ridge Regression using regularization to prevent overfitting, and visualize how well the model predicts BMI from the selected protein features.
4.1. Data Loading and Preparation#
The first step involves loading and preparing data from the CPTAC (Clinical Proteomic Tumor Analysis Consortium) database for Lung Squamous Cell Carcinoma (LSCC). We retrieve proteomics data and a relevant clinical variable (BMI), then merge these datasets based on matching patient records.
import pandas as pd
import cptac
import cptac.utils as ut
en = cptac.Lscc()
en.list_data_sources()
prot = en.get_proteomics("umich")
prot = ut.reduce_multiindex(df=prot, tuples=True)
clin_var = "bmi"
clinical = en.get_clinical('mssm')[[clin_var]]
clin_and_prot = clinical.join(prot, how='inner').dropna(subset=[clin_var])
relevant_prot = [('POLI', 'ENSP00000462664.1'), ('MYL4', 'ENSP00000347055.1'), ('NRP2', 'ENSP00000350432.5'), ('CFHR2', 'ENSP00000356385.4'), ('SMAD2', 'ENSP00000262160.6'), ('KIAA1328', 'ENSP00000280020.5')]
variables_df = clin_and_prot.loc[:, [clin_var] + relevant_prot ].dropna()
Show code cell output
Downloading cptac_genes.csv: 0%| | 0.00/462k [00:00<?, ?B/s]
Downloading cptac_genes.csv: 0%| | 1.02k/462k [00:00<04:09, 1.85kB/s]
Downloading cptac_genes.csv: 10%|▉ | 45.1k/462k [00:00<00:04, 86.3kB/s]
Downloading cptac_genes.csv: 37%|███▋ | 173k/462k [00:00<00:00, 337kB/s]
Downloading cptac_genes.csv: 77%|███████▋ | 355k/462k [00:00<00:00, 602kB/s]
Downloading cptac_genes.csv: 100%|██████████| 462k/462k [00:00<00:00, 463kB/s]
Downloading brca_mapping.csv: 0%| | 0.00/6.37k [00:00<?, ?B/s]
Downloading brca_mapping.csv: 16%|█▌ | 1.02k/6.37k [00:00<00:03, 1.77kB/s]
Downloading brca_mapping.csv: 91%|█████████ | 5.80k/6.37k [00:00<00:00, 9.29kB/s]
Downloading brca_mapping.csv: 100%|██████████| 6.37k/6.37k [00:00<00:00, 8.32kB/s]
Downloading index.tsv: 0%| | 0.00/30.2k [00:00<?, ?B/s]
Downloading index.tsv: 3%|▎ | 1.02k/30.2k [00:00<00:19, 1.51kB/s]
Downloading index.tsv: 100%|██████████| 30.2k/30.2k [00:00<00:00, 43.5kB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 0%| | 0.00/24.4M [00:00<?, ?B/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 0%| | 1.02k/24.4M [00:00<3:47:34, 1.79kB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 0%| | 48.1k/24.4M [00:00<04:22, 92.9kB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 1%| | 140k/24.4M [00:00<01:32, 264kB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 1%| | 253k/24.4M [00:00<00:53, 450kB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 2%|▏ | 499k/24.4M [00:01<00:25, 922kB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 3%|▎ | 784k/24.4M [00:01<00:17, 1.36MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 5%|▍ | 1.16M/24.4M [00:01<00:11, 1.96MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 6%|▋ | 1.57M/24.4M [00:01<00:09, 2.43MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 10%|█ | 2.45M/24.4M [00:01<00:06, 3.59MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 17%|█▋ | 4.05M/24.4M [00:01<00:03, 6.46MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 22%|██▏ | 5.50M/24.4M [00:01<00:02, 8.50MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 29%|██▊ | 6.98M/24.4M [00:01<00:01, 10.1MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 34%|███▍ | 8.37M/24.4M [00:01<00:01, 9.59MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 38%|███▊ | 9.39M/24.4M [00:02<00:03, 4.99MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 42%|████▏ | 10.2M/24.4M [00:02<00:03, 3.93MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 44%|████▍ | 10.8M/24.4M [00:03<00:04, 3.33MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 46%|████▌ | 11.3M/24.4M [00:03<00:04, 2.74MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 48%|████▊ | 11.6M/24.4M [00:03<00:04, 2.62MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 49%|████▉ | 12.0M/24.4M [00:03<00:04, 2.52MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 50%|█████ | 12.3M/24.4M [00:03<00:05, 2.43MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 51%|█████▏ | 12.6M/24.4M [00:04<00:05, 2.30MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 52%|█████▏ | 12.8M/24.4M [00:04<00:05, 2.13MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 54%|█████▍ | 13.2M/24.4M [00:04<00:04, 2.25MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 55%|█████▌ | 13.6M/24.4M [00:04<00:05, 2.17MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 57%|█████▋ | 13.8M/24.4M [00:04<00:05, 2.02MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 58%|█████▊ | 14.1M/24.4M [00:04<00:05, 1.91MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 59%|█████▊ | 14.3M/24.4M [00:05<00:05, 1.80MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 60%|█████▉ | 14.6M/24.4M [00:05<00:05, 1.73MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 61%|██████ | 14.8M/24.4M [00:05<00:05, 1.74MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 62%|██████▏ | 15.2M/24.4M [00:05<00:05, 1.83MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 63%|██████▎ | 15.5M/24.4M [00:05<00:04, 1.88MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 64%|██████▍ | 15.7M/24.4M [00:05<00:04, 1.78MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 64%|██████▍ | 15.7M/24.4M [00:05<00:05, 1.71MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 66%|██████▌ | 16.2M/24.4M [00:05<00:04, 2.03MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 67%|██████▋ | 16.5M/24.4M [00:06<00:04, 1.98MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 69%|██████▊ | 16.8M/24.4M [00:06<00:03, 2.00MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 70%|███████ | 17.1M/24.4M [00:06<00:03, 2.02MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 70%|███████ | 17.1M/24.4M [00:06<00:03, 2.03MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 71%|███████▏ | 17.4M/24.4M [00:06<00:03, 1.97MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 72%|███████▏ | 17.7M/24.4M [00:06<00:03, 1.93MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 73%|███████▎ | 17.7M/24.4M [00:06<00:03, 1.91MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 74%|███████▎ | 18.0M/24.4M [00:06<00:03, 1.84MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 74%|███████▎ | 18.0M/24.4M [00:06<00:03, 1.80MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 75%|███████▍ | 18.3M/24.4M [00:07<00:03, 1.79MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 75%|███████▍ | 18.3M/24.4M [00:07<00:03, 1.79MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 77%|███████▋ | 18.7M/24.4M [00:07<00:02, 1.96MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 78%|███████▊ | 19.1M/24.4M [00:07<00:02, 2.11MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 78%|███████▊ | 19.1M/24.4M [00:07<00:02, 2.21MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 79%|███████▉ | 19.4M/24.4M [00:07<00:02, 2.07MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 79%|███████▉ | 19.4M/24.4M [00:07<00:02, 1.96MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 81%|████████ | 19.8M/24.4M [00:07<00:02, 2.05MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 81%|████████ | 19.8M/24.4M [00:07<00:02, 2.11MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 83%|████████▎ | 20.2M/24.4M [00:07<00:01, 2.14MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 83%|████████▎ | 20.2M/24.4M [00:07<00:01, 2.17MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 84%|████████▍ | 20.5M/24.4M [00:08<00:01, 2.08MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 84%|████████▍ | 20.5M/24.4M [00:08<00:01, 2.02MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 85%|████████▌ | 20.9M/24.4M [00:08<00:01, 2.03MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 85%|████████▌ | 20.9M/24.4M [00:08<00:01, 2.04MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 87%|████████▋ | 21.2M/24.4M [00:08<00:01, 2.06MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 87%|████████▋ | 21.2M/24.4M [00:08<00:01, 2.07MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 88%|████████▊ | 21.6M/24.4M [00:08<00:01, 2.12MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 88%|████████▊ | 21.6M/24.4M [00:08<00:01, 2.15MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 90%|████████▉ | 21.9M/24.4M [00:08<00:01, 2.12MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 91%|█████████ | 22.3M/24.4M [00:08<00:01, 2.11MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 92%|█████████▏| 22.6M/24.4M [00:08<00:00, 2.16MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 94%|█████████▍| 22.9M/24.4M [00:09<00:00, 2.10MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 94%|█████████▍| 22.9M/24.4M [00:09<00:00, 2.05MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 95%|█████████▌| 23.3M/24.4M [00:09<00:00, 2.02MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 96%|█████████▋| 23.5M/24.4M [00:09<00:00, 1.97MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 96%|█████████▋| 23.6M/24.4M [00:09<00:00, 1.94MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 98%|█████████▊| 23.9M/24.4M [00:09<00:00, 1.99MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 99%|█████████▉| 24.2M/24.4M [00:09<00:00, 1.99MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 100%|█████████▉| 24.4M/24.4M [00:10<00:00, 1.26MB/s]
Downloading Report_abundance_groupby=protein_protNorm=MD_gu=2.tsv.gz: 100%|██████████| 24.4M/24.4M [00:10<00:00, 2.41MB/s]
Downloading aliquot_to_patient_ID.tsv.gz: 0%| | 0.00/22.9k [00:00<?, ?B/s]
Downloading aliquot_to_patient_ID.tsv.gz: 4%|▍ | 1.02k/22.9k [00:00<00:12, 1.81kB/s]
Downloading aliquot_to_patient_ID.tsv.gz: 100%|██████████| 22.9k/22.9k [00:00<00:00, 38.6kB/s]
Downloading clinical_Pan-cancer.May2022.tsv.gz: 0%| | 0.00/243k [00:00<?, ?B/s]
Downloading clinical_Pan-cancer.May2022.tsv.gz: 0%| | 1.02k/243k [00:00<02:13, 1.81kB/s]
Downloading clinical_Pan-cancer.May2022.tsv.gz: 1%|▏ | 3.07k/243k [00:00<00:59, 4.04kB/s]
Downloading clinical_Pan-cancer.May2022.tsv.gz: 4%|▍ | 10.2k/243k [00:00<00:57, 4.04kB/s]
Downloading clinical_Pan-cancer.May2022.tsv.gz: 24%|██▎ | 57.3k/243k [00:00<00:03, 47.4kB/s]
Downloading clinical_Pan-cancer.May2022.tsv.gz: 71%|███████▏ | 173k/243k [00:00<00:00, 185kB/s]
Downloading clinical_Pan-cancer.May2022.tsv.gz: 100%|██████████| 243k/243k [00:00<00:00, 276kB/s]
Clinical Data: We extract BMI as the clinical variable.
Proteomics Data: We use proteomic measurements from several proteins (POLI, MYL4, NRP2, CFHR2, SMAD2, and KIAA1328) as our independent variables.
Data Joining: The clinical and proteomics data are merged into a single dataframe and missing values are handled.
4.2. Defining the Ridge Regression Model#
Ridge regression adds a regularization term to penalize large coefficients, helping to control model complexity and reduce overfitting. The following steps define the ridge regression loss function and optimize it using the scipy.optimize.minimize
function.
import numpy as np
from scipy.optimize import minimize
X = variables_df.drop(columns=[clin_var])
y = variables_df[clin_var].values
# Standardize features
# Convert to numpy arrays
X = np.array(X)
y = np.array(y)
# Set the regularization strength
lambda_ridge = 0.5
# Define the Ridge loss function
def ridge_loss(beta, X, y, lambda_ridge):
# Prediction
y_pred = X @ beta[:-1] + beta[-1] # Last beta is intercept
# Ridge loss: Sum of Squared Errors + regularization term
error = y - y_pred
loss = np.sum(error ** 2) + lambda_ridge * np.sum(beta[:-1] ** 2)
return loss
# Initial guess for beta (coefficients and intercept)
initial_beta = np.zeros(X.shape[1] + 1)
# Optimize the loss function using scipy's minimize
opt_result = minimize(ridge_loss, initial_beta, args=(X, y, lambda_ridge), method='L-BFGS-B')
# Extract optimized coefficients
optimized_beta = opt_result.x
# Separate coefficients and intercept
coefficients = optimized_beta[:-1]
intercept = optimized_beta[-1]
print(f"Optimized coefficients: {coefficients}")
print(f"Optimized intercept: {intercept}")
Optimized coefficients: [3.1372491 2.57581743 2.16237214 0.36522699 2.37369714 2.16228561]
Optimized intercept: 28.32913561455519
Ridge Loss Function: The loss function includes the sum of squared errors (SSE) and a regularization term (𝜆) applied to the coefficients.
Optimization: We initialize the coefficients and intercept to zero and use minimize to find the optimal values by minimizing the ridge loss function.
4.3. Visualization: Predicted vs Actual BMI#
After fitting the model, we calculate the predicted BMI values and plot them against the actual BMI values to evaluate the model’s performance.
import seaborn as sns
import matplotlib.pyplot as plt
# Predicted tumor size using optimized coefficients
y_pred = X @ coefficients + intercept
#y_pred = X @ coefficients_sklearn + intercept_sklearn
# y_pred = X @ [0.,0.,0.,0.,1. ] #+ intercept_sklearn
# Create a DataFrame for easy plotting
plot_df = pd.DataFrame({'Actual': y, 'Predicted': y_pred})
# Plot predicted vs actual using seaborn
plt.figure(figsize=(8, 6))
sns.scatterplot(data=plot_df, x='Actual', y='Predicted', color='blue', s=60)
sns.lineplot(x=[y.min(), y.max()], y=[y.min(), y.max()], color='red', linestyle='--', linewidth=2)
plt.xlabel('Actual BMI')
plt.ylabel('Predicted BMI')
plt.title('Predicted vs Actual BMI')
plt.grid(True)
plt.show()

Scatter Plot: The plot compares the actual BMI values against the predicted values from the model. A red dashed line indicates the ideal scenario where predictions perfectly match the actual values.
Visual Evaluation: If the points lie close to the red line, the model’s predictions are accurate. Deviations from this line represent prediction errors.
This example demonstrates how multiple regression can be extended with regularization to improve model generalization, particularly when working with clinical and proteomic datasets.