Machine-Learning/05-High-Dimesional-Problems.qmd at main · fhdsl/Machine-Learning · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
# High Dimensional Problems

In our linear and logistic regression models, we have always made the assumption that the number of predictors is less than the number of samples. However, that is not always true in the **high-dimensional** **problems.** This is when the number of predictors is *greater* than the number of samples analyzed. How could this be? Consider the following examples:

-   In molecular biology, high-throughput sequencing technologies can make thousands of measurements across a human genome at once, for a single sample. For instance, bulk RNA-seq can in theory measure 20,000 different gene expressions of a sample.

-   In computer vision, if each sample is an image, then one can derive a vast amount of measurements from an image. Each pixel could be a single measurement, or some defined **feature** of the image, such as the presence of a face or a cat. One can imagine gathering thousands of measurements from a image.

-   In Natural Language Processing (NLP), a document can be treated as a sample. A document consists of a large number of words, and one can derive measurements from single words, or a collection of words, such as themes or writing style. The number of measurements can be vast and large.

In mathematical form, we say we have a high dimensional problem when $p$ , the number of predictors, is large relative to the number of samples:

$$
Y = \beta_0 + \beta_1 \cdot X_1 + ... + \beta_p \cdot X_p
$$

If we try to run linear or logistic regression in a high-dimensional problem, it will not work, due to the mathematical limits of these tools.

What people typically do in light of a high-dimensional problem is to use a **dimensional reduction** method, to reduce the number of features *lower* than the number of samples in the analysis. We will explore two methods today: **regularization methods** and **principal components analysis**.

## **Regularization** methods

Regularization methods will let us fit all the predictors in a model, and in the process of creating the model, it will naturally encourage some of the parameter estimates to be zero. This is an example of dimension reduction. The resulting model will have less predictors than samples. The best known regularization techniques are called **ridge** **regression** and **lasso regression**. For this class, we will use lasso regression, and you can read in the appendix the difference between the two methods.

Recall that in linear regression, we fit a line where the Mean Squared Error (MSE) is minimized. In regularization methods, the quantity we try to minimize for the model includes additional terms:

$$
minimize(MSE + \lambda \cdot Parameters)
$$

What is going on here?

-   When $MSE$ is minimized, the model gives the shortest distance to the data points, just like we did in linear regression.

-   When the $Parameters$ are minimized (such as $\beta_1, … \beta_p$), it reduces the magnitude of the parameters towards zero. This reduces the number of predictors being used in the final model, performing dimension reduction.

-   The **tuning parameter** (sometimes called **hyperparameter**) $\lambda$ ("lambda") serves to control the balance between minimizing the $MSE$ vs. minimizing the $Parameters$. When $\lambda=0$, we just have to minimize the $MSE$, ie. linear regression. When $\lambda$ is large, then we start to minimize $Parameters$ more and encourages some of the parameter estimates to be zero. Selecting the appropriate $\lambda$ is important in the model training process, and we will see how to do that in the Cross-Validation section below.

Let's see an example of the effect of $\lambda$ on a model with 5 predictors, parameter weights $\beta_1$... $\beta_5$.

```{python}
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LassoCV, lasso_path

X, y = load_diabetes(return_X_y=True)
X = X[:, :5]
X /= X.std(axis=0)

alphas_lasso, coefs_lasso, _ = lasso_path(X, y, n_alphas=10, eps=5e-3)

plt.clf()

for i, coef_lasso in enumerate(coefs_lasso):
    plt.semilogx(alphas_lasso, coef_lasso, label='beta'+str(i+1))

plt.xlabel("lambda")
plt.ylabel("coefficients")
plt.title("Effect of lambda on regression coefficients")
plt.axis("tight")
plt.legend()

plt.show()
```

As $\lambda$ increased on the x-axis, you can see the parameters, aka coefficients in colored lines $\beta_1$... $\beta_5$ move towards zero. We will need to find the optimal value of $\lambda$ for our model.

Now, let's get to a real-world problem!

## Example dataset

Our example dataset for the high-dimensional setting centers around this question: Can we use RNA gene expression predict response to a cancer treatment? If so, then we can use RNA gene expression as a biomarker to predict how cancer patients may respond to various cancer treatments. We use data from the [Dependency Map Project](https://depmap.org/portal/), which has RNA gene expression profiles and cancer treatment responses on the largest collection of cancer cell line models.

```{python}
from sklearn.model_selection import train_test_split
from formulaic import model_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, log_loss, accuracy_score, confusion_matrix, ConfusionMatrixDisplay
import pickle

with open('classroom_data/GEFITINIB_Expression.pickle', 'rb') as handle:
    gefitinib_expression = pickle.load(handle)

with open('classroom_data/DOCETAXEL_Expression.pickle', 'rb') as handle:
    docetaxel_expression = pickle.load(handle)
```

Let's start looking at the cancer drug "Gefitinib", which is a targeted therapy for non-small cell lung cancer with EGFR mutation and high levels of EGFR expression. We might expect to find the the gene EGFR highly predictive of Gefitinib response.

Let's look at the distribution of our response. The drug response is measured in terms of [Area Under the Curve](https://en.wikipedia.org/wiki/Area_under_the_curve_(pharmacokinetics)) (unrelated to the appendix of the ROC curve in the 3rd week), and a lower value indicates that the drug is more effective against the cancer.

```{python}
plt.clf()
g = sns.displot(x="GEFITINIB", data=gefitinib_expression)
plt.show()
```

It is quite skewed, with just a few having low values.

## Data Preparation

Then, let's look at the dimensions of this Dataframe:

```{python}
gefitinib_expression.shape
```

We have way more potential predictors than samples even before we split into training and testing sets! This is definitely a high-dimensional problem.

Now, let's split our data into training and testing:

```{python}
gefitinib_expression_train, gefitinib_expression_test = train_test_split(gefitinib_expression, test_size=0.2, random_state=42)
```

We create our `y_train` and `X_train` variables, and when we need to use all predictors, we indicate via the `.` symbol.

```{python}
y_train, X_train = model_matrix("GEFITINIB ~ .", gefitinib_expression_train)
y_test, X_test = model_matrix("GEFITINIB ~ .", gefitinib_expression_test)
```

Before we fit the model, we have to do some data transformations. In linear and logistic regression, all the parameter estimates for the models are **scale equivariant**. That is, if we multiple a predictor by a constant $c$ and did not change anything else, the parameter estimate for that predictor will be scaled automatically $\frac{1}{c}$ in the model training process. However, in regularization methods, because we are minimizing the Mean Squared Error and the magnitude of the parameters, the parameter estimates may change substantially if the scale of a predictor changes.

Therefore, the best practice is to first **standardize the predictors** before using regularization methods. We use the following code to ensure that each predictor has a mean of 0 and variance of 1:

```{python}
X_train_scaled = StandardScaler().fit_transform(X_train, y_train)
X_test_scaled = StandardScaler().fit_transform(X_test, y_test)
```

With this data standardization, we are almost ready for lasso regression.

When using regularization methods, we have to figure out the value of $\lambda$ ("lambda"), which dictates the balance between the Mean Squared Error and model parameters to be minimized. Similar to the parameters $\beta$s in the model, $\lambda$ has to be learned in the training process also. However, there is something different about learning $\lambda$:

-   One needs to pick the value of $\lambda$ before the parameters $\beta$s are learned. They cannot be learned simultaneously.

-   There is no guidance on how to pick $\lambda$: one has to try many values and see how the model performs. Whereas, for the $\beta$s, there is an algorithm (called "least squares") that is deterministic for the optimal $\beta$ value.

Therefore, we have to find the optimal $\lambda$ to use. We can use Cross-Validation, which we learned from last week to explore the range of $\lambda$ to use.

## Lasso Regression

Let's see how to implement cross validation in lasso regression. The `LassoCV` function will perform a search for the optimal $\lambda$ for us via Cross Validation using whatever dataset you give it as a input. For speed, we use a 2-fold cross validation, but in general people use 5 to 10 fold cross-validation.

```{python}
import time

start_time = time.perf_counter()
reg = LassoCV(cv=2, random_state=0).fit(X_train_scaled, np.ravel(y_train))

end_time = time.perf_counter()

elapsed_time = end_time - start_time
print(f"Elapsed time to fit model: {elapsed_time} seconds")
```

What is the $\lambda$ learned in the cross validation?

```{python}
print(reg.alpha_)
```

(Notice in the code we refer to it as "alpha". This is due to the a different in variable naming by Scikit-Learn.)

We expect that only a small subset of predictors end up being used for the lasso model. What are the genes that have non-zero coefficients?

```{python}
X_train.columns[np.nonzero(reg.coef_)]
```

Now, let's see how the model performs on the Test Set:

```{python}
y_test_predicted = reg.predict(X_test_scaled)
test_err = mean_absolute_error(y_test_predicted, y_test)
test_err
```

Let's look at the plot:

```{python}
plt.clf()
plt.scatter(y_test_predicted, y_test, alpha=.5)
plt.axline((.9, .9), slope=1, color='r', linestyle='--')
plt.xlabel('Predicted AUC')
plt.ylabel('True AUC')
plt.title('Mean Absolute Error: ' + str(round(test_err, 2)))
plt.show()
```

## Some remarks

-   Most variables in the model can be written as a combination of other variables, so the problem of predictors correlated to each other is present everywhere.

-   Therefore, we can never identify the best predictors of the outcome, only one of of many possible models to consider.

-   Most statistics of fit that are used in low-dimensional regression, such as p-values, R\^2, AIC, BIC on the training dataset does not work at the high dimensional setting.

-   Our diagnostics plots are not feasible to scale to many dimensions, so they are generally not used in practice.

## Other ways to reduce dimensions

The Lasso Regularization method is one of the many ways one can reduce the number of dimensions in your modeling process. Here are some other common methods of reducing the number of dimensions, and they don't necessarily have to be mutually exclusive! It is common to combine multiple methods together:

-   Removing predictors with a large number of missing values. Many regression methods will only run on samples with non-missing values. Giving a predictor with high number of missing values may reduce the number of available samples.

    -   It is important to explore *why* there are missing values in the dataset, especially whether it is related to the response variable or not. If there is a strong trend, it may create strong bias in the model.

    -   As a sidenote, one can consider imputing the missing values with educated guesses. This is a big topic in statistical literature that goes beyond the scope of the course. See this [well-used reference](https://tmb.njtierney.com/) in the R community.

-   We know from Linear and Logistic Regression that having co-linear predictors affect the performance of our models. We can remove the most correlated predictors.

    -   A more nuanced way of doing this is to "remove the minimum number of predictors to ensure that all pairwise correlations are below a certain threshold". The algorithm is as follows (from "Applied Predicted Modeling" by Kuhn and Johnson):

        1.  Calculate the correlation matrix of the predictors
        2.  Determine the two predictors associated with the largest absolute pairwise correlation (call them predictors A and B)
        3.  Determine the average correlation between A and other variables. Do the same for predictor B.
        4.  If A has a larger average correlation, remove it; otherwise remove B.
        5.  Repeat Steps 2-4 until no absolute correlations are above the threshold.

-   Predictors with extremely low variance are generally unhelpful predictors. Imagine that a predictor only contains one single value, which has a variance of 0. It would not relate at all to the response value. It is common practice to remove predictors with variance below a low threshold.

-   Principal Components Analysis (PCA) is a way to reduce the number of predictors by generating a new, smaller set of predictors that seek to capture the majority of information in the original variables. Suppose that you have 100 predictors, and you look at the variance of each predictor and sum all of them up as the *total variance*. When one performs PCA, the process will take a combination of the 100 original predictors to yield a new predictor (the first principal component) that captures the most variability of the total variance. Then, the second principal component is another combination of the 100 original predictors that captures the most of the *remaining* variability, and is uncorrelated with the first principal component. And so on. Thus, it is possible to take the first few principal components into a new regression model while capturing most of the total variance in the data.

## Appendix: Ridge and Lasso regression

In regularization methods, the most common methods are Ridge and Lasso Regression. We only talk about Lasso Regression in this course, but in this appendix we expand in more technicality what is behind the scenes and what are some other variations.

## Exercises

Exercises for week 4 can be found [here](https://colab.research.google.com/drive/17LFQzaGmnCsLdDZNnnvDsYwANwgj2dob?usp=sharing).