
Transforming data – PCA and LDA with scikit-learn
Often, a transformation can make data more digestible. In particular, data scientists use transformations to rotate the data about the axis of the most overall or most important variations with the aim of representing similar information with a smaller number of dimensions. We can use the iris dataset as an example to take four features and represent similar information in two dimensions. Let's start with principal component analysis (PCA) to orient the data onto the axes of the highest variation. The iris set only has four dimensions, but this technique can be used on data with tens or hundreds of features:
# reduce dimensions with PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
out_pca = pca.fit_transform(df[['sepal length in cm',
'sepal width in cm',
'petal length in cm',
'petal width in cm']])
Now, let's create a pandas DataFrame with the output data and use the .head() sanity check to see what we have:
df_pca = pd.DataFrame(data = out_pca, columns = ['pca1', 'pca2'])
print(df_pca.head())
You will see the following output after executing the preceding code:

This looks good, but we are missing the target or label column (species). Let's add the column by concatenating with the original DataFrame. This gives us a PCA DataFrame (df_pca) that is ready for downstream work and predictions. Then, let's plot it and see what our transformed data looks like plotted on just two dimensions:
df_pca = pd.concat([df_pca, df[['species']]], axis = 1)
print(df_pca.head())
sns.lmplot(x="pca1", y="pca2", hue="species", data=df_pca, fit_reg=False)
You will see the following output after executing the preceding code:

The following plot is obtained after the execution of same code snippet:

We now have our higher-dimensional data represented in two easily-digestible and plottable dimensions. However, can we do better? The goal of PCA is to orient the data in the direction of the greatest variation. However, it ignores some important information from our dataset – for instance, the labels are not used; perhaps we can extract even better transformation vectors if we include the labels. The most popular labeled dimension-reduction technique is called linear discriminant analysis (LDA). The following math will group by class labels, and then find the direction of most separation between the classes:
# reduce dimensions with LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
# format dataframe
out_lda = lda.fit_transform(X=df.iloc[:,:4], y=df['species'])
df_lda = pd.DataFrame(data = out_lda, columns = ['lda1', 'lda2'])
df_lda = pd.concat([df_lda, df[['species']]], axis = 1)
# sanity check
print(df_lda.head())
# plot
sns.lmplot(x="lda1", y="lda2", hue="species", data=df_lda, fit_reg=False)
You will see the following output after executing the preceding code:

The following plot is obtained after the execution of same code snippet:

The scatter plots may tempt you into thinking that the PCA and LDA techniques performed the same transformation on the data. Let's look a little closer at the first component of each using the powerful violin plot routine. First, we will begin with PCA, as follows:
sns.violinplot(x='species',y='pca1', data=df_pca).set_title("Violin plot: Feature = PCA_1")
You will see the following output after executing the preceding code:

Now, let's plot the first LDA component, as follows:
sns.violinplot(x='species',y='lda1', data=df_lda).set_title("Violin plot: Feature = LDA_1")
You will see the following output after executing the preceding code:
