上QQ阅读APP看书，第一时间看更新

Transforming data – PCA and LDA with scikit-learn

Often, a transformation can make data more digestible. In particular, data scientists use transformations to rotate the data about the axis of the most overall or most important variations with the aim of representing similar information with a smaller number of dimensions. We can use the iris dataset as an example to take four features and represent similar information in two dimensions. Let's start with principal component analysis (PCA) to orient the data onto the axes of the highest variation. The iris set only has four dimensions, but this technique can be used on data with tens or hundreds of features:

# reduce dimensions with PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
out_pca = pca.fit_transform(df[['sepal length in cm',
                                'sepal width in cm',
                                'petal length in cm',
                                'petal width in cm']])

Now, let's create a pandas DataFrame with the output data and use the .head() sanity check to see what we have:

df_pca = pd.DataFrame(data = out_pca, columns = ['pca1', 'pca2'])
print(df_pca.head())

You will see the following output after executing the preceding code:

This looks good, but we are missing the target or label column (species). Let's add the column by concatenating with the original DataFrame. This gives us a PCA DataFrame (df_pca) that is ready for downstream work and predictions. Then, let's plot it and see what our transformed data looks like plotted on just two dimensions:

df_pca = pd.concat([df_pca, df[['species']]], axis = 1)
print(df_pca.head())
sns.lmplot(x="pca1", y="pca2", hue="species", data=df_pca, fit_reg=False)

You will see the following output after executing the preceding code:

The following plot is obtained after the execution of same code snippet:

We now have our higher-dimensional data represented in two easily-digestible and plottable dimensions. However, can we do better? The goal of PCA is to orient the data in the direction of the greatest variation. However, it ignores some important information from our dataset – for instance, the labels are not used; perhaps we can extract even better transformation vectors if we include the labels. The most popular labeled dimension-reduction technique is called linear discriminant analysis (LDA). The following math will group by class labels, and then find the direction of most separation between the classes:

Ignoring labels in the transformation step can be desirable for some problem statements (especially those with unreliable class labels) to avoid pulling the reduced component vectors in an unhelpful direction. For this reason, I recommend that you always start with PCA before deciding whether you need to do any further work or not. Indeed, unless your dataset is large, the computation time for PCA is short, so there's no harm in starting here.

# reduce dimensions with LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)

# format dataframe
out_lda = lda.fit_transform(X=df.iloc[:,:4], y=df['species'])
df_lda = pd.DataFrame(data = out_lda, columns = ['lda1', 'lda2'])
df_lda = pd.concat([df_lda, df[['species']]], axis = 1)

# sanity check
print(df_lda.head())

# plot
sns.lmplot(x="lda1", y="lda2", hue="species", data=df_lda, fit_reg=False)

You will see the following output after executing the preceding code:

The following plot is obtained after the execution of same code snippet:

The scatter plots may tempt you into thinking that the PCA and LDA techniques performed the same transformation on the data. Let's look a little closer at the first component of each using the powerful violin plot routine. First, we will begin with PCA, as follows:

sns.violinplot(x='species',y='pca1', data=df_pca).set_title("Violin plot: Feature = PCA_1")

You will see the following output after executing the preceding code:

Now, let's plot the first LDA component, as follows:

sns.violinplot(x='species',y='lda1', data=df_lda).set_title("Violin plot: Feature = LDA_1")

You will see the following output after executing the preceding code: