Introduction to Principal Component Analysis – Part 1

Introduction to Principal Component Analysis (PCA)

PCA — Primary Component Analysis — is one of those statistical algorithms that is popular among data scientists and statisticians, but not much among people who are outside of data science or statistics.

Problem if PCA is not used:-

In real world data analysis tasks we analyze complex data i.e. multi-dimensional data. We plot the data and find various patterns in it or use it to train some machine learning models. One way to think about dimensions is that suppose you have a data point x , if we consider this data point as a physical object then dimensions are merely a basis of view, like where is the data located when it is observed from horizontal axis or vertical axis.

As the dimensions of data increases, the difficulty to visualize it and perform computations on it also increases. So, how to reduce the dimensions of a data-

* Remove the redundant dimensions
* Only keep the most important dimensions

What is PCA:-

PCA is a way of identifying patterns in data, and expressing the data in such a way as to highlight their similarities and differences. Since patterns in data can be hard to find in data of high dimension, where the luxury of graphical representation is not available, PCA is a powerful tool for analyzing data. The other main advantage of PCA is that once you have found these patterns in the data, and you compress the data, ie. by reducing the number of dimensions, without much loss of information. PCA is used as a dimensionality reduction technique in domains like facial recognition, computer vision and image compression. It is also used for finding patterns in data of high dimension in the field of finance, data mining, bioinformatics, psychology, etc.

This chapter will take you through the steps you needed to perform a Principal Components Analysis on a set of data .I’ll try to provide an explanation of what is happening at each point so that you can make informed decisions when you try to use this technique yourself.

Why & when PCA:-

PCA finds more meaningful basis or coordinate system for your data and works based on covariance matrix to find the strongest features if your samples .

We usually are surrounded by data with a large number of variables, some of which might be correlated. This correlation between variables brings redundancy in the information that can be gathered by the data set. And having too many dimensions (features) in your data causes noise and difficulties (it can be sound, picture or context). This specifically gets worst when features have different scales (e.g. weight, length, area, speed, power, temperature, volume, time, cell number, etc. )

Thus in order to reduce the computational and cost complexities, we use PCA to transform the original variables to the linear combination of these variables which are independent.

But why & when should we reduce or change dimensions?

1- Better Perspective and less Complexity: When we need a more realistic perspective and we have many features on a given data set and specifically when we have this intuitive knowledge that we don’t need this much number of features.

Similarly, in many other practices modelling is easier in 2D than 3D , right?

2 – Better visualization: When we cannot get a good visualization due to high number of dimensions we use PCA to reduce it into a shadow of 2D or 3D features (or even more but convenient enough for better parallel coordinates or Andrew Curve, e.g. when you transfer 100 features into 10 features you cannot still depict it as 2D or 3D but you can get a much better Andrew Curve)

3- Reduce size: When we have too much data and we are going to use process-intensive algorithms (like many supervised algorithms) on the data so we need to get rid of redundancy .

Sometimes change of perspective matters more than Dimension reduction and we want to exploit dimensionality :

4- Different perspective: Maybe you don’t have any of these motivations but you merely need to improve your knowledge on your data. PCA can give you the best linearly independent and different combinations of features so you can use to describe your data differently.

It has extensive application wherever extensive data is, e.g. for media editing, statistical quality control, portfolio analysis, etc. as far as we concern linear relationships.

Normalize your Data before performing PCA :-

Why to Normalize data before PCA:-

So It’s time to address the elephant in the room !! 😛

The reason you should think about Normalizing/ standardizing your data before performing PCA is that PCA is a variance maximizing exercise . PCA calculates a new projection of your data set. It projects your original data onto directions which maximize the variance. So the new axis are based on the standard deviation of your variables. So a variable with a high standard deviation will have a higher weight for the calculation of axis than a variable with a low standard deviation. If you normalize your data, all variables have the same standard deviation, thus all variables have the same weight and your PCA calculates relevant axis.

As you know that different variables in your data set may be having different units of measurement. Eg: one may be cost, other production in numbers, another percentage interest in purchase etc. It is necessary to normalize data to get a reasonable covariance analysis among all such variables. Another example to appreciate the normalization is analysis of students data in a class room may include…height and weight and establishing a correlation between them.

The first plot below shows the amount of total variance explained in the different principal components where we’ve not normalized the data. As you can see, it seems like component one explains most of the variance in the data.

If you look at the second picture, we have normalized the data first. Here it is clear that the other components contribute as well. The reason for this is because PCA seeks to maximize the variance of each component. And since the covariance matrix of this particular dataset is:

             Murder   Assault   UrbanPop      Rape
Murder    18.970465 291.0624   4.386204 22.99141
Assault 291.062367 6945.1657 312.275102 519.26906
UrbanPop   4.386204 312.2751 209.518776 55.76808
Rape      22.991412 519.2691 55.768082 87.72916

From this structure, the PCA will select to project as much as possible in the direction of Assault since that variance is much greater. So for finding features usable for any kind of model, a PCA without normalization would perform worse than one with normalization.

This is why normalizing your data is of paramount importance while applying PCA, or even ICA for that matter.

Now let’s see some of the novice mistakes people make.

Common Mistakes (or DON’Ts)

Fixing Over-fit : One common mistake people make is using PCA to reduce over-fit. Over-fitting is usually caused by having too many features. The large amount of features results in the rise in cross validation error due to to high variance on our training data. People assume that reducing the number of dimensions will automatically reduce the influence of certain features and hence fix over-fitting. PCA is simply reducing the number of dimensions of your original features and may not fix the issue of over-fit. Better options would be feature selection and regularization.
Set Standard : Another common mistake people make is that we HAVE TO USE PCA for every machine learning application. This is a false assumption. PCA should only be used if memory or computation speed becomes an issue. Otherwise, we are completely fine without it.

In Nutshell Principal component analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset. It’s often used to make data easy to explore and visualize. And it deals with the curse of dimensionality by capturing the essence of data into a few principal components.