What Is Principal Component Analysis? (With Steps)

By Indeed Editorial Team

Updated 13 December 2022

Published 4 May 2022

The Indeed Editorial Team comprises a diverse and talented team of writers, researchers and subject matter experts equipped with Indeed's data and insights to deliver useful tips to help guide your career journey.

Principal component analysis (PCA) is a dimensionality-reduction method used to condense a large data set into a smaller, more manageable set. This method serves various purposes in technology and software industries, including the use of facial recognition software and image compression. If you learn what principal component analysis is, it can help you understand how you can use this method to deal with data in your job when necessary. In this article, we define principal component analysis, discuss some important terms related to it, provide a guide to help execute it and list the benefits of using this method.

What Is Principal Component Analysis?

The answer to 'What is principal component analysis?' is quite simple in machine learning and data science. It is the method of modifying a massive dataset with an extensive set of variables into a modest scale, reducing its dimensionality while maintaining most of its variation information. This means it still has to contain most of the information present in the larger set. It simplifies the complexity inherent in high-dimensional data while retaining the necessary trends and patterns. In the PCA method, data transforms into fewer dimensions, which summarises the pre-existing features.

This reduction can undoubtedly make data sets less accurate, but it also makes them easier to handle. Bulky sets of data with superfluous variables can be difficult to review and analyse for both people and machines. It is time intensive for machine learning algorithms to process such extraneous data. PCA also emphasises variation within a dataset and helps identify patterns. It helps data scientists identify, manage, control and minimise the relationships between variables in a large set of data.

Related: What Is Machine Learning? (Skills, Jobs And Salaries)

Some Important Terms Related To Principal Component Analysis

Here are the terms you are most likely to encounter while handling PCA algorithm:

  • Dimensionality: Number of variables or features present in the dataset, represented by the number of columns on the spreadsheet. For example, healthcare data can have multiple dimensions, like diabetes, blood pressure and heart health.

  • Correlation: Denotes how strongly a pair of variables relate to each other, such that, if one variable changes, the other changes as well. The correlation value ranges from -1 to +1, where -1 indicates that variables are inversely proportional to each other and +1 shows that variables are directly proportional to each other.

  • Orthogonal: Signifies that variables do not correlate to each other so that the correlation between the pair of variables is zero.

  • Eigenvectors: The eigenvector of a square matrix is a non-vector that is equal to a scalar multiple of this vector when you multiply a given matrix in it. For example, if there is a square matrix 'X', and you have a non-zero vector 'w', then 'w' is to be eigenvector if 'Xw' is the scalar multiple of 'w'.

  • Eigenvalues: Coefficients applied to eigenvectors that give them their length or magnitude.

  • Covariance: Determines the relationship between the movements of two random variables. When the variables move together, the covariance is positive and when they move inversely, the covariance is negative.

  • Covariance matrix: Matrix giving the covariance between each pair of elements of a given random vector. It is also called covariance matrix, dispersion matrix, variance matrix and variance-covariance matrix.

Related: Supervised Machine Learning Examples (And How It Works)

How Does Principal Component Analysis Work

You can conduct principal component analysis in these five steps:

1. Standardisation of data

Data standardisation is the process of transforming all variables in the initial data set into the same scale. Standardising the data is necessary for two reasons. Firstly, it ensures equal contribution of each variable to the analysis. Another crucial reason for standardisation is to ascertain that the varying ranges between your initial variables do not skew the results. As PCA is sensitive about the initial variables in terms of their variances, if there are sizeable differences between the variable ranges, then variables with broader ranges are going to overshadow those with shorter ranges.

For instance, a variable ranging between 50 and 500 can dominate over one that ranges between 0 and 5. This leads to inaccurate, warped results. If the data transforms to comparable scales through standardisation, you can avert this problem. To standardise your data, you can follow this formula:

Z = (value - mean) / standard deviation

Related: 10 Valuable Data Analysis Skills

2. Computation of covariance matrix

Computing the covariance matrix lets you determine the existence of any relationship between the variables of your data set, also known as covariance. This is possible by understanding how the variables vary from the mean of the input data set in relation to each other. Variables can correlate highly sometimes and might contain superfluous information, which has to be removed. Computing the covariance matrix helps identify those correlations between all possible variable pairs in the form of a table display.

The covariance matrix is a symmetrical matrix that includes all pairs of the initial variables imaginable. It is a table that summarises the correlations between them. If the covariance is positive, the variables increase and decrease together, signifying a correlation between them. If the covariance is negative, one variable increases when the other decreases, implying there is no correlation between the two variables.

Example

For a two-dimensional data set with two variables, m and n, the covariance matrix is a 2 × 2 matrix of this form:

Cov(m,m) Cov(m,n) Cov(n,m) Cov(n,n)

Note that in this example, the top right (Cov(m,m)) and the bottom right (Cov(n,n)) elements give the variances of the variables m and n as the variance of a variable is the same as its covariance with itself (Var(m)=Cov(m,m)). Also, the top right (Cov(m,n)) and the bottom left (Cov(n,m)) elements are equal because covariance is commutative in nature (Cov(m,n)=Cov(n,m)).

3. Calculation of eigenvectors and eigenvalues

Eigenvectors and eigenvalues, which are concepts from linear algebra, are used to determine the principal components of the covariance matrix. Create the principal components by combining initial variables and compressing most of their information and getting new variables. It gets rid of correlations within the data set. Principal components are new variables that reduce dimensionality while avoiding too much information loss. It represents data that depicts the maximum extent of variance. Bigger the variance of a line, the greater the dispersion of the data points, and the greater the dispersion, the more information-laden it is.

The eigenvectors of the covariance matrix contain the most variance or information. These are the principal components. Eigenvalues are the coefficients associated with eigenvectors that produce the variance amount in the principal components. Each eigenvector has a corresponding eigenvalue, so they always occur in a pair. The number of data dimensions is equal to the number of eigenvectors and eigenvalues. A two-dimensional data set has two eigenvectors and two eigenvalues. To determine the principal components in order of significance, calculate the eigenvectors and list them in descending order according to their eigenvalues.

4. Creation of feature vector

After obtaining your list of principal components, you can discard those with lower eigenvalues, as they are of less significance. You can form a matrix of vectors with the remaining components, which makes up a feature vector. This means that a feature vector is a matrix that lists the eigenvalues of the components you decide to keep. This is how you conduct dimensionality reduction. If that is not your goal, a feature vector can still be useful for listing your data according to the new variables or principal components.

5. Recasting of the data

In this last step, you can use the information from the feature vector, calculated from the eigenvectors of the covariance matrix, and reorient the data to the principal component axes from the original ones. During this process, there is no alteration of the initial information so that the input data set remains the same. You just apply your new variables to the original axes of the initial data set and remodel your data according to the new range of variance.

To recast and remodel the data, multiply the transpose of the feature vector by the transpose of the original data set. The formula for remodelling the data is:

Final data set = (feature vector)t x (standardised original data set)t

Related: What Does A Data Scientist Do? And How To Become One

What Are The Benefits Of Using Principal Component Analysis?

PCA benefits data scientists and machine learning engineers in the following ways:

  • Improves algorithm performance

  • Ensures zero redundancy of data

  • Removes correlated features

  • Helps create a model that matches real physical constraints

  • Decreases variable noise

  • Reduces variable over-fitting within the model

  • Rationalises the inputs for unbalanced measurement systems

  • Eases the visualisation process

  • Brings transparency and insight regarding the phenomena under analysis


Related:

  • 11 Data Analysis Tools (Including Tips For Choosing One)

  • 14 Data Modelling Tools For Data Analysis (With Features)

  • Top 20 Big Data Tools: Big Data and Types of Big Data Jobs

  • 12 Data Transformation Tools (With Examples And FAQs)

  • What Is Data Visualisation? Importance, Types And How To

  • 13 Data Mining Techniques: A Complete Guide

  • How To Find The Range Of A Data Set (Formula And Examples)


Explore more articles