What Is Principal Component Analysis? (With Steps)
By Indeed Editorial Team
Updated 13 December 2022
Published 4 May 2022
The Indeed Editorial Team comprises a diverse and talented team of writers, researchers and subject matter experts equipped with Indeed's data and insights to deliver useful tips to help guide your career journey.
Principal component analysis (PCA) is a dimensionality-reduction method used to condense a large data set into a smaller, more manageable set. This method serves various purposes in technology and software industries, including the use of facial recognition software and image compression. If you learn what principal component analysis is, it can help you understand how you can use this method to deal with data in your job when necessary. In this article, we define principal component analysis, discuss some important terms related to it, provide a guide to help execute it and list the benefits of using this method.
What Is Principal Component Analysis?
The answer to 'What is principal component analysis?' is quite simple in machine learning and data science. It is the method of modifying a massive dataset with an extensive set of variables into a modest scale, reducing its dimensionality while maintaining most of its variation information. This means it still has to contain most of the information present in the larger set. It simplifies the complexity inherent in high-dimensional data while retaining the necessary trends and patterns. In the PCA method, data transforms into fewer dimensions, which summarises the pre-existing features.
This reduction can undoubtedly make data sets less accurate, but it also makes them easier to handle. Bulky sets of data with superfluous variables can be difficult to review and analyse for both people and machines. It is time intensive for machine learning algorithms to process such extraneous data. PCA also emphasises variation within a dataset and helps identify patterns. It helps data scientists identify, manage, control and minimise the relationships between variables in a large set of data.
Some Important Terms Related To Principal Component Analysis
Here are the terms you are most likely to encounter while handling PCA algorithm:
Dimensionality: Number of variables or features present in the dataset, represented by the number of columns on the spreadsheet. For example, healthcare data can have multiple dimensions, like diabetes, blood pressure and heart health.
Correlation: Denotes how strongly a pair of variables relate to each other, such that, if one variable changes, the other changes as well. The correlation value ranges from -1 to +1, where -1 indicates that variables are inversely proportional to each other and +1 shows that variables are directly proportional to each other.
Orthogonal: Signifies that variables do not correlate to each other so that the correlation between the pair of variables is zero.
Eigenvectors: The eigenvector of a square matrix is a non-vector that is equal to a scalar multiple of this vector when you multiply a given matrix in it. For example, if there is a square matrix 'X', and you have a non-zero vector 'w', then 'w' is to be eigenvector if 'Xw' is the scalar multiple of 'w'.
Eigenvalues: Coefficients applied to eigenvectors that give them their length or magnitude.
Covariance: Determines the relationship between the movements of two random variables. When the variables move together, the covariance is positive and when they move inversely, the covariance is negative.
Covariance matrix: Matrix giving the covariance between each pair of elements of a given random vector. It is also called covariance matrix, dispersion matrix, variance matrix and variance-covariance matrix.
How Does Principal Component Analysis Work
You can conduct principal component analysis in these five steps:
1. Standardisation of data
Data standardisation is the process of transforming all variables in the initial data set into the same scale. Standardising the data is necessary for two reasons. Firstly, it ensures equal contribution of each variable to the analysis. Another crucial reason for standardisation is to ascertain that the varying ranges between your initial variables do not skew the results. As PCA is sensitive about the initial variables in terms of their variances, if there are sizeable differences between the variable ranges, then variables with broader ranges are going to overshadow those with shorter ranges.
For instance, a variable ranging between 50 and 500 can dominate over one that ranges between 0 and 5. This leads to inaccurate, warped results. If the data transforms to comparable scales through standardisation, you can avert this problem. To standardise your data, you can follow this formula:
Z = (value - mean) / standard deviation
Related: 10 Valuable Data Analysis Skills
2. Computation of covariance matrix
Computing the covariance matrix lets you determine the existence of any relationship between the variables of your data set, also known as covariance. This is possible by understanding how the variables vary from the mean of the input data set in relation to each other. Variables can correlate highly sometimes and might contain superfluous information, which has to be removed. Computing the covariance matrix helps identify those correlations between all possible variable pairs in the form of a table display.
The covariance matrix is a symmetrical matrix that includes all pairs of the initial variables imaginable. It is a table that summarises the correlations between them. If the covariance is positive, the variables increase and decrease together, signifying a correlation between them. If the covariance is negative, one variable increases when the other decreases, implying there is no correlation between the two variables.
For a two-dimensional data set with two variables, m and n, the covariance matrix is a 2 × 2 matrix of this form:
Cov(m,m) Cov(m,n) Cov(n,m) Cov(n,n)
Note that in this example, the top right (Cov(m,m)) and the bottom right (Cov(n,n)) elements give the variances of the variables m and n as the variance of a variable is the same as its covariance with itself (Var(m)=Cov(m,m)). Also, the top right (Cov(m,n)) and the bottom left (Cov(n,m)) elements are equal because covariance is commutative in nature (Cov(m,n)=Cov(n,m)).
3. Calculation of eigenvectors and eigenvalues
Eigenvectors and eigenvalues, which are concepts from linear algebra, are used to determine the principal components of the covariance matrix. Create the principal components by combining initial variables and compressing most of their information and getting new variables. It gets rid of correlations within the data set. Principal components are new variables that reduce dimensionality while avoiding too much information loss. It represents data that depicts the maximum extent of variance. Bigger the variance of a line, the greater the dispersion of the data points, and the greater the dispersion, the more information-laden it is.
The eigenvectors of the covariance matrix contain the most variance or information. These are the principal components. Eigenvalues are the coefficients associated with eigenvectors that produce the variance amount in the principal components. Each eigenvector has a corresponding eigenvalue, so they always occur in a pair. The number of data dimensions is equal to the number of eigenvectors and eigenvalues. A two-dimensional data set has two eigenvectors and two eigenvalues. To determine the principal components in order of significance, calculate the eigenvectors and list them in descending order according to their eigenvalues.
4. Creation of feature vector
After obtaining your list of principal components, you can discard those with lower eigenvalues, as they are of less significance. You can form a matrix of vectors with the remaining components, which makes up a feature vector. This means that a feature vector is a matrix that lists the eigenvalues of the components you decide to keep. This is how you conduct dimensionality reduction. If that is not your goal, a feature vector can still be useful for listing your data according to the new variables or principal components.
5. Recasting of the data
In this last step, you can use the information from the feature vector, calculated from the eigenvectors of the covariance matrix, and reorient the data to the principal component axes from the original ones. During this process, there is no alteration of the initial information so that the input data set remains the same. You just apply your new variables to the original axes of the initial data set and remodel your data according to the new range of variance.
To recast and remodel the data, multiply the transpose of the feature vector by the transpose of the original data set. The formula for remodelling the data is:
Final data set = (feature vector)t x (standardised original data set)t
What Are The Benefits Of Using Principal Component Analysis?
PCA benefits data scientists and machine learning engineers in the following ways:
Improves algorithm performance
Ensures zero redundancy of data
Removes correlated features
Helps create a model that matches real physical constraints
Decreases variable noise
Reduces variable over-fitting within the model
Rationalises the inputs for unbalanced measurement systems
Eases the visualisation process
Brings transparency and insight regarding the phenomena under analysis
Explore more articles
- 12 Simple Ways To Save Time And Accomplish More Daily
- How To Calculate MTTR In 4 Easy Steps (Tips To Reduce It)
- Continuous Employee Growth: Benefits And How To Encourage It
- A Guide On Financial Planning And Analysis (With Functions)
- What Is A Demand Curve? (With Definition And Examples)
- How To Get Motivated To Study: A Complete Guide With Tips
- What Are Online MLIS Programmes? (With Career Options)
- What Are Net Earnings? (With Steps For Calculation)
- Herzberg Motivation Theory: How To Boost Employee Motivation
- 11 Ways To Collaborate With Your Team (With Benefits)
- What Are Videography Skills? (And How You Can Improve Them)
- What Is Competitive Pricing? (Definition And Advantages)