Sharing is caring!
First Version of this article was published on my Italian blog uomodellamansarda.com
Between July and August, I could lead an optimization project for cutting cost for a UK company.
This project could be based on PCA application.
In this article and in the following, I will try to explain a fundamental concept.
It is also very useful in Machine Learning.
Let me say in advance that I will simplify the PCA theory, on YouTube you can find more details on it.
Why a boring article?
Easy! For me trying to explain a subject is the best way to learn.
This learning method was taught by the grandmaster Feynman, for more info about the technique you can click on the link -> https://medium.com/taking-note/learning-from-the-feynman-technique-5373014ad230
Moving on, the Principal Component Analysis (PCA) is a linear transformation.
Doing a PCA is like taking the list of 50 life pillars and reduce to 3: “La femmina, il danaro e la mortazza”(English version would be “ Women, Money and Mortadella” is a famous Italian quote).
[youtube https://www.youtube.com/watch?v=aLEfp7js620]
PCA allows to reduce variables and identify the most important and not correlated with each other.
PCA is a transformation, a mathematic operation, in this case, is a linear and orthogonal transformation, it transforms a function in another one.
In the following example, I am going to apply PCA, not to reduce variables number, but to decorrelate them.
It’s a modified version from a DataCamp.com exercise that you can find in this chapter (https://www.datacamp.com/courses/unsupervised-learning-in-python).
In the first part of the example, I am studying the correlation between two variables, in the second part, I am going to apply PCA.
We took 209 seeds and measured their length and width.
Then the information was saved in a *CSV file.
At first, I made a scatter plot to see which was a correlation between the two variables and then I calculated the Pearson coefficient, then I applied a PCA to decorrelate two variables and identified the principal components.
The one with the higher variance represent the first axis, the second one has the lower variance
In the case, we had m variables and from the PCA we had got n variables, with m>n, then the second axis would be described by the second variable with the higher variance, the third one with the third higher variance and so on until the n-th variable.
In the following article I will try to illustrate better the PCA concept with pratical example, until I will draft a post with the following title “PCA the definitive guide” or “PCA Explained to my Grandmother”
#PCA analysis
#Importing libraries needed matplot scipy.stats and pandas
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
import pandas as pd
#loading file
grains=pd.read_csv("seeds-width-vs-length.csv")
#Always exploring data and how our data are structured
print(grains.info())
print(grains.describe())
#extract only the values from our dataframe that we need
to work
grains=grains.values
# 0-th dataframe column represent seeds width
width = grains[:,0]
#1-th dataframe column represent seeds length
length = grains[:,1]
# Plotting the data
# Using a scatter plot width-length
plt.scatter(width, length)
plt.axis('equal')
plt.show()
#Calculating Pearson Coefficent
#Also called correlation coefficent
#We also calculate data p-values
correlation, pvalue = pearsonr(width,length)
# Visualising the two calculated values
print("Correlation between width and length:", round(correlation, 4))
print("Data P-value:",pvalue)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 209 entries, 0 to 208 Data columns (total 2 columns): 3.312 209 non-null float64 5.763 209 non-null float64 dtypes: float64(2) memory usage: 3.3 KB None 3.312 5.763 count 209.000000 209.000000 mean 3.258349 5.627890 std 0.378603 0.444029 min 2.630000 4.899000 25% 2.941000 5.262000 50% 3.232000 5.520000 75% 3.562000 5.980000 max 4.033000 6.675000
Correlation between width and length: 0.8604 Data P-value: 1.5696623081483666e-62
Now we can evaluate our variables Principal Components and decorrelate the two variables using the PCA
#Loading library module for the operation
#PCA Analysis
# Import PCA
from sklearn.decomposition import PCA
#Creating the PCA instance
modello = PCA()
#Applying fit_transform method to our dataset on grains
#Now we obtained a new array with two new decorrelated variables
pca_features = modello.fit_transform(grains)
#Assigning the 0-th pca_features column to xs
xs = pca_features[:,0]
#Assigning the 1-th pca_features column to ys
ys = pca_features[:,1]
#Plotting the two new decorelated variable
plt.scatter(xs, ys)
plt.axis('equal')
plt.show()
# Calculating the pearson coefficent xs ys
correlation, pvalue = pearsonr(xs, ys)
#Visualizing the two new results
print("Correlation between two new variables xs and ys ", round(correlation,4))
print("Data P-value",pvalue)
Thanks for reading!
A great hug
Andrea
If you notice any error ping me here and Twitter. I am still learning 🙂