First Version of this article was published on my Italian blog uomodellamansarda.com
Between July and August, I could lead an optimization project for cutting cost for a UK company.
This project could be based on PCA application.
In this article and in the following, I will try to explain a fundamental concept.
It is also very useful in Machine Learning.
Let me say in advance that I will simplify the PCA theory, on YouTube you can find more details on it.
Why a boring article?
Easy! For me trying to explain a subject is the best way to learn.
This learning method was taught by the grandmaster Feynman, for more info about the technique you can click on the link -> https://medium.com/taking-note/learning-from-the-feynman-technique-5373014ad230
Moving on, the Principal Component Analysis (PCA) is a linear transformation.
Doing a PCA is like taking the list of 50 life pillars and reduce to 3: “La femmina, il danaro e la mortazza”(English version would be “ Women, Money and Mortadella” is a famous Italian quote).
PCA allows to reduce variables and identify the most important and not correlated with each other.
PCA is a transformation, a mathematic operation, in this case, is a linear and orthogonal transformation, it transforms a function in another one.
In the following example, I am going to apply PCA, not to reduce variables number, but to decorrelate them.
It’s a modified version from a DataCamp.com exercise that you can find in this chapter (https://www.datacamp.com/courses/unsupervised-learning-in-python).
In the first part of the example, I am studying the correlation between two variables, in the second part, I am going to apply PCA.
We took 209 seeds and measured their length and width.
Then the information was saved in a *CSV file.
At first, I made a scatter plot to see which was a correlation between the two variables and then I calculated the Pearson coefficient, then I applied a PCA to decorrelate two variables and identified the principal components.
The one with the higher variance represent the first axis, the second one has the lower variance
In the case, we had m variables and from the PCA we had got n variables, with m>n, then the second axis would be described by the second variable with the higher variance, the third one with the third higher variance and so on until the n-th variable.
In the following article I will try to illustrate better the PCA concept with pratical example, until I will draft a post with the following title “PCA the definitive guide” or “PCA Explained to my Grandmother”
#PCA analysis #Importing libraries needed matplot scipy.stats and pandas import matplotlib.pyplot as plt from scipy.stats import pearsonr import pandas as pd #loading file grains=pd.read_csv("seeds-width-vs-length.csv") #Always exploring data and how our data are structured print(grains.info()) print(grains.describe()) #extract only the values from our dataframe that we need to work grains=grains.values # 0-th dataframe column represent seeds width width = grains[:,0] #1-th dataframe column represent seeds length length = grains[:,1] # Plotting the data # Using a scatter plot width-length plt.scatter(width, length) plt.axis('equal') plt.show() #Calculating Pearson Coefficent #Also called correlation coefficent #We also calculate data p-values correlation, pvalue = pearsonr(width,length) # Visualising the two calculated values print("Correlation between width and length:", round(correlation, 4)) print("Data P-value:",pvalue)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 209 entries, 0 to 208 Data columns (total 2 columns): 3.312 209 non-null float64 5.763 209 non-null float64 dtypes: float64(2) memory usage: 3.3 KB None 3.312 5.763 count 209.000000 209.000000 mean 3.258349 5.627890 std 0.378603 0.444029 min 2.630000 4.899000 25% 2.941000 5.262000 50% 3.232000 5.520000 75% 3.562000 5.980000 max 4.033000 6.675000
Correlation between width and length: 0.8604 Data P-value: 1.5696623081483666e-62
Now we can evaluate our variables Principal Components and decorrelate the two variables using the PCA
#Loading library module for the operation #PCA Analysis # Import PCA from sklearn.decomposition import PCA #Creating the PCA instance modello = PCA() #Applying fit_transform method to our dataset on grains #Now we obtained a new array with two new decorrelated variables pca_features = modello.fit_transform(grains) #Assigning the 0-th pca_features column to xs xs = pca_features[:,0] #Assigning the 1-th pca_features column to ys ys = pca_features[:,1] #Plotting the two new decorelated variable plt.scatter(xs, ys) plt.axis('equal') plt.show() # Calculating the pearson coefficent xs ys correlation, pvalue = pearsonr(xs, ys) #Visualizing the two new results print("Correlation between two new variables xs and ys ", round(correlation,4)) print("Data P-value",pvalue)
Thanks for reading!
A great hug
If you notice any error ping me here and Twitter. I am still learning 🙂