First Version of this article was published on my Italian blog uomodellamansarda.com

Between July and August, I could lead an optimization project for cutting cost for a UK company.

This project could be based on PCA application.

In this article and in the following, I will try to explain a fundamental concept.

It is also very useful in Machine Learning.

Let me say in advance that I will simplify the PCA theory, on YouTube you can find more details on it.

Why a boring article?

Easy! For me trying to explain a subject is the best way to learn.

This learning method was taught by the grandmaster Feynman, for more info about the technique you can click on the link -> https://medium.com/taking-note/learning-from-the-feynman-technique-5373014ad230

Moving on, the Principal Component Analysis (PCA) is a linear transformation.

Doing a PCA is like taking the list of 50 life pillars and reduce to 3: “La femmina, il danaro e la mortazza”(English version would be “ Women, Money and Mortadella” is a famous Italian quote).

[youtube https://www.youtube.com/watch?v=aLEfp7js620]

PCA allows to reduce variables and identify the most important and not correlated with each other.

PCA is a transformation, a mathematic operation, in this case, is a linear and orthogonal transformation, it transforms a function in another one.

In the following example, I am going to apply PCA, not to reduce variables number, but to decorrelate them.

It’s a modified version from a DataCamp.com exercise that you can find in this chapter (https://www.datacamp.com/courses/unsupervised-learning-in-python).

In the first part of the example, I am studying the correlation between two variables, in the second part, I am going to apply PCA.

We took 209 seeds and measured their length and width.

Then the information was saved in a *CSV file.

At first, I made a scatter plot to see which was a correlation between the two variables and then I calculated the Pearson coefficient, then I applied a PCA to decorrelate two variables and identified the principal components.

The one with the higher variance represent the first axis, the second one has the lower variance

**In the case, we had m variables and from the PCA we had got n variables, with m>n, then the second axis would be described by the second variable with the higher variance, the third one with the third higher variance and so on until the n-th variable.**

In the following article I will try to illustrate better the PCA concept with pratical example, until I will draft a post with the following title **“PCA the definitive guide” **or **“PCA Explained to my Grandmother”**

```
#PCA analysis
#Importing libraries needed matplot scipy.stats and pandas
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
import pandas as pd
#loading file
grains=pd.read_csv("seeds-width-vs-length.csv")
#Always exploring data and how our data are structured
print(grains.info())
print(grains.describe())
#extract only the values from our dataframe that we need
to work
grains=grains.values
# 0-th dataframe column represent seeds width
width = grains[:,0]
#1-th dataframe column represent seeds length
length = grains[:,1]
# Plotting the data
# Using a scatter plot width-length
plt.scatter(width, length)
plt.axis('equal')
plt.show()
#Calculating Pearson Coefficent
#Also called correlation coefficent
#We also calculate data p-values
correlation, pvalue = pearsonr(width,length)
# Visualising the two calculated values
print("Correlation between width and length:", round(correlation, 4))
print("Data P-value:",pvalue)
```

Now we can evaluate our variables Principal Components and decorrelate the two variables using the PCA

```
#Loading library module for the operation
#PCA Analysis
# Import PCA
from sklearn.decomposition import PCA
#Creating the PCA instance
modello = PCA()
#Applying fit_transform method to our dataset on grains
#Now we obtained a new array with two new decorrelated variables
pca_features = modello.fit_transform(grains)
#Assigning the 0-th pca_features column to xs
xs = pca_features[:,0]
#Assigning the 1-th pca_features column to ys
ys = pca_features[:,1]
#Plotting the two new decorelated variable
plt.scatter(xs, ys)
plt.axis('equal')
plt.show()
# Calculating the pearson coefficent xs ys
correlation, pvalue = pearsonr(xs, ys)
#Visualizing the two new results
print("Correlation between two new variables xs and ys ", round(correlation,4))
print("Data P-value",pvalue)
```

Thanks for reading!

A great hug

Andrea

If you notice any error ping me here and Twitter. I am still learning 🙂