PCA Second Chapter
I do not know if my explanation on PCA was clear, I do not think so.
I will retry.
PCA is a very common technique used in Machine Learning and represents the Principal Component Analysis.
Imagine, for some unlucky reasons, you HAVE TO make a present to your girlfriend: a bag (I wouldn’t wish on any man)
Imagine, you and your knowledge about the topic “bag for women”.
Imagine, you with your knowledge:
- Starting from the trolley that you bought for your last high school trip, the same trolley that your mother and your girlfriend hope every day to throw away. The same trolley still good for you, despite a big stain due to your friend called “Er bresaola”.
- Ending with the laptop backpack, with the exception for the briefcase that you received during the graduation day because now “You are a big boy”, but you never used because It could store only a laptop and the charger.
YOU, really YOU, have to buy a bag.
You start classifying the products:
- Price Range (5-5000 https://www.chanel.com/it_IT/moda/p/hdb/a57739y83868/a57739y8386894305/borsa-shopping-grande-montone-a-pelo-lungo-metallo-finitura-rutenio-nero.html)
- Brand (Low Cost-Amazon-High Fashion)
- Color(Blue-Red-Black)(You have at least 216 shade for each color)
- Dimension (Clutch-Trousse-Hobo Bag-Shopping Bag)
- Where to buy (Distance from home)
- How to pay it (Cash-20Y Mortage-Bitcoin-Rob because you don’t have the money)
- Intraperiod Time Analysis (Spring-Summer-Autumn-Winter)
- Interperiod Time Analysis (Outlet-Last Releases )
We have at least eight variables, hard times?
Furthermore, you cannot avoid the purchase because you have to make amends
You do not know why you are guilty, but there is always a good reason, as man you are guilty by definition.
PCA helps you to simplify the problem and the input data for your fateful choice.
Some variables in our problem are in some way redundant and we can aggregate.
For example “Brand” – “Price Range” – “How to pay it” could be aggregated into one variable.
This is what PCA does.
Are we discarding redundant variables?
No, even because we know that any error will make a big deal about this.
You considered all variables in your clustering, but you transformed using this technique.
PCA allows building new variables and aggregating the most meaningful.
This is a fundamental point because in my last post I talked about a “Reduction”, but this doesn’t mean that we are discarding some variables (in mathematic terms we are making a linear combination)
In our case study, we reduced our variables from 8 to 6.
With the new transformation, we identified a variable that could change considerably.
This element is a key point because allows differentiating and the identification of different bag categories.
From a mathematics perspective, we identified a new variable characterized by the strongest variance.
That’s why is called “Principal Component” because is the variable with the higher variance.
Now we know how to classify bags, so, which one to choose?
This point falls outside the PCA, sorry for that.
In the classic economic theory for the rational man, this problem would not exist.
- Volume to carry
- Minimize the cost based on the volume to carry (€/cm3)
It works this way only in the engineers world, a time series analysis of past purchases could solve the problem, but will be not easy.
“Be an engineer is an illness. To a woman, an engineer wife we could ask: “How is your husband? He is still an engineer?” And she could reply: “No, now is getting better” -Luciano de Crescenzo – Bellavista Thoughts
A frequency analysis on past purchases, using Bayes’ Theorem, could help to buy the “most frequent bag” that is not “the bag the will make her happier”
What you could do is to assign different weights to the variables and then make an analysis on weighted frequency.
One way could be to rate higher bags used on Saturday night compared to everyday ones.
Then you have to choose the model with the higher weighted rate and, with some probability, you chose the alternative that maximizes the target (or minimize the error)
In this post, PCA description is highly qualitative and I have simplified a lot of hypotheses.
In the last post, you can see how correlation changes between the variables and p-value through a small script with Python.
Thanks for reading.
If you see any mistake you can ping me, always appreciated.