PCA part.2 for unlucky boyfriend/husband

PCA Second Chapter

I do not know if my explanation on PCA was clear, I do not think so.

I will retry.

PCA is a very common technique used in Machine Learning and represents the Principal Component Analysis.

Imagine, for some unlucky reasons, you HAVE TO make a present to your girlfriend: a bag (I wouldn’t wish on any man)

Imagine, you and your knowledge about the topic “bag for women”.

Imagine, you with your knowledge:

  • Starting from the trolley that you bought for your last high school trip, the same trolley that your mother and your girlfriend hope every day to throw away. The same trolley still good for you, despite a big stain due to your friend called “Er bresaola”.
  • Ending with the laptop backpack, with the exception for the briefcase that you received during the graduation day because now “You are a big boy”, but you never used because It could store only a laptop and the charger.

YOU, really YOU, have to buy a bag.

You start classifying the products:

We have at least eight variables, hard times?

Furthermore, you cannot avoid the purchase because you have to make amends

You do not know why you are guilty, but there is always a good reason, as man you are guilty by definition.

PCA helps you to simplify the problem and the input data for your fateful choice.

Some variables in our problem are in some way redundant and we can aggregate.

For example “Brand” – “Price Range” – “How to pay it” could be aggregated into one variable.

This is what PCA does.

Are we discarding redundant variables?

No, even because we know that any error will make a big deal about this.

You considered all variables in your clustering, but you transformed using this technique.

PCA allows building new variables and aggregating the most meaningful.  

This is a fundamental point because in my last post I talked about a “Reduction”, but this doesn’t mean that we are discarding some variables (in mathematic terms we are making a linear combination)

In our case study, we reduced our variables from 8 to 6.

With the new transformation, we identified a variable that could change considerably.

This element is a key point because allows differentiating and the identification of different bag categories.

From a mathematics perspective, we identified a new variable characterized by the strongest variance.

That’s why is called “Principal Component” because is the variable with the higher variance.

Now we know how to classify bags, so, which one to choose?

This point falls outside the PCA, sorry for that.

In the classic economic theory for the rational man, this problem would not exist.

  • Volume to carry
  • Minimize the cost based on the volume to carry (€/cm3)

It works this way only in the engineers world, a time series analysis of past purchases could solve the problem, but will be not easy.

 

“Be an engineer is an illness. To a woman, an engineer wife we could ask:  “How is your husband? He is still an engineer?” And she could reply: “No, now is getting better” -Luciano de Crescenzo –  Bellavista Thoughts

A frequency analysis on past purchases,  using Bayes’ Theorem, could help to buy the “most frequent bag”  that is not “the bag the will make her happier”

What you could do is to assign different weights to the variables and then make an analysis on weighted frequency.

One way could be to rate higher bags used on Saturday night compared to everyday ones.

Then you have to choose the model with the higher weighted rate and, with some probability, you chose the alternative that maximizes the target (or minimize the error)

In this post, PCA description is highly qualitative and I have simplified a lot of hypotheses.

In the last post, you can see how correlation changes between the variables and p-value through a small script with Python.

 

Thanks for reading.

If you see any mistake you can ping me, always appreciated.

PCA (Chapter One)

First Version of this article was published  on my Italian blog uomodellamansarda.com
Between July and August, I could lead an optimization project for cutting cost for a UK company.
This project could be based on PCA application.
In this article and in the following, I will try to explain a fundamental concept.
It is also very useful in Machine Learning.
Let me say in advance that I will simplify the PCA theory, on YouTube you can find more details on it.
Why a boring article?
Easy! For me trying to explain a subject is the best way to learn.
This learning method was taught by the grandmaster Feynman, for more info about the technique you can click on the link -> https://medium.com/taking-note/learning-from-the-feynman-technique-5373014ad230

Moving on, the Principal Component Analysis (PCA) is a linear transformation.
Doing a PCA is like taking the list of 50 life pillars and reduce to 3: “La femmina, il danaro e la mortazza”(English version would be “ Women, Money and Mortadella” is a famous Italian quote).
[youtube https://www.youtube.com/watch?v=aLEfp7js620]

PCA allows to reduce variables and identify the most important and not correlated with each other.
PCA is a transformation, a mathematic operation, in this case, is a linear and orthogonal transformation, it transforms a function in another one.
In the following example, I am going to apply PCA, not to reduce variables number, but to decorrelate them.

It’s a modified version from a DataCamp.com exercise that you can find in this chapter (https://www.datacamp.com/courses/unsupervised-learning-in-python).
In the first part of the example, I am studying the correlation between two variables, in the second part, I am going to apply PCA.
We took 209 seeds and measured their length and width.

Then the information was saved in a *CSV file.

At first, I made a scatter plot to see which was a correlation between the two variables and then I calculated the Pearson coefficient, then I applied a PCA to decorrelate two variables and identified the principal components.
The one with the higher variance represent the first axis, the second one has the lower variance

In the case, we had m variables and from the PCA we had got n variables, with m>n, then the second axis would be described by the second variable with the higher variance, the third one with the third higher variance and so on until the n-th variable.
In the following article I will try to illustrate better the PCA concept with pratical example, until I will draft a post with the following title “PCA the definitive guide” or “PCA Explained to my Grandmother”

In [10]:
#PCA analysis 
#Importing libraries needed matplot scipy.stats and pandas 
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
import pandas as pd
#loading file 
grains=pd.read_csv("seeds-width-vs-length.csv")
#Always exploring data and how our data are structured
print(grains.info())
print(grains.describe())
#extract only the values from our dataframe that we need
to work
grains=grains.values
# 0-th dataframe column represent seeds width
width = grains[:,0]

#1-th dataframe column represent seeds length
length = grains[:,1]

# Plotting the data
# Using a scatter plot width-length
plt.scatter(width, length)
plt.axis('equal')
plt.show()

#Calculating Pearson Coefficent
#Also called correlation coefficent 
#We also calculate data p-values
correlation, pvalue = pearsonr(width,length)

# Visualising the two calculated values
print("Correlation between width and length:", round(correlation, 4))
print("Data P-value:",pvalue)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209 entries, 0 to 208
Data columns (total 2 columns):
3.312    209 non-null float64
5.763    209 non-null float64
dtypes: float64(2)
memory usage: 3.3 KB
None
            3.312       5.763
count  209.000000  209.000000
mean     3.258349    5.627890
std      0.378603    0.444029
min      2.630000    4.899000
25%      2.941000    5.262000
50%      3.232000    5.520000
75%      3.562000    5.980000
max      4.033000    6.675000
Correlation between width and length: 0.8604
Data P-value: 1.5696623081483666e-62

Now we can evaluate our variables Principal Components and decorrelate the two variables using the PCA

In [4]:
#Loading library module for the operation
#PCA Analysis 
# Import PCA
from sklearn.decomposition import PCA

#Creating the PCA instance
modello = PCA()

#Applying  fit_transform method to our dataset on grains
#Now we obtained a new array with two new decorrelated variables
pca_features = modello.fit_transform(grains)

#Assigning the 0-th pca_features column to xs
xs = pca_features[:,0]

#Assigning the 1-th pca_features column to ys
ys = pca_features[:,1]

#Plotting the two new decorelated variable
plt.scatter(xs, ys)
plt.axis('equal')
plt.show()

# Calculating the pearson coefficent xs ys
correlation, pvalue = pearsonr(xs, ys)

#Visualizing the two new results
print("Correlation between two new variables xs and ys ", round(correlation,4))
print("Data P-value",pvalue)
Correlation between two new variables xs and ys -0.0
"Data P-value 1.0"

Thanks for reading!
A great hug
Andrea

If you notice any error ping me here and Twitter. I am still learning 🙂

Not only Theory

I received some negative feedback on my last post on the Italian blog uomodellamansarda.com, from Filippo and Francesco, two dear friends and I am planning a dinner to discuss better their suggestions.
A bbq, a bottle of wine(actually I would try this non-commercial-vermouth –> https://amzn.to/2v0oles ) , a friendly discussion and I hope on this occasion to learn how to make a great Negroni, also.
But this is another story! I want to talk about something else!

I want to talk about the dichotomy between practice and theory.
Only theory is not enough, this is true for physicians as for other professions, practice is needed.
But practice needs theory to refine the technique.
I guess that you never want surgery from a physician that studied everything on the book but He hasn’t practical experience, likewise, you never want surgery from a trainee doctor without a theoretical foundation

I tend to be stronlgy theoretical , I study often a problem from all its point of view before coming to a solution and this tendency could be extremely negative if you not compensate through some practice.
This is especially true with Python.

“Theory is when you know everything but nothing works. Practice is when everything works but no one knows why. We combined theory and practice: nothing works and no one knows why”-Albert Einstein *

During the following week I will allocate, at least, 20h to dedicate to Python and the error I could do is to hit the books or online courses and focusing too much on theory.
To avoid this error I created some activity labels on my Clockwork Pomodoro.
This labels are crunched by a small script that receives as input the information on how I used my time in the last week and It gives back the percentage of accomplishment, based on the mix between Practice, Theory and Writing about the Progresses made.
In short:

  • Working 10 h
  • Studying 6 h
  • Marketing 4h

The script is raw and I will improve it, I could use some “for” loops to make it more readable (I appreciate in advance for any feedback on what I could enhace and improve)

The first part of the script keeps me updated on how is going my practice with Python, I wrote in the past about this script on this blog, the second part of the script evaluates the “mix”

Thanks for reading the article, big hug.

Andrea
ps If you notice any mistake ping me, if you liked share it! 🙂
In [2]:
import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime, timedelta


va='Python'
tvoi='Length'

csvname="logs2.csv"
#read the csv	
columns_name=['Year', 'Month', 'Day', 'Time', 'Length', 'Start', 'End', 'Activity']
dfraw=pd.read_csv(csvname,names=columns_name,sep=',',skiprows=1,skipfooter=0, index_col=False)


dfraw[tvoi] = dfraw[tvoi].astype('str')
mask = (dfraw[tvoi].str.len() == 6) 
dfraw = dfraw.loc[mask]

dfraw[tvoi]=dfraw[tvoi].str.strip()

dfraw[tvoi]=pd.to_datetime(dfraw[tvoi], format='%M:%S')
dfraw['Date'] = dfraw.apply(lambda row: datetime(row['Year'], row['Month'], row['Day']), axis=1)


pythondf=dfraw[(dfraw['Activity'].str.contains("Python",na=False)) | (dfraw['Activity'].str.contains("python",na=False))] 
numacti=pythondf.groupby('Date').count()
numacti=numacti['Activity']
numacti=numacti.divide(2)
cumulata=numacti.cumsum()


day=pd.concat([numacti, cumulata], axis=1)
day.columns=['pgiorno','cumulata']
maxh=cumulata.max()
plt.plot(day.index,day['cumulata'])
plt.xticks(rotation=90)
plt.title('Totale ore di studio e lavoro con Python (%d ore)' %(maxh))
plt.tight_layout()
plt.show()
In [26]:
#Section for weekly analysis
python_work=10
python_study=6
sutdy='study'
marketing='marketing'

python_marketing=4
total=python_work+python_study+python_marketing

#Selection only the last 7 days of the log 
days=7
cutoff_date= pythondf['Date'].iloc[-1]- pd.Timedelta(days=days)
print(cutoff_date)
last_7days= pythondf[pythondf['Date'] > cutoff_date] 
#Qualsiasi attivita' che non abbia come label "marketing" o "study" "datacamp" "ripasso" "libro" é considerata "work"
#Per del codice migliore cercherò nei prossimi log di avere solo tre label study work marketing come metag
study_mask=(last_7days['Activity'].str.contains("ripasso",na=False) | last_7days['Activity'].str.contains("datacamp",na=False)) | (last_7days['Activity'].str.contains("Libro",na=False))
pythondf_study=last_7days[study_mask]

pythondf_marketing=last_7days[last_7days['Activity'].str.contains("marketing",na=False)]


pythondf_work=last_7days[~study_mask]

#Pomodoro Time Slots last 30 minutes(25+5)
#We have to group by category and then count
#Not enough lazy for a foor loops, sorry

print("Weekly % of Python Working",round(pythondf_work['Activity'].count()/2/python_work*100,2))
print("Weekly % of Python Study", round(pythondf_study['Activity'].count()/2/python_study*100,2))
print("Weekly % of Python Marketng",round(pythondf_marketing['Activity'].count()/2/python_marketing*100,2))
2018-07-18 00:00:00
Weekly % of Python Working 95.0
Weekly % of Python Study 50.0
Weekly % of Python Marketng 62.5

*quote to be verified

Hypothesis Testing, easy explanation

The first time I have studied “Hypothesis testing” was when I enrolled “Probability and Statistics” with prof. Martinelli, during my master degree.
In the beginning, I didn’t understand easily the topic, but in the following month, practicing and practicing, I became quite confident.

In this post, I will try to explain the Hypothesis Testing, also because I’ve promised to publish this article to Diego, after a call on skype where he helped me to understand how PyCharm and Jupyter work.

Some days ago, Francesco, a friend of mine, discovered from his University Student Office, that the average age of all the enrolled students was 23 years.

Francesco, always skeptics, replied to the University Student Office: “Doubts*”

How could he verify if the statement was reasonable?
He could with “Hypothesis Testing”.

In this specific case, the Hypothesis is on the average age of enrolled students.
How to verify this Hypothesis?

Brutally Francesco needs to verify if the value assumed is not too much different from another value that he is going to determine as checking value.

Simplified and not accurate explanation: We evaluate the probability that the difference between our Hypothesis Value and the Check Value will be higher of a defined threshold.

We accept the hypothesis if the difference is smaller than our threshold value, we reject the hypothesis if greater.

The threshold value is called “Level of Significance of the test”

In the picture is represented the difference between our threshold value X and the Hypothesis Value µ0, if that difference is inside the 95% of the distribution bell, we accept our Hypothesis with a 5% level of significance, otherwise, we reject it.

In other terms, we are stating that the difference between our hypothesis and our threshold is unlikely to be so high.

Level of significance is a key concept in the “Hypothesis Testing”, is a value that describes the probability to reject a Hypothesis that is true.

If Francesco stated that student average age will not be 23 when is true, he rejected a true hypothesis.

Significance level is often expressed in term of α

For example,  a 5% level of significance means that we have the 5% of probability to state that our hypothesis is false, and reject it when instead is true.

Francesco doesn’t know if the average age is 23 as University Student Office Stated, it could be 24, 25 or 22.
Each value will be characterized by a probability to be accetable, so it will be likely that the average age will be 23, unlikely 35 or 18.
Francesco takes a random sample of 40 enrolled students and determines the average age of the sampled group.
I will not describe all the mathematic formulas behind the scenes (but I am going to describe in the appendix of the post).
Francesco wants to verify his hypothesis with a 5% level of significance.

If the disequation is true he accepts the hypothesis, otherwise, he rejects it.
There are some points that I have implied and need to be discussed further.
If you had patience you can find above here.

In the following days, I will talk on p-values and A/B testing with Python.

Thanks for reading and if you find any mistake let me know about it, I will fix it, especially grammar errors.

Andrea
___________________________

In the example I didn’t say:

µ0= 23 is the mean of the enrolled student age distribution.
Distribution Variance is known.

Sample mean is a natural point estimator of the enrolled student mean distribution, that is unknown
If we state that our Hypothesis is true, follows that the enrolled student mean distribution has a normal distribution

If we accept a hypothesis, for example, that mean value of a distribution is µ0, with a level of significance α (in our example 5%) we are stating that exist a region in probability space c that:

If the sample mean X follows a normal distribution with mean µ0 we can say that Z is a random variable that:

What we are doing, based on a decided level of significance, is to identify the value of the normal standard variable associated with the probability of our threshold value.

97.5% is the probability that our Z assumes a value less than 1.96, or vice-versa, 2.5% is the probability that our Z will be greater than 1.96
What we are saying is ” If the sample mean less the hypothesis mean divided by standard deviation multiplied by the square root of the number of samples is greater than 1.96 with a significance level of 5% follows that the hypothesis is false”

The last point is on the two type of errors that you can make in the hypothesis testing:

  • First type, when data conduct us to reject an hypothesis that is true.
  • Second type, when data conduct us to accept an hypothesis that is false.

 

*Dubts is an italian short way to express skepticism to something

Most Junior Data Scientist Required Skills based on my personal experience and analysis prt 1

Is not easy to be a wannabe Data Scientist.

Be a Data Scientist is fucking hard, be a self-learner Data Scientist even harder.

Time is never enough, you need to focus, and focus on what market needs, this way you will have more chance to survive.

Where to focus?

You need to identify a path to follow and exercise, or you will be distracted by all the noise on the web.

From September 2017 until now, quite often, after sending my CV applications for Data Scientist positions I took note of the skills required and added manually to a Google Sheet.

I reached more than 430 rows each one contains an information.

Today I decided to analyze this CSV in order to identify the most frequent skills required for a Data Scientist.

The analysis I have done is very brutal and need to be improved, but gives me where to focus.

In [80]:
#importing the libraries 

import pandas as pd
import matplotlib.pyplot as plt
In [40]:
csvname= "skill.csv"
df= pd.read_csv(csvname,sep= ",", header=None, index_col=False)
print(df.head(30))
                             0    1
0                       Agile   NaN
1                           AI  NaN
2                    Algorithm  NaN
3                    Algorithm  NaN
4                   Algorithms  NaN
5                    Analytics  NaN
6                      Apache   NaN
7                      Apache   NaN
8                          API  NaN
9   Artificial neural networks  NaN
10                         AWS  NaN
11                         AWS  NaN
12                         AWS  NaN
13                         AWS  NaN
14                         AWS  NaN
15                         AWS  NaN
16                         AWS  NaN
17                         AWS  NaN
18                       Azure  NaN
19                       Azure  NaN
20                       Azure  NaN
21                       Azure  NaN
22                       Azure  NaN
23              Bayesian Model  NaN
24              Bayesian Model  NaN
25              Bayesian Model  NaN
26         Bayesian Statistics  NaN
27                          BI  NaN
28                          BI  NaN
29                         BI   NaN
30                    Big Data  NaN
31                    Big Data  NaN
32                    Big Data  NaN
33                    Big Data  NaN
34                    Big Data  NaN
35                    Big Data  NaN
36                    Big Data  NaN
37                    Big Data  NaN
38                    BIgQuery  NaN
39                    BIgQuery  NaN
In [34]:
print(df.columns)
Int64Index([0, 1], dtype='int64')
In [50]:
df.columns=['skills','empty']
In [51]:
print(df.head())
       skills empty
0      Agile    NaN
1          AI   NaN
2   Algorithm   NaN
3   Algorithm   NaN
4  Algorithms   NaN
In [65]:
df_skill=pd.DataFrame(df.iloc[:,0], columns=['skills'])
print(df_skill.head(5))
       skills
0      Agile 
1          AI
2   Algorithm
3   Algorithm
4  Algorithms
In [71]:
print(df_skill.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 423 entries, 0 to 422
Data columns (total 1 columns):
skills    423 non-null object
dtypes: object(1)
memory usage: 3.4+ KB
None
In [84]:
df_skill_grouped=df_skill.groupby(['skills']).size().sort_values(ascending=False)
In [85]:
print(df_skill_grouped)
skills
SQL                                        37
Python                                     36
Spark                                      16
Python                                     13
Handoop                                    12
Scala                                      10
Scikit Learn                               10
NLP                                        10
Machine Learning                           10
Statistics                                 10
AWS                                         8
Big Data                                    8
NOSQL                                       7
Kafka                                       7
TensorFlow                                  6
Tableau                                     6
Pandas                                      5
Numpy                                       5
Azure                                       5
SQL                                         5
Machine learning                            5
Financial Systems                           4
Predictive Model                            4
Neural Networks                             4
C++                                         4
Machine Learning                            4
Go                                          3
Bayesian Model                              3
MapReduce                                   3
Clustering                                  3
                                           ..
Sentiment Analysis                          1
NLP                                         1
Scraping                                    1
NOSQL                                       1
Naive Bayes classifier                      1
Natural language processing                 1
Numpy                                       1
Linear Model                                1
Latent semantic indexing                    1
Pig                                         1
Hashmaps                                    1
Flask                                       1
Flink                                       1
Gis                                         1
GitHub                                      1
Testing Software                            1
Google 360                                  1
Gradient Boosted Machine                    1
TF-IDF                                      1
Plotly                                      1
T-SQL                                       1
Html                                        1
Information Extraction                      1
Instantaneously trained neural networks     1
JQuery                                      1
JSON                                        1
Java                                        1
JavaScript                                  1
Jira                                        1
AI                                          1
Length: 150, dtype: int64
In [90]:
df_skill_grouped.head(25).plot.bar()
Out[90]:
First 25 skills required for a Data Scientist

I will improve this analysis  working with:
1) Regex, this way I can fix typing errors and be more accurate (see for example in the bar graph “Python” and “Python ”
2) Web Scraping my applications in order to automatically extract all the skills required
3) Improve my ClockWork Pomodoro Analyzer in order to be aware where my time is allocated and if is coherent with the market requirements

 

 

Git and Git Hub

In the last job interview for a Data Scientist position a skill required was the version control knowledge.
A version control system is a changes management tool for software development, one of the most common is GIT.

I never used Git, as self-learner always coded and made my analysis on Notepad++ and then run my scripts through Windows Powershell.
I used Notepad++ as suggested in “Python The Hard Way

A new project on Git is called a repository, you can also host online a repository trough Git Hub.

The 16th of April 2016 I created my Git Hub account, without a clear idea of what a repository was.
A friends from MUG suggested to put here my projects and create a repository.
Here I created at least five repositories for five different projects in three different languages, whitout knowing how Git worked.

So after the first interview step I decided to take a more rigorous approach and I started “Introduction to Git for Data Science“.
The course explained the basics of Git:

  • What is a repository
  • How to create
  • How to commit new files
  • How to manage conflicts between different versions
  • Ecc

The course is based on DataCamp server, you don’t need to install Git on your PC.
I started applying what I learned on my projects, creating new repositories on my PC.

The next goal for the following days is to update all my repositories on Git Hub through Git Bash.

Applying Markov Inequality and Central Limit Theorem on Pomodoro Records to Estimate the Probability to Improve Daily Performance

One day I will improve how to publish a better post from Jupiter on WordPress, all is still work in progress.

The script, that you can find on my GitHub,  will estimate based on my past records the probability to study more Python (or whatever variable are you tracking), in terms of Pomodoro time slots dedicated, based on past records.

Enjoy!

In [3]:

#THIS IS THE 5th Ver OF THE Clock Work Pomodoro Estimator 
#Is still work in progress

# The script is based on the pomodoro technique
#Reads the csv with the past logs recorded and estimates the probability to dedicate more or less pomodoro time slots
#on daily basis on a target activity
#The script based on the past records and the central limit theorem 
#with the strong hypothesis that the average daily pomodoro slots follow a normal distribution

#to do: How to plot more figure 
#Merging standard deviation and mean from dataframe 
#to do:identify the mean problems, specifically understand the value under the denominator 


import pandas as pd 
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import numpy as np
from datetime import datetime, timedelta
import scipy.stats


va='Python'

tvoi='Durata'
#select the number of pomodoros for a selected activity for knowing, based on time series, the probability to improve
improve_pomodoro=8

csvname="logs.csv"
#read the csv with the logged data

dfraw=pd.read_csv(csvname,sep=',',header=0,skipfooter=0, index_col=False)
dfraw.columns=['Anno', 'Mese', 'Giorno', 'Tempo', 'Durata', 'Avvio (secondi trascorsi)', 'Fine (secondi trascorsi)','Attivit']

#tvoi in this case is the lenght of the single record
#for some reasons are not in the same format
#some records, for a bug, are longer than 
#00:00:00
#for this reason we (brutally) clean the data removing all the information longer
#we can consider outliers 

#tvoi is a non null object
#we need to convert to string in order to clean the data 
dfraw[tvoi] = dfraw[tvoi].astype('str')
mask = (dfraw[tvoi].str.len() == 6) 
dfraw = dfraw.loc[mask]
dfraw[tvoi]=dfraw[tvoi].str.strip()
#Converting date time to minutes and second removing hours
#they are alway 00 because pomodoro slots last 25 minutes
dfraw[tvoi]=pd.to_datetime(dfraw[tvoi], format='%M:%S')
#Avvio column is expressed in Epoch we can use later as a new index 
#useful for resampling

dfraw['IndexDate']=pd.to_datetime(dfraw['Avvio (secondi trascorsi)'], unit='s')
dfraw=dfraw.reset_index().set_index('IndexDate')


#extract all the row contains va word in our case is Python
Python_df=dfraw[dfraw['Attivit'].str.contains(va,na=False)].copy()

Python_df['Date'] = Python_df.apply(lambda row: datetime(row['Anno'], row['Mese'], row['Giorno']), axis=1)

#resample the subset in order to calculate weekly count and then the weekly mean 
resample_prova=Python_df.resample('D').count()
#Calculating the std of our dataframe
resample_prova_2=resample_prova.resample('W').mean()

resample_w_std=resample_prova.resample('W').std()
#Converting from a dataframe to a single array 
total_std=resample_w_std.std()
total_mean=resample_prova_2.mean()


#To find the probability that the variable has a value LESS than or equal
#of the target "improve pomodoro"is based on the cumulative Density Function

probability_less=scipy.stats.norm.cdf(improve_pomodoro,total_mean[0],total_std[0])

print("The probability to dedicate daily less than or equal of %s pomodoro is" %(improve_pomodoro) , probability_less,  '%')

#To find the probability that the variable has a value greater than or
#equal of the "improve pomodoro" is based on the survival function 
probability_more=scipy.stats.norm.sf(improve_pomodoro,total_mean[0],total_std[0])
print("The probability to dedicate daily more than %s pomodoro is" %(improve_pomodoro) , probability_more , '%' )

#With Markov 
print("The probability to dedicate daily more than %s pomodoro based on the Markov Inequality with the average of pomodoro dedicated %s IS" %(improve_pomodoro,total_mean[0]))
print(total_mean[0]/8,'%' )

x = np.linspace(total_mean[0] - 3*total_std[0],total_mean[0] + 3*total_std[0], 100)
plt.plot(x,mlab.normpdf(x, total_mean[0], total_std[0]))
plt.plot(x,mlab.normpdf(x, total_mean[0], total_std[0]))
plt.show()
The probability to dedicate daily less than or equal of 8 pomodoro is 0.984711072229 %
The probability to dedicate daily more than 8 pomodoro is 0.0152889277713 %
The probability to dedicate daily more than 8 pomodoro based on the Markov Inequality with the average of pomodoro  3.66206896552 dedicated dailyIS
0.45775862069 %

 

 

Data Scientist Career Track with Python

The 30th of December I finished the “Data Scientist Career Track with Python” on DataCamp.com.

It was a great journey and it lasted 226 h (tracked with the Pomodoro Technique).

I was too lazy to remove the word "Working"

The Career Track is composed of 20 courses, I also enrolled other two, the first on SQL, the second on PostgreSQL

This career track cost 180$, actually with 180$ I have one year access to all the courses, so I can enroll new courses (and after February I will do it) until August 2019.

It was very interesting and I discovered a new discipline that really engaged me.

What I really liked about the Career Track it was the courses modularity, moreover, every 5 minutes of theory, explained through a video, followed at least three exercises.

A preliminary knowledge about Python was not necessary, although I had some basic about C thanks to Arduino.

Will I recommend it?

Yes, if you are interested about the subject, but after the Career Track, you must start some meaty project to put into practice what you learned and avoid the risk to forget what you have learned.

If there is a negative side to this “Career Track”, it is the time needed. On the website, I read that all the Career Track last 67h, I don’t know how they calculated this time.
On average 3,35 hours for a course but based on my personal experience I think the evaluation is not true.

Efforts and time needed to master the subjects explained are higher.
Now It’s time to put into practice all the things I have learned!

On January I have to study for the National Engineer Exam, work on some projects and create a personal portfolio on Git Hub .

I have also promised to Diego that I will write some posts on his blog where I will explain the statistic concept of “Test hypothesis” and errors related to these tests.

Also because as you can see from the first and the following graph, all the time dedicated during these months on Python was for the study on Datacamp (226h of 290h tot.)

Happy new year! 🙂

Python Pomodoro Technique Logs Analyzer

I am a big fan of the Pomodoro Technique.
Developed by an Italian, I wrote about it on my Italian blog (wow it was in the 2013 time is running fast).
The technique splits the time in 25 minutes slots.
After every 25 minutes, you take 5 minutes break.
Is a great way to avoid distractions, be focused and manage your energy.
Obviously if you don’t know why are you doing something, what really “motivate” you, It will not work.
I use a smartphone app that not only alerts me when the 25 minutes are passed, but I can also label the kind of activity done during the time slots.
This way is easier to analyze how I manage my work time and how I waste it.
Awareness on how time flow is essential.
By the time I gathered a lot of logs and became quite hard to efficiently analyze all in a quick way, so I developed a short script on python trying to solve the problem.
The script in thi, example is used for analyze how much time I have dedicated to python since I started to study it
Basically the code read the Clockwork Pomodoro Activity Log *.CSV, clean the data and extract all the rows that contains the word “Python” from the Activity Column, then it makes the sum of it and plot the data.
The future goal is to apply the Markov’s inequality or the central limit theorem to estimate what can I reach in the following months, based on passed results.

First Post, Where It All Started

The first time I heard about Machine Learning was during my Master Degree in Civil Engineer when I enrolled the course “Theory of road infrastructure”.
Here the prof. De Blasiis talked about Neural Networks applied to road accident analysis and the subject was completely mind-blowing.

After that, I started reading about Machine Learning, but only in 2016 I started learning how to code.

I started studying MATLAB with the Machine Learning course on Coursera, then after my master degree in Civil Engineer, I realized that I needed to learn Python and so I decided to buy a year subscription on DataCamp.com

This blog will be a travel journey about this exciting experience in the world of Data Science.

I will describe my script/project but also my idea about Data and events where I will attend.

I hope to update the blog regularly, although it is not going to be easy.

Ps This blog  Is also a gym to improve my bad  English.