Hypothesis Testing, easy explanation

The first time I have studied “Hypothesis testing” was when I enrolled “Probability and Statistics” with prof. Martinelli, during my master degree.
In the beginning, I didn’t understand easily the topic, but in the following month, practicing and practicing, I became quite confident.

In this post, I will try to explain the Hypothesis Testing, also because I’ve promised to publish this article to Diego, after a call on skype where he helped me to understand how PyCharm and Jupyter work.

Some days ago, Francesco, a friend of mine, discovered from his University Student Office, that the average age of all the enrolled students was 23 years.

Francesco, always skeptics, replied to the University Student Office: “Doubts*”

How could he verify if the statement was reasonable?
He could with “Hypothesis Testing”.

In this specific case, the Hypothesis is on the average age of enrolled students.
How to verify this Hypothesis?

Brutally Francesco needs to verify if the value assumed is not too much different from another value that he is going to determine as checking value.

Simplified and not accurate explanation: We evaluate the probability that the difference between our Hypothesis Value and the Check Value will be higher of a defined threshold.

We accept the hypothesis if the difference is smaller than our threshold value, we reject the hypothesis if greater.

The threshold value is called “Level of Significance of the test”

In the picture is represented the difference between our threshold value X and the Hypothesis Value µ0, if that difference is inside the 95% of the distribution bell, we accept our Hypothesis with a 5% level of significance, otherwise, we reject it.

In other terms, we are stating that the difference between our hypothesis and our threshold is unlikely to be so high.

Level of significance is a key concept in the “Hypothesis Testing”, is a value that describes the probability to reject a Hypothesis that is true.

If Francesco stated that student average age will not be 23 when is true, he rejected a true hypothesis.

Significance level is often expressed in term of α

For example,  a 5% level of significance means that we have the 5% of probability to state that our hypothesis is false, and reject it when instead is true.

Francesco doesn’t know if the average age is 23 as University Student Office Stated, it could be 24, 25 or 22.
Each value will be characterized by a probability to be accetable, so it will be likely that the average age will be 23, unlikely 35 or 18.
Francesco takes a random sample of 40 enrolled students and determines the average age of the sampled group.
I will not describe all the mathematic formulas behind the scenes (but I am going to describe in the appendix of the post).
Francesco wants to verify his hypothesis with a 5% level of significance.

If the disequation is true he accepts the hypothesis, otherwise, he rejects it.
There are some points that I have implied and need to be discussed further.
If you had patience you can find above here.

In the following days, I will talk on p-values and A/B testing with Python.

Thanks for reading and if you find any mistake let me know about it, I will fix it, especially grammar errors.

Andrea
___________________________

In the example I didn’t say:

µ0= 23 is the mean of the enrolled student age distribution.
Distribution Variance is known.

Sample mean is a natural point estimator of the enrolled student mean distribution, that is unknown
If we state that our Hypothesis is true, follows that the enrolled student mean distribution has a normal distribution

If we accept a hypothesis, for example, that mean value of a distribution is µ0, with a level of significance α (in our example 5%) we are stating that exist a region in probability space c that:

If the sample mean X follows a normal distribution with mean µ0 we can say that Z is a random variable that:

What we are doing, based on a decided level of significance, is to identify the value of the normal standard variable associated with the probability of our threshold value.

97.5% is the probability that our Z assumes a value less than 1.96, or vice-versa, 2.5% is the probability that our Z will be greater than 1.96
What we are saying is ” If the sample mean less the hypothesis mean divided by standard deviation multiplied by the square root of the number of samples is greater than 1.96 with a significance level of 5% follows that the hypothesis is false”

The last point is on the two type of errors that you can make in the hypothesis testing:

  • First type, when data conduct us to reject an hypothesis that is true.
  • Second type, when data conduct us to accept an hypothesis that is false.

 

*Dubts is an italian short way to express skepticism to something

Most Junior Data Scientist Required Skills based on my personal experience and analysis prt 1

Is not easy to be a wannabe Data Scientist.

Be a Data Scientist is fucking hard, be a self-learner Data Scientist even harder.

Time is never enough, you need to focus, and focus on what market needs, this way you will have more chance to survive.

Where to focus?

You need to identify a path to follow and exercise, or you will be distracted by all the noise on the web.

From September 2017 until now, quite often, after sending my CV applications for Data Scientist positions I took note of the skills required and added manually to a Google Sheet.

I reached more than 430 rows each one contains an information.

Today I decided to analyze this CSV in order to identify the most frequent skills required for a Data Scientist.

The analysis I have done is very brutal and need to be improved, but gives me where to focus.

In [80]:
#importing the libraries 

import pandas as pd
import matplotlib.pyplot as plt
In [40]:
csvname= "skill.csv"
df= pd.read_csv(csvname,sep= ",", header=None, index_col=False)
print(df.head(30))
                             0    1
0                       Agile   NaN
1                           AI  NaN
2                    Algorithm  NaN
3                    Algorithm  NaN
4                   Algorithms  NaN
5                    Analytics  NaN
6                      Apache   NaN
7                      Apache   NaN
8                          API  NaN
9   Artificial neural networks  NaN
10                         AWS  NaN
11                         AWS  NaN
12                         AWS  NaN
13                         AWS  NaN
14                         AWS  NaN
15                         AWS  NaN
16                         AWS  NaN
17                         AWS  NaN
18                       Azure  NaN
19                       Azure  NaN
20                       Azure  NaN
21                       Azure  NaN
22                       Azure  NaN
23              Bayesian Model  NaN
24              Bayesian Model  NaN
25              Bayesian Model  NaN
26         Bayesian Statistics  NaN
27                          BI  NaN
28                          BI  NaN
29                         BI   NaN
30                    Big Data  NaN
31                    Big Data  NaN
32                    Big Data  NaN
33                    Big Data  NaN
34                    Big Data  NaN
35                    Big Data  NaN
36                    Big Data  NaN
37                    Big Data  NaN
38                    BIgQuery  NaN
39                    BIgQuery  NaN
In [34]:
print(df.columns)
Int64Index([0, 1], dtype='int64')
In [50]:
df.columns=['skills','empty']
In [51]:
print(df.head())
       skills empty
0      Agile    NaN
1          AI   NaN
2   Algorithm   NaN
3   Algorithm   NaN
4  Algorithms   NaN
In [65]:
df_skill=pd.DataFrame(df.iloc[:,0], columns=['skills'])
print(df_skill.head(5))
       skills
0      Agile 
1          AI
2   Algorithm
3   Algorithm
4  Algorithms
In [71]:
print(df_skill.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 423 entries, 0 to 422
Data columns (total 1 columns):
skills    423 non-null object
dtypes: object(1)
memory usage: 3.4+ KB
None
In [84]:
df_skill_grouped=df_skill.groupby(['skills']).size().sort_values(ascending=False)
In [85]:
print(df_skill_grouped)
skills
SQL                                        37
Python                                     36
Spark                                      16
Python                                     13
Handoop                                    12
Scala                                      10
Scikit Learn                               10
NLP                                        10
Machine Learning                           10
Statistics                                 10
AWS                                         8
Big Data                                    8
NOSQL                                       7
Kafka                                       7
TensorFlow                                  6
Tableau                                     6
Pandas                                      5
Numpy                                       5
Azure                                       5
SQL                                         5
Machine learning                            5
Financial Systems                           4
Predictive Model                            4
Neural Networks                             4
C++                                         4
Machine Learning                            4
Go                                          3
Bayesian Model                              3
MapReduce                                   3
Clustering                                  3
                                           ..
Sentiment Analysis                          1
NLP                                         1
Scraping                                    1
NOSQL                                       1
Naive Bayes classifier                      1
Natural language processing                 1
Numpy                                       1
Linear Model                                1
Latent semantic indexing                    1
Pig                                         1
Hashmaps                                    1
Flask                                       1
Flink                                       1
Gis                                         1
GitHub                                      1
Testing Software                            1
Google 360                                  1
Gradient Boosted Machine                    1
TF-IDF                                      1
Plotly                                      1
T-SQL                                       1
Html                                        1
Information Extraction                      1
Instantaneously trained neural networks     1
JQuery                                      1
JSON                                        1
Java                                        1
JavaScript                                  1
Jira                                        1
AI                                          1
Length: 150, dtype: int64
In [90]:
df_skill_grouped.head(25).plot.bar()
Out[90]:
First 25 skills required for a Data Scientist

I will improve this analysis  working with:
1) Regex, this way I can fix typing errors and be more accurate (see for example in the bar graph “Python” and “Python ”
2) Web Scraping my applications in order to automatically extract all the skills required
3) Improve my ClockWork Pomodoro Analyzer in order to be aware where my time is allocated and if is coherent with the market requirements

 

 

Git and Git Hub

In the last job interview for a Data Scientist position a skill required was the version control knowledge.
A version control system is a changes management tool for software development, one of the most common is GIT.

I never used Git, as self-learner always coded and made my analysis on Notepad++ and then run my scripts through Windows Powershell.
I used Notepad++ as suggested in “Python The Hard Way

A new project on Git is called a repository, you can also host online a repository trough Git Hub.

The 16th of April 2016 I created my Git Hub account, without a clear idea of what a repository was.
A friends from MUG suggested to put here my projects and create a repository.
Here I created at least five repositories for five different projects in three different languages, whitout knowing how Git worked.

So after the first interview step I decided to take a more rigorous approach and I started “Introduction to Git for Data Science“.
The course explained the basics of Git:

  • What is a repository
  • How to create
  • How to commit new files
  • How to manage conflicts between different versions
  • Ecc

The course is based on DataCamp server, you don’t need to install Git on your PC.
I started applying what I learned on my projects, creating new repositories on my PC.

The next goal for the following days is to update all my repositories on Git Hub through Git Bash.

Applying Markov Inequality and Central Limit Theorem on Pomodoro Records to Estimate the Probability to Improve Daily Performance

One day I will improve how to publish a better post from Jupiter on WordPress, all is still work in progress.

The script, that you can find on my GitHub,  will estimate based on my past records the probability to study more Python (or whatever variable are you tracking), in terms of Pomodoro time slots dedicated, based on past records.

Enjoy!

In [3]:

#THIS IS THE 5th Ver OF THE Clock Work Pomodoro Estimator 
#Is still work in progress

# The script is based on the pomodoro technique
#Reads the csv with the past logs recorded and estimates the probability to dedicate more or less pomodoro time slots
#on daily basis on a target activity
#The script based on the past records and the central limit theorem 
#with the strong hypothesis that the average daily pomodoro slots follow a normal distribution

#to do: How to plot more figure 
#Merging standard deviation and mean from dataframe 
#to do:identify the mean problems, specifically understand the value under the denominator 


import pandas as pd 
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import numpy as np
from datetime import datetime, timedelta
import scipy.stats


va='Python'

tvoi='Durata'
#select the number of pomodoros for a selected activity for knowing, based on time series, the probability to improve
improve_pomodoro=8

csvname="logs.csv"
#read the csv with the logged data

dfraw=pd.read_csv(csvname,sep=',',header=0,skipfooter=0, index_col=False)
dfraw.columns=['Anno', 'Mese', 'Giorno', 'Tempo', 'Durata', 'Avvio (secondi trascorsi)', 'Fine (secondi trascorsi)','Attivit']

#tvoi in this case is the lenght of the single record
#for some reasons are not in the same format
#some records, for a bug, are longer than 
#00:00:00
#for this reason we (brutally) clean the data removing all the information longer
#we can consider outliers 

#tvoi is a non null object
#we need to convert to string in order to clean the data 
dfraw[tvoi] = dfraw[tvoi].astype('str')
mask = (dfraw[tvoi].str.len() == 6) 
dfraw = dfraw.loc[mask]
dfraw[tvoi]=dfraw[tvoi].str.strip()
#Converting date time to minutes and second removing hours
#they are alway 00 because pomodoro slots last 25 minutes
dfraw[tvoi]=pd.to_datetime(dfraw[tvoi], format='%M:%S')
#Avvio column is expressed in Epoch we can use later as a new index 
#useful for resampling

dfraw['IndexDate']=pd.to_datetime(dfraw['Avvio (secondi trascorsi)'], unit='s')
dfraw=dfraw.reset_index().set_index('IndexDate')


#extract all the row contains va word in our case is Python
Python_df=dfraw[dfraw['Attivit'].str.contains(va,na=False)].copy()

Python_df['Date'] = Python_df.apply(lambda row: datetime(row['Anno'], row['Mese'], row['Giorno']), axis=1)

#resample the subset in order to calculate weekly count and then the weekly mean 
resample_prova=Python_df.resample('D').count()
#Calculating the std of our dataframe
resample_prova_2=resample_prova.resample('W').mean()

resample_w_std=resample_prova.resample('W').std()
#Converting from a dataframe to a single array 
total_std=resample_w_std.std()
total_mean=resample_prova_2.mean()


#To find the probability that the variable has a value LESS than or equal
#of the target "improve pomodoro"is based on the cumulative Density Function

probability_less=scipy.stats.norm.cdf(improve_pomodoro,total_mean[0],total_std[0])

print("The probability to dedicate daily less than or equal of %s pomodoro is" %(improve_pomodoro) , probability_less,  '%')

#To find the probability that the variable has a value greater than or
#equal of the "improve pomodoro" is based on the survival function 
probability_more=scipy.stats.norm.sf(improve_pomodoro,total_mean[0],total_std[0])
print("The probability to dedicate daily more than %s pomodoro is" %(improve_pomodoro) , probability_more , '%' )

#With Markov 
print("The probability to dedicate daily more than %s pomodoro based on the Markov Inequality with the average of pomodoro dedicated %s IS" %(improve_pomodoro,total_mean[0]))
print(total_mean[0]/8,'%' )

x = np.linspace(total_mean[0] - 3*total_std[0],total_mean[0] + 3*total_std[0], 100)
plt.plot(x,mlab.normpdf(x, total_mean[0], total_std[0]))
plt.plot(x,mlab.normpdf(x, total_mean[0], total_std[0]))
plt.show()
The probability to dedicate daily less than or equal of 8 pomodoro is 0.984711072229 %
The probability to dedicate daily more than 8 pomodoro is 0.0152889277713 %
The probability to dedicate daily more than 8 pomodoro based on the Markov Inequality with the average of pomodoro  3.66206896552 dedicated dailyIS
0.45775862069 %

 

 

Data Scientist Career Track with Python

The 30th of December I finished the “Data Scientist Career Track with Python” on DataCamp.com.

It was a great journey and it lasted 226 h (tracked with the Pomodoro Technique).

I was too lazy to remove the word "Working"

The Career Track is composed of 20 courses, I also enrolled other two, the first on SQL, the second on PostgreSQL

This career track cost 180$, actually with 180$ I have one year access to all the courses, so I can enroll new courses (and after February I will do it) until August 2019.

It was very interesting and I discovered a new discipline that really engaged me.

What I really liked about the Career Track it was the courses modularity, moreover, every 5 minutes of theory, explained through a video, followed at least three exercises.

A preliminary knowledge about Python was not necessary, although I had some basic about C thanks to Arduino.

Will I recommend it?

Yes, if you are interested about the subject, but after the Career Track, you must start some meaty project to put into practice what you learned and avoid the risk to forget what you have learned.

If there is a negative side to this “Career Track”, it is the time needed. On the website, I read that all the Career Track last 67h, I don’t know how they calculated this time.
On average 3,35 hours for a course but based on my personal experience I think the evaluation is not true.

Efforts and time needed to master the subjects explained are higher.
Now It’s time to put into practice all the things I have learned!

On January I have to study for the National Engineer Exam, work on some projects and create a personal portfolio on Git Hub .

I have also promised to Diego that I will write some posts on his blog where I will explain the statistic concept of “Test hypothesis” and errors related to these tests.

Also because as you can see from the first and the following graph, all the time dedicated during these months on Python was for the study on Datacamp (226h of 290h tot.)

Happy new year! 🙂

Python Pomodoro Technique Logs Analyzer

I am a big fan of the Pomodoro Technique.
Developed by an Italian, I wrote about it on my Italian blog (wow it was in the 2013 time is running fast).
The technique splits the time in 25 minutes slots.
After every 25 minutes, you take 5 minutes break.
Is a great way to avoid distractions, be focused and manage your energy.
Obviously if you don’t know why are you doing something, what really “motivate” you, It will not work.
I use a smartphone app that not only alerts me when the 25 minutes are passed, but I can also label the kind of activity done during the time slots.
This way is easier to analyze how I manage my work time and how I waste it.
Awareness on how time flow is essential.
By the time I gathered a lot of logs and became quite hard to efficiently analyze all in a quick way, so I developed a short script on python trying to solve the problem.
The script in thi, example is used for analyze how much time I have dedicated to python since I started to study it
Basically the code read the Clockwork Pomodoro Activity Log *.CSV, clean the data and extract all the rows that contains the word “Python” from the Activity Column, then it makes the sum of it and plot the data.
The future goal is to apply the Markov’s inequality or the central limit theorem to estimate what can I reach in the following months, based on passed results.

First Post, Where It All Started

The first time I heard about Machine Learning was during my Master Degree in Civil Engineer when I enrolled the course “Theory of road infrastructure”.
Here the prof. De Blasiis talked about Neural Networks applied to road accident analysis and the subject was completely mind-blowing.

After that, I started reading about Machine Learning, but only in 2016 I started learning how to code.

I started studying MATLAB with the Machine Learning course on Coursera, then after my master degree in Civil Engineer, I realized that I needed to learn Python and so I decided to buy a year subscription on DataCamp.com

This blog will be a travel journey about this exciting experience in the world of Data Science.

I will describe my script/project but also my idea about Data and events where I will attend.

I hope to update the blog regularly, although it is not going to be easy.

Ps This blog  Is also a gym to improve my bad  English.