Nine Priciples for Forecasting in Survey Analysis (With Scientific Bibliography)

While I was building a forecast model and I was looking for scientific confirmation regards some assumptions I discovered an interesting book, really suggested “Principles Of Forecasting: A Handbook for Researchers and Practitioners” written by J. Scott Armstrong, from Wharton School.

This handbook deals with all kinds of decision models.

It starts from the judgmental (Expert Opinions, Surveys) to econometrics model, multivariate analysis and neural network application for forecasting.

More then 600 hundred pages, rich and full of information and each chapter is written by a different professor from top universities.

Photo by Markus Spiske on Unsplash

This post is a summary of a chapter about Customer Intentions and behaviors and made me think.

Why?

Because I used a lot of surveys for my first startup, I often read reports based on surveys and interviews.

For example this year we made a survey with Mosiello (InnovAction Lab alumnus) in order to evaluate and improve the InnovAction Lab Annual BBQ.

Mosiello (un alumno di InnovAction Lab) per valutare e migliorare il BBQ Annuale di InnovAction Lab.

The chapter “Methods For Forecasting From Intentions Data” is written by Vicki G. Morwitz, from New York University.

First, we begin defining the word “Intention” in order to have a common language.

There are many definitions, but for the purpose of this post and for consistency with the book I will use the following one:

Intentions are a measure. Specifically, they represent a level of possibility for individuals to achieve plans, goals and future expectations. Often intentions are used to estimate what peoples are willing to do in the future.

Morwitz has developed nine principles to drive decisions based on information gathered by “customer intentions”.

Morwitz highlights that even using this nine principles, we must be careful making forecast based on intentional data.

Always.

Nine principles developed by the author are based on the following questions:

  1. How should be measured Intentions?”
  2. How should be used intentions to forecast behaviors?
  3. How should tune/adapt intentions when we have to forecast behavior?
  4. When declared intentions/preferences should be used to forecast/predict a behavior?
  5. Why should not be consistency between intentions and the actual behavior?
  1. How should be measured Intentions?

    • Using a probability scale, instead of other classification methods. (First Principle)
    • We must explain to the respondents to be extremely realistic on their effective expectation and personal characteristic. (the classic example here is the wish to have a Ferrari and the money to afford this expense) (Second Principle)
  2. How should be used intentions to forecast behaviors?

    • It should be avoided to use “raw” measured “Intentions”, they should be preprocessed before using to make a forecast. For example, if intentions are evaluated using a probability scale a quick and easy way to evaluate a percentage of potential buyers could be defined as the average probability to buy between respondents(Third Principle). In general, some researches showed the probability of buying durable goods is often underestimated compared to effective sales*. Bird and Ehrenberg studies showed an inverse trend for nondurable goods: intentions overstate purchases**
  3. How should be adapted/tuned intentions when we have to forecast a behavior?

    1. We can use past survey data and based on that results adapt/tune latest intentional data (Fourth Principle)
      1. Effective Behaviour(t)=Average Intentions Value (t-1) + Bias(t-1,t)
      2. Bias(t-1,t)=Effective Behaviour(t) –Intentions Average Value (t-1)
      3. Effective Behaviour (t+1)= Average Intentions Value(t) + Bias(t-1,t)
        • (t -1) time of the previous survey
        • (t) time of the latest survey that we will use to drive new strategy or forecast, is also the time when we measure the effective behavior of the measured intentions a time t-1
        • (t+1) represents the time when expressed intentions at the time (t) will be achieved
        • Bias, is the error, made at the time (t-1), it is a function of actual behavior at time t
    2. Through a segmentation and clustering the respondents before adapt/tune intentions (Fifth Principle)
      • Morwitz e Schmittlein analyzed in details the previous point, splitting respondents between who was intentioned to buy and who was not intentioned:

Authors from their studies evaluated that segmenting respondents through methods, where dependent variables (Criterion) and independent variables (Predictor) were identified, followed a lower forecast error compared with aggregated values.

In the specific case, they analyzed different households and their attitude to buy a car in one group and to buy a PC in another.

The techniques for clustering into intentioned and not intentioned buyers were highly cutting-edge (research was done in 1992) and they did not settle for only one method, they evaluated different methods, some well-known to all DataScience and Analytics enthusiast:

  1. Aggregate By income
  2. Using K-Means algorithm based on demographic data and product variables
  3. Discriminant Analysis where the purchase was predicted based on some demographic and product usage variables
  4. CART (Classification And Regression Trees), the prediction was based on demographic insight, product usage data, and other independent variables
    • Mentioning the paper inside the chapter: “The main empirical finding is that more accurate sales forecasts appear to be obtained by applying statistical segmentation methods that distinguish between dependent and independent variables (e.g., CART, discriminant analysis) than by applying simpler direct clustering  approaches( e.g., a priori segmentation or K-means clustering)“****
Independents Variables  (Predictors) are identified in the Morwitz e Schmittlein research
  1. Use the intentions measures to define boundary/limits of purchase probability (Seventh Principle).
    • When you have to evaluate the best and worst scenario, authors point out that is possible to use extreme values in measured intentions
  2. When intentions and preferences should be used to predict a behavior?

    1. According to the study by Armstrong, there are six conditions that determine when reported intentions should be predictive of behavior (Seventh Principle)***
      1. The predicted behavior is important
      2. Answers are made by the decision maker
      3. Respondent has a plan
      4. Respondent can clearly describe the plan (opposite to the respondent who states “I can’t tell you”, or describe the plan in a vague manner or the plan is not consistent with the attitude of the respondent)
      5. Respondent can achieve what he planned
      6. New data and insights are unlikely to change the plan over forecast time span ***
    2. Why could happen that intentions are not consistent measure compared with the actual behavior?
      • Measuring intentions modify behavior. Morwitz, Johnson, and Schmittlein, in one of their studies, selected two groups that should buy a car. In one group they measured the attitude to buy, in the other was not measured. The result was that in the group where buying intentions were measured more people actually bought a car. The explanation lies on the cognitive plan, when the respondent answer he triggers a “process” that Goleman in his book “Emotional Intelligence” (strongly recommended to all, you can find here) call “Metacognition” (Eighth Principle)
    3. Often people are affected by cognitive bias when they have to recall when last purchase was done(Ninth Principle) If this bias is in the answers then forecast and the prediction could be distorted. It’s important to identify it, in order to define an uncertainty level in the forecast. The ninth principle is the result of Kawani and Silk’s researches

I found these principles really helpful and interesting. In particular, the first three are quick to use, the last more complex and they must be looked more closely.

And you?

If you liked or you find it useful, share it through the social networks, with just one click can raise an opportunity.
If you think something needs to be fixed, you found any typo, write me!
Thanks for reading the article!

Andrea

*(Juster 1966 “Consumer buying intentions and purchase probability: An experiment in survey design”, McNeil 1974, “Federal programs to measure consumer purchase expectaions”)

**(Bird, Ehrenberg 1966 “Intentions-to-buy and claimed brand usage”)

***(Armstrong, Long-Range Forecasting , pag 83)

****(Vicki G. Morwitz and David Schmittlein,Nov., 1992, “Using Segmentation to Improve Sales Forecasts Based on Purchase Intent: Which “Intenders” Actually Buy?”)

*****(Silk Urban, “Pre-Test-Market Evaluation of New Packaged Goods: A Model and Measurement
Methodology”,Journal of Marketing Research, Vol. 15, No. 2. (May, 1978), pp. 171-191.)

 

 

The best selling drugs in Italy are the ones that could be advertised

The best selling drugs in Italy are the ones that could be advertised

On the Ministry of Health Website, there is a open data section where you can find the information in a *.csv format on the top-50best selling drugs in Italy.

I decided to investigate this dataset, grouping some information (Python Code Attached Below, I will also upload on Git Hub) and visualize this data better.

Often plots are better than data in a tabular format.

Drugs are divided into two categories:

  • SOP, “Senza Obbligo di Prescrizione” that means without a prescription
  • Over The Counter or Self-Care/Medication drugs that could be advertised

If you clean a bit the Data from the Ministry you could plot an interesting graph.

I realized a scatter plot where SOPs are the blue dots and OTCs are the red dots.

Strictly speaking, this drugs are not the best selling, but the most “distributed” this is the definition on the dataset metadata, but it is a good proxy sales.

 

If we group the two categories (SOP and OTC) and we sum “Boxes quantity provided to pharmacy and drugstore” we obtain that in the first semester 2016*:

30.3 Millions of OCT boxes where distributed against 12.6 Millions of SOP boxes

I am sorry if I did not format in a readable way Y-axis, it is expressed ad 10^7 power ( 3*10^7= 30Milions)

 

Based on this results I can say:

  • Equivalent drugs are little used, even though Italian pharmacists are obliged during the purchase to inform the customer about this option
  • There might be an excessive drugs consumption with a negative impact on health, this could be related on advertising and not a use from the General Practitioner, but I am not a General Practitioner so this consideration is beyond my knowledge
  • It’s fundamental for a rigorous analysis that Ministry of Helth publishes :
    • 2017 and 2018 time series on 50 best selling drugs
    • Data on General Drugs, in order to evaluate correlation and run a data comparison

For an intellectual integrity, I wrote to the Ministry Offices asking an updated dataset and to get data on equivalent drugs, if they will reply me I will update you

 

At the end of the article, you can find all the Python work for data selection, analysis, and visualization from the starting dataset.

After the aggregation through a groupby method, I added a column with a new column with the label drugs “Prescription Modality” This column was added through a union (Merge) with a DataFrame containing the drugs set and the “Prescription Modality” extracted from the original DataFrame.

Thanks for reading the article!

If you like it and you find it useful share it! If you think a fix is needed or any improvement, text me : )

Andrea

 

In [1]:
# Import pandas
import pandas as pd

import numpy as np
# Import plotting module
import matplotlib.pyplot as plt

#Import regex module
import re
file_name='C_17_dataset_15_download_itemDownload_0_upFile.csv'
csv=pd.read_csv(file_name,sep=';',encoding="ISO-8859-1",skiprows=1)
csv.info()

#We explore drugs name 
#A necessary step for 
#the next phase of cleaning and aggregation 
csv_mod=csv.dropna().copy()
csv_mod['Farmaco']=None
#A drug "Rinazina" started with ** because the Ministry have to confirm the data
#for semplicity i rempoved ** 
csv_mod['Denominazione della confezione']=csv_mod['Denominazione della confezione'].str.replace('*','')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 6 columns):
N°                                                                        51 non-null object
Codice Autorizzazione all'Immissione in Commercio
(AIC)                   50 non-null float64
Denominazione della confezione                                            50 non-null object
Fascia PTN                                                                50 non-null object
Modalità
Prescrizione                                                     50 non-null object
Quantità confezioni fornite alle farmacie ed agli esercizi commerciali    50 non-null object
dtypes: float64(1), object(5)
memory usage: 2.5+ KB
In [2]:
#We add a Column "Farmaco" to our Dataframe 
#Why? This column help us for the following group operations

csv_mod['Farmaco']=None
def primaparola(colonna_di_testo):
    
    pattern=r'\W*(\w[^,. !?"]*)'
    return re.match(pattern,colonna_di_testo).group()

print(type(csv_mod['Denominazione della confezione']))

estratto=csv_mod['Denominazione della confezione'].apply(primaparola)
csv_mod['Farmaco']=estratto
csv_mod['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali']=pd.to_numeric(csv_mod['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali'].str.replace('.',''))
<class 'pandas.core.series.Series'>
In [3]:
#In this section we create a DataFrame 
#Contaiining the unique "Farmaco" type 
#Es. TACHIPIRINA label could be in multiple rows inside 
#farmaco column because
#there are at least two version
#containg or 500mg or 1000 mg paracetamolo 

#For doing so 
#We drop duplicates
selezione= csv_mod.drop_duplicates(subset=['Farmaco']).copy()
#We select only te relevant columns 
#for building our DataFrame
selezione=selezione[['Modalità\nPrescrizione','Farmaco']]
In [4]:
#Here we start our aggragation job
#here we also visualize in descending order
#the boxes distributed
grouped=csv_mod.groupby(by='Farmaco').sum()
grouped=grouped.sort_values(by='Quantità confezioni fornite alle farmacie ed agli esercizi commerciali',ascending=False)
print("Grouped Prima del reset dell'index")
print(grouped.head(2))
#we reset index in order to get
#drugs name as a column and not as an index
grouped=grouped.reset_index()
print("Grouped Dopo il reset dell' index")
print(grouped.head(2))

#We merge the two DataFrame 
#on the "Farmaco" column
df_farma=grouped.merge(selezione,on='Farmaco')
print(df_farma.head(3))
print(grouped.info())
Grouped Prima del reset dell'index
               Codice Autorizzazione all'Immissione in Commercio\n(AIC)  \
Farmaco                                                                   
TACHIPIRINA                                           89215558.0          
ENTEROGERMINA                                         39138155.0          

               Quantità confezioni fornite alle farmacie ed agli esercizi commerciali  
Farmaco                                                                                
TACHIPIRINA                                              8309360                       
ENTEROGERMINA                                            3834716                       
Grouped Dopo il reset dell' index
         Farmaco  Codice Autorizzazione all'Immissione in Commercio\n(AIC)  \
0    TACHIPIRINA                                         89215558.0          
1  ENTEROGERMINA                                         39138155.0          

   Quantità confezioni fornite alle farmacie ed agli esercizi commerciali  
0                                            8309360                       
1                                            3834716                       
         Farmaco  Codice Autorizzazione all'Immissione in Commercio\n(AIC)  \
0    TACHIPIRINA                                         89215558.0          
1  ENTEROGERMINA                                         39138155.0          
2       VOLTAREN                                        138192273.0          

   Quantità confezioni fornite alle farmacie ed agli esercizi commerciali  \
0                                            8309360                        
1                                            3834716                        
2                                            2940672                        

  Modalità\nPrescrizione  
0                    SOP  
1                    OTC  
2                    OTC  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 3 columns):
Farmaco                                                                   32 non-null object
Codice Autorizzazione all'Immissione in Commercio
(AIC)                   32 non-null float64
Quantità confezioni fornite alle farmacie ed agli esercizi commerciali    32 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 848.0+ bytes
None
In [7]:
##### Now we added a column to know
#wich drugs are SOP and wich are OTC
#Now we can create our scatter plot with different colour

df_farma=df_farma.sort_values(by='Quantità confezioni fornite alle farmacie ed agli esercizi commerciali',ascending=False)
label=df_farma['Farmaco']
ln=np.arange(0,len(df_farma))
#We create two DataFrame one for the OTC and one for the SOP

df_otc=df_farma[df_farma['Modalità\nPrescrizione']=='OTC']
df_sop=df_farma[df_farma['Modalità\nPrescrizione']=='SOP']
#DataFrame index is necessary for 
#displaying correctly on x-axis 
#the x values and 
#the corresponding y-values
index_otc=df_otc.index

index_sop=df_sop.index

otc_plot=plt.scatter(index_otc,df_otc['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali'], color='red')
sop_plot=plt.scatter(index_sop,df_sop['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali'], color='blue')
plt.ylabel('# of distributed boxes')
plt.xticks(ln,(label) ,rotation=90)
plt.title('Best selling Drugs without prescription in the first semester of 2016')

plt.text(10, 4, 'Data Elaboration Ing. Andrea Ciufo. Data Source: Ministry of Health',
         fontsize=15, color='gray',
         ha='center', va='top', alpha=0.5)
plt.legend((sop_plot,otc_plot),('SOP','Over The Counter that could be advertised'),loc='upper right')
plt.rcParams["figure.figsize"] = (15,4)
plt.show()
In [8]:
#In this sectio we aggregate
#All the SOP and OTC drugs
#In order to visualize 
#through a Bar Plot
#For time reason I did not well formatted Y axes ticks
#The are expressed in power of 10


sum_otc=df_otc.groupby(by='Modalità\nPrescrizione').sum()
sum_sop=df_sop.groupby(by='Modalità\nPrescrizione').sum()
print(sum_otc['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali'])
print(sum_sop['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali'])
sum_df_y=[sum_otc['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali'].values, sum_sop['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali'].values]
sum_df_x=['OTC','SOP']

sum_df = pd.DataFrame({'Modalità Prescrizione':sum_df_x, 'Quantità confezioni fornite alle farmacie ed agli esercizi commerciali':sum_df_y})

ind=np.arange(len(sum_df_y))

plt.bar(ind,sum_df['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali'], width=0.2)
plt.rcParams["figure.figsize"] = (7,3)
plt.title('Best selling Drugs without prescription in the first semester of 2016')

plt.ylabel('# of boxes distributed')
plt.xticks(ind,(sum_df_x) ,rotation=90)


plt.text(0.5, 0.5, 'Data Elaboration Ing. Andrea Ciufo. Data Source: Ministry of Health',
         fontsize=15, color='gray',
         ha='center', va='top', alpha=0.5)

plt.show()
Modalità\nPrescrizione
OTC    30361800
Name: Quantità confezioni fornite alle farmacie ed agli esercizi commerciali, dtype: int64
Modalità\nPrescrizione
SOP    12599616
Name: Quantità confezioni fornite alle farmacie ed agli esercizi commerciali, dtype: int64

 

 

What are the best selling drugs in Italy?

The Italian Ministry of Health published a dataset on the most distributed * drugs through drugstores (here you can find the dataset).

In methodological terms I have aggregated all the drugs with the same starting word E.g. All kind of “Tachipirina” packs (most sold paracetamol drug in Italy) are grouped in a single variable, regardless of whether the active substance was 500mg or 1000mg, or shall be used the oral mouth**.

There is a fundamental categorical variable inside the dataset“Modalità” and could assume two values:

  • SOP, Drugs Without Prescription)
  • OTC, Over The Counter, all the drugs without prescription that could be advertised

In the next post, I will analyze the relationship between drugs consumption and advertising.

The first 5 drugs in Italy, with the highest distribution in the first semester of 2016 are:

  1. Tachipirina 8+Mil of boxes

  2. Enterorgermina 3+Mil of boxes

  3. Voltaren

  4. Rinazina

  5. Aspirina

80% of the time spent was for cleaning the dataset.

I had to skip first rows, you can understand why from the picture attached.


Moreover, data were decoded through “ISO-8859-1”, instead of “utf-8” and was not comma-separated but the dataset used the semicolon “;”

Numeric values were in the following format “1.000.000” so I stripped the dots in order to cast as Integer. I noticed only after a first data inspection, otherwise, I could use a specific option inside pd.read_csv()

Here a link to an interesting question on stack overflow that helped me in reading with pandas the csv. At first, I didn’t recognize the encoding problem.

This question is also useful and I used for the first-word extraction when some words start with special character.

Thanks for reading the article!

If you liked this post consider to share it, I really appreciate!

If you think it could be improved or I have to fix something text me 🙂
Andrea

In [1]:
# Import pandas
import pandas as pd

import numpy as np
# Import plotting module
import matplotlib.pyplot as plt

#Import regex module
import re
file_name='C_17_dataset_15_download_itemDownload_0_upFile.csv'
csv=pd.read_csv(file_name,sep=';',encoding="ISO-8859-1",skiprows=1)
csv.info()

#We explore drugs name 
#A necessary step for 
#the next phase of cleaning and aggregation 
csv_mod=csv.dropna().copy()
csv_mod['Farmaco']=None
#A drug "Rinazina" started with ** because the Ministry have to confirm the data
#for semplicity i rempoved ** 
csv_mod['Denominazione della confezione']=csv_mod['Denominazione della confezione'].str.replace('*','')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 6 columns):
N°                                                                        51 non-null object
Codice Autorizzazione all'Immissione in Commercio
(AIC)                   50 non-null float64
Denominazione della confezione                                            50 non-null object
Fascia PTN                                                                50 non-null object
Modalità
Prescrizione                                                     50 non-null object
Quantità confezioni fornite alle farmacie ed agli esercizi commerciali    50 non-null object
dtypes: float64(1), object(5)
memory usage: 2.5+ KB
In [2]:
#We add a Column "Farmaco" to our Dataframe 
#Why? This column help us for the following group operations

csv_mod['Farmaco']=None
def primaparola(colonna_di_testo):
    
    pattern=r'\W*(\w[^,. !?"]*)'
    return re.match(pattern,colonna_di_testo).group()

print(type(csv_mod['Denominazione della confezione']))

estratto=csv_mod['Denominazione della confezione'].apply(primaparola)
csv_mod['Farmaco']=estratto
csv_mod['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali']=pd.to_numeric(csv_mod['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali'].str.replace('.',''))
<class 'pandas.core.series.Series'>
In [3]:
#We group our dataframe by 'Farmaco' column
#And we determine the sum 
grouped=csv_mod.groupby(by='Farmaco').sum()
#We sort our data in descending order
df_farmaco=grouped['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali'].sort_values(ascending=False)
#We create an array 
#to plot on the X axis our drugs
ln=np.arange(0,len(df_farmaco))
label=df_farmaco.index
In [5]:
#The names of drugs will be our x-axis labels
label=df_farmaco.index
#We visualize our data through a Scatter Plot
plt.scatter(ln,df_farmaco)
plt.title('Best selling Drugs without prescription in the first semester of 2016')

plt.ylabel('# boxes sold')
plt.xticks(ln,(label) ,rotation=90)
plt.text(10, 4, 'Data Elaboration Ing. Andrea Ciufo',
         fontsize=15, color='gray',
         ha='center', va='top', alpha=0.9)
plt.rcParams["figure.figsize"] = (15,4)
plt.show()
*The Ministry uses the term “distributed ” instead of “sold”, the two terms are similar but not the same, I supposed the most distributed as a good proxy of the best selling drugs
**For time reason I didn’t investigate which kind of drugs were advertised on TV and other communication channels.

A/B Testing explained to the Nerd who wants to pick up on Tinder

You just bought the ultimate fragrance, your abs are not perfectly sculpted, the beard is perfect, but the only match you get on Tinder Is with the fake profile made by your mate.

You start thinking that you have a problem.

Your “pick up strategy” is not working, obviously.

You decide to rely on your friend, the pick up wiz, the king of pick up in MySpace and Netlog period, also called by your friends “The Driller”.

After a short chat and a beer, The Driller, decides to help you, but only if you accept to buy him drinks as a success fee.

Suddenly he realizes that your profile pics selection are not suitable, the one blasting bare chest with the discount underwear and the one out of focus with the stoned face called “Ibiza 2k12” must be censored.


“Shock Therapy” these are the Driller words during a moment of big desperation and pity for you.

Shopping spree at Primark, a new photo made with a reflex in an auto mode made by your photograph friend and random poetic quote.

After the new updates, you start with a new good match and you can’t believe.
But you don’t want to buy drinks for The Driller, you think is just a coincidence, that you are been lucky.

With the old photo pics->100 Tries->1match
With the new photo pics->100 Tries-> 10 Matches

You watch The Driller and say: “I think It’s just a coincidence, with the new profile pics I was just lucky”

The Driller starts watching you in the eyes, he can’t believe, he wants you to buy him his drinks. He replies with calm: “Ok let’s suppose that it was just a coincidence. If this is true there is no difference between “before and after the changes” and we have 200 tries in total, isn’t it?”
You: “Sure”
The Driller: ”So now let’s make some simulations

We take 200 leaflets we write on it the name of the girl and we write if it was a match or not.

1 if you had success 0 if you don’t.”

The request is strange, but you fill out this 200 leaflets.

The Driller: “Now we shuffle this 200 leaflets, we associate the first 100 to the old condition (Case A) and the last 100 to the new condition (Case B).

Once we have done it, we determine the difference between the new Case B’ and the new Case A’, we call this value “Pick Up Delta”
Do you remember that in the starting and original case this difference was 0.09?”(10(match)/100(tries)-1(match)/100(tries)

The Driller: “Once we determined the “Pick up Delta” for the second time, we shuffle again the leaflets and we reproduce this operation several times” (a number of times “n”, with n very large).

If what you said is true the number of times when “Pick up Delta” is bigger or equal to the “Pick up Delta” of the starting and the original case should be fairly common, because we supposed that this difference is just a coincidence.
You: ”Yes, it makes sense”

The Driller: “ We could evaluate this by dividing the number of times when “Pick up delta” is greater than or equal to the starting and the original case by the number of times we shuffled our leaflets” (This value will be our p-value)
The Driller: ”If this ratio is high your hypothesis will be probably true, but if this ratio is small your hypothesis will be probably small”
You: “How small should It be?”
The Driller: “If we are going to reject the hypothesis with 95% confidence, this value must be smaller than 0.05”
You and the Driller discovered that:

  •  Number of shuffles when the pickup delta was better than the original and start case was only 100 to 1, following the p-value was 0.01
  • The hypothesis “It was just a coincidence” was false
  •  You have to buy drinks

A/B tests are extremely frequent especially in Digital Marketing, but their evaluation is not easy.

This article with the attached script is only a nice introduction, I simplified a lot of hypotheses.

For a rigorous discussion, I always recommend the Ross Book on Probability and Statistics.

Moreover, we must estimate experimentation cost, the improvement from test A or Test B must be not only statistically significative but also economically significative.

Economically significative means that experimentation costs are validated by the improvements created, a really though point and sometimes unattended.

Thanks for reading the article!
If you liked it, share this post with others

Ping me on Twitter for further discussion

Andrea

 

In [4]:
import numpy as np
import pandas as pd 
In [5]:
#Representing with 2Arrays our two analysis cases (A-B)
old_pic=np.array([True] * 1 + [False] * 99)
new_pic=np.array([True] * 10 + [False] * 90)
In [6]:
#We Define our Analysis Statistic:
#"The Difference between success cases with the new photos and the success cases with the old photos
#divided by the total ammount of tries
def frac_abs_success(a,b):
    afrac = np.sum(a) /len(a)
    bfrac= np.sum(b) /len(b)
    ababs=abs(afrac-bfrac)
    return ababs
def permutation_sample(data1, data2,func):
    """Once we defined the two dataset we generate our permutation"""

    # We concatenate the two dataset: data
    data = np.concatenate((data1,data2))

    #We define the permutation array "permuted_data"
    permuted_data = np.random.permutation(data)

    # We devide in two sub-arrays our permutation array: perm_sample_1, perm_sample_2
    perm_sample_1 = permuted_data[:len(data1)]
    perm_sample_2 = permuted_data[len(data1):]
    delta_rim=func(perm_sample_1,perm_sample_2)

    return delta_rim
In [7]:
#We realize n permutation on our two dataset A* B*
n=1000
#for each permutation we evaluate the analysis statistic value
#The difference between the first and the second dataset
def draw_rep_stat(data,data2, func, size):
    """Draw bootstrap replicates."""

    # Initialize array of replicates: bs_replicates
    stat_replicates = np.empty(size)

    # Generate replicates
    for i in range(size):
        stat_replicates[i] = permutation_sample(data,data2,func)

    return stat_replicates
In [16]:
test_set=draw_rep_stat(old_pic, new_pic,frac_abs_success,n)

print(len(test_set))
#We evaluate the data p-value  
#n is the number of permutations realized 
p = np.sum(test_set >= frac_abs_success(old_pic,new_pic)) / len(test_set)
print('p-value =', p)
1000
p-value = 0.01

 

How to become (a Self-Though) Data Scientist

-Doctor my son want to become DataScientist, Have I to worry about?

-It is a critical situation Miss, I am sorry for that, but I warn you.

Sadly we don’t have an answer to this kind of illness.

-You have to be prepared, you must be prepared, your son will go to IKEA, or to save time he will buy some blackboard on Amazon(https://amzn.to/2NpOImd) is not the first time we saw this. Did he start talk enthusiastic about Monte Carlo(https://en.wikipedia.org/wiki/Monte_Carlo)?

-Yes, he did.

-Oh, this is a lot worse than I originally thought.

Some days ago Davide Sicignani wrote me.

Davide is a brother of the InnLab Family from Terracina (so near home), but I met him the first time in London.

Davide asked me a suggestion for a biologist friend of him, who want to start Python and Data Science.

Specifically, he pings me because I started from zero and I am a self-thought.

He is not the first person that ping me for this reason, some month ago Stefano Spe (still InnLab) asked too, so I decided to write briefly about it.

I need to distinguish three categories that we can call “DataScientist”, but I will do in another post:

  • Data Engineering (Computer Science)
  • Data Modelling (Probability and Statistics- Operational Research)
  • Business Intelligence (Analytic Knowledge)

I am going to write a brief of my path and the resources used.

No doubt, the best way to become a Data Scientist is the same to become a great surgeon:

A perfect synergy between Practice, Study and Great Mentors.

What I have understood until today:

  • Courses that in one month or one week allow you to become a DataScientist don’t exist, if they say that bullshit
  • You can’t become a DataScientist only with practice
  • You can’t become DataScientist only with books
  • It’s really exciting
  • It’s hard, very hard
  • A lot of people in the field are available to help you (for free)
  • Before starting to understand something you need one year of training (Practice+Study+Mentors)
  • Even if you need one training year you have to search for a job in the DataScience Job Market, what you can do and what you know is something that other people have to judge. (Here a great video on “How To Start”)You are risking to delay your debut in the job market scared to be underprepared.

 

“The world ain’t all sunshine and rainbows. It is a very mean and nasty place and it will beat you to your knees and keep you there permanently if you let it. You, me, or nobody is gonna hit as hard as life. But it ain’t how hard you hit; it’s about how hard you can get hit, and keep moving forward. How much you can take, and keep moving forward. ” Rocky

Which resources to start the journey?

Here mine:

DataCamp.com

In August 2017 I started studying on DataCamp.com, I enrolled and completed “DataScientist Career Path”.

Really Suggested.

The cost for one year to all courses is about 130/180$ I do not remember correctly.

Great Investment, simply but effective courses.

These courses are useful for a first glance.

DataCamp Mobile App allows you to train during the commuting time on basic concepts.

Cons is not sufficient to start working, you need to integrate with other resources.

Python for Data Analysis: Data Wrangling with Pandas, Numpy, and IPython

https://amzn.to/2NrqLuJ

This book was written by Wes McKinney, Pandas Framework Author, one of the most used in Python for data manipulation and cleaning.

This book was a Marchetti Present when I started this journey.

It is a great resource because step by step It explains everything you need to know about data manipulation.

I used one year to study it completely and other six months will be useful to re-read and practice on all the topics described.

You must, read, study the book with your laptop and jupyter notebook open. This way you can replicate in short times all the examples and tips in the book.

If you don’t put in practice the examples, even modifying it, the book loses its effectiveness.

Practice

This section is F-O-U-N-D-A-M-E-N-T-A-L

I had the possibility to practice trough some consultancy projects, no-profit projects, public datasets.

Most of the time is allocated for data cleaning and manipulation, it is an annoying operation but it is always the same.

There are a lot of public datasets also Italian where you can start doing some Data Visualisation, Inference and building some basic models.

Some Data Set:

Open Data on Italian Election from Viminale https://dait.interno.gov.it/elezioni/open-data

Open Data from Lazio Region on Tourism and Hospitality http://dati.lazio.it/catalog/it/dataset?category=Turismo%2C+sport+e+tempo+libero

For an international view:

Kaggle

Kaggle is a platform specifically for DataScientist.

Getting a good grade on kaggle, participating to competitions is a great way for self-branding, be spotted by a recruiter and to show our knowledge.

Postgresql

SQL knowledge is the second more requested skill, before python in job posts.

This is based on my cv sent and job description analyzed (+100)

A great platform where to practice is https://pgexercises.com/

Postgresql it was one of the most frequent DBMS in the job post, there are others, so don’t feel constrained in the choice

Mentor

A technical mentor is a key resource for different reasons:

  • He pushes you to do better
  • He can help you during hard times to solve faster any problem (obviously after you give the blood at least for two days on the issue)
  • He makes more human a path otherwise characterized by only numbers and lines of code

Podcast

There are different podcast on SoundCloud and Spotify, you can listen in dead time to be updated on new technologies and market trend.

Secret Sauce

The secret sauce is passion.

If you are not electrified by a good plot, if you are not curious about the possibility to plan and predict sales trend, if you don’t be crazy on the idea to spend the night to analyze the exponential process that could represent electronic components failure rates, please don’t start this career.

Passion move everything, the other resources are secondary.

Thanks for reading! If you liked the post and you found useful,  share it with others on Linkedin or Twitter.

I really appreciate it

Andrea

PCA part.2 for unlucky boyfriend/husband

PCA Second Chapter

I do not know if my explanation on PCA was clear, I do not think so.

I will retry.

PCA is a very common technique used in Machine Learning and represents the Principal Component Analysis.

Imagine, for some unlucky reasons, you HAVE TO make a present to your girlfriend: a bag (I wouldn’t wish on any man)

Imagine, you and your knowledge about the topic “bag for women”.

Imagine, you with your knowledge:

  • Starting from the trolley that you bought for your last high school trip, the same trolley that your mother and your girlfriend hope every day to throw away. The same trolley still good for you, despite a big stain due to your friend called “Er bresaola”.
  • Ending with the laptop backpack, with the exception for the briefcase that you received during the graduation day because now “You are a big boy”, but you never used because It could store only a laptop and the charger.

YOU, really YOU, have to buy a bag.

You start classifying the products:

We have at least eight variables, hard times?

Furthermore, you cannot avoid the purchase because you have to make amends

You do not know why you are guilty, but there is always a good reason, as man you are guilty by definition.

PCA helps you to simplify the problem and the input data for your fateful choice.

Some variables in our problem are in some way redundant and we can aggregate.

For example “Brand” – “Price Range” – “How to pay it” could be aggregated into one variable.

This is what PCA does.

Are we discarding redundant variables?

No, even because we know that any error will make a big deal about this.

You considered all variables in your clustering, but you transformed using this technique.

PCA allows building new variables and aggregating the most meaningful.  

This is a fundamental point because in my last post I talked about a “Reduction”, but this doesn’t mean that we are discarding some variables (in mathematic terms we are making a linear combination)

In our case study, we reduced our variables from 8 to 6.

With the new transformation, we identified a variable that could change considerably.

This element is a key point because allows differentiating and the identification of different bag categories.

From a mathematics perspective, we identified a new variable characterized by the strongest variance.

That’s why is called “Principal Component” because is the variable with the higher variance.

Now we know how to classify bags, so, which one to choose?

This point falls outside the PCA, sorry for that.

In the classic economic theory for the rational man, this problem would not exist.

  • Volume to carry
  • Minimize the cost based on the volume to carry (€/cm3)

It works this way only in the engineers world, a time series analysis of past purchases could solve the problem, but will be not easy.

 

“Be an engineer is an illness. To a woman, an engineer wife we could ask:  “How is your husband? He is still an engineer?” And she could reply: “No, now is getting better” -Luciano de Crescenzo –  Bellavista Thoughts

A frequency analysis on past purchases,  using Bayes’ Theorem, could help to buy the “most frequent bag”  that is not “the bag the will make her happier”

What you could do is to assign different weights to the variables and then make an analysis on weighted frequency.

One way could be to rate higher bags used on Saturday night compared to everyday ones.

Then you have to choose the model with the higher weighted rate and, with some probability, you chose the alternative that maximizes the target (or minimize the error)

In this post, PCA description is highly qualitative and I have simplified a lot of hypotheses.

In the last post, you can see how correlation changes between the variables and p-value through a small script with Python.

 

Thanks for reading.

If you see any mistake you can ping me, always appreciated.

PCA (Chapter One)

First Version of this article was published  on my Italian blog uomodellamansarda.com
Between July and August, I could lead an optimization project for cutting cost for a UK company.
This project could be based on PCA application.
In this article and in the following, I will try to explain a fundamental concept.
It is also very useful in Machine Learning.
Let me say in advance that I will simplify the PCA theory, on YouTube you can find more details on it.
Why a boring article?
Easy! For me trying to explain a subject is the best way to learn.
This learning method was taught by the grandmaster Feynman, for more info about the technique you can click on the link -> https://medium.com/taking-note/learning-from-the-feynman-technique-5373014ad230

Moving on, the Principal Component Analysis (PCA) is a linear transformation.
Doing a PCA is like taking the list of 50 life pillars and reduce to 3: “La femmina, il danaro e la mortazza”(English version would be “ Women, Money and Mortadella” is a famous Italian quote).
[youtube https://www.youtube.com/watch?v=aLEfp7js620]

PCA allows to reduce variables and identify the most important and not correlated with each other.
PCA is a transformation, a mathematic operation, in this case, is a linear and orthogonal transformation, it transforms a function in another one.
In the following example, I am going to apply PCA, not to reduce variables number, but to decorrelate them.

It’s a modified version from a DataCamp.com exercise that you can find in this chapter (https://www.datacamp.com/courses/unsupervised-learning-in-python).
In the first part of the example, I am studying the correlation between two variables, in the second part, I am going to apply PCA.
We took 209 seeds and measured their length and width.

Then the information was saved in a *CSV file.

At first, I made a scatter plot to see which was a correlation between the two variables and then I calculated the Pearson coefficient, then I applied a PCA to decorrelate two variables and identified the principal components.
The one with the higher variance represent the first axis, the second one has the lower variance

In the case, we had m variables and from the PCA we had got n variables, with m>n, then the second axis would be described by the second variable with the higher variance, the third one with the third higher variance and so on until the n-th variable.
In the following article I will try to illustrate better the PCA concept with pratical example, until I will draft a post with the following title “PCA the definitive guide” or “PCA Explained to my Grandmother”

In [10]:
#PCA analysis 
#Importing libraries needed matplot scipy.stats and pandas 
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
import pandas as pd
#loading file 
grains=pd.read_csv("seeds-width-vs-length.csv")
#Always exploring data and how our data are structured
print(grains.info())
print(grains.describe())
#extract only the values from our dataframe that we need
to work
grains=grains.values
# 0-th dataframe column represent seeds width
width = grains[:,0]

#1-th dataframe column represent seeds length
length = grains[:,1]

# Plotting the data
# Using a scatter plot width-length
plt.scatter(width, length)
plt.axis('equal')
plt.show()

#Calculating Pearson Coefficent
#Also called correlation coefficent 
#We also calculate data p-values
correlation, pvalue = pearsonr(width,length)

# Visualising the two calculated values
print("Correlation between width and length:", round(correlation, 4))
print("Data P-value:",pvalue)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209 entries, 0 to 208
Data columns (total 2 columns):
3.312    209 non-null float64
5.763    209 non-null float64
dtypes: float64(2)
memory usage: 3.3 KB
None
            3.312       5.763
count  209.000000  209.000000
mean     3.258349    5.627890
std      0.378603    0.444029
min      2.630000    4.899000
25%      2.941000    5.262000
50%      3.232000    5.520000
75%      3.562000    5.980000
max      4.033000    6.675000
Correlation between width and length: 0.8604
Data P-value: 1.5696623081483666e-62

Now we can evaluate our variables Principal Components and decorrelate the two variables using the PCA

In [4]:
#Loading library module for the operation
#PCA Analysis 
# Import PCA
from sklearn.decomposition import PCA

#Creating the PCA instance
modello = PCA()

#Applying  fit_transform method to our dataset on grains
#Now we obtained a new array with two new decorrelated variables
pca_features = modello.fit_transform(grains)

#Assigning the 0-th pca_features column to xs
xs = pca_features[:,0]

#Assigning the 1-th pca_features column to ys
ys = pca_features[:,1]

#Plotting the two new decorelated variable
plt.scatter(xs, ys)
plt.axis('equal')
plt.show()

# Calculating the pearson coefficent xs ys
correlation, pvalue = pearsonr(xs, ys)

#Visualizing the two new results
print("Correlation between two new variables xs and ys ", round(correlation,4))
print("Data P-value",pvalue)
Correlation between two new variables xs and ys -0.0
"Data P-value 1.0"

Thanks for reading!
A great hug
Andrea

If you notice any error ping me here and Twitter. I am still learning 🙂

Not only Theory

I received some negative feedback on my last post on the Italian blog uomodellamansarda.com, from Filippo and Francesco, two dear friends and I am planning a dinner to discuss better their suggestions.
A bbq, a bottle of wine(actually I would try this non-commercial-vermouth –> https://amzn.to/2v0oles ) , a friendly discussion and I hope on this occasion to learn how to make a great Negroni, also.
But this is another story! I want to talk about something else!

I want to talk about the dichotomy between practice and theory.
Only theory is not enough, this is true for physicians as for other professions, practice is needed.
But practice needs theory to refine the technique.
I guess that you never want surgery from a physician that studied everything on the book but He hasn’t practical experience, likewise, you never want surgery from a trainee doctor without a theoretical foundation

I tend to be stronlgy theoretical , I study often a problem from all its point of view before coming to a solution and this tendency could be extremely negative if you not compensate through some practice.
This is especially true with Python.

“Theory is when you know everything but nothing works. Practice is when everything works but no one knows why. We combined theory and practice: nothing works and no one knows why”-Albert Einstein *

During the following week I will allocate, at least, 20h to dedicate to Python and the error I could do is to hit the books or online courses and focusing too much on theory.
To avoid this error I created some activity labels on my Clockwork Pomodoro.
This labels are crunched by a small script that receives as input the information on how I used my time in the last week and It gives back the percentage of accomplishment, based on the mix between Practice, Theory and Writing about the Progresses made.
In short:

  • Working 10 h
  • Studying 6 h
  • Marketing 4h

The script is raw and I will improve it, I could use some “for” loops to make it more readable (I appreciate in advance for any feedback on what I could enhace and improve)

The first part of the script keeps me updated on how is going my practice with Python, I wrote in the past about this script on this blog, the second part of the script evaluates the “mix”

Thanks for reading the article, big hug.

Andrea
ps If you notice any mistake ping me, if you liked share it! 🙂
In [2]:
import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime, timedelta


va='Python'
tvoi='Length'

csvname="logs2.csv"
#read the csv	
columns_name=['Year', 'Month', 'Day', 'Time', 'Length', 'Start', 'End', 'Activity']
dfraw=pd.read_csv(csvname,names=columns_name,sep=',',skiprows=1,skipfooter=0, index_col=False)


dfraw[tvoi] = dfraw[tvoi].astype('str')
mask = (dfraw[tvoi].str.len() == 6) 
dfraw = dfraw.loc[mask]

dfraw[tvoi]=dfraw[tvoi].str.strip()

dfraw[tvoi]=pd.to_datetime(dfraw[tvoi], format='%M:%S')
dfraw['Date'] = dfraw.apply(lambda row: datetime(row['Year'], row['Month'], row['Day']), axis=1)


pythondf=dfraw[(dfraw['Activity'].str.contains("Python",na=False)) | (dfraw['Activity'].str.contains("python",na=False))] 
numacti=pythondf.groupby('Date').count()
numacti=numacti['Activity']
numacti=numacti.divide(2)
cumulata=numacti.cumsum()


day=pd.concat([numacti, cumulata], axis=1)
day.columns=['pgiorno','cumulata']
maxh=cumulata.max()
plt.plot(day.index,day['cumulata'])
plt.xticks(rotation=90)
plt.title('Totale ore di studio e lavoro con Python (%d ore)' %(maxh))
plt.tight_layout()
plt.show()
In [26]:
#Section for weekly analysis
python_work=10
python_study=6
sutdy='study'
marketing='marketing'

python_marketing=4
total=python_work+python_study+python_marketing

#Selection only the last 7 days of the log 
days=7
cutoff_date= pythondf['Date'].iloc[-1]- pd.Timedelta(days=days)
print(cutoff_date)
last_7days= pythondf[pythondf['Date'] > cutoff_date] 
#Qualsiasi attivita' che non abbia come label "marketing" o "study" "datacamp" "ripasso" "libro" é considerata "work"
#Per del codice migliore cercherò nei prossimi log di avere solo tre label study work marketing come metag
study_mask=(last_7days['Activity'].str.contains("ripasso",na=False) | last_7days['Activity'].str.contains("datacamp",na=False)) | (last_7days['Activity'].str.contains("Libro",na=False))
pythondf_study=last_7days[study_mask]

pythondf_marketing=last_7days[last_7days['Activity'].str.contains("marketing",na=False)]


pythondf_work=last_7days[~study_mask]

#Pomodoro Time Slots last 30 minutes(25+5)
#We have to group by category and then count
#Not enough lazy for a foor loops, sorry

print("Weekly % of Python Working",round(pythondf_work['Activity'].count()/2/python_work*100,2))
print("Weekly % of Python Study", round(pythondf_study['Activity'].count()/2/python_study*100,2))
print("Weekly % of Python Marketng",round(pythondf_marketing['Activity'].count()/2/python_marketing*100,2))
2018-07-18 00:00:00
Weekly % of Python Working 95.0
Weekly % of Python Study 50.0
Weekly % of Python Marketng 62.5

*quote to be verified

Hypothesis Testing, easy explanation

The first time I have studied “Hypothesis testing” was when I enrolled “Probability and Statistics” with prof. Martinelli, during my master degree.
In the beginning, I didn’t understand easily the topic, but in the following month, practicing and practicing, I became quite confident.

In this post, I will try to explain the Hypothesis Testing, also because I’ve promised to publish this article to Diego, after a call on skype where he helped me to understand how PyCharm and Jupyter work.

Some days ago, Francesco, a friend of mine, discovered from his University Student Office, that the average age of all the enrolled students was 23 years.

Francesco, always skeptics, replied to the University Student Office: “Doubts*”

How could he verify if the statement was reasonable?
He could with “Hypothesis Testing”.

In this specific case, the Hypothesis is on the average age of enrolled students.
How to verify this Hypothesis?

Brutally Francesco needs to verify if the value assumed is not too much different from another value that he is going to determine as checking value.

Simplified and not accurate explanation: We evaluate the probability that the difference between our Hypothesis Value and the Check Value will be higher of a defined threshold.

We accept the hypothesis if the difference is smaller than our threshold value, we reject the hypothesis if greater.

The threshold value is called “Level of Significance of the test”

In the picture is represented the difference between our threshold value X and the Hypothesis Value µ0, if that difference is inside the 95% of the distribution bell, we accept our Hypothesis with a 5% level of significance, otherwise, we reject it.

In other terms, we are stating that the difference between our hypothesis and our threshold is unlikely to be so high.

Level of significance is a key concept in the “Hypothesis Testing”, is a value that describes the probability to reject a Hypothesis that is true.

If Francesco stated that student average age will not be 23 when is true, he rejected a true hypothesis.

Significance level is often expressed in term of α

For example,  a 5% level of significance means that we have the 5% of probability to state that our hypothesis is false, and reject it when instead is true.

Francesco doesn’t know if the average age is 23 as University Student Office Stated, it could be 24, 25 or 22.
Each value will be characterized by a probability to be accetable, so it will be likely that the average age will be 23, unlikely 35 or 18.
Francesco takes a random sample of 40 enrolled students and determines the average age of the sampled group.
I will not describe all the mathematic formulas behind the scenes (but I am going to describe in the appendix of the post).
Francesco wants to verify his hypothesis with a 5% level of significance.

If the disequation is true he accepts the hypothesis, otherwise, he rejects it.
There are some points that I have implied and need to be discussed further.
If you had patience you can find above here.

In the following days, I will talk on p-values and A/B testing with Python.

Thanks for reading and if you find any mistake let me know about it, I will fix it, especially grammar errors.

Andrea
___________________________

In the example I didn’t say:

µ0= 23 is the mean of the enrolled student age distribution.
Distribution Variance is known.

Sample mean is a natural point estimator of the enrolled student mean distribution, that is unknown
If we state that our Hypothesis is true, follows that the enrolled student mean distribution has a normal distribution

If we accept a hypothesis, for example, that mean value of a distribution is µ0, with a level of significance α (in our example 5%) we are stating that exist a region in probability space c that:

If the sample mean X follows a normal distribution with mean µ0 we can say that Z is a random variable that:

What we are doing, based on a decided level of significance, is to identify the value of the normal standard variable associated with the probability of our threshold value.

97.5% is the probability that our Z assumes a value less than 1.96, or vice-versa, 2.5% is the probability that our Z will be greater than 1.96
What we are saying is ” If the sample mean less the hypothesis mean divided by standard deviation multiplied by the square root of the number of samples is greater than 1.96 with a significance level of 5% follows that the hypothesis is false”

The last point is on the two type of errors that you can make in the hypothesis testing:

  • First type, when data conduct us to reject an hypothesis that is true.
  • Second type, when data conduct us to accept an hypothesis that is false.

 

*Dubts is an italian short way to express skepticism to something

Most Junior Data Scientist Required Skills based on my personal experience and analysis prt 1

Is not easy to be a wannabe Data Scientist.

Be a Data Scientist is fucking hard, be a self-learner Data Scientist even harder.

Time is never enough, you need to focus, and focus on what market needs, this way you will have more chance to survive.

Where to focus?

You need to identify a path to follow and exercise, or you will be distracted by all the noise on the web.

From September 2017 until now, quite often, after sending my CV applications for Data Scientist positions I took note of the skills required and added manually to a Google Sheet.

I reached more than 430 rows each one contains an information.

Today I decided to analyze this CSV in order to identify the most frequent skills required for a Data Scientist.

The analysis I have done is very brutal and need to be improved, but gives me where to focus.

In [80]:
#importing the libraries 

import pandas as pd
import matplotlib.pyplot as plt
In [40]:
csvname= "skill.csv"
df= pd.read_csv(csvname,sep= ",", header=None, index_col=False)
print(df.head(30))
                             0    1
0                       Agile   NaN
1                           AI  NaN
2                    Algorithm  NaN
3                    Algorithm  NaN
4                   Algorithms  NaN
5                    Analytics  NaN
6                      Apache   NaN
7                      Apache   NaN
8                          API  NaN
9   Artificial neural networks  NaN
10                         AWS  NaN
11                         AWS  NaN
12                         AWS  NaN
13                         AWS  NaN
14                         AWS  NaN
15                         AWS  NaN
16                         AWS  NaN
17                         AWS  NaN
18                       Azure  NaN
19                       Azure  NaN
20                       Azure  NaN
21                       Azure  NaN
22                       Azure  NaN
23              Bayesian Model  NaN
24              Bayesian Model  NaN
25              Bayesian Model  NaN
26         Bayesian Statistics  NaN
27                          BI  NaN
28                          BI  NaN
29                         BI   NaN
30                    Big Data  NaN
31                    Big Data  NaN
32                    Big Data  NaN
33                    Big Data  NaN
34                    Big Data  NaN
35                    Big Data  NaN
36                    Big Data  NaN
37                    Big Data  NaN
38                    BIgQuery  NaN
39                    BIgQuery  NaN
In [34]:
print(df.columns)
Int64Index([0, 1], dtype='int64')
In [50]:
df.columns=['skills','empty']
In [51]:
print(df.head())
       skills empty
0      Agile    NaN
1          AI   NaN
2   Algorithm   NaN
3   Algorithm   NaN
4  Algorithms   NaN
In [65]:
df_skill=pd.DataFrame(df.iloc[:,0], columns=['skills'])
print(df_skill.head(5))
       skills
0      Agile 
1          AI
2   Algorithm
3   Algorithm
4  Algorithms
In [71]:
print(df_skill.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 423 entries, 0 to 422
Data columns (total 1 columns):
skills    423 non-null object
dtypes: object(1)
memory usage: 3.4+ KB
None
In [84]:
df_skill_grouped=df_skill.groupby(['skills']).size().sort_values(ascending=False)
In [85]:
print(df_skill_grouped)
skills
SQL                                        37
Python                                     36
Spark                                      16
Python                                     13
Handoop                                    12
Scala                                      10
Scikit Learn                               10
NLP                                        10
Machine Learning                           10
Statistics                                 10
AWS                                         8
Big Data                                    8
NOSQL                                       7
Kafka                                       7
TensorFlow                                  6
Tableau                                     6
Pandas                                      5
Numpy                                       5
Azure                                       5
SQL                                         5
Machine learning                            5
Financial Systems                           4
Predictive Model                            4
Neural Networks                             4
C++                                         4
Machine Learning                            4
Go                                          3
Bayesian Model                              3
MapReduce                                   3
Clustering                                  3
                                           ..
Sentiment Analysis                          1
NLP                                         1
Scraping                                    1
NOSQL                                       1
Naive Bayes classifier                      1
Natural language processing                 1
Numpy                                       1
Linear Model                                1
Latent semantic indexing                    1
Pig                                         1
Hashmaps                                    1
Flask                                       1
Flink                                       1
Gis                                         1
GitHub                                      1
Testing Software                            1
Google 360                                  1
Gradient Boosted Machine                    1
TF-IDF                                      1
Plotly                                      1
T-SQL                                       1
Html                                        1
Information Extraction                      1
Instantaneously trained neural networks     1
JQuery                                      1
JSON                                        1
Java                                        1
JavaScript                                  1
Jira                                        1
AI                                          1
Length: 150, dtype: int64
In [90]:
df_skill_grouped.head(25).plot.bar()
Out[90]:
First 25 skills required for a Data Scientist

I will improve this analysis  working with:
1) Regex, this way I can fix typing errors and be more accurate (see for example in the bar graph “Python” and “Python ”
2) Web Scraping my applications in order to automatically extract all the skills required
3) Improve my ClockWork Pomodoro Analyzer in order to be aware where my time is allocated and if is coherent with the market requirements