Privacy Policy

Most Junior Data Scientist Required Skills based on my personal experience and analysis prt 1

Is not easy to be a wannabe Data Scientist.

Be a Data Scientist is fucking hard, be a self-learner Data Scientist even harder.

Time is never enough, you need to focus, and focus on what market needs, this way you will have more chance to survive.

Where to focus?

You need to identify a path to follow and exercise, or you will be distracted by all the noise on the web.

From September 2017 until now, quite often, after sending my CV applications for Data Scientist positions I took note of the skills required and added manually to a Google Sheet.

I reached more than 430 rows each one contains an information.

Today I decided to analyze this CSV in order to identify the most frequent skills required for a Data Scientist.

The analysis I have done is very brutal and need to be improved, but gives me where to focus.

In [80]:
#importing the libraries 

import pandas as pd
import matplotlib.pyplot as plt
In [40]:
csvname= "skill.csv"
df= pd.read_csv(csvname,sep= ",", header=None, index_col=False)
print(df.head(30))
                             0    1
0                       Agile   NaN
1                           AI  NaN
2                    Algorithm  NaN
3                    Algorithm  NaN
4                   Algorithms  NaN
5                    Analytics  NaN
6                      Apache   NaN
7                      Apache   NaN
8                          API  NaN
9   Artificial neural networks  NaN
10                         AWS  NaN
11                         AWS  NaN
12                         AWS  NaN
13                         AWS  NaN
14                         AWS  NaN
15                         AWS  NaN
16                         AWS  NaN
17                         AWS  NaN
18                       Azure  NaN
19                       Azure  NaN
20                       Azure  NaN
21                       Azure  NaN
22                       Azure  NaN
23              Bayesian Model  NaN
24              Bayesian Model  NaN
25              Bayesian Model  NaN
26         Bayesian Statistics  NaN
27                          BI  NaN
28                          BI  NaN
29                         BI   NaN
30                    Big Data  NaN
31                    Big Data  NaN
32                    Big Data  NaN
33                    Big Data  NaN
34                    Big Data  NaN
35                    Big Data  NaN
36                    Big Data  NaN
37                    Big Data  NaN
38                    BIgQuery  NaN
39                    BIgQuery  NaN
In [34]:
print(df.columns)
Int64Index([0, 1], dtype='int64')
In [50]:
df.columns=['skills','empty']
In [51]:
print(df.head())
       skills empty
0      Agile    NaN
1          AI   NaN
2   Algorithm   NaN
3   Algorithm   NaN
4  Algorithms   NaN
In [65]:
df_skill=pd.DataFrame(df.iloc[:,0], columns=['skills'])
print(df_skill.head(5))
       skills
0      Agile 
1          AI
2   Algorithm
3   Algorithm
4  Algorithms
In [71]:
print(df_skill.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 423 entries, 0 to 422
Data columns (total 1 columns):
skills    423 non-null object
dtypes: object(1)
memory usage: 3.4+ KB
None
In [84]:
df_skill_grouped=df_skill.groupby(['skills']).size().sort_values(ascending=False)
In [85]:
print(df_skill_grouped)
skills
SQL                                        37
Python                                     36
Spark                                      16
Python                                     13
Handoop                                    12
Scala                                      10
Scikit Learn                               10
NLP                                        10
Machine Learning                           10
Statistics                                 10
AWS                                         8
Big Data                                    8
NOSQL                                       7
Kafka                                       7
TensorFlow                                  6
Tableau                                     6
Pandas                                      5
Numpy                                       5
Azure                                       5
SQL                                         5
Machine learning                            5
Financial Systems                           4
Predictive Model                            4
Neural Networks                             4
C++                                         4
Machine Learning                            4
Go                                          3
Bayesian Model                              3
MapReduce                                   3
Clustering                                  3
                                           ..
Sentiment Analysis                          1
NLP                                         1
Scraping                                    1
NOSQL                                       1
Naive Bayes classifier                      1
Natural language processing                 1
Numpy                                       1
Linear Model                                1
Latent semantic indexing                    1
Pig                                         1
Hashmaps                                    1
Flask                                       1
Flink                                       1
Gis                                         1
GitHub                                      1
Testing Software                            1
Google 360                                  1
Gradient Boosted Machine                    1
TF-IDF                                      1
Plotly                                      1
T-SQL                                       1
Html                                        1
Information Extraction                      1
Instantaneously trained neural networks     1
JQuery                                      1
JSON                                        1
Java                                        1
JavaScript                                  1
Jira                                        1
AI                                          1
Length: 150, dtype: int64
In [90]:
df_skill_grouped.head(25).plot.bar()
Out[90]:
First 25 skills required for a Data Scientist

I will improve this analysis  working with:
1) Regex, this way I can fix typing errors and be more accurate (see for example in the bar graph “Python” and “Python ”
2) Web Scraping my applications in order to automatically extract all the skills required
3) Improve my ClockWork Pomodoro Analyzer in order to be aware where my time is allocated and if is coherent with the market requirements

 

 

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.