Sharing is caring!
Is not easy to be a wannabe Data Scientist.
Be a Data Scientist is fucking hard, be a self-learner Data Scientist even harder.
Time is never enough, you need to focus, and focus on what market needs, this way you will have more chance to survive.
Where to focus?
You need to identify a path to follow and exercise, or you will be distracted by all the noise on the web.
From September 2017 until now, quite often, after sending my CV applications for Data Scientist positions I took note of the skills required and added manually to a Google Sheet.
I reached more than 430 rows each one contains an information.
Today I decided to analyze this CSV in order to identify the most frequent skills required for a Data Scientist.
The analysis I have done is very brutal and need to be improved, but gives me where to focus.
#importing the libraries
import pandas as pd
import matplotlib.pyplot as plt
csvname= "skill.csv"
df= pd.read_csv(csvname,sep= ",", header=None, index_col=False)
print(df.head(30))
0 1 0 Agile NaN 1 AI NaN 2 Algorithm NaN 3 Algorithm NaN 4 Algorithms NaN 5 Analytics NaN 6 Apache NaN 7 Apache NaN 8 API NaN 9 Artificial neural networks NaN 10 AWS NaN 11 AWS NaN 12 AWS NaN 13 AWS NaN 14 AWS NaN 15 AWS NaN 16 AWS NaN 17 AWS NaN 18 Azure NaN 19 Azure NaN 20 Azure NaN 21 Azure NaN 22 Azure NaN 23 Bayesian Model NaN 24 Bayesian Model NaN 25 Bayesian Model NaN 26 Bayesian Statistics NaN 27 BI NaN 28 BI NaN 29 BI NaN 30 Big Data NaN 31 Big Data NaN 32 Big Data NaN 33 Big Data NaN 34 Big Data NaN 35 Big Data NaN 36 Big Data NaN 37 Big Data NaN 38 BIgQuery NaN 39 BIgQuery NaN
print(df.columns)
Int64Index([0, 1], dtype='int64')
df.columns=['skills','empty']
print(df.head())
skills empty 0 Agile NaN 1 AI NaN 2 Algorithm NaN 3 Algorithm NaN 4 Algorithms NaN
df_skill=pd.DataFrame(df.iloc[:,0], columns=['skills'])
print(df_skill.head(5))
skills 0 Agile 1 AI 2 Algorithm 3 Algorithm 4 Algorithms
print(df_skill.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 423 entries, 0 to 422 Data columns (total 1 columns): skills 423 non-null object dtypes: object(1) memory usage: 3.4+ KB None
df_skill_grouped=df_skill.groupby(['skills']).size().sort_values(ascending=False)
print(df_skill_grouped)
skills
SQL 37
Python 36
Spark 16
Python 13
Handoop 12
Scala 10
Scikit Learn 10
NLP 10
Machine Learning 10
Statistics 10
AWS 8
Big Data 8
NOSQL 7
Kafka 7
TensorFlow 6
Tableau 6
Pandas 5
Numpy 5
Azure 5
SQL 5
Machine learning 5
Financial Systems 4
Predictive Model 4
Neural Networks 4
C++ 4
Machine Learning 4
Go 3
Bayesian Model 3
MapReduce 3
Clustering 3
..
Sentiment Analysis 1
NLP 1
Scraping 1
NOSQL 1
Naive Bayes classifier 1
Natural language processing 1
Numpy 1
Linear Model 1
Latent semantic indexing 1
Pig 1
Hashmaps 1
Flask 1
Flink 1
Gis 1
GitHub 1
Testing Software 1
Google 360 1
Gradient Boosted Machine 1
TF-IDF 1
Plotly 1
T-SQL 1
Html 1
Information Extraction 1
Instantaneously trained neural networks 1
JQuery 1
JSON 1
Java 1
JavaScript 1
Jira 1
AI 1
Length: 150, dtype: int64
df_skill_grouped.head(25).plot.bar()
I will improve this analysis working with:
1) Regex, this way I can fix typing errors and be more accurate (see for example in the bar graph “Python” and “Python ”
2) Web Scraping my applications in order to automatically extract all the skills required
3) Improve my ClockWork Pomodoro Analyzer in order to be aware where my time is allocated and if is coherent with the market requirements

