A French startup (https://navee.co/) wrote me on LinkedIn for a challenging job interview (spoiler: It didn’t go well).
They identify potential fraud on-line through Natural Language Processing and Image Recognition on online marketplace.
E.g., a house picture on a marketplace that is also on a stock photo website with a standard message could be classified as a scam by the startup technology.
The technical interview was a project that I had to develop in one week.
An SMS fraud detector.
Detection would be based on Natural Language Processing techniques.
The dataset was 5500 SMS from Kaggle.
I was aware not to be much confident with Natural Language Processing
Until that moment I just worked with some regex and less complex analysis, that’s why I accepted to proceed with the interview.
When you have a deadline everything moves faster, also learning.
The workflow was the following:
- Uploaded the necessary modules and libraries
- Defined two functions
- The first one in order to train the classifier
- The second one to evaluate classifier performance
- Imported the csv as a DataFrame taking care about encoding
- Explored data in tabular format
- With a pie chart, I plotted the % of “Genuine SMS” and “Fraud SMS”
- Defined a classifier list to train with different hyperparameters
- Trained and tested the classifier
- Sorted the results by performance metrics, I decided to use the F1_score
Which parameters did I test for every classifier?
Basically n-grams.
N-gram is a subsequence of n element for a specific sequence
E.g. the phrase “Andrea Ciufo is really nice” is composed by:
- 5 1-grams [“Andrea” “Ciufo” “is” “really” “nice”]
- 4 2-grams [“Andrea Ciufo” “Ciufo is” “is really” “really nice”]
- 3 3-grams [“Andrea Ciufo is” “Ciufo is really” “is really nice”]
In the task, I evaluated only 1-grams, 2-grams and their combination (in one case I trained the classifier for just one subsequence, in another case I trained considering both)
Delivered the work, the interview was not so long, one hour or less. During that, I explained all the workflow to the CEO and he went deep on some issues, such as how I cleaned and preprocessed the data and the performance metrics that I chose.
After the job interview, I asked the courtesy to get a structured feedback if I get rejected.
In the following days, they replied me explaining my mistakes and gap, and I really appreciated:
- You saw that the dataset was very unbalanced but you didn’t do anything about it. Take a look at class weights and sample weights methods
- You have to be clear on what metric to use to compare different algorithm. You were computing every metric possible but a lot of them were irrelevant. “ROC AUC” was the best in this case
- You did not do any pre-processing: remove punction, lowercase everything, group numbers … [-I really forgot to do that–]
- adding other features like: length of the text, ratio of capital, use of “!”, presence of url, email …
- take a look at the xgboost algorithm, it performs usually better than any other method or at some deep learning methods
Without the need to find excuses, ALL the previous notes are extremely useful and to be considered for the next projects.
I must thank Pier, an Innlaber and researcher at Aldo Moro University in Bary, he also collaborated with the Alan Turing Institute
He is well grounded in NLP (Natural Language Processing), during the 7 days I always asked for tips and I always showed my results and algorithm used.
In the following posts, I will address all the feedback that I received with some code included.
On git hub, you can find the code, or attached below.
If you liked or you find it useful, share it through social networks, with just one click you can raise an opportunity.
If you think something needs to be fixed, you found any typo, write me!
Thanks for reading my article!
Andrea
#Importing all the module that we will use
#to read csv and manipulate dataframe
import pandas as pd
#to modify array
import numpy as np
#to do some data visualization
import matplotlib.pyplot as plt
#importing the class to convert SMSs to a matrix of token counts
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
#Note that count vectorizer just counts the word frequencies
#TFIDF vectorizer assigns a score
from sklearn.model_selection import train_test_split
#Importing the classifiers that we wil test
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
#Importing performance metrix to evaluate our models
from sklearn.metrics import confusion_matrix,auc,roc_auc_score
from sklearn.metrics import recall_score, precision_score, accuracy_score, f1_score
#We define the two main function for our task
#one for get prediction from out classifier
#the other one to evaluate performance of the model based on different
#key perfomance metrics
def get_predictions(clf, count_train, y_train, count_test):
# create classifier
clf = clf
# fit it to training data
clf.fit(count_train,y_train)
# predict using test data
y_pred = clf.predict(count_test)
# Compute predicted probabilities: y_pred_prob
y_pred_prob = clf.predict_proba(count_test)
#for fun: train-set predictions
#train_pred = clf.predict(count_test)
#print('train-set confusion matrix:\n', confusion_matrix(y_train,train_pred))
return y_pred, y_pred_prob
def print_scores(y_test,y_pred,y_pred_prob):
conf_matrix=confusion_matrix(y_test,y_pred)
recall= recall_score(y_test,y_pred)
precision= precision_score(y_test,y_pred)
f1=f1_score(y_test,y_pred)
accuracy= accuracy_score(y_test,y_pred)
roc= roc_auc_score(y_test, y_pred_prob[:,1])
print('test-set confusion matrix:\n',conf_matrix )
print("recall score: ",recall )
print("precision score: ", precision)
print("f1 score: ",f1 )
print("accuracy score: ", accuracy)
print("ROC AUC: {}".format(roc_auc_score(y_test, y_pred_prob[:,1])))
kpi=pd.DataFrame([[recall,precision,f1,accuracy,roc]], columns=['Recall','Precision','F1_Score','Accuracy','ROC-AUC'])
return kpi
#Name of the file in the directory
name_path='data_spam.csv'
#reading the csv and converting in a df
csv_raw=pd.read_csv(name_path,index_col=0,encoding="ISO-8859-1")
#We need to inspect our dataframe
print(csv_raw.head(2))
print(csv_raw.info())
#We count how many genuine and fraud SMS are in the dataset
#We are going to group the dataset by label
#and plot thorugh a pie chart
group_label=csv_raw.groupby(by='label').count()
print('The dataset is composed by')
print(group_label)
# We Plot Our Dataset
print("Our Dataset as pie chart:")
fig, ax = plt.subplots(1, 1)
ax.pie(group_label,autopct='%1.1f%%', labels=['Genuine','Fraud'], colors=['yellow','r'])
plt.axis('equal')
plt.ylabel('')
Probabilistic Note, if we pick a random SMS, on a frequentistic approach we have 13.4% of probability to pick a Fraudulent SMS.
#We need to create a dummy variable 1-0 for our dataset for faster test-training and later manipulation
df=csv_raw.copy()
b={'ham':True,'spam':False}
df['Status']=df['label'].map(b)
print(df.head())
list_classifier=[MultinomialNB(),LogisticRegression(),RandomForestClassifier(),RandomForestClassifier(criterion='entropy')]
names=['MultinomialNaiveBayes','Logistic Regression', 'Random Forest Classifier', 'Random Forest Classifier crit=Entropy']
y=df['Status']
X_train, X_test, y_train, y_test = train_test_split(df['SMS'],y,test_size=0.33)
Naive Bayes–> Generative classifiers learn a model of joint probabilities p(x, y) and use Bayes rule to calculate p(x y) to make a prediction
Logistic Regression –> Discriminative models learn the posterior probability p(x y) “directly”
#In this section we will use count vectorizer for the three classifiers, selecting keywords and bi-grams
results_df=pd.DataFrame(columns=['n_grams','Classifier','Recall','Precision','F1_Score','Accuracy','ROC-AUC'])
performance=pd.DataFrame()
for i in [[1,1],[2,2],[1,2]]:
count_vectorizer=CountVectorizer(stop_words='english',ngram_range=(i))
count_train=count_vectorizer.fit_transform(X_train.values)
count_test=count_vectorizer.transform(X_test.values)
for name, clf in zip(names,list_classifier):
print('Classifier used',name)
print('n-gram range is',i)
print()
y_pred, y_pred_prob = get_predictions(clf, count_train, y_train, count_test)
n_grams=str(i)
classifier=name
loop_performance=pd.DataFrame()
loop_performance=print_scores(y_test,y_pred,y_pred_prob)
loop_performance['n_grams']=n_grams
loop_performance['Classifier']=name
performance=performance.append(loop_performance)
print('__________________')
#Now we have to tidy our output
print(performance.sort_values(['F1_Score'],ascending=False))
X_train, X_test, y_train, y_test = train_test_split(df['SMS'],y,test_size=0.33)
#In this section we will use Tfidf vectorizer for the three classifiers, selecting keywords and bi-grams
results_df=pd.DataFrame(columns=['n_grams','Classifier','Recall','Precision','F1_Score','Accuracy','ROC-AUC'])
performance=pd.DataFrame()
for i in [[1,1],[2,2],[1,2]]:
tfidf_vectorizer=TfidfVectorizer(stop_words='english',ngram_range=(i))
tfidf_train=tfidf_vectorizer.fit_transform(X_train.values)
tfidf_test=tfidf_vectorizer.transform(X_test.values)
for name, clf in zip(names,list_classifier):
print('Classifier used',name)
print('n-gram range is',i)
print()
y_pred, y_pred_prob = get_predictions(clf, tfidf_train, y_train, tfidf_test)
n_grams=str(i)
classifier=name
loop_performance=pd.DataFrame()
loop_performance=print_scores(y_test,y_pred,y_pred_prob)
loop_performance['n_grams']=n_grams
loop_performance['Classifier']=name
performance=performance.append(loop_performance)
print('__________________')
#Now we have to tidy our output
print(performance.sort_values(['F1_Score'],ascending=False))
# We didn't make in depth analysis of overfitting/underfitting, next step will use cross folds validation
#As we saw in the pie chart, data are skewed. One test to do is to undersample the dataset
#in order to get a less skewed dataset an analyze the results