What are the best selling drugs in Italy?

Sharing is caring!

The Italian Ministry of Health published a dataset on the most distributed * drugs through drugstores (here you can find the dataset).

In methodological terms I have aggregated all the drugs with the same starting word E.g. All kind of “Tachipirina” packs (most sold paracetamol drug in Italy) are grouped in a single variable, regardless of whether the active substance was 500mg or 1000mg, or shall be used the oral mouth**.

There is a fundamental categorical variable inside the dataset“Modalità” and could assume two values:

  • SOP, Drugs Without Prescription)
  • OTC, Over The Counter, all the drugs without prescription that could be advertised

In the next post, I will analyze the relationship between drugs consumption and advertising.

The first 5 drugs in Italy, with the highest distribution in the first semester of 2016 are:

  1. Tachipirina 8+Mil of boxes

  2. Enterorgermina 3+Mil of boxes

  3. Voltaren

  4. Rinazina

  5. Aspirina

80% of the time spent was for cleaning the dataset.

I had to skip first rows, you can understand why from the picture attached.


Moreover, data were decoded through “ISO-8859-1”, instead of “utf-8” and was not comma-separated but the dataset used the semicolon “;”

Numeric values were in the following format “1.000.000” so I stripped the dots in order to cast as Integer. I noticed only after a first data inspection, otherwise, I could use a specific option inside pd.read_csv()

Here a link to an interesting question on stack overflow that helped me in reading with pandas the csv. At first, I didn’t recognize the encoding problem.

This question is also useful and I used for the first-word extraction when some words start with special character.

Thanks for reading the article!

If you liked this post consider to share it, I really appreciate!

If you think it could be improved or I have to fix something text me 🙂
Andrea

In [1]:
# Import pandas
import pandas as pd

import numpy as np
# Import plotting module
import matplotlib.pyplot as plt

#Import regex module
import re
file_name='C_17_dataset_15_download_itemDownload_0_upFile.csv'
csv=pd.read_csv(file_name,sep=';',encoding="ISO-8859-1",skiprows=1)
csv.info()

#We explore drugs name 
#A necessary step for 
#the next phase of cleaning and aggregation 
csv_mod=csv.dropna().copy()
csv_mod['Farmaco']=None
#A drug "Rinazina" started with ** because the Ministry have to confirm the data
#for semplicity i rempoved ** 
csv_mod['Denominazione della confezione']=csv_mod['Denominazione della confezione'].str.replace('*','')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 6 columns):
N°                                                                        51 non-null object
Codice Autorizzazione all'Immissione in Commercio
(AIC)                   50 non-null float64
Denominazione della confezione                                            50 non-null object
Fascia PTN                                                                50 non-null object
Modalità
Prescrizione                                                     50 non-null object
Quantità confezioni fornite alle farmacie ed agli esercizi commerciali    50 non-null object
dtypes: float64(1), object(5)
memory usage: 2.5+ KB
In [2]:
#We add a Column "Farmaco" to our Dataframe 
#Why? This column help us for the following group operations

csv_mod['Farmaco']=None
def primaparola(colonna_di_testo):
    
    pattern=r'\W*(\w[^,. !?"]*)'
    return re.match(pattern,colonna_di_testo).group()

print(type(csv_mod['Denominazione della confezione']))

estratto=csv_mod['Denominazione della confezione'].apply(primaparola)
csv_mod['Farmaco']=estratto
csv_mod['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali']=pd.to_numeric(csv_mod['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali'].str.replace('.',''))
<class 'pandas.core.series.Series'>
In [3]:
#We group our dataframe by 'Farmaco' column
#And we determine the sum 
grouped=csv_mod.groupby(by='Farmaco').sum()
#We sort our data in descending order
df_farmaco=grouped['Quantità confezioni fornite alle farmacie ed agli esercizi commerciali'].sort_values(ascending=False)
#We create an array 
#to plot on the X axis our drugs
ln=np.arange(0,len(df_farmaco))
label=df_farmaco.index
In [5]:
#The names of drugs will be our x-axis labels
label=df_farmaco.index
#We visualize our data through a Scatter Plot
plt.scatter(ln,df_farmaco)
plt.title('Best selling Drugs without prescription in the first semester of 2016')

plt.ylabel('# boxes sold')
plt.xticks(ln,(label) ,rotation=90)
plt.text(10, 4, 'Data Elaboration Ing. Andrea Ciufo',
         fontsize=15, color='gray',
         ha='center', va='top', alpha=0.9)
plt.rcParams["figure.figsize"] = (15,4)
plt.show()
*The Ministry uses the term “distributed ” instead of “sold”, the two terms are similar but not the same, I supposed the most distributed as a good proxy of the best selling drugs
**For time reason I didn’t investigate which kind of drugs were advertised on TV and other communication channels.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related articles

Beware of YES Master NO

Talking about negotiation, like other people, I am not so good at handling “NO” as a reply.  I mean, after

Questions?
Let's get in touch!

Tell me about your project

I will arrange a meeting to discuss the details

we will start working together

Scroll to Top