Disaster Response Pipeline with Figure Eight
Udacity Data Scientist Nanodegree Program Project
This project is part of the Udacity Data Scientist Nanodegree Program: Disaster Response Pipeline Project and the goal was to apply the data engineering skills learned in the course to analyze disaster data from Figure Eight to build a model for an API that classifies disaster messages. As always let’s apply CRISP-DM Process (Cross Industry Process for Data Mining) to tackle the problem:
- Business Understanding
- Data Understanding
- Prepare Data
- Data Modeling
- Evaluate the Results
- Deploy
Business Understanding
During and immediately after natural disaster there are millions of communication to disaster response organizations either direct or through social media. Disaster response organizations have to to filter and pull out the most important messages from this huge amount of communications a and redirect specific requests or indications to the proper organization that takes care of medical aid, water, logistics ecc. Every second is vital in this kind of situations, so handling the message correctly is the key
The project is divided in three sections:
- Data Processing: build an ETL (Extract, Transform, and Load) Pipeline to extract data from the given dataset, clean the data, and then store it in a SQLite database
- Machine Learning Pipeline: split the data into a training set and a test set. Then, create a machine learning pipeline that uses NLTK, as well as scikit-learn’s Pipeline and GridSearchCV to output a final model that predicts a message classifications for the 36 categories (multi-output classification)
- Web development: develop a web application to show classify messages in real time
Data Understanding
The dataset provided by Figure Eight contains 30000 messages drawn from events including an earthquake in Haiti in 2010, an earthquake in Chile in 2010, floods in Pakistan in 2010, superstorm Sandy in the U.S.A. in 2012, and news articles spanning a large number of years and 100s of different disasters. The messages has been classified in 36 different categories related to disaster response and they have been stripped of sensitive informations in their entirety. A translation from the original language to english has also been provided. More information about the dataset here
Prepare Data
The dataset is provided is basically composed by two files:
- disaster_categories.csv: Categories of the messages
- disaster_messages.csv: Multilingual disaster response messages
Data preparation steps:
1 Merge the two datasets
2 Split categories into separate category columns
3 One-hot encode category
4 Remove duplicates
5 Upload to SQLite database
At the end we obtain a SQL table containing the messages and all their attributes:
- id
- message
- original
- genre
- related
- request
- offer
- aid_related
- medical_help
- medical_products
- search_and_rescue
- security
- military
- child_alone
- water
- food
- shelter
- clothing
- money
- missing_people
- refugees
- death
- other_aid
- infrastructure_related
- transport
- buildings
- electricity
- tools
- hospitals
- shops
- aid_centers
- other_infrastructure
- weather_related
- floods
- storm
- fire
- earthquake
- cold
- other_weather
- direct_report
Data Modeling
Now we will use the data to train a model that should take in the message column as input and output classification results on the other 36 categories in the dataset
1 Load the data
2 Create a ML Pipeline
3 Train the ML Pipeline
4 Test the model
5 Tune the model
6 Evaluate the results
7 Export the model
The components used in the pipeline are:
- CountVectorizer: Convert a collection of text documents to a matrix of token counts
- TfidfTransformer: Transform a count matrix to a tf-idf (term-frequency times inverse document-frequency) representation
- MultiOutputClassifier: Multi target classification
As already said we then used GridSearchCV to exhaustive search over specified parameter values for our estimator
TFIDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection. Check this fantastic Quora article for more information
The model is finally being saved to be loaded and and later used for real time message classification
Evaluate the Results
The dataset is highly imbalanced and that is the reason why the accuracy is high and the recall value is pretty low
To tackle an imbalanced dataset there are a lot of ways as shown in this really interesting Medium post
Deploy
A dash application has been developed as user interface: it is possible to submit a message to classify and have an overview of some information about the training dataset
When a message is submitted with the Classify Message button the resulting categories are highlighted in green
You can try it out on my website in this page
Outro
I hope the story was interesting and thank you for taking the time to read it. The code for this project can be found in this github repository and on my Blogspot you can find the same post in Italian. Let me know if you have any question and if you like the content that I create feel free to buy me a coffee.