Disaster Response Pipeline with Figure Eight

Udacity Data Scientist Nanodegree Program Project

Simone Rigoni
4 min readFeb 9, 2020
https://www.figure-eight.com/dataset/combined-disaster-response-data/

This project is part of the Udacity Data Scientist Nanodegree Program: Disaster Response Pipeline Project and the goal was to apply the data engineering skills learned in the course to analyze disaster data from Figure Eight to build a model for an API that classifies disaster messages. As always let’s apply CRISP-DM Process (Cross Industry Process for Data Mining) to tackle the problem:

  1. Business Understanding
  2. Data Understanding
  3. Prepare Data
  4. Data Modeling
  5. Evaluate the Results
  6. Deploy

Business Understanding

During and immediately after natural disaster there are millions of communication to disaster response organizations either direct or through social media. Disaster response organizations have to to filter and pull out the most important messages from this huge amount of communications a and redirect specific requests or indications to the proper organization that takes care of medical aid, water, logistics ecc. Every second is vital in this kind of situations, so handling the message correctly is the key

The project is divided in three sections:

  • Data Processing: build an ETL (Extract, Transform, and Load) Pipeline to extract data from the given dataset, clean the data, and then store it in a SQLite database
  • Machine Learning Pipeline: split the data into a training set and a test set. Then, create a machine learning pipeline that uses NLTK, as well as scikit-learn’s Pipeline and GridSearchCV to output a final model that predicts a message classifications for the 36 categories (multi-output classification)
  • Web development: develop a web application to show classify messages in real time

Data Understanding

The dataset provided by Figure Eight contains 30000 messages drawn from events including an earthquake in Haiti in 2010, an earthquake in Chile in 2010, floods in Pakistan in 2010, superstorm Sandy in the U.S.A. in 2012, and news articles spanning a large number of years and 100s of different disasters. The messages has been classified in 36 different categories related to disaster response and they have been stripped of sensitive informations in their entirety. A translation from the original language to english has also been provided. More information about the dataset here

Messages categories

Prepare Data

The dataset is provided is basically composed by two files:

  • disaster_categories.csv: Categories of the messages
  • disaster_messages.csv: Multilingual disaster response messages

Data preparation steps:

1 Merge the two datasets
2 Split categories into separate category columns
3 One-hot encode category
4 Remove duplicates
5 Upload to SQLite database

At the end we obtain a SQL table containing the messages and all their attributes:

- id
- message
- original
- genre
- related
- request
- offer
- aid_related
- medical_help
- medical_products
- search_and_rescue
- security
- military
- child_alone
- water
- food
- shelter
- clothing
- money
- missing_people
- refugees
- death
- other_aid
- infrastructure_related
- transport
- buildings
- electricity
- tools
- hospitals
- shops
- aid_centers
- other_infrastructure
- weather_related
- floods
- storm
- fire
- earthquake
- cold
- other_weather
- direct_report

First 5 records in the table

Data Modeling

Now we will use the data to train a model that should take in the message column as input and output classification results on the other 36 categories in the dataset

1 Load the data
2 Create a ML Pipeline
3 Train the ML Pipeline
4 Test the model
5 Tune the model
6 Evaluate the results
7 Export the model

The components used in the pipeline are:

ML Pipeline components

As already said we then used GridSearchCV to exhaustive search over specified parameter values for our estimator

TFIDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection. Check this fantastic Quora article for more information

https://www.quora.com/What-is-a-tf-idf-vector

The model is finally being saved to be loaded and and later used for real time message classification

Evaluate the Results

The dataset is highly imbalanced and that is the reason why the accuracy is high and the recall value is pretty low

Classification results

To tackle an imbalanced dataset there are a lot of ways as shown in this really interesting Medium post

Deploy

A dash application has been developed as user interface: it is possible to submit a message to classify and have an overview of some information about the training dataset

Dash web application home

When a message is submitted with the Classify Message button the resulting categories are highlighted in green

Message classification

You can try it out on my website in this page

Note: Code can be found in this github repository

--

--