Earthquake Damage Prediction with Machine Learning — Part 4

Ng Jen Neng

4 min readNov 20, 2020

By — Jen Neng Ng

This story is continue from a Series of :

Part 1: Background Research

Part 2: Data Analysis

Part 3: Implementation

Part 4: Implementation (Continue)

Experiment 7: No SMOTE

This experiment just needs to comment on these few lines of code:

Experiment 8: Remove Outlier

The remove outlier part code already demonstrated on the preprocessing part.

Experiment 9: PCA Transformation

We will use the psych package that performed PCA analysis at the previous section

Experiment 10: Feature Selection (Recursive Feature Elimination)

Recursive Feature Elimination (RFE) is a feature selection method to recursively remove the weak features until a certain number of features has reached by using the cross-validation method. In the caret package, it provides RFE function to implement the backward selection strategy and find out the best optimal features size accordingly.

the variable count vs RMSE plot shows the optimal variables count is 18, which the minimum error will be the lowest. Further, increase the size of the variables will not help the model perform better. Also, most of the continuous feature is in the top important feature list.

Based on the optimal feature list obtain from RFE, it has filtered the feature list, we just need to put on the Experiment 4 random forest model.

Conclusion & Summary (Experiment 7~10)

Experiment 7:

It is having almost 7% increased when SmoteClassIf up-sampling method is not using. It meets what we expected early from related work analysis. This can only be explained by the up-sampling or interpolation data is not reliable, the generated data does not clearly reflect the variance relation with the dependent variable especially the least data class (damage_grade class 1).

Experiment 8:

Removing the age outlier also causing the class 1 doesn’t predict correctly as the previous experiment, this could be due to the extreme point of age are having high correlation with class 1 observation.

Experiment 9:

The PCA combine height_percentage and count_floors_pre_eq into one component, this component has reduced the dimension or the variable number of the dataset and also doesn’t affect the score. Thus, This could help to reduce the processing time of training the model.

Experiment 10:

The accuracy is similar to Experiment 4 random forest model (0.677). However, the processing time has reduced from 2005 seconds to 498 seconds. The feature selection methods have helped the model to reduce the dataset dimension and eliminate unnecessary features. It can boost the model performance that requires long training time such as random forest.

Ending Summary

The experiment part has proven our assumption defined in related work analysis. In conclusion, the trade-off point of using a generalisation model or a “more node splitting” model is a science factor in data science. The hyperparameter tuning is a science to experiment for each dataset.

We could submit the solution with no upsampling/resampling and use XGBoost to achieve high accuracy in this competition. By using Experiment 7 model, I am able to achieve 0.730 accuracy in Drivendata (current Top 1 rank is 0.755 accuracy). Even though there is a huge overfit (validation accuracy is 0.92+).

Using a less overfit model such as random forest or a high overfit kernel like Experiment 7 model is still a question. It depend on the production need and what to trade-off.

Reference

[1] Das, A. AI for Earthquake Damage Modelling (2019), https://towardsdatascience.com/ai-for-earthquake-damage-modelling-7cefae22e7e1

[2] Das, A. Earthquake-Damage-Modelling (2019), https://github.com/arpan65/Earthquake-Damage-Modelling

[3] Mendes, G. machine-learning (2019), https://github.com/gm-gwu/machine-learning

[4] Ghimire, B. nepal-earthquake (2019), https://github.com/bishwasg/nepal-earthquake

[5] Eliseev, A. Richter’s Predictor: Modeling Earthquake Damage (2019), https://www.kaggle.com/alekseyeliseev/richter-s-predictor

[6] Gorkar, A. Richter’s Predictor: Modeling Earthquake Damage (2019), https://www.kaggle.com/ajaygorkar/rihcter-s-prediction-modeling-earthquake-damage

[7] Lukyanenko, A. Earthquakes FE. More features and samples. (2019), https://www.kaggle.com/artgor/earthquakes-fe-more-features-and-samples

[8] Narayan, J. Model for Nepal Earthquake Damage (2019), https://www.kaggle.com/jaylaksh94/model-for-nepal-earthquake-damage