Project 2 - Kenyan DHS Data
For this project, I am interested in predicting wealth based on DHS data from Kenya. Below is a summary of observations as they are distributed across the five wealth categories:
Model 1 - Penalized Logistic Regression
See below for a summary of the top_models:
penalty .metric .estimator mean n std_err .config
<dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
1 0.0001 roc_auc hand_till 0.643 1 NA Preprocessor1_Model01
2 0.000127 roc_auc hand_till 0.643 1 NA Preprocessor1_Model02
3 0.000161 roc_auc hand_till 0.643 1 NA Preprocessor1_Model03
4 0.000204 roc_auc hand_till 0.643 1 NA Preprocessor1_Model04
5 0.000259 roc_auc hand_till 0.643 1 NA Preprocessor1_Model05
6 0.000329 roc_auc hand_till 0.642 1 NA Preprocessor1_Model06
7 0.000418 roc_auc hand_till 0.642 1 NA Preprocessor1_Model07
8 0.000530 roc_auc hand_till 0.642 1 NA Preprocessor1_Model08
9 0.000672 roc_auc hand_till 0.642 1 NA Preprocessor1_Model09
10 0.000853 roc_auc hand_till 0.641 1 NA Preprocessor1_Model10
11 0.00108 roc_auc hand_till 0.641 1 NA Preprocessor1_Model11
12 0.00137 roc_auc hand_till 0.641 1 NA Preprocessor1_Model12
13 0.00174 roc_auc hand_till 0.640 1 NA Preprocessor1_Model13
14 0.00221 roc_auc hand_till 0.640 1 NA Preprocessor1_Model14
15 0.00281 roc_auc hand_till 0.639 1 NA Preprocessor1_Model15
And below is a plot of penalty values versus the area under the ROC curve:
Based on the above information, I selected model 7 as the model that performed the best because it had very close to the highest ROC, and is the model before a slight drop off in the graph. This model has a penalty of 0.000418. That said, I’m not convinced that it is significantly better than other models - based on the available information there is very little difference bewteen them, especially the first nine.
Below are the ROC plots, showing how well this model predicted the wealth categories:
The dotted 45 degree line through each of the graphs shows what the curve would look like if our model was no better than random guessing. The higher the curve is above that line, the better this model is at predicting that particular wealth category. As you can see, the most prominent curves can be seen on graph 1 and graph 5. This makes sense - the model is good at predicting those with very high, and very low, incomes. It does not fare as well with those in the middle ranges. This is to be expected, as predicting the distinctions between these is likely much more complicated and nuanced, and would likely require more variables.
Model 2 - Random Forest
Below, you can see the AUC - ROC values for the randomly selected predictors, and the minimal node size.
In general, the results were very comparable to the linear regression. You can see this in this graph, comparing the AUC of the linear regression model to the AUC of the random forest model:
While the results are almost identical, based on this image we can see that the random forest model performs slightly better. Below, we can see the ROC plots for the random forest, showing how well this model predicted the wealth categories:
Again, we see that the model does significantly better with predicting those with very high or very low wealth, and does not fare as well with those in the middle.
Interestingly, in this model, gender is weighted significantly lower than other predictors. This is interesting, because I expected it to be higher. In this model, at least, age is the most important feature. Below, we can see the relative importance of each feature included in the model:
Model 3 - Logistic Regression with Linear Classifier
This model is trained as a logistic regression model using the tensorflow estimator API and the DHS data from Kenya.
Wealth 1:
accuracy 0.753285
average_loss 0.515050
loss 0.515025
global_step 35960.000000
ROC Plot:
Predicted Probabilities Plot:
Wealth 2:
accuracy 0.793826
average_loss 0.496922
loss 0.496950
global_step 35960.000000
ROC Plot:
Predicted Probabilities Plot:
Wealth 3:
accuracy 0.810512
average_loss 0.478042
loss 0.478025
global_step 35960.000000
ROC Plot:
Predicted Probabilities Plot:
Wealth 4:
accuracy 0.821097
average_loss 0.455290
loss 0.455296
global_step 35960.000000
ROC Plot:
Predicted Probabilities Plot:
Wealth 5:
accuracy 0.863307
average_loss 0.365876
loss 0.365920
global_step 35960.000000
ROC Plot:
Predicted Probabilities Plot:
Model 4 - Gradient Boosting Model
For this model, I trained a gradient boosting model using decision trees with the tensorflow estimator. Below are the results for each wealth bin. As with the other models, it does a much better job of predicting very high and very low wealth, and a pretty bad job of predicting more middle wealth levels. Wealth 1 is an interesing (and suspicious) category for this one - it appears to have 100% accuracy. I can’t find an issue with the code, but I’m also not convinced that is right so we should be cautious of that particular result.
Wealth 1:
accuracy 1.000000
accuracy_baseline 0.712066
auc 1.000000
auc_precision_recall 1.000000
average_loss 0.000022
label/mean 0.287934
loss 0.000022
precision 1.000000
prediction/mean 0.287944
recall 1.000000
global_step 100.000000
ROC Plot:
Predicted Probabilities Plot:
Wealth 2:
accuracy 0.794947
accuracy_baseline 0.795052
auc 0.766762
auc_precision_recall 0.375286
average_loss 0.408367
label/mean 0.204948
loss 0.408367
precision 0.409091
prediction/mean 0.204926
recall 0.001145
global_step 100.000000
ROC Plot:
Predicted Probabilities Plot:
Wealth 3:
accuracy 0.809808
accuracy_baseline 0.809782
auc 0.714789
auc_precision_recall 0.299503
average_loss 0.408735
label/mean 0.190218
loss 0.408735
precision 1.000000
prediction/mean 0.187240
recall 0.000137
global_step 100.000000
ROC Plot:
Predicted Probabilities Plot:
Wealth 4:
accuracy 0.821514
accuracy_baseline 0.821332
auc 0.714486
auc_precision_recall 0.298418
average_loss 0.397650
label/mean 0.178668
loss 0.397650
precision 0.606061
prediction/mean 0.175882
recall 0.002918
global_step 100.000000
ROC Plot:
Predicted Probabilities Plot:
Wealth 5:
accuracy 0.861873
accuracy_baseline 0.853869
auc 0.791587
auc_precision_recall 0.397457
average_loss 0.331681
label/mean 0.146131
loss 0.331681
precision 0.594345
prediction/mean 0.144980
recall 0.172525
global_step 100.000000
ROC Plot:
Predicted Probabilities Plot:
So, which model is best?
Based on the above results, Model 4 - the Gradient Boosting Model - produced the best results. All models tended to follow the same trend: moderately good predictions for wealth 1 and wealth 5, with moderately poor predictions for wealth 2, 3, and 4. Gradient Boosting is somewhat of an outlier, having produced perfect predictions for wealth 1. However, I believe there is reason to be skeptical of this. It is highly unlikely that this is a true indication of the efficacy of the model, and likely an error in the code or in the data.