Increase Donations with Machine Learning
A charity organization had set a goal to increase donations. Based on the data it has, it’s known that a person with an income of more than $50,000 per year is more likely to donate more. The goal of the project was to build a model to predict the income of a person.
During this project, different algorithms were tested. The decision was made to use the Gradient Boosting algorithm, as this algorithm has the best accuracy when it comes to predictions. But it has one disadvantage - this algorithm was the slowest during the model training phase.
At the beginning of the project, some algorithms such as LogisticRegression, DecisionTrees and Gaussian Naive Bayes were tested. None showed the accuracy level needed and the decision was made to move to ensemble methods. Three ensemble methods were tested. The best performance was achieved by Gradient Boosting. After the optimization of Gradient Boosting, an accuracy score of 87% was achieved. This means that out of 100 predictions, 87 predictions were made correctly. Fbeta score (beta=0.5) was around 76%.
The data for the project was taken from the UCI Machine Learning Repository. Data included the following attributes (features): age, workclass, education, education-num, marital status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country. Using this data, the goal was to construct a model to predict if a person makes more than $50,000 per year.
During the exploratory data analysis and working with the models, it was identified there are five features crucial for making an accurate prediction. Figure 1.0 shows these five features.
Figure 1.0: Normalized Weights For Five Most Predictive Features
As shown in Figure 1.0, features such as capital-loss, age, capital-gain, hours-per-week, and education are important in predicting the income of a person. Those five features contribute approximately 50% of the weight for the model. This information can be used to increase the speed of the model by reducing the dimensions. Instead of using all the available features from the dataset, these five features can be used to train the model. This approach was tested during the project. It was possible to increase the prediction speed by approximately two times, but there were losses in both metrics see Figure 1.1.
Figure 1.1: Dimension Reduction
Due to losses in accuracy and f-score, the decision was made to include all features in the dataset. Figure 1.2 shows the performance of the last three tested algorithms. The x-axis shows the training set size. The training set size contains 1%, 10%, and 100%. The black dotted line shows the performance of a Naive Predictor and this is used as a benchmark. It simply guesses the result. The goal was to see the performance of guess vs. a real algorithm. As seen in Figure 1.2, Gradient Boosting runs slower in comparison to Random Forest and Ada Boost. Also as it is seen here that Random Forest tends to overfit. It shows high scores on the training set, but much lower scores on the testing set.
Figure 1.2: Performance Metrics for Three Supervised Learning Models
Another important metric is the time in seconds it takes to make a prediction. Ada Boost trains faster in comparison to Gradient Boosting, but it lacks in performance during a prediction phase. Based on all the information described above, it was decided to use Gradient Boosting. The next phase of the project was to optimize the final model.
|Metric||Unoptimized Model||Optimized Model|
Table 1.0: Results of Optimization - Grid Search
Table 1.0 shows the results before and after the optimization of the model. The optimization led to improvements in accuracy and f-score.
There is often a trade-off based on the main focus of the project. The focus of the current project was to obtain the best accuracy possible. Even though Gradient Boosting was not the fastest algorithm, it was chosen because of its accuracy.
Some algorithms such as Logistic Regression and Naive Bayes were tested, but it was necessary to switch to more advanced methods to gain faster performance uplifts. For production use cases it’s recommended to start with simpler linear methods and then move on to non-linear classifiers.
The whole codebase was rewritten from a Jupyter Notebook. In the current version, each part of the pipeline can run independently. The code is currently not covered with unit tests. Since the project was made during the Machine Learning program of Udacity, there are no plans for refactoring.