In the Big Data Module Benedikt W. and I built a financial product recommendation Tool with PySpark. To recommend products, we developed a random forest model with over 90% accuracy.
Lessons learned and future work:
Data imputing important to not lose too many observations
– Recommendations by Age or Gender could raise potential ethical issues
– Grid-Search is computationally intense but did not result in a significant performance increase
– Only considering personal data without looking at the product portfolio made our model perform worse
– Results showed that prediction for imbalanced datasets i.e. sparse product distribution is problematic
– The accuracy of unpopular products will automatically be high when the model solely not recommends a product
– Future work could tackle this sparsity using SMOTE
We also have a poster of our work, contact me if you are interested.