Customer Churn Analysis using Spark and Hadoop

doi:10.23940/ijpe.23.10.p4.663675

Abstract

Abstract: Predicting Customer churn is one of the telecommunication industry's biggest challenges. Why did their customers quit using their product, site, service, or subscription? Machine learning with Spark and Hadoop has considerably increased the ability to predict customer behaviours. The most popular predictive models, such as logistic regression, Binary Classification Evaluator, and Multi Classification Evaluator, have been used in the prediction process. Enhancing and outfit approaches are used on the training dataset to examine the impact on model effectiveness. Additionally, to further optimize the hyperparameters and produce the models, a K-fold cross-validation method is utilized to train the dataset. Finally, the test data were examined by the AUC-ROC curve and confusion matrix. In this research, an adaptation of Spark and Hadoop frameworks is made to predict customer churn. The data is pre-processed, feature analyses are performed, and the feature selection is carried out using the Vector Assembler algorithm. This study aims to analyse customer behaviors by using a dataset.

Key words: Hadoop and Spark, Machine Learning, Logistic regression, Random Forest, Vector Assembler, Binary Classification Evaluator

Priyanshu Verma, Ishan Sharma, Sonia Deshmukh, and Rohit Vashisht. Customer Churn Analysis using Spark and Hadoop [J]. Int J Performability Eng, 2023, 19(10): 663-675.

Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks

References

[1] Huang, Y., Zhu, F., Yuan, M., Deng, K., Li, Y., Ni, B., Dai, W., Yang, Q., and Zeng, J. Telco Churn Prediction with Big Data. In Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp. 607-618, 2015.
[2] Burez, J. and Van den Poel, D. Handling Class Imbalance in Customer Churn Prediction. Expert Systems with Applications, vol. 36, no. 3, pp. 4626-4636, 2009.
[3] Lalwani P., Mishra M.K., Chadha J.S., andSethi P.Customer Churn Prediction System: A Machine Learning Approach.Computing, pp. 1-24, 2022.
[4] Hadden J., Tiwari A., Roy R., andRuta D.Computer Assisted Customer Churn Management: State-of-The-Art and Future Trends. Computers & Operations Research, vol. 34, no. 10, pp. 2902-2917, 2007.
[5] Kisioglu, P. and Topcu, Y.I.Applying Bayesian Belief Network Approach to Customer Churn Analysis: A Case Study on the Telecom Industry of Turkey. Expert Systems with Applications, vol. 38, no. 6, pp. 7151-7157, 2011.
[6] Coussement, K. and Van den Poel, D. Churn Prediction in Subscription Services: An Application of Support Vector Machines While Comparing Two Parameter-Selection Techniques. Expert systems with applications, vol. 34, no. 1, pp. 313-327, 2008.
[7] International Centre for Mechanical Sciences; International Federation for the Theory of Machines and Mechanisms, Nevins, J.L., and Whitney, D.E. The force vector assembler concept, Springer Berlin Heidelberg, pp. 273-288, 1972.
[8] Allison P.D.Logistic regression using SAS: Theory and application. SAS institute, 2012.
[9] Kakarla R., Krishnan S., andAlla S.Model Evaluation. InApplied Data Science Using PySpark, Springer, pp. 205-249, 2021.
[10] Marom N.D., Rokach L., andShmilovici, A. using the Confusion Matrix for Improving Ensemble Classifiers. In2010 IEEE 26-th Convention of Electrical and Electronics Engineers in Israel, IEEE, pp. 000555-000559, 2010.
[11] Ahmad A.K., Jafar A., andAljoumaa K.Customer Churn Prediction in Telecom using Machine Learning in Big Data Platform. Journal of Big Data, vol. 6, no. 1, pp. 1-24, 2019.
[12] Prajapati V.Big data analytics with R and Hadoop. Packt Publishing Ltd, 2013.
[13] Dong G., Fu X., Li H., andPan X.An Accurate Sequence Assembly Algorithm for Livestock, Plants and Microorganism based on Spark. International Journal of Pattern Recognition and Artificial Intelligence, vol. 31, no. 08, pp. 1750024, 2017.
[14] Lovrić M., Molero J.M., andKern R.PySpark and RDKit: Moving Towards Big Data in Cheminformatics. Molecular informatics, vol. 38, no. 6, pp. 1800082, 2019.
[15] Khan M.A., Karim M.R., andKim Y.A Two-Stage Big Data Analytics Framework with Real World Applications using Spark Machine Learning and Long Short-Term Memory Network. Symmetry, vol. 10, no. 10, pp. 485, 2018.
[16] Chaudhuri, K. and Monteleoni, C.Privacy-Preserving Logistic Regression. Advances in neural information processing systems, vol. 21, 2008.
[17] Erraissi, A. and Banane, M.Machine Learning Model to Predict the Number of Cases Contaminated by COVID-19. International Journal of Computing and Digital Systems, vol. 9, pp. 1-11, 2020.
[18] Branitskiy A., Kotenko I., andSaenko I.Applying Machine Learning and Parallel Data Processing for Attack Detection in IoT. IEEE Transactions on Emerging Topics in Computing, vol. 9, no. 4, pp. 1642-1653, 2020.