ESD: E-mail Spam Detection using Cybersecurity-Driven Header Analysis and Machine Learning based Content Analysis

doi:10.23940/ijpe.24.04.p2.205213

Abstract

Abstract: Background:Spams are commonly known as unwanted commercial or deceptive emails, which strategically target specific individuals or businesses to promote products or mislead recipients. However, with the implementation of advanced technologies such as machine learning and natural language processing, computers can be trained to discern and categorize these emails as spam or legitimate (ham) messages. Despite considerable efforts in spam filtering, the effective identification and mitigation of spam emails remain an ongoing challenge. Methods:This research places particular emphasis on scrutinizing email headers and extracting crucial data, such as HOP count and IP address, using a Python script that serves as a forensic or investigative tool for analyzing and extracting information from email files. Additionally, it assesses various vectorization techniques to gauge the efficacy of machine-learning approaches for spam classification. The work encompasses a range of supervised learning algorithms, including Logistic Regression, Decision Trees, Naive Bayes, and Natural Language Processing (NLP) methods, such as Bidirectional Encoder representation of transformers (BERT). Two vectorization methods, count vectorization and tf-idf vectorization, are compared. The evaluation metrics employed included accuracy, training time, CPU and wall times, precision, recall, f1 score, and support. Conclusion:The performance of the Decision Trees is particularly noteworthy, achieving a flawless 100% accuracy rate. The trained model is seamlessly integrated into both an Android application and a website, enabling real-time spam detection and classification.

Key words: spam email, machine learning, header analysis, vectorization technique, bidirectional encoder

Harshita Batra and Leema Nelson. ESD: E-mail Spam Detection using Cybersecurity-Driven Header Analysis and Machine Learning based Content Analysis [J]. Int J Performability Eng, 2024, 20(4): 205-213.

Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks

References

[1] Saleh A.J., Karim A., Shanmugam B., Azam S., Kannoorpatti K., Jonkman M., and Boer F.D.An Intelligent Spam Detection Model Based on Artificial Immune System. Information, vol. 10, no. 6, pp. 209, 2019.
[2] Siddique Z.B., Khan M.A., Din I.U., Almogren A., Mohiuddin I., and Nazir S.Machine Learning-Based Detection of Spam Emails. Scientific Programming, vol. 2021, pp. 1-11, 2021.
[3] Sathvik D., Dhanalakshmi D., Prahasith A., Hariharan S., Pendam K., and Kukreja V.Web Extension for Phishing Website Identification: A Browser-Based Security Solution. In2023 International Conference on Research Methodologies in Knowledge Management, Artificial Intelligence and Telecommunication Engineering (RMKMATE), IEEE, pp. 1-5, 2023.
[4] Yaseen Q.Spam Email Detection using Deep Learning Techniques. Procedia Computer Science, vol. 184, pp. 853-858, 2021.
[5] Sethi M., Chand ra S., Chaudhary V., and Dahiya Y.Spam Email Detection using Machine Learning and Neural Networks. In Sentimental Analysis and Deep Learning: Proceedings of ICSADL 2021, Springer Singapore, pp. 275-290, 2022.
[6] Devakunchari, Sourabh and Malik, P. Network Intrusion Detection System using Two Stage Classifier. In Inventive Computation Technologies 4, Springer International Publishing, pp. 749-757, 2020.
[7] Tida, V.S. and Hsu, S.Universal Spam Detection using Transfer Learning of BERT Model.arXiv preprint arXiv:2202.03480, 2022.
[8] Nooraee, M. and Ghaffari, H.Optimization and Improvement of Spam Email Detection using Deep Learning Approaches. Journal of Computer & Robotics, vol. 15, no. 2, pp. 61-70, 2022.
[9] Mageshkumar N., Vijayaraj A., Arunpriya N., and Sangeetha A.Efficient Spam Filtering through Intelligent Text Modification Detection using Machine Learning. Materials Today: Proceedings, vol. 64, pp. 848-858, 2022.
[10] Shafi’i M.A., Maryam S., Oluwafemi O., Ismaila I., and John K.A. Comparative Analysis of Classification Algorithms for Email Spam Detection, 2018.
[11] Sharaff A., Nagwani N.K., and Dhadse A.Comparative Study of Classification Algorithms for Spam Email Detection. In Emerging Research in Computing, Information, Communication and Applications: ERCICA 2015, Springer India, vol. 2, pp. 237-244, 2016.
[12] Lilhore U.K., Poongodi M., Kaur A., Simaiya S., Algarni A.D., Elmannai H., Vijayakumar V., Tunze G.B., and Hamdi M.Hybrid Model for Detection of Cervical Cancer using Causal Analysis and Machine Learning Techniques. Computational and Mathematical Methods in Medicine, vol. 2022, 2022.
[13] Ramesh T.R., Lilhore U.K., Poongodi M., Simaiya S., Kaur A., and Hamdi M.Predictive Analysis of Heart Diseases with Machine Learning Approaches.Malaysian Journal of Computer Science, pp. 132-148, 2022.

[1]	Ashu Mehta, Navdeep Kaur, and Amandeep Kaur. A Review of Software Fault Prediction Techniques in Class Imbalance Scenarios [J]. Int J Performability Eng, 2025, 21(3): 123-130.
[2]	Vikas, Charu Wahi, Bharat Bhushan Sagar, and Manisha Manjul. Trust Management in WSN using ML for Detection of DDoS Attacks [J]. Int J Performability Eng, 2025, 21(3): 157-167.
[3]	Arpna Saxena and Sangeeta Mittal. CluSHAPify: Synergizing Clustering and SHAP Value Interpretations for Improved Reconnaissance Attack Detection in IIoT Networks [J]. Int J Performability Eng, 2025, 21(1): 36-47.
[4]	Seema Kalonia and Amrita Upadhyay. Comparative Analysis of Machine Learning Model and PSO Optimized CNN-RNN for Software Fault Prediction [J]. Int J Performability Eng, 2025, 21(1): 48-55.
[5]	Vikas Kumar, Charu Wahi, Bharat Bhushan Sagar, and Manisha Manjul. Ensemble Learning Based Intrusion Detection for Wireless Sensor Network Environment [J]. Int J Performability Eng, 2024, 20(9): 541-551.
[6]	Kalyani H. Deshmukh, Gajendra R. Bamnote, and Pratik K Agrawal. A Novel Approach for Drought Monitoring and Evaluation using Time Series Analysis and Deep Learning [J]. Int J Performability Eng, 2024, 20(8): 498-509.
[7]	Saurabh Saxena, and Chetna Gupta. Optimizing Bug Resolution: A Data-Driven Developer Recommendation System [J]. Int J Performability Eng, 2024, 20(8): 510-519.
[8]	Lakshya Vaswani, Sai Sri Harsha, Subham Jaiswal, and Aju D. Unravelling Complexity: Investigating the Effectiveness of SHAP Algorithm for Improving Explainability in Network Intrusion System Across Machine and Deep Learning Models [J]. Int J Performability Eng, 2024, 20(7): 421-431.
[9]	Meenakshi Chawla and Meenakshi Pareek. A Hybrid Deep Learning Perspective for Software Effort Estimation [J]. Int J Performability Eng, 2024, 20(7): 442-450.
[10]	Ajeet Kumar Sharma and Rakesh Kumar. IoT Malware Detection and Dynamic Analysis of MQTT Simulated Network [J]. Int J Performability Eng, 2024, 20(7): 451-459.
[11]	Abhishek Gupta and Jaspreet Singh. Data-Driven Security Framework for VANET using Firefly and ANN [J]. Int J Performability Eng, 2024, 20(6): 344-354.
[12]	Vikas Verma, Arun Malik, and Isha Batra. Analyzing and Classifying Malware Types on Windows Platform using an Ensemble Machine Learning Approach [J]. Int J Performability Eng, 2024, 20(5): 312-318.
[13]	Manu Jyoti Gupta and Parveen Sehgal. Optimizing Credit Card Fraud Detection: Classifier Performance and Feature Selection Empowered by Grasshopper Algorithm [J]. Int J Performability Eng, 2024, 20(3): 177-185.
[14]	Aparna Shrivastava and P Raghu Vamsi. Improving Anomaly Classification using Combined Data Transformation and Machine Learning Methods [J]. Int J Performability Eng, 2024, 20(2): 68-80.
[15]	Ronit Bali, Anukansha Sharma, Shuchi Mala, and Yash Malhan. Modeling the Geospatial Trend Changes in Jobs and Layoffs by Performing Sentiment Analysis on Twitter Data [J]. Int J Performability Eng, 2024, 20(2): 120-130.