Int J Performability Eng ›› 2024, Vol. 20 ›› Issue (4): 205-213.doi: 10.23940/ijpe.24.04.p2.205213

Previous Articles     Next Articles

ESD: E-mail Spam Detection using Cybersecurity-Driven Header Analysis and Machine Learning based Content Analysis

Harshita Batra and Leema Nelson*   

  1. Chitkara University Institute of Engineering & Technology, Chitkara University, Punjab, India
  • Submitted on ; Revised on ; Accepted on
  • Contact: * E-mail address: leema.nelson@gmail.com

Abstract: Background:Spams are commonly known as unwanted commercial or deceptive emails, which strategically target specific individuals or businesses to promote products or mislead recipients. However, with the implementation of advanced technologies such as machine learning and natural language processing, computers can be trained to discern and categorize these emails as spam or legitimate (ham) messages. Despite considerable efforts in spam filtering, the effective identification and mitigation of spam emails remain an ongoing challenge. Methods:This research places particular emphasis on scrutinizing email headers and extracting crucial data, such as HOP count and IP address, using a Python script that serves as a forensic or investigative tool for analyzing and extracting information from email files. Additionally, it assesses various vectorization techniques to gauge the efficacy of machine-learning approaches for spam classification. The work encompasses a range of supervised learning algorithms, including Logistic Regression, Decision Trees, Naive Bayes, and Natural Language Processing (NLP) methods, such as Bidirectional Encoder representation of transformers (BERT). Two vectorization methods, count vectorization and tf-idf vectorization, are compared. The evaluation metrics employed included accuracy, training time, CPU and wall times, precision, recall, f1 score, and support. Conclusion:The performance of the Decision Trees is particularly noteworthy, achieving a flawless 100% accuracy rate. The trained model is seamlessly integrated into both an Android application and a website, enabling real-time spam detection and classification.

Key words: spam email, machine learning, header analysis, vectorization technique, bidirectional encoder