Int J Performability Eng ›› 2023, Vol. 19 ›› Issue (3): 193-202.doi: 10.23940/ijpe.23.03.p5.193202

Previous Articles     Next Articles

DCADS: Data-Driven Computer Aided Diagnostic System using Machine Learning Techniques for Polycystic Ovary Syndrome

Harshita Batra and Leema Nelson*   

  1. Chitkara University Institute of Engineering and Technology, Chitkara University, Punjab, 140401, India
  • Contact: * E-mail address:

Abstract: Background: In recent years, humans have faced many diseases because of their lifestyle and environmental changes. One of these is polycystic ovary syndrome (PCOS), a hormonal condition that affects a large percentage of women of reproductive age. One in five (20%) Indian women have PCOS, making it one of the most prevalent causes for female infertility in women. Causes: The ovaries create small fluid-filled sacs called follicles, which fail to discharge eggs regularly, resulting in an imbalanced menstrual cycle. It is challenging for doctors to manually analyze disease symptoms, but this might be accomplished by utilizing machine-learning approaches to confirm that this category accurately identifies individuals with chronic diseases. Early identification and diagnosis of these diseases are important as they can prevent them from reaching their worst stage. Methods: This work aims to develop a Data-driven Computer Aided Diagnostic System (DCADS) using Synthetic Minority Oversampling Technique (SMOTE), correlation-based feature selection, and Machine Learning techniques to diagnose PCOS without the need for clinical testing. SMOTE oversamples the minority samples in the dataset and the important features are selected above the threshold value of 0.25 using the correlation-based feature selection method. Random Forest (RF), Support Vector Machine (SVM) and Logistic Regression (LR) are three machine-learning algorithms used to learn from the selected input features. The classification accuracy serves as the basis for the performance of the ML classifier. Because RF classifiers are more accurate than the other classifiers in this study, they have been employed in DCADS for non-clinical testing. The developed DCADS was tested using the PCOS dataset obtained from Kaggle and owned by Prasoon Kottarathil. Conclusion: Random forest achieves an overall accuracy of 92.024%, logistic regression achieves 90.18%, and SVM achieves 70.55% for the PCOS dataset. Gynecologists and women can diagnose PCOS using the developed DCADS without the need of clinical tests.

Key words: machine learning, support vector machine, random forest, logistic regression, correlation, SMOTE, Polycystic Ovary Syndrome, DCADS