Int J Performability Eng ›› 2019, Vol. 15 ›› Issue (7): 1939-1946.doi: 10.23940/ijpe.19.07.p21.19391946

Previous Articles     Next Articles

Combining Stochastic Grammar and Semi-Supervised Learning Techniques to Extract RNA Structures with Pseudoknots

Sixin Tang*   

  1. College of Computer Science and Technology, Hengyang Normal University, Hengyang, 421002, China
  • Submitted on ;
  • Contact: * E-mail address: tangsix@qq.com
  • About author:* Corresponding author. E-mail address: tangsix@qq.com Sixin Tang is a lecturer in the College of Computer Science and Technology at Hengyang Normal University. His research interests include machine learning and bioinfomatics.
  • Supported by:
    This work is supported by the Scientific Research Projects (No. 15C0204) of the Hunan Education Department.

Abstract: To predict RNA structures with pseudoknots, traditional stochastic grammar models must collect several related labeled RNA sequences, which limits the practical application of this method. In order to use a large number of unlabeled RNA sequences effectively for structure prediction, the combination of stochastic grammar and semi-supervised learning techniques has been proposed. In these techniques, we used a small amount of labeled RNA sequences and a large number of unlabeled sequences as a training set of the prediction model. Designing a semi-supervised learning model based on the SCFG inside/outside algorithm and using a SCFG model based on the generative method as a classifier, we labeled the unlabeled RNA sequences through training and then gradually merged them into the labeled data set. This model can regulate the proportion of labeled and unlabeled sequences and finally output the structure tags sequence. Experimental results showed that this method can utilize unlabeled sequences data effectively, greatly reduce the demand for the number of related sequence samples, and improve the prediction accuracy. In addition, we measured the performance of model prediction influenced by different amounts of unlabeled sequences.

Key words: RNA structures, semi-supervised learning, prediction accuracy, performance improvement