Int J Performability Eng ›› 2024, Vol. 20 ›› Issue (12): 723-732.doi: 10.23940/ijpe.24.12.p2.723732

• Original article • Previous Articles     Next Articles

Discovering Elementary Discourse Units in Textual Data using Canonical Correlation Analysis

Mehndiratta Akanksha() and Asawa Krishna   

  1. Department of Computer Science & Engineering and Information Technology, Jaypee Institute of Information Technology, Noida, India
  • Submitted on ; Revised on ; Accepted on
  • Contact: Mehndiratta Akanksha E-mail:akanksha.mehndiratta@mail.jiit.ac.in

Abstract:

Canonical Correlation Analysis (CCA) has been exploited immensely for learning latent representations in various fields. This study takes a step further by demonstrating the potential of CCA in identifying Elementary Discourse Units (EDUs) that capture the latent information within the textual data. The probabilistic interpretation of CCA discussed in this study utilizes the two-view nature of textual data, i.e. the consecutive sentences in a document or turn in a dyadic conversation and has a strong theoretical foundation. Furthermore, this study proposes a model for Elementary Discourse Unit (EDU) segmentation that discovers EDUs in textual data without any supervision. To validate the model, the EDUs are utilized as textual units for content selection in textual similarity tasks. Empirical results on Semantic Textual Similarity (STSB) and Mohler datasets confirm that, despite being represented as a unigram, the EDUs deliver competitive results and can even beat various sophisticated supervised techniques. The model is simple, linear, adaptable and language-independent making it an ideal baseline particularly when labeled training data is scarce or nonexistent.

Key words: discourse modeling, discourse analysis, EDU segmentation, unlabeled data, low resource language, canonical correlation analysis