Discovering Elementary Discourse Units in Textual Data using Canonical Correlation Analysis

doi:10.23940/ijpe.24.12.p2.723732

Abstract

Abstract:

Canonical Correlation Analysis (CCA) has been exploited immensely for learning latent representations in various fields. This study takes a step further by demonstrating the potential of CCA in identifying Elementary Discourse Units (EDUs) that capture the latent information within the textual data. The probabilistic interpretation of CCA discussed in this study utilizes the two-view nature of textual data, i.e. the consecutive sentences in a document or turn in a dyadic conversation and has a strong theoretical foundation. Furthermore, this study proposes a model for Elementary Discourse Unit (EDU) segmentation that discovers EDUs in textual data without any supervision. To validate the model, the EDUs are utilized as textual units for content selection in textual similarity tasks. Empirical results on Semantic Textual Similarity (STSB) and Mohler datasets confirm that, despite being represented as a unigram, the EDUs deliver competitive results and can even beat various sophisticated supervised techniques. The model is simple, linear, adaptable and language-independent making it an ideal baseline particularly when labeled training data is scarce or nonexistent.

Key words: discourse modeling, discourse analysis, EDU segmentation, unlabeled data, low resource language, canonical correlation analysis

Mehndiratta Akanksha and Asawa Krishna. Discovering Elementary Discourse Units in Textual Data using Canonical Correlation Analysis [J]. Int J Performability Eng, 2024, 20(12): 723-732.

Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks

References 29

[1]	Guo, Z. , Gao, L. , and Guan, L. , 2021. A manifold semantic canonical correlation framework for effective feature fusion. In 2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 15- 20.
[2]	Jia, Y. , Bai, L. , Liu, S. , Wang, P. , Guo, J. , and Xie, Y. , 2019. Semantically-enhanced kernel canonical correlation analysis: a multi-label cross-modal retrieval. Multimedia Tools and Applications, 78, pp. 13169- 13188.
[3]	Foster, D.P. , Kakade, S.M. , and Zhang, T. , 2008. Multi-view dimensionality reduction via canonical correlation analysis. Toyota Technical Institute-Chicago.
[4]	Mann, W.C. , and Thompson, S.A. , 1988. Rhetorical structure theory: toward a functional theory of text organization. Text-Interdisciplinary Journal for the Study of Discourse, 8( 3), pp. 243- 281.
[5]	Lascarides, A. , and Asher, N. , 2007. Segmented discourse representation theory: dynamic semantics with discourse structure. In Computing Meaning , pp. 87- 124.
[6]	Prasad, R. , Miltsakaki, E. , Dinesh, N. , Lee, A. , Joshi, A. , Robaldo, L. , and Webber, B. , 2007. The penn discourse treebank 2.0 annotation manual. December, 17, 2007.
[7]	Li, J. , Sun, A. , and Joty, S.R. , 2018. SegBot: A generic neural text segmentation model with pointer network. In International Joint Conferences on Artificial Intelligence , pp. 4166- 4172.
[8]	Wang, Y. , Li, S. , and Yang, J. , 2018. Toward fast and accurate neural discourse segmentation. Arxiv Preprint Arxiv:1808.09147.
[9]	Lukasik, M. , Dadachev, B. , Simoes, G. , and Papineni, K. , 2020. Text segmentation by cross segment attention. Arxiv Preprint Arxiv:2004.14535.
[10]	Bakshi, S. , and Sharma, D.M. , 2021. A transformer based approach towards identification of discourse unit segments and connectives. In Proceedings of the 2nd Shared Task on Discourse Relation Parsing and Treebanking (DISRPT 2021), pp. 13- 21.
[11]	Asher, N. , Hunter, J. , Morey, M. , Benamara, F. , and Afantenos, S. , 2016. Discourse structure and dialogue acts in multiparty dialogue: the STAC corpus. In 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 2721- 2727.
[12]	Li, J. , Liu, M. , Kan, M.Y. , Zheng, Z. , Wang, Z. , Lei, W. , Liu, T. , and Qin, B. , 2020. Molweni: A challenge multiparty dialogues-based machine reading comprehension dataset with discourse structure. Arxiv Preprint Arxiv:2004.05080.
[13]	Afantenos, S. , Kow, E. , Asher, N. , and Perret, J. , 2015. Discourse parsing for multi-party chat dialogues. In Conference on Empirical Methods on Natural Language Processing (EMNLP 2015).
[14]	Perret, J. , Afantenos, S. , Asher, N. , and Morey, M. , 2016. Integer linear programming for discourse parsing. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies , pp. 99- 109.
[15]	Shi, Z. , and Huang, M. , 2019. A deep sequential model for discourse parsing on multi-party dialogues. In Proceedings of the AAAI Conference on Artificial Intelligence , 33( 01), pp. 7007- 7014.
[16]	Majumder, B.P. , Li, S. , Ni, J. , and McAuley, J. , 2020. Interview: large-scale modeling of media dialog with discourse patterns and knowledge grounding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8129- 8141.
[17]	Gu, X. , Yoo, K.M. , and Ha, J.W. , 2021. Dialogbert: discourse-aware response generation via learning to recover and rank utterances. In Proceedings of the AAAI Conference on Artificial Intelligence, 35( 14), pp. 12911- 12919.
[18]	Santra, B. , Roychowdhury, S. , Mandal, A. , Gurram, V. , Naik, A. , Gupta, M. , and Goyal, P. , 2021. Representation learning for conversational data using discourse mutual information maximization. Arxiv Preprint Arxiv:2112.05787.
[19]	Hotelling, H. , 1992. Relations between two sets of variates. In Breakthroughs in Statistics: Methodology and Distribution, pp. 162- 190.
[20]	Bach, F.R. , and Jordan, M.I. , 2005. A probabilistic interpretation of canonical correlation analysis.
[21]	Dhillon, P.S. , Foster, D.P. , and Ungar, L.H. , 2015. Eigenwords: spectral word embeddings. Journal of Machine Learning Research, 16, pp. 3035- 3078.
[22]	Pennington, J. , Socher, R. , and Manning, C.D. , 2014. Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532- 1543.
[23]	Cer, D. , Diab, M. , Agirre, E. , Lopez-Gazpio, I. , and Specia, L. , 2017. Semeval-2017 task 1: semantic textual similarity-multilingual and cross-lingual focused evaluation. Arxiv Preprint Arxiv:1708.00055.
[24]	Mohler, M. , and Mihalcea, R. , 2009. Text-to-text semantic similarity for automatic short answer grading. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009) , pp. 567- 575.
[25]	Wang, A. , 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
[26]	McCann, B. , Bradbury, J. , Xiong, C. , and Socher, R. , 2017. Learned in translation: contextualized word vectors. Advances in Neural Information Processing Systems, 30.
[27]	Dasgupta, S. , Cohn, T. , and Baldwin, T. , 2023. Cost-effective distillation of large language models. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 7346- 7354.
[28]	Zhu, X. , Wu, H. , and Zhang, L. , 2022. Automatic short-answer grading via BERT-based deep neural networks. IEEE Transactions on Learning Technologies, 15( 3), pp. 364- 375.
[29]	Gaddipati, S.K. , Nair, D. , and Plöger, P.G. , 2020. Comparative evaluation of pretrained transfer learning models on automatic short answer grading. Arxiv Preprint Arxiv:2009.01303.