International Journal of Performability Engineering, 2019, 15(2): 667-675 doi: 10.23940/ijpe.19.02.p31.667675

Short Text Classification based on Feature Extension using Information in Images

Shengjie Zhao,a,b and Qianyun Jianga

a College of Electronic and Information Engineering, Tongji University, Shanghai, 200800, China

b School of Software Engineering, Tongji University, Shanghai, 200800, China

Corresponding authors: E-mail address:shengjiezhao@tongji.edu.cn

Accepted: 2019-01-5   Online: 2019-02-25

Abstract

With the quick development and extensive application of the Internet, there is a growing desire for people to share their life or opinions on social networks, which producesa mass of short texts. Short texts are characterized by short length, sparse features, and a lack of contextual information. Thus, it is difficult for conventional methods to achieve high quality classification performance. To achieve a higher classification accuracy, this paper proposes a novel short text classification method based on feature extension by incorporating the information of the images. Specifically, we first generate a sentence that descripts the images by image caption technology, and then we combine the generated sentence with the text as the input of the classifier. Meanwhile, we introduce a similarity module in terms of the correlation between the image and the short text so as to determine whether the two sentences are combined or not. Simulation results show that ourproposed model significantly out performs the state-of-the-art methods in terms of classification accuracy.

Keywords: short text classification; image caption; feature extension; sentence similarity

PDF (821KB) Metadata Metrics Related articles Export EndNote| Ris| Bibtex  Favorite

Cite this article

Shengjie Zhao. Short Text Classification based on Feature Extension using Information in Images. [J], 2019, 15(2): 667-675 doi:10.23940/ijpe.19.02.p31.667675

1.Introduction

With the rapid development and wide spread use of social networks, there are countless short text messages posted on Twitter every day. They are carriers of a great deal of information covering almost all areas, such as entertainment, education, sports, etc. In order to effectively sort out the content that we are interested in, it is fundamentally significant to organize these texts into different categories. However, the texts on Twitter are always very short. Thus, when dealing with these shorter text messages, many classic classification algorithms fail to achieve high quality classification performanceas they would have achieved on longer texts. This is according toour intuition, since short texts have the following characteristics:

· Short texts lack context. This means that messages often express the feelings of an event at the time.

· Short texts do not always follow the grammar of language. As we often see on Twitter, the text often contains emoticons, like and T_T.

The texts are short in length, generally less than 100 words. That is, these texts may fail to provide enough knowledge about the text itself, which leads to data sparseness problems.

To effectively tackle the challenges induced by such characteristics in short text classification, many well-developed methods have been devised to capture more information from short texts. The most commonly used methods are based on feature extension, which can be divided into two classes, namely based on rules and statistical approaches. The rule-based feature selection methods are proposed, such as 8F and TFIDF [1-2], to pick on the words that are shared by fewer texts. Nevertheless, using these approaches to build the feature space willlead to the sparseness problemas there are very few words shared by the original texts in the case of the short texts. The statistics-based method, from the perspective of machine learning, introduce sexternal resources to expand the short text such as public online knowledge, e.g., WorldNet and HowNet [3-4]. However, feature extension by using searching engines or external resources is computationally expensive. Manual selection of features is a fully experienced work driven by experience, intuition, as well as domain knowledge.

Nowadays, due to the popularity of deep learning, an increasing number of machine leaning communities have begun to apply deep learning models to text classification. Multi-layer neural network has the capacity to combine low-level text information into a more abstract high-level text representation, thereby enhancing the performance of the classifier model. CNN (Convolutional Neural Networks), for example, is good at extracting features andcan greatly reduce the necessity of manual feature extraction in classification tasks. Meanwhile, word embedding technology established on the neural network model is being applied to representthe semantic vector of short texts[5].

In this paper, borrowing the idea of the work of Lynks [6], we combine the text and an associated image that contains complementary information into a single classifier model. Namely, unlike the previous work, we not only analyze the text, but also use the information contained in the image. Recently, an increasing amount of works has exposed rich resources of semantic information conveyed by online user generated content: the texts and the associated images. This shows that people often attach an image when they use Twitter, and in most cases, this image is related to the content of the text.Thus,we propose a classifier model with related information drawn from the associated images that can deal with short and sparse text successfully. To our best knowledge, this is the first work that combines image and text for short text classification.

We generate image description by image caption technology and then combine the generated sentences with texts as the input of neural networks for text classification. Nevertheless, the image probably has nothing to do with the text, which gives rise to result that the information contained in the image is interference information. Therefore, we need to determine whether the information generated from the images is the available information by calculating the relevance of the image description and the text. Ultimately, we adopt a model that is a bit more complex but can yield significantly better performance compared to the most advanced approaches.

Our contributions in this paper are as follows:

· We propose a novel feature extension approach by utilizing the information contained in images for short text classification. We employ the image caption technology to generate the description of the image.

· We also take the correlation between image and text into account by introducing the similarity module.

Our experiments demonstrate that the model we proposed is indeed effective. Furthermore, the classifier we proposed can obtain nearly 10% accuracy improvement.

We organize the rest of paper as follows: firstly, we give a brief overview of works related to image caption generator and short text categorization in Section 2. Then, we describe each part of our model in Section 3. Experiments and results are presented in Section 4. Lastly, we summarize the contributions of this paper.

2. Related Works

In the last few years, researchers in related fieldshave introducedsome new approaches to address short text classification problems [7]. A large number of works have managed to reduce the spatial dimension by integrating additional features. The most widely used feature is to represent text as a vector, e.g., Continuous Bag-of-Words(CBOW), continuous skip-gram model, and glove model [8-9]. Latent topic models like LDA (Latent Dirichlet Allocation) are often used for document classification. Blei et al. [10] developed the LDA algorithm to extract topic information from short text, which can be represented as a multinomial distribution of words in the vocabulary. Then, these potential latent topics can generate the document. However, these approaches require sufficient information of word occurrences. Some methods rely on human selected features to address the short text sparse problem. The characteristics of twitter textwere fully exploited by Sriram et al.[1], and they put forward to present the short texts by using a small set of domain-specific features (such as the existence of shortened words and slangs, opinions, reference to another user, and so on). Some studies have also introduced methods of short texts expansion by using searching engines. It is easy to see that querying these data sources online can cause long-term problems, and snap shots using these data sources can lead to outdated information.

Due to the remarkable performance of deep neural networks, many papers have been devoted to experiments on sentence-level classification tasks with the CNN model trained on pre-trained word vectors [11]. Johnson et al. [12]directly used the one-hot vector as input to the CNN to reduce the number of learning parameters of the model. Kim [13]further improved the performance by using multi-channel embedding. Santos et al. [14]combined the word-level and sentence-level features learned from the short text sequence to improve the accuracy of short text classification. These showed that a simple CNN with few hyper-parameters tuning can achieve good performance ina range of tasks. Inspired by these works, in this paper, we settled on a CNN model that is flexible but still easy enough.

As mentioned before, the information contained in the imagecan be used as a feature of the short text classification. However, understanding the content of images is hard, especially when the images are blurred. The goal of the image caption technology is to generate grammatical and consistent sentences of pictures. There are several models for generating image descriptions. Kiros et al. [15] first addressed the caption generation by introducing a neural network structure. Recently, many models based on RNN (Recurrent Neural Networks) have been proposed, which adopt encoder-decoder framework wildly used in machine translation [16]. Using this framework allows us to “translate” an image into a sentence. LSTM (Long Short Term Memory) [17] units are utilized to “decode” the feature vectors “encoded” by CNN. A generation model based on the deep recursive structure was proposed in [18-19]. By combining the deep convolutional network in computer vision with the recursive network, the natural statement describing the image was generated. In our work, we utilize a deep learning model under probabilistic framework to generate descriptions from images.

We can combine the sentences generated by image caption with the original texts for text categorization. However, as mentioned in Section 1,the information contained in the image maybe interference. Thus, we need to evaluate the correlation between the generated sentence and the text by mapping them into a joint feature space. Several complex lexical, syntactic, and semantic features were used to encode the input text pairs [20], and then various similarity measures between the representations could be obtained. Meanwhile, [13, 21] showed that convolutional neural networks can learn the low-dimensional vector from the input sentences and retain important syntactic and semantic aspects of the input sentence, which makes the results of many NLP tasks the most advanced. The distributed sentence model based on CNN (Section 3.1) is the main building block of the model in this paper. There are two potential sentence models working in parallel, projecting the generated sentence and text into vectors, which are then used to obtain the similarity score between them [22].

3. Overall Framework

Our goal is to classify the short texts based on feature extension using information in images. Thus, we decided to combine texts and images into a single machine learning model, since they contain complementary information. In order to combine original texts and image contents together as a new input of CNN text classifiers, they must be projected to the same vector space. The overall framework of the model is shown in Figure 1. Furthermore, the general idea of our model is this: we extract image features through a CNN image embedder, followed by a LSTM model to generate the corresponding sentence. Then, we project the generated sentence and the original text into the intermediate feature representation, represented by the red grids in the picture. Then, the similarity score is calculated between the generated sentence and the original text to determine whether or not the two sentences are combined. Finally, the concatenated text is sorted by a CNN classification network. Below, we will describe each part of the function and module in detail.

Figure 1.

Figure 1.   The overall framework of the model


3.1. Convolution Sentence Model

Since the distributional sentence model is involved in both the sentences similarity module and the sentences classification module, we first propose a convolutional architecture for sentence modeling. The goal of this sentence model is to convert input sentence into an intermediate feature representation, which is then used for computing the semantic similarity or performing various sentence classification tasks.

As shown in Figure 2, this convolutional sentence model consists of a convolution layer with multiple filter sizes followed by a simple max pooling layer. It takes as input fixed-length embedding vectors (pre-trained word embedding vectors), through convolution layer and max pooling layer, until a fixed length representation is reached on the last level. Like most convolutional neural network models [13,23], we adopt a convolution filter taking on the characteristic of “ shared weights”.

Figure 2.

Figure 2.   A convolutional layer with multiple filter widths and feature maps followed by a simple max pooling layer


We denote S∈ Rn×kas the input of the network, where n represents the length of the sentence (padded where necessary) and krepresents the dimension of word embedding. The function of convolution layers is to extract different level features from the input matrix.

Given a convolution filter ω∈ Rh×k, a feature Fi is generated from a window of h words [ wi:wi+h-1] by:

Fi=f(ωwi:wi+h-1+b)

Here, b∈ Rn×k represents the bias term. f is the non-linear activation function. In this work, we take ReLU as the activation function of the convolution layers. The filter slides on the sentence S to produce a series of feature map FRn-h+1. This process can be repeated for various filters with different sizes to increase the feature coverage of the model.

The function of the max pooling layer is to further abstract the features generated from convolution layer. The idea is to choose the maximal value F̂=max(F) to capture the most important feature. With the pooling layer, we can induce a fixed-length vector, which can be used as the input of the other modules.

3.2. Extracting Information from Image

The image caption problem can be defined as a binary (I,S) form, where I is an image andS is a sequence of target words. The overview of the image caption module can be seen inFigure 3.

Figure 3.

Figure 3.   The overview of the image caption module


The module includes two parts: “encoder” and “decoder”. The encoding part makes use of GoogLeNet [24], which “encodes” the given image into a fixed dimensional feature vector. The decoding part adopts the classical LSTM model to “decode” the feature vector to the desired output sentences. The LSTM consists of a series memory cells (see Figure 4). Each cell performs the same taskso that they can share some parameters. The output of a cell depends on the current input xtand “memory” state ht-1. As we can see from the Figure 4, there are three gates in a cell (input gate i, output gate o, and forget gate f). The function of the forget gate is to decide what information we want to discard from the state of the cell. It puts the input xt and the last hidden state ht-1 into a sigmoid layer and decides how to process the previous information of unit ct-1. Input gate also takes as input xt and ht-1, but it has a different function to update the memory cell. This step is to decide what new information we intend to “remember” in the cell state.With the function of these two gates, we can change the state of the memory unit. Finally, the LSTM unit uses an output gate, which takes the same input as the other two gates, and gets the result according to the state of the unit.

Figure 4.

Figure 4.   represents the product with a gate value


Therefore, the LSTM modelcan be trained to predict word st of the description in the case of known image I as well as all predicted words [ s0, s1, , st-1], which is defined by P( st| I, s0, s1, , st-1). Our implementation of LSTMs closely follows the one used in [25].

xt=west
ft=σ(Wfht-1,xt+bf)
it=σ(Wiht-1,xt+bi)
C̃t=tanh(Wcht-1,xt+bc)
Ct=ft×Ct-1+it×C̃t
ot=σ(Woht-1,xt+bo)
ht=ot*tanh(Ct)
PstI,s0,,st-1=Softmax(Wpht)

Where the various W* and b* are trained parameters, we are word embeddings, and σ is the logic sigmoid function. We project each word into a one-hot vector si, and the dimensions of these vectors are equal to the vocabulary size. These gates make it possible to properly handle the exploding and vanishing gradients issues that the traditional RNN suffered [26]. ht in Equation (8) is the input of the Softmax. Equation (9) canobtain the probability distribution of all possible sentence words. The image I is put only at the very beginning when t = 0, to notify the LSTM model of the image content.

We recommend maximizing the probability of a correct description of the given image directly by using the following Equation (10). Thus, we can obtain the probability of asentence by multiplying the probability of each word in the image and the probability of words before the current time.

Ps0,s1,,sm=i=0mPsiI,s0,,si-1

The log likelihood sum of the correct words in a sentence sequence is defined as the loss of our model. We can optimize Equation(11) by adopting the stochastic gradient descent method throughout the training.

LI,S=i=0mlog(PsiI,s0,,si-1)

For all parameters of LSTM as well as word embeddings we, our goal is to minimize the above loss.

3.3. Similarity Score of Sentences

The architecture of sentences similarity module is presented in Figure 5. The inputs of the model are the output of the previous layers, which can be utilized to compute their similarity.We use the texts and images of tweets where the topic is related as positive samples, otherwise as negative samples. Here, we are going todescribe our similarity module in detail:

Figure 5.

Figure 5.   The architecture of similarity module, where M represents the similarity matrix


A distributional representation can be obtained by feeding the generated sentence and the text into our sentence models (Section 3.1), which can be denoted as xcRd and xoRd respectively. The similarity score is then calculated using the similarity matrix M according to Equation (12). The similarity between the xc and xo vectors can be defined as follows[27]:

xsim ( xcxo) = xcTM xo

Where M∈ Rd×d represents the similarity matrix, which is also the parameter of our model and is optimized through the training process (Section 4.2).

The single result xsim is a measure of the similarity (including syntactic and semantic) of two sentences. Therefore, according to Equation (13), we concatenate two intermediate vectors and the similarity index xsim into a single vector xjoin in the joint layer[27]:

xjoin=[xcT;xsim;xoT]

This vector is passed to next hidden layer, which is fully connected. This layer can model the interactions among the concatenated representation vector. This following transformation is conducted by the hidden layer:

α(whxjoin+b)

Where wh is hidden layer’s weight vector and α represents the non-linear function.

The similarity score between the two sentences is approximately equal to the correlation between the text and the image. Therefore, as described earlier, we can determine whether to consider the information contained in the image or not when classifying sentences according to Equation (15).

P=SY,sim=1Y,sim=0

Where S is the sentence generated by LSTM model and Y is the original text. denotes concatenation,and we concatenatethem as P= ( r1, r2,···, rl), where l is the length of the sentences (padded where necessary) and is a fixed value. Namely, if the two sentences are related, then the image caption and the original text concatenate together; otherwise, they are padded with a special character.

3.4. CNN Model

The last step of our model is a fully-connected deep learning model to process the text. The last layer of our model is simply the combination of both parts of the model, and it is used to produce the final classification. Readers who are interested in this subject can find more detailed discussions in [12-13,17].

4. Experiments and Results

4.1. Dataset

The dataset we perform experiments on is MS COCO(the Microsoft Common Objects in Context), which was carefully created to provide resources for automatic image caption generation tasks[28]. Currently, the MS COCO 2014 dataset contains one million captions and more than 160,000 images. Compared to the existing image caption dataset, for example, flickr8k or flickr30k, the COCO dataset is advantageous because it has more images and annotations for training and testing. Corresponding to the problem that we need to solve, we regard MS COCO as the positive samples and add some texts and images with irrelevant content as the negative samples.

4.2. Training Details

When we train models on the MS COCO dataset, we inevitably encounter over-fitting problems. In fact, supervised machine learning methods needa mass of data fortraining, whilethe dataset we used has less than 100000 images, which is far from enough. Thus, we explore several techniques to handle overfitting. In the image caption generation module, we decide to make we, the word embeddings randomly initialized. Meanwhile, for the Convolution Sentence Model (Sec 3.1), we initialized the embedding vectors with pre-trained word embedding vectors. Due to the complexity of ourmodel, we implemented some model-level techniques for avoiding over-adaptation. We tried the drop and ensemble models and explored the size of the model by weighing the number and depth of hidden units.

The parameters of the SGD (stochastic gradient descent) network are optimized by back propagation algorithm. We randomly initialized all weights except for the weights of the CNN model. We set the dimension size as 512 for the embeddings as well as the size of the LSTM memory.

4.3. Generation Results

The first experiment performs in the absence of the similarity module. Theexperiments are carried out on binary datasets.The combined model we proposed is larger and a bit more complex than ones that focus only on text or images. In order to understand and justify the performance gains of the model, it makes sense to look at each component separately and compare them to the final model:

The experimental result is quite dramatic as the performance of only analysing text is slightly better than analysing only images, while the computational cost is much lower. Moreover, it is striking that the images alone can also achieve good classification performance. We can draw a conclusion from Table 1 that the CNN model out performs the other two traditional text classification models (SVM and LDA). It is intuitive that the combination of both the images and the text should lead to the best performance since they presumably pick up on different things. Indeed, when we focus on the performance of the model, we see a significant increase in precision rate. Figure 6 shows the ROC curve (receiver operating characteristic curve). This curve plots two parameters: True Positive Rate and False Positive Rate. The pre-trained GoogLeNet used here is trained on 1000 classes of images in Image-Net.

Table 1.   The comparation of three models: classifier only using text or images and the classifier using both (not add the similarity module)

MethodResNet (Images Only)SVM (Text Only)LDA (Text Only)CNN (Text Only)Images +Text
Precision Rate0.84960.84120.82100.86510.9550

New window| CSV


Figure 6.

Figure 6.   The receiver operating characteristic curve


The second experiment is to introduce the similarity module on the basis of the experiment one. As mentioned in the introduction, in real life, the image we sent sometimes has nothing to do with the text. In order to simulate this phenomenon as much as possible, we add some negative samples (the images and texts are irrelevant) to the COCO dataset. Experiment with our model on this dataset. The experimental results are as follows:

From the experimental results in Table 2, we can reach the conclusion that the accuracy of the classification decreases after adding the irrelevant data. With the introduction of the similarity module, the accuracy of the classification increases by nearly 10%, which also proves that our idea is correct.

Table 2.   The comparation of whether to add similarity module

StrategySimilarity module absentAdd similarity module
Classifier accuracy (%)72.0882.14

New window| CSV


5. Conclusions

This paper introduces a feature extension method that takes advantage of information in images for short text classification. We enrich the features of short texts with the help of image caption technology to capture the complementary information in images. The method we proposed for feature extension does improve short text classification. Meanwhile, we also take the relevance of images into account by adding the similarity module, and the accuracy is further improved. Then, we compare the proposed method with other existing classification models. A series of experimental results show that our proposed method is better in classifying the short text.

Reference

B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu, M. Demirbas ,

“Short Text Classification in Twitter to Improve Information Filtering,”

in Proceedings ofInternational ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 841-842, 2010

DOI:10.1145/1835449.1835643      URL     [Cited within: 2]

In microblogging services such as Twitter, the users may become overwhelmed by the raw data. One solution to this problem is the classification of short text messages. As short texts do not provide sufficient word occurrences, traditional classification methods such as "Bag-Of-Words" have limitations. To address this problem, we propose to use a small set of domain-specific features extracted from the author's profile and text. The proposed approach effectively classifies the text to a predefined set of generic classes such as News, Events, Opinions, Deals, and Private Messages.

M. Wang, L. Lin, F. Wang ,

“Improving Short Text Classification through Better Feature Space Selection,”

in Proceedings of International Conference on Computational Intelligence and Security, pp. 120-124, 2014

[Cited within: 1]

D. Bollegala, M. Ishizuka, Y. Matsuo ,

“Measuring Semantic Similarity Between Words using Web Search Engines,”

in Proceedings ofInternational Conference on World Wide Web,pp. 757-766,Banff, Alberta, Canada, May 2007

DOI:10.1145/1242572.1242675      URL     [Cited within: 1]

Semantic similarity measures play important roles in many Web-related tasks such as Web browsing and query suggestion.Because taxonomy-based methods cannot deal with continually emerging words,recently Web-based methods have been proposed to solve this problem.Because of the noise and redundancy hidden in the Web data,robustness and accuracy are still challenges.We proposed a method integrating page counts and snippets returned by Web search engines.Then,the semantic snippets and the number of search results were used to remove noise and redundancy in the Web snippets.After that,a method integrating page counts,semantics snippets and the number of already displayed search results was proposed.The proposed method does not need any human annotated knowledge,and can be applied Web-related tasks easily.A correlation coefficient of 0.851 against Rubenstein-Goodenough benchmark dataset shows that the proposed method outperforms the existing Web-based methods by a wide margin.Moreover,the proposed semantic similarity measure significantly improves the quality of query suggestion against some page counts based methods.

X. Hu, N. Sun, C. Zhang, T. S. Chua ,

“Exploiting Internal and External Semantics for the Clustering of Short Texts using World Knowledge,”

in Proceedings of ACM Conference on Information and Knowledge Management, pp. 919-928, 2009

DOI:10.1145/1645953.1646071      URL     [Cited within: 1]

Clustering of short texts, such as snippets, presents great challenges in existing aggregated search techniques due to the problem of data sparseness and the complex semantics of natural language. As short texts do not provide sufficient term occurring information, traditional text representation methods, such as ``bag of words" model, have several limitations when directly applied to short texts tasks. In this paper, we propose a novel framework to improve the performance of short texts clustering by exploiting the internal semantics from original text and external concepts from world knowledge. The proposed method employs a hierarchical three-level structure to tackle the data sparsity problem of original short texts and reconstruct the corresponding feature space with the integration of multiple semantic knowledge bases -- Wikipedia and WordNet. Empirical evaluation with Reuters and real web dataset demonstrates that our approach is able to achieve significant improvement as compared to the state-of-the-art methods.

A. Paccanaro and G. E. Hinton,

“Learning Distributed Representations of Concepts using Linear Relational Embedding,”

IEEE Transactions on Knowledge & Data Engineering,Vol. 13,No. 2, pp. 232-244, 2002

DOI:10.1109/69.917563      URL     [Cited within: 1]

We introduce linear relational embedding as a means of learning a distributed representation of concepts from data consisting of binary relations between these concepts. The key idea is to represent concepts as vectors, binary relations as matrices, and the operation of applying a relation to a concept as a matrix-vector multiplication that produces an approximation to the related concept. A representation for concepts and relations is learned by maximizing an appropriate discriminative goodness function using gradient ascent. On a task involving family relationships, learning is fast and leads to good generalization.

X. Zhang and B. Wu,

“Short Text Classification based on Feature Extension using the n-Gram Model,”

in Proceedings of 2015 12th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD),pp.710-716, IEEE, 2015

DOI:10.1109/FSKD.2015.7382029      URL     [Cited within: 1]

With the rapid development of Web2.0, more and more people like to show their life or opinions on social media websites or forums, such as Weibo, Twitter and Tianya, which produce masses of short texts. In order to manage these short texts effectively, Short Text Classification becomes an important branch of Text Classification. However, because of the short text length, the lack of signals, and the sparseness of features, it is very difficult to achieve high quality classification by using conventional methods. This paper proposes a novelty feature extending method based on the N-Gram model to solve the problem of feature sparseness. From continuous word sequences in the train set, we extract n-grams as our feature extension mode library. Then using features showing in the short texts, we can compute the appearance probability of other words that do not exist in original texts. We use the data set collected from Sina Weibo to carry out our extension method. After extending features of the original short texts, we use the Nai ve Bayes algorithm to train and evaluate a classifier. We use precision, recall and F1-Score to evaluate our work. The result shows that the extension method based on the N-Gram model can improve classification performance observably.

Christopher Bonnett, “Classifying E-Commerce Products based on Images and Text,” ( http://cbonnett.github.io/Insight.html)

[Cited within: 1]

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean ,

“Distributed Representations of Words and Phrases and their Compositionality,”

in Proceedings of International Conference on Neural Information Processing Systems, pp. 3111-3119, 2013

URL     [Cited within: 1]

The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

J. Pennington, R. Socher, C. Manning ,

“Glove: Global Vectors for Word Representation,”

in Proceedings of Conference on Empirical Methods in Natural Language Processing, pp. 1532-1543, 2014

[Cited within: 1]

D. M. Blei, A. Y. Ng, M. I. Jordan ,

“Latent Dirichlet Allocation,”

J Machine Learning Research Archive,Vol. 3, pp. 993-1022, 2003

[Cited within: 1]

A. Bordes, J. Weston, R. Collobert, Y. Bengio ,

“Learning Structured Embeddings of Knowledge Bases,”

AAAI, Vol. 6. No. 1, 2011

[Cited within: 1]

R. Johnson and T. Zhang,

“Effective Use of Word Order for Text Categorization with Convolutional Neural Networks,”

arXiv preprint arXiv:1412.1058, 2014

[Cited within: 2]

Y. Kim ,

“Convolutional Neural Networks for Sentence Classification,”

arXiv preprint arXiv:1408.5882, 2014

DOI:10.3115/v1/D14-1181      URL     [Cited within: 4]

Abstract: We report on a series of experiments with convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks. We show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks. Learning task-specific vectors through fine-tuning offers further gains in performance. We additionally propose a simple modification to the architecture to allow for the use of both task-specific and static vectors. The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification.

C. N. D .

Santos and M. Gattit, “Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts,”

in Proceedings of International Conference on Computational Linguistics, 2014

[Cited within: 1]

R. Kiros, R. Salakhutdinov, R. Zemel ,

“Multimodal Neural Language Models,”

in Proceedings of International Conference on Machine Learning, pp. 595-603, 2014

[Cited within: 1]

K. Cho, B. Van Merriënboer, C Gulcehre , et al.,

“Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation,”

arXiv preprint arXiv:1406.1078, 2014

DOI:10.3115/v1/D14-1179      URL     [Cited within: 1]

Abstract: In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.

S. Hochreiter and J. Schmidhuber,

“Long Short-Term Memory,”

Neural Computation,Vol. 9, No. 8, pp. 1735-1780, 1997

[Cited within: 2]

O. Vinyals, A. Toshev, S. Bengio, D. Erhan ,

“Show and Tell: A Neural Image Caption Generator,”

in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156-3164, 2015

DOI:10.1109/CVPR.2015.7298935      URL     [Cited within: 1]

Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU-1 score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU-1 score improvements on Flickr30k, from 56 to 66, and on SBU, from 19 to 28. Lastly, on the newly released COCO dataset, we achieve a BLEU-4 of 27.7, which is the current state-of-the-art.

O. Vinyals, A. Toshev, S. Bengio, andD. Erhan. “Show and Tell: Lessons Learned fromthe 2015

MSCOCO Image Captioning Challenge,”

IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 652-663, 2016

[Cited within: 1]

M. Surdeanu, M. Ciaramita, H. Zaragoza ,

“Learning to Rank Answers to Non-Factoid Questions from Web Collections,”

Computational Linguistics,Vol. 37, No. 2, pp. 351-383, 2011

[Cited within: 1]

N. Kalchbrenner, E. Grefenstette, P. Blunsom ,

“A Convolutional Neural Network for Modelling Sentences,”

arXiv preprint arXiv:1404.2188, 2014

[Cited within: 1]

A. Severyn and A. Moschitti.

“Modeling Relational Information in Question-Answer Pairs with Convolutional Neural Networks,”

arXiv preprint arXiv:1604.01178, 2016

URL     [Cited within: 1]

In this paper, we propose convolutional neural networks for learning an optimal representation of question and answer sentences. Their main aspect is the use of relational information given by the matches between words from the two members of the pair. The matches are encoded as embeddings with additional parameters (dimensions), which are tuned by the network. These allows for better capturing interactions between questions and answers, resulting in a significant boost in accuracy. We test our models on two widely used answer sentence selection benchmarks. The results clearly show the effectiveness of our relational information, which allows our relatively simple network to approach the state of the art.

O. Abdel-Hamid, A. R. Mohamed, H. Jiang, G. Penn ,

“Applying Convolutional Neural Networks Concepts to Hybrid NN-HMM Model for Speech Recognition,”

in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4277-4280, 2012

DOI:10.1109/ICASSP.2012.6288864      URL     [Cited within: 1]

Convolutional Neural Networks (CNN) have showed success in achieving translation invariance for many image processing tasks. The success is largely attributed to the use of local filtering and max-pooling in the CNN architecture. In this paper, we propose to apply CNN to speech recognition within the framework of hybrid NN-HMM model. We propose to use local filtering and max-pooling in frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance. In our method, a pair of local filtering layer and max-pooling layer is added at the lowest end of neural network (NN) to normalize spectral variations of speech signals. In our experiments, the proposed CNN architecture is evaluated in a speaker independent speech recognition task using the standard TIMIT data sets. Experimental results show that the proposed CNN method can achieve over 10% relative error reduction in the core TIMIT test sets when comparing with a regular NN using the same number of hidden layers and weights. Our results also show that the best result of the proposed CNN model is better than previously published results on the same TIMIT test sets that use a pre-trained deep NN model.

C. Szegedy, L. Wei, J. Yangqing , et al.,

“Going Deeper with Convolutions,”

in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-9, 2015

[Cited within: 1]

W. Zaremba, I. Sutskever, O. Vinyals ,

“Recurrent Neural Network Regularization,”

Eprint Arxiv, 2014

[Cited within: 1]

L. M. Surhone, M. T. Tennoe, S. F. Henssonow ,

“Long Short Term Memory, ”

Betascript Publishing, 2010

[Cited within: 1]

A. Severyn and A. Moschitti,

“Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks,”

in Proceedings of the International ACM SIGIR Conference, pp. 373-382, 2015

[Cited within: 2]

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, et al.,

“Microsoft coco: Common Objects in Context,”

in Proceedings of European Conference onComputer Vision, Springer, pp. 740-755, 2014

[Cited within: 1]

/