Cross-Media Retrieval based on Pseudo-Label Learning and Semantic Consistency Algorithm

doi:10.23940/ijpe.18.09.p31.22192229

Abstract

Abstract: To retrieve heterogeneous multimodal data with the same semantics, many algorithms for retrieval over multimodal data have been suggested. The organization and analysis of heterogeneous data have become the focus of intensive research. Here, a new and efficient algorithm for cross-media retrieval is proposed based on pseudo-label learning and semantic consistency (PLSC). In this algorithm, an adaptive learning projection matrix optimization method is proposed, and in the process of learning the projection matrices, the method fully considers the semantic information of the labeled and unlabeled samples. Thus, the PLSC algorithm can utilize more useful information than other methods and can learn the more efficient projection matrices. Firstly, the class centers of labeled text are computed. We use median feature vectors as the class center vectors. Next, unlabeled images are projected onto the text space and are assigned pseudo-labels by comparing with the class center vectors of the text data. Finally, a new training dataset, which includes labeled and unlabeled data, is generated for training the projection matrix. Using the projection matrix to project image or text data onto the same feature space, the data can be compared with each other for similarity, and the distance between data points can be calculated using the Euclidean metric. Validation experiments suggest that the PLSC outperforms other state-of-the-art algorithms.

Key words: cross-media retrieval, pseudo-label, semi-supervised, semantic analysis

Gongwen Xu, Zhiqi Sang, and Zhijun Zhang. Cross-Media Retrieval based on Pseudo-Label Learning and Semantic Consistency Algorithm [J]. Int J Performability Eng, 2018, 14(9): 2219-2229.

Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks

References

[1] F. Wu, H. Zhang,Y. Zhuang, “Learning Semantic Correlations for Cross-Media Retrieval,” inProceedings of IEEE International Conference on Image Processing, pp. 1465-1468, IEEE, 2007
[2] D. R. Hardoon, S. Szedmak,J. Shawe-Taylor, “Canonical Correlation Analysis: An Overview with Application to Learning Methods,” Neural Computation, Vol. 16, No. 12, pp. 2639, 2004
[3] Y. Q. Jia, M. Salzmann,T. Darrell, “Learning Cross-Modality Similarity for Multinomial Data,” inProceedings of IEEE International Conference on Computer Vision, pp. 2407-2414, 2011
[4] C. C. Kang, et al., “Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval,” IEEE Transactions on Multimedia, Vol. 17, No. 3, pp. 370-381, 2015
[5] J. F. He, et al., “Cross-Modal Retrieval by Real Label Partial Least Squares,” inProceedings of ACM on Multimedia Conference ACM, pp. 227-231, 2016
[6] X. Chang and Y. Yang, “Semisupervised Feature Analysis by Mining Correlations Among Multiple Tasks,” IEEE Transactions on Neural Networks & Learning Systems, Vol. 28, No. 10, pp. 2294-2305, 2016
[7] H. Zhang, Y. Liu,Z. Ma, “Fusing Inherent and External Knowledge with Nonlinear Learning for Cross-Media Retrieval,” Neurocomputing, Vol. 119, No.16, pp. 10-16, 2013
[8] H. Zhang and X. Liu, “Cross-Media Semantics Mining based on Sparse Canonical Correlation Analysis and Relevance Feedback,” inProceedings of Pacific-Rim Conference on Advances in Multimedia Information Processing, pp. 759-768, Springer-Verlag, 2012
[9] Y. X. Wang, H. Zhang,F. Yang, “A Weighted Sparse Neighbourhood-Preserving Projections for Face Recognition,”IETE Journal of Research, pp. 1-10, 2017
[10] H. X. Zhang, L. Cao,S. Gao, “A Locality Correlation Preserving Support Vector Machine,” Pattern Recognition, Vol. 47, No. 9, pp. 3168-3178, 2014
[11] J. H. Yan, et al., “Joint Graph Regularization based Modality-Dependent Cross-Media Retrieval,”Multimedia Tools & Applications, No. 6, pp. 1-19, 2017
[12] X. Liang, Y. Wei, X. Shen, et al., “Proposal-Free Network for Instance-Level Object Segmentation,”IEEE Transactions on Pattern Analysis and Machine, 2015
[13] Y. H. Xiao, et al., “Topographic NMF for Data Representation,” IEEE Transactions on Cybernetics, Vol. 44, No. 10, pp. 1762, 2014
[14] X. Liang, Y. Wei, L. Lin, et al., “Learning to Segment Human by Watching YouTube,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, No. 7, pp. 1462-1468, 2017
[15] X. Zhai, Y. Peng,J. Xiao, “Learning Cross-Media Joint Representation with Sparse and Semisupervised Regularization,” IEEE Transactions on Circuits & Systems for Video Technology, Vol. 24, No. 6, pp. 965-978, 2014
[16] X. Zhai, Y. Peng,J. Xiao, “Heterogeneous Metric Learning with Joint Graph Regularization for Cross-Media Retrieval,” inProceedings of Twenty-Seventh AAAI Conference on Artificial Intelligence, pp. 1198-1204, 2013
[17] X. Zhai, Y. Peng,J. Xiao, “Cross-modality Correlation Propagation for Cross-Media Retrieval,” inProceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2337-2340, 2012
[18] X. Zhai, Y. Peng,J. Xiao, “Effective Heterogeneous Similarity Measure with Nearest Neighbors for Cross-Media Retrieval,” inProceedings of International Conference on Advances in Multimedia Modeling Springer-Verlag, pp. 312-322, 2012
[19] Y. Wei, Y. Zhao, Z. Zhu, et al., “Modality-Dependent Cross-Media Retrieval, ” ACM Transactions on Intelligent Systems & Technology, Vol. 7, No. 4, pp. 1-13, 2016
[20] D. H. Le, “Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks,” in Proceedings of the 2013 ICML Workshop on Challenges in Representation Learning, pp. 1-4, 2013.
[21] W. Wang, R. Arora, K. Livescu, et al., “On Deep Multi-View Representation Learning,” inProceedings of International Conference on Machine Learning, pp. 1083-1092, 2015
[22] A. Karpathy, A. Joulin,L. Fei-Fei, “Deep Fragment Embeddings for Bidirectional Image Sentence Mapping,” Advances in Neural Information Processing Systems, pp. 1889-1897, 2015
[23] D. Yu, L. Deng,G. E. Dahl, “Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition,” inProceedings of Nips Workshop on Deep Learning & Unsupervised Feature Learning, 2010
[24] X. Zhang, B. He,T. Luo, “Training Query Filtering for Semi-Supervised Learning to Rank with Pseudo Labels,” World Wide Web-Internet & Web Information Systems, Vol. 19, No. 5, pp. 833-864, 2016
[25] N. Rasiwasia, J. C. Pereira, E. Coviello, et al., “A New Approach to Cross-Modal Multimedia Retrieval,” inProceedings of ACM International Conference on Multimedia, pp. 251-260, 2010
[26] Y. Ke and R. Sukthankar, “PCA-SIFT: A More Distinctive Representation for Local Image Descriptors,” inProceedings of IEEE Computer Society Conference on Computer Vision & Pattern Recognition, pp. 506-513, 2004
[27] D. M. Blei, A. Y. Ng,M. I. Jordan, “Latent Dirichlet Allocation,”Journal of Machine Learning Research Archive, No. 3, pp. 993-1022, 2003
[28] C. Rashtchian, P. Young, M. Hodosh, et al., “Collecting Image Annotations using Amazon's Mechanical Turk,” in Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk. Association for Computational Linguistics, pp. 139-147, 2010
[29] L. Zheng, Y. Zhao, S. Wang, et al., “Good Practice in CNN Feature Transfer,” arXiv preprint, arXiv:1604.00133, pp. 1-9 2016
[30] Y. Gong, Q. Ke, M. Isard, et al., “A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics,” International Journal of Computer Vision, Vol. 106, No. 2, pp. 210-233, 2014
[31] D. W. Jacobs, H. Daume, A. Kumar,A. Sharma, “Generalized Multiview Analysis: A discriminative Latent Space,” inProceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2160-2167, IEEE Computer Society, 2012