Rule-based Merged Line Segmentation Technique
- Bhupendra Kumar and Sarvesh Tanwar
Text line segmentation is an important step for the document image understanding system. Day by day volume of document images is increasing as a result of digitization activities run by various organizations. These data comprise of old archive document images which are important to heritage. Understanding documents and extracting information from these varieties of documents is important and yet challenging. Document layout analysis, graphics detection, table detection, stamp detection, annotation understanding, segmentation, recognition etc. are the component of document understanding system. Segmentation plays a very important role in document image understanding systems as stable recognition engines can only recognize words, characters or symbols. The older printed documents or manuscripts generally contain various artifacts, noise, skew, annotations, merged components, etc. which makes information extraction difficult as segmentation does not perform well. We are considering the document images containing Indian scripts where upper/lower zone characters from different lines get merged and containing interfering lines due to less space between the lines. Moreover, these documents possess complex layout, noise, deformities, local skew etc. which makes line segmentation more challenging. The line segmentation output is fed into the word segmentation module and then to character segmentation or recognition engine. The wrong/merged line segments generate fallacious results/data for the word segmentation and recognition engine, thus reducing the overall performance of the document understanding system. It is required to develop robust line segmentation as the overall performance of system is very sensitive to the performance of line segmentation. In this paper, we are proposing the rule based line segmentation method, employing region growing, morphological image processing, and connected component analysis to handle merged line segmentation. The proposed technique is able to segment lines where nonlinear separation exists between the adjacent lines. Evaluation has been performed on 350 documents images taken from old books containing approx. 10k text lines. It is observed the proposed approach outperforms the traditional projection profile based approach for the documents containing merged lines.