Web page title extraction and its application
Information Processing and Management: an International Journal
Hi-index | 0.00 |
Extraction of some meta-information from printed documents without OCR approach is considered. It can be statistically verified that important terms in articles are printed in italic, bold and all capital style. Detection of these type styles helps in automatic extraction of the lines containing titles, authors' names, subtitles, references as well as sentences having important terms occurring in the text. It also helps in improving the OCR performance for reading the italicized text. Some experimental results on the performance of the approach on good quality as well as degraded document images are presented.