Business Specific Online Information Extraction from German Websites

Authors:
Yeong Su Lee;Michaela Geierhos
Affiliations:
CIS, University of Munich, Germany;CIS, University of Munich, Germany
Venue:
CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Year:
2009

Citing 6
Cited 0

Internal and external evidence in the identification and semantic categorization of proper names

Corpus processing for lexical acquisition
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Named Entity recognition without gazetteers

EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Notes on contemporary table recognition

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a system that uses the domain name of a German business website to locate its information pages (e.g. company profile, contact page, imprint) and then identifies business specific information. We therefore concentrate on the extraction of characteristic vocabulary like company names, addresses, contact details, CEOs, etc. Above all, we interpret the HTML structure of documents and analyze some contextual facts to transform the unstructured web pages into structured forms. Our approach is quite robust in variability of the DOM, upgradeable and keeps data up-to-date. The evaluation experiments show high efficiency of information access to the generated data. Hence, the developed technique is adaptive to non-German websites with slight language-specific modifications, and experimental results on real-life websites confirm the feasibility of the approach.