RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases
WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Retrieving answers from frequently asked questions pages on the web
Proceedings of the 14th ACM international conference on Information and knowledge management
Web Communities Defined by Web Page Content
WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Hi-index | 0.00 |
In this poster, we present an information extraction engine for web-based forums. The engine analyzes the HTML files crawled from web forums, deduces the wrapper (template) of the pages and extracts the information about posts (e.g., author, title, content, number of replies and views, etc.). Extraction is an important module for forum search engine, since it helps to understand the content of a forum HTML page and facilitates ranking during retrieval. We discuss the system architecture of the extraction engine in the context of a forum search engine and present various components in the extraction engine. We also introduce briefly the extraction process and discuss some implementation issues.