A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
Automatic information extraction from semi-structured Web pages by pattern discovery
Decision Support Systems - Web retrieval and mining
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Applying Pattern Mining to Web Information Extraction
PAKDD '01 Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A survey on tree edit distance and related problems
Theoretical Computer Science
Visual Similarity Comparison for Web Page Retrieval
WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
Exploiting structural similarity for effective Web information extraction
Data & Knowledge Engineering
Extracting Web Data Using Instance-Based Learning
World Wide Web
Web Information Extraction by HTML Tree Edit Distance Matching
ICCIT '07 Proceedings of the 2007 International Conference on Convergence Information Technology
Using clustering and edit distance techniques for automatic web data extraction
WISE'07 Proceedings of the 8th international conference on Web information systems engineering
Automatic web information extraction based on rules
WISE'11 Proceedings of the 12th international conference on Web information system engineering
Web objectionable text content detection using topic modeling technique
Expert Systems with Applications: An International Journal
Hi-index | 12.05 |
The process of information extraction from Web is both interesting and challenging, which could be helpful in Web Searching, Information Retrieval and Web Mining. Web pages on many sites are produced dynamically as structural records based on a HTML template from a background database. To efficiently extract meaningful information including records and data schema from the kind of pages, a new method based on Tag tree template is proposed. Web pages from different Web sites are parsed into Tag trees, and then templates of each site are generated from the trees by using a cost-based tree similarity measurement. The exclusive content in each page is then extracted by using the templates to parse the page. Finally, the records in pages and the schema of the records can be extracted from the exclusive content by finding repeating patterns and using some heuristic rules. The extraction experiments on 360 pages from 12 Web sites are performed, and the result shows that the proposed method is an effective way to extract meaningful information.