Information extraction from semi-structured web documents

Authors:
Bo-Hyun Yun;Chang-Ho Seo
Affiliations:
Dept. of Computer Education, Mokwon University, Taejon, Korea;Dept. of Applied Mathematics, Kongju University, Kongju-City, Korea
Venue:
KSEM'06 Proceedings of the First international conference on Knowledge Science, Engineering and Management
Year:
2006

Citing 16
Cited 0

NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Learning for text categorization and information extraction with ILP

Learning language in logic
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Gleaning the Web

IEEE Intelligent Systems
Design and development of data-intensive web sites: The Araneus approach

ACM Transactions on Internet Technology (TOIT)
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Wrapper Generation for Web Accessible Data Sources

COOPIS '98 Proceedings of the 3rd IFCIS International Conference on Cooperative Information Systems
Boosted Wrapper Induction

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
A Conceptual-Modeling Approach to Extracting Data from the Web

ER '98 Proceedings of the 17th International Conference on Conceptual Modeling
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Fully Automated Object Extraction System for the World Wide Web

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Relational learning techniques for natural language information extraction

Relational learning techniques for natural language information extraction
Machine learning for information extraction in informal domains

Machine learning for information extraction in informal domains
Adaptive information extraction: core technologies for information agents

Intelligent information agents

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes the web information extraction system that extracts the pre-defined information automatically from web documents (i.e. HTML documents) and integrates the extracted information. The system recognizes entities without labels by the probabilistic based entity recognition method and extends the existing domain knowledge semiautomatically by using the extracted data. Moreover, the system extracts the sub-linked information linked to the basic page and integrates the similar results extracted from heterogeneous sources. The experimental result shows that the global precision of seven domain sites is 93.5%. The system using the sub-linked information and the probabilistic based entity recognition enhances the precision significantly against the system using only the domain knowledge. Moreover, the presented system can extract the more various information precisely due to applying the system with flexibility according to domains. Thus, the system can increase the degree of user satisfaction at its maximum and contribute the revitalization of e-business.