Automatic web information extraction based on rules

Authors:
Fanghuai Hu;Tong Ruan;Zhiqing Shao;Jun Ding
Affiliations:
Department of Computer Science and Engineering, East China University of Science and Technology;Department of Computer Science and Engineering, East China University of Science and Technology;Department of Computer Science and Engineering, East China University of Science and Technology;Department of Computer Science and Engineering, East China University of Science and Technology
Venue:
WISE'11 Proceedings of the 12th international conference on Web information system engineering
Year:
2011

Citing 13
Cited 1

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Mining the Web: Discovering Knowledge from HyperText Data

Mining the Web: Discovering Knowledge from HyperText Data
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Data-rich Section Extraction from HTML pages

WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Semi-Automatic Wrapper Generation for Commercial Web Sources

Proceedings of the IFIP TC8 / WG8.1 Working Conference on Engineering Information Systems in the Internet Context
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Message Understanding Conference-6: a brief history

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Automatic information extraction from large websites

Journal of the ACM (JACM)
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Text Extraction from the Web via Text-to-Tag Ratio

DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
Extracting Structured Data from Web Pages with Maximum Entropy Segmental Markov Model

WISE '09 Proceedings of the 10th International Conference on Web Information Systems Engineering
Tag tree template for Web information and schema extraction

Expert Systems with Applications: An International Journal

Complete-Thread extraction from web forums

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web Information Extraction is the initial step of effective web mining. In this article a few heuristic rules which describe the characteristics of the main content of web pages are summarized. The rules are constructed by some pre-defined terms and metrics, which can be considered as reusable and extensible for different kinds of HTML pages. Afterwards, a probabilistic model which utilizes the rules and metrics is suggested and the corresponding algorithm is implemented. The algorithm is tested on 1000 randomly selected web pages. The experiment shows that the algorithm is more precise and more applicable to the diverse structure of different web sites than other algorithms.