Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
Hierarchical Wrapper Induction for Semistructured Information Sources
Autonomous Agents and Multi-Agent Systems
Mining the Web: Discovering Knowledge from HyperText Data
Mining the Web: Discovering Knowledge from HyperText Data
Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Data-rich Section Extraction from HTML pages
WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Semi-Automatic Wrapper Generation for Commercial Web Sources
Proceedings of the IFIP TC8 / WG8.1 Working Conference on Engineering Information Systems in the Internet Context
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Message Understanding Conference-6: a brief history
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 1
Automatic information extraction from large websites
Journal of the ACM (JACM)
A Survey of Web Information Extraction Systems
IEEE Transactions on Knowledge and Data Engineering
Text Extraction from the Web via Text-to-Tag Ratio
DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
Extracting Structured Data from Web Pages with Maximum Entropy Segmental Markov Model
WISE '09 Proceedings of the 10th International Conference on Web Information Systems Engineering
Tag tree template for Web information and schema extraction
Expert Systems with Applications: An International Journal
Complete-Thread extraction from web forums
APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Hi-index | 0.00 |
Web Information Extraction is the initial step of effective web mining. In this article a few heuristic rules which describe the characteristics of the main content of web pages are summarized. The rules are constructed by some pre-defined terms and metrics, which can be considered as reusable and extensible for different kinds of HTML pages. Afterwards, a probabilistic model which utilizes the rules and metrics is suggested and the corresponding algorithm is implemented. The algorithm is tested on 1000 randomly selected web pages. The experiment shows that the algorithm is more precise and more applicable to the diverse structure of different web sites than other algorithms.