Web information extraction using markov logic networks

Authors:
Sandeepkumar Satpal;Sahely Bhadra;Sundararajan Sellamanickam;Rajeev Rastogi;Prithviraj Sen
Affiliations:
Microsoft, Hyderabad, India;CSA, Indian Institute of Science, Bangalore, India;Yahoo! Labs, Bangalore, India;Yahoo! Labs, Bangalore, India;Yahoo! Labs, Bangalore, India
Venue:
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2011

Citing 15
Cited 2

Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Ultraconservative online algorithms for multiclass problems

The Journal of Machine Learning Research
Mining reference tables for automatic text segmentation

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Markov logic networks

Machine Learning
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Information Extraction

Foundations and Trends in Databases
Incorporating site-level knowledge to extract structured data from web forums

Proceedings of the 18th international conference on World wide web
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
Memory-efficient inference in relational domains

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
A general method for reducing the complexity of relational inference and its application to MCMC

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Answering table augmentation queries from unstructured lists on the web

Proceedings of the VLDB Endowment

Markov logic networks for situated incremental natural language understanding

SIGDIAL '12 Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue
A simple approach to the design of site-level extractors using domain-centric principles

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we consider the problem of extracting structured data from web pages taking into account both the content of individual attributes as well as the structure of pages and sites. We use Markov Logic Networks (MLNs) to capture both content and structural features in a single unified framework, and this enables us to perform more accurate inference. MLNs allow us to model a wide range of rich structural features like proximity, precedence, alignment, and contiguity, using first-order clauses. We show that inference in our information extraction scenario reduces to solving an instance of the maximum weight subgraph problem. We develop specialized procedures for solving the maximum subgraph variants that are far more efficient than previously proposed inference methods for MLNs that solve variants of MAX-SAT. Experiments with real-life datasets demonstrate the effectiveness of our MLN-based approach compared to existing state-of-the-art extraction methods.