The parallel path framework for entity discovery on the web

Authors:
Tim Weninger;Thomas J. Johnston;Jiawei Han
Affiliations:
University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign;University of Illinois at Urbana-Champaign
Venue:
ACM Transactions on the Web (TWEB)
Year:
2013

Citing 33
Cited 0

Block edit models for approximate string matching

Theoretical Computer Science - Special issue: Latin American theoretical informatics
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
PEBL: Web Page Classification without Negative Examples

IEEE Transactions on Knowledge and Data Engineering
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Effectiveness of web page classification on finding list answers

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Clustering web pages based on their structure

Data & Knowledge Engineering - Special issue: WIDM 2003
Towards automatic association of relevant unstructured content with structured query results

Proceedings of the 14th ACM international conference on Information and knowledge management
Integrating Unstructured Data into Relational Databases

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A comparison of implicit and explicit links for web page classification

Proceedings of the 15th international conference on World Wide Web
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
Flint: Google-basing the Web

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Language-Independent Set Expansion of Named Entities Using the Web

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
Supporting the automatic construction of entity aware search engines

Proceedings of the 10th ACM workshop on Web information and data management
Web page classification: Features and algorithms

ACM Computing Surveys (CSUR)
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
Improving web page classification by label-propagation over click graphs

Proceedings of the 18th ACM conference on Information and knowledge management
Answering table augmentation queries from unstructured lists on the web

Proceedings of the VLDB Endowment
Data integration for the relational web

Proceedings of the VLDB Endowment
ViDE: A Vision-Based Approach for Deep Web Data Extraction

IEEE Transactions on Knowledge and Data Engineering
Entity relation discovery from web tables and links

Proceedings of the 19th international conference on World wide web
Entity ranking using Wikipedia as a pivot

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Mapping web pages to database records via link paths

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Annotating and searching web tables using entities, types and relationships

Proceedings of the VLDB Endowment
HyLiEn: a hybrid approach to general list extraction on the web

Proceedings of the 20th international conference companion on World wide web
Growing parallel paths for entity-page discovery

Proceedings of the 20th international conference companion on World wide web
Unexpected results in automatic list extraction on the web

ACM SIGKDD Explorations Newsletter
Harvesting relational tables from lists on the web

The VLDB Journal — The International Journal on Very Large Data Bases
WINACS: construction and analysis of web-based computer science information networks

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Building enriched web page representations using link paths

Proceedings of the 23rd ACM conference on Hypertext and social media

Quantified Score

Hi-index	0.00

Visualization

Abstract

It has been a dream of the database and Web communities to reconcile the unstructured nature of the World Wide Web with the neat, structured schemas of the database paradigm. Even though databases are currently used to generate Web content in some sites, the schemas of these databases are rarely consistent across a domain. This makes the comparison and aggregation of information from different domains difficult. We aim to make an important step towards resolving this disparity by using the structural and relational information on the Web to (1) extract Web lists, (2) find entity-pages, (3) map entity-pages to a database, and (4) extract attributes of the entities. Specifically, given a Web site and an entity-page (e.g., university department and faculty member home page) we seek to find all of the entity-pages of the same type (e.g., all faculty members in the department), as well as attributes of the specific entities (e.g., their phone numbers, email addresses, office numbers). To do this, we propose a Web structure mining method which grows parallel paths through the Web graph and DOM trees and propagates relevant attribute information forward. We show that by utilizing these parallel paths we can efficiently discover entity-pages and attributes. Finally, we demonstrate the accuracy of our method with a large case study.