A system for extracting top-K lists from the web

Authors:
Zhixian Zhang;Kenny Qili Zhu;Haixun Wang
Affiliations:
Shanghai Jiao Tong University, Shanghai, China;Shanghai Jiao Tong University, Shanghai, China;Microsoft Research Asia, Beijing, China
Venue:
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2012

Citing 11
Cited 2

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Towards domain-independent information extraction from web tables

Proceedings of the 16th international conference on World Wide Web
From dirt to shovels: fully automatic tool generation from ad hoc data

Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
FACTO: a fact lookup engine based on web tables

Proceedings of the 20th international conference on World wide web
Unexpected results in automatic list extraction on the web

ACM SIGKDD Explorations Newsletter
Extracting general lists from web documents: a hybrid approach

IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
Probase: a probabilistic taxonomy for text understanding

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Short text conceptualization using a probabilistic knowledgebase

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three

Understanding tables on the web

ER'12 Proceedings of the 31st international conference on Conceptual Modeling
Concept-based web search

ER'12 Proceedings of the 31st international conference on Conceptual Modeling

Quantified Score

Hi-index	0.00

Visualization

Abstract

List data is an important source of structured data on the web. This paper is concerned with "top-k" pages, which are web pages that describe a list of k instances of a particular topic or concept. Examples include "the 10 tallest persons in the world" and "the 50 hits of 2010 you don't want to miss". Compared to normal web list data, "top-k" lists contain richer information and are easier to understand. Therefore the extraction of such lists can help enrich existing knowledge bases about general concepts, or act as a preprocessing step to produce facts for a fact answering engine. We present an efficient system that extracts the target lists from web pages with high accuracy. We have used the system to process up to 160 million, or 1/10 of a high-frequency web snapshot from Bing, and obtained over 140,000 lists with 90.4% precision.