BioSnowball: automated population of Wikis

Authors:
Xiaojiang Liu;Zaiqing Nie;Nenghai Yu;Ji-Rong Wen
Affiliations:
Univ. of Sci. and Tech. of China, Hefei, China;Microsoft Research Asia, Beijing, China;Univ. of Sci. and Tech. of China, Hefei, China;Microsoft Research Asia, Beijing, China
Venue:
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2010

Citing 22
Cited 2

The use of MMR, diversity-based reranking for reordering documents and producing summaries

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Multidocument summarization via information extraction

HLT '01 Proceedings of the first international conference on Human language technology research
Becoming Wikipedian: transformation of participation in a collaborative online encyclopedia

GROUP '05 Proceedings of the 2005 international ACM SIGGROUP conference on Supporting group work
Sentence Fusion for Multidocument News Summarization

Computational Linguistics
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Entity Resolution with Markov Logic

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Soft pattern matching models for definitional question answering

ACM Transactions on Information Systems (TOIS)
Scalable training of L1-regularized log-linear models

Proceedings of the 24th international conference on Machine learning
Autonomously semantifying wikipedia

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Automatically refining the wikipedia infobox ontology

Proceedings of the 17th international conference on World Wide Web
StatSnowball: a statistical approach to extracting entity relationships

Proceedings of the 18th international conference on World wide web
SOFIE: a self-organizing framework for information extraction

Proceedings of the 18th international conference on World wide web
Structural, transitive and latent models for biographic fact extraction

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Joint unsupervised coreference resolution with Markov logic

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Discriminative training of Markov logic networks

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Multi-document summarization by maximizing informative content-words

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Open information extraction from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Shallow semantics for relation extraction

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Unsupervised named-entity extraction from the Web: An experimental study

Artificial Intelligence

Extraction and geographical navigation of important historical events in the web

W2GIS'11 Proceedings of the 10th international conference on Web and wireless geographical information systems
Evaluating significance of historical entities based on tempo-spatial impacts analysis using Wikipedia link structure

Proceedings of the 22nd ACM conference on Hypertext and hypermedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

Internet users regularly have the need to find biographies and facts of people of interest. Wikipedia has become the first stop for celebrity biographies and facts. However, Wikipedia can only provide information for celebrities because of its neutral point of view (NPOV) editorial policy. In this paper we propose an integrated bootstrapping framework named BioSnowball to automatically summarize the Web to generate Wikipedia-style pages for any person with a modest web presence. In BioSnowball, biography ranking and fact extraction are performed together in a single integrated training and inference process using Markov Logic Networks (MLNs) as its underlying statistical model. The bootstrapping framework starts with only a small number of seeds and iteratively finds new facts and biographies. As biography paragraphs on the Web are composed of the most important facts, our joint summarization model can improve the accuracy of both fact extraction and biography ranking compared to decoupled methods in the literature. Empirical results on both a small labeled data set and a real Web-scale data set show the effectiveness of BioSnowball. We also empirically show that BioSnowball outperforms the decoupled methods.