Information extraction from Wikipedia: moving down the long tail

Authors:
Fei Wu;Raphael Hoffmann;Daniel S. Weld
Affiliations:
University of Washington, Seattle, WA, USA;University of Washington, Seattle, WA, USA;University of Washington, Seattle, WA, USA
Venue:
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2008

Citing 19
Cited 36

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Scaling question answering to the web

ACM Transactions on Information Systems (TOIS)
DIRT @SBT@discovery of inference rules from text

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
On the MSE robustness of batching estimators

Proceedings of the 33nd conference on Winter simulation
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Improving Text Classification by Shrinkage in a Hierarchy of Classes

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Gimme' the context: context-driven automatic semantic annotation with C-PANKOW

WWW '05 Proceedings of the 14th international conference on World Wide Web
Unsupervised named-entity extraction from the web: an experimental study

Artificial Intelligence
Markov logic networks

Machine Learning
An analysis of the AskMSR question-answering system

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Semantic taxonomy induction from heterogenous evidence

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Yago: a core of semantic knowledge

Proceedings of the 16th international conference on World Wide Web
Autonomously semantifying wikipedia

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Automatically refining the wikipedia infobox ontology

Proceedings of the 17th international conference on World Wide Web
Searching for common sense: populating Cyc™ from the web

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 3
Open information extraction from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Ontology-driven information extraction with ontosyphon

ISWC'06 Proceedings of the 5th international conference on The Semantic Web

Information arbitrage across multi-lingual Wikipedia

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Amplifying community content creation with mixed initiative information extraction

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Using Wikipedia to bootstrap open information extraction

ACM SIGMOD Record
Is Wikipedia growing a longer tail?

Proceedings of the ACM 2009 international conference on Supporting group work
Mining meaning from Wikipedia

International Journal of Human-Computer Studies
Intelligence in wikipedia

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 3
Using multiple ontologies in information extraction

Proceedings of the 18th ACM conference on Information and knowledge management
Extracting Enterprise Vocabularies Using Linked Open Data

ISWC '09 Proceedings of the 8th International Semantic Web Conference
Named entity recognition in Wikipedia

People's Web '09 Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources
Ontology-based information extraction: An introduction and a survey of current approaches

Journal of Information Science
Acquisition of instance attributes via labeled and related instances

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Open information extraction using Wikipedia

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Learning 5000 relational extractors

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Machine reading at the University of Washington

FAM-LbR '10 Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading
Components for information extraction: ontology-based information extractors and generic platforms

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Extracting structured information from Wikipedia articles to populate infoboxes

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Constructing reference sets from unstructured, ungrammatical text

Journal of Artificial Intelligence Research
A self-supervised approach for extraction of attribute-value pairs from wikipedia articles

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Materializing multi-relational databases from the web using taxonomic queries

Proceedings of the fourth ACM international conference on Web search and data mining
Instance sense induction from attribute sets

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Shortipedia aggregating and curating Semantic Web data

Web Semantics: Science, Services and Agents on the World Wide Web
Attribute retrieval from relational web tables

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Towards a framework for attribute retrieval

Proceedings of the 20th ACM international conference on Information and knowledge management
Selecting actions for resource-bounded information extraction using reinforcement learning

Proceedings of the fifth ACM international conference on Web search and data mining
The role of query sessions in extracting instance attributes from web search queries

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Ontological parsing of encyclopedia information

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Resource-Bounded information extraction: acquiring missing feature values on demand

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Identifying constant and unique relations by using time-series text

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Large-Scale learning of relation-extraction rules with distant supervision from the web

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
Ontology-based information extraction of regulatory networks from scientific articles with case studies for Escherichia coli

Expert Systems with Applications: An International Journal
A performance comparison of parallel DBMSs and MapReduce on large-scale text analytics

Proceedings of the 16th International Conference on Extending Database Technology
Using natural language to integrate, evaluate, and optimize extracted knowledge bases

Proceedings of the 2013 workshop on Automated knowledge base construction
Aggregated search: A new information retrieval paradigm

ACM Computing Surveys (CSUR)
Guided curation of semistructured data in collaboratively-built knowledge bases

Future Generation Computer Systems
Towards better understanding and utilizing relations in DBpedia

Web Intelligence and Agent Systems
Bricking Semantic Wikipedia by relation population and predicate suggestion

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Not only is Wikipedia a comprehensive source of quality information, it has several kinds of internal structure (e.g., relational summaries known as infoboxes), which enable self-supervised information extraction. While previous efforts at extraction from Wikipedia achieve high precision and recall on well-populated classes of articles, they fail in a larger number of cases, largely because incomplete articles and infrequent use of infoboxes lead to insufficient training data. This paper presents three novel techniques for increasing recall from Wikipedia's long tail of sparse classes: (1) shrinkage over an automatically-learned subsumption taxonomy, (2) a retraining technique for improving the training data, and (3) supplementing results by extracting from the broader Web. Our experiments compare design variations and show that, used in concert, these techniques increase recall by a factor of 1.76 to 8.71 while maintaining or increasing precision.