Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning

Authors:
Lidong Bing;Wai Lam;Tak-Lam Wong
Affiliations:
The Chinese University of Hong Kong, Hong Kong, Hong Kong;The Chinese University of Hong Kong, Hong Kong, Hong Kong;The Hong Kong Institute of Education, Hong Kong, Hong Kong
Venue:
Proceedings of the sixth ACM international conference on Web search and data mining
Year:
2013

Citing 31
Cited 3

Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Web-scale information extraction in knowitall: (preliminary results)

Proceedings of the 13th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Semi-supervised conditional random fields for improved sequence segmentation and labeling

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
KnowItNow: fast, scalable information extraction from the web

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Organizing and searching the world wide web of facts -- step two: harnessing the wisdom of the crowds

Proceedings of the 16th international conference on World Wide Web
Autonomously semantifying wikipedia

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Weakly-supervised discovery of named entities using web search queries

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Freebase: a collaboratively created graph database for structuring human knowledge

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
YAGO: A Large Ontology from Wikipedia and WordNet

Web Semantics: Science, Services and Agents on the World Wide Web
Language-Independent Set Expansion of Named Entities Using the Web

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
SOFIE: a self-organizing framework for information extraction

Proceedings of the 18th international conference on World wide web
Open information extraction from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Answering table augmentation queries from unstructured lists on the web

Proceedings of the VLDB Endowment
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment
Entity extraction via ensemble semantics

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Web-scale distributional similarity and entity set expansion

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Character-level analysis of semi-structured documents for set expansion

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Coupled semi-supervised learning for information extraction

Proceedings of the third ACM international conference on Web search and data mining
Learning to Adapt Web Information Extraction Knowledge and Discovering New Attributes via a Bayesian Approach

IEEE Transactions on Knowledge and Data Engineering
DBpedia: a nucleus for a web of open data

ISWC'07/ASWC'07 Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference
Open information extraction using Wikipedia

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Learning 5000 relational extractors

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Distributional similarity vs. PU learning for entity set expansion

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Efficient graph-based semi-supervised learning of structured tagging models

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Annotating and searching web tables using entities, types and relationships

Proceedings of the VLDB Endowment
Web-scale table census and classification

Proceedings of the fourth ACM international conference on Web search and data mining
Recovering semantics of tables on the web

Proceedings of the VLDB Endowment
Towards a unified solution: data record region detection and segmentation

Proceedings of the 20th ACM international conference on Information and knowledge management
Open information extraction: the second generation

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume One

Towards an enhanced and adaptable ontology by distilling and assembling online encyclopedias

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Structured positional entity language model for enterprise entity retrieval

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We develop a new framework to achieve the goal of Wikipedia entity expansion and attribute extraction from the Web. Our framework takes a few existing entities that are automatically collected from a particular Wikipedia category as seed input and explores their attribute infoboxes to obtain clues for the discovery of more entities for this category and the attribute content of the newly discovered entities. One characteristic of our framework is to conduct discovery and extraction from desirable semi-structured data record sets which are automatically collected from the Web. A semi-supervised learning model with Conditional Random Fields is developed to deal with the issues of extraction learning and limited number of labeled examples derived from the seed entities. We make use of a proximate record graph to guide the semi-supervised learning process. The graph captures alignment similarity among data records. Then the semi-supervised learning process can leverage the unlabeled data in the record set by controlling the label regularization under the guidance of the proximate record graph. Extensive experiments on different domains have been conducted to demonstrate its superiority for discovering new entities and extracting attribute content.