Annotating wikipedia articles with semantic tags for structured retrieval

Authors:
Saravadee Sae Tan;Tang Enya Kong;Gian Chand Sodhy
Affiliations:
Multimedia University, Cyberjaya, Malaysia;Multimedia University, Cyberjaya, Malaysia;Universiti Sains Malaysia, Penang, Malaysia
Venue:
Proceedings of the 2nd ACM workshop on Social web search and mining
Year:
2009

Citing 8
Cited 0

A vector space model for automatic indexing

Communications of the ACM
Querying and ranking XML documents

Journal of the American Society for Information Science and Technology - XML
Searching structured documents

Information Processing and Management: an International Journal
The SphereSearch engine for unified ranked retrieval of heterogeneous XML and web documents

VLDB '05 Proceedings of the 31st international conference on Very large data bases
The Wikipedia XML corpus

ACM SIGIR Forum
Ontology evaluation using wikipedia categories for browsing

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Entity ranking in Wikipedia

Proceedings of the 2008 ACM symposium on Applied computing
Effective use of semantic structure in XML retrieval

ECIR'07 Proceedings of the 29th European conference on IR research

Quantified Score

Hi-index	0.01

Visualization

Abstract

Structured retrieval aims at exploiting the structural information of documents when searching for documents. Structured retrieval makes use of both content and structure of documents to improve information retrieval. Therefore, the availability of semantic structure in the documents is an important factor for the success of structured retrieval. However, the majority of documents in the Web still lack semantically-rich structure. This motivates us to automatically identify the semantic information in web documents and explicitly annotate the information with semantic tags. Based on the well-known Wikipedia corpus, this paper describes an unsupervised learning approach to identify conceptual information and descriptive information of an entity described in a Wikipedia article. Our approach utilizes Wikipedia link structure and Infobox information in order to learn the semantic structure of the Wikipedia articles. We also describe a lazy approach used in the learning process. By utilizing the Wikipedia categories provided by the contributors, only a subset of entities in a Wikipedia category is used as training data in the learning process and the results can be applied to the rest of the entities in the category.