Portable extraction of partially structured facts from the web

Authors:
Andrew Salway;Liadh Kelly;Inguna Skadiņa;Gareth J. F. Jones
Affiliations:
Centre for Digital Video Processing, School of Computing, Dublin City University, Dublin 9, Ireland;Centre for Digital Video Processing, School of Computing, Dublin City University, Dublin 9, Ireland;Tilde, Riga, Latvia;Centre for Digital Video Processing, School of Computing, Dublin City University, Dublin 9, Ireland
Venue:
IceTAL'10 Proceedings of the 7th international conference on Advances in natural language processing
Year:
2010

Citing 8
Cited 1

Approaches to passage retrieval in full text information systems

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval

Modern Information Retrieval
On the MSE robustness of batching estimators

Proceedings of the 33nd conference on Winter simulation
Multi-document summarization by sentence extraction

NAACL-ANLP-AutoSum '00 Proceedings of the 2000 NAACL-ANLPWorkshop on Automatic summarization - Volume 4
An exploration of the principles underlying redundancy-based factoid question answering

ACM Transactions on Information Systems (TOIS)
Open information extraction from the web

Communications of the ACM - Surviving the data deluge
Information Extraction

Foundations and Trends in Databases
Organizing and searching the world wide web of facts - step one: the one-million fact extraction challenge

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2

Automated annotation of landmark images using community contributed datasets and web resources

SAMT'10 Proceedings of the 5th international conference on Semantic and digital media technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

A novel fact extraction task is defined to fill a gap between current information retrieval and information extraction technologies. It is shown that it is possible to extract useful partially structured facts about different kinds of entities in a broad domain, i.e. all kinds of places depicted in tourist images. Importantly the approach does not rely on existing linguistic resources (gazetteers, taggers, parsers, etc.) and it ported easily and cheaply between two rather different languages (English and Latvian). Previous fact extraction from the web has focused on the extraction of structured data, e.g. (Building-LocatedIn-Town). In contrast we extract richer and more interesting facts, such as a fact explaining why a building was built. Enough structure is maintained to facilitate subsequent processing of the information. For example, the partial structure enables straightforward template-based text generation. We report positive results for the correctness and interest of English and Latvian facts and for their utility in enhancing image captions.