Tracking entities in web archives: the LAWA project

Authors:
Marc Spaniol;Gerhard Weikum
Affiliations:
Max Planck Institute for Informatics, Saarbrücken, Germany;Max Planck Institute for Informatics, Saarbrücken, Germany
Venue:
Proceedings of the 21st international conference companion on World Wide Web
Year:
2012

Citing 8
Cited 0

Web Archiving

Web Archiving
On the value of temporal information in information retrieval

ACM SIGIR Forum
Timely YAGO: harvesting, querying, and visualizing temporal knowledge from Wikipedia

Proceedings of the 13th International Conference on Extending Database Technology
DBpedia: a nucleus for a web of open data

ISWC'07/ASWC'07 Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference
WikiWars: a new corpus for research on temporal expressions

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
TimeTrails: a system for exploring spatio-temporal information in documents

Proceedings of the VLDB Endowment
YAGO2: exploring and querying world knowledge in time, space, context, and many languages

Proceedings of the 20th international conference companion on World wide web
Harvesting facts from textual web sources by constrained label propagation

Proceedings of the 20th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web-preservation organization like the Internet Archive not only capture the history of born-digital content but also reflect the zeitgeist of different time periods over more than a decade. This longitudinal data is a potential gold mine for researchers like sociologists, politologists, media and market analysts, or experts on intellectual property. The LAWA project (Longitudinal Analytics of Web Archive data) is developing an Internet-based experimental testbed for large-scale data analytics on Web archive collections. Its emphasis is on scalable methods for this specific kind of big-data analytics, and software tools for aggregating, querying, mining, and analyzing Web contents over long epochs. In this paper, we highlight our research on {\em entity-level analytics} in Web archive data, which lifts Web analytics from plain text to the entity-level by detecting named entities, resolving ambiguous names, extracting temporal facts and visualizing entities over extended time periods. Our results provide key assets for tracking named entities in the evolving Web, news, and social media.