Detecting and exploiting stability in evolving heterogeneous information spaces

Authors:
George Papadakis;George Giannakopoulos;Claudia Niederée;Themis Palpanas;Wolfgang Nejdl
Affiliations:
National Technical University of Athens, Greece & L3S Research Center, Germany, Athens, Greece;SKEL - NCSR Demokritos, Athens, Greece;L3S Research Center, Germany, Hannover, Germany;University of Trento, Italy, Trento, Italy;L3S Research Center, Germany, Hannover, Germany
Venue:
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Year:
2011

Citing 22
Cited 1

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Efficient Record Linkage in Large Data Sets

DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Recrawl scheduling based on information longevity

Proceedings of the 17th international conference on World Wide Web
Video suggestion and discovery for youtube: taking random walks through the view graph

Proceedings of the 17th international conference on World Wide Web
SQAK: doing more with keywords

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Identification of time-varying objects on the web

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Summarization system evaluation revisited: N-gram graphs

ACM Transactions on Speech and Language Processing (TSLP)
The web changes everything: understanding the dynamics of web content

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Searching for events in the blogosphere

Proceedings of the 18th international conference on World wide web
Entity resolution with iterative blocking

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Combining keyword search and forms for ad hoc querying of databases

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Robust record linkage blocking using suffix arrays

Proceedings of the 18th ACM conference on Information and knowledge management
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Towards recency ranking in web search

Proceedings of the third ACM international conference on Web search and data mining
What is Twitter, a social network or a news media?

Proceedings of the 19th international conference on World wide web
Using the past to score the present: extending term weighting models through revision history analysis

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Evaluating evidences for keyword query disambiguation in entity centric database search

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part II
Efficient entity resolution for large heterogeneous information spaces

Proceedings of the fourth ACM international conference on Web search and data mining

dbTrento: the data and information management group at the University of Trento

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

Individuals contribute content on the Web at an unprecedented rate, accumulating immense quantities of (semi-)structured data. Wisdom of the Crowds theory advocates that such information (or parts of it) is constantly overwritten, updated, or even deleted by other users, with the goal of rendering it more accurate, or up-to-date. This is particularly true for the collaboratively edited, semi-structured data of entity repositories, whose entity profiles are consistently kept fresh. Therefore, their core information that remain stable with the passage of time, despite being reviewed by numerous users, are particularly useful for the description of an entity. Based on the above hypothesis, we introduce a classification scheme that predicts, on the basis of statistical and content patterns, whether an attribute (i.e., name-value pair) is going to be modified in the future. We apply our scheme on a large, real-world, versioned dataset and verify its effectiveness. Our thorough experimental study also suggests that reducing entity profiles to their stable parts conveys significant benefits to two common tasks in computer science: information retrieval and information integration.