Finding similar identities among objects from multiple web sources

Authors:
Joyce C. P. Carvalho;Altigran S. da Silva
Affiliations:
Federal University of Minas Gerais, Belo Horizonte, Brazil;Federal University of Amazonas, Manaus, Brazil
Venue:
WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Year:
2003

Citing 7
Cited 12

The TSIMMIS Approach to Mediation: Data Models and Languages

Journal of Intelligent Information Systems - Special issue: next generation information technologies and systems
Data on the Web: from relations to semistructured data and XML

Data on the Web: from relations to semistructured data and XML
Data integration using similarity joins and a word-based information representation language

ACM Transactions on Information Systems (TOIS)
Learning object identification rules for information integration

Information Systems - Data extraction, cleaning and reconciliation
Modern Information Retrieval

Modern Information Retrieval
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Query-answering algorithms for information agents

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

BDBComp: building a digital library for the Brazilian computer science community

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Measuring similarity between collection of values

Proceedings of the 6th annual ACM international workshop on Web information and data management
DogmatiX tracks down duplicates in XML

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Automatic Generation and Publication of Web Services for the Access and Integration of Distributed Data Sources

ENC '05 Proceedings of the Sixth Mexican International Conference on Computer Science
Learning to deduplicate

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Replica identification using genetic programming

Proceedings of the 2008 ACM symposium on Applied computing
Matching XML documents in highly dynamic applications

Proceedings of the eighth ACM symposium on Document engineering
Uma abordagem efetiva e eficiente para deduplicação de metadados bibliográficos de objetos digitais

SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
An unsupervised heuristic-based approach for bibliographic metadata deduplication

Information Processing and Management: an International Journal
Duplicate detection through structure optimization

Proceedings of the 20th ACM international conference on Information and knowledge management
XML duplicate detection using sorted neighborhoods

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Improving XML instances comparison with preprocessing algorithms

DEXA'07 Proceedings of the 18th international conference on Database and Expert Systems Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

When integrating data from multiple Web sources, objects can exist in different formats and structures, making it difficult to identify those that can be matched together. In this paper, we propose an identification approach to finding similar identities among objects from multiple Web sources. In this approach, object identification works like the relational join operation where a similarity function takes the place of the equality condition. This similarity function is based on information retrieval techniques. Our approach differs from others in the literature since it can be used to identify objects more complexly structured (e.g., XML documents) and not only objects with a flat structure such as relations. The effectiveness of our approach is demonstrated by experimental results with real Web data sources from different domains, that reach precision levels above 75%.