Visual integration tool for heterogeneous data type by unified vectorization

Authors:
Farid Bourennani;Ken Q. Pu;Ying Zhu
Affiliations:
University of Ontario Institute of Technology, Oshawa, Ontario, Canada;University of Ontario Institute of Technology, Oshawa, Ontario, Canada;University of Ontario Institute of Technology, Oshawa, Ontario, Canada
Venue:
IRI'09 Proceedings of the 10th IEEE international conference on Information Reuse & Integration
Year:
2009

Citing 14
Cited 0

Automatic text processing

Automatic text processing
Map displays for information retrieval

Journal of the American Society for Information Science
Data preparation for data mining

Data preparation for data mining
Data mining: concepts and techniques

Data mining: concepts and techniques
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Modern Information Retrieval

Modern Information Retrieval
Self-Organizing Maps

Self-Organizing Maps
Schema Mapping as Query Discovery

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Data Mining: Opportunities and Challenges

Data Mining: Opportunities and Challenges
Experiments with random projections for machine learning

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining massive document collections by the WEBSOM method

Information Sciences: an International Journal - Special issue: Soft computing data mining
Concept-based clustering of textual documents using SOM

AICCSA '08 Proceedings of the 2008 IEEE/ACS International Conference on Computer Systems and Applications
Visualization and Integration of Databases Using Self-Organizing Map

DBKDA '09 Proceedings of the 2009 First International Conference on Advances in Databases, Knowledge, and Data Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data. One of the critical issues of data integration is the detection of similar entities based on the content. This complexity is due to three factors: the data type of the databases are heterogenous, the schema of databases are unfamiliar and heterogenous as well, and the amount of records is voluminous and time consuming to analyze. As solution to these problems we extend our work in another of our papers by introducing a new measure to handle heterogenous textual and numerical data type for coincident meaning extraction. Firstly, to in order accommodate the heterogeneous data types we propose a new weight called Bin Frequency - Inverse Document Bin Frequency (BF-IDBF) for effective heterogeneous data pre-processing and classification by unified vectorization. Secondly in order to handle the unfamiliar data structure, we use the unsupervised algorithm Self-Organizing Map. Finally to help the user to explore and browse the semantically similar entities among the copious amount of data, we use a SOM based visualization tool to map the database tables based on their semantical content.